diff --git a/rss/CVPR2023.xml b/rss/CVPR2023.xml new file mode 100644 index 0000000..c4f365c --- /dev/null +++ b/rss/CVPR2023.xml @@ -0,0 +1,14125 @@ + + + + CVPR 2023 + + + GFPose: Learning 3D Human Pose Prior With Gradient Fields + http://openaccess.thecvf.com//content/CVPR2023/papers/Ci_GFPose_Learning_3D_Human_Pose_Prior_With_Gradient_Fields_CVPR_2023_paper.pdf + Learning 3D human pose prior is essential to human-centered AI. Here, we present GFPose, a versatile framework to model plausible 3D human poses for various applications. At the core of GFPose is a time-dependent score network, which estimates the gradient on each body joint and progressively denoises the perturbed 3D human pose to match a given task specification. During the denoising process, GFPose implicitly incorporates pose priors in gradients and unifies various discriminative and generative tasks in an elegant framework. Despite the simplicity, GFPose demonstrates great potential in several downstream tasks. Our experiments empirically show that 1) as a multi-hypothesis pose estimator, GFPose outperforms existing SOTAs by 20% on Human3.6M dataset. 2) as a single-hypothesis pose estimator, GFPose achieves comparable results to deterministic SOTAs, even with a vanilla backbone. 3) GFPose is able to produce diverse and realistic samples in pose denoising, completion and generation tasks. + + + + CXTrack: Improving 3D Point Cloud Tracking With Contextual Information + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_CXTrack_Improving_3D_Point_Cloud_Tracking_With_Contextual_Information_CVPR_2023_paper.pdf + 3D single object tracking plays an essential role in many applications, such as autonomous driving. It remains a challenging problem due to the large appearance variation and the sparsity of points caused by occlusion and limited sensor capabilities. Therefore, contextual information across two consecutive frames is crucial for effective object tracking. However, points containing such useful information are often overlooked and cropped out in existing methods, leading to insufficient use of important contextual knowledge. To address this issue, we propose CXTrack, a novel transformer-based network for 3D object tracking, which exploits ConteXtual information to improve the tracking results. Specifically, we design a target-centric transformer network that directly takes point features from two consecutive frames and the previous bounding box as input to explore contextual information and implicitly propagate target cues. To achieve accurate localization for objects of all sizes, we propose a transformer-based localization head with a novel center embedding module to distinguish the target from distractors. Extensive experiments on three large-scale datasets, KITTI, nuScenes and Waymo Open Dataset, show that CXTrack achieves state-of-the-art tracking performance while running at 34 FPS. + + + + NoisyTwins: Class-Consistent and Diverse Image Generation Through StyleGANs + http://openaccess.thecvf.com//content/CVPR2023/papers/Rangwani_NoisyTwins_Class-Consistent_and_Diverse_Image_Generation_Through_StyleGANs_CVPR_2023_paper.pdf + StyleGANs are at the forefront of controllable image generation as they produce a latent space that is semantically disentangled, making it suitable for image editing and manipulation. However, the performance of StyleGANs severely degrades when trained via class-conditioning on large-scale long-tailed datasets. We find that one reason for degradation is the collapse of latents for each class in the W latent space. With NoisyTwins, we first introduce an effective and inexpensive augmentation strategy for class embeddings, which then decorrelates the latents based on self-supervision in the W space. This decorrelation mitigates collapse, ensuring that our method preserves intra-class diversity with class-consistency in image generation. We show the effectiveness of our approach on large-scale real-world long-tailed datasets of ImageNet-LT and iNaturalist 2019, where our method outperforms other methods by 19% on FID, establishing a new state-of-the-art. + + + + DisCoScene: Spatially Disentangled Generative Radiance Fields for Controllable 3D-Aware Scene Synthesis + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_DisCoScene_Spatially_Disentangled_Generative_Radiance_Fields_for_Controllable_3D-Aware_Scene_CVPR_2023_paper.pdf + Existing 3D-aware image synthesis approaches mainly focus on generating a single canonical object and show limited capacity in composing a complex scene containing a variety of objects. This work presents DisCoScene: a 3D-aware generative model for high-quality and controllable scene synthesis. The key ingredient of our method is a very abstract object-level representation (i.e., 3D bounding boxes without semantic annotation) as the scene layout prior, which is simple to obtain, general to describe various scene contents, and yet informative to disentangle objects and background. Moreover, it serves as an intuitive user control for scene editing. Based on such a prior, the proposed model spatially disentangles the whole scene into object-centric generative radiance fields by learning on only 2D images with the global-local discrimination. Our model obtains the generation fidelity and editing flexibility of individual objects while being able to efficiently compose objects and the background into a complete scene. We demonstrate state-of-the-art performance on many scene datasets, including the challenging Waymo outdoor dataset. Our code will be made publicly available. + + + + Minimizing the Accumulated Trajectory Error To Improve Dataset Distillation + http://openaccess.thecvf.com//content/CVPR2023/papers/Du_Minimizing_the_Accumulated_Trajectory_Error_To_Improve_Dataset_Distillation_CVPR_2023_paper.pdf + Model-based deep learning has achieved astounding successes due in part to the availability of large-scale real-world data. However, processing such massive amounts of data comes at a considerable cost in terms of computations, storage, training and the search for good neural architectures. Dataset distillation has thus recently come to the fore. This paradigm involves distilling information from large real-world datasets into tiny and compact synthetic datasets such that processing the latter yields similar performances as the former. State-of-the-art methods primarily rely on learning the synthetic dataset by matching the gradients obtained during training between the real and synthetic data. However, these gradient-matching methods suffer from the accumulated trajectory error caused by the discrepancy between the distillation and subsequent evaluation. To alleviate the adverse impact of this accumulated trajectory error, we propose a novel approach that encourages the optimization algorithm to seek a flat trajectory. We show that the weights trained on synthetic data are robust against the accumulated errors perturbations with the regularization towards the flat trajectory. Our method, called Flat Trajectory Distillation (FTD), is shown to boost the performance of gradient-matching methods by up to 4.7% on a subset of images of the ImageNet dataset with higher resolution images. We also validate the effectiveness and generalizability of our method with datasets of different resolutions and demonstrate its applicability to neural architecture search. + + + + Implicit Occupancy Flow Fields for Perception and Prediction in Self-Driving + http://openaccess.thecvf.com//content/CVPR2023/papers/Agro_Implicit_Occupancy_Flow_Fields_for_Perception_and_Prediction_in_Self-Driving_CVPR_2023_paper.pdf + A self-driving vehicle (SDV) must be able to perceive its surroundings and predict the future behavior of other traffic participants. Existing works either perform object detection followed by trajectory forecasting of the detected objects, or predict dense occupancy and flow grids for the whole scene. The former poses a safety concern as the number of detections needs to be kept low for efficiency reasons, sacrificing object recall. The latter is computationally expensive due to the high-dimensionality of the output grid, and suffers from the limited receptive field inherent to fully convolutional networks. Furthermore, both approaches employ many computational resources predicting areas or objects that might never be queried by the motion planner. This motivates our unified approach to perception and future prediction that implicitly represents occupancy and flow over time with a single neural network. Our method avoids unnecessary computation, as it can be directly queried by the motion planner at continuous spatio-temporal locations. Moreover, we design an architecture that overcomes the limited receptive field of previous explicit occupancy prediction methods by adding an efficient yet effective global attention mechanism. Through extensive experiments in both urban and highway settings, we demonstrate that our implicit model outperforms the current state-of-the-art. For more information, visit the project website: https://waabi.ai/research/implicito. + + + + CCuantuMM: Cycle-Consistent Quantum-Hybrid Matching of Multiple Shapes + http://openaccess.thecvf.com//content/CVPR2023/papers/Bhatia_CCuantuMM_Cycle-Consistent_Quantum-Hybrid_Matching_of_Multiple_Shapes_CVPR_2023_paper.pdf + Jointly matching multiple, non-rigidly deformed 3D shapes is a challenging, NP-hard problem. A perfect matching is necessarily cycle-consistent: Following the pairwise point correspondences along several shapes must end up at the starting vertex of the original shape. Unfortunately, existing quantum shape-matching methods do not support multiple shapes and even less cycle consistency. This paper addresses the open challenges and introduces the first quantum-hybrid approach for 3D shape multi-matching; in addition, it is also cycle-consistent. Its iterative formulation is admissible to modern adiabatic quantum hardware and scales linearly with the total number of input shapes. Both these characteristics are achieved by reducing the N-shape case to a sequence of three-shape matchings, the derivation of which is our main technical contribution. Thanks to quantum annealing, high-quality solutions with low energy are retrieved for the intermediate NP-hard objectives. On benchmark datasets, the proposed approach significantly outperforms extensions to multi-shape matching of a previous quantum-hybrid two-shape matching method and is on-par with classical multi-matching methods. Our source code is available at 4dqv.mpi-inf.mpg.de/CCuantuMM/ + + + + TrojViT: Trojan Insertion in Vision Transformers + http://openaccess.thecvf.com//content/CVPR2023/papers/Zheng_TrojViT_Trojan_Insertion_in_Vision_Transformers_CVPR_2023_paper.pdf + Vision Transformers (ViTs) have demonstrated the state-of-the-art performance in various vision-related tasks. The success of ViTs motivates adversaries to perform backdoor attacks on ViTs. Although the vulnerability of traditional CNNs to backdoor attacks is well-known, backdoor attacks on ViTs are seldom-studied. Compared to CNNs capturing pixel-wise local features by convolutions, ViTs extract global context information through patches and attentions. Naively transplanting CNN-specific backdoor attacks to ViTs yields only a low clean data accuracy and a low attack success rate. In this paper, we propose a stealth and practical ViT-specific backdoor attack TrojViT. Rather than an area-wise trigger used by CNN-specific backdoor attacks, TrojViT generates a patch-wise trigger designed to build a Trojan composed of some vulnerable bits on the parameters of a ViT stored in DRAM memory through patch salience ranking and attention-target loss. TrojViT further uses parameter distillation to reduce the bit number of the Trojan. Once the attacker inserts the Trojan into the ViT model by flipping the vulnerable bits, the ViT model still produces normal inference accuracy with benign inputs. But when the attacker embeds a trigger into an input, the ViT model is forced to classify the input to a predefined target class. We show that flipping only few vulnerable bits identified by TrojViT on a ViT model using the well-known RowHammer can transform the model into a backdoored one. We perform extensive experiments of multiple datasets on various ViT models. TrojViT can classify 99.64% of test images to a target class by flipping 345 bits on a ViT for ImageNet. + + + + Robust Outlier Rejection for 3D Registration With Variational Bayes + http://openaccess.thecvf.com//content/CVPR2023/papers/Jiang_Robust_Outlier_Rejection_for_3D_Registration_With_Variational_Bayes_CVPR_2023_paper.pdf + Learning-based outlier (mismatched correspondence) rejection for robust 3D registration generally formulates the outlier removal as an inlier/outlier classification problem. The core for this to be successful is to learn the discriminative inlier/outlier feature representations. In this paper, we develop a novel variational non-local network-based outlier rejection framework for robust alignment. By reformulating the non-local feature learning with variational Bayesian inference, the Bayesian-driven long-range dependencies can be modeled to aggregate discriminative geometric context information for inlier/outlier distinction. Specifically, to achieve such Bayesian-driven contextual dependencies, each query/key/value component in our non-local network predicts a prior feature distribution and a posterior one. Embedded with the inlier/outlier label, the posterior feature distribution is label-dependent and discriminative. Thus, pushing the prior to be close to the discriminative posterior in the training step enables the features sampled from this prior at test time to model high-quality long-range dependencies. Notably, to achieve effective posterior feature guidance, a specific probabilistic graphical model is designed over our non-local model, which lets us derive a variational low bound as our optimization objective for model training. Finally, we propose a voting-based inlier searching strategy to cluster the high-quality hypothetical inliers for transformation estimation. Extensive experiments on 3DMatch, 3DLoMatch, and KITTI datasets verify the effectiveness of our method. + + + + Power Bundle Adjustment for Large-Scale 3D Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Weber_Power_Bundle_Adjustment_for_Large-Scale_3D_Reconstruction_CVPR_2023_paper.pdf + We introduce Power Bundle Adjustment as an expansion type algorithm for solving large-scale bundle adjustment problems. It is based on the power series expansion of the inverse Schur complement and constitutes a new family of solvers that we call inverse expansion methods. We theoretically justify the use of power series and we prove the convergence of our approach. Using the real-world BAL dataset we show that the proposed solver challenges the state-of-the-art iterative methods and significantly accelerates the solution of the normal equation, even for reaching a very high accuracy. This easy-to-implement solver can also complement a recently presented distributed bundle adjustment framework. We demonstrate that employing the proposed Power Bundle Adjustment as a sub-problem solver significantly improves speed and accuracy of the distributed optimization. + + + + Picture That Sketch: Photorealistic Image Generation From Abstract Sketches + http://openaccess.thecvf.com//content/CVPR2023/papers/Koley_Picture_That_Sketch_Photorealistic_Image_Generation_From_Abstract_Sketches_CVPR_2023_paper.pdf + Given an abstract, deformed, ordinary sketch from untrained amateurs like you and me, this paper turns it into a photorealistic image - just like those shown in Fig. 1(a), all non-cherry-picked. We differ significantly from prior art in that we do not dictate an edgemap-like sketch to start with, but aim to work with abstract free-hand human sketches. In doing so, we essentially democratise the sketch-to-photo pipeline, "picturing" a sketch regardless of how good you sketch. Our contribution at the outset is a decoupled encoder-decoder training paradigm, where the decoder is a StyleGAN trained on photos only. This importantly ensures that generated results are always photorealistic. The rest is then all centred around how best to deal with the abstraction gap between sketch and photo. For that, we propose an autoregressive sketch mapper trained on sketch-photo pairs that maps a sketch to the StyleGAN latent space. We further introduce specific designs to tackle the abstract nature of human sketches, including a fine-grained discriminative loss on the back of a trained sketch-photo retrieval model, and a partial-aware sketch augmentation strategy. Finally, we showcase a few downstream tasks our generation model enables, amongst them is showing how fine-grained sketch-based image retrieval, a well-studied problem in the sketch community, can be reduced to an image (generated) to image retrieval task, surpassing state-of-the-arts. We put forward generated results in the supplementary for everyone to scrutinise. Project page: https://subhadeepkoley.github.io/PictureThatSketch + + + + 3D-Aware Object Goal Navigation via Simultaneous Exploration and Identification + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_3D-Aware_Object_Goal_Navigation_via_Simultaneous_Exploration_and_Identification_CVPR_2023_paper.pdf + Object goal navigation (ObjectNav) in unseen environments is a fundamental task for Embodied AI. Agents in existing works learn ObjectNav policies based on 2D maps, scene graphs, or image sequences. Considering this task happens in 3D space, a 3D-aware agent can advance its ObjectNav capability via learning from fine-grained spatial information. However, leveraging 3D scene representation can be prohibitively unpractical for policy learning in this floor-level task, due to low sample efficiency and expensive computational cost. In this work, we propose a framework for the challenging 3D-aware ObjectNav based on two straightforward sub-policies. The two sub-polices, namely corner-guided exploration policy and category-aware identification policy, simultaneously perform by utilizing online fused 3D points as observation. Through extensive experiments, we show that this framework can dramatically improve the performance in ObjectNav through learning from 3D scene representation. Our framework achieves the best performance among all modular-based methods on the Matterport3D and Gibson datasets while requiring (up to30x) less computational cost for training. The code will be released to benefit the community. + + + + Shape, Pose, and Appearance From a Single Image via Bootstrapped Radiance Field Inversion + http://openaccess.thecvf.com//content/CVPR2023/papers/Pavllo_Shape_Pose_and_Appearance_From_a_Single_Image_via_Bootstrapped_CVPR_2023_paper.pdf + Neural Radiance Fields (NeRF) coupled with GANs represent a promising direction in the area of 3D reconstruction from a single view, owing to their ability to efficiently model arbitrary topologies. Recent work in this area, however, has mostly focused on synthetic datasets where exact ground-truth poses are known, and has overlooked pose estimation, which is important for certain downstream applications such as augmented reality (AR) and robotics. We introduce a principled end-to-end reconstruction framework for natural images, where accurate ground-truth poses are not available. Our approach recovers an SDF-parameterized 3D shape, pose, and appearance from a single image of an object, without exploiting multiple views during training. More specifically, we leverage an unconditional 3D-aware generator, to which we apply a hybrid inversion scheme where a model produces a first guess of the solution which is then refined via optimization. Our framework can de-render an image in as few as 10 steps, enabling its use in practical scenarios. We demonstrate state-of-the-art results on a variety of real and synthetic benchmarks. + + + + Unlearnable Clusters: Towards Label-Agnostic Unlearnable Examples + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Unlearnable_Clusters_Towards_Label-Agnostic_Unlearnable_Examples_CVPR_2023_paper.pdf + There is a growing interest in developing unlearnable examples (UEs) against visual privacy leaks on the Internet. UEs are training samples added with invisible but unlearnable noise, which have been found can prevent unauthorized training of machine learning models. UEs typically are generated via a bilevel optimization framework with a surrogate model to remove (minimize) errors from the original samples, and then applied to protect the data against unknown target models. However, existing UE generation methods all rely on an ideal assumption called labelconsistency, where the hackers and protectors are assumed to hold the same label for a given sample. In this work, we propose and promote a more practical label-agnostic setting, where the hackers may exploit the protected data quite differently from the protectors. E.g., a m-class unlearnable dataset held by the protector may be exploited by the hacker as a n-class dataset. Existing UE generation methods are rendered ineffective in this challenging setting. To tackle this challenge, we present a novel technique called Unlearnable Clusters (UCs) to generate label-agnostic unlearnable examples with cluster-wise perturbations. Furthermore, we propose to leverage Vision-and-Language Pretrained Models (VLPMs) like CLIP as the surrogate model to improve the transferability of the crafted UCs to diverse domains. We empirically verify the effectiveness of our proposed approach under a variety of settings with different datasets, target models, and even commercial platforms Microsoft Azure and Baidu PaddlePaddle. Code is available at https://github.com/jiamingzhang94/ Unlearnable-Clusters. + + + + NoPe-NeRF: Optimising Neural Radiance Field With No Pose Prior + http://openaccess.thecvf.com//content/CVPR2023/papers/Bian_NoPe-NeRF_Optimising_Neural_Radiance_Field_With_No_Pose_Prior_CVPR_2023_paper.pdf + Training a Neural Radiance Field (NeRF) without pre-computed camera poses is challenging. Recent advances in this direction demonstrate the possibility of jointly optimising a NeRF and camera poses in forward-facing scenes. However, these methods still face difficulties during dramatic camera movement. We tackle this challenging problem by incorporating undistorted monocular depth priors. These priors are generated by correcting scale and shift parameters during training, with which we are then able to constrain the relative poses between consecutive frames. This constraint is achieved using our proposed novel loss functions. Experiments on real-world indoor and outdoor scenes show that our method can handle challenging camera trajectories and outperforms existing methods in terms of novel view rendering quality and pose estimation accuracy. Our project page is https://nope-nerf.active.vision. + + + + SIEDOB: Semantic Image Editing by Disentangling Object and Background + http://openaccess.thecvf.com//content/CVPR2023/papers/Luo_SIEDOB_Semantic_Image_Editing_by_Disentangling_Object_and_Background_CVPR_2023_paper.pdf + Semantic image editing provides users with a flexible tool to modify a given image guided by a corresponding segmentation map. In this task, the features of the foreground objects and the backgrounds are quite different. However, all previous methods handle backgrounds and objects as a whole using a monolithic model. Consequently, they remain limited in processing content-rich images and suffer from generating unrealistic objects and texture-inconsistent backgrounds. To address this issue, we propose a novel paradigm, Semantic Image Editing by Disentangling Object and Background (SIEDOB), the core idea of which is to explicitly leverages several heterogeneous subnetworks for objects and backgrounds. First, SIEDOB disassembles the edited input into background regions and instance-level objects. Then, we feed them into the dedicated generators. Finally, all synthesized parts are embedded in their original locations and utilize a fusion network to obtain a harmonized result. Moreover, to produce high-quality edited images, we propose some innovative designs, including Semantic-Aware Self-Propagation Module, Boundary-Anchored Patch Discriminator, and Style-Diversity Object Generator, and integrate them into SIEDOB. We conduct extensive experiments on Cityscapes and ADE20K-Room datasets and exhibit that our method remarkably outperforms the baselines, especially in synthesizing realistic and diverse objects and texture-consistent backgrounds. + + + + Robust 3D Shape Classification via Non-Local Graph Attention Network + http://openaccess.thecvf.com//content/CVPR2023/papers/Qin_Robust_3D_Shape_Classification_via_Non-Local_Graph_Attention_Network_CVPR_2023_paper.pdf + We introduce a non-local graph attention network (NLGAT), which generates a novel global descriptor through two sub-networks for robust 3D shape classification. In the first sub-network, we capture the global relationships between points (i.e., point-point features) by designing a global relationship network (GRN). In the second sub-network, we enhance the local features with a geometric shape attention map obtained from a global structure network (GSN). To keep rotation invariant and extract more information from sparse point clouds, all sub-networks use the Gram matrices with different dimensions as input for working with robust classification. Additionally, GRN effectively preserves the low-frequency features and improves the classification results. Experimental results on various datasets exhibit that the classification effect of the NLGAT model is better than other state-of-the-art models. Especially, in the case of sparse point clouds (64 points) with noise under arbitrary SO(3) rotation, the classification result (85.4%) of NLGAT is improved by 39.4% compared with the best development of other methods. + + + + Exploring Structured Semantic Prior for Multi Label Recognition With Incomplete Labels + http://openaccess.thecvf.com//content/CVPR2023/papers/Ding_Exploring_Structured_Semantic_Prior_for_Multi_Label_Recognition_With_Incomplete_CVPR_2023_paper.pdf + Multi-label recognition (MLR) with incomplete labels is very challenging. Recent works strive to explore the image-to-label correspondence in the vision-language model, i.e., CLIP, to compensate for insufficient annotations. In spite of promising performance, they generally overlook the valuable prior about the label-to-label correspondence. In this paper, we advocate remedying the deficiency of label supervision for the MLR with incomplete labels by deriving a structured semantic prior about the label-to-label correspondence via a semantic prior prompter. We then present a novel Semantic Correspondence Prompt Network (SCPNet), which can thoroughly explore the structured semantic prior. A Prior-Enhanced Self-Supervised Learning method is further introduced to enhance the use of the prior. Comprehensive experiments and analyses on several widely used benchmark datasets show that our method significantly outperforms existing methods on all datasets, well demonstrating the effectiveness and the superiority of our method. Our code will be available at https://github.com/jameslahm/SCPNet. + + + + Delving Into Shape-Aware Zero-Shot Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Delving_Into_Shape-Aware_Zero-Shot_Semantic_Segmentation_CVPR_2023_paper.pdf + Thanks to the impressive progress of large-scale vision-language pretraining, recent recognition models can classify arbitrary objects in a zero-shot and open-set manner, with a surprisingly high accuracy. However, translating this success to semantic segmentation is not trivial, because this dense prediction task requires not only accurate semantic understanding but also fine shape delineation and existing vision-language models are trained with image-level language descriptions. To bridge this gap, we pursue shape-aware zero-shot semantic segmentation in this study. Inspired by classical spectral methods in the image segmentation literature, we propose to leverage the eigen vectors of Laplacian matrices constructed with self-supervised pixel-wise features to promote shape-awareness. Despite that this simple and effective technique does not make use of the masks of seen classes at all, we demonstrate that it out-performs a state-of-the-art shape-aware formulation that aligns ground truth and predicted edges during training. We also delve into the performance gains achieved on different datasets using different backbones and draw several interesting and conclusive observations: the benefits of promoting shape-awareness highly relates to mask compactness and language embedding locality. Finally, our method sets new state-of-the-art performance for zero-shot semantic segmentation on both Pascal and COCO, with significant margins. Code and models will be accessed at https://github.com/Liuxinyv/SAZS. + + + + Post-Training Quantization on Diffusion Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Shang_Post-Training_Quantization_on_Diffusion_Models_CVPR_2023_paper.pdf + Denoising diffusion (score-based) generative models have recently achieved significant accomplishments in generating realistic and diverse data. These approaches define a forward diffusion process for transforming data into noise and a backward denoising process for sampling data from noise. Unfortunately, the generation process of current denoising diffusion models is notoriously slow due to the lengthy iterative noise estimations, which rely on cumbersome neural networks. It prevents the diffusion models from being widely deployed, especially on edge devices. Previous works accelerate the generation process of diffusion model (DM) via finding shorter yet effective sampling trajectories. However, they overlook the cost of noise estimation with a heavy network in every iteration. In this work, we accelerate generation from the perspective of compressing the noise estimation network. Due to the difficulty of retraining DMs, we exclude mainstream training-aware compression paradigms and introduce post-training quantization (PTQ) into DM acceleration. However, the output distributions of noise estimation networks change with time-step, making previous PTQ methods fail in DMs since they are designed for single-time step scenarios. To devise a DM-specific PTQ method, we explore PTQ on DM in three aspects: quantized operations, calibration dataset, and calibration metric. We summarize and use several observations derived from all-inclusive investigations to formulate our method, which especially targets the unique multi-time-step structure of DMs. Experimentally, our method can directly quantize full-precision DMs into 8-bit models while maintaining or even improving their performance in a training-free manner. Importantly, our method can serve as a plug-and-play module on other fast-sampling methods, such as DDIM. + + + + Leveraging Inter-Rater Agreement for Classification in the Presence of Noisy Labels + http://openaccess.thecvf.com//content/CVPR2023/papers/Bucarelli_Leveraging_Inter-Rater_Agreement_for_Classification_in_the_Presence_of_Noisy_CVPR_2023_paper.pdf + In practical settings, classification datasets are obtained through a labelling process that is usually done by humans. Labels can be noisy as they are obtained by aggregating the different individual labels assigned to the same sample by multiple, and possibly disagreeing, annotators. The inter-rater agreement on these datasets can be measured while the underlying noise distribution to which the labels are subject is assumed to be unknown. In this work, we: (i) show how to leverage the inter-annotator statistics to estimate the noise distribution to which labels are subject; (ii) introduce methods that use the estimate of the noise distribution to learn from the noisy dataset; and (iii) establish generalization bounds in the empirical risk minimization framework that depend on the estimated quantities. We conclude the paper by providing experiments that illustrate our findings. + + + + Analyzing Physical Impacts Using Transient Surface Wave Imaging + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Analyzing_Physical_Impacts_Using_Transient_Surface_Wave_Imaging_CVPR_2023_paper.pdf + The subtle vibrations on an object's surface contain information about the object's physical properties and its interaction with the environment. Prior works imaged surface vibration to recover the object's material properties via modal analysis, which discards the transient vibrations propagating immediately after the object is disturbed. Conversely, prior works that captured transient vibrations focused on recovering localized signals (e.g., recording nearby sound sources), neglecting the spatiotemporal relationship between vibrations at different object points. In this paper, we extract information from the transient surface vibrations simultaneously measured at a sparse set of object points using the dual-shutter camera described by Sheinin[31]. We model the geometry of an elastic wave generated shortly after an object's surface is disturbed (e.g., a knock or a footstep), and use the model to localize the disturbance source for various materials (e.g., wood, plastic, tile). We also show that transient object vibrations contain additional cues about the impact force and the impacting object's material properties. We demonstrate our approach in applications like localizing the strikes of a ping-pong ball on a table mid-play and recovering the footsteps' locations by imaging the floor vibrations they create. + + + + ScanDMM: A Deep Markov Model of Scanpath Prediction for 360deg Images + http://openaccess.thecvf.com//content/CVPR2023/papers/Sui_ScanDMM_A_Deep_Markov_Model_of_Scanpath_Prediction_for_360deg_CVPR_2023_paper.pdf + Scanpath prediction for 360deg images aims to produce dynamic gaze behaviors based on the human visual perception mechanism. Most existing scanpath prediction methods for 360deg images do not give a complete treatment of the time-dependency when predicting human scanpath, resulting in inferior performance and poor generalizability. In this paper, we present a scanpath prediction method for 360deg images by designing a novel Deep Markov Model (DMM) architecture, namely ScanDMM. We propose a semantics-guided transition function to learn the nonlinear dynamics of time-dependent attentional landscape. Moreover, a state initialization strategy is proposed by considering the starting point of viewing, enabling the model to learn the dynamics with the correct "launcher". We further demonstrate that our model achieves state-of-the-art performance on four 360deg image databases, and exhibit its generalizability by presenting two applications of applying scanpath prediction models to other visual tasks - saliency detection and image quality assessment, expecting to provide profound insights into these fields. + + + + Continual Semantic Segmentation With Automatic Memory Sample Selection + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhu_Continual_Semantic_Segmentation_With_Automatic_Memory_Sample_Selection_CVPR_2023_paper.pdf + Continual Semantic Segmentation (CSS) extends static semantic segmentation by incrementally introducing new classes for training. To alleviate the catastrophic forgetting issue in CSS, a memory buffer that stores a small number of samples from the previous classes is constructed for replay. However, existing methods select the memory samples either randomly or based on a single-factor-driven hand-crafted strategy, which has no guarantee to be optimal. In this work, we propose a novel memory sample selection mechanism that selects informative samples for effective replay in a fully automatic way by considering comprehensive factors including sample diversity and class performance. Our mechanism regards the selection operation as a decision-making process and learns an optimal selection policy that directly maximizes the validation performance on a reward set. To facilitate the selection decision, we design a novel state representation and a dual-stage action space. Our extensive experiments on Pascal-VOC 2012 and ADE 20K datasets demonstrate the effectiveness of our approach with state-of-the-art (SOTA) performance achieved, outperforming the second-place one by 12.54% for the 6-stage setting on Pascal-VOC 2012. + + + + Meta-Tuning Loss Functions and Data Augmentation for Few-Shot Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Demirel_Meta-Tuning_Loss_Functions_and_Data_Augmentation_for_Few-Shot_Object_Detection_CVPR_2023_paper.pdf + Few-shot object detection, the problem of modelling novel object detection categories with few training instances, is an emerging topic in the area of few-shot learning and object detection. Contemporary techniques can be divided into two groups: fine-tuning based and meta-learning based approaches. While meta-learning approaches aim to learn dedicated meta-models for mapping samples to novel class models, fine-tuning approaches tackle few-shot detection in a simpler manner, by adapting the detection model to novel classes through gradient based optimization. Despite their simplicity, fine-tuning based approaches typically yield competitive detection results. Based on this observation, we focus on the role of loss functions and augmentations as the force driving the fine-tuning process, and propose to tune their dynamics through meta-learning principles. The proposed training scheme, therefore, allows learning inductive biases that can boost few-shot detection, while keeping the advantages of fine-tuning based approaches. In addition, the proposed approach yields interpretable loss functions, as opposed to highly parametric and complex few-shot meta-models. The experimental results highlight the merits of the proposed scheme, with significant improvements over the strong fine-tuning based few-shot detection baselines on benchmark Pascal VOC and MS-COCO datasets, in terms of both standard and generalized few-shot performance metrics. + + + + CLIP2Scene: Towards Label-Efficient 3D Scene Understanding by CLIP + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_CLIP2Scene_Towards_Label-Efficient_3D_Scene_Understanding_by_CLIP_CVPR_2023_paper.pdf + Contrastive Language-Image Pre-training (CLIP) achieves promising results in 2D zero-shot and few-shot learning. Despite the impressive performance in 2D, applying CLIP to help the learning in 3D scene understanding has yet to be explored. In this paper, we make the first attempt to investigate how CLIP knowledge benefits 3D scene understanding. We propose CLIP2Scene, a simple yet effective framework that transfers CLIP knowledge from 2D image-text pre-trained models to a 3D point cloud network. We show that the pre-trained 3D network yields impressive performance on various downstream tasks, i.e., annotation-free and fine-tuning with labelled data for semantic segmentation. Specifically, built upon CLIP, we design a Semantic-driven Cross-modal Contrastive Learning framework that pre-trains a 3D network via semantic and spatial-temporal consistency regularization. For the former, we first leverage CLIP's text semantics to select the positive and negative point samples and then employ the contrastive loss to train the 3D network. In terms of the latter, we force the consistency between the temporally coherent point cloud features and their corresponding image features. We conduct experiments on SemanticKITTI, nuScenes, and ScanNet. For the first time, our pre-trained network achieves annotation-free 3D semantic segmentation with 20.8% and 25.08% mIoU on nuScenes and ScanNet, respectively. When fine-tuned with 1% or 100% labelled data, our method significantly outperforms other self-supervised methods, with improvements of 8% and 1% mIoU, respectively. Furthermore, we demonstrate the generalizability for handling cross-domain datasets. Code is publicly available. + + + + LOGO: A Long-Form Video Dataset for Group Action Quality Assessment + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_LOGO_A_Long-Form_Video_Dataset_for_Group_Action_Quality_Assessment_CVPR_2023_paper.pdf + Action quality assessment (AQA) has become an emerging topic since it can be extensively applied in numerous scenarios. However, most existing methods and datasets focus on single-person short-sequence scenes, hindering the application of AQA in more complex situations. To address this issue, we construct a new multi-person long-form video dataset for action quality assessment named LOGO. Distinguished in scenario complexity, our dataset contains 200 videos from 26 artistic swimming events with 8 athletes in each sample along with an average duration of 204.2 seconds. As for richness in annotations, LOGO includes formation labels to depict group information of multiple athletes and detailed annotations on action procedures. Furthermore, we propose a simple yet effective method to model relations among athletes and reason about the potential temporal logic in long-form videos. Specifically, we design a group-aware attention module, which can be easily plugged into existing AQA methods, to enrich the clip-wise representations based on contextual group information. To benchmark LOGO, we systematically conduct investigations on the performance of several popular methods in AQA and action segmentation. The results reveal the challenges our dataset brings. Extensive experiments also show that our approach achieves state-of-the-art on the LOGO dataset. The dataset and code will be released at https://github.com/shiyi-zh0408/LOGO. + + + + UniSim: A Neural Closed-Loop Sensor Simulator + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_UniSim_A_Neural_Closed-Loop_Sensor_Simulator_CVPR_2023_paper.pdf + Rigorously testing autonomy systems is essential for making safe self-driving vehicles (SDV) a reality. It requires one to generate safety critical scenarios beyond what can be collected safely in the world, as many scenarios happen rarely on our roads. To accurately evaluate performance, we need to test the SDV on these scenarios in closed-loop, where the SDV and other actors interact with each other at each timestep. Previously recorded driving logs provide a rich resource to build these new scenarios from, but for closed loop evaluation, we need to modify the sensor data based on the new scene configuration and the SDV's decisions, as actors might be added or removed and the trajectories of existing actors and the SDV will differ from the original log. In this paper, we present UniSim, a neural sensor simulator that takes a single recorded log captured by a sensor-equipped vehicle and converts it into a realistic closed-loop multi-sensor simulation. UniSim builds neural feature grids to reconstruct both the static background and dynamic actors in the scene, and composites them together to simulate LiDAR and camera data at new viewpoints, with actors added or removed and at new placements. To better handle extrapolated views, we incorporate learnable priors for dynamic objects, and leverage a convolutional network to complete unseen regions. Our experiments show UniSim can simulate realistic sensor data with small domain gap on downstream tasks. With UniSim, we demonstrate, for the first time, closed-loop evaluation of an autonomy system on safety-critical scenarios as if it were in the real world. + + + + Prefix Conditioning Unifies Language and Label Supervision + http://openaccess.thecvf.com//content/CVPR2023/papers/Saito_Prefix_Conditioning_Unifies_Language_and_Label_Supervision_CVPR_2023_paper.pdf + Pretraining visual models on web-scale image-caption datasets has recently emerged as a powerful alternative to traditional pretraining on image classification data. Image-caption datasets are more "open-domain", containing broader scene types and vocabulary words, and result in models that have strong performance in few- and zero-shot recognition tasks. However large-scale classification datasets can provide fine-grained categories with a balanced label distribution. In this work, we study a pretraining strategy that uses both classification and caption datasets to unite their complementary benefits. First, we show that naively unifying the datasets results in sub-optimal performance in downstream zero-shot recognition tasks, as the model is affected by dataset bias: the coverage of image domains and vocabulary words is different in each dataset. We address this problem with novel Prefix Conditioning, a simple yet effective method that helps disentangle dataset biases from visual concepts. This is done by introducing prefix tokens that inform the language encoder of the input data type (e.g., classification vs caption) at training time. Our approach allows the language encoder to learn from both datasets while also tailoring feature extraction to each dataset. Prefix conditioning is generic and can be easily integrated into existing VL pretraining objectives, such as CLIP or UniCL. In experiments, we show that it improves zero-shot image recognition and robustness to image-level distribution shift. + + + + Towards Scalable Neural Representation for Diverse Videos + http://openaccess.thecvf.com//content/CVPR2023/papers/He_Towards_Scalable_Neural_Representation_for_Diverse_Videos_CVPR_2023_paper.pdf + Implicit neural representations (INR) have gained increasing attention in representing 3D scenes and images, and have been recently applied to encode videos (e.g., NeRV, E-NeRV). While achieving promising results, existing INR-based methods are limited to encoding a handful of short videos (e.g., seven 5-second videos in the UVG dataset) with redundant visual content, leading to a model design that fits individual video frames independently and is not efficiently scalable to a large number of diverse videos. This paper focuses on developing neural representations for a more practical setup -- encoding long and/or a large number of videos with diverse visual content. We first show that instead of dividing videos into small subsets and encoding them with separate models, encoding long and diverse videos jointly with a unified model achieves better compression results. Based on this observation, we propose D-NeRV, a novel neural representation framework designed to encode diverse videos by (i) decoupling clip-specific visual content from motion information, (ii) introducing temporal reasoning into the implicit neural network, and (iii) employing the task-oriented flow as intermediate output to reduce spatial redundancies. Our new model largely surpasses NeRV and traditional video compression techniques on UCF101 and UVG datasets on the video compression task. Moreover, when used as an efficient data-loader, D-NeRV achieves 3%-10% higher accuracy than NeRV on action recognition tasks on the UCF101 dataset under the same compression ratios. + + + + Towards Robust Tampered Text Detection in Document Image: New Dataset and New Solution + http://openaccess.thecvf.com//content/CVPR2023/papers/Qu_Towards_Robust_Tampered_Text_Detection_in_Document_Image_New_Dataset_CVPR_2023_paper.pdf + Recently, tampered text detection in document image has attracted increasingly attention due to its essential role on information security. However, detecting visually consistent tampered text in photographed document images is still a main challenge. In this paper, we propose a novel framework to capture more fine-grained clues in complex scenarios for tampered text detection, termed as Document Tampering Detector (DTD), which consists of a Frequency Perception Head (FPH) to compensate the deficiencies caused by the inconspicuous visual features, and a Multi-view Iterative Decoder (MID) for fully utilizing the information of features in different scales. In addition, we design a new training paradigm, termed as Curriculum Learning for Tampering Detection (CLTD), which can address the confusion during the training procedure and thus to improve the robustness for image compression and the ability to generalize. To further facilitate the tampered text detection in document images, we construct a large-scale document image dataset, termed as DocTamper, which contains 170,000 document images of various types. Experiments demonstrate that our proposed DTD outperforms previous state-of-the-art by 9.2%, 26.3% and 12.3% in terms of F-measure on the DocTamper testing set, and the cross-domain testing sets of DocTamper-FCD and DocTamper-SCD, respectively. Codes and dataset will be available at https://github.com/qcf-568/DocTamper. + + + + DeSTSeg: Segmentation Guided Denoising Student-Teacher for Anomaly Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_DeSTSeg_Segmentation_Guided_Denoising_Student-Teacher_for_Anomaly_Detection_CVPR_2023_paper.pdf + Visual anomaly detection, an important problem in computer vision, is usually formulated as a one-class classification and segmentation task. The student-teacher (S-T) framework has proved to be effective in solving this challenge. However, previous works based on S-T only empirically applied constraints on normal data and fused multi-level information. In this study, we propose an improved model called DeSTSeg, which integrates a pre-trained teacher network, a denoising student encoder-decoder, and a segmentation network into one framework. First, to strengthen the constraints on anomalous data, we introduce a denoising procedure that allows the student network to learn more robust representations. From synthetically corrupted normal images, we train the student network to match the teacher network feature of the same images without corruption. Second, to fuse the multi-level S-T features adaptively, we train a segmentation network with rich supervision from synthetic anomaly masks, achieving a substantial performance improvement. Experiments on the industrial inspection benchmark dataset demonstrate that our method achieves state-of-the-art performance, 98.6% on image-level AUC, 75.8% on pixel-level average precision, and 76.4% on instance-level average precision. + + + + Neural Rate Estimator and Unsupervised Learning for Efficient Distributed Image Analytics in Split-DNN Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Ahuja_Neural_Rate_Estimator_and_Unsupervised_Learning_for_Efficient_Distributed_Image_CVPR_2023_paper.pdf + Thanks to advances in computer vision and AI, there has been a large growth in the demand for cloud-based visual analytics in which images captured by a low-powered edge device are transmitted to the cloud for analytics. Use of conventional codecs (JPEG, MPEG, HEVC, etc.) for compressing such data introduces artifacts that can seriously degrade the performance of the downstream analytic tasks. Split-DNN computing has emerged as a paradigm to address such usages, in which a DNN is partitioned into a client-side portion and a server side portion. Low-complexity neural networks called 'bottleneck units' are introduced at the split point to transform the intermediate layer features into a lower-dimensional representation better suited for compression and transmission. Optimizing the pipeline for both compression and task-performance requires high-quality estimates of the information-theoretic rate of the intermediate features. Most works on compression for image analytics use heuristic approaches to estimate the rate, leading to suboptimal performance. We propose a high-quality 'neural rate-estimator' to address this gap. We interpret the lower-dimensional bottleneck output as a latent representation of the intermediate feature and cast the rate-distortion optimization problem as one of training an equivalent variational auto-encoder with an appropriate loss function. We show that this leads to improved rate-distortion outcomes. We further show that replacing supervised loss terms (such as cross-entropy loss) by distillation-based losses in a teacher-student framework allows for unsupervised training of bottleneck units without the need for explicit training labels. This makes our method very attractive for real world deployments where access to labeled training data is difficult or expensive. We demonstrate that our method outperforms several state-of-the-art methods by obtaining improved task accuracy at lower bitrates on image classification and semantic segmentation tasks. + + + + Object Pop-Up: Can We Infer 3D Objects and Their Poses From Human Interactions Alone? + http://openaccess.thecvf.com//content/CVPR2023/papers/Petrov_Object_Pop-Up_Can_We_Infer_3D_Objects_and_Their_Poses_CVPR_2023_paper.pdf + The intimate entanglement between objects affordances and human poses is of large interest, among others, for behavioural sciences, cognitive psychology, and Computer Vision communities. In recent years, the latter has developed several object-centric approaches: starting from items, learning pipelines synthesizing human poses and dynamics in a realistic way, satisfying both geometrical and functional expectations. However, the inverse perspective is significantly less explored: Can we infer 3D objects and their poses from human interactions alone? Our investigation follows this direction, showing that a generic 3D human point cloud is enough to pop up an unobserved object, even when the user is just imitating a functionality (e.g., looking through a binocular) without involving a tangible counterpart. We validate our method qualitatively and quantitatively, with synthetic data and sequences acquired for the task, showing applicability for XR/VR. + + + + VoP: Text-Video Co-Operative Prompt Tuning for Cross-Modal Retrieval + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_VoP_Text-Video_Co-Operative_Prompt_Tuning_for_Cross-Modal_Retrieval_CVPR_2023_paper.pdf + Many recent studies leverage the pre-trained CLIP for text-video cross-modal retrieval by tuning the backbone with additional heavy modules, which not only brings huge computational burdens with much more parameters, but also leads to the knowledge forgetting from upstream models. In this work, we propose the VoP: Text-Video Co-operative Prompt Tuning for efficient tuning on the text-video retrieval task. The proposed VoP is an end-to-end framework with both video & text prompts introducing, which can be regarded as a powerful baseline with only 0.1% trainable parameters. Further, based on the spatio-temporal characteristics of videos, we develop three novel video prompt mechanisms to improve the performance with different scales of trainable parameters. The basic idea of the VoP enhancement is to model the frame position, frame context, and layer function with specific trainable prompts, respectively. Extensive experiments show that compared to full fine-tuning, the enhanced VoP achieves a 1.4% average R@1 gain across five text-video retrieval benchmarks with 6x less parameter overhead. The code will be available at https://github.com/bighuang624/VoP. + + + + Exploiting Unlabelled Photos for Stronger Fine-Grained SBIR + http://openaccess.thecvf.com//content/CVPR2023/papers/Sain_Exploiting_Unlabelled_Photos_for_Stronger_Fine-Grained_SBIR_CVPR_2023_paper.pdf + This paper advances the fine-grained sketch-based image retrieval (FG-SBIR) literature by putting forward a strong baseline that overshoots prior state-of-the art by 11%. This is not via complicated design though, but by addressing two critical issues facing the community (i) the gold standard triplet loss does not enforce holistic latent space geometry, and (ii) there are never enough sketches to train a high accuracy model. For the former, we propose a simple modification to the standard triplet loss, that explicitly enforces separation amongst photos/sketch instances. For the latter, we put forward a novel knowledge distillation module can leverage photo data for model training. Both modules are then plugged into a novel plug-n-playable training paradigm that allows for more stable training. More specifically, for (i) we employ an intra-modal triplet loss amongst sketches to bring sketches of the same instance closer from others, and one more amongst photos to push away different photo instances while bringing closer a structurally augmented version of the same photo (offering a gain of 4-6%). To tackle (ii), we first pre-train a teacher on the large set of unlabelled photos over the aforementioned intra-modal photo triplet loss. Then we distill the contextual similarity present amongst the instances in the teacher's embedding space to that in the student's embedding space, by matching the distribution over inter-feature distances of respective samples in both embedding spaces (delivering a further gain of 4-5%). Apart from outperforming prior arts significantly, our model also yields satisfactory results on generalising to new classes. Project page: https://aneeshan95.github.io/Sketch_PVT/ + + + + PIP-Net: Patch-Based Intuitive Prototypes for Interpretable Image Classification + http://openaccess.thecvf.com//content/CVPR2023/papers/Nauta_PIP-Net_Patch-Based_Intuitive_Prototypes_for_Interpretable_Image_Classification_CVPR_2023_paper.pdf + Interpretable methods based on prototypical patches recognize various components in an image in order to explain their reasoning to humans. However, existing prototype-based methods can learn prototypes that are not in line with human visual perception, i.e., the same prototype can refer to different concepts in the real world, making interpretation not intuitive. Driven by the principle of explainability-by-design, we introduce PIP-Net (Patch-based Intuitive Prototypes Network): an interpretable image classification model that learns prototypical parts in a self-supervised fashion which correlate better with human vision. PIP-Net can be interpreted as a sparse scoring sheet where the presence of a prototypical part in an image adds evidence for a class. The model can also abstain from a decision for out-of-distribution data by saying "I haven't seen this before". We only use image-level labels and do not rely on any part annotations. PIP-Net is globally interpretable since the set of learned prototypes shows the entire reasoning of the model. A smaller local explanation locates the relevant prototypes in one image. We show that our prototypes correlate with ground-truth object parts, indicating that PIP-Net closes the "semantic gap" between latent space and pixel space. Hence, our PIP-Net with interpretable prototypes enables users to interpret the decision making process in an intuitive, faithful and semantically meaningful way. Code is available at https://github.com/M-Nauta/PIPNet. + + + + CloSET: Modeling Clothed Humans on Continuous Surface With Explicit Template Decomposition + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_CloSET_Modeling_Clothed_Humans_on_Continuous_Surface_With_Explicit_Template_CVPR_2023_paper.pdf + Creating animatable avatars from static scans requires the modeling of clothing deformations in different poses. Existing learning-based methods typically add pose-dependent deformations upon a minimally-clothed mesh template or a learned implicit template, which have limitations in capturing details or hinder end-to-end learning. In this paper, we revisit point-based solutions and propose to decompose explicit garment-related templates and then add pose-dependent wrinkles to them. In this way, the clothing deformations are disentangled such that the pose-dependent wrinkles can be better learned and applied to unseen poses. Additionally, to tackle the seam artifact issues in recent state-of-the-art point-based methods, we propose to learn point features on a body surface, which establishes a continuous and compact feature space to capture the fine-grained and pose-dependent clothing geometry. To facilitate the research in this field, we also introduce a high-quality scan dataset of humans in real-world clothing. Our approach is validated on two existing datasets and our newly introduced dataset, showing better clothing deformation results in unseen poses. The project page with code and dataset can be found at https://www.liuyebin.com/closet. + + + + BUOL: A Bottom-Up Framework With Occupancy-Aware Lifting for Panoptic 3D Scene Reconstruction From a Single Image + http://openaccess.thecvf.com//content/CVPR2023/papers/Chu_BUOL_A_Bottom-Up_Framework_With_Occupancy-Aware_Lifting_for_Panoptic_3D_CVPR_2023_paper.pdf + Understanding and modeling the 3D scene from a single image is a practical problem. A recent advance proposes a panoptic 3D scene reconstruction task that performs both 3D reconstruction and 3D panoptic segmentation from a single image. Although having made substantial progress, recent works only focus on top-down approaches that fill 2D instances into 3D voxels according to estimated depth, which hinders their performance by two ambiguities. (1) instance-channel ambiguity: The variable ids of instances in each scene lead to ambiguity during filling voxel channels with 2D information, confusing the following 3D refinement. (2) voxel-reconstruction ambiguity: 2D-to-3D lifting with estimated single view depth only propagates 2D information onto the surface of 3D regions, leading to ambiguity during the reconstruction of regions behind the frontal view surface. In this paper, we propose BUOL, a Bottom-Up framework with Occupancy-aware Lifting to address the two issues for panoptic 3D scene reconstruction from a single image. For instance-channel ambiguity, a bottom-up framework lifts 2D information to 3D voxels based on deterministic semantic assignments rather than arbitrary instance id assignments. The 3D voxels are then refined and grouped into 3D instances according to the predicted 2D instance centers. For voxel-reconstruction ambiguity, the estimated multi-plane occupancy is leveraged together with depth to fill the whole regions of things and stuff. Our method shows a tremendous performance advantage over state-of-the-art methods on synthetic dataset 3D-Front and real-world dataset Matterport3D, respectively. Code and models will be released. + + + + Seeing What You Miss: Vision-Language Pre-Training With Semantic Completion Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Ji_Seeing_What_You_Miss_Vision-Language_Pre-Training_With_Semantic_Completion_Learning_CVPR_2023_paper.pdf + Cross-modal alignment is essential for vision-language pre-training (VLP) models to learn the correct corresponding information across different modalities. For this purpose, inspired by the success of masked language modeling (MLM) tasks in the NLP pre-training area, numerous masked modeling tasks have been proposed for VLP to further promote cross-modal interactions. The core idea of previous masked modeling tasks is to focus on reconstructing the masked tokens based on visible context for learning local-to-local alignment. However, most of them pay little attention to the global semantic features generated for the masked data, resulting in a limited cross-modal alignment ability of global representations. Therefore, in this paper, we propose a novel Semantic Completion Learning (SCL) task, complementary to existing masked modeling tasks, to facilitate global-to-local alignment. Specifically, the SCL task complements the missing semantics of masked data by capturing the corresponding information from the other modality, promoting learning more representative global features which have a great impact on the performance of downstream tasks. Moreover, we present a flexible vision encoder, which enables our model to perform image-text and video-text multimodal tasks simultaneously. Experimental results show that our proposed method obtains state-of-the-art performance on various vision-language benchmarks, such as visual question answering, image-text retrieval, and video-text retrieval. + + + + Differentiable Shadow Mapping for Efficient Inverse Graphics + http://openaccess.thecvf.com//content/CVPR2023/papers/Worchel_Differentiable_Shadow_Mapping_for_Efficient_Inverse_Graphics_CVPR_2023_paper.pdf + We show how shadows can be efficiently generated in differentiable rendering of triangle meshes. Our central observation is that pre-filtered shadow mapping, a technique for approximating shadows based on rendering from the perspective of a light, can be combined with existing differentiable rasterizers to yield differentiable visibility information. We demonstrate at several inverse graphics problems that differentiable shadow maps are orders of magnitude faster than differentiable light transport simulation with similar accuracy -- while differentiable rasterization without shadows often fails to converge. + + + + Understanding and Constructing Latent Modality Structures in Multi-Modal Representation Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Jiang_Understanding_and_Constructing_Latent_Modality_Structures_in_Multi-Modal_Representation_Learning_CVPR_2023_paper.pdf + Contrastive loss has been increasingly used in learning representations from multiple modalities. In the limit, the nature of the contrastive loss encourages modalities to exactly match each other in the latent space. Yet it remains an open question how the modality alignment affects the downstream task performance. In this paper, based on an information-theoretic argument, we first prove that exact modality alignment is sub-optimal in general for downstream prediction tasks. Hence we advocate that the key of better performance lies in meaningful latent modality structures instead of perfect modality alignment. To this end, we propose three general approaches to construct latent modality structures. Specifically, we design 1) a deep feature separation loss for intra-modality regularization; 2) a Brownian-bridge loss for inter-modality regularization; and 3) a geometric consistency loss for both intra- and inter-modality regularization. Extensive experiments are conducted on two popular multi-modal representation learning frameworks: the CLIP-based two-tower model and the ALBEF-based fusion model. We test our model on a variety of tasks including zero/few-shot image classification, image-text retrieval, visual question answering, visual reasoning, and visual entailment. Our method achieves consistent improvements over existing methods, demonstrating the effectiveness and generalizability of our proposed approach on latent modality structure regularization. + + + + Instant Volumetric Head Avatars + http://openaccess.thecvf.com//content/CVPR2023/papers/Zielonka_Instant_Volumetric_Head_Avatars_CVPR_2023_paper.pdf + We present Instant Volumetric Head Avatars (INSTA), a novel approach for reconstructing photo-realistic digital avatars instantaneously. INSTA models a dynamic neural radiance field based on neural graphics primitives embedded around a parametric face model. Our pipeline is trained on a single monocular RGB portrait video that observes the subject under different expressions and views. While state-of-the-art methods take up to several days to train an avatar, our method can reconstruct a digital avatar in less than 10 minutes on modern GPU hardware, which is orders of magnitude faster than previous solutions. In addition, it allows for the interactive rendering of novel poses and expressions. By leveraging the geometry prior of the underlying parametric face model, we demonstrate that INSTA extrapolates to unseen poses. In quantitative and qualitative studies on various subjects, INSTA outperforms state-of-the-art methods regarding rendering quality and training time. Project website: https://zielon.github.io/insta/ + + + + Cross-Domain Image Captioning With Discriminative Finetuning + http://openaccess.thecvf.com//content/CVPR2023/papers/Dessi_Cross-Domain_Image_Captioning_With_Discriminative_Finetuning_CVPR_2023_paper.pdf + Neural captioners are typically trained to mimic human-generated references without optimizing for any specific communication goal, leading to problems such as the generation of vague captions. In this paper, we show that fine-tuning an out-of-the-box neural captioner with a self-supervised discriminative communication objective helps to recover a plain, visually descriptive language that is more informative about image contents. Given a target image, the system must learn to produce a description that enables an out-of-the-box text-conditioned image retriever to identify such image among a set of candidates. We experiment with the popular ClipCap captioner, also replicating the main results with BLIP. In terms of similarity to ground-truth human descriptions, the captions emerging from discriminative finetuning lag slightly behind those generated by the non-finetuned model, when the latter is trained and tested on the same caption dataset. However, when the model is used without further tuning to generate captions for out-of-domain datasets, our discriminatively-finetuned captioner generates descriptions that resemble human references more than those produced by the same captioner without finetuning. We further show that, on the Conceptual Captions dataset, discriminatively finetuned captions are more helpful than either vanilla ClipCap captions or ground-truth captions for human annotators tasked with an image discrimination task. + + + + DBARF: Deep Bundle-Adjusting Generalizable Neural Radiance Fields + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_DBARF_Deep_Bundle-Adjusting_Generalizable_Neural_Radiance_Fields_CVPR_2023_paper.pdf + Recent works such as BARF and GARF can bundle adjust camera poses with neural radiance fields (NeRF) which is based on coordinate-MLPs. Despite the impressive results, these methods cannot be applied to Generalizable NeRFs (GeNeRFs) which require image feature extractions that are often based on more complicated 3D CNN or transformer architectures. In this work, we first analyze the difficulties of jointly optimizing camera poses with GeNeRFs, and then further propose our DBARF to tackle these issues. Our DBARF which bundle adjusts camera poses by taking a cost feature map as an implicit cost function can be jointly trained with GeNeRFs in a self-supervised manner. Unlike BARF and its follow-up works, which can only be applied to per-scene optimized NeRFs and need accurate initial camera poses with the exception of forward-facing scenes, our method can generalize across scenes and does not require any good initialization. Experiments show the effectiveness and generalization ability of our DBARF when evaluated on real-world datasets. Our code is available at https://aibluefisher.github.io/dbarf. + + + + Connecting the Dots: Floorplan Reconstruction Using Two-Level Queries + http://openaccess.thecvf.com//content/CVPR2023/papers/Yue_Connecting_the_Dots_Floorplan_Reconstruction_Using_Two-Level_Queries_CVPR_2023_paper.pdf + We address 2D floorplan reconstruction from 3D scans. Existing approaches typically employ heuristically designed multi-stage pipelines. Instead, we formulate floorplan reconstruction as a single-stage structured prediction task: find a variable-size set of polygons, which in turn are variable-length sequences of ordered vertices. To solve it we develop a novel Transformer architecture that generates polygons of multiple rooms in parallel, in a holistic manner without hand-crafted intermediate stages. The model features two-level queries for polygons and corners, and includes polygon matching to make the network end-to-end trainable. Our method achieves a new state-of-the-art for two challenging datasets, Structured3D and SceneCAD, along with significantly faster inference than previous methods. Moreover, it can readily be extended to predict additional information, i.e., semantic room types and architectural elements like doors and windows. Our code and models are available at: https://github.com/ywyue/RoomFormer. + + + + Analyzing and Diagnosing Pose Estimation With Attributions + http://openaccess.thecvf.com//content/CVPR2023/papers/He_Analyzing_and_Diagnosing_Pose_Estimation_With_Attributions_CVPR_2023_paper.pdf + We present Pose Integrated Gradient (PoseIG), the first interpretability technique designed for pose estimation. We extend the concept of integrated gradients for pose estimation to generate pixel-level attribution maps. To enable comparison across different pose frameworks, we unify different pose outputs into a common output space, along with a likelihood approximation function for gradient back-propagation. To complement the qualitative insight from the attribution maps, we propose three indices for quantitative analysis. With these tools, we systematically compare different pose estimation frameworks to understand the impacts of network design, backbone and auxiliary tasks. Our analysis reveals an interesting shortcut of the knuckles (MCP joints) for hand pose estimation and an under-explored inversion error for keypoints in body pose estimation. Project page: https://qy-h00.github.io/poseig/. + + + + Make-a-Story: Visual Memory Conditioned Consistent Story Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Rahman_Make-a-Story_Visual_Memory_Conditioned_Consistent_Story_Generation_CVPR_2023_paper.pdf + There has been a recent explosion of impressive generative models that can produce high quality images (or videos) conditioned on text descriptions. However, all such approaches rely on conditional sentences that contain unambiguous descriptions of scenes and main actors in them. Therefore employing such models for more complex task of story visualization, where naturally references and co-references exist, and one requires to reason about when to maintain consistency of actors and backgrounds across frames/scenes, and when not to, based on story progression, remains a challenge. In this work, we address the aforementioned challenges and propose a novel autoregressive diffusion-based framework with a visual memory module that implicitly captures the actor and background context across the generated frames. Sentence-conditioned soft attention over the memories enables effective reference resolution and learns to maintain scene and actor consistency when needed. To validate the effectiveness of our approach, we extend the MUGEN dataset and introduce additional characters, backgrounds and referencing in multi-sentence storylines. Our experiments for story generation on the MUGEN, the PororoSV and the FlintstonesSV dataset show that our method not only outperforms prior state-of-the-art in generating frames with high visual quality, which are consistent with the story, but also models appropriate correspondences between the characters and the background. + + + + TinyMIM: An Empirical Study of Distilling MIM Pre-Trained Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Ren_TinyMIM_An_Empirical_Study_of_Distilling_MIM_Pre-Trained_Models_CVPR_2023_paper.pdf + Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred; etc. With these findings, we achieve significant fine-tuning accuracy improvements over the scratch MIM pre-training on ImageNet-1K classification, using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4% gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM model of tiny size achieves 79.6% top-1 accuracy on ImageNet-1K image classification, which sets a new record for small vision models of the same size and computation budget. This strong performance suggests an alternative way for developing small vision Transformer models, that is, by exploring better training methods rather than introducing inductive biases into architectures as in most previous works. Code is available at https://github.com/OliverRensu/TinyMIM. + + + + OneFormer: One Transformer To Rule Universal Image Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Jain_OneFormer_One_Transformer_To_Rule_Universal_Image_Segmentation_CVPR_2023_paper.pdf + Universal Image Segmentation is not a new concept.Past attempts to unify image segmentation include scene parsing, panoptic segmentation, and, more recently, new panoptic architectures. However, such panoptic architectures do not truly unify image segmentation because they need to be trained individually on the semantic, instance, or panoptic segmentation to achieve the best performance. Ideally, a truly universal framework should be trained only once and achieve SOTA performance across all three image segmentation tasks. To that end, we propose OneFormer, a universal image segmentation framework that unifies segmentation with a multi-task train-once design. We first propose a task-conditioned joint training strategy that enables training on ground truths of each domain (semantic, instance, and panoptic segmentation) within a single multi-task training process. Secondly, we introduce a task token to condition our model on the task at hand, making our model task-dynamic to support multi-task training and inference. Thirdly, we propose using a query-text contrastive loss during training to establish better inter-task and inter-class distinctions. Notably, our single OneFormer model outperforms specialized Mask2Former models across all three segmentation tasks on ADE20k, Cityscapes, and COCO, despite the latter being trained on each task individually. We believe OneFormer is a significant step towards making image segmentation more universal and accessible. + + + + Finding Geometric Models by Clustering in the Consensus Space + http://openaccess.thecvf.com//content/CVPR2023/papers/Barath_Finding_Geometric_Models_by_Clustering_in_the_Consensus_Space_CVPR_2023_paper.pdf + We propose a new algorithm for finding an unknown number of geometric models, e.g., homographies. The problem is formalized as finding dominant model instances progressively without forming crisp point-to-model assignments. Dominant instances are found via a RANSAC-like sampling and a consolidation process driven by a model quality function considering previously proposed instances. New ones are found by clustering in the consensus space. This new formulation leads to a simple iterative algorithm with state-of-the-art accuracy while running in real-time on a number of vision problems -- at least two orders of magnitude faster than the competitors on two-view motion estimation. Also, we propose a deterministic sampler reflecting the fact that real-world data tend to form spatially coherent structures. The sampler returns connected components in a progressively densified neighborhood-graph. We present a number of applications where the use of multiple geometric models improves accuracy. These include pose estimation from multiple generalized homographies; trajectory estimation of fast-moving objects; and we also propose a way of using multiple homographies in global SfM algorithms. Source code: https://github.com/danini/clustering-in-consensus-space. + + + + Leapfrog Diffusion Model for Stochastic Trajectory Prediction + http://openaccess.thecvf.com//content/CVPR2023/papers/Mao_Leapfrog_Diffusion_Model_for_Stochastic_Trajectory_Prediction_CVPR_2023_paper.pdf + To model the indeterminacy of human behaviors, stochastic trajectory prediction requires a sophisticated multi-modal distribution of future trajectories. Emerging diffusion models have revealed their tremendous representation capacities in numerous generation tasks, showing potential for stochastic trajectory prediction. However, expensive time consumption prevents diffusion models from real-time prediction, since a large number of denoising steps are required to assure sufficient representation ability. To resolve the dilemma, we present LEapfrog Diffusion model (LED), a novel diffusion-based trajectory prediction model, which provides real-time, precise, and diverse predictions. The core of the proposed LED is to leverage a trainable leapfrog initializer to directly learn an expressive multi-modal distribution of future trajectories, which skips a large number of denoising steps, significantly accelerating inference speed. Moreover, the leapfrog initializer is trained to appropriately allocate correlated samples to provide a diversity of predicted future trajectories, significantly improving prediction performances. Extensive experiments on four real-world datasets, including NBA/NFL/SDD/ETH-UCY, show that LED consistently improves performance and achieves 23.7%/21.9% ADE/FDE improvement on NFL. The proposed LED also speeds up the inference 19.3/30.8/24.3/25.1 times compared to the standard diffusion model on NBA/NFL/SDD/ETH-UCY, satisfying real-time inference needs. Code is available at https://github.com/MediaBrain-SJTU/LED. + + + + GeoLayoutLM: Geometric Pre-Training for Visual Information Extraction + http://openaccess.thecvf.com//content/CVPR2023/papers/Luo_GeoLayoutLM_Geometric_Pre-Training_for_Visual_Information_Extraction_CVPR_2023_paper.pdf + Visual information extraction (VIE) plays an important role in Document Intelligence. Generally, it is divided into two tasks: semantic entity recognition (SER) and relation extraction (RE). Recently, pre-trained models for documents have achieved substantial progress in VIE, particularly in SER. However, most of the existing models learn the geometric representation in an implicit way, which has been found insufficient for the RE task since geometric information is especially crucial for RE. Moreover, we reveal another factor that limits the performance of RE lies in the objective gap between the pre-training phase and the fine-tuning phase for RE. To tackle these issues, we propose in this paper a multi-modal framework, named GeoLayoutLM, for VIE. GeoLayoutLM explicitly models the geometric relations in pre-training, which we call geometric pre-training. Geometric pre-training is achieved by three specially designed geometry-related pre-training tasks. Additionally, novel relation heads, which are pre-trained by the geometric pre-training tasks and fine-tuned for RE, are elaborately designed to enrich and enhance the feature representation. According to extensive experiments on standard VIE benchmarks, GeoLayoutLM achieves highly competitive scores in the SER task and significantly outperforms the previous state-of-the-arts for RE (e.g.,the F1 score of RE on FUNSD is boosted from 80.35% to 89.45%). + + + + SFD2: Semantic-Guided Feature Detection and Description + http://openaccess.thecvf.com//content/CVPR2023/papers/Xue_SFD2_Semantic-Guided_Feature_Detection_and_Description_CVPR_2023_paper.pdf + Visual localization is a fundamental task for various applications including autonomous driving and robotics. Prior methods focus on extracting large amounts of often redundant locally reliable features, resulting in limited efficiency and accuracy, especially in large-scale environments under challenging conditions. Instead, we propose to extract globally reliable features by implicitly embedding high-level semantics into both the detection and description processes. Specifically, our semantic-aware detector is able to detect keypoints from reliable regions (e.g. building, traffic lane) and suppress reliable areas (e.g. sky, car) implicitly instead of relying on explicit semantic labels. This boosts the accuracy of keypoint matching by reducing the number of features sensitive to appearance changes and avoiding the need of additional segmentation networks at test time. Moreover, our descriptors are augmented with semantics and have stronger discriminative ability, providing more inliers at test time. Particularly, experiments on long-term large-scale visual localization Aachen Day-Night and RobotCar-Seasons datasets demonstrate that our model outperforms previous local features and gives competitive accuracy to advanced matchers but is about 2 and 3 times faster when using 2k and 4k keypoints, respectively. + + + + CLIP for All Things Zero-Shot Sketch-Based Image Retrieval, Fine-Grained or Not + http://openaccess.thecvf.com//content/CVPR2023/papers/Sain_CLIP_for_All_Things_Zero-Shot_Sketch-Based_Image_Retrieval_Fine-Grained_or_CVPR_2023_paper.pdf + In this paper, we leverage CLIP for zero-shot sketch based image retrieval (ZS-SBIR). We are largely inspired by recent advances on foundation models and the unparalleled generalisation ability they seem to offer, but for the first time tailor it to benefit the sketch community. We put forward novel designs on how best to achieve this synergy, for both the category setting and the fine-grained setting ("all"). At the very core of our solution is a prompt learning setup. First we show just via factoring in sketch-specific prompts, we already have a category-level ZS-SBIR system that overshoots all prior arts, by a large margin (24.8%) - a great testimony on studying the CLIP and ZS-SBIR synergy. Moving onto the fine-grained setup is however trickier, and requires a deeper dive into this synergy. For that, we come up with two specific designs to tackle the fine-grained matching nature of the problem: (i) an additional regularisation loss to ensure the relative separation between sketches and photos is uniform across categories, which is not the case for the gold standard standalone triplet loss, and (ii) a clever patch shuffling technique to help establishing instance-level structural correspondences between sketch-photo pairs. With these designs, we again observe significant performance gains in the region of 26.9% over previous state-of-the-art. The take-home message, if any, is the proposed CLIP and prompt learning paradigm carries great promise in tackling other sketch-related tasks (not limited to ZS-SBIR) where data scarcity remains a great challenge. Project page: https://aneeshan95.github.io/Sketch_LVM/ + + + + RIAV-MVS: Recurrent-Indexing an Asymmetric Volume for Multi-View Stereo + http://openaccess.thecvf.com//content/CVPR2023/papers/Cai_RIAV-MVS_Recurrent-Indexing_an_Asymmetric_Volume_for_Multi-View_Stereo_CVPR_2023_paper.pdf + This paper presents a learning-based method for multi-view depth estimation from posed images. Our core idea is a "learning-to-optimize" paradigm that iteratively indexes a plane-sweeping cost volume and regresses the depth map via a convolutional Gated Recurrent Unit (GRU). Since the cost volume plays a paramount role in encoding the multi-view geometry, we aim to improve its construction both at pixel- and frame- levels. At the pixel level, we propose to break the symmetry of the Siamese network (which is typically used in MVS to extract image features) by introducing a transformer block to the reference image (but not to the source images). Such an asymmetric volume allows the network to extract global features from the reference image to predict its depth map. Given potential inaccuracies in the poses between reference and source images, we propose to incorporate a residual pose network to correct the relative poses. This essentially rectifies the cost volume at the frame level. We conduct extensive experiments on real-world MVS datasets and show that our method achieves state-of-the-art performance in terms of both within-dataset evaluation and cross-dataset generalization. + + + + 3D Video Loops From Asynchronous Input + http://openaccess.thecvf.com//content/CVPR2023/papers/Ma_3D_Video_Loops_From_Asynchronous_Input_CVPR_2023_paper.pdf + Looping videos are short video clips that can be looped endlessly without visible seams or artifacts. They provide a very attractive way to capture the dynamism of natural scenes. Existing methods have been mostly limited to 2D representations. In this paper, we take a step forward and propose a practical solution that enables an immersive experience on dynamic 3D looping scenes. The key challenge is to consider the per-view looping conditions from asynchronous input while maintaining view consistency for the 3D representation. We propose a novel sparse 3D video representation, namely Multi-Tile Video (MTV), which not only provides a view-consistent prior, but also greatly reduces memory usage, making the optimization of a 4D volume tractable. Then, we introduce a two-stage pipeline to construct the 3D looping MTV from completely asynchronous multi-view videos with no time overlap. A novel looping loss based on video temporal retargeting algorithms is adopted during the optimization to loop the 3D scene. Experiments of our framework have shown promise in successfully generating and rendering photorealistic 3D looping videos in real time even on mobile devices. The code, dataset, and live demos are available in https://limacv.github.io/VideoLoop3D_web/. + + + + Style Projected Clustering for Domain Generalized Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Style_Projected_Clustering_for_Domain_Generalized_Semantic_Segmentation_CVPR_2023_paper.pdf + Existing semantic segmentation methods improve generalization capability, by regularizing various images to a canonical feature space. While this process contributes to generalization, it weakens the representation inevitably. In contrast to existing methods, we instead utilize the difference between images to build a better representation space, where the distinct style features are extracted and stored as the bases of representation. Then, the generalization to unseen image styles is achieved by projecting features to this known space. Specifically, we realize the style projection as a weighted combination of stored bases, where the similarity distances are adopted as the weighting factors. Based on the same concept, we extend this process to the decision part of model and promote the generalization of semantic prediction. By measuring the similarity distances to semantic bases (i.e., prototypes), we replace the common deterministic prediction with semantic clustering. Comprehensive experiments demonstrate the advantage of proposed method to the state of the art, up to 3.6% mIoU improvement in average on unseen scenarios. + + + + DIP: Dual Incongruity Perceiving Network for Sarcasm Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Wen_DIP_Dual_Incongruity_Perceiving_Network_for_Sarcasm_Detection_CVPR_2023_paper.pdf + Sarcasm indicates the literal meaning is contrary to the real attitude. Considering the popularity and complementarity of image-text data, we investigate the task of multi-modal sarcasm detection. Different from other multi-modal tasks, for the sarcastic data, there exists intrinsic incongruity between a pair of image and text as demonstrated in psychological theories. To tackle this issue, we propose a Dual Incongruity Perceiving (DIP) network consisting of two branches to mine the sarcastic information from factual and affective levels. For the factual aspect, we introduce a channel-wise reweighting strategy to obtain semantically discriminative embeddings, and leverage gaussian distribution to model the uncertain correlation caused by the incongruity. The distribution is generated from the latest data stored in the memory bank, which can adaptively model the difference of semantic similarity between sarcastic and non-sarcastic data. For the affective aspect, we utilize siamese layers with shared parameters to learn cross-modal sentiment information. Furthermore, we use the polarity value to construct a relation graph for the mini-batch, which forms the continuous contrastive loss to acquire affective embeddings. Extensive experiments demonstrate that our proposed method performs favorably against state-of-the-art approaches. Our code is released on https://github.com/downdric/MSD. + + + + Learning To Generate Language-Supervised and Open-Vocabulary Scene Graph Using Pre-Trained Visual-Semantic Space + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Learning_To_Generate_Language-Supervised_and_Open-Vocabulary_Scene_Graph_Using_Pre-Trained_CVPR_2023_paper.pdf + Scene graph generation (SGG) aims to abstract an image into a graph structure, by representing objects as graph nodes and their relations as labeled edges. However, two knotty obstacles limit the practicability of current SGG methods in real-world scenarios: 1) training SGG models requires time-consuming ground-truth annotations, and 2) the closed-set object categories make the SGG models limited in their ability to recognize novel objects outside of training corpora. To address these issues, we novelly exploit a powerful pre-trained visual-semantic space (VSS) to trigger language-supervised and open-vocabulary SGG in a simple yet effective manner. Specifically, cheap scene graph supervision data can be easily obtained by parsing image language descriptions into semantic graphs. Next, the noun phrases on such semantic graphs are directly grounded over image regions through region-word alignment in the pre-trained VSS. In this way, we enable open-vocabulary object detection by performing object category name grounding with a text prompt in this VSS. On the basis of visually-grounded objects, the relation representations are naturally built for relation recognition, pursuing open-vocabulary SGG. We validate our proposed approach with extensive experiments on the Visual Genome benchmark across various SGG scenarios (i.e., supervised / language-supervised, closed-set / open-vocabulary). Consistent superior performances are achieved compared with existing methods, demonstrating the potential of exploiting pre-trained VSS for SGG in more practical scenarios. + + + + VectorFloorSeg: Two-Stream Graph Attention Network for Vectorized Roughcast Floorplan Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_VectorFloorSeg_Two-Stream_Graph_Attention_Network_for_Vectorized_Roughcast_Floorplan_Segmentation_CVPR_2023_paper.pdf + Vector graphics (VG) are ubiquitous in industrial designs. In this paper, we address semantic segmentation of a typical VG, i.e., roughcast floorplans with bare wall structures, whose output can be directly used for further applications like interior furnishing and room space modeling. Previous semantic segmentation works mostly process well-decorated floorplans in raster images and usually yield aliased boundaries and outlier fragments in segmented rooms, due to pixel-level segmentation that ignores the regular elements (e.g. line segments) in vector floorplans. To overcome these issues, we propose to fully utilize the regular elements in vector floorplans for more integral segmentation. Our pipeline predicts room segmentation from vector floorplans by dually classifying line segments as room boundaries, and regions partitioned by line segments as room segments. To fully exploit the structural relationships between lines and regions, we use two-stream graph neural networks to process the line segments and partitioned regions respectively, and devise a novel modulated graph attention layer to fuse the heterogeneous information from one stream to the other. Extensive experiments show that by directly operating on vector floorplans, we outperform image-based methods in both mIoU and mAcc. In addition, we propose a new metric that captures room integrity and boundary regularity, which confirms that our method produces much more regular segmentations. Source code is available at https://github.com/DrZiji/VecFloorSeg + + + + Extracting Motion and Appearance via Inter-Frame Attention for Efficient Video Frame Interpolation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Extracting_Motion_and_Appearance_via_Inter-Frame_Attention_for_Efficient_Video_CVPR_2023_paper.pdf + Effectively extracting inter-frame motion and appearance information is important for video frame interpolation (VFI). Previous works either extract both types of information in a mixed way or devise separate modules for each type of information, which lead to representation ambiguity and low efficiency. In this paper, we propose a new module to explicitly extract motion and appearance information via a unified operation. Specifically, we rethink the information process in inter-frame attention and reuse its attention map for both appearance feature enhancement and motion information extraction. Furthermore, for efficient VFI, our proposed module could be seamlessly integrated into a hybrid CNN and Transformer architecture. This hybrid pipeline can alleviate the computational complexity of inter-frame attention as well as preserve detailed low-level structure information. Experimental results demonstrate that, for both fixed- and arbitrary-timestep interpolation, our method achieves state-of-the-art performance on various datasets. Meanwhile, our approach enjoys a lighter computation overhead over models with close performance. The source code and models are available at https://github.com/MCG-NJU/EMA-VFI. + + + + Minimizing Maximum Model Discrepancy for Transferable Black-Box Targeted Attacks + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhao_Minimizing_Maximum_Model_Discrepancy_for_Transferable_Black-Box_Targeted_Attacks_CVPR_2023_paper.pdf + In this work, we study the black-box targeted attack problem from the model discrepancy perspective. On the theoretical side, we present a generalization error bound for black-box targeted attacks, which gives a rigorous theoretical analysis for guaranteeing the success of the attack. We reveal that the attack error on a target model mainly depends on empirical attack error on the substitute model and the maximum model discrepancy among substitute models. On the algorithmic side, we derive a new algorithm for black-box targeted attacks based on our theoretical analysis, in which we additionally minimize the maximum model discrepancy(M3D) of the substitute models when training the generator to generate adversarial examples. In this way, our model is capable of crafting highly transferable adversarial examples that are robust to the model variation, thus improving the success rate for attacking the black-box model. We conduct extensive experiments on the ImageNet dataset with different classification models, and our proposed approach outperforms existing state-of-the-art methods by a significant margin. + + + + Efficient Loss Function by Minimizing the Detrimental Effect of Floating-Point Errors on Gradient-Based Attacks + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_Efficient_Loss_Function_by_Minimizing_the_Detrimental_Effect_of_Floating-Point_CVPR_2023_paper.pdf + Attackers can deceive neural networks by adding human imperceptive perturbations to their input data; this reveals the vulnerability and weak robustness of current deep-learning networks. Many attack techniques have been proposed to evaluate the model's robustness. Gradient-based attacks suffer from severely overestimating the robustness. This paper identifies that the relative error in calculated gradients caused by floating-point errors, including floating-point underflow and rounding errors, is a fundamental reason why gradient-based attacks fail to accurately assess the model's robustness. Although it is hard to eliminate the relative error in the gradients, we can control its effect on the gradient-based attacks. Correspondingly, we propose an efficient loss function by minimizing the detrimental impact of the floating-point errors on the attacks. Experimental results show that it is more efficient and reliable than other loss functions when examined across a wide range of defence mechanisms. + + + + BAD-NeRF: Bundle Adjusted Deblur Neural Radiance Fields + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_BAD-NeRF_Bundle_Adjusted_Deblur_Neural_Radiance_Fields_CVPR_2023_paper.pdf + Neural Radiance Fields (NeRF) have received considerable attention recently, due to its impressive capability in photo-realistic 3D reconstruction and novel view synthesis, given a set of posed camera images. Earlier work usually assumes the input images are of good quality. However, image degradation (e.g. image motion blur in low-light conditions) can easily happen in real-world scenarios, which would further affect the rendering quality of NeRF. In this paper, we present a novel bundle adjusted deblur Neural Radiance Fields (BAD-NeRF), which can be robust to severe motion blurred images and inaccurate camera poses. Our approach models the physical image formation process of a motion blurred image, and jointly learns the parameters of NeRF and recovers the camera motion trajectories during exposure time. In experiments, we show that by directly modeling the real physical image formation process, BAD-NeRF achieves superior performance over prior works on both synthetic and real datasets. Code and data are available at https://github.com/WU-CVGL/BAD-NeRF. + + + + QPGesture: Quantization-Based and Phase-Guided Motion Matching for Natural Speech-Driven Gesture Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_QPGesture_Quantization-Based_and_Phase-Guided_Motion_Matching_for_Natural_Speech-Driven_Gesture_CVPR_2023_paper.pdf + Speech-driven gesture generation is highly challenging due to the random jitters of human motion. In addition, there is an inherent asynchronous relationship between human speech and gestures. To tackle these challenges, we introduce a novel quantization-based and phase-guided motion matching framework. Specifically, we first present a gesture VQ-VAE module to learn a codebook to summarize meaningful gesture units. With each code representing a unique gesture, random jittering problems are alleviated effectively. We then use Levenshtein distance to align diverse gestures with different speech. Levenshtein distance based on audio quantization as a similarity metric of corresponding speech of gestures helps match more appropriate gestures with speech, and solves the alignment problem of speech and gestures well. Moreover, we introduce phase to guide the optimal gesture matching based on the semantics of context or rhythm of audio. Phase guides when text-based or speech-based gestures should be performed to make the generated gestures more natural. Extensive experiments show that our method outperforms recent approaches on speech-driven gesture generation. Our code, database, pre-trained models and demos are available at https://github.com/YoungSeng/QPGesture. + + + + Multiscale Tensor Decomposition and Rendering Equation Encoding for View Synthesis + http://openaccess.thecvf.com//content/CVPR2023/papers/Han_Multiscale_Tensor_Decomposition_and_Rendering_Equation_Encoding_for_View_Synthesis_CVPR_2023_paper.pdf + Rendering novel views from captured multi-view images has made considerable progress since the emergence of the neural radiance field. This paper aims to further advance the quality of view rendering by proposing a novel approach dubbed the neural radiance feature field (NRFF). We first propose a multiscale tensor decomposition scheme to organize learnable features so as to represent scenes from coarse to fine scales. We demonstrate many benefits of the proposed multiscale representation, including more accurate scene shape and appearance reconstruction, and faster convergence compared with the single-scale representation. Instead of encoding view directions to model view-dependent effects, we further propose to encode the rendering equation in the feature space by employing the anisotropic spherical Gaussian mixture predicted from the proposed multiscale representation. The proposed NRFF improves state-of-the-art rendering results by over 1 dB in PSNR on both the NeRF and NSVF synthetic datasets. A significant improvement has also been observed on the real-world Tanks & Temples dataset. Code can be found at https://github.com/imkanghan/nrff. + + + + NS3D: Neuro-Symbolic Grounding of 3D Objects and Relations + http://openaccess.thecvf.com//content/CVPR2023/papers/Hsu_NS3D_Neuro-Symbolic_Grounding_of_3D_Objects_and_Relations_CVPR_2023_paper.pdf + Grounding object properties and relations in 3D scenes is a prerequisite for a wide range of artificial intelligence tasks, such as visually grounded dialogues and embodied manipulation. However, the variability of the 3D domain induces two fundamental challenges: 1) the expense of labeling and 2) the complexity of 3D grounded language. Hence, essential desiderata for models are to be data-efficient, generalize to different data distributions and tasks with unseen semantic forms, as well as ground complex language semantics (e.g., view-point anchoring and multi-object reference). To address these challenges, we propose NS3D, a neuro-symbolic framework for 3D grounding. NS3D translates language into programs with hierarchical structures by leveraging large language-to-code models. Different functional modules in the programs are implemented as neural networks. Notably, NS3D extends prior neuro-symbolic visual reasoning methods by introducing functional modules that effectively reason about high-arity relations (i.e., relations among more than two objects), key in disambiguating objects in complex 3D scenes. Modular and compositional architecture enables NS3D to achieve state-of-the-art results on the ReferIt3D view-dependence task, a 3D referring expression comprehension benchmark. Importantly, NS3D shows significantly improved performance on settings of data-efficiency and generalization, and demonstrate zero-shot transfer to an unseen 3D question-answering task. + + + + GANmouflage: 3D Object Nondetection With Texture Fields + http://openaccess.thecvf.com//content/CVPR2023/papers/Guo_GANmouflage_3D_Object_Nondetection_With_Texture_Fields_CVPR_2023_paper.pdf + We propose a method that learns to camouflage 3D objects within scenes. Given an object's shape and a distribution of viewpoints from which it will be seen, we estimate a texture that will make it difficult to detect. Successfully solving this task requires a model that can accurately reproduce textures from the scene, while simultaneously dealing with the highly conflicting constraints imposed by each viewpoint. We address these challenges with a model based on texture fields and adversarial learning. Our model learns to camouflage a variety of object shapes from randomly sampled locations and viewpoints within the input scene, and is the first to address the problem of hiding complex object shapes. Using a human visual search study, we find that our estimated textures conceal objects significantly better than previous methods. + + + + Revisiting Residual Networks for Adversarial Robustness + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Revisiting_Residual_Networks_for_Adversarial_Robustness_CVPR_2023_paper.pdf + Efforts to improve the adversarial robustness of convolutional neural networks have primarily focused on developing more effective adversarial training methods. In contrast, little attention was devoted to analyzing the role of architectural elements (e.g., topology, depth, and width) on adversarial robustness. This paper seeks to bridge this gap and present a holistic study on the impact of architectural design on adversarial robustness. We focus on residual networks and consider architecture design at the block level as well as at the network scaling level. In both cases, we first derive insights through systematic experiments. Then we design a robust residual block, dubbed RobustResBlock, and a compound scaling rule, dubbed RobustScaling, to distribute depth and width at the desired FLOP count. Finally, we combine RobustResBlock and RobustScaling and present a portfolio of adversarially robust residual networks, RobustResNets, spanning a broad spectrum of model capacities. Experimental validation across multiple datasets and adversarial attacks demonstrate that RobustResNets consistently outperform both the standard WRNs and other existing robust architectures, achieving state-of-the-art AutoAttack robust accuracy 63.7% with 500K external data while being 2x more compact in terms of parameters. The code is available at https://github.com/zhichao-lu/robust-residual-network. + + + + PosterLayout: A New Benchmark and Approach for Content-Aware Visual-Textual Presentation Layout + http://openaccess.thecvf.com//content/CVPR2023/papers/Hsu_PosterLayout_A_New_Benchmark_and_Approach_for_Content-Aware_Visual-Textual_Presentation_CVPR_2023_paper.pdf + Content-aware visual-textual presentation layout aims at arranging spatial space on the given canvas for pre-defined elements, including text, logo, and underlay, which is a key to automatic template-free creative graphic design. In practical applications, e.g., poster designs, the canvas is originally non-empty, and both inter-element relationships as well as inter-layer relationships should be concerned when generating a proper layout. A few recent works deal with them simultaneously, but they still suffer from poor graphic performance, such as a lack of layout variety or spatial non-alignment. Since content-aware visual-textual presentation layout is a novel task, we first construct a new dataset named PKU PosterLayout, which consists of 9,974 poster-layout pairs and 905 images, i.e., non-empty canvases. It is more challenging and useful for greater layout variety, domain diversity, and content diversity. Then, we propose design sequence formation (DSF) that reorganizes elements in layouts to imitate the design processes of human designers, and a novel CNN-LSTM-based conditional generative adversarial network (GAN) is presented to generate proper layouts. Specifically, the discriminator is design-sequence-aware and will supervise the "design" process of the generator. Experimental results verify the usefulness of the new benchmark and the effectiveness of the proposed approach, which achieves the best performance by generating suitable layouts for diverse canvases. The dataset and the source code are available at https://github.com/PKU-ICST-MIPL/PosterLayout-CVPR2023. + + + + A General Regret Bound of Preconditioned Gradient Method for DNN Training + http://openaccess.thecvf.com//content/CVPR2023/papers/Yong_A_General_Regret_Bound_of_Preconditioned_Gradient_Method_for_DNN_CVPR_2023_paper.pdf + While adaptive learning rate methods, such as Adam, have achieved remarkable improvement in optimizing Deep Neural Networks (DNNs), they consider only the diagonal elements of the full preconditioned matrix. Though the full-matrix preconditioned gradient methods theoretically have a lower regret bound, they are impractical for use to train DNNs because of the high complexity. In this paper, we present a general regret bound with a constrained full-matrix preconditioned gradient and show that the updating formula of the preconditioner can be derived by solving a cone-constrained optimization problem. With the block-diagonal and Kronecker-factorized constraints, a specific guide function can be obtained. By minimizing the upper bound of the guide function, we develop a new DNN optimizer, termed AdaBK. A series of techniques, including statistics updating, dampening, efficient matrix inverse root computation, and gradient amplitude preservation, are developed to make AdaBK effective and efficient to implement. The proposed AdaBK can be readily embedded into many existing DNN optimizers, e.g., SGDM and AdamW, and the corresponding SGDM_BK and AdamW_BK algorithms demonstrate significant improvements over existing DNN optimizers on benchmark vision tasks, including image classification, object detection and segmentation. The source code will be made publicly available. + + + + Optimal Proposal Learning for Deployable End-to-End Pedestrian Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Song_Optimal_Proposal_Learning_for_Deployable_End-to-End_Pedestrian_Detection_CVPR_2023_paper.pdf + End-to-end pedestrian detection focuses on training a pedestrian detection model via discarding the Non-Maximum Suppression (NMS) post-processing. Though a few methods have been explored, most of them still suffer from longer training time and more complex deployment, which cannot be deployed in the actual industrial applications. In this paper, we intend to bridge this gap and propose an Optimal Proposal Learning (OPL) framework for deployable end-to-end pedestrian detection. Specifically, we achieve this goal by using CNN-based light detector and introducing two novel modules, including a Coarse-to-Fine (C2F) learning strategy for proposing precise positive proposals for the Ground-Truth (GT) instances by reducing the ambiguity of sample assignment/output in training/testing respectively, and a Completed Proposal Network (CPN) for producing extra information compensation to further recall the hard pedestrian samples. Extensive experiments are conducted on CrowdHuman, TJU-Ped and Caltech, and the results show that our proposed OPL method significantly outperforms the competing methods. + + + + Temporal Interpolation Is All You Need for Dynamic Neural Radiance Fields + http://openaccess.thecvf.com//content/CVPR2023/papers/Park_Temporal_Interpolation_Is_All_You_Need_for_Dynamic_Neural_Radiance_CVPR_2023_paper.pdf + Temporal interpolation often plays a crucial role to learn meaningful representations in dynamic scenes. In this paper, we propose a novel method to train spatiotemporal neural radiance fields of dynamic scenes based on temporal interpolation of feature vectors. Two feature interpolation methods are suggested depending on underlying representations, neural networks or grids. In the neural representation, we extract features from space-time inputs via multiple neural network modules and interpolate them based on time frames. The proposed multi-level feature interpolation network effectively captures features of both short-term and long-term time ranges. In the grid representation, space-time features are learned via four-dimensional hash grids, which remarkably reduces training time. The grid representation shows more than 100 times faster training speed than the previous neural-net-based methods while maintaining the rendering quality. Concatenating static and dynamic features and adding a simple smoothness term further improve the performance of our proposed models. Despite the simplicity of the model architectures, our method achieved state-of-the-art performance both in rendering quality for the neural representation and in training speed for the grid representation. + + + + Graph Transformer GANs for Graph-Constrained House Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Tang_Graph_Transformer_GANs_for_Graph-Constrained_House_Generation_CVPR_2023_paper.pdf + We present a novel graph Transformer generative adversarial network (GTGAN) to learn effective graph node relations in an end-to-end fashion for the challenging graph-constrained house generation task. The proposed graph-Transformer-based generator includes a novel graph Transformer encoder that combines graph convolutions and self-attentions in a Transformer to model both local and global interactions across connected and non-connected graph nodes. Specifically, the proposed connected node attention (CNA) and non-connected node attention (NNA) aim to capture the global relations across connected nodes and non-connected nodes in the input graph, respectively. The proposed graph modeling block (GMB) aims to exploit local vertex interactions based on a house layout topology. Moreover, we propose a new node classification-based discriminator to preserve the high-level semantic and discriminative node features for different house components. Finally, we propose a novel graph-based cycle-consistency loss that aims at maintaining the relative spatial relationships between ground truth and predicted graphs. Experiments on two challenging graph-constrained house generation tasks (i.e., house layout and roof generation) with two public datasets demonstrate the effectiveness of GTGAN in terms of objective quantitative scores and subjective visual realism. New state-of-the-art results are established by large margins on both tasks. + + + + On the Benefits of 3D Pose and Tracking for Human Action Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Rajasegaran_On_the_Benefits_of_3D_Pose_and_Tracking_for_Human_CVPR_2023_paper.pdf + In this work we study the benefits of using tracking and 3D poses for action recognition. To achieve this, we take the Lagrangian view on analysing actions over a trajectory of human motion rather than at a fixed point in space. Taking this stand allows us to use the tracklets of people to predict their actions. In this spirit, first we show the benefits of using 3D pose to infer actions, and study person-person interactions. Subsequently, we propose a Lagrangian Action Recognition model by fusing 3D pose and contextualized appearance over tracklets. To this end, our method achieves state-of-the-art performance on the AVA v2.2 dataset on both pose only settings and on standard benchmark settings. When reasoning about the action using only pose cues, our pose model achieves +10.0 mAP gain over the corresponding state-of-the-art while our fused model has a gain of +2.8 mAP over the best state-of-the-art model. Code and results are available at: https://brjathu.github.io/LART + + + + How to Backdoor Diffusion Models? + http://openaccess.thecvf.com//content/CVPR2023/papers/Chou_How_to_Backdoor_Diffusion_Models_CVPR_2023_paper.pdf + Diffusion models are state-of-the-art deep learning empowered generative models that are trained based on the principle of learning forward and reverse diffusion processes via progressive noise-addition and denoising. To gain a better understanding of the limitations and potential risks, this paper presents the first study on the robustness of diffusion models against backdoor attacks. Specifically, we propose BadDiffusion, a novel attack framework that engineers compromised diffusion processes during model training for backdoor implantation. At the inference stage, the backdoored diffusion model will behave just like an untampered generator for regular data inputs, while falsely generating some targeted outcome designed by the bad actor upon receiving the implanted trigger signal. Such a critical risk can be dreadful for downstream tasks and applications built upon the problematic model. Our extensive experiments on various backdoor attack settings show that BadDiffusion can consistently lead to compromised diffusion models with high utility and target specificity. Even worse, BadDiffusion can be made cost-effective by simply finetuning a clean pre-trained diffusion model to implant backdoors. We also explore some possible countermeasures for risk mitigation. Our results call attention to potential risks and possible misuse of diffusion models. + + + + PACO: Parts and Attributes of Common Objects + http://openaccess.thecvf.com//content/CVPR2023/papers/Ramanathan_PACO_Parts_and_Attributes_of_Common_Objects_CVPR_2023_paper.pdf + Object models are gradually progressing from predicting just category labels to providing detailed descriptions of object instances. This motivates the need for large datasets which go beyond traditional object masks and provide richer annotations such as part masks and attributes. Hence, we introduce PACO: Parts and Attributes of Common Objects. It spans 75 object categories, 456 object-part categories and 55 attributes across image (LVIS) and video (Ego4D) datasets. We provide 641K part masks annotated across 260K object boxes, with roughly half of them exhaustively annotated with attributes as well. We design evaluation metrics and provide benchmark results for three tasks on the dataset: part mask segmentation, object and part attribute prediction and zero-shot instance detection. Dataset, models, and code are open-sourced at https://github.com/facebookresearch/paco. + + + + Continuous Sign Language Recognition With Correlation Network + http://openaccess.thecvf.com//content/CVPR2023/papers/Hu_Continuous_Sign_Language_Recognition_With_Correlation_Network_CVPR_2023_paper.pdf + Human body trajectories are a salient cue to identify actions in video. Such body trajectories are mainly conveyed by hands and face across consecutive frames in sign language. However, current methods in continuous sign language recognition(CSLR) usually process frames independently to capture frame-wise features, thus failing to capture cross-frame trajectories to effectively identify a sign. To handle this limitation, we propose correlation network (CorrNet) to explicitly leverage body trajectories across frames to identify signs. In specific, an identification module is first presented to emphasize informative regions in each frame that are beneficial in expressing a sign. A correlation module is then proposed to dynamically compute correlation maps between current frame and adjacent neighbors to capture cross-frame trajectories. As a result, the generated features are able to gain an overview of local temporal movements to identify a sign. Thanks to its special attention on body trajectories, CorrNet achieves new state-of-the-art accuracy on four large-scale datasets, PHOENIX14, PHOENIX14-T, CSL-Daily, and CSL. A comprehensive comparison between CorrNet and previous spatial-temporal reasoning methods verifies its effectiveness. Visualizations are given to demonstrate the effects of CorrNet on emphasizing human body trajectories across adjacent frames. + + + + A Simple Framework for Text-Supervised Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Yi_A_Simple_Framework_for_Text-Supervised_Semantic_Segmentation_CVPR_2023_paper.pdf + Text-supervised semantic segmentation is a novel research topic that allows semantic segments to emerge with image-text contrasting. However, pioneering methods could be subject to specifically designed network architectures. This paper shows that a vanilla contrastive language-image pre-training (CLIP) model is an effective text-supervised semantic segmentor by itself. First, we reveal that a vanilla CLIP is inferior to localization and segmentation due to its optimization being driven by densely aligning visual and language representations. Second, we propose the locality-driven alignment (LoDA) to address the problem, where CLIP optimization is driven by sparsely aligning local representations. Third, we propose a simple segmentation (SimSeg) framework. LoDA and SimSeg jointly ameliorate a vanilla CLIP to produce impressive semantic segmentation results. Our method outperforms previous state-of-the-art methods on PASCAL VOC 2012, PASCAL Context and COCO datasets by large margins. Code and models are available at github.com/muyangyi/SimSeg. + + + + PlenVDB: Memory Efficient VDB-Based Radiance Fields for Fast Training and Rendering + http://openaccess.thecvf.com//content/CVPR2023/papers/Yan_PlenVDB_Memory_Efficient_VDB-Based_Radiance_Fields_for_Fast_Training_and_CVPR_2023_paper.pdf + In this paper, we present a new representation for neural radiance fields that accelerates both the training and the inference processes with VDB, a hierarchical data structure for sparse volumes. VDB takes both the advantages of sparse and dense volumes for compact data representation and efficient data access, being a promising data structure for NeRF data interpolation and ray marching. Our method, Plenoptic VDB (PlenVDB), directly learns the VDB data structure from a set of posed images by means of a novel training strategy and then uses it for real-time rendering. Experimental results demonstrate the effectiveness and the efficiency of our method over previous arts: First, it converges faster in the training process. Second, it delivers a more compact data format for NeRF data presentation. Finally, it renders more efficiently on commodity graphics hardware. Our mobile PlenVDB demo achieves 30+ FPS, 1280x720 resolution on an iPhone12 mobile phone. Check plenvdb.github.io for details. + + + + An Actor-Centric Causality Graph for Asynchronous Temporal Inference in Group Activity + http://openaccess.thecvf.com//content/CVPR2023/papers/Xie_An_Actor-Centric_Causality_Graph_for_Asynchronous_Temporal_Inference_in_Group_CVPR_2023_paper.pdf + The causality relation modeling remains a challenging task for group activity recognition. The causality relations describe the influence of some actors (cause actors) on other actors (effect actors). Most existing graph models focus on learning the actor relation with synchronous temporal features, which is insufficient to deal with the causality relation with asynchronous temporal features. In this paper, we propose an Actor-Centric Causality Graph Model, which learns the asynchronous temporal causality relation with three modules, i.e., an asynchronous temporal causality relation detection module, a causality feature fusion module, and a causality relation graph inference module. First, given a centric actor and correlative actor, we analyze their influences to detect causality relation. We estimate the self influence of the centric actor with self regression. We estimate the correlative influence from the correlative actor to the centric actor with correlative regression, which uses asynchronous features at different timestamps. Second, we synchronize the two action features by estimating the temporal delay between the cause action and the effect action. The synchronized features are used to enhance the feature of the effect action with a channel-wise fusion. Third, we describe the nodes (actors) with causality features and learn the edges by fusing the causality relation with the appearance relation and distance relation. The causality relation graph inference provides crucial features of effect action, which are complementary to the base model using synchronous relation inference. The two relation inferences are aggregated to enhance group relation learning. Extensive experiments show that our method achieves state-of-the-art performance on the Volleyball dataset and Collective Activity dataset. + + + + Color Backdoor: A Robust Poisoning Attack in Color Space + http://openaccess.thecvf.com//content/CVPR2023/papers/Jiang_Color_Backdoor_A_Robust_Poisoning_Attack_in_Color_Space_CVPR_2023_paper.pdf + Backdoor attacks against neural networks have been intensively investigated, where the adversary compromises the integrity of the victim model, causing it to make wrong predictions for inference samples containing a specific trigger. To make the trigger more imperceptible and human-unnoticeable, a variety of stealthy backdoor attacks have been proposed, some works employ imperceptible perturbations as the backdoor triggers, which restrict the pixel differences of the triggered image and clean image. Some works use special image styles (e.g., reflection, Instagram filter) as the backdoor triggers. However, these attacks sacrifice the robustness, and can be easily defeated by common preprocessing-based defenses. This paper presents a novel color backdoor attack, which can exhibit robustness and stealthiness at the same time. The key insight of our attack is to apply a uniform color space shift for all pixels as the trigger. This global feature is robust to image transformation operations and the triggered samples maintain natural-looking. To find the optimal trigger, we first define naturalness restrictions through the metrics of PSNR, SSIM and LPIPS. Then we employ the Particle Swarm Optimization (PSO) algorithm to search for the optimal trigger that can achieve high attack effectiveness and robustness while satisfying the restrictions. Extensive experiments demonstrate the superiority of PSO and the robustness of color backdoor against different mainstream backdoor defenses. + + + + How You Feelin'? Learning Emotions and Mental States in Movie Scenes + http://openaccess.thecvf.com//content/CVPR2023/papers/Srivastava_How_You_Feelin_Learning_Emotions_and_Mental_States_in_Movie_CVPR_2023_paper.pdf + Movie story analysis requires understanding characters' emotions and mental states. Towards this goal, we formulate emotion understanding as predicting a diverse and multi-label set of emotions at the level of a movie scene and for each character. We propose EmoTx, a multimodal Transformer-based architecture that ingests videos, multiple characters, and dialog utterances to make joint predictions. By leveraging annotations from the MovieGraphs dataset, we aim to predict classic emotions (e.g. happy, angry) and other mental states (e.g. honest, helpful). We conduct experiments on the most frequently occurring 10 and 25 labels, and a mapping that clusters 181 labels to 26. Ablation studies and comparison against adapted state-of-the-art emotion recognition approaches shows the effectiveness of EmoTx. Analyzing EmoTx's self-attention scores reveals that expressive emotions often look at character tokens while other mental states rely on video and dialog cues. + + + + Dynamic Inference With Grounding Based Vision and Language Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Uzkent_Dynamic_Inference_With_Grounding_Based_Vision_and_Language_Models_CVPR_2023_paper.pdf + Transformers have been recently utilized for vision and language tasks successfully. For example, recent image and language models with more than 200M parameters have been proposed to learn visual grounding in the pre-training step and show impressive results on downstream vision and language tasks. On the other hand, there exists a large amount of computational redundancy in these large models which skips their run-time efficiency. To address this problem, we propose dynamic inference for grounding based vision and language models conditioned on the input image-text pair. We first design an approach to dynamically skip multihead self-attention and feed forward network layers across two backbones and multimodal network. Additionally, we propose dynamic token pruning and fusion for two backbones. In particular, we remove redundant tokens at different levels of the backbones and fuse the image tokens with the language tokens in an adaptive manner. To learn policies for dynamic inference, we train agents using reinforcement learning. In this direction, we replace the CNN backbone in a recent grounding-based vision and language model, MDETR, with a vision transformer and call it ViTMDETR. Then, we apply our dynamic inference method to ViTMDETR, called D-ViTDMETR, and perform experiments on image-language tasks. Our results show that we can improve the run-time efficiency of the state-of-the-art models MDETR and GLIP by up to 50% on Referring Expression Comprehension and Segmentation, and VQA with only maximum 0.3% accuracy drop. + + + + Connecting Vision and Language With Video Localized Narratives + http://openaccess.thecvf.com//content/CVPR2023/papers/Voigtlaender_Connecting_Vision_and_Language_With_Video_Localized_Narratives_CVPR_2023_paper.pdf + We propose Video Localized Narratives, a new form of multimodal video annotations connecting vision and language. In the original Localized Narratives, annotators speak and move their mouse simultaneously on an image, thus grounding each word with a mouse trace segment. However, this is challenging on a video. Our new protocol empowers annotators to tell the story of a video with Localized Narratives, capturing even complex events involving multiple actors interacting with each other and with several passive objects. We annotated 20k videos of the OVIS, UVO, and Oops datasets, totalling 1.7M words. Based on this data, we also construct new benchmarks for the video narrative grounding and video question answering tasks, and provide reference results from strong baseline models. Our annotations are available at https://google.github.io/video-localized-narratives/. + + + + Diverse Embedding Expansion Network and Low-Light Cross-Modality Benchmark for Visible-Infrared Person Re-Identification + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Diverse_Embedding_Expansion_Network_and_Low-Light_Cross-Modality_Benchmark_for_Visible-Infrared_CVPR_2023_paper.pdf + For the visible-infrared person re-identification (VIReID) task, one of the major challenges is the modality gaps between visible (VIS) and infrared (IR) images. However, the training samples are usually limited, while the modality gaps are too large, which leads that the existing methods cannot effectively mine diverse cross-modality clues. To handle this limitation, we propose a novel augmentation network in the embedding space, called diverse embedding expansion network (DEEN). The proposed DEEN can effectively generate diverse embeddings to learn the informative feature representations and reduce the modality discrepancy between the VIS and IR images. Moreover, the VIReID model may be seriously affected by drastic illumination changes, while all the existing VIReID datasets are captured under sufficient illumination without significant light changes. Thus, we provide a low-light cross-modality (LLCM) dataset, which contains 46,767 bounding boxes of 1,064 identities captured by 9 RGB/IR cameras. Extensive experiments on the SYSU-MM01, RegDB and LLCM datasets show the superiority of the proposed DEEN over several other state-of-the-art methods. The code and dataset are released at: https://github.com/ZYK100/LLCM + + + + Visual-Language Prompt Tuning With Knowledge-Guided Context Optimization + http://openaccess.thecvf.com//content/CVPR2023/papers/Yao_Visual-Language_Prompt_Tuning_With_Knowledge-Guided_Context_Optimization_CVPR_2023_paper.pdf + Prompt tuning is an effective way to adapt the pretrained visual-language model (VLM) to the downstream task using task-related textual tokens. Representative CoOp-based works combine the learnable textual tokens with the class tokens to obtain specific textual knowledge. However, the specific textual knowledge has worse generalizable to the unseen classes because it forgets the essential general textual knowledge having a strong generalization ability. To tackle this issue, we introduce a novel Knowledge-guided Context Optimization (KgCoOp) to enhance the generalization ability of the learnable prompt for unseen classes. To remember the essential general knowledge, KgCoOp constructs a regularization term to ensure that the essential general textual knowledge can be embedded into the special textual knowledge generated by the learnable prompt. Especially, KgCoOp minimizes the discrepancy between the textual embeddings generated by learned prompts and the hand-crafted prompts. Finally, adding the KgCoOp upon the contrastive loss can make a discriminative prompt for both seen and unseen tasks. Extensive evaluation of several benchmarks demonstrates that the proposed Knowledge-guided Context Optimization is an efficient method for prompt tuning, i.e., achieves better performance with less training time. + + + + Weakly Supervised Video Representation Learning With Unaligned Text for Sequential Videos + http://openaccess.thecvf.com//content/CVPR2023/papers/Dong_Weakly_Supervised_Video_Representation_Learning_With_Unaligned_Text_for_Sequential_CVPR_2023_paper.pdf + Sequential video understanding, as an emerging video understanding task, has driven lots of researchers' attention because of its goal-oriented nature. This paper studies weakly supervised sequential video understanding where the accurate time-stamp level text-video alignment is not provided. We solve this task by borrowing ideas from CLIP. Specifically, we use a transformer to aggregate frame-level features for video representation and use a pre-trained text encoder to encode the texts corresponding to each action and the whole video, respectively. To model the correspondence between text and video, we propose a multiple granularity loss, where the video-paragraph contrastive loss enforces matching between the whole video and the complete script, and a fine-grained frame-sentence contrastive loss enforces the matching between each action and its description. As the frame-sentence correspondence is not available, we propose to use the fact that video actions happen sequentially in the temporal domain to generate pseudo frame-sentence correspondence and supervise the network training with the pseudo labels. Extensive experiments on video sequence verification and text-to-video matching show that our method outperforms baselines by a large margin, which validates the effectiveness of our proposed approach. Code is available at https://github.com/svip-lab/WeakSVR. + + + + Bootstrap Your Own Prior: Towards Distribution-Agnostic Novel Class Discovery + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_Bootstrap_Your_Own_Prior_Towards_Distribution-Agnostic_Novel_Class_Discovery_CVPR_2023_paper.pdf + Novel Class Discovery (NCD) aims to discover unknown classes without any annotation, by exploiting the transferable knowledge already learned from a base set of known classes. Existing works hold an impractical assumption that the novel class distribution prior is uniform, yet neglect the imbalanced nature of real-world data. In this paper, we relax this assumption by proposing a new challenging task: distribution-agnostic NCD, which allows data drawn from arbitrary unknown class distributions and thus renders existing methods useless or even harmful. We tackle this challenge by proposing a new method, dubbed "Bootstrapping Your Own Prior (BYOP)", which iteratively estimates the class prior based on the model prediction itself. At each iteration, we devise a dynamic temperature technique that better estimates the class prior by encouraging sharper predictions for less-confident samples. Thus, BYOP obtains more accurate pseudo-labels for the novel samples, which are beneficial for the next training iteration. Extensive experiments show that existing methods suffer from imbalanced class distributions, while BYOP outperforms them by clear margins, demonstrating its effectiveness across various distribution scenarios. + + + + Learning To Generate Image Embeddings With User-Level Differential Privacy + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_Learning_To_Generate_Image_Embeddings_With_User-Level_Differential_Privacy_CVPR_2023_paper.pdf + Small on-device models have been successfully trained with user-level differential privacy (DP) for next word prediction and image classification tasks in the past. However, existing methods can fail when directly applied to learn embedding models using supervised training data with a large class space. To achieve user-level DP for large image-to-embedding feature extractors, we propose DP-FedEmb, a variant of federated learning algorithms with per-user sensitivity control and noise addition, to train from user-partitioned data centralized in datacenter. DP-FedEmb combines virtual clients, partial aggregation, private local fine-tuning, and public pretraining to achieve strong privacy utility trade-offs. We apply DP-FedEmb to train image embedding models for faces, landmarks and natural species, and demonstrate its superior utility under same privacy budget on benchmark datasets DigiFace, GLD and iNaturalist. We further illustrate it is possible to achieve strong user-level DP guarantees of epsilon < 2 while controlling the utility drop within 5%, when millions of users can participate in training. + + + + Open-Vocabulary Panoptic Segmentation With Text-to-Image Diffusion Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_Open-Vocabulary_Panoptic_Segmentation_With_Text-to-Image_Diffusion_Models_CVPR_2023_paper.pdf + We present ODISE: Open-vocabulary DIffusion-based panoptic SEgmentation, which unifies pre-trained text-image diffusion and discriminative models to perform open-vocabulary panoptic segmentation. Text-to-image diffusion models have the remarkable ability to generate high-quality images with diverse open-vocabulary language descriptions. This demonstrates that their internal representation space is highly correlated with open concepts in the real world. Text-image discriminative models like CLIP, on the other hand, are good at classifying images into open-vocabulary labels. We leverage the frozen internal representations of both these models to perform panoptic segmentation of any category in the wild. Our approach outperforms the previous state of the art by significant margins on both open-vocabulary panoptic and semantic segmentation tasks. In particular, with COCO training only, our method achieves 23.4 PQ and 30.0 mIoU on the ADE20K dataset, with 8.3 PQ and 7.9 mIoU absolute improvement over the previous state of the art. We open-source our code and models at https://github.com/NVlabs/ODISE. + + + + Learning Open-Vocabulary Semantic Segmentation Models From Natural Language Supervision + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_Learning_Open-Vocabulary_Semantic_Segmentation_Models_From_Natural_Language_Supervision_CVPR_2023_paper.pdf + In this paper, we consider the problem of open-vocabulary semantic segmentation (OVS), which aims to segment objects of arbitrary classes instead of pre-defined, closed-set categories. The main contributions are as follows: First, we propose a transformer-based model for OVS, termed as OVSegmentor, which only exploits web-crawled image-text pairs for pre-training without using any mask annotations. OVSegmentor assembles the image pixels into a set of learnable group tokens via a slot-attention based binding module, and aligns the group tokens to the corresponding caption embedding. Second, we propose two proxy tasks for training, namely masked entity completion and cross-image mask consistency. The former aims to infer all masked entities in the caption given the group tokens, that enables the model to learn fine-grained alignment between visual groups and text entities. The latter enforces consistent mask predictions between images that contain shared entities, which encourages the model to learn visual invariance. Third, we construct CC4M dataset for pre-training by filtering CC12M with frequently appeared entities, which significantly improves training efficiency. Fourth, we perform zero-shot transfer on three benchmark datasets, PASCAL VOC 2012, PASCAL Context, and COCO Object. Our model achieves superior segmentation results over the state-of-the-art method by using only 3% data (4M vs 134M) for pre-training. Code and pre-trained models will be released for future research. + + + + OcTr: Octree-Based Transformer for 3D Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_OcTr_Octree-Based_Transformer_for_3D_Object_Detection_CVPR_2023_paper.pdf + A key challenge for LiDAR-based 3D object detection is to capture sufficient features from large scale 3D scenes especially for distant or/and occluded objects. Albeit recent efforts made by Transformers with the long sequence modeling capability, they fail to properly balance the accuracy and efficiency, suffering from inadequate receptive fields or coarse-grained holistic correlations. In this paper, we propose an Octree-based Transformer, named OcTr, to address this issue. It first constructs a dynamic octree on the hierarchical feature pyramid through conducting self-attention on the top level and then recursively propagates to the level below restricted by the octants, which captures rich global context in a coarse-to-fine manner while maintaining the computational complexity under control. Furthermore, for enhanced foreground perception, we propose a hybrid positional embedding, composed of the semantic-aware positional embedding and attention mask, to fully exploit semantic and geometry clues. Extensive experiments are conducted on the Waymo Open Dataset and KITTI Dataset, and OcTr reaches newly state-of-the-art results. + + + + Learning Distortion Invariant Representation for Image Restoration From a Causality Perspective + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Learning_Distortion_Invariant_Representation_for_Image_Restoration_From_a_Causality_CVPR_2023_paper.pdf + In recent years, we have witnessed the great advancement of Deep neural networks (DNNs) in image restoration. However, a critical limitation is that they cannot generalize well to real-world degradations with different degrees or types. In this paper, we are the first to propose a novel training strategy for image restoration from the causality perspective, to improve the generalization ability of DNNs for unknown degradations. Our method, termed Distortion Invariant representation Learning (DIL), treats each distortion type and degree as one specific confounder, and learns the distortion-invariant representation by eliminating the harmful confounding effect of each degradation. We derive our DIL with the back-door criterion in causality by modeling the interventions of different distortions from the optimization perspective. Particularly, we introduce counterfactual distortion augmentation to simulate the virtual distortion types and degrees as the confounders. Then, we instantiate the intervention of each distortion with a virtual model updating based on corresponding distorted images, and eliminate them from the meta-learning perspective. Extensive experiments demonstrate the generalization capability of our DIL on unseen distortion types and degrees. Our code will be available at https://github.com/lixinustc/Causal-IR-DIL. + + + + MOT: Masked Optimal Transport for Partial Domain Adaptation + http://openaccess.thecvf.com//content/CVPR2023/papers/Luo_MOT_Masked_Optimal_Transport_for_Partial_Domain_Adaptation_CVPR_2023_paper.pdf + As an important methodology to measure distribution discrepancy, optimal transport (OT) has been successfully applied to learn generalizable visual models under changing environments. However, there are still limitations, including strict prior assumption and implicit alignment, for current OT modeling in challenging real-world scenarios like partial domain adaptation, where the learned transport plan may be biased and negative transfer is inevitable. Thus, it is necessary to explore a more feasible OT methodology for real-world applications. In this work, we focus on the rigorous OT modeling for conditional distribution matching and label shift correction. A novel masked OT (MOT) methodology on conditional distributions is proposed by defining a mask operation with label information. Further, a relaxed and reweighting formulation is proposed to improve the robustness of OT in extreme scenarios. We prove the theoretical equivalence between conditional OT and MOT, which implies the well-defined MOT serves as a computation-friendly proxy. Extensive experiments validate the effectiveness of theoretical results and proposed model. + + + + UDE: A Unified Driving Engine for Human Motion Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_UDE_A_Unified_Driving_Engine_for_Human_Motion_Generation_CVPR_2023_paper.pdf + Generating controllable and editable human motion sequences is a key challenge in 3D Avatar generation. It has been labor-intensive to generate and animate human motion for a long time until learning-based approaches have been developed and applied recently. However, these approaches are still task-specific or modality-specific. In this paper, we propose "UDE", the first unified driving engine that enables generating human motion sequences from natural language or audio sequences (see Fig. 1). Specifically, UDE consists of the following key components: 1) a motion quantization module based on VQVAE that represents continuous motion sequence as discrete latent code, 2) a modality-agnostic transformer encoder that learns to map modality-aware driving signals to a joint space, and 3) a unified token transformer (GPT-like) network to predict the quantized latent code index in an auto-regressive manner. 4) a diffusion motion decoder that takes as input the motion tokens and decodes them into motion sequences with high diversity. We evaluate our method on HumanML3D and AIST++ benchmarks, and the experiment results demonstrate our method achieves state-of-the-art performance. + + + + Extracting Class Activation Maps From Non-Discriminative Features As Well + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Extracting_Class_Activation_Maps_From_Non-Discriminative_Features_As_Well_CVPR_2023_paper.pdf + Extracting class activation maps (CAM) from a classification model often results in poor coverage on foreground objects, i.e., only the discriminative region (e.g., the "head" of "sheep") is recognized and the rest (e.g., the "leg" of "sheep") mistakenly as background. The crux behind is that the weight of the classifier (used to compute CAM) captures only the discriminative features of objects. We tackle this by introducing a new computation method for CAM that explicitly captures non-discriminative features as well, thereby expanding CAM to cover whole objects. Specifically, we omit the last pooling layer of the classification model, and perform clustering on all local features of an object class, where "local" means "at a spatial pixel position". We call the resultant K cluster centers local prototypes - represent local semantics like the "head", "leg", and "body" of "sheep". Given a new image of the class, we compare its unpooled features to every prototype, derive K similarity matrices, and then aggregate them into a heatmap (i.e., our CAM). Our CAM thus captures all local features of the class without discrimination. We evaluate it in the challenging tasks of weakly-supervised semantic segmentation (WSSS), and plug it in multiple state-of-the-art WSSS methods, such as MCTformer and AMN, by simply replacing their original CAM with ours. Our extensive experiments on standard WSSS benchmarks (PASCAL VOC and MS COCO) show the superiority of our method: consistent improvements with little computational overhead. + + + + BlendFields: Few-Shot Example-Driven Facial Modeling + http://openaccess.thecvf.com//content/CVPR2023/papers/Kania_BlendFields_Few-Shot_Example-Driven_Facial_Modeling_CVPR_2023_paper.pdf + Generating faithful visualizations of human faces requires capturing both coarse and fine-level details of the face geometry and appearance. Existing methods are either data-driven, requiring an extensive corpus of data not publicly accessible to the research community, or fail to capture fine details because they rely on geometric face models that cannot represent fine-grained details in texture with a mesh discretization and linear deformation designed to model only a coarse face geometry. We introduce a method that bridges this gap by drawing inspiration from traditional computer graphics techniques. Unseen expressions are modeled by blending appearance from a sparse set of extreme poses. This blending is performed by measuring local volumetric changes in those expressions and locally reproducing their appearance whenever a similar expression is performed at test time. We show that our method generalizes to unseen expressions, adding fine-grained effects on top of smooth volumetric deformations of a face, and demonstrate how it generalizes beyond faces. + + + + NeFII: Inverse Rendering for Reflectance Decomposition With Near-Field Indirect Illumination + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_NeFII_Inverse_Rendering_for_Reflectance_Decomposition_With_Near-Field_Indirect_Illumination_CVPR_2023_paper.pdf + Inverse rendering methods aim to estimate geometry, materials and illumination from multi-view RGB images. In order to achieve better decomposition, recent approaches attempt to model indirect illuminations reflected from different materials via Spherical Gaussians (SG), which, however, tends to blur the high-frequency reflection details. In this paper, we propose an end-to-end inverse rendering pipeline that decomposes materials and illumination from multi-view images, while considering near-field indirect illumination. In a nutshell, we introduce the Monte Carlo sampling based path tracing and cache the indirect illumination as neural radiance, enabling a physics-faithful and easy-to-optimize inverse rendering method. To enhance efficiency and practicality, we leverage SG to represent the smooth environment illuminations and apply importance sampling techniques. To supervise indirect illuminations from unobserved directions, we develop a novel radiance consistency constraint between implicit neural radiance and path tracing results of unobserved rays along with the joint optimization of materials and illuminations, thus significantly improving the decomposition performance. Extensive experiments demonstrate that our method outperforms the state-of-the-art on multiple synthetic and real datasets, especially in terms of inter-reflection decomposition. + + + + Towards Professional Level Crowd Annotation of Expert Domain Data + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Towards_Professional_Level_Crowd_Annotation_of_Expert_Domain_Data_CVPR_2023_paper.pdf + Image recognition on expert domains is usually fine-grained and requires expert labeling, which is costly. This limits dataset sizes and the accuracy of learning systems. To address this challenge, we consider annotating expert data with crowdsourcing. This is denoted as PrOfeSsional lEvel cRowd (POSER) annotation. A new approach, based on semi-supervised learning (SSL) and denoted as SSL with human filtering (SSL-HF) is proposed. It is a human-in-the-loop SSL method, where crowd-source workers act as filters of pseudo-labels, replacing the unreliable confidence thresholding used by state-of-the-art SSL methods. To enable annotation by non-experts, classes are specified implicitly, via positive and negative sets of examples and augmented with deliberative explanations, which highlight regions of class ambiguity. In this way, SSL-HF leverages the strong low-shot learning and confidence estimation ability of humans to create an intuitive but effective labeling experience. Experiments show that SSL-HF significantly outperforms various alternative approaches in several benchmarks. + + + + Deep Stereo Video Inpainting + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_Deep_Stereo_Video_Inpainting_CVPR_2023_paper.pdf + Stereo video inpainting aims to fill the missing regions on the left and right views of the stereo video with plausible content simultaneously. Compared with the single video inpainting that has achieved promising results using deep convolutional neural networks, inpainting the missing regions of stereo video has not been thoroughly explored. In essence, apart from the spatial and temporal consistency that single video inpainting needs to achieve, another key challenge for stereo video inpainting is to maintain the stereo consistency between left and right views and hence alleviate the 3D fatigue for viewers. In this paper, we propose a novel deep stereo video inpainting network named SVINet, which is the first attempt for stereo video inpainting task utilizing deep convolutional neural networks. SVINet first utilizes a self-supervised flow-guided deformable temporal alignment module to align the features on the left and right view branches, respectively. Then, the aligned features are fed into a shared adaptive feature aggregation module to generate missing contents of their respective branches. Finally, the parallax attention module (PAM) that uses the cross-view information to consider the significant stereo correlation is introduced to fuse the completed features of left and right views. Furthermore, we develop a stereo consistency loss to regularize the trained parameters, so that our model is able to yield high-quality stereo video inpainting results with better stereo consistency. Experimental results demonstrate that our SVINet outperforms state-of-the-art single video inpainting models. + + + + IFSeg: Image-Free Semantic Segmentation via Vision-Language Model + http://openaccess.thecvf.com//content/CVPR2023/papers/Yun_IFSeg_Image-Free_Semantic_Segmentation_via_Vision-Language_Model_CVPR_2023_paper.pdf + Vision-language (VL) pre-training has recently gained much attention for its transferability and flexibility in novel concepts (e.g., cross-modality transfer) across various visual tasks. However, VL-driven segmentation has been under-explored, and the existing approaches still have the burden of acquiring additional training images or even segmentation annotations to adapt a VL model to downstream segmentation tasks. In this paper, we introduce a novel image-free segmentation task where the goal is to perform semantic segmentation given only a set of the target semantic categories, but without any task-specific images and annotations. To tackle this challenging task, our proposed method, coined IFSeg, generates VL-driven artificial image-segmentation pairs and updates a pre-trained VL model to a segmentation task. We construct this artificial training data by creating a 2D map of random semantic categories and another map of their corresponding word tokens. Given that a pre-trained VL model projects visual and text tokens into a common space where tokens that share the semantics are located closely, this artificially generated word map can replace the real image inputs for such a VL model. Through an extensive set of experiments, our model not only establishes an effective baseline for this novel task but also demonstrates strong performances compared to existing methods that rely on stronger supervision, such as task-specific images and segmentation masks. Code is available at https://github.com/alinlab/ifseg. + + + + Is BERT Blind? Exploring the Effect of Vision-and-Language Pretraining on Visual Language Understanding + http://openaccess.thecvf.com//content/CVPR2023/papers/Alper_Is_BERT_Blind_Exploring_the_Effect_of_Vision-and-Language_Pretraining_on_CVPR_2023_paper.pdf + Most humans use visual imagination to understand and reason about language, but models such as BERT reason about language using knowledge acquired during text-only pretraining. In this work, we investigate whether vision-and-language pretraining can improve performance on text-only tasks that involve implicit visual reasoning, focusing primarily on zero-shot probing methods. We propose a suite of visual language understanding (VLU) tasks for probing the visual reasoning abilities of text encoder models, as well as various non-visual natural language understanding (NLU) tasks for comparison. We also contribute a novel zero-shot knowledge probing method, Stroop probing, for applying models such as CLIP to text-only tasks without needing a prediction head such as the masked language modelling head of models like BERT. We show that SOTA multimodally trained text encoders outperform unimodally trained text encoders on the VLU tasks while being underperformed by them on the NLU tasks, lending new context to previously mixed results regarding the NLU capabilities of multimodal models. We conclude that exposure to images during pretraining affords inherent visual reasoning knowledge that is reflected in language-only tasks that require implicit visual reasoning. Our findings bear importance in the broader context of multimodal learning, providing principled guidelines for the choice of text encoders used in such contexts. + + + + 3D GAN Inversion With Facial Symmetry Prior + http://openaccess.thecvf.com//content/CVPR2023/papers/Yin_3D_GAN_Inversion_With_Facial_Symmetry_Prior_CVPR_2023_paper.pdf + Recently, a surge of high-quality 3D-aware GANs have been proposed, which leverage the generative power of neural rendering. It is natural to associate 3D GANs with GAN inversion methods to project a real image into the generator's latent space, allowing free-view consistent synthesis and editing, referred as 3D GAN inversion. Although with the facial prior preserved in pre-trained 3D GANs, reconstructing a 3D portrait with only one monocular image is still an ill-pose problem. The straightforward application of 2D GAN inversion methods focuses on texture similarity only while ignoring the correctness of 3D geometry shapes. It may raise geometry collapse effects, especially when reconstructing a side face under an extreme pose. Besides, the synthetic results in novel views are prone to be blurry. In this work, we propose a novel method to promote 3D GAN inversion by introducing facial symmetry prior. We design a pipeline and constraints to make full use of the pseudo auxiliary view obtained via image flipping, which helps obtain a view-consistent and well-structured geometry shape during the inversion process. To enhance texture fidelity in unobserved viewpoints, pseudo labels from depth-guided 3D warping can provide extra supervision. We design constraints aimed at filtering out conflict areas for optimization in asymmetric situations. Comprehensive quantitative and qualitative evaluations on image reconstruction and editing demonstrate the superiority of our method. + + + + SDFusion: Multimodal 3D Shape Completion, Reconstruction, and Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Cheng_SDFusion_Multimodal_3D_Shape_Completion_Reconstruction_and_Generation_CVPR_2023_paper.pdf + In this work, we present a novel framework built to simplify 3D asset generation for amateur users. To enable interactive generation, our method supports a variety of input modalities that can be easily provided by a human, including images, texts, partially observed shapes and combinations of these, further allowing for adjusting the strength of each input. At the core of our approach is an encoder-decoder, compressing 3D shapes into a compact latent representation, upon which a diffusion model is learned. To enable a variety of multi-modal inputs, we employ task-specific encoders with dropout followed by a cross-attention mechanism. Due to its flexibility, our model naturally supports a variety of tasks outperforming prior works on shape completion, image-based 3D reconstruction, and text-to-3D. Most interestingly, our model can combine all these tasks into one swiss-army-knife tool, enabling the user to perform shape generation using incomplete shapes, images, and textual descriptions at the same time, providing the relative weights for each input and facilitating interactivity. Despite our approach being shape-only, we further show an efficient method to texture the generated using large-scale text-to-image models. + + + + SMAE: Few-Shot Learning for HDR Deghosting With Saturation-Aware Masked Autoencoders + http://openaccess.thecvf.com//content/CVPR2023/papers/Yan_SMAE_Few-Shot_Learning_for_HDR_Deghosting_With_Saturation-Aware_Masked_Autoencoders_CVPR_2023_paper.pdf + Generating a high-quality High Dynamic Range (HDR) image from dynamic scenes has recently been extensively studied by exploiting Deep Neural Networks (DNNs). Most DNNs-based methods require a large amount of training data with ground truth, requiring tedious and time-consuming work. Few-shot HDR imaging aims to generate satisfactory images with limited data. However, it is difficult for modern DNNs to avoid overfitting when trained on only a few images. In this work, we propose a novel semi-supervised approach to realize few-shot HDR imaging via two stages of training, called SSHDR. Unlikely previous methods, directly recovering content and removing ghosts simultaneously, which is hard to achieve optimum, we first generate content of saturated regions with a self-supervised mechanism and then address ghosts via an iterative semi-supervised learning framework. Concretely, considering that saturated regions can be regarded as masking Low Dynamic Range (LDR) input regions, we design a Saturated Mask AutoEncoder (SMAE) to learn a robust feature representation and reconstruct a non-saturated HDR image. We also propose an adaptive pseudo-label selection strategy to pick high-quality HDR pseudo-labels in the second stage to avoid the effect of mislabeled samples. Experiments demonstrate that SSHDR outperforms state-of-the-art methods quantitatively and qualitatively within and across different datasets, achieving appealing HDR visualization with few labeled samples. + + + + Learning To Render Novel Views From Wide-Baseline Stereo Pairs + http://openaccess.thecvf.com//content/CVPR2023/papers/Du_Learning_To_Render_Novel_Views_From_Wide-Baseline_Stereo_Pairs_CVPR_2023_paper.pdf + We introduce a method for novel view synthesis given only a single wide-baseline stereo image pair. In this challenging regime, 3D scene points are regularly observed only once, requiring prior-based reconstruction of scene geometry and appearance. We find that existing approaches to novel view synthesis from sparse observations fail due to recovering incorrect 3D geometry and the high cost of differentiable rendering that precludes their scaling to large-scale training. We take a step towards resolving these shortcomings by formulating a multi-view transformer encoder, proposing an efficient, image-space epipolar line sampling scheme to assemble image features for a target ray, and a lightweight cross-attention-based renderer. Our contributions enable training of our method on a large-scale real-world dataset of indoor and outdoor scenes. In several ablation studies, we demonstrate that our contributions enable learning of powerful multi-view geometry priors while reducing both rendering time and memory footprint. We conduct extensive comparisons on held-out test scenes across two real-world datasets, significantly outperforming prior work on novel view synthesis from sparse image observations and achieving multi-view-consistent novel view synthesis. + + + + TryOnDiffusion: A Tale of Two UNets + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhu_TryOnDiffusion_A_Tale_of_Two_UNets_CVPR_2023_paper.pdf + Given two images depicting a person and a garment worn by another person, our goal is to generate a visualization of how the garment might look on the input person. A key challenge is to synthesize a photorealistic detail-preserving visualization of the garment, while warping the garment to accommodate a significant body pose and shape change across the subjects. Previous methods either focus on garment detail preservation without effective pose and shape variation, or allow try-on with the desired shape and pose but lack garment details. In this paper, we propose a diffusion-based architecture that unifies two UNets (referred to as Parallel-UNet), which allows us to preserve garment details and warp the garment for significant pose and body change in a single network. The key ideas behind Parallel-UNet include: 1) garment is warped implicitly via a cross attention mechanism, 2) garment warp and person blend happen as part of a unified process as opposed to a sequence of two separate tasks. Experimental results indicate that TryOnDiffusion achieves state-of-the-art performance both qualitatively and quantitatively. + + + + Automatic High Resolution Wire Segmentation and Removal + http://openaccess.thecvf.com//content/CVPR2023/papers/Chiu_Automatic_High_Resolution_Wire_Segmentation_and_Removal_CVPR_2023_paper.pdf + Wires and powerlines are common visual distractions that often undermine the aesthetics of photographs. The manual process of precisely segmenting and removing them is extremely tedious and may take up to hours, especially on high-resolution photos where wires may span the entire space. In this paper, we present an automatic wire clean-up system that eases the process of wire segmentation and removal/inpainting to within a few seconds. We observe several unique challenges: wires are thin, lengthy, and sparse. These are rare properties of subjects that common segmentation tasks cannot handle, especially in high-resolution images. We thus propose a two-stage method that leverages both global and local context to accurately segment wires in high-resolution images efficiently, and a tile-based inpainting strategy to remove the wires given our predicted segmentation masks. We also introduce the first wire segmentation benchmark dataset, WireSegHR. Finally, we demonstrate quantitatively and qualitatively that our wire clean-up system enables fully automated wire removal for great generalization to various wire appearances. + + + + The Resource Problem of Using Linear Layer Leakage Attack in Federated Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhao_The_Resource_Problem_of_Using_Linear_Layer_Leakage_Attack_in_CVPR_2023_paper.pdf + Secure aggregation promises a heightened level of privacy in federated learning, maintaining that a server only has access to a decrypted aggregate update. Within this setting, linear layer leakage methods are the only data reconstruction attacks able to scale and achieve a high leakage rate regardless of the number of clients or batch size. This is done through increasing the size of an injected fully-connected (FC) layer. We show that this results in a resource overhead which grows larger with an increasing number of clients. We show that this resource overhead is caused by an incorrect perspective in all prior work that treats an attack on an aggregate update in the same way as an individual update with a larger batch size. Instead, by attacking the update from the perspective that aggregation is combining multiple individual updates, this allows the application of sparsity to alleviate resource overhead. We show that the use of sparsity can decrease the model size overhead by over 327x and the computation time by 3.34x compared to SOTA while maintaining equivalent total leakage rate, 77% even with 1000 clients in aggregation. + + + + Seeing a Rose in Five Thousand Ways + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Seeing_a_Rose_in_Five_Thousand_Ways_CVPR_2023_paper.pdf + What is a rose, visually? A rose comprises its intrinsics, including the distribution of geometry, texture, and material specific to its object category. With knowledge of these intrinsic properties, we may render roses of different sizes and shapes, in different poses, and under different lighting conditions. In this work, we build a generative model that learns to capture such object intrinsics from a single image, such as a photo of a bouquet. Such an image includes multiple instances of an object type. These instances all share the same intrinsics, but appear different due to a combination of variance within these intrinsics and differences in extrinsic factors, such as pose and illumination. Experiments show that our model successfully learns object intrinsics (distribution of geometry, texture, and material) for a wide range of objects, each from a single Internet image. Our method achieves superior results on multiple downstream tasks, including intrinsic image decomposition, shape and image generation, view synthesis, and relighting. + + + + Neural Residual Radiance Fields for Streamably Free-Viewpoint Videos + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Neural_Residual_Radiance_Fields_for_Streamably_Free-Viewpoint_Videos_CVPR_2023_paper.pdf + The success of the Neural Radiance Fields (NeRFs) for modeling and free-view rendering static objects has inspired numerous attempts on dynamic scenes. Current techniques that utilize neural rendering for facilitating free-view videos (FVVs) are restricted to either offline rendering or are capable of processing only brief sequences with minimal motion. In this paper, we present a novel technique, Residual Radiance Field or ReRF, as a highly compact neural representation to achieve real-time FVV rendering on long-duration dynamic scenes. ReRF explicitly models the residual information between adjacent timestamps in the spatial-temporal feature space, with a global coordinate-based tiny MLP as the feature decoder. Specifically, ReRF employs a compact motion grid along with a residual feature grid to exploit inter-frame feature similarities. We show such a strategy can handle large motions without sacrificing quality. We further present a sequential training scheme to maintain the smoothness and the sparsity of the motion/residual grids. Based on ReRF, we design a special FVV codec that achieves three orders of magnitudes compression rate and provides a companion ReRF player to support online streaming of long-duration FVVs of dynamic scenes. Extensive experiments demonstrate the effectiveness of ReRF for compactly representing dynamic radiance fields, enabling an unprecedented free-viewpoint viewing experience in speed and quality. + + + + ACSeg: Adaptive Conceptualization for Unsupervised Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_ACSeg_Adaptive_Conceptualization_for_Unsupervised_Semantic_Segmentation_CVPR_2023_paper.pdf + Recently, self-supervised large-scale visual pre-training models have shown great promise in representing pixel-level semantic relationships, significantly promoting the development of unsupervised dense prediction tasks, e.g., unsupervised semantic segmentation (USS). The extracted relationship among pixel-level representations typically contains rich class-aware information that semantically identical pixel embeddings in the representation space gather together to form sophisticated concepts. However, leveraging the learned models to ascertain semantically consistent pixel groups or regions in the image is non-trivial since over/ under-clustering overwhelms the conceptualization procedure under various semantic distributions of different images. In this work, we investigate the pixel-level semantic aggregation in self-supervised ViT pre-trained models as image Segmentation and propose the Adaptive Conceptualization approach for USS, termed ACSeg. Concretely, we explicitly encode concepts into learnable prototypes and design the Adaptive Concept Generator (ACG), which adaptively maps these prototypes to informative concepts for each image. Meanwhile, considering the scene complexity of different images, we propose the modularity loss to optimize ACG independent of the concept number based on estimating the intensity of pixel pairs belonging to the same concept. Finally, we turn the USS task into classifying the discovered concepts in an unsupervised manner. Extensive experiments with state-of-the-art results demonstrate the effectiveness of the proposed ACSeg. + + + + Reproducible Scaling Laws for Contrastive Language-Image Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Cherti_Reproducible_Scaling_Laws_for_Contrastive_Language-Image_Learning_CVPR_2023_paper.pdf + Scaling up neural networks has led to remarkable performance across a wide range of tasks. Moreover, performance often follows reliable scaling laws as a function of training set size, model size, and compute, which offers valuable guidance as large-scale experiments are becoming increasingly expensive. However, previous work on scaling laws has primarily used private data & models or focused on uni-modal language or vision learning. To address these limitations, we investigate scaling laws for contrastive language-image pre-training (CLIP) with the public LAION dataset and the open-source OpenCLIP repository. Our large-scale experiments involve models trained on up to two billion image-text pairs and identify power law scaling for multiple downstream tasks including zero-shot classification, retrieval, linear probing, and end-to-end fine-tuning. We find that the training distribution plays a key role in scaling laws as the OpenAI and OpenCLIP models exhibit different scaling behavior despite identical model architectures and similar training recipes. We open-source our evaluation workflow and all models, including the largest public CLIP models, to ensure reproducibility and make scaling laws research more accessible. Source code and instructions to reproduce this study is available at https://github.com/LAION-AI/scaling-laws-openclip. + + + + PromptCAL: Contrastive Affinity Learning via Auxiliary Prompts for Generalized Novel Category Discovery + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_PromptCAL_Contrastive_Affinity_Learning_via_Auxiliary_Prompts_for_Generalized_Novel_CVPR_2023_paper.pdf + Although existing semi-supervised learning models achieve remarkable success in learning with unannotated in-distribution data, they mostly fail to learn on unlabeled data sampled from novel semantic classes due to their closed-set assumption. In this work, we target a pragmatic but under-explored Generalized Novel Category Discovery (GNCD) setting. The GNCD setting aims to categorize unlabeled training data coming from known and novel classes by leveraging the information of partially labeled known classes. We propose a two-stage Contrastive Affinity Learning method with auxiliary visual Prompts, dubbed PromptCAL, to address this challenging problem. Our approach discovers reliable pairwise sample affinities to learn better semantic clustering of both known and novel classes for the class token and visual prompts. First, we propose a discriminative prompt regularization loss to reinforce semantic discriminativeness of prompt-adapted pre-trained vision transformer for refined affinity relationships. Besides, we propose contrastive affinity learning to calibrate semantic representations based on our iterative semi-supervised affinity graph generation method for semantically-enhanced supervision. Extensive experimental evaluation demonstrates that our PromptCAL method is more effective in discovering novel classes even with limited annotations and surpasses the current state-of-the-art on generic and fine-grained benchmarks (e.g., with nearly 11% gain on CUB-200, and 9% on ImageNet-100) on overall accuracy. Our code will be released to the public. + + + + A Unified Spatial-Angular Structured Light for Single-View Acquisition of Shape and Reflectance + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_A_Unified_Spatial-Angular_Structured_Light_for_Single-View_Acquisition_of_Shape_CVPR_2023_paper.pdf + We propose a unified structured light, consisting of an LED array and an LCD mask, for high-quality acquisition of both shape and reflectance from a single view. For geometry, one LED projects a set of learned mask patterns to accurately encode spatial information; the decoded results from multiple LEDs are then aggregated to produce a final depth map. For appearance, learned light patterns are cast through a transparent mask to efficiently probe angularly-varying reflectance. Per-point BRDF parameters are differentiably optimized with respect to corresponding measurements, and stored in texture maps as the final reflectance. We establish a differentiable pipeline for the joint capture to automatically optimize both the mask and light patterns towards optimal acquisition quality. The effectiveness of our light is demonstrated with a wide variety of physical objects. Our results compare favorably with state-of-the-art techniques. + + + + On the Difficulty of Unpaired Infrared-to-Visible Video Translation: Fine-Grained Content-Rich Patches Transfer + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_On_the_Difficulty_of_Unpaired_Infrared-to-Visible_Video_Translation_Fine-Grained_Content-Rich_CVPR_2023_paper.pdf + Explicit visible videos can provide sufficient visual information and facilitate vision applications. Unfortunately, the image sensors of visible cameras are sensitive to light conditions like darkness or overexposure. To make up for this, recently, infrared sensors capable of stable imaging have received increasing attention in autonomous driving and monitoring. However, most prosperous vision models are still trained on massive clear visible data, facing huge visual gaps when deploying to infrared imaging scenarios. In such cases, transferring the infrared video to a distinct visible one with fine-grained semantic patterns is a worthwhile endeavor. Previous works improve the outputs by equally optimizing each patch on the translated visible results, which is unfair for enhancing the details on content-rich patches due to the long-tail effect of pixel distribution. Here we propose a novel CPTrans framework to tackle the challenge via balancing gradients of different patches, achieving the fine-grained Content-rich Patches Transferring. Specifically, the content-aware optimization module encourages model optimization along gradients of target patches, ensuring the improvement of visual details. Additionally, the content-aware temporal normalization module enforces the generator to be robust to the motions of target patches. Moreover, we extend the existing dataset InfraredCity to more challenging adverse weather conditions (rain and snow), dubbed as InfraredCity-Adverse. Extensive experiments show that the proposed CPTrans achieves state-of-the-art performance under diverse scenes while requiring less training time than competitive methods. + + + + CLIP the Gap: A Single Domain Generalization Approach for Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Vidit_CLIP_the_Gap_A_Single_Domain_Generalization_Approach_for_Object_CVPR_2023_paper.pdf + Single Domain Generalization (SDG) tackles the problem of training a model on a single source domain so that it generalizes to any unseen target domain. While this has been well studied for image classification, the literature on SDG object detection remains almost non-existent. To address the challenges of simultaneously learning robust object localization and representation, we propose to leverage a pre-trained vision-language model to introduce semantic domain concepts via textual prompts. We achieve this via a semantic augmentation strategy acting on the features extracted by the detector backbone, as well as a text-based classification loss. Our experiments evidence the benefits of our approach, outperforming by 10% the only existing SDG object detection method, Single-DGOD[49], on their own diverse weather-driving benchmark. + + + + On the Importance of Accurate Geometry Data for Dense 3D Vision Tasks + http://openaccess.thecvf.com//content/CVPR2023/papers/Jung_On_the_Importance_of_Accurate_Geometry_Data_for_Dense_3D_CVPR_2023_paper.pdf + Learning-based methods to solve dense 3D vision problems typically train on 3D sensor data. The respectively used principle of measuring distances provides advantages and drawbacks. These are typically not compared nor discussed in the literature due to a lack of multi-modal datasets. Texture-less regions are problematic for structure from motion and stereo, reflective material poses issues for active sensing, and distances for translucent objects are intricate to measure with existing hardware. Training on inaccurate or corrupt data induces model bias and hampers generalisation capabilities. These effects remain unnoticed if the sensor measurement is considered as ground truth during the evaluation. This paper investigates the effect of sensor errors for the dense 3D vision tasks of depth estimation and reconstruction. We rigorously show the significant impact of sensor characteristics on the learned predictions and notice generalisation issues arising from various technologies in everyday household environments. For evaluation, we introduce a carefully designed dataset comprising measurements from commodity sensors, namely D-ToF, I-ToF, passive/active stereo, and monocular RGB+P. Our study quantifies the considerable sensor noise impact and paves the way to improved dense vision estimates and targeted data fusion. + + + + Understanding Masked Autoencoders via Hierarchical Latent Variable Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Kong_Understanding_Masked_Autoencoders_via_Hierarchical_Latent_Variable_Models_CVPR_2023_paper.pdf + Masked autoencoder (MAE), a simple and effective self-supervised learning framework based on the reconstruction of masked image regions, has recently achieved prominent success in a variety of vision tasks. Despite the emergence of intriguing empirical observations on MAE, a theoretically principled understanding is still lacking. In this work, we formally characterize and justify existing empirical insights and provide theoretical guarantees of MAE. We formulate the underlying data-generating process as a hierarchical latent variable model, and show that under reasonable assumptions, MAE provably identifies a set of latent variables in the hierarchical model, explaining why MAE can extract high-level information from pixels. Further, we show how key hyperparameters in MAE (the masking ratio and the patch size) determine which true latent variables to be recovered, therefore influencing the level of semantic information in the representation. Specifically, extremely large or small masking ratios inevitably lead to low-level representations. Our theory offers coherent explanations of existing empirical observations and provides insights for potential empirical improvements and fundamental limitations of the masked-reconstruction paradigm. We conduct extensive experiments to validate our theoretical insights. + + + + Unbalanced Optimal Transport: A Unified Framework for Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/De_Plaen_Unbalanced_Optimal_Transport_A_Unified_Framework_for_Object_Detection_CVPR_2023_paper.pdf + During training, supervised object detection tries to correctly match the predicted bounding boxes and associated classification scores to the ground truth. This is essential to determine which predictions are to be pushed towards which solutions, or to be discarded. Popular matching strategies include matching to the closest ground truth box (mostly used in combination with anchors), or matching via the Hungarian algorithm (mostly used in anchor-free methods). Each of these strategies comes with its own properties, underlying losses, and heuristics. We show how Unbalanced Optimal Transport unifies these different approaches and opens a whole continuum of methods in between. This allows for a finer selection of the desired properties. Experimentally, we show that training an object detection model with Unbalanced Optimal Transport is able to reach the state-of-the-art both in terms of Average Precision and Average Recall as well as to provide a faster initial convergence. The approach is well suited for GPU implementation, which proves to be an advantage for large-scale models. + + + + Photo Pre-Training, but for Sketch + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Photo_Pre-Training_but_for_Sketch_CVPR_2023_paper.pdf + The sketch community has faced up to its unique challenges over the years, that of data scarcity however still remains the most significant to date. This lack of sketch data has imposed on the community a few "peculiar" design choices -- the most representative of them all is perhaps the coerced utilisation of photo-based pre-training (i.e., no sketch), for many core tasks that otherwise dictates specific sketch understanding. In this paper, we ask just the one question -- can we make such photo-based pre-training, to actually benefit sketch? Our answer lies in cultivating the topology of photo data learned at pre-training, and use that as a "free" source of supervision for downstream sketch tasks. In particular, we use fine-grained sketch-based image retrieval (FG-SBIR), one of the most studied and data-hungry sketch tasks, to showcase our new perspective on pre-training. In this context, the topology-informed supervision learned from photos act as a constraint that take effect at every fine-tuning step -- neighbouring photos in the pre-trained model remain neighbours under each FG-SBIR updates. We further portray this neighbourhood consistency constraint as a photo ranking problem and formulate it into a neat cross-modal triplet loss. We also show how this target is better leveraged as a meta objective rather than optimised in parallel with the main FG-SBIR objective. With just this change on pre-training, we beat all previously published results on all five product-level FG-SBIR benchmarks with significant margins (sometimes >10%). And the most beautiful thing, as we note, is such gigantic leap is made possible with just a few extra lines of code! Our implementation is available at https://github.com/KeLi-SketchX/Photo-Pre-Training-But-for-Sketch + + + + NeuralPCI: Spatio-Temporal Neural Field for 3D Point Cloud Multi-Frame Non-Linear Interpolation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zheng_NeuralPCI_Spatio-Temporal_Neural_Field_for_3D_Point_Cloud_Multi-Frame_Non-Linear_CVPR_2023_paper.pdf + In recent years, there has been a significant increase in focus on the interpolation task of computer vision. Despite the tremendous advancement of video interpolation, point cloud interpolation remains insufficiently explored. Meanwhile, the existence of numerous nonlinear large motions in real-world scenarios makes the point cloud interpolation task more challenging. In light of these issues, we present NeuralPCI: an end-to-end 4D spatio-temporal Neural field for 3D Point Cloud Interpolation, which implicitly integrates multi-frame information to handle nonlinear large motions for both indoor and outdoor scenarios. Furthermore, we construct a new multi-frame point cloud interpolation dataset called NL-Drive for large nonlinear motions in autonomous driving scenes to better demonstrate the superiority of our method. Ultimately, NeuralPCI achieves state-of-the-art performance on both DHB (Dynamic Human Bodies) and NL-Drive datasets. Beyond the interpolation task, our method can be naturally extended to point cloud extrapolation, morphing, and auto-labeling, which indicates substantial potential in other domains. Codes are available at https://github.com/ispc-lab/NeuralPCI. + + + + Bidirectional Cross-Modal Knowledge Exploration for Video Recognition With Pre-Trained Vision-Language Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_Bidirectional_Cross-Modal_Knowledge_Exploration_for_Video_Recognition_With_Pre-Trained_Vision-Language_CVPR_2023_paper.pdf + Vision-language models (VLMs) pre-trained on large-scale image-text pairs have demonstrated impressive transferability on various visual tasks. Transferring knowledge from such powerful VLMs is a promising direction for building effective video recognition models. However, current exploration in this field is still limited. We believe that the greatest value of pre-trained VLMs lies in building a bridge between visual and textual domains. In this paper, we propose a novel framework called BIKE, which utilizes the cross-modal bridge to explore bidirectional knowledge: i) We introduce the Video Attribute Association mechanism, which leverages the Video-to-Text knowledge to generate textual auxiliary attributes for complementing video recognition. ii) We also present a Temporal Concept Spotting mechanism that uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner, leading to enhanced video representation. Extensive studies on six popular video datasets, including Kinetics-400 & 600, UCF-101, HMDB-51, ActivityNet and Charades, show that our method achieves state-of-the-art performance in various recognition scenarios, such as general, zero-shot, and few-shot video recognition. Our best model achieves a state-of-the-art accuracy of 88.6% on the challenging Kinetics-400 using the released CLIP model. The code is available at https://github.com/whwu95/BIKE. + + + + Adaptive Plasticity Improvement for Continual Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Liang_Adaptive_Plasticity_Improvement_for_Continual_Learning_CVPR_2023_paper.pdf + Many works have tried to solve the catastrophic forgetting (CF) problem in continual learning (lifelong learning). However, pursuing non-forgetting on old tasks may damage the model's plasticity for new tasks. Although some methods have been proposed to achieve stability-plasticity trade-off, no methods have considered evaluating a model's plasticity and improving plasticity adaptively for a new task. In this work, we propose a new method, called adaptive plasticity improvement (API), for continual learning. Besides the ability to overcome CF on old tasks, API also tries to evaluate the model's plasticity and then adaptively improve the model's plasticity for learning a new task if necessary. Experiments on several real datasets show that API can outperform other state-of-the-art baselines in terms of both accuracy and memory usage. + + + + Semantic Scene Completion With Cleaner Self + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Semantic_Scene_Completion_With_Cleaner_Self_CVPR_2023_paper.pdf + Semantic Scene Completion (SSC) transforms an image of single-view depth and/or RGB 2D pixels into 3D voxels, each of whose semantic labels are predicted. SSC is a well-known ill-posed problem as the prediction model has to "imagine" what is behind the visible surface, which is usually represented by Truncated Signed Distance Function (TSDF). Due to the sensory imperfection of the depth camera, most existing methods based on the noisy TSDF estimated from depth values suffer from 1) incomplete volumetric predictions and 2) confused semantic labels. To this end, we use the ground-truth 3D voxels to generate a perfect visible surface, called TSDF-CAD, and then train a "cleaner" SSC model. As the model is noise-free, it is expected to focus more on the "imagination" of unseen voxels. Then, we propose to distill the intermediate "cleaner" knowledge into another model with noisy TSDF input. In particular, we use the 3D occupancy feature and the semantic relations of the "cleaner self" to supervise the counterparts of the "noisy self" to respectively address the above two incorrect predictions. Experimental results validate that the proposed method improves the noisy counterparts with 3.1% IoU and 2.2% mIoU for measuring scene completion and SSC, and also achieves new state-of-the-art accuracy on the popular NYU dataset. The code is available at https://github.com/fereenwong/CleanerS. + + + + Deep Factorized Metric Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Deep_Factorized_Metric_Learning_CVPR_2023_paper.pdf + Learning a generalizable and comprehensive similarity metric to depict the semantic discrepancies between images is the foundation of many computer vision tasks. While existing methods approach this goal by learning an ensemble of embeddings with diverse objectives, the backbone network still receives a mix of all the training signals. Differently, we propose a deep factorized metric learning method (DFML) to factorize the training signal and employ different samples to train various components of the backbone network. We factorize the network to different sub-blocks and devise a learnable router to adaptively allocate the training samples to each sub-block with the objective to capture the most information. The metric model trained by DFML captures different characteristics with different sub-blocks and constitutes a generalizable metric when using all the sub-blocks. The proposed DFML achieves state-of-the-art performance on all three benchmarks for deep metric learning including CUB-200-2011, Cars196, and Stanford Online Products. We also generalize DFML to the image classification task on ImageNet-1K and observe consistent improvement in accuracy/computation trade-off. Specifically, we improve the performance of ViT-B on ImageNet (+0.2% accuracy) with less computation load (-24% FLOPs). + + + + High-Fidelity 3D Face Generation From Natural Language Descriptions + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_High-Fidelity_3D_Face_Generation_From_Natural_Language_Descriptions_CVPR_2023_paper.pdf + Synthesizing high-quality 3D face models from natural language descriptions is very valuable for many applications, including avatar creation, virtual reality, and telepresence. However, little research ever tapped into this task. We argue the major obstacle lies in 1) the lack of high-quality 3D face data with descriptive text annotation, and 2) the complex mapping relationship between descriptive language space and shape/appearance space. To solve these problems, we build DESCRIBE3D dataset, the first large-scale dataset with fine-grained text descriptions for text-to-3D face generation task. Then we propose a two-stage framework to first generate a 3D face that matches the concrete descriptions, then optimize the parameters in the 3D shape and texture space with abstract description to refine the 3D face model. Extensive experimental results show that our method can produce a faithful 3D face that conforms to the input descriptions with higher accuracy and quality than previous methods. The code and DESCRIBE3D dataset are released at https://github.com/zhuhao-nju/describe3d. + + + + Dual-Path Adaptation From Image to Video Transformers + http://openaccess.thecvf.com//content/CVPR2023/papers/Park_Dual-Path_Adaptation_From_Image_to_Video_Transformers_CVPR_2023_paper.pdf + In this paper, we efficiently transfer the surpassing representation power of the vision foundation models, such as ViT and Swin, for video understanding with only a few trainable parameters. Previous adaptation methods have simultaneously considered spatial and temporal modeling with a unified learnable module but still suffered from fully leveraging the representative capabilities of image transformers. We argue that the popular dual-path (two-stream) architecture in video models can mitigate this problem. We propose a novel DUALPATH adaptation separated into spatial and temporal adaptation paths, where a lightweight bottleneck adapter is employed in each transformer block. Especially for temporal dynamic modeling, we incorporate consecutive frames into a grid-like frameset to precisely imitate vision transformers' capability that extrapolates relationships between tokens. In addition, we extensively investigate the multiple baselines from a unified perspective in video understanding and compare them with DUALPATH. Experimental results on four action recognition benchmarks prove that pretrained image transformers with DUALPATH can be effectively generalized beyond the data domain. + + + + Towards Better Decision Forests: Forest Alternating Optimization + http://openaccess.thecvf.com//content/CVPR2023/papers/Carreira-Perpinan_Towards_Better_Decision_Forests_Forest_Alternating_Optimization_CVPR_2023_paper.pdf + Decision forests are among the most accurate models in machine learning. This is remarkable given that the way they are trained is highly heuristic: neither the individual trees nor the overall forest optimize any well-defined loss. While diversity mechanisms such as bagging or boosting have been until now critical in the success of forests, we think that a better optimization should lead to better forests---ideally eliminating any need for an ensembling heuristic. However, unlike for most other models, such as neural networks, optimizing forests or trees is not easy, because they define a non-differentiable function. We show, for the first time, that it is possible to learn a forest by optimizing a desirable loss and regularization jointly over all its trees and parameters. Our algorithm, Forest Alternating Optimization, is based on defining a forest as a parametric model with a fixed number of trees and structure (rather than adding trees indefinitely as in bagging or boosting). It then iteratively updates each tree in alternation so that the objective function decreases monotonically. The algorithm is so effective at optimizing that it easily overfits, but this can be corrected by averaging. The result is a forest that consistently exceeds the accuracy of the state-of-the-art while using fewer, smaller trees. + + + + Dynamic Graph Enhanced Contrastive Learning for Chest X-Ray Report Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Dynamic_Graph_Enhanced_Contrastive_Learning_for_Chest_X-Ray_Report_Generation_CVPR_2023_paper.pdf + Automatic radiology reporting has great clinical potential to relieve radiologists from heavy workloads and improve diagnosis interpretation. Recently, researchers have enhanced data-driven neural networks with medical knowledge graphs to eliminate the severe visual and textual bias in this task. The structures of such graphs are exploited by using the clinical dependencies formed by the disease topic tags via general knowledge and usually do not update during the training process. Consequently, the fixed graphs can not guarantee the most appropriate scope of knowledge and limit the effectiveness. To address the limitation, we propose a knowledge graph with Dynamic structure and nodes to facilitate chest X-ray report generation with Contrastive Learning, named DCL. In detail, the fundamental structure of our graph is pre-constructed from general knowledge. Then we explore specific knowledge extracted from the retrieved reports to add additional nodes or redefine their relations in a bottom-up manner. Each image feature is integrated with its very own updated graph before being fed into the decoder module for report generation. Finally, this paper introduces Image-Report Contrastive and Image-Report Matching losses to better represent visual features and textual information. Evaluated on IU-Xray and MIMIC-CXR datasets, our DCL outperforms previous state-of-the-art models on these two benchmarks. + + + + FrustumFormer: Adaptive Instance-Aware Resampling for Multi-View 3D Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_FrustumFormer_Adaptive_Instance-Aware_Resampling_for_Multi-View_3D_Detection_CVPR_2023_paper.pdf + The transformation of features from 2D perspective space to 3D space is essential to multi-view 3D object detection. Recent approaches mainly focus on the design of view transformation, either pixel-wisely lifting perspective view features into 3D space with estimated depth or grid-wisely constructing BEV features via 3D projection, treating all pixels or grids equally. However, choosing what to transform is also important but has rarely been discussed before. The pixels of a moving car are more informative than the pixels of the sky. To fully utilize the information contained in images, the view transformation should be able to adapt to different image regions according to their contents. In this paper, we propose a novel framework named FrustumFormer, which pays more attention to the features in instance regions via adaptive instance-aware resampling. Specifically, the model obtains instance frustums on the bird's eye view by leveraging image view object proposals. An adaptive occupancy mask within the instance frustum is learned to refine the instance location. Moreover, the temporal frustum intersection could further reduce the localization uncertainty of objects. Comprehensive experiments on the nuScenes dataset demonstrate the effectiveness of FrustumFormer, and we achieve a new state-of-the-art performance on the benchmark. Codes and models will be made available at https://github.com/Robertwyq/Frustum. + + + + Class-Conditional Sharpness-Aware Minimization for Deep Long-Tailed Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_Class-Conditional_Sharpness-Aware_Minimization_for_Deep_Long-Tailed_Recognition_CVPR_2023_paper.pdf + It's widely acknowledged that deep learning models with flatter minima in its loss landscape tend to generalize better. However, such property is under-explored in deep long-tailed recognition (DLTR), a practical problem where the model is required to generalize equally well across all classes when trained on highly imbalanced label distribution. In this paper, through empirical observations, we argue that sharp minima are in fact prevalent in deep longtailed models, whereas naive integration of existing flattening operations into long-tailed learning algorithms brings little improvement. Instead, we propose an effective twostage sharpness-aware optimization approach based on the decoupling paradigm in DLTR. In the first stage, both the feature extractor and classifier are trained under parameter perturbations at a class-conditioned scale, which is theoretically motivated by the characteristic radius of flat minima under the PAC-Bayesian framework. In the second stage, we generate adversarial features with classbalanced sampling to further robustify the classifier with the backbone frozen. Extensive experiments on multiple longtailed visual recognition benchmarks show that, our proposed Class-Conditional Sharpness-Aware Minimization (CC-SAM), achieves competitive performance compared to the state-of-the-arts. Code is available at https:// github.com/zzpustc/CC-SAM. + + + + Efficient On-Device Training via Gradient Filtering + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_Efficient_On-Device_Training_via_Gradient_Filtering_CVPR_2023_paper.pdf + Despite its importance for federated learning, continuous learning and many other applications, on-device training remains an open problem for EdgeAI. The problem stems from the large number of operations (e.g., floating point multiplications and additions) and memory consumption required during training by the back-propagation algorithm. Consequently, in this paper, we propose a new gradient filtering approach which enables on-device CNN model training. More precisely, our approach creates a special structure with fewer unique elements in the gradient map, thus significantly reducing the computational complexity and memory consumption of back propagation during training. Extensive experiments on image classification and semantic segmentation with multiple CNN models (e.g., MobileNet, DeepLabV3, UPerNet) and devices (e.g., Raspberry Pi and Jetson Nano) demonstrate the effectiveness and wide applicability of our approach. For example, compared to SOTA, we achieve up to 19x speedup and 77.1% memory savings on ImageNet classification with only 0.1% accuracy loss. Finally, our method is easy to implement and deploy; over 20x speedup and 90% energy savings have been observed compared to highly optimized baselines in MKLDNN and CUDNN on NVIDIA Jetson Nano. Consequently, our approach opens up a new direction of research with a huge potential for on-device training. + + + + 3D Human Mesh Estimation From Virtual Markers + http://openaccess.thecvf.com//content/CVPR2023/papers/Ma_3D_Human_Mesh_Estimation_From_Virtual_Markers_CVPR_2023_paper.pdf + Inspired by the success of volumetric 3D pose estimation, some recent human mesh estimators propose to estimate 3D skeletons as intermediate representations, from which, the dense 3D meshes are regressed by exploiting the mesh topology. However, body shape information is lost in extracting skeletons, leading to mediocre performance. The advanced motion capture systems solve the problem by placing dense physical markers on the body surface, which allows to extract realistic meshes from their non-rigid motions. However, they cannot be applied to wild images without markers. In this work, we present an intermediate representation, named virtual markers, which learns 64 landmark keypoints on the body surface based on the large-scale mocap data in a generative style, mimicking the effects of physical markers. The virtual markers can be accurately detected from wild images and can reconstruct the intact meshes with realistic shapes by simple interpolation. Our approach outperforms the state-of-the-art methods on three datasets. In particular, it surpasses the existing methods by a notable margin on the SURREAL dataset, which has diverse body shapes. Code is available at https://github.com/ShirleyMaxx/VirtualMarker. + + + + CUDA: Convolution-Based Unlearnable Datasets + http://openaccess.thecvf.com//content/CVPR2023/papers/Sadasivan_CUDA_Convolution-Based_Unlearnable_Datasets_CVPR_2023_paper.pdf + Large-scale training of modern deep learning models heavily relies on publicly available data on the web. This potentially unauthorized usage of online data leads to concerns regarding data privacy. Recent works aim to make unlearnable data for deep learning models by adding small, specially designed noises to tackle this issue. However, these methods are vulnerable to adversarial training (AT) and/or are computationally heavy. In this work, we propose a novel, model-free, Convolution-based Unlearnable DAtaset (CUDA) generation technique. CUDA is generated using controlled class-wise convolutions with filters that are randomly generated via a private key. CUDA encourages the network to learn the relation between filters and labels rather than informative features for classifying the clean data. We develop some theoretical analysis demonstrating that CUDA can successfully poison Gaussian mixture data by reducing the clean data performance of the optimal Bayes classifier. We also empirically demonstrate the effectiveness of CUDA with various datasets (CIFAR-10, CIFAR-100, ImageNet-100, and Tiny-ImageNet), and architectures (ResNet-18, VGG-16, Wide ResNet-34-10, DenseNet-121, DeIT, EfficientNetV2-S, and MobileNetV2). Our experiments show that CUDA is robust to various data augmentations and training approaches such as smoothing, AT with different budgets, transfer learning, and fine-tuning. For instance, training a ResNet-18 on ImageNet-100 CUDA achieves only 8.96%, 40.08%, and 20.58% clean test accuracies with empirical risk minimization (ERM), L_infinity AT, and L_2 AT, respectively. Here, ERM on the clean training data achieves a clean test accuracy of 80.66%. CUDA exhibits unlearnability effect with ERM even when only a fraction of the training dataset is perturbed. Furthermore, we also show that CUDA is robust to adaptive defenses designed specifically to break it. + + + + MIANet: Aggregating Unbiased Instance and General Information for Few-Shot Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_MIANet_Aggregating_Unbiased_Instance_and_General_Information_for_Few-Shot_Semantic_CVPR_2023_paper.pdf + Existing few-shot segmentation methods are based on the meta-learning strategy and extract instance knowledge from a support set and then apply the knowledge to segment target objects in a query set. However, the extracted knowledge is insufficient to cope with the variable intra-class differences since the knowledge is obtained from a few samples in the support set. To address the problem, we propose a multi-information aggregation network (MIANet) that effectively leverages the general knowledge, i.e., semantic word embeddings, and instance information for accurate segmentation. Specifically, in MIANet, a general information module (GIM) is proposed to extract a general class prototype from word embeddings as a supplement to instance information. To this end, we design a triplet loss that treats the general class prototype as an anchor and samples positive-negative pairs from local features in the support set. The calculated triplet loss can transfer semantic similarities among language identities from a word embedding space to a visual representation space. To alleviate the model biasing towards the seen training classes and to obtain multi-scale information, we then introduce a non-parametric hierarchical prior module (HPM) to generate unbiased instance-level information via calculating the pixel-level similarity between the support and query image features. Finally, an information fusion module (IFM) combines the general and instance information to make predictions for the query image. Extensive experiments on PASCAL-5i and COCO-20i show that MIANet yields superior performance and set a new state-of-the-art. Code is available at github.com/Aldrich2y/MIANet. + + + + Starting From Non-Parametric Networks for 3D Point Cloud Analysis + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Starting_From_Non-Parametric_Networks_for_3D_Point_Cloud_Analysis_CVPR_2023_paper.pdf + We present a Non-parametric Network for 3D point cloud analysis, Point-NN, which consists of purely non-learnable components: farthest point sampling (FPS), k-nearest neighbors (k-NN), and pooling operations, with trigonometric functions. Surprisingly, it performs well on various 3D tasks, requiring no parameters or training, and even surpasses existing fully trained models. Starting from this basic non-parametric model, we propose two extensions. First, Point-NN can serve as a base architectural framework to construct Parametric Networks by simply inserting linear layers on top. Given the superior non-parametric foundation, the derived Point-PN exhibits a high performance-efficiency trade-off with only a few learnable parameters. Second, Point-NN can be regarded as a plug-and-play module for the already trained 3D models during inference. Point-NN captures the complementary geometric knowledge and enhances existing methods for different 3D benchmarks without re-training. We hope our work may cast a light on the community for understanding 3D point clouds with non-parametric methods. Code is available at https://github.com/ZrrSkywalker/Point-NN. + + + + Light Source Separation and Intrinsic Image Decomposition Under AC Illumination + http://openaccess.thecvf.com//content/CVPR2023/papers/Yoshida_Light_Source_Separation_and_Intrinsic_Image_Decomposition_Under_AC_Illumination_CVPR_2023_paper.pdf + Artificial light sources are often powered by an electric grid, and then their intensities rapidly oscillate in response to the grid's alternating current (AC). Interestingly, the flickers of scene radiance values due to AC illumination are useful for extracting rich information on a scene of interest. In this paper, we show that the flickers due to AC illumination is useful for intrinsic image decomposition (IID). Our proposed method conducts the light source separation (LSS) followed by the IID under AC illumination. In particular, we reveal the ambiguity in the blind LSS via matrix factorization and the ambiguity in the IID assuming the Lambert model, and then show why and how those ambiguities can be resolved. We experimentally confirmed that our method can recover the colors of the light sources, the diffuse reflectance values, and the diffuse and specular intensities (shadings) under each of the light sources, and that the IID under AC illumination is effective for application to auto white balancing. + + + + CFA: Class-Wise Calibrated Fair Adversarial Training + http://openaccess.thecvf.com//content/CVPR2023/papers/Wei_CFA_Class-Wise_Calibrated_Fair_Adversarial_Training_CVPR_2023_paper.pdf + Adversarial training has been widely acknowledged as the most effective method to improve the adversarial robustness against adversarial examples for Deep Neural Networks (DNNs). So far, most existing works focus on enhancing the overall model robustness, treating each class equally in both the training and testing phases. Although revealing the disparity in robustness among classes, few works try to make adversarial training fair at the class level without sacrificing overall robustness. In this paper, we are the first to theoretically and empirically investigate the preference of different classes for adversarial configurations, including perturbation margin, regularization, and weight averaging. Motivated by this, we further propose a Class-wise calibrated Fair Adversarial training framework, named CFA, which customizes specific training configurations for each class automatically. Experiments on benchmark datasets demonstrate that our proposed CFA can improve both overall robustness and fairness notably over other state-of-the-art methods. Code is available at https://github.com/PKU-ML/CFA. + + + + 3D Human Pose Estimation With Spatio-Temporal Criss-Cross Attention + http://openaccess.thecvf.com//content/CVPR2023/papers/Tang_3D_Human_Pose_Estimation_With_Spatio-Temporal_Criss-Cross_Attention_CVPR_2023_paper.pdf + Recent transformer-based solutions have shown great success in 3D human pose estimation. Nevertheless, to calculate the joint-to-joint affinity matrix, the computational cost has a quadratic growth with the increasing number of joints. Such drawback becomes even worse especially for pose estimation in a video sequence, which necessitates spatio-temporal correlation spanning over the entire video. In this paper, we facilitate the issue by decomposing correlation learning into space and time, and present a novel Spatio-Temporal Criss-cross attention (STC) block. Technically, STC first slices its input feature into two partitions evenly along the channel dimension, followed by performing spatial and temporal attention respectively on each partition. STC then models the interactions between joints in an identical frame and joints in an identical trajectory simultaneously by concatenating the outputs from attention layers. On this basis, we devise STCFormer by stacking multiple STC blocks and further integrate a new Structure-enhanced Positional Embedding (SPE) into STCFormer to take the structure of human body into consideration. The embedding function consists of two components: spatio-temporal convolution around neighboring joints to capture local structure, and part-aware embedding to indicate which part each joint belongs to. Extensive experiments are conducted on Human3.6M and MPI-INF-3DHP benchmarks, and superior results are reported when comparing to the state-of-the-art approaches. More remarkably, STCFormer achieves to-date the best published performance: 40.5mm P1 error on the challenging Human3.6M dataset. + + + + Plateau-Reduced Differentiable Path Tracing + http://openaccess.thecvf.com//content/CVPR2023/papers/Fischer_Plateau-Reduced_Differentiable_Path_Tracing_CVPR_2023_paper.pdf + Current differentiable renderers provide light transport gradients with respect to arbitrary scene parameters. However, the mere existence of these gradients does not guarantee useful update steps in an optimization. Instead, inverse rendering might not converge due to inherent plateaus, i.e., regions of zero gradient, in the objective function. We propose to alleviate this by convolving the high-dimensional rendering function that maps scene parameters to images with an additional kernel that blurs the parameter space. We describe two Monte Carlo estimators to compute plateau-free gradients efficiently, i.e., with low variance, and show that these translate into net-gains in optimization error and runtime performance. Our approach is a straightforward extension to both black-box and differentiable renderers and enables the successful optimization of problems with intricate light transport, such as caustics or global illumination, that existing differentiable path tracers do not converge on. Our code is at github.com/mfischer-ucl/prdpt. + + + + Glocal Energy-Based Learning for Few-Shot Open-Set Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Glocal_Energy-Based_Learning_for_Few-Shot_Open-Set_Recognition_CVPR_2023_paper.pdf + Few-shot open-set recognition (FSOR) is a challenging task of great practical value. It aims to categorize a sample to one of the pre-defined, closed-set classes illustrated by few examples while being able to reject the sample from unknown classes. In this work, we approach the FSOR task by proposing a novel energy-based hybrid model. The model is composed of two branches, where a classification branch learns a metric to classify a sample to one of closed-set classes and the energy branch explicitly estimates the open-set probability. To achieve holistic detection of open-set samples, our model leverages both class-wise and pixel-wise features to learn a glocal energy-based score, in which a global energy score is learned using the class-wise features, while a local energy score is learned using the pixel-wise features. The model is enforced to assign large energy scores to samples that are deviated from the few-shot examples in either the class-wise features or the pixel-wise features, and to assign small energy scores otherwise. Experiments on three standard FSOR datasets show the superior performance of our model. + + + + Revisiting Temporal Modeling for CLIP-Based Image-to-Video Knowledge Transferring + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Revisiting_Temporal_Modeling_for_CLIP-Based_Image-to-Video_Knowledge_Transferring_CVPR_2023_paper.pdf + Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs, thus attracting increasing attention for their potential to improve visual representation learning in the video domain. In this paper, based on the CLIP model, we revisit temporal modeling in the context of image-to-video knowledge transferring, which is the key point for extending image-text pretrained models to the video domain. We find that current temporal modeling mechanisms are tailored to either high-level semantic-dominant tasks (e.g., retrieval) or low-level visual pattern-dominant tasks (e.g., recognition), and fail to work on the two cases simultaneously. The key difficulty lies in modeling temporal dependency while taking advantage of both high-level and low-level knowledge in CLIP model. To tackle this problem, we present Spatial-Temporal Auxiliary Network (STAN) -- a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks. Specifically, to realize both low-level and high-level knowledge transferring, STAN adopts a branch structure with decomposed spatial-temporal modules that enable multi-level CLIP features to be spatial-temporally contextualized. We evaluate our method on two representative video tasks: Video-Text Retrieval and Video Recognition. Extensive experiments demonstrate the superiority of our model over the state-of-the-art methods on various datasets, including MSR-VTT, DiDeMo, LSMDC, MSVD, Kinetics-400, and Something-Something-V2. Codes will be available at https://github.com/farewellthree/STAN + + + + EFEM: Equivariant Neural Field Expectation Maximization for 3D Object Segmentation Without Scene Supervision + http://openaccess.thecvf.com//content/CVPR2023/papers/Lei_EFEM_Equivariant_Neural_Field_Expectation_Maximization_for_3D_Object_Segmentation_CVPR_2023_paper.pdf + We introduce Equivariant Neural Field Expectation Maximization (EFEM), a simple, effective, and robust geometric algorithm that can segment objects in 3D scenes without annotations or training on scenes. We achieve such unsupervised segmentation by exploiting single object shape priors. We make two novel steps in that direction. First, we introduce equivariant shape representations to this problem to eliminate the complexity induced by the variation in object configuration. Second, we propose a novel EM algorithm that can iteratively refine segmentation masks using the equivariant shape prior. We collect a novel real dataset Chairs and Mugs that contains various object configurations and novel scenes in order to verify the effectiveness and robustness of our method. Experimental results demonstrate that our method achieves consistent and robust performance across different scenes where the (weakly) supervised methods may fail. Code and data available at https://www.cis.upenn.edu/ leijh/projects/efem + + + + ECON: Explicit Clothed Humans Optimized via Normal Integration + http://openaccess.thecvf.com//content/CVPR2023/papers/Xiu_ECON_Explicit_Clothed_Humans_Optimized_via_Normal_Integration_CVPR_2023_paper.pdf + The combination of deep learning, artist-curated scans, and Implicit Functions (IF), is enabling the creation of detailed, clothed, 3D humans from images. However, existing methods are far from perfect. IF-based methods recover free-form geometry, but produce disembodied limbs or degenerate shapes for novel poses or clothes. To increase robustness for these cases, existing work uses an explicit parametric body model to constrain surface reconstruction, but this limits the recovery of free-form surfaces such as loose clothing that deviates from the body. What we want is a method that combines the best properties of implicit representation and explicit body regularization. To this end, we make two key observations: (1) current networks are better at inferring detailed 2D maps than full-3D surfaces, and (2) a parametric model can be seen as a "canvas" for stitching together detailed surface patches. Based on these, our method, ECON, has three main steps: (1) It infers detailed 2D normal maps for the front and back side of a clothed person. (2) From these, it recovers 2.5D front and back surfaces, called d-BiNI, that are equally detailed, yet incomplete, and registers these w.r.t. each other with the help of a SMPL-X body mesh recovered from the image. (3) It "inpaints" the missing geometry between d-BiNI surfaces. If the face and hands are noisy, they can optionally be replaced with the ones of SMPL-X. As a result, ECON infers high-fidelity 3D humans even in loose clothes and challenging poses. This goes beyond previous methods, according to the quantitative evaluation on the CAPE and Renderpeople datasets. Perceptual studies also show that ECON's perceived realism is better by a large margin. Code and models are available for research purposes at econ.is.tue.mpg.de + + + + F2-NeRF: Fast Neural Radiance Field Training With Free Camera Trajectories + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_F2-NeRF_Fast_Neural_Radiance_Field_Training_With_Free_Camera_Trajectories_CVPR_2023_paper.pdf + This paper presents a novel grid-based NeRF called F^2-NeRF (Fast-Free-NeRF) for novel view synthesis, which enables arbitrary input camera trajectories and only costs a few minutes for training. Existing fast grid-based NeRF training frameworks, like Instant-NGP, Plenoxels, DVGO, or TensoRF, are mainly designed for bounded scenes and rely on space warping to handle unbounded scenes. Existing two widely-used space-warping methods are only designed for the forward-facing trajectory or the 360deg object-centric trajectory but cannot process arbitrary trajectories. In this paper, we delve deep into the mechanism of space warping to handle unbounded scenes. Based on our analysis, we further propose a novel space-warping method called perspective warping, which allows us to handle arbitrary trajectories in the grid-based NeRF framework. Extensive experiments demonstrate that F^2-NeRF is able to use the same perspective warping to render high-quality images on two standard datasets and a new free trajectory dataset collected by us. + + + + Learning To Detect and Segment for Open Vocabulary Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Learning_To_Detect_and_Segment_for_Open_Vocabulary_Object_Detection_CVPR_2023_paper.pdf + Open vocabulary object detection has been greately advanced by the recent development of vision-language pre-trained model, which helps recognizing the novel objects with only semantic categories. The prior works mainly focus on knowledge transferring to the object proposal classification and employ class-agnostic box and mask prediction. In this work, we propose CondHead, a principled dynamic network design to better generalize the box regression and mask segmentation for open vocabulary setting. The core idea is to conditionally parametrize the network heads on semantic embedding and thus the model is guided with class-specific knowledge to better detect novel categories. Specifically, CondHead is composed of two streams of network heads, the dynamically aggregated heads and dynamically generated heads. The former is instantiated with a set of static heads that are conditionally aggregated, these heads are optimized as experts and are expected to learn sophisticated prediction. The Latter is instantiated with dynamically generated parameters and encodes general class-specific information. With such conditional design, the detection model is bridged by the semantic embedding to offer strongly generalizable class-wise box and mask prediction. Our method brings significant improvement to the prior state-of-the-art open vocabulary object detection methods with very minor overhead, e.g., it surpasses a RegionClip model by 3.0 detection AP on novel categories, with only 1.1% more computation. + + + + Disentangling Writer and Character Styles for Handwriting Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Dai_Disentangling_Writer_and_Character_Styles_for_Handwriting_Generation_CVPR_2023_paper.pdf + Training machines to synthesize diverse handwritings is an intriguing task. Recently, RNN-based methods have been proposed to generate stylized online Chinese characters. However, these methods mainly focus on capturing a person's overall writing style, neglecting subtle style inconsistencies between characters written by the same person. For example, while a person's handwriting typically exhibits general uniformity (e.g., glyph slant and aspect ratios), there are still small style variations in finer details (e.g., stroke length and curvature) of characters. In light of this, we propose to disentangle the style representations at both writer and character levels from individual handwritings to synthesize realistic stylized online handwritten characters. Specifically, we present the style-disentangled Transformer (SDT), which employs two complementary contrastive objectives to extract the style commonalities of reference samples and capture the detailed style patterns of each sample, respectively. Extensive experiments on various language scripts demonstrate the effectiveness of SDT. Notably, our empirical findings reveal that the two learned style representations provide information at different frequency magnitudes, underscoring the importance of separate style extraction. Our source code is public at: https://github.com/dailenson/SDT. + + + + StyleSync: High-Fidelity Generalized and Personalized Lip Sync in Style-Based Generator + http://openaccess.thecvf.com//content/CVPR2023/papers/Guan_StyleSync_High-Fidelity_Generalized_and_Personalized_Lip_Sync_in_Style-Based_Generator_CVPR_2023_paper.pdf + Despite recent advances in syncing lip movements with any audio waves, current methods still struggle to balance generation quality and the model's generalization ability. Previous studies either require long-term data for training or produce a similar movement pattern on all subjects with low quality. In this paper, we propose StyleSync, an effective framework that enables high-fidelity lip synchronization. We identify that a style-based generator would sufficiently enable such a charming property on both one-shot and few-shot scenarios. Specifically, we design a mask-guided spatial information encoding module that preserves the details of the given face. The mouth shapes are accurately modified by audio through modulated convolutions. Moreover, our design also enables personalized lip-sync by introducing style space and generator refinement on only limited frames. Thus the identity and talking style of a target person could be accurately preserved. Extensive experiments demonstrate the effectiveness of our method in producing high-fidelity results on a variety of scenes. + + + + Coreset Sampling From Open-Set for Fine-Grained Self-Supervised Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Kim_Coreset_Sampling_From_Open-Set_for_Fine-Grained_Self-Supervised_Learning_CVPR_2023_paper.pdf + Deep learning in general domains has constantly been extended to domain-specific tasks requiring the recognition of fine-grained characteristics. However, real-world applications for fine-grained tasks suffer from two challenges: a high reliance on expert knowledge for annotation and necessity of a versatile model for various downstream tasks in a specific domain (e.g., prediction of categories, bounding boxes, or pixel-wise annotations). Fortunately, the recent self-supervised learning (SSL) is a promising approach to pretrain a model without annotations, serving as an effective initialization for any downstream tasks. Since SSL does not rely on the presence of annotation, in general, it utilizes the large-scale unlabeled dataset, referred to as an open-set. In this sense, we introduce a novel Open-Set Self-Supervised Learning problem under the assumption that a large-scale unlabeled open-set is available, as well as the fine-grained target dataset, during a pretraining phase. In our problem setup, it is crucial to consider the distribution mismatch between the open-set and target dataset. Hence, we propose SimCore algorithm to sample a coreset, the subset of an open-set that has a minimum distance to the target dataset in the latent space. We demonstrate that SimCore significantly improves representation learning performance through extensive experimental settings, including eleven fine-grained datasets and seven open-sets in various downstream tasks. + + + + Generative Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Generative_Semantic_Segmentation_CVPR_2023_paper.pdf + We present Generative Semantic Segmentation (GSS), a generative learning approach for semantic segmentation. Uniquely, we cast semantic segmentation as an image-conditioned mask generation problem. This is achieved by replacing the conventional per-pixel discriminative learning with a latent prior learning process. Specifically, we model the variational posterior distribution of latent variables given the segmentation mask. To that end, the segmentation mask is expressed with a special type of image (dubbed as maskige). This posterior distribution allows to generate segmentation masks unconditionally. To achieve semantic segmentation on a given image, we further introduce a conditioning network. It is optimized by minimizing the divergence between the posterior distribution of maskige (i.e., segmentation masks) and the latent prior distribution of input training images. Extensive experiments on standard benchmarks show that our GSS can perform competitively to prior art alternatives in the standard semantic segmentation setting, whilst achieving a new state of the art in the more challenging cross-domain setting. + + + + Instant-NVR: Instant Neural Volumetric Rendering for Human-Object Interactions From Monocular RGBD Stream + http://openaccess.thecvf.com//content/CVPR2023/papers/Jiang_Instant-NVR_Instant_Neural_Volumetric_Rendering_for_Human-Object_Interactions_From_Monocular_CVPR_2023_paper.pdf + Convenient 4D modeling of human-object interactions is essential for numerous applications. However, monocular tracking and rendering of complex interaction scenarios remain challenging. In this paper, we propose Instant-NVR, a neural approach for instant volumetric human-object tracking and rendering using a single RGBD camera. It bridges traditional non-rigid tracking with recent instant radiance field techniques via a multi-thread tracking-rendering mechanism. In the tracking front-end, we adopt a robust human-object capture scheme to provide sufficient motion priors. We further introduce a separated instant neural representation with a novel hybrid deformation module for the interacting scene. We also provide an on-the-fly reconstruction scheme of the dynamic/static radiance fields via efficient motion-prior searching. Moreover, we introduce an online key frame selection scheme and a rendering-aware refinement strategy to significantly improve the appearance details for online novel-view synthesis. Extensive experiments demonstrate the effectiveness and efficiency of our approach for the instant generation of human-object radiance fields on the fly, notably achieving real-time photo-realistic novel view synthesis under complex human-object interactions. + + + + Aligning Step-by-Step Instructional Diagrams to Video Demonstrations + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Aligning_Step-by-Step_Instructional_Diagrams_to_Video_Demonstrations_CVPR_2023_paper.pdf + Multimodal alignment facilitates the retrieval of instances from one modality when queried using another. In this paper, we consider a novel setting where such an alignment is between (i) instruction steps that are depicted as assembly diagrams (commonly seen in Ikea assembly manuals) and (ii) video segments from in-the-wild videos; these videos comprising an enactment of the assembly actions in the real world. To learn this alignment, we introduce a novel supervised contrastive learning method that learns to align videos with the subtle details in the assembly diagrams, guided by a set of novel losses. To study this problem and demonstrate the effectiveness of our method, we introduce a novel dataset: IAW---for Ikea assembly in the wild---consisting of 183 hours of videos from diverse furniture assembly collections and nearly 8,300 illustrations from their associated instruction manuals and annotated for their ground truth alignments. We define two tasks on this dataset: First, nearest neighbor retrieval between video segments and illustrations, and, second, alignment of instruction steps and the segments for each video. Extensive experiments on IAW demonstrate superior performances of our approach against alternatives. + + + + High-Fidelity and Freely Controllable Talking Head Video Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Gao_High-Fidelity_and_Freely_Controllable_Talking_Head_Video_Generation_CVPR_2023_paper.pdf + Talking head generation is to generate video based on a given source identity and target motion. However, current methods face several challenges that limit the quality and controllability of the generated videos. First, the generated face often has unexpected deformation and severe distortions. Second, the driving image does not explicitly disentangle movement-relevant information, such as poses and expressions, which restricts the manipulation of different attributes during generation. Third, the generated videos tend to have flickering artifacts due to the inconsistency of the extracted landmarks between adjacent frames. In this paper, we propose a novel model that produces high-fidelity talking head videos with free control over head pose and expression. Our method leverages both self-supervised learned landmarks and 3D face model-based landmarks to model the motion. We also introduce a novel motion-aware multi-scale feature alignment module to effectively transfer the motion without face distortion. Furthermore, we enhance the smoothness of the synthesized talking head videos with a feature context adaptation and propagation module. We evaluate our model on challenging datasets and demonstrate its state-of-the-art performance. More information is available at https://yuegao.me/PECHead. + + + + Q-DETR: An Efficient Low-Bit Quantized Detection Transformer + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_Q-DETR_An_Efficient_Low-Bit_Quantized_Detection_Transformer_CVPR_2023_paper.pdf + The recent detection transformer (DETR) has advanced object detection, but its application on resource-constrained devices requires massive computation and memory resources. Quantization stands out as a solution by representing the network in low-bit parameters and operations. However, there is a significant performance drop when performing low-bit quantized DETR (Q-DETR) with existing quantization methods. We find that the bottlenecks of Q-DETR come from the query information distortion through our empirical analyses. This paper addresses this problem based on a distribution rectification distillation (DRD). We formulate our DRD as a bi-level optimization problem, which can be derived by generalizing the information bottleneck (IB) principle to the learning of Q-DETR. At the inner level, we conduct a distribution alignment for the queries to maximize the self-information entropy. At the upper level, we introduce a new foreground-aware query matching scheme to effectively transfer the teacher information to distillation-desired features to minimize the conditional information entropy. Extensive experimental results show that our method performs much better than prior arts. For example, the 4-bit Q-DETR can theoretically accelerate DETR with ResNet-50 backbone by 6.6x and achieve 39.4% AP, with only 2.6% performance gaps than its real-valued counterpart on the COCO dataset. + + + + Burstormer: Burst Image Restoration and Enhancement Transformer + http://openaccess.thecvf.com//content/CVPR2023/papers/Dudhane_Burstormer_Burst_Image_Restoration_and_Enhancement_Transformer_CVPR_2023_paper.pdf + On a shutter press, modern handheld cameras capture multiple images in rapid succession and merge them to generate a single image. However, individual frames in a burst are misaligned due to inevitable motions and contain multiple degradations. The challenge is to properly align the successive image shots and merge their complimentary information to achieve high-quality outputs. Towards this direction, we propose Burstormer: a novel transformer-based architecture for burst image restoration and enhancement. In comparison to existing works, our approach exploits multi-scale local and non-local features to achieve improved alignment and feature fusion. Our key idea is to enable inter-frame communication in the burst neighborhoods for information aggregation and progressive fusion while modeling the burst-wide context. However, the input burst frames need to be properly aligned before fusing their information. Therefore, we propose an enhanced deformable alignment module for aligning burst features with regards to the reference frame. Unlike existing methods, the proposed alignment module not only aligns burst features but also exchanges feature information and maintains focused communication with the reference frame through the proposed reference-based feature enrichment mechanism, which facilitates handling complex motions. After multi-level alignment and enrichment, we re-emphasize on inter-frame communication within burst using a cyclic burst sampling module. Finally, the inter-frame information is aggregated using the proposed burst feature fusion module followed by progressive upsampling. Our Burstormer outperforms state-of-the-art methods on burst super-resolution, burst denoising and burst low-light enhancement. Our codes and pre-trained models are available at https://github.com/akshaydudhane16/Burstormer. + + + + Progressive Transformation Learning for Leveraging Virtual Images in Training + http://openaccess.thecvf.com//content/CVPR2023/papers/Shen_Progressive_Transformation_Learning_for_Leveraging_Virtual_Images_in_Training_CVPR_2023_paper.pdf + To effectively interrogate UAV-based images for detecting objects of interest, such as humans, it is essential to acquire large-scale UAV-based datasets that include human instances with various poses captured from widely varying viewing angles. As a viable alternative to laborious and costly data curation, we introduce Progressive Transformation Learning (PTL), which gradually augments a training dataset by adding transformed virtual images with enhanced realism. Generally, a virtual2real transformation generator in the conditional GAN framework suffers from quality degradation when a large domain gap exists between real and virtual images. To deal with the domain gap, PTL takes a novel approach that progressively iterates the following three steps: 1) select a subset from a pool of virtual images according to the domain gap, 2) transform the selected virtual images to enhance realism, and 3) add the transformed virtual images to the training set while removing them from the pool. In PTL, accurately quantifying the domain gap is critical. To do that, we theoretically demonstrate that the feature representation space of a given object detector can be modeled as a multivariate Gaussian distribution from which the Mahalanobis distance between a virtual object and the Gaussian distribution of each object category in the representation space can be readily computed. Experiments show that PTL results in a substantial performance increase over the baseline, especially in the small data and the cross-domain regime. + + + + Co-Speech Gesture Synthesis by Reinforcement Learning With Contrastive Pre-Trained Rewards + http://openaccess.thecvf.com//content/CVPR2023/papers/Sun_Co-Speech_Gesture_Synthesis_by_Reinforcement_Learning_With_Contrastive_Pre-Trained_Rewards_CVPR_2023_paper.pdf + There is a growing demand of automatically synthesizing co-speech gestures for virtual characters. However, it remains a challenge due to the complex relationship between input speeches and target gestures. Most existing works focus on predicting the next gesture that fits the data best, however, such methods are myopic and lack the ability to plan for future gestures. In this paper, we propose a novel reinforcement learning (RL) framework called RACER to generate sequences of gestures that maximize the overall satisfactory. RACER employs a vector quantized variational autoencoder to learn compact representations of gestures and a GPT-based policy architecture to generate coherent sequence of gestures autoregressively. In particular, we propose a contrastive pre-training approach to calculate the rewards, which integrates contextual information into action evaluation and successfully captures the complex relationships between multi-modal speech-gesture data. Experimental results show that our method significantly outperforms existing baselines in terms of both objective metrics and subjective human judgements. Demos can be found at https://github.com/RLracer/RACER.git. + + + + SDC-UDA: Volumetric Unsupervised Domain Adaptation Framework for Slice-Direction Continuous Cross-Modality Medical Image Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Shin_SDC-UDA_Volumetric_Unsupervised_Domain_Adaptation_Framework_for_Slice-Direction_Continuous_Cross-Modality_CVPR_2023_paper.pdf + Recent advances in deep learning-based medical image segmentation studies achieve nearly human-level performance in fully supervised manner. However, acquiring pixel-level expert annotations is extremely expensive and laborious in medical imaging fields. Unsupervised domain adaptation (UDA) can alleviate this problem, which makes it possible to use annotated data in one imaging modality to train a network that can successfully perform segmentation on target imaging modality with no labels. In this work, we propose SDC-UDA, a simple yet effective volumetric UDA framework for Slice-Direction Continuous cross-modality medical image segmentation which combines intra- and inter-slice self-attentive image translation, uncertainty-constrained pseudo-label refinement, and volumetric self-training. Our method is distinguished from previous methods on UDA for medical image segmentation in that it can obtain continuous segmentation in the slice direction, thereby ensuring higher accuracy and potential in clinical practice. We validate SDC-UDA with multiple publicly available cross-modality medical image segmentation datasets and achieve state-of-the-art segmentation performance, not to mention the superior slice-direction continuity of prediction compared to previous studies. + + + + Divide and Conquer: Answering Questions With Object Factorization and Compositional Reasoning + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Divide_and_Conquer_Answering_Questions_With_Object_Factorization_and_Compositional_CVPR_2023_paper.pdf + Humans have the innate capability to answer diverse questions, which is rooted in the natural ability to correlate different concepts based on their semantic relationships and decompose difficult problems into sub-tasks. On the contrary, existing visual reasoning methods assume training samples that capture every possible object and reasoning problem, and rely on black-boxed models that commonly exploit statistical priors. They have yet to develop the capability to address novel objects or spurious biases in real-world scenarios, and also fall short of interpreting the rationales behind their decisions. Inspired by humans' reasoning of the visual world, we tackle the aforementioned challenges from a compositional perspective, and propose an integral framework consisting of a principled object factorization method and a novel neural module network. Our factorization method decomposes objects based on their key characteristics, and automatically derives prototypes that represent a wide range of objects. With these prototypes encoding important semantics, the proposed network then correlates objects by measuring their similarity on a common semantic space and makes decisions with a compositional reasoning process. It is capable of answering questions with diverse objects regardless of their availability during training, and overcoming the issues of biased question-answer distributions. In addition to the enhanced generalizability, our framework also provides an interpretable interface for understanding the decision-making process of models. Our code is available at https://github.com/szzexpoi/POEM. + + + + Jedi: Entropy-Based Localization and Removal of Adversarial Patches + http://openaccess.thecvf.com//content/CVPR2023/papers/Tarchoun_Jedi_Entropy-Based_Localization_and_Removal_of_Adversarial_Patches_CVPR_2023_paper.pdf + Real-world adversarial physical patches were recently shown to be successful in compromising state-of-the-art models in a variety of computer vision applications. The most promising defenses that are based on either input gradient or features analyses have been shown to be compromised by recent GAN-based adaptive attacks that generate realistic/naturalistic patches. In this paper, we propose Jedi, a new defense against adversarial patches that is resilient to realistic patch attacks, and also improves detection and recovery compared to the state of the art. Jedi leverages two new ideas: (1) it improves the identification of potential patch regions using entropy analysis: we show that the entropy of adversarial patches is high, even in naturalistic patches; and (2) it improves the localization of adversarial patches, using an autoencoder that is able to complete patch regions and filter out normal regions with high entropy that are not part of a patch. Jedi achieves high precision adversarial patch localization, which we show is critical to successfully repair the images. Since Jedi relies on an input entropy analysis, it is model-agnostic, and can be applied on pre-trained off-the-shelf models without changes to the training or inference of the protected models. Jedi detects on average 90% of adversarial patches across different benchmarks and recovers up to 94% of successful patch attacks (Compared to 75% and 65% for LGS and Jujutsu, respectively). Jedi is also able to continue detection even in the presence of adaptive realistic patches that are able to fool other defenses. + + + + Localized Semantic Feature Mixers for Efficient Pedestrian Detection in Autonomous Driving + http://openaccess.thecvf.com//content/CVPR2023/papers/Khan_Localized_Semantic_Feature_Mixers_for_Efficient_Pedestrian_Detection_in_Autonomous_CVPR_2023_paper.pdf + Autonomous driving systems rely heavily on the underlying perception module which needs to be both performant and efficient to allow precise decisions in real-time. Avoiding collisions with pedestrians is of topmost priority in any autonomous driving system. Therefore, pedestrian detection is one of the core parts of such systems' perception modules. Current state-of-the-art pedestrian detectors have two major issues. Firstly, they have long inference times which affect the efficiency of the whole perception module, and secondly, their performance in the case of small and heavily occluded pedestrians is poor. We propose Localized Semantic Feature Mixers (LSFM), a novel, anchor-free pedestrian detection architecture. It uses our novel Super Pixel Pyramid Pooling module instead of the, computationally costly, Feature Pyramid Networks for feature encoding. Moreover, our MLPMixer-based Dense Focal Detection Network is used as a light detection head, reducing computational effort and inference time compared to existing approaches. To boost the performance of the proposed architecture, we adapt and use mixup augmentation which improves the performance, especially in small and heavily occluded cases. We benchmark LSFM against the state-of-the-art on well-established traffic scene pedestrian datasets. The proposed LSFM achieves state-of-the-art performance in Caltech, City Persons, Euro City Persons, and TJU-Traffic-Pedestrian datasets while reducing the inference time on average by 55%. Further, LSFM beats the human baseline for the first time in the history of pedestrian detection. Finally, we conducted a cross-dataset evaluation which proved that our proposed LSFM generalizes well to unseen data. + + + + VDN-NeRF: Resolving Shape-Radiance Ambiguity via View-Dependence Normalization + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhu_VDN-NeRF_Resolving_Shape-Radiance_Ambiguity_via_View-Dependence_Normalization_CVPR_2023_paper.pdf + We propose VDN-NeRF, a method to train neural radiance fields (NeRFs) for better geometry under non-Lambertian surface and dynamic lighting conditions that cause significant variation in the radiance of a point when viewed from different angles. Instead of explicitly modeling the underlying factors that result in the view-dependent phenomenon, which could be complex yet not inclusive, we develop a simple and effective technique that normalizes the view-dependence by distilling invariant information already encoded in the learned NeRFs. We then jointly train NeRFs for view synthesis with view-dependence normalization to attain quality geometry. Our experiments show that even though shape-radiance ambiguity is inevitable, the proposed normalization can minimize its effect on geometry, which essentially aligns the optimal capacity needed for explaining view-dependent variations. Our method applies to various baselines and significantly improves geometry without changing the volume rendering pipeline, even if the data is captured under a moving light source. Code is available at: https://github.com/BoifZ/VDN-NeRF. + + + + Coaching a Teachable Student + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Coaching_a_Teachable_Student_CVPR_2023_paper.pdf + We propose a novel knowledge distillation framework for effectively teaching a sensorimotor student agent to drive from the supervision of a privileged teacher agent. Current distillation for sensorimotor agents methods tend to result in suboptimal learned driving behavior by the student, which we hypothesize is due to inherent differences between the input, modeling capacity, and optimization processes of the two agents. We develop a novel distillation scheme that can address these limitations and close the gap between the sensorimotor agent and its privileged teacher. Our key insight is to design a student which learns to align their input features with the teacher's privileged Bird's Eye View (BEV) space. The student then can benefit from direct supervision by the teacher over the internal representation learning. To scaffold the difficult sensorimotor learning task, the student model is optimized via a student-paced coaching mechanism with various auxiliary supervision. We further propose a high-capacity imitation learned privileged agent that surpasses prior privileged agents in CARLA and ensures the student learns safe driving behavior. Our proposed sensorimotor agent results in a robust image-based behavior cloning agent in CARLA, improving over current models by over 20.6% in driving score without requiring LiDAR, historical observations, ensemble of models, on-policy data aggregation or reinforcement learning. + + + + RealImpact: A Dataset of Impact Sound Fields for Real Objects + http://openaccess.thecvf.com//content/CVPR2023/papers/Clarke_RealImpact_A_Dataset_of_Impact_Sound_Fields_for_Real_Objects_CVPR_2023_paper.pdf + Objects make unique sounds under different perturbations, environment conditions, and poses relative to the listener. While prior works have modeled impact sounds and sound propagation in simulation, we lack a standard dataset of impact sound fields of real objects for audio-visual learning and calibration of the sim-to-real gap. We present RealImpact, a large-scale dataset of real object impact sounds recorded under controlled conditions. RealImpact contains 150,000 recordings of impact sounds of 50 everyday objects with detailed annotations, including their impact locations, microphone locations, contact force profiles, material labels, and RGBD images. We make preliminary attempts to use our dataset as a reference to current simulation methods for estimating object impact sounds that match the real world. Moreover, we demonstrate the usefulness of our dataset as a testbed for acoustic and audio-visual learning via the evaluation of two benchmark tasks, including listener location classification and visual acoustic matching. + + + + Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and Vision-Language Tasks + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Uni-Perceiver_v2_A_Generalist_Model_for_Large-Scale_Vision_and_Vision-Language_CVPR_2023_paper.pdf + Despite the remarkable success of foundation models, their task-specific fine-tuning paradigm makes them inconsistent with the goal of general perception modeling. The key to eliminating this inconsistency is to use generalist models for general task modeling. However, existing attempts at generalist models are inadequate in both versatility and performance. In this paper, we propose Uni-Perceiver v2, which is the first generalist model capable of handling major large-scale vision and vision-language tasks with competitive performance. Specifically, images are encoded as general region proposals, while texts are encoded via a Transformer-based language model. The encoded representations are transformed by a task-agnostic decoder. Different tasks are formulated as a unified maximum likelihood estimation problem. We further propose an effective optimization technique named Task-Balanced Gradient Normalization to ensure stable multi-task learning with an unmixed sampling strategy, which is helpful for tasks requiring large batch-size training. After being jointly trained on various tasks, Uni-Perceiver v2 is capable of directly handling downstream tasks without any task-specific adaptation. Results show that Uni-Perceiver v2 outperforms all existing generalist models in both versatility and performance. Meanwhile, compared with the commonly-recognized strong baselines that require tasks-specific fine-tuning, Uni-Perceiver v2 achieves competitive performance on a broad range of vision and vision-language tasks. + + + + Decompose More and Aggregate Better: Two Closer Looks at Frequency Representation Learning for Human Motion Prediction + http://openaccess.thecvf.com//content/CVPR2023/papers/Gao_Decompose_More_and_Aggregate_Better_Two_Closer_Looks_at_Frequency_CVPR_2023_paper.pdf + Encouraged by the effectiveness of encoding temporal dynamics within the frequency domain, recent human motion prediction systems prefer to first convert the motion representation from the original pose space into the frequency space. In this paper, we introduce two closer looks at effective frequency representation learning for robust motion prediction and summarize them as: decompose more and aggregate better. Motivated by these two insights, we develop two powerful units that factorize the frequency representation learning task with a novel decomposition-aggregation two-stage strategy: (1) frequency decomposition unit unweaves multi-view frequency representations from an input body motion by embedding its frequency features into multiple spaces; (2) feature aggregation unit deploys a series of intra-space and inter-space feature aggregation layers to collect comprehensive frequency representations from these spaces for robust human motion prediction. As evaluated on large-scale datasets, we develop a strong baseline model for the human motion prediction task that outperforms state-of-the-art methods by large margins: 8% 12% on Human3.6M, 3% 7% on CMU MoCap, and 7% 10% on 3DPW. + + + + Affection: Learning Affective Explanations for Real-World Visual Data + http://openaccess.thecvf.com//content/CVPR2023/papers/Achlioptas_Affection_Learning_Affective_Explanations_for_Real-World_Visual_Data_CVPR_2023_paper.pdf + In this work, we explore the space of emotional reactions induced by real-world images. For this, we first introduce a large-scale dataset that contains both categorical emotional reactions and free-form textual explanations for 85,007 publicly available images, analyzed by 6,283 annotators who were asked to indicate and explain how and why they felt when observing a particular image, with a total of 526,749 responses. Although emotional reactions are subjective and sensitive to context (personal mood, social status, past experiences) -- we show that there is significant common ground to capture emotional responses with a large support in the subject population. In light of this observation, we ask the following questions: i) Can we develop neural networks that provide plausible affective responses to real-world visual data explained with language? ii) Can we steer such methods towards producing explanations with varying degrees of pragmatic language, justifying different emotional reactions by grounding them in the visual stimulus? Finally, iii) How to evaluate the performance of such methods for this novel task? In this work, we take the first steps in addressing all of these questions, paving the way for more human-centric and emotionally-aware image analysis systems. Our code and data are publicly available at https://affective-explanations.org. + + + + PLA: Language-Driven Open-Vocabulary 3D Scene Understanding + http://openaccess.thecvf.com//content/CVPR2023/papers/Ding_PLA_Language-Driven_Open-Vocabulary_3D_Scene_Understanding_CVPR_2023_paper.pdf + Open-vocabulary scene understanding aims to localize and recognize unseen categories beyond the annotated label space. The recent breakthrough of 2D open-vocabulary perception is largely driven by Internet-scale paired image-text data with rich vocabulary concepts. However, this success cannot be directly transferred to 3D scenarios due to the inaccessibility of large-scale 3D-text pairs. To this end, we propose to distill knowledge encoded in pre-trained vision-language (VL) foundation models through captioning multi-view images from 3D, which allows explicitly associating 3D and semantic-rich captions. Further, to foster coarse-to-fine visual-semantic representation learning from captions, we design hierarchical 3D-caption pairs, leveraging geometric constraints between 3D scenes and multi-view images. Finally, by employing contrastive learning, the model learns language-aware embeddings that connect 3D and text for open-vocabulary tasks. Our method not only remarkably outperforms baseline methods by 25.8% 44.7% hIoU and 14.5% 50.4% hAP_ 50 in open-vocabulary semantic and instance segmentation, but also shows robust transferability on challenging zero-shot domain transfer tasks. See the project website at https://dingry.github.io/projects/PLA. + + + + InstMove: Instance Motion for Object-Centric Video Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_InstMove_Instance_Motion_for_Object-Centric_Video_Segmentation_CVPR_2023_paper.pdf + Despite significant efforts, cutting-edge video segmentation methods still remain sensitive to occlusion and rapid movement, due to their reliance on the appearance of objects in the form of object embeddings, which are vulnerable to these disturbances. A common solution is to use optical flow to provide motion information, but essentially it only considers pixel-level motion, which still relies on appearance similarity and hence is often inaccurate under occlusion and fast movement. In this work, we study the instance-level motion and present InstMove, which stands for Instance Motion for Object-centric Video Segmentation. In comparison to pixel-wise motion, InstMove mainly relies on instance-level motion information that is free from image feature embeddings, and features physical interpretations, making it more accurate and robust toward occlusion and fast-moving objects. To better fit in with the video segmentation tasks, InstMove uses instance masks to model the physical presence of an object and learns the dynamic model through a memory network to predict its position and shape in the next frame. With only a few lines of code, InstMove can be integrated into current SOTA methods for three different video segmentation tasks and boost their performance. Specifically, we improve the previous arts by 1.5 AP on OVIS dataset, which features heavy occlusions, and 4.9 AP on YouTubeVIS-Long dataset, which mainly contains fast-moving objects. These results suggest that instance-level motion is robust and accurate, and hence serving as a powerful solution in complex scenarios for object-centric video segmentation. + + + + Towards Effective Adversarial Textured 3D Meshes on Physical Face Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_Towards_Effective_Adversarial_Textured_3D_Meshes_on_Physical_Face_Recognition_CVPR_2023_paper.pdf + Face recognition is a prevailing authentication solution in numerous biometric applications. Physical adversarial attacks, as an important surrogate, can identify the weaknesses of face recognition systems and evaluate their robustness before deployed. However, most existing physical attacks are either detectable readily or ineffective against commercial recognition systems. The goal of this work is to develop a more reliable technique that can carry out an end-to-end evaluation of adversarial robustness for commercial systems. It requires that this technique can simultaneously deceive black-box recognition models and evade defensive mechanisms. To fulfill this, we design adversarial textured 3D meshes (AT3D) with an elaborate topology on a human face, which can be 3D-printed and pasted on the attacker's face to evade the defenses. However, the mesh-based optimization regime calculates gradients in high-dimensional mesh space, and can be trapped into local optima with unsatisfactory transferability. To deviate from the mesh-based space, we propose to perturb the low-dimensional coefficient space based on 3D Morphable Model, which significantly improves black-box transferability meanwhile enjoying faster search efficiency and better visual quality. Extensive experiments in digital and physical scenarios show that our method effectively explores the security vulnerabilities of multiple popular commercial services, including three recognition APIs, four anti-spoofing APIs, two prevailing mobile phones and two automated access control systems. + + + + Effective Ambiguity Attack Against Passport-Based DNN Intellectual Property Protection Schemes Through Fully Connected Layer Substitution + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Effective_Ambiguity_Attack_Against_Passport-Based_DNN_Intellectual_Property_Protection_Schemes_CVPR_2023_paper.pdf + Since training a deep neural network (DNN) is costly, the well-trained deep models can be regarded as valuable intellectual property (IP) assets. The IP protection associated with deep models has been receiving increasing attentions in recent years. Passport-based method, which replaces normalization layers with passport layers, has been one of the few protection solutions that are claimed to be secure against advanced attacks. In this work, we tackle the issue of evaluating the security of passport-based IP protection methods. We propose a novel and effective ambiguity attack against passport-based method, capable of successfully forging multiple valid passports with a small training dataset. This is accomplished by inserting a specially designed accessory block ahead of the passport parameters. Using less than 10% of training data, with the forged passport, the model exhibits almost indistinguishable performance difference (less than 2%) compared with that of the authorized passport. In addition, it is shown that our attack strategy can be readily generalized to attack other IP protection methods based on watermark embedding. Directions for potential remedy solutions are also given. + + + + TempSAL - Uncovering Temporal Information for Deep Saliency Prediction + http://openaccess.thecvf.com//content/CVPR2023/papers/Aydemir_TempSAL_-_Uncovering_Temporal_Information_for_Deep_Saliency_Prediction_CVPR_2023_paper.pdf + Deep saliency prediction algorithms complement the object recognition features, they typically rely on additional information such as scene context, semantic relationships, gaze direction, and object dissimilarity. However, none of these models consider the temporal nature of gaze shifts during image observation. We introduce a novel saliency prediction model that learns to output saliency maps in sequential time intervals by exploiting human temporal attention patterns. Our approach locally modulates the saliency predictions by combining the learned temporal maps. Our experiments show that our method outperforms the state-of-the-art models, including a multi-duration saliency model, on the SALICON benchmark and CodeCharts1k dataset. Our code is publicly available on GitHub. + + + + Megahertz Light Steering Without Moving Parts + http://openaccess.thecvf.com//content/CVPR2023/papers/Pediredla_Megahertz_Light_Steering_Without_Moving_Parts_CVPR_2023_paper.pdf + We introduce a light steering technology that operates at megahertz frequencies, has no moving parts, and costs less than a hundred dollars. Our technology can benefit many projector and imaging systems that critically rely on high-speed, reliable, low-cost, and wavelength-independent light steering, including laser scanning projectors, LiDAR sensors, and fluorescence microscopes. Our technology uses ultrasound waves to generate a spatiotemporally-varying refractive index field inside a compressible medium, such as water, turning the medium into a dynamic traveling lens. By controlling the electrical input of the ultrasound transducers that generate the waves, we can change the lens, and thus steer light, at the speed of sound (1.5 km/s in water). We build a physical prototype of this technology, use it to realize different scanning techniques at megahertz rates (three orders of magnitude faster than commercial alternatives such as galvo mirror scanners), and demonstrate proof-of-concept projector and LiDAR applications. To encourage further innovation towards this new technology, we derive the theory for its fundamental limits and develop a physically-accurate simulator for virtual design. Our technology offers a promising solution for achieving high-speed and low-cost light steering in a variety of applications. + + + + Iterative Proposal Refinement for Weakly-Supervised Video Grounding + http://openaccess.thecvf.com//content/CVPR2023/papers/Cao_Iterative_Proposal_Refinement_for_Weakly-Supervised_Video_Grounding_CVPR_2023_paper.pdf + Weakly-Supervised Video Grounding (WSVG) aims to localize events of interest in untrimmed videos with only video-level annotations. To date, most of the state-of-the-art WSVG methods follow a two-stage pipeline, i.e., firstly generating potential temporal proposals and then grounding with these proposal candidates. Despite the recent progress, existing proposal generation methods suffer from two drawbacks: 1) lack of explicit correspondence modeling; and 2) partial coverage of complex events. To this end, we propose a novel IteRative prOposal refiNement network (dubbed as IRON) to gradually distill the prior knowledge into each proposal and encourage proposals with more complete coverage. Specifically, we set up two lightweight distillation branches to uncover the cross-modal correspondence on both the semantic and conceptual levels. Then, an iterative Label Propagation (LP) strategy is devised to prevent the network from focusing excessively on the most discriminative events instead of the whole sentence content. Precisely, during each iteration, the proposal with the minimal distillation loss and its adjacent ones are regarded as the positive samples, which refines proposal confidence scores in a cascaded manner. Extensive experiments and ablation studies on two challenging WSVG datasets have attested to the effectiveness of our IRON. + + + + SCConv: Spatial and Channel Reconstruction Convolution for Feature Redundancy + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_SCConv_Spatial_and_Channel_Reconstruction_Convolution_for_Feature_Redundancy_CVPR_2023_paper.pdf + Convolutional Neural Networks (CNNs) have achieved remarkable performance in various computer vision tasks but this comes at the cost of tremendous computational resources, partly due to convolutional layers extracting redundant features. Recent works either compress well-trained large-scale models or explore well-designed lightweight models. In this paper, we make an attempt to exploit spatial and channel redundancy among features for CNN compression and propose an efficient convolution module, called SCConv (Spatial and Channel reconstruction Convolution), to decrease redundant computing and facilitate representative feature learning. The proposed SCConv consists of two units: spatial reconstruction unit (SRU) and channel reconstruction unit (CRU). SRU utilizes a separate-and-reconstruct method to suppress the spatial redundancy while CRU uses a split-transform-and-fuse strategy to diminish the channel redundancy. In addition, SCConv is a plug-and-play architectural unit that can be used to replace standard convolution in various convolutional neural networks directly. Experimental results show that SCConv-embedded models are able to achieve better performance by reducing redundant features with significantly lower complexity and computational costs. + + + + Positive-Augmented Contrastive Learning for Image and Video Captioning Evaluation + http://openaccess.thecvf.com//content/CVPR2023/papers/Sarto_Positive-Augmented_Contrastive_Learning_for_Image_and_Video_Captioning_Evaluation_CVPR_2023_paper.pdf + The CLIP model has been recently proven to be very effective for a variety of cross-modal tasks, including the evaluation of captions generated from vision-and-language architectures. In this paper, we propose a new recipe for a contrastive-based evaluation metric for image captioning, namely Positive-Augmented Contrastive learning Score (PAC-S), that in a novel way unifies the learning of a contrastive visual-semantic space with the addition of generated images and text on curated data. Experiments spanning several datasets demonstrate that our new metric achieves the highest correlation with human judgments on both images and videos, outperforming existing reference-based metrics like CIDEr and SPICE and reference-free metrics like CLIP-Score. Finally, we test the system-level correlation of the proposed metric when considering popular image captioning approaches, and assess the impact of employing different cross-modal features. Our source code and trained models are publicly available at: https://github.com/aimagelab/pacscore. + + + + 3D Cinemagraphy From a Single Image + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_3D_Cinemagraphy_From_a_Single_Image_CVPR_2023_paper.pdf + We present 3D Cinemagraphy, a new technique that marries 2D image animation with 3D photography. Given a single still image as input, our goal is to generate a video that contains both visual content animation and camera motion. We empirically find that naively combining existing 2D image animation and 3D photography methods leads to obvious artifacts or inconsistent animation. Our key insight is that representing and animating the scene in 3D space offers a natural solution to this task. To this end, we first convert the input image into feature-based layered depth images using predicted depth values, followed by unprojecting them to a feature point cloud. To animate the scene, we perform motion estimation and lift the 2D motion into the 3D scene flow. Finally, to resolve the problem of hole emergence as points move forward, we propose to bidirectionally displace the point cloud as per the scene flow and synthesize novel views by separately projecting them into target image planes and blending the results. Extensive experiments demonstrate the effectiveness of our method. A user study is also conducted to validate the compelling rendering results of our method. + + + + AttriCLIP: A Non-Incremental Learner for Incremental Knowledge Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_AttriCLIP_A_Non-Incremental_Learner_for_Incremental_Knowledge_Learning_CVPR_2023_paper.pdf + Continual learning aims to enable a model to incrementally learn knowledge from sequentially arrived data. Previous works adopt the conventional classification architecture, which consists of a feature extractor and a classifier. The feature extractor is shared across sequentially arrived tasks or classes, but one specific group of weights of the classifier corresponding to one new class should be incrementally expanded. Consequently, the parameters of a continual learner gradually increase. Moreover, as the classifier contains all historical arrived classes, a certain size of the memory is usually required to store rehearsal data to mitigate classifier bias and catastrophic forgetting. In this paper, we propose a non-incremental learner, named AttriCLIP, to incrementally extract knowledge of new classes or tasks. Specifically, AttriCLIP is built upon the pre-trained visual-language model CLIP. Its image encoder and text encoder are fixed to extract features from both images and text prompts. Each text prompt consists of a category name and a fixed number of learnable parameters which are selected from our designed attribute bank and serve as attributes. As we compute the visual and textual similarity for classification, AttriCLIP is a non-incremental learner. The attribute prompts, which encode the common knowledge useful for classification, can effectively mitigate the catastrophic forgetting and avoid constructing a replay memory. We empirically evaluate our AttriCLIP and compare it with CLIP-based and previous state-of-the-art continual learning methods in realistic settings with domain-shift and long-sequence learning. The results show that our method performs favorably against previous state-of-the-arts. + + + + StyleRes: Transforming the Residuals for Real Image Editing With StyleGAN + http://openaccess.thecvf.com//content/CVPR2023/papers/Pehlivan_StyleRes_Transforming_the_Residuals_for_Real_Image_Editing_With_StyleGAN_CVPR_2023_paper.pdf + We present a novel image inversion framework and a training pipeline to achieve high-fidelity image inversion with high-quality attribute editing. Inverting real images into StyleGAN's latent space is an extensively studied problem, yet the trade-off between the image reconstruction fidelity and image editing quality remains an open challenge. The low-rate latent spaces are limited in their expressiveness power for high-fidelity reconstruction. On the other hand, high-rate latent spaces result in degradation in editing quality. In this work, to achieve high-fidelity inversion, we learn residual features in higher latent codes that lower latent codes were not able to encode. This enables preserving image details in reconstruction. To achieve high-quality editing, we learn how to transform the residual features for adapting to manipulations in latent codes. We train the framework to extract residual features and transform them via a novel architecture pipeline and cycle consistency losses. We run extensive experiments and compare our method with state-of-the-art inversion methods. Qualitative metrics and visual comparisons show significant improvements. + + + + Diffusion Video Autoencoders: Toward Temporally Consistent Face Video Editing via Disentangled Video Encoding + http://openaccess.thecvf.com//content/CVPR2023/papers/Kim_Diffusion_Video_Autoencoders_Toward_Temporally_Consistent_Face_Video_Editing_via_CVPR_2023_paper.pdf + Inspired by the impressive performance of recent face image editing methods, several studies have been naturally proposed to extend these methods to the face video editing task. One of the main challenges here is temporal consistency among edited frames, which is still unresolved. To this end, we propose a novel face video editing framework based on diffusion autoencoders that can successfully extract the decomposed features - for the first time as a face video editing model - of identity and motion from a given video. This modeling allows us to edit the video by simply manipulating the temporally invariant feature to the desired direction for the consistency. Another unique strength of our model is that, since our model is based on diffusion models, it can satisfy both reconstruction and edit capabilities at the same time, and is robust to corner cases in wild face videos (e.g. occluded faces) unlike the existing GAN-based methods. + + + + SIM: Semantic-Aware Instance Mask Generation for Box-Supervised Instance Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_SIM_Semantic-Aware_Instance_Mask_Generation_for_Box-Supervised_Instance_Segmentation_CVPR_2023_paper.pdf + Weakly supervised instance segmentation using only bounding box annotations has recently attracted much research attention. Most of the current efforts leverage low-level image features as extra supervision without explicitly exploiting the high-level semantic information of the objects, which will become ineffective when the foreground objects have similar appearances to the background or other objects nearby. We propose a new box-supervised instance segmentation approach by developing a Semantic-aware Instance Mask (SIM) generation paradigm. Instead of heavily relying on local pair-wise affinities among neighboring pixels, we construct a group of category-wise feature centroids as prototypes to identify foreground objects and assign them semantic-level pseudo labels. Considering that the semantic-aware prototypes cannot distinguish different instances of the same semantics, we propose a self-correction mechanism to rectify the falsely activated regions while enhancing the correct ones. Furthermore, to handle the occlusions between objects, we tailor the Copy-Paste operation for the weakly-supervised instance segmentation task to augment challenging training data. Extensive experimental results demonstrate the superiority of our proposed SIM approach over other state-of-the-art methods. The source code: https://github.com/lslrh/SIM. + + + + Compression-Aware Video Super-Resolution + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Compression-Aware_Video_Super-Resolution_CVPR_2023_paper.pdf + Videos stored on mobile devices or delivered on the Internet are usually in compressed format and are of various unknown compression parameters, but most video super-resolution (VSR) methods often assume ideal inputs resulting in large performance gap between experimental settings and real-world applications. In spite of a few pioneering works being proposed recently to super-resolve the compressed videos, they are not specially designed to deal with videos of various levels of compression. In this paper, we propose a novel and practical compression-aware video super-resolution model, which could adapt its video enhancement process to the estimated compression level. A compression encoder is designed to model compression levels of input frames, and a base VSR model is then conditioned on the implicitly computed representation by inserting compression-aware modules. In addition, we propose to further strengthen the VSR model by taking full advantage of meta data that is embedded naturally in compressed video streams in the procedure of information fusion. Extensive experiments are conducted to demonstrate the effectiveness and efficiency of the proposed method on compressed VSR benchmarks. + + + + Incremental 3D Semantic Scene Graph Prediction From RGB Sequences + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_Incremental_3D_Semantic_Scene_Graph_Prediction_From_RGB_Sequences_CVPR_2023_paper.pdf + 3D semantic scene graphs are a powerful holistic representation as they describe the individual objects and depict the relation between them. They are compact high-level graphs that enable many tasks requiring scene reasoning. In real-world settings, existing 3D estimation methods produce robust predictions that mostly rely on dense inputs. In this work, we propose a real-time framework that incrementally builds a consistent 3D semantic scene graph of a scene given an RGB image sequence. Our method consists of a novel incremental entity estimation pipeline and a scene graph prediction network. The proposed pipeline simultaneously reconstructs a sparse point map and fuses entity estimation from the input images. The proposed network estimates 3D semantic scene graphs with iterative message passing using multi-view and geometric features extracted from the scene entities. Extensive experiments on the 3RScan dataset show the effectiveness of the proposed method in this challenging task, outperforming state-of-the-art approaches. + + + + VLPD: Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_VLPD_Context-Aware_Pedestrian_Detection_via_Vision-Language_Semantic_Self-Supervision_CVPR_2023_paper.pdf + Detecting pedestrians accurately in urban scenes is significant for realistic applications like autonomous driving or video surveillance. However, confusing human-like objects often lead to wrong detections, and small scale or heavily occluded pedestrians are easily missed due to their unusual appearances. To address these challenges, only object regions are inadequate, thus how to fully utilize more explicit and semantic contexts becomes a key problem. Meanwhile, previous context-aware pedestrian detectors either only learn latent contexts with visual clues, or need laborious annotations to obtain explicit and semantic contexts. Therefore, we propose in this paper a novel approach via Vision-Language semantic self-supervision for context-aware Pedestrian Detection (VLPD) to model explicitly semantic contexts without any extra annotations. Firstly, we propose a self-supervised Vision-Language Semantic (VLS) segmentation method, which learns both fully-supervised pedestrian detection and contextual segmentation via self-generated explicit labels of semantic classes by vision-language models. Furthermore, a self-supervised Prototypical Semantic Contrastive (PSC) learning method is proposed to better discriminate pedestrians and other classes, based on more explicit and semantic contexts obtained from VLS. Extensive experiments on popular benchmarks show that our proposed VLPD achieves superior performances over the previous state-of-the-arts, particularly under challenging circumstances like small scale and heavy occlusion. Code is available at https://github.com/lmy98129/VLPD. + + + + TexPose: Neural Texture Learning for Self-Supervised 6D Object Pose Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_TexPose_Neural_Texture_Learning_for_Self-Supervised_6D_Object_Pose_Estimation_CVPR_2023_paper.pdf + In this paper, we introduce neural texture learning for 6D object pose estimation from synthetic data and a few unlabelled real images. Our major contribution is a novel learning scheme which removes the drawbacks of previous works, namely the strong dependency on co-modalities or additional refinement. These have been previously necessary to provide training signals for convergence. We formulate such a scheme as two sub-optimisation problems on texture learning and pose learning. We separately learn to predict realistic texture of objects from real image collections and learn pose estimation from pixel-perfect synthetic data. Combining these two capabilities allows then to synthesise photorealistic novel views to supervise the pose estimator with accurate geometry. To alleviate pose noise and segmentation imperfection present during the texture learning phase, we propose a surfel-based adversarial training loss together with texture regularisation from synthetic data. We demonstrate that the proposed approach significantly outperforms the recent state-of-the-art methods without ground-truth pose annotations and demonstrates substantial generalisation improvements towards unseen scenes. Remarkably, our scheme improves the adopted pose estimators substantially even when initialised with much inferior performance. + + + + DynIBaR: Neural Dynamic Image-Based Rendering + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_DynIBaR_Neural_Dynamic_Image-Based_Rendering_CVPR_2023_paper.pdf + We address the problem of synthesizing novel views from a monocular video depicting a complex dynamic scene. State-of-the-art methods based on temporally varying Neural Radiance Fields (aka dynamic NeRFs) have shown impressive results on this task. However, for long videos with complex object motions and uncontrolled camera trajectories,these methods can produce blurry or inaccurate renderings, hampering their use in real-world applications. Instead of encoding the entire dynamic scene within the weights of MLPs, we present a new approach that addresses these limitations by adopting a volumetric image-based rendering framework that synthesizes new viewpoints by aggregating features from nearby views in a scene motion-aware manner.Our system retains the advantages of prior methods in its ability to model complex scenes and view-dependent effects,but also enables synthesizing photo-realistic novel views from long videos featuring complex scene dynamics with unconstrained camera trajectories. We demonstrate significant improvements over state-of-the-art methods on dynamic scene datasets, and also apply our approach to in-the-wild videos with challenging camera and object motion, where prior methods fail to produce high-quality renderings + + + + Unsupervised Object Localization: Observing the Background To Discover Objects + http://openaccess.thecvf.com//content/CVPR2023/papers/Simeoni_Unsupervised_Object_Localization_Observing_the_Background_To_Discover_Objects_CVPR_2023_paper.pdf + Recent advances in self-supervised visual representation learning have paved the way for unsupervised methods tackling tasks such as object discovery and instance segmentation. However, discovering objects in an image with no supervision is a very hard task; what are the desired objects, when to separate them into parts, how many are there, and of what classes? The answers to these questions depend on the tasks and datasets of evaluation. In this work, we take a different approach and propose to look for the background instead. This way, the salient objects emerge as a by-product without any strong assumption on what an object should be. We propose FOUND, a simple model made of a single conv 1x1 initialized with coarse background masks extracted from self-supervised patch-based representations. After fast training and refining these seed masks, the model reaches state-of-the-art results on unsupervised saliency detection and object discovery benchmarks. Moreover, we show that our approach yields good results in the unsupervised semantic segmentation retrieval task. The code to reproduce our results is available at https://github.com/valeoai/FOUND. + + + + BEV-LaneDet: An Efficient 3D Lane Detection Based on Virtual Camera via Key-Points + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_BEV-LaneDet_An_Efficient_3D_Lane_Detection_Based_on_Virtual_Camera_CVPR_2023_paper.pdf + 3D lane detection which plays a crucial role in vehicle routing, has recently been a rapidly developing topic in autonomous driving. Previous works struggle with practicality due to their complicated spatial transformations and inflexible representations of 3D lanes. Faced with the issues, our work proposes an efficient and robust monocular 3D lane detection called BEV-LaneDet with three main contributions. First, we introduce the Virtual Camera that unifies the in/extrinsic parameters of cameras mounted on different vehicles to guarantee the consistency of the spatial relationship among cameras. It can effectively promote the learning procedure due to the unified visual space. We secondly propose a simple but efficient 3D lane representation called Key-Points Representation. This module is more suitable to represent the complicated and diverse 3D lane structures. At last, we present a light-weight and chip-friendly spatial transformation module named Spatial Transformation Pyramid to transform multiscale front-view features into BEV features. Experimental results demonstrate that our work outperforms the state-of-the-art approaches in terms of F-Score, being 10.6% higher on the OpenLane dataset and 4.0% higher on the Apollo 3D synthetic dataset, with a speed of 185 FPS. Code is released at https://github.com/gigo-team/bev_lane_det. + + + + Self-Supervised 3D Scene Flow Estimation Guided by Superpoints + http://openaccess.thecvf.com//content/CVPR2023/papers/Shen_Self-Supervised_3D_Scene_Flow_Estimation_Guided_by_Superpoints_CVPR_2023_paper.pdf + 3D scene flow estimation aims to estimate point-wise motions between two consecutive frames of point clouds. Superpoints, i.e., points with similar geometric features, are usually employed to capture similar motions of local regions in 3D scenes for scene flow estimation. However, in existing methods, superpoints are generated with the offline clustering methods, which cannot characterize local regions with similar motions for complex 3D scenes well, leading to inaccurate scene flow estimation. To this end, we propose an iterative end-to-end superpoint based scene flow estimation framework, where the superpoints can be dynamically updated to guide the point-level flow prediction. Specifically, our framework consists of a flow guided superpoint generation module and a superpoint guided flow refinement module. In our superpoint generation module, we utilize the bidirectional flow information at the previous iteration to obtain the matching points of points and superpoint centers for soft point-to-superpoint association construction, in which the superpoints are generated for pairwise point clouds. With the generated superpoints, we first reconstruct the flow for each point by adaptively aggregating the superpoint-level flow, and then encode the consistency between the reconstructed flow of pairwise point clouds. Finally, we feed the consistency encoding along with the reconstructed flow into GRU to refine point-level flow. Extensive experiments on several different datasets show that our method can achieve promising performance. + + + + A Unified Pyramid Recurrent Network for Video Frame Interpolation + http://openaccess.thecvf.com//content/CVPR2023/papers/Jin_A_Unified_Pyramid_Recurrent_Network_for_Video_Frame_Interpolation_CVPR_2023_paper.pdf + Flow-guided synthesis provides a common framework for frame interpolation, where optical flow is estimated to guide the synthesis of intermediate frames between consecutive inputs. In this paper, we present UPR-Net, a novel Unified Pyramid Recurrent Network for frame interpolation. Cast in a flexible pyramid framework, UPR-Net exploits lightweight recurrent modules for both bi-directional flow estimation and intermediate frame synthesis. At each pyramid level, it leverages estimated bi-directional flow to generate forward-warped representations for frame synthesis; across pyramid levels, it enables iterative refinement for both optical flow and intermediate frame. In particular, we show that our iterative synthesis strategy can significantly improve the robustness of frame interpolation on large motion cases. Despite being extremely lightweight (1.7M parameters), our base version of UPR-Net achieves excellent performance on a large range of benchmarks. Code and trained models of our UPR-Net series are available at: https://github.com/srcn-ivl/UPR-Net. + + + + DiffusioNeRF: Regularizing Neural Radiance Fields With Denoising Diffusion Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Wynn_DiffusioNeRF_Regularizing_Neural_Radiance_Fields_With_Denoising_Diffusion_Models_CVPR_2023_paper.pdf + Under good conditions, Neural Radiance Fields (NeRFs) have shown impressive results on novel view synthesis tasks. NeRFs learn a scene's color and density fields by minimizing the photometric discrepancy between training views and differentiable renderings of the scene. Once trained from a sufficient set of views, NeRFs can generate novel views from arbitrary camera positions. However, the scene geometry and color fields are severely under-constrained, which can lead to artifacts, especially when trained with few input views. To alleviate this problem we learn a prior over scene geometry and color, using a denoising diffusion model (DDM). Our DDM is trained on RGBD patches of the synthetic Hypersim dataset and can be used to predict the gradient of the logarithm of a joint probability distribution of color and depth patches. We show that, these gradients of logarithms of RGBD patch priors serve to regularize geometry and color of a scene. During NeRF training, random RGBD patches are rendered and the estimated gradient of the log-likelihood is backpropagated to the color and density fields. Evaluations on LLFF, the most relevant dataset, show that our learned prior achieves improved quality in the reconstructed geometry and improved generalization to novel views. Evaluations on DTU show improved reconstruction quality among NeRF methods. + + + + Edge-Aware Regional Message Passing Controller for Image Forgery Localization + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Edge-Aware_Regional_Message_Passing_Controller_for_Image_Forgery_Localization_CVPR_2023_paper.pdf + Digital image authenticity has promoted research on image forgery localization. Although deep learning-based methods achieve remarkable progress, most of them usually suffer from severe feature coupling between the forged and authentic regions. In this work, we propose a two-step Edge-aware Regional Message Passing Controlling strategy to address the above issue. Specifically, the first step is to account for fully exploiting the edge information. It consists of two core designs: context-enhanced graph construction and threshold-adaptive differentiable binarization edge algorithm. The former assembles the global semantic information to distinguish the features between the forged and authentic regions, while the latter stands on the output of the former to provide the learnable edges. In the second step, guided by the learnable edges, a region message passing controller is devised to weaken the message passing between the forged and authentic regions. In this way, our ERMPC is capable of explicitly modeling the inconsistency between the forged and authentic regions and enabling it to perform well on refined forged images. Extensive experiments on several challenging benchmarks show that our method is superior to state-of-the-art image forgery localization methods qualitatively and quantitatively. + + + + Spatiotemporal Self-Supervised Learning for Point Clouds in the Wild + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_Spatiotemporal_Self-Supervised_Learning_for_Point_Clouds_in_the_Wild_CVPR_2023_paper.pdf + Self-supervised learning (SSL) has the potential to benefit many applications, particularly those where manually annotating data is cumbersome. One such situation is the semantic segmentation of point clouds. In this context, existing methods employ contrastive learning strategies and define positive pairs by performing various augmentation of point clusters in a single frame. As such, these methods do not exploit the temporal nature of LiDAR data. In this paper, we introduce an SSL strategy that leverages positive pairs in both the spatial and temporal domains. To this end, we design (i) a point-to-cluster learning strategy that aggregates spatial information to distinguish objects; and (ii) a cluster-to-cluster learning strategy based on unsupervised object tracking that exploits temporal correspondences. We demonstrate the benefits of our approach via extensive experiments performed by self-supervised training on two large-scale LiDAR datasets and transferring the resulting models to other point cloud segmentation benchmarks. Our results evidence that our method outperforms the state-of-the-art point cloud SSL methods. + + + + Semi-Supervised Learning Made Simple With Self-Supervised Clustering + http://openaccess.thecvf.com//content/CVPR2023/papers/Fini_Semi-Supervised_Learning_Made_Simple_With_Self-Supervised_Clustering_CVPR_2023_paper.pdf + Self-supervised learning models have been shown to learn rich visual representations without requiring human annotations. However, in many real-world scenarios, labels are partially available, motivating a recent line of work on semi-supervised methods inspired by self-supervised principles. In this paper, we propose a conceptually simple yet empirically powerful approach to turn clustering-based self-supervised methods such as SwAV or DINO into semi-supervised learners. More precisely, we introduce a multi-task framework merging a supervised objective using ground-truth labels and a self-supervised objective relying on clustering assignments with a single cross-entropy loss. This approach may be interpreted as imposing the cluster centroids to be class prototypes. Despite its simplicity, we provide empirical evidence that our approach is highly effective and achieves state-of-the-art performance on CIFAR100 and ImageNet. + + + + Frequency-Modulated Point Cloud Rendering With Easy Editing + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Frequency-Modulated_Point_Cloud_Rendering_With_Easy_Editing_CVPR_2023_paper.pdf + We develop an effective point cloud rendering pipeline for novel view synthesis, which enables high fidelity local detail reconstruction, real-time rendering and user-friendly editing. In the heart of our pipeline is an adaptive frequency modulation module called Adaptive Frequency Net (AFNet), which utilizes a hypernetwork to learn the local texture frequency encoding that is consecutively injected into adaptive frequency activation layers to modulate the implicit radiance signal. This mechanism improves the frequency expressive ability of the network with richer frequency basis support, only at a small computational budget. To further boost performance, a preprocessing module is also proposed for point cloud geometry optimization via point opacity estimation. In contrast to implicit rendering, our pipeline supports high-fidelity interactive editing based on point cloud manipulation. Extensive experimental results on NeRF-Synthetic, ScanNet, DTU and Tanks and Temples datasets demonstrate the superior performances achieved by our method in terms of PSNR, SSIM and LPIPS, in comparison to the state-of-the-art. + + + + Few-Shot Referring Relationships in Videos + http://openaccess.thecvf.com//content/CVPR2023/papers/Kumar_Few-Shot_Referring_Relationships_in_Videos_CVPR_2023_paper.pdf + Interpreting visual relationships is a core aspect of comprehensive video understanding. Given a query visual relationship as <subject, predicate, object> and a test video, our objective is to localize the subject and object that are connected via the predicate. Given modern visio-lingual understanding capabilities, solving this problem is achievable, provided that there are large-scale annotated training examples available. However, annotating for every combination of subject, object, and predicate is cumbersome, expensive, and possibly infeasible. Therefore, there is a need for models that can learn to spatially and temporally localize subjects and objects that are connected via an unseen predicate using only a few support set videos sharing the common predicate. We address this challenging problem, referred to as few-shot referring relationships in videos for the first time. To this end, we pose the problem as a minimization of an objective function defined over a T-partite random field. Here, the vertices of the random field correspond to candidate bounding boxes for the subject and object, and T represents the number of frames in the test video. This objective function is composed of frame level and visual relationship similarity potentials. To learn these potentials, we use a relation network that takes query-conditioned translational relationship embedding as inputs and is meta-trained using support set videos in an episodic manner. Further, the objective function is minimized using a belief propagation-based message passing on the random field to obtain the spatiotemporal localization or subject and object trajectories. We perform extensive experiments using two public benchmarks, namely ImageNet-VidVRD and VidOR, and compare the proposed approach with competitive baselines to assess its efficacy. + + + + 3D Human Pose Estimation via Intuitive Physics + http://openaccess.thecvf.com//content/CVPR2023/papers/Tripathi_3D_Human_Pose_Estimation_via_Intuitive_Physics_CVPR_2023_paper.pdf + Estimating 3D humans from images often produces implausible bodies that lean, float, or penetrate the floor. Such methods ignore the fact that bodies are typically supported by the scene. A physics engine can be used to enforce physical plausibility, but these are not differentiable, rely on unrealistic proxy bodies, and are difficult to integrate into existing optimization and learning frameworks. In contrast, we exploit novel intuitive-physics (IP) terms that can be inferred from a 3D SMPL body interacting with the scene. Inspired by biomechanics, we infer the pressure heatmap on the body, the Center of Pressure (CoP) from the heatmap, and the SMPL body's Center of Mass (CoM). With these, we develop IPMAN, to estimate a 3D body from a color image in a "stable" configuration by encouraging plausible floor contact and overlapping CoP and CoM. Our IP terms are intuitive, easy to implement, fast to compute, differentiable, and can be integrated into existing optimization and regression methods. We evaluate IPMAN on standard datasets and MoYo, a new dataset with synchronized multi-view images, ground-truth 3D bodies with complex poses, body-floor contact, CoM and pressure. IPMAN produces more plausible results than the state of the art, improving accuracy for static poses, while not hurting dynamic ones. Code and data are available for research at https://ipman.is.tue.mpg.de/. + + + + SplineCam: Exact Visualization and Characterization of Deep Network Geometry and Decision Boundaries + http://openaccess.thecvf.com//content/CVPR2023/papers/Humayun_SplineCam_Exact_Visualization_and_Characterization_of_Deep_Network_Geometry_and_CVPR_2023_paper.pdf + Current Deep Network (DN) visualization and interpretability methods rely heavily on data space visualizations such as scoring which dimensions of the data are responsible for their associated prediction or generating new data features or samples that best match a given DN unit or representation. In this paper, we go one step further by developing the first provably exact method for computing the geometry of a DN's mapping -- including its decision boundary -- over a specified region of the data space. By leveraging the theory of Continuous Piecewise Linear (CPWL) spline DNs, SplineCam exactly computes a DN's geometry without resorting to approximations such as sampling or architecture simplification. SplineCam applies to any DN architecture based on CPWL activation nonlinearities, including (leaky) ReLU, absolute value, maxout, and max-pooling and can also be applied to regression DNs such as implicit neural representations. Beyond decision boundary visualization and characterization, SplineCam enables one to compare architectures, measure generalizability, and sample from the decision boundary on or off the data manifold. Project website: https://bit.ly/splinecam + + + + SE-ORNet: Self-Ensembling Orientation-Aware Network for Unsupervised Point Cloud Shape Correspondence + http://openaccess.thecvf.com//content/CVPR2023/papers/Deng_SE-ORNet_Self-Ensembling_Orientation-Aware_Network_for_Unsupervised_Point_Cloud_Shape_Correspondence_CVPR_2023_paper.pdf + Unsupervised point cloud shape correspondence aims to obtain dense point-to-point correspondences between point clouds without manually annotated pairs. However, humans and some animals have bilateral symmetry and various orientations, which leads to severe mispredictions of symmetrical parts. Besides, point cloud noise disrupts consistent representations for point cloud and thus degrades the shape correspondence accuracy. To address the above issues, we propose a Self-Ensembling ORientation-aware Network termed SE-ORNet. The key of our approach is to exploit an orientation estimation module with a domain adaptive discriminator to align the orientations of point cloud pairs, which significantly alleviates the mispredictions of symmetrical parts. Additionally, we design a self-ensembling framework for unsupervised point cloud shape correspondence. In this framework, the disturbances of point cloud noise are overcome by perturbing the inputs of the student and teacher networks with different data augmentations and constraining the consistency of predictions. Extensive experiments on both human and animal datasets show that our SE-ORNet can surpass state-of-the-art unsupervised point cloud shape correspondence methods. + + + + A Bag-of-Prototypes Representation for Dataset-Level Applications + http://openaccess.thecvf.com//content/CVPR2023/papers/Tu_A_Bag-of-Prototypes_Representation_for_Dataset-Level_Applications_CVPR_2023_paper.pdf + This work investigates dataset vectorization for two dataset-level tasks: assessing training set suitability and test set difficulty. The former measures how suitable a training set is for a target domain, while the latter studies how challenging a test set is for a learned model. Central of the two tasks is measuring the underlying relationship between datasets. This needs a desirable dataset vectorization scheme, which should preserve as much discriminative dataset information as possible so that the distance between the resulting dataset vectors can reflect dataset-to-dataset similarity. To this end, we propose a bag-of-prototypes (BoP) dataset representation that extends the image level bag consisting of patch descriptors to dataset-level bag consisting of semantic prototypes. Specifically, we develop a codebook consisting of K prototypes clustered from a reference dataset. Given a dataset to be encoded, we quantize each of its image features to a certain prototype in the codebook and obtain a K-dimensional histogram feature. Without assuming access to dataset labels, the BoP representation provides rich characterization of dataset semantic distribution. Further, BoP representations cooperates well with Jensen-Shannon divergence for measuring dataset-to-dataset similarity. Albeit very simple, BoP consistently shows its advantage over existing representations on a series of benchmarks for two dataset-level tasks. + + + + Leverage Interactive Affinity for Affordance Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Luo_Leverage_Interactive_Affinity_for_Affordance_Learning_CVPR_2023_paper.pdf + Perceiving potential "action possibilities" (i.e., affordance) regions of images and learning interactive functionalities of objects from human demonstration is a challenging task due to the diversity of human-object interactions. Prevailing affordance learning algorithms often adopt the label assignment paradigm and presume that there is a unique relationship between functional region and affordance label, yielding poor performance when adapting to unseen environments with large appearance variations. In this paper, we propose to leverage interactive affinity for affordance learning, i.e., extracting interactive affinity from human-object interaction and transferring it to non-interactive objects. Interactive affinity, which represents the contacts between different parts of the human body and local regions of the target object, can provide inherent cues of interconnectivity between humans and objects, thereby reducing the ambiguity of the perceived action possibilities. Specifically, we propose a pose-aided interactive affinity learning framework that exploits human pose to guide the network to learn the interactive affinity from human-object interactions. Particularly, a keypoint heuristic perception (KHP) scheme is devised to exploit the keypoint association of human pose to alleviate the uncertainties due to interaction diversities and contact occlusions. Besides, a contact-driven affordance learning (CAL) dataset is constructed by collecting and labeling over 5,000 images. Experimental results demonstrate that our method outperforms the representative models regarding objective metrics and visual quality. Code and dataset: github.com/lhc1224/PIAL-Net. + + + + Deep Semi-Supervised Metric Learning With Mixed Label Propagation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhuang_Deep_Semi-Supervised_Metric_Learning_With_Mixed_Label_Propagation_CVPR_2023_paper.pdf + Metric learning requires the identification of far-apart similar pairs and close dissimilar pairs during training, and this is difficult to achieve with unlabeled data because pairs are typically assumed to be similar if they are close. We present a novel metric learning method which circumvents this issue by identifying hard negative pairs as those which obtain dissimilar labels via label propagation (LP), when the edge linking the pair of data is removed in the affinity matrix. In so doing, the negative pairs can be identified despite their proximity, and we are able to utilize this information to significantly improve LP's ability to identify far-apart positive pairs and close negative pairs. This results in a considerable improvement in semi-supervised metric learning performance as evidenced by recall, precision and Normalized Mutual Information (NMI) performance metrics on Content-based Information Retrieval (CBIR) applications. + + + + OVTrack: Open-Vocabulary Multiple Object Tracking + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_OVTrack_Open-Vocabulary_Multiple_Object_Tracking_CVPR_2023_paper.pdf + The ability to recognize, localize and track dynamic objects in a scene is fundamental to many real-world applications, such as self-driving and robotic systems. Yet, traditional multiple object tracking (MOT) benchmarks rely only on a few object categories that hardly represent the multitude of possible objects that are encountered in the real world. This leaves contemporary MOT methods limited to a small set of pre-defined object categories. In this paper, we address this limitation by tackling a novel task, open-vocabulary MOT, that aims to evaluate tracking beyond pre-defined training categories. We further develop OVTrack, an open-vocabulary tracker that is capable of tracking arbitrary object classes. Its design is based on two key ingredients: First, leveraging vision-language models for both classification and association via knowledge distillation; second, a data hallucination strategy for robust appearance feature learning from denoising diffusion probabilistic models. The result is an extremely data-efficient open-vocabulary tracker that sets a new state-of-the-art on the large-scale, large-vocabulary TAO benchmark, while being trained solely on static images. The project page is at https://www.vis.xyz/pub/ovtrack/. + + + + Hyperspherical Embedding for Point Cloud Completion + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Hyperspherical_Embedding_for_Point_Cloud_Completion_CVPR_2023_paper.pdf + Most real-world 3D measurements from depth sensors are incomplete, and to address this issue the point cloud completion task aims to predict the complete shapes of objects from partial observations. Previous works often adapt an encoder-decoder architecture, where the encoder is trained to extract embeddings that are used as inputs to generate predictions from the decoder. However, the learned embeddings have sparse distribution in the feature space, which leads to worse generalization results during testing. To address these problems, this paper proposes a hyperspherical module, which transforms and normalizes embeddings from the encoder to be on a unit hypersphere. With the proposed module, the magnitude and direction of the output hyperspherical embedding are decoupled and only the directional information is optimized. We theoretically analyze the hyperspherical embedding and show that it enables more stable training with a wider range of learning rates and more compact embedding distributions. Experiment results show consistent improvement of point cloud completion in both single-task and multi-task learning, which demonstrates the effectiveness of the proposed method. + + + + QuantArt: Quantizing Image Style Transfer Towards High Visual Fidelity + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_QuantArt_Quantizing_Image_Style_Transfer_Towards_High_Visual_Fidelity_CVPR_2023_paper.pdf + The mechanism of existing style transfer algorithms is by minimizing a hybrid loss function to push the generated image toward high similarities in both content and style. However, this type of approach cannot guarantee visual fidelity, i.e., the generated artworks should be indistinguishable from real ones. In this paper, we devise a new style transfer framework called QuantArt for high visual-fidelity stylization. QuantArt pushes the latent representation of the generated artwork toward the centroids of the real artwork distribution with vector quantization. By fusing the quantized and continuous latent representations, QuantArt allows flexible control over the generated artworks in terms of content preservation, style similarity, and visual fidelity. Experiments on various style transfer settings show that our QuantArt framework achieves significantly higher visual fidelity compared with the existing style transfer methods. + + + + SlowLiDAR: Increasing the Latency of LiDAR-Based Detection Using Adversarial Examples + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_SlowLiDAR_Increasing_the_Latency_of_LiDAR-Based_Detection_Using_Adversarial_Examples_CVPR_2023_paper.pdf + LiDAR-based perception is a central component of autonomous driving, playing a key role in tasks such as vehicle localization and obstacle detection. Since the safety of LiDAR-based perceptual pipelines is critical to safe autonomous driving, a number of past efforts have investigated its vulnerability under adversarial perturbations of raw point cloud inputs. However, most such efforts have focused on investigating the impact of such perturbations on predictions (integrity), and little has been done to understand the impact on latency (availability), a critical concern for real-time cyber-physical systems. We present the first systematic investigation of the availability of LiDAR detection pipelines, and SlowLiDAR, an adversarial perturbation attack that maximizes LiDAR detection runtime. The attack overcomes the technical challenges posed by the non-differentiable parts of the LiDAR detection pipelines by using differentiable proxies and uses a novel loss function that effectively captures the impact of adversarial perturbations on the execution time of the pipeline. Extensive experimental results show that SlowLiDAR can significantly increase the latency of the six most popular LiDAR detection pipelines while maintaining imperceptibility. + + + + + + CutMIB: Boosting Light Field Super-Resolution via Multi-View Image Blending + http://openaccess.thecvf.com//content/CVPR2023/papers/Xiao_CutMIB_Boosting_Light_Field_Super-Resolution_via_Multi-View_Image_Blending_CVPR_2023_paper.pdf + Data augmentation (DA) is an efficient strategy for improving the performance of deep neural networks. Recent DA strategies have demonstrated utility in single image super-resolution (SR). Little research has, however, focused on the DA strategy for light field SR, in which multi-view information utilization is required. For the first time in light field SR, we propose a potent DA strategy called CutMIB to improve the performance of existing light field SR networks while keeping their structures unchanged. Specifically, CutMIB first cuts low-resolution (LR) patches from each view at the same location. Then CutMIB blends all LR patches to generate the blended patch and finally pastes the blended patch to the corresponding regions of high-resolution light field views, and vice versa. By doing so, CutMIB enables light field SR networks to learn from implicit geometric information during the training stage. Experimental results demonstrate that CutMIB can improve the reconstruction performance and the angular consistency of existing light field SR networks. We further verify the effectiveness of CutMIB on real-world light field SR and light field denoising. The implementation code is available at https://github.com/zeyuxiao1997/CutMIB. + + + + Energy-Efficient Adaptive 3D Sensing + http://openaccess.thecvf.com//content/CVPR2023/papers/Tilmon_Energy-Efficient_Adaptive_3D_Sensing_CVPR_2023_paper.pdf + Active depth sensing achieves robust depth estimation but is usually limited by the sensing range. Naively increasing the optical power can improve sensing range but induces eye-safety concerns for many applications, including autonomous robots and augmented reality. In this paper, we propose an adaptive active depth sensor that jointly optimizes range, power consumption, and eye-safety. The main observation is that we need not project light patterns to the entire scene but only to small regions of interest where depth is necessary for the application and passive stereo depth estimation fails. We theoretically compare this adaptive sensing scheme with other sensing strategies, such as full-frame projection, line scanning, and point scanning. We show that, to achieve the same maximum sensing distance, the proposed method consumes the least power while having the shortest (best) eye-safety distance. We implement this adaptive sensing scheme with two hardware prototypes, one with a phase-only spatial light modulator (SLM) and the other with a micro-electro-mechanical (MEMS) mirror and diffractive optical elements (DOE). Experimental results validate the advantage of our method and demonstrate its capability of acquiring higher quality geometry adaptively. + + + + CR-FIQA: Face Image Quality Assessment by Learning Sample Relative Classifiability + http://openaccess.thecvf.com//content/CVPR2023/papers/Boutros_CR-FIQA_Face_Image_Quality_Assessment_by_Learning_Sample_Relative_Classifiability_CVPR_2023_paper.pdf + Face image quality assessment (FIQA) estimates the utility of the captured image in achieving reliable and accurate recognition performance. This work proposes a novel FIQA method, CR-FIQA, that estimates the face image quality of a sample by learning to predict its relative classifiability. This classifiability is measured based on the allocation of the training sample feature representation in angular space with respect to its class center and the nearest negative class center. We experimentally illustrate the correlation between the face image quality and the sample relative classifiability. As such property is only observable for the training dataset, we propose to learn this property by probing internal network observations during the training process and utilizing it to predict the quality of unseen samples. Through extensive evaluation experiments on eight benchmarks and four face recognition models, we demonstrate the superiority of our proposed CR-FIQA over state-of-the-art (SOTA) FIQA algorithms. + + + + Endpoints Weight Fusion for Class Incremental Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Xiao_Endpoints_Weight_Fusion_for_Class_Incremental_Semantic_Segmentation_CVPR_2023_paper.pdf + Class incremental semantic segmentation (CISS) focuses on alleviating catastrophic forgetting to improve discrimination. Previous work mainly exploit regularization (e.g., knowledge distillation) to maintain previous knowledge in the current model. However, distillation alone often yields limited gain to the model since only the representations of old and new models are restricted to be consistent. In this paper, we propose a simple yet effective method to obtain a model with strong memory of old knowledge, named Endpoints Weight Fusion (EWF). In our method, the model containing old knowledge is fused with the model retaining new knowledge in a dynamic fusion manner, strengthening the memory of old classes in ever-changing distributions. In addition, we analyze the relation between our fusion strategy and a popular moving average technique EMA, which reveals why our method is more suitable for class-incremental learning. To facilitate parameter fusion with closer distance in the parameter space, we use distillation to enhance the optimization process. Furthermore, we conduct experiments on two widely used datasets, achieving the state-of-the-art performance. + + + + GeneCIS: A Benchmark for General Conditional Image Similarity + http://openaccess.thecvf.com//content/CVPR2023/papers/Vaze_GeneCIS_A_Benchmark_for_General_Conditional_Image_Similarity_CVPR_2023_paper.pdf + We argue that there are many notions of 'similarity' and that models, like humans, should be able to adapt to these dynamically. This contrasts with most representation learning methods, supervised or self-supervised, which learn a fixed embedding function and hence implicitly assume a single notion of similarity. For instance, models trained on ImageNet are biased towards object categories, while a user might prefer the model to focus on colors, textures or specific elements in the scene. In this paper, we propose the GeneCIS ('genesis') benchmark, which measures models' ability to adapt to a range of similarity conditions. Extending prior work, our benchmark is designed for zero-shot evaluation only, and hence considers an open-set of similarity conditions. We find that baselines from powerful CLIP models struggle on GeneCIS and that performance on the benchmark is only weakly correlated with ImageNet accuracy, suggesting that simply scaling existing methods is not fruitful. We further propose a simple, scalable solution based on automatically mining information from existing image-caption datasets. We find our method offers a substantial boost over the baselines on GeneCIS, and further improves zero-shot performance on related image retrieval benchmarks. In fact, though evaluated zero-shot, our model surpasses state-of-the-art supervised models on MIT-States. + + + + MD-VQA: Multi-Dimensional Quality Assessment for UGC Live Videos + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_MD-VQA_Multi-Dimensional_Quality_Assessment_for_UGC_Live_Videos_CVPR_2023_paper.pdf + User-generated content (UGC) live videos are often bothered by various distortions during capture procedures and thus exhibit diverse visual qualities. Such source videos are further compressed and transcoded by media server providers before being distributed to end-users. Because of the flourishing of UGC live videos, effective video quality assessment (VQA) tools are needed to monitor and perceptually optimize live streaming videos in the distributing process. Unfortunately, existing compressed UGC VQA databases are either small in scale or employ high-quality UGC videos as source videos, so VQA models developed on these databases have limited abilities to evaluate UGC live videos. In this paper, we address UGC Live VQA problems by constructing a first-of-a-kind subjective UGC Live VQA database and developing an effective evaluation tool. Concretely, 418 source UGC videos are collected in real live streaming scenarios and 3,762 compressed ones at different bit rates are generated for the subsequent subjective VQA experiments. Based on the built database, we develop a Multi-Dimensional VQA (MD-VQA) evaluator to measure the visual quality of UGC live videos from semantic, distortion, and motion aspects respectively. Extensive experimental results show that MD-VQA achieves state-of-the-art performance on both our UGC Live VQA database and existing compressed UGC VQA databases. + + + + Spring: A High-Resolution High-Detail Dataset and Benchmark for Scene Flow, Optical Flow and Stereo + http://openaccess.thecvf.com//content/CVPR2023/papers/Mehl_Spring_A_High-Resolution_High-Detail_Dataset_and_Benchmark_for_Scene_Flow_CVPR_2023_paper.pdf + While recent methods for motion and stereo estimation recover an unprecedented amount of details, such highly detailed structures are neither adequately reflected in the data of existing benchmarks nor their evaluation methodology. Hence, we introduce Spring -- a large, high-resolution, high-detail, computer-generated benchmark for scene flow, optical flow, and stereo. Based on rendered scenes from the open-source Blender movie "Spring", it provides photo-realistic HD datasets with state-of-the-art visual effects and ground truth training data. Furthermore, we provide a website to upload, analyze and compare results. Using a novel evaluation methodology based on a super-resolved UHD ground truth, our Spring benchmark can assess the quality of fine structures and provides further detailed performance statistics on different image regions. Regarding the number of ground truth frames, Spring is 60x larger than the only scene flow benchmark, KITTI 2015, and 15x larger than the well-established MPI Sintel optical flow benchmark. Initial results for recent methods on our benchmark show that estimating fine details is indeed challenging, as their accuracy leaves significant room for improvement. The Spring benchmark and the corresponding datasets are available at http://spring-benchmark.org. + + + + MAESTER: Masked Autoencoder Guided Segmentation at Pixel Resolution for Accurate, Self-Supervised Subcellular Structure Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Xie_MAESTER_Masked_Autoencoder_Guided_Segmentation_at_Pixel_Resolution_for_Accurate_CVPR_2023_paper.pdf + Accurate segmentation of cellular images remains an elusive task due to the intrinsic variability in morphology of biological structures. Complete manual segmentation is unfeasible for large datasets, and while supervised methods have been proposed to automate segmentation, they often rely on manually generated ground truths which are especially challenging and time consuming to generate in biology due to the requirement of domain expertise. Furthermore, these methods have limited generalization capacity, requiring additional manual labels to be generated for each dataset and use case. We introduce MAESTER (Masked AutoEncoder guided SegmenTation at pixEl Resolution), a self-supervised method for accurate, subcellular structure segmentation at pixel resolution. MAESTER treats segmentation as a representation learning and clustering problem. Specifically, MAESTER learns semantically meaningful token representations of multi-pixel image patches while simultaneously maintaining a sufficiently large field of view for contextual learning. We also develop a cover-and-stride inference strategy to achieve pixel-level subcellular structure segmentation. We evaluated MAESTER on a publicly available volumetric electron microscopy (VEM) dataset of primary mouse pancreatic islets beta cells and achieved upwards of 29.1% improvement over state-of-the-art under the same evaluation criteria. Furthermore, our results are competitive against supervised methods trained on the same tasks, closing the gap between self-supervised and supervised approaches. MAESTER shows promise for alleviating the critical bottleneck of ground truth generation for imaging related data analysis and thereby greatly increasing the rate of biological discovery. + + + + Self-Supervised Image-to-Point Distillation via Semantically Tolerant Contrastive Loss + http://openaccess.thecvf.com//content/CVPR2023/papers/Mahmoud_Self-Supervised_Image-to-Point_Distillation_via_Semantically_Tolerant_Contrastive_Loss_CVPR_2023_paper.pdf + An effective framework for learning 3D representations for perception tasks is distilling rich self-supervised image features via contrastive learning. However, image-to-point representation learning for autonomous driving datasets faces two main challenges: 1) the abundance of self-similarity, which results in the contrastive losses pushing away semantically similar point and image regions and thus disturbing the local semantic structure of the learned representations, and 2) severe class imbalance as pretraining gets dominated by over-represented classes. We propose to alleviate the self-similarity problem through a novel semantically tolerant image-to-point contrastive loss that takes into consideration the semantic distance between positive and negative image regions to minimize contrasting semantically similar point and image regions. Additionally, we address class imbalance by designing a class-agnostic balanced loss that approximates the degree of class imbalance through an aggregate sample-to-samples semantic similarity measure. We demonstrate that our semantically-tolerant contrastive loss with class balancing improves state-of-the-art 2D-to-3D representation learning in all evaluation settings on 3D semantic segmentation. Our method consistently outperforms state-of-the-art 2D-to-3D representation learning frameworks across a wide range of 2D self-supervised pretrained models. + + + + Efficient Robust Principal Component Analysis via Block Krylov Iteration and CUR Decomposition + http://openaccess.thecvf.com//content/CVPR2023/papers/Fang_Efficient_Robust_Principal_Component_Analysis_via_Block_Krylov_Iteration_and_CVPR_2023_paper.pdf + Robust principal component analysis (RPCA) is widely studied in computer vision. Recently an adaptive rank estimate based RPCA has achieved top performance in low-level vision tasks without the prior rank, but both the rank estimate and RPCA optimization algorithm involve singular value decomposition, which requires extremely huge computational resource for large-scale matrices. To address these issues, an efficient RPCA (eRPCA) algorithm is proposed based on block Krylov iteration and CUR decomposition in this paper. Specifically, the Krylov iteration method is employed to approximate the eigenvalue decomposition in the rank estimation, which requires O(ndrq + n(rq)^2) for an (nxd) input matrix, in which q is a parameter with a small value, r is the target rank. Based on the estimated rank, CUR decomposition is adopted to replace SVD in updating low-rank matrix component, whose complexity reduces from O(rnd) to O(r^2n) per iteration. Experimental results verify the efficiency and effectiveness of the proposed eRPCA over the state-of-the-art methods in various low-level vision applications. + + + + VIVE3D: Viewpoint-Independent Video Editing Using 3D-Aware GANs + http://openaccess.thecvf.com//content/CVPR2023/papers/Fruhstuck_VIVE3D_Viewpoint-Independent_Video_Editing_Using_3D-Aware_GANs_CVPR_2023_paper.pdf + We introduce VIVE3D, a novel approach that extends the capabilities of image-based 3D GANs to video editing and is able to represent the input video in an identity-preserving and temporally consistent way. We propose two new building blocks. First, we introduce a novel GAN inversion technique specifically tailored to 3D GANs by jointly embedding multiple frames and optimizing for the camera parameters. Second, besides traditional semantic face edits (e.g. for age and expression), we are the first to demonstrate edits that show novel views of the head enabled by the inherent properties of 3D GANs and our optical flow-guided compositing technique to combine the head with the background video. Our experiments demonstrate that VIVE3D generates high-fidelity face edits at consistent quality from a range of camera viewpoints which are composited with the original video in a temporally and spatially-consistent manner. + + + + DPE: Disentanglement of Pose and Expression for General Video Portrait Editing + http://openaccess.thecvf.com//content/CVPR2023/papers/Pang_DPE_Disentanglement_of_Pose_and_Expression_for_General_Video_Portrait_CVPR_2023_paper.pdf + One-shot video-driven talking face generation aims at producing a synthetic talking video by transferring the facial motion from a video to an arbitrary portrait image. Head pose and facial expression are always entangled in facial motion and transferred simultaneously. However, the entanglement sets up a barrier for these methods to be used in video portrait editing directly, where it may require to modify the expression only while maintaining the pose unchanged. One challenge of decoupling pose and expression is the lack of paired data, such as the same pose but different expressions. Only a few methods attempt to tackle this challenge with the feat of 3D Morphable Models (3DMMs) for explicit disentanglement. But 3DMMs are not accurate enough to capture facial details due to the limited number of Blendshapes, which has side effects on motion transfer. In this paper, we introduce a novel self-supervised disentanglement framework to decouple pose and expression without 3DMMs and paired data, which consists of a motion editing module, a pose generator, and an expression generator. The editing module projects faces into a latent space where pose motion and expression motion can be disentangled, and the pose or expression transfer can be performed in the latent space conveniently via addition. The two generators render the modified latent codes to images, respectively. Moreover, to guarantee the disentanglement, we propose a bidirectional cyclic training strategy with well-designed constraints. Evaluations demonstrate our method can control pose or expression independently and be used for general video editing. + + + + HexPlane: A Fast Representation for Dynamic Scenes + http://openaccess.thecvf.com//content/CVPR2023/papers/Cao_HexPlane_A_Fast_Representation_for_Dynamic_Scenes_CVPR_2023_paper.pdf + Modeling and re-rendering dynamic 3D scenes is a challenging task in 3D vision. Prior approaches build on NeRF and rely on implicit representations. This is slow since it requires many MLP evaluations, constraining real-world applications. We show that dynamic 3D scenes can be explicitly represented by six planes of learned features, leading to an elegant solution we call HexPlane. A HexPlane computes features for points in spacetime by fusing vectors extracted from each plane, which is highly efficient. Pairing a HexPlane with a tiny MLP to regress output colors and training via volume rendering gives impressive results for novel view synthesis on dynamic scenes, matching the image quality of prior work but reducing training time by more than 100x. Extensive ablations confirm our HexPlane design and show that it is robust to different feature fusion mechanisms, coordinate systems, and decoding mechanisms. HexPlane is a simple and effective solution for representing 4D volumes, and we hope they can broadly contribute to modeling spacetime for dynamic 3D scenes. + + + + Boosting Semi-Supervised Learning by Exploiting All Unlabeled Data + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Boosting_Semi-Supervised_Learning_by_Exploiting_All_Unlabeled_Data_CVPR_2023_paper.pdf + Semi-supervised learning (SSL) has attracted enormous attention due to its vast potential of mitigating the dependence on large labeled datasets. The latest methods (e.g., FixMatch) use a combination of consistency regularization and pseudo-labeling to achieve remarkable successes. However, these methods all suffer from the waste of complicated examples since all pseudo-labels have to be selected by a high threshold to filter out noisy ones. Hence, the examples with ambiguous predictions will not contribute to the training phase. For better leveraging all unlabeled examples, we propose two novel techniques: Entropy Meaning Loss (EML) and Adaptive Negative Learning (ANL). EML incorporates the prediction distribution of non-target classes into the optimization objective to avoid competition with target class, and thus generating more high-confidence predictions for selecting pseudo-label. ANL introduces the additional negative pseudo-label for all unlabeled data to leverage low-confidence examples. It adaptively allocates this label by dynamically evaluating the top-k performance of the model. EML and ANL do not introduce any additional parameter and hyperparameter. We integrate these techniques with FixMatch, and develop a simple yet powerful framework called FullMatch. Extensive experiments on several common SSL benchmarks (CIFAR-10/100, SVHN, STL-10 and ImageNet) demonstrate that FullMatch exceeds FixMatch by a large margin. Integrated with FlexMatch (an advanced FixMatch-based framework), we achieve state-of-the-art performance. Source code is available at https://github.com/megvii-research/FullMatch. + + + + Novel-View Acoustic Synthesis + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Novel-View_Acoustic_Synthesis_CVPR_2023_paper.pdf + We introduce the novel-view acoustic synthesis (NVAS) task: given the sight and sound observed at a source viewpoint, can we synthesize the sound of that scene from an unseen target viewpoint? We propose a neural rendering approach: Visually-Guided Acoustic Synthesis (ViGAS) network that learns to synthesize the sound of an arbitrary point in space by analyzing the input audio-visual cues. To benchmark this task, we collect two first-of-their-kind large-scale multi-view audio-visual datasets, one synthetic and one real. We show that our model successfully reasons about the spatial cues and synthesizes faithful audio on both datasets. To our knowledge, this work represents the very first formulation, dataset, and approach to solve the novel-view acoustic synthesis task, which has exciting potential applications ranging from AR/VR to art and design. Unlocked by this work, we believe that the future of novel-view synthesis is in multi-modal learning from videos. + + + + Constrained Evolutionary Diffusion Filter for Monocular Endoscope Tracking + http://openaccess.thecvf.com//content/CVPR2023/papers/Luo_Constrained_Evolutionary_Diffusion_Filter_for_Monocular_Endoscope_Tracking_CVPR_2023_paper.pdf + Stochastic filtering is widely used to deal with nonlinear optimization problems such as 3-D and visual tracking in various computer vision and augmented reality applications. Many current methods suffer from an imbalance between exploration and exploitation due to their particle degeneracy and impoverishment, resulting in local optimums. To address this imbalance, this work proposes a new constrained evolutionary diffusion filter for nonlinear optimization. Specifically, this filter develops spatial state constraints and adaptive history-recall differential evolution embedded evolutionary stochastic diffusion instead of sequential resampling to resolve the degeneracy and impoverishment problem. With application to monocular endoscope 3-D tracking, the experimental results show that the proposed filtering significantly improves the balance between exploration and exploitation and certainly works better than recent 3-D tracking methods. Particularly, the surgical tracking error was reduced from 4.03 mm to 2.59 mm. + + + + Toward Accurate Post-Training Quantization for Image Super Resolution + http://openaccess.thecvf.com//content/CVPR2023/papers/Tu_Toward_Accurate_Post-Training_Quantization_for_Image_Super_Resolution_CVPR_2023_paper.pdf + Model quantization is a crucial step for deploying super resolution (SR) networks on mobile devices. However, existing works focus on quantization-aware training, which requires complete dataset and expensive computational overhead. In this paper, we study post-training quantization(PTQ) for image super resolution using only a few unlabeled calibration images. As the SR model aims to maintain the texture and color information of input images, the distribution of activations are long-tailed, asymmetric and highly dynamic compared with classification models. To this end, we introduce the density-based dual clipping to cut off the outliers based on analyzing the asymmetric bounds of activations. Moreover, we present a novel pixel aware calibration method with the supervision of the full-precision model to accommodate the highly dynamic range of different samples. Extensive experiments demonstrate that the proposed method significantly outperforms existing PTQ algorithms on various models and datasets. For instance, we get a 2.091 dB increase on Urban100 benchmark when quantizing EDSRx4 to 4-bit with 100 unlabeled images. Our code is available at both https://github.com/huawei-noah/Efficient-Computing/tree/master/Quantization/PTQ4SR and https://gitee.com/mindspore/models/tree/master/research/cv/PTQ4SR. + + + + Omnimatte3D: Associating Objects and Their Effects in Unconstrained Monocular Video + http://openaccess.thecvf.com//content/CVPR2023/papers/Suhail_Omnimatte3D_Associating_Objects_and_Their_Effects_in_Unconstrained_Monocular_Video_CVPR_2023_paper.pdf + We propose a method to decompose a video into a background and a set of foreground layers, where the background captures stationary elements while the foreground layers capture moving objects along with their associated effects (e.g. shadows and reflections). Our approach is designed for unconstrained monocular videos, with arbitrary camera and object motion. Prior work that tackles this problem assumes that the video can be mapped onto a fixed 2D canvas, severely limiting the possible space of camera motion. Instead, our method applies recent progress in monocular camera pose and depth estimation to create a full, RGBD video layer for the background, along with a video layer for each foreground object. To solve the underconstrained decomposition problem, we propose a new loss formulation based on multi-view consistency. We test our method on challenging videos with complex camera motion and show significant qualitative improvement over current approaches. + + + + Incrementer: Transformer for Class-Incremental Semantic Segmentation With Knowledge Distillation Focusing on Old Class + http://openaccess.thecvf.com//content/CVPR2023/papers/Shang_Incrementer_Transformer_for_Class-Incremental_Semantic_Segmentation_With_Knowledge_Distillation_Focusing_CVPR_2023_paper.pdf + Class-incremental semantic segmentation aims to incrementally learn new classes while maintaining the capability to segment old ones, and suffers catastrophic forgetting since the old-class labels are unavailable. Most existing methods are based on convolutional networks and prevent forgetting through knowledge distillation, which (1) need to add additional convolutional layers to predict new classes, and (2) ignore to distinguish different regions corresponding to old and new classes during knowledge distillation and roughly distill all the features, thus limiting the learning of new classes. Based on the above observations, we propose a new transformer framework for class-incremental semantic segmentation, dubbed Incrementer, which only needs to add new class tokens to the transformer decoder for new-class learning. Based on the Incrementer, we propose a new knowledge distillation scheme that focuses on the distillation in the old-class regions, which reduces the constraints of the old model on the new-class learning, thus improving the plasticity. Moreover, we propose a class deconfusion strategy to alleviate the overfitting to new classes and the confusion of similar classes. Our method is simple and effective, and extensive experiments show that our method outperforms the SOTAs by a large margin (5 15 absolute points boosts on both Pascal VOC and ADE20k). We hope that our Incrementer can serve as a new strong pipeline for class-incremental semantic segmentation. + + + + Patch-Mix Transformer for Unsupervised Domain Adaptation: A Game Perspective + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhu_Patch-Mix_Transformer_for_Unsupervised_Domain_Adaptation_A_Game_Perspective_CVPR_2023_paper.pdf + Endeavors have been recently made to leverage the vision transformer (ViT) for the challenging unsupervised domain adaptation (UDA) task. They typically adopt the cross-attention in ViT for direct domain alignment. However, as the performance of cross-attention highly relies on the quality of pseudo labels for targeted samples, it becomes less effective when the domain gap becomes large. We solve this problem from a game theory's perspective with the proposed model dubbed as PMTrans, which bridges source and target domains with an intermediate domain. Specifically, we propose a novel ViT-based module called PatchMix that effectively builds up the intermediate domain, i.e., probability distribution, by learning to sample patches from both domains based on the game-theoretical models. This way, it learns to mix the patches from the source and target domains to maximize the cross entropy (CE), while exploiting two semi-supervised mixup losses in the feature and label spaces to minimize it. As such, we interpret the process of UDA as a min-max CE game with three players, including the feature extractor, classifier, and PatchMix, to find the Nash Equilibria. Moreover, we leverage attention maps from ViT to re-weight the label of each patch by its importance, making it possible to obtain more domain-discriminative feature representations. We conduct extensive experiments on four benchmark datasets, and the results show that PMTrans significantly surpasses the ViT-based and CNN-based SoTA methods by +3.6% on Office-Home, +1.4% on Office-31, and +17.7% on DomainNet, respectively. https://vlis2022.github.io/cvpr23/PMTrans + + + + CAMS: CAnonicalized Manipulation Spaces for Category-Level Functional Hand-Object Manipulation Synthesis + http://openaccess.thecvf.com//content/CVPR2023/papers/Zheng_CAMS_CAnonicalized_Manipulation_Spaces_for_Category-Level_Functional_Hand-Object_Manipulation_Synthesis_CVPR_2023_paper.pdf + In this work, we focus on a novel task of category-level functional hand-object manipulation synthesis covering both rigid and articulated object categories. Given an object geometry, an initial human hand pose as well as a sparse control sequence of object poses, our goal is to generate a physically reasonable hand-object manipulation sequence that performs like human beings. To address such a challenge, we first design CAnonicalized Manipulation Spaces (CAMS), a two-level space hierarchy that canonicalizes the hand poses in an object-centric and contact-centric view. Benefiting from the representation capability of CAMS, we then present a two-stage framework for synthesizing human-like manipulation animations. Our framework achieves state-of-the-art performance for both rigid and articulated categories with impressive visual effects. Codes and video results can be found at our project homepage: https://cams-hoi.github.io/. + + + + Multiplicative Fourier Level of Detail + http://openaccess.thecvf.com//content/CVPR2023/papers/Dou_Multiplicative_Fourier_Level_of_Detail_CVPR_2023_paper.pdf + We develop a simple yet surprisingly effective implicit representing scheme called Multiplicative Fourier Level of Detail (MFLOD) motivated by the recent success of multiplicative filter network. Built on multi-resolution feature grid/volume (e.g., the sparse voxel octree), each level's feature is first modulated by a sinusoidal function and then element-wisely multiplied by a linear transformation of previous layer's representation in a layer-to-layer recursive manner, yielding the scale-aggregated encodings for a subsequent simple linear forward to get final output. In contrast to previous hybrid representations relying on interleaved multilevel fusion and nonlinear activation-based decoding, MFLOD could be elegantly characterized as a linear combination of sine basis functions with varying amplitude, frequency, and phase upon the learned multilevel features, thus offering great feasibility in Fourier analysis. Comprehensive experimental results on implicit neural representation learning tasks including image fitting, 3D shape representation, and neural radiance fields well demonstrate the superior quality and generalizability achieved by the proposed MFLOD scheme. + + + + Relational Context Learning for Human-Object Interaction Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Kim_Relational_Context_Learning_for_Human-Object_Interaction_Detection_CVPR_2023_paper.pdf + Recent state-of-the-art methods for HOI detection typically build on transformer architectures with two decoder branches, one for human-object pair detection and the other for interaction classification. Such disentangled transformers, however, may suffer from insufficient context exchange between the branches and lead to a lack of context information for relational reasoning, which is critical in discovering HOI instances. In this work, we propose the multiplex relation network (MUREN) that performs rich context exchange between three decoder branches using unary, pairwise, and ternary relations of human, object, and interaction tokens. The proposed method learns comprehensive relational contexts for discovering HOI instances, achieving state-of-the-art performance on two standard benchmarks for HOI detection, HICO-DET and V-COCO. + + + + Multi-Label Compound Expression Recognition: C-EXPR Database & Network + http://openaccess.thecvf.com//content/CVPR2023/papers/Kollias_Multi-Label_Compound_Expression_Recognition_C-EXPR_Database__Network_CVPR_2023_paper.pdf + Research in automatic analysis of facial expressions mainly focuses on recognising the seven basic ones. However, compound expressions are more diverse and represent the complexity and subtlety of our daily affective displays more accurately. Limited research has been conducted for compound expression recognition (CER), because only a few databases exist, which are small, lab controlled, imbalanced and static. In this paper we present an in-the-wild A/V database, C-EXPR-DB, consisting of 400 videos of 200K frames, annotated in terms of 13 compound expressions, valence-arousal emotion descriptors, action units, speech, facial landmarks and attributes. We also propose C-EXPR-NET, a multi-task learning (MTL) method for CER and AU detection (AU-D); the latter task is introduced to enhance CER performance. For AU-D we incorporate AU semantic description along with visual information. For CER we use a multi-label formulation and the KL-divergence loss. We also propose a distribution matching loss for coupling CER and AU-D tasks to boost their performance and alleviate negative transfer (i.e., when MT model's performance is worse than that of at least one single-task model). An extensive experimental study has been conducted illustrating the excellent performance of C-EXPR-NET, validating the theoretical claims. Finally, C-EXPR-NET is shown to effectively generalize its knowledge in new emotion recognition contexts, in a zero-shot manner. + + + + CORA: Adapting CLIP for Open-Vocabulary Detection With Region Prompting and Anchor Pre-Matching + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_CORA_Adapting_CLIP_for_Open-Vocabulary_Detection_With_Region_Prompting_and_CVPR_2023_paper.pdf + Open-vocabulary detection (OVD) is an object detection task aiming at detecting objects from novel categories beyond the base categories on which the detector is trained. Recent OVD methods rely on large-scale visual-language pre-trained models, such as CLIP, for recognizing novel objects. We identify the two core obstacles that need to be tackled when incorporating these models into detector training: (1) the distribution mismatch that happens when applying a VL-model trained on whole images to region recognition tasks; (2) the difficulty of localizing objects of unseen classes. To overcome these obstacles, we propose CORA, a DETR-style framework that adapts CLIP for Open-vocabulary detection by Region prompting and Anchor pre-matching. Region prompting mitigates the whole-to-region distribution gap by prompting the region features of the CLIP-based region classifier. Anchor pre-matching helps learning generalizable object localization by a class-aware matching mechanism. We evaluate CORA on the COCO OVD benchmark, where we achieve 41.7 AP50 on novel classes, which outperforms the previous SOTA by 2.4 AP50 even without resorting to extra training data. When extra training data is available, we train CORA+ on both ground-truth base-category annotations and additional pseudo bounding box labels computed by CORA. CORA+ achieves 43.1 AP50 on the COCO OVD benchmark and 28.1 box APr on the LVIS OVD benchmark. The code is available at https://github.com/tgxs002/CORA. + + + + 3DAvatarGAN: Bridging Domains for Personalized Editable Avatars + http://openaccess.thecvf.com//content/CVPR2023/papers/Abdal_3DAvatarGAN_Bridging_Domains_for_Personalized_Editable_Avatars_CVPR_2023_paper.pdf + Modern 3D-GANs synthesize geometry and texture by training on large-scale datasets with a consistent structure. Training such models on stylized, artistic data, with often unknown, highly variable geometry, and camera information has not yet been shown possible. Can we train a 3D GAN on such artistic data, while maintaining multi-view consistency and texture quality? To this end, we propose an adaptation framework, where the source domain is a pre-trained 3D-GAN, while the target domain is a 2D-GAN trained on artistic datasets. We, then, distill the knowledge from a 2D generator to the source 3D generator. To do that, we first propose an optimization-based method to align the distributions of camera parameters across domains. Second, we propose regularizations necessary to learn high-quality texture, while avoiding degenerate geometric solutions, such as flat shapes. Third, we show a deformation-based technique for modeling exaggerated geometry of artistic domains, enabling---as a byproduct---personalized geometric editing. Finally, we propose a novel inversion method for 3D-GANs linking the latent spaces of the source and the target domains. Our contributions---for the first time---allow for the generation, editing, and animation of personalized artistic 3D avatars on artistic datasets. + + + + Discriminative Co-Saliency and Background Mining Transformer for Co-Salient Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Discriminative_Co-Saliency_and_Background_Mining_Transformer_for_Co-Salient_Object_Detection_CVPR_2023_paper.pdf + Most previous co-salient object detection works mainly focus on extracting co-salient cues via mining the consistency relations across images while ignoring the explicit exploration of background regions. In this paper, we propose a Discriminative co-saliency and background Mining Transformer framework (DMT) based on several economical multi-grained correlation modules to explicitly mine both co-saliency and background information and effectively model their discrimination. Specifically, we first propose region-to-region correlation modules to economically model inter-image relations for pixel-wise segmentation features. Then, we use two types of predefined tokens to mine co-saliency and background information via our proposed contrast-induced pixel-to-token and co-saliency token-to-token correlation modules. We also design a token-guided feature refinement module to enhance the discriminability of the segmentation features under the guidance of the learned tokens. We perform iterative mutual promotion for the segmentation feature extraction and token construction. Experimental results on three benchmark datasets demonstrate the effectiveness of our proposed method. The source code is available at: https://github.com/dragonlee258079/DMT. + + + + Person Image Synthesis via Denoising Diffusion Model + http://openaccess.thecvf.com//content/CVPR2023/papers/Bhunia_Person_Image_Synthesis_via_Denoising_Diffusion_Model_CVPR_2023_paper.pdf + The pose-guided person image generation task requires synthesizing photorealistic images of humans in arbitrary poses. The existing approaches use generative adversarial networks that do not necessarily maintain realistic textures or need dense correspondences that struggle to handle complex deformations and severe occlusions. In this work, we show how denoising diffusion models can be applied for high-fidelity person image synthesis with strong sample diversity and enhanced mode coverage of the learnt data distribution. Our proposed Person Image Diffusion Model (PIDM) disintegrates the complex transfer problem into a series of simpler forward-backward denoising steps. This helps in learning plausible source-to-target transformation trajectories that result in faithful textures and undistorted appearance details. We introduce a 'texture diffusion module' based on cross-attention to accurately model the correspondences between appearance and pose information available in source and target images. Further, we propose 'disentangled classifier-free guidance' to ensure close resemblance between the conditional inputs and the synthesized output in terms of both pose and appearance information. Our extensive results on two large-scale benchmarks and a user study demonstrate the photorealism of our proposed approach under challenging scenarios. We also show how our generated images can help in downstream tasks. + + + + Adaptive Assignment for Geometry Aware Local Feature Matching + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Adaptive_Assignment_for_Geometry_Aware_Local_Feature_Matching_CVPR_2023_paper.pdf + The detector-free feature matching approaches are currently attracting great attention thanks to their excellent performance. However, these methods still struggle at large-scale and viewpoint variations, due to the geometric inconsistency resulting from the application of the mutual nearest neighbour criterion (i.e., one-to-one assignment) in patch-level matching. Accordingly, we introduce AdaMatcher, which first accomplishes the feature correlation and co-visible area estimation through an elaborate feature interaction module, then performs adaptive assignment on patch-level matching while estimating the scales between images, and finally refines the co-visible matches through scale alignment and sub-pixel regression module. Extensive experiments show that AdaMatcher outperforms solid baselines and achieves state-of-the-art results on many downstream tasks. Additionally, the adaptive assignment and sub-pixel refinement module can be used as a refinement network for other matching methods, such as SuperGlue, to boost their performance further. The code will be publicly available at https://github.com/AbyssGaze/AdaMatcher. + + + + Initialization Noise in Image Gradients and Saliency Maps + http://openaccess.thecvf.com//content/CVPR2023/papers/Woerl_Initialization_Noise_in_Image_Gradients_and_Saliency_Maps_CVPR_2023_paper.pdf + In this paper, we examine gradients of logits of image classification CNNs by input pixel values. We observe that these fluctuate considerably with training randomness, such as the random initialization of the networks. We extend our study to gradients of intermediate layers, obtained via GradCAM, as well as popular network saliency estimators such as DeepLIFT, SHAP, LIME, Integrated Gradients, and SmoothGrad. While empirical noise levels vary, qualitatively different attributions to image features are still possible with all of these, which comes with implications for interpreting such attributions, in particular when seeking data-driven explanations of the phenomenon generating the data. Finally, we demonstrate that the observed artefacts can be removed by marginalization over the initialization distribution by simple stochastic integration. + + + + Implicit Neural Head Synthesis via Controllable Local Deformation Fields + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Implicit_Neural_Head_Synthesis_via_Controllable_Local_Deformation_Fields_CVPR_2023_paper.pdf + High-quality reconstruction of controllable 3D head avatars from 2D videos is highly desirable for virtual human applications in movies, games, and telepresence. Neural implicit fields provide a powerful representation to model 3D head avatars with personalized shape, expressions, and facial parts, e.g., hair and mouth interior, that go beyond the linear 3D morphable model (3DMM). However, existing methods do not model faces with fine-scale facial features, or local control of facial parts that extrapolate asymmetric expressions from monocular videos. Further, most condition only on 3DMM parameters with poor(er) locality, and resolve local features with a global neural field. We build on part-based implicit shape models that decompose a global deformation field into local ones. Our novel formulation models multiple implicit deformation fields with local semantic rig-like control via 3DMM-based parameters, and representative facial landmarks. Further, we propose a local control loss and attention mask mechanism that promote sparsity of each learned deformation field. Our formulation renders sharper locally controllable nonlinear deformations than previous implicit monocular approaches, especially mouth interior, asymmetric expressions, and facial details. Project page:https://imaging.cs.cmu.edu/local_deformation_fields/ + + + + Curricular Object Manipulation in LiDAR-Based Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhu_Curricular_Object_Manipulation_in_LiDAR-Based_Object_Detection_CVPR_2023_paper.pdf + This paper explores the potential of curriculum learning in LiDAR-based 3D object detection by proposing a curricular object manipulation (COM) framework. The framework embeds the curricular training strategy into both the loss design and the augmentation process. For the loss design, we propose the COMLoss to dynamically predict object-level difficulties and emphasize objects of different difficulties based on training stages. On top of the widely-used augmentation technique called GT-Aug in LiDAR detection tasks, we propose a novel COMAug strategy which first clusters objects in ground-truth database based on well-designed heuristics. Group-level difficulties rather than individual ones are then predicted and updated during training for stable results. Model performance and generalization capabilities can be improved by sampling and augmenting progressively more difficult objects into the training points. Extensive experiments and ablation studies reveal the superior and generality of the proposed framework. The code is available at https://github.com/ZZY816/COM. + + + + Shape-Constraint Recurrent Flow for 6D Object Pose Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Hai_Shape-Constraint_Recurrent_Flow_for_6D_Object_Pose_Estimation_CVPR_2023_paper.pdf + Most recent 6D object pose estimation methods rely on 2D optical flow networks to refine their results. However, these optical flow methods typically do not consider any 3D shape information of the targets during matching, making them suffer in 6D object pose estimation. In this work, we propose a shape-constraint recurrent flow network for 6D object pose estimation, which embeds the 3D shape information of the targets into the matching procedure. We first introduce a flow-to-pose component to learn an intermediate pose from the current flow estimation, then impose a shape constraint from the current pose on the lookup space of the 4D correlation volume for flow estimation, which reduces the matching space significantly and is much easier to learn. Finally, we optimize the flow and pose simultaneously in a recurrent manner until convergence. We evaluate our method on three challenging 6D object pose datasets and show that it outperforms the state of the art in both accuracy and efficiency. + + + + Micron-BERT: BERT-Based Facial Micro-Expression Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Nguyen_Micron-BERT_BERT-Based_Facial_Micro-Expression_Recognition_CVPR_2023_paper.pdf + Micro-expression recognition is one of the most challenging topics in affective computing. It aims to recognize tiny facial movements difficult for humans to perceive in a brief period, i.e., 0.25 to 0.5 seconds. Recent advances in pre-training deep Bidirectional Transformers (BERT) have significantly improved self-supervised learning tasks in computer vision. However, the standard BERT in vision problems is designed to learn only from full images or videos, and the architecture cannot accurately detect details of facial micro-expressions. This paper presents Micron-BERT (u-BERT), a novel approach to facial micro-expression recognition. The proposed method can automatically capture these movements in an unsupervised manner based on two key ideas. First, we employ Diagonal Micro-Attention (DMA) to detect tiny differences between two frames. Second, we introduce a new Patch of Interest (PoI) module to localize and highlight micro-expression interest regions and simultaneously reduce noisy backgrounds and distractions. By incorporating these components into an end-to-end deep network, the proposed u-BERT significantly outperforms all previous work in various micro-expression tasks. u-BERT can be trained on a large-scale unlabeled dataset, i.e., up to 8 million images, and achieves high accuracy on new unseen facial micro-expression datasets. Empirical experiments show u-BERT consistently outperforms state-of-the-art performance on four micro-expression benchmarks, including SAMM, CASME II, SMIC, and CASME3, by significant margins. Code will be available at https://github.com/uark-cviu/Micron-BERT + + + + PanelNet: Understanding 360 Indoor Environment via Panel Representation + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_PanelNet_Understanding_360_Indoor_Environment_via_Panel_Representation_CVPR_2023_paper.pdf + Indoor 360 panoramas have two essential properties. (1) The panoramas are continuous and seamless in the horizontal direction. (2) Gravity plays an important role in indoor environment design. By leveraging these properties, we present PanelNet, a framework that understands indoor environments using a novel panel representation of 360 images. We represent an equirectangular projection (ERP) as consecutive vertical panels with corresponding 3D panel geometry. To reduce the negative impact of panoramic distortion, we incorporate a panel geometry embedding network that encodes both the local and global geometric features of a panel. To capture the geometric context in room design, we introduce Local2Global Transformer, which aggregates local information within a panel and panel-wise global context. It greatly improves the model performance with low training overhead. Our method outperforms existing methods on indoor 360 depth estimation and shows competitive results against state-of-the-art approaches on the task of indoor layout estimation and semantic segmentation. + + + + PoseExaminer: Automated Testing of Out-of-Distribution Robustness in Human Pose and Shape Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_PoseExaminer_Automated_Testing_of_Out-of-Distribution_Robustness_in_Human_Pose_and_CVPR_2023_paper.pdf + Human pose and shape (HPS) estimation methods achieve remarkable results. However, current HPS benchmarks are mostly designed to test models in scenarios that are similar to the training data. This can lead to critical situations in real-world applications when the observed data differs significantly from the training data and hence is out-of-distribution (OOD). It is therefore important to test and improve the OOD robustness of HPS methods. To address this fundamental problem, we develop a simulator that can be controlled in a fine-grained manner using interpretable parameters to explore the manifold of images of human pose, e.g. by varying poses, shapes, and clothes. We introduce a learning-based testing method, termed PoseExaminer, that automatically diagnoses HPS algorithms by searching over the parameter space of human pose images to find the failure modes. Our strategy for exploring this high-dimensional parameter space is a multi-agent reinforcement learning system, in which the agents collaborate to explore different parts of the parameter space. We show that our PoseExaminer discovers a variety of limitations in current state-of-the-art models that are relevant in real-world scenarios but are missed by current benchmarks. For example, it finds large regions of realistic human poses that are not predicted correctly, as well as reduced performance for humans with skinny and corpulent body shapes. In addition, we show that fine-tuning HPS methods by exploiting the failure modes found by PoseExaminer improve their robustness and even their performance on standard benchmarks by a significant margin. The code are available for research purposes. + + + + GANHead: Towards Generative Animatable Neural Head Avatars + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_GANHead_Towards_Generative_Animatable_Neural_Head_Avatars_CVPR_2023_paper.pdf + To bring digital avatars into people's lives, it is highly demanded to efficiently generate complete, realistic, and animatable head avatars. This task is challenging, and it is difficult for existing methods to satisfy all the requirements at once. To achieve these goals, we propose GANHead (Generative Animatable Neural Head Avatar), a novel generative head model that takes advantages of both the fine-grained control over the explicit expression parameters and the realistic rendering results of implicit representations. Specifically, GANHead represents coarse geometry, fine-gained details and texture via three networks in canonical space to obtain the ability to generate complete and realistic head avatars. To achieve flexible animation, we define the deformation filed by standard linear blend skinning (LBS), with the learned continuous pose and expression bases and LBS weights. This allows the avatars to be directly animated by FLAME parameters and generalize well to unseen poses and expressions. Compared to state-of-the-art (SOTA) methods, GANHead achieves superior performance on head avatar generation and raw scan fitting. + + + + Deep Dive Into Gradients: Better Optimization for 3D Object Detection With Gradient-Corrected IoU Supervision + http://openaccess.thecvf.com//content/CVPR2023/papers/Ming_Deep_Dive_Into_Gradients_Better_Optimization_for_3D_Object_Detection_CVPR_2023_paper.pdf + Intersection-over-Union (IoU) is the most popular metric to evaluate regression performance in 3D object detection. Recently, there are also some methods applying IoU to the optimization of 3D bounding box regression. However, we demonstrate through experiments and mathematical proof that the 3D IoU loss suffers from abnormal gradient w.r.t. angular error and object scale, which further leads to slow convergence and suboptimal regression process, respectively. In this paper, we propose a Gradient-Corrected IoU (GCIoU) loss to achieve fast and accurate 3D bounding box regression. Specifically, a gradient correction strategy is designed to endow 3D IoU loss with a reasonable gradient. It ensures that the model converges quickly in the early stage of training, and helps to achieve fine-grained refinement of bounding boxes in the later stage. To solve suboptimal regression of 3D IoU loss for objects at different scales, we introduce a gradient rescaling strategy to adaptively optimize the step size. Finally, we integrate GCIoU Loss into multiple models to achieve stable performance gains and faster model convergence. Experiments on KITTI dataset demonstrate superiority of the proposed method. The code is available at https://github.com/ming71/GCIoU-loss. + + + + Doubly Right Object Recognition: A Why Prompt for Visual Rationales + http://openaccess.thecvf.com//content/CVPR2023/papers/Mao_Doubly_Right_Object_Recognition_A_Why_Prompt_for_Visual_Rationales_CVPR_2023_paper.pdf + Many visual recognition models are evaluated only on their classification accuracy, a metric for which they obtain strong performance. In this paper, we investigate whether computer vision models can also provide correct rationales for their predictions. We propose a "doubly right" object recognition benchmark, where the metric requires the model to simultaneously produce both the right labels as well as the right rationales. We find that state-of-the-art visual models, such as CLIP, often provide incorrect rationales for their categorical predictions. However, by transferring the rationales from language models into visual representations through a tailored dataset, we show that we can learn a "why prompt," which adapts large visual representations to produce correct rationales. Visualizations and empirical experiments show that our prompts significantly improve performance on doubly right object recognition, in addition to zero-shot transfer to unseen tasks and datasets. + + + + Distilling Neural Fields for Real-Time Articulated Shape Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Tan_Distilling_Neural_Fields_for_Real-Time_Articulated_Shape_Reconstruction_CVPR_2023_paper.pdf + We present a method for reconstructing articulated 3D models from videos in real-time, without test-time optimization or manual 3D supervision at training time. Prior work often relies on pre-built deformable models (e.g. SMAL/SMPL), or slow per-scene optimization through differentiable rendering (e.g. dynamic NeRFs). Such methods fail to support arbitrary object categories, or are unsuitable for real-time applications. To address the challenge of collecting large-scale 3D training data for arbitrary deformable object categories, our key insight is to use off-the-shelf video-based dynamic NeRFs as 3D supervision to train a fast feed-forward network, turning 3D shape and motion prediction into a supervised distillation task. Our temporal-aware network uses articulated bones and blend skinning to represent arbitrary deformations, and is self-supervised on video datasets without requiring 3D shapes or viewpoints as input. Through distillation, our network learns to 3D-reconstruct unseen articulated objects at interactive frame rates. Our method yields higher-fidelity 3D reconstructions than prior real-time methods for animals, with the ability to render realistic images at novel viewpoints and poses. + + + + IPCC-TP: Utilizing Incremental Pearson Correlation Coefficient for Joint Multi-Agent Trajectory Prediction + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhu_IPCC-TP_Utilizing_Incremental_Pearson_Correlation_Coefficient_for_Joint_Multi-Agent_Trajectory_CVPR_2023_paper.pdf + Reliable multi-agent trajectory prediction is crucial for the safe planning and control of autonomous systems. Compared with single-agent cases, the major challenge in simultaneously processing multiple agents lies in modeling complex social interactions caused by various driving intentions and road conditions. Previous methods typically leverage graph-based message propagation or attention mechanism to encapsulate such interactions in the format of marginal probabilistic distributions. However, it is inherently sub-optimal. In this paper, we propose IPCC-TP, a novel relevance-aware module based on Incremental Pearson Correlation Coefficient to improve multi-agent interaction modeling. IPCC-TP learns pairwise joint Gaussian Distributions through the tightly-coupled estimation of the means and covariances according to interactive incremental movements. Our module can be conveniently embedded into existing multi-agent prediction methods to extend original motion distribution decoders. Extensive experiments on nuScenes and Argoverse 2 datasets demonstrate that IPCC-TP improves the performance of baselines by a large margin. + + + + MobileOne: An Improved One Millisecond Mobile Backbone + http://openaccess.thecvf.com//content/CVPR2023/papers/Vasu_MobileOne_An_Improved_One_Millisecond_Mobile_Backbone_CVPR_2023_paper.pdf + Efficient neural network backbones for mobile devices are often optimized for metrics such as FLOPs or parameter count. However, these metrics may not correlate well with latency of the network when deployed on a mobile device. Therefore, we perform extensive analysis of different metrics by deploying several mobile-friendly networks on a mobile device. We identify and analyze architectural and optimization bottlenecks in recent efficient neural networks and provide ways to mitigate these bottlenecks. To this end, we design an efficient backbone MobileOne, with variants achieving an inference time under 1 ms on an iPhone12 with 75.9% top-1 accuracy on ImageNet. We show that MobileOne achieves state-of-the-art performance within the efficient architectures while being many times faster on mobile. Our best model obtains similar performance on ImageNet as MobileFormer while being 38x faster. Our model obtains 2.3% better top-1 accuracy on ImageNet than EfficientNet at similar latency. Furthermore, we show that our model generalizes to multiple tasks -- image classification, object detection, and semantic segmentation with significant improvements in latency and accuracy as compared to existing efficient architectures when deployed on a mobile device. + + + + A Data-Based Perspective on Transfer Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Jain_A_Data-Based_Perspective_on_Transfer_Learning_CVPR_2023_paper.pdf + It is commonly believed that more pre-training data leads to better transfer learning performance. However, recent evidence suggests that removing data from the source dataset can actually help too. In this work, we present a framework for probing the impact of the source dataset's composition on transfer learning performance. Our framework facilitates new capabilities such as identifying transfer learning brittleness and detecting pathologies such as data-leakage and the presence of misleading examples in the source dataset. In particular, we demonstrate that removing detrimental datapoints identified by our framework improves transfer performance from ImageNet on a variety of transfer tasks. + + + + Meta-Explore: Exploratory Hierarchical Vision-and-Language Navigation Using Scene Object Spectrum Grounding + http://openaccess.thecvf.com//content/CVPR2023/papers/Hwang_Meta-Explore_Exploratory_Hierarchical_Vision-and-Language_Navigation_Using_Scene_Object_Spectrum_Grounding_CVPR_2023_paper.pdf + The main challenge in vision-and-language navigation (VLN) is how to understand natural-language instructions in an unseen environment. The main limitation of conventional VLN algorithms is that if an action is mistaken, the agent fails to follow the instructions or explores unnecessary regions, leading the agent to an irrecoverable path. To tackle this problem, we propose Meta-Explore, a hierarchical navigation method deploying an exploitation policy to correct misled recent actions. We show that an exploitation policy, which moves the agent toward a well-chosen local goal among unvisited but observable states, outperforms a method which moves the agent to a previously visited state. We also highlight the demand for imagining regretful explorations with semantically meaningful clues. The key to our approach is understanding the object placements around the agent in spectral-domain. Specifically, we present a novel visual representation, called scene object spectrum (SOS), which performs category-wise 2D Fourier transform of detected objects. Combining exploitation policy and SOS features, the agent can correct its path by choosing a promising local goal. We evaluate our method in three VLN benchmarks: R2R, SOON, and REVERIE. Meta-Explore outperforms other baselines and shows significant generalization performance. In addition, local goal search using the proposed spectral-domain SOS features significantly improves the success rate by 17.1% and SPL by 20.6% for the SOON benchmark. + + + + Recovering 3D Hand Mesh Sequence From a Single Blurry Image: A New Dataset and Temporal Unfolding + http://openaccess.thecvf.com//content/CVPR2023/papers/Oh_Recovering_3D_Hand_Mesh_Sequence_From_a_Single_Blurry_Image_CVPR_2023_paper.pdf + Hands, one of the most dynamic parts of our body, suffer from blur due to their active movements. However, previous 3D hand mesh recovery methods have mainly focused on sharp hand images rather than considering blur due to the absence of datasets providing blurry hand images. We first present a novel dataset BlurHand, which contains blurry hand images with 3D groundtruths. The BlurHand is constructed by synthesizing motion blur from sequential sharp hand images, imitating realistic and natural motion blurs. In addition to the new dataset, we propose BlurHandNet, a baseline network for accurate 3D hand mesh recovery from a blurry hand image. Our BlurHandNet unfolds a blurry input image to a 3D hand mesh sequence to utilize temporal information in the blurry input image, while previous works output a static single hand mesh. We demonstrate the usefulness of BlurHand for the 3D hand mesh recovery from blurry images in our experiments. The proposed BlurHandNet produces much more robust results on blurry images while generalizing well to in-the-wild images. The training codes and BlurHand dataset are available at https://github.com/JaehaKim97/BlurHand_RELEASE. + + + + NaQ: Leveraging Narrations As Queries To Supervise Episodic Memory + http://openaccess.thecvf.com//content/CVPR2023/papers/Ramakrishnan_NaQ_Leveraging_Narrations_As_Queries_To_Supervise_Episodic_Memory_CVPR_2023_paper.pdf + Searching long egocentric videos with natural language queries (NLQ) has compelling applications in augmented reality and robotics, where a fluid index into everything that a person (agent) has seen before could augment human memory and surface relevant information on demand. However, the structured nature of the learning problem (free-form text query inputs, localized video temporal window outputs) and its needle-in-a-haystack nature makes it both technically challenging and expensive to supervise. We introduce Narrations-as-Queries (NaQ), a data augmentation strategy that transforms standard video-text narrations into training data for a video query localization model. Validating our idea on the Ego4D benchmark, we find it has tremendous impact in practice. NaQ improves multiple top models by substantial margins (even doubling their accuracy), and yields the very best results to date on the Ego4D NLQ challenge, soundly outperforming all challenge winners in the CVPR and ECCV 2022 competitions and topping the current public leaderboard. Beyond achieving the state-of-the-art for NLQ, we also demonstrate unique properties of our approach such as the ability to perform zero-shot and few-shot NLQ, and improved performance on queries about long-tail object categories. Code and models: http://vision.cs.utexas.edu/projects/naq. + + + + FedSeg: Class-Heterogeneous Federated Learning for Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Miao_FedSeg_Class-Heterogeneous_Federated_Learning_for_Semantic_Segmentation_CVPR_2023_paper.pdf + Federated Learning (FL) is a distributed learning paradigm that collaboratively learns a global model across multiple clients with data privacy-preserving. Although many FL algorithms have been proposed for classification tasks, few works focus on more challenging semantic seg-mentation tasks, especially in the class-heterogeneous FL situation. Compared with classification, the issues from heterogeneous FL for semantic segmentation are more severe: (1) Due to the non-IID distribution, different clients may contain inconsistent foreground-background classes, resulting in divergent local updates. (2) Class-heterogeneity for complex dense prediction tasks makes the local optimum of clients farther from the global optimum. In this work, we propose FedSeg, a basic federated learning approach for class-heterogeneous semantic segmentation. We first propose a simple but strong modified cross-entropy loss to correct the local optimization and address the foreground-background inconsistency problem. Based on it, we introduce pixel-level contrastive learning to enforce local pixel embeddings belonging to the global semantic space. Extensive experiments on four semantic segmentation benchmarks (Cityscapes, CamVID, PascalVOC and ADE20k) demonstrate the effectiveness of our FedSeg. We hope this work will attract more attention from the FL community to the challenging semantic segmentation federated learning. + + + + Fast Monocular Scene Reconstruction With Global-Sparse Local-Dense Grids + http://openaccess.thecvf.com//content/CVPR2023/papers/Dong_Fast_Monocular_Scene_Reconstruction_With_Global-Sparse_Local-Dense_Grids_CVPR_2023_paper.pdf + Indoor scene reconstruction from monocular images has long been sought after by augmented reality and robotics developers. Recent advances in neural field representations and monocular priors have led to remarkable results in scene-level surface reconstructions. The reliance on Multilayer Perceptrons (MLP), however, significantly limits speed in training and rendering. In this work, we propose to directly use signed distance function (SDF) in sparse voxel block grids for fast and accurate scene reconstruction without MLPs. Our globally sparse and locally dense data structure exploits surfaces' spatial sparsity, enables cache-friendly queries, and allows direct extensions to multi-modal data such as color and semantic labels. To apply this representation to monocular scene reconstruction, we develop a scale calibration algorithm for fast geometric initialization from monocular depth priors. We apply differentiable volume rendering from this initialization to refine details with fast convergence. We also introduce efficient high-dimensional Continuous Random Fields (CRFs) to further exploit the semantic-geometry consistency between scene objects. Experiments show that our approach is 10x faster in training and 100x faster in rendering while achieving comparable accuracy to state-of-the-art neural implicit methods. + + + + Thermal Spread Functions (TSF): Physics-Guided Material Classification + http://openaccess.thecvf.com//content/CVPR2023/papers/Dashpute_Thermal_Spread_Functions_TSF_Physics-Guided_Material_Classification_CVPR_2023_paper.pdf + Robust and non-destructive material classification is a challenging but crucial first-step in numerous vision applications. We propose a physics-guided material classification framework that relies on thermal properties of the object. Our key observation is that the rate of heating and cooling of an object depends on the unique intrinsic properties of the material, namely the emissivity and diffusivity. We leverage this observation by gently heating the objects in the scene with a low-power laser for a fixed duration and then turning it off, while a thermal camera captures measurements during the heating and cooling process. We then take this spatial and temporal "thermal spread function" (TSF) to solve an inverse heat equation using the finite-differences approach, resulting in a spatially varying estimate of diffusivity and emissivity. These tuples are then used to train a classifier that produces a fine-grained material label at each spatial pixel. Our approach is extremely simple requiring only a small light source (low power laser) and a thermal camera, and produces robust classification results with 86% accuracy over 16 classes + + + + Unsupervised 3D Point Cloud Representation Learning by Triangle Constrained Contrast for Autonomous Driving + http://openaccess.thecvf.com//content/CVPR2023/papers/Pang_Unsupervised_3D_Point_Cloud_Representation_Learning_by_Triangle_Constrained_Contrast_CVPR_2023_paper.pdf + Due to the difficulty of annotating the 3D LiDAR data of autonomous driving, an efficient unsupervised 3D representation learning method is important. In this paper, we design the Triangle Constrained Contrast (TriCC) framework tailored for autonomous driving scenes which learns 3D unsupervised representations through both the multimodal information and dynamic of temporal sequences. We treat one camera image and two LiDAR point clouds with different timestamps as a triplet. And our key design is the consistent constraint that automatically finds matching relationships among the triplet through "self-cycle" and learns representations from it. With the matching relations across the temporal dimension and modalities, we can further conduct a triplet contrast to improve learning efficiency. To the best of our knowledge, TriCC is the first framework that unifies both the temporal and multimodal semantics, which means it utilizes almost all the information in autonomous driving scenes. And compared with previous contrastive methods, it can automatically dig out contrasting pairs with higher difficulty, instead of relying on handcrafted ones. Extensive experiments are conducted with Minkowski-UNet and VoxelNet on several semantic segmentation and 3D detection datasets. Results show that TriCC learns effective representations with much fewer training iterations and improves the SOTA results greatly on all the downstream tasks. Code and models can be found at https://bopang1996.github.io/. + + + + Prompt-Guided Zero-Shot Anomaly Action Recognition Using Pretrained Deep Skeleton Features + http://openaccess.thecvf.com//content/CVPR2023/papers/Sato_Prompt-Guided_Zero-Shot_Anomaly_Action_Recognition_Using_Pretrained_Deep_Skeleton_Features_CVPR_2023_paper.pdf + This study investigates unsupervised anomaly action recognition, which identifies video-level abnormal-human-behavior events in an unsupervised manner without abnormal samples, and simultaneously addresses three limitations in the conventional skeleton-based approaches: target domain-dependent DNN training, robustness against skeleton errors, and a lack of normal samples. We present a unified, user prompt-guided zero-shot learning framework using a target domain-independent skeleton feature extractor, which is pretrained on a large-scale action recognition dataset. Particularly, during the training phase using normal samples, the method models the distribution of skeleton features of the normal actions while freezing the weights of the DNNs and estimates the anomaly score using this distribution in the inference phase. Additionally, to increase robustness against skeleton errors, we introduce a DNN architecture inspired by a point cloud deep learning paradigm, which sparsely propagates the features between joints. Furthermore, to prevent the unobserved normal actions from being misidentified as abnormal actions, we incorporate a similarity score between the user prompt embeddings and skeleton features aligned in the common space into the anomaly score, which indirectly supplements normal actions. On two publicly available datasets, we conduct experiments to test the effectiveness of the proposed method with respect to abovementioned limitations. + + + + Efficient Multimodal Fusion via Interactive Prompting + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Efficient_Multimodal_Fusion_via_Interactive_Prompting_CVPR_2023_paper.pdf + Large-scale pre-training has brought unimodal fields such as computer vision and natural language processing to a new era. Following this trend, the size of multimodal learning models constantly increases, leading to an urgent need to reduce the massive computational cost of fine-tuning these models for downstream tasks. In this paper, we propose an efficient and flexible multimodal fusion method, namely PMF, tailored for fusing unimodally pretrained transformers. Specifically, we first present a modular multimodal fusion framework that exhibits high flexibility and facilitates mutual interactions among different modalities. In addition, we disentangle vanilla prompts into three types in order to learn different optimizing objectives for multimodal learning. It is also worth noting that we propose to add prompt vectors only on the deep layers of the unimodal transformers, thus significantly reducing the training memory usage. Experiment results show that our proposed method achieves comparable performance to several other multimodal finetuning methods with less than 3% trainable parameters and up to 66% saving of training memory usage. + + + + Depth Estimation From Indoor Panoramas With Neural Scene Representation + http://openaccess.thecvf.com//content/CVPR2023/papers/Chang_Depth_Estimation_From_Indoor_Panoramas_With_Neural_Scene_Representation_CVPR_2023_paper.pdf + Depth estimation from indoor panoramas is challenging due to the equirectangular distortions of panoramas and inaccurate matching. In this paper, we propose a practical framework to improve the accuracy and efficiency of depth estimation from multi-view indoor panoramic images with the Neural Radiance Field technology. Specifically, we develop two networks to implicitly learn the Signed Distance Function for depth measurements and the radiance field from panoramas. We also introduce a novel spherical position embedding scheme to achieve high accuracy. For better convergence, we propose an initialization method for the network weights based on the Manhattan World Assumption. Furthermore, we devise a geometric consistency loss, leveraging the surface normal, to further refine the depth estimation. The experimental results demonstrate that our proposed method outperforms state-of-the-art works by a large margin in both quantitative and qualitative evaluations. Our source code is available at https://github.com/WJ-Chang-42/IndoorPanoDepth. + + + + Task-Specific Fine-Tuning via Variational Information Bottleneck for Weakly-Supervised Pathology Whole Slide Image Classification + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Task-Specific_Fine-Tuning_via_Variational_Information_Bottleneck_for_Weakly-Supervised_Pathology_Whole_CVPR_2023_paper.pdf + While Multiple Instance Learning (MIL) has shown promising results in digital Pathology Whole Slide Image (WSI) analysis, such a paradigm still faces performance and generalization problems due to high computational costs and limited supervision of Gigapixel WSIs. To deal with the computation problem, previous methods utilize a frozen model pretrained from ImageNet to obtain representations, however, it may lose key information owing to the large domain gap and hinder the generalization ability without image-level training-time augmentation. Though Self-supervised Learning (SSL) proposes viable representation learning schemes, the downstream task-specific features via partial label tuning are not explored. To alleviate this problem, we propose an efficient WSI fine-tuning framework motivated by the Information Bottleneck theory. The theory enables the framework to find the minimal sufficient statistics of WSI, thus supporting us to fine-tune the backbone into a task-specific representation only depending on WSI-level weak labels. The WSI-MIL problem is further analyzed to theoretically deduce our fine-tuning method. We evaluate the method on five pathological WSI datasets on various WSI heads. The experimental results show significant improvements in both accuracy and generalization compared with previous works. Source code will be available at https://github.com/invoker-LL/WSI-finetuning. + + + + One-Shot Model for Mixed-Precision Quantization + http://openaccess.thecvf.com//content/CVPR2023/papers/Koryakovskiy_One-Shot_Model_for_Mixed-Precision_Quantization_CVPR_2023_paper.pdf + Neural network quantization is a popular approach for model compression. Modern hardware supports quantization in mixed-precision mode, which allows for greater compression rates but adds the challenging task of searching for the optimal bit width. The majority of existing searchers find a single mixed-precision architecture. To select an architecture that is suitable in terms of performance and resource consumption, one has to restart searching multiple times. We focus on a specific class of methods that find tensor bit width using gradient-based optimization. First, we theoretically derive several methods that were empirically proposed earlier. Second, we present a novel One-Shot method that finds a diverse set of Pareto-front architectures in O(1) time. For large models, the proposed method is 5 times more efficient than existing methods. We verify the method on two classification and super-resolution models and show above 0.93 correlation score between the predicted and actual model performance. The Pareto-front architecture selection is straightforward and takes only 20 to 40 supernet evaluations, which is the new state-of-the-art result to the best of our knowledge. + + + + MARLIN: Masked Autoencoder for Facial Video Representation LearnINg + http://openaccess.thecvf.com//content/CVPR2023/papers/Cai_MARLIN_Masked_Autoencoder_for_Facial_Video_Representation_LearnINg_CVPR_2023_paper.pdf + This paper proposes a self-supervised approach to learn universal facial representations from videos, that can transfer across a variety of facial analysis tasks such as Facial Attribute Recognition (FAR), Facial Expression Recognition (FER), DeepFake Detection (DFD), and Lip Synchronization (LS). Our proposed framework, named MARLIN, is a facial video masked autoencoder, that learns highly robust and generic facial embeddings from abundantly available non-annotated web crawled facial videos. As a challenging auxiliary task, MARLIN reconstructs the spatio-temporal details of the face from the densely masked facial regions which mainly include eyes, nose, mouth, lips, and skin to capture local and global aspects that in turn help in encoding generic and transferable features. Through a variety of experiments on diverse downstream tasks, we demonstrate MARLIN to be an excellent facial video encoder as well as feature extractor, that performs consistently well across a variety of downstream tasks including FAR (1.13% gain over supervised benchmark), FER (2.64% gain over unsupervised benchmark), DFD (1.86% gain over unsupervised benchmark), LS (29.36% gain for Frechet Inception Distance), and even in low data regime. Our code and models are available at https://github.com/ControlNet/MARLIN. + + + + Dynamic Coarse-To-Fine Learning for Oriented Tiny Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_Dynamic_Coarse-To-Fine_Learning_for_Oriented_Tiny_Object_Detection_CVPR_2023_paper.pdf + Detecting arbitrarily oriented tiny objects poses intense challenges to existing detectors, especially for label assignment. Despite the exploration of adaptive label assignment in recent oriented object detectors, the extreme geometry shape and limited feature of oriented tiny objects still induce severe mismatch and imbalance issues. Specifically, the position prior, positive sample feature, and instance are mismatched, and the learning of extreme-shaped objects is biased and unbalanced due to little proper feature supervision. To tackle these issues, we propose a dynamic prior along with the coarse-to-fine assigner, dubbed DCFL. For one thing, we model the prior, label assignment, and object representation all in a dynamic manner to alleviate the mismatch issue. For another, we leverage the coarse prior matching and finer posterior constraint to dynamically assign labels, providing appropriate and relatively balanced supervision for diverse instances. Extensive experiments on six datasets show substantial improvements to the baseline. Notably, we obtain the state-of-the-art performance for one-stage detectors on the DOTA-v1.5, DOTA-v2.0, and DIOR-R datasets under single-scale training and testing. Codes are available at https://github.com/Chasel-Tsui/mmrotate-dcfl. + + + + Controllable Mesh Generation Through Sparse Latent Point Diffusion Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Lyu_Controllable_Mesh_Generation_Through_Sparse_Latent_Point_Diffusion_Models_CVPR_2023_paper.pdf + Mesh generation is of great value in various applications involving computer graphics and virtual content, yet designing generative models for meshes is challenging due to their irregular data structure and inconsistent topology of meshes in the same category. In this work, we design a novel sparse latent point diffusion model for mesh generation. Our key insight is to regard point clouds as an intermediate representation of meshes, and model the distribution of point clouds instead. While meshes can be generated from point clouds via techniques like Shape as Points (SAP), the challenges of directly generating meshes can be effectively avoided. To boost the efficiency and controllability of our mesh generation method, we propose to further encode point clouds to a set of sparse latent points with point-wise semantic meaningful features, where two DDPMs are trained in the space of sparse latent points to respectively model the distribution of the latent point positions and features at these latent points. We find that sampling in this latent space is faster than directly sampling dense point clouds. Moreover, the sparse latent points also enable us to explicitly control both the overall structures and local details of the generated meshes. Extensive experiments are conducted on the ShapeNet dataset, where our proposed sparse latent point diffusion model achieves superior performance in terms of generation quality and controllability when compared to existing methods. + + + + Look Before You Match: Instance Understanding Matters in Video Object Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Look_Before_You_Match_Instance_Understanding_Matters_in_Video_Object_CVPR_2023_paper.pdf + Exploring dense matching between the current frame and past frames for long-range context modeling, memory-based methods have demonstrated impressive results in video object segmentation (VOS) recently. Nevertheless, due to the lack of instance understanding ability, the above approaches are oftentimes brittle to large appearance variations or viewpoint changes resulted from the movement of objects and cameras. In this paper, we argue that instance understanding matters in VOS, and integrating it with memory-based matching can enjoy the synergy, which is intuitively sensible from the definition of VOS task, i.e., identifying and segmenting object instances within the video. Towards this goal, we present a two-branch network for VOS, where the query-based instance segmentation (IS) branch delves into the instance details of the current frame and the VOS branch performs spatial-temporal matching with the memory bank. We employ the well-learned object queries from IS branch to inject instance-specific information into the query key, with which the instance-augmented matching is further performed. In addition, we introduce a multi-path fusion block to effectively combine the memory readout with multi-scale features from the instance segmentation decoder, which incorporates high-resolution instance-aware features to produce final segmentation results. Our method achieves state-of-the-art performance on DAVIS 2016/2017 val (92.6% and 87.1%), DAVIS 2017 test-dev (82.8%), and YouTube-VOS 2018/2019 val (86.3% and 86.3%), outperforming alternative methods by clear margins. + + + + Boundary Unlearning: Rapid Forgetting of Deep Networks via Shifting the Decision Boundary + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Boundary_Unlearning_Rapid_Forgetting_of_Deep_Networks_via_Shifting_the_CVPR_2023_paper.pdf + The practical needs of the "right to be forgotten" and poisoned data removal call for efficient machine unlearning techniques, which enable machine learning models to unlearn, or to forget a fraction of training data and its lineage. Recent studies on machine unlearning for deep neural networks (DNNs) attempt to destroy the influence of the forgetting data by scrubbing the model parameters. However, it is prohibitively expensive due to the large dimension of the parameter space. In this paper, we refocus our attention from the parameter space to the decision space of the DNN model, and propose Boundary Unlearning, a rapid yet effective way to unlearn an entire class from a trained DNN model. The key idea is to shift the decision boundary of the original DNN model to imitate the decision behavior of the model retrained from scratch. We develop two novel boundary shift methods, namely Boundary Shrink and Boundary Expanding, both of which can rapidly achieve the utility and privacy guarantees. We extensively evaluate Boundary Unlearning on CIFAR-10 and Vggface2 datasets, and the results show that Boundary Unlearning can effectively forget the forgetting class on image classification and face recognition tasks, with an expected speed-up of 17x and 19x, respectively, compared with retraining from the scratch. + + + + Orthogonal Annotation Benefits Barely-Supervised Medical Image Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Cai_Orthogonal_Annotation_Benefits_Barely-Supervised_Medical_Image_Segmentation_CVPR_2023_paper.pdf + Recent trends in semi-supervised learning have significantly boosted the performance of 3D semi-supervised medical image segmentation. Compared with 2D images, 3D medical volumes involve information from different directions, e.g., transverse, sagittal, and coronal planes, so as to naturally provide complementary views. These complementary views and the intrinsic similarity among adjacent 3D slices inspire us to develop a novel annotation way and its corresponding semi-supervised model for effective segmentation. Specifically, we firstly propose the orthogonal annotation by only labeling two orthogonal slices in a labeled volume, which significantly relieves the burden of annotation. Then, we perform registration to obtain the initial pseudo labels for sparsely labeled volumes. Subsequently, by introducing unlabeled volumes, we propose a dual-network paradigm named Dense-Sparse Co-training (DeSCO) that exploits dense pseudo labels in early stage and sparse labels in later stage and meanwhile forces consistent output of two networks. Experimental results on three benchmark datasets validated our effectiveness in performance and efficiency in annotation. For example, with only 10 annotated slices, our method reaches a Dice up to 86.93% on KiTS19 dataset. + + + + Spectral Enhanced Rectangle Transformer for Hyperspectral Image Denoising + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Spectral_Enhanced_Rectangle_Transformer_for_Hyperspectral_Image_Denoising_CVPR_2023_paper.pdf + Denoising is a crucial step for hyperspectral image (HSI) applications. Though witnessing the great power of deep learning, existing HSI denoising methods suffer from limitations in capturing the non-local self-similarity. Transformers have shown potential in capturing long-range dependencies, but few attempts have been made with specifically designed Transformer to model the spatial and spectral correlation in HSIs. In this paper, we address these issues by proposing a spectral enhanced rectangle Transformer, driving it to explore the non-local spatial similarity and global spectral low-rank property of HSIs. For the former, we exploit the rectangle self-attention horizontally and vertically to capture the non-local similarity in the spatial domain. For the latter, we design a spectral enhancement module that is capable of extracting global underlying low-rank property of spatial-spectral cubes to suppress noise, while enabling the interactions among non-overlapping spatial rectangles. Extensive experiments have been conducted on both synthetic noisy HSIs and real noisy HSIs, showing the effectiveness of our proposed method in terms of both objective metric and subjective visual quality. The code is available at https://github.com/MyuLi/SERT. + + + + UMat: Uncertainty-Aware Single Image High Resolution Material Capture + http://openaccess.thecvf.com//content/CVPR2023/papers/Rodriguez-Pardo_UMat_Uncertainty-Aware_Single_Image_High_Resolution_Material_Capture_CVPR_2023_paper.pdf + We propose a learning-based method to recover normals, specularity, and roughness from a single diffuse image of a material, using microgeometry appearance as our primary cue. Previous methods that work on single images tend to produce over-smooth outputs with artifacts, operate at limited resolution, or train one model per class with little room for generalization. In contrast, in this work, we propose a novel capture approach that leverages a generative network with attention and a U-Net discriminator, which shows outstanding performance integrating global information at reduced computational complexity. We showcase the performance of our method with a real dataset of digitized textile materials and show that a commodity flatbed scanner can produce the type of diffuse illumination required as input to our method. Additionally, because the problem might be ill-posed --more than a single diffuse image might be needed to disambiguate the specular reflection-- or because the training dataset is not representative enough of the real distribution, we propose a novel framework to quantify the model's confidence about its prediction at test time. Our method is the first one to deal with the problem of modeling uncertainty in material digitization, increasing the trustworthiness of the process and enabling more intelligent strategies for dataset creation, as we demonstrate with an active learning experiment. + + + + Similarity Maps for Self-Training Weakly-Supervised Phrase Grounding + http://openaccess.thecvf.com//content/CVPR2023/papers/Shaharabany_Similarity_Maps_for_Self-Training_Weakly-Supervised_Phrase_Grounding_CVPR_2023_paper.pdf + A phrase grounding model receives an input image and a text phrase and outputs a suitable localization map. We present an effective way to refine a phrase ground model by considering self-similarity maps extracted from the latent representation of the model's image encoder. Our main insights are that these maps resemble localization maps and that by combining such maps, one can obtain useful pseudo-labels for performing self-training. Our results surpass, by a large margin, the state-of-the-art in weakly supervised phrase grounding. A similar gap in performance is obtained for a recently proposed downstream task called WWbL, in which the input image is given without any text. Our code is available as supplementary. + + + + SCOOP: Self-Supervised Correspondence and Optimization-Based Scene Flow + http://openaccess.thecvf.com//content/CVPR2023/papers/Lang_SCOOP_Self-Supervised_Correspondence_and_Optimization-Based_Scene_Flow_CVPR_2023_paper.pdf + Scene flow estimation is a long-standing problem in computer vision, where the goal is to find the 3D motion of a scene from its consecutive observations. Recently, there have been efforts to compute the scene flow from 3D point clouds. A common approach is to train a regression model that consumes source and target point clouds and outputs the per-point translation vector. An alternative is to learn point matches between the point clouds concurrently with regressing a refinement of the initial correspondence flow. In both cases, the learning task is very challenging since the flow regression is done in the free 3D space, and a typical solution is to resort to a large annotated synthetic dataset. We introduce SCOOP, a new method for scene flow estimation that can be learned on a small amount of data without employing ground-truth flow supervision. In contrast to previous work, we train a pure correspondence model focused on learning point feature representation and initialize the flow as the difference between a source point and its softly corresponding target point. Then, in the run-time phase, we directly optimize a flow refinement component with a self-supervised objective, which leads to a coherent and accurate flow field between the point clouds. Experiments on widespread datasets demonstrate the performance gains achieved by our method compared to existing leading techniques while using a fraction of the training data. Our code is publicly available. + + + + Human-Art: A Versatile Human-Centric Dataset Bridging Natural and Artificial Scenes + http://openaccess.thecvf.com//content/CVPR2023/papers/Ju_Human-Art_A_Versatile_Human-Centric_Dataset_Bridging_Natural_and_Artificial_Scenes_CVPR_2023_paper.pdf + Humans have long been recorded in a variety of forms since antiquity. For example, sculptures and paintings were the primary media for depicting human beings before the invention of cameras. However, most current human-centric computer vision tasks like human pose estimation and human image generation focus exclusively on natural images in the real world. Artificial humans, such as those in sculptures, paintings, and cartoons, are commonly neglected, making existing models fail in these scenarios. As an abstraction of life, art incorporates humans in both natural and artificial scenes. We take advantage of it and introduce the Human-Art dataset to bridge related tasks in natural and artificial scenarios. Specifically, Human-Art contains 50k high-quality images with over 123k person instances from 5 natural and 15 artificial scenarios, which are annotated with bounding boxes, keypoints, self-contact points, and text information for humans represented in both 2D and 3D. It is, therefore, comprehensive and versatile for various downstream tasks. We also provide a rich set of baseline results and detailed analyses for related tasks, including human detection, 2D and 3D human pose estimation, image generation, and motion transfer. As a challenging dataset, we hope Human-Art can provide insights for relevant research and open up new research questions. + + + + Turning a CLIP Model Into a Scene Text Detector + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_Turning_a_CLIP_Model_Into_a_Scene_Text_Detector_CVPR_2023_paper.pdf + The recent large-scale Contrastive Language-Image Pretraining (CLIP) model has shown great potential in various downstream tasks via leveraging the pretrained vision and language knowledge. Scene text, which contains rich textual and visual information, has an inherent connection with a model like CLIP. Recently, pretraining approaches based on vision language models have made effective progresses in the field of text detection. In contrast to these works, this paper proposes a new method, termed TCM, focusing on Turning the CLIP Model directly for text detection without pretraining process. We demonstrate the advantages of the proposed TCM as follows: (1) The underlying principle of our framework can be applied to improve existing scene text detector. (2) It facilitates the few-shot training capability of existing methods, e.g., by using 10% of labeled data, we significantly improve the performance of the baseline method with an average of 22% in terms of the F-measure on 4 benchmarks. (3) By turning the CLIP model into existing scene text detection methods, we further achieve promising domain adaptation ability. The code will be publicly released at https://github.com/wenwenyu/TCM. + + + + RODIN: A Generative Model for Sculpting 3D Digital Avatars Using Diffusion + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_RODIN_A_Generative_Model_for_Sculpting_3D_Digital_Avatars_Using_CVPR_2023_paper.pdf + This paper presents a 3D diffusion model that automatically generates 3D digital avatars represented as neural radiance fields (NeRFs). A significant challenge for 3D diffusion is that the memory and processing costs are prohibitive for producing high-quality results with rich details. To tackle this problem, we propose the roll-out diffusion network (RODIN), which takes a 3D NeRF model represented as multiple 2D feature maps and rolls out them onto a single 2D feature plane within which we perform 3D-aware diffusion. The RODIN model brings much-needed computational efficiency while preserving the integrity of 3D diffusion by using 3D-aware convolution that attends to projected features in the 2D plane according to their original relationships in 3D. We also use latent conditioning to orchestrate the feature generation with global coherence, leading to high-fidelity avatars and enabling semantic editing based on text prompts. Finally, we use hierarchical synthesis to further enhance details. The 3D avatars generated by our model compare favorably with those produced by existing techniques. We can generate highly detailed avatars with realistic hairstyles and facial hair. We also demonstrate 3D avatar generation from image or text, as well as text-guided editability. + + + + On the Pitfall of Mixup for Uncertainty Calibration + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_On_the_Pitfall_of_Mixup_for_Uncertainty_Calibration_CVPR_2023_paper.pdf + By simply taking convex combinations between pairs of samples and their labels, mixup training has been shown to easily improve predictive accuracy. It has been recently found that models trained with mixup also perform well on uncertainty calibration. However, in this study, we found that mixup training usually makes models less calibratable than vanilla empirical risk minimization, which means that it would harm uncertainty estimation when post-hoc calibration is considered. By decomposing the mixup process into data transformation and random perturbation, we suggest that the confidence penalty nature of the data transformation is the reason of calibration degradation. To mitigate this problem, we first investigate the mixup inference strategy and found that despite it improves calibration on mixup, this ensemble-like strategy does not necessarily outperform simple ensemble. Then, we propose a general strategy named mixup inference in training, which adopts a simple decoupling principle for recovering the outputs of raw samples at the end of forward network pass. By embedding the mixup inference, models can be learned from the original one-hot labels and hence avoid the negative impact of confidence penalty. Our experiments show this strategy properly solves mixup's calibration issue without sacrificing the predictive performance, while even improves accuracy than vanilla mixup. + + + + Feature Shrinkage Pyramid for Camouflaged Object Detection With Transformers + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Feature_Shrinkage_Pyramid_for_Camouflaged_Object_Detection_With_Transformers_CVPR_2023_paper.pdf + Vision transformers have recently shown strong global context modeling capabilities in camouflaged object detection. However, they suffer from two major limitations: less effective locality modeling and insufficient feature aggregation in decoders, which are not conducive to camouflaged object detection that explores subtle cues from indistinguishable backgrounds. To address these issues, in this paper, we propose a novel transformer-based Feature Shrinkage Pyramid Network (FSPNet), which aims to hierarchically decode locality-enhanced neighboring transformer features through progressive shrinking for camouflaged object detection. Specifically, we propose a non-local token enhancement module (NL-TEM) that employs the non-local mechanism to interact neighboring tokens and explore graph-based high-order relations within tokens to enhance local representations of transformers. Moreover, we design a feature shrinkage decoder (FSD) with adjacent interaction modules (AIM), which progressively aggregates adjacent transformer features through a layer-by-layer shrinkage pyramid to accumulate imperceptible but effective cues as much as possible for object information decoding. Extensive quantitative and qualitative experiments demonstrate that the proposed model significantly outperforms the existing 24 competitors on three challenging COD benchmark datasets under six widely-used evaluation metrics. Our code is publicly available at https://github.com/ZhouHuang23/FSPNet. + + + + Matching Is Not Enough: A Two-Stage Framework for Category-Agnostic Pose Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Shi_Matching_Is_Not_Enough_A_Two-Stage_Framework_for_Category-Agnostic_Pose_CVPR_2023_paper.pdf + Category-agnostic pose estimation (CAPE) aims to predict keypoints for arbitrary categories given support images with keypoint annotations. Existing approaches match the keypoints across the image for localization. However, such a one-stage matching paradigm shows inferior accuracy: the prediction heavily relies on the matching results, which can be noisy due to the open set nature in CAPE. For example, two mirror-symmetric keypoints (e.g., left and right eyes) in the query image can both trigger high similarity on certain support keypoints (eyes), which leads to duplicated or opposite predictions. To calibrate the inaccurate matching results, we introduce a two-stage framework, where matched keypoints from the first stage are viewed as similarity-aware position proposals. Then, the model learns to fetch relevant features to correct the initial proposals in the second stage. We instantiate the framework with a transformer model tailored for CAPE. The transformer encoder incorporates specific designs to improve the representation and similarity modeling in the first matching stage. In the second stage, similarity-aware proposals are packed as queries in the decoder for refinement via cross-attention. Our method surpasses the previous best approach by large margins on CAPE benchmark MP-100 on both accuracy and efficiency. Code available at https://github.com/flyinglynx/CapeFormer + + + + High-Fidelity Guided Image Synthesis With Latent Diffusion Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Singh_High-Fidelity_Guided_Image_Synthesis_With_Latent_Diffusion_Models_CVPR_2023_paper.pdf + Controllable image synthesis with user scribbles has gained huge public interest with the recent advent of text-conditioned latent diffusion models. The user scribbles control the color composition while the text prompt provides control over the overall image semantics. However, we find that prior works suffer from an intrinsic domain shift problem wherein the generated outputs often lack details and resemble simplistic representations of the target domain. In this paper, we propose a novel guided image synthesis framework, which addresses this problem by modeling the output image as the solution of a constrained optimization problem. We show that while computing an exact solution to the optimization is infeasible, an approximation of the same can be achieved while just requiring a single pass of the reverse diffusion process. Additionally, we show that by simply defining a cross-attention based correspondence between the input text tokens and the user stroke-painting, the user is also able to control the semantics of different painted regions without requiring any conditional training or finetuning. Human user study results show that the proposed approach outperforms the previous state-of-the-art by over 85.32% on the overall user satisfaction scores. Project page for our paper is available at https://1jsingh.github.io/gradop. + + + + Semi-Supervised Parametric Real-World Image Harmonization + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Semi-Supervised_Parametric_Real-World_Image_Harmonization_CVPR_2023_paper.pdf + Learning-based image harmonization techniques are usually trained to undo synthetic global transformations, applied to a masked foreground in a single ground truth photo. This simulated data does not model many important appearance mismatches (illumination, object boundaries, etc.) between foreground and background in real composites, leading to models that do not generalize well and cannot model complex local changes. We propose a new semi-supervised training strategy that addresses this problem and lets us learn complex local appearance harmonization from unpaired real composites, where foreground and background come from different images. Our model is fully parametric. It uses RGB curves to correct the global colors and tone and a shading map to model local variations. Our approach outperforms previous work on established benchmarks and real composites, as shown in a user study, and processes high-resolution images interactively. The code and project page is available at https://kewang0622.github.io/sprih/. + + + + Learning Visibility Field for Detailed 3D Human Reconstruction and Relighting + http://openaccess.thecvf.com//content/CVPR2023/papers/Zheng_Learning_Visibility_Field_for_Detailed_3D_Human_Reconstruction_and_Relighting_CVPR_2023_paper.pdf + Detailed 3D reconstruction and photo-realistic relighting of digital humans are essential for various applications. To this end, we propose a novel sparse-view 3d human reconstruction framework that closely incorporates the occupancy field and albedo field with an additional visibility field--it not only resolves occlusion ambiguity in multiview feature aggregation, but can also be used to evaluate light attenuation for self-shadowed relighting. To enhance its training viability and efficiency, we discretize visibility onto a fixed set of sample directions and supply it with coupled geometric 3D depth feature and local 2D image feature. We further propose a novel rendering-inspired loss, namely TransferLoss, to implicitly enforce the alignment between visibility and occupancy field, enabling end-to-end joint training. Results and extensive experiments demonstrate the effectiveness of the proposed method, as it surpasses state-of-the-art in terms of reconstruction accuracy while achieving comparably accurate relighting to ray-traced ground truth. + + + + Improving Robustness of Vision Transformers by Reducing Sensitivity To Patch Corruptions + http://openaccess.thecvf.com//content/CVPR2023/papers/Guo_Improving_Robustness_of_Vision_Transformers_by_Reducing_Sensitivity_To_Patch_CVPR_2023_paper.pdf + Despite their success, vision transformers still remain vulnerable to image corruptions, such as noise or blur. Indeed, we find that the vulnerability mainly stems from the unstable self-attention mechanism, which is inherently built upon patch-based inputs and often becomes overly sensitive to the corruptions across patches. For example, when we only occlude a small number of patches with random noise (e.g., 10%), these patch corruptions would lead to severe accuracy drops and greatly distract intermediate attention layers. To address this, we propose a new training method that improves the robustness of transformers from a new perspective -- reducing sensitivity to patch corruptions (RSPC). Specifically, we first identify and occlude/corrupt the most vulnerable patches and then explicitly reduce sensitivity to them by aligning the intermediate features between clean and corrupted examples. We highlight that the construction of patch corruptions is learned adversarially to the following feature alignment process, which is particularly effective and essentially different from existing methods. In experiments, our RSPC greatly improves the stability of attention layers and consistently yields better robustness on various benchmarks, including CIFAR-10/100-C, ImageNet-A, ImageNet-C, and ImageNet-P. + + + + VecFontSDF: Learning To Reconstruct and Synthesize High-Quality Vector Fonts via Signed Distance Functions + http://openaccess.thecvf.com//content/CVPR2023/papers/Xia_VecFontSDF_Learning_To_Reconstruct_and_Synthesize_High-Quality_Vector_Fonts_via_CVPR_2023_paper.pdf + Font design is of vital importance in the digital content design and modern printing industry. Developing algorithms capable of automatically synthesizing vector fonts can significantly facilitate the font design process. However, existing methods mainly concentrate on raster image generation, and only a few approaches can directly synthesize vector fonts. This paper proposes an end-to-end trainable method, VecFontSDF, to reconstruct and synthesize high-quality vector fonts using signed distance functions (SDFs). Specifically, based on the proposed SDF-based implicit shape representation, VecFontSDF learns to model each glyph as shape primitives enclosed by several parabolic curves, which can be precisely converted to quadratic Bezier curves that are widely used in vector font products. In this manner, most image generation methods can be easily extended to synthesize vector fonts. Qualitative and quantitative experiments conducted on a publicly-available dataset demonstrate that our method obtains high-quality results on several tasks, including vector font reconstruction, interpolation, and few-shot vector font synthesis, markedly outperforming the state of the art. + + + + MSF: Motion-Guided Sequential Fusion for Efficient 3D Object Detection From Point Cloud Sequences + http://openaccess.thecvf.com//content/CVPR2023/papers/He_MSF_Motion-Guided_Sequential_Fusion_for_Efficient_3D_Object_Detection_From_CVPR_2023_paper.pdf + Point cloud sequences are commonly used to accurately detect 3D objects in applications such as autonomous driving. Current top-performing multi-frame detectors mostly follow a Detect-and-Fuse framework, which extracts features from each frame of the sequence and fuses them to detect the objects in the current frame. However, this inevitably leads to redundant computation since adjacent frames are highly correlated. In this paper, we propose an efficient Motion-guided Sequential Fusion (MSF) method, which exploits the continuity of object motion to mine useful sequential contexts for object detection in the current frame. We first generate 3D proposals on the current frame and propagate them to preceding frames based on the estimated velocities. The points-of-interest are then pooled from the sequence and encoded as proposal features. A novel Bidirectional Feature Aggregation (BiFA) module is further proposed to facilitate the interactions of proposal features across frames. Besides, we optimize the point cloud pooling by a voxel-based sampling technique so that millions of points can be processed in several milliseconds. The proposed MSF method achieves not only better efficiency than other multi-frame detectors but also leading accuracy, with 83.12% and 78.30% mAP on the LEVEL1 and LEVEL2 test sets of Waymo Open Dataset, respectively. Codes can be found at https://github.com/skyhehe123/MSF. + + + + HypLiLoc: Towards Effective LiDAR Pose Regression With Hyperbolic Fusion + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_HypLiLoc_Towards_Effective_LiDAR_Pose_Regression_With_Hyperbolic_Fusion_CVPR_2023_paper.pdf + LiDAR relocalization plays a crucial role in many fields, including robotics, autonomous driving, and computer vision. LiDAR-based retrieval from a database typically incurs high computation storage costs and can lead to globally inaccurate pose estimations if the database is too sparse. On the other hand, pose regression methods take images or point clouds as inputs and directly regress global poses in an end-to-end manner. They do not perform database matching and are more computationally efficient than retrieval techniques. We propose HypLiLoc, a new model for LiDAR pose regression. We use two branched backbones to extract 3D features and 2D projection features, respectively. We consider multi-modal feature fusion in both Euclidean and hyperbolic spaces to obtain more effective feature representations. Experimental results indicate that HypLiLoc achieves state-of-the-art performance in both outdoor and indoor datasets. We also conduct extensive ablation studies on the framework design, which demonstrate the effectiveness of multi-modal feature extraction and multi-space embedding. Our code is released at: https://github.com/sijieaaa/HypLiLoc + + + + Robust Model-Based Face Reconstruction Through Weakly-Supervised Outlier Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Robust_Model-Based_Face_Reconstruction_Through_Weakly-Supervised_Outlier_Segmentation_CVPR_2023_paper.pdf + In this work, we aim to enhance model-based face reconstruction by avoiding fitting the model to outliers, i.e. regions that cannot be well-expressed by the model such as occluders or make-up. The core challenge for localizing outliers is that they are highly variable and difficult to annotate. To overcome this challenging problem, we introduce a joint Face-autoencoder and outlier segmentation approach (FOCUS).In particular, we exploit the fact that the outliers cannot be fitted well by the face model and hence can be localized well given a high-quality model fitting. The main challenge is that the model fitting and the outlier segmentation are mutually dependent on each other, and need to be inferred jointly. We resolve this chicken-and-egg problem with an EM-type training strategy, where a face autoencoder is trained jointly with an outlier segmentation network. This leads to a synergistic effect, in which the segmentation network prevents the face encoder from fitting to the outliers, enhancing the reconstruction quality. The improved 3D face reconstruction, in turn, enables the segmentation network to better predict the outliers. To resolve the ambiguity between outliers and regions that are difficult to fit, such as eyebrows, we build a statistical prior from synthetic data that measures the systematic bias in model fitting. Experiments on the NoW testset demonstrate that FOCUS achieves SOTA 3D face reconstruction performance among all baselines that are trained without 3D annotation. Moreover, our results on CelebA-HQ and the AR database show that the segmentation network can localize occluders accurately despite being trained without any segmentation annotation. + + + + Not All Image Regions Matter: Masked Vector Quantization for Autoregressive Image Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Not_All_Image_Regions_Matter_Masked_Vector_Quantization_for_Autoregressive_CVPR_2023_paper.pdf + Existing autoregressive models follow the two-stage generation paradigm that first learns a codebook in the latent space for image reconstruction and then completes the image generation autoregressively based on the learned codebook. However, existing codebook learning simply models all local region information of images without distinguishing their different perceptual importance, which brings redundancy in the learned codebook that not only limits the next stage's autoregressive model's ability to model important structure but also results in high training cost and slow generation speed. In this study, we borrow the idea of importance perception from classical image coding theory and propose a novel two-stage framework, which consists of Masked Quantization VAE (MQ-VAE) and Stackformer, to relieve the model from modeling redundancy. Specifically, MQ-VAE incorporates an adaptive mask module for masking redundant region features before quantization and an adaptive de-mask module for recovering the original grid image feature map to faithfully reconstruct the original images after quantization. Then, Stackformer learns to predict the combination of the next code and its position in the feature map. Comprehensive experiments on various image generation validate our effectiveness and efficiency. + + + + Masked Video Distillation: Rethinking Masked Feature Modeling for Self-Supervised Video Representation Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Masked_Video_Distillation_Rethinking_Masked_Feature_Modeling_for_Self-Supervised_Video_CVPR_2023_paper.pdf + Benefiting from masked visual modeling, self-supervised video representation learning has achieved remarkable progress. However, existing methods focus on learning representations from scratch through reconstructing low-level features like raw pixel values. In this paper, we propose masked video distillation (MVD), a simple yet effective two-stage masked feature modeling framework for video representation learning: firstly we pretrain an image (or video) model by recovering low-level features of masked patches, then we use the resulting features as targets for masked feature modeling. For the choice of teacher models, we observe that students taught by video teachers perform better on temporally-heavy video tasks, while image teachers transfer stronger spatial representations for spatially-heavy video tasks. Visualization analysis also indicates different teachers produce different learned patterns for students. To leverage the advantage of different teachers, we design a spatial-temporal co-teaching method for MVD. Specifically, we distill student models from both video teachers and image teachers by masked feature modeling. Extensive experimental results demonstrate that video transformers pretrained with spatial-temporal co-teaching outperform models distilled with a single teacher on a multitude of video datasets. Our MVD with vanilla ViT achieves state-of-the-art performance compared with previous methods on several challenging video downstream tasks. For example, with the ViT-Large model, our MVD achieves 86.4% and 76.7% Top-1 accuracy on Kinetics-400 and Something-Something-v2, outperforming VideoMAE by 1.2% and 2.4% respectively. When a larger ViT-Huge model is adopted, MVD achieves the state-of-the-art performance with 77.3% Top-1 accuracy on Something-Something-v2. Code will be available at https://github.com/ruiwang2021/mvd. + + + + Transformer-Based Unified Recognition of Two Hands Manipulating Objects + http://openaccess.thecvf.com//content/CVPR2023/papers/Cho_Transformer-Based_Unified_Recognition_of_Two_Hands_Manipulating_Objects_CVPR_2023_paper.pdf + Understanding the hand-object interactions from an egocentric video has received a great attention recently. So far, most approaches are based on the convolutional neural network (CNN) features combined with the temporal encoding via the long short-term memory (LSTM) or graph convolution network (GCN) to provide the unified understanding of two hands, an object and their interactions. In this paper, we propose the Transformer-based unified framework that provides better understanding of two hands manipulating objects. In our framework, we insert the whole image depicting two hands, an object and their interactions as input and jointly estimate 3 information from each frame: poses of two hands, pose of an object and object types. Afterwards, the action class defined by the hand-object interactions is predicted from the entire video based on the estimated information combined with the contact map that encodes the interaction between two hands and an object. Experiments are conducted on H2O and FPHA benchmark datasets and we demonstrated the superiority of our method achieving the state-of-the-art accuracy. Ablative studies further demonstrate the effectiveness of each proposed module. + + + + RangeViT: Towards Vision Transformers for 3D Semantic Segmentation in Autonomous Driving + http://openaccess.thecvf.com//content/CVPR2023/papers/Ando_RangeViT_Towards_Vision_Transformers_for_3D_Semantic_Segmentation_in_Autonomous_CVPR_2023_paper.pdf + Casting semantic segmentation of outdoor LiDAR point clouds as a 2D problem, e.g., via range projection, is an effective and popular approach. These projection-based methods usually benefit from fast computations and, when combined with techniques which use other point cloud representations, achieve state-of-the-art results. Today, projection-based methods leverage 2D CNNs but recent advances in computer vision show that vision transformers (ViTs) have achieved state-of-the-art results in many image-based benchmarks. In this work, we question if projection-based methods for 3D semantic segmentation can benefit from these latest improvements on ViTs. We answer positively but only after combining them with three key ingredients: (a) ViTs are notoriously hard to train and require a lot of training data to learn powerful representations. By preserving the same backbone architecture as for RGB images, we can exploit the knowledge from long training on large image collections that are much cheaper to acquire and annotate than point clouds. We reach our best results with pre-trained ViTs on large image datasets. (b) We compensate ViTs' lack of inductive bias by substituting a tailored convolutional stem for the classical linear embedding layer. (c) We refine pixel-wise predictions with a convolutional decoder and a skip connection from the convolutional stem to combine low-level but fine-grained features of the the convolutional stem with the high-level but coarse predictions of the ViT encoder. With these ingredients, we show that our method, called RangeViT, outperforms existing projection-based methods on nuScenes and SemanticKITTI. The code is available at https://github.com/valeoai/rangevit. + + + + ProTeGe: Untrimmed Pretraining for Video Temporal Grounding by Video Temporal Grounding + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_ProTeGe_Untrimmed_Pretraining_for_Video_Temporal_Grounding_by_Video_Temporal_CVPR_2023_paper.pdf + Video temporal grounding (VTG) is the task of localizing a given natural language text query in an arbitrarily long untrimmed video. While the task involves untrimmed videos, all existing VTG methods leverage features from video backbones pretrained on trimmed videos. This is largely due to the lack of large-scale well-annotated VTG dataset to perform pretraining. As a result, the pretrained features lack a notion of temporal boundaries leading to the video-text alignment being less distinguishable between correct and incorrect locations. We present ProTeGe as the first method to perform VTG-based untrimmed pretraining to bridge the gap between trimmed pretrained backbones and downstream VTG tasks. ProTeGe reconfigures the HowTo100M dataset, with noisily correlated video-text pairs, into a VTG dataset and introduces a novel Video-Text Similarity-based Grounding Module and a pretraining objective to make pretraining robust to noise in HowTo100M. Extensive experiments on multiple datasets across downstream tasks with all variations of supervision validate that pretrained features from ProTeGe can significantly outperform features from trimmed pretrained backbones on VTG. + + + + Exploring Incompatible Knowledge Transfer in Few-Shot Image Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhao_Exploring_Incompatible_Knowledge_Transfer_in_Few-Shot_Image_Generation_CVPR_2023_paper.pdf + Few-shot image generation (FSIG) learns to generate diverse and high-fidelity images from a target domain using a few (e.g., 10) reference samples. Existing FSIG methods select, preserve and transfer prior knowledge from a source generator (pretrained on a related domain) to learn the target generator. In this work, we investigate an underexplored issue in FSIG, dubbed as incompatible knowledge transfer, which would significantly degrade the realisticness of synthetic samples. Empirical observations show that the issue stems from the least significant filters from the source generator. To this end, we propose knowledge truncation to mitigate this issue in FSIG, which is a complementary operation to knowledge preservation and is implemented by a lightweight pruning-based method. Extensive experiments show that knowledge truncation is simple and effective, consistently achieving state-of-the-art performance, including challenging setups where the source and target domains are more distant. Project Page: https://yunqing-me.github.io/RICK. + + + + All-in-One Image Restoration for Unknown Degradations Using Adaptive Discriminative Filters for Specific Degradations + http://openaccess.thecvf.com//content/CVPR2023/papers/Park_All-in-One_Image_Restoration_for_Unknown_Degradations_Using_Adaptive_Discriminative_Filters_CVPR_2023_paper.pdf + Image restorations for single degradations have been widely studied, demonstrating excellent performance for each degradation, but can not reflect unpredictable realistic environments with unknown multiple degradations, which may change over time. To mitigate this issue, image restorations for known and unknown multiple degradations have recently been investigated, showing promising results, but require large networks or have sub-optimal architectures for potential interference among different degradations. Here, inspired by the filter attribution integrated gradients (FAIG), we propose an adaptive discriminative filter-based model for specific degradations (ADMS) to restore images with unknown degradations. Our method allows the network to contain degradation-dedicated filters only for about 3% of all network parameters per each degradation and to apply them adaptively via degradation classification (DC) to explicitly disentangle the network for multiple degradations. Our proposed method has demonstrated its effectiveness in comparison studies and achieved state-of-the-art performance in all-in-one image restoration benchmark datasets of both Rain-Noise-Blur and Rain-Snow-Haze. + + + + Efficient RGB-T Tracking via Cross-Modality Distillation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Efficient_RGB-T_Tracking_via_Cross-Modality_Distillation_CVPR_2023_paper.pdf + Most current RGB-T trackers adopt a two-stream structure to extract unimodal RGB and thermal features and complex fusion strategies to achieve multi-modal feature fusion, which require a huge number of parameters, thus hindering their real-life applications. On the other hand, a compact RGB-T tracker may be computationally efficient but encounter non-negligible performance degradation, due to the weakening of feature representation ability. To remedy this situation, a cross-modality distillation framework is presented to bridge the performance gap between a compact tracker and a powerful tracker. Specifically, a specific-common feature distillation module is proposed to transform the modality-common information as well as the modality-specific information from a deeper two-stream network to a shallower single-stream network. In addition, a multi-path selection distillation module is proposed to instruct a simple fusion module to learn more accurate multi-modal information from a well-designed fusion mechanism by using multiple paths. We validate the effectiveness of our method with extensive experiments on three RGB-T benchmarks, which achieves state-of-the-art performance but consumes much less computational resources. + + + + Passive Micron-Scale Time-of-Flight With Sunlight Interferometry + http://openaccess.thecvf.com//content/CVPR2023/papers/Kotwal_Passive_Micron-Scale_Time-of-Flight_With_Sunlight_Interferometry_CVPR_2023_paper.pdf + We introduce an interferometric technique for passive time-of-flight imaging and depth sensing at micrometer axial resolutions. Our technique uses a full-field Michelson interferometer, modified to use sunlight as the only light source. The large spectral bandwidth of sunlight makes it possible to acquire micrometer-resolution time-resolved scene responses, through a simple axial scanning operation. Additionally, the angular bandwidth of sunlight makes it possible to capture time-of-flight measurements insensitive to indirect illumination effects, such as interreflections and subsurface scattering. We build an experimental prototype that we operate outdoors, under direct sunlight, and in adverse environment conditions such as machine vibrations and vehicle traffic. We use this prototype to demonstrate, for the first time, passive imaging capabilities such as micrometer-scale depth sensing robust to indirect illumination, direct-only imaging, and imaging through diffusers. + + + + Behavioral Analysis of Vision-and-Language Navigation Agents + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_Behavioral_Analysis_of_Vision-and-Language_Navigation_Agents_CVPR_2023_paper.pdf + To be successful, Vision-and-Language Navigation (VLN) agents must be able to ground instructions to actions based on their surroundings. In this work, we develop a methodology to study agent behavior on a skill-specific basis -- examining how well existing agents ground instructions about stopping, turning, and moving towards specified objects or rooms. Our approach is based on generating skill-specific interventions and measuring changes in agent predictions. We present a detailed case study analyzing the behavior of a recent agent and then compare multiple agents in terms of skill-specific competency scores. This analysis suggests that biases from training have lasting effects on agent behavior and that existing models are able to ground simple referring expressions. Our comparisons between models show that skill-specific scores correlate with improvements in overall VLN task performance. + + + + Unsupervised Volumetric Animation + http://openaccess.thecvf.com//content/CVPR2023/papers/Siarohin_Unsupervised_Volumetric_Animation_CVPR_2023_paper.pdf + We propose a novel approach for unsupervised 3D animation of non-rigid deformable objects. Our method learns the 3D structure and dynamics of objects solely from single-view RGB videos, and can decompose them into semantically meaningful parts that can be tracked and animated. Using a 3D autodecoder framework, paired with a keypoint estimator via a differentiable PnP algorithm, our model learns the underlying object geometry and parts decomposition in an entirely unsupervised manner. This allows it to perform 3D segmentation, 3D keypoint estimation, novel view synthesis, and animation. We primarily evaluate the framework on two video datasets: VoxCeleb 256^2 and TEDXPeople 256^2. In addition, on the Cats 256^2 dataset, we show that it learns compelling 3D geometry even from raw image data. Finally, we show that our model can obtain animatable 3D objects from a singe or a few images. + + + + Unite and Conquer: Plug & Play Multi-Modal Synthesis Using Diffusion Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Nair_Unite_and_Conquer_Plug__Play_Multi-Modal_Synthesis_Using_Diffusion_CVPR_2023_paper.pdf + Generating photos satisfying multiple constraints finds broad utility in the content creation industry. A key hurdle to accomplishing this task is the need for paired data consisting of all modalities (i.e., constraints) and their corresponding output. Moreover, existing methods need retraining using paired data across all modalities to introduce a new condition. This paper proposes a solution to this problem based on denoising diffusion probabilistic models (DDPMs). Our motivation for choosing diffusion models over other generative models comes from the flexible internal structure of diffusion models. Since each sampling step in the DDPM follows a Gaussian distribution, we show that there exists a closed-form solution for generating an image given various constraints. Our method can utilize a single diffusion model trained on multiple sub-tasks and improve the combined task through our proposed sampling strategy. We also introduce a novel reliability parameter that allows using different off-the-shelf diffusion models trained across various datasets during sampling time alone to guide it to the desired outcome satisfying multiple constraints. We perform experiments on various standard multimodal tasks to demonstrate the effectiveness of our approach. More details can be found at: https://nithin-gk.github.io/projectpages/Multidiff + + + + ZBS: Zero-Shot Background Subtraction via Instance-Level Background Modeling and Foreground Selection + http://openaccess.thecvf.com//content/CVPR2023/papers/An_ZBS_Zero-Shot_Background_Subtraction_via_Instance-Level_Background_Modeling_and_Foreground_CVPR_2023_paper.pdf + Background subtraction (BGS) aims to extract all moving objects in the video frames to obtain binary foreground segmentation masks. Deep learning has been widely used in this field. Compared with supervised-based BGS methods, unsupervised methods have better generalization. However, previous unsupervised deep learning BGS algorithms perform poorly in sophisticated scenarios such as shadows or night lights, and they cannot detect objects outside the pre-defined categories. In this work, we propose an unsupervised BGS algorithm based on zero-shot object detection called Zero-shot Background Subtraction ZBS. The proposed method fully utilizes the advantages of zero-shot object detection to build the open-vocabulary instance-level background model. Based on it, the foreground can be effectively extracted by comparing the detection results of new frames with the background model. ZBS performs well for sophisticated scenarios, and it has rich and extensible categories. Furthermore, our method can easily generalize to other tasks, such as abandoned object detection in unseen environments. We experimentally show that ZBS surpasses state-of-the-art unsupervised BGS methods by 4.70% F-Measure on the CDnet 2014 dataset. The code is released at https://github.com/CASIA-IVA-Lab/ZBS. + + + + MobileBrick: Building LEGO for 3D Reconstruction on Mobile Devices + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_MobileBrick_Building_LEGO_for_3D_Reconstruction_on_Mobile_Devices_CVPR_2023_paper.pdf + High-quality 3D ground-truth shapes are critical for 3D object reconstruction evaluation. However, it is difficult to create a replica of an object in reality, and even 3D reconstructions generated by 3D scanners have artefacts that cause biases in evaluation. To address this issue, we introduce a novel multi-view RGBD dataset captured using a mobile device, which includes highly precise 3D ground-truth annotations for 153 object models featuring a diverse set of 3D structures. We obtain precise 3D ground-truth shape without relying on high-end 3D scanners by utilising LEGO models with known geometry as the 3D structures for image capture. The distinct data modality offered by high- resolution RGB images and low-resolution depth maps captured on a mobile device, when combined with precise 3D geometry annotations, presents a unique opportunity for future research on high-fidelity 3D reconstruction. Furthermore, we evaluate a range of 3D reconstruction algorithms on the proposed dataset. + + + + GKEAL: Gaussian Kernel Embedded Analytic Learning for Few-Shot Class Incremental Task + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhuang_GKEAL_Gaussian_Kernel_Embedded_Analytic_Learning_for_Few-Shot_Class_Incremental_CVPR_2023_paper.pdf + Few-shot class incremental learning (FSCIL) aims to address catastrophic forgetting during class incremental learning in a few-shot learning setting. In this paper, we approach the FSCIL by adopting analytic learning, a technique that converts network training into linear problems. This is inspired by the fact that the recursive implementation (batch-by-batch learning) of analytic learning gives identical weights to that produced by training on the entire dataset at once. The recursive implementation and the weight-identical property highly resemble the FSCIL setting (phase-by-phase learning) and its goal of avoiding catastrophic forgetting. By bridging the FSCIL with the analytic learning, we propose a Gaussian kernel embedded analytic learning (GKEAL) for FSCIL. The key components of GKEAL include the kernel analytic module which allows the GKEAL to conduct FSCIL in a recursive manner, and the augmented feature concatenation module that balances the preference between old and new tasks especially effectively under the few-shot setting. Our experiments show that the GKEAL gives state-of-the-art performance on several benchmark datasets. + + + + Active Exploration of Multimodal Complementarity for Few-Shot Action Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Wanyan_Active_Exploration_of_Multimodal_Complementarity_for_Few-Shot_Action_Recognition_CVPR_2023_paper.pdf + Recently, few-shot action recognition receives increasing attention and achieves remarkable progress. However, previous methods mainly rely on limited unimodal data (e.g., RGB frames) while the multimodal information remains relatively underexplored. In this paper, we propose a novel Active Multimodal Few-shot Action Recognition (AMFAR) framework, which can actively find the reliable modality for each sample based on task-dependent context information to improve few-shot reasoning procedure. In meta-training, we design an Active Sample Selection (ASS) module to organize query samples with large differences in the reliability of modalities into different groups based on modality-specific posterior distributions. In addition, we design an Active Mutual Distillation (AMD) module to capture discriminative task-specific knowledge from the reliable modality to improve the representation learning of unreliable modality by bidirectional knowledge distillation. In meta-test, we adopt Adaptive Multimodal Inference (AMI) module to adaptively fuse the modality-specific posterior distributions with a larger weight on the reliable modality. Extensive experimental results on four public benchmarks demonstrate that our model achieves significant improvements over existing unimodal and multimodal methods. + + + + Magic3D: High-Resolution Text-to-3D Content Creation + http://openaccess.thecvf.com//content/CVPR2023/papers/Lin_Magic3D_High-Resolution_Text-to-3D_Content_Creation_CVPR_2023_paper.pdf + Recently, DreamFusion demonstrated the utility of a pretrained text-to-image diffusion model to optimize Neural Radiance Fields (NeRF), achieving remarkable text-to-3D synthesis results. However, the method has two inherent limitations: 1) optimization of the NeRF representation is extremely slow, 2) NeRF is supervised by images at a low resolution (64x64), thus leading to low-quality 3D models with a long wait time. In this paper, we address these limitations by utilizing a two-stage coarse-to-fine optimization framework. In the first stage, we use a sparse 3D neural representation to accelerate optimization while using a low-resolution diffusion prior. In the second stage, we use a textured mesh model initialized from the coarse neural representation, allowing us to perform optimization with a very efficient differentiable renderer interacting with high-resolution images. Our method, dubbed Magic3D, can create a 3D mesh model in 40 minutes, 2x faster than DreamFusion (reportedly taking 1.5 hours on average), while achieving 8x higher resolution. User studies show 61.7% raters to prefer our approach than DreamFusion. Together with the image-conditioned generation capabilities, we provide users with new ways to control 3D synthesis, opening up new avenues to various creative applications. + + + + Sketch2Saliency: Learning To Detect Salient Objects From Human Drawings + http://openaccess.thecvf.com//content/CVPR2023/papers/Bhunia_Sketch2Saliency_Learning_To_Detect_Salient_Objects_From_Human_Drawings_CVPR_2023_paper.pdf + Human sketch has already proved its worth in various visual understanding tasks (e.g., retrieval, segmentation, image-captioning, etc). In this paper, we reveal a new trait of sketches -- that they are also salient. This is intuitive as sketching is a natural attentive process at its core. More specifically, we aim to study how sketches can be used as a weak label to detect salient objects present in an image. To this end, we propose a novel method that emphasises on how "salient object" could be explained by hand-drawn sketches. To accomplish this, we introduce a photo-to-sketch generation model that aims to generate sequential sketch coordinates corresponding to a given visual photo through a 2D attention mechanism. Attention maps accumulated across the time steps give rise to salient regions in the process. Extensive quantitative and qualitative experiments prove our hypothesis and delineate how our sketch-based saliency detection model gives a competitive performance compared to the state-of-the-art. + + + + Efficient Frequency Domain-Based Transformers for High-Quality Image Deblurring + http://openaccess.thecvf.com//content/CVPR2023/papers/Kong_Efficient_Frequency_Domain-Based_Transformers_for_High-Quality_Image_Deblurring_CVPR_2023_paper.pdf + We present an effective and efficient method that explores the properties of Transformers in the frequency domain for high-quality image deblurring. Our method is motivated by the convolution theorem that the correlation or convolution of two signals in the spatial domain is equivalent to an element-wise product of them in the frequency domain. This inspires us to develop an efficient frequency domain-based self-attention solver (FSAS) to estimate the scaled dot-product attention by an element-wise product operation instead of the matrix multiplication in the spatial domain. In addition, we note that simply using the naive feed-forward network (FFN) in Transformers does not generate good deblurred results. To overcome this problem, we propose a simple yet effective discriminative frequency domain-based FFN (DFFN), where we introduce a gated mechanism in the FFN based on the Joint Photographic Experts Group (JPEG) compression algorithm to discriminatively determine which low- and high-frequency information of the features should be preserved for latent clear image restoration. We formulate the proposed FSAS and DFFN into an asymmetrical network based on an encoder and decoder architecture, where the FSAS is only used in the decoder module for better image deblurring. Experimental results show that the proposed method performs favorably against the state-of-the-art approaches. + + + + Distilling Focal Knowledge From Imperfect Expert for 3D Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Zeng_Distilling_Focal_Knowledge_From_Imperfect_Expert_for_3D_Object_Detection_CVPR_2023_paper.pdf + Multi-camera 3D object detection blossoms in recent years and most of state-of-the-art methods are built up on the bird's-eye-view (BEV) representations. Albeit remarkable performance, these works suffer from low efficiency. Typically, knowledge distillation can be used for model compression. However, due to unclear 3D geometry reasoning, expert features usually contain some noisy and confusing areas. In this work, we investigate on how to distill the knowledge from an imperfect expert. We propose FD3D, a Focal Distiller for 3D object detection. Specifically, a set of queries are leveraged to locate the instance-level areas for masked feature generation, to intensify feature representation ability in these areas. Moreover, these queries search out the representative fine-grained positions for refined distillation. We verify the effectiveness of our method by applying it to two popular detection models, BEVFormer and DETR3D. The results demonstrate that our method achieves improvements of 4.07 and 3.17 points respectively in terms of NDS metric on nuScenes benchmark. Code is hosted at https://github.com/OpenPerceptionX/BEVPerception-Survey-Recipe. + + + + ULIP: Learning a Unified Representation of Language, Images, and Point Clouds for 3D Understanding + http://openaccess.thecvf.com//content/CVPR2023/papers/Gao_ULIP_Learning_a_Unified_Representation_of_Language_Images_and_Point_CVPR_2023_paper.pdf + The recognition capabilities of current state-of-the-art 3D models are limited by datasets with a small number of annotated data and a pre-defined set of categories. In its 2D counterpart, recent advances have shown that similar problems can be significantly alleviated by employing knowledge from other modalities, such as language. Inspired by this, leveraging multimodal information for 3D modality could be promising to improve 3D understanding under the restricted data regime, but this line of research is not well studied. Therefore, we introduce ULIP to learn a unified representation of images, language, and 3D point clouds by pre-training with object triplets from the three modalities. To overcome the shortage of training triplets, ULIP leverages a pre-trained vision-language model that has already learned a common visual and textual space by training with massive image-text pairs. Then, ULIP learns a 3D representation space aligned with the common image-text space, using a small number of automatically synthesized triplets. ULIP is agnostic to 3D backbone networks and can easily be integrated into any 3D architecture. Experiments show that ULIP effectively improves the performance of multiple recent 3D backbones by simply pre-training them on ShapeNet55 using our framework, achieving state-of-the-art performance in both standard 3D classification and zero-shot 3D classification on ModelNet40 and ScanObjectNN. ULIP also improves the performance of PointMLP by around 3% in 3D classification on ScanObjectNN, and outperforms PointCLIP by 28.8% on top-1 accuracy for zero-shot 3D classification on ModelNet40. Our code and pre-trained models are released at https://github.com/salesforce/ULIP. + + + + Deep Learning of Partial Graph Matching via Differentiable Top-K + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Deep_Learning_of_Partial_Graph_Matching_via_Differentiable_Top-K_CVPR_2023_paper.pdf + Graph matching (GM) aims at discovering node matching between graphs, by maximizing the node- and edge-wise affinities between the matched elements. As an NP-hard problem, its challenge is further pronounced in the existence of outlier nodes in both graphs which is ubiquitous in practice, especially for vision problems. However, popular affinity-maximization-based paradigms often lack a principled scheme to suppress the false matching and resort to handcrafted thresholding to dismiss the outliers. This limitation is also inherited by the neural GM solvers though they have shown superior performance in the ideal no-outlier setting. In this paper, we propose to formulate the partial GM problem as the top-k selection task with a given/estimated number of inliers k. Specifically, we devise a differentiable top-k module that enables effective gradient descent over the optimal-transport layer, which can be readily plugged into SOTA deep GM pipelines including the quadratic matching network NGMv2 as well as the linear matching network GCAN. Meanwhile, the attention-fused aggregation layers are developed to estimate k to enable automatic outlier-robust matching in the wild. Last but not least, we remake and release a new benchmark called IMC-PT-SparseGM, originating from the IMC-PT stereo-matching dataset. The new benchmark involves more scale-varying graphs and partial matching instances from the real world. Experiments show that our methods outperform other partial matching schemes on popular benchmarks. + + + + NVTC: Nonlinear Vector Transform Coding + http://openaccess.thecvf.com//content/CVPR2023/papers/Feng_NVTC_Nonlinear_Vector_Transform_Coding_CVPR_2023_paper.pdf + In theory, vector quantization (VQ) is always better than scalar quantization (SQ) in terms of rate-distortion (R-D) performance. Recent state-of-the-art methods for neural image compression are mainly based on nonlinear transform coding (NTC) with uniform scalar quantization, overlooking the benefits of VQ due to its exponentially increased complexity. In this paper, we first investigate on some toy sources, demonstrating that even if modern neural networks considerably enhance the compression performance of SQ with nonlinear transform, there is still an insurmountable chasm between SQ and VQ. Therefore, revolving around VQ, we propose a novel framework for neural image compression named Nonlinear Vector Transform Coding (NVTC). NVTC solves the critical complexity issue of VQ through (1) a multi-stage quantization strategy and (2) nonlinear vector transforms. In addition, we apply entropy-constrained VQ in latent space to adaptively determine the quantization boundaries for joint rate-distortion optimization, which improves the performance both theoretically and experimentally. Compared to previous NTC approaches, NVTC demonstrates superior rate-distortion performance, faster decoding speed, and smaller model size. Our code is available at https://github.com/USTC-IMCL/NVTC. + + + + On the Effectiveness of Partial Variance Reduction in Federated Learning With Heterogeneous Data + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_On_the_Effectiveness_of_Partial_Variance_Reduction_in_Federated_Learning_CVPR_2023_paper.pdf + Data heterogeneity across clients is a key challenge in federated learning. Prior works address this by either aligning client and server models or using control variates to correct client model drift. Although these methods achieve fast convergence in convex or simple non-convex problems, the performance in over-parameterized models such as deep neural networks is lacking. In this paper, we first revisit the widely used FedAvg algorithm in a deep neural network to understand how data heterogeneity influences the gradient updates across the neural network layers. We observe that while the feature extraction layers are learned efficiently by FedAvg, the substantial diversity of the final classification layers across clients impedes the performance. Motivated by this, we propose to correct model drift by variance reduction only on the final layers. We demonstrate that this significantly outperforms existing benchmarks at a similar or lower communication cost. We furthermore provide proof for the convergence rate of our algorithm. + + + + Point Cloud Forecasting as a Proxy for 4D Occupancy Forecasting + http://openaccess.thecvf.com//content/CVPR2023/papers/Khurana_Point_Cloud_Forecasting_as_a_Proxy_for_4D_Occupancy_Forecasting_CVPR_2023_paper.pdf + Predicting how the world can evolve in the future is crucial for motion planning in autonomous systems. Classical methods are limited because they rely on costly human annotations in the form of semantic class labels, bounding boxes, and tracks or HD maps of cities to plan their motion -- and thus are difficult to scale to large unlabeled datasets. One promising self-supervised task is 3D point cloud forecasting from unannotated LiDAR sequences. We show that this task requires algorithms to implicitly capture (1) sensor extrinsics (i.e., the egomotion of the autonomous vehicle), (2) sensor intrinsics (i.e., the sampling pattern specific to the particular LiDAR sensor), and (3) the shape and motion of other objects in the scene. But autonomous systems should make predictions about the world and not their sensors! To this end, we factor out (1) and (2) by recasting the task as one of spacetime (4D) occupancy forecasting. But because it is expensive to obtain ground-truth 4D occupancy, we "render" point cloud data from 4D occupancy predictions given sensor extrinsics and intrinsics, allowing one to train and test occupancy algorithms with unannotated LiDAR sequences. This also allows one to evaluate and compare point cloud forecasting algorithms across diverse datasets, sensors, and vehicles. + + + + Masked Representation Learning for Domain Generalized Stereo Matching + http://openaccess.thecvf.com//content/CVPR2023/papers/Rao_Masked_Representation_Learning_for_Domain_Generalized_Stereo_Matching_CVPR_2023_paper.pdf + Recently, many deep stereo matching methods have begun to focus on cross-domain performance, achieving impressive achievements. However, these methods did not deal with the significant volatility of generalization performance among different training epochs. Inspired by masked representation learning and multi-task learning, this paper designs a simple and effective masked representation for domain generalized stereo matching. First, we feed the masked left and complete right images as input into the models. Then, we add a lightweight and simple decoder following the feature extraction module to recover the original left image. Finally, we train the models with two tasks (stereo matching and image reconstruction) as a pseudo-multi-task learning framework, promoting models to learn structure information and to improve generalization performance. We implement our method on two well-known architectures (CFNet and LacGwcNet) to demonstrate its effectiveness. Experimental results on multi-datasets show that: (1) our method can be easily plugged into the current various stereo matching models to improve generalization performance; (2) our method can reduce the significant volatility of generalization performance among different training epochs; (3) we find that the current methods prefer to choose the best results among different training epochs as generalization performance, but it is impossible to select the best performance by ground truth in practice. + + + + You Can Ground Earlier Than See: An Effective and Efficient Pipeline for Temporal Sentence Grounding in Compressed Videos + http://openaccess.thecvf.com//content/CVPR2023/papers/Fang_You_Can_Ground_Earlier_Than_See_An_Effective_and_Efficient_CVPR_2023_paper.pdf + Given an untrimmed video, temporal sentence grounding (TSG) aims to locate a target moment semantically according to a sentence query. Although previous respectable works have made decent success, they only focus on high-level visual features extracted from the consecutive decoded frames and fail to handle the compressed videos for query modelling, suffering from insufficient representation capability and significant computational complexity during training and testing. In this paper, we pose a new setting, compressed-domain TSG, which directly utilizes compressed videos rather than fully-decompressed frames as the visual input. To handle the raw video bit-stream input, we propose a novel Three-branch Compressed-domain Spatial-temporal Fusion (TCSF) framework, which extracts and aggregates three kinds of low-level visual features (I-frame, motion vector and residual features) for effective and efficient grounding. Particularly, instead of encoding the whole decoded frames like previous works, we capture the appearance representation by only learning the I-frame feature to reduce delay or latency. Besides, we explore the motion information not only by learning the motion vector feature, but also by exploring the relations of neighboring frames via the residual feature. In this way, a three-branch spatial-temporal attention layer with an adaptive motion-appearance fusion module is further designed to extract and aggregate both appearance and motion information for the final grounding. Experiments on three challenging datasets shows that our TCSF achieves better performance than other state-of-the-art methods with lower complexity. + + + + EqMotion: Equivariant Multi-Agent Motion Prediction With Invariant Interaction Reasoning + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_EqMotion_Equivariant_Multi-Agent_Motion_Prediction_With_Invariant_Interaction_Reasoning_CVPR_2023_paper.pdf + Learning to predict agent motions with relationship reasoning is important for many applications. In motion prediction tasks, maintaining motion equivariance under Euclidean geometric transformations and invariance of agent interaction is a critical and fundamental principle. However, such equivariance and invariance properties are overlooked by most existing methods. To fill this gap, we propose EqMotion, an efficient equivariant motion prediction model with invariant interaction reasoning. To achieve motion equivariance, we propose an equivariant geometric feature learning module to learn a Euclidean transformable feature through dedicated designs of equivariant operations. To reason agent's interactions, we propose an invariant interaction reasoning module to achieve a more stable interaction modeling. To further promote more comprehensive motion features, we propose an invariant pattern feature learning module to learn an invariant pattern feature, which cooperates with the equivariant geometric feature to enhance network expressiveness. We conduct experiments for the proposed model on four distinct scenarios: particle dynamics, molecule dynamics, human skeleton motion prediction and pedestrian trajectory prediction. Experimental results show that our method is not only generally applicable, but also achieves state-of-the-art prediction performances on all the four tasks, improving by 24.0/30.1/8.6/9.2%. Code is available at https://github.com/MediaBrain-SJTU/EqMotion. + + + + FlowFormer++: Masked Cost Volume Autoencoding for Pretraining Optical Flow Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Shi_FlowFormer_Masked_Cost_Volume_Autoencoding_for_Pretraining_Optical_Flow_Estimation_CVPR_2023_paper.pdf + FlowFormer introduces a transformer architecture into optical flow estimation and achieves state-of-the-art performance. The core component of FlowFormer is the transformer-based cost-volume encoder. Inspired by recent success of masked autoencoding (MAE) pretraining in unleashing transformers' capacity of encoding visual representation, we propose Masked Cost Volume Autoencoding (MCVA) to enhance FlowFormer by pretraining the cost-volume encoder with a novel MAE scheme. Firstly, we introduce a block-sharing masking strategy to prevent masked information leakage, as the cost maps of neighboring source pixels are highly correlated. Secondly, we propose a novel pre-text reconstruction task, which encourages the cost-volume encoder to aggregate long-range information and ensures pretraining-finetuning consistency. We also show how to modify the FlowFormer architecture to accommodate masks during pretraining. Pretrained with MCVA, our proposed FlowFormer++ ranks 1st among published methods on both Sintel and KITTI-2015 benchmarks. Specifically, FlowFormer++ achieves 1.07 and 1.94 average end-point-error (AEPE) on the clean and final pass of Sintel benchmark, leading to 7.76% and 7.18% error reductions from FlowFormer. FlowFormer++ obtains 4.52 F1-all on the KITTI-2015 test set, improving FlowFormer by 0.16. + + + + 3D Human Keypoints Estimation From Point Clouds in the Wild Without Human Labels + http://openaccess.thecvf.com//content/CVPR2023/papers/Weng_3D_Human_Keypoints_Estimation_From_Point_Clouds_in_the_Wild_CVPR_2023_paper.pdf + Training a 3D human keypoint detector from point clouds in a supervised manner requires large volumes of high quality labels. While it is relatively easy to capture large amounts of human point clouds, annotating 3D keypoints is expensive, subjective, error prone and especially difficult for long-tail cases (pedestrians with rare poses, scooterists, etc.). In this work, we propose GC-KPL - Geometry Consistency inspired Key Point Leaning, an approach for learning 3D human joint locations from point clouds without human labels. We achieve this by our novel unsupervised loss formulations that account for the structure and movement of the human body. We show that by training on a large training set from Waymo Open Dataset without any human annotated keypoints, we are able to achieve reasonable performance as compared to the fully supervised approach. Further, the backbone benefits from the unsupervised training and is useful in downstream fewshot learning of keypoints, where fine-tuning on only 10 percent of the labeled training data gives comparable performance to fine-tuning on the entire set. We demonstrated that GC-KPL outperforms by a large margin over SoTA when trained on entire dataset and efficiently leverages large volumes of unlabeled data. + + + + Where Is My Spot? Few-Shot Image Generation via Latent Subspace Optimization + http://openaccess.thecvf.com//content/CVPR2023/papers/Zheng_Where_Is_My_Spot_Few-Shot_Image_Generation_via_Latent_Subspace_CVPR_2023_paper.pdf + Image generation relies on massive training data that can hardly produce diverse images of an unseen category according to a few examples. In this paper, we address this dilemma by projecting sparse few-shot samples into a continuous latent space that can potentially generate infinite unseen samples. The rationale behind is that we aim to locate a centroid latent position in a conditional StyleGAN, where the corresponding output image on that centroid can maximize the similarity with the given samples. Although the given samples are unseen for the conditional StyleGAN, we assume the neighboring latent subspace around the centroid belongs to the novel category, and therefore introduce two latent subspace optimization objectives. In the first one we use few-shot samples as positive anchors of the novel class, and adjust the StyleGAN to produce the corresponding results with the new class label condition. The second objective is to govern the generation process from the other way around, by altering the centroid and its surrounding latent subspace for a more precise generation of the novel class. These reciprocal optimization objectives inject a novel class into the StyleGAN latent subspace, and therefore new unseen samples can be easily produced by sampling images from it. Extensive experiments demonstrate superior few-shot generation performances compared with state-of-the-art methods, especially in terms of diversity and generation quality. Code is available at https://github.com/chansey0529/LSO. + + + + TopNet: Transformer-Based Object Placement Network for Image Compositing + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhu_TopNet_Transformer-Based_Object_Placement_Network_for_Image_Compositing_CVPR_2023_paper.pdf + We investigate the problem of automatically placing an object into a background image for image compositing. Given a background image and a segmented object, the goal is to train a model to predict plausible placements (location and scale) of the object for compositing. The quality of the composite image highly depends on the predicted location/scale. Existing works either generate candidate bounding boxes or apply sliding-window search using global representations from background and object images, which fail to model local information in background images. However, local clues in background images are important to determine the compatibility of placing the objects with certain locations/scales. In this paper, we propose to learn the correlation between object features and all local background features with a transformer module so that detailed information can be provided on all possible location/scale configurations. A sparse contrastive loss is further proposed to train our model with sparse supervision. Our new formulation generates a 3D heatmap indicating the plausibility of all location/scale combinations in one network forward pass, which is >10x faster than the previous sliding-window method. It also supports interactive search when users provide a pre-defined location or scale. The proposed method can be trained with explicit annotation or in a self-supervised manner using an off-the-shelf inpainting model, and it outperforms state-of-the-art methods significantly. User study shows that the trained model generalizes well to real-world images with diverse challenging scenes and object categories. + + + + Gloss Attention for Gloss-Free Sign Language Translation + http://openaccess.thecvf.com//content/CVPR2023/papers/Yin_Gloss_Attention_for_Gloss-Free_Sign_Language_Translation_CVPR_2023_paper.pdf + Most sign language translation (SLT) methods to date require the use of gloss annotations to provide additional supervision information, however, the acquisition of gloss is not easy. To solve this problem, we first perform an analysis of existing models to confirm how gloss annotations make SLT easier. We find that it can provide two aspects of information for the model, 1) it can help the model implicitly learn the location of semantic boundaries in continuous sign language videos, 2) it can help the model understand the sign language video globally. We then propose gloss attention, which enables the model to keep its attention within video segments that have the same semantics locally, just as gloss helps existing models do. Furthermore, we transfer the knowledge of sentence-to-sentence similarity from the natural language model to our gloss attention SLT network (GASLT) to help it understand sign language videos at the sentence level. Experimental results on multiple large-scale sign language datasets show that our proposed GASLT model significantly outperforms existing methods. Our code is provided in https://github.com/YinAoXiong/GASLT. + + + + Revisiting Rolling Shutter Bundle Adjustment: Toward Accurate and Fast Solution + http://openaccess.thecvf.com//content/CVPR2023/papers/Liao_Revisiting_Rolling_Shutter_Bundle_Adjustment_Toward_Accurate_and_Fast_Solution_CVPR_2023_paper.pdf + We propose a robust and fast bundle adjustment solution that estimates the 6-DoF pose of the camera and the geometry of the environment based on measurements from a rolling shutter (RS) camera. This tackles the challenges in the existing works, namely relying on additional sensors, high frame rate video as input, restrictive assumptions on camera motion, readout direction, and poor efficiency. To this end, we first investigate the influence of normalization to the image point on RSBA performance and show its better approximation in modelling the real 6-DoF camera motion. Then we present a novel analytical model for the visual residual covariance, which can be used to standardize the reprojection error during the optimization, consequently improving the overall accuracy. More importantly, the combination of normalization and covariance standardization weighting in RSBA (NW-RSBA) can avoid common planar degeneracy without needing to constrain the filming manner. Besides, we propose an acceleration strategy for NW-RSBA based on the sparsity of its Jacobian matrix and Schur complement. The extensive synthetic and real data experiments verify the effectiveness and efficiency of the proposed solution over the state-of-the-art works. We also demonstrate the proposed method can be easily implemented and plug-in famous GSSfM and GSSLAM systems as completed RSSfM and RSSLAM solutions. + + + + Context-Aware Relative Object Queries To Unify Video Instance and Panoptic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Choudhuri_Context-Aware_Relative_Object_Queries_To_Unify_Video_Instance_and_Panoptic_CVPR_2023_paper.pdf + Object queries have emerged as a powerful abstraction to generically represent object proposals. However, their use for temporal tasks like video segmentation poses two questions: 1) How to process frames sequentially and propagate object queries seamlessly across frames. Using independent object queries per frame doesn't permit tracking, and requires post-processing. 2) How to produce temporally consistent, yet expressive object queries that model both appearance and position changes. Using the entire video at once doesn't capture position changes and doesn't scale to long videos. As one answer to both questions we propose 'context-aware relative object queries', which are continuously propagated frame-by-frame. They seamlessly track objects and deal with occlusion and re-appearance of objects, without post-processing. Further, we find context-aware relative object queries better capture position changes of objects in motion. We evaluate the proposed approach across three challenging tasks: video instance segmentation, multi-object tracking and segmentation, and video panoptic segmentation. Using the same approach and architecture, we match or surpass state-of-the art results on the diverse and challenging OVIS, Youtube-VIS, Cityscapes-VPS, MOTS 2020 and KITTI-MOTS data. + + + + Enhancing Deformable Local Features by Jointly Learning To Detect and Describe Keypoints + http://openaccess.thecvf.com//content/CVPR2023/papers/Potje_Enhancing_Deformable_Local_Features_by_Jointly_Learning_To_Detect_and_CVPR_2023_paper.pdf + Local feature extraction is a standard approach in computer vision for tackling important tasks such as image matching and retrieval. The core assumption of most methods is that images undergo affine transformations, disregarding more complicated effects such as non-rigid deformations. Furthermore, incipient works tailored for non-rigid correspondence still rely on keypoint detectors designed for rigid transformations, hindering performance due to the limitations of the detector. We propose DALF (Deformation-Aware Local Features), a novel deformation-aware network for jointly detecting and describing keypoints, to handle the challenging problem of matching deformable surfaces. All network components work cooperatively through a feature fusion approach that enforces the descriptors' distinctiveness and invariance. Experiments using real deforming objects showcase the superiority of our method, where it delivers 8% improvement in matching scores compared to the previous best results. Our approach also enhances the performance of two real-world applications: deformable object retrieval and non-rigid 3D surface registration. Code for training, inference, and applications are publicly available at verlab.dcc.ufmg.br/descriptors/dalf_cvpr23. + + + + Siamese Image Modeling for Self-Supervised Vision Representation Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Tao_Siamese_Image_Modeling_for_Self-Supervised_Vision_Representation_Learning_CVPR_2023_paper.pdf + Self-supervised learning (SSL) has delivered superior performance on a variety of downstream vision tasks. Two main-stream SSL frameworks have been proposed, i.e., Instance Discrimination (ID) and Masked Image Modeling (MIM). ID pulls together representations from different views of the same image, while avoiding feature collapse. It lacks spatial sensitivity, which requires modeling the local structure within each image. On the other hand, MIM reconstructs the original content given a masked image. It instead does not have good semantic alignment, which requires projecting semantically similar views into nearby representations. To address this dilemma, we observe that (1) semantic alignment can be achieved by matching different image views with strong augmentations; (2) spatial sensitivity can benefit from predicting dense representations with masked images. Driven by these analysis, we propose Siamese Image Modeling (SiameseIM), which predicts the dense representations of an augmented view, based on another masked view from the same image but with different augmentations. SiameseIM uses a Siamese network with two branches. The online branch encodes the first view, and predicts the second view's representation according to the relative positions between these two views. The target branch produces the target by encoding the second view. SiameseIM can surpass both ID and MIM on a wide range of downstream tasks, including ImageNet finetuning and linear probing, COCO and LVIS detection, and ADE20k semantic segmentation. The improvement is more significant in few-shot, long-tail and robustness-concerned scenarios. Code shall be released. + + + + Generating Part-Aware Editable 3D Shapes Without 3D Supervision + http://openaccess.thecvf.com//content/CVPR2023/papers/Tertikas_Generating_Part-Aware_Editable_3D_Shapes_Without_3D_Supervision_CVPR_2023_paper.pdf + Impressive progress in generative models and implicit representations gave rise to methods that can generate 3D shapes of high quality. However, being able to locally control and edit shapes is another essential property that can unlock several content creation applications. Local control can be achieved with part-aware models, but existing methods require 3D supervision and cannot produce textures. In this work, we devise PartNeRF, a novel part-aware generative model for editable 3D shape synthesis that does not require any explicit 3D supervision. Our model generates objects as a set of locally defined NeRFs, augmented with an affine transformation. This enables several editing operations such as applying transformations on parts, mixing parts from different objects etc. To ensure distinct, manipulable parts we enforce a hard assignment of rays to parts that makes sure that the color of each ray is only determined by a single NeRF. As a result, altering one part does not affect the appearance of the others. Evaluations on various ShapeNet categories demonstrate the ability of our model to generate editable 3D objects of improved fidelity, compared to previous part-based generative approaches that require 3D supervision or models relying on NeRFs. + + + + High-Fidelity Facial Avatar Reconstruction From Monocular Video With Generative Priors + http://openaccess.thecvf.com//content/CVPR2023/papers/Bai_High-Fidelity_Facial_Avatar_Reconstruction_From_Monocular_Video_With_Generative_Priors_CVPR_2023_paper.pdf + High-fidelity facial avatar reconstruction from a monocular video is a significant research problem in computer graphics and computer vision. Recently, Neural Radiance Field (NeRF) has shown impressive novel view rendering results and has been considered for facial avatar reconstruction. However, the complex facial dynamics and missing 3D information in monocular videos raise significant challenges for faithful facial reconstruction. In this work, we propose a new method for NeRF-based facial avatar reconstruction that utilizes 3D-aware generative prior. Different from existing works that depend on a conditional deformation field for dynamic modeling, we propose to learn a personalized generative prior, which is formulated as a local and low dimensional subspace in the latent space of 3D-GAN. We propose an efficient method to construct the personalized generative prior based on a small set of facial images of a given individual. After learning, it allows for photo-realistic rendering with novel views, and the face reenactment can be realized by performing navigation in the latent space. Our proposed method is applicable for different driven signals, including RGB images, 3DMM coefficients, and audio. Compared with existing works, we obtain superior novel view synthesis results and faithfully face reenactment performance. The code is available here https://github.com/bbaaii/HFA-GP. + + + + CABM: Content-Aware Bit Mapping for Single Image Super-Resolution Network With Large Input + http://openaccess.thecvf.com//content/CVPR2023/papers/Tian_CABM_Content-Aware_Bit_Mapping_for_Single_Image_Super-Resolution_Network_With_CVPR_2023_paper.pdf + With the development of high-definition display devices, the practical scenario of Super-Resolution (SR) usually needs to super-resolve large input like 2K to higher resolution (4K/8K). To reduce the computational and memory cost, current methods first split the large input into local patches and then merge the SR patches into the output. These methods adaptively allocate a subnet for each patch. Quantization is a very important technique for network acceleration and has been used to design the subnets. Current methods train an MLP bit selector to determine the propoer bit for each layer. However, they uniformly sample subnets for training, making simple subnets overfitted and complicated subnets underfitted. Therefore, the trained bit selector fails to determine the optimal bit. Apart from this, the introduced bit selector brings additional cost to each layer of the SR network. In this paper, we propose a novel method named Content-Aware Bit Mapping (CABM), which can remove the bit selector without any performance loss. CABM also learns a bit selector for each layer during training. After training, we analyze the relation between the edge information of an input patch and the bit of each layer. We observe that the edge information can be an effective metric for the selected bit. Therefore, we design a strategy to build an Edge-to-Bit lookup table that maps the edge score of a patch to the bit of each layer during inference. The bit configuration of SR network can be determined by the lookup tables of all layers. Our strategy can find better bit configuration, resulting in more efficient mixed precision networks. We conduct detailed experiments to demonstrate the generalization ability of our method. The code will be released. + + + + Decoupling MaxLogit for Out-of-Distribution Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Decoupling_MaxLogit_for_Out-of-Distribution_Detection_CVPR_2023_paper.pdf + In machine learning, it is often observed that standard training outputs anomalously high confidence for both in-distribution (ID) and out-of-distribution (OOD) data. Thus, the ability to detect OOD samples is critical to the model deployment. An essential step for OOD detection is post-hoc scoring. MaxLogit is one of the simplest scoring functions which uses the maximum logits as OOD score. To provide a new viewpoint to study the logit-based scoring function, we reformulate the logit into cosine similarity and logit norm and propose to use MaxCosine and MaxNorm. We empirically find that MaxCosine is a core factor in the effectiveness of MaxLogit. And the performance of MaxLogit is encumbered by MaxNorm. To tackle the problem, we propose the Decoupling MaxLogit (DML) for flexibility to balance MaxCosine and MaxNorm. To further embody the core of our method, we extend DML to DML+ based on the new insights that fewer hard samples and compact feature space are the key components to make logit-based methods effective. We demonstrate the effectiveness of our logit-based OOD detection methods on CIFAR-10, CIFAR-100 and ImageNet and establish state-of-the-art performance. + + + + Generalizing Dataset Distillation via Deep Generative Prior + http://openaccess.thecvf.com//content/CVPR2023/papers/Cazenavette_Generalizing_Dataset_Distillation_via_Deep_Generative_Prior_CVPR_2023_paper.pdf + Dataset Distillation aims to distill an entire dataset's knowledge into a few synthetic images. The idea is to synthesize a small number of synthetic data points that, when given to a learning algorithm as training data, result in a model approximating one trained on the original data. Despite a recent upsurge of progress in the field, existing dataset distillation methods fail to generalize to new architectures and scale to high-resolution datasets. To overcome the above issues, we propose to use the learned prior from pre-trained deep generative models to synthesize the distilled data. To achieve this, we present a new optimization algorithm that distills a large number of images into a few intermediate feature vectors in the generative model's latent space. Our method augments existing techniques, significantly improving cross-architecture generalization in all settings. + + + + Adaptive Patch Deformation for Textureless-Resilient Multi-View Stereo + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Adaptive_Patch_Deformation_for_Textureless-Resilient_Multi-View_Stereo_CVPR_2023_paper.pdf + In recent years, deep learning-based approaches have shown great strength in multi-view stereo because of their outstanding ability to extract robust visual features. However, most learning-based methods need to build the cost volume and increase the receptive field enormously to get a satisfactory result when dealing with large-scale textureless regions, consequently leading to prohibitive memory consumption. To ensure both memory-friendly and textureless-resilient, we innovatively transplant the spirit of deformable convolution from deep learning into the traditional PatchMatch-based method. Specifically, for each pixel with matching ambiguity (termed unreliable pixel), we adaptively deform the patch centered on it to extend the receptive field until covering enough correlative reliable pixels (without matching ambiguity) that serve as anchors. When performing PatchMatch, constrained by the anchor pixels, the matching cost of an unreliable pixel is guaranteed to reach the global minimum at the correct depth and therefore increases the robustness of multi-view stereo significantly. To detect more anchor pixels to ensure better adaptive patch deformation, we propose to evaluate the matching ambiguity of a certain pixel by checking the convergence of the estimated depth as optimization proceeds. As a result, our method achieves state-of-the-art performance on ETH3D and Tanks and Temples while preserving low memory consumption. + + + + Detection of Out-of-Distribution Samples Using Binary Neuron Activation Patterns + http://openaccess.thecvf.com//content/CVPR2023/papers/Olber_Detection_of_Out-of-Distribution_Samples_Using_Binary_Neuron_Activation_Patterns_CVPR_2023_paper.pdf + Deep neural networks (DNN) have outstanding performance in various applications. Despite numerous efforts of the research community, out-of-distribution (OOD) samples remain a significant limitation of DNN classifiers. The ability to identify previously unseen inputs as novel is crucial in safety-critical applications such as self-driving cars, unmanned aerial vehicles, and robots. Existing approaches to detect OOD samples treat a DNN as a black box and evaluate the confidence score of the output predictions. Unfortunately, this method frequently fails, because DNNs are not trained to reduce their confidence for OOD inputs. In this work, we introduce a novel method for OOD detection. Our method is motivated by theoretical analysis of neuron activation patterns (NAP) in ReLU-based architectures. The proposed method does not introduce a high computational overhead due to the binary representation of the activation patterns extracted from convolutional layers. The extensive empirical evaluation proves its high performance on various DNN architectures and seven image datasets. + + + + SeaThru-NeRF: Neural Radiance Fields in Scattering Media + http://openaccess.thecvf.com//content/CVPR2023/papers/Levy_SeaThru-NeRF_Neural_Radiance_Fields_in_Scattering_Media_CVPR_2023_paper.pdf + Research on neural radiance fields (NeRFs) for novel view generation is exploding with new models and extensions. However, a question that remains unanswered is what happens in underwater or foggy scenes where the medium strongly influences the appearance of objects. Thus far, NeRF and its variants have ignored these cases. However, since the NeRF framework is based on volumetric rendering, it has inherent capability to account for the medium's effects, once modeled appropriately. We develop a new rendering model for NeRFs in scattering media, which is based on the SeaThru image formation model, and suggest a suitable architecture for learning both scene information and medium parameters. We demonstrate the strength of our method using simulated and real-world scenes, correctly rendering novel photorealistic views underwater. Even more excitingly, we can render clear views of these scenes, removing the medium between the camera and the scene and reconstructing the appearance and depth of far objects, which are severely occluded by the medium. Our code and unique datasets are available on the project's website. + + + + MixMAE: Mixed and Masked Autoencoder for Efficient Pretraining of Hierarchical Vision Transformers + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_MixMAE_Mixed_and_Masked_Autoencoder_for_Efficient_Pretraining_of_Hierarchical_CVPR_2023_paper.pdf + In this paper, we propose Mixed and Masked AutoEncoder (MixMAE), a simple but efficient pretraining method that is applicable to various hierarchical Vision Transformers. Existing masked image modeling (MIM) methods for hierarchical Vision Transformers replace a random subset of input tokens with a special [MASK] symbol and aim at reconstructing original image tokens from the corrupted image. However, we find that using the [MASK] symbol greatly slows down the training and causes pretraining-finetuning inconsistency, due to the large masking ratio (e.g., 60% in SimMIM). On the other hand, MAE does not introduce [MASK] tokens at its encoder at all but is not applicable for hierarchical Vision Transformers. To solve the issue and accelerate the pretraining of hierarchical models, we replace the masked tokens of one image with visible tokens of another image, i.e., creating a mixed image. We then conduct dual reconstruction to reconstruct the two original images from the mixed input, which significantly improves efficiency. While MixMAE can be applied to various hierarchical Transformers, this paper explores using Swin Transformer with a large window size and scales up to huge model size (to reach 600M parameters). Empirical results demonstrate that MixMAE can learn high-quality visual representations efficiently. Notably, MixMAE with Swin-B/W14 achieves 85.1% top-1 accuracy on ImageNet-1K by pretraining for 600 epochs. Besides, its transfer performances on the other 6 datasets show that MixMAE has better FLOPs / performance tradeoff than previous popular MIM methods. + + + + Human Pose Estimation in Extremely Low-Light Conditions + http://openaccess.thecvf.com//content/CVPR2023/papers/Lee_Human_Pose_Estimation_in_Extremely_Low-Light_Conditions_CVPR_2023_paper.pdf + We study human pose estimation in extremely low-light images. This task is challenging due to the difficulty of collecting real low-light images with accurate labels, and severely corrupted inputs that degrade prediction quality significantly. To address the first issue, we develop a dedicated camera system and build a new dataset of real low-light images with accurate pose labels. Thanks to our camera system, each low-light image in our dataset is coupled with an aligned well-lit image, which enables accurate pose labeling and is used as privileged information during training. We also propose a new model and a new training strategy that fully exploit the privileged information to learn representation insensitive to lighting conditions. Our method demonstrates outstanding performance on real extremely low-light images, and extensive analyses validate that both of our model and dataset contribute to the success. + + + + EventNeRF: Neural Radiance Fields From a Single Colour Event Camera + http://openaccess.thecvf.com//content/CVPR2023/papers/Rudnev_EventNeRF_Neural_Radiance_Fields_From_a_Single_Colour_Event_Camera_CVPR_2023_paper.pdf + Asynchronously operating event cameras find many applications due to their high dynamic range, vanishingly low motion blur, low latency and low data bandwidth. The field saw remarkable progress during the last few years, and existing event-based 3D reconstruction approaches recover sparse point clouds of the scene. However, such sparsity is a limiting factor in many cases, especially in computer vision and graphics, that has not been addressed satisfactorily so far. Accordingly, this paper proposes the first approach for 3D-consistent, dense and photorealistic novel view synthesis using just a single colour event stream as input. At its core is a neural radiance field trained entirely in a self-supervised manner from events while preserving the original resolution of the colour event channels. Next, our ray sampling strategy is tailored to events and allows for data-efficient training. At test, our method produces results in the RGB space at unprecedented quality. We evaluate our method qualitatively and numerically on several challenging synthetic and real scenes and show that it produces significantly denser and more visually appealing renderings than the existing methods. We also demonstrate robustness in challenging scenarios with fast motion and under low lighting conditions. We release the newly recorded dataset and our source code to facilitate the research field, see https://4dqv.mpi-inf.mpg.de/EventNeRF. + + + + Neighborhood Attention Transformer + http://openaccess.thecvf.com//content/CVPR2023/papers/Hassani_Neighborhood_Attention_Transformer_CVPR_2023_paper.pdf + We present Neighborhood Attention (NA), the first efficient and scalable sliding window attention mechanism for vision. NA is a pixel-wise operation, localizing self attention (SA) to the nearest neighboring pixels, and therefore enjoys a linear time and space complexity compared to the quadratic complexity of SA. The sliding window pattern allows NA's receptive field to grow without needing extra pixel shifts, and preserves translational equivariance, unlike Swin Transformer's Window Self Attention (WSA). We develop NATTEN (Neighborhood Attention Extension), a Python package with efficient C++ and CUDA kernels, which allows NA to run up to 40% faster than Swin's WSA while using up to 25% less memory. We further present Neighborhood Attention Transformer (NAT), a new hierarchical transformer design based on NA that boosts image classification and downstream vision performance. Experimental results on NAT are competitive; NAT-Tiny reaches 83.2% top-1 accuracy on ImageNet, 51.4% mAP on MS-COCO and 48.4% mIoU on ADE20K, which is 1.9% ImageNet accuracy, 1.0% COCO mAP, and 2.6% ADE20K mIoU improvement over a Swin model with similar size. To support more research based on sliding window attention, we open source our project and release our checkpoints. + + + + Progressive Spatio-Temporal Alignment for Efficient Event-Based Motion Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Progressive_Spatio-Temporal_Alignment_for_Efficient_Event-Based_Motion_Estimation_CVPR_2023_paper.pdf + In this paper, we propose an efficient event-based motion estimation framework for various motion models. Different from previous works, we design a progressive event-to-map alignment scheme and utilize the spatio-temporal correlations to align events. In detail, we progressively align sampled events in an event batch to the time-surface map and obtain the updated motion model by minimizing a novel time-surface loss. In addition, a dynamic batch size strategy is applied to adaptively adjust the batch size so that all events in the batch are consistent with the current motion model. Our framework has three advantages: a) the progressive scheme refines motion parameters iteratively, achieving accurate motion estimation; b) within one iteration, only a small portion of events are involved in optimization, which greatly reduces the total runtime; c) the dynamic batch size strategy ensures that the constant velocity assumption always holds. We conduct comprehensive experiments to evaluate our framework on challenging high-speed scenes with three motion models: rotational, homography, and 6-DOF models. Experimental results demonstrate that our framework achieves state-of-the-art estimation accuracy and efficiency. + + + + Trap Attention: Monocular Depth Estimation With Manual Traps + http://openaccess.thecvf.com//content/CVPR2023/papers/Ning_Trap_Attention_Monocular_Depth_Estimation_With_Manual_Traps_CVPR_2023_paper.pdf + Predicting a high quality depth map from a single image is a challenging task, because it exists infinite possibility to project a 2D scene to the corresponding 3D scene. Recently, some studies introduced multi-head attention (MHA) modules to perform long-range interaction, which have shown significant progress in regressing the depth maps.The main functions of MHA can be loosely summarized to capture long-distance information and report the attention map by the relationship between pixels. However, due to the quadratic complexity of MHA, these methods can not leverage MHA to compute depth features in high resolution with an appropriate computational complexity. In this paper, we exploit a depth-wise convolution to obtain long-range information, and propose a novel trap attention, which sets some traps on the extended space for each pixel, and forms the attention mechanism by the feature retention ratio of convolution window, resulting in that the quadratic computational complexity can be converted to linear form. Then we build an encoder-decoder trap depth estimation network, which introduces a vision transformer as the encoder, and uses the trap attention to estimate the depth from single image in the decoder. Extensive experimental results demonstrate that our proposed network can outperform the state-of-the-art methods in monocular depth estimation on datasets NYU Depth-v2 and KITTI, with significantly reduced number of parameters. Code is available at: https://github.com/ICSResearch/TrapAttention. + + + + Representing Volumetric Videos As Dynamic MLP Maps + http://openaccess.thecvf.com//content/CVPR2023/papers/Peng_Representing_Volumetric_Videos_As_Dynamic_MLP_Maps_CVPR_2023_paper.pdf + This paper introduces a novel representation of volumetric videos for real-time view synthesis of dynamic scenes. Recent advances in neural scene representations demonstrate their remarkable capability to model and render complex static scenes, but extending them to represent dynamic scenes is not straightforward due to their slow rendering speed or high storage cost. To solve this problem, our key idea is to represent the radiance field of each frame as a set of shallow MLP networks whose parameters are stored in 2D grids, called MLP maps, and dynamically predicted by a 2D CNN decoder shared by all frames. Representing 3D scenes with shallow MLPs significantly improves the rendering speed, while dynamically predicting MLP parameters with a shared 2D CNN instead of explicitly storing them leads to low storage cost. Experiments show that the proposed approach achieves state-of-the-art rendering quality on the NHR and ZJU-MoCap datasets, while being efficient for real-time rendering with a speed of 41.7 fps for 512 x 512 images on an RTX 3090 GPU. The code is available at https://zju3dv.github.io/mlp_maps/. + + + + Video-Text As Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Jin_Video-Text_As_Game_Players_Hierarchical_Banzhaf_Interaction_for_Cross-Modal_Representation_CVPR_2023_paper.pdf + Contrastive learning-based video-language representation learning approaches, e.g., CLIP, have achieved outstanding performance, which pursue semantic interaction upon pre-defined video-text pairs. To clarify this coarse-grained global interaction and move a step further, we have to encounter challenging shell-breaking interactions for fine-grained cross-modal learning. In this paper, we creatively model video-text as game players with multivariate cooperative game theory to wisely handle the uncertainty during fine-grained semantic interaction with diverse granularity, flexible combination, and vague intensity. Concretely, we propose Hierarchical Banzhaf Interaction (HBI) to value possible correspondence between video frames and text words for sensitive and explainable cross-modal contrast. To efficiently realize the cooperative game of multiple video frames and multiple text words, the proposed method clusters the original video frames (text words) and computes the Banzhaf Interaction between the merged tokens. By stacking token merge modules, we achieve cooperative games at different semantic levels. Extensive experiments on commonly used text-video retrieval and video-question answering benchmarks with superior performances justify the efficacy of our HBI. More encouragingly, it can also serve as a visualization tool to promote the understanding of cross-modal interaction, which may have a far-reaching impact on the community. Project page is available at https://jpthu17.github.io/HBI/. + + + + Blur Interpolation Transformer for Real-World Motion From Blur + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhong_Blur_Interpolation_Transformer_for_Real-World_Motion_From_Blur_CVPR_2023_paper.pdf + This paper studies the challenging problem of recovering motion from blur, also known as joint deblurring and interpolation or blur temporal super-resolution. The challenges are twofold: 1) the current methods still leave considerable room for improvement in terms of visual quality even on the synthetic dataset, and 2) poor generalization to real-world data. To this end, we propose a blur interpolation transformer (BiT) to effectively unravel the underlying temporal correlation encoded in blur. Based on multi-scale residual Swin transformer blocks, we introduce dual-end temporal supervision and temporally symmetric ensembling strategies to generate effective features for time-varying motion rendering. In addition, we design a hybrid camera system to collect the first real-world dataset of one-to-many blur-sharp video pairs. Experimental results show that BiT has a significant gain over the state-of-the-art methods on the public dataset Adobe240. Besides, the proposed real-world dataset effectively helps the model generalize well to real blurry scenarios. Code and data are available at https://github.com/zzh-tech/BiT. + + + + Rethinking Few-Shot Medical Segmentation: A Vector Quantization View + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Rethinking_Few-Shot_Medical_Segmentation_A_Vector_Quantization_View_CVPR_2023_paper.pdf + The existing few-shot medical segmentation networks share the same practice that the more prototypes, the better performance. This phenomenon can be theoretically interpreted in Vector Quantization (VQ) view: the more prototypes, the more clusters are separated from pixel-wise feature points distributed over the full space. However, as we further think about few-shot segmentation with this perspective, it is found that the clusterization of feature points and the adaptation to unseen tasks have not received enough attention. Motivated by the observation, we propose a learning VQ mechanism consisting of grid-format VQ (GFVQ), self-organized VQ (SOVQ) and residual oriented VQ (ROVQ). To be specific, GFVQ generates the prototype matrix by averaging square grids over the spatial extent, which uniformly quantizes the local details; SOVQ adaptively assigns the feature points to different local classes and creates a new representation space where the learnable local prototypes are updated with a global view; ROVQ introduces residual information to fine-tune the aforementioned learned local prototypes without re-training, which benefits the generalization performance for the irrelevance to the training task. We empirically show that our VQ framework yields the state-of-the-art performance over abdomen, cardiac and prostate MRI datasets and expect this work will provoke a rethink of the current few-shot medical segmentation model design. Our code will soon be publicly available. + + + + Event-Based Shape From Polarization + http://openaccess.thecvf.com//content/CVPR2023/papers/Muglikar_Event-Based_Shape_From_Polarization_CVPR_2023_paper.pdf + State-of-the-art solutions for Shape-from-Polarization (SfP) suffer from a speed-resolution tradeoff: they either sacrifice the number of polarization angles measured or necessitate lengthy acquisition times due to framerate constraints, thus compromising either accuracy or latency. We tackle this tradeoff using event cameras. Event cameras operate at microseconds resolution with negligible motion blur, and output a continuous stream of events that precisely measures how light changes over time asynchronously. We propose a setup that consists of a linear polarizer rotating at high speeds in front of an event camera. Our method uses the continuous event stream caused by the rotation to reconstruct relative intensities at multiple polarizer angles. Experiments demonstrate that our method outperforms physics-based baselines using frames, reducing the MAE by 25% in synthetic and real-world datasets. In the real world, we observe, however, that the challenging conditions (i.e., when few events are generated) harm the performance of physics-based solutions. To overcome this, we propose a learning-based approach that learns to estimate surface normals even at low event-rates, improving the physics-based approach by 52% on the real world dataset. The proposed system achieves an acquisition speed equivalent to 50 fps (>twice the framerate of the commercial polarization sensor) while retaining the spatial resolution of 1MP. Our evaluation is based on the first large-scale dataset for event-based SfP. + + + + ARO-Net: Learning Implicit Fields From Anchored Radial Observations + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_ARO-Net_Learning_Implicit_Fields_From_Anchored_Radial_Observations_CVPR_2023_paper.pdf + We introduce anchored radial observations (ARO), a novel shape encoding for learning implicit field representation of 3D shapes that is category-agnostic and generalizable amid significant shape variations. The main idea behind our work is to reason about shapes through partial observations from a set of viewpoints, called anchors. We develop a general and unified shape representation by employing a fixed set of anchors, via Fibonacci sampling, and designing a coordinate-based deep neural network to predict the occupancy value of a query point in space. Differently from prior neural implicit models that use global shape feature, our shape encoder operates on contextual, query-specific features. To predict point occupancy, locally observed shape information from the perspective of the anchors surrounding the input query point are encoded and aggregated through an attention module, before implicit decoding is performed. We demonstrate the quality and generality of our network, coined ARO-Net, on surface reconstruction from sparse point clouds, with tests on novel and unseen object categories, "one-shape" training, and comparisons to state-of-the-art neural and classical methods for reconstruction and tessellation. + + + + All in One: Exploring Unified Video-Language Pre-Training + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_All_in_One_Exploring_Unified_Video-Language_Pre-Training_CVPR_2023_paper.pdf + Mainstream Video-Language Pre-training models consist of three parts, a video encoder, a text encoder, and a video-text fusion Transformer. They pursue better performance via utilizing heavier unimodal encoders or multimodal fusion Transformers, resulting in increased parameters with lower efficiency in downstream tasks. In this work, we for the first time introduce an end-to-end video-language model, namely all-in-one Transformer, that embeds raw video and textual signals into joint representations using a unified backbone architecture. We argue that the unique temporal information of video data turns out to be a key barrier hindering the design of a modality-agnostic Transformer. To overcome the challenge, we introduce a novel and effective token rolling operation to encode temporal representations from video clips in a non-parametric manner. The careful design enables the representation learning of both video-text multimodal inputs and unimodal inputs using a unified backbone model. Our pre-trained all-in-one Transformer is transferred to various downstream video-text tasks after fine-tuning, including text-video retrieval, video-question answering, multiple choice and visual commonsense reasoning. State-of-the-art performances with the minimal model FLOPs on nine datasets demonstrate the superiority of our method compared to the competitive counterparts. + + + + Making Vision Transformers Efficient From a Token Sparsification View + http://openaccess.thecvf.com//content/CVPR2023/papers/Chang_Making_Vision_Transformers_Efficient_From_a_Token_Sparsification_View_CVPR_2023_paper.pdf + The quadratic computational complexity to the number of tokens limits the practical applications of Vision Transformers (ViTs). Several works propose to prune redundant tokens to achieve efficient ViTs. However, these methods generally suffer from (i) dramatic accuracy drops, (ii) application difficulty in the local vision transformer, and (iii) non-general-purpose networks for downstream tasks. In this work, we propose a novel Semantic Token ViT (STViT), for efficient global and local vision transformers, which can also be revised to serve as backbone for downstream tasks. The semantic tokens represent cluster centers, and they are initialized by pooling image tokens in space and recovered by attention, which can adaptively represent global or local semantic information. Due to the cluster properties, a few semantic tokens can attain the same effect as vast image tokens, for both global and local vision transformers. For instance, only 16 semantic tokens on DeiT-(Tiny,Small,Base) can achieve the same accuracy with more than 100% inference speed improvement and nearly 60% FLOPs reduction; on Swin-(Tiny,Small,Base), we can employ 16 semantic tokens in each window to further speed it up by around 20% with slight accuracy increase. Besides great success in image classification, we also extend our method to video recognition. In addition, we design a STViT-R(ecovery) network to restore the detailed spatial information based on the STViT, making it work for downstream tasks, which is powerless for previous token sparsification methods. Experiments demonstrate that our method can achieve competitive results compared to the original networks in object detection and instance segmentation, with over 30% FLOPs reduction for backbone. + + + + RefCLIP: A Universal Teacher for Weakly Supervised Referring Expression Comprehension + http://openaccess.thecvf.com//content/CVPR2023/papers/Jin_RefCLIP_A_Universal_Teacher_for_Weakly_Supervised_Referring_Expression_Comprehension_CVPR_2023_paper.pdf + Referring Expression Comprehension (REC) is a task of grounding the referent based on an expression, and its development is greatly limited by expensive instance-level annotations. Most existing weakly supervised methods are built based on two-stage detection networks, which are computationally expensive. In this paper, we resort to the efficient one-stage detector and propose a novel weakly supervised model called RefCLIP. Specifically, RefCLIP redefines weakly supervised REC as an anchor-text matching problem, which can avoid the complex post-processing in existing methods. To achieve weakly supervised learning, we introduce anchor-based contrastive loss to optimize RefCLIP via numerous anchor-text pairs. Based on RefCLIP, we further propose the first model-agnostic weakly supervised training scheme for existing REC models, where RefCLIP acts as a mature teacher to generate pseudo-labels for teaching common REC models. With our careful designs, this scheme can even help existing REC models achieve better weakly supervised performance than RefCLIP, e.g., TransVG and SimREC. To validate our approaches, we conduct extensive experiments on four REC benchmarks, i.e., RefCOCO, RefCOCO+, RefCOCOg and ReferItGame. Experimental results not only report our significant performance gains over existing weakly supervised models, e.g., +24.87% on RefCOCO, but also show the 5x faster inference speed. Project: https://refclip.github.io. + + + + Re-IQA: Unsupervised Learning for Image Quality Assessment in the Wild + http://openaccess.thecvf.com//content/CVPR2023/papers/Saha_Re-IQA_Unsupervised_Learning_for_Image_Quality_Assessment_in_the_Wild_CVPR_2023_paper.pdf + Automatic Perceptual Image Quality Assessment is a challenging problem that impacts billions of internet, and social media users daily. To advance research in this field, we propose a Mixture of Experts approach to train two separate encoders to learn high-level content and low-level image quality features in an unsupervised setting. The unique novelty of our approach is its ability to generate low-level representations of image quality that are complementary to high-level features representing image content. We refer to the framework used to train the two encoders as Re-IQA. For Image Quality Assessment in the Wild, we deploy the complementary low and high-level image representations obtained from the Re-IQA framework to train a linear regression model, which is used to map the image representations to the ground truth quality scores, refer Figure 1. Our method achieves state-of-the-art performance on multiple large-scale image quality assessment databases containing both real and synthetic distortions, demonstrating how deep neural networks can be trained in an unsupervised setting to produce perceptually relevant representations. We conclude from our experiments that the low and high-level features obtained are indeed complementary and positively impact the performance of the linear regressor. A public release of all the codes associated with this work will be made available on GitHub. + + + + Catch Missing Details: Image Reconstruction With Frequency Augmented Variational Autoencoder + http://openaccess.thecvf.com//content/CVPR2023/papers/Lin_Catch_Missing_Details_Image_Reconstruction_With_Frequency_Augmented_Variational_Autoencoder_CVPR_2023_paper.pdf + The popular VQ-VAE models reconstruct images through learning a discrete codebook but suffer from a significant issue in the rapid quality degradation of image reconstruction as the compression rate rises. One major reason is that a higher compression rate induces more loss of visual signals on the higher frequency spectrum, which reflect the details on pixel space. In this paper, a Frequency Complement Module (FCM) architecture is proposed to capture the missing frequency information for enhancing reconstruction quality. The FCM can be easily incorporated into the VQ-VAE structure, and we refer to the new model as Frequancy Augmented VAE (FA-VAE). In addition, a Dynamic Spectrum Loss (DSL) is introduced to guide the FCMs to balance between various frequencies dynamically for optimal reconstruction. FA-VAE is further extended to the text-to-image synthesis task, and a Cross-attention Autoregressive Transformer (CAT) is proposed to obtain more precise semantic attributes in texts. Extensive reconstruction experiments with different compression rates are conducted on several benchmark datasets, and the results demonstrate that the proposed FA-VAE is able to restore more faithfully the details compared to SOTA methods. CAT also shows improved generation quality with better image-text semantic alignment. + + + + Rotation-Invariant Transformer for Point Cloud Matching + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_Rotation-Invariant_Transformer_for_Point_Cloud_Matching_CVPR_2023_paper.pdf + The intrinsic rotation invariance lies at the core of matching point clouds with handcrafted descriptors. However, it is widely despised by recent deep matchers that obtain the rotation invariance extrinsically via data augmentation. As the finite number of augmented rotations can never span the continuous SO(3) space, these methods usually show instability when facing rotations that are rarely seen. To this end, we introduce RoITr, a Rotation-Invariant Transformer to cope with the pose variations in the point cloud matching task. We contribute both on the local and global levels. Starting from the local level, we introduce an attention mechanism embedded with Point Pair Feature (PPF)-based coordinates to describe the pose-invariant geometry, upon which a novel attention-based encoder-decoder architecture is constructed. We further propose a global transformer with rotation-invariant cross-frame spatial awareness learned by the self-attention mechanism, which significantly improves the feature distinctiveness and makes the model robust with respect to the low overlap. Experiments are conducted on both the rigid and non-rigid public benchmarks, where RoITr outperforms all the state-of-the-art models by a considerable margin in the low-overlapping scenarios. Especially when the rotations are enlarged on the challenging 3DLoMatch benchmark, RoITr surpasses the existing methods by at least 13 and 5 percentage points in terms of Inlier Ratio and Registration Recall, respectively. + + + + Habitat-Matterport 3D Semantics Dataset + http://openaccess.thecvf.com//content/CVPR2023/papers/Yadav_Habitat-Matterport_3D_Semantics_Dataset_CVPR_2023_paper.pdf + We present the Habitat-Matterport 3D Semantics (HM3DSEM) dataset. HM3DSEM is the largest dataset of 3D real-world spaces with densely annotated semantics that is currently available to the academic community. It consists of 142,646 object instance annotations across 216 3D spaces and 3,100 rooms within those spaces. The scale, quality, and diversity of object annotations far exceed those of prior datasets. A key difference setting apart HM3DSEM from other datasets is the use of texture information to annotate pixel-accurate object boundaries. We demonstrate the effectiveness of HM3DSEM dataset for the Object Goal Navigation task using different methods. Policies trained using HM3DSEM perform outperform those trained on prior datasets. Introduction of HM3DSEM in the Habitat ObjectNav Challenge lead to an increase in participation from 400 submissions in 2021 to 1022 submissions in 2022. Project page: https://aihabitat.org/datasets/hm3d-semantics/ + + + + EDGE: Editable Dance Generation From Music + http://openaccess.thecvf.com//content/CVPR2023/papers/Tseng_EDGE_Editable_Dance_Generation_From_Music_CVPR_2023_paper.pdf + Dance is an important human art form, but creating new dances can be difficult and time-consuming. In this work, we introduce Editable Dance GEneration (EDGE), a state-of-the-art method for editable dance generation that is capable of creating realistic, physically-plausible dances while remaining faithful to the input music. EDGE uses a transformer-based diffusion model paired with Jukebox, a strong music feature extractor, and confers powerful editing capabilities well-suited to dance, including joint-wise conditioning, and in-betweening. We introduce a new metric for physical plausibility, and evaluate dance quality generated by our method extensively through (1) multiple quantitative metrics on physical plausibility, alignment, and diversity benchmarks, and more importantly, (2) a large-scale user study, demonstrating a significant improvement over previous state-of-the-art methods. Qualitative samples from our model can be found at our website. + + + + Curricular Contrastive Regularization for Physics-Aware Single Image Dehazing + http://openaccess.thecvf.com//content/CVPR2023/papers/Zheng_Curricular_Contrastive_Regularization_for_Physics-Aware_Single_Image_Dehazing_CVPR_2023_paper.pdf + Considering the ill-posed nature, contrastive regularization has been developed for single image dehazing, introducing the information from negative images as a lower bound. However, the contrastive samples are nonconsensual, as the negatives are usually represented distantly from the clear (i.e., positive) image, leaving the solution space still under-constricted. Moreover, the interpretability of deep dehazing models is underexplored towards the physics of the hazing process. In this paper, we propose a novel curricular contrastive regularization targeted at a consensual contrastive space as opposed to a non-consensual one. Our negatives, which provide better lower-bound constraints, can be assembled from 1) the hazy image, and 2) corresponding restorations by other existing methods. Further, due to the different similarities between the embeddings of the clear image and negatives, the learning difficulty of the multiple components is intrinsically imbalanced. To tackle this issue, we customize a curriculum learning strategy to reweight the importance of different negatives. In addition, to improve the interpretability in the feature space, we build a physics-aware dual-branch unit according to the atmospheric scattering model. With the unit, as well as curricular contrastive regularization, we establish our dehazing network, named C2PNet. Extensive experiments demonstrate that our C2PNet significantly outperforms state-of-the-art methods, with extreme PSNR boosts of 3.94dB and 1.50dB, respectively, on SOTS-indoor and SOTS-outdoor datasets. Code is available at https://github.com/YuZheng9/C2PNet. + + + + Sharpness-Aware Gradient Matching for Domain Generalization + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Sharpness-Aware_Gradient_Matching_for_Domain_Generalization_CVPR_2023_paper.pdf + The goal of domain generalization (DG) is to enhance the generalization capability of the model learned from a source domain to other unseen domains. The recently developed Sharpness-Aware Minimization (SAM) method aims to achieve this goal by minimizing the sharpness measure of the loss landscape. Though SAM and its variants have demonstrated impressive DG performance, they may not always converge to the desired flat region with a small loss value. In this paper, we present two conditions to ensure that the model could converge to a flat minimum with a small loss, and present an algorithm, named Sharpness-Aware Gradient Matching (SAGM), to meet the two conditions for improving model generalization capability. Specifically, the optimization objective of SAGM will simultaneously minimize the empirical risk, the perturbed loss (i.e., the maximum loss within a neighborhood in the parameter space), and the gap between them. By implicitly aligning the gradient directions between the empirical risk and the perturbed loss, SAGM improves the generalization capability over SAM and its variants without increasing the computational cost. Extensive experimental results show that our proposed SAGM method consistently outperforms the state-of-the-art methods on five DG benchmarks, including PACS, VLCS, OfficeHome, TerraIncognita, and DomainNet. Codes are available at https://github.com/Wang-pengfei/SAGM. + + + + Bi-Directional Feature Fusion Generative Adversarial Network for Ultra-High Resolution Pathological Image Virtual Re-Staining + http://openaccess.thecvf.com//content/CVPR2023/papers/Sun_Bi-Directional_Feature_Fusion_Generative_Adversarial_Network_for_Ultra-High_Resolution_Pathological_CVPR_2023_paper.pdf + The cost of pathological examination makes virtual re-staining of pathological images meaningful. However, due to the ultra-high resolution of pathological images, traditional virtual re-staining methods have to divide a WSI image into patches for model training and inference. Such a limitation leads to the lack of global information, resulting in observable differences in color, brightness and contrast when the re-stained patches are merged to generate an image of larger size. We summarize this issue as the square effect. Some existing methods try to solve this issue through overlapping between patches or simple post-processing. But the former one is not that effective, while the latter one requires carefully tuning. In order to eliminate the square effect, we design a bi-directional feature fusion generative adversarial network (BFF-GAN) with a global branch and a local branch. It learns the inter-patch connections through the fusion of global and local features plus patch-wise attention. We perform experiments on both the private dataset RCC and the public dataset ANHIR. The results show that our model achieves competitive performance and is able to generate extremely real images that are deceptive even for experienced pathologists, which means it is of great clinical significance. + + + + Towards Practical Plug-and-Play Diffusion Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Go_Towards_Practical_Plug-and-Play_Diffusion_Models_CVPR_2023_paper.pdf + Diffusion-based generative models have achieved remarkable success in image generation. Their guidance formulation allows an external model to plug-and-play control the generation process for various tasks without fine-tuning the diffusion model. However, the direct use of publicly available off-the-shelf models for guidance fails due to their poor performance on noisy inputs. For that, the existing practice is to fine-tune the guidance models with labeled data corrupted with noises. In this paper, we argue that this practice has limitations in two aspects: (1) performing on inputs with extremely various noises is too hard for a single guidance model; (2) collecting labeled datasets hinders scaling up for various tasks. To tackle the limitations, we propose a novel strategy that leverages multiple experts where each expert is specialized in a particular noise range and guides the reverse process of the diffusion at its corresponding timesteps. However, as it is infeasible to manage multiple networks and utilize labeled data, we present a practical guidance framework termed Practical Plug-And-Play (PPAP), which leverages parameter-efficient fine-tuning and data-free knowledge transfer. We exhaustively conduct ImageNet class conditional generation experiments to show that our method can successfully guide diffusion with small trainable parameters and no labeled data. Finally, we show that image classifiers, depth estimators, and semantic segmentation models can guide publicly available GLIDE through our framework in a plug-and-play manner. Our code is available at https://github.com/riiid/PPAP. + + + + YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_YOLOv7_Trainable_Bag-of-Freebies_Sets_New_State-of-the-Art_for_Real-Time_Object_Detectors_CVPR_2023_paper.pdf + Real-time object detection is one of the most important research topics in computer vision. As new approaches regarding architecture optimization and training optimization are continually being developed, we have found two research topics that have spawned when dealing with these latest state-of-the-art methods. To address the topics, we propose a trainable bag-of-freebies oriented solution. We combine the flexible and efficient training tools with the proposed architecture and the compound scaling method. YOLOv7 surpasses all known object detectors in both speed and accuracy in the range from 5 FPS to 120 FPS and has the highest accuracy 56.8% AP among all known realtime object detectors with 30 FPS or higher on GPU V100. Source code is released in https://github.com/ WongKinYiu/yolov7. + + + + PartDistillation: Learning Parts From Instance Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Cho_PartDistillation_Learning_Parts_From_Instance_Segmentation_CVPR_2023_paper.pdf + We present a scalable framework to learn part segmentation from object instance labels. State-of-the-art instance segmentation models contain a surprising amount of part information. However, much of this information is hidden from plain view. For each object instance, the part information is noisy, inconsistent, and incomplete. PartDistillation transfers the part information of an instance segmentation model into a part segmentation model through self-supervised self-training on a large dataset. The resulting segmentation model is robust, accurate, and generalizes well. We evaluate the model on various part segmentation datasets. Our model outperforms supervised part segmentation in zero-shot generalization performance by a large margin. Our model outperforms when finetuned on target datasets compared to supervised counterpart and other baselines especially in few-shot regime. Finally, our model provides a wider coverage of rare parts when evaluated over 10K object classes. Code is at https://github.com/facebookresearch/PartDistillation. + + + + Boosting Video Object Segmentation via Space-Time Correspondence Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Boosting_Video_Object_Segmentation_via_Space-Time_Correspondence_Learning_CVPR_2023_paper.pdf + Current top-leading solutions for video object segmentation (VOS) typically follow a matching-based regime: for each query frame, the segmentation mask is inferred according to its correspondence to previously processed and the first annotated frames. They simply exploit the supervisory signals from the groundtruth masks for learning mask prediction only, without posing any constraint on the space-time correspondence matching, which, however, is the fundamental building block of such regime. To alleviate this crucial yet commonly ignored issue, we devise a correspondence-aware training framework, which boosts matching-based VOS solutions by explicitly encouraging robust correspondence matching during network learning. Through comprehensively exploring the intrinsic coherence in videos on pixel and object levels, our algorithm reinforces the standard, fully supervised training of mask segmentation with label-free, contrastive correspondence learning. Without neither requiring extra annotation cost during training, nor causing speed delay during deployment, nor incurring architectural modification, our algorithm provides solid performance gains on four widely used benchmarks, i.e., DAVIS2016&2017, and YouTube-VOS2018&2019, on the top of famous matching-based VOS solutions. Our implementation will be released. + + + + Towards Realistic Long-Tailed Semi-Supervised Learning: Consistency Is All You Need + http://openaccess.thecvf.com//content/CVPR2023/papers/Wei_Towards_Realistic_Long-Tailed_Semi-Supervised_Learning_Consistency_Is_All_You_Need_CVPR_2023_paper.pdf + While long-tailed semi-supervised learning (LTSSL) has received tremendous attention in many real-world classification problems, existing LTSSL algorithms typically assume that the class distributions of labeled and unlabeled data are almost identical. Those LTSSL algorithms built upon the assumption can severely suffer when the class distributions of labeled and unlabeled data are mismatched since they utilize biased pseudo-labels from the model. To alleviate this issue, we propose a new simple method that can effectively utilize unlabeled data of unknown class distributions by introducing the adaptive consistency regularizer (ACR). ACR realizes the dynamic refinery of pseudo-labels for various distributions in a unified formula by estimating the true class distribution of unlabeled data. Despite its simplicity, we show that ACR achieves state-of-the-art performance on a variety of standard LTSSL benchmarks, e.g., an averaged 10% absolute increase of test accuracy against existing algorithms when the class distributions of labeled and unlabeled data are mismatched. Even when the class distributions are identical, ACR consistently outperforms many sophisticated LTSSL algorithms. We carry out extensive ablation studies to tease apart the factors that are most important to ACR's success. Source code is available at https://github.com/Gank0078/ACR. + + + + GAPartNet: Cross-Category Domain-Generalizable Object Perception and Manipulation via Generalizable and Actionable Parts + http://openaccess.thecvf.com//content/CVPR2023/papers/Geng_GAPartNet_Cross-Category_Domain-Generalizable_Object_Perception_and_Manipulation_via_Generalizable_and_CVPR_2023_paper.pdf + For years, researchers have been devoted to generalizable object perception and manipulation, where cross-category generalizability is highly desired yet underexplored. In this work, we propose to learn such cross-category skills via Generalizable and Actionable Parts (GAParts). By identifying and defining 9 GAPart classes (lids, handles, etc.) in 27 object categories, we construct a large-scale part-centric interactive dataset, GAPartNet, where we provide rich, part-level annotations (semantics, poses) for 8,489 part instances on 1,166 objects. Based on GAPartNet, we investigate three cross-category tasks: part segmentation, part pose estimation, and part-based object manipulation. Given the significant domain gaps between seen and unseen object categories, we propose a robust 3D segmentation method from the perspective of domain generalization by integrating adversarial learning techniques. Our method outperforms all existing methods by a large margin, no matter on seen or unseen categories. Furthermore, with part segmentation and pose estimation results, we leverage the GAPart pose definition to design part-based manipulation heuristics that can generalize well to unseen object categories in both the simulator and the real world. + + + + OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_OmniObject3D_Large-Vocabulary_3D_Object_Dataset_for_Realistic_Perception_Reconstruction_and_CVPR_2023_paper.pdf + Recent advances in modeling 3D objects mostly rely on synthetic datasets due to the lack of large-scale real-scanned 3D databases. To facilitate the development of 3D perception, reconstruction, and generation in the real world, we propose OmniObject3D, a large vocabulary 3D object dataset with massive high-quality real-scanned 3D objects. OmniObject3D has several appealing properties: 1) Large Vocabulary: It comprises 6,000 scanned objects in 190 daily categories, sharing common classes with popular 2D datasets (e.g., ImageNet and LVIS), benefiting the pursuit of generalizable 3D representations. 2) Rich Annotations: Each 3D object is captured with both 2D and 3D sensors, providing textured meshes, point clouds, multiview rendered images, and multiple real-captured videos. 3) Realistic Scans: The professional scanners support high-quality object scans with precise shapes and realistic appearances. With the vast exploration space offered by OmniObject3D, we carefully set up four evaluation tracks: a) robust 3D perception, b) novel-view synthesis, c) neural surface reconstruction, and d) 3D object generation. Extensive studies are performed on these four benchmarks, revealing new observations, challenges, and opportunities for future research in realistic 3D vision. + + + + Uncovering the Disentanglement Capability in Text-to-Image Diffusion Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_Uncovering_the_Disentanglement_Capability_in_Text-to-Image_Diffusion_Models_CVPR_2023_paper.pdf + Generative models have been widely studied in computer vision. Recently, diffusion models have drawn substantial attention due to the high quality of their generated images. A key desired property of image generative models is the ability to disentangle different attributes, which should enable modification towards a style without changing the semantic content, and the modification parameters should generalize to different images. Previous studies have found that generative adversarial networks (GANs) are inherently endowed with such disentanglement capability, so they can perform disentangled image editing without re-training or fine-tuning the network. In this work, we explore whether diffusion models are also inherently equipped with such a capability. Our finding is that for stable diffusion models, by partially changing the input text embedding from a neutral description (e.g., "a photo of person") to one with style (e.g., "a photo of person with smile") while fixing all the Gaussian random noises introduced during the denoising process, the generated images can be modified towards the target style without changing the semantic content. Based on this finding, we further propose a simple, light-weight image editing algorithm where the mixing weights of the two text embeddings are optimized for style matching and content preservation. This entire process only involves optimizing over around 50 parameters and does not fine-tune the diffusion model itself. Experiments show that the proposed method can modify a wide range of attributes, with the performance outperforming diffusion-model-based image-editing algorithms that require fine-tuning. The optimized weights generalize well to different images. Our code is publicly available at https://github.com/UCSB-NLP-Chang/DiffusionDisentanglement. + + + + EXIF As Language: Learning Cross-Modal Associations Between Images and Camera Metadata + http://openaccess.thecvf.com//content/CVPR2023/papers/Zheng_EXIF_As_Language_Learning_Cross-Modal_Associations_Between_Images_and_Camera_CVPR_2023_paper.pdf + We learn a visual representation that captures information about the camera that recorded a given photo. To do this, we train a multimodal embedding between image patches and the EXIF metadata that cameras automatically insert into image files. Our model represents this metadata by simply converting it to text and then processing it with a transformer. The features that we learn significantly outperform other self-supervised and supervised features on downstream image forensics and calibration tasks. In particular, we successfully localize spliced image regions "zero shot" by clustering the visual embeddings for all of the patches within an image. + + + + TrojDiff: Trojan Attacks on Diffusion Models With Diverse Targets + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_TrojDiff_Trojan_Attacks_on_Diffusion_Models_With_Diverse_Targets_CVPR_2023_paper.pdf + Diffusion models have achieved great success in a range of tasks, such as image synthesis and molecule design. As such successes hinge on large-scale training data collected from diverse sources, the trustworthiness of these collected data is hard to control or audit. In this work, we aim to explore the vulnerabilities of diffusion models under potential training data manipulations and try to answer: How hard is it to perform Trojan attacks on well-trained diffusion models? What are the adversarial targets that such Trojan attacks can achieve? To answer these questions, we propose an effective Trojan attack against diffusion models, TrojDiff, which optimizes the Trojan diffusion and generative processes during training. In particular, we design novel transitions during the Trojan diffusion process to diffuse adversarial targets into a biased Gaussian distribution and propose a new parameterization of the Trojan generative process that leads to an effective training objective for the attack. In addition, we consider three types of adversarial targets: the Trojaned diffusion models will always output instances belonging to a certain class from the in-domain distribution (In-D2D attack), out-of-domain distribution (Out-D2D-attack), and one specific instance (D2I attack). We evaluate TrojDiff on CIFAR-10 and CelebA datasets against both DDPM and DDIM diffusion models. We show that TrojDiff always achieves high attack performance under different adversarial targets using different types of triggers, while the performance in benign environments is preserved. The code is available at https://github.com/chenweixin107/TrojDiff. + + + + Mitigating Task Interference in Multi-Task Learning via Explicit Task Routing With Non-Learnable Primitives + http://openaccess.thecvf.com//content/CVPR2023/papers/Ding_Mitigating_Task_Interference_in_Multi-Task_Learning_via_Explicit_Task_Routing_CVPR_2023_paper.pdf + Multi-task learning (MTL) seeks to learn a single model to accomplish multiple tasks by leveraging shared information among the tasks. Existing MTL models, however, have been known to suffer from negative interference among tasks. Efforts to mitigate task interference have focused on either loss/gradient balancing or implicit parameter partitioning with partial overlaps among the tasks. In this paper, we propose ETR-NLP to mitigate task interference through a synergistic combination of non-learnable primitives (NLPs) and explicit task routing (ETR). Our key idea is to employ non-learnable primitives to extract a diverse set of task-agnostic features and recombine them into a shared branch common to all tasks and explicit task-specific branches reserved for each task. The non-learnable primitives and the explicit decoupling of learnable parameters into shared and task-specific ones afford the flexibility needed for minimizing task interference. We evaluate the efficacy of ETR-NLP networks for both image-level classification and pixel-level dense prediction MTL problems. Experimental results indicate that ETR-NLP significantly outperforms state-of-the-art baselines with fewer learnable parameters and similar FLOPs across all datasets. Code is available at this URL. + + + + Neural Kernel Surface Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Neural_Kernel_Surface_Reconstruction_CVPR_2023_paper.pdf + We present a novel method for reconstructing a 3D implicit surface from a large-scale, sparse, and noisy point cloud. Our approach builds upon the recently introduced Neural Kernel Fields (NKF) representation. It enjoys similar generalization capabilities to NKF, while simultaneously addressing its main limitations: (a) We can scale to large scenes through compactly supported kernel functions, which enable the use of memory-efficient sparse linear solvers. (b) We are robust to noise, through a gradient fitting solve. (c) We minimize training requirements, enabling us to learn from any dataset of dense oriented points, and even mix training data consisting of objects and scenes at different scales. Our method is capable of reconstructing millions of points in a few seconds, and handling very large scenes in an out-of-core fashion. We achieve state-of-the-art results on reconstruction benchmarks consisting of single objects (ShapeNet, ABC), indoor scenes (ScanNet, Matterport3D), and outdoor scenes (CARLA, Waymo). + + + + Multilateral Semantic Relations Modeling for Image Text Retrieval + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Multilateral_Semantic_Relations_Modeling_for_Image_Text_Retrieval_CVPR_2023_paper.pdf + Image-text retrieval is a fundamental task to bridge vision and language by exploiting various strategies to fine-grained alignment between regions and words. This is still tough mainly because of one-to-many correspondence, where a set of matches from another modality can be accessed by a random query. While existing solutions to this problem including multi-point mapping, probabilistic distribution, and geometric embedding have made promising progress, one-to-many correspondence is still under-explored. In this work, we develop a Multilateral Semantic Relations Modeling (termed MSRM) for image-text retrieval to capture the one-to-many correspondence between multiple samples and a given query via hypergraph modeling. Specifically, a given query is first mapped as a probabilistic embedding to learn its true semantic distribution based on Mahalanobis distance. Then each candidate instance in a mini-batch is regarded as a hypergraph node with its mean semantics while a Gaussian query is modeled as a hyperedge to capture the semantic correlations beyond the pair between candidate points and the query. Comprehensive experimental results on two widely used datasets demonstrate that our MSRM method can outperform state-of-the-art methods in the settlement of multiple matches while still maintaining the comparable performance of instance-level matching. Our codes and checkpoints will be released soon. + + + + Optimization-Inspired Cross-Attention Transformer for Compressive Sensing + http://openaccess.thecvf.com//content/CVPR2023/papers/Song_Optimization-Inspired_Cross-Attention_Transformer_for_Compressive_Sensing_CVPR_2023_paper.pdf + By integrating certain optimization solvers with deep neural networks, deep unfolding network (DUN) with good interpretability and high performance has attracted growing attention in compressive sensing (CS). However, existing DUNs often improve the visual quality at the price of a large number of parameters and have the problem of feature information loss during iteration. In this paper, we propose an Optimization-inspired Cross-attention Transformer (OCT) module as an iterative process, leading to a lightweight OCT-based Unfolding Framework (OCTUF) for image CS. Specifically, we design a novel Dual Cross Attention (Dual-CA) sub-module, which consists of an Inertia-Supplied Cross Attention (ISCA) block and a Projection-Guided Cross Attention (PGCA) block. ISCA block introduces multi-channel inertia forces and increases the memory effect by a cross attention mechanism between adjacent iterations. And, PGCA block achieves an enhanced information interaction, which introduces the inertia force into the gradient descent step through a cross attention block. Extensive CS experiments manifest that our OCTUF achieves superior performance compared to state-of-the-art methods while training lower complexity. Codes are available at https://github.com/songjiechong/OCTUF. + + + + Normalizing Flow Based Feature Synthesis for Outlier-Aware Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Kumar_Normalizing_Flow_Based_Feature_Synthesis_for_Outlier-Aware_Object_Detection_CVPR_2023_paper.pdf + Real-world deployment of reliable object detectors is crucial for applications such as autonomous driving. However, general-purpose object detectors like Faster R-CNN are prone to providing overconfident predictions for outlier objects. Recent outlier-aware object detection approaches estimate the density of instance-wide features with class-conditional Gaussians and train on synthesized outlier features from their low-likelihood regions. However, this strategy does not guarantee that the synthesized outlier features will have a low likelihood according to the other class-conditional Gaussians. We propose a novel outlier-aware object detection framework that distinguishes outliers from inlier objects by learning the joint data distribution of all inlier classes with an invertible normalizing flow. The appropriate sampling of the flow model ensures that the synthesized outliers have a lower likelihood than inliers of all object classes, thereby modeling a better decision boundary between inlier and outlier objects. Our approach significantly outperforms the state-of-the-art for outlier-aware object detection on both image and video datasets. + + + + DivClust: Controlling Diversity in Deep Clustering + http://openaccess.thecvf.com//content/CVPR2023/papers/Metaxas_DivClust_Controlling_Diversity_in_Deep_Clustering_CVPR_2023_paper.pdf + Clustering has been a major research topic in the field of machine learning, one to which Deep Learning has recently been applied with significant success. However, an aspect of clustering that is not addressed by existing deep clustering methods, is that of efficiently producing multiple, diverse partitionings for a given dataset. This is particularly important, as a diverse set of base clusterings are necessary for consensus clustering, which has been found to produce better and more robust results than relying on a single clustering. To address this gap, we propose DivClust, a diversity controlling loss that can be incorporated into existing deep clustering frameworks to produce multiple clusterings with the desired degree of diversity. We conduct experiments with multiple datasets and deep clustering frameworks and show that: a) our method effectively controls diversity across frameworks and datasets with very small additional computational cost, b) the sets of clusterings learned by DivClust include solutions that significantly outperform single-clustering baselines, and c) using an off-the-shelf consensus clustering algorithm, DivClust produces consensus clustering solutions that consistently outperform single-clustering baselines, effectively improving the performance of the base deep clustering framework. + + + + Topology-Guided Multi-Class Cell Context Generation for Digital Pathology + http://openaccess.thecvf.com//content/CVPR2023/papers/Abousamra_Topology-Guided_Multi-Class_Cell_Context_Generation_for_Digital_Pathology_CVPR_2023_paper.pdf + In digital pathology, the spatial context of cells is important for cell classification, cancer diagnosis and prognosis. To model such complex cell context, however, is challenging. Cells form different mixtures, lineages, clusters and holes. To model such structural patterns in a learnable fashion, we introduce several mathematical tools from spatial statistics and topological data analysis. We incorporate such structural descriptors into a deep generative model as both conditional inputs and a differentiable loss. This way, we are able to generate high quality multi-class cell layouts for the first time. We show that the topology-rich cell layouts can be used for data augmentation and improve the performance of downstream tasks such as cell classification. + + + + Adaptive Graph Convolutional Subspace Clustering + http://openaccess.thecvf.com//content/CVPR2023/papers/Wei_Adaptive_Graph_Convolutional_Subspace_Clustering_CVPR_2023_paper.pdf + Spectral-type subspace clustering algorithms have shown excellent performance in many subspace clustering applications. The existing spectral-type subspace clustering algorithms either focus on designing constraints for the reconstruction coefficient matrix or feature extraction methods for finding latent features of original data samples. In this paper, inspired by graph convolutional networks, we use the graph convolution technique to develop a feature extraction method and a coefficient matrix constraint simultaneously. And the graph-convolutional operator is updated iteratively and adaptively in our proposed algorithm. Hence, we call the proposed method adaptive graph convolutional subspace clustering (AGCSC). We claim that, by using AGCSC, the aggregated feature representation of original data samples is suitable for subspace clustering, and the coefficient matrix could reveal the subspace structure of the original data set more faithfully. Finally, plenty of subspace clustering experiments prove our conclusions and show that AGCSC outperforms some related methods as well as some deep models. + + + + Learning Steerable Function for Efficient Image Resampling + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Learning_Steerable_Function_for_Efficient_Image_Resampling_CVPR_2023_paper.pdf + Image resampling is a basic technique that is widely employed in daily applications. Existing deep neural networks (DNNs) have made impressive progress in resampling performance. Yet these methods are still not the perfect substitute for interpolation, due to the issues of efficiency and continuous resampling. In this work, we propose a novel method of Learning Resampling Function (termed LeRF), which takes advantage of both the structural priors learned by DNNs and the locally continuous assumption of interpolation methods. Specifically, LeRF assigns spatially-varying steerable resampling functions to input image pixels and learns to predict the hyper-parameters that determine the orientations of these resampling functions with a neural network. To achieve highly efficient inference, we adopt look-up tables (LUTs) to accelerate the inference of the learned neural network. Furthermore, we design a directional ensemble strategy and edge-sensitive indexing patterns to better capture local structures. Extensive experiments show that our method runs as fast as interpolation, generalizes well to arbitrary transformations, and outperforms interpolation significantly, e.g., up to 3dB PSNR gain over bicubic for x2 upsampling on Manga109. + + + + Cut and Learn for Unsupervised Object Detection and Instance Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Cut_and_Learn_for_Unsupervised_Object_Detection_and_Instance_Segmentation_CVPR_2023_paper.pdf + We propose Cut-and-LEaRn (CutLER), a simple approach for training unsupervised object detection and segmentation models. We leverage the property of self-supervised models to 'discover' objects without supervision and amplify it to train a state-of-the-art localization model without any human labels. CutLER first uses our proposed MaskCut approach to generate coarse masks for multiple objects in an image, and then learns a detector on these masks using our robust loss function. We further improve performance by self-training the model on its predictions. Compared to prior work, CutLER is simpler, compatible with different detection architectures, and detects multiple objects. CutLER is also a zero-shot unsupervised detector and improves detection performance AP_50 by over 2.7x on 11 benchmarks across domains like video frames, paintings, sketches, etc. With finetuning, CutLER serves as a low-shot detector surpassing MoCo-v2 by 7.3% AP^box and 6.6% AP^mask on COCO when training with 5% labels. + + + + Privacy-Preserving Adversarial Facial Features + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Privacy-Preserving_Adversarial_Facial_Features_CVPR_2023_paper.pdf + Face recognition service providers protect face privacy by extracting compact and discriminative facial features (representations) from images, and storing the facial features for real-time recognition. However, such features can still be exploited to recover the appearance of the original face by building a reconstruction network. Although several privacy-preserving methods have been proposed, the enhancement of face privacy protection is at the expense of accuracy degradation. In this paper, we propose an adversarial features-based face privacy protection (AdvFace) approach to generate privacy-preserving adversarial features, which can disrupt the mapping from adversarial features to facial images to defend against reconstruction attacks. To this end, we design a shadow model which simulates the attackers' behavior to capture the mapping function from facial features to images and generate adversarial latent noise to disrupt the mapping. The adversarial features rather than the original features are stored in the server's database to prevent leaked features from exposing facial information. Moreover, the AdvFace requires no changes to the face recognition network and can be implemented as a privacy-enhancing plugin in deployed face recognition systems. Extensive experimental results demonstrate that AdvFace outperforms the state-of-the-art face privacy-preserving methods in defending against reconstruction attacks while maintaining face recognition accuracy. + + + + Exploring the Relationship Between Architectural Design and Adversarially Robust Generalization + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Exploring_the_Relationship_Between_Architectural_Design_and_Adversarially_Robust_Generalization_CVPR_2023_paper.pdf + Adversarial training has been demonstrated to be one of the most effective remedies for defending adversarial examples, yet it often suffers from the huge robustness generalization gap on unseen testing adversaries, deemed as the adversarially robust generalization problem. Despite the preliminary understandings devoted to adversarially robust generalization, little is known from the architectural perspective. To bridge the gap, this paper for the first time systematically investigated the relationship between adversarially robust generalization and architectural design. In particular, we comprehensively evaluated 20 most representative adversarially trained architectures on ImageNette and CIFAR-10 datasets towards multiple l_p-norm adversarial attacks. Based on the extensive experiments, we found that, under aligned settings, Vision Transformers (e.g., PVT, CoAtNet) often yield better adversarially robust generalization while CNNs tend to overfit on specific attacks and fail to generalize on multiple adversaries. To better understand the nature behind it, we conduct theoretical analysis via the lens of Rademacher complexity. We revealed the fact that the higher weight sparsity contributes significantly towards the better adversarially robust generalization of Transformers, which can be often achieved by the specially-designed attention blocks. We hope our paper could help to better understand the mechanism for designing robust DNNs. Our model weights can be found at http://robust.art. + + + + Side Adapter Network for Open-Vocabulary Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_Side_Adapter_Network_for_Open-Vocabulary_Semantic_Segmentation_CVPR_2023_paper.pdf + This paper presents a new framework for open-vocabulary semantic segmentation with the pre-trained vision-language model, named SAN. Our approach models the semantic segmentation task as a region recognition problem. A side network is attached to a frozen CLIP model with two branches: one for predicting mask proposals, and the other for predicting attention bias which is applied in the CLIP model to recognize the class of masks. This decoupled design has the benefit CLIP in recognizing the class of mask proposals. Since the attached side network can reuse CLIP features, it can be very light. In addition, the entire network can be trained end-to-end, allowing the side network to be adapted to the frozen CLIP model, which makes the predicted mask proposals CLIP-aware. Our approach is fast, accurate, and only adds a few additional trainable parameters. We evaluate our approach on multiple semantic segmentation benchmarks. Our method significantly outperforms other counterparts, with up to 18 times fewer trainable parameters and 19 times faster inference speed. We hope our approach will serve as a solid baseline and help ease future research in open-vocabulary semantic segmentation. + + + + Multi-Centroid Task Descriptor for Dynamic Class Incremental Inference + http://openaccess.thecvf.com//content/CVPR2023/papers/Cai_Multi-Centroid_Task_Descriptor_for_Dynamic_Class_Incremental_Inference_CVPR_2023_paper.pdf + Incremental learning could be roughly divided into two categories, i.e., class- and task-incremental learning. The main difference is whether the task ID is given during evaluation. In this paper, we show this task information is indeed a strong prior knowledge, which will bring significant improvement over class-incremental learning baseline, e.g., DER. Based on this observation, we propose a gate network to predict the task ID for class incremental inference. This is challenging as there is no explicit semantic relationship between categories in the concept of task. Therefore, we propose a multi-centroid task descriptor by assuming the data within a task can form multiple clusters. The cluster centers are optimized by pulling relevant sample-centroid pairs while pushing others away, which ensures that there is at least one centroid close to a given sample. To select relevant pairs, we use class prototypes as proxies and solve a bipartite matching problem, making the task descriptor representative yet not degenerate to uni-modal. As a result, our dynamic inference network is trained independently of baseline and provides a flexible, efficient solution to distinguish between tasks. Extensive experiments show our approach achieves state-of-the-art results, e.g., we achieve 72.41% average accuracy on CIFAR100-B0S50, outperforming DER by 3.40%. + + + + Physics-Guided ISO-Dependent Sensor Noise Modeling for Extreme Low-Light Photography + http://openaccess.thecvf.com//content/CVPR2023/papers/Cao_Physics-Guided_ISO-Dependent_Sensor_Noise_Modeling_for_Extreme_Low-Light_Photography_CVPR_2023_paper.pdf + Although deep neural networks have achieved astonishing performance in many vision tasks, existing learning-based methods are far inferior to the physical model-based solutions in extreme low-light sensor noise modeling. To tap the potential of learning-based sensor noise modeling, we investigate the noise formation in a typical imaging process and propose a novel physics-guided ISO-dependent sensor noise modeling approach. Specifically, we build a normalizing flow-based framework to represent the complex noise characteristics of CMOS camera sensors. Each component of the noise model is dedicated to a particular kind of noise under the guidance of physical models. Moreover, we take into consideration of the ISO dependence in the noise model, which is not completely considered by the existing learning-based methods. For training the proposed noise model, a new dataset is further collected with paired noisy-clean images, as well as flat-field and bias frames covering a wide range of ISO settings. Compared to existing methods, the proposed noise model benefits from the flexible structure and accurate modeling capabilities, which can help achieve better denoising performance in extreme low-light scenes. The source code and collected dataset will be publicly available. + + + + DiGeo: Discriminative Geometry-Aware Learning for Generalized Few-Shot Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Ma_DiGeo_Discriminative_Geometry-Aware_Learning_for_Generalized_Few-Shot_Object_Detection_CVPR_2023_paper.pdf + Generalized few-shot object detection aims to achieve precise detection on both base classes with abundant annotations and novel classes with limited training data. Existing approaches enhance few-shot generalization with the sacrifice of base-class performance, or maintain high precision in base-class detection with limited improvement in novel-class adaptation. In this paper, we point out the reason is insufficient Discriminative feature learning for all of the classes. As such, we propose a new training framework, DiGeo, to learn Geometry-aware features of inter-class separation and intra-class compactness. To guide the separation of feature clusters, we derive an offline simplex equiangular tight frame (ETF) classifier whose weights serve as class centers and are maximally and equally separated. To tighten the cluster for each class, we include adaptive class-specific margins into the classification loss and encourage the features close to the class centers. Experimental studies on two few-shot benchmark datasets (PASCAL VOC, MSCOCO) and one long-tail dataset (LVIS) demonstrate that, with a single model, our method can effectively improve generalization on novel classes without hurting the detection of base classes. + + + + A Soma Segmentation Benchmark in Full Adult Fly Brain + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_A_Soma_Segmentation_Benchmark_in_Full_Adult_Fly_Brain_CVPR_2023_paper.pdf + Neuron reconstruction in a full adult fly brain from high-resolution electron microscopy (EM) data is regarded as a cornerstone for neuroscientists to explore how neurons inspire intelligence. As the central part of neurons, somas in the full brain indicate the origin of neurogenesis and neural functions. However, due to the absence of EM datasets specifically annotated for somas, existing deep learning-based neuron reconstruction methods cannot directly provide accurate soma distribution and morphology. Moreover, full brain neuron reconstruction remains extremely time-consuming due to the unprecedentedly large size of EM data. In this paper, we develop an efficient soma reconstruction method for obtaining accurate soma distribution and morphology information in a full adult fly brain. To this end, we first make a high-resolution EM dataset with fine-grained 3D manual annotations on somas. Relying on this dataset, we propose an efficient, two-stage deep learning algorithm for predicting accurate locations and boundaries of 3D soma instances. Further, we deploy a parallelized, high-throughput data processing pipeline for executing the above algorithm on the full brain. Finally, we provide quantitative and qualitative benchmark comparisons on the testset to validate the superiority of the proposed method, as well as preliminary statistics of the reconstructed somas in the full adult fly brain from the biological perspective. We release our code and dataset at https://github.com/liuxy1103/EMADS. + + + + CaPriDe Learning: Confidential and Private Decentralized Learning Based on Encryption-Friendly Distillation Loss + http://openaccess.thecvf.com//content/CVPR2023/papers/Tastan_CaPriDe_Learning_Confidential_and_Private_Decentralized_Learning_Based_on_Encryption-Friendly_CVPR_2023_paper.pdf + Large volumes of data required to train accurate deep neural networks (DNNs) are seldom available with any single entity. Often, privacy concerns and stringent data regulations prevent entities from sharing data with each other or with a third-party learning service provider. While cross-silo federated learning (FL) allows collaborative learning of large DNNs without sharing the data itself, most existing cross-silo FL algorithms have an unacceptable utility-privacy trade-off. In this work, we propose a framework called Confidential and Private Decentralized (CaPriDe) learning, which optimally leverages the power of fully homomorphic encryption (FHE) to enable collaborative learning without compromising on the confidentiality and privacy of data. In CaPriDe learning, participating entities release their private data in an encrypted form allowing other participants to perform inference in the encrypted domain. The crux of CaPriDe learning is mutual knowledge distillation between multiple local models through a novel distillation loss, which is an approximation of the Kullback-Leibler (KL) divergence between the local predictions and encrypted inferences of other participants on the same data that can be computed in the encrypted domain. Extensive experiments on three datasets show that CaPriDe learning can improve the accuracy of local models without any central coordination, provide strong guarantees of data confidentiality and privacy, and has the ability to handle statistical heterogeneity. Constraints on the model architecture (arising from the need to be FHE-friendly), limited scalability, and computational complexity of encrypted domain inference are the main limitations of the proposed approach. The code can be found at https://github.com/tnurbek/capride-learning. + + + + vMAP: Vectorised Object Mapping for Neural Field SLAM + http://openaccess.thecvf.com//content/CVPR2023/papers/Kong_vMAP_Vectorised_Object_Mapping_for_Neural_Field_SLAM_CVPR_2023_paper.pdf + We present vMAP, an object-level dense SLAM system using neural field representations. Each object is represented by a small MLP, enabling efficient, watertight object modelling without the need for 3D priors. As an RGB-D camera browses a scene with no prior information, vMAP detects object instances on-the-fly, and dynamically adds them to its map. Specifically, thanks to the power of vectorised training, vMAP can optimise as many as 50 individual objects in a single scene, with an extremely efficient training speed of 5Hz map update. We experimentally demonstrate significantly improved scene-level and object-level reconstruction quality compared to prior neural field SLAM systems. Project page: https://kxhit.github.io/vMAP. + + + + Images Speak in Images: A Generalist Painter for In-Context Visual Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Images_Speak_in_Images_A_Generalist_Painter_for_In-Context_Visual_CVPR_2023_paper.pdf + In-context learning, as a new paradigm in NLP, allows the model to rapidly adapt to various tasks with only a handful of prompts and examples. But in computer vision, the difficulties for in-context learning lie in that tasks vary significantly in the output representations, thus it is unclear how to define the general-purpose task prompts that the vision model can understand and transfer to out-of-domain tasks. In this work, we present Painter, a generalist model which addresses these obstacles with an "image"-centric solution, that is, to redefine the output of core vision tasks as images, and specify task prompts as also images. With this idea, our training process is extremely simple, which performs standard masked image modeling on the stitch of input and output image pairs. This makes the model capable of performing tasks conditioned on visible image patches. Thus, during inference, we can adopt a pair of input and output images from the same task as the input condition, to indicate which task to perform. Without bells and whistles, our generalist Painter can achieve competitive performance compared to well-established task-specific models, on seven representative vision tasks ranging from high-level visual understanding to low-level image processing. In addition, Painter significantly outperforms recent generalist models on several challenging tasks. + + + + StyLess: Boosting the Transferability of Adversarial Examples + http://openaccess.thecvf.com//content/CVPR2023/papers/Liang_StyLess_Boosting_the_Transferability_of_Adversarial_Examples_CVPR_2023_paper.pdf + Adversarial attacks can mislead deep neural networks (DNNs) by adding imperceptible perturbations to benign examples. The attack transferability enables adversarial examples to attack black-box DNNs with unknown architectures or parameters, which poses threats to many real-world applications. We find that existing transferable attacks do not distinguish between style and content features during optimization, limiting their attack transferability. To improve attack transferability, we propose a novel attack method called style-less perturbation (StyLess). Specifically, instead of using a vanilla network as the surrogate model, we advocate using stylized networks, which encode different style features by perturbing an adaptive instance normalization. Our method can prevent adversarial examples from using non-robust style features and help generate transferable perturbations. Comprehensive experiments show that our method can significantly improve the transferability of adversarial examples. Furthermore, our approach is generic and can outperform state-of-the-art transferable attacks when combined with other attack techniques. + + + + Uncertainty-Aware Optimal Transport for Semantically Coherent Out-of-Distribution Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Lu_Uncertainty-Aware_Optimal_Transport_for_Semantically_Coherent_Out-of-Distribution_Detection_CVPR_2023_paper.pdf + Semantically coherent out-of-distribution (SCOOD) detection aims to discern outliers from the intended data distribution with access to unlabeled extra set. The coexistence of in-distribution and out-of-distribution samples will exacerbate the model overfitting when no distinction is made. To address this problem, we propose a novel uncertainty-aware optimal transport scheme. Our scheme consists of an energy-based transport (ET) mechanism that estimates the fluctuating cost of uncertainty to promote the assignment of semantic-agnostic representation, and an inter-cluster extension strategy that enhances the discrimination of semantic property among different clusters by widening the corresponding margin distance. Furthermore, a T-energy score is presented to mitigate the magnitude gap between the parallel transport and classifier branches. Extensive experiments on two standard SCOOD benchmarks demonstrate the above-par OOD detection performance, outperforming the state-of-the-art methods by a margin of 27.69% and 34.4% on FPR@95, respectively. + + + + MISC210K: A Large-Scale Dataset for Multi-Instance Semantic Correspondence + http://openaccess.thecvf.com//content/CVPR2023/papers/Sun_MISC210K_A_Large-Scale_Dataset_for_Multi-Instance_Semantic_Correspondence_CVPR_2023_paper.pdf + Semantic correspondence have built up a new way for object recognition. However current single-object matching schema can be hard for discovering commonalities for a category and far from the real-world recognition tasks. To fill this gap, we design the multi-instance semantic correspondence task which aims at constructing the correspondence between multiple objects in an image pair. To support this task, we build a multi-instance semantic correspondence (MISC) dataset from COCO Detection 2017 task called MISC210K. We construct our dataset as three steps: (1) category selection and data cleaning; (2) keypoint design based on 3D models and object description rules; (3) human-machine collaborative annotation. Following these steps, we select 34 classes of objects with 4,812 challenging images annotated via a well designed semi-automatic workflow, and finally acquire 218,179 image pairs with instance masks and instance-level keypoint pairs annotated. We design a dual-path collaborative learning pipeline to train instance-level co-segmentation task and fine-grained level correspondence task together. Benchmark evaluation and further ablation results with detailed analysis are provided with three future directions proposed. Our project is available on https://github.com/YXSUNMADMAX/MISC210K. + + + + MAGE: MAsked Generative Encoder To Unify Representation Learning and Image Synthesis + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_MAGE_MAsked_Generative_Encoder_To_Unify_Representation_Learning_and_Image_CVPR_2023_paper.pdf + Generative modeling and representation learning are two key tasks in computer vision. However, these models are typically trained independently, which ignores the potential for each task to help the other, and leads to training and model maintenance overheads. In this work, we propose MAsked Generative Encoder (MAGE), the first framework to unify SOTA image generation and self-supervised representation learning. Our key insight is that using variable masking ratios in masked image modeling pre-training can allow generative training (very high masking ratio) and representation learning (lower masking ratio) under the same training framework. Inspired by previous generative models, MAGE uses semantic tokens learned by a vector-quantized GAN at inputs and outputs, combining this with masking. We can further improve the representation by adding a contrastive loss to the encoder output. We extensively evaluate the generation and representation learning capabilities of MAGE. On ImageNet-1K, a single MAGE ViT-L model obtains 9.10 FID in the task of class-unconditional image generation and 78.9% top-1 accuracy for linear probing, achieving state-of-the-art performance in both image generation and representation learning. Code is available at https://github.com/LTH14/mage. + + + + Lift3D: Synthesize 3D Training Data by Lifting 2D GAN to 3D Generative Radiance Field + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Lift3D_Synthesize_3D_Training_Data_by_Lifting_2D_GAN_to_CVPR_2023_paper.pdf + This work explores the use of 3D generative models to synthesize training data for 3D vision tasks. The key requirements of the generative models are that the generated data should be photorealistic to match the real-world scenarios, and the corresponding 3D attributes should be aligned with given sampling labels. However, we find that the recent NeRF-based 3D GANs hardly meet the above requirements due to their designed generation pipeline and the lack of explicit 3D supervision. In this work, we propose Lift3D, an inverted 2D-to-3D generation framework to achieve the data generation objectives. Lift3D has several merits compared to prior methods: (1) Unlike previous 3D GANs that the output resolution is fixed after training, Lift3D can generalize to any camera intrinsic with higher resolution and photorealistic output. (2) By lifting well-disentangled 2D GAN to 3D object NeRF, Lift3D provides explicit 3D information of generated objects, thus offering accurate 3D annotations for downstream tasks. We evaluate the effectiveness of our framework by augmenting autonomous driving datasets. Experimental results demonstrate that our data generation framework can effectively improve the performance of 3D object detectors. Code: len-li.github.io/lift3d-web + + + + Hunting Sparsity: Density-Guided Contrastive Learning for Semi-Supervised Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Hunting_Sparsity_Density-Guided_Contrastive_Learning_for_Semi-Supervised_Semantic_Segmentation_CVPR_2023_paper.pdf + Recent semi-supervised semantic segmentation methods combine pseudo labeling and consistency regularization to enhance model generalization from perturbation-invariant training. In this work, we argue that adequate supervision can be extracted directly from the geometry of feature space. Inspired by density-based unsupervised clustering, we propose to leverage feature density to locate sparse regions within feature clusters defined by label and pseudo labels. The hypothesis is that lower-density features tend to be under-trained compared with those densely gathered. Therefore, we propose to apply regularization on the structure of the cluster by tackling the sparsity to increase intra-class compactness in feature space. With this goal, we present a Density-Guided Contrastive Learning (DGCL) strategy to push anchor features in sparse regions toward cluster centers approximated by high-density positive keys. The heart of our method is to estimate feature density which is defined as neighbor compactness. We design a multi-scale density estimation module to obtain the density from multiple nearest-neighbor graphs for robust density modeling. Moreover, a unified training framework is proposed to combine label-guided self-training and density-guided geometry regularization to form complementary supervision on unlabeled data. Experimental results on PASCAL VOC and Cityscapes under various semi-supervised settings demonstrate that our proposed method achieves state-of-the-art performances. + + + + An Erudite Fine-Grained Visual Classification Model + http://openaccess.thecvf.com//content/CVPR2023/papers/Chang_An_Erudite_Fine-Grained_Visual_Classification_Model_CVPR_2023_paper.pdf + Current fine-grained visual classification (FGVC) models are isolated. In practice, we first need to identify the coarse-grained label of an object, then select the corresponding FGVC model for recognition. This hinders the application of the FGVC algorithm in real-life scenarios. In this paper, we propose an erudite FGVC model jointly trained by several different datasets, which can efficiently and accurately predict an object's fine-grained label across the combined label space. We found through a pilot study that positive and negative transfers co-occur when different datasets are mixed for training, i.e., the knowledge from other datasets is not always useful. Therefore, we first propose a feature disentanglement module and a feature re-fusion module to reduce negative transfer and boost positive transfer between different datasets. In detail, we reduce negative transfer by decoupling the deep features through many dataset-specific feature extractors. Subsequently, these are channel-wise re-fused to facilitate positive transfer. Finally, we propose a meta-learning based dataset-agnostic spatial attention layer to take full advantage of the multi-dataset training data, given that localisation is dataset-agnostic between different datasets. Experimental results across 11 different mixed-datasets built on four different FGVC datasets demonstrate the effectiveness of the proposed method. Furthermore, the proposed method can be easily combined with existing FGVC methods to obtain state-of-the-art results. + + + + Adversarially Robust Neural Architecture Search for Graph Neural Networks + http://openaccess.thecvf.com//content/CVPR2023/papers/Xie_Adversarially_Robust_Neural_Architecture_Search_for_Graph_Neural_Networks_CVPR_2023_paper.pdf + Graph Neural Networks (GNNs) obtain tremendous success in modeling relational data. Still, they are prone to adversarial attacks, which are massive threats to applying GNNs to risk-sensitive domains. Existing defensive methods neither guarantee performance facing new data/tasks or adversarial attacks nor provide insights to understand GNN robustness from an architectural perspective. Neural Architecture Search (NAS) has the potential to solve this problem by automating GNN architecture designs. Nevertheless, current graph NAS approaches lack robust design and are vulnerable to adversarial attacks. To tackle these challenges, we propose a novel Robust Neural Architecture search framework for GNNs (G-RNA). Specifically, we design a robust search space for the message-passing mechanism by adding graph structure mask operations into the search space, which comprises various defensive operation candidates and allows us to search for defensive GNNs. Furthermore, we define a robustness metric to guide the search procedure, which helps to filter robust architectures. In this way, G-RNA helps understand GNN robustness from an architectural perspective and effectively searches for optimal adversarial robust GNNs. Extensive experimental results on benchmark datasets show that G-RNA significantly outperforms manually designed robust GNNs and vanilla graph NAS baselines by 12.1% to 23.4% under adversarial attacks. + + + + Affordance Grounding From Demonstration Video To Target Image + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Affordance_Grounding_From_Demonstration_Video_To_Target_Image_CVPR_2023_paper.pdf + Humans excel at learning from expert demonstrations and solving their own problems. To equip intelligent robots and assistants, such as AR glasses, with this ability, it is essential to ground human hand interactions (i.e., affordances) from demonstration videos and apply them to a target image like a user's AR glass view. The video-to-image affordance grounding task is challenging due to (1) the need to predict fine-grained affordances, and (2) the limited training data, which inadequately covers video-image discrepancies and negatively impacts grounding. To tackle them, we propose Affordance Transformer (Afformer), which has a fine-grained transformer-based decoder that gradually refines affordance grounding. Moreover, we introduce Mask Affordance Hand (MaskAHand), a self-supervised pretraining technique for synthesizing video-image data and simulating context changes, enhancing affordance grounding across video-image discrepancies. Afformer with MaskAHand pre-training achieves state-of-the-art performance on multiple benchmarks, including a substantial 37% improvement on the OPRA dataset. Code is made available at https://github.com/showlab/afformer. + + + + DeepMAD: Mathematical Architecture Design for Deep Convolutional Neural Network + http://openaccess.thecvf.com//content/CVPR2023/papers/Shen_DeepMAD_Mathematical_Architecture_Design_for_Deep_Convolutional_Neural_Network_CVPR_2023_paper.pdf + The rapid advances in Vision Transformer (ViT) refresh the state-of-the-art performances in various vision tasks, overshadowing the conventional CNN-based models. This ignites a few recent striking-back research in the CNN world showing that pure CNN models can achieve as good performance as ViT models when carefully tuned. While encouraging, designing such high-performance CNN models is challenging, requiring non-trivial prior knowledge of network design. To this end, a novel framework termed Mathematical Architecture Design for Deep CNN (DeepMAD) is proposed to design high-performance CNN models in a principled way. In DeepMAD, a CNN network is modeled as an information processing system whose expressiveness and effectiveness can be analytically formulated by their structural parameters. Then a constrained mathematical programming (MP) problem is proposed to optimize these structural parameters. The MP problem can be easily solved by off-the-shelf MP solvers on CPUs with a small memory footprint. In addition, DeepMAD is a pure mathematical framework: no GPU or training data is required during network design. The superiority of DeepMAD is validated on multiple large-scale computer vision benchmark datasets. Notably on ImageNet-1k, only using conventional convolutional layers, DeepMAD achieves 0.7% and 1.5% higher top-1 accuracy than ConvNeXt and Swin on Tiny level, and 0.8% and 0.9% higher on Small level. + + + + BBDM: Image-to-Image Translation With Brownian Bridge Diffusion Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_BBDM_Image-to-Image_Translation_With_Brownian_Bridge_Diffusion_Models_CVPR_2023_paper.pdf + Image-to-image translation is an important and challenging problem in computer vision and image processing. Diffusion models(DM) have shown great potentials for high-quality image synthesis, and have gained competitive performance on the task of image-to-image translation. However, most of the existing diffusion models treat image-to-image translation as conditional generation processes, and suffer heavily from the gap between distinct domains. In this paper, a novel image-to-image translation method based on the Brownian Bridge Diffusion Model(BBDM) is proposed, which models image-to-image translation as a stochastic Brownian Bridge process, and learns the translation between two domains directly through the bidirectional diffusion process rather than a conditional generation process. To the best of our knowledge, it is the first work that proposes Brownian Bridge diffusion process for image-to-image translation. Experimental results on various benchmarks demonstrate that the proposed BBDM model achieves competitive performance through both visual inspection and measurable metrics. + + + + Probing Neural Representations of Scene Perception in a Hippocampally Dependent Task Using Artificial Neural Networks + http://openaccess.thecvf.com//content/CVPR2023/papers/Frey_Probing_Neural_Representations_of_Scene_Perception_in_a_Hippocampally_Dependent_CVPR_2023_paper.pdf + Deep artificial neural networks (DNNs) trained through backpropagation provide effective models of the mammalian visual system, accurately capturing the hierarchy of neural responses through primary visual cortex to inferior temporal cortex (IT). However, the ability of these networks to explain representations in higher cortical areas is relatively lacking and considerably less well researched. For example, DNNs have been less successful as a model of the egocentric to allocentric transformation embodied by circuits in retrosplenial and posterior parietal cortex. We describe a novel scene perception benchmark inspired by a hippocampal dependent task, designed to probe the ability of DNNs to transform scenes viewed from different egocentric perspectives. Using a network architecture inspired by the connectivity between temporal lobe structures and the hippocampus, we demonstrate that DNNs trained using a triplet loss can learn this task. Moreover, by enforcing a factorized latent space, we can split information propagation into "what" and "where" pathways, which we use to reconstruct the input. This allows us to beat the state-of-the-art for unsupervised object segmentation on the CATER and MOVi-A,B,C benchmarks. + + + + A Probabilistic Framework for Lifelong Test-Time Adaptation + http://openaccess.thecvf.com//content/CVPR2023/papers/Brahma_A_Probabilistic_Framework_for_Lifelong_Test-Time_Adaptation_CVPR_2023_paper.pdf + Test-time adaptation (TTA) is the problem of updating a pre-trained source model at inference time given test input(s) from a different target domain. Most existing TTA approaches assume the setting in which the target domain is stationary, i.e., all the test inputs come from a single target domain. However, in many practical settings, the test input distribution might exhibit a lifelong/continual shift over time. Moreover, existing TTA approaches also lack the ability to provide reliable uncertainty estimates, which is crucial when distribution shifts occur between the source and target domain. To address these issues, we present PETAL (Probabilistic lifElong Test-time Adaptation with seLf-training prior), which solves lifelong TTA using a probabilistic approach, and naturally results in (1) a student-teacher framework, where the teacher model is an exponential moving average of the student model, and (2) regularizing the model updates at inference time using the source model as a regularizer. To prevent model drift in the lifelong/continual TTA setting, we also propose a data-driven parameter restoration technique which contributes to reducing the error accumulation and maintaining the knowledge of recent domains by restoring only the irrelevant parameters. In terms of predictive error rate as well as uncertainty based metrics such as Brier score and negative log-likelihood, our method achieves better results than the current state-of-the-art for online lifelong test-time adaptation across various benchmarks, such as CIFAR-10C, CIFAR-100C, ImageNetC, and ImageNet3DCC datasets. The source code for our approach is accessible at https://github.com/dhanajitb/petal. + + + + Sound to Visual Scene Generation by Audio-to-Visual Latent Alignment + http://openaccess.thecvf.com//content/CVPR2023/papers/Sung-Bin_Sound_to_Visual_Scene_Generation_by_Audio-to-Visual_Latent_Alignment_CVPR_2023_paper.pdf + How does audio describe the world around us? In this paper, we propose a method for generating an image of a scene from sound. Our method addresses the challenges of dealing with the large gaps that often exist between sight and sound. We design a model that works by scheduling the learning procedure of each model component to associate audio-visual modalities despite their information gaps. The key idea is to enrich the audio features with visual information by learning to align audio to visual latent space. We translate the input audio to visual features, then use a pre-trained generator to produce an image. To further improve the quality of our generated images, we use sound source localization to select the audio-visual pairs that have strong cross-modal correlations. We obtain substantially better results on the VEGAS and VGGSound datasets than prior approaches. We also show that we can control our model's predictions by applying simple manipulations to the input waveform, or to the latent space. + + + + Filtering, Distillation, and Hard Negatives for Vision-Language Pre-Training + http://openaccess.thecvf.com//content/CVPR2023/papers/Radenovic_Filtering_Distillation_and_Hard_Negatives_for_Vision-Language_Pre-Training_CVPR_2023_paper.pdf + Vision-language models trained with contrastive learning on large-scale noisy data are becoming increasingly popular for zero-shot recognition problems. In this paper we improve the following three aspects of the contrastive pre-training pipeline: dataset noise, model initialization and the training objective. First, we propose a straightforward filtering strategy titled Complexity, Action, and Text-spotting (CAT) that significantly reduces dataset size, while achieving improved performance across zero-shot vision-language tasks. Next, we propose an approach titled Concept Distillation to leverage strong unimodal representations for contrastive training that does not increase training complexity while outperforming prior work. Finally, we modify the traditional contrastive alignment objective, and propose an importance-sampling approach to up-sample the importance of hard-negatives without adding additional complexity. On an extensive zero-shot benchmark of 29 tasks, our Distilled and Hard-negative Training (DiHT) approach improves on 20 tasks compared to the baseline. Furthermore, for few-shot linear probing, we propose a novel approach that bridges the gap between zero-shot and few-shot performance, substantially improving over prior work. Models are available at github.com/facebookresearch/diht. + + + + PointCMP: Contrastive Mask Prediction for Self-Supervised Learning on Point Cloud Videos + http://openaccess.thecvf.com//content/CVPR2023/papers/Shen_PointCMP_Contrastive_Mask_Prediction_for_Self-Supervised_Learning_on_Point_Cloud_CVPR_2023_paper.pdf + Self-supervised learning can extract representations of good quality from solely unlabeled data, which is appealing for point cloud videos due to their high labelling cost. In this paper, we propose a contrastive mask prediction (PointCMP) framework for self-supervised learning on point cloud videos. Specifically, our PointCMP employs a two-branch structure to achieve simultaneous learning of both local and global spatio-temporal information. On top of this two-branch structure, a mutual similarity based augmentation module is developed to synthesize hard samples at the feature level. By masking dominant tokens and erasing principal channels, we generate hard samples to facilitate learning representations with better discrimination and generalization performance. Extensive experiments show that our PointCMP achieves the state-of-the-art performance on benchmark datasets and outperforms existing full-supervised counterparts. Transfer learning results demonstrate the superiority of the learned representations across different datasets and tasks. + + + + IS-GGT: Iterative Scene Graph Generation With Generative Transformers + http://openaccess.thecvf.com//content/CVPR2023/papers/Kundu_IS-GGT_Iterative_Scene_Graph_Generation_With_Generative_Transformers_CVPR_2023_paper.pdf + Scene graphs provide a rich, structured representation of a scene by encoding the entities (objects) and their spatial relationships in a graphical format. This representation has proven useful in several tasks, such as question answering, captioning, and even object detection, to name a few. Current approaches take a generation-by-classification approach where the scene graph is generated through labeling of all possible edges between objects in a scene, which adds computational overhead to the approach. This work introduces a generative transformer-based approach to generating scene graphs beyond link prediction. Using two transformer-based components, we first sample a possible scene graph structure from detected objects and their visual features. We then perform predicate classification on the sampled edges to generate the final scene graph. This approach allows us to efficiently generate scene graphs from images with minimal inference overhead. Extensive experiments on the Visual Genome dataset demonstrate the efficiency of the proposed approach. Without bells and whistles, we obtain, on average, 20.7% mean recall (mR@100) across different settings for scene graph generation (SGG), outperforming state-of-the-art SGG approaches while offering competitive performance to unbiased SGG approaches. + + + + Meta Omnium: A Benchmark for General-Purpose Learning-To-Learn + http://openaccess.thecvf.com//content/CVPR2023/papers/Bohdal_Meta_Omnium_A_Benchmark_for_General-Purpose_Learning-To-Learn_CVPR_2023_paper.pdf + Meta-learning and other approaches to few-shot learning are widely studied for image recognition, and are increasingly applied to other vision tasks such as pose estimation and dense prediction. This naturally raises the question of whether there is any few-shot meta-learning algorithm capable of generalizing across these diverse task types? To support the community in answering this question, we introduce Meta Omnium, a dataset-of-datasets spanning multiple vision tasks including recognition, keypoint localization, semantic segmentation and regression. We experiment with popular few-shot meta-learning baselines and analyze their ability to generalize across tasks and to transfer knowledge between them. Meta Omnium enables meta-learning researchers to evaluate model generalization to a much wider array of tasks than previously possible, and provides a single framework for evaluating meta-learners across a wide suite of vision applications in a consistent manner. + + + + Multimodal Industrial Anomaly Detection via Hybrid Fusion + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Multimodal_Industrial_Anomaly_Detection_via_Hybrid_Fusion_CVPR_2023_paper.pdf + 2D-based Industrial Anomaly Detection has been widely discussed, however, multimodal industrial anomaly detection based on 3D point clouds and RGB images still has many untouched fields. Existing multimodal industrial anomaly detection methods directly concatenate the multimodal features, which leads to a strong disturbance between features and harms the detection performance. In this paper, we propose Multi-3D-Memory (M3DM), a novel multimodal anomaly detection method with hybrid fusion scheme: firstly, we design an unsupervised feature fusion with patch-wise contrastive learning to encourage the interaction of different modal features; secondly, we use a decision layer fusion with multiple memory banks to avoid loss of information and additional novelty classifiers to make the final decision. We further propose a point feature alignment operation to better align the point cloud and RGB features. Extensive experiments show that our multimodal industrial anomaly detection model outperforms the state-of-the-art (SOTA) methods on both detection and segmentation precision on MVTec-3D AD dataset. Code at github.com/nomewang/M3DM. + + + + BoxTeacher: Exploring High-Quality Pseudo Labels for Weakly Supervised Instance Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Cheng_BoxTeacher_Exploring_High-Quality_Pseudo_Labels_for_Weakly_Supervised_Instance_Segmentation_CVPR_2023_paper.pdf + Labeling objects with pixel-wise segmentation requires a huge amount of human labor compared to bounding boxes. Most existing methods for weakly supervised instance segmentation focus on designing heuristic losses with priors from bounding boxes. While, we find that box-supervised methods can produce some fine segmentation masks and we wonder whether the detectors could learn from these fine masks while ignoring low-quality masks. To answer this question, we present BoxTeacher, an efficient and end-to-end training framework for high-performance weakly supervised instance segmentation, which leverages a sophisticated teacher to generate high-quality masks as pseudo labels. Considering the massive noisy masks hurt the training, we present a mask-aware confidence score to estimate the quality of pseudo masks and propose the noise-aware pixel loss and noise-reduced affinity loss to adaptively optimize the student with pseudo masks. Extensive experiments can demonstrate the effectiveness of the proposed BoxTeacher. Without bells and whistles, BoxTeacher remarkably achieves 35.0 mask AP and 36.5 mask AP with ResNet-50 and ResNet-101 respectively on the challenging COCO dataset, which outperforms the previous state-of-the-art methods by a significant margin and bridges the gap between box-supervised and mask-supervised methods. The code and models will be available later. + + + + Change-Aware Sampling and Contrastive Learning for Satellite Images + http://openaccess.thecvf.com//content/CVPR2023/papers/Mall_Change-Aware_Sampling_and_Contrastive_Learning_for_Satellite_Images_CVPR_2023_paper.pdf + Automatic remote sensing tools can help inform many large-scale challenges such as disaster management, climate change, etc. While a vast amount of spatio-temporal satellite image data is readily available, most of it remains unlabelled. Without labels, this data is not very useful for supervised learning algorithms. Self-supervised learning instead provides a way to learn effective representations for various downstream tasks without labels. In this work, we leverage characteristics unique to satellite images to learn better self-supervised features. Specifically, we use the temporal signal to contrast images with long-term and short-term differences, and we leverage the fact that satellite images do not change frequently. Using these characteristics, we formulate a new loss contrastive loss called Change-Aware Contrastive (CACo) Loss. Further, we also present a novel method of sampling different geographical regions. We show that leveraging these properties leads to better performance on diverse downstream tasks. For example, we see a 6.5% relative improvement for semantic segmentation and an 8.5% relative improvement for change detection over the best-performing baseline with our method. + + + + KD-DLGAN: Data Limited Image Generation via Knowledge Distillation + http://openaccess.thecvf.com//content/CVPR2023/papers/Cui_KD-DLGAN_Data_Limited_Image_Generation_via_Knowledge_Distillation_CVPR_2023_paper.pdf + Generative Adversarial Networks (GANs) rely heavily on large-scale training data for training high-quality image generation models. With limited training data, the GAN discriminator often suffers from severe overfitting which directly leads to degraded generation especially in generation diversity. Inspired by the recent advances in knowledge distillation (KD), we propose KD-GAN, a knowledge-distillation based generation framework that introduces pre-trained vision-language models for training effective data-limited image generation models. KD-GAN consists of two innovative designs. The first is aggregated generative KD that mitigates the discriminator overfitting by challenging the discriminator with harder learning tasks and distilling more generalizable knowledge from the pre-trained models. The second is correlated generative KD that improves the generation diversity by distilling and preserving the diverse image-text correlation within the pre-trained models. Extensive experiments over multiple benchmarks show that KD-GAN achieves superior image generation with limited training data. In addition, KD-GAN complements the state-of-the-art with consistent and substantial performance gains. Note that codes will be released. + + + + Batch Model Consolidation: A Multi-Task Model Consolidation Framework + http://openaccess.thecvf.com//content/CVPR2023/papers/Fostiropoulos_Batch_Model_Consolidation_A_Multi-Task_Model_Consolidation_Framework_CVPR_2023_paper.pdf + In Continual Learning (CL), a model is required to learn a stream of tasks sequentially without significant performance degradation on previously learned tasks. Current approaches fail for a long sequence of tasks from diverse domains and difficulties. Many of the existing CL approaches are difficult to apply in practice due to excessive memory cost or training time, or are tightly coupled to a single device. With the intuition derived from the widely applied mini-batch training, we propose Batch Model Consolidation (BMC) to support more realistic CL under conditions where multiple agents are exposed to a range of tasks. During a regularization phase, BMC trains multiple expert models in parallel on a set of disjoint tasks. Each expert maintains weight similarity to a base model through a stability loss, and constructs a buffer from a fraction of the task's data. During the consolidation phase, we combine the learned knowledge on 'batches' of expert models using a batched consolidation loss in memory data that aggregates all buffers. We thoroughly evaluate each component of our method in an ablation study and demonstrate the effectiveness on standardized benchmark datasets Split-CIFAR-100, Tiny-ImageNet, and the Stream dataset composed of 71 image classification tasks from diverse domains and difficulties. Our method outperforms the next best CL approach by 70% and is the only approach that can maintain performance at the end of 71 tasks. + + + + DR2: Diffusion-Based Robust Degradation Remover for Blind Face Restoration + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_DR2_Diffusion-Based_Robust_Degradation_Remover_for_Blind_Face_Restoration_CVPR_2023_paper.pdf + Blind face restoration usually synthesizes degraded low-quality data with a pre-defined degradation model for training, while more complex cases could happen in the real world. This gap between the assumed and actual degradation hurts the restoration performance where artifacts are often observed in the output. However, it is expensive and infeasible to include every type of degradation to cover real-world cases in the training data. To tackle this robustness issue, we propose Diffusion-based Robust Degradation Remover (DR2) to first transform the degraded image to a coarse but degradation-invariant prediction, then employ an enhancement module to restore the coarse prediction to a high-quality image. By leveraging a well-performing denoising diffusion probabilistic model, our DR2 diffuses input images to a noisy status where various types of degradation give way to Gaussian noise, and then captures semantic information through iterative denoising steps. As a result, DR2 is robust against common degradation (e.g. blur, resize, noise and compression) and compatible with different designs of enhancement modules. Experiments in various settings show that our framework outperforms state-of-the-art methods on heavily degraded synthetic and real-world datasets. + + + + LiDAR2Map: In Defense of LiDAR-Based Semantic Map Construction Using Online Camera Distillation + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_LiDAR2Map_In_Defense_of_LiDAR-Based_Semantic_Map_Construction_Using_Online_CVPR_2023_paper.pdf + Semantic map construction under bird's-eye view (BEV) plays an essential role in autonomous driving. In contrast to camera image, LiDAR provides the accurate 3D observations to project the captured 3D features onto BEV space inherently. However, the vanilla LiDAR-based BEV feature often contains many indefinite noises, where the spatial features have little texture and semantic cues. In this paper, we propose an effective LiDAR-based method to build semantic map. Specifically, we introduce a BEV pyramid feature decoder that learns the robust multi-scale BEV features for semantic map construction, which greatly boosts the accuracy of the LiDAR-based method. To mitigate the defects caused by lacking semantic cues in LiDAR data, we present an online Camera-to-LiDAR distillation scheme to facilitate the semantic learning from image to point cloud. Our distillation scheme consists of feature-level and logit-level distillation to absorb the semantic information from camera in BEV. The experimental results on challenging nuScenes dataset demonstrate the efficacy of our proposed LiDAR2Map on semantic map construction, which significantly outperforms the previous LiDAR-based methods over 27.9% mIoU and even performs better than the state-of-the-art camera-based approaches. Source code is available at: https://github.com/songw-zju/LiDAR2Map. + + + + Token Contrast for Weakly-Supervised Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Ru_Token_Contrast_for_Weakly-Supervised_Semantic_Segmentation_CVPR_2023_paper.pdf + Weakly-Supervised Semantic Segmentation (WSSS) using image-level labels typically utilizes Class Activation Map (CAM) to generate the pseudo labels. Limited by the local structure perception of CNN, CAM usually cannot identify the integral object regions. Though the recent Vision Transformer (ViT) can remedy this flaw, we observe it also brings the over-smoothing issue, ie, the final patch tokens incline to be uniform. In this work, we propose Token Contrast (ToCo) to address this issue and further explore the virtue of ViT for WSSS. Firstly, motivated by the observation that intermediate layers in ViT can still retain semantic diversity, we designed a Patch Token Contrast module (PTC). PTC supervises the final patch tokens with the pseudo token relations derived from intermediate layers, allowing them to align the semantic regions and thus yield more accurate CAM. Secondly, to further differentiate the low-confidence regions in CAM, we devised a Class Token Contrast module (CTC) inspired by the fact that class tokens in ViT can capture high-level semantics. CTC facilitates the representation consistency between uncertain local regions and global objects by contrasting their class tokens. Experiments on the PASCAL VOC and MS COCO datasets show the proposed ToCo can remarkably surpass other single-stage competitors and achieve comparable performance with state-of-the-art multi-stage methods. Code is available at https://github.com/rulixiang/ToCo. + + + + LightedDepth: Video Depth Estimation in Light of Limited Inference View Angles + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhu_LightedDepth_Video_Depth_Estimation_in_Light_of_Limited_Inference_View_CVPR_2023_paper.pdf + Video depth estimation infers the dense scene depth from immediate neighboring video frames. While recent works consider it a simplified structure-from-motion (SfM) problem, it still differs from the SfM in that significantly fewer view angels are available in inference. This setting, however, suits the mono-depth and optical flow estimation. This observation motivates us to decouple the video depth estimation into two components, a normalized pose estimation over a flowmap and a logged residual depth estimation over a mono-depth map. The two parts are unified with an efficient off-the-shelf scale alignment algorithm. Additionally, we stabilize the indoor two-view pose estimation by including additional projection constraints and ensuring sufficient camera translation. Though a two-view algorithm, we validate the benefit of the decoupling with the substantial performance improvement over multi-view iterative prior works on indoor and outdoor datasets. Codes and models are available at https://github.com/ShngJZ/LightedDepth. + + + + HouseDiffusion: Vector Floorplan Generation via a Diffusion Model With Discrete and Continuous Denoising + http://openaccess.thecvf.com//content/CVPR2023/papers/Shabani_HouseDiffusion_Vector_Floorplan_Generation_via_a_Diffusion_Model_With_Discrete_CVPR_2023_paper.pdf + The paper presents a novel approach for vector-floorplan generation via a diffusion model, which denoises 2D coordinates of room/door corners with two inference objectives: 1) a single-step noise as the continuous quantity to precisely invert the continuous forward process; and 2) the final 2D coordinate as the discrete quantity to establish geometric incident relationships such as parallelism, orthogonality, and corner-sharing. Our task is graph-conditioned floorplan generation, a common workflow in floorplan design. We represent a floorplan as 1D polygonal loops, each of which corresponds to a room or a door. Our diffusion model employs a Transformer architecture at the core, which controls the attention masks based on the input graph-constraint and directly generates vector-graphics floorplans via a discrete and continuous denoising process. We have evaluated our approach on RPLAN dataset. The proposed approach makes significant improvements in all the metrics against the state-of-the-art with significant margins, while being capable of generating non-Manhattan structures and controlling the exact number of corners per room. We will share all our code and models. + + + + V2X-Seq: A Large-Scale Sequential Dataset for Vehicle-Infrastructure Cooperative Perception and Forecasting + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_V2X-Seq_A_Large-Scale_Sequential_Dataset_for_Vehicle-Infrastructure_Cooperative_Perception_and_CVPR_2023_paper.pdf + Utilizing infrastructure and vehicle-side information to track and forecast the behaviors of surrounding traffic participants can significantly improve decision-making and safety in autonomous driving. However, the lack of real-world sequential datasets limits research in this area. To address this issue, we introduce V2X-Seq, the first large-scale sequential V2X dataset, which includes data frames, trajectories, vector maps, and traffic lights captured from natural scenery. V2X-Seq comprises two parts: the sequential perception dataset, which includes more than 15,000 frames captured from 95 scenarios, and the trajectory forecasting dataset, which contains about 80,000 infrastructure-view scenarios, 80,000 vehicle-view scenarios, and 50,000 cooperative-view scenarios captured from 28 intersections' areas, covering 672 hours of data. Based on V2X-Seq, we introduce three new tasks for vehicle-infrastructure cooperative (VIC) autonomous driving: VIC3D Tracking, Online-VIC Forecasting, and Offline-VIC Forecasting. We also provide benchmarks for the introduced tasks. Find data, code, and more up-to-date information at https://github.com/AIR-THU/DAIR-V2X-Seq. + + + + Bridging the Gap Between Model Explanations in Partially Annotated Multi-Label Classification + http://openaccess.thecvf.com//content/CVPR2023/papers/Kim_Bridging_the_Gap_Between_Model_Explanations_in_Partially_Annotated_Multi-Label_CVPR_2023_paper.pdf + Due to the expensive costs of collecting labels in multi-label classification datasets, partially annotated multi-label classification has become an emerging field in computer vision. One baseline approach to this task is to assume unobserved labels as negative labels, but this assumption induces label noise as a form of false negative. To understand the negative impact caused by false negative labels, we study how these labels affect the model's explanation. We observe that the explanation of two models, trained with full and partial labels each, highlights similar regions but with different scaling, where the latter tends to have lower attribution scores. Based on these findings, we propose to boost the attribution scores of the model trained with partial labels to make its explanation resemble that of the model trained with full labels. Even with the conceptually simple approach, the multi-label classification performance improves by a large margin in three different datasets on a single positive label setting and one on a large-scale partial label setting. Code is available at https://github.com/youngwk/BridgeGapExplanationPAMC. + + + + Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Sun_Learning_Audio-Visual_Source_Localization_via_False_Negative_Aware_Contrastive_Learning_CVPR_2023_paper.pdf + Self-supervised audio-visual source localization aims to locate sound-source objects in video frames without extra annotations. Recent methods often approach this goal with the help of contrastive learning, which assumes only the audio and visual contents from the same video are positive samples for each other. However, this assumption would suffer from false negative samples in real-world training. For example, for an audio sample, treating the frames from the same audio class as negative samples may mislead the model and therefore harm the learned representations (e.g., the audio of a siren wailing may reasonably correspond to the ambulances in multiple images). Based on this observation, we propose a new learning strategy named False Negative Aware Contrastive (FNAC) to mitigate the problem of misleading the training with such false negative samples. Specifically, we utilize the intra-modal similarities to identify potentially similar samples and construct corresponding adjacency matrices to guide contrastive learning. Further, we propose to strengthen the role of true negative samples by explicitly leveraging the visual features of sound sources to facilitate the differentiation of authentic sounding source regions. FNAC achieves state-of-the-art performances on Flickr-SoundNet, VGG-Sound, and AVSBench, which demonstrates the effectiveness of our method in mitigating the false negative issue. The code is available at https://github.com/OpenNLPLab/FNAC_AVL + + + + MMG-Ego4D: Multimodal Generalization in Egocentric Action Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Gong_MMG-Ego4D_Multimodal_Generalization_in_Egocentric_Action_Recognition_CVPR_2023_paper.pdf + In this paper, we study a novel problem in egocentric action recognition, which we term as "Multimodal Generalization" (MMG). MMG aims to study how systems can generalize when data from certain modalities is limited or even completely missing. We thoroughly investigate MMG in the context of standard supervised action recognition and the more challenging few-shot setting for learning new action categories. MMG consists of two novel scenarios, designed to support security, and efficiency considerations in real-world applications: (1) missing modality generalization where some modalities that were present during the train time are missing during the inference time, and (2) cross-modal zero-shot generalization, where the modalities present during the inference time and the training time are disjoint. To enable this investigation, we construct a new dataset MMG-Ego4D containing data points with video, audio, and inertial motion sensor (IMU) modalities. Our dataset is derived from Ego4D dataset, but processed and thoroughly re-annotated by human experts to facilitate research in the MMG problem. We evaluate a diverse array of models on MMG-Ego4D and propose new methods with improved generalization ability. In particular, we introduce a new fusion module with modality dropout training, contrastive-based alignment training, and a novel cross-modal prototypical loss for better few-shot performance. We hope this study will serve as a benchmark and guide future research in multimodal generalization problems. The benchmark and code are available at https://github.com/facebookresearch/MMG_Ego4D + + + + 3D Video Object Detection With Learnable Object-Centric Global Optimization + http://openaccess.thecvf.com//content/CVPR2023/papers/He_3D_Video_Object_Detection_With_Learnable_Object-Centric_Global_Optimization_CVPR_2023_paper.pdf + We explore long-term temporal visual correspondence-based optimization for 3D video object detection in this work. Visual correspondence refers to one-to-one mappings for pixels across multiple images. Correspondence-based optimization is the cornerstone for 3D scene reconstruction but is less studied in 3D video object detection, because moving objects violate multi-view geometry constraints and are treated as outliers during scene reconstruction. We address this issue by treating objects as first-class citizens during correspondence-based optimization. In this work, we propose BA-Det, an end-to-end optimizable object detector with object-centric temporal correspondence learning and featuremetric object bundle adjustment. Empirically, we verify the effectiveness and efficiency of BA-Det for multiple baseline 3D detectors under various setups. Our BA-Det achieves SOTA performance on the large-scale Waymo Open Dataset (WOD) with only marginal computation cost. Our code is available at https://github.com/jiaweihe1996/BA-Det. + + + + Improving the Transferability of Adversarial Samples by Path-Augmented Method + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Improving_the_Transferability_of_Adversarial_Samples_by_Path-Augmented_Method_CVPR_2023_paper.pdf + Deep neural networks have achieved unprecedented success on diverse vision tasks. However, they are vulnerable to adversarial noise that is imperceptible to humans. This phenomenon negatively affects their deployment in real-world scenarios, especially security-related ones. To evaluate the robustness of a target model in practice, transfer-based attacks craft adversarial samples with a local model and have attracted increasing attention from researchers due to their high efficiency. The state-of-the-art transfer-based attacks are generally based on data augmentation, which typically augments multiple training images from a linear path when learning adversarial samples. However, such methods selected the image augmentation path heuristically and may augment images that are semantics-inconsistent with the target images, which harms the transferability of the generated adversarial samples. To overcome the pitfall, we propose the Path-Augmented Method (PAM). Specifically, PAM first constructs a candidate augmentation path pool. It then settles the employed augmentation paths during adversarial sample generation with greedy search. Furthermore, to avoid augmenting semantics-inconsistent images, we train a Semantics Predictor (SP) to constrain the length of the augmentation path. Extensive experiments confirm that PAM can achieve an improvement of over 4.8% on average compared with the state-of-the-art baselines in terms of the attack success rates. + + + + Robust Mean Teacher for Continual and Gradual Test-Time Adaptation + http://openaccess.thecvf.com//content/CVPR2023/papers/Dobler_Robust_Mean_Teacher_for_Continual_and_Gradual_Test-Time_Adaptation_CVPR_2023_paper.pdf + Since experiencing domain shifts during test-time is inevitable in practice, test-time adaption (TTA) continues to adapt the model after deployment. Recently, the area of continual and gradual test-time adaptation (TTA) emerged. In contrast to standard TTA, continual TTA considers not only a single domain shift, but a sequence of shifts. Gradual TTA further exploits the property that some shifts evolve gradually over time. Since in both settings long test sequences are present, error accumulation needs to be addressed for methods relying on self-training. In this work, we propose and show that in the setting of TTA, the symmetric cross-entropy is better suited as a consistency loss for mean teachers compared to the commonly used cross-entropy. This is justified by our analysis with respect to the (symmetric) cross-entropy's gradient properties. To pull the test feature space closer to the source domain, where the pre-trained model is well posed, contrastive learning is leveraged. Since applications differ in their requirements, we address several settings, including having source data available and the more challenging source-free setting. We demonstrate the effectiveness of our proposed method "robust mean teacher" (RMT) on the continual and gradual corruption benchmarks CIFAR10C, CIFAR100C, and Imagenet-C. We further consider ImageNet-R and propose a new continual DomainNet-126 benchmark. State-of-the-art results are achieved on all benchmarks. + + + + MOVES: Manipulated Objects in Video Enable Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Higgins_MOVES_Manipulated_Objects_in_Video_Enable_Segmentation_CVPR_2023_paper.pdf + We present a method that uses manipulation to learn to understand the objects people hold and as well as hand-object contact. We train a system that takes a single RGB image and produces a pixel-embedding that can be used to answer grouping questions (do these two pixels go together) as well as hand-association questions (is this hand holding that pixel). Rather painstakingly annotate segmentation masks, we observe people in realistic video data. We show that pairing epipolar geometry with modern optical flow produces simple and effective pseudo-labels for grouping. Given people segmentations, we can further associate pixels with hands to understand contact. Our system achieves competitive results on hand and hand-held object tasks. + + + + Generating Holistic 3D Human Motion From Speech + http://openaccess.thecvf.com//content/CVPR2023/papers/Yi_Generating_Holistic_3D_Human_Motion_From_Speech_CVPR_2023_paper.pdf + This work addresses the problem of generating 3D holistic body motions from human speech. Given a speech recording, we synthesize sequences of 3D body poses, hand gestures, and facial expressions that are realistic and diverse. To achieve this, we first build a high-quality dataset of 3D holistic body meshes with synchronous speech. We then define a novel speech-to-motion generation framework in which the face, body, and hands are modeled separately. The separated modeling stems from the fact that face articulation strongly correlates with human speech, while body poses and hand gestures are less correlated. Specifically, we employ an autoencoder for face motions, and a compositional vector-quantized variational autoencoder (VQ-VAE) for the body and hand motions. The compositional VQ-VAE is key to generating diverse results. Additionally, we propose a cross conditional autoregressive model that generates body poses and hand gestures, leading to coherent and realistic motions. Extensive experiments and user studies demonstrate that our proposed approach achieves state-of-the-art performance both qualitatively and quantitatively. Our novel dataset and code will be released for research purposes. + + + + ShadowNeuS: Neural SDF Reconstruction by Shadow Ray Supervision + http://openaccess.thecvf.com//content/CVPR2023/papers/Ling_ShadowNeuS_Neural_SDF_Reconstruction_by_Shadow_Ray_Supervision_CVPR_2023_paper.pdf + By supervising camera rays between a scene and multi-view image planes, NeRF reconstructs a neural scene representation for the task of novel view synthesis. On the other hand, shadow rays between the light source and the scene have yet to be considered. Therefore, we propose a novel shadow ray supervision scheme that optimizes both the samples along the ray and the ray location. By supervising shadow rays, we successfully reconstruct a neural SDF of the scene from single-view images under multiple lighting conditions. Given single-view binary shadows, we train a neural network to reconstruct a complete scene not limited by the camera's line of sight. By further modeling the correlation between the image colors and the shadow rays, our technique can also be effectively extended to RGB inputs. We compare our method with previous works on challenging tasks of shape reconstruction from single-view binary shadow or RGB images and observe significant improvements. The code and data are available at https://github.com/gerwang/ShadowNeuS. + + + + Generalized UAV Object Detection via Frequency Domain Disentanglement + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Generalized_UAV_Object_Detection_via_Frequency_Domain_Disentanglement_CVPR_2023_paper.pdf + When deploying the Unmanned Aerial Vehicles object detection (UAV-OD) network to complex and unseen real-world scenarios, the generalization ability is usually reduced due to the domain shift. To address this issue, this paper proposes a novel frequency domain disentanglement method to improve the UAV-OD generalization. Specifically, we first verified that the spectrum of different bands in the image has different effects to the UAV-OD generalization. Based on this conclusion, we design two learnable filters to extract domain-invariant spectrum and domain-specific spectrum, respectively. The former can be used to train the UAV-OD network and improve its capacity for generalization. In addition, we design a new instance-level contrastive loss to guide the network training. This loss enables the network to concentrate on extracting domain-invariant spectrum and domain-specific spectrum, so as to achieve better disentangling results. Experimental results on three unseen target domains demonstrate that our method has better generalization ability than both the baseline method and state-of-the-art methods. + + + + DINER: Disorder-Invariant Implicit Neural Representation + http://openaccess.thecvf.com//content/CVPR2023/papers/Xie_DINER_Disorder-Invariant_Implicit_Neural_Representation_CVPR_2023_paper.pdf + Implicit neural representation (INR) characterizes the attributes of a signal as a function of corresponding coordinates which emerges as a sharp weapon for solving inverse problems. However, the capacity of INR is limited by the spectral bias in the network training. In this paper, we find that such a frequency-related problem could be largely solved by re-arranging the coordinates of the input signal, for which we propose the disorder-invariant implicit neural representation (DINER) by augmenting a hash-table to a traditional INR backbone. Given discrete signals sharing the same histogram of attributes and different arrangement orders, the hash-table could project the coordinates into the same distribution for which the mapped signal can be better modeled using the subsequent INR network, leading to significantly alleviated spectral bias. Experiments not only reveal the generalization of the DINER for different INR backbones (MLP vs. SIREN) and various tasks (image/video representation, phase retrieval, and refractive index recovery) but also show the superiority over the state-of-the-art algorithms both in quality and speed. + + + + A Light Touch Approach to Teaching Transformers Multi-View Geometry + http://openaccess.thecvf.com//content/CVPR2023/papers/Bhalgat_A_Light_Touch_Approach_to_Teaching_Transformers_Multi-View_Geometry_CVPR_2023_paper.pdf + Transformers are powerful visual learners, in large part due to their conspicuous lack of manually-specified priors. This flexibility can be problematic in tasks that involve multiple-view geometry, due to the near-infinite possible variations in 3D shapes and viewpoints (requiring flexibility), and the precise nature of projective geometry (obeying rigid laws). To resolve this conundrum, we propose a "light touch" approach, guiding visual Transformers to learn multiple-view geometry but allowing them to break free when needed. We achieve this by using epipolar lines to guide the Transformer's cross-attention maps, penalizing attention values outside the epipolar lines and encouraging higher attention along these lines since they contain geometrically plausible matches. Unlike previous methods, our proposal does not require any camera pose information at test-time. We focus on pose-invariant object instance retrieval, where standard Transformer networks struggle, due to the large differences in viewpoint between query and retrieved images. Experimentally, our method outperforms state-of-the-art approaches at object retrieval, without needing pose information at test-time. + + + + Trade-Off Between Robustness and Accuracy of Vision Transformers + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Trade-Off_Between_Robustness_and_Accuracy_of_Vision_Transformers_CVPR_2023_paper.pdf + Although deep neural networks (DNNs) have shown great successes in computer vision tasks, they are vulnerable to perturbations on inputs, and there exists a trade-off between the natural accuracy and robustness to such perturbations, which is mainly caused by the existence of robust non-predictive features and non-robust predictive features. Recent empirical analyses find Vision Transformers (ViTs) are inherently robust to various kinds of perturbations, but the aforementioned trade-off still exists for them. In this work, we propose Trade-off between Robustness and Accuracy of Vision Transformers (TORA-ViTs), which aims to efficiently transfer ViT models pretrained on natural tasks for both accuracy and robustness. TORA-ViTs consist of two major components, including a pair of accuracy and robustness adapters to extract predictive and robust features, respectively, and a gated fusion module to adjust the trade-off. The gated fusion module takes outputs of a pretrained ViT block as queries and outputs of our adapters as keys and values, and tokens from different adapters at different spatial locations are compared with each other to generate attention scores for a balanced mixing of predictive and robust features. Experiments on ImageNet with various robust benchmarks show that our TORA-ViTs can efficiently improve the robustness of naturally pretrained ViTs while maintaining competitive natural accuracy. Our most balanced setting (TORA-ViTs with lambda = 0.5) can maintain 83.7% accuracy on clean ImageNet and reach 54.7% and 38.0% accuracy under FGSM and PGD white-box attacks, respectively. In terms of various ImageNet variants, it can reach 39.2% and 56.3% accuracy on ImageNet-A and ImageNet-R and reach 34.4% mCE on ImageNet-C. + + + + Deep Graph-Based Spatial Consistency for Robust Non-Rigid Point Cloud Registration + http://openaccess.thecvf.com//content/CVPR2023/papers/Qin_Deep_Graph-Based_Spatial_Consistency_for_Robust_Non-Rigid_Point_Cloud_Registration_CVPR_2023_paper.pdf + We study the problem of outlier correspondence pruning for non-rigid point cloud registration. In rigid registration, spatial consistency has been a commonly used criterion to discriminate outliers from inliers. It measures the compatibility of two correspondences by the discrepancy between the respective distances in two point clouds. However, spatial consistency no longer holds in non-rigid cases and outlier rejection for non-rigid registration has not been well studied. In this work, we propose Graph-based Spatial Consistency Network (GraphSCNet) to filter outliers for non-rigid registration. Our method is based on the fact that non-rigid deformations are usually locally rigid, or local shape preserving. We first design a local spatial consistency measure over the deformation graph of the point cloud, which evaluates the spatial compatibility only between the correspondences in the vicinity of a graph node. An attention-based non-rigid correspondence embedding module is then devised to learn a robust representation of non-rigid correspondences from local spatial consistency. Despite its simplicity, GraphSCNet effectively improves the quality of the putative correspondences and attains state-of-the-art performance on three challenging benchmarks. Our code and models are available at https://github.com/qinzheng93/GraphSCNet. + + + + Slide-Transformer: Hierarchical Vision Transformer With Local Self-Attention + http://openaccess.thecvf.com//content/CVPR2023/papers/Pan_Slide-Transformer_Hierarchical_Vision_Transformer_With_Local_Self-Attention_CVPR_2023_paper.pdf + Self-attention mechanism has been a key factor in the recent progress of Vision Transformer (ViT), which enables adaptive feature extraction from global contexts. However, existing self-attention methods either adopt sparse global attention or window attention to reduce the computation complexity, which may compromise the local feature learning or subject to some handcrafted designs. In contrast, local attention, which restricts the receptive field of each query to its own neighboring pixels, enjoys the benefits of both convolution and self-attention, namely local inductive bias and dynamic feature selection. Nevertheless, current local attention modules either use inefficient Im2Col function or rely on specific CUDA kernels that are hard to generalize to devices without CUDA support. In this paper, we propose a novel local attention module, Slide Attention, which leverages common convolution operations to achieve high efficiency, flexibility and generalizability. Specifically, we first re-interpret the column-based Im2Col function from a new row-based perspective and use Depthwise Convolution as an efficient substitution. On this basis, we propose a deformed shifting module based on the re-parameterization technique, which further relaxes the fixed key/value positions to deformed features in the local region. In this way, our module realizes the local attention paradigm in both efficient and flexible manner. Extensive experiments show that our slide attention module is applicable to a variety of advanced Vision Transformer models and compatible with various hardware devices, and achieves consistently improved performances on comprehensive benchmarks. + + + + NeRF-Supervised Deep Stereo + http://openaccess.thecvf.com//content/CVPR2023/papers/Tosi_NeRF-Supervised_Deep_Stereo_CVPR_2023_paper.pdf + We introduce a novel framework for training deep stereo networks effortlessly and without any ground-truth. By leveraging state-of-the-art neural rendering solutions, we generate stereo training data from image sequences collected with a single handheld camera. On top of them, a NeRF-supervised training procedure is carried out, from which we exploit rendered stereo triplets to compensate for occlusions and depth maps as proxy labels. This results in stereo networks capable of predicting sharp and detailed disparity maps. Experimental results show that models trained under this regime yield a 30-40% improvement over existing self-supervised methods on the challenging Middlebury dataset, filling the gap to supervised models and, most times, outperforming them at zero-shot generalization. + + + + Decoupled Multimodal Distilling for Emotion Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Decoupled_Multimodal_Distilling_for_Emotion_Recognition_CVPR_2023_paper.pdf + Human multimodal emotion recognition (MER) aims to perceive human emotions via language, visual and acoustic modalities. Despite the impressive performance of previous MER approaches, the inherent multimodal heterogeneities still haunt and the contribution of different modalities varies significantly. In this work, we mitigate this issue by proposing a decoupled multimodal distillation (DMD) approach that facilitates flexible and adaptive crossmodal knowledge distillation, aiming to enhance the discriminative features of each modality. Specially, the representation of each modality is decoupled into two parts, i.e., modality-irrelevant/-exclusive spaces, in a self-regression manner. DMD utilizes a graph distillation unit (GD-Unit) for each decoupled part so that each GD can be performed in a more specialized and effective manner. A GD-Unit consists of a dynamic graph where each vertice represents a modality and each edge indicates a dynamic knowledge distillation. Such GD paradigm provides a flexible knowledge transfer manner where the distillation weights can be automatically learned, thus enabling diverse crossmodal knowledge transfer patterns. Experimental results show DMD consistently obtains superior performance than state-of-the-art MER methods. Visualization results show the graph edges in DMD exhibit meaningful distributional patterns w.r.t. the modality-irrelevant/-exclusive feature spaces. Codes are released at https://github.com/mdswyz/DMD. + + + + DualRefine: Self-Supervised Depth and Pose Estimation Through Iterative Epipolar Sampling and Refinement Toward Equilibrium + http://openaccess.thecvf.com//content/CVPR2023/papers/Bangunharcana_DualRefine_Self-Supervised_Depth_and_Pose_Estimation_Through_Iterative_Epipolar_Sampling_CVPR_2023_paper.pdf + Self-supervised multi-frame depth estimation achieves high accuracy by computing matching costs of pixel correspondences between adjacent frames, injecting geometric information into the network. These pixel-correspondence candidates are computed based on the relative pose estimates between the frames. Accurate pose predictions are essential for precise matching cost computation as they influence the epipolar geometry. Furthermore, improved depth estimates can, in turn, be used to align pose estimates. Inspired by traditional structure-from-motion (SfM) principles, we propose the DualRefine model, which tightly couples depth and pose estimation through a feedback loop. Our novel update pipeline uses a deep equilibrium model framework to iteratively refine depth estimates and a hidden state of feature maps by computing local matching costs based on epipolar geometry. Importantly, we used the refined depth estimates and feature maps to compute pose updates at each step. This update in the pose estimates slowly alters the epipolar geometry during the refinement process. Experimental results on the KITTI dataset demonstrate competitive depth prediction and odometry prediction performance surpassing published self-supervised baselines. The code is available at https://github.com/antabangun/DualRefine. + + + + Improving Generalization of Meta-Learning With Inverted Regularization at Inner-Level + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Improving_Generalization_of_Meta-Learning_With_Inverted_Regularization_at_Inner-Level_CVPR_2023_paper.pdf + Despite the broad interest in meta-learning, the generalization problem remains one of the significant challenges in this field. Existing works focus on meta-generalization to unseen tasks at the meta-level by regularizing the meta-loss, while ignoring that adapted models may not generalize to the task domains at the adaptation level. In this paper, we propose a new regularization mechanism for meta-learning -- Minimax-Meta Regularization, which employs inverted regularization at the inner loop and ordinary regularization at the outer loop during training. In particular, the inner inverted regularization makes the adapted model more difficult to generalize to task domains; thus, optimizing the outer-loop loss forces the meta-model to learn meta-knowledge with better generalization. Theoretically, we prove that inverted regularization improves the meta-testing performance by reducing generalization errors. We conduct extensive experiments on the representative scenarios, and the results show that our method consistently improves the performance of meta-learning algorithms. + + + + SmallCap: Lightweight Image Captioning Prompted With Retrieval Augmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Ramos_SmallCap_Lightweight_Image_Captioning_Prompted_With_Retrieval_Augmentation_CVPR_2023_paper.pdf + Recent advances in image captioning have focused on scaling the data and model size, substantially increasing the cost of pre-training and finetuning. As an alternative to large models, we present SmallCap, which generates a caption conditioned on an input image and related captions retrieved from a datastore. Our model is lightweight and fast to train as the only learned parameters are in newly introduced cross-attention layers between a pre-trained CLIP encoder and GPT-2 decoder. SmallCap can transfer to new domains without additional finetuning and can exploit large-scale data in a training-free fashion since the contents of the datastore can be readily replaced. Our experiments show that SmallCap, trained only on COCO, has competitive performance on this benchmark, and also transfers to other domains without retraining, solely through retrieval from target-domain data. Further improvement is achieved through the training-free exploitation of diverse human-labeled and web data, which proves effective for a range of domains, including the nocaps benchmark, designed to test generalization to unseen visual concepts. + + + + Unifying Layout Generation With a Decoupled Diffusion Model + http://openaccess.thecvf.com//content/CVPR2023/papers/Hui_Unifying_Layout_Generation_With_a_Decoupled_Diffusion_Model_CVPR_2023_paper.pdf + Layout generation aims to synthesize realistic graphic scenes consisting of elements with different attributes including category, size, position, and between-element relation. It is a crucial task for reducing the burden on heavy-duty graphic design works for formatted scenes, e.g., publications, documents, and user interfaces (UIs). Diverse application scenarios impose a big challenge in unifying various layout generation subtasks, including conditional and unconditional generation. In this paper, we propose a Layout Diffusion Generative Model (LDGM) to achieve such unification with a single decoupled diffusion model. LDGM views a layout of arbitrary missing or coarse element attributes as an intermediate diffusion status from a completed layout. Since different attributes have their individual semantics and characteristics, we propose to decouple the diffusion processes for them to improve the diversity of training samples and learn the reverse process jointly to exploit global-scope contexts for facilitating generation. As a result, our LDGM can generate layouts either from scratch or conditional on arbitrary available attributes. Extensive qualitative and quantitative experiments demonstrate our proposed LDGM outperforms existing layout generation models in both functionality and performance. + + + + Dynamic Neural Network for Multi-Task Learning Searching Across Diverse Network Topologies + http://openaccess.thecvf.com//content/CVPR2023/papers/Choi_Dynamic_Neural_Network_for_Multi-Task_Learning_Searching_Across_Diverse_Network_CVPR_2023_paper.pdf + In this paper, we present a new MTL framework that searches for structures optimized for multiple tasks with diverse graph topologies and shares features among tasks. We design a restricted DAG-based central network with read-in/read-out layers to build topologically diverse task-adaptive structures while limiting search space and time. We search for a single optimized network that serves as multiple task adaptive sub-networks using our three-stage training process. To make the network compact and discretized, we propose a flow-based reduction algorithm and a squeeze loss used in the training process. We evaluate our optimized network on various public MTL datasets and show ours achieves state-of-the-art performance. An extensive ablation study experimentally validates the effectiveness of the sub-module and schemes in our framework. + + + + Relightable Neural Human Assets From Multi-View Gradient Illuminations + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_Relightable_Neural_Human_Assets_From_Multi-View_Gradient_Illuminations_CVPR_2023_paper.pdf + Human modeling and relighting are two fundamental problems in computer vision and graphics, where high-quality datasets can largely facilitate related research. However, most existing human datasets only provide multi-view human images captured under the same illumination. Although valuable for modeling tasks, they are not readily used in relighting problems. To promote research in both fields, in this paper, we present UltraStage, a new 3D human dataset that contains more than 2,000 high-quality human assets captured under both multi-view and multi-illumination settings. Specifically, for each example, we provide 32 surrounding views illuminated with one white light and two gradient illuminations. In addition to regular multi-view images, gradient illuminations help recover detailed surface normal and spatially-varying material maps, enabling various relighting applications. Inspired by recent advances in neural representation, we further interpret each example into a neural human asset which allows novel view synthesis under arbitrary lighting conditions. We show our neural human assets can achieve extremely high capture performance and are capable of representing fine details such as facial wrinkles and cloth folds. We also validate UltraStage in single image relighting tasks, training neural networks with virtual relighted data from neural assets and demonstrating realistic rendering improvements over prior arts. UltraStage will be publicly available to the community to stimulate significant future developments in various human modeling and rendering tasks. The dataset is available at https://miaoing.github.io/RNHA. + + + + Probing Sentiment-Oriented Pre-Training Inspired by Human Sentiment Perception Mechanism + http://openaccess.thecvf.com//content/CVPR2023/papers/Feng_Probing_Sentiment-Oriented_Pre-Training_Inspired_by_Human_Sentiment_Perception_Mechanism_CVPR_2023_paper.pdf + Pre-training of deep convolutional neural networks (DCNNs) plays a crucial role in the field of visual sentiment analysis (VSA). Most proposed methods employ the off-the-shelf backbones pre-trained on large-scale object classification datasets (i.e., ImageNet). While it boosts performance for a big margin against initializing model states from random, we argue that DCNNs simply pre-trained on ImageNet may excessively focus on recognizing objects, but failed to provide high-level concepts in terms of sentiment. To address this long-term overlooked problem, we propose a sentiment-oriented pre-training method that is built upon human visual sentiment perception (VSP) mechanism. Specifically, we factorize the process of VSP into three steps, namely stimuli taking, holistic organizing, and high-level perceiving. From imitating each VSP step, a total of three models are separately pre-trained via our devised sentiment-aware tasks that contribute to excavating sentiment-discriminated representations. Moreover, along with our elaborated multi-model amalgamation strategy, the prior knowledge learned from each perception step can be effectively transferred into a single target model, yielding substantial performance gains. Finally, we verify the superiorities of our proposed method over extensive experiments, covering mainstream VSA tasks from single-label learning (SLL), multi-label learning (MLL), to label distribution learning (LDL). Experiment results demonstrate that our proposed method leads to unanimous improvements in these downstream tasks. Our code is released on https://github.com/tinglyfeng/sentiment_pretraining + + + + Imitation Learning As State Matching via Differentiable Physics + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Imitation_Learning_As_State_Matching_via_Differentiable_Physics_CVPR_2023_paper.pdf + Existing imitation learning (IL) methods such as inverse reinforcement learning (IRL) usually have a double-loop training process, alternating between learning a reward function and a policy and tend to suffer long training time and high variance. In this work, we identify the benefits of differentiable physics simulators and propose a new IL method, i.e., Imitation Learning via Differentiable Physics (ILD), which gets rid of the double-loop design and achieves significant improvements in final performance, convergence speed, and stability. The proposed ILD incorporates the differentiable physics simulator as a physics prior into its computational graph for policy learning. It unrolls the dynamics by sampling actions from a parameterized policy, simply minimizing the distance between the expert trajectory and the agent trajectory, and back-propagating the gradient into the policy via temporal physics operators. With the physics prior, ILD policies can not only be transferable to unseen environment specifications but also yield higher final performance on a variety of tasks. In addition, ILD naturally forms a single-loop structure, which significantly improves the stability and training speed. To simplify the complex optimization landscape induced by temporal physics operations, ILD dynamically selects the learning objectives for each state during optimization. In our experiments, we show that ILD outperforms state-of-the-art methods in a variety of continuous control tasks with Brax, requiring only one expert demonstration. In addition, ILD can be applied to challenging deformable object manipulation tasks and can be generalized to unseen configurations. + + + + TOPLight: Lightweight Neural Networks With Task-Oriented Pretraining for Visible-Infrared Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_TOPLight_Lightweight_Neural_Networks_With_Task-Oriented_Pretraining_for_Visible-Infrared_Recognition_CVPR_2023_paper.pdf + Visible-infrared recognition (VI recognition) is a challenging task due to the enormous visual difference across heterogeneous images. Most existing works achieve promising results by transfer learning, such as pretraining on the ImageNet, based on advanced neural architectures like ResNet and ViT. However, such methods ignore the negative influence of the pretrained colour prior knowledge, as well as their heavy computational burden makes them hard to deploy in actual scenarios with limited resources. In this paper, we propose a novel task-oriented pretrained lightweight neural network (TOPLight) for VI recognition. Specifically, the TOPLight method simulates the domain conflict and sample variations with the proposed fake domain loss in the pretraining stage, which guides the network to learn how to handle those difficulties, such that a more general modality-shared feature representation is learned for the heterogeneous images. Moreover, an effective fine-grained dependency reconstruction module (FDR) is developed to discover substantial pattern dependencies shared in two modalities. Extensive experiments on VI person re-identification and VI face recognition datasets demonstrate the superiority of the proposed TOPLight, which significantly outperforms the current state of the arts while demanding fewer computational resources. + + + + DeFeeNet: Consecutive 3D Human Motion Prediction With Deviation Feedback + http://openaccess.thecvf.com//content/CVPR2023/papers/Sun_DeFeeNet_Consecutive_3D_Human_Motion_Prediction_With_Deviation_Feedback_CVPR_2023_paper.pdf + Let us rethink the real-world scenarios that require human motion prediction techniques, such as human-robot collaboration. Current works simplify the task of predicting human motions into a one-off process of forecasting a short future sequence (usually no longer than 1 second) based on a historical observed one. However, such simplification may fail to meet practical needs due to the neglect of the fact that motion prediction in real applications is not an isolated "observe then predict" unit, but a consecutive process composed of many rounds of such unit, semi-overlapped along the entire sequence. As time goes on, the predicted part of previous round has its corresponding ground truth observable in the new round, but their deviation in-between is neither exploited nor able to be captured by existing isolated learning fashion. In this paper, we propose DeFeeNet, a simple yet effective network that can be added on existing one-off prediction models to realize deviation perception and feedback when applied to consecutive motion prediction task. At each prediction round, the deviation generated by previous unit is first encoded by our DeFeeNet, and then incorporated into the existing predictor to enable a deviation-aware prediction manner, which, for the first time, allows for information transmit across adjacent prediction units. We design two versions of DeFeeNet as MLP-based and GRU-based, respectively. On Human3.6M and more complicated BABEL, experimental results indicate that our proposed network improves consecutive human motion prediction performance regardless of the basic model. + + + + Mask DINO: Towards a Unified Transformer-Based Framework for Object Detection and Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Mask_DINO_Towards_a_Unified_Transformer-Based_Framework_for_Object_Detection_CVPR_2023_paper.pdf + In this paper we present Mask DINO, a unified object detection and segmentation framework. Mask DINO extends DINO (DETR with Improved Denoising Anchor Boxes) by adding a mask prediction branch which supports all image segmentation tasks (instance, panoptic, and semantic). It makes use of the query embeddings from DINO to dot-product a high-resolution pixel embedding map to predict a set of binary masks. Some key components in DINO are extended for segmentation through a shared architecture and training process. Mask DINO is simple, efficient, scalable, and benefits from joint large-scale detection and segmentation datasets. Our experiments show that Mask DINO significantly outperforms all existing specialized segmentation methods, both on a ResNet-50 backbone and a pre-trained model with SwinL backbone. Notably, Mask DINO establishes the best results to date on instance segmentation (54.5 AP on COCO), panoptic segmentation (59.4 PQ on COCO), and semantic segmentation (60.8 mIoU on ADE20K) among models under one billion parameters. We will release the code after the blind review. + + + + FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion Tasks + http://openaccess.thecvf.com//content/CVPR2023/papers/Han_FAME-ViL_Multi-Tasking_Vision-Language_Model_for_Heterogeneous_Fashion_Tasks_CVPR_2023_paper.pdf + In the fashion domain, there exists a variety of vision-and-language (V+L) tasks, including cross-modal retrieval, text-guided image retrieval, multi-modal classification, and image captioning. They differ drastically in each individual input/output format and dataset size. It has been common to design a task-specific model and fine-tune it independently from a pre-trained V+L model (e.g., CLIP). This results in parameter inefficiency and inability to exploit inter-task relatedness. To address such issues, we propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL) in this work. Compared with existing approaches, FAME-ViL applies a single model for multiple heterogeneous fashion tasks, therefore being much more parameter-efficient. It is enabled by two novel components: (1) a task-versatile architecture with cross-attention adapters and task-specific adapters integrated into a unified V+L model, and (2) a stable and effective multi-task training strategy that supports learning from heterogeneous data and prevents negative transfer. Extensive experiments on four fashion tasks show that our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conventional independently trained single-task models. Code is available at https://github.com/BrandonHanx/FAME-ViL. + + + + Rate Gradient Approximation Attack Threats Deep Spiking Neural Networks + http://openaccess.thecvf.com//content/CVPR2023/papers/Bu_Rate_Gradient_Approximation_Attack_Threats_Deep_Spiking_Neural_Networks_CVPR_2023_paper.pdf + Spiking Neural Networks (SNNs) have attracted significant attention due to their energy-efficient properties and potential application on neuromorphic hardware. State-of-the-art SNNs are typically composed of simple Leaky Integrate-and-Fire (LIF) neurons and have become comparable to ANNs in image classification tasks on large-scale datasets. However, the robustness of these deep SNNs has not yet been fully uncovered. In this paper, we first experimentally observe that layers in these SNNs mostly communicate by rate coding. Based on this rate coding property, we develop a novel rate coding SNN-specified attack method, Rate Gradient Approximation Attack (RGA). We generalize the RGA attack to SNNs composed of LIF neurons with different leaky parameters and input encoding by designing surrogate gradients. In addition, we develop the time-extended enhancement to generate more effective adversarial examples. The experiment results indicate that our proposed RGA attack is more effective than the previous attack and is less sensitive to neuron hyperparameters. We also conclude from the experiment that rate-coded SNN composed of LIF neurons is not secure, which calls for exploring training methods for SNNs composed of complex neurons and other neuronal codings. Code is available at https://github.com/putshua/SNN_attack_RGA + + + + Adaptive Data-Free Quantization + http://openaccess.thecvf.com//content/CVPR2023/papers/Qian_Adaptive_Data-Free_Quantization_CVPR_2023_paper.pdf + Data-free quantization (DFQ) recovers the performance of quantized network (Q) without the original data, but generates the fake sample via a generator (G) by learning from full-precision network (P), which, however, is totally independent of Q, overlooking the adaptability of the knowledge from generated samples, i.e., informative or not to the learning process of Q, resulting into the overflow of generalization error. Building on this, several critical questions -- how to measure the sample adaptability to Q under varied bit-width scenarios? whether the largest adaptability is the best? how to generate the samples with adaptive adaptability to improve Q's generalization? To answer the above questions, in this paper, we propose an Adaptive Data-Free Quantization (AdaDFQ) method, which revisits DFQ from a zero-sum game perspective upon the sample adaptability between two players -- a generator and a quantized network. Following this viewpoint, we further define the disagreement and agreement samples to form two boundaries, where the margin between two boundaries is optimized to adaptively regulate the adaptability of generated samples to Q, so as to address the over-and-under fitting issues. Our AdaDFQ reveals: 1) the largest adaptability is NOT the best for sample generation to benefit Q's generalization; 2) the knowledge of the generated sample should not be informative to Q only, but also related to the category and distribution information of the training data for P. The theoretical and empirical analysis validate the advantages of AdaDFQ over the state-of-the-arts. Our code is available at https://github.com/hfutqian/AdaDFQ. + + + + Overcoming the Trade-Off Between Accuracy and Plausibility in 3D Hand Shape Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_Overcoming_the_Trade-Off_Between_Accuracy_and_Plausibility_in_3D_Hand_CVPR_2023_paper.pdf + Direct mesh fitting for 3D hand shape reconstruction estimates highly accurate meshes. However, the resulting meshes are prone to artifacts and do not appear as plausible hand shapes. Conversely, parametric models like MANO ensure plausible hand shapes but are not as accurate as the non-parametric methods. In this work, we introduce a novel weakly-supervised hand shape estimation framework that integrates non-parametric mesh fitting with MANO models in an end-to-end fashion. Our joint model overcomes the tradeoff in accuracy and plausibility to yield well-aligned and high-quality 3D meshes, especially in challenging two-hand and hand-object interaction scenarios. + + + + Open-Vocabulary Attribute Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Bravo_Open-Vocabulary_Attribute_Detection_CVPR_2023_paper.pdf + Vision-language modeling has enabled open-vocabulary tasks where predictions can be queried using any text prompt in a zero-shot manner. Existing open-vocabulary tasks focus on object classes, whereas research on object attributes is limited due to the lack of a reliable attribute-focused evaluation benchmark. This paper introduces the Open-Vocabulary Attribute Detection (OVAD) task and the corresponding OVAD benchmark. The objective of the novel task and benchmark is to probe object-level attribute information learned by vision-language models. To this end, we created a clean and densely annotated test set covering 117 attribute classes on the 80 object classes of MS COCO. It includes positive and negative annotations, which enables open-vocabulary evaluation. Overall, the benchmark consists of 1.4 million annotations. For reference, we provide a first baseline method for open-vocabulary attribute detection. Moreover, we demonstrate the benchmark's value by studying the attribute detection performance of several foundation models. + + + + TBP-Former: Learning Temporal Bird's-Eye-View Pyramid for Joint Perception and Prediction in Vision-Centric Autonomous Driving + http://openaccess.thecvf.com//content/CVPR2023/papers/Fang_TBP-Former_Learning_Temporal_Birds-Eye-View_Pyramid_for_Joint_Perception_and_Prediction_CVPR_2023_paper.pdf + Vision-centric joint perception and prediction (PnP) has become an emerging trend in autonomous driving research. It predicts the future states of the traffic participants in the surrounding environment from raw RGB images. However, it is still a critical challenge to synchronize features obtained at multiple camera views and timestamps due to inevitable geometric distortions and further exploit those spatial-temporal features. To address this issue, we propose a temporal bird's-eye-view pyramid transformer (TBP-Former) for vision-centric PnP, which includes two novel designs. First, a pose-synchronized BEV encoder is proposed to map raw image inputs with any camera pose at any time to a shared and synchronized BEV space for better spatial-temporal synchronization. Second, a spatial-temporal pyramid transformer is introduced to comprehensively extract multi-scale BEV features and predict future BEV states with the support of spatial priors. Extensive experiments on nuScenes dataset show that our proposed framework overall outperforms all state-of-the-art vision-based prediction methods. + + + + Test of Time: Instilling Video-Language Models With a Sense of Time + http://openaccess.thecvf.com//content/CVPR2023/papers/Bagad_Test_of_Time_Instilling_Video-Language_Models_With_a_Sense_of_CVPR_2023_paper.pdf + Modelling and understanding time remains a challenge in contemporary video understanding models. With language emerging as a key driver towards powerful generalization, it is imperative for foundational video-language models to have a sense of time. In this paper, we consider a specific aspect of temporal understanding: consistency of time order as elicited by before/after relations. We establish that seven existing video-language models struggle to understand even such simple temporal relations. We then question whether it is feasible to equip these foundational models with temporal awareness without re-training them from scratch. Towards this, we propose a temporal adaptation recipe on top of one such model, VideoCLIP, based on post-pretraining on a small amount of video-text data. We conduct a zero-shot evaluation of the adapted models on six datasets for three downstream tasks which require varying degrees of time awareness. We observe encouraging performance gains especially when the task needs higher time awareness. Our work serves as a first step towards probing and instilling a sense of time in existing video-language models without the need for data and compute-intense training from scratch. + + + + Learning To Segment Every Referring Object Point by Point + http://openaccess.thecvf.com//content/CVPR2023/papers/Qu_Learning_To_Segment_Every_Referring_Object_Point_by_Point_CVPR_2023_paper.pdf + Referring Expression Segmentation (RES) can facilitate pixel-level semantic alignment between vision and language. Most of the existing RES approaches require massive pixel-level annotations, which are expensive and exhaustive. In this paper, we propose a new partially supervised training paradigm for RES, i.e., training using abundant referring bounding boxes and only a few (e.g., 1%) pixel-level referring masks. To maximize the transferability from the REC model, we construct our model based on the point-based sequence prediction model. We propose the co-content teacher-forcing to make the model explicitly associate the point coordinates (scale values) with the referred spatial features, which alleviates the exposure bias caused by the limited segmentation masks. To make the most of referring bounding box annotations, we further propose the resampling pseudo points strategy to select more accurate pseudo-points as supervision. Extensive experiments show that our model achieves 52.06% in terms of accuracy (versus 58.93% in fully supervised setting) on RefCOCO+@testA, when only using 1% of the mask annotations. + + + + Seeing With Sound: Long-range Acoustic Beamforming for Multimodal Scene Understanding + http://openaccess.thecvf.com//content/CVPR2023/papers/Chakravarthula_Seeing_With_Sound_Long-range_Acoustic_Beamforming_for_Multimodal_Scene_Understanding_CVPR_2023_paper.pdf + Existing autonomous vehicles primarily use sensors that rely on electromagnetic waves which are undisturbed in good environmental conditions but can suffer in adverse scenarios, such as low light or for objects with low reflectance. Moreover, only objects in direct line-of-sight are typically detected by these existing methods. Acoustic pressure waves emanating from road users do not share these limitations. However, such signals are typically ignored in automotive perception because they suffer from low spatial resolution and lack directional information. In this work, we introduce long-range acoustic beamforming of pressure waves from noise directly produced by automotive vehicles in-the-wild as a complementary sensing modality to traditional optical sensor approaches for detection of objects in dynamic traffic environments. To this end, we introduce the first multimodal long-range acoustic beamforming dataset. We propose a neural aperture expansion method for beamforming and we validate its utility for multimodal automotive object detection. We validate the benefit of adding sound detections to existing RGB cameras in challenging automotive scenarios, where camera-only approaches fail or do not deliver the ultra-fast rates of pressure sensors. + + + + OpenScene: 3D Scene Understanding With Open Vocabularies + http://openaccess.thecvf.com//content/CVPR2023/papers/Peng_OpenScene_3D_Scene_Understanding_With_Open_Vocabularies_CVPR_2023_paper.pdf + Traditional 3D scene understanding approaches rely on labeled 3D datasets to train a model for a single task with supervision. We propose OpenScene, an alternative approach where a model predicts dense features for 3D scene points that are co-embedded with text and image pixels in CLIP feature space. This zero-shot approach enables task-agnostic training and open-vocabulary queries. For example, to perform SOTA zero-shot 3D semantic segmentation it first infers CLIP features for every 3D point and later classifies them based on similarities to embeddings of arbitrary class labels. More interestingly, it enables a suite of open-vocabulary scene understanding applications that have never been done before. For example, it allows a user to enter an arbitrary text query and then see a heat map indicating which parts of a scene match. Our approach is effective at identifying objects, materials, affordances, activities, and room types in complex 3D scenes, all using a single model trained without any labeled 3D data. + + + + Movies2Scenes: Using Movie Metadata To Learn Scene Representation + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Movies2Scenes_Using_Movie_Metadata_To_Learn_Scene_Representation_CVPR_2023_paper.pdf + Understanding scenes in movies is crucial for a variety of applications such as video moderation, search, and recommendation. However, labeling individual scenes is a time-consuming process. In contrast, movie level metadata (e.g., genre, synopsis, etc.) regularly gets produced as part of the film production process, and is therefore significantly more commonly available. In this work, we propose a novel contrastive learning approach that uses movie metadata to learn a general-purpose scene representation. Specifically, we use movie metadata to define a measure of movie similarity, and use it during contrastive learning to limit our search for positive scene-pairs to only the movies that are considered similar to each other. Our learned scene representation consistently outperforms existing state-of-the-art methods on a diverse set of tasks evaluated using multiple benchmark datasets. Notably, our learned representation offers an average improvement of 7.9% on the seven classification tasks and 9.7% improvement on the two regression tasks in LVU dataset. Furthermore, using a newly collected movie dataset, we present comparative results of our scene representation on a set of video moderation tasks to demonstrate its generalizability on previously less explored tasks. + + + + Joint Token Pruning and Squeezing Towards More Aggressive Compression of Vision Transformers + http://openaccess.thecvf.com//content/CVPR2023/papers/Wei_Joint_Token_Pruning_and_Squeezing_Towards_More_Aggressive_Compression_of_CVPR_2023_paper.pdf + Although vision transformers (ViTs) have shown promising results in various computer vision tasks recently, their high computational cost limits their practical applications. Previous approaches that prune redundant tokens have demonstrated a good trade-off between performance and computation costs. Nevertheless, errors caused by pruning strategies can lead to significant information loss. Our quantitative experiments reveal that the impact of pruned tokens on performance should be noticeable. To address this issue, we propose a novel joint Token Pruning & Squeezing module (TPS) for compressing vision transformers with higher efficiency. Firstly, TPS adopts pruning to get the reserved and pruned subsets. Secondly, TPS squeezes the information of pruned tokens into partial reserved tokens via the unidirectional nearest-neighbor matching and similarity-oriented fusing steps. Compared to state-of-the-art methods, our approach outperforms them under all token pruning intensities. Especially while shrinking DeiT-tiny&small computational budgets to 35%, it improves the accuracy by 1%-6% compared with baselines on ImageNet classification. The proposed method can accelerate the throughput of DeiT-small beyond DeiT-tiny, while its accuracy surpasses DeiT-tiny by 4.78%. Experiments on various transformers demonstrate the effectiveness of our method, while analysis experiments prove our higher robustness to the errors of the token pruning policy. Code is available at https://github.com/megvii-research/TPS-CVPR2023. + + + + Solving Oscillation Problem in Post-Training Quantization Through a Theoretical Perspective + http://openaccess.thecvf.com//content/CVPR2023/papers/Ma_Solving_Oscillation_Problem_in_Post-Training_Quantization_Through_a_Theoretical_Perspective_CVPR_2023_paper.pdf + Post-training quantization (PTQ) is widely regarded as one of the most efficient compression methods practically, benefitting from its data privacy and low computation costs. We argue that an overlooked problem of oscillation is in the PTQ methods. In this paper, we take the initiative to explore and present a theoretical proof to explain why such a problem is essential in PTQ. And then, we try to solve this problem by introducing a principled and generalized framework theoretically. In particular, we first formulate the oscillation in PTQ and prove the problem is caused by the difference in module capacity. To this end, we define the module capacity (ModCap) under data-dependent and data-free scenarios, where the differentials between adjacent modules are used to measure the degree of oscillation. The problem is then solved by selecting top-k differentials, in which the corresponding modules are jointly optimized and quantized. Extensive experiments demonstrate that our method successfully reduces the performance drop and is generalized to different neural networks and PTQ methods. For example, with 2/4 bit ResNet-50 quantization, our method surpasses the previous state-of-the-art method by 1.9%. It becomes more significant on small model quantization, e.g. surpasses BRECQ method by 6.61% on MobileNetV2*0.5. + + + + Masked Image Modeling With Local Multi-Scale Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Masked_Image_Modeling_With_Local_Multi-Scale_Reconstruction_CVPR_2023_paper.pdf + Masked Image Modeling (MIM) achieves outstanding success in self-supervised representation learning. Unfortunately, MIM models typically have huge computational burden and slow learning process, which is an inevitable obstacle for their industrial applications. Although the lower layers play the key role in MIM, existing MIM models conduct reconstruction task only at the top layer of encoder. The lower layers are not explicitly guided and the interaction among their patches is only used for calculating new activations. Considering the reconstruction task requires non-trivial inter-patch interactions to reason target signals, we apply it to multiple local layers including lower and upper layers. Further, since the multiple layers expect to learn the information of different scales, we design local multi-scale reconstruction, where the lower and upper layers reconstruct fine-scale and coarse-scale supervision signals respectively. This design not only accelerates the representation learning process by explicitly guiding multiple layers, but also facilitates multi-scale semantical understanding to the input. Extensive experiments show that with significantly less pre-training burden, our model achieves comparable or better performance on classification, detection and segmentation tasks than existing MIM models. + + + + Flexible-Cm GAN: Towards Precise 3D Dose Prediction in Radiotherapy + http://openaccess.thecvf.com//content/CVPR2023/papers/Gao_Flexible-Cm_GAN_Towards_Precise_3D_Dose_Prediction_in_Radiotherapy_CVPR_2023_paper.pdf + Deep learning has been utilized in knowledge-based radiotherapy planning in which a system trained with a set of clinically approved plans is employed to infer a three-dimensional dose map for a given new patient. However, previous deep methods are primarily limited to simple scenarios, e.g., a fixed planning type or a consistent beam angle configuration. This in fact limits the usability of such approaches and makes them not generalizable over a larger set of clinical scenarios. Herein, we propose a novel conditional generative model, Flexible-C^m GAN, utilizing additional information regarding planning types and various beam geometries. A miss-consistency loss is proposed to deal with the challenge of having a limited set of conditions on the input data, e.g., incomplete training samples. To address the challenges of including clinical preferences, we derive a differentiable shift-dose-volume loss to incorporate the well-known dose-volume histogram constraints. During inference, users can flexibly choose a specific planning type and a set of beam angles to meet the clinical requirements. We conduct experiments on an illustrative face dataset to show the motivation of Flexible-C^m GAN and further validate our model's potential clinical values with two radiotherapy datasets. The results demonstrate the superior performance of the proposed method in a practical heterogeneous radiotherapy planning application compared to existing deep learning-based approaches. + + + + Handy: Towards a High Fidelity 3D Hand Shape and Appearance Model + http://openaccess.thecvf.com//content/CVPR2023/papers/Potamias_Handy_Towards_a_High_Fidelity_3D_Hand_Shape_and_Appearance_CVPR_2023_paper.pdf + Over the last few years, with the advent of virtual and augmented reality, an enormous amount of research has been focused on modeling, tracking and reconstructing human hands. Given their power to express human behavior, hands have been a very important, but challenging component of the human body. Currently, most of the state-of-the-art reconstruction and pose estimation methods rely on the low polygon MANO model. Apart from its low polygon count, MANO model was trained with only 31 adult subjects, which not only limits its expressive power but also imposes unnecessary shape reconstruction constraints on pose estimation methods. Moreover, hand appearance remains almost unexplored and neglected from the majority of hand reconstruction methods. In this work, we propose "Handy", a large-scale model of the human hand, modeling both shape and appearance composed of over 1200 subjects which we make publicly available for the benefit of the research community. In contrast to current models, our proposed hand model was trained on a dataset with large diversity in age, gender, and ethnicity, which tackles the limitations of MANO and accurately reconstructs out-of-distribution samples. In order to create a high quality texture model, we trained a powerful GAN, which preserves high frequency details and is able to generate high resolution hand textures. To showcase the capabilities of the proposed model, we built a synthetic dataset of textured hands and trained a hand pose estimation network to reconstruct both the shape and appearance from single images. As it is demonstrated in an extensive series of quantitative as well as qualitative experiments, our model proves to be robust against the state-of-the-art and realistically captures the 3D hand shape and pose along with a high frequency detailed texture even in adverse "in-the-wild" conditions. + + + + Learning To Zoom and Unzoom + http://openaccess.thecvf.com//content/CVPR2023/papers/Thavamani_Learning_To_Zoom_and_Unzoom_CVPR_2023_paper.pdf + Many perception systems in mobile computing, autonomous navigation, and AR/VR face strict compute constraints that are particularly challenging for high-resolution input images. Previous works propose nonuniform downsamplers that "learn to zoom" on salient image regions, reducing compute while retaining task-relevant image information. However, for tasks with spatial labels (such as 2D/3D object detection and semantic segmentation), such distortions may harm performance. In this work (LZU), we "learn to zoom" in on the input image, compute spatial features, and then "unzoom" to revert any deformations. To enable efficient and differentiable unzooming, we approximate the zooming warp with a piecewise bilinear mapping that is invertible. LZU can be applied to any task with 2D spatial input and any model with 2D spatial features, and we demonstrate this versatility by evaluating on a variety of tasks and datasets: object detection on Argoverse-HD, semantic segmentation on Cityscapes, and monocular 3D object detection on nuScenes. Interestingly, we observe boosts in performance even when high-resolution sensor data is unavailable, implying that LZU can be used to "learn to upsample" as well. Code and additional visuals are available at https://tchittesh.github.io/lzu/. + + + + Task Difficulty Aware Parameter Allocation & Regularization for Lifelong Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Task_Difficulty_Aware_Parameter_Allocation__Regularization_for_Lifelong_Learning_CVPR_2023_paper.pdf + Parameter regularization or allocation methods are effective in overcoming catastrophic forgetting in lifelong learning. However, they solve all tasks in a sequence uniformly and ignore the differences in the learning difficulty of different tasks. So parameter regularization methods face significant forgetting when learning a new task very different from learned tasks, and parameter allocation methods face unnecessary parameter overhead when learning simple tasks. In this paper, we propose the Parameter Allocation & Regularization (PAR), which adaptively select an appropriate strategy for each task from parameter allocation and regularization based on its learning difficulty. A task is easy for a model that has learned tasks related to it and vice versa. We propose a divergence estimation method based on the Nearest-Prototype distance to measure the task relatedness using only features of the new task. Moreover, we propose a time-efficient relatedness-aware sampling-based architecture search strategy to reduce the parameter overhead for allocation. Experimental results on multiple benchmarks demonstrate that, compared with SOTAs, our method is scalable and significantly reduces the model's redundancy while improving the model's performance. Further qualitative analysis indicates that PAR obtains reasonable task-relatedness. + + + + From Node Interaction To Hop Interaction: New Effective and Scalable Graph Learning Paradigm + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_From_Node_Interaction_To_Hop_Interaction_New_Effective_and_Scalable_CVPR_2023_paper.pdf + Existing Graph Neural Networks (GNNs) follow the message-passing mechanism that conducts information interaction among nodes iteratively. While considerable progress has been made, such node interaction paradigms still have the following limitation. First, the scalability limitation precludes the broad application of GNNs in large-scale industrial settings since the node interaction among rapidly expanding neighbors incurs high computation and memory costs. Second, the over-smoothing problem restricts the discrimination ability of nodes, i.e., node representations of different classes will converge to indistinguishable after repeated node interactions. In this work, we propose a novel hop interaction paradigm to address these limitations simultaneously. The core idea is to convert the interaction target among nodes to pre-processed multi-hop features inside each node. We design a simple yet effective HopGNN framework that can easily utilize existing GNNs to achieve hop interaction. Furthermore, we propose a multi-task learning strategy with a self-supervised learning objective to enhance HopGNN. We conduct extensive experiments on 12 benchmark datasets in a wide range of domains, scales, and smoothness of graphs. Experimental results show that our methods achieve superior performance while maintaining high scalability and efficiency. The code is at https://github.com/JC-202/HopGNN. + + + + Understanding and Improving Features Learned in Deep Functional Maps + http://openaccess.thecvf.com//content/CVPR2023/papers/Attaiki_Understanding_and_Improving_Features_Learned_in_Deep_Functional_Maps_CVPR_2023_paper.pdf + Deep functional maps have recently emerged as a successful paradigm for non-rigid 3D shape correspondence tasks. An essential step in this pipeline consists in learning feature functions that are used as constraints to solve for a functional map inside the network. However, the precise nature of the information learned and stored in these functions is not yet well understood. Specifically, a major question is whether these features can be used for any other objective, apart from their purely algebraic role, in solving for functional map matrices. In this paper, we show that under some mild conditions, the features learned within deep functional map approaches can be used as point-wise descriptors and thus are directly comparable across different shapes, even without the necessity of solving for a functional map at test time. Furthermore, informed by our analysis, we propose effective modifications to the standard deep functional map pipeline, which promotes structural properties of learned features, significantly improving the matching results. Finally, we demonstrate that previously unsuccessful attempts at using extrinsic architectures for deep functional map feature extraction can be remedied via simple architectural changes, which promote the theoretical properties suggested by our analysis. We thus bridge the gap between intrinsic and extrinsic surface-based learning, suggesting the necessary and sufficient conditions for successful shape matching. Our code is available at https://github.com/pvnieo/clover. + + + + PartManip: Learning Cross-Category Generalizable Part Manipulation Policy From Point Cloud Observations + http://openaccess.thecvf.com//content/CVPR2023/papers/Geng_PartManip_Learning_Cross-Category_Generalizable_Part_Manipulation_Policy_From_Point_Cloud_CVPR_2023_paper.pdf + Learning a generalizable object manipulation policy is vital for an embodied agent to work in complex real-world scenes. Parts, as the shared components in different object categories, have the potential to increase the generalization ability of the manipulation policy and achieve cross-category object manipulation. In this work, we build the first large-scale, part-based cross-category object manipulation benchmark, PartManip, which is composed of 11 object categories, 494 objects, and 1432 tasks in 6 task classes. Compared to previous work, our benchmark is also more diverse and realistic, i.e., having more objects and using sparse-view point cloud as input without oracle information like part segmentation. To tackle the difficulties of vision-based policy learning, we first train a state-based expert with our proposed part-based canonicalization and part-aware rewards, and then distill the knowledge to a vision-based student. We also find an expressive backbone is essential to overcome the large diversity of different objects. For cross-category generalization, we introduce domain adversarial learning for domain-invariant feature extraction. Extensive experiments in simulation show that our learned policy can outperform other methods by a large margin, especially on unseen object categories. We also demonstrate our method can successfully manipulate novel objects in the real world. + + + + Polynomial Implicit Neural Representations for Large Diverse Datasets + http://openaccess.thecvf.com//content/CVPR2023/papers/Singh_Polynomial_Implicit_Neural_Representations_for_Large_Diverse_Datasets_CVPR_2023_paper.pdf + Implicit neural representations (INR) have gained significant popularity for signal and image representation for many end-tasks, such as superresolution, 3D modeling, and more. Most INR architectures rely on sinusoidal positional encoding, which accounts for high-frequency information in data. However, the finite encoding size restricts the model's representational power. Higher representational power is needed to go from representing a single given image to representing large and diverse datasets. Our approach addresses this gap by representing an image with a polynomial function and eliminates the need for positional encodings. Therefore, to achieve a progressively higher degree of polynomial representation, we use element-wise multiplications between features and affine-transformed coordinate locations after every ReLU layer. The proposed method is evaluated qualitatively and quantitatively on large datasets like ImageNet. The proposed Poly-INR model performs comparably to state-of-the-art generative models without any convolution, normalization, or self-attention layers, and with far fewer trainable parameters. With much fewer training parameters and higher representative power, our approach paves the way for broader adoption of INR models for generative modeling tasks in complex domains. The code is available at https://github.com/Rajhans0/Poly_INR + + + + High-Frequency Stereo Matching Network + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhao_High-Frequency_Stereo_Matching_Network_CVPR_2023_paper.pdf + In the field of binocular stereo matching, remarkable progress has been made by iterative methods like RAFT-Stereo and CREStereo. However, most of these methods lose information during the iterative process, making it difficult to generate more detailed difference maps that take full advantage of high-frequency information. We propose the Decouple module to alleviate the problem of data coupling and allow features containing subtle details to transfer across the iterations which proves to alleviate the problem significantly in the ablations. To further capture high-frequency details, we propose a Normalization Refinement module that unifies the disparities as a proportion of the disparities over the width of the image, which address the problem of module failure in cross-domain scenarios. Further, with the above improvements, the ResNet-like feature extractor that has not been changed for years becomes a bottleneck. Towards this end, we proposed a multi-scale and multi-stage feature extractor that introduces the channel-wise self-attention mechanism which greatly addresses this bottleneck. Our method (DLNR) ranks 1st on the Middlebury leaderboard, significantly outperforming the next best method by 13.04%. Our method also achieves SOTA performance on the KITTI-2015 benchmark for D1-fg. + + + + Spatial-Then-Temporal Self-Supervised Learning for Video Correspondence + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Spatial-Then-Temporal_Self-Supervised_Learning_for_Video_Correspondence_CVPR_2023_paper.pdf + In low-level video analyses, effective representations are important to derive the correspondences between video frames. These representations have been learned in a self-supervised fashion from unlabeled images/videos, using carefully designed pretext tasks in some recent studies. However, the previous work concentrates on either spatial-discriminative features or temporal-repetitive features, with little attention to the synergy between spatial and temporal cues. To address this issue, we propose a novel spatial-then-temporal self-supervised learning method. Specifically, we firstly extract spatial features from unlabeled images via contrastive learning, and secondly enhance the features by exploiting the temporal cues in unlabeled videos via reconstructive learning. In the second step, we design a global correlation distillation loss to ensure the learning not to forget the spatial cues, and we design a local correlation distillation loss to combat the temporal discontinuity that harms the reconstruction. The proposed method outperforms the state-of-the-art self-supervised methods, as established by the experimental results on a series of correspondence-based video analysis tasks. Also, we performed ablation studies to verify the effectiveness of the two-step design as well as the distillation losses. + + + + Unsupervised Contour Tracking of Live Cells by Mechanical and Cycle Consistency Losses + http://openaccess.thecvf.com//content/CVPR2023/papers/Jang_Unsupervised_Contour_Tracking_of_Live_Cells_by_Mechanical_and_Cycle_CVPR_2023_paper.pdf + Analyzing the dynamic changes of cellular morphology is important for understanding the various functions and characteristics of live cells, including stem cells and metastatic cancer cells. To this end, we need to track all points on the highly deformable cellular contour in every frame of live cell video. Local shapes and textures on the contour are not evident, and their motions are complex, often with expansion and contraction of local contour features. The prior arts for optical flow or deep point set tracking are unsuited due to the fluidity of cells, and previous deep contour tracking does not consider point correspondence. We propose the first deep learning-based tracking of cellular (or more generally viscoelastic materials) contours with point correspondence by fusing dense representation between two contours with cross attention. Since it is impractical to manually label dense tracking points on the contour, unsupervised learning comprised of the mechanical and cyclical consistency losses is proposed to train our contour tracker. The mechanical loss forcing the points to move perpendicular to the contour effectively helps out. For quantitative evaluation, we labeled sparse tracking points along the contour of live cells from two live cell datasets taken with phase contrast and confocal fluorescence microscopes. Our contour tracker quantitatively outperforms compared methods and produces qualitatively more favorable results. Our code and data are publicly available at https://github.com/JunbongJang/contour-tracking/ + + + + Distribution Shift Inversion for Out-of-Distribution Prediction + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_Distribution_Shift_Inversion_for_Out-of-Distribution_Prediction_CVPR_2023_paper.pdf + Machine learning society has witnessed the emergence of a myriad of Out-of-Distribution (OoD) algorithms, which address the distribution shift between the training and the testing distribution by searching for a unified predictor or invariant feature representation. However, the task of directly mitigating the distribution shift in the unseen testing set is rarely investigated, due to the unavailability of the testing distribution during the training phase and thus the impossibility of training a distribution translator mapping between the training and testing distribution. In this paper, we explore how to bypass the requirement of testing distribution for distribution translator training and make the distribution translation useful for OoD prediction. We propose a portable Distribution Shift Inversion (DSI) algorithm, in which, before being fed into the prediction model, the OoD testing samples are first linearly combined with additional Gaussian noise and then transferred back towards the training distribution using a diffusion model trained only on the source distribution. Theoretical analysis reveals the feasibility of our method. Experimental results, on both multiple-domain generalization datasets and single-domain generalization datasets, show that our method provides a general performance gain when plugged into a wide range of commonly used OoD algorithms. Our code is available at https://github.com/yu-rp/Distribution-Shift-Iverson https://github.com/yu-rp/Distribution-Shift-Iverson. + + + + Parallel Diffusion Models of Operator and Image for Blind Inverse Problems + http://openaccess.thecvf.com//content/CVPR2023/papers/Chung_Parallel_Diffusion_Models_of_Operator_and_Image_for_Blind_Inverse_CVPR_2023_paper.pdf + Diffusion model-based inverse problem solvers have demonstrated state-of-the-art performance in cases where the forward operator is known (i.e. non-blind). However, the applicability of the method to blind inverse problems has yet to be explored. In this work, we show that we can indeed solve a family of blind inverse problems by constructing another diffusion prior for the forward operator. Specifically, parallel reverse diffusion guided by gradients from the intermediate stages enables joint optimization of both the forward operator parameters as well as the image, such that both are jointly estimated at the end of the parallel reverse diffusion procedure. We show the efficacy of our method on two representative tasks --- blind deblurring, and imaging through turbulence --- and show that our method yields state-of-the-art performance, while also being flexible to be applicable to general blind inverse problems when we know the functional forms. Code available: https://github.com/BlindDPS/blind-dps + + + + Semidefinite Relaxations for Robust Multiview Triangulation + http://openaccess.thecvf.com//content/CVPR2023/papers/Harenstam-Nielsen_Semidefinite_Relaxations_for_Robust_Multiview_Triangulation_CVPR_2023_paper.pdf + We propose an approach based on convex relaxations for certifiably optimal robust multiview triangulation. To this end, we extend existing relaxation approaches to non-robust multiview triangulation by incorporating a least squares cost function. We propose two formulations, one based on epipolar constraints and one based on fractional reprojection constraints. The first is lower dimensional and remains tight under moderate noise and outlier levels, while the second is higher dimensional and therefore slower but remains tight even under extreme noise and outlier levels. We demonstrate through extensive experiments that the proposed approaches allow us to compute provably optimal reconstructions even under significant noise and a large percentage of outliers. + + + + Modeling Video As Stochastic Processes for Fine-Grained Video Representation Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Modeling_Video_As_Stochastic_Processes_for_Fine-Grained_Video_Representation_Learning_CVPR_2023_paper.pdf + A meaningful video is semantically coherent and changes smoothly. However, most existing fine-grained video representation learning methods learn frame-wise features by aligning frames across videos or exploring relevance between multiple views, neglecting the inherent dynamic process of each video. In this paper, we propose to learn video representations by modeling Video as Stochastic Processes (VSP) via a novel process-based contrastive learning framework, which aims to discriminate between video processes and simultaneously capture the temporal dynamics in the processes. Specifically, we enforce the embeddings of the frame sequence of interest to approximate a goal-oriented stochastic process, i.e., Brownian bridge, in the latent space via a process-based contrastive loss. To construct the Brownian bridge, we adapt specialized sampling strategies under different annotations for both self-supervised and weakly-supervised learning. Experimental results on four datasets show that VSP stands as a state-of-the-art method for various video understanding tasks, including phase progression, phase classification and frame retrieval. Code is available at 'https://github.com/hengRUC/VSP'. + + + + Relational Space-Time Query in Long-Form Videos + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_Relational_Space-Time_Query_in_Long-Form_Videos_CVPR_2023_paper.pdf + Egocentric videos are often available in the form of uninterrupted, uncurated long videos capturing the camera wearers' daily life activities.Understanding these videos requires models to be able to reason about activities, objects, and their interactions. However, current video benchmarks study these problems independently and under short, curated clips. In contrast, real-world applications, e.g., AR assistants, require bundling these problems for both model development and evaluation. In this paper, we propose to study these problems in a joint framework for long video understanding. Our contributions are three-fold. First, we propose an integrated framework, namely Relational Space-Time Query (ReST), for evaluating video understanding models via templated spatiotemporal queries. Second, we introduce two new benchmarks, ReST-ADL and ReST-Ego4D, which augment the existing egocentric video datasets with abundant query annotations generated by the ReST framework. Finally, we present a set of baselines and in-depth analysis on the two benchmarks and provide insights about the query tasks. We view our integrated framework and benchmarks as a step towards comprehensive, multi-step reasoning in long videos, and believe it will facilitate the development of next generations of video understanding models. + + + + BiFormer: Learning Bilateral Motion Estimation via Bilateral Transformer for 4K Video Frame Interpolation + http://openaccess.thecvf.com//content/CVPR2023/papers/Park_BiFormer_Learning_Bilateral_Motion_Estimation_via_Bilateral_Transformer_for_4K_CVPR_2023_paper.pdf + A novel 4K video frame interpolator based on bilateral transformer (BiFormer) is proposed in this paper, which performs three steps: global motion estimation, local motion refinement, and frame synthesis. First, in global motion estimation, we predict symmetric bilateral motion fields at a coarse scale. To this end, we propose BiFormer, the first transformer-based bilateral motion estimator. Second, we refine the global motion fields efficiently using blockwise bilateral cost volumes (BBCVs). Third, we warp the input frames using the refined motion fields and blend them to synthesize an intermediate frame. Extensive experiments demonstrate that the proposed BiFormer algorithm achieves excellent interpolation performance on 4K datasets. The source codes are available at https://github.com/JunHeum/BiFormer. + + + + Learning From Unique Perspectives: User-Aware Saliency Modeling + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Learning_From_Unique_Perspectives_User-Aware_Saliency_Modeling_CVPR_2023_paper.pdf + Everyone is unique. Given the same visual stimuli, people's attention is driven by both salient visual cues and their own inherent preferences. Knowledge of visual preferences not only facilitates understanding of fine-grained attention patterns of diverse users, but also has the potential of benefiting the development of customized applications. Nevertheless, existing saliency models typically limit their scope to attention as it applies to the general population and ignore the variability between users' behaviors. In this paper, we identify the critical roles of visual preferences in attention modeling, and for the first time study the problem of user-aware saliency modeling. Our work aims to advance attention research from three distinct perspectives: (1) We present a new model with the flexibility to capture attention patterns of various combinations of users, so that we can adaptively predict personalized attention, user group attention, and general saliency at the same time with one single model; (2) To augment models with knowledge about the composition of attention from different users, we further propose a principled learning method to understand visual attention in a progressive manner; and (3) We carry out extensive analyses on publicly available saliency datasets to shed light on the roles of visual preferences. Experimental results on diverse stimuli, including naturalistic images and web pages, demonstrate the advantages of our method in capturing the distinct visual behaviors of different users and the general saliency of visual stimuli. + + + + MaskSketch: Unpaired Structure-Guided Masked Image Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Bashkirova_MaskSketch_Unpaired_Structure-Guided_Masked_Image_Generation_CVPR_2023_paper.pdf + Recent conditional image generation methods produce images of remarkable diversity, fidelity and realism. However, the majority of these methods allow conditioning only on labels or text prompts, which limits their level of control over the generation result. In this paper, we introduce MaskSketch, an image generation method that allows spatial conditioning of the generation result using a guiding sketch as an extra conditioning signal during sampling. MaskSketch utilizes a pre-trained masked generative transformer, requiring no model training or paired supervision, and works with input sketches of different levels of abstraction. We show that intermediate self-attention maps of a masked generative transformer encode important structural information of the input image, such as scene layout and object shape, and we propose a novel sampling method based on this observation to enable structure-guided generation. Our results show that MaskSketch achieves high image realism and fidelity to the guiding structure. Evaluated on standard benchmark datasets, MaskSketch outperforms state-of-the-art methods for sketch-to-image translation, as well as unpaired image-to-image translation approaches. The code can be found on our project website: https://masksketch.github.io/ + + + + Open-Vocabulary Point-Cloud Object Detection Without 3D Annotation + http://openaccess.thecvf.com//content/CVPR2023/papers/Lu_Open-Vocabulary_Point-Cloud_Object_Detection_Without_3D_Annotation_CVPR_2023_paper.pdf + The goal of open-vocabulary detection is to identify novel objects based on arbitrary textual descriptions. In this paper, we address open-vocabulary 3D point-cloud detection by a dividing-and-conquering strategy, which involves: 1) developing a point-cloud detector that can learn a general representation for localizing various objects, and 2) connecting textual and point-cloud representations to enable the detector to classify novel object categories based on text prompting. Specifically, we resort to rich image pre-trained models, by which the point-cloud detector learns localizing objects under the supervision of predicted 2D bounding boxes from 2D pre-trained detectors. Moreover, we propose a novel de-biased triplet cross-modal contrastive learning to connect the modalities of image, point-cloud and text, thereby enabling the point-cloud detector to benefit from vision-language pre-trained models, i.e., CLIP. The novel use of image and vision-language pre-trained models for point-cloud detectors allows for open-vocabulary 3D object detection without the need for 3D annotations. Experiments demonstrate that the proposed method improves at least 3.03 points and 7.47 points over a wide range of baselines on the ScanNet and SUN RGB-D datasets, respectively. Furthermore, we provide a comprehensive analysis to explain why our approach works. + + + + Lookahead Diffusion Probabilistic Models for Refining Mean Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Lookahead_Diffusion_Probabilistic_Models_for_Refining_Mean_Estimation_CVPR_2023_paper.pdf + We propose lookahead diffusion probabilistic models (LA-DPMs) to exploit the correlation in the outputs of the deep neural networks (DNNs) over subsequent timesteps in diffusion probabilistic models (DPMs) to refine the mean estimation of the conditional Gaussian distributions in the backward process. A typical DPM first obtains an estimate of the original data sample x by feeding the most recent state z_i and index i into the DNN model and then computes the mean vector of the conditional Gaussian distribution for z_ i-1 . We propose to calculate a more accurate estimate for x by performing extrapolation on the two estimates of x that are obtained by feeding (z_ i+1 , i+1) and (z_i, i) into the DNN model. The extrapolation can be easily integrated into the backward process of existing DPMs by introducing an additional connection over two consecutive timesteps, and fine-tuning is not required. Extensive experiments showed that plugging in the additional connection into DDPM, DDIM, DEIS, S-PNDM, and high-order DPM-Solvers leads to a significant performance gain in terms of Frechet inception distance (FID) score. Our implementation is available at https://github.com/guoqiangzhang-x/LA-DPM. + + + + TensoIR: Tensorial Inverse Rendering + http://openaccess.thecvf.com//content/CVPR2023/papers/Jin_TensoIR_Tensorial_Inverse_Rendering_CVPR_2023_paper.pdf + We propose TensoIR, a novel inverse rendering approach based on tensor factorization and neural fields. Unlike previous works that use purely MLP-based neural fields, thus suffering from low capacity and high computation costs, we extend TensoRF, a state-of-the-art approach for radiance field modeling, to estimate scene geometry, surface reflectance, and environment illumination from multi-view images captured under unknown lighting conditions. Our approach jointly achieves radiance field reconstruction and physically-based model estimation, leading to photo-realistic novel view synthesis and relighting. Benefiting from the efficiency and extensibility of the TensoRF-based representation, our method can accurately model secondary shading effects (like shadows and indirect lighting) and generally support input images captured under a single or multiple unknown lighting conditions. The low-rank tensor representation allows us to not only achieve fast and compact reconstruction but also better exploit shared information under an arbitrary number of capturing lighting conditions. We demonstrate the superiority of our method to baseline methods qualitatively and quantitatively on various challenging synthetic and real-world scenes. + + + + NIPQ: Noise Proxy-Based Integrated Pseudo-Quantization + http://openaccess.thecvf.com//content/CVPR2023/papers/Shin_NIPQ_Noise_Proxy-Based_Integrated_Pseudo-Quantization_CVPR_2023_paper.pdf + Straight-through estimator (STE), which enables the gradient flow over the non-differentiable function via approximation, has been favored in studies related to quantization-aware training (QAT). However, STE incurs unstable convergence during QAT, resulting in notable quality degradation in low-precision representation. Recently, pseudo-quantization training has been proposed as an alternative approach to updating the learnable parameters using the pseudo-quantization noise instead of STE. In this study, we propose a novel noise proxy-based integrated pseudo-quantization (NIPQ) that enables unified support of pseudo-quantization for both activation and weight with minimal error by integrating the idea of truncation on the pseudo-quantization framework. NIPQ updates all of the quantization parameters (e.g., bit-width and truncation boundary) as well as the network parameters via gradient descent without STE instability, resulting in greatly-simplified but reliable precision allocation without human intervention. Our extensive experiments show that NIPQ outperforms existing quantization algorithms in various vision and language applications by a large margin. + + + + Object-Goal Visual Navigation via Effective Exploration of Relations Among Historical Navigation States + http://openaccess.thecvf.com//content/CVPR2023/papers/Du_Object-Goal_Visual_Navigation_via_Effective_Exploration_of_Relations_Among_Historical_CVPR_2023_paper.pdf + Object-goal visual navigation aims at steering an agent toward an object via a series of moving steps. Previous works mainly focus on learning informative visual representations for navigation, but overlook the impacts of navigation states on the effectiveness and efficiency of navigation. We observe that high relevance among navigation states will cause navigation inefficiency or failure for existing methods. In this paper, we present a History-inspired Navigation Policy Learning (HiNL) framework to estimate navigation states effectively by exploring relationships among historical navigation states. In HiNL, we propose a History-aware State Estimation (HaSE) module to alleviate the impacts of dominant historical states on the current state estimation. Meanwhile, HaSE also encourages an agent to be alert to the current observation changes, thus enabling the agent to make valid actions. Furthermore, we design a History-based State Regularization (HbSR) to explicitly suppress the correlation among navigation states in training. As a result, our agent can update states more effectively while reducing the correlations among navigation states. Experiments on the artificial platform AI2-THOR (i.e.,, iTHOR and RoboTHOR) demonstrate that HiNL significantly outperforms state-of-the-art methods on both Success Rate and SPL in unseen testing environments. + + + + Probabilistic Knowledge Distillation of Face Ensembles + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_Probabilistic_Knowledge_Distillation_of_Face_Ensembles_CVPR_2023_paper.pdf + Mean ensemble (i.e. averaging predictions from multiple models) is a commonly-used technique in machine learning that improves the performance of each individual model. We formalize it as feature alignment for ensemble in open-set face recognition and generalize it into Bayesian Ensemble Averaging (BEA) through the lens of probabilistic modeling. This generalization brings up two practical benefits that existing methods could not provide: (1) the uncertainty of a face image can be evaluated and further decomposed into aleatoric uncertainty and epistemic uncertainty, the latter of which can be used as a measure for out-of-distribution detection of faceness; (2) a BEA statistic provably reflects the aleatoric uncertainty of a face image, acting as a measure for face image quality to improve recognition performance. To inherit the uncertainty estimation capability from BEA without the loss of inference efficiency, we propose BEA-KD, a student model to distill knowledge from BEA. BEA-KD mimics the overall behavior of ensemble members and consistently outperforms SOTA knowledge distillation methods on various challenging benchmarks. + + + + MeMaHand: Exploiting Mesh-Mano Interaction for Single Image Two-Hand Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_MeMaHand_Exploiting_Mesh-Mano_Interaction_for_Single_Image_Two-Hand_Reconstruction_CVPR_2023_paper.pdf + Existing methods proposed for hand reconstruction tasks usually parameterize a generic 3D hand model or predict hand mesh positions directly. The parametric representations consisting of hand shapes and rotational poses are more stable, while the non-parametric methods can predict more accurate mesh positions. In this paper, we propose to reconstruct meshes and estimate MANO parameters of two hands from a single RGB image simultaneously to utilize the merits of two kinds of hand representations. To fulfill this target, we propose novel Mesh-Mano interaction blocks (MMIBs), which take mesh vertices positions and MANO parameters as two kinds of query tokens. MMIB consists of one graph residual block to aggregate local information and two transformer encoders to model long-range dependencies. The transformer encoders are equipped with different asymmetric attention masks to model the intra-hand and inter-hand attention, respectively. Moreover, we introduce the mesh alignment refinement module to further enhance the mesh-image alignment. Extensive experiments on the InterHand2.6M benchmark demonstrate promising results over the state-of-the-art hand reconstruction methods. + + + + DSFNet: Dual Space Fusion Network for Occlusion-Robust 3D Dense Face Alignment + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_DSFNet_Dual_Space_Fusion_Network_for_Occlusion-Robust_3D_Dense_Face_CVPR_2023_paper.pdf + Sensitivity to severe occlusion and large view angles limits the usage scenarios of the existing monocular 3D dense face alignment methods. The state-of-the-art 3DMM-based method, directly regresses the model's coefficients, underutilizing the low-level 2D spatial and semantic information, which can actually offer cues for face shape and orientation. In this work, we demonstrate how modeling 3D facial geometry in image and model space jointly can solve the occlusion and view angle problems. Instead of predicting the whole face directly, we regress image space features in the visible facial region by dense prediction first. Subsequently, we predict our model's coefficients based on the regressed feature of the visible regions, leveraging the prior knowledge of whole face geometry from the morphable models to complete the invisible regions. We further propose a fusion network that combines the advantages of both the image and model space predictions to achieve high robustness and accuracy in unconstrained scenarios. Thanks to the proposed fusion module, our method is robust not only to occlusion and large pitch and roll view angles, which is the benefit of our image space approach, but also to noise and large yaw angles, which is the benefit of our model space method. Comprehensive evaluations demonstrate the superior performance of our method compared with the state-of-the-art methods. On the 3D dense face alignment task, we achieve 3.80% NME on the AFLW2000-3D dataset, which outperforms the state-of-the-art method by 5.5%. Code is available at https://github.com/lhyfst/DSFNet. + + + + MoStGAN-V: Video Generation With Temporal Motion Styles + http://openaccess.thecvf.com//content/CVPR2023/papers/Shen_MoStGAN-V_Video_Generation_With_Temporal_Motion_Styles_CVPR_2023_paper.pdf + Video generation remains a challenging task due to spatiotemporal complexity and the requirement of synthesizing diverse motions with temporal consistency. Previous works attempt to generate videos in arbitrary lengths either in an autoregressive manner or regarding time as a continuous signal. However, they struggle to synthesize detailed and diverse motions with temporal coherence and tend to generate repetitive scenes after a few time steps. In this work, we argue that a single time-agnostic latent vector of style-based generator is insufficient to model various and temporally-consistent motions. Hence, we introduce additional time-dependent motion styles to model diverse motion patterns. In addition, a Motion Style Attention modulation mechanism, dubbed as MoStAtt, is proposed to augment frames with vivid dynamics for each specific scale (i.e., layer), which assigns attention score for each motion style w.r.t deconvolution filter weights in the target synthesis layer and softly attends different motion styles for weight modulation. Experimental results show our model achieves state-of-the-art performance on four unconditional 256^2 video synthesis benchmarks trained with only 3 frames per clip and produces better qualitative results with respect to dynamic motions. Code and videos have been made available at https://github.com/xiaoqian-shen/MoStGAN-V. + + + + Poly-PC: A Polyhedral Network for Multiple Point Cloud Tasks at Once + http://openaccess.thecvf.com//content/CVPR2023/papers/Xie_Poly-PC_A_Polyhedral_Network_for_Multiple_Point_Cloud_Tasks_at_CVPR_2023_paper.pdf + In this work, we show that it is feasible to perform multiple tasks concurrently on point cloud with a straightforward yet effective multi-task network. Our framework, Poly-PC, tackles the inherent obstacles (e.g., different model architectures caused by task bias and conflicting gradients caused by multiple dataset domains, etc.) of multi-task learning on point cloud. Specifically, we propose a residual set abstraction (Res-SA) layer for efficient and effective scaling in both width and depth of the network, hence accommodating the needs of various tasks. We develop a weight-entanglement-based one-shot NAS technique to find optimal architectures for all tasks. Moreover, such technique entangles the weights of multiple tasks in each layer to offer task-shared parameters for efficient storage deployment while providing ancillary task-specific parameters for learning task-related features. Finally, to facilitate the training of Poly-PC, we introduce a task-prioritization-based gradient balance algorithm that leverages task prioritization to reconcile conflicting gradients, ensuring high performance for all tasks. Benefiting from the suggested techniques, models optimized by Poly-PC collectively for all tasks keep fewer total FLOPs and parameters and outperform previous methods. We also demonstrate that Poly-PC allows incremental learning and evades catastrophic forgetting when tuned to a new task. + + + + HandsOff: Labeled Dataset Generation With No Additional Human Annotations + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_HandsOff_Labeled_Dataset_Generation_With_No_Additional_Human_Annotations_CVPR_2023_paper.pdf + Recent work leverages the expressive power of genera- tive adversarial networks (GANs) to generate labeled syn- thetic datasets. These dataset generation methods often require new annotations of synthetic images, which forces practitioners to seek out annotators, curate a set of synthetic images, and ensure the quality of generated labels. We in- troduce the HandsOff framework, a technique capable of producing an unlimited number of synthetic images and cor- responding labels after being trained on less than 50 pre- existing labeled images. Our framework avoids the practi- cal drawbacks of prior work by unifying the field of GAN in- version with dataset generation. We generate datasets with rich pixel-wise labels in multiple challenging domains such as faces, cars, full-body human poses, and urban driving scenes. Our method achieves state-of-the-art performance in semantic segmentation, keypoint detection, and depth es- timation compared to prior dataset generation approaches and transfer learning baselines. We additionally showcase its ability to address broad challenges in model develop- ment which stem from fixed, hand-annotated datasets, such as the long-tail problem in semantic segmentation. Project page: austinxu87.github.io/handsoff. + + + + Semi-Supervised 2D Human Pose Estimation Driven by Position Inconsistency Pseudo Label Correction Module + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Semi-Supervised_2D_Human_Pose_Estimation_Driven_by_Position_Inconsistency_Pseudo_CVPR_2023_paper.pdf + In this paper, we delve into semi-supervised 2D human pose estimation. The previous method ignored two problems: (i) When conducting interactive training between large model and lightweight model, the pseudo label of lightweight model will be used to guide large models. (ii) The negative impact of noise pseudo labels on training. Moreover, the labels used for 2D human pose estimation are relatively complex: keypoint category and keypoint position. To solve the problems mentioned above, we propose a semi-supervised 2D human pose estimation framework driven by a position inconsistency pseudo label correction module (SSPCM). We introduce an additional auxiliary teacher and use the pseudo labels generated by the two teacher model in different periods to calculate the inconsistency score and remove outliers. Then, the two teacher models are updated through interactive training, and the student model is updated using the pseudo labels generated by two teachers. To further improve the performance of the student model, we use the semi-supervised Cut-Occlude based on pseudo keypoint perception to generate more hard and effective samples. In addition, we also proposed a new indoor overhead fisheye human keypoint dataset WEPDTOF-Pose. Extensive experiments demonstrate that our method outperforms the previous best semi-supervised 2D human pose estimation method. We will release the code and dataset at https://github.com/hlz0606/SSPCM. + + + + ARKitTrack: A New Diverse Dataset for Tracking Using Mobile RGB-D Data + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhao_ARKitTrack_A_New_Diverse_Dataset_for_Tracking_Using_Mobile_RGB-D_CVPR_2023_paper.pdf + Compared with traditional RGB-only visual tracking, few datasets have been constructed for RGB-D tracking. In this paper, we propose ARKitTrack, a new RGB-D tracking dataset for both static and dynamic scenes captured by consumer-grade LiDAR scanners equipped on Apple's iPhone and iPad. ARKitTrack contains 300 RGB-D sequences, 455 targets, and 229.7K video frames in total. Along with the bounding box annotations and frame-level attributes, we also annotate this dataset with 123.9K pixel-level target masks. Besides, the camera intrinsic and camera pose of each frame are provided for future developments. To demonstrate the potential usefulness of this dataset, we further present a unified baseline for both box-level and pixel-level tracking, which integrates RGB features with bird's-eye-view representations to better explore cross-modality 3D geometry. In-depth empirical analysis has verified that the ARKitTrack dataset can significantly facilitate RGB-D tracking and that the proposed baseline method compares favorably against the state of the arts. The source code and dataset will be released. + + + + Efficient Verification of Neural Networks Against LVM-Based Specifications + http://openaccess.thecvf.com//content/CVPR2023/papers/Hanspal_Efficient_Verification_of_Neural_Networks_Against_LVM-Based_Specifications_CVPR_2023_paper.pdf + The deployment of perception systems based on neural networks in safety critical applications requires assurance on their robustness. Deterministic guarantees on network robustness require formal verification. Standard approaches for verifying robustness analyse invariance to analytically defined transformations, but not the diverse and ubiquitous changes involving object pose, scene viewpoint, occlusions, etc. To this end, we present an efficient approach for verifying specifications definable using Latent Variable Models that capture such diverse changes. The approach involves adding an invertible encoding head to the network to be verified, enabling the verification of latent space sets with minimal reconstruction overhead. We report verification experiments for three classes of proposed latent space specifications, each capturing different types of realistic input variations. Differently from previous work in this area, the proposed approach is relatively independent of input dimensionality and scales to a broad class of deep networks and real-world datasets by mitigating the inefficiency and decoder expressivity dependence in the present state-of-the-art. + + + + Feature Aggregated Queries for Transformer-Based Video Object Detectors + http://openaccess.thecvf.com//content/CVPR2023/papers/Cui_Feature_Aggregated_Queries_for_Transformer-Based_Video_Object_Detectors_CVPR_2023_paper.pdf + Video object detection needs to solve feature degradation situations that rarely happen in the image domain. One solution is to use the temporal information and fuse the features from the neighboring frames. With Transformer-based object detectors getting a better performance on the image domain tasks, recent works began to extend those methods to video object detection. However, those existing Transformer-based video object detectors still follow the same pipeline as those used for classical object detectors, like enhancing the object feature representations by aggregation. In this work, we take a different perspective on video object detection. In detail, we improve the qualities of queries for the Transformer-based models by aggregation. To achieve this goal, we first propose a vanilla query aggregation module that weighted averages the queries according to the features of the neighboring frames. Then, we extend the vanilla module to a more practical version, which generates and aggregates queries according to the features of the input frames. Extensive experimental results validate the effectiveness of our proposed methods: On the challenging ImageNet VID benchmark, when integrated with our proposed modules, the current state-of-the-art Transformer-based object detectors can be improved by more than 2.4% on mAP and 4.2% on AP50. + + + + Decomposed Cross-Modal Distillation for RGB-Based Temporal Action Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Lee_Decomposed_Cross-Modal_Distillation_for_RGB-Based_Temporal_Action_Detection_CVPR_2023_paper.pdf + Temporal action detection aims to predict the time intervals and the classes of action instances in the video. Despite the promising performance, existing two-stream models exhibit slow inference speed due to their reliance on computationally expensive optical flow. In this paper, we introduce a decomposed cross-modal distillation framework to build a strong RGB-based detector by transferring knowledge of the motion modality. Specifically, instead of direct distillation, we propose to separately learn RGB and motion representations, which are in turn combined to perform action localization. The dual-branch design and the asymmetric training objectives enable effective motion knowledge transfer while preserving RGB information intact. In addition, we introduce a local attentive fusion to better exploit the multimodal complementarity. It is designed to preserve the local discriminability of the features that is important for action localization. Extensive experiments on the benchmarks verify the effectiveness of the proposed method in enhancing RGB-based action detectors. Notably, our framework is agnostic to backbones and detection heads, bringing consistent gains across different model combinations. + + + + A Unified Knowledge Distillation Framework for Deep Directed Graphical Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_A_Unified_Knowledge_Distillation_Framework_for_Deep_Directed_Graphical_Models_CVPR_2023_paper.pdf + Knowledge distillation (KD) is a technique that transfers the knowledge from a large teacher network to a small student network. It has been widely applied to many different tasks, such as model compression and federated learning. However, existing KD methods fail to generalize to general deep directed graphical models (DGMs) with arbitrary layers of random variables. We refer by deep DGMs to DGMs whose conditional distributions are parameterized by deep neural networks. In this work, we propose a novel unified knowledge distillation framework for deep DGMs on various applications. Specifically, we leverage the reparameterization trick to hide the intermediate latent variables, resulting in a compact DGM. Then we develop a surrogate distillation loss to reduce error accumulation through multiple layers of random variables. Moreover, we present the connections between our method and some existing knowledge distillation approaches. The proposed framework is evaluated on four applications: data-free hierarchical variational autoencoder (VAE) compression, data-free variational recurrent neural networks (VRNN) compression, data-free Helmholtz Machine (HM) compression, and VAE continual learning. The results show that our distillation method outperforms the baselines in data-free model compression tasks. We further demonstrate that our method significantly improves the performance of KD-based continual learning for data generation. Our source code is available at https://github.com/YizhuoChen99/KD4DGM-CVPR. + + + + Blemish-Aware and Progressive Face Retouching With Limited Paired Data + http://openaccess.thecvf.com//content/CVPR2023/papers/Xie_Blemish-Aware_and_Progressive_Face_Retouching_With_Limited_Paired_Data_CVPR_2023_paper.pdf + Face retouching aims to remove facial blemishes, while at the same time maintaining the textual details of a given input image. The main challenge lies in distinguishing blemishes from the facial characteristics, such as moles. Training an image-to-image translation network with pixel-wise supervision suffers from the problem of expensive paired training data, since professional retouching needs specialized experience and is time-consuming. In this paper, we propose a Blemish-aware and Progressive Face Retouching model, which is referred to as BPFRe. Our framework can be partitioned into two manageable stages to perform progressive blemish removal. Specifically, an encoder-decoder-based module learns to coarsely remove the blemishes at the first stage, and the resulting intermediate features are injected into a generator to enrich local detail at the second stage. We find that explicitly suppressing the blemishes can contribute to an effective collaboration among the components. Toward this end, we incorporate an attention module, which learns to infer a blemish-aware map and further determine the corresponding weights, which are then used to refine the intermediate features transferred from the encoder to the decoder, and from the decoder to the generator. Therefore, BPFRe is able to deliver significant performance gains on a wide range of face retouching tasks. It is worth noting that we reduce the dependence of BPFRe on paired training samples by imposing effective regularization on unpaired ones. + + + + Detecting and Grounding Multi-Modal Media Manipulation + http://openaccess.thecvf.com//content/CVPR2023/papers/Shao_Detecting_and_Grounding_Multi-Modal_Media_Manipulation_CVPR_2023_paper.pdf + Misinformation has become a pressing issue. Fake media, in both visual and textual forms, is widespread on the web. While various deepfake detection and text fake news detection methods have been proposed, they are only designed for single-modality forgery based on binary classification, let alone analyzing and reasoning subtle forgery traces across different modalities. In this paper, we highlight a new research problem for multi-modal fake media, namely Detecting and Grounding Multi-Modal Media Manipulation (DGM^4). DGM^4 aims to not only detect the authenticity of multi-modal media, but also ground the manipulated content (i.e., image bounding boxes and text tokens), which requires deeper reasoning of multi-modal media manipulation. To support a large-scale investigation, we construct the first DGM^4 dataset, where image-text pairs are manipulated by various approaches, with rich annotation of diverse manipulations. Moreover, we propose a novel HierArchical Multi-modal Manipulation rEasoning tRansformer (HAMMER) to fully capture the fine-grained interaction between different modalities. HAMMER performs 1) manipulation-aware contrastive learning between two uni-modal encoders as shallow manipulation reasoning, and 2) modality-aware cross-attention by multi-modal aggregator as deep manipulation reasoning. Dedicated manipulation detection and grounding heads are integrated from shallow to deep levels based on the interacted multi-modal information. Finally, we build an extensive benchmark and set up rigorous evaluation metrics for this new research problem. Comprehensive experiments demonstrate the superiority of our model; several valuable observations are also revealed to facilitate future research in multi-modal media manipulation. + + + + Human Pose As Compositional Tokens + http://openaccess.thecvf.com//content/CVPR2023/papers/Geng_Human_Pose_As_Compositional_Tokens_CVPR_2023_paper.pdf + Human pose is typically represented by a coordinate vector of body joints or their heatmap embeddings. While easy for data processing, unrealistic pose estimates are admitted due to the lack of dependency modeling between the body joints. In this paper, we present a structured representation, named Pose as Compositional Tokens (PCT), to explore the joint dependency. It represents a pose by M discrete tokens with each characterizing a sub-structure with several interdependent joints. The compositional design enables it to achieve a small reconstruction error at a low cost. Then we cast pose estimation as a classification task. In particular, we learn a classifier to predict the categories of the M tokens from an image. A pre-learned decoder network is used to recover the pose from the tokens without further post-processing. We show that it achieves better or comparable pose estimation results as the existing methods in general scenarios, yet continues to work well when occlusion occurs, which is ubiquitous in practice. The code and models are publicly available at https://github.com/Gengzigang/PCT. + + + + Synthesizing Photorealistic Virtual Humans Through Cross-Modal Disentanglement + http://openaccess.thecvf.com//content/CVPR2023/papers/Ravichandran_Synthesizing_Photorealistic_Virtual_Humans_Through_Cross-Modal_Disentanglement_CVPR_2023_paper.pdf + Over the last few decades, many aspects of human life have been enhanced with virtual domains, from the advent of digital assistants such as Amazon's Alexa and Apple's Siri to the latest metaverse efforts of the rebranded Meta. These trends underscore the importance of generating photorealistic visual depictions of humans. This has led to the rapid growth of so-called deepfake and talking-head generation methods in recent years. Despite their impressive results and popularity, they usually lack certain qualitative aspects such as texture quality, lips synchronization, or resolution, and practical aspects such as the ability to run in real-time. To allow for virtual human avatars to be used in practical scenarios, we propose an end-to-end framework for synthesizing high-quality virtual human faces capable of speaking with accurate lip motion with a special emphasis on performance. We introduce a novel network utilizing visemes as an intermediate audio representation and a novel data augmentation strategy employing a hierarchical image synthesis approach that allows disentanglement of the different modalities used to control the global head motion. Our method runs in real-time, and is able to deliver superior results compared to the current state-of-the-art. + + + + Test Time Adaptation With Regularized Loss for Weakly Supervised Salient Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Veksler_Test_Time_Adaptation_With_Regularized_Loss_for_Weakly_Supervised_Salient_CVPR_2023_paper.pdf + It is well known that CNNs tend to overfit to the training data. Test-time adaptation is an extreme approach to deal with overfitting: given a test image, the aim is to adapt the trained model to that image. Indeed nothing can be closer to the test data than the test image itself. The main difficulty of test-time adaptation is that the ground truth is not available. Thus test-time adaptation, while intriguing, applies to only a few scenarios where one can design an effective loss function that does not require ground truth. We propose the first approach for test-time Salient Object Detection (SOD) in the context of weak supervision. Our approach is based on a so called regularized loss function, which can be used for training CNN when pixel precise ground truth is unavailable. Regularized loss tends to have lower values for the more likely object segments, and thus it can be used to fine-tune an already trained CNN to a given test image, adapting to images unseen during training. We develop a regularized loss function particularly suitable for test-time adaptation and show that our approach significantly outperforms prior work for weakly supervised SOD. + + + + Self-Supervised Pre-Training With Masked Shape Prediction for 3D Scene Understanding + http://openaccess.thecvf.com//content/CVPR2023/papers/Jiang_Self-Supervised_Pre-Training_With_Masked_Shape_Prediction_for_3D_Scene_Understanding_CVPR_2023_paper.pdf + Masked signal modeling has greatly advanced self-supervised pre-training for language and 2D images. However, it is still not fully explored in 3D scene understanding. Thus, this paper introduces Masked Shape Prediction (MSP), a new framework to conduct masked signal modeling in 3D scenes. MSP uses the essential 3D semantic cue, i.e., geometric shape, as the prediction target for masked points. The context-enhanced shape target consisting of explicit shape context and implicit deep shape feature is proposed to facilitate exploiting contextual cues in shape prediction. Meanwhile, the pre-training architecture in MSP is carefully designed to alleviate the masked shape leakage from point coordinates. Experiments on multiple 3D understanding tasks on both indoor and outdoor datasets demonstrate the effectiveness of MSP in learning good feature representations to consistently boost downstream performance. + + + + Guiding Pseudo-Labels With Uncertainty Estimation for Source-Free Unsupervised Domain Adaptation + http://openaccess.thecvf.com//content/CVPR2023/papers/Litrico_Guiding_Pseudo-Labels_With_Uncertainty_Estimation_for_Source-Free_Unsupervised_Domain_Adaptation_CVPR_2023_paper.pdf + Standard Unsupervised Domain Adaptation (UDA) methods assume the availability of both source and target data during the adaptation. In this work, we investigate Source-free Unsupervised Domain Adaptation (SF-UDA), a specific case of UDA where a model is adapted to a target domain without access to source data. We propose a novel approach for the SF-UDA setting based on a loss reweighting strategy that brings robustness against the noise that inevitably affects the pseudo-labels. The classification loss is reweighted based on the reliability of the pseudo-labels that is measured by estimating their uncertainty. Guided by such reweighting strategy, the pseudo-labels are progressively refined by aggregating knowledge from neighbouring samples. Furthermore, a self-supervised contrastive framework is leveraged as a target space regulariser to enhance such knowledge aggregation. A novel negative pairs exclusion strategy is proposed to identify and exclude negative pairs made of samples sharing the same class, even in presence of some noise in the pseudo-labels. Our method outperforms previous methods on three major benchmarks by a large margin. We set the new SF-UDA state-of-the-art on VisDA-C and DomainNet with a performance gain of +1.8% on both benchmarks and on PACS with +12.3% in the single-source setting and +6.6% in multi-target adaptation. Additional analyses demonstrate that the proposed approach is robust to the noise, which results in significantly more accurate pseudo-labels compared to state-of-the-art approaches. + + + + HuManiFlow: Ancestor-Conditioned Normalising Flows on SO(3) Manifolds for Human Pose and Shape Distribution Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Sengupta_HuManiFlow_Ancestor-Conditioned_Normalising_Flows_on_SO3_Manifolds_for_Human_Pose_CVPR_2023_paper.pdf + Monocular 3D human pose and shape estimation is an ill-posed problem since multiple 3D solutions can explain a 2D image of a subject. Recent approaches predict a probability distribution over plausible 3D pose and shape parameters conditioned on the image. We show that these approaches exhibit a trade-off between three key properties: (i) accuracy - the likelihood of the ground-truth 3D solution under the predicted distribution, (ii) sample-input consistency - the extent to which 3D samples from the predicted distribution match the visible 2D image evidence, and (iii) sample diversity - the range of plausible 3D solutions modelled by the predicted distribution. Our method, HuManiFlow, predicts simultaneously accurate, consistent and diverse distributions. We use the human kinematic tree to factorise full body pose into ancestor-conditioned per-body-part pose distributions in an autoregressive manner. Per-body-part distributions are implemented using normalising flows that respect the manifold structure of SO(3), the Lie group of per-body-part poses. We show that ill-posed, but ubiquitous, 3D point estimate losses reduce sample diversity, and employ only probabilistic training losses. HuManiFlow outperforms state-of-the-art probabilistic approaches on the 3DPW and SSP-3D datasets. + + + + EC2: Emergent Communication for Embodied Control + http://openaccess.thecvf.com//content/CVPR2023/papers/Mu_EC2_Emergent_Communication_for_Embodied_Control_CVPR_2023_paper.pdf + Embodied control requires agents to leverage multi-modal pre-training to quickly learn how to act in new environments, where video demonstrations contain visual and motion details needed for low-level perception and control, and language instructions support generalization with abstract, symbolic structures. While recent approaches apply contrastive learning to force alignment between the two modalities, we hypothesize better modeling their complementary differences can lead to more holistic representations for downstream adaption. To this end, we propose Emergent Communication for Embodied Control (EC^2), a novel scheme to pre-train video-language representations for few-shot embodied control. The key idea is to learn an unsupervised "language" of videos via emergent communication, which bridges the semantics of video details and structures of natural language. We learn embodied representations of video trajectories, emergent language, and natural language using a language model, which is then used to finetune a lightweight policy network for downstream control. Through extensive experiments in Metaworld and Franka Kitchen embodied benchmarks, EC^2 is shown to consistently outperform previous contrastive learning methods for both videos and texts as task inputs. Further ablations confirm the importance of the emergent language, which is beneficial for both video and language learning, and significantly superior to using pre-trained video captions. We also present a quantitative and qualitative analysis of the emergent language and discuss future directions toward better understanding and leveraging emergent communication in embodied tasks. + + + + DynamicDet: A Unified Dynamic Architecture for Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Lin_DynamicDet_A_Unified_Dynamic_Architecture_for_Object_Detection_CVPR_2023_paper.pdf + Dynamic neural network is an emerging research topic in deep learning. With adaptive inference, dynamic models can achieve remarkable accuracy and computational efficiency. However, it is challenging to design a powerful dynamic detector, because of no suitable dynamic architecture and exiting criterion for object detection. To tackle these difficulties, we propose a dynamic framework for object detection, named DynamicDet. Firstly, we carefully design a dynamic architecture based on the nature of the object detection task. Then, we propose an adaptive router to analyze the multi-scale information and to decide the inference route automatically. We also present a novel optimization strategy with an exiting criterion based on the detection losses for our dynamic detectors. Last, we present a variable-speed inference strategy, which helps to realize a wide range of accuracy-speed trade-offs with only one dynamic detector. Extensive experiments conducted on the COCO benchmark demonstrate that the proposed DynamicDet achieves new state-of-the-art accuracy-speed trade-offs. For instance, with comparable accuracy, the inference speed of our dynamic detector Dy-YOLOv7-W6 surpasses YOLOv7-E6 by 12%, YOLOv7-D6 by 17%, and YOLOv7-E6E by 39%. The code is available at https://github.com/VDIGPKU/DynamicDet. + + + + Common Pets in 3D: Dynamic New-View Synthesis of Real-Life Deformable Categories + http://openaccess.thecvf.com//content/CVPR2023/papers/Sinha_Common_Pets_in_3D_Dynamic_New-View_Synthesis_of_Real-Life_Deformable_CVPR_2023_paper.pdf + Obtaining photorealistic reconstructions of objects from sparse views is inherently ambiguous and can only be achieved by learning suitable reconstruction priors. Earlier works on sparse rigid object reconstruction successfully learned such priors from large datasets such as CO3D. In this paper, we extend this approach to dynamic objects. We use cats and dogs as a representative example and introduce Common Pets in 3D (CoP3D), a collection of crowd-sourced videos showing around 4,200 distinct pets. CoP3D is one of the first large-scale datasets for benchmarking non-rigid 3D reconstruction "in the wild". We also propose Tracker-NeRF, a method for learning 4D reconstruction from our dataset. At test time, given a small number of video frames of an unseen sequence, Tracker-NeRF predicts the trajectories and dynamics of the 3D points and generates new views, interpolating viewpoint and time. Results on CoP3D reveal significantly better non-rigid new-view synthesis performance than existing baselines. The data is available on the project webpage: https://cop3d.github.io/. + + + + Normal-Guided Garment UV Prediction for Human Re-Texturing + http://openaccess.thecvf.com//content/CVPR2023/papers/Jafarian_Normal-Guided_Garment_UV_Prediction_for_Human_Re-Texturing_CVPR_2023_paper.pdf + Clothes undergo complex geometric deformations, which lead to appearance changes. To edit human videos in a physically plausible way, a texture map must take into account not only the garment transformation induced by the body movements and clothes fitting, but also its 3D fine-grained surface geometry. This poses, however, a new challenge of 3D reconstruction of dynamic clothes from an image or a video. In this paper, we show that it is possible to edit dressed human images and videos without 3D reconstruction. We estimate a geometry aware texture map between the garment region in an image and the texture space, a.k.a, UV map. Our UV map is designed to preserve isometry with respect to the underlying 3D surface by making use of the 3D surface normals predicted from the image. Our approach captures the underlying geometry of the garment in a self-supervised way, requiring no ground truth annotation of UV maps and can be readily extended to predict temporally coherent UV maps. We demonstrate that our method outperforms the state-of-the-art human UV map estimation approaches on both real and synthetic data. + + + + Learning Compact Representations for LiDAR Completion and Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Xiong_Learning_Compact_Representations_for_LiDAR_Completion_and_Generation_CVPR_2023_paper.pdf + LiDAR provides accurate geometric measurements of the 3D world. Unfortunately, dense LiDARs are very expensive and the point clouds captured by low-beam LiDAR are often sparse. To address these issues, we present UltraLiDAR, a data-driven framework for scene-level LiDAR completion, LiDAR generation, and LiDAR manipulation. The crux of UltraLiDAR is a compact, discrete representation that encodes the point cloud's geometric structure, is robust to noise, and is easy to manipulate. We show that by aligning the representation of a sparse point cloud to that of a dense point cloud, we can densify the sparse point clouds as if they were captured by a real high-density LiDAR, drastically reducing the cost. Furthermore, by learning a prior over the discrete codebook, we can generate diverse, realistic LiDAR point clouds for self-driving. We evaluate the effectiveness of UltraLiDAR on sparse-to-dense LiDAR completion and LiDAR generation. Experiments show that densifying real-world point clouds with our approach can significantly improve the performance of downstream perception systems. Compared to prior art on LiDAR generation, our approach generates much more realistic point clouds. According to A/B test, over 98.5% of the time human participants prefer our results over those of previous methods. Please refer to project page https://waabi.ai/research/ultralidar/ for more information. + + + + Hubs and Hyperspheres: Reducing Hubness and Improving Transductive Few-Shot Learning With Hyperspherical Embeddings + http://openaccess.thecvf.com//content/CVPR2023/papers/Trosten_Hubs_and_Hyperspheres_Reducing_Hubness_and_Improving_Transductive_Few-Shot_Learning_CVPR_2023_paper.pdf + Distance-based classification is frequently used in transductive few-shot learning (FSL). However, due to the high-dimensionality of image representations, FSL classifiers are prone to suffer from the hubness problem, where a few points (hubs) occur frequently in multiple nearest neighbour lists of other points. Hubness negatively impacts distance-based classification when hubs from one class appear often among the nearest neighbors of points from another class, degrading the classifier's performance. To address the hubness problem in FSL, we first prove that hubness can be eliminated by distributing representations uniformly on the hypersphere. We then propose two new approaches to embed representations on the hypersphere, which we prove optimize a tradeoff between uniformity and local similarity preservation -- reducing hubness while retaining class structure. Our experiments show that the proposed methods reduce hubness, and significantly improves transductive FSL accuracy for a wide range of classifiers. + + + + Improving Graph Representation for Point Cloud Segmentation via Attentive Filtering + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Improving_Graph_Representation_for_Point_Cloud_Segmentation_via_Attentive_Filtering_CVPR_2023_paper.pdf + Recently, self-attention networks achieve impressive performance in point cloud segmentation due to their superiority in modeling long-range dependencies. However, compared to self-attention mechanism, we find graph convolutions show a stronger ability in capturing local geometry information with less computational cost. In this paper, we employ a hybrid architecture design to construct our Graph Convolution Network with Attentive Filtering (AF-GCN), which takes advantage of both graph convolution and self-attention mechanism. We adopt graph convolutions to aggregate local features in the shallow encoder stages, while in the deeper stages, we propose a self-attention-like module named Graph Attentive Filter (GAF) to better model long-range contexts from distant neighbors. Besides, to further improve graph representation for point cloud segmentation, we employ a Spatial Feature Projection (SFP) module for graph convolutions which helps to handle spatial variations of unstructured point clouds. Finally, a graph-shared down-sampling and up-sampling strategy is introduced to make full use of the graph structures in point cloud processing. We conduct extensive experiments on multiple datasets including S3DIS, ScanNetV2, Toronto-3D, and ShapeNetPart. Experimental results show our AF-GCN obtains competitive performance. + + + + PCT-Net: Full Resolution Image Harmonization Using Pixel-Wise Color Transformations + http://openaccess.thecvf.com//content/CVPR2023/papers/Guerreiro_PCT-Net_Full_Resolution_Image_Harmonization_Using_Pixel-Wise_Color_Transformations_CVPR_2023_paper.pdf + In this paper, we present PCT-Net, a simple and general image harmonization method that can be easily applied to images at full resolution. The key idea is to learn a parameter network that uses downsampled input images to predict the parameters for pixel-wise color transforms (PCTs) which are applied to each pixel in the full-resolution image. We show that affine color transforms are both efficient and effective, resulting in state-of-the-art harmonization results. Moreover, we explore both CNNs and Transformers as the parameter network and show that Transformers lead to better results. We evaluate the proposed method on the public full-resolution iHarmony4 dataset, which is comprised of four datasets, and show a reduction of the foreground MSE (fMSE) and MSE values by more than 20% and an increase of the PSNR value by 1.4dB while keeping the architecture light-weight. In a user study with 20 people, we show that the method achieves a higher B-T score than two other recent methods. + + + + Architecture, Dataset and Model-Scale Agnostic Data-Free Meta-Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Hu_Architecture_Dataset_and_Model-Scale_Agnostic_Data-Free_Meta-Learning_CVPR_2023_paper.pdf + The goal of data-free meta-learning is to learn useful prior knowledge from a collection of pre-trained models without accessing their training data. However, existing works only solve the problem in parameter space, which (i) ignore the fruitful data knowledge contained in the pre-trained models; (ii) can not scale to large-scale pre-trained models; (iii) can only meta-learn pre-trained models with the same network architecture. To address those issues, we propose a unified framework, dubbed PURER, which contains: (1) ePisode cUrriculum inveRsion (ECI) during data-free meta training; and (2) invErsion calibRation following inner loop (ICFIL) during meta testing. During meta training, we propose ECI to perform pseudo episode training for learning to adapt fast to new unseen tasks. Specifically, we progressively synthesize a sequence of pseudo episodes by distilling the training data from each pre-trained model. The ECI adaptively increases the difficulty level of pseudo episodes according to the real-time feedback of the meta model. We formulate the optimization process of meta training with ECI as an adversarial form in an end-to-end manner. During meta testing, we further propose a simple plug-and-play supplement--ICFIL--only used during meta testing to narrow the gap between meta training and meta testing task distribution. Extensive experiments in various real-world scenarios show the superior performance of ours. + + + + Egocentric Video Task Translation + http://openaccess.thecvf.com//content/CVPR2023/papers/Xue_Egocentric_Video_Task_Translation_CVPR_2023_paper.pdf + Different video understanding tasks are typically treated in isolation, and even with distinct types of curated data (e.g., classifying sports in one dataset, tracking animals in another). However, in wearable cameras, the immersive egocentric perspective of a person engaging with the world around them presents an interconnected web of video understanding tasks---hand-object manipulations, navigation in the space, or human-human interactions---that unfold continuously, driven by the person's goals. We argue that this calls for a much more unified approach. We propose EgoTask Translation (EgoT2), which takes a collection of models optimized on separate tasks and learns to translate their outputs for improved performance on any or all of them at once. Unlike traditional transfer or multi-task learning, EgoT2's "flipped design" entails separate task-specific backbones and a task translator shared across all tasks, which captures synergies between even heterogeneous tasks and mitigates task competition. Demonstrating our model on a wide array of video tasks from Ego4D, we show its advantages over existing transfer paradigms and achieve top-ranked results on four of the Ego4D 2022 benchmark challenges. + + + + Gaussian Label Distribution Learning for Spherical Image Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_Gaussian_Label_Distribution_Learning_for_Spherical_Image_Object_Detection_CVPR_2023_paper.pdf + Spherical image object detection emerges in many applications from virtual reality to robotics and automatic driving, while many existing detectors use ln-norms loss for regression of spherical bounding boxes. There are two intrinsic flaws for ln-norms loss, i.e., independent optimization of parameters and inconsistency between metric (dominated by IoU) and loss. These problems are common in planar image detection but more significant in spherical image detection. Solution for these problems has been extensively discussed in planar image detection by using IoU loss and related variants. However, these solutions cannot be migrated to spherical image object detection due to the undifferentiable of the Spherical IoU (SphIoU). In this paper, we design a simple but effective regression loss based on Gaussian Label Distribution Learning (GLDL) for spherical image object detection. Besides, we observe that the scale of the object in a spherical image varies greatly. The huge differences among objects from different categories make the sample selection strategy based on SphIoU challenging. Therefore, we propose GLDL-ATSS as a better training sample selection strategy for objects of the spherical image, which can alleviate the drawback of IoU threshold-based strategy of scale-sample imbalance. Extensive results on various two datasets with different baseline detectors show the effectiveness of our approach. + + + + Better "CMOS" Produces Clearer Images: Learning Space-Variant Blur Estimation for Blind Image Super-Resolution + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Better_CMOS_Produces_Clearer_Images_Learning_Space-Variant_Blur_Estimation_for_CVPR_2023_paper.pdf + Most of the existing blind image Super-Resolution (SR) methods assume that the blur kernels are space-invariant. However, the blur involved in real applications are usually space-variant due to object motion, out-of-focus, etc., resulting in severe performance drop of the advanced SR methods. To address this problem, we firstly introduce two new datasets with out-of-focus blur, i.e., NYUv2-BSR and Cityscapes-BSR, to support further researches of blind SR with space-variant blur. Based on the datasets, we design a novel Cross-MOdal fuSion network (CMOS) that estimate both blur and semantics simultaneously, which leads to improved SR results. It involves a feature Grouping Interactive Attention (GIA) module to make the two modals interact more effectively and avoid inconsistency. GIA can also be used for the interaction of other features because of the universality of its structure. Qualitative and quantitative experiments compared with state-of-the-art methods on above datasets and real-world images demonstrate the superiority of our method, e.g., obtaining PSNR/SSIM by +1.91/+0.0048 on NYUv2-BSR than MANet. + + + + MixTeacher: Mining Promising Labels With Mixed Scale Teacher for Semi-Supervised Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_MixTeacher_Mining_Promising_Labels_With_Mixed_Scale_Teacher_for_Semi-Supervised_CVPR_2023_paper.pdf + Scale variation across object instances is one of the key challenges in object detection. Although modern detection models have achieved remarkable progress in dealing with the scale variation, it still brings trouble in the semi-supervised case. Most existing semi-supervised object detection methods rely on strict conditions to filter out high-quality pseudo labels from the network predictions. However, we observe that objects with extreme scale tend to have low confidence, which makes the positive supervision missing for these objects. In this paper, we delve into the scale variation problem, and propose a novel framework by introducing a mixed scale teacher to improve the pseudo labels generation and scale invariant learning. In addition, benefiting from the better predictions from mixed scale features, we propose to mine pseudo labels with the score promotion of predictions across scales. Extensive experiments on MS COCO and PASCAL VOC benchmarks under various semi-supervised settings demonstrate that our method achieves new state-of-the-art performance. The code and models will be made publicly available. + + + + NeuMap: Neural Coordinate Mapping by Auto-Transdecoder for Camera Localization + http://openaccess.thecvf.com//content/CVPR2023/papers/Tang_NeuMap_Neural_Coordinate_Mapping_by_Auto-Transdecoder_for_Camera_Localization_CVPR_2023_paper.pdf + This paper presents an end-to-end neural mapping method for camera localization, dubbed NeuMap, encoding a whole scene into a grid of latent codes, with which a Transformer-based auto-decoder regresses 3D coordinates of query pixels. State-of-the-art feature matching methods require each scene to be stored as a 3D point cloud with per-point features, consuming several gigabytes of storage per scene. While compression is possible, performance drops significantly at high compression rates. Conversely, coordinate regression methods achieve high compression by storing scene information in a neural network but suffer from reduced robustness. NeuMap combines the advantages of both approaches by utilizing 1) learnable latent codes for efficient scene representation and 2) a scene-agnostic Transformer-based auto-decoder to infer coordinates for query pixels. This scene-agnostic network design learns robust matching priors from large-scale data and enables rapid optimization of codes for new scenes while keeping the network weights fixed. Extensive evaluations on five benchmarks show that NeuMap significantly outperforms other coordinate regression methods and achieves comparable performance to feature matching methods while requiring a much smaller scene representation size. For example, NeuMap achieves 39.1% accuracy in the Aachen night benchmark with only 6MB of data, whereas alternative methods require 100MB or several gigabytes and fail completely under high compression settings. The codes are available at https://github.com/Tangshitao/NeuMap. + + + + AShapeFormer: Semantics-Guided Object-Level Active Shape Encoding for 3D Object Detection via Transformers + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_AShapeFormer_Semantics-Guided_Object-Level_Active_Shape_Encoding_for_3D_Object_Detection_CVPR_2023_paper.pdf + 3D object detection techniques commonly follow a pipeline that aggregates predicted object central point features to compute candidate points. However, these candidate points contain only positional information, largely ignoring the object-level shape information. This eventually leads to sub-optimal 3D object detection. In this work, we propose AShapeFormer, a semantics-guided object-level shape encoding module for 3D object detection. This is a plug-n-play module that leverages multi-head attention to encode object shape information. We also propose shape tokens and object-scene positional encoding to ensure that the shape information is fully exploited. Moreover, we introduce a semantic guidance sub-module to sample more foreground points and suppress the influence of background points for a better object shape perception. We demonstrate a straightforward enhancement of multiple existing methods with our AShapeFormer. Through extensive experiments on the popular SUN RGB-D and ScanNetV2 dataset, we show that our enhanced models are able to outperform the baselines by a considerable absolute margin of up to 8.1%. Code will be available at https://github.com/ZechuanLi/AShapeFormer + + + + SeSDF: Self-Evolved Signed Distance Field for Implicit 3D Clothed Human Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Cao_SeSDF_Self-Evolved_Signed_Distance_Field_for_Implicit_3D_Clothed_Human_CVPR_2023_paper.pdf + We address the problem of clothed human reconstruction from a single image or uncalibrated multi-view images. Existing methods struggle with reconstructing detailed geometry of a clothed human and often require a calibrated setting for multi-view reconstruction. We propose a flexible framework which, by leveraging the parametric SMPL-X model, can take an arbitrary number of input images to reconstruct a clothed human model under an uncalibrated setting. At the core of our framework is our novel self-evolved signed distance field (SeSDF) module which allows the framework to learn to deform the signed distance field (SDF) derived from the fitted SMPL-X model, such that detailed geometry reflecting the actual clothed human can be encoded for better reconstruction. Besides, we propose a simple method for self-calibration of multi-view images via the fitted SMPL-X parameters. This lifts the requirement of tedious manual calibration and largely increases the flexibility of our method. Further, we introduce an effective occlusion-aware feature fusion strategy to account for the most useful features to reconstruct the human model. We thoroughly evaluate our framework on public benchmarks, demonstrating significant superiority over the state-of-the-arts both qualitatively and quantitatively. + + + + Deep Depth Estimation From Thermal Image + http://openaccess.thecvf.com//content/CVPR2023/papers/Shin_Deep_Depth_Estimation_From_Thermal_Image_CVPR_2023_paper.pdf + Robust and accurate geometric understanding against adverse weather conditions is one top prioritized conditions to achieve a high-level autonomy of self-driving cars. However, autonomous driving algorithms relying on the visible spectrum band are easily impacted by weather and lighting conditions. A long-wave infrared camera, also known as a thermal imaging camera, is a potential rescue to achieve high-level robustness. However, the missing necessities are the well-established large-scale dataset and public benchmark results. To this end, in this paper, we first built a large-scale Multi-Spectral Stereo (MS^2) dataset, including stereo RGB, stereo NIR, stereo thermal, and stereo LiDAR data along with GNSS/IMU information. The collected dataset provides about 195K synchronized data pairs taken from city, residential, road, campus, and suburban areas in the morning, daytime, and nighttime under clear-sky, cloudy, and rainy conditions. Secondly, we conduct an exhaustive validation process of monocular and stereo depth estimation algorithms designed on visible spectrum bands to benchmark their performance in the thermal image domain. Lastly, we propose a unified depth network that effectively bridges monocular depth and stereo depth tasks from a conditional random field approach perspective. Our dataset and source code are available at https://github.com/UkcheolShin/MS2-MultiSpectralStereoDataset. + + + + Cross-GAN Auditing: Unsupervised Identification of Attribute Level Similarities and Differences Between Pretrained Generative Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Olson_Cross-GAN_Auditing_Unsupervised_Identification_of_Attribute_Level_Similarities_and_Differences_CVPR_2023_paper.pdf + Generative Adversarial Networks (GANs) are notoriously difficult to train especially for complex distributions and with limited data. This has driven the need for interpretable tools to audit trained networks, for example, to identify biases or ensure fairness. Existing GAN audit tools are restricted to coarse-grained, model-data comparisons based on summary statistics such as FID or recall. In this paper, we propose an alternative approach that compares a newly developed GAN against a prior baseline. To this end, we introduce Cross-GAN Auditing (xGA) that, given an established "reference" GAN and a newly proposed "client" GAN, jointly identifies semantic attributes that are either common across both GANs, novel to the client GAN, or missing from the client GAN. This provides both users and model developers an intuitive assessment of similarity and differences between GANs. We introduce novel metrics to evaluate attribute-based GAN auditing approaches and use these metrics to demonstrate quantitatively that xGA outperforms baseline approaches. We also include qualitative results that illustrate the common, novel and missing attributes identified by xGA from GANs trained on a variety of image datasets. + + + + Backdoor Defense via Adaptively Splitting Poisoned Dataset + http://openaccess.thecvf.com//content/CVPR2023/papers/Gao_Backdoor_Defense_via_Adaptively_Splitting_Poisoned_Dataset_CVPR_2023_paper.pdf + Backdoor defenses have been studied to alleviate the threat of deep neural networks (DNNs) being backdoor attacked and thus maliciously altered. Since DNNs usually adopt some external training data from an untrusted third party, a robust backdoor defense strategy during the training stage is of importance. We argue that the core of training-time defense is to select poisoned samples and to handle them properly. In this work, we summarize the training-time defenses from a unified framework as splitting the poisoned dataset into two data pools. Under our framework, we propose an adaptively splitting dataset-based defense (ASD). Concretely, we apply loss-guided split and meta-learning-inspired split to dynamically update two data pools. With the split clean data pool and polluted data pool, ASD successfully defends against backdoor attacks during training. Extensive experiments on multiple benchmark datasets and DNN models against six state-of-the-art backdoor attacks demonstrate the superiority of our ASD. + + + + Towards Stable Human Pose Estimation via Cross-View Fusion and Foot Stabilization + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhuo_Towards_Stable_Human_Pose_Estimation_via_Cross-View_Fusion_and_Foot_CVPR_2023_paper.pdf + Towards stable human pose estimation from monocular images, there remain two main dilemmas. On the one hand, the different perspectives, i.e., front view, side view, and top view, appear the inconsistent performances due to the depth ambiguity. On the other hand, foot posture plays a significant role in complicated human pose estimation, i.e., dance and sports, and foot-ground interaction, but unfortunately, it is omitted in most general approaches and datasets. In this paper, we first propose the Cross-View Fusion (CVF) module to catch up with better 3D intermediate representation and alleviate the view inconsistency based on the vision transformer encoder. Then the optimization-based method is introduced to reconstruct the foot pose and foot-ground contact for the general multi-view datasets including AIST++ and Human3.6M. Besides, the reversible kinematic topology strategy is innovated to utilize the contact information into the full-body with foot pose regressor. Extensive experiments on the popular benchmarks demonstrate that our method outperforms the state-of-the-art approaches by achieving 40.1mm PA-MPJPE on the 3DPW test set and 43.8mm on the AIST++ test set. + + + + SINE: SINgle Image Editing With Text-to-Image Diffusion Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_SINE_SINgle_Image_Editing_With_Text-to-Image_Diffusion_Models_CVPR_2023_paper.pdf + Recent works on diffusion models have demonstrated a strong capability for conditioning image generation, e.g., text-guided image synthesis. Such success inspires many efforts trying to use large-scale pre-trained diffusion models for tackling a challenging problem--real image editing. Works conducted in this area learn a unique textual token corresponding to several images containing the same object. However, under many circumstances, only one image is available, such as the painting of the Girl with a Pearl Earring. Using existing works on fine-tuning the pre-trained diffusion models with a single image causes severe overfitting issues. The information leakage from the pre-trained diffusion models makes editing can not keep the same content as the given image while creating new features depicted by the language guidance. This work aims to address the problem of single-image editing. We propose a novel model-based guidance built upon the classifier-free guidance so that the knowledge from the model trained on a single image can be distilled into the pre-trained diffusion model, enabling content creation even with one given image. Additionally, we propose a patch-based fine-tuning that can effectively help the model generate images of arbitrary resolution. We provide extensive experiments to validate the design choices of our approach and show promising editing capabilities, including changing style, content addition, and object manipulation. + + + + OSAN: A One-Stage Alignment Network To Unify Multimodal Alignment and Unsupervised Domain Adaptation + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_OSAN_A_One-Stage_Alignment_Network_To_Unify_Multimodal_Alignment_and_CVPR_2023_paper.pdf + Extending from unimodal to multimodal is a critical challenge for unsupervised domain adaptation (UDA). Two major problems emerge in unsupervised multimodal domain adaptation: domain adaptation and modality alignment. An intuitive way to handle these two problems is to fulfill these tasks in two separate stages: aligning modalities followed by domain adaptation, or vice versa. However, domains and modalities are not associated in most existing two-stage studies, and the relationship between them is not leveraged which can provide complementary information to each other. In this paper, we unify these two stages into one to align domains and modalities simultaneously. In our model, a tensor-based alignment module (TAL) is presented to explore the relationship between domains and modalities. By this means, domains and modalities can interact sufficiently and guide them to utilize complementary information for better results. Furthermore, to establish a bridge between domains, a dynamic domain generator (DDG) module is proposed to build transitional samples by mixing the shared information of two domains in a self-supervised manner, which helps our model learn a domain-invariant common representation space. Extensive experiments prove that our method can achieve superior performance in two real-world applications. The code will be publicly available. + + + + Heat Diffusion Based Multi-Scale and Geometric Structure-Aware Transformer for Mesh Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Wong_Heat_Diffusion_Based_Multi-Scale_and_Geometric_Structure-Aware_Transformer_for_Mesh_CVPR_2023_paper.pdf + Triangle mesh segmentation is an important task in 3D shape analysis, especially in applications such as digital humans and AR/VR. Transformer model is inherently permutation-invariant to input, which makes it a suitable candidate model for 3D mesh processing. However, two main challenges involved in adapting Transformer from natural languages to 3D mesh are yet to be solved, such as i) extracting the multi-scale information of mesh data in an adaptive manner; ii) capturing geometric structures of mesh data as the discriminative characteristics of the shape. Current point based Transformer models fail to tackle such challenges and thus provide inferior performance for discretized surface segmentation. In this work, heat diffusion based method is exploited to tackle these problems. A novel Transformer model called MeshFormer is proposed, which i) integrates Heat Diffusion method into Multi-head Self-Attention operation (HDMSA) to adaptively capture the features from local neighborhood to global contexts; ii) applies a novel Heat Kernel Signature based Structure Encoding (HKSSE) to embed the intrinsic geometric structures of mesh instances into Transformer for structure-aware processing. Extensive experiments on triangle mesh segmentation validate the effectiveness of the proposed MeshFormer model and show significant improvements over current state-of-the-art methods. + + + + Multi-Granularity Archaeological Dating of Chinese Bronze Dings Based on a Knowledge-Guided Relation Graph + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_Multi-Granularity_Archaeological_Dating_of_Chinese_Bronze_Dings_Based_on_a_CVPR_2023_paper.pdf + The archaeological dating of bronze dings has played a critical role in the study of ancient Chinese history. Current archaeology depends on trained experts to carry out bronze dating, which is time-consuming and labor-intensive. For such dating, in this study, we propose a learning-based approach to integrate advanced deep learning techniques and archaeological knowledge. To achieve this, we first collect a large-scale image dataset of bronze dings, which contains richer attribute information than other existing fine-grained datasets. Second, we introduce a multihead classifier and a knowledge-guided relation graph to mine the relationship between attributes and the ding era. Third, we conduct comparison experiments with various existing methods, the results of which show that our dating method achieves a state-of-the-art performance. We hope that our data and applied networks will enrich fine-grained classification research relevant to other interdisciplinary areas of expertise. The dataset and source code used are included in our supplementary materials, and will be open after submission owing to the anonymity policy. Source codes and data are available at: https://github.com/zhourixin/bronze-Ding. + + + + CASP-Net: Rethinking Video Saliency Prediction From an Audio-Visual Consistency Perceptual Perspective + http://openaccess.thecvf.com//content/CVPR2023/papers/Xiong_CASP-Net_Rethinking_Video_Saliency_Prediction_From_an_Audio-Visual_Consistency_Perceptual_CVPR_2023_paper.pdf + Incorporating the audio stream enables Video Saliency Prediction (VSP) to imitate the selective attention mechanism of human brain. By focusing on the benefits of joint auditory and visual information, most VSP methods are capable of exploiting semantic correlation between vision and audio modalities but ignoring the negative effects due to the temporal inconsistency of audio-visual intrinsics. Inspired by the biological inconsistency-correction within multi-sensory information, in this study, a consistency-aware audio-visual saliency prediction network (CASP-Net) is proposed, which takes a comprehensive consideration of the audio-visual semantic interaction and consistent perception. In addition a two-stream encoder for elegant association between video frames and corresponding sound source, a novel consistency-aware predictive coding is also designed to improve the consistency within audio and visual representations iteratively. To further aggregate the multi-scale audio-visual information, a saliency decoder is introduced for the final saliency map generation. Substantial experiments demonstrate that the proposed CASP-Net outperforms the other state-of-the-art methods on six challenging audio-visual eye-tracking datasets. For a demo of our system please see https://woshihaozhu.github.io/CASP-Net/. + + + + Learning Expressive Prompting With Residuals for Vision Transformers + http://openaccess.thecvf.com//content/CVPR2023/papers/Das_Learning_Expressive_Prompting_With_Residuals_for_Vision_Transformers_CVPR_2023_paper.pdf + Prompt learning is an efficient approach to adapt transformers by inserting learnable set of parameters into the input and intermediate representations of a pre-trained model. In this work, we present Expressive Prompts with Residuals (EXPRES) which modifies the prompt learning paradigm specifically for effective adaptation of vision transformers (ViT). Out method constructs downstream representations via learnable "output" tokens, that are akin to the learned class tokens of the ViT. Further for better steering of the downstream representation processed by the frozen transformer, we introduce residual learnable tokens that are added to the output of various computations. We apply EXPRES for image classification, few shot learning, and semantic segmentation, and show our method is capable of achieving state of the art prompt tuning on 3/3 categories of the VTAB benchmark. In addition to strong performance, we observe that our approach is an order of magnitude more prompt efficient than existing visual prompting baselines. We analytically show the computational benefits of our approach over weight space adaptation techniques like finetuning. Lastly we systematically corroborate the architectural design of our method via a series of ablation experiments. + + + + AnyFlow: Arbitrary Scale Optical Flow With Implicit Neural Representation + http://openaccess.thecvf.com//content/CVPR2023/papers/Jung_AnyFlow_Arbitrary_Scale_Optical_Flow_With_Implicit_Neural_Representation_CVPR_2023_paper.pdf + To apply optical flow in practice, it is often necessary to resize the input to smaller dimensions in order to reduce computational costs. However, downsizing inputs makes the estimation more challenging because objects and motion ranges become smaller. Even though recent approaches have demonstrated high-quality flow estimation, they tend to fail to accurately model small objects and precise boundaries when the input resolution is lowered, restricting their applicability to high-resolution inputs. In this paper, we introduce AnyFlow, a robust network that estimates accurate flow from images of various resolutions. By representing optical flow as a continuous coordinate-based representation, AnyFlow generates outputs at arbitrary scales from low-resolution inputs, demonstrating superior performance over prior works in capturing tiny objects with detail preservation on a wide range of scenes. We establish a new state-of-the-art performance of cross-dataset generalization on the KITTI dataset, while achieving comparable accuracy on the online benchmarks to other SOTA methods. + + + + Federated Domain Generalization With Generalization Adjustment + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Federated_Domain_Generalization_With_Generalization_Adjustment_CVPR_2023_paper.pdf + Federated Domain Generalization (FedDG) attempts to learn a global model in a privacy-preserving manner that generalizes well to new clients possibly with domain shift. Recent exploration mainly focuses on designing an unbiased training strategy within each individual domain. However, without the support of multi-domain data jointly in the mini-batch training, almost all methods cannot guarantee the generalization under domain shift. To overcome this problem, we propose a novel global objective incorporating a new variance reduction regularizer to encourage fairness. A novel FL-friendly method named Generalization Adjustment (GA) is proposed to optimize the above objective by dynamically calibrating the aggregation weights. The theoretical analysis of GA demonstrates the possibility to achieve a tighter generalization bound with an explicit re-weighted aggregation, substituting the implicit multi-domain data sharing that is only applicable to the conventional DG settings. Besides, the proposed algorithm is generic and can be combined with any local client training-based methods. Extensive experiments on several benchmark datasets have shown the effectiveness of the proposed method, with consistent improvements over several FedDG algorithms when used in combination. The source code is released at https://github.com/MediaBrain-SJTU/FedDG-GA. + + + + CoMFormer: Continual Learning in Semantic and Panoptic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Cermelli_CoMFormer_Continual_Learning_in_Semantic_and_Panoptic_Segmentation_CVPR_2023_paper.pdf + Continual learning for segmentation has recently seen increasing interest. However, all previous works focus on narrow semantic segmentation and disregard panoptic segmentation, an important task with real-world impacts. In this paper, we present the first continual learning model capable of operating on both semantic and panoptic segmentation. Inspired by recent transformer approaches that consider segmentation as a mask-classification problem, we design CoMFormer. Our method carefully exploits the properties of transformer architectures to learn new classes over time. Specifically, we propose a novel adaptive distillation loss along with a mask-based pseudo-labeling technique to effectively prevent forgetting. To evaluate our approach, we introduce a novel continual panoptic segmentation benchmark on the challenging ADE20K dataset. Our CoMFormer outperforms all the existing baselines by forgetting less old classes but also learning more effectively new classes. In addition, we also report an extensive evaluation in the large-scale continual semantic segmentation scenario showing that CoMFormer also significantly outperforms state-of-the-art methods. + + + + Conditional Generation of Audio From Video via Foley Analogies + http://openaccess.thecvf.com//content/CVPR2023/papers/Du_Conditional_Generation_of_Audio_From_Video_via_Foley_Analogies_CVPR_2023_paper.pdf + The sound effects that designers add to videos are designed to convey a particular artistic effect and, thus, may be quite different from a scene's true sound. Inspired by the challenges of creating a soundtrack for a video that differs from its true sound, but that nonetheless matches the actions occurring on screen, we propose the problem of conditional Foley. We present the following contributions to address this problem. First, we propose a pretext task for training our model to predict sound for an input video clip using a conditional audio-visual clip sampled from another time within the same source video. Second, we propose a model for generating a soundtrack for a silent input video, given a user-supplied example that specifies what the video should "sound like". We show through human studies and automated evaluation metrics that our model successfully generates sound from video, while varying its output according to the content of a supplied example. + + + + Diverse 3D Hand Gesture Prediction From Body Dynamics by Bilateral Hand Disentanglement + http://openaccess.thecvf.com//content/CVPR2023/papers/Qi_Diverse_3D_Hand_Gesture_Prediction_From_Body_Dynamics_by_Bilateral_CVPR_2023_paper.pdf + Predicting natural and diverse 3D hand gestures from the upper body dynamics is a practical yet challenging task in virtual avatar creation. Previous works usually overlook the asymmetric motions between two hands and generate two hands in a holistic manner, leading to unnatural results. In this work, we introduce a novel bilateral hand disentanglement based two-stage 3D hand generation method to achieve natural and diverse 3D hand prediction from body dynamics. In the first stage, we intend to generate natural hand gestures by two hand-disentanglement branches. Considering the asymmetric gestures and motions of two hands, we introduce a Spatial-Residual Memory (SRM) module to model spatial interaction between the body and each hand by residual learning. To enhance the coordination of two hand motions wrt. body dynamics holistically, we then present a Temporal-Motion Memory (TMM) module. TMM can effectively model the temporal association between body dynamics and two hand motions. The second stage is built upon the insight that 3D hand predictions should be non-deterministic given the sequential body postures. Thus, we further diversify our 3D hand predictions based on the initial output from the stage one. Concretely, we propose a Prototypical-Memory Sampling Strategy (PSS) to generate the non-deterministic hand gestures by gradient-based Markov Chain Monte Carlo (MCMC) sampling. Extensive experiments demonstrate that our method outperforms the state-of-the-art models on the B2H dataset and our newly collected TED Hands dataset. The dataset and code are available at: https://github.com/XingqunQi-lab/Diverse-3D-Hand-Gesture-Prediction. + + + + Learning Video Representations From Large Language Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhao_Learning_Video_Representations_From_Large_Language_Models_CVPR_2023_paper.pdf + We introduce LAVILA, a new approach to learning video-language representations by leveraging Large Language Models (LLMs). We repurpose pre-trained LLMs to be conditioned on visual input, and finetune them to create automatic video narrators. Our auto-generated narrations offer a number of advantages, including dense coverage of long videos, better temporal synchronization of the visual information and text, and much higher diversity of text. The video-language embedding learned contrastively with these narrations outperforms the previous state-of-the-art on multiple first-person and third-person video tasks, both in zero-shot and finetuned setups. Most notably, LAVILA obtains an absolute gain of 10.1% on EGTEA classification and 5.9% Epic-Kitchens-100 multi-instance retrieval benchmarks. Furthermore, LAVILA trained with only half the narrations from the Ego4D dataset outperforms models trained on the full set, and shows positive scaling behavior on increasing pre-training data and model size. + + + + Open-Vocabulary Semantic Segmentation With Mask-Adapted CLIP + http://openaccess.thecvf.com//content/CVPR2023/papers/Liang_Open-Vocabulary_Semantic_Segmentation_With_Mask-Adapted_CLIP_CVPR_2023_paper.pdf + Open-vocabulary semantic segmentation aims to segment an image into semantic regions according to text descriptions, which may not have been seen during training. Recent two-stage methods first generate class-agnostic mask proposals and then leverage pre-trained vision-language models, e.g., CLIP, to classify masked regions. We identify the performance bottleneck of this paradigm to be the pre-trained CLIP model, since it does not perform well on masked images. To address this, we propose to finetune CLIP on a collection of masked image regions and their corresponding text descriptions. We collect training data by mining an existing image-caption dataset (e.g., COCO Captions), using CLIP to match masked image regions to nouns in the image captions. Compared with the more precise and manually annotated segmentation labels with fixed classes (e.g., COCO-Stuff), we find our noisy but diverse dataset can better retain CLIP's generalization ability. Along with finetuning the entire model, we utilize the "blank" areas in masked images using a method we dub mask prompt tuning. Experiments demonstrate mask prompt tuning brings significant improvement without modifying any weights of CLIP, and it can further improve a fully finetuned model. In particular, when trained on COCO and evaluated on ADE20K-150, our best model achieves 29.6% mIoU, which is +8.5% higher than the previous state-of-the-art. For the first time, open-vocabulary generalist models match the performance of supervised specialist models in 2017 without dataset-specific adaptations. + + + + A Loopback Network for Explainable Microvascular Invasion Classification + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_A_Loopback_Network_for_Explainable_Microvascular_Invasion_Classification_CVPR_2023_paper.pdf + Microvascular invasion (MVI) is a critical factor for prognosis evaluation and cancer treatment. The current diagnosis of MVI relies on pathologists to manually find out cancerous cells from hundreds of blood vessels, which is time-consuming, tedious, and subjective. Recently, deep learning has achieved promising results in medical image analysis tasks. However, the unexplainability of black box models and the requirement of massive annotated samples limit the clinical application of deep learning based diagnostic methods. In this paper, aiming to develop an accurate, objective, and explainable diagnosis tool for MVI, we propose a Loopback Network (LoopNet) for classifying MVI efficiently. With the image-level category annotations of the collected Pathologic Vessel Image Dataset (PVID), LoopNet is devised to be composed binary classification branch and cell locating branch. The latter is devised to locate the area of cancerous cells, regular non-cancerous cells, and background. For healthy samples, the pseudo masks of cells supervise the cell locating branch to distinguish the area of regular non-cancerous cells and background. For each MVI sample, the cell locating branch predicts the mask of cancerous cells. Then the masked cancerous and non-cancerous areas of the same sample are inputted back to the binary classification branch separately. The loopback between two branches enables the category label to supervise the cell locating branch to learn the locating ability for cancerous areas. Experiment results show that the proposed LoopNet achieves 97.5% accuracy on MVI classification. Surprisingly, the proposed loopback mechanism not only enables LoopNet to predict the cancerous area but also facilitates the classification backbone to achieve better classification performance. + + + + Exact-NeRF: An Exploration of a Precise Volumetric Parameterization for Neural Radiance Fields + http://openaccess.thecvf.com//content/CVPR2023/papers/Isaac-Medina_Exact-NeRF_An_Exploration_of_a_Precise_Volumetric_Parameterization_for_Neural_CVPR_2023_paper.pdf + Neural Radiance Fields (NeRF) have attracted significant attention due to their ability to synthesize novel scene views with great accuracy. However, inherent to their underlying formulation, the sampling of points along a ray with zero width may result in ambiguous representations that lead to further rendering artifacts such as aliasing in the final scene. To address this issue, the recent variant mip-NeRF proposes an Integrated Positional Encoding (IPE) based on a conical view frustum. Although this is expressed with an integral formulation, mip-NeRF instead approximates this integral as the expected value of a multivariate Gaussian distribution. This approximation is reliable for short frustums but degrades with highly elongated regions, which arises when dealing with distant scene objects under a larger depth of field. In this paper, we explore the use of an exact approach for calculating the IPE by using a pyramid-based integral formulation instead of an approximated conical-based one. We denote this formulation as Exact-NeRF and contribute the first approach to offer a precise analytical solution to the IPE within the NeRF domain. Our exploratory work illustrates that such an exact formulation (Exact-NeRF) matches the accuracy of mip-NeRF and furthermore provides a natural extension to more challenging scenarios without further modification, such as in the case of unbounded scenes. Our contribution aims to both address the hitherto unexplored issues of frustum approximation in earlier NeRF work and additionally provide insight into the potential future consideration of analytical solutions in future NeRF extensions. + + + + WildLight: In-the-Wild Inverse Rendering With a Flashlight + http://openaccess.thecvf.com//content/CVPR2023/papers/Cheng_WildLight_In-the-Wild_Inverse_Rendering_With_a_Flashlight_CVPR_2023_paper.pdf + This paper proposes a practical photometric solution for the challenging problem of in-the-wild inverse rendering under unknown ambient lighting. Our system recovers scene geometry and reflectance using only multi-view images captured by a smartphone. The key idea is to exploit smartphone's built-in flashlight as a minimally controlled light source, and decompose image intensities into two photometric components -- a static appearance corresponds to ambient flux, plus a dynamic reflection induced by the moving flashlight. Our method does not require flash/non-flash images to be captured in pairs. Building on the success of neural light fields, we use an off-the-shelf method to capture the ambient reflections, while the flashlight component enables physically accurate photometric constraints to decouple reflectance and illumination. Compared to existing inverse rendering methods, our setup is applicable to non-darkroom environments yet sidesteps the inherent difficulties of explicit solving ambient reflections. We demonstrate by extensive experiments that our method is easy to implement, casual to set up, and consistently outperforms existing in-the-wild inverse rendering techniques. Finally, our neural reconstruction can be easily exported to PBR textured triangle mesh ready for industrial renderers. Our source code and data are released to https://github.com/za-cheng/WildLight + + + + A Probabilistic Attention Model With Occlusion-Aware Texture Regression for 3D Hand Reconstruction From a Single RGB Image + http://openaccess.thecvf.com//content/CVPR2023/papers/Jiang_A_Probabilistic_Attention_Model_With_Occlusion-Aware_Texture_Regression_for_3D_CVPR_2023_paper.pdf + Recently, deep learning based approaches have shown promising results in 3D hand reconstruction from a single RGB image. These approaches can be roughly divided into model-based approaches, which are heavily dependent on the model's parameter space, and model-free approaches, which require large numbers of 3D ground truths to reduce depth ambiguity and struggle in weakly-supervised scenarios. To overcome these issues, we propose a novel probabilistic model to achieve the robustness of model-based approaches and reduced dependence on the model's parameter space of model-free approaches. The proposed probabilistic model incorporates a model-based network as a prior-net to estimate the prior probability distribution of joints and vertices. An Attention-based Mesh Vertices Uncertainty Regression (AMVUR) model is proposed to capture dependencies among vertices and the correlation between joints and mesh vertices to improve their feature representation. We further propose a learning based occlusion-aware Hand Texture Regression model to achieve high-fidelity texture reconstruction. We demonstrate the flexibility of the proposed probabilistic model to be trained in both supervised and weakly-supervised scenarios. The experimental results demonstrate our probabilistic model's state-of-the-art accuracy in 3D hand and texture reconstruction from a single image in both training schemes, including in the presence of severe occlusions. + + + + Attribute-Preserving Face Dataset Anonymization via Latent Code Optimization + http://openaccess.thecvf.com//content/CVPR2023/papers/Barattin_Attribute-Preserving_Face_Dataset_Anonymization_via_Latent_Code_Optimization_CVPR_2023_paper.pdf + This work addresses the problem of anonymizing the identity of faces in a dataset of images, such that the privacy of those depicted is not violated, while at the same time the dataset is useful for downstream task such as for training machine learning models. To the best of our knowledge, we are the first to explicitly address this issue and deal with two major drawbacks of the existing state-of-the-art approaches, namely that they (i) require the costly training of additional, purpose-trained neural networks, and/or (ii) fail to retain the facial attributes of the original images in the anonymized counterparts, the preservation of which is of paramount importance for their use in downstream tasks. We accordingly present a task-agnostic anonymization procedure that directly optimises the images' latent representation in the latent space of a pre-trained GAN. By optimizing the latent codes directly, we ensure both that the identity is of a desired distance away from the original (with an identity obfuscation loss), whilst preserving the facial attributes (using a novel feature-matching loss in FaRL's deep feature space). We demonstrate through a series of both qualitative and quantitative experiments that our method is capable of anonymizing the identity of the images whilst--crucially--better-preserving the facial attributes. We make the code and the pre-trained models publicly available at: https://github.com/chi0tzp/FALCO. + + + + Ensemble-Based Blackbox Attacks on Dense Prediction + http://openaccess.thecvf.com//content/CVPR2023/papers/Cai_Ensemble-Based_Blackbox_Attacks_on_Dense_Prediction_CVPR_2023_paper.pdf + We propose an approach for adversarial attacks on dense prediction models (such as object detectors and segmentation). It is well known that the attacks generated by a single surrogate model do not transfer to arbitrary (blackbox) victim models. Furthermore, targeted attacks are often more challenging than the untargeted attacks. In this paper, we show that a carefully designed ensemble can create effective attacks for a number of victim models. In particular, we show that normalization of the weights for individual models plays a critical role in the success of the attacks. We then demonstrate that by adjusting the weights of the ensemble according to the victim model can further improve the performance of the attacks. We performed a number of experiments for object detectors and segmentation to highlight the significance of the our proposed methods. Our proposed ensemble-based method outperforms existing blackbox attack methods for object detection and segmentation. Finally we show that our proposed method can also generate a single perturbation that can fool multiple blackbox detection and segmentation models simultaneously. + + + + Improving Fairness in Facial Albedo Estimation via Visual-Textual Cues + http://openaccess.thecvf.com//content/CVPR2023/papers/Ren_Improving_Fairness_in_Facial_Albedo_Estimation_via_Visual-Textual_Cues_CVPR_2023_paper.pdf + Recent 3D face reconstruction methods have made significant advances in geometry prediction, yet further cosmetic improvements are limited by lagged albedo because inferring albedo from appearance is an ill-posed problem. Although some existing methods consider prior knowledge from illumination to improve albedo estimation, they still produce a light-skin bias due to racially biased albedo models and limited light constraints. In this paper, we reconsider the relationship between albedo and face attributes and propose an ID2Albedo to directly estimate albedo without constraining illumination. Our key insight is that intrinsic semantic attributes such as race, skin color, and age can constrain the albedo map. We first introduce visual-textual cues and design a semantic loss to supervise facial albedo estimation. Specifically, we pre-define text labels such as race, skin color, age, and wrinkles. Then, we employ the text-image model (CLIP) to compute the similarity between the text and the input image, and assign a pseudo-label to each facial image. We constrain generated albedos in the training phase to have the same attributes as the inputs. In addition, we train a high-quality, unbiased facial albedo generator and utilize the semantic loss to learn the mapping from illumination-robust identity features to the albedo latent codes. Finally, our ID2Albedo is trained in a self-supervised way and outperforms state-of-the-art albedo estimation methods in terms of accuracy and fidelity. It is worth mentioning that our approach has excellent generalizability and fairness, especially on in-the-wild data. + + + + SmartAssign: Learning a Smart Knowledge Assignment Strategy for Deraining and Desnowing + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_SmartAssign_Learning_a_Smart_Knowledge_Assignment_Strategy_for_Deraining_and_CVPR_2023_paper.pdf + Existing methods mainly handle single weather types. However, the connections of different weather conditions at deep representation level are usually ignored. These connections, if used properly, can generate complementary representations for each other to make up insufficient training data, obtaining positive performance gains and better generalization. In this paper, we focus on the very correlated rain and snow to explore their connections at deep representation level. Because sub-optimal connections may cause negative effect, another issue is that if rain and snow are handled in a multi-task learning way, how to find an optimal connection strategy to simultaneously improve deraining and desnowing performance. To build desired connection, we propose a smart knowledge assignment strategy, called SmartAssign, to optimally assign the knowledge learned from both tasks to a specific one. In order to further enhance the accuracy of knowledge assignment, we propose a novel knowledge contrast mechanism, so that the knowledge assigned to different tasks preserves better uniqueness. The inherited inductive biases usually limit the modelling ability of CNNs, we introduce a novel transformer block to constitute the backbone of our network to effectively combine long-range context dependency and local image details. Extensive experiments on seven benchmark datasets verify that proposed SmartAssign explores effective connection between rain and snow, and improves the performances of both deraining and desnowing apparently. The implementation code will be available at https://gitee.com/mindspore/models/tree/master/research/cv/SmartAssign. + + + + sRGB Real Noise Synthesizing With Neighboring Correlation-Aware Noise Model + http://openaccess.thecvf.com//content/CVPR2023/papers/Fu_sRGB_Real_Noise_Synthesizing_With_Neighboring_Correlation-Aware_Noise_Model_CVPR_2023_paper.pdf + Modeling and synthesizing real noise in the standard RGB (sRGB) domain is challenging due to the complicated noise distribution. While most of the deep noise generators proposed to synthesize sRGB real noise using an end-to-end trained model, the lack of explicit noise modeling degrades the quality of their synthesized noise. In this work, we propose to model the real noise as not only dependent on the underlying clean image pixel intensity, but also highly correlated to its neighboring noise realization within the local region. Correspondingly, we propose a novel noise synthesizing framework by explicitly learning its neighboring correlation on top of the signal dependency. With the proposed noise model, our framework greatly bridges the distribution gap between synthetic noise and real noise. We show that our generated "real" sRGB noisy images can be used for training supervised deep denoisers, thus to improve their real denoising results with a large margin, comparing to the popular classic denoisers or the deep denoisers that are trained on other sRGB noise generators. The code will be available at https://github.com/xuan611/sRGB-Real-Noise-Synthesizing. + + + + Revisiting Weak-to-Strong Consistency in Semi-Supervised Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_Revisiting_Weak-to-Strong_Consistency_in_Semi-Supervised_Semantic_Segmentation_CVPR_2023_paper.pdf + In this work, we revisit the weak-to-strong consistency framework, popularized by FixMatch from semi-supervised classification, where the prediction of a weakly perturbed image serves as supervision for its strongly perturbed version. Intriguingly, we observe that such a simple pipeline already achieves competitive results against recent advanced works, when transferred to our segmentation scenario. Its success heavily relies on the manual design of strong data augmentations, however, which may be limited and inadequate to explore a broader perturbation space. Motivated by this, we propose an auxiliary feature perturbation stream as a supplement, leading to an expanded perturbation space. On the other, to sufficiently probe original image-level augmentations, we present a dual-stream perturbation technique, enabling two strong views to be simultaneously guided by a common weak view. Consequently, our overall Unified Dual-Stream Perturbations approach (UniMatch) surpasses all existing methods significantly across all evaluation protocols on the Pascal, Cityscapes, and COCO benchmarks. Its superiority is also demonstrated in remote sensing interpretation and medical image analysis. We hope our reproduced FixMatch and our results can inspire more future works. Code and logs are available at https://github.com/LiheYoung/UniMatch. + + + + Implicit View-Time Interpolation of Stereo Videos Using Multi-Plane Disparities and Non-Uniform Coordinates + http://openaccess.thecvf.com//content/CVPR2023/papers/Paliwal_Implicit_View-Time_Interpolation_of_Stereo_Videos_Using_Multi-Plane_Disparities_and_CVPR_2023_paper.pdf + In this paper, we propose an approach for view-time interpolation of stereo videos. Specifically, we build upon X-Fields that approximates an interpolatable mapping between the input coordinates and 2D RGB images using a convolutional decoder. Our main contribution is to analyze and identify the sources of the problems with using X-Fields in our application and propose novel techniques to overcome these challenges. Specifically, we observe that X-Fields struggles to implicitly interpolate the disparities for large baseline cameras. Therefore, we propose multi-plane disparities to reduce the spatial distance of the objects in the stereo views. Moreover, we propose non-uniform time coordinates to handle the non-linear and sudden motion spikes in videos. We additionally introduce several simple, but important, improvements over X-Fields. We demonstrate that our approach is able to produce better results than the state of the art, while running in near real-time rates and having low memory and storage costs. + + + + Soft-Landing Strategy for Alleviating the Task Discrepancy Problem in Temporal Action Localization Tasks + http://openaccess.thecvf.com//content/CVPR2023/papers/Kang_Soft-Landing_Strategy_for_Alleviating_the_Task_Discrepancy_Problem_in_Temporal_CVPR_2023_paper.pdf + Temporal Action Localization (TAL) methods typically operate on top of feature sequences from a frozen snippet encoder that is pretrained with the Trimmed Action Classification (TAC) tasks, resulting in a task discrepancy problem. While existing TAL methods mitigate this issue either by retraining the encoder with a pretext task or by end-to-end finetuning, they commonly require an overload of high memory and computation. In this work, we introduce Soft-Landing (SoLa) strategy, an efficient yet effective framework to bridge the transferability gap between the pretrained encoder and the downstream tasks by incorporating a light-weight neural network, i.e., a SoLa module, on top of the frozen encoder. We also propose an unsupervised training scheme for the SoLa module; it learns with inter-frame Similarity Matching that uses the frame interval as its supervisory signal, eliminating the need for temporal annotations. Experimental evaluation on various benchmarks for downstream TAL tasks shows that our method effectively alleviates the task discrepancy problem with remarkable computational efficiency. + + + + Visibility Aware Human-Object Interaction Tracking From Single RGB Camera + http://openaccess.thecvf.com//content/CVPR2023/papers/Xie_Visibility_Aware_Human-Object_Interaction_Tracking_From_Single_RGB_Camera_CVPR_2023_paper.pdf + Capturing the interactions between humans and their environment in 3D is important for many applications in robotics, graphics, and vision. Recent works to reconstruct the 3D human and object from a single RGB image do not have consistent relative translation across frames because they assume a fixed depth. Moreover, their performance drops significantly when the object is occluded. In this work, we propose a novel method to track the 3D human, object, contacts, and relative translation across frames from a single RGB camera, while being robust to heavy occlusions. Our method is built on two key insights. First, we condition our neural field reconstructions for human and object on per-frame SMPL model estimates obtained by pre-fitting SMPL to a video sequence. This improves neural reconstruction accuracy and produces coherent relative translation across frames. Second, human and object motion from visible frames provides valuable information to infer the occluded object. We propose a novel transformer-based neural network that explicitly uses object visibility and human motion to leverage neighboring frames to make predictions for the occluded frames. Building on these insights, our method is able to track both human and object robustly even under occlusions. Experiments on two datasets show that our method significantly improves over the state-of-the-art methods. Our code and pretrained models are available at: https://virtualhumans.mpi-inf.mpg.de/VisTracker. + + + + Rethinking Gradient Projection Continual Learning: Stability / Plasticity Feature Space Decoupling + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhao_Rethinking_Gradient_Projection_Continual_Learning_Stability__Plasticity_Feature_Space_CVPR_2023_paper.pdf + Continual learning aims to incrementally learn novel classes over time, while not forgetting the learned knowledge. Recent studies have found that learning would not forget if the updated gradient is orthogonal to the feature space. However, previous approaches require the gradient to be fully orthogonal to the whole feature space, leading to poor plasticity, as the feasible gradient direction becomes narrow when the tasks continually come, i.e., feature space is unlimitedly expanded. In this paper, we propose a space decoupling (SD) algorithm to decouple the feature space into a pair of complementary subspaces, i.e., the stability space I, and the plasticity space R. I is established by conducting space intersection between the historic and current feature space, and thus I contains more task-shared bases. R is constructed by seeking the orthogonal complementary subspace of I, and thus R mainly contains more task-specific bases. By putting the distinguishing constraints on R and I, our method achieves a better balance between stability and plasticity. Extensive experiments are conducted by applying SD to gradient projection baselines, and show SD is model-agnostic and achieves SOTA results on publicly available datasets. + + + + FlatFormer: Flattened Window Attention for Efficient Point Cloud Transformer + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_FlatFormer_Flattened_Window_Attention_for_Efficient_Point_Cloud_Transformer_CVPR_2023_paper.pdf + Transformer, as an alternative to CNN, has been proven effective in many modalities (e.g., texts and images). For 3D point cloud transformers, existing efforts focus primarily on pushing their accuracy to the state-of-the-art level. However, their latency lags behind sparse convolution-based models (3x slower), hindering their usage in resource-constrained, latency-sensitive applications (such as autonomous driving). This inefficiency comes from point clouds' sparse and irregular nature, whereas transformers are designed for dense, regular workloads. This paper presents FlatFormer to close this latency gap by trading spatial proximity for better computational regularity. We first flatten the point cloud with window-based sorting and partition points into groups of equal sizes rather than windows of equal shapes. This effectively avoids expensive structuring and padding overheads. We then apply self-attention within groups to extract local features, alternate sorting axis to gather features from different directions, and shift windows to exchange features across groups. FlatFormer delivers state-of-the-art accuracy on Waymo Open Dataset with 4.6x speedup over (transformer-based) SST and 1.4x speedup over (sparse convolutional) CenterPoint. This is the first point cloud transformer that achieves real-time performance on edge GPUs and is faster than sparse convolutional methods while achieving on-par or even superior accuracy on large-scale benchmarks. + + + + Dynamic Graph Learning With Content-Guided Spatial-Frequency Relation Reasoning for Deepfake Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Dynamic_Graph_Learning_With_Content-Guided_Spatial-Frequency_Relation_Reasoning_for_Deepfake_CVPR_2023_paper.pdf + With the springing up of face synthesis techniques, it is prominent in need to develop powerful face forgery detection methods due to security concerns. Some existing methods attempt to employ auxiliary frequency-aware information combined with CNN backbones to discover the forged clues. Due to the inadequate information interaction with image content, the extracted frequency features are thus spatially irrelavant, struggling to generalize well on increasingly realistic counterfeit types. To address this issue, we propose a Spatial-Frequency Dynamic Graph method to exploit the relation-aware features in spatial and frequency domains via dynamic graph learning. To this end, we introduce three well-designed components: 1) Content-guided Adaptive Frequency Extraction module to mine the content-adaptive forged frequency clues. 2) Multiple Domains Attention Map Learning module to enrich the spatial-frequency contextual features with multiscale attention maps. 3) Dynamic Graph Spatial-Frequency Feature Fusion Network to explore the high-order relation of spatial and frequency features. Extensive experiments on several benchmark show that our proposed method sustainedly exceeds the state-of-the-arts by a considerable margin. + + + + Two-Stage Co-Segmentation Network Based on Discriminative Representation for Recovering Human Mesh From Videos + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Two-Stage_Co-Segmentation_Network_Based_on_Discriminative_Representation_for_Recovering_Human_CVPR_2023_paper.pdf + Recovering 3D human mesh from videos has recently made significant progress. However, most of the existing methods focus on the temporal consistency of videos, while ignoring the spatial representation in complex scenes, thus failing to recover a reasonable and smooth human mesh sequence under extreme illumination and chaotic backgrounds.To alleviate this problem, we propose a two-stage co-segmentation network based on discriminative representation for recovering human body meshes from videos. Specifically, the first stage of the network segments the video spatial domain to spotlight spatially fine-grained information, and then learns and enhances the intra-frame discriminative representation through a dual-excitation mechanism and a frequency domain enhancement module, while suppressing irrelevant information (e.g., background). The second stage focuses on temporal context by segmenting the video temporal domain, and models inter-frame discriminative representation via a dynamic integration strategy.Further, to efficiently generate reasonable human discriminative actions, we carefully elaborate a landmark anchor area loss to constrain the variation of the human motion area. Extensive experimental results on large publicly available datasets indicate the superiority in comparison with most state-of-the-art. Code will be made public. + + + + Learning Anchor Transformations for 3D Garment Animation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhao_Learning_Anchor_Transformations_for_3D_Garment_Animation_CVPR_2023_paper.pdf + This paper proposes an anchor-based deformation model, namely AnchorDEF, to predict 3D garment animation from a body motion sequence. It deforms a garment mesh template by a mixture of rigid transformations with extra nonlinear displacements. A set of anchors around the mesh surface is introduced to guide the learning of rigid transformation matrices. Once the anchor transformations are found, per-vertex nonlinear displacements of the garment template can be regressed in a canonical space, which reduces the complexity of deformation space learning. By explicitly constraining the transformed anchors to satisfy the consistencies of position, normal and direction, the physical meaning of learned anchor transformations in space is guaranteed for better generalization. Furthermore, an adaptive anchor updating is proposed to optimize the anchor position by being aware of local mesh topology for learning representative anchor transformations. Qualitative and quantitative experiments on different types of garments demonstrate that AnchorDEF achieves the state-of-the-art performance on 3D garment deformation prediction in motion, especially for loose-fitting garments. + + + + Actionlet-Dependent Contrastive Learning for Unsupervised Skeleton-Based Action Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Lin_Actionlet-Dependent_Contrastive_Learning_for_Unsupervised_Skeleton-Based_Action_Recognition_CVPR_2023_paper.pdf + The self-supervised pretraining paradigm has achieved great success in skeleton-based action recognition. However, these methods treat the motion and static parts equally, and lack an adaptive design for different parts, which has a negative impact on the accuracy of action recognition. To realize the adaptive action modeling of both parts, we propose an Actionlet-Dependent Contrastive Learning method (ActCLR). The actionlet, defined as the discriminative subset of the human skeleton, effectively decomposes motion regions for better action modeling. In detail, by contrasting with the static anchor without motion, we extract the motion region of the skeleton data, which serves as the actionlet, in an unsupervised manner. Then, centering on actionlet, a motion-adaptive data transformation method is built. Different data transformations are applied to actionlet and non-actionlet regions to introduce more diversity while maintaining their own characteristics. Meanwhile, we propose a semantic-aware feature pooling method to build feature representations among motion and static regions in a distinguished manner. Extensive experiments on NTU RGB+D and PKUMMD show that the proposed method achieves remarkable action recognition performance. More visualization and quantitative experiments demonstrate the effectiveness of our method. + + + + Ref-NPR: Reference-Based Non-Photorealistic Radiance Fields for Controllable Scene Stylization + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Ref-NPR_Reference-Based_Non-Photorealistic_Radiance_Fields_for_Controllable_Scene_Stylization_CVPR_2023_paper.pdf + Current 3D scene stylization methods transfer textures and colors as styles using arbitrary style references, lacking meaningful semantic correspondences. We introduce Reference-Based Non-Photorealistic Radiance Fields (Ref-NPR) to address this limitation. This controllable method stylizes a 3D scene using radiance fields with a single stylized 2D view as a reference. We propose a ray registration process based on the stylized reference view to obtain pseudo-ray supervision in novel views. Then we exploit semantic correspondences in content images to fill occluded regions with perceptually similar styles, resulting in non-photorealistic and continuous novel view sequences. Our experimental results demonstrate that Ref-NPR outperforms existing scene and video stylization methods regarding visual quality and semantic correspondence. The code and data are publicly available on the project page at https://ref-npr.github.io. + + + + Tree Instance Segmentation With Temporal Contour Graph + http://openaccess.thecvf.com//content/CVPR2023/papers/Firoze_Tree_Instance_Segmentation_With_Temporal_Contour_Graph_CVPR_2023_paper.pdf + We present a novel approach to perform instance segmentation, and counting, for densely packed self-similar trees using a top-view RGB image sequence. We propose a solution that leverages pixel content, shape, and self-occlusion. First, we perform an initial over-segmentation of the image sequence and aggregate structural characteristics into a contour graph with temporal information incorporated. Second, using a graph convolutional network and its inherent local messaging passing abilities, we merge adjacent tree crown patches into a final set of tree crowns. Per various studies and comparisons, our method is superior to all prior methods and results in high-accuracy instance segmentation and counting, despite the trees being tightly packed. Finally, we provide various forest image sequence datasets suitable for subsequent benchmarking and evaluation captured at different altitudes and leaf conditions. + + + + Meta-Causal Learning for Single Domain Generalization + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Meta-Causal_Learning_for_Single_Domain_Generalization_CVPR_2023_paper.pdf + Single domain generalization aims to learn a model from a single training domain (source domain) and apply it to multiple unseen test domains (target domains). Existing methods focus on expanding the distribution of the training domain to cover the target domains, but without estimating the domain shift between the source and target domains. In this paper, we propose a new learning paradigm, namely simulate-analyze-reduce, which first simulates the domain shift by building an auxiliary domain as the target domain, then learns to analyze the causes of domain shift, and finally learns to reduce the domain shift for model adaptation. Under this paradigm, we propose a meta-causal learning method to learn meta-knowledge, that is, how to infer the causes of domain shift between the auxiliary and source domains during training. We use the meta-knowledge to analyze the shift between the target and source domains during testing. Specifically, we perform multiple transformations on source data to generate the auxiliary domain, perform counterfactual inference to learn to discover the causal factors of the shift between the auxiliary and source domains, and incorporate the inferred causality into factor-aware domain alignments. Extensive experiments on several benchmarks of image classification show the effectiveness of our method. + + + + Grad-PU: Arbitrary-Scale Point Cloud Upsampling via Gradient Descent With Learned Distance Functions + http://openaccess.thecvf.com//content/CVPR2023/papers/He_Grad-PU_Arbitrary-Scale_Point_Cloud_Upsampling_via_Gradient_Descent_With_Learned_CVPR_2023_paper.pdf + Most existing point cloud upsampling methods have roughly three steps: feature extraction, feature expansion and 3D coordinate prediction. However, they usually suffer from two critical issues: (1) fixed upsampling rate after one-time training, since the feature expansion unit is customized for each upsampling rate; (2) outliers or shrinkage artifact caused by the difficulty of precisely predicting 3D coordinates or residuals of upsampled points. To adress them, we propose a new framework for accurate point cloud upsampling that supports arbitrary upsampling rates. Our method first interpolates the low-res point cloud according to a given upsampling rate. And then refine the positions of the interpolated points with an iterative optimization process, guided by a trained model estimating the difference between the current point cloud and the high-res target. Extensive quantitative and qualitative results on benchmarks and downstream tasks demonstrate that our method achieves the state-of-the-art accuracy and efficiency. + + + + Trainable Projected Gradient Method for Robust Fine-Tuning + http://openaccess.thecvf.com//content/CVPR2023/papers/Tian_Trainable_Projected_Gradient_Method_for_Robust_Fine-Tuning_CVPR_2023_paper.pdf + Recent studies on transfer learning have shown that selectively fine-tuning a subset of layers or customizing different learning rates for each layer can greatly improve robustness to out-of-distribution (OOD) data and retain generalization capability in the pre-trained models. However, most of these methods employ manually crafted heuristics or expensive hyper-parameter search, which prevent them from scaling up to large datasets and neural networks. To solve this problem, we propose Trainable Projected Gradient Method (TPGM) to automatically learn the constraint imposed for each layer for a fine-grained fine-tuning regularization. This is motivated by formulating fine-tuning as a bi-level constrained optimization problem. Specifically, TPGM maintains a set of projection radii, i.e., distance constraints between the fine-tuned model and the pre-trained model, for each layer, and enforces them through weight projections. To learn the constraints, we propose a bi-level optimization to automatically learn the best set of projection radii in an end-to-end manner. Theoretically, we show that the bi-level optimization formulation is the key to learn different constraints for each layer. Empirically, with little hyper-parameter search cost, TPGM outperforms existing fine-tuning methods in OOD performance while matching the best in-distribution (ID) performance. For example, when fine-tuned on DomainNet-Real and ImageNet, compared to vanilla fine-tuning, TPGM shows 22% and 10% relative OOD improvement respectively on their sketch counterparts. + + + + Text2Scene: Text-Driven Indoor Scene Stylization With Part-Aware Details + http://openaccess.thecvf.com//content/CVPR2023/papers/Hwang_Text2Scene_Text-Driven_Indoor_Scene_Stylization_With_Part-Aware_Details_CVPR_2023_paper.pdf + We propose Text2Scene, a method to automatically create realistic textures for virtual scenes composed of multiple objects. Guided by a reference image and text descriptions, our pipeline adds detailed texture on labeled 3D geometries in the room such that the generated colors respect the hierarchical structure or semantic parts that are often composed of similar materials. Instead of applying flat stylization on the entire scene at a single step, we obtain weak semantic cues from geometric segmentation, which are further clarified by assigning initial colors to segmented parts. Then we add texture details for individual objects such that their projections on image space exhibit feature embedding aligned with the embedding of the input. The decomposition makes the entire pipeline tractable to a moderate amount of computation resources and memory. As our framework utilizes the existing resources of image and text embedding, it does not require dedicated datasets with high-quality textures designed by skillful artists. To the best of our knowledge, it is the first practical and scalable approach that can create detailed and realistic textures of the desired style that maintain structural context for scenes with multiple objects. + + + + FEND: A Future Enhanced Distribution-Aware Contrastive Learning Framework for Long-Tail Trajectory Prediction + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_FEND_A_Future_Enhanced_Distribution-Aware_Contrastive_Learning_Framework_for_Long-Tail_CVPR_2023_paper.pdf + Predicting the future trajectories of the traffic agents is a gordian technique in autonomous driving. However, trajectory prediction suffers from data imbalance in the prevalent datasets, and the tailed data is often more complicated and safety-critical. In this paper, we focus on dealing with the long-tail phenomenon in trajectory prediction. Previous methods dealing with long-tail data did not take into account the variety of motion patterns in the tailed data. In this paper, we put forward a future enhanced contrastive learning framework to recognize tail trajectory patterns and form a feature space with separate pattern clusters.Furthermore, a distribution aware hyper predictor is brought up to better utilize the shaped feature space.Our method is a model-agnostic framework and can be plugged into many well-known baselines. Experimental results show that our framework outperforms the state-of-the-art long-tail prediction method on tailed samples by 9.5% on ADE and 8.5% on FDE, while maintaining or slightly improving the averaged performance. Our method also surpasses many long-tail techniques on trajectory prediction task. + + + + HDR Imaging With Spatially Varying Signal-to-Noise Ratios + http://openaccess.thecvf.com//content/CVPR2023/papers/Chi_HDR_Imaging_With_Spatially_Varying_Signal-to-Noise_Ratios_CVPR_2023_paper.pdf + While today's high dynamic range (HDR) image fusion algorithms are capable of blending multiple exposures, the acquisition is often controlled so that the dynamic range within one exposure is narrow. For HDR imaging in photon-limited situations, the dynamic range can be enormous and the noise within one exposure is spatially varying. Existing image denoising algorithms and HDR fusion algorithms both fail to handle this situation, leading to severe limitations in low-light HDR imaging. This paper presents two contributions. Firstly, we identify the source of the problem. We find that the issue is associated with the co-existence of (1) spatially varying signal-to-noise ratio, especially the excessive noise due to very dark regions, and (2) a wide luminance range within each exposure. We show that while the issue can be handled by a bank of denoisers, the complexity is high. Secondly, we propose a new method called the spatially varying high dynamic range (SV-HDR) fusion network to simultaneously denoise and fuse images. We introduce a new exposure-shared block within our custom-designed multi-scale transformer framework. In a variety of testing conditions, the performance of the proposed SV-HDR is better than the existing methods. + + + + Reliability in Semantic Segmentation: Are We on the Right Track? + http://openaccess.thecvf.com//content/CVPR2023/papers/de_Jorge_Reliability_in_Semantic_Segmentation_Are_We_on_the_Right_Track_CVPR_2023_paper.pdf + Motivated by the increasing popularity of transformers in computer vision, in recent times there has been a rapid development of novel architectures. While in-domain performance follows a constant, upward trend, properties like robustness or uncertainty estimation are less explored -leaving doubts about advances in model reliability. Studies along these axes exist, but they are mainly limited to classification models. In contrast, we carry out a study on semantic segmentation, a relevant task for many real-world applications where model reliability is paramount. We analyze a broad variety of models, spanning from older ResNet-based architectures to novel transformers and assess their reliability based on four metrics: robustness, calibration, misclassification detection and out-of-distribution (OOD) detection. We find that while recent models are significantly more robust, they are not overall more reliable in terms of uncertainty estimation. We further explore methods that can come to the rescue and show that improving calibration can also help with other uncertainty metrics such as misclassification or OOD detection. This is the first study on modern segmentation models focused on both robustness and uncertainty estimation and we hope it will help practitioners and researchers interested in this fundamental vision task. + + + + Blowing in the Wind: CycleNet for Human Cinemagraphs From Still Images + http://openaccess.thecvf.com//content/CVPR2023/papers/Bertiche_Blowing_in_the_Wind_CycleNet_for_Human_Cinemagraphs_From_Still_CVPR_2023_paper.pdf + Cinemagraphs are short looping videos created by adding subtle motions to a static image. This kind of media is popular and engaging. However, automatic generation of cinemagraphs is an underexplored area and current solutions require tedious low-level manual authoring by artists. In this paper, we present an automatic method that allows generating human cinemagraphs from single RGB images. We investigate the problem in the context of dressed humans under the wind. At the core of our method is a novel cyclic neural network that produces looping cinemagraphs for the target loop duration. To circumvent the problem of collecting real data, we demonstrate that it is possible, by working in the image normal space, to learn garment motion dynamics on synthetic data and generalize to real data. We evaluate our method on both synthetic and real data and demonstrate that it is possible to create compelling and plausible cinemagraphs from single RGB images. + + + + Panoptic Compositional Feature Field for Editable Scene Rendering With Network-Inferred Labels via Metric Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Cheng_Panoptic_Compositional_Feature_Field_for_Editable_Scene_Rendering_With_Network-Inferred_CVPR_2023_paper.pdf + Despite neural implicit representations demonstrating impressive high-quality view synthesis capacity, decomposing such representations into objects for instance-level editing is still challenging. Recent works learn object-compositional representations supervised by ground truth instance annotations and produce promising scene editing results. However, ground truth annotations are manually labeled and expensive in practice, which limits their usage in real-world scenes. In this work, we attempt to learn an object-compositional neural implicit representation for editable scene rendering by leveraging labels inferred from the off-the-shelf 2D panoptic segmentation networks instead of the ground truth annotations. We propose a novel framework named Panoptic Compositional Feature Field (PCFF), which introduces an instance quadruplet metric learning to build a discriminating panoptic feature space for reliable scene editing. In addition, we propose semantic-related strategies to further exploit the correlations between semantic and appearance attributes for achieving better rendering results. Experiments on multiple scene datasets including ScanNet, Replica, and ToyDesk demonstrate that our proposed method achieves superior performance for novel view synthesis and produces convincing real-world scene editing results. The code will be available. + + + + Neural Kaleidoscopic Space Sculpting + http://openaccess.thecvf.com//content/CVPR2023/papers/Ahn_Neural_Kaleidoscopic_Space_Sculpting_CVPR_2023_paper.pdf + We introduce a method that recovers full-surround 3D reconstructions from a single kaleidoscopic image using a neural surface representation. Full-surround 3D reconstruction is critical for many applications, such as augmented and virtual reality. A kaleidoscope, which uses a single camera and multiple mirrors, is a convenient way of achieving full-surround coverage, as it redistributes light directions and thus captures multiple viewpoints in a single image. This enables single-shot and dynamic full-surround 3D reconstruction. However, using a kaleidoscopic image for multi-view stereo is challenging, as we need to decompose the image into multi-view images by identifying which pixel corresponds to which virtual camera, a process we call labeling. To address this challenge, pur approach avoids the need to explicitly estimate labels, but instead "sculpts" a neural surface representation through the careful use of silhouette, background, foreground, and texture information present in the kaleidoscopic image. We demonstrate the advantages of our method in a range of simulated and real experiments, on both static and dynamic scenes. + + + + Implicit Identity Driven Deepfake Face Swapping Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Implicit_Identity_Driven_Deepfake_Face_Swapping_Detection_CVPR_2023_paper.pdf + In this paper, we consider the face swapping detection from the perspective of face identity. Face swapping aims to replace the target face with the source face and generate the fake face that the human cannot distinguish between real and fake. We argue that the fake face contains the explicit identity and implicit identity, which respectively corresponds to the identity of the source face and target face during face swapping. Note that the explicit identities of faces can be extracted by regular face recognizers. Particularly, the implicit identity of real face is consistent with the its explicit identity. Thus the difference between explicit and implicit identity of face facilitates face swapping detection. Following this idea, we propose a novel implicit identity driven framework for face swapping detection. Specifically, we design an explicit identity contrast (EIC) loss and an implicit identity exploration (IIE) loss, which supervises a CNN backbone to embed face images into the implicit identity space. Under the guidance of EIC, real samples are pulled closer to their explicit identities, while fake samples are pushed away from their explicit identities. Moreover, IIE is derived from the margin-based classification loss function, which encourages the fake faces with known target identities to enjoy intra-class compactness and inter-class diversity. Extensive experiments and visualizations on several datasets demonstrate the generalization of our method against the state-of-the-art counterparts. + + + + Class Relationship Embedded Learning for Source-Free Unsupervised Domain Adaptation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Class_Relationship_Embedded_Learning_for_Source-Free_Unsupervised_Domain_Adaptation_CVPR_2023_paper.pdf + This work focuses on a practical knowledge transfer task defined as Source-Free Unsupervised Domain Adaptation (SFUDA), where only a well-trained source model and unlabeled target data are available. To fully utilize source knowledge, we propose to transfer the class relationship, which is domain-invariant but still under-explored in previous works. To this end, we first regard the classifier weights of the source model as class prototypes to compute class relationship, and then propose a novel probability-based similarity between target-domain samples by embedding the source-domain class relationship, resulting in Class Relationship embedded Similarity (CRS). Here the inter-class term is particularly considered in order to more accurately represent the similarity between two samples, in which the source prior of class relationship is utilized by weighting. Finally, we propose to embed CRS into contrastive learning in a unified form. Here both class-aware and instance discrimination contrastive losses are employed, which are complementary to each other. We combine the proposed method with existing representative methods to evaluate its efficacy in multiple SFUDA settings. Extensive experimental results reveal that our method can achieve state-of-the-art performance due to the transfer of domain-invariant class relationship. + + + + One-to-Few Label Assignment for End-to-End Dense Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_One-to-Few_Label_Assignment_for_End-to-End_Dense_Detection_CVPR_2023_paper.pdf + One-to-one (o2o) label assignment plays a key role for transformer based end-to-end detection, and it has been recently introduced in fully convolutional detectors for lightweight end-to-end dense detection. However, o2o can largely degrade the feature learning performance due to the limited number of positive samples. Though extra positive samples can be introduced to mitigate this issue, the computation of self- and cross- attentions among anchors prevents its practical application to dense and fully convolutional detectors. In this work, we propose a simple yet effective one-to-few (o2f) label assignment strategy for end-to-end dense detection. Apart from defining one positive and many negative anchors for each object, we define several soft anchors, which serve as positive and negative samples simultaneously. The positive and negative weights of these soft anchors are dynamically adjusted during training so that they can contribute more to 'representation learning' in the early training stage and contribute more to 'duplicated prediction removal' in the later stage. The detector trained in this way can not only learn a strong feature representation but also perform end-to-end detection. Experiments on COCO and CrowdHuman datasets demonstrate the effectiveness of the proposed o2f scheme. + + + + Fake It Till You Make It: Learning Transferable Representations From Synthetic ImageNet Clones + http://openaccess.thecvf.com//content/CVPR2023/papers/Sariyildiz_Fake_It_Till_You_Make_It_Learning_Transferable_Representations_From_CVPR_2023_paper.pdf + Recent image generation models such as Stable Diffusion have exhibited an impressive ability to generate fairly realistic images starting from a simple text prompt. Could such models render real images obsolete for training image prediction models? In this paper, we answer part of this provocative question by investigating the need for real images when training models for ImageNet classification. Provided only with the class names that have been used to build the dataset, we explore the ability of Stable Diffusion to generate synthetic clones of ImageNet and measure how useful these are for training classification models from scratch. We show that with minimal and class-agnostic prompt engineering, ImageNet clones are able to close a large part of the gap between models produced by synthetic images and models trained with real images, for the several standard classification benchmarks that we consider in this study. More importantly, we show that models trained on synthetic images exhibit strong generalization properties and perform on par with models trained on real data for transfer. Project page: https://europe.naverlabs.com/imagenet-sd + + + + Interactive and Explainable Region-Guided Radiology Report Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Tanida_Interactive_and_Explainable_Region-Guided_Radiology_Report_Generation_CVPR_2023_paper.pdf + The automatic generation of radiology reports has the potential to assist radiologists in the time-consuming task of report writing. Existing methods generate the full report from image-level features, failing to explicitly focus on anatomical regions in the image. We propose a simple yet effective region-guided report generation model that detects anatomical regions and then describes individual, salient regions to form the final report. While previous methods generate reports without the possibility of human intervention and with limited explainability, our method opens up novel clinical use cases through additional interactive capabilities and introduces a high degree of transparency and explainability. Comprehensive experiments demonstrate our method's effectiveness in report generation, outperforming previous state-of-the-art models, and highlight its interactive capabilities. The code and checkpoints are available at https://github.com/ttanida/rgrg. + + + + MED-VT: Multiscale Encoder-Decoder Video Transformer With Application To Object Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Karim_MED-VT_Multiscale_Encoder-Decoder_Video_Transformer_With_Application_To_Object_Segmentation_CVPR_2023_paper.pdf + Multiscale video transformers have been explored in a wide variety of vision tasks. To date, however, the multiscale processing has been confined to the encoder or decoder alone. We present a unified multiscale encoder-decoder transformer that is focused on dense prediction tasks in videos. Multiscale representation at both encoder and decoder yields key benefits of implicit extraction of spatiotemporal features (i.e. without reliance on input optical flow) as well as temporal consistency at encoding and coarse-to-fine detection for high-level (e.g. object) semantics to guide precise localization at decoding. Moreover, we propose a transductive learning scheme through many-to-many label propagation to provide temporally consistent predictions.We showcase our Multiscale Encoder-Decoder Video Transformer (MED-VT) on Automatic Video Object Segmentation (AVOS) and actor/action segmentation, where we outperform state-of-the-art approaches on multiple benchmarks using only raw images, without using optical flow. + + + + Benchmarking Self-Supervised Learning on Diverse Pathology Datasets + http://openaccess.thecvf.com//content/CVPR2023/papers/Kang_Benchmarking_Self-Supervised_Learning_on_Diverse_Pathology_Datasets_CVPR_2023_paper.pdf + Computational pathology can lead to saving human lives, but models are annotation hungry and pathology images are notoriously expensive to annotate. Self-supervised learning has shown to be an effective method for utilizing unlabeled data, and its application to pathology could greatly benefit its downstream tasks. Yet, there are no principled studies that compare SSL methods and discuss how to adapt them for pathology. To address this need, we execute the largest-scale study of SSL pre-training on pathology image data, to date. Our study is conducted using 4 representative SSL methods on diverse downstream tasks. We establish that large-scale domain-aligned pre-training in pathology consistently out-performs ImageNet pre-training in standard SSL settings such as linear and fine-tuning evaluations, as well as in low-label regimes. Moreover, we propose a set of domain-specific techniques that we experimentally show leads to a performance boost. Lastly, for the first time, we apply SSL to the challenging task of nuclei instance segmentation and show large and consistent performance improvements under diverse settings. + + + + Document Image Shadow Removal Guided by Color-Aware Background + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Document_Image_Shadow_Removal_Guided_by_Color-Aware_Background_CVPR_2023_paper.pdf + Existing works on document image shadow removal mostly depend on learning and leveraging a constant background (the color of the paper) from the image. However, the constant background is less representative and frequently ignores other background colors, such as the printed colors, resulting in distorted results. In this paper, we present a color-aware background extraction network (CBENet) for extracting a spatially varying background image that accurately depicts the background colors of the document. Furthermore, we propose a background-guided document images shadow removal network (BGShadowNet) using the predicted spatially varying background as auxiliary information, which consists of two stages. At Stage I, a background-constrained decoder is designed to promote a coarse result. Then, the coarse result is refined with a background-based attention module (BAModule) to maintain a consistent appearance and a detail improvement module (DEModule) to enhance the texture details at Stage II. Experiments on two benchmark datasets qualitatively and quantitatively validate the superiority of the proposed approach over state-of-the-arts. + + + + Improved Distribution Matching for Dataset Condensation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhao_Improved_Distribution_Matching_for_Dataset_Condensation_CVPR_2023_paper.pdf + Dataset Condensation aims to condense a large dataset into a smaller one while maintaining its ability to train a well-performing model, thus reducing the storage cost and training effort in deep learning applications. However, conventional dataset condensation methods are optimization-oriented and condense the dataset by performing gradient or parameter matching during model optimization, which is computationally intensive even on small datasets and models. In this paper, we propose a novel dataset condensation method based on distribution matching, which is more efficient and promising. Specifically, we identify two important shortcomings of naive distribution matching (i.e., imbalanced feature numbers and unvalidated embeddings for distance computation) and address them with three novel techniques (i.e., partitioning and expansion augmentation, efficient and enriched model sampling, and class-aware distribution regularization). Our simple yet effective method outperforms most previous optimization-oriented methods with much fewer computational resources, thereby scaling data condensation to larger datasets and models. Extensive experiments demonstrate the effectiveness of our method. Codes are available at https://github.com/uitrbn/IDM + + + + Feature Separation and Recalibration for Adversarial Robustness + http://openaccess.thecvf.com//content/CVPR2023/papers/Kim_Feature_Separation_and_Recalibration_for_Adversarial_Robustness_CVPR_2023_paper.pdf + Deep neural networks are susceptible to adversarial attacks due to the accumulation of perturbations in the feature level, and numerous works have boosted model robustness by deactivating the non-robust feature activations that cause model mispredictions. However, we claim that these malicious activations still contain discriminative cues and that with recalibration, they can capture additional useful information for correct model predictions. To this end, we propose a novel, easy-to-plugin approach named Feature Separation and Recalibration (FSR) that recalibrates the malicious, non-robust activations for more robust feature maps through Separation and Recalibration. The Separation part disentangles the input feature map into the robust feature with activations that help the model make correct predictions and the non-robust feature with activations that are responsible for model mispredictions upon adversarial attack. The Recalibration part then adjusts the non-robust activations to restore the potentially useful cues for model predictions. Extensive experiments verify the superiority of FSR compared to traditional deactivation techniques and demonstrate that it improves the robustness of existing adversarial training methods by up to 8.57% with small computational overhead. Codes are available at https://github.com/wkim97/FSR. + + + + Slimmable Dataset Condensation + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Slimmable_Dataset_Condensation_CVPR_2023_paper.pdf + Dataset distillation, also known as dataset condensation, aims to compress a large dataset into a compact synthetic one. Existing methods perform dataset condensation by assuming a fixed storage or transmission budget. When the budget changes, however, they have to repeat the synthesizing process with access to original datasets, which is highly cumbersome if not infeasible at all. In this paper, we explore the problem of slimmable dataset condensation, to extract a smaller synthetic dataset given only previous condensation results. We first study the limitations of existing dataset condensation algorithms on such a successive compression setting and identify two key factors: (1) the inconsistency of neural networks over different compression times and (2) the underdetermined solution space for synthetic data. Accordingly, we propose a novel training objective for slimmable dataset condensation to explicitly account for both factors. Moreover, synthetic datasets in our method adopt an significance-aware parameterization. Theoretical derivation indicates that an upper-bounded error can be achieved by discarding the minor components without training. Alternatively, if training is allowed, this strategy can serve as a strong initialization that enables a fast convergence. Extensive comparisons and ablations demonstrate the superiority of the proposed solution over existing methods on multiple benchmarks. + + + + Multi-View Azimuth Stereo via Tangent Space Consistency + http://openaccess.thecvf.com//content/CVPR2023/papers/Cao_Multi-View_Azimuth_Stereo_via_Tangent_Space_Consistency_CVPR_2023_paper.pdf + We present a method for 3D reconstruction only using calibrated multi-view surface azimuth maps. Our method, multi-view azimuth stereo, is effective for textureless or specular surfaces, which are difficult for conventional multi-view stereo methods. We introduce the concept of tangent space consistency: Multi-view azimuth observations of a surface point should be lifted to the same tangent space. Leveraging this consistency, we recover the shape by optimizing a neural implicit surface representation. Our method harnesses the robust azimuth estimation capabilities of photometric stereo methods or polarization imaging while bypassing potentially complex zenith angle estimation. Experiments using azimuth maps from various sources validate the accurate shape recovery with our method, even without zenith angles. + + + + VectorFusion: Text-to-SVG by Abstracting Pixel-Based Diffusion Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Jain_VectorFusion_Text-to-SVG_by_Abstracting_Pixel-Based_Diffusion_Models_CVPR_2023_paper.pdf + Diffusion models have shown impressive results in text-to-image synthesis. Using massive datasets of captioned images, diffusion models learn to generate raster images of highly diverse objects and scenes. However, designers frequently use vector representations of images like Scalable Vector Graphics (SVGs) for digital icons, graphics and stickers. Vector graphics can be scaled to any size, and are compact. In this work, we show that a text-conditioned diffusion model trained on pixel representations of images can be used to generate SVG-exportable vector graphics. We do so without access to large datasets of captioned SVGs. Instead, inspired by recent work on text-to-3D synthesis, we vectorize a text-to-image diffusion sample and fine-tune with a Score Distillation Sampling loss. By optimizing a differentiable vector graphics rasterizer, our method distills abstract semantic knowledge out of a pretrained diffusion model. By constraining the vector representation, we can also generate coherent pixel art and sketches. Our approach, VectorFusion, produces more coherent graphics than prior works that optimize CLIP, a contrastive image-text model. + + + + The Dialog Must Go On: Improving Visual Dialog via Generative Self-Training + http://openaccess.thecvf.com//content/CVPR2023/papers/Kang_The_Dialog_Must_Go_On_Improving_Visual_Dialog_via_Generative_CVPR_2023_paper.pdf + Visual dialog (VisDial) is a task of answering a sequence of questions grounded in an image, using the dialog history as context. Prior work has trained the dialog agents solely on VisDial data via supervised learning or leveraged pre-training on related vision-and-language datasets. This paper presents a semi-supervised learning approach for visually-grounded dialog, called Generative Self-Training (GST), to leverage unlabeled images on the Web. Specifically, GST first retrieves in-domain images through out-of-distribution detection and generates synthetic dialogs regarding the images via multimodal conditional text generation. GST then trains a dialog agent on the synthetic and the original VisDial data. As a result, GST scales the amount of training data up to an order of magnitude that of VisDial (1.2M to 12.9M QA data). For robust training of the synthetic dialogs, we also propose perplexity-based data selection and multimodal consistency regularization. Evaluation on VisDial v1.0 and v0.9 datasets shows that GST achieves new state-of-the-art results on both datasets. We further observe the robustness of GST against both visual and textual adversarial attacks. Finally, GST yields strong performance gains in the low-data regime. Code is available at https://github.com/gicheonkang/gst-visdial. + + + + Binarizing Sparse Convolutional Networks for Efficient Point Cloud Analysis + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_Binarizing_Sparse_Convolutional_Networks_for_Efficient_Point_Cloud_Analysis_CVPR_2023_paper.pdf + In this paper, we propose binary sparse convolutional networks called BSC-Net for efficient point cloud analysis. We empirically observe that sparse convolution operation causes larger quantization errors than standard convolution. However, conventional network quantization methods directly binarize the weights and activations in sparse convolution, resulting in performance drop due to the significant quantization loss. On the contrary, we search the optimal subset of convolution operation that activates the sparse convolution at various locations for quantization error alleviation, and the performance gap between real-valued and binary sparse convolutional networks is closed without complexity overhead. Specifically, we first present the shifted sparse convolution that fuses the information in the receptive field for the active sites that match the pre-defined positions. Then we employ the differentiable search strategies to discover the optimal opsitions for active site matching in the shifted sparse convolution, and the quantization errors are significantly alleviated for efficient point cloud analysis. For fair evaluation of the proposed method, we empirically select the recently advances that are beneficial for sparse convolution network binarization to construct a strong baseline. The experimental results on ScanNet and NYU Depth v2 show that our BSC-Net achieves significant improvement upon our srtong baseline and outperforms the state-of-the-art network binarization methods by a remarkable margin without additional computation overhead for binarizing sparse convolutional networks. + + + + Diffusion Art or Digital Forgery? Investigating Data Replication in Diffusion Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Somepalli_Diffusion_Art_or_Digital_Forgery_Investigating_Data_Replication_in_Diffusion_CVPR_2023_paper.pdf + Cutting-edge diffusion models produce images with high quality and customizability, enabling them to be used for commercial art and graphic design purposes. But do diffusion models create unique works of art, or are they replicating content directly from their training sets? In this work, we study image retrieval frameworks that enable us to compare generated images with training samples and detect when content has been replicated. Applying our frameworks to diffusion models trained on multiple datasets including Oxford flowers, Celeb-A, ImageNet, and LAION, we discuss how factors such as training set size impact rates of content replication. We also identify cases where diffusion models, including the popular Stable Diffusion model, blatantly copy from their training data. + + + + Neuralizer: General Neuroimage Analysis Without Re-Training + http://openaccess.thecvf.com//content/CVPR2023/papers/Czolbe_Neuralizer_General_Neuroimage_Analysis_Without_Re-Training_CVPR_2023_paper.pdf + Neuroimage processing tasks like segmentation, reconstruction, and registration are central to the study of neuroscience. Robust deep learning strategies and architectures used to solve these tasks are often similar. Yet, when presented with a new task or a dataset with different visual characteristics, practitioners most often need to train a new model, or fine-tune an existing one. This is a time-consuming process that poses a substantial barrier for the thousands of neuroscientists and clinical researchers who often lack the resources or machine-learning expertise to train deep learning models. In practice, this leads to a lack of adoption of deep learning, and neuroscience tools being dominated by classical frameworks. We introduce Neuralizer, a single model that generalizes to previously unseen neuroimaging tasks and modalities without the need for re-training or fine-tuning. Tasks do not have to be known a priori, and generalization happens in a single forward pass during inference. The model can solve processing tasks across multiple image modalities, acquisition methods, and datasets, and generalize to tasks and modalities it has not been trained on. Our experiments on coronal slices show that when few annotated subjects are available, our multi-task network outperforms task-specific baselines without training on the task. + + + + UniDexGrasp: Universal Robotic Dexterous Grasping via Learning Diverse Proposal Generation and Goal-Conditioned Policy + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_UniDexGrasp_Universal_Robotic_Dexterous_Grasping_via_Learning_Diverse_Proposal_Generation_CVPR_2023_paper.pdf + In this work, we tackle the problem of learning universal robotic dexterous grasping from a point cloud observation under a table-top setting. The goal is to grasp and lift up objects in high-quality and diverse ways and generalize across hundreds of categories and even the unseen. Inspired by successful pipelines used in parallel gripper grasping, we split the task into two stages: 1) grasp proposal (pose) generation and 2) goal-conditioned grasp execution. For the first stage, we propose a novel probabilistic model of grasp pose conditioned on the point cloud observation that factorizes rotation from translation and articulation. Trained on our synthesized large-scale dexterous grasp dataset, this model enables us to sample diverse and high-quality dexterous grasp poses for the object point cloud. For the second stage, we propose to replace the motion planning used in parallel gripper grasping with a goal-conditioned grasp policy, due to the complexity involved in dexterous grasping execution. Note that it is very challenging to learn this highly generalizable grasp policy that only takes realistic inputs without oracle states. We thus propose several important innovations, including state canonicalization, object curriculum, and teacher-student distillation. Integrating the two stages, our final pipeline becomes the first to achieve universal generalization for dexterous grasping, demonstrating an average success rate of more than 60% on thousands of object instances, which significantly outperforms all baselines, meanwhile showing only a minimal generalization gap. + + + + A Rotation-Translation-Decoupled Solution for Robust and Efficient Visual-Inertial Initialization + http://openaccess.thecvf.com//content/CVPR2023/papers/He_A_Rotation-Translation-Decoupled_Solution_for_Robust_and_Efficient_Visual-Inertial_Initialization_CVPR_2023_paper.pdf + We propose a novel visual-inertial odometry (VIO) initialization method, which decouples rotation and translation estimation, and achieves higher efficiency and better robustness. Existing loosely-coupled VIO-initialization methods suffer from poor stability of visual structure-from-motion (SfM), whereas those tightly-coupled methods often ignore the gyroscope bias in the closed-form solution, resulting in limited accuracy. Moreover, the aforementioned two classes of methods are computationally expensive, because 3D point clouds need to be reconstructed simultaneously. In contrast, our new method fully combines inertial and visual measurements for both rotational and translational initialization. First, a rotation-only solution is designed for gyroscope bias estimation, which tightly couples the gyroscope and camera observations. Second, the initial velocity and gravity vector are solved with linear translation constraints in a globally optimal fashion and without reconstructing 3D point clouds. Extensive experiments have demonstrated that our method is 8 72 times faster (w.r.t. a 10-frame set) than the state-of-the-art methods, and also presents significantly higher robustness and accuracy. The source code is available at https://github.com/boxuLibrary/drt-vio-init. + + + + BundleSDF: Neural 6-DoF Tracking and 3D Reconstruction of Unknown Objects + http://openaccess.thecvf.com//content/CVPR2023/papers/Wen_BundleSDF_Neural_6-DoF_Tracking_and_3D_Reconstruction_of_Unknown_Objects_CVPR_2023_paper.pdf + We present a near real-time (10Hz) method for 6-DoF tracking of an unknown object from a monocular RGBD video sequence, while simultaneously performing neural 3D reconstruction of the object. Our method works for arbitrary rigid objects, even when visual texture is largely absent. The object is assumed to be segmented in the first frame only. No additional information is required, and no assumption is made about the interaction agent. Key to our method is a Neural Object Field that is learned concurrently with a pose graph optimization process in order to robustly accumulate information into a consistent 3D representation capturing both geometry and appearance. A dynamic pool of posed memory frames is automatically maintained to facilitate communication between these threads. Our approach handles challenging sequences with large pose changes, partial and full occlusion, untextured surfaces, and specular highlights. We show results on HO3D, YCBInEOAT, and BEHAVE datasets, demonstrating that our method significantly outperforms existing approaches. Project page: https://bundlesdf.github.io/ + + + + Texture-Guided Saliency Distilling for Unsupervised Salient Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_Texture-Guided_Saliency_Distilling_for_Unsupervised_Salient_Object_Detection_CVPR_2023_paper.pdf + Deep Learning-based Unsupervised Salient Object Detection (USOD) mainly relies on the noisy saliency pseudo labels that have been generated from traditional handcraft methods or pre-trained networks. To cope with the noisy labels problem, a class of methods focus on only easy samples with reliable labels but ignore valuable knowledge in hard samples. In this paper, we propose a novel USOD method to mine rich and accurate saliency knowledge from both easy and hard samples. First, we propose a Confidence-aware Saliency Distilling (CSD) strategy that scores samples conditioned on samples' confidences, which guides the model to distill saliency knowledge from easy samples to hard samples progressively. Second, we propose a Boundary-aware Texture Matching (BTM) strategy to refine the boundaries of noisy labels by matching the textures around the predicted boundaries. Extensive experiments on RGB, RGB-D, RGB-T, and video SOD benchmarks prove that our method achieves state-of-the-art USOD performance. Code is available at www.github.com/moothes/A2S-v2. + + + + AltFreezing for More General Video Face Forgery Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_AltFreezing_for_More_General_Video_Face_Forgery_Detection_CVPR_2023_paper.pdf + Existing face forgery detection models try to discriminate fake images by detecting only spatial artifacts (e.g., generative artifacts, blending) or mainly temporal artifacts (e.g., flickering, discontinuity). They may experience significant performance degradation when facing out-domain artifacts. In this paper, we propose to capture both spatial and temporal artifacts in one model for face forgery detection. A simple idea is to leverage a spatiotemporal model (3D ConvNet). However, we find that it may easily rely on one type of artifact and ignore the other. To address this issue, we present a novel training strategy called AltFreezing for more general face forgery detection. The AltFreezing aims to encourage the model to detect both spatial and temporal artifacts. It divides the weights of a spatiotemporal network into two groups: spatial- and temporal-related. Then the two groups of weights are alternately frozen during the training process so that the model can learn spatial and temporal features to distinguish real or fake videos. Furthermore, we introduce various video-level data augmentation methods to improve the generalization capability of the forgery detection model. Extensive experiments show that our framework outperforms existing methods in terms of generalization to unseen manipulations and datasets. + + + + Learning Partial Correlation Based Deep Visual Representation for Image Classification + http://openaccess.thecvf.com//content/CVPR2023/papers/Rahman_Learning_Partial_Correlation_Based_Deep_Visual_Representation_for_Image_Classification_CVPR_2023_paper.pdf + Visual representation based on covariance matrix has demonstrates its efficacy for image classification by characterising the pairwise correlation of different channels in convolutional feature maps. However, pairwise correlation will become misleading once there is another channel correlating with both channels of interest, resulting in the "confounding" effect. For this case, "partial correlation" which removes the confounding effect shall be estimated instead. Nevertheless, reliably estimating partial correlation requires to solve a symmetric positive definite matrix optimisation, known as sparse inverse covariance estimation (SICE). How to incorporate this process into CNN remains an open issue. In this work, we formulate SICE as a novel structured layer of CNN. To ensure end-to-end trainability, we develop an iterative method to solve the above matrix optimisation during forward and backward propagation steps. Our work obtains a partial correlation based deep visual representation and mitigates the small sample problem often encountered by covariance matrix estimation in CNN. Computationally, our model can be effectively trained with GPU and works well with a large number of channels of advanced CNNs. Experiments show the efficacy and superior classification performance of our deep visual representation compared to covariance matrix based counterparts. + + + + Hi-LASSIE: High-Fidelity Articulated Shape and Skeleton Discovery From Sparse Image Ensemble + http://openaccess.thecvf.com//content/CVPR2023/papers/Yao_Hi-LASSIE_High-Fidelity_Articulated_Shape_and_Skeleton_Discovery_From_Sparse_Image_CVPR_2023_paper.pdf + Automatically estimating 3D skeleton, shape, camera viewpoints, and part articulation from sparse in-the-wild image ensembles is a severely under-constrained and challenging problem. Most prior methods rely on large-scale image datasets, dense temporal correspondence, or human annotations like camera pose, 2D keypoints, and shape templates. We propose Hi-LASSIE, which performs 3D articulated reconstruction from only 20-30 online images in the wild without any user-defined shape or skeleton templates. We follow the recent work of LASSIE that tackles a similar problem setting and make two significant advances. First, instead of relying on a manually annotated 3D skeleton, we automatically estimate a class-specific skeleton from the selected reference image. Second, we improve the shape reconstructions with novel instance-specific optimization strategies that allow reconstructions to faithful fit on each instance while preserving the class-specific priors learned across all images. Experiments on in-the-wild image ensembles show that Hi-LASSIE obtains higher fidelity state-of-the-art 3D reconstructions despite requiring minimum user input. Project page: chhankyao.github.io/hi-lassie/ + + + + Computationally Budgeted Continual Learning: What Does Matter? + http://openaccess.thecvf.com//content/CVPR2023/papers/Prabhu_Computationally_Budgeted_Continual_Learning_What_Does_Matter_CVPR_2023_paper.pdf + Continual Learning (CL) aims to sequentially train models on streams of incoming data that vary in distribution by preserving previous knowledge while adapting to new data. Current CL literature focuses on restricted access to previously seen data, while imposing no constraints on the computational budget for training. This is unreasonable for applications in-the-wild, where systems are primarily constrained by computational and time budgets, not storage. We revisit this problem with a large-scale benchmark and analyze the performance of traditional CL approaches in a compute-constrained setting, where effective memory samples used in training can be implicitly restricted as a consequence of limited computation. We conduct experiments evaluating various CL sampling strategies, distillation losses, and partial fine-tuning on two large-scale datasets, namely ImageNet2K and Continual Google Landmarks V2 in data incremental, class incremental, and time incremental settings. Through extensive experiments amounting to a total of over 1500 GPU-hours, we find that, under compute-constrained setting, traditional CL approaches, with no exception, fail to outperform a simple minimal baseline that samples uniformly from memory. Our conclusions are consistent in a different number of stream time steps, e.g., 20 to 200, and under several computational budgets. This suggests that most existing CL methods are particularly too computationally expensive for realistic budgeted deployment. Code for this project is available at: https://github.com/drimpossible/BudgetCL. + + + + Decentralized Learning With Multi-Headed Distillation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhmoginov_Decentralized_Learning_With_Multi-Headed_Distillation_CVPR_2023_paper.pdf + Decentralized learning with private data is a central problem in machine learning. We propose a novel distillation-based decentralized learning technique that allows multiple agents with private non-iid data to learn from each other, without having to share their data, weights or weight updates. Our approach is communication efficient, utilizes an unlabeled public dataset and uses multiple auxiliary heads for each client, greatly improving training efficiency in the case of heterogeneous data. This approach allows individual models to preserve and enhance performance on their private tasks while also dramatically improving their performance on the global aggregated data distribution. We study the effects of data and model architecture heterogeneity and the impact of the underlying communication graph topology on learning efficiency and show that our agents can significantly improve their performance compared to learning in isolation. + + + + CF-Font: Content Fusion for Few-Shot Font Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_CF-Font_Content_Fusion_for_Few-Shot_Font_Generation_CVPR_2023_paper.pdf + Content and style disentanglement is an effective way to achieve few-shot font generation. It allows to transfer the style of the font image in a source domain to the style defined with a few reference images in a target domain. However, the content feature extracted using a representative font might not be optimal. In light of this, we propose a content fusion module (CFM) to project the content feature into a linear space defined by the content features of basis fonts, which can take the variation of content features caused by different fonts into consideration. Our method also allows to optimize the style representation vector of reference images through a lightweight iterative style-vector refinement (ISR) strategy. Moreover, we treat the 1D projection of a character image as a probability distribution and leverage the distance between two distributions as the reconstruction loss (namely projected character loss, PCL). Compared to L2 or L1 reconstruction loss, the distribution distance pays more attention to the global shape of characters. We have evaluated our method on a dataset of 300 fonts with 6.5k characters each. Experimental results verify that our method outperforms existing state-of-the-art few-shot font generation methods by a large margin. The source code can be found at https://github.com/wangchi95/CF-Font. + + + + 3Mformer: Multi-Order Multi-Mode Transformer for Skeletal Action Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_3Mformer_Multi-Order_Multi-Mode_Transformer_for_Skeletal_Action_Recognition_CVPR_2023_paper.pdf + Many skeletal action recognition models use GCNs to represent the human body by 3D body joints connected body parts. GCNs aggregate one- or few-hop graph neighbourhoods, and ignore the dependency between not linked body joints. We propose to form hypergraph to model hyper-edges between graph nodes (e.g., third- and fourth-order hyper-edges capture three and four nodes) which help capture higher-order motion patterns of groups of body joints. We split action sequences into temporal blocks, Higher-order Transformer (HoT) produces embeddings of each temporal block based on (i) the body joints, (ii) pairwise links of body joints and (iii) higher-order hyper-edges of skeleton body joints. We combine such HoT embeddings of hyper-edges of orders 1, ..., r by a novel Multi-order Multi-mode Transformer (3Mformer) with two modules whose order can be exchanged to achieve coupled-mode attention on coupled-mode tokens based on 'channel-temporal block', 'order-channel-body joint', 'channel-hyper-edge (any order)' and 'channel-only' pairs. The first module, called Multi-order Pooling (MP), additionally learns weighted aggregation along the hyper-edge mode, whereas the second module, Temporal block Pooling (TP), aggregates along the temporal block mode. Our end-to-end trainable network yields state-of-the-art results compared to GCN-, transformer- and hypergraph-based counterparts. + + + + Transformer Scale Gate for Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Shi_Transformer_Scale_Gate_for_Semantic_Segmentation_CVPR_2023_paper.pdf + Effectively encoding multi-scale contextual information is crucial for accurate semantic segmentation. Most of the existing transformer-based segmentation models combine features across scales without any selection, where features on sub-optimal scales may degrade segmentation outcomes. Leveraging from the inherent properties of Vision Transformers, we propose a simple yet effective module, Transformer Scale Gate (TSG), to optimally combine multi-scale features. TSG exploits cues in self and cross attentions in Vision Transformers for the scale selection. TSG is a highly flexible plug-and-play module, and can easily be incorporated with any encoder-decoder-based hierarchical vision Transformer architecture. Extensive experiments on the Pascal Context, ADE20K and Cityscapes datasets demonstrate that our feature selection strategy achieves consistent gains. + + + + EMT-NAS:Transferring Architectural Knowledge Between Tasks From Different Datasets + http://openaccess.thecvf.com//content/CVPR2023/papers/Liao_EMT-NASTransferring_Architectural_Knowledge_Between_Tasks_From_Different_Datasets_CVPR_2023_paper.pdf + The success of multi-task learning (MTL) can largely be attributed to the shared representation of related tasks, allowing the models to better generalise. In deep learning, this is usually achieved by sharing a common neural network architecture and jointly training the weights. However, the joint training of weighting parameters on multiple related tasks may lead to performance degradation, known as negative transfer. To address this issue, this work proposes an evolutionary multi-tasking neural architecture search (EMT-NAS) algorithm to accelerate the search process by transferring architectural knowledge across multiple related tasks. In EMT-NAS, unlike the traditional MTL, the model for each task has a personalised network architecture and its own weights, thus offering the capability of effectively alleviating negative transfer. A fitness re-evaluation method is suggested to alleviate fluctuations in performance evaluations resulting from parameter sharing and the mini-batch gradient descent training method, thereby avoiding losing promising solutions during the search process. To rigorously verify the performance of EMT-NAS, the classification tasks used in the empirical assessments are derived from different datasets, including the CIFAR-10 and CIFAR-100, and four MedMNIST datasets. Extensive comparative experiments on different numbers of tasks demonstrate that EMT-NAS takes 8% and up to 40% on CIFAR and MedMNIST, respectively, less time to find competitive neural architectures than its single-task counterparts. + + + + Learning Joint Latent Space EBM Prior Model for Multi-Layer Generator + http://openaccess.thecvf.com//content/CVPR2023/papers/Cui_Learning_Joint_Latent_Space_EBM_Prior_Model_for_Multi-Layer_Generator_CVPR_2023_paper.pdf + This paper studies the fundamental problem of learning multi-layer generator models. The multi-layer generator model builds multiple layers of latent variables as a prior model on top of the generator, which benefits learning complex data distribution and hierarchical representations. However, such a prior model usually focuses on modeling inter-layer relations between latent variables by assuming non-informative (conditional) Gaussian distributions, which can be limited in model expressivity. To tackle this issue and learn more expressive prior models, we propose an energy-based model (EBM) on the joint latent space over all layers of latent variables with the multi-layer generator as its backbone. Such joint latent space EBM prior model captures the intra-layer contextual relations at each layer through layer-wise energy terms, and latent variables across different layers are jointly corrected. We develop a joint training scheme via maximum likelihood estimation (MLE), which involves Markov Chain Monte Carlo (MCMC) sampling for both prior and posterior distributions of the latent variables from different layers. To ensure efficient inference and learning, we further propose a variational training scheme where an inference model is used to amortize the costly posterior MCMC sampling. Our experiments demonstrate that the learned model can be expressive in generating high-quality images and capturing hierarchical features for better outlier detection. + + + + Benchmarking Robustness of 3D Object Detection to Common Corruptions + http://openaccess.thecvf.com//content/CVPR2023/papers/Dong_Benchmarking_Robustness_of_3D_Object_Detection_to_Common_Corruptions_CVPR_2023_paper.pdf + 3D object detection is an important task in autonomous driving to perceive the surroundings. Despite the excellent performance, the existing 3D detectors lack the robustness to real-world corruptions caused by adverse weathers, sensor noises, etc., provoking concerns about the safety and reliability of autonomous driving systems. To comprehensively and rigorously benchmark the corruption robustness of 3D detectors, in this paper we design 27 types of common corruptions for both LiDAR and camera inputs considering real-world driving scenarios. By synthesizing these corruptions on public datasets, we establish three corruption robustness benchmarks---KITTI-C, nuScenes-C, and Waymo-C. Then, we conduct large-scale experiments on 24 diverse 3D object detection models to evaluate their corruption robustness. Based on the evaluation results, we draw several important findings, including: 1) motion-level corruptions are the most threatening ones that lead to significant performance drop of all models; 2) LiDAR-camera fusion models demonstrate better robustness; 3) camera-only models are extremely vulnerable to image corruptions, showing the indispensability of LiDAR point clouds. We release the benchmarks and codes at https://github.com/thu-ml/3D_Corruptions_AD to be helpful for future studies. + + + + STMT: A Spatial-Temporal Mesh Transformer for MoCap-Based Action Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhu_STMT_A_Spatial-Temporal_Mesh_Transformer_for_MoCap-Based_Action_Recognition_CVPR_2023_paper.pdf + We study the problem of human action recognition using motion capture (MoCap) sequences. Unlike existing techniques that take multiple manual steps to derive standardized skeleton representations as model input, we propose a novel Spatial-Temporal Mesh Transformer (STMT) to directly model the mesh sequences. The model uses a hierarchical transformer with intra-frame off-set attention and inter-frame self-attention. The attention mechanism allows the model to freely attend between any two vertex patches to learn non-local relationships in the spatial-temporal domain. Masked vertex modeling and future frame prediction are used as two self-supervised tasks to fully activate the bi-directional and auto-regressive attention in our hierarchical transformer. The proposed method achieves state-of-the-art performance compared to skeleton-based and point-cloud-based models on common MoCap benchmarks. Code is available at https://github.com/zgzxy001/STMT. + + + + High-Fidelity Generalized Emotional Talking Face Generation With Multi-Modal Emotion Space Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_High-Fidelity_Generalized_Emotional_Talking_Face_Generation_With_Multi-Modal_Emotion_Space_CVPR_2023_paper.pdf + Recently, emotional talking face generation has received considerable attention. However, existing methods only adopt one-hot coding, image, or audio as emotion conditions, thus lacking flexible control in practical applications and failing to handle unseen emotion styles due to limited semantics. They either ignore the one-shot setting or the quality of generated faces. In this paper, we propose a more flexible and generalized framework. Specifically, we supplement the emotion style in text prompts and use an Aligned Multi-modal Emotion encoder to embed the text, image, and audio emotion modality into a unified space, which inherits rich semantic prior from CLIP. Consequently, effective multi-modal emotion space learning helps our method support arbitrary emotion modality during testing and could generalize to unseen emotion styles. Besides, an Emotion-aware Audio-to-3DMM Convertor is proposed to connect the emotion condition and the audio sequence to structural representation. A followed style-based High-fidelity Emotional Face generator is designed to generate arbitrary high-resolution realistic identities. Our texture generator hierarchically learns flow fields and animated faces in a residual manner. Extensive experiments demonstrate the flexibility and generalization of our method in emotion control and the effectiveness of high-quality face synthesis. + + + + Teaching Matters: Investigating the Role of Supervision in Vision Transformers + http://openaccess.thecvf.com//content/CVPR2023/papers/Walmer_Teaching_Matters_Investigating_the_Role_of_Supervision_in_Vision_Transformers_CVPR_2023_paper.pdf + Vision Transformers (ViTs) have gained significant popularity in recent years and have proliferated into many applications. However, their behavior under different learning paradigms is not well explored. We compare ViTs trained through different methods of supervision, and show that they learn a diverse range of behaviors in terms of their attention, representations, and downstream performance. We also discover ViT behaviors that are consistent across supervision, including the emergence of Offset Local Attention Heads. These are self-attention heads that attend to a token adjacent to the current token with a fixed directional offset, a phenomenon that to the best of our knowledge has not been highlighted in any prior work. Our analysis shows that ViTs are highly flexible and learn to process local and global information in different orders depending on their training method. We find that contrastive self-supervised methods learn features that are competitive with explicitly supervised features, and they can even be superior for part-level tasks. We also find that the representations of reconstruction-based models show non-trivial similarity to contrastive self-supervised models. + + + + Imagic: Text-Based Real Image Editing With Diffusion Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Kawar_Imagic_Text-Based_Real_Image_Editing_With_Diffusion_Models_CVPR_2023_paper.pdf + Text-conditioned image editing has recently attracted considerable interest. However, most methods are currently limited to one of the following: specific editing types (e.g., object overlay, style transfer), synthetically generated images, or requiring multiple input images of a common object. In this paper we demonstrate, for the very first time, the ability to apply complex (e.g., non-rigid) text-based semantic edits to a single real image. For example, we can change the posture and composition of one or multiple objects inside an image, while preserving its original characteristics. Our method can make a standing dog sit down, cause a bird to spread its wings, etc. -- each within its single high-resolution user-provided natural image. Contrary to previous work, our proposed method requires only a single input image and a target text (the desired edit). It operates on real images, and does not require any additional inputs (such as image masks or additional views of the object). Our method, called Imagic, leverages a pre-trained text-to-image diffusion model for this task. It produces a text embedding that aligns with both the input image and the target text, while fine-tuning the diffusion model to capture the image-specific appearance. We demonstrate the quality and versatility of Imagic on numerous inputs from various domains, showcasing a plethora of high quality complex semantic image edits, all within a single unified framework. To better assess performance, we introduce TEdBench, a highly challenging image editing benchmark. We conduct a user study, whose findings show that human raters prefer Imagic to previous leading editing methods on TEdBench. + + + + LightPainter: Interactive Portrait Relighting With Freehand Scribble + http://openaccess.thecvf.com//content/CVPR2023/papers/Mei_LightPainter_Interactive_Portrait_Relighting_With_Freehand_Scribble_CVPR_2023_paper.pdf + Recent portrait relighting methods have achieved realistic results of portrait lighting effects given a desired lighting representation such as an environment map. However, these methods are not intuitive for user interaction and lack precise lighting control. We introduce LightPainter, a scribble-based relighting system that allows users to interactively manipulate portrait lighting effect with ease. This is achieved by two conditional neural networks, a delighting module that recovers geometry and albedo optionally conditioned on skin tone, and a scribble-based module for relighting. To train the relighting module, we propose a novel scribble simulation procedure to mimic real user scribbles, which allows our pipeline to be trained without any human annotations. We demonstrate high-quality and flexible portrait lighting editing capability with both quantitative and qualitative experiments. User study comparisons with commercial lighting editing tools also demonstrate consistent user preference for our method. + + + + Vision Transformers Are Parameter-Efficient Audio-Visual Learners + http://openaccess.thecvf.com//content/CVPR2023/papers/Lin_Vision_Transformers_Are_Parameter-Efficient_Audio-Visual_Learners_CVPR_2023_paper.pdf + Vision transformers (ViTs) have achieved impressive results on various computer vision tasks in the last several years. In this work, we study the capability of frozen ViTs, pretrained only on visual data, to generalize to audio-visual data without finetuning any of its original parameters. To do so, we propose a latent audio-visual hybrid (LAVISH) adapter that adapts pretrained ViTs to audio-visual tasks by injecting a small number of trainable parameters into every layer of a frozen ViT. To efficiently fuse visual and audio cues, our LAVISH adapter uses a small set of latent tokens, which form an attention bottleneck, thus, eliminating the quadratic cost of standard cross-attention. Compared to the existing modality-specific audio-visual methods, our approach achieves competitive or even better performance on various audio-visual tasks while using fewer tunable parameters and without relying on costly audio pretraining or external audio encoders. Our code is available at https://genjib.github.io/project_page/LAVISH/ + + + + Training Debiased Subnetworks With Contrastive Weight Pruning + http://openaccess.thecvf.com//content/CVPR2023/papers/Park_Training_Debiased_Subnetworks_With_Contrastive_Weight_Pruning_CVPR_2023_paper.pdf + Neural networks are often biased to spuriously correlated features that provide misleading statistical evidence that does not generalize. This raises an interesting question: "Does an optimal unbiased functional subnetwork exist in a severely biased network? If so, how to extract such subnetwork?" While empirical evidence has been accumulated about the existence of such unbiased subnetworks, these observations are mainly based on the guidance of ground-truth unbiased samples. Thus, it is unexplored how to discover the optimal subnetworks with biased training datasets in practice. To address this, here we first present our theoretical insight that alerts potential limitations of existing algorithms in exploring unbiased subnetworks in the presence of strong spurious correlations. We then further elucidate the importance of bias-conflicting samples on structure learning. Motivated by these observations, we propose a Debiased Contrastive Weight Pruning (DCWP) algorithm, which probes unbiased subnetworks without expensive group annotations. Experimental results demonstrate that our approach significantly outperforms state-of-the-art debiasing methods despite its considerable reduction in the number of parameters. + + + + SparseViT: Revisiting Activation Sparsity for Efficient High-Resolution Vision Transformer + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_SparseViT_Revisiting_Activation_Sparsity_for_Efficient_High-Resolution_Vision_Transformer_CVPR_2023_paper.pdf + High-resolution images enable neural networks to learn richer visual representations. However, this improved performance comes at the cost of growing computational complexity, hindering their usage in latency-sensitive applications. As not all pixels are equal, skipping computations for less-important regions offers a simple and effective measure to reduce the computation. This, however, is hard to be translated into actual speedup for CNNs since it breaks the regularity of the dense convolution workload. In this paper, we introduce SparseViT that revisits activation sparsity for recent window-based vision transformers (ViTs). As window attentions are naturally batched over blocks, actual speedup with window activation pruning becomes possible: i.e., 50% latency reduction with 60% sparsity. Different layers should be assigned with different pruning ratios due to their diverse sensitivities and computational costs. We introduce sparsity-aware adaptation and apply the evolutionary search to efficiently find the optimal layerwise sparsity configuration within the vast search space. SparseViT achieves speedups of 1.5x, 1.4x, and 1.3x compared to its dense counterpart in monocular 3D object detection, 2D instance segmentation, and 2D semantic segmentation, respectively, with negligible to no loss of accuracy. + + + + Multispectral Video Semantic Segmentation: A Benchmark Dataset and Baseline + http://openaccess.thecvf.com//content/CVPR2023/papers/Ji_Multispectral_Video_Semantic_Segmentation_A_Benchmark_Dataset_and_Baseline_CVPR_2023_paper.pdf + Robust and reliable semantic segmentation in complex scenes is crucial for many real-life applications such as autonomous safe driving and nighttime rescue. In most approaches, it is typical to make use of RGB images as input. They however work well only in preferred weather conditions; when facing adverse conditions such as rainy, overexposure, or low-light, they often fail to deliver satisfactory results. This has led to the recent investigation into multispectral semantic segmentation, where RGB and thermal infrared (RGBT) images are both utilized as input. This gives rise to significantly more robust segmentation of image objects in complex scenes and under adverse conditions. Nevertheless, the present focus in single RGBT image input restricts existing methods from well addressing dynamic real-world scenes. Motivated by the above observations, in this paper, we set out to address a relatively new task of semantic segmentation of multispectral video input, which we refer to as Multispectral Video Semantic Segmentation, or MVSS in short. An in-house MVSeg dataset is thus curated, consisting of 738 calibrated RGB and thermal videos, accompanied by 3,545 fine-grained pixel-level semantic annotations of 26 categories. Our dataset contains a wide range of challenging urban scenes in both daytime and nighttime. Moreover, we propose an effective MVSS baseline, dubbed MVNet, which is to our knowledge the first model to jointly learn semantic representations from multispectral and temporal contexts. Comprehensive experiments are conducted using various semantic segmentation models on the MVSeg dataset. Empirically, the engagement of multispectral video input is shown to lead to significant improvement in semantic segmentation; the effectiveness of our MVNet baseline has also been verified. + + + + Reducing the Label Bias for Timestamp Supervised Temporal Action Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Reducing_the_Label_Bias_for_Timestamp_Supervised_Temporal_Action_Segmentation_CVPR_2023_paper.pdf + Timestamp supervised temporal action segmentation (TSTAS) is more cost-effective than fully supervised counterparts. However, previous approaches suffer from severe label bias due to over-reliance on sparse timestamp annotations, resulting in unsatisfactory performance. In this paper, we propose the Debiasing-TSTAS (D-TSTAS) framework by exploiting unannotated frames to alleviate this bias from two phases: 1) Initialization. To reduce the dependencies on annotated frames, we propose masked timestamp predictions (MTP) to ensure that initialized model captures more contextual information. 2) Refinement. To overcome the limitation of the expressiveness from sparsely annotated timestamps, we propose a center-oriented timestamp expansion (CTE) approach to progressively expand pseudo-timestamp groups which contain semantic-rich motion representation of action segments. Then, these pseudo-timestamp groups and the model output are used to iteratively generate pseudo-labels for refining the model in a fully supervised setup. We further introduce segmental confidence loss to enable the model to have high confidence predictions within the pseudo-timestamp groups and more accurate action boundaries. Our D-TSTAS outperforms the state-of-the-art TSTAS method as well as achieves competitive results compared with fully supervised approaches on three benchmark datasets. + + + + A Meta-Learning Approach to Predicting Performance and Data Requirements + http://openaccess.thecvf.com//content/CVPR2023/papers/Jain_A_Meta-Learning_Approach_to_Predicting_Performance_and_Data_Requirements_CVPR_2023_paper.pdf + We propose an approach to estimate the number of samples required for a model to reach a target performance. We find that the power law, the de facto principle to estimate model performance, leads to large error when using a small dataset (e.g., 5 samples per class) for extrapolation. This is because the log-performance error against the log-dataset size follows a nonlinear progression in the few-shot regime followed by a linear progression in the high-shot regime. We introduce a novel piecewise power law (PPL) that handles the two data regimes differently. To estimate the parameters of the PPL, we introduce a random forest regressor trained via meta learning that generalizes across classification/detection tasks, ResNet/ViT based architectures, and random/pre-trained initializations. The PPL improves the performance estimation on average by 37% across 16 classification datasets and 33% across 10 detection datasets, compared to the power law. We further extend the PPL to provide a confidence bound and use it to limit the prediction horizon that reduces over-estimation of data by 76% on classification and 91% on detection datasets. + + + + Deep Curvilinear Editing: Commutative and Nonlinear Image Manipulation for Pretrained Deep Generative Model + http://openaccess.thecvf.com//content/CVPR2023/papers/Aoshima_Deep_Curvilinear_Editing_Commutative_and_Nonlinear_Image_Manipulation_for_Pretrained_CVPR_2023_paper.pdf + Semantic editing of images is the fundamental goal of computer vision. Although deep learning methods, such as generative adversarial networks (GANs), are capable of producing high-quality images, they often do not have an inherent way of editing generated images semantically. Recent studies have investigated a way of manipulating the latent variable to determine the images to be generated. However, methods that assume linear semantic arithmetic have certain limitations in terms of the quality of image editing, whereas methods that discover nonlinear semantic pathways provide non-commutative editing, which is inconsistent when applied in different orders. This study proposes a novel method called deep curvilinear editing (DeCurvEd) to determine semantic commuting vector fields on the latent space. We theoretically demonstrate that owing to commutativity, the editing of multiple attributes depends only on the quantities and not on the order. Furthermore, we experimentally demonstrate that compared to previous methods, the nonlinear and commutative nature of DeCurvEd provides higher-quality editing. + + + + Learning Semantic-Aware Knowledge Guidance for Low-Light Image Enhancement + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_Learning_Semantic-Aware_Knowledge_Guidance_for_Low-Light_Image_Enhancement_CVPR_2023_paper.pdf + Low-light image enhancement (LLIE) investigates how to improve illumination and produce normal-light images. The majority of existing methods improve low-light images via a global and uniform manner, without taking into account the semantic information of different regions. Without semantic priors, a network may easily deviate from a region's original color. To address this issue, we propose a novel semantic-aware knowledge-guided framework (SKF) that can assist a low-light enhancement model in learning rich and diverse priors encapsulated in a semantic segmentation model. We concentrate on incorporating semantic knowledge from three key aspects: a semantic-aware embedding module that wisely integrates semantic priors in feature representation space, a semantic-guided color histogram loss that preserves color consistency of various instances, and a semantic-guided adversarial loss that produces more natural textures by semantic priors. Our SKF is appealing in acting as a general framework in LLIE task. Extensive experiments show that models equipped with the SKF significantly outperform the baselines on multiple datasets and our SKF generalizes to different models and scenes well. The code is available at Semantic-Aware-Low-Light-Image-Enhancement. + + + + Deep Arbitrary-Scale Image Super-Resolution via Scale-Equivariance Pursuit + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Deep_Arbitrary-Scale_Image_Super-Resolution_via_Scale-Equivariance_Pursuit_CVPR_2023_paper.pdf + The ability of scale-equivariance processing blocks plays a central role in arbitrary-scale image super-resolution tasks. Inspired by this crucial observation, this work proposes two novel scale-equivariant modules within a transformer-style framework to enhance arbitrary-scale image super-resolution (ASISR) performance, especially in high upsampling rate image extrapolation. In the feature extraction phase, we design a plug-in module called Adaptive Feature Extractor, which injects explicit scale information in frequency-expanded encoding, thus achieving scale-adaption in representation learning. In the upsampling phase, a learnable Neural Kriging upsampling operator is introduced, which simultaneously encodes both relative distance (i.e., scale-aware) information as well as feature similarity (i.e., with priori learned from training data) in a bilateral manner, providing scale-encoded spatial feature fusion. The above operators are easily plugged into multiple stages of a SR network, and a recent emerging pre-training strategy is also adopted to impulse the model's performance further. Extensive experimental results have demonstrated the outstanding scale-equivariance capability offered by the proposed operators and our learning framework, with much better results than previous SOTAs at arbitrary scales for SR. Our code is available at https://github.com/neuralchen/EQSR. + + + + OmniAL: A Unified CNN Framework for Unsupervised Anomaly Localization + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhao_OmniAL_A_Unified_CNN_Framework_for_Unsupervised_Anomaly_Localization_CVPR_2023_paper.pdf + Unsupervised anomaly localization and detection is crucial for industrial manufacturing processes due to the lack of anomalous samples. Recent unsupervised advances on industrial anomaly detection achieve high performance by training separate models for many different categories. The model storage and training time cost of this paradigm is high. Moreover, the setting of one-model-N-classes leads to fearful degradation of existing methods. In this paper, we propose a unified CNN framework for unsupervised anomaly localization, named OmniAL. This method conquers aforementioned problems by improving anomaly synthesis, reconstruction and localization. To prevent the model learning identical reconstruction, it trains the model with proposed panel-guided synthetic anomaly data rather than directly using normal data. It increases anomaly reconstruction error for multi-class distribution by using a network that is equipped with proposed Dilated Channel and Spatial Attention (DCSA) blocks. To better localize the anomaly regions, it employs proposed DiffNeck between reconstruction and localization sub-networks to explore multi-level differences. Experiments on 15-class MVTecAD and 12-class VisA datasets verify the advantage of proposed OmniAL that surpasses the state-of-the-art of unified models. On 15-class-MVTecAD/12-class-VisA, its single unified model achieves 97.2/87.8 image-AUROC, 98.3/96.6 pixel-AUROC and 73.4/41.7 pixel-AP for anomaly detection and localization respectively. Besides that, we make the first attempt to conduct a comprehensive study on the robustness of unsupervised anomaly localization and detection methods against different level adversarial attacks. Experiential results show OmniAL has good application prospects for its superior performance. + + + + Canonical Fields: Self-Supervised Learning of Pose-Canonicalized Neural Fields + http://openaccess.thecvf.com//content/CVPR2023/papers/Agaram_Canonical_Fields_Self-Supervised_Learning_of_Pose-Canonicalized_Neural_Fields_CVPR_2023_paper.pdf + Coordinate-based implicit neural networks, or neural fields, have emerged as useful representations of shape and appearance in 3D computer vision. Despite advances however, it remains challenging to build neural fields for categories of objects without datasets like ShapeNet that provide "canonicalized" object instances that are consistently aligned for their 3D position and orientation (pose). We present Canonical Field Network (CaFi-Net), a self-supervised method to canonicalize the 3D pose of instances from an object category represented as neural fields, specifically neural radiance fields (NeRFs). CaFi-Net directly learns from continuous and noisy radiance fields using a Siamese network architecture that is designed to extract equivariant field features for category-level canonicalization. During inference, our method takes pre-trained neural radiance fields of novel object instances at arbitrary 3D pose, and estimates a canonical field with consistent 3D pose across the entire category. Extensive experiments on a new dataset of 1300 NeRF models across 13 object categories show that our method matches or exceeds the performance of 3D point cloud-based methods. + + + + BiasAdv: Bias-Adversarial Augmentation for Model Debiasing + http://openaccess.thecvf.com//content/CVPR2023/papers/Lim_BiasAdv_Bias-Adversarial_Augmentation_for_Model_Debiasing_CVPR_2023_paper.pdf + Neural networks are often prone to bias toward spurious correlations inherent in a dataset, thus failing to generalize unbiased test criteria. A key challenge to resolving the issue is the significant lack of bias-conflicting training data (i.e., samples without spurious correlations). In this paper, we propose a novel data augmentation approach termed Bias-Adversarial augmentation (BiasAdv) that supplements bias-conflicting samples with adversarial images. Our key idea is that an adversarial attack on a biased model that makes decisions based on spurious correlations may generate synthetic bias-conflicting samples, which can then be used as augmented training data for learning a debiased model. Specifically, we formulate an optimization problem for generating adversarial images that attack the predictions of an auxiliary biased model without ruining the predictions of the desired debiased model. Despite its simplicity, we find that BiasAdv can generate surprisingly useful synthetic bias-conflicting samples, allowing the debiased model to learn generalizable representations. Furthermore, BiasAdv does not require any bias annotations or prior knowledge of the bias type, which enables its broad applicability to existing debiasing methods to improve their performances. Our extensive experimental results demonstrate the superiority of BiasAdv, achieving state-of-the-art performance on four popular benchmark datasets across various bias domains. + + + + CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhao_CDDFuse_Correlation-Driven_Dual-Branch_Feature_Decomposition_for_Multi-Modality_Image_Fusion_CVPR_2023_paper.pdf + Multi-modality (MM) image fusion aims to render fused images that maintain the merits of different modalities, e.g., functional highlight and detailed textures. To tackle the challenge in modeling cross-modality features and decomposing desirable modality-specific and modality-shared features, we propose a novel Correlation-Driven feature Decomposition Fusion (CDDFuse) network. Firstly, CDDFuse uses Restormer blocks to extract cross-modality shallow features. We then introduce a dual-branch Transformer-CNN feature extractor with Lite Transformer (LT) blocks leveraging long-range attention to handle low-frequency global features and Invertible Neural Networks (INN) blocks focusing on extracting high-frequency local information. A correlation-driven loss is further proposed to make the low-frequency features correlated while the high-frequency features uncorrelated based on the embedded information. Then, the LT-based global fusion and INN-based local fusion layers output the fused image. Extensive experiments demonstrate that our CDDFuse achieves promising results in multiple fusion tasks, including infrared-visible image fusion and medical image fusion. We also show that CDDFuse can boost the performance in downstream infrared-visible semantic segmentation and object detection in a unified benchmark. The code is available at https://github.com/Zhaozixiang1228/MMIF-CDDFuse. + + + + Cross-Modal Implicit Relation Reasoning and Aligning for Text-to-Image Person Retrieval + http://openaccess.thecvf.com//content/CVPR2023/papers/Jiang_Cross-Modal_Implicit_Relation_Reasoning_and_Aligning_for_Text-to-Image_Person_Retrieval_CVPR_2023_paper.pdf + Text-to-image person retrieval aims to identify the target person based on a given textual description query. The primary challenge is to learn the mapping of visual and textual modalities into a common latent space. Prior works have attempted to address this challenge by leveraging separately pre-trained unimodal models to extract visual and textual features. However, these approaches lack the necessary underlying alignment capabilities required to match multimodal data effectively. Besides, these works use prior information to explore explicit part alignments, which may lead to the distortion of intra-modality information. To alleviate these issues, we present IRRA: a cross-modal Implicit Relation Reasoning and Aligning framework that learns relations between local visual-textual tokens and enhances global image-text matching without requiring additional prior supervision. Specifically, we first design an Implicit Relation Reasoning module in a masked language modeling paradigm. This achieves cross-modal interaction by integrating the visual cues into the textual tokens with a cross-modal multimodal interaction encoder. Secondly, to globally align the visual and textual embeddings, Similarity Distribution Matching is proposed to minimize the KL divergence between image-text similarity distributions and the normalized label matching distributions. The proposed method achieves new state-of-the-art results on all three public datasets, with a notable margin of about 3%-9% for Rank-1 accuracy compared to prior methods. + + + + Learning To Retain While Acquiring: Combating Distribution-Shift in Adversarial Data-Free Knowledge Distillation + http://openaccess.thecvf.com//content/CVPR2023/papers/Patel_Learning_To_Retain_While_Acquiring_Combating_Distribution-Shift_in_Adversarial_Data-Free_CVPR_2023_paper.pdf + Data-free Knowledge Distillation (DFKD) has gained popularity recently, with the fundamental idea of carrying out knowledge transfer from a Teacher neural network to a Student neural network in the absence of training data. However, in the Adversarial DFKD framework, the student network's accuracy, suffers due to the non-stationary distribution of the pseudo-samples under multiple generator updates. To this end, at every generator update, we aim to maintain the student's performance on previously encountered examples while acquiring knowledge from samples of the current distribution. Thus, we propose a meta-learning inspired framework by treating the task of Knowledge-Acquisition (learning from newly generated samples) and Knowledge-Retention (retaining knowledge on previously met samples) as meta-train and meta-test, respectively. Hence, we dub our method as Learning to Retain while Acquiring. Moreover, we identify an implicit aligning factor between the Knowledge-Retention and Knowledge-Acquisition tasks indicating that the proposed student update strategy enforces a common gradient direction for both tasks, alleviating interference between the two objectives. Finally, we support our hypothesis by exhibiting extensive evaluation and comparison of our method with prior arts on multiple datasets. + + + + Good Is Bad: Causality Inspired Cloth-Debiasing for Cloth-Changing Person Re-Identification + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_Good_Is_Bad_Causality_Inspired_Cloth-Debiasing_for_Cloth-Changing_Person_Re-Identification_CVPR_2023_paper.pdf + Entangled representation of clothing and identity (ID)-intrinsic clues are potentially concomitant in conventional person Re-IDentification (ReID). Nevertheless, eliminating the negative impact of clothing on ID remains challenging due to the lack of theory and the difficulty of isolating the exact implications. In this paper, a causality-based Auto-Intervention Model, referred to as AIM, is first proposed to mitigate clothing bias for robust cloth-changing person ReID (CC-ReID). Specifically, we analyze the effect of clothing on the model inference and adopt a dual-branch model to simulate causal intervention. Progressively, clothing bias is eliminated automatically with model training. AIM is encouraged to learn more discriminative ID clues that are free from clothing bias. Extensive experiments on two standard CC-ReID datasets demonstrate the superiority of the proposed AIM over other state-of-the-art methods. + + + + Use Your Head: Improving Long-Tail Video Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Perrett_Use_Your_Head_Improving_Long-Tail_Video_Recognition_CVPR_2023_paper.pdf + This paper presents an investigation into long-tail video recognition. We demonstrate that, unlike naturally-collected video datasets and existing long-tail image benchmarks, current video benchmarks fall short on multiple long-tailed properties. Most critically, they lack few-shot classes in their tails. In response, we propose new video benchmarks that better assess long-tail recognition, by sampling subsets from two datasets: SSv2 and VideoLT. We then propose a method, Long-Tail Mixed Reconstruction (LMR), which reduces overfitting to instances from few-shot classes by reconstructing them as weighted combinations of samples from head classes. LMR then employs label mixing to learn robust decision boundaries. It achieves state-of-the-art average class accuracy on EPIC-KITCHENS and the proposed SSv2-LT and VideoLT-LT. Benchmarks and code at: github.com/tobyperrett/lmr + + + + Revisiting the P3P Problem + http://openaccess.thecvf.com//content/CVPR2023/papers/Ding_Revisiting_the_P3P_Problem_CVPR_2023_paper.pdf + One of the classical multi-view geometry problems is the so called P3P problem, where the absolute pose of a calibrated camera is determined from three 2D-to-3D correspondences. Since these solvers form a critical component of many vision systems (e.g. in localization and Structure-from-Motion), there have been significant effort in developing faster and more stable algorithms. While the current state-of-the-art solvers are both extremely fast and stable, there still exist configurations where they break down. In this paper we algebraically formulate the problem as finding the intersection of two conics. With this formulation we are able to analytically characterize the real roots of the polynomial system and employ a tailored solution strategy for each problem instance. The result is a fast and completely stable solver, that is able to correctly solve cases where competing methods fail. Our experimental evaluation shows that we outperform the current state-of-the-art methods both in terms of speed and success rate. + + + + TimeBalance: Temporally-Invariant and Temporally-Distinctive Video Representations for Semi-Supervised Action Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Dave_TimeBalance_Temporally-Invariant_and_Temporally-Distinctive_Video_Representations_for_Semi-Supervised_Action_Recognition_CVPR_2023_paper.pdf + Semi-Supervised Learning can be more beneficial for the video domain compared to images because of its higher annotation cost and dimensionality. Besides, any video understanding task requires reasoning over both spatial and temporal dimensions. In order to learn both the static and motion related features for the semi-supervised action recognition task, existing methods rely on hard input inductive biases like using two-modalities (RGB and Optical-flow) or two-stream of different playback rates. Instead of utilizing unlabeled videos through diverse input streams, we rely on self-supervised video representations, particularly, we utilize temporally-invariant and temporally-distinctive representations. We observe that these representations complement each other depending on the nature of the action. Based on this observation, we propose a student-teacher semi-supervised learning framework, TimeBalance, where we distill the knowledge from a temporally-invariant and a temporally-distinctive teacher. Depending on the nature of the unlabeled video, we dynamically combine the knowledge of these two teachers based on a novel temporal similarity-based reweighting scheme. Our method achieves state-of-the-art performance on three action recognition benchmarks: UCF101, HMDB51, and Kinetics400. Code: https://github.com/DAVEISHAN/TimeBalance. + + + + Generating Aligned Pseudo-Supervision From Non-Aligned Data for Image Restoration in Under-Display Camera + http://openaccess.thecvf.com//content/CVPR2023/papers/Feng_Generating_Aligned_Pseudo-Supervision_From_Non-Aligned_Data_for_Image_Restoration_in_CVPR_2023_paper.pdf + Due to the difficulty in collecting large-scale and perfectly aligned paired training data for Under-Display Camera (UDC) image restoration, previous methods resort to monitor-based image systems or simulation-based methods, sacrificing the realness of the data and introducing domain gaps. In this work, we revisit the classic stereo setup for training data collection -- capturing two images of the same scene with one UDC and one standard camera. The key idea is to "copy" details from a high-quality reference image and "paste" them on the UDC image. While being able to generate real training pairs, this setting is susceptible to spatial misalignment due to perspective and depth of field changes. The problem is further compounded by the large domain discrepancy between the UDC and normal images, which is unique to UDC restoration. In this paper, we mitigate the non-trivial domain discrepancy and spatial misalignment through a novel Transformer-based framework that generates well-aligned yet high-quality target data for the corresponding UDC input. This is made possible through two carefully designed components, namely, the Domain Alignment Module (DAM) and Geometric Alignment Module (GAM), which encourage robust and accurate discovery of correspondence between the UDC and normal views. Extensive experiments show that high-quality and well-aligned pseudo UDC training pairs are beneficial for training a robust restoration network. Code and the dataset are available at https://github.com/jnjaby/AlignFormer. + + + + Neural Pixel Composition for 3D-4D View Synthesis From Multi-Views + http://openaccess.thecvf.com//content/CVPR2023/papers/Bansal_Neural_Pixel_Composition_for_3D-4D_View_Synthesis_From_Multi-Views_CVPR_2023_paper.pdf + We present Neural Pixel Composition (NPC), a novel approach for continuous 3D-4D view synthesis given only a discrete set of multi-view observations as input. Existing state-of-the-art approaches require dense multi-view supervision and an extensive computational budget. The proposed formulation reliably operates on sparse and wide-baseline multi-view imagery and can be trained efficiently within a few seconds to 10 minutes for hi-res (12MP) content, i.e., 200-400X faster convergence than existing methods. Crucial to our approach are two core novelties: 1) a representation of a pixel that contains color and depth information accumulated from multi-views for a particular location and time along a line of sight, and 2) a multi-layer perceptron (MLP) that enables the composition of this rich information provided for a pixel location to obtain the final color output. We experiment with a large variety of multi-view sequences, compare to existing approaches, and achieve better results in diverse and challenging settings. + + + + CRAFT: Concept Recursive Activation FacTorization for Explainability + http://openaccess.thecvf.com//content/CVPR2023/papers/Fel_CRAFT_Concept_Recursive_Activation_FacTorization_for_Explainability_CVPR_2023_paper.pdf + Attribution methods are a popular class of explainability methods that use heatmaps to depict the most important areas of an image that drive a model decision. Nevertheless, recent work has shown that these methods have limited utility in practice, presumably because they only highlight the most salient parts of an image (i.e., "where" the model looked) and do not communicate any information about "what" the model saw at those locations. In this work, we try to fill in this gap with Craft -- a novel approach to identify both "what" and "where" by generating concept-based explanations. We introduce 3 new ingredients to the automatic concept extraction literature: (i) a recursive strategy to detect and decompose concepts across layers, (ii) a novel method for a more faithful estimation of concept importance using Sobol indices, and (iii) the use of implicit differentiation to unlock Concept Attribution Maps. We conduct both human and computer vision experiments to demonstrate the benefits of the proposed approach. We show that our recursive decomposition generates meaningful and accurate concepts and that the proposed concept importance estimation technique is more faithful to the model than previous methods. When evaluating the usefulness of the method for human experimenters on the utility benchmark, we find that our approach significantly improves on two of the three test scenarios (while none of the current methods including ours help on the third). Overall, our study suggests that, while much work remains toward the development of general explainability methods that are useful in practical scenarios, the identification of meaningful concepts at the proper level of granularity yields useful and complementary information beyond that afforded by attribution methods. + + + + Recognizing Rigid Patterns of Unlabeled Point Clouds by Complete and Continuous Isometry Invariants With No False Negatives and No False Positives + http://openaccess.thecvf.com//content/CVPR2023/papers/Widdowson_Recognizing_Rigid_Patterns_of_Unlabeled_Point_Clouds_by_Complete_and_CVPR_2023_paper.pdf + Rigid structures such as cars or any other solid objects are often represented by finite clouds of unlabeled points. The most natural equivalence on these point clouds is rigid motion or isometry maintaining all inter-point distances. Rigid patterns of point clouds can be reliably compared only by complete isometry invariants that can also be called equivariant descriptors without false negatives (isometric clouds having different descriptions) and without false positives (non-isometric clouds with the same description). Noise and motion in data motivate a search for invariants that are continuous under perturbations of points in a suitable metric. We propose the first continuous and complete invariant of unlabeled clouds in any Euclidean space. For a fixed dimension, the new metric for this invariant is computable in a polynomial time in the number of points. + + + + N-Gram in Swin Transformers for Efficient Lightweight Image Super-Resolution + http://openaccess.thecvf.com//content/CVPR2023/papers/Choi_N-Gram_in_Swin_Transformers_for_Efficient_Lightweight_Image_Super-Resolution_CVPR_2023_paper.pdf + While some studies have proven that Swin Transformer (Swin) with window self-attention (WSA) is suitable for single image super-resolution (SR), the plain WSA ignores the broad regions when reconstructing high-resolution images due to a limited receptive field. In addition, many deep learning SR methods suffer from intensive computations. To address these problems, we introduce the N-Gram context to the low-level vision with Transformers for the first time. We define N-Gram as neighboring local windows in Swin, which differs from text analysis that views N-Gram as consecutive characters or words. N-Grams interact with each other by sliding-WSA, expanding the regions seen to restore degraded pixels. Using the N-Gram context, we propose NGswin, an efficient SR network with SCDP bottleneck taking multi-scale outputs of the hierarchical encoder. Experimental results show that NGswin achieves competitive performance while maintaining an efficient structure when compared with previous leading methods. Moreover, we also improve other Swin-based SR methods with the N-Gram context, thereby building an enhanced model: SwinIR-NG. Our improved SwinIR-NG outperforms the current best lightweight SR approaches and establishes state-of-the-art results. Codes are available at https://github.com/rami0205/NGramSwin. + + + + Hybrid Neural Rendering for Large-Scale Scenes With Motion Blur + http://openaccess.thecvf.com//content/CVPR2023/papers/Dai_Hybrid_Neural_Rendering_for_Large-Scale_Scenes_With_Motion_Blur_CVPR_2023_paper.pdf + Rendering novel view images is highly desirable for many applications. Despite recent progress, it remains challenging to render high-fidelity and view-consistent novel views of large-scale scenes from in-the-wild images with inevitable artifacts (e.g., motion blur). To this end, we develop a hybrid neural rendering model that makes image-based representation and neural 3D representation join forces to render high-quality, view-consistent images. Besides, images captured in the wild inevitably contain artifacts, such as motion blur, which deteriorates the quality of rendered images. Accordingly, we propose strategies to simulate blur effects on the rendered images to mitigate the negative influence of blurriness images and reduce their importance during training based on precomputed quality-aware weights. Extensive experiments on real and synthetic data demonstrate our model surpasses state-of-the-art point-based methods for novel view synthesis. The code is available at https://daipengwa.github.io/Hybrid-Rendering-ProjectPage. + + + + Perception-Oriented Single Image Super-Resolution Using Optimal Objective Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Park_Perception-Oriented_Single_Image_Super-Resolution_Using_Optimal_Objective_Estimation_CVPR_2023_paper.pdf + Single-image super-resolution (SISR) networks trained with perceptual and adversarial losses provide high-contrast outputs compared to those of networks trained with distortion-oriented losses, such as L1 or L2. However, it has been shown that using a single perceptual loss is insufficient for accurately restoring locally varying diverse shapes in images, often generating undesirable artifacts or unnatural details. For this reason, combinations of various losses, such as perceptual, adversarial, and distortion losses, have been attempted, yet it remains challenging to find optimal combinations. Hence, in this paper, we propose a new SISR framework that applies optimal objectives for each region to generate plausible results in overall areas of high-resolution outputs. Specifically, the framework comprises two models: a predictive model that infers an optimal objective map for a given low-resolution (LR) input and a generative model that applies a target objective map to produce the corresponding SR output. The generative model is trained over our proposed objective trajectory representing a set of essential objectives, which enables the single network to learn various SR results corresponding to combined losses on the trajectory. The predictive model is trained using pairs of LR images and corresponding optimal objective maps searched from the objective trajectory. Experimental results on five benchmarks show that the proposed method outperforms state-of-the-art perception-driven SR methods in LPIPS, DISTS, PSNR, and SSIM metrics. The visual results also demonstrate the superiority of our method in perception-oriented reconstruction. The code is available at https://github.com/seungho-snu/SROOE. + + + + Learning 3D Scene Priors With 2D Supervision + http://openaccess.thecvf.com//content/CVPR2023/papers/Nie_Learning_3D_Scene_Priors_With_2D_Supervision_CVPR_2023_paper.pdf + Holistic 3D scene understanding entails estimation of both layout configuration and object geometry in a 3D environment. Recent works have shown advances in 3D scene estimation from various input modalities (e.g., images, 3D scans), by leveraging 3D supervision (e.g., 3D bounding boxes or CAD models), for which collection at scale is expensive and often intractable. To address this shortcoming, we propose a new method to learn 3D scene priors of layout and shape without requiring any 3D ground truth. Instead, we rely on 2D supervision from multi-view RGB images. Our method represents a 3D scene as a latent vector, from which we can progressively decode to a sequence of objects characterized by their class categories, 3D bounding boxes, and meshes. With our trained autoregressive decoder representing the scene prior, our method facilitates many downstream applications, including scene synthesis, interpolation, and single-view reconstruction. Experiments on 3D-FRONT and ScanNet show that our method outperforms state of the art in single-view reconstruction, and achieves state-of-the-art results in scene synthesis against baselines which require for 3D supervision. + + + + Label-Free Liver Tumor Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Hu_Label-Free_Liver_Tumor_Segmentation_CVPR_2023_paper.pdf + We demonstrate that AI models can accurately segment liver tumors without the need for manual annotation by using synthetic tumors in CT scans. Our synthetic tumors have two intriguing advantages: (I) realistic in shape and texture, which even medical professionals can confuse with real tumors; (II) effective for training AI models, which can perform liver tumor segmentation similarly to the model trained on real tumors--this result is exciting because no existing work, using synthetic tumors only, has thus far reached a similar or even close performance to real tumors. This result also implies that manual efforts for annotating tumors voxel by voxel (which took years to create) can be significantly reduced in the future. Moreover, our synthetic tumors can automatically generate many examples of small (or even tiny) synthetic tumors and have the potential to improve the success rate of detecting small liver tumors, which is critical for detecting the early stages of cancer. In addition to enriching the training data, our synthesizing strategy also enables us to rigorously assess the AI robustness. + + + + Uncurated Image-Text Datasets: Shedding Light on Demographic Bias + http://openaccess.thecvf.com//content/CVPR2023/papers/Garcia_Uncurated_Image-Text_Datasets_Shedding_Light_on_Demographic_Bias_CVPR_2023_paper.pdf + The increasing tendency to collect large and uncurated datasets to train vision-and-language models has raised concerns about fair representations. It is known that even small but manually annotated datasets, such as MSCOCO, are affected by societal bias. This problem, far from being solved, may be getting worse with data crawled from the Internet without much control. In addition, the lack of tools to analyze societal bias in big collections of images makes addressing the problem extremely challenging. Our first contribution is to annotate part of the Google Conceptual Captions dataset, widely used for training vision-and-language models, with four demographic and two contextual attributes. Our second contribution is to conduct a comprehensive analysis of the annotations, focusing on how different demographic groups are represented. Our last contribution lies in evaluating three prevailing vision-and-language tasks: image captioning, text-image CLIP embeddings, and text-to-image generation, showing that societal bias is a persistent problem in all of them. + + + + Adversarial Robustness via Random Projection Filters + http://openaccess.thecvf.com//content/CVPR2023/papers/Dong_Adversarial_Robustness_via_Random_Projection_Filters_CVPR_2023_paper.pdf + Deep Neural Networks show superior performance in various tasks but are vulnerable to adversarial attacks. Most defense techniques are devoted to the adversarial training strategies, however, it is difficult to achieve satisfactory robust performance only with traditional adversarial training. We mainly attribute it to that aggressive perturbations which lead to the loss increment can always be found via gradient ascent in white-box setting. Although some noises can be involved to prevent attacks from deriving precise gradients on inputs, there exist trade-offs between the defense capability and natural generalization. Taking advantage of the properties of random projection, we propose to replace part of convolutional filters with random projection filters, and theoretically explore the geometric representation preservation of proposed synthesized filters via Johnson-Lindenstrauss lemma. We conduct sufficient evaluation on multiple networks and datasets. The experimental results showcase the superiority of proposed random projection filters to state-of-the-art baselines. The code is available on https://github.com/UniSerj/Random-Projection-Filters. + + + + VNE: An Effective Method for Improving Deep Representation by Manipulating Eigenvalue Distribution + http://openaccess.thecvf.com//content/CVPR2023/papers/Kim_VNE_An_Effective_Method_for_Improving_Deep_Representation_by_Manipulating_CVPR_2023_paper.pdf + Since the introduction of deep learning, a wide scope of representation properties, such as decorrelation, whitening, disentanglement, rank, isotropy, and mutual information, have been studied to improve the quality of representation. However, manipulating such properties can be challenging in terms of implementational effectiveness and general applicability. To address these limitations, we propose to regularize von Neumann entropy (VNE) of representation. First, we demonstrate that the mathematical formulation of VNE is superior in effectively manipulating the eigenvalues of the representation autocorrelation matrix. Then, we demonstrate that it is widely applicable in improving state-of-the-art algorithms or popular benchmark algorithms by investigating domain-generalization, meta-learning, self-supervised learning, and generative models. In addition, we formally establish theoretical connections with rank, disentanglement, and isotropy of representation. Finally, we provide discussions on the dimension control of VNE and the relationship with Shannon entropy. Code is available at: https://github.com/jaeill/CVPR23-VNE. + + + + Local Implicit Ray Function for Generalizable Radiance Field Representation + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Local_Implicit_Ray_Function_for_Generalizable_Radiance_Field_Representation_CVPR_2023_paper.pdf + We propose LIRF (Local Implicit Ray Function), a generalizable neural rendering approach for novel view rendering. Current generalizable neural radiance fields (NeRF) methods sample a scene with a single ray per pixel and may therefore render blurred or aliased views when the input views and rendered views observe scene content at different resolutions. To solve this problem, we propose LIRF to aggregate the information from conical frustums to construct a ray. Given 3D positions within conical frustums, LIRF takes 3D coordinates and the features of conical frustums as inputs and predicts a local volumetric radiance field. Since the coordinates are continuous, LIRF renders high-quality novel views at a continuously-valued scale via volume rendering. Besides, we predict the visible weights for each input view via transformer-based feature matching to improve the performance in occluded areas. Experimental results on real-world scenes validate that our method outperforms state-of-the-art methods on novel view rendering of unseen scenes at arbitrary scales. + + + + Dense Distinct Query for End-to-End Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Dense_Distinct_Query_for_End-to-End_Object_Detection_CVPR_2023_paper.pdf + One-to-one label assignment in object detection has successfully obviated the need of non-maximum suppression (NMS) as a postprocessing and makes the pipeline end-to-end. However, it triggers a new dilemma as the widely used sparse queries cannot guarantee a high recall, while dense queries inevitably bring more similar queries and encounters optimization difficulty. As both sparse and dense queries are problematic, then what are the expected queries in end-to-end object detection? This paper shows that the solution should be Dense Distinct Queries (DDQ). Concretely, we first lay dense queries like traditional detectors and then select distinct ones for one-to-one assignments. DDQ blends the advantages of traditional and recent end-to-end detectors and significantly improves the performance of various detectors including FCN, R-CNN, and DETRs. Most impressively, DDQ-DETR achieves 52.1 AP on MS-COCO dataset within 12 epochs using a ResNet-50 backbone, outperforming all existing detectors in the same setting. DDQ also shares the benefit of end-to-end detectors in crowded scenes and achieves 93.8 AP on CrowdHuman. We hope DDQ can inspire researchers to consider the complementarity between traditional methods and end-to-end detectors. The source code can be found at https://github.com/jshilong/DDQ. + + + + Divide and Adapt: Active Domain Adaptation via Customized Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Divide_and_Adapt_Active_Domain_Adaptation_via_Customized_Learning_CVPR_2023_paper.pdf + Active domain adaptation (ADA) aims to improve the model adaptation performance by incorporating the active learning (AL) techniques to label a maximally-informative subset of target samples. Conventional AL methods do not consider the existence of domain shift, and hence, fail to identify the truly valuable samples in the context of domain adaptation. To accommodate active learning and domain adaption, the two naturally different tasks, in a collaborative framework, we advocate that a customized learning strategy for the target data is the key to the success of ADA solutions. We present Divide-and-Adapt (DiaNA), a new ADA framework that partitions the target instances into four categories with stratified transferable properties. With a novel data subdivision protocol based on uncertainty and domainness, DiaNA can accurately recognize the most gainful samples. While sending the informative instances for annotation, DiaNA employs tailored learning strategies for the remaining categories. Furthermore, we propose an informativeness score that unifies the data partitioning criteria. This enables the use of a Gaussian mixture model (GMM) to automatically sample unlabeled data into the proposed four categories. Thanks to the "divide-and-adapt" spirit, DiaNA can handle data with large variations of domain gap. In addition, we show that DiaNA can generalize to different domain adaptation settings, such as unsupervised domain adaptation (UDA), semi-supervised domain adaptation (SSDA), source-free domain adaptation (SFDA), etc. + + + + Learning Spatial-Temporal Implicit Neural Representations for Event-Guided Video Super-Resolution + http://openaccess.thecvf.com//content/CVPR2023/papers/Lu_Learning_Spatial-Temporal_Implicit_Neural_Representations_for_Event-Guided_Video_Super-Resolution_CVPR_2023_paper.pdf + Event cameras sense the intensity changes asynchronously and produce event streams with high dynamic range and low latency. This has inspired research endeavors utilizing events to guide the challenging video super-resolution (VSR) task. In this paper, we make the first at tempt to address a novel problem of achieving VSR at random scales by taking advantages of the high temporal resolution property of events. This is hampered by the difficulties of representing the spatial-temporal information of events when guiding VSR. To this end, we propose a novel framework that incorporates the spatial-temporal interpolation of events to VSR in a unified framework. Our key idea is to learn implicit neural representations from queried spatial-temporal coordinates and features from both RGB frames and events. Our method contains three parts. Specifically, the Spatial-Temporal Fusion (STF) module first learns the 3D features from events and RGB frames. Then, the Temporal Filter (TF) module unlocks more explicit motion information from the events near the queried timestamp and generates the 2D features. Lastly, the Spatial-Temporal Implicit Representation (STIR) module recovers the SR frame in arbitrary resolutions from the outputs of these two modules. In addition, we collect a real-world dataset with spatially aligned events and RGB frames. Extensive experiments show that our method significantly surpass the prior-arts and achieves VSR with random scales, e.g., 6.5. Code and dataset are available at https://. + + + + Both Style and Distortion Matter: Dual-Path Unsupervised Domain Adaptation for Panoramic Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zheng_Both_Style_and_Distortion_Matter_Dual-Path_Unsupervised_Domain_Adaptation_for_CVPR_2023_paper.pdf + The ability of scene understanding has sparked active research for panoramic image semantic segmentation. However, the performance is hampered by distortion of the equirectangular projection (ERP) and a lack of pixel-wise annotations. For this reason, some works treat the ERP and pinhole images equally and transfer knowledge from the pinhole to ERP images via unsupervised domain adaptation (UDA). However, they fail to handle the domain gaps caused by: 1) the inherent differences between camera sensors and captured scenes; 2) the distinct image formats (e.g., ERP and pinhole images). In this paper, we propose a novel yet flexible dual-path UDA framework, DPPASS, taking ERP and tangent projection (TP) images as inputs. To reduce the domain gaps, we propose cross-projection and intra-projection training. The cross-projection training includes tangent-wise feature contrastive training and prediction consistency training. That is, the former formulates the features with the same projection locations as positive examples and vice versa, for the models' awareness of distortion, while the latter ensures the consistency of cross-model predictions between the ERP and TP. Moreover, adversarial intra-projection training is proposed to reduce the inherent gap, between the features of the pinhole images and those of the ERP and TP images, respectively. Importantly, the TP path can be freely removed after training, leading to no additional inference cost. Extensive experiments on two benchmarks show that our DPPASS achieves +1.06% mIoU increment than the state-of-the-art approaches. + + + + ALTO: Alternating Latent Topologies for Implicit 3D Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_ALTO_Alternating_Latent_Topologies_for_Implicit_3D_Reconstruction_CVPR_2023_paper.pdf + This work introduces alternating latent topologies (ALTO) for high-fidelity reconstruction of implicit 3D surfaces from noisy point clouds. Previous work identifies that the spatial arrangement of latent encodings is important to recover detail. One school of thought is to encode a latent vector for each point (point latents). Another school of thought is to project point latents into a grid (grid latents) which could be a voxel grid or triplane grid. Each school of thought has tradeoffs. Grid latents are coarse and lose high-frequency detail. In contrast, point latents preserve detail. However, point latents are more difficult to decode into a surface, and quality and runtime suffer. In this paper, we propose ALTO to sequentially alternate between geometric representations, before converging to an easy-to-decode latent. We find that this preserves spatial expressiveness and makes decoding lightweight. We validate ALTO on implicit 3D recovery and observe not only a performance improvement over the state-of-the-art, but a runtime improvement of 3-10x. Anonymized source code at https://visual.ee.ucla.edu/alto.htm/. + + + + Learning Debiased Representations via Conditional Attribute Interpolation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Learning_Debiased_Representations_via_Conditional_Attribute_Interpolation_CVPR_2023_paper.pdf + An image is usually described by more than one attribute like "shape" and "color". When a dataset is biased, i.e., most samples have attributes spuriously correlated with the target label, a Deep Neural Network (DNN) is prone to make predictions by the "unintended" attribute, especially if it is easier to learn. To improve the generalization ability when training on such a biased dataset, we propose a chi^2-model to learn debiased representations. First, we design a chi-shape pattern to match the training dynamics of a DNN and find Intermediate Attribute Samples (IASs) --- samples near the attribute decision boundaries, which indicate how the value of an attribute changes from one extreme to another. Then we rectify the representation with a chi-structured metric learning objective. Conditional interpolation among IASs eliminates the negative effect of peripheral attributes and facilitates retaining the intra-class compactness. Experiments show that chi^2-model learns debiased representation effectively and achieves remarkable improvements on various datasets. + + + + Modeling Inter-Class and Intra-Class Constraints in Novel Class Discovery + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Modeling_Inter-Class_and_Intra-Class_Constraints_in_Novel_Class_Discovery_CVPR_2023_paper.pdf + Novel class discovery (NCD) aims at learning a model that transfers the common knowledge from a class-disjoint labelled dataset to another unlabelled dataset and discovers new classes (clusters) within it. Many methods, as well as elaborate training pipelines and appropriate objectives, have been proposed and considerably boosted performance on NCD tasks. Despite all this, we find that the existing methods do not sufficiently take advantage of the essence of the NCD setting. To this end, in this paper, we propose to model both inter-class and intra-class constraints in NCD based on the symmetric Kullback-Leibler divergence (sKLD). Specifically, we propose an inter-class sKLD constraint to effectively exploit the disjoint relationship between labelled and unlabelled classes, enforcing the separability for different classes in the embedding space. In addition, we present an intra-class sKLD constraint to explicitly constrain the intra-relationship between a sample and its augmentations and ensure the stability of the training process at the same time. We conduct extensive experiments on the popular CIFAR10, CIFAR100 and ImageNet benchmarks and successfully demonstrate that our method can establish a new state of the art and can achieve significant performance improvements, e.g., 3.5%/3.7% clustering accuracy improvements on CIFAR100-50 dataset split under the task-aware/-agnostic evaluation protocol, over previous state-of-the-art methods. Code is available at https://github.com/FanZhichen/NCD-IIC. + + + + Multiple Instance Learning via Iterative Self-Paced Supervised Contrastive Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Multiple_Instance_Learning_via_Iterative_Self-Paced_Supervised_Contrastive_Learning_CVPR_2023_paper.pdf + Learning representations for individual instances when only bag-level labels are available is a fundamental challenge in multiple instance learning (MIL). Recent works have shown promising results using contrastive self-supervised learning (CSSL), which learns to push apart representations corresponding to two different randomly-selected instances. Unfortunately, in real-world applications such as medical image classification, there is often class imbalance, so randomly-selected instances mostly belong to the same majority class, which precludes CSSL from learning inter-class differences. To address this issue, we propose a novel framework, Iterative Self-paced Supervised Contrastive Learning for MIL Representations (ItS2CLR), which improves the learned representation by exploiting instance-level pseudo labels derived from the bag-level labels. The framework employs a novel self-paced sampling strategy to ensure the accuracy of pseudo labels. We evaluate ItS2CLR on three medical datasets, showing that it improves the quality of instance-level pseudo labels and representations, and outperforms existing MIL methods in terms of both bag and instance level accuracy. Code is available at https://github.com/Kangningthu/ItS2CLR + + + + CrowdCLIP: Unsupervised Crowd Counting via Vision-Language Model + http://openaccess.thecvf.com//content/CVPR2023/papers/Liang_CrowdCLIP_Unsupervised_Crowd_Counting_via_Vision-Language_Model_CVPR_2023_paper.pdf + Supervised crowd counting relies heavily on costly manual labeling, which is difficult and expensive, especially in dense scenes. To alleviate the problem, we propose a novel unsupervised framework for crowd counting, named CrowdCLIP. The core idea is built on two observations: 1) the recent contrastive pre-trained vision-language model (CLIP) has presented impressive performance on various downstream tasks; 2) there is a natural mapping between crowd patches and count text. To the best of our knowledge, CrowdCLIP is the first to investigate the vision-language knowledge to solve the counting problem. Specifically, in the training stage, we exploit the multi-modal ranking loss by constructing ranking text prompts to match the size-sorted crowd patches to guide the image encoder learning. In the testing stage, to deal with the diversity of image patches, we propose a simple yet effective progressive filtering strategy to first select the highly potential crowd patches and then map them into the language space with various counting intervals. Extensive experiments on five challenging datasets demonstrate that the proposed CrowdCLIP achieves superior performance compared to previous unsupervised state-of-the-art counting methods. Notably, CrowdCLIP even surpasses some popular fully-supervised methods under the cross-dataset setting. The source code will be available at https://github.com/dk-liang/CrowdCLIP. + + + + iCLIP: Bridging Image Classification and Contrastive Language-Image Pre-Training for Visual Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Wei_iCLIP_Bridging_Image_Classification_and_Contrastive_Language-Image_Pre-Training_for_Visual_CVPR_2023_paper.pdf + This paper presents a method that effectively combines two prevalent visual recognition methods, i.e., image classification and contrastive language-image pre-training, dubbed iCLIP. Instead of naive multi-task learning that use two separate heads for each task, we fuse the two tasks in a deep fashion that adapts the image classification to share the same formula and the same model weights with the language-image pre-training. To further bridge these two tasks, we propose to enhance the category names in image classification tasks using external knowledge, such as their descriptions in dictionaries. Extensive experiments show that the proposed method combines the advantages of two tasks well: the strong discrimination ability in image classification tasks due to the clear and clean category labels, and the good zero-shot ability in CLIP tasks ascribed to the richer semantics in the text descriptions. In particular, it reaches 82.9% top-1 accuracy on IN-1K, and surpasses CLIPby 1.8%, with similar model size, on zero-shot recognition of Kornblith 12-dataset benchmark. The code and models are publicly available at https://github.com/weiyx16/iCLIP. + + + + RepMode: Learning to Re-Parameterize Diverse Experts for Subcellular Structure Prediction + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_RepMode_Learning_to_Re-Parameterize_Diverse_Experts_for_Subcellular_Structure_Prediction_CVPR_2023_paper.pdf + In biological research, fluorescence staining is a key technique to reveal the locations and morphology of subcellular structures. However, it is slow, expensive, and harmful to cells. In this paper, we model it as a deep learning task termed subcellular structure prediction (SSP), aiming to predict the 3D fluorescent images of multiple subcellular structures from a 3D transmitted-light image. Unfortunately, due to the limitations of current biotechnology, each image is partially labeled in SSP. Besides, naturally, subcellular structures vary considerably in size, which causes the multi-scale issue of SSP. To overcome these challenges, we propose Re-parameterizing Mixture-of-Diverse-Experts (RepMode), a network that dynamically organizes its parameters with task-aware priors to handle specified single-label prediction tasks. In RepMode, the Mixture-of-Diverse-Experts (MoDE) block is designed to learn the generalized parameters for all tasks, and gating re-parameterization (GatRep) is performed to generate the specialized parameters for each task, by which RepMode can maintain a compact practical topology exactly like a plain network, and meanwhile achieves a powerful theoretical topology. Comprehensive experiments show that RepMode can achieve state-of-the-art overall performance in SSP. + + + + Masked Motion Encoding for Self-Supervised Video Representation Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Sun_Masked_Motion_Encoding_for_Self-Supervised_Video_Representation_Learning_CVPR_2023_paper.pdf + How to learn discriminative video representation from unlabeled videos is challenging but crucial for video analysis. The latest attempts seek to learn a representation model by predicting the appearance contents in the masked regions. However, simply masking and recovering appearance contents may not be sufficient to model temporal clues as the appearance contents can be easily reconstructed from a single frame. To overcome this limitation, we present Masked Motion Encoding (MME), a new pre-training paradigm that reconstructs both appearance and motion information to explore temporal clues. In MME, we focus on addressing two critical challenges to improve the representation performance: 1) how to well represent the possible long-term motion across multiple frames; and 2) how to obtain fine-grained temporal clues from sparsely sampled videos. Motivated by the fact that human is able to recognize an action by tracking objects' position changes and shape changes, we propose to reconstruct a motion trajectory that represents these two kinds of change in the masked regions. Besides, given the sparse video input, we enforce the model to reconstruct dense motion trajectories in both spatial and temporal dimensions. Pre-trained with our MME paradigm, the model is able to anticipate long-term and fine-grained motion details. Code is available at https://github.com/XinyuSun/MME. + + + + FFHQ-UV: Normalized Facial UV-Texture Dataset for 3D Face Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Bai_FFHQ-UV_Normalized_Facial_UV-Texture_Dataset_for_3D_Face_Reconstruction_CVPR_2023_paper.pdf + We present a large-scale facial UV-texture dataset that contains over 50,000 high-quality texture UV-maps with even illuminations, neutral expressions, and cleaned facial regions, which are desired characteristics for rendering realistic 3D face models under different lighting conditions. The dataset is derived from a large-scale face image dataset namely FFHQ, with the help of our fully automatic and robust UV-texture production pipeline. Our pipeline utilizes the recent advances in StyleGAN-based facial image editing approaches to generate multi-view normalized face images from single-image inputs. An elaborated UV-texture extraction, correction, and completion procedure is then applied to produce high-quality UV-maps from the normalized face images. Compared with existing UV-texture datasets, our dataset has more diverse and higher-quality texture maps. We further train a GAN-based texture decoder as the nonlinear texture basis for parametric fitting based 3D face reconstruction. Experiments show that our method improves the reconstruction accuracy over state-of-the-art approaches, and more importantly, produces high-quality texture maps that are ready for realistic renderings. The dataset, code, and pre-trained texture decoder are publicly available at https://github.com/csbhr/FFHQ-UV. + + + + SurfelNeRF: Neural Surfel Radiance Fields for Online Photorealistic Reconstruction of Indoor Scenes + http://openaccess.thecvf.com//content/CVPR2023/papers/Gao_SurfelNeRF_Neural_Surfel_Radiance_Fields_for_Online_Photorealistic_Reconstruction_of_CVPR_2023_paper.pdf + Online reconstructing and rendering of large-scale indoor scenes is a long-standing challenge. SLAM-based methods can reconstruct 3D scene geometry progressively in real time but can not render photorealistic results. While NeRF-based methods produce promising novel view synthesis results, their long offline optimization time and lack of geometric constraints pose challenges to efficiently handling online input. Inspired by the complementary advantages of classical 3D reconstruction and NeRF, we thus investigate marrying explicit geometric representation with NeRF rendering to achieve efficient online reconstruction and high-quality rendering. We introduce SurfelNeRF, a variant of neural radiance field which employs a flexible and scalable neural surfel representation to store geometric attributes and extracted appearance features from input images. We further extend conventional surfel-based fusion scheme to progressively integrate incoming input frames into the reconstructed global neural scene representation. In addition, we propose a highly-efficient differentiable rasterization scheme for rendering neural surfel radiance fields, which helps SurfelNeRF achieve 10x speedups in training and inference time, respectively. Experimental results show that our method achieves the state-of-the-art 23.82 PSNR and 29.58 PSNR on ScanNet in feedforward inference and per-scene optimization settings, respectively. + + + + Logical Implications for Visual Question Answering Consistency + http://openaccess.thecvf.com//content/CVPR2023/papers/Tascon-Morales_Logical_Implications_for_Visual_Question_Answering_Consistency_CVPR_2023_paper.pdf + Despite considerable recent progress in Visual Question Answering (VQA) models, inconsistent or contradictory answers continue to cast doubt on their true reasoning capabilities. However, most proposed methods use indirect strategies or strong assumptions on pairs of questions and answers to enforce model consistency. Instead, we propose a novel strategy intended to improve model performance by directly reducing logical inconsistencies. To do this, we introduce a new consistency loss term that can be used by a wide range of the VQA models and which relies on knowing the logical relation between pairs of questions and answers. While such information is typically not available in VQA datasets, we propose to infer these logical relations using a dedicated language model and use these in our proposed consistency loss function. We conduct extensive experiments on the VQA Introspect and DME datasets and show that our method brings improvements to state-of-the-art VQA models while being robust across different architectures and settings. + + + + NeUDF: Leaning Neural Unsigned Distance Fields With Volume Rendering + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_NeUDF_Leaning_Neural_Unsigned_Distance_Fields_With_Volume_Rendering_CVPR_2023_paper.pdf + Multi-view shape reconstruction has achieved impressive progresses thanks to the latest advances in neural implicit surface rendering. However, existing methods based on signed distance function (SDF) are limited to closed surfaces, failing to reconstruct a wide range of real-world objects that contain open-surface structures. In this work, we introduce a new neural rendering framework, coded NeUDF, that can reconstruct surfaces with arbitrary topologies solely from multi-view supervision. To gain the flexibility of representing arbitrary surfaces, NeUDF leverages the unsigned distance function (UDF) as surface representation. While a naive extension of SDF-based neural renderer cannot scale to UDF, we propose two new formulations of weight function specially tailored for UDF-based volume rendering. Furthermore, to cope with open surface rendering, where the in/out test is no longer valid, we present a dedicated normal regularization strategy to resolve the surface orientation ambiguity. We extensively evaluate our method over a number of challenging datasets, including DTU, MGN, and Deep Fashion 3D. Experimental results demonstrate that NeUDF can significantly outperform the state-of-the-art method in the task of multi-view surface reconstruction, especially for the complex shapes with open boundaries. + + + + MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling With Informative-Preserved Reconstruction and Self-Distilled Consistency + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_MM-3DScene_3D_Scene_Understanding_by_Customizing_Masked_Modeling_With_Informative-Preserved_CVPR_2023_paper.pdf + Masked Modeling (MM) has demonstrated widespread success in various vision challenges, by reconstructing masked visual patches. Yet, applying MM for large-scale 3D scenes remains an open problem due to the data sparsity and scene complexity. The conventional random masking paradigm used in 2D images often causes a high risk of ambiguity when recovering the masked region of 3D scenes. To this end, we propose a novel informative-preserved reconstruction, which explores local statistics to discover and preserve the representative structured points, effectively enhancing the pretext masking task for 3D scene understanding. Integrated with a progressive reconstruction manner, our method can concentrate on modeling regional geometry and enjoy less ambiguity for masked reconstruction. Besides, such scenes with progressive masking ratios can also serve to self-distill their intrinsic spatial consistency, requiring to learn the consistent representations from unmasked areas. By elegantly combining informative-preserved reconstruction on masked areas and consistency self-distillation from unmasked areas, a unified framework called MM-3DScene is yielded. We conduct comprehensive experiments on a host of downstream tasks. The consistent improvement (e.g., +6.1% mAP@0.5 on object detection and +2.2% mIoU on semantic segmentation) demonstrates the superiority of our approach. + + + + Plug-and-Play Diffusion Features for Text-Driven Image-to-Image Translation + http://openaccess.thecvf.com//content/CVPR2023/papers/Tumanyan_Plug-and-Play_Diffusion_Features_for_Text-Driven_Image-to-Image_Translation_CVPR_2023_paper.pdf + Large-scale text-to-image generative models have been a revolutionary breakthrough in the evolution of generative AI, synthesizing diverse images with highly complex visual concepts. However, a pivotal challenge in leveraging such models for real-world content creation is providing users with control over the generated content. In this paper, we present a new framework that takes text-to-image synthesis to the realm of image-to-image translation -- given a guidance image and a target text prompt as input, our method harnesses the power of a pre-trained text-to-image diffusion model to generate a new image that complies with the target text, while preserving the semantic layout of the guidance image. Specifically, we observe and empirically demonstrate that fine-grained control over the generated structure can be achieved by manipulating spatial features and their self-attention inside the model. This results in a simple and effective approach, where features extracted from the guidance image are directly injected into the generation process of the translated image, requiring no training or fine-tuning. We demonstrate high-quality results on versatile text-guided image translation tasks, including translating sketches, rough drawings and animations into realistic images, changing the class and appearance of objects in a given image, and modifying global qualities such as lighting and color. + + + + Fast Contextual Scene Graph Generation With Unbiased Context Augmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Jin_Fast_Contextual_Scene_Graph_Generation_With_Unbiased_Context_Augmentation_CVPR_2023_paper.pdf + Scene graph generation (SGG) methods have historically suffered from long-tail bias and slow inference speed. In this paper, we notice that humans can analyze relationships between objects relying solely on context descriptions,and this abstract cognitive process may be guided by experience. For example, given descriptions of cup and table with their spatial locations, humans can speculate possible relationships < cup, on, table > or < table, near, cup >. Even without visual appearance information, some impossible predicates like flying in and looking at can be empirically excluded. Accordingly, we propose a contextual scene graph generation (C-SGG) method without using visual information and introduce a context augmentation method. We propose that slight perturbations in the position and size of objects do not essentially affect the relationship between objects. Therefore, at the context level, we can produce diverse context descriptions by using a context augmentation method based on the original dataset. These diverse context descriptions can be used for unbiased training of C-SGG to alleviate long-tail bias. In addition, we also introduce a context guided visual scene graph generation (CV-SGG) method, which leverages the C-SGG experience to guide vision to focus on possible predicates. Through extensive experiments on the publicly available dataset, C-SGG alleviates long-tail bias and omits the huge computation of visual feature extraction to realize real-time SGG. CV-SGG achieves a great trade-off between common predicates and tail predicates. + + + + Re-Thinking Federated Active Learning Based on Inter-Class Diversity + http://openaccess.thecvf.com//content/CVPR2023/papers/Kim_Re-Thinking_Federated_Active_Learning_Based_on_Inter-Class_Diversity_CVPR_2023_paper.pdf + Although federated learning has made awe-inspiring advances, most studies have assumed that the client's data are fully labeled. However, in a real-world scenario, every client may have a significant amount of unlabeled instances. Among the various approaches to utilizing unlabeled data, a federated active learning framework has emerged as a promising solution. In the decentralized setting, there are two types of available query selector models, namely 'global' and 'local-only' models, but little literature discusses their performance dominance and its causes. In this work, we first demonstrate that the superiority of two selector models depends on the global and local inter-class diversity. Furthermore, we observe that the global and local-only models are the keys to resolving the imbalance of each side. Based on our findings, we propose LoGo, a FAL sampling strategy robust to varying local heterogeneity levels and global imbalance ratio, that integrates both models by two steps of active selection scheme. LoGo consistently outperforms six active learning strategies in the total number of 38 experimental settings. + + + + CiaoSR: Continuous Implicit Attention-in-Attention Network for Arbitrary-Scale Image Super-Resolution + http://openaccess.thecvf.com//content/CVPR2023/papers/Cao_CiaoSR_Continuous_Implicit_Attention-in-Attention_Network_for_Arbitrary-Scale_Image_Super-Resolution_CVPR_2023_paper.pdf + Learning continuous image representations is recently gaining popularity for image super-resolution (SR) because of its ability to reconstruct high-resolution images with arbitrary scales from low-resolution inputs. Existing methods mostly ensemble nearby features to predict the new pixel at any queried coordinate in the SR image. Such a local ensemble suffers from some limitations: i) it has no learnable parameters and it neglects the similarity of the visual features; ii) it has a limited receptive field and cannot ensemble relevant features in a large field which are important in an image. To address these issues, this paper proposes a continuous implicit attention-in-attention network, called CiaoSR. We explicitly design an implicit attention network to learn the ensemble weights for the nearby local features. Furthermore, we embed a scale-aware attention in this implicit attention network to exploit additional non-local information. Extensive experiments on benchmark datasets demonstrate CiaoSR significantly outperforms the existing single image SR methods with the same backbone. In addition, CiaoSR also achieves the state-of-the-art performance on the arbitrary-scale SR task. The effectiveness of the method is also demonstrated on the real-world SR setting. More importantly, CiaoSR can be flexibly integrated into any backbone to improve the SR performance. + + + + The Best Defense Is a Good Offense: Adversarial Augmentation Against Adversarial Attacks + http://openaccess.thecvf.com//content/CVPR2023/papers/Frosio_The_Best_Defense_Is_a_Good_Offense_Adversarial_Augmentation_Against_CVPR_2023_paper.pdf + Many defenses against adversarial attacks (e.g. robust classifiers, randomization, or image purification) use countermeasures put to work only after the attack has been crafted. We adopt a different perspective to introduce A^5 (Adversarial Augmentation Against Adversarial Attacks), a novel framework including the first certified preemptive defense against adversarial attacks. The main idea is to craft a defensive perturbation to guarantee that any attack (up to a given magnitude) towards the input in hand will fail. To this aim, we leverage existing automatic perturbation analysis tools for neural networks. We study the conditions to apply A^5 effectively, analyze the importance of the robustness of the to-be-defended classifier, and inspect the appearance of the robustified images. We show effective on-the-fly defensive augmentation with a robustifier network that ignores the ground truth label, and demonstrate the benefits of robustifier and classifier co-training. In our tests, A^5 consistently beats state of the art certified defenses on MNIST, CIFAR10, FashionMNIST and Tinyimagenet. We also show how to apply A^5 to create certifiably robust physical objects. The released code at https://github.com/NVlabs/A5 allows experimenting on a wide range of scenarios beyond the man-in-the-middle attack tested here, including the case of physical attacks. + + + + GaitGCI: Generative Counterfactual Intervention for Gait Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Dou_GaitGCI_Generative_Counterfactual_Intervention_for_Gait_Recognition_CVPR_2023_paper.pdf + Gait is one of the most promising biometrics that aims to identify pedestrians from their walking patterns. However, prevailing methods are susceptible to confounders, resulting in the networks hardly focusing on the regions that reflect effective walking patterns. To address this fundamental problem in gait recognition, we propose a Generative Counterfactual Intervention framework, dubbed GaitGCI, consisting of Counterfactual Intervention Learning (CIL) and Diversity-Constrained Dynamic Convolution (DCDC). CIL leverages causal inference to alleviate the impact of confounders by maximizing the likelihood difference between factual/counterfactual attention. DCDC adaptively generates sample-wise factual/counterfactual attention to perceive the sample properties. With matrix decomposition and diversity constraint, DCDC guarantees the model's efficiency and effectiveness. Extensive experiments indicate that proposed GaitGCI: 1) could effectively focus on the discriminative and interpretable regions that reflect gait patterns; 2) is model-agnostic and could be plugged into existing models to improve performance with nearly no extra cost; 3) efficiently achieves state-of-the-art performance on arbitrary scenarios (in-the-lab and in-the-wild). + + + + Constructing Deep Spiking Neural Networks From Artificial Neural Networks With Knowledge Distillation + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_Constructing_Deep_Spiking_Neural_Networks_From_Artificial_Neural_Networks_With_CVPR_2023_paper.pdf + Spiking neural networks (SNNs) are well known as the brain-inspired models with high computing efficiency, due to a key component that they utilize spikes as information units, close to the biological neural systems. Although spiking based models are energy efficient by taking advantage of discrete spike signals, their performance is limited by current network structures and their training methods. As discrete signals, typical SNNs cannot apply the gradient descent rules directly into parameters adjustment as artificial neural networks (ANNs). Aiming at this limitation, here we propose a novel method of constructing deep SNN models with knowledge distillation (KD) that uses ANN as teacher model and SNN as student model. Through ANN-SNN joint training algorithm, the student SNN model can learn rich feature information from the teacher ANN model through the KD method, yet it avoids training SNN from scratch when communicating with non-differentiable spikes. Our method can not only build a more efficient deep spiking structure feasibly and reasonably, but use few time steps to train whole model compared to direct training or ANN to SNN methods. More importantly, it has a superb ability of noise immunity for various types of artificial noises and natural signals. The proposed novel method provides efficient ways to improve the performance of SNN through constructing deeper structures in a high-throughput fashion, with potential usage for light and efficient brain-inspired computing of practical scenarios. + + + + KERM: Knowledge Enhanced Reasoning for Vision-and-Language Navigation + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_KERM_Knowledge_Enhanced_Reasoning_for_Vision-and-Language_Navigation_CVPR_2023_paper.pdf + Vision-and-language navigation (VLN) is the task to enable an embodied agent to navigate to a remote location following the natural language instruction in real scenes. Most of the previous approaches utilize the entire features or object-centric features to represent navigable candidates. However, these representations are not efficient enough for an agent to perform actions to arrive the target location. As knowledge provides crucial information which is complementary to visible content, in this paper, we propose a Knowledge Enhanced Reasoning Model (KERM) to leverage knowledge to improve agent navigation ability. Specifically, we first retrieve facts (i.e., knowledge described by language descriptions) for the navigation views based on local regions from the constructed knowledge base. The retrieved facts range from properties of a single object (e.g., color, shape) to relationships between objects (e.g., action, spatial position), providing crucial information for VLN. We further present the KERM which contains the purification, fact-aware interaction, and instruction-guided aggregation modules to integrate visual, history, instruction, and fact features. The proposed KERM can automatically select and gather crucial and relevant cues, obtaining more accurate action prediction. Experimental results on the REVERIE, R2R, and SOON datasets demonstrate the effectiveness of the proposed method. The source code is available at https://github.com/XiangyangLi20/KERM. + + + + Abstract Visual Reasoning: An Algebraic Approach for Solving Raven's Progressive Matrices + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_Abstract_Visual_Reasoning_An_Algebraic_Approach_for_Solving_Ravens_Progressive_CVPR_2023_paper.pdf + We introduce algebraic machine reasoning, a new reasoning framework that is well-suited for abstract reasoning. Effectively, algebraic machine reasoning reduces the difficult process of novel problem-solving to routine algebraic computation. The fundamental algebraic objects of interest are the ideals of some suitably initialized polynomial ring. We shall explain how solving Raven's Progressive Matrices (RPMs) can be realized as computational problems in algebra, which combine various well-known algebraic subroutines that include: Computing the Grobner basis of an ideal, checking for ideal containment, etc. Crucially, the additional algebraic structure satisfied by ideals allows for more operations on ideals beyond set-theoretic operations. Our algebraic machine reasoning framework is not only able to select the correct answer from a given answer set, but also able to generate the correct answer with only the question matrix given. Experiments on the I-RAVEN dataset yield an overall 93.2% accuracy, which significantly outperforms the current state-of-the-art accuracy of 77.0% and exceeds human performance at 84.4% accuracy. + + + + 3D-Aware Conditional Image Synthesis + http://openaccess.thecvf.com//content/CVPR2023/papers/Deng_3D-Aware_Conditional_Image_Synthesis_CVPR_2023_paper.pdf + We propose pix2pix3D, a 3D-aware conditional generative model for controllable photorealistic image synthesis. Given a 2D label map, such as a segmentation or edge map, our model learns to synthesize a corresponding image from different viewpoints. To enable explicit 3D user control, we extend conditional generative models with neural radiance fields. Given widely-available posed monocular image and label map pairs, our model learns to assign a label to every 3D point in addition to color and density, which enables it to render the image and pixel-aligned label map simultaneously. Finally, we build an interactive system that allows users to edit the label map from different viewpoints and generate outputs accordingly. + + + + ABCD: Arbitrary Bitwise Coefficient for De-Quantization + http://openaccess.thecvf.com//content/CVPR2023/papers/Han_ABCD_Arbitrary_Bitwise_Coefficient_for_De-Quantization_CVPR_2023_paper.pdf + Modern displays and contents support more than 8bits image and video. However, bit-starving situations such as compression codecs make low bit-depth (LBD) images (<8bits), occurring banding and blurry artifacts. Previous bit depth expansion (BDE) methods still produce unsatisfactory high bit-depth (HBD) images. To this end, we propose an implicit neural function with a bit query to recover de-quantized images from arbitrarily quantized inputs. We develop a phasor estimator to exploit the information of the nearest pixels. Our method shows superior performance against prior BDE methods on natural and animation images. We also demonstrate our model on YouTube UGC datasets for de-banding. Our source code is available at https://github.com/WooKyoungHan/ABCD + + + + Event-Based Blurry Frame Interpolation Under Blind Exposure + http://openaccess.thecvf.com//content/CVPR2023/papers/Weng_Event-Based_Blurry_Frame_Interpolation_Under_Blind_Exposure_CVPR_2023_paper.pdf + Restoring sharp high frame-rate videos from low frame-rate blurry videos is a challenging problem. Existing blurry frame interpolation methods assume a predefined and known exposure time, which suffer from severe performance drop when applied to videos captured in the wild. In this paper, we study the problem of blurry frame interpolation under blind exposure with the assistance of an event camera. The high temporal resolution of the event camera is beneficial to obtain the exposure prior that is lost during the imaging process. Besides, sharp frames can be restored using event streams and blurry frames relying on the mutual constraint among them. Therefore, we first propose an exposure estimation strategy guided by event streams to estimate the lost exposure prior, transforming the blind exposure problem well-posed. Second, we propose to model the mutual constraint with a temporal-exposure control strategy through iterative residual learning. Our blurry frame interpolation method achieves a distinct performance boost over existing methods on both synthetic and self-collected real-world datasets under blind exposure. + + + + Spider GAN: Leveraging Friendly Neighbors To Accelerate GAN Training + http://openaccess.thecvf.com//content/CVPR2023/papers/Asokan_Spider_GAN_Leveraging_Friendly_Neighbors_To_Accelerate_GAN_Training_CVPR_2023_paper.pdf + Training Generative adversarial networks (GANs) stably is a challenging task. The generator in GANs transform noise vectors, typically Gaussian distributed, into realistic data such as images. In this paper, we propose a novel approach for training GANs with images as inputs, but without enforcing any pairwise constraints. The intuition is that images are more structured than noise, which the generator can leverage to learn a more robust transformation. The process can be made efficient by identifying closely related datasets, or a "friendly neighborhood" of the target distribution, inspiring the moniker, Spider GAN. To define friendly neighborhoods leveraging proximity between datasets, we propose a new measure called the signed inception distance (SID), inspired by the polyharmonic kernel. We show that the Spider GAN formulation results in faster convergence, as the generator can discover correspondence even between seemingly unrelated datasets, for instance, between Tiny-ImageNet and CelebA faces. Further, we demonstrate cascading Spider GAN, where the output distribution from a pre-trained GAN generator is used as the input to the subsequent network. Effectively, transporting one distribution to another in a cascaded fashion until the target is learnt -- a new flavor of transfer learning. We demonstrate the efficacy of the Spider approach on DCGAN, conditional GAN, PGGAN, StyleGAN2 and StyleGAN3. The proposed approach achieves state-of-the-art Frechet inception distance (FID) values, with one-fifth of the training iterations, in comparison to their baseline counterparts on high-resolution small datasets such as MetFaces, Ukiyo-E Faces and AFHQ-Cats. + + + + ScaleDet: A Scalable Multi-Dataset Object Detector + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_ScaleDet_A_Scalable_Multi-Dataset_Object_Detector_CVPR_2023_paper.pdf + Multi-dataset training provides a viable solution for exploiting heterogeneous large-scale datasets without extra annotation cost. In this work, we propose a scalable multi-dataset detector (ScaleDet) that can scale up its generalization across datasets when increasing the number of training datasets. Unlike existing multi-dataset learners that mostly rely on manual relabelling efforts or sophisticated optimizations to unify labels across datasets, we introduce a simple yet scalable formulation to derive a unified semantic label space for multi-dataset training. ScaleDet is trained by visual-textual alignment to learn the label assignment with label semantic similarities across datasets. Once trained, ScaleDet can generalize well on any given upstream and downstream datasets with seen and unseen classes. We conduct extensive experiments using LVIS, COCO, Objects365, OpenImages as upstream datasets, and 13 datasets from Object Detection in the Wild (ODinW) as downstream datasets. Our results show that ScaleDet achieves compelling strong model performance with an mAP of 50.7 on LVIS, 58.8 on COCO, 46.8 on Objects365, 76.2 on OpenImages, and 71.8 on ODinW, surpassing state-of-the-art detectors with the same backbone. + + + + Unbiased Multiple Instance Learning for Weakly Supervised Video Anomaly Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Lv_Unbiased_Multiple_Instance_Learning_for_Weakly_Supervised_Video_Anomaly_Detection_CVPR_2023_paper.pdf + Weakly Supervised Video Anomaly Detection (WSVAD) is challenging because the binary anomaly label is only given on the video level, but the output requires snippet-level predictions. So, Multiple Instance Learning (MIL) is prevailing in WSVAD. However, MIL is notoriously known to suffer from many false alarms because the snippet-level detector is easily biased towards the abnormal snippets with simple context, confused by the normality with the same bias, and missing the anomaly with a different pattern. To this end, we propose a new MIL framework: Unbiased MIL (UMIL), to learn unbiased anomaly features that improve WSVAD. At each MIL training iteration, we use the current detector to divide the samples into two groups with different context biases: the most confident abnormal/normal snippets and the rest ambiguous ones. Then, by seeking the invariant features across the two sample groups, we can remove the variant context biases. Extensive experiments on benchmarks UCF-Crime and TAD demonstrate the effectiveness of our UMIL. Our code is provided at https://github.com/ktr-hubrt/UMIL. + + + + Towards Unbiased Volume Rendering of Neural Implicit Surfaces With Geometry Priors + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Towards_Unbiased_Volume_Rendering_of_Neural_Implicit_Surfaces_With_Geometry_CVPR_2023_paper.pdf + Learning surface by neural implicit rendering has been a promising way for multi-view reconstruction in recent years. Existing neural surface reconstruction methods, such as NeuS and VolSDF, can produce reliable meshes from multi-view posed images. Although they build a bridge between volume rendering and Signed Distance Function (SDF), the accuracy is still limited. In this paper, we argue that this limited accuracy is due to the bias of their volume rendering strategies, especially when the viewing direction is close to be tangent to the surface. We revise and provide an additional condition for the unbiased volume rendering. Following this analysis, we propose a new rendering method by scaling the SDF field with the angle between the viewing direction and the surface normal vector. Experiments on simulated data indicate that our rendering method reduces the bias of SDF-based volume rendering. Moreover, there still exists non-negligible bias when the learnable standard deviation of SDF is large at early stage, which means that it is hard to supervise the rendered depth with depth priors. Alternatively we supervise zero-level set with surface points obtained from a pre-trained Multi-View Stereo network. We evaluate our method on the DTU dataset and show that it outperforms the state-of-the-arts neural implicit surface methods without mask supervision. + + + + Neuro-Modulated Hebbian Learning for Fully Test-Time Adaptation + http://openaccess.thecvf.com//content/CVPR2023/papers/Tang_Neuro-Modulated_Hebbian_Learning_for_Fully_Test-Time_Adaptation_CVPR_2023_paper.pdf + Fully test-time adaptation aims to adapt the network model based on sequential analysis of input samples during the inference stage to address the cross-domain performance degradation problem of deep neural networks. We take inspiration from the biological plausibility learning where the neuron responses are tuned based on a local synapse-change procedure and activated by competitive lateral inhibition rules. Based on these feed-forward learning rules, we design a soft Hebbian learning process which provides an unsupervised and effective mechanism for online adaptation. We observe that the performance of this feed-forward Hebbian learning for fully test-time adaptation can be significantly improved by incorporating a feedback neuro-modulation layer. It is able to fine-tune the neuron responses based on the external feedback generated by the error back-propagation from the top inference layers. This leads to our proposed neuro-modulated Hebbian learning (NHL) method for fully test-time adaptation. With the unsupervised feed-forward soft Hebbian learning being combined with a learned neuro-modulator to capture feedback from external responses, the source model can be effectively adapted during the testing process. Experimental results on benchmark datasets demonstrate that our proposed method can significantly improve the adaptation performance of network models and outperforms existing state-of-the-art methods. + + + + Implicit Identity Leakage: The Stumbling Block to Improving Deepfake Detection Generalization + http://openaccess.thecvf.com//content/CVPR2023/papers/Dong_Implicit_Identity_Leakage_The_Stumbling_Block_to_Improving_Deepfake_Detection_CVPR_2023_paper.pdf + In this paper, we analyse the generalization ability of binary classifiers for the task of deepfake detection. We find that the stumbling block to their generalization is caused by the unexpected learned identity representation on images. Termed as the Implicit Identity Leakage, this phenomenon has been qualitatively and quantitatively verified among various DNNs. Furthermore, based on such understanding, we propose a simple yet effective method named the ID-unaware Deepfake Detection Model to reduce the influence of this phenomenon. Extensive experimental results demonstrate that our method outperforms the state-of-the-art in both in-dataset and cross-dataset evaluation. The code is available at https://github.com/megvii-research/CADDM. + + + + Learning Federated Visual Prompt in Null Space for MRI Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Feng_Learning_Federated_Visual_Prompt_in_Null_Space_for_MRI_Reconstruction_CVPR_2023_paper.pdf + Federated Magnetic Resonance Imaging (MRI) reconstruction enables multiple hospitals to collaborate distributedly without aggregating local data, thereby protecting patient privacy. However, the data heterogeneity caused by different MRI protocols, insufficient local training data, and limited communication bandwidth inevitably impair global model convergence and updating. In this paper, we propose a new algorithm, FedPR, to learn federated visual prompts in the null space of global prompt for MRI reconstruction. FedPR is a new federated paradigm that adopts a powerful pre-trained model while only learning and communicating the prompts with few learnable parameters, thereby significantly reducing communication costs and achieving competitive performance on limited local data. Moreover, to deal with catastrophic forgetting caused by data heterogeneity, FedPR also updates efficient federated visual prompts that project the local prompts into an approximate null space of the global prompt, thereby suppressing the interference of gradients on the server performance. Extensive experiments on federated MRI show that FedPR significantly outperforms state-of-the-art FL algorithms with < 6% of communication costs when given the limited amount of local data. + + + + Data-Driven Feature Tracking for Event Cameras + http://openaccess.thecvf.com//content/CVPR2023/papers/Messikommer_Data-Driven_Feature_Tracking_for_Event_Cameras_CVPR_2023_paper.pdf + Because of their high temporal resolution, increased resilience to motion blur, and very sparse output, event cameras have been shown to be ideal for low-latency and low-bandwidth feature tracking, even in challenging scenarios. Existing feature tracking methods for event cameras are either handcrafted or derived from first principles but require extensive parameter tuning, are sensitive to noise, and do not generalize to different scenarios due to unmodeled effects. To tackle these deficiencies, we introduce the first data-driven feature tracker for event cameras, which leverages low-latency events to track features detected in a grayscale frame. We achieve robust performance via a novel frame attention module, which shares information across feature tracks. By directly transferring zero-shot from synthetic to real data, our data-driven tracker outperforms existing approaches in relative feature age by up to 120% while also achieving the lowest latency. This performance gap is further increased to 130% by adapting our tracker to real data with a novel self-supervision strategy. + + + + Temporal Consistent 3D LiDAR Representation Learning for Semantic Perception in Autonomous Driving + http://openaccess.thecvf.com//content/CVPR2023/papers/Nunes_Temporal_Consistent_3D_LiDAR_Representation_Learning_for_Semantic_Perception_in_CVPR_2023_paper.pdf + Semantic perception is a core building block in autonomous driving, since it provides information about the drivable space and location of other traffic participants. For learning-based perception, often a large amount of diverse training data is necessary to achieve high performance. Data labeling is usually a bottleneck for developing such methods, especially for dense prediction tasks, e.g., semantic segmentation or panoptic segmentation. For 3D LiDAR data, the annotation process demands even more effort than for images. Especially in autonomous driving, point clouds are sparse, and objects appearance depends on its distance from the sensor, making it harder to acquire large amounts of labeled training data. This paper aims at taking an alternative path proposing a self-supervised representation learning method for 3D LiDAR data. Our approach exploits the vehicle motion to match objects across time viewed in different scans. We then train a model to maximize the point-wise feature similarities from points of the associated object in different scans, which enables to learn a consistent representation across time. The experimental results show that our approach performs better than previous state-of-the-art self-supervised representation learning methods when fine-tuning to different downstream tasks. We furthermore show that with only 10% of labeled data, a network pre-trained with our approach can achieve better performance than the same network trained from scratch with all labels for semantic segmentation on SemanticKITTI. + + + + DiffTalk: Crafting Diffusion Models for Generalized Audio-Driven Portraits Animation + http://openaccess.thecvf.com//content/CVPR2023/papers/Shen_DiffTalk_Crafting_Diffusion_Models_for_Generalized_Audio-Driven_Portraits_Animation_CVPR_2023_paper.pdf + Talking head synthesis is a promising approach for the video production industry. Recently, a lot of effort has been devoted in this research area to improve the generation quality or enhance the model generalization. However, there are few works able to address both issues simultaneously, which is essential for practical applications. To this end, in this paper, we turn attention to the emerging powerful Latent Diffusion Models, and model the Talking head generation as an audio-driven temporally coherent denoising process (DiffTalk). More specifically, instead of employing audio signals as the single driving factor, we investigate the control mechanism of the talking face, and incorporate reference face images and landmarks as conditions for personality-aware generalized synthesis. In this way, the proposed DiffTalk is capable of producing high-quality talking head videos in synchronization with the source audio, and more importantly, it can be naturally generalized across different identities without any further fine-tuning. Additionally, our DiffTalk can be gracefully tailored for higher-resolution synthesis with negligible extra computational cost. Extensive experiments show that the proposed DiffTalk efficiently synthesizes high-fidelity audio-driven talking head videos for generalized novel identities. For more video results, please refer to https://sstzal.github.io/DiffTalk/. + + + + Visual Query Tuning: Towards Effective Usage of Intermediate Representations for Parameter and Memory Efficient Transfer Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Tu_Visual_Query_Tuning_Towards_Effective_Usage_of_Intermediate_Representations_for_CVPR_2023_paper.pdf + Intermediate features of a pre-trained model have been shown informative for making accurate predictions on downstream tasks, even if the model backbone is frozen. The key challenge is how to utilize them, given the gigantic amount. We propose visual query tuning (VQT), a simple yet effective approach to aggregate intermediate features of Vision Transformers. Through introducing a handful of learnable "query" tokens to each layer, VQT leverages the inner workings of Transformers to "summarize" rich intermediate features of each layer, which can then be used to train the prediction heads of downstream tasks. As VQT keeps the intermediate features intact and only learns to combine them, it enjoys memory efficiency in training, compared to many other parameter-efficient fine-tuning approaches that learn to adapt features and need back-propagation through the entire backbone. This also suggests the complementary role between VQT and those approaches in transfer learning. Empirically, VQT consistently surpasses the state-of-the-art approach that utilizes intermediate features for transfer learning and outperforms full fine-tuning in many cases. Compared to parameter-efficient approaches that adapt features, VQT achieves much higher accuracy under memory constraints. Most importantly, VQT is compatible with these approaches to attain higher accuracy, making it a simple add-on to further boost transfer learning. + + + + Compressing Volumetric Radiance Fields to 1 MB + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Compressing_Volumetric_Radiance_Fields_to_1_MB_CVPR_2023_paper.pdf + Approximating radiance fields with discretized volumetric grids is one of promising directions for improving NeRFs, represented by methods like DVGO, Plenoxels and TensoRF, which achieve super-fast training convergence and real-time rendering. However, these methods typically require a tremendous storage overhead, costing up to hundreds of megabytes of disk space and runtime memory for a single scene. We address this issue in this paper by introducing a simple yet effective framework, called vector quantized radiance fields (VQRF), for compressing these volume-grid-based radiance fields. We first present a robust and adaptive metric for estimating redundancy in grid models and performing voxel pruning by better exploring intermediate outputs of volumetric rendering. A trainable vector quantization is further proposed to improve the compactness of grid models. In combination with an efficient joint tuning strategy and post-processing, our method can achieve a compression ratio of 100x by reducing the overall model size to 1 MB with negligible loss on visual quality. Extensive experiments demonstrate that the proposed framework is capable of achieving unrivaled performance and well generalization across multiple methods with distinct volumetric structures, facilitating the wide use of volumetric radiance fields methods in real-world applications. Code is available at https://github.com/AlgoHunt/VQRF. + + + + Label Information Bottleneck for Label Enhancement + http://openaccess.thecvf.com//content/CVPR2023/papers/Tang_Label_Information_Bottleneck_for_Label_Enhancement_CVPR_2023_paper.pdf + In this work, we focus on the challenging problem of Label Enhancement (LE), which aims to exactly recover label distributions from logical labels, and present a novel Label Information Bottleneck (LIB) method for LE. For the recovery process of label distributions, the label irrelevant information contained in the dataset may lead to unsatisfactory recovery performance. To address this limitation, we make efforts to excavate the essential label relevant information to improve the recovery performance. Our method formulates the LE problem as the following two joint processes: 1) learning the representation with the essential label relevant information, 2) recovering label distributions based on the learned representation. The label relevant information can be excavated based on the "bottleneck" formed by the learned representation. Significantly, both the label relevant information about the label assignments and the label relevant information about the label gaps can be explored in our method. Evaluation experiments conducted on several benchmark label distribution learning datasets verify the effectiveness and competitiveness of LIB. + + + + Multi-Modal Representation Learning With Text-Driven Soft Masks + http://openaccess.thecvf.com//content/CVPR2023/papers/Park_Multi-Modal_Representation_Learning_With_Text-Driven_Soft_Masks_CVPR_2023_paper.pdf + We propose a visual-linguistic representation learning approach within a self-supervised learning framework by introducing a new operation, loss, and data augmentation strategy. First, we generate diverse features for the image-text matching (ITM) task via soft-masking the regions in an image, which are most relevant to a certain word in the corresponding caption, instead of completely removing them. Since our framework relies only on image-caption pairs with no fine-grained annotations, we identify the relevant regions to each word by computing the word-conditional visual attention using multi-modal encoder. Second, we encourage the model to focus more on hard but diverse examples by proposing a focal loss for the image-text contrastive learning (ITC) objective, which alleviates the inherent limitations of overfitting and bias issues. Last, we perform multi-modal data augmentations for self-supervised learning via mining various examples by masking texts and rendering distortions on images. We show that the combination of these three innovations is effective for learning a pretrained model, leading to outstanding performance on multiple vision-language downstream tasks. + + + + Gazeformer: Scalable, Effective and Fast Prediction of Goal-Directed Human Attention + http://openaccess.thecvf.com//content/CVPR2023/papers/Mondal_Gazeformer_Scalable_Effective_and_Fast_Prediction_of_Goal-Directed_Human_Attention_CVPR_2023_paper.pdf + Predicting human gaze is important in Human-Computer Interaction (HCI). However, to practically serve HCI applications, gaze prediction models must be scalable, fast, and accurate in their spatial and temporal gaze predictions. Recent scanpath prediction models focus on goal-directed attention (search). Such models are limited in their application due to a common approach relying on trained target detectors for all possible objects, and the availability of human gaze data for their training (both not scalable). In response, we pose a new task called ZeroGaze, a new variant of zero-shot learning where gaze is predicted for never-before-searched objects, and we develop a novel model, Gazeformer, to solve the ZeroGaze problem. In contrast to existing methods using object detector modules, Gazeformer encodes the target using a natural language model, thus leveraging semantic similarities in scanpath prediction. We use a transformer-based encoder-decoder architecture because transformers are particularly useful for generating contextual representations. Gazeformer surpasses other models by a large margin (19% - 70%) on the ZeroGaze setting. It also outperforms existing target-detection models on standard gaze prediction for both target-present and target-absent search tasks. In addition to its improved performance, Gazeformer is more than five times faster than the state-of-the-art target-present visual search model. + + + + Rethinking the Correlation in Few-Shot Segmentation: A Buoys View + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Rethinking_the_Correlation_in_Few-Shot_Segmentation_A_Buoys_View_CVPR_2023_paper.pdf + Few-shot segmentation (FSS) aims to segment novel objects in a given query image with only a few annotated support images. However, most previous best-performing methods, whether prototypical learning methods or affinity learning methods, neglect to alleviate false matches caused by their own pixel-level correlation. In this work, we rethink how to mitigate the false matches from the perspective of representative reference features (referred to as buoys), and propose a novel adaptive buoys correlation (ABC) network to rectify direct pairwise pixel-level correlation, including a buoys mining module and an adaptive correlation module. The proposed ABC enjoys several merits. First, to learn the buoys well without any correspondence supervision, we customize the buoys mining module according to the three characteristics of representativeness, task awareness and resilience. Second, the proposed adaptive correlation module is responsible for further endowing buoy-correlation-based pixel matching with an adaptive ability. Extensive experimental results with two different backbones on two challenging benchmarks demonstrate that our ABC, as a general plugin, achieves consistent improvements over several leading methods on both 1-shot and 5-shot settings. + + + + DiffRF: Rendering-Guided 3D Radiance Field Diffusion + http://openaccess.thecvf.com//content/CVPR2023/papers/Muller_DiffRF_Rendering-Guided_3D_Radiance_Field_Diffusion_CVPR_2023_paper.pdf + We introduce DiffRF, a novel approach for 3D radiance field synthesis based on denoising diffusion probabilistic models. While existing diffusion-based methods operate on images, latent codes, or point cloud data, we are the first to directly generate volumetric radiance fields. To this end, we propose a 3D denoising model which directly operates on an explicit voxel grid representation. However, as radiance fields generated from a set of posed images can be ambiguous and contain artifacts, obtaining ground truth radiance field samples is non-trivial. We address this challenge by pairing the denoising formulation with a rendering loss, enabling our model to learn a deviated prior that favours good image quality instead of trying to replicate fitting errors like floating artifacts. In contrast to 2D-diffusion models, our model learns multi-view consistent priors, enabling free-view synthesis and accurate shape generation. Compared to 3D GANs, our diffusion-based approach naturally enables conditional generation like masked completion or single-view 3D synthesis at inference time. + + + + Parts2Words: Learning Joint Embedding of Point Clouds and Texts by Bidirectional Matching Between Parts and Words + http://openaccess.thecvf.com//content/CVPR2023/papers/Tang_Parts2Words_Learning_Joint_Embedding_of_Point_Clouds_and_Texts_by_CVPR_2023_paper.pdf + Shape-Text matching is an important task of high-level shape understanding. Current methods mainly represent a 3D shape as multiple 2D rendered views, which obviously can not be understood well due to the structural ambiguity caused by self-occlusion in the limited number of views. To resolve this issue, we directly represent 3D shapes as point clouds, and propose to learn joint embedding of point clouds and texts by bidirectional matching between parts from shapes and words from texts. Specifically, we first segment the point clouds into parts, and then leverage optimal transport method to match parts and words in an optimized feature space, where each part is represented by aggregating features of all points within it and each word is abstracted by its contextual information. We optimize the feature space in order to enlarge the similarities between the paired training samples, while simultaneously maximizing the margin between the unpaired ones. Experiments demonstrate that our method achieves a significant improvement in accuracy over the SOTAs on multi-modal retrieval tasks under the Text2Shape dataset. Codes are available at https://github.com/JLUtangchuan/Parts2Words. + + + + Proposal-Based Multiple Instance Learning for Weakly-Supervised Temporal Action Localization + http://openaccess.thecvf.com//content/CVPR2023/papers/Ren_Proposal-Based_Multiple_Instance_Learning_for_Weakly-Supervised_Temporal_Action_Localization_CVPR_2023_paper.pdf + Weakly-supervised temporal action localization aims to localize and recognize actions in untrimmed videos with only video-level category labels during training. Without instance-level annotations, most existing methods follow the Segment-based Multiple Instance Learning (S-MIL) framework, where the predictions of segments are supervised by the labels of videos. However, the objective for acquiring segment-level scores during training is not consistent with the target for acquiring proposal-level scores during testing, leading to suboptimal results. To deal with this problem, we propose a novel Proposal-based Multiple Instance Learning (P-MIL) framework that directly classifies the candidate proposals in both the training and testing stages, which includes three key designs: 1) a surrounding contrastive feature extraction module to suppress the discriminative short proposals by considering the surrounding contrastive information, 2) a proposal completeness evaluation module to inhibit the low-quality proposals with the guidance of the completeness pseudo labels, and 3) an instance-level rank consistency loss to achieve robust detection by leveraging the complementarity of RGB and FLOW modalities. Extensive experimental results on two challenging benchmarks including THUMOS14 and ActivityNet demonstrate the superior performance of our method. Our code is available at github.com/OpenSpaceAI/CVPR2023_P-MIL. + + + + ASPnet: Action Segmentation With Shared-Private Representation of Multiple Data Sources + http://openaccess.thecvf.com//content/CVPR2023/papers/van_Amsterdam_ASPnet_Action_Segmentation_With_Shared-Private_Representation_of_Multiple_Data_Sources_CVPR_2023_paper.pdf + Most state-of-the-art methods for action segmentation are based on single input modalities or naive fusion of multiple data sources. However, effective fusion of complementary information can potentially strengthen segmentation models and make them more robust to sensor noise and more accurate with smaller training datasets. In order to improve multimodal representation learning for action segmentation, we propose to disentangle hidden features of a multi-stream segmentation model into modality-shared components, containing common information across data sources, and private components; we then use an attention bottleneck to capture long-range temporal dependencies in the data while preserving disentanglement in consecutive processing layers. Evaluation on 50salads, Breakfast and RARP45 datasets shows that our multimodal approach outperforms different data fusion baselines on both multiview and multimodal data sources, obtaining competitive or better results compared with the state-of-the-art. Our model is also more robust to additive sensor noise and can achieve performance on par with strong video baselines even with less training data. + + + + Ingredient-Oriented Multi-Degradation Learning for Image Restoration + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Ingredient-Oriented_Multi-Degradation_Learning_for_Image_Restoration_CVPR_2023_paper.pdf + Learning to leverage the relationship among diverse image restoration tasks is quite beneficial for unraveling the intrinsic ingredients behind the degradation. Recent years have witnessed the flourish of various All-in-one methods, which handle multiple image degradations within a single model. In practice, however, few attempts have been made to excavate task correlations in that exploring the underlying fundamental ingredients of various image degradations, resulting in poor scalability as more tasks are involved. In this paper, we propose a novel perspective to delve into the degradation via an ingredients-oriented rather than previous task-oriented manner for scalable learning. Specifically, our method, named Ingredients-oriented Degradation Reformulation framework (IDR), consists of two stages, namely task-oriented knowledge collection and ingredients-oriented knowledge integration. In the first stage, we conduct ad hoc operations on different degradations according to the underlying physics principles, and establish the corresponding prior hubs for each type of degradation. While the second stage progressively reformulates the preceding task-oriented hubs into single ingredients-oriented hub via learnable Principal Component Analysis (PCA), and employs a dynamic routing mechanism for probabilistic unknown degradation removal. Extensive experiments on various image restoration tasks demonstrate the effectiveness and scalability of our method. More importantly, our IDR exhibits the favorable generalization ability to unknown downstream tasks. + + + + How Can Objects Help Action Recognition? + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_How_Can_Objects_Help_Action_Recognition_CVPR_2023_paper.pdf + Current state-of-the-art video models process a video clip as a long sequence of spatio-temporal tokens. However, they do not explicitly model objects, their interactions across the video, and instead process all the tokens in the video. In this paper, we investigate how we can use knowledge of objects to design better video models, namely to process fewer tokens and to improve recognition accuracy. This is in contrast to prior works which either drop tokens at the cost of accuracy, or increase accuracy whilst also increasing the computation required. First, we propose an object-guided token sampling strategy that enables us to retain a small fraction of the input tokens with minimal impact on accuracy. And second, we propose an object-aware attention module that enriches our feature representation with object information and improves overall accuracy. Our resulting framework achieves better performance when using fewer tokens than strong baselines. In particular, we match our baseline with 30%, 40%, and 60% of the input tokens on SomethingElse, Something-something v2, and Epic-Kitchens, respectively. When we use our model to process the same number of tokens as our baseline, we improve by 0.6 to 4.2 points on these datasets. + + + + Realistic Saliency Guided Image Enhancement + http://openaccess.thecvf.com//content/CVPR2023/papers/Miangoleh_Realistic_Saliency_Guided_Image_Enhancement_CVPR_2023_paper.pdf + Common editing operations performed by professional photographers include the cleanup operations: de-emphasizing distracting elements and enhancing subjects. These edits are challenging, requiring a delicate balance between manipulating the viewer's attention while maintaining photo realism. While recent approaches can boast successful examples of attention attenuation or amplification, most of them also suffer from frequent unrealistic edits. We propose a realism loss for saliency-guided image enhancement to maintain high realism across varying image types, while attenuating distractors and amplifying objects of interest. Evaluations with professional photographers confirm that we achieve the dual objective of realism and effectiveness, and outperform the recent approaches on their own datasets, while requiring a smaller memory footprint and runtime. We thus offer a viable solution for automating image enhancement and photo cleanup operations. + + + + SLOPER4D: A Scene-Aware Dataset for Global 4D Human Pose Estimation in Urban Environments + http://openaccess.thecvf.com//content/CVPR2023/papers/Dai_SLOPER4D_A_Scene-Aware_Dataset_for_Global_4D_Human_Pose_Estimation_CVPR_2023_paper.pdf + We present SLOPER4D, a novel scene-aware dataset collected in large urban environments to facilitate the research of global human pose estimation (GHPE) with human-scene interaction in the wild. Employing a head-mounted device integrated with a LiDAR and camera, we record 12 human subjects' activities over 10 diverse urban scenes from an egocentric view. Frame-wise annotations for 2D key points, 3D pose parameters, and global translations are provided, together with reconstructed scene point clouds. To obtain accurate 3D ground truth in such large dynamic scenes, we propose a joint optimization method to fit local SMPL meshes to the scene and fine-tune the camera calibration during dynamic motions frame by frame, resulting in plausible and scene-natural 3D human poses. Eventually, SLOPER4D consists of 15 sequences of human motions, each of which has a trajectory length of more than 200 meters (up to 1,300 meters) and covers an area of more than 200 square meters (up to 30,000 square meters), including more than 100k LiDAR frames, 300k video frames, and 500K IMU-based motion frames. With SLOPER4D, we provide a detailed and thorough analysis of two critical tasks, including camera-based 3D HPE and LiDAR-based 3D HPE in urban environments, and benchmark a new task, GHPE. The in-depth analysis demonstrates SLOPER4D poses significant challenges to existing methods and produces great research opportunities. The dataset and code are released at http://www.lidarhumanmotion.net/sloper4d/. + + + + Mask-Guided Matting in the Wild + http://openaccess.thecvf.com//content/CVPR2023/papers/Park_Mask-Guided_Matting_in_the_Wild_CVPR_2023_paper.pdf + Mask-guided matting has shown great practicality compared to traditional trimap-based methods. The mask-guided approach takes an easily-obtainable coarse mask as guidance and produces an accurate alpha matte. To extend the success toward practical usage, we tackle mask-guided matting in the wild, which covers a wide range of categories in their complex context robustly. To this end, we propose a simple yet effective learning framework based on two core insights: 1) learning a generalized matting model that can better understand the given mask guidance and 2) leveraging weak supervision datasets (e.g., instance segmentation dataset) to alleviate the limited diversity and scale of existing matting datasets. Extensive experimental results on multiple benchmarks, consisting of a newly proposed synthetic benchmark (Composition-Wild) and existing natural datasets, demonstrate the superiority of the proposed method. Moreover, we provide appealing results on new practical applications (e.g., panoptic matting and mask-guided video matting), showing the great generality and potential of our model. + + + + Dynamic Conceptional Contrastive Learning for Generalized Category Discovery + http://openaccess.thecvf.com//content/CVPR2023/papers/Pu_Dynamic_Conceptional_Contrastive_Learning_for_Generalized_Category_Discovery_CVPR_2023_paper.pdf + Generalized category discovery (GCD) is a recently proposed open-world problem, which aims to automatically cluster partially labeled data. The main challenge is that the unlabeled data contain instances that are not only from known categories of the labeled data but also from novel categories. This leads traditional novel category discovery (NCD) methods to be incapacitated for GCD, due to their assumption of unlabeled data are only from novel categories. One effective way for GCD is applying self-supervised learning to learn discriminate representation for unlabeled data. However, this manner largely ignores underlying relationships between instances of the same concepts (e.g., class, super-class, and sub-class), which results in inferior representation learning. In this paper, we propose a Dynamic Conceptional Contrastive Learning (DCCL) framework, which can effectively improve clustering accuracy by alternately estimating underlying visual conceptions and learning conceptional representation. In addition, we design a dynamic conception generation and update mechanism, which is able to ensure consistent conception learning and thus further facilitate the optimization of DCCL. Extensive experiments show that DCCL achieves new state-of-the-art performances on six generic and fine-grained visual recognition datasets, especially on fine-grained ones. For example, our method significantly surpasses the best competitor by 16.2% on the new classes for the CUB-200 dataset. Code is available at https://github.com/TPCD/DCCL + + + + Neumann Network With Recursive Kernels for Single Image Defocus Deblurring + http://openaccess.thecvf.com//content/CVPR2023/papers/Quan_Neumann_Network_With_Recursive_Kernels_for_Single_Image_Defocus_Deblurring_CVPR_2023_paper.pdf + Single image defocus deblurring (SIDD) refers to recovering an all-in-focus image from a defocused blurry one. It is a challenging recovery task due to the spatially-varying defocus blurring effects with significant size variation. Motivated by the strong correlation among defocus kernels of different sizes and the blob-type structure of defocus kernels, we propose a learnable recursive kernel representation (RKR) for defocus kernels that expresses a defocus kernel by a linear combination of recursive, separable and positive atom kernels, leading to a compact yet effective and physics-encoded parametrization of the spatially-varying defocus blurring process. Afterwards, a physics-driven and efficient deep model with a cross-scale fusion structure is presented for SIDD, with inspirations from the truncated Neumann series for approximating the matrix inversion of the RKR-based blurring operator. In addition, a reblurring loss is proposed to regularize the RKR learning. Extensive experiments show that, our proposed approach significantly outperforms existing ones, with a model size comparable to that of the top methods. + + + + Guided Recommendation for Model Fine-Tuning + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Guided_Recommendation_for_Model_Fine-Tuning_CVPR_2023_paper.pdf + Model selection is essential for reducing the search cost of the best pre-trained model over a large-scale model zoo for a downstream task. After analyzing recent hand-designed model selection criteria with 400+ ImageNet pre-trained models and 40 downstream tasks, we find that they can fail due to invalid assumptions and intrinsic limitations. The prior knowledge on model capacity and dataset also can not be easily integrated into the existing criteria. To address these issues, we propose to convert model selection as a recommendation problem and to learn from the past training history. Specifically, we characterize the meta information of datasets and models as features, and use their transfer learning performance as the guided score. With thousands of historical training jobs, a recommendation system can be learned to predict the model selection score given the features of the dataset and the model as input. Our approach enables integrating existing model selection scores as additional features and scales with more historical data. We evaluate the prediction accuracy with 22 pre-trained models over 40 downstream tasks. With extensive evaluations, we show that the learned approach can outperform prior hand-designed model selection methods significantly when relevant training history is available. + + + + Masked Image Training for Generalizable Deep Image Denoising + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Masked_Image_Training_for_Generalizable_Deep_Image_Denoising_CVPR_2023_paper.pdf + When capturing and storing images, devices inevitably introduce noise. Reducing this noise is a critical task called image denoising. Deep learning has become the de facto method for image denoising, especially with the emergence of Transformer-based models that have achieved notable state-of-the-art results on various image tasks. However, deep learning-based methods often suffer from a lack of generalization ability. For example, deep models trained on Gaussian noise may perform poorly when tested on other noise distributions. To address this issue, we present a novel approach to enhance the generalization performance of denoising networks, known as masked training. Our method involves masking random pixels of the input image and reconstructing the missing information during training. We also mask out the features in the self-attention layers to avoid the impact of training-testing inconsistency. Our approach exhibits better generalization ability than other deep learning models and is directly applicable to real-world scenarios. Additionally, our interpretability analysis demonstrates the superiority of our method. + + + + DeAR: Debiasing Vision-Language Models With Additive Residuals + http://openaccess.thecvf.com//content/CVPR2023/papers/Seth_DeAR_Debiasing_Vision-Language_Models_With_Additive_Residuals_CVPR_2023_paper.pdf + Large pre-trained vision-language models (VLMs) reduce the time for developing predictive models for various vision-grounded language downstream tasks by providing rich, adaptable image and text representations. However, these models suffer from societal biases owing to the skewed distribution of various identity groups in the training data. These biases manifest as the skewed similarity between the representations for specific text concepts and images of people of different identity groups and, therefore, limit the usefulness of such models in real-world high-stakes applications. In this work, we present DeAR (Debiasing with Additive Residuals), a novel debiasing method that learns additive residual image representations to offset the original representations, ensuring fair output representations. In doing so, it reduces the ability of the representations to distinguish between the different identity groups. Further, we observe that the current fairness tests are performed on limited face image datasets that fail to indicate why a specific text concept should/should not apply to them. To bridge this gap and better evaluate DeAR, we introduce a new context-based bias benchmarking dataset - the Protected Attribute Tag Association (PATA) dataset for evaluating the fairness of large pre-trained VLMs. Additionally, PATA provides visual context for a diverse human population in different scenarios with both positive and negative connotations. Experimental results for fairness and zero-shot performance preservation using multiple datasets demonstrate the efficacy of our framework. + + + + E2PN: Efficient SE(3)-Equivariant Point Network + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhu_E2PN_Efficient_SE3-Equivariant_Point_Network_CVPR_2023_paper.pdf + This paper proposes a convolution structure for learning SE(3)-equivariant features from 3D point clouds. It can be viewed as an equivariant version of kernel point convolutions (KPConv), a widely used convolution form to process point cloud data. Compared with existing equivariant networks, our design is simple, lightweight, fast, and easy to be integrated with existing task-specific point cloud learning pipelines. We achieve these desirable properties by combining group convolutions and quotient representations. Specifically, we discretize SO(3) to finite groups for their simplicity while using SO(2) as the stabilizer subgroup to form spherical quotient feature fields to save computations. We also propose a permutation layer to recover SO(3) features from spherical features to preserve the capacity to distinguish rotations. Experiments show that our method achieves comparable or superior performance in various tasks, including object classification, pose estimation, and keypoint-matching, while consuming much less memory and running faster than existing work. The proposed method can foster the development of equivariant models for real-world applications based on point clouds. + + + + Understanding Masked Image Modeling via Learning Occlusion Invariant Feature + http://openaccess.thecvf.com//content/CVPR2023/papers/Kong_Understanding_Masked_Image_Modeling_via_Learning_Occlusion_Invariant_Feature_CVPR_2023_paper.pdf + Recently, Masked Image Modeling (MIM) achieves great success in self-supervised visual recognition. However, as a reconstruction-based framework, it is still an open question to understand how MIM works, since MIM appears very different from previous well-studied siamese approaches such as contrastive learning. In this paper, we propose a new viewpoint: MIM implicitly learns occlusion-invariant features, which is analogous to other siamese methods while the latter learns other invariance. By relaxing MIM formulation into an equivalent siamese form, MIM methods can be interpreted in a unified framework with conventional methods, among which only a) data transformations, i.e. what invariance to learn, and b) similarity measurements are different. Furthermore, taking MAE (He et al., 2021) as a representative example of MIM, we empirically find the success of MIM models relates a little to the choice of similarity functions, but the learned occlusion invariant feature introduced by masked image -- it turns out to be a favored initialization for vision transformers, even though the learned feature could be less semantic. We hope our findings could inspire researchers to develop more powerful self-supervised methods in computer vision community. + + + + A Dynamic Multi-Scale Voxel Flow Network for Video Prediction + http://openaccess.thecvf.com//content/CVPR2023/papers/Hu_A_Dynamic_Multi-Scale_Voxel_Flow_Network_for_Video_Prediction_CVPR_2023_paper.pdf + The performance of video prediction has been greatly boosted by advanced deep neural networks. However, most of the current methods suffer from large model sizes and require extra inputs, e.g., semantic/depth maps, for promising performance. For efficiency consideration, in this paper, we propose a Dynamic Multi-scale Voxel Flow Network (DMVFN) to achieve better video prediction performance at lower computational costs with only RGB images, than previous methods. The core of our DMVFN is a differentiable routing module that can effectively perceive the motion scales of video frames. Once trained, our DMVFN selects adaptive sub-networks for different inputs at the inference stage. Experiments on several benchmarks demonstrate that our DMVFN is an order of magnitude faster than Deep Voxel Flow and surpasses the state-of-the-art iterative-based OPT on generated image quality. Our code and demo are available at https://huxiaotaostasy.github.io/DMVFN/. + + + + UniDistill: A Universal Cross-Modality Knowledge Distillation Framework for 3D Object Detection in Bird's-Eye View + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_UniDistill_A_Universal_Cross-Modality_Knowledge_Distillation_Framework_for_3D_Object_CVPR_2023_paper.pdf + In the field of 3D object detection for autonomous driving, the sensor portfolio including multi-modality and single-modality is diverse and complex. Since the multi-modal methods have system complexity while the accuracy of single-modal ones is relatively low, how to make a tradeoff between them is difficult. In this work, we propose a universal cross-modality knowledge distillation framework (UniDistill) to improve the performance of single-modality detectors. Specifically, during training, UniDistill projects the features of both the teacher and the student detector into Bird's-Eye-View (BEV), which is a friendly representation for different modalities. Then, three distillation losses are calculated to sparsely align the foreground features, helping the student learn from the teacher without introducing additional cost during inference. Taking advantage of the similar detection paradigm of different detectors in BEV, UniDistill easily supports LiDAR-to-camera, camera-to-LiDAR, fusion-to-LiDAR and fusion-to-camera distillation paths. Furthermore, the three distillation losses can filter the effect of misaligned background information and balance between objects of different sizes, improving the distillation effectiveness. Extensive experiments on nuScenes demonstrate that UniDistill effectively improves the mAP and NDS of student detectors by 2.0% 3.2%. + + + + Fine-Tuned CLIP Models Are Efficient Video Learners + http://openaccess.thecvf.com//content/CVPR2023/papers/Rasheed_Fine-Tuned_CLIP_Models_Are_Efficient_Video_Learners_CVPR_2023_paper.pdf + Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model. Since training on a similar scale for videos is infeasible, recent approaches focus on the effective transfer of image-based CLIP to the video domain. In this pursuit, new parametric modules are added to learn temporal information and inter-frame relationships which require meticulous design efforts. Furthermore, when the resulting models are learned on videos, they tend to overfit on the given task distribution and lack in generalization aspect. This begs the following question: How to effectively transfer image-level CLIP representations to videos? In this work, we show that a simple Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos. Our qualitative analysis illustrates that the frame-level processing from CLIP image-encoder followed by feature pooling and similarity matching with corresponding text embeddings helps in implicitly modeling the temporal cues within ViFi-CLIP. Such fine-tuning helps the model to focus on scene dynamics, moving objects and inter-object relationships. For low-data regimes where full fine-tuning is not viable, we propose a 'bridge and prompt' approach that first uses finetuning to bridge the domain gap and then learns prompts on language and vision side to adapt CLIP representations. We extensively evaluate this simple yet strong baseline on zero-shot, base-to-novel generalization, few-shot and fully supervised settings across five video benchmarks. Our code and models will be publicly released. + + + + Collaborative Diffusion for Multi-Modal Face Generation and Editing + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Collaborative_Diffusion_for_Multi-Modal_Face_Generation_and_Editing_CVPR_2023_paper.pdf + Diffusion models arise as a powerful generative tool recently. Despite the great progress, existing diffusion models mainly focus on uni-modal control, i.e., the diffusion process is driven by only one modality of condition. To further unleash the users' creativity, it is desirable for the model to be controllable by multiple modalities simultaneously, e.g., generating and editing faces by describing the age (text-driven) while drawing the face shape (mask-driven). In this work, we present Collaborative Diffusion, where pre-trained uni-modal diffusion models collaborate to achieve multi-modal face generation and editing without re-training. Our key insight is that diffusion models driven by different modalities are inherently complementary regarding the latent denoising steps, where bilateral connections can be established upon. Specifically, we propose dynamic diffuser, a meta-network that adaptively hallucinates multi-modal denoising steps by predicting the spatial-temporal influence functions for each pre-trained uni-modal model. Collaborative Diffusion not only collaborates generation capabilities from uni-modal diffusion models, but also integrates multiple uni-modal manipulations to perform multi-modal editing. Extensive qualitative and quantitative experiments demonstrate the superiority of our framework in both image quality and condition consistency. + + + + MACARONS: Mapping and Coverage Anticipation With RGB Online Self-Supervision + http://openaccess.thecvf.com//content/CVPR2023/papers/Guedon_MACARONS_Mapping_and_Coverage_Anticipation_With_RGB_Online_Self-Supervision_CVPR_2023_paper.pdf + We introduce a method that simultaneously learns to explore new large environments and to reconstruct them in 3D from color images only. This is closely related to the Next Best View problem (NBV), where one has to identify where to move the camera next to improve the coverage of an unknown scene. However, most of the current NBV methods rely on depth sensors, need 3D supervision and/or do not scale to large scenes. Our method requires only a color camera and no 3D supervision. It simultaneously learns in a self-supervised fashion to predict a volume occupancy field from color images and, from this field, to predict the NBV. Thanks to this approach, our method performs well on new scenes as it is not biased towards any training 3D data. We demonstrate this on a recent dataset made of various 3D scenes and show it performs even better than recent methods requiring a depth sensor, which is not a realistic assumption for outdoor scenes captured with a flying drone. + + + + Tracking Multiple Deformable Objects in Egocentric Videos + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Tracking_Multiple_Deformable_Objects_in_Egocentric_Videos_CVPR_2023_paper.pdf + Most existing multiple object tracking (MOT) methods that solely rely on appearance features struggle in tracking highly deformable objects. Other MOT methods that use motion clues to associate identities across frames have difficulty handling egocentric videos effectively or efficiently. In this work, we propose DETracker, a new MOT method that jointly detects and tracks deformable objects in egocentric videos. DETracker uses three novel modules, namely the motion disentanglement network (MDN), the patch association network (PAN) and the patch memory network (PMN), to explicitly tackle the difficulties caused by severe ego motion and fast morphing target objects. DETracker is end-to-end trainable and achieves near real-time speed. We also present DogThruGlasses, a large-scale deformable multi-object tracking dataset, with 150 videos and 73K annotated frames, collected by smart glasses. DETracker outperforms existing state-of-the-art method on the DogThruGlasses dataset and YouTube-Hand dataset. + + + + REC-MV: REconstructing 3D Dynamic Cloth From Monocular Videos + http://openaccess.thecvf.com//content/CVPR2023/papers/Qiu_REC-MV_REconstructing_3D_Dynamic_Cloth_From_Monocular_Videos_CVPR_2023_paper.pdf + Reconstructing dynamic 3D garment surfaces with open boundaries from monocular videos is an important problem as it provides a practical and low-cost solution for clothes digitization. Recent neural rendering methods achieve high-quality dynamic clothed human reconstruction results from monocular video, but these methods cannot separate the garment surface from the body. Moreover, despite existing garment reconstruction methods based on feature curve representation demonstrating impressive results for garment reconstruction from a single image, they struggle to generate temporally consistent surfaces for the video input. To address the above limitations, in this paper, we formulate this task as an optimization problem of 3D garment feature curves and surface reconstruction from monocular video. We introduce a novel approach, called REC-MV to jointly optimize the explicit feature curves and the implicit signed distance field (SDF) of the garments. Then the open garment meshes can be extracted via garment template registration in the canonical space. Experiments on multiple casually captured datasets show that our approach outperforms existing methods and can produce high-quality dynamic garment surfaces. + + + + JRDB-Pose: A Large-Scale Dataset for Multi-Person Pose Estimation and Tracking + http://openaccess.thecvf.com//content/CVPR2023/papers/Vendrow_JRDB-Pose_A_Large-Scale_Dataset_for_Multi-Person_Pose_Estimation_and_Tracking_CVPR_2023_paper.pdf + Autonomous robotic systems operating in human environments must understand their surroundings to make accurate and safe decisions. In crowded human scenes with close-up human-robot interaction and robot navigation, a deep understanding of surrounding people requires reasoning about human motion and body dynamics over time with human body pose estimation and tracking. However, existing datasets captured from robot platforms either do not provide pose annotations or do not reflect the scene distribution of social robots. In this paper, we introduce JRDB-Pose, a large-scale dataset and benchmark for multi-person pose estimation and tracking. JRDB-Pose extends the existing JRDB which includes videos captured from a social navigation robot in a university campus environment, containing challenging scenes with crowded indoor and outdoor locations and a diverse range of scales and occlusion types. JRDB-Pose provides human pose annotations with per-keypoint occlusion labels and track IDs consistent across the scene and with existing annotations in JRDB. We conduct a thorough experimental study of state-of-the-art multi-person pose estimation and tracking methods on JRDB-Pose, showing that our dataset imposes new challenges for the existing methods. JRDB-Pose is available at https://jrdb.erc.monash.edu/. + + + + AsyFOD: An Asymmetric Adaptation Paradigm for Few-Shot Domain Adaptive Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Gao_AsyFOD_An_Asymmetric_Adaptation_Paradigm_for_Few-Shot_Domain_Adaptive_Object_CVPR_2023_paper.pdf + In this work, we study few-shot domain adaptive object detection (FSDAOD), where only a few target labeled images are available for training in addition to sufficient source labeled images. Critically, in FSDAOD, the data-scarcity in the target domain leads to an extreme data imbalance between the source and target domains, which potentially causes over-adaptation in traditional feature alignment. To address the data imbalance problem, we propose an asymmetric adaptation paradigm, namely AsyFOD, which leverages the source and target instances from different perspectives. Specifically, by using target distribution estimation, the AsyFOD first identifies the target-similar source instances, which serves for augmenting the limited target instances. Then, we conduct asynchronous alignment between target-dissimilar source instances and augmented target instances, which is simple yet effective for alleviating the over-adaptation. Extensive experiments demonstrate that the proposed AsyFOD outperforms all state-of-the-art methods on four FSDAOD benchmarks with various environmental variances, e.g., 3.1% mAP improvement on Cityscapes-to-FoggyCityscapes and 2.9% mAP increase on Sim10k-to-Cityscapes. The code is available at https://github.com/Hlings/AsyFOD. + + + + Federated Learning With Data-Agnostic Distribution Fusion + http://openaccess.thecvf.com//content/CVPR2023/papers/Duan_Federated_Learning_With_Data-Agnostic_Distribution_Fusion_CVPR_2023_paper.pdf + Federated learning has emerged as a promising distributed machine learning paradigm to preserve data privacy. One of the fundamental challenges of federated learning is that data samples across clients are usually not independent and identically distributed (non-IID), leading to slow convergence and severe performance drop of the aggregated global model. To facilitate model aggregation on non-IID data, it is desirable to infer the unknown global distributions without violating privacy protection policy. In this paper, we propose a novel data-agnostic distribution fusion based model aggregation method called FedFusion to optimize federated learning with non-IID local datasets, based on which the heterogeneous clients' data distributions can be represented by a global distribution of several virtual fusion components with different parameters and weights. We develop a Variational AutoEncoder (VAE) method to learn the optimal parameters of the distribution fusion components based on limited statistical information extracted from the local models, and apply the derived distribution fusion model to optimize federated model aggregation with non-IID data. Extensive experiments based on various federated learning scenarios with real-world datasets show that FedFusion achieves significant performance improvement compared to the state-of-the-art. + + + + Improving Commonsense in Vision-Language Models via Knowledge Graph Riddles + http://openaccess.thecvf.com//content/CVPR2023/papers/Ye_Improving_Commonsense_in_Vision-Language_Models_via_Knowledge_Graph_Riddles_CVPR_2023_paper.pdf + This paper focuses on analyzing and improving the commonsense ability of recent popular vision-language (VL) models. Despite the great success, we observe that existing VL-models still lack commonsense knowledge/reasoning ability (e.g., "Lemons are sour"), which is a vital component towards artificial general intelligence. Through our analysis, we find one important reason is that existing large-scale VL datasets do not contain much commonsense knowledge, which motivates us to improve the commonsense of VL-models from the data perspective. Rather than collecting a new VL training dataset, we propose a more scalable strategy, i.e., "Data Augmentation with kNowledge graph linearization for CommonsensE capability" (DANCE). It can be viewed as one type of data augmentation technique, which can inject commonsense knowledge into existing VL datasets on the fly during training. More specifically, we leverage the commonsense knowledge graph (e.g., ConceptNet) and create variants of text description in VL datasets via bidirectional sub-graph sequentialization. For better commonsense evaluation, we further propose the first retrieval-based commonsense diagnostic benchmark. By conducting extensive experiments on some representative VL-models, we demonstrate that our DANCE technique is able to significantly improve the commonsense ability while maintaining the performance on vanilla retrieval tasks. + + + + S3C: Semi-Supervised VQA Natural Language Explanation via Self-Critical Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Suo_S3C_Semi-Supervised_VQA_Natural_Language_Explanation_via_Self-Critical_Learning_CVPR_2023_paper.pdf + VQA Natural Language Explanation (VQA-NLE) task aims to explain the decision-making process of VQA models in natural language. Unlike traditional attention or gradient analysis, free-text rationales can be easier to understand and gain users' trust. Existing methods mostly use post-hoc or self-rationalization models to obtain a plausible explanation. However, these frameworks are bottlenecked by the following challenges: 1) the reasoning process cannot be faithfully responded to and suffer from the problem of logical inconsistency. 2) Human-annotated explanations are expensive and time-consuming to collect. In this paper, we propose a new Semi-Supervised VQA-NLE via Self-Critical Learning (S3C), which evaluates the candidate explanations by answering rewards to improve the logical consistency between answers and rationales. With a semi-supervised learning framework, the S3C can benefit from a tremendous amount of samples without human-annotated explanations. A large number of automatic measures and human evaluations all show the effectiveness of our method. Meanwhile, the framework achieves a new state-of-the-art performance on the two VQA-NLE datasets. + + + + Spatio-Focal Bidirectional Disparity Estimation From a Dual-Pixel Image + http://openaccess.thecvf.com//content/CVPR2023/papers/Kim_Spatio-Focal_Bidirectional_Disparity_Estimation_From_a_Dual-Pixel_Image_CVPR_2023_paper.pdf + Dual-pixel photography is monocular RGB-D photography with an ultra-high resolution, enabling many applications in computational photography. However, there are still several challenges to fully utilizing dual-pixel photography. Unlike the conventional stereo pair, the dual pixel exhibits a bidirectional disparity that includes positive and negative values, depending on the focus plane depth in an image. Furthermore, capturing a wide range of dual-pixel disparity requires a shallow depth of field, resulting in a severely blurred image, degrading depth estimation performance. Recently, several data-driven approaches have been proposed to mitigate these two challenges. However, due to the lack of the ground-truth dataset of the dual-pixel disparity, existing data-driven methods estimate either inverse depth or blurriness map. In this work, we propose a self-supervised learning method that learns bidirectional disparity by utilizing the nature of anisotropic blur kernels in dual-pixel photography. We observe that the dual-pixel left/right images have reflective-symmetric anisotropic kernels, so their sum is equivalent to that of a conventional image. We take a self-supervised training approach with the novel kernel-split symmetry loss accounting for the phenomenon. Our method does not rely on a training dataset of dual-pixel disparity that does not exist yet. Our method can estimate a complete disparity map with respect to the focus-plane depth from a dual-pixel image, outperforming the baseline dual-pixel methods. + + + + Rethinking Optical Flow From Geometric Matching Consistent Perspective + http://openaccess.thecvf.com//content/CVPR2023/papers/Dong_Rethinking_Optical_Flow_From_Geometric_Matching_Consistent_Perspective_CVPR_2023_paper.pdf + Optical flow estimation is a challenging problem remaining unsolved. Recent deep learning based optical flow models have achieved considerable success. However, these models often train networks from the scratch on standard optical flow data, which restricts their ability to robustly and geometrically match image features. In this paper, we propose a rethinking to previous optical flow estimation. We particularly leverage Geometric Image Matching (GIM) as a pre-training task for the optical flow estimation (MatchFlow) with better feature representations, as GIM shares some common challenges as optical flow estimation, and with massive labeled real-world data. Thus, matching static scenes helps to learn more fundamental feature correlations of objects and scenes with consistent displacements. Specifically, the proposed MatchFlow model employs a QuadTree attention-based network pre-trained on MegaDepth to extract coarse features for further flow regression. Extensive experiments show that our model has great cross-dataset generalization. Our method achieves 11.5% and 10.1% error reduction from GMA on Sintel clean pass and KITTI test set. At the time of anonymous submission, our MatchFlow(G) enjoys state-of-theart performance on Sintel clean and final pass compared to published approaches with comparable computation and memory footprint. Codes and models will be released in https://github.com/DQiaole/MatchFlow. + + + + Learning Optical Expansion From Scale Matching + http://openaccess.thecvf.com//content/CVPR2023/papers/Ling_Learning_Optical_Expansion_From_Scale_Matching_CVPR_2023_paper.pdf + This paper address the problem of optical expansion (OE). OE describes the object scale change between two frames, widely used in monocular 3D vision tasks. Previous methods estimate optical expansion mainly from optical flow results, but this two-stage architecture makes their results limited by the accuracy of optical flow and less robust. To solve these problems, we propose the concept of 3D optical flow by integrating optical expansion into the 2D optical flow, which is implemented by a plug-and-play module, namely TPCV. TPCV implements matching features at the correct location and scale, thus allowing the simultaneous optimization of optical flow and optical expansion tasks. Experimentally, we apply TPCV to the RAFT optical flow baseline. Experimental results show that the baseline optical flow performance is substantially improved. Moreover, we apply the optical flow and optical expansion results to various dynamic 3D vision tasks, including motion-in-depth, time-to-collision, and scene flow, often achieving significant improvement over the prior SOTA. Code will be available at https://github.com/HanLingsgjk/TPCV. + + + + TopDiG: Class-Agnostic Topological Directional Graph Extraction From Remote Sensing Images + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_TopDiG_Class-Agnostic_Topological_Directional_Graph_Extraction_From_Remote_Sensing_Images_CVPR_2023_paper.pdf + Rapid development in automatic vector extraction from remote sensing images has been witnessed in recent years. However, the vast majority of existing works concentrate on a specific target, fragile to category variety, and hardly achieve stable performance crossing different categories. In this work, we propose an innovative class-agnostic model, namely TopDiG, to directly extract topological directional graphs from remote sensing images and solve these issues. Firstly, TopDiG employs a topology-concentrated node detector (TCND) to detect nodes and obtain compact perception of topological components. Secondly, we propose a dynamic graph supervision (DGS) strategy to dynamically generate adjacency graph labels from unordered nodes. Finally, the directional graph (DiG) generator module is designed to construct topological directional graphs from predicted nodes. Experiments on the Inria, CrowdAI, GID, GF2 and Massachusetts datasets empirically demonstrate that TopDiG is class-agnostic and achieves competitive performance on all datasets. + + + + StyleIPSB: Identity-Preserving Semantic Basis of StyleGAN for High Fidelity Face Swapping + http://openaccess.thecvf.com//content/CVPR2023/papers/Jiang_StyleIPSB_Identity-Preserving_Semantic_Basis_of_StyleGAN_for_High_Fidelity_Face_CVPR_2023_paper.pdf + Recent researches reveal that StyleGAN can generate highly realistic images, inspiring researchers to use pretrained StyleGAN to generate high-fidelity swapped faces. However, existing methods fail to meet the expectations in two essential aspects of high-fidelity face swapping. Their results are blurry without pore-level details and fail to preserve identity for challenging cases. To overcome the above artifacts, we innovatively construct a series of identity-preserving semantic bases of StyleGAN (called StyleIPSB) in respect of pose, expression, and illumination. Each basis of StyleIPSB controls one specific semantic attribute and disentangles with the others. The StyleIPSB constrains style code in the subspace of W+ space to preserve pore-level details. StyleIPSB gives us a novel tool for high-fidelity face swapping, and we propose a three-stage framework for face swapping with StyleIPSB. Firstly, we transform the target facial images' attributes to the source image. We learn the mapping from 3D Morphable Model (3DMM) parameters, which capture the prominent semantic variance, to the coordinates of StyleIPSB that show higher identity-preserving and fidelity. Secondly, to transform detailed attributes which 3DMM does not capture, we learn the residual attribute between the reenacted face and the target face. Finally, the face is blended into the background of the target image. Extensive results and comparisons demonstrate that StyleIPSB can effectively preserve identity and pore-level details. The results of face swapping can achieve state-of-the-art performance. We will release our code at https://github.com/a686432/StyleIPSB. + + + + Unknown Sniffer for Object Detection: Don't Turn a Blind Eye to Unknown Objects + http://openaccess.thecvf.com//content/CVPR2023/papers/Liang_Unknown_Sniffer_for_Object_Detection_Dont_Turn_a_Blind_Eye_CVPR_2023_paper.pdf + The recently proposed open-world object and open-set detection have achieved a breakthrough in finding never-seen-before objects and distinguishing them from known ones. However, their studies on knowledge transfer from known classes to unknown ones are not deep enough, resulting in the scanty capability for detecting unknowns hidden in the background. In this paper, we propose the unknown sniffer (UnSniffer) to find both unknown and known objects. Firstly, the generalized object confidence (GOC) score is introduced, which only uses known samples for supervision and avoids improper suppression of unknowns in the background. Significantly, such confidence score learned from known objects can be generalized to unknown ones. Additionally, we propose a negative energy suppression loss to further suppress the non-object samples in the background. Next, the best box of each unknown is hard to obtain during inference due to lacking their semantic information in training. To solve this issue, we introduce a graph-based determination scheme to replace hand-designed non-maximum suppression (NMS) post-processing. Finally, we present the Unknown Object Detection Benchmark, the first publicly benchmark that encompasses precision evaluation for unknown detection to our knowledge. Experiments show that our method is far better than the existing state-of-the-art methods. Code is available at: https://github.com/Went-Liang/UnSniffer. + + + + Multi-Concept Customization of Text-to-Image Diffusion + http://openaccess.thecvf.com//content/CVPR2023/papers/Kumari_Multi-Concept_Customization_of_Text-to-Image_Diffusion_CVPR_2023_paper.pdf + While generative models produce high-quality images of concepts learned from a large-scale database, a user often wishes to synthesize instantiations of their own concepts (for example, their family, pets, or items). Can we teach a model to quickly acquire a new concept, given a few examples? Furthermore, can we compose multiple new concepts together? We propose Custom Diffusion, an efficient method for augmenting existing text-to-image models. We find that only optimizing a few parameters in the text-to-image conditioning mechanism is sufficiently powerful to represent new concepts while enabling fast tuning ( 6 minutes). Additionally, we can jointly train for multiple concepts or combine multiple fine-tuned models into one via closed-form constrained optimization. Our fine-tuned model generates variations of multiple new concepts and seamlessly composes them with existing concepts in novel settings. Our method outperforms or performs on par with several baselines and concurrent works in both qualitative and quantitative evaluations, while being memory and computationally efficient. + + + + LinK: Linear Kernel for LiDAR-Based 3D Perception + http://openaccess.thecvf.com//content/CVPR2023/papers/Lu_LinK_Linear_Kernel_for_LiDAR-Based_3D_Perception_CVPR_2023_paper.pdf + Extending the success of 2D Large Kernel to 3D perception is challenging due to: 1. the cubically-increasing overhead in processing 3D data; 2. the optimization difficulties from data scarcity and sparsity. Previous work has taken the first step to scale up the kernel size from 3x3x3 to 7x7x7 by introducing block-shared weights. However, to reduce the feature variations within a block, it only employs modest block size and fails to achieve larger kernels like the 21x21x21. To address this issue, we propose a new method, called LinK, to achieve a wider-range perception receptive field in a convolution-like manner with two core designs. The first is to replace the static kernel matrix with a linear kernel generator, which adaptively provides weights only for non-empty voxels. The second is to reuse the pre-computed aggregation results in the overlapped blocks to reduce computation complexity. The proposed method successfully enables each voxel to perceive context within a range of 21x21x21. Extensive experiments on two basic perception tasks, 3D object detection and 3D semantic segmentation, demonstrate the effectiveness of our method. Notably, we rank 1st on the public leaderboard of the 3D detection benchmark of nuScenes (LiDAR track), by simply incorporating a LinK-based backbone into the basic detector, CenterPoint. We also boost the strong segmentation baseline's mIoU with 2.7% in the SemanticKITTI test set. Code is available at https://github.com/MCG-NJU/LinK. + + + + CP3: Channel Pruning Plug-In for Point-Based Networks + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_CP3_Channel_Pruning_Plug-In_for_Point-Based_Networks_CVPR_2023_paper.pdf + Channel pruning has been widely studied as a prevailing method that effectively reduces both computational cost and memory footprint of the original network while keeping a comparable accuracy performance. Though great success has been achieved in channel pruning for 2D image-based convolutional networks (CNNs), existing works seldom extend the channel pruning methods to 3D point-based neural networks (PNNs). Directly implementing the 2D CNN channel pruning methods to PNNs undermine the performance of PNNs because of the different representations of 2D images and 3D point clouds as well as the network architecture disparity. In this paper, we proposed CP^3, which is a Channel Pruning Plug-in for Point-based network. CP^3 is elaborately designed to leverage the characteristics of point clouds and PNNs in order to enable 2D channel pruning methods for PNNs. Specifically, it presents a coordinate-enhanced channel importance metric to reflect the correlation between dimensional information and individual channel features, and it recycles the discarded points in PNN's sampling process and reconsiders their potentially-exclusive information to enhance the robustness of channel pruning. Experiments on various PNN architectures show that CP^3 constantly improves state-of-the-art 2D CNN pruning approaches on different point cloud tasks. For instance, our compressed PointNeXt-S on ScanObjectNN achieves an accuracy of 88.52% with a pruning rate of 57.8%, outperforming the baseline pruning methods with an accuracy gain of 1.94%. + + + + Two-Way Multi-Label Loss + http://openaccess.thecvf.com//content/CVPR2023/papers/Kobayashi_Two-Way_Multi-Label_Loss_CVPR_2023_paper.pdf + A natural image frequently contains multiple classification targets, accordingly providing multiple class labels rather than a single label per image. While the single-label classification is effectively addressed by applying a softmax cross-entropy loss, the multi-label task is tackled mainly in a binary cross-entropy (BCE) framework. In contrast to the softmax loss, the BCE loss involves issues regarding imbalance as multiple classes are decomposed into a bunch of binary classifications; recent works improve the BCE loss to cope with the issue by means of weighting. In this paper, we propose a multi-label loss by bridging a gap between the softmax loss and the multi-label scenario. The proposed loss function is formulated on the basis of relative comparison among classes which also enables us to further improve discriminative power of features by enhancing classification margin. The loss function is so flexible as to be applicable to a multi-label setting in two ways for discriminating classes as well as samples. In the experiments on multi-label classification, the proposed method exhibits competitive performance to the other multi-label losses, and it also provides transferrable features on single-label ImageNet training. Codes are available at https://github.com/tk1980/TwowayMultiLabelLoss. + + + + Where Is My Wallet? Modeling Object Proposal Sets for Egocentric Visual Query Localization + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_Where_Is_My_Wallet_Modeling_Object_Proposal_Sets_for_Egocentric_CVPR_2023_paper.pdf + This paper deals with the problem of localizing objects in image and video datasets from visual exemplars. In particular, we focus on the challenging problem of egocentric visual query localization. We first identify grave implicit biases in current query-conditioned model design and visual query datasets. Then, we directly tackle such biases at both frame and object set levels. Concretely, our method solves these issues by expanding limited annotations and dynamically dropping object proposals during training. Additionally, we propose a novel transformer-based module that allows for object-proposal set context to be considered while incorporating query information. We name our module Conditioned Contextual Transformer or CocoFormer. Our experiments show that the proposed adaptations improve egocentric query detection, leading to a better visual query localization system in both 2D and 3D configurations. Thus, we are able to improve frame-level detection performance from 26.28% to 31.26% in AP, which correspondingly improves the VQ2D and VQ3D localization scores by significant margins. Our improved context-aware query object detector ranked first and second in the VQ2D and VQ3D tasks in the 2nd Ego4D challenge. In addition, we showcase the relevance of our proposed model in the Few-Shot Detection (FSD) task, where we also achieve SOTA results. + + + + ReDirTrans: Latent-to-Latent Translation for Gaze and Head Redirection + http://openaccess.thecvf.com//content/CVPR2023/papers/Jin_ReDirTrans_Latent-to-Latent_Translation_for_Gaze_and_Head_Redirection_CVPR_2023_paper.pdf + Learning-based gaze estimation methods require large amounts of training data with accurate gaze annotations. Facing such demanding requirements of gaze data collection and annotation, several image synthesis methods were proposed, which successfully redirected gaze directions precisely given the assigned conditions. However, these methods focused on changing gaze directions of the images that only include eyes or restricted ranges of faces with low resolution (less than 128*128) to largely reduce interference from other attributes such as hairs, which limits application scenarios. To cope with this limitation, we proposed a portable network, called ReDirTrans, achieving latent-to-latent translation for redirecting gaze directions and head orientations in an interpretable manner. ReDirTrans projects input latent vectors into aimed-attribute embeddings only and redirects these embeddings with assigned pitch and yaw values. Then both the initial and edited embeddings are projected back (deprojected) to the initial latent space as residuals to modify the input latent vectors by subtraction and addition, representing old status removal and new status addition. The projection of aimed attributes only and subtraction-addition operations for status replacement essentially mitigate impacts on other attributes and the distribution of latent vectors. Thus, by combining ReDirTrans with a pretrained fixed e4e-StyleGAN pair, we created ReDirTrans-GAN, which enables accurately redirecting gaze in full-face images with 1024*1024 resolution while preserving other attributes such as identity, expression, and hairstyle. Furthermore, we presented improvements for the downstream learning-based gaze estimation task, using redirected samples as dataset augmentation. + + + + Noisy Correspondence Learning With Meta Similarity Correction + http://openaccess.thecvf.com//content/CVPR2023/papers/Han_Noisy_Correspondence_Learning_With_Meta_Similarity_Correction_CVPR_2023_paper.pdf + Despite the success of multimodal learning in cross-modal retrieval task, the remarkable progress relies on the correct correspondence among multimedia data. However, collecting such ideal data is expensive and time-consuming. In practice, most widely used datasets are harvested from the Internet and inevitably contain mismatched pairs. Training on such noisy correspondence datasets causes performance degradation because the cross-modal retrieval methods can wrongly enforce the mismatched data to be similar. To tackle this problem, we propose a Meta Similarity Correction Network (MSCN) to provide reliable similarity scores. We view a binary classification task as the meta-process that encourages the MSCN to learn discrimination from positive and negative meta-data. To further alleviate the influence of noise, we design an effective data purification strategy using meta-data as prior knowledge to remove the noisy samples. Extensive experiments are conducted to demonstrate the strengths of our method in both synthetic and real-world noises, including Flickr30K, MS-COCO, and Conceptual Captions. + + + + Rethinking Video ViTs: Sparse Video Tubes for Joint Image and Video Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Piergiovanni_Rethinking_Video_ViTs_Sparse_Video_Tubes_for_Joint_Image_and_CVPR_2023_paper.pdf + We present a simple approach which can turn a ViT encoder into an efficient video model, which can seamlessly work with both image and video inputs. By sparsely sampling the inputs, the model is able to do training and inference from both inputs. The model is easily scalable and can be adapted to large-scale pre-trained ViTs without requiring full finetuning. The model achieves SOTA results. + + + + Unsupervised Continual Semantic Adaptation Through Neural Rendering + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Unsupervised_Continual_Semantic_Adaptation_Through_Neural_Rendering_CVPR_2023_paper.pdf + An increasing amount of applications rely on data-driven models that are deployed for perception tasks across a sequence of scenes. Due to the mismatch between training and deployment data, adapting the model on the new scenes is often crucial to obtain good performance. In this work, we study continual multi-scene adaptation for the task of semantic segmentation, assuming that no ground-truth labels are available during deployment and that performance on the previous scenes should be maintained. We propose training a Semantic-NeRF network for each scene by fusing the predictions of a segmentation model and then using the view-consistent rendered semantic labels as pseudo-labels to adapt the model. Through joint training with the segmentation model, the Semantic-NeRF model effectively enables 2D-3D knowledge transfer. Furthermore, due to its compact size, it can be stored in a long-term memory and subsequently used to render data from arbitrary viewpoints to reduce forgetting. We evaluate our approach on ScanNet, where we outperform both a voxel-based baseline and a state-of-the-art unsupervised domain adaptation method. + + + + Multi-View Adversarial Discriminator: Mine the Non-Causal Factors for Object Detection in Unseen Domains + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_Multi-View_Adversarial_Discriminator_Mine_the_Non-Causal_Factors_for_Object_Detection_CVPR_2023_paper.pdf + Domain shift degrades the performance of object detection models in practical applications. To alleviate the influence of domain shift, plenty of previous work try to decouple and learn the domain-invariant (common) features from source domains via domain adversarial learning (DAL). However, inspired by causal mechanisms, we find that previous methods ignore the implicit insignificant non-causal factors hidden in the common features. This is mainly due to the single-view nature of DAL. In this work, we present an idea to remove non-causal factors from common features by multi-view adversarial training on source domains, because we observe that such insignificant non-causal factors may still be significant in other latent spaces (views) due to the multi-mode structure of data. To summarize, we propose a Multi-view Adversarial Discriminator (MAD) based domain generalization model, consisting of a Spurious Correlations Generator (SCG) that increases the diversity of source domain by random augmentation and a Multi-View Domain Classifier (MVDC) that maps features to multiple latent spaces, such that the non-causal factors are removed and the domain-invariant features are purified. Extensive experiments on six benchmarks show our MAD obtains state-of-the-art performance. + + + + Instance Relation Graph Guided Source-Free Domain Adaptive Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/VS_Instance_Relation_Graph_Guided_Source-Free_Domain_Adaptive_Object_Detection_CVPR_2023_paper.pdf + Unsupervised Domain Adaptation (UDA) is an effective approach to tackle the issue of domain shift. Specifically, UDA methods try to align the source and target representations to improve generalization on the target domain. Further, UDA methods work under the assumption that the source data is accessible during the adaptation process. However, in real-world scenarios, the labelled source data is often restricted due to privacy regulations, data transmission constraints, or proprietary data concerns. The Source-Free Domain Adaptation (SFDA) setting aims to alleviate these concerns by adapting a source-trained model for the target domain without requiring access to the source data. In this paper, we explore the SFDA setting for the task of adaptive object detection. To this end, we propose a novel training strategy for adapting a source-trained object detector to the target domain without source data. More precisely, we design a novel contrastive loss to enhance the target representations by exploiting the objects relations for a given target domain input. These object instance relations are modelled using an Instance Relation Graph (IRG) network, which are then used to guide the contrastive representation learning. In addition, we utilize a student-teacher to effectively distill knowledge from source-trained model to target domain. Extensive experiments on multiple object detection benchmark datasets show that the proposed approach is able to efficiently adapt source-trained object detectors to the target domain, outperforming state-of-the-art domain adaptive detection methods. Code and models are provided in https://viudomain.github.io/irg-sfda-web/ + + + + Instant Multi-View Head Capture Through Learnable Registration + http://openaccess.thecvf.com//content/CVPR2023/papers/Bolkart_Instant_Multi-View_Head_Capture_Through_Learnable_Registration_CVPR_2023_paper.pdf + Existing methods for capturing datasets of 3D heads in dense semantic correspondence are slow and commonly address the problem in two separate steps; multi-view stereo (MVS) reconstruction followed by non-rigid registration. To simplify this process, we introduce TEMPEH (Towards Estimation of 3D Meshes from Performances of Expressive Heads) to directly infer 3D heads in dense correspondence from calibrated multi-view images. Registering datasets of 3D scans typically requires manual parameter tuning to find the right balance between accurately fitting the scans' surfaces and being robust to scanning noise and outliers. Instead, we propose to jointly register a 3D head dataset while training TEMPEH. Specifically, during training, we minimize a geometric loss commonly used for surface registration, effectively leveraging TEMPEH as a regularizer. Our multi-view head inference builds on a volumetric feature representation that samples and fuses features from each view using camera calibration information. To account for partial occlusions and a large capture volume that enables head movements, we use view- and surface-aware feature fusion, and a spatial transformer-based head localization module, respectively. We use raw MVS scans as supervision during training, but, once trained, TEMPEH directly predicts 3D heads in dense correspondence without requiring scans. Predicting one head takes about 0.3 seconds with a median reconstruction error of 0.26 mm, 64% lower than the current state-of-the-art. This enables the efficient capture of large datasets containing multiple people and diverse facial motions. Code, model, and data are publicly available at https://tempeh.is.tue.mpg.de. + + + + GINA-3D: Learning To Generate Implicit Neural Assets in the Wild + http://openaccess.thecvf.com//content/CVPR2023/papers/Shen_GINA-3D_Learning_To_Generate_Implicit_Neural_Assets_in_the_Wild_CVPR_2023_paper.pdf + Modeling the 3D world from sensor data for simulation is a scalable way of developing testing and validation environments for robotic learning problems such as autonomous driving. However, manually creating or re-creating real-world-like environments is difficult, expensive, and not scalable. Recent generative model techniques have shown promising progress to address such challenges by learning 3D assets using only plentiful 2D images -- but still suffer limitations as they leverage either human-curated image datasets or renderings from manually-created synthetic 3D environments. In this paper, we introduce GINA-3D, a generative model that uses real-world driving data from camera and LiDAR sensors to create photo-realistic 3D implicit neural assets of diverse vehicles and pedestrians. Compared to the existing image datasets, the real-world driving setting poses new challenges due to occlusions, lighting-variations and long-tail distributions. GINA-3D tackles these challenges by decoupling representation learning and generative modeling into two stages with a learned tri-plane latent structure, inspired by recent advances in generative modeling of images. To evaluate our approach, we construct a large-scale object-centric dataset containing over 520K images of vehicles and pedestrians from the Waymo Open Dataset, and a new set of 80K images of long-tail instances such as construction equipment, garbage trucks, and cable cars. We compare our model with existing approaches and demonstrate that it achieves state-of-the-art performance in quality and diversity for both generated images and geometries. + + + + Consistent Direct Time-of-Flight Video Depth Super-Resolution + http://openaccess.thecvf.com//content/CVPR2023/papers/Sun_Consistent_Direct_Time-of-Flight_Video_Depth_Super-Resolution_CVPR_2023_paper.pdf + Direct time-of-flight (dToF) sensors are promising for next-generation on-device 3D sensing. However, limited by manufacturing capabilities in a compact module, the dToF data has low spatial resolution (e.g., 20x30 for iPhone dToF), and it requires a super-resolution step before being passed to downstream tasks. In this paper, we solve this super-resolution problem by fusing the low-resolution dToF data with the corresponding high-resolution RGB guidance. Unlike the conventional RGB-guided depth enhancement approaches which perform the fusion in a per-frame manner, we propose the first multi-frame fusion scheme to mitigate the spatial ambiguity resulting from the low-resolution dToF imaging. In addition, dToF sensors provide unique depth histogram information for each local patch, and we incorporate this dToF-specific feature in our network design to further alleviate spatial ambiguity. To evaluate our models on complex dynamic indoor environments and to provide a large-scale dToF sensor dataset, we introduce DyDToF, the first synthetic RGB-dToF video dataset that features dynamic objects and a realistic dToF simulator following the physical imaging process. We believe the methods and dataset are beneficial to a broad community as dToF depth sensing is becoming mainstream on mobile devices. Our code and data are publicly available. https://github.com/facebookresearch/DVSR/ + + + + Crossing the Gap: Domain Generalization for Image Captioning + http://openaccess.thecvf.com//content/CVPR2023/papers/Ren_Crossing_the_Gap_Domain_Generalization_for_Image_Captioning_CVPR_2023_paper.pdf + Existing image captioning methods are under the assumption that the training and testing data are from the same domain or that the data from the target domain (i.e., the domain that testing data lie in) are accessible. However, this assumption is invalid in real-world applications where the data from the target domain is inaccessible. In this paper, we introduce a new setting called Domain Generalization for Image Captioning (DGIC), where the data from the target domain is unseen in the learning process. We first construct a benchmark dataset for DGIC, which helps us to investigate models' domain generalization (DG) ability on unseen domains. With the support of the new benchmark, we further propose a new framework called language-guided semantic metric learning (LSML) for the DGIC setting. Experiments on multiple datasets demonstrate the challenge of the task and the effectiveness of our newly proposed benchmark and LSML framework. + + + + Probabilistic Prompt Learning for Dense Prediction + http://openaccess.thecvf.com//content/CVPR2023/papers/Kwon_Probabilistic_Prompt_Learning_for_Dense_Prediction_CVPR_2023_paper.pdf + Recent progress in deterministic prompt learning has become a promising alternative to various downstream vision tasks, enabling models to learn powerful visual representations with the help of pre-trained vision-language models. However, this approach results in limited performance for dense prediction tasks that require handling more complex and diverse objects, since a single and deterministic description cannot sufficiently represent the entire image. In this paper, we present a novel probabilistic prompt learning to fully exploit the vision-language knowledge in dense prediction tasks. First, we introduce learnable class-agnostic attribute prompts to describe universal attributes across the object class. The attributes are combined with class information and visual-context knowledge to define the class-specific textual distribution. Text representations are sampled and used to guide the dense prediction task using the probabilistic pixel-text matching loss, enhancing the stability and generalization capability of the proposed method. Extensive experiments on different dense prediction tasks and ablation studies demonstrate the effectiveness of our proposed method. + + + + Exploring Intra-Class Variation Factors With Learnable Cluster Prompts for Semi-Supervised Image Synthesis + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Exploring_Intra-Class_Variation_Factors_With_Learnable_Cluster_Prompts_for_Semi-Supervised_CVPR_2023_paper.pdf + Semi-supervised class-conditional image synthesis is typically performed by inferring and injecting class labels into a conditional Generative Adversarial Network (GAN). The supervision in the form of class identity may be inadequate to model classes with diverse visual appearances. In this paper, we propose a Learnable Cluster Prompt-based GAN (LCP-GAN) to capture class-wise characteristics and intra-class variation factors with a broader source of supervision. To exploit partially labeled data, we perform soft partitioning on each class, and explore the possibility of associating intra-class clusters with learnable visual concepts in the feature space of a pre-trained language-vision model, e.g., CLIP. For class-conditional image generation, we design a cluster-conditional generator by injecting a combination of intra-class cluster label embeddings, and further incorporate a real-fake classification head on top of CLIP to distinguish real instances from the synthesized ones, conditioned on the learnable cluster prompts. This significantly strengthens the generator with more semantic language supervision. LCP-GAN not only possesses superior generation capability but also matches the performance of the fully supervised version of the base models: BigGAN and StyleGAN2-ADA, on multiple standard benchmarks. + + + + NeAT: Learning Neural Implicit Surfaces With Arbitrary Topologies From Multi-View Images + http://openaccess.thecvf.com//content/CVPR2023/papers/Meng_NeAT_Learning_Neural_Implicit_Surfaces_With_Arbitrary_Topologies_From_Multi-View_CVPR_2023_paper.pdf + Recent progress in neural implicit functions has set new state-of-the-art in reconstructing high-fidelity 3D shapes from a collection of images. However, these approaches are limited to closed surfaces as they require the surface to be represented by a signed distance field. In this paper, we propose NeAT, a new neural rendering framework that can learn implicit surfaces with arbitrary topologies from multi-view images. In particular, NeAT represents the 3D surface as a level set of a signed distance function (SDF) with a validity branch for estimating the surface existence probability at the query positions. We also develop a novel neural volume rendering method, which uses SDF and validity to calculate the volume opacity and avoids rendering points with low validity. NeAT supports easy field-to-mesh conversion using the classic Marching Cubes algorithm. Extensive experiments on DTU, MGN, and Deep Fashion 3D datasets indicate that our approach is able to faithfully reconstruct both watertight and non-watertight surfaces. In particular, NeAT significantly outperforms the state-of-the-art methods in the task of open surface reconstruction both quantitatively and qualitatively. + + + + SPARF: Neural Radiance Fields From Sparse and Noisy Poses + http://openaccess.thecvf.com//content/CVPR2023/papers/Truong_SPARF_Neural_Radiance_Fields_From_Sparse_and_Noisy_Poses_CVPR_2023_paper.pdf + Neural Radiance Field (NeRF) has recently emerged as a powerful representation to synthesize photorealistic novel views. While showing impressive performance, it relies on the availability of dense input views with highly accurate camera poses, thus limiting its application in real-world scenarios. In this work, we introduce Sparse Pose Adjusting Radiance Field (SPARF), to address the challenge of novel-view synthesis given only few wide-baseline input images (as low as 3) with noisy camera poses. Our approach exploits multi-view geometry constraints in order to jointly learn the NeRF and refine the camera poses. By relying on pixel matches extracted between the input views, our multi-view correspondence objective enforces the optimized scene and camera poses to converge to a global and geometrically accurate solution. Our depth consistency loss further encourages the reconstructed scene to be consistent from any viewpoint. Our approach sets a new state of the art in the sparse-view regime on multiple challenging datasets. + + + + Local Implicit Normalizing Flow for Arbitrary-Scale Image Super-Resolution + http://openaccess.thecvf.com//content/CVPR2023/papers/Yao_Local_Implicit_Normalizing_Flow_for_Arbitrary-Scale_Image_Super-Resolution_CVPR_2023_paper.pdf + Flow-based methods have demonstrated promising results in addressing the ill-posed nature of super-resolution (SR) by learning the distribution of high-resolution (HR) images with the normalizing flow. However, these methods can only perform a predefined fixed-scale SR, limiting their potential in real-world applications. Meanwhile, arbitrary-scale SR has gained more attention and achieved great progress. Nonetheless, previous arbitrary-scale SR methods ignore the ill-posed problem and train the model with per-pixel L1 loss, leading to blurry SR outputs. In this work, we propose "Local Implicit Normalizing Flow" (LINF) as a unified solution to the above problems. LINF models the distribution of texture details under different scaling factors with normalizing flow. Thus, LINF can generate photo-realistic HR images with rich texture details in arbitrary scale factors. We evaluate LINF with extensive experiments and show that LINF achieves the state-of-the-art perceptual quality compared with prior arbitrary-scale SR methods. + + + + Texts as Images in Prompt Tuning for Multi-Label Image Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Guo_Texts_as_Images_in_Prompt_Tuning_for_Multi-Label_Image_Recognition_CVPR_2023_paper.pdf + Prompt tuning has been employed as an efficient way to adapt large vision-language pre-trained models (e.g. CLIP) to various downstream tasks in data-limited or label-limited settings. Nonetheless, visual data (e.g., images) is by default prerequisite for learning prompts in existing methods. In this work, we advocate that the effectiveness of image-text contrastive learning in aligning the two modalities (for training CLIP) further makes it feasible to treat texts as images for prompt tuning and introduce TaI prompting. In contrast to the visual data, text descriptions are easy to collect, and their class labels can be directly derived. Particularly, we apply TaI prompting to multi-label image recognition, where sentences in the wild serve as alternatives to images for prompt tuning. Moreover, with TaI, double-grained prompt tuning (TaI-DPT) is further presented to extract both coarse-grained and fine-grained embeddings for enhancing the multi-label recognition performance. Experimental results show that our proposed TaI-DPT outperforms zero-shot CLIP by a large margin on multiple benchmarks, e.g., MS-COCO, VOC2007, and NUS-WIDE, while it can be combined with existing methods of prompting from images to improve recognition performance further. The code is released at https://github.com/guozix/TaI-DPT. + + + + Self-Correctable and Adaptable Inference for Generalizable Human Pose Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Kan_Self-Correctable_and_Adaptable_Inference_for_Generalizable_Human_Pose_Estimation_CVPR_2023_paper.pdf + A central challenge in human pose estimation, as well as in many other machine learning and prediction tasks, is the generalization problem. The learned network does not have the capability to characterize the prediction error, generate feedback information from the test sample, and correct the prediction error on the fly for each individual test sample, which results in degraded performance in generalization. In this work, we introduce a self-correctable and adaptable inference (SCAI) method to address the generalization challenge of network prediction and use human pose estimation as an example to demonstrate its effectiveness and performance. We learn a correction network to correct the prediction result conditioned by a fitness feedback error. This feedback error is generated by a learned fitness feedback network which maps the prediction result to the original input domain and compares it against the original input. Interestingly, we find that this self-referential feedback error is highly correlated with the actual prediction error. This strong correlation suggests that we can use this error as feedback to guide the correction process. It can be also used as a loss function to quickly adapt and optimize the correction network during the inference process. Our extensive experimental results on human pose estimation demonstrate that the proposed SCAI method is able to significantly improve the generalization capability and performance of human pose estimation. + + + + GradMA: A Gradient-Memory-Based Accelerated Federated Learning With Alleviated Catastrophic Forgetting + http://openaccess.thecvf.com//content/CVPR2023/papers/Luo_GradMA_A_Gradient-Memory-Based_Accelerated_Federated_Learning_With_Alleviated_Catastrophic_Forgetting_CVPR_2023_paper.pdf + Federated Learning (FL) has emerged as a de facto machine learning area and received rapid increasing research interests from the community. However, catastrophic forgetting caused by data heterogeneity and partial participation poses distinctive challenges for FL, which are detrimental to the performance. To tackle the problems, we propose a new FL approach (namely GradMA), which takes inspiration from continual learning to simultaneously correct the server-side and worker-side update directions as well as take full advantage of server's rich computing and memory resources. Furthermore, we elaborate a memory reduction strategy to enable GradMA to accommodate FL with a large scale of workers. We then analyze convergence of GradMA theoretically under the smooth non-convex setting and show that its convergence rate achieves a linear speed up w.r.t the increasing number of sampled active workers. At last, our extensive experiments on various image classification tasks show that GradMA achieves significant performance gains in accuracy and communication efficiency compared to SOTA baselines. We provide our code here: https://github.com/lkyddd/GradMA. + + + + POTTER: Pooling Attention Transformer for Efficient Human Mesh Recovery + http://openaccess.thecvf.com//content/CVPR2023/papers/Zheng_POTTER_Pooling_Attention_Transformer_for_Efficient_Human_Mesh_Recovery_CVPR_2023_paper.pdf + Transformer architectures have achieved SOTA performance on the human mesh recovery (HMR) from monocular images. However, the performance gain has come at the cost of substantial memory and computational overhead. A lightweight and efficient model to reconstruct accurate human mesh is needed for real-world applications. In this paper, we propose a pure transformer architecture named POoling aTtention TransformER (POTTER) for the HMR task from single images. Observing that the conventional attention module is memory and computationally expensive, we propose an efficient pooling attention module, which significantly reduces the memory and computational cost without sacrificing performance. Furthermore, we design a new transformer architecture by integrating a High-Resolution (HR) stream for the HMR task. The high-resolution local and global features from the HR stream can be utilized for recovering more accurate human mesh. Our POTTER outperforms the SOTA method METRO by only requiring 7% of total parameters and 14% of the Multiply-Accumulate Operations on the Human3.6M (PA-MPJPE) and 3DPW (all three metrics) datasets. Code will be publicly available. + + + + Learning Detailed Radiance Manifolds for High-Fidelity and 3D-Consistent Portrait Synthesis From Monocular Image + http://openaccess.thecvf.com//content/CVPR2023/papers/Deng_Learning_Detailed_Radiance_Manifolds_for_High-Fidelity_and_3D-Consistent_Portrait_Synthesis_CVPR_2023_paper.pdf + A key challenge for novel view synthesis of monocular portrait images is 3D consistency under continuous pose variations. Most existing methods rely on 2D generative models which often leads to obvious 3D inconsistency artifacts. We present a 3D-consistent novel view synthesis approach for monocular portrait images based on a recent proposed 3D-aware GAN, namely Generative Radiance Manifolds (GRAM), which has shown strong 3D consistency at multiview image generation of virtual subjects via the radiance manifolds representation. However, simply learning an encoder to map a real image into the latent space of GRAM can only reconstruct coarse radiance manifolds without faithful fine details, while improving the reconstruction fidelity via instance-specific optimization is time-consuming. We introduce a novel detail manifolds reconstructor to learn 3D-consistent fine details on the radiance manifolds from monocular images, and combine them with the coarse radiance manifolds for high-fidelity reconstruction. The 3D priors derived from the coarse radiance manifolds are used to regulate the learned details to ensure reasonable synthesized results at novel views. Trained on in-the-wild 2D images, our method achieves high-fidelity and 3D-consistent portrait synthesis largely outperforming the prior art. Project page: https://yudeng.github.io/GRAMInverter/ + + + + Patch-Craft Self-Supervised Training for Correlated Image Denoising + http://openaccess.thecvf.com//content/CVPR2023/papers/Vaksman_Patch-Craft_Self-Supervised_Training_for_Correlated_Image_Denoising_CVPR_2023_paper.pdf + Supervised neural networks are known to achieve excellent results in various image restoration tasks. However, such training requires datasets composed of pairs of corrupted images and their corresponding ground truth targets. Unfortunately, such data is not available in many applications. For the task of image denoising in which the noise statistics is unknown, several self-supervised training methods have been proposed for overcoming this difficulty. Some of these require knowledge of the noise model, while others assume that the contaminating noise is uncorrelated, both assumptions are too limiting for many practical needs. This work proposes a novel self-supervised training technique suitable for the removal of unknown correlated noise. The proposed approach neither requires knowledge of the noise model nor access to ground truth targets. The input to our algorithm consists of easily captured bursts of noisy shots. Our algorithm constructs artificial patch-craft images from these bursts by patch matching and stitching, and the obtained crafted images are used as targets for the training. Our method does not require registration of the different images within the burst. We evaluate the proposed framework through extensive experiments with synthetic and real image noise. + + + + DistilPose: Tokenized Pose Regression With Heatmap Distillation + http://openaccess.thecvf.com//content/CVPR2023/papers/Ye_DistilPose_Tokenized_Pose_Regression_With_Heatmap_Distillation_CVPR_2023_paper.pdf + In the field of human pose estimation, regression-based methods have been dominated in terms of speed, while heatmap-based methods are far ahead in terms of performance. How to take advantage of both schemes remains a challenging problem. In this paper, we propose a novel human pose estimation framework termed DistilPose, which bridges the gaps between heatmap-based and regression-based methods. Specifically, DistilPose maximizes the transfer of knowledge from the teacher model (heatmap-based) to the student model (regression-based) through Token-distilling Encoder (TDE) and Simulated Heatmaps. TDE aligns the feature spaces of heatmap-based and regression-based models by introducing tokenization, while Simulated Heatmaps transfer explicit guidance (distribution and confidence) from teacher heatmaps into student models. Extensive experiments show that the proposed DistilPose can significantly improve the performance of the regression-based models while maintaining efficiency. Specifically, on the MSCOCO validation dataset, DistilPose-S obtains 71.6% mAP with 5.36M parameter, 2.38 GFLOPs and 40.2 FPS, which saves 12.95x, 7.16x computational cost and is 4.9x faster than its teacher model with only 0.9 points performance drop. Furthermore, DistilPose-L obtains 74.4% mAP on MSCOCO validation dataset, achieving a new state-of-the-art among predominant regression-based models. + + + + Neural Volumetric Memory for Visual Locomotion Control + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_Neural_Volumetric_Memory_for_Visual_Locomotion_Control_CVPR_2023_paper.pdf + Legged robots have the potential to expand the reach of autonomy beyond paved roads. In this work, we consider the difficult problem of locomotion on challenging terrains using a single forward-facing depth camera. Due to the partial observability of the problem, the robot has to rely on past observations to infer the terrain currently beneath it. To solve this problem, we follow the paradigm in computer vision that explicitly models the 3D geometry of the scene and propose Neural Volumetric Memory (NVM), a geometric memory architecture that explicitly accounts for the SE(3) equivariance of the 3D world. NVM aggregates feature volumes from multiple camera views by first bringing them back to the ego-centric frame of the robot. We test the learned visual-locomotion policy on a physical robot and show that our approach, learning legged locomotion with neural volumetric memory, produces performance gains over prior works on challenging terrains. We include ablation studies and show that the representations stored in the neural volumetric memory capture sufficient geometric information to reconstruct the scene. Our project page with videos is https://rchalyang.github.io/NVM/ + + + + Propagate and Calibrate: Real-Time Passive Non-Line-of-Sight Tracking + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Propagate_and_Calibrate_Real-Time_Passive_Non-Line-of-Sight_Tracking_CVPR_2023_paper.pdf + Non-line-of-sight (NLOS) tracking has drawn increasing attention in recent years, due to its ability to detect object motion out of sight. Most previous works on NLOS tracking rely on active illumination, e.g., laser, and suffer from high cost and elaborate experimental conditions. Besides, these techniques are still far from practical application due to oversimplified settings. In contrast, we propose a purely passive method to track a person walking in an invisible room by only observing a relay wall, which is more in line with real application scenarios, e.g., security. To excavate imperceptible changes in videos of the relay wall, we introduce difference frames as an essential carrier of temporal-local motion messages. In addition, we propose PAC-Net, which consists of alternating propagation and calibration, making it capable of leveraging both dynamic and static messages on a frame-level granularity. To evaluate the proposed method, we build and publish the first dynamic passive NLOS tracking dataset, NLOS-Track, which fills the vacuum of realistic NLOS datasets. NLOS-Track contains thousands of NLOS video clips and corresponding trajectories. Both real-shot and synthetic data are included. Our codes and dataset are available at https://againstentropy.github.io/NLOS-Track/. + + + + Learning Decorrelated Representations Efficiently Using Fast Fourier Transform + http://openaccess.thecvf.com//content/CVPR2023/papers/Shigeto_Learning_Decorrelated_Representations_Efficiently_Using_Fast_Fourier_Transform_CVPR_2023_paper.pdf + Barlow Twins and VICReg are self-supervised representation learning models that use regularizers to decorrelate features. Although these models are as effective as conventional representation learning models, their training can be computationally demanding if the dimension d of the projected embeddings is high. As the regularizers are defined in terms of individual elements of a cross-correlation or covariance matrix, computing the loss for n samples takes O(n d^2) time. In this paper, we propose a relaxed decorrelating regularizer that can be computed in O(n d log d) time by Fast Fourier Transform. We also propose an inexpensive technique to mitigate undesirable local minima that develop with the relaxation. The proposed regularizer exhibits accuracy comparable to that of existing regularizers in downstream tasks, whereas their training requires less memory and is faster for large d. The source code is available. + + + + Two-Shot Video Object Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Yan_Two-Shot_Video_Object_Segmentation_CVPR_2023_paper.pdf + Previous works on video object segmentation (VOS) are trained on densely annotated videos. Nevertheless, acquiring annotations in pixel level is expensive and time-consuming. In this work, we demonstrate the feasibility of training a satisfactory VOS model on sparsely annotated videos--we merely require two labeled frames per training video while the performance is sustained. We term this novel training paradigm as two-shot video object segmentation, or two-shot VOS for short. The underlying idea is to generate pseudo labels for unlabeled frames during training and to optimize the model on the combination of labeled and pseudo-labeled data. Our approach is extremely simple and can be applied to a majority of existing frameworks. We first pre-train a VOS model on sparsely annotated videos in a semi-supervised manner, with the first frame always being a labeled one. Then, we adopt the pre-trained VOS model to generate pseudo labels for all unlabeled frames, which are subsequently stored in a pseudo-label bank. Finally, we retrain a VOS model on both labeled and pseudo-labeled data without any restrictions on the first frame. For the first time, we present a general way to train VOS models on two-shot VOS datasets. By using 7.3% and 2.9% labeled data of YouTube-VOS and DAVIS benchmarks, our approach achieves comparable results in contrast to the counterparts trained on fully labeled set. Code and models are available at https://github.com/yk-pku/Two-shot-Video-Object-Segmentation. + + + + PiMAE: Point Cloud and Image Interactive Masked Autoencoders for 3D Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_PiMAE_Point_Cloud_and_Image_Interactive_Masked_Autoencoders_for_3D_CVPR_2023_paper.pdf + Masked Autoencoders learn strong visual representations and achieve state-of-the-art results in several independent modalities, yet very few works have addressed their capabilities in multi-modality settings. In this work, we focus on point cloud and RGB image data, two modalities that are often presented together in the real world and explore their meaningful interactions. To improve upon the cross-modal synergy in existing works, we propose PiMAE, a self-supervised pre-training framework that promotes 3D and 2D interaction through three aspects. Specifically, we first notice the importance of masking strategies between the two sources and utilize a projection module to complementarily align the mask and visible tokens of the two modalities. Then, we utilize a well-crafted two-branch MAE pipeline with a novel shared decoder to promote cross-modality interaction in the mask tokens. Finally, we design a unique cross-modal reconstruction module to enhance representation learning for both modalities. Through extensive experiments performed on large-scale RGB-D scene understanding benchmarks (SUN RGB-D and ScannetV2), we discover it is nontrivial to interactively learn point-image features, where we greatly improve multiple 3D detectors, 2D detectors and few-shot classifiers by 2.9%, 6.7%, and 2.4%, respectively. Code is available at https://github.com/BLVLab/PiMAE. + + + + High-Fidelity 3D GAN Inversion by Pseudo-Multi-View Optimization + http://openaccess.thecvf.com//content/CVPR2023/papers/Xie_High-Fidelity_3D_GAN_Inversion_by_Pseudo-Multi-View_Optimization_CVPR_2023_paper.pdf + We present a high-fidelity 3D generative adversarial network (GAN) inversion framework that can synthesize photo-realistic novel views while preserving specific details of the input image. High-fidelity 3D GAN inversion is inherently challenging due to the geometry-texture trade-off, where overfitting to a single view input image often damages the estimated geometry during the latent optimization. To solve this challenge, we propose a novel pipeline that builds on the pseudo-multi-view estimation with visibility analysis. We keep the original textures for the visible parts and utilize generative priors for the occluded parts. Extensive experiments show that our approach achieves advantageous reconstruction and novel view synthesis quality over prior work, even for images with out-of-distribution textures. The proposed pipeline also enables image attribute editing with the inverted latent code and 3D-aware texture modification. Our approach enables high-fidelity 3D rendering from a single image, which is promising for various applications of AI-generated 3D content. The source code is at https://github.com/jiaxinxie97/HFGI3D/. + + + + Single Image Backdoor Inversion via Robust Smoothed Classifiers + http://openaccess.thecvf.com//content/CVPR2023/papers/Sun_Single_Image_Backdoor_Inversion_via_Robust_Smoothed_Classifiers_CVPR_2023_paper.pdf + Backdoor inversion, the process of finding a backdoor trigger inserted into a machine learning model, has become the pillar of many backdoor detection and defense methods. Previous works on backdoor inversion often recover the backdoor through an optimization process to flip a support set of clean images into the target class. However, it is rarely studied and understood how large this support set should be to recover a successful backdoor. In this work, we show that one can reliably recover the backdoor trigger with as few as a single image. Specifically, we propose the SmoothInv method, which first constructs a robust smoothed version of the backdoored classifier and then performs guided image synthesis towards the target class to reveal the backdoor pattern. SmoothInv requires neither an explicit modeling of the backdoor via a mask variable, nor any complex regularization schemes, which has become the standard practice in backdoor inversion methods. We perform both quantitaive and qualitative study on backdoored classifiers from previous published backdoor attacks. We demonstrate that compared to existing methods, SmoothInv is able to recover successful backdoors from single images, while maintaining high fidelity to the original backdoor. We also show how we identify the target backdoored class from the backdoored classifier. Last, we propose and analyze two countermeasures to our approach and show that SmoothInv remains robust in the face of an adaptive attacker. Our code is available at https://github.com/locuslab/smoothinv. + + + + A Hierarchical Representation Network for Accurate and Detailed Face Reconstruction From In-the-Wild Images + http://openaccess.thecvf.com//content/CVPR2023/papers/Lei_A_Hierarchical_Representation_Network_for_Accurate_and_Detailed_Face_Reconstruction_CVPR_2023_paper.pdf + Limited by the nature of the low-dimensional representational capacity of 3DMM, most of the 3DMM-based face reconstruction (FR) methods fail to recover high-frequency facial details, such as wrinkles, dimples, etc. Some attempt to solve the problem by introducing detail maps or non-linear operations, however, the results are still not vivid. To this end, we in this paper present a novel hierarchical representation network (HRN) to achieve accurate and detailed face reconstruction from a single image. Specifically, we implement the geometry disentanglement and introduce the hierarchical representation to fulfill detailed face modeling. Meanwhile, 3D priors of facial details are incorporated to enhance the accuracy and authenticity of the reconstruction results. We also propose a de-retouching module to achieve better decoupling of the geometry and appearance. It is noteworthy that our framework can be extended to a multi-view fashion by considering detail consistency of different views. Extensive experiments on two single-view and two multi-view FR benchmarks demonstrate that our method outperforms the existing methods in both reconstruction accuracy and visual effects. Finally, we introduce a high-quality 3D face dataset FaceHD-100 to boost the research of high-fidelity face reconstruction. The project homepage is at https://younglbw.github.io/HRN-homepage/. + + + + PersonNeRF: Personalized Reconstruction From Photo Collections + http://openaccess.thecvf.com//content/CVPR2023/papers/Weng_PersonNeRF_Personalized_Reconstruction_From_Photo_Collections_CVPR_2023_paper.pdf + We present PersonNeRF, a method that takes a collection of photos of a subject (e.g., Roger Federer) captured across multiple years with arbitrary body poses and appearances, and enables rendering the subject with arbitrary novel combinations of viewpoint, body pose, and appearance. PersonNeRF builds a customized neural volumetric 3D model of the subject that is able to render an entire space spanned by camera viewpoint, body pose, and appearance. A central challenge in this task is dealing with sparse observations; a given body pose is likely only observed by a single viewpoint with a single appearance, and a given appearance is only observed under a handful of different body poses. We address this issue by recovering a canonical T-pose neural volumetric representation of the subject that allows for changing appearance across different observations, but uses a shared pose-dependent motion field across all observations. We demonstrate that this approach, along with regularization of the recovered volumetric geometry to encourage smoothness, is able to recover a model that renders compelling images from novel combinations of viewpoint, pose, and appearance from these challenging unstructured photo collections, outperforming prior work for free-viewpoint human rendering. + + + + NeuralLift-360: Lifting an In-the-Wild 2D Photo to a 3D Object With 360deg Views + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_NeuralLift-360_Lifting_an_In-the-Wild_2D_Photo_to_a_3D_Object_CVPR_2023_paper.pdf + Virtual reality and augmented reality (XR) bring increasing demand for 3D content generation. However, creating high-quality 3D content requires tedious work from a human expert. In this work, we study the challenging task of lifting a single image to a 3D object and, for the first time, demonstrate the ability to generate a plausible 3D object with 360deg views that corresponds well with the given reference image. By conditioning on the reference image, our model can fulfill the everlasting curiosity for synthesizing novel views of objects from images. Our technique sheds light on a promising direction of easing the workflows for 3D artists and XR designers. We propose a novel framework, dubbed NeuralLift-360, that utilizes a depth-aware neural radiance representation (NeRF) and learns to craft the scene guided by denoising diffusion models. By introducing a ranking loss, our NeuralLift-360 can be guided with rough depth estimation in the wild. We also adopt a CLIP-guided sampling strategy for the diffusion prior to provide coherent guidance. Extensive experiments demonstrate that our NeuralLift-360 significantly outperforms existing state-of-the-art baselines. Project page: https://vita-group.github.io/NeuralLift-360/ + + + + ViP3D: End-to-End Visual Trajectory Prediction via 3D Agent Queries + http://openaccess.thecvf.com//content/CVPR2023/papers/Gu_ViP3D_End-to-End_Visual_Trajectory_Prediction_via_3D_Agent_Queries_CVPR_2023_paper.pdf + Perception and prediction are two separate modules in the existing autonomous driving systems. They interact with each other via hand-picked features such as agent bounding boxes and trajectories. Due to this separation, prediction, as a downstream module, only receives limited information from the perception module. To make matters worse, errors from the perception modules can propagate and accumulate, adversely affecting the prediction results. In this work, we propose ViP3D, a query-based visual trajectory prediction pipeline that exploits rich information from raw videos to directly predict future trajectories of agents in a scene. ViP3D employs sparse agent queries to detect, track, and predict throughout the pipeline, making it the first fully differentiable vision-based trajectory prediction approach. Instead of using historical feature maps and trajectories, useful information from previous timestamps is encoded in agent queries, which makes ViP3D a concise streaming prediction method. Furthermore, extensive experimental results on the nuScenes dataset show the strong vision-based prediction performance of ViP3D over traditional pipelines and previous end-to-end models. + + + + LidarGait: Benchmarking 3D Gait Recognition With Point Clouds + http://openaccess.thecvf.com//content/CVPR2023/papers/Shen_LidarGait_Benchmarking_3D_Gait_Recognition_With_Point_Clouds_CVPR_2023_paper.pdf + Video-based gait recognition has achieved impressive results in constrained scenarios. However, visual cameras neglect human 3D structure information, which limits the feasibility of gait recognition in the 3D wild world. Instead of extracting gait features from images, this work explores precise 3D gait features from point clouds and proposes a simple yet efficient 3D gait recognition framework, termed LidarGait. Our proposed approach projects sparse point clouds into depth maps to learn the representations with 3D geometry information, which outperforms existing point-wise and camera-based methods by a significant margin. Due to the lack of point cloud datasets, we build the first large-scale LiDAR-based gait recognition dataset, SUSTech1K, collected by a LiDAR sensor and an RGB camera. The dataset contains 25,239 sequences from 1,050 subjects and covers many variations, including visibility, views, occlusions, clothing, carrying, and scenes. Extensive experiments show that (1) 3D structure information serves as a significant feature for gait recognition. (2) LidarGait outperforms existing point-based and silhouette-based methods by a significant margin, while it also offers stable cross-view results. (3) The LiDAR sensor is superior to the RGB camera for gait recognition in the outdoor environment. The source code and dataset have been made available at https://lidargait.github.io. + + + + D2Former: Jointly Learning Hierarchical Detectors and Contextual Descriptors via Agent-Based Transformers + http://openaccess.thecvf.com//content/CVPR2023/papers/He_D2Former_Jointly_Learning_Hierarchical_Detectors_and_Contextual_Descriptors_via_Agent-Based_CVPR_2023_paper.pdf + Establishing pixel-level matches between image pairs is vital for a variety of computer vision applications. However, achieving robust image matching remains challenging because CNN extracted descriptors usually lack discriminative ability in texture-less regions and keypoint detectors are only good at identifying keypoints with a specific level of structure. To deal with these issues, a novel image matching method is proposed by Jointly Learning Hierarchical Detectors and Contextual Descriptors via Agent-based Transformers (D2Former), including a contextual feature descriptor learning (CFDL) module and a hierarchical keypoint detector learning (HKDL) module. The proposed D2Former enjoys several merits. First, the proposed CFDL module can model long-range contexts efficiently and effectively with the aid of designed descriptor agents. Second, the HKDL module can generate keypoint detectors in a hierarchical way, which is helpful for detecting keypoints with diverse levels of structures. Extensive experimental results on four challenging benchmarks show that our proposed method significantly outperforms state-of-the-art image matching methods. + + + + Joint Appearance and Motion Learning for Efficient Rolling Shutter Correction + http://openaccess.thecvf.com//content/CVPR2023/papers/Fan_Joint_Appearance_and_Motion_Learning_for_Efficient_Rolling_Shutter_Correction_CVPR_2023_paper.pdf + Rolling shutter correction (RSC) is becoming increasingly popular for RS cameras that are widely used in commercial and industrial applications. Despite the promising performance, existing RSC methods typically employ a two-stage network structure that ignores intrinsic information interactions and hinders fast inference. In this paper, we propose a single-stage encoder-decoder-based network, named JAMNet, for efficient RSC. It first extracts pyramid features from consecutive RS inputs, and then simultaneously refines the two complementary information (i.e., global shutter appearance and undistortion motion field) to achieve mutual promotion in a joint learning decoder. To inject sufficient motion cues for guiding joint learning, we introduce a transformer-based motion embedding module and propose to pass hidden states across pyramid levels. Moreover, we present a new data augmentation strategy "vertical flip + inverse order" to release the potential of the RSC datasets. Experiments on various benchmarks show that our approach surpasses the state-of-the-art methods by a large margin, especially with a 4.7 dB PSNR leap on real-world RSC. Code is available at https://github.com/GitCVfb/JAMNet. + + + + Federated Incremental Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Dong_Federated_Incremental_Semantic_Segmentation_CVPR_2023_paper.pdf + Federated learning-based semantic segmentation (FSS) has drawn widespread attention via decentralized training on local clients. However, most FSS models assume categories are fxed in advance, thus heavily undergoing forgetting on old categories in practical applications where local clients receive new categories incrementally while have no memory storage to access old classes. Moreover, new clients collecting novel classes may join in the global training of FSS, which further exacerbates catastrophic forgetting. To surmount the above challenges, we propose a Forgetting-Balanced Learning (FBL) model to address heterogeneous forgetting on old classes from both intra-client and inter-client aspects. Specifically, under the guidance of pseudo labels generated via adaptive class-balanced pseudo labeling, we develop a forgetting-balanced semantic compensation loss and a forgetting-balanced relation consistency loss to rectify intra-client heterogeneous forgetting of old categories with background shift. It performs balanced gradient propagation and relation consistency distillation within local clients. Moreover, to tackle heterogeneous forgetting from inter-client aspect, we propose a task transition monitor. It can identify new classes under privacy protection and store the latest old global model for relation distillation. Qualitative experiments reveal large improvement of our model against comparison methods. The code is available at https://github.com/JiahuaDong/FISS. + + + + Attention-Based Point Cloud Edge Sampling + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_Attention-Based_Point_Cloud_Edge_Sampling_CVPR_2023_paper.pdf + Point cloud sampling is a less explored research topic for this data representation. The most commonly used sampling methods are still classical random sampling and farthest point sampling. With the development of neural networks, various methods have been proposed to sample point clouds in a task-based learning manner. However, these methods are mostly generative-based, rather than selecting points directly using mathematical statistics. Inspired by the Canny edge detection algorithm for images and with the help of the attention mechanism, this paper proposes a non-generative Attention-based Point cloud Edge Sampling method (APES), which captures salient points in the point cloud outline. Both qualitative and quantitative experimental results show the superior performance of our sampling method on common benchmark tasks. + + + + Avatars Grow Legs: Generating Smooth Human Motion From Sparse Tracking Inputs With Diffusion Model + http://openaccess.thecvf.com//content/CVPR2023/papers/Du_Avatars_Grow_Legs_Generating_Smooth_Human_Motion_From_Sparse_Tracking_CVPR_2023_paper.pdf + With the recent surge in popularity of AR/VR applications, realistic and accurate control of 3D full-body avatars has become a highly demanded feature. A particular challenge is that only a sparse tracking signal is available from standalone HMDs (Head Mounted Devices), often limited to tracking the user's head and wrists. While this signal is resourceful for reconstructing the upper body motion, the lower body is not tracked and must be synthesized from the limited information provided by the upper body joints. In this paper, we present AGRoL, a novel conditional diffusion model specifically designed to track full bodies given sparse upper-body tracking signals. Our model is based on a simple multi-layer perceptron (MLP) architecture and a novel conditioning scheme for motion data. It can predict accurate and smooth full-body motion, particularly the challenging lower body movement. Unlike common diffusion architectures, our compact architecture can run in real-time, making it suitable for online body-tracking applications. We train and evaluate our model on AMASS motion capture dataset, and demonstrate that our approach outperforms state-of-the-art methods in generated motion accuracy and smoothness. We further justify our design choices through extensive experiments and ablation studies. + + + + Learning Neural Proto-Face Field for Disentangled 3D Face Modeling in the Wild + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Learning_Neural_Proto-Face_Field_for_Disentangled_3D_Face_Modeling_in_CVPR_2023_paper.pdf + Generative models show good potential for recovering 3D faces beyond limited shape assumptions. While plausible details and resolutions are achieved, these models easily fail under extreme conditions of pose, shadow or appearance, due to the entangled fitting or lack of multi-view priors. To address this problem, this paper presents a novel Neural Proto-face Field (NPF) for unsupervised robust 3D face modeling. Instead of using constrained images as Neural Radiance Field (NeRF), NPF disentangles the common/specific facial cues, i.e., ID, expression and scene-specific details from in-the-wild photo collections. Specifically, NPF learns a face prototype to aggregate 3D-consistent identity via uncertainty modeling, extracting multi-image priors from a photo collection. NPF then learns to deform the prototype with the appropriate facial expressions, constrained by a loss of expression consistency and personal idiosyncrasies. Finally, NPF is optimized to fit a target image in the collection, recovering specific details of appearance and geometry. In this way, the generative model benefits from multi-image priors and meaningful facial structures. Extensive experiments on benchmarks show that NPF recovers superior or competitive facial shapes and textures, compared to state-of-the-art methods. + + + + BUFFER: Balancing Accuracy, Efficiency, and Generalizability in Point Cloud Registration + http://openaccess.thecvf.com//content/CVPR2023/papers/Ao_BUFFER_Balancing_Accuracy_Efficiency_and_Generalizability_in_Point_Cloud_Registration_CVPR_2023_paper.pdf + An ideal point cloud registration framework should have superior accuracy, acceptable efficiency, and strong generalizability. However, this is highly challenging since existing registration techniques are either not accurate enough, far from efficient, or generalized poorly. It remains an open question that how to achieve a satisfying balance between this three key elements. In this paper, we propose BUFFER, a point cloud registration method for balancing accuracy, efficiency, and generalizability. The key to our approach is to take advantage of both point-wise and patch-wise techniques, while overcoming the inherent drawbacks simultaneously. Different from a simple combination of existing methods, each component of our network has been carefully crafted to tackle specific issues. Specifically, a Point-wise Learner is first introduced to enhance computational efficiency by predicting keypoints and improving the representation capacity of features by estimating point orientations, a Patch-wise Embedder which leverages a lightweight local feature learner is then deployed to extract efficient and general patch features. Additionally, an Inliers Generator which combines simple neural layers and general features is presented to search inlier correspondences. Extensive experiments on real-world scenarios demonstrate that our method achieves the best of both worlds in accuracy, efficiency, and generalization. In particular, our method not only reaches the highest success rate on unseen domains, but also is almost 30 times faster than the strong baselines specializing in generalization. Code is available at https://github.com/aosheng1996/BUFFER. + + + + CrOC: Cross-View Online Clustering for Dense Visual Representation Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Stegmuller_CrOC_Cross-View_Online_Clustering_for_Dense_Visual_Representation_Learning_CVPR_2023_paper.pdf + Learning dense visual representations without labels is an arduous task and more so from scene-centric data. We propose to tackle this challenging problem by proposing a Cross-view consistency objective with an Online Clustering mechanism (CrOC) to discover and segment the semantics of the views. In the absence of hand-crafted priors, the resulting method is more generalizable and does not require a cumbersome pre-processing step. More importantly, the clustering algorithm conjointly operates on the features of both views, thereby elegantly bypassing the issue of content not represented in both views and the ambiguous matching of objects from one crop to the other. We demonstrate excellent performance on linear and unsupervised segmentation transfer tasks on various datasets and similarly for video object segmentation. Our code and pre-trained models are publicly available at https://github.com/stegmuel/CrOC. + + + + DrapeNet: Garment Generation and Self-Supervised Draping + http://openaccess.thecvf.com//content/CVPR2023/papers/De_Luigi_DrapeNet_Garment_Generation_and_Self-Supervised_Draping_CVPR_2023_paper.pdf + Recent approaches to drape garments quickly over arbitrary human bodies leverage self-supervision to eliminate the need for large training sets. However, they are designed to train one network per clothing item, which severely limits their generalization abilities. In our work, we rely on self-supervision to train a single network to drape multiple garments. This is achieved by predicting a 3D deformation field conditioned on the latent codes of a generative network, which models garments as unsigned distance fields. Our pipeline can generate and drape previously unseen garments of any topology, whose shape can be edited by manipulating their latent codes. Being fully differentiable, our formulation makes it possible to recover accurate 3D models of garments from partial observations -- images or 3D scans -- via gradient descent. Our code is publicly available at https://github.com/liren2515/DrapeNet. + + + + FeatureBooster: Boosting Feature Descriptors With a Lightweight Neural Network + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_FeatureBooster_Boosting_Feature_Descriptors_With_a_Lightweight_Neural_Network_CVPR_2023_paper.pdf + We introduce a lightweight network to improve descriptors of keypoints within the same image. The network takes the original descriptors and the geometric properties of keypoints as the input, and uses an MLP-based self-boosting stage and a Transformer-based cross-boosting stage to enhance the descriptors. The boosted descriptors can be either real-valued or binary ones. We use the proposed network to boost both hand-crafted (ORB, SIFT) and the state-of-the-art learning-based descriptors (SuperPoint, ALIKE) and evaluate them on image matching, visual localization, and structure-from-motion tasks. The results show that our method significantly improves the performance of each task, particularly in challenging cases such as large illumination changes or repetitive patterns. Our method requires only 3.2ms on desktop GPU and 27ms on embedded GPU to process 2000 features, which is fast enough to be applied to a practical system. The code and trained weights are publicly available at github.com/SJTU-ViSYS/FeatureBooster. + + + + Towards Efficient Use of Multi-Scale Features in Transformer-Based Object Detectors + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Towards_Efficient_Use_of_Multi-Scale_Features_in_Transformer-Based_Object_Detectors_CVPR_2023_paper.pdf + Multi-scale features have been proven highly effective for object detection but often come with huge and even prohibitive extra computation costs, especially for the recent Transformer-based detectors. In this paper, we propose Iterative Multi-scale Feature Aggregation (IMFA) - a generic paradigm that enables efficient use of multi-scale features in Transformer-based object detectors. The core idea is to exploit sparse multi-scale features from just a few crucial locations, and it is achieved with two novel designs. First, IMFA rearranges the Transformer encoder-decoder pipeline so that the encoded features can be iteratively updated based on the detection predictions. Second, IMFA sparsely samples scale-adaptive features for refined detection from just a few keypoint locations under the guidance of prior detection predictions. As a result, the sampled multi-scale features are sparse yet still highly beneficial for object detection. Extensive experiments show that the proposed IMFA boosts the performance of multiple Transformer-based object detectors significantly yet with only slight computational overhead. + + + + Delivering Arbitrary-Modal Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Delivering_Arbitrary-Modal_Semantic_Segmentation_CVPR_2023_paper.pdf + Multimodal fusion can make semantic segmentation more robust. However, fusing an arbitrary number of modalities remains underexplored. To delve into this problem, we create the DeLiVER arbitrary-modal segmentation benchmark, covering Depth, LiDAR, multiple Views, Events, and RGB. Aside from this, we provide this dataset in four severe weather conditions as well as five sensor failure cases to exploit modal complementarity and resolve partial outages. To facilitate this data, we present the arbitrary cross-modal segmentation model CMNeXt. It encompasses a Self-Query Hub (SQ-Hub) designed to extract effective information from any modality for subsequent fusion with the RGB representation and adds only negligible amounts of parameters ( 0.01M) per additional modality. On top, to efficiently and flexibly harvest discriminative cues from the auxiliary modalities, we introduce the simple Parallel Pooling Mixer (PPX). With extensive experiments on a total of six benchmarks, our CMNeXt achieves state-of-the-art performance, allowing to scale from 1 to 81 modalities on the DeLiVER, KITTI-360, MFNet, NYU Depth V2, UrbanLF, and MCubeS datasets. On the freshly collected DeLiVER, the quad-modal CMNeXt reaches up to 66.30% in mIoU with a +9.10% gain as compared to the mono-modal baseline. + + + + Consistent-Teacher: Towards Reducing Inconsistent Pseudo-Targets in Semi-Supervised Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Consistent-Teacher_Towards_Reducing_Inconsistent_Pseudo-Targets_in_Semi-Supervised_Object_Detection_CVPR_2023_paper.pdf + In this study, we dive deep into the inconsistency of pseudo targets in semi-supervised object detection (SSOD). Our core observation is that the oscillating pseudo-targets undermine the training of an accurate detector. It injects noise into the student's training, leading to severe overfitting problems. Therefore, we propose a systematic solution, termed NAME, to reduce the inconsistency. First, adaptive anchor assignment (ASA) substitutes the static IoU-based strategy, which enables the student network to be resistant to noisy pseudo-bounding boxes. Then we calibrate the subtask predictions by designing a 3D feature alignment module (FAM-3D). It allows each classification feature to adaptively query the optimal feature vector for the regression task at arbitrary scales and locations. Lastly, a Gaussian Mixture Model (GMM) dynamically revises the score threshold of pseudo-bboxes, which stabilizes the number of ground truths at an early stage and remedies the unreliable supervision signal during training. NAME provides strong results on a large range of SSOD evaluations. It achieves 40.0 mAP with ResNet-50 backbone given only 10% of annotated MS-COCO data, which surpasses previous baselines using pseudo labels by around 3 mAP. When trained on fully annotated MS-COCO with additional unlabeled data, the performance further increases to 47.7 mAP. Our code is available at https://github.com/Adamdad/ConsistentTeacher. + + + + DNeRV: Modeling Inherent Dynamics via Difference Neural Representation for Videos + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhao_DNeRV_Modeling_Inherent_Dynamics_via_Difference_Neural_Representation_for_Videos_CVPR_2023_paper.pdf + Existing implicit neural representation (INR) methods do not fully exploit spatiotemporal redundancies in videos. Index-based INRs ignore the content-specific spatial features and hybrid INRs ignore the contextual dependency on adjacent frames, leading to poor modeling capability for scenes with large motion or dynamics. We analyze this limitation from the perspective of function fitting and reveal the importance of frame difference. To use explicit motion information, we propose Difference Neural Representation for Videos (DNeRV), which consists of two streams for content and frame difference. We also introduce a collaborative content unit for effective feature fusion. We test DNeRV for video compression, inpainting, and interpolation. DNeRV achieves competitive results against the state-of-the-art neural compression approaches and outperforms existing implicit methods on downstream inpainting and interpolation for 960 x 1920 videos. + + + + Continuous Pseudo-Label Rectified Domain Adaptive Semantic Segmentation With Implicit Neural Representations + http://openaccess.thecvf.com//content/CVPR2023/papers/Gong_Continuous_Pseudo-Label_Rectified_Domain_Adaptive_Semantic_Segmentation_With_Implicit_Neural_CVPR_2023_paper.pdf + Unsupervised domain adaptation (UDA) for semantic segmentation aims at improving the model performance on the unlabeled target domain by leveraging a labeled source domain. Existing approaches have achieved impressive progress by utilizing pseudo-labels on the unlabeled target-domain images. Yet the low-quality pseudo-labels, arising from the domain discrepancy, inevitably hinder the adaptation. This calls for effective and accurate approaches to estimating the reliability of the pseudo-labels, in order to rectify them. In this paper, we propose to estimate the rectification values of the predicted pseudo-labels with implicit neural representations. We view the rectification value as a signal defined over the continuous spatial domain. Taking an image coordinate and the nearby deep features as inputs, the rectification value at a given coordinate is predicted as an output. This allows us to achieve high-resolution and detailed rectification values estimation, important for accurate pseudo-label generation at mask boundaries in particular. The rectified pseudo-labels are then leveraged in our rectification-aware mixture model (RMM) to be learned end-to-end and help the adaptation. We demonstrate the effectiveness of our approach on different UDA benchmarks, including synthetic-to-real and day-to-night. Our approach achieves superior results compared to state-of-the-art. The implementation is available at https://github.com/ETHRuiGong/IR2F. + + + + Hyperbolic Contrastive Learning for Visual Representations Beyond Objects + http://openaccess.thecvf.com//content/CVPR2023/papers/Ge_Hyperbolic_Contrastive_Learning_for_Visual_Representations_Beyond_Objects_CVPR_2023_paper.pdf + Although self-/un-supervised methods have led to rapid progress in visual representation learning, these methods generally treat objects and scenes using the same lens. In this paper, we focus on learning representations of objects and scenes that preserve the structure among them. Motivated by the observation that visually similar objects are close in the representation space, we argue that the scenes and objects should instead follow a hierarchical structure based on their compositionality. To exploit such a structure, we propose a contrastive learning framework where a Euclidean loss is used to learn object representations and a hyperbolic loss is used to encourage representations of scenes to lie close to representations of their constituent objects in hyperbolic space. This novel hyperbolic objective encourages the scene-object hypernymy among the representations by optimizing the magnitude of their norms. We show that when pretraining on the COCO and OpenImages datasets, the hyperbolic loss improves the downstream performance of several baselines across multiple datasets and tasks, including image classification, object detection, and semantic segmentation. We also show that the properties of the learned representations allow us to solve various vision tasks that involve the interaction between scenes and objects in a zero-shot fashion. + + + + AligNeRF: High-Fidelity Neural Radiance Fields via Alignment-Aware Training + http://openaccess.thecvf.com//content/CVPR2023/papers/Jiang_AligNeRF_High-Fidelity_Neural_Radiance_Fields_via_Alignment-Aware_Training_CVPR_2023_paper.pdf + Neural Radiance Fields (NeRFs) are a powerful representation for modeling a 3D scene as a continuous function. Though NeRF is able to render complex 3D scenes with view-dependent effects, few efforts have been devoted to exploring its limits in a high-resolution setting. Specifically, existing NeRF-based methods face several limitations when reconstructing high-resolution real scenes, including a very large number of parameters, misaligned input data, and overly smooth details. In this work, we conduct the first pilot study on training NeRF with high-resolution data and propose the corresponding solutions: 1) marrying the multilayer perceptron (MLP) with convolutional layers which can encode more neighborhood information while reducing the total number of parameters; 2) a novel training strategy to address misalignment caused by moving objects or small camera calibration errors; and 3) a high-frequency aware loss. Our approach is nearly free without introducing obvious training/testing costs, while experiments on different datasets demonstrate that it can recover more high-frequency details compared with the current state-of-the-art NeRF models. Project page: https://yifanjiang19.github.io/alignerf. + + + + NAR-Former: Neural Architecture Representation Learning Towards Holistic Attributes Prediction + http://openaccess.thecvf.com//content/CVPR2023/papers/Yi_NAR-Former_Neural_Architecture_Representation_Learning_Towards_Holistic_Attributes_Prediction_CVPR_2023_paper.pdf + With the wide and deep adoption of deep learning models in real applications, there is an increasing need to model and learn the representations of the neural networks themselves. These models can be used to estimate attributes of different neural network architectures such as the accuracy and latency, without running the actual training or inference tasks. In this paper, we propose a neural architecture representation model that can be used to estimate these attributes holistically. Specifically, we first propose a simple and effective tokenizer to encode both the operation and topology information of a neural network into a single sequence. Then, we design a multi-stage fusion transformer to build a compact vector representation from the converted sequence. For efficient model training, we further propose an information flow consistency augmentation and correspondingly design an architecture consistency loss, which brings more benefits with less augmentation samples compared with previous random augmentation strategies. Experiment results on NAS-Bench-101, NAS-Bench-201, DARTS search space and NNLQP show that our proposed framework can be used to predict the aforementioned latency and accuracy attributes of both cell architectures and whole deep neural networks, and achieves promising performance. Code is available at https://github.com/yuny220/NAR-Former. + + + + Teaching Structured Vision & Language Concepts to Vision & Language Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Doveh_Teaching_Structured_Vision__Language_Concepts_to_Vision__Language_CVPR_2023_paper.pdf + Vision and Language (VL) models have demonstrated remarkable zero-shot performance in a variety of tasks. However, some aspects of complex language understanding still remain a challenge. We introduce the collective notion of Structured Vision & Language Concepts (SVLC) which includes object attributes, relations, and states which are present in the text and visible in the image. Recent studies have shown that even the best VL models struggle with SVLC. A possible way of fixing this issue is by collecting dedicated datasets for teaching each SVLC type, yet this might be expensive and time-consuming. Instead, we propose a more elegant data-driven approach for enhancing VL models' understanding of SVLCs that makes more effective use of existing VL pre-training datasets and does not require any additional data. While automatic understanding of image structure still remains largely unsolved, language structure is much better modeled and understood, allowing for its effective utilization in teaching VL models. In this paper, we propose various techniques based on language structure understanding that can be used to manipulate the textual part of off-the-shelf paired VL datasets. VL models trained with the updated data exhibit a significant improvement of up to 15% in their SVLC understanding with only a mild degradation in their zero-shot capabilities both when training from scratch or fine-tuning a pre-trained model. Our code and pretrained models are available at: https://github.com/SivanDoveh/TSVLC + + + + NULL-Text Inversion for Editing Real Images Using Guided Diffusion Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Mokady_NULL-Text_Inversion_for_Editing_Real_Images_Using_Guided_Diffusion_Models_CVPR_2023_paper.pdf + Recent large-scale text-guided diffusion models provide powerful image generation capabilities. Currently, a massive effort is given to enable the modification of these images using text only as means to offer intuitive and versatile editing tools. To edit a real image using these state-of-the-art tools, one must first invert the image with a meaningful text prompt into the pretrained model's domain. In this paper, we introduce an accurate inversion technique and thus facilitate an intuitive text-based modification of the image. Our proposed inversion consists of two key novel components: (i) Pivotal inversion for diffusion models. While current methods aim at mapping random noise samples to a single input image, we use a single pivotal noise vector for each timestamp and optimize around it. We recognize that a direct DDIM inversion is inadequate on its own, but does provide a rather good anchor for our optimization. (ii) NULL-text optimization, where we only modify the unconditional textual embedding that is used for classifier-free guidance, rather than the input text embedding. This allows for keeping both the model weights and the conditional embedding intact and hence enables applying prompt-based editing while avoiding the cumbersome tuning of the model's weights. Our Null-text inversion, based on the publicly available Stable Diffusion model, is extensively evaluated on a variety of images and various prompt editing, showing high-fidelity editing of real images. + + + + Selective Structured State-Spaces for Long-Form Video Understanding + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Selective_Structured_State-Spaces_for_Long-Form_Video_Understanding_CVPR_2023_paper.pdf + Effective modeling of complex spatiotemporal dependencies in long-form videos remains an open problem. The recently proposed Structured State-Space Sequence (S4) model with its linear complexity offers a promising direction in this space. However, we demonstrate that treating all image-tokens equally as done by S4 model can adversely affect its efficiency and accuracy. To address this limitation, we present a novel Selective S4 (i.e., S5) model that employs a lightweight mask generator to adaptively select informative image tokens resulting in more efficient and accurate modeling of long-term spatiotemporal dependencies in videos. Unlike previous mask-based token reduction methods used in transformers, our S5 model avoids the dense self-attention calculation by making use of the guidance of the momentum-updated S4 model. This enables our model to efficiently discard less informative tokens and adapt to various long-form video understanding tasks more effectively. However, as is the case for most token reduction methods, the informative image tokens could be dropped incorrectly. To improve the robustness and the temporal horizon of our model, we propose a novel long-short masked contrastive learning (LSMCL) approach that enables our model to predict longer temporal context using shorter input videos. We present extensive comparative results using three challenging long-form video understanding datasets (LVU, COIN and Breakfast), demonstrating that our approach consistently outperforms the previous state-of-the-art S4 model by up to 9.6% accuracy while reducing its memory footprint by 23%. + + + + Motion Information Propagation for Neural Video Compression + http://openaccess.thecvf.com//content/CVPR2023/papers/Qi_Motion_Information_Propagation_for_Neural_Video_Compression_CVPR_2023_paper.pdf + In most existing neural video codecs, the information flow therein is uni-directional, where only motion coding provides motion vectors for frame coding. In this paper, we argue that, through information interactions, the synergy between motion coding and frame coding can be achieved. We effectively introduce bi-directional information interactions between motion coding and frame coding via our Motion Information Propagation. When generating the temporal contexts for frame coding, the high-dimension motion feature from the motion decoder serves as motion guidance to mitigate the alignment errors. Meanwhile, besides assisting frame coding at the current time step, the feature from context generation will be propagated as motion condition when coding the subsequent motion latent. Through the cycle of such interactions, feature propagation on motion coding is built, strengthening the capacity of exploiting long-range temporal correlation. In addition, we propose hybrid context generation to exploit the multi-scale context features and provide better motion condition. Experiments show that our method can achieve 12.9% bit rate saving over the previous SOTA neural video codec. + + + + Accelerated Coordinate Encoding: Learning to Relocalize in Minutes Using RGB and Poses + http://openaccess.thecvf.com//content/CVPR2023/papers/Brachmann_Accelerated_Coordinate_Encoding_Learning_to_Relocalize_in_Minutes_Using_RGB_CVPR_2023_paper.pdf + Learning-based visual relocalizers exhibit leading pose accuracy, but require hours or days of training. Since training needs to happen on each new scene again, long training times make learning-based relocalization impractical for most applications, despite its promise of high accuracy. In this paper we show how such a system can actually achieve the same accuracy in less than 5 minutes. We start from the obvious: a relocalization network can be split in a scene-agnostic feature backbone, and a scene-specific prediction head. Less obvious: using an MLP prediction head allows us to optimize across thousands of view points simultaneously in each single training iteration. This leads to stable and extremely fast convergence. Furthermore, we substitute effective but slow end-to-end training using a robust pose solver with a curriculum over a reprojection loss. Our approach does not require privileged knowledge, such a depth maps or a 3D model, for speedy training. Overall, our approach is up to 300x faster in mapping than state-of-the-art scene coordinate regression, while keeping accuracy on par. Code is available: https://nianticlabs.github.io/ace + + + + Robust Dynamic Radiance Fields + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Robust_Dynamic_Radiance_Fields_CVPR_2023_paper.pdf + Dynamic radiance field reconstruction methods aim to model the time-varying structure and appearance of a dynamic scene. Existing methods, however, assume that accurate camera poses can be reliably estimated by Structure from Motion (SfM) algorithms. These methods, thus, are unreliable as SfM algorithms often fail or produce erroneous poses on challenging videos with highly dynamic objects, poorly textured surfaces, and rotating camera motion. We address this issue by jointly estimating the static and dynamic radiance fields along with the camera parameters (poses and focal length). We demonstrate the robustness of our approach via extensive quantitative and qualitative experiments. Our results show favorable performance over the state-of-the-art dynamic view synthesis methods. + + + + PLIKS: A Pseudo-Linear Inverse Kinematic Solver for 3D Human Body Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Shetty_PLIKS_A_Pseudo-Linear_Inverse_Kinematic_Solver_for_3D_Human_Body_CVPR_2023_paper.pdf + We introduce PLIKS (Pseudo-Linear Inverse Kinematic Solver) for reconstruction of a 3D mesh of the human body from a single 2D image. Current techniques directly regress the shape, pose, and translation of a parametric model from an input image through a non-linear mapping with minimal flexibility to any external influences. We approach the task as a model-in-the-loop optimization problem. PLIKS is built on a linearized formulation of the parametric SMPL model. Using PLIKS, we can analytically reconstruct the human model via 2D pixel-aligned vertices. This enables us with the flexibility to use accurate camera calibration information when available. PLIKS offers an easy way to introduce additional constraints such as shape and translation. We present quantitative evaluations which confirm that PLIKS achieves more accurate reconstruction with greater than 10% improvement compared to other state-of-the-art methods with respect to the standard 3D human pose and shape benchmarks while also obtaining a reconstruction error improvement of 12.9 mm on the newer AGORA dataset. + + + + Promoting Semantic Connectivity: Dual Nearest Neighbors Contrastive Learning for Unsupervised Domain Generalization + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Promoting_Semantic_Connectivity_Dual_Nearest_Neighbors_Contrastive_Learning_for_Unsupervised_CVPR_2023_paper.pdf + Domain Generalization (DG) has achieved great success in generalizing knowledge from source domains to unseen target domains. However, current DG methods rely heavily on labeled source data, which are usually costly and unavailable. Since unlabeled data are far more accessible, we study a more practical unsupervised domain generalization (UDG) problem. Learning invariant visual representation from different views, i.e., contrastive learning, promises well semantic features for in-domain unsupervised learning. However, it fails in cross-domain scenarios. In this paper, we first delve into the failure of vanilla contrastive learning and point out that semantic connectivity is the key to UDG. Specifically, suppressing the intra-domain connectivity and encouraging the intra-class connectivity help to learn the domain-invariant semantic information. Then, we propose a novel unsupervised domain generalization approach, namely Dual Nearest Neighbors contrastive learning with strong Augmentation (DN^2A). Our DN^2A leverages strong augmentations to suppress the intra-domain connectivity and proposes a novel dual nearest neighbors search strategy to find trustworthy cross domain neighbors along with in-domain neighbors to encourage the intra-class connectivity. Experimental results demonstrate that our DN^2A outperforms the state-of-the-art by a large margin, e.g., 12.01% and 13.11% accuracy gain with only 1% labels for linear evaluation on PACS and DomainNet, respectively. + + + + Interactive Segmentation of Radiance Fields + http://openaccess.thecvf.com//content/CVPR2023/papers/Goel_Interactive_Segmentation_of_Radiance_Fields_CVPR_2023_paper.pdf + Radiance Fields (RF) are popular to represent casually-captured scenes for new view synthesis and several applications beyond it. Mixed reality on personal spaces needs understanding and manipulating scenes represented as RFs, with semantic segmentation of objects as an important step. Prior segmentation efforts show promise but don't scale to complex objects with diverse appearance. We present the ISRF method to interactively segment objects with fine structure and appearance. Nearest neighbor feature matching using distilled semantic features identifies high-confidence seed regions. Bilateral search in a joint spatio-semantic space grows the region to recover accurate segmentation. We show state-of-the-art results of segmenting objects from RFs and compositing them to another scene, changing appearance, etc., and an interactive segmentation tool that others can use. + + + + Exploring and Utilizing Pattern Imbalance + http://openaccess.thecvf.com//content/CVPR2023/papers/Mei_Exploring_and_Utilizing_Pattern_Imbalance_CVPR_2023_paper.pdf + In this paper, we identify pattern imbalance from several aspects, and further develop a new training scheme to avert pattern preference as well as spurious correlation. In contrast to prior methods which are mostly concerned with category or domain granularity, ignoring the potential finer structure that existed in datasets, we give a new definition of seed category as an appropriate optimization unit to distinguish different patterns in the same category or domain. Extensive experiments on domain generalization datasets of diverse scales demonstrate the effectiveness of the proposed method. + + + + Are Data-Driven Explanations Robust Against Out-of-Distribution Data? + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Are_Data-Driven_Explanations_Robust_Against_Out-of-Distribution_Data_CVPR_2023_paper.pdf + As black-box models increasingly power high-stakes applications, a variety of data-driven explanation methods have been introduced. Meanwhile, machine learning models are constantly challenged by distributional shifts. A question naturally arises: Are data-driven explanations robust against out-of-distribution data? Our empirical results show that even though predict correctly, the model might still yield unreliable explanations under distributional shifts. How to develop robust explanations against out-of-distribution data? To address this problem, we propose an end-to-end model-agnostic learning framework Distributionally Robust Explanations (DRE). The key idea is, inspired by self-supervised learning, to fully utilizes the inter-distribution information to provide supervisory signals for the learning of explanations without human annotation. Can robust explanations benefit the model's generalization capability? We conduct extensive experiments on a wide range of tasks and data types, including classification and regression on image and scientific tabular data. Our results demonstrate that the proposed method significantly improves the model's performance in terms of explanation and prediction robustness against distributional shifts. + + + + Top-Down Visual Attention From Analysis by Synthesis + http://openaccess.thecvf.com//content/CVPR2023/papers/Shi_Top-Down_Visual_Attention_From_Analysis_by_Synthesis_CVPR_2023_paper.pdf + Current attention algorithms (e.g., self-attention) are stimulus-driven and highlight all the salient objects in an image. However, intelligent agents like humans often guide their attention based on the high-level task at hand, focusing only on task-related objects. This ability of task-guided top-down attention provides task-adaptive representation and helps the model generalize to various tasks. In this paper, we consider top-down attention from a classic Analysis-by-Synthesis (AbS) perspective of vision. Prior work indicates a functional equivalence between visual attention and sparse reconstruction; we show that an AbS visual system that optimizes a similar sparse reconstruction objective modulated by a goal-directed top-down signal naturally simulates top-down attention. We further propose Analysis-by-Synthesis Vision Transformer (AbSViT), which is a top-down modulated ViT model that variationally approximates AbS, and achieves controllable top-down attention. For real-world applications, AbSViT consistently improves over baselines on Vision-Language tasks such as VQA and zero-shot retrieval where language guides the top-down attention. AbSViT can also serve as a general backbone, improving performance on classification, semantic segmentation, and model robustness. Project page: https://sites.google.com/view/absvit. + + + + Hierarchical Fine-Grained Image Forgery Detection and Localization + http://openaccess.thecvf.com//content/CVPR2023/papers/Guo_Hierarchical_Fine-Grained_Image_Forgery_Detection_and_Localization_CVPR_2023_paper.pdf + Differences in forgery attributes of images generated in CNN-synthesized and image-editing domains are large, and such differences make a unified image forgery detection and localization (IFDL) challenging. To this end, we present a hierarchical fine-grained formulation for IFDL representation learning. Specifically, we first represent forgery attributes of a manipulated image with multiple labels at different levels. Then we perform fine-grained classification at these levels using the hierarchical dependency between them. As a result, the algorithm is encouraged to learn both comprehensive features and inherent hierarchical nature of different forgery attributes, thereby improving the IFDL representation. Our proposed IFDL framework contains three components: multi-branch feature extractor, localization and classification modules. Each branch of the feature extractor learns to classify forgery attributes at one level, while localization and classification modules segment the pixel-level forgery region and detect image-level forgery, respectively. Lastly, we construct a hierarchical fine-grained dataset to facilitate our study. We demonstrate the effectiveness of our method on 7 different benchmarks, for both tasks of IFDL and forgery attribute classification. Our source code and dataset can be found at https://github.com/CHELSEA234/HiFi_IFDL + + + + Fantastic Breaks: A Dataset of Paired 3D Scans of Real-World Broken Objects and Their Complete Counterparts + http://openaccess.thecvf.com//content/CVPR2023/papers/Lamb_Fantastic_Breaks_A_Dataset_of_Paired_3D_Scans_of_Real-World_CVPR_2023_paper.pdf + Automated shape repair approaches currently lack access to datasets that describe real-world damaged geometry. We present Fantastic Breaks (and Where to Find Them: https://terascale-all-sensing-research-studio.github.io/FantasticBreaks), a dataset containing scanned, waterproofed, and cleaned 3D meshes for 150 broken objects, paired and geometrically aligned with complete counterparts. Fantastic Breaks contains class and material labels, proxy repair parts that join to broken meshes to generate complete meshes, and manually annotated fracture boundaries. Through a detailed analysis of fracture geometry, we reveal differences between Fantastic Breaks and synthetic fracture datasets generated using geometric and physics-based methods. We show experimental shape repair evaluation with Fantastic Breaks using multiple learning-based approaches pre-trained with synthetic datasets and re-trained with subset of Fantastic Breaks. + + + + Deep Frequency Filtering for Domain Generalization + http://openaccess.thecvf.com//content/CVPR2023/papers/Lin_Deep_Frequency_Filtering_for_Domain_Generalization_CVPR_2023_paper.pdf + Improving the generalization ability of Deep Neural Networks (DNNs) is critical for their practical uses, which has been a longstanding challenge. Some theoretical studies have uncovered that DNNs have preferences for some frequency components in the learning process and indicated that this may affect the robustness of learned features. In this paper, we propose Deep Frequency Filtering (DFF) for learning domain-generalizable features, which is the first endeavour to explicitly modulate the frequency components of different transfer difficulties across domains in the latent space during training. To achieve this, we perform Fast Fourier Transform (FFT) for the feature maps at different layers, then adopt a light-weight module to learn attention masks from the frequency representations after FFT to enhance transferable components while suppressing the components not conducive to generalization. Further, we empirically compare the effectiveness of adopting different types of attention designs for implementing DFF. Extensive experiments demonstrate the effectiveness of our proposed DFF and show that applying our DFF on a plain baseline outperforms the state-of-the-art methods on different domain generalization tasks, including close-set classification and open-set retrieval. + + + + Frame Flexible Network + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Frame_Flexible_Network_CVPR_2023_paper.pdf + Existing video recognition algorithms always conduct different training pipelines for inputs with different frame numbers, which requires repetitive training operations and multiplying storage costs. If we evaluate the model using other frames which are not used in training, we observe the performance will drop significantly (see Fig.1, which is summarized as Temporal Frequency Deviation phenomenon. To fix this issue, we propose a general framework, named Frame Flexible Network (FFN), which not only enables the model to be evaluated at different frames to adjust its computation, but also reduces the memory costs of storing multiple models significantly. Concretely, FFN integrates several sets of training sequences, involves Multi-Frequency Alignment (MFAL) to learn temporal frequency invariant representations, and leverages Multi-Frequency Adaptation (MFAD) to further strengthen the representation abilities. Comprehensive empirical validations using various architectures and popular benchmarks solidly demonstrate the effectiveness and generalization of FFN (e.g., 7.08/5.15/2.17% performance gain at Frame 4/8/16 on Something-Something V1 dataset over Uniformer). Code is available at https://github.com/BeSpontaneous/FFN. + + + + Unsupervised Cumulative Domain Adaptation for Foggy Scene Optical Flow + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_Unsupervised_Cumulative_Domain_Adaptation_for_Foggy_Scene_Optical_Flow_CVPR_2023_paper.pdf + Optical flow has achieved great success under clean scenes, but suffers from restricted performance under foggy scenes. To bridge the clean-to-foggy domain gap, the existing methods typically adopt the domain adaptation to transfer the motion knowledge from clean to synthetic foggy domain. However, these methods unexpectedly neglect the synthetic-to-real domain gap, and thus are erroneous when applied to real-world scenes. To handle the practical optical flow under real foggy scenes, in this work, we propose a novel unsupervised cumulative domain adaptation optical flow (UCDA-Flow) framework: depth-association motion adaptation and correlation-alignment motion adaptation. Specifically, we discover that depth is a key ingredient to influence the optical flow: the deeper depth, the inferior optical flow, which motivates us to design a depth-association motion adaptation module to bridge the clean-to-foggy domain gap. Moreover, we figure out that the cost volume correlation shares similar distribution of the synthetic and real foggy images, which enlightens us to devise a correlation-alignment motion adaptation module to distill motion knowledge of the synthetic foggy domain to the real foggy domain. Note that synthetic fog is designed as the intermediate domain. Under this unified framework, the proposed cumulative adaptation progressively transfers knowledge from clean scenes to real foggy scenes. Extensive experiments have been performed to verify the superiority of the proposed method. + + + + MarS3D: A Plug-and-Play Motion-Aware Model for Semantic Segmentation on Multi-Scan 3D Point Clouds + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_MarS3D_A_Plug-and-Play_Motion-Aware_Model_for_Semantic_Segmentation_on_Multi-Scan_CVPR_2023_paper.pdf + 3D semantic segmentation on multi-scan large-scale point clouds plays an important role in autonomous systems. Unlike the single-scan-based semantic segmentation task, this task requires distinguishing the motion states of points in addition to their semantic categories. However, methods designed for single-scan-based segmentation tasks perform poorly on the multi-scan task due to the lacking of an effective way to integrate temporal information. We propose MarS3D, a plug-and-play motion-aware model for semantic segmentation on multi-scan 3D point clouds. This module can be flexibly combined with single-scan models to allow them to have multi-scan perception abilities. The model encompasses two key designs: the Cross-Frame Feature Embedding module for enriching representation learning and the Motion-Aware Feature Learning module for enhancing motion awareness. Extensive experiments show that MarS3D can improve the performance of the baseline model by a large margin. The code is available at https://github.com/CVMI-Lab/MarS3D. + + + + An Image Quality Assessment Dataset for Portraits + http://openaccess.thecvf.com//content/CVPR2023/papers/Chahine_An_Image_Quality_Assessment_Dataset_for_Portraits_CVPR_2023_paper.pdf + Year after year, the demand for ever-better smartphone photos continues to grow, in particular in the domain of portrait photography. Manufacturers thus use perceptual quality criteria throughout the development of smartphone cameras. This costly procedure can be partially replaced by automated learning-based methods for image quality assessment (IQA). Due to its subjective nature, it is necessary to estimate and guarantee the consistency of the IQA process, a characteristic lacking in the mean opinion scores (MOS) widely used for crowdsourcing IQA. In addition, existing blind IQA (BIQA) datasets pay little attention to the difficulty of cross-content assessment, which may degrade the quality of annotations. This paper introduces PIQ23, a portrait-specific IQA dataset of 5116 images of 50 predefined scenarios acquired by 100 smartphones, covering a high variety of brands, models, and use cases. The dataset includes individuals of various genders and ethnicities who have given explicit and informed consent for their photographs to be used in public research. It is annotated by pairwise comparisons (PWC) collected from over 30 image quality experts for three image attributes: face detail preservation, face target exposure, and overall image quality. An in-depth statistical analysis of these annotations allows us to evaluate their consistency over PIQ23. Finally, we show through an extensive comparison with existing baselines that semantic information (image context) can be used to improve IQA predictions. + + + + Painting 3D Nature in 2D: View Synthesis of Natural Scenes From a Single Semantic Mask + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Painting_3D_Nature_in_2D_View_Synthesis_of_Natural_Scenes_CVPR_2023_paper.pdf + We introduce a novel approach that takes a single semantic mask as input to synthesize multi-view consistent color images of natural scenes, trained with a collection of single images from the Internet. Prior works on 3D-aware image synthesis either require multi-view supervision or learning category-level prior for specific classes of objects, which are inapplicable to natural scenes. Our key idea to solve this challenge is to use a semantic field as the intermediate representation, which is easier to reconstruct from an input semantic mask and then translated to a radiance field with the assistance of off-the-shelf semantic image synthesis models. Experiments show that our method outperforms baseline methods and produces photorealistic and multi-view consistent videos of a variety of natural scenes. The project website is https://zju3dv.github.io/paintingnature/. + + + + Fast Point Cloud Generation With Straight Flows + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_Fast_Point_Cloud_Generation_With_Straight_Flows_CVPR_2023_paper.pdf + Diffusion models have emerged as a powerful tool for point cloud generation. A key component that drives the impressive performance for generating high-quality samples from noise is iteratively denoise for thousands of steps. While beneficial, the complexity of learning steps has limited its applications to many 3D real-world. To address this limitation, we propose Point Straight Flow (PSF), a model that exhibits impressive performance using one step. Our idea is based on the reformulation of the standard diffusion model, which optimizes the curvy learning trajectory into a straight path. Further, we develop a distillation strategy to shorten the straight path into one step without a performance loss, enabling applications to 3D real-world with latency constraints. We perform evaluations on multiple 3D tasks and find that our PSF performs comparably to the standard diffusion model, outperforming other efficient 3D point cloud generation methods. On real-world applications such as point cloud completion and training-free text-guided generation in a low-latency setup, PSF performs favorably. + + + + Achieving a Better Stability-Plasticity Trade-Off via Auxiliary Networks in Continual Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Kim_Achieving_a_Better_Stability-Plasticity_Trade-Off_via_Auxiliary_Networks_in_Continual_CVPR_2023_paper.pdf + In contrast to the natural capabilities of humans to learn new tasks in a sequential fashion, neural networks are known to suffer from catastrophic forgetting, where the model's performances on old tasks drop dramatically after being optimized for a new task. Since then, the continual learning (CL) community has proposed several solutions aiming to equip the neural network with the ability to learn the current task (plasticity) while still achieving high accuracy on the previous tasks (stability). Despite remarkable improvements, the plasticity-stability trade-off is still far from being solved, and its underlying mechanism is poorly understood. In this work, we propose Auxiliary Network Continual Learning (ANCL), a novel method that applies an additional auxiliary network which promotes plasticity to the continually learned model which mainly focuses on stability. More concretely, the proposed framework materializes in a regularizer that naturally interpolates between plasticity and stability, surpassing strong baselines on task incremental and class incremental scenarios. Through extensive analyses on ANCL solutions, we identify some essential principles beneath the stability-plasticity trade-off. + + + + Video Event Restoration Based on Keyframes for Video Anomaly Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_Video_Event_Restoration_Based_on_Keyframes_for_Video_Anomaly_Detection_CVPR_2023_paper.pdf + Video anomaly detection (VAD) is a significant computer vision problem. Existing deep neural network (DNN) based VAD methods mostly follow the route of frame reconstruction or frame prediction. However, the lack of mining and learning of higher-level visual features and temporal context relationships in videos limits the further performance of these two approaches. Inspired by video codec theory, we introduce a brand-new VAD paradigm to break through these limitations: First, we propose a new task of video event restoration based on keyframes. Encouraging DNN to infer missing multiple frames based on video keyframes so as to restore a video event, which can more effectively motivate DNN to mine and learn potential higher-level visual features and comprehensive temporal context relationships in the video. To this end, we propose a novel U-shaped Swin Transformer Network with Dual Skip Connections (USTN-DSC) for video event restoration, where a cross-attention and a temporal upsampling residual skip connection are introduced to further assist in restoring complex static and dynamic motion object features in the video. In addition, we propose a simple and effective adjacent frame difference loss to constrain the motion consistency of the video sequence. Extensive experiments on benchmarks demonstrate that USTN-DSC outperforms most existing methods, validating the effectiveness of our method. + + + + EcoTTA: Memory-Efficient Continual Test-Time Adaptation via Self-Distilled Regularization + http://openaccess.thecvf.com//content/CVPR2023/papers/Song_EcoTTA_Memory-Efficient_Continual_Test-Time_Adaptation_via_Self-Distilled_Regularization_CVPR_2023_paper.pdf + This paper presents a simple yet effective approach that improves continual test-time adaptation (TTA) in a memory-efficient manner. TTA may primarily be conducted on edge devices with limited memory, so reducing memory is crucial but has been overlooked in previous TTA studies. In addition, long-term adaptation often leads to catastrophic forgetting and error accumulation, which hinders applying TTA in real-world deployments. Our approach consists of two components to address these issues. First, we present lightweight meta networks that can adapt the frozen original networks to the target domain. This novel architecture minimizes memory consumption by decreasing the size of intermediate activations required for backpropagation. Second, our novel self-distilled regularization controls the output of the meta networks not to deviate significantly from the output of the frozen original networks, thereby preserving well-trained knowledge from the source domain. Without additional memory, this regularization prevents error accumulation and catastrophic forgetting, resulting in stable performance even in long-term test-time adaptation. We demonstrate that our simple yet effective strategy outperforms other state-of-the-art methods on various benchmarks for image classification and semantic segmentation tasks. Notably, our proposed method with ResNet-50 and WideResNet-40 takes 86% and 80% less memory than the recent state-of-the-art method, CoTTA. + + + + Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Tri-Perspective_View_for_Vision-Based_3D_Semantic_Occupancy_Prediction_CVPR_2023_paper.pdf + Modern methods for vision-centric autonomous driving perception widely adopt the bird's-eye-view (BEV) representation to describe a 3D scene. Despite its better efficiency than voxel representation, it has difficulty describing the fine-grained 3D structure of a scene with a single plane. To address this, we propose a tri-perspective view (TPV) representation which accompanies BEV with two additional perpendicular planes. We model each point in the 3D space by summing its projected features on the three planes. To lift image features to the 3D TPV space, we further propose a transformer-based TPV encoder (TPVFormer) to obtain the TPV features effectively. We employ the attention mechanism to aggregate the image features corresponding to each query in each TPV plane. Experiments show that our model trained with sparse supervision effectively predicts the semantic occupancy for all voxels. We demonstrate for the first time that using only camera inputs can achieve comparable performance with LiDAR-based methods on the LiDAR segmentation task on nuScenes. Code: https://github.com/wzzheng/TPVFormer. + + + + Castling-ViT: Compressing Self-Attention via Switching Towards Linear-Angular Attention at Vision Transformer Inference + http://openaccess.thecvf.com//content/CVPR2023/papers/You_Castling-ViT_Compressing_Self-Attention_via_Switching_Towards_Linear-Angular_Attention_at_Vision_CVPR_2023_paper.pdf + Vision Transformers (ViTs) have shown impressive performance but still require a high computation cost as compared to convolutional neural networks (CNNs), one reason is that ViTs' attention measures global similarities and thus has a quadratic complexity with the number of input tokens. Existing efficient ViTs adopt local attention or linear attention, which sacrifice ViTs' capabilities of capturing either global or local context. In this work, we ask an important research question: Can ViTs learn both global and local context while being more efficient during inference? To this end, we propose a framework called Castling-ViT, which trains ViTs using both linear-angular attention and masked softmax-based quadratic attention, but then switches to having only linear-angular attention during inference. Our Castling-ViT leverages angular kernels to measure the similarities between queries and keys via spectral angles. And we further simplify it with two techniques: (1) a novel linear-angular attention mechanism: we decompose the angular kernels into linear terms and high-order residuals, and only keep the linear terms; and (2) we adopt two parameterized modules to approximate high-order residuals: a depthwise convolution and an auxiliary masked softmax attention to help learn global and local information, where the masks for softmax attention are regularized to gradually become zeros and thus incur no overhead during inference. Extensive experiments validate the effectiveness of our Castling-ViT, e.g., achieving up to a 1.8% higher accuracy or 40% MACs reduction on classification and 1.2 higher mAP on detection under comparable FLOPs, as compared to ViTs with vanilla softmax-based attentions. Project page is available at https://www.haoranyou.com/castling-vit. + + + + Rethinking Federated Learning With Domain Shift: A Prototype View + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Rethinking_Federated_Learning_With_Domain_Shift_A_Prototype_View_CVPR_2023_paper.pdf + Federated learning shows a bright promise as a privacy-preserving collaborative learning technique. However, prevalent solutions mainly focus on all private data sampled from the same domain. An important challenge is that when distributed data are derived from diverse domains. The private model presents degenerative performance on other domains (with domain shift). Therefore, we expect that the global model optimized after the federated learning process stably provides generalizability performance on multiple domains. In this paper, we propose Federated Prototypes Learning (FPL) for federated learning under domain shift. The core idea is to construct cluster prototypes and unbiased prototypes, providing fruitful domain knowledge and a fair convergent target. On the one hand, we pull the sample embedding closer to cluster prototypes belonging to the same semantics than cluster prototypes from distinct classes. On the other hand, we introduce consistency regularization to align the local instance with the respective unbiased prototype. Empirical results on Digits and Office Caltech tasks demonstrate the effectiveness of the proposed solution and the efficiency of crucial modules. + + + + HGFormer: Hierarchical Grouping Transformer for Domain Generalized Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Ding_HGFormer_Hierarchical_Grouping_Transformer_for_Domain_Generalized_Semantic_Segmentation_CVPR_2023_paper.pdf + Current semantic segmentation models have achieved great success under the independent and identically distributed (i.i.d.) condition. However, in real-world applications, test data might come from a different domain than training data. Therefore, it is important to improve model robustness against domain differences. This work studies semantic segmentation under the domain generalization setting, where a model is trained only on the source domain and tested on the unseen target domain. Existing works show that Vision Transformers are more robust than CNNs and show that this is related to the visual grouping property of self-attention. In this work, we propose a novel hierarchical grouping transformer (HGFormer) to explicitly group pixels to form part-level masks and then whole-level masks. The masks at different scales aim to segment out both parts and a whole of classes. HGFormer combines mask classification results at both scales for class label prediction. We assemble multiple interesting cross-domain settings by using seven public semantic segmentation datasets. Experiments show that HGFormer yields more robust semantic segmentation results than per-pixel classification methods and flat-grouping transformers, and outperforms previous methods significantly. Code will be available at https://github.com/dingjiansw101/HGFormer. + + + + Distilling Vision-Language Pre-Training To Collaborate With Weakly-Supervised Temporal Action Localization + http://openaccess.thecvf.com//content/CVPR2023/papers/Ju_Distilling_Vision-Language_Pre-Training_To_Collaborate_With_Weakly-Supervised_Temporal_Action_Localization_CVPR_2023_paper.pdf + Weakly-supervised temporal action localization (WTAL) learns to detect and classify action instances with only category labels. Most methods widely adopt the off-the-shelf Classification-Based Pre-training (CBP) to generate video features for action localization. However, the different optimization objectives between classification and localization, make temporally localized results suffer from the serious incomplete issue. To tackle this issue without additional annotations, this paper considers to distill free action knowledge from Vision-Language Pre-training (VLP), as we surprisingly observe that the localization results of vanilla VLP have an over-complete issue, which is just complementary to the CBP results. To fuse such complementarity, we propose a novel distillation-collaboration framework with two branches acting as CBP and VLP respectively. The framework is optimized through a dual-branch alternate training strategy. Specifically, during the B step, we distill the confident background pseudo-labels from the CBP branch; while during the F step, the confident foreground pseudo-labels are distilled from the VLP branch. As a result, the dual-branch complementarity is effectively fused to promote one strong alliance. Extensive experiments and ablation studies on THUMOS14 and ActivityNet1.2 reveal that our method significantly outperforms state-of-the-art methods. + + + + Augmentation Matters: A Simple-Yet-Effective Approach to Semi-Supervised Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhao_Augmentation_Matters_A_Simple-Yet-Effective_Approach_to_Semi-Supervised_Semantic_Segmentation_CVPR_2023_paper.pdf + Recent studies on semi-supervised semantic segmentation (SSS) have seen fast progress. Despite their promising performance, current state-of-the-art methods tend to increasingly complex designs at the cost of introducing more network components and additional training procedures. Differently, in this work, we follow a standard teacher-student framework and propose AugSeg, a simple and clean approach that focuses mainly on data perturbations to boost the SSS performance. We argue that various data augmentations should be adjusted to better adapt to the semi-supervised scenarios instead of directly applying these techniques from supervised learning. Specifically, we adopt a simplified intensity-based augmentation that selects a random number of data transformations with uniformly sampling distortion strengths from a continuous space. Based on the estimated confidence of the model on different unlabeled samples, we also randomly inject labelled information to augment the unlabeled samples in an adaptive manner. Without bells and whistles, our simple AugSeg can readily achieve new state-of-the-art performance on SSS benchmarks under different partition protocols. + + + + Boosting Verified Training for Robust Image Classifications via Abstraction + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Boosting_Verified_Training_for_Robust_Image_Classifications_via_Abstraction_CVPR_2023_paper.pdf + This paper proposes a novel, abstraction-based, certified training method for robust image classifiers. Via abstraction, all perturbed images are mapped into intervals before feeding into neural networks for training. By training on intervals, all the perturbed images that are mapped to the same interval are classified as the same label, rendering the variance of training sets to be small and the loss landscape of the models to be smooth. Consequently, our approach significantly improves the robustness of trained models. For the abstraction, our training method also enables a sound and complete black-box verification approach, which is orthogonal and scalable to arbitrary types of neural networks regardless of their sizes and architectures. We evaluate our method on a wide range of benchmarks in different scales. The experimental results show that our method outperforms state of the art by (i) reducing the verified errors of trained models up to 95.64%; (ii) totally achieving up to 602.50x speedup; and (iii) scaling up to larger models with up to 138 million trainable parameters. The demo is available at https://github.com/zhangzhaodi233/ABSCERT.git. + + + + 3D Shape Reconstruction of Semi-Transparent Worms + http://openaccess.thecvf.com//content/CVPR2023/papers/Ilett_3D_Shape_Reconstruction_of_Semi-Transparent_Worms_CVPR_2023_paper.pdf + 3D shape reconstruction typically requires identifying object features or textures in multiple images of a subject. This approach is not viable when the subject is semi-transparent and moving in and out of focus. Here we overcome these challenges by rendering a candidate shape with adaptive blurring and transparency for comparison with the images. We use the microscopic nematode Caenorhabditis elegans as a case study as it freely explores a 3D complex fluid with constantly changing optical properties. We model the slender worm as a 3D curve using an intrinsic parametrisation that naturally admits biologically-informed constraints and regularisation. To account for the changing optics we develop a novel differentiable renderer to construct images from 2D projections and compare against raw images to generate a pixel-wise error to jointly update the curve, camera and renderer parameters using gradient descent. The method is robust to interference such as bubbles and dirt trapped in the fluid, stays consistent through complex sequences of postures, recovers reliable estimates from blurry images and provides a significant improvement on previous attempts to track C. elegans in 3D. Our results demonstrate the potential of direct approaches to shape estimation in complex physical environments in the absence of ground-truth data. + + + + Mapping Degeneration Meets Label Evolution: Learning Infrared Small Target Detection With Single Point Supervision + http://openaccess.thecvf.com//content/CVPR2023/papers/Ying_Mapping_Degeneration_Meets_Label_Evolution_Learning_Infrared_Small_Target_Detection_CVPR_2023_paper.pdf + Training a convolutional neural network (CNN) to detect infrared small targets in a fully supervised manner has gained remarkable research interests in recent years, but is highly labor expensive since a large number of per-pixel annotations are required. To handle this problem, in this paper, we make the first attempt to achieve infrared small target detection with point-level supervision. Interestingly, during the training phase supervised by point labels, we discover that CNNs first learn to segment a cluster of pixels near the targets, and then gradually converge to predict groundtruth point labels. Motivated by this "mapping degeneration" phenomenon, we propose a label evolution framework named label evolution with single point supervision (LESPS) to progressively expand the point label by leveraging the intermediate predictions of CNNs. In this way, the network predictions can finally approximate the updated pseudo labels, and a pixel-level target mask can be obtained to train CNNs in an end-to-end manner. We conduct extensive experiments with insightful visualizations to validate the effectiveness of our method. Experimental results show that CNNs equipped with LESPS can well recover the target masks from corresponding point labels, and can achieve over 70% and 95% of their fully supervised performance in terms of pixel-level intersection over union (IoU) and object-level probability of detection (Pd), respectively. Code is available at https://github.com/XinyiYing/LESPS. + + + + Swept-Angle Synthetic Wavelength Interferometry + http://openaccess.thecvf.com//content/CVPR2023/papers/Kotwal_Swept-Angle_Synthetic_Wavelength_Interferometry_CVPR_2023_paper.pdf + We present a new imaging technique, swept-angle synthetic wavelength interferometry, for full-field micron-scale 3D sensing. As in conventional synthetic wavelength interferometry, our technique uses light consisting of two narrowly-separated optical wavelengths, resulting in per-pixel interferometric measurements whose phase encodes scene depth. Our technique additionally uses a new type of light source that, by emulating spatially-incoherent illumination, makes interferometric measurements insensitive to aberrations and (sub)surface scattering, effects that corrupt phase measurements. The resulting technique combines the robustness to such corruptions of scanning interferometric setups, with the speed of full-field interferometric setups. Overall, our technique can recover full-frame depth at a lateral and axial resolution of 5 microns, at frame rates of 5 Hz, even under strong ambient light. We build an experimental prototype, and use it to demonstrate these capabilities by scanning a variety of objects, including objects representative of applications in inspection and fabrication, and objects that contain challenging light scattering effects. + + + + Adaptive Global Decay Process for Event Cameras + http://openaccess.thecvf.com//content/CVPR2023/papers/Nunes_Adaptive_Global_Decay_Process_for_Event_Cameras_CVPR_2023_paper.pdf + In virtually all event-based vision problems, there is the need to select the most recent events, which are assumed to carry the most relevant information content. To achieve this, at least one of three main strategies is applied, namely: 1) constant temporal decay or fixed time window, 2) constant number of events, and 3) flow-based lifetime of events. However, these strategies suffer from at least one major limitation each. We instead propose a novel decay process for event cameras that adapts to the global scene dynamics and whose latency is in the order of nanoseconds. The main idea is to construct an adaptive quantity that encodes the global scene dynamics, denoted by event activity. The proposed method is evaluated in several event-based vision problems and datasets, consistently improving the corresponding baseline methods' performance. We thus believe it can have a significant widespread impact on event-based research. Code available: https://github.com/neuromorphic-paris/event_batch. + + + + Multi-Space Neural Radiance Fields + http://openaccess.thecvf.com//content/CVPR2023/papers/Yin_Multi-Space_Neural_Radiance_Fields_CVPR_2023_paper.pdf + Neural Radiance Fields (NeRF) and its variants have reached state-of-the-art performance in many novel-view-synthesis-related tasks. However, current NeRF-based methods still suffer from the existence of reflective objects, often resulting in blurry or distorted rendering. Instead of calculating a single radiance field, we propose a multispace neural radiance field (MS-NeRF) that represents the scene using a group of feature fields in parallel sub-spaces, which leads to a better understanding of the neural network toward the existence of reflective and refractive objects. Our multi-space scheme works as an enhancement to existing NeRF methods, with only small computational overheads needed for training and inferring the extra-space outputs. We demonstrate the superiority and compatibility of our approach using three representative NeRF-based models, i.e., NeRF, Mip-NeRF, and Mip-NeRF 360. Comparisons are performed on a novelly constructed dataset consisting of 25 synthetic scenes and 7 real captured scenes with complex reflection and refraction, all having 360-degree viewpoints. Extensive experiments show that our approach significantly outperforms the existing single-space NeRF methods for rendering high-quality scenes concerned with complex light paths through mirror-like objects. + + + + Bitstream-Corrupted JPEG Images Are Restorable: Two-Stage Compensation and Alignment Framework for Image Restoration + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Bitstream-Corrupted_JPEG_Images_Are_Restorable_Two-Stage_Compensation_and_Alignment_Framework_CVPR_2023_paper.pdf + In this paper, we study a real-world JPEG image restoration problem with bit errors on the encrypted bitstream. The bit errors bring unpredictable color casts and block shifts on decoded image contents, which cannot be trivially resolved by existing image restoration methods mainly relying on pre-defined degradation models in the pixel domain. To address these challenges, we propose a robust JPEG decoder, followed by a two-stage compensation and alignment framework to restore bitstream-corrupted JPEG images. Specifically, the robust JPEG decoder adopts an error-resilient mechanism to decode the corrupted JPEG bitstream. The two-stage framework is composed of the self-compensation and alignment (SCA) stage and the guided-compensation and alignment (GCA) stage. The SCA adaptively performs block-wise image color compensation and alignment based on the estimated color and block offsets via image content similarity. The GCA leverages the extracted low-resolution thumbnail from the JPEG header to guide full-resolution pixel-wise image restoration in a coarse-to-fine manner. It is achieved by a coarse-guided pix2pix network and a refine-guided bi-directional Laplacian pyramid fusion network. We conduct experiments on three benchmarks with varying degrees of bit error rates. Experimental results and ablation studies demonstrate the superiority of our proposed method. The code will be released at https://github.com/wenyang001/Two-ACIR. + + + + Histopathology Whole Slide Image Analysis With Heterogeneous Graph Representation Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Chan_Histopathology_Whole_Slide_Image_Analysis_With_Heterogeneous_Graph_Representation_Learning_CVPR_2023_paper.pdf + Graph-based methods have been extensively applied to whole slide histopathology image (WSI) analysis due to the advantage of modeling the spatial relationships among different entities. However, most of the existing methods focus on modeling WSIs with homogeneous graphs (e.g., with homogeneous node type). Despite their successes, these works are incapable of mining the complex structural relations between biological entities (e.g., the diverse interaction among different cell types) in the WSI. We propose a novel heterogeneous graph-based framework to leverage the inter-relationships among different types of nuclei for WSI analysis. Specifically, we formulate the WSI as a heterogeneous graph with "nucleus-type" attribute to each node and a semantic similarity attribute to each edge. We then present a new heterogeneous-graph edge attribute transformer (HEAT) to take advantage of the edge and node heterogeneity during massage aggregating. Further, we design a new pseudo-label-based semantic-consistent pooling mechanism to obtain graph-level features, which can mitigate the over-parameterization issue of conventional cluster-based pooling. Additionally, observing the limitations of existing association-based localization methods, we propose a causal-driven approach attributing the contribution of each node to improve the interpretability of our framework. Extensive experiments on three public TCGA benchmark datasets demonstrate that our framework outperforms the state-of-the-art methods with considerable margins on various tasks. Our codes are available at https://github.com/HKU-MedAI/WSI-HGNN. + + + + Towards All-in-One Pre-Training via Maximizing Multi-Modal Mutual Information + http://openaccess.thecvf.com//content/CVPR2023/papers/Su_Towards_All-in-One_Pre-Training_via_Maximizing_Multi-Modal_Mutual_Information_CVPR_2023_paper.pdf + To effectively exploit the potential of large-scale models, various pre-training strategies supported by massive data from different sources are proposed, including supervised pre-training, weakly-supervised pre-training, and self-supervised pre-training. It has been proved that combining multiple pre-training strategies and data from various modalities/sources can greatly boost the training of large-scale models. However, current works adopt a multi-stage pre-training system, where the complex pipeline may increase the uncertainty and instability of the pre-training. It is thus desirable that these strategies can be integrated in a single-stage manner. In this paper, we first propose a general multi-modal mutual information formula as a unified optimization target and demonstrate that all mainstream approaches are special cases of our framework. Under this unified perspective, we propose an all-in-one single-stage pre-training approach, named Maximizing Multi-modal Mutual Information Pre-training (M3I Pre-training). Our approach achieves better performance than previous pre-training methods on various vision benchmarks, including ImageNet classification, COCO object detection, LVIS long-tailed object detection, and ADE20k semantic segmentation. Notably, we successfully pre-train a billion-level parameter image backbone and achieve state-of-the-art performance on various benchmarks under public data setting. Code shall be released at https://github.com/OpenGVLab/M3I-Pretraining. + + + + Aligning Bag of Regions for Open-Vocabulary Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_Aligning_Bag_of_Regions_for_Open-Vocabulary_Object_Detection_CVPR_2023_paper.pdf + Pre-trained vision-language models (VLMs) learn to align vision and language representations on large-scale datasets, where each image-text pair usually contains a bag of semantic concepts. However, existing open-vocabulary object detectors only align region embeddings individually with the corresponding features extracted from the VLMs. Such a design leaves the compositional structure of semantic concepts in a scene under-exploited, although the structure may be implicitly learned by the VLMs. In this work, we propose to align the embedding of bag of regions beyond individual regions. The proposed method groups contextually interrelated regions as a bag. The embeddings of regions in a bag are treated as embeddings of words in a sentence, and they are sent to the text encoder of a VLM to obtain the bag-of-regions embedding, which is learned to be aligned to the corresponding features extracted by a frozen VLM. Applied to the commonly used Faster R-CNN, our approach surpasses the previous best results by 4.6 box AP 50 and 2.8 mask AP on novel categories of open-vocabulary COCO and LVIS benchmarks, respectively. Code and models are available at https://github.com/wusize/ovdet. + + + + Two-View Geometry Scoring Without Correspondences + http://openaccess.thecvf.com//content/CVPR2023/papers/Barroso-Laguna_Two-View_Geometry_Scoring_Without_Correspondences_CVPR_2023_paper.pdf + Camera pose estimation for two-view geometry traditionally relies on RANSAC. Normally, a multitude of image correspondences leads to a pool of proposed hypotheses, which are then scored to find a winning model. The inlier count is generally regarded as a reliable indicator of "consensus". We examine this scoring heuristic, and find that it favors disappointing models under certain circumstances. As a remedy, we propose the Fundamental Scoring Network (FSNet), which infers a score for a pair of overlapping images and any proposed fundamental matrix. It does not rely on sparse correspondences, but rather embodies a two-view geometry model through an epipolar attention mechanism that predicts the pose error of the two images. FSNet can be incorporated into traditional RANSAC loops. We evaluate FSNet on fundamental and essential matrix estimation on indoor and outdoor datasets, and establish that FSNet can successfully identify good poses for pairs of images with few or unreliable correspondences. Besides, we show that naively combining FSNet with MAGSAC++ scoring approach achieves state of the art results. + + + + Annealing-Based Label-Transfer Learning for Open World Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Ma_Annealing-Based_Label-Transfer_Learning_for_Open_World_Object_Detection_CVPR_2023_paper.pdf + Open world object detection (OWOD) has attracted extensive attention due to its practicability in the real world. Previous OWOD works manually designed unknown-discover strategies to select unknown proposals from the background, suffering from uncertainties without appropriate priors. In this paper, we claim the learning of object detection could be seen as an object-level feature-entanglement process, where unknown traits are propagated to the known proposals through convolutional operations and could be distilled to benefit unknown recognition without manual selection. Therefore, we propose a simple yet effective Annealing-based Label-Transfer framework, which sufficiently explores the known proposals to alleviate the uncertainties. Specifically, a Label-Transfer Learning paradigm is introduced to decouple the known and unknown features, while a Sawtooth Annealing Scheduling strategy is further employed to rebuild the decision boundaries of the known and unknown classes, thus promoting both known and unknown recognition. Moreover, previous OWOD works neglected the trade-off of known and unknown performance, and we thus introduce a metric called Equilibrium Index to comprehensively evaluate the effectiveness of the OWOD models. To the best of our knowledge, this is the first OWOD work without manual unknown selection. Extensive experiments conducted on the common-used benchmark validate that our model achieves superior detection performance (200% unknown mAP improvement with the even higher known detection performance) compared to other state-of-the-art methods. Our code is available at https://github.com/DIG-Beihang/ALLOW.git. + + + + Self-Supervised Video Forensics by Audio-Visual Anomaly Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Feng_Self-Supervised_Video_Forensics_by_Audio-Visual_Anomaly_Detection_CVPR_2023_paper.pdf + Manipulated videos often contain subtle inconsistencies between their visual and audio signals. We propose a video forensics method, based on anomaly detection, that can identify these inconsistencies, and that can be trained solely using real, unlabeled data. We train an autoregressive model to generate sequences of audio-visual features, using feature sets that capture the temporal synchronization between video frames and sound. At test time, we then flag videos that the model assigns low probability. Despite being trained entirely on real videos, our model obtains strong performance on the task of detecting manipulated speech videos. Project site: https://cfeng16.github.io/audio-visual-forensics. + + + + Class Balanced Adaptive Pseudo Labeling for Federated Semi-Supervised Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Class_Balanced_Adaptive_Pseudo_Labeling_for_Federated_Semi-Supervised_Learning_CVPR_2023_paper.pdf + This paper focuses on federated semi-supervised learning (FSSL), assuming that few clients have fully labeled data (labeled clients) and the training datasets in other clients are fully unlabeled (unlabeled clients). Existing methods attempt to deal with the challenges caused by not independent and identically distributed data (Non-IID) setting. Though methods such as sub-consensus models have been proposed, they usually adopt standard pseudo labeling or consistency regularization on unlabeled clients which can be easily influenced by imbalanced class distribution. Thus, problems in FSSL are still yet to be solved. To seek for a fundamental solution to this problem, we present Class Balanced Adaptive Pseudo Labeling (CBAFed), to study FSSL from the perspective of pseudo labeling. In CBAFed, the first key element is a fixed pseudo labeling strategy to handle the catastrophic forgetting problem, where we keep a fixed set by letting pass information of unlabeled data at the beginning of the unlabeled client training in each communication round. The second key element is that we design class balanced adaptive thresholds via considering the empirical distribution of all training data in local clients, to encourage a balanced training process. To make the model reach a better optimum, we further propose a residual weight connection in local supervised training and global model aggregation. Extensive experiments on five datasets demonstrate the superiority of CBAFed. Code will be released. + + + + Rethinking Out-of-Distribution (OOD) Detection: Masked Image Modeling Is All You Need + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Rethinking_Out-of-Distribution_OOD_Detection_Masked_Image_Modeling_Is_All_You_CVPR_2023_paper.pdf + The core of out-of-distribution (OOD) detection is to learn the in-distribution (ID) representation, which is distinguishable from OOD samples. Previous work applied recognition-based methods to learn the ID features, which tend to learn shortcuts instead of comprehensive representations. In this work, we find surprisingly that simply using reconstruction-based methods could boost the performance of OOD detection significantly. We deeply explore the main contributors of OOD detection and find that reconstruction-based pretext tasks have the potential to provide a generally applicable and efficacious prior, which benefits the model in learning intrinsic data distributions of the ID dataset. Specifically, we take Masked Image Modeling as a pretext task for our OOD detection framework (MOOD). Without bells and whistles, MOOD outperforms previous SOTA of one-class OOD detection by 5.7%, multi-class OOD detection by 3.0%, and near-distribution OOD detection by 2.1%. It even defeats the 10-shot-per-class outlier exposure OOD detection, although we do not include any OOD samples for our detection. + + + + Masked Scene Contrast: A Scalable Framework for Unsupervised 3D Representation Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_Masked_Scene_Contrast_A_Scalable_Framework_for_Unsupervised_3D_Representation_CVPR_2023_paper.pdf + As a pioneering work, PointContrast conducts unsupervised 3D representation learning via leveraging contrastive learning over raw RGB-D frames and proves its effectiveness on various downstream tasks. However, the trend of large-scale unsupervised learning in 3D has yet to emerge due to two stumbling blocks: the inefficiency of matching RGB-D frames as contrastive views and the annoying mode collapse phenomenon mentioned in previous works. Turning the two stumbling blocks into empirical stepping stones, we first propose an efficient and effective contrastive learning framework, which generates contrastive views directly on scene-level point clouds by a well-curated data augmentation pipeline and a practical view mixing strategy. Second, we introduce reconstructive learning on the contrastive learning framework with an exquisite design of contrastive cross masks, which targets the reconstruction of point color and surfel normal. Our Masked Scene Contrast (MSC) framework is capable of extracting comprehensive 3D representations more efficiently and effectively. It accelerates the pre-training procedure by at least 3x and still achieves an uncompromised performance compared with previous work. Besides, MSC also enables large-scale 3D pre-training across multiple datasets, which further boosts the performance and achieves state-of-the-art fine-tuning results on several downstream tasks, e.g., 75.5% mIoU on ScanNet semantic segmentation validation set. + + + + Multi Domain Learning for Motion Magnification + http://openaccess.thecvf.com//content/CVPR2023/papers/Singh_Multi_Domain_Learning_for_Motion_Magnification_CVPR_2023_paper.pdf + Video motion magnification makes subtle invisible motions visible, such as small chest movements while breathing, subtle vibrations in the moving objects etc. But small motions are prone to noise, illumination changes, large motions, etc. making the task difficult. Most state-of-the-art methods use hand-crafted concepts which result in small magnification, ringing artifacts etc. The deep learning based approach has higher magnification but is prone to severe artifacts in some scenarios. We propose a new phase based deep network for video motion magnification that operates in both domains (frequency and spatial) to address this issue. It generates motion magnification from frequency domain phase fluctuations and then improves its quality in the spatial domain. The proposed models are lightweight networks with fewer parameters ( 0.11M and 0.05M). Further, the proposed networks performance is compared to the SOTA approaches and evaluated on real-world and synthetic videos. Finally, an ablation study is also conducted to show the impact of different parts of the network. + + + + A Simple Baseline for Video Restoration With Grouped Spatial-Temporal Shift + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_A_Simple_Baseline_for_Video_Restoration_With_Grouped_Spatial-Temporal_Shift_CVPR_2023_paper.pdf + Video restoration, which aims to restore clear frames from degraded videos, has numerous important applications. The key to video restoration depends on utilizing inter-frame information. However, existing deep learning methods often rely on complicated network architectures, such as optical flow estimation, deformable convolution, and cross-frame self-attention layers, resulting in high computational costs. In this study, we propose a simple yet effective framework for video restoration. Our approach is based on grouped spatial-temporal shift, which is a lightweight and straightforward technique that can implicitly capture inter-frame correspondences for multi-frame aggregation. By introducing grouped spatial shift, we attain expansive effective receptive fields. Combined with basic 2D convolution, this simple framework can effectively aggregate inter-frame information. Extensive experiments demonstrate that our framework outperforms the previous state-of-the-art method, while using less than a quarter of its computational cost, on both video deblurring and video denoising tasks. These results indicate the potential for our approach to significantly reduce computational overhead while maintaining high-quality results. Code is avaliable at https://github.com/dasongli1/Shift-Net. + + + + itKD: Interchange Transfer-Based Knowledge Distillation for 3D Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Cho_itKD_Interchange_Transfer-Based_Knowledge_Distillation_for_3D_Object_Detection_CVPR_2023_paper.pdf + Point-cloud based 3D object detectors recently have achieved remarkable progress. However, most studies are limited to the development of network architectures for improving only their accuracy without consideration of the computational efficiency. In this paper, we first propose an autoencoder-style framework comprising channel-wise compression and decompression via interchange transfer-based knowledge distillation. To learn the map-view feature of a teacher network, the features from teacher and student networks are independently passed through the shared autoencoder; here, we use a compressed representation loss that binds the channel-wised compression knowledge from both student and teacher networks as a kind of regularization. The decompressed features are transferred in opposite directions to reduce the gap in the interchange reconstructions. Lastly, we present an head attention loss to match the 3D object detection information drawn by the multi-head self-attention mechanism. Through extensive experiments, we verify that our method can train the lightweight model that is well-aligned with the 3D point cloud detection task and we demonstrate its superiority using the well-known public datasets; e.g., Waymo and nuScenes. + + + + 2PCNet: Two-Phase Consistency Training for Day-to-Night Unsupervised Domain Adaptive Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Kennerley_2PCNet_Two-Phase_Consistency_Training_for_Day-to-Night_Unsupervised_Domain_Adaptive_Object_CVPR_2023_paper.pdf + Object detection at night is a challenging problem due to the absence of night image annotations. Despite several domain adaptation methods, achieving high-precision results remains an issue. False-positive error propagation is still observed in methods using the well-established student-teacher framework, particularly for small-scale and low-light objects. This paper proposes a two-phase consistency unsupervised domain adaptation network, 2PCNet, to address these issues. The network employs high-confidence bounding-box predictions from the teacher in the first phase and appends them to the student's region proposals for the teacher to re-evaluate in the second phase, resulting in a combination of high and low confidence pseudo-labels. The night images and pseudo-labels are scaled-down before being used as input to the student, providing stronger small-scale pseudo-labels. To address errors that arise from low-light regions and other night-related attributes in images, we propose a night-specific augmentation pipeline called NightAug. This pipeline involves applying random augmentations, such as glare, blur, and noise, to daytime images. Experiments on publicly available datasets demonstrate that our method achieves superior results to state-of-the-art methods by 20%, and to supervised models trained directly on the target data. + + + + Panoptic Lifting for 3D Scene Understanding With Neural Fields + http://openaccess.thecvf.com//content/CVPR2023/papers/Siddiqui_Panoptic_Lifting_for_3D_Scene_Understanding_With_Neural_Fields_CVPR_2023_paper.pdf + We propose Panoptic Lifting, a novel approach for learning panoptic 3D volumetric representations from images of in-the-wild scenes. Once trained, our model can render color images together with 3D-consistent panoptic segmentation from novel viewpoints. Unlike existing approaches which use 3D input directly or indirectly, our method requires only machine-generated 2D panoptic segmentation masks inferred from a pre-trained network. Our core contribution is a panoptic lifting scheme based on a neural field representation that generates a unified and multi-view consistent, 3D panoptic representation of the scene. To account for inconsistencies of 2D instance identifiers across views, we solve a linear assignment with a cost based on the model's current predictions and the machine-generated segmentation masks, thus enabling us to lift 2D instances to 3D in a consistent way. We further propose and ablate contributions that make our method more robust to noisy, machine-generated labels, including test-time augmentations for confidence estimates, segment consistency loss, bounded segmentation fields, and gradient stopping. Experimental results validate our approach on the challenging Hypersim, Replica, and ScanNet datasets, improving by 8.4, 13.8, and 10.6% in scene-level PQ over state of the art. + + + + WeatherStream: Light Transport Automation of Single Image Deweathering + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_WeatherStream_Light_Transport_Automation_of_Single_Image_Deweathering_CVPR_2023_paper.pdf + Today single image deweathering is arguably more sensitive to the dataset type, rather than the model. We introduce WeatherStream, an automatic pipeline capturing all real-world weather effects (rain, snow, and rain fog degradations), along with their clean image pairs. Previous state-of-the-art methods that have attempted the all-weather removal task train on synthetic pairs, and are thus limited by the Sim2Real domain gap. Recent work has attempted to manually collect time multiplexed pairs, but the use of human labor limits the scale of such a dataset. We introduce a pipeline that uses the power of light-transport physics and a model trained on a small, initial seed dataset to reject approximately 99.6% of unwanted scenes. The pipeline is able to generalize to new scenes and degradations that can, in turn, be used to train existing models just like fully human-labeled data. Training on a dataset collected through this procedure leads to significant improvements on multiple existing weather removal methods on a carefully human-collected test set of real-world weather effects. The dataset and code can be found in the following website: http://visual.ee.ucla.edu/wstream.htm/. + + + + Learning To Detect Mirrors From Videos via Dual Correspondences + http://openaccess.thecvf.com//content/CVPR2023/papers/Lin_Learning_To_Detect_Mirrors_From_Videos_via_Dual_Correspondences_CVPR_2023_paper.pdf + Detecting mirrors from static images has received significant research interest recently. However, detecting mirrors over dynamic scenes is still under-explored due to the lack of a high-quality dataset and an effective method for video mirror detection (VMD). To the best of our knowledge, this is the first work to address the VMD problem from a deep-learning-based perspective. Our observation is that there are often correspondences between the contents inside (reflected) and outside (real) of a mirror, but such correspondences may not always appear in every frame, e.g., due to the change of camera pose. This inspires us to propose a video mirror detection method, named VMD-Net, that can tolerate spatially missing correspondences by considering the mirror correspondences at both the intra-frame level as well as inter-frame level via a dual correspondence module that looks over multiple frames spatially and temporally for correlating correspondences. We further propose a first large-scale dataset for VMD (named VMD-D), which contains 14,987 image frames from 269 videos with corresponding manually annotated masks. Experimental results show that the proposed method outperforms SOTA methods from relevant fields. To enable real-time VMD, our method efficiently utilizes the backbone features by removing the redundant multi-level module design and gets rid of post-processing of the output maps commonly used in existing methods, making it very efficient and practical for real-time video-based applications. Code, dataset, and models are available at https://jiaying.link/cvpr2023-vmd/ + + + + The Devil Is in the Points: Weakly Semi-Supervised Instance Segmentation via Point-Guided Mask Representation + http://openaccess.thecvf.com//content/CVPR2023/papers/Kim_The_Devil_Is_in_the_Points_Weakly_Semi-Supervised_Instance_Segmentation_CVPR_2023_paper.pdf + In this paper, we introduce a novel learning scheme named weakly semi-supervised instance segmentation (WSSIS) with point labels for budget-efficient and high-performance instance segmentation. Namely, we consider a dataset setting consisting of a few fully-labeled images and a lot of point-labeled images. Motivated by the main challenge of semi-supervised approaches mainly derives from the trade-off between false-negative and false-positive instance proposals, we propose a method for WSSIS that can effectively leverage the budget-friendly point labels as a powerful weak supervision source to resolve the challenge. Furthermore, to deal with the hard case where the amount of fully-labeled data is extremely limited, we propose a MaskRefineNet that refines noise in rough masks. We conduct extensive experiments on COCO and BDD100K datasets, and the proposed method achieves promising results comparable to those of the fully-supervised model, even with 50% of the fully labeled COCO data (38.8% vs. 39.7%). Moreover, when using as little as 5% of fully labeled COCO data, our method shows significantly superior performance over the state-of-the-art semi-supervised learning method (33.7% vs. 24.9%). The code is available at https://github.com/clovaai/PointWSSIS. + + + + Language-Guided Audio-Visual Source Separation via Trimodal Consistency + http://openaccess.thecvf.com//content/CVPR2023/papers/Tan_Language-Guided_Audio-Visual_Source_Separation_via_Trimodal_Consistency_CVPR_2023_paper.pdf + We propose a self-supervised approach for learning to perform audio source separation in videos based on natural language queries, using only unlabeled video and audio pairs as training data. A key challenge in this task is learning to associate the linguistic description of a sound-emitting object to its visual features and the corresponding components of the audio waveform, all without access to annotations during training. To overcome this challenge, we adapt off-the-shelf vision-language foundation models to provide pseudo-target supervision via two novel loss functions and encourage a stronger alignment between the audio, visual and natural language modalities. During inference, our approach can separate sounds given text, video and audio input, or given text and audio input alone. We demonstrate the effectiveness of our self-supervised approach on three audio-visual separation datasets, including MUSIC, SOLOS and AudioSet, where we outperform state-of-the-art strongly supervised approaches despite not using object detectors or text labels during training. Finally, we also include samples of our separated audios in the supplemental for reference. + + + + DynaMask: Dynamic Mask Selection for Instance Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_DynaMask_Dynamic_Mask_Selection_for_Instance_Segmentation_CVPR_2023_paper.pdf + The representative instance segmentation methods mostly segment different object instances with a mask of the fixed resolution, e.g., 28x 28 grid. However, a low-resolution mask loses rich details, while a high-resolution mask incurs quadratic computation overhead. It is a challenging task to predict the optimal binary mask for each instance. In this paper, we propose to dynamically select suitable masks for different object proposals. First, a dual-level Feature Pyramid Network (FPN) with adaptive feature aggregation is developed to gradually increase the mask grid resolution, ensuring high-quality segmentation of objects. Specifically, an efficient region-level top-down path (r-FPN) is introduced to incorporate complementary contextual and detailed information from different stages of image-level FPN (i-FPN). Then, to alleviate the increase of computation and memory costs caused by using large masks, we develop a Mask Switch Module (MSM) with negligible computational cost to select the most suitable mask resolution for each instance, achieving high efficiency while maintaining high segmentation accuracy. Without bells and whistles, the proposed method, namely DynaMask, brings consistent and noticeable performance improvements over other state-of-the-arts at a moderate computation overhead. The source code: https://github.com/lslrh/DynaMask. + + + + SAP-DETR: Bridging the Gap Between Salient Points and Queries-Based Transformer Detector for Fast Model Convergency + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_SAP-DETR_Bridging_the_Gap_Between_Salient_Points_and_Queries-Based_Transformer_CVPR_2023_paper.pdf + Recently, the dominant DETR-based approaches apply central-concept spatial prior to accelerating Transformer detector convergency. These methods gradually refine the reference points to the center of target objects and imbue object queries with the updated central reference information for spatially conditional attention. However, centralizing reference points may severely deteriorate queries' saliency and confuse detectors due to the indiscriminative spatial prior. To bridge the gap between the reference points of salient queries and Transformer detectors, we propose SAlient Point-based DETR (SAP-DETR) by treating object detection as a transformation from salient points to instance objects. In SAP-DETR, we explicitly initialize a query-specific reference point for each object query, gradually aggregate them into an instance object, and then predict the distance from each side of the bounding box to these points. By rapidly attending to query-specific reference region and other conditional extreme regions from the image features, SAP-DETR can effectively bridge the gap between the salient point and the query-based Transformer detector with a significant convergency speed. Our extensive experiments have demonstrated that SAP-DETR achieves 1.4 times convergency speed with competitive performance. Under the standard training scheme, SAP-DETR stably promotes the SOTA approaches by 1.0 AP. Based on ResNet-DC-101, SAP-DETR achieves 46.9 AP. The code will be released at https://github.com/liuyang-ict/SAP-DETR. + + + + GD-MAE: Generative Decoder for MAE Pre-Training on LiDAR Point Clouds + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_GD-MAE_Generative_Decoder_for_MAE_Pre-Training_on_LiDAR_Point_Clouds_CVPR_2023_paper.pdf + Despite the tremendous progress of Masked Autoencoders (MAE) in developing vision tasks such as image and video, exploring MAE in large-scale 3D point clouds remains challenging due to the inherent irregularity. In contrast to previous 3D MAE frameworks, which either design a complex decoder to infer masked information from maintained regions or adopt sophisticated masking strategies, we instead propose a much simpler paradigm. The core idea is to apply a Generative Decoder for MAE (GD-MAE) to automatically merges the surrounding context to restore the masked geometric knowledge in a hierarchical fusion manner. In doing so, our approach is free from introducing the heuristic design of decoders and enjoys the flexibility of exploring various masking strategies. The corresponding part costs less than 12% latency compared with conventional methods, while achieving better performance. We demonstrate the efficacy of the proposed method on several large-scale benchmarks: Waymo, KITTI, and ONCE. Consistent improvement on downstream detection tasks illustrates strong robustness and generalization capability. Not only our method reveals state-of-the-art results, but remarkably, we achieve comparable accuracy even with 20% of the labeled data on the Waymo dataset. Code will be released. + + + + Re-Thinking Model Inversion Attacks Against Deep Neural Networks + http://openaccess.thecvf.com//content/CVPR2023/papers/Nguyen_Re-Thinking_Model_Inversion_Attacks_Against_Deep_Neural_Networks_CVPR_2023_paper.pdf + Model inversion (MI) attacks aim to infer and reconstruct private training data by abusing access to a model. MI attacks have raised concerns about the leaking of sensitive information (e.g. private face images used in training a face recognition system). Recently, several algorithms for MI have been proposed to improve the attack performance. In this work, we revisit MI, study two fundamental issues pertaining to all state-of-the-art (SOTA) MI algorithms, and propose solutions to these issues which lead to a significant boost in attack performance for all SOTA MI. In particular, our contributions are two-fold: 1) We analyze the optimization objective of SOTA MI algorithms, argue that the objective is sub-optimal for achieving MI, and propose an improved optimization objective that boosts attack performance significantly. 2) We analyze "MI overfitting", show that it would prevent reconstructed images from learning semantics of training data, and propose a novel "model augmentation" idea to overcome this issue. Our proposed solutions are simple and improve all SOTA MI attack accuracy significantly. E.g., in the standard CelebA benchmark, our solutions improve accuracy by 11.8% and achieve for the first time over 90% attack accuracy. Our findings demonstrate that there is a clear risk of leaking sensitive information from deep learning models. We urge serious consideration to be given to the privacy implications. Our code, demo, and models are available at https://ngoc-nguyen-0.github.io/re-thinking_model_inversion_attacks/ + + + + You Need Multiple Exiting: Dynamic Early Exiting for Accelerating Unified Vision Language Model + http://openaccess.thecvf.com//content/CVPR2023/papers/Tang_You_Need_Multiple_Exiting_Dynamic_Early_Exiting_for_Accelerating_Unified_CVPR_2023_paper.pdf + Large-scale transformer models bring significant improvements for various downstream vision language tasks with a unified architecture. The performance improvements come with increasing model size, resulting in slow inference speed and increased cost for severing. While some certain predictions benefit from the full complexity of the large-scale model, not all of input need the same amount of computation to conduct, potentially leading to computation resource waste. To handle this challenge, early exiting is proposed to adaptively allocate computational power in term of input complexity to improve inference efficiency. The existing early exiting strategies usually adopt output confidence based on intermediate layers as a proxy of input complexity to incur the decision of skipping following layers. However, such strategies cannot apply to encoder in the widely-used unified architecture with both encoder and decoder due to difficulty of output confidence estimation in the encoder. It is suboptimal in term of saving computation power to ignore the early exiting in encoder component. To handle this challenge, we propose a novel early exiting strategy for unified visual language models, which allows dynamically skip the layers in encoder and decoder simultaneously in term of input layer-wise similarities with multiple times of early exiting, namely MuE. By decomposing the image and text modalities in the encoder, MuE is flexible and can skip different layers in term of modalities, advancing the inference efficiency while minimizing performance drop. Experiments on the SNLI-VE and MS COCO datasets show that the proposed approach MuE can reduce inference time by up to 50% and 40% while maintaining 99% and 96% performance respectively. + + + + PROB: Probabilistic Objectness for Open World Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Zohar_PROB_Probabilistic_Objectness_for_Open_World_Object_Detection_CVPR_2023_paper.pdf + Open World Object Detection (OWOD) is a new and challenging computer vision task that bridges the gap between classic object detection (OD) benchmarks and object detection in the real world. In addition to detecting and classifying seen/labeled objects, OWOD algorithms are expected to detect novel/unknown objects - which can be classified and incrementally learned. In standard OD, object proposals not overlapping with a labeled object are automatically classified as background. Therefore, simply applying OD methods to OWOD fails as unknown objects would be predicted as background. The challenge of detecting unknown objects stems from the lack of supervision in distinguishing unknown objects and background object proposals. Previous OWOD methods have attempted to overcome this issue by generating supervision using pseudo-labeling - however, unknown object detection has remained low. Probabilistic/generative models may provide a solution for this challenge. Herein, we introduce a novel probabilistic framework for objectness estimation, where we alternate between probability distribution estimation and objectness likelihood maximization of known objects in the embedded feature space - ultimately allowing us to estimate the objectness probability of different proposals. The resulting Probabilistic Objectness transformer-based open-world detector, PROB, integrates our framework into traditional object detection models, adapting them for the open-world setting. Comprehensive experiments on OWOD benchmarks show that PROB outperforms all existing OWOD methods in both unknown object detection ( 2x unknown recall) and known object detection ( mAP). Our code is available at https://github.com/orrzohar/PROB. + + + + SparseFusion: Distilling View-Conditioned Diffusion for 3D Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_SparseFusion_Distilling_View-Conditioned_Diffusion_for_3D_Reconstruction_CVPR_2023_paper.pdf + We propose SparseFusion, a sparse view 3D reconstruction approach that unifies recent advances in neural rendering and probabilistic image generation. Existing approaches typically build on neural rendering with re-projected features but fail to generate unseen regions or handle uncertainty under large viewpoint changes. Alternate methods treat this as a (probabilistic) 2D synthesis task, and while they can generate plausible 2D images, they do not infer a consistent underlying 3D. However, we find that this trade-off between 3D consistency and probabilistic image generation does not need to exist. In fact, we show that geometric consistency and generative inference can be complementary in a mode seeking behavior. By distilling a 3D consistent scene representation from a view-conditioned latent diffusion model, we are able to recover a plausible 3D representation whose renderings are both accurate and realistic. We evaluate our approach across 51 categories in the CO3D dataset and show that it outperforms existing methods, in both distortion and perception metrics, for sparse view novel view synthesis. + + + + Dynamic Focus-Aware Positional Queries for Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/He_Dynamic_Focus-Aware_Positional_Queries_for_Semantic_Segmentation_CVPR_2023_paper.pdf + The DETR-like segmentors have underpinned the most recent breakthroughs in semantic segmentation, which end-to-end train a set of queries representing the class prototypes or target segments. Recently, masked attention is proposed to restrict each query to only attend to the foreground regions predicted by the preceding decoder block for easier optimization. Although promising, it relies on the learnable parameterized positional queries which tend to encode the dataset statistics, leading to inaccurate localization for distinct individual queries. In this paper, we propose a simple yet effective query design for semantic segmentation termed Dynamic Focus-aware Positional Queries (DFPQ), which dynamically generates positional queries conditioned on the cross-attention scores from the preceding decoder block and the positional encodings for the corresponding image features, simultaneously. Therefore, our DFPQ preserves rich localization information for the target segments and provides accurate and fine-grained positional priors. In addition, we propose to efficiently deal with high-resolution cross-attention by only aggregating the contextual tokens based on the low-resolution cross-attention scores to perform local relation aggregation. Extensive experiments on ADE20K and Cityscapes show that with the two modifications on Mask2former, our framework achieves SOTA performance and outperforms Mask2former by clear margins of 1.1%, 1.9%, and 1.1% single-scale mIoU with ResNet-50, Swin-T, and Swin-B backbones on the ADE20K validation set, respectively. Source code is available at https://github.com/ziplab/FASeg. + + + + HARP: Personalized Hand Reconstruction From a Monocular RGB Video + http://openaccess.thecvf.com//content/CVPR2023/papers/Karunratanakul_HARP_Personalized_Hand_Reconstruction_From_a_Monocular_RGB_Video_CVPR_2023_paper.pdf + We present HARP (HAnd Reconstruction and Personalization), a personalized hand avatar creation approach that takes a short monocular RGB video of a human hand as input and reconstructs a faithful hand avatar exhibiting a high-fidelity appearance and geometry. In contrast to the major trend of neural implicit representations, HARP models a hand with a mesh-based parametric hand model, a vertex displacement map, a normal map, and an albedo without any neural components. The explicit nature of our representation enables a truly scalable, robust, and efficient approach to hand avatar creation as validated by our experiments. HARP is optimized via gradient descent from a short sequence captured by a hand-held mobile phone and can be directly used in AR/VR applications with real-time rendering capability. To enable this, we carefully design and implement a shadow-aware differentiable rendering scheme that is robust to high degree articulations and self-shadowing regularly present in hand motions, as well as challenging lighting conditions. It also generalizes to unseen poses and novel viewpoints, producing photo-realistic renderings of hand animations. Furthermore, the learned HARP representation can be used for improving 3D hand pose estimation quality in challenging viewpoints. The key advantages of HARP are validated by the in-depth analyses on appearance reconstruction, novel view and novel pose synthesis, and 3D hand pose refinement. It is an AR/VR-ready personalized hand representation that shows superior fidelity and scalability. + + + + DART: Diversify-Aggregate-Repeat Training Improves Generalization of Neural Networks + http://openaccess.thecvf.com//content/CVPR2023/papers/Jain_DART_Diversify-Aggregate-Repeat_Training_Improves_Generalization_of_Neural_Networks_CVPR_2023_paper.pdf + Generalization of Neural Networks is crucial for deploying them safely in the real world. Common training strategies to improve generalization involve the use of data augmentations, ensembling and model averaging. In this work, we first establish a surprisingly simple but strong benchmark for generalization which utilizes diverse augmentations within a training minibatch, and show that this can learn a more balanced distribution of features. Further, we propose Diversify-Aggregate-Repeat Training (DART) strategy that first trains diverse models using different augmentations (or domains) to explore the loss basin, and further Aggregates their weights to combine their expertise and obtain improved generalization. We find that Repeating the step of Aggregation throughout training improves the overall optimization trajectory and also ensures that the individual models have sufficiently low loss barrier to obtain improved generalization on combining them. We theoretically justify the proposed approach and show that it indeed generalizes better. In addition to improvements in In-Domain generalization, we demonstrate SOTA performance on the Domain Generalization benchmarks in the popular DomainBed framework as well. Our method is generic and can easily be integrated with several base training algorithms to achieve performance gains. Our code is available here: https://github.com/val-iisc/DART. + + + + EvShutter: Transforming Events for Unconstrained Rolling Shutter Correction + http://openaccess.thecvf.com//content/CVPR2023/papers/Erbach_EvShutter_Transforming_Events_for_Unconstrained_Rolling_Shutter_Correction_CVPR_2023_paper.pdf + Widely used Rolling Shutter (RS) CMOS sensors capture high resolution images at the expense of introducing distortions and artifacts in the presence of motion. In such situations, RS distortion correction algorithms are critical. Recent methods rely on a constant velocity assumption and require multiple frames to predict the dense displacement field. In this work, we introduce a new method, called Eventful Shutter (EvShutter), that corrects RS artifacts using a single RGB image and event information with high temporal resolution. The method firstly removes blur using a novel flow-based deblurring module and then compensates RS using a double encoder hourglass network. In contrast to previous methods, it does not rely on a constant velocity assumption and uses a simple architecture thanks to an event transformation dedicated to RS, called Filter and Flip (FnF), that transforms input events to encode only the changes between GS and RS images. To evaluate the proposed method and facilitate future research, we collect the first dataset with real events and high-quality RS images with optional blur, called RS-ERGB. We generate the RS images from GS images using a newly proposed simulator based on adaptive interpolation. The simulator permits the use of inexpensive cameras with long exposure to capture high-quality GS images. We show that on this realistic dataset the proposed method outperforms the state-of-the-art image- and event-based methods by 9.16 dB and 0.75 dB respectively in terms of PSNR and an improvement of 23% and 21% in LPIPS. + + + + Ambiguity-Resistant Semi-Supervised Learning for Dense Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Ambiguity-Resistant_Semi-Supervised_Learning_for_Dense_Object_Detection_CVPR_2023_paper.pdf + With basic Semi-Supervised Object Detection (SSOD) techniques, one-stage detectors generally obtain limited promotions compared with two-stage clusters. We experimentally find that the root lies in two kinds of ambiguities: (1) Selection ambiguity that selected pseudo labels are less accurate, since classification scores cannot properly represent the localization quality. (2) Assignment ambiguity that samples are matched with improper labels in pseudo-label assignment, as the strategy is misguided by missed objects and inaccurate pseudo boxes. To tackle these problems, we propose a Ambiguity-Resistant Semi-supervised Learning (ARSL) for one-stage detectors. Specifically, to alleviate the selection ambiguity, Joint-Confidence Estimation (JCE) is proposed to jointly quantifies the classification and localization quality of pseudo labels. As for the assignment ambiguity, Task-Separation Assignment (TSA) is introduced to assign labels based on pixel-level predictions rather than unreliable pseudo boxes. It employs a 'divide-and-conquer' strategy and separately exploits positives for the classification and localization task, which is more robust to the assignment ambiguity. Comprehensive experiments demonstrate that ARSL effectively mitigates the ambiguities and achieves state-of-the-art SSOD performance on MS COCO and PASCAL VOC. Codes can be found at https://github.com/PaddlePaddle/PaddleDetection. + + + + Scalable, Detailed and Mask-Free Universal Photometric Stereo + http://openaccess.thecvf.com//content/CVPR2023/papers/Ikehata_Scalable_Detailed_and_Mask-Free_Universal_Photometric_Stereo_CVPR_2023_paper.pdf + In this paper, we introduce SDM-UniPS, a groundbreaking Scalable, Detailed, Mask-free, and Universal Photometric Stereo network. Our approach can recover astonishingly intricate surface normal maps, rivaling the quality of 3D scanners, even when images are captured under unknown, spatially-varying lighting conditions in uncontrolled environments. We have extended previous universal photometric stereo networks to extract spatial-light features, utilizing all available information in high-resolution input images and accounting for non-local interactions among surface points. Moreover, we present a new synthetic training dataset that encompasses a diverse range of shapes, materials, and illumination scenarios found in real-world scenes. Through extensive evaluation, we demonstrate that our method not only surpasses calibrated, lighting-specific techniques on public benchmarks, but also excels with a significantly smaller number of input images even without object masks. + + + + Towards High-Quality and Efficient Video Super-Resolution via Spatial-Temporal Data Overfitting + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Towards_High-Quality_and_Efficient_Video_Super-Resolution_via_Spatial-Temporal_Data_Overfitting_CVPR_2023_paper.pdf + As deep convolutional neural networks (DNNs) are widely used in various fields of computer vision, leveraging the overfitting ability of the DNN to achieve video resolution upscaling has become a new trend in the modern video delivery system. By dividing videos into chunks and overfitting each chunk with a super-resolution model, the server encodes videos before transmitting them to the clients, thus achieving better video quality and transmission efficiency. However, a large number of chunks are expected to ensure good overfitting quality, which substantially increases the storage and consumes more bandwidth resources for data transmission. On the other hand, decreasing the number of chunks through training optimization techniques usually requires high model capacity, which significantly slows down execution speed. To reconcile such, we propose a novel method for high-quality and efficient video resolution upscaling tasks, which leverages the spatial-temporal information to accurately divide video into chunks, thus keeping the number of chunks as well as the model size to a minimum. Additionally, we advance our method into a single overfitting model by a data-aware joint training technique, which further reduces the storage requirement with negligible quality drop. We deploy our proposed overfitting models on an off-the-shelf mobile phone, and experimental results show that our method achieves real-time video super-resolution with high video quality. Compared with the state-of-the-art, our method achieves 28 fps streaming speed with 41.60 PSNR, which is 14 times faster and 2.29 dB better in the live video resolution upscaling tasks. + + + + BiFormer: Vision Transformer With Bi-Level Routing Attention + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhu_BiFormer_Vision_Transformer_With_Bi-Level_Routing_Attention_CVPR_2023_paper.pdf + As the core building block of vision transformers, attention is a powerful tool to capture long-range dependency. However, such power comes at a cost: it incurs a huge computation burden and heavy memory footprint as pairwise token interaction across all spatial locations is computed. A series of works attempt to alleviate this problem by introducing handcrafted and content-agnostic sparsity into attention, such as restricting the attention operation to be inside local windows, axial stripes, or dilated windows. In contrast to these approaches, we propose a novel dynamic sparse attention via bi-level routing to enable a more flexible allocation of computations with content awareness. Specifically, for a query, irrelevant key-value pairs are first filtered out at a coarse region level, and then fine-grained token-to-token attention is applied in the union of remaining candidate regions (i.e., routed regions). We provide a simple yet effective implementation of the proposed bi-level routing attention, which utilizes the sparsity to save both computation and memory while involving only GPU-friendly dense matrix multiplications. Built with the proposed bi-level routing attention, a new general vision transformer, named BiFormer, is then presented. As BiFormer attends to a small subset of relevant tokens in a query-adaptive manner without distraction from other irrelevant ones, it enjoys both good performance and high computational efficiency, especially in dense prediction tasks. Empirical results across several computer vision tasks such as image classification, object detection, and semantic segmentation verify the effectiveness of our design. Code is available at https://github.com/rayleizhu/BiFormer. + + + + Class-Incremental Exemplar Compression for Class-Incremental Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Luo_Class-Incremental_Exemplar_Compression_for_Class-Incremental_Learning_CVPR_2023_paper.pdf + Exemplar-based class-incremental learning (CIL) finetunes the model with all samples of new classes but few-shot exemplars of old classes in each incremental phase, where the "few-shot" abides by the limited memory budget. In this paper, we break this "few-shot" limit based on a simple yet surprisingly effective idea: compressing exemplars by downsampling non-discriminative pixels and saving "many-shot" compressed exemplars in the memory. Without needing any manual annotation, we achieve this compression by generating 0-1 masks on discriminative pixels from class activation maps (CAM). We propose an adaptive mask generation model called class-incremental masking (CIM) to explicitly resolve two difficulties of using CAM: 1) transforming the heatmaps of CAM to 0-1 masks with an arbitrary threshold leads to a trade-off between the coverage on discriminative pixels and the quantity of exemplars, as the total memory is fixed; and 2) optimal thresholds vary for different object classes, which is particularly obvious in the dynamic environment of CIL. We optimize the CIM model alternatively with the conventional CIL model through a bilevel optimization problem. We conduct extensive experiments on high-resolution CIL benchmarks including Food-101, ImageNet-100, and ImageNet-1000, and show that using the compressed exemplars by CIM can achieve a new state-of-the-art CIL accuracy, e.g., 4.8 percentage points higher than FOSTER on 10-Phase ImageNet-1000. Our code is available at https://github.com/xfflzl/CIM-CIL. + + + + Behind the Scenes: Density Fields for Single View Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Wimbauer_Behind_the_Scenes_Density_Fields_for_Single_View_Reconstruction_CVPR_2023_paper.pdf + Inferring a meaningful geometric scene representation from a single image is a fundamental problem in computer vision. Approaches based on traditional depth map prediction can only reason about areas that are visible in the image. Currently, neural radiance fields (NeRFs) can capture true 3D including color, but are too complex to be generated from a single image. As an alternative, we propose to predict an implicit density field from a single image. It maps every location in the frustum of the image to volumetric density. By directly sampling color from the available views instead of storing color in the density field, our scene representation becomes significantly less complex compared to NeRFs, and a neural network can predict it in a single forward pass. The network is trained through self-supervision from only video data. Our formulation allows volume rendering to perform both depth prediction and novel view synthesis. Through experiments, we show that our method is able to predict meaningful geometry for regions that are occluded in the input image. Additionally, we demonstrate the potential of our approach on three datasets for depth prediction and novel-view synthesis. + + + + StyleGAN Salon: Multi-View Latent Optimization for Pose-Invariant Hairstyle Transfer + http://openaccess.thecvf.com//content/CVPR2023/papers/Khwanmuang_StyleGAN_Salon_Multi-View_Latent_Optimization_for_Pose-Invariant_Hairstyle_Transfer_CVPR_2023_paper.pdf + Our paper seeks to transfer the hairstyle of a reference image to an input photo for virtual hair try-on. We target a variety of challenges scenarios, such as transforming a long hairstyle with bangs to a pixie cut, which requires removing the existing hair and inferring how the forehead would look, or transferring partially visible hair from a hat-wearing person in a different pose. Past solutions leverage StyleGAN for hallucinating any missing parts and producing a seamless face-hair composite through so-called GAN inversion or projection. However, there remains a challenge in controlling the hallucinations to accurately transfer hairstyle and preserve the face shape and identity of the input. To overcome this, we propose a multi-view optimization framework that uses "two different views" of reference composites to semantically guide occluded or ambiguous regions. Our optimization shares information between two poses, which allows us to produce high fidelity and realistic results from incomplete references. Our framework produces high-quality results and outperforms prior work in a user study that consists of significantly more challenging hair transfer scenarios than previously studied. Project page: https://stylegan-salon.github.io/. + + + + Resource-Efficient RGBD Aerial Tracking + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_Resource-Efficient_RGBD_Aerial_Tracking_CVPR_2023_paper.pdf + Aerial robots are now able to fly in complex environments, and drone-captured data gains lots of attention in object tracking. However, current research on aerial perception has mainly focused on limited categories, such as pedestrian or vehicle, and most scenes are captured in urban environments from a birds-eye view. Recently, UAVs equipped with depth cameras have been also deployed for more complex applications, while RGBD aerial tracking is still unexplored. Compared with traditional RGB object tracking, adding depth information can more effectively deal with more challenging scenes such as target and background interference. To this end, in this paper, we explore RGBD aerial tracking in an overhead space, which can greatly enlarge the development of drone-based visual perception. To boost the research, we first propose a large-scale benchmark for RGBD aerial tracking, containing 1,000 drone-captured RGBD videos with dense annotations. Then, as drone-based applications require for real-time processing with limited computational resources, we also propose an efficient RGBD tracker named EMT. Our tracker runs at over 100 fps on GPU, and 25 fps on the edge platform of NVidia Jetson NX Xavier, benefiting from its efficient multimodal fusion and feature matching. Extensive experiments show that our EMT achieves promising tracking performance. All resources are available at https://github.com/yjybuaa/RGBDAerialTracking. + + + + Bilateral Memory Consolidation for Continual Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Nie_Bilateral_Memory_Consolidation_for_Continual_Learning_CVPR_2023_paper.pdf + Humans are proficient at continuously acquiring and integrating new knowledge. By contrast, deep models forget catastrophically, especially when tackling highly long task sequences. Inspired by the way our brains constantly rewrite and consolidate past recollections, we propose a novel Bilateral Memory Consolidation (BiMeCo) framework that focuses on enhancing memory interaction capabilities. Specifically, BiMeCo explicitly decouples model parameters into short-term memory module and long-term memory module, responsible for representation ability of the model and generalization over all learned tasks, respectively. BiMeCo encourages dynamic interactions between two memory modules by knowledge distillation and momentum-based updating for forming generic knowledge to prevent forgetting. The proposed BiMeCo is parameter-efficient and can be integrated into existing methods seamlessly. Extensive experiments on challenging benchmarks show that BiMeCo significantly improves the performance of existing continual learning methods. For example, combined with the state-of-the-art method CwD, BiMeCo brings in significant gains of around 2% to 6% while using 2x fewer parameters on CIFAR-100 under ResNet-18. + + + + Search-Map-Search: A Frame Selection Paradigm for Action Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhao_Search-Map-Search_A_Frame_Selection_Paradigm_for_Action_Recognition_CVPR_2023_paper.pdf + Despite the success of deep learning in video understanding tasks, processing every frame in a video is computationally expensive and often unnecessary in real-time applications. Frame selection aims to extract the most informative and representative frames to help a model better understand video content. Existing frame selection methods either individually sample frames based on per-frame importance prediction, without considering interaction among frames, or adopt reinforcement learning agents to find representative frames in succession, which are costly to train and may lead to potential stability issues. To overcome the limitations of existing methods, we propose a Search-Map-Search learning paradigm which combines the advantages of heuristic search and supervised learning to select the best combination of frames from a video as one entity. By combining search with learning, the proposed method can better capture frame interactions while incurring a low inference overhead. Specifically, we first propose a hierarchical search method conducted on each training video to search for the optimal combination of frames with the lowest error on the downstream task. A feature mapping function is then learned to map the frames of a video to the representation of its target optimal frame combination. During inference, another search is performed on an unseen video to select a combination of frames whose feature representation is close to the projected feature representation. Extensive experiments based on several action recognition benchmarks demonstrate that our frame selection method effectively improves performance of action recognition models, and significantly outperforms a number of competitive baselines. + + + + Uncovering the Missing Pattern: Unified Framework Towards Trajectory Imputation and Prediction + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_Uncovering_the_Missing_Pattern_Unified_Framework_Towards_Trajectory_Imputation_and_CVPR_2023_paper.pdf + Trajectory prediction is a crucial undertaking in understanding entity movement or human behavior from observed sequences. However, current methods often assume that the observed sequences are complete while ignoring the potential for missing values caused by object occlusion, scope limitation, sensor failure, etc. This limitation inevitably hinders the accuracy of trajectory prediction. To address this issue, our paper presents a unified framework, the Graph-based Conditional Variational Recurrent Neural Network (GC-VRNN), which can perform trajectory imputation and prediction simultaneously. Specifically, we introduce a novel Multi-Space Graph Neural Network (MS-GNN) that can extract spatial features from incomplete observations and leverage missing patterns. Additionally, we employ a Conditional VRNN with a specifically designed Temporal Decay (TD) module to capture temporal dependencies and temporal missing patterns in incomplete trajectories. The inclusion of the TD module allows for valuable information to be conveyed through the temporal flow. We also curate and benchmark three practical datasets for the joint problem of trajectory imputation and prediction. Extensive experiments verify the exceptional performance of our proposed method. As far as we know, this is the first work to address the lack of benchmarks and techniques for trajectory imputation and prediction in a unified manner. + + + + FlexiViT: One Model for All Patch Sizes + http://openaccess.thecvf.com//content/CVPR2023/papers/Beyer_FlexiViT_One_Model_for_All_Patch_Sizes_CVPR_2023_paper.pdf + Vision Transformers convert images to sequences by slicing them into patches. The size of these patches controls a speed/accuracy tradeoff, with smaller patches leading to higher accuracy at greater computational cost, but changing the patch size typically requires retraining the model. In this paper, we demonstrate that simply randomizing the patch size at training time leads to a single set of weights that performs well across a wide range of patch sizes, making it possible to tailor the model to different compute budgets at deployment time. We extensively evaluate the resulting model, which we call FlexiViT, on a wide range of tasks, including classification, image-text retrieval, openworld detection, panoptic segmentation, and semantic segmentation, concluding that it usually matches, and sometimes outperforms, standard ViT models trained at a single patch size in an otherwise identical setup. Hence, FlexiViT training is a simple drop-in improvement for ViT that makes it easy to add compute-adaptive capabilities to most models relying on a ViT backbone architecture. Code and pretrained models are available at github.com/googleresearch/big_vision. + + + + Structured Kernel Estimation for Photon-Limited Deconvolution + http://openaccess.thecvf.com//content/CVPR2023/papers/Sanghvi_Structured_Kernel_Estimation_for_Photon-Limited_Deconvolution_CVPR_2023_paper.pdf + Images taken in a low light condition with the presence of camera shake suffer from motion blur and photon shot noise. While state-of-the-art image restoration networks show promising results, they are largely limited to well-illuminated scenes and their performance drops significantly when photon shot noise is strong. In this paper, we propose a new blur estimation technique customized for photon-limited conditions. The proposed method employs a gradient-based backpropagation method to estimate the blur kernel. By modeling the blur kernel using a low-dimensional representation with the key points on the motion trajectory, we significantly reduce the search space and improve the regularity of the kernel estimation problem. When plugged into an iterative framework, our novel low-dimensional representation provides improved kernel estimates and hence significantly better deconvolution performance when compared to end-to-end trained neural networks. + + + + Frame Interpolation Transformer and Uncertainty Guidance + http://openaccess.thecvf.com//content/CVPR2023/papers/Plack_Frame_Interpolation_Transformer_and_Uncertainty_Guidance_CVPR_2023_paper.pdf + Video frame interpolation has seen important progress in recent years, thanks to developments in several directions. Some works leverage better optical flow methods with improved splatting strategies or additional cues from depth, while others have investigated alternative approaches through direct predictions or transformers. Still, the problem remains unsolved in more challenging conditions such as complex lighting or large motion. In this work, we are bridging the gap towards video production with a novel transformer-based interpolation network architecture capable of estimating the expected error together with the interpolated frame. This offers several advantages that are of key importance for frame interpolation usage: First, we obtained improved visual quality over several datasets. The improvement in terms of quality is also clearly demonstrated through a user study. Second, our method estimates error maps for the interpolated frame, which are essential for real-life applications on longer video sequences where problematic frames need to be flagged. Finally, for rendered content a partial rendering pass of the intermediate frame, guided by the predicted error, can be utilized during the interpolation to generate a new frame of superior quality. Through this error estimation, our method can produce even higher-quality intermediate frames using only a fraction of the time compared to a full rendering. + + + + Neural Preset for Color Style Transfer + http://openaccess.thecvf.com//content/CVPR2023/papers/Ke_Neural_Preset_for_Color_Style_Transfer_CVPR_2023_paper.pdf + In this paper, we present a Neural Preset technique to address the limitations of existing color style transfer methods, including visual artifacts, vast memory requirement, and slow style switching speed. Our method is based on two core designs. First, we propose Deterministic Neural Color Mapping (DNCM) to consistently operate on each pixel via an image-adaptive color mapping matrix, avoiding artifacts and supporting high-resolution inputs with a small memory footprint. Second, we develop a two-stage pipeline by dividing the task into color normalization and stylization, which allows efficient style switching by extracting color styles as presets and reusing them on normalized input images. Due to the unavailability of pairwise datasets, we describe how to train Neural Preset via a self-supervised strategy. Various advantages of Neural Preset over existing methods are demonstrated through comprehensive evaluations. Besides, we show that our trained model can naturally support multiple applications without fine-tuning, including low-light image enhancement, underwater image correction, image dehazing, and image harmonization. + + + + Wavelet Diffusion Models Are Fast and Scalable Image Generators + http://openaccess.thecvf.com//content/CVPR2023/papers/Phung_Wavelet_Diffusion_Models_Are_Fast_and_Scalable_Image_Generators_CVPR_2023_paper.pdf + Diffusion models are rising as a powerful solution for high-fidelity image generation, which exceeds GANs in quality in many circumstances. However, their slow training and inference speed is a huge bottleneck, blocking them from being used in real-time applications. A recent DiffusionGAN method significantly decreases the models' running time by reducing the number of sampling steps from thousands to several, but their speeds still largely lag behind the GAN counterparts. This paper aims to reduce the speed gap by proposing a novel wavelet-based diffusion scheme. We extract low-and-high frequency components from both image and feature levels via wavelet decomposition and adaptively handle these components for faster processing while maintaining good generation quality. Furthermore, we propose to use a reconstruction term, which effectively boosts the model training convergence. Experimental results on CelebA-HQ, CIFAR-10, LSUN-Church, and STL-10 datasets prove our solution is a stepping-stone to offering real-time and high-fidelity diffusion models. Our code and pre-trained checkpoints are available at https://github.com/VinAIResearch/WaveDiff.git. + + + + PA&DA: Jointly Sampling Path and Data for Consistent NAS + http://openaccess.thecvf.com//content/CVPR2023/papers/Lu_PADA_Jointly_Sampling_Path_and_Data_for_Consistent_NAS_CVPR_2023_paper.pdf + Based on the weight-sharing mechanism, one-shot NAS methods train a supernet and then inherit the pre-trained weights to evaluate sub-models, largely reducing the search cost. However, several works have pointed out that the shared weights suffer from different gradient descent directions during training. And we further find that large gradient variance occurs during supernet training, which degrades the supernet ranking consistency. To mitigate this issue, we propose to explicitly minimize the gradient variance of the supernet training by jointly optimizing the sampling distributions of PAth and DAta (PA&DA). We theoretically derive the relationship between the gradient variance and the sampling distributions, and reveal that the optimal sampling probability is proportional to the normalized gradient norm of path and training data. Hence, we use the normalized gradient norm as the importance indicator for path and training data, and adopt an importance sampling strategy for the supernet training. Our method only requires negligible computation cost for optimizing the sampling distributions of path and data, but achieves lower gradient variance during supernet training and better generalization performance for the supernet, resulting in a more consistent NAS. We conduct comprehensive comparisons with other improved approaches in various search spaces. Results show that our method surpasses others with more reliable ranking performance and higher accuracy of searched architectures, showing the effectiveness of our method. Code is available at https://github.com/ShunLu91/PA-DA. + + + + 3D Spatial Multimodal Knowledge Accumulation for Scene Graph Prediction in Point Cloud + http://openaccess.thecvf.com//content/CVPR2023/papers/Feng_3D_Spatial_Multimodal_Knowledge_Accumulation_for_Scene_Graph_Prediction_in_CVPR_2023_paper.pdf + In-depth understanding of a 3D scene not only involves locating/recognizing individual objects, but also requires to infer the relationships and interactions among them. However, since 3D scenes contain partially scanned objects with physical connections, dense placement, changing sizes, and a wide variety of challenging relationships, existing methods perform quite poorly with limited training samples. In this work, we find that the inherently hierarchical structures of physical space in 3D scenes aid in the automatic association of semantic and spatial arrangements, specifying clear patterns and leading to less ambiguous predictions. Thus, they well meet the challenges due to the rich variations within scene categories. To achieve this, we explicitly unify these structural cues of 3D physical spaces into deep neural networks to facilitate scene graph prediction. Specifically, we exploit an external knowledge base as a baseline to accumulate both contextualized visual content and textual facts to form a 3D spatial multimodal knowledge graph. Moreover, we propose a knowledge-enabled scene graph prediction module benefiting from the 3D spatial knowledge to effectively regularize semantic space of relationships. Extensive experiments demonstrate the superiority of the proposed method over current state-of-the-art competitors. Our code is available at https://github.com/HHrEtvP/SMKA. + + + + ViTs for SITS: Vision Transformers for Satellite Image Time Series + http://openaccess.thecvf.com//content/CVPR2023/papers/Tarasiou_ViTs_for_SITS_Vision_Transformers_for_Satellite_Image_Time_Series_CVPR_2023_paper.pdf + In this paper we introduce the Temporo-Spatial Vision Transformer (TSViT), a fully-attentional model for general Satellite Image Time Series (SITS) processing based on the Vision Transformer (ViT). TSViT splits a SITS record into non-overlapping patches in space and time which are tokenized and subsequently processed by a factorized temporo-spatial encoder. We argue, that in contrast to natural images, a temporal-then-spatial factorization is more intuitive for SITS processing and present experimental evidence for this claim. Additionally, we enhance the model's discriminative power by introducing two novel mechanisms for acquisition-time-specific temporal positional encodings and multiple learnable class tokens. The effect of all novel design choices is evaluated through an extensive ablation study. Our proposed architecture achieves state-of-the-art performance, surpassing previous approaches by a significant margin in three publicly available SITS semantic segmentation and classification datasets. All model, training and evaluation codes can be found at https://github.com/michaeltrs/DeepSatModels. + + + + Prompt, Generate, Then Cache: Cascade of Foundation Models Makes Strong Few-Shot Learners + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Prompt_Generate_Then_Cache_Cascade_of_Foundation_Models_Makes_Strong_CVPR_2023_paper.pdf + Visual recognition in low-data regimes requires deep neural networks to learn generalized representations from limited training samples. Recently, CLIP-based methods have shown promising few-shot performance benefited from the contrastive language-image pre-training. We then question, if the more diverse pre-training knowledge can be cascaded to further assist few-shot representation learning. In this paper, we propose CaFo, a Cascade of Foundation models that incorporates diverse prior knowledge of various pre training paradigms for better few-shot learning. Our CaFo incorporates CLIP's language-contrastive knowledge, DINO's vision-contrastive knowledge, DALL-E's vision generative knowledge, and GPT-3's language-generative knowledge. Specifically, CaFo works by 'Prompt, Generate, then Cache'. Firstly, we leverage GPT-3 to produce textual inputs for prompting CLIP with rich downstream linguistic semantics. Then, we generate synthetic images via DALL-E to expand the few-shot training data without any manpower. At last, we introduce a learnable cache model to adaptively blend the predictions from CLIP and DINO. By such col laboration, CaFo can fully unleash the potential of different pre-training methods and unify them to perform state-of the-art for few-shot classification. Code is available at https://github.com/ZrrSkywalker/CaFo. + + + + VideoMAE V2: Scaling Video Masked Autoencoders With Dual Masking + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_VideoMAE_V2_Scaling_Video_Masked_Autoencoders_With_Dual_Masking_CVPR_2023_paper.pdf + Scale is the primary factor for building a powerful foundation model that could well generalize to a variety of downstream tasks. However, it is still challenging to train video foundation models with billions of parameters. This paper shows that video masked autoencoder (VideoMAE) is a scalable and general self-supervised pre-trainer for building video foundation models. We scale the VideoMAE in both model and data with a core design. Specifically, we present a dual masking strategy for efficient pre-training, with an encoder operating on a subset of video tokens and a decoder processing another subset of video tokens. Although VideoMAE is very efficient due to high masking ratio in encoder, masking decoder can still further reduce the overall computational cost. This enables the efficient pre-training of billion-level models in video. We also introduce a progressive training paradigm that involves initial pre-training on the diverse multi-sourced unlabeled dataset, followed by fine-tuning on a mixed labeled dataset. Finally, we successfully train a video ViT model with a billion parameters, which achieves a new state-of-the-art performance on the datasets of Kinetics (90.0% on K400 and 89.9% on K600) and Something-Something (68.7% on V1 and 77.0% on V2). In addition, we extensively verify the pre-trained video ViT models on a variety of downstream tasks, demonstrating its effectiveness as a general video representation learner. + + + + Perception and Semantic Aware Regularization for Sequential Confidence Calibration + http://openaccess.thecvf.com//content/CVPR2023/papers/Peng_Perception_and_Semantic_Aware_Regularization_for_Sequential_Confidence_Calibration_CVPR_2023_paper.pdf + Deep sequence recognition (DSR) models receive increasing attention due to their superior application to various applications. Most DSR models use merely the target sequences as supervision without considering other related sequences, leading to over-confidence in their predictions. The DSR models trained with label smoothing regularize labels by equally and independently smoothing each token, reallocating a small value to other tokens for mitigating overconfidence. However, they do not consider tokens/sequences correlations that may provide more effective information to regularize training and thus lead to sub-optimal performance. In this work, we find tokens/sequences with high perception and semantic correlations with the target ones contain more correlated and effective information and thus facilitate more effective regularization. To this end, we propose a Perception and Semantic aware Sequence Regularization framework, which explore perceptively and semantically correlated tokens/sequences as regularization. Specifically, we introduce a semantic context-free recognition and a language model to acquire similar sequences with high perceptive similarities and semantic correlation, respectively. Moreover, over-confidence degree varies across samples according to their difficulties. Thus, we further design an adaptive calibration intensity module to compute a difficulty score for each samples to obtain finer-grained regularization. Extensive experiments on canonical sequence recognition tasks, including scene text and speech recognition, demonstrate that our method sets novel state-of-the-art results. Code is available at https://github.com/husterpzh/PSSR. + + + + Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_Vid2Seq_Large-Scale_Pretraining_of_a_Visual_Language_Model_for_Dense_CVPR_2023_paper.pdf + In this work, we introduce Vid2Seq, a multi-modal single-stage dense event captioning model pretrained on narrated videos which are readily-available at scale. The Vid2Seq architecture augments a language model with special time tokens, allowing it to seamlessly predict event boundaries and textual descriptions in the same output sequence. Such a unified model requires large-scale training data, which is not available in current annotated datasets. We show that it is possible to leverage unlabeled narrated videos for dense video captioning, by reformulating sentence boundaries of transcribed speech as pseudo event boundaries, and using the transcribed speech sentences as pseudo event captions. The resulting Vid2Seq model pretrained on the YT-Temporal-1B dataset improves the state of the art on a variety of dense video captioning benchmarks including YouCook2, ViTT and ActivityNet Captions. Vid2Seq also generalizes well to the tasks of video paragraph captioning and video clip captioning, and to few-shot settings. Our code is publicly available at https://antoyang.github.io/vid2seq.html. + + + + ERNIE-ViLG 2.0: Improving Text-to-Image Diffusion Model With Knowledge-Enhanced Mixture-of-Denoising-Experts + http://openaccess.thecvf.com//content/CVPR2023/papers/Feng_ERNIE-ViLG_2.0_Improving_Text-to-Image_Diffusion_Model_With_Knowledge-Enhanced_Mixture-of-Denoising-Experts_CVPR_2023_paper.pdf + Recent progress in diffusion models has revolutionized the popular technology of text-to-image generation. While existing approaches could produce photorealistic high-resolution images with text conditions, there are still several open problems to be solved, which limits the further improvement of image fidelity and text relevancy. In this paper, we propose ERNIE-ViLG 2.0, a large-scale Chinese text-to-image diffusion model, to progressively upgrade the quality of generated images by: (1) incorporating fine-grained textual and visual knowledge of key elements in the scene, and (2) utilizing different denoising experts at different denoising stages. With the proposed mechanisms, ERNIE-ViLG 2.0 not only achieves a new state-of-the-art on MS-COCO with zero-shot FID-30k score of 6.75, but also significantly outperforms recent models in terms of image fidelity and image-text alignment, with side-by-side human evaluation on the bilingual prompt set ViLG-300. + + + + Revisiting the Stack-Based Inverse Tone Mapping + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Revisiting_the_Stack-Based_Inverse_Tone_Mapping_CVPR_2023_paper.pdf + Current stack-based inverse tone mapping (ITM) methods can recover high dynamic range (HDR) radiance by predicting a set of multi-exposure images from a single low dynamic range image. However, there are still some limitations. On the one hand, these methods estimate a fixed number of images (e.g., three exposure-up and three exposure-down), which may introduce unnecessary computational cost or reconstruct incorrect results. On the other hand, they neglect the connections between the up-exposure and down-exposure models and thus fail to fully excavate effective features. In this paper, we revisit the stack-based ITM approaches and propose a novel method to reconstruct HDR radiance from a single image, which only needs to estimate two exposure images. At first, we design the exposure adaptive block that can adaptively adjust the exposure based on the luminance distribution of the input image. Secondly, we devise the cross-model attention block to connect the exposure adjustment models. Thirdly, we propose an end-to-end ITM pipeline by incorporating the multi-exposure fusion model. Furthermore, we propose and open a multi-exposure dataset that indicates the optimal exposure-up/down levels. Experimental results show that the proposed method outperforms some state-of-the-art methods. + + + + Exploiting Completeness and Uncertainty of Pseudo Labels for Weakly Supervised Video Anomaly Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Exploiting_Completeness_and_Uncertainty_of_Pseudo_Labels_for_Weakly_Supervised_CVPR_2023_paper.pdf + Weakly supervised video anomaly detection aims to identify abnormal events in videos using only video-level labels. Recently, two-stage self-training methods have achieved significant improvements by self-generating pseudo labels and self-refining anomaly scores with these labels. As the pseudo labels play a crucial role, we propose an enhancement framework by exploiting completeness and uncertainty properties for effective self-training. Specifically, we first design a multi-head classification module (each head serves as a classifier) with a diversity loss to maximize the distribution differences of predicted pseudo labels across heads. This encourages the generated pseudo labels to cover as many abnormal events as possible. We then devise an iterative uncertainty pseudo label refinement strategy, which improves not only the initial pseudo labels but also the updated ones obtained by the desired classifier in the second stage. Extensive experimental results demonstrate the proposed method performs favorably against state-of-the-art approaches on the UCF-Crime, TAD, and XD-Violence benchmark datasets. + + + + Full or Weak Annotations? An Adaptive Strategy for Budget-Constrained Annotation Campaigns + http://openaccess.thecvf.com//content/CVPR2023/papers/Tejero_Full_or_Weak_Annotations_An_Adaptive_Strategy_for_Budget-Constrained_Annotation_CVPR_2023_paper.pdf + Annotating new datasets for machine learning tasks is tedious, time-consuming, and costly. For segmentation applications, the burden is particularly high as manual delineations of relevant image content are often extremely expensive or can only be done by experts with domain-specific knowledge. Thanks to developments in transfer learning and training with weak supervision, segmentation models can now also greatly benefit from annotations of different kinds. However, for any new domain application looking to use weak supervision, the dataset builder still needs to define a strategy to distribute full segmentation and other weak annotations. Doing so is challenging, however, as it is a priori unknown how to distribute an annotation budget for a given new dataset. To this end, we propose a novel approach to determine annotation strategies for segmentation datasets, whereby estimating what proportion of segmentation and classification annotations should be collected given a fixed budget. To do so, our method sequentially determines proportions of segmentation and classification annotations to collect for budget-fractions by modeling the expected improvement of the final segmentation model. We show in our experiments that our approach yields annotations that perform very close to the optimal for a number of different annotation budgets and datasets. + + + + Backdoor Defense via Deconfounded Representation Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Backdoor_Defense_via_Deconfounded_Representation_Learning_CVPR_2023_paper.pdf + Deep neural networks (DNNs) are recently shown to be vulnerable to backdoor attacks, where attackers embed hidden backdoors in the DNN model by injecting a few poisoned examples into the training dataset. While extensive efforts have been made to detect and remove backdoors from backdoored DNNs, it is still not clear whether a backdoor-free clean model can be directly obtained from poisoned datasets. In this paper, we first construct a causal graph to model the generation process of poisoned data and find that the backdoor attack acts as the confounder, which brings spurious associations between the input images and target labels, making the model predictions less reliable. Inspired by the causal understanding, we propose the Causality-inspired Backdoor Defense (CBD), to learn deconfounded representations by employing the front-door adjustment. Specifically, a backdoored model is intentionally trained to capture the confounding effects. The other clean model dedicates to capturing the desired causal effects by minimizing the mutual information with the confounding representations from the backdoored model and employing a sample-wise re-weighting scheme. Extensive experiments on multiple benchmark datasets against 6 state-of-the-art attacks verify that our proposed defense method is effective in reducing backdoor threats while maintaining high accuracy in predicting benign samples. Further analysis shows that CBD can also resist potential adaptive attacks. + + + + HairStep: Transfer Synthetic to Real Using Strand and Depth Maps for Single-View 3D Hair Modeling + http://openaccess.thecvf.com//content/CVPR2023/papers/Zheng_HairStep_Transfer_Synthetic_to_Real_Using_Strand_and_Depth_Maps_CVPR_2023_paper.pdf + In this work, we tackle the challenging problem of learning-based single-view 3D hair modeling. Due to the great difficulty of collecting paired real image and 3D hair data, using synthetic data to provide prior knowledge for real domain becomes a leading solution. This unfortunately introduces the challenge of domain gap. Due to the inherent difficulty of realistic hair rendering, existing methods typically use orientation maps instead of hair images as input to bridge the gap. We firmly think an intermediate representation is essential, but we argue that orientation map using the dominant filtering-based methods is sensitive to uncertain noise and far from a competent representation. Thus, we first raise this issue up and propose a novel intermediate representation, termed as HairStep, which consists of a strand map and a depth map. It is found that HairStep not only provides sufficient information for accurate 3D hair modeling, but also is feasible to be inferred from real images. Specifically, we collect a dataset of 1,250 portrait images with two types of annotations. A learning framework is further designed to transfer real images to the strand map and depth map. It is noted that, an extra bonus of our new dataset is the first quantitative metric for 3D hair modeling. Our experiments show that HairStep narrows the domain gap between synthetic and real and achieves state-of-the-art performance on single-view 3D hair reconstruction. + + + + MoDAR: Using Motion Forecasting for 3D Object Detection in Point Cloud Sequences + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_MoDAR_Using_Motion_Forecasting_for_3D_Object_Detection_in_Point_CVPR_2023_paper.pdf + Occluded and long-range objects are ubiquitous and challenging for 3D object detection. Point cloud sequence data provide unique opportunities to improve such cases, as an occluded or distant object can be observed from different viewpoints or gets better visibility over time. However, the efficiency and effectiveness in encoding long-term sequence data can still be improved. In this work, we propose MoDAR, using motion forecasting outputs as a type of virtual modality, to augment LiDAR point clouds. The MoDAR modality propagates object information from temporal contexts to a target frame, represented as a set of virtual points, one for each object from a waypoint on a forecasted trajectory. A fused point cloud of both raw sensor points and the virtual points can then be fed to any off-the-shelf point-cloud based 3D object detector. Evaluated on the Waymo Open Dataset, our method significantly improves prior art detectors by using motion forecasting from extra-long sequences (e.g. 18 seconds), achieving new state of the arts, while not adding much computation overhead. + + + + ALSO: Automotive Lidar Self-Supervision by Occupancy Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Boulch_ALSO_Automotive_Lidar_Self-Supervision_by_Occupancy_Estimation_CVPR_2023_paper.pdf + We propose a new self-supervised method for pre-training the backbone of deep perception models operating on point clouds. The core idea is to train the model on a pretext task which is the reconstruction of the surface on which the 3D points are sampled, and to use the underlying latent vectors as input to the perception head. The intuition is that if the network is able to reconstruct the scene surface, given only sparse input points, then it probably also captures some fragments of semantic information, that can be used to boost an actual perception task. This principle has a very simple formulation, which makes it both easy to implement and widely applicable to a large range of 3D sensors and deep networks performing semantic segmentation or object detection. In fact, it supports a single-stream pipeline, as opposed to most contrastive learning approaches, allowing training on limited resources. We conducted extensive experiments on various autonomous driving datasets, involving very different kinds of lidars, for both semantic segmentation and object detection. The results show the effectiveness of our method to learn useful representations without any annotation, compared to existing approaches. The code is available at github.com/valeoai/ALSO + + + + Learning Dynamic Style Kernels for Artistic Style Transfer + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_Learning_Dynamic_Style_Kernels_for_Artistic_Style_Transfer_CVPR_2023_paper.pdf + Arbitrary style transfer has been demonstrated to be efficient in artistic image generation. Previous methods either globally modulate the content feature ignoring local details, or overly focus on the local structure details leading to style leakage. In contrast to the literature, we propose a new scheme "style kernel" that learns spatially adaptive kernel for per-pixel stylization, where the convolutional kernels are dynamically generated from the global style-content aligned feature and then the learned kernels are applied to modulate the content feature at each spatial position. This new scheme allows flexible both global and local interactions between the content and style features such that the wanted styles can be easily transferred to the content image while at the same time the content structure can be easily preserved. To further enhance the flexibility of our style transfer method, we propose a Style Alignment Encoding (SAE) module complemented with a Content-based Gating Modulation (CGM) module for learning the dynamic style kernels in focusing regions. Extensive experiments strongly demonstrate that our proposed method outperforms state-of-the-art methods and exhibits superior performance in terms of visual quality and efficiency. + + + + Chat2Map: Efficient Scene Mapping From Multi-Ego Conversations + http://openaccess.thecvf.com//content/CVPR2023/papers/Majumder_Chat2Map_Efficient_Scene_Mapping_From_Multi-Ego_Conversations_CVPR_2023_paper.pdf + Can conversational videos captured from multiple egocentric viewpoints reveal the map of a scene in a cost-efficient way? We seek to answer this question by proposing a new problem: efficiently building the map of a previously unseen 3D environment by exploiting shared information in the egocentric audio-visual observations of participants in a natural conversation. Our hypothesis is that as multiple people ("egos") move in a scene and talk among themselves, they receive rich audio-visual cues that can help uncover the unseen areas of the scene. Given the high cost of continuously processing egocentric visual streams, we further explore how to actively coordinate the sampling of visual information, so as to minimize redundancy and reduce power use. To that end, we present an audio-visual deep reinforcement learning approach that works with our shared scene mapper to selectively turn on the camera to efficiently chart out the space. We evaluate the approach using a state-of-the-art audio-visual simulator for 3D scenes as well as real-world video. Our model outperforms previous state-of-the-art mapping methods, and achieves an excellent cost-accuracy tradeoff. Project: https://vision.cs.utexas.edu/projects/chat2map. + + + + GeoMAE: Masked Geometric Target Prediction for Self-Supervised Point Cloud Pre-Training + http://openaccess.thecvf.com//content/CVPR2023/papers/Tian_GeoMAE_Masked_Geometric_Target_Prediction_for_Self-Supervised_Point_Cloud_Pre-Training_CVPR_2023_paper.pdf + This paper tries to address a fundamental question in point cloud self-supervised learning: what is a good signal we should leverage to learn features from point clouds without annotations? To answer that, we introduce a point cloud representation learning framework, based on geometric feature reconstruction. In contrast to recent papers that directly adopt masked autoencoder (MAE) and only predict original coordinates or occupancy from masked point clouds, our method revisits differences between images and point clouds and identifies three self-supervised learning objectives peculiar to point clouds, namely centroid prediction, normal estimation, and curvature prediction. Combined, these three objectives yield an nontrivial self-supervised learning task and mutually facilitate models to better reason fine-grained geometry of point clouds. Our pipeline is conceptually simple and it consists of two major steps: first, it randomly masks out groups of points, followed by a Transformer-based point cloud encoder; second, a lightweight Transformer decoder predicts centroid, normal, and curvature for points in each voxel. We transfer the pre-trained Transformer encoder to a downstream peception model. On the nuScene Datset, our model achieves 3.38 mAP improvment for object detection, 2.1 mIoU gain for segmentation, and 1.7 AMOTA gain for multi-object tracking. We also conduct experiments on the Waymo Open Dataset and achieve significant performance improvements over baselines as well. + + + + Learning Conditional Attributes for Compositional Zero-Shot Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Learning_Conditional_Attributes_for_Compositional_Zero-Shot_Learning_CVPR_2023_paper.pdf + Compositional Zero-Shot Learning (CZSL) aims to train models to recognize novel compositional concepts based on learned concepts such as attribute-object combinations. One of the challenges is to model attributes interacted with different objects, e.g., the attribute "wet" in "wet apple" and "wet cat" is different. As a solution, we provide analysis and argue that attributes are conditioned on the recognized object and input image and explore learning conditional attribute embeddings by a proposed attribute learning framework containing an attribute hyper learner and an attribute base learner. By encoding conditional attributes, our model enables to generate flexible attribute embeddings for generalization from seen to unseen compositions. Experiments on CZSL benchmarks, including the more challenging C-GQA dataset, demonstrate better performances compared with other state-of-the-art approaches and validate the importance of learning conditional attributes. + + + + Complete 3D Human Reconstruction From a Single Incomplete Image + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Complete_3D_Human_Reconstruction_From_a_Single_Incomplete_Image_CVPR_2023_paper.pdf + This paper presents a method to reconstruct a complete human geometry and texture from an image of a person with only partial body observed, e.g., a torso. The core challenge arises from the occlusion: there exists no pixel to reconstruct where many existing single-view human reconstruction methods are not designed to handle such invisible parts, leading to missing data in 3D. To address this challenge, we introduce a novel coarse-to-fine human reconstruction framework. For coarse reconstruction, explicit volumetric features are learned to generate a complete human geometry with 3D convolutional neural networks conditioned by a 3D body model and the style features from visible parts. An implicit network combines the learned 3D features with the high-quality surface normals enhanced from multiview to produce fine local details, e.g., high-frequency wrinkles. Finally, we perform progressive texture inpainting to reconstruct a complete appearance of the person in a view-consistent way, which is not possible without the reconstruction of a complete geometry. In experiments, we demonstrate that our method can reconstruct high-quality 3D humans, which is robust to occlusion. + + + + PVT-SSD: Single-Stage 3D Object Detector With Point-Voxel Transformer + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_PVT-SSD_Single-Stage_3D_Object_Detector_With_Point-Voxel_Transformer_CVPR_2023_paper.pdf + Recent Transformer-based 3D object detectors learn point cloud features either from point- or voxel-based representations. However, the former requires time-consuming sampling while the latter introduces quantization errors. In this paper, we present a novel Point-Voxel Transformer for single-stage 3D detection (PVT-SSD) that takes advantage of these two representations. Specifically, we first use voxel-based sparse convolutions for efficient feature encoding. Then, we propose a Point-Voxel Transformer (PVT) module that obtains long-range contexts in a cheap manner from voxels while attaining accurate positions from points. The key to associating the two different representations is our introduced input-dependent Query Initialization module, which could efficiently generate reference points and content queries. Then, PVT adaptively fuses long-range contextual and local geometric information around reference points into content queries. Further, to quickly find the neighboring points of reference points, we design the Virtual Range Image module, which generalizes the native range image to multi-sensor and multi-frame. The experiments on several autonomous driving benchmarks verify the effectiveness and efficiency of the proposed method. Code will be available. + + + + Adaptive Human Matting for Dynamic Videos + http://openaccess.thecvf.com//content/CVPR2023/papers/Lin_Adaptive_Human_Matting_for_Dynamic_Videos_CVPR_2023_paper.pdf + The most recent efforts in video matting have focused on eliminating trimap dependency since trimap annotations are expensive and trimap-based methods are less adaptable for real-time applications. Despite the latest tripmap-free methods showing promising results, their performance often degrades when dealing with highly diverse and unstructured videos. We address this limitation by introducing Adaptive Matting for Dynamic Videos, termed AdaM, which is a framework designed for simultaneously differentiating foregrounds from backgrounds and capturing alpha matte details of human subjects in the foreground. Two interconnected network designs are employed to achieve this goal: (1) an encoder-decoder network that produces alpha mattes and intermediate masks which are used to guide the transformer in adaptively decoding foregrounds and backgrounds, and (2) a transformer network in which long- and short-term attention combine to retain spatial and temporal contexts, facilitating the decoding of foreground details. We benchmark and study our methods on recently introduced datasets, showing that our model notably improves matting realism and temporal coherence in complex real-world videos and achieves new best-in-class generalizability. Further details and examples are available at https://github.com/microsoft/AdaM. + + + + Learning Common Rationale To Improve Self-Supervised Representation for Fine-Grained Visual Recognition Problems + http://openaccess.thecvf.com//content/CVPR2023/papers/Shu_Learning_Common_Rationale_To_Improve_Self-Supervised_Representation_for_Fine-Grained_Visual_CVPR_2023_paper.pdf + Self-supervised learning (SSL) strategies have demonstrated remarkable performance in various recognition tasks. However, both our preliminary investigation and recent studies suggest that they may be less effective in learning representations for fine-grained visual recognition (FGVR) since many features helpful for optimizing SSL objectives are not suitable for characterizing the subtle differences in FGVR. To overcome this issue, we propose learning an additional screening mechanism to identify discriminative clues commonly seen across instances and classes, dubbed as common rationales in this paper. Intuitively, common rationales tend to correspond to the discriminative patterns from the key parts of foreground objects. We show that a common rationale detector can be learned by simply exploiting the GradCAM induced from the SSL objective without using any pre-trained object parts or saliency detectors, making it seamlessly to be integrated with the existing SSL process. Specifically, we fit the GradCAM with a branch with limited fitting capacity, which allows the branch to capture the common rationales and discard the less common discriminative patterns. At the test stage, the branch generates a set of spatial weights to selectively aggregate features representing an instance. Extensive experimental results on four visual tasks demonstrate that the proposed method can lead to a significant improvement in different evaluation settings. + + + + High-Fidelity 3D Human Digitization From Single 2K Resolution Images + http://openaccess.thecvf.com//content/CVPR2023/papers/Han_High-Fidelity_3D_Human_Digitization_From_Single_2K_Resolution_Images_CVPR_2023_paper.pdf + High-quality 3D human body reconstruction requires high-fidelity and large-scale training data and appropriate network design that effectively exploits the high-resolution input images. To tackle these problems, we propose a simple yet effective 3D human digitization method called 2K2K, which constructs a large-scale 2K human dataset and infers 3D human models from 2K resolution images. The proposed method separately recovers the global shape of a human and its details. The low-resolution depth network predicts the global structure from a low-resolution image, and the part-wise image-to-normal network predicts the details of the 3D human body structure. The high-resolution depth network merges the global 3D shape and the detailed structures to infer the high-resolution front and back side depth maps. Finally, an off-the-shelf mesh generator reconstructs the full 3D human model, which are available at https://github.com/SangHunHan92/2K2K. In addition, we also provide 2,050 3D human models, including texture maps, 3D joints, and SMPL parameters for research purposes. In experiments, we demonstrate competitive performance over the recent works on various datasets. + + + + Fully Self-Supervised Depth Estimation From Defocus Clue + http://openaccess.thecvf.com//content/CVPR2023/papers/Si_Fully_Self-Supervised_Depth_Estimation_From_Defocus_Clue_CVPR_2023_paper.pdf + Depth-from-defocus (DFD), modeling the relationship between depth and defocus pattern in images, has demonstrated promising performance in depth estimation. Recently, several self-supervised works try to overcome the difficulties in acquiring accurate depth ground-truth. However, they depend on the all-in-focus (AIF) images, which cannot be captured in real-world scenarios. Such limitation discourages the applications of DFD methods. To tackle this issue, we propose a completely self-supervised framework that estimates depth purely from a sparse focal stack. We show that our framework circumvents the needs for the depth and AIF image ground-truth, and receives superior predictions, thus closing the gap between the theoretical success of DFD works and their applications in the real world. In particular, we propose (i) a more realistic setting for DFD tasks, where no depth or AIF image ground-truth is available; (ii) a novel self-supervision framework that provides reliable predictions of depth and AIF image under the the challenging setting. The proposed framework uses a neural model to predict the depth and AIF image, and utilizes an optical model to validate and refine the prediction. We verify our framework on three benchmark datasets with rendered focal stacks and real focal stacks. Qualitative and quantitative evaluations show that our method provides a strong baseline for self-supervised DFD tasks. The source code is publicly available at https://github.com/Ehzoahis/DEReD. + + + + Prompting Large Language Models With Answer Heuristics for Knowledge-Based Visual Question Answering + http://openaccess.thecvf.com//content/CVPR2023/papers/Shao_Prompting_Large_Language_Models_With_Answer_Heuristics_for_Knowledge-Based_Visual_CVPR_2023_paper.pdf + Knowledge-based visual question answering (VQA) requires external knowledge beyond the image to answer the question. Early studies retrieve required knowledge from explicit knowledge bases (KBs), which often introduces irrelevant information to the question, hence restricting the performance of their models. Recent works have sought to use a large language model (i.e., GPT-3) as an implicit knowledge engine to acquire the necessary knowledge for answering. Despite the encouraging results achieved by these methods, we argue that they have not fully activated the capacity of GPT-3 as the provided input information is insufficient. In this paper, we present Prophet---a conceptually simple framework designed to prompt GPT-3 with answer heuristics for knowledge-based VQA. Specifically, we first train a vanilla VQA model on a specific knowledge-based VQA dataset without external knowledge. After that, we extract two types of complementary answer heuristics from the model: answer candidates and answer-aware examples. Finally, the two types of answer heuristics are encoded into the prompts to enable GPT-3 to better comprehend the task thus enhancing its capacity. Prophet significantly outperforms all existing state-of-the-art methods on two challenging knowledge-based VQA datasets, OK-VQA and A-OKVQA, delivering 61.1% and 55.7% accuracies on their testing sets, respectively. + + + + Improving Robustness of Semantic Segmentation to Motion-Blur Using Class-Centric Augmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Aakanksha_Improving_Robustness_of_Semantic_Segmentation_to_Motion-Blur_Using_Class-Centric_Augmentation_CVPR_2023_paper.pdf + Semantic segmentation involves classifying each pixel into one of a pre-defined set of object/stuff classes. Such a fine-grained detection and localization of objects in the scene is challenging by itself. The complexity increases manifold in the presence of blur. With cameras becoming increasingly light-weight and compact, blur caused by motion during capture time has become unavoidable. Most research has focused on improving segmentation performance for sharp clean images and the few works that deal with degradations, consider motion-blur as one of many generic degradations. In this work, we focus exclusively on motion-blur and attempt to achieve robustness for semantic segmentation in its presence. Based on the observation that segmentation annotations can be used to generate synthetic space-variant blur, we propose a Class-Centric Motion-Blur Augmentation (CCMBA) strategy. Our approach involves randomly selecting a subset of semantic classes present in the image and using the segmentation map annotations to blur only the corresponding regions. This enables the network to simultaneously learn semantic segmentation for clean images, images with egomotion blur, as well as images with dynamic scene blur. We demonstrate the effectiveness of our approach for both CNN and Vision Transformer-based semantic segmentation networks on PASCAL VOC and Cityscapes datasets. We also illustrate the improved generalizability of our method to complex real-world blur by evaluating on the commonly used deblurring datasets GoPro and REDS. + + + + Progressive Open Space Expansion for Open-Set Model Attribution + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_Progressive_Open_Space_Expansion_for_Open-Set_Model_Attribution_CVPR_2023_paper.pdf + Despite the remarkable progress in generative technology, the Janus-faced issues of intellectual property protection and malicious content supervision have arisen. Efforts have been paid to manage synthetic images by attributing them to a set of potential source models. However, the closed-set classification setting limits the application in real-world scenarios for handling contents generated by arbitrary models. In this study, we focus on a challenging task, namely Open-Set Model Attribution (OSMA), to simultaneously attribute images to known models and identify those from unknown ones. Compared to existing open-set recognition (OSR) tasks focusing on semantic novelty, OSMA is more challenging as the distinction between images from known and unknown models may only lie in visually imperceptible traces. To this end, we propose a Progressive Open Space Expansion (POSE) solution, which simulates open-set samples that maintain the same semantics as closed-set samples but embedded with different imperceptible traces. Guided by a diversity constraint, the open space is simulated progressively by a set of lightweight augmentation models. We consider three real-world scenarios and construct an OSMA benchmark dataset, including unknown models trained with different random seeds, architectures, and datasets from known ones. Extensive experiments on the dataset demonstrate POSE is superior to both existing model attribution methods and off-the-shelf OSR methods. + + + + Backdoor Cleansing With Unlabeled Data + http://openaccess.thecvf.com//content/CVPR2023/papers/Pang_Backdoor_Cleansing_With_Unlabeled_Data_CVPR_2023_paper.pdf + Due to the increasing computational demand of Deep Neural Networks (DNNs), companies and organizations have begun to outsource the training process. However, the externally trained DNNs can potentially be backdoor attacked. It is crucial to defend against such attacks, i.e, to postprocess a suspicious model so that its backdoor behavior is mitigated while its normal prediction power on clean inputs remain uncompromised. To remove the abnormal backdoor behavior, existing methods mostly rely on additional labeled clean samples. However, such requirement may be unrealistic as the training data are often unavailable to end users. In this paper, we investigate the possibility of circumventing such barrier. We propose a novel defense method that does not require training labels. Through a carefully designed layer-wise weight re-initialization and knowledge distillation, our method can effectively cleanse backdoor behaviors of a suspicious network with negligible compromise in its normal behavior. In experiments, we show that our method, trained without labels, is on-par with state-of-the-art defense methods trained using labels. We also observe promising defense results even on out-of-distribution data. This makes our method very practical. Code is available at: https://github.com/luluppang/BCU. + + + + Harmonious Feature Learning for Interactive Hand-Object Pose Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Lin_Harmonious_Feature_Learning_for_Interactive_Hand-Object_Pose_Estimation_CVPR_2023_paper.pdf + Joint hand and object pose estimation from a single image is extremely challenging as serious occlusion often occurs when the hand and object interact. Existing approaches typically first extract coarse hand and object features from a single backbone, then further enhance them with reference to each other via interaction modules. However, these works usually ignore that the hand and object are competitive in feature learning, since the backbone takes both of them as foreground and they are usually mutually occluded. In this paper, we propose a novel Harmonious Feature Learning Network (HFL-Net). HFL-Net introduces a new framework that combines the advantages of single- and double-stream backbones: it shares the parameters of the low- and high-level convolutional layers of a common ResNet-50 model for the hand and object, leaving the middle-level layers unshared. This strategy enables the hand and the object to be extracted as the sole targets by the middle-level layers, avoiding their competition in feature learning. The shared high-level layers also force their features to be harmonious, thereby facilitating their mutual feature enhancement. In particular, we propose to enhance the feature of the hand via concatenation with the feature in the same location from the object stream. A subsequent self-attention layer is adopted to deeply fuse the concatenated feature. Experimental results show that our proposed approach consistently outperforms state-of-the-art methods on the popular HO3D and Dex-YCB databases. Notably, the performance of our model on hand pose estimation even surpasses that of existing works that only perform the single-hand pose estimation task. Code is available at https://github.com/lzfff12/HFL-Net. + + + + CLOTH4D: A Dataset for Clothed Human Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Zou_CLOTH4D_A_Dataset_for_Clothed_Human_Reconstruction_CVPR_2023_paper.pdf + Clothed human reconstruction is the cornerstone for creating the virtual world. To a great extent, the quality of recovered avatars decides whether the Metaverse is a passing fad. In this work, we introduce CLOTH4D, a clothed human dataset containing 1,000 subjects with varied appearances, 1,000 3D outfits, and over 100,000 clothed meshes with paired unclothed humans, to fill the gap in large-scale and high-quality 4D clothing data. It enjoys appealing characteristics: 1) Accurate and detailed clothing textured meshes---all clothing items are manually created and then simulated in professional software, strictly following the general standard in fashion design. 2) Separated textured clothing and under-clothing body meshes, closer to the physical world than single-layer raw scans. 3) Clothed human motion sequences simulated given a set of 289 actions, covering fundamental and complicated dynamics. Upon CLOTH4D, we novelly designed a series of temporally-aware metrics to evaluate the temporal stability of the generated 3D human meshes, which has been overlooked previously. Moreover, by assessing and retraining current state-of-the-art clothed human reconstruction methods, we reveal insights, present improved performance, and propose potential future research directions, confirming our dataset's advancement. The dataset is available at www.github.com/AemikaChow/AiDLab-fAshIon-Data + + + + Generative Bias for Robust Visual Question Answering + http://openaccess.thecvf.com//content/CVPR2023/papers/Cho_Generative_Bias_for_Robust_Visual_Question_Answering_CVPR_2023_paper.pdf + The task of Visual Question Answering (VQA) is known to be plagued by the issue of VQA models exploiting biases within the dataset to make its final prediction. Various previous ensemble based debiasing methods have been proposed where an additional model is purposefully trained to be biased in order to train a robust target model. However, these methods compute the bias for a model simply from the label statistics of the training data or from single modal branches. In this work, in order to better learn the bias a target VQA model suffers from, we propose a generative method to train the bias model directly from the target model, called GenB. In particular, GenB employs a generative network to learn the bias in the target model through a combination of the adversarial objective and knowledge distillation. We then debias our target model with GenB as a bias model, and show through extensive experiments the effects of our method on various VQA bias datasets including VQA-CP2, VQA-CP1, GQA-OOD, and VQA-CE, and show state-of-the-art results with the LXMERT architecture on VQA-CP2. + + + + Data-Free Sketch-Based Image Retrieval + http://openaccess.thecvf.com//content/CVPR2023/papers/Chaudhuri_Data-Free_Sketch-Based_Image_Retrieval_CVPR_2023_paper.pdf + Rising concerns about privacy and anonymity preservation of deep learning models have facilitated research in data-free learning. Primarily based on data-free knowledge distillation, models developed in this area so far have only been able to operate in a single modality, performing the same kind of task as that of the teacher. For the first time, we propose Data-Free Sketch-Based Image Retrieval (DF-SBIR), a cross-modal data-free learning setting, where teachers trained for classification in a single modality have to be leveraged by students to learn a cross-modal metric-space for retrieval. The widespread availability of pre-trained classification models, along with the difficulty in acquiring paired photo-sketch datasets for SBIR justify the practicality of this setting. We present a methodology for DF-SBIR, which can leverage knowledge from models independently trained to perform classification on photos and sketches. We evaluate our model on the Sketchy, TU-Berlin, and QuickDraw benchmarks, designing a variety of baselines based on existing data-free learning literature, and observe that our method surpasses all of them by significant margins. Our method also achieves mAPs competitive with data-dependent approaches, all the while requiring no training data. Implementation is available at https://github.com/abhrac/data-free-sbir. + + + + Multi-Object Manipulation via Object-Centric Neural Scattering Functions + http://openaccess.thecvf.com//content/CVPR2023/papers/Tian_Multi-Object_Manipulation_via_Object-Centric_Neural_Scattering_Functions_CVPR_2023_paper.pdf + Learned visual dynamics models have proven effective for robotic manipulation tasks. Yet, it remains unclear how best to represent scenes involving multi-object interactions. Current methods decompose a scene into discrete objects, yet they struggle with precise modeling and manipulation amid challenging lighting conditions since they only encode appearance tied with specific illuminations. In this work, we propose using object-centric neural scattering functions (OSFs) as object representations in a model-predictive control framework. OSFs model per-object light transport, enabling compositional scene re-rendering under object rearrangement and varying lighting conditions. By combining this approach with inverse parameter estimation and graph-based neural dynamics models, we demonstrate improved model-predictive control performance and generalization in compositional multi-object environments, even in previously unseen scenarios and harsh lighting conditions. + + + + The Wisdom of Crowds: Temporal Progressive Attention for Early Action Prediction + http://openaccess.thecvf.com//content/CVPR2023/papers/Stergiou_The_Wisdom_of_Crowds_Temporal_Progressive_Attention_for_Early_Action_CVPR_2023_paper.pdf + Early action prediction deals with inferring the ongoing action from partially-observed videos, typically at the outset of the video. We propose a bottleneck-based attention model that captures the evolution of the action, through progressive sampling over fine-to-coarse scales. Our proposed Temporal Progressive (TemPr) model is composed of multiple attention towers, one for each scale. The predicted action label is based on the collective agreement considering confidences of these towers. Extensive experiments over four video datasets showcase state-of-the-art performance on the task of Early Action Prediction across a range of encoder architectures. We demonstrate the effectiveness and consistency of TemPr through detailed ablations. + + + + Invertible Neural Skinning + http://openaccess.thecvf.com//content/CVPR2023/papers/Kant_Invertible_Neural_Skinning_CVPR_2023_paper.pdf + Building animatable and editable models of clothed humans from raw 3D scans and poses is a challenging problem. Existing reposing methods suffer from the limited expressiveness of Linear Blend Skinning (LBS), require costly mesh extraction to generate each new pose, and typically do not preserve surface correspondences across different poses. In this work, we introduce Invertible Neural Skinning (INS) to address these shortcomings. To maintain correspondences, we propose a Pose-conditioned Invertible Network (PIN) architecture, which extends the LBS process by learning additional pose-varying deformations. Next, we combine PIN with a differentiable LBS module to build an expressive and end-to-end Invertible Neural Skinning (INS) pipeline. We demonstrate the strong performance of our method by outperforming the state-of-the-art reposing techniques on clothed humans and preserving surface correspondences, while being an order of magnitude faster. We also perform an ablation study, which shows the usefulness of our pose-conditioning formulation, and our qualitative results display that INS can rectify artefacts introduced by LBS well. + + + + Weakly Supervised Semantic Segmentation via Adversarial Learning of Classifier and Reconstructor + http://openaccess.thecvf.com//content/CVPR2023/papers/Kweon_Weakly_Supervised_Semantic_Segmentation_via_Adversarial_Learning_of_Classifier_and_CVPR_2023_paper.pdf + In Weakly Supervised Semantic Segmentation (WSSS), Class Activation Maps (CAMs) usually 1) do not cover the whole object and 2) be activated on irrelevant regions. To address the issues, we propose a novel WSSS framework via adversarial learning of a classifier and an image reconstructor. When an image is perfectly decomposed into class-wise segments, information (i.e., color or texture) of a single segment could not be inferred from the other segments. Therefore, inferability between the segments can represent the preciseness of segmentation. We quantify the inferability as a reconstruction quality of one segment from the other segments. If one segment could be reconstructed from the others, then the segment would be imprecise. To bring this idea into WSSS, we simultaneously train two models: a classifier generating CAMs that decompose an image into segments and a reconstructor that measures the inferability between the segments. As in GANs, while being alternatively trained in an adversarial manner, two networks provide positive feedback to each other. We verify the superiority of the proposed framework with extensive ablation studies. Our method achieves new state-of-the-art performances on both PASCAL VOC 2012 and MS COCO 2014. The code is available at https://github.com/sangrockEG/ACR. + + + + Distilling Cross-Temporal Contexts for Continuous Sign Language Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Guo_Distilling_Cross-Temporal_Contexts_for_Continuous_Sign_Language_Recognition_CVPR_2023_paper.pdf + Continuous sign language recognition (CSLR) aims to recognize glosses in a sign language video. State-of-the-art methods typically have two modules, a spatial perception module and a temporal aggregation module, which are jointly learned end-to-end. Existing results in [9,20,25,36] have indicated that, as the frontal component of the overall model, the spatial perception module used for spatial feature extraction tends to be insufficiently trained. In this paper, we first conduct empirical studies and show that a shallow temporal aggregation module allows more thorough training of the spatial perception module. However, a shallow temporal aggregation module cannot well capture both local and global temporal context information in sign language. To address this dilemma, we propose a cross-temporal context aggregation (CTCA) model. Specifically, we build a dual-path network that contains two branches for perceptions of local temporal context and global temporal context. We further design a cross-context knowledge distillation learning objective to aggregate the two types of context and the linguistic prior. The knowledge distillation enables the resultant one-branch temporal aggregation module to perceive local-global temporal and semantic context. This shallow temporal perception module structure facilitates spatial perception module learning. Extensive experiments on challenging CSLR benchmarks demonstrate that our method outperforms all state-of-the-art methods. + + + + Unsupervised Deep Probabilistic Approach for Partial Point Cloud Registration + http://openaccess.thecvf.com//content/CVPR2023/papers/Mei_Unsupervised_Deep_Probabilistic_Approach_for_Partial_Point_Cloud_Registration_CVPR_2023_paper.pdf + Deep point cloud registration methods face challenges to partial overlaps and rely on labeled data. To address these issues, we propose UDPReg, an unsupervised deep probabilistic registration framework for point clouds with partial overlaps. Specifically, we first adopt a network to learn posterior probability distributions of Gaussian mixture models (GMMs) from point clouds. To handle partial point cloud registration, we apply the Sinkhorn algorithm to predict the distribution-level correspondences under the constraint of the mixing weights of GMMs. To enable unsupervised learning, we design three distribution consistency-based losses: self-consistency, cross-consistency, and local contrastive. The self-consistency loss is formulated by encouraging GMMs in Euclidean and feature spaces to share identical posterior distributions. The cross-consistency loss derives from the fact that the points of two partially overlapping point clouds belonging to the same clusters share the cluster centroids. The cross-consistency loss allows the network to flexibly learn a transformation-invariant posterior distribution of two aligned point clouds. The local contrastive loss facilitates the network to extract discriminative local features. Our UDPReg achieves competitive performance on the 3DMatch/3DLoMatch and ModelNet/ModelLoNet benchmarks. + + + + Similarity Metric Learning for RGB-Infrared Group Re-Identification + http://openaccess.thecvf.com//content/CVPR2023/papers/Xiong_Similarity_Metric_Learning_for_RGB-Infrared_Group_Re-Identification_CVPR_2023_paper.pdf + Group re-identification (G-ReID) aims to re-identify a group of people that is observed from non-overlapping camera systems. The existing literature has mainly addressed RGB-based problems, but RGB-infrared (RGB-IR) cross-modality matching problem has not been studied yet. In this paper, we propose a metric learning method Closest Permutation Matching (CPM) for RGB-IR G-ReID. We model each group as a set of single-person features which are extracted by MPANet, then we propose the metric Closest Permutation Distance (CPD) to measure the similarity between two sets of features. CPD is invariant with order changes of group members so that it solves the layout change problem in G-ReID. Furthermore, we introduce the problem of G-ReID without person labels. In the weak-supervised case, we design the Relation-aware Module (RAM) that exploits visual context and relations among group members to produce a modality-invariant order of features in each group, with which group member features within a set can be sorted to form a robust group representation against modality change. To support the study on RGB-IR G-ReID, we construct a new large-scale RGB-IR G-ReID dataset CM-Group. The dataset contains 15,440 RGB images and 15,506 infrared images of 427 groups and 1,013 identities. Extensive experiments on the new dataset demonstrate the effectiveness of the proposed models and the complexity of CM-Group. The code and dataset are available at: https://github.com/WhollyOat/CM-Group. + + + + Train/Test-Time Adaptation With Retrieval + http://openaccess.thecvf.com//content/CVPR2023/papers/Zancato_TrainTest-Time_Adaptation_With_Retrieval_CVPR_2023_paper.pdf + We introduce Train/Test-Time Adaptation with Retrieval (T3AR), a method to adapt models both at train and test time by means of a retrieval module and a searchable pool of external samples. Before inference, T3AR adapts a given model to the downstream task using refined pseudo-labels and a self-supervised contrastive objective function whose noise distribution leverages retrieved real samples to improve feature adaptation on the target data manifold. The retrieval of real images is key to T3AR since it does not rely solely on synthetic data augmentations to compensate for the lack of adaptation data, as typically done by other adaptation algorithms. Furthermore, thanks to the retrieval module, our method gives the user or service provider the possibility to improve model adaptation on the downstream task by incorporating further relevant data or to fully remove samples that may no longer be available due to changes in user preference after deployment. First, we show that T3AR can be used at training time to improve downstream fine-grained classification over standard fine-tuning baselines, and the fewer the adaptation data the higher the relative improvement (up to 13%). Second, we apply T3AR for test-time adaptation and show that exploiting a pool of external images at test-time leads to more robust representations over existing methods on DomainNet-126 and VISDA-C, especially when few adaptation data are available (up to 8%). + + + + ProxyFormer: Proxy Alignment Assisted Point Cloud Completion With Missing Part Sensitive Transformer + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_ProxyFormer_Proxy_Alignment_Assisted_Point_Cloud_Completion_With_Missing_Part_CVPR_2023_paper.pdf + Problems such as equipment defects or limited viewpoints will lead the captured point clouds to be incomplete. Therefore, recovering the complete point clouds from the partial ones plays an vital role in many practical tasks, and one of the keys lies in the prediction of the missing part. In this paper, we propose a novel point cloud completion approach namely ProxyFormer that divides point clouds into existing (input) and missing (to be predicted) parts and each part communicates information through its proxies. Specifically, we fuse information into point proxy via feature and position extractor, and generate features for missing point proxies from the features of existing point proxies. Then, in order to better perceive the position of missing points, we design a missing part sensitive transformer, which converts random normal distribution into reasonable position information, and uses proxy alignment to refine the missing proxies. It makes the predicted point proxies more sensitive to the features and positions of the missing part, and thus makes these proxies more suitable for subsequent coarse-to-fine processes. Experimental results show that our method outperforms state-of-the-art completion networks on several benchmark datasets and has the fastest inference speed. + + + + Mod-Squad: Designing Mixtures of Experts As Modular Multi-Task Learners + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Mod-Squad_Designing_Mixtures_of_Experts_As_Modular_Multi-Task_Learners_CVPR_2023_paper.pdf + Optimization in multi-task learning (MTL) is more challenging than single-task learning (STL), as the gradient from different tasks can be contradictory. When tasks are related, it can be beneficial to share some parameters among them (cooperation). However, some tasks require additional parameters with expertise in a specific type of data or discrimination (specialization). To address the MTL challenge, we propose Mod-Squad, a new model that is Modularized into groups of experts (a 'Squad'). This structure allows us to formalize cooperation and specialization as the process of matching experts and tasks. We optimize this matching process during the training of a single model. Specifically, we incorporate mixture of experts (MoE) layers into a transformer model, with a new loss that incorporates the mutual dependence between tasks and experts. As a result, only a small set of experts are activated for each task. This prevents the sharing of the entire backbone model between all tasks, which strengthens the model, especially when the training set size and the number of tasks scale up. More interestingly, for each task, we can extract the small set of experts as a standalone model that maintains the same performance as the large model. Extensive experiments on the Taskonomy dataset with 13 vision tasks and the PASCALContext dataset with 5 vision tasks show the superiority of our approach. The project page can be accessed at https://vis-www.cs.umass.edu/mod-squad. + + + + Learning Customized Visual Models With Retrieval-Augmented Knowledge + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Learning_Customized_Visual_Models_With_Retrieval-Augmented_Knowledge_CVPR_2023_paper.pdf + Image-text contrastive learning models such as CLIP have demonstrated strong task transfer ability. The high generality and usability of these visual models is achieved via a web-scale data collection process to ensure broad concept coverage, followed by expensive pre-training to feed all the knowledge into model weights. Alternatively, we propose REACT, REtrieval-Augmented CusTomization, a framework to acquire the relevant web knowledge to build customized visual models for target domains. We retrieve the most relevant image-text pairs ( 3% of CLIP pre-training data) from the web-scale database as external knowledge and propose to customize the model by only training new modularized blocks while freezing all the original weights. The effectiveness of REACT is demonstrated via extensive experiments on classification, retrieval, detection and segmentation tasks, including zero, few, and full-shot settings. Particularly, on the zero-shot classification task, compared with CLIP, it achieves up to 5.4% improvement on ImageNet and 3.7% on the ELEVATER benchmark (20 datasets). + + + + Run, Don't Walk: Chasing Higher FLOPS for Faster Neural Networks + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Run_Dont_Walk_Chasing_Higher_FLOPS_for_Faster_Neural_Networks_CVPR_2023_paper.pdf + To design fast neural networks, many works have been focusing on reducing the number of floating-point operations (FLOPs). We observe that such reduction in FLOPs, however, does not necessarily lead to a similar level of reduction in latency. This mainly stems from inefficiently low floating-point operations per second (FLOPS). To achieve faster networks, we revisit popular operators and demonstrate that such low FLOPS is mainly due to frequent memory access of the operators, especially the depthwise convolution. We hence propose a novel partial convolution (PConv) that extracts spatial features more efficiently, by cutting down redundant computation and memory access simultaneously. Building upon our PConv, we further propose FasterNet, a new family of neural networks, which attains substantially higher running speed than others on a wide range of devices, without compromising on accuracy for various vision tasks. For example, on ImageNet-1k, our tiny FasterNet-T0 is 2.8x, 3.3x, and 2.4x faster than MobileViT-XXS on GPU, CPU, and ARM processors, respectively, while being 2.9% more accurate. Our large FasterNet-L achieves impressive 83.5% top-1 accuracy, on par with the emerging Swin-B, while having 36% higher inference throughput on GPU, as well as saving 37% compute time on CPU. Code is available at https://github.com/JierunChen/FasterNet. + + + + Learning Procedure-Aware Video Representation From Instructional Videos and Their Narrations + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_Learning_Procedure-Aware_Video_Representation_From_Instructional_Videos_and_Their_Narrations_CVPR_2023_paper.pdf + The abundance of instructional videos and their narrations over the Internet offers an exciting avenue for understanding procedural activities. In this work, we propose to learn video representation that encodes both action steps and their temporal ordering, based on a large-scale dataset of web instructional videos and their narrations, without using human annotations. Our method jointly learns a video representation to encode individual step concepts, and a deep probabilistic model to capture both temporal dependencies and immense individual variations in the step ordering. We empirically demonstrate that learning temporal ordering not only enables new capabilities for procedure reasoning, but also reinforces the recognition of individual steps. Our model significantly advances the state-of-the-art results on step classification (+2.8% / +3.3% on COIN / EPIC-Kitchens) and step forecasting (+7.4% on COIN). Moreover, our model attains promising results in zero-shot inference for step classification and forecasting, as well as in predicting diverse and plausible steps for incomplete procedures. Our code is available at https://github.com/facebookresearch/ProcedureVRL. + + + + Co-Training 2L Submodels for Visual Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Touvron_Co-Training_2L_Submodels_for_Visual_Recognition_CVPR_2023_paper.pdf + This paper introduces submodel co-training, a regularization method related to co-training, self-distillation and stochastic depth. Given a neural network to be trained, for each sample we implicitly instantiate two altered networks, "submodels", with stochastic depth: i.e. activating only a subset of the layers and skipping others. Each network serves as a soft teacher to the other, by providing a cross-entropy loss that complements the regular softmax cross-entropy loss provided by the one-hot label. Our approach, dubbed "cosub", uses a single set of weights, and does not involve a pre-trained external model or temporal averaging. Experimentally, we show that submodel co-training is effective to train backbones for recognition tasks such as image classification and semantic segmentation, and that our approach is compatible with multiple recent architectures, including RegNet, PiT, and Swin. We report new state-of-the-art results for vision transformers trained on ImageNet only. For instance, a ViT-B pre-trained with cosub on Imagenet-21k achieves 87.4% top-1 acc. on Imagenet-val. + + + + K-Planes: Explicit Radiance Fields in Space, Time, and Appearance + http://openaccess.thecvf.com//content/CVPR2023/papers/Fridovich-Keil_K-Planes_Explicit_Radiance_Fields_in_Space_Time_and_Appearance_CVPR_2023_paper.pdf + We introduce k-planes, a white-box model for radiance fields in arbitrary dimensions. Our model uses d-choose-2 planes to represent a d-dimensional scene, providing a seamless way to go from static (d=3) to dynamic (d=4) scenes. This planar factorization makes adding dimension-specific priors easy, e.g. temporal smoothness and multi-resolution spatial structure, and induces a natural decomposition of static and dynamic components of a scene. We use a linear feature decoder with a learned color basis that yields similar performance as a nonlinear black-box MLP decoder. Across a range of synthetic and real, static and dynamic, fixed and varying appearance scenes, k-planes yields competitive and often state-of-the-art reconstruction fidelity with low memory usage, achieving 1000x compression over a full 4D grid, and fast optimization with a pure PyTorch implementation. For video results and code, please see sarafridov.github.io/K-Planes. + + + + Multi-Mode Online Knowledge Distillation for Self-Supervised Visual Representation Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Song_Multi-Mode_Online_Knowledge_Distillation_for_Self-Supervised_Visual_Representation_Learning_CVPR_2023_paper.pdf + Self-supervised learning (SSL) has made remarkable progress in visual representation learning. Some studies combine SSL with knowledge distillation (SSL-KD) to boost the representation learning performance of small models. In this study, we propose a Multi-mode Online Knowledge Distillation method (MOKD) to boost self-supervised visual representation learning. Different from existing SSL-KD methods that transfer knowledge from a static pre-trained teacher to a student, in MOKD, two different models learn collaboratively in a self-supervised manner. Specifically, MOKD consists of two distillation modes: self-distillation and cross-distillation modes. Among them, self-distillation performs self-supervised learning for each model independently, while cross-distillation realizes knowledge interaction between different models. In cross-distillation, a cross-attention feature search strategy is proposed to enhance the semantic feature alignment between different models. As a result, the two models can absorb knowledge from each other to boost their representation learning performance. Extensive experimental results on different backbones and datasets demonstrate that two heterogeneous models can benefit from MOKD and outperform their independently trained baseline. In addition, MOKD also outperforms existing SSL-KD methods for both the student and teacher models. + + + + Viewpoint Equivariance for Multi-View 3D Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Viewpoint_Equivariance_for_Multi-View_3D_Object_Detection_CVPR_2023_paper.pdf + 3D object detection from visual sensors is a cornerstone capability of robotic systems. State-of-the-art methods focus on reasoning and decoding object bounding boxes from multi-view camera input. In this work we gain intuition from the integral role of multi-view consistency in 3D scene understanding and geometric learning. To this end, we introduce VEDet, a novel 3D object detection framework that exploits 3D multi-view geometry to improve localization through viewpoint awareness and equivariance. VEDet leverages a query-based transformer architecture and encodes the 3D scene by augmenting image features with positional encodings from their 3D perspective geometry. We design view-conditioned queries at the output level, which enables the generation of multiple virtual frames during training to learn viewpoint equivariance by enforcing multi-view consistency. The multi-view geometry injected at the input level as positional encodings and regularized at the loss level provides rich geometric cues for 3D object detection, leading to state-of-the-art performance on the nuScenes benchmark. The code and model are made available at https://github.com/TRI-ML/VEDet. + + + + A Generalized Framework for Video Instance Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Heo_A_Generalized_Framework_for_Video_Instance_Segmentation_CVPR_2023_paper.pdf + The handling of long videos with complex and occluded sequences has recently emerged as a new challenge in the video instance segmentation (VIS) community. However, existing methods have limitations in addressing this challenge. We argue that the biggest bottleneck in current approaches is the discrepancy between training and inference. To effectively bridge this gap, we propose a Generalized framework for VIS, namely GenVIS, that achieves state-of-the-art performance on challenging benchmarks without designing complicated architectures or requiring extra post-processing. The key contribution of GenVIS is the learning strategy, which includes a query-based training pipeline for sequential learning with a novel target label assignment. Additionally, we introduce a memory that effectively acquires information from previous states. Thanks to the new perspective, which focuses on building relationships between separate frames or clips, GenVIS can be flexibly executed in both online and semi-online manner. We evaluate our approach on popular VIS benchmarks, achieving state-of-the-art results on YouTube-VIS 2019/2021/2022 and Occluded VIS (OVIS). Notably, we greatly outperform the state-of-the-art on the long VIS benchmark (OVIS), improving 5.6 AP with ResNet-50 backbone. Code is available at https://github.com/miranheo/GenVIS. + + + + On Distillation of Guided Diffusion Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Meng_On_Distillation_of_Guided_Diffusion_Models_CVPR_2023_paper.pdf + Classifier-free guided diffusion models have recently been shown to be highly effective at high-resolution image generation, and they have been widely used in large-scale diffusion frameworks including DALL*E 2, Stable Diffusion and Imagen. However, a downside of classifier-free guided diffusion models is that they are computationally expensive at inference time since they require evaluating two diffusion models, a class-conditional model and an unconditional model, tens to hundreds of times. To deal with this limitation, we propose an approach to distilling classifier-free guided diffusion models into models that are fast to sample from: Given a pre-trained classifier-free guided model, we first learn a single model to match the output of the combined conditional and unconditional models, and then we progressively distill that model to a diffusion model that requires much fewer sampling steps. For standard diffusion models trained on the pixel-space, our approach is able to generate images visually comparable to that of the original model using as few as 4 sampling steps on ImageNet 64x64 and CIFAR-10, achieving FID/IS scores comparable to that of the original model while being up to 256 times faster to sample from. For diffusion models trained on the latent-space (e.g., Stable Diffusion), our approach is able to generate high-fidelity images using as few as 1 to 4 denoising steps, accelerating inference by at least 10-fold compared to existing methods on ImageNet 256x256 and LAION datasets. We further demonstrate the effectiveness of our approach on text-guided image editing and inpainting, where our distilled model is able to generate high-quality results using as few as 2-4 denoising steps. + + + + Disentangled Representation Learning for Unsupervised Neural Quantization + http://openaccess.thecvf.com//content/CVPR2023/papers/Noh_Disentangled_Representation_Learning_for_Unsupervised_Neural_Quantization_CVPR_2023_paper.pdf + The inverted index is a widely used data structure to avoid the infeasible exhaustive search. It accelerates retrieval significantly by splitting the database into multiple disjoint sets and restricts distance computation to a small fraction of the database. Moreover, it even improves search quality by allowing quantizers to exploit the compact distribution of residual vector space. However, we firstly point out a problem that an existing deep learning-based quantizer hardly benefits from the residual vector space, unlike conventional shallow quantizers. To cope with this problem, we introduce a novel disentangled representation learning for unsupervised neural quantization. Similar to the concept of residual vector space, the proposed method enables more compact latent space by disentangling information of the inverted index from the vectors. Experimental results on large-scale datasets confirm that our method outperforms the state-of-the-art retrieval systems by a large margin. + + + + Zero-Shot Pose Transfer for Unrigged Stylized 3D Characters + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Zero-Shot_Pose_Transfer_for_Unrigged_Stylized_3D_Characters_CVPR_2023_paper.pdf + Transferring the pose of a reference avatar to stylized 3D characters of various shapes is a fundamental task in computer graphics. Existing methods either require the stylized characters to be rigged, or they use the stylized character in the desired pose as ground truth at training. We present a zero-shot approach that requires only the widely available deformed non-stylized avatars in training, and deforms stylized characters of significantly different shapes at inference. Classical methods achieve strong generalization by deforming the mesh at the triangle level, but this requires labelled correspondences. We leverage the power of local deformation, but without requiring explicit correspondence labels. We introduce a semi-supervised shape-understanding module to bypass the need for explicit correspondences at test time, and an implicit pose deformation module that deforms individual surface points to match the target pose. Furthermore, to encourage realistic and accurate deformation of stylized characters, we introduce an efficient volume-based test-time training procedure. Because it does not need rigging, nor the deformed stylized character at training time, our model generalizes to categories with scarce annotation, such as stylized quadrupeds. Extensive experiments demonstrate the effectiveness of the proposed method compared to the state-of-the-art approaches trained with comparable or more supervision. Our project page is available at https://jiashunwang.github.io/ZPT + + + + Listening Human Behavior: 3D Human Pose Estimation With Acoustic Signals + http://openaccess.thecvf.com//content/CVPR2023/papers/Shibata_Listening_Human_Behavior_3D_Human_Pose_Estimation_With_Acoustic_Signals_CVPR_2023_paper.pdf + Given only acoustic signals without any high-level information, such as voices or sounds of scenes/actions, how much can we infer about the behavior of humans? Unlike existing methods, which suffer from privacy issues because they use signals that include human speech or the sounds of specific actions, we explore how low-level acoustic signals can provide enough clues to estimate 3D human poses by active acoustic sensing with a single pair of microphones and loudspeakers (see Fig. 1). This is a challenging task since sound is much more diffractive than other signals and therefore covers up the shape of objects in a scene. Accordingly, we introduce a framework that encodes multichannel audio features into 3D human poses. Aiming to capture subtle sound changes to reveal detailed pose information, we explicitly extract phase features from the acoustic signals together with typical spectrum features and feed them into our human pose estimation network. Also, we show that reflected or diffracted sounds are easily influenced by subjects' physique differences e.g., height and muscularity, which deteriorates prediction accuracy. We reduce these gaps by using a subject discriminator to improve accuracy. Our experiments suggest that with the use of only low-dimensional acoustic information, our method outperforms baseline methods. The datasets and codes used in this project will be publicly available. + + + + Meta-Learning With a Geometry-Adaptive Preconditioner + http://openaccess.thecvf.com//content/CVPR2023/papers/Kang_Meta-Learning_With_a_Geometry-Adaptive_Preconditioner_CVPR_2023_paper.pdf + Model-agnostic meta-learning (MAML) is one of the most successful meta-learning algorithms. It has a bi-level optimization structure where the outer-loop process learns a shared initialization and the inner-loop process optimizes task-specific weights. Although MAML relies on the standard gradient descent in the inner-loop, recent studies have shown that controlling the inner-loop's gradient descent with a meta-learned preconditioner can be beneficial. Existing preconditioners, however, cannot simultaneously adapt in a task-specific and path-dependent way. Additionally, they do not satisfy the Riemannian metric condition, which can enable the steepest descent learning with preconditioned gradient. In this study, we propose Geometry-Adaptive Preconditioned gradient descent (GAP) that can overcome the limitations in MAML; GAP can efficiently meta-learn a preconditioner that is dependent on task-specific parameters, and its preconditioner can be shown to be a Riemannian metric. Thanks to the two properties, the geometry-adaptive preconditioner is effective for improving the inner-loop optimization. Experiment results show that GAP outperforms the state-of-the-art MAML family and preconditioned gradient descent-MAML (PGD-MAML) family in a variety of few-shot learning tasks. Code is available at: https://github.com/Suhyun777/CVPR23-GAP. + + + + NeuralDome: A Neural Modeling Pipeline on Multi-View Human-Object Interactions + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_NeuralDome_A_Neural_Modeling_Pipeline_on_Multi-View_Human-Object_Interactions_CVPR_2023_paper.pdf + Humans constantly interact with objects in daily life tasks. Capturing such processes and subsequently conducting visual inferences from a fixed viewpoint suffers from occlusions, shape and texture ambiguities, motions, etc. To mitigate the problem, it is essential to build a training dataset that captures free-viewpoint interactions. We construct a dense multi-view dome to acquire a complex human object interaction dataset, named HODome, that consists of 71M frames on 10 subjects interacting with 23 objects. To process the HODome dataset, we develop NeuralDome, a layer-wise neural processing pipeline tailored for multi-view video inputs to conduct accurate tracking, geometry reconstruction and free-view rendering, for both human subjects and objects. Extensive experiments on the HODome dataset demonstrate the effectiveness of NeuralDome on a variety of inference, modeling, and rendering tasks. Both the dataset and the NeuralDome tools will be disseminated to the community for further development, which can be found at https://juzezhang.github.io/NeuralDome + + + + No One Left Behind: Improving the Worst Categories in Long-Tailed Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Du_No_One_Left_Behind_Improving_the_Worst_Categories_in_Long-Tailed_CVPR_2023_paper.pdf + Unlike the case when using a balanced training dataset, the per-class recall (i.e., accuracy) of neural networks trained with an imbalanced dataset are known to vary a lot from category to category. The convention in long-tailed recognition is to manually split all categories into three subsets and report the average accuracy within each subset. We argue that under such an evaluation setting, some categories are inevitably sacrificed. On one hand, focusing on the average accuracy on a balanced test set incurs little penalty even if some worst performing categories have zero accuracy. On the other hand, classes in the "Few" subset do not necessarily perform worse than those in the "Many" or "Medium" subsets. We therefore advocate to focus more on improving the lowest recall among all categories and the harmonic mean of all recall values. Specifically, we propose a simple plug-in method that is applicable to a wide range of methods. By simply re-training the classifier of an existing pre-trained model with our proposed loss function and using an optional ensemble trick that combines the predictions of the two classifiers, we achieve a more uniform distribution of recall values across categories, which leads to a higher harmonic mean accuracy while the (arithmetic) average accuracy is still high. The effectiveness of our method is justified on widely used benchmark datasets. + + + + Target-Referenced Reactive Grasping for Dynamic Objects + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Target-Referenced_Reactive_Grasping_for_Dynamic_Objects_CVPR_2023_paper.pdf + Reactive grasping, which enables the robot to successfully grasp dynamic moving objects, is of great interest in robotics. Current methods mainly focus on the temporal smoothness of the predicted grasp poses but few consider their semantic consistency. Consequently, the predicted grasps are not guaranteed to fall on the same part of the same object, especially in cluttered scenes. In this paper, we propose to solve reactive grasping in a target-referenced setting by tracking through generated grasp spaces. Given a targeted grasp pose on an object and detected grasp poses in a new observation, our method is composed of two stages: 1) discovering grasp pose correspondences through an attentional graph neural network and selecting the one with the highest similarity with respect to the target pose; 2) refining the selected grasp poses based on target and historical information. We evaluate our method on a large-scale benchmark GraspNet-1Billion. We also collect 30 scenes of dynamic objects for testing. The results suggest that our method outperforms other representative methods. Furthermore, our real robot experiments achieve an average success rate of over 80 percent. + + + + Complexity-Guided Slimmable Decoder for Efficient Deep Video Compression + http://openaccess.thecvf.com//content/CVPR2023/papers/Hu_Complexity-Guided_Slimmable_Decoder_for_Efficient_Deep_Video_Compression_CVPR_2023_paper.pdf + In this work, we propose the complexity-guided slimmable decoder (cgSlimDecoder) in combination with skip-adaptive entropy coding (SaEC) for efficient deep video compression. Specifically, given the target complexity constraints, in our cgSlimDecoder, we introduce a set of new channel width selection modules to automatically decide the optimal channel width of each slimmable convolution layer. By optimizing the complexity-rate-distortion related objective function to directly learn the parameters of the newly introduced channel width selection modules and other modules in the decoder, our cgSlimDecoder can automatically allocate the optimal numbers of parameters for different types of modules (e.g., motion/residual decoder and the motion compensation network) and simultaneously support multiple complexity levels by using a single learnt decoder instead of multiple decoders. In addition, our proposed SaEC can further accelerate the entropy decoding procedure in both motion and residual decoders by simply skipping the entropy coding process for the elements in the encoded feature maps that are already well-predicted by the hyperprior network. As demonstrated in our comprehensive experiments, our newly proposed methods cgSlimDecoder and SaEC are general and can be readily incorporated into three widely used deep video codecs (i.e., DVC, FVC and DCVC) to significantly improve their coding efficiency with negligible performance drop. + + + + MarginMatch: Improving Semi-Supervised Learning with Pseudo-Margins + http://openaccess.thecvf.com//content/CVPR2023/papers/Sosea_MarginMatch_Improving_Semi-Supervised_Learning_with_Pseudo-Margins_CVPR_2023_paper.pdf + We introduce MarginMatch, a new SSL approach combining consistency regularization and pseudo-labeling, with its main novelty arising from the use of unlabeled data training dynamics to measure pseudo-label quality. Instead of using only the model's confidence on an unlabeled example at an arbitrary iteration to decide if the example should be masked or not, MarginMatch also analyzes the behavior of the model on the pseudo-labeled examples as the training progresses, ensuring low fluctuations in the model's predictions from one iteration to another. MarginMatch brings substantial improvements on four vision benchmarks in low data regimes and on two large-scale datasets, emphasizing the importance of enforcing high-quality pseudo-labels. Notably, we obtain an improvement in error rate over the state-of-the-art of 3.25% on CIFAR-100 with only 25 examples per class and of 4.19% on STL-10 using as few as 4 examples per class. + + + + Beyond Appearance: A Semantic Controllable Self-Supervised Learning Framework for Human-Centric Visual Tasks + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Beyond_Appearance_A_Semantic_Controllable_Self-Supervised_Learning_Framework_for_Human-Centric_CVPR_2023_paper.pdf + Human-centric visual tasks have attracted increasing research attention due to their widespread applications. In this paper, we aim to learn a general human representation from massive unlabeled human images which can benefit downstream human-centric tasks to the maximum extent. We call this method SOLIDER, a Semantic cOntrollable seLf-supervIseD lEaRning framework. Unlike the existing self-supervised learning methods, prior knowledge from human images is utilized in SOLIDER to build pseudo semantic labels and import more semantic information into the learned representation. Meanwhile, we note that different downstream tasks always require different ratios of semantic information and appearance information. For example, human parsing requires more semantic information, while person re-identification needs more appearance information for identification purpose. So a single learned representation cannot fit for all requirements. To solve this problem, SOLIDER introduces a conditional network with a semantic controller. After the model is trained, users can send values to the controller to produce representations with different ratios of semantic information, which can fit different needs of downstream tasks. Finally, SOLIDER is verified on six downstream human-centric visual tasks. It outperforms state of the arts and builds new baselines for these tasks. The code is released in https://github.com/tinyvision/SOLIDER. + + + + Neural Fourier Filter Bank + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_Neural_Fourier_Filter_Bank_CVPR_2023_paper.pdf + We present a novel method to provide efficient and highly detailed reconstructions. Inspired by wavelets, we learn a neural field that decompose the signal both spatially and frequency-wise. We follow the recent grid-based paradigm for spatial decomposition, but unlike existing work, encourage specific frequencies to be stored in each grid via Fourier features encodings. We then apply a multi-layer perceptron with sine activations, taking these Fourier encoded features in at appropriate layers so that higher-frequency components are accumulated on top of lower-frequency components sequentially, which we sum up to form the final output. We demonstrate that our method outperforms the state of the art regarding model compactness and convergence speed on multiple tasks: 2D image fitting, 3D shape reconstruction, and neural radiance fields. Our code is available at https://github.com/ubc-vision/NFFB. + + + + NeRFInvertor: High Fidelity NeRF-GAN Inversion for Single-Shot Real Image Animation + http://openaccess.thecvf.com//content/CVPR2023/papers/Yin_NeRFInvertor_High_Fidelity_NeRF-GAN_Inversion_for_Single-Shot_Real_Image_Animation_CVPR_2023_paper.pdf + Nerf-based Generative models have shown impressive capacity in generating high-quality images with consistent 3D geometry. Despite successful synthesis of fake identity images randomly sampled from latent space, adopting these models for generating face images of real subjects is still a challenging task due to its so-called inversion issue. In this paper, we propose a universal method to surgically fine-tune these NeRF-GAN models in order to achieve high-fidelity animation of real subjects only by a single image. Given the optimized latent code for an out-of-domain real image, we employ 2D loss functions on the rendered image to reduce the identity gap. Furthermore, our method leverages explicit and implicit 3D regularizations using the in-domain neighborhood samples around the optimized latent code to remove geometrical and visual artifacts. Our experiments confirm the effectiveness of our method in realistic, high-fidelity, and 3D consistent animation of real faces on multiple NeRF-GAN models across different datasets. + + + + Trace and Pace: Controllable Pedestrian Animation via Guided Trajectory Diffusion + http://openaccess.thecvf.com//content/CVPR2023/papers/Rempe_Trace_and_Pace_Controllable_Pedestrian_Animation_via_Guided_Trajectory_Diffusion_CVPR_2023_paper.pdf + We introduce a method for generating realistic pedestrian trajectories and full-body animations that can be controlled to meet user-defined goals. We draw on recent advances in guided diffusion modeling to achieve test-time controllability of trajectories, which is normally only associated with rule-based systems. Our guided diffusion model allows users to constrain trajectories through target waypoints, speed, and specified social groups while accounting for the surrounding environment context. This trajectory diffusion model is integrated with a novel physics-based humanoid controller to form a closed-loop, full-body pedestrian animation system capable of placing large crowds in a simulated environment with varying terrains. We further propose utilizing the value function learned during RL training of the animation controller to guide diffusion to produce trajectories better suited for particular scenarios such as collision avoidance and traversing uneven terrain. + + + + Overlooked Factors in Concept-Based Explanations: Dataset Choice, Concept Learnability, and Human Capability + http://openaccess.thecvf.com//content/CVPR2023/papers/Ramaswamy_Overlooked_Factors_in_Concept-Based_Explanations_Dataset_Choice_Concept_Learnability_and_CVPR_2023_paper.pdf + Concept-based interpretability methods aim to explain a deep neural network model's components and predictions using a pre-defined set of semantic concepts. These methods evaluate a trained model on a new, "probe" dataset and correlate the model's outputs with concepts labeled in that dataset. Despite their popularity, they suffer from limitations that are not well-understood and articulated in the literature. In this work, we identify and analyze three commonly overlooked factors in concept-based explanations. First, we find that the choice of the probe dataset has a profound impact on the generated explanations. Our analysis reveals that different probe datasets lead to very different explanations, suggesting that the generated explanations are not generalizable outside the probe dataset. Second, we find that concepts in the probe dataset are often harder to learn than the target classes they are used to explain, calling into question the correctness of the explanations. We argue that only easily learnable concepts should be used in concept-based explanations. Finally, while existing methods use hundreds or even thousands of concepts, our human studies reveal a much stricter upper bound of 32 concepts or less, beyond which the explanations are much less practically useful. We discuss the implications of our findings and provide suggestions for future development of concept-based interpretability methods. Code for our analysis and user interface can be found at https://github.com/princetonvisualai/OverlookedFactors. + + + + Unsupervised 3D Shape Reconstruction by Part Retrieval and Assembly + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_Unsupervised_3D_Shape_Reconstruction_by_Part_Retrieval_and_Assembly_CVPR_2023_paper.pdf + Representing a 3D shape with a set of primitives can aid perception of structure, improve robotic object manipulation, and enable editing, stylization, and compression of 3D shapes. Existing methods either use simple parametric primitives or learn a generative shape space of parts. Both have limitations: parametric primitives lead to coarse approximations, while learned parts offer too little control over the decomposition. We instead propose to decompose shapes using a library of 3D parts provided by the user, giving full control over the choice of parts. The library can contain parts with high-quality geometry that are suitable for a given category, resulting in meaningful decom- positions with clean geometry. The type of decomposition can also be controlled through the choice of parts in the library. Our method works via a unsupervised approach that iteratively retrieves parts from the library and refines their placements. We show that this approach gives higher reconstruction accuracy and more desirable decompositions than existing approaches. Additionally, we show how the decom- position can be controlled through the part library by using different part libraries to reconstruct the same shapes. + + + + SeqTrack: Sequence to Sequence Learning for Visual Object Tracking + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_SeqTrack_Sequence_to_Sequence_Learning_for_Visual_Object_Tracking_CVPR_2023_paper.pdf + In this paper, we present a new sequence-to-sequence learning framework for visual tracking, dubbed SeqTrack. It casts visual tracking as a sequence generation problem, which predicts object bounding boxes in an autoregressive fashion. This is different from prior Siamese trackers and transformer trackers, which rely on designing complicated head networks, such as classification and regression heads. SeqTrack only adopts a simple encoder-decoder transformer architecture. The encoder extracts visual features with a bidirectional transformer, while the decoder generates a sequence of bounding box values autoregressively with a causal transformer. The loss function is a plain cross-entropy. Such a sequence learning paradigm not only simplifies tracking framework, but also achieves competitive performance on benchmarks. For instance, SeqTrack gets 72.5% AUC on LaSOT, establishing a new state-of-the-art performance. Code and models are available at https://github.com/microsoft/VideoX. + + + + AutoLabel: CLIP-Based Framework for Open-Set Video Domain Adaptation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zara_AutoLabel_CLIP-Based_Framework_for_Open-Set_Video_Domain_Adaptation_CVPR_2023_paper.pdf + Open-set Unsupervised Video Domain Adaptation (OUVDA) deals with the task of adapting an action recognition model from a labelled source domain to an unlabelled target domain that contains "target-private" categories, which are present in the target but absent in the source. In this work we deviate from the prior work of training a specialized open-set classifier or weighted adversarial learning by proposing to use pre-trained Language and Vision Models (CLIP). The CLIP is well suited for OUVDA due to its rich representation and the zero-shot recognition capabilities. However, rejecting target-private instances with the CLIP's zero-shot protocol requires oracle knowledge about the target-private label names. To circumvent the impossibility of the knowledge of label names, we propose AutoLabel that automatically discovers and generates object-centric compositional candidate target-private class names. Despite its simplicity, we show that CLIP when equipped with AutoLabel can satisfactorily reject the target-private instances, thereby facilitating better alignment between the shared classes of the two domains. The code is available. + + + + DINER: Depth-Aware Image-Based NEural Radiance Fields + http://openaccess.thecvf.com//content/CVPR2023/papers/Prinzler_DINER_Depth-Aware_Image-Based_NEural_Radiance_Fields_CVPR_2023_paper.pdf + We present Depth-aware Image-based NEural Radiance fields (DINER). Given a sparse set of RGB input views, we predict depth and feature maps to guide the reconstruction of a volumetric scene representation that allows us to render 3D objects under novel views. Specifically, we propose novel techniques to incorporate depth information into feature fusion and efficient scene sampling. In comparison to the previous state of the art, DINER achieves higher synthesis quality and can process input views with greater disparity. This allows us to capture scenes more completely without changing capturing hardware requirements and ultimately enables larger viewpoint changes during novel view synthesis. We evaluate our method by synthesizing novel views, both for human heads and for general objects, and observe significantly improved qualitative results and increased perceptual metrics compared to the previous state of the art. + + + + Reconstructing Signing Avatars From Video Using Linguistic Priors + http://openaccess.thecvf.com//content/CVPR2023/papers/Forte_Reconstructing_Signing_Avatars_From_Video_Using_Linguistic_Priors_CVPR_2023_paper.pdf + Sign language (SL) is the primary method of communication for the 70 million Deaf people around the world. Video dictionaries of isolated signs are a core SL learning tool. Replacing these with 3D avatars can aid learning and enable AR/VR applications, improving access to technology and online media. However, little work has attempted to estimate expressive 3D avatars from SL video; occlusion, noise, and motion blur make this task difficult. We address this by introducing novel linguistic priors that are universally applicable to SL and provide constraints on 3D hand pose that help resolve ambiguities within isolated signs. Our method, SGNify, captures fine-grained hand pose, facial expression, and body movement fully automatically from in-the-wild monocular SL videos. We evaluate SGNify quantitatively by using a commercial motion-capture system to compute 3D avatars synchronized with monocular video. SGNify outperforms state-of-the-art 3D body-pose- and shape-estimation methods on SL videos. A perceptual study shows that SGNify's 3D reconstructions are significantly more comprehensible and natural than those of previous methods and are on par with the source videos. Code and data are available at sgnify.is.tue.mpg.de. + + + + DeepMapping2: Self-Supervised Large-Scale LiDAR Map Optimization + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_DeepMapping2_Self-Supervised_Large-Scale_LiDAR_Map_Optimization_CVPR_2023_paper.pdf + LiDAR mapping is important yet challenging in self-driving and mobile robotics. To tackle such a global point cloud registration problem, DeepMapping converts the complex map estimation into a self-supervised training of simple deep networks. Despite its broad convergence range on small datasets, DeepMapping still cannot produce satisfactory results on large-scale datasets with thousands of frames. This is due to the lack of loop closures and exact cross-frame point correspondences, and the slow convergence of its global localization network. We propose DeepMapping2 by adding two novel techniques to address these issues: (1) organization of training batch based on map topology from loop closing, and (2) self-supervised local-to-global point consistency loss leveraging pairwise registration. Our experiments and ablation studies on public datasets such as KITTI, NCLT, and Nebula, demonstrate the effectiveness of our method. + + + + DoNet: Deep De-Overlapping Network for Cytology Instance Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Jiang_DoNet_Deep_De-Overlapping_Network_for_Cytology_Instance_Segmentation_CVPR_2023_paper.pdf + Cell instance segmentation in cytology images has significant importance for biology analysis and cancer screening, while remains challenging due to 1) the extensive overlapping translucent cell clusters that cause the ambiguous boundaries, and 2) the confusion of mimics and debris as nuclei. In this work, we proposed a De-overlapping Network (DoNet) in a decompose-and-recombined strategy. A Dual-path Region Segmentation Module (DRM) explicitly decomposes the cell clusters into intersection and complement regions, followed by a Semantic Consistency-guided Recombination Module (CRM) for integration. To further introduce the containment relationship of the nucleus in the cytoplasm, we design a Mask-guided Region Proposal Strategy (MRP) that integrates the cell attention maps for inner-cell instance prediction. We validate the proposed approach on ISBI2014 and CPS datasets. Experiments show that our proposed DoNet significantly outperforms other state-of-the-art (SOTA) cell instance segmentation methods. The code is available at https://github.com/DeepDoNet/DoNet. + + + + Instant Domain Augmentation for LiDAR Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Ryu_Instant_Domain_Augmentation_for_LiDAR_Semantic_Segmentation_CVPR_2023_paper.pdf + Despite the increasing popularity of LiDAR sensors, perception algorithms using 3D LiDAR data struggle with the 'sensor-bias problem'. Specifically, the performance of perception algorithms significantly drops when an unseen specification of LiDAR sensor is applied at test time due to the domain discrepancy. This paper presents a fast and flexible LiDAR augmentation method for the semantic segmentation task, called 'LiDomAug'. It aggregates raw LiDAR scans and creates a LiDAR scan of any configurations with the consideration of dynamic distortion and occlusion, resulting in instant domain augmentation. Our on-demand augmentation module runs at 330 FPS, so it can be seamlessly integrated into the data loader in the learning framework. In our experiments, learning-based approaches aided with the proposed LiDomAug are less affected by the sensor-bias issue and achieve new state-of-the-art domain adaptation performances on SemanticKITTI and nuScenes dataset without the use of the target domain data. We also present a sensor-agnostic model that faithfully works on the various LiDAR configurations. + + + + A Characteristic Function-Based Method for Bottom-Up Human Pose Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Qu_A_Characteristic_Function-Based_Method_for_Bottom-Up_Human_Pose_Estimation_CVPR_2023_paper.pdf + Most recent methods formulate the task of human pose estimation as a heatmap estimation problem, and use the overall L2 loss computed from the entire heatmap to optimize the heatmap prediction. In this paper, we show that in bottom-up human pose estimation where each heatmap often contains multiple body joints, using the overall L2 loss to optimize the heatmap prediction may not be the optimal choice. This is because, minimizing the overall L2 loss cannot always lead the model to locate all the body joints across different sub-regions of the heatmap more accurately. To cope with this problem, from a novel perspective, we propose a new bottom-up human pose estimation method that optimizes the heatmap prediction via minimizing the distance between two characteristic functions respectively constructed from the predicted heatmap and the groundtruth heatmap. Our analysis presented in this paper indicates that the distance between these two characteristic functions is essentially the upper bound of the L2 losses w.r.t. sub-regions of the predicted heatmap. Therefore, via minimizing the distance between the two characteristic functions, we can optimize the model to provide a more accurate localization result for the body joints in different sub-regions of the predicted heatmap. We show the effectiveness of our proposed method through extensive experiments on the COCO dataset and the CrowdPose dataset. + + + + SceneTrilogy: On Human Scene-Sketch and Its Complementarity With Photo and Text + http://openaccess.thecvf.com//content/CVPR2023/papers/Chowdhury_SceneTrilogy_On_Human_Scene-Sketch_and_Its_Complementarity_With_Photo_and_CVPR_2023_paper.pdf + In this paper, we extend scene understanding to include that of human sketch. The result is a complete trilogy of scene representation from three diverse and complementary modalities -- sketch, photo, and text. Instead of learning a rigid three-way embedding and be done with it, we focus on learning a flexible joint embedding that fully supports the "optionality" that this complementarity brings. Our embedding supports optionality on two axis: (i) optionality across modalities -- use any combination of modalities as query for downstream tasks like retrieval, (ii) optionality across tasks -- simultaneously utilising the embedding for either discriminative (e.g., retrieval) or generative tasks (e.g., captioning). This provides flexibility to end-users by exploiting the best of each modality, therefore serving the very purpose behind our proposal of a trilogy at the first place. First, a combination of information-bottleneck and conditional invertible neural networks disentangle the modality-specific component from modality-agnostic in sketch, photo, and text. Second, the modality-agnostic instances from sketch, photo, and text are synergised using a modified cross-attention. Once learned, we show our embedding can accommodate a multi-facet of scene-related tasks, including those enabled for the first time by the inclusion of sketch, all without any task-specific modifications. Project Page: http://www.pinakinathc.me/scenetrilogy + + + + RefSR-NeRF: Towards High Fidelity and Super Resolution View Synthesis + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_RefSR-NeRF_Towards_High_Fidelity_and_Super_Resolution_View_Synthesis_CVPR_2023_paper.pdf + We present Reference-guided Super-Resolution Neural Radiance Field (RefSR-NeRF) that extends NeRF to super resolution and photorealistic novel view synthesis. Despite NeRF's extraordinary success in the neural rendering field, it suffers from blur in high resolution rendering because its inherent multilayer perceptron struggles to learn high frequency details and incurs a computational explosion as resolution increases. Therefore, we propose RefSR-NeRF, an end-to-end framework that first learns a low resolution NeRF representation, and then reconstructs the high frequency details with the help of a high resolution reference image. We observe that simply introducing the pre-trained models from the literature tends to produce unsatisfied artifacts due to the divergence in the degradation model. To this end, we design a novel lightweight RefSR model to learn the inverse degradation process from NeRF renderings to target HR ones. Extensive experiments on multiple benchmarks demonstrate that our method exhibits an impressive trade-off among rendering quality, speed, and memory usage, outperforming or on par with NeRF and its variants while being 52x speedup with minor extra memory usage. + + + + Polarimetric iToF: Measuring High-Fidelity Depth Through Scattering Media + http://openaccess.thecvf.com//content/CVPR2023/papers/Jeon_Polarimetric_iToF_Measuring_High-Fidelity_Depth_Through_Scattering_Media_CVPR_2023_paper.pdf + Indirect time-of-flight (iToF) imaging allows us to capture dense depth information at a low cost. However, iToF imaging often suffers from multipath interference (MPI) artifacts in the presence of scattering media, resulting in severe depth-accuracy degradation. For instance, iToF cameras cannot measure depth accurately through fog because ToF active illumination scatters back to the sensor before reaching the farther target surface. In this work, we propose a polarimetric iToF imaging method that can capture depth information robustly through scattering media. Our observations on the principle of indirect ToF imaging and polarization of light allow us to formulate a novel computational model of scattering-aware polarimetric phase measurements that enables us to correct MPI errors. We first devise a scattering-aware polarimetric iToF model that can estimate the phase of unpolarized backscattered light. We then combine the optical filtering of polarization and our computational modeling of unpolarized backscattered light via scattering analysis of phase and amplitude. This allows us to tackle the MPI problem by estimating the scattering energy through the participating media. We validate our method on an experimental setup using a customized off-the-shelf iToF camera. Our method outperforms baseline methods by a significant margin by means of our scattering model and polarimetric phase measurements. + + + + Mobile User Interface Element Detection via Adaptively Prompt Tuning + http://openaccess.thecvf.com//content/CVPR2023/papers/Gu_Mobile_User_Interface_Element_Detection_via_Adaptively_Prompt_Tuning_CVPR_2023_paper.pdf + Recent object detection approaches rely on pretrained vision-language models for image-text alignment. However, they fail to detect the Mobile User Interface (MUI) element since it contains additional OCR information, which describes its content and function but is often ignored. In this paper, we develop a new MUI element detection dataset named MUI-zh and propose an Adaptively Prompt Tuning (APT) module to take advantage of discriminating OCR information. APT is a lightweight and effective module to jointly optimize category prompts across different modalities. For every element, APT uniformly encodes its visual features and OCR descriptions to dynamically adjust the representation of frozen category prompts. We evaluate the effectiveness of our plug-and-play APT upon several existing CLIP-based detectors for both standard and open-vocabulary MUI element detection. Extensive experiments show that our method achieves considerable improvements on two datasets. The datasets is available at github.com/antmachineintelligence/MUI-zh. + + + + Sparse Multi-Modal Graph Transformer With Shared-Context Processing for Representation Learning of Giga-Pixel Images + http://openaccess.thecvf.com//content/CVPR2023/papers/Nakhli_Sparse_Multi-Modal_Graph_Transformer_With_Shared-Context_Processing_for_Representation_Learning_CVPR_2023_paper.pdf + Processing giga-pixel whole slide histopathology images (WSI) is a computationally expensive task. Multiple instance learning (MIL) has become the conventional approach to process WSIs, in which these images are split into smaller patches for further processing. However, MIL-based techniques ignore explicit information about the individual cells within a patch. In this paper, by defining the novel concept of shared-context processing, we designed a multi-modal Graph Transformer that uses the cellular graph within the tissue to provide a single representation for a patient while taking advantage of the hierarchical structure of the tissue, enabling a dynamic focus between cell-level and tissue-level information. We benchmarked the performance of our model against multiple state-of-the-art methods in survival prediction and showed that ours can significantly outperform all of them including hierarchical vision Transformer (ViT). More importantly, we show that our model is strongly robust to missing information to an extent that it can achieve the same performance with as low as 20% of the data. Finally, in two different cancer datasets, we demonstrated that our model was able to stratify the patients into low-risk and high-risk groups while other state-of-the-art methods failed to achieve this goal. We also publish a large dataset of immunohistochemistry (IHC) images containing 1,600 tissue microarray (TMA) cores from 188 patients along with their survival information, making it one of the largest publicly available datasets in this context. + + + + Generating Human Motion From Textual Descriptions With Discrete Representations + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Generating_Human_Motion_From_Textual_Descriptions_With_Discrete_Representations_CVPR_2023_paper.pdf + In this work, we investigate a simple and must-known conditional generative framework based on Vector Quantised-Variational AutoEncoder (VQ-VAE) and Generative Pre-trained Transformer (GPT) for human motion generation from textural descriptions. We show that a simple CNN-based VQ-VAE with commonly used training recipes (EMA and Code Reset) allows us to obtain high-quality discrete representations. For GPT, we incorporate a simple corruption strategy during the training to alleviate training-testing discrepancy. Despite its simplicity, our T2M-GPT shows better performance than competitive approaches, including recent diffusion-based approaches. For example, on HumanML3D, which is currently the largest dataset, we achieve comparable performance on the consistency between text and generated motion (R-Precision), but with FID 0.116 largely outperforming MotionDiffuse of 0.630. Additionally, we conduct analyses on HumanML3D and observe that the dataset size is a limitation of our approach. Our work suggests that VQ-VAE still remains a competitive approach for human motion generation. Our implementation is available on the project page: https://mael-zys.github.io/T2M-GPT/ + + + + Spatial-Temporal Concept Based Explanation of 3D ConvNets + http://openaccess.thecvf.com//content/CVPR2023/papers/Ji_Spatial-Temporal_Concept_Based_Explanation_of_3D_ConvNets_CVPR_2023_paper.pdf + Convolutional neural networks (CNNs) have shown remarkable performance on various tasks. Despite its widespread adoption, the decision procedure of the network still lacks transparency and interpretability, making it difficult to enhance the performance further. Hence, there has been considerable interest in providing explanation and interpretability for CNNs over the last few years. Explainable artificial intelligence (XAI) investigates the relationship between input images or videos and output predictions. Recent studies have achieved outstanding success in explaining 2D image classification ConvNets. On the other hand, due to the high computation cost and complexity of video data, the explanation of 3D video recognition ConvNets is relatively less studied. And none of them are able to produce a high-level explanation. In this paper, we propose a STCE (Spatial-temporal Concept-based Explanation) framework for interpreting 3D ConvNets. In our approach: (1) videos are represented with high-level supervoxels, similar supervoxels are clustered as a concept, which is straightforward for human to understand; and (2) the interpreting framework calculates a score for each concept, which reflects its significance in the ConvNet decision procedure. Experiments on diverse 3D ConvNets demonstrate that our method can identify global concepts with different importance levels, allowing us to investigate the impact of the concepts on a target task, such as action recognition, in-depth. The source codes are publicly available at https://github.com/yingji425/STCE. + + + + Robust Test-Time Adaptation in Dynamic Scenarios + http://openaccess.thecvf.com//content/CVPR2023/papers/Yuan_Robust_Test-Time_Adaptation_in_Dynamic_Scenarios_CVPR_2023_paper.pdf + Test-time adaptation (TTA) intends to adapt the pretrained model to test distributions with only unlabeled test data streams. Most of the previous TTA methods have achieved great success on simple test data streams such as independently sampled data from single or multiple distributions. However, these attempts may fail in dynamic scenarios of real-world applications like autonomous driving, where the environments gradually change and the test data is sampled correlatively over time. In this work, we explore such practical test data streams to deploy the model on the fly, namely practical test-time adaptation (PTTA). To do so, we elaborate a Robust Test-Time Adaptation (RoTTA) method against the complex data stream in PTTA. More specifically, we present a robust batch normalization scheme to estimate the normalization statistics. Meanwhile, a memory bank is utilized to sample category-balanced data with consideration of timeliness and uncertainty. Further, to stabilize the training procedure, we develop a time-aware reweighting strategy with a teacher-student model. Extensive experiments prove that RoTTA enables continual testtime adaptation on the correlatively sampled data streams. Our method is easy to implement, making it a good choice for rapid deployment. The code is publicly available at https://github.com/BIT-DA/RoTTA + + + + Global and Local Mixture Consistency Cumulative Learning for Long-Tailed Visual Recognitions + http://openaccess.thecvf.com//content/CVPR2023/papers/Du_Global_and_Local_Mixture_Consistency_Cumulative_Learning_for_Long-Tailed_Visual_CVPR_2023_paper.pdf + In this paper, our goal is to design a simple learning paradigm for long-tail visual recognition, which not only improves the robustness of the feature extractor but also alleviates the bias of the classifier towards head classes while reducing the training skills and overhead. We propose an efficient one-stage training strategy for long-tailed visual recognition called Global and Local Mixture Consistency cumulative learning (GLMC). Our core ideas are twofold: (1) a global and local mixture consistency loss improves the robustness of the feature extractor. Specifically, we generate two augmented batches by the global MixUp and local CutMix from the same batch data, respectively, and then use cosine similarity to minimize the difference. (2) A cumulative head-tail soft label reweighted loss mitigates the head class bias problem. We use empirical class frequencies to reweight the mixed label of the head-tail class for long-tailed data and then balance the conventional loss and the rebalanced loss with a coefficient accumulated by epochs. Our approach achieves state-of-the-art accuracy on CIFAR10-LT, CIFAR100-LT, and ImageNet-LT datasets. Additional experiments on balanced ImageNet and CIFAR demonstrate that GLMC can significantly improve the generalization of backbones. Code is made publicly available at https://github.com/ynu-yangpeng/GLMC + + + + NIRVANA: Neural Implicit Representations of Videos With Adaptive Networks and Autoregressive Patch-Wise Modeling + http://openaccess.thecvf.com//content/CVPR2023/papers/Maiya_NIRVANA_Neural_Implicit_Representations_of_Videos_With_Adaptive_Networks_and_CVPR_2023_paper.pdf + Implicit Neural Representations (INR) have recently shown to be powerful tool for high-quality video compression. However, existing works are limiting as they do not explicitly exploit the temporal redundancy in videos, leading to a long encoding time. Additionally, these methods have fixed architectures which do not scale to longer videos or higher resolutions. To address these issues, we propose NIRVANA, which treats videos as groups of frames and fits separate networks to each group performing patch-wise prediction. %This design shares computation within each group, in the spatial and temporal dimensions, resulting in reduced encoding time of the video. The video representation is modeled autoregressively, with networks fit on a current group initialized using weights from the previous group's model. To further enhance efficiency, we perform quantization of the network parameters during training, requiring no post-hoc pruning or quantization. When compared with previous works on the benchmark UVG dataset, NIRVANA improves encoding quality from 37.36 to 37.70 (in terms of PSNR) and the encoding speed by 12x, while maintaining the same compression rate. In contrast to prior video INR works which struggle with larger resolution and longer videos, we show that our algorithm is highly flexible and scales naturally due to its patch-wise and autoregressive designs. Moreover, our method achieves variable bitrate compression by adapting to videos with varying inter-frame motion. NIRVANA achieves 6x decoding speed and scales well with more GPUs, making it practical for various deployment scenarios. + + + + Collaboration Helps Camera Overtake LiDAR in 3D Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Hu_Collaboration_Helps_Camera_Overtake_LiDAR_in_3D_Detection_CVPR_2023_paper.pdf + Camera-only 3D detection provides an economical solution with a simple configuration for localizing objects in 3D space compared to LiDAR-based detection systems. However, a major challenge lies in precise depth estimation due to the lack of direct 3D measurements in the input. Many previous methods attempt to improve depth estimation through network designs, e.g., deformable layers and larger receptive fields. This work proposes an orthogonal direction, improving the camera-only 3D detection by introducing multi-agent collaborations. Our proposed collaborative camera-only 3D detection (CoCa3D) enables agents to share complementary information with each other through communication. Meanwhile, we optimize communication efficiency by selecting the most informative cues. The shared messages from multiple viewpoints disambiguate the single-agent estimated depth and complement the occluded and long-range regions in the single-agent view. We evaluate CoCa3D in one real-world dataset and two new simulation datasets. Results show that CoCa3D improves previous SOTA performances by 44.21% on DAIR-V2X, 30.60% on OPV2V+, 12.59% on CoPerception-UAVs+ for AP@70. Our preliminary results show a potential that with sufficient collaboration, the camera might overtake LiDAR in some practical scenarios. We released the dataset and code at https://siheng-chen.github.io/dataset/CoPerception+ and https://github.com/MediaBrain-SJTU/CoCa3D. + + + + ReCo: Region-Controlled Text-to-Image Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_ReCo_Region-Controlled_Text-to-Image_Generation_CVPR_2023_paper.pdf + Recently, large-scale text-to-image (T2I) models have shown impressive performance in generating high-fidelity images, but with limited controllability, e.g., precisely specifying the content in a specific region with a free-form text description. In this paper, we propose an effective technique for such regional control in T2I generation. We augment T2I models' inputs with an extra set of position tokens, which represent the quantized spatial coordinates. Each region is specified by four position tokens to represent the top-left and bottom-right corners, followed by an open-ended natural language regional description. Then, we fine-tune a pre-trained T2I model with such new input interface. Our model, dubbed as ReCo (Region-Controlled T2I), enables the region control for arbitrary objects described by open-ended regional texts rather than by object labels from a constrained category set. Empirically, ReCo achieves better image quality than the T2I model strengthened by positional words (FID: 8.82 -> 7.36, SceneFID: 15.54 -> 6.51 on COCO), together with objects being more accurately placed, amounting to a 20.40% region classification accuracy improvement on COCO. Furthermore, we demonstrate that ReCo can better control the object count, spatial relationship, and region attributes such as color/size, with the free-form regional description. Human evaluation on PaintSkill shows that ReCo is +19.28% and +17.21% more accurate in generating images with correct object count and spatial relationship than the T2I model. + + + + Fix the Noise: Disentangling Source Feature for Controllable Domain Translation + http://openaccess.thecvf.com//content/CVPR2023/papers/Lee_Fix_the_Noise_Disentangling_Source_Feature_for_Controllable_Domain_Translation_CVPR_2023_paper.pdf + Recent studies show strong generative performance in domain translation especially by using transfer learning techniques on the unconditional generator. However, the control between different domain features using a single model is still challenging. Existing methods often require additional models, which is computationally demanding and leads to unsatisfactory visual quality. In addition, they have restricted control steps, which prevents a smooth transition. In this paper, we propose a new approach for high-quality domain translation with better controllability. The key idea is to preserve source features within a disentangled subspace of a target feature space. This allows our method to smoothly control the degree to which it preserves source features while generating images from an entirely new domain using only a single model. Our extensive experiments show that the proposed method can produce more consistent and realistic images than previous works and maintain precise controllability over different levels of transformation. The code is available at LeeDongYeun/FixNoise. + + + + Sparsely Annotated Semantic Segmentation With Adaptive Gaussian Mixtures + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_Sparsely_Annotated_Semantic_Segmentation_With_Adaptive_Gaussian_Mixtures_CVPR_2023_paper.pdf + Sparsely annotated semantic segmentation (SASS) aims to learn a segmentation model by images with sparse labels (i.e., points or scribbles). Existing methods mainly focus on introducing low-level affinity or generating pseudo labels to strengthen supervision, while largely ignoring the inherent relation between labeled and unlabeled pixels. In this paper, we observe that pixels that are close to each other in the feature space are more likely to share the same class. Inspired by this, we propose a novel SASS framework, which is equipped with an Adaptive Gaussian Mixture Model (AGMM). Our AGMM can effectively endow reliable supervision for unlabeled pixels based on the distributions of labeled and unlabeled pixels. Specifically, we first build Gaussian mixtures using labeled pixels and their relatively similar unlabeled pixels, where the labeled pixels act as centroids, for modeling the feature distribution of each class. Then, we leverage the reliable information from labeled pixels and adaptively generated GMM predictions to supervise the training of unlabeled pixels, achieving online, dynamic, and robust self-supervision. In addition, by capturing category-wise Gaussian mixtures, AGMM encourages the model to learn discriminative class decision boundaries in an end-to-end contrastive learning manner. Experimental results conducted on the PASCAL VOC 2012 and Cityscapes datasets demonstrate that our AGMM can establish new state-of-the-art SASS performance. Code is available at https://github.com/Luffy03/AGMM-SASS. + + + + Diversity-Aware Meta Visual Prompting + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Diversity-Aware_Meta_Visual_Prompting_CVPR_2023_paper.pdf + We present Diversity-Aware Meta Visual Prompting (DAM-VP), an efficient and effective prompting method for transferring pre-trained models to downstream tasks with frozen backbone. A challenging issue in visual prompting is that image datasets sometimes have a large data diversity whereas a per-dataset generic prompt can hardly handle the complex distribution shift toward the original pretraining data distribution properly. To address this issue, we propose a dataset Diversity-Aware prompting strategy whose initialization is realized by a Meta-prompt. Specifically, we cluster the downstream dataset into small homogeneity subsets in a diversity-adaptive way, with each subset has its own prompt optimized separately. Such a divide-and-conquer design reduces the optimization difficulty greatly and significantly boosts the prompting performance. Furthermore, all the prompts are initialized with a meta-prompt, which is learned across several datasets. It is a bootstrapped paradigm, with the key observation that the prompting knowledge learned from previous datasets could help the prompt to converge faster and perform better on a new dataset. During inference, we dynamically select a proper prompt for each input, based on the feature distance between the input and each subset. Through extensive experiments, our DAM-VP demonstrates superior efficiency and effectiveness, clearly surpassing previous prompting methods in a series of downstream datasets for different pretraining models. Our code is available at: https://github.com/shikiw/DAM-VP. + + + + FaceLit: Neural 3D Relightable Faces + http://openaccess.thecvf.com//content/CVPR2023/papers/Ranjan_FaceLit_Neural_3D_Relightable_Faces_CVPR_2023_paper.pdf + We propose a generative framework, FaceLit, capable of generating a 3D face that can be rendered at various user-defined lighting conditions and views, learned purely from 2D images in-the-wild without any manual annotation. Unlike existing works that require careful capture setup or human labor, we rely on off-the-shelf pose and illumination estimators. With these estimates, we incorporate the Phong reflectance model in the neural volume rendering framework. Our model learns to generate shape and material properties of a face such that, when rendered according to the natural statistics of pose and illumination, produces photorealistic face images with multiview 3D and illumination consistency. Our method enables photorealistic generation of faces with explicit illumination and view controls on multiple datasets -- FFHQ, MetFaces and CelebA-HQ. We show state-of-the-art photorealism among 3D aware GANs on FFHQ dataset achieving an FID score of 3.5. + + + + Visual Programming: Compositional Visual Reasoning Without Training + http://openaccess.thecvf.com//content/CVPR2023/papers/Gupta_Visual_Programming_Compositional_Visual_Reasoning_Without_Training_CVPR_2023_paper.pdf + We present VISPROG, a neuro-symbolic approach to solving complex and compositional visual tasks given natural language instructions. VISPROG avoids the need for any task-specific training. Instead, it uses the in-context learning ability of large language models to generate python-like modular programs, which are then executed to get both the solution and a comprehensive and interpretable rationale. Each line of the generated program may invoke one of several off-the-shelf computer vision models, image processing routines, or python functions to produce intermediate outputs that may be consumed by subsequent parts of the program. We demonstrate the flexibility of VISPROG on 4 diverse tasks - compositional visual question answering, zero-shot reasoning on image pairs, factual knowledge object tagging, and language-guided image editing. We believe neuro-symbolic approaches like VISPROG are an exciting avenue to easily and effectively expand the scope of AI systems to serve the long tail of complex tasks that people may wish to perform. + + + + Real-Time Evaluation in Online Continual Learning: A New Hope + http://openaccess.thecvf.com//content/CVPR2023/papers/Ghunaim_Real-Time_Evaluation_in_Online_Continual_Learning_A_New_Hope_CVPR_2023_paper.pdf + Current evaluations of Continual Learning (CL) methods typically assume that there is no constraint on training time and computation. This is an unrealistic assumption for any real-world setting, which motivates us to propose: a practical real-time evaluation of continual learning, in which the stream does not wait for the model to complete training before revealing the next data for predictions. To do this, we evaluate current CL methods with respect to their computational costs. We conduct extensive experiments on CLOC, a large-scale dataset containing 39 million time-stamped images with geolocation labels. We show that a simple baseline outperforms state-of-the-art CL methods under this evaluation, questioning the applicability of existing methods in realistic settings. In addition, we explore various CL components commonly used in the literature, including memory sampling strategies and regularization approaches. We find that all considered methods fail to be competitive against our simple baseline. This surprisingly suggests that the majority of existing CL literature is tailored to a specific class of streams that is not practical. We hope that the evaluation we provide will be the first step towards a paradigm shift to consider the computational cost in the development of online continual learning methods. + + + + BAAM: Monocular 3D Pose and Shape Reconstruction With Bi-Contextual Attention Module and Attention-Guided Modeling + http://openaccess.thecvf.com//content/CVPR2023/papers/Lee_BAAM_Monocular_3D_Pose_and_Shape_Reconstruction_With_Bi-Contextual_Attention_CVPR_2023_paper.pdf + 3D traffic scene comprises various 3D information about car objects, including their pose and shape. However, most recent studies pay relatively less attention to reconstructing detailed shapes. Furthermore, most of them treat each 3D object as an independent one, resulting in losses of relative context inter-objects and scene context reflecting road circumstances. A novel monocular 3D pose and shape reconstruction algorithm, based on bi-contextual attention and attention-guided modeling (BAAM), is proposed in this work. First, given 2D primitives, we reconstruct 3D object shape based on attention-guided modeling that considers the relevance between detected objects and vehicle shape priors. Next, we estimate 3D object pose through bi-contextual attention, which leverages relation-context inter objects and scene-context between an object and road environment. Finally, we propose a 3D non maximum suppression algorithm to eliminate spurious objects based on their Bird-Eye-View distance. Extensive experiments demonstrate that the proposed BAAM yields state-of-the-art performance on ApolloCar3D. Also, they show that the proposed BAAM can be plugged into any mature monocular 3D object detector on KITTI and significantly boost their performance. + + + + Freestyle Layout-to-Image Synthesis + http://openaccess.thecvf.com//content/CVPR2023/papers/Xue_Freestyle_Layout-to-Image_Synthesis_CVPR_2023_paper.pdf + Typical layout-to-image synthesis (LIS) models generate images for a closed set of semantic classes, e.g., 182 common objects in COCO-Stuff. In this work, we explore the freestyle capability of the model, i.e., how far can it generate unseen semantics (e.g., classes, attributes, and styles) onto a given layout, and call the task Freestyle LIS (FLIS). Thanks to the development of large-scale pre-trained language-image models, a number of discriminative models (e.g., image classification and object detection) trained on limited base classes are empowered with the ability of unseen class prediction. Inspired by this, we opt to leverage large-scale pre-trained text-to-image diffusion models to achieve the generation of unseen semantics. The key challenge of FLIS is how to enable the diffusion model to synthesize images from a specific layout which very likely violates its pre-learned knowledge, e.g., the model never sees "a unicorn sitting on a bench" during its pre-training. To this end, we introduce a new module called Rectified Cross-Attention (RCA) that can be conveniently plugged in the diffusion model to integrate semantic masks. This "plug-in" is applied in each cross-attention layer of the model to rectify the attention maps between image and text tokens. The key idea of RCA is to enforce each text token to act on the pixels in a specified region, allowing us to freely put a wide variety of semantics from pre-trained knowledge (which is general) onto the given layout (which is specific). Extensive experiments show that the proposed diffusion network produces realistic and freestyle layout-to-image generation results with diverse text inputs, which has a high potential to spawn a bunch of interesting applications. Code is available at https://github.com/essunny310/FreestyleNet. + + + + Visual Dependency Transformers: Dependency Tree Emerges From Reversed Attention + http://openaccess.thecvf.com//content/CVPR2023/papers/Ding_Visual_Dependency_Transformers_Dependency_Tree_Emerges_From_Reversed_Attention_CVPR_2023_paper.pdf + Humans possess a versatile mechanism for extracting structured representations of our visual world. When looking at an image, we can decompose the scene into entities and their parts as well as obtain the dependencies between them. To mimic such capability, we propose Visual Dependency Transformers (DependencyViT) that can induce visual dependencies without any labels. We achieve that with a novel neural operator called reversed attention that can naturally capture long-range visual dependencies between image patches. Specifically, we formulate it as a dependency graph where a child token in reversed attention is trained to attend to its parent tokens and send information following a normalized probability distribution rather than gathering information in conventional self-attention. With such a design, hierarchies naturally emerge from reversed attention layers, and a dependency tree is progressively induced from leaf nodes to the root node unsupervisedly. DependencyViT offers several appealing benefits. (i) Entities and their parts in an image are represented by different subtrees, enabling part partitioning from dependencies; (ii) Dynamic visual pooling is made possible. The leaf nodes which rarely send messages can be pruned without hindering the model performance, based on which we propose the lightweight DependencyViT-Lite to reduce the computational and memory footprints; (iii) DependencyViT works well on both self- and weakly-supervised pretraining paradigms on ImageNet, and demonstrates its effectiveness on 8 datasets and 5 tasks, such as unsupervised part and saliency segmentation, recognition, and detection. + + + + Differentiable Architecture Search With Random Features + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Differentiable_Architecture_Search_With_Random_Features_CVPR_2023_paper.pdf + Differentiable architecture search (DARTS) has significantly promoted the development of NAS techniques because of its high search efficiency and effectiveness but suffers from performance collapse. In this paper, we make efforts to alleviate the performance collapse problem for DARTS from two aspects. First, we investigate the expressive power of the supernet in DARTS and then derive a new setup of DARTS paradigm with only training BatchNorm. Second, we theoretically find that random features dilute the auxiliary connection role of skip-connection in supernet optimization and enable search algorithm focus on fairer operation selection, thereby solving the performance collapse problem. We instantiate DARTS and PC-DARTS with random features to build an improved version for each named RF-DARTS and RF-PCDARTS respectively. Experimental results show that RF-DARTS obtains 94.36% test accuracy on CIFAR-10 (which is the nearest optimal result in NAS-Bench-201), and achieves the newest state-of-the-art top-1 test error of 24.0% on ImageNet when transferring from CIFAR-10. Moreover, RF-DARTS performs robustly across three datasets (CIFAR-10, CIFAR-100, and SVHN) and four search spaces (S1-S4). Besides, RF-PCDARTS achieves even better results on ImageNet, that is, 23.9% top-1 and 7.1% top-5 test error, surpassing representative methods like single-path, training-free, and partial-channel paradigms directly searched on ImageNet. + + + + Enhanced Stable View Synthesis + http://openaccess.thecvf.com//content/CVPR2023/papers/Jain_Enhanced_Stable_View_Synthesis_CVPR_2023_paper.pdf + We introduce an approach to enhance the novel view synthesis from images taken from a freely moving camera. The introduced approach focuses on outdoor scenes where recovering accurate geometric scaffold and camera pose is challenging, leading to inferior results using the state-of-the-art stable view synthesis (SVS) method. SVS and related methods fail for outdoor scenes primarily due to (i) over-relying on the multiview stereo (MVS) for geometric scaffold recovery and (ii) assuming COLMAP computed camera poses as the best possible estimates, despite it being well-studied that MVS 3D reconstruction accuracy is limited to scene disparity and camera-pose accuracy is sensitive to key-point correspondence selection. This work proposes a principled way to enhance novel view synthesis solutions drawing inspiration from the basics of multiple view geometry. By leveraging the complementary behavior of MVS and monocular depth, we arrive at a better scene depth per view for nearby and far points, respectively. Moreover, our approach jointly refines camera poses with image-based rendering via multiple rotation averaging graph optimization. The recovered scene depth and the camera-pose help better view-dependent on-surface feature aggregation of the entire scene. Extensive evaluation of our approach on the popular benchmark dataset, such as Tanks and Temples, shows substantial improvement in view synthesis results compared to the prior art. For instance, our method shows 1.5 dB of PSNR improvement on the Tank and Temples. Similar statistics are observed when tested on other benchmark datasets such as FVS, Mip-NeRF 360, and DTU. + + + + Breaching FedMD: Image Recovery via Paired-Logits Inversion Attack + http://openaccess.thecvf.com//content/CVPR2023/papers/Takahashi_Breaching_FedMD_Image_Recovery_via_Paired-Logits_Inversion_Attack_CVPR_2023_paper.pdf + Federated Learning with Model Distillation (FedMD) is a nascent collaborative learning paradigm, where only output logits of public datasets are transmitted as distilled knowledge, instead of passing on private model parameters that are susceptible to gradient inversion attacks, a known privacy risk in federated learning. In this paper, we found that even though sharing output logits of public datasets is safer than directly sharing gradients, there still exists a substantial risk of data exposure caused by carefully designed malicious attacks. Our study shows that a malicious server can inject a PLI (Paired-Logits Inversion) attack against FedMD and its variants by training an inversion neural network that exploits the confidence gap between the server and client models. Experiments on multiple facial recognition datasets validate that under FedMD-like schemes, by using paired server-client logits of public datasets only, the malicious server is able to reconstruct private images on all tested benchmarks with a high success rate. + + + + Biomechanics-Guided Facial Action Unit Detection Through Force Modeling + http://openaccess.thecvf.com//content/CVPR2023/papers/Cui_Biomechanics-Guided_Facial_Action_Unit_Detection_Through_Force_Modeling_CVPR_2023_paper.pdf + Existing AU detection algorithms are mainly based on appearance information extracted from 2D images, and well-established facial biomechanics that governs 3D facial skin deformation is rarely considered. In this paper, we propose a biomechanics-guided AU detection approach, where facial muscle activation forces are modelled, and are employed to predict AU activation. Specifically, our model consists of two branches: 3D physics branch and 2D image branch. In 3D physics branch, we first derive the Euler-Lagrange equation governing facial deformation. The Euler-Lagrange equation represented as an ordinary differential equation (ODE) is embedded into a differentiable ODE solver. Muscle activation forces together with other physics parameters are firstly regressed, and then are utilized to simulate 3D deformation by solving the ODE. By leveraging facial biomechanics, we obtain physically plausible facial muscle activation forces. 2D image branch compensates 3D physics branch by employing additional appearance information from 2D images. Both estimated forces and appearance features are employed for AU detection. The proposed approach achieves competitive AU detection performance on two benchmark datasets. Furthermore, by leveraging biomechanics, our approach achieves outstanding performance with reduced training data. + + + + Equiangular Basis Vectors + http://openaccess.thecvf.com//content/CVPR2023/papers/Shen_Equiangular_Basis_Vectors_CVPR_2023_paper.pdf + We propose Equiangular Basis Vectors (EBVs) for classification tasks. In deep neural networks, models usually end with a k-way fully connected layer with softmax to handle different classification tasks. The learning objective of these methods can be summarized as mapping the learned feature representations to the samples' label space. While in metric learning approaches, the main objective is to learn a transformation function that maps training data points from the original space to a new space where similar points are closer while dissimilar points become farther apart. Different from previous methods, our EBVs generate normalized vector embeddings as "predefined classifiers" which are required to not only be with the equal status between each other, but also be as orthogonal as possible. By minimizing the spherical distance of the embedding of an input between its categorical EBV in training, the predictions can be obtained by identifying the categorical EBV with the smallest distance during inference. Various experiments on the ImageNet-1K dataset and other downstream tasks demonstrate that our method outperforms the general fully connected classifier while it does not introduce huge additional computation compared with classical metric learning methods. Our EBVs won the first place in the 2022 DIGIX Global AI Challenge, and our code is open-source and available at https://github.com/NJUST-VIPGroup/Equiangular-Basis-Vectors. + + + + Cross-Guided Optimization of Radiance Fields With Multi-View Image Super-Resolution for High-Resolution Novel View Synthesis + http://openaccess.thecvf.com//content/CVPR2023/papers/Yoon_Cross-Guided_Optimization_of_Radiance_Fields_With_Multi-View_Image_Super-Resolution_for_CVPR_2023_paper.pdf + Novel View Synthesis (NVS) aims at synthesizing an image from an arbitrary viewpoint using multi-view images and camera poses. Among the methods for NVS, Neural Radiance Fields (NeRF) is capable of NVS for an arbitrary resolution as it learns a continuous volumetric representation. However, radiance fields rely heavily on the spectral characteristics of coordinate-based networks. Thus, there is a limit to improving the performance of high-resolution novel view synthesis (HRNVS). To solve this problem, we propose a novel framework using cross-guided optimization of the single-image super-resolution (SISR) and radiance fields. We perform multi-view image super-resolution (MVSR) on train-view images during the radiance fields optimization process. It derives the updated SR result by fusing the feature map obtained from SISR and voxel-based uncertainty fields generated by integrated errors of train-view images. By repeating the updates during radiance fields optimization, train-view images for radiance fields optimization have multi-view consistency and high-frequency details simultaneously, ultimately improving the performance of HRNVS. Experiments of HRNVS and MVSR on various benchmark datasets show that the proposed method significantly surpasses existing methods. + + + + Unified Pose Sequence Modeling + http://openaccess.thecvf.com//content/CVPR2023/papers/Foo_Unified_Pose_Sequence_Modeling_CVPR_2023_paper.pdf + We propose a Unified Pose Sequence Modeling approach to unify heterogeneous human behavior understanding tasks based on pose data, e.g., action recognition, 3D pose estimation and 3D early action prediction. A major obstacle is that different pose-based tasks require different output data formats. Specifically, the action recognition and prediction tasks require class predictions as outputs, while 3D pose estimation requires a human pose output, which limits existing methods to leverage task-specific network architectures for each task. Hence, in this paper, we propose a novel Unified Pose Sequence (UPS) model to unify heterogeneous output formats for the aforementioned tasks by considering text-based action labels and coordinate-based human poses as language sequences. Then, by optimizing a single auto-regressive transformer, we can obtain a unified output sequence that can handle all the aforementioned tasks. Moreover, to avoid the interference brought by the heterogeneity between different tasks, a dynamic routing mechanism is also proposed to empower our UPS with the ability to learn which subsets of parameters should be shared among different tasks. To evaluate the efficacy of the proposed UPS, extensive experiments are conducted on four different tasks with four popular behavior understanding benchmarks. + + + + Probability-Based Global Cross-Modal Upsampling for Pansharpening + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhu_Probability-Based_Global_Cross-Modal_Upsampling_for_Pansharpening_CVPR_2023_paper.pdf + Pansharpening is an essential preprocessing step for remote sensing image processing. Although deep learning (DL) approaches performed well on this task, current upsampling methods used in these approaches only utilize the local information of each pixel in the low-resolution multispectral (LRMS) image while neglecting to exploit its global information as well as the cross-modal information of the guiding panchromatic (PAN) image, which limits their performance improvement. To address this issue, this paper develops a novel probability-based global cross-modal upsampling (PGCU) method for pan-sharpening. Precisely, we first formulate the PGCU method from a probabilistic perspective and then design an efficient network module to implement it by fully utilizing the information mentioned above while simultaneously considering the channel specificity. The PGCU module consists of three blocks, i.e., information extraction (IE), distribution and expectation estimation (DEE), and fine adjustment (FA). Extensive experiments verify the superiority of the PGCU method compared with other popular upsampling methods. Additionally, experiments also show that the PGCU module can help improve the performance of existing SOTA deep learning pansharpening methods. The codes are available at https://github.com/Zeyu-Zhu/PGCU. + + + + FAC: 3D Representation Learning via Foreground Aware Feature Contrast + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_FAC_3D_Representation_Learning_via_Foreground_Aware_Feature_Contrast_CVPR_2023_paper.pdf + Contrastive learning has recently demonstrated great potential for unsupervised pre-training in 3D scene understanding tasks. However, most existing work randomly selects point features as anchors while building contrast, leading to a clear bias toward background points that often dominate in 3D scenes. Also, object awareness and foreground-to-background discrimination are neglected, making contrastive learning less effective. To tackle these issues, we propose a general foreground-aware feature contrast (FAC) framework to learn more effective point cloud representations in pre-training. FAC consists of two novel contrast designs to construct more effective and informative contrast pairs. The first is building positive pairs within the same foreground segment where points tend to have the same semantics. The second is that we prevent over-discrimination between 3D segments/objects and encourage foreground-to-background distinctions at the segment level with adaptive feature learning in a Siamese correspondence network, which adaptively learns feature correlations within and across point cloud views effectively. Visualization with point activation maps shows that our contrast pairs capture clear correspondences among foreground regions during pre-training. Quantitative experiments also show that FAC achieves superior knowledge transfer and data efficiency in various downstream 3D semantic segmentation and object detection tasks. All codes, data, and models are available at:https://github.com/KangchengLiu/FAC_Foreground_Aware_Contrast. + + + + Improving Visual Representation Learning Through Perceptual Understanding + http://openaccess.thecvf.com//content/CVPR2023/papers/Tukra_Improving_Visual_Representation_Learning_Through_Perceptual_Understanding_CVPR_2023_paper.pdf + We present an extension to masked autoencoders (MAE) which improves on the representations learnt by the model by explicitly encouraging the learning of higher scene-level features. We do this by: (i) the introduction of a perceptual similarity term between generated and real images (ii) incorporating several techniques from the adversarial training literature including multi-scale training and adaptive discriminator augmentation. The combination of these results in not only better pixel reconstruction but also representations which appear to capture better higher-level details within images. More consequentially, we show how our method, Perceptual MAE, leads to better performance when used for downstream tasks outperforming previous methods. We achieve 78.1% top-1 accuracy linear probing on ImageNet-1K and up to 88.1% when fine-tuning, with similar results for other downstream tasks, all without use of additional pre-trained models or data. + + + + Learning Bottleneck Concepts in Image Classification + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Learning_Bottleneck_Concepts_in_Image_Classification_CVPR_2023_paper.pdf + Interpreting and explaining the behavior of deep neural networks is critical for many tasks. Explainable AI provides a way to address this challenge, mostly by providing per-pixel relevance to the decision. Yet, interpreting such explanations may require expert knowledge. Some recent attempts toward interpretability adopt a concept-based framework, giving a higher-level relationship between some concepts and model decisions. This paper proposes Bottleneck Concept Learner (BotCL), which represents an image solely by the presence/absence of concepts learned through training over the target task without explicit supervision over the concepts. It uses self-supervision and tailored regularizers so that learned concepts can be human-understandable. Using some image classification tasks as our testbed, we demonstrate BotCL's potential to rebuild neural networks for better interpretability. + + + + Inversion-Based Style Transfer With Diffusion Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Inversion-Based_Style_Transfer_With_Diffusion_Models_CVPR_2023_paper.pdf + The artistic style within a painting is the means of expression, which includes not only the painting material, colors, and brushstrokes, but also the high-level attributes, including semantic elements and object shapes. Previous arbitrary example-guided artistic image generation methods often fail to control shape changes or convey elements. Pre-trained text-to-image synthesis diffusion probabilistic models have achieved remarkable quality but often require extensive textual descriptions to accurately portray the attributes of a particular painting.The uniqueness of an artwork lies in the fact that it cannot be adequately explained with normal language. Our key idea is to learn the artistic style directly from a single painting and then guide the synthesis without providing complex textual descriptions. Specifically, we perceive style as a learnable textual description of a painting.We propose an inversion-based style transfer method (InST), which can efficiently and accurately learn the key information of an image, thus capturing and transferring the artistic style of a painting. We demonstrate the quality and efficiency of our method on numerous paintings of various artists and styles. Codes are available at https://github.com/zyxElsa/InST. + + + + Learning Imbalanced Data With Vision Transformers + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_Learning_Imbalanced_Data_With_Vision_Transformers_CVPR_2023_paper.pdf + The real-world data tends to be heavily imbalanced and severely skew the data-driven deep neural networks, which makes Long-Tailed Recognition (LTR) a massive challenging task. Existing LTR methods seldom train Vision Transformers (ViTs) with Long-Tailed (LT) data, while the off-the-shelf pretrain weight of ViTs always leads to unfair comparisons. In this paper, we systematically investigate the ViTs' performance in LTR and propose LiVT to train ViTs from scratch only with LT data. With the observation that ViTs suffer more severe LTR problems, we conduct Masked Generative Pretraining (MGP) to learn generalized features. With ample and solid evidence, we show that MGP is more robust than supervised manners. Although Binary Cross Entropy (BCE) loss performs well with ViTs, it struggles on the LTR tasks. We further propose the balanced BCE to ameliorate it with strong theoretical groundings. Specially, we derive the unbiased extension of Sigmoid and compensate extra logit margins for deploying it. Our Bal-BCE contributes to the quick convergence of ViTs in just a few epochs. Extensive experiments demonstrate that with MGP and Bal-BCE, LiVT successfully trains ViTs well without any additional data and outperforms comparable state-of-the-art methods significantly, e.g., our ViT-B achieves 81.0% Top-1 accuracy in iNaturalist 2018 without bells and whistles. Code is available at https://github.com/XuZhengzhuo/LiVT. + + + + PHA: Patch-Wise High-Frequency Augmentation for Transformer-Based Person Re-Identification + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_PHA_Patch-Wise_High-Frequency_Augmentation_for_Transformer-Based_Person_Re-Identification_CVPR_2023_paper.pdf + Although recent studies empirically show that injecting Convolutional Neural Networks (CNNs) into Vision Transformers (ViTs) can improve the performance of person re-identification, the rationale behind it remains elusive. From a frequency perspective, we reveal that ViTs perform worse than CNNs in preserving key high-frequency components (e.g, clothes texture details) since high-frequency components are inevitably diluted by low-frequency ones due to the intrinsic Self-Attention within ViTs. To remedy such inadequacy of the ViT, we propose a Patch-wise High-frequency Augmentation (PHA) method with two core designs. First, to enhance the feature representation ability of high-frequency components, we split patches with high-frequency components by the Discrete Haar Wavelet Transform, then empower the ViT to take the split patches as auxiliary input. Second, to prevent high-frequency components from being diluted by low-frequency ones when taking the entire sequence as input during network optimization, we propose a novel patch-wise contrastive loss. From the view of gradient optimization, it acts as an implicit augmentation to improve the representation ability of key high-frequency components. This benefits the ViT to capture key high-frequency components to extract discriminative person representations. PHA is necessary during training and can be removed during inference, without bringing extra complexity. Extensive experiments on widely-used ReID datasets validate the effectiveness of our method. + + + + Learning Instance-Level Representation for Large-Scale Multi-Modal Pretraining in E-Commerce + http://openaccess.thecvf.com//content/CVPR2023/papers/Jin_Learning_Instance-Level_Representation_for_Large-Scale_Multi-Modal_Pretraining_in_E-Commerce_CVPR_2023_paper.pdf + This paper aims to establish a generic multi-modal foundation model that has the scalable capability to massive downstream applications in E-commerce. Recently, large-scale vision-language pretraining approaches have achieved remarkable advances in the general domain. However, due to the significant differences between natural and product images, directly applying these frameworks for modeling image-level representations to E-commerce will be inevitably sub-optimal. To this end, we propose an instance-centric multi-modal pretraining paradigm called ECLIP in this work. In detail, we craft a decoder architecture that introduces a set of learnable instance queries to explicitly aggregate instance-level semantics. Moreover, to enable the model to focus on the desired product instance without reliance on expensive manual annotations, two specially configured pretext tasks are further proposed. Pretrained on the 100 million E-commerce-related data, ECLIP successfully extracts more generic, semantic-rich, and robust representations. Extensive experimental results show that, without further fine-tuning, ECLIP surpasses existing methods by a large margin on a broad range of downstream tasks, demonstrating the strong transferability to real-world E-commerce applications. + + + + Conditional Text Image Generation With Diffusion Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhu_Conditional_Text_Image_Generation_With_Diffusion_Models_CVPR_2023_paper.pdf + Current text recognition systems, including those for handwritten scripts and scene text, have relied heavily on image synthesis and augmentation, since it is difficult to realize real-world complexity and diversity through collecting and annotating enough real text images. In this paper, we explore the problem of text image generation, by taking advantage of the powerful abilities of Diffusion Models in generating photo-realistic and diverse image samples with given conditions, and propose a method called Conditional Text Image Generation with Diffusion Models (CTIG-DM for short). To conform to the characteristics of text images, we devise three conditions: image condition, text condition, and style condition, which can be used to control the attributes, contents, and styles of the samples in the image generation process. Specifically, four text image generation modes, namely: (1) synthesis mode, (2) augmentation mode, (3) recovery mode, and (4) imitation mode, can be derived by combining and configuring these three conditions. Extensive experiments on both handwritten and scene text demonstrate that the proposed CTIG-DM is able to produce image samples that simulate real-world complexity and diversity, and thus can boost the performance of existing text recognizers. Besides, CTIG-DM shows its appealing potential in domain adaptation and generating images containing Out-Of-Vocabulary (OOV) words. + + + + AnchorFormer: Point Cloud Completion From Discriminative Nodes + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_AnchorFormer_Point_Cloud_Completion_From_Discriminative_Nodes_CVPR_2023_paper.pdf + Point cloud completion aims to recover the completed 3D shape of an object from its partial observation. A common strategy is to encode the observed points to a global feature vector and then predict the complete points through a generative process on this vector. Nevertheless, the results may suffer from the high-quality shape generation problem due to the fact that a global feature vector cannot sufficiently characterize diverse patterns in one object. In this paper, we present a new shape completion architecture, namely AnchorFormer, that innovatively leverages pattern-aware discriminative nodes, i.e., anchors, to dynamically capture regional information of objects. Technically, AnchorFormer models the regional discrimination by learning a set of anchors based on the point features of the input partial observation. Such anchors are scattered to both observed and unobserved locations through estimating particular offsets, and form sparse points together with the down-sampled points of the input observation. To reconstruct the fine-grained object patterns, AnchorFormer further employs a modulation scheme to morph a canonical 2D grid at individual locations of the sparse points into a detailed 3D structure. Extensive experiments on the PCN, ShapeNet-55/34 and KITTI datasets quantitatively and qualitatively demonstrate the efficacy of AnchorFormer over the state-of-the-art point cloud completion approaches. Source code is available at https://github.com/chenzhik/AnchorFormer. + + + + Co-SLAM: Joint Coordinate and Sparse Parametric Encodings for Neural Real-Time SLAM + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Co-SLAM_Joint_Coordinate_and_Sparse_Parametric_Encodings_for_Neural_Real-Time_CVPR_2023_paper.pdf + We present Co-SLAM, a neural RGB-D SLAM system based on a hybrid representation, that performs robust camera tracking and high-fidelity surface reconstruction in real time. Co-SLAM represents the scene as a multi-resolution hash-grid to exploit its high convergence speed and ability to represent high-frequency local features. In addition, Co-SLAM incorporates one-blob encoding, to encourage surface coherence and completion in unobserved areas. This joint parametric-coordinate encoding enables real-time and robust performance by bringing the best of both worlds: fast convergence and surface hole filling. Moreover, our ray sampling strategy allows Co-SLAM to perform global bundle adjustment over all keyframes instead of requiring keyframe selection to maintain a small number of active keyframes as competing neural SLAM approaches do. Experimental results show that Co-SLAM runs at 10-17Hz and achieves state-of-the-art scene reconstruction results, and competitive tracking performance in various datasets and benchmarks (ScanNet, TUM, Replica, Synthetic RGBD). Project page: https://hengyiwang.github.io/projects/CoSLAM + + + + Regularization of Polynomial Networks for Image Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Chrysos_Regularization_of_Polynomial_Networks_for_Image_Recognition_CVPR_2023_paper.pdf + Deep Neural Networks (DNNs) have obtained impressive performance across tasks, however they still remain as black boxes, e.g., hard to theoretically analyze. At the same time, Polynomial Networks (PNs) have emerged as an alternative method with a promising performance and improved interpretability but have yet to reach the performance of the powerful DNN baselines. In this work, we aim to close this performance gap. We introduce a class of PNs, which are able to reach the performance of ResNet across a range of six benchmarks. We demonstrate that strong regularization is critical and conduct an extensive study of the exact regularization schemes required to match performance. To further motivate the regularization schemes, we introduce D-PolyNets that achieve a higher-degree of expansion than previously proposed polynomial networks. D-PolyNets are more parameter-efficient while achieving a similar performance as other polynomial networks. We expect that our new models can lead to an understanding of the role of elementwise activation functions (which are no longer required for training PNs). The source code is available at https://github.com/grigorisg9gr/regularized_polynomials. + + + + EfficientViT: Memory Efficient Vision Transformer With Cascaded Group Attention + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_EfficientViT_Memory_Efficient_Vision_Transformer_With_Cascaded_Group_Attention_CVPR_2023_paper.pdf + Vision transformers have shown great success due to their high model capabilities. However, their remarkable performance is accompanied by heavy computation costs, which makes them unsuitable for real-time applications. In this paper, we propose a family of high-speed vision transformers named EfficientViT. We find that the speed of existing transformer models is commonly bounded by memory inefficient operations, especially the tensor reshaping and element-wise functions in MHSA. Therefore, we design a new building block with a sandwich layout, i.e., using a single memory-bound MHSA between efficient FFN layers, which improves memory efficiency while enhancing channel communication. Moreover, we discover that the attention maps share high similarities across heads, leading to computational redundancy. To address this, we present a cascaded group attention module feeding attention heads with different splits of the full feature, which not only saves computation cost but also improves attention diversity. Comprehensive experiments demonstrate EfficientViT outperforms existing efficient models, striking a good trade-off between speed and accuracy. For instance, our EfficientViT-M5 surpasses MobileNetV3-Large by 1.9% in accuracy, while getting 40.4% and 45.2% higher throughput on Nvidia V100 GPU and Intel Xeon CPU, respectively. Compared to the recent efficient model MobileViT-XXS, EfficientViT-M2 achieves 1.8% superior accuracy, while running 5.8x/3.7x faster on the GPU/CPU, and 7.4x faster when converted to ONNX format. Code and models will be available soon. + + + + DiffCollage: Parallel Generation of Large Content With Diffusion Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_DiffCollage_Parallel_Generation_of_Large_Content_With_Diffusion_Models_CVPR_2023_paper.pdf + We present DiffCollage, a compositional diffusion model that can generate large content by leveraging diffusion models trained on generating pieces of the large content. Our approach is based on a factor graph representation where each factor node represents a portion of the content and a variable node represents their overlap. This representation allows us to aggregate intermediate outputs from diffusion models defined on individual nodes to generate content of arbitrary size and shape in parallel without resorting to an autoregressive generation procedure. We apply DiffCollage to various tasks, including infinite image generation, panorama image generation, and long-duration text-guided motion generation. Extensive experimental results with a comparison to strong autoregressive baselines verify the effectiveness of our approach. + + + + Efficient Second-Order Plane Adjustment + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_Efficient_Second-Order_Plane_Adjustment_CVPR_2023_paper.pdf + Planes are generally used in 3D reconstruction for depth sensors, such as RGB-D cameras and LiDARs. This paper focuses on the problem of estimating the optimal planes and sensor poses to minimize the point-to-plane distance. The resulting least-squares problem is referred to as plane adjustment (PA) in the literature, which is the counterpart of bundle adjustment (BA) in visual reconstruction. Iterative methods are adopted to solve these least-squares problems. Typically, Newton's method is rarely used for a large-scale least-squares problem, due to the high computational complexity of the Hessian matrix. Instead, methods using an approximation of the Hessian matrix, such as the Levenberg-Marquardt (LM) method, are generally adopted. This paper adopts the Newton's method to efficiently solve the PA problem. Specifically, given poses, the optimal plane have a close-form solution. Thus we can eliminate planes from the cost function, which significantly reduces the number of variables. Furthermore, as the optimal planes are functions of poses, this method actually ensures that the optimal planes for the current estimated poses can be obtained at each iteration, which benefits the convergence. The difficulty lies in how to efficiently compute the Hessian matrix and the gradient of the resulting cost. This paper provides an efficient solution. Empirical evaluation shows that our algorithm outperforms the state-of-the-art algorithms. + + + + Mofusion: A Framework for Denoising-Diffusion-Based Motion Synthesis + http://openaccess.thecvf.com//content/CVPR2023/papers/Dabral_Mofusion_A_Framework_for_Denoising-Diffusion-Based_Motion_Synthesis_CVPR_2023_paper.pdf + Conventional methods for human motion synthesis have either been deterministic or have had to struggle with the trade-off between motion diversity vs motion quality. In response to these limitations, we introduce MoFusion, i.e., a new denoising-diffusion-based framework for high-quality conditional human motion synthesis that can synthesise long, temporally plausible, and semantically accurate motions based on a range of conditioning contexts (such as music and text). We also present ways to introduce well-known kinematic losses for motion plausibility within the motion-diffusion framework through our scheduled weighting strategy. The learned latent space can be used for several interactive motion-editing applications like in-betweening, seed-conditioning, and text-based editing, thus, providing crucial abilities for virtual-character animation and robotics. Through comprehensive quantitative evaluations and a perceptual user study, we demonstrate the effectiveness of MoFusion compared to the state-of-the-art on established benchmarks in the literature. We urge the reader to watch our supplementary video. The source code will be released. + + + + PoseFormerV2: Exploring Frequency Domain for Efficient and Robust 3D Human Pose Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhao_PoseFormerV2_Exploring_Frequency_Domain_for_Efficient_and_Robust_3D_Human_CVPR_2023_paper.pdf + Recently, transformer-based methods have gained significant success in sequential 2D-to-3D lifting human pose estimation. As a pioneering work, PoseFormer captures spatial relations of human joints in each video frame and human dynamics across frames with cascaded transformer layers and has achieved impressive performance. However, in real scenarios, the performance of PoseFormer and its follow-ups is limited by two factors: (a) The length of the input joint sequence; (b) The quality of 2D joint detection. Existing methods typically apply self-attention to all frames of the input sequence, causing a huge computational burden when the frame number is increased to obtain advanced estimation accuracy, and they are not robust to noise naturally brought by the limited capability of 2D joint detectors. In this paper, we propose PoseFormerV2, which exploits a compact representation of lengthy skeleton sequences in the frequency domain to efficiently scale up the receptive field and boost robustness to noisy 2D joint detection. With minimum modifications to PoseFormer, the proposed method effectively fuses features both in the time domain and frequency domain, enjoying a better speed-accuracy trade-off than its precursor. Extensive experiments on two benchmark datasets (i.e., Human3.6M and MPI-INF-3DHP) demonstrate that the proposed approach significantly outperforms the original PoseFormer and other transformer-based variants. Code is released at https://github.com/QitaoZhao/PoseFormerV2. + + + + Mask3D: Pre-Training 2D Vision Transformers by Learning Masked 3D Priors + http://openaccess.thecvf.com//content/CVPR2023/papers/Hou_Mask3D_Pre-Training_2D_Vision_Transformers_by_Learning_Masked_3D_Priors_CVPR_2023_paper.pdf + Current popular backbones in computer vision, such as Vision Transformers (ViT) and ResNets are trained to perceive the world from 2D images. However, to more effectively understand 3D structural priors in 2D backbones, we propose Mask3D to leverage existing large-scale RGB-D data in a self-supervised pre-training to embed these 3D priors into 2D learned feature representations. In contrast to traditional 3D contrastive learning paradigms requiring 3D reconstructions or multi-view correspondences, our approach is simple: we formulate a pre-text reconstruction task by masking RGB and depth patches in individual RGB-D frames. We demonstrate the Mask3D is particularly effective in embedding 3D priors into the powerful 2D ViT backbone, enabling improved representation learn- ing for various scene understanding tasks, such as semantic segmentation, instance segmentation and object detection. Experiments show that Mask3D notably outperforms exist- ing self-supervised 3D pre-training approaches on ScanNet, NYUv2, and Cityscapes image understanding tasks, with an improvement of +6.5% mIoU against the state-of-the-art Pri3D on ScanNet image semantic segmentation. + + + + Physically Adversarial Infrared Patches With Learnable Shapes and Locations + http://openaccess.thecvf.com//content/CVPR2023/papers/Wei_Physically_Adversarial_Infrared_Patches_With_Learnable_Shapes_and_Locations_CVPR_2023_paper.pdf + Owing to the extensive application of infrared object detectors in the safety-critical tasks, it is necessary to evaluate their robustness against adversarial examples in the real world. However, current few physical infrared attacks are complicated to implement in practical application because of their complex transformation from digital world to physical world. To address this issue, in this paper, we propose a physically feasible infrared attack method called "adversarial infrared patches". Considering the imaging mechanism of infrared cameras by capturing objects' thermal radiation, adversarial infrared patches conduct attacks by attaching a patch of thermal insulation materials on the target object to manipulate its thermal distribution. To enhance adversarial attacks, we present a novel aggregation regularization to guide the simultaneous learning for the patch' shape and location on the target object. Thus, a simple gradient-based optimization can be adapted to solve for them. We verify adversarial infrared patches in different object detection tasks with various object detectors. Experimental results show that our method achieves more than 90% Attack Success Rate (ASR) versus the pedestrian detector and vehicle detector in the physical environment, where the objects are captured in different angles, distances, postures, and scenes. More importantly, adversarial infrared patch is easy to implement, and it only needs 0.5 hour to be constructed in the physical world, which verifies its effectiveness and efficiency. + + + + Exemplar-FreeSOLO: Enhancing Unsupervised Instance Segmentation With Exemplars + http://openaccess.thecvf.com//content/CVPR2023/papers/Ishtiak_Exemplar-FreeSOLO_Enhancing_Unsupervised_Instance_Segmentation_With_Exemplars_CVPR_2023_paper.pdf + Instance segmentation seeks to identify and segment each object from images, which often relies on a large number of dense annotations for model training. To alleviate this burden, unsupervised instance segmentation methods have been developed to train class-agnostic instance segmentation models without any annotation. In this paper, we propose a novel unsupervised instance segmentation approach, Exemplar-FreeSOLO, to enhance unsupervised instance segmentation by exploiting a limited number of unannotated and unsegmented exemplars. The proposed framework offers a new perspective on directly perceiving top-down information without annotations. Specifically, Exemplar-FreeSOLO introduces a novel exemplarknowledge abstraction module to acquire beneficial top-down guidance knowledge for instances using unsupervised exemplar object extraction. Moreover, a new exemplar embedding contrastive module is designed to enhance the discriminative capability of the segmentation model by exploiting the contrastive exemplar-based guidance knowledge in the embedding space. To evaluate the proposed ExemplarFreeSOLO, we conduct comprehensive experiments and perform in-depth analyses on three image instance segmentation datasets. The experimental results demonstrate that the proposed approach is effective and outperforms the state-of-the-art methods. + + + + Multimodal Prompting With Missing Modalities for Visual Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Lee_Multimodal_Prompting_With_Missing_Modalities_for_Visual_Recognition_CVPR_2023_paper.pdf + In this paper, we tackle two challenges in multimodal learning for visual recognition: 1) when missing-modality occurs either during training or testing in real-world situations; and 2) when the computation resources are not available to finetune on heavy transformer models. To this end, we propose to utilize prompt learning and mitigate the above two challenges together. Specifically, our modality-missing-aware prompts can be plugged into multimodal transformers to handle general missing-modality cases, while only requiring less than 1% learnable parameters compared to training the entire model. We further explore the effect of different prompt configurations and analyze the robustness to missing modality. Extensive experiments are conducted to show the effectiveness of our prompt learning framework that improves the performance under various missing-modality cases, while alleviating the requirement of heavy model re-training. Code is available. + + + + Neural Koopman Pooling: Control-Inspired Temporal Dynamics Encoding for Skeleton-Based Action Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Neural_Koopman_Pooling_Control-Inspired_Temporal_Dynamics_Encoding_for_Skeleton-Based_Action_CVPR_2023_paper.pdf + Skeleton-based human action recognition is becoming increasingly important in a variety of fields. Most existing works train a CNN or GCN based backbone to extract spatial-temporal features, and use temporal average/max pooling to aggregate the information. However, these pooling methods fail to capture high-order dynamics information. To address the problem, we propose a plug-and-play module called Koopman pooling, which is a parameterized high-order pooling technique based on Koopman theory. The Koopman operator linearizes a non-linear dynamics system, thus providing a way to represent the complex system through the dynamics matrix, which can be used for classification. We also propose an eigenvalue normalization method to encourage the learned dynamics to be non-decaying and stable. Besides, we also show that our Koopman pooling framework can be easily extended to one-shot action recognition when combined with Dynamic Mode Decomposition. The proposed method is evaluated on three benchmark datasets, namely NTU RGB+D 60, 120 and NW-UCLA. Our experiments clearly demonstrate that Koopman pooling significantly improves the performance under both full-dataset and one-shot settings. + + + + Blind Image Quality Assessment via Vision-Language Correspondence: A Multitask Learning Perspective + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Blind_Image_Quality_Assessment_via_Vision-Language_Correspondence_A_Multitask_Learning_CVPR_2023_paper.pdf + We aim at advancing blind image quality assessment (BIQA), which predicts the human perception of image quality without any reference information. We develop a general and automated multitask learning scheme for BIQA to exploit auxiliary knowledge from other tasks, in a way that the model parameter sharing and the loss weighting are determined automatically. Specifically, we first describe all candidate label combinations (from multiple tasks) using a textual template, and compute the joint probability from the cosine similarities of the visual-textual embeddings. Predictions of each task can be inferred from the joint distribution, and optimized by carefully designed loss functions. Through comprehensive experiments on learning three tasks - BIQA, scene classification, and distortion type identification, we verify that the proposed BIQA method 1) benefits from the scene classification and distortion type identification tasks and outperforms the state-of-the-art on multiple IQA datasets, 2) is more robust in the group maximum differentiation competition, and 3) realigns the quality annotations from different IQA datasets more effectively. The source code is available at https://github.com/zwx8981/LIQE. + + + + Integral Neural Networks + http://openaccess.thecvf.com//content/CVPR2023/papers/Solodskikh_Integral_Neural_Networks_CVPR_2023_paper.pdf + We introduce a new family of deep neural networks. Instead of the conventional representation of network layers as N-dimensional weight tensors, we use continuous layer representation along the filter and channel dimensions. We call such networks Integral Neural Networks (INNs). In particular, the weights of INNs are represented as continuous functions defined on N-dimensional hypercubes, and the discrete transformations of inputs to the layers are replaced by continuous integration operations, accordingly. During the inference stage, our continuous layers can be converted into the traditional tensor representation via numerical integral quadratures. Such kind of representation allows the discretization of a network to an arbitrary size with various discretization intervals for the integral kernels. This approach can be applied to prune the model directly on the edge device while featuring only a small performance loss at high rates of structural pruning without any fine-tuning. To evaluate the practical benefits of our proposed approach, we have conducted experiments using various neural network architectures for multiple tasks. Our reported results show that the proposed INNs achieve the same performance with their conventional discrete counterparts, while being able to preserve approximately the same performance (2 % accuracy loss for ResNet18 on Imagenet) at a high rate (up to 30%) of structural pruning without fine-tuning, compared to 65 % accuracy loss of the conventional pruning methods under the same conditions. + + + + EXCALIBUR: Encouraging and Evaluating Embodied Exploration + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhu_EXCALIBUR_Encouraging_and_Evaluating_Embodied_Exploration_CVPR_2023_paper.pdf + Experience precedes understanding. Humans constantly explore and learn about their environment out of curiosity, gather information, and update their models of the world. On the other hand, machines are either trained to learn passively from static and fixed datasets, or taught to complete specific goal-conditioned tasks. To encourage the development of exploratory interactive agents, we present the EXCALIBUR benchmark. EXCALIBUR allows agents to explore their environment for long durations and then query their understanding of the physical world via inquiries like: "is the small heavy red bowl made from glass?" or "is there a silver spoon heavier than the egg?". This design encourages agents to perform free-form home exploration without myopia induced by goal conditioning. Once the agents have answered a series of questions, they can renter the scene to refine their knowledge, update their beliefs, and improve their performance on the questions. Our experiments demonstrate the challenges posed by this dataset for the present-day state-of-the-art embodied systems and the headroom afforded to develop new innovative methods. Finally, we present a virtual reality interface that enables humans to seamlessly interact within the simulated world and use it to gather human performance measures. EXCALIBUR affords unique challenges in comparison to present-day benchmarks and represents the next frontier for embodied AI research. + + + + Visual DNA: Representing and Comparing Images Using Distributions of Neuron Activations + http://openaccess.thecvf.com//content/CVPR2023/papers/Ramtoula_Visual_DNA_Representing_and_Comparing_Images_Using_Distributions_of_Neuron_CVPR_2023_paper.pdf + Selecting appropriate datasets is critical in modern computer vision. However, no general-purpose tools exist to evaluate the extent to which two datasets differ. For this, we propose representing images -- and by extension datasets -- using Distributions of Neuron Activations (DNAs). DNAs fit distributions, such as histograms or Gaussians, to activations of neurons in a pre-trained feature extractor through which we pass the image(s) to represent. This extractor is frozen for all datasets, and we rely on its generally expressive power in feature space. By comparing two DNAs, we can evaluate the extent to which two datasets differ with granular control over the comparison attributes of interest, providing the ability to customise the way distances are measured to suit the requirements of the task at hand. Furthermore, DNAs are compact, representing datasets of any size with less than 15 megabytes. We demonstrate the value of DNAs by evaluating their applicability on several tasks, including conditional dataset comparison, synthetic image evaluation, and transfer learning, and across diverse datasets, ranging from synthetic cat images to celebrity faces and urban driving scenes. + + + + Recognizability Embedding Enhancement for Very Low-Resolution Face Recognition and Quality Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Chai_Recognizability_Embedding_Enhancement_for_Very_Low-Resolution_Face_Recognition_and_Quality_CVPR_2023_paper.pdf + Very low-resolution face recognition (VLRFR) poses unique challenges, such as tiny regions of interest and poor resolution due to extreme standoff distance or wide viewing angle of the acquisition device. In this paper, we study principled approaches to elevate the recognizability of a face in the embedding space instead of the visual quality. We first formulate a robust learning-based face recognizability measure, namely recognizability index (RI), based on two criteria: (i) proximity of each face embedding against the unrecognizable faces cluster center and (ii) closeness of each face embedding against its positive and negative class prototypes. We then devise an index diversion loss to push the hard-to-recognize face embedding with low RI away from unrecognizable faces cluster to boost the RI, which reflects better recognizability. Additionally, a perceptibility-aware attention mechanism is introduced to attend to the salient recognizable face regions, which offers better explanatory and discriminative content for embedding learning. Our proposed model is trained end-to-end and simultaneously serves recognizability-aware embedding learning and face quality estimation. To address VLRFR, extensive evaluations on three challenging low-resolution datasets and face quality assessment demonstrate the superiority of the proposed model over the state-of-the-art methods. + + + + Accelerating Dataset Distillation via Model Augmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Accelerating_Dataset_Distillation_via_Model_Augmentation_CVPR_2023_paper.pdf + Dataset Distillation (DD), a newly emerging field, aims at generating much smaller but efficient synthetic training datasets from large ones. Existing DD methods based on gradient matching achieve leading performance; however, they are extremely computationally intensive as they require continuously optimizing a dataset among thousands of randomly initialized models. In this paper, we assume that training the synthetic data with diverse models leads to better generalization performance. Thus we propose two model augmentation techniques, i.e. using early-stage models and parameter perturbation to learn an informative synthetic set with significantly reduced training cost. Extensive experiments demonstrate that our method achieves up to 20x speedup and comparable performance on par with state-of-the-art methods. + + + + Frame-Event Alignment and Fusion Network for High Frame Rate Tracking + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Frame-Event_Alignment_and_Fusion_Network_for_High_Frame_Rate_Tracking_CVPR_2023_paper.pdf + Most existing RGB-based trackers target low frame rate benchmarks of around 30 frames per second. This setting restricts the tracker's functionality in the real world, especially for fast motion. Event-based cameras as bioinspired sensors provide considerable potential for high frame rate tracking due to their high temporal resolution. However, event-based cameras cannot offer fine-grained texture information like conventional cameras. This unique complementarity motivates us to combine conventional frames and events for high frame rate object tracking under various challenging conditions. In this paper, we propose an end-to-end network consisting of multi-modality alignment and fusion modules to effectively combine meaningful information from both modalities at different measurement rates. The alignment module is responsible for cross-modality and cross-frame-rate alignment between frame and event modalities under the guidance of the moving cues furnished by events. While the fusion module is accountable for emphasizing valuable features and suppressing noise information by the mutual complement between the two modalities. Extensive experiments show that the proposed approach outperforms state-of-the-art trackers by a significant margin in high frame rate tracking. With the FE240hz dataset, our approach achieves high frame rate tracking up to 240Hz. + + + + + + Solving Relaxations of MAP-MRF Problems: Combinatorial In-Face Frank-Wolfe Directions + http://openaccess.thecvf.com//content/CVPR2023/papers/Kolmogorov_Solving_Relaxations_of_MAP-MRF_Problems_Combinatorial_In-Face_Frank-Wolfe_Directions_CVPR_2023_paper.pdf + We consider the problem of solving LP relaxations of MAP-MRF inference problems, and in particular the method proposed recently in (Swoboda, Kolmogorov 2019; Kolmogorov, Pock 2021). As a key computational subroutine, it uses a variant of the Frank-Wolfe (FW) method to minimize a smooth convex function over a combinatorial polytope. We propose an efficient implementation of this subproutine based on in-face Frank-Wolfe directions, introduced in (Freund et al. 2017) in a different context. More generally, we define an abstract data structure for a combinatorial subproblem that enables in-face FW directions, and describe its specialization for tree-structured MAP-MRF inference subproblems. Experimental results indicate that the resulting method is the current state-of-art LP solver for some classes of problems. Our code is available at pub.ist.ac.at/ vnk/papers/IN-FACE-FW.html. + + + + MEGANE: Morphable Eyeglass and Avatar Network + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_MEGANE_Morphable_Eyeglass_and_Avatar_Network_CVPR_2023_paper.pdf + Eyeglasses play an important role in the perception of identity. Authentic virtual representations of faces can benefit greatly from their inclusion. However, modeling the geometric and appearance interactions of glasses and the face of virtual representations of humans is challenging. Glasses and faces affect each other's geometry at their contact points, and also induce appearance changes due to light transport. Most existing approaches do not capture these physical interactions since they model eyeglasses and faces independently. Others attempt to resolve interactions as a 2D image synthesis problem and suffer from view and temporal inconsistencies. In this work, we propose a 3D compositional morphable model of eyeglasses that accurately incorporates high-fidelity geometric and photometric interaction effects. To support the large variation in eyeglass topology efficiently, we employ a hybrid representation that combines surface geometry and a volumetric representation. Unlike volumetric approaches, our model naturally retains correspondences across glasses, and hence explicit modification of geometry, such as lens insertion and frame deformation, is greatly simplified. In addition, our model is relightable under point lights and natural illumination, supporting high-fidelity rendering of various frame materials, including translucent plastic and metal within a single morphable model. Importantly, our approach models global light transport effects, such as casting shadows between faces and glasses. Our morphable model for eyeglasses can also be fit to novel glasses via inverse rendering. We compare our approach to state-of-the-art methods and demonstrate significant quality improvements. + + + + Enhancing Multiple Reliability Measures via Nuisance-Extended Information Bottleneck + http://openaccess.thecvf.com//content/CVPR2023/papers/Jeong_Enhancing_Multiple_Reliability_Measures_via_Nuisance-Extended_Information_Bottleneck_CVPR_2023_paper.pdf + In practical scenarios where training data is limited, many predictive signals in the data can be rather from some biases in data acquisition (i.e., less generalizable), so that one cannot prevent a model from co-adapting on such (so-called) "shortcut" signals: this makes the model fragile in various distribution shifts. To bypass such failure modes, we consider an adversarial threat model under a mutual information constraint to cover a wider class of perturbations in training. This motivates us to extend the standard information bottleneck to additionally model the nuisance information. We propose an autoencoder-based training to implement the objective, as well as practical encoder designs to facilitate the proposed hybrid discriminative-generative training concerning both convolutional- and Transformer-based architectures. Our experimental results show that the proposed scheme improves robustness of learned representations (remarkably without using any domain-specific knowledge), with respect to multiple challenging reliability measures. For example, our model could advance the state-of-the-art on a recent challenging OBJECTS benchmark in novelty detection by 78.4% -> 87.2% in AUROC, while simultaneously enjoying improved corruption, background and (certified) adversarial robustness. Code is available at https://github.com/jh-jeong/nuisance_ib. + + + + Rethinking the Approximation Error in 3D Surface Fitting for Point Cloud Normal Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Du_Rethinking_the_Approximation_Error_in_3D_Surface_Fitting_for_Point_CVPR_2023_paper.pdf + Most existing approaches for point cloud normal estimation aim to locally fit a geometric surface and calculate the normal from the fitted surface. Recently, learning-based methods have adopted a routine of predicting point-wise weights to solve the weighted least-squares surface fitting problem. Despite achieving remarkable progress, these methods overlook the approximation error of the fitting problem, resulting in a less accurate fitted surface. In this paper, we first carry out in-depth analysis of the approximation error in the surface fitting problem. Then, in order to bridge the gap between estimated and precise surface normals, we present two basic design principles: 1) applies the Z-direction Transform to rotate local patches for a better surface fitting with a lower approximation error; 2) models the error of the normal estimation as a learnable term. We implement these two principles using deep neural networks, and integrate them with the state-of-the-art (SOTA) normal estimation methods in a plug-and-play manner. Extensive experiments verify our approaches bring benefits to point cloud normal estimation and push the frontier of state-of-the-art performance on both synthetic and real-world datasets. The code is available at https://github.com/hikvision-research/3DVision. + + + + Objaverse: A Universe of Annotated 3D Objects + http://openaccess.thecvf.com//content/CVPR2023/papers/Deitke_Objaverse_A_Universe_of_Annotated_3D_Objects_CVPR_2023_paper.pdf + Massive data corpora like WebText, Wikipedia, Conceptual Captions, WebImageText, and LAION have propelled recent dramatic progress in AI. Large neural models trained on such datasets produce impressive results and top many of today's benchmarks. A notable omission within this family of large-scale datasets is 3D data. Despite considerable interest and potential applications in 3D vision, datasets of high-fidelity 3D models continue to be mid-sized with limited diversity of object categories. Addressing this gap, we present Objaverse 1.0, a large dataset of objects with 800K+ (and growing) 3D models with descriptive captions, tags, and animations. Objaverse improves upon present day 3D repositories in terms of scale, number of categories, and in the visual diversity of instances within a category. We demonstrate the large potential of Objaverse via four diverse applications: training generative 3D models, improving tail category segmentation on the LVIS benchmark, training open-vocabulary object-navigation models for Embodied AI, and creating a new benchmark for robustness analysis of vision models. Objaverse can open new directions for research and enable new applications across the field of AI. + + + + A-Cap: Anticipation Captioning With Commonsense Knowledge + http://openaccess.thecvf.com//content/CVPR2023/papers/Vo_A-Cap_Anticipation_Captioning_With_Commonsense_Knowledge_CVPR_2023_paper.pdf + Humans possess the capacity to reason about the future based on a sparse collection of visual cues acquired over time. In order to emulate this ability, we introduce a novel task called Anticipation Captioning, which generates a caption for an unseen oracle image using a sparsely temporally-ordered set of images. To tackle this new task, we propose a model called A-CAP, which incorporates commonsense knowledge into a pre-trained vision-language model, allowing it to anticipate the caption. Through both qualitative and quantitative evaluations on a customized visual storytelling dataset, A-CAP outperforms other image captioning methods and establishes a strong baseline for anticipation captioning. We also address the challenges inherent in this task. + + + + Domain Generalized Stereo Matching via Hierarchical Visual Transformation + http://openaccess.thecvf.com//content/CVPR2023/papers/Chang_Domain_Generalized_Stereo_Matching_via_Hierarchical_Visual_Transformation_CVPR_2023_paper.pdf + Recently, deep Stereo Matching (SM) networks have shown impressive performance and attracted increasing attention in computer vision. However, existing deep SM networks are prone to learn dataset-dependent shortcuts, which fail to generalize well on unseen realistic datasets. This paper takes a step towards training robust models for the domain generalized SM task, which mainly focuses on learning shortcut-invariant representation from synthetic data to alleviate the domain shifts. Specifically, we propose a Hierarchical Visual Transformation (HVT) network to 1) first transform the training sample hierarchically into new domains with diverse distributions from three levels: Global, Local, and Pixel, 2) then maximize the visual discrepancy between the source domain and new domains, and minimize the cross-domain feature inconsistency to capture domain-invariant features. In this way, we can prevent the model from exploiting the artifacts of synthetic stereo images as shortcut features, thereby estimating the disparity maps more effectively based on the learned robust and shortcut-invariant representation. We integrate our proposed HVT network with SOTA SM networks and evaluate its effectiveness on several public SM benchmark datasets. Extensive experiments clearly show that the HVT network can substantially enhance the performance of existing SM networks in synthetic-to-realistic domain generalization. + + + + Adapting Shortcut With Normalizing Flow: An Efficient Tuning Framework for Visual Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Adapting_Shortcut_With_Normalizing_Flow_An_Efficient_Tuning_Framework_for_CVPR_2023_paper.pdf + Pretraining followed by fine-tuning has proven to be effective in visual recognition tasks. However, fine-tuning all parameters can be computationally expensive, particularly for large-scale models. To mitigate the computational and storage demands, recent research has explored Parameter-Efficient Fine-Tuning (PEFT), which focuses on tuning a minimal number of parameters for efficient adaptation. Existing methods, however, fail to analyze the impact of the additional parameters on the model, resulting in an unclear and suboptimal tuning process. In this paper, we introduce a novel and effective PEFT paradigm, named SNF (Shortcut adaptation via Normalization Flow), which utilizes normalizing flows to adjust the shortcut layers. We highlight that layers without Lipschitz constraints can lead to error propagation when adapting to downstream datasets. Since modifying the over-parameterized residual connections in these layers is expensive, we focus on adjusting the cheap yet crucial shortcuts. Moreover, learning new information with few parameters in PEFT can be challenging, and information loss can result in label information degradation. To address this issue, we propose an information-preserving normalizing flow. Experimental results demonstrate the effectiveness of SNF. Specifically, with only 0.036M parameters, SNF surpasses previous approaches on both the FGVC and VTAB-1k benchmarks using ViT/B-16 as the backbone. The code is available at https://github.com/Wang-Yaoming/SNF + + + + Unpaired Image-to-Image Translation With Shortest Path Regularization + http://openaccess.thecvf.com//content/CVPR2023/papers/Xie_Unpaired_Image-to-Image_Translation_With_Shortest_Path_Regularization_CVPR_2023_paper.pdf + Unpaired image-to-image translation aims to learn proper mappings that can map images from one domain to another domain while preserving the content of the input image. However, with large enough capacities, the network can learn to map the inputs to any random permutation of images in another domain. Existing methods treat two domains as discrete and propose different assumptions to address this problem. In this paper, we start from a different perspective and consider the paths connecting the two domains. We assume that the optimal path length between the input and output image should be the shortest among all possible paths. Based on this assumption, we propose a new method to allow generating images along the path and present a simple way to encourage the network to find the shortest path without pair information. Extensive experiments on various tasks demonstrate the superiority of our approach. + + + + MotionDiffuser: Controllable Multi-Agent Motion Prediction Using Diffusion + http://openaccess.thecvf.com//content/CVPR2023/papers/Jiang_MotionDiffuser_Controllable_Multi-Agent_Motion_Prediction_Using_Diffusion_CVPR_2023_paper.pdf + We present MotionDiffuser, a diffusion based representation for the joint distribution of future trajectories over multiple agents. Such representation has several key advantages: first, our model learns a highly multimodal distribution that captures diverse future outcomes. Second, the simple predictor design requires only a single L2 loss training objective, and does not depend on trajectory anchors. Third, our model is capable of learning the joint distribution for the motion of multiple agents in a permutation-invariant manner. Furthermore, we utilize a compressed trajectory representation via PCA, which improves model performance and allows for efficient computation of the exact sample log probability. Subsequently, we propose a general constrained sampling framework that enables controlled trajectory sampling based on differentiable cost functions. This strategy enables a host of applications such as enforcing rules and physical priors, or creating tailored simulation scenarios. MotionDiffuser can be combined with existing backbone architectures to achieve top motion forecasting results. We obtain state-of-the-art results for multi-agent motion prediction on the Waymo Open Motion Dataset. + + + + ConvNeXt V2: Co-Designing and Scaling ConvNets With Masked Autoencoders + http://openaccess.thecvf.com//content/CVPR2023/papers/Woo_ConvNeXt_V2_Co-Designing_and_Scaling_ConvNets_With_Masked_Autoencoders_CVPR_2023_paper.pdf + Driven by improved architectures and better representation learning frameworks, the field of visual recognition has enjoyed rapid modernization and performance boost in the early 2020s. For example, modern ConvNets, represented by ConvNeXt models, have demonstrated strong performance across different application scenarios. Like many other architectures, ConvNeXt models were designed under the supervised learning setting with ImageNet labels. It is natural to expect ConvNeXt can also benefit from state-of-the-art self-supervised learning frameworks such as masked autoencoders (MAE), which was originally designed with Transformers. However, we show that simply combining the two designs yields subpar performance. In this paper, we develop an efficient and fully-convolutional masked autoencoder framework. We then upgrade the ConvNeXt architecture with a new Global Response Normalization (GRN) layer. GRN enhances inter-channel feature competition and is crucial for pre-training with masked input. The new model family, dubbed ConvNeXt V2, is a complete training recipe that synergizes both the architectural improvement and the advancement in self-supervised learning. With ConvNeXt V2, we are able to significantly advance pure ConvNets' performance across different recognition benchmarks including ImageNet classification, ADE20K segmentation and COCO detection. To accommodate different use cases, we provide pre-trained ConvNeXt V2 models of a wide range of complexity: from an efficient 3.7M-parameter Atto model that achieves 76.8% top-1 accuracy on ImageNet, to a 650M Huge model that can reach a state-of-the-art 88.9% accuracy using public training data only. + + + + Unsupervised Deep Asymmetric Stereo Matching With Spatially-Adaptive Self-Similarity + http://openaccess.thecvf.com//content/CVPR2023/papers/Song_Unsupervised_Deep_Asymmetric_Stereo_Matching_With_Spatially-Adaptive_Self-Similarity_CVPR_2023_paper.pdf + Unsupervised stereo matching has received a lot of attention since it enables the learning of disparity estimation without ground-truth data. However, most of the unsupervised stereo matching algorithms assume that the left and right images have consistent visual properties, i.e., symmetric, and easily fail when the stereo images are asymmetric. In this paper, we present a novel spatially-adaptive self-similarity (SASS) for unsupervised asymmetric stereo matching. It extends the concept of self-similarity and generates deep features that are robust to the asymmetries. The sampling patterns to calculate self-similarities are adaptively generated throughout the image regions to effectively encode diverse patterns. In order to learn the effective sampling patterns, we design a contrastive similarity loss with positive and negative weights. Consequently, SASS is further encouraged to encode asymmetry-agnostic features, while maintaining the distinctiveness for stereo correspondence. We present extensive experimental results including ablation studies and comparisons with different methods, demonstrating effectiveness of the proposed method under resolution and noise asymmetries. + + + + TWINS: A Fine-Tuning Framework for Improved Transferability of Adversarial Robustness and Generalization + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_TWINS_A_Fine-Tuning_Framework_for_Improved_Transferability_of_Adversarial_Robustness_CVPR_2023_paper.pdf + Recent years have seen the ever-increasing importance of pre-trained models and their downstream training in deep learning research and applications. At the same time, the defense for adversarial examples has been mainly investigated in the context of training from random initialization on simple classification tasks. To better exploit the potential of pre-trained models in adversarial robustness, this paper focuses on the fine-tuning of an adversarially pre-trained model in various classification tasks. Existing research has shown that since the robust pre-trained model has already learned a robust feature extractor, the crucial question is how to maintain the robustness in the pre-trained model when learning the downstream task. We study the model-based and data-based approaches for this goal and find that the two common approaches cannot achieve the objective of improving both generalization and adversarial robustness. Thus, we propose a novel statistics-based approach, Two-WIng NormliSation (TWINS) fine-tuning framework, which consists of two neural networks where one of them keeps the population means and variances of pre-training data in the batch normalization layers. Besides the robust information transfer, TWINS increases the effective learning rate without hurting the training stability since the relationship between a weight norm and its gradient norm in standard batch normalization layer is broken, resulting in a faster escape from the sub-optimal initialization and alleviating the robust overfitting. Finally, TWINS is shown to be effective on a wide range of image classification datasets in terms of both generalization and robustness. + + + + Object-Aware Distillation Pyramid for Open-Vocabulary Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Object-Aware_Distillation_Pyramid_for_Open-Vocabulary_Object_Detection_CVPR_2023_paper.pdf + Open-vocabulary object detection aims to provide object detectors trained on a fixed set of object categories with the generalizability to detect objects described by arbitrary text queries. Previous methods adopt knowledge distillation to extract knowledge from Pretrained Vision-and-Language Models (PVLMs) and transfer it to detectors. However, due to the non-adaptive proposal cropping and single-level feature mimicking processes, they suffer from information destruction during knowledge extraction and inefficient knowledge transfer. To remedy these limitations, we propose an Object-Aware Distillation Pyramid (OADP) framework, including an Object-Aware Knowledge Extraction (OAKE) module and a Distillation Pyramid (DP) mechanism. When extracting object knowledge from PVLMs, the former adaptively transforms object proposals and adopts object-aware mask attention to obtain precise and complete knowledge of objects. The latter introduces global and block distillation for more comprehensive knowledge transfer to compensate for the missing relation information in object distillation. Extensive experiments show that our method achieves significant improvement compared to current methods. Especially on the MS-COCO dataset, our OADP framework reaches 35.6 mAP^N_50, surpassing the current state-of-the-art method by 3.3 mAP^N_50. Code is anonymously provided in the supplementary materials. + + + + Evolved Part Masking for Self-Supervised Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Feng_Evolved_Part_Masking_for_Self-Supervised_Learning_CVPR_2023_paper.pdf + Existing Masked Image Modeling methods apply fixed mask patterns to guide the self-supervised training. As those patterns resort to different criteria to mask local regions, sticking to a fixed pattern leads to limited vision cues modeling capability. This paper proposes an evolved part-based masking to pursue more general visual cues modeling in self-supervised learning. Our method is based on an adaptive part partition module, which leverages the vision model being trained to construct a part graph, and partitions parts with graph cut. The accuracy of partitioned parts is on par with the capability of the pre-trained model, leading to evolved mask patterns at different training stages. It generates simple patterns at the initial training stage to learn low-level visual cues, which hence evolves to eliminate accurate object parts to reinforce the learning of object semantics and contexts. Our method does not require extra pre-trained models or annotations, and effectively ensures the training efficiency by evolving the training difficulty. Experiment results show that it substantially boosts the performance on various tasks including image classification, object detection, and semantic segmentation. For example, it outperforms the recent MAE by 0.69% on imageNet-1K classification and 1.61% on ADE20K segmentation with the same training epochs. + + + + MV-JAR: Masked Voxel Jigsaw and Reconstruction for LiDAR-Based Self-Supervised Pre-Training + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_MV-JAR_Masked_Voxel_Jigsaw_and_Reconstruction_for_LiDAR-Based_Self-Supervised_Pre-Training_CVPR_2023_paper.pdf + This paper introduces the Masked Voxel Jigsaw and Reconstruction (MV-JAR) method for LiDAR-based self-supervised pre-training and a carefully designed data-efficient 3D object detection benchmark on the Waymo dataset. Inspired by the scene-voxel-point hierarchy in downstream 3D object detectors, we design masking and reconstruction strategies accounting for voxel distributions in the scene and local point distributions within the voxel. We employ a Reversed-Furthest-Voxel-Sampling strategy to address the uneven distribution of LiDAR points and propose MV-JAR, which combines two techniques for modeling the aforementioned distributions, resulting in superior performance. Our experiments reveal limitations in previous data-efficient experiments, which uniformly sample fine-tuning splits with varying data proportions from each LiDAR sequence, leading to similar data diversity across splits. To address this, we propose a new benchmark that samples scene sequences for diverse fine-tuning splits, ensuring adequate model convergence and providing a more accurate evaluation of pre-training methods. Experiments on our Waymo benchmark and the KITTI dataset demonstrate that MV-JAR consistently and significantly improves 3D detection performance across various data scales, achieving up to a 6.3% increase in mAPH compared to training from scratch. Codes and the benchmark are available at https://github.com/SmartBot-PJLab/MV-JAR. + + + + Open-Set Semantic Segmentation for Point Clouds via Adversarial Prototype Framework + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Open-Set_Semantic_Segmentation_for_Point_Clouds_via_Adversarial_Prototype_Framework_CVPR_2023_paper.pdf + Recently, point cloud semantic segmentation has attracted much attention in computer vision. Most of the existing works in literature assume that the training and testing point clouds have the same object classes, but they are generally invalid in many real-world scenarios for identifying the 3D objects whose classes are not seen in the training set. To address this problem, we propose an Adversarial Prototype Framework (APF) for handling the open-set 3D semantic segmentation task, which aims to identify 3D unseen-class points while maintaining the segmentation performance on seen-class points. The proposed APF consists of a feature extraction module for extracting point features, a prototypical constraint module, and a feature adversarial module. The prototypical constraint module is designed to learn prototypes for each seen class from point features. The feature adversarial module utilizes generative adversarial networks to estimate the distribution of unseen-class features implicitly, and the synthetic unseen-class features are utilized to prompt the model to learn more effective point features and prototypes for discriminating unseen-class samples from the seen-class ones. Experimental results on two public datasets demonstrate that the proposed APF outperforms the comparative methods by a large margin in most cases. + + + + Learning Attention As Disentangler for Compositional Zero-Shot Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Hao_Learning_Attention_As_Disentangler_for_Compositional_Zero-Shot_Learning_CVPR_2023_paper.pdf + Compositional zero-shot learning (CZSL) aims at learning visual concepts (i.e., attributes and objects) from seen compositions and combining concept knowledge into unseen compositions. The key to CZSL is learning the disentanglement of the attribute-object composition. To this end, we propose to exploit cross-attentions as compositional disentanglers to learn disentangled concept embeddings. For example, if we want to recognize an unseen composition "yellow flower", we can learn the attribute concept "yellow" and object concept "flower" from different yellow objects and different flowers respectively. To further constrain the disentanglers to learn the concept of interest, we employ a regularization at the attention level. Specifically, we adapt the earth mover's distance (EMD) as a feature similarity metric in the cross-attention module. Moreover, benefiting from concept disentanglement, we improve the inference process and tune the prediction score by combining multiple concept probabilities. Comprehensive experiments on three CZSL benchmark datasets demonstrate that our method significantly outperforms previous works in both closed- and open-world settings, establishing a new state-of-the-art. Project page: https://haoosz.github.io/ade-czsl/ + + + + MetaViewer: Towards a Unified Multi-View Representation + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_MetaViewer_Towards_a_Unified_Multi-View_Representation_CVPR_2023_paper.pdf + Existing multi-view representation learning methods typically follow a specific-to-uniform pipeline, extracting latent features from each view and then fusing or aligning them to obtain the unified object representation. However, the manually pre-specified fusion functions and aligning criteria could potentially degrade the quality of the derived representation. To overcome them, we propose a novel uniform-to-specific multi-view learning framework from a meta-learning perspective, where the unified representation no longer involves manual manipulation but is automatically derived from a meta-learner named MetaViewer. Specifically, we formulated the extraction and fusion of view-specific latent features as a nested optimization problem and solved it by using a bi-level optimization scheme. In this way, MetaViewer automatically fuses view-specific features into a unified one and learns the optimal fusion scheme by observing reconstruction processes from the unified to the specific over all views. Extensive experimental results in downstream classification and clustering tasks demonstrate the efficiency and effectiveness of the proposed method. + + + + Natural Language-Assisted Sign Language Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Zuo_Natural_Language-Assisted_Sign_Language_Recognition_CVPR_2023_paper.pdf + Sign languages are visual languages which convey information by signers' handshape, facial expression, body movement, and so forth. Due to the inherent restriction of combinations of these visual ingredients, there exist a significant number of visually indistinguishable signs (VISigns) in sign languages, which limits the recognition capacity of vision neural networks. To mitigate the problem, we propose the Natural Language-Assisted Sign Language Recognition (NLA-SLR) framework, which exploits semantic information contained in glosses (sign labels). First, for VISigns with similar semantic meanings, we propose language-aware label smoothing by generating soft labels for each training sign whose smoothing weights are computed from the normalized semantic similarities among the glosses to ease training. Second, for VISigns with distinct semantic meanings, we present an inter-modality mixup technique which blends vision and gloss features to further maximize the separability of different signs under the supervision of blended labels. Besides, we also introduce a novel backbone, video-keypoint network, which not only models both RGB videos and human body keypoints but also derives knowledge from sign videos of different temporal receptive fields. Empirically, our method achieves state-of-the-art performance on three widely-adopted benchmarks: MSASL, WLASL, and NMFs-CSL. Codes are available at https://github.com/FangyunWei/SLRT. + + + + Learning Semantic Relationship Among Instances for Image-Text Matching + http://openaccess.thecvf.com//content/CVPR2023/papers/Fu_Learning_Semantic_Relationship_Among_Instances_for_Image-Text_Matching_CVPR_2023_paper.pdf + Image-text matching, a bridge connecting image and language, is an important task, which generally learns a holistic cross-modal embedding to achieve a high-quality semantic alignment between the two modalities. However, previous studies only focus on capturing fragment-level relation within a sample from a particular modality, e.g., salient regions in an image or text words in a sentence, where they usually pay less attention to capturing instance-level interactions among samples and modalities, e.g., multiple images and texts. In this paper, we argue that sample relations could help learn subtle differences for hard negative instances, and thus transfer shared knowledge for infrequent samples should be promising in obtaining better holistic embeddings. Therefore, we propose a novel hierarchical relation modeling framework (HREM), which explicitly capture both fragment- and instance-level relations to learn discriminative and robust cross-modal embeddings. Extensive experiments on Flickr30K and MS-COCO show our proposed method outperforms the state-of-the-art ones by 4%-10% in terms of rSum. + + + + Global-to-Local Modeling for Video-Based 3D Human Pose and Shape Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Shen_Global-to-Local_Modeling_for_Video-Based_3D_Human_Pose_and_Shape_Estimation_CVPR_2023_paper.pdf + Video-based 3D human pose and shape estimations are evaluated by intra-frame accuracy and inter-frame smoothness. Although these two metrics are responsible for different ranges of temporal consistency, existing state-of-the-art methods treat them as a unified problem and use monotonous modeling structures (e.g., RNN or attention-based block) to design their networks. However, using a single kind of modeling structure is difficult to balance the learning of short-term and long-term temporal correlations, and may bias the network to one of them, leading to undesirable predictions like global location shift, temporal inconsistency, and insufficient local details. To solve these problems, we propose to structurally decouple the modeling of long-term and short-term correlations in an end-to-end framework, Global-to-Local Transformer (GLoT). First, a global transformer is introduced with a Masked Pose and Shape Estimation strategy for long-term modeling. The strategy stimulates the global transformer to learn more inter-frame correlations by randomly masking the features of several frames. Second, a local transformer is responsible for exploiting local details on the human mesh and interacting with the global transformer by leveraging cross-attention. Moreover, a Hierarchical Spatial Correlation Regressor is further introduced to refine intra-frame estimations by decoupled global-local representation and implicit kinematic constraints. Our GLoT surpasses previous state-of-the-art methods with the lowest model parameters on popular benchmarks, i.e., 3DPW, MPI-INF-3DHP, and Human3.6M. Codes are available at https://github.com/sxl142/GLoT. + + + + BEDLAM: A Synthetic Dataset of Bodies Exhibiting Detailed Lifelike Animated Motion + http://openaccess.thecvf.com//content/CVPR2023/papers/Black_BEDLAM_A_Synthetic_Dataset_of_Bodies_Exhibiting_Detailed_Lifelike_Animated_CVPR_2023_paper.pdf + We show, for the first time, that neural networks trained only on synthetic data achieve state-of-the-art accuracy on the problem of 3D human pose and shape (HPS) estimation from real images. Previous synthetic datasets have been small, unrealistic, or lacked realistic clothing. Achieving sufficient realism is non-trivial and we show how to do this for full bodies in motion. Specifically, our BEDLAM dataset contains monocular RGB videos with ground-truth 3D bodies in SMPL-X format. It includes a diversity of body shapes, motions, skin tones, hair, and clothing. The clothing is realistically simulated on the moving bodies using commercial clothing physics simulation. We render varying numbers of people in realistic scenes with varied lighting and camera motions. We then train various HPS regressors using BEDLAM and achieve state-of-the-art accuracy on real-image benchmarks despite training with synthetic data. We use BEDLAM to gain insights into what model design choices are important for accuracy. With good synthetic training data, we find that a basic method like HMR approaches the accuracy of the current SOTA method (CLIFF). BEDLAM is useful for a variety of tasks and all images, ground truth bodies, 3D clothing, support code, and more are available for research purposes. Additionally, we provide detailed information about our synthetic data generation pipeline, enabling others to generate their own datasets. See the project page: https://bedlam.is.tue.mpg.de/. + + + + ProtoCon: Pseudo-Label Refinement via Online Clustering and Prototypical Consistency for Efficient Semi-Supervised Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Nassar_ProtoCon_Pseudo-Label_Refinement_via_Online_Clustering_and_Prototypical_Consistency_for_CVPR_2023_paper.pdf + Confidence-based pseudo-labeling is among the dominant approaches in semi-supervised learning (SSL). It relies on including high-confidence predictions made on unlabeled data as additional targets to train the model. We propose ProtoCon, a novel SSL method aimed at the less-explored label-scarce SSL where such methods usually underperform. ProtoCon refines the pseudo-labels by leveraging their nearest neighbours' information. The neighbours are identified as the training proceeds using an online clustering approach operating in an embedding space trained via a prototypical loss to encourage well-formed clusters. The online nature of ProtoCon allows it to utilise the label history of the entire dataset in one training cycle to refine labels in the following cycle without the need to store image embeddings. Hence, it can seamlessly scale to larger datasets at a low cost. Finally, ProtoCon addresses the poor training signal in the initial phase of training (due to fewer confident predictions) by introducing an auxiliary self-supervised loss. It delivers significant gains and faster convergence over state-of-the-art across 5 datasets, including CIFARs, ImageNet and DomainNet. + + + + Image Super-Resolution Using T-Tetromino Pixels + http://openaccess.thecvf.com//content/CVPR2023/papers/Grosche_Image_Super-Resolution_Using_T-Tetromino_Pixels_CVPR_2023_paper.pdf + For modern high-resolution imaging sensors, pixel binning is performed in low-lighting conditions and in case high frame rates are required. To recover the original spatial resolution, single-image super-resolution techniques can be applied for upscaling. To achieve a higher image quality after upscaling, we propose a novel binning concept using tetromino-shaped pixels. It is embedded into the field of compressed sensing and the coherence is calculated to motivate the sensor layouts used. Next, we investigate the reconstruction quality using tetromino pixels for the first time in literature. Instead of using different types of tetrominoes as proposed elsewhere, we show that using a small repeating cell consisting of only four T-tetrominoes is sufficient. For reconstruction, we use a locally fully connected reconstruction (LFCR) network as well as two classical reconstruction methods from the field of compressed sensing. Using the LFCR network in combination with the proposed tetromino layout, we achieve superior image quality in terms of PSNR, SSIM, and visually compared to conventional single-image super-resolution using the very deep super-resolution (VDSR) network. For PSNR, a gain of up to +1.92 dB is achieved. + + + + GFIE: A Dataset and Baseline for Gaze-Following From 2D to 3D in Indoor Environments + http://openaccess.thecvf.com//content/CVPR2023/papers/Hu_GFIE_A_Dataset_and_Baseline_for_Gaze-Following_From_2D_to_CVPR_2023_paper.pdf + Gaze-following is a kind of research that requires locating where the person in the scene is looking automatically under the topic of gaze estimation. It is an important clue for understanding human intention, such as identifying objects or regions of interest to humans. However, a survey of datasets used for gaze-following tasks reveals defects in the way they collect gaze point labels. Manual labeling may introduce subjective bias and is labor-intensive, while automatic labeling with an eye-tracking device would alter the person's appearance. In this work, we introduce GFIE, a novel dataset recorded by a gaze data collection system we developed. The system is constructed with two devices, an Azure Kinect and a laser rangefinder, which generate the laser spot to steer the subject's attention as they perform in front of the camera. And an algorithm is developed to locate laser spots in images for annotating 2D/3D gaze targets and removing ground truth introduced by the spots. The whole procedure of collecting gaze behavior allows us to obtain unbiased labels in unconstrained environments semi-automatically. We also propose a baseline method with stereo field-of-view (FoV) perception for establishing a 2D/3D gaze-following benchmark on the GFIE dataset. Project page: https://sites.google.com/view/gfie. + + + + BKinD-3D: Self-Supervised 3D Keypoint Discovery From Multi-View Videos + http://openaccess.thecvf.com//content/CVPR2023/papers/Sun_BKinD-3D_Self-Supervised_3D_Keypoint_Discovery_From_Multi-View_Videos_CVPR_2023_paper.pdf + Quantifying motion in 3D is important for studying the behavior of humans and other animals, but manual pose annotations are expensive and time-consuming to obtain. Self-supervised keypoint discovery is a promising strategy for estimating 3D poses without annotations. However, current keypoint discovery approaches commonly process single 2D views and do not operate in the 3D space. We propose a new method to perform self-supervised keypoint discovery in 3D from multi-view videos of behaving agents, without any keypoint or bounding box supervision in 2D or 3D. Our method, BKinD-3D, uses an encoder-decoder architecture with a 3D volumetric heatmap, trained to reconstruct spatiotemporal differences across multiple views, in addition to joint length constraints on a learned 3D skeleton of the subject. In this way, we discover keypoints without requiring manual supervision in videos of humans and rats, demonstrating the potential of 3D keypoint discovery for studying behavior. + + + + StyleRF: Zero-Shot 3D Style Transfer of Neural Radiance Fields + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_StyleRF_Zero-Shot_3D_Style_Transfer_of_Neural_Radiance_Fields_CVPR_2023_paper.pdf + 3D style transfer aims to render stylized novel views of a 3D scene with multi-view consistency. However, most existing work suffers from a three-way dilemma over accurate geometry reconstruction, high-quality stylization, and being generalizable to arbitrary new styles. We propose StyleRF (Style Radiance Fields), an innovative 3D style transfer technique that resolves the three-way dilemma by performing style transformation within the feature space of a radiance field. StyleRF employs an explicit grid of high-level features to represent 3D scenes, with which high-fidelity geometry can be reliably restored via volume rendering. In addition, it transforms the grid features according to the reference style which directly leads to high-quality zero-shot style transfer. StyleRF consists of two innovative designs. The first is sampling-invariant content transformation that makes the transformation invariant to the holistic statistics of the sampled 3D points and accordingly ensures multi-view consistency. The second is deferred style transformation of 2D feature maps which is equivalent to the transformation of 3D points but greatly reduces memory footprint without degrading multi-view consistency. Extensive experiments show that StyleRF achieves superior 3D stylization quality with precise geometry reconstruction and it can generalize to various new styles in a zero-shot manner. Project website: https://kunhao-liu.github.io/StyleRF/ + + + + Accidental Light Probes + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_Accidental_Light_Probes_CVPR_2023_paper.pdf + Recovering lighting in a scene from a single image is a fundamental problem in computer vision. While a mirror ball light probe can capture omnidirectional lighting, light probes are generally unavailable in everyday images. In this work, we study recovering lighting from accidental light probes (ALPs)---common, shiny objects like Coke cans, which often accidentally appear in daily scenes. We propose a physically-based approach to model ALPs and estimate lighting from their appearances in single images. The main idea is to model the appearance of ALPs by photogrammetrically principled shading and to invert this process via differentiable rendering to recover incidental illumination. We demonstrate that we can put an ALP into a scene to allow high-fidelity lighting estimation. Our model can also recover lighting for existing images that happen to contain an ALP. + + + + Iterative Vision-and-Language Navigation + http://openaccess.thecvf.com//content/CVPR2023/papers/Krantz_Iterative_Vision-and-Language_Navigation_CVPR_2023_paper.pdf + We present Iterative Vision-and-Language Navigation (IVLN), a paradigm for evaluating language-guided agents navigating in a persistent environment over time. Existing Vision-and-Language Navigation (VLN) benchmarks erase the agent's memory at the beginning of every episode, testing the ability to perform cold-start navigation with no prior information. However, deployed robots occupy the same environment for long periods of time. The IVLN paradigm addresses this disparity by training and evaluating VLN agents that maintain memory across tours of scenes that consist of up to 100 ordered instruction-following Room-to-Room (R2R) episodes, each defined by an individual language instruction and a target path. We present discrete and continuous Iterative Room-to-Room (IR2R) benchmarks comprising about 400 tours each in 80 indoor scenes. We find that extending the implicit memory of high-performing transformer VLN agents is not sufficient for IVLN, but agents that build maps can benefit from environment persistence, motivating a renewed focus on map-building agents in VLN. + + + + Adversarial Counterfactual Visual Explanations + http://openaccess.thecvf.com//content/CVPR2023/papers/Jeanneret_Adversarial_Counterfactual_Visual_Explanations_CVPR_2023_paper.pdf + Counterfactual explanations and adversarial attacks have a related goal: flipping output labels with minimal perturbations regardless of their characteristics. Yet, adversarial attacks cannot be used directly in a counterfactual explanation perspective, as such perturbations are perceived as noise and not as actionable and understandable image modifications. Building on the robust learning literature, this paper proposes an elegant method to turn adversarial attacks into semantically meaningful perturbations, without modifying the classifiers to explain. The proposed approach hypothesizes that Denoising Diffusion Probabilistic Models are excellent regularizers for avoiding high-frequency and out-of-distribution perturbations when generating adversarial attacks. The paper's key idea is to build attacks through a diffusion model to polish them. This allows studying the target model regardless of its robustification level. Extensive experimentation shows the advantages of our counterfactual explanation approach over current State-of-the-Art in multiple testbeds. + + + + MaLP: Manipulation Localization Using a Proactive Scheme + http://openaccess.thecvf.com//content/CVPR2023/papers/Asnani_MaLP_Manipulation_Localization_Using_a_Proactive_Scheme_CVPR_2023_paper.pdf + Advancements in the generation quality of various Generative Models (GMs) has made it necessary to not only perform binary manipulation detection but also localize the modified pixels in an image. However, prior works termed as passive for manipulation localization exhibit poor generalization performance over unseen GMs and attribute modifications. To combat this issue, we propose a proactive scheme for manipulation localization, termed MaLP. We encrypt the real images by adding a learned template. If the image is manipulated by any GM, this added protection from the template not only aids binary detection but also helps in identifying the pixels modified by the GM. The template is learned by leveraging local and global-level features estimated by a two-branch architecture. We show that MaLP performs better than prior passive works. We also show the generalizability of MaLP by testing on 22 different GMs, providing a benchmark for future research on manipulation localization. Finally, we show that MaLP can be used as a discriminator for improving the generation quality of GMs. Our models/codes are available at www.github.com/vishal3477/pro_loc. + + + + MM-Diffusion: Learning Multi-Modal Diffusion Models for Joint Audio and Video Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Ruan_MM-Diffusion_Learning_Multi-Modal_Diffusion_Models_for_Joint_Audio_and_Video_CVPR_2023_paper.pdf + We propose the first joint audio-video generation framework that brings engaging watching and listening experiences simultaneously, towards high-quality realistic videos. To generate joint audio-video pairs, we propose a novel Multi-Modal Diffusion model (i.e., MM-Diffusion), with two-coupled denoising autoencoders. In contrast to existing single-modal diffusion models, MM-Diffusion consists of a sequential multi-modal U-Net for a joint denoising process by design. Two subnets for audio and video learn to gradually generate aligned audio-video pairs from Gaussian noises. To ensure semantic consistency across modalities, we propose a novel random-shift based attention block bridging over the two subnets, which enables efficient cross-modal alignment, and thus reinforces the audio-video fidelity for each other. Extensive experiments show superior results in unconditional audio-video generation, and zero-shot conditional tasks (e.g., video-to-audio). In particular, we achieve the best FVD and FAD on Landscape and AIST++ dancing datasets. Turing tests of 10k votes further demonstrate dominant preferences for our model. + + + + Robust Generalization Against Photon-Limited Corruptions via Worst-Case Sharpness Minimization + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Robust_Generalization_Against_Photon-Limited_Corruptions_via_Worst-Case_Sharpness_Minimization_CVPR_2023_paper.pdf + Robust generalization aims to tackle the most challenging data distributions which are rare in the training set and contain severe noises, i.e., photon-limited corruptions. Common solutions such as distributionally robust optimization (DRO) focus on the worst-case empirical risk to ensure low training error on the uncommon noisy distributions. However, due to the over-parameterized model being optimized on scarce worst-case data, DRO fails to produce a smooth loss landscape, thus struggling on generalizing well to the test set. Therefore, instead of focusing on the worst-case risk minimization, we propose SharpDRO by penalizing the sharpness of the worst-case distribution, which measures the loss changes around the neighbor of learning parameters. Through worst-case sharpness minimization, the proposed method successfully produces a flat loss curve on the corrupted distributions, thus achieving robust generalization. Moreover, by considering whether the distribution annotation is available, we apply SharpDRO to two problem settings and design a worst-case selection process for robust generalization. Theoretically, we show that SharpDRO has a great convergence guarantee. Experimentally, we simulate photon-limited corruptions using CIFAR10/100 and ImageNet30 datasets and show that SharpDRO exhibits a strong generalization ability against severe corruptions and exceeds well-known baseline methods with large performance gains. + + + + Point2Pix: Photo-Realistic Point Cloud Rendering via Neural Radiance Fields + http://openaccess.thecvf.com//content/CVPR2023/papers/Hu_Point2Pix_Photo-Realistic_Point_Cloud_Rendering_via_Neural_Radiance_Fields_CVPR_2023_paper.pdf + Synthesizing photo-realistic images from a point cloud is challenging because of the sparsity of point cloud representation. Recent Neural Radiance Fields and extensions are proposed to synthesize realistic images from 2D input. In this paper, we present Point2Pix as a novel point renderer to link the 3D sparse point clouds with 2D dense image pixels. Taking advantage of the point cloud 3D prior and NeRF rendering pipeline, our method can synthesize high-quality images from colored point clouds, generally for novel indoor scenes. To improve the efficiency of ray sampling, we propose point-guided sampling, which focuses on valid samples. Also, we present Point Encoding to build Multi-scale Radiance Fields that provide discriminative 3D point features. Finally, we propose Fusion Encoding to efficiently synthesize high-quality images. Extensive experiments on the ScanNet and ArkitScenes datasets demonstrate the effectiveness and generalization. + + + + NICO++: Towards Better Benchmarking for Domain Generalization + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_NICO_Towards_Better_Benchmarking_for_Domain_Generalization_CVPR_2023_paper.pdf + Despite the remarkable performance that modern deep neural networks have achieved on independent and identically distributed (I.I.D.) data, they can crash under distribution shifts. Most current evaluation methods for domain generalization (DG) adopt the leave-one-out strategy as a compromise on the limited number of domains. We propose a large-scale benchmark with extensive labeled domains named NICO++ along with more rational evaluation methods for comprehensively evaluating DG algorithms. To evaluate DG datasets, we propose two metrics to quantify covariate shift and concept shift, respectively. Two novel generalization bounds from the perspective of data construction are proposed to prove that limited concept shift and significant covariate shift favor the evaluation capability for generalization. Through extensive experiments, NICO++ shows its superior evaluation capability compared with current DG datasets and its contribution in alleviating unfairness caused by the leak of oracle knowledge in model selection. + + + + CHMATCH: Contrastive Hierarchical Matching and Robust Adaptive Threshold Boosted Semi-Supervised Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_CHMATCH_Contrastive_Hierarchical_Matching_and_Robust_Adaptive_Threshold_Boosted_Semi-Supervised_CVPR_2023_paper.pdf + The recently proposed FixMatch and FlexMatch have achieved remarkable results in the field of semi-supervised learning. But these two methods go to two extremes as FixMatch and FlexMatch use a pre-defined constant threshold for all classes and an adaptive threshold for each category, respectively. By only investigating consistency regularization, they also suffer from unstable results and indiscriminative feature representation, especially under the situation of few labeled samples. In this paper, we propose a novel CHMatch method, which can learn robust adaptive thresholds for instance-level prediction matching as well as discriminative features by contrastive hierarchical matching. We first present a memory-bank based robust threshold learning strategy to select highly-confident samples. In the meantime, we make full use of the structured information in the hierarchical labels to learn an accurate affinity graph for contrastive learning. CHMatch achieves very stable and superior results on several commonly-used benchmarks. For example, CHMatch achieves 8.44% and 9.02% error rate reduction over FlexMatch on CIFAR-100 under WRN-28-2 with only 4 and 25 labeled samples per class, respectively. + + + + Neural Dependencies Emerging From Learning Massive Categories + http://openaccess.thecvf.com//content/CVPR2023/papers/Feng_Neural_Dependencies_Emerging_From_Learning_Massive_Categories_CVPR_2023_paper.pdf + This work presents two astonishing findings on neural networks learned for large-scale image classification. 1) Given a well-trained model, the logits predicted for some category can be directly obtained by linearly combining the predictions of a few other categories, which we call neural dependency. 2) Neural dependencies exist not only within a single model, but even between two independently learned models, regardless of their architectures. Towards a theoretical analysis of such phenomena, we demonstrate that identifying neural dependencies is equivalent to solving the Covariance Lasso (CovLasso) regression problem proposed in this paper. Through investigating the properties of the problem solution, we confirm that neural dependency is guaranteed by a redundant logit covariance matrix, which condition is easily met given massive categories, and that neural dependency is sparse, which implies one category relates to only a few others. We further empirically show the potential of neural dependencies in understanding internal data correlations, generalizing models to unseen categories, and improving model robustness with a dependency-derived regularize. Code to exactly reproduce the results in this work will be released publicly. + + + + ARCTIC: A Dataset for Dexterous Bimanual Hand-Object Manipulation + http://openaccess.thecvf.com//content/CVPR2023/papers/Fan_ARCTIC_A_Dataset_for_Dexterous_Bimanual_Hand-Object_Manipulation_CVPR_2023_paper.pdf + Humans intuitively understand that inanimate objects do not move by themselves, but that state changes are typically caused by human manipulation (e.g., the opening of a book). This is not yet the case for machines. In part this is because there exist no datasets with ground-truth 3D annotations for the study of physically consistent and synchronised motion of hands and articulated objects. To this end, we introduce ARCTIC -- a dataset of two hands that dexterously manipulate objects, containing 2.1M video frames paired with accurate 3D hand and object meshes and detailed, dynamic contact information. It contains bi-manual articulation of objects such as scissors or laptops, where hand poses and object states evolve jointly in time. We propose two novel articulated hand-object interaction tasks: (1) Consistent motion reconstruction: Given a monocular video, the goal is to reconstruct two hands and articulated objects in 3D, so that their motions are spatio-temporally consistent. (2) Interaction field estimation: Dense relative hand-object distances must be estimated from images. We introduce two baselines ArcticNet and InterField, respectively and evaluate them qualitatively and quantitatively on ARCTIC. Our code and data are available at https://arctic.is.tue.mpg.de. + + + + MAGVIT: Masked Generative Video Transformer + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_MAGVIT_Masked_Generative_Video_Transformer_CVPR_2023_paper.pdf + We introduce the MAsked Generative VIdeo Transformer, MAGVIT, to tackle various video synthesis tasks with a single model. We introduce a 3D tokenizer to quantize a video into spatial-temporal visual tokens and propose an embedding method for masked video token modeling to facilitate multi-task learning. We conduct extensive experiments to demonstrate the quality, efficiency, and flexibility of MAGVIT. Our experiments show that (i) MAGVIT performs favorably against state-of-the-art approaches and establishes the best-published FVD on three video generation benchmarks, including the challenging Kinetics-600. (ii) MAGVIT outperforms existing methods in inference time by two orders of magnitude against diffusion models and by 60x against autoregressive models. (iii) A single MAGVIT model supports ten diverse generation tasks and generalizes across videos from different visual domains. The source code and trained models will be released to the public at https://magvit.cs.cmu.edu. + + + + Hidden Gems: 4D Radar Scene Flow Learning Using Cross-Modal Supervision + http://openaccess.thecvf.com//content/CVPR2023/papers/Ding_Hidden_Gems_4D_Radar_Scene_Flow_Learning_Using_Cross-Modal_Supervision_CVPR_2023_paper.pdf + This work proposes a novel approach to 4D radar-based scene flow estimation via cross-modal learning. Our approach is motivated by the co-located sensing redundancy in modern autonomous vehicles. Such redundancy implicitly provides various forms of supervision cues to the radar scene flow estimation. Specifically, we introduce a multi-task model architecture for the identified cross-modal learning problem and propose loss functions to opportunistically engage scene flow estimation using multiple cross-modal constraints for effective model training. Extensive experiments show the state-of-the-art performance of our method and demonstrate the effectiveness of cross-modal supervised learning to infer more accurate 4D radar scene flow. We also show its usefulness to two subtasks - motion segmentation and ego-motion estimation. Our source code will be available on https://github.com/Toytiny/CMFlow. + + + + OmniMAE: Single Model Masked Pretraining on Images and Videos + http://openaccess.thecvf.com//content/CVPR2023/papers/Girdhar_OmniMAE_Single_Model_Masked_Pretraining_on_Images_and_Videos_CVPR_2023_paper.pdf + Transformer-based architectures have become competitive across a variety of visual domains, most notably images and videos. While prior work studies these modalities in isolation, having a common architecture suggests that one can train a single unified model for multiple visual modalities. Prior attempts at unified modeling typically use architectures tailored for vision tasks, or obtain worse performance compared to single modality models. In this work, we show that masked autoencoding can be used to train a simple Vision Transformer on images and videos, without requiring any labeled data. This single model learns visual representations that are comparable to or better than single-modality representations on both image and video benchmarks, while using a much simpler architecture. Furthermore, this model can be learned by dropping 90% of the image and 95% of the video patches, enabling extremely fast training of huge model architectures. In particular, we show that our single ViT-Huge model can be finetuned to achieve 86.6% on ImageNet and 75.5% on the challenging Something Something-v2 video benchmark, setting a new state-of-the-art. + + + + Real-Time Neural Light Field on Mobile Devices + http://openaccess.thecvf.com//content/CVPR2023/papers/Cao_Real-Time_Neural_Light_Field_on_Mobile_Devices_CVPR_2023_paper.pdf + Recent efforts in Neural Rendering Fields (NeRF) have shown impressive results on novel view synthesis by utilizing implicit neural representation to represent 3D scenes. Due to the process of volumetric rendering, the inference speed for NeRF is extremely slow, limiting the application scenarios of utilizing NeRF on resource-constrained hardware, such as mobile devices. Many works have been conducted to reduce the latency of running NeRF models. However, most of them still require high-end GPU for acceleration or extra storage memory, which is all unavailable on mobile devices. Another emerging direction utilizes the neural light field (NeLF) for speedup, as only one forward pass is performed on a ray to predict the pixel color. Nevertheless, to reach a similar rendering quality as NeRF, the network in NeLF is designed with intensive computation, which is not mobile-friendly. In this work, we propose an efficient network that runs in real-time on mobile devices for neural rendering. We follow the setting of NeLF to train our network. Unlike existing works, we introduce a novel network architecture that runs efficiently on mobile devices with low latency and small size, i.e., saving 15x 24x storage compared with MobileNeRF. Our model achieves high-resolution generation while maintaining real-time inference for both synthetic and real-world scenes on mobile devices, e.g., 18.04ms (iPhone 13) for rendering one 1008x756 image of real 3D scenes. Additionally, we achieve similar image quality as NeRF and better quality than MobileNeRF (PSNR 26.15 vs. 25.91 on the real-world forward-facing dataset). + + + + End-to-End Video Matting With Trimap Propagation + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_End-to-End_Video_Matting_With_Trimap_Propagation_CVPR_2023_paper.pdf + The research of video matting mainly focuses on temporal coherence and has gained significant improvement via neural networks. However, matting usually relies on user-annotated trimaps to estimate alpha values, which is a labor-intensive issue. Although recent studies exploit video object segmentation methods to propagate the given trimaps, they suffer inconsistent results. Here we present a more robust and faster end-to-end video matting model equipped with trimap propagation called FTP-VM (Fast Trimap Propagation - Video Matting). The FTP-VM combines trimap propagation and video matting in one model, where the additional backbone in memory matching is replaced with the proposed lightweight trimap fusion module. The segmentation consistency loss is adopted from automotive segmentation to fit trimap segmentation with the collaboration of RNN (Recurrent Neural Network) to improve the temporal coherence. The experimental results demonstrate that the FTP-VM performs competitively both in composited and real videos only with few given trimaps. The efficiency is eight times higher than the state-of-the-art methods, which confirms its robustness and applicability in real-time scenarios. The code is available at https://github.com/csvt32745/FTP-VM + + + + DropMAE: Masked Autoencoders With Spatial-Attention Dropout for Tracking Tasks + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_DropMAE_Masked_Autoencoders_With_Spatial-Attention_Dropout_for_Tracking_Tasks_CVPR_2023_paper.pdf + In this paper, we study masked autoencoder (MAE) pretraining on videos for matching-based downstream tasks, including visual object tracking (VOT) and video object segmentation (VOS). A simple extension of MAE is to randomly mask out frame patches in videos and reconstruct the frame pixels. However, we find that this simple baseline heavily relies on spatial cues while ignoring temporal relations for frame reconstruction, thus leading to sub-optimal temporal matching representations for VOT and VOS. To alleviate this problem, we propose DropMAE, which adaptively performs spatial-attention dropout in the frame reconstruction to facilitate temporal correspondence learning in videos. We show that our DropMAE is a strong and efficient temporal matching learner, which achieves better finetuning results on matching-based tasks than the ImageNetbased MAE with 2x faster pre-training speed. Moreover, we also find that motion diversity in pre-training videos is more important than scene diversity for improving the performance on VOT and VOS. Our pre-trained DropMAE model can be directly loaded in existing ViT-based trackers for fine-tuning without further modifications. Notably, DropMAE sets new state-of-the-art performance on 8 out of 9 highly competitive video tracking and segmentation datasets. Our code and pre-trained models are available at https://github.com/jimmy-dq/DropMAE.git. + + + + High-Fidelity Clothed Avatar Reconstruction From a Single Image + http://openaccess.thecvf.com//content/CVPR2023/papers/Liao_High-Fidelity_Clothed_Avatar_Reconstruction_From_a_Single_Image_CVPR_2023_paper.pdf + This paper presents a framework for efficient 3D clothed avatar reconstruction. By combining the advantages of the high accuracy of optimization-based methods and the efficiency of learning-based methods, we propose a coarse-to-fine way to realize a high-fidelity clothed avatar reconstruction (CAR) from a single image. At the first stage, we use an implicit model to learn the general shape in the canonical space of a person in a learning-based way, and at the second stage, we refine the surface detail by estimating the non-rigid deformation in the posed space in an optimization way. A hyper-network is utilized to generate a good initialization so that the convergence of the optimization process is greatly accelerated. Extensive experiments on various datasets show that the proposed CAR successfully produces high-fidelity avatars for arbitrarily clothed humans in real scenes. The codes will be released. + + + + Zero-Shot Object Counting + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_Zero-Shot_Object_Counting_CVPR_2023_paper.pdf + Class-agnostic object counting aims to count object instances of an arbitrary class at test time. It is challenging but also enables many potential applications. Current methods require human-annotated exemplars as inputs which are often unavailable for novel categories, especially for autonomous systems. Thus, we propose zero-shot object counting (ZSC), a new setting where only the class name is available during test time. Such a counting system does not require human annotators in the loop and can operate automatically. Starting from a class name, we propose a method that can accurately identify the optimal patches which can then be used as counting exemplars. Specifically, we first construct a class prototype to select the patches that are likely to contain the objects of interest, namely class-relevant patches. Furthermore, we introduce a model that can quantitatively measure how suitable an arbitrary patch is as a counting exemplar. By applying this model to all the candidate patches, we can select the most suitable patches as exemplars for counting. Experimental results on a recent class-agnostic counting dataset, FSC-147, validate the effectiveness of our method. + + + + Implicit Diffusion Models for Continuous Super-Resolution + http://openaccess.thecvf.com//content/CVPR2023/papers/Gao_Implicit_Diffusion_Models_for_Continuous_Super-Resolution_CVPR_2023_paper.pdf + Image super-resolution (SR) has attracted increasing attention due to its wide applications. However, current SR methods generally suffer from over-smoothing and artifacts, and most work only with fixed magnifications. This paper introduces an Implicit Diffusion Model (IDM) for high-fidelity continuous image super-resolution. IDM integrates an implicit neural representation and a denoising diffusion model in a unified end-to-end framework, where the implicit neural representation is adopted in the decoding process to learn continuous-resolution representation. Furthermore, we design a scale-controllable conditioning mechanism that consists of a low-resolution (LR) conditioning network and a scaling factor. The scaling factor regulates the resolution and accordingly modulates the proportion of the LR information and generated features in the final output, which enables the model to accommodate the continuous-resolution requirement. Extensive experiments validate the effectiveness of our IDM and demonstrate its superior performance over prior arts. + + + + Phase-Shifting Coder: Predicting Accurate Orientation in Oriented Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_Phase-Shifting_Coder_Predicting_Accurate_Orientation_in_Oriented_Object_Detection_CVPR_2023_paper.pdf + With the vigorous development of computer vision, oriented object detection has gradually been featured. In this paper, a novel differentiable angle coder named phase-shifting coder (PSC) is proposed to accurately predict the orientation of objects, along with a dual-frequency version (PSCD). By mapping the rotational periodicity of different cycles into the phase of different frequencies, we provide a unified framework for various periodic fuzzy problems caused by rotational symmetry in oriented object detection. Upon such a framework, common problems in oriented object detection such as boundary discontinuity and square-like problems are elegantly solved in a unified form. Visual analysis and experiments on three datasets prove the effectiveness and the potentiality of our approach. When facing scenarios requiring high-quality bounding boxes, the proposed methods are expected to give a competitive performance. The codes are publicly available at https://github.com/open-mmlab/mmrotate. + + + + Neural Lens Modeling + http://openaccess.thecvf.com//content/CVPR2023/papers/Xian_Neural_Lens_Modeling_CVPR_2023_paper.pdf + Recent methods for 3D reconstruction and rendering increasingly benefit from end-to-end optimization of the entire image formation process. However, this approach is currently limited: effects of the optical hardware stack and in particular lenses are hard to model in a unified way. This limits the quality that can be achieved for camera calibration and the fidelity of the results of 3D reconstruction. In this paper, we propose NeuroLens, a neural lens model for distortion and vignetting that can be used for point projection and ray casting and can be optimized through both operations. This means that it can (optionally) be used to perform pre-capture calibration using classical calibration targets, and can later be used to perform calibration or refinement during 3D reconstruction, e.g., while optimizing a radiance field. To evaluate the performance of our proposed model, we create a comprehensive dataset assembled from the Lensfun database with a multitude of lenses. Using this and other real-world datasets, we show that the quality of our proposed lens model outperforms standard packages as well as recent approaches while being much easier to use and extend. The model generalizes across many lens types and is trivial to integrate into existing 3D reconstruction and rendering systems. Visit our project website at: https://neural-lens.github.io. + + + + CoralStyleCLIP: Co-Optimized Region and Layer Selection for Image Editing + http://openaccess.thecvf.com//content/CVPR2023/papers/Revanur_CoralStyleCLIP_Co-Optimized_Region_and_Layer_Selection_for_Image_Editing_CVPR_2023_paper.pdf + Edit fidelity is a significant issue in open-world controllable generative image editing. Recently, CLIP-based approaches have traded off simplicity to alleviate these problems by introducing spatial attention in a handpicked layer of a StyleGAN. In this paper, we propose CoralStyleCLIP, which incorporates a multi-layer attention-guided blending strategy in the feature space of StyleGAN2 for obtaining high-fidelity edits. We propose multiple forms of our co-optimized region and layer selection strategy to demonstrate the variation of time complexity with the quality of edits over different architectural intricacies while preserving simplicity. We conduct extensive experimental analysis and benchmark our method against state-of-the-art CLIP-based methods. Our findings suggest that CoralStyleCLIP results in high-quality edits while preserving the ease of use. + + + + GLeaD: Improving GANs With a Generator-Leading Task + http://openaccess.thecvf.com//content/CVPR2023/papers/Bai_GLeaD_Improving_GANs_With_a_Generator-Leading_Task_CVPR_2023_paper.pdf + Generative adversarial network (GAN) is formulated as a two-player game between a generator (G) and a discriminator (D), where D is asked to differentiate whether an image comes from real data or is produced by G. Under such a formulation, D plays as the rule maker and hence tends to dominate the competition. Towards a fairer game in GANs, we propose a new paradigm for adversarial training, which makes G assign a task to D as well. Specifically, given an image, we expect D to extract representative features that can be adequately decoded by G to reconstruct the input. That way, instead of learning freely, D is urged to align with the view of G for domain classification. Experimental results on various datasets demonstrate the substantial superiority of our approach over the baselines. For instance, we improve the FID of StyleGAN2 from 4.30 to 2.55 on LSUN Bedroom and from 4.04 to 2.82 on LSUN Church. We believe that the pioneering attempt present in this work could inspire the community with better designed generator-leading tasks for GAN improvement. Project page is at https://ezioby.github.io/glead/. + + + + GALIP: Generative Adversarial CLIPs for Text-to-Image Synthesis + http://openaccess.thecvf.com//content/CVPR2023/papers/Tao_GALIP_Generative_Adversarial_CLIPs_for_Text-to-Image_Synthesis_CVPR_2023_paper.pdf + Synthesizing high-fidelity complex images from text is challenging. Based on large pretraining, the autoregressive and diffusion models can synthesize photo-realistic images. Although these large models have shown notable progress, there remain three flaws. 1) These models require tremendous training data and parameters to achieve good performance. 2) The multi-step generation design slows the image synthesis process heavily. 3) The synthesized visual features are challenging to control and require delicately designed prompts. To enable high-quality, efficient, fast, and controllable text-to-image synthesis, we propose Generative Adversarial CLIPs, namely GALIP. GALIP leverages the powerful pretrained CLIP model both in the discriminator and generator. Specifically, we propose a CLIP-based discriminator. The complex scene understanding ability of CLIP enables the discriminator to accurately assess the image quality. Furthermore, we propose a CLIP-empowered generator that induces the visual concepts from CLIP through bridge features and prompts. The CLIP-integrated generator and discriminator boost training efficiency, and as a result, our model only requires about 3% training data and 6% learnable parameters, achieving comparable results to large pretrained autoregressive and diffusion models. Moreover, our model achieves 120 times faster synthesis speed and inherits the smooth latent space from GAN. The extensive experimental results demonstrate the excellent performance of our GALIP. Code is available at https://github.com/tobran/GALIP. + + + + Indiscernible Object Counting in Underwater Scenes + http://openaccess.thecvf.com//content/CVPR2023/papers/Sun_Indiscernible_Object_Counting_in_Underwater_Scenes_CVPR_2023_paper.pdf + Recently, indiscernible scene understanding has attracted a lot of attention in the vision community. We further advance the frontier of this field by systematically studying a new challenge named indiscernible object counting (IOC), the goal of which is to count objects that are blended with respect to their surroundings. Due to a lack of appropriate IOC datasets, we present a large-scale dataset IOCfish5K which contains a total of 5,637 high-resolution images and 659,024 annotated center points. Our dataset consists of a large number of indiscernible objects (mainly fish) in underwater scenes, making the annotation process all the more challenging. IOCfish5K is superior to existing datasets with indiscernible scenes because of its larger scale, higher image resolutions, more annotations, and denser scenes. All these aspects make it the most challenging dataset for IOC so far, supporting progress in this area. For benchmarking purposes, we select 14 mainstream methods for object counting and carefully evaluate them on IOCfish5K. Furthermore, we propose IOCFormer, a new strong baseline that combines density and regression branches in a unified framework and can effectively tackle object counting under concealed scenes. Experiments show that IOCFormer achieves state-of-the-art scores on IOCfish5K. + + + + Low-Light Image Enhancement via Structure Modeling and Guidance + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_Low-Light_Image_Enhancement_via_Structure_Modeling_and_Guidance_CVPR_2023_paper.pdf + This paper proposes a new framework for low-light image enhancement by simultaneously conducting the appearance as well as structure modeling. It employs the structural feature to guide the appearance enhancement, leading to sharp and realistic results. The structure modeling in our framework is implemented as the edge detection in low-light images. It is achieved with a modified generative model via designing a structure-aware feature extractor and generator. The detected edge maps can accurately emphasize the essential structural information, and the edge prediction is robust towards the noises in dark areas. Moreover, to improve the appearance modeling, which is implemented with a simple U-Net, a novel structure-guided enhancement module is proposed with structure-guided feature synthesis layers. The appearance modeling, edge detector, and enhancement module can be trained end-to-end. The experiments are conducted on representative datasets (sRGB and RAW domains), showing that our model consistently achieves SOTA performance on all datasets with the same architecture. + + + + Physics-Driven Diffusion Models for Impact Sound Synthesis From Videos + http://openaccess.thecvf.com//content/CVPR2023/papers/Su_Physics-Driven_Diffusion_Models_for_Impact_Sound_Synthesis_From_Videos_CVPR_2023_paper.pdf + Modeling sounds emitted from physical object interactions is critical for immersive perceptual experiences in real and virtual worlds. Traditional methods of impact sound synthesis use physics simulation to obtain a set of physics parameters that could represent and synthesize the sound. However, they require fine details of both the object geometries and impact locations, which are rarely available in the real world and can not be applied to synthesize impact sounds from common videos. On the other hand, existing video-driven deep learning-based approaches could only capture the weak correspondence between visual content and impact sounds since they lack of physics knowledge. In this work, we propose a physics-driven diffusion model that can synthesize high-fidelity impact sound for a silent video clip. In addition to the video content, we propose to use additional physics priors to guide the impact sound synthesis procedure. The physics priors include both physics parameters that are directly estimated from noisy real-world impact sound examples without sophisticated setup and learned residual parameters that interpret the sound environment via neural networks. We further implement a novel diffusion model with specific training and inference strategies to combine physics priors and visual information for impact sound synthesis. Experimental results show that our model outperforms several existing systems in generating realistic impact sounds. More importantly, the physics-based representations are fully interpretable and transparent, thus enabling us to perform sound editing flexibly. We encourage the readers to visit our project page to watch demo videos with audio turned on to experience the results. + + + + Alias-Free Convnets: Fractional Shift Invariance via Polynomial Activations + http://openaccess.thecvf.com//content/CVPR2023/papers/Michaeli_Alias-Free_Convnets_Fractional_Shift_Invariance_via_Polynomial_Activations_CVPR_2023_paper.pdf + Although CNNs are believed to be invariant to translations, recent works have shown this is not the case due to aliasing effects that stem from down-sampling layers. The existing architectural solutions to prevent the aliasing effects are partial since they do not solve those effects that originate in non-linearities. We propose an extended anti-aliasing method that tackles both down-sampling and non-linear layers, thus creating truly alias-free, shift-invariant CNNs. We show that the presented model is invariant to integer as well as fractional (i.e., sub-pixel) translations, thus outperforming other shift-invariant methods in terms of robustness to adversarial translations. + + + + Shortcomings of Top-Down Randomization-Based Sanity Checks for Evaluations of Deep Neural Network Explanations + http://openaccess.thecvf.com//content/CVPR2023/papers/Binder_Shortcomings_of_Top-Down_Randomization-Based_Sanity_Checks_for_Evaluations_of_Deep_CVPR_2023_paper.pdf + While the evaluation of explanations is an important step towards trustworthy models, it needs to be done carefully, and the employed metrics need to be well-understood. Specifically model randomization testing can be overinterpreted if regarded as a primary criterion for selecting or discarding explanation methods. To address shortcomings of this test, we start by observing an experimental gap in the ranking of explanation methods between randomization-based sanity checks [1] and model output faithfulness measures (e.g. [20]). We identify limitations of model-randomization-based sanity checks for the purpose of evaluating explanations. Firstly, we show that uninformative attribution maps created with zero pixel-wise covariance easily achieve high scores in this type of checks. Secondly, we show that top-down model randomization preserves scales of forward pass activations with high probability. That is, channels with large activations have a high probility to contribute strongly to the output, even after randomization of the network on top of them. Hence, explanations after randomization can only be expected to differ to a certain extent. This explains the observed experimental gap. In summary, these results demonstrate the inadequacy of model-randomization-based sanity checks as a criterion to rank attribution methods. + + + + Neural Part Priors: Learning To Optimize Part-Based Object Completion in RGB-D Scans + http://openaccess.thecvf.com//content/CVPR2023/papers/Bokhovkin_Neural_Part_Priors_Learning_To_Optimize_Part-Based_Object_Completion_in_CVPR_2023_paper.pdf + 3D scene understanding has seen significant advances in recent years, but has largely focused on object understanding in 3D scenes with independent per-object predictions. We thus propose to learn Neural Part Priors (NPPs), parametric spaces of objects and their parts, that enable optimizing to fit to a new input 3D scan geometry with global scene consistency constraints. The rich structure of our NPPs enables accurate, holistic scene reconstruction across similar objects in the scene. Both objects and their part geometries are characterized by coordinate field MLPs, facilitating optimization at test time to fit to input geometric observations as well as similar objects in the input scan. This enables more accurate reconstructions than independent per-object predictions as a single forward pass, while establishing global consistency within a scene. Experiments on the ScanNet dataset demonstrate that NPPs significantly outperforms the state-of-the-art in part decomposition and object completion in real-world scenes. + + + + Towards Trustable Skin Cancer Diagnosis via Rewriting Model's Decision + http://openaccess.thecvf.com//content/CVPR2023/papers/Yan_Towards_Trustable_Skin_Cancer_Diagnosis_via_Rewriting_Models_Decision_CVPR_2023_paper.pdf + Deep neural networks have demonstrated promising performance on image recognition tasks. However, they may heavily rely on confounding factors, using irrelevant artifacts or bias within the dataset as the cue to improve performance. When a model performs decision-making based on these spurious correlations, it can become untrustable and lead to catastrophic outcomes when deployed in the real-world scene. In this paper, we explore and try to solve this problem in the context of skin cancer diagnosis. We introduce a human-in-the-loop framework in the model training process such that users can observe and correct the model's decision logic when confounding behaviors happen. Specifically, our method can automatically discover confounding factors by analyzing the co-occurrence behavior of the samples. It is capable of learning confounding concepts using easily obtained concept exemplars. By mapping the blackbox model's feature representation onto an explainable concept space, human users can interpret the concept and intervene via first order-logic instruction. We systematically evaluate our method on our newly crafted, well-controlled skin lesion dataset and several public skin lesion datasets. Experiments show that our method can effectively detect and remove confounding factors from datasets without any prior knowledge about the category distribution and does not require fully annotated concept labels. We also show that our method enables the model to focus on clinicalrelated concepts, improving the model's performance and trustworthiness during model inference. + + + + FeatER: An Efficient Network for Human Reconstruction via Feature Map-Based TransformER + http://openaccess.thecvf.com//content/CVPR2023/papers/Zheng_FeatER_An_Efficient_Network_for_Human_Reconstruction_via_Feature_Map-Based_CVPR_2023_paper.pdf + Recently, vision transformers have shown great success in a set of human reconstruction tasks such as 2D human pose estimation (2D HPE), 3D human pose estimation (3D HPE), and human mesh reconstruction (HMR) tasks. In these tasks, feature map representations of the human structural information are often extracted first from the image by a CNN (such as HRNet), and then further processed by transformer to predict the heatmaps (encodes each joint's location into a feature map with a Gaussian distribution) for HPE or HMR. However, existing transformer architectures are not able to process these feature map inputs directly, forcing an unnatural flattening of the location-sensitive human structural information. Furthermore, much of the performance benefit in recent HPE and HMR methods has come at the cost of ever-increasing computation and memory needs. Therefore, to simultaneously address these problems, we propose FeatER, a novel transformer design which preserves the inherent structure of feature map representations when modeling attention while reducing the memory and computational costs. Taking advantage of FeatER, we build an efficient network for a set of human reconstruction tasks including 2D HPE, 3D HPE, and HMR. A feature map reconstruction module is applied to improve the performance of the estimated human pose and mesh. Extensive experiments demonstrate the effectiveness of FeatER on various human pose and mesh datasets. For instance, FeatER outperforms the SOTA method MeshGraphormer by requiring 5% of Params (total parameters) and 16% of MACs (the Multiply-Accumulate Operations) on Human3.6M and 3DPW datasets. Code will be publicly available. + + + + Visibility Constrained Wide-Band Illumination Spectrum Design for Seeing-in-the-Dark + http://openaccess.thecvf.com//content/CVPR2023/papers/Niu_Visibility_Constrained_Wide-Band_Illumination_Spectrum_Design_for_Seeing-in-the-Dark_CVPR_2023_paper.pdf + Seeing-in-the-dark is one of the most important and challenging computer vision tasks due to its wide applications and extreme complexities of in-the-wild scenarios. Existing arts can be mainly divided into two threads: 1) RGB-dependent methods restore information using degraded RGB inputs only (e.g., low-light enhancement), 2) RGB-independent methods translate images captured under auxiliary near-infrared (NIR) illuminants into RGB domain (e.g., NIR2RGB translation). The latter is very attractive since it works in complete darkness and the illuminants are visually friendly to naked eyes, but tends to be unstable due to its intrinsic ambiguities. In this paper, we try to robustify NIR2RGB translation by designing the optimal spectrum of auxiliary illumination in the wide-band VIS-NIR range, while keeping visual friendliness. Our core idea is to quantify the visibility constraint implied by the human vision system and incorporate it into the design pipeline. By modeling the formation process of images in the VIS-NIR range, the optimal multiplexing of a wide range of LEDs is automatically designed in a fully differentiable manner, within the feasible region defined by the visibility constraint. We also collect a substantially expanded VIS-NIR hyperspectral image dataset for experiments by using a customized 50-band filter wheel. Experimental results show that the task can be significantly improved by using the optimized wide-band illumination than using NIR only. Codes Available: https://github.com/MyNiuuu/VCSD. + + + + Learning With Noisy Labels via Self-Supervised Adversarial Noisy Masking + http://openaccess.thecvf.com//content/CVPR2023/papers/Tu_Learning_With_Noisy_Labels_via_Self-Supervised_Adversarial_Noisy_Masking_CVPR_2023_paper.pdf + Collecting large-scale datasets is crucial for training deep models, annotating the data, however, inevitably yields noisy labels, which poses challenges to deep learning algorithms. Previous efforts tend to mitigate this problem via identifying and removing noisy samples or correcting their labels according to the statistical properties (e.g., loss values) among training samples. In this paper, we aim to tackle this problem from a new perspective, delving into the deep feature maps, we empirically find that models trained with clean and mislabeled samples manifest distinguishable activation feature distributions. From this observation, a novel robust training approach termed adversarial noisy masking is proposed. The idea is to regularize deep features with a label quality guided masking scheme, which adaptively modulates the input data and label simultaneously, preventing the model to overfit noisy samples. Further, an auxiliary task is designed to reconstruct input data, it naturally provides noise-free self-supervised signals to reinforce the generalization ability of deep models. The proposed method is simple and flexible, it is tested on both synthetic and real-world noisy datasets, where significant improvements are achieved over previous state-of-the-art methods. + + + + Towards Domain Generalization for Multi-View 3D Object Detection in Bird-Eye-View + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Towards_Domain_Generalization_for_Multi-View_3D_Object_Detection_in_Bird-Eye-View_CVPR_2023_paper.pdf + Multi-view 3D object detection (MV3D-Det) in Bird-Eye-View (BEV) has drawn extensive attention due to its low cost and high efficiency. Although new algorithms for camera-only 3D object detection have been continuously proposed, most of them may risk drastic performance degradation when the domain of input images differs from that of training. In this paper, we first analyze the causes of the domain gap for the MV3D-Det task. Based on the covariate shift assumption, we find that the gap mainly attributes to the feature distribution of BEV, which is determined by the quality of both depth estimation and 2D image's feature representation. To acquire a robust depth prediction, we propose to decouple the depth estimation from the intrinsic parameters of the camera (i.e. the focal length) through converting the prediction of metric depth to that of scale-invariant depth and perform dynamic perspective augmentation to increase the diversity of the extrinsic parameters (i.e. the camera poses) by utilizing homography. Moreover, we modify the focal length values to create multiple pseudo-domains and construct an adversarial training loss to encourage the feature representation to be more domain-agnostic. Without bells and whistles, our approach, namely DG-BEV, successfully alleviates the performance drop on the unseen target domain without impairing the accuracy of the source domain. Extensive experiments on Waymo, nuScenes, and Lyft, demonstrate the generalization and effectiveness of our approach. + + + + Q: How To Specialize Large Vision-Language Models to Data-Scarce VQA Tasks? A: Self-Train on Unlabeled Images! + http://openaccess.thecvf.com//content/CVPR2023/papers/Khan_Q_How_To_Specialize_Large_Vision-Language_Models_to_Data-Scarce_VQA_CVPR_2023_paper.pdf + Finetuning a large vision language model (VLM) on a target dataset after large scale pretraining is a dominant paradigm in visual question answering (VQA). Datasets for specialized tasks such as knowledge-based VQA or VQA in non natural-image domains are orders of magnitude smaller than those for general-purpose VQA. While collecting additional labels for specialized tasks or domains can be challenging, unlabeled images are often available. We introduce SelTDA (Self-Taught Data Augmentation), a strategy for finetuning large VLMs on small-scale VQA datasets. SelTDA uses the VLM and target dataset to build a teacher model that can generate question-answer pseudolabels directly conditioned on an image alone, allowing us to pseudolabel unlabeled images. SelTDA then finetunes the initial VLM on the original dataset augmented with freshly pseudolabeled images. We describe a series of experiments showing that our self-taught data augmentation increases robustness to adversarially searched questions, counterfactual examples, and rephrasings, it improves domain generalization, and results in greater retention of numerical reasoning skills. The proposed strategy requires no additional annotations or architectural modifications, and is compatible with any modern encoder-decoder multimodal transformer. Code available at https://github.com/codezakh/SelTDA + + + + Improving Robust Generalization by Direct PAC-Bayesian Bound Minimization + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Improving_Robust_Generalization_by_Direct_PAC-Bayesian_Bound_Minimization_CVPR_2023_paper.pdf + Recent research in robust optimization has shown an overfitting-like phenomenon in which models trained against adversarial attacks exhibit higher robustness on the training set compared to the test set. Although previous work provided theoretical explanations for this phenomenon using a robust PAC-Bayesian bound over the adversarial test error, related algorithmic derivations are at best only loosely connected to this bound, which implies that there is still a gap between their empirical success and our understanding of adversarial robustness theory. To close this gap, in this paper we consider a different form of the robust PAC-Bayesian bound and directly minimize it with respect to the model posterior. The derivation of the optimal solution connects PAC-Bayesian learning to the geometry of the robust loss surface through a Trace of Hessian (TrH) regularizer that measures the surface flatness. In practice, we restrict the TrH regularizer to the top layer only, which results in an analytical solution to the bound whose computational cost does not depend on the network depth. Finally, we evaluate our TrH regularization approach over CIFAR-10/100 and ImageNet using Vision Transformers (ViT) and compare against baseline adversarial robustness algorithms. Experimental results show that TrH regularization leads to improved ViT robustness that either matches or surpasses previous state-of-the-art approaches while at the same time requires less memory and computational cost. + + + + AssemblyHands: Towards Egocentric Activity Understanding via 3D Hand Pose Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Ohkawa_AssemblyHands_Towards_Egocentric_Activity_Understanding_via_3D_Hand_Pose_Estimation_CVPR_2023_paper.pdf + We present AssemblyHands, a large-scale benchmark dataset with accurate 3D hand pose annotations, to facilitate the study of egocentric activities with challenging hand-object interactions. The dataset includes synchronized egocentric and exocentric images sampled from the recent Assembly101 dataset, in which participants assemble and disassemble take-apart toys. To obtain high-quality 3D hand pose annotations for the egocentric images, we develop an efficient pipeline, where we use an initial set of manual annotations to train a model to automatically annotate a much larger dataset. Our annotation model uses multi-view feature fusion and an iterative refinement scheme, and achieves an average keypoint error of 4.20 mm, which is 85 % lower than the error of the original annotations in Assembly101. AssemblyHands provides 3.0M annotated images, including 490K egocentric images, making it the largest existing benchmark dataset for egocentric 3D hand pose estimation. Using this data, we develop a strong single-view baseline of 3D hand pose estimation from egocentric images. Furthermore, we design a novel action classification task to evaluate predicted 3D hand poses. Our study shows that having higher-quality hand poses directly improves the ability to recognize actions. + + + + Scene-Aware Egocentric 3D Human Pose Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Scene-Aware_Egocentric_3D_Human_Pose_Estimation_CVPR_2023_paper.pdf + Egocentric 3D human pose estimation with a single head-mounted fisheye camera has recently attracted attention due to its numerous applications in virtual and augmented reality. Existing methods still struggle in challenging poses where the human body is highly occluded or is closely interacting with the scene. To address this issue, we propose a scene-aware egocentric pose estimation method that guides the prediction of the egocentric pose with scene constraints. To this end, we propose an egocentric depth estimation network to predict the scene depth map from a wide-view egocentric fisheye camera while mitigating the occlusion of the human body with a depth-inpainting network. Next, we propose a scene-aware pose estimation network that projects the 2D image features and estimated depth map of the scene into a voxel space and regresses the 3D pose with a V2V network. The voxel-based feature representation provides the direct geometric connection between 2D image features and scene geometry, and further facilitates the V2V network to constrain the predicted pose based on the estimated scene geometry. To enable the training of the aforementioned networks, we also generated a synthetic dataset, called EgoGTA, and an in-the-wild dataset based on EgoPW, called EgoPW-Scene. The experimental results of our new evaluation sequences show that the predicted 3D egocentric poses are accurate and physically plausible in terms of human-scene interaction, demonstrating that our method outperforms the state-of-the-art methods both quantitatively and qualitatively. + + + + NeuralField-LDM: Scene Generation With Hierarchical Latent Diffusion Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Kim_NeuralField-LDM_Scene_Generation_With_Hierarchical_Latent_Diffusion_Models_CVPR_2023_paper.pdf + Automatically generating high-quality real world 3D scenes is of enormous interest for applications such as virtual reality and robotics simulation. Towards this goal, we introduce NeuralField-LDM, a generative model capable of synthesizing complex 3D environments. We leverage Latent Diffusion Models that have been successfully utilized for efficient high-quality 2D content creation. We first train a scene auto-encoder to express a set of image and pose pairs as a neural field, represented as density and feature voxel grids that can be projected to produce novel views of the scene. To further compress this representation, we train a latent-autoencoder that maps the voxel grids to a set of latent representations. A hierarchical diffusion model is then fit to the latents to complete the scene generation pipeline. We achieve a substantial improvement over existing state-of-the-art scene generation models. Additionally, we show how NeuralField-LDM can be used for a variety of 3D content creation applications, including conditional scene generation, scene inpainting and scene style manipulation. + + + + DPF: Learning Dense Prediction Fields With Weak Supervision + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_DPF_Learning_Dense_Prediction_Fields_With_Weak_Supervision_CVPR_2023_paper.pdf + Nowadays, many visual scene understanding problems are addressed by dense prediction networks. But pixel-wise dense annotations are very expensive (e.g., for scene parsing) or impossible (e.g., for intrinsic image decomposition), motivating us to leverage cheap point-level weak supervision. However, existing pointly-supervised methods still use the same architecture designed for full supervision. In stark contrast to them, we propose a new paradigm that makes predictions for point coordinate queries, as inspired by the recent success of implicit representations, like distance or radiance fields. As such, the method is named as dense prediction fields (DPFs). DPFs generate expressive intermediate features for continuous sub-pixel locations, thus allowing outputs of an arbitrary resolution. DPFs are naturally compatible with point-level supervision. We showcase the effectiveness of DPFs using two substantially different tasks: high-level semantic parsing and low-level intrinsic image decomposition. In these two cases, supervision comes in the form of single-point semantic category and two-point relative reflectance, respectively. As benchmarked by three large-scale public datasets PascalContext, ADE20k and IIW, DPFs set new state-of-the-art performance on all of them with significant margins. Code can be accessed at https://github.com/cxx226/DPF. + + + + CNVid-3.5M: Build, Filter, and Pre-Train the Large-Scale Public Chinese Video-Text Dataset + http://openaccess.thecvf.com//content/CVPR2023/papers/Gan_CNVid-3.5M_Build_Filter_and_Pre-Train_the_Large-Scale_Public_Chinese_Video-Text_CVPR_2023_paper.pdf + Owing to well-designed large-scale video-text datasets, recent years have witnessed tremendous progress in video-text pre-training. However, existing large-scale video-text datasets are mostly English-only. Though there are certain methods studying the Chinese video-text pre-training, they pre-train their models on private datasets whose videos and text are unavailable. This lack of large-scale public datasets and benchmarks in Chinese hampers the research and downstream applications of Chinese video-text pre-training. Towards this end, we release and benchmark CNVid-3.5M, a large-scale public cross-modal dataset containing over 3.5M Chinese video-text pairs. We summarize our contributions by three verbs, i.e., "Build", "Filter", and "Pre-train": 1) To build a public Chinese video-text dataset, we collect over 4.5M videos from the Chinese websites. 2) To improve the data quality, we propose a novel method to filter out 1M weakly-paired videos, resulting in the CNVid-3.5M dataset. And 3) we benchmark CNVid-3.5M with three mainstream pixel-level pre-training architectures. At last, we propose the Hard Sample Curriculum Learning strategy to promote the pre-training performance. To the best of our knowledge, CNVid-3.5M is the largest public video-text dataset in Chinese, and we provide the first pixel-level benchmarks for Chinese video-text pre-training. The dataset, codebase, and pre-trained models are available at https://github.com/CNVid/CNVid-3.5M. + + + + iQuery: Instruments As Queries for Audio-Visual Sound Separation + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_iQuery_Instruments_As_Queries_for_Audio-Visual_Sound_Separation_CVPR_2023_paper.pdf + Current audio-visual separation methods share a standard architecture design where an audio encoder-decoder network is fused with visual encoding features at the encoder bottleneck. This design confounds the learning of multi-modal feature encoding with robust sound decoding for audio separation. To generalize to a new instrument, one must fine-tune the entire visual and audio network for all musical instruments. We re-formulate the visual-sound separation task and propose Instruments as Queries (iQuery) with a flexible query expansion mechanism. Our approach ensures cross-modal consistency and cross-instrument disentanglement. We utilize "visually named" queries to initiate the learning of audio queries and use cross-modal attention to remove potential sound source interference at the estimated waveforms. To generalize to a new instrument or event class, drawing inspiration from the text-prompt design, we insert additional queries as audio prompts while freezing the attention mechanism. Experimental results on three benchmarks demonstrate that our iQuery improves audio-visual sound source separation performance. Code is available at https://github.com/JiabenChen/iQuery. + + + + Sampling Is Matter: Point-Guided 3D Human Mesh Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Kim_Sampling_Is_Matter_Point-Guided_3D_Human_Mesh_Reconstruction_CVPR_2023_paper.pdf + This paper presents a simple yet powerful method for 3D human mesh reconstruction from a single RGB image. Most recently, the non-local interactions of the whole mesh vertices have been effectively estimated in the transformer while the relationship between body parts also has begun to be handled via the graph model. Even though those approaches have shown the remarkable progress in 3D human mesh reconstruction, it is still difficult to directly infer the relationship between features, which are encoded from the 2D input image, and 3D coordinates of each vertex. To resolve this problem, we propose to design a simple feature sampling scheme. The key idea is to sample features in the embedded space by following the guide of points, which are estimated as projection results of 3D mesh vertices (i.e., ground truth). This helps the model to concentrate more on vertex-relevant features in the 2D space, thus leading to the reconstruction of the natural human pose. Furthermore, we apply progressive attention masking to precisely estimate local interactions between vertices even under severe occlusions. Experimental results on benchmark datasets show that the proposed method efficiently improves the performance of 3D human mesh reconstruction. The code and model are publicly available at: https://github.com/DCVL-3D/PointHMR_release. + + + + Look Around for Anomalies: Weakly-Supervised Anomaly Detection via Context-Motion Relational Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Cho_Look_Around_for_Anomalies_Weakly-Supervised_Anomaly_Detection_via_Context-Motion_Relational_CVPR_2023_paper.pdf + Weakly-supervised Video Anomaly Detection is the task of detecting frame-level anomalies using video-level labeled training data. It is difficult to explore class representative features using minimal supervision of weak labels with a single backbone branch. Furthermore, in real-world scenarios, the boundary between normal and abnormal is ambiguous and varies depending on the situation. For example, even for the same motion of running person, the abnormality varies depending on whether the surroundings are a playground or a roadway. Therefore, our aim is to extract discriminative features by widening the relative gap between classes' features from a single branch. In the proposed Class-Activate Feature Learning (CLAV), the features are extracted as per the weights that are implicitly activated depending on the class, and the gap is then enlarged through relative distance learning. Furthermore, as the relationship between context and motion is important in order to identify the anomalies in complex and diverse scenes, we propose a Context--Motion Interrelation Module (CoMo), which models the relationship between the appearance of the surroundings and motion, rather than utilizing only temporal dependencies or motion information. The proposed method shows SOTA performance on four benchmarks including large-scale real-world datasets, and we demonstrate the importance of relational information by analyzing the qualitative results and generalization ability. + + + + Detecting Everything in the Open World: Towards Universal Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Detecting_Everything_in_the_Open_World_Towards_Universal_Object_Detection_CVPR_2023_paper.pdf + In this paper, we formally address universal object detection, which aims to detect every scene and predict every category. The dependence on human annotations, the limited visual information, and the novel categories in the open world severely restrict the universality of traditional detectors. We propose UniDetector, a universal object detector that has the ability to recognize enormous categories in the open world. The critical points for the universality of UniDetector are: 1) it leverages images of multiple sources and heterogeneous label spaces for training through the alignment of image and text spaces, which guarantees sufficient information for universal representations. 2) it generalizes to the open world easily while keeping the balance between seen and unseen classes, thanks to abundant information from both vision and language modalities. 3) it further promotes the generalization ability to novel categories through our proposed decoupling training manner and probability calibration. These contributions allow UniDetector to detect over 7k categories, the largest measurable category size so far, with only about 500 classes participating in training. Our UniDetector behaves the strong zero-shot generalization ability on large-vocabulary datasets like LVIS, ImageNetBoxes, and VisualGenome - it surpasses the traditional supervised baselines by more than 4% on average without seeing any corresponding images. On 13 public detection datasets with various scenes, UniDetector also achieves state-of-the-art performance with only a 3% amount of training data. + + + + NUWA-LIP: Language-Guided Image Inpainting With Defect-Free VQGAN + http://openaccess.thecvf.com//content/CVPR2023/papers/Ni_NUWA-LIP_Language-Guided_Image_Inpainting_With_Defect-Free_VQGAN_CVPR_2023_paper.pdf + Language-guided image inpainting aims to fill the defective regions of an image under the guidance of text while keeping the non-defective regions unchanged. However, directly encoding the defective images is prone to have an adverse effect on the non-defective regions, giving rise to distorted structures on non-defective parts. To better adapt the text guidance to the inpainting task, this paper proposes NUWA-LIP, which involves defect-free VQGAN (DF-VQGAN) and a multi-perspective sequence-to-sequence module (MP-S2S). To be specific, DF-VQGAN introduces relative estimation to carefully control the receptive spreading, as well as symmetrical connections to protect structure details unchanged. For harmoniously embedding text guidance into the locally defective regions, MP-S2S is employed by aggregating the complementary perspectives from low-level pixels, high-level tokens as well as the text description. Experiments show that our DF-VQGAN effectively aids the inpainting process while avoiding unexpected changes in non-defective regions. Results on three open-domain benchmarks demonstrate the superior performance of our method against state-of-the-arts. Our code, datasets, and model will be made publicly available. + + + + Language Adaptive Weight Generation for Multi-Task Visual Grounding + http://openaccess.thecvf.com//content/CVPR2023/papers/Su_Language_Adaptive_Weight_Generation_for_Multi-Task_Visual_Grounding_CVPR_2023_paper.pdf + Although the impressive performance in visual grounding, the prevailing approaches usually exploit the visual backbone in a passive way, i.e., the visual backbone extracts features with fixed weights without expression-related hints. The passive perception may lead to mismatches (e.g., redundant and missing), limiting further performance improvement. Ideally, the visual backbone should actively extract visual features since the expressions already provide the blueprint of desired visual features. The active perception can take expressions as priors to extract relevant visual features, which can effectively alleviate the mismatches. Inspired by this, we propose an active perception Visual Grounding framework based on Language Adaptive Weights, called VG-LAW. The visual backbone serves as an expression-specific feature extractor through dynamic weights generated for various expressions. Benefiting from the specific and relevant visual features extracted from the language-aware visual backbone, VG-LAW does not require additional modules for cross-modal interaction. Along with a neat multi-task head, VG-LAW can be competent in referring expression comprehension and segmentation jointly. Extensive experiments on four representative datasets, i.e., RefCOCO, RefCOCO+, RefCOCOg, and ReferItGame, validate the effectiveness of the proposed framework and demonstrate state-of-the-art performance. + + + + Continuous Intermediate Token Learning With Implicit Motion Manifold for Keyframe Based Motion Interpolation + http://openaccess.thecvf.com//content/CVPR2023/papers/Mo_Continuous_Intermediate_Token_Learning_With_Implicit_Motion_Manifold_for_Keyframe_CVPR_2023_paper.pdf + Deriving sophisticated 3D motions from sparse keyframes is a particularly challenging problem, due to continuity and exceptionally skeletal precision. The action features are often derivable accurately from the full series of keyframes, and thus, leveraging the global context with transformers has been a promising data-driven embedding approach. However, existing methods are often with inputs of interpolated intermediate frame for continuity using basic interpolation methods with keyframes, which result in a trivial local minimum during training. In this paper, we propose a novel framework to formulate latent motion manifolds with keyframe-based constraints, from which the continuous nature of intermediate token representations is considered. Particularly, our proposed framework consists of two stages for identifying a latent motion subspace, i.e., a keyframe encoding stage and an intermediate token generation stage, and a subsequent motion synthesis stage to extrapolate and compose motion data from manifolds. Through our extensive experiments conducted on both the LaFAN1 and CMU Mocap datasets, our proposed method demonstrates both superior interpolation accuracy and high visual similarity to ground truth motions. + + + + SGLoc: Scene Geometry Encoding for Outdoor LiDAR Localization + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_SGLoc_Scene_Geometry_Encoding_for_Outdoor_LiDAR_Localization_CVPR_2023_paper.pdf + LiDAR-based absolute pose regression estimates the global pose through a deep network in an end-to-end manner, achieving impressive results in learning-based localization. However, the accuracy of existing methods still has room to improve due to the difficulty of effectively encoding the scene geometry and the unsatisfactory quality of the data. In this work, we propose a novel LiDAR localization framework, SGLoc, which decouples the pose estimation to point cloud correspondence regression and pose estimation via this correspondence. This decoupling effectively encodes the scene geometry because the decoupled correspondence regression step greatly preserves the scene geometry, leading to significant performance improvement. Apart from this decoupling, we also design a tri-scale spatial feature aggregation module and inter-geometric consistency constraint loss to effectively capture scene geometry. Moreover, we empirically find that the ground truth might be noisy due to GPS/INS measuring errors, greatly reducing the pose estimation performance. Thus, we propose a pose quality evaluation and enhancement method to measure and correct the ground truth pose. Extensive experiments on the Oxford Radar RobotCar and NCLT datasets demonstrate the effectiveness of SGLoc, which outperforms state-of-the-art regression-based localization methods by 68.5% and 67.6% on position accuracy, respectively. + + + + Bridging Search Region Interaction With Template for RGB-T Tracking + http://openaccess.thecvf.com//content/CVPR2023/papers/Hui_Bridging_Search_Region_Interaction_With_Template_for_RGB-T_Tracking_CVPR_2023_paper.pdf + RGB-T tracking aims to leverage the mutual enhancement and complement ability of RGB and TIR modalities for improving the tracking process in various scenarios, where cross-modal interaction is the key component. Some previous methods concatenate the RGB and TIR search region features directly to perform a coarse interaction process with redundant background noises introduced. Many other methods sample candidate boxes from search frames and conduct various fusion approaches on isolated pairs of RGB and TIR boxes, which limits the cross-modal interaction within local regions and brings about inadequate context modeling. To alleviate these limitations, we propose a novel Template-Bridged Search region Interaction (TBSI) module which exploits templates as the medium to bridge the cross-modal interaction between RGB and TIR search regions by gathering and distributing target-relevant object and environment contexts. Original templates are also updated with enriched multimodal contexts from the template medium. Our TBSI module is inserted into a ViT backbone for joint feature extraction, search-template matching, and cross-modal interaction. Extensive experiments on three popular RGB-T tracking benchmarks demonstrate our method achieves new state-of-the-art performances. Code is available at https://github.com/RyanHTR/TBSI. + + + + Indescribable Multi-Modal Spatial Evaluator + http://openaccess.thecvf.com//content/CVPR2023/papers/Kong_Indescribable_Multi-Modal_Spatial_Evaluator_CVPR_2023_paper.pdf + Multi-modal image registration spatially aligns two images with different distributions. One of its major challenges is that images acquired from different imaging machines have different imaging distributions, making it difficult to focus only on the spatial aspect of the images and ignore differences in distributions. In this study, we developed a self-supervised approach, Indescribable Multi-model Spatial Evaluator (IMSE), to address multi-modal image registration. IMSE creates an accurate multi-modal spatial evaluator to measure spatial differences between two images, and then optimizes registration by minimizing the error predicted of the evaluator. To optimize IMSE performance, we also proposed a new style enhancement method called Shuffle Remap which randomizes the image distribution into multiple segments, and then randomly disorders and remaps these segments, so that the distribution of the original image is changed. Shuffle Remap can help IMSE to predict the difference in spatial location from unseen target distributions. Our results show that IMSE outperformed the existing methods for registration using T1-T2 and CT-MRI datasets. IMSE also can be easily integrated into the traditional registration process, and can provide a convenient way to evaluate and visualize registration results. IMSE also has the potential to be used as a new paradigm for image-to-image translation. Our code is available at https://github.com/Kid-Liet/IMSE. + + + + ImageBind: One Embedding Space To Bind Them All + http://openaccess.thecvf.com//content/CVPR2023/papers/Girdhar_ImageBind_One_Embedding_Space_To_Bind_Them_All_CVPR_2023_paper.pdf + We present ImageBind, an approach to learn a joint embedding across six different modalities - images, text, audio, depth, thermal, and IMU data. We show that all combinations of paired data are not necessary to train such a joint embedding, and only image-paired data is sufficient to bind the modalities together. ImageBind can leverage recent large scale vision-language models, and extends their zero-shot capabilities to new modalities just by using their natural pairing with images. It enables novel emergent applications 'out-of-the-box' including cross-modal retrieval, composing modalities with arithmetic, cross-modal detection and generation. The emergent capabilities improve with the strength of the image encoder and we set a new state-of-the-art on emergent zero-shot recognition tasks across modalities, outperforming specialist supervised models. Finally, we show strong few-shot recognition results outperforming prior work, and that ImageBind serves as a new way to evaluate vision models for visual and non-visual tasks. + + + + Three Guidelines You Should Know for Universally Slimmable Self-Supervised Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Cao_Three_Guidelines_You_Should_Know_for_Universally_Slimmable_Self-Supervised_Learning_CVPR_2023_paper.pdf + We propose universally slimmable self-supervised learning (dubbed as US3L) to achieve better accuracy-efficiency trade-offs for deploying self-supervised models across different devices. We observe that direct adaptation of self-supervised learning (SSL) to universally slimmable networks misbehaves as the training process frequently collapses. We then discover that temporal consistent guidance is the key to the success of SSL for universally slimmable networks, and we propose three guidelines for the loss design to ensure this temporal consistency from a unified gradient perspective. Moreover, we propose dynamic sampling and group regularization strategies to simultaneously improve training efficiency and accuracy. Our US3L method has been empirically validated on both convolutional neural networks and vision transformers. With only once training and one copy of weights, our method outperforms various state-of-the-art methods (individually trained or not) on benchmarks including recognition, object detection and instance segmentation. + + + + MetaFusion: Infrared and Visible Image Fusion via Meta-Feature Embedding From Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhao_MetaFusion_Infrared_and_Visible_Image_Fusion_via_Meta-Feature_Embedding_From_CVPR_2023_paper.pdf + Fusing infrared and visible images can provide more texture details for subsequent object detection task. Conversely, detection task furnishes object semantic information to improve the infrared and visible image fusion. Thus, a joint fusion and detection learning to use their mutual promotion is attracting more attention. However, the feature gap between these two different-level tasks hinders the progress. Addressing this issue, this paper proposes an infrared and visible image fusion via meta-feature embedding from object detection. The core idea is that meta-feature embedding model is designed to generate object semantic features according to fusion network ability, and thus the semantic features are naturally compatible with fusion features. It is optimized by simulating a meta learning. Moreover, we further implement a mutual promotion learning between fusion and detection tasks to improve their performances. Comprehensive experiments on three public datasets demonstrate the effectiveness of our method. Code and model are available at: https://github.com/wdzhao123/MetaFusion. + + + + End-to-End Vectorized HD-Map Construction With Piecewise Bezier Curve + http://openaccess.thecvf.com//content/CVPR2023/papers/Qiao_End-to-End_Vectorized_HD-Map_Construction_With_Piecewise_Bezier_Curve_CVPR_2023_paper.pdf + Vectorized high-definition map (HD-map) construction, which focuses on the perception of centimeter-level environmental information, has attracted significant research interest in the autonomous driving community. Most existing approaches first obtain rasterized map with the segmentation-based pipeline and then conduct heavy post-processing for downstream-friendly vectorization. In this paper, by delving into parameterization-based methods, we pioneer a concise and elegant scheme that adopts unified piecewise Bezier curve. In order to vectorize changeful map elements end-to-end, we elaborate a simple yet effective architecture, named Piecewise Bezier HD-map Network (BeMapNet), which is formulated as a direct set prediction paradigm and postprocessing-free. Concretely, we first introduce a novel IPM-PE Align module to inject 3D geometry prior into BEV features through common position encoding in Transformer. Then a well-designed Piecewise Bezier Head is proposed to output the details of each map element, including the coordinate of control points and the segment number of curves. In addition, based on the progressively restoration of Bezier curve, we also present an efficient Point-Curve-Region Loss for supervising more robust and precise HD-map modeling. Extensive comparisons show that our method is remarkably superior to other existing SOTAs by 18.0 mAP at least. + + + + On Data Scaling in Masked Image Modeling + http://openaccess.thecvf.com//content/CVPR2023/papers/Xie_On_Data_Scaling_in_Masked_Image_Modeling_CVPR_2023_paper.pdf + Scaling properties have been one of the central issues in self-supervised pre-training, especially the data scalability, which has successfully motivated the large-scale self-supervised pre-trained language models and endowed them with significant modeling capabilities. However, scaling properties seem to be unintentionally neglected in the recent trending studies on masked image modeling (MIM), and some arguments even suggest that MIM cannot benefit from large-scale data. In this work, we try to break down these preconceptions and systematically study the scaling behaviors of MIM through extensive experiments, with data ranging from 10% of ImageNet-1K to full ImageNet-22K, model parameters ranging from 49-million to one-billion, and training length ranging from 125K to 500K iterations. And our main findings can be summarized in two folds: 1) masked image modeling remains demanding large-scale data in order to scale up computes and model parameters; 2) masked image modeling cannot benefit from more data under a non-overfitting scenario, which diverges from the previous observations in self-supervised pre-trained language models or supervised pre-trained vision models. In addition, we reveal several intriguing properties in MIM, such as high sample efficiency in large MIM models and strong correlation between pre-training validation loss and transfer performance. We hope that our findings could deepen the understanding of masked image modeling and facilitate future developments on large-scale vision models. Code and models will be available at https://github.com/microsoft/SimMIM. + + + + Balanced Energy Regularization Loss for Out-of-Distribution Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Choi_Balanced_Energy_Regularization_Loss_for_Out-of-Distribution_Detection_CVPR_2023_paper.pdf + In the field of out-of-distribution (OOD) detection, a previous method that use auxiliary data as OOD data has shown promising performance. However, the method provides an equal loss to all auxiliary data to differentiate them from inliers. However, based on our observation, in various tasks, there is a general imbalance in the distribution of the auxiliary OOD data across classes. We propose a balanced energy regularization loss that is simple but generally effective for a variety of tasks. Our balanced energy regularization loss utilizes class-wise different prior probabilities for auxiliary data to address the class imbalance in OOD data. The main concept is to regularize auxiliary samples from majority classes, more heavily than those from minority classes. Our approach performs better for OOD detection in semantic segmentation, long-tailed image classification, and image classification than the prior energy regularization loss. Furthermore, our approach achieves state-of-the-art performance in two tasks: OOD detection in semantic segmentation and long-tailed image classification. + + + + 3D-Aware Face Swapping + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_3D-Aware_Face_Swapping_CVPR_2023_paper.pdf + Face swapping is an important research topic in computer vision with wide applications in entertainment and privacy protection. Existing methods directly learn to swap 2D facial images, taking no account of the geometric information of human faces. In the presence of large pose variance between the source and the target faces, there always exist undesirable artifacts on the swapped face. In this paper, we present a novel 3D-aware face swapping method that generates high-fidelity and multi-view-consistent swapped faces from single-view source and target images. To achieve this, we take advantage of the strong geometry and texture prior of 3D human faces, where the 2D faces are projected into the latent space of a 3D generative model. By disentangling the identity and attribute features in the latent space, we succeed in swapping faces in a 3D-aware manner, being robust to pose variations while transferring fine-grained facial details. Extensive experiments demonstrate the superiority of our 3D-aware face swapping framework in terms of visual quality, identity similarity, and multi-view consistency. Code is available at https://lyx0208.github.io/3dSwap. + + + + Phone2Proc: Bringing Robust Robots Into Our Chaotic World + http://openaccess.thecvf.com//content/CVPR2023/papers/Deitke_Phone2Proc_Bringing_Robust_Robots_Into_Our_Chaotic_World_CVPR_2023_paper.pdf + Training embodied agents in simulation has become mainstream for the embodied AI community. However, these agents often struggle when deployed in the physical world due to their inability to generalize to real-world environments. In this paper, we present Phone2Proc, a method that uses a 10-minute phone scan and conditional procedural generation to create a distribution of training scenes that are semantically similar to the target environment. The generated scenes are conditioned on the wall layout and arrangement of large objects from the scan, while also sampling lighting, clutter, surface textures, and instances of smaller objects with randomized placement and materials. Leveraging just a simple RGB camera, training with Phone2Proc shows massive improvements from 34.7% to 70.7% success rate in sim-to-real ObjectNav performance across a test suite of over 200 trials in diverse real-world environments, including homes, offices, and RoboTHOR. Furthermore, Phone2Proc's diverse distribution of generated scenes makes agents remarkably robust to changes in the real world, such as human movement, object rearrangement, lighting changes, or clutter. + + + + Learning Articulated Shape With Keypoint Pseudo-Labels From Web Images + http://openaccess.thecvf.com//content/CVPR2023/papers/Stathopoulos_Learning_Articulated_Shape_With_Keypoint_Pseudo-Labels_From_Web_Images_CVPR_2023_paper.pdf + This paper shows that it is possible to learn models for monocular 3D reconstruction of articulated objects (e.g. horses, cows, sheep), using as few as 50-150 images labeled with 2D keypoints. Our proposed approach involves training category-specific keypoint estimators, generating 2D keypoint pseudo-labels on unlabeled web images, and using both the labeled and self-labeled sets to train 3D reconstruction models. It is based on two key insights: (1) 2D keypoint estimation networks trained on as few as 50-150 images of a given object category generalize well and generate reliable pseudo-labels; (2) a data selection mechanism can automatically create a "curated" subset of the unlabeled web images that can be used for training -- we evaluate four data selection methods. Coupling these two insights enables us to train models that effectively utilize web images, resulting in improved 3D reconstruction performance for several articulated object categories beyond the fully-supervised baseline. Our approach can quickly bootstrap a model and requires only a few images labeled with 2D keypoints. This requirement can be easily satisfied for any new object category. To showcase the practicality of our approach for predicting the 3D shape of arbitrary object categories, we annotate 2D keypoints on 250 giraffe and bear images from COCO in just 2.5 hours per category. + + + + Rethinking Image Super Resolution From Long-Tailed Distribution Learning Perspective + http://openaccess.thecvf.com//content/CVPR2023/papers/Gou_Rethinking_Image_Super_Resolution_From_Long-Tailed_Distribution_Learning_Perspective_CVPR_2023_paper.pdf + Existing studies have empirically observed that the resolution of the low-frequency region is easier to enhance than that of the high-frequency one. Although plentiful works have been devoted to alleviating this problem, little understanding is given to explain it. In this paper, we try to give a feasible answer from a machine learning perspective, i.e., the twin fitting problem caused by the long-tailed pixel distribution in natural images. With this explanation, we reformulate image super resolution (SR) as a long-tailed distribution learning problem and solve it by bridging the gaps of the problem between in low- and high-level vision tasks. As a result, we design a long-tailed distribution learning solution, that rebalances the gradients from the pixels in the low- and high-frequency region, by introducing a static and a learnable structure prior. The learned SR model achieves better balance on the fitting of the low- and high-frequency region so that the overall performance is improved. In the experiments, we evaluate the solution on four CNN- and one Transformer-based SR models w.r.t. six datasets and three tasks, and experimental results demonstrate its superiority. + + + + SCOTCH and SODA: A Transformer Video Shadow Detection Framework + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_SCOTCH_and_SODA_A_Transformer_Video_Shadow_Detection_Framework_CVPR_2023_paper.pdf + Shadows in videos are difficult to detect because of the large shadow deformation between frames. In this work, we argue that accounting for shadow deformation is essential when designing a video shadow detection method. To this end, we introduce the shadow deformation attention trajectory (SODA), a new type of video self-attention module, specially designed to handle the large shadow deformations in videos. Moreover, we present a new shadow contrastive learning mechanism (SCOTCH) which aims at guiding the network to learn a unified shadow representation from massive positive shadow pairs across different videos. We demonstrate empirically the effectiveness of our two contributions in an ablation study. Furthermore, we show that SCOTCH and SODA significantly outperforms existing techniques for video shadow detection. Code is available at the project page: https://lihaoliu-cambridge.github.io/scotch_and_soda/ + + + + CodeTalker: Speech-Driven 3D Facial Animation With Discrete Motion Prior + http://openaccess.thecvf.com//content/CVPR2023/papers/Xing_CodeTalker_Speech-Driven_3D_Facial_Animation_With_Discrete_Motion_Prior_CVPR_2023_paper.pdf + Speech-driven 3D facial animation has been widely studied, yet there is still a gap to achieving realism and vividness due to the highly ill-posed nature and scarcity of audio-visual data. Existing works typically formulate the cross-modal mapping into a regression task, which suffers from the regression-to-mean problem leading to over-smoothed facial motions. In this paper, we propose to cast speech-driven facial animation as a code query task in a finite proxy space of the learned codebook, which effectively promotes the vividness of the generated motions by reducing the cross-modal mapping uncertainty. The codebook is learned by self-reconstruction over real facial motions and thus embedded with realistic facial motion priors. Over the discrete motion space, a temporal autoregressive model is employed to sequentially synthesize facial motions from the input speech signal, which guarantees lip-sync as well as plausible facial expressions. We demonstrate that our approach outperforms current state-of-the-art methods both qualitatively and quantitatively. Also, a user study further justifies our superiority in perceptual quality. + + + + Improving Zero-Shot Generalization and Robustness of Multi-Modal Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Ge_Improving_Zero-Shot_Generalization_and_Robustness_of_Multi-Modal_Models_CVPR_2023_paper.pdf + Multi-modal image-text models such as CLIP and LiT have demonstrated impressive performance on image classification benchmarks and their zero-shot generalization ability is particularly exciting. While the top-5 zero-shot accuracies of these models are very high, the top-1 accuracies are much lower (over 25% gap in some cases). We investigate the reasons for this performance gap and find that many of the failure cases are caused by ambiguity in the text prompts. First, we develop a simple and efficient zero-shot post-hoc method to identify images whose top-1 prediction is likely to be incorrect, by measuring consistency of the predictions w.r.t. multiple prompts and image transformations. We show that our procedure better predicts mistakes, outperforming the popular max logit baseline on selective prediction tasks. Next, we propose a simple and efficient way to improve accuracy on such uncertain images by making use of the WordNet hierarchy; specifically we augment the original class by incorporating its parent and children from the semantic label hierarchy, and plug the augmentation into text prompts. We conduct experiments on both CLIP and LiT models with five different ImageNet- based datasets. For CLIP, our method improves the top-1 accuracy by 17.13% on the uncertain subset and 3.6% on the entire ImageNet validation set. We also show that our method improves across ImageNet shifted datasets, four other datasets, and other model architectures such as LiT. Our proposed method is hyperparameter-free, requires no additional model training and can be easily scaled to other large multi-modal architectures. Code is available at https://github.com/gyhandy/Hierarchy-CLIP. + + + + CODA-Prompt: COntinual Decomposed Attention-Based Prompting for Rehearsal-Free Continual Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Smith_CODA-Prompt_COntinual_Decomposed_Attention-Based_Prompting_for_Rehearsal-Free_Continual_Learning_CVPR_2023_paper.pdf + Computer vision models suffer from a phenomenon known as catastrophic forgetting when learning novel concepts from continuously shifting training data. Typical solutions for this continual learning problem require extensive rehearsal of previously seen data, which increases memory costs and may violate data privacy. Recently, the emergence of large-scale pre-trained vision transformer models has enabled prompting approaches as an alternative to data-rehearsal. These approaches rely on a key-query mechanism to generate prompts and have been found to be highly resistant to catastrophic forgetting in the well-established rehearsal-free continual learning setting. However, the key mechanism of these methods is not trained end-to-end with the task sequence. Our experiments show that this leads to a reduction in their plasticity, hence sacrificing new task accuracy, and inability to benefit from expanded parameter capacity. We instead propose to learn a set of prompt components which are assembled with input-conditioned weights to produce input-conditioned prompts, resulting in a novel attention-based end-to-end key-query scheme. Our experiments show that we outperform the current SOTA method DualPrompt on established benchmarks by as much as 4.5% in average final accuracy. We also outperform the state of art by as much as 4.4% accuracy on a continual learning benchmark which contains both class-incremental and domain-incremental task shifts, corresponding to many practical settings. Our code is available at https://github.com/GT-RIPL/CODA-Prompt + + + + Real-Time Multi-Person Eyeblink Detection in the Wild for Untrimmed Video + http://openaccess.thecvf.com//content/CVPR2023/papers/Zeng_Real-Time_Multi-Person_Eyeblink_Detection_in_the_Wild_for_Untrimmed_Video_CVPR_2023_paper.pdf + Real-time eyeblink detection in the wild can widely serve for fatigue detection, face anti-spoofing, emotion analysis, etc. The existing research efforts generally focus on single-person cases towards trimmed video. However, multi-person scenario within untrimmed videos is also important for practical applications, which has not been well concerned yet. To address this, we shed light on this research field for the first time with essential contributions on dataset, theory, and practices. In particular, a large-scale dataset termed MPEblink that involves 686 untrimmed videos with 8748 eyeblink events is proposed under multi-person conditions. The samples are captured from unconstrained films to reveal "in the wild" characteristics. Meanwhile, a real-time multi-person eyeblink detection method is also proposed. Being different from the existing counterparts, our proposition runs in a one-stage spatio-temporal way with an end-to-end learning capacity. Specifically, it simultaneously addresses the sub-tasks of face detection, face tracking, and human instance-level eyeblink detection. This paradigm holds 2 main advantages: (1) eyeblink features can be facilitated via the face's global context (e.g., head pose and illumination condition) with joint optimization and interaction, and (2) addressing these sub-tasks in parallel instead of sequential manner can save time remarkably to meet the real-time running requirement. Experiments on MPEblink verify the essential challenges of real-time multi-person eyeblink detection in the wild for untrimmed video. Our method also outperforms existing approaches by large margins and with a high inference speed. + + + + Category Query Learning for Human-Object Interaction Classification + http://openaccess.thecvf.com//content/CVPR2023/papers/Xie_Category_Query_Learning_for_Human-Object_Interaction_Classification_CVPR_2023_paper.pdf + Unlike most previous HOI methods that focus on learning better human-object features, we propose a novel and complementary approach called category query learning. Such queries are explicitly associated to interaction categories, converted to image specific category representation via a transformer decoder, and learnt via an auxiliary image-level classification task. This idea is motivated by an earlier multi-label image classification method, but is for the first time applied for the challenging human-object interaction classification task. Our method is simple, general and effective. It is validated on three representative HOI baselines and achieves new state-of-the-art results on two benchmarks. + + + + MDQE: Mining Discriminative Query Embeddings To Segment Occluded Instances on Challenging Videos + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_MDQE_Mining_Discriminative_Query_Embeddings_To_Segment_Occluded_Instances_on_CVPR_2023_paper.pdf + While impressive progress has been achieved, video instance segmentation (VIS) methods with per-clip input often fail on challenging videos with occluded objects and crowded scenes. This is mainly because instance queries in these methods cannot encode well the discriminative embeddings of instances, making the query-based segmenter difficult to distinguish those 'hard' instances. To address these issues, we propose to mine discriminative query embeddings (MDQE) to segment occluded instances on challenging videos. First, we initialize the positional embeddings and content features of object queries by considering their spatial contextual information and the inter-frame object motion. Second, we propose an inter-instance mask repulsion loss to distance each instance from its nearby non-target instances. The proposed MDQE is the first VIS method with per-clip input that achieves state-of-the-art results on challenging videos and competitive performance on simple videos. In specific, MDQE with ResNet50 achieves 33.0% and 44.5% mask AP on OVIS and YouTube-VIS 2021, respectively. Code of MDQE can be found at https://github.com/MinghanLi/MDQE_CVPR2023. + + + + Are We Ready for Vision-Centric Driving Streaming Perception? The ASAP Benchmark + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Are_We_Ready_for_Vision-Centric_Driving_Streaming_Perception_The_ASAP_CVPR_2023_paper.pdf + In recent years, vision-centric perception has flourished in various autonomous driving tasks, including 3D detection, semantic map construction, motion forecasting, and depth estimation. Nevertheless, the latency of vision-centric approaches is too high for practical deployment (e.g., most camera-based 3D detectors have a runtime greater than 300ms). To bridge the gap between ideal researches and real-world applications, it is necessary to quantify the trade-off between performance and efficiency. Traditionally, autonomous-driving perception benchmarks perform the online evaluation, neglecting the inference time delay. To mitigate the problem, we propose the Autonomous-driving StreAming Perception (ASAP) benchmark, which is the first benchmark to evaluate the online performance of vision-centric perception in autonomous driving. On the basis of the 2Hz annotated nuScenes dataset, we first propose an annotation-extending pipeline to generate high-frame-rate labels for the 12Hz raw images. Referring to the practical deployment, the Streaming Perception Under constRained-computation (SPUR) evaluation protocol is further constructed, where the 12Hz inputs are utilized for streaming evaluation under the constraints of different computational resources. In the ASAP benchmark, comprehensive experiment results reveal that the model rank alters under different constraints, suggesting that the model latency and computation budget should be considered as design choices to optimize the practical deployment. To facilitate further research, we establish baselines for camera-based streaming 3D detection, which consistently enhance the streaming performance across various hardware. The ASAP benchmark will be made publicly available. + + + + PDPP:Projected Diffusion for Procedure Planning in Instructional Videos + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_PDPPProjected_Diffusion_for_Procedure_Planning_in_Instructional_Videos_CVPR_2023_paper.pdf + In this paper, we study the problem of procedure planning in instructional videos, which aims to make goal-directed plans given the current visual observations in unstructured real-life videos. Previous works cast this problem as a sequence planning problem and leverage either heavy intermediate visual observations or natural language instructions as supervision, resulting in complex learning schemes and expensive annotation costs. In contrast, we treat this problem as a distribution fitting problem. In this sense, we model the whole intermediate action sequence distribution with a diffusion model (PDPP), and thus transform the planning problem to a sampling process from this distribution. In addition, we remove the expensive intermediate supervision, and simply use task labels from instructional videos as supervision instead. Our model is a U-Net based diffusion model, which directly samples action sequences from the learned distribution with the given start and end observations. Furthermore, we apply an efficient projection method to provide accurate conditional guides for our model during the learning and sampling process. Experiments on three datasets with different scales show that our PDPP model can achieve the state-of-the-art performance on multiple metrics, even without the task supervision. Code and trained models are available at https://github.com/MCG-NJU/PDPP. + + + + Efficient Map Sparsification Based on 2D and 3D Discretized Grids + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Efficient_Map_Sparsification_Based_on_2D_and_3D_Discretized_Grids_CVPR_2023_paper.pdf + Localization in a pre-built map is a basic technique for robot autonomous navigation. Existing mapping and localization methods commonly work well in small-scale environments. As a map grows larger, however, more memory is required and localization becomes inefficient. To solve these problems, map sparsification becomes a practical necessity to acquire a subset of the original map for localization. Previous map sparsification methods add a quadratic term in mixed-integer programming to enforce a uniform distribution of selected landmarks, which requires high memory capacity and heavy computation. In this paper, we formulate map sparsification in an efficient linear form and select uniformly distributed landmarks based on 2D discretized grids. Furthermore, to reduce the influence of different spatial distributions between the mapping and query sequences, which is not considered in previous methods, we also introduce a space constraint term based on 3D discretized grids. The exhaustive experiments in different datasets demonstrate the superiority of the proposed methods in both efficiency and localization performance. The relevant codes will be released at https://github.com/fishmarch/SLAM_Map_Compression. + + + + Class Attention Transfer Based Knowledge Distillation + http://openaccess.thecvf.com//content/CVPR2023/papers/Guo_Class_Attention_Transfer_Based_Knowledge_Distillation_CVPR_2023_paper.pdf + Previous knowledge distillation methods have shown their impressive performance on model compression tasks, however, it is hard to explain how the knowledge they transferred helps to improve the performance of the student network. In this work, we focus on proposing a knowledge distillation method that has both high interpretability and competitive performance. We first revisit the structure of mainstream CNN models and reveal that possessing the capacity of identifying class discriminative regions of input is critical for CNN to perform classification. Furthermore, we demonstrate that this capacity can be obtained and enhanced by transferring class activation maps. Based on our findings, we propose class attention transfer based knowledge distillation (CAT-KD). Different from previous KD methods, we explore and present several properties of the knowledge transferred by our method, which not only improve the interpretability of CAT-KD but also contribute to a better understanding of CNN. While having high interpretability, CAT-KD achieves state-of-the-art performance on multiple benchmarks. Code is available at: https://github.com/GzyAftermath/CAT-KD. + + + + Temporally Consistent Online Depth Estimation Using Point-Based Fusion + http://openaccess.thecvf.com//content/CVPR2023/papers/Khan_Temporally_Consistent_Online_Depth_Estimation_Using_Point-Based_Fusion_CVPR_2023_paper.pdf + Depth estimation is an important step in many computer vision problems such as 3D reconstruction, novel view synthesis, and computational photography. Most existing work focuses on depth estimation from single frames. When applied to videos, the result lacks temporal consistency, showing flickering and swimming artifacts. In this paper we aim to estimate temporally consistent depth maps of video streams in an online setting. This is a difficult problem as future frames are not available and the method must choose between enforcing consistency and correcting errors from previous estimations. The presence of dynamic objects further complicates the problem. We propose to address these challenges by using a global point cloud that is dynamically updated each frame, along with a learned fusion approach in image space. Our approach encourages consistency while simultaneously allowing updates to handle errors and dynamic objects. Qualitative and quantitative results show that our method achieves state-of-the-art quality for consistent video depth estimation. + + + + Generalizable Implicit Neural Representations via Instance Pattern Composers + http://openaccess.thecvf.com//content/CVPR2023/papers/Kim_Generalizable_Implicit_Neural_Representations_via_Instance_Pattern_Composers_CVPR_2023_paper.pdf + Despite recent advances in implicit neural representations (INRs), it remains challenging for a coordinate-based multi-layer perceptron (MLP) of INRs to learn a common representation across data instances and generalize it for unseen instances. In this work, we introduce a simple yet effective framework for generalizable INRs that enables a coordinate-based MLP to represent complex data instances by modulating only a small set of weights in an early MLP layer as an instance pattern composer; the remaining MLP weights learn pattern composition rules to learn common representations across instances. Our generalizable INR framework is fully compatible with existing meta-learning and hypernetworks in learning to predict the modulated weight for unseen instances. Extensive experiments demonstrate that our method achieves high performance on a wide range of domains such as an audio, image, and 3D object, while the ablation study validates our weight modulation. + + + + What Can Human Sketches Do for Object Detection? + http://openaccess.thecvf.com//content/CVPR2023/papers/Chowdhury_What_Can_Human_Sketches_Do_for_Object_Detection_CVPR_2023_paper.pdf + Sketches are highly expressive, inherently capturing subjective and fine-grained visual cues. The exploration of such innate properties of human sketches has, however, been limited to that of image retrieval. In this paper, for the first time, we cultivate the expressiveness of sketches but for the fundamental vision task of object detection. The end result is a sketch-enabled object detection framework that detects based on what you sketch -- that "zebra" (e.g., one that is eating the grass) in a herd of zebras (instance-aware detection), and only the part (e.g., "head" of a "zebra") that you desire (part-aware detection). We further dictate that our model works without (i) knowing which category to expect at testing (zero-shot) and (ii) not requiring additional bounding boxes (as per fully supervised) and class labels (as per weakly supervised). Instead of devising a model from the ground up, we show an intuitive synergy between foundation models (e.g., CLIP) and existing sketch models build for sketch-based image retrieval (SBIR), which can already elegantly solve the task -- CLIP to provide model generalisation, and SBIR to bridge the (sketch->photo) gap. In particular, we first perform independent prompting on both sketch and photo branches of an SBIR model to build highly generalisable sketch and photo encoders on the back of the generalisation ability of CLIP. We then devise a training paradigm to adapt the learned encoders for object detection, such that the region embeddings of detected boxes are aligned with the sketch and photo embeddings from SBIR. Evaluating our framework on standard object detection datasets like PASCAL-VOC and MS-COCO outperforms both supervised (SOD) and weakly-supervised object detectors (WSOD) on zero-shot setups. Project Page: https://pinakinathc.github.io/sketch-detect + + + + Identity-Preserving Talking Face Generation With Landmark and Appearance Priors + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhong_Identity-Preserving_Talking_Face_Generation_With_Landmark_and_Appearance_Priors_CVPR_2023_paper.pdf + Generating talking face videos from audio attracts lots of research interest. A few person-specific methods can generate vivid videos but require the target speaker's videos for training or fine-tuning. Existing person-generic methods have difficulty in generating realistic and lip-synced videos while preserving identity information. To tackle this problem, we propose a two-stage framework consisting of audio-to-landmark generation and landmark-to-video rendering procedures. First, we devise a novel Transformer-based landmark generator to infer lip and jaw landmarks from the audio. Prior landmark characteristics of the speaker's face are employed to make the generated landmarks coincide with the facial outline of the speaker. Then, a video rendering model is built to translate the generated landmarks into face images. During this stage, prior appearance information is extracted from the lower-half occluded target face and static reference images, which helps generate realistic and identity-preserving visual content. For effectively exploring the prior information of static reference images, we align static reference images with the target face's pose and expression based on motion fields. Moreover, auditory features are reused to guarantee that the generated face images are well synchronized with the audio. Extensive experiments demonstrate that our method can produce more realistic, lip-synced, and identity-preserving videos than existing person-generic talking face generation methods. + + + + Weakly Supervised Segmentation With Point Annotations for Histopathology Images via Contrast-Based Variational Model + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Weakly_Supervised_Segmentation_With_Point_Annotations_for_Histopathology_Images_via_CVPR_2023_paper.pdf + Image segmentation is a fundamental task in the field of imaging and vision. Supervised deep learning for segmentation has achieved unparalleled success when sufficient training data with annotated labels are available. However, annotation is known to be expensive to obtain, especially for histopathology images where the target regions are usually with high morphology variations and irregular shapes. Thus, weakly supervised learning with sparse annotations of points is promising to reduce the annotation workload. In this work, we propose a contrast-based variational model to generate segmentation results, which serve as reliable complementary supervision to train a deep segmentation model for histopathology images. The proposed method considers the common characteristics of target regions in histopathology images and can be trained in an end-to-end manner. It can generate more regionally consistent and smoother boundary segmentation, and is more robust to unlabeled 'novel' regions. Experiments on two different histology datasets demonstrate its effectiveness and efficiency in comparison to previous models. Code is available at: https://github.com/hrzhang1123/CVM_WS_Segmentation. + + + + Zero-Shot Generative Model Adaptation via Image-Specific Prompt Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Guo_Zero-Shot_Generative_Model_Adaptation_via_Image-Specific_Prompt_Learning_CVPR_2023_paper.pdf + Recently, CLIP-guided image synthesis has shown appealing performance on adapting a pre-trained source-domain generator to an unseen target domain. It does not require any target-domain samples but only the textual domain labels. The training is highly efficient, e.g., a few minutes. However, existing methods still have some limitations in the quality of generated images and may suffer from the mode collapse issue. A key reason is that a fixed adaptation direction is applied for all cross-domain image pairs, which leads to identical supervision signals. To address this issue, we propose an Image-specific Prompt Learning (IPL) method, which learns specific prompt vectors for each source-domain image. This produces a more precise adaptation direction for every cross-domain image pair, endowing the target-domain generator with greatly enhanced flexibility. Qualitative and quantitative evaluations on various domains demonstrate that IPL effectively improves the quality and diversity of synthesized images and alleviates the mode collapse. Moreover, IPL is independent of the structure of the generative model, such as generative adversarial networks or diffusion models. Code is available at https://github.com/Picsart-AI-Research/IPL-Zero-Shot-Generative-Model-Adaptation. + + + + CelebV-Text: A Large-Scale Facial Text-Video Dataset + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_CelebV-Text_A_Large-Scale_Facial_Text-Video_Dataset_CVPR_2023_paper.pdf + Text-driven generation models are flourishing in video generation and editing. However, face-centric text-to-video generation remains a challenge due to the lack of a suitable dataset containing high-quality videos and highly relevant texts. This paper presents CelebV-Text, a large-scale, diverse, and high-quality dataset of facial text-video pairs, to facilitate research on facial text-to-video generation tasks. CelebV-Text comprises 70,000 in-the-wild face video clips with diverse visual content, each paired with 20 texts generated using the proposed semi-automatic text generation strategy. The provided texts are of high quality, describing both static and dynamic attributes precisely. The superiority of CelebV-Text over other datasets is demonstrated via comprehensive statistical analysis of the videos, texts, and text-video relevance. The effectiveness and potential of CelebV-Text are further shown through extensive self-evaluation. A benchmark is constructed with representative methods to standardize the evaluation of the facial text-to-video generation task. All data and models are publicly available. + + + + Hard Patches Mining for Masked Image Modeling + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Hard_Patches_Mining_for_Masked_Image_Modeling_CVPR_2023_paper.pdf + Masked image modeling (MIM) has attracted much research attention due to its promising potential for learning scalable visual representations. In typical approaches, models usually focus on predicting specific contents of masked patches, and their performances are highly related to pre-defined mask strategies. Intuitively, this procedure can be considered as training a student (the model) on solving given problems (predict masked patches). However, we argue that the model should not only focus on solving given problems, but also stand in the shoes of a teacher to produce a more challenging problem by itself. To this end, we propose Hard Patches Mining (HPM), a brand-new framework for MIM pre-training. We observe that the reconstruction loss can naturally be the metric of the difficulty of the pre-training task. Therefore, we introduce an auxiliary loss predictor, predicting patch-wise losses first and deciding where to mask next. It adopts a relative relationship learning strategy to prevent overfitting to exact reconstruction loss values. Experiments under various settings demonstrate the effectiveness of HPM in constructing masked images. Furthermore, we empirically find that solely introducing the loss prediction objective leads to powerful representations, verifying the efficacy of the ability to be aware of where is hard to reconstruct. + + + + Diffusion-SDF: Text-To-Shape via Voxelized Diffusion + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Diffusion-SDF_Text-To-Shape_via_Voxelized_Diffusion_CVPR_2023_paper.pdf + With the rising industrial attention to 3D virtual modeling technology, generating novel 3D content based on specified conditions (e.g. text) has become a hot issue. In this paper, we propose a new generative 3D modeling framework called Diffusion-SDF for the challenging task of text-to-shape synthesis. Previous approaches lack flexibility in both 3D data representation and shape generation, thereby failing to generate highly diversified 3D shapes conforming to the given text descriptions. To address this, we propose a SDF autoencoder together with the Voxelized Diffusion model to learn and generate representations for voxelized signed distance fields (SDFs) of 3D shapes. Specifically, we design a novel UinU-Net architecture that implants a local-focused inner network inside the standard U-Net architecture, which enables better reconstruction of patch-independent SDF representations. We extend our approach to further text-to-shape tasks including text-conditioned shape completion and manipulation. Experimental results show that Diffusion-SDF generates both higher quality and more diversified 3D shapes that conform well to given text descriptions when compared to previous approaches. Code is available at: https://github.com/ttlmh/Diffusion-SDF. + + + + + + Boundary-Aware Backward-Compatible Representation via Adversarial Learning in Image Retrieval + http://openaccess.thecvf.com//content/CVPR2023/papers/Pan_Boundary-Aware_Backward-Compatible_Representation_via_Adversarial_Learning_in_Image_Retrieval_CVPR_2023_paper.pdf + Image retrieval plays an important role in the Internet world. Usually, the core parts of mainstream visual retrieval systems include an online service of the embedding model and a large-scale vector database. For traditional model upgrades, the old model will not be replaced by the new one until the embeddings of all the images in the database are re-computed by the new model, which takes days or weeks for a large amount of data. Recently, backward-compatible training (BCT) enables the new model to be immediately deployed online by making the new embeddings directly comparable to the old ones. For BCT, improving the compatibility of two models with less negative impact on retrieval performance is the key challenge. In this paper, we introduce AdvBCT, an Adversarial Backward-Compatible Training method with an elastic boundary constraint that takes both compatibility and discrimination into consideration. We first employ adversarial learning to minimize the distribution disparity between embeddings of the new model and the old model. Meanwhile, we add an elastic boundary constraint during training to improve compatibility and discrimination efficiently. Extensive experiments on GLDv2, Revisited Oxford (ROxford), and Revisited Paris (RParis) demonstrate that our method outperforms other BCT methods on both compatibility and discrimination. The implementation of AdvBCT will be publicly available at https://github.com/Ashespt/AdvBCT. + + + + Super-CLEVR: A Virtual Benchmark To Diagnose Domain Robustness in Visual Reasoning + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Super-CLEVR_A_Virtual_Benchmark_To_Diagnose_Domain_Robustness_in_Visual_CVPR_2023_paper.pdf + Visual Question Answering (VQA) models often perform poorly on out-of-distribution data and struggle on domain generalization. Due to the multi-modal nature of this task, multiple factors of variation are intertwined, making generalization difficult to analyze. This motivates us to introduce a virtual benchmark, Super-CLEVR, where different factors in VQA domain shifts can be isolated in order that their effects can be studied independently. Four factors are considered: visual complexity, question redundancy, concept distribution and concept compositionality. With controllably generated data, Super-CLEVR enables us to test VQA methods in situations where the test data differs from the training data along each of these axes. We study four existing methods, including two neural symbolic methods NSCL and NSVQA, and two non-symbolic methods FiLM and mDETR; and our proposed method, probabilistic NSVQA (P-NSVQA), which extends NSVQA with uncertainty reasoning. P-NSVQA outperforms other methods on three of the four domain shift factors. Our results suggest that disentangling reasoning and perception, combined with probabilistic uncertainty, form a strong VQA model that is more robust to domain shifts. The dataset and code are released at https://github.com/Lizw14/Super-CLEVR. + + + + Sliced Optimal Partial Transport + http://openaccess.thecvf.com//content/CVPR2023/papers/Bai_Sliced_Optimal_Partial_Transport_CVPR_2023_paper.pdf + Optimal transport (OT) has become exceedingly popular in machine learning, data science, and computer vision. The core assumption in the OT problem is the equal total amount of mass in source and target measures, which limits its application. Optimal Partial Transport (OPT) is a recently proposed solution to this limitation. Similar to the OT problem, the computation of OPT relies on solving a linear programming problem (often in high dimensions), which can become computationally prohibitive. In this paper, we propose an efficient algorithm for calculating the OPT problem between two non-negative measures in one dimension. Next, following the idea of sliced OT distances, we utilize slicing to define the sliced OPT distance. Finally, we demonstrate the computational and accuracy benefits of the sliced OPT-based method in various numerical experiments. In particular, we show an application of our proposed Sliced-OPT in noisy point cloud registration. + + + + Siamese DETR + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Siamese_DETR_CVPR_2023_paper.pdf + Recent self-supervised methods are mainly designed for representation learning with the base model, e.g., ResNets or ViTs. They cannot be easily transferred to DETR, with task-specific Transformer modules. In this work, we present Siamese DETR, a Siamese self-supervised pretraining approach for the Transformer architecture in DETR. We consider learning view-invariant and detection-oriented representations simultaneously through two complementary tasks, i.e., localization and discrimination, in a novel multi-view learning framework. Two self-supervised pretext tasks are designed: (i) Multi-View Region Detection aims at learning to localize regions-of-interest between augmented views of the input, and (ii) Multi-View Semantic Discrimination attempts to improve object-level discrimination for each region. The proposed Siamese DETR achieves state-of-the-art transfer performance on COCO and PASCAL VOC detection using different DETR variants in all setups. Code is available at https://github.com/Zx55/SiameseDETR. + + + + Turning Strengths Into Weaknesses: A Certified Robustness Inspired Attack Framework Against Graph Neural Networks + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Turning_Strengths_Into_Weaknesses_A_Certified_Robustness_Inspired_Attack_Framework_CVPR_2023_paper.pdf + Graph neural networks (GNNs) have achieved state-of-the-art performance in many graph-related tasks such as node classification. However, recent studies show that GNNs are vulnerable to both test-time and training-time attacks that perturb the graph structure. While the existing attack methods have shown promising attack performance, we would like to design an attack framework that can significantly enhance both the existing evasion and poisoning attacks. In particular, our attack framework is inspired by certified robustness. Certified robustness was originally used by defenders to defend against adversarial attacks. We are the first, from the attacker perspective, to leverage its properties to better attack GNNs. Specifically, we first leverage and derive nodes' certified perturbation sizes against evasion and poisoning attacks based on randomized smoothing. A larger certified perturbation size of a node indicates this node is theoretically more robust to graph perturbations. Such a property motivates us to focus more on nodes with smaller certified perturbation sizes, as they are easier to be attacked after graph perturbations. Accordingly, we design a certified robustness inspired attack loss, when incorporated into (any) existing attacks, produces our certified robustness inspired attack framework. We apply our attack framework to the existing attacks and results show it can significantly enhance the existing attacks' performance. + + + + Demystifying Causal Features on Adversarial Examples and Causal Inoculation for Robust Network by Adversarial Instrumental Variable Regression + http://openaccess.thecvf.com//content/CVPR2023/papers/Kim_Demystifying_Causal_Features_on_Adversarial_Examples_and_Causal_Inoculation_for_CVPR_2023_paper.pdf + The origin of adversarial examples is still inexplicable in research fields, and it arouses arguments from various viewpoints, albeit comprehensive investigations. In this paper, we propose a way of delving into the unexpected vulnerability in adversarially trained networks from a causal perspective, namely adversarial instrumental variable (IV) regression. By deploying it, we estimate the causal relation of adversarial prediction under an unbiased environment dissociated from unknown confounders. Our approach aims to demystify inherent causal features on adversarial examples by leveraging a zero-sum optimization game between a casual feature estimator (i.e., hypothesis model) and worst-case counterfactuals (i.e., test function) disturbing to find causal features. Through extensive analyses, we demonstrate that the estimated causal features are highly related to the correct prediction for adversarial robustness, and the counterfactuals exhibit extreme features significantly deviating from the correct prediction. In addition, we present how to effectively inoculate CAusal FEatures (CAFE) into defense networks for improving adversarial robustness. + + + + B-Spline Texture Coefficients Estimator for Screen Content Image Super-Resolution + http://openaccess.thecvf.com//content/CVPR2023/papers/Pak_B-Spline_Texture_Coefficients_Estimator_for_Screen_Content_Image_Super-Resolution_CVPR_2023_paper.pdf + Screen content images (SCIs) include many informative components, e.g., texts and graphics. Such content creates sharp edges or homogeneous areas, making a pixel distribution of SCI different from the natural image. Therefore, we need to properly handle the edges and textures to minimize information distortion of the contents when a display device's resolution differs from SCIs. To achieve this goal, we propose an implicit neural representation using B-splines for screen content image super-resolution (SCI SR) with arbitrary scales. Our method extracts scaling, translating, and smoothing parameters of B-splines. The followed multi-layer perceptron (MLP) uses the estimated B-splines to recover high-resolution SCI. Our network outperforms both a transformer-based reconstruction and an implicit Fourier representation method in almost upscaling factor, thanks to the positive constraint and compact support of the B-spline basis. Moreover, our SR results are recognized as correct text letters with the highest confidence by a pre-trained scene text recognition network. Source code is available at https://github.com/ByeongHyunPak/btc. + + + + Domain Expansion of Image Generators + http://openaccess.thecvf.com//content/CVPR2023/papers/Nitzan_Domain_Expansion_of_Image_Generators_CVPR_2023_paper.pdf + Can one inject new concepts into an already trained generative model, while respecting its existing structure and knowledge? We propose a new task -- domain expansion -- to address this. Given a pretrained generator and novel (but related) domains, we expand the generator to jointly model all domains, old and new, harmoniously. First, we note the generator contains a meaningful, pretrained latent space. Is it possible to minimally perturb this hard-earned representation, while maximally representing the new domains? Interestingly, we find that the latent space offers unused, "dormant" axes, which do not affect the output. This provides an opportunity -- by "repurposing" these axes, we are able to represent new domains, without perturbing the original representation. In fact, we find that pretrained generators have the capacity to add several -- even hundreds -- of new domains! Using our expansion technique, one "expanded" model can supersede numerous domain-specific models, without expanding model size. Additionally, using a single, expanded generator natively supports smooth transitions between and composition of domains. + + + + LVQAC: Lattice Vector Quantization Coupled With Spatially Adaptive Companding for Efficient Learned Image Compression + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_LVQAC_Lattice_Vector_Quantization_Coupled_With_Spatially_Adaptive_Companding_for_CVPR_2023_paper.pdf + Recently, numerous end-to-end optimized image compression neural networks have been developed and proved themselves as leaders in rate-distortion performance. The main strength of these learnt compression methods is in powerful nonlinear analysis and synthesis transforms that can be facilitated by deep neural networks. However, out of operational expediency, most of these end-to-end methods adopt uniform scalar quantizers rather than vector quantizers, which are information-theoretically optimal. In this paper, we present a novel Lattice Vector Quantization scheme coupled with a spatially Adaptive Companding (LVQAC) mapping. LVQ can better exploit the inter-feature dependencies than scalar uniform quantization while being computationally almost as simple as the latter. Moreover, to improve the adaptability of LVQ to source statistics, we couple a spatially adaptive companding (AC) mapping with LVQ. The resulting LVQAC design can be easily embedded into any end-to-end optimized image compression system. Extensive experiments demonstrate that for any end-to-end CNN image compression models, replacing uniform quantizer by LVQAC achieves better rate-distortion performance without significantly increasing the model complexity. + + + + Fine-Grained Face Swapping via Regional GAN Inversion + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Fine-Grained_Face_Swapping_via_Regional_GAN_Inversion_CVPR_2023_paper.pdf + We present a novel paradigm for high-fidelity face swapping that faithfully preserves the desired subtle geometry and texture details. We rethink face swapping from the perspective of fine-grained face editing, i.e., editing for swapping (E4S), and propose a framework that is based on the explicit disentanglement of the shape and texture of facial components. Following the E4S principle, our framework enables both global and local swapping of facial features, as well as controlling the amount of partial swapping specified by the user. Furthermore, the E4S paradigm is inherently capable of handling facial occlusions by means of facial masks. At the core of our system lies a novel Regional GAN Inversion (RGI) method, which allows the explicit disentanglement of shape and texture. It also allows face swapping to be performed in the latent space of StyleGAN. Specifically, we design a multi-scale mask-guided encoder to project the texture of each facial component into regional style codes. We also design a mask-guided injection module to manipulate the feature maps with the style codes. Based on the disentanglement, face swapping is reformulated as a simplified problem of style and mask swapping. Extensive experiments and comparisons with current state-of-the-art methods demonstrate the superiority of our approach in preserving texture and shape details, as well as working with high resolution images. The project page is https://e4s2022.github.io + + + + Taming Diffusion Models for Audio-Driven Co-Speech Gesture Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhu_Taming_Diffusion_Models_for_Audio-Driven_Co-Speech_Gesture_Generation_CVPR_2023_paper.pdf + Animating virtual avatars to make co-speech gestures facilitates various applications in human-machine interaction. The existing methods mainly rely on generative adversarial networks (GANs), which typically suffer from notorious mode collapse and unstable training, thus making it difficult to learn accurate audio-gesture joint distributions. In this work, we propose a novel diffusion-based framework, named Diffusion Co-Speech Gesture (DiffGesture), to effectively capture the cross-modal audio-to-gesture associations and preserve temporal coherence for high-fidelity audio-driven co-speech gesture generation. Specifically, we first establish the diffusion-conditional generation process on clips of skeleton sequences and audio to enable the whole framework. Then, a novel Diffusion Audio-Gesture Transformer is devised to better attend to the information from multiple modalities and model the long-term temporal dependency. Moreover, to eliminate temporal inconsistency, we propose an effective Diffusion Gesture Stabilizer with an annealed noise sampling strategy. Benefiting from the architectural advantages of diffusion models, we further incorporate implicit classifier-free guidance to trade off between diversity and gesture quality. Extensive experiments demonstrate that DiffGesture achieves state-of-the-art performance, which renders coherent gestures with better mode coverage and stronger audio correlations. Code is available at https://github.com/Advocate99/DiffGesture. + + + + NeRFLix: High-Quality Neural View Synthesis by Learning a Degradation-Driven Inter-Viewpoint MiXer + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_NeRFLix_High-Quality_Neural_View_Synthesis_by_Learning_a_Degradation-Driven_Inter-Viewpoint_CVPR_2023_paper.pdf + Neural radiance fields(NeRF) show great success in novel-view synthesis. However, in real-world scenes, recovering high-quality details from the source images is still challenging for the existing NeRF-based approaches, due to the potential imperfect calibration information and scene representation inaccuracy. Even with high-quality training frames, the synthetic novel-view frames produced by NeRF models still suffer from notable rendering artifacts, such as noise, blur, etc. Towards to improve the synthesis quality of NeRF-based approaches, we propose NeRFLiX, a general NeRF-agnostic restorer paradigm by learning a degradation-driven inter-viewpoint mixer. Specially, we design a NeRF-style degradation modeling approach and construct large-scale training data, enabling the possibility of effectively removing those NeRF-native rendering artifacts for existing deep neural networks. Moreover, beyond the degradation removal, we propose an inter-viewpoint aggregation framework that is able to fuse highly related high-quality training images, pushing the performance of cutting-edge NeRF models to entirely new levels and producing highly photo-realistic synthetic images. + + + + STMixer: A One-Stage Sparse Action Detector + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_STMixer_A_One-Stage_Sparse_Action_Detector_CVPR_2023_paper.pdf + Traditional video action detectors typically adopt the two-stage pipeline, where a person detector is first employed to yield actor boxes and then 3D RoIAlign is used to extract actor-specific features for classification. This detection paradigm requires multi-stage training and inference and cannot capture context information outside the bounding box. Recently, a few query-based action detectors are proposed to predict action instances in an end-to-end manner. However, they still lack adaptability in feature sampling or decoding, thus suffering from the issue of inferior performance or slower convergence. In this paper, we propose a new one-stage sparse action detector, termed STMixer. STMixer is based on two core designs. First, we present a query-based adaptive feature sampling module, which endows our STMixer with the flexibility of mining a set of discriminative features from the entire spatiotemporal domain. Second, we devise a dual-branch feature mixing module, which allows our STMixer to dynamically attend to and mix video features along the spatial and the temporal dimension respectively for better feature decoding. Coupling these two designs with a video backbone yields an efficient and accurate action detector. Without bells and whistles, STMixer obtains the state-of-the-art results on the datasets of AVA, UCF101-24, and JHMDB. + + + + Genie: Show Me the Data for Quantization + http://openaccess.thecvf.com//content/CVPR2023/papers/Jeon_Genie_Show_Me_the_Data_for_Quantization_CVPR_2023_paper.pdf + Zero-shot quantization is a promising approach for developing lightweight deep neural networks when data is inaccessible owing to various reasons, including cost and issues related to privacy. By exploiting the learned parameters (u and sigma) of batch normalization layers in an FP32-pre-trained model, zero-shot quantization schemes focus on generating synthetic data. Subsequently, they distill knowledge from the pre-trained model (teacher) to the quantized model (student) such that the quantized model can be optimized with the synthetic dataset. However, thus far, zero-shot quantization has primarily been discussed in the context of quantization-aware training methods, which require task-specific losses and long-term optimization as much as retraining. We thus introduce a post-training quantization scheme for zero-shot quantization that produces high-quality quantized networks within a few hours. Furthermore, we propose a framework called GENIE that generates data suited for quantization. With the data synthesized by GENIE, we can produce robust quantized models without real datasets, which is comparable to few-shot quantization. We also propose a post-training quantization algorithm to enhance the performance of quantized models. By combining them, we can bridge the gap between zero-shot and few-shot quantization while significantly improving the quantization performance compared to that of existing approaches. In other words, we can obtain a unique state-of-the-art zero-shot quantization approach. + + + + Multi-Agent Automated Machine Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Multi-Agent_Automated_Machine_Learning_CVPR_2023_paper.pdf + In this paper, we propose multi-agent automated machine learning (MA2ML) with the aim to effectively handle joint optimization of modules in automated machine learning (AutoML). MA2ML takes each machine learning module, such as data augmentation (AUG), neural architecture search (NAS), or hyper-parameters (HPO), as an agent and the final performance as the reward, to formulate a multi-agent reinforcement learning problem. MA2ML explicitly assigns credit to each agent according to its marginal contribution to enhance cooperation among modules, and incorporates off-policy learning to improve search efficiency. Theoretically, MA2ML guarantees monotonic improvement of joint optimization. Extensive experiments show that MA2ML yields the state-of-the-art top-1 accuracy on ImageNet under constraints of computational cost, e.g., 79.7%/80.5% with FLOPs fewer than 600M/800M. Extensive ablation studies verify the benefits of credit assignment and off-policy learning of MA2ML. + + + + Robot Structure Prior Guided Temporal Attention for Camera-to-Robot Pose Estimation From Image Sequence + http://openaccess.thecvf.com//content/CVPR2023/papers/Tian_Robot_Structure_Prior_Guided_Temporal_Attention_for_Camera-to-Robot_Pose_Estimation_CVPR_2023_paper.pdf + In this work, we tackle the problem of online camera-to-robot pose estimation from single-view successive frames of an image sequence, a crucial task for robots to interact with the world. The primary obstacles of this task are the robot's self-occlusions and the ambiguity of single-view images. This work demonstrates, for the first time, the effectiveness of temporal information and the robot structure prior in addressing these challenges. Given the successive frames and the robot joint configuration, our method learns to accurately regress the 2D coordinates of the predefined robot's keypoints (e.g., joints). With the camera intrinsic and robotic joints status known, we get the camera-to-robot pose using a Perspective-n-point (PnP) solver. We further improve the camera-to-robot pose iteratively using the robot structure prior. To train the whole pipeline, we build a large-scale synthetic dataset generated with domain randomisation to bridge the sim-to-real gap. The extensive experiments on synthetic and real-world datasets and the downstream robotic grasping task demonstrate that our method achieves new state-of-the-art performances and outperforms traditional hand-eye calibration algorithms in real-time (36 FPS). Code and data are available at the project page: https://sites.google.com/view/sgtapose. + + + + HRDFuse: Monocular 360deg Depth Estimation by Collaboratively Learning Holistic-With-Regional Depth Distributions + http://openaccess.thecvf.com//content/CVPR2023/papers/Ai_HRDFuse_Monocular_360deg_Depth_Estimation_by_Collaboratively_Learning_Holistic-With-Regional_Depth_CVPR_2023_paper.pdf + Depth estimation from a monocular 360 image is a burgeoning problem owing to its holistic sensing of a scene. Recently, some methods, e.g., OmniFusion, have applied the tangent projection (TP) to represent a 360 image and predicted depth values via patch-wise regressions, which are merged to get a depth map with equirectangular projection (ERP) format. However, these methods suffer from 1) non-trivial process of merging a large number of patches; 2) capturing less holistic-with-regional contextual information by directly regressing the depth value of each pixel. In this paper, we propose a novel framework, HRDFuse, that subtly combines the potential of convolutional neural networks (CNNs) and transformers by collaboratively learning the holistic contextual information from the ERP and the regional structural information from the TP. Firstly, we propose a spatial feature alignment (SFA) module that learns feature similarities between the TP and ERP to aggregate the TP features into a complete ERP feature map in a pixel-wise manner. Secondly, we propose a collaborative depth distribution classification (CDDC) module that learns the holistic-with-regional histograms capturing the ERP and TP depth distributions. As such, the final depth values can be predicted as a linear combination of histogram bin centers. Lastly, we adaptively combine the depth predictions from ERP and TP to obtain the final depth map. Extensive experiments show that our method predicts more smooth and accurate depth results while achieving favorably better results than the SOTA methods. + + + + StructVPR: Distill Structural Knowledge With Weighting Samples for Visual Place Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Shen_StructVPR_Distill_Structural_Knowledge_With_Weighting_Samples_for_Visual_Place_CVPR_2023_paper.pdf + Visual place recognition (VPR) is usually considered as a specific image retrieval problem. Limited by existing training frameworks, most deep learning-based works cannot extract sufficiently stable global features from RGB images and rely on a time-consuming re-ranking step to exploit spatial structural information for better performance. In this paper, we propose StructVPR, a novel training architecture for VPR, to enhance structural knowledge in RGB global features and thus improve feature stability in a constantly changing environment. Specifically, StructVPR uses segmentation images as a more definitive source of structural knowledge input into a CNN network and applies knowledge distillation to avoid online segmentation and inference of seg-branch in testing. Considering that not all samples contain high-quality and helpful knowledge, and some even hurt the performance of distillation, we partition samples and weigh each sample's distillation loss to enhance the expected knowledge precisely. Finally, StructVPR achieves impressive performance on several benchmarks using only global retrieval and even outperforms many two-stage approaches by a large margin. After adding additional re-ranking, ours achieves state-of-the-art performance while maintaining a low computational cost. + + + + Learning Human-to-Robot Handovers From Point Clouds + http://openaccess.thecvf.com//content/CVPR2023/papers/Christen_Learning_Human-to-Robot_Handovers_From_Point_Clouds_CVPR_2023_paper.pdf + We propose the first framework to learn control policies for vision-based human-to-robot handovers, a critical task for human-robot interaction. While research in Embodied AI has made significant progress in training robot agents in simulated environments, interacting with humans remains challenging due to the difficulties of simulating humans. Fortunately, recent research has developed realistic simulated environments for human-to-robot handovers. Leveraging this result, we introduce a method that is trained with a human-in-the-loop via a two-stage teacher-student framework that uses motion and grasp planning, reinforcement learning, and self-supervision. We show significant performance gains over baselines on a simulation benchmark, sim-to-sim transfer and sim-to-real transfer. + + + + Score Jacobian Chaining: Lifting Pretrained 2D Diffusion Models for 3D Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Score_Jacobian_Chaining_Lifting_Pretrained_2D_Diffusion_Models_for_3D_CVPR_2023_paper.pdf + A diffusion model learns to predict a vector field of gradients. We propose to apply chain rule on the learned gradients, and back-propagate the score of a diffusion model through the Jacobian of a differentiable renderer, which we instantiate to be a voxel radiance field. This setup aggregates 2D scores at multiple camera viewpoints into a 3D score, and repurposes a pretrained 2D model for 3D data generation. We identify a technical challenge of distribution mismatch that arises in this application, and propose a novel estimation mechanism to resolve it. We run our algorithm on several off-the-shelf diffusion image generative models, including the recently released Stable Diffusion trained on the large-scale LAION dataset. + + + + Role of Transients in Two-Bounce Non-Line-of-Sight Imaging + http://openaccess.thecvf.com//content/CVPR2023/papers/Somasundaram_Role_of_Transients_in_Two-Bounce_Non-Line-of-Sight_Imaging_CVPR_2023_paper.pdf + The goal of non-line-of-sight (NLOS) imaging is to image objects occluded from the camera's field of view using multiply scattered light. Recent works have demonstrated the feasibility of two-bounce (2B) NLOS imaging by scanning a laser and measuring cast shadows of occluded objects in scenes with two relay surfaces. In this work, we study the role of time-of-flight (ToF) measurements, i.e. transients, in 2B-NLOS under multiplexed illumination. Specifically, we study how ToF information can reduce the number of measurements and spatial resolution needed for shape reconstruction. We present our findings with respect to tradeoffs in (1) temporal resolution, (2) spatial resolution, and (3) number of image captures by studying SNR and recoverability as functions of system parameters. This leads to a formal definition of the mathematical constraints for 2B lidar. We believe that our work lays an analytical groundwork for design of future NLOS imaging systems, especially as ToF sensors become increasingly ubiquitous. + + + + Elastic Aggregation for Federated Optimization + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Elastic_Aggregation_for_Federated_Optimization_CVPR_2023_paper.pdf + Federated learning enables the privacy-preserving training of neural network models using real-world data across distributed clients. FedAvg has become the preferred optimizer for federated learning because of its simplicity and effectiveness. FedAvg uses naive aggregation to update the server model, interpolating client models based on the number of instances used in their training. However, naive aggregation suffers from client-drift when the data is heterogenous (non-IID), leading to unstable and slow convergence. In this work, we propose a novel aggregation approach, elastic aggregation, to overcome these issues. Elastic aggregation interpolates client models adaptively according to parameter sensitivity, which is measured by computing how much the overall prediction function output changes when each parameter is changed. This measurement is performed in an unsupervised and online manner. Elastic aggregation reduces the magnitudes of updates to the more sensitive parameters so as to prevent the server model from drifting to any one client distribution, and conversely boosts updates to the less sensitive parameters to better explore different client distributions. Empirical results on real and synthetic data as well as analytical results show that elastic aggregation leads to efficient training in both convex and non-convex settings, while being fully agnostic to client heterogeneity and robust to large numbers of clients, partial participation, and imbalanced data. Finally, elastic aggregation works well with other federated optimizers and achieves significant improvements across the board. + + + + ObjectMatch: Robust Registration Using Canonical Object Correspondences + http://openaccess.thecvf.com//content/CVPR2023/papers/Gumeli_ObjectMatch_Robust_Registration_Using_Canonical_Object_Correspondences_CVPR_2023_paper.pdf + We present ObjectMatch, a semantic and object-centric camera pose estimator for RGB-D SLAM pipelines. Modern camera pose estimators rely on direct correspondences of overlapping regions between frames; however, they cannot align camera frames with little or no overlap. In this work, we propose to leverage indirect correspondences obtained via semantic object identification. For instance, when an object is seen from the front in one frame and from the back in another frame, we can provide additional pose constraints through canonical object correspondences. We first propose a neural network to predict such correspondences on a per-pixel level, which we then combine in our energy formulation with state-of-the-art keypoint matching solved with a joint Gauss-Newton optimization. In a pairwise setting, our method improves registration recall of state-of-the-art feature matching, including from 24% to 45% in pairs with 10% or less inter-frame overlap. In registering RGB-D sequences, our method outperforms cutting-edge SLAM baselines in challenging, low-frame-rate scenarios, achieving more than 35% reduction in trajectory error in multiple scenes. + + + + Center Focusing Network for Real-Time LiDAR Panoptic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Center_Focusing_Network_for_Real-Time_LiDAR_Panoptic_Segmentation_CVPR_2023_paper.pdf + LiDAR panoptic segmentation facilitates an autonomous vehicle to comprehensively understand the surrounding objects and scenes and is required to run in real time. The recent proposal-free methods accelerate the algorithm, but their effectiveness and efficiency are still limited owing to the difficulty of modeling non-existent instance centers and the costly center-based clustering modules. To achieve accurate and real-time LiDAR panoptic segmentation, a novel center focusing network (CFNet) is introduced. Specifically, the center focusing feature encoding (CFFE) is proposed to explicitly understand the relationships between the original LiDAR points and virtual instance centers by shifting the LiDAR points and filling in the center points. Moreover, to leverage the redundantly detected centers, a fast center deduplication module (CDM) is proposed to select only one center for each instance. Experiments on the SemanticKITTI and nuScenes panoptic segmentation benchmarks demonstrate that our CFNet outperforms all existing methods by a large margin and is 1.6 times faster than the most efficient method. + + + + Restoration of Hand-Drawn Architectural Drawings Using Latent Space Mapping With Degradation Generator + http://openaccess.thecvf.com//content/CVPR2023/papers/Choi_Restoration_of_Hand-Drawn_Architectural_Drawings_Using_Latent_Space_Mapping_With_CVPR_2023_paper.pdf + This work presents the restoration of drawings of wooden built heritage. Hand-drawn drawings contain the most important original information but are often severely degraded over time. A novel restoration method based on the vector quantized variational autoencoders is presented. Latent space representations of drawings and noise are learned, which are used to map noisy drawings to clean drawings for restoration and to generate authentic noisy drawings for data augmentation. The proposed method is applied to the drawings archived in the Cultural Heritage Administration. Restored drawings show significant quality improvement and allow more accurate interpretations of information. + + + + Few-Shot Class-Incremental Learning via Class-Aware Bilateral Distillation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhao_Few-Shot_Class-Incremental_Learning_via_Class-Aware_Bilateral_Distillation_CVPR_2023_paper.pdf + Few-Shot Class-Incremental Learning (FSCIL) aims to continually learn novel classes based on only few training samples, which poses a more challenging task than the well-studied Class-Incremental Learning (CIL) due to data scarcity. While knowledge distillation, a prevailing technique in CIL, can alleviate the catastrophic forgetting of older classes by regularizing outputs between current and previous model, it fails to consider the overfitting risk of novel classes in FSCIL. To adapt the powerful distillation technique for FSCIL, we propose a novel distillation structure, by taking the unique challenge of overfitting into account. Concretely, we draw knowledge from two complementary teachers. One is the model trained on abundant data from base classes that carries rich general knowledge, which can be leveraged for easing the overfitting of current novel classes. The other is the updated model from last incremental session that contains the adapted knowledge of previous novel classes, which is used for alleviating their forgetting. To combine the guidances, an adaptive strategy conditioned on the class-wise semantic similarities is introduced. Besides, for better preserving base class knowledge when accommodating novel concepts, we adopt a two-branch network with an attention-based aggregation module to dynamically merge predictions from two complementary branches. Extensive experiments on 3 popular FSCIL datasets: mini-ImageNet, CIFAR100 and CUB200 validate the effectiveness of our method by surpassing existing works by a significant margin. + + + + Learning To Dub Movies via Hierarchical Prosody Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Cong_Learning_To_Dub_Movies_via_Hierarchical_Prosody_Models_CVPR_2023_paper.pdf + Given a piece of text, a video clip and a reference audio, the movie dubbing (also known as visual voice clone, V2C) task aims to generate speeches that match the speaker's emotion presented in the video using the desired speaker voice as reference. V2C is more challenging than conventional text-to-speech tasks as it additionally requires the generated speech to exactly match the varying emotions and speaking speed presented in the video. Unlike previous works, we propose a novel movie dubbing architecture to tackle these problems via hierarchical prosody modeling, which bridges the visual information to corresponding speech prosody from three aspects: lip, face, and scene. Specifically, we align lip movement to the speech duration, and convey facial expression to speech energy and pitch via attention mechanism based on valence and arousal representations inspired by the psychology findings. Moreover, we design an emotion booster to capture the atmosphere from global video scenes. All these embeddings are used together to generate mel-spectrogram, which is then converted into speech waves by an existing vocoder. Extensive experimental results on the V2C and Chem benchmark datasets demonstrate the favourable performance of the proposed method. The code and trained models will be made available at https://github.com/GalaxyCong/HPMDubbing. + + + + DiffusionRig: Learning Personalized Priors for Facial Appearance Editing + http://openaccess.thecvf.com//content/CVPR2023/papers/Ding_DiffusionRig_Learning_Personalized_Priors_for_Facial_Appearance_Editing_CVPR_2023_paper.pdf + We address the problem of learning person-specific facial priors from a small number (e.g., 20) of portrait photos of the same person. This enables us to edit this specific person's facial appearance, such as expression and lighting, while preserving their identity and high-frequency facial details. Key to our approach, which we dub DiffusionRig, is a diffusion model conditioned on, or "rigged by," crude 3D face models estimated from single in-the-wild images by an off-the-shelf estimator. On a high level, DiffusionRig learns to map simplistic renderings of 3D face models to realistic photos of a given person. Specifically, DiffusionRig is trained in two stages: It first learns generic facial priors from a large-scale face dataset and then person-specific priors from a small portrait photo collection of the person of interest. By learning the CGI-to-photo mapping with such personalized priors, DiffusionRig can "rig" the lighting, facial expression, head pose, etc. of a portrait photo, conditioned only on coarse 3D models while preserving this person's identity and other high-frequency characteristics. Qualitative and quantitative experiments show that DiffusionRig outperforms existing approaches in both identity preservation and photorealism. Please see the project website: https://diffusionrig.github.io for the supplemental material, video, code, and data. + + + + Delving StyleGAN Inversion for Image Editing: A Foundation Latent Space Viewpoint + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Delving_StyleGAN_Inversion_for_Image_Editing_A_Foundation_Latent_Space_CVPR_2023_paper.pdf + GAN inversion and editing via StyleGAN maps an input image into the embedding spaces (W, W^+, and F) to simultaneously maintain image fidelity and meaningful manipulation. From latent space W to extended latent space W^+ to feature space F in StyleGAN, the editability of GAN inversion decreases while its reconstruction quality increases. Recent GAN inversion methods typically explore W^+ and F rather than W to improve reconstruction fidelity while maintaining editability. As W^+ and F are derived from W that is essentially the foundation latent space of StyleGAN, these GAN inversion methods focusing on W^+ and F spaces could be improved by stepping back to W. In this work, we propose to first obtain the proper latent code in foundation latent space W. We introduce contrastive learning to align W and the image space for proper latent code discovery. Then, we leverage a cross-attention encoder to transform the obtained latent code in W into W^+ and F, accordingly. Our experiments show that our exploration of the foundation latent space W improves the representation ability of latent codes in W^+ and features in F, which yields state-of-the-art reconstruction fidelity and editability results on the standard benchmarks. Project page: https://kumapowerliu.github.io/CLCAE. + + + + Enlarging Instance-Specific and Class-Specific Information for Open-Set Action Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Cen_Enlarging_Instance-Specific_and_Class-Specific_Information_for_Open-Set_Action_Recognition_CVPR_2023_paper.pdf + Open-set action recognition is to reject unknown human action cases which are out of the distribution of the training set. Existing methods mainly focus on learning better uncertainty scores but dismiss the importance of feature representations. We find that features with richer semantic diversity can significantly improve the open-set performance under the same uncertainty scores. In this paper, we begin with analyzing the feature representation behavior in the open-set action recognition (OSAR) problem based on the information bottleneck (IB) theory, and propose to enlarge the instance-specific (IS) and class-specific (CS) information contained in the feature for better performance. To this end, a novel Prototypical Similarity Learning (PSL) framework is proposed to keep the instance variance within the same class to retain more IS information. Besides, we notice that unknown samples sharing similar appearances to known samples are easily misclassified as known classes. To alleviate this issue, video shuffling is further introduced in our PSL to learn distinct temporal information between original and shuffled samples, which we find enlarges the CS information. Extensive experiments demonstrate that the proposed PSL can significantly boost both the open-set and closed-set performance and achieves state-of-the-art results on multiple benchmarks. Code is available at https://github.com/Jun-CEN/PSL. + + + + Decoupled Semantic Prototypes Enable Learning From Diverse Annotation Types for Semi-Weakly Segmentation in Expert-Driven Domains + http://openaccess.thecvf.com//content/CVPR2023/papers/Reiss_Decoupled_Semantic_Prototypes_Enable_Learning_From_Diverse_Annotation_Types_for_CVPR_2023_paper.pdf + A vast amount of images and pixel-wise annotations allowed our community to build scalable segmentation solutions for natural domains. However, the transfer to expert-driven domains like microscopy applications or medical healthcare remains difficult as domain experts are a critical factor due to their limited availability for providing pixel-wise annotations. To enable affordable segmentation solutions for such domains, we need training strategies which can simultaneously handle diverse annotation types and are not bound to costly pixel-wise annotations. In this work, we analyze existing training algorithms towards their flexibility for different annotation types and scalability to small annotation regimes. We conduct an extensive evaluation in the challenging domain of organelle segmentation and find that existing semi- and semi-weakly supervised training algorithms are not able to fully exploit diverse annotation types. Driven by our findings, we introduce Decoupled Semantic Prototypes (DSP) as a training method for semantic segmentation which enables learning from annotation types as diverse as image-level-, point-, bounding box-, and pixel-wise annotations and which leads to remarkable accuracy gains over existing solutions for semi-weakly segmentation. + + + + Iterative Next Boundary Detection for Instance Segmentation of Tree Rings in Microscopy Images of Shrub Cross Sections + http://openaccess.thecvf.com//content/CVPR2023/papers/Gillert_Iterative_Next_Boundary_Detection_for_Instance_Segmentation_of_Tree_Rings_CVPR_2023_paper.pdf + We address the problem of detecting tree rings in microscopy images of shrub cross sections. This can be regarded as a special case of the instance segmentation task with several unique challenges such as the concentric circular ring shape of the objects and high precision requirements that result in inadequate performance of existing methods. We propose a new iterative method which we term Iterative Next Boundary Detection (INBD). It intuitively models the natural growth direction, starting from the center of the shrub cross section and detecting the next ring boundary in each iteration step. In our experiments, INBD shows superior performance to generic instance segmentation methods and is the only one with a built-in notion of chronological order. Our dataset and source code are available at http://github.com/alexander-g/INBD. + + + + Learning and Aggregating Lane Graphs for Urban Automated Driving + http://openaccess.thecvf.com//content/CVPR2023/papers/Buchner_Learning_and_Aggregating_Lane_Graphs_for_Urban_Automated_Driving_CVPR_2023_paper.pdf + Lane graph estimation is an essential and highly challenging task in automated driving and HD map learning. Existing methods using either onboard or aerial imagery struggle with complex lane topologies, out-of-distribution scenarios, or significant occlusions in the image space. Moreover, merging overlapping lane graphs to obtain consistent largescale graphs remains difficult. To overcome these challenges, we propose a novel bottom-up approach to lane graph estimation from aerial imagery that aggregates multiple overlapping graphs into a single consistent graph. Due to its modular design, our method allows us to address two complementary tasks: predicting ego-respective successor lane graphs from arbitrary vehicle positions using a graph neural network and aggregating these predictions into a consistent global lane graph. Extensive experiments on a large-scale lane graph dataset demonstrate that our approach yields highly accurate lane graphs, even in regions with severe occlusions. The presented approach to graph aggregation proves to eliminate inconsistent predictions while increasing the overall graph quality. We make our large-scale urban lane graph dataset and code publicly available at http://urbanlanegraph.cs.uni-freiburg.de. + + + + Universal Instance Perception As Object Discovery and Retrieval + http://openaccess.thecvf.com//content/CVPR2023/papers/Yan_Universal_Instance_Perception_As_Object_Discovery_and_Retrieval_CVPR_2023_paper.pdf + All instance perception tasks aim at finding certain objects specified by some queries such as category names, language expressions, and target annotations, but this complete field has been split into multiple independent subtasks. In this work, we present a universal instance perception model of the next generation, termed UNINEXT. UNINEXT reformulates diverse instance perception tasks into a unified object discovery and retrieval paradigm and can flexibly perceive different types of objects by simply changing the input prompts. This unified formulation brings the following benefits: (1) enormous data from different tasks and label vocabularies can be exploited for jointly training general instance-level representations, which is especially beneficial for tasks lacking in training data. (2) the unified model is parameter-efficient and can save redundant computation when handling multiple tasks simultaneously. UNINEXT shows superior performance on 20 challenging benchmarks from 10 instance-level tasks including classical image-level tasks (object detection and instance segmentation), vision-and-language tasks (referring expression comprehension and segmentation), and six video-level object tracking tasks. Code is available at https://github.com/MasterBin-IIAU/UNINEXT. + + + + Transferable Adversarial Attacks on Vision Transformers With Token Gradient Regularization + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Transferable_Adversarial_Attacks_on_Vision_Transformers_With_Token_Gradient_Regularization_CVPR_2023_paper.pdf + Vision transformers (ViTs) have been successfully deployed in a variety of computer vision tasks, but they are still vulnerable to adversarial samples. Transfer-based attacks use a local model to generate adversarial samples and directly transfer them to attack a target black-box model. The high efficiency of transfer-based attacks makes it a severe security threat to ViT-based applications. Therefore, it is vital to design effective transfer-based attacks to identify the deficiencies of ViTs beforehand in security-sensitive scenarios. Existing efforts generally focus on regularizing the input gradients to stabilize the updated direction of adversarial samples. However, the variance of the back-propagated gradients in intermediate blocks of ViTs may still be large, which may make the generated adversarial samples focus on some model-specific features and get stuck in poor local optima. To overcome the shortcomings of existing approaches, we propose the Token Gradient Regularization (TGR) method. According to the structural characteristics of ViTs, TGR reduces the variance of the back-propagated gradient in each internal block of ViTs in a token-wise manner and utilizes the regularized gradient to generate adversarial samples. Extensive experiments on attacking both ViTs and CNNs confirm the superiority of our approach. Notably, compared to the state-of-the-art transfer-based attacks, our TGR offers a performance improvement of 8.8 % on average. + + + + MCF: Mutual Correction Framework for Semi-Supervised Medical Image Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_MCF_Mutual_Correction_Framework_for_Semi-Supervised_Medical_Image_Segmentation_CVPR_2023_paper.pdf + Semi-supervised learning is a promising method for medical image segmentation under limited annotation. However, the model cognitive bias impairs the segmentation performance, especially for edge regions. Furthermore, current mainstream semi-supervised medical image segmentation (SSMIS) methods lack designs to handle model bias. The neural network has a strong learning ability, but the cognitive bias will gradually deepen during the training, and it is difficult to correct itself. We propose a novel mutual correction framework (MCF) to explore network bias correction and improve the performance of SSMIS. Inspired by the plain contrast idea, MCF introduces two different subnets to explore and utilize the discrepancies between subnets to correct cognitive bias of the model. More concretely, a contrastive difference review (CDR) module is proposed to find out inconsistent prediction regions and perform a review training. Additionally, a dynamic competitive pseudo-label generation (DCPLG) module is proposed to evaluate the performance of subnets in real-time, dynamically selecting more reliable pseudo-labels. Experimental results on two medical image databases with different modalities (CT and MRI) show that our method achieves superior performance compared to several state-of-the-art methods. The code will be available at https://github.com/WYC-321/MCF. + + + + Parametric Implicit Face Representation for Audio-Driven Facial Reenactment + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Parametric_Implicit_Face_Representation_for_Audio-Driven_Facial_Reenactment_CVPR_2023_paper.pdf + Audio-driven facial reenactment is a crucial technique that has a range of applications in film-making, virtual avatars and video conferences. Existing works either employ explicit intermediate face representations (e.g., 2D facial landmarks or 3D face models) or implicit ones (e.g., Neural Radiance Fields), thus suffering from the trade-offs between interpretability and expressive power, hence between controllability and quality of the results. In this work, we break these trade-offs with our novel parametric implicit face representation and propose a novel audio-driven facial reenactment framework that is both controllable and can generate high-quality talking heads. Specifically, our parametric implicit representation parameterizes the implicit representation with interpretable parameters of 3D face models, thereby taking the best of both explicit and implicit methods. In addition, we propose several new techniques to improve the three components of our framework, including i) incorporating contextual information into the audio-to-expression parameters encoding; ii) using conditional image synthesis to parameterize the implicit representation and implementing it with an innovative tri-plane structure for efficient learning; iii) formulating facial reenactment as a conditional image inpainting problem and proposing a novel data augmentation technique to improve model generalizability. Extensive experiments demonstrate that our method can generate more realistic results than previous methods with greater fidelity to the identities and talking styles of speakers. + + + + VILA: Learning Image Aesthetics From User Comments With Vision-Language Pretraining + http://openaccess.thecvf.com//content/CVPR2023/papers/Ke_VILA_Learning_Image_Aesthetics_From_User_Comments_With_Vision-Language_Pretraining_CVPR_2023_paper.pdf + Assessing the aesthetics of an image is challenging, as it is influenced by multiple factors including composition, color, style, and high-level semantics. Existing image aesthetic assessment (IAA) methods primarily rely on human-labeled rating scores, which oversimplify the visual aesthetic information that humans perceive. Conversely, user comments offer more comprehensive information and are a more natural way to express human opinions and preferences regarding image aesthetics. In light of this, we propose learning image aesthetics from user comments, and exploring vision-language pretraining methods to learn multimodal aesthetic representations. Specifically, we pretrain an image-text encoder-decoder model with image-comment pairs, using contrastive and generative objectives to learn rich and generic aesthetic semantics without human labels. To efficiently adapt the pretrained model for downstream IAA tasks, we further propose a lightweight rank-based adapter that employs text as an anchor to learn the aesthetic ranking concept. Our results show that our pretrained aesthetic vision-language model outperforms prior works on image aesthetic captioning over the AVA-Captions dataset, and it has powerful zero-shot capability for aesthetic tasks such as zero-shot style classification and zero-shot IAA, surpassing many supervised baselines. With only minimal finetuning parameters using the proposed adapter module, our model achieves state-of-the-art IAA performance over the AVA dataset. + + + + Procedure-Aware Pretraining for Instructional Video Understanding + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_Procedure-Aware_Pretraining_for_Instructional_Video_Understanding_CVPR_2023_paper.pdf + Our goal is to learn a video representation that is useful for downstream procedure understanding tasks in instructional videos. Due to the small amount of available annotations, a key challenge in procedure understanding is to be able to extract from unlabeled videos the procedural knowledge such as the identity of the task (e.g., 'make latte'), its steps (e.g., 'pour milk'), or the potential next steps given partial progress in its execution. Our main insight is that instructional videos depict sequences of steps that repeat between instances of the same or different tasks, and that this structure can be well represented by a Procedural Knowledge Graph (PKG), where nodes are discrete steps and edges connect steps that occur sequentially in the instructional activities. This graph can then be used to generate pseudo labels to train a video representation that encodes the procedural knowledge in a more accessible form to generalize to multiple procedure understanding tasks. We build a PKG by combining information from a text-based procedural knowledge database and an unlabeled instructional video corpus and then use it to generate training pseudo labels with four novel pre-training objectives. We call this PKG-based pre-training procedure and the resulting model Paprika, Procedure-Aware PRe-training for Instructional Knowledge Acquisition. We evaluate Paprika on COIN and CrossTask for procedure understanding tasks such as task recognition, step recognition, and step forecasting. Paprika yields a video representation that improves over the state of the art: up to 11.23% gains in accuracy in 12 evaluation settings. Implementation is available at https://github.com/salesforce/paprika. + + + + Fine-Grained Audible Video Description + http://openaccess.thecvf.com//content/CVPR2023/papers/Shen_Fine-Grained_Audible_Video_Description_CVPR_2023_paper.pdf + We explore a new task for audio-visual-language modeling called fine-grained audible video description (FAVD). It aims to provide detailed textual descriptions for the given audible videos, including the appearance and spatial locations of each object, the actions of moving objects, and the sounds in videos. Existing visual-language modeling tasks often concentrate on visual cues in videos while undervaluing the language and audio modalities. On the other hand, FAVD requires not only audio-visual-language modeling skills but also paragraph-level language generation abilities. We construct the first fine-grained audible video description benchmark (FAVDBench) to facilitate this research. For each video clip, we first provide a one-sentence summary of the video, ie, the caption, followed by 4-6 sentences describing the visual details and 1-2 audio-related descriptions at the end. The descriptions are provided in both English and Chinese. We create two new metrics for this task: an EntityScore to gauge the completeness of entities in the visual descriptions, and an AudioScore to assess the audio descriptions. As a preliminary approach to this task, we propose an audio-visual-language transformer that extends existing video captioning model with an additional audio branch. We combine the masked language modeling and auto-regressive language modeling losses to optimize our model so that it can produce paragraph-level descriptions. We illustrate the efficiency of our model in audio-visual-language modeling by evaluating it against the proposed benchmark using both conventional captioning metrics and our proposed metrics. We further put our benchmark to the test in video generation models, demonstrating that employing fine-grained video descriptions can create more intricate videos than using captions. Code and dataset are available at https://github.com/OpenNLPLab/FAVDBench. Our online benchmark is available at www.avlbench.opennlplab.cn. + + + + 3D Semantic Segmentation in the Wild: Learning Generalized Models for Adverse-Condition Point Clouds + http://openaccess.thecvf.com//content/CVPR2023/papers/Xiao_3D_Semantic_Segmentation_in_the_Wild_Learning_Generalized_Models_for_CVPR_2023_paper.pdf + Robust point cloud parsing under all-weather conditions is crucial to level-5 autonomy in autonomous driving. However, how to learn a universal 3D semantic segmentation (3DSS) model is largely neglected as most existing benchmarks are dominated by point clouds captured under normal weather. We introduce SemanticSTF, an adverse-weather point cloud dataset that provides dense point-level annotations and allows to study 3DSS under various adverse weather conditions. We investigate universal 3DSS modeling with two tasks: 1) domain adaptive 3DSS that adapts from normal-weather data to adverse-weather data; 2) domain generalized 3DSS that learns a generalizable model from normal-weather data. Our studies reveal the challenge while existing 3DSS methods encounter adverse-weather data, showing the great value of SemanticSTF in steering the future endeavor along this very meaningful research direction. In addition, we design a domain randomization technique that alternatively randomizes the geometry styles of point clouds and aggregates their encoded embeddings, ultimately leading to a generalizable model that effectively improves 3DSS under various adverse weather. The SemanticSTF and related codes are available at https://github.com/xiaoaoran/SemanticSTF. + + + + RaBit: Parametric Modeling of 3D Biped Cartoon Characters With a Topological-Consistent Dataset + http://openaccess.thecvf.com//content/CVPR2023/papers/Luo_RaBit_Parametric_Modeling_of_3D_Biped_Cartoon_Characters_With_a_CVPR_2023_paper.pdf + Assisting people in efficiently producing visually plausible 3D characters has always been a fundamental research topic in computer vision and computer graphics. Recent learning-based approaches have achieved unprecedented accuracy and efficiency in the area of 3D real human digitization. However, none of the prior works focus on modeling 3D biped cartoon characters, which are also in great demand in gaming and filming. In this paper, we introduce 3DBiCar, the first large-scale dataset of 3D biped cartoon characters, and RaBit, the corresponding parametric model. Our dataset contains 1,500 topologically consistent high-quality 3D textured models which are manually crafted by professional artists. Built upon the data, RaBit is thus designed with a SMPL-like linear blend shape model and a StyleGAN-based neural UV-texture generator, simultaneously expressing the shape, pose, and texture. To demonstrate the practicality of 3DBiCar and RaBit, various applications are conducted, including single-view reconstruction, sketch-based modeling, and 3D cartoon animation. For the single-view reconstruction setting, we find a straightforward global mapping from input images to the output UV-based texture maps tends to lose detailed appearances of some local parts (e.g., nose, ears). Thus, a part-sensitive texture reasoner is adopted to make all important local areas perceived. Experiments further demonstrate the effectiveness of our method both qualitatively and quantitatively. 3DBiCar and RaBit are available at gaplab.cuhk.edu.cn/projects/RaBit. + + + + Uni3D: A Unified Baseline for Multi-Dataset 3D Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Uni3D_A_Unified_Baseline_for_Multi-Dataset_3D_Object_Detection_CVPR_2023_paper.pdf + Current 3D object detection models follow a single dataset-specific training and testing paradigm, which often faces a serious detection accuracy drop when they are directly deployed in another dataset. In this paper, we study the task of training a unified 3D detector from multiple datasets. We observe that this appears to be a challenging task, which is mainly due to that these datasets present substantial data-level differences and taxonomy-level variations caused by different LiDAR types and data acquisition standards. Inspired by such observation, we present a Uni3D which leverages a simple data-level correction operation and a designed semantic-level coupling-and-recoupling module to alleviate the unavoidable data-level and taxonomy-level differences, respectively. Our method is simple and easily combined with many 3D object detection baselines such as PV-RCNN and Voxel-RCNN, enabling them to effectively learn from multiple off-the-shelf 3D datasets to obtain more discriminative and generalizable representations. Experiments are conducted on many dataset consolidation settings. Their results demonstrate that Uni3D exceeds a series of individual detectors trained on a single dataset, with a 1.04x parameter increase over a selected baseline detector. We expect this work will inspire the research of 3D generalization since it will push the limits of perceptual performance. Our code is available at: https://github.com/PJLab-ADG/3DTrans + + + + ACR: Attention Collaboration-Based Regressor for Arbitrary Two-Hand Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_ACR_Attention_Collaboration-Based_Regressor_for_Arbitrary_Two-Hand_Reconstruction_CVPR_2023_paper.pdf + Reconstructing two hands from monocular RGB images is challenging due to frequent occlusion and mutual confusion. Existing methods mainly learn an entangled representation to encode two interacting hands, which are incredibly fragile to impaired interaction, such as truncated hands, separate hands, or external occlusion. This paper presents ACR (Attention Collaboration-based Regressor), which makes the first attempt to reconstruct hands in arbitrary scenarios. To achieve this, ACR explicitly mitigates interdependencies between hands and between parts by leveraging center and part-based attention for feature extraction. However, reducing interdependence helps release the input constraint while weakening the mutual reasoning about reconstructing the interacting hands. Thus, based on center attention, ACR also learns cross-hand prior that handle the interacting hands better. We evaluate our method on various types of hand reconstruction datasets. Our method significantly outperforms the best interacting-hand approaches on the InterHand2.6M dataset while yielding comparable performance with the state-of-the-art single-hand methods on the FreiHand dataset. More qualitative results on in-the-wild and hand-object interaction datasets and web images/videos further demonstrate the effectiveness of our approach for arbitrary hand reconstruction. Our code is available at https://github.com/ZhengdiYu/Arbitrary-Hands-3D-Reconstruction + + + + Improving Table Structure Recognition With Visual-Alignment Sequential Coordinate Modeling + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Improving_Table_Structure_Recognition_With_Visual-Alignment_Sequential_Coordinate_Modeling_CVPR_2023_paper.pdf + Table structure recognition aims to extract the logical and physical structure of unstructured table images into a machine-readable format. The latest end-to-end image-to-text approaches simultaneously predict the two structures by two decoders, where the prediction of the physical structure (the bounding boxes of the cells) is based on the representation of the logical structure. However, as the logical representation lacks the local visual information, the previous methods often produce imprecise bounding boxes. To address this issue, we propose an end-to-end sequential modeling framework for table structure recognition called VAST. It contains a novel coordinate sequence decoder triggered by the representation of the non-empty cell from the logical structure decoder. In the coordinate sequence decoder, we model the bounding box coordinates as a language sequence, where the left, top, right and bottom coordinates are decoded sequentially to leverage the inter-coordinate dependency. Furthermore, we propose an auxiliary visual-alignment loss to enforce the logical representation of the non-empty cells to contain more local visual details, which helps produce better cell bounding boxes. Extensive experiments demonstrate that our proposed method can achieve state-of-the-art results in both logical and physical structure recognition. The ablation study also validates that the proposed coordinate sequence decoder and the visual-alignment loss are the keys to the success of our method. + + + + HumanGen: Generating Human Radiance Fields With Explicit Priors + http://openaccess.thecvf.com//content/CVPR2023/papers/Jiang_HumanGen_Generating_Human_Radiance_Fields_With_Explicit_Priors_CVPR_2023_paper.pdf + Recent years have witnessed the tremendous progress of 3D GANs for generating view-consistent radiance fields with photo-realism. Yet, high-quality generation of human radiance fields remains challenging, partially due to the limited human-related priors adopted in existing methods. We present HumanGen, a novel 3D human generation scheme with detailed geometry and 360deg realistic free-view rendering. It explicitly marries the 3D human generation with various priors from the 2D generator and 3D reconstructor of humans through the design of "anchor image". We introduce a hybrid feature representation using the anchor image to bridge the latent space of HumanGen with the existing 2D generator. We then adopt a pronged design to disentangle the generation of geometry and appearance. With the aid of the anchor image, we adapt a 3D reconstructor for fine-grained details synthesis and propose a two-stage blending scheme to boost appearance generation. Extensive experiments demonstrate our effectiveness for state-of-the-art 3D human generation regarding geometry details, texture quality, and free-view performance. Notably, HumanGen can also incorporate various off-the-shelf 2D latent editing methods, seamlessly lifting them into 3D. + + + + Local Connectivity-Based Density Estimation for Face Clustering + http://openaccess.thecvf.com//content/CVPR2023/papers/Shin_Local_Connectivity-Based_Density_Estimation_for_Face_Clustering_CVPR_2023_paper.pdf + Recent graph-based face clustering methods predict the connectivity of enormous edges, including false positive edges that link nodes with different classes. However, those false positive edges, which connect negative node pairs, have the risk of integration of different clusters when their connectivity is incorrectly estimated. This paper proposes a novel face clustering method to address this problem. The proposed clustering method employs density-based clustering, which maintains edges that have higher density. For this purpose, we propose a reliable density estimation algorithm based on local connectivity between K nearest neighbors (KNN). We effectively exclude negative pairs from the KNN graph based on the reliable density while maintaining sufficient positive pairs. Furthermore, we develop a pairwise connectivity estimation network to predict the connectivity of the selected edges. Experimental results demonstrate that the proposed clustering method significantly outperforms the state-of-the-art clustering methods on large-scale face clustering datasets and fashion image clustering datasets. Our code is available at https://github.com/illian01/LCE-PCENet + + + + Adaptive Zone-Aware Hierarchical Planner for Vision-Language Navigation + http://openaccess.thecvf.com//content/CVPR2023/papers/Gao_Adaptive_Zone-Aware_Hierarchical_Planner_for_Vision-Language_Navigation_CVPR_2023_paper.pdf + The task of Vision-Language Navigation (VLN) is for an embodied agent to reach the global goal according to the instruction. Essentially, during navigation, a series of sub-goals need to be adaptively set and achieved, which is naturally a hierarchical navigation process. However, previous methods leverage a single-step planning scheme, i.e., directly performing navigation action at each step, which is unsuitable for such a hierarchical navigation process. In this paper, we propose an Adaptive Zone-aware Hierarchical Planner (AZHP) to explicitly divides the navigation process into two heterogeneous phases, i.e., sub-goal setting via zone partition/selection (high-level action) and sub-goal executing (low-level action), for hierarchical planning. Specifically, AZHP asynchronously performs two levels of action via the designed State-Switcher Module (SSM). For high-level action, we devise a Scene-aware adaptive Zone Partition (SZP) method to adaptively divide the whole navigation area into different zones on-the-fly. Then the Goal-oriented Zone Selection (GZS) method is proposed to select a proper zone for the current sub-goal. For low-level action, the agent conducts navigation-decision multi-steps in the selected zone. Moreover, we design a Hierarchical RL (HRL) strategy and auxiliary losses with curriculum learning to train the AZHP framework, which provides effective supervision signals for each stage. Extensive experiments demonstrate the superiority of our proposed method, which achieves state-of-the-art performance on three VLN benchmarks (REVERIE, SOON, R2R). + + + + Memory-Friendly Scalable Super-Resolution via Rewinding Lottery Ticket Hypothesis + http://openaccess.thecvf.com//content/CVPR2023/papers/Lin_Memory-Friendly_Scalable_Super-Resolution_via_Rewinding_Lottery_Ticket_Hypothesis_CVPR_2023_paper.pdf + Scalable deep Super-Resolution (SR) models are increasingly in demand, whose memory can be customized and tuned to the computational recourse of the platform. The existing dynamic scalable SR methods are not memory-friendly enough because multi-scale models have to be saved with a fixed size for each model. Inspired by the success of Lottery Tickets Hypothesis (LTH) on image classification, we explore the existence of unstructured scalable SR deep models, that is, we find gradual shrinkage sub-networks of extreme sparsity named winning tickets. In this paper, we propose a Memory-friendly Scalable SR framework (MSSR). The advantage is that only a single scalable model covers multiple SR models with different sizes, instead of reloading SR models of different sizes. Concretely, MSSR consists of the forward and backward stages, the former for model compression and the latter for model expansion. In the forward stage, we take advantage of LTH with rewinding weights to progressively shrink the SR model and the pruning-out masks that form nested sets. Moreover, stochastic self-distillation (SSD) is conducted to boost the performance of sub-networks. By stochastically selecting multiple depths, the current model inputs the selected features into the corresponding parts in the larger model and improves the performance of the current model based on the feedback results of the larger model. In the backward stage, the smaller SR model could be expanded by recovering and fine-tuning the pruned parameters according to the pruning-out masks obtained in the forward. Extensive experiments show the effectiveness of MMSR. The smallest-scale sub-network could achieve the sparsity of 94% and outperforms the compared lightweight SR methods. + + + + Therbligs in Action: Video Understanding Through Motion Primitives + http://openaccess.thecvf.com//content/CVPR2023/papers/Dessalene_Therbligs_in_Action_Video_Understanding_Through_Motion_Primitives_CVPR_2023_paper.pdf + In this paper we introduce a rule-based, compositional, and hierarchical modeling of action using Therbligs as our atoms. Introducing these atoms provides us with a consistent, expressive, contact-centered representation of action. Over the atoms we introduce a differentiable method of rule-based reasoning to regularize for logical consistency. Our approach is complementary to other approaches in that the Therblig-based representations produced by our architecture augment rather than replace existing architectures' representations. We release the first Therblig-centered annotations over two popular video datasets - EPIC Kitchens 100 and 50-Salads. We also broadly demonstrate benefits to adopting Therblig representations through evaluation on the following tasks: action segmentation, action anticipation, and action recognition - observing an average 10.5%/7.53%/6.5% relative improvement, respectively, over EPIC Kitchens and an average 8.9%/6.63%/4.8% relative improvement, respectively, over 50 Salads. Code and data will be made publicly available. + + + + SadTalker: Learning Realistic 3D Motion Coefficients for Stylized Audio-Driven Single Image Talking Face Animation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_SadTalker_Learning_Realistic_3D_Motion_Coefficients_for_Stylized_Audio-Driven_Single_CVPR_2023_paper.pdf + Generating talking head videos through a face image and a piece of speech audio still contains many challenges. i.e., unnatural head movement, distorted expression, and identity modification. We argue that these issues are mainly caused by learning from the coupled 2D motion fields. On the other hand, explicitly using 3D information also suffers problems of stiff expression and incoherent video. We present SadTalker, which generates 3D motion coefficients (head pose, expression) of the 3DMM from audio and implicitly modulates a novel 3D-aware face render for talking head generation. To learn the realistic motion coefficients, we explicitly model the connections between audio and different types of motion coefficients individually. Precisely, we present ExpNet to learn the accurate facial expression from audio by distilling both coefficients and 3D-rendered faces. As for the head pose, we design PoseVAE via a conditional VAE to synthesize head motion in different styles. Finally, the generated 3D motion coefficients are mapped to the unsupervised 3D keypoints space of the proposed face render to synthesize the final video. We conducted extensive experiments to show the superior of our method in terms of motion and video quality. + + + + HAAV: Hierarchical Aggregation of Augmented Views for Image Captioning + http://openaccess.thecvf.com//content/CVPR2023/papers/Kuo_HAAV_Hierarchical_Aggregation_of_Augmented_Views_for_Image_Captioning_CVPR_2023_paper.pdf + A great deal of progress has been made in image captioning, driven by research into how to encode the image using pre-trained models. This includes visual encodings (e.g. image grid features or detected objects) and more recently textual encodings (e.g. image tags or text descriptions of image regions). As more advanced encodings are available and incorporated, it is natural to ask: how to efficiently and effectively leverage the heterogeneous set of encodings? In this paper, we propose to regard the encodings as augmented views of the input image. The image captioning model encodes each view independently with a shared encoder efficiently, and a contrastive loss is incorporated across the encoded views in a novel way to improve their representation quality and the model's data efficiency. Our proposed hierarchical decoder then adaptively weighs the encoded views according to their effectiveness for caption generation by first aggregating within each view at the token level, and then across views at the view level. We demonstrate significant performance improvements of +5.6% CIDEr on MS-COCO and +12.9% CIDEr on Flickr30k compared to state of the arts, + + + + Learning Sample Relationship for Exposure Correction + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Learning_Sample_Relationship_for_Exposure_Correction_CVPR_2023_paper.pdf + Exposure correction task aims to correct the underexposure and its adverse overexposure images to the normal exposure in a single network. As well recognized, the optimization flow is opposite. Despite the great advancement, existing exposure correction methods are usually trained with a mini-batch of both underexposure and overexposure mixed samples and have not explored the relationship between them to solve the optimization inconsistency. In this paper, we introduce a new perspective to conjunct their optimization processes by correlating and constraining the relationship of correction procedure in a mini-batch. The core designs of our framework consist of two steps: 1) formulating the exposure relationship of samples across the batch dimension via a context-irrelevant pretext task. 2) delivering the above sample relationship design as the regularization term within the loss function to promote optimization consistency. The proposed sample relationship design as a general term can be easily integrated into existing exposure correction methods without any computational burden in inference time. Extensive experiments over multiple representative exposure correction benchmarks demonstrate consistent performance gains by introducing our sample relationship design. + + + + TRACE: 5D Temporal Regression of Avatars With Dynamic Cameras in 3D Environments + http://openaccess.thecvf.com//content/CVPR2023/papers/Sun_TRACE_5D_Temporal_Regression_of_Avatars_With_Dynamic_Cameras_in_CVPR_2023_paper.pdf + Although the estimation of 3D human pose and shape (HPS) is rapidly progressing, current methods still cannot reliably estimate moving humans in global coordinates, which is critical for many applications. This is particularly challenging when the camera is also moving, entangling human and camera motion. To address these issues, we adopt a novel 5D representation (space, time, and identity) that enables end-to-end reasoning about people in scenes. Our method, called TRACE, introduces several novel architectural components. Most importantly, it uses two new "maps" to reason about the 3D trajectory of people over time in camera, and world, coordinates. An additional memory unit enables persistent tracking of people even during long occlusions. TRACE is the first one-stage method to jointly recover and track 3D humans in global coordinates from dynamic cameras. By training it end-to-end, and using full image information, TRACE achieves state-of-the-art performance on tracking and HPS benchmarks. The code and dataset are released for research purposes. + + + + End-to-End 3D Dense Captioning With Vote2Cap-DETR + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_End-to-End_3D_Dense_Captioning_With_Vote2Cap-DETR_CVPR_2023_paper.pdf + 3D dense captioning aims to generate multiple captions localized with their associated object regions. Existing methods follow a sophisticated "detect-then-describe" pipeline equipped with numerous hand-crafted components. However, these hand-crafted components would yield suboptimal performance given cluttered object spatial and class distributions among different scenes. In this paper, we propose a simple-yet-effective transformer framework Vote2Cap-DETR based on recent popular DEtection TRansformer (DETR). Compared with prior arts, our framework has several appealing advantages: 1) Without resorting to numerous hand-crafted components, our method is based on a full transformer encoder-decoder architecture with a learnable vote query driven object decoder, and a caption decoder that produces the dense captions in a set-prediction manner. 2) In contrast to the two-stage scheme, our method can perform detection and captioning in one-stage. 3) Without bells and whistles, extensive experiments on two commonly used datasets, ScanRefer and Nr3D, demonstrate that our Vote2Cap-DETR surpasses current state-of-the-arts by 11.13% and 7.11% in CIDEr@0.5IoU, respectively. Codes will be released soon. + + + + Learned Two-Plane Perspective Prior Based Image Resampling for Efficient Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Ghosh_Learned_Two-Plane_Perspective_Prior_Based_Image_Resampling_for_Efficient_Object_CVPR_2023_paper.pdf + Real-time efficient perception is critical for autonomous navigation and city scale sensing. Orthogonal to architectural improvements, streaming perception approaches have exploited adaptive sampling improving real-time detection performance. In this work, we propose a learnable geometry-guided prior that incorporates rough geometry of the 3D scene (a ground plane and a plane above) to resample images for efficient object detection. This significantly improves small and far-away object detection performance while also being more efficient both in terms of latency and memory. For autonomous navigation, using the same detector and scale, our approach improves detection rate by +4.1 AP_S or +39% and in real-time performance by +5.3 sAP_S or +63% for small objects over state-of-the-art (SOTA). For fixed traffic cameras, our approach detects small objects at image scales other methods cannot. At the same scale, our approach improves detection of small objects by 195% (+12.5 AP_S) over naive-downsampling and 63% (+4.2 AP_S) over SOTA. + + + + Tell Me What Happened: Unifying Text-Guided Video Completion via Multimodal Masked Video Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Fu_Tell_Me_What_Happened_Unifying_Text-Guided_Video_Completion_via_Multimodal_CVPR_2023_paper.pdf + Generating a video given the first several static frames is challenging as it anticipates reasonable future frames with temporal coherence. Besides video prediction, the ability to rewind from the last frame or infilling between the head and tail is also crucial, but they have rarely been explored for video completion. Since there could be different outcomes from the hints of just a few frames, a system that can follow natural language to perform video completion may significantly improve controllability. Inspired by this, we introduce a novel task, text-guided video completion (TVC), which requests the model to generate a video from partial frames guided by an instruction. We then propose Multimodal Masked Video Generation (MMVG) to address this TVC task. During training, MMVG discretizes the video frames into visual tokens and masks most of them to perform video completion from any time point. At inference time, a single MMVG model can address all 3 cases of TVC, including video prediction, rewind, and infilling, by applying corresponding masking conditions. We evaluate MMVG in various video scenarios, including egocentric, animation, and gaming. Extensive experimental results indicate that MMVG is effective in generating high-quality visual appearances with text guidance for TVC. + + + + Tracking Through Containers and Occluders in the Wild + http://openaccess.thecvf.com//content/CVPR2023/papers/Van_Hoorick_Tracking_Through_Containers_and_Occluders_in_the_Wild_CVPR_2023_paper.pdf + Tracking objects with persistence in cluttered and dynamic environments remains a difficult challenge for computer vision systems. In this paper, we introduce TCOW, a new benchmark and model for visual tracking through heavy occlusion and containment. We set up a task where the goal is to, given a video sequence, segment both the projected extent of the target object, as well as the surrounding container or occluder whenever one exists. To study this task, we create a mixture of synthetic and annotated real datasets to support both supervised learning and structured evaluation of model performance under various forms of task variation, such as moving or nested containment. We evaluate two recent transformer-based video models and find that while they can be surprisingly capable of tracking targets under certain settings of task variation, there remains a considerable performance gap before we can claim a tracking model to have acquired a true notion of object permanence. + + + + Decompose, Adjust, Compose: Effective Normalization by Playing With Frequency for Domain Generalization + http://openaccess.thecvf.com//content/CVPR2023/papers/Lee_Decompose_Adjust_Compose_Effective_Normalization_by_Playing_With_Frequency_for_CVPR_2023_paper.pdf + Domain generalization (DG) is a principal task to evaluate the robustness of computer vision models. Many previous studies have used normalization for DG. In normalization, statistics and normalized features are regarded as style and content, respectively. However, it has a content variation problem when removing style because the boundary between content and style is unclear. This study addresses this problem from the frequency domain perspective, where amplitude and phase are considered as style and content, respectively. First, we verify the quantitative phase variation of normalization through the mathematical derivation of the Fourier transform formula. Then, based on this, we propose a novel normalization method, PCNorm, which eliminates style only as the preserving content through spectral decomposition. Furthermore, we propose advanced PCNorm variants, CCNorm and SCNorm, which adjust the degrees of variations in content and style, respectively. Thus, they can learn domain-agnostic representations for DG. With the normalization methods, we propose ResNet-variant models, DAC-P and DAC-SC, which are robust to the domain gap. The proposed models outperform other recent DG methods. The DAC-SC achieves an average state-of-the-art performance of 65.6% on five datasets: PACS, VLCS, Office-Home, DomainNet, and TerraIncognita. + + + + Novel Class Discovery for 3D Point Cloud Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Riz_Novel_Class_Discovery_for_3D_Point_Cloud_Semantic_Segmentation_CVPR_2023_paper.pdf + Novel class discovery (NCD) for semantic segmentation is the task of learning a model that can segment unlabelled (novel) classes using only the supervision from labelled (base) classes. This problem has recently been pioneered for 2D image data, but no work exists for 3D point cloud data. In fact, the assumptions made for 2D are loosely applicable to 3D in this case. This paper is presented to advance the state of the art on point cloud data analysis in four directions. Firstly, we address the new problem of NCD for point cloud semantic segmentation. Secondly, we show that the transposition of the only existing NCD method for 2D semantic segmentation to 3D data is suboptimal. Thirdly, we present a new method for NCD based on online clustering that exploits uncertainty quantification to produce prototypes for pseudo-labelling the points of the novel classes. Lastly, we introduce a new evaluation protocol to assess the performance of NCD for point cloud semantic segmentation. We thoroughly evaluate our method on SemanticKITTI and SemanticPOSS datasets, showing that it can significantly outperform the baseline. Project page: https://github.com/LuigiRiz/NOPS. + + + + Learning 3D-Aware Image Synthesis With Unknown Pose Distribution + http://openaccess.thecvf.com//content/CVPR2023/papers/Shi_Learning_3D-Aware_Image_Synthesis_With_Unknown_Pose_Distribution_CVPR_2023_paper.pdf + Existing methods for 3D-aware image synthesis largely depend on the 3D pose distribution pre-estimated on the training set. An inaccurate estimation may mislead the model into learning faulty geometry. This work proposes PoF3D that frees generative radiance fields from the requirements of 3D pose priors. We first equip the generator with an efficient pose learner, which is able to infer a pose from a latent code, to approximate the underlying true pose distribution automatically. We then assign the discriminator a task to learn pose distribution under the supervision of the generator and to differentiate real and synthesized images with the predicted pose as the condition. The pose-free generator and the pose-aware discriminator are jointly trained in an adversarial manner. Extensive results on a couple of datasets confirm that the performance of our approach, regarding both image quality and geometry quality, is on par with state of the art. To our best knowledge, PoF3D demonstrates the feasibility of learning high-quality 3D-aware image synthesis without using 3D pose priors for the first time. Project page can be found at https://vivianszf.github.io/pof3d/. + + + + Train-Once-for-All Personalization + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Train-Once-for-All_Personalization_CVPR_2023_paper.pdf + We study the problem of how to train a "personalization-friendly" model such that given only the task descriptions, the model can be adapted to different end-users' needs, e.g., for accurately classifying different subsets of objects. One baseline approach is to train a "generic" model for classifying a wide range of objects, followed by class selection. In our experiments, we however found it suboptimal, perhaps because the model's weights are kept frozen without being personalized. To address this drawback, we propose Train-once-for-All PERsonalization (TAPER), a framework that is trained just once and can later customize a model for different end-users given their task descriptions. TAPER learns a set of "basis" models and a mixer predictor, such that given the task description, the weights (not the predictions!) of the basis models can be on the fly combined into a single "personalized" model. Via extensive experiments on multiple recognition tasks, we show that TAPER consistently outperforms the baseline methods in achieving a higher personalized accuracy. Moreover, we show that TAPER can synthesize a much smaller model to achieve comparable performance to a huge generic model, making it "deployment-friendly" to resource-limited end devices. Interestingly, even without end-users' task descriptions, TAPER can still be specialized to the deployed context based on its past predictions, making it even more "personalization-friendly". + + + + DIFu: Depth-Guided Implicit Function for Clothed Human Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Song_DIFu_Depth-Guided_Implicit_Function_for_Clothed_Human_Reconstruction_CVPR_2023_paper.pdf + Recently, implicit function (IF)-based methods for clothed human reconstruction using a single image have received a lot of attention. Most existing methods rely on a 3D embedding branch using volume such as the skinned multi-person linear (SMPL) model, to compensate for the lack of information in a single image. Beyond the SMPL, which provides skinned parametric human 3D information, in this paper, we propose a new IF-based method, DIFu, that utilizes a projected depth prior containing textured and non-parametric human 3D information. In particular, DIFu consists of a generator, an occupancy prediction network, and a texture prediction network. The generator takes an RGB image of the human front-side as input, and hallucinates the human back-side image. After that, depth maps for front/back images are estimated and projected into 3D volume space. Finally, the occupancy prediction network extracts a pixel-aligned feature and a voxel-aligned feature through a 2D encoder and a 3D encoder, respectively, and estimates occupancy using these features. Note that voxel-aligned features are obtained from the projected depth maps, thus it can contain detailed 3D information such as hair and cloths. Also, colors of each 3D point are also estimated with the texture inference branch. The effectiveness of DIFu is demonstrated by comparing to recent IF-based models quantitatively and qualitatively. + + + + Bi-LRFusion: Bi-Directional LiDAR-Radar Fusion for 3D Dynamic Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Bi-LRFusion_Bi-Directional_LiDAR-Radar_Fusion_for_3D_Dynamic_Object_Detection_CVPR_2023_paper.pdf + LiDAR and Radar are two complementary sensing approaches in that LiDAR specializes in capturing an object's 3D shape while Radar provides longer detection ranges as well as velocity hints. Though seemingly natural, how to efficiently combine them for improved feature representation is still unclear. The main challenge arises from that Radar data are extremely sparse and lack height information. Therefore, directly integrating Radar features into LiDAR-centric detection networks is not optimal. In this work, we introduce a bi-directional LiDAR-Radar fusion framework, termed Bi-LRFusion, to tackle the challenges and improve 3D detection for dynamic objects. Technically, Bi-LRFusion involves two steps: first, it enriches Radar's local features by learning important details from the LiDAR branch to alleviate the problems caused by the absence of height information and extreme sparsity; second, it combines LiDAR features with the enhanced Radar features in a unified bird's-eye-view representation. We conduct extensive experiments on nuScenes and ORR datasets, and show that our Bi-LRFusion achieves state-of-the-art performance for detecting dynamic objects. Notably, Radar data in these two datasets have different formats, which demonstrates the generalizability of our method. Codes will be published. + + + + LOCATE: Localize and Transfer Object Parts for Weakly Supervised Affordance Grounding + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_LOCATE_Localize_and_Transfer_Object_Parts_for_Weakly_Supervised_Affordance_CVPR_2023_paper.pdf + Humans excel at acquiring knowledge through observation. For example, we can learn to use new tools by watching demonstrations. This skill is fundamental for intelligent systems to interact with the world. A key step to acquire this skill is to identify what part of the object affords each action, which is called affordance grounding. In this paper, we address this problem and propose a framework called LOCATE that can identify matching object parts across images, to transfer knowledge from images where an object is being used (exocentric images used for learning), to images where the object is inactive (egocentric ones used to test). To this end, we first find interaction areas and extract their feature embeddings. Then we learn to aggregate the embeddings into compact prototypes (human, object part, and background), and select the one representing the object part. Finally, we use the selected prototype to guide affordance grounding. We do this in a weakly supervised manner, learning only from image-level affordance and object labels. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods by a large margin on both seen and unseen objects. + + + + TokenHPE: Learning Orientation Tokens for Efficient Head Pose Estimation via Transformers + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_TokenHPE_Learning_Orientation_Tokens_for_Efficient_Head_Pose_Estimation_via_CVPR_2023_paper.pdf + Head pose estimation (HPE) has been widely used in the fields of human machine interaction, self-driving, and attention estimation. However, existing methods cannot deal with extreme head pose randomness and serious occlusions. To address these challenges, we identify three cues from head images, namely, neighborhood similarities, significant facial changes, and critical minority relationships. To leverage the observed findings, we propose a novel critical minority relationship-aware method based on the Transformer architecture in which the facial part relationships can be learned. Specifically, we design several orientation tokens to explicitly encode the basic orientation regions. Meanwhile, a novel token guide multi-loss function is designed to guide the orientation tokens as they learn the desired regional similarities and relationships. We evaluate the proposed method on three challenging benchmark HPE datasets. Experiments show that our method achieves better performance compared with state-of-the-art methods. Our code is publicly available at https://github.com/zc2023/TokenHPE. + + + + BioNet: A Biologically-Inspired Network for Face Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_BioNet_A_Biologically-Inspired_Network_for_Face_Recognition_CVPR_2023_paper.pdf + Recently, whether and how cutting-edge Neuroscience findings can inspire Artificial Intelligence (AI) confuse both communities and draw much discussion. As one of the most critical fields in AI, Computer Vision (CV) also pays much attention to the discussion. To show our ideas and experimental evidence to the discussion, we focus on one of the most broadly researched topics both in Neuroscience and CV fields, i.e., Face Recognition (FR). Neuroscience studies show that face attributes are essential to the human face-recognizing system. How the attributes contribute also be explained by the Neuroscience community. Even though a few CV works improved the FR performance with attribute enhancement, none of them are inspired by the human face-recognizing mechanism nor boosted performance significantly. To show our idea experimentally, we model the biological characteristics of the human face-recognizing system with classical Convolutional Neural Network Operators (CNN Ops) purposely. We name the proposed Biologically-inspired Network as BioNet. Our BioNet consists of two cascade sub-networks, i.e., the Visual Cortex Network (VCN) and the Inferotemporal Cortex Network (ICN). The VCN is modeled with a classical CNN backbone. The proposed ICN comprises three biologically-inspired modules, i.e., the Cortex Functional Compartmentalization, the Compartment Response Transform, and the Response Intensity Modulation. The experiments prove that: 1) The cutting-edge findings about the human face-recognizing system can further boost the CNN-based FR network. 2) With the biological mechanism, both identity-related attributes (e.g., gender) and identity-unrelated attributes (e.g., expression) can benefit the deep FR models. Surprisingly, the identity-unrelated ones contribute even more than the identity-related ones. 3) The proposed BioNet significantly boosts state-of-the-art on standard FR benchmark datasets. For example, BioNet boosts IJB-B@1e-6 from 52.12% to 68.28% and MegaFace from 98.74% to 99.19%. The source code will be released. + + + + Scaling Up GANs for Text-to-Image Synthesis + http://openaccess.thecvf.com//content/CVPR2023/papers/Kang_Scaling_Up_GANs_for_Text-to-Image_Synthesis_CVPR_2023_paper.pdf + The recent success of text-to-image synthesis has taken the world by storm and captured the general public's imagination. From a technical standpoint, it also marked a drastic change in the favored architecture to design generative image models. GANs used to be the de facto choice, with techniques like StyleGAN. With DALL-E 2, auto-regressive and diffusion models became the new standard for large-scale generative models overnight. This rapid shift raises a fundamental question: can we scale up GANs to benefit from large datasets like LAION? We find that naively increasing the capacity of the StyleGAN architecture quickly becomes unstable. We introduce GigaGAN, a new GAN architecture that far exceeds this limit, demonstrating GANs as a viable option for text-to-image synthesis. GigaGAN offers three major advantages. First, it is orders of magnitude faster at inference time, taking only 0.13 seconds to synthesize a 512px image. Second, it can synthesize high-resolution images, for example, 16-megapixel images in 3.66 seconds. Finally, GigaGAN supports various latent space editing applications such as latent interpolation, style mixing, and vector arithmetic operations. + + + + DepGraph: Towards Any Structural Pruning + http://openaccess.thecvf.com//content/CVPR2023/papers/Fang_DepGraph_Towards_Any_Structural_Pruning_CVPR_2023_paper.pdf + Structural pruning enables model acceleration by removing structurally-grouped parameters from neural networks. However, the parameter-grouping patterns vary widely across different models, making architecture-specific pruners, which rely on manually-designed grouping schemes, non-generalizable to new architectures. In this work, we study a highly-challenging yet barely-explored task, any structural pruning, to tackle general structural pruning of arbitrary architecture like CNNs, RNNs, GNNs and Transformers. The most prominent obstacle towards this goal lies in the structural coupling, which not only forces different layers to be pruned simultaneously, but also expects all removed parameters to be consistently unimportant, thereby avoiding structural issues and significant performance degradation after pruning. To address this problem, we propose a general and fully automatic method, Dependency Graph (DepGraph), to explicitly model the dependency between layers and comprehensively group coupled parameters for pruning. In this work, we extensively evaluate our method on several architectures and tasks, including ResNe(X)t, DenseNet, MobileNet and Vision transformer for images, GAT for graph, DGCNN for 3D point cloud, alongside LSTM for language, and demonstrate that, even with a simple norm-based criterion, the proposed method consistently yields gratifying performances. + + + + Exploring Discontinuity for Video Frame Interpolation + http://openaccess.thecvf.com//content/CVPR2023/papers/Lee_Exploring_Discontinuity_for_Video_Frame_Interpolation_CVPR_2023_paper.pdf + Video frame interpolation (VFI) is the task that synthesizes the intermediate frame given two consecutive frames. Most of the previous studies have focused on appropriate frame warping operations and refinement modules for the warped frames. These studies have been conducted on natural videos containing only continuous motions. However, many practical videos contain various unnatural objects with discontinuous motions such as logos, user interfaces and subtitles. We propose three techniques that can make the existing deep learning-based VFI architectures robust to these elements. First is a novel data augmentation strategy called figure-text mixing (FTM) which can make the models learn discontinuous motions during training stage without any extra dataset. Second, we propose a simple but effective module that predicts a map called discontinuity map (D-map), which densely distinguishes between areas of continuous and discontinuous motions. Lastly, we propose loss functions to give supervisions of the discontinuous motion areas which can be applied along with FTM and D-map. We additionally collect a special test benchmark called Graphical Discontinuous Motion (GDM) dataset consisting of some mobile games and chatting videos. Applied to the various state-of-the-art VFI networks, our method significantly improves the interpolation qualities on the videos from not only GDM dataset, but also the existing benchmarks containing only continuous motions such as Vimeo90K, UCF101, and DAVIS. + + + + DynamicStereo: Consistent Dynamic Depth From Stereo Videos + http://openaccess.thecvf.com//content/CVPR2023/papers/Karaev_DynamicStereo_Consistent_Dynamic_Depth_From_Stereo_Videos_CVPR_2023_paper.pdf + We consider the problem of reconstructing a dynamic scene observed from a stereo camera. Most existing methods for depth from stereo treat different stereo frames independently, leading to temporally inconsistent depth predictions. Temporal consistency is especially important for immersive AR or VR scenarios, where flickering greatly diminishes the user experience. We propose DynamicStereo, a novel transformer-based architecture to estimate disparity for stereo videos. The network learns to pool information from neighboring frames to improve the temporal consistency of its predictions. Our architecture is designed to process stereo videos efficiently through divided attention layers. We also introduce Dynamic Replica, a new benchmark dataset containing synthetic videos of people and animals in scanned environments, which provides complementary training and evaluation data for dynamic stereo closer to real applications than existing datasets. Training with this dataset further improves the quality of predictions of our proposed DynamicStereo as well as prior methods. Finally, it acts as a benchmark for consistent stereo methods. + + + + Vid2Avatar: 3D Avatar Reconstruction From Videos in the Wild via Self-Supervised Scene Decomposition + http://openaccess.thecvf.com//content/CVPR2023/papers/Guo_Vid2Avatar_3D_Avatar_Reconstruction_From_Videos_in_the_Wild_via_CVPR_2023_paper.pdf + We present Vid2Avatar, a method to learn human avatars from monocular in-the-wild videos. Reconstructing humans that move naturally from monocular in-the-wild videos is difficult. Solving it requires accurately separating humans from arbitrary backgrounds. Moreover, it requires reconstructing detailed 3D surface from short video sequences, making it even more challenging. Despite these challenges, our method does not require any groundtruth supervision or priors extracted from large datasets of clothed human scans, nor do we rely on any external segmentation modules. Instead, it solves the tasks of scene decomposition and surface reconstruction directly in 3D by modeling both the human and the background in the scene jointly, parameterized via two separate neural fields. Specifically, we define a temporally consistent human representation in canonical space and formulate a global optimization over the background model, the canonical human shape and texture, and per-frame human pose parameters. A coarse-to-fine sampling strategy for volume rendering and novel objectives are introduced for a clean separation of dynamic human and static background, yielding detailed and robust 3D human reconstructions. The evaluation of our method shows improvements over prior art on publicly available datasets. + + + + Task Residual for Tuning Vision-Language Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_Task_Residual_for_Tuning_Vision-Language_Models_CVPR_2023_paper.pdf + Large-scale vision-language models (VLMs) pre-trained on billion-level data have learned general visual representations and broad visual concepts. In principle, the well-learned knowledge structure of the VLMs should be inherited appropriately when being transferred to downstream tasks with limited data. However, most existing efficient transfer learning (ETL) approaches for VLMs either damage or are excessively biased towards the prior knowledge, e.g., prompt tuning (PT) discards the pre-trained text-based classifier and builds a new one while adapter-style tuning (AT) fully relies on the pre-trained features. To address this, we propose a new efficient tuning approach for VLMs named Task Residual Tuning (TaskRes), which performs directly on the text-based classifier and explicitly decouples the prior knowledge of the pre-trained models and new knowledge regarding a target task. Specifically, TaskRes keeps the original classifier weights from the VLMs frozen and obtains a new classifier for the target task by tuning a set of prior-independent parameters as a residual to the original one, which enables reliable prior knowledge preservation and flexible task-specific knowledge exploration. The proposed TaskRes is simple yet effective, which significantly outperforms previous ETL methods (e.g., PT and AT) on 11 benchmark datasets while requiring minimal effort for the implementation. Our code is available at https://github.com/geekyutao/TaskRes. + + + + Hierarchical Prompt Learning for Multi-Task Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Hierarchical_Prompt_Learning_for_Multi-Task_Learning_CVPR_2023_paper.pdf + Vision-language models (VLMs) can effectively transfer to various vision tasks via prompt learning. Real-world scenarios often require adapting a model to multiple similar yet distinct tasks. Existing methods focus on learning a specific prompt for each task, limiting the ability to exploit potentially shared information from other tasks. Naively training a task-shared prompt using a combination of all tasks ignores fine-grained task correlations. Significant discrepancies across tasks could cause negative transferring. Considering this, we present Hierarchical Prompt (HiPro) learning, a simple and effective method for jointly adapting a pre-trained VLM to multiple downstream tasks. Our method quantifies inter-task affinity and subsequently constructs a hierarchical task tree. Task-shared prompts learned by internal nodes explore the information within the corresponding task group, while task-individual prompts learned by leaf nodes obtain fine-grained information targeted at each task. The combination of hierarchical prompts provides high-quality content of different granularity. We evaluate HiPro on four multi-task learning datasets. The results demonstrate the effectiveness of our method. + + + + RIFormer: Keep Your Vision Backbone Effective but Removing Token Mixer + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_RIFormer_Keep_Your_Vision_Backbone_Effective_but_Removing_Token_Mixer_CVPR_2023_paper.pdf + This paper studies how to keep a vision backbone effective while removing token mixers in its basic building blocks. Token mixers, as self-attention for vision transformers (ViTs), are intended to perform information communication between different spatial tokens but suffer from considerable computational cost and latency. However, directly removing them will lead to an incomplete model structure prior, and thus brings a significant accuracy drop. To this end, we first develop an RepIdentityFormer base on the re-parameterizing idea, to study the token mixer free model architecture. And we then explore the improved learning paradigm to break the limitation of simple token mixer free backbone, and summarize the empirical practice into 5 guidelines. Equipped with the proposed optimization strategy, we are able to build an extremely simple vision backbone with encouraging performance, while enjoying the high efficiency during inference. Extensive experiments and ablative analysis also demonstrate that the inductive bias of network architecture, can be incorporated into simple network structure with appropriate optimization strategy. We hope this work can serve as a starting point for the exploration of optimization-driven efficient network design. + + + + Context-Based Trit-Plane Coding for Progressive Image Compression + http://openaccess.thecvf.com//content/CVPR2023/papers/Jeon_Context-Based_Trit-Plane_Coding_for_Progressive_Image_Compression_CVPR_2023_paper.pdf + Trit-plane coding enables deep progressive image compression, but it cannot use autoregressive context models. In this paper, we propose the context-based trit-plane coding (CTC) algorithm to achieve progressive compression more compactly. First, we develop the context-based rate reduction module to estimate trit probabilities of latent elements accurately and thus encode the trit-planes compactly. Second, we develop the context-based distortion reduction module to refine partial latent tensors from the trit-planes and improve the reconstructed image quality. Third, we propose a retraining scheme for the decoder to attain better rate-distortion tradeoffs. Extensive experiments show that CTC outperforms the baseline trit-plane codec significantly, e.g. by -14.84% in BD-rate on the Kodak lossless dataset, while increasing the time complexity only marginally. The source codes are available at https://github.com/seungminjeon-github/CTC. + + + + Recurrent Vision Transformers for Object Detection With Event Cameras + http://openaccess.thecvf.com//content/CVPR2023/papers/Gehrig_Recurrent_Vision_Transformers_for_Object_Detection_With_Event_Cameras_CVPR_2023_paper.pdf + We present Recurrent Vision Transformers (RVTs), a novel backbone for object detection with event cameras. Event cameras provide visual information with sub-millisecond latency at a high-dynamic range and with strong robustness against motion blur. These unique properties offer great potential for low-latency object detection and tracking in time-critical scenarios. Prior work in event-based vision has achieved outstanding detection performance but at the cost of substantial inference time, typically beyond 40 milliseconds. By revisiting the high-level design of recurrent vision backbones, we reduce inference time by a factor of 6 while retaining similar performance. To achieve this, we explore a multi-stage design that utilizes three key concepts in each stage: First, a convolutional prior that can be regarded as a conditional positional embedding. Second, local- and dilated global self-attention for spatial feature interaction. Third, recurrent temporal feature aggregation to minimize latency while retaining temporal information. RVTs can be trained from scratch to reach state-of-the-art performance on event-based object detection - achieving an mAP of 47.2% on the Gen1 automotive dataset. At the same time, RVTs offer fast inference (<12 ms on a T4 GPU) and favorable parameter efficiency (5 times fewer than prior art). Our study brings new insights into effective design choices that can be fruitful for research beyond event-based vision. + + + + METransformer: Radiology Report Generation by Transformer With Multiple Learnable Expert Tokens + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_METransformer_Radiology_Report_Generation_by_Transformer_With_Multiple_Learnable_Expert_CVPR_2023_paper.pdf + In clinical scenarios, multi-specialist consultation could significantly benefit the diagnosis, especially for intricate cases. This inspires us to explore a "multi-expert joint diagnosis" mechanism to upgrade the existing "single expert" framework commonly seen in the current literature. To this end, we propose METransformer, a method to realize this idea with a transformer-based backbone. The key design of our method is the introduction of multiple learnable "expert" tokens into both the transformer encoder and decoder. In the encoder, each expert token interacts with both vision tokens and other expert tokens to learn to attend different image regions for image representation. These expert tokens are encouraged to capture complementary information by an orthogonal loss that minimizes their overlap. In the decoder, each attended expert token guides the cross-attention between input words and visual tokens, thus influencing the generated report. A metrics-based expert voting strategy is further developed to generate the final report. By the multi-experts concept, our model enjoys the merits of an ensemble-based approach but through a manner that is computationally more efficient and supports more sophisticated interactions among experts. Experimental results demonstrate the promising performance of our proposed model on two widely used benchmarks. Last but not least, the framework-level innovation makes our work ready to incorporate advances on existing "single-expert" models to further improve its performance. + + + + Revealing the Dark Secrets of Masked Image Modeling + http://openaccess.thecvf.com//content/CVPR2023/papers/Xie_Revealing_the_Dark_Secrets_of_Masked_Image_Modeling_CVPR_2023_paper.pdf + Masked image modeling (MIM) as pre-training is shown to be effective for numerous vision downstream tasks, but how and where MIM works remain unclear. In this paper, we compare MIM with the long-dominant supervised pre-trained models from two perspectives, the visualizations and the experiments, to uncover their key representational differences. From the visualizations, we find that MIM brings locality inductive bias to all layers of the trained models, but supervised models tend to focus locally at lower layers but more globally at higher layers. That may be the reason why MIM helps Vision Transformers that have a very large receptive field to optimize. Using MIM, the model can maintain a large diversity on attention heads in all layers. But for supervised models, the diversity on attention heads almost disappears from the last three layers and less diversity harms the fine-tuning performance. From the experiments, we find that MIM models can perform significantly better on geometric and motion tasks with weak semantics or fine-grained classification tasks, than their supervised counterparts. Without bells and whistles, a standard MIM pre-trained SwinV2-L could achieve state-of-the-art performance on pose estimation (78.9 AP on COCO test-dev and 78.0 AP on CrowdPose), depth estimation (0.287 RMSE on NYUv2 and 1.966 RMSE on KITTI), and video object tracking (70.7 SUC on LaSOT). For the semantic understanding datasets where the categories are sufficiently covered by the supervised pre-training, MIM models can still achieve highly competitive transfer performance. With a deeper understanding of MIM, we hope that our work can inspire new and solid research in this direction. Code will be available at https://github.com/zdaxie/MIM-DarkSecrets. + + + + Fine-Grained Classification With Noisy Labels + http://openaccess.thecvf.com//content/CVPR2023/papers/Wei_Fine-Grained_Classification_With_Noisy_Labels_CVPR_2023_paper.pdf + Learning with noisy labels (LNL) aims to ensure model generalization given a label-corrupted training set. In this work, we investigate a rarely studied scenario of LNL on fine-grained datasets (LNL-FG), which is more practical and challenging as large inter-class ambiguities among fine-grained classes cause more noisy labels. We empirically show that existing methods that work well for LNL fail to achieve satisfying performance for LNL-FG, arising the practical need of effective solutions for LNL-FG. To this end, we propose a novel framework called stochastic noise-tolerated supervised contrastive learning (SNSCL) that confronts label noise by encouraging distinguishable representation. Specifically, we design a noise-tolerated supervised contrastive learning loss that incorporates a weight-aware mechanism for noisy label correction and selectively updating momentum queue lists. By this mechanism, we mitigate the effects of noisy anchors and avoid inserting noisy labels into the momentum-updated queue. Besides, to avoid manually-defined augmentation strategies in contrastive learning, we propose an efficient stochastic module that samples feature embeddings from a generated distribution, which can also enhance the representation ability of deep models. SNSCL is general and compatible with prevailing robust LNL strategies to improve their performance for LNL-FG. Extensive experiments demonstrate the effectiveness of SNSCL. + + + + CAP: Robust Point Cloud Classification via Semantic and Structural Modeling + http://openaccess.thecvf.com//content/CVPR2023/papers/Ding_CAP_Robust_Point_Cloud_Classification_via_Semantic_and_Structural_Modeling_CVPR_2023_paper.pdf + Recently, deep neural networks have shown great success on 3D point cloud classification tasks, which simultaneously raises the concern of adversarial attacks that cause severe damage to real-world applications. Moreover, defending against adversarial examples in point cloud data is extremely difficult due to the emergence of various attack strategies. In this work, with the insight of the fact that the adversarial examples in this task still preserve the same semantic and structural information as the original input, we design a novel defense framework for improving the robustness of existing classification models, which consists of two main modules: the attention-based pooling and the dynamic contrastive learning. In addition, we also develop an algorithm to theoretically certify the robustness of the proposed framework. Extensive empirical results on two datasets and three classification models show the robustness of our approach against various attacks, e.g., the averaged attack success rate of PointNet decreases from 70.2% to 2.7% on the ModelNet40 dataset under 9 common attacks. + + + + Visual-Tactile Sensing for In-Hand Object Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_Visual-Tactile_Sensing_for_In-Hand_Object_Reconstruction_CVPR_2023_paper.pdf + Tactile sensing is one of the modalities human rely on heavily to perceive the world. Working with vision, this modality refines local geometry structure, measures deformation at contact area, and indicates hand-object contact state. With the availability of open-source tactile sensors such as DIGIT, research on visual-tactile learning is becoming more accessible and reproducible. Leveraging this tactile sensor, we propose a novel visual-tactile in-hand object reconstruction framework VTacO, and extend it to VTacOH for hand-object reconstruction. Since our method can support both rigid and deformable object reconstruction, and no existing benchmark are proper for the goal. We propose a simulation environment, VT-Sim, which supports to generate hand-object interaction for both rigid and deformable objects. With VT-Sim, we generate a large-scale training dataset, and evaluate our method on it. Extensive experiments demonstrate that our proposed method can outperform the previous baseline methods qualitatively and quantitatively. Finally, we directly apply our model trained in simulation to various real-world test cases, which display qualitative results. Codes, models, simulation environment, datasets will be publicly available. + + + + Local-to-Global Registration for Bundle-Adjusting Neural Radiance Fields + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Local-to-Global_Registration_for_Bundle-Adjusting_Neural_Radiance_Fields_CVPR_2023_paper.pdf + Neural Radiance Fields (NeRF) have achieved photorealistic novel views synthesis; however, the requirement of accurate camera poses limits its application. Despite analysis-by-synthesis extensions for jointly learning neural 3D representations and registering camera frames exist, they are susceptible to suboptimal solutions if poorly initialized. We propose L2G-NeRF, a Local-to-Global registration method for bundle-adjusting Neural Radiance Fields: first, a pixel-wise flexible alignment, followed by a frame-wise constrained parametric alignment. Pixel-wise local alignment is learned in an unsupervised way via a deep network which optimizes photometric reconstruction errors. Frame-wise global alignment is performed using differentiable parameter estimation solvers on the pixel-wise correspondences to find a global transformation. Experiments on synthetic and real-world data show that our method outperforms the current state-of-the-art in terms of high-fidelity reconstruction and resolving large camera pose misalignment. Our module is an easy-to-use plugin that can be applied to NeRF variants and other neural field applications. + + + + FJMP: Factorized Joint Multi-Agent Motion Prediction Over Learned Directed Acyclic Interaction Graphs + http://openaccess.thecvf.com//content/CVPR2023/papers/Rowe_FJMP_Factorized_Joint_Multi-Agent_Motion_Prediction_Over_Learned_Directed_Acyclic_CVPR_2023_paper.pdf + Predicting the future motion of road agents is a critical task in an autonomous driving pipeline. In this work, we address the problem of generating a set of scene-level, or joint, future trajectory predictions in multi-agent driving scenarios. To this end, we propose FJMP, a Factorized Joint Motion Prediction framework for multi-agent interactive driving scenarios. FJMP models the future scene interaction dynamics as a sparse directed interaction graph, where edges denote explicit interactions between agents. We then prune the graph into a directed acyclic graph (DAG) and decompose the joint prediction task into a sequence of marginal and conditional predictions according to the partial ordering of the DAG, where joint future trajectories are decoded using a directed acyclic graph neural network (DAGNN). We conduct experiments on the INTERACTION and Argoverse 2 datasets and demonstrate that FJMP produces more accurate and scene-consistent joint trajectory predictions than non-factorized approaches, especially on the most interactive and kinematically interesting agents. FJMP ranks 1st on the multi-agent test leaderboard of the INTERACTION dataset. + + + + Correlational Image Modeling for Self-Supervised Visual Pre-Training + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Correlational_Image_Modeling_for_Self-Supervised_Visual_Pre-Training_CVPR_2023_paper.pdf + We introduce Correlational Image Modeling (CIM), a novel but surprisingly effective approach to self-supervised visual pre-training. Our CIM performs a simple pretext task: we randomly crop image regions (exemplar) from an input image (context) and predict correlation maps between the exemplars and the context. Three key designs enable correlational image modeling as a nontrivial and meaningful self-supervisory task. First, to generate useful exemplar-context pairs, we consider cropping image regions with various scales, shapes, rotations, and transformations. Second, we employ a bootstrap learning framework that involves online and target networks. During pre-training, the former takes exemplars as inputs while the latter converts the context. Third, we model the output correlation maps via a simple cross-attention block, within which the context serves as queries and the exemplars offer values and keys. We show that CIM performs on par or better than the current state of the art on self-supervised and transfer benchmarks. + + + + Self-Supervised Implicit Glyph Attention for Text Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Guan_Self-Supervised_Implicit_Glyph_Attention_for_Text_Recognition_CVPR_2023_paper.pdf + The attention mechanism has become the de facto module in scene text recognition (STR) methods, due to its capability of extracting character-level representations. These methods can be summarized into implicit attention based and supervised attention based, depended on how the attention is computed, i.e., implicit attention and supervised attention are learned from sequence-level text annotations and character-level bounding box annotations, respectively. Implicit attention, as it may extract coarse or even incorrect spatial regions as character attention, is prone to suffering from an alignment-drifted issue. Supervised attention can alleviate the above issue, but it is category-specific, which requires extra laborious character-level bounding box annotations and would be memory-intensive when the number of character categories is large. To address the aforementioned issues, we propose a novel attention mechanism for STR, self-supervised implicit glyph attention (SIGA). SIGA delineates the glyph structures of text images by jointly self-supervised text segmentation and implicit attention alignment, which serve as the supervision to improve attention correctness without extra character-level annotations. Experimental results demonstrate that SIGA performs consistently and significantly better than previous attention-based STR methods, in terms of both attention correctness and final recognition performance on publicly available context benchmarks and our contributed contextless benchmarks. + + + + ACL-SPC: Adaptive Closed-Loop System for Self-Supervised Point Cloud Completion + http://openaccess.thecvf.com//content/CVPR2023/papers/Hong_ACL-SPC_Adaptive_Closed-Loop_System_for_Self-Supervised_Point_Cloud_Completion_CVPR_2023_paper.pdf + Point cloud completion addresses filling in the missing parts of a partial point cloud obtained from depth sensors and generating a complete point cloud. Although there has been steep progress in the supervised methods on the synthetic point cloud completion task, it is hardly applicable in real-world scenarios due to the domain gap between the synthetic and real-world datasets or the requirement of prior information. To overcome these limitations, we propose a novel self-supervised framework ACL-SPC for point cloud completion to train and test on the same data. ACL-SPC takes a single partial input and attempts to output the complete point cloud using an adaptive closed-loop (ACL) system that enforces the output same for the variation of an input. We evaluate our ACL-SPC on various datasets to prove that it can successfully learn to complete a partial point cloud as the first self-supervised scheme. Results show that our method is comparable with unsupervised methods and achieves superior performance on the real-world dataset compared to the supervised methods trained on the synthetic dataset. Extensive experiments justify the necessity of self-supervised learning and the effectiveness of our proposed method for the real-world point cloud completion task. The code is publicly available from this link. + + + + Focus on Details: Online Multi-Object Tracking With Diverse Fine-Grained Representation + http://openaccess.thecvf.com//content/CVPR2023/papers/Ren_Focus_on_Details_Online_Multi-Object_Tracking_With_Diverse_Fine-Grained_Representation_CVPR_2023_paper.pdf + Discriminative representation is essential to keep a unique identifier for each target in Multiple object tracking (MOT). Some recent MOT methods extract features of the bounding box region or the center point as identity embeddings. However, when targets are occluded, these coarse-grained global representations become unreliable. To this end, we propose exploring diverse fine-grained representation, which describes appearance comprehensively from global and local perspectives. This fine-grained representation requires high feature resolution and precise semantic information. To effectively alleviate the semantic misalignment caused by indiscriminate contextual information aggregation, Flow Alignment FPN (FAFPN) is proposed for multi-scale feature alignment aggregation. It generates semantic flow among feature maps from different resolutions to transform their pixel positions. Furthermore, we present a Multi-head Part Mask Generator (MPMG) to extract fine-grained representation based on the aligned feature maps. Multiple parallel branches of MPMG allow it to focus on different parts of targets to generate local masks without label supervision. The diverse details in target masks facilitate fine-grained representation. Eventually, benefiting from a Shuffle-Group Sampling (SGS) training strategy with positive and negative samples balanced, we achieve state-of-the-art performance on MOT17 and MOT20 test sets. Even on DanceTrack, where the appearance of targets is extremely similar, our method significantly outperforms ByteTrack by 5.0% on HOTA and 5.6% on IDF1. Extensive experiments have proved that diverse fine-grained representation makes Re-ID great again in MOT. + + + + DiffPose: Toward More Reliable 3D Pose Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Gong_DiffPose_Toward_More_Reliable_3D_Pose_Estimation_CVPR_2023_paper.pdf + Monocular 3D human pose estimation is quite challenging due to the inherent ambiguity and occlusion, which often lead to high uncertainty and indeterminacy. On the other hand, diffusion models have recently emerged as an effective tool for generating high-quality images from noise. Inspired by their capability, we explore a novel pose estimation framework (DiffPose) that formulates 3D pose estimation as a reverse diffusion process. We incorporate novel designs into our DiffPose to facilitate the diffusion process for 3D pose estimation: a pose-specific initialization of pose uncertainty distributions, a Gaussian Mixture Model-based forward diffusion process, and a context-conditioned reverse diffusion process. Our proposed DiffPose significantly outperforms existing methods on the widely used pose estimation benchmarks Human3.6M and MPI-INF-3DHP. Project page: https://gongjia0208.github.io/Diffpose/. + + + + Learning Analytical Posterior Probability for Human Mesh Recovery + http://openaccess.thecvf.com//content/CVPR2023/papers/Fang_Learning_Analytical_Posterior_Probability_for_Human_Mesh_Recovery_CVPR_2023_paper.pdf + Despite various probabilistic methods for modeling the uncertainty and ambiguity in human mesh recovery, their overall precision is limited because existing formulations for joint rotations are either not constrained to SO(3) or difficult to learn for neural networks. To address such an issue, we derive a novel analytical formulation for learning posterior probability distributions of human joint rotations conditioned on bone directions in a Bayesian manner, and based on this, we propose a new posterior-guided framework for human mesh recovery. We demonstrate that our framework is not only superior to existing SOTA baselines on multiple benchmarks but also flexible enough to seamlessly incorporate with additional sensors due to its Bayesian nature. The code is available at https://github.com/NetEase-GameAI/ProPose. + + + + Non-Contrastive Unsupervised Learning of Physiological Signals From Video + http://openaccess.thecvf.com//content/CVPR2023/papers/Speth_Non-Contrastive_Unsupervised_Learning_of_Physiological_Signals_From_Video_CVPR_2023_paper.pdf + Subtle periodic signals such as blood volume pulse and respiration can be extracted from RGB video, enabling noncontact health monitoring at low cost. Advancements in remote pulse estimation -- or remote photoplethysmography (rPPG) -- are currently driven by deep learning solutions. However, modern approaches are trained and evaluated on benchmark datasets with ground truth from contact-PPG sensors. We present the first non-contrastive unsupervised learning framework for signal regression to mitigate the need for labelled video data. With minimal assumptions of periodicity and finite bandwidth, our approach discovers the blood volume pulse directly from unlabelled videos. We find that encouraging sparse power spectra within normal physiological bandlimits and variance over batches of power spectra is sufficient for learning visual features of periodic signals. We perform the first experiments utilizing unlabelled video data not specifically created for rPPG to train robust pulse rate estimators. Given the limited inductive biases and impressive empirical results, the approach is theoretically capable of discovering other periodic signals from video, enabling multiple physiological measurements without the need for ground truth signals. + + + + FashionSAP: Symbols and Attributes Prompt for Fine-Grained Fashion Vision-Language Pre-Training + http://openaccess.thecvf.com//content/CVPR2023/papers/Han_FashionSAP_Symbols_and_Attributes_Prompt_for_Fine-Grained_Fashion_Vision-Language_Pre-Training_CVPR_2023_paper.pdf + Fashion vision-language pre-training models have shown efficacy for a wide range of downstream tasks. However, general vision-language pre-training models pay less attention to fine-grained domain features, while these features are important in distinguishing the specific domain tasks from general tasks. We propose a method for fine-grained fashion vision-language pre-training based on fashion Symbols and Attributes Prompt (FashionSAP) to model fine-grained multi-modalities fashion attributes and characteristics. Firstly, we propose the fashion symbols, a novel abstract fashion concept layer, to represent different fashion items and to generalize various kinds of fine-grained fashion features, making modelling fine-grained attributes more effective. Secondly, the attributes prompt method is proposed to make the model learn specific attributes of fashion items explicitly. We design proper prompt templates according to the format of fashion data. Comprehensive experiments are conducted on two public fashion benchmarks, i.e., FashionGen and FashionIQ, and FashionSAP gets SOTA performances for four popular fashion tasks. The ablation study also shows the proposed abstract fashion symbols, and the attribute prompt method enables the model to acquire fine-grained semantics in the fashion domain effectively. The obvious performance gains from FashionSAP provide a new baseline for future fashion task research. + + + + Structure Aggregation for Cross-Spectral Stereo Image Guided Denoising + http://openaccess.thecvf.com//content/CVPR2023/papers/Sheng_Structure_Aggregation_for_Cross-Spectral_Stereo_Image_Guided_Denoising_CVPR_2023_paper.pdf + To obtain clean images with salient structures from noisy observations, a growing trend in current denoising studies is to seek the help of additional guidance images with high signal-to-noise ratios, which are often acquired in different spectral bands such as near infrared. Although previous guided denoising methods basically require the input images to be well-aligned, a more common way to capture the paired noisy target and guidance images is to exploit a stereo camera system. However, current studies on cross-spectral stereo matching cannot fully guarantee the pixel-level registration accuracy, and rarely consider the case of noise contamination. In this work, for the first time, we propose a guided denoising framework for cross-spectral stereo images. Instead of aligning the input images via conventional stereo matching, we aggregate structures from the guidance image to estimate a clean structure map for the noisy target image, which is then used to regress the final denoising result with a spatially variant linear representation model. Based on this, we design a neural network, called as SANet, to complete the entire guided denoising process. Experimental results show that, our SANet can effectively transfer structures from an unaligned guidance image to the restoration result, and outperforms state-of-the-art denoisers on various stereo image datasets. Besides, our structure aggregation strategy also shows its potential to handle other unaligned guided restoration tasks such as super-resolution and deblurring. The source code is available at https://github.com/lustrouselixir/SANet. + + + + RONO: Robust Discriminative Learning With Noisy Labels for 2D-3D Cross-Modal Retrieval + http://openaccess.thecvf.com//content/CVPR2023/papers/Feng_RONO_Robust_Discriminative_Learning_With_Noisy_Labels_for_2D-3D_Cross-Modal_CVPR_2023_paper.pdf + Recently, with the advent of Metaverse and AI Generated Content, cross-modal retrieval becomes popular with a burst of 2D and 3D data. However, this problem is challenging given the heterogeneous structure and semantic discrepancies. Moreover, imperfect annotations are ubiquitous given the ambiguous 2D and 3D content, thus inevitably producing noisy labels to degrade the learning performance. To tackle the problem, this paper proposes a robust 2D-3D retrieval framework (RONO) to robustly learn from noisy multimodal data. Specifically, one novel Robust Discriminative Center Learning mechanism (RDCL) is proposed in RONO to adaptively distinguish clean and noisy samples for respectively providing them with positive and negative optimization directions, thus mitigating the negative impact of noisy labels. Besides, we present a Shared Space Consistency Learning mechanism (SSCL) to capture the intrinsic information inside the noisy data by minimizing the cross-modal and semantic discrepancy between common space and label space simultaneously. Comprehensive mathematical analyses are given to theoretically prove the noise tolerance of the proposed method. Furthermore, we conduct extensive experiments on four 3D-model multimodal datasets to verify the effectiveness of our method by comparing it with 15 state-of-the-art methods. Code is available at https://github.com/penghu-cs/RONO. + + + + ConQueR: Query Contrast Voxel-DETR for 3D Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhu_ConQueR_Query_Contrast_Voxel-DETR_for_3D_Object_Detection_CVPR_2023_paper.pdf + Although DETR-based 3D detectors simplify the detection pipeline and achieve direct sparse predictions, their performance still lags behind dense detectors with post-processing for 3D object detection from point clouds. DETRs usually adopt a larger number of queries than GTs (e.g., 300 queries v.s. 40 objects in Waymo) in a scene, which inevitably incur many false positives during inference. In this paper, we propose a simple yet effective sparse 3D detector, named Query Contrast Voxel-DETR (ConQueR), to eliminate the challenging false positives, and achieve more accurate and sparser predictions. We observe that most false positives are highly overlapping in local regions, caused by the lack of explicit supervision to discriminate locally similar queries. We thus propose a Query Contrast mechanism to explicitly enhance queries towards their best-matched GTs over all unmatched query predictions. This is achieved by the construction of positive and negative GT-query pairs for each GT, and a contrastive loss to enhance positive GT-query pairs against negative ones based on feature similarities. ConQueR closes the gap of sparse and dense 3D detectors, and reduces 60% false positives. Our single-frame ConQueR achieves 71.6 mAPH/L2 on the challenging Waymo Open Dataset validation set, outperforming previous sota methods by over 2.0 mAPH/L2. Code: https://github.com/poodarchu/EFG. + + + + Robust Multiview Point Cloud Registration With Reliable Pose Graph Initialization and History Reweighting + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Robust_Multiview_Point_Cloud_Registration_With_Reliable_Pose_Graph_Initialization_CVPR_2023_paper.pdf + In this paper, we present a new method for the multiview registration of point cloud. Previous multiview registration methods rely on exhaustive pairwise registration to construct a densely-connected pose graph and apply Iteratively Reweighted Least Square (IRLS) on the pose graph to compute the scan poses. However, constructing a densely-connected graph is time-consuming and contains lots of outlier edges, which makes the subsequent IRLS struggle to find correct poses. To address the above problems, we first propose to use a neural network to estimate the overlap between scan pairs, which enables us to construct a sparse but reliable pose graph. Then, we design a novel history reweighting function in the IRLS scheme, which has strong robustness to outlier edges on the graph. In comparison with existing multiview registration methods, our method achieves 11% higher registration recall on the 3DMatch dataset and 13% lower registration errors on the ScanNet dataset while reducing 70% required pairwise registrations. Comprehensive ablation studies are conducted to demonstrate the effectiveness of our designs. The source code is available at https://github.com/WHU-USI3DV/SGHR. + + + + OSRT: Omnidirectional Image Super-Resolution With Distortion-Aware Transformer + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_OSRT_Omnidirectional_Image_Super-Resolution_With_Distortion-Aware_Transformer_CVPR_2023_paper.pdf + Omnidirectional images (ODIs) have obtained lots of research interest for immersive experiences. Although ODIs require extremely high resolution to capture details of the entire scene, the resolutions of most ODIs are insufficient. Previous methods attempt to solve this issue by image super-resolution (SR) on equirectangular projection (ERP) images. However, they omit geometric properties of ERP in the degradation process, and their models can hardly generalize to real ERP images. In this paper, we propose Fisheye downsampling, which mimics the real-world imaging process and synthesizes more realistic low-resolution samples. Then we design a distortion-aware Transformer (OSRT) to modulate ERP distortions continuously and self-adaptively. Without a cumbersome process, OSRT outperforms previous methods by about 0.2dB on PSNR. Moreover, we propose a convenient data augmentation strategy, which synthesizes pseudo ERP images from plain images. This simple strategy can alleviate the over-fitting problem of large networks and significantly boost the performance of ODI SR. Extensive experiments have demonstrated the state-of-the-art performance of our OSRT. + + + + BEV@DC: Bird's-Eye View Assisted Training for Depth Completion + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_BEVDC_Birds-Eye_View_Assisted_Training_for_Depth_Completion_CVPR_2023_paper.pdf + Depth completion plays a crucial role in autonomous driving, in which cameras and LiDARs are two complementary sensors. Recent approaches attempt to exploit spatial geometric constraints hidden in LiDARs to enhance image-guided depth completion. However, only low efficiency and poor generalization can be achieved. In this paper, we propose BEV@DC, a more efficient and powerful multi-modal training scheme, to boost the performance of image-guided depth completion. In practice, the proposed BEV@DC model comprehensively takes advantage of LiDARs with rich geometric details in training, employing an enhanced depth completion manner in inference, which takes only images (RGB and depth) as input. Specifically, the geometric-aware LiDAR features are projected onto a unified BEV space, combining with RGB features to perform BEV completion. By equipping a newly proposed point-voxel spatial propagation network (PV-SPN), this auxiliary branch introduces strong guidance to the original image branches via 3D dense supervision and feature consistency. As a result, our baseline model demonstrates significant improvements with the sole image inputs. Concretely, it achieves state-of-the-art on several benchmarks, e.g., ranking Top-1 on the challenging KITTI depth completion benchmark. + + + + Large-Scale Training Data Search for Object Re-Identification + http://openaccess.thecvf.com//content/CVPR2023/papers/Yao_Large-Scale_Training_Data_Search_for_Object_Re-Identification_CVPR_2023_paper.pdf + We consider a scenario where we have access to the target domain, but cannot afford on-the-fly training data annotation, and instead would like to construct an alternative training set from a large-scale data pool such that a competitive model can be obtained. We propose a search and pruning (SnP) solution to this training data search problem, tailored to object re-identification (re-ID), an application aiming to match the same object captured by different cameras. Specifically, the search stage identifies and merges clusters of source identities which exhibit similar distributions with the target domain. The second stage, subject to a budget, then selects identities and their images from the Stage I output, to control the size of the resulting training set for efficient training. The two steps provide us with training sets 80% smaller than the source pool while achieving a similar or even higher re-ID accuracy. These training sets are also shown to be superior to a few existing search methods such as random sampling and greedy sampling under the same budget on training data size. If we release the budget, training sets resulting from the first stage alone allow even higher re-ID accuracy. We provide interesting discussions on the specificity of our method to the re-ID problem and particularly its role in bridging the re-ID domain gap. The code is available at https://github.com/yorkeyao/SnP. + + + + SelfME: Self-Supervised Motion Learning for Micro-Expression Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Fan_SelfME_Self-Supervised_Motion_Learning_for_Micro-Expression_Recognition_CVPR_2023_paper.pdf + Facial micro-expressions (MEs) refer to brief spontaneous facial movements that can reveal a person's genuine emotion. They are valuable in lie detection, criminal analysis, and other areas. While deep learning-based ME recognition (MER) methods achieved impressive success, these methods typically require pre-processing using conventional optical flow-based methods to extract facial motions as inputs. To overcome this limitation, we proposed a novel MER framework using self-supervised learning to extract facial motion for ME (SelfME). To the best of our knowledge, this is the first work using an automatically self-learned motion technique for MER. However, the self-supervised motion learning method might suffer from ignoring symmetrical facial actions on the left and right sides of faces when extracting fine features. To address this issue, we developed a symmetric contrastive vision transformer (SCViT) to constrain the learning of similar facial action features for the left and right parts of faces. Experiments were conducted on two benchmark datasets showing that our method achieved state-of-the-art performance, and ablation studies demonstrated the effectiveness of our method. + + + + NewsNet: A Novel Dataset for Hierarchical Temporal Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_NewsNet_A_Novel_Dataset_for_Hierarchical_Temporal_Segmentation_CVPR_2023_paper.pdf + Temporal video segmentation is the get-to-go automatic video analysis, which decomposes a long-form video into smaller components for the following-up understanding tasks. Recent works have studied several levels of granularity to segment a video, such as shot, event, and scene. Those segmentations can help compare the semantics in the corresponding scales, but lack a wider view of larger temporal spans, especially when the video is complex and structured. Therefore, we present two abstractive levels of temporal segmentations and study their hierarchy to the existing fine-grained levels. Accordingly, we collect NewsNet, the largest news video dataset consisting of 1,000 videos in over 900 hours, associated with several tasks for hierarchical temporal video segmentation. Each news video is a collection of stories on different topics, represented as aligned audio, visual, and textual data, along with extensive frame-wise annotations in four granularities. We assert that the study on NewsNet can advance the understanding of complex structured video and benefit more areas such as short-video creation, personalized advertisement, digital instruction, and education. Our dataset and code is publicly available at: https://github.com/NewsNet-Benchmark/NewsNet. + + + + Uncertainty-Aware Unsupervised Image Deblurring With Deep Residual Prior + http://openaccess.thecvf.com//content/CVPR2023/papers/Tang_Uncertainty-Aware_Unsupervised_Image_Deblurring_With_Deep_Residual_Prior_CVPR_2023_paper.pdf + Non-blind deblurring methods achieve decent performance under the accurate blur kernel assumption. Since the kernel uncertainty (i.e. kernel error) is inevitable in practice, semi-blind deblurring is suggested to handle it by introducing the prior of the kernel (or induced) error. However, how to design a suitable prior for the kernel (or induced) error remains challenging. Hand-crafted prior, incorporating domain knowledge, generally performs well but may lead to poor performance when kernel (or induced) error is complex. Data-driven prior, which excessively depends on the diversity and abundance of training data, is vulnerable to out-of-distribution blurs and images. To address this challenge, we suggest a dataset-free deep residual prior for the kernel induced error (termed as residual) expressed by a customized untrained deep neural network, which allows us to flexibly adapt to different blurs and images in real scenarios. By organically integrating the respective strengths of deep priors and hand-crafted priors, we propose an unsupervised semi-blind deblurring model which recovers the latent image from the blurry image and inaccurate blur kernel. To tackle the formulated model, an efficient alternating minimization algorithm is developed. Extensive experiments demonstrate the favorable performance of the proposed method as compared to model-driven and data-driven methods in terms of image quality and the robustness to different types of kernel error. + + + + FedDM: Iterative Distribution Matching for Communication-Efficient Federated Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Xiong_FedDM_Iterative_Distribution_Matching_for_Communication-Efficient_Federated_Learning_CVPR_2023_paper.pdf + Federated learning (FL) has recently attracted increasing attention from academia and industry, with the ultimate goal of achieving collaborative training under privacy and communication constraints. Existing iterative model averaging based FL algorithms require a large number of communication rounds to obtain a well-performed model due to extremely unbalanced and non-i.i.d data partitioning among different clients. Thus, we propose FedDM to build the global training objective from multiple local surrogate functions, which enables the server to gain a more global view of the loss landscape. In detail, we construct synthetic sets of data on each client to locally match the loss landscape from original data through distribution matching. FedDM reduces communication rounds and improves model quality by transmitting more informative and smaller synthesized data compared with unwieldy model weights. We conduct extensive experiments on three image classification datasets, and results show that our method can outperform other FL counterparts in terms of efficiency and model performance. Moreover, we demonstrate that FedDM can be adapted to preserve differential privacy with Gaussian mechanism and train a better model under the same privacy budget. + + + + Bit-Shrinking: Limiting Instantaneous Sharpness for Improving Post-Training Quantization + http://openaccess.thecvf.com//content/CVPR2023/papers/Lin_Bit-Shrinking_Limiting_Instantaneous_Sharpness_for_Improving_Post-Training_Quantization_CVPR_2023_paper.pdf + Post-training quantization (PTQ) is an effective compression method to reduce the model size and computational cost. However, quantizing a model into a low-bit one, e.g., lower than 4, is difficult and often results in nonnegligible performance degradation. To address this, we investigate the loss landscapes of quantized networks with various bit-widths. We show that the network with more ragged loss surface, is more easily trapped into bad local minima, which mostly appears in low-bit quantization. A deeper analysis indicates, the ragged surface is caused by the injection of excessive quantization noise. To this end, we detach a sharpness term from the loss which reflects the impact of quantization noise. To smooth the rugged loss surface, we propose to limit the sharpness term small and stable during optimization. Instead of directly optimizing the target bit network, the bit-width of quantized network has a self-adapted shrinking scheduler in continuous domain from high bit-width to the target by limiting the increasing sharpness term within a proper range. It can be viewed as iteratively adding small "instant" quantization noise and adjusting the network to eliminate its impact. Widely experiments including classification and detection tasks demonstrate the effectiveness of the Bit-shrinking strategy in PTQ. On the Vision Transformer models, our INT8 and INT6 models drop within 0.5% and 1.5% Top-1 accuracy, respectively. On the traditional CNN networks, our INT4 quantized models drop within 1.3% and 3.5% Top-1 accuracy on ResNet18 and MobileNetV2 without fine-tuning, which achieves the state-of-the-art performance. + + + + LSTFE-Net:Long Short-Term Feature Enhancement Network for Video Small Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Xiao_LSTFE-NetLong_Short-Term_Feature_Enhancement_Network_for_Video_Small_Object_Detection_CVPR_2023_paper.pdf + Video small object detection is a difficult task due to the lack of object information. Recent methods focus on adding more temporal information to obtain more potent high-level features, which often fail to specify the most vital information for small objects, resulting in insufficient or inappropriate features. Since information from frames at different positions contributes differently to small objects, it is not ideal to assume that using one universal method will extract proper features. We find that context information from the long-term frame and temporal information from the short-term frame are two useful cues for video small object detection. To fully utilize these two cues, we propose a long short-term feature enhancement network (LSTFE-Net) for video small object detection. First, we develop a plug-and-play spatio-temporal feature alignment module to create temporal correspondences between the short-term and current frames. Then, we propose a frame selection module to select the long-term frame that can provide the most additional context information. Finally, we propose a long short-term feature aggregation module to fuse long short-term features. Compared to other state-of-the-art methods, our LSTFE-Net achieves 4.4% absolute boosts in AP on the FL-Drones dataset. More details can be found at https://github.com/xiaojs18/LSTFE-Net. + + + + MIC: Masked Image Consistency for Context-Enhanced Domain Adaptation + http://openaccess.thecvf.com//content/CVPR2023/papers/Hoyer_MIC_Masked_Image_Consistency_for_Context-Enhanced_Domain_Adaptation_CVPR_2023_paper.pdf + In unsupervised domain adaptation (UDA), a model trained on source data (e.g. synthetic) is adapted to target data (e.g. real-world) without access to target annotation. Most previous UDA methods struggle with classes that have a similar visual appearance on the target domain as no ground truth is available to learn the slight appearance differences. To address this problem, we propose a Masked Image Consistency (MIC) module to enhance UDA by learning spatial context relations of the target domain as additional clues for robust visual recognition. MIC enforces the consistency between predictions of masked target images, where random patches are withheld, and pseudo-labels that are generated based on the complete image by an exponential moving average teacher. To minimize the consistency loss, the network has to learn to infer the predictions of the masked regions from their context. Due to its simple and universal concept, MIC can be integrated into various UDA methods across different visual recognition tasks such as image classification, semantic segmentation, and object detection. MIC significantly improves the state-of-the-art performance across the different recognition tasks for synthetic-to-real, day-to-nighttime, and clear-to-adverse-weather UDA. For instance, MIC achieves an unprecedented UDA performance of 75.9 mIoU and 92.8% on GTA-to-Cityscapes and VisDA-2017, respectively, which corresponds to an improvement of +2.1 and +3.0 percent points over the previous state of the art. The implementation is available at https://github.com/lhoyer/MIC. + + + + SkyEye: Self-Supervised Bird's-Eye-View Semantic Mapping Using Monocular Frontal View Images + http://openaccess.thecvf.com//content/CVPR2023/papers/Gosala_SkyEye_Self-Supervised_Birds-Eye-View_Semantic_Mapping_Using_Monocular_Frontal_View_Images_CVPR_2023_paper.pdf + Bird's-Eye-View (BEV) semantic maps have become an essential component of automated driving pipelines due to the rich representation they provide for decision-making tasks. However, existing approaches for generating these maps still follow a fully supervised training paradigm and hence rely on large amounts of annotated BEV data. In this work, we address this limitation by proposing the first self-supervised approach for generating a BEV semantic map using a single monocular image from the frontal view (FV). During training, we overcome the need for BEV ground truth annotations by leveraging the more easily available FV semantic annotations of video sequences. Thus, we propose the SkyEye architecture that learns based on two modes of self-supervision, namely, implicit supervision and explicit supervision. Implicit supervision trains the model by enforcing spatial consistency of the scene over time based on FV semantic sequences, while explicit supervision exploits BEV pseudolabels generated from FV semantic annotations and self-supervised depth estimates. Extensive evaluations on the KITTI-360 dataset demonstrate that our self-supervised approach performs on par with the state-of-the-art fully supervised methods and achieves competitive results using only 1% of direct supervision in BEV compared to fully supervised approaches. Finally, we publicly release both our code and the BEV datasets generated from the KITTI-360 and Waymo datasets. + + + + VoxFormer: Sparse Voxel Transformer for Camera-Based 3D Semantic Scene Completion + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_VoxFormer_Sparse_Voxel_Transformer_for_Camera-Based_3D_Semantic_Scene_Completion_CVPR_2023_paper.pdf + Humans can easily imagine the complete 3D geometry of occluded objects and scenes. This appealing ability is vital for recognition and understanding. To enable such capability in AI systems, we propose VoxFormer, a Transformer-based semantic scene completion framework that can output complete 3D volumetric semantics from only 2D images. Our framework adopts a two-stage design where we start from a sparse set of visible and occupied voxel queries from depth estimation, followed by a densification stage that generates dense 3D voxels from the sparse ones. A key idea of this design is that the visual features on 2D images correspond only to the visible scene structures rather than the occluded or empty spaces. Therefore, starting with the featurization and prediction of the visible structures is more reliable. Once we obtain the set of sparse queries, we apply a masked autoencoder design to propagate the information to all the voxels by self-attention. Experiments on SemanticKITTI show that VoxFormer outperforms the state of the art with a relative improvement of 20.0% in geometry and 18.1% in semantics and reduces GPU memory during training to less than 16GB. Our code is available on https://github.com/NVlabs/VoxFormer. + + + + Joint Video Multi-Frame Interpolation and Deblurring Under Unknown Exposure Time + http://openaccess.thecvf.com//content/CVPR2023/papers/Shang_Joint_Video_Multi-Frame_Interpolation_and_Deblurring_Under_Unknown_Exposure_Time_CVPR_2023_paper.pdf + Natural videos captured by consumer cameras often suffer from low framerate and motion blur due to the combination of dynamic scene complexity, lens and sensor imperfection, and less than ideal exposure setting. As a result, computational methods that jointly perform video frame interpolation and deblurring begin to emerge with the unrealistic assumption that the exposure time is known and fixed. In this work, we aim ambitiously for a more realistic and challenging task - joint video multi-frame interpolation and deblurring under unknown exposure time. Toward this goal, we first adopt a variant of supervised contrastive learning to construct an exposure-aware representation from input blurred frames. We then train two U-Nets for intra-motion and inter-motion analysis, respectively, adapting to the learned exposure representation via gain tuning. We finally build our video reconstruction network upon the exposure and motion representation by progressive exposure-adaptive convolution and motion refinement. Extensive experiments on both simulated and real-world datasets show that our optimized method achieves notable performance gains over the state-of-the-art on the joint video x8 interpolation and deblurring task. Moreover, on the seemingly implausible x16 interpolation task, our method outperforms existing methods by more than 1.5 dB in terms of PSNR. + + + + Dual-Bridging With Adversarial Noise Generation for Domain Adaptive rPPG Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Du_Dual-Bridging_With_Adversarial_Noise_Generation_for_Domain_Adaptive_rPPG_Estimation_CVPR_2023_paper.pdf + The remote photoplethysmography (rPPG) technique can estimate pulse-related metrics (e.g. heart rate and respiratory rate) from facial videos and has a high potential for health monitoring. The latest deep rPPG methods can model in-distribution noise due to head motion, video compression, etc., and estimate high-quality rPPG signals under similar scenarios. However, deep rPPG models may not generalize well to the target test domain with unseen noise and distortions. In this paper, to improve the generalization ability of rPPG models, we propose a dual-bridging network to reduce the domain discrepancy by aligning intermediate domains and synthesizing the target noise in the source domain for better noise reduction. To comprehensively explore the target domain noise, we propose a novel adversarial noise generation in which the noise generator indirectly competes with the noise reducer. To further improve the robustness of the noise reducer, we propose hard noise pattern mining to encourage the generator to learn hard noise patterns contained in the target domain features. We evaluated the proposed method on three public datasets with different types of interferences. Under different cross-domain scenarios, the comprehensive results show the effectiveness of our method. + + + + NeuDA: Neural Deformable Anchor for High-Fidelity Implicit Surface Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Cai_NeuDA_Neural_Deformable_Anchor_for_High-Fidelity_Implicit_Surface_Reconstruction_CVPR_2023_paper.pdf + This paper studies implicit surface reconstruction leveraging differentiable ray casting. Previous works such as IDR and NeuS overlook the spatial context in 3D space when predicting and rendering the surface, thereby may fail to capture sharp local topologies such as small holes and structures. To mitigate the limitation, we propose a flexible neural implicit representation leveraging hierarchical voxel grids, namely Neural Deformable Anchor (NeuDA), for high-fidelity surface reconstruction. NeuDA maintains the hierarchical anchor grids where each vertex stores a 3d position (or anchor) instead of the direct embedding (or feature). We optimize the anchor grids such that different local geometry structures can be adaptively encoded. Besides, we dig into the frequency encoding strategies and introduce a simple hierarchical positional encoding method for the hierarchical anchor structure to flexibly exploited the properties of high-frequency and low-frequency geometry and appearance. Experiments on both the DTU and BlendedMVS datasets demonstrate that NeuDA can produce promising mesh surfaces. + + + + Boosting Weakly-Supervised Temporal Action Localization With Text Information + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Boosting_Weakly-Supervised_Temporal_Action_Localization_With_Text_Information_CVPR_2023_paper.pdf + Due to the lack of temporal annotation, current Weakly-supervised Temporal Action Localization (WTAL) methods are generally stuck into over-complete or incomplete localization. In this paper, we aim to leverage the text information to boost WTAL from two aspects, i.e., (a) the discriminative objective to enlarge the inter-class difference, thus reducing the over-complete; (b) the generative objective to enhance the intra-class integrity, thus finding more complete temporal boundaries. For the discriminative objective, we propose a Text-Segment Mining (TSM) mechanism, which constructs a text description based on the action class label, and regards the text as the query to mine all class-related segments. Without the temporal annotation of actions, TSM compares the text query with the entire videos across the dataset to mine the best matching segments while ignoring irrelevant ones. Due to the shared sub-actions in different categories of videos, merely applying TSM is too strict to neglect the semantic-related segments, which results in incomplete localization. We further introduce a generative objective named Video-text Language Completion (VLC), which focuses on all semantic-related segments from videos to complete the text sentence. We achieve the state-of-the-art performance on THUMOS14 and ActivityNet1.3. Surprisingly, we also find our proposed method can be seamlessly applied to existing methods, and improve their performances with a clear margin. The code is available at https://github.com/lgzlIlIlI/Boosting-WTAL. + + + + OpenMix: Exploring Outlier Samples for Misclassification Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhu_OpenMix_Exploring_Outlier_Samples_for_Misclassification_Detection_CVPR_2023_paper.pdf + Reliable confidence estimation for deep neural classifiers is a challenging yet fundamental requirement in high-stakes applications. Unfortunately, modern deep neural networks are often overconfident for their erroneous predictions. In this work, we exploit the easily available outlier samples, i.e., unlabeled samples coming from non-target classes, for helping detect misclassification errors. Particularly, we find that the well-known Outlier Exposure, which is powerful in detecting out-of-distribution (OOD) samples from unknown classes, does not provide any gain in identifying misclassification errors. Based on these observations, we propose a novel method called OpenMix, which incorporates open-world knowledge by learning to reject uncertain pseudo-samples generated via outlier transformation. OpenMix significantly improves confidence reliability under various scenarios, establishing a strong and unified framework for detecting both misclassified samples from known classes and OOD samples from unknown classes. + + + + Multivariate, Multi-Frequency and Multimodal: Rethinking Graph Neural Networks for Emotion Recognition in Conversation + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Multivariate_Multi-Frequency_and_Multimodal_Rethinking_Graph_Neural_Networks_for_Emotion_CVPR_2023_paper.pdf + Complex relationships of high arity across modality and context dimensions is a critical challenge in the Emotion Recognition in Conversation (ERC) task. Yet, previous works tend to encode multimodal and contextual relationships in a loosely-coupled manner, which may harm relationship modelling. Recently, Graph Neural Networks (GNN) which show advantages in capturing data relations, offer a new solution for ERC. However, existing GNN-based ERC models fail to address some general limits of GNNs, including assuming pairwise formulation and erasing high-frequency signals, which may be trivial for many applications but crucial for the ERC task. In this paper, we propose a GNN-based model that explores multivariate relationships and captures the varying importance of emotion discrepancy and commonality by valuing multi-frequency signals. We empower GNNs to better capture the inherent relationships among utterances and deliver more sufficient multimodal and contextual modelling. Experimental results show that our proposed method outperforms previous state-of-the-art works on two popular multimodal ERC datasets. + + + + Bridging Precision and Confidence: A Train-Time Loss for Calibrating Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Munir_Bridging_Precision_and_Confidence_A_Train-Time_Loss_for_Calibrating_Object_CVPR_2023_paper.pdf + Deep neural networks (DNNs) have enabled astounding progress in several vision-based problems. Despite showing high predictive accuracy, recently, several works have revealed that they tend to provide overconfident predictions and thus are poorly calibrated. The majority of the works addressing the miscalibration of DNNs fall under the scope of classification and consider only in-domain predictions. However, there is little to no progress in studying the calibration of DNN-based object detection models, which are central to many vision-based safety-critical applications. In this paper, inspired by the train-time calibration methods, we propose a novel auxiliary loss formulation that explicitly aims to align the class confidence of bounding boxes with the accurateness of predictions (i.e. precision). Since the original formulation of our loss depends on the counts of true positives and false positives in a minibatch, we develop a differentiable proxy of our loss that can be used during training with other application-specific loss functions. We perform extensive experiments on challenging in-domain and out-domain scenarios with six benchmark datasets including MS-COCO, Cityscapes, Sim10k, and BDD100k. Our results reveal that our train-time loss surpasses strong calibration baselines in reducing calibration error for both in and out-domain scenarios. Our source code and pre-trained models are available at https://github.com/akhtarvision/bpc_calibration + + + + DyLiN: Making Light Field Networks Dynamic + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_DyLiN_Making_Light_Field_Networks_Dynamic_CVPR_2023_paper.pdf + Light Field Networks, the re-formulations of radiance fields to oriented rays, are magnitudes faster than their coordinate network counterparts, and provide higher fidelity with respect to representing 3D structures from 2D observations. They would be well suited for generic scene representation and manipulation, but suffer from one problem: they are limited to holistic and static scenes. In this paper, we propose the Dynamic Light Field Network (DyLiN) method that can handle non-rigid deformations, including topological changes. We learn a deformation field from input rays to canonical rays, and lift them into a higher dimensional space to handle discontinuities. We further introduce CoDyLiN, which augments DyLiN with controllable attribute inputs. We train both models via knowledge distillation from pretrained dynamic radiance fields. We evaluated DyLiN using both synthetic and real world datasets that include various non-rigid deformations. DyLiN qualitatively outperformed and quantitatively matched state-of-the-art methods in terms of visual fidelity, while being 25 - 71x computationally faster. We also tested CoDyLiN on attribute annotated data and it surpassed its teacher model. Project page: https://dylin2023.github.io. + + + + Human Guided Ground-Truth Generation for Realistic Image Super-Resolution + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Human_Guided_Ground-Truth_Generation_for_Realistic_Image_Super-Resolution_CVPR_2023_paper.pdf + How to generate the ground-truth (GT) image is a critical issue for training realistic image super-resolution (Real-ISR) models. Existing methods mostly take a set of high-resolution (HR) images as GTs and apply various degradations to simulate their low-resolution (LR) counterparts. Though great progress has been achieved, such an LR-HR pair generation scheme has several limitations. First, the perceptual quality of HR images may not be high enough, limiting the quality of Real-ISR outputs. Second, existing schemes do not consider much human perception in GT generation, and the trained models tend to produce over-smoothed results or unpleasant artifacts. With the above considerations, we propose a human guided GT generation scheme. We first elaborately train multiple image enhancement models to improve the perceptual quality of HR images, and enable one LR image having multiple HR counterparts. Human subjects are then involved to annotate the high quality regions among the enhanced HR images as GTs, and label the regions with unpleasant artifacts as negative samples. A human guided GT image dataset with both positive and negative samples is then constructed, and a loss function is proposed to train the Real-ISR models. Experiments show that the Real-ISR models trained on our dataset can produce perceptually more realistic results with less artifacts. Dataset and codes can be found at https://github.com/ChrisDud0257/HGGT. + + + + Align and Attend: Multimodal Summarization With Dual Contrastive Losses + http://openaccess.thecvf.com//content/CVPR2023/papers/He_Align_and_Attend_Multimodal_Summarization_With_Dual_Contrastive_Losses_CVPR_2023_paper.pdf + The goal of multimodal summarization is to extract the most important information from different modalities to form summaries. Unlike unimodal summarization, the multimodal summarization task explicitly leverages cross-modal information to help generate more reliable and high-quality summaries. However, existing methods fail to leverage the temporal correspondence between different modalities and ignore the intrinsic correlation between different samples. To address this issue, we introduce Align and Attend Multimodal Summarization (A2Summ), a unified multimodal transformer-based model which can effectively align and attend the multimodal input. In addition, we propose two novel contrastive losses to model both inter-sample and intra-sample correlations. Extensive experiments on two standard video summarization datasets (TVSum and SumMe) and two multimodal summarization datasets (Daily Mail and CNN) demonstrate the superiority of A2Summ, achieving state-of-the-art performances on all datasets. Moreover, we collected a large-scale multimodal summarization dataset BLiSS, which contains livestream videos and transcribed texts with annotated summaries. Our code and dataset are publicly available at https://boheumd.github.io/A2Summ/. + + + + SinGRAF: Learning a 3D Generative Radiance Field for a Single Scene + http://openaccess.thecvf.com//content/CVPR2023/papers/Son_SinGRAF_Learning_a_3D_Generative_Radiance_Field_for_a_Single_CVPR_2023_paper.pdf + Generative models have shown great promise in synthesizing photorealistic 3D objects, but they require large amounts of training data. We introduce SinGRAF, a 3D-aware generative model that is trained with a few input images of a single scene. Once trained, SinGRAF generates different realizations of this 3D scene that preserve the appearance of the input while varying scene layout. For this purpose, we build on recent progress in 3D GAN architectures and introduce a novel progressive-scale patch discrimination approach during training. With several experiments, we demonstrate that the results produced by SinGRAF outperform the closest related works in both quality and diversity by a large margin. + + + + Self-Supervised AutoFlow + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Self-Supervised_AutoFlow_CVPR_2023_paper.pdf + Recently, AutoFlow has shown promising results on learning a training set for optical flow, but requires ground truth labels in the target domain to compute its search metric. Observing a strong correlation between the ground truth search metric and self-supervised losses, we introduce self-supervised AutoFlow to handle real-world videos without ground truth labels. Using self-supervised loss as the search metric, our self-supervised AutoFlow performs on par with AutoFlow on Sintel and KITTI where ground truth is available, and performs better on the real-world DAVIS dataset. We further explore using self-supervised AutoFlow in the (semi-)supervised setting and obtain competitive results against the state of the art. + + + + Neuralangelo: High-Fidelity Neural Surface Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Neuralangelo_High-Fidelity_Neural_Surface_Reconstruction_CVPR_2023_paper.pdf + Neural surface reconstruction has been shown to be powerful for recovering dense 3D surfaces via image-based neural rendering. However, current methods struggle to recover detailed structures of real-world scenes. To address the issue, we present Neuralangelo, which combines the representation power of multi-resolution 3D hash grids with neural surface rendering. Two key ingredients enable our approach: (1) numerical gradients for computing higher-order derivatives as a smoothing operation and (2) coarse-to-fine optimization on the hash grids controlling different levels of details. Even without auxiliary inputs such as depth, Neuralangelo can effectively recover dense 3D surface structures from multi-view images with fidelity significantly surpassing previous methods, enabling detailed large-scale scene reconstruction from RGB video captures. + + + + Re-GAN: Data-Efficient GANs Training via Architectural Reconfiguration + http://openaccess.thecvf.com//content/CVPR2023/papers/Saxena_Re-GAN_Data-Efficient_GANs_Training_via_Architectural_Reconfiguration_CVPR_2023_paper.pdf + Training Generative Adversarial Networks (GANs) on high-fidelity images usually requires a vast number of training images. Recent research on GAN tickets reveals that dense GANs models contain sparse sub-networks or "lottery tickets" that, when trained separately, yield better results under limited data. However, finding GANs tickets requires an expensive process of train-prune-retrain. In this paper, we propose Re-GAN, a data-efficient GANs training that dynamically reconfigures GANs architecture during training to explore different sub-network structures in training time. Our method repeatedly prunes unimportant connections to regularize GANs network and regrows them to reduce the risk of prematurely pruning important connections. Re-GAN stabilizes the GANs models with less data and offers an alternative to the existing GANs tickets and progressive growing methods. We demonstrate that Re-GAN is a generic training methodology which achieves stability on datasets of varying sizes, domains, and resolutions (CIFAR-10, Tiny-ImageNet, and multiple few-shot generation datasets) as well as different GANs architectures (SNGAN, ProGAN, StyleGAN2 and AutoGAN). Re-GAN also improves performance when combined with the recent augmentation approaches. Moreover, Re-GAN requires fewer floating-point operations (FLOPs) and less training time by removing the unimportant connections during GANs training while maintaining comparable or even generating higher-quality samples. When compared to state-of-the-art StyleGAN2, our method outperforms without requiring any additional fine-tuning step. Code can be found at this link: https://github.com/IntellicentAI-Lab/Re-GAN + + + + Dimensionality-Varying Diffusion Process + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Dimensionality-Varying_Diffusion_Process_CVPR_2023_paper.pdf + Diffusion models, which learn to reverse a signal destruction process to generate new data, typically require the signal at each step to have the same dimension. We argue that, considering the spatial redundancy in image signals, there is no need to maintain a high dimensionality in the evolution process, especially in the early generation phase. To this end, we make a theoretical generalization of the forward diffusion process via signal decomposition. Concretely, we manage to decompose an image into multiple orthogonal components and control the attenuation of each component when perturbing the image. That way, along with the noise strength increasing, we are able to diminish those inconsequential components and thus use a lower-dimensional signal to represent the source, barely losing information. Such a reformulation allows to vary dimensions in both training and inference of diffusion models. Extensive experiments on a range of datasets suggest that our approach substantially reduces the computational cost and achieves on-par or even better synthesis performance compared to baseline methods. We also show that our strategy facilitates high-resolution image synthesis and improves FID of diffusion model trained on FFHQ at 1024x1024 resolution from 52.40 to 10.46. Code is available at https://github.com/damo-vilab/dvdp. + + + + RenderDiffusion: Image Diffusion for 3D Reconstruction, Inpainting and Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Anciukevicius_RenderDiffusion_Image_Diffusion_for_3D_Reconstruction_Inpainting_and_Generation_CVPR_2023_paper.pdf + Diffusion models currently achieve state-of-the-art performance for both conditional and unconditional image generation. However, so far, image diffusion models do not support tasks required for 3D understanding, such as view-consistent 3D generation or single-view object reconstruction. In this paper, we present RenderDiffusion, the first diffusion model for 3D generation and inference, trained using only monocular 2D supervision. Central to our method is a novel image denoising architecture that generates and renders an intermediate three-dimensional representation of a scene in each denoising step. This enforces a strong inductive structure within the diffusion process, providing a 3D consistent representation while only requiring 2D supervision. The resulting 3D representation can be rendered from any view. We evaluate RenderDiffusion on FFHQ, AFHQ, ShapeNet and CLEVR datasets, showing competitive performance for generation of 3D scenes and inference of 3D scenes from 2D images. Additionally, our diffusion-based approach allows us to use 2D inpainting to edit 3D scenes. + + + + Latent-NeRF for Shape-Guided Generation of 3D Shapes and Textures + http://openaccess.thecvf.com//content/CVPR2023/papers/Metzer_Latent-NeRF_for_Shape-Guided_Generation_of_3D_Shapes_and_Textures_CVPR_2023_paper.pdf + Text-guided image generation has progressed rapidly in recent years, inspiring major breakthroughs in text-guided shape generation. Recently, it has been shown that using score distillation, one can successfully text-guide a NeRF model to generate a 3D object. We adapt the score distillation to the publicly available, and computationally efficient, Latent Diffusion Models, which apply the entire diffusion process in a compact latent space of a pretrained autoencoder. As NeRFs operate in image space, a naive solution for guiding them with latent score distillation would require encoding to the latent space at each guidance step. Instead, we propose to bring the NeRF to the latent space, resulting in a Latent-NeRF. Analyzing our Latent-NeRF, we show that while Text-to-3D models can generate impressive results, they are inherently unconstrained and may lack the ability to guide or enforce a specific 3D structure. To assist and direct the 3D generation, we propose to guide our Latent-NeRF using a Sketch-Shape: an abstract geometry that defines the coarse structure of the desired object. Then, we present means to integrate such a constraint directly into a Latent-NeRF. This unique combination of text and shape guidance allows for increased control over the generation process. We also show that latent score distillation can be successfully applied directly on 3D meshes. This allows for generating high-quality textures on a given geometry. Our experiments validate the power of our different forms of guidance and the efficiency of using latent rendering. + + + + Learning Generative Structure Prior for Blind Text Image Super-Resolution + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Learning_Generative_Structure_Prior_for_Blind_Text_Image_Super-Resolution_CVPR_2023_paper.pdf + Blind text image super-resolution (SR) is challenging as one needs to cope with diverse font styles and unknown degradation. To address the problem, existing methods perform character recognition in parallel to regularize the SR task, either through a loss constraint or intermediate feature condition. Nonetheless, the high-level prior could still fail when encountering severe degradation. The problem is further compounded given characters of complex structures, e.g., Chinese characters that combine multiple pictographic or ideographic symbols into a single character. In this work, we present a novel prior that focuses more on the character structure. In particular, we learn to encapsulate rich and diverse structures in a StyleGAN and exploit such generative structure priors for restoration. To restrict the generative space of StyleGAN so that it obeys the structure of characters yet remains flexible in handling different font styles, we store the discrete features for each character in a codebook . The code subsequently drives the StyleGAN to generate high-resolution structural details to aid text SR. Compared to priors based on character recognition, the proposed structure prior exerts stronger character-specific guidance to restore faithful and precise strokes of a designated character. Extensive experiments on synthetic and real datasets demonstrate the compelling performance of the proposed generative structure prior in facilitating robust text SR. Our code is available at https://github.com/csxmli2016/MARCONet. + + + + PEFAT: Boosting Semi-Supervised Medical Image Classification via Pseudo-Loss Estimation and Feature Adversarial Training + http://openaccess.thecvf.com//content/CVPR2023/papers/Zeng_PEFAT_Boosting_Semi-Supervised_Medical_Image_Classification_via_Pseudo-Loss_Estimation_and_CVPR_2023_paper.pdf + Pseudo-labeling approaches have been proven beneficial for semi-supervised learning (SSL) schemes in computer vision and medical imaging. Most works are dedicated to finding samples with high-confidence pseudo-labels from the perspective of model predicted probability. Whereas this way may lead to the inclusion of incorrectly pseudo-labeled data if the threshold is not carefully adjusted. In addition, low-confidence probability samples are frequently disregarded and not employed to their full potential. In this paper, we propose a novel Pseudo-loss Estimation and Feature Adversarial Training semi-supervised framework, termed as PEFAT, to boost the performance of multi-class and multi-label medical image classification from the point of loss distribution modeling and adversarial training. Specifically, we develop a trustworthy data selection scheme to split a high-quality pseudo-labeled set, inspired by the dividable pseudo-loss assumption that clean data tend to show lower loss while noise data is the opposite. Instead of directly discarding these samples with low-quality pseudo-labels, we present a novel regularization approach to learn discriminate information from them via injecting adversarial noises at the feature-level to smooth the decision boundary. Experimental results on three medical and two natural image benchmarks validate that our PEFAT can achieve a promising performance and surpass other state-of-the-art methods. The code is available at https://github.com/maxwell0027/PEFAT. + + + + Ground-Truth Free Meta-Learning for Deep Compressive Sampling + http://openaccess.thecvf.com//content/CVPR2023/papers/Qin_Ground-Truth_Free_Meta-Learning_for_Deep_Compressive_Sampling_CVPR_2023_paper.pdf + Deep learning has become an important tool for reconstructing images in compressive sampling (CS). This paper proposes a ground-truth (GT) free meta-learning method for CS, which leverages both external and internal learning for unsupervised high-quality image reconstruction. The proposed method first trains a deep model via external meta-learning using only CS measurements, and then efficiently adapts the trained model to a test sample for further improvement by exploiting its internal characteristics. The meta-learning and model adaptation are built on an improved Stein's unbiased risk estimator (iSURE) that provides efficient computation and effective guidance for accurate prediction in the range space of the adjoint of the measurement matrix. To further improve the learning on the null space of the measurement matrix, a modified model-agnostic meta-learning scheme is proposed, along with a null-space-consistent loss and a bias-adaptive deep unrolling network to improve and accelerate model adaption in test time. Experimental results have demonstrated that the proposed GT-free method performs well, and can even compete with supervised learning-based methods. + + + + SHS-Net: Learning Signed Hyper Surfaces for Oriented Normal Estimation of Point Clouds + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_SHS-Net_Learning_Signed_Hyper_Surfaces_for_Oriented_Normal_Estimation_of_CVPR_2023_paper.pdf + We propose a novel method called SHS-Net for oriented normal estimation of point clouds by learning signed hyper surfaces, which can accurately predict normals with global consistent orientation from various point clouds. Almost all existing methods estimate oriented normals through a two-stage pipeline, i.e., unoriented normal estimation and normal orientation, and each step is implemented by a separate algorithm. However, previous methods are sensitive to parameter settings, resulting in poor results from point clouds with noise, density variations and complex geometries. In this work, we introduce signed hyper surfaces (SHS), which are parameterized by multi-layer perceptron (MLP) layers, to learn to estimate oriented normals from point clouds in an end-to-end manner. The signed hyper surfaces are implicitly learned in a high-dimensional feature space where the local and global information is aggregated. Specifically, we introduce a patch encoding module and a shape encoding module to encode a 3D point cloud into a local latent code and a global latent code, respectively. Then, an attention-weighted normal prediction module is proposed as a decoder, which takes the local and global latent codes as input to predict oriented normals. Experimental results show that our SHS-Net outperforms the state-of-the-art methods in both unoriented and oriented normal estimation on the widely used benchmarks. The code, data and pretrained models are available at https://github.com/LeoQLi/SHS-Net. + + + + DistractFlow: Improving Optical Flow Estimation via Realistic Distractions and Pseudo-Labeling + http://openaccess.thecvf.com//content/CVPR2023/papers/Jeong_DistractFlow_Improving_Optical_Flow_Estimation_via_Realistic_Distractions_and_Pseudo-Labeling_CVPR_2023_paper.pdf + We propose a novel data augmentation approach, DistractFlow, for training optical flow estimation models by introducing realistic distractions to the input frames. Based on a mixing ratio, we combine one of the frames in the pair with a distractor image depicting a similar domain, which allows for inducing visual perturbations congruent with natural objects and scenes. We refer to such pairs as distracted pairs. Our intuition is that using semantically meaningful distractors enables the model to learn related variations and attain robustness against challenging deviations, compared to conventional augmentation schemes focusing only on low-level aspects and modifications. More specifically, in addition to the supervised loss computed between the estimated flow for the original pair and its ground-truth flow, we include a second supervised loss defined between the distracted pair's flow and the original pair's ground-truth flow, weighted with the same mixing ratio. Furthermore, when unlabeled data is available, we extend our augmentation approach to self-supervised settings through pseudo-labeling and cross-consistency regularization. Given an original pair and its distracted version, we enforce the estimated flow on the distracted pair to agree with the flow of the original pair. Our approach allows increasing the number of available training pairs significantly without requiring additional annotations. It is agnostic to the model architecture and can be applied to training any optical flow estimation models. Our extensive evaluations on multiple benchmarks, including Sintel, KITTI, and SlowFlow, show that DistractFlow improves existing models consistently, outperforming the latest state of the art. + + + + DSVT: Dynamic Sparse Voxel Transformer With Rotated Sets + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_DSVT_Dynamic_Sparse_Voxel_Transformer_With_Rotated_Sets_CVPR_2023_paper.pdf + Designing an efficient yet deployment-friendly 3D backbone to handle sparse point clouds is a fundamental problem in 3D perception. Compared with the customized sparse convolution, the attention mechanism in Transformers is more appropriate for flexibly modeling long-range relationships and is easier to be deployed in real-world applications. However, due to the sparse characteristics of point clouds, it is non-trivial to apply a standard transformer on sparse points. In this paper, we present Dynamic Sparse Voxel Transformer (DSVT), a single-stride window-based voxel Transformer backbone for outdoor 3D perception. In order to efficiently process sparse points in parallel, we propose Dynamic Sparse Window Attention, which partitions a series of local regions in each window according to its sparsity and then computes the features of all regions in a fully parallel manner. To allow the cross-set connection, we design a rotated set partitioning strategy that alternates between two partitioning configurations in consecutive self-attention layers. To support effective downsampling and better encode geometric information, we also propose an attention-style 3D pooling module on sparse points, which is powerful and deployment-friendly without utilizing any customized CUDA operations. Our model achieves state-of-the-art performance with a broad range of 3D perception tasks. More importantly, DSVT can be easily deployed by TensorRT with real-time inference speed (27Hz). Code will be available at https://github.com/Haiyang-W/DSVT. + + + + Enhancing the Self-Universality for Transferable Targeted Attacks + http://openaccess.thecvf.com//content/CVPR2023/papers/Wei_Enhancing_the_Self-Universality_for_Transferable_Targeted_Attacks_CVPR_2023_paper.pdf + In this paper, we propose a novel transfer-based targeted attack method that optimizes the adversarial perturbations without any extra training efforts for auxiliary networks on training data. Our new attack method is proposed based on the observation that highly universal adversarial perturbations tend to be more transferable for targeted attacks. Therefore, we propose to make the perturbation to be agnostic to different local regions within one image, which we called as self-universality. Instead of optimizing the perturbations on different images, optimizing on different regions to achieve self-universality can get rid of using extra data. Specifically, we introduce a feature similarity loss that encourages the learned perturbations to be universal by maximizing the feature similarity between adversarial perturbed global images and randomly cropped local regions. With the feature similarity loss, our method makes the features from adversarial perturbations to be more dominant than that of benign images, hence improving targeted transferability. We name the proposed attack method as Self-Universality (SU) attack. Extensive experiments demonstrate that SU can achieve high success rates for transfer-based targeted attacks. On ImageNet-compatible dataset, SU yields an improvement of 12% compared with existing state-of-the-art methods. Code is available at https://github.com/zhipeng-wei/Self-Universality. + + + + EditableNeRF: Editing Topologically Varying Neural Radiance Fields by Key Points + http://openaccess.thecvf.com//content/CVPR2023/papers/Zheng_EditableNeRF_Editing_Topologically_Varying_Neural_Radiance_Fields_by_Key_Points_CVPR_2023_paper.pdf + Neural radiance fields (NeRF) achieve highly photo-realistic novel-view synthesis, but it's a challenging problem to edit the scenes modeled by NeRF-based methods, especially for dynamic scenes. We propose editable neural radiance fields that enable end-users to easily edit dynamic scenes and even support topological changes. Input with an image sequence from a single camera, our network is trained fully automatically and models topologically varying dynamics using our picked-out surface key points. Then end-users can edit the scene by easily dragging the key points to desired new positions. To achieve this, we propose a scene analysis method to detect and initialize key points by considering the dynamics in the scene, and a weighted key points strategy to model topologically varying dynamics by joint key points and weights optimization. Our method supports intuitive multi-dimensional (up to 3D) editing and can generate novel scenes that are unseen in the input sequence. Experiments demonstrate that our method achieves high-quality editing on various dynamic scenes and outperforms the state-of-the-art. Our code and captured data are available at https://chengwei-zheng.github.io/EditableNeRF/. + + + + NeuralEditor: Editing Neural Radiance Fields via Manipulating Point Clouds + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_NeuralEditor_Editing_Neural_Radiance_Fields_via_Manipulating_Point_Clouds_CVPR_2023_paper.pdf + This paper proposes NeuralEditor that enables neural radiance fields (NeRFs) natively editable for general shape editing tasks. Despite their impressive results on novel-view synthesis, it remains a fundamental challenge for NeRFs to edit the shape of the scene. Our key insight is to exploit the explicit point cloud representation as the underlying structure to construct NeRFs, inspired by the intuitive interpretation of NeRF rendering as a process that projects or "plots" the associated 3D point cloud to a 2D image plane. To this end, NeuralEditor introduces a novel rendering scheme based on deterministic integration within K-D tree-guided density-adaptive voxels, which produces both high-quality rendering results and precise point clouds through optimization. NeuralEditor then performs shape editing via mapping associated points between point clouds. Extensive evaluation shows that NeuralEditor achieves state-of-the-art performance in both shape deformation and scene morphing tasks. Notably, NeuralEditor supports both zero-shot inference and further fine-tuning over the edited scene. Our code, benchmark, and demo video are available at https://immortalco.github.io/NeuralEditor. + + + + NIKI: Neural Inverse Kinematics With Invertible Neural Networks for 3D Human Pose and Shape Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_NIKI_Neural_Inverse_Kinematics_With_Invertible_Neural_Networks_for_3D_CVPR_2023_paper.pdf + With the progress of 3D human pose and shape estimation, state-of-the-art methods can either be robust to occlusions or obtain pixel-aligned accuracy in non-occlusion cases. However, they cannot obtain robustness and mesh-image alignment at the same time. In this work, we present NIKI (Neural Inverse Kinematics with Invertible Neural Network), which models bi-directional errors to improve the robustness to occlusions and obtain pixel-aligned accuracy. NIKI can learn from both the forward and inverse processes with invertible networks. In the inverse process, the model separates the error from the plausible 3D pose manifold for a robust 3D human pose estimation. In the forward process, we enforce the zero-error boundary conditions to improve the sensitivity to reliable joint positions for better mesh-image alignment. Furthermore, NIKI emulates the analytical inverse kinematics algorithms with the twist-and-swing decomposition for better interpretability. Experiments on standard and occlusion-specific benchmarks demonstrate the effectiveness of NIKI, where we exhibit robust and well-aligned results simultaneously. Code is available at https://github.com/Jeff-sjtu/NIKI + + + + Transfer4D: A Framework for Frugal Motion Capture and Deformation Transfer + http://openaccess.thecvf.com//content/CVPR2023/papers/Maheshwari_Transfer4D_A_Framework_for_Frugal_Motion_Capture_and_Deformation_Transfer_CVPR_2023_paper.pdf + Animating a virtual character based on a real performance of an actor is a challenging task that currently requires expensive motion capture setups and additional effort by expert animators, rendering it accessible only to large production houses. The goal of our work is to democratize this task by developing a frugal alternative termed "Transfer4D" that uses only commodity depth sensors and further reduces animators' effort by automating the rigging and animation transfer process. To handle sparse, incomplete videos from depth video inputs and large variations between source and target objects, we propose to use skeletons as an intermediary representation between motion capture and transfer. We propose a novel skeleton extraction pipeline from single-view depth sequence that incorporates additional geometric information, resulting in superior performance in motion reconstruction and transfer in comparison to the contemporary methods. We use non-rigid reconstruction to track motion from the depth sequence, and then we rig the source object using skinning decomposition. Finally, the rig is embedded into the target object for motion retargeting. + + + + Randomized Adversarial Training via Taylor Expansion + http://openaccess.thecvf.com//content/CVPR2023/papers/Jin_Randomized_Adversarial_Training_via_Taylor_Expansion_CVPR_2023_paper.pdf + In recent years, there has been an explosion of research into developing more robust deep neural networks against adversarial examples. Adversarial training appears as one of the most successful methods. To deal with both the robustness against adversarial examples and the accuracy over clean examples, many works develop enhanced adversarial training methods to achieve various trade-offs between them. Leveraging over the studies that smoothed update on weights during training may help find flat minima and improve generalization, we suggest reconciling the robustness-accuracy trade-off from another perspective, i.e., by adding random noise into deterministic weights. The randomized weights enable our design of a novel adversarial training method via Taylor expansion of a small Gaussian noise, and we show that the new adversarial training method can flatten loss landscape and find flat minima. With PGD, CW, and Auto Attacks, an extensive set of experiments demonstrate that our method enhances the state-of-the-art adversarial training methods, boosting both robustness and clean accuracy. The code is available at https://github.com/Alexkael/Randomized-Adversarial-Training. + + + + Learning To Measure the Point Cloud Reconstruction Loss in a Representation Space + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Learning_To_Measure_the_Point_Cloud_Reconstruction_Loss_in_a_CVPR_2023_paper.pdf + For point cloud reconstruction-related tasks, the reconstruction losses to evaluate the shape differences between reconstructed results and the ground truths are typically used to train the task networks. Most existing works measure the training loss with point-to-point distance, which may introduce extra defects as predefined matching rules may deviate from the real shape differences. Although some learning-based works have been proposed to overcome the weaknesses of manually-defined rules, they still measure the shape differences in 3D Euclidean space, which may limit their ability to capture defects in reconstructed shapes. In this work, we propose a learning-based Contrastive Adversarial Loss (CALoss) to measure the point cloud reconstruction loss dynamically in a non-linear representation space by combining the contrastive constraint with the adversarial strategy. Specifically, we use the contrastive constraint to help CALoss learn a representation space with shape similarity, while we introduce the adversarial strategy to help CALoss mine differences between reconstructed results and ground truths. According to experiments on reconstruction-related tasks, CALoss can help task networks improve reconstruction performances and learn more representative representations. + + + + Progressive Neighbor Consistency Mining for Correspondence Pruning + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Progressive_Neighbor_Consistency_Mining_for_Correspondence_Pruning_CVPR_2023_paper.pdf + The goal of correspondence pruning is to recognize correct correspondences (inliers) from initial ones, with applications to various feature matching based tasks. Seeking neighbors in the coordinate and feature spaces is a common strategy in many previous methods. However, it is difficult to ensure that these neighbors are always consistent, since the distribution of false correspondences is extremely irregular. For addressing this problem, we propose a novel global-graph space to search for consistent neighbors based on a weighted global graph that can explicitly explore long-range dependencies among correspondences. On top of that, we progressively construct three neighbor embeddings according to different neighbor search spaces, and design a Neighbor Consistency block to extract neighbor context and explore their interactions sequentially. In the end, we develop a Neighbor Consistency Mining Network (NCMNet) for accurately recovering camera poses and identifying inliers. Experimental results indicate that our NCMNet achieves a significant performance advantage over state-of-the-art competitors on challenging outdoor and indoor matching scenes. The source code can be found at https://github.com/xinliu29/NCMNet. + + + + Bootstrapping Objectness From Videos by Relaxed Common Fate and Visual Grouping + http://openaccess.thecvf.com//content/CVPR2023/papers/Lian_Bootstrapping_Objectness_From_Videos_by_Relaxed_Common_Fate_and_Visual_CVPR_2023_paper.pdf + We study learning object segmentation from unlabeled videos. Humans can easily segment moving objects without knowing what they are. The Gestalt law of common fate, i.e., what move at the same speed belong together, has inspired unsupervised object discovery based on motion segmentation. However, common fate is not a reliable indicator of objectness: Parts of an articulated / deformable object may not move at the same speed, whereas shadows / reflections of an object always move with it but are not part of it. Our insight is to bootstrap objectness by first learning image features from relaxed common fate and then refining them based on visual appearance grouping within the image itself and across images statistically. Specifically, we learn an image segmenter first in the loop of approximating optical flow with constant segment flow plus small within-segment residual flow, and then by refining it for more coherent appearance and statistical figure-ground relevance. On unsupervised video object segmentation, using only ResNet and convolutional heads, our model surpasses the state-of-the-art by absolute gains of 7/9/5% on DAVIS16 / STv2 / FBMS59 respectively, demonstrating the effectiveness of our ideas. Our code is publicly available. + + + + Semi-Supervised Hand Appearance Recovery via Structure Disentanglement and Dual Adversarial Discrimination + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhao_Semi-Supervised_Hand_Appearance_Recovery_via_Structure_Disentanglement_and_Dual_Adversarial_CVPR_2023_paper.pdf + Enormous hand images with reliable annotations are collected through marker-based MoCap. Unfortunately, degradations caused by markers limit their application in hand appearance reconstruction. A clear appearance recovery insight is an image-to-image translation trained with unpaired data. However, most frameworks fail because there exists structure inconsistency from a degraded hand to a bare one. The core of our approach is to first disentangle the bare hand structure from those degraded images and then wrap the appearance to this structure with a dual adversarial discrimination (DAD) scheme. Both modules take full advantage of the semi-supervised learning paradigm: The structure disentanglement benefits from the modeling ability of ViT, and the translator is enhanced by the dual discrimination on both translation processes and translation results. Comprehensive evaluations have been conducted to prove that our framework can robustly recover photo-realistic hand appearance from diverse marker-contained and even object-occluded datasets. It provides a novel avenue to acquire bare hand appearance data for other downstream learning problems. + + + + Back to the Source: Diffusion-Driven Adaptation To Test-Time Corruption + http://openaccess.thecvf.com//content/CVPR2023/papers/Gao_Back_to_the_Source_Diffusion-Driven_Adaptation_To_Test-Time_Corruption_CVPR_2023_paper.pdf + Test-time adaptation harnesses test inputs to improve the accuracy of a model trained on source data when tested on shifted target data. Most methods update the source model by (re-)training on each target domain. While re-training can help, it is sensitive to the amount and order of the data and the hyperparameters for optimization. We update the target data instead, and project all test inputs toward the source domain with a generative diffusion model. Our diffusion-driven adaptation (DDA) method shares its models for classification and generation across all domains, training both on source then freezing them for all targets, to avoid expensive domain-wise re-training. We augment diffusion with image guidance and classifier self-ensembling to automatically decide how much to adapt. Input adaptation by DDA is more robust than model adaptation across a variety of corruptions, models, and data regimes on the ImageNet-C benchmark. With its input-wise updates, DDA succeeds where model adaptation degrades on too little data (small batches), on dependent data (correlated orders), or on mixed data (multiple corruptions). + + + + LayoutDM: Discrete Diffusion Model for Controllable Layout Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Inoue_LayoutDM_Discrete_Diffusion_Model_for_Controllable_Layout_Generation_CVPR_2023_paper.pdf + Controllable layout generation aims at synthesizing plausible arrangement of element bounding boxes with optional constraints, such as type or position of a specific element. In this work, we try to solve a broad range of layout generation tasks in a single model that is based on discrete state-space diffusion models. Our model, named LayoutDM, naturally handles the structured layout data in the discrete representation and learns to progressively infer a noiseless layout from the initial input, where we model the layout corruption process by modality-wise discrete diffusion. For conditional generation, we propose to inject layout constraints in the form of masking or logit adjustment during inference. We show in the experiments that our LayoutDM successfully generates high-quality layouts and outperforms both task-specific and task-agnostic baselines on several layout tasks. + + + + ShapeTalk: A Language Dataset and Framework for 3D Shape Edits and Deformations + http://openaccess.thecvf.com//content/CVPR2023/papers/Achlioptas_ShapeTalk_A_Language_Dataset_and_Framework_for_3D_Shape_Edits_CVPR_2023_paper.pdf + Editing 3D geometry is a challenging task requiring specialized skills. In this work, we aim to facilitate the task of editing the geometry of 3D models through the use of natural language. For example, we may want to modify a 3D chair model to "make its legs thinner" or to "open a hole in its back". To tackle this problem in a manner that promotes open-ended language use and enables fine-grained shape edits, we introduce the most extensive existing corpus of natural language utterances describing shape differences: ShapeTalk. ShapeTalk contains over half a million discriminative utterances produced by contrasting the shapes of common 3D objects for a variety of object classes and degrees of similarity. We also introduce a generic framework, ChangeIt3D, which builds on ShapeTalk and can use an arbitrary 3D generative model of shapes to produce edits that align the output better with the edit or deformation description. Finally, we introduce metrics for the quantitative evaluation of language-assisted shape editing methods that reflect key desiderata within this editing setup. We note that ShapeTalk allows methods to be trained with explicit 3D-to-language data, bypassing the necessity of "lifting" 2D to 3D using methods like neural rendering, as required by extant 2D image-language foundation models. Our code and data are publicly available at https://changeit3d.github.io/. + + + + RGBD2: Generative Scene Synthesis via Incremental View Inpainting Using RGBD Diffusion Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Lei_RGBD2_Generative_Scene_Synthesis_via_Incremental_View_Inpainting_Using_RGBD_CVPR_2023_paper.pdf + We address the challenge of recovering an underlying scene geometry and colors from a sparse set of RGBD view observations. In this work, we present a new solution termed RGBD2 that sequentially generates novel RGBD views along a camera trajectory, and the scene geometry is simply the fusion result of these views. More specifically, we maintain an intermediate surface mesh used for rendering new RGBD views, which subsequently becomes complete by an inpainting network; each rendered RGBD view is later back-projected as a partial surface and is supplemented into the intermediate mesh. The use of intermediate mesh and camera projection helps solve the tough problem of multi-view inconsistency. We practically implement the RGBD inpainting network as a versatile RGBD diffusion model, which is previously used for 2D generative modeling; we make a modification to its reverse diffusion process to enable our use. We evaluate our approach on the task of 3D scene synthesis from sparse RGBD inputs; extensive experiments on the ScanNet dataset demonstrate the superiority of our approach over existing ones. Project page: https://jblei.site/proj/rgbd-diffusion. + + + + System-Status-Aware Adaptive Network for Online Streaming Video Understanding + http://openaccess.thecvf.com//content/CVPR2023/papers/Foo_System-Status-Aware_Adaptive_Network_for_Online_Streaming_Video_Understanding_CVPR_2023_paper.pdf + Recent years have witnessed great progress in deep neural networks for real-time applications. However, most existing works do not explicitly consider the general case where the device's state and the available resources fluctuate over time, and none of them investigate or address the impact of varying computational resources for online video understanding tasks. This paper proposes a System-status-aware Adaptive Network (SAN) that considers the device's real-time state to provide high-quality predictions with low delay. Usage of our agent's policy improves efficiency and robustness to fluctuations of the system status. On two widely used video understanding tasks, SAN obtains state-of-the-art performance while constantly keeping processing delays low. Moreover, training such an agent on various types of hardware configurations is not easy as the labeled training data might not be available, or can be computationally prohibitive. To address this challenging problem, we propose a Meta Self-supervised Adaptation (MSA) method that adapts the agent's policy to new hardware configurations at test-time, allowing for easy deployment of the model onto other unseen hardware platforms. + + + + Local-Guided Global: Paired Similarity Representation for Visual Reinforcement Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Choi_Local-Guided_Global_Paired_Similarity_Representation_for_Visual_Reinforcement_Learning_CVPR_2023_paper.pdf + Recent vision-based reinforcement learning (RL) methods have found extracting high-level features from raw pixels with self-supervised learning to be effective in learning policies. However, these methods focus on learning global representations of images, and disregard local spatial structures present in the consecutively stacked frames. In this paper, we propose a novel approach, termed self-supervised Paired Similarity Representation Learning (PSRL) for effectively encoding spatial structures in an unsupervised manner. Given the input frames, the latent volumes are first generated individually using an encoder, and they are used to capture the variance in terms of local spatial structures, i.e., correspondence maps among multiple frames. This enables for providing plenty of fine-grained samples for training the encoder of deep RL. We further attempt to learn the global semantic representations in the global prediction module that predicts future state representations using action vector as a medium. The proposed method imposes similarity constraints on the three latent volumes; transformed query representations by estimated pixel-wise correspondence, predicted query representations from the global prediction model, and target representations of future state, guiding global prediction with locality-inherent volume. Experimental results on complex tasks in Atari Games and DeepMind Control Suite demonstrate that the RL methods are significantly boosted by the proposed self-supervised learning of structured representations. + + + + FFCV: Accelerating Training by Removing Data Bottlenecks + http://openaccess.thecvf.com//content/CVPR2023/papers/Leclerc_FFCV_Accelerating_Training_by_Removing_Data_Bottlenecks_CVPR_2023_paper.pdf + We present FFCV, a library for easy, fast, resource-efficient training of machine learning models. FFCV speeds up model training by eliminating (often subtle) data bottlenecks from the training process. In particular, we combine techniques such as an efficient file storage format, caching, data pre-loading, asynchronous data transfer, and just-in-time compilation to (a) make data loading and transfer significantly more efficient, ensuring that GPUs can reach full utilization; and (b) offload as much data processing as possible to the CPU asynchronously, freeing GPU up capacity for training. Using FFCV, we train ResNet-18 and ResNet-50 on the ImageNet dataset with a state-of-the-art tradeoff between accuracy and training time. For example, across the range of ResNet-50 models we test, we obtain the same accuracy as the best baselines in half the time. We demonstrate FFCV's performance, ease-of-use, extensibility, and ability to adapt to resource constraints through several case studies. + + + + Region-Aware Pretraining for Open-Vocabulary Object Detection With Vision Transformers + http://openaccess.thecvf.com//content/CVPR2023/papers/Kim_Region-Aware_Pretraining_for_Open-Vocabulary_Object_Detection_With_Vision_Transformers_CVPR_2023_paper.pdf + We present Region-aware Open-vocabulary Vision Transformers (RO-ViT) -- a contrastive image-text pretraining recipe to bridge the gap between image-level pretraining and open-vocabulary object detection. At the pretraining phase, we propose to randomly crop and resize regions of positional embeddings instead of using the whole image positional embeddings. This better matches the use of positional embeddings at region-level in the detection finetuning phase. In addition, we replace the common softmax cross entropy loss in contrastive learning with focal loss to better learn the informative yet difficult examples. Finally, we leverage recent advances in novel object proposals to improve open-vocabulary detection finetuning. We evaluate our full model on the LVIS and COCO open-vocabulary detection benchmarks and zero-shot transfer. RO-ViT achieves a state-of-the-art 32.1 APr on LVIS, surpassing the best existing approach by +5.8 points in addition to competitive zero-shot transfer detection. Surprisingly, RO-ViT improves the image-level representation as well and achieves the state of the art on 9 out of 12 metrics on COCO and Flickr image-text retrieval benchmarks, outperforming competitive approaches with larger models. + + + + Towards Unsupervised Object Detection From LiDAR Point Clouds + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Towards_Unsupervised_Object_Detection_From_LiDAR_Point_Clouds_CVPR_2023_paper.pdf + In this paper, we study the problem of unsupervised object detection from 3D point clouds in self-driving scenes. We present a simple yet effective method that exploits (i) point clustering in near-range areas where the point clouds are dense, (ii) temporal consistency to filter out noisy unsupervised detections, (iii) translation equivariance of CNNs to extend the auto-labels to long range, and (iv) self-supervision for improving on its own. Our approach, OYSTER (Object Discovery via Spatio-Temporal Refinement), does not impose constraints on data collection (such as repeated traversals of the same location), is able to detect objects in a zero-shot manner without supervised finetuning (even in sparse, distant regions), and continues to self-improve given more rounds of iterative self-training. To better measure model performance in self-driving scenarios, we propose a new planning-centric perception metric based on distance-to-collision. We demonstrate that our unsupervised object detector significantly outperforms unsupervised baselines on PandaSet and Argoverse 2 Sensor dataset, showing promise that self-supervision combined with object priors can enable object discovery in the wild. For more information, visit the project website: https://waabi.ai/research/oyster. + + + + NeRF-DS: Neural Radiance Fields for Dynamic Specular Objects + http://openaccess.thecvf.com//content/CVPR2023/papers/Yan_NeRF-DS_Neural_Radiance_Fields_for_Dynamic_Specular_Objects_CVPR_2023_paper.pdf + Dynamic Neural Radiance Field (NeRF) is a powerful algorithm capable of rendering photo-realistic novel view images from a monocular RGB video of a dynamic scene. Although it warps moving points across frames from the observation spaces to a common canonical space for rendering, dynamic NeRF does not model the change of the reflected color during the warping. As a result, this approach often fails drastically on challenging specular objects in motion. We address this limitation by reformulating the neural radiance field function to be conditioned on surface position and orientation in the observation space. This allows the specular surface at different poses to keep the different reflected colors when mapped to the common canonical space. Additionally, we add the mask of moving objects to guide the deformation field. As the specular surface changes color during motion, the mask mitigates the problem of failure to find temporal correspondences with only RGB supervision. We evaluate our model based on the novel view synthesis quality with a self-collected dataset of different moving specular objects in realistic environments. The experimental results demonstrate that our method significantly improves the reconstruction quality of moving specular objects from monocular RGB videos compared to the existing NeRF models. Our code and data are available at the project website https://github.com/JokerYan/NeRF-DS. + + + + M6Doc: A Large-Scale Multi-Format, Multi-Type, Multi-Layout, Multi-Language, Multi-Annotation Category Dataset for Modern Document Layout Analysis + http://openaccess.thecvf.com//content/CVPR2023/papers/Cheng_M6Doc_A_Large-Scale_Multi-Format_Multi-Type_Multi-Layout_Multi-Language_Multi-Annotation_Category_Dataset_CVPR_2023_paper.pdf + Document layout analysis is a crucial prerequisite for document understanding, including document retrieval and conversion. Most public datasets currently contain only PDF documents and lack realistic documents. Models trained on these datasets may not generalize well to real-world scenarios. Therefore, this paper introduces a large and diverse document layout analysis dataset called M^6-Doc. The M^6 designation represents six properties: (1) Multi-Format (including scanned, photographed, and PDF documents); (2) Multi-Type (such as scientific articles, textbooks, books, test papers, magazines, newspapers, and notes); (3) Multi-Layout (rectangular, Manhattan, non-Manhattan, and multi-column Manhattan); (4) Multi-Language (Chinese and English); (5) Multi-Annotation Category (74 types of annotation labels with 237,116 annotation instances in 9,080 manually annotated pages); and (6) Modern documents. Additionally, we propose a transformer-based document layout analysis method called TransDLANet, which leverages an adaptive element matching mechanism that enables query embedding to better match ground truth to improve recall, and constructs a segmentation branch for more precise document image instance segmentation. We conduct a comprehensive evaluation of M^6-Doc with various layout analysis methods and demonstrate its effectiveness. TransDLANet achieves state-of-the-art performance on M^6-Doc with 64.5% mAP. The M^6-Doc dataset will be available at https://github.com/HCIILAB/M6Doc. + + + + RealFusion: 360deg Reconstruction of Any Object From a Single Image + http://openaccess.thecvf.com//content/CVPR2023/papers/Melas-Kyriazi_RealFusion_360deg_Reconstruction_of_Any_Object_From_a_Single_Image_CVPR_2023_paper.pdf + We consider the problem of reconstructing a full 360deg photographic model of an object from a single image of it. We do so by fitting a neural radiance field to the image, but find this problem to be severely ill-posed. We thus take an off-the-self conditional image generator based on diffusion and engineer a prompt that encourages it to "dream up" novel views of the object. Using the recent DreamFusion method, we fuse the given input view, the conditional prior, and other regularizers in a final, consistent reconstruction. We demonstrate state-of-the-art reconstruction results on benchmark images when compared to prior methods for monocular 3D reconstruction of objects. Qualitatively, our reconstructions provide a faithful match of the input view and a plausible extrapolation of its appearance and 3D shape, including to the side of the object not visible in the image. + + + + LargeKernel3D: Scaling Up Kernels in 3D Sparse CNNs + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_LargeKernel3D_Scaling_Up_Kernels_in_3D_Sparse_CNNs_CVPR_2023_paper.pdf + Recent advance in 2D CNNs has revealed that large kernels are important. However, when directly applying large convolutional kernels in 3D CNNs, severe difficulties are met, where those successful module designs in 2D become surprisingly ineffective on 3D networks, including the popular depth-wise convolution. To address this vital challenge, we instead propose the spatial-wise partition convolution and its large-kernel module. As a result, it avoids the optimization and efficiency issues of naive 3D large kernels. Our large-kernel 3D CNN network, LargeKernel3D, yields notable improvement in 3D tasks of semantic segmentation and object detection. It achieves 73.9% mIoU on the ScanNetv2 semantic segmentation and 72.8% NDS nuScenes object detection benchmarks, ranking 1st on the nuScenes LIDAR leaderboard. The performance further boosts to 74.2% NDS with a simple multi-modal fusion. In addition, LargeKernel3D can be scaled to 17x17x17 kernel size on Waymo 3D object detection. For the first time, we show that large kernels are feasible and essential for 3D visual tasks. + + + + 3D Concept Learning and Reasoning From Multi-View Images + http://openaccess.thecvf.com//content/CVPR2023/papers/Hong_3D_Concept_Learning_and_Reasoning_From_Multi-View_Images_CVPR_2023_paper.pdf + Humans are able to accurately reason in 3D by gathering multi-view observations of the surrounding world. Inspired by this insight, we introduce a new large-scale benchmark for 3D multi-view visual question answering (3DMV-VQA). This dataset is collected by an embodied agent actively moving and capturing RGB images in an environment using the Habitat simulator. In total, it consists of approximately 5k scenes, 600k images, paired with 50k questions. We evaluate various state-of-the-art models for visual reasoning on our benchmark and find that they all perform poorly. We suggest that a principled approach for 3D reasoning from multi-view images should be to infer a compact 3D representation of the world from the multi-view images, which is further grounded on open-vocabulary semantic concepts, and then to execute reasoning on these 3D representations. As the first step towards this approach, we propose a novel 3D concept learning and reasoning (3D-CLR) framework that seamlessly combines these components via neural fields, 2D pre-trained vision-language models, and neural reasoning operators. Experimental results suggest that our framework outperforms baseline models by a large margin, but the challenge remains largely unsolved. We further perform an in-depth analysis of the challenges and highlight potential future directions. + + + + Soft Augmentation for Image Classification + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Soft_Augmentation_for_Image_Classification_CVPR_2023_paper.pdf + Modern neural networks are over-parameterized and thus rely on strong regularization such as data augmentation and weight decay to reduce overfitting and improve generalization. The dominant form of data augmentation applies invariant transforms, where the learning target of a sample is invariant to the transform applied to that sample. We draw inspiration from human visual classification studies and propose generalizing augmentation with invariant transforms to soft augmentation where the learning target softens non-linearly as a function of the degree of the transform applied to the sample: e.g., more aggressive image crop augmentations produce less confident learning targets. We demonstrate that soft targets allow for more aggressive data augmentation, offer more robust performance boosts, work with other augmentation policies, and interestingly, produce better calibrated models (since they are trained to be less confident on aggressively cropped/occluded examples). Combined with existing aggressive augmentation strategies, soft targets 1) double the top-1 accuracy boost across Cifar-10, Cifar-100, ImageNet-1K, and ImageNet-V2, 2) improve model occlusion performance by up to 4x, and 3) half the expected calibration error (ECE). Finally, we show that soft augmentation generalizes to self-supervised classification tasks. + + + + PREIM3D: 3D Consistent Precise Image Attribute Editing From a Single Image + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_PREIM3D_3D_Consistent_Precise_Image_Attribute_Editing_From_a_Single_CVPR_2023_paper.pdf + We study the 3D-aware image attribute editing problem in this paper, which has wide applications in practice. Recent methods solved the problem by training a shared encoder to map images into a 3D generator's latent space or by per-image latent code optimization and then edited images in the latent space. Despite their promising results near the input view, they still suffer from the 3D inconsistency of produced images at large camera poses and imprecise image attribute editing, like affecting unspecified attributes during editing. For more efficient image inversion, we train a shared encoder for all images. To alleviate 3D inconsistency at large camera poses, we propose two novel methods, an alternating training scheme and a multi-view identity loss, to maintain 3D consistency and subject identity. As for imprecise image editing, we attribute the problem to the gap between the latent space of real images and that of generated images. We compare the latent space and inversion manifold of GAN models and demonstrate that editing in the inversion manifold can achieve better results in both quantitative and qualitative evaluations. Extensive experiments show that our method produces more 3D consistent images and achieves more precise image editing than previous work. Source code and pretrained models can be found on our project page: https://mybabyyh.github.io/Preim3D. + + + + Detecting Backdoors in Pre-Trained Encoders + http://openaccess.thecvf.com//content/CVPR2023/papers/Feng_Detecting_Backdoors_in_Pre-Trained_Encoders_CVPR_2023_paper.pdf + Self-supervised learning in computer vision trains on unlabeled data, such as images or (image, text) pairs, to obtain an image encoder that learns high-quality embeddings for input data. Emerging backdoor attacks towards encoders expose crucial vulnerabilities of self-supervised learning, since downstream classifiers (even further trained on clean data) may inherit backdoor behaviors from encoders. Existing backdoor detection methods mainly focus on supervised learning settings and cannot handle pre-trained encoders especially when input labels are not available. In this paper, we propose DECREE, the first backdoor detection approach for pre-trained encoders, requiring neither classifier headers nor input labels. We evaluate DECREE on over 400 encoders trojaned under 3 paradigms. We show the effectiveness of our method on image encoders pre-trained on ImageNet and OpenAI's CLIP 400 million image-text pairs. Our method consistently has a high detection accuracy even if we have only limited or no access to the pre-training dataset. + + + + Primitive Generation and Semantic-Related Alignment for Universal Zero-Shot Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/He_Primitive_Generation_and_Semantic-Related_Alignment_for_Universal_Zero-Shot_Segmentation_CVPR_2023_paper.pdf + We study universal zero-shot segmentation in this work to achieve panoptic, instance, and semantic segmentation for novel categories without any training samples. Such zero-shot segmentation ability relies on inter-class relationships in semantic space to transfer the visual knowledge learned from seen categories to unseen ones. Thus, it is desired to well bridge semantic-visual spaces and apply the semantic relationships to visual feature learning. We introduce a generative model to synthesize features for unseen categories, which links semantic and visual spaces as well as address the issue of lack of unseen training data. Furthermore, to mitigate the domain gap between semantic and visual spaces, firstly, we enhance the vanilla generator with learned primitives, each of which contains fine-grained attributes related to categories, and synthesize unseen features by selectively assembling these primitives. Secondly, we propose to disentangle the visual feature into the semantic-related part and the semantic-unrelated part that contains useful visual classification clues but is less relevant to semantic representation. The inter-class relationships of semantic-related visual features are then required to be aligned with those in semantic space, thereby transferring semantic knowledge to visual feature learning. The proposed approach achieves impressively state-of-the-art performance on zero-shot panoptic segmentation, instance segmentation, and semantic segmentation. + + + + Long Range Pooling for 3D Large-Scale Scene Understanding + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Long_Range_Pooling_for_3D_Large-Scale_Scene_Understanding_CVPR_2023_paper.pdf + Inspired by the success of recent vision transformers and large kernel design in convolutional neural networks (CNNs), in this paper, we analyze and explore essential reasons for their success. We claim two factors that are critical for 3D large-scale scene understanding: a larger receptive field and operations with greater non-linearity. The former is responsible for providing long range contexts and the latter can enhance the capacity of the network. To achieve the above properties, we propose a simple yet effective long range pooling (LRP) module using dilation max pooling, which provides a network with a large adaptive receptive field. LRP has few parameters, and can be readily added to current CNNs. Also, based on LRP, we present an entire network architecture, LRPNet, for 3D understanding. Ablation studies are presented to support our claims, and show that the LRP module achieves better results than large kernel convolution yet with reduced computation, due to its non-linearity. We also demonstrate the superiority of LRPNet on various benchmarks: LRPNet performs the best on ScanNet and surpasses other CNN-based methods on S3DIS and Matterport3D. Code will be avalible at https://github.com/li-xl/LRPNet. + + + + Causally-Aware Intraoperative Imputation for Overall Survival Time Prediction + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Causally-Aware_Intraoperative_Imputation_for_Overall_Survival_Time_Prediction_CVPR_2023_paper.pdf + Previous efforts in vision community are mostly made on learning good representations from visual patterns. Beyond this, this paper emphasizes the high-level ability of causal reasoning. We thus present a case study of solving the challenging task of Overall Survival (OS) time in primary liver cancers. Critically, the prediction of OS time at the early stage remains challenging, due to the unobvious image patterns of reflecting the OS. To this end, we propose a causal inference system by leveraging the intraoperative attributes and the correlation among them, as an intermediate supervision to bridge the gap between the images and the final OS. Particularly, we build a causal graph, and train the images to estimate the intraoperative attributes for final OS prediction. We present a novel Causally-aware Intraoperative Imputation Model (CAWIM) that can sequentially predict each attribute using its parent nodes in the estimated causal graph. To determine the causal directions, we propose a splitting-voting mechanism, which votes for the direction for each pair of adjacent nodes among multiple predictions obtained via causal discovery from heterogeneity. The practicability and effectiveness of our method are demonstrated by the promising result on liver cancer dataset of 361 patients with long-term observations. + + + + Twin Contrastive Learning With Noisy Labels + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Twin_Contrastive_Learning_With_Noisy_Labels_CVPR_2023_paper.pdf + Learning from noisy data is a challenging task that significantly degenerates the model performance. In this paper, we present TCL, a novel twin contrastive learning model to learn robust representations and handle noisy labels for classification. Specifically, we construct a Gaussian mixture model (GMM) over the representations by injecting the supervised model predictions into GMM to link label-free latent variables in GMM with label-noisy annotations. Then, TCL detects the examples with wrong labels as the out-of-distribution examples by another two-component GMM, taking into account the data distribution. We further propose a cross-supervision with an entropy regularization loss that bootstraps the true targets from model predictions to handle the noisy labels. As a result, TCL can learn discriminative representations aligned with estimated labels through mixup and contrastive learning. Extensive experimental results on several standard benchmarks and real-world datasets demonstrate the superior performance of TCL. In particular, TCL achieves 7.5% improvements on CIFAR-10 with 90% noisy label---an extremely noisy scenario. The source code is available at https://github.com/Hzzone/TCL. + + + + Asymmetric Feature Fusion for Image Retrieval + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_Asymmetric_Feature_Fusion_for_Image_Retrieval_CVPR_2023_paper.pdf + In asymmetric retrieval systems, models with different capacities are deployed on platforms with different computational and storage resources. Despite the great progress, existing approaches still suffer from a dilemma between retrieval efficiency and asymmetric accuracy due to the low capacity of the lightweight query model. In this work, we propose an Asymmetric Feature Fusion (AFF) paradigm, which advances existing asymmetric retrieval systems by considering the complementarity among different features just at the gallery side. Specifically, it first embeds each gallery image into various features, e.g., local features and global features. Then, a dynamic mixer is introduced to aggregate these features into a compact embedding for efficient search. On the query side, only a single lightweight model is deployed for feature extraction. The query model and dynamic mixer are jointly trained by sharing a momentum-updated classifier. Notably, the proposed paradigm boosts the accuracy of asymmetric retrieval without introducing any extra overhead to the query side. Exhaustive experiments on various landmark retrieval datasets demonstrate the superiority of our paradigm. + + + + CREPE: Can Vision-Language Foundation Models Reason Compositionally? + http://openaccess.thecvf.com//content/CVPR2023/papers/Ma_CREPE_Can_Vision-Language_Foundation_Models_Reason_Compositionally_CVPR_2023_paper.pdf + A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet, despite the performance gains contributed by large vision and language pretraining, we find that--across 7 architectures trained with 4 algorithms on massive datasets--they struggle at compositionality. To arrive at this conclusion, we introduce a new compositionality evaluation benchmark, CREPE, which measures two important aspects of compositionality identified by cognitive science literature: systematicity and productivity. To measure systematicity, CREPE consists of a test dataset containing over 370K image-text pairs and three different seen-unseen splits. The three splits are designed to test models trained on three popular training datasets: CC-12M, YFCC-15M, and LAION-400M. We also generate 325K, 316K, and 309K hard negative captions for a subset of the pairs. To test productivity, CREPE contains 17K image-text pairs with nine different complexities plus 278K hard negative captions with atomic, swapping, and negation foils. The datasets are generated by repurposing the Visual Genome scene graphs and region descriptions and applying handcrafted templates and GPT-3. For systematicity, we find that model performance decreases consistently when novel compositions dominate the retrieval set, with Recall@1 dropping by up to 9%. For productivity, models' retrieval success decays as complexity increases, frequently nearing random chance at high complexity. These results hold regardless of model and training dataset size. + + + + PyramidFlow: High-Resolution Defect Contrastive Localization Using Pyramid Normalizing Flow + http://openaccess.thecvf.com//content/CVPR2023/papers/Lei_PyramidFlow_High-Resolution_Defect_Contrastive_Localization_Using_Pyramid_Normalizing_Flow_CVPR_2023_paper.pdf + During industrial processing, unforeseen defects may arise in products due to uncontrollable factors. Although unsupervised methods have been successful in defect localization, the usual use of pre-trained models results in low-resolution outputs, which damages visual performance. To address this issue, we propose PyramidFlow, the first fully normalizing flow method without pre-trained models that enables high-resolution defect localization. Specifically, we propose a latent template-based defect contrastive localization paradigm to reduce intra-class variance, as the pre-trained models do. In addition, PyramidFlow utilizes pyramid-like normalizing flows for multi-scale fusing and volume normalization to help generalization. Our comprehensive studies on MVTecAD demonstrate the proposed method outperforms the comparable algorithms that do not use external priors, even achieving state-of-the-art performance in more challenging BTAD scenarios. + + + + On-the-Fly Category Discovery + http://openaccess.thecvf.com//content/CVPR2023/papers/Du_On-the-Fly_Category_Discovery_CVPR_2023_paper.pdf + Although machines have surpassed humans on visual recognition problems, they are still limited to providing closed-set answers. Unlike machines, humans can cognize novel categories at the first observation. Novel category discovery (NCD) techniques, transferring knowledge from seen categories to distinguish unseen categories, aim to bridge the gap. However, current NCD methods assume a transductive learning and offline inference paradigm, which restricts them to a pre-defined query set and renders them unable to deliver instant feedback. In this paper, we study on-the-fly category discovery (OCD) aimed at making the model instantaneously aware of novel category samples (i.e., enabling inductive learning and streaming inference). We first design a hash coding-based expandable recognition model as a practical baseline. Afterwards, noticing the sensitivity of hash codes to intra-category variance, we further propose a novel Sign-Magnitude dIsentangLEment (SMILE) architecture to alleviate the disturbance it brings. Our experimental results demonstrate the superiority of SMILE against our baseline model and prior art. Our code will be made publicly available. Our code is available at https://github.com/PRIS-CV/On-the-fly-Category-Discovery. + + + + MAIR: Multi-View Attention Inverse Rendering With 3D Spatially-Varying Lighting Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Choi_MAIR_Multi-View_Attention_Inverse_Rendering_With_3D_Spatially-Varying_Lighting_Estimation_CVPR_2023_paper.pdf + We propose a scene-level inverse rendering framework that uses multi-view images to decompose the scene into geometry, a SVBRDF, and 3D spatially-varying lighting. Because multi-view images provide a variety of information about the scene, multi-view images in object-level inverse rendering have been taken for granted. However, owing to the absence of multi-view HDR synthetic dataset, scene-level inverse rendering has mainly been studied using single-view image. We were able to successfully perform scene-level inverse rendering using multi-view images by expanding OpenRooms dataset and designing efficient pipelines to handle multi-view images, and splitting spatially-varying lighting. Our experiments show that the proposed method not only achieves better performance than single-view-based methods, but also achieves robust performance on unseen real-world scene. Also, our sophisticated 3D spatially-varying lighting volume allows for photorealistic object insertion in any 3D location. + + + + DF-Platter: Multi-Face Heterogeneous Deepfake Dataset + http://openaccess.thecvf.com//content/CVPR2023/papers/Narayan_DF-Platter_Multi-Face_Heterogeneous_Deepfake_Dataset_CVPR_2023_paper.pdf + Deepfake detection is gaining significant importance in the research community. While most of the research efforts are focused around high-quality images and videos, deepfake generation algorithms today have the capability to generate low-resolution videos, occluded deepfakes, and multiple-subject deepfakes. In this research, we emulate the real-world scenario of deepfake generation and spreading, and propose the DF-Platter dataset, which contains (i) both low-resolution and high-resolution deepfakes generated using multiple generation techniques and (ii) single-subject and multiple-subject deepfakes, with face images of Indian ethnicity. Faces in the dataset are annotated for various attributes such as gender, age, skin tone, and occlusion. The database is prepared in 116 days with continuous usage of 32 GPUs accounting to 1,800 GB cumulative memory. With over 500 GBs in size, the dataset contains a total of 133,260 videos encompassing three sets. To the best of our knowledge, this is one of the largest datasets containing vast variability and multiple challenges. We also provide benchmark results under multiple evaluation settings using popular and state-of-the-art deepfake detection models. Further, benchmark results under c23 and c40 compression are provided. The results demonstrate a significant performance reduction in the deepfake detection task on low-resolution deepfakes and show that the existing techniques fail drastically on multiple-subject deepfakes. It is our assertion that this database will improve the state-of-the-art by extending the capabilities of deepfake detection algorithms to real-world scenarios. The database is available at: http://iab-rubric.org/df-platter-database. + + + + Shifted Diffusion for Text-to-Image Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_Shifted_Diffusion_for_Text-to-Image_Generation_CVPR_2023_paper.pdf + We present Corgi, a novel method for text-to-image generation. Corgi is based on our proposed shifted diffusion model, which achieves better image embedding generation from input text. Different from the baseline diffusion model used in DALL-E 2, our method seamlessly encodes prior knowledge of the pre-trained CLIP model in its diffusion process by designing a new initialization distribution and a new transition step of the diffusion. Compared to the strong DALL-E 2 baseline, our method performs better in generating image embedding from the text in terms of both efficiency and effectiveness, which consequently results in better text-to-image generation. Extensive large-scale experiments are conducted and evaluated in terms of both quantitative measures and human evaluation, indicating a stronger generation ability of our method compared to existing ones. Furthermore, our model enables semi-supervised and language-free training for text-to-image generation, where only part or none of the images in the training dataset have an associated caption. Trained with only 1.7% of the images being captioned, our semi-supervised model obtains FID results comparable to DALL-E 2 on zero-shot text-to-image generation evaluated on MS-COCO. Corgi also achieves new state-of-the-art results across different datasets on downstream language-free text-to-image generation tasks, outperforming the previous method, Lafite, by a large margin. + + + + Boosting Detection in Crowd Analysis via Underutilized Output Features + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_Boosting_Detection_in_Crowd_Analysis_via_Underutilized_Output_Features_CVPR_2023_paper.pdf + Detection-based methods have been viewed unfavorably in crowd analysis due to their poor performance in dense crowds. However, we argue that the potential of these methods has been underestimated, as they offer crucial information for crowd analysis that is often ignored. Specifically, the area size and confidence score of output proposals and bounding boxes provide insight into the scale and density of the crowd. To leverage these underutilized features, we propose Crowd Hat, a plug-and-play module that can be easily integrated with existing detection models. This module uses a mixed 2D-1D compression technique to refine the output features and obtain the spatial and numerical distribution of crowd-specific information. Based on these features, we further propose region-adaptive NMS thresholds and a decouple-then-align paradigm that address the major limitations of detection-based methods. Our extensive evaluations on various crowd analysis tasks, including crowd counting, localization, and detection, demonstrate the effectiveness of utilizing output features and the potential of detection-based methods in crowd analysis. Our code is available at https://github.com/wskingdom/Crowd-Hat. + + + + K3DN: Disparity-Aware Kernel Estimation for Dual-Pixel Defocus Deblurring + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_K3DN_Disparity-Aware_Kernel_Estimation_for_Dual-Pixel_Defocus_Deblurring_CVPR_2023_paper.pdf + The dual-pixel (DP) sensor captures a two-view image pair in a single snapshot by splitting each pixel in half. The disparity occurs in defocus blurred regions between the two views of the DP pair, while the in-focus sharp regions have zero disparity. This motivates us to propose a K3DN framework for DP pair deblurring, and it has three modules: i) a disparity-aware deblur module. It estimates a disparity feature map, which is used to query a trainable kernel set to estimate a blur kernel that best describes the spatially-varying blur. The kernel is constrained to be symmetrical per the DP formulation. A simple Fourier transform is performed for deblurring that follows the blur model; ii) a reblurring regularization module. It reuses the blur kernel, performs a simple convolution for reblurring, and regularizes the estimated kernel and disparity feature unsupervisedly, in the training stage; iii) a sharp region preservation module. It identifies in-focus regions that correspond to areas with zero disparity between DP images, aims to avoid the introduction of noises during the deblurring process, and improves image restoration performance. Experiments on four standard DP datasets show that the proposed K3DN outperforms state-of-the-art methods, with fewer parameters and flops at the same time. + + + + DartBlur: Privacy Preservation With Detection Artifact Suppression + http://openaccess.thecvf.com//content/CVPR2023/papers/Jiang_DartBlur_Privacy_Preservation_With_Detection_Artifact_Suppression_CVPR_2023_paper.pdf + Nowadays, privacy issue has become a top priority when training AI algorithms. Machine learning algorithms are expected to benefit our daily life, while personal information must also be carefully protected from exposure. Facial information is particularly sensitive in this regard. Multiple datasets containing facial information have been taken offline, and the community is actively seeking solutions to remedy the privacy issues. Existing methods for privacy preservation can be divided into blur-based and face replacement-based methods. Owing to the advantages of review convenience and good accessibility, blur-based based methods have become a dominant choice in practice. However, blur-based methods would inevitably introduce training artifacts harmful to the performance of downstream tasks. In this paper, we propose a novel De-artifact Blurring(DartBlur) privacy-preserving method, which capitalizes on a DNN architecture to generate blurred faces. DartBlur can effectively hide facial privacy information while detection artifacts are simultaneously suppressed. We have designed four training objectives that particularly aim to improve review convenience and maximize detection artifact suppression. We associate the algorithm with an adversarial training strategy with a second-order optimization pipeline. Experimental results demonstrate that DartBlur outperforms the existing face-replacement method from both perspectives of review convenience and accessibility, and also shows an exclusive advantage in suppressing the training artifact compared to traditional blur-based methods. Our implementation is available at https://github.com/JaNg2333/DartBlur. + + + + LipFormer: High-Fidelity and Generalizable Talking Face Generation With a Pre-Learned Facial Codebook + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_LipFormer_High-Fidelity_and_Generalizable_Talking_Face_Generation_With_a_Pre-Learned_CVPR_2023_paper.pdf + Generating a talking face video from the input audio sequence is a practical yet challenging task. Most existing methods either fail to capture fine facial details or need to train a specific model for each identity. We argue that a codebook pre-learned on high-quality face images can serve as a useful prior that facilitates high-fidelity and generalizable talking head synthesis. Thanks to the strong capability of the codebook in representing face textures, we simplify the talking face generation task as finding proper lip-codes to characterize the variation of lips during a portrait talking. To this end, we propose LipFormer, a transformer-based framework, to model the audio-visual coherence and predict the lip-codes sequence based on the input audio features. We further introduce an adaptive face warping module, which helps warp the reference face to the target pose in the feature space, to alleviate the difficulty of lip-code prediction under different poses. By this means, LipFormer can make better use of the pre-learned priors in images and is robust to posture change. Extensive experiments show that LipFormer can produce more realistic talking face videos compared to previous methods and faithfully generalize to unseen identities. + + + + Generalizable Local Feature Pre-Training for Deformable Shape Analysis + http://openaccess.thecvf.com//content/CVPR2023/papers/Attaiki_Generalizable_Local_Feature_Pre-Training_for_Deformable_Shape_Analysis_CVPR_2023_paper.pdf + Transfer learning is fundamental for addressing problems in settings with little training data. While several transfer learning approaches have been proposed in 3D, unfortunately, these solutions typically operate on an entire 3D object or even scene-level and thus, as we show, fail to generalize to new classes, such as deformable organic shapes. In addition, there is currently a lack of understanding of what makes pre-trained features transferable across significantly different 3D shape categories. In this paper, we make a step toward addressing these challenges. First, we analyze the link between feature locality and transferability in tasks involving deformable 3D objects, while also comparing different backbones and losses for local feature pre-training. We observe that with proper training, learned features can be useful in such tasks, but, crucially, only with an appropriate choice of the receptive field size. We then propose a differentiable method for optimizing the receptive field within 3D transfer learning. Jointly, this leads to the first learnable features that can successfully generalize to unseen classes of 3D shapes such as humans and animals. Our extensive experiments show that this approach leads to state-of-the-art results on several downstream tasks such as segmentation, shape correspondence, and classification. Our code is available at https://github.com/pvnieo/vader. + + + + Progressive Random Convolutions for Single Domain Generalization + http://openaccess.thecvf.com//content/CVPR2023/papers/Choi_Progressive_Random_Convolutions_for_Single_Domain_Generalization_CVPR_2023_paper.pdf + Single domain generalization aims to train a generalizable model with only one source domain to perform well on arbitrary unseen target domains. Image augmentation based on Random Convolutions (RandConv), consisting of one convolution layer randomly initialized for each mini-batch, enables the model to learn generalizable visual representations by distorting local textures despite its simple and lightweight structure. However, RandConv has structural limitations in that the generated image easily loses semantics as the kernel size increases, and lacks the inherent diversity of a single convolution operation. To solve the problem, we propose a Progressive Random Convolution (Pro-RandConv) method that recursively stacks random convolution layers with a small kernel size instead of increasing the kernel size. This progressive approach can not only mitigate semantic distortions by reducing the influence of pixels away from the center in the theoretical receptive field, but also create more effective virtual domains by gradually increasing the style diversity. In addition, we develop a basic random convolution layer into a random convolution block including deformable offsets and affine transformation to support texture and contrast diversification, both of which are also randomly initialized. Without complex generators or adversarial learning, we demonstrate that our simple yet effective augmentation strategy outperforms state-of-the-art methods on single domain generalization benchmarks. + + + + OPE-SR: Orthogonal Position Encoding for Designing a Parameter-Free Upsampling Module in Arbitrary-Scale Image Super-Resolution + http://openaccess.thecvf.com//content/CVPR2023/papers/Song_OPE-SR_Orthogonal_Position_Encoding_for_Designing_a_Parameter-Free_Upsampling_Module_CVPR_2023_paper.pdf + Arbitrary-scale image super-resolution (SR) is often tackled using the implicit neural representation (INR) approach, which relies on a position encoding scheme to improve its representation ability. In this paper, we introduce orthogonal position encoding (OPE), an extension of position encoding, and an OPE-Upscale module to replace the INR-based upsampling module for arbitrary-scale image super-resolution. Our OPE-Upscale module takes 2D coordinates and latent code as inputs, just like INR, but does not require any training parameters. This parameter-free feature allows the OPE-Upscale module to directly perform linear combination operations, resulting in continuous image reconstruction and achieving arbitrary-scale image reconstruction. As a concise SR framework, our method is computationally efficient and consumes less memory than state-of-the-art methods, as confirmed by extensive experiments and evaluations. In addition, our method achieves comparable results with state-of-the-art methods in arbitrary-scale image super-resolution. Lastly, we show that OPE corresponds to a set of orthogonal basis, validating our design principle. + + + + I2MVFormer: Large Language Model Generated Multi-View Document Supervision for Zero-Shot Image Classification + http://openaccess.thecvf.com//content/CVPR2023/papers/Naeem_I2MVFormer_Large_Language_Model_Generated_Multi-View_Document_Supervision_for_Zero-Shot_CVPR_2023_paper.pdf + Recent works have shown that unstructured text (documents) from online sources can serve as useful auxiliary information for zero-shot image classification. However, these methods require access to a high-quality source like Wikipedia and are limited to a single source of information. Large Language Models (LLM) trained on web-scale text show impressive abilities to repurpose their learned knowledge for a multitude of tasks. In this work, we provide a novel perspective on using an LLM to provide text supervision for a zero-shot image classification model. The LLM is provided with a few text descriptions from different annotators as examples. The LLM is conditioned on these examples to generate multiple text descriptions for each class (referred to as views). Our proposed model, I2MVFormer, learns multi-view semantic embeddings for zero-shot image classification with these class views. We show that each text view of a class provides complementary information allowing a model to learn a highly discriminative class embedding. Moreover, we show that I2MVFormer is better at consuming the multi-view text supervision from LLM compared to baseline models. I2MVFormer establishes a new state-of-the-art on three public benchmark datasets for zero-shot image classification with unsupervised semantic embeddings. + + + + MixSim: A Hierarchical Framework for Mixed Reality Traffic Simulation + http://openaccess.thecvf.com//content/CVPR2023/papers/Suo_MixSim_A_Hierarchical_Framework_for_Mixed_Reality_Traffic_Simulation_CVPR_2023_paper.pdf + The prevailing way to test a self-driving vehicle (SDV) in simulation involves non-reactive open-loop replay of real world scenarios. However, in order to safely deploy SDVs to the real world, we need to evaluate them in closed-loop. Towards this goal, we propose to leverage the wealth of interesting scenarios captured in the real world and make them reactive and controllable to enable closed-loop SDV evaluation in what-if situations. In particular, we present MixSim, a hierarchical framework for mixed reality traffic simulation. MixSim explicitly models agent goals as routes along the road network and learns a reactive route-conditional policy. By inferring each agent's route from the original scenario, MixSim can reactively re-simulate the scenario and enable testing different autonomy systems under the same conditions. Furthermore, by varying each agent's route, we can expand the scope of testing to what-if situations with realistic variations in agent behaviors or even safety-critical interactions. Our experiments show that MixSim can serve as a realistic, reactive, and controllable digital twin of real world scenarios. For more information, please visit the project website: https://waabi.ai/research/mixsim/ + + + + Context-Aware Alignment and Mutual Masking for 3D-Language Pre-Training + http://openaccess.thecvf.com//content/CVPR2023/papers/Jin_Context-Aware_Alignment_and_Mutual_Masking_for_3D-Language_Pre-Training_CVPR_2023_paper.pdf + 3D visual language reasoning plays an important role in effective human-computer interaction. The current approaches for 3D visual reasoning are task-specific, and lack pre-training methods to learn generic representations that can transfer across various tasks. Despite the encouraging progress in vision-language pre-training for image-text data, 3D-language pre-training is still an open issue due to limited 3D-language paired data, highly sparse and irregular structure of point clouds and ambiguities in spatial relations of 3D objects with viewpoint changes. In this paper, we present a generic 3D-language pre-training approach, that tackles multiple facets of 3D-language reasoning by learning universal representations. Our learning objective constitutes two main parts. 1) Context aware spatial-semantic alignment to establish fine-grained correspondence between point clouds and texts. It reduces relational ambiguities by aligning 3D spatial relationships with textual semantic context. 2) Mutual 3D-Language Masked modeling to enable cross-modality information exchange. Instead of reconstructing sparse 3D points for which language can hardly provide cues, we propose masked proposal reasoning to learn semantic class and mask-invariant representations. Our proposed 3D-language pre-training method achieves promising results once adapted to various downstream tasks, including 3D visual grounding, 3D dense captioning and 3D question answering. Our codes are available at https://github.com/leolyj/3D-VLP + + + + Generalized Decoding for Pixel, Image, and Language + http://openaccess.thecvf.com//content/CVPR2023/papers/Zou_Generalized_Decoding_for_Pixel_Image_and_Language_CVPR_2023_paper.pdf + We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly. X-Decoder takes as input two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks. Further, our design enables seamless interactions across tasks at different granularities and brings mutual benefits by learning a common and rich pixel-level visual-semantic understanding space, without any pseudo-labeling. After pretraining on a mixed set of a limited amount of segmentation data and millions of image-text pairs, X-Decoder exhibits strong transferability to a wide range of downstream tasks in both zero-shot and finetuning settings. Notably, it achieves (1) state-of-the-art results on open-vocabulary segmentation and referring segmentation on eight datasets; (2) better or competitive finetuned performance to other generalist and specialist models on segmentation and VL tasks; and (3) flexibility for efficient finetuning and novel task composition. Code, demo, video and visualization are available at: https://x-decoder-vl.github.io. + + + + Towards Unified Scene Text Spotting Based on Sequence Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Kil_Towards_Unified_Scene_Text_Spotting_Based_on_Sequence_Generation_CVPR_2023_paper.pdf + Sequence generation models have recently made significant progress in unifying various vision tasks. Although some auto-regressive models have demonstrated promising results in end-to-end text spotting, they use specific detection formats while ignoring various text shapes and are limited in the maximum number of text instances that can be detected. To overcome these limitations, we propose a UNIfied scene Text Spotter, called UNITS. Our model unifies various detection formats, including quadrilaterals and polygons, allowing it to detect text in arbitrary shapes. Additionally, we apply starting-point prompting to enable the model to extract texts from an arbitrary starting point, thereby extracting more texts beyond the number of instances it was trained on. Experimental results demonstrate that our method achieves competitive performance compared to state-of-the-art methods. Further analysis shows that UNITS can extract a larger number of texts than it was trained on. We provide the code for our method at https://github.com/clovaai/units. + + + + X3KD: Knowledge Distillation Across Modalities, Tasks and Stages for Multi-Camera 3D Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Klingner_X3KD_Knowledge_Distillation_Across_Modalities_Tasks_and_Stages_for_Multi-Camera_CVPR_2023_paper.pdf + Recent advances in 3D object detection (3DOD) have obtained remarkably strong results for LiDAR-based models. In contrast, surround-view 3DOD models based on multiple camera images underperform due to the necessary view transformation of features from perspective view (PV) to a 3D world representation which is ambiguous due to missing depth information. This paper introduces X3KD, a comprehensive knowledge distillation framework across different modalities, tasks, and stages for multi-camera 3DOD. Specifically, we propose cross-task distillation from an instance segmentation teacher (X-IS) in the PV feature extraction stage providing supervision without ambiguous error backpropagation through the view transformation. After the transformation, we apply cross-modal feature distillation (X-FD) and adversarial training (X-AT) to improve the 3D world representation of multi-camera features through the information contained in a LiDAR-based 3DOD teacher. Finally, we also employ this teacher for cross-modal output distillation (X-OD), providing dense supervision at the prediction stage. We perform extensive ablations of knowledge distillation at different stages of multi-camera 3DOD. Our final X3KD model outperforms previous state-of-the-art approaches on the nuScenes and Waymo datasets and generalizes to RADAR-based 3DOD. Qualitative results video at https://youtu.be/1do9DPFmr38. + + + + Rawgment: Noise-Accounted RAW Augmentation Enables Recognition in a Wide Variety of Environments + http://openaccess.thecvf.com//content/CVPR2023/papers/Yoshimura_Rawgment_Noise-Accounted_RAW_Augmentation_Enables_Recognition_in_a_Wide_Variety_CVPR_2023_paper.pdf + Image recognition models that work in challenging environments (e.g., extremely dark, blurry, or high dynamic range conditions) must be useful. However, creating training datasets for such environments is expensive and hard due to the difficulties of data collection and annotation. It is desirable if we could get a robust model without the need for hard-to-obtain datasets. One simple approach is to apply data augmentation such as color jitter and blur to standard RGB (sRGB) images in simple scenes. Unfortunately, this approach struggles to yield realistic images in terms of pixel intensity and noise distribution due to not considering the non-linearity of Image Signal Processors (ISPs) and noise characteristics of image sensors. Instead, we propose a noise-accounted RAW image augmentation method. In essence, color jitter and blur augmentation are applied to a RAW image before applying non-linear ISP, resulting in realistic intensity. Furthermore, we introduce a noise amount alignment method that calibrates the domain gap in the noise property caused by the augmentation. We show that our proposed noise-accounted RAW augmentation method doubles the image recognition accuracy in challenging environments only with simple training data. + + + + BITE: Beyond Priors for Improved Three-D Dog Pose Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Ruegg_BITE_Beyond_Priors_for_Improved_Three-D_Dog_Pose_Estimation_CVPR_2023_paper.pdf + We address the problem of inferring the 3D shape and pose of dogs from images. Given the lack of 3D training data, this problem is challenging, and the best methods lag behind those designed to estimate human shape and pose. To make progress, we attack the problem from multiple sides at once. First, we need a good 3D shape prior, like those available for humans. To that end, we learn a dog-specific 3D parametric model, called D-SMAL. Second, existing methods focus on dogs in standing poses because when they sit or lie down, their legs are self occluded and their bodies deform. Without access to a good pose prior or 3D data, we need an alternative approach. To that end, we exploit contact with the ground as a form of side information. We consider an existing large dataset of dog images and label any 3D contact of the dog with the ground. We exploit body-ground contact in estimating dog pose and find that it significantly improves results. Third, we develop a novel neural network architecture to infer and exploit this contact information. Fourth, to make progress, we have to be able to measure it. Current evaluation metrics are based on 2D features like keypoints and silhouettes, which do not directly correlate with 3D errors. To address this, we create a synthetic dataset containing rendered images of scanned 3D dogs. With these advances, our method recovers significantly better dog shape and pose than the state of the art, and we evaluate this improvement in 3D. Our code, model and test dataset are publicly available for research purposes at https://bite.is.tue.mpg.de. + + + + Equivalent Transformation and Dual Stream Network Construction for Mobile Image Super-Resolution + http://openaccess.thecvf.com//content/CVPR2023/papers/Chao_Equivalent_Transformation_and_Dual_Stream_Network_Construction_for_Mobile_Image_CVPR_2023_paper.pdf + In recent years, there has been an increasing demand for real-time super-resolution networks on mobile devices. To address this issue, many lightweight super-resolution models have been proposed. However, these models still contain time-consuming components that increase inference latency, limiting their real-world applications on mobile devices. In this paper, we propose a novel model for singleimage super-resolution based on Equivalent Transformation and Dual Stream network construction (ETDS). ET method is proposed to transform time-consuming operators into time-friendly ones such as convolution and ReLU on mobile devices. Then, a dual stream network is designed to alleviate redundant parameters yielded from ET and enhance the feature extraction ability. Taking full advantage of the advance of ET and the dual stream network structure, we develop the efficient SR model ETDS for mobile devices. The experimental results demonstrate that our ETDS achieves superior inference speed and reconstruction quality compared to prior lightweight SR methods on mobile devices. The code is available at https://github.com/ECNUSR/ETDS. + + + + High-Resolution Image Reconstruction With Latent Diffusion Models From Human Brain Activity + http://openaccess.thecvf.com//content/CVPR2023/papers/Takagi_High-Resolution_Image_Reconstruction_With_Latent_Diffusion_Models_From_Human_Brain_CVPR_2023_paper.pdf + Reconstructing visual experiences from human brain activity offers a unique way to understand how the brain represents the world, and to interpret the connection between computer vision models and our visual system. While deep generative models have recently been employed for this task, reconstructing realistic images with high semantic fidelity is still a challenging problem. Here, we propose a new method based on a diffusion model (DM) to reconstruct images from human brain activity obtained via functional magnetic resonance imaging (fMRI). More specifically, we rely on a latent diffusion model (LDM) termed Stable Diffusion. This model reduces the computational cost of DMs, while preserving their high generative performance. We also characterize the inner mechanisms of the LDM by studying how its different components (such as the latent vector Z, conditioning inputs C, and different elements of the denoising U-Net) relate to distinct brain functions. We show that our proposed method can reconstruct high-resolution images with high fidelity in straightforward fashion, without the need for any additional training and fine-tuning of complex deep-learning models. We also provide a quantitative interpretation of different LDM components from a neuroscientific perspective. Overall, our study proposes a promising method for reconstructing images from human brain activity, and provides a new framework for understanding DMs. Please check out our webpage at https://sites.google.com/view/stablediffusion-withbrain/. + + + + DARE-GRAM: Unsupervised Domain Adaptation Regression by Aligning Inverse Gram Matrices + http://openaccess.thecvf.com//content/CVPR2023/papers/Nejjar_DARE-GRAM_Unsupervised_Domain_Adaptation_Regression_by_Aligning_Inverse_Gram_Matrices_CVPR_2023_paper.pdf + Unsupervised Domain Adaptation Regression (DAR) aims to bridge the domain gap between a labeled source dataset and an unlabelled target dataset for regression problems. Recent works mostly focus on learning a deep feature encoder by minimizing the discrepancy between source and target features. In this work, we present a different perspective for the DAR problem by analyzing the closed-form ordinary least square (OLS) solution to the linear regressor in the deep domain adaptation context. Rather than aligning the original feature embedding space, we propose to align the inverse Gram matrix of the features, which is motivated by its presence in the OLS solution and the Gram matrix's ability to capture the feature correlations. Specifically, we propose a simple yet effective DAR method which leverages the pseudo-inverse low-rank property to align the scale and angle in a selected subspace generated by the pseudo-inverse Gram matrix of the two domains. We evaluate our method on three domain adaptation regression benchmarks. Experimental results demonstrate that our method achieves state-of-the-art performance. Our code is available at https://github.com/ismailnejjar/DARE-GRAM. + + + + Bidirectional Copy-Paste for Semi-Supervised Medical Image Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Bai_Bidirectional_Copy-Paste_for_Semi-Supervised_Medical_Image_Segmentation_CVPR_2023_paper.pdf + In semi-supervised medical image segmentation, there exist empirical mismatch problems between labeled and unlabeled data distribution. The knowledge learned from the labeled data may be largely discarded if treating labeled and unlabeled data separately or training labeled and unlabeled data in an inconsistent manner. We propose a straightforward method for alleviating the problem -- copy-pasting labeled and unlabeled data bidirectionally, in a simple Mean Teacher architecture. The method encourages unlabeled data to learn comprehensive common semantics from the labeled data in both inward and outward directions. More importantly, the consistent learning procedure for labeled and unlabeled data can largely reduce the empirical distribution gap. In detail, we copy-paste a random crop from a labeled image (foreground) onto an unlabeled image (background) and an unlabeled image (foreground) onto a labeled image (background), respectively. The two mixed images are fed into a Student network. It is trained by the generated supervisory signal via bidirectional copy-pasting between the predictions of the unlabeled images from the Teacher and the label maps of the labeled images. We explore several design choices of how to copy-paste to make it more effective for minimizing empirical distribution gaps between labeled and unlabeled data. We reveal that the simple mechanism of copy-pasting bidirectionally between labeled and unlabeled data is good enough and the experiments show solid gains (e.g., over 21% Dice improvement on ACDC dataset with 5% labeled data) compared with other state-of-the-arts on various semi-supervised medical image segmentation datasets. + + + + Learning Discriminative Representations for Skeleton Based Action Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_Learning_Discriminative_Representations_for_Skeleton_Based_Action_Recognition_CVPR_2023_paper.pdf + Human action recognition aims at classifying the category of human action from a segment of a video. Recently, people have dived into designing GCN-based models to extract features from skeletons for performing this task, because skeleton representations are much more efficient and robust than other modalities such as RGB frames. However, when employing the skeleton data, some important clues like related items are also discarded. It results in some ambiguous actions that are hard to be distinguished and tend to be misclassified. To alleviate this problem, we propose an auxiliary feature refinement head (FR Head), which consists of spatial-temporal decoupling and contrastive feature refinement, to obtain discriminative representations of skeletons. Ambiguous samples are dynamically discovered and calibrated in the feature space. Furthermore, FR Head could be imposed on different stages of GCNs to build a multi-level refinement for stronger supervision. Extensive experiments are conducted on NTU RGB+D, NTU RGB+D 120, and NW-UCLA datasets. Our proposed models obtain competitive results from state-of-the-art methods and can help to discriminate those ambiguous samples. Codes are available at https://github.com/zhysora/FR-Head. + + + + Few-Shot Non-Line-of-Sight Imaging With Signal-Surface Collaborative Regularization + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Few-Shot_Non-Line-of-Sight_Imaging_With_Signal-Surface_Collaborative_Regularization_CVPR_2023_paper.pdf + The non-line-of-sight imaging technique aims to reconstruct targets from multiply reflected light. For most existing methods, dense points on the relay surface are raster scanned to obtain high-quality reconstructions, which requires a long acquisition time. In this work, we propose a signal-surface collaborative regularization (SSCR) framework that provides noise-robust reconstructions with a minimal number of measurements. Using Bayesian inference, we design joint regularizations of the estimated signal, the 3D voxel-based representation of the objects, and the 2D surface-based description of the targets. To our best knowledge, this is the first work that combines regularizations in mixed dimensions for hidden targets. Experiments on synthetic and experimental datasets illustrated the efficiency of the proposed method under both confocal and non-confocal settings. We report the reconstruction of the hidden targets with complex geometric structures with only 5 x 5 confocal measurements from public datasets, indicating an acceleration of the conventional measurement process by a factor of 10,000. Besides, the proposed method enjoys low time and memory complexity with sparse measurements. Our approach has great potential in real-time non-line-of-sight imaging applications such as rescue operations and autonomous driving. + + + + Probabilistic Debiasing of Scene Graphs + http://openaccess.thecvf.com//content/CVPR2023/papers/Biswas_Probabilistic_Debiasing_of_Scene_Graphs_CVPR_2023_paper.pdf + The quality of scene graphs generated by the state-of-the-art (SOTA) models is compromised due to the long-tail nature of the relationships and their parent object pairs. Training of the scene graphs is dominated by the majority relationships of the majority pairs and, therefore, the object-conditional distributions of relationship in the minority pairs are not preserved after the training is converged. Consequently, the biased model performs well on more frequent relationships in the marginal distribution of relationships such as 'on' and 'wearing', and performs poorly on the less frequent relationships such as 'eating' or 'hanging from'. In this work, we propose virtual evidence incorporated within-triplet Bayesian Network (BN) to preserve the object-conditional distribution of the relationship label and to eradicate the bias created by the marginal probability of the relationships. The insufficient number of relationships in the minority classes poses a significant problem in learning the within-triplet Bayesian network. We address this insufficiency by embedding-based augmentation of triplets where we borrow samples of the minority triplet classes from its neighboring triplets in the semantic space. We perform experiments on two different datasets and achieve a significant improvement in the mean recall of the relationships. We also achieve a better balance between recall and mean recall performance compared to the SOTA de-biasing techniques of scene graph models. + + + + Depth Estimation From Camera Image and mmWave Radar Point Cloud + http://openaccess.thecvf.com//content/CVPR2023/papers/Singh_Depth_Estimation_From_Camera_Image_and_mmWave_Radar_Point_Cloud_CVPR_2023_paper.pdf + We present a method for inferring dense depth from a camera image and a sparse noisy radar point cloud. We first describe the mechanics behind mmWave radar point cloud formation and the challenges that it poses, i.e. ambiguous elevation and noisy depth and azimuth components that yields incorrect positions when projected onto the image, and how existing works have overlooked these nuances in camera-radar fusion. Our approach is motivated by these mechanics, leading to the design of a network that maps each radar point to the possible surfaces that it may project onto in the image plane. Unlike existing works, we do not process the raw radar point cloud as an erroneous depth map, but query each raw point independently to associate it with likely pixels in the image -- yielding a semi-dense radar depth map. To fuse radar depth with an image, we propose a gated fusion scheme that accounts for the confidence scores of the correspondence so that we selectively combine radar and camera embeddings to yield a dense depth map. We test our method on the NuScenes benchmark and show a 10.3% improvement in mean absolute error and a 9.1% improvement in root-mean-square error over the best method. + + + + Learning Event Guided High Dynamic Range Video Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_Learning_Event_Guided_High_Dynamic_Range_Video_Reconstruction_CVPR_2023_paper.pdf + Limited by the trade-off between frame rate and exposure time when capturing moving scenes with conventional cameras, frame based HDR video reconstruction suffers from scene-dependent exposure ratio balancing and ghosting artifacts. Event cameras provide an alternative visual representation with a much higher dynamic range and temporal resolution free from the above issues, which could be an effective guidance for HDR imaging from LDR videos. In this paper, we propose a multimodal learning framework for event guided HDR video reconstruction. In order to better leverage the knowledge of the same scene from the two modalities of visual signals, a multimodal representation alignment strategy to learn a shared latent space and a fusion module tailored to complementing two types of signals for different dynamic ranges in different regions are proposed. Temporal correlations are utilized recurrently to suppress the flickering effects in the reconstructed HDR video. The proposed HDRev-Net demonstrates state-of-the-art performance quantitatively and qualitatively for both synthetic and real-world data. + + + + Prototypical Residual Networks for Anomaly Detection and Localization + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Prototypical_Residual_Networks_for_Anomaly_Detection_and_Localization_CVPR_2023_paper.pdf + Anomaly detection and localization are widely used in industrial manufacturing for its efficiency and effectiveness. Anomalies are rare and hard to collect and supervised models easily over-fit to these seen anomalies with a handful of abnormal samples, producing unsatisfactory performance. On the other hand, anomalies are typically subtle, hard to discern, and of various appearance, making it difficult to detect anomalies and let alone locate anomalous regions. To address these issues, we propose a framework called Prototypical Residual Network (PRN), which learns feature residuals of varying scales and sizes between anomalous and normal patterns to accurately reconstruct the segmentation maps of anomalous regions. PRN mainly consists of two parts: multi-scale prototypes that explicitly represent the residual features of anomalies to normal patterns; a multi-size self-attention mechanism that enables variable-sized anomalous feature learning. Besides, we present a variety of anomaly generation strategies that consider both seen and unseen appearance variance to enlarge and diversify anomalies. Extensive experiments on the challenging and widely used MVTec AD benchmark show that PRN outperforms current state-of-the-art unsupervised and supervised methods. We further report SOTA results on three additional datasets to demonstrate the effectiveness and generalizability of PRN. + + + + Ultrahigh Resolution Image/Video Matting With Spatio-Temporal Sparsity + http://openaccess.thecvf.com//content/CVPR2023/papers/Sun_Ultrahigh_Resolution_ImageVideo_Matting_With_Spatio-Temporal_Sparsity_CVPR_2023_paper.pdf + Commodity ultra-high definition (UHD) displays are becoming more affordable which demand imaging in ultra high resolution (UHR). This paper proposes SparseMat, a computationally efficient approach for UHR image/video matting. Note that it is infeasible to directly process UHR images at full resolution in one shot using existing matting algorithms without running out of memory on consumer-level computational platforms, e.g., Nvidia 1080Ti with 11G memory, while patch-based approaches can introduce unsightly artifacts due to patch partitioning. Instead, our method resorts to spatial and temporal sparsity for solving general UHR matting. During processing videos, huge computation redundancy can be reduced through the rational use of spatial and temporal sparsity. In this paper, we show how to effectively estimate spatio-temporal sparsity, which serves as a gate to activate input pixels for the matting model. Under the guidance of such sparsity, our method discards patch-based inference in lieu of memory-efficient and full-resolution matte refinement. Extensive experiments demonstrate that SparseMat can effectively and efficiently generate high-quality alpha matte for UHR images and videos in one shot. + + + + Zero-Shot Noise2Noise: Efficient Image Denoising Without Any Data + http://openaccess.thecvf.com//content/CVPR2023/papers/Mansour_Zero-Shot_Noise2Noise_Efficient_Image_Denoising_Without_Any_Data_CVPR_2023_paper.pdf + Recently, self-supervised neural networks have shown excellent image denoising performance. However, current dataset free methods are either computationally expensive, require a noise model, or have inadequate image quality. In this work we show that a simple 2-layer network, without any training data or knowledge of the noise distribution, can enable high-quality image denoising at low computational cost. Our approach is motivated by Noise2Noise and Neighbor2Neighbor and works well for denoising pixel-wise independent noise. Our experiments on artificial, real-world camera, and microscope noise show that our method termed ZS-N2N (Zero Shot Noise2Noise) often outperforms existing dataset-free methods at a reduced cost, making it suitable for use cases with scarce data availability and limited compute. + + + + FIANCEE: Faster Inference of Adversarial Networks via Conditional Early Exits + http://openaccess.thecvf.com//content/CVPR2023/papers/Karpikova_FIANCEE_Faster_Inference_of_Adversarial_Networks_via_Conditional_Early_Exits_CVPR_2023_paper.pdf + Generative DNNs are a powerful tool for image synthesis, but they are limited by their computational load. On the other hand, given a trained model and a task, e.g. faces generation within a range of characteristics, the output image quality will be unevenly distributed among images with different characteristics. It follows, that we might restrain the model's complexity on some instances, maintaining a high quality. We propose a method for diminishing computations by adding so-called early exit branches to the original architecture, and dynamically switching the computational path depending on how difficult it will be to render the output. We apply our method on two different SOTA models performing generative tasks: generation from a semantic map, and cross reenactment of face expressions; showing it is able to output images with custom lower quality thresholds. For a threshold of LPIPS <=0.1, we diminish their computations by up to a half. This is especially relevant for real-time applications such as synthesis of faces, when quality loss needs to be contained, but most of the inputs need fewer computations than the complex instances. + + + + Simultaneously Short- and Long-Term Temporal Modeling for Semi-Supervised Video Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Lao_Simultaneously_Short-_and_Long-Term_Temporal_Modeling_for_Semi-Supervised_Video_Semantic_CVPR_2023_paper.pdf + In order to tackle video semantic segmentation task at a lower cost, e.g., only one frame annotated per video, lots of efforts have been devoted to investigate the utilization of those unlabeled frames by either assigning pseudo labels or performing feature enhancement. In this work, we propose a novel feature enhancement network to simultaneously model short- and long-term temporal correlation. Compared with existing work that only leverage short-term correspondence, the long-term temporal correlation obtained from distant frames can effectively expand the temporal perception field and provide richer contextual prior. More importantly, modeling adjacent and distant frames together can alleviate the risk of over-fitting, hence produce high-quality feature representation for the distant unlabeled frames in training set and unseen videos in testing set. To this end, we term our method SSLTM, short for Simultaneously Short- and Long-Term Temporal Modeling. In the setting of only one frame annotated per video, SSLTM significantly outperforms the state-of-the-art methods by 2% 3% mIoU on the challenging VSPW dataset. Furthermore, when working with a pseudo label based method such as MeanTeacher, our final model only exhibits 0.13% mIoU less than the ceiling performance (i.e., all frames are manually annotated). + + + + Learning To Generate Text-Grounded Mask for Open-World Semantic Segmentation From Only Image-Text Pairs + http://openaccess.thecvf.com//content/CVPR2023/papers/Cha_Learning_To_Generate_Text-Grounded_Mask_for_Open-World_Semantic_Segmentation_From_CVPR_2023_paper.pdf + We tackle open-world semantic segmentation, which aims at learning to segment arbitrary visual concepts in images, by using only image-text pairs without dense annotations. Existing open-world segmentation methods have shown impressive advances by employing contrastive learning (CL) to learn diverse visual concepts and transferring the learned image-level understanding to the segmentation task. However, these CL-based methods suffer from a train-test discrepancy, since it only considers image-text alignment during training, whereas segmentation requires region-text alignment during testing. In this paper, we proposed a novel Text-grounded Contrastive Learning (TCL) framework that enables a model to directly learn region-text alignment. Our method generates a segmentation mask for a given text, extracts text-grounded image embedding from the masked region, and aligns it with text embedding via TCL. By learning region-text alignment directly, our framework encourages a model to directly improve the quality of generated segmentation masks. In addition, for a rigorous and fair comparison, we present a unified evaluation protocol with widely used 8 semantic segmentation datasets. TCL achieves state-of-the-art zero-shot segmentation performances with large margins in all datasets. Code is available at https://github.com/kakaobrain/tcl. + + + + Shakes on a Plane: Unsupervised Depth Estimation From Unstabilized Photography + http://openaccess.thecvf.com//content/CVPR2023/papers/Chugunov_Shakes_on_a_Plane_Unsupervised_Depth_Estimation_From_Unstabilized_Photography_CVPR_2023_paper.pdf + Modern mobile burst photography pipelines capture and merge a short sequence of frames to recover an enhanced image, but often disregard the 3D nature of the scene they capture, treating pixel motion between images as a 2D aggregation problem. We show that in a "long-burst", forty-two 12-megapixel RAW frames captured in a two-second sequence, there is enough parallax information from natural hand tremor alone to recover high-quality scene depth. To this end, we devise a test-time optimization approach that fits a neural RGB-D representation to long-burst data and simultaneously estimates scene depth and camera motion. Our plane plus depth model is trained end-to-end, and performs coarse-to-fine refinement by controlling which multi-resolution volume features the network has access to at what time during training. We validate the method experimentally, and demonstrate geometrically accurate depth reconstructions with no additional hardware or separate data pre-processing and pose-estimation steps. + + + + Learning Correspondence Uncertainty via Differentiable Nonlinear Least Squares + http://openaccess.thecvf.com//content/CVPR2023/papers/Muhle_Learning_Correspondence_Uncertainty_via_Differentiable_Nonlinear_Least_Squares_CVPR_2023_paper.pdf + We propose a differentiable nonlinear least squares framework to account for uncertainty in relative pose estimation from feature correspondences. Specifically, we introduce a symmetric version of the probabilistic normal epipolar constraint, and an approach to estimate the covariance of feature positions by differentiating through the camera pose estimation procedure. We evaluate our approach on synthetic, as well as the KITTI and EuRoC real-world datasets. On the synthetic dataset, we confirm that our learned covariances accurately approximate the true noise distribution. In real world experiments, we find that our approach consistently outperforms state-of-the-art non-probabilistic and probabilistic approaches, regardless of the feature extraction algorithm of choice. + + + + Towards Effective Visual Representations for Partial-Label Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Xia_Towards_Effective_Visual_Representations_for_Partial-Label_Learning_CVPR_2023_paper.pdf + Under partial-label learning (PLL) where, for each training instance, only a set of ambiguous candidate labels containing the unknown true label is accessible, contrastive learning has recently boosted the performance of PLL on vision tasks, attributed to representations learned by contrasting the same/different classes of entities. Without access to true labels, positive points are predicted using pseudolabels that are inherently noisy, and negative points often require large batches or momentum encoders, resulting in unreliable similarity information and a high computational overhead. In this paper, we rethink a state-of-the-art contrastive PLL method PiCO [24], inspiring the design of a simple framework termed PaPi (Partial-label learning with a guided Prototypical classifier), which demonstrates significant scope for improvement in representation learning, thus contributing to label disambiguation. PaPi guides the optimization of a prototypical classifier by a linear classifier with which they share the same feature encoder, thus explicitly encouraging the representation to reflect visual similarity between categories. It is also technically appealing, as PaPi requires only a few components in PiCO with the opposite direction of guidance, and directly eliminates the contrastive learning module that would introduce noise and consume computational resources. We empirically demonstrate that PaPi significantly outperforms other PLL methods on various image classification tasks. + + + + MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining + http://openaccess.thecvf.com//content/CVPR2023/papers/Dong_MaskCLIP_Masked_Self-Distillation_Advances_Contrastive_Language-Image_Pretraining_CVPR_2023_paper.pdf + This paper presents a simple yet effective framework MaskCLIP, which incorporates a newly proposed masked self-distillation into contrastive language-image pretraining. The core idea of masked self-distillation is to distill representation from a full image to the representation predicted from a masked image. Such incorporation enjoys two vital benefits. First, masked self-distillation targets local patch representation learning, which is complementary to vision-language contrastive focusing on text-related representation. Second, masked self-distillation is also consistent with vision-language contrastive from the perspective of training objective as both utilize the visual encoder for feature aligning, and thus is able to learn local semantics getting indirect supervision from the language. We provide specially designed experiments with a comprehensive analysis to validate the two benefits. Symmetrically, we also introduce the local semantic supervision into the text branch, which further improves the pretraining performance. With extensive experiments, we show that MaskCLIP, when applied to various challenging downstream tasks, achieves superior results in linear probing, finetuning, and zero-shot performance with the guidance of the language encoder. We will release the code and data after the publication. + + + + Inferring and Leveraging Parts From Object Shape for Improving Semantic Image Synthesis + http://openaccess.thecvf.com//content/CVPR2023/papers/Wei_Inferring_and_Leveraging_Parts_From_Object_Shape_for_Improving_Semantic_CVPR_2023_paper.pdf + Despite the progress in semantic image synthesis, it remains a challenging problem to generate photo-realistic parts from input semantic map. Integrating part segmentation map can undoubtedly benefit image synthesis, but is bothersome and inconvenient to be provided by users. To improve part synthesis, this paper presents to infer Parts from Object ShapE (iPOSE) and leverage it for improving semantic image synthesis. However, albeit several part segmentation datasets are available, part annotations are still not provided for many object categories in semantic image synthesis. To circumvent it, we resort to few-shot regime to learn a PartNet for predicting the object part map with the guidance of pre-defined support part maps. PartNet can be readily generalized to handle a new object category when a small number (e.g., 3) of support part maps for this category are provided. Furthermore, part semantic modulation is presented to incorporate both inferred part map and semantic map for image synthesis. Experiments show that our iPOSE not only generates objects with rich part details, but also enables to control the image synthesis flexibly. And our iPOSE performs favorably against the state-of-the-art methods in terms of quantitative and qualitative evaluation. Our code will be publicly available at https://github.com/csyxwei/iPOSE. + + + + MIME: Human-Aware 3D Scene Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Yi_MIME_Human-Aware_3D_Scene_Generation_CVPR_2023_paper.pdf + Generating realistic 3D worlds occupied by moving humans has many applications in games, architecture, and synthetic data creation. But generating such scenes is expensive and labor intensive. Recent work generates human poses and motions given a 3D scene. Here, we take the opposite approach and generate 3D indoor scenes given 3D human motion. Such motions can come from archival motion capture or from IMU sensors worn on the body, effectively turning human movement in a "scanner" of the 3D world. Intuitively, human movement indicates the free-space in a room and human contact indicates surfaces or objects that support activities such as sitting, lying or touching. We propose MIME (Mining Interaction and Movement to infer 3D Environments), which is a generative model of indoor scenes that produces furniture layouts that are consistent with the human movement. MIME uses an auto-regressive transformer architecture that takes the already generated objects in the scene as well as the human motion as input, and outputs the next plausible object. To train MIME, we build a dataset by populating the 3D FRONT scene dataset with 3D humans. Our experiments show that MIME produces more diverse and plausible 3D scenes than a recent generative scene method that does not know about human movement. Code and data will be available for research. + + + + NerVE: Neural Volumetric Edges for Parametric Curve Extraction From Point Cloud + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhu_NerVE_Neural_Volumetric_Edges_for_Parametric_Curve_Extraction_From_Point_CVPR_2023_paper.pdf + Extracting parametric edge curves from point clouds is a fundamental problem in 3D vision and geometry processing. Existing approaches mainly rely on keypoint detection, a challenging procedure that tends to generate noisy output, making the subsequent edge extraction error-prone. To address this issue, we propose to directly detect structured edges to circumvent the limitations of the previous point-wise methods. We achieve this goal by presenting NerVE, a novel neural volumetric edge representation that can be easily learned through a volumetric learning framework. NerVE can be seamlessly converted to a versatile piece-wise linear (PWL) curve representation, enabling a unified strategy for learning all types of free-form curves. Furthermore, as NerVE encodes rich structural information, we show that edge extraction based on NerVE can be reduced to a simple graph search problem. After converting NerVE to the PWL representation, parametric curves can be obtained via off-the-shelf spline fitting algorithms. We evaluate our method on the challenging ABC dataset. We show that a simple network based on NerVE can already outperform the previous state-of-the-art methods by a great margin. + + + + ShapeClipper: Scalable 3D Shape Learning From Single-View Images via Geometric and CLIP-Based Consistency + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_ShapeClipper_Scalable_3D_Shape_Learning_From_Single-View_Images_via_Geometric_CVPR_2023_paper.pdf + We present ShapeClipper, a novel method that reconstructs 3D object shapes from real-world single-view RGB images. Instead of relying on laborious 3D, multi-view or camera pose annotation, ShapeClipper learns shape reconstruction from a set of single-view segmented images. The key idea is to facilitate shape learning via CLIP-based shape consistency, where we encourage objects with similar CLIP encodings to share similar shapes. We also leverage off-the-shelf normals as an additional geometric constraint so the model can learn better bottom-up reasoning of detailed surface geometry. These two novel consistency constraints, when used to regularize our model, improve its ability to learn both global shape structure and local geometric details. We evaluate our method over three challenging real-world datasets, Pix3D, Pascal3D+, and OpenImages, where we achieve superior performance over state-of-the-art methods. + + + + Backdoor Attacks Against Deep Image Compression via Adaptive Frequency Trigger + http://openaccess.thecvf.com//content/CVPR2023/papers/Tan_Backdoor_Attacks_Against_Deep_Image_Compression_via_Adaptive_Frequency_Trigger_CVPR_2023_paper.pdf + Recent deep-learning-based compression methods have achieved superior performance compared with traditional approaches. However, deep learning models have proven to be vulnerable to backdoor attacks, where some specific trigger patterns added to the input can lead to malicious behavior of the models. In this paper, we present a novel backdoor attack with multiple triggers against learned image compression models. Motivated by the widely used discrete cosine transform (DCT) in existing compression systems and standards, we propose a frequency-based trigger injection model that adds triggers in the DCT domain. In particular, we design several attack objectives for various attacking scenarios, including: 1) attacking compression quality in terms of bit-rate and reconstruction quality; 2) attacking task-driven measures, such as down-stream face recognition and semantic segmentation. Moreover, a novel simple dynamic loss is designed to balance the influence of different loss terms adaptively, which helps achieve more efficient training. Extensive experiments show that with our trained trigger injection models and simple modification of encoder parameters (of the compression model), the proposed attack can successfully inject several backdoors with corresponding triggers in a single image compression model. + + + + A New Path: Scaling Vision-and-Language Navigation With Synthetic Instructions and Imitation Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Kamath_A_New_Path_Scaling_Vision-and-Language_Navigation_With_Synthetic_Instructions_and_CVPR_2023_paper.pdf + Recent studies in Vision-and-Language Navigation (VLN) train RL agents to execute natural-language navigation instructions in photorealistic environments, as a step towards robots that can follow human instructions. However, given the scarcity of human instruction data and limited diversity in the training environments, these agents still struggle with complex language grounding and spatial language understanding. Pre-training on large text and image-text datasets from the web has been extensively explored but the improvements are limited. We investigate large-scale augmentation with synthetic instructions. We take 500+ indoor environments captured in densely-sampled 360 degree panoramas, construct navigation trajectories through these panoramas, and generate a visually-grounded instruction for each trajectory using Marky, a high-quality multilingual navigation instruction generator. We also synthesize image observations from novel viewpoints using an image-to-image GAN. The resulting dataset of 4.2M instruction-trajectory pairs is two orders of magnitude larger than existing human-annotated datasets, and contains a wider variety of environments and viewpoints. To efficiently leverage data at this scale, we train a simple transformer agent with imitation learning. On the challenging RxR dataset, our approach outperforms all existing RL agents, improving the state-of-the-art NDTW from 71.1 to 79.1 in seen environments, and from 64.6 to 66.8 in unseen test environments. Our work points to a new path to improving instruction-following agents, emphasizing large-scale training on near-human quality synthetic instructions. + + + + Layout-Based Causal Inference for Object Navigation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Layout-Based_Causal_Inference_for_Object_Navigation_CVPR_2023_paper.pdf + Previous works for ObjectNav task attempt to learn the association (e.g. relation graph) between the visual inputs and the goal during training. Such association contains the prior knowledge of navigating in training environments, which is denoted as the experience. The experience performs a positive effect on helping the agent infer the likely location of the goal when the layout gap between the unseen environments of the test and the prior knowledge obtained in training is minor. However, when the layout gap is significant, the experience exerts a negative effect on navigation. Motivated by keeping the positive effect and removing the negative effect of the experience, we propose the layout-based soft Total Direct Effect (L-sTDE) framework based on the causal inference to adjust the prediction of the navigation policy. In particular, we propose to calculate the layout gap which is defined as the KL divergence between the posterior and the prior distribution of the object layout. Then the sTDE is proposed to appropriately control the effect of the experience based on the layout gap. Experimental results on AI2THOR, RoboTHOR, and Habitat demonstrate the effectiveness of our method. + + + + Pose-Disentangled Contrastive Learning for Self-Supervised Facial Representation + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Pose-Disentangled_Contrastive_Learning_for_Self-Supervised_Facial_Representation_CVPR_2023_paper.pdf + Self-supervised facial representation has recently attracted increasing attention due to its ability to perform face understanding without relying on large-scale annotated datasets heavily. However, analytically, current contrastive-based self-supervised learning (SSL) still performs unsatisfactorily for learning facial representation. More specifically, existing contrastive learning (CL) tends to learn pose-invariant features that cannot depict the pose details of faces, compromising the learning performance. To conquer the above limitation of CL, we propose a novel Pose-disentangled Contrastive Learning (PCL) method for general self-supervised facial representation. Our PCL first devises a pose-disentangled decoder (PDD) with a delicately designed orthogonalizing regulation, which disentangles the pose-related features from the face-aware features; therefore, pose-related and other pose-unrelated facial information could be performed in individual subnetworks and do not affect each other's training. Furthermore, we introduce a pose-related contrastive learning scheme that learns pose-related information based on data augmentation of the same image, which would deliver more effective face-aware representation for various downstream tasks. We conducted linear evaluation on four challenging downstream facial understanding tasks, i.e., facial expression recognition, face recognition, AU detection and head pose estimation.Experimental results demonstrate that PCL significantly outperforms cutting-edge SSL methods. Our Code is available at https://github.com/DreamMr/PCL. + + + + Inverse Rendering of Translucent Objects Using Physical and Neural Renderers + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Inverse_Rendering_of_Translucent_Objects_Using_Physical_and_Neural_Renderers_CVPR_2023_paper.pdf + In this work, we propose an inverse rendering model that estimates 3D shape, spatially-varying reflectance, homogeneous subsurface scattering parameters, and an environment illumination jointly from only a pair of captured images of a translucent object. In order to solve the ambiguity problem of inverse rendering, we use a physically-based renderer and a neural renderer for scene reconstruction and material editing. Because two renderers are differentiable, we can compute a reconstruction loss to assist parameter estimation. To enhance the supervision of the proposed neural renderer, we also propose an augmented loss. In addition, we use a flash and no-flash image pair as the input. To supervise the training, we constructed a large-scale synthetic dataset of translucent objects, which consists of 117K scenes. Qualitative and quantitative results on both synthetic and real-world datasets demonstrated the effectiveness of the proposed model. + + + + Towards Building Self-Aware Object Detectors via Reliable Uncertainty Quantification and Calibration + http://openaccess.thecvf.com//content/CVPR2023/papers/Oksuz_Towards_Building_Self-Aware_Object_Detectors_via_Reliable_Uncertainty_Quantification_and_CVPR_2023_paper.pdf + The current approach for testing the robustness of object detectors suffers from serious deficiencies such as improper methods of performing out-of-distribution detection and using calibration metrics which do not consider both localisation and classification quality. In this work, we address these issues, and introduce the Self Aware Object Detection (SAOD) task, a unified testing framework which respects and adheres to the challenges that object detectors face in safety-critical environments such as autonomous driving. Specifically, the SAOD task requires an object detector to be: robust to domain shift; obtain reliable uncertainty estimates for the entire scene; and provide calibrated confidence scores for the detections. We extensively use our framework, which introduces novel metrics and large scale test datasets, to test numerous object detectors in two different use-cases, allowing us to highlight critical insights into their robustness performance. Finally, we introduce a simple baseline for the SAOD task, enabling researchers to benchmark future proposed methods and move towards robust object detectors which are fit for purpose. Code is available at: https://github.com/fiveai/saod + + + + Source-Free Video Domain Adaptation With Spatial-Temporal-Historical Consistency Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Source-Free_Video_Domain_Adaptation_With_Spatial-Temporal-Historical_Consistency_Learning_CVPR_2023_paper.pdf + Source-free domain adaptation (SFDA) is an emerging research topic that studies how to adapt a pretrained source model using unlabeled target data. It is derived from unsupervised domain adaptation but has the advantage of not requiring labeled source data to learn adaptive models. This makes it particularly useful in real-world applications where access to source data is restricted. While there has been some SFDA work for images, little attention has been paid to videos. Naively extending image-based methods to videos without considering the unique properties of videos often leads to unsatisfactory results. In this paper, we propose a simple and highly flexible method for Source-Free Video Domain Adaptation (SFVDA), which extensively exploits consistency learning for videos from spatial, temporal, and historical perspectives. Our method is based on the assumption that videos of the same action category are drawn from the same low-dimensional space, regardless of the spatio-temporal variations in the high-dimensional space that cause domain shifts. To overcome domain shifts, we simulate spatio-temporal variations by applying spatial and temporal augmentations on target videos, and encourage the model to make consistent predictions from a video and its augmented versions. Due to the simple design, our method can be applied to various SFVDA settings, and experiments show that our method achieves state-of-the-art performance for all the settings. + + + + Fusing Pre-Trained Language Models With Multimodal Prompts Through Reinforcement Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_Fusing_Pre-Trained_Language_Models_With_Multimodal_Prompts_Through_Reinforcement_Learning_CVPR_2023_paper.pdf + Language models are capable of commonsense reasoning: while domain-specific models can learn from explicit knowledge (e.g. commonsense graphs [6], ethical norms [25]), and larger models like GPT-3 manifest broad commonsense reasoning capacity. Can their knowledge be extended to multimodal inputs such as images and audio without paired domain data? In this work, we propose ESPER (Extending Sensory PErception with Reinforcement learning) which enables text-only pretrained models to address multimodal tasks such as visual commonsense reasoning. Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision: for example, our reward optimization relies only on cosine similarity derived from CLIP and requires no additional paired (image, text) data. Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of multimodal text generation tasks ranging from captioning to commonsense reasoning; these include a new benchmark we collect and release, the ESP dataset, which tasks models with generating the text of several different domains for each image. Our code and data are publicly released at https://github.com/JiwanChung/esper. + + + + Dense Network Expansion for Class Incremental Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Hu_Dense_Network_Expansion_for_Class_Incremental_Learning_CVPR_2023_paper.pdf + The problem of class incremental learning (CIL) is considered. State-of-the-art approaches use a dynamic architecture based on network expansion (NE), in which a task expert is added per task. While effective from a computational standpoint, these methods lead to models that grow quickly with the number of tasks. A new NE method, dense network expansion (DNE), is proposed to achieve a better trade-off between accuracy and model complexity. This is accomplished by the introduction of dense connections between the intermediate layers of the task expert networks, that enable the transfer of knowledge from old to new tasks via feature sharing and reusing. This sharing is implemented with a cross-task attention mechanism, based on a new task attention block (TAB), that fuses information across tasks. Unlike traditional attention mechanisms, TAB operates at the level of the feature mixing and is decoupled with spatial attentions. This is shown more effective than a joint spatial-and-task attention for CIL. The proposed DNE approach can strictly maintain the feature space of old classes while growing the network and feature scale at a much slower rate than previous methods. In result, it outperforms the previous SOTA methods by a margin of 4% in terms of accuracy, with similar or even smaller model scale. + + + + Regularize Implicit Neural Representation by Itself + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Regularize_Implicit_Neural_Representation_by_Itself_CVPR_2023_paper.pdf + This paper proposes a regularizer called Implicit Neural Representation Regularizer (INRR) to improve the generalization ability of the Implicit Neural Representation (INR). The INR is a fully connected network that can represent signals with details not restricted by grid resolution. However, its generalization ability could be improved, especially with non-uniformly sampled data. The proposed INRR is based on learned Dirichlet Energy (DE) that measures similarities between rows/columns of the matrix. The smoothness of the Laplacian matrix is further integrated by parameterizing DE with a tiny INR. INRR improves the generalization of INR in signal representation by perfectly integrating the signal's self-similarity with the smoothness of the Laplacian matrix. Through well-designed numerical experiments, the paper also reveals a series of properties derived from INRR, including momentum methods like convergence trajectory and multi-scale similarity. Moreover, the proposed method could improve the performance of other signal representation methods. + + + + Ambiguous Medical Image Segmentation Using Diffusion Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Rahman_Ambiguous_Medical_Image_Segmentation_Using_Diffusion_Models_CVPR_2023_paper.pdf + Collective insights from a group of experts have always proven to outperform an individual's best diagnostic for clinical tasks. For the task of medical image segmentation, existing research on AI-based alternatives focuses more on developing models that can imitate the best individual rather than harnessing the power of expert groups. In this paper, we introduce a single diffusion model-based approach that produces multiple plausible outputs by learning a distribution over group insights. Our proposed model generates a distribution of segmentation masks by leveraging the inherent stochastic sampling process of diffusion using only minimal additional learning. We demonstrate on three different medical image modalities- CT, ultrasound, and MRI that our model is capable of producing several possible variants while capturing the frequencies of their occurrences. Comprehensive results show that our proposed approach outperforms existing state-of-the-art ambiguous segmentation networks in terms of accuracy while preserving naturally occurring variation. We also propose a new metric to evaluate the diversity as well as the accuracy of segmentation predictions that aligns with the interest of clinical practice of collective insights. Implementation code will be released publicly after the review process. + + + + DANI-Net: Uncalibrated Photometric Stereo by Differentiable Shadow Handling, Anisotropic Reflectance Modeling, and Neural Inverse Rendering + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_DANI-Net_Uncalibrated_Photometric_Stereo_by_Differentiable_Shadow_Handling_Anisotropic_Reflectance_CVPR_2023_paper.pdf + Uncalibrated photometric stereo (UPS) is challenging due to the inherent ambiguity brought by the unknown light. Although the ambiguity is alleviated on non-Lambertian objects, the problem is still difficult to solve for more general objects with complex shapes introducing irregular shadows and general materials with complex reflectance like anisotropic reflectance. To exploit cues from shadow and reflectance to solve UPS and improve performance on general materials, we propose DANI-Net, an inverse rendering framework with differentiable shadow handling and anisotropic reflectance modeling. Unlike most previous methods that use non-differentiable shadow maps and assume isotropic material, our network benefits from cues of shadow and anisotropic reflectance through two differentiable paths. Experiments on multiple real-world datasets demonstrate our superior and robust performance. + + + + Towards Better Stability and Adaptability: Improve Online Self-Training for Model Adaptation in Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhao_Towards_Better_Stability_and_Adaptability_Improve_Online_Self-Training_for_Model_CVPR_2023_paper.pdf + Unsupervised domain adaptation (UDA) in semantic segmentation transfers the knowledge of the source domain to the target one to improve the adaptability of the segmentation model in the target domain. The need to access labeled source data makes UDA unable to handle adaptation scenarios involving privacy, property rights protection, and confidentiality. In this paper, we focus on unsupervised model adaptation (UMA), also called source-free domain adaptation, which adapts a source-trained model to the target domain without accessing source data. We find that the online self-training method has the potential to be deployed in UMA, but the lack of source domain loss will greatly weaken the stability and adaptability of the method. We analyze the two possible reasons for the degradation of online self-training, i.e. inopportune updates of the teacher model and biased knowledge from source-trained model. Based on this, we propose a dynamic teacher update mechanism and a training-consistency based resampling strategy to improve the stability and adaptability of online self training. On multiple model adaptation benchmarks, our method obtains new state-of-the-art performance, which is comparable or even better than state-of-the-art UDA methods. + + + + Ranking Regularization for Critical Rare Classes: Minimizing False Positives at a High True Positive Rate + http://openaccess.thecvf.com//content/CVPR2023/papers/Mohammadi_Ranking_Regularization_for_Critical_Rare_Classes_Minimizing_False_Positives_at_CVPR_2023_paper.pdf + In many real-world settings, the critical class is rare and a missed detection carries a disproportionately high cost. For example, tumors are rare and a false negative diagnosis could have severe consequences on treatment outcomes; fraudulent banking transactions are rare and an undetected occurrence could result in significant losses or legal penalties. In such contexts, systems are often operated at a high true positive rate, which may require tolerating high false positives. In this paper, we present a novel approach to address the challenge of minimizing false positives for systems that need to operate at a high true positive rate. We propose a ranking-based regularization (RankReg) approach that is easy to implement, and show empirically that it not only effectively reduces false positives, but also complements conventional imbalanced learning losses. With this novel technique in hand, we conduct a series of experiments on three broadly explored datasets (CIFAR-10&100 and Melanoma) and show that our approach lifts the previous state-of-the-art performance by notable margins. + + + + Joint HDR Denoising and Fusion: A Real-World Mobile HDR Image Dataset + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Joint_HDR_Denoising_and_Fusion_A_Real-World_Mobile_HDR_Image_CVPR_2023_paper.pdf + Mobile phones have become a ubiquitous and indispensable photographing device in our daily life, while the small aperture and sensor size make mobile phones more susceptible to noise and over-saturation, resulting in low dynamic range (LDR) and low image quality. It is thus crucial to develop high dynamic range (HDR) imaging techniques for mobile phones. Unfortunately, the existing HDR image datasets are mostly constructed by DSLR cameras in daytime, limiting their applicability to the study of HDR imaging for mobile phones. In this work, we develop, for the first time to our best knowledge, an HDR image dataset by using mobile phone cameras, namely Mobile-HDR dataset. Specifically, we utilize three mobile phone cameras to collect paired LDR-HDR images in the raw image domain, covering both daytime and nighttime scenes with different noise levels. We then propose a transformer based model with a pyramid cross-attention alignment module to aggregate highly correlated features from different exposure frames to perform joint HDR denoising and fusion. Experiments validate the advantages of our dataset and our method on mobile HDR imaging. Dataset and codes are available at https://github.com/shuaizhengliu/Joint-HDRDN. + + + + MIST: Multi-Modal Iterative Spatial-Temporal Transformer for Long-Form Video Question Answering + http://openaccess.thecvf.com//content/CVPR2023/papers/Gao_MIST_Multi-Modal_Iterative_Spatial-Temporal_Transformer_for_Long-Form_Video_Question_Answering_CVPR_2023_paper.pdf + To build Video Question Answering (VideoQA) systems capable of assisting humans in daily activities, seeking answers from long-form videos with diverse and complex events is a must. Existing multi-modal VQA models achieve promising performance on images or short video clips, especially with the recent success of large-scale multi-modal pre-training. However, when extending these methods to long-form videos, new challenges arise. On the one hand, using a dense video sampling strategy is computationally prohibitive. On the other hand, methods relying on sparse sampling struggle in scenarios where multi-event and multi-granularity visual reasoning are required. In this work, we introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA. Specifically, MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules that adaptively select frames and image regions that are closely relevant to the question itself. Visual concepts at different granularities are then processed efficiently through an attention module. In addition, MIST iteratively conducts selection and attention over multiple layers to support reasoning over multiple events. The experimental results on four VideoQA datasets, including AGQA, NExT-QA, STAR, and Env-QA, show that MIST achieves state-of-the-art performance and is superior at computation efficiency and interpretability. + + + + Privacy-Preserving Representations Are Not Enough: Recovering Scene Content From Camera Poses + http://openaccess.thecvf.com//content/CVPR2023/papers/Chelani_Privacy-Preserving_Representations_Are_Not_Enough_Recovering_Scene_Content_From_Camera_CVPR_2023_paper.pdf + Visual localization is the task of estimating the camera pose from which a given image was taken and is central to several 3D computer vision applications. With the rapid growth in the popularity of AR/VR/MR devices and cloud-based applications, privacy issues are becoming a very important aspect of the localization process. Existing work on privacy-preserving localization aims to defend against an attacker who has access to a cloud-based service. In this paper, we show that an attacker can learn about details of a scene without any access by simply querying a localization service. The attack is based on the observation that modern visual localization algorithms are robust to variations in appearance and geometry. While this is in general a desired property, it also leads to algorithms localizing objects that are similar enough to those present in a scene. An attacker can thus query a server with a large enough set of images of objects, e.g., obtained from the Internet, and some of them will be localized. The attacker can thus learn about object placements from the camera poses returned by the service (which is the minimal information returned by such a service). In this paper, we develop a proof-of-concept version of this attack and demonstrate its practical feasibility. The attack does not place any requirements on the localization algorithm used, and thus also applies to privacy-preserving representations. Current work on privacy-preserving representations alone is thus insufficient. + + + + A New Dataset Based on Images Taken by Blind People for Testing the Robustness of Image Classification Models Trained for ImageNet Categories + http://openaccess.thecvf.com//content/CVPR2023/papers/Bafghi_A_New_Dataset_Based_on_Images_Taken_by_Blind_People_CVPR_2023_paper.pdf + Our goal is to improve upon the status quo for designing image classification models trained in one domain that perform well on images from another domain. Complementing existing work in robustness testing, we introduce the first dataset for this purpose which comes from an authentic use case where photographers wanted to learn about the content in their images. We built a new test set using 8,900 images taken by people who are blind for which we collected metadata to indicate the presence versus absence of 200 ImageNet object categories. We call this dataset VizWiz-Classification. We characterize this dataset and how it compares to the mainstream datasets for evaluating how well ImageNet-trained classification models generalize. Finally, we analyze the performance of 100 ImageNet classification models on our new test dataset. Our fine-grained analysis demonstrates that these models struggle on images with quality issues. To enable future extensions to this work, we share our new dataset with evaluation server at: https://vizwiz.org/tasks-and-datasets/image-classification + + + + Detecting Backdoors During the Inference Stage Based on Corruption Robustness Consistency + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Detecting_Backdoors_During_the_Inference_Stage_Based_on_Corruption_Robustness_CVPR_2023_paper.pdf + Deep neural networks are proven to be vulnerable to backdoor attacks. Detecting the trigger samples during the inference stage, i.e., the test-time trigger sample detection, can prevent the backdoor from being triggered. However, existing detection methods often require the defenders to have high accessibility to victim models, extra clean data, or knowledge about the appearance of backdoor triggers, limiting their practicality. In this paper, we propose the test-time corruption robustness consistency evaluation (TeCo), a novel test-time trigger sample detection method that only needs the hard-label outputs of the victim models without any extra information. Our journey begins with the intriguing observation that the backdoor-infected models have similar performance across different image corruptions for the clean images, but perform discrepantly for the trigger samples. Based on this phenomenon, we design TeCo to evaluate test-time robustness consistency by calculating the deviation of severity that leads to predictions' transition across different corruptions. Extensive experiments demonstrate that compared with state-of-the-art defenses, which even require either certain information about the trigger types or accessibility of clean data, TeCo outperforms them on different backdoor attacks, datasets, and model architectures, enjoying a higher AUROC by 10% and 5 times of stability. The code is available at https://github.com/CGCL-codes/TeCo + + + + Black-Box Sparse Adversarial Attack via Multi-Objective Optimisation + http://openaccess.thecvf.com//content/CVPR2023/papers/Williams_Black-Box_Sparse_Adversarial_Attack_via_Multi-Objective_Optimisation_CVPR_2023_paper.pdf + Deep neural networks (DNNs) are susceptible to adversarial images, raising concerns about their reliability in safety-critical tasks. Sparse adversarial attacks, which limit the number of modified pixels, have shown to be highly effective in causing DNNs to misclassify. However, existing methods often struggle to simultaneously minimize the number of modified pixels and the size of the modifications, often requiring a large number of queries and assuming unrestricted access to the targeted DNN. In contrast, other methods that limit the number of modified pixels often permit unbounded modifications, making them easily detectable. To address these limitations, we propose a novel multi-objective sparse attack algorithm that efficiently minimizes the number of modified pixels and their size during the attack process. Our algorithm draws inspiration from evolutionary computation and incorporates a mechanism for prioritizing objectives that aligns with an attacker's goals. Our approach outperforms existing sparse attacks on CIFAR-10 and ImageNet trained DNN classifiers while requiring only a small query budget, attaining competitive attack success rates while perturbing fewer pixels. Overall, our proposed attack algorithm provides a solution to the limitations of current sparse attack methods by jointly minimizing the number of modified pixels and their size. Our results demonstrate the effectiveness of our approach in restricted scenarios, highlighting its potential to enhance DNN security. + + + + Renderable Neural Radiance Map for Visual Navigation + http://openaccess.thecvf.com//content/CVPR2023/papers/Kwon_Renderable_Neural_Radiance_Map_for_Visual_Navigation_CVPR_2023_paper.pdf + We propose a novel type of map for visual navigation, a renderable neural radiance map (RNR-Map), which is designed to contain the overall visual information of a 3D environment. The RNR-Map has a grid form and consists of latent codes at each pixel. These latent codes are embedded from image observations, and can be converted to the neural radiance field which enables image rendering given a camera pose. The recorded latent codes implicitly contain visual information about the environment, which makes the RNR-Map visually descriptive. This visual information in RNR-Map can be a useful guideline for visual localization and navigation. We develop localization and navigation frameworks that can effectively utilize the RNR-Map. We evaluate the proposed frameworks on camera tracking, visual localization, and image-goal navigation. Experimental results show that the RNR-Map-based localization framework can find the target location based on a single query image with fast speed and competitive accuracy compared to other baselines. Also, this localization framework is robust to environmental changes, and even finds the most visually similar places when a query image from a different environment is given. The proposed navigation framework outperforms the existing image-goal navigation methods in difficult scenarios, under odometry and actuation noises. The navigation framework shows 65.7% success rate in curved scenarios of the NRNS dataset, which is an improvement of 18.6% over the current state-of-the-art. Project page: https://rllab-snu.github.io/projects/RNR-Map/ + + + + Learning Orthogonal Prototypes for Generalized Few-Shot Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Learning_Orthogonal_Prototypes_for_Generalized_Few-Shot_Semantic_Segmentation_CVPR_2023_paper.pdf + Generalized few-shot semantic segmentation (GFSS) distinguishes pixels of base and novel classes from the background simultaneously, conditioning on sufficient data of base classes and a few examples from novel class. A typical GFSS approach has two training phases: base class learning and novel class updating. Nevertheless, such a stand-alone updating process often compromises the well-learnt features and results in performance drop on base classes. In this paper, we propose a new idea of leveraging Projection onto Orthogonal Prototypes (POP), which updates features to identify novel classes without compromising base classes. POP builds a set of orthogonal prototypes, each of which represents a semantic class, and makes the prediction for each class separately based on the features projected onto its prototype. Technically, POP first learns prototypes on base data, and then extends the prototype set to novel classes. The orthogonal constraint of POP encourages the orthogonality between the learnt prototypes and thus mitigates the influence on base class features when generalizing to novel prototypes. Moreover, we capitalize on the residual of feature projection as the background representation to dynamically fit semantic shifting (i.e., background no longer includes the pixels of novel classes in updating phase). Extensive experiments on two benchmarks demonstrate that our POP achieves superior performances on novel classes without sacrificing much accuracy on base classes. Notably, POP outperforms the state-of-the-art fine-tuning by 3.93% overall mIoU on PASCAL-5i in 5-shot scenario. + + + + Are Deep Neural Networks SMARTer Than Second Graders? + http://openaccess.thecvf.com//content/CVPR2023/papers/Cherian_Are_Deep_Neural_Networks_SMARTer_Than_Second_Graders_CVPR_2023_paper.pdf + Recent times have witnessed an increasing number of applications of deep neural networks towards solving tasks that require superior cognitive abilities, e.g., playing Go, generating art, question answering (such as ChatGPT), etc. Such a dramatic progress raises the question: how generalizable are neural networks in solving problems that demand broad skills? To answer this question, we propose SMART: a Simple Multimodal Algorithmic Reasoning Task and the associated SMART-101 dataset, for evaluating the abstraction, deduction, and generalization abilities of neural networks in solving visuo-linguistic puzzles designed specifically for children in the 6--8 age group. Our dataset consists of 101 unique puzzles; each puzzle comprises a picture and a question, and their solution needs a mix of several elementary skills, including arithmetic, algebra, and spatial reasoning, among others. To scale our dataset towards training deep neural networks, we programmatically generate entirely new instances for each puzzle while retaining their solution algorithm. To benchmark the performance on the SMART-101 dataset, we propose a vision-and-language meta-learning model that can incorporate varied state-of-the-art neural backbones. Our experiments reveal that while powerful deep models offer reasonable performances on puzzles in a supervised setting, they are not better than random accuracy when analyzed for generalization -- filling this gap may demand new multimodal learning approaches. + + + + Bi-Level Meta-Learning for Few-Shot Domain Generalization + http://openaccess.thecvf.com//content/CVPR2023/papers/Qin_Bi-Level_Meta-Learning_for_Few-Shot_Domain_Generalization_CVPR_2023_paper.pdf + The goal of few-shot learning is to learn the generalizability from seen to unseen data with only a few samples. Most previous few-shot learning focus on learning generalizability within particular domains. However, the more practical scenarios may also require generalizability across domains. In this paper, we study the problem of Few-shot domain generalization (FSDG), which is a more challenging variant of few-shot classification. FSDG requires additional generalization with larger gap from seen domains to unseen domains. We address FSDG problem by meta-learning two levels of meta-knowledge, where the lower-level meta-knowledge are domain-specific embedding spaces as subspaces of a base space for intra-domain generalization, and the upper-level meta-knowledge is the base space and a prior subspace over domain-specific spaces for inter-domain generalization. We formulate the two levels of meta-knowledge learning problem with bi-level optimization, and further develop an optimization algorithm without Hessian information to solve it. We demonstrate our method is significantly superior to the previous works by evaluating it on the widely used benchmark Meta-Dataset. + + + + Multi-Modal Learning With Missing Modality via Shared-Specific Feature Modelling + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Multi-Modal_Learning_With_Missing_Modality_via_Shared-Specific_Feature_Modelling_CVPR_2023_paper.pdf + The missing modality issue is critical but non-trivial to be solved by multi-modal models. Current methods aiming to handle the missing modality problem in multi-modal tasks, either deal with missing modalities only during evaluation or train separate models to handle specific missing modality settings. In addition, these models are designed for specific tasks, so for example, classification models are not easily adapted to segmentation tasks and vice versa. In this paper, we propose the Shared-Specific Feature Modelling (ShaSpec) method that is considerably simpler and more effective than competing approaches that address the issues above. ShaSpec is designed to take advantage of all available input modalities during training and evaluation by learning shared and specific features to better represent the input data. This is achieved from a strategy that relies on auxiliary tasks based on distribution alignment and domain classification, in addition to a residual feature fusion procedure. Also, the design simplicity of ShaSpec enables its easy adaptation to multiple tasks, such as classification and segmentation. Experiments are conducted on both medical image segmentation and computer vision classification, with results indicating that ShaSpec outperforms competing methods by a large margin. For instance, on BraTS2018, ShaSpec improves the SOTA by more than 3% for enhancing tumour, 5% for tumour core and 3% for whole tumour. + + + + DisWOT: Student Architecture Search for Distillation WithOut Training + http://openaccess.thecvf.com//content/CVPR2023/papers/Dong_DisWOT_Student_Architecture_Search_for_Distillation_WithOut_Training_CVPR_2023_paper.pdf + Knowledge distillation (KD) is an effective training strategy to improve the lightweight student models under the guidance of cumbersome teachers. However, the large architecture difference across the teacher-student pairs limits the distillation gains. In contrast to previous adaptive distillation methods to reduce the teacher-student gap, we explore a novel training-free framework to search for the best student architectures for a given teacher. Our work first empirically show that the optimal model under vanilla training cannot be the winner in distillation. Secondly, we find that the similarity of feature semantics and sample relations between random-initialized teacher-student networks have good correlations with final distillation performances. Thus, we efficiently measure similarity matrixs conditioned on the semantic activation maps to select the optimal student via an evolutionary algorithm without any training. In this way, our student architecture search for Distillation WithOut Training (DisWOT) significantly improves the performance of the model in the distillation stage with at least 180x training acceleration. Additionally, we extend similarity metrics in DisWOT as new distillers and KD-based zero-proxies. Our experiments on CIFAR, ImageNet and NAS-Bench-201 demonstrate that our technique achieves state-of-the-art results on different search spaces. Our project and code are available at https://lilujunai.github.io/DisWOT-CVPR2023/. + + + + Logical Consistency and Greater Descriptive Power for Facial Hair Attribute Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_Logical_Consistency_and_Greater_Descriptive_Power_for_Facial_Hair_Attribute_CVPR_2023_paper.pdf + Face attribute research has so far used only simple binary attributes for facial hair; e.g., beard / no beard. We have created a new, more descriptive facial hair annotation scheme and applied it to create a new facial hair attribute dataset, FH37K. Face attribute research also so far has not dealt with logical consistency and completeness. For example, in prior research, an image might be classified as both having no beard and also having a goatee (a type of beard). We show that the test accuracy of previous classification methods on facial hair attribute classification drops significantly if logical consistency of classifications is enforced. We propose a logically consistent prediction loss, LCPLoss, to aid learning of logical consistency across attributes, and also a label compensation training strategy to eliminate the problem of no positive prediction across a set of related attributes. Using an attribute classifier trained on FH37K, we investigate how facial hair affects face recognition accuracy, including variation across demographics. Results show that similarity and difference in facial hairstyle have important effects on the impostor and genuine score distributions in face recognition. The code is at https:// github.com/ HaiyuWu/ facial hair logical. + + + + Spatio-Temporal Pixel-Level Contrastive Learning-Based Source-Free Domain Adaptation for Video Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Lo_Spatio-Temporal_Pixel-Level_Contrastive_Learning-Based_Source-Free_Domain_Adaptation_for_Video_Semantic_CVPR_2023_paper.pdf + Unsupervised Domain Adaptation (UDA) of semantic segmentation transfers labeled source knowledge to an unlabeled target domain by relying on accessing both the source and target data. However, the access to source data is often restricted or infeasible in real-world scenarios. Under the source data restrictive circumstances, UDA is less practical. To address this, recent works have explored solutions under the Source-Free Domain Adaptation (SFDA) setup, which aims to adapt a source-trained model to the target domain without accessing source data. Still, existing SFDA approaches use only image-level information for adaptation, making them sub-optimal in video applications. This paper studies SFDA for Video Semantic Segmentation (VSS), where temporal information is leveraged to address video adaptation. Specifically, we propose Spatio-Temporal Pixel-Level (STPL) contrastive learning, a novel method that takes full advantage of spatio-temporal information to tackle the absence of source data better. STPL explicitly learns semantic correlations among pixels in the spatio-temporal space, providing strong self-supervision for adaptation to the unlabeled target domain. Extensive experiments show that STPL achieves state-of-the-art performance on VSS benchmarks compared to current UDA and SFDA approaches. Code is available at: https://github.com/shaoyuanlo/STPL + + + + InternImage: Exploring Large-Scale Vision Foundation Models With Deformable Convolutions + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_InternImage_Exploring_Large-Scale_Vision_Foundation_Models_With_Deformable_Convolutions_CVPR_2023_paper.pdf + Compared to the great progress of large-scale vision transformers (ViTs) in recent years, large-scale models based on convolutional neural networks (CNNs) are still in an early state. This work presents a new large-scale CNN-based foundation model, termed InternImage, which can obtain the gain from increasing parameters and training data like ViTs. Different from the recent CNNs that focus on large dense kernels, InternImage takes deformable convolution as the core operator, so that our model not only has the large effective receptive field required for downstream tasks such as detection and segmentation, but also has the adaptive spatial aggregation conditioned by input and task information. As a result, the proposed InternImage reduces the strict inductive bias of traditional CNNs and makes it possible to learn stronger and more robust patterns with large-scale parameters from massive data like ViTs. The effectiveness of our model is proven on challenging benchmarks including ImageNet, COCO, and ADE20K. It is worth mentioning that InternImage-H achieved a new record 65.4 mAP on COCO test-dev and 62.9 mIoU on ADE20K, outperforming current leading CNNs and ViTs. + + + + DAA: A Delta Age AdaIN Operation for Age Estimation via Binary Code Transformer + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_DAA_A_Delta_Age_AdaIN_Operation_for_Age_Estimation_via_CVPR_2023_paper.pdf + Naked eye recognition of age is usually based on comparison with the age of others. However, this idea is ignored by computer tasks because it is difficult to obtain representative contrast images of each age. Inspired by the transfer learning, we designed the Delta Age AdaIN (DAA) operation to obtain the feature difference with each age, which obtains the style map of each age through the learned values representing the mean and standard deviation. We let the input of transfer learning as the binary code of age natural number to obtain continuous age feature information. The learned two groups of values in Binary code mapping are corresponding to the mean and standard deviation of the comparison ages. In summary, our method consists of four parts: FaceEncoder, DAA operation, Binary code mapping, and AgeDecoder modules. After getting the delta age via AgeDecoder, we take the average value of all comparison ages and delta ages as the predicted age. Compared with state-of-the-art methods, our method achieves better performance with fewer parameters on multiple facial age datasets. Code is available at https://github.com/redcping/Delta_Age_AdaIN + + + + Mind the Label Shift of Augmentation-Based Graph OOD Generalization + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_Mind_the_Label_Shift_of_Augmentation-Based_Graph_OOD_Generalization_CVPR_2023_paper.pdf + Out-of-distribution (OOD) generalization is an important issue for Graph Neural Networks (GNNs). Recent works employ different graph editions to generate augmented environments and learn an invariant GNN for generalization. However, the graph structural edition inevitably alters the graph label. This causes the label shift in augmentations and brings inconsistent predictive relationships among augmented environments. To address this issue, we propose LiSA, which generates label-invariant augmentations to facilitate graph OOD generalization. Instead of resorting to graph editions, LiSA exploits Label-invariant Subgraphs of the training graphs to construct Augmented environments. Specifically, LiSA first designs the variational subgraph generators to efficiently extract locally predictive patterns and construct multiple label-invariant subgraphs. Then, the subgraphs produced by different generators are collected to build different augmented environments. To promote diversity among augmented environments, LiSA further introduces a tractable energy-based regularization to enlarge pair-wise distances between the distributions of environments. In this manner, LiSA generates diverse augmented environments with a consistent predictive relationship to facilitate learning an invariant GNN. Extensive experiments on node-level and graph-level OOD benchmarks show that LiSA achieves impressive generalization performance with different GNN backbones. Code is available on https://github.com/Samyu0304/LiSA. + + + + Unsupervised Intrinsic Image Decomposition With LiDAR Intensity + http://openaccess.thecvf.com//content/CVPR2023/papers/Sato_Unsupervised_Intrinsic_Image_Decomposition_With_LiDAR_Intensity_CVPR_2023_paper.pdf + Intrinsic image decomposition (IID) is the task that decomposes a natural image into albedo and shade. While IID is typically solved through supervised learning methods, it is not ideal due to the difficulty in observing ground truth albedo and shade in general scenes. Conversely, unsupervised learning methods are currently underperforming supervised learning methods since there are no criteria for solving the ill-posed problems. Recently, light detection and ranging (LiDAR) is widely used due to its ability to make highly precise distance measurements. Thus, we have focused on the utilization of LiDAR, especially LiDAR intensity, to address this issue. In this paper, we propose unsupervised intrinsic image decomposition with LiDAR intensity (IID-LI). Since the conventional unsupervised learning methods consist of image-to-image transformations, simply inputting LiDAR intensity is not an effective approach. Therefore, we design an intensity consistency loss that computes the error between LiDAR intensity and gray-scaled albedo to provide a criterion for the ill-posed problem. In addition, LiDAR intensity is difficult to handle due to its sparsity and occlusion, hence, a LiDAR intensity densification module is proposed. We verified the estimating quality using our own dataset, which include RGB images, LiDAR intensity and human judged annotations. As a result, we achieved an estimation accuracy that outperforms conventional unsupervised learning methods. + + + + PET-NeuS: Positional Encoding Tri-Planes for Neural Surfaces + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_PET-NeuS_Positional_Encoding_Tri-Planes_for_Neural_Surfaces_CVPR_2023_paper.pdf + A signed distance function (SDF) parametrized by an MLP is a common ingredient of neural surface reconstruction. We build on the successful recent method NeuS to extend it by three new components. The first component is to borrow the tri-plane representation from EG3D and represent signed distance fields as a mixture of tri-planes and MLPs instead of representing it with MLPs only. Using tri-planes leads to a more expressive data structure but will also introduce noise in the reconstructed surface. The second component is to use a new type of positional encoding with learnable weights to combat noise in the reconstruction process. We divide the features in the tri-plane into multiple frequency scales and modulate them with sin and cos functions of different frequencies. The third component is to use learnable convolution operations on the tri-plane features using self-attention convolution to produce features with different frequency bands. The experiments show that PET-NeuS achieves high-fidelity surface reconstruction on standard datasets. Following previous work and using the Chamfer metric as the most important way to measure surface reconstruction quality, we are able to improve upon the NeuS baseline by 57% on Nerf-synthetic (0.84 compared to 1.97) and by 15.5% on DTU (0.71 compared to 0.84). The qualitative evaluation reveals how our method can better control the interference of high-frequency noise. + + + + ZegCLIP: Towards Adapting CLIP for Zero-Shot Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_ZegCLIP_Towards_Adapting_CLIP_for_Zero-Shot_Semantic_Segmentation_CVPR_2023_paper.pdf + Recently, CLIP has been applied to pixel-level zero-shot learning tasks via a wo-stage scheme. The general idea is to first generate class-agnostic region proposals and then feed the cropped proposal regions to CLIP to utilize its image-level zero-shot classification capability. While effective, such a scheme requires two image encoders, one for proposal generation and one for CLIP, leading to a complicated pipeline and high computational cost. In this work, we pursue a simpler-and-efficient one-stage solution that directly extends CLIP's zero-shot prediction capability from image to pixel level. Our investigation starts with a straightforward extension as our baseline that generates semantic masks by comparing the similarity between text and patch embeddings extracted from CLIP. However, such a paradigm could heavily overfit the seen classes and fail to generalize to unseen classes. To handle this issue, we propose three simple-but-effective designs and figure out that they can significantly retain the inherent zero-shot capacity of CLIP and improve pixel-level generalization ability. Incorporating those modifications leads to an efficient zero-shot semantic segmentation system called ZegCLIP. Through extensive experiments on three public benchmarks, ZegCLIP demonstrates superior performance, outperforming the state-of-the-art methods by a large margin under both "inductive" and "transductive" zero-shot settings. In addition, compared with the two-stage method, our one-stage ZegCLIP achieves a speedup of about 5 times faster during inference. We release the code at https://github.com/ZiqinZhou66/ZegCLIP.git. + + + + AdaptiveMix: Improving GAN Training via Feature Space Shrinkage + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_AdaptiveMix_Improving_GAN_Training_via_Feature_Space_Shrinkage_CVPR_2023_paper.pdf + Due to the outstanding capability for data generation, Generative Adversarial Networks (GANs) have attracted considerable attention in unsupervised learning. However, training GANs is difficult, since the training distribution is dynamic for the discriminator, leading to unstable image representation. In this paper, we address the problem of training GANs from a novel perspective, i.e., robust image classification. Motivated by studies on robust image representation, we propose a simple yet effective module, namely AdaptiveMix, for GANs, which shrinks the regions of training data in the image representation space of the discriminator. Considering it is intractable to directly bound feature space, we propose to construct hard samples and narrow down the feature distance between hard and easy samples. The hard samples are constructed by mixing a pair of training images. We evaluate the effectiveness of our AdaptiveMix with widely-used and state-of-the-art GAN architectures. The evaluation results demonstrate that our AdaptiveMix can facilitate the training of GANs and effectively improve the image quality of generated samples. We also show that our AdaptiveMix can be further applied to image classification and Out-Of-Distribution (OOD) detection tasks, by equipping it with state-of-the-art methods. Extensive experiments on seven publicly available datasets show that our method effectively boosts the performance of baselines. The code is publicly available at https://github.com/WentianZhang-ML/AdaptiveMix. + + + + Specialist Diffusion: Plug-and-Play Sample-Efficient Fine-Tuning of Text-to-Image Diffusion Models To Learn Any Unseen Style + http://openaccess.thecvf.com//content/CVPR2023/papers/Lu_Specialist_Diffusion_Plug-and-Play_Sample-Efficient_Fine-Tuning_of_Text-to-Image_Diffusion_Models_To_CVPR_2023_paper.pdf + Diffusion models have demonstrated impressive capability of text-conditioned image synthesis, and broader application horizons are emerging by personalizing those pretrained diffusion models toward generating some specialized target object or style. In this paper, we aim to learn an unseen style by simply fine-tuning a pre-trained diffusion model with a handful of images (e.g., less than 10), so that the fine-tuned model can generate high-quality images of arbitrary objects in this style. Such extremely lowshot fine-tuning is accomplished by a novel toolkit of finetuning techniques, including text-to-image customized data augmentations, a content loss to facilitate content-style disentanglement, and sparse updating that focuses on only a few time steps. Our framework, dubbed Specialist Diffusion, is plug-and-play to existing diffusion model backbones and other personalization techniques. We demonstrate it to outperform the latest few-shot personalization alternatives of diffusion models such as Textual Inversion and DreamBooth, in terms of learning highly sophisticated styles with ultra-sample-efficient tuning. We further show that Specialist Diffusion can be integrated on top of textual inversion to boost performance further, even on highly unusual styles. Our codes are available at: https://github.com/Picsart-AI-Research/Specialist-Diffusion + + + + HyperCUT: Video Sequence From a Single Blurry Image Using Unsupervised Ordering + http://openaccess.thecvf.com//content/CVPR2023/papers/Pham_HyperCUT_Video_Sequence_From_a_Single_Blurry_Image_Using_Unsupervised_CVPR_2023_paper.pdf + We consider the challenging task of training models for image-to-video deblurring, which aims to recover a sequence of sharp images corresponding to a given blurry image input. A critical issue disturbing the training of an image-to-video model is the ambiguity of the frame ordering since both the forward and backward sequences are plausible solutions. This paper proposes an effective self-supervised ordering scheme that allows training high-quality image-to-video deblurring models. Unlike previous methods that rely on order-invariant losses, we assign an explicit order for each video sequence, thus avoiding the order-ambiguity issue. Specifically, we map each video sequence to a vector in a latent high-dimensional space so that there exists a hyperplane such that for every video sequence, the vectors extracted from it and its reversed sequence are on different sides of the hyperplane. The side of the vectors will be used to define the order of the corresponding sequence. Last but not least, we propose a real-image dataset for the image-to-video deblurring problem that covers a variety of popular domains, including face, hand, and street. Extensive experimental results confirm the effectiveness of our method. Code and data are available at https://github.com/VinAIResearch/HyperCUT.git + + + + Can't Steal? Cont-Steal! Contrastive Stealing Attacks Against Image Encoders + http://openaccess.thecvf.com//content/CVPR2023/papers/Sha_Cant_Steal_Cont-Steal_Contrastive_Stealing_Attacks_Against_Image_Encoders_CVPR_2023_paper.pdf + Self-supervised representation learning techniques have been developing rapidly to make full use of unlabeled images. They encode images into rich features that are oblivious to downstream tasks. Behind their revolutionary representation power, the requirements for dedicated model designs and a massive amount of computation resources expose image encoders to the risks of potential model stealing attacks - a cheap way to mimic the well-trained encoder performance while circumventing the demanding requirements. Yet conventional attacks only target supervised classifiers given their predicted labels and/or posteriors, which leaves the vulnerability of unsupervised encoders unexplored. In this paper, we first instantiate the conventional stealing attacks against encoders and demonstrate their severer vulnerability compared with downstream classifiers. To better leverage the rich representation of encoders, we further propose Cont-Steal, a contrastive-learning-based attack, and validate its improved stealing effectiveness in various experiment settings. As a takeaway, we appeal to our community's attention to the intellectual property protection of representation learning techniques, especially to the defenses against encoder stealing attacks like ours. + + + + Nerflets: Local Radiance Fields for Efficient Structure-Aware 3D Scene Representation From 2D Supervision + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Nerflets_Local_Radiance_Fields_for_Efficient_Structure-Aware_3D_Scene_Representation_CVPR_2023_paper.pdf + We address efficient and structure-aware 3D scene representation from images. Nerflets are our key contribution-- a set of local neural radiance fields that together represent a scene. Each nerflet maintains its own spatial position, orientation, and extent, within which it contributes to panoptic, density, and radiance reconstructions. By leveraging only photometric and inferred panoptic image supervision, we can directly and jointly optimize the parameters of a set of nerflets so as to form a decomposed representation of the scene, where each object instance is represented by a group of nerflets. During experiments with indoor and outdoor environments, we find that nerflets: (1) fit and approximate the scene more efficiently than traditional global NeRFs, (2) allow the extraction of panoptic and photometric renderings from arbitrary views, and (3) enable tasks rare for NeRFs, such as 3D panoptic segmentation and interactive editing. + + + + CLIP Is Also an Efficient Segmenter: A Text-Driven Approach for Weakly Supervised Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Lin_CLIP_Is_Also_an_Efficient_Segmenter_A_Text-Driven_Approach_for_CVPR_2023_paper.pdf + Weakly supervised semantic segmentation (WSSS) with image-level labels is a challenging task. Mainstream approaches follow a multi-stage framework and suffer from high training costs. In this paper, we explore the potential of Contrastive Language-Image Pre-training models (CLIP) to localize different categories with only image-level labels and without further training. To efficiently generate high-quality segmentation masks from CLIP, we propose a novel WSSS framework called CLIP-ES. Our framework improves all three stages of WSSS with special designs for CLIP: 1) We introduce the softmax function into GradCAM and exploit the zero-shot ability of CLIP to suppress the confusion caused by non-target classes and backgrounds. Meanwhile, to take full advantage of CLIP, we re-explore text inputs under the WSSS setting and customize two text-driven strategies: sharpness-based prompt selection and synonym fusion. 2) To simplify the stage of CAM refinement, we propose a real-time class-aware attention-based affinity (CAA) module based on the inherent multi-head self-attention (MHSA) in CLIP-ViTs. 3) When training the final segmentation model with the masks generated by CLIP, we introduced a confidence-guided loss (CGL) focus on confident regions. Our CLIP-ES achieves SOTA performance on Pascal VOC 2012 and MS COCO 2014 while only taking 10% time of previous methods for the pseudo mask generation. Code is available at https://github.com/linyq2117/CLIP-ES. + + + + Spatially Adaptive Self-Supervised Learning for Real-World Image Denoising + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Spatially_Adaptive_Self-Supervised_Learning_for_Real-World_Image_Denoising_CVPR_2023_paper.pdf + Significant progress has been made in self-supervised image denoising (SSID) in the recent few years. However, most methods focus on dealing with spatially independent noise, and they have little practicality on real-world sRGB images with spatially correlated noise. Although pixel-shuffle downsampling has been suggested for breaking the noise correlation, it breaks the original information of images, which limits the denoising performance. In this paper, we propose a novel perspective to solve this problem, i.e., seeking for spatially adaptive supervision for real-world sRGB image denoising. Specifically, we take into account the respective characteristics of flat and textured regions in noisy images, and construct supervisions for them separately. For flat areas, the supervision can be safely derived from non-adjacent pixels, which are much far from the current pixel for excluding the influence of the noise-correlated ones. And we extend the blind-spot network to a blind-neighborhood network (BNN) for providing supervision on flat areas. For textured regions, the supervision has to be closely related to the content of adjacent pixels. And we present a locally aware network (LAN) to meet the requirement, while LAN itself is selectively supervised with the output of BNN. Combining these two supervisions, a denoising network (e.g., U-Net) can be well-trained. Extensive experiments show that our method performs favorably against state-of-the-art SSID methods on real-world sRGB photographs. The code is available at https://github.com/nagejacob/SpatiallyAdaptiveSSID. + + + + From Images to Textual Prompts: Zero-Shot Visual Question Answering With Frozen Large Language Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Guo_From_Images_to_Textual_Prompts_Zero-Shot_Visual_Question_Answering_With_CVPR_2023_paper.pdf + Large language models (LLMs) have demonstrated excellent zero-shot generalization to new language tasks. However, effective utilization of LLMs for zero-shot visual question-answering (VQA) remains challenging, primarily due to the modality disconnection and task disconnection between LLM and VQA task. End-to-end training on vision and language data may bridge the disconnections, but is inflexible and computationally expensive. To address this issue, we propose Img2Prompt, a plug-and-play module that provides the prompts that can bridge the aforementioned modality and task disconnections, so that LLMs can perform zero-shot VQA tasks without end-to-end training. In order to provide such prompts, we further employ LLM-agnostic models to provide prompts that can describe image content and self-constructed question-answer pairs, which can effectively guide LLM to perform zero-shot VQA tasks. Img2Prompt offers the following benefits: 1) It can flexibly work with various LLMs to perform VQA. 2) Without the needing of end-to-end training, it significantly reduces the cost of deploying LLM for zero-shot VQA tasks. 3) It achieves comparable or better performance than methods relying on end-to-end training. For example, we outperform Flamingo by 5.6% on VQAv2. On the challenging A-OKVQA dataset, our method even outperforms few-shot methods by as much as 20%. + + + + Observation-Centric SORT: Rethinking SORT for Robust Multi-Object Tracking + http://openaccess.thecvf.com//content/CVPR2023/papers/Cao_Observation-Centric_SORT_Rethinking_SORT_for_Robust_Multi-Object_Tracking_CVPR_2023_paper.pdf + Kalman filter (KF) based methods for multi-object tracking (MOT) make an assumption that objects move linearly. While this assumption is acceptable for very short periods of occlusion, linear estimates of motion for prolonged time can be highly inaccurate. Moreover, when there is no measurement available to update Kalman filter parameters, the standard convention is to trust the priori state estimations for posteriori update. This leads to the accumulation of errors during a period of occlusion. The error causes significant motion direction variance in practice. In this work, we show that a basic Kalman filter can still obtain state-of-the-art tracking performance if proper care is taken to fix the noise accumulated during occlusion. Instead of relying only on the linear state estimate (i.e., estimation-centric approach), we use object observations (i.e., the measurements by object detector) to compute a virtual trajectory over the occlusion period to fix the error accumulation of filter parameters. This allows more time steps to correct errors accumulated during occlusion. We name our method Observation-Centric SORT (OC-SORT). It remains Simple, Online, and Real-Time but improves robustness during occlusion and non-linear motion. Given off-the-shelf detections as input, OC-SORT runs at 700+ FPS on a single CPU. It achieves state-of-the-art on multiple datasets, including MOT17, MOT20, KITTI, head tracking, and especially DanceTrack where the object motion is highly non-linear. The code and models are available at https://github.com/noahcao/OC_SORT. + + + + Transformer-Based Learned Optimization + http://openaccess.thecvf.com//content/CVPR2023/papers/Gartner_Transformer-Based_Learned_Optimization_CVPR_2023_paper.pdf + We propose a new approach to learned optimization where we represent the computation of an optimizer's update step using a neural network. The parameters of the optimizer are then learned by training on a set of optimization tasks with the objective to perform minimization efficiently. Our innovation is a new neural network architecture, Optimus, for the learned optimizer inspired by the classic BFGS algorithm. As in BFGS, we estimate a preconditioning matrix as a sum of rank-one updates but use a Transformer-based neural network to predict these updates jointly with the step length and direction. In contrast to several recent learned optimization-based approaches, our formulation allows for conditioning across the dimensions of the parameter space of the target problem while remaining applicable to optimization tasks of variable dimensionality without retraining. We demonstrate the advantages of our approach on a benchmark composed of objective functions traditionally used for the evaluation of optimization algorithms, as well as on the real world-task of physics-based visual reconstruction of articulated 3d human motion. + + + + Quantum-Inspired Spectral-Spatial Pyramid Network for Hyperspectral Image Classification + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Quantum-Inspired_Spectral-Spatial_Pyramid_Network_for_Hyperspectral_Image_Classification_CVPR_2023_paper.pdf + Hyperspectral image (HSI) classification aims at assigning a unique label for every pixel to identify categories of different land covers. Existing deep learning models for HSIs are usually performed in a traditional learning paradigm. Being emerging machines, quantum computers are limited in the noisy intermediate-scale quantum (NISQ) era. The quantum theory offers a new paradigm for designing deep learning models. Motivated by the quantum circuit (QC) model, we propose a quantum-inspired spectral-spatial network (QSSN) for HSI feature extraction. The proposed QSSN consists of a phase-prediction module (PPM) and a measurement-like fusion module (MFM) inspired from quantum theory to dynamically fuse spectral and spatial information. Specifically, QSSN uses a quantum representation to represent an HSI cuboid and extracts joint spectral-spatial features using MFM. An HSI cuboid and its phases predicted by PPM are used in the quantum representation. Using QSSN as the building block, we propose an end-to-end quantum-inspired spectral-spatial pyramid network (QSSPN) for HSI feature extraction and classification. In this pyramid framework, QSSPN progressively learns feature representations by cascading QSSN blocks and performs classification with a softmax classifier. It is the first attempt to introduce quantum theory in HSI processing model design. Substantial experiments are conducted on three HSI datasets to verify the superiority of the proposed QSSPN framework over the state-of-the-art methods. + + + + Towards Benchmarking and Assessing Visual Naturalness of Physical World Adversarial Attacks + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Towards_Benchmarking_and_Assessing_Visual_Naturalness_of_Physical_World_Adversarial_CVPR_2023_paper.pdf + Physical world adversarial attack is a highly practical and threatening attack, which fools real world deep learning systems by generating conspicuous and maliciously crafted real world artifacts. In physical world attacks, evaluating naturalness is highly emphasized since human can easily detect and remove unnatural attacks. However, current studies evaluate naturalness in a case-by-case fashion, which suffers from errors, bias and inconsistencies. In this paper, we take the first step to benchmark and assess visual naturalness of physical world attacks, taking autonomous driving scenario as the first attempt. First, to benchmark attack naturalness, we contribute the first Physical Attack Naturalness (PAN) dataset with human rating and gaze. PAN verifies several insights for the first time: naturalness is (disparately) affected by contextual features (i.e., environmental and semantic variations) and correlates with behavioral feature (i.e., gaze signal). Second, to automatically assess attack naturalness that aligns with human ratings, we further introduce Dual Prior Alignment (DPA) network, which aims to embed human knowledge into model reasoning process. Specifically, DPA imitates human reasoning in naturalness assessment by rating prior alignment and mimics human gaze behavior by attentive prior alignment. We hope our work fosters researches to improve and automatically assess naturalness of physical world attacks. Our code and exemplar data can be found at https://github.com/zhangsn-19/PAN. + + + + Visual Prompt Multi-Modal Tracking + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhu_Visual_Prompt_Multi-Modal_Tracking_CVPR_2023_paper.pdf + Visible-modal object tracking gives rise to a series of downstream multi-modal tracking tributaries. To inherit the powerful representations of the foundation model, a natural modus operandi for multi-modal tracking is full fine-tuning on the RGB-based parameters. Albeit effective, this manner is not optimal due to the scarcity of downstream data and poor transferability, etc. In this paper, inspired by the recent success of the prompt learning in language models, we develop Visual Prompt multi-modal Tracking (ViPT), which learns the modal-relevant prompts to adapt the frozen pre-trained foundation model to various downstream multimodal tracking tasks. ViPT finds a better way to stimulate the knowledge of the RGB-based model that is pre-trained at scale, meanwhile only introducing a few trainable parameters (less than 1% of model parameters). ViPT outperforms the full fine-tuning paradigm on multiple downstream tracking tasks including RGB+Depth, RGB+Thermal, and RGB+Event tracking. Extensive experiments show the potential of visual prompt learning for multi-modal tracking, and ViPT can achieve state-of-the-art performance while satisfying parameter efficiency. Code and models are available at https://github.com/jiawen-zhu/ViPT. + + + + Dealing With Cross-Task Class Discrimination in Online Continual Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Guo_Dealing_With_Cross-Task_Class_Discrimination_in_Online_Continual_Learning_CVPR_2023_paper.pdf + Existing continual learning (CL) research regards catastrophic forgetting (CF) as almost the only challenge. This paper argues for another challenge in class-incremental learning (CIL), which we call cross-task class discrimination (CTCD), i.e., how to establish decision boundaries between the classes of the new task and old tasks with no (or limited) access to the old task data. CTCD is implicitly and partially dealt with by replay-based methods. A replay method saves a small amount of data (replay data) from previous tasks. When a batch of current task data arrives, the system jointly trains the new data and some sampled replay data. The replay data enables the system to partially learn the decision boundaries between the new classes and the old classes as the amount of the saved data is small. However, this paper argues that the replay approach also has a dynamic training bias issue which reduces the effectiveness of the replay data in solving the CTCD problem. A novel optimization objective with a gradient-based adaptive method is proposed to dynamically deal with the problem in the online CL process. Experimental results show that the new method achieves much better results in online CL. + + + + GIVL: Improving Geographical Inclusivity of Vision-Language Models With Pre-Training Methods + http://openaccess.thecvf.com//content/CVPR2023/papers/Yin_GIVL_Improving_Geographical_Inclusivity_of_Vision-Language_Models_With_Pre-Training_Methods_CVPR_2023_paper.pdf + A key goal for the advancement of AI is to develop technologies that serve the needs not just of one group but of all communities regardless of their geographical region. In fact, a significant proportion of knowledge is locally shared by people from certain regions but may not apply equally in other regions because of cultural differences. If a model is unaware of regional characteristics, it may lead to performance disparity across regions and result in bias against underrepresented groups. We propose GIVL, a Geographically Inclusive Vision-and-Language Pre-trained model. There are two attributes of geo-diverse visual concepts which can help to learn geo-diverse knowledge: 1) concepts under similar categories have unique knowledge and visual characteristics, 2) concepts with similar visual features may fall in completely different categories. Motivated by the attributes, we design new pre-training objectives Image-Knowledge Matching (IKM) and Image Edit Checking (IEC) to pre-train GIVL. Compared with similar-size models pre-trained with similar scale of data, GIVL achieves state-of-the-art (SOTA) and more balanced performance on geo-diverse V&L tasks. Code and data are released at https://github.com/WadeYin9712/GIVL. + + + + Bi3D: Bi-Domain Active Learning for Cross-Domain 3D Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Yuan_Bi3D_Bi-Domain_Active_Learning_for_Cross-Domain_3D_Object_Detection_CVPR_2023_paper.pdf + Unsupervised Domain Adaptation (UDA) technique has been explored in 3D cross-domain tasks recently. Though preliminary progress has been made, the performance gap between the UDA-based 3D model and the supervised one trained with fully annotated target domain is still large. This motivates us to consider selecting partial-yet-important target data and labeling them at a minimum cost, to achieve a good trade-off between high performance and low annotation cost. To this end, we propose a Bi-domain active learning approach, namely Bi3D, to solve the cross-domain 3D object detection task. The Bi3D first develops a domainness-aware source sampling strategy, which identifies target-domain-like samples from the source domain to avoid the model being interfered by irrelevant source data. Then a diversity-based target sampling strategy is developed, which selects the most informative subset of target domain to improve the model adaptability to the target domain using as little annotation budget as possible. Experiments are conducted on typical cross-domain adaptation scenarios including cross-LiDAR-beam, cross-country, and cross-sensor, where Bi3D achieves a promising target-domain detection accuracy (89.63% on KITTI) compared with UDA-based work (84.29%), even surpassing the detector trained on the full set of the labeled target domain (88.98%). + + + + Towards Fast Adaptation of Pretrained Contrastive Models for Multi-Channel Video-Language Retrieval + http://openaccess.thecvf.com//content/CVPR2023/papers/Lin_Towards_Fast_Adaptation_of_Pretrained_Contrastive_Models_for_Multi-Channel_Video-Language_CVPR_2023_paper.pdf + Multi-channel video-language retrieval require models to understand information from different channels (e.g. video+question, video+speech) to correctly link a video with a textual response or query. Fortunately, contrastive multimodal models are shown to be highly effective at aligning entities in images/videos and text, e.g., CLIP; text contrastive models are extensively studied recently for their strong ability of producing discriminative sentence embeddings, e.g., SimCSE. However, there is not a clear way to quickly adapt these two lines to multi-channel video-language retrieval with limited data and resources. In this paper, we identify a principled model design space with two axes: how to represent videos and how to fuse video and text information. Based on categorization of recent methods, we investigate the options of representing videos using continuous feature vectors or discrete text tokens; for the fusion method, we explore the use of a multimodal transformer or a pretrained contrastive text model. We extensively evaluate the four combinations on five video-language datasets. We surprisingly find that discrete text tokens coupled with a pretrained contrastive text model yields the best performance, which can even outperform state-of-the-art on the iVQA and How2QA datasets without additional training on millions of video-text data. Further analysis shows that this is because representing videos as text tokens captures the key visual information and text tokens are naturally aligned with text models that are strong retrievers after the contrastive pretraining process. All the empirical analysis establishes a solid foundation for future research on affordable and upgradable multimodal intelligence. The code will be released at https://github.com/XudongLinthu/upgradable-multimodal-intelligence to facilitate future research. + + + + Crowd3D: Towards Hundreds of People Reconstruction From a Single Image + http://openaccess.thecvf.com//content/CVPR2023/papers/Wen_Crowd3D_Towards_Hundreds_of_People_Reconstruction_From_a_Single_Image_CVPR_2023_paper.pdf + Image-based multi-person reconstruction in wide-field large scenes is critical for crowd analysis and security alert. However, existing methods cannot deal with large scenes containing hundreds of people, which encounter the challenges of large number of people, large variations in human scale, and complex spatial distribution. In this paper, we propose Crowd3D, the first framework to reconstruct the 3D poses, shapes and locations of hundreds of people with global consistency from a single large-scene image. The core of our approach is to convert the problem of complex crowd localization into pixel localization with the help of our newly defined concept, Human-scene Virtual Interaction Point (HVIP). To reconstruct the crowd with global consistency, we propose a progressive reconstruction network based on HVIP by pre-estimating a scene-level camera and a ground plane. To deal with a large number of persons and various human sizes, we also design an adaptive human-centric cropping scheme. Besides, we contribute a benchmark dataset, LargeCrowd, for crowd reconstruction in a large scene. Experimental results demonstrate the effectiveness of the proposed method. The code and the dataset are available at http://cic.tju.edu.cn/faculty/likun/projects/Crowd3D. + + + + Highly Confident Local Structure Based Consensus Graph Learning for Incomplete Multi-View Clustering + http://openaccess.thecvf.com//content/CVPR2023/papers/Wen_Highly_Confident_Local_Structure_Based_Consensus_Graph_Learning_for_Incomplete_CVPR_2023_paper.pdf + Graph-based multi-view clustering has attracted extensive attention because of the powerful clustering-structure representation ability and noise robustness. Considering the reality of a large amount of incomplete data, in this paper, we propose a simple but effective method for incomplete multi-view clustering based on consensus graph learning, termed as HCLS_CGL. Unlike existing methods that utilize graph constructed from raw data to aid in the learning of consistent representation, our method directly learns a consensus graph across views for clustering. Specifically, we design a novel confidence graph and embed it to form a confidence structure driven consensus graph learning model. Our confidence graph is based on an intuitive similar-nearest-neighbor hypothesis, which does not require any additional information and can help the model to obtain a high-quality consensus graph for better clustering. Numerous experiments are performed to confirm the effectiveness of our method. + + + + Humans As Light Bulbs: 3D Human Reconstruction From Thermal Reflection + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Humans_As_Light_Bulbs_3D_Human_Reconstruction_From_Thermal_Reflection_CVPR_2023_paper.pdf + The relatively hot temperature of the human body causes people to turn into long-wave infrared light sources. Since this emitted light has a larger wavelength than visible light, many surfaces in typical scenes act as infrared mirrors with strong specular reflections. We exploit the thermal reflections of a person onto objects in order to locate their position and reconstruct their pose, even if they are not visible to a normal camera. We propose an analysis-by-synthesis framework that jointly models the objects, people, and their thermal reflections, which allows us to combine generative models with differentiable rendering of reflections. Quantitative and qualitative experiments show our approach works in highly challenging cases, such as with curved mirrors or when the person is completely unseen by a normal camera. + + + + CafeBoost: Causal Feature Boost To Eliminate Task-Induced Bias for Class Incremental Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Qiu_CafeBoost_Causal_Feature_Boost_To_Eliminate_Task-Induced_Bias_for_Class_CVPR_2023_paper.pdf + Continual learning requires a model to incrementally learn a sequence of tasks and aims to predict well on all the learned tasks so far, which notoriously suffers from the catastrophic forgetting problem. In this paper, we find a new type of bias appearing in continual learning, coined as task-induced bias. We place continual learning into a causal framework, based on which we find the task-induced bias is reduced naturally by two underlying mechanisms in task and domain incremental learning. However, these mechanisms do not exist in class incremental learning (CIL), in which each task contains a unique subset of classes. To eliminate the task-induced bias in CIL, we devise a causal intervention operation so as to cut off the causal path that causes the task-induced bias, and then implement it as a causal debias module that transforms biased features into unbiased ones. In addition, we propose a training pipeline to incorporate the novel module into existing methods and jointly optimize the entire architecture. Our overall approach does not rely on data replay, and is simple and convenient to plug into existing methods. Extensive empirical study on CIFAR-100 and ImageNet shows that our approach can improve accuracy and reduce forgetting of well-established methods by a large margin. + + + + A-La-Carte Prompt Tuning (APT): Combining Distinct Data via Composable Prompting + http://openaccess.thecvf.com//content/CVPR2023/papers/Bowman_A-La-Carte_Prompt_Tuning_APT_Combining_Distinct_Data_via_Composable_Prompting_CVPR_2023_paper.pdf + We introduce A-la-carte Prompt Tuning (APT), a transformer-based scheme to tune prompts on distinct data so that they can be arbitrarily composed at inference time. The individual prompts can be trained in isolation, possibly on different devices, at different times, and on different distributions or domains. Furthermore each prompt only contains information about the subset of data it was exposed to during training. During inference, models can be assembled based on arbitrary selections of data sources, which we call a-la-carte learning. A-la-carte learning enables constructing bespoke models specific to each user's individual access rights and preferences. We can add or remove information from the model by simply adding or removing the corresponding prompts without retraining from scratch. We demonstrate that a-la-carte built models achieve accuracy within 5% of models trained on the union of the respective sources, with comparable cost in terms of training and inference time. For the continual learning benchmarks Split CIFAR-100 and CORe50, we achieve state-of-the-art performance. + + + + ViLEM: Visual-Language Error Modeling for Image-Text Retrieval + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_ViLEM_Visual-Language_Error_Modeling_for_Image-Text_Retrieval_CVPR_2023_paper.pdf + Dominant pre-training works for image-text retrieval adopt "dual-encoder" architecture to enable high efficiency, where two encoders are used to extract image and text representations and contrastive learning is employed for global alignment. However, coarse-grained global alignment ignores detailed semantic associations between image and text. In this work, we propose a novel proxy task, named Visual-Language Error Modeling (ViLEM), to inject detailed image-text association into "dual-encoder" model by "proofreading" each word in the text against the corresponding image. Specifically, we first edit the image-paired text to automatically generate diverse plausible negative texts with pre-trained language models. ViLEM then enforces the model to discriminate the correctness of each word in the plausible negative texts and further correct the wrong words via resorting to image information. Furthermore, we propose a multi-granularity interaction framework to perform ViLEM via interacting text features with both global and local image features, which associates local text semantics with both high-level visual context and multi-level local visual information. Our method surpasses state-of-the-art "dual-encoder" methods by a large margin on the image-text retrieval task and significantly improves discriminativeness to local textual semantics. Our model can also generalize well to video-text retrieval. + + + + Egocentric Auditory Attention Localization in Conversations + http://openaccess.thecvf.com//content/CVPR2023/papers/Ryan_Egocentric_Auditory_Attention_Localization_in_Conversations_CVPR_2023_paper.pdf + In a noisy conversation environment such as a dinner party, people often exhibit selective auditory attention, or the ability to focus on a particular speaker while tuning out others. Recognizing who somebody is listening to in a conversation is essential for developing technologies that can understand social behavior and devices that can augment human hearing by amplifying particular sound sources. The computer vision and audio research communities have made great strides towards recognizing sound sources and speakers in scenes. In this work, we take a step further by focusing on the problem of localizing auditory attention targets in egocentric video, or detecting who in a camera wearer's field of view they are listening to. To tackle the new and challenging Selective Auditory Attention Localization problem, we propose an end-to-end deep learning approach that uses egocentric video and multichannel audio to predict the heatmap of the camera wearer's auditory attention. Our approach leverages spatiotemporal audiovisual features and holistic reasoning about the scene to make predictions, and outperforms a set of baselines on a challenging multi-speaker conversation dataset. Project page: https://fkryan.github.io/saal + + + + Open-World Multi-Task Control Through Goal-Aware Representation Learning and Adaptive Horizon Prediction + http://openaccess.thecvf.com//content/CVPR2023/papers/Cai_Open-World_Multi-Task_Control_Through_Goal-Aware_Representation_Learning_and_Adaptive_Horizon_CVPR_2023_paper.pdf + We study the problem of learning goal-conditioned policies in Minecraft, a popular, widely accessible yet challenging open-ended environment for developing human-level multi-task agents. We first identify two main challenges of learning such policies: 1) the indistinguishability of tasks from the state distribution, due to the vast scene diversity, and 2) the non-stationary nature of environment dynamics caused by the partial observability. To tackle the first challenge, we propose Goal-Sensitive Backbone (GSB) for the policy to encourage the emergence of goal-relevant visual state representations. To tackle the second challenge, the policy is further fueled by an adaptive horizon prediction module that helps alleviate the learning uncertainty brought by the non-stationary dynamics. Experiments on 20 Minecraft tasks show that our method significantly outperforms the best baseline so far; in many of them, we double the performance. Our ablation and exploratory studies then explain how our approach beat the counterparts and also unveil the surprising bonus of zero-shot generalization to new scenes (biomes). We hope our agent could help shed some light on learning goal-conditioned, multi-task agents in challenging, open-ended environments like Minecraft. + + + + MoDi: Unconditional Motion Synthesis From Diverse Data + http://openaccess.thecvf.com//content/CVPR2023/papers/Raab_MoDi_Unconditional_Motion_Synthesis_From_Diverse_Data_CVPR_2023_paper.pdf + The emergence of neural networks has revolutionized the field of motion synthesis. Yet, learning to unconditionally synthesize motions from a given distribution remains challenging, especially when the motions are highly diverse. In this work, we present MoDi -- a generative model trained in an unsupervised setting from an extremely diverse, unstructured and unlabeled dataset. During inference, MoDi can synthesize high-quality, diverse motions. Despite the lack of any structure in the dataset, our model yields a well-behaved and highly structured latent space, which can be semantically clustered, constituting a strong motion prior that facilitates various applications including semantic editing and crowd animation. In addition, we present an encoder that inverts real motions into MoDi's natural motion manifold, issuing solutions to various ill-posed challenges such as completion from prefix and spatial editing. Our qualitative and quantitative experiments achieve state-of-the-art results that outperform recent SOTA techniques. Code and trained models are available at https://sigal-raab.github.io/MoDi. + + + + Visual Localization Using Imperfect 3D Models From the Internet + http://openaccess.thecvf.com//content/CVPR2023/papers/Panek_Visual_Localization_Using_Imperfect_3D_Models_From_the_Internet_CVPR_2023_paper.pdf + Visual localization is a core component in many applications, including augmented reality (AR). Localization algorithms compute the camera pose of a query image w.r.t. a scene representation, which is typically built from images. This often requires capturing and storing large amounts of data, followed by running Structure-from-Motion (SfM) algorithms. An interesting, and underexplored, source of data for building scene representations are 3D models that are readily available on the Internet, e.g., hand-drawn CAD models, 3D models generated from building footprints, or from aerial images. These models allow to perform visual localization right away without the time-consuming scene capturing and model building steps. Yet, it also comes with challenges as the available 3D models are often imperfect reflections of reality. E.g., the models might only have generic or no textures at all, might only provide a simple approximation of the scene geometry, or might be stretched. This paper studies how the imperfections of these models affect localization accuracy. We create a new benchmark for this task and provide a detailed experimental evaluation based on multiple 3D models per scene. We show that 3D models from the Internet show promise as an easy-to-obtain scene representation. At the same time, there is significant room for improvement for visual localization pipelines. To foster research on this interesting and challenging task, we release our benchmark at v-pnk.github.io/cadloc. + + + + PVO: Panoptic Visual Odometry + http://openaccess.thecvf.com//content/CVPR2023/papers/Ye_PVO_Panoptic_Visual_Odometry_CVPR_2023_paper.pdf + We present PVO, a novel panoptic visual odometry framework to achieve more comprehensive modeling of the scene motion, geometry, and panoptic segmentation information. Our PVO models visual odometry (VO) and video panoptic segmentation (VPS) in a unified view, which makes the two tasks mutually beneficial. Specifically, we introduce a panoptic update module into the VO Module with the guidance of image panoptic segmentation. This Panoptic-Enhanced VO Module can alleviate the impact of dynamic objects in the camera pose estimation with a panoptic-aware dynamic mask. On the other hand, the VO-Enhanced VPS Module also improves the segmentation accuracy by fusing the panoptic segmentation result of the current frame on the fly to the adjacent frames, using geometric information such as camera pose, depth, and optical flow obtained from the VO Module. These two modules contribute to each other through recurrent iterative optimization. Extensive experiments demonstrate that PVO outperforms state-of-the-art methods in both visual odometry and video panoptic segmentation tasks. + + + + Generative Diffusion Prior for Unified Image Restoration and Enhancement + http://openaccess.thecvf.com//content/CVPR2023/papers/Fei_Generative_Diffusion_Prior_for_Unified_Image_Restoration_and_Enhancement_CVPR_2023_paper.pdf + Existing image restoration methods mostly leverage the posterior distribution of natural images. However, they often assume known degradation and also require supervised training, which restricts their adaptation to complex real applications. In this work, we propose the Generative Diffusion Prior (GDP) to effectively model the posterior distributions in an unsupervised sampling manner. GDP utilizes a pre-train denoising diffusion generative model (DDPM) for solving linear inverse, non-linear, or blind problems. Specifically, GDP systematically explores a protocol of conditional guidance, which is verified more practical than the commonly used guidance way. Furthermore, GDP is strength at optimizing the parameters of degradation model during denoising process, achieving blind image restoration. Besides, we devise hierarchical guidance and patch-based methods, enabling the GDP to generate images of arbitrary resolutions. Experimentally, we demonstrate GDP's versatility on several image datasets for linear problems, such as super-resolution, deblurring, inpainting, and colorization, as well as non-linear and blind issues, such as low-light enhancement and HDR image recovery. GDP outperforms the current leading unsupervised methods on the diverse benchmarks in reconstruction quality and perceptual quality. Moreover, GDP also generalizes well for natural images or synthesized images with arbitrary sizes from various tasks out of the distribution of the ImageNet training set. + + + + Real-Time Controllable Denoising for Image and Video + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Real-Time_Controllable_Denoising_for_Image_and_Video_CVPR_2023_paper.pdf + Controllable image denoising aims to generate clean samples with human perceptual priors and balance sharpness and smoothness. In traditional filter-based denoising methods, this can be easily achieved by adjusting the filtering strength. However, for NN (Neural Network)-based models, adjusting the final denoising strength requires performing network inference each time, making it almost impossible for real-time user interaction. In this paper, we introduce Real-time Controllable Denoising (RCD), the first deep image and video denoising pipeline that provides a fully controllable user interface to edit arbitrary denoising levels in real-time with only one-time network inference. Unlike existing controllable denoising methods that require multiple denoisers and training stages, RCD replaces the last output layer (which usually outputs a single noise map) of an existing CNN-based model with a lightweight module that outputs multiple noise maps. We propose a novel Noise Decorrelation process to enforce the orthogonality of the noise feature maps, allowing arbitrary noise level control through noise map interpolation. This process is network-free and does not require network inference. Our experiments show that RCD can enable real-time editable image and video denoising for various existing heavy-weight models without sacrificing their original performance. + + + + ISBNet: A 3D Point Cloud Instance Segmentation Network With Instance-Aware Sampling and Box-Aware Dynamic Convolution + http://openaccess.thecvf.com//content/CVPR2023/papers/Ngo_ISBNet_A_3D_Point_Cloud_Instance_Segmentation_Network_With_Instance-Aware_CVPR_2023_paper.pdf + Existing 3D instance segmentation methods are predominated by the bottom-up design -- manually fine-tuned algorithm to group points into clusters followed by a refinement network. However, by relying on the quality of the clusters, these methods generate susceptible results when (1) nearby objects with the same semantic class are packed together, or (2) large objects with loosely connected regions. To address these limitations, we introduce ISBNet, a novel cluster-free method that represents instances as kernels and decodes instance masks via dynamic convolution. To efficiently generate high-recall and discriminative kernels, we propose a simple strategy named Instance-aware Farthest Point Sampling to sample candidates and leverage the local aggregation layer inspired by PointNet++ to encode candidate features. Moreover, we show that predicting and leveraging the 3D axis-aligned bounding boxes in the dynamic convolution further boosts performance. Our method set new state-of-the-art results on ScanNetV2 (55.9), S3DIS (60.8), and STPLS3D (49.2) in terms of AP and retains fast inference time (237ms per scene on ScanNetV2). The source code and trained models are available at https://github.com/VinAIResearch/ISBNet. + + + + IterativePFN: True Iterative Point Cloud Filtering + http://openaccess.thecvf.com//content/CVPR2023/papers/de_Silva_Edirimuni_IterativePFN_True_Iterative_Point_Cloud_Filtering_CVPR_2023_paper.pdf + The quality of point clouds is often limited by noise introduced during their capture process. Consequently, a fundamental 3D vision task is the removal of noise, known as point cloud filtering or denoising. State-of-the-art learning based methods focus on training neural networks to infer filtered displacements and directly shift noisy points onto the underlying clean surfaces. In high noise conditions, they iterate the filtering process. However, this iterative filtering is only done at test time and is less effective at ensuring points converge quickly onto the clean surfaces. We propose IterativePFN (iterative point cloud filtering network), which consists of multiple IterationModules that model the true iterative filtering process internally, within a single network. We train our IterativePFN network using a novel loss function that utilizes an adaptive ground truth target at each iteration to capture the relationship between intermediate filtering results during training. This ensures that the filtered results converge faster to the clean surfaces. Our method is able to obtain better performance compared to state-of-the-art methods. The source code can be found at: https://github.com/ddsediri/IterativePFN. + + + + CLIP-S4: Language-Guided Self-Supervised Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/He_CLIP-S4_Language-Guided_Self-Supervised_Semantic_Segmentation_CVPR_2023_paper.pdf + Existing semantic segmentation approaches are often limited by costly pixel-wise annotations and predefined classes. In this work, we present CLIP-S^4 that leverages self-supervised pixel representation learning and vision-language models to enable various semantic segmentation tasks (e.g., unsupervised, transfer learning, language-driven segmentation) without any human annotations and unknown class information. We first learn pixel embeddings with pixel-segment contrastive learning from different augmented views of images. To further improve the pixel embeddings and enable language-driven semantic segmentation, we design two types of consistency guided by vision-language models: 1) embedding consistency, aligning our pixel embeddings to the joint feature space of a pre-trained vision-language model, CLIP; and 2) semantic consistency, forcing our model to make the same predictions as CLIP over a set of carefully designed target classes with both known and unknown prototypes. Thus, CLIP-S^4 enables a new task of class-free semantic segmentation where no unknown class information is needed during training. As a result, our approach shows consistent and substantial performance improvement over four popular benchmarks compared with the state-of-the-art unsupervised and language-driven semantic segmentation methods. More importantly, our method outperforms these methods on unknown class recognition by a large margin. + + + + Deep Incomplete Multi-View Clustering With Cross-View Partial Sample and Prototype Alignment + http://openaccess.thecvf.com//content/CVPR2023/papers/Jin_Deep_Incomplete_Multi-View_Clustering_With_Cross-View_Partial_Sample_and_Prototype_CVPR_2023_paper.pdf + The success of existing multi-view clustering relies on the assumption of sample integrity across multiple views. However, in real-world scenarios, samples of multi-view are partially available due to data corruption or sensor failure, which leads to incomplete multi-view clustering study (IMVC). Although several attempts have been proposed to address IMVC, they suffer from the following drawbacks: i) Existing methods mainly adopt cross-view contrastive learning forcing the representations of each sample across views to be exactly the same, which might ignore view discrepancy and flexibility in representations; ii) Due to the absence of non-observed samples across multiple views, the obtained prototypes of clusters might be unaligned and biased, leading to incorrect fusion. To address the above issues, we propose a Cross-view Partial Sample and Prototype Alignment Network (CPSPAN) for Deep Incomplete Multi-view Clustering. Firstly, unlike existing contrastive-based methods, we adopt pair-observed data alignment as 'proxy supervised signals' to guide instance-to-instance correspondence construction among views. Then, regarding of the shifted prototypes in IMVC, we further propose a prototype alignment module to achieve incomplete distribution calibration across views. Extensive experimental results showcase the effectiveness of our proposed modules, attaining noteworthy performance improvements when compared to existing IMVC competitors on benchmark datasets. + + + + Revisiting Multimodal Representation in Contrastive Learning: From Patch and Token Embeddings to Finite Discrete Tokens + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Revisiting_Multimodal_Representation_in_Contrastive_Learning_From_Patch_and_Token_CVPR_2023_paper.pdf + Contrastive learning-based vision-language pre-training approaches, such as CLIP, have demonstrated great success in many vision-language tasks. These methods achieve cross-modal alignment by encoding a matched image-text pair with similar feature embeddings, which are generated by aggregating information from visual patches and language tokens. However, direct aligning cross-modal information using such representations is challenging, as visual patches and text tokens differ in semantic levels and granularities. To alleviate this issue, we propose a Finite Discrete Tokens (FDT) based multimodal representation. FDT is a set of learnable tokens representing certain visual-semantic concepts. Both images and texts are embedded using shared FDT by first grounding multimodal inputs to FDT space and then aggregating the activated FDT representations. The matched visual and semantic concepts are enforced to be represented by the same set of discrete tokens by a sparse activation constraint. As a result, the granularity gap between the two modalities is reduced. Through both quantitative and qualitative analyses, we demonstrate that using FDT representations in CLIP-style models improves cross-modal alignment and performance in visual recognition and vision-language downstream tasks. Furthermore, we show that our method can learn more comprehensive representations, and the learned FDT capture meaningful cross-modal correspondence, ranging from objects to actions and attributes. + + + + Heterogeneous Continual Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Madaan_Heterogeneous_Continual_Learning_CVPR_2023_paper.pdf + We propose a novel framework and a solution to tackle the continual learning (CL) problem with changing network architectures. Most CL methods focus on adapting a single architecture to a new task/class by modifying its weights. However, with rapid progress in architecture design, the problem of adapting existing solutions to novel architectures becomes relevant. To address this limitation, we propose Heterogeneous Continual Learning (HCL), where a wide range of evolving network architectures emerge continually together with novel data/tasks. As a solution, we build on top of the distillation family of techniques and modify it to a new setting where a weaker model takes the role of a teacher; meanwhile, a new stronger architecture acts as a student. Furthermore, we consider a setup of limited access to previous data and propose Quick Deep Inversion (QDI) to recover prior task visual features to support knowledge transfer. QDI significantly reduces computational costs compared to previous solutions and improves overall performance. In summary, we propose a new setup for CL with a modified knowledge distillation paradigm and design a quick data inversion method to enhance distillation. Our evaluation of various benchmarks shows a significant improvement on accuracy in comparison to state-of-the-art methods over various networks architectures. + + + + Object Pose Estimation With Statistical Guarantees: Conformal Keypoint Detection and Geometric Uncertainty Propagation + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_Object_Pose_Estimation_With_Statistical_Guarantees_Conformal_Keypoint_Detection_and_CVPR_2023_paper.pdf + The two-stage object pose estimation paradigm first detects semantic keypoints on the image and then estimates the 6D pose by minimizing reprojection errors. Despite performing well on standard benchmarks, existing techniques offer no provable guarantees on the quality and uncertainty of the estimation. In this paper, we inject two fundamental changes, namely conformal keypoint detection and geometric uncertainty propagation, into the two-stage paradigm and propose the first pose estimator that endows an estimation with provable and computable worst-case error bounds. On one hand, conformal keypoint detection applies the statistical machinery of inductive conformal prediction to convert heuristic keypoint detections into circular or elliptical prediction sets that cover the groundtruth keypoints with a user-specified marginal probability (e.g., 90%). Geometric uncertainty propagation, on the other, propagates the geometric constraints on the keypoints to the 6D object pose, leading to a Pose UnceRtainty SEt (PURSE) that guarantees coverage of the groundtruth pose with the same probability. The PURSE, however, is a nonconvex set that does not directly lead to estimated poses and uncertainties. Therefore, we develop RANdom SAmple averaGing (RANSAG) to compute an average pose and apply semidefinite relaxation to upper bound the worst-case errors between the average pose and the groundtruth. On the LineMOD Occlusion dataset we demonstrate: (i) the PURSE covers the groundtruth with valid probabilities; (ii) the worst-case error bounds provide correct uncertainty quantification; and (iii) the average pose achieves better or similar accuracy as representative methods based on sparse keypoints. + + + + 3D-Aware Multi-Class Image-to-Image Translation With NeRFs + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_3D-Aware_Multi-Class_Image-to-Image_Translation_With_NeRFs_CVPR_2023_paper.pdf + Recent advances in 3D-aware generative models (3D-aware GANs) combined with Neural Radiance Fields (NeRF) have achieved impressive results. However no prior works investigate 3D-aware GANs for 3D consistent multi-class image-to-image (3D-aware I2I) translation. Naively using 2D-I2I translation methods suffers from unrealistic shape/identity change. To perform 3D-aware multi-class I2I translation, we decouple this learning process into a multi-class 3D-aware GAN step and a 3D-aware I2I translation step. In the first step, we propose two novel techniques: a new conditional architecture and an effective training strategy. In the second step, based on the well-trained multi-class 3D-aware GAN architecture, that preserves view-consistency, we construct a 3D-aware I2I translation system. To further reduce the view-consistency problems, we propose several new techniques, including a U-net-like adaptor network design, a hierarchical representation constrain and a relative regularization loss. In extensive experiments on two datasets, quantitative and qualitative results demonstrate that we successfully perform 3D-aware I2I translation with multi-view consistency. + + + + Unsupervised Visible-Infrared Person Re-Identification via Progressive Graph Matching and Alternate Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_Unsupervised_Visible-Infrared_Person_Re-Identification_via_Progressive_Graph_Matching_and_Alternate_CVPR_2023_paper.pdf + Unsupervised visible-infrared person re-identification is a challenging task due to the large modality gap and the unavailability of cross-modality correspondences. Cross-modality correspondences are very crucial to bridge the modality gap. Some existing works try to mine cross-modality correspondences, but they focus only on local information. They do not fully exploit the global relationship across identities, thus limiting the quality of the mined correspondences. Worse still, the number of clusters of the two modalities is often inconsistent, exacerbating the unreliability of the generated correspondences. In response, we devise a Progressive Graph Matching method to globally mine cross-modality correspondences under cluster imbalance scenarios. PGM formulates correspondences mining as a graph matching process and considers the global information by minimizing the global matching cost, where the matching cost measures the dissimilarity of clusters. Besides, PGM adopts a progressive strategy to address the imbalance issue with multiple dynamic matching processes. Based on PGM, we design an Alternate Cross Contrastive Learning (ACCL) module to reduce the modality gap with the mined cross-modality correspondences, while mitigating the effect of noise in correspondences through an alternate scheme. Extensive experiments demonstrate the reliability of the generated correspondences and the effectiveness of our method. + + + + Hierarchical B-Frame Video Coding Using Two-Layer CANF Without Motion Coding + http://openaccess.thecvf.com//content/CVPR2023/papers/Alexandre_Hierarchical_B-Frame_Video_Coding_Using_Two-Layer_CANF_Without_Motion_Coding_CVPR_2023_paper.pdf + Typical video compression systems consist of two main modules: motion coding and residual coding. This general architecture is adopted by classical coding schemes (such as international standards H.265 and H.266) and deep learning-based coding schemes. We propose a novel B-frame coding architecture based on two-layer Conditional Augmented Normalization Flows (CANF). It has the striking feature of not transmitting any motion information. Our proposed idea of video compression without motion coding offers a new direction for learned video coding. Our base layer is a low-resolution image compressor that replaces the full-resolution motion compressor. The low-resolution coded image is merged with the warped high-resolution images to generate a high-quality image as a conditioning signal for the enhancement-layer image coding in full resolution. One advantage of this architecture is significantly reduced computational complexity due to eliminating the motion information compressor. In addition, we adopt a skip-mode coding technique to reduce the transmitted latent samples. The rate-distortion performance of our scheme is slightly lower than that of the state-of-the-art learned B-frame coding scheme, B-CANF, but outperforms other learned B-frame coding schemes. However, compared to B-CANF, our scheme saves 45% of multiply-accumulate operations (MACs) for encoding and 27% of MACs for decoding. The code is available at https://nycu-clab.github.io. + + + + Seeing Through the Glass: Neural 3D Reconstruction of Object Inside a Transparent Container + http://openaccess.thecvf.com//content/CVPR2023/papers/Tong_Seeing_Through_the_Glass_Neural_3D_Reconstruction_of_Object_Inside_CVPR_2023_paper.pdf + In this paper, we define a new problem of recovering the 3D geometry of an object confined in a transparent enclosure. We also propose a novel method for solving this challenging problem. Transparent enclosures pose challenges of multiple light reflections and refractions at the interface between different propagation media e.g. air or glass. These multiple reflections and refractions cause serious image distortions which invalidate the single viewpoint assumption. Hence the 3D geometry of such objects cannot be reliably reconstructed using existing methods, such as traditional structure from motion or modern neural reconstruction methods. We solve this problem by explicitly modeling the scene as two distinct sub-spaces, inside and outside the transparent enclosure. We use an existing neural reconstruction method (NeuS) that implicitly represents the geometry and appearance of the inner subspace. In order to account for complex light interactions, we develop a hybrid rendering strategy that combines volume rendering with ray tracing. We then recover the underlying geometry and appearance of the model by minimizing the difference between the real and rendered images. We evaluate our method on both synthetic and real data. Experiment results show that our method outperforms the state-of-the-art (SOTA) methods. Codes and data will be available at https://github.com/hirotong/ReNeuS + + + + Neural Voting Field for Camera-Space 3D Hand Pose Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Neural_Voting_Field_for_Camera-Space_3D_Hand_Pose_Estimation_CVPR_2023_paper.pdf + We present a unified framework for camera-space 3D hand pose estimation from a single RGB image based on 3D implicit representation. As opposed to recent works, most of which first adopt holistic or pixel-level dense regression to obtain relative 3D hand pose and then follow with complex second-stage operations for 3D global root or scale recovery, we propose a novel unified 3D dense regression scheme to estimate camera-space 3D hand pose via dense 3D point-wise voting in camera frustum. Through direct dense modeling in 3D domain inspired by Pixel-aligned Implicit Functions for 3D detailed reconstruction, our proposed Neural Voting Field (NVF) fully models 3D dense local evidence and hand global geometry, helping to alleviate common 2D-to-3D ambiguities. Specifically, for a 3D query point in camera frustum and its pixel-aligned image feature, NVF, represented by a Multi-Layer Perceptron, regresses: (i) its signed distance to the hand surface; (ii) a set of 4D offset vectors (1D voting weight and 3D directional vector to each hand joint). Following a vote-casting scheme, 4D offset vectors from near-surface points are selected to calculate the 3D hand joint coordinates by a weighted average. Experiments demonstrate that NVF outperforms existing state-of-the-art algorithms on FreiHAND dataset for camera-space 3D hand pose estimation. We also adapt NVF to the classic task of root-relative 3D hand pose estimation, for which NVF also obtains state-of-the-art results on HO3D dataset. + + + + Visual Recognition-Driven Image Restoration for Multiple Degradation With Intrinsic Semantics Recovery + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_Visual_Recognition-Driven_Image_Restoration_for_Multiple_Degradation_With_Intrinsic_Semantics_CVPR_2023_paper.pdf + Deep image recognition models suffer a significant performance drop when applied to low-quality images since they are trained on high-quality images. Although many studies have investigated to solve the issue through image restoration or domain adaptation, the former focuses on visual quality rather than recognition quality, while the latter requires semantic annotations for task-specific training. In this paper, to address more practical scenarios, we propose a Visual Recognition-Driven Image Restoration network for multiple degradation, dubbed VRD-IR, to recover high-quality images from various unknown corruption types from the perspective of visual recognition within one model. Concretely, we harmonize the semantic representations of diverse degraded images into a unified space in a dynamic manner, and then optimize them towards intrinsic semantics recovery. Moreover, a prior-ascribing optimization strategy is introduced to encourage VRD-IR to couple with various downstream recognition tasks better. Our VRD-IR is corruption- and recognition-agnostic, and can be inserted into various recognition tasks directly as an image enhancement module. Extensive experiments on multiple image distortions demonstrate that our VRD-IR surpasses existing image restoration methods and show superior performance on diverse high-level tasks, including classification, detection, and person re-identification. + + + + Knowledge Combination To Learn Rotated Detection Without Rotated Annotation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhu_Knowledge_Combination_To_Learn_Rotated_Detection_Without_Rotated_Annotation_CVPR_2023_paper.pdf + Rotated bounding boxes drastically reduce output ambiguity of elongated objects, making it superior to axis-aligned bounding boxes. Despite the effectiveness, rotated detectors are not widely employed. Annotating rotated bounding boxes is such a laborious process that they are not provided in many detection datasets where axis-aligned annotations are used instead. In this paper, we propose a framework that allows the model to predict precise rotated boxes only requiring cheaper axis-aligned annotation of the target dataset. To achieve this, we leverage the fact that neural networks are capable of learning richer representation of the target domain than what is utilized by the task. The under-utilized representation can be exploited to address a more detailed task. Our framework combines task knowledge of an out-of-domain source dataset with stronger annotation and domain knowledge of the target dataset with weaker annotation. A novel assignment process and projection loss are used to enable the co-training on the source and target datasets. As a result, the model is able to solve the more detailed task in the target domain, without additional computation overhead during inference. We extensively evaluate the method on various target datasets including fresh-produce dataset, HRSC2016 and SSDD. Results show that the proposed method consistently performs on par with the fully supervised approach. + + + + Pointersect: Neural Rendering With Cloud-Ray Intersection + http://openaccess.thecvf.com//content/CVPR2023/papers/Chang_Pointersect_Neural_Rendering_With_Cloud-Ray_Intersection_CVPR_2023_paper.pdf + We propose a novel method that renders point clouds as if they are surfaces. The proposed method is differentiable and requires no scene-specific optimization. This unique capability enables, out-of-the-box, surface normal estimation, rendering room-scale point clouds, inverse rendering, and ray tracing with global illumination. Unlike existing work that focuses on converting point clouds to other representations--e.g., surfaces or implicit functions--our key idea is to directly infer the intersection of a light ray with the underlying surface represented by the given point cloud. Specifically, we train a set transformer that, given a small number of local neighbor points along a light ray, provides the intersection point, the surface normal, and the material blending weights, which are used to render the outcome of this light ray. Localizing the problem into small neighborhoods enables us to train a model with only 48 meshes and apply it to unseen point clouds. Our model achieves higher estimation accuracy than state-of-the-art surface reconstruction and point-cloud rendering methods on three test sets. When applied to room-scale point clouds, without any scene-specific optimization, the model achieves competitive quality with the state-of-the-art novel-view rendering methods. Moreover, we demonstrate ability to render and manipulate Lidar-scanned point clouds such as lighting control and object insertion. + + + + Beyond Attentive Tokens: Incorporating Token Importance and Diversity for Efficient Vision Transformers + http://openaccess.thecvf.com//content/CVPR2023/papers/Long_Beyond_Attentive_Tokens_Incorporating_Token_Importance_and_Diversity_for_Efficient_CVPR_2023_paper.pdf + Vision transformers have achieved significant improvements on various vision tasks but their quadratic interactions between tokens significantly reduce computational efficiency. Many pruning methods have been proposed to remove redundant tokens for efficient vision transformers recently. However, existing studies mainly focus on the token importance to preserve local attentive tokens but completely ignore the global token diversity. In this paper, we emphasize the cruciality of diverse global semantics and propose an efficient token decoupling and merging method that can jointly consider the token importance and diversity for token pruning. According to the class token attention, we decouple the attentive and inattentive tokens. In addition to preserve the most discriminative local tokens, we merge similar inattentive tokens and match homogeneous attentive tokens to maximize the token diversity. Despite its simplicity, our method obtains a promising trade-off between model complexity and classification accuracy. On DeiT-S, our method reduces the FLOPs by 35% with only a 0.2% accuracy drop. Notably, benefiting from maintaining the token diversity, our method can even improve the accuracy of DeiT-T by 0.1% after reducing its FLOPs by 40%. + + + + STDLens: Model Hijacking-Resilient Federated Learning for Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Chow_STDLens_Model_Hijacking-Resilient_Federated_Learning_for_Object_Detection_CVPR_2023_paper.pdf + Federated Learning (FL) has been gaining popularity as a collaborative learning framework to train deep learning-based object detection models over a distributed population of clients. Despite its advantages, FL is vulnerable to model hijacking. The attacker can control how the object detection system should misbehave by implanting Trojaned gradients using only a small number of compromised clients in the collaborative learning process. This paper introduces STDLens, a principled approach to safeguarding FL against such attacks. We first investigate existing mitigation mechanisms and analyze their failures caused by the inherent errors in spatial clustering analysis on gradients. Based on the insights, we introduce a three-tier forensic framework to identify and expel Trojaned gradients and reclaim the performance over the course of FL. We consider three types of adaptive attacks and demonstrate the robustness of STDLens against advanced adversaries. Extensive experiments show that STDLens can protect FL against different model hijacking attacks and outperform existing methods in identifying and removing Trojaned gradients with significantly higher precision and much lower false-positive rates. The source code is available at https://github.com/git-disl/STDLens. + + + + MagicPony: Learning Articulated 3D Animals in the Wild + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_MagicPony_Learning_Articulated_3D_Animals_in_the_Wild_CVPR_2023_paper.pdf + We consider the problem of predicting the 3D shape, articulation, viewpoint, texture, and lighting of an articulated animal like a horse given a single test image as input. We present a new method, dubbed MagicPony, that learns this predictor purely from in-the-wild single-view images of the object category, with minimal assumptions about the topology of deformation. At its core is an implicit-explicit representation of articulated shape and appearance, combining the strengths of neural fields and meshes. In order to help the model understand an object's shape and pose, we distil the knowledge captured by an off-the-shelf self-supervised vision transformer and fuse it into the 3D model. To overcome local optima in viewpoint estimation, we further introduce a new viewpoint sampling scheme that comes at no additional training cost. MagicPony outperforms prior work on this challenging task and demonstrates excellent generalisation in reconstructing art, despite the fact that it is only trained on real images. The code can be found on the project page at https://3dmagicpony.github.io/. + + + + Affordances From Human Videos as a Versatile Representation for Robotics + http://openaccess.thecvf.com//content/CVPR2023/papers/Bahl_Affordances_From_Human_Videos_as_a_Versatile_Representation_for_Robotics_CVPR_2023_paper.pdf + Building a robot that can understand and learn to interact by watching humans has inspired several vision problems. However, despite some successful results on static datasets, it remains unclear how current models can be used on a robot directly. In this paper, we aim to bridge this gap by leveraging videos of human interactions in an environment centric manner. Utilizing internet videos of human behavior, we train a visual affordance model that estimates where and how in the scene a human is likely to interact. The structure of these behavioral affordances directly enables the robot to perform many complex tasks. We show how to seamlessly integrate our affordance model with four robot learning paradigms including offline imitation learning, exploration, goal-conditioned learning, and action parameterization for reinforcement learning. We show the efficacy of our approach, which we call Vision-Robotics Bridge (VRB) across 4 real world environments, over 10 different tasks, and 2 robotic platforms operating in the wild. + + + + AMT: All-Pairs Multi-Field Transforms for Efficient Frame Interpolation + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_AMT_All-Pairs_Multi-Field_Transforms_for_Efficient_Frame_Interpolation_CVPR_2023_paper.pdf + We present All-Pairs Multi-Field Transforms (AMT), a new network architecture for video frame interpolation. It is based on two essential designs. First, we build bidirectional correlation volumes for all pairs of pixels and use the predicted bilateral flows to retrieve correlations for updating both flows and the interpolated content feature. Second, we derive multiple groups of fine-grained flow fields from one pair of updated coarse flows for performing backward warping on the input frames separately. Combining these two designs enables us to generate promising task-oriented flows and reduce the difficulties in modeling large motions and handling occluded areas during frame interpolation. These qualities promote our model to achieve state-of-the-art performance on various benchmarks with high efficiency. Moreover, our convolution-based model competes favorably compared to Transformer-based models in terms of accuracy and efficiency. Our code is available at https://github.com/MCG-NKU/AMT. + + + + Toward RAW Object Detection: A New Benchmark and a New Model + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_Toward_RAW_Object_Detection_A_New_Benchmark_and_a_New_CVPR_2023_paper.pdf + In many computer vision applications (e.g., robotics and autonomous driving), high dynamic range (HDR) data is necessary for object detection algorithms to handle a variety of lighting conditions, such as strong glare. In this paper, we aim to achieve object detection on RAW sensor data, which naturally saves the HDR information from image sensors without extra equipment costs. We build a novel RAW sensor dataset, named ROD, for Deep Neural Networks (DNNs)-based object detection algorithms to be applied to HDR data. The ROD dataset contains a large amount of annotated instances of day and night driving scenes in 24-bit dynamic range. Based on the dataset, we first investigate the impact of dynamic range for DNNs-based detectors and demonstrate the importance of dynamic range adjustment for detection on RAW sensor data. Then, we propose a simple and effective adjustment method for object detection on HDR RAW sensor data, which is image adaptive and jointly optimized with the downstream detector in an end-to-end scheme. Extensive experiments demonstrate that the performance of detection on RAW sensor data is significantly superior to standard dynamic range (SDR) data in different situations. Moreover, we analyze the influence of texture information and pixel distribution of input data on the performance of the DNNs-based detector. + + + + Music-Driven Group Choreography + http://openaccess.thecvf.com//content/CVPR2023/papers/Le_Music-Driven_Group_Choreography_CVPR_2023_paper.pdf + Music-driven choreography is a challenging problem with a wide variety of industrial applications. Recently, many methods have been proposed to synthesize dance motions from music for a single dancer. However, generating dance motion for a group remains an open problem. In this paper, we present AIOZ-GDANCE, a new largescale dataset for music-driven group dance generation. Unlike existing datasets that only support single dance, our new dataset contains group dance videos, hence supporting the study of group choreography. We propose a semiautonomous labeling method with humans in the loop to obtain the 3D ground truth for our dataset. The proposed dataset consists of 16.7 hours of paired music and 3D motion from in-the-wild videos, covering 7 dance styles and 16 music genres. We show that naively applying single dance generation technique to creating group dance motion may lead to unsatisfactory results, such as inconsistent movements and collisions between dancers. Based on our new dataset, we propose a new method that takes an input music sequence and a set of 3D positions of dancers to efficiently produce multiple group-coherent choreographies. We propose new evaluation metrics for measuring group dance quality and perform intensive experiments to demonstrate the effectiveness of our method. Our project facilitates future research on group dance generation and is available at https://aioz-ai.github.io/AIOZ-GDANCE/. + + + + Cascade Evidential Learning for Open-World Weakly-Supervised Temporal Action Localization + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Cascade_Evidential_Learning_for_Open-World_Weakly-Supervised_Temporal_Action_Localization_CVPR_2023_paper.pdf + Targeting at recognizing and localizing action instances with only video-level labels during training, Weakly-supervised Temporal Action Localization (WTAL) has achieved significant progress in recent years. However, living in the dynamically changing open world where unknown actions constantly spring up, the closed-set assumption of existing WTAL methods is invalid. Compared with traditional open-set recognition tasks, Open-world WTAL (OWTAL) is challenging since not only are the annotations of unknown samples unavailable, but also the fine-grained annotations of known action instances can only be inferred ambiguously from the video category labels. To address this problem, we propose a Cascade Evidential Learning framework at an evidence level, which targets at OWTAL for the first time. Our method jointly leverages multi-scale temporal contexts and knowledge-guided prototype information to progressively collect cascade and enhanced evidence for known action, unknown action, and background separation. Extensive experiments conducted on THUMOS-14 and ActivityNet-v1.3 verify the effectiveness of our method. Besides the classification metrics adopted by previous open-set recognition methods, we also evaluate our method on localization metrics which are more reasonable for OWTAL. + + + + STAR Loss: Reducing Semantic Ambiguity in Facial Landmark Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_STAR_Loss_Reducing_Semantic_Ambiguity_in_Facial_Landmark_Detection_CVPR_2023_paper.pdf + Recently, deep learning-based facial landmark detection has achieved significant improvement. However, the semantic ambiguity problem degrades detection performance. Specifically, the semantic ambiguity causes inconsistent annotation and negatively affects the model's convergence, leading to worse accuracy and instability prediction. To solve this problem, we propose a Self-adapTive Ambiguity Reduction (STAR) loss by exploiting the properties of semantic ambiguity. We find that semantic ambiguity results in the anisotropic predicted distribution, which inspires us to use predicted distribution to represent semantic ambiguity. Based on this, we design the STAR loss that measures the anisotropism of the predicted distribution. Compared with the standard regression loss, STAR loss is encouraged to be small when the predicted distribution is anisotropic and thus adaptively mitigates the impact of semantic ambiguity. Moreover, we propose two kinds of eigenvalue restriction methods that could avoid both distribution's abnormal change and the model's premature convergence. Finally, the comprehensive experiments demonstrate that STAR loss outperforms the state-of-the-art methods on three benchmarks, i.e., COFW, 300W, and WFLW, with negligible computation overhead. Code is at https://github.com/ZhenglinZhou/STAR + + + + Seeing What You Said: Talking Face Generation Guided by a Lip Reading Expert + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Seeing_What_You_Said_Talking_Face_Generation_Guided_by_a_CVPR_2023_paper.pdf + Talking face generation, also known as speech-to-lip generation, reconstructs facial motions concerning lips given coherent speech input. The previous studies revealed the importance of lip-speech synchronization and visual quality. Despite much progress, they hardly focus on the content of lip movements i.e., the visual intelligibility of the spoken words, which is an important aspect of generation quality. To address the problem, we propose using a lip-reading expert to improve the intelligibility of the generated lip regions by penalizing the incorrect generation results. Moreover, to compensate for data scarcity, we train the lip-reading expert in an audio-visual self-supervised manner. With a lip-reading expert, we propose a novel contrastive learning to enhance lip-speech synchronization, and a transformer to encode audio synchronically with video, while considering global temporal dependency of audio. For evaluation, we propose a new strategy with two different lip-reading experts to measure intelligibility of the generated videos. Rigorous experiments show that our proposal is superior to other State-of-the-art (SOTA) methods, such as Wav2Lip, in reading intelligibility i.e., over 38% Word Error Rate (WER) on LRS2 dataset and 27.8% accuracy on LRW dataset. We also achieve the SOTA performance in lip-speech synchronization and comparable performances in visual quality. + + + + SimpSON: Simplifying Photo Cleanup With Single-Click Distracting Object Segmentation Network + http://openaccess.thecvf.com//content/CVPR2023/papers/Huynh_SimpSON_Simplifying_Photo_Cleanup_With_Single-Click_Distracting_Object_Segmentation_Network_CVPR_2023_paper.pdf + In photo editing, it is common practice to remove visual distractions to improve the overall image quality and highlight the primary subject. However, manually selecting and removing these small and dense distracting regions can be a laborious and time-consuming task. In this paper, we propose an interactive distractor selection method that is optimized to achieve the task with just a single click. Our method surpasses the precision and recall achieved by the traditional method of running panoptic segmentation and then selecting the segments containing the clicks. We also showcase how a transformer-based module can be used to identify more distracting regions similar to the user's click position. Our experiments demonstrate that the model can effectively and accurately segment unknown distracting objects interactively and in groups. By significantly simplifying the photo cleaning and retouching process, our proposed model provides inspiration for exploring rare object segmentation and group selection with a single click. + + + + Learning Neural Duplex Radiance Fields for Real-Time View Synthesis + http://openaccess.thecvf.com//content/CVPR2023/papers/Wan_Learning_Neural_Duplex_Radiance_Fields_for_Real-Time_View_Synthesis_CVPR_2023_paper.pdf + Neural radiance fields (NeRFs) enable novel view synthesis with unprecedented visual quality. However, to render photorealistic images, NeRFs require hundreds of deep multilayer perceptron (MLP) evaluations -- for each pixel. This is prohibitively expensive and makes real-time rendering infeasible, even on powerful modern GPUs. In this paper, we propose a novel approach to distill and bake NeRFs into highly efficient mesh-based neural representations that are fully compatible with the massively parallel graphics rendering pipeline. We represent scenes as neural radiance features encoded on a two-layer duplex mesh, which effectively overcomes the inherent inaccuracies in 3D surface reconstruction by learning the aggregated radiance information from a reliable interval of ray-surface intersections. To exploit local geometric relationships of nearby pixels, we leverage screen-space convolutions instead of the MLPs used in NeRFs to achieve high-quality appearance. Finally, the performance of the whole framework is further boosted by a novel multi-view distillation optimization strategy. We demonstrate the effectiveness and superiority of our approach via extensive experiments on a range of standard datasets. + + + + Towards Modality-Agnostic Person Re-Identification With Descriptive Query + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Towards_Modality-Agnostic_Person_Re-Identification_With_Descriptive_Query_CVPR_2023_paper.pdf + Person re-identification (ReID) with descriptive query (text or sketch) provides an important supplement for general image-image paradigms, which is usually studied in a single cross-modality matching manner, e.g., text-to-image or sketch-to-photo. However, without a camera-captured photo query, it is uncertain whether the text or sketch is available or not in practical scenarios. This motivates us to study a new and challenging modality-agnostic person re-identification problem. Towards this goal, we propose a unified person re-identification (UNIReID) architecture that can effectively adapt to cross-modality and multi-modality tasks. Specifically, UNIReID incorporates a simple dual-encoder with task-specific modality learning to mine and fuse visual and textual modality information. To deal with the imbalanced training problem of different tasks in UNIReID, we propose a task-aware dynamic training strategy in terms of task difficulty, adaptively adjusting the training focus. Besides, we construct three multi-modal ReID datasets by collecting the corresponding sketches from photos to support this challenging task. The experimental results on three multi-modal ReID datasets show that our UNIReID greatly improves the retrieval accuracy and generalization ability on different tasks and unseen scenarios. + + + + An In-Depth Exploration of Person Re-Identification and Gait Recognition in Cloth-Changing Conditions + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_An_In-Depth_Exploration_of_Person_Re-Identification_and_Gait_Recognition_in_CVPR_2023_paper.pdf + The target of person re-identification (ReID) and gait recognition is consistent, that is to match the target pedestrian under surveillance cameras. For the cloth-changing problem, video-based ReID is rarely studied due to the lack of a suitable cloth-changing benchmark, and gait recognition is often researched under controlled conditions. To tackle this problem, we propose a Cloth-Changing benchmark for Person re-identification and Gait recognition (CCPG). It is a cloth-changing dataset, and there are several highlights in CCPG, (1) it provides 200 identities and over 16K sequences are captured indoors and outdoors, (2) each identity has seven different cloth-changing statuses, which is hardly seen in previous datasets, (3) RGB and silhouettes version data are both available for research purposes. Moreover, aiming to investigate the cloth-changing problem systematically, comprehensive experiments are conducted on video-based ReID and gait recognition methods. The experimental results demonstrate the superiority of ReID and gait recognition separately in different cloth-changing conditions and suggest that gait recognition is a potential solution for addressing the cloth-changing problem. Our dataset will be available at https://github.com/BNU-IVC/CCPG. + + + + Visual Exemplar Driven Task-Prompting for Unified Perception in Autonomous Driving + http://openaccess.thecvf.com//content/CVPR2023/papers/Liang_Visual_Exemplar_Driven_Task-Prompting_for_Unified_Perception_in_Autonomous_Driving_CVPR_2023_paper.pdf + Multi-task learning has emerged as a powerful paradigm to solve a range of tasks simultaneously with good efficiency in both computation resources and inference time. However, these algorithms are designed for different tasks mostly not within the scope of autonomous driving, thus making it hard to compare multi-task methods in autonomous driving. Aiming to enable the comprehensive evaluation of present multi-task learning methods in autonomous driving, we extensively investigate the performance of popular multi-task methods on the large-scale driving dataset, which covers four common perception tasks, i.e., object detection, semantic segmentation, drivable area segmentation, and lane detection. We provide an in-depth analysis of current multi-task learning methods under different common settings and find out that the existing methods make progress but there is still a large performance gap compared with single-task baselines. To alleviate this dilemma in autonomous driving, we present an effective multi-task framework, VE-Prompt, which introduces visual exemplars via task-specific prompting to guide the model toward learning high-quality task-specific representations. Specifically, we generate visual exemplars based on bounding boxes and color-based markers, which provide accurate visual appearances of target categories and further mitigate the performance gap. Furthermore, we bridge transformer-based encoders and convolutional layers for efficient and accurate unified perception in autonomous driving. Comprehensive experimental results on the diverse self-driving dataset BDD100K show that the VE-Prompt improves the multi-task baseline and further surpasses single-task models. + + + + Toward Verifiable and Reproducible Human Evaluation for Text-to-Image Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Otani_Toward_Verifiable_and_Reproducible_Human_Evaluation_for_Text-to-Image_Generation_CVPR_2023_paper.pdf + Human evaluation is critical for validating the performance of text-to-image generative models, as this highly cognitive process requires deep comprehension of text and images. However, our survey of 37 recent papers reveals that many works rely solely on automatic measures (e.g., FID) or perform poorly described human evaluations that are not reliable or repeatable. This paper proposes a standardized and well-defined human evaluation protocol to facilitate verifiable and reproducible human evaluation in future works. In our pilot data collection, we experimentally show that the current automatic measures are incompatible with human perception in evaluating the performance of the text-to-image generation results. Furthermore, we provide insights for designing human evaluation experiments reliably and conclusively. Finally, we make several resources publicly available to the community to facilitate easy and fast implementations. + + + + Learning a 3D Morphable Face Reflectance Model From Low-Cost Data + http://openaccess.thecvf.com//content/CVPR2023/papers/Han_Learning_a_3D_Morphable_Face_Reflectance_Model_From_Low-Cost_Data_CVPR_2023_paper.pdf + Modeling non-Lambertian effects such as facial specularity leads to a more realistic 3D Morphable Face Model. Existing works build parametric models for diffuse and specular albedo using Light Stage data. However, only diffuse and specular albedo cannot determine the full BRDF. In addition, the requirement of Light Stage data is hard to fulfill for the research communities. This paper proposes the first 3D morphable face reflectance model with spatially varying BRDF using only low-cost publicly-available data. We apply linear shiness weighting into parametric modeling to represent spatially varying specular intensity and shiness. Then an inverse rendering algorithm is developed to reconstruct the reflectance parameters from non-Light Stage data, which are used to train an initial morphable reflectance model. To enhance the model's generalization capability and expressive power, we further propose an update-by-reconstruction strategy to finetune it on an in-the-wild dataset. Experimental results show that our method obtains decent rendering results with plausible facial specularities. Our code is released at https://yxuhan.github.io/ReflectanceMM/index.html. + + + + Recurrent Homography Estimation Using Homography-Guided Image Warping and Focus Transformer + http://openaccess.thecvf.com//content/CVPR2023/papers/Cao_Recurrent_Homography_Estimation_Using_Homography-Guided_Image_Warping_and_Focus_Transformer_CVPR_2023_paper.pdf + We propose the Recurrent homography estimation framework using Homography-guided image Warping and Focus transformer (FocusFormer), named RHWF. Both being appropriately absorbed into the recurrent framework, the homography-guided image warping progressively enhances the feature consistency and the attention-focusing mechanism in FocusFormer aggregates the intra-inter correspondence in a global->nonlocal->local manner. Thanks to the above strategies, RHWF ranks top in accuracy on a variety of datasets, including the challenging cross-resolution and cross-modal ones. Meanwhile, benefiting from the recurrent framework, RHWF achieves parameter efficiency despite the transformer architecture. Compared to previous state-of-the-art approaches LocalTrans and IHN, RHWF reduces the mean average corner error (MACE) by about 70% and 38.1% on the MSCOCO dataset, while saving the parameter costs by 86.5% and 24.6%. Similar to the previous works, RHWF can also be arranged in 1-scale for efficiency and 2-scale for accuracy, with the 1-scale RHWF already outperforming most of the previous methods. Source code is available at https://github.com/imdumpl78/RHWF. + + + + I2-SDF: Intrinsic Indoor Scene Reconstruction and Editing via Raytracing in Neural SDFs + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhu_I2-SDF_Intrinsic_Indoor_Scene_Reconstruction_and_Editing_via_Raytracing_in_CVPR_2023_paper.pdf + In this work, we present I^2-SDF, a new method for intrinsic indoor scene reconstruction and editing using differentiable Monte Carlo raytracing on neural signed distance fields (SDFs). Our holistic neural SDF-based framework jointly recovers the underlying shapes, incident radiance and materials from multi-view images. We introduce a novel bubble loss for fine-grained small objects and error-guided adaptive sampling scheme to largely improve the reconstruction quality on large-scale indoor scenes. Further, we propose to decompose the neural radiance field into spatially-varying material of the scene as a neural field through surface-based, differentiable Monte Carlo raytracing and emitter semantic segmentations, which enables physically based and photorealistic scene relighting and editing applications. Through a number of qualitative and quantitative experiments, we demonstrate the superior quality of our method on indoor scene reconstruction, novel view synthesis, and scene editing compared to state-of-the-art baselines. Our project page is at https://jingsenzhu.github.io/i2-sdf. + + + + DLBD: A Self-Supervised Direct-Learned Binary Descriptor + http://openaccess.thecvf.com//content/CVPR2023/papers/Xiao_DLBD_A_Self-Supervised_Direct-Learned_Binary_Descriptor_CVPR_2023_paper.pdf + For learning-based binary descriptors, the binarization process has not been well addressed. The reason is that the binarization blocks gradient back-propagation. Existing learning-based binary descriptors learn real-valued output, and then it is converted to binary descriptors by their proposed binarization processes. Since their binarization processes are not a component of the network, the learning-based binary descriptor cannot fully utilize the advances of deep learning. To solve this issue, we propose a model-agnostic plugin binary transformation layer (BTL), making the network directly generate binary descriptors. Then, we present the first self-supervised, direct-learned binary descriptor, dubbed DLBD. Furthermore, we propose ultra-wide temperature-scaled cross-entropy loss to adjust the distribution of learned descriptors in a larger range. Experiments demonstrate that the proposed BTL can substitute the previous binarization process. Our proposed DLBD outperforms SOTA on different tasks such as image retrieval and classification. + + + + Fuzzy Positive Learning for Semi-Supervised Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Qiao_Fuzzy_Positive_Learning_for_Semi-Supervised_Semantic_Segmentation_CVPR_2023_paper.pdf + Semi-supervised learning (SSL) essentially pursues class boundary exploration with less dependence on human annotations. Although typical attempts focus on ameliorating the inevitable error-prone pseudo-labeling, we think differently and resort to exhausting informative semantics from multiple probably correct candidate labels. In this paper, we introduce Fuzzy Positive Learning (FPL) for accurate SSL semantic segmentation in a plug-and-play fashion, targeting adaptively encouraging fuzzy positive predictions and suppressing highly-probable negatives. Being conceptually simple yet practically effective, FPL can remarkably alleviate interference from wrong pseudo labels and progressively achieve clear pixel-level semantic discrimination. Concretely, our FPL approach consists of two main components, including fuzzy positive assignment (FPA) to provide an adaptive number of labels for each pixel and fuzzy positive regularization (FPR) to restrict the predictions of fuzzy positive categories to be larger than the rest under different perturbations. Theoretical analysis and extensive experiments on Cityscapes and VOC 2012 with consistent performance gain justify the superiority of our approach. Codes are available in https://github.com/qpc1611094/FPL. + + + + Multi-View Inverse Rendering for Large-Scale Real-World Indoor Scenes + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Multi-View_Inverse_Rendering_for_Large-Scale_Real-World_Indoor_Scenes_CVPR_2023_paper.pdf + We present a efficient multi-view inverse rendering method for large-scale real-world indoor scenes that reconstructs global illumination and physically-reasonable SVBRDFs. Unlike previous representations, where the global illumination of large scenes is simplified as multiple environment maps, we propose a compact representation called Texture-based Lighting (TBL). It consists of 3D mesh and HDR textures, and efficiently models direct and infinite-bounce indirect lighting of the entire large scene. Based on TBL, we further propose a hybrid lighting representation with precomputed irradiance, which significantly improves the efficiency and alleviates the rendering noise in the material optimization. To physically disentangle the ambiguity between materials, we propose a three-stage material optimization strategy based on the priors of semantic segmentation and room segmentation. Extensive experiments show that the proposed method outperforms the state-of-the-art quantitatively and qualitatively, and enables physically-reasonable mixed-reality applications such as material editing, editable novel view synthesis and relighting. The project page is at https://lzleejean.github.io/TexIR. + + + + Boosting Transductive Few-Shot Fine-Tuning With Margin-Based Uncertainty Weighting and Probability Regularization + http://openaccess.thecvf.com//content/CVPR2023/papers/Tao_Boosting_Transductive_Few-Shot_Fine-Tuning_With_Margin-Based_Uncertainty_Weighting_and_Probability_CVPR_2023_paper.pdf + Few-Shot Learning (FSL) has been rapidly developed in recent years, potentially eliminating the requirement for significant data acquisition. Few-shot fine-tuning has been demonstrated to be practically efficient and helpful, especially for out-of-distribution datum. In this work, we first observe that the few-shot fine-tuned methods are learned with the imbalanced class marginal distribution. This observation further motivates us to propose the Transductive Fine-tuning with Margin-based uncertainty weighting and Probability regularization (TF-MP), which learns a more balanced class marginal distribution. We first conduct sample weighting on the testing data with margin-based uncertainty scores and further regularize each test sample's categorical probability. TF-MP achieves state-of-the-art performance on in- / out-of-distribution evaluations of Meta-Dataset and surpasses previous transductive methods by a large margin. + + + + SMPConv: Self-Moving Point Representations for Continuous Convolution + http://openaccess.thecvf.com//content/CVPR2023/papers/Kim_SMPConv_Self-Moving_Point_Representations_for_Continuous_Convolution_CVPR_2023_paper.pdf + Continuous convolution has recently gained prominence due to its ability to handle irregularly sampled data and model long-term dependency. Also, the promising experimental results of using large convolutional kernels have catalyzed the development of continuous convolution since they can construct large kernels very efficiently. Leveraging neural networks, more specifically multilayer perceptrons (MLPs), is by far the most prevalent approach to implementing continuous convolution. However, there are a few drawbacks, such as high computational costs, complex hyperparameter tuning, and limited descriptive power of filters. This paper suggests an alternative approach to building a continuous convolution without neural networks, resulting in more computationally efficient and improved performance. We present self-moving point representations where weight parameters freely move, and interpolation schemes are used to implement continuous functions. When applied to construct convolutional kernels, the experimental results have shown improved performance with drop-in replacement in the existing frameworks. Due to its lightweight structure, we are first to demonstrate the effectiveness of continuous convolution in a large-scale setting, e.g., ImageNet, presenting the improvements over the prior arts. Our code is available on https://github.com/sangnekim/SMPConv + + + + PRISE: Demystifying Deep Lucas-Kanade With Strongly Star-Convex Constraints for Multimodel Image Alignment + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_PRISE_Demystifying_Deep_Lucas-Kanade_With_Strongly_Star-Convex_Constraints_for_Multimodel_CVPR_2023_paper.pdf + The Lucas-Kanade (LK) method is a classic iterative homography estimation algorithm for image alignment, but often suffers from poor local optimality especially when image pairs have large distortions. To address this challenge, in this paper we propose a novel Deep Star-Convexified Lucas-Kanade (PRISE) method for multimodel image alignment by introducing strongly star-convex constraints into the optimization problem. Our basic idea is to enforce the neural network to approximately learn a star-convex loss landscape around the ground truth give any data to facilitate the convergence of the LK method to the ground truth through the high dimensional space defined by the network. This leads to a minimax learning problem, with contrastive (hinge) losses due to the definition of strong star-convexity that are appended to the original loss for training. We also provide an efficient sampling based algorithm to leverage the training cost, as well as some analysis on the quality of the solutions from PRISE. We further evaluate our approach on benchmark datasets such as MSCOCO, GoogleEarth, and GoogleMap, and demonstrate state-of-the-art results, especially for small pixel errors. Demo code is attached. + + + + Learning To Exploit Temporal Structure for Biomedical Vision-Language Processing + http://openaccess.thecvf.com//content/CVPR2023/papers/Bannur_Learning_To_Exploit_Temporal_Structure_for_Biomedical_Vision-Language_Processing_CVPR_2023_paper.pdf + Self-supervised learning in vision--language processing (VLP) exploits semantic alignment between imaging and text modalities. Prior work in biomedical VLP has mostly relied on the alignment of single image and report pairs even though clinical notes commonly refer to prior images. This does not only introduce poor alignment between the modalities but also a missed opportunity to exploit rich self-supervision through existing temporal content in the data. In this work, we explicitly account for prior images and reports when available during both training and fine-tuning. Our approach, named BioViL-T, uses a CNN--Transformer hybrid multi-image encoder trained jointly with a text model. It is designed to be versatile to arising challenges such as pose variations and missing input images across time. The resulting model excels on downstream tasks both in single- and multi-image setups, achieving state-of-the-art (SOTA) performance on (I) progression classification, (II) phrase grounding, and (III) report generation, whilst offering consistent improvements on disease classification and sentence-similarity tasks. We release a novel multi-modal temporal benchmark dataset, CXR-T, to quantify the quality of vision--language representations in terms of temporal semantics. Our experimental results show the significant advantages of incorporating prior images and reports to make most use of the data. + + + + Simple Cues Lead to a Strong Multi-Object Tracker + http://openaccess.thecvf.com//content/CVPR2023/papers/Seidenschwarz_Simple_Cues_Lead_to_a_Strong_Multi-Object_Tracker_CVPR_2023_paper.pdf + For a long time, the most common paradigm in MultiObject Tracking was tracking-by-detection (TbD), where objects are first detected and then associated over video frames. For association, most models resourced to motion and appearance cues, e.g., re-identification networks. Recent approaches based on attention propose to learn the cues in a data-driven manner, showing impressive results. In this paper, we ask ourselves whether simple good old TbD methods are also capable of achieving the performance of end-to-end models. To this end, we propose two key ingredients that allow a standard re-identification network to excel at appearance-based tracking. We extensively analyse its failure cases, and show that a combination of our appearance features with a simple motion model leads to strong tracking results. Our tracker generalizes to four public datasets, namely MOT17, MOT20, BDD100k, and DanceTrack, achieving state-ofthe-art performance. https://github.com/dvl-tum/GHOST + + + + Marching-Primitives: Shape Abstraction From Signed Distance Function + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Marching-Primitives_Shape_Abstraction_From_Signed_Distance_Function_CVPR_2023_paper.pdf + Representing complex objects with basic geometric primitives has long been a topic in computer vision. Primitive-based representations have the merits of compactness and computational efficiency in higher-level tasks such as physics simulation, collision checking, and robotic manipulation. Unlike previous works which extract polygonal meshes from a signed distance function (SDF), in this paper, we present a novel method, named Marching-Primitives, to obtain a primitive-based abstraction directly from an SDF. Our method grows geometric primitives (such as superquadrics) iteratively by analyzing the connectivity of voxels while marching at different levels of signed distance. For each valid connected volume of interest, we march on the scope of voxels from which a primitive is able to be extracted in a probabilistic sense and simultaneously solve for the parameters of the primitive to capture the underlying local geometry. We evaluate the performance of our method on both synthetic and real-world datasets. The results show that the proposed method outperforms the state-of-the-art in terms of accuracy, and is directly generalizable among different categories and scales. The code is open-sourced at https://github.com/ChirikjianLab/Marching-Primitives.git. + + + + PointVector: A Vector Representation in Point Cloud Analysis + http://openaccess.thecvf.com//content/CVPR2023/papers/Deng_PointVector_A_Vector_Representation_in_Point_Cloud_Analysis_CVPR_2023_paper.pdf + In point cloud analysis, point-based methods have rapidly developed in recent years. These methods have recently focused on concise MLP structures, such as PointNeXt, which have demonstrated competitiveness with Convolutional and Transformer structures. However, standard MLPs are limited in their ability to extract local features effectively. To address this limitation, we propose a Vector-oriented Point Set Abstraction that can aggregate neighboring features through higher-dimensional vectors. To facilitate network optimization, we construct a transformation from scalar to vector using independent angles based on 3D vector rotations. Finally, we develop a PointVector model that follows the structure of PointNeXt. Our experimental results demonstrate that PointVector achieves state-of-the-art performance 72.3% mIOU on the S3DIS Area 5 and 78.4% mIOU on the S3DIS (6-fold cross-validation) with only 58% model parameters of PointNeXt. We hope our work will help the exploration of concise and effective feature representations. The code will be released soon. + + + + BAEFormer: Bi-Directional and Early Interaction Transformers for Bird's Eye View Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Pan_BAEFormer_Bi-Directional_and_Early_Interaction_Transformers_for_Birds_Eye_View_CVPR_2023_paper.pdf + Bird's Eye View (BEV) semantic segmentation is a critical task in autonomous driving. However, existing Transformer-based methods confront difficulties in transforming Perspective View (PV) to BEV due to their unidirectional and posterior interaction mechanisms. To address this issue, we propose a novel Bi-directional and Early Interaction Transformers framework named BAEFormer, consisting of (i) an early-interaction PV-BEV pipeline and (ii) a bi-directional cross-attention mechanism. Moreover, we find that the image feature maps' resolution in the cross-attention module has a limited effect on the final performance. Under this critical observation, we propose to enlarge the size of input images and downsample the multi-view image features for cross-interaction, further improving the accuracy while keeping the amount of computation controllable. Our proposed method for BEV semantic segmentation achieves state-of-the-art performance in real-time inference speed on the nuScenes dataset, i.e., 38.9 mIoU at 45 FPS on a single A100 GPU. + + + + Generic-to-Specific Distillation of Masked Autoencoders + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Generic-to-Specific_Distillation_of_Masked_Autoencoders_CVPR_2023_paper.pdf + Large vision Transformers (ViTs) driven by self-supervised pre-training mechanisms achieved unprecedented progress. Lightweight ViT models limited by the model capacity, however, benefit little from those pre-training mechanisms. Knowledge distillation defines a paradigm to transfer representations from large (teacher) models to small (student) ones. However, the conventional single-stage distillation easily gets stuck on task-specific transfer, failing to retain the task-agnostic knowledge crucial for model generalization. In this study, we propose generic-to-specific distillation (G2SD), to tap the potential of small ViT models under the supervision of large models pre-trained by masked autoencoders. In generic distillation, decoder of the small model is encouraged to align feature predictions with hidden representations of the large model, so that task-agnostic knowledge can be transferred. In specific distillation, predictions of the small model are constrained to be consistent with those of the large model, to transfer task-specific features which guarantee task performance. With G2SD, the vanilla ViT-Small model respectively achieves 98.7%, 98.1% and 99.3% the performance of its teacher (ViT-Base) for image classification, object detection, and semantic segmentation, setting a solid baseline for two-stage vision distillation. Code will be available at https://github.com/pengzhiliang/G2SD. + + + + Combining Implicit-Explicit View Correlation for Light Field Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Cong_Combining_Implicit-Explicit_View_Correlation_for_Light_Field_Semantic_Segmentation_CVPR_2023_paper.pdf + Since light field simultaneously records spatial information and angular information of light rays, it is considered to be beneficial for many potential applications, and semantic segmentation is one of them. The regular variation of image information across views facilitates a comprehensive scene understanding. However, in the case of limited memory, the high-dimensional property of light field makes the problem more intractable than generic semantic segmentation, manifested in the difficulty of fully exploiting the relationships among views while maintaining contextual information in single view. In this paper, we propose a novel network called LF-IENet for light field semantic segmentation. It contains two different manners to mine complementary information from surrounding views to segment central view. One is implicit feature integration that leverages attention mechanism to compute inter-view and intra-view similarity to modulate features of central view. The other is explicit feature propagation that directly warps features of other views to central view under the guidance of disparity. They complement each other and jointly realize complementary information fusion across views in light field. The proposed method achieves outperforming performance on both real-world and synthetic light field datasets, demonstrating the effectiveness of this new architecture. + + + + SOOD: Towards Semi-Supervised Oriented Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Hua_SOOD_Towards_Semi-Supervised_Oriented_Object_Detection_CVPR_2023_paper.pdf + Semi-Supervised Object Detection (SSOD), aiming to explore unlabeled data for boosting object detectors, has become an active task in recent years. However, existing SSOD approaches mainly focus on horizontal objects, leaving multi-oriented objects that are common in aerial images unexplored. This paper proposes a novel Semi-supervised Oriented Object Detection model, termed SOOD, built upon the mainstream pseudo-labeling framework. Towards oriented objects in aerial scenes, we design two loss functions to provide better supervision. Focusing on the orientations of objects, the first loss regularizes the consistency between each pseudo-label-prediction pair (includes a prediction and its corresponding pseudo label) with adaptive weights based on their orientation gap. Focusing on the layout of an image, the second loss regularizes the similarity and explicitly builds the many-to-many relation between the sets of pseudo-labels and predictions. Such a global consistency constraint can further boost semi-supervised learning. Our experiments show that when trained with the two proposed losses, SOOD surpasses the state-of-the-art SSOD methods under various settings on the DOTA-v1.5 benchmark. The code will be available at https://github.com/HamPerdredes/SOOD. + + + + Beyond mAP: Towards Better Evaluation of Instance Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Jena_Beyond_mAP_Towards_Better_Evaluation_of_Instance_Segmentation_CVPR_2023_paper.pdf + Correctness of instance segmentation constitutes counting the number of objects, correctly localizing all predictions and classifying each localized prediction. Average Precision is the de-facto metric used to measure all these constituents of segmentation. However, this metric does not penalize duplicate predictions in the high-recall range, and cannot distinguish instances that are localized correctly but categorized incorrectly. This weakness has inadvertently led to network designs that achieve significant gains in AP but also introduce a large number of false positives. We therefore cannot rely on AP to choose a model that provides an optimal tradeoff between false positives and high recall. To resolve this dilemma, we review alternative metrics in the literature and propose two new measures to explicitly measure the amount of both spatial and categorical duplicate predictions. We also propose a Semantic Sorting and NMS module to remove these duplicates based on a pixel occupancy matching scheme. Experiments show that modern segmentation networks have significant gains in AP, but also contain a considerable amount of duplicates. Our Semantic Sorting and NMS can be added as a plug-and-play module to mitigate hedged predictions and preserve AP. + + + + BASiS: Batch Aligned Spectral Embedding Space + http://openaccess.thecvf.com//content/CVPR2023/papers/Streicher_BASiS_Batch_Aligned_Spectral_Embedding_Space_CVPR_2023_paper.pdf + Graph is a highly generic and diverse representation, suitable for almost any data processing problem. Spectral graph theory has been shown to provide powerful algorithms, backed by solid linear algebra theory. It thus can be extremely instrumental to design deep network building blocks with spectral graph characteristics. For instance, such a network allows the design of optimal graphs for certain tasks or obtaining a canonical orthogonal low-dimensional embedding of the data. Recent attempts to solve this problem were based on minimizing Rayleigh-quotient type losses. We propose a different approach of directly learning the graph's eigensapce. A severe problem of the direct approach, applied in batch-learning, is the inconsistent mapping of features to eigenspace coordinates in different batches. We analyze the degrees of freedom of learning this task using batches and propose a stable alignment mechanism that can work both with batch changes and with graph-metric changes. We show that our learnt spectral embedding is better in terms of NMI, ACC, Grassman distnace, orthogonality and classification accuracy, compared to SOTA. In addition, the learning is more stable. + + + + DCFace: Synthetic Face Generation With Dual Condition Diffusion Model + http://openaccess.thecvf.com//content/CVPR2023/papers/Kim_DCFace_Synthetic_Face_Generation_With_Dual_Condition_Diffusion_Model_CVPR_2023_paper.pdf + Generating synthetic datasets for training face recognition models is challenging because dataset generation entails more than creating high fidelity images. It involves generating multiple images of same subjects under different factors (e.g., variations in pose, illumination, expression, aging and occlusion) which follows the real image conditional distribution. Previous works have studied the generation of synthetic datasets using GAN or 3D models. In this work, we approach the problem from the aspect of combining subject appearance (ID) and external factor (style) conditions. These two conditions provide a direct way to control the inter-class and intra-class variations. To this end, we propose a Dual Condition Face Generator (DCFace) based on a diffusion model. Our novel Patch-wise style extractor and Time-step dependent ID loss enables DCFace to consistently produce face images of the same subject under different styles with precise control. Face recognition models trained on synthetic images from the proposed DCFace provide higher verification accuracies compared to previous works by 6.11% on average in 4 out of 5 test datasets, LFW, CFP-FP, CPLFW, AgeDB and CALFW. Model, code, and synthetic dataset are available at https://github.com/mk-minchul/dcface + + + + Infinite Photorealistic Worlds Using Procedural Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Raistrick_Infinite_Photorealistic_Worlds_Using_Procedural_Generation_CVPR_2023_paper.pdf + We introduce Infinigen, a procedural generator of photorealistic 3D scenes of the natural world. Infinigen is entirely procedural: every asset, from shape to texture, is generated from scratch via randomized mathematical rules, using no external source and allowing infinite variation and composition. Infinigen offers broad coverage of objects and scenes in the natural world including plants, animals, terrains, and natural phenomena such as fire, cloud, rain, and snow. Infinigen can be used to generate unlimited, diverse training data for a wide range of computer vision tasks including object detection, semantic segmentation, optical flow, and 3D reconstruction. We expect Infinigen to be a useful resource for computer vision research and beyond. Please visit https://infinigen.org for videos, code and pre-generated data. + + + + Diversity-Measurable Anomaly Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Diversity-Measurable_Anomaly_Detection_CVPR_2023_paper.pdf + Reconstruction-based anomaly detection models achieve their purpose by suppressing the generalization ability for anomaly. However, diverse normal patterns are consequently not well reconstructed as well. Although some efforts have been made to alleviate this problem by modeling sample diversity, they suffer from shortcut learning due to undesired transmission of abnormal information. In this paper, to better solve the tradeoff problem, we propose Diversity-Measurable Anomaly Detection (DMAD) framework to enhance reconstruction diversity while avoid the undesired generalization on anomalies. To this end, we design Pyramid Deformation Module (PDM), which models diverse normals and measures the severity of anomaly by estimating multi-scale deformation fields from reconstructed reference to original input. Integrated with an information compression module, PDM essentially decouples deformation from prototypical embedding and makes the final anomaly score more reliable. Experimental results on both surveillance videos and industrial images demonstrate the effectiveness of our method. In addition, DMAD works equally well in front of contaminated data and anomaly-like normal samples. + + + + A Large-Scale Robustness Analysis of Video Action Recognition Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Schiappa_A_Large-Scale_Robustness_Analysis_of_Video_Action_Recognition_Models_CVPR_2023_paper.pdf + We have seen great progress in video action recognition in recent years. There are several models based on convolutional neural network (CNN) and some recent transformer based approaches which provide top performance on existing benchmarks. In this work, we perform a large-scale robustness analysis of these existing models for video action recognition. We focus on robustness against real-world distribution shift perturbations instead of adversarial perturbations. We propose four different benchmark datasets, HMDB51-P, UCF101-P, Kinetics400-P, and SSv2-P to perform this analysis. We study robustness of six state-of-the-art action recognition models against 90 different perturbations. The study reveals some interesting findings, 1) Transformer based models are consistently more robust compared to CNN based models, 2) Pre-training improves robustness for Transformer based models more than CNN based models, and 3) All of the studied models are robust to temporal perturbations for all datasets but SSv2; suggesting the importance of temporal information for action recognition varies based on the dataset and activities. Next, we study the role of augmentations in model robustness and present a real-world dataset, UCF101-DS, which contains realistic distribution shifts, to further validate some of these findings. We believe this study will serve as a benchmark for future research in robust video action recognition. + + + + Blind Video Deflickering by Neural Filtering With a Flawed Atlas + http://openaccess.thecvf.com//content/CVPR2023/papers/Lei_Blind_Video_Deflickering_by_Neural_Filtering_With_a_Flawed_Atlas_CVPR_2023_paper.pdf + Many videos contain flickering artifacts; common causes of flicker include video processing algorithms, video generation algorithms, and capturing videos under specific situations. Prior work usually requires specific guidance such as the flickering frequency, manual annotations, or extra consistent videos to remove the flicker. In this work, we propose a general flicker removal framework that only receives a single flickering video as input without additional guidance. Since it is blind to a specific flickering type or guidance, we name this "blind deflickering." The core of our approach is utilizing the neural atlas in cooperation with a neural filtering strategy. The neural atlas is a unified representation for all frames in a video that provides temporal consistency guidance but is flawed in many cases. To this end, a neural network is trained to mimic a filter to learn the consistent features (e.g., color, brightness) and avoid introducing the artifacts in the atlas. To validate our method, we construct a dataset that contains diverse real-world flickering videos. Extensive experiments show that our method achieves satisfying deflickering performance and even outperforms baselines that use extra guidance on a public benchmark. The source code is publicly available at https://chenyanglei.github.io/deflicker. + + + + Grid-Guided Neural Radiance Fields for Large Urban Scenes + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_Grid-Guided_Neural_Radiance_Fields_for_Large_Urban_Scenes_CVPR_2023_paper.pdf + Purely MLP-based neural radiance fields (NeRF-based methods) often suffer from underfitting with blurred renderings on large-scale scenes due to limited model capacity. Recent approaches propose to geographically divide the scene and adopt multiple sub-NeRFs to model each region individually, leading to linear scale-up in training costs and the number of sub-NeRFs as the scene expands. An alternative solution is to use a feature grid representation, which is computationally efficient and can naturally scale to a large scene with increased grid resolutions. However, the feature grid tends to be less constrained and often reaches suboptimal solutions, producing noisy artifacts in renderings, especially in regions with complex geometry and texture. In this work, we present a new framework that realizes high-fidelity rendering on large urban scenes while being computationally efficient. We propose to use a compact multi-resolution ground feature plane representation to coarsely capture the scene, and complement it with positional encoding inputs through another NeRF branch for rendering in a joint learning fashion. We show that such an integration can utilize the advantages of two alternative solutions: a light-weighted NeRF is sufficient, under the guidance of the feature grid representation, to render photorealistic novel views with fine details; and the jointly optimized ground feature planes, can meanwhile gain further refinements, forming a more accurate and compact feature space and output much more natural rendering results. + + + + FreeNeRF: Improving Few-Shot Neural Rendering With Free Frequency Regularization + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_FreeNeRF_Improving_Few-Shot_Neural_Rendering_With_Free_Frequency_Regularization_CVPR_2023_paper.pdf + Novel view synthesis with sparse inputs is a challenging problem for neural radiance fields (NeRF). Recent efforts alleviate this challenge by introducing external supervision, such as pre-trained models and extra depth signals, or by using non-trivial patch-based rendering. In this paper, we present Frequency regularized NeRF (FreeNeRF), a surprisingly simple baseline that outperforms previous methods with minimal modifications to plain NeRF. We analyze the key challenges in few-shot neural rendering and find that frequency plays an important role in NeRF's training. Based on this analysis, we propose two regularization terms: one to regularize the frequency range of NeRF's inputs, and the other to penalize the near-camera density fields. Both techniques are "free lunches" that come at no additional computational cost. We demonstrate that even with just one line of code change, the original NeRF can achieve similar performance to other complicated methods in the few-shot setting. FreeNeRF achieves state-of-the-art performance across diverse datasets, including Blender, DTU, and LLFF. We hope that this simple baseline will motivate a rethinking of the fundamental role of frequency in NeRF's training, under both the low-data regime and beyond. This project is released at https://jiawei-yang.github.io/FreeNeRF/. + + + + NeuWigs: A Neural Dynamic Model for Volumetric Hair Capture and Animation + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_NeuWigs_A_Neural_Dynamic_Model_for_Volumetric_Hair_Capture_and_CVPR_2023_paper.pdf + The capture and animation of human hair are two of the major challenges in the creation of realistic avatars for the virtual reality. Both problems are highly challenging, because hair has complex geometry and appearance, as well as exhibits challenging motion. In this paper, we present a two-stage approach that models hair independently from the head to address these challenges in a data-driven manner. The first stage, state compression, learns a low-dimensional latent space of 3D hair states containing motion and appearance, via a novel autoencoder-as-a-tracker strategy. To better disentangle the hair and head in appearance learning, we employ multi-view hair segmentation masks in combination with a differentiable volumetric renderer. The second stage learns a novel hair dynamics model that performs temporal hair transfer based on the discovered latent codes. To enforce higher stability while driving our dynamics model, we employ the 3D point-cloud autoencoder from the compression stage for de-noising of the hair state. Our model outperforms the state of the art in novel view synthesis and is capable of creating novel hair animations without having to rely on hair observations as a driving signal + + + + CLIP2: Contrastive Language-Image-Point Pretraining From Real-World Point Cloud Data + http://openaccess.thecvf.com//content/CVPR2023/papers/Zeng_CLIP2_Contrastive_Language-Image-Point_Pretraining_From_Real-World_Point_Cloud_Data_CVPR_2023_paper.pdf + Contrastive Language-Image Pre-training, benefiting from large-scale unlabeled text-image pairs, has demonstrated great performance in open-world vision understanding tasks. However, due to the limited Text-3D data pairs, adapting the success of 2D Vision-Language Models (VLM) to the 3D space remains an open problem. Existing works that leverage VLM for 3D understanding generally resort to constructing intermediate 2D representations for the 3D data, but at the cost of losing 3D geometry information. To take a step toward open-world 3D vision understanding, we propose Contrastive Language-Image-Point Cloud Pretraining (CLIP^2) to directly learn the transferable 3D point cloud representation in realistic scenarios with a novel proxy alignment mechanism. Specifically, we exploit naturally-existed correspondences in 2D and 3D scenarios, and build well-aligned and instance-based text-image-point proxies from those complex scenarios. On top of that, we propose a cross-modal contrastive objective to learn semantic and instance-level aligned point cloud representation. Experimental results on both indoor and outdoor scenarios show that our learned 3D representation has great transfer ability in downstream tasks, including zero-shot and few-shot 3D recognition, which boosts the state-of-the-art methods by large margins. Furthermore, we provide analyses of the capability of different representations in real scenarios and present the optional ensemble scheme. + + + + HNeRV: A Hybrid Neural Representation for Videos + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_HNeRV_A_Hybrid_Neural_Representation_for_Videos_CVPR_2023_paper.pdf + Implicit neural representations store videos as neural networks and have performed well for vision tasks such as video compression and denoising. With frame index and/or positional index as input, implicit representations (NeRV, E-NeRV, etc.) reconstruct video frames from fixed and content-agnostic embeddings. Such embedding largely limits the regression capacity and internal generalization for video interpolation. In this paper, we propose a Hybrid Neural Representation for Videos (HNeRV), where learnable and content-adaptive embeddings act as decoder input. Besides the input embedding, we introduce a HNeRV block to make model parameters evenly distributed across the entire network, therefore higher layers (layers near the output) can have more capacity to store high-resolution content and video details. With content-adaptive embedding and re-designed model architecture, HNeRV outperforms implicit methods (NeRV, E-NeRV) in video regression task for both reconstruction quality and convergence speed, and shows better internal generalization. As a simple and efficient video representation, HNeRV also shows decoding advantages for speed, flexibility, and deployment, compared to traditional codecs (H.264, H.265) and learning-based compression methods. Finally, we explore the effectiveness of HNeRV on downstream tasks such as video compression and video inpainting. + + + + Model-Agnostic Gender Debiased Image Captioning + http://openaccess.thecvf.com//content/CVPR2023/papers/Hirota_Model-Agnostic_Gender_Debiased_Image_Captioning_CVPR_2023_paper.pdf + Image captioning models are known to perpetuate and amplify harmful societal bias in the training set. In this work, we aim to mitigate such gender bias in image captioning models. While prior work has addressed this problem by forcing models to focus on people to reduce gender misclassification, it conversely generates gender-stereotypical words at the expense of predicting the correct gender. From this observation, we hypothesize that there are two types of gender bias affecting image captioning models: 1) bias that exploits context to predict gender, and 2) bias in the probability of generating certain (often stereotypical) words because of gender. To mitigate both types of gender biases, we propose a framework, called LIBRA, that learns from synthetically biased samples to decrease both types of biases, correcting gender misclassification and changing gender-stereotypical words to more neutral ones. + + + + FitMe: Deep Photorealistic 3D Morphable Model Avatars + http://openaccess.thecvf.com//content/CVPR2023/papers/Lattas_FitMe_Deep_Photorealistic_3D_Morphable_Model_Avatars_CVPR_2023_paper.pdf + In this paper, we introduce FitMe, a facial reflectance model and a differentiable rendering optimization pipeline, that can be used to acquire high-fidelity renderable human avatars from single or multiple images. The model consists of a multi-modal style-based generator, that captures facial appearance in terms of diffuse and specular reflectance, and a PCA-based shape model. We employ a fast differentiable rendering process that can be used in an optimization pipeline, while also achieving photorealistic facial shading. Our optimization process accurately captures both the facial reflectance and shape in high-detail, by exploiting the expressivity of the style-based latent representation and of our shape model. FitMe achieves state-of-the-art reflectance acquisition and identity preservation on single "in-the-wild" facial images, while it produces impressive scan-like results, when given multiple unconstrained facial images pertaining to the same identity. In contrast with recent implicit avatar reconstructions, FitMe requires only one minute and produces relightable mesh and texture-based avatars, that can be used by end-user applications. + + + + CLIPPO: Image-and-Language Understanding From Pixels Only + http://openaccess.thecvf.com//content/CVPR2023/papers/Tschannen_CLIPPO_Image-and-Language_Understanding_From_Pixels_Only_CVPR_2023_paper.pdf + Multimodal models are becoming increasingly effective, in part due to unified components, such as the Transformer architecture. However, multimodal models still often consist of many task- and modality-specific pieces and training procedures. For example, CLIP (Radford et al., 2021) trains independent text and image towers via a contrastive loss. We explore an additional unification: the use of a pure pixel-based model to perform image, text, and multimodal tasks. Our model is trained with contrastive loss alone, so we call it CLIP-Pixels Only (CLIPPO). CLIPPO uses a single encoder that processes both regular images and text rendered as images. CLIPPO performs image-based tasks such as retrieval and zero-shot image classification almost as well as CLIP-style models, with half the number of parameters and no text-specific tower or embedding. When trained jointly via image-text contrastive learning and next-sentence contrastive learning, CLIPPO can perform well on natural language understanding tasks, without any word-level loss (language modelling or masked language modelling), outperforming pixel-based prior work. Surprisingly, CLIPPO can obtain good accuracy in visual question answering, simply by rendering the question and image together. Finally, we exploit the fact that CLIPPO does not require a tokenizer to show that it can achieve strong performance on multilingual multimodal retrieval without modifications. Code and pretrained models are available at https://github.com/google-research/big_vision. + + + + DETR With Additional Global Aggregation for Cross-Domain Weakly Supervised Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Tang_DETR_With_Additional_Global_Aggregation_for_Cross-Domain_Weakly_Supervised_Object_CVPR_2023_paper.pdf + This paper presents a DETR-based method for cross-domain weakly supervised object detection (CDWSOD), aiming at adapting the detector from source to target domain through weak supervision. We think DETR has strong potential for CDWSOD due to an insight: the encoder and the decoder in DETR are both based on the attention mechanism and are thus capable of aggregating semantics across the entire image. The aggregation results, i.e., image-level predictions, can naturally exploit the weak supervision for domain alignment. Such motivated, we propose DETR with additional Global Aggregation (DETR-GA), a CDWSOD detector that simultaneously makes "instance-level + image-level" predictions and utilizes "strong + weak" supervisions. The key point of DETR-GA is very simple: for the encoder / decoder, we respectively add multiple class queries / a foreground query to aggregate the semantics into image-level predictions. Our query-based aggregation has two advantages. First, in the encoder, the weakly-supervised class queries are capable of roughly locating the corresponding positions and excluding the distraction from non-relevant regions. Second, through our design, the object queries and the foreground query in the decoder share consensus on the class semantics, therefore making the strong and weak supervision mutually benefit each other for domain alignment. Extensive experiments on four popular cross-domain benchmarks show that DETR-GA significantly improves CSWSOD and advances the states of the art (e.g., 29.0% --> 79.4% mAP on PASCAL VOC --> Clipart_all dataset). + + + + Towards Bridging the Performance Gaps of Joint Energy-Based Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_Towards_Bridging_the_Performance_Gaps_of_Joint_Energy-Based_Models_CVPR_2023_paper.pdf + Can we train a hybrid discriminative-generative model with a single network? This question has recently been answered in the affirmative, introducing the field of Joint Energy-based Model (JEM), which achieves high classification accuracy and image generation quality simultaneously. Despite recent advances, there remain two performance gaps: the accuracy gap to the standard softmax classifier, and the generation quality gap to state-of-the-art generative models. In this paper, we introduce a variety of training techniques to bridge the accuracy gap and the generation quality gap of JEM. 1) We incorporate a recently proposed sharpness-aware minimization (SAM) framework to train JEM, which promotes the energy landscape smoothness and the generalization of JEM. 2) We exclude data augmentation from the maximum likelihood estimate pipeline of JEM, and mitigate the negative impact of data augmentation to image generation quality. Extensive experiments on multiple datasets demonstrate our SADA-JEM achieves state-of-the-art performances and outperforms JEM in image classification, image generation, calibration, out-of-distribution detection and adversarial robustness by a notable margin. Our code is available at https://github.com/sndnyang/SADAJEM. + + + + expOSE: Accurate Initialization-Free Projective Factorization Using Exponential Regularization + http://openaccess.thecvf.com//content/CVPR2023/papers/Iglesias_expOSE_Accurate_Initialization-Free_Projective_Factorization_Using_Exponential_Regularization_CVPR_2023_paper.pdf + Bundle adjustment is a key component in practically all available Structure from Motion systems. While it is crucial for achieving accurate reconstruction, convergence to the right solution hinges on good initialization. The recently introduced factorization-based pOSE methods formulate a surrogate for the bundle adjustment error without reliance on good initialization. In this paper, we show that pOSE has an undesirable penalization of large depths. To address this we propose expOSE which has an exponential regularization that is negligible for positive depths. To achieve efficient inference we use a quadratic approximation that allows an iterative solution with VarPro. Furthermore, we extend the method with radial distortion robustness by decomposing the Object Space Error into radial and tangential components. Experimental results confirm that the proposed method is robust to initialization and improves reconstruction quality compared to state-of-the-art methods even without bundle adjustment refinement. + + + + OpenGait: Revisiting Gait Recognition Towards Better Practicality + http://openaccess.thecvf.com//content/CVPR2023/papers/Fan_OpenGait_Revisiting_Gait_Recognition_Towards_Better_Practicality_CVPR_2023_paper.pdf + Gait recognition is one of the most critical long-distance identification technologies and increasingly gains popularity in both research and industry communities. Despite the significant progress made in indoor datasets, much evidence shows that gait recognition techniques perform poorly in the wild. More importantly, we also find that some conclusions drawn from indoor datasets cannot be generalized to real applications. Therefore, the primary goal of this paper is to present a comprehensive benchmark study for better practicality rather than only a particular model for better performance. To this end, we first develop a flexible and efficient gait recognition codebase named OpenGait. Based on OpenGait, we deeply revisit the recent development of gait recognition by re-conducting the ablative experiments. Encouragingly,we detect some unperfect parts of certain prior woks, as well as new insights. Inspired by these discoveries, we develop a structurally simple, empirically powerful, and practically robust baseline model, GaitBase. Experimentally, we comprehensively compare GaitBase with many current gait recognition methods on multiple public datasets, and the results reflect that GaitBase achieves significantly strong performance in most cases regardless of indoor or outdoor situations. Code is available at https://github.com/ShiqiYu/OpenGait. + + + + DATID-3D: Diversity-Preserved Domain Adaptation Using Text-to-Image Diffusion for 3D Generative Model + http://openaccess.thecvf.com//content/CVPR2023/papers/Kim_DATID-3D_Diversity-Preserved_Domain_Adaptation_Using_Text-to-Image_Diffusion_for_3D_Generative_CVPR_2023_paper.pdf + Recent 3D generative models have achieved remarkable performance in synthesizing high resolution photorealistic images with view consistency and detailed 3D shapes, but training them for diverse domains is challenging since it requires massive training images and their camera distribution information. Text-guided domain adaptation methods have shown impressive performance on converting the 2D generative model on one domain into the models on other domains with different styles by leveraging the CLIP (Contrastive Language-Image Pre-training), rather than collecting massive datasets for those domains. However, one drawback of them is that the sample diversity in the original generative model is not well-preserved in the domain-adapted generative models due to the deterministic nature of the CLIP text encoder. Text-guided domain adaptation will be even more challenging for 3D generative models not only because of catastrophic diversity loss, but also because of inferior text-image correspondence and poor image quality. Here we propose DATID-3D, a domain adaptation method tailored for 3D generative models using text-to-image diffusion models that can synthesize diverse images per text prompt without collecting additional images and camera information for the target domain. Unlike 3D extensions of prior text-guided domain adaptation methods, our novel pipeline was able to fine-tune the state-of-the-art 3D generator of the source domain to synthesize high resolution, multi-view consistent images in text-guided targeted domains without additional data, outperforming the existing text-guided domain adaptation methods in diversity and text-image correspondence. Furthermore, we propose and demonstrate diverse 3D image manipulations such as one-shot instance-selected adaptation and single-view manipulated 3D reconstruction to fully enjoy diversity in text. + + + + Learning Neural Volumetric Representations of Dynamic Humans in Minutes + http://openaccess.thecvf.com//content/CVPR2023/papers/Geng_Learning_Neural_Volumetric_Representations_of_Dynamic_Humans_in_Minutes_CVPR_2023_paper.pdf + This paper addresses the challenge of efficiently reconstructing volumetric videos of dynamic humans from sparse multi-view videos. Some recent works represent a dynamic human as a canonical neural radiance field (NeRF) and a motion field, which are learned from input videos through differentiable rendering. But the per-scene optimization generally requires hours. Other generalizable NeRF models leverage learned prior from datasets to reduce the optimization time by only finetuning on new scenes at the cost of visual fidelity. In this paper, we propose a novel method for learning neural volumetric representations of dynamic humans in minutes with competitive visual quality. Specifically, we define a novel part-based voxelized human representation to better distribute the representational power of the network to different human parts. Furthermore, we propose a novel 2D motion parameterization scheme to increase the convergence rate of deformation field learning. Experiments demonstrate that our model can be learned 100 times faster than previous per-scene optimization methods while being competitive in the rendering quality. Training our model on a 512x512 video with 100 frames typically takes about 5 minutes on a single RTX 3090 GPU. The code is available on our project page: https://zju3dv.github.io/instant_nvr + + + + Streaming Video Model + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhao_Streaming_Video_Model_CVPR_2023_paper.pdf + Video understanding tasks have traditionally been modeled by two separate architectures, specially tailored for two distinct tasks. Sequence-based video tasks, such as action recognition, use a video backbone to directly extract spatiotemporal features, while frame-based video tasks, such as multiple object tracking (MOT), rely on single fixed-image backbone to extract spatial features. In contrast, we propose to unify video understanding tasks into one novel streaming video architecture, referred to as Streaming Vision Transformer (S-ViT). S-ViT first produces frame-level features with a memory-enabled temporally-aware spatial encoder to serve the frame-based video tasks. Then the frame features are input into a task-related temporal decoder to obtain spatiotemporal features for sequence-based tasks. The efficiency and efficacy of S-ViT is demonstrated by the state-of-the-art accuracy in the sequence-based action recognition task and the competitive advantage over conventional architecture in the frame-based MOT task. We believe that the concept of streaming video model and the implementation of S-ViT are solid steps towards a unified deep learning architecture for video understanding. Code will be available at https://github.com/yuzhms/Streaming-Video-Model. + + + + CapDet: Unifying Dense Captioning and Open-World Detection Pretraining + http://openaccess.thecvf.com//content/CVPR2023/papers/Long_CapDet_Unifying_Dense_Captioning_and_Open-World_Detection_Pretraining_CVPR_2023_paper.pdf + Benefiting from large-scale vision-language pre-training on image-text pairs, open-world detection methods have shown superior generalization ability under the zero-shot or few-shot detection settings. However, a pre-defined category space is still required during the inference stage of existing methods and only the objects belonging to that space will be predicted. To introduce a "real" open-world detector, in this paper, we propose a novel method named CapDet to either predict under a given category list or directly generate the category of predicted bounding boxes. Specifically, we unify the open-world detection and dense caption tasks into a single yet effective framework by introducing an additional dense captioning head to generate the region-grounded captions. Besides, adding the captioning task will in turn benefit the generalization of detection performance since the captioning dataset covers more concepts. Experiment results show that by unifying the dense caption task, our CapDet has obtained significant performance improvements (e.g., +2.1% mAP on LVIS rare classes) over the baseline method on LVIS (1203 classes). Besides, our CapDet also achieves state-of-the-art performance on dense captioning tasks, e.g., 15.44% mAP on VG V1.2 and 13.98% on the VG-COCO dataset. + + + + Bayesian Posterior Approximation With Stochastic Ensembles + http://openaccess.thecvf.com//content/CVPR2023/papers/Balabanov_Bayesian_Posterior_Approximation_With_Stochastic_Ensembles_CVPR_2023_paper.pdf + We introduce ensembles of stochastic neural networks to approximate the Bayesian posterior, combining stochastic methods such as dropout with deep ensembles. The stochastic ensembles are formulated as families of distributions and trained to approximate the Bayesian posterior with variational inference. We implement stochastic ensembles based on Monte Carlo dropout, DropConnect and a novel non-parametric version of dropout and evaluate them on a toy problem and CIFAR image classification. For both tasks, we test the quality of the posteriors directly against Hamiltonian Monte Carlo simulations. Our results show that stochastic ensembles provide more accurate posterior estimates than other popular baselines for Bayesian inference. + + + + Symmetric Shape-Preserving Autoencoder for Unsupervised Real Scene Point Cloud Completion + http://openaccess.thecvf.com//content/CVPR2023/papers/Ma_Symmetric_Shape-Preserving_Autoencoder_for_Unsupervised_Real_Scene_Point_Cloud_Completion_CVPR_2023_paper.pdf + Unsupervised completion of real scene objects is of vital importance but still remains extremely challenging in preserving input shapes, predicting accurate results, and adapting to multi-category data. To solve these problems, we propose in this paper an Unsupervised Symmetric Shape-Preserving Autoencoding Network, termed USSPA, to predict complete point clouds of objects from real scenes. One of our main observations is that many natural and man-made objects exhibit significant symmetries. To accommodate this, we devise a symmetry learning module to learn from those objects and to preserve structural symmetries. Starting from an initial coarse predictor, our autoencoder refines the complete shape with a carefully designed upsampling refinement module. Besides the discriminative process on the latent space, the discriminators of our USSPA also take predicted point clouds as direct guidance, enabling more detailed shape prediction. Clearly different from previous methods which train each category separately, our USSPA can be adapted to the training of multi-category data in one pass through a classifier-guided discriminator, with consistent performance on single category. For more accurate evaluation, we contribute to the community a real scene dataset with paired CAD models as ground truth. Extensive experiments and comparisons demonstrate our superiority and generalization and show that our method achieves state-of-the-art performance on unsupervised completion of real scene objects. + + + + Comprehensive and Delicate: An Efficient Transformer for Image Restoration + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhao_Comprehensive_and_Delicate_An_Efficient_Transformer_for_Image_Restoration_CVPR_2023_paper.pdf + Vision Transformers have shown promising performance in image restoration, which usually conduct window- or channel-based attention to avoid intensive computations. Although the promising performance has been achieved, they go against the biggest success factor of Transformers to a certain extent by capturing the local instead of global dependency among pixels. In this paper, we propose a novel efficient image restoration Transformer that first captures the superpixel-wise global dependency, and then transfers it into each pixel. Such a coarse-to-fine paradigm is implemented through two neural blocks, i.e., condensed attention neural block (CA) and dual adaptive neural block (DA). In brief, CA employs feature aggregation, attention computation, and feature recovery to efficiently capture the global dependency at the superpixel level. To embrace the pixel-wise global dependency, DA takes a novel dual-way structure to adaptively encapsulate the globality from superpixels into pixels. Thanks to the two neural blocks, our method achieves comparable performance while taking only 6% FLOPs compared with SwinIR. + + + + Zero-Shot Model Diagnosis + http://openaccess.thecvf.com//content/CVPR2023/papers/Luo_Zero-Shot_Model_Diagnosis_CVPR_2023_paper.pdf + When it comes to deploying deep vision models, the behavior of these systems must be explicable to ensure confidence in their reliability and fairness. A common approach to evaluate deep learning models is to build a labeled test set with attributes of interest and assess how well it performs. However, creating a balanced test set (i.e., one that is uniformly sampled over all the important traits) is often time-consuming, expensive, and prone to mistakes. The question we try to address is: can we evaluate the sensitivity of deep learning models to arbitrary visual attributes without an annotated test set? This paper argues the case that Zero-shot Model Diagnosis (ZOOM) is possible without the need for a test set nor labeling. To avoid the need for test sets, our system relies on a generative model and CLIP. The key idea is enabling the user to select a set of prompts (relevant to the problem) and our system will automatically search for semantic counterfactual images (i.e., synthesized images that flip the prediction in the case of a binary classifier) using the generative model. We evaluate several visual tasks (classification, key-point detection, and segmentation) in multiple visual domains to demonstrate the viability of our methodology. Extensive experiments demonstrate that our method is capable of producing counterfactual images and offering sensitivity analysis for model diagnosis without the need for a test set. + + + + ShadowDiffusion: When Degradation Prior Meets Diffusion Model for Shadow Removal + http://openaccess.thecvf.com//content/CVPR2023/papers/Guo_ShadowDiffusion_When_Degradation_Prior_Meets_Diffusion_Model_for_Shadow_Removal_CVPR_2023_paper.pdf + Recent deep learning methods have achieved promising results in image shadow removal. However, their restored images still suffer from unsatisfactory boundary artifacts, due to the lack of degradation prior and the deficiency in modeling capacity. Our work addresses these issues by proposing a unified diffusion framework that integrates both the image and degradation priors for highly effective shadow removal. In detail, we first propose a shadow degradation model, which inspires us to build a novel unrolling diffusion model, dubbed ShandowDiffusion. It remarkably improves the model's capacity in shadow removal via progressively refining the desired output with both degradation prior and diffusive generative prior, which by nature can serve as a new strong baseline for image restoration. Furthermore, ShadowDiffusion progressively refines the estimated shadow mask as an auxiliary task of the diffusion generator, which leads to more accurate and robust shadow-free image generation. We conduct extensive experiments on three popular public datasets, including ISTD, ISTD+, and SRD, to validate our method's effectiveness. Compared to the state-of-the-art methods, our model achieves a significant improvement in terms of PSNR, increasing from 31.69dB to 34.73dB over SRD dataset. + + + + + + NLOST: Non-Line-of-Sight Imaging With Transformer + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_NLOST_Non-Line-of-Sight_Imaging_With_Transformer_CVPR_2023_paper.pdf + Time-resolved non-line-of-sight (NLOS) imaging is based on the multi-bounce indirect reflections from the hidden objects for 3D sensing. Reconstruction from NLOS measurements remains challenging especially for complicated scenes. To boost the performance, we present NLOST, the first transformer-based neural network for NLOS reconstruction. Specifically, after extracting the shallow features with the assistance of physics-based priors, we design two spatial-temporal self attention encoders to explore both local and global correlations within 3D NLOS data by splitting or downsampling the features into different scales, respectively. Then, we design a spatial-temporal cross attention decoder to integrate local and global features in the token space of transformer, resulting in deep features with high representation capabilities. Finally, deep and shallow features are fused to reconstruct the 3D volume of hidden scenes. Extensive experimental results demonstrate the superior performance of the proposed method over existing solutions on both synthetic data and real-world data captured by different NLOS imaging systems. + + + + Text-Visual Prompting for Efficient 2D Temporal Video Grounding + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Text-Visual_Prompting_for_Efficient_2D_Temporal_Video_Grounding_CVPR_2023_paper.pdf + In this paper, we study the problem of temporal video grounding (TVG), which aims to predict the starting/ending time points of moments described by a text sentence within a long untrimmed video. Benefiting from fine-grained 3D visual features, the TVG techniques have achieved remarkable progress in recent years. However, the high complexity of 3D convolutional neural networks (CNNs) makes extracting dense 3D visual features time-consuming, which calls for intensive memory and computing resources. Towards efficient TVG, we propose a novel text-visual prompting (TVP) framework, which incorporates optimized perturbation patterns (that we call 'prompts') into both visual inputs and textual features of a TVG model. In sharp contrast to 3D CNNs, we show that TVP allows us to effectively co-train vision encoder and language encoder in a 2D TVG model and improves the performance of crossmodal feature fusion using only low-complexity sparse 2D visual features. Further, we propose a Temporal-Distance IoU (TDIoU) loss for efficient learning of TVG. Experiments on two benchmark datasets, Charades-STA and ActivityNet Captions datasets, empirically show that the proposed TVP significantly boosts the performance of 2D TVG (e.g., 9.79% improvement on Charades-STA and 30.77% improvement on ActivityNet Captions) and achieves 5x inference acceleration over TVG using 3D visual features. Codes are available at Open.Intel. + + + + NEF: Neural Edge Fields for 3D Parametric Curve Reconstruction From Multi-View Images + http://openaccess.thecvf.com//content/CVPR2023/papers/Ye_NEF_Neural_Edge_Fields_for_3D_Parametric_Curve_Reconstruction_From_CVPR_2023_paper.pdf + We study the problem of reconstructing 3D feature curves of an object from a set of calibrated multi-view images. To do so, we learn a neural implicit field representing the density distribution of 3D edges which we refer to as Neural Edge Field (NEF). Inspired by NeRF, NEF is optimized with a view-based rendering loss where a 2D edge map is rendered at a given view and is compared to the ground-truth edge map extracted from the image of that view. The rendering-based differentiable optimization of NEF fully exploits 2D edge detection, without needing a supervision of 3D edges, a 3D geometric operator or cross-view edge correspondence. Several technical designs are devised to ensure learning a range-limited and view-independent NEF for robust edge extraction. The final parametric 3D curves are extracted from NEF with an iterative optimization method. On our benchmark with synthetic data, we demonstrate that NEF outperforms existing state-of-the-art methods on all metrics. Project page: https://yunfan1202.github.io/NEF/. + + + + Geometric Visual Similarity Learning in 3D Medical Image Self-Supervised Pre-Training + http://openaccess.thecvf.com//content/CVPR2023/papers/He_Geometric_Visual_Similarity_Learning_in_3D_Medical_Image_Self-Supervised_Pre-Training_CVPR_2023_paper.pdf + Learning inter-image similarity is crucial for 3D medical images self-supervised pre-training, due to their sharing of numerous same semantic regions. However, the lack of the semantic prior in metrics and the semantic-independent variation in 3D medical images make it challenging to get a reliable measurement for the inter-image similarity, hindering the learning of consistent representation for same semantics. We investigate the challenging problem of this task, i.e., learning a consistent representation between images for a clustering effect of same semantic features. We propose a novel visual similarity learning paradigm, Geometric Visual Similarity Learning, which embeds the prior of topological invariance into the measurement of the inter-image similarity for consistent representation of semantic regions. To drive this paradigm, we further construct a novel geometric matching head, the Z-matching head, to collaboratively learn the global and local similarity of semantic regions, guiding the efficient representation learning for different scale-level inter-image semantic features. Our experiments demonstrate that the pre-training with our learning of inter-image similarity yields more powerful inner-scene, inter-scene, and global-local transferring ability on four challenging 3D medical image tasks. Our codes and pre-trained models will be publicly available in https://github.com/YutingHe-list/GVSL. + + + + Less Is More: Reducing Task and Model Complexity for 3D Point Cloud Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Less_Is_More_Reducing_Task_and_Model_Complexity_for_3D_CVPR_2023_paper.pdf + Whilst the availability of 3D LiDAR point cloud data has significantly grown in recent years, annotation remains expensive and time-consuming, leading to a demand for semi-supervised semantic segmentation methods with application domains such as autonomous driving. Existing work very often employs relatively large segmentation backbone networks to improve segmentation accuracy, at the expense of computational costs. In addition, many use uniform sampling to reduce ground truth data requirements for learning needed, often resulting in sub-optimal performance. To address these issues, we propose a new pipeline that employs a smaller architecture, requiring fewer ground-truth annotations to achieve superior segmentation accuracy compared to contemporary approaches. This is facilitated via a novel Sparse Depthwise Separable Convolution module that significantly reduces the network parameter count while retaining overall task performance. To effectively sub-sample our training data, we propose a new Spatio-Temporal Redundant Frame Downsampling (ST-RFD) method that leverages knowledge of sensor motion within the environment to extract a more diverse subset of training data frame samples. To leverage the use of limited annotated data samples, we further propose a soft pseudo-label method informed by LiDAR reflectivity. Our method outperforms contemporary semi-supervised work in terms of mIoU, using less labeled data, on the SemanticKITTI (59.5@5%) and ScribbleKITTI (58.1@5%) benchmark datasets, based on a 2.3x reduction in model parameters and 641x fewer multiply-add operations whilst also demonstrating significant performance improvement on limited training data (i.e., Less is More). + + + + AdaMAE: Adaptive Masking for Efficient Spatiotemporal Learning With Masked Autoencoders + http://openaccess.thecvf.com//content/CVPR2023/papers/Bandara_AdaMAE_Adaptive_Masking_for_Efficient_Spatiotemporal_Learning_With_Masked_Autoencoders_CVPR_2023_paper.pdf + Masked Autoencoders (MAEs) learn generalizable representations for image, text, audio, video, etc., by reconstructing masked input data from tokens of the visible data. Current MAE approaches for videos rely on random patch, tube, or frame based masking strategies to select these tokens. This paper proposes AdaMAE, an adaptive masking strategy for MAEs that is end-to-end trainable. Our adaptive masking strategy samples visible tokens based on the semantic context using an auxiliary sampling network. This network estimates a categorical distribution over spacetime-patch tokens. The tokens that increase the expected reconstruction error are rewarded and selected as visible tokens, motivated by the policy gradient algorithm in reinforcement learning. We show that AdaMAE samples more tokens from the high spatiotemporal information regions, thereby allowing us to mask 95% of tokens, resulting in lower memory requirements and faster pre-training. We conduct ablation studies on the Something-Something v2 (SSv2) dataset to demonstrate the efficacy of our adaptive sampling approach and report state-of-the-art results of 70.0% and 81.7% in top-1 accuracy on SSv2 and Kinetics-400 action classification datasets with a ViT-Base backbone and 800 pre-training epochs. Code and pre-trained models are available at: https://github.com/wgcban/adamae.git + + + + Directional Connectivity-Based Segmentation of Medical Images + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_Directional_Connectivity-Based_Segmentation_of_Medical_Images_CVPR_2023_paper.pdf + Anatomical consistency in biomarker segmentation is crucial for many medical image analysis tasks. A promising paradigm for achieving anatomically consistent segmentation via deep networks is incorporating pixel connectivity, a basic concept in digital topology, to model inter-pixel relationships. However, previous works on connectivity modeling have ignored the rich channel-wise directional information in the latent space. In this work, we demonstrate that effective disentanglement of directional sub-space from the shared latent space can significantly enhance the feature representation in the connectivity-based network. To this end, we propose a directional connectivity modeling scheme for segmentation that decouples, tracks, and utilizes the directional information across the network. Experiments on various public medical image segmentation benchmarks show the effectiveness of our model as compared to the state-of-the-art methods. Code is available at https://github.com/Zyun-Y/DconnNet. + + + + Towards Flexible Multi-Modal Document Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Inoue_Towards_Flexible_Multi-Modal_Document_Models_CVPR_2023_paper.pdf + Creative workflows for generating graphical documents involve complex inter-related tasks, such as aligning elements, choosing appropriate fonts, or employing aesthetically harmonious colors. In this work, we attempt at building a holistic model that can jointly solve many different design tasks. Our model, which we denote by FlexDM, treats vector graphic documents as a set of multi-modal elements, and learns to predict masked fields such as element type, position, styling attributes, image, or text, using a unified architecture. Through the use of explicit multi-task learning and in-domain pre-training, our model can better capture the multi-modal relationships among the different document fields. Experimental results corroborate that our single FlexDM is able to successfully solve a multitude of different design tasks, while achieving performance that is competitive with task-specific and costly baselines. + + + + LiDAR-in-the-Loop Hyperparameter Optimization + http://openaccess.thecvf.com//content/CVPR2023/papers/Goudreault_LiDAR-in-the-Loop_Hyperparameter_Optimization_CVPR_2023_paper.pdf + LiDAR has become a cornerstone sensing modality for 3D vision. LiDAR systems emit pulses of light into the scene, take measurements of the returned signal, and rely on hardware digital signal processing (DSP) pipelines to construct 3D point clouds from these measurements. The resulting point clouds output by these DSPs are input to downstream 3D vision models -- both, in the form of training datasets or as input at inference time. Existing LiDAR DSPs are composed of cascades of parameterized operations; modifying configuration parameters results in significant changes in the point clouds and consequently the output of downstream methods. Existing methods treat LiDAR systems as fixed black boxes and construct downstream task networks more robust with respect to measurement fluctuations. Departing from this approach, the proposed method directly optimizes LiDAR sensing and DSP parameters for downstream tasks. To investigate the optimization of LiDAR system parameters, we devise a realistic LiDAR simulation method that generates raw waveforms as input to a LiDAR DSP pipeline. We optimize LiDAR parameters for both 3D object detection IoU losses and depth error metrics by solving a nonlinear multi-objective optimization problem with a 0th-order stochastic algorithm. For automotive 3D object detection models, the proposed method outperforms manual expert tuning by 39.5% mean Average Precision (mAP). + + + + Local 3D Editing via 3D Distillation of CLIP Knowledge + http://openaccess.thecvf.com//content/CVPR2023/papers/Hyung_Local_3D_Editing_via_3D_Distillation_of_CLIP_Knowledge_CVPR_2023_paper.pdf + 3D content manipulation is an important computer vision task with many real-world applications (e.g., product design, cartoon generation, and 3D Avatar editing). Recently proposed 3D GANs can generate diverse photo-realistic 3D-aware contents using Neural Radiance fields (NeRF). However, manipulation of NeRF still remains a challenging problem since the visual quality tends to degrade after manipulation and suboptimal control handles such as semantic maps are used for manipulations. While text-guided manipulations have shown potential in 3D editing, such approaches often lack locality. To overcome the problems, we propose Local Editing NeRF (LENeRF), which only requires text inputs for fine-grained and localized manipulation. Specifically, we present three add-on modules of LENeRF, the Latent Residual Mapper, the Attention Field Network, and the Deformation Network, which are jointly used for local manipulations of 3D features by estimating a 3D attention field. The 3D attention field is learned in an unsupervised way, by distilling the CLIP's zero-shot mask generation capability to 3D with multi-view guidance. We conduct diverse experiments and thorough evaluations both quantitatively and qualitatively. + + + + Human Body Shape Completion With Implicit Shape and Flow Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_Human_Body_Shape_Completion_With_Implicit_Shape_and_Flow_Learning_CVPR_2023_paper.pdf + In this paper, we investigate how to complete human body shape models by combining shape and flow estimation given two consecutive depth images. Shape completion is a challenging task in computer vision that is highly under-constrained when considering partial depth observations. Besides model based strategies that exploit strong priors, and consequently struggle to preserve fine geometric details, learning based approaches build on weaker assumptions and can benefit from efficient implicit representations. We adopt such a representation and explore how the motion flow between two consecutive frames can contribute to the shape completion task. In order to effectively exploit the flow information, our architecture combines both estimations and implements two features for robustness: First, an all-to-all attention module that encodes the correlation between points in the same frame and between corresponding points in different frames; Second, a coarse-dense to fine-sparse strategy that balances the representation ability and the computational cost. Our experiments demonstrate that the flow actually benefits human body model completion. They also show that our method outperforms the state-of-the-art approaches for shape completion on 2 benchmarks, considering different human shapes, poses, and clothing. + + + + Modular Memorability: Tiered Representations for Video Memorability Prediction + http://openaccess.thecvf.com//content/CVPR2023/papers/Dumont_Modular_Memorability_Tiered_Representations_for_Video_Memorability_Prediction_CVPR_2023_paper.pdf + The question of how to best estimate the memorability of visual content is currently a source of debate in the memorability community. In this paper, we propose to explore how different key properties of images and videos affect their consolidation into memory. We analyze the impact of several features and develop a model that emulates the most important parts of a proposed "pathway to memory": a simple but effective way of representing the different hurdles that new visual content needs to surpass to stay in memory. This framework leads to the construction of our M3-S model, a novel memorability network that processes input videos in a modular fashion. Each module of the network emulates one of the four key steps of the pathway to memory: raw encoding, scene understanding, event understanding and memory consolidation. We find that the different representations learned by our modules are non-trivial and substantially different from each other. Additionally, we observe that certain representations tend to perform better at the task of memorability prediction than others, and we introduce an in-depth ablation study to support our results. Our proposed approach surpasses the state of the art on the two largest video memorability datasets and opens the door to new applications in the field. + + + + Weakly-Supervised Domain Adaptive Semantic Segmentation With Prototypical Contrastive Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Das_Weakly-Supervised_Domain_Adaptive_Semantic_Segmentation_With_Prototypical_Contrastive_Learning_CVPR_2023_paper.pdf + There has been a lot of effort in improving the performance of unsupervised domain adaptation for semantic segmentation task, however there is still a huge gap in performance when compared with supervised learning. In this work, we propose a common framework to use different weak labels, e.g. image, point and coarse labels from target domain to reduce this performance gap. Specifically, we propose to learn better prototypes that are representative class features, by exploiting these weak labels. We use these improved prototypes for contrastive alignment of class features. In particular, we perform two different feature alignments, first, we align pixel features with prototypes within each domain and second, we align pixel features from source to prototype of target domain in an asymmetric way. This asymmetric alignment is beneficial as it preserves the target features during training, which is essential when weak labels are available from target domain. Our experiments on standard benchmarks shows that our framework achieves significant improvement compared to existing works and is able to reduce the performance gap with supervised learning. + + + + Language-Guided Music Recommendation for Video via Prompt Analogies + http://openaccess.thecvf.com//content/CVPR2023/papers/McKee_Language-Guided_Music_Recommendation_for_Video_via_Prompt_Analogies_CVPR_2023_paper.pdf + We propose a method to recommend music for an input video while allowing a user to guide music selection with free-form natural language. A key challenge of this problem setting is that existing music video datasets provide the needed (video, music) training pairs, but lack text descriptions of the music. This work addresses this challenge with the following three contributions. First, we propose a text-synthesis approach that relies on an analogy-based prompting procedure to generate natural language music descriptions from a large-scale language model (BLOOM-176B) given pre-trained music tagger outputs and a small number of human text descriptions. Second, we use these synthesized music descriptions to train a new trimodal model, which fuses text and video input representations to query music samples. For training, we introduce a text dropout regularization mechanism which we show is critical to model performance. Our model design allows for the retrieved music audio to agree with the two input modalities by matching visual style depicted in the video and musical genre, mood, or instrumentation described in the natural language query. Third, to evaluate our approach, we collect a testing dataset for our problem by annotating a subset of 4k clips from the YT8M-MusicVideo dataset with natural language music descriptions which we make publicly available. We show that our approach can match or exceed the performance of prior methods on video-to-music retrieval while significantly improving retrieval accuracy when using text guidance. + + + + Re2TAL: Rewiring Pretrained Video Backbones for Reversible Temporal Action Localization + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhao_Re2TAL_Rewiring_Pretrained_Video_Backbones_for_Reversible_Temporal_Action_Localization_CVPR_2023_paper.pdf + Temporal action localization (TAL) requires long-form reasoning to predict actions of various durations and complex content. Given limited GPU memory, training TAL end to end (i.e., from videos to predictions) on long videos is a significant challenge. Most methods can only train on pre-extracted features without optimizing them for the localization problem, consequently limiting localization performance. In this work, to extend the potential in TAL networks, we propose a novel end-to-end method Re2TAL, which rewires pretrained video backbones for reversible TAL. Re2TAL builds a backbone with reversible modules, where the input can be recovered from the output such that the bulky intermediate activations can be cleared from memory during training. Instead of designing one single type of reversible module, we propose a network rewiring mechanism, to transform any module with a residual connection to a reversible module without changing any parameters. This provides two benefits: (1) a large variety of reversible networks are easily obtained from existing and even future model designs, and (2) the reversible models require much less training effort as they reuse the pre-trained parameters of their original non-reversible versions. Re2TAL, only using the RGB modality, reaches 37.01% average mAP on ActivityNet-v1.3, a new state-of-the-art record, and mAP 64.9% at tIoU=0.5 on THUMOS-14, outperforming all other RGB-only methods. Code is available at https://github.com/coolbay/Re2TAL. + + + + NeRFLight: Fast and Light Neural Radiance Fields Using a Shared Feature Grid + http://openaccess.thecvf.com//content/CVPR2023/papers/Rivas-Manzaneque_NeRFLight_Fast_and_Light_Neural_Radiance_Fields_Using_a_Shared_CVPR_2023_paper.pdf + While original Neural Radiance Fields (NeRF) have shown impressive results in modeling the appearance of a scene with compact MLP architectures, they are not able to achieve real-time rendering. This has been recently addressed by either baking the outputs of NeRF into a data structure or arranging trainable parameters in an explicit feature grid. These strategies, however, significantly increase the memory footprint of the model which prevents their deployment on bandwidth-constrained applications. In this paper, we extend the grid-based approach to achieve real-time view synthesis at more than 150 FPS using a lightweight model. Our main contribution is a novel architecture in which the density field of NeRF-based representations is split into N regions and the density is modeled using N different decoders which reuse the same feature grid. This results in a smaller grid where each feature is located in more than one spatial position, forcing them to learn a compact representation that is valid for different parts of the scene. We further reduce the size of the final model by disposing of the features symmetrically on each region, which favors feature pruning after training while also allowing smooth gradient transitions between neighboring voxels. An exhaustive evaluation demonstrates that our method achieves real-time performance and quality metrics on a pair with state-of-the-art with an improvement of more than 2x in the FPS/MB ratio. + + + + MVImgNet: A Large-Scale Dataset of Multi-View Images + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_MVImgNet_A_Large-Scale_Dataset_of_Multi-View_Images_CVPR_2023_paper.pdf + Being data-driven is one of the most iconic properties of deep learning algorithms. The birth of ImageNet drives a remarkable trend of "learning from large-scale data" in computer vision. Pretraining on ImageNet to obtain rich universal representations has been manifested to benefit various 2D visual tasks, and becomes a standard in 2D vision. However, due to the laborious collection of real-world 3D data, there is yet no generic dataset serving as a counterpart of ImageNet in 3D vision, thus how such a dataset can impact the 3D community is unraveled. To remedy this defect, we introduce MVImgNet, a large-scale dataset of multi-view images, which is highly convenient to gain by shooting videos of real-world objects in human daily life. It contains 6.5 million frames from 219,188 videos crossing objects from 238 classes, with rich annotations of object masks, camera parameters, and point clouds. The multi-view attribute endows our dataset with 3D-aware signals, making it a soft bridge between 2D and 3D vision. We conduct pilot studies for probing the potential of MVImgNet on a variety of 3D and 2D visual tasks, including radiance field reconstruction, multi-view stereo, and view-consistent image understanding, where MVImgNet demonstrates promising performance, remaining lots of possibilities for future explorations. Besides, via dense reconstruction on MVImgNet, a 3D object point cloud dataset is derived, called MVPNet, covering 87,200 samples from 150 categories, with the class label on each point cloud. Experiments show that MVPNet can benefit the real-world 3D object classification while posing new challenges to point cloud understanding. MVImgNet and MVPNet will be publicly available, hoping to inspire the broader vision community. + + + + A New Benchmark: On the Utility of Synthetic Data With Blender for Bare Supervised Learning and Downstream Domain Adaptation + http://openaccess.thecvf.com//content/CVPR2023/papers/Tang_A_New_Benchmark_On_the_Utility_of_Synthetic_Data_With_CVPR_2023_paper.pdf + Deep learning in computer vision has achieved great success with the price of large-scale labeled training data. However, exhaustive data annotation is impracticable for each task of all domains of interest, due to high labor costs and unguaranteed labeling accuracy. Besides, the uncontrollable data collection process produces non-IID training and test data, where undesired duplication may exist. All these nuisances may hinder the verification of typical theories and exposure to new findings. To circumvent them, an alternative is to generate synthetic data via 3D rendering with domain randomization. We in this work push forward along this line by doing profound and extensive research on bare supervised learning and downstream domain adaptation. Specifically, under the well-controlled, IID data setting enabled by 3D rendering, we systematically verify the typical, important learning insights, e.g., shortcut learning, and discover the new laws of various data regimes and network architectures in generalization. We further investigate the effect of image formation factors on generalization, e.g., object scale, material texture, illumination, camera viewpoint, and background in a 3D scene. Moreover, we use the simulation-to-reality adaptation as a downstream task for comparing the transferability between synthetic and real data when used for pre-training, which demonstrates that synthetic data pre-training is also promising to improve real test results. Lastly, to promote future research, we develop a new large-scale synthetic-to-real benchmark for image classification, termed S2RDA, which provides more significant challenges for transfer from simulation to reality. The code and datasets are available at https://github.com/huitangtang/On_the_Utility_of_Synthetic_Data. + + + + Autoregressive Visual Tracking + http://openaccess.thecvf.com//content/CVPR2023/papers/Wei_Autoregressive_Visual_Tracking_CVPR_2023_paper.pdf + We present ARTrack, an autoregressive framework for visual object tracking. ARTrack tackles tracking as a coordinate sequence interpretation task that estimates object trajectories progressively, where the current estimate is induced by previous states and in turn affects subsequences. This time-autoregressive approach models the sequential evolution of trajectories to keep tracing the object across frames, making it superior to existing template matching based trackers that only consider the per-frame localization accuracy. ARTrack is simple and direct, eliminating customized localization heads and post-processings. Despite its simplicity, ARTrack achieves state-of-the-art performance on prevailing benchmark datasets. + + + + Unsupervised Domain Adaption With Pixel-Level Discriminator for Image-Aware Layout Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_Unsupervised_Domain_Adaption_With_Pixel-Level_Discriminator_for_Image-Aware_Layout_Generation_CVPR_2023_paper.pdf + Layout is essential for graphic design and poster generation. Recently, applying deep learning models to generate layouts has attracted increasing attention. This paper focuses on using the GAN-based model conditioned on image contents to generate advertising poster graphic layouts, which requires an advertising poster layout dataset with paired product images and graphic layouts. However, the paired images and layouts in the existing dataset are collected by inpainting and annotating posters, respectively. There exists a domain gap between inpainted posters (source domain data) and clean product images (target domain data). Therefore, this paper combines unsupervised domain adaption techniques to design a GAN with a novel pixel-level discriminator (PD), called PDA-GAN, to generate graphic layouts according to image contents. The PD is connected to the shallow level feature map and computes the GAN loss for each input-image pixel. Both quantitative and qualitative evaluations demonstrate that PDA-GAN can achieve state-of-the-art performances and generate high-quality image-aware graphic layouts for advertising posters. + + + + Real-Time 6K Image Rescaling With Rate-Distortion Optimization + http://openaccess.thecvf.com//content/CVPR2023/papers/Qi_Real-Time_6K_Image_Rescaling_With_Rate-Distortion_Optimization_CVPR_2023_paper.pdf + The task of image rescaling aims at embedding an high-resolution (HR) image into a low-resolution (LR) one that can contain embedded information for HR image reconstruction. Existing image rescaling methods do not optimize the LR image file size and recent flow-based rescaling methods are not real-time yet for HR image reconstruction (e.g., 6K). To address these two challenges, we propose a novel framework (HyperThumbnail) for real-time 6K rate-distortion-aware image rescaling. Our HyperThumbnail first embeds an HR image into a JPEG LR image (thumbnail) by an encoder with our proposed learnable JPEG quantization module, which optimizes the file size of the embedding LR JPEG image. Then, an efficient decoder reconstructs a high-fidelity HR (6K) image from the LR one in real time. Extensive experiments demonstrate that our framework outperforms previous image rescaling baselines in both rate-distortion performance and is much faster than prior work in HR image reconstruction speed. + + + + Gated Stereo: Joint Depth Estimation From Gated and Wide-Baseline Active Stereo Cues + http://openaccess.thecvf.com//content/CVPR2023/papers/Walz_Gated_Stereo_Joint_Depth_Estimation_From_Gated_and_Wide-Baseline_Active_CVPR_2023_paper.pdf + We propose Gated Stereo, a high-resolution and long-range depth estimation technique that operates on active gated stereo images. Using active and high dynamic range passive captures, Gated Stereo exploits multi-view cues alongside time-of-flight intensity cues from active gating. To this end, we propose a depth estimation method with a monocular and stereo depth prediction branch which are combined in a final fusion stage. Each block is supervised through a combination of supervised and gated self-supervision losses. To facilitate training and validation, we acquire a long-range synchronized gated stereo dataset for automotive scenarios. We find that the method achieves an improvement of more than 50 % MAE compared to the next best RGB stereo method, and 74 % MAE to existing monocular gated methods for distances up to 160 m. Our code, models and datasets are available here: https://light.princeton.edu/gatedstereo/. + + + + MammalNet: A Large-Scale Video Benchmark for Mammal Recognition and Behavior Understanding + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_MammalNet_A_Large-Scale_Video_Benchmark_for_Mammal_Recognition_and_Behavior_CVPR_2023_paper.pdf + Monitoring animal behavior can facilitate conservation efforts by providing key insights into wildlife health, population status, and ecosystem function. Automatic recognition of animals and their behaviors is critical for capitalizing on the large unlabeled datasets generated by modern video devices and for accelerating monitoring efforts at scale. However, the development of automated recognition systems is currently hindered by a lack of appropriately labeled datasets. Existing video datasets 1) do not classify animals according to established biological taxonomies; 2) are too small to facilitate large-scale behavioral studies and are often limited to a single species; and 3) do not feature temporally localized annotations and therefore do not facilitate localization of targeted behaviors within longer video sequences. Thus, we propose MammalNet, a new large-scale animal behavior dataset with taxonomy-guided annotations of mammals and their common behaviors. MammalNet contains over 18K videos totaling 539 hours, which is 10 times larger than the largest existing animal behavior dataset. It covers 17 orders, 69 families, and 173 mammal categories for animal categorization and captures 12 high-level animal behaviors that received focus in previous animal behavior studies. We establish three benchmarks on MammalNet: standard animal and behavior recognition, compositional low-shot animal and behavior recognition, and behavior detection. Our dataset and code have been made available at: https://mammal-net.github.io. + + + + Hand Avatar: Free-Pose Hand Animation and Rendering From Monocular Video + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Hand_Avatar_Free-Pose_Hand_Animation_and_Rendering_From_Monocular_Video_CVPR_2023_paper.pdf + We present HandAvatar, a novel representation for hand animation and rendering, which can generate smoothly compositional geometry and self-occlusion-aware texture. Specifically, we first develop a MANO-HD model as a high-resolution mesh topology to fit personalized hand shapes. Sequentially, we decompose hand geometry into per-bone rigid parts, and then re-compose paired geometry encodings to derive an across-part consistent occupancy field. As for texture modeling, we propose a self-occlusion-aware shading field (SelF). In SelF, drivable anchors are paved on the MANO-HD surface to record albedo information under a wide variety of hand poses. Moreover, directed soft occupancy is designed to describe the ray-to-surface relation, which is leveraged to generate an illumination field for the disentanglement of pose-independent albedo and pose-dependent illumination. Trained from monocular video data, our HandAvatar can perform free-pose hand animation and rendering while at the same time achieving superior appearance fidelity. We also demonstrate that HandAvatar provides a route for hand appearance editing. + + + + VindLU: A Recipe for Effective Video-and-Language Pretraining + http://openaccess.thecvf.com//content/CVPR2023/papers/Cheng_VindLU_A_Recipe_for_Effective_Video-and-Language_Pretraining_CVPR_2023_paper.pdf + The last several years have witnessed remarkable progress in video-and-language (VidL) understanding. However, most modern VidL approaches use complex and specialized model architectures and sophisticated pretraining protocols, making the reproducibility, analysis and comparisons of these frameworks difficult. Hence, instead of proposing yet another new VidL model, this paper conducts a thorough empirical study demystifying the most important factors in the VidL model design. Among the factors that we investigate are (i) the spatiotemporal architecture design, (ii) the multimodal fusion schemes, (iii) the pretraining objectives, (iv) the choice of pretraining data, (v) pretraining and finetuning protocols, and (vi) dataset and model scaling. Our empirical study reveals that the most important design factors include: temporal modeling, video-to-text multimodal fusion, masked modeling objectives, and joint training on images and videos. Using these empirical insights, we then develop a step-by-step recipe, dubbed VindLU, for effective VidL pretraining. Our final model trained using our recipe achieves comparable or better than state-of-the-art results on several VidL tasks without relying on external CLIP pretraining. In particular, on the text-to-video retrieval task, our approach obtains 61.2% on DiDeMo, and 55.0% on ActivityNet, outperforming current SOTA by 7.8% and 6.1% respectively. Furthermore, our model also obtains state-of-the-art video question-answering results on ActivityNet-QA, MSRVTT-QA, MSRVTT-MC and TVQA. Our code and pretrained models are publicly available at: https://github.com/klauscc/VindLU. + + + + OmniAvatar: Geometry-Guided Controllable 3D Head Synthesis + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_OmniAvatar_Geometry-Guided_Controllable_3D_Head_Synthesis_CVPR_2023_paper.pdf + We present OmniAvatar, a novel geometry-guided 3D head synthesis model trained from in-the-wild unstructured images that is capable of synthesizing diverse identity-preserved 3D heads with compelling dynamic details under full disentangled control over camera poses, facial expressions, head shapes, articulated neck and jaw poses. To achieve such high level of disentangled control, we first explicitly define a novel semantic signed distance function (SDF) around a head geometry (FLAME) conditioned on the control parameters. This semantic SDF allows us to build a differentiable volumetric correspondence map from the observation space to a disentangled canonical space from all the control parameters. We then leverage the 3D-aware GAN framework (EG3D) to synthesize detailed shape and appearance of 3D full heads in the canonical space, followed by a volume rendering step guided by the volumetric correspondence map to output into the observation space. To ensure the control accuracy on the synthesized head shapes and expressions, we introduce a geometry prior loss to conform to head SDF and a control loss to conform to the expression code. Further, we enhance the temporal realism with dynamic details conditioned upon varying expressions and joint poses. Our model can synthesize more preferable identity-preserved 3D heads with compelling dynamic details compared to the state-of-the-art methods both qualitatively and quantitatively. We also provide an ablation study to justify many of our system design choices. + + + + SUDS: Scalable Urban Dynamic Scenes + http://openaccess.thecvf.com//content/CVPR2023/papers/Turki_SUDS_Scalable_Urban_Dynamic_Scenes_CVPR_2023_paper.pdf + We extend neural radiance fields (NeRFs) to dynamic large-scale urban scenes. Prior work tends to reconstruct single video clips of short durations (up to 10 seconds). Two reasons are that such methods (a) tend to scale linearly with the number of moving objects and input videos because a separate model is built for each and (b) tend to require supervision via 3D bounding boxes and panoptic labels, obtained manually or via category-specific models. As a step towards truly open-world reconstructions of dynamic cities, we introduce two key innovations: (a) we factorize the scene into three separate hash table data structures to efficiently encode static, dynamic, and far-field radiance fields, and (b) we make use of unlabeled target signals consisting of RGB images, sparse LiDAR, off-the-shelf self-supervised 2D descriptors, and most importantly, 2D optical flow. Operationalizing such inputs via photometric, geometric, and feature-metric reconstruction losses enables SUDS to decompose dynamic scenes into the static background, individual objects, and their motions. When combined with our multi-branch table representation, such reconstructions can be scaled to tens of thousands of objects across 1.2 million frames from 1700 videos spanning geospatial footprints of hundreds of kilometers, (to our knowledge) the largest dynamic NeRF built to date. We present qualitative initial results on a variety of tasks enabled by our representations, including novel-view synthesis of dynamic urban scenes, unsupervised 3D instance segmentation, and unsupervised 3D cuboid detection. To compare to prior work, we also evaluate on KITTI and Virtual KITTI 2, surpassing state-of-the-art methods that rely on ground truth 3D bounding box annotations while being 10x quicker to train. + + + + Cloud-Device Collaborative Adaptation to Continual Changing Environments in the Real-World + http://openaccess.thecvf.com//content/CVPR2023/papers/Pan_Cloud-Device_Collaborative_Adaptation_to_Continual_Changing_Environments_in_the_Real-World_CVPR_2023_paper.pdf + When facing changing environments in the real world, the lightweight model on client devices suffer from severe performance drop under distribution shifts. The main limitations of existing device model lie in: (1) unable to update due to the computation limit of the device, (2) limited generalization ability of the lightweight model. Meanwhile, recent large models have shown strong generalization capability on cloud while they can not be deployed on client devices due to the poor computation constraint. To enable the device model to deal with changing environments, we propose a new learning paradigm of Cloud-Device Collaborative Continual Adaptation. To encourage collaboration between cloud and device and improve the generalization of device model, we propose an Uncertainty-based Visual Prompt Adapted (U-VPA) teacher-student model in such paradigm. Specifically, we first design the Uncertainty Guided Sampling (UGS) to screen out challenging data continuously and transmit the most out-of-distribution samples from the device to the cloud. To further transfer the generalization capability of the large model on the cloud to the device model, we propose a Visual Prompt Learning Strategy with Uncertainty guided updating (VPLU) to specifically deal with the selected samples with more distribution shifts. Then, we transmit the visual prompts to the device and concatenate them with the incoming data to pull the device testing distribution closer to the cloud training distribution. We conduct extensive experiments on two object detection datasets with continually changing environments. Our proposed U-VPA teacher-student framework outperforms previous state-of-the-art test time adaptation and device-cloud collaboration methods. The code and datasets will be released. + + + + Seasoning Model Soups for Robustness to Adversarial and Natural Distribution Shifts + http://openaccess.thecvf.com//content/CVPR2023/papers/Croce_Seasoning_Model_Soups_for_Robustness_to_Adversarial_and_Natural_Distribution_CVPR_2023_paper.pdf + Adversarial training is widely used to make classifiers robust to a specific threat or adversary, such as l_p-norm bounded perturbations of a given p-norm. However, existing methods for training classifiers robust to multiple threats require knowledge of all attacks during training and remain vulnerable to unseen distribution shifts. In this work, we describe how to obtain adversarially-robust model soups (i.e., linear combinations of parameters) that smoothly trade-off robustness to different l_p-norm bounded adversaries. We demonstrate that such soups allow us to control the type and level of robustness, and can achieve robustness to all threats without jointly training on all of them. In some cases, the resulting model soups are more robust to a given l_p-norm adversary than the constituent model specialized against that same adversary. Finally, we show that adversarially-robust model soups can be a viable tool to adapt to distribution shifts from a few examples. + + + + How To Prevent the Continuous Damage of Noises To Model Training? + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_How_To_Prevent_the_Continuous_Damage_of_Noises_To_Model_CVPR_2023_paper.pdf + Deep learning with noisy labels is challenging and inevitable in many circumstances. Existing methods reduce the impact of noise samples by reducing loss weights of uncertain samples or by filtering out potential noise samples, which highly rely on the model's superior discriminative power for identifying noise samples. However, in the training stage, the trainee model is imperfect will miss many noise samples, which cause continuous damage to the model training. Consequently, there is a large performance gap between existing anti-noise models trained with noisy samples and models trained with clean samples. In this paper, we put forward a Gradient Switching Strategy (GSS) to prevent the continuous damage of noise samples to the classifier. Theoretical analysis shows that the damage comes from the misleading gradient direction computed from the noise samples. The trainee model will deviate from the correct optimization direction under the influence of the accumulated misleading gradient of noise samples. To address this problem, the proposed GSS alleviates the damage by switching the current gradient direction of each sample to a new direction selected from a gradient direction pool, which contains all-class gradient directions with different probabilities. During training, the trainee model is optimized along switched gradient directions generated by GSS, which assigns higher probabilities to potential principal directions for high-confidence samples. Conversely, uncertain samples have a relatively uniform probability distribution for all gradient directions, which can cancel out the misleading gradient directions. Extensive experiments show that a model trained with GSS can achieve comparable performance with a model trained with clean data. Moreover, the proposed GSS is pluggable for existing frameworks for noisy-label learning. This work can provide a new perspective for future noisy-label learning. + + + + Skinned Motion Retargeting With Residual Perception of Motion Semantics & Geometry + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Skinned_Motion_Retargeting_With_Residual_Perception_of_Motion_Semantics__CVPR_2023_paper.pdf + A good motion retargeting cannot be reached without reasonable consideration of source-target differences on both the skeleton and shape geometry levels. In this work, we propose a novel Residual RETargeting network (R2ET) structure, which relies on two neural modification modules, to adjust the source motions to fit the target skeletons and shapes progressively. In particular, a skeleton-aware module is introduced to preserve the source motion semantics. A shape-aware module is designed to perceive the geometries of target characters to reduce interpenetration and contact-missing. Driven by our explored distance-based losses that explicitly model the motion semantics and geometry, these two modules can learn residual motion modifications on the source motion to generate plausible retargeted motion in a single inference without post-processing. To balance these two modifications, we further present a balancing gate to conduct linear interpolation between them. Extensive experiments on the public dataset Mixamo demonstrate that our R2ET achieves the state-of-the-art performance, and provides a good balance between the preservation of motion semantics as well as the attenuation of interpenetration and contact-missing. Code is available at https://github.com/Kebii/R2ET. + + + + Weakly-Supervised Single-View Image Relighting + http://openaccess.thecvf.com//content/CVPR2023/papers/Yi_Weakly-Supervised_Single-View_Image_Relighting_CVPR_2023_paper.pdf + We present a learning-based approach to relight a single image of Lambertian and low-frequency specular objects. Our method enables inserting objects from photographs into new scenes and relighting them under the new environment lighting, which is essential for AR applications. To relight the object, we solve both inverse rendering and re-rendering. To resolve the ill-posed inverse rendering, we propose a weakly-supervised method by a low-rank constraint. To facilitate the weakly-supervised training, we contribute Relit, a large-scale (750K images) dataset of videos with aligned objects under changing illuminations. For re-rendering, we propose a differentiable specular rendering layer to render low-frequency non-Lambertian materials under various illuminations of spherical harmonics. The whole pipeline is end-to-end and efficient, allowing for a mobile app implementation of AR object insertion. Extensive evaluations demonstrate that our method achieves state-of-the-art performance. Project page: https://renjiaoyi.github.io/relighting/. + + + + DualVector: Unsupervised Vector Font Synthesis With Dual-Part Representation + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_DualVector_Unsupervised_Vector_Font_Synthesis_With_Dual-Part_Representation_CVPR_2023_paper.pdf + Automatic generation of fonts can be an important aid to typeface design. Many current approaches regard glyphs as pixelated images, which present artifacts when scaling and inevitable quality losses after vectorization. On the other hand, existing vector font synthesis methods either fail to represent the shape concisely or require vector supervision during training. To push the quality of vector font synthesis to the next level, we propose a novel dual-part representation for vector glyphs, where each glyph is modeled as a collection of closed "positive" and "negative" path pairs. The glyph contour is then obtained by boolean operations on these paths. We first learn such a representation only from glyph images and devise a subsequent contour refinement step to align the contour with an image representation to further enhance details. Our method, named DualVector, outperforms state-of-the-art methods in vector font synthesis both quantitatively and qualitatively. Our synthesized vector fonts can be easily converted to common digital font formats like TrueType Font for practical use. The code is released at https://github.com/thuliu-yt16/dualvector. + + + + ReasonNet: End-to-End Driving With Temporal and Global Reasoning + http://openaccess.thecvf.com//content/CVPR2023/papers/Shao_ReasonNet_End-to-End_Driving_With_Temporal_and_Global_Reasoning_CVPR_2023_paper.pdf + The large-scale deployment of autonomous vehicles is yet to come, and one of the major remaining challenges lies in urban dense traffic scenarios. In such cases, it remains challenging to predict the future evolution of the scene and future behaviors of objects, and to deal with rare adverse events such as the sudden appearance of occluded objects. In this paper, we present ReasonNet, a novel end-to-end driving framework that extensively exploits both temporal and global information of the driving scene. By reasoning on the temporal behavior of objects, our method can effectively process the interactions and relationships among features in different frames. Reasoning about the global information of the scene can also improve overall perception performance and benefit the detection of adverse events, especially the anticipation of potential danger from occluded objects. For comprehensive evaluation on occlusion events, we also release publicly a driving simulation benchmark DriveOcclusionSim consisting of diverse occlusion events. We conduct extensive experiments on multiple CARLA benchmarks, where our model outperforms all prior methods, ranking first on the sensor track of the public CARLA Leaderboard. + + + + Learning Situation Hyper-Graphs for Video Question Answering + http://openaccess.thecvf.com//content/CVPR2023/papers/Urooj_Learning_Situation_Hyper-Graphs_for_Video_Question_Answering_CVPR_2023_paper.pdf + Answering questions about complex situations in videos requires not only capturing of the presence of actors, objects, and their relations, but also the evolution of these relationships over time. A situation hyper-graph is a representation that describes situations as scene sub-graphs for video frames and hyper-edges for connected sub-graphs, and has been proposed to capture all such information in a compact structured form. In this work, we propose an architecture for Video Question Answering (VQA) that enables answering questions related to video content by predicting situation hyper-graphs, coined Situation Hyper-Graph based Video Question Answering (SHG-VQA). To this end, we train a situation hyper-graph decoder to implicitly identify graph representations with actions and object/human-object relationships from the input video clip and to use cross-attention between the predicted situation hyper-graphs and the question embedding to predict the correct answer. The proposed method is trained in an end-to-end manner and optimized by a cross-entropy based VQA loss function and a Hungarian matching loss for the situation graph prediction. The effectiveness of the proposed architecture is extensively evaluated on two challenging benchmarks: AGQA and STAR. Our results show that learning the underlying situation hyper-graphs helps the system to significantly improve its performance for novel challenges of video question answering task. + + + + GazeNeRF: 3D-Aware Gaze Redirection With Neural Radiance Fields + http://openaccess.thecvf.com//content/CVPR2023/papers/Ruzzi_GazeNeRF_3D-Aware_Gaze_Redirection_With_Neural_Radiance_Fields_CVPR_2023_paper.pdf + We propose GazeNeRF, a 3D-aware method for the task of gaze redirection. Existing gaze redirection methods operate on 2D images and struggle to generate 3D consistent results. Instead, we build on the intuition that the face region and eye balls are separate 3D structures that move in a coordinated yet independent fashion. Our method leverages recent advancements in conditional image-based neural radiance fields and proposes a two-branch architecture that predicts volumetric features for the face and eye regions separately. Rigidly transforming the eye features via a 3D rotation matrix provides fine-grained control over the desired gaze angle. The final, redirected image is then attained via differentiable volume compositing. Our experiments show that this architecture outperforms naively conditioned NeRF baselines as well as previous state-of-the-art 2D gaze redirection methods in terms of redirection accuracy and identity preservation. Code and models will be released for research purposes. + + + + SegLoc: Learning Segmentation-Based Representations for Privacy-Preserving Visual Localization + http://openaccess.thecvf.com//content/CVPR2023/papers/Pietrantoni_SegLoc_Learning_Segmentation-Based_Representations_for_Privacy-Preserving_Visual_Localization_CVPR_2023_paper.pdf + Inspired by properties of semantic segmentation, in this paper we investigate how to leverage robust image segmentation in the context of privacy-preserving visual localization. We propose a new localization framework, SegLoc, that leverages image segmentation to create robust, compact, and privacy-preserving scene representations, i.e., 3D maps. We build upon the correspondence-supervised, fine-grained segmentation approach from Larsson et al (ICCV'19), making it more robust by learning a set of cluster labels with discriminative clustering, additional consistency regularization terms and we jointly learn a global image representation along with a dense local representation. In our localization pipeline, the former will be used for retrieving the most similar images, the latter to refine the retrieved poses by minimizing the label inconsistency between the 3D points of the map and their projection onto the query image. In various experiments, we show that our proposed representation allows to achieve (close-to) state-of-the-art pose estimation results while only using a compact 3D map that does not contain enough information about the original images for an attacker to reconstruct personal information. + + + + Efficient Hierarchical Entropy Model for Learned Point Cloud Compression + http://openaccess.thecvf.com//content/CVPR2023/papers/Song_Efficient_Hierarchical_Entropy_Model_for_Learned_Point_Cloud_Compression_CVPR_2023_paper.pdf + Learning an accurate entropy model is a fundamental way to remove the redundancy in point cloud compression. Recently, the octree-based auto-regressive entropy model which adopts the self-attention mechanism to explore dependencies in a large-scale context is proved to be promising. However, heavy global attention computations and auto-regressive contexts are inefficient for practical applications. To improve the efficiency of the attention model, we propose a hierarchical attention structure that has a linear complexity to the context scale and maintains the global receptive field. Furthermore, we present a grouped context structure to address the serial decoding issue caused by the auto-regression while preserving the compression performance. Experiments demonstrate that the proposed entropy model achieves superior rate-distortion performance and significant decoding latency reduction compared with the state-of-the-art large-scale auto-regressive entropy model. + + + + Image Cropping With Spatial-Aware Feature and Rank Consistency + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Image_Cropping_With_Spatial-Aware_Feature_and_Rank_Consistency_CVPR_2023_paper.pdf + Image cropping aims to find visually appealing crops in an image. Despite the great progress made by previous methods, they are weak in capturing the spatial relationship between crops and aesthetic elements (e.g., salient objects, semantic edges). Besides, due to the high annotation cost of labeled data, the potential of unlabeled data awaits to be excavated. To address the first issue, we propose spatial-aware feature to encode the spatial relationship between candidate crops and aesthetic elements, by feeding the concatenation of crop mask and selectively aggregated feature maps to a light-weighted encoder. To address the second issue, we train a pair-wise ranking classifier on labeled images and transfer such knowledge to unlabeled images to enforce rank consistency. Experimental results on the benchmark datasets show that our proposed method performs favorably against state-of-the-art methods. + + + + SVGformer: Representation Learning for Continuous Vector Graphics Using Transformers + http://openaccess.thecvf.com//content/CVPR2023/papers/Cao_SVGformer_Representation_Learning_for_Continuous_Vector_Graphics_Using_Transformers_CVPR_2023_paper.pdf + Advances in representation learning have led to great success in understanding and generating data in various domains. However, in modeling vector graphics data, the pure data-driven approach often yields unsatisfactory results in downstream tasks as existing deep learning methods often require the quantization of SVG parameters and cannot exploit the geometric properties explicitly. In this paper, we propose a transformer-based representation learning model (SVGformer) that directly operates on continuous input values and manipulates the geometric information of SVG to encode outline details and long-distance dependencies. SVGfomer can be used for various downstream tasks: reconstruction, classification, interpolation, retrieval, etc. We have conducted extensive experiments on vector font and icon datasets to show that our model can capture high-quality representation information and outperform the previous state-of-the-art on downstream tasks significantly. + + + + Learning Attribute and Class-Specific Representation Duet for Fine-Grained Fashion Analysis + http://openaccess.thecvf.com//content/CVPR2023/papers/Jiao_Learning_Attribute_and_Class-Specific_Representation_Duet_for_Fine-Grained_Fashion_Analysis_CVPR_2023_paper.pdf + Fashion representation learning involves the analysis and understanding of various visual elements at different granularities and the interactions among them. Existing works often learn fine-grained fashion representations at the attribute-level without considering their relationships and inter-dependencies across different classes. In this work, we propose to learn an attribute and class specific fashion representation duet to better model such attribute relationships and inter-dependencies by leveraging prior knowledge about the taxonomy of fashion attributes and classes. Through two sub-networks for the attributes and classes, respectively, our proposed an embedding network progressively learn and refine the visual representation of a fashion image to improve its robustness for fashion retrieval. A multi-granularity loss consisting of attribute-level and class-level losses is proposed to introduce appropriate inductive bias to learn across different granularities of the fashion representations. Experimental results on three benchmark datasets demonstrate the effectiveness of our method, which outperforms the state-of-the-art methods with a large margin. + + + + Pixels, Regions, and Objects: Multiple Enhancement for Salient Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Pixels_Regions_and_Objects_Multiple_Enhancement_for_Salient_Object_Detection_CVPR_2023_paper.pdf + Salient object detection (SOD) aims to mimic the human visual system (HVS) and cognition mechanisms to identify and segment salient objects. However, due to the complexity of these mechanisms, current methods are not perfect. Accuracy and robustness need to be further improved, particularly in complex scenes with multiple objects and background clutter. To address this issue, we propose a novel approach called Multiple Enhancement Network (MENet) that adopts the boundary sensibility, content integrity, iterative refinement, and frequency decomposition mechanisms of HVS. A multi-level hybrid loss is firstly designed to guide the network to learn pixel-level, region-level, and object-level features. A flexible multiscale feature enhancement module (ME-Module) is then designed to gradually aggregate and refine global or detailed features by changing the size order of the input feature sequence. An iterative training strategy is used to enhance boundary features and adaptive features in the dual-branch decoder of MENet. Comprehensive evaluations on six challenging benchmark datasets show that MENet achieves state-of-the-art results. Both the codes and results are publicly available at https://github.com/yiwangtz/MENet. + + + + Leveraging Temporal Context in Low Representational Power Regimes + http://openaccess.thecvf.com//content/CVPR2023/papers/Fosco_Leveraging_Temporal_Context_in_Low_Representational_Power_Regimes_CVPR_2023_paper.pdf + Computer vision models are excellent at identifying and exploiting regularities in the world. However, it is computationally costly to learn these regularities from scratch. This presents a challenge for low-parameter models, like those running on edge devices (e.g. smartphones). Can the performance of models with low representational power be improved by supplementing training with additional information about these statistical regularities? We explore this in the domains of action recognition and action anticipation, leveraging the fact that actions are typically embedded in stereotypical sequences. We introduce the Event Transition Matrix (ETM), computed from action labels in an untrimmed video dataset, which captures the temporal context of a given action, operationalized as the likelihood that it was preceded or followed by each other action in the set. We show that including information from the ETM during training improves action recognition and anticipation performance on various egocentric video datasets. Through ablation and control studies, we show that the coherent sequence of information captured by our ETM is key to this effect, and we find that the benefit of this explicit representation of temporal context is most pronounced for smaller models. Code, matrices and models are available in our project page: https://camilofosco.com/etm_website. + + + + Omni3D: A Large Benchmark and Model for 3D Object Detection in the Wild + http://openaccess.thecvf.com//content/CVPR2023/papers/Brazil_Omni3D_A_Large_Benchmark_and_Model_for_3D_Object_Detection_CVPR_2023_paper.pdf + Recognizing scenes and objects in 3D from a single image is a longstanding goal of computer vision with applications in robotics and AR/VR. For 2D recognition, large datasets and scalable solutions have led to unprecedented advances. In 3D, existing benchmarks are small in size and approaches specialize in few object categories and specific domains, e.g. urban driving scenes. Motivated by the success of 2D recognition, we revisit the task of 3D object detection by introducing a large benchmark, called Omni3D. Omni3D re-purposes and combines existing datasets resulting in 234k images annotated with more than 3 million instances and 98 categories. 3D detection at such scale is challenging due to variations in camera intrinsics and the rich diversity of scene and object types. We propose a model, called Cube R-CNN, designed to generalize across camera and scene types with a unified approach. We show that Cube R-CNN outperforms prior works on the larger Omni3D and existing benchmarks. Finally, we prove that Omni3D is a powerful dataset for 3D object recognition and show that it improves single-dataset performance and can accelerate learning on new smaller datasets via pre-training. + + + + OT-Filter: An Optimal Transport Filter for Learning With Noisy Labels + http://openaccess.thecvf.com//content/CVPR2023/papers/Feng_OT-Filter_An_Optimal_Transport_Filter_for_Learning_With_Noisy_Labels_CVPR_2023_paper.pdf + The success of deep learning is largely attributed to the training over clean data. However, data is often coupled with noisy labels in practice. Learning with noisy labels is challenging because the performance of the deep neural networks (DNN) drastically degenerates, due to confirmation bias caused by the network memorization over noisy labels. To alleviate that, a recent prominent direction is on sample selection, which retrieves clean data samples from noisy samples, so as to enhance the model's robustness and tolerance to noisy labels. In this paper, we revamp the sample selection from the perspective of optimal transport theory and propose a novel method, called the OT-Filter. The OT-Filter provides geometrically meaningful distances and preserves distribution patterns to measure the data discrepancy, thus alleviating the confirmation bias. Extensive experiments on benchmarks, such as Clothing1M and ANIMAL-10N, show that the performance of the OT- Filter outperforms its counterparts. Meanwhile, results on benchmarks with synthetic labels, such as CIFAR-10/100, show the superiority of the OT-Filter in handling data labels of high noise. + + + + Rigidity-Aware Detection for 6D Object Pose Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Hai_Rigidity-Aware_Detection_for_6D_Object_Pose_Estimation_CVPR_2023_paper.pdf + Most recent 6D object pose estimation methods first use object detection to obtain 2D bounding boxes before actually regressing the pose. However, the general object detection methods they use are ill-suited to handle cluttered scenes, thus producing poor initialization to the subsequent pose network. To address this, we propose a rigidity-aware detection method exploiting the fact that, in 6D pose estimation, the target objects are rigid. This lets us introduce an approach to sampling positive object regions from the entire visible object area during training, instead of naively drawing samples from the bounding box center where the object might be occluded. As such, every visible object part can contribute to the final bounding box prediction, yielding better detection robustness. Key to the success of our approach is a visibility map, which we propose to build using a minimum barrier distance between every pixel in the bounding box and the box boundary. Our results on seven challenging 6D pose estimation datasets evidence that our method outperforms general detection frameworks by a large margin. Furthermore, combined with a pose regression network, we obtain state-of-the-art pose estimation results on the challenging BOP benchmark. + + + + Clover: Towards a Unified Video-Language Alignment and Fusion Model + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Clover_Towards_a_Unified_Video-Language_Alignment_and_Fusion_Model_CVPR_2023_paper.pdf + Building a universal video-language model for solving various video understanding tasks (e.g., text-video retrieval, video question answering) is an open challenge to the machine learning field. Towards this goal, most recent works build the model by stacking uni-modal and cross-modal feature encoders and train it with pair-wise contrastive pre-text tasks. Though offering attractive generality, the resulted models have to compromise between efficiency and performance. They mostly adopt different architectures to deal with different downstream tasks. We find this is because the pair-wise training cannot well align and fuse features from different modalities. We then introduce Clover--a Correlated Video-Language pre-training method--towards a universal video-language model for solving multiple video understanding tasks with neither performance nor efficiency compromise. It improves cross-modal feature alignment and fusion via a novel tri-modal alignment pre-training task. Additionally, we propose to enhance the tri-modal alignment via incorporating learning from semantic masked samples and a new pair-wise ranking loss. Clover establishes new state-of-the-arts on multiple downstream tasks, including three retrieval tasks for both zero-shot and fine-tuning settings, and eight video question answering tasks. Codes and pre-trained models will be released at https://github.com/LeeYN-43/Clover. + + + + Self-Supervised Learning From Images With a Joint-Embedding Predictive Architecture + http://openaccess.thecvf.com//content/CVPR2023/papers/Assran_Self-Supervised_Learning_From_Images_With_a_Joint-Embedding_Predictive_Architecture_CVPR_2023_paper.pdf + This paper demonstrates an approach for learning highly semantic image representations without relying on hand-crafted data-augmentations. We introduce the Image-based Joint-Embedding Predictive Architecture (I-JEPA), a non-generative approach for self-supervised learning from images. The idea behind I-JEPA is simple: from a single context block, predict the representations of various target blocks in the same image. A core design choice to guide I-JEPA towards producing semantic representations is the masking strategy; specifically, it is crucial to (a) sample target blocks with sufficiently large scale (semantic), and to (b) use a sufficiently informative (spatially distributed) context block. Empirically, when combined with Vision Transformers, we find I-JEPA to be highly scalable. For instance, we train a ViT-Huge/14 on ImageNet using 16 A100 GPUs in under 72 hours to achieve strong downstream performance across a wide range of tasks, from linear classification to object counting and depth prediction. + + + + A2J-Transformer: Anchor-to-Joint Transformer Network for 3D Interacting Hand Pose Estimation From a Single RGB Image + http://openaccess.thecvf.com//content/CVPR2023/papers/Jiang_A2J-Transformer_Anchor-to-Joint_Transformer_Network_for_3D_Interacting_Hand_Pose_Estimation_CVPR_2023_paper.pdf + 3D interacting hand pose estimation from a single RGB image is a challenging task, due to serious self-occlusion and inter-occlusion towards hands, confusing similar appearance patterns between 2 hands, ill-posed joint position mapping from 2D to 3D, etc.. To address these, we propose to extend A2J-the state-of-the-art depth-based 3D single hand pose estimation method-to RGB domain under interacting hand condition. Our key idea is to equip A2J with strong local-global aware ability to well capture interacting hands' local fine details and global articulated clues among joints jointly. To this end, A2J is evolved under Transformer's non-local encoding-decoding framework to build A2J-Transformer. It holds 3 main advantages over A2J. First, self-attention across local anchor points is built to make them global spatial context aware to better capture joints' articulation clues for resisting occlusion. Secondly, each anchor point is regarded as learnable query with adaptive feature learning for facilitating pattern fitting capacity, instead of having the same local representation with the others. Last but not least, anchor point locates in 3D space instead of 2D as in A2J, to leverage 3D pose prediction. Experiments on challenging InterHand 2.6M demonstrate that, A2J-Transformer can achieve state-of-the-art model-free performance (3.38mm MPJPE advancement in 2-hand case) and can also be applied to depth domain with strong generalization. + + + + The Treasure Beneath Multiple Annotations: An Uncertainty-Aware Edge Detector + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_The_Treasure_Beneath_Multiple_Annotations_An_Uncertainty-Aware_Edge_Detector_CVPR_2023_paper.pdf + Deep learning-based edge detectors heavily rely on pixel-wise labels which are often provided by multiple annotators. Existing methods fuse multiple annotations using a simple voting process, ignoring the inherent ambiguity of edges and labeling bias of annotators. In this paper, we propose a novel uncertainty-aware edge detector (UAED), which employs uncertainty to investigate the subjectivity and ambiguity of diverse annotations. Specifically, we first convert the deterministic label space into a learnable Gaussian distribution, whose variance measures the degree of ambiguity among different annotations. Then we regard the learned variance as the estimated uncertainty of the predicted edge maps, and pixels with higher uncertainty are likely to be hard samples for edge detection. Therefore we design an adaptive weighting loss to emphasize the learning from those pixels with high uncertainty, which helps the network to gradually concentrate on the important pixels. UAED can be combined with various encoder-decoder backbones, and the extensive experiments demonstrate that UAED achieves superior performance consistently across multiple edge detection benchmarks. The source code is available at https://github.com/ZhouCX117/UAED. + + + + DP-NeRF: Deblurred Neural Radiance Field With Physical Scene Priors + http://openaccess.thecvf.com//content/CVPR2023/papers/Lee_DP-NeRF_Deblurred_Neural_Radiance_Field_With_Physical_Scene_Priors_CVPR_2023_paper.pdf + Neural Radiance Field (NeRF) has exhibited outstanding three-dimensional (3D) reconstruction quality via the novel view synthesis from multi-view images and paired calibrated camera parameters. However, previous NeRF-based systems have been demonstrated under strictly controlled settings, with little attention paid to less ideal scenarios, including with the presence of noise such as exposure, illumination changes, and blur. In particular, though blur frequently occurs in real situations, NeRF that can handle blurred images has received little attention. The few studies that have investigated NeRF for blurred images have not considered geometric and appearance consistency in 3D space, which is one of the most important factors in 3D reconstruction. This leads to inconsistency and the degradation of the perceptual quality of the constructed scene. Hence, this paper proposes a DP-NeRF, a novel clean NeRF framework for blurred images, which is constrained with two physical priors. These priors are derived from the actual blurring process during image acquisition by the camera. DP-NeRF proposes rigid blurring kernel to impose 3D consistency utilizing the physical priors and adaptive weight proposal to refine the color composition error in consideration of the relationship between depth and blur. We present extensive experimental results for synthetic and real scenes with two types of blur: camera motion blur and defocus blur. The results demonstrate that DP-NeRF successfully improves the perceptual quality of the constructed NeRF ensuring 3D geometric and appearance consistency. We further demonstrate the effectiveness of our model with comprehensive ablation analysis. + + + + Self-Supervised Blind Motion Deblurring With Deep Expectation Maximization + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Self-Supervised_Blind_Motion_Deblurring_With_Deep_Expectation_Maximization_CVPR_2023_paper.pdf + When taking a picture, any camera shake during the shutter time can result in a blurred image. Recovering a sharp image from the one blurred by camera shake is a challenging yet important problem. Most existing deep learning methods use supervised learning to train a deep neural network (DNN) on a dataset of many pairs of blurred/latent images. In contrast, this paper presents a dataset-free deep learning method for removing uniform and non-uniform blur effects from images of static scenes. Our method involves a DNN-based re-parametrization of the latent image, and we propose a Monte Carlo Expectation Maximization (MCEM) approach to train the DNN without requiring any latent images. The Monte Carlo simulation is implemented via Langevin dynamics. Experiments showed that the proposed method outperforms existing methods significantly in removing motion blur from images of static scenes. + + + + Grounding Counterfactual Explanation of Image Classifiers to Textual Concept Space + http://openaccess.thecvf.com//content/CVPR2023/papers/Kim_Grounding_Counterfactual_Explanation_of_Image_Classifiers_to_Textual_Concept_Space_CVPR_2023_paper.pdf + Concept-based explanation aims to provide concise and human-understandable explanations of an image classifier. However, existing concept-based explanation methods typically require a significant amount of manually collected concept-annotated images. This is costly and runs the risk of human biases being involved in the explanation. In this paper, we propose counterfactual explanation with text-driven concepts (CounTEX), where the concepts are defined only from text by leveraging a pre-trained multi-modal joint embedding space without additional concept-annotated datasets. A conceptual counterfactual explanation is generated with text-driven concepts. To utilize the text-driven concepts defined in the joint embedding space to interpret target classifier outcome, we present a novel projection scheme for mapping the two spaces with a simple yet effective implementation. We show that CounTEX generates faithful explanations that provide a semantic understanding of model decision rationale robust to human bias. + + + + SemiCVT: Semi-Supervised Convolutional Vision Transformer for Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_SemiCVT_Semi-Supervised_Convolutional_Vision_Transformer_for_Semantic_Segmentation_CVPR_2023_paper.pdf + Semi-supervised learning improves data efficiency of deep models by leveraging unlabeled samples to alleviate the reliance on a large set of labeled samples. These successes concentrate on the pixel-wise consistency by using convolutional neural networks (CNNs) but fail to address both global learning capability and class-level features for unlabeled data. Recent works raise a new trend that Trans- former achieves superior performance on the entire feature map in various tasks. In this paper, we unify the current dominant Mean-Teacher approaches by reconciling intra- model and inter-model properties for semi-supervised segmentation to produce a novel algorithm, SemiCVT, that absorbs the quintessence of CNNs and Transformer in a comprehensive way. Specifically, we first design a parallel CNN-Transformer architecture (CVT) with introducing an intra-model local-global interaction schema (LGI) in Fourier domain for full integration. The inter-model class- wise consistency is further presented to complement the class-level statistics of CNNs and Transformer in a cross- teaching manner. Extensive empirical evidence shows that SemiCVT yields consistent improvements over the state-of- the-art methods in two public benchmarks. + + + + Towards Open-World Segmentation of Parts + http://openaccess.thecvf.com//content/CVPR2023/papers/Pan_Towards_Open-World_Segmentation_of_Parts_CVPR_2023_paper.pdf + Segmenting object parts such as cup handles and animal bodies is important in many real-world applications but requires more annotation effort. The largest dataset nowadays contains merely two hundred object categories, implying the difficulty to scale up part segmentation to an unconstrained setting. To address this, we propose to explore a seemingly simplified but empirically useful and scalable task, class-agnostic part segmentation. In this problem, we disregard the part class labels in training and instead treat all of them as a single part class. We argue and demonstrate that models trained without part classes can better localize parts and segment them on objects unseen in training. We then present two further improvements. First, we propose to make the model object-aware, leveraging the fact that parts are "compositions" whose extents are bounded by objects, whose appearances are by nature not independent but bundled. Second, we introduce a novel approach to improve part segmentation on unseen objects, inspired by an interesting finding --- for unseen objects, the pixel-wise features extracted by the model often reveal high-quality part segments. To this end, we propose a novel self-supervised procedure that iterates between pixel clustering and supervised contrastive learning that pulls pixels closer or pushes them away. Via extensive experiments on PartImageNet and Pascal-Part, we show notable and consistent gains by our approach, essentially a critical step towards open-world part segmentation. + + + + Stitchable Neural Networks + http://openaccess.thecvf.com//content/CVPR2023/papers/Pan_Stitchable_Neural_Networks_CVPR_2023_paper.pdf + The public model zoo containing enormous powerful pretrained model families (e.g., ResNet/DeiT) has reached an unprecedented scope than ever, which significantly contributes to the success of deep learning. As each model family consists of pretrained models with diverse scales (e.g., DeiT-Ti/S/B), it naturally arises a fundamental question of how to efficiently assemble these readily available models in a family for dynamic accuracy-efficiency trade-offs at runtime. To this end, we present Stitchable Neural Networks (SN-Net), a novel scalable and efficient framework for model deployment. It cheaply produces numerous networks with different complexity and performance trade-offs given a family of pretrained neural networks, which we call anchors. Specifically, SN-Net splits the anchors across the blocks/layers and then stitches them together with simple stitching layers to map the activations from one anchor to another. With only a few epochs of training, SN-Net effectively interpolates between the performance of anchors with varying scales. At runtime, SN-Net can instantly adapt to dynamic resource constraints by switching the stitching positions. Extensive experiments on ImageNet classification demonstrate that SN-Net can obtain on-par or even better performance than many individually trained networks while supporting diverse deployment scenarios. For example, by stitching Swin Transformers, we challenge hundreds of models in Timm model zoo with a single network. We believe this new elastic model framework can serve as a strong baseline for further research in wider communities. + + + + Audio-Visual Grouping Network for Sound Localization From Mixtures + http://openaccess.thecvf.com//content/CVPR2023/papers/Mo_Audio-Visual_Grouping_Network_for_Sound_Localization_From_Mixtures_CVPR_2023_paper.pdf + Sound source localization is a typical and challenging task that predicts the location of sound sources in a video. Previous single-source methods mainly used the audio-visual association as clues to localize sounding objects in each frame. Due to the mixed property of multiple sound sources in the original space, there exist rare multi-source approaches to localizing multiple sources simultaneously, except for one recent work using a contrastive random walk in the graph with images and separated sound as nodes. Despite their promising performance, they can only handle a fixed number of sources, and they cannot learn compact class-aware representations for individual sources. To alleviate this shortcoming, in this paper, we propose a novel audio-visual grouping network, namely AVGN, that can directly learn category-wise semantic features for each source from the input audio mixture and frame to localize multiple sources simultaneously. Specifically, our AVGN leverages learnable audio-visual class tokens to aggregate class-aware source features. Then, the aggregated semantic features for each source can be used as guidance to localize the corresponding visual regions. Compared to existing multi-source methods, our new framework can localize a flexible number of sources and disentangle category-aware audio-visual representations for individual sound sources. We conduct extensive experiments on MUSIC, VGGSound-Instruments, and VGG-Sound Sources benchmarks. The results demonstrate that the proposed AVGN can achieve state-of-the-art sounding object localization performance on both single-source and multi-source scenarios. + + + + Fair Federated Medical Image Segmentation via Client Contribution Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Jiang_Fair_Federated_Medical_Image_Segmentation_via_Client_Contribution_Estimation_CVPR_2023_paper.pdf + How to ensure fairness is an important topic in federated learning (FL). Recent studies have investigated how to reward clients based on their contribution (collaboration fairness), and how to achieve uniformity of performance across clients (performance fairness). Despite achieving progress on either one, we argue that it is critical to consider them together, in order to engage and motivate more diverse clients joining FL to derive a high-quality global model. In this work, we propose a novel method to optimize both types of fairness simultaneously. Specifically, we propose to estimate client contribution in gradient and data space. In gradient space, we monitor the gradient direction differences of each client with respect to others. And in data space, we measure the prediction error on client data using an auxiliary model. Based on this contribution estimation, we propose a FL method, federated training via contribution estimation (FedCE), i.e., using estimation as global model aggregation weights. We have theoretically analyzed our method and empirically evaluated it on two real-world medical datasets. The effectiveness of our approach has been validated with significant performance improvements, better collaboration fairness, better performance fairness, and comprehensive analytical studies. + + + + Dynamic Generative Targeted Attacks With Pattern Injection + http://openaccess.thecvf.com//content/CVPR2023/papers/Feng_Dynamic_Generative_Targeted_Attacks_With_Pattern_Injection_CVPR_2023_paper.pdf + Adversarial attacks can evaluate model robustness and have been of great concerns in recent years. Among various attacks, targeted attacks aim at misleading victim models to output adversary-desired predictions, which are more challenging and threatening than untargeted ones. Existing targeted attacks can be roughly divided into instancespecific and instance-agnostic attacks. Instance-specific attacks craft adversarial examples via iterative gradient updating on the specific instance. In contrast, instanceagnostic attacks learn a universal perturbation or a generative model on the global dataset to perform attacks. However they rely too much on the classification boundary of substitute models, ignoring the realistic distribution of target class, which may result in limited targeted attack performance. And there is no attempt to simultaneously combine the information of the specific instance and the global dataset. To deal with these limitations, we first conduct an analysis via a causal graph and propose to craft transferable targeted adversarial examples by injecting target patterns. Based on this analysis, we introduce a generative attack model composed of a cross-attention guided convolution module and a pattern injection module. Concretely, the former adopts a dynamic convolution kernel and a static convolution kernel for the specific instance and the global dataset, respectively, which can inherit the advantages of both instance-specific and instance-agnostic attacks. And the pattern injection module utilizes a pattern prototype to encode target patterns, which can guide the generation of targeted adversarial examples. Besides, we also provide rigorous theoretical analysis to guarantee the effectiveness of our method. Extensive experiments demonstrate that our method show superior performance than 10 existing adversarial attacks against 13 models. + + + + Visual Recognition by Request + http://openaccess.thecvf.com//content/CVPR2023/papers/Tang_Visual_Recognition_by_Request_CVPR_2023_paper.pdf + Humans have the ability of recognizing visual semantics in an unlimited granularity, but existing visual recognition algorithms cannot achieve this goal. In this paper, we establish a new paradigm named visual recognition by request (ViRReq) to bridge the gap. The key lies in decomposing visual recognition into atomic tasks named requests and leveraging a knowledge base, a hierarchical and text-based dictionary, to assist task definition. ViRReq allows for (i) learning complicated whole-part hierarchies from highly incomplete annotations and (ii) inserting new concepts with minimal efforts. We also establish a solid baseline by integrating language-driven recognition into recent semantic and instance segmentation methods, and demonstrate its flexible recognition ability on CPP and ADE20K, two datasets with hierarchical whole-part annotations. + + + + PointCert: Point Cloud Classification With Deterministic Certified Robustness Guarantees + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_PointCert_Point_Cloud_Classification_With_Deterministic_Certified_Robustness_Guarantees_CVPR_2023_paper.pdf + Point cloud classification is an essential component in many security-critical applications such as autonomous driving and augmented reality. However, point cloud classifiers are vulnerable to adversarially perturbed point clouds. Existing certified defenses against adversarial point clouds suffer from a key limitation: their certified robustness guarantees are probabilistic, i.e., they produce an incorrect certified robustness guarantee with some probability. In this work, we propose a general framework, namely PointCert, that can transform an arbitrary point cloud classifier to be certifiably robust against adversarial point clouds with deterministic guarantees. PointCert certifiably predicts the same label for a point cloud when the number of arbitrarily added, deleted, and/or modified points is less than a threshold. Moreover, we propose multiple methods to optimize the certified robustness guarantees of PointCert in three application scenarios. We systematically evaluate PointCert on ModelNet and ScanObjectNN benchmark datasets. Our results show that PointCert substantially outperforms state-of-the-art certified defenses even though their robustness guarantees are probabilistic. + + + + Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval? + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_Cap4Video_What_Can_Auxiliary_Captions_Do_for_Text-Video_Retrieval_CVPR_2023_paper.pdf + Most existing text-video retrieval methods focus on cross-modal matching between the visual content of videos and textual query sentences. However, in real-world scenarios, online videos are often accompanied by relevant text information such as titles, tags, and even subtitles, which can be utilized to match textual queries. This insight has motivated us to propose a novel approach to text-video retrieval, where we directly generate associated captions from videos using zero-shot video captioning with knowledge from web-scale pre-trained models (e.g., CLIP and GPT-2). Given the generated captions, a natural question arises: what benefits do they bring to text-video retrieval? To answer this, we introduce Cap4Video, a new framework that leverages captions in three ways: i) Input data: video-caption pairs can augment the training data. ii) Intermediate feature interaction: we perform cross-modal feature interaction between the video and caption to produce enhanced video representations. iii) Output score: the Query-Caption matching branch can complement the original Query-Video matching branch for text-video retrieval. We conduct comprehensive ablation studies to demonstrate the effectiveness of our approach. Without any post-processing, Cap4Video achieves state-of-the-art performance on four standard text-video retrieval benchmarks: MSR-VTT (51.4%), VATEX (66.6%), MSVD (51.8%), and DiDeMo (52.0%). The code is available at https://github.com/whwu95/Cap4Video. + + + + Progressive Semantic-Visual Mutual Adaption for Generalized Zero-Shot Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Progressive_Semantic-Visual_Mutual_Adaption_for_Generalized_Zero-Shot_Learning_CVPR_2023_paper.pdf + Generalized Zero-Shot Learning (GZSL) identifies unseen categories by knowledge transferred from the seen domain, relying on the intrinsic interactions between visual and semantic information. Prior works mainly localize regions corresponding to the sharing attributes. When various visual appearances correspond to the same attribute, the sharing attributes inevitably introduce semantic ambiguity, hampering the exploration of accurate semantic-visual interactions. In this paper, we deploy the dual semantic-visual transformer module (DSVTM) to progressively model the correspondences between attribute prototypes and visual features, constituting a progressive semantic-visual mutual adaption (PSVMA) network for semantic disambiguation and knowledge transferability improvement. Specifically, DSVTM devises an instance-motivated semantic encoder that learns instance-centric prototypes to adapt to different images, enabling the recast of the unmatched semantic-visual pair into the matched one. Then, a semantic-motivated instance decoder strengthens accurate cross-domain interactions between the matched pair for semantic-related instance adaption, encouraging the generation of unambiguous visual representations. Moreover, to mitigate the bias towards seen classes in GZSL, a debiasing loss is proposed to pursue response consistency between seen and unseen predictions. The PSVMA consistently yields superior performances against other state-of-the-art methods. Code will be available at: https://github.com/ManLiuCoder/PSVMA. + + + + Block Selection Method for Using Feature Norm in Out-of-Distribution Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_Block_Selection_Method_for_Using_Feature_Norm_in_Out-of-Distribution_Detection_CVPR_2023_paper.pdf + Detecting out-of-distribution (OOD) inputs during the inference stage is crucial for deploying neural networks in the real world. Previous methods commonly relied on the output of a network derived from the highly activated feature map. In this study, we first revealed that a norm of the feature map obtained from the other block than the last block can be a better indicator of OOD detection. Motivated by this, we propose a simple framework consisting of FeatureNorm: a norm of the feature map and NormRatio: a ratio of FeatureNorm for ID and OOD to measure the OOD detection performance of each block. In particular, to select the block that provides the largest difference between FeatureNorm of ID and FeatureNorm of OOD, we create jigsaw puzzles as pseudo OOD from ID training samples and calculate NormRatio, and the block with the largest value is selected. After the suitable block is selected, OOD detection with the FeatureNorm outperforms other OOD detection methods by reducing FPR95 by up to 52.77% on CIFAR10 benchmark and by up to 48.53% on ImageNet benchmark. We demonstrate that our framework can generalize to various architectures and the importance of block selection, which can improve previous OOD detection methods as well. + + + + Four-View Geometry With Unknown Radial Distortion + http://openaccess.thecvf.com//content/CVPR2023/papers/Hruby_Four-View_Geometry_With_Unknown_Radial_Distortion_CVPR_2023_paper.pdf + We present novel solutions to previously unsolved problems of relative pose estimation from images whose calibration parameters, namely focal lengths and radial distortion, are unknown. Our approach enables metric reconstruction without modeling these parameters. The minimal case for reconstruction requires 13 points in 4 views for both the calibrated and uncalibrated cameras. We describe and implement the first solution to these minimal problems. In the calibrated case, this may be modeled as a polynomial system of equations with 3584 solutions. Despite the apparent intractability, the problem decomposes spectacularly. Each solution falls into a Euclidean symmetry class of size 16, and we can estimate 224 class representatives by solving a sequence of three subproblems with 28, 2, and 4 solutions. We highlight the relationship between internal constraints on the radial quadrifocal tensor and the relations among the principal minors of a 4x4 matrix. We also address the case of 4 upright cameras, where 7 points are minimal. Finally, we evaluate our approach on simulated and real data and benchmark against previous calibration-free solutions, and show that our method provides an efficient startup for an SfM pipeline with radial cameras. + + + + How To Prevent the Poor Performance Clients for Personalized Federated Learning? + http://openaccess.thecvf.com//content/CVPR2023/papers/Qu_How_To_Prevent_the_Poor_Performance_Clients_for_Personalized_Federated_CVPR_2023_paper.pdf + Personalized federated learning (pFL) collaboratively trains personalized models, which provides a customized model solution for individual clients in the presence of heterogeneous distributed local data. Although many recent studies have applied various algorithms to enhance personalization in pFL, they mainly focus on improving the performance from averaging or top perspective. However, part of the clients may fall into poor performance and are not clearly discussed. Therefore, how to prevent these poor clients should be considered critically. Intuitively, these poor clients may come from biased universal information shared with others. To address this issue, we propose a novel pFL strategy, called Personalize Locally, Generalize Universally (PLGU). PLGU generalizes the fine-grained universal information and moderates its biased performance by designing a Layer-Wised Sharpness Aware Minimization (LWSAM) algorithm while keeping the personalization local. Specifically, we embed our proposed PLGU strategy into two pFL schemes concluded in this paper: with/without a global model, and present the training procedures in detail. Through in-depth study, we show that the proposed PLGU strategy achieves competitive generalization bounds on both considered pFL schemes. Our extensive experimental results show that all the proposed PLGU based-algorithms achieve state-of-the-art performance. + + + + Galactic: Scaling End-to-End Reinforcement Learning for Rearrangement at 100k Steps-per-Second + http://openaccess.thecvf.com//content/CVPR2023/papers/Berges_Galactic_Scaling_End-to-End_Reinforcement_Learning_for_Rearrangement_at_100k_Steps-per-Second_CVPR_2023_paper.pdf + We present Galactic, a large-scale simulation and reinforcement-learning (RL) framework for robotic mobile manipulation in indoor environments. Specifically, a Fetch robot (equipped with a mobile base, 7DoF arm, RGBD camera, egomotion, and onboard sensing) is spawned in a home environment and asked to rearrange objects -- by navigating to an object, picking it up, navigating to a target location, and then placing the object at the target location. Galactic is fast. In terms of simulation speed (rendering + physics), Galactic achieves over 421,000 steps-per-second (SPS) on an 8-GPU node, which is 54x faster than Habitat 2.0 (7699 SPS). More importantly, Galactic was designed to optimize the entire rendering+physics+RL interplay since any bottleneck in the interplay slows down training. In terms of simulation+RL speed (rendering + physics + inference + learning), Galactic achieves over 108,000 SPS, which 88x faster than Habitat 2.0 (1243 SPS). These massive speed-ups not only drastically cut the wall-clock training time of existing experiments, but also unlock an unprecedented scale of new experiments. First, Galactic can train a mobile pick skill to >80% accuracy in under 16 minutes, a 100x speedup compared to the over 24 hours it takes to train the same skill in Habitat 2.0. Second, we use Galactic to perform the largest-scale experiment to date for rearrangement using 5B steps of experience in 46 hours, which is equivalent to 20 years of robot experience. This scaling results in a single neural network composed of task-agnostic components achieving 85% success in GeometricGoal rearrangement, compared to 0% success reported in Habitat 2.0 for the same approach. The code is available at github.com/facebookresearch/galactic. + + + + Learning on Gradients: Generalized Artifacts Representation for GAN-Generated Images Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Tan_Learning_on_Gradients_Generalized_Artifacts_Representation_for_GAN-Generated_Images_Detection_CVPR_2023_paper.pdf + Recently, there has been a significant advancement in image generation technology, known as GAN. It can easily generate realistic fake images, leading to an increased risk of abuse. However, most image detectors suffer from sharp performance drops in unseen domains. The key of fake image detection is to develop a generalized representation to describe the artifacts produced by generation models. In this work, we introduce a novel detection framework, named Learning on Gradients (LGrad), designed for identifying GAN-generated images, with the aim of constructing a generalized detector with cross-model and cross-data. Specifically, a pretrained CNN model is employed as a transformation model to convert images into gradients. Subsequently, we leverage these gradients to present the generalized artifacts, which are fed into the classifier to ascertain the authenticity of the images. In our framework, we turn the data-dependent problem into a transformation-model-dependent problem. To the best of our knowledge, this is the first study to utilize gradients as the representation of artifacts in GAN-generated images. Extensive experiments demonstrate the effectiveness and robustness of gradients as generalized artifact representations. Our detector achieves a new state-of-the-art performance with a remarkable gain of 11.4%. The code is released at https://github.com/chuangchuangtan/LGrad. + + + + Don't Lie to Me! Robust and Efficient Explainability With Verified Perturbation Analysis + http://openaccess.thecvf.com//content/CVPR2023/papers/Fel_Dont_Lie_to_Me_Robust_and_Efficient_Explainability_With_Verified_CVPR_2023_paper.pdf + A variety of methods have been proposed to try to explain how deep neural networks make their decisions. Key to those approaches is the need to sample the pixel space efficiently in order to derive importance maps. However, it has been shown that the sampling methods used to date introduce biases and other artifacts, leading to inaccurate estimates of the importance of individual pixels and severely limit the reliability of current explainability methods. Unfortunately, the alternative -- to exhaustively sample the image space is computationally prohibitive. In this paper, we introduce EVA (Explaining using Verified perturbation Analysis) -- the first explainability method guarantee to have an exhaustive exploration of a perturbation space. Specifically, we leverage the beneficial properties of verified perturbation analysis -- time efficiency, tractability and guaranteed complete coverage of a manifold -- to efficiently characterize the input variables that are most likely to drive the model decision. We evaluate the approach systematically and demonstrate state-of-the-art results on multiple benchmarks. + + + + Defending Against Patch-Based Backdoor Attacks on Self-Supervised Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Tejankar_Defending_Against_Patch-Based_Backdoor_Attacks_on_Self-Supervised_Learning_CVPR_2023_paper.pdf + Recently, self-supervised learning (SSL) was shown to be vulnerable to patch-based data poisoning backdoor attacks. It was shown that an adversary can poison a small part of the unlabeled data so that when a victim trains an SSL model on it, the final model will have a backdoor that the adversary can exploit. This work aims to defend self-supervised learning against such attacks. We use a three-step defense pipeline, where we first train a model on the poisoned data. In the second step, our proposed defense algorithm (PatchSearch) uses the trained model to search the training data for poisoned samples and removes them from the training set. In the third step, a final model is trained on the cleaned-up training set. Our results show that PatchSearch is an effective defense. As an example, it improves a model's accuracy on images containing the trigger from 38.2% to 63.7% which is very close to the clean model's accuracy, 64.6%. Moreover, we show that PatchSearch outperforms baselines and state-of-the-art defense approaches including those using additional clean, trusted data. Our code is available at https://github.com/UCDvision/PatchSearch + + + + GeoNet: Benchmarking Unsupervised Adaptation Across Geographies + http://openaccess.thecvf.com//content/CVPR2023/papers/Kalluri_GeoNet_Benchmarking_Unsupervised_Adaptation_Across_Geographies_CVPR_2023_paper.pdf + In recent years, several efforts have been aimed at improving the robustness of vision models to domains and environments unseen during training. An important practical problem pertains to models deployed in a new geography that is under-represented in the training dataset, posing a direct challenge to fair and inclusive computer vision. In this paper, we study the problem of geographic robustness and make three main contributions. First, we introduce a large-scale dataset GeoNet for geographic adaptation containing benchmarks across diverse tasks like scene recognition (GeoPlaces), image classification (GeoImNet) and universal adaptation (GeoUniDA). Second, we investigate the nature of distribution shifts typical to the problem of geographic adaptation and hypothesize that the major source of domain shifts arise from significant variations in scene context (context shift), object design (design shift) and label distribution (prior shift) across geographies. Third, we conduct an extensive evaluation of several state-of-the-art unsupervised domain adaptation algorithms and architectures on GeoNet, showing that they do not suffice for geographical adaptation, and that large-scale pre-training using large vision models also does not lead to geographic robustness. Our dataset is publicly available at https://tarun005.github.io/GeoNet. + + + + Learning Transformation-Predictive Representations for Detection and Description of Local Features + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Learning_Transformation-Predictive_Representations_for_Detection_and_Description_of_Local_Features_CVPR_2023_paper.pdf + The task of key-points detection and description is to estimate the stable location and discriminative representation of local features, which is essential for image matching. However, either the rough hard positive or negative labels generated from one-to-one correspondences among images bring indistinguishable samples, called pseudo positives or negatives, which act as inconsistent supervisions while learning key-points used for matching. Such pseudo-labeled samples prevent deep neural networks from learning discriminative descriptions for accurate matching. To tackle this challenge, we propose to learn transformation-predictive representations with self-supervised contrastive learning. We maximize the similarity between corresponded views of the same 3D point (landmark) by using none of the negative sample pairs (including true and pseudo negatives) and avoiding collapsing solutions. Then we design a learnable label prediction mechanism to soften the hard positive labels into soft continuous targets. The aggressively updated soft labels extensively deal with the training bottleneck (derived from the label noise of pseudo positives) and make the model can be trained under a stronger augmentation paradigm. Our self-supervised method outperforms the state-of-the-art on the standard image matching benchmarks by noticeable margins and shows excellent generalization capability on multiple downstream tasks. + + + + Dionysus: Recovering Scene Structures by Dividing Into Semantic Pieces + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Dionysus_Recovering_Scene_Structures_by_Dividing_Into_Semantic_Pieces_CVPR_2023_paper.pdf + Most existing 3D reconstruction methods result in either detail loss or unsatisfying efficiency. However, effectiveness and efficiency are equally crucial in real-world applications, e.g., autonomous driving and augmented reality. We argue that this dilemma comes from wasted resources on valueless depth samples. This paper tackles the problem by proposing a novel learning-based 3D reconstruction framework named Dionysus. Our main contribution is to find out the most promising depth candidates from estimated semantic maps. This strategy simultaneously enables high effectiveness and efficiency by attending to the most reliable nominators. Specifically, we distinguish unreliable depth candidates by checking the cross-view semantic consistency and allow adaptive sampling by redistributing depth nominators among pixels. Experiments on the most popular datasets confirm our proposed framework's effectiveness. + + + + Advancing Visual Grounding With Scene Knowledge: Benchmark and Method + http://openaccess.thecvf.com//content/CVPR2023/papers/Song_Advancing_Visual_Grounding_With_Scene_Knowledge_Benchmark_and_Method_CVPR_2023_paper.pdf + Visual grounding (VG) aims to establish fine-grained alignment between vision and language. Ideally, it can be a testbed for vision-and-language models to evaluate their understanding of the images and texts and their reasoning abilities over their joint space. However, most existing VG datasets are constructed using simple description texts, which do not require sufficient reasoning over the images and texts. This has been demonstrated in a recent study, where a simple LSTM-based text encoder without pretraining can achieve state-of-the-art performance on mainstream VG datasets. Therefore, in this paper, we propose a novel benchmark of Scene Knowledge-guided Visual Grounding (SK-VG), where the image content and referring expressions are not sufficient to ground the target objects, forcing the models to have a reasoning ability on the long-form scene knowledge. To perform this task, we propose two approaches to accept the triple-type input, where the former embeds knowledge into the image features before the image-query interaction; the latter leverages linguistic structure to assist in computing the image-text matching. We conduct extensive experiments to analyze the above methods and show that the proposed approaches achieve promising results but still leave room for improvement, including performance and interpretability. + + + + Multiview Compressive Coding for 3D Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_Multiview_Compressive_Coding_for_3D_Reconstruction_CVPR_2023_paper.pdf + A central goal of visual recognition is to understand objects and scenes from a single image. 2D recognition has witnessed tremendous progress thanks to large-scale learning and general-purpose representations. But, 3D poses new challenges stemming from occlusions not depicted in the image. Prior works try to overcome these by inferring from multiple views or rely on scarce CAD models and category-specific priors which hinder scaling to novel settings. In this work, we explore single-view 3D reconstruction by learning generalizable representations inspired by advances in self-supervised learning. We introduce a simple framework that operates on 3D points of single objects or whole scenes coupled with category-agnostic large-scale training from diverse RGB-D videos. Our model, Multiview Compressive Coding (MCC), learns to compress the input appearance and geometry to predict the 3D structure by querying a 3D-aware decoder. MCC's generality and efficiency allow it to learn from large-scale and diverse data sources with strong generalization to novel objects imagined by DALL*E 2 or captured in-the-wild with an iPhone. + + + + Modeling Entities As Semantic Points for Visual Information Extraction in the Wild + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_Modeling_Entities_As_Semantic_Points_for_Visual_Information_Extraction_in_CVPR_2023_paper.pdf + Recently, Visual Information Extraction (VIE) has been becoming increasingly important in both academia and industry, due to the wide range of real-world applications. Previously, numerous works have been proposed to tackle this problem. However, the benchmarks used to assess these methods are relatively plain, i.e., scenarios with real-world complexity are not fully represented in these benchmarks. As the first contribution of this work, we curate and release a new dataset for VIE, in which the document images are much more challenging in that they are taken from real applications, and difficulties such as blur, partial occlusion, and printing shift are quite common. All these factors may lead to failures in information extraction. Therefore, as the second contribution, we explore an alternative approach to precisely and robustly extract key information from document images under such tough conditions. Specifically, in contrast to previous methods, which usually either incorporate visual information into a multi-modal architecture or train text spotting and information extraction in an end-to-end fashion, we explicitly model entities as semantic points, i.e., center points of entities are enriched with semantic information describing the attributes and relationships of different entities, which could largely benefit entity labeling and linking. Extensive experiments on standard benchmarks in this field as well as the proposed dataset demonstrate that the proposed method can achieve significantly enhanced performance on entity labeling and linking, compared with previous state-of-the-art models. + + + + MobileVOS: Real-Time Video Object Segmentation Contrastive Learning Meets Knowledge Distillation + http://openaccess.thecvf.com//content/CVPR2023/papers/Miles_MobileVOS_Real-Time_Video_Object_Segmentation_Contrastive_Learning_Meets_Knowledge_Distillation_CVPR_2023_paper.pdf + This paper tackles the problem of semi-supervised video object segmentation on resource-constrained devices, such as mobile phones. We formulate this problem as a distillation task, whereby we demonstrate that small space-time-memory networks with finite memory can achieve competitive results with state of the art, but at a fraction of the computational cost (32 milliseconds per frame on a Samsung Galaxy S22). Specifically, we provide a theoretically grounded framework that unifies knowledge distillation with supervised contrastive representation learning. These models are able to jointly benefit from both pixel-wise contrastive learning and distillation from a pre-trained teacher. We validate this loss by achieving competitive J&F to state of the art on both the standard DAVIS and YouTube benchmarks, despite running up to x5 faster, and with x32 fewer parameters. + + + + Pose Synchronization Under Multiple Pair-Wise Relative Poses + http://openaccess.thecvf.com//content/CVPR2023/papers/Sun_Pose_Synchronization_Under_Multiple_Pair-Wise_Relative_Poses_CVPR_2023_paper.pdf + Pose synchronization, which seeks to estimate consistent absolute poses among a collection of objects from noisy relative poses estimated between pairs of objects in isolation, is a fundamental problem in many inverse applications. This paper studies an extreme setting where multiple relative pose estimates exist between each object pair, and the majority is incorrect. Popular methods that solve pose synchronization via recovering a low-rank matrix that encodes relative poses in block fail under this extreme setting. We introduce a three-step algorithm for pose synchronization under multiple relative pose inputs. The first step performs diffusion and clustering to compute the candidate poses of the input objects. We present a theoretical result to justify our diffusion formulation. The second step jointly optimizes the best pose for each object. The final step refines the output of the second step. Experimental results on benchmark datasets of structurefrom-motion and scan-based geometry reconstruction show that our approach offers more accurate absolute poses than state-of-the-art pose synchronization techniques. + + + + Controllable Light Diffusion for Portraits + http://openaccess.thecvf.com//content/CVPR2023/papers/Futschik_Controllable_Light_Diffusion_for_Portraits_CVPR_2023_paper.pdf + We introduce light diffusion, a novel method to improve lighting in portraits, softening harsh shadows and specular highlights while preserving overall scene illumination. Inspired by professional photographers' diffusers and scrims, our method softens lighting given only a single portrait photo. Previous portrait relighting approaches focus on changing the entire lighting environment, removing shadows (ignoring strong specular highlights), or removing shading entirely. In contrast, we propose a learning based method that allows us to control the amount of light diffusion and apply it on in-the-wild portraits. Additionally, we design a method to synthetically generate plausible external shadows with sub-surface scattering effects while conforming to the shape of the subject's face. Finally, we show how our approach can increase the robustness of higher level vision applications, such as albedo estimation, geometry estimation and semantic segmentation. + + + + Boosting Low-Data Instance Segmentation by Unsupervised Pre-Training With Saliency Prompt + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Boosting_Low-Data_Instance_Segmentation_by_Unsupervised_Pre-Training_With_Saliency_Prompt_CVPR_2023_paper.pdf + Recently, inspired by DETR variants, query-based end-to-end instance segmentation (QEIS) methods have outperformed CNN-based models on large-scale datasets. Yet they would lose efficacy when only a small amount of training data is available since it's hard for the crucial queries/kernels to learn localization and shape priors. To this end, this work offers a novel unsupervised pre-training solution for low-data regimes. Inspired by the recent success of the Prompting technique, we introduce a new pre-training method that boosts QEIS models by giving Saliency Prompt for queries/kernels. Our method contains three parts: 1) Saliency Masks Proposal is responsible for generating pseudo masks from unlabeled images based on the saliency mechanism. 2) Prompt-Kernel Matching transfers pseudo masks into prompts and injects the corresponding localization and shape priors to the best-matched kernels. 3) Kernel Supervision is applied to supply supervision at the kernel level for robust learning. From a practical perspective, our pre-training method helps QEIS models achieve a similar convergence speed and comparable performance with CNN-based models in low-data regimes. Experimental results show that our method significantly boosts several QEIS models on three datasets. + + + + Virtual Occlusions Through Implicit Depth + http://openaccess.thecvf.com//content/CVPR2023/papers/Watson_Virtual_Occlusions_Through_Implicit_Depth_CVPR_2023_paper.pdf + For augmented reality (AR), it is important that virtual assets appear to 'sit among' real world objects. The virtual element should variously occlude and be occluded by real matter, based on a plausible depth ordering. This occlusion should be consistent over time as the viewer's camera moves. Unfortunately, small mistakes in the estimated scene depth can ruin the downstream occlusion mask, and thereby the AR illusion. Especially in real-time settings, depths inferred near boundaries or across time can be inconsistent. In this paper, we challenge the need for depth-regression as an intermediate step. We instead propose an implicit model for depth and use that to predict the occlusion mask directly. The inputs to our network are one or more color images, plus the known depths of any virtual geometry. We show how our occlusion predictions are more accurate and more temporally stable than predictions derived from traditional depth-estimation models. We obtain state-of-the-art occlusion results on the challenging ScanNetv2 dataset and superior qualitative results on real scenes. + + + + DiGA: Distil To Generalize and Then Adapt for Domain Adaptive Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Shen_DiGA_Distil_To_Generalize_and_Then_Adapt_for_Domain_Adaptive_CVPR_2023_paper.pdf + Domain adaptive semantic segmentation methods commonly utilize stage-wise training, consisting of a warm-up and a self-training stage. However, this popular approach still faces several challenges in each stage: for warm-up, the widely adopted adversarial training often results in limited performance gain, due to blind feature alignment; for self-training, finding proper categorical thresholds is very tricky. To alleviate these issues, we first propose to replace the adversarial training in the warm-up stage by a novel symmetric knowledge distillation module that only accesses the source domain data and makes the model domain generalizable. Surprisingly, this domain generalizable warm-up model brings substantial performance improvement, which can be further amplified via our proposed cross-domain mixture data augmentation technique. Then, for the self-training stage, we propose a threshold-free dynamic pseudo-label selection mechanism to ease the aforementioned threshold problem and make the model better adapted to the target domain. Extensive experiments demonstrate that our framework achieves remarkable and consistent improvements compared to the prior arts on popular benchmarks. Codes and models are available at https://github.com/fy-vision/DiGA + + + + DiffSwap: High-Fidelity and Controllable Face Swapping via 3D-Aware Masked Diffusion + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhao_DiffSwap_High-Fidelity_and_Controllable_Face_Swapping_via_3D-Aware_Masked_Diffusion_CVPR_2023_paper.pdf + In this paper, we propose DiffSwap, a diffusion model based framework for high-fidelity and controllable face swapping. Unlike previous work that relies on carefully designed network architectures and loss functions to fuse the information from the source and target faces, we reformulate the face swapping as a conditional inpainting task, performed by a powerful diffusion model guided by the desired face attributes (e.g., identity and landmarks). An important issue that makes it nontrivial to apply diffusion models to face swapping is that we cannot perform the time-consuming multi-step sampling to obtain the generated image during training. To overcome this, we propose a midpoint estimation method to efficiently recover a reasonable diffusion result of the swapped face with only 2 steps, which enables us to introduce identity constraints to improve the face swapping quality. Our framework enjoys several favorable properties more appealing than prior arts: 1) Controllable. Our method is based on conditional masked diffusion on the latent space, where the mask and the conditions can be fully controlled and customized. 2) High-fidelity. The formulation of conditional inpainting can fully exploit the generative ability of diffusion models and can preserve the background of target images with minimal artifacts. 3) Shape-preserving. The controllability of our method enables us to use 3D-aware landmarks as the condition during generation to preserve the shape of the source face. Extensive experiments on both FF++ and FFHQ demonstrate that our method can achieve state-of-the-art face swapping results both qualitatively and quantitatively. + + + + Learned Image Compression With Mixed Transformer-CNN Architectures + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Learned_Image_Compression_With_Mixed_Transformer-CNN_Architectures_CVPR_2023_paper.pdf + Learned image compression (LIC) methods have exhibited promising progress and superior rate-distortion performance compared with classical image compression standards. Most existing LIC methods are Convolutional Neural Networks-based (CNN-based) or Transformer-based, which have different advantages. Exploiting both advantages is a point worth exploring, which has two challenges: 1) how to effectively fuse the two methods? 2) how to achieve higher performance with a suitable complexity? In this paper, we propose an efficient parallel Transformer-CNN Mixture (TCM) block with a controllable complexity to incorporate the local modeling ability of CNN and the non-local modeling ability of transformers to improve the overall architecture of image compression models. Besides, inspired by the recent progress of entropy estimation models and attention modules, we propose a channel-wise entropy model with parameter-efficient swin-transformer-based attention (SWAtten) modules by using channel squeezing. Experimental results demonstrate our proposed method achieves state-of-the-art rate-distortion performances on three different resolution datasets (i.e., Kodak, Tecnick, CLIC Professional Validation) compared to existing LIC methods. The code is at https://github.com/jmliu206/LIC_TCM. + + + + Quantum Multi-Model Fitting + http://openaccess.thecvf.com//content/CVPR2023/papers/Farina_Quantum_Multi-Model_Fitting_CVPR_2023_paper.pdf + Geometric model fitting is a challenging but fundamental computer vision problem. Recently, quantum optimization has been shown to enhance robust fitting for the case of a single model, while leaving the question of multi-model fitting open. In response to this challenge, this paper shows that the latter case can significantly benefit from quantum hardware and proposes the first quantum approach to multi-model fitting (MMF). We formulate MMF as a problem that can be efficiently sampled by modern adiabatic quantum computers without the relaxation of the objective function. We also propose an iterative and decomposed version of our method, which supports real-world-sized problems. The experimental evaluation demonstrates promising results on a variety of datasets. The source code is available at https://github.com/FarinaMatteo/qmmf. + + + + PermutoSDF: Fast Multi-View Reconstruction With Implicit Surfaces Using Permutohedral Lattices + http://openaccess.thecvf.com//content/CVPR2023/papers/Rosu_PermutoSDF_Fast_Multi-View_Reconstruction_With_Implicit_Surfaces_Using_Permutohedral_Lattices_CVPR_2023_paper.pdf + Neural radiance-density field methods have become increasingly popular for the task of novel-view rendering. Their recent extension to hash-based positional encoding ensures fast training and inference with visually pleasing results. However, density-based methods struggle with recovering accurate surface geometry. Hybrid methods alleviate this issue by optimizing the density based on an underlying SDF. However, current SDF methods are overly smooth and miss fine geometric details. In this work, we combine the strengths of these two lines of work in a novel hash-based implicit surface representation. We propose improvements to the two areas by replacing the voxel hash encoding with a permutohedral lattice which optimizes faster, especially for higher dimensions. We additionally propose a regularization scheme which is crucial for recovering high-frequency geometric detail. We evaluate our method on multiple datasets and show that we can recover geometric detail at the level of pores and wrinkles while using only RGB images for supervision. Furthermore, using sphere tracing we can render novel views at 30 fps on an RTX 3090. Code is publicly available at https://radualexandru.github.io/permuto_sdf + + + + Detection Hub: Unifying Object Detection Datasets via Query Adaptation on Language Embedding + http://openaccess.thecvf.com//content/CVPR2023/papers/Meng_Detection_Hub_Unifying_Object_Detection_Datasets_via_Query_Adaptation_on_CVPR_2023_paper.pdf + Combining multiple datasets enables performance boost on many computer vision tasks. But similar trend has not been witnessed in object detection when combining multiple datasets due to two inconsistencies among detection datasets: taxonomy difference and domain gap. In this paper, we address these challenges by a new design (named Detection Hub) that is dataset-aware and category-aligned. It not only mitigates the dataset inconsistency but also provides coherent guidance for the detector to learn across multiple datasets. In particular, the dataset-aware design is achieved by learning a dataset embedding that is used to adapt object queries as well as convolutional kernels in detection heads. The categories across datasets are semantically aligned into a unified space by replacing one-hot category representations with word embedding and leveraging the semantic coherence of language embedding. Detection Hub fulfills the benefits of large data on object detection. Experiments demonstrate that joint training on multiple datasets achieves significant performance gains over training on each dataset alone. Detection Hub further achieves SoTA performance on UODB benchmark with wide variety of datasets. + + + + Adversarial Normalization: I Can Visualize Everything (ICE) + http://openaccess.thecvf.com//content/CVPR2023/papers/Choi_Adversarial_Normalization_I_Can_Visualize_Everything_ICE_CVPR_2023_paper.pdf + Vision transformers use [CLS] tokens to predict image classes. Their explainability visualization has been studied using relevant information from [CLS] tokens or focusing on attention scores during self-attention. Such visualization, however, is challenging because of the dependence of the structure of a vision transformer on skip connections and attention operators, the instability of non-linearities in the learning process, and the limited reflection of self-attention scores on relevance. We argue that the output vectors for each input patch token in a vision transformer retain the image information of each patch location, which can facilitate the prediction of an image class. In this paper, we propose ICE (Adversarial Normalization: I Can visualize Everything), a novel method that enables a model to directly predict a class for each patch in an image; thus, advancing the effective visualization of the explainability of a vision transformer. Our method distinguishes background from foreground regions by predicting background classes for patches that do not determine image classes. We used the DeiT-S model, the most representative model employed in studies, on the explainability visualization of vision transformers. On the ImageNet-Segmentation dataset, ICE outperformed all explainability visualization methods for four cases depending on the model size. We also conducted quantitative and qualitative analyses on the tasks of weakly-supervised object localization and unsupervised object discovery. On the CUB-200-2011 and PASCALVOC07/12 datasets, ICE achieved comparable performance to the state-of-the-art methods. We incorporated ICE into the encoder of DeiT-S and improved efficiency by 44.01% on the ImageNet dataset over that achieved by the original DeiT-S model. We showed performance on the accuracy and efficiency comparable to EViT, the state-of-the-art pruning model, demonstrating the effectiveness of ICE. The code is available at https://github.com/Hanyang-HCC-Lab/ICE. + + + + Referring Multi-Object Tracking + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_Referring_Multi-Object_Tracking_CVPR_2023_paper.pdf + Existing referring understanding tasks tend to involve the detection of a single text-referred object. In this paper, we propose a new and general referring understanding task, termed referring multi-object tracking (RMOT). Its core idea is to employ a language expression as a semantic cue to guide the prediction of multi-object tracking. To the best of our knowledge, it is the first work to achieve an arbitrary number of referent object predictions in videos. To push forward RMOT, we construct one benchmark with scalable expressions based on KITTI, named Refer-KITTI. Specifically, it provides 18 videos with 818 expressions, and each expression in a video is annotated with an average of 10.7 objects. Further, we develop a transformer-based architecture TransRMOT to tackle the new task in an online manner, which achieves impressive detection performance and outperforms other counterparts. The Refer-KITTI dataset and the code are released at https://referringmot.github.io. + + + + Hint-Aug: Drawing Hints From Foundation Vision Transformers Towards Boosted Few-Shot Parameter-Efficient Tuning + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_Hint-Aug_Drawing_Hints_From_Foundation_Vision_Transformers_Towards_Boosted_Few-Shot_CVPR_2023_paper.pdf + Despite the growing demand for tuning foundation vision transformers (FViTs) on downstream tasks, fully unleashing FViTs' potential under data-limited scenarios (e.g., few-shot tuning) remains a challenge due to FViTs' data-hungry nature. Common data augmentation techniques fall short in this context due to the limited features contained in the few-shot tuning data. To tackle this challenge, we first identify an opportunity for FViTs in few-shot tuning: pretrained FViTs themselves have already learned highly representative features from large-scale pretraining data, which are fully preserved during widely used parameter-efficient tuning. We thus hypothesize that leveraging those learned features to augment the tuning data can boost the effectiveness of few-shot FViT tuning. To this end, we propose a framework called Hint-based Data Augmentation (Hint-Aug), which aims to boost FViT in few-shot tuning by augmenting the over-fitted parts of tuning samples with the learned features of pretrained FViTs. Specifically, Hint-Aug integrates two key enablers: (1) an Attentive Over-fitting Detector (AOD) to detect over-confident patches of foundation ViTs for potentially alleviating their over-fitting on the few-shot tuning data and (2) a Confusion-based Feature Infusion (CFI) module to infuse easy-to-confuse features from the pretrained FViTs with the over-confident patches detected by the above AOD in order to enhance the feature diversity during tuning. Extensive experiments and ablation studies on five datasets and three parameter-efficient tuning techniques consistently validate Hint-Aug's effectiveness: 0.04% 32.91% higher accuracy over the state-of-the-art (SOTA) data augmentation method under various low-shot settings. For example, on the Pet dataset, Hint-Aug achieves a 2.22% higher accuracy with 50% less training data over SOTA data augmentation methods. + + + + A Strong Baseline for Generalized Few-Shot Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Hajimiri_A_Strong_Baseline_for_Generalized_Few-Shot_Semantic_Segmentation_CVPR_2023_paper.pdf + This paper introduces a generalized few-shot segmentation framework with a straightforward training process and an easy-to-optimize inference phase. In particular, we propose a simple yet effective model based on the well-known InfoMax principle, where the Mutual Information (MI) between the learned feature representations and their corresponding predictions is maximized. In addition, the terms derived from our MI-based formulation are coupled with a knowledge distillation term to retain the knowledge on base classes. With a simple training process, our inference model can be applied on top of any segmentation network trained on base classes. The proposed inference yields substantial improvements on the popular few-shot segmentation benchmarks, PASCAL-5^i and COCO-20^i. Particularly, for novel classes, the improvement gains range from 7% to 26% (PASCAL-5^i) and from 3% to 12% (COCO-20^i) in the 1-shot and 5-shot scenarios, respectively. Furthermore, we propose a more challenging setting, where performance gaps are further exacerbated. Our code is publicly available at https://github.com/sinahmr/DIaM. + + + + DynaFed: Tackling Client Data Heterogeneity With Global Dynamics + http://openaccess.thecvf.com//content/CVPR2023/papers/Pi_DynaFed_Tackling_Client_Data_Heterogeneity_With_Global_Dynamics_CVPR_2023_paper.pdf + The Federated Learning (FL) paradigm is known to face challenges under heterogeneous client data. Local training on non-iid distributed data results in deflected local optimum, which causes the client models drift further away from each other and degrades the aggregated global model's performance. A natural solution is to gather all client data onto the server, such that the server has a global view of the entire data distribution. Unfortunately, this reduces to regular training, which compromises clients' privacy and conflicts with the purpose of FL. In this paper, we put forth an idea to collect and leverage global knowledge on the server without hindering data privacy. We unearth such knowledge from the dynamics of the global model's trajectory. Specifically, we first reserve a short trajectory of global model snapshots on the server. Then, we synthesize a small pseudo dataset such that the model trained on it mimics the dynamics of the reserved global model trajectory. Afterward, the synthesized data is used to help aggregate the deflected clients into the global model. We name our method DynaFed, which enjoys the following advantages: 1) we do not rely on any external on-server dataset, which requires no additional cost for data collection; 2) the pseudo data can be synthesized in early communication rounds, which enables DynaFed to take effect early for boosting the convergence and stabilizing training; 3) the pseudo data only needs to be synthesized once and can be directly utilized on the server to help aggregation in subsequent rounds. Experiments across extensive benchmarks are conducted to showcase the effectiveness of DynaFed. We also provide insights and understanding of the underlying mechanism of our method. + + + + CUF: Continuous Upsampling Filters + http://openaccess.thecvf.com//content/CVPR2023/papers/Vasconcelos_CUF_Continuous_Upsampling_Filters_CVPR_2023_paper.pdf + Neural fields have rapidly been adopted for representing 3D signals, but their application to more classical 2D image-processing has been relatively limited. In this paper, we consider one of the most important operations in image processing: upsampling. In deep learning, learnable upsampling layers have extensively been used for single image super-resolution. We propose to parameterize upsampling kernels as neural fields. This parameterization leads to a compact architecture that obtains a 40-fold reduction in the number of parameters when compared with competing arbitrary-scale super-resolution architectures. When upsampling images of size 256x256 we show that our architecture is 2x-10x more efficient than competing arbitrary-scale super-resolution architectures, and more efficient than sub-pixel convolutions when instantiated to a single-scale model. In the general setting, these gains grow polynomially with the square of the target scale. We validate our method on standard benchmarks showing such efficiency gains can be achieved without sacrifices in super-resolution performance. + + + + Quantitative Manipulation of Custom Attributes on 3D-Aware Image Synthesis + http://openaccess.thecvf.com//content/CVPR2023/papers/Do_Quantitative_Manipulation_of_Custom_Attributes_on_3D-Aware_Image_Synthesis_CVPR_2023_paper.pdf + While 3D-based GAN techniques have been successfully applied to render photo-realistic 3D images with a variety of attributes while preserving view consistency, there has been little research on how to fine-control 3D images without limiting to a specific category of objects of their properties. To fill such research gap, we propose a novel image manipulation model of 3D-based GAN representations for a fine-grained control of specific custom attributes. By extending the latest 3D-based GAN models (e.g., EG3D), our user-friendly quantitative manipulation model enables a fine yet normalized control of 3D manipulation of multi-attribute quantities while achieving view consistency. We validate the effectiveness of our proposed technique both qualitatively and quantitatively through various experiments. + + + + HOTNAS: Hierarchical Optimal Transport for Neural Architecture Search + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_HOTNAS_Hierarchical_Optimal_Transport_for_Neural_Architecture_Search_CVPR_2023_paper.pdf + Instead of searching the entire network directly, current NAS approaches increasingly search for multiple relatively small cells to reduce search costs. A major challenge is to jointly measure the similarity of cell micro-architectures and the difference in macro-architectures between different cell-based networks. Recently, optimal transport (OT) has been successfully applied to NAS as it can capture the operational and structural similarity across various networks. However, existing OT-based NAS methods either ignore the cell similarity or focus solely on searching for a single cell architecture. To address these issues, we propose a hierarchical optimal transport metric called HOTNN for measuring the similarity of different networks. In HOTNN, the cell-level similarity computes the OT distance between cells in various networks by considering the similarity of each node and the differences in the information flow costs between node pairs within each cell in terms of operational and structural information. The network-level similarity calculates OT distance between networks by considering both the cell-level similarity and the variation in the global position of each cell within their respective networks. We then explore HOTNN in a Bayesian optimization framework called HOTNAS, and demonstrate its efficacy in diverse tasks. Extensive experiments demonstrate that HOTNAS can discover network architectures with better performance in multiple modular cell-based search spaces. + + + + Neural Fields Meet Explicit Geometric Representations for Inverse Rendering of Urban Scenes + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Neural_Fields_Meet_Explicit_Geometric_Representations_for_Inverse_Rendering_of_CVPR_2023_paper.pdf + Reconstruction and intrinsic decomposition of scenes from captured imagery would enable many applications such as relighting and virtual object insertion. Recent NeRF based methods achieve impressive fidelity of 3D reconstruction, but bake the lighting and shadows into the radiance field, while mesh-based methods that facilitate intrinsic decomposition through differentiable rendering have not yet scaled to the complexity and scale of outdoor scenes. We present a novel inverse rendering framework for large urban scenes capable of jointly reconstructing the scene geometry, spatially-varying materials, and HDR lighting from a set of posed RGB images with optional depth. Specifically, we use a neural field to account for the primary rays, and use an explicit mesh (reconstructed from the underlying neural field) for modeling secondary rays that produce higher-order lighting effects such as cast shadows. By faithfully disentangling complex geometry and materials from lighting effects, our method enables photorealistic relighting with specular and shadow effects on several outdoor datasets. Moreover, it supports physics-based scene manipulations such as virtual object insertion with ray-traced shadow casting. + + + + Cross-Image-Attention for Conditional Embeddings in Deep Metric Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Kotovenko_Cross-Image-Attention_for_Conditional_Embeddings_in_Deep_Metric_Learning_CVPR_2023_paper.pdf + Learning compact image embeddings that yield semantic similarities between images and that generalize to unseen test classes, is at the core of deep metric learning (DML). Finding a mapping from a rich, localized image feature map onto a compact embedding vector is challenging: Although similarity emerges between tuples of images, DML approaches marginalize out information in an individual image before considering another image to which similarity is to be computed. Instead, we propose during training to condition the embedding of an image on the image we want to compare it to. Rather than embedding by a simple pooling as in standard DML, we use cross-attention so that one image can identify relevant features in the other image. Consequently, the attention mechanism establishes a hierarchy of conditional embeddings that gradually incorporates information about the tuple to steer the representation of an individual image. The cross-attention layers bridge the gap between the original unconditional embedding and the final similarity and allow backpropagtion to update encodings more directly than through a lossy pooling layer. At test time we use the resulting improved unconditional embeddings, thus requiring no additional parameters or computational overhead. Experiments on established DML benchmarks show that our cross-attention conditional embedding during training improves the underlying standard DML pipeline significantly so that it outperforms the state-of-the-art. + + + + Enhanced Multimodal Representation Learning With Cross-Modal KD + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Enhanced_Multimodal_Representation_Learning_With_Cross-Modal_KD_CVPR_2023_paper.pdf + This paper explores the tasks of leveraging auxiliary modalities which are only available at training to enhance multimodal representation learning through cross-modal Knowledge Distillation (KD). The widely adopted mutual information maximization-based objective leads to a short-cut solution of the weak teacher, i.e., achieving the maximum mutual information by simply making the teacher model as weak as the student model. To prevent such a weak solution, we introduce an additional objective term, i.e., the mutual information between the teacher and the auxiliary modality model. Besides, to narrow down the information gap between the student and teacher, we further propose to minimize the conditional entropy of the teacher given the student. Novel training schemes based on contrastive learning and adversarial learning are designed to optimize the mutual information and the conditional entropy, respectively. Experimental results on three popular multimodal benchmark datasets have shown that the proposed method outperforms a range of state-of-the-art approaches for video recognition, video retrieval and emotion classification. + + + + Learning a Depth Covariance Function + http://openaccess.thecvf.com//content/CVPR2023/papers/Dexheimer_Learning_a_Depth_Covariance_Function_CVPR_2023_paper.pdf + We propose learning a depth covariance function with applications to geometric vision tasks. Given RGB images as input, the covariance function can be flexibly used to define priors over depth functions, predictive distributions given observations, and methods for active point selection. We leverage these techniques for a selection of downstream tasks: depth completion, bundle adjustment, and monocular dense visual odometry. + + + + Evading DeepFake Detectors via Adversarial Statistical Consistency + http://openaccess.thecvf.com//content/CVPR2023/papers/Hou_Evading_DeepFake_Detectors_via_Adversarial_Statistical_Consistency_CVPR_2023_paper.pdf + In recent years, as various realistic face forgery techniques known as DeepFake improves by leaps and bounds, more and more DeepFake detection techniques have been proposed. These methods typically rely on detecting statistical differences between natural (i.e., real) and DeepFake-generated images in both spatial and frequency domains. In this work, we propose to explicitly minimize the statistical differences to evade state-of-the-art DeepFake detectors. To this end, we propose a statistical consistency attack (StatAttack) against DeepFake detectors, which contains two main parts. First, we select several statistical-sensitive natural degradations (i.e., exposure, blur, and noise) and add them to the fake images in an adversarial way. Second, we find that the statistical differences between natural and DeepFake images are positively associated with the distribution shifting between the two kinds of images, and we propose to use a distribution-aware loss to guide the optimization of different degradations. As a result, the feature distributions of generated adversarial examples is close to the natural images. Furthermore, we extend the StatAttack to a more powerful version, MStatAttack, where we extend the single-layer degradation to multi-layer degradations sequentially and use the loss to tune the combination weights jointly. Comprehensive experimental results on four spatial-based detectors and two frequency-based detectors with four datasets demonstrate the effectiveness of our proposed attack method in both white-box and black-box settings. + + + + V2V4Real: A Real-World Large-Scale Dataset for Vehicle-to-Vehicle Cooperative Perception + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_V2V4Real_A_Real-World_Large-Scale_Dataset_for_Vehicle-to-Vehicle_Cooperative_Perception_CVPR_2023_paper.pdf + Modern perception systems of autonomous vehicles are known to be sensitive to occlusions and lack the capability of long perceiving range. It has been one of the key bottlenecks that prevents Level 5 autonomy. Recent research has demonstrated that the Vehicle-to-Vehicle (V2V) cooperative perception system has great potential to revolutionize the autonomous driving industry. However, the lack of a real-world dataset hinders the progress of this field. To facilitate the development of cooperative perception, we present V2V4Real, the first large-scale real-world multi-modal dataset for V2V perception. The data is collected by two vehicles equipped with multi-modal sensors driving together through diverse scenarios. Our V2V4Real dataset covers a driving area of 410 km, comprising 20K LiDAR frames, 40K RGB frames, 240K annotated 3D bounding boxes for 5 classes, and HDMaps that cover all the driving routes. V2V4Real introduces three perception tasks, including cooperative 3D object detection, cooperative 3D object tracking, and Sim2Real domain adaptation for cooperative perception. We provide comprehensive benchmarks of recent cooperative perception algorithms on three tasks. The V2V4Real dataset can be found at research.seas.ucla.edu/mobility-lab/v2v4real/. + + + + RMLVQA: A Margin Loss Approach for Visual Question Answering With Language Biases + http://openaccess.thecvf.com//content/CVPR2023/papers/Basu_RMLVQA_A_Margin_Loss_Approach_for_Visual_Question_Answering_With_CVPR_2023_paper.pdf + Visual Question Answering models have been shown to suffer from language biases, where the model learns a correlation between the question and the answer, ignoring the image. While early works attempted to use question-only models or data augmentations to reduce this bias, we propose an adaptive margin loss approach having two components. The first component considers the frequency of answers within a question type in the training data, which addresses the concern of the class-imbalance causing the language biases. However, it does not take into account the answering difficulty of the samples, which impacts their learning. We address this through the second component, where instance-specific margins are learnt, allowing the model to distinguish between samples of varying complexity. We introduce a bias-injecting component to our model, and compute the instance-specific margins from the confidence of this component. We combine these with the estimated margins to consider both answer-frequency and task-complexity in the training loss. We show that, while the margin loss is effective for out-of-distribution (ood) data, the bias-injecting component is essential for generalising to in-distribution (id) data. Our proposed approach, Robust Margin Loss for Visual Question Answering (RMLVQA) improves upon the existing state-of-the-art results when compared to augmentation-free methods on benchmark VQA datasets suffering from language biases, while maintaining competitive performance on id data, making our method the most robust one among all comparable methods. + + + + Adaptive Sparse Convolutional Networks With Global Context Enhancement for Faster Object Detection on Drone Images + http://openaccess.thecvf.com//content/CVPR2023/papers/Du_Adaptive_Sparse_Convolutional_Networks_With_Global_Context_Enhancement_for_Faster_CVPR_2023_paper.pdf + Object detection on drone images with low-latency is an important but challenging task on the resource-constrained unmanned aerial vehicle (UAV) platform. This paper investigates optimizing the detection head based on the sparse convolution, which proves effective in balancing the accuracy and efficiency. Nevertheless, it suffers from inadequate integration of contextual information of tiny objects as well as clumsy control of the mask ratio in the presence of foreground with varying scales. To address the issues above, we propose a novel global context-enhanced adaptive sparse convolutional network (CEASC). It first develops a context-enhanced group normalization (CE-GN) layer, by replacing the statistics based on sparsely sampled features with the global contextual ones, and then designs an adaptive multi-layer masking strategy to generate optimal mask ratios at distinct scales for compact foreground coverage, promoting both the accuracy and efficiency. Extensive experimental results on two major benchmarks, i.e. VisDrone and UAVDT, demonstrate that CEASC remarkably reduces the GFLOPs and accelerates the inference procedure when plugging into the typical state-of-the-art detection frameworks (e.g. RetinaNet and GFL V1) with competitive performance. Code is available at https://github.com/Cuogeihong/CEASC. + + + + Command-Driven Articulated Object Understanding and Manipulation + http://openaccess.thecvf.com//content/CVPR2023/papers/Chu_Command-Driven_Articulated_Object_Understanding_and_Manipulation_CVPR_2023_paper.pdf + We present Cart, a new approach towards articulated-object manipulations by human commands. Beyond the existing work that focuses on inferring articulation structures, we further support manipulating articulated shapes to align them subject to simple command templates. The key of Cart is to utilize the prediction of object structures to connect visual observations with user commands for effective manipulations. It is achieved by encoding command messages for motion prediction and a test-time adaptation to adjust the amount of movement from only command supervision. For a rich variety of object categories, Cart can accurately manipulate object shapes and outperform the state-of-the-art approaches in understanding the inherent articulation structures. Also, it can well generalize to unseen object categories and real-world objects. We hope Cart could open new directions for instructing machines to operate articulated objects. + + + + ConStruct-VL: Data-Free Continual Structured VL Concepts Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Smith_ConStruct-VL_Data-Free_Continual_Structured_VL_Concepts_Learning_CVPR_2023_paper.pdf + Recently, large-scale pre-trained Vision-and-Language (VL) foundation models have demonstrated remarkable capabilities in many zero-shot downstream tasks, achieving competitive results for recognizing objects defined by as little as short text prompts. However, it has also been shown that VL models are still brittle in Structured VL Concept (SVLC) reasoning, such as the ability to recognize object attributes, states, and inter-object relations. This leads to reasoning mistakes, which need to be corrected as they occur by teaching VL models the missing SVLC skills; often this must be done using private data where the issue was found, which naturally leads to a data-free continual (no task-id) VL learning setting. In this work, we introduce the first Continual Data-Free Structured VL Concepts Learning (ConStruct-VL) benchmark and show it is challenging for many existing data-free CL strategies. We, therefore, propose a data-free method comprised of a new approach of Adversarial Pseudo-Replay (APR) which generates adversarial reminders of past tasks from past task models. To use this method efficiently, we also propose a continual parameter-efficient Layered-LoRA (LaLo) neural architecture allowing no-memory-cost access to all past models at train time. We show this approach outperforms all data-free methods by as much as 7% while even matching some levels of experience-replay (prohibitive for applications where data-privacy must be preserved). Our code is publicly available at https://github.com/jamessealesmith/ConStruct-VL + + + + HelixSurf: A Robust and Efficient Neural Implicit Surface Learning of Indoor Scenes With Iterative Intertwined Regularization + http://openaccess.thecvf.com//content/CVPR2023/papers/Liang_HelixSurf_A_Robust_and_Efficient_Neural_Implicit_Surface_Learning_of_CVPR_2023_paper.pdf + Recovery of an underlying scene geometry from multi-view images stands as a long-time challenge in computer vision research. The recent promise leverages neural implicit surface learning and differentiable volume rendering, and achieves both the recovery of scene geometry and synthesis of novel views, where deep priors of neural models are used as an inductive smoothness bias. While promising for object-level surfaces, these methods suffer when coping with complex scene surfaces. In the meanwhile, traditional multi-view stereo can recover the geometry of scenes with rich textures, by globally optimizing the local, pixel-wise correspondences across multiple views. We are thus motivated to make use of the complementary benefits from the two strategies, and propose a method termed Helix-shaped neural implicit Surface learning or HelixSurf; HelixSurf uses the intermediate prediction from one strategy as the guidance to regularize the learning of the other one, and conducts such intertwined regularization iteratively during the learning process. We also propose an efficient scheme for differentiable volume rendering in HelixSurf. Experiments on surface reconstruction of indoor scenes show that our method compares favorably with existing methods and is orders of magnitude faster, even when some of existing methods are assisted with auxiliary training data. The source code is available at https://github.com/Gorilla-Lab-SCUT/HelixSurf. + + + + Towards a Smaller Student: Capacity Dynamic Distillation for Efficient Image Retrieval + http://openaccess.thecvf.com//content/CVPR2023/papers/Xie_Towards_a_Smaller_Student_Capacity_Dynamic_Distillation_for_Efficient_Image_CVPR_2023_paper.pdf + Previous Knowledge Distillation based efficient image retrieval methods employ a lightweight network as the student model for fast inference. However, the lightweight student model lacks adequate representation capacity for effective knowledge imitation during the most critical early training period, causing final performance degeneration. To tackle this issue, we propose a Capacity Dynamic Distillation framework, which constructs a student model with editable representation capacity. Specifically, the employed student model is initially a heavy model to fruitfully learn distilled knowledge in the early training epochs, and the student model is gradually compressed during the training. To dynamically adjust the model capacity, our dynamic framework inserts a learnable convolutional layer within each residual block in the student model as the channel importance indicator. The indicator is optimized simultaneously by the image retrieval loss and the compression loss, and a retrieval-guided gradient resetting mechanism is proposed to release the gradient conflict. Extensive experiments show that our method has superior inference speed and accuracy, e.g., on the VeRi-776 dataset, given the ResNet101 as a teacher, our method saves 67.13% model parameters and 65.67% FLOPs without sacrificing accuracy. + + + + 3D-Aware Facial Landmark Detection via Multi-View Consistent Training on Synthetic Data + http://openaccess.thecvf.com//content/CVPR2023/papers/Zeng_3D-Aware_Facial_Landmark_Detection_via_Multi-View_Consistent_Training_on_Synthetic_CVPR_2023_paper.pdf + Accurate facial landmark detection on wild images plays an essential role in human-computer interaction, entertainment, and medical applications. Existing approaches have limitations in enforcing 3D consistency while detecting 3D/2D facial landmarks due to the lack of multi-view in-the-wild training data. Fortunately, with the recent advances in generative visual models and neural rendering, we have witnessed rapid progress towards high quality 3D image synthesis. In this work, we leverage such approaches to construct a synthetic dataset and propose a novel multi-view consistent learning strategy to improve 3D facial landmark detection accuracy on in-the-wild images. The proposed 3D-aware module can be plugged into any learning-based landmark detection algorithm to enhance its accuracy. We demonstrate the superiority of the proposed plug-in module with extensive comparison against state-of-the-art methods on several real and synthetic datasets. + + + + PC2: Projection-Conditioned Point Cloud Diffusion for Single-Image 3D Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Melas-Kyriazi_PC2_Projection-Conditioned_Point_Cloud_Diffusion_for_Single-Image_3D_Reconstruction_CVPR_2023_paper.pdf + Reconstructing the 3D shape of an object from a single RGB image is a long-standing problem in computer vision. In this paper, we propose a novel method for single-image 3D reconstruction which generates a sparse point cloud via a conditional denoising diffusion process. Our method takes as input a single RGB image along with its camera pose and gradually denoises a set of 3D points, whose positions are initially sampled randomly from a three-dimensional Gaussian distribution, into the shape of an object. The key to our method is a geometrically-consistent conditioning process which we call projection conditioning: at each step in the diffusion process, we project local image features onto the partially-denoised point cloud from the given camera pose. This projection conditioning process enables us to generate high-resolution sparse geometries that are well-aligned with the input image and can additionally be used to predict point colors after shape reconstruction. Moreover, due to the probabilistic nature of the diffusion process, our method is naturally capable of generating multiple different shapes consistent with a single input image. In contrast to prior work, our approach not only performs well on synthetic benchmarks but also gives large qualitative improvements on complex real-world data. + + + + Gradient-Based Uncertainty Attribution for Explainable Bayesian Deep Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Gradient-Based_Uncertainty_Attribution_for_Explainable_Bayesian_Deep_Learning_CVPR_2023_paper.pdf + Predictions made by deep learning models are prone to data perturbations, adversarial attacks, and out-of-distribution inputs. To build a trusted AI system, it is therefore critical to accurately quantify the prediction uncertainties. While current efforts focus on improving uncertainty quantification accuracy and efficiency, there is a need to identify uncertainty sources and take actions to mitigate their effects on predictions. Therefore, we propose to develop explainable and actionable Bayesian deep learning methods to not only perform accurate uncertainty quantification but also explain the uncertainties, identify their sources, and propose strategies to mitigate the uncertainty impacts. Specifically, we introduce a gradient-based uncertainty attribution method to identify the most problematic regions of the input that contribute to the prediction uncertainty. Compared to existing methods, the proposed UA-Backprop has competitive accuracy, relaxed assumptions, and high efficiency. Moreover, we propose an uncertainty mitigation strategy that leverages the attribution results as attention to further improve the model performance. Both qualitative and quantitative evaluations are conducted to demonstrate the effectiveness of our proposed methods. + + + + Manipulating Transfer Learning for Property Inference + http://openaccess.thecvf.com//content/CVPR2023/papers/Tian_Manipulating_Transfer_Learning_for_Property_Inference_CVPR_2023_paper.pdf + Transfer learning is a popular method for tuning pretrained (upstream) models for different downstream tasks using limited data and computational resources. We study how an adversary with control over an upstream model used in transfer learning can conduct property inference attacks on a victim's tuned downstream model. For example, to infer the presence of images of a specific individual in the downstream training set. We demonstrate attacks in which an adversary can manipulate the upstream model to conduct highly effective and specific property inference attacks (AUC score > 0.9), without incurring significant performance loss on the main task. The main idea of the manipulation is to make the upstream model generate activations (intermediate features) with different distributions for samples with and without a target property, thus enabling the adversary to distinguish easily between downstream models trained with and without training examples that have the target property. Our code is available at https://github.com/yulongt23/Transfer-Inference. + + + + Class Adaptive Network Calibration + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Class_Adaptive_Network_Calibration_CVPR_2023_paper.pdf + Recent studies have revealed that, beyond conventional accuracy, calibration should also be considered for training modern deep neural networks. To address miscalibration during learning, some methods have explored different penalty functions as part of the learning objective, alongside a standard classification loss, with a hyper-parameter controlling the relative contribution of each term. Nevertheless, these methods share two major drawbacks: 1) the scalar balancing weight is the same for all classes, hindering the ability to address different intrinsic difficulties or imbalance among classes; and 2) the balancing weight is usually fixed without an adaptive strategy, which may prevent from reaching the best compromise between accuracy and calibration, and requires hyper-parameter search for each application. We propose Class Adaptive Label Smoothing (CALS) for calibrating deep networks, which allows to learn class-wise multipliers during training, yielding a powerful alternative to common label smoothing penalties. Our method builds on a general Augmented Lagrangian approach, a well-established technique in constrained optimization, but we introduce several modifications to tailor it for large-scale, class-adaptive training. Comprehensive evaluation and multiple comparisons on a variety of benchmarks, including standard and long-tailed image classification, semantic segmentation, and text classification, demonstrate the superiority of the proposed method. The code is available at https://github.com/by-liu/CALS. + + + + Evading Forensic Classifiers With Attribute-Conditioned Adversarial Faces + http://openaccess.thecvf.com//content/CVPR2023/papers/Shamshad_Evading_Forensic_Classifiers_With_Attribute-Conditioned_Adversarial_Faces_CVPR_2023_paper.pdf + The ability of generative models to produce highly realistic synthetic face images has raised security and ethical concerns. As a first line of defense against such fake faces, deep learning based forensic classifiers have been developed. While these forensic models can detect whether a face image is synthetic or real with high accuracy, they are also vulnerable to adversarial attacks. Although such attacks can be highly successful in evading detection by forensic classifiers, they introduce visible noise patterns that are detectable through careful human scrutiny. Additionally, these attacks assume access to the target model(s) which may not always be true. Attempts have been made to directly perturb the latent space of GANs to produce adversarial fake faces that can circumvent forensic classifiers. In this work, we go one step further and show that it is possible to successfully generate adversarial fake faces with a specified set of attributes (e.g., hair color, eye size, race, gender, etc.). To achieve this goal, we leverage the state-of-the-art generative model StyleGAN with disentangled representations, which enables a range of modifications without leaving the manifold of natural images. We propose a framework to search for adversarial latent codes within the feature space of StyleGAN, where the search can be guided either by a text prompt or a reference image. We also propose a meta-learning based optimization strategy to achieve transferable performance on unknown target models. Extensive experiments demonstrate that the proposed approach can produce semantically manipulated adversarial fake faces, which are true to the specified attribute set and can successfully fool forensic face classifiers, while remaining undetectable by humans. Code: https://github.com/koushiksrivats/face_attribute_attack. + + + + OCTET: Object-Aware Counterfactual Explanations + http://openaccess.thecvf.com//content/CVPR2023/papers/Zemni_OCTET_Object-Aware_Counterfactual_Explanations_CVPR_2023_paper.pdf + Nowadays, deep vision models are being widely deployed in safety-critical applications, e.g., autonomous driving, and explainability of such models is becoming a pressing concern. Among explanation methods, counterfactual explanations aim to find minimal and interpretable changes to the input image that would also change the output of the model to be explained. Such explanations point end-users at the main factors that impact the decision of the model. However, previous methods struggle to explain decision models trained on images with many objects, e.g., urban scenes, which are more difficult to work with but also arguably more critical to explain. In this work, we propose to tackle this issue with an object-centric framework for counterfactual explanation generation. Our method, inspired by recent generative modeling works, encodes the query image into a latent space that is structured in a way to ease object-level manipulations. Doing so, it provides the end-user with control over which search directions (e.g., spatial displacement of objects, style modification, etc.) are to be explored during the counterfactual generation. We conduct a set of experiments on counterfactual explanation benchmarks for driving scenes, and we show that our method can be adapted beyond classification, e.g., to explain semantic segmentation models. To complete our analysis, we design and run a user study that measures the usefulness of counterfactual explanations in understanding a decision model. Code is available at https://github.com/valeoai/OCTET. + + + + Polarized Color Image Denoising + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Polarized_Color_Image_Denoising_CVPR_2023_paper.pdf + Single-chip polarized color photography provides both visual textures and object surface information in one snapshot. However, the use of an additional directional polarizing filter array tends to lower photon count and SNR, when compared to conventional color imaging. As a result, such a bilayer structure usually leads to unpleasant noisy images and undermines performance of polarization analysis, especially in low-light conditions. It is a challenge for traditional image processing pipelines owing to the fact that the physical constraints exerted implicitly in the channels are excessively complicated. In this paper, we propose to tackle this issue through a noise modeling method for realistic data synthesis and a powerful network structure inspired by vision Transformer. A real-world polarized color image dataset of paired raw short-exposed noisy images and long-exposed reference images is captured for experimental evaluation, which has demonstrated the effectiveness of our approaches for data synthesis and polarized color image denoising. + + + + UniDAformer: Unified Domain Adaptive Panoptic Segmentation Transformer via Hierarchical Mask Calibration + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_UniDAformer_Unified_Domain_Adaptive_Panoptic_Segmentation_Transformer_via_Hierarchical_Mask_CVPR_2023_paper.pdf + Domain adaptive panoptic segmentation aims to mitigate data annotation challenge by leveraging off-the-shelf annotated data in one or multiple related source domains. However, existing studies employ two separate networks for instance segmentation and semantic segmentation which lead to excessive network parameters as well as complicated and computationally intensive training and inference processes. We design UniDAformer, a unified domain adaptive panoptic segmentation transformer that is simple but can achieve domain adaptive instance segmentation and semantic segmentation simultaneously within a single network. UniDAformer introduces Hierarchical Mask Calibration (HMC) that rectifies inaccurate predictions at the level of regions, superpixels and pixels via online self-training on the fly. It has three unique features: 1) it enables unified domain adaptive panoptic adaptation; 2) it mitigates false predictions and improves domain adaptive panoptic segmentation effectively; 3) it is end-to-end trainable with a much simpler training and inference pipeline. Extensive experiments over multiple public benchmarks show that UniDAformer achieves superior domain adaptive panoptic segmentation as compared with the state-of-the-art. + + + + Non-Contrastive Learning Meets Language-Image Pre-Training + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_Non-Contrastive_Learning_Meets_Language-Image_Pre-Training_CVPR_2023_paper.pdf + Contrastive language-image pre-training (CLIP) serves as a de-facto standard to align images and texts. Nonetheless, the loose correlation between images and texts of web-crawled data renders the contrastive objective data inefficient and craving for a large training batch size. In this work, we explore the validity of non-contrastive language-image pre-training (nCLIP) and study whether nice properties exhibited in visual self-supervised models can emerge. We empirically observe that the non-contrastive objective nourishes representation learning while sufficiently underperforming under zero-shot recognition. Based on the above study, we further introduce xCLIP, a multi-tasking framework combining CLIP and nCLIP, and show that nCLIP aids CLIP in enhancing feature semantics. The synergy between two objectives lets xCLIP enjoy the best of both worlds: superior performance in both zero-shot transfer and representation learning. Systematic evaluation is conducted spanning a wide variety of downstream tasks including zero-shot classification, out-of-domain classification, retrieval, visual representation learning, and textual representation learning, showcasing a consistent performance gain and validating the effectiveness of xCLIP. + + + + Switchable Representation Learning Framework With Self-Compatibility + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_Switchable_Representation_Learning_Framework_With_Self-Compatibility_CVPR_2023_paper.pdf + Real-world visual search systems involve deployments on multiple platforms with different computing and storage resources. Deploying a unified model that suits the minimal-constrain platforms leads to limited accuracy. It is expected to deploy models with different capacities adapting to the resource constraints, which requires features extracted by these models to be aligned in the metric space. The method to achieve feature alignments is called "compatible learning". Existing research mainly focuses on the one-to-one compatible paradigm, which is limited in learning compatibility among multiple models. We propose a Switchable representation learning Framework with Self-Compatibility (SFSC). SFSC generates a series of compatible sub-models with different capacities through one training process. The optimization of sub-models faces gradients conflict, and we mitigate this problem from the perspective of the magnitude and direction. We adjust the priorities of sub-models dynamically through uncertainty estimation to co-optimize sub-models properly. Besides, the gradients with conflicting directions are projected to avoid mutual interference. SFSC achieves state-of-the-art performance on the evaluated datasets. + + + + Zero-Shot Dual-Lens Super-Resolution + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_Zero-Shot_Dual-Lens_Super-Resolution_CVPR_2023_paper.pdf + The asymmetric dual-lens configuration is commonly available on mobile devices nowadays, which naturally stores a pair of wide-angle and telephoto images of the same scene to support realistic super-resolution (SR). Even on the same device, however, the degradation for modeling realistic SR is image-specific due to the unknown acquisition process (e.g., tiny camera motion). In this paper, we propose a zero-shot solution for dual-lens SR (ZeDuSR), where only the dual-lens pair at test time is used to learn an image-specific SR model. As such, ZeDuSR adapts itself to the current scene without using external training data, and thus gets rid of generalization difficulty. However, there are two major challenges to achieving this goal: 1) dual-lens alignment while keeping the realistic degradation, and 2) effective usage of highly limited training data. To overcome these two challenges, we propose a degradation-invariant alignment method and a degradation-aware training strategy to fully exploit the information within a single dual-lens pair. Extensive experiments validate the superiority of ZeDuSR over existing solutions on both synthesized and real-world dual-lens datasets. + + + + Improving Vision-and-Language Navigation by Generating Future-View Image Semantics + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Improving_Vision-and-Language_Navigation_by_Generating_Future-View_Image_Semantics_CVPR_2023_paper.pdf + Vision-and-Language Navigation (VLN) is the task that requires an agent to navigate through the environment based on natural language instructions. At each step, the agent takes the next action by selecting from a set of navigable locations. In this paper, we aim to take one step further and explore whether the agent can benefit from generating the potential future view during navigation. Intuitively, humans will have an expectation of how the future environment will look like, based on the natural language instructions and surrounding views, which will aid correct navigation. Hence, to equip the agent with this ability to generate the semantics of future navigation views, we first propose three proxy tasks during the agent's in-domain pre-training: Masked Panorama Modeling (MPM), Masked Trajectory Modeling (MTM), and Action Prediction with Image Generation (APIG). These three objectives teach the model to predict missing views in a panorama (MPM), predict missing steps in the full trajectory (MTM), and generate the next view based on the full instruction and navigation history (APIG), respectively. We then fine-tune the agent on the VLN task with an auxiliary loss that minimizes the difference between the view semantics generated by the agent and the ground truth view semantics of the next step. Empirically, our VLN-SIG achieves the new state-of-the-art on both the Room-to-Room dataset and the CVDN dataset. We further show that our agent learns to fill in missing patches in future views qualitatively, which brings more interpretability over agents' predicted actions. Lastly, we demonstrate that learning to predict future view semantics also enables the agent to have better performance on longer paths. + + + + gSDF: Geometry-Driven Signed Distance Functions for 3D Hand-Object Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_gSDF_Geometry-Driven_Signed_Distance_Functions_for_3D_Hand-Object_Reconstruction_CVPR_2023_paper.pdf + Signed distance functions (SDFs) is an attractive framework that has recently shown promising results for 3D shape reconstruction from images. SDFs seamlessly generalize to different shape resolutions and topologies but lack explicit modelling of the underlying 3D geometry. In this work, we exploit the hand structure and use it as guidance for SDF-based shape reconstruction. In particular, we address reconstruction of hands and manipulated objects from monocular RGB images. To this end, we estimate poses of hands and objects and use them to guide 3D reconstruction. More specifically, we predict kinematic chains of pose transformations and align SDFs with highly-articulated hand poses. We improve the visual features of 3D points with geometry alignment and further leverage temporal information to enhance the robustness to occlusion and motion blurs. We conduct extensive experiments on the challenging ObMan and DexYCB benchmarks and demonstrate significant improvements of the proposed method over the state of the art. + + + + CIMI4D: A Large Multimodal Climbing Motion Dataset Under Human-Scene Interactions + http://openaccess.thecvf.com//content/CVPR2023/papers/Yan_CIMI4D_A_Large_Multimodal_Climbing_Motion_Dataset_Under_Human-Scene_Interactions_CVPR_2023_paper.pdf + Motion capture is a long-standing research problem. Although it has been studied for decades, the majority of research focus on ground-based movements such as walking, sitting, dancing, etc. Off-grounded actions such as climbing are largely overlooked. As an important type of action in sports and firefighting field, the climbing movements is challenging to capture because of its complex back poses, intricate human-scene interactions, and difficult global localization. The research community does not have an in-depth understanding of the climbing action due to the lack of specific datasets. To address this limitation, we collect CIMI4D, a large rock ClImbing MotIon on dataset from 12 persons climbing 13 different climbing walls. The dataset consists of around 180,000 frames of pose inertial measurements, LiDAR point clouds, RGB videos, high-precision static point cloud scenes, and reconstructed scene meshes. Moreover, we frame-wise annotate touch rock holds to facilitate a detailed exploration of human-scene interaction. The core of this dataset is a blending optimization process, which corrects for the pose as it drifts and is affected by the magnetic conditions. To evaluate the merit of CIMI4D, we perform four tasks which include human pose estimations (with/without scene constraints), pose prediction, and pose generation. The experimental results demonstrate that CIMI4D presents great challenges to existing methods and enables extensive research opportunities. We share the dataset with the research community in http://www.lidarhumanmotion.net/cimi4d/. + + + + Modernizing Old Photos Using Multiple References via Photorealistic Style Transfer + http://openaccess.thecvf.com//content/CVPR2023/papers/Gunawan_Modernizing_Old_Photos_Using_Multiple_References_via_Photorealistic_Style_Transfer_CVPR_2023_paper.pdf + This paper firstly presents old photo modernization using multiple references by performing stylization and enhancement in a unified manner. In order to modernize old photos, we propose a novel multi-reference-based old photo modernization (MROPM) framework consisting of a network MROPM-Net and a novel synthetic data generation scheme. MROPM-Net stylizes old photos using multiple references via photorealistic style transfer (PST) and further enhances the results to produce modern-looking images. Meanwhile, the synthetic data generation scheme trains the network to effectively utilize multiple references to perform modernization. To evaluate the performance, we propose a new old photos benchmark dataset (CHD) consisting of diverse natural indoor and outdoor scenes. Extensive experiments show that the proposed method outperforms other baselines in performing modernization on real old photos, even though no old photos were used during training. Moreover, our method can appropriately select styles from multiple references for each semantic region in the old photo to further improve the modernization performance. + + + + Curvature-Balanced Feature Manifold Learning for Long-Tailed Classification + http://openaccess.thecvf.com//content/CVPR2023/papers/Ma_Curvature-Balanced_Feature_Manifold_Learning_for_Long-Tailed_Classification_CVPR_2023_paper.pdf + To address the challenges of long-tailed classification, researchers have proposed several approaches to reduce model bias, most of which assume that classes with few samples are weak classes. However, recent studies have shown that tail classes are not always hard to learn, and model bias has been observed on sample-balanced datasets, suggesting the existence of other factors that affect model bias. In this work, we systematically propose a series of geometric measures for perceptual manifolds in deep neural networks, and then explore the effect of the geometric characteristics of perceptual manifolds on classification difficulty and how learning shapes the geometric characteristics of perceptual manifolds. An unanticipated finding is that the correlation between the class accuracy and the separation degree of perceptual manifolds gradually decreases during training, while the negative correlation with the curvature gradually increases, implying that curvature imbalance leads to model bias. Therefore, we propose curvature regularization to facilitate the model to learn curvature-balanced and flatter perceptual manifolds. Evaluations on multiple long-tailed and non-long-tailed datasets show the excellent performance and exciting generality of our approach, especially in achieving significant performance improvements based on current state-of-the-art techniques. Our work reminds researchers to pay attention to model bias not only on long-tailed datasets but also on non-long-tailed and even data-balanced datasets, which can improve model performance from another perspective. + + + + Revisiting Self-Similarity: Structural Embedding for Image Retrieval + http://openaccess.thecvf.com//content/CVPR2023/papers/Lee_Revisiting_Self-Similarity_Structural_Embedding_for_Image_Retrieval_CVPR_2023_paper.pdf + Despite advances in global image representation, existing image retrieval approaches rarely consider geometric structure during the global retrieval stage. In this work, we revisit the conventional self-similarity descriptor from a convolutional perspective, to encode both the visual and structural cues of the image to global image representation. Our proposed network, named Structural Embedding Network (SENet), captures the internal structure of the images and gradually compresses them into dense self-similarity descriptors while learning diverse structures from various images. These self-similarity descriptors and original image features are fused and then pooled into global embedding, so that global embedding can represent both geometric and visual cues of the image. Along with this novel structural embedding, our proposed network sets new state-of-the-art performances on several image retrieval benchmarks, convincing its robustness to look-alike distractors. The code and models are available: https://github.com/sungonce/SENet. + + + + Decoupling-and-Aggregating for Image Exposure Correction + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Decoupling-and-Aggregating_for_Image_Exposure_Correction_CVPR_2023_paper.pdf + The images captured under improper exposure conditions often suffer from contrast degradation and detail distortion. Contrast degradation will destroy the statistical properties of low-frequency components, while detail distortion will disturb the structural properties of high-frequency components, leading to the low-frequency and high-frequency components being mixed and inseparable. This will limit the statistical and structural modeling capacity for exposure correction. To address this issue, this paper proposes to decouple the contrast enhancement and detail restoration within each convolution process. It is based on the observation that, in the local regions covered by convolution kernels, the feature response of low-/high-frequency can be decoupled by addition/difference operation. To this end, we inject the addition/difference operation into the convolution process and devise a Contrast Aware (CA) unit and a Detail Aware (DA) unit to facilitate the statistical and structural regularities modeling. The proposed CA and DA can be plugged into existing CNN-based exposure correction networks to substitute the Traditional Convolution (TConv) to improve the performance. Furthermore, to maintain the computational costs of the network without changing, we aggregate two units into a single TConv kernel using structural re-parameterization. Evaluations of nine methods and five benchmark datasets demonstrate that our proposed method can comprehensively improve the performance of existing methods without introducing extra computational costs compared with the original networks. The codes will be publicly available. + + + + MSeg3D: Multi-Modal 3D Semantic Segmentation for Autonomous Driving + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_MSeg3D_Multi-Modal_3D_Semantic_Segmentation_for_Autonomous_Driving_CVPR_2023_paper.pdf + LiDAR and camera are two modalities available for 3D semantic segmentation in autonomous driving. The popular LiDAR-only methods severely suffer from inferior segmentation on small and distant objects due to insufficient laser points, while the robust multi-modal solution is under-explored, where we investigate three crucial inherent difficulties: modality heterogeneity, limited sensor field of view intersection, and multi-modal data augmentation. We propose a multi-modal 3D semantic segmentation model (MSeg3D) with joint intra-modal feature extraction and inter-modal feature fusion to mitigate the modality heterogeneity. The multi-modal fusion in MSeg3D consists of geometry-based feature fusion GF-Phase, cross-modal feature completion, and semantic-based feature fusion SF-Phase on all visible points. The multi-modal data augmentation is reinvigorated by applying asymmetric transformations on LiDAR point cloud and multi-camera images individually, which benefits the model training with diversified augmentation transformations. MSeg3D achieves state-of-the-art results on nuScenes, Waymo, and SemanticKITTI datasets. Under the malfunctioning multi-camera input and the multi-frame point clouds input, MSeg3D still shows robustness and improves the LiDAR-only baseline. Our code is publicly available at https://github.com/jialeli1/lidarseg3d. + + + + Dynamically Instance-Guided Adaptation: A Backward-Free Approach for Test-Time Domain Adaptive Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Dynamically_Instance-Guided_Adaptation_A_Backward-Free_Approach_for_Test-Time_Domain_Adaptive_CVPR_2023_paper.pdf + In this paper, we study the application of Test-time domain adaptation in semantic segmentation (TTDA-Seg) where both efficiency and effectiveness are crucial. Existing methods either have low efficiency (e.g., backward optimization) or ignore semantic adaptation (e.g., distribution alignment). Besides, they would suffer from the accumulated errors caused by unstable optimization and abnormal distributions. To solve these problems, we propose a novel backward-free approach for TTDA-Seg, called Dynamically Instance-Guided Adaptation (DIGA). Our principle is utilizing each instance to dynamically guide its own adaptation in a non-parametric way, which avoids the error accumulation issue and expensive optimizing cost. Specifically, DIGA is composed of a distribution adaptation module (DAM) and a semantic adaptation module (SAM), enabling us to jointly adapt the model in two indispensable aspects. DAM mixes the instance and source BN statistics to encourage the model to capture robust representation. SAM combines the historical prototypes with instance-level prototypes to adjust semantic predictions, which can be associated with the parametric classifier to mutually benefit the final results. Extensive experiments evaluated on five target domains demonstrate the effectiveness and efficiency of the proposed method. Our DIGA establishes new state-of-the-art performance in TTDA-Seg. + + + + LANIT: Language-Driven Image-to-Image Translation for Unlabeled Data + http://openaccess.thecvf.com//content/CVPR2023/papers/Park_LANIT_Language-Driven_Image-to-Image_Translation_for_Unlabeled_Data_CVPR_2023_paper.pdf + Existing techniques for image-to-image translation commonly have suffered from two critical problems: heavy reliance on per-sample domain annotation and/or inability to handle multiple attributes per image. Recent truly-unsupervised methods adopt clustering approaches to easily provide per-sample one-hot domain labels. However, they cannot account for the real-world setting: one sample may have multiple attributes. In addition, the semantics of the clusters are not easily coupled to human understanding. To overcome these, we present LANguage-driven Image-to-image Translation model, dubbed LANIT. We leverage easy-to-obtain candidate attributes given in texts for a dataset: the similarity between images and attributes indicates per-sample domain labels. This formulation naturally enables multi-hot labels so that users can specify the target domain with a set of attributes in language. To account for the case that the initial prompts are inaccurate, we also present prompt learning. We further present domain regularization loss that enforces translated images to be mapped to the corresponding domain. Experiments on several standard benchmarks demonstrate that LANIT achieves comparable or superior performance to existing models. The code is available at github.com/KU-CVLAB/LANIT. + + + + MoLo: Motion-Augmented Long-Short Contrastive Learning for Few-Shot Action Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_MoLo_Motion-Augmented_Long-Short_Contrastive_Learning_for_Few-Shot_Action_Recognition_CVPR_2023_paper.pdf + Current state-of-the-art approaches for few-shot action recognition achieve promising performance by conducting frame-level matching on learned visual features. However, they generally suffer from two limitations: i) the matching procedure between local frames tends to be inaccurate due to the lack of guidance to force long-range temporal perception; ii) explicit motion learning is usually ignored, leading to partial information loss. To address these issues, we develop a Motion-augmented Long-short Contrastive Learning (MoLo) method that contains two crucial components, including a long-short contrastive objective and a motion autodecoder. Specifically, the long-short contrastive objective is to endow local frame features with long-form temporal awareness by maximizing their agreement with the global token of videos belonging to the same class. The motion autodecoder is a lightweight architecture to reconstruct pixel motions from the differential features, which explicitly embeds the network with motion dynamics. By this means, MoLo can simultaneously learn long-range temporal context and motion cues for comprehensive few-shot matching. To demonstrate the effectiveness, we evaluate MoLo on five standard benchmarks, and the results show that MoLo favorably outperforms recent advanced methods. The source code is available at https://github.com/alibaba-mmai-research/MoLo. + + + + Text-Guided Unsupervised Latent Transformation for Multi-Attribute Image Manipulation + http://openaccess.thecvf.com//content/CVPR2023/papers/Wei_Text-Guided_Unsupervised_Latent_Transformation_for_Multi-Attribute_Image_Manipulation_CVPR_2023_paper.pdf + Great progress has been made in StyleGAN-based image editing. To associate with preset attributes, most existing approaches focus on supervised learning for semantically meaningful latent space traversal directions, and each manipulation step is typically determined for an individual attribute. To address this limitation, we propose a Text-guided Unsupervised StyleGAN Latent Transformation (TUSLT) model, which adaptively infers a single transformation step in the latent space of StyleGAN to simultaneously manipulate multiple attributes on a given input image. Specifically, we adopt a two-stage architecture for a latent mapping network to break down the transformation process into two manageable steps. Our network first learns a diverse set of semantic directions tailored to an input image, and later nonlinearly fuses the ones associated with the target attributes to infer a residual vector. The resulting tightly interlinked two-stage architecture delivers the flexibility to handle diverse attribute combinations. By leveraging the cross-modal text-image representation of CLIP, we can perform pseudo annotations based on the semantic similarity between preset attribute text descriptions and training images, and further jointly train an auxiliary attribute classifier with the latent mapping network to provide semantic guidance. We perform extensive experiments to demonstrate that the adopted strategies contribute to the superior performance of TUSLT. + + + + Contrastive Semi-Supervised Learning for Underwater Image Restoration via Reliable Bank + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Contrastive_Semi-Supervised_Learning_for_Underwater_Image_Restoration_via_Reliable_Bank_CVPR_2023_paper.pdf + Despite the remarkable achievement of recent underwater image restoration techniques, the lack of labeled data has become a major hurdle for further progress. In this work, we propose a mean-teacher based Semi-supervised Underwater Image Restoration (Semi-UIR) framework to incorporate the unlabeled data into network training. However, the naive mean-teacher method suffers from two main problems: (1) The consistency loss used in training might become ineffective when the teacher's prediction is wrong. (2) Using L1 distance may cause the network to overfit wrong labels, resulting in confirmation bias. To address the above problems, we first introduce a reliable bank to store the "best-ever" outputs as pseudo ground truth. To assess the quality of outputs, we conduct an empirical analysis based on the monotonicity property to select the most trustworthy NR-IQA method. Besides, in view of the confirmation bias problem, we incorporate contrastive regularization to prevent the overfitting on wrong labels. Experimental results on both full-reference and non-reference underwater benchmarks demonstrate that our algorithm has obvious improvement over SOTA methods quantitatively and qualitatively. Code has been released at https://github.com/Huang-ShiRui/Semi-UIR. + + + + Multiclass Confidence and Localization Calibration for Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Pathiraja_Multiclass_Confidence_and_Localization_Calibration_for_Object_Detection_CVPR_2023_paper.pdf + Albeit achieving high predictive accuracy across many challenging computer vision problems, recent studies suggest that deep neural networks (DNNs) tend to make overconfident predictions, rendering them poorly calibrated. Most of the existing attempts for improving DNN calibration are limited to classification tasks and restricted to calibrating in-domain predictions. Surprisingly, very little to no attempts have been made in studying the calibration of object detection methods, which occupy a pivotal space in vision-based security-sensitive, and safety-critical applications. In this paper, we propose a new train-time technique for calibrating modern object detection methods. It is capable of jointly calibrating multiclass confidence and box localization by leveraging their predictive uncertainties. We perform extensive experiments on several in-domain and out-of-domain detection benchmarks. Results demonstrate that our proposed train-time calibration method consistently outperforms several baselines in reducing calibration error for both in-domain and out-of-domain predictions. Our code and models are available at https://github.com/bimsarapathiraja/MCCL + + + + Query-Dependent Video Representation for Moment Retrieval and Highlight Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Moon_Query-Dependent_Video_Representation_for_Moment_Retrieval_and_Highlight_Detection_CVPR_2023_paper.pdf + Recently, video moment retrieval and highlight detection (MR/HD) are being spotlighted as the demand for video understanding is drastically increased. The key objective of MR/HD is to localize the moment and estimate clip-wise accordance level, i.e., saliency score, to the given text query. Although the recent transformer-based models brought some advances, we found that these methods do not fully exploit the information of a given query. For example, the relevance between text query and video contents is sometimes neglected when predicting the moment and its saliency. To tackle this issue, we introduce Query-Dependent DETR (QD-DETR), a detection transformer tailored for MR/HD. As we observe the insignificant role of a given query in transformer architectures, our encoding module starts with cross-attention layers to explicitly inject the context of text query into video representation. Then, to enhance the model's capability of exploiting the query information, we manipulate the video-query pairs to produce irrelevant pairs. Such negative (irrelevant) video-query pairs are trained to yield low saliency scores, which in turn, encourages the model to estimate precise accordance between query-video pairs. Lastly, we present an input-adaptive saliency predictor which adaptively defines the criterion of saliency scores for the given video-query pairs. Our extensive studies verify the importance of building the query-dependent representation for MR/HD. Specifically, QD-DETR outperforms state-of-the-art methods on QVHighlights, TVSum, and Charades-STA datasets. Codes are available at github.com/wjun0830/QD-DETR. + + + + Instance-Specific and Model-Adaptive Supervision for Semi-Supervised Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhao_Instance-Specific_and_Model-Adaptive_Supervision_for_Semi-Supervised_Semantic_Segmentation_CVPR_2023_paper.pdf + Recently, semi-supervised semantic segmentation has achieved promising performance with a small fraction of labeled data. However, most existing studies treat all unlabeled data equally and barely consider the differences and training difficulties among unlabeled instances. Differentiating unlabeled instances can promote instance-specific supervision to adapt to the model's evolution dynamically. In this paper, we emphasize the cruciality of instance differences and propose an instance-specific and model-adaptive supervision for semi-supervised semantic segmentation, named iMAS. Relying on the model's performance, iMAS employs a class-weighted symmetric intersection-over-union to evaluate quantitative hardness of each unlabeled instance and supervises the training on unlabeled data in a model-adaptive manner. Specifically, iMAS learns from unlabeled instances progressively by weighing their corresponding consistency losses based on the evaluated hardness. Besides, iMAS dynamically adjusts the augmentation for each instance such that the distortion degree of augmented instances is adapted to the model's generalization capability across the training course. Not integrating additional losses and training procedures, iMAS can obtain remarkable performance gains against current state-of-the-art approaches on segmentation benchmarks under different semi-supervised partition protocols. + + + + X-Pruner: eXplainable Pruning for Vision Transformers + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_X-Pruner_eXplainable_Pruning_for_Vision_Transformers_CVPR_2023_paper.pdf + Recently vision transformer models have become prominent models for a range of tasks. These models, however, usually suffer from intensive computational costs and heavy memory requirements, making them impractical for deployment on edge platforms. Recent studies have proposed to prune transformers in an unexplainable manner, which overlook the relationship between internal units of the model and the target class, thereby leading to inferior performance. To alleviate this problem, we propose a novel explainable pruning framework dubbed X-Pruner, which is designed by considering the explainability of the pruning criterion. Specifically, to measure each prunable unit's contribution to predicting each target class, a novel explainability-aware mask is proposed and learned in an end-to-end manner. Then, to preserve the most informative units and learn the layer-wise pruning rate, we adaptively search the layer-wise threshold that differentiates between unpruned and pruned units based on their explainability-aware mask values. To verify and evaluate our method, we apply the X-Pruner on representative transformer models including the DeiT and Swin Transformer. Comprehensive simulation results demonstrate that the proposed X-Pruner outperforms the state-of-the-art black-box methods with significantly reduced computational costs and slight performance degradation. + + + + Hard Sample Matters a Lot in Zero-Shot Quantization + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Hard_Sample_Matters_a_Lot_in_Zero-Shot_Quantization_CVPR_2023_paper.pdf + Zero-shot quantization (ZSQ) is promising for compressing and accelerating deep neural networks when the data for training full-precision models are inaccessible. In ZSQ, network quantization is performed using synthetic samples, thus, the performance of quantized models depends heavily on the quality of synthetic samples. Nonetheless, we find that the synthetic samples constructed in existing ZSQ methods can be easily fitted by models. Accordingly, quantized models obtained by these methods suffer from significant performance degradation on hard samples. To address this issue, we propose HArd sample Synthesizing and Training (HAST). Specifically, HAST pays more attention to hard samples when synthesizing samples and makes synthetic samples hard to fit when training quantized models. HAST aligns features extracted by full-precision and quantized models to ensure the similarity between features extracted by these two models. Extensive experiments show that HAST significantly outperforms existing ZSQ methods, achieving performance comparable to models that are quantized with real data. + + + + Meta Compositional Referring Expression Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_Meta_Compositional_Referring_Expression_Segmentation_CVPR_2023_paper.pdf + Referring expression segmentation aims to segment an object described by a language expression from an image. Despite the recent progress on this task, existing models tackling this task may not be able to fully capture semantics and visual representations of individual concepts, which limits their generalization capability, especially when handling novel compositions of learned concepts. In this work, through the lens of meta learning, we propose a Meta Compositional Referring Expression Segmentation (MCRES) framework to enhance model compositional generalization performance. Specifically, to handle various levels of novel compositions, our framework first uses training data to construct a virtual training set and multiple virtual testing sets, where data samples in each virtual testing set contain a level of novel compositions w.r.t. the support set. Then, following a novel meta optimization scheme to optimize the model to obtain good testing performance on the virtual testing sets after training on the virtual training set, our framework can effectively drive the model to better capture semantics and visual representations of individual concepts, and thus obtain robust generalization performance even when handling novel compositions. Extensive experiments on three benchmark datasets demonstrate the effectiveness of our framework. + + + + A Light Weight Model for Active Speaker Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Liao_A_Light_Weight_Model_for_Active_Speaker_Detection_CVPR_2023_paper.pdf + Active speaker detection is a challenging task in audio-visual scenarios, with the aim to detect who is speaking in one or more speaker scenarios. This task has received considerable attention because it is crucial in many applications. Existing studies have attempted to improve the performance by inputting multiple candidate information and designing complex models. Although these methods have achieved excellent performance, their high memory and computational power consumption render their application to resource-limited scenarios difficult. Therefore, in this study, a lightweight active speaker detection architecture is constructed by reducing the number of input candidates, splitting 2D and 3D convolutions for audio-visual feature extraction, and applying gated recurrent units with low computational complexity for cross-modal modeling. Experimental results on the AVA-ActiveSpeaker dataset reveal that the proposed framework achieves competitive mAP performance (94.1% vs. 94.2%), while the resource costs are significantly lower than the state-of-the-art method, particularly in model parameters (1.0M vs. 22.5M, approximately 23x) and FLOPs (0.6G vs. 2.6G, approximately 4x). Additionally, the proposed framework also performs well on the Columbia dataset, thus demonstrating good robustness. The code and model weights are available at https://github.com/Junhua-Liao/Light-ASD. + + + + GCFAgg: Global and Cross-View Feature Aggregation for Multi-View Clustering + http://openaccess.thecvf.com//content/CVPR2023/papers/Yan_GCFAgg_Global_and_Cross-View_Feature_Aggregation_for_Multi-View_Clustering_CVPR_2023_paper.pdf + Multi-view clustering can partition data samples into their categories by learning a consensus representation in unsupervised way and has received more and more attention in recent years. However, most existing deep clustering methods learn consensus representation or view-specific representations from multiple views via view-wise aggregation way, where they ignore structure relationship of all samples. In this paper, we propose a novel multi-view clustering network to address these problems, called Global and Cross-view Feature Aggregation for Multi-View Clustering (GCFAggMVC). Specifically, the consensus data presentation from multiple views is obtained via cross-sample and cross-view feature aggregation, which fully explores the complementary of similar samples. Moreover, we align the consensus representation and the view-specific representation by the structure-guided contrastive learning module, which makes the view-specific representations from different samples with high structure relationship similar. The proposed module is a flexible multi-view data representation module, which can be also embedded to the incomplete multi-view data clustering task via plugging our module into other frameworks. Extensive experiments show that the proposed method achieves excellent performance in both complete multi-view data clustering tasks and incomplete multi-view data clustering tasks. + + + + DeGPR: Deep Guided Posterior Regularization for Multi-Class Cell Detection and Counting + http://openaccess.thecvf.com//content/CVPR2023/papers/Tyagi_DeGPR_Deep_Guided_Posterior_Regularization_for_Multi-Class_Cell_Detection_and_CVPR_2023_paper.pdf + Multi-class cell detection and counting is an essential task for many pathological diagnoses. Manual counting is tedious and often leads to inter-observer variations among pathologists. While there exist multiple, general-purpose, deep learning-based object detection and counting methods, they may not readily transfer to detecting and counting cells in medical images, due to the limited data, presence of tiny overlapping objects, multiple cell types, severe class-imbalance, minute differences in size/shape of cells, etc. In response, we propose guided posterior regularization DeGPR, which assists an object detector by guiding it to exploit discriminative features among cells. The features may be pathologist-provided or inferred directly from visual data. We validate our model on two publicly available datasets (CoNSeP and MoNuSAC), and on MuCeD, a novel dataset that we contribute. MuCeD consists of 55 biopsy images of the human duodenum for predicting celiac disease. We perform extensive experimentation with three object detection baselines on three datasets to show that DeGPR is model-agnostic, and consistently improves baselines obtaining up to 9% (absolute) mAP gains. + + + + SliceMatch: Geometry-Guided Aggregation for Cross-View Pose Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Lentsch_SliceMatch_Geometry-Guided_Aggregation_for_Cross-View_Pose_Estimation_CVPR_2023_paper.pdf + This work addresses cross-view camera pose estimation, i.e., determining the 3-Degrees-of-Freedom camera pose of a given ground-level image w.r.t. an aerial image of the local area. We propose SliceMatch, which consists of ground and aerial feature extractors, feature aggregators, and a pose predictor. The feature extractors extract dense features from the ground and aerial images. Given a set of candidate camera poses, the feature aggregators construct a single ground descriptor and a set of pose-dependent aerial descriptors. Notably, our novel aerial feature aggregator has a cross-view attention module for ground-view guided aerial feature selection and utilizes the geometric projection of the ground camera's viewing frustum on the aerial image to pool features. The efficient construction of aerial descriptors is achieved using precomputed masks. SliceMatch is trained using contrastive learning and pose estimation is formulated as a similarity comparison between the ground descriptor and the aerial descriptors. Compared to the state-of-the-art, SliceMatch achieves a 19% lower median localization error on the VIGOR benchmark using the same VGG16 backbone at 150 frames per second, and a 50% lower error when using a ResNet50 backbone. + + + + Single View Scene Scale Estimation Using Scale Field + http://openaccess.thecvf.com//content/CVPR2023/papers/Lee_Single_View_Scene_Scale_Estimation_Using_Scale_Field_CVPR_2023_paper.pdf + In this paper, we propose a single image scale estimation method based on a novel scale field representation. A scale field defines the local pixel-to-metric conversion ratio along the gravity direction on all the ground pixels. This representation resolves the ambiguity in camera parameters, allowing us to use a simple yet effective way to collect scale annotations on arbitrary images from human annotators. By training our model on calibrated panoramic image data and the in-the-wild human annotated data, our single image scene scale estimation network generates robust scale field on a variety of image, which can be utilized in various 3D understanding and scale-aware image editing applications. + + + + Learning Semantic-Aware Disentangled Representation for Flexible 3D Human Body Editing + http://openaccess.thecvf.com//content/CVPR2023/papers/Sun_Learning_Semantic-Aware_Disentangled_Representation_for_Flexible_3D_Human_Body_Editing_CVPR_2023_paper.pdf + 3D human body representation learning has received increasing attention in recent years. However, existing works cannot flexibly, controllably and accurately represent human bodies, limited by coarse semantics and unsatisfactory representation capability, particularly in the absence of supervised data. In this paper, we propose a human body representation with fine-grained semantics and high reconstruction-accuracy in an unsupervised setting. Specifically, we establish a correspondence between latent vectors and geometric measures of body parts by designing a part-aware skeleton-separated decoupling strategy, which facilitates controllable editing of human bodies by modifying the corresponding latent codes. With the help of a bone-guided auto-encoder and an orientation-adaptive weighting strategy, our representation can be trained in an unsupervised manner. With the geometrically meaningful latent space, it can be applied to a wide range of applications, from human body editing to latent code interpolation and shape style transfer. Experimental results on public datasets demonstrate the accurate reconstruction and flexible editing abilities of the proposed method. The code will be available at http://cic.tju.edu.cn/faculty/likun/projects/SemanticHuman. + + + + Generating Features With Increased Crop-Related Diversity for Few-Shot Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_Generating_Features_With_Increased_Crop-Related_Diversity_for_Few-Shot_Object_Detection_CVPR_2023_paper.pdf + Two-stage object detectors generate object proposals and classify them to detect objects in images. These proposals often do not perfectly contain the objects but overlap with them in many possible ways, exhibiting great variability in the difficulty levels of the proposals. Training a robust classifier against this crop-related variability requires abundant training data, which is not available in few-shot settings. To mitigate this issue, we propose a novel variational autoencoder (VAE) based data generation model, which is capable of generating data with increased crop-related diversity. The main idea is to transform the latent space such the latent codes with different norms represent different crop-related variations. This allows us to generate features with increased crop-related diversity in difficulty levels by simply varying the latent norm. In particular, each latent code is rescaled such that its norm linearly correlates with the IoU score of the input crop w.r.t. the ground-truth box. Here the IoU score is a proxy that represents the difficulty level of the crop. We train this VAE model on base classes conditioned on the semantic code of each class and then use the trained model to generate features for novel classes. Our experimental results show that our generated features consistently improve state-of-the-art few-shot object detection methods on PASCAL VOC and MS COCO datasets. + + + + Towards Compositional Adversarial Robustness: Generalizing Adversarial Training to Composite Semantic Perturbations + http://openaccess.thecvf.com//content/CVPR2023/papers/Hsiung_Towards_Compositional_Adversarial_Robustness_Generalizing_Adversarial_Training_to_Composite_Semantic_CVPR_2023_paper.pdf + Model robustness against adversarial examples of single perturbation type such as the Lp-norm has been widely studied, yet its generalization to more realistic scenarios involving multiple semantic perturbations and their composition remains largely unexplored. In this paper, we first propose a novel method for generating composite adversarial examples. Our method can find the optimal attack composition by utilizing component-wise projected gradient descent and automatic attack-order scheduling. We then propose generalized adversarial training (GAT) to extend model robustness from Lp-ball to composite semantic perturbations, such as the combination of Hue, Saturation, Brightness, Contrast, and Rotation. Results obtained using ImageNet and CIFAR-10 datasets indicate that GAT can be robust not only to all the tested types of a single attack, but also to any combination of such attacks. GAT also outperforms baseline L-infinity-norm bounded adversarial training approaches by a significant margin. + + + + CVT-SLR: Contrastive Visual-Textual Transformation for Sign Language Recognition With Variational Alignment + http://openaccess.thecvf.com//content/CVPR2023/papers/Zheng_CVT-SLR_Contrastive_Visual-Textual_Transformation_for_Sign_Language_Recognition_With_Variational_CVPR_2023_paper.pdf + Sign language recognition (SLR) is a weakly supervised task that annotates sign videos as textual glosses. Recent studies show that insufficient training caused by the lack of large-scale available sign datasets becomes the main bottleneck for SLR. Most SLR works thereby adopt pretrained visual modules and develop two mainstream solutions. The multi-stream architectures extend multi-cue visual features, yielding the current SOTA performances but requiring complex designs and might introduce potential noise. Alternatively, the advanced single-cue SLR frameworks using explicit cross-modal alignment between visual and textual modalities are simple and effective, potentially competitive with the multi-cue framework. In this work, we propose a novel contrastive visual-textual transformation for SLR, CVT-SLR, to fully explore the pretrained knowledge of both the visual and language modalities. Based on the single-cue cross-modal alignment framework, we propose a variational autoencoder (VAE) for pretrained contextual knowledge while introducing the complete pretrained language module. The VAE implicitly aligns visual and textual modalities while benefiting from pretrained contextual knowledge as the traditional contextual module. Meanwhile, a contrastive cross-modal alignment algorithm is designed to explicitly enhance the consistency constraints. Extensive experiments on public datasets (PHOENIX-2014 and PHOENIX-2014T) demonstrate that our proposed CVT-SLR consistently outperforms existing single-cue methods and even outperforms SOTA multi-cue methods. + + + + Paint by Example: Exemplar-Based Image Editing With Diffusion Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_Paint_by_Example_Exemplar-Based_Image_Editing_With_Diffusion_Models_CVPR_2023_paper.pdf + Language-guided image editing has achieved great success recently. In this paper, we investigate exemplar-guided image editing for more precise control. We achieve this goal by leveraging self-supervised training to disentangle and re-organize the source image and the exemplar. However, the naive approach will cause obvious fusing artifacts. We carefully analyze it and propose an information bottleneck and strong augmentations to avoid the trivial solution of directly copying and pasting the exemplar image. Meanwhile, to ensure the controllability of the editing process, we design an arbitrary shape mask for the exemplar image and leverage the classifier-free guidance to increase the similarity to the exemplar image. The whole framework involves a single forward of the diffusion model without any iterative optimization. We demonstrate that our method achieves an impressive performance and enables controllable editing on in-the-wild images with high fidelity. + + + + Ego-Body Pose Estimation via Ego-Head Pose Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Ego-Body_Pose_Estimation_via_Ego-Head_Pose_Estimation_CVPR_2023_paper.pdf + Estimating 3D human motion from an egocentric video sequence plays a critical role in human behavior understanding and has various applications in VR/AR. However, naively learning a mapping between egocentric videos and human motions is challenging, because the user's body is often unobserved by the front-facing camera placed on the head of the user. In addition, collecting large-scale, high-quality datasets with paired egocentric videos and 3D human motions requires accurate motion capture devices, which often limit the variety of scenes in the videos to lab-like environments. To eliminate the need for paired egocentric video and human motions, we propose a new method, Ego-Body Pose Estimation via Ego-Head Pose Estimation (EgoEgo), which decomposes the problem into two stages, connected by the head motion as an intermediate representation. EgoEgo first integrates SLAM and a learning approach to estimate accurate head motion. Subsequently, leveraging the estimated head pose as input, EgoEgo utilizes conditional diffusion to generate multiple plausible full-body motions. This disentanglement of head and body pose eliminates the need for training datasets with paired egocentric videos and 3D human motion, enabling us to leverage large-scale egocentric video datasets and motion capture datasets separately. Moreover, for systematic benchmarking, we develop a synthetic dataset, AMASS-Replica-Ego-Syn (ARES), with paired egocentric videos and human motion. On both ARES and real data, our EgoEgo model performs significantly better than the current state-of-the-art methods. + + + + Learning Rotation-Equivariant Features for Visual Correspondence + http://openaccess.thecvf.com//content/CVPR2023/papers/Lee_Learning_Rotation-Equivariant_Features_for_Visual_Correspondence_CVPR_2023_paper.pdf + Extracting discriminative local features that are invariant to imaging variations is an integral part of establishing correspondences between images. In this work, we introduce a self-supervised learning framework to extract discriminative rotation-invariant descriptors using group-equivariant CNNs. Thanks to employing group-equivariant CNNs, our method effectively learns to obtain rotation-equivariant features and their orientations explicitly, without having to perform sophisticated data augmentations. The resultant features and their orientations are further processed by group aligning, a novel invariant mapping technique that shifts the group-equivariant features by their orientations along the group dimension. Our group aligning technique achieves rotation-invariance without any collapse of the group dimension and thus eschews loss of discriminability. The proposed method is trained end-to-end in a self-supervised manner, where we use an orientation alignment loss for the orientation estimation and a contrastive descriptor loss for robust local descriptors to geometric/photometric variations. Our method demonstrates state-of-the-art matching accuracy among existing rotation-invariant descriptors under varying rotation and also shows competitive results when transferred to the task of keypoint matching and camera pose estimation. + + + + DexArt: Benchmarking Generalizable Dexterous Manipulation With Articulated Objects + http://openaccess.thecvf.com//content/CVPR2023/papers/Bao_DexArt_Benchmarking_Generalizable_Dexterous_Manipulation_With_Articulated_Objects_CVPR_2023_paper.pdf + To enable general-purpose robots, we will require the robot to operate daily articulated objects as humans do. Current robot manipulation has heavily relied on using a parallel gripper, which restricts the robot to a limited set of objects. On the other hand, operating with a multi-finger robot hand will allow better approximation to human behavior and enable the robot to operate on diverse articulated objects. To this end, we propose a new benchmark called DexArt, which involves Dexterous manipulation with Articulated objects in a physical simulator. In our benchmark, we define multiple complex manipulation tasks, and the robot hand will need to manipulate diverse articulated objects within each task. Our main focus is to evaluate the generalizability of the learned policy on unseen articulated objects. This is very challenging given the high degrees of freedom of both hands and objects. We use Reinforcement Learning with 3D representation learning to achieve generalization. Through extensive studies, we provide new insights into how 3D representation learning affects decision making in RL with 3D point cloud inputs. More details can be found at https://www.chenbao.tech/dexart/. + + + + You Do Not Need Additional Priors or Regularizers in Retinex-Based Low-Light Image Enhancement + http://openaccess.thecvf.com//content/CVPR2023/papers/Fu_You_Do_Not_Need_Additional_Priors_or_Regularizers_in_Retinex-Based_CVPR_2023_paper.pdf + Images captured in low-light conditions often suffer from significant quality degradation. Recent works have built a large variety of deep Retinex-based networks to enhance low-light images. The Retinex-based methods require decomposing the image into reflectance and illumination components, which is a highly ill-posed problem and there is no available ground truth. Previous works addressed this problem by imposing some additional priors or regularizers. However, finding an effective prior or regularizer that can be applied in various scenes is challenging, and the performance of the model suffers from too many additional constraints. We propose a contrastive learning method and a self-knowledge distillation method that allow training our Retinex-based model for Retinex decomposition without elaborate hand-crafted regularization functions. Rather than estimating reflectance and illuminance images and representing the final images as their element-wise products as in previous works, our regularizer-free Retinex decomposition and synthesis network (RFR) extracts reflectance and illuminance features and synthesizes them end-to-end. In addition, we propose a loss function for contrastive learning and a progressive learning strategy for self-knowledge distillation. Extensive experimental results demonstrate that our proposed methods can achieve superior performance compared with state-of-the-art approaches. + + + + SCADE: NeRFs from Space Carving With Ambiguity-Aware Depth Estimates + http://openaccess.thecvf.com//content/CVPR2023/papers/Uy_SCADE_NeRFs_from_Space_Carving_With_Ambiguity-Aware_Depth_Estimates_CVPR_2023_paper.pdf + Neural radiance fields (NeRFs) have enabled high fidelity 3D reconstruction from multiple 2D input views. However, a well-known drawback of NeRFs is the less-than-ideal performance under a small number of views, due to insufficient constraints enforced by volumetric rendering. To address this issue, we introduce SCADE, a novel technique that improves NeRF reconstruction quality on sparse, unconstrained input views for in-the-wild indoor scenes. To constrain NeRF reconstruction, we leverage geometric priors in the form of per-view depth estimates produced with state-of-the-art monocular depth estimation models, which can generalize across scenes. A key challenge is that monocular depth estimation is an ill-posed problem, with inherent ambiguities. To handle this issue, we propose a new method that learns to predict, for each view, a continuous, multimodal distribution of depth estimates using conditional Implicit Maximum Likelihood Estimation (cIMLE). In order to disambiguate exploiting multiple views, we introduce an original space carving loss that guides the NeRF representation to fuse multiple hypothesized depth maps from each view and distill from them a common geometry that is consistent with all views. Experiments show that our approach enables higher fidelity novel view synthesis from sparse views. Our project page can be found at https://scade-spacecarving-nerfs.github.io. + + + + 1% VS 100%: Parameter-Efficient Low Rank Adapter for Dense Predictions + http://openaccess.thecvf.com//content/CVPR2023/papers/Yin_1_VS_100_Parameter-Efficient_Low_Rank_Adapter_for_Dense_Predictions_CVPR_2023_paper.pdf + Fine-tuning large-scale pre-trained vision models to downstream tasks is a standard technique for achieving state-of-the-art performance on computer vision benchmarks. However, fine-tuning the whole model with millions of parameters is inefficient as it requires storing a same-sized new model copy for each task. In this work, we propose LoRand, a method for fine-tuning large-scale vision models with a better trade-off between task performance and the number of trainable parameters. LoRand generates tiny adapter structures with low-rank synthesis while keeping the original backbone parameters fixed, resulting in high parameter sharing. To demonstrate LoRand's effectiveness, we implement extensive experiments on object detection, semantic segmentation, and instance segmentation tasks. By only training a small percentage (1% to 3%) of the pre-trained backbone parameters, LoRand achieves comparable performance to standard fine-tuning on COCO and ADE20K and outperforms fine-tuning in low-resource PASCAL VOC dataset. + + + + ResFormer: Scaling ViTs With Multi-Resolution Training + http://openaccess.thecvf.com//content/CVPR2023/papers/Tian_ResFormer_Scaling_ViTs_With_Multi-Resolution_Training_CVPR_2023_paper.pdf + Vision Transformers (ViTs) have achieved overwhelming success, yet they suffer from vulnerable resolution scalability, i.e., the performance drops drastically when presented with input resolutions that are unseen during training. We introduce, ResFormer, a framework that is built upon the seminal idea of multi-resolution training for improved performance on a wide spectrum of, mostly unseen, testing resolutions. In particular, ResFormer operates on replicated images of different resolutions and enforces a scale consistency loss to engage interactive information across different scales. More importantly, to alternate among varying resolutions effectively, especially novel ones in testing, we propose a global-local positional embedding strategy that changes smoothly conditioned on input sizes. We conduct extensive experiments for image classification on ImageNet. The results provide strong quantitative evidence that ResFormer has promising scaling abilities towards a wide range of resolutions. For instance, ResFormer- B-MR achieves a Top-1 accuracy of 75.86% and 81.72% when evaluated on relatively low and high resolutions respectively (i.e., 96 and 640), which are 48% and 7.49% better than DeiT-B. We also demonstrate, moreover, ResFormer is flexible and can be easily extended to semantic segmentation, object detection and video action recognition. + + + + Hierarchical Video-Moment Retrieval and Step-Captioning + http://openaccess.thecvf.com//content/CVPR2023/papers/Zala_Hierarchical_Video-Moment_Retrieval_and_Step-Captioning_CVPR_2023_paper.pdf + There is growing interest in searching for information from large video corpora. Prior works have studied relevant tasks, such as text-based video retrieval, moment retrieval, video summarization, and video captioning in isolation, without an end-to-end setup that can jointly search from video corpora and generate summaries. Such an end-to-end setup would allow for many interesting applications, e.g., a text-based search that finds a relevant video from a video corpus, extracts the most relevant moment from that video, and segments the moment into important steps with captions. To address this, we present the HiREST (HIerarchical REtrieval and STep-captioning) dataset and propose a new benchmark that covers hierarchical information retrieval and visual/textual stepwise summarization from an instructional video corpus. HiREST consists of 3.4K text-video pairs from an instructional video dataset, where 1.1K videos have annotations of moment spans relevant to text query and breakdown of each moment into key instruction steps with caption and timestamps (totaling 8.6K step captions). Our hierarchical benchmark consists of video retrieval, moment retrieval, and two novel moment segmentation and step captioning tasks. In moment segmentation, models break down a video moment into instruction steps and identify start-end boundaries. In step captioning, models generate a textual summary for each step. We also present starting point task-specific and end-to-end joint baseline models for our new benchmark. While the baseline models show some promising results, there still exists large room for future improvement by the community. + + + + PD-Quant: Post-Training Quantization Based on Prediction Difference Metric + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_PD-Quant_Post-Training_Quantization_Based_on_Prediction_Difference_Metric_CVPR_2023_paper.pdf + Post-training quantization (PTQ) is a neural network compression technique that converts a full-precision model into a quantized model using lower-precision data types. Although it can help reduce the size and computational cost of deep neural networks, it can also introduce quantization noise and reduce prediction accuracy, especially in extremely low-bit settings. How to determine the appropriate quantization parameters (e.g., scaling factors and rounding of weights) is the main problem facing now. Existing methods attempt to determine these parameters by minimize the distance between features before and after quantization, but such an approach only considers local information and may not result in the most optimal quantization parameters. We analyze this issue and propose PD-Quant, a method that addresses this limitation by considering global information. It determines the quantization parameters by using the information of differences between network prediction before and after quantization. In addition, PD-Quant can alleviate the overfitting problem in PTQ caused by the small number of calibration sets by adjusting the distribution of activations. Experiments show that PD-Quant leads to better quantization parameters and improves the prediction accuracy of quantized models, especially in low-bit settings. For example, PD-Quant pushes the accuracy of ResNet-18 up to 53.14% and RegNetX-600MF up to 40.67% in weight 2-bit activation 2-bit. The code is released at https://github.com/hustvl/PD-Quant. + + + + AUNet: Learning Relations Between Action Units for Face Forgery Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Bai_AUNet_Learning_Relations_Between_Action_Units_for_Face_Forgery_Detection_CVPR_2023_paper.pdf + Face forgery detection becomes increasingly crucial due to the serious security issues caused by face manipulation techniques. Recent studies in deepfake detection have yielded promising results when the training and testing face forgeries are from the same domain. However, the problem remains challenging when one tries to generalize the detector to forgeries created by unseen methods during training. Observing that face manipulation may alter the relation between different facial action units (AU), we propose the Action Units Relation Learning framework to improve the generality of forgery detection. In specific, it consists of the Action Units Relation Transformer (ART) and the Tampered AU Prediction (TAP). The ART constructs the relation between different AUs with AU-agnostic Branch and AU-specific Branch, which complement each other and work together to exploit forgery clues. In the Tampered AU Prediction, we tamper AU-related regions at the image level and develop challenging pseudo samples at the feature level. The model is then trained to predict the tampered AU regions with the generated location-specific supervision. Experimental results demonstrate that our method can achieve state-of-the-art performance in both the in-dataset and cross-dataset evaluations. + + + + PolyFormer: Referring Image Segmentation As Sequential Polygon Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_PolyFormer_Referring_Image_Segmentation_As_Sequential_Polygon_Generation_CVPR_2023_paper.pdf + In this work, instead of directly predicting the pixel-level segmentation masks, the problem of referring image segmentation is formulated as sequential polygon generation, and the predicted polygons can be later converted into segmentation masks. This is enabled by a new sequence-to-sequence framework, Polygon Transformer (PolyFormer), which takes a sequence of image patches and text query tokens as input, and outputs a sequence of polygon vertices autoregressively. For more accurate geometric localization, we propose a regression-based decoder, which predicts the precise floating-point coordinates directly, without any coordinate quantization error. In the experiments, PolyFormer outperforms the prior art by a clear margin, e.g., 5.40% and 4.52% absolute improvements on the challenging RefCOCO+ and RefCOCOg datasets. It also shows strong generalization ability when evaluated on the referring video segmentation task without fine-tuning, e.g., achieving competitive 61.5% J&F on the Ref-DAVIS17 dataset. + + + + Interactive Segmentation As Gaussion Process Classification + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_Interactive_Segmentation_As_Gaussion_Process_Classification_CVPR_2023_paper.pdf + Click-based interactive segmentation (IS) aims to extract the target objects under user interaction. For this task, most of the current deep learning (DL)-based methods mainly follow the general pipelines of semantic segmentation. Albeit achieving promising performance, they do not fully and explicitly utilize and propagate the click information, inevitably leading to unsatisfactory segmentation results, even at clicked points. Against this issue, in this paper, we propose to formulate the IS task as a Gaussian process (GP)-based pixel-wise binary classification model on each image. To solve this model, we utilize amortized variational inference to approximate the intractable GP posterior in a data-driven manner and then decouple the approximated GP posterior into double space forms for efficient sampling with linear complexity. Then, we correspondingly construct a GP classification framework, named GPCIS, which is integrated with the deep kernel learning mechanism for more flexibility. The main specificities of the proposed GPCIS lie in: 1) Under the explicit guidance of the derived GP posterior, the information contained in clicks can be finely propagated to the entire image and then boost the segmentation; 2) The accuracy of predictions at clicks has good theoretical support. These merits of GPCIS as well as its good generality and high efficiency are substantiated by comprehensive experiments on several benchmarks, as compared with representative methods both quantitatively and qualitatively. Codes will be released at https://github.com/zmhhmz/GPCIS_CVPR2023. + + + + A Practical Stereo Depth System for Smart Glasses + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_A_Practical_Stereo_Depth_System_for_Smart_Glasses_CVPR_2023_paper.pdf + We present the design of a productionized end-to-end stereo depth sensing system that does pre-processing, online stereo rectification, and stereo depth estimation with a fallback to monocular depth estimation when rectification is unreliable. The output of our depth sensing system is then used in a novel view generation pipeline to create 3D computational photography effects using point-of-view images captured by smart glasses. All these steps are executed on-device on the stringent compute budget of a mobile phone, and because we expect the users can use a wide range of smartphones, our design needs to be general and cannot be dependent on a particular hardware or ML accelerator such as a smartphone GPU. Although each of these steps is well studied, a description of a practical system is still lacking. For such a system, all these steps need to work in tandem with one another and fallback gracefully on failures within the system or less than ideal input data. We show how we handle unforeseen changes to calibration, e.g., due to heat, robustly support depth estimation in the wild, and still abide by the memory and latency constraints required for a smooth user experience. We show that our trained models are fast, and run in less than 1s on a six-year-old Samsung Galaxy S8 phone's CPU. Our models generalize well to unseen data and achieve good results on Middlebury and in-the-wild images captured from the smart glasses. + + + + PointConvFormer: Revenge of the Point-Based Convolution + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_PointConvFormer_Revenge_of_the_Point-Based_Convolution_CVPR_2023_paper.pdf + We introduce PointConvFormer, a novel building block for point cloud based deep network architectures. Inspired by generalization theory, PointConvFormer combines ideas from point convolution, where filter weights are only based on relative position, and Transformers which utilize feature-based attention. In PointConvFormer, attention computed from feature difference between points in the neighborhood is used to modify the convolutional weights at each point. Hence, we preserved the invariances from point convolution, whereas attention helps to select relevant points in the neighborhood for convolution. We experiment on both semantic segmentation and scene flow estimation tasks on point clouds with multiple datasets including ScanNet, SemanticKitti, FlyingThings3D and KITTI. Our results show that PointConvFormer substantially outperforms classic convolutions, regular transformers, and voxelized sparse convolution approaches with much smaller and faster networks. Visualizations show that PointConvFormer performs similarly to convolution on flat areas, whereas the neighborhood selection effect is stronger on object boundaries, showing that it has got the best of both worlds. The code will be available with the final version. + + + + Variational Distribution Learning for Unsupervised Text-to-Image Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Kang_Variational_Distribution_Learning_for_Unsupervised_Text-to-Image_Generation_CVPR_2023_paper.pdf + We propose a text-to-image generation algorithm based on deep neural networks when text captions for images are unavailable during training. In this work, instead of simply generating pseudo-ground-truth sentences of training images using existing image captioning methods, we employ a pretrained CLIP model, which is capable of properly aligning embeddings of images and corresponding texts in a joint space and, consequently, works well on zero-shot recognition tasks. We optimize a text-to-image generation model by maximizing the data log-likelihood conditioned on pairs of image-text CLIP embeddings. To better align data in the two domains, we employ a principled way based on a variational inference, which efficiently estimates an approximate posterior of the hidden text embedding given an image and its CLIP feature. Experimental results validate that the proposed framework outperforms existing approaches by large margins under unsupervised and semi-supervised text-to-image generation settings. + + + + MetaMix: Towards Corruption-Robust Continual Learning With Temporally Self-Adaptive Data Transformation + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_MetaMix_Towards_Corruption-Robust_Continual_Learning_With_Temporally_Self-Adaptive_Data_Transformation_CVPR_2023_paper.pdf + Continual Learning (CL) has achieved rapid progress in recent years. However, it is still largely unknown how to determine whether a CL model is trustworthy and how to foster its trustworthiness. This work focuses on evaluating and improving the robustness to corruptions of existing CL models. Our empirical evaluation results show that existing state-of-the-art (SOTA) CL models are particularly vulnerable to various data corruptions during testing. To make them trustworthy and robust to corruptions deployed in safety-critical scenarios, we propose a meta-learning framework of self-adaptive data augmentation to tackle the corruption robustness in CL. The proposed framework, MetaMix, learns to augment and mix data, automatically transforming the new task data or memory data. It directly optimizes the generalization performance against data corruptions during training. To evaluate the corruption robustness of our proposed approach, we construct several CL corruption datasets with different levels of severity. We perform comprehensive experiments on both task- and class-continual learning. Extensive experiments demonstrate the effectiveness of our proposed method compared to SOTA baselines. + + + + Ultra-High Resolution Segmentation With Ultra-Rich Context: A Novel Benchmark + http://openaccess.thecvf.com//content/CVPR2023/papers/Ji_Ultra-High_Resolution_Segmentation_With_Ultra-Rich_Context_A_Novel_Benchmark_CVPR_2023_paper.pdf + With the increasing interest and rapid development of methods for Ultra-High Resolution (UHR) segmentation, a large-scale benchmark covering a wide range of scenes with full fine-grained dense annotations is urgently needed to facilitate the field. To this end, the URUR dataset is introduced, in the meaning of Ultra-High Resolution dataset with Ultra-Rich Context. As the name suggests, URUR contains amounts of images with high enough resolution (3,008 images of size 5,120x5,120), a wide range of complex scenes (from 63 cities), rich-enough context (1 million instances with 8 categories) and fine-grained annotations (about 80 billion manually annotated pixels), which is far superior to all the existing UHR datasets including DeepGlobe, Inria Aerial, UDD, etc.. Moreover, we also propose WSDNet, a more efficient and effective framework for UHR segmentation especially with ultra-rich context. Specifically, multi-level Discrete Wavelet Transform (DWT) is naturally integrated to release computation burden while preserve more spatial details, along with a Wavelet Smooth Loss (WSL) to reconstruct original structured context and texture with a smooth constrain. Experiments on several UHR datasets demonstrate its state-of-the-art performance. The dataset is available at https://github.com/jankyee/URUR. + + + + Accelerating Vision-Language Pretraining With Free Language Modeling + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Accelerating_Vision-Language_Pretraining_With_Free_Language_Modeling_CVPR_2023_paper.pdf + The state of the arts in vision-language pretraining (VLP) achieves exemplary performance but suffers from high training costs resulting from slow convergence and long training time, especially on large-scale web datasets. An essential obstacle to training efficiency lies in the entangled prediction rate (percentage of tokens for reconstruction) and corruption rate (percentage of corrupted tokens) in masked language modeling (MLM), that is, a proper corruption rate is achieved at the cost of a large portion of output tokens being excluded from prediction loss. To accelerate the convergence of VLP, we propose a new pretraining task, namely, free language modeling (FLM), that enables a 100% prediction rate with arbitrary corruption rates. FLM successfully frees the prediction rate from the tie-up with the corruption rate while allowing the corruption spans to be customized for each token to be predicted. FLM-trained models are encouraged to learn better and faster given the same GPU time by exploiting bidirectional contexts more flexibly. Extensive experiments show FLM could achieve an impressive 2.5x pretraining time reduction in comparison to the MLM-based methods, while keeping competitive performance on both vision-language understanding and generation tasks. + + + + Efficient Mask Correction for Click-Based Interactive Image Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Du_Efficient_Mask_Correction_for_Click-Based_Interactive_Image_Segmentation_CVPR_2023_paper.pdf + The goal of click-based interactive image segmentation is to extract target masks with the input of positive/negative clicks. Every time a new click is placed, existing methods run the whole segmentation network to obtain a corrected mask, which is inefficient since several clicks may be needed to reach satisfactory accuracy. To this end, we propose an efficient method to correct the mask with a lightweight mask correction network. The whole network remains a low computational cost from the second click, even if we have a large backbone. However, a simple correction network with limited capacity is not likely to achieve comparable performance with a classic segmentation network. Thus, we propose a click-guided self-attention module and a click-guided correlation module to effectively exploits the click information to boost performance. First, several templates are selected based on the semantic similarity with click features. Then the self-attention module propagates the template information to other pixels, while the correlation module directly uses the templates to obtain target outlines. With the efficient architecture and two click-guided modules, our method shows preferable performance and efficiency compared to existing methods. The code will be released at https://github.com/feiaxyt/EMC-Click. + + + + Graphics Capsule: Learning Hierarchical 3D Face Representations From 2D Images + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_Graphics_Capsule_Learning_Hierarchical_3D_Face_Representations_From_2D_Images_CVPR_2023_paper.pdf + The function of constructing the hierarchy of objects is important to the visual process of the human brain. Previous studies have successfully adopted capsule networks to decompose the digits and faces into parts in an unsupervised manner to investigate the similar perception mechanism of neural networks. However, their descriptions are restricted to the 2D space, limiting their capacities to imitate the intrinsic 3D perception ability of humans. In this paper, we propose an Inverse Graphics Capsule Network (IGC-Net) to learn the hierarchical 3D face representations from large-scale unlabeled images. The core of IGC-Net is a new type of capsule, named graphics capsule, which represents 3D primitives with interpretable parameters in computer graphics (CG), including depth, albedo, and 3D pose. Specifically, IGC-Net first decomposes the objects into a set of semantic-consistent part-level descriptions and then assembles them into object-level descriptions to build the hierarchy. The learned graphics capsules reveal how the neural networks, oriented at visual perception, understand faces as a hierarchy of 3D models. Besides, the discovered parts can be deployed to the unsupervised face segmentation task to evaluate the semantic consistency of our method. Moreover, the part-level descriptions with explicit physical meanings provide insight into the face analysis that originally runs in a black box, such as the importance of shape and texture for face recognition. Experiments on CelebA, BP4D, and Multi-PIE demonstrate the characteristics of our IGC-Net. + + + + Masked Autoencoders Enable Efficient Knowledge Distillers + http://openaccess.thecvf.com//content/CVPR2023/papers/Bai_Masked_Autoencoders_Enable_Efficient_Knowledge_Distillers_CVPR_2023_paper.pdf + This paper studies the potential of distilling knowledge from pre-trained models, especially Masked Autoencoders. Our approach is simple: in addition to optimizing the pixel reconstruction loss on masked inputs, we minimize the distance between the intermediate feature map of the teacher model and that of the student model. This design leads to a computationally efficient knowledge distillation framework, given 1) only a small visible subset of patches is used, and 2) the (cumbersome) teacher model only needs to be partially executed, i.e., forward propagate inputs through the first few layers, for obtaining intermediate feature maps. Compared to directly distilling fine-tuned models, distilling pre-trained models substantially improves downstream performance. For example, by distilling the knowledge from an MAE pre-trained ViT-L into a ViT-B, our method achieves 84.0% ImageNet top-1 accuracy, outperforming the baseline of directly distilling a fine-tuned ViT-L by 1.2%. More intriguingly, our method can robustly distill knowledge from teacher models even with extremely high masking ratios: e.g., with 95% masking ratio where merely TEN patches are visible during distillation, our ViT-B competitively attains a top-1 ImageNet accuracy of 83.6%; surprisingly, it can still secure 82.4% top-1 ImageNet accuracy by aggressively training with just FOUR visible patches (98% masking ratio). The code will be made publicly available. + + + + Persistent Nature: A Generative Model of Unbounded 3D Worlds + http://openaccess.thecvf.com//content/CVPR2023/papers/Chai_Persistent_Nature_A_Generative_Model_of_Unbounded_3D_Worlds_CVPR_2023_paper.pdf + Despite increasingly realistic image quality, recent 3D image generative models often operate on 3D volumes of fixed extent with limited camera motions. We investigate the task of unconditionally synthesizing unbounded nature scenes, enabling arbitrarily large camera motion while maintaining a persistent 3D world model. Our scene representation consists of an extendable, planar scene layout grid, which can be rendered from arbitrary camera poses via a 3D decoder and volume rendering, and a panoramic skydome. Based on this representation, we learn a generative world model solely from single-view internet photos. Our method enables simulating long flights through 3D landscapes, while maintaining global scene consistency---for instance, returning to the starting point yields the same view of the scene. Our approach enables scene extrapolation beyond the fixed bounds of current 3D generative models, while also supporting a persistent, camera-independent world representation that stands in contrast to auto-regressive 3D prediction models. Our project page: https://chail.github.io/persistent-nature/. + + + + Hierarchical Neural Memory Network for Low Latency Event Processing + http://openaccess.thecvf.com//content/CVPR2023/papers/Hamaguchi_Hierarchical_Neural_Memory_Network_for_Low_Latency_Event_Processing_CVPR_2023_paper.pdf + This paper proposes a low latency neural network architecture for event-based dense prediction tasks. Conventional architectures encode entire scene contents at a fixed rate regardless of their temporal characteristics. Instead, the proposed network encodes contents at a proper temporal scale depending on its movement speed. We achieve this by constructing temporal hierarchy using stacked latent memories that operate at different rates. Given low latency event steams, the multi-level memories gradually extract dynamic to static scene contents by propagating information from the fast to the slow memory modules. The architecture not only reduces the redundancy of conventional architectures but also exploits long-term dependencies. Furthermore, an attention-based event representation efficiently encodes sparse event streams into the memory cells. We conduct extensive evaluations on three event-based dense prediction tasks, where the proposed approach outperforms the existing methods on accuracy and latency, while demonstrating effective event and image fusion capabilities. The code is available at https://hamarh.github.io/hmnet/ + + + + DaFKD: Domain-Aware Federated Knowledge Distillation + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_DaFKD_Domain-Aware_Federated_Knowledge_Distillation_CVPR_2023_paper.pdf + Federated Distillation (FD) has recently attracted increasing attention for its efficiency in aggregating multiple diverse local models trained from statistically heterogeneous data of distributed clients. Existing FD methods generally treat these models equally by merely computing the average of their output soft predictions for some given input distillation sample, which does not take the diversity across all local models into account, thus leading to degraded performance of the aggregated model, especially when some local models learn little knowledge about the sample. In this paper, we propose a new perspective that treats the local data in each client as a specific domain and design a novel domain knowledge aware federated distillation method, dubbed DaFKD, that can discern the importance of each model to the distillation sample, and thus is able to optimize the ensemble of soft predictions from diverse models. Specifically, we employ a domain discriminator for each client, which is trained to identify the correlation factor between the sample and the corresponding domain. Then, to facilitate the training of the domain discriminator while saving communication costs, we propose sharing its partial parameters with the classification model. Extensive experiments on various datasets and settings show that the proposed method can improve the model accuracy by up to 6.02% compared to state-of-the-art baselines. + + + + Boost Vision Transformer With GPU-Friendly Sparsity and Quantization + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_Boost_Vision_Transformer_With_GPU-Friendly_Sparsity_and_Quantization_CVPR_2023_paper.pdf + The transformer extends its success from the language to the vision domain. Because of the numerous stacked self-attention and cross-attention blocks in the transformer, which involve many high-dimensional tensor multiplication operations, the acceleration deployment of vision transformer on GPU hardware is challenging and also rarely studied. This paper thoroughly designs a compression scheme to maximally utilize the GPU-friendly 2:4 fine-grained structured sparsity and quantization. Specially, an original large model with dense weight parameters is first pruned into a sparse one by 2:4 structured pruning, which considers the GPU's acceleration of 2:4 structured sparse pattern with FP16 data type, then the floating-point sparse model is further quantized into a fixed-point one by sparse-distillation-aware quantization aware training, which considers GPU can provide an extra speedup of 2:4 sparse calculation with integer tensors. A mixed-strategy knowledge distillation is used during the pruning and quantization process. The proposed compression scheme is flexible to support supervised and unsupervised learning styles. Experiment results show GPUSQ-ViT scheme achieves state-of-the-art compression by reducing vision transformer models 6.4-12.7 times on model size and 30.3-62 times on FLOPs with negligible accuracy degradation on ImageNet classification, COCO detection and ADE20K segmentation benchmarking tasks. Moreover, GPUSQ-ViT can boost actual deployment performance by 1.39-1.79 times and 3.22-3.43 times of latency and throughput on A100 GPU, and 1.57-1.69 times and 2.11-2.51 times improvement of latency and throughput on AGX Orin. + + + + Spectral Bayesian Uncertainty for Image Super-Resolution + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Spectral_Bayesian_Uncertainty_for_Image_Super-Resolution_CVPR_2023_paper.pdf + Recently deep learning techniques have significantly advanced image super-resolution (SR). Due to the black-box nature, quantifying reconstruction uncertainty is crucial when employing these deep SR networks. Previous approaches for SR uncertainty estimation mostly focus on capturing pixel-wise uncertainty in the spatial domain. SR uncertainty in the frequency domain which is highly related to image SR is seldom explored. In this paper, we propose to quantify spectral Bayesian uncertainty in image SR. To achieve this, a Dual-Domain Learning (DDL) framework is first proposed. Combined with Bayesian approaches, the DDL model is able to estimate spectral uncertainty accurately, enabling a reliability assessment for high frequencies reasoning from the frequency domain perspective. Extensive experiments under non-ideal premises are conducted and demonstrate the effectiveness of the proposed spectral uncertainty. Furthermore, we propose a novel Spectral Uncertainty based Decoupled Frequency (SUDF) training scheme for perceptual SR. Experimental results show the proposed SUDF can evidently boost perceptual quality of SR results without sacrificing much pixel accuracy. + + + + Mutual Information-Based Temporal Difference Learning for Human Pose Estimation in Video + http://openaccess.thecvf.com//content/CVPR2023/papers/Feng_Mutual_Information-Based_Temporal_Difference_Learning_for_Human_Pose_Estimation_in_CVPR_2023_paper.pdf + Temporal modeling is crucial for multi-frame human pose estimation. Most existing methods directly employ optical flow or deformable convolution to predict full-spectrum motion fields, which might incur numerous irrelevant cues, such as a nearby person or background. Without further efforts to excavate meaningful motion priors, their results are suboptimal, especially in complicated spatiotemporal interactions. On the other hand, the temporal difference has the ability to encode representative motion information which can potentially be valuable for pose estimation but has not been fully exploited. In this paper, we present a novel multi-frame human pose estimation framework, which employs temporal differences across frames to model dynamic contexts and engages mutual information objectively to facilitate useful motion information disentanglement. To be specific, we design a multi-stage Temporal Difference Encoder that performs incremental cascaded learning conditioned on multi-stage feature difference sequences to derive informative motion representation. We further propose a Representation Disentanglement module from the mutual information perspective, which can grasp discriminative task-relevant motion signals by explicitly defining useful and noisy constituents of the raw motion features and minimizing their mutual information. These place us to rank No.1 in the Crowd Pose Estimation in Complex Events Challenge on benchmark dataset HiEve, and achieve state-of-the-art performance on three benchmarks PoseTrack2017, PoseTrack2018, and PoseTrack21. + + + + SynthVSR: Scaling Up Visual Speech Recognition With Synthetic Supervision + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_SynthVSR_Scaling_Up_Visual_Speech_Recognition_With_Synthetic_Supervision_CVPR_2023_paper.pdf + Recently reported state-of-the-art results in visual speech recognition (VSR) often rely on increasingly large amounts of video data, while the publicly available transcribed video datasets are limited in size. In this paper, for the first time, we study the potential of leveraging synthetic visual data for VSR. Our method, termed SynthVSR, substantially improves the performance of VSR systems with synthetic lip movements. The key idea behind SynthVSR is to leverage a speech-driven lip animation model that generates lip movements conditioned on the input speech. The speech-driven lip animation model is trained on an unlabeled audio-visual dataset and could be further optimized towards a pre-trained VSR model when labeled videos are available. As plenty of transcribed acoustic data and face images are available, we are able to generate large-scale synthetic data using the proposed lip animation model for semi-supervised VSR training. We evaluate the performance of our approach on the largest public VSR benchmark - Lip Reading Sentences 3 (LRS3). SynthVSR achieves a WER of 43.3% with only 30 hours of real labeled data, outperforming off-the-shelf approaches using thousands of hours of video. The WER is further reduced to 27.9% when using all 438 hours of labeled data from LRS3, which is on par with the state-of-the-art self-supervised AV-HuBERT method. Furthermore, when combined with large-scale pseudo-labeled audio-visual data SynthVSR yields a new state-of-the-art VSR WER of 16.9% using publicly available data only, surpassing the recent state-of-the-art approaches trained with 29 times more non-public machine-transcribed video data (90,000 hours). Finally, we perform extensive ablation studies to understand the effect of each component in our proposed method. + + + + BiasBed - Rigorous Texture Bias Evaluation + http://openaccess.thecvf.com//content/CVPR2023/papers/Kalischek_BiasBed_-_Rigorous_Texture_Bias_Evaluation_CVPR_2023_paper.pdf + The well-documented presence of texture bias in modern convolutional neural networks has led to a plethora of algorithms that promote an emphasis on shape cues, often to support generalization to new domains. Yet, common datasets, benchmarks and general model selection strategies are missing, and there is no agreed, rigorous evaluation protocol. In this paper, we investigate difficulties and limitations when training networks with reduced texture bias. In particular, we also show that proper evaluation and meaningful comparisons between methods are not trivial. We introduce BiasBed, a testbed for texture- and style-biased training, including multiple datasets and a range of existing algorithms. It comes with an extensive evaluation protocol that includes rigorous hypothesis testing to gauge the significance of the results, despite the considerable training instability of some style bias methods. Our extensive experiments, shed new light on the need for careful, statistically founded evaluation protocols for style bias (and beyond). E.g., we find that some algorithms proposed in the literature do not significantly mitigate the impact of style bias at all. With the release of BiasBed, we hope to foster a common understanding of consistent and meaningful comparisons, and consequently faster progress towards learning methods free of texture bias. Code is available at https://github.com/D1noFuzi/BiasBed. + + + + Open-Category Human-Object Interaction Pre-Training via Language Modeling Framework + http://openaccess.thecvf.com//content/CVPR2023/papers/Zheng_Open-Category_Human-Object_Interaction_Pre-Training_via_Language_Modeling_Framework_CVPR_2023_paper.pdf + Human-object interaction (HOI) has long been plagued by the conflict between limited supervised data and a vast number of possible interaction combinations in real life. Current methods trained from closed-set data predict HOIs as fixed-dimension logits, which restricts their scalability to open-set categories. To address this issue, we introduce OpenCat, a language modeling framework that reformulates HOI prediction as sequence generation. By converting HOI triplets into a token sequence through a serialization scheme, our model is able to exploit the open-set vocabulary of the language modeling framework to predict novel interaction classes with a high degree of freedom. In addition, inspired by the great success of vision-language pre-training, we collect a large amount of weakly-supervised data related to HOI from image-caption pairs, and devise several auxiliary proxy tasks, including soft relational matching and human-object relation prediction, to pre-train our model. Extensive experiments show that our OpenCat significantly boosts HOI performance, particularly on a broad range of rare and unseen categories. + + + + Explicit Boundary Guided Semi-Push-Pull Contrastive Learning for Supervised Anomaly Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Yao_Explicit_Boundary_Guided_Semi-Push-Pull_Contrastive_Learning_for_Supervised_Anomaly_Detection_CVPR_2023_paper.pdf + Most anomaly detection (AD) models are learned using only normal samples in an unsupervised way, which may result in ambiguous decision boundary and insufficient discriminability. In fact, a few anomaly samples are often available in real-world applications, the valuable knowledge of known anomalies should also be effectively exploited. However, utilizing a few known anomalies during training may cause another issue that the model may be biased by those known anomalies and fail to generalize to unseen anomalies. In this paper, we tackle supervised anomaly detection, i.e., we learn AD models using a few available anomalies with the objective to detect both the seen and unseen anomalies. We propose a novel explicit boundary guided semi-push-pull contrastive learning mechanism, which can enhance model's discriminability while mitigating the bias issue. Our approach is based on two core designs: First, we find an explicit and compact separating boundary as the guidance for further feature learning. As the boundary only relies on the normal feature distribution, the bias problem caused by a few known anomalies can be alleviated. Second, a boundary guided semi-push-pull loss is developed to only pull the normal features together while pushing the abnormal features apart from the separating boundary beyond a certain margin region. In this way, our model can form a more explicit and discriminative decision boundary to distinguish known and also unseen anomalies from normal samples more effectively. Code will be available at https://github.com/xcyao00/BGAD. + + + + DeCo: Decomposition and Reconstruction for Compositional Temporal Grounding via Coarse-To-Fine Contrastive Ranking + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_DeCo_Decomposition_and_Reconstruction_for_Compositional_Temporal_Grounding_via_Coarse-To-Fine_CVPR_2023_paper.pdf + Understanding dense action in videos is a fundamental challenge towards the generalization of vision models. Several works show that compositionality is key to achieving generalization by combining known primitive elements, especially for handling novel composited structures. Compositional temporal grounding is the task of localizing dense action by using known words combined in novel ways in the form of novel query sentences for the actual grounding. In recent works, composition is assumed to be learned from pairs of whole videos and language embeddings through large scale self-supervised pre-training. Alternatively, one can process the video and language into word-level primitive elements, and then only learn fine-grained semantic correspondences. Both approaches do not consider the granularity of the compositions, where different query granularity corresponds to different video segments. Therefore, a good compositional representation should be sensitive to different video and query granularity. We propose a method to learn a coarse-to-fine compositional representation by decomposing the original query sentence into different granular levels, and then learning the correct correspondences between the video and recombined queries through a contrastive ranking constraint. Additionally, we run temporal boundary prediction in a coarse-to-fine manner for precise grounding boundary detection. Experiments are performed on two datasets Charades-CG and ActivityNet-CG showing the superior compositional generalizability of our approach. + + + + Dynamic Aggregated Network for Gait Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Ma_Dynamic_Aggregated_Network_for_Gait_Recognition_CVPR_2023_paper.pdf + Gait recognition is beneficial for a variety of applications, including video surveillance, crime scene investigation, and social security, to mention a few. However, gait recognition often suffers from multiple exterior factors in real scenes, such as carrying conditions, wearing overcoats, and diverse viewing angles. Recently, various deep learning-based gait recognition methods have achieved promising results, but they tend to extract one of the salient features using fixed-weighted convolutional networks, do not well consider the relationship within gait features in key regions, and ignore the aggregation of complete motion patterns. In this paper, we propose a new perspective that actual gait features include global motion patterns in multiple key regions, and each global motion pattern is composed of a series of local motion patterns. To this end, we propose a Dynamic Aggregation Network (DANet) to learn more discriminative gait features. Specifically, we create a dynamic attention mechanism between the features of neighboring pixels that not only adaptively focuses on key regions but also generates more expressive local motion patterns. In addition, we develop a self-attention mechanism to select representative local motion patterns and further learn robust global motion patterns. Extensive experiments on three popular public gait datasets, i.e., CASIA-B, OUMVLP, and Gait3D, demonstrate that the proposed method can provide substantial improvements over the current state-of-the-art methods. + + + + Sphere-Guided Training of Neural Implicit Surfaces + http://openaccess.thecvf.com//content/CVPR2023/papers/Dogaru_Sphere-Guided_Training_of_Neural_Implicit_Surfaces_CVPR_2023_paper.pdf + In recent years, neural distance functions trained via volumetric ray marching have been widely adopted for multi-view 3D reconstruction. These methods, however, apply the ray marching procedure for the entire scene volume, leading to reduced sampling efficiency and, as a result, lower reconstruction quality in the areas of high-frequency details. In this work, we address this problem via joint training of the implicit function and our new coarse sphere-based surface reconstruction. We use the coarse representation to efficiently exclude the empty volume of the scene from the volumetric ray marching procedure without additional forward passes of the neural surface network, which leads to an increased fidelity of the reconstructions compared to the base systems. We evaluate our approach by incorporating it into the training procedures of several implicit surface modeling methods and observe uniform improvements across both synthetic and real-world datasets. Our codebase can be accessed via the project page. + + + + Bias Mimicking: A Simple Sampling Approach for Bias Mitigation + http://openaccess.thecvf.com//content/CVPR2023/papers/Qraitem_Bias_Mimicking_A_Simple_Sampling_Approach_for_Bias_Mitigation_CVPR_2023_paper.pdf + Prior work has shown that Visual Recognition datasets frequently underrepresent bias groups B (e.g. Female) within class labels Y (e.g. Programmers). This dataset bias can lead to models that learn spurious correlations between class labels and bias groups such as age, gender, or race. Most recent methods that address this problem require significant architectural changes or additional loss functions requiring more hyper-parameter tuning. Alternatively, data sampling baselines from the class imbalance literature (eg Undersampling, Upweighting), which can often be implemented in a single line of code and often have no hyperparameters, offer a cheaper and more efficient solution. However, these methods suffer from significant shortcomings. For example, Undersampling drops a significant part of the input distribution per epoch while Oversampling repeats samples, causing overfitting. To address these shortcomings, we introduce a new class-conditioned sampling method: Bias Mimicking. The method is based on the observation that if a class c bias distribution, i.e., P_D(B|Y=c) is mimicked across every c' != c, then Y and B are statistically independent. Using this notion, BM, through a novel training procedure, ensures that the model is exposed to the entire distribution per epoch without repeating samples. Consequently, Bias Mimicking improves underrepresented groups' accuracy of sampling methods by 3% over four benchmarks while maintaining and sometimes improving performance over nonsampling methods. Code: https://github.com/mqraitem/Bias-Mimicking + + + + NoisyQuant: Noisy Bias-Enhanced Post-Training Activation Quantization for Vision Transformers + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_NoisyQuant_Noisy_Bias-Enhanced_Post-Training_Activation_Quantization_for_Vision_Transformers_CVPR_2023_paper.pdf + The complicated architecture and high training cost of vision transformers urge the exploration of post-training quantization. However, the heavy-tailed distribution of vision transformer activations hinders the effectiveness of previous post-training quantization methods, even with advanced quantizer designs. Instead of tuning the quantizer to better fit the complicated activation distribution, this paper proposes NoisyQuant, a quantizer-agnostic enhancement for the post-training activation quantization performance of vision transformers. We make a surprising theoretical discovery that for a given quantizer, adding a fixed Uniform noisy bias to the values being quantized can significantly reduce the quantization error under provable conditions. Building on the theoretical insight, NoisyQuant achieves the first success on actively altering the heavy-tailed activation distribution with additive noisy bias to fit a given quantizer. Extensive experiments show NoisyQuant largely improves the post-training quantization performance of vision transformer with minimal computation overhead. For instance, on linear uniform 6-bit activation quantization, NoisyQuant improves SOTA top-1 accuracy on ImageNet by up to 1.7%, 1.1% and 0.5% for ViT, DeiT, and Swin Transformer respectively, achieving on-par or even higher performance than previous nonlinear, mixed-precision quantization. + + + + Semi-Supervised Stereo-Based 3D Object Detection via Cross-View Consensus + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_Semi-Supervised_Stereo-Based_3D_Object_Detection_via_Cross-View_Consensus_CVPR_2023_paper.pdf + Stereo-based 3D object detection, which aims at detecting 3D objects with stereo cameras, shows great potential in low-cost deployment compared to LiDAR-based methods and excellent performance compared to monocular-based algorithms. However, the impressive performance of stereo-based 3D object detection is at the huge cost of high-quality manual annotations, which are hardly attainable for any given scene. Semi-supervised learning, in which limited annotated data and numerous unannotated data are required to achieve a satisfactory model, is a promising method to address the problem of data deficiency. In this work, we propose to achieve semi-supervised learning for stereo-based 3D object detection through pseudo annotation generation from a temporal-aggregated teacher model, which temporally accumulates knowledge from a student model. To facilitate a more stable and accurate depth estimation, we introduce Temporal-Aggregation-Guided (TAG) disparity consistency, a cross-view disparity consistency constraint between the teacher model and the student model for robust and improved depth estimation. To mitigate noise in pseudo annotation generation, we propose a cross-view agreement strategy, in which pseudo annotations should attain high degree of agreements between 3D and 2D views, as well as between binocular views. We perform extensive experiments on the KITTI 3D dataset to demonstrate our proposed method's capability in leveraging a huge amount of unannotated stereo images to attain significantly improved detection results. + + + + Video Compression With Entropy-Constrained Neural Representations + http://openaccess.thecvf.com//content/CVPR2023/papers/Gomes_Video_Compression_With_Entropy-Constrained_Neural_Representations_CVPR_2023_paper.pdf + Encoding videos as neural networks is a recently proposed approach that allows new forms of video processing. However, traditional techniques still outperform such neural video representation (NVR) methods for the task of video compression. This performance gap can be explained by the fact that current NVR methods: i) use architectures that do not efficiently obtain a compact representation of temporal and spatial information; and ii) minimize rate and distortion disjointly (first overfitting a network on a video and then using heuristic techniques such as post-training quantization or weight pruning to compress the model). We propose a novel convolutional architecture for video representation that better represents spatio-temporal information and a training strategy capable of jointly optimizing rate and distortion. All network and quantization parameters are jointly learned end-to-end, and the post-training operations used in previous works are unnecessary. We evaluate our method on the UVG dataset, achieving new state-of-the-art results for video compression with NVRs. Moreover, we deliver the first NVR-based video compression method that improves over the typically adopted HEVC benchmark (x265, disabled b-frames, "medium" preset), closing the gap to autoencoder-based video compression techniques. + + + + Deep Random Projector: Accelerated Deep Image Prior + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Deep_Random_Projector_Accelerated_Deep_Image_Prior_CVPR_2023_paper.pdf + Deep image prior (DIP) has shown great promise in tackling a variety of image restoration (IR) and general visual inverse problems, needing no training data. However, the resulting optimization process is often very slow, inevitably hindering DIP's practical usage for time-sensitive scenarios. In this paper, we focus on IR, and propose two crucial modifications to DIP that help achieve substantial speedup: 1) optimizing the DIP seed while freezing randomly-initialized network weights, and 2) reducing the network depth. In addition, we reintroduce explicit priors, such as sparse gradient prior---encoded by total-variation regularization, to preserve the DIP peak performance. We evaluate the proposed method on three IR tasks, including image denoising, image super-resolution, and image inpainting, against the original DIP and variants, as well as the competing metaDIP that uses meta-learning to learn good initializers with extra data. Our method is a clear winner in obtaining competitive restoration quality in a minimal amount of time. Our code is available at https://github.com/sun-umn/Deep-Random-Projector. + + + + SCPNet: Semantic Scene Completion on Point Cloud + http://openaccess.thecvf.com//content/CVPR2023/papers/Xia_SCPNet_Semantic_Scene_Completion_on_Point_Cloud_CVPR_2023_paper.pdf + Training deep models for semantic scene completion is challenging due to the sparse and incomplete input, a large quantity of objects of diverse scales as well as the inherent label noise for moving objects. To address the above-mentioned problems, we propose the following three solutions: 1) Redesigning the completion network. We design a novel completion network, which consists of several Multi-Path Blocks (MPBs) to aggregate multi-scale features and is free from the lossy downsampling operations. 2) Distilling rich knowledge from the multi-frame model. We design a novel knowledge distillation objective, dubbed Dense-to-Sparse Knowledge Distillation (DSKD). It transfers the dense, relation-based semantic knowledge from the multi-frame teacher to the single-frame student, significantly improving the representation learning of the single-frame model. 3) Completion label rectification. We propose a simple yet effective label rectification strategy, which uses off-the-shelf panoptic segmentation labels to remove the traces of dynamic objects in completion labels, greatly improving the performance of deep models especially for those moving objects. Extensive experiments are conducted in two public semantic scene completion benchmarks, i.e., SemanticKITTI and SemanticPOSS. Our SCPNet ranks 1st on SemanticKITTI semantic scene completion challenge and surpasses the competitive S3CNet by 7.2 mIoU. SCPNet also outperforms previous completion algorithms on the SemanticPOSS dataset. Besides, our method also achieves competitive results on SemanticKITTI semantic segmentation tasks, showing that knowledge learned in the scene completion is beneficial to the segmentation task. + + + + Revisiting Prototypical Network for Cross Domain Few-Shot Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_Revisiting_Prototypical_Network_for_Cross_Domain_Few-Shot_Learning_CVPR_2023_paper.pdf + Prototypical Network is a popular few-shot solver that aims at establishing a feature metric generalizable to novel few-shot classification (FSC) tasks using deep neural networks. However, its performance drops dramatically when generalizing to the FSC tasks in new domains. In this study, we revisit this problem and argue that the devil lies in the simplicity bias pitfall in neural networks. In specific, the network tends to focus on some biased shortcut features (e.g., color, shape, etc.) that are exclusively sufficient to distinguish very few classes in the meta-training tasks within a pre-defined domain, but fail to generalize across domains as some desirable semantic features. To mitigate this problem, we propose a Local-global Distillation Prototypical Network (LDP-net). Different from the standard Prototypical Network, we establish a two-branch network to classify the query image and its random local crops, respectively. Then, knowledge distillation is conducted among these two branches to enforce their class affiliation consistency. The rationale behind is that since such global-local semantic relationship is expected to hold regardless of data domains, the local-global distillation is beneficial to exploit some cross-domain transferable semantic features for feature metric establishment. Moreover, such local-global semantic consistency is further enforced among different images of the same class to reduce the intra-class semantic variation of the resultant feature. In addition, we propose to update the local branch as Exponential Moving Average (EMA) over training episodes, which makes it possible to better distill cross-episode knowledge and further enhance the generalization performance. Experiments on eight cross-domain FSC benchmarks empirically clarify our argument and show the state-of-the-art results of LDP-net. Code is available in https://github.com/NWPUZhoufei/LDP-Net + + + + Learning Accurate 3D Shape Based on Stereo Polarimetric Imaging + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Learning_Accurate_3D_Shape_Based_on_Stereo_Polarimetric_Imaging_CVPR_2023_paper.pdf + Shape from Polarization (SfP) aims to recover surface normal using the polarization cues of light. The accuracy of existing SfP methods is affected by two main problems. First, the ambiguity of polarization cues partially results in false normal estimation. Second, the widely-used assumption about orthographic projection is too ideal. To solve these problems, we propose the first approach that combines deep learning and stereo polarization information to recover not only normal but also disparity. Specifically, for the ambiguity problem, we design a Shape Consistency-based Mask Prediction (SCMP) module. It exploits the inherent consistency between normal and disparity to identify the areas with false normal estimation. We replace the unreliable features enclosed by these areas with new features extracted by global attention mechanism. As to the orthographic projection problem, we propose a novel Viewing Direction-aided Positional Encoding (VDPE) strategy. This strategy is based on the unique pixel-viewing direction encoding, and thus enables our neural network to handle the non-orthographic projection. In addition, we establish a real-world stereo SfP dataset that contains various object categories and illumination conditions. Experiments showed that compared with existing SfP methods, our approach is more accurate. Moreover, our approach shows higher robustness to light variation. + + + + RA-CLIP: Retrieval Augmented Contrastive Language-Image Pre-Training + http://openaccess.thecvf.com//content/CVPR2023/papers/Xie_RA-CLIP_Retrieval_Augmented_Contrastive_Language-Image_Pre-Training_CVPR_2023_paper.pdf + Contrastive Language-Image Pre-training (CLIP) is attracting increasing attention for its impressive zero-shot recognition performance on different down-stream tasks. However, training CLIP is data-hungry and requires lots of image-text pairs to memorize various semantic concepts. In this paper, we propose a novel and efficient framework: Retrieval Augmented Contrastive Language-Image Pre-training (RA-CLIP) to augment embeddings by online retrieval. Specifically, we sample part of image-text data as a hold-out reference set. Given an input image, relevant image-text pairs are retrieved from the reference set to enrich the representation of input image. This process can be considered as an open-book exam: with the reference set as a cheat sheet, the proposed method doesn't need to memorize all visual concepts in the training data. It explores how to recognize visual concepts by exploiting correspondence between images and texts in the cheat sheet. The proposed RA-CLIP implements this idea and comprehensive experiments are conducted to show how RA-CLIP works. Performances on 10 image classification datasets and 2 object detection datasets show that RA-CLIP outperforms vanilla CLIP baseline by a large margin on zero-shot image classification task (+12.7%), linear probe image classification task (+6.9%) and zero-shot ROI classification task (+2.8%). + + + + A Practical Upper Bound for the Worst-Case Attribution Deviations + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_A_Practical_Upper_Bound_for_the_Worst-Case_Attribution_Deviations_CVPR_2023_paper.pdf + Model attribution is a critical component of deep neural networks (DNNs) for its interpretability to complex models. Recent studies bring up attention to the security of attribution methods as they are vulnerable to attribution attacks that generate similar images with dramatically different attributions. Existing works have been investigating empirically improving the robustness of DNNs against those attacks; however, none of them explicitly quantifies the actual deviations of attributions. In this work, for the first time, a constrained optimization problem is formulated to derive an upper bound that measures the largest dissimilarity of attributions after the samples are perturbed by any noises within a certain region while the classification results remain the same. Based on the formulation, different practical approaches are introduced to bound the attributions above using Euclidean distance and cosine similarity under both L2 and Linf-norm perturbations constraints. The bounds developed by our theoretical study are validated on various datasets and two different types of attacks (PGD attack and IFIA attribution attack). Over 10 million attacks in the experiments indicate that the proposed upper bounds effectively quantify the robustness of models based on the worst-case attribution dissimilarities. + + + + Teacher-Generated Spatial-Attention Labels Boost Robustness and Accuracy of Contrastive Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Yao_Teacher-Generated_Spatial-Attention_Labels_Boost_Robustness_and_Accuracy_of_Contrastive_Models_CVPR_2023_paper.pdf + Human spatial attention conveys information about theregions of visual scenes that are important for perform-ing visual tasks. Prior work has shown that the informa-tion about human attention can be leveraged to benefit var-ious supervised vision tasks. Might providing this weakform of supervision be useful for self-supervised represen-tation learning? Addressing this question requires collect-ing large datasets with human attention labels. Yet, col-lecting such large scale data is very expensive. To addressthis challenge, we construct an auxiliary teacher model topredict human attention, trained on a relatively small la-beled dataset. This teacher model allows us to generate im-age (pseudo) attention labels for ImageNet. We then traina model with a primary contrastive objective; to this stan-dard configuration, we add a simple output head trained topredict the attentional map for each image, guided by thepseudo labels from teacher model. We measure the qual-ity of learned representations by evaluating classificationperformance from the frozen learned embeddings as wellas performance on image retrieval tasks. We find that thespatial-attention maps predicted from the contrastive modeltrained with teacher guidance aligns better with human at-tention compared to vanilla contrastive models. Moreover,we find that our approach improves classification accuracyand robustness of the contrastive models on ImageNet andImageNet-C. Further, we find that model representationsbecome more useful for image retrieval task as measuredby precision-recall performance on ImageNet, ImageNet-C,CIFAR10, and CIFAR10-C datasets. + + + + Exploring and Exploiting Uncertainty for Incomplete Multi-View Classification + http://openaccess.thecvf.com//content/CVPR2023/papers/Xie_Exploring_and_Exploiting_Uncertainty_for_Incomplete_Multi-View_Classification_CVPR_2023_paper.pdf + Classifying incomplete multi-view data is inevitable since arbitrary view missing widely exists in real-world applications. Although great progress has been achieved, existing incomplete multi-view methods are still difficult to obtain a trustworthy prediction due to the relatively high uncertainty nature of missing views. First, the missing view is of high uncertainty, and thus it is not reasonable to provide a single deterministic imputation. Second, the quality of the imputed data itself is of high uncertainty. To explore and exploit the uncertainty, we propose an Uncertainty-induced Incomplete Multi-View Data Classification (UIMC) model to classify the incomplete multi-view data under a stable and reliable framework. We construct a distribution and sample multiple times to characterize the uncertainty of missing views, and adaptively utilize them according to the sampling quality. Accordingly, the proposed method realizes more perceivable imputation and controllable fusion. Specifically, we model each missing data with a distribution conditioning on the available views and thus introducing uncertainty. Then an evidence-based fusion strategy is employed to guarantee the trustworthy integration of the imputed views. Extensive experiments are conducted on multiple benchmark data sets and our method establishes a state-of-the-art performance in terms of both performance and trustworthiness. + + + + Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering + http://openaccess.thecvf.com//content/CVPR2023/papers/Zang_Discovering_the_Real_Association_Multimodal_Causal_Reasoning_in_Video_Question_CVPR_2023_paper.pdf + Video Question Answering (VideoQA) is challenging as it requires capturing accurate correlations between modalities from redundant information. Recent methods focus on the explicit challenges of the task, e.g. multimodal feature extraction, video-text alignment and fusion. Their frameworks reason the answer relying on statistical evidence causes, which ignores potential bias in the multimodal data. In our work, we investigate relational structure from a causal representation perspective on multimodal data and propose a novel inference framework. For visual data, question-irrelevant objects may establish simple matching associations with the answer. For textual data, the model prefers the local phrase semantics which may deviate from the global semantics in long sentences. Therefore, to enhance the generalization of the model, we discover the real association by explicitly capturing visual features that are causally related to the question semantics and weakening the impact of local language semantics on question answering. The experimental results on two large causal VideoQA datasets verify that our proposed framework 1) improves the accuracy of the existing VideoQA backbone, 2) demonstrates robustness on complex scenes and questions. + + + + Learning Transformations To Reduce the Geometric Shift in Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Vidit_Learning_Transformations_To_Reduce_the_Geometric_Shift_in_Object_Detection_CVPR_2023_paper.pdf + The performance of modern object detectors drops when the test distribution differs from the training one. Most of the methods that address this focus on object appearance changes caused by, e.g., different illumination conditions, or gaps between synthetic and real images. Here, by contrast, we tackle geometric shifts emerging from variations in the image capture process, or due to the constraints of the environment causing differences in the apparent geometry of the content itself. We introduce a self-training approach that learns a set of geometric transformations to minimize these shifts without leveraging any labeled data in the new domain, nor any information about the cameras. We evaluate our method on two different shifts, i.e., a camera's field of view (FoV) change and a viewpoint change. Our results evidence that learning geometric transformations helps detectors to perform better in the target domains. + + + + OReX: Object Reconstruction From Planar Cross-Sections Using Neural Fields + http://openaccess.thecvf.com//content/CVPR2023/papers/Sawdayee_OReX_Object_Reconstruction_From_Planar_Cross-Sections_Using_Neural_Fields_CVPR_2023_paper.pdf + Reconstructing 3D shapes from planar cross-sections is a challenge inspired by downstream applications like medical imaging and geographic informatics. The input is an in/out indicator function fully defined on a sparse collection of planes in space, and the output is an interpolation of the indicator function to the entire volume. Previous works addressing this sparse and ill-posed problem either produce low quality results, or rely on additional priors such as target topology, appearance information, or input normal directions. In this paper, we present OReX, a method for 3D shape reconstruction from slices alone, featuring a Neural Field as the interpolation prior. A modest neural network is trained on the input planes to return an inside/outside estimate for a given 3D coordinate, yielding a powerful prior that induces smoothness and self-similarities. The main challenge for this approach is high-frequency details, as the neural prior is overly smoothing. To alleviate this, we offer an iterative estimation architecture and a hierarchical input sampling scheme that encourage coarse-to-fine training, allowing the training process to focus on high frequencies at later stages. In addition, we identify and analyze a ripple-like effect stemming from the mesh extraction step. We mitigate it by regularizing the spatial gradients of the indicator function around input in/out boundaries during network training, tackling the problem at the root. Through extensive qualitative and quantitative experimentation, we demonstrate our method is robust, accurate, and scales well with the size of the input. We report state-of-the-art results compared to previous approaches and recent potential solutions, and demonstrate the benefit of our individual contributions through analysis and ablation studies. + + + + SPIn-NeRF: Multiview Segmentation and Perceptual Inpainting With Neural Radiance Fields + http://openaccess.thecvf.com//content/CVPR2023/papers/Mirzaei_SPIn-NeRF_Multiview_Segmentation_and_Perceptual_Inpainting_With_Neural_Radiance_Fields_CVPR_2023_paper.pdf + Neural Radiance Fields (NeRFs) have emerged as a popular approach for novel view synthesis. While NeRFs are quickly being adapted for a wider set of applications, intuitively editing NeRF scenes is still an open challenge. One important editing task is the removal of unwanted objects from a 3D scene, such that the replaced region is visually plausible and consistent with its context. We refer to this task as 3D inpainting. In 3D, solutions must be both consistent across multiple views and geometrically valid. In this paper, we propose a novel 3D inpainting method that addresses these challenges. Given a small set of posed images and sparse annotations in a single input image, our framework first rapidly obtains a 3D segmentation mask for a target object. Using the mask, a perceptual optimization-based approach is then introduced that leverages learned 2D image inpainters, distilling their information into 3D space, while ensuring view consistency. We also address the lack of a diverse benchmark for evaluating 3D scene inpainting methods by introducing a dataset comprised of challenging real-world scenes. In particular, our dataset contains views of the same scene with and without a target object, enabling more principled benchmarking of the 3D inpainting task. We first demonstrate the superiority of our approach on multiview segmentation, comparing to NeRF-based methods and 2D segmentation approaches. We then evaluate on the task of 3D inpainting, establishing state-of-the-art performance against other NeRF manipulation algorithms, as well as a strong 2D image inpainter baseline. + + + + Revisiting Rotation Averaging: Uncertainties and Robust Losses + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Revisiting_Rotation_Averaging_Uncertainties_and_Robust_Losses_CVPR_2023_paper.pdf + In this paper, we revisit the rotation averaging problem applied in global Structure-from-Motion pipelines. We argue that the main problem of current methods is the minimized cost function that is only weakly connected with the input data via the estimated epipolar geometries. We propose to better model the underlying noise distributions by directly propagating the uncertainty from the point correspondences into the rotation averaging. Such uncertainties are obtained for free by considering the Jacobians of two-view refinements. Moreover, we explore integrating a variant of the MAGSAC loss into the rotation averaging problem, instead of using classical robust losses employed in current frameworks. The proposed method leads to results superior to baselines, in terms of accuracy, on large-scale public benchmarks. The code is public. https://github.com/zhangganlin/GlobalSfMpy + + + + Patch-Based 3D Natural Scene Generation From a Single Example + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Patch-Based_3D_Natural_Scene_Generation_From_a_Single_Example_CVPR_2023_paper.pdf + We target a 3D generative model for general natural scenes that are typically unique and intricate. Lacking the necessary volumes of training data, along with the difficulties of having ad hoc designs in presence of varying scene characteristics, renders existing setups intractable. Inspired by classical patch-based image models, we advocate for synthesizing 3D scenes at the patch level, given a single example. At the core of this work lies important algorithmic designs w.r.t the scene representation and generative patch nearest-neighbor module, that address unique challenges arising from lifting classical 2D patch-based framework to 3D generation. These design choices, on a collective level, contribute to a robust, effective, and efficient model that can generate high-quality general natural scenes with both realistic geometric structure and visual appearance, in large quantities and varieties, as demonstrated upon a variety of exemplar scenes. Data and code can be found at http://wyysf-98.github.io/Sin3DGen. + + + + Leveraging Hidden Positives for Unsupervised Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Seong_Leveraging_Hidden_Positives_for_Unsupervised_Semantic_Segmentation_CVPR_2023_paper.pdf + Dramatic demand for manpower to label pixel-level annotations triggered the advent of unsupervised semantic segmentation. Although the recent work employing the vision transformer (ViT) backbone shows exceptional performance, there is still a lack of consideration for task-specific training guidance and local semantic consistency. To tackle these issues, we leverage contrastive learning by excavating hidden positives to learn rich semantic relationships and ensure semantic consistency in local regions. Specifically, we first discover two types of global hidden positives, task-agnostic and task-specific ones for each anchor based on the feature similarities defined by a fixed pre-trained backbone and a segmentation head-in-training, respectively. A gradual increase in the contribution of the latter induces the model to capture task-specific semantic features. In addition, we introduce a gradient propagation strategy to learn semantic consistency between adjacent patches, under the inherent premise that nearby patches are highly likely to possess the same semantics. Specifically, we add the loss propagating to local hidden positives, semantically similar nearby patches, in proportion to the predefined similarity scores. With these training schemes, our proposed method achieves new state-of-the-art (SOTA) results in COCO-stuff, Cityscapes, and Potsdam-3 datasets. Our code is available at: https://github.com/hynnsk/HP. + + + + LG-BPN: Local and Global Blind-Patch Network for Self-Supervised Real-World Denoising + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_LG-BPN_Local_and_Global_Blind-Patch_Network_for_Self-Supervised_Real-World_Denoising_CVPR_2023_paper.pdf + Despite the significant results on synthetic noise under simplified assumptions, most self-supervised denoising methods fail under real noise due to the strong spatial noise correlation, including the advanced self-supervised blind-spot networks (BSNs). For recent methods targeting real-world denoising, they either suffer from ignoring this spatial correlation, or are limited by the destruction of fine textures for under-considering the correlation. In this paper, we present a novel method called LG-BPN for self-supervised real-world denoising, which takes the spatial correlation statistic into our network design for local detail restoration, and also brings the long-range dependencies modeling ability to previously CNN-based BSN methods. First, based on the correlation statistic, we propose a densely-sampled patch-masked convolution module. By taking more neighbor pixels with low noise correlation into account, we enable a denser local receptive field, preserving more useful information for enhanced fine structure recovery. Second, we propose a dilated Transformer block to allow distant context exploitation in BSN. This global perception addresses the intrinsic deficiency of BSN, whose receptive field is constrained by the blind spot requirement, which can not be fully resolved by the previous CNN-based BSNs. These two designs enable LG-BPN to fully exploit both the detailed structure and the global interaction in a blind manner. Extensive results on real-world datasets demonstrate the superior performance of our method. https://github.com/Wang-XIaoDingdd/LGBPN + + + + Efficient View Synthesis and 3D-Based Multi-Frame Denoising With Multiplane Feature Representations + http://openaccess.thecvf.com//content/CVPR2023/papers/Tanay_Efficient_View_Synthesis_and_3D-Based_Multi-Frame_Denoising_With_Multiplane_Feature_CVPR_2023_paper.pdf + While current multi-frame restoration methods combine information from multiple input images using 2D alignment techniques, recent advances in novel view synthesis are paving the way for a new paradigm relying on volumetric scene representations. In this work, we introduce the first 3D-based multi-frame denoising method that significantly outperforms its 2D-based counterparts with lower computational requirements. Our method extends the multiplane image (MPI) framework for novel view synthesis by introducing a learnable encoder-renderer pair manipulating multiplane representations in feature space. The encoder fuses information across views and operates in a depth-wise manner while the renderer fuses information across depths and operates in a view-wise manner. The two modules are trained end-to-end and learn to separate depths in an unsupervised way, giving rise to Multiplane Feature (MPF) representations. Experiments on the Spaces and Real Forward-Facing datasets as well as on raw burst data validate our approach for view synthesis, multi-frame denoising, and view synthesis under noisy conditions. + + + + Model Barrier: A Compact Un-Transferable Isolation Domain for Model Intellectual Property Protection + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Model_Barrier_A_Compact_Un-Transferable_Isolation_Domain_for_Model_Intellectual_CVPR_2023_paper.pdf + As the scientific and technological achievements produced by human intellectual labor and computation cost, model intellectual property (IP) protection, which refers to preventing the usage of the well-trained model on an unauthorized domain, deserves further attention, so as to effectively mobilize the enthusiasm of model owners and creators. To this end, we propose a novel compact un-transferable isolation domain (CUTI-domain), which acts as a model barrier to block illegal transferring from the authorized domain to the unauthorized domain. Specifically, CUTI-domain is investigated to block cross-domain transferring by highlighting private style features of the authorized domain and lead to the failure of recognition on unauthorized domains that contain irrelative private style features. Furthermore, depending on whether the unauthorized domain is known or not, two solutions of using CUTI-domain are provided: target-specified CUTI-domain and target-free CUTI-domain. Comprehensive experimental results on four digit datasets, CIFAR10 & STL10, and VisDA-2017 dataset, demonstrate that our CUTI-domain can be easily implemented with different backbones as a plug-and-play module and provides an efficient solution for model IP protection. + + + + Object Detection With Self-Supervised Scene Adaptation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Object_Detection_With_Self-Supervised_Scene_Adaptation_CVPR_2023_paper.pdf + This paper proposes a novel method to improve the performance of a trained object detector on scenes with fixed camera perspectives based on self-supervised adaptation. Given a specific scene, the trained detector is adapted using pseudo-ground truth labels generated by the detector itself and an object tracker in a cross-teaching manner. When the camera perspective is fixed, our method can utilize the background equivariance by proposing artifact-free object mixup as a means of data augmentation, and utilize accurate background extraction as an additional input modality. We also introduce a large-scale and diverse dataset for the development and evaluation of scene-adaptive object detection. Experiments on this dataset show that our method can improve the average precision of the original detector, outperforming the previous state-of-the-art self-supervised domain adaptive object detection methods by a large margin. Our dataset and code are published at https://github.com/cvlab-stonybrook/scenes100. + + + + Self-Positioning Point-Based Transformer for Point Cloud Understanding + http://openaccess.thecvf.com//content/CVPR2023/papers/Park_Self-Positioning_Point-Based_Transformer_for_Point_Cloud_Understanding_CVPR_2023_paper.pdf + Transformers have shown superior performance on various computer vision tasks with their capabilities to capture long-range dependencies. Despite the success, it is challenging to directly apply Transformers on point clouds due to their quadratic cost in the number of points. In this paper, we present a Self-Positioning point-based Transformer (SPoTr), which is designed to capture both local and global shape contexts with reduced complexity. Specifically, this architecture consists of local self- attention and self-positioning point-based global cross-attention. The self-positioning points, adaptively located based on the input shape, consider both spatial and semantic information with disentangled attention to improve expressive power. With the self-positioning points, we propose a novel global cross-attention mechanism for point clouds, which improves the scalability of global self-attention by allowing the attention module to compute attention weights with only a small set of self-positioning points. Experiments show the effectiveness of SPoTr on three point cloud tasks such as shape classification, part segmentation, and scene segmentation. In particular, our proposed model achieves an accuracy gain of 2.6% over the previous best models on shape classification with ScanObjectNN. We also provide qualitative analyses to demonstrate the interpretability of self-positioning points. The code of SPoTr is available at https://github.com/mlvlab/SPoTr. + + + + DeepLSD: Line Segment Detection and Refinement With Deep Image Gradients + http://openaccess.thecvf.com//content/CVPR2023/papers/Pautrat_DeepLSD_Line_Segment_Detection_and_Refinement_With_Deep_Image_Gradients_CVPR_2023_paper.pdf + Line segments are ubiquitous in our human-made world and are increasingly used in vision tasks. They are complementary to feature points thanks to their spatial extent and the structural information they provide. Traditional line detectors based on the image gradient are extremely fast and accurate, but lack robustness in noisy images and challenging conditions. Their learned counterparts are more repeatable and can handle challenging images, but at the cost of a lower accuracy and a bias towards wireframe lines. We propose to combine traditional and learned approaches to get the best of both worlds: an accurate and robust line detector that can be trained in the wild without ground truth lines. Our new line segment detector, DeepLSD, processes images with a deep network to generate a line attraction field, before converting it to a surrogate image gradient magnitude and angle, which is then fed to any existing handcrafted line detector. Additionally, we propose a new optimization tool to refine line segments based on the attraction field and vanishing points. This refinement improves the accuracy of current deep detectors by a large margin. We demonstrate the performance of our method on low-level line detection metrics, as well as on several downstream tasks using multiple challenging datasets. The source code and models are available at https://github.com/cvg/DeepLSD. + + + + Executing Your Commands via Motion Diffusion in Latent Space + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Executing_Your_Commands_via_Motion_Diffusion_in_Latent_Space_CVPR_2023_paper.pdf + We study a challenging task, conditional human motion generation, which produces plausible human motion sequences according to various conditional inputs, such as action classes or textual descriptors. Since human motions are highly diverse and have a property of quite different distribution from conditional modalities, such as textual descriptors in natural languages, it is hard to learn a probabilistic mapping from the desired conditional modality to the human motion sequences. Besides, the raw motion data from the motion capture system might be redundant in sequences and contain noises; directly modeling the joint distribution over the raw motion sequences and conditional modalities would need a heavy computational overhead and might result in artifacts introduced by the captured noises. To learn a better representation of the various human motion sequences, we first design a powerful Variational AutoEncoder (VAE) and arrive at a representative and low-dimensional latent code for a human motion sequence. Then, instead of using a diffusion model to establish the connections between the raw motion sequences and the conditional inputs, we perform a diffusion process on the motion latent space. Our proposed Motion Latent-based Diffusion model (MLD) could produce vivid motion sequences conforming to the given conditional inputs and substantially reduce the computational overhead in both the training and inference stages. Extensive experiments on various human motion generation tasks demonstrate that our MLD achieves significant improvements over the state-of-the-art methods among extensive human motion generation tasks, with two orders of magnitude faster than previous diffusion models on raw motion sequences. + + + + Reconstructing Animatable Categories From Videos + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_Reconstructing_Animatable_Categories_From_Videos_CVPR_2023_paper.pdf + Building animatable 3D models is challenging due to the need for 3D scans, laborious registration, and manual rigging. Recently, differentiable rendering provides a pathway to obtain high-quality 3D models from monocular videos, but these are limited to rigid categories or single instances. We present RAC, a method to build category-level 3D models from monocular videos, disentangling variations over instances and motion over time. Three key ideas are introduced to solve this problem: (1) specializing a category-level skeleton to instances, (2) a method for latent space regularization that encourages shared structure across a category while maintaining instance details, and (3) using 3D background models to disentangle objects from the background. We build 3D models for humans, cats, and dogs given monocular videos. Project page: gengshan-y.github.io/rac-www/ + + + + Co-Salient Object Detection With Uncertainty-Aware Group Exchange-Masking + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_Co-Salient_Object_Detection_With_Uncertainty-Aware_Group_Exchange-Masking_CVPR_2023_paper.pdf + The traditional definition of co-salient object detection (CoSOD) task is to segment the common salient objects in a group of relevant images. Existing CoSOD models by default adopt the group consensus assumption. This brings about model robustness defect under the condition of irrelevant images in the testing image group, which hinders the use of CoSOD models in real-world applications. To address this issue, this paper presents a group exchange-masking (GEM) strategy for robust CoSOD model learning. With two group of image containing different types of salient object as input, the GEM first selects a set of images from each group by the proposed learning based strategy, then these images are exchanged. The proposed feature extraction module considers both the uncertainty caused by the irrelevant images and group consensus in the remaining relevant images. We design a latent variable generator branch which is made of conditional variational autoencoder to generate uncertainly-based global stochastic features. A CoSOD transformer branch is devised to capture the correlation-based local features that contain the group consistency information. At last, the output of two branches are concatenated and fed into a transformer-based decoder, producing robust co-saliency prediction. Extensive evaluations on co-saliency detection with and without irrelevant images demonstrate the superiority of our method over a variety of state-of-the-art methods. + + + + Tangentially Elongated Gaussian Belief Propagation for Event-Based Incremental Optical Flow Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Nagata_Tangentially_Elongated_Gaussian_Belief_Propagation_for_Event-Based_Incremental_Optical_Flow_CVPR_2023_paper.pdf + Optical flow estimation is a fundamental functionality in computer vision. An event-based camera, which asynchronously detects sparse intensity changes, is an ideal device for realizing low-latency estimation of the optical flow owing to its low-latency sensing mechanism. An existing method using local plane fitting of events could utilize the sparsity to realize incremental updates for low-latency estimation; however, its output is merely a normal component of the full optical flow. An alternative approach using a frame-based deep neural network could estimate the full flow; however, its intensive non-incremental dense operation prohibits the low-latency estimation. We propose tangentially elongated Gaussian (TEG) belief propagation (BP) that realizes incremental full-flow estimation. We model the probability of full flow as the joint distribution of TEGs from the normal flow measurements, such that the marginal of this distribution with correct prior equals the full flow. We formulate the marginalization using a message-passing based on the BP to realize efficient incremental updates using sparse measurements. In addition to the theoretical justification, we evaluate the effectiveness of the TEGBP in real-world datasets; it outperforms SOTA incremental quasi-full flow method by a large margin. The code will be open-sourced upon acceptance. + + + + Adaptive Sparse Pairwise Loss for Object Re-Identification + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_Adaptive_Sparse_Pairwise_Loss_for_Object_Re-Identification_CVPR_2023_paper.pdf + Object re-identification (ReID) aims to find instances with the same identity as the given probe from a large gallery. Pairwise losses play an important role in training a strong ReID network. Existing pairwise losses densely exploit each instance as an anchor and sample its triplets in a mini-batch. This dense sampling mechanism inevitably introduces positive pairs that share few visual similarities, which can be harmful to the training. To address this problem, we propose a novel loss paradigm termed Sparse Pairwise (SP) loss that only leverages few appropriate pairs for each class in a mini-batch, and empirically demonstrate that it is sufficient for the ReID tasks. Based on the proposed loss framework, we propose an adaptive positive mining strategy that can dynamically adapt to diverse intra-class variations. Extensive experiments show that SP loss and its adaptive variant AdaSP loss outperform other pairwise losses, and achieve state-of-the-art performance across several ReID benchmarks. Code is available at https://github.com/Astaxanthin/AdaSP. + + + + Semi-Weakly Supervised Object Kinematic Motion Prediction + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Semi-Weakly_Supervised_Object_Kinematic_Motion_Prediction_CVPR_2023_paper.pdf + Given a 3D object, kinematic motion prediction aims to identify the mobile parts as well as the corresponding motion parameters. Due to the large variations in both topological structure and geometric details of 3D objects, this remains a challenging task and the lack of large scale labeled data also constrain the performance of deep learning based approaches. In this paper, we tackle the task of object kinematic motion prediction problem in a semi-weakly supervised manner. Our key observations are two-fold. First, although 3D dataset with fully annotated motion labels is limited, there are existing datasets and methods for object part semantic segmentation at large scale. Second, semantic part segmentation and mobile part segmentation is not always consistent but it is possible to detect the mobile parts from the underlying 3D structure. Towards this end, we propose a graph neural network to learn the map between hierarchical part-level segmentation and mobile parts parameters, which are further refined based on geometric alignment. This network can be first trained on PartNet-Mobility dataset with fully labeled mobility information and then applied on PartNet dataset with fine-grained and hierarchical part-level segmentation. The network predictions yield a large scale of 3D objects with pseudo labeled mobility information and can further be used for weakly-supervised learning with pre-existing segmentation. Our experiments show there are significant performance boosts with the augmented data for previous method designed for kinematic motion prediction on 3D partial scans. + + + + Learning a Simple Low-Light Image Enhancer From Paired Low-Light Instances + http://openaccess.thecvf.com//content/CVPR2023/papers/Fu_Learning_a_Simple_Low-Light_Image_Enhancer_From_Paired_Low-Light_Instances_CVPR_2023_paper.pdf + Low-light Image Enhancement (LIE) aims at improving contrast and restoring details for images captured in low-light conditions. Most of the previous LIE algorithms adjust illumination using a single input image with several handcrafted priors. Those solutions, however, often fail in revealing image details due to the limited information in a single image and the poor adaptability of handcrafted priors. To this end, we propose PairLIE, an unsupervised approach that learns adaptive priors from low-light image pairs. First, the network is expected to generate the same clean images as the two inputs share the same image content. To achieve this, we impose the network with the Retinex theory and make the two reflectance components consistent. Second, to assist the Retinex decomposition, we propose to remove inappropriate features in the raw image with a simple self-supervised mechanism. Extensive experiments on public datasets show that the proposed PairLIE achieves comparable performance against the state-of-the-art approaches with a simpler network and fewer handcrafted priors. Code is available at: https://github.com/zhenqifu/PairLIE. + + + + PivoTAL: Prior-Driven Supervision for Weakly-Supervised Temporal Action Localization + http://openaccess.thecvf.com//content/CVPR2023/papers/Rizve_PivoTAL_Prior-Driven_Supervision_for_Weakly-Supervised_Temporal_Action_Localization_CVPR_2023_paper.pdf + Weakly-supervised Temporal Action Localization (WTAL) attempts to localize the actions in untrimmed videos using only video-level supervision. Most recent works approach WTAL from a localization-by-classification perspective where these methods try to classify each video frame followed by a manually-designed post-processing pipeline to aggregate these per-frame action predictions into action snippets. Due to this perspective, the model lacks any explicit understanding of action boundaries and tends to focus only on the most discriminative parts of the video resulting in incomplete action localization. To address this, we present PivoTAL, Prior-driven Supervision for Weakly-supervised Temporal Action Localization, to approach WTAL from a localization-by-localization perspective by learning to localize the action snippets directly. To this end, PivoTAL leverages the underlying spatio-temporal regularities in videos in the form of action-specific scene prior, action snippet generation prior, and learnable Gaussian prior to supervise the localization-based training. PivoTAL shows significant improvement (of at least 3% avg mAP) over all existing methods on the benchmark datasets, THUMOS-14 and ActivitNet-v1.3. + + + + Improving Generalization With Domain Convex Game + http://openaccess.thecvf.com//content/CVPR2023/papers/Lv_Improving_Generalization_With_Domain_Convex_Game_CVPR_2023_paper.pdf + Domain generalization (DG) tends to alleviate the poor generalization capability of deep neural networks by learning model with multiple source domains. A classical solution to DG is domain augmentation, the common belief of which is that diversifying source domains will be conducive to the out-of-distribution generalization. However, these claims are understood intuitively, rather than mathematically. Our explorations empirically reveal that the correlation between model generalization and the diversity of domains may be not strictly positive, which limits the effectiveness of domain augmentation. This work therefore aim to guarantee and further enhance the validity of this strand. To this end, we propose a new perspective on DG that recasts it as a convex game between domains. We first encourage each diversified domain to enhance model generalization by elaborately designing a regularization term based on supermodularity. Meanwhile, a sample filter is constructed to eliminate low-quality samples, thereby avoiding the impact of potentially harmful information. Our framework presents a new avenue for the formal analysis of DG, heuristic analysis and extensive experiments demonstrate the rationality and effectiveness. + + + + Fair Scratch Tickets: Finding Fair Sparse Networks Without Weight Training + http://openaccess.thecvf.com//content/CVPR2023/papers/Tang_Fair_Scratch_Tickets_Finding_Fair_Sparse_Networks_Without_Weight_Training_CVPR_2023_paper.pdf + Recent studies suggest that computer vision models come at the risk of compromising fairness. There are extensive works to alleviate unfairness in computer vision using pre-processing, in-processing, and post-processing methods. In this paper, we lead a novel fairness-aware learning paradigm for in-processing methods through the lens of the lottery ticket hypothesis (LTH) in the context of computer vision fairness. We randomly initialize a dense neural network and find appropriate binary masks for the weights to obtain fair sparse subnetworks without any weight training. Interestingly, to the best of our knowledge, we are the first to discover that such sparse subnetworks with inborn fairness exist in randomly initialized networks, achieving an accuracy-fairness trade-off comparable to that of dense neural networks trained with existing fairness-aware in-processing approaches. We term these fair subnetworks as Fair Scratch Tickets (FSTs). We also theoretically provide fairness and accuracy guarantees for them. In our experiments, we investigate the existence of FSTs on various datasets, target attributes, random initialization methods, sparsity patterns, and fairness surrogates. We also find that FSTs can transfer across datasets and investigate other properties of FSTs. + + + + Intrinsic Physical Concepts Discovery With Object-Centric Predictive Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Tang_Intrinsic_Physical_Concepts_Discovery_With_Object-Centric_Predictive_Models_CVPR_2023_paper.pdf + The ability to discover abstract physical concepts and understand how they work in the world through observing lies at the core of human intelligence. The acquisition of this ability is based on compositionally perceiving the environment in terms of objects and relations in an unsupervised manner. Recent approaches learn object-centric representations and capture visually observable concepts of objects, e.g., shape, size, and location. In this paper, we take a step forward and try to discover and represent intrinsic physical concepts such as mass and charge. We introduce the PHYsical Concepts Inference NEtwork (PHYCINE), a system that infers physical concepts in different abstract levels without supervision. The key insights underlining PHYCINE are two-fold, commonsense knowledge emerges with prediction, and physical concepts of different abstract levels should be reasoned in a bottom-to-up fashion. Empirical evaluation demonstrates that variables inferred by our system work in accordance with the properties of the corresponding physical concepts. We also show that object representations containing the discovered physical concepts variables could help achieve better performance in causal reasoning tasks, i.e., COMPHY. + + + + Towards Generalisable Video Moment Retrieval: Visual-Dynamic Injection to Image-Text Pre-Training + http://openaccess.thecvf.com//content/CVPR2023/papers/Luo_Towards_Generalisable_Video_Moment_Retrieval_Visual-Dynamic_Injection_to_Image-Text_Pre-Training_CVPR_2023_paper.pdf + The correlation between the vision and text is essential for video moment retrieval (VMR), however, existing methods heavily rely on separate pre-training feature extractors for visual and textual understanding. Without sufficient temporal boundary annotations, it is non-trivial to learn universal video-text alignments. In this work, we explore multi-modal correlations derived from large-scale image-text data to facilitate generalisable VMR. To address the limitations of image-text pre-training models on capturing the video changes, we propose a generic method, referred to as Visual-Dynamic Injection (VDI), to empower the model's understanding of video moments. Whilst existing VMR methods are focusing on building temporal-aware video features, being aware of the text descriptions about the temporal changes is also critical but originally overlooked in pre-training by matching static images with sentences. Therefore, we extract visual context and spatial dynamic information from video frames and explicitly enforce their alignments with the phrases describing video changes (e.g. verb). By doing so, the potentially relevant visual and motion patterns in videos are encoded in the corresponding text embeddings (injected) so to enable more accurate video-text alignments. We conduct extensive experiments on two VMR benchmark datasets (Charades-STA and ActivityNet-Captions) and achieve state-of-the-art performances. Especially, VDI yields notable advantages when being tested on the out-of-distribution splits where the testing samples involve novel scenes and vocabulary. + + + + Learning Adaptive Dense Event Stereo From the Image Domain + http://openaccess.thecvf.com//content/CVPR2023/papers/Cho_Learning_Adaptive_Dense_Event_Stereo_From_the_Image_Domain_CVPR_2023_paper.pdf + Recently, event-based stereo matching has been studied due to its robustness in poor light conditions. However, existing event-based stereo networks suffer severe performance degradation when domains shift. Unsupervised domain adaptation (UDA) aims at resolving this problem without using the target domain ground-truth. However, traditional UDA still needs the input event data with ground-truth in the source domain, which is more challenging and costly to obtain than image data. To tackle this issue, we propose a novel unsupervised domain Adaptive Dense Event Stereo (ADES), which resolves gaps between the different domains and input modalities. The proposed ADES framework adapts event-based stereo networks from abundant image datasets with ground-truth on the source domain to event datasets without ground-truth on the target domain, which is a more practical setup. First, we propose a self-supervision module that trains the network on the target domain through image reconstruction, while an artifact prediction network trained on the source domain assists in removing intermittent artifacts in the reconstructed image. Secondly, we utilize the feature-level normalization scheme to align the extracted features along the epipolar line. Finally, we present the motion-invariant consistency module to impose the consistent output between the perturbed motion. Our experiments demonstrate that our approach achieves remarkable results in the adaptation ability of event-based stereo matching from the image domain. + + + + Foundation Model Drives Weakly Incremental Learning for Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_Foundation_Model_Drives_Weakly_Incremental_Learning_for_Semantic_Segmentation_CVPR_2023_paper.pdf + Modern incremental learning for semantic segmentation methods usually learn new categories based on dense annotations. Although achieve promising results, pixel-by-pixel labeling is costly and time-consuming. Weakly incremental learning for semantic segmentation (WILSS) is a novel and attractive task, which aims at learning to segment new classes from cheap and widely available image-level labels. Despite the comparable results, the image-level labels can not provide details to locate each segment, which limits the performance of WILSS. This inspires us to think how to improve and effectively utilize the supervision of new classes given image-level labels while avoiding forgetting old ones. In this work, we propose a novel and data-efficient framework for WILSS, named FMWISS. Specifically, we propose pre-training based co-segmentation to distill the knowledge of complementary foundation models for generating dense pseudo labels. We further optimize the noisy pseudo masks with a teacher-student architecture, where a plug-in teacher is optimized with a proposed dense contrastive loss. Moreover, we introduce memory-based copy-paste augmentation to improve the catastrophic forgetting problem of old classes. Extensive experiments on Pascal VOC and COCO datasets demonstrate the superior performance of our framework, e.g., FMWISS achieves 70.7% and 73.3% in the 15-5 VOC setting, outperforming the state-of-the-art method by 3.4% and 6.1%, respectively. + + + + NeRFVS: Neural Radiance Fields for Free View Synthesis via Geometry Scaffolds + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_NeRFVS_Neural_Radiance_Fields_for_Free_View_Synthesis_via_Geometry_CVPR_2023_paper.pdf + We present NeRFVS, a novel neural radiance fields (NeRF) based method to enable free navigation in a room. NeRF achieves impressive performance in rendering images for novel views similar to the input views while suffering for novel views that are significantly different from the training views. To address this issue, we utilize the holistic priors, including pseudo depth maps and view coverage information, from neural reconstruction to guide the learning of implicit neural representations of 3D indoor scenes. Concretely, an off-the-shelf neural reconstruction method is leveraged to generate a geometry scaffold. Then, two loss functions based on the holistic priors are proposed to improve the learning of NeRF: 1) A robust depth loss that can tolerate the error of the pseudo depth map to guide the geometry learning of NeRF; 2) A variance loss to regularize the variance of implicit neural representations to reduce the geometry and color ambiguity in the learning procedure. These two loss functions are modulated during NeRF optimization according to the view coverage information to reduce the negative influence brought by the view coverage imbalance. Extensive results demonstrate that our NeRFVS outperforms state-of-the-art view synthesis methods quantitatively and qualitatively on indoor scenes, achieving high-fidelity free navigation results. + + + + Auto-CARD: Efficient and Robust Codec Avatar Driving for Real-Time Mobile Telepresence + http://openaccess.thecvf.com//content/CVPR2023/papers/Fu_Auto-CARD_Efficient_and_Robust_Codec_Avatar_Driving_for_Real-Time_Mobile_CVPR_2023_paper.pdf + Real-time and robust photorealistic avatars for telepresence in AR/VR have been highly desired for enabling immersive photorealistic telepresence. However, there still exists one key bottleneck: the considerable computational expense needed to accurately infer facial expressions captured from headset-mounted cameras with a quality level that can match the realism of the avatar's human appearance. To this end, we propose a framework called Auto-CARD, which for the first time enables real-time and robust driving of Codec Avatars when exclusively using merely on-device computing resources. This is achieved by minimizing two sources of redundancy. First, we develop a dedicated neural architecture search technique called AVE-NAS for avatar encoding in AR/VR, which explicitly boosts both the searched architectures' robustness in the presence of extreme facial expressions and hardware friendliness on fast evolving AR/VR headsets. Second, we leverage the temporal redundancy in consecutively captured images during continuous rendering and develop a mechanism dubbed LATEX to skip the computation of redundant frames. Specifically, we first identify an opportunity from the linearity of the latent space derived by the avatar decoder and then propose to perform adaptive latent extrapolation for redundant frames. For evaluation, we demonstrate the efficacy of our Auto-CARD framework in real-time Codec Avatar driving settings, where we achieve a 5.05x speed-up on Meta Quest 2 while maintaining a comparable or even better animation quality than state-of-the-art avatar encoder designs. + + + + Conjugate Product Graphs for Globally Optimal 2D-3D Shape Matching + http://openaccess.thecvf.com//content/CVPR2023/papers/Roetzer_Conjugate_Product_Graphs_for_Globally_Optimal_2D-3D_Shape_Matching_CVPR_2023_paper.pdf + We consider the problem of finding a continuous and non-rigid matching between a 2D contour and a 3D mesh. While such problems can be solved to global optimality by finding a shortest path in the product graph between both shapes, existing solutions heavily rely on unrealistic prior assumptions to avoid degenerate solutions (e.g. knowledge to which region of the 3D shape each point of the 2D contour is matched). To address this, we propose a novel 2D-3D shape matching formalism based on the conjugate product graph between the 2D contour and the 3D shape. Doing so allows us for the first time to consider higher-order costs, i.e. defined for edge chains, as opposed to costs defined for single edges. This offers substantially more flexibility, which we utilise to incorporate a local rigidity prior. By doing so, we effectively circumvent degenerate solutions and thereby obtain smoother and more realistic matchings, even when using only a one-dimensional feature descriptor. Overall, our method finds globally optimal and continuous 2D-3D matchings, has the same asymptotic complexity as previous solutions, produces state-of-the-art results for shape matching and is even capable of matching partial shapes. Our code is publicly available (https://github.com/paul0noah/sm-2D3D). + + + + Multi-Realism Image Compression With a Conditional Generator + http://openaccess.thecvf.com//content/CVPR2023/papers/Agustsson_Multi-Realism_Image_Compression_With_a_Conditional_Generator_CVPR_2023_paper.pdf + By optimizing the rate-distortion-realism trade-off, generative compression approaches produce detailed, realistic images, even at low bit rates, instead of the blurry reconstructions produced by rate-distortion optimized models. However, previous methods do not explicitly control how much detail is synthesized, which results in a common criticism of these methods: users might be worried that a misleading reconstruction far from the input image is generated. In this work, we alleviate these concerns by training a decoder that can bridge the two regimes and navigate the distortion-realism trade-off. From a single compressed representation, the receiver can decide to either reconstruct a low mean squared error reconstruction that is close to the input, a realistic reconstruction with high perceptual quality, or anything in between. With our method, we set a new state-of-the-art in distortion-realism, pushing the frontier of achievable distortion-realism pairs, i.e., our method achieves better distortions at high realism and better realism at low distortion than ever before. + + + + Best of Both Worlds: Multimodal Contrastive Learning With Tabular and Imaging Data + http://openaccess.thecvf.com//content/CVPR2023/papers/Hager_Best_of_Both_Worlds_Multimodal_Contrastive_Learning_With_Tabular_and_CVPR_2023_paper.pdf + Medical datasets and especially biobanks, often contain extensive tabular data with rich clinical information in addition to images. In practice, clinicians typically have less data, both in terms of diversity and scale, but still wish to deploy deep learning solutions. Combined with increasing medical dataset sizes and expensive annotation costs, the necessity for unsupervised methods that can pretrain multimodally and predict unimodally has risen. To address these needs, we propose the first self-supervised contrastive learning framework that takes advantage of images and tabular data to train unimodal encoders. Our solution combines SimCLR and SCARF, two leading contrastive learning strategies, and is simple and effective. In our experiments, we demonstrate the strength of our framework by predicting risks of myocardial infarction and coronary artery disease (CAD) using cardiac MR images and 120 clinical features from 40,000 UK Biobank subjects. Furthermore, we show the generalizability of our approach to natural images using the DVM car advertisement dataset. We take advantage of the high interpretability of tabular data and through attribution and ablation experiments find that morphometric tabular features, describing size and shape, have outsized importance during the contrastive learning process and improve the quality of the learned embeddings. Finally, we introduce a novel form of supervised contrastive learning, label as a feature (LaaF), by appending the ground truth label as a tabular feature during multimodal pretraining, outperforming all supervised contrastive baselines. + + + + Masked Images Are Counterfactual Samples for Robust Fine-Tuning + http://openaccess.thecvf.com//content/CVPR2023/papers/Xiao_Masked_Images_Are_Counterfactual_Samples_for_Robust_Fine-Tuning_CVPR_2023_paper.pdf + Deep learning models are challenged by the distribution shift between the training data and test data. Recently, the large models pre-trained on diverse data have demonstrated unprecedented robustness to various distribution shifts. However, fine-tuning these models can lead to a trade-off between in-distribution (ID) performance and out-of-distribution (OOD) robustness. Existing methods for tackling this trade-off do not explicitly address the OOD robustness problem. In this paper, based on causal analysis of the aforementioned problems, we propose a novel fine-tuning method, which uses masked images as counterfactual samples that help improve the robustness of the fine-tuning model. Specifically, we mask either the semantics-related or semantics-unrelated patches of the images based on class activation map to break the spurious correlation, and refill the masked patches with patches from other images. The resulting counterfactual samples are used in feature-based distillation with the pre-trained model. Extensive experiments verify that regularizing the fine-tuning with the proposed masked images can achieve a better trade-off between ID and OOD performance, surpassing previous methods on the OOD performance. Our code is available at https://github.com/Coxy7/robust-finetuning. + + + + StepFormer: Self-Supervised Step Discovery and Localization in Instructional Videos + http://openaccess.thecvf.com//content/CVPR2023/papers/Dvornik_StepFormer_Self-Supervised_Step_Discovery_and_Localization_in_Instructional_Videos_CVPR_2023_paper.pdf + Instructional videos are an important resource to learn procedural tasks from human demonstrations. However, the instruction steps in such videos are typically short and sparse, with most of the video being irrelevant to the procedure. This motivates the need to temporally localize the instruction steps in such videos, i.e. the task called key-step localization. Traditional methods for key-step localization require video-level human annotations and thus do not scale to large datasets. In this work, we tackle the problem with no human supervision and introduce StepFormer, a self-supervised model that discovers and localizes instruction steps in a video. StepFormer is a transformer decoder that attends to the video with learnable queries, and produces a sequence of slots capturing the key-steps in the video. We train our system on a large dataset of instructional videos, using their automatically-generated subtitles as the only source of supervision. In particular, we supervise our system with a sequence of text narrations using an order-aware loss function that filters out irrelevant phrases. We show that our model outperforms all previous unsupervised and weakly-supervised approaches on step detection and localization by a large margin on three challenging benchmarks. Moreover, our model demonstrates an emergent property to solve zero-shot multi-step localization and outperforms all relevant baselines at this task. + + + + Open Vocabulary Semantic Segmentation With Patch Aligned Contrastive Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Mukhoti_Open_Vocabulary_Semantic_Segmentation_With_Patch_Aligned_Contrastive_Learning_CVPR_2023_paper.pdf + We introduce Patch Aligned Contrastive Learning (PACL), a modified compatibility function for CLIP's contrastive loss, intending to train an alignment between the patch tokens of the vision encoder and the CLS token of the text encoder. With such an alignment, a model can identify regions of an image corresponding to a given text input, and therefore transfer seamlessly to the task of open vocabulary semantic segmentation without requiring any segmentation annotations during training. Using pre-trained CLIP encoders with PACL, we are able to set the state-of-the-art on the task of open vocabulary zero-shot segmentation on 4 different segmentation benchmarks: Pascal VOC, Pascal Context, COCO Stuff and ADE20K. Furthermore, we show that PACL is also applicable to image-level predictions and when used with a CLIP backbone, provides a general improvement in zero-shot classification accuracy compared to CLIP, across a suite of 12 image classification datasets. + + + + Camouflaged Instance Segmentation via Explicit De-Camouflaging + http://openaccess.thecvf.com//content/CVPR2023/papers/Luo_Camouflaged_Instance_Segmentation_via_Explicit_De-Camouflaging_CVPR_2023_paper.pdf + Camouflaged Instance Segmentation (CIS) aims at predicting the instance-level masks of camouflaged objects, which are usually the animals in the wild adapting their appearance to match the surroundings. Previous instance segmentation methods perform poorly on this task as they are easily disturbed by the deceptive camouflage. To address these challenges, we propose a novel De-camouflaging Network (DCNet) including a pixel-level camouflage decoupling module and an instance-level camouflage suppression module. The proposed DCNet enjoys several merits. First, the pixel-level camouflage decoupling module can extract camouflage characteristics based on the Fourier transformation. Then a difference attention mechanism is proposed to eliminate the camouflage characteristics while reserving target object characteristics in the pixel feature. Second, the instance-level camouflage suppression module can aggregate rich instance information from pixels by use of instance prototypes. To mitigate the effect of background noise during segmentation, we introduce some reliable reference points to build a more robust similarity measurement. With the aid of these two modules, our DCNet can effectively model de-camouflaging and achieve accurate segmentation for camouflaged instances. Extensive experimental results on two benchmarks demonstrate that our DCNet performs favorably against state-of-the-art CIS methods, e.g., with more than 5% performance gains on COD10K and NC4K datasets in average precision. + + + + Pic2Word: Mapping Pictures to Words for Zero-Shot Composed Image Retrieval + http://openaccess.thecvf.com//content/CVPR2023/papers/Saito_Pic2Word_Mapping_Pictures_to_Words_for_Zero-Shot_Composed_Image_Retrieval_CVPR_2023_paper.pdf + In Composed Image Retrieval (CIR), a user combines a query image with text to describe their intended target. Existing methods rely on supervised learning of CIR models using labeled triplets consisting of the query image, text specification, and the target image. Labeling such triplets is expensive and hinders broad applicability of CIR. In this work, we propose to study an important task, Zero-Shot Composed Image Retrieval (ZS-CIR), whose goal is to build a CIR model without requiring labeled triplets for training. To this end, we propose a novel method, called Pic2Word, that requires only weakly labeled image-caption pairs and unlabeled image datasets to train. Unlike existing supervised CIR models, our model trained on weakly labeled or unlabeled datasets shows strong generalization across diverse ZS-CIR tasks, e.g., attribute editing, object composition, and domain conversion. Our approach outperforms several supervised CIR methods on the common CIR benchmark, CIRR and Fashion-IQ. + + + + MMANet: Margin-Aware Distillation and Modality-Aware Regularization for Incomplete Multimodal Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Wei_MMANet_Margin-Aware_Distillation_and_Modality-Aware_Regularization_for_Incomplete_Multimodal_Learning_CVPR_2023_paper.pdf + Multimodal learning has shown great potentials in numerous scenes and attracts increasing interest recently. However, it often encounters the problem of missing modality data and thus suffers severe performance degradation in practice. To this end, we propose a general framework called MMANet to assist incomplete multimodal learning. It consists of three components: the deployment network used for inference, the teacher network transferring comprehensive multimodal information to the deployment network, and the regularization network guiding the deployment network to balance weak modality combinations. Specifically, we propose a novel margin-aware distillation (MAD) to assist the information transfer by weighing the sample contribution with the classification uncertainty. This encourages the deployment network to focus on the samples near decision boundaries and acquire the refined inter-class margin. Besides, we design a modality-aware regularization (MAR) algorithm to mine the weak modality combinations and guide the regularization network to calculate prediction loss for them. This forces the deployment network to improve its representation ability for the weak modality combinations adaptively. Finally, extensive experiments on multimodal classification and segmentation tasks demonstrate that our MMANet outperforms the state-of-the-art significantly. + + + + Putting People in Their Place: Affordance-Aware Human Insertion Into Scenes + http://openaccess.thecvf.com//content/CVPR2023/papers/Kulal_Putting_People_in_Their_Place_Affordance-Aware_Human_Insertion_Into_Scenes_CVPR_2023_paper.pdf + We study the problem of inferring scene affordances by presenting a method for realistically inserting people into scenes. Given a scene image with a marked region and an image of a person, we insert the person into the scene while respecting the scene affordances. Our model can infer the set of realistic poses given the scene context, re-pose the reference person, and harmonize the composition. We set up the task in a self-supervised fashion by learning to re- pose humans in video clips. We train a large-scale diffusion model on a dataset of 2.4M video clips that produces diverse plausible poses while respecting the scene context. Given the learned human-scene composition, our model can also hallucinate realistic people and scenes when prompted without conditioning and also enables interactive editing. We conduct quantitative evaluation and show that our method synthesizes more realistic human appearance and more natural human-scene interactions when compared to prior work. + + + + 3D Neural Field Generation Using Triplane Diffusion + http://openaccess.thecvf.com//content/CVPR2023/papers/Shue_3D_Neural_Field_Generation_Using_Triplane_Diffusion_CVPR_2023_paper.pdf + Diffusion models have emerged as the state-of-the-art for image generation, among other tasks. Here, we present an efficient diffusion-based model for 3D-aware generation of neural fields. Our approach pre-processes training data, such as ShapeNet meshes, by converting them to continuous occupancy fields and factoring them into a set of axis-aligned triplane feature representations. Thus, our 3D training scenes are all represented by 2D feature planes, and we can directly train existing 2D diffusion models on these representations to generate 3D neural fields with high quality and diversity, outperforming alternative approaches to 3D-aware generation. Our approach requires essential modifications to existing triplane factorization pipelines to make the resulting features easy to learn for the diffusion model. We demonstrate state-of-the-art results on 3D generation on several object classes from ShapeNet. + + + + Regularized Vector Quantization for Tokenized Image Synthesis + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Regularized_Vector_Quantization_for_Tokenized_Image_Synthesis_CVPR_2023_paper.pdf + Quantizing images into discrete representations has been a fundamental problem in unified generative modeling. Predominant approaches learn the discrete representation either in a deterministic manner by selecting the best-matching token or in a stochastic manner by sampling from a predicted distribution. However, deterministic quantization suffers from severe codebook collapse and misaligned inference stage while stochastic quantization suffers from low codebook utilization and perturbed reconstruction objective. This paper presents a regularized vector quantization framework that allows to mitigate above issues effectively by applying regularization from two perspectives. The first is a prior distribution regularization which measures the discrepancy between a prior token distribution and predicted token distribution to avoid codebook collapse and low codebook utilization. The second is a stochastic mask regularization that introduces stochasticity during quantization to strike a good balance between inference stage misalignment and unperturbed reconstruction objective. In addition, we design a probabilistic contrastive loss which serves as a calibrated metric to further mitigate the perturbed reconstruction objective. Extensive experiments show that the proposed quantization framework outperforms prevailing vector quantizers consistently across different generative models including auto-regressive models and diffusion models. + + + + + + Multi-Level Logit Distillation + http://openaccess.thecvf.com//content/CVPR2023/papers/Jin_Multi-Level_Logit_Distillation_CVPR_2023_paper.pdf + Knowledge Distillation (KD) aims at distilling the knowledge from the large teacher model to a lightweight student model. Mainstream KD methods can be divided into two categories, logit distillation, and feature distillation. The former is easy to implement, but inferior in performance, while the latter is not applicable to some practical circumstances due to concerns such as privacy and safety. Towards this dilemma, in this paper, we explore a stronger logit distillation method via making better utilization of logit outputs. Concretely, we propose a simple yet effective approach to logit distillation via multi-level prediction alignment. Through this framework, the prediction alignment is not only conducted at the instance level, but also at the batch and class level, through which the student model learns instance prediction, input correlation, and category correlation simultaneously. In addition, a prediction augmentation mechanism based on model calibration further boosts the performance. Extensive experiment results validate that our method enjoys consistently higher performance than previous logit distillation methods, and even reaches competitive performance with mainstream feature distillation methods. We promise to release our code and models to ensure reproducibility. + + + + DA Wand: Distortion-Aware Selection Using Neural Mesh Parameterization + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_DA_Wand_Distortion-Aware_Selection_Using_Neural_Mesh_Parameterization_CVPR_2023_paper.pdf + We present a neural technique for learning to select a local sub-region around a point which can be used for mesh parameterization. The motivation for our framework is driven by interactive workflows used for decaling, texturing, or painting on surfaces. Our key idea to to learn a local parameterization in a data-driven manner, using a novel differentiable parameterization layer within a neural network framework. We train a segmentation network to select 3D regions that are parameterized into 2D and penalized by the resulting distortion, giving rise to segmentations which are distortion-aware. Following training, a user can use our system to interactively select a point on the mesh and obtain a large, meaningful region around the selection which induces a low-distortion parameterization. Our code and project page are publicly available. + + + + Hierarchical Semantic Correspondence Networks for Video Paragraph Grounding + http://openaccess.thecvf.com//content/CVPR2023/papers/Tan_Hierarchical_Semantic_Correspondence_Networks_for_Video_Paragraph_Grounding_CVPR_2023_paper.pdf + Video Paragraph Grounding (VPG) is an essential yet challenging task in vision-language understanding, which aims to jointly localize multiple events from an untrimmed video with a paragraph query description. One of the critical challenges in addressing this problem is to comprehend the complex semantic relations between visual and textual modalities. Previous methods focus on modeling the contextual information between the video and text from a single-level perspective (i.e., the sentence level), ignoring rich visual-textual correspondence relations at different semantic levels, e.g., the video-word and video-paragraph correspondence. To this end, we propose a novel Hierarchical Semantic Correspondence Network (HSCNet), which explores multi-level visual-textual correspondence by learning hierarchical semantic alignment and utilizes dense supervision by grounding diverse levels of queries. Specifically, we develop a hierarchical encoder that encodes the multi-modal inputs into semantics-aligned representations at different levels. To exploit the hierarchical semantic correspondence learned in the encoder for multi-level supervision, we further design a hierarchical decoder that progressively performs finer grounding for lower-level queries conditioned on higher-level semantics. Extensive experiments demonstrate the effectiveness of HSCNet and our method significantly outstrips the state-of-the-arts on two challenging benchmarks, i.e., ActivityNet-Captions and TACoS. + + + + Temporal Attention Unit: Towards Efficient Spatiotemporal Predictive Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Tan_Temporal_Attention_Unit_Towards_Efficient_Spatiotemporal_Predictive_Learning_CVPR_2023_paper.pdf + Spatiotemporal predictive learning aims to generate future frames by learning from historical frames. In this paper, we investigate existing methods and present a general framework of spatiotemporal predictive learning, in which the spatial encoder and decoder capture intra-frame features and the middle temporal module catches inter-frame correlations. While the mainstream methods employ recurrent units to capture long-term temporal dependencies, they suffer from low computational efficiency due to their unparallelizable architectures. To parallelize the temporal module, we propose the Temporal Attention Unit (TAU), which decomposes temporal attention into intra-frame statical attention and inter-frame dynamical attention. Moreover, while the mean squared error loss focuses on intra-frame errors, we introduce a novel differential divergence regularization to take inter-frame variations into account. Extensive experiments demonstrate that the proposed method enables the derived model to achieve competitive performance on various spatiotemporal prediction benchmarks. + + + + BiCro: Noisy Correspondence Rectification for Multi-Modality Data via Bi-Directional Cross-Modal Similarity Consistency + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_BiCro_Noisy_Correspondence_Rectification_for_Multi-Modality_Data_via_Bi-Directional_Cross-Modal_CVPR_2023_paper.pdf + As one of the most fundamental techniques in multimodal learning, cross-modal matching aims to project various sensory modalities into a shared feature space. To achieve this, massive and correctly aligned data pairs are required for model training. However, unlike unimodal datasets, multimodal datasets are extremely harder to collect and annotate precisely. As an alternative, the co-occurred data pairs (e.g., image-text pairs) collected from the Internet have been widely exploited in the area. Unfortunately, the cheaply collected dataset unavoidably contains many mismatched data pairs, which have been proven to be harmful to the model's performance. To address this, we propose a general framework called BiCro (Bidirectional Cross-modal similarity consistency), which can be easily integrated into existing cross-modal matching models and improve their robustness against noisy data. Specifically, BiCro aims to estimate soft labels for noisy data pairs to reflect their true correspondence degree. The basic idea of BiCro is motivated by that -- taking image-text matching as an example -- similar images should have similar textual descriptions and vice versa. Then the consistency of these two similarities can be recast as the estimated soft labels to train the matching model. The experiments on three popular cross-modal matching datasets demonstrate that our method significantly improves the noise-robustness of various matching models, and surpass the state-of-the-art by a clear margin. + + + + Transfer Knowledge From Head to Tail: Uncertainty Calibration Under Long-Tailed Distribution + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Transfer_Knowledge_From_Head_to_Tail_Uncertainty_Calibration_Under_Long-Tailed_CVPR_2023_paper.pdf + How to estimate the uncertainty of a given model is a crucial problem. Current calibration techniques treat different classes equally and thus implicitly assume that the distribution of training data is balanced, but ignore the fact that real-world data often follows a long-tailed distribution. In this paper, we explore the problem of calibrating the model trained from a long-tailed distribution. Due to the difference between the imbalanced training distribution and balanced test distribution, existing calibration methods such as temperature scaling can not generalize well to this problem. Specific calibration methods for domain adaptation are also not applicable because they rely on unlabeled target domain instances which are not available. Models trained from a long-tailed distribution tend to be more overconfident to head classes. To this end, we propose a novel knowledge-transferring-based calibration method by estimating the importance weights for samples of tail classes to realize long-tailed calibration. Our method models the distribution of each class as a Gaussian distribution and views the source statistics of head classes as a prior to calibrate the target distributions of tail classes. We adaptively transfer knowledge from head classes to get the target probability density of tail classes. The importance weight is estimated by the ratio of the target probability density over the source probability density. Extensive experiments on CIFAR-10-LT, MNIST-LT, CIFAR-100-LT, and ImageNet-LT datasets demonstrate the effectiveness of our method. + + + + Global Vision Transformer Pruning With Hessian-Aware Saliency + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_Global_Vision_Transformer_Pruning_With_Hessian-Aware_Saliency_CVPR_2023_paper.pdf + Transformers yield state-of-the-art results across many tasks. However, their heuristically designed architecture impose huge computational costs during inference. This work aims on challenging the common design philosophy of the Vision Transformer (ViT) model with uniform dimension across all the stacked blocks in a model stage, where we redistribute the parameters both across transformer blocks and between different structures within the block via the first systematic attempt on global structural pruning. Dealing with diverse ViT structural components, we derive a novel Hessian-based structural pruning criteria comparable across all layers and structures, with latency-aware regularization for direct latency reduction. Performing iterative pruning on the DeiT-Base model leads to a new architecture family called NViT (Novel ViT), with a novel parameter redistribution that utilizes parameters more efficiently. On ImageNet-1K, NViT-Base achieves a 2.6x FLOPs reduction, 5.1x parameter reduction, and 1.9x run-time speedup over the DeiT-Base model in a near lossless manner. Smaller NViT variants achieve more than 1% accuracy gain at the same throughput of the DeiT Small/Tiny variants, as well as a lossless 3.3x parameter reduction over the SWIN-Small model. These results outperform prior art by a large margin. Further analysis is provided on the parameter redistribution insight of NViT, where we show the high prunability of ViT models, distinct sensitivity within ViT block, and unique parameter distribution trend across stacked ViT blocks. Our insights provide viability for a simple yet effective parameter redistribution rule towards more efficient ViTs for off-the-shelf performance boost. + + + + ScarceNet: Animal Pose Estimation With Scarce Annotations + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_ScarceNet_Animal_Pose_Estimation_With_Scarce_Annotations_CVPR_2023_paper.pdf + Animal pose estimation is an important but under-explored task due to the lack of labeled data. In this paper, we tackle the task of animal pose estimation with scarce annotations, where only a small set of labeled data and unlabeled images are available. At the core of the solution to this problem setting is the use of the unlabeled data to compensate for the lack of well-labeled animal pose data. To this end, we propose the ScarceNet, a pseudo label-based approach to generate artificial labels for the unlabeled images. The pseudo labels, which are generated with a model trained with the small set of labeled images, are generally noisy and can hurt the performance when directly used for training. To solve this problem, we first use a small-loss trick to select reliable pseudo labels. Although effective, the selection process is improvident since numerous high-loss samples are left unused. We further propose to identify reusable samples from the high-loss samples based on an agreement check. Pseudo labels are re-generated to provide supervision for those reusable samples. Lastly, we introduce a student-teacher framework to enforce a consistency constraint since there are still samples that are neither reliable nor reusable. By combining the reliable pseudo label selection with the reusable sample re-labeling and the consistency constraint, we can make full use of the unlabeled data. We evaluate our approach on the challenging AP-10K dataset, where our approach outperforms existing semi-supervised approaches by a large margin. We also test on the TigDog dataset, where our approach can achieve better performance than domain adaptation based approaches when only very few annotations are available. Our code is available at the project website. + + + + OmniCity: Omnipotent City Understanding With Multi-Level and Multi-View Images + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_OmniCity_Omnipotent_City_Understanding_With_Multi-Level_and_Multi-View_Images_CVPR_2023_paper.pdf + This paper presents OmniCity, a new dataset for omnipotent city understanding from multi-level and multi-view images. More precisely, OmniCity contains multi-view satellite images as well as street-level panorama and mono-view images, constituting over 100K pixel-wise annotated images that are well-aligned and collected from 25K geo-locations in New York City. To alleviate the substantial pixel-wise annotation efforts, we propose an efficient street-view image annotation pipeline that leverages the existing label maps of satellite view and the transformation relations between different views (satellite, panorama, and mono-view). With the new OmniCity dataset, we provide benchmarks for a variety of tasks including building footprint extraction, height estimation, and building plane/instance/fine-grained segmentation. Compared with existing multi-level and multi-view benchmarks, OmniCity contains a larger number of images with richer annotation types and more views, provides more benchmark results of state-of-the-art models, and introduces a new task for fine-grained building instance segmentation on street-level panorama images. Moreover, OmniCity provides new problem settings for existing tasks, such as cross-view image matching, synthesis, segmentation, detection, etc., and facilitates the developing of new methods for large-scale city understanding, reconstruction, and simulation. The OmniCity dataset as well as the benchmarks will be released at https://city-super.github.io/omnicity/. + + + + SViTT: Temporal Learning of Sparse Video-Text Transformers + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_SViTT_Temporal_Learning_of_Sparse_Video-Text_Transformers_CVPR_2023_paper.pdf + Do video-text transformers learn to model temporal relationships across frames? Despite their immense capacity and the abundance of multimodal training data, recent work has revealed the strong tendency of video-text models towards frame-based spatial representations, while temporal reasoning remains largely unsolved. In this work, we identify several key challenges in temporal learning of video-text transformers: the spatiotemporal trade-off from limited network size; the curse of dimensionality for multi-frame modeling; and the diminishing returns of semantic information by extending clip length. Guided by these findings, we propose SViTT, a sparse video-text architecture that performs multi-frame reasoning with significantly lower cost than naive transformers with dense attention. Analogous to graph-based networks, SViTT employs two forms of sparsity: edge sparsity that limits the query-key communications between tokens in self-attention, and node sparsity that discards uninformative visual tokens. Trained with a curriculum which increases model sparsity with the clip length, SViTT outperforms dense transformer baselines on multiple video-text retrieval and question answering benchmarks, with a fraction of computational cost. Project page: http://svcl.ucsd.edu/projects/svitt. + + + + Deep Fair Clustering via Maximizing and Minimizing Mutual Information: Theory, Algorithm and Metric + http://openaccess.thecvf.com//content/CVPR2023/papers/Zeng_Deep_Fair_Clustering_via_Maximizing_and_Minimizing_Mutual_Information_Theory_CVPR_2023_paper.pdf + Fair clustering aims to divide data into distinct clusters while preventing sensitive attributes (e.g., gender, race, RNA sequencing technique) from dominating the clustering. Although a number of works have been conducted and achieved huge success recently, most of them are heuristical, and there lacks a unified theory for algorithm design. In this work, we fill this blank by developing a mutual information theory for deep fair clustering and accordingly designing a novel algorithm, dubbed FCMI. In brief, through maximizing and minimizing mutual information, FCMI is designed to achieve four characteristics highly expected by deep fair clustering, i.e., compact, balanced, and fair clusters, as well as informative features. Besides the contributions to theory and algorithm, another contribution of this work is proposing a novel fair clustering metric built upon information theory as well. Unlike existing evaluation metrics, our metric measures the clustering quality and fairness as a whole instead of separate manner. To verify the effectiveness of the proposed FCMI, we conduct experiments on six benchmarks including a single-cell RNA-seq atlas compared with 11 state-of-the-art methods in terms of five metrics. The code could be accessed from https://pengxi.me. + + + + High Fidelity 3D Hand Shape Reconstruction via Scalable Graph Frequency Decomposition + http://openaccess.thecvf.com//content/CVPR2023/papers/Luan_High_Fidelity_3D_Hand_Shape_Reconstruction_via_Scalable_Graph_Frequency_CVPR_2023_paper.pdf + Despite the impressive performance obtained by recent single-image hand modeling techniques, they lack the capability to capture sufficient details of the 3D hand mesh. This deficiency greatly limits their applications when high fidelity hand modeling is required, e.g., personalized hand modeling. To address this problem, we design a frequency split network to generate 3D hand mesh using different frequency bands in a coarse-to-fine manner. To capture high-frequency personalized details, we transform the 3D mesh into the frequency domain, and propose a novel frequency decomposition loss to supervise each frequency component. By leveraging such a coarse-to-fine scheme, hand details that correspond to the higher frequency domain can be preserved. In addition, the proposed network is scalable, and can stop the inference at any resolution level to accommodate different hardwares with varying computational powers. To quantitatively evaluate the performance of our method in terms of recovering personalized shape details, we introduce a new evaluation metric named Mean Signal-to-Noise Ratio (MSNR) to measure the signal-to-noise ratio of each mesh frequency component. Extensive experiments demonstrate that our approach generates fine-grained details for high fidelity 3D hand reconstruction, and our evaluation metric is more effective for measuring mesh details compared with traditional metrics. + + + + COT: Unsupervised Domain Adaptation With Clustering and Optimal Transport + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_COT_Unsupervised_Domain_Adaptation_With_Clustering_and_Optimal_Transport_CVPR_2023_paper.pdf + Unsupervised domain adaptation (UDA) aims to transfer the knowledge from a labeled source domain to an unlabeled target domain. Typically, to guarantee desirable knowledge transfer, aligning the distribution between source and target domain from a global perspective is widely adopted in UDA. Recent researchers further point out the importance of local-level alignment and propose to construct instance-pair alignment by leveraging on Optimal Transport (OT) theory. However, existing OT-based UDA approaches are limited to handling class imbalance challenges and introduce a heavy computation overhead when considering a large-scale training situation. To cope with two aforementioned issues, we propose a Clustering-based Optimal Transport (COT) algorithm, which formulates the alignment procedure as an Optimal Transport problem and constructs a mapping between clustering centers in the source and target domain via an end-to-end manner. With this alignment on clustering centers, our COT eliminates the negative effect caused by class imbalance and reduces the computation cost simultaneously. Empirically, our COT achieves state-of-the-art performance on several authoritative benchmark datasets. + + + + Learning To Exploit the Sequence-Specific Prior Knowledge for Image Processing Pipelines Optimization + http://openaccess.thecvf.com//content/CVPR2023/papers/Qin_Learning_To_Exploit_the_Sequence-Specific_Prior_Knowledge_for_Image_Processing_CVPR_2023_paper.pdf + The hardware image signal processing (ISP) pipeline is the intermediate layer between the imaging sensor and the downstream application, processing the sensor signal into an RGB image. The ISP is less programmable and consists of a series of processing modules. Each processing module handles a subtask and contains a set of tunable hyperparameters. A large number of hyperparameters form a complex mapping with the ISP output. The industry typically relies on manual and time-consuming hyperparameter tuning by image experts, biased towards human perception. Recently, several automatic ISP hyperparameter optimization methods using downstream evaluation metrics come into sight. However, existing methods for ISP tuning treat the high-dimensional parameter space as a global space for optimization and prediction all at once without inducing the structure knowledge of ISP. To this end, we propose a sequential ISP hyperparameter prediction framework that utilizes the sequential relationship within ISP modules and the similarity among parameters to guide the model sequence process. We validate the proposed method on object detection, image segmentation, and image quality tasks. + + + + Lite-Mono: A Lightweight CNN and Transformer Architecture for Self-Supervised Monocular Depth Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Lite-Mono_A_Lightweight_CNN_and_Transformer_Architecture_for_Self-Supervised_Monocular_CVPR_2023_paper.pdf + Self-supervised monocular depth estimation that does not require ground truth for training has attracted attention in recent years. It is of high interest to design lightweight but effective models so that they can be deployed on edge devices. Many existing architectures benefit from using heavier backbones at the expense of model sizes. This paper achieves comparable results with a lightweight architecture. Specifically, the efficient combination of CNNs and Transformers is investigated, and a hybrid architecture called Lite-Mono is presented. A Consecutive Dilated Convolutions (CDC) module and a Local-Global Features Interaction (LGFI) module are proposed. The former is used to extract rich multi-scale local features, and the latter takes advantage of the self-attention mechanism to encode long-range global information into the features. Experiments demonstrate that Lite-Mono outperforms Monodepth2 by a large margin in accuracy, with about 80% fewer trainable parameters. Our codes and models are available at https://github.com/noahzn/Lite-Mono. + + + + Neural Scene Chronology + http://openaccess.thecvf.com//content/CVPR2023/papers/Lin_Neural_Scene_Chronology_CVPR_2023_paper.pdf + In this work, we aim to reconstruct a time-varying 3D model, capable of rendering photo-realistic renderings with independent control of viewpoint, illumination, and time, from Internet photos of large-scale landmarks. The core challenges are twofold. First, different types of temporal changes, such as illumination and changes to the underlying scene itself (such as replacing one graffiti artwork with another) are entangled together in the imagery. Second, scene-level temporal changes are often discrete and sporadic over time, rather than continuous. To tackle these problems, we propose a new scene representation equipped with a novel temporal step function encoding method that can model discrete scene-level content changes as piece-wise constant functions over time. Specifically, we represent the scene as a space-time radiance field with a per-image illumination embedding, where temporally-varying scene changes are encoded using a set of learned step functions. To facilitate our task of chronology reconstruction from Internet imagery, we also collect a new dataset of four scenes that exhibit various changes over time. We demonstrate that our method exhibits state-of-the-art view synthesis results on this dataset, while achieving independent control of viewpoint, time, and illumination. Code and data are available at https://zju3dv.github.io/NeuSC/. + + + + TIPI: Test Time Adaptation With Transformation Invariance + http://openaccess.thecvf.com//content/CVPR2023/papers/Nguyen_TIPI_Test_Time_Adaptation_With_Transformation_Invariance_CVPR_2023_paper.pdf + When deploying a machine learning model to a new environment, we often encounter the distribution shift problem -- meaning the target data distribution is different from the model's training distribution. In this paper, we assume that labels are not provided for this new domain, and that we do not store the source data (e.g., for privacy reasons). It has been shown that even small shifts in the data distribution can affect the model's performance severely. Test Time Adaptation offers a means to combat this problem, as it allows the model to adapt during test time to the new data distribution, using only unlabeled test data batches. To achieve this, the predominant approach is to optimize a surrogate loss on the test-time unlabeled target data. In particular, minimizing the prediction's entropy on target samples has received much interest as it is task-agnostic and does not require altering the model's training phase (e.g., does not require adding a self-supervised task during training on the source domain). However, as the target data's batch size is often small in real-world scenarios (e.g., autonomous driving models process each few frames in real-time), we argue that this surrogate loss is not optimal since it often collapses with small batch sizes. To tackle this problem, in this paper, we propose to use an invariance regularizer as the surrogate loss during test-time adaptation, motivated by our theoretical results regarding the model's performance under input transformations. The resulting method (TIPI -- Test tIme adaPtation with transformation Invariance) is validated with extensive experiments in various benchmarks (Cifar10-C, Cifar100-C, ImageNet-C, DIGITS, and VisDA17). Remarkably, TIPI is robust against small batch sizes (as small as 2 in our experiments), and consistently outperforms TENT in all settings. Our code is released at https://github.com/atuannguyen/TIPI. + + + + OTAvatar: One-Shot Talking Face Avatar With Controllable Tri-Plane Rendering + http://openaccess.thecvf.com//content/CVPR2023/papers/Ma_OTAvatar_One-Shot_Talking_Face_Avatar_With_Controllable_Tri-Plane_Rendering_CVPR_2023_paper.pdf + Controllability, generalizability and efficiency are the major objectives of constructing face avatars represented by neural implicit field. However, existing methods have not managed to accommodate the three requirements simultaneously. They either focus on static portraits, restricting the representation ability to a specific subject, or suffer from substantial computational cost, limiting their flexibility. In this paper, we propose One-shot Talking face Avatar (OTAvatar), which constructs face avatars by a generalized controllable tri-plane rendering solution so that each personalized avatar can be constructed from only one portrait as the reference. Specifically, OTAvatar first inverts a portrait image to a motion-free identity code. Second, the identity code and a motion code are utilized to modulate an efficient CNN to generate a tri-plane formulated volume, which encodes the subject in the desired motion. Finally, volume rendering is employed to generate an image in any view. The core of our solution is a novel decoupling-by-inverting strategy that disentangles identity and motion in the latent code via optimization-based inversion. Benefiting from the efficient tri-plane representation, we achieve controllable rendering of generalized face avatar at 35 FPS on A100. Experiments show promising performance of cross-identity reenactment on subjects out of the training set and better 3D consistency. The code is available at https://github.com/theEricMa/OTAvatar. + + + + Large-Capacity and Flexible Video Steganography via Invertible Neural Network + http://openaccess.thecvf.com//content/CVPR2023/papers/Mou_Large-Capacity_and_Flexible_Video_Steganography_via_Invertible_Neural_Network_CVPR_2023_paper.pdf + Video steganography is the art of unobtrusively concealing secret data in a cover video and then recovering the secret data through a decoding protocol at the receiver end. Although several attempts have been made, most of them are limited to low-capacity and fixed steganography. To rectify these weaknesses, we propose a Large-capacity and Flexible Video Steganography Network (LF-VSN) in this paper. For large-capacity, we present a reversible pipeline to perform multiple videos hiding and recovering through a single invertible neural network (INN). Our method can hide/recover 7 secret videos in/from 1 cover video with promising performance. For flexibility, we propose a key-controllable scheme, enabling different receivers to recover particular secret videos from the same cover video through specific keys. Moreover, we further improve the flexibility by proposing a scalable strategy in multiple videos hiding, which can hide variable numbers of secret videos in a cover video with a single model and a single training session. Extensive experiments demonstrate that with the significant improvement of the video steganography performance, our proposed LF-VSN has high security, large hiding capacity, and flexibility. The source code is available at https://github.com/MC-E/LF-VSN. + + + + EVAL: Explainable Video Anomaly Localization + http://openaccess.thecvf.com//content/CVPR2023/papers/Singh_EVAL_Explainable_Video_Anomaly_Localization_CVPR_2023_paper.pdf + We develop a novel framework for single-scene video anomaly localization that allows for human-understandable reasons for the decisions the system makes. We first learn general representations of objects and their motions (using deep networks) and then use these representations to build a high-level, location-dependent model of any particular scene. This model can be used to detect anomalies in new videos of the same scene. Importantly, our approach is explainable -- our high-level appearance and motion features can provide human-understandable reasons for why any part of a video is classified as normal or anomalous. We conduct experiments on standard video anomaly detection datasets (Street Scene, CUHK Avenue, ShanghaiTech and UCSD Ped1, Ped2) and show significant improvements over the previous state-of-the-art. All of our code and extra datasets will be made publicly available. + + + + Position-Guided Text Prompt for Vision-Language Pre-Training + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Position-Guided_Text_Prompt_for_Vision-Language_Pre-Training_CVPR_2023_paper.pdf + Vision-Language Pre-Training (VLP) has shown promising capabilities to align image and text pairs, facilitating a broad variety of cross-modal learning tasks. However, we observe that VLP models often lack the visual grounding/localization capability which is critical for many downstream tasks such as visual reasoning. In this work, we propose a novel Position-guided Text Prompt (PTP) paradigm to enhance the visual grounding ability of cross-modal models trained with VLP. Specifically, in the VLP phase, PTP divides the image into NxN blocks, and identifies the objects in each block through the widely used object detector in VLP. It then reformulates the visual grounding task into a fill-in-the-blank problem given a PTP by encouraging the model to predict the objects in the given blocks or regress the blocks of a given object, e.g. filling "P" or "O" in a PTP "The block P has a O". This mechanism improves the visual grounding capability of VLP models and thus helps them better handle various downstream tasks. By introducing PTP into several state-of-the-art VLP frameworks, we observe consistently significant improvements across representative cross-modal learning model architectures and several benchmarks, e.g. zero-shot Flickr30K Retrieval (+4.8 in average recall@1) for ViLT baseline, and COCO Captioning (+5.3 in CIDEr) for SOTA BLIP baseline. Moreover, PTP achieves comparable results with object-detector based methods, and much faster inference speed since PTP discards its object detector for inference while the later cannot. Our code and pre-trained weight will be released. + + + + HOLODIFFUSION: Training a 3D Diffusion Model Using 2D Images + http://openaccess.thecvf.com//content/CVPR2023/papers/Karnewar_HOLODIFFUSION_Training_a_3D_Diffusion_Model_Using_2D_Images_CVPR_2023_paper.pdf + Diffusion models have emerged as the best approach for generative modeling of 2D images. Part of their success is due to the possibility of training them on millions if not billions of images with a stable learning objective. However, extending these models to 3D remains difficult for two reasons. First, finding a large quantity of 3D training data is much more complex than for 2D images. Second, while it is conceptually trivial to extend the models to operate on 3D rather than 2D grids, the associated cubic growth in memory and compute complexity makes this infeasible. We address the first challenge by introducing a new diffusion setup that can be trained, end-to-end, with only posed 2D images for supervision; and the second challenge by proposing an image formation model that decouples model memory from spatial memory. We evaluate our method on real-world data, using the CO3D dataset which has not been used to train 3D generative models before. We show that our diffusion models are scalable, train robustly, and are competitive in terms of sample quality and fidelity to existing approaches for 3D generative modeling. + + + + Stimulus Verification Is a Universal and Effective Sampler in Multi-Modal Human Trajectory Prediction + http://openaccess.thecvf.com//content/CVPR2023/papers/Sun_Stimulus_Verification_Is_a_Universal_and_Effective_Sampler_in_Multi-Modal_CVPR_2023_paper.pdf + To comprehensively cover the uncertainty of the future, the common practice of multi-modal human trajectory prediction is to first generate a set/distribution of candidate future trajectories and then sample required numbers of trajectories from them as final predictions. Even though a large number of previous researches develop various strong models to predict candidate trajectories, how to effectively sample the final ones has not received much attention yet. In this paper, we propose stimulus verification, serving as a universal and effective sampling process to improve the multi-modal prediction capability, where stimulus refers to the factor in the observation that may affect the future movements such as social interaction and scene context. Stimulus verification introduces a probabilistic model, denoted as stimulus verifier, to verify the coherence between a predicted future trajectory and its corresponding stimulus. By highlighting prediction samples with better stimulus-coherence, stimulus verification ensures sampled trajectories plausible from the stimulus' point of view and therefore aids in better multi-modal prediction performance. We implement stimulus verification on five representative prediction frameworks and conduct exhaustive experiments on three widely-used benchmarks. Superior results demonstrate the effectiveness of our approach. + + + + LoGoNet: Towards Accurate 3D Object Detection With Local-to-Global Cross-Modal Fusion + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_LoGoNet_Towards_Accurate_3D_Object_Detection_With_Local-to-Global_Cross-Modal_Fusion_CVPR_2023_paper.pdf + LiDAR-camera fusion methods have shown impressive performance in 3D object detection. Recent advanced multi-modal methods mainly perform global fusion, where image features and point cloud features are fused across the whole scene. Such practice lacks fine-grained region-level information, yielding suboptimal fusion performance. In this paper, we present the novel Local-to-Global fusion network (LoGoNet), which performs LiDAR-camera fusion at both local and global levels. Concretely, the Global Fusion (GoF) of LoGoNet is built upon previous literature, while we exclusively use point centroids to more precisely represent the position of voxel features, thus achieving better cross-modal alignment. As to the Local Fusion (LoF), we first divide each proposal into uniform grids and then project these grid centers to the images. The image features around the projected grid points are sampled to be fused with position-decorated point cloud features, maximally utilizing the rich contextual information around the proposals. The Feature Dynamic Aggregation (FDA) module is further proposed to achieve information interaction between these locally and globally fused features, thus producing more informative multi-modal features. Extensive experiments on both Waymo Open Dataset (WOD) and KITTI datasets show that LoGoNet outperforms all state-of-the-art 3D detection methods. Notably, LoGoNet ranks 1st on Waymo 3D object detection leaderboard and obtains 81.02 mAPH (L2) detection performance. It is noteworthy that, for the first time, the detection performance on three classes surpasses 80 APH (L2) simultaneously. Code will be available at https://github.com/sankin97/LoGoNet. + + + + ScaleKD: Distilling Scale-Aware Knowledge in Small Object Detector + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhu_ScaleKD_Distilling_Scale-Aware_Knowledge_in_Small_Object_Detector_CVPR_2023_paper.pdf + Despite the prominent success of general object detection, the performance and efficiency of Small Object Detection (SOD) are still unsatisfactory. Unlike existing works that struggle to balance the trade-off between inference speed and SOD performance, in this paper, we propose a novel Scale-aware Knowledge Distillation (ScaleKD), which transfers knowledge of a complex teacher model to a compact student model. We design two novel modules to boost the quality of knowledge transfer in distillation for SOD: 1) a scale-decoupled feature distillation module that disentangled teacher's feature representation into multi-scale embedding that enables explicit feature mimicking of the student model on small objects. 2) a cross-scale assistant to refine the noisy and uninformative bounding boxes prediction student models, which can mislead the student model and impair the efficacy of knowledge distillation. A multi-scale cross-attention layer is established to capture the multi-scale semantic information to improve the student model. We conduct experiments on COCO and VisDrone datasets with diverse types of models, i.e., two-stage and one-stage detectors, to evaluate our proposed method. Our ScaleKD achieves superior performance on general detection performance and obtains spectacular improvement regarding the SOD performance. + + + + An Empirical Study of End-to-End Video-Language Transformers With Masked Visual Modeling + http://openaccess.thecvf.com//content/CVPR2023/papers/Fu_An_Empirical_Study_of_End-to-End_Video-Language_Transformers_With_Masked_Visual_CVPR_2023_paper.pdf + Masked visual modeling (MVM) has been recently proven effective for visual pre-training. While similar reconstructive objectives on video inputs (e.g., masked frame modeling) have been explored in video-language (VidL) pre-training, previous studies fail to find a truly effective MVM strategy that can largely benefit the downstream performance. In this work, we systematically examine the potential of MVM in the context of VidL learning. Specifically, we base our study on a fully end-to-end VIdeO-LanguagE Transformer (VIOLET), where the supervision from MVM training can be backpropagated to the video pixel space. In total, eight different reconstructive targets of MVM are explored, from low-level pixel values and oriented gradients to high-level depth maps, optical flow, discrete visual tokens, and latent visual features. We conduct comprehensive experiments and provide insights into the factors leading to effective MVM training, resulting in an enhanced model VIOLETv2. Empirically, we show VIOLETv2 pre-trained with MVM objective achieves notable improvements on 13 VidL benchmarks, ranging from video question answering, video captioning, to text-to-video retrieval. + + + + MethaneMapper: Spectral Absorption Aware Hyperspectral Transformer for Methane Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Kumar_MethaneMapper_Spectral_Absorption_Aware_Hyperspectral_Transformer_for_Methane_Detection_CVPR_2023_paper.pdf + Methane (CH 4 ) is the chief contributor to global climate change. Recent Airborne Visible-Infrared Imaging Spectrometer-Next Generation (AVIRIS-NG) has been very useful in quantitative mapping of methane emissions. Existing methods for analyzing this data are sensitive to local terrain conditions, often require manual inspection from domain experts, prone to significant error and hence are not scalable. To address these challenges, we propose a novel end-to-end spectral absorption wavelength aware transformer network, MethaneMapper, to detect and quantify the emissions. MethaneMapper introduces two novel modules that help to locate the most relevant methane plume regions in the spectral domain and uses them to localize these accurately. Thorough evaluation shows that MethaneMapper achieves 0.63 mAP in detection and reduces the model size (by 5x) compared to the current state of the art. In addition, we also introduce a large-scale dataset of methane plume segmentation mask for over 1200 AVIRIS-NG flightlines from 2015-2022. It contains over 4000 methane plume sites. Our dataset will provide researchers the opportunity to develop and advance new methods for tackling this challenging green-house gas detection problem with significant broader social impact. Dataset and source code link. + + + + Autonomous Manipulation Learning for Similar Deformable Objects via Only One Demonstration + http://openaccess.thecvf.com//content/CVPR2023/papers/Ren_Autonomous_Manipulation_Learning_for_Similar_Deformable_Objects_via_Only_One_CVPR_2023_paper.pdf + In comparison with most methods focusing on 3D rigid object recognition and manipulation, deformable objects are more common in our real life but attract less attention. Generally, most existing methods for deformable object manipulation suffer two issues, 1) Massive demonstration: repeating thousands of robot-object demonstrations for model training of one specific instance; 2) Poor generalization: inevitably re-training for transferring the learned skill to a similar/new instance from the same category. Therefore, we propose a category-level deformable 3D object manipulation framework, which could manipulate deformable 3D objects with only one demonstration and generalize the learned skills to new similar instances without re-training. Specifically, our proposed framework consists of two modules. The Nocs State Transform (NST) module transfers the observed point clouds of the target to a pre-defined unified pose state (i.e., Nocs state), which is the foundation for the category-level manipulation learning; the Neural Spatial Encoding (NSE) module generalizes the learned skill to novel instances by encoding the category-level spatial information to pursue the expected grasping point without re-training. The relative motion path is then planned to achieve autonomous manipulation. Both the simulated results via our Cap40 dataset and real robotic experiments justify the effectiveness of our framework. + + + + Representation Learning for Visual Object Tracking by Masked Appearance Transfer + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhao_Representation_Learning_for_Visual_Object_Tracking_by_Masked_Appearance_Transfer_CVPR_2023_paper.pdf + Visual representation plays an important role in visual object tracking. However, few works study the tracking-specified representation learning method. Most trackers directly use ImageNet pre-trained representations. In this paper, we propose masked appearance transfer, a simple but effective representation learning method for tracking, based on an encoder-decoder architecture. First, we encode the visual appearances of the template and search region jointly, and then we decode them separately. During decoding, the original search region image is reconstructed. However, for the template, we make the decoder reconstruct the target appearance within the search region. By this target appearance transfer, the tracking-specified representations are learned. We randomly mask out the inputs, thereby making the learned representations more discriminative. For sufficient evaluation, we design a simple and lightweight tracker that can evaluate the representation for both target localization and box regression. Extensive experiments show that the proposed method is effective, and the learned representations can enable the simple tracker to obtain state-of-the-art performance on six datasets. + + + + Learning To Name Classes for Vision and Language Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Parisot_Learning_To_Name_Classes_for_Vision_and_Language_Models_CVPR_2023_paper.pdf + Large scale vision and language models can achieve impressive zero-shot recognition performance by mapping class specific text queries to image content. Two distinct challenges that remain however, are high sensitivity to the choice of handcrafted class names that define queries, and the difficulty of adaptation to new, smaller datasets. Towards addressing these problems, we propose to leverage available data to learn, for each class, an optimal word embedding as a function of the visual content. By learning new word embeddings on an otherwise frozen model, we are able to retain zero-shot capabilities for new classes, easily adapt models to new datasets, and adjust potentially erroneous, non-descriptive or ambiguous class names. We show that our solution can easily be integrated in image classification and object detection pipelines, yields significant performance gains in multiple scenarios and provides insights into model biases and labelling errors. + + + + Nighttime Smartphone Reflective Flare Removal Using Optical Center Symmetry Prior + http://openaccess.thecvf.com//content/CVPR2023/papers/Dai_Nighttime_Smartphone_Reflective_Flare_Removal_Using_Optical_Center_Symmetry_Prior_CVPR_2023_paper.pdf + Reflective flare is a phenomenon that occurs when light reflects inside lenses, causing bright spots or a "ghosting effect" in photos, which can impact their quality. Eliminating reflective flare is highly desirable but challenging. Many existing methods rely on manually designed features to detect these bright spots, but they often fail to identify reflective flares created by various types of light and may even mistakenly remove the light sources in scenarios with multiple light sources. To address these challenges, we propose an optical center symmetry prior, which suggests that the reflective flare and light source are always symmetrical around the lens's optical center. This prior helps to locate the reflective flare's proposal region more accurately and can be applied to most smartphone cameras. Building on this prior, we create the first reflective flare removal dataset called BracketFlare, which contains diverse and realistic reflective flare patterns. We use continuous bracketing to capture the reflective flare pattern in the underexposed image and combine it with a normally exposed image to synthesize a pair of flare-corrupted and flare-free images. With the dataset, neural networks can be trained to remove the reflective flares effectively. Extensive experiments demonstrate the effectiveness of our method on both synthetic and real-world datasets. + + + + + + Box-Level Active Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Lyu_Box-Level_Active_Detection_CVPR_2023_paper.pdf + Active learning selects informative samples for annotation within budget, which has proven efficient recently on object detection. However, the widely used active detection benchmarks conduct image-level evaluation, which is unrealistic in human workload estimation and biased towards crowded images. Furthermore, existing methods still perform image-level annotation, but equally scoring all targets within the same image incurs waste of budget and redundant labels. Having revealed above problems and limitations, we introduce a box-level active detection framework that controls a box-based budget per cycle, prioritizes informative targets and avoids redundancy for fair comparison and efficient application. Under the proposed box-level setting, we devise a novel pipeline, namely Complementary Pseudo Active Strategy (ComPAS). It exploits both human annotations and the model intelligence in a complementary fashion: an efficient input-end committee queries labels for informative objects only; meantime well-learned targets are identified by the model and compensated with pseudo-labels. ComPAS consistently outperforms 10 competitors under 4 settings in a unified codebase. With supervision from labeled data only, it achieves 100% supervised performance of VOC0712 with merely 19% box annotations. On the COCO dataset, it yields up to 4.3% mAP improvement over the second-best method. ComPAS also supports training with the unlabeled pool, where it surpasses 90% COCO supervised performance with 85% label reduction. Our source code is publicly available at https://github.com/lyumengyao/blad. + + + + Self-Supervised Non-Uniform Kernel Estimation With Flow-Based Motion Prior for Blind Image Deblurring + http://openaccess.thecvf.com//content/CVPR2023/papers/Fang_Self-Supervised_Non-Uniform_Kernel_Estimation_With_Flow-Based_Motion_Prior_for_Blind_CVPR_2023_paper.pdf + Many deep learning-based solutions to blind image deblurring estimate the blur representation and reconstruct the target image from its blurry observation. However, these methods suffer from severe performance degradation in real-world scenarios because they ignore important prior information about motion blur (e.g., real-world motion blur is diverse and spatially varying). Some methods have attempted to explicitly estimate non-uniform blur kernels by CNNs, but accurate estimation is still challenging due to the lack of ground truth about spatially varying blur kernels in real-world images. To address these issues, we propose to represent the field of motion blur kernels in a latent space by normalizing flows, and design CNNs to predict the latent codes instead of motion kernels. To further improve the accuracy and robustness of non-uniform kernel estimation, we introduce uncertainty learning into the process of estimating latent codes and propose a multi-scale kernel attention module to better integrate image features with estimated kernels. Extensive experimental results, especially on real-world blur datasets, demonstrate that our method achieves state-of-the-art results in terms of both subjective and objective quality as well as excellent generalization performance for non-uniform image deblurring. The code is available at https://see.xidian.edu.cn/faculty/wsdong/Projects/UFPNet.htm. + + + + Collecting Cross-Modal Presence-Absence Evidence for Weakly-Supervised Audio-Visual Event Perception + http://openaccess.thecvf.com//content/CVPR2023/papers/Gao_Collecting_Cross-Modal_Presence-Absence_Evidence_for_Weakly-Supervised_Audio-Visual_Event_Perception_CVPR_2023_paper.pdf + With only video-level event labels, this paper targets at the task of weakly-supervised audio-visual event perception (WS-AVEP), which aims to temporally localize and categorize events belonging to each modality. Despite the recent progress, most existing approaches either ignore the unsynchronized property of audio-visual tracks or discount the complementary modality for explicit enhancement. We argue that, for an event residing in one modality, the modality itself should provide ample presence evidence of this event, while the other complementary modality is encouraged to afford the absence evidence as a reference signal. To this end, we propose to collect Cross-Modal Presence-Absence Evidence (CMPAE) in a unified framework. Specifically, by leveraging uni-modal and cross-modal representations, a presence-absence evidence collector (PAEC) is designed under Subjective Logic theory. To learn the evidence in a reliable range, we propose a joint-modal mutual learning (JML) process, which calibrates the evidence of diverse audible, visible, and audi-visible events adaptively and dynamically. Extensive experiments show that our method surpasses state-of-the-arts (e.g., absolute gains of 3.6% and 6.1% in terms of event-level visual and audio metrics). Code is available in github.com/MengyuanChen21/CVPR2023-CMPAE. + + + + AVFace: Towards Detailed Audio-Visual 4D Face Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Chatziagapi_AVFace_Towards_Detailed_Audio-Visual_4D_Face_Reconstruction_CVPR_2023_paper.pdf + In this work, we present a multimodal solution to the problem of 4D face reconstruction from monocular videos. 3D face reconstruction from 2D images is an under-constrained problem due to the ambiguity of depth. State-of-the-art methods try to solve this problem by leveraging visual information from a single image or video, whereas 3D mesh animation approaches rely more on audio. However, in most cases (e.g. AR/VR applications), videos include both visual and speech information. We propose AVFace that incorporates both modalities and accurately reconstructs the 4D facial and lip motion of any speaker, without requiring any 3D ground truth for training. A coarse stage estimates the per-frame parameters of a 3D morphable model, followed by a lip refinement, and then a fine stage recovers facial geometric details. Due to the temporal audio and video information captured by transformer-based modules, our method is robust in cases when either modality is insufficient (e.g. face occlusions). Extensive qualitative and quantitative evaluation demonstrates the superiority of our method over the current state-of-the-art. + + + + ERM-KTP: Knowledge-Level Machine Unlearning via Knowledge Transfer + http://openaccess.thecvf.com//content/CVPR2023/papers/Lin_ERM-KTP_Knowledge-Level_Machine_Unlearning_via_Knowledge_Transfer_CVPR_2023_paper.pdf + Machine unlearning can fortify the privacy and security of machine learning applications. Unfortunately, the exact unlearning approaches are inefficient, and the approximate unlearning approaches are unsuitable for complicated CNNs. Moreover, the approximate approaches have serious security flaws because even unlearning completely different data points can produce the same contribution estimation as unlearning the target data points. To address the above problems, we try to define machine unlearning from the knowledge perspective, and we propose a knowledge-level machine unlearning method, namely ERM-KTP. Specifically, we propose an entanglement-reduced mask (ERM) structure to reduce the knowledge entanglement among classes during the training phase. When receiving the unlearning requests, we transfer the knowledge of the non-target data points from the original model to the unlearned model and meanwhile prohibit the knowledge of the target data points via our proposed knowledge transfer and prohibition (KTP) method. Finally, we will get the unlearned model as the result and delete the original model to accomplish the unlearning process. Especially, our proposed ERM-KTP is an interpretable unlearning method because the ERM structure and the crafted masks in KTP can explicitly explain the operation and the effect of unlearning data points. Extensive experiments demonstrate the effectiveness, efficiency, high fidelity, and scalability of the ERM-KTP unlearning method. + + + + DATE: Domain Adaptive Product Seeker for E-Commerce + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_DATE_Domain_Adaptive_Product_Seeker_for_E-Commerce_CVPR_2023_paper.pdf + Product Retrieval (PR) and Grounding (PG), aiming to seek image and object-level products respectively according to a textual query, have attracted great interest recently for better shopping experience. Owing to the lack of relevant datasets, we collect two large-scale benchmark datasets from Taobao Mall and Live domains with about 474k and 101k image-query pairs for PR, and manually annotate the object bounding boxes in each image for PG. As annotating boxes is expensive and time-consuming, we attempt to transfer knowledge from annotated domain to unannotated for PG to achieve un-supervised Domain Adaptation (PG-DA). We propose a Domain Adaptive producT sEeker (DATE) framework, regarding PR and PG as Product Seeking problem at different levels, to assist the query date the product. Concretely, we first design a semantics-aggregated feature extractor for each modality to obtain concentrated and comprehensive features for following efficient retrieval and fine-grained grounding tasks. Then, we present two cooperative seekers to simultaneously search the image for PR and localize the product for PG. Besides, we devise a domain aligner for PG-DA to alleviate uni-modal marginal and multi-modal conditional distribution shift between source and target domains, and design a pseudo box generator to dynamically select reliable instances and generate bounding boxes for further knowledge transfer. Extensive experiments show that our DATE achieves satisfactory performance in fully-supervised PR, PG and un-supervised PG-DA. Our desensitized datasets will be publicly available here https://github.com/Taobao-live/Product-Seeking. + + + + Self-Supervised Super-Plane for Neural 3D Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Ye_Self-Supervised_Super-Plane_for_Neural_3D_Reconstruction_CVPR_2023_paper.pdf + Neural implicit surface representation methods show impressive reconstruction results but struggle to handle texture-less planar regions that widely exist in indoor scenes. Existing approaches addressing this leverage image prior that requires assistive networks trained with large-scale annotated datasets. In this work, we introduce a self-supervised super-plane constraint by exploring the free geometry cues from the predicted surface, which can further regularize the reconstruction of plane regions without any other ground truth annotations. Specifically, we introduce an iterative training scheme, where (i) grouping of pixels to formulate a super-plane (analogous to super-pixels), and (ii) optimizing of the scene reconstruction network via a super-plane constraint, are progressively conducted. We demonstrate that the model trained with super-planes surprisingly outperforms the one using conventional annotated planes, as individual super-plane statistically occupies a larger area and leads to more stable training. Extensive experiments show that our self-supervised super-plane constraint significantly improves 3D reconstruction quality even better than using ground truth plane segmentation. Additionally, the plane reconstruction results from our model can be used for auto-labeling for other vision tasks. The code and models are available at https: //github.com/botaoye/S3PRecon. + + + + DisCo-CLIP: A Distributed Contrastive Loss for Memory Efficient CLIP Training + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_DisCo-CLIP_A_Distributed_Contrastive_Loss_for_Memory_Efficient_CLIP_Training_CVPR_2023_paper.pdf + We propose DisCo-CLIP, a distributed memory-efficient CLIP training approach, to reduce the memory consumption of contrastive loss when training contrastive learning models. Our approach decomposes the contrastive loss and its gradient computation into two parts, one to calculate the intra-GPU gradients and the other to compute the inter-GPU gradients. According to our decomposition, only the intra-GPU gradients are computed on the current GPU, while the inter-GPU gradients are collected via all_reduce from other GPUs instead of being repeatedly computed on every GPU. In this way, we can reduce the GPU memory consumption of contrastive loss computation from O(B^2) to O(B^2 / N), where B and N are the batch size and the number of GPUs used for training. Such a distributed solution is mathematically equivalent to the original non-distributed contrastive loss computation, without sacrificing any computation accuracy. It is particularly efficient for large-batch CLIP training. For instance, DisCo-CLIP can enable contrastive training of a ViT-B/32 model with a batch size of 32K or 196K using 8 or 64 A100 40GB GPUs, compared with the original CLIP solution which requires 128 A100 40GB GPUs to train a ViT-B/32 model with a batch size of 32K. + + + + GM-NeRF: Learning Generalizable Model-Based Neural Radiance Fields From Multi-View Images + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_GM-NeRF_Learning_Generalizable_Model-Based_Neural_Radiance_Fields_From_Multi-View_Images_CVPR_2023_paper.pdf + In this work, we focus on synthesizing high-fidelity novel view images for arbitrary human performers, given a set of sparse multi-view images. It is a challenging task due to the large variation among articulated body poses and heavy self-occlusions. To alleviate this, we introduce an effective generalizable framework Generalizable Model-based Neural Radiance Fields (GM-NeRF) to synthesize free-viewpoint images. Specifically, we propose a geometry-guided attention mechanism to register the appearance code from multi-view 2D images to a geometry proxy which can alleviate the misalignment between inaccurate geometry prior and pixel space. On top of that, we further conduct neural rendering and partial gradient backpropagation for efficient perceptual supervision and improvement of the perceptual quality of synthesis. To evaluate our method, we conduct experiments on synthesized datasets THuman2.0 and Multi-garment, and real-world datasets Genebody and ZJUMocap. The results demonstrate that our approach outperforms state-of-the-art methods in terms of novel view synthesis and geometric reconstruction. + + + + Perspective Fields for Single Image Camera Calibration + http://openaccess.thecvf.com//content/CVPR2023/papers/Jin_Perspective_Fields_for_Single_Image_Camera_Calibration_CVPR_2023_paper.pdf + Geometric camera calibration is often required for applications that understand the perspective of the image. We propose perspective fields as a representation that models the local perspective properties of an image. Perspective Fields contain per-pixel information about the camera view, parameterized as an up vector and a latitude value. This representation has a number of advantages as it makes minimal assumptions about the camera model and is invariant or equivariant to common image editing operations like cropping, warping, and rotation. It is also more interpretable and aligned with human perception. We train a neural network to predict Perspective Fields and the predicted Perspective Fields can be converted to calibration parameters easily. We demonstrate the robustness of our approach under various scenarios compared with camera calibration-based methods and show example applications in image compositing. Project page: https://jinlinyi.github.io/PerspectiveFields/ + + + + Towards Accurate Image Coding: Improved Autoregressive Image Generation With Dynamic Vector Quantization + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Towards_Accurate_Image_Coding_Improved_Autoregressive_Image_Generation_With_Dynamic_CVPR_2023_paper.pdf + Existing vector quantization (VQ) based autoregressive models follow a two-stage generation paradigm that first learns a codebook to encode images as discrete codes, and then completes generation based on the learned codebook. However, they encode fixed-size image regions into fixed-length codes and ignore their naturally different information densities, which results in insufficiency in important regions and redundancy in unimportant ones, and finally degrades the generation quality and speed. Moreover, the fixed-length coding leads to an unnatural raster-scan autoregressive generation. To address the problem, we propose a novel two-stage framework: (1) Dynamic-Quantization VAE (DQ-VAE) which encodes image regions into variable-length codes based on their information densities for an accurate & compact code representation. (2) DQ-Transformer which thereby generates images autoregressively from coarse-grained (smooth regions with fewer codes) to fine-grained (details regions with more codes) by modeling the position and content of codes in each granularity alternately, through a novel stacked-transformer architecture and shared-content, non-shared position input layers designs. Comprehensive experiments on various generation tasks validate our superiorities in both effectiveness and efficiency. + + + + WINNER: Weakly-Supervised hIerarchical decompositioN and aligNment for Spatio-tEmporal Video gRounding + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_WINNER_Weakly-Supervised_hIerarchical_decompositioN_and_aligNment_for_Spatio-tEmporal_Video_gRounding_CVPR_2023_paper.pdf + Spatio-temporal video grounding aims to localize the aligned visual tube corresponding to a language query. Existing techniques achieve such alignment by exploiting dense boundary and bounding box annotations, which can be prohibitively expensive. To bridge the gap, we investigate the weakly-supervised setting, where models learn from easily accessible video-language data without annotations. We identify that intra-sample spurious correlations among video-language components can be alleviated if the model captures the decomposed structures of video and language data. In this light, we propose a novel framework, namely WINNER, for hierarchical video-text understanding. WINNER first builds the language decomposition tree in a bottom-up manner, upon which the structural attention mechanism and top-down feature backtracking jointly build a multi-modal decomposition tree, permitting a hierarchical understanding of unstructured videos. The multi-modal decomposition tree serves as the basis for multi-hierarchy language-tube matching. A hierarchical contrastive learning objective is proposed to learn the multi-hierarchy correspondence and distinguishment with intra-sample and inter-sample video-text decomposition structures, achieving video-language decomposition structure alignment. Extensive experiments demonstrate the rationality of our design and its effectiveness beyond state-of-the-art weakly supervised methods, even some supervised methods. + + + + Preserving Linear Separability in Continual Learning by Backward Feature Projection + http://openaccess.thecvf.com//content/CVPR2023/papers/Gu_Preserving_Linear_Separability_in_Continual_Learning_by_Backward_Feature_Projection_CVPR_2023_paper.pdf + Catastrophic forgetting has been a major challenge in continual learning, where the model needs to learn new tasks with limited or no access to data from previously seen tasks. To tackle this challenge, methods based on knowledge distillation in feature space have been proposed and shown to reduce forgetting. However, most feature distillation methods directly constrain the new features to match the old ones, overlooking the need for plasticity. To achieve a better stability-plasticity trade-off, we propose Backward Feature Projection (BFP), a method for continual learning that allows the new features to change up to a learnable linear transformation of the old features. BFP preserves the linear separability of the old classes while allowing the emergence of new feature directions to accommodate new classes. BFP can be integrated with existing experience replay methods and boost performance by a significant margin. We also demonstrate that BFP helps learn a better representation space, in which linear separability is well preserved during continual learning and linear probing achieves high classification accuracy. + + + + MHPL: Minimum Happy Points Learning for Active Source Free Domain Adaptation + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_MHPL_Minimum_Happy_Points_Learning_for_Active_Source_Free_Domain_CVPR_2023_paper.pdf + Source free domain adaptation (SFDA) aims to transfer a trained source model to the unlabeled target domain without accessing the source data. However, the SFDA setting faces a performance bottleneck due to the absence of source data and target supervised information, as evidenced by the limited performance gains of the newest SFDA methods. Active source free domain adaptation (ASFDA) can break through the problem by exploring and exploiting a small set of informative samples via active learning. In this paper, we first find that those satisfying the properties of neighbor-chaotic, individual-different, and source-dissimilar are the best points to select. We define them as the minimum happy (MH) points challenging to explore with existing methods. We propose minimum happy points learning (MHPL) to explore and exploit MH points actively. We design three unique strategies: neighbor environment uncertainty, neighbor diversity relaxation, and one-shot querying, to explore the MH points. Further, to fully exploit MH points in the learning process, we design a neighbor focal loss that assigns the weighted neighbor purity to the cross entropy loss of MH points to make the model focus more on them. Extensive experiments verify that MHPL remarkably exceeds the various types of baselines and achieves significant performance gains at a small cost of labeling. + + + + Metadata-Based RAW Reconstruction via Implicit Neural Functions + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Metadata-Based_RAW_Reconstruction_via_Implicit_Neural_Functions_CVPR_2023_paper.pdf + Many low-level computer vision tasks are desirable to utilize the unprocessed RAW image as input, which remains the linear relationship between pixel values and scene radiance. Recent works advocate to embed the RAW image samples into sRGB images at capture time, and reconstruct the RAW from sRGB by these metadata when needed. However, there still exist some limitations on taking full use of the metadata. In this paper, instead of following the perspective of sRGB-to-RAW mapping, we reformulate the problem as mapping the 2D coordinates of the metadata to its RAW values conditioned on the corresponding sRGB values. With this novel formulation, we propose to reconstruct the RAW image with an implicit neural function, which achieves significant performance improvement (more than 10dB average PSNR) only with the uniform sampling. Compared with most deep learning-based approaches, our method is trained in a self-supervised way that requiring no pre-training on different camera ISPs. We perform further experiments to demonstrate the effectiveness of our method, and show that our framework is also suitable for the task of guided super-resolution. + + + + Multimodality Helps Unimodality: Cross-Modal Few-Shot Learning With Multimodal Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Lin_Multimodality_Helps_Unimodality_Cross-Modal_Few-Shot_Learning_With_Multimodal_Models_CVPR_2023_paper.pdf + The ability to quickly learn a new task with minimal instruction - known as few-shot learning - is a central aspect of intelligent agents. Classical few-shot benchmarks make use of few-shot samples from a single modality, but such samples may not be sufficient to characterize an entire concept class. In contrast, humans use cross-modal information to learn new concepts efficiently. In this work, we demonstrate that one can indeed build a better visual dog classifier by reading about dogs and listening to them bark. To do so, we exploit the fact that recent multimodal foundation models such as CLIP are inherently cross-modal, mapping different modalities to the same representation space. Specifically, we propose a simple cross-modal adaptation approach that learns from few-shot examples spanning different modalities. By repurposing class names as additional one-shot training samples, we achieve SOTA results with an embarrassingly simple linear classifier for vision-language adaptation. Furthermore, we show that our approach can benefit existing methods such as prefix tuning and classifier ensembling. Finally, to explore other modalities beyond vision and language, we construct the first (to our knowledge) audiovisual few-shot benchmark and use cross-modal training to improve the performance of both image and audio classification. We hope our success can inspire future works to embrace cross-modality for even broader domains and tasks. + + + + 3D Highlighter: Localizing Regions on 3D Shapes via Text Descriptions + http://openaccess.thecvf.com//content/CVPR2023/papers/Decatur_3D_Highlighter_Localizing_Regions_on_3D_Shapes_via_Text_Descriptions_CVPR_2023_paper.pdf + We present 3D Highlighter, a technique for localizing semantic regions on a mesh using text as input. A key feature of our system is the ability to interpret "out-of-domain" localizations. Our system demonstrates the ability to reason about where to place non-obviously related concepts on an input 3D shape, such as adding clothing to a bare 3D animal model. Our method contextualizes the text description using a neural field and colors the corresponding region of the shape using a probability-weighted blend. Our neural optimization is guided by a pre-trained CLIP encoder, which bypasses the need for any 3D datasets or 3D annotations. Thus, 3D Highlighter is highly flexible, general, and capable of producing localizations on a myriad of input shapes. + + + + Iterative Geometry Encoding Volume for Stereo Matching + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_Iterative_Geometry_Encoding_Volume_for_Stereo_Matching_CVPR_2023_paper.pdf + Recurrent All-Pairs Field Transforms (RAFT) has shown great potentials in matching tasks. However, all-pairs correlations lack non-local geometry knowledge and have difficulties tackling local ambiguities in ill-posed regions. In this paper, we propose Iterative Geometry Encoding Volume (IGEV-Stereo), a new deep network architecture for stereo matching. The proposed IGEV-Stereo builds a combined geometry encoding volume that encodes geometry and context information as well as local matching details, and iteratively indexes it to update the disparity map. To speed up the convergence, we exploit GEV to regress an accurate starting point for ConvGRUs iterations. Our IGEV-Stereo ranks first on KITTI 2015 and 2012 (Reflective) among all published methods and is the fastest among the top 10 methods. In addition, IGEV-Stereo has strong cross-dataset generalization as well as high inference efficiency. We also extend our IGEV to multi-view stereo (MVS), i.e. IGEV-MVS, which achieves competitive accuracy on DTU benchmark. Code is available at https://github.com/gangweiX/IGEV. + + + + GRES: Generalized Referring Expression Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_GRES_Generalized_Referring_Expression_Segmentation_CVPR_2023_paper.pdf + Referring Expression Segmentation (RES) aims to generate a segmentation mask for the object described by a given language expression. Existing classic RES datasets and methods commonly support single-target expressions only, i.e., one expression refers to one target object. Multi-target and no-target expressions are not considered. This limits the usage of RES in practice. In this paper, we introduce a new benchmark called Generalized Referring Expression Segmentation (GRES), which extends the classic RES to allow expressions to refer to an arbitrary number of target objects. Towards this, we construct the first large-scale GRES dataset called gRefCOCO that contains multi-target, no-target, and single-target expressions. GRES and gRefCOCO are designed to be well-compatible with RES, facilitating extensive experiments to study the performance gap of the existing RES methods on the GRES task. In the experimental study, we find that one of the big challenges of GRES is complex relationship modeling. Based on this, we propose a region-based GRES baseline ReLA that adaptively divides the image into regions with sub-instance clues, and explicitly models the region-region and region-language dependencies. The proposed approach ReLA achieves new state-of-the-art performance on the both newly proposed GRES and classic RES tasks. The proposed gRefCOCO dataset and method are available at https://henghuiding.github.io/GRES. + + + + Open-Set Fine-Grained Retrieval via Prompting Vision-Language Evaluator + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Open-Set_Fine-Grained_Retrieval_via_Prompting_Vision-Language_Evaluator_CVPR_2023_paper.pdf + Open-set fine-grained retrieval is an emerging challenge that requires an extra capability to retrieve unknown subcategories during evaluation. However, current works are rooted in the close-set scenarios, where all the subcategories are pre-defined, and make it hard to capture discriminative knowledge from unknown subcategories, consequently failing to handle the inevitable unknown subcategories in open-world scenarios. In this work, we propose a novel Prompting vision-Language Evaluator (PLEor) framework based on the recently introduced contrastive language-image pretraining (CLIP) model, for open-set fine-grained retrieval. PLEor could leverage pre-trained CLIP model to infer the discrepancies encompassing both pre-defined and unknown subcategories, called category-specific discrepancies, and transfer them to the backbone network trained in the close-set scenarios. To make pre-trained CLIP model sensitive to category-specific discrepancies, we design a dual prompt scheme to learn a vision prompt specifying the category-specific discrepancies, and turn random vectors with category names in a text prompt into category-specific discrepancy descriptions. Moreover, a vision-language evaluator is proposed to semantically align the vision and text prompts based on CLIP model, and reinforce each other. In addition, we propose an open-set knowledge transfer to transfer the category-specific discrepancies into the backbone network using knowledge distillation mechanism. A variety of quantitative and qualitative experiments show that our PLEor achieves promising performance on open-set fine-grained retrieval datasets. + + + + Sibling-Attack: Rethinking Transferable Adversarial Attacks Against Face Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Sibling-Attack_Rethinking_Transferable_Adversarial_Attacks_Against_Face_Recognition_CVPR_2023_paper.pdf + A hard challenge in developing practical face recognition (FR) attacks is due to the black-box nature of the target FR model, i.e., inaccessible gradient and parameter information to attackers. While recent research took an important step towards attacking black-box FR models through leveraging transferability, their performance is still limited, especially against online commercial FR systems that can be pessimistic (e.g., a less than 50% ASR--attack success rate on average). Motivated by this, we present Sibling-Attack, a new FR attack technique for the first time explores a novel multi-task perspective (i.e., leveraging extra information from multi-correlated tasks to boost attacking transferability). Intuitively, Sibling-Attack selects a set of tasks correlated with FR and picks the Attribute Recognition (AR) task as the task used in Sibling-Attack based on theoretical and quantitative analysis. Sibling-Attack then develops an optimization framework that fuses adversarial gradient information through (1) constraining the cross-task features to be under the same space, (2) a joint-task meta optimization framework that enhances the gradient compatibility among tasks, and (3) a cross-task gradient stabilization method which mitigates the oscillation effect during attacking. Extensive experiments demonstrate that Sibling-Attack outperforms state-of-the-art FR attack techniques by a non-trivial margin, boosting ASR by 12.61% and 55.77% on average on state-of-the-art pre-trained FR models and two well-known, widely used commercial FR systems. + + + + PIRLNav: Pretraining With Imitation and RL Finetuning for ObjectNav + http://openaccess.thecvf.com//content/CVPR2023/papers/Ramrakhya_PIRLNav_Pretraining_With_Imitation_and_RL_Finetuning_for_ObjectNav_CVPR_2023_paper.pdf + We study ObjectGoal Navigation -- where a virtual robot situated in a new environment is asked to navigate to an object. Prior work has shown that imitation learning (IL) using behavior cloning (BC) on a dataset of human demonstrations achieves promising results. However, this has limitations -- 1) BC policies generalize poorly to new states, since the training mimics actions not their consequences, and 2) collecting demonstrations is expensive. On the other hand, reinforcement learning (RL) is trivially scalable, but requires careful reward engineering to achieve desirable behavior. We present PIRLNav, a two-stage learning scheme for BC pretraining on human demonstrations followed by RL-finetuning. This leads to a policy that achieves a success rate of 65.0% on ObjectNav (+5.0% absolute over previous state-of-the-art). Using this BC->RL training recipe, we present a rigorous empirical analysis of design choices. First, we investigate whether human demonstrations can be replaced with 'free' (automatically generated) sources of demonstrations, e.g. shortest paths (SP) or task-agnostic frontier exploration (FE) trajectories. We find that BC->RL on human demonstrations outperforms BC->RL on SP and FE trajectories, even when controlled for the same BC-pretraining success on train, and even on a subset of val episodes where BC-pretraining success favors the SP or FE policies. Next, we study how RL-finetuning performance scales with the size of the BC pretraining dataset. We find that as we increase the size of the BC-pretraining dataset and get to high BC accuracies, the improvements from RL-finetuning are smaller, and that 90% of the performance of our best BC->RL policy can be achieved with less than half the number of BC demonstrations. Finally, we analyze failure modes of our ObjectNav policies, and present guidelines for further improving them. + + + + StyleGene: Crossover and Mutation of Region-Level Facial Genes for Kinship Face Synthesis + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_StyleGene_Crossover_and_Mutation_of_Region-Level_Facial_Genes_for_Kinship_CVPR_2023_paper.pdf + High-fidelity kinship face synthesis has many potential applications, such as kinship verification, missing child identification, and social media analysis. However, it is challenging to synthesize high-quality descendant faces with genetic relations due to the lack of large-scale, high-quality annotated kinship data. This paper proposes RFG (Region-level Facial Gene) extraction framework to address this issue. We propose to use IGE (Image-based Gene Encoder), LGE (Latent-based Gene Encoder) and Gene Decoder to learn the RFGs of a given face image, and the relationships between RFGs and the latent space of StyleGAN2. As cycle-like losses are designed to measure the L_2 distances between the output of Gene Decoder and image encoder, and that between the output of LGE and IGE, only face images are required to train our framework, i.e. no paired kinship face data is required. Based upon the proposed RFGs, a crossover and mutation module is further designed to inherit the facial parts of parents. A Gene Pool has also been used to introduce the variations into the mutation of RFGs. The diversity of the faces of descendants can thus be significantly increased. Qualitative, quantitative, and subjective experiments on FIW, TSKinFace, and FF-Databases clearly show that the quality and diversity of kinship faces generated by our approach are much better than the existing state-of-the-art methods. + + + + Clothed Human Performance Capture With a Double-Layer Neural Radiance Fields + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Clothed_Human_Performance_Capture_With_a_Double-Layer_Neural_Radiance_Fields_CVPR_2023_paper.pdf + This paper addresses the challenge of capturing performance for the clothed humans from sparse-view or monocular videos. Previous methods capture the performance of full humans with a personalized template or recover the garments from a single frame with static human poses. However, it is inconvenient to extract cloth semantics and capture clothing motion with one-piece template, while single frame-based methods may suffer from instable tracking across videos. To address these problems, we propose a novel method for human performance capture by tracking clothing and human body motion separately with a double-layer neural radiance fields (NeRFs). Specifically, we propose a double-layer NeRFs for the body and garments, and track the densely deforming template of the clothing and body by jointly optimizing the deformation fields and the canonical double-layer NeRFs. In the optimization, we introduce a physics-aware cloth simulation network which can help generate physically plausible cloth dynamics and body-cloth interactions. Compared with existing methods, our method is fully differentiable and can capture both the body and clothing motion robustly from dynamic videos. Also, our method represents the clothing with an independent NeRFs, allowing us to model implicit fields of general clothes feasibly. The experimental evaluations validate its effectiveness on real multi-view or monocular videos. + + + + NeuFace: Realistic 3D Neural Face Rendering From Multi-View Images + http://openaccess.thecvf.com//content/CVPR2023/papers/Zheng_NeuFace_Realistic_3D_Neural_Face_Rendering_From_Multi-View_Images_CVPR_2023_paper.pdf + Realistic face rendering from multi-view images is beneficial to various computer vision and graphics applications. Due to the complex spatially-varying reflectance properties and geometry characteristics of faces, however, it remains challenging to recover 3D facial representations both faithfully and efficiently in the current studies. This paper presents a novel 3D face rendering model, namely NeuFace, to learn accurate and physically-meaningful underlying 3D representations by neural rendering techniques. It naturally incorporates the neural BRDFs into physically based rendering, capturing sophisticated facial geometry and appearance clues in a collaborative manner. Specifically, we introduce an approximated BRDF integration and a simple yet new low-rank prior, which effectively lower the ambiguities and boost the performance of the facial BRDFs. Extensive experiments demonstrate the superiority of NeuFace in human face rendering, along with a decent generalization ability to common objects. Code is released at https://github.com/aejion/NeuFace. + + + + Rethinking Domain Generalization for Face Anti-Spoofing: Separability and Alignment + http://openaccess.thecvf.com//content/CVPR2023/papers/Sun_Rethinking_Domain_Generalization_for_Face_Anti-Spoofing_Separability_and_Alignment_CVPR_2023_paper.pdf + This work studies the generalization issue of face anti-spoofing (FAS) models on domain gaps, such as image resolution, blurriness and sensor variations. Most prior works regard domain-specific signals as a negative impact, and apply metric learning or adversarial losses to remove it from feature representation. Though learning a domain-invariant feature space is viable for the training data, we show that the feature shift still exists in an unseen test domain, which backfires on the generalizability of the classifier. In this work, instead of constructing a domain-invariant feature space, we encourage domain separability while aligning the live-to-spoof transition (i.e., the trajectory from live to spoof) to be the same for all domains. We formulate this FAS strategy of separability and alignment (SA-FAS) as a problem of invariant risk minimization (IRM), and learn domain-variant feature representation but domain-invariant classifier. We demonstrate the effectiveness of SA-FAS on challenging cross-domain FAS datasets and establish state-of-the-art performance. + + + + SMOC-Net: Leveraging Camera Pose for Self-Supervised Monocular Object Pose Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Tan_SMOC-Net_Leveraging_Camera_Pose_for_Self-Supervised_Monocular_Object_Pose_Estimation_CVPR_2023_paper.pdf + Recently, self-supervised 6D object pose estimation, where synthetic images with object poses (sometimes jointly with un-annotated real images) are used for training, has attracted much attention in computer vision. Some typical works in literature employ a time-consuming differentiable renderer for object pose prediction at the training stage, so that (i) their performances on real images are generally limited due to the gap between their rendered images and real images and (ii) their training process is computationally expensive. To address the two problems, we propose a novel Network for Self-supervised Monocular Object pose estimation by utilizing the predicted Camera poses from un-annotated real images, called SMOC-Net. The proposed network is explored under a knowledge distillation framework, consisting of a teacher model and a student model. The teacher model contains a backbone estimation module for initial object pose estimation, and an object pose refiner for refining the initial object poses using a geometric constraint (called relative-pose constraint) derived from relative camera poses. The student model gains knowledge for object pose estimation from the teacher model by imposing the relative-pose constraint. Thanks to the relative-pose constraint, SMOC-Net could not only narrow the domain gap between synthetic and real data but also reduce the training cost. Experimental results on two public datasets demonstrate that SMOC-Net outperforms several state-of-the-art methods by a large margin while requiring much less training time than the differentiable-renderer-based methods. + + + + Learning Human Mesh Recovery in 3D Scenes + http://openaccess.thecvf.com//content/CVPR2023/papers/Shen_Learning_Human_Mesh_Recovery_in_3D_Scenes_CVPR_2023_paper.pdf + We present a novel method for recovering the absolute pose and shape of a human in a pre-scanned scene given a single image. Unlike previous methods that perform sceneaware mesh optimization, we propose to first estimate absolute position and dense scene contacts with a sparse 3D CNN, and later enhance a pretrained human mesh recovery network by cross-attention with the derived 3D scene cues. Joint learning on images and scene geometry enables our method to reduce the ambiguity caused by depth and occlusion, resulting in more reasonable global postures and contacts. Encoding scene-aware cues in the network also allows the proposed method to be optimization-free, and opens up the opportunity for real-time applications. The experiments show that the proposed network is capable of recovering accurate and physically-plausible meshes by a single forward pass and outperforms state-of-the-art methods in terms of both accuracy and speed. Code is available on our project page: https://zju3dv.github.io/sahmr/. + + + + Learning Locally Editable Virtual Humans + http://openaccess.thecvf.com//content/CVPR2023/papers/Ho_Learning_Locally_Editable_Virtual_Humans_CVPR_2023_paper.pdf + In this paper, we propose a novel hybrid representation and end-to-end trainable network architecture to model fully editable and customizable neural avatars. At the core of our work lies a representation that combines the modeling power of neural fields with the ease of use and inherent 3D consistency of skinned meshes. To this end, we construct a trainable feature codebook to store local geometry and texture features on the vertices of a deformable body model, thus exploiting its consistent topology under articulation. This representation is then employed in a generative auto-decoder architecture that admits fitting to unseen scans and sampling of realistic avatars with varied appearances and geometries. Furthermore, our representation allows local editing by swapping local features between 3D assets. To verify our method for avatar creation and editing, we contribute a new high-quality dataset, dubbed CustomHumans, for training and evaluation. Our experiments quantitatively and qualitatively show that our method generates diverse detailed avatars and achieves better model fitting performance compared to state-of-the-art methods. Our code and dataset are available at https://ait.ethz.ch/custom-humans. + + + + PillarNeXt: Rethinking Network Designs for 3D Object Detection in LiDAR Point Clouds + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_PillarNeXt_Rethinking_Network_Designs_for_3D_Object_Detection_in_LiDAR_CVPR_2023_paper.pdf + In order to deal with the sparse and unstructured raw point clouds, most LiDAR based 3D object detection research focuses on designing dedicated local point aggregators for fine-grained geometrical modeling. In this paper, we revisit the local point aggregators from the perspective of allocating computational resources. We find that the simplest pillar based models perform surprisingly well considering both accuracy and latency. Additionally, we show that minimal adaptions from the success of 2D object detection, such as enlarging receptive field, significantly boost the performance. Extensive experiments reveal that our pillar based networks with modernized designs in terms of architecture and training render the state-of-the-art performance on two popular benchmarks: Waymo Open Dataset and nuScenes. Our results challenge the common intuition that detailed geometry modeling is essential to achieve high performance for 3D object detection. + + + + LINe: Out-of-Distribution Detection by Leveraging Important Neurons + http://openaccess.thecvf.com//content/CVPR2023/papers/Ahn_LINe_Out-of-Distribution_Detection_by_Leveraging_Important_Neurons_CVPR_2023_paper.pdf + It is important to quantify the uncertainty of input samples, especially in mission-critical domains such as autonomous driving and healthcare, where failure predictions on out-of-distribution (OOD) data are likely to cause big problems. OOD detection problem fundamentally begins in that the model cannot express what it is not aware of. Post-hoc OOD detection approaches are widely explored because they do not require an additional re-training process which might degrade the model's performance and increase the training cost. In this study, from the perspective of neurons in the deep layer of the model representing high-level features, we introduce a new aspect for analyzing the difference in model outputs between in-distribution data and OOD data. We propose a novel method, Leveraging Important Neurons (LINe), for post-hoc Out of distribution detection. Shapley value-based pruning reduces the effects of noisy outputs by selecting only high-contribution neurons for predicting specific classes of input data and masking the rest. Activation clipping fixes all values above a certain threshold into the same value, allowing LINe to treat all the class-specific features equally and just consider the difference between the number of activated feature differences between in-distribution and OOD data. Comprehensive experiments verify the effectiveness of the proposed method by outperforming state-of-the-art post-hoc OOD detection methods on CIFAR-10, CIFAR-100, and ImageNet datasets. + + + + Transforming Radiance Field With Lipschitz Network for Photorealistic 3D Scene Stylization + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Transforming_Radiance_Field_With_Lipschitz_Network_for_Photorealistic_3D_Scene_CVPR_2023_paper.pdf + Recent advances in 3D scene representation and novel view synthesis have witnessed the rise of Neural Radiance Fields (NeRFs). Nevertheless, it is not trivial to exploit NeRF for the photorealistic 3D scene stylization task, which aims to generate visually consistent and photorealistic stylized scenes from novel views. Simply coupling NeRF with photorealistic style transfer (PST) will result in cross-view inconsistency and degradation of stylized view syntheses. Through a thorough analysis, we demonstrate that this non-trivial task can be simplified in a new light: When transforming the appearance representation of a pre-trained NeRF with Lipschitz mapping, the consistency and photorealism across source views will be seamlessly encoded into the syntheses. That motivates us to build a concise and flexible learning framework namely LipRF, which upgrades arbitrary 2D PST methods with Lipschitz mapping tailored for the 3D scene. Technically, LipRF first pre-trains a radiance field to reconstruct the 3D scene, and then emulates the style on each view by 2D PST as the prior to learn a Lipschitz network to stylize the pre-trained appearance. In view of that Lipschitz condition highly impacts the expressivity of the neural network, we devise an adaptive regularization to balance the reconstruction and stylization. A gradual gradient aggregation strategy is further introduced to optimize LipRF in a cost-efficient manner. We conduct extensive experiments to show the high quality and robust performance of LipRF on both photorealistic 3D stylization and object appearance editing. + + + + Guided Depth Super-Resolution by Deep Anisotropic Diffusion + http://openaccess.thecvf.com//content/CVPR2023/papers/Metzger_Guided_Depth_Super-Resolution_by_Deep_Anisotropic_Diffusion_CVPR_2023_paper.pdf + Performing super-resolution of a depth image using the guidance from an RGB image is a problem that concerns several fields, such as robotics, medical imaging, and remote sensing. While deep learning methods have achieved good results in this problem, recent work highlighted the value of combining modern methods with more formal frameworks. In this work we propose a novel approach which combines guided anisotropic diffusion with a deep convolutional network and advances the state of the art for guided depth super-resolution. The edge transferring/enhancing properties of the diffusion are boosted by the contextual reasoning capabilities of modern networks, and a strict adjustment step guarantees perfect adherence to the source image. We achieve unprecedented results in three commonly used benchmarks for guided depth super resolution. The performance gain compared to other methods is the largest at larger scales, such as x32 scaling. Code for the proposed method will be made available to promote reproducibility of our results. + + + + Fresnel Microfacet BRDF: Unification of Polari-Radiometric Surface-Body Reflection + http://openaccess.thecvf.com//content/CVPR2023/papers/Ichikawa_Fresnel_Microfacet_BRDF_Unification_of_Polari-Radiometric_Surface-Body_Reflection_CVPR_2023_paper.pdf + Computer vision applications have heavily relied on the linear combination of Lambertian diffuse and microfacet specular reflection models for representing reflected radiance, which turns out to be physically incompatible and limited in applicability. In this paper, we derive a novel analytical reflectance model, which we refer to as Fresnel Microfacet BRDF model, that is physically accurate and generalizes to various real-world surfaces. Our key idea is to model the Fresnel reflection and transmission of the surface microgeometry with a collection of oriented mirror facets, both for body and surface reflections. We carefully derive the Fresnel reflection and transmission for each microfacet as well as the light transport between them in the subsurface. This physically-grounded modeling also allows us to express the polarimetric behavior of reflected light in addition to its radiometric behavior. That is, FMBRDF unifies not only body and surface reflections but also light reflection in radiometry and polarization and represents them in a single model. Experimental results demonstrate its effectiveness in accuracy, expressive power, image-based estimation, and geometry recovery. + + + + Simulated Annealing in Early Layers Leads to Better Generalization + http://openaccess.thecvf.com//content/CVPR2023/papers/Sarfi_Simulated_Annealing_in_Early_Layers_Leads_to_Better_Generalization_CVPR_2023_paper.pdf + Recently, a number of iterative learning methods have been introduced to improve generalization. These typically rely on training for longer periods of time in exchange for improved generalization. LLF (later-layer-forgetting) is a state-of-the-art method in this category. It strengthens learning in early layers by periodically re-initializing the last few layers of the network. Our principal innovation in this work is to use Simulated annealing in EArly Layers (SEAL) of the network in place of re-initialization of later layers. Essentially, later layers go through the normal gradient descent process, while the early layers go through short stints of gradient ascent followed by gradient descent. Extensive experiments on the popular Tiny-ImageNet dataset benchmark and a series of transfer learning and few-shot learning tasks show that we outperform LLF by a significant margin. We further show that, compared to normal training, LLF features, although improving on the target task, degrade the transfer learning performance across all datasets we explored. In comparison, our method outperforms LLF across the same target datasets by a large margin. We also show that the prediction depth of our method is significantly lower than that of LLF and normal training, indicating on average better prediction performance. + + + + Exploring Data Geometry for Continual Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Gao_Exploring_Data_Geometry_for_Continual_Learning_CVPR_2023_paper.pdf + Continual learning aims to efficiently learn from a non-stationary stream of data while avoiding forgetting the knowledge of old data. In many practical applications, data complies with non-Euclidean geometry. As such, the commonly used Euclidean space cannot gracefully capture non-Euclidean geometric structures of data, leading to inferior results. In this paper, we study continual learning from a novel perspective by exploring data geometry for the non-stationary stream of data. Our method dynamically expands the geometry of the underlying space to match growing geometric structures induced by new data, and prevents forgetting by keeping geometric structures of old data into account. In doing so, we make use of the mixed-curvature space and propose an incremental search scheme, through which the growing geometric structures are encoded. Then, we introduce an angular-regularization loss and a neighbor-robustness loss to train the model, capable of penalizing the change of global geometric structures and local geometric structures. Experiments show that our method achieves better performance than baseline methods designed in Euclidean space. + + + + Learning Neural Parametric Head Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Giebenhain_Learning_Neural_Parametric_Head_Models_CVPR_2023_paper.pdf + We propose a novel 3D morphable model for complete human heads based on hybrid neural fields. At the core of our model lies a neural parametric representation that disentangles identity and expressions in disjoint latent spaces. To this end, we capture a person's identity in a canonical space as a signed distance field (SDF), and model facial expressions with a neural deformation field. In addition, our representation achieves high-fidelity local detail by introducing an ensemble of local fields centered around facial anchor points. To facilitate generalization, we train our model on a newly-captured dataset of over 3700 head scans from 203 different identities using a custom high-end 3D scanning setup. Our dataset significantly exceeds comparable existing datasets, both with respect to quality and completeness of geometry, averaging around 3.5M mesh faces per scan. Finally, we demonstrate that our approach outperforms state-of-the-art methods in terms of fitting error and reconstruction quality. + + + + Removing Objects From Neural Radiance Fields + http://openaccess.thecvf.com//content/CVPR2023/papers/Weder_Removing_Objects_From_Neural_Radiance_Fields_CVPR_2023_paper.pdf + Neural Radiance Fields (NeRFs) are emerging as a ubiquitous scene representation that allows for novel view synthesis. Increasingly, NeRFs will be shareable with other people. Before sharing a NeRF, though, it might be desirable to remove personal information or unsightly objects. Such removal is not easily achieved with the current NeRF editing frameworks. We propose a framework to remove objects from a NeRF representation created from an RGB-D sequence. Our NeRF inpainting method leverages recent work in 2D image inpainting and is guided by a user-provided mask. Our algorithm is underpinned by a confidence based view selection procedure. It chooses which of the individual 2D inpainted images to use in the creation of the NeRF, so that the resulting inpainted NeRF is 3D consistent. We show that our method for NeRF editing is effective for synthesizing plausible inpaintings in a multi-view coherent manner, outperforming competing methods. We validate our approach by proposing a new and still-challenging dataset for the task of NeRF inpainting. + + + + Structural Multiplane Image: Bridging Neural View Synthesis and 3D Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Structural_Multiplane_Image_Bridging_Neural_View_Synthesis_and_3D_Reconstruction_CVPR_2023_paper.pdf + The Multiplane Image (MPI), containing a set of fronto-parallel RGBA layers, is an effective and efficient representation for view synthesis from sparse inputs. Yet, its fixed structure limits the performance, especially for surfaces imaged at oblique angles. We introduce the Structural MPI (S-MPI), where the plane structure approximates 3D scenes concisely. Conveying RGBA contexts with geometrically-faithful structures, the S-MPI directly bridges view synthesis and 3D reconstruction. It can not only overcome the critical limitations of MPI, i.e., discretization artifacts from sloped surfaces and abuse of redundant layers, and can also acquire planar 3D reconstruction. Despite the intuition and demand of applying S-MPI, great challenges are introduced, e.g., high-fidelity approximation for both RGBA layers and plane poses, multi-view consistency, non-planar regions modeling, and efficient rendering with intersected planes. Accordingly, we propose a transformer-based network based on a segmentation model. It predicts compact and expressive S-MPI layers with their corresponding masks, poses, and RGBA contexts. Non-planar regions are inclusively handled as a special case in our unified framework. Multi-view consistency is ensured by sharing global proxy embeddings, which encode plane-level features covering the complete 3D scenes with aligned coordinates. Intensive experiments show that our method outperforms both previous state-of-the-art MPI-based view synthesis methods and planar reconstruction methods. + + + + Harmonious Teacher for Cross-Domain Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Deng_Harmonious_Teacher_for_Cross-Domain_Object_Detection_CVPR_2023_paper.pdf + Self-training approaches recently achieved promising results in cross-domain object detection, where people iteratively generate pseudo labels for unlabeled target domain samples with a model, and select high-confidence samples to refine the model. In this work, we reveal that the consistency of classification and localization predictions are crucial to measure the quality of pseudo labels, and propose a new Harmonious Teacher approach to improve the self-training for cross-domain object detection. In particular, we first propose to enhance the quality of pseudo labels by regularizing the consistency of the classification and localization scores when training the detection model. The consistency losses are defined for both labeled source samples and the unlabeled target samples. Then, we further remold the traditional sample selection method by a sample reweighing strategy based on the consistency of classification and localization scores to improve the ranking of predictions. This allows us to fully exploit all instance predictions from the target domain without abandoning valuable hard examples. Without bells and whistles, our method shows superior performance in various cross-domain scenarios compared with the state-of-the-art baselines, which validates the effectiveness of our Harmonious Teacher. Our codes will be available at https://github.com/kinredon/Harmonious-Teacher. + + + + Learning To Predict Scene-Level Implicit 3D From Posed RGBD Data + http://openaccess.thecvf.com//content/CVPR2023/papers/Kulkarni_Learning_To_Predict_Scene-Level_Implicit_3D_From_Posed_RGBD_Data_CVPR_2023_paper.pdf + We introduce a method that can learn to predict scene-level implicit functions for 3D reconstruction from posed RGBD data. At test time, our system maps a previously unseen RGB image to a 3D reconstruction of a scene via implicit functions. While implicit functions for 3D reconstruction have often been tied to meshes, we show that we can train one using only a set of posed RGBD images. This setting may help 3D reconstruction unlock the sea of accelerometer+RGBD data that is coming with new phones. Our system, D2-DRDF, can match and sometimes outperform current methods that use mesh supervision and shows better robustness to sparse data. + + + + Physical-World Optical Adversarial Attacks on 3D Face Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Physical-World_Optical_Adversarial_Attacks_on_3D_Face_Recognition_CVPR_2023_paper.pdf + The success rate of current adversarial attacks remains low on real-world 3D face recognition tasks because the 3D-printing attacks need to meet the requirement that the generated points should be adjacent to the surface, which limits the adversarial example' searching space. Additionally, they have not considered unpredictable head movements or the non-homogeneous nature of skin reflectance in the real world. To address the real-world challenges, we propose a novel structured-light attack against structured-light-based 3D face recognition. We incorporate the 3D reconstruction process and skin's reflectance in the optimization process to get the end-to-end attack and present 3D transform invariant loss and sensitivity maps to improve robustness. Our attack enables adversarial points to be placed in any position and is resilient to random head movements while maintaining the perturbation unnoticeable. Experiments show that our new method can attack point-cloud-based and depth-image-based 3D face recognition systems with a high success rate, using fewer perturbations than previous physical 3D adversarial attacks. + + + + Raw Image Reconstruction With Learned Compact Metadata + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Raw_Image_Reconstruction_With_Learned_Compact_Metadata_CVPR_2023_paper.pdf + While raw images exhibit advantages over sRGB images (e.g. linearity and fine-grained quantization level), they are not widely used by common users due to the large storage requirements. Very recent works propose to compress raw images by designing the sampling masks in the raw image pixel space, leading to suboptimal image representations and redundant metadata. In this paper, we propose a novel framework to learn a compact representation in the latent space serving as the metadata in an end-to-end manner. Furthermore, we propose a novel sRGB-guided context model with the improved entropy estimation strategies, which leads to better reconstruction quality, smaller size of metadata, and faster speed. We illustrate how the proposed raw image compression scheme can adaptively allocate more bits to image regions that are important from a global perspective. The experimental results show that the proposed method can achieve superior raw image reconstruction results using a smaller size of the metadata on both uncompressed sRGB images and JPEG images. + + + + Semi-Supervised Video Inpainting With Cycle Consistency Constraints + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_Semi-Supervised_Video_Inpainting_With_Cycle_Consistency_Constraints_CVPR_2023_paper.pdf + Deep learning-based video inpainting has yielded promising results and gained increasing attention from researchers. Generally, these methods usually assume that the corrupted region masks of each frame are known and easily obtained. However, the annotation of these masks are labor-intensive and expensive, which limits the practical application of current methods. Therefore, we expect to relax this assumption by defining a new semi-supervised inpainting setting, making the networks have the ability of completing the corrupted regions of the whole video using the annotated mask of only one frame. Specifically, in this work, we propose an end-to-end trainable framework consisting of completion network and mask prediction network, which are designed to generate corrupted contents of the current frame using the known mask and decide the regions to be filled of the next frame, respectively. Besides, we introduce a cycle consistency loss to regularize the training parameters of these two networks. In this way, the completion network and the mask prediction network can constrain each other, and hence the overall performance of the trained model can be maximized. Furthermore, due to the natural existence of prior knowledge (e.g., corrupted contents and clear borders), current video inpainting datasets are not suitable in the context of semi-supervised video inpainting. Thus, we create a new dataset by simulating the corrupted video of real-world scenarios. Extensive experimental results are reported to demonstrate the superiority of our model in the video inpainting task. Remarkably, although our model is trained in a semi-supervised manner, it can achieve comparable performance as fully-supervised methods. + + + + Level-S$^2$fM: Structure From Motion on Neural Level Set of Implicit Surfaces + http://openaccess.thecvf.com//content/CVPR2023/papers/Xiao_Level-S2fM_Structure_From_Motion_on_Neural_Level_Set_of_Implicit_CVPR_2023_paper.pdf + This paper presents a neural incremental Structure-from-Motion (SfM) approach, Level-S2fM, which estimates the camera poses and scene geometry from a set of uncalibrated images by learning coordinate MLPs for the implicit surfaces and the radiance fields from the established keypoint correspondences. Our novel formulation poses some new challenges due to inevitable two-view and few-view configurations in the incremental SfM pipeline, which complicates the optimization of coordinate MLPs for volumetric neural rendering with unknown camera poses. Nevertheless, we demonstrate that the strong inductive basis conveying in the 2D correspondences is promising to tackle those challenges by exploiting the relationship between the ray sampling schemes. Based on this, we revisit the pipeline of incremental SfM and renew the key components, including two-view geometry initialization, the camera poses registration, the 3D points triangulation, and Bundle Adjustment, with a fresh perspective based on neural implicit surfaces. By unifying the scene geometry in small MLP networks through coordinate MLPs, our Level-S2fM treats the zero-level set of the implicit surface as an informative top-down regularization to manage the reconstructed 3D points, reject the outliers in correspondences via querying SDF, and refine the estimated geometries by NBA (Neural BA). Not only does our Level-S2fM lead to promising results on camera pose estimation and scene geometry reconstruction, but it also shows a promising way for neural implicit rendering without knowing camera extrinsic beforehand. + + + + Neuron Structure Modeling for Generalizable Remote Physiological Measurement + http://openaccess.thecvf.com//content/CVPR2023/papers/Lu_Neuron_Structure_Modeling_for_Generalizable_Remote_Physiological_Measurement_CVPR_2023_paper.pdf + Remote photoplethysmography (rPPG) technology has drawn increasing attention in recent years. It can extract Blood Volume Pulse (BVP) from facial videos, making many applications like health monitoring and emotional analysis more accessible. However, as the BVP signal is easily affected by environmental changes, existing methods struggle to generalize well for unseen domains. In this paper, we systematically address the domain shift problem in the rPPG measurement task. We show that most domain generalization methods do not work well in this problem, as domain labels are ambiguous in complicated environmental changes. In light of this, we propose a domain-label-free approach called NEuron STructure modeling (NEST). NEST improves the generalization capacity by maximizing the coverage of feature space during training, which reduces the chance for under-optimized feature activation during inference. Besides, NEST can also enrich and enhance domain invariant features across multi-domain. We create and benchmark a large-scale domain generalization protocol for the rPPG measurement task. Extensive experiments show that our approach outperforms the state-of-the-art methods on both cross-dataset and intra-dataset settings. + + + + Out-of-Candidate Rectification for Weakly Supervised Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Cheng_Out-of-Candidate_Rectification_for_Weakly_Supervised_Semantic_Segmentation_CVPR_2023_paper.pdf + Weakly supervised semantic segmentation is typically inspired by class activation maps, which serve as pseudo masks with class-discriminative regions highlighted. Although tremendous efforts have been made to recall precise and complete locations for each class, existing methods still commonly suffer from the unsolicited Out-of-Candidate (OC) error predictions that do not belong to the label candidates, which could be avoidable since the contradiction with image-level class tags is easy to be detected. In this paper, we develop a group ranking-based Out-of-Candidate Rectification (OCR) mechanism in a plug-and-play fashion. Firstly, we adaptively split the semantic categories into In-Candidate (IC) and OC groups for each OC pixel according to their prior annotation correlation and posterior prediction correlation. Then, we derive a differentiable rectification loss to force OC pixels to shift to the IC group. Incorporating OCR with seminal baselines (e.g., AffinityNet, SEAM, MCTformer), we can achieve remarkable performance gains on both Pascal VOC (+3.2%, +3.3%, +0.8% mIoU) and MS COCO (+1.0%, +1.3%, +0.5% mIoU) datasets with negligible extra training overhead, which justifies the effectiveness and generality of OCR. + + + + MonoATT: Online Monocular 3D Object Detection With Adaptive Token Transformer + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_MonoATT_Online_Monocular_3D_Object_Detection_With_Adaptive_Token_Transformer_CVPR_2023_paper.pdf + Mobile monocular 3D object detection (Mono3D) (e.g., on a vehicle, a drone, or a robot) is an important yet challenging task. Existing transformer-based offline Mono3D models adopt grid-based vision tokens, which is suboptimal when using coarse tokens due to the limited available computational power. In this paper, we propose an online Mono3D framework, called MonoATT, which leverages a novel vision transformer with heterogeneous tokens of varying shapes and sizes to facilitate mobile Mono3D. The core idea of MonoATT is to adaptively assign finer tokens to areas of more significance before utilizing a transformer to enhance Mono3D. To this end, we first use prior knowledge to design a scoring network for selecting the most important areas of the image, and then propose a token clustering and merging network with an attention mechanism to gradually merge tokens around the selected areas in multiple stages. Finally, a pixel-level feature map is reconstructed from heterogeneous tokens before employing a SOTA Mono3D detector as the underlying detection core. Experiment results on the real-world KITTI dataset demonstrate that MonoATT can effectively improve the Mono3D accuracy for both near and far objects and guarantee low latency. MonoATT yields the best performance compared with the state-of-the-art methods by a large margin and is ranked number one on the KITTI 3D benchmark. + + + + Image Quality-Aware Diagnosis via Meta-Knowledge Co-Embedding + http://openaccess.thecvf.com//content/CVPR2023/papers/Che_Image_Quality-Aware_Diagnosis_via_Meta-Knowledge_Co-Embedding_CVPR_2023_paper.pdf + Medical images usually suffer from image degradation in clinical practice, leading to decreased performance of deep learning-based models. To resolve this problem, most previous works have focused on filtering out degradation-causing low-quality images while ignoring their potential value for models. Through effectively learning and leveraging the knowledge of degradations, models can better resist their adverse effects and avoid misdiagnosis. In this paper, we raise the problem of image quality-aware diagnosis, which aims to take advantage of low-quality images and image quality labels to achieve a more accurate and robust diagnosis. However, the diversity of degradations and superficially unrelated targets between image quality assessment and disease diagnosis makes it still quite challenging to effectively leverage quality labels to assist diagnosis. Thus, to tackle these issues, we propose a novel meta-knowledge co-embedding network, consisting of two subnets: Task Net and Meta Learner. Task Net constructs an explicit quality information utilization mechanism to enhance diagnosis via knowledge co-embedding features, while Meta Learner ensures the effectiveness and constrains the semantics of these features via meta-learning and joint-encoding masking. Superior performance on five datasets with four widely-used medical imaging modalities demonstrates the effectiveness and generalizability of our method. + + + + Learning 3D Representations From 2D Pre-Trained Models via Image-to-Point Masked Autoencoders + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Learning_3D_Representations_From_2D_Pre-Trained_Models_via_Image-to-Point_Masked_CVPR_2023_paper.pdf + Pre-training by numerous image data has become de-facto for robust 2D representations. In contrast, due to the expensive data processing, a paucity of 3D datasets severely hinders the learning for high-quality 3D features. In this paper, we propose an alternative to obtain superior 3D representations from 2D pre-trained models via Image-to-Point Masked Autoencoders, named as I2P-MAE. By self-supervised pre-training, we leverage the well learned 2D knowledge to guide 3D masked autoencoding, which reconstructs the masked point tokens with an encoder-decoder architecture. Specifically, we first utilize off-the-shelf 2D models to extract the multi-view visual features of the input point cloud, and then conduct two types of image-to-point learning schemes. For one, we introduce a 2D-guided masking strategy that maintains semantically important point tokens to be visible. Compared to random masking, the network can better concentrate on significant 3D structures with key spatial cues. For another, we enforce these visible tokens to reconstruct multi-view 2D features after the decoder. This enables the network to effectively inherit high-level 2D semantics for discriminative 3D modeling. Aided by our image-to-point pre-training, the frozen I2P-MAE, without any fine-tuning, achieves 93.4% accuracy for linear SVM on ModelNet40, competitive to existing fully trained methods. By further fine-tuning on on ScanObjectNN's hardest split, I2P-MAE attains the state-of-the-art 90.11% accuracy, +3.68% to the second-best, demonstrating superior transferable capacity. Code is available at https://github.com/ZrrSkywalker/I2P-MAE. + + + + BEVFormer v2: Adapting Modern Image Backbones to Bird's-Eye-View Recognition via Perspective Supervision + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_BEVFormer_v2_Adapting_Modern_Image_Backbones_to_Birds-Eye-View_Recognition_via_CVPR_2023_paper.pdf + We present a novel bird's-eye-view (BEV) detector with perspective supervision, which converges faster and better suits modern image backbones. Existing state-of-the-art BEV detectors are often tied to certain depth pre-trained backbones like VoVNet, hindering the synergy between booming image backbones and BEV detectors. To address this limitation, we prioritize easing the optimization of BEV detectors by introducing perspective space supervision. To this end, we propose a two-stage BEV detector, where proposals from the perspective head are fed into the bird's-eye-view head for final predictions. To evaluate the effectiveness of our model, we conduct extensive ablation studies focusing on the form of supervision and the generality of the proposed detector. The proposed method is verified with a wide spectrum of traditional and modern image backbones and achieves new SoTA results on the large-scale nuScenes dataset. The code shall be released soon. + + + + Object Discovery From Motion-Guided Tokens + http://openaccess.thecvf.com//content/CVPR2023/papers/Bao_Object_Discovery_From_Motion-Guided_Tokens_CVPR_2023_paper.pdf + Object discovery -- separating objects from the background without manual labels -- is a fundamental open challenge in computer vision. Previous methods struggle to go beyond clustering of low-level cues, whether handcrafted (e.g., color, texture) or learned (e.g., from auto-encoders). In this work, we augment the auto-encoder representation learning framework with two key components: motion-guidance and mid-level feature tokenization. Although both have been separately investigated, we introduce a new transformer decoder showing that their benefits can compound thanks to motion-guided vector quantization. We show that our architecture effectively leverages the synergy between motion and tokenization, improving upon the state of the art on both synthetic and real datasets. Our approach enables the emergence of interpretable object-specific mid-level features, demonstrating the benefits of motion-guidance (no labeling) and quantization (interpretability, memory efficiency). + + + + Event-Based Video Frame Interpolation With Cross-Modal Asymmetric Bidirectional Motion Fields + http://openaccess.thecvf.com//content/CVPR2023/papers/Kim_Event-Based_Video_Frame_Interpolation_With_Cross-Modal_Asymmetric_Bidirectional_Motion_Fields_CVPR_2023_paper.pdf + Video Frame Interpolation (VFI) aims to generate intermediate video frames between consecutive input frames. Since the event cameras are bio-inspired sensors that only encode brightness changes with a micro-second temporal resolution, several works utilized the event camera to enhance the performance of VFI. However, existing methods estimate bidirectional inter-frame motion fields with only events or approximations, which can not consider the complex motion in real-world scenarios. In this paper, we propose a novel event-based VFI framework with cross-modal asymmetric bidirectional motion field estimation. In detail, our EIF-BiOFNet utilizes each valuable characteristic of the events and images for direct estimation of inter-frame motion fields without any approximation methods.Moreover, we develop an interactive attention-based frame synthesis network to efficiently leverage the complementary warping-based and synthesis-based features. Finally, we build a large-scale event-based VFI dataset, ERF-X170FPS, with a high frame rate, extreme motion, and dynamic textures to overcome the limitations of previous event-based VFI datasets. Extensive experimental results validate that our method shows significant performance improvement over the state-of-the-art VFI methods on various datasets.Our project pages are available at: https://github.com/intelpro/CBMNet + + + + VolRecon: Volume Rendering of Signed Ray Distance Functions for Generalizable Multi-View Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Ren_VolRecon_Volume_Rendering_of_Signed_Ray_Distance_Functions_for_Generalizable_CVPR_2023_paper.pdf + The success of the Neural Radiance Fields (NeRF) in novel view synthesis has inspired researchers to propose neural implicit scene reconstruction. However, most existing neural implicit reconstruction methods optimize per-scene parameters and therefore lack generalizability to new scenes. We introduce VolRecon, a novel generalizable implicit reconstruction method with Signed Ray Distance Function (SRDF). To reconstruct the scene with fine details and little noise, VolRecon combines projection features aggregated from multi-view features, and volume features interpolated from a coarse global feature volume. Using a ray transformer, we compute SRDF values of sampled points on a ray and then render color and depth. On DTU dataset, VolRecon outperforms SparseNeuS by about 30% in sparse view reconstruction and achieves comparable accuracy as MVSNet in full view reconstruction. Furthermore, our approach exhibits good generalization performance on the large-scale ETH3D benchmark. + + + + DA-DETR: Domain Adaptive Detection Transformer With Information Fusion + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_DA-DETR_Domain_Adaptive_Detection_Transformer_With_Information_Fusion_CVPR_2023_paper.pdf + The recent detection transformer (DETR) simplifies the object detection pipeline by removing hand-crafted designs and hyperparameters as employed in conventional two-stage object detectors. However, how to leverage the simple yet effective DETR architecture in domain adaptive object detection is largely neglected. Inspired by the unique DETR attention mechanisms, we design DA-DETR, a domain adaptive object detection transformer that introduces information fusion for effective transfer from a labeled source domain to an unlabeled target domain. DA-DETR introduces a novel CNN-Transformer Blender (CTBlender) that fuses the CNN features and Transformer features ingeniously for effective feature alignment and knowledge transfer across domains. Specifically, CTBlender employs the Transformer features to modulate the CNN features across multiple scales where the high-level semantic information and the low-level spatial information are fused for accurate object identification and localization. Extensive experiments show that DA-DETR achieves superior detection performance consistently across multiple widely adopted domain adaptation benchmarks. + + + + Vision Transformers Are Good Mask Auto-Labelers + http://openaccess.thecvf.com//content/CVPR2023/papers/Lan_Vision_Transformers_Are_Good_Mask_Auto-Labelers_CVPR_2023_paper.pdf + We propose Mask Auto-Labeler (MAL), a high-quality Transformer-based mask auto-labeling framework for instance segmentation using only box annotations. MAL takes box-cropped images as inputs and conditionally generates their mask pseudo-labels.We show that Vision Transformers are good mask auto-labelers. Our method significantly reduces the gap between auto-labeling and human annotation regarding mask quality. Instance segmentation models trained using the MAL-generated masks can nearly match the performance of their fully-supervised counterparts, retaining up to 97.4% performance of fully supervised models. The best model achieves 44.1% mAP on COCO instance segmentation (test-dev 2017), outperforming state-of-the-art box-supervised methods by significant margins. Qualitative results indicate that masks produced by MAL are, in some cases, even better than human annotations. + + + + Neural Transformation Fields for Arbitrary-Styled Font Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Fu_Neural_Transformation_Fields_for_Arbitrary-Styled_Font_Generation_CVPR_2023_paper.pdf + Few-shot font generation (FFG), aiming at generating font images with a few samples, is an emerging topic in recent years due to the academic and commercial values. Typically, the FFG approaches follow the style-content disentanglement paradigm, which transfers the target font styles to characters by combining the content representations of source characters and the style codes of reference samples. Most existing methods attempt to increase font generation ability via exploring powerful style representations, which may be a sub-optimal solution for the FFG task due to the lack of modeling spatial transformation in transferring font styles. In this paper, we model font generation as a continuous transformation process from the source character image to the target font image via the creation and dissipation of font pixels, and embed the corresponding transformations into a neural transformation field. With the estimated transformation path, the neural transformation field generates a set of intermediate transformation results via the sampling process, and a font rendering formula is developed to accumulate them into the target font image. Extensive experiments show that our method achieves state-of-the-art performance on few-shot font generation task, which demonstrates the effectiveness of our proposed model. Our implementation is available at: https://github.com/fubinfb/NTF. + + + + EDICT: Exact Diffusion Inversion via Coupled Transformations + http://openaccess.thecvf.com//content/CVPR2023/papers/Wallace_EDICT_Exact_Diffusion_Inversion_via_Coupled_Transformations_CVPR_2023_paper.pdf + Finding an initial noise vector that produces an input image when fed into the diffusion process (known as inversion) is an important problem in denoising diffusion models (DDMs), with applications for real image editing. The standard approach for real image editing with inversion uses denoising diffusion implicit models (DDIMs) to deterministically noise the image to the intermediate state along the path that the denoising would follow given the original conditioning. However, DDIM inversion for real images is unstable as it relies on local linearization assumptions, which result in the propagation of errors, leading to incorrect image reconstruction and loss of content. To alleviate these problems, we propose Exact Diffusion Inversion via Coupled Transformations (EDICT), an inversion method that draws inspiration from affine coupling layers. EDICT enables mathematically exact inversion of real and model-generated images by maintaining two coupled noise vectors which are used to invert each other in an alternating fashion. Using Stable Diffusion [25], a state-of-the-art latent diffusion model, we demonstrate that EDICT successfully reconstructs real images with high fidelity. On complex image datasets like MS-COCO, EDICT reconstruction significantly outperforms DDIM, improving the mean square error of reconstruction by a factor of two. Using noise vectors inverted from real images, EDICT enables a wide range of image edits--from local and global semantic edits to image stylization--while maintaining fidelity to the original image structure. EDICT requires no model training/finetuning, prompt tuning, or extra data and can be combined with any pretrained DDM. + + + + AeDet: Azimuth-Invariant Multi-View 3D Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Feng_AeDet_Azimuth-Invariant_Multi-View_3D_Object_Detection_CVPR_2023_paper.pdf + Recent LSS-based multi-view 3D object detection has made tremendous progress, by processing the features in Brid-Eye-View (BEV) via the convolutional detector. However, the typical convolution ignores the radial symmetry of the BEV features and increases the difficulty of the detector optimization. To preserve the inherent property of the BEV features and ease the optimization, we propose an azimuth-equivariant convolution (AeConv) and an azimuth-equivariant anchor. The sampling grid of AeConv is always in the radial direction, thus it can learn azimuth-invariant BEV features. The proposed anchor enables the detection head to learn predicting azimuth-irrelevant targets. In addition, we introduce a camera-decoupled virtual depth to unify the depth prediction for the images with different camera intrinsic parameters. The resultant detector is dubbed Azimuth-equivariant Detector (AeDet). Extensive experiments are conducted on nuScenes, and AeDet achieves a 62.0% NDS, surpassing the recent multi-view 3D object detectors such as PETRv2 and BEVDepth by a large margin. + + + + OCELOT: Overlapped Cell on Tissue Dataset for Histopathology + http://openaccess.thecvf.com//content/CVPR2023/papers/Ryu_OCELOT_Overlapped_Cell_on_Tissue_Dataset_for_Histopathology_CVPR_2023_paper.pdf + Cell detection is a fundamental task in computational pathology that can be used for extracting high-level medical information from whole-slide images. For accurate cell detection, pathologists often zoom out to understand the tissue-level structures and zoom in to classify cells based on their morphology and the surrounding context. However, there is a lack of efforts to reflect such behaviors by pathologists in the cell detection models, mainly due to the lack of datasets containing both cell and tissue annotations with overlapping regions. To overcome this limitation, we propose and publicly release OCELOT, a dataset purposely dedicated to the study of cell-tissue relationships for cell detection in histopathology. OCELOT provides overlapping cell and tissue annotations on images acquired from multiple organs. Within this setting, we also propose multi-task learning approaches that benefit from learning both cell and tissue tasks simultaneously. When compared against a model trained only for the cell detection task, our proposed approaches improve cell detection performance on 3 datasets: proposed OCELOT, public TIGER, and internal CARP datasets. On the OCELOT test set in particular, we show up to 6.79 improvement in F1-score. We believe the contributions of this paper, including the release of the OCELOT dataset at https://lunit-io.github.io/research/publications/ocelot are a crucial starting point toward the important research direction of incorporating cell-tissue relationships in computation pathology. + + + + Unsupervised Sampling Promoting for Stochastic Human Trajectory Prediction + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Unsupervised_Sampling_Promoting_for_Stochastic_Human_Trajectory_Prediction_CVPR_2023_paper.pdf + The indeterminate nature of human motion requires trajectory prediction systems to use a probabilistic model to formulate the multi-modality phenomenon and infer a finite set of future trajectories. However, the inference processes of most existing methods rely on Monte Carlo random sampling, which is insufficient to cover the realistic paths with finite samples, due to the long tail effect of the predicted distribution. To promote the sampling process of stochastic prediction, we propose a novel method, called BOsampler, to adaptively mine potential paths with Bayesian optimization in an unsupervised manner, as a sequential design strategy in which new prediction is dependent on the previously drawn samples. Specifically, we model the trajectory sampling as a Gaussian process and construct an acquisition function to measure the potential sampling value. This acquisition function applies the original distribution as prior and encourages exploring paths in the long-tail region. This sampling method can be integrated with existing stochastic predictive models without retraining. Experimental results on various baseline methods demonstrate the effectiveness of our method. The source code is released in this link. + + + + Safe Latent Diffusion: Mitigating Inappropriate Degeneration in Diffusion Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Schramowski_Safe_Latent_Diffusion_Mitigating_Inappropriate_Degeneration_in_Diffusion_Models_CVPR_2023_paper.pdf + Text-conditioned image generation models have recently achieved astonishing results in image quality and text alignment and are consequently employed in a fast-growing number of applications. Since they are highly data-driven, relying on billion-sized datasets randomly scraped from the internet, they also suffer, as we demonstrate, from degenerated and biased human behavior. In turn, they may even reinforce such biases. To help combat these undesired side effects, we present safe latent diffusion (SLD). Specifically, to measure the inappropriate degeneration due to unfiltered and imbalanced training sets, we establish a novel image generation test bed - inappropriate image prompts (I2P) - containing dedicated, real-world image-to-text prompts covering concepts such as nudity and violence. As our exhaustive empirical evaluation demonstrates, the introduced SLD removes and suppresses inappropriate image parts during the diffusion process, with no additional training required and no adverse effect on overall image quality or text alignment. + + + + Superclass Learning With Representation Enhancement + http://openaccess.thecvf.com//content/CVPR2023/papers/Kang_Superclass_Learning_With_Representation_Enhancement_CVPR_2023_paper.pdf + In many real scenarios, data are often divided into a handful of artificial super categories in terms of expert knowledge rather than the representations of images. Concretely, a superclass may contain massive and various raw categories, such as refuse sorting. Due to the lack of common semantic features, the existing classification techniques are intractable to recognize superclass without raw class labels, thus they suffer severe performance damage or require huge annotation costs. To narrow this gap, this paper proposes a superclass learning framework, called SuperClass Learning with Representation Enhancement(SCLRE), to recognize super categories by leveraging enhanced representation. Specifically, by exploiting the self-attention technique across the batch, SCLRE collapses the boundaries of those raw categories and enhances the representation of each superclass. On the enhanced representation space, a superclass-aware decision boundary is then reconstructed. Theoretically, we prove that by leveraging attention techniques the generalization error of SCLRE can be bounded under superclass scenarios. Experimentally, extensive results demonstrate that SCLRE outperforms the baseline and other contrastive-based methods on CIFAR-100 datasets and four high-resolution datasets. + + + + Visual Prompt Tuning for Generative Transfer Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Sohn_Visual_Prompt_Tuning_for_Generative_Transfer_Learning_CVPR_2023_paper.pdf + Learning generative image models from various domains efficiently needs transferring knowledge from an image synthesis model trained on a large dataset. We present a recipe for learning vision transformers by generative knowledge transfer. We base our framework on generative vision transformers representing an image as a sequence of visual tokens with the autoregressive or non-autoregressive transformers. To adapt to a new domain, we employ prompt tuning, which prepends learnable tokens called prompts to the image token sequence and introduces a new prompt design for our task. We study on a variety of visual domains with varying amounts of training images. We show the effectiveness of knowledge transfer and a significantly better image generation quality. Code is available at https://github.com/google-research/generative_transfer. + + + + ReLight My NeRF: A Dataset for Novel View Synthesis and Relighting of Real World Objects + http://openaccess.thecvf.com//content/CVPR2023/papers/Toschi_ReLight_My_NeRF_A_Dataset_for_Novel_View_Synthesis_and_CVPR_2023_paper.pdf + In this paper, we focus on the problem of rendering novel views from a Neural Radiance Field (NeRF) under unobserved light conditions. To this end, we introduce a novel dataset, dubbed ReNe (Relighting NeRF), framing real world objects under one-light-at-time (OLAT) conditions, annotated with accurate ground-truth camera and light poses. Our acquisition pipeline leverages two robotic arms holding, respectively, a camera and an omni-directional point-wise light source. We release a total of 20 scenes depicting a variety of objects with complex geometry and challenging materials. Each scene includes 2000 images, acquired from 50 different points of views under 40 different OLAT conditions. By leveraging the dataset, we perform an ablation study on the relighting capability of variants of the vanilla NeRF architecture and identify a lightweight architecture that can render novel views of an object under novel light conditions, which we use to establish a non-trivial baseline for the dataset. Dataset and benchmark are available at https://eyecan-ai.github.io/rene. + + + + Content-Aware Token Sharing for Efficient Semantic Segmentation With Vision Transformers + http://openaccess.thecvf.com//content/CVPR2023/papers/Lu_Content-Aware_Token_Sharing_for_Efficient_Semantic_Segmentation_With_Vision_Transformers_CVPR_2023_paper.pdf + This paper introduces Content-aware Token Sharing (CTS), a token reduction approach that improves the computational efficiency of semantic segmentation networks that use Vision Transformers (ViTs). Existing works have proposed token reduction approaches to improve the efficiency of ViT-based image classification networks, but these methods are not directly applicable to semantic segmentation, which we address in this work. We observe that, for semantic segmentation, multiple image patches can share a token if they contain the same semantic class, as they contain redundant information. Our approach leverages this by employing an efficient, class-agnostic policy network that predicts if image patches contain the same semantic class, and lets them share a token if they do. With experiments, we explore the critical design choices of CTS and show its effectiveness on the ADE20K, Pascal Context and Cityscapes datasets, various ViT backbones, and different segmentation decoders. With Content-aware Token Sharing, we are able to reduce the number of processed tokens by up to 44%, without diminishing the segmentation quality. + + + + Are Binary Annotations Sufficient? Video Moment Retrieval via Hierarchical Uncertainty-Based Active Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Ji_Are_Binary_Annotations_Sufficient_Video_Moment_Retrieval_via_Hierarchical_Uncertainty-Based_CVPR_2023_paper.pdf + Recent research on video moment retrieval has mostly focused on enhancing the performance of accuracy, efficiency, and robustness, all of which largely rely on the abundance of high-quality annotations. While the precise frame-level annotations are time-consuming and cost-expensive, few attentions have been paid to the labeling process. In this work, we explore a new interactive manner to stimulate the process of human-in-the-loop annotation in video moment retrieval task. The key challenge is to select "ambiguous" frames and videos for binary annotations to facilitate the network training. To be specific, we propose a new hierarchical uncertainty-based modeling that explicitly considers modeling the uncertainty of each frame within the entire video sequence corresponding to the query description, and selecting the frame with the highest uncertainty. Only selected frame will be annotated by the human experts, which can largely reduce the workload. After obtaining a small number of labels provided by the expert, we show that it is sufficient to learn a competitive video moment retrieval model in such a harsh environment. Moreover, we treat the uncertainty score of frames in a video as a whole, and estimate the difficulty of each video, which can further relieve the burden of video selection. In general, our active learning strategy for video moment retrieval works not only at the frame level but also at the sequence level. Experiments on two public datasets validate the effectiveness of our proposed method. + + + + VGFlow: Visibility Guided Flow Network for Human Reposing + http://openaccess.thecvf.com//content/CVPR2023/papers/Jain_VGFlow_Visibility_Guided_Flow_Network_for_Human_Reposing_CVPR_2023_paper.pdf + The task of human reposing involves generating a realistic image of a model standing in an arbitrary conceivable pose. There are multiple difficulties in generating perceptually accurate images and existing methods suffers from limitations in preserving texture, maintaining pattern coherence, respecting cloth boundaries, handling occlusions, manipulating skin generation etc. These difficulties are further exacerbated by the fact that the possible space of pose orientation for humans is large and variable, the nature of clothing items are highly non-rigid and the diversity in body shape differ largely among the population. To alleviate these difficulties and synthesize perceptually accurate images, we propose VGFlow, a model which uses a visibility guided flow module to disentangle the flow into visible and invisible parts of the target for simultaneous texture preservation and style manipulation. Furthermore, to tackle distinct body shapes and avoid network artifacts, we also incorporate an a self-supervised patch-wise "realness" loss to further improve the output. VGFlow achieves state-of-the-art results as observed qualitatively and quantitatively on different image quality metrics(SSIM, LPIPS, FID). + + + + Improving Selective Visual Question Answering by Learning From Your Peers + http://openaccess.thecvf.com//content/CVPR2023/papers/Dancette_Improving_Selective_Visual_Question_Answering_by_Learning_From_Your_Peers_CVPR_2023_paper.pdf + Despite advances in Visual Question Answering (VQA), the ability of models to assess their own correctness remains underexplored. Recent work has shown that VQA models, out-of-the-box, can have difficulties abstaining from answering when they are wrong. The option to abstain, also called Selective Prediction, is highly relevant when deploying systems to users who must trust the system's output (e.g., VQA assistants for users with visual impairments). For such scenarios, abstention can be especially important as users may provide out-of-distribution (OOD) or adversarial inputs that make incorrect answers more likely. In this work, we explore Selective VQA in both in-distribution (ID) and OOD scenarios, where models are presented with mixtures of ID and OOD data. The goal is to maximize the number of questions answered while minimizing the risk of error on those questions. We propose a simple yet effective Learning from Your Peers (LYP) approach for training multimodal selection functions for making abstention decisions. Our approach uses predictions from models trained on distinct subsets of the training data as targets for optimizing a Selective VQA model. It does not require additional manual labels or held-out data and provides a signal for identifying examples that are easy/difficult to generalize to. In our extensive evaluations, we show this benefits a number of models across different architectures and scales. Overall, for ID, we reach 32.92% in the selective prediction metric coverage at 1% risk of error (C@1%) which doubles the previous best coverage of 15.79% on this task. For mixed ID/OOD, using models' softmax confidences for abstention decisions performs very poorly, answering <5% of questions at 1% risk of error even when faced with only 10% OOD examples, but a learned selection function with LYP can increase that to 25.38% C@1%. + + + + Look, Radiate, and Learn: Self-Supervised Localisation via Radio-Visual Correspondence + http://openaccess.thecvf.com//content/CVPR2023/papers/Alloulah_Look_Radiate_and_Learn_Self-Supervised_Localisation_via_Radio-Visual_Correspondence_CVPR_2023_paper.pdf + Next generation cellular networks will implement radio sensing functions alongside customary communications, thereby enabling unprecedented worldwide sensing coverage outdoors. Deep learning has revolutionised computer vision but has had limited application to radio perception tasks, in part due to lack of systematic datasets and benchmarks dedicated to the study of the performance and promise of radio sensing. To address this gap, we present MaxRay: a synthetic radio-visual dataset and benchmark that facilitate precise target localisation in radio. We further propose to learn to localise targets in radio without supervision by extracting self-coordinates from radio-visual correspondence. We use such self-supervised coordinates to train a radio localiser network. We characterise our performance against a number of state-of-the-art baselines. Our results indicate that accurate radio target localisation can be automatically learned from paired radio-visual data without labels, which is important for empirical data. This opens the door for vast data scalability and may prove key to realising the promise of robust radio sensing atop a unified communication-perception cellular infrastructure. Dataset will be hosted on IEEE DataPort. + + + + Shape-Erased Feature Learning for Visible-Infrared Person Re-Identification + http://openaccess.thecvf.com//content/CVPR2023/papers/Feng_Shape-Erased_Feature_Learning_for_Visible-Infrared_Person_Re-Identification_CVPR_2023_paper.pdf + Due to the modality gap between visible and infrared images with high visual ambiguity, learning diverse modality-shared semantic concepts for visible-infrared person re-identification (VI-ReID) remains a challenging problem. Body shape is one of the significant modality-shared cues for VI-ReID. To dig more diverse modality-shared cues, we expect that erasing body-shape-related semantic concepts in the learned features can force the ReID model to extract more and other modality-shared features for identification. To this end, we propose shape-erased feature learning paradigm that decorrelates modality-shared features in two orthogonal subspaces. Jointly learning shape-related feature in one subspace and shape-erased features in the orthogonal complement achieves a conditional mutual information maximization between shape-erased feature and identity discarding body shape information, thus enhancing the diversity of the learned representation explicitly. Extensive experiments on SYSU-MM01, RegDB, and HITSZ-VCM datasets demonstrate the effectiveness of our method. + + + + On Calibrating Semantic Segmentation Models: Analyses and an Algorithm + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_On_Calibrating_Semantic_Segmentation_Models_Analyses_and_an_Algorithm_CVPR_2023_paper.pdf + We study the problem of semantic segmentation calibration. Lots of solutions have been proposed to approach model miscalibration of confidence in image classification. However, to date, confidence calibration research on semantic segmentation is still limited. We provide a systematic study on the calibration of semantic segmentation models and propose a simple yet effective approach. First, we find that model capacity, crop size, multi-scale testing, and prediction correctness have impact on calibration. Among them, prediction correctness, especially misprediction, is more important to miscalibration due to over-confidence. Next, we propose a simple, unifying, and effective approach, namely selective scaling, by separating correct/incorrect prediction for scaling and more focusing on misprediction logit smoothing. Then, we study popular existing calibration methods and compare them with selective scaling on semantic segmentation calibration. We conduct extensive experiments with a variety of benchmarks on both in-domain and domain-shift calibration and show that selective scaling consistently outperforms other methods. + + + + Visual Atoms: Pre-Training Vision Transformers With Sinusoidal Waves + http://openaccess.thecvf.com//content/CVPR2023/papers/Takashima_Visual_Atoms_Pre-Training_Vision_Transformers_With_Sinusoidal_Waves_CVPR_2023_paper.pdf + Formula-driven supervised learning (FDSL) has been shown to be an effective method for pre-training vision transformers, where ExFractalDB-21k was shown to exceed the pre-training effect of ImageNet-21k. These studies also indicate that contours mattered more than textures when pre-training vision transformers. However, the lack of a systematic investigation as to why these contour-oriented synthetic datasets can achieve the same accuracy as real datasets leaves much room for skepticism. In the present work, we develop a novel methodology based on circular harmonics for systematically investigating the design space of contour-oriented synthetic datasets. This allows us to efficiently search the optimal range of FDSL parameters and maximize the variety of synthetic images in the dataset, which we found to be a critical factor. When the resulting new dataset VisualAtom-21k is used for pre-training ViT-Base, the top-1 accuracy reached 83.7% when fine-tuning on ImageNet-1k. This is only 0.5% difference from the top-1 accuracy (84.2%) achieved by the JFT-300M pre-training, even though the scale of images is 1/14. Unlike JFT-300M which is a static dataset, the quality of synthetic datasets will continue to improve, and the current work is a testament to this possibility. FDSL is also free of the common issues associated with real images, e.g. privacy/copyright issues, labeling costs/errors, and ethical biases. + + + + Masked Autoencoding Does Not Help Natural Language Supervision at Scale + http://openaccess.thecvf.com//content/CVPR2023/papers/Weers_Masked_Autoencoding_Does_Not_Help_Natural_Language_Supervision_at_Scale_CVPR_2023_paper.pdf + Self supervision and natural language supervision have emerged as two exciting ways to train general purpose image encoders which excel at a variety of downstream tasks. Recent works such as M3AE (Geng et al 2022) and SLIP (Mu et al 2022) have suggested that these approaches can be effectively combined, but most notably their results use small (<20M examples) pre-training datasets and don't effectively reflect the large-scale regime (>100M samples) that is commonly used for these approaches. Here we investigate whether a similar approach can be effective when trained with a much larger amount of data. We find that a combination of two state of the art approaches: masked auto-encoders, MAE (He et al 2021) and contrastive language image pre-training, CLIP (Radford et al 2021) provides a benefit over CLIP when trained on a corpus of 11.3M image-text pairs, but little to no benefit (as evaluated on a suite of common vision tasks) over CLIP when trained on a large corpus of 1.4B images. Our work provides some much needed clarity into the effectiveness (or lack thereof) of self supervision for large-scale image-text training. + + + + Transductive Few-Shot Learning With Prototype-Based Label Propagation by Iterative Graph Refinement + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhu_Transductive_Few-Shot_Learning_With_Prototype-Based_Label_Propagation_by_Iterative_Graph_CVPR_2023_paper.pdf + Few-shot learning (FSL) is popular due to its ability to adapt to novel classes. Compared with inductive few-shot learning, transductive models typically perform better as they leverage all samples of the query set. The two existing classes of methods, prototype-based and graph-based, have the disadvantages of inaccurate prototype estimation and sub-optimal graph construction with kernel functions, respectively. %, which hurt the performance. In this paper, we propose a novel prototype-based label propagation to solve these issues. Specifically, our graph construction is based on the relation between prototypes and samples rather than between samples. As prototypes are being updated, the graph changes.We also estimate the label of each prototype instead of considering a prototype be the class centre. On mini-ImageNet, tiered-ImageNet, CIFAR-FS and CUB datasets, we show the proposed method outperforms other state-of-the-art methods in transductive FSL and semi-supervised FSL when some unlabeled data accompanies the novel few-shot task. + + + + Binary Latent Diffusion + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Binary_Latent_Diffusion_CVPR_2023_paper.pdf + In this paper, we show that a binary latent space can be explored for compact yet expressive image representations. We model the bi-directional mappings between an image and the corresponding latent binary representation by training an auto-encoder with a Bernoulli encoding distribution. On the one hand, the binary latent space provides a compact discrete image representation of which the distribution can be modeled more efficiently than pixels or continuous latent representations. On the other hand, we now represent each image patch as a binary vector instead of an index of a learned cookbook as in discrete image representations with vector quantization. In this way, we obtain binary latent representations that allow for better image quality and high-resolution image representations without any multi-stage hierarchy in the latent space. In this binary latent space, images can now be generated effectively using a binary latent diffusion model tailored specifically for modeling the prior over the binary image representations. We present both conditional and unconditional image generation experiments with multiple datasets, and show that the proposed method performs comparably to state-of-the-art methods while dramatically improving the sampling efficiency to as few as 16 steps without using any test-time acceleration. The proposed framework can also be seamlessly scaled to 1024 x 1024 high-resolution image generation without resorting to latent hierarchy or multi-stage refinements. + + + + FLAG3D: A 3D Fitness Activity Dataset With Language Instruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Tang_FLAG3D_A_3D_Fitness_Activity_Dataset_With_Language_Instruction_CVPR_2023_paper.pdf + With the continuously thriving popularity around the world, fitness activity analytic has become an emerging research topic in computer vision. While a variety of new tasks and algorithms have been proposed recently, there are growing hunger for data resources involved in high-quality data, fine-grained labels, and diverse environments. In this paper, we present FLAG3D, a large-scale 3D fitness activity dataset with language instruction containing 180K sequences of 60 categories. FLAG3D features the following three aspects: 1) accurate and dense 3D human pose captured from advanced MoCap system to handle the complex activity and large movement, 2) detailed and professional language instruction to describe how to perform a specific activity, 3) versatile video resources from a high-tech MoCap system, rendering software, and cost-effective smartphones in natural environments. Extensive experiments and in-depth analysis show that FLAG3D contributes great research value for various challenges, such as cross-domain human action recognition, dynamic human mesh recovery, and language-guided human action generation. Our dataset and source code are publicly available at https://andytang15.github.io/FLAG3D. + + + + NeuralUDF: Learning Unsigned Distance Fields for Multi-View Reconstruction of Surfaces With Arbitrary Topologies + http://openaccess.thecvf.com//content/CVPR2023/papers/Long_NeuralUDF_Learning_Unsigned_Distance_Fields_for_Multi-View_Reconstruction_of_Surfaces_CVPR_2023_paper.pdf + We present a novel method, called NeuralUDF, for reconstructing surfaces with arbitrary topologies from 2D images via volume rendering. Recent advances in neural rendering based reconstruction have achieved compelling results. However, these methods are limited to objects with closed surfaces since they adopt Signed Distance Function (SDF) as surface representation which requires the target shape to be divided into inside and outside. In this paper, we propose to represent surfaces as the Unsigned Distance Function (UDF) and develop a new volume rendering scheme to learn the neural UDF representation. Specifically, a new density function that correlates the property of UDF with the volume rendering scheme is introduced for robust optimization of the UDF fields. Experiments on the DTU and DeepFashion3D datasets show that our method not only enables high-quality reconstruction of non-closed shapes with complex typologies, but also achieves comparable performance to the SDF based methods on the reconstruction of closed surfaces. Visit our project page at https://www.xxlong.site/NeuralUDF/. + + + + Collaborative Static and Dynamic Vision-Language Streams for Spatio-Temporal Video Grounding + http://openaccess.thecvf.com//content/CVPR2023/papers/Lin_Collaborative_Static_and_Dynamic_Vision-Language_Streams_for_Spatio-Temporal_Video_Grounding_CVPR_2023_paper.pdf + Spatio-Temporal Video Grounding (STVG) aims to localize the target object spatially and temporally according to the given language query. It is a challenging task in which the model should well understand dynamic visual cues (e.g., motions) and static visual cues (e.g., object appearances) in the language description, which requires effective joint modeling of spatio-temporal visual-linguistic dependencies. In this work, we propose a novel framework in which a static vision-language stream and a dynamic vision-language stream are developed to collaboratively reason the target tube. The static stream performs cross-modal understanding in a single frame and learns to attend to the target object spatially according to intra-frame visual cues like object appearances. The dynamic stream models visual-linguistic dependencies across multiple consecutive frames to capture dynamic cues like motions. We further design a novel cross-stream collaborative block between the two streams, which enables the static and dynamic streams to transfer useful and complementary information from each other to achieve collaborative reasoning. Experimental results show the effectiveness of the collaboration of the two streams and our overall framework achieves new state-of-the-art performance on both HCSTVG and VidSTG datasets. + + + + Residual Degradation Learning Unfolding Framework With Mixing Priors Across Spectral and Spatial for Compressive Spectral Imaging + http://openaccess.thecvf.com//content/CVPR2023/papers/Dong_Residual_Degradation_Learning_Unfolding_Framework_With_Mixing_Priors_Across_Spectral_CVPR_2023_paper.pdf + To acquire a snapshot spectral image, coded aperture snapshot spectral imaging (CASSI) is proposed. A core problem of the CASSI system is to recover the reliable and fine underlying 3D spectral cube from the 2D measurement. By alternately solving a data subproblem and a prior subproblem, deep unfolding methods achieve good performance. However, in the data subproblem, the used sensing matrix is ill-suited for the real degradation process due to the device errors caused by phase aberration, distortion; in the prior subproblem, it is important to design a suitable model to jointly exploit both spatial and spectral priors. In this paper, we propose a Residual Degradation Learning Unfolding Framework (RDLUF), which bridges the gap between the sensing matrix and the degradation process. Moreover, a MixS2 Transformer is designed via mixing priors across spectral and spatial to strengthen the spectral-spatial representation capability. Finally, plugging the MixS2 Transformer into the RDLUF leads to an end-to-end trainable and interpretable neural network RDLUF-MixS2. Experimental results establish the superior performance of the proposed method over existing ones. + + + + GamutMLP: A Lightweight MLP for Color Loss Recovery + http://openaccess.thecvf.com//content/CVPR2023/papers/Le_GamutMLP_A_Lightweight_MLP_for_Color_Loss_Recovery_CVPR_2023_paper.pdf + Cameras and image-editing software often process images in the wide-gamut ProPhoto color space, encompassing 90% of all visible colors. However, when images are encoded for sharing, this color-rich representation is transformed and clipped to fit within the small-gamut standard RGB (sRGB) color space, representing only 30% of visible colors. Recovering the lost color information is challenging due to the clipping procedure. Inspired by neural implicit representations for 2D images, we propose a method that optimizes a lightweight multi-layer-perceptron (MLP) model during the gamut reduction step to predict the clipped values. GamutMLP takes approximately 2 seconds to optimize and requires only 23 KB of storage. The small memory footprint allows our GamutMLP model to be saved as metadata in the sRGB image---the model can be extracted when needed to restore wide-gamut color values. We demonstrate the effectiveness of our approach for color recovery and compare it with alternative strategies, including pre-trained DNN-based gamut expansion networks and other implicit neural representation methods. As part of this effort, we introduce a new color gamut dataset of 2200 wide-gamut/small-gamut images for training and testing. + + + + Instance-Aware Domain Generalization for Face Anti-Spoofing + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_Instance-Aware_Domain_Generalization_for_Face_Anti-Spoofing_CVPR_2023_paper.pdf + Face anti-spoofing (FAS) based on domain generalization (DG) has been recently studied to improve the generalization on unseen scenarios. Previous methods typically rely on domain labels to align the distribution of each domain for learning domain-invariant representations. However, artificial domain labels are coarse-grained and subjective, which cannot reflect real domain distributions accurately. Besides, such domain-aware methods focus on domain-level alignment, which is not fine-grained enough to ensure that learned representations are insensitive to domain styles. To address these issues, we propose a novel perspective for DG FAS that aligns features on the instance level without the need for domain labels. Specifically, Instance-Aware Domain Generalization framework is proposed to learn the generalizable feature by weakening the features' sensitivity to instance-specific styles. Concretely, we propose Asymmetric Instance Adaptive Whitening to adaptively eliminate the style-sensitive feature correlation, boosting the generalization. Moreover, Dynamic Kernel Generator and Categorical Style Assembly are proposed to first extract the instance-specific features and then generate the style-diversified features with large style shifts, respectively, further facilitating the learning of style-insensitive features. Extensive experiments and analysis demonstrate the superiority of our method over state-of-the-art competitors. Code will be publicly available at this link: https://github.com/qianyuzqy/IADG. + + + + Robust and Scalable Gaussian Process Regression and Its Applications + http://openaccess.thecvf.com//content/CVPR2023/papers/Lu_Robust_and_Scalable_Gaussian_Process_Regression_and_Its_Applications_CVPR_2023_paper.pdf + This paper introduces a robust and scalable Gaussian process regression (GPR) model via variational learning. This enables the application of Gaussian processes to a wide range of real data, which are often large-scale and contaminated by outliers. Towards this end, we employ a mixture likelihood model where outliers are assumed to be sampled from a uniform distribution. We next derive a variational formulation that jointly infers the mode of data, i.e., inlier or outlier, as well as hyperparameters by maximizing a lower bound of the true log marginal likelihood. Compared to previous robust GPR, our formulation approximates the exact posterior distribution. The inducing variable approximation and stochastic variational inference are further introduced to our variational framework, extending our model to large-scale data. We apply our model to two challenging real-world applications, namely feature matching and dense gene expression imputation. Extensive experiments demonstrate the superiority of our model in terms of robustness and speed. Notably, when matching 4k feature points, its inference is completed in milliseconds with almost no false matches. The code is at https://github.com/YifanLu2000/Robust-Scalable-GPR. + + + + Shepherding Slots to Objects: Towards Stable and Robust Object-Centric Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Kim_Shepherding_Slots_to_Objects_Towards_Stable_and_Robust_Object-Centric_Learning_CVPR_2023_paper.pdf + Object-centric learning (OCL) aspires general and com- positional understanding of scenes by representing a scene as a collection of object-centric representations. OCL has also been extended to multi-view image and video datasets to apply various data-driven inductive biases by utilizing geometric or temporal information in the multi-image data. Single-view images carry less information about how to disentangle a given scene than videos or multi-view im- ages do. Hence, owing to the difficulty of applying induc- tive biases, OCL for single-view images still remains chal- lenging, resulting in inconsistent learning of object-centric representation. To this end, we introduce a novel OCL framework for single-view images, SLot Attention via SHep- herding (SLASH), which consists of two simple-yet-effective modules on top of Slot Attention. The new modules, At- tention Refining Kernel (ARK) and Intermediate Point Pre- dictor and Encoder (IPPE), respectively, prevent slots from being distracted by the background noise and indicate lo- cations for slots to focus on to facilitate learning of object- centric representation. We also propose a weak- and semi- supervision approach for OCL, whilst our proposed frame- work can be used without any assistant annotation during the inference. Experiments show that our proposed method enables consistent learning of object-centric representa- tion and achieves strong performance across four datasets. Code is available at https://github.com/object- understanding/SLASH. + + + + High-Fidelity Event-Radiance Recovery via Transient Event Frequency + http://openaccess.thecvf.com//content/CVPR2023/papers/Han_High-Fidelity_Event-Radiance_Recovery_via_Transient_Event_Frequency_CVPR_2023_paper.pdf + High-fidelity radiance recovery plays a crucial role in scene information reconstruction and understanding. Conventional cameras suffer from limited sensitivity in dynamic range, bit depth, and spectral response, etc. In this paper, we propose to use event cameras with bio-inspired silicon sensors, which are sensitive to radiance changes, to recover precise radiance values. We reveal that, under active lighting conditions, the transient frequency of event signals triggering linearly reflects the radiance value. We propose an innovative method to convert the high temporal resolution of event signals into precise radiance values. The precise radiance values yields several capabilities in image analysis. We demonstrate the feasibility of recovering radiance values solely from the transient event frequency (TEF) through multiple experiments. + + + + NeMo: Learning 3D Neural Motion Fields From Multiple Video Instances of the Same Action + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_NeMo_Learning_3D_Neural_Motion_Fields_From_Multiple_Video_Instances_CVPR_2023_paper.pdf + The task of reconstructing 3D human motion has wide-ranging applications. The gold standard Motion capture (MoCap) systems are accurate but inaccessible to the general public due to their cost, hardware, and space constraints. In contrast, monocular human mesh recovery (HMR) methods are much more accessible than MoCap as they take single-view videos as inputs. Replacing the multi-view MoCap systems with a monocular HMR method would break the current barriers to collecting accurate 3D motion thus making exciting applications like motion analysis and motion-driven animation accessible to the general public. However, the performance of existing HMR methods degrades when the video contains challenging and dynamic motion that is not in existing MoCap datasets used for training. This reduces its appeal as dynamic motion is frequently the target in 3D motion recovery in the aforementioned applications. Our study aims to bridge the gap between monocular HMR and multi-view MoCap systems by leveraging information shared across multiple video instances of the same action. We introduce the Neural Motion (NeMo) field. It is optimized to represent the underlying 3D motions across a set of videos of the same action. Empirically, we show that NeMo can recover 3D motion in sports using videos from the Penn Action dataset, where NeMo outperforms existing HMR methods in terms of 2D keypoint detection. To further validate NeMo using 3D metrics, we collected a small MoCap dataset mimicking actions in Penn Action, and show that NeMo achieves better 3D reconstruction compared to various baselines. + + + + RIATIG: Reliable and Imperceptible Adversarial Text-to-Image Generation With Natural Prompts + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_RIATIG_Reliable_and_Imperceptible_Adversarial_Text-to-Image_Generation_With_Natural_Prompts_CVPR_2023_paper.pdf + The field of text-to-image generation has made remarkable strides in creating high-fidelity and photorealistic images. As this technology gains popularity, there is a growing concern about its potential security risks. However, there has been limited exploration into the robustness of these models from an adversarial perspective. Existing research has primarily focused on untargeted settings, and lacks holistic consideration for reliability (attack success rate) and stealthiness (imperceptibility). In this paper, we propose RIATIG, a reliable and imperceptible adversarial attack against text-to-image models via inconspicuous examples. By formulating the example crafting as an optimization process and solving it using a genetic-based method, our proposed attack can generate imperceptible prompts for text-to-image generation models in a reliable way. Evaluation of six popular text-to-image generation models demonstrates the efficiency and stealthiness of our attack in both white-box and black-box settings. To allow the community to build on top of our findings, we've made the artifacts available. + + + + GLIGEN: Open-Set Grounded Text-to-Image Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_GLIGEN_Open-Set_Grounded_Text-to-Image_Generation_CVPR_2023_paper.pdf + Large-scale text-to-image diffusion models have made amazing advances. However, the status quo is to use text input alone, which can impede controllability. In this work, we propose GLIGEN: Open-Set Grounded Text-to-Image Generation, a novel approach that builds upon and extends the functionality of existing pre-trained text-to-image diffusion models by enabling them to also be conditioned on grounding inputs. To preserve the vast concept knowledge of the pre-trained model, we freeze all of its weights and inject the grounding information into new trainable layers via a gated mechanism. Our model achieves open-world grounded text2img generation with caption and bounding box condition inputs, and the grounding ability generalizes well to novel spatial configurations and concepts. GLIGEN's zero-shot performance on COCO and LVIS outperforms existing supervised layout-to-image baselines by a large margin. + + + + Learning Geometry-Aware Representations by Sketching + http://openaccess.thecvf.com//content/CVPR2023/papers/Lee_Learning_Geometry-Aware_Representations_by_Sketching_CVPR_2023_paper.pdf + Understanding geometric concepts, such as distance and shape, is essential for understanding the real world and also for many vision tasks. To incorporate such information into a visual representation of a scene, we propose learning to represent the scene by sketching, inspired by human behavior. Our method, coined Learning by Sketching (LBS), learns to convert an image into a set of colored strokes that explicitly incorporate the geometric information of the scene in a single inference step without requiring a sketch dataset. A sketch is then generated from the strokes where CLIP-based perceptual loss maintains a semantic similarity between the sketch and the image. We show theoretically that sketching is equivariant with respect to arbitrary affine transformations and thus provably preserves geometric information. Experimental results show that LBS substantially improves the performance of object attribute classification on the unlabeled CLEVR dataset, domain transfer between CLEVR and STL-10 datasets, and for diverse downstream tasks, confirming that LBS provides rich geometric information. + + + + SVFormer: Semi-Supervised Video Transformer for Action Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Xing_SVFormer_Semi-Supervised_Video_Transformer_for_Action_Recognition_CVPR_2023_paper.pdf + Semi-supervised action recognition is a challenging but critical task due to the high cost of video annotations. Existing approaches mainly use convolutional neural networks, yet current revolutionary vision transformer models have been less explored. In this paper, we investigate the use of transformer models under the SSL setting for action recognition. To this end, we introduce SVFormer, which adopts a steady pseudo-labeling framework (ie, EMA-Teacher) to cope with unlabeled video samples. While a wide range of data augmentations have been shown effective for semi-supervised image classification, they generally produce limited results for video recognition. We therefore introduce a novel augmentation strategy, Tube TokenMix, tailored for video data where video clips are mixed via a mask with consistent masked tokens over the temporal axis. In addition, we propose a temporal warping augmentation to cover the complex temporal variation in videos, which stretches selected frames to various temporal durations in the clip. Extensive experiments on three datasets Kinetics-400, UCF-101, and HMDB-51 verify the advantage of SVFormer. In particular, SVFormer outperforms the state-of-the-art by 31.5% with fewer training epochs under the 1% labeling rate of Kinetics-400. Our method can hopefully serve as a strong benchmark and encourage future search on semi-supervised action recognition with Transformer networks. + + + + X-Avatar: Expressive Human Avatars + http://openaccess.thecvf.com//content/CVPR2023/papers/Shen_X-Avatar_Expressive_Human_Avatars_CVPR_2023_paper.pdf + We present X-Avatar, a novel avatar model that captures the full expressiveness of digital humans to bring about life-like experiences in telepresence, AR/VR and beyond. Our method models bodies, hands, facial expressions and appearance in a holistic fashion and can be learned from either full 3D scans or RGB-D data. To achieve this, we propose a part-aware learned forward skinning module that can be driven by the parameter space of SMPL-X, allowing for expressive animation of X-Avatars. To efficiently learn the neural shape and deformation fields, we propose novel part-aware sampling and initialization strategies. This leads to higher fidelity results, especially for smaller body parts while maintaining efficient training despite increased number of articulated bones. To capture the appearance of the avatar with high-frequency details, we extend the geometry and deformation fields with a texture network that is conditioned on pose, facial expression, geometry and the normals of the deformed surface. We show experimentally that our method outperforms strong baselines both quantitatively and qualitatively on the animation task. To facilitate future research on expressive avatars we contribute a new dataset, called X-Humans, containing 233 sequences of high-quality textured scans from 20 participants, totalling 35,500 data frames. + + + + AccelIR: Task-Aware Image Compression for Accelerating Neural Restoration + http://openaccess.thecvf.com//content/CVPR2023/papers/Ye_AccelIR_Task-Aware_Image_Compression_for_Accelerating_Neural_Restoration_CVPR_2023_paper.pdf + Recently, deep neural networks have been successfully applied for image restoration (IR) (e.g., super-resolution, de-noising, de-blurring). Despite their promising performance, running IR networks requires heavy computation. A large body of work has been devoted to addressing this issue by designing novel neural networks or pruning their parameters. However, the common limitation is that while images are saved in a compressed format before being enhanced by IR, prior work does not consider the impact of compression on the IR quality. In this paper, we present AccelIR, a framework that optimizes image compression considering the end-to-end pipeline of IR tasks. AccelIR encodes an image through IR-aware compression that optimizes compression levels across image blocks within an image according to the impact on the IR quality. Then, it runs a lightweight IR network on the compressed image, effectively reducing IR computation, while maintaining the same IR quality and image size. Our extensive evaluation using seven IR networks shows that AccelIR can reduce the computing overhead of super-resolution, de-nosing, and de-blurring by 49%, 29%, and 32% on average, respectively + + + + BEV-Guided Multi-Modality Fusion for Driving Perception + http://openaccess.thecvf.com//content/CVPR2023/papers/Man_BEV-Guided_Multi-Modality_Fusion_for_Driving_Perception_CVPR_2023_paper.pdf + Integrating multiple sensors and addressing diverse tasks in an end-to-end algorithm are challenging yet critical topics for autonomous driving. To this end, we introduce BEVGuide, a novel Bird's Eye-View (BEV) representation learning framework, representing the first attempt to unify a wide range of sensors under direct BEV guidance in an end-to-end fashion. Our architecture accepts input from a diverse sensor pool, including but not limited to Camera, Lidar and Radar sensors, and extracts BEV feature embeddings using a versatile and general transformer backbone. We design a BEV-guided multi-sensor attention block to take queries from BEV embeddings and learn the BEV representation from sensor-specific features. BEVGuide is efficient due to its lightweight backbone design and highly flexible as it supports almost any input sensor configurations. Extensive experiments demonstrate that our framework achieves exceptional performance in BEV perception tasks with a diverse sensor set. Project page is at https://yunzeman.github.io/BEVGuide. + + + + Proximal Splitting Adversarial Attack for Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Rony_Proximal_Splitting_Adversarial_Attack_for_Semantic_Segmentation_CVPR_2023_paper.pdf + Classification has been the focal point of research on adversarial attacks, but only a few works investigate methods suited to denser prediction tasks, such as semantic segmentation. The methods proposed in these works do not accurately solve the adversarial segmentation problem and, therefore, overestimate the size of the perturbations required to fool models. Here, we propose a white-box attack for these models based on a proximal splitting to produce adversarial perturbations with much smaller l_infinity norms. Our attack can handle large numbers of constraints within a nonconvex minimization framework via an Augmented Lagrangian approach, coupled with adaptive constraint scaling and masking strategies. We demonstrate that our attack significantly outperforms previously proposed ones, as well as classification attacks that we adapted for segmentation, providing a first comprehensive benchmark for this dense task. + + + + Improved Test-Time Adaptation for Domain Generalization + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Improved_Test-Time_Adaptation_for_Domain_Generalization_CVPR_2023_paper.pdf + The main challenge in domain generalization (DG) is to handle the distribution shift problem that lies between the training and test data. Recent studies suggest that test-time training (TTT), which adapts the learned model with test data, might be a promising solution to the problem. Generally, a TTT strategy hinges its performance on two main factors: selecting an appropriate auxiliary TTT task for updating and identifying reliable parameters to update during the test phase. Both previous arts and our experiments indicate that TTT may not improve but be detrimental to the learned model if those two factors are not properly considered. This work addresses those two factors by proposing an Improved Test-Time Adaptation (ITTA) method. First, instead of heuristically defining an auxiliary objective, we propose a learnable consistency loss for the TTT task, which contains learnable parameters that can be adjusted toward better alignment between our TTT task and the main prediction task. Second, we introduce additional adaptive parameters for the trained model, and we suggest only updating the adaptive parameters during the test phase. Through extensive experiments, we show that the proposed two strategies are beneficial for the learned model (see Figure 1), and ITTA could achieve superior performance to the current state-of-the-arts on several DG benchmarks. + + + + Correspondence Transformers With Asymmetric Feature Learning and Matching Flow Super-Resolution + http://openaccess.thecvf.com//content/CVPR2023/papers/Sun_Correspondence_Transformers_With_Asymmetric_Feature_Learning_and_Matching_Flow_Super-Resolution_CVPR_2023_paper.pdf + This paper solves the problem of learning dense visual correspondences between different object instances of the same category with only sparse annotations. We decompose this pixel-level semantic matching problem into two easier ones: (i) First, local feature descriptors of source and target images need to be mapped into shared semantic spaces to get coarse matching flows. (ii) Second, matching flows in low resolution should be refined to generate accurate point-to-point matching results. We propose asymmetric feature learning and matching flow super-resolution based on vision transformers to solve the above problems. The asymmetric feature learning module exploits a biased cross-attention mechanism to encode token features of source images with their target counterparts. Then matching flow in low resolutions is enhanced by a super-resolution network to get accurate correspondences. Our pipeline is built upon vision transformers and can be trained in an end-to-end manner. Extensive experimental results on several popular benchmarks, such as PF-PASCAL, PF-WILLOW, and SPair-71K, demonstrate that the proposed method can catch subtle semantic differences in pixels efficiently. Code is available on https://github.com/YXSUNMADMAX/ACTR. + + + + Adjustment and Alignment for Unbiased Open Set Domain Adaptation + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Adjustment_and_Alignment_for_Unbiased_Open_Set_Domain_Adaptation_CVPR_2023_paper.pdf + Open Set Domain Adaptation (OSDA) transfers the model from a label-rich domain to a label-free one containing novel-class samples. Existing OSDA works overlook abundant novel-class semantics hidden in the source domain, leading to a biased model learning and transfer. Although the causality has been studied to remove the semantic-level bias, the non-available novel-class samples result in the failure of existing causal solutions in OSDA. To break through this barrier, we propose a novel causality-driven solution with the unexplored front-door adjustment theory, and then implement it with a theoretically grounded framework, coined AdjustmeNt aNd Alignment (ANNA), to achieve an unbiased OSDA. In a nutshell, ANNA consists of Front-Door Adjustment (FDA) to correct the biased learning in the source domain and Decoupled Causal Alignment (DCA) to transfer the model unbiasedly. On the one hand, FDA delves into fine-grained visual blocks to discover novel-class regions hidden in the base-class image. Then, it corrects the biased model optimization by implementing causal debiasing. On the other hand, DCA disentangles the base-class and novel-class regions with orthogonal masks, and then adapts the decoupled distribution for an unbiased model transfer. Extensive experiments show that ANNA achieves state-of-the-art results. The code is available at https://github.com/CityU-AIM-Group/Anna. + + + + ESLAM: Efficient Dense SLAM System Based on Hybrid Representation of Signed Distance Fields + http://openaccess.thecvf.com//content/CVPR2023/papers/Johari_ESLAM_Efficient_Dense_SLAM_System_Based_on_Hybrid_Representation_of_CVPR_2023_paper.pdf + We present ESLAM, an efficient implicit neural representation method for Simultaneous Localization and Mapping (SLAM). ESLAM reads RGB-D frames with unknown camera poses in a sequential manner and incrementally reconstructs the scene representation while estimating the current camera position in the scene. We incorporate the latest advances in Neural Radiance Fields (NeRF) into a SLAM system, resulting in an efficient and accurate dense visual SLAM method. Our scene representation consists of multi-scale axis-aligned perpendicular feature planes and shallow decoders that, for each point in the continuous space, decode the interpolated features into Truncated Signed Distance Field (TSDF) and RGB values. Our extensive experiments on three standard datasets, Replica, ScanNet, and TUM RGB-D show that ESLAM improves the accuracy of 3D reconstruction and camera localization of state-of-the-art dense visual SLAM methods by more than 50%, while it runs up to 10 times faster and does not require any pre-training. Project page: https://www.idiap.ch/paper/eslam + + + + Unsupervised Space-Time Network for Temporally-Consistent Segmentation of Multiple Motions + http://openaccess.thecvf.com//content/CVPR2023/papers/Meunier_Unsupervised_Space-Time_Network_for_Temporally-Consistent_Segmentation_of_Multiple_Motions_CVPR_2023_paper.pdf + Motion segmentation is one of the main tasks in computer vision and is relevant for many applications. The optical flow (OF) is the input generally used to segment every frame of a video sequence into regions of coherent motion. Temporal consistency is a key feature of motion segmentation, but it is often neglected. In this paper, we propose an original unsupervised spatio-temporal framework for motion segmentation from optical flow that fully investigates the temporal dimension of the problem. More specifically, we have defined a 3D network for multiple motion segmentation that takes as input a sub-volume of successive optical flows and delivers accordingly a sub-volume of coherent segmentation maps. Our network is trained in a fully unsupervised way, and the loss function combines a flow reconstruction term involving spatio-temporal parametric motion models, and a regularization term enforcing temporal consistency on the masks. We have specified an easy temporal linkage of the predicted segments. Besides, we have proposed a flexible and efficient way of coding U-nets. We report experiments on several VOS benchmarks with convincing quantitative results, while not using appearance and not training with any ground-truth data. We also highlight through visual results the distinctive contribution of the short- and long-term temporal consistency brought by our OF segmentation method. + + + + iDisc: Internal Discretization for Monocular Depth Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Piccinelli_iDisc_Internal_Discretization_for_Monocular_Depth_Estimation_CVPR_2023_paper.pdf + Monocular depth estimation is fundamental for 3D scene understanding and downstream applications. However, even under the supervised setup, it is still challenging and ill-posed due to the lack of geometric constraints. We observe that although a scene can consist of millions of pixels, there are much fewer high-level patterns. We propose iDisc to learn those patterns with internal discretized representations. The method implicitly partitions the scene into a set of high-level concepts. In particular, our new module, Internal Discretization (ID), implements a continuous-discrete-continuous bottleneck to learn those concepts without supervision. In contrast to state-of-the-art methods, the proposed model does not enforce any explicit constraints or priors on the depth output. The whole network with the ID module can be trained in an end-to-end fashion thanks to the bottleneck module based on attention. Our method sets the new state of the art with significant improvements on NYU-Depth v2 and KITTI, outperforming all published methods on the official KITTI benchmark. iDisc can also achieve state-of-the-art results on surface normal estimation. Further, we explore the model generalization capability via zero-shot testing. From there, we observe the compelling need to promote diversification in the outdoor scenario and we introduce splits of two autonomous driving datasets, DDAD and Argoverse. Code is available at http://vis.xyz/pub/idisc/. + + + + Balancing Logit Variation for Long-Tailed Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Balancing_Logit_Variation_for_Long-Tailed_Semantic_Segmentation_CVPR_2023_paper.pdf + Semantic segmentation usually suffers from a long tail data distribution. Due to the imbalanced number of samples across categories, the features of those tail classes may get squeezed into a narrow area in the feature space. Towards a balanced feature distribution, we introduce category-wise variation into the network predictions in the training phase such that an instance is no longer projected to a feature point, but a small region instead. Such a perturbation is highly dependent on the category scale, which appears as assigning smaller variation to head classes and larger variation to tail classes. In this way, we manage to close the gap between the feature areas of different categories, resulting in a more balanced representation. It is noteworthy that the introduced variation is discarded at the inference stage to facilitate a confident prediction. Although with an embarrassingly simple implementation, our method manifests itself in strong generalizability to various datasets and task settings. Extensive experiments suggest that our plug-in design lends itself well to a range of state-of-the-art approaches and boosts the performance on top of them. + + + + Single Image Depth Prediction Made Better: A Multivariate Gaussian Take + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Single_Image_Depth_Prediction_Made_Better_A_Multivariate_Gaussian_Take_CVPR_2023_paper.pdf + Neural-network-based single image depth prediction (SIDP) is a challenging task where the goal is to predict the scene's per-pixel depth at test time. Since the problem, by definition, is ill-posed, the fundamental goal is to come up with an approach that can reliably model the scene depth from a set of training examples. In the pursuit of perfect depth estimation, most existing state-of-the-art learning techniques predict a single scalar depth value per-pixel. Yet, it is well-known that the trained model has accuracy limits and can predict imprecise depth. Therefore, an SIDP approach must be mindful of the expected depth variations in the model's prediction at test time. Accordingly, we introduce an approach that performs continuous modeling of per-pixel depth, where we can predict and reason about the per-pixel depth and its distribution. To this end, we model per-pixel scene depth using a multivariate Gaussian distribution. Moreover, contrary to the existing uncertainty modeling methods---in the same spirit, where per-pixel depth is assumed to be independent, we introduce per-pixel covariance modeling that encodes its depth dependency w.r.t. all the scene points. Unfortunately, per-pixel depth covariance modeling leads to a computationally expensive continuous loss function, which we solve efficiently using the learned low-rank approximation of the overall covariance matrix. Notably, when tested on benchmark datasets such as KITTI, NYU, and SUN-RGB-D, the SIDP model obtained by optimizing our loss function shows state-of-the-art results. Our method's accuracy (named MG) is among the top on the KITTI depth-prediction benchmark leaderboard. + + + + Query-Centric Trajectory Prediction + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_Query-Centric_Trajectory_Prediction_CVPR_2023_paper.pdf + Predicting the future trajectories of surrounding agents is essential for autonomous vehicles to operate safely. This paper presents QCNet, a modeling framework toward pushing the boundaries of trajectory prediction. First, we identify that the agent-centric modeling scheme used by existing approaches requires re-normalizing and re-encoding the input whenever the observation window slides forward, leading to redundant computations during online prediction. To overcome this limitation and achieve faster inference, we introduce a query-centric paradigm for scene encoding, which enables the reuse of past computations by learning representations independent of the global spacetime coordinate system. Sharing the invariant scene features among all target agents further allows the parallelism of multi-agent trajectory decoding. Second, even given rich encodings of the scene, existing decoding strategies struggle to capture the multimodality inherent in agents' future behavior, especially when the prediction horizon is long. To tackle this challenge, we first employ anchor-free queries to generate trajectory proposals in a recurrent fashion, which allows the model to utilize different scene contexts when decoding waypoints at different horizons. A refinement module then takes the trajectory proposals as anchors and leverages anchor-based queries to refine the trajectories further. By supplying adaptive and high-quality anchors to the refinement module, our query-based decoder can better deal with the multimodality in the output of trajectory prediction. Our approach ranks 1st on Argoverse 1 and Argoverse 2 motion forecasting benchmarks, outperforming all methods on all main metrics by a large margin. Meanwhile, our model can achieve streaming scene encoding and parallel multi-agent decoding thanks to the query-centric design ethos. + + + + The Enemy of My Enemy Is My Friend: Exploring Inverse Adversaries for Improving Adversarial Training + http://openaccess.thecvf.com//content/CVPR2023/papers/Dong_The_Enemy_of_My_Enemy_Is_My_Friend_Exploring_Inverse_CVPR_2023_paper.pdf + Although current deep learning techniques have yielded superior performance on various computer vision tasks, yet they are still vulnerable to adversarial examples. Adversarial training and its variants have been shown to be the most effective approaches to defend against adversarial examples. A particular class of these methods regularize the difference between output probabilities for an adversarial and its corresponding natural example. However, it may have a negative impact if a natural example is misclassified. To circumvent this issue, we propose a novel adversarial training scheme that encourages the model to produce similar output probabilities for an adversarial example and its "inverse adversarial" counterpart. Particularly, the counterpart is generated by maximizing the likelihood in the neighborhood of the natural example. Extensive experiments on various vision datasets and architectures demonstrate that our training method achieves state-of-the-art robustness as well as natural accuracy among robust models. Furthermore, using a universal version of inverse adversarial examples, we improve the performance of single-step adversarial training techniques at a low computational cost. + + + + Exploring Motion Ambiguity and Alignment for High-Quality Video Frame Interpolation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_Exploring_Motion_Ambiguity_and_Alignment_for_High-Quality_Video_Frame_Interpolation_CVPR_2023_paper.pdf + For video frame interpolation(VFI), existing deep-learning-based approaches strongly rely on the ground-truth (GT) intermediate frames, which sometimes ignore the non-unique nature of motion judging from the given adjacent frames. As a result, these methods tend to produce averaged solutions that are not clear enough. To alleviate this issue, we propose to relax the requirement of reconstructing an intermediate frame as close to the GT as possible. Towards this end, we develop a texture consistency loss (TCL) upon the assumption that the interpolated content should maintain similar structures with their counterparts in the given frames. Predictions satisfying this constraint are encouraged, though they may differ from the predefined GT. Without the bells and whistles, our plug-and-play TCL is capable of improving the performance of existing VFI frameworks consistently. On the other hand, previous methods usually adopt the cost volume or correlation map to achieve more accurate image or feature warping. However, the O(N^2) (N refers to the pixel count) computational complexity makes it infeasible for high-resolution cases. In this work, we design a simple, efficient O(N) yet powerful guided cross-scale pyramid alignment(GCSPA) module, where multi-scale information is highly exploited. Extensive experiments justify the efficiency and effectiveness of the proposed strategy. + + + + Knowledge Distillation for 6D Pose Estimation by Aligning Distributions of Local Predictions + http://openaccess.thecvf.com//content/CVPR2023/papers/Guo_Knowledge_Distillation_for_6D_Pose_Estimation_by_Aligning_Distributions_of_CVPR_2023_paper.pdf + Knowledge distillation facilitates the training of a compact student network by using a deep teacher one. While this has achieved great success in many tasks, it remains completely unstudied for image-based 6D object pose estimation. In this work, we introduce the first knowledge distillation method driven by the 6D pose estimation task. To this end, we observe that most modern 6D pose estimation frameworks output local predictions, such as sparse 2D keypoints or dense representations, and that the compact student network typically struggles to predict such local quantities precisely. Therefore, instead of imposing prediction-to-prediction supervision from the teacher to the student, we propose to distill the teacher's distribution of local predictions into the student network, facilitating its training. Our experiments on several benchmarks show that our distillation method yields state-of-the-art results with different compact student models and for both keypoint-based and dense prediction-based architectures. + + + + Adaptive Annealing for Robust Geometric Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Sidhartha_Adaptive_Annealing_for_Robust_Geometric_Estimation_CVPR_2023_paper.pdf + Geometric estimation problems in vision are often solved via minimization of statistical loss functions which account for the presence of outliers in the observations. The corresponding energy landscape often has many local minima. Many approaches attempt to avoid local minima by annealing the scale parameter of loss functions using methods such as graduated non-convexity (GNC). However, little attention has been paid to the annealing schedule, which is often carried out in a fixed manner, resulting in a poor speed-accuracy trade-off and unreliable convergence to the global minimum. In this paper, we propose a principled approach for adaptively annealing the scale for GNC by tracking the positive-definiteness (i.e. local convexity) of the Hessian of the cost function. We illustrate our approach using the classic problem of registering 3D correspondences in the presence of noise and outliers. We also develop approximations to the Hessian that significantly speeds up our method. The effectiveness of our approach is validated by comparing its performance with state-of-the-art 3D registration approaches on a number of synthetic and real datasets. Our approach is accurate and efficient and converges to the global solution more reliably than the state-of-the-art methods. + + + + PointListNet: Deep Learning on 3D Point Lists + http://openaccess.thecvf.com//content/CVPR2023/papers/Fan_PointListNet_Deep_Learning_on_3D_Point_Lists_CVPR_2023_paper.pdf + Deep neural networks on regular 1D lists (e.g., natural languages) and irregular 3D sets (e.g., point clouds) have made tremendous achievements. The key to natural language processing is to model words and their regular order dependency in texts. For point cloud understanding, the challenge is to understand the geometry via irregular point coordinates, in which point-feeding orders do not matter. However, there are a few kinds of data that exhibit both regular 1D list and irregular 3D set structures, such as proteins and non-coding RNAs. In this paper, we refer to them as 3D point lists and propose a Transformer-style PointListNet to model them. First, PointListNet employs non-parametric distance-based attention because we find sometimes it is the distance, instead of the feature or type, that mainly determines how much two points, e.g., amino acids, are correlated in the micro world. Second, different from the vanilla Transformer that directly performs a simple linear transformation on inputs to generate values and does not explicitly model relative relations, our PointListNet integrates the 1D order and 3D Euclidean displacements into values. We conduct experiments on protein fold classification and enzyme reaction classification. Experimental results show the effectiveness of the proposed PointListNet. + + + + Upcycling Models Under Domain and Category Shift + http://openaccess.thecvf.com//content/CVPR2023/papers/Qu_Upcycling_Models_Under_Domain_and_Category_Shift_CVPR_2023_paper.pdf + Deep neural networks (DNNs) often perform poorly in the presence of domain shift and category shift. How to upcycle DNNs and adapt them to the target task remains an important open problem. Unsupervised Domain Adaptation (UDA), especially recently proposed Source-free Domain Adaptation (SFDA), has become a promising technology to address this issue. Nevertheless, most existing SFDA methods require that the source domain and target domain share the same label space, consequently being only applicable to the vanilla closed-set setting. In this paper, we take one step further and explore the Source-free Universal Domain Adaptation (SF-UniDA). The goal is to identify "known" data samples under both domain and category shift, and reject those "unknown" data samples (not present in source classes), with only the knowledge from standard pre-trained source model. To this end, we introduce an innovative global and local clustering learning technique (GLC). Specifically, we design a novel, adaptive one-vs-all global clustering algorithm to achieve the distinction across different target classes and introduce a local k-NN clustering strategy to alleviate negative transfer. We examine the superiority of our GLC on multiple benchmarks with different category shift scenarios, including partial-set, open-set, and open-partial-set DA. More remarkably, in the most challenging open-partial-set DA scenario, GLC outperforms UMAD by 14.8% on the VisDA benchmark. + + + + Single Domain Generalization for LiDAR Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Kim_Single_Domain_Generalization_for_LiDAR_Semantic_Segmentation_CVPR_2023_paper.pdf + With the success of the 3D deep learning models, various perception technologies for autonomous driving have been developed in the LiDAR domain. While these models perform well in the trained source domain, they struggle in unseen domains with a domain gap. In this paper, we propose a single domain generalization method for LiDAR semantic segmentation (DGLSS) that aims to ensure good performance not only in the source domain but also in the unseen domain by learning only on the source domain. We mainly focus on generalizing from a dense source domain and target the domain shift from different LiDAR sensor configurations and scene distributions. To this end, we augment the domain to simulate the unseen domains by randomly subsampling the LiDAR scans. With the augmented domain, we introduce two constraints for generalizable representation learning: sparsity invariant feature consistency (SIFC) and semantic correlation consistency (SCC). The SIFC aligns sparse internal features of the source domain with the augmented domain based on the feature affinity. For SCC, we constrain the correlation between class prototypes to be similar for every LiDAR scan. We also establish a standardized training and evaluation setting for DGLSS. With the proposed evaluation setting, our method showed improved performance in the unseen domains compared to other baselines. Even without access to the target domain, our method performed better than the domain adaptation method. The code is available at https://github.com/gzgzys9887/DGLSS. + + + + SLACK: Stable Learning of Augmentations With Cold-Start and KL Regularization + http://openaccess.thecvf.com//content/CVPR2023/papers/Marrie_SLACK_Stable_Learning_of_Augmentations_With_Cold-Start_and_KL_Regularization_CVPR_2023_paper.pdf + Data augmentation is known to improve the generalization capabilities of neural networks, provided that the set of transformations is chosen with care, a selection often performed manually. Automatic data augmentation aims at automating this process. However, most recent approaches still rely on some prior information; they start from a small pool of manually-selected default transformations that are either used to pretrain the network or forced to be part of the policy learned by the automatic data augmentation algorithm. In this paper, we propose to directly learn the augmentation policy without leveraging such prior knowledge. The resulting bilevel optimization problem becomes more challenging due to the larger search space and the inherent instability of bilevel optimization algorithms. To mitigate these issues (i) we follow a successive cold-start strategy with a Kullback-Leibler regularization, and (ii) we parameterize magnitudes as continuous distributions. Our approach leads to competitive results on standard benchmarks despite a more challenging setting, and generalizes beyond natural images. + + + + Gradient Norm Aware Minimization Seeks First-Order Flatness and Improves Generalization + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Gradient_Norm_Aware_Minimization_Seeks_First-Order_Flatness_and_Improves_Generalization_CVPR_2023_paper.pdf + Recently, flat minima are proven to be effective for improving generalization and sharpness-aware minimization (SAM) achieves state-of-the-art performance. Yet the current definition of flatness discussed in SAM and its follow-ups are limited to the zeroth-order flatness (i.e., the worst-case loss within a perturbation radius). We show that the zeroth-order flatness can be insufficient to discriminate minima with low generalization error from those with high generalization error both when there is a single minimum or multiple minima within the given perturbation radius. Thus we present first-order flatness, a stronger measure of flatness focusing on the maximal gradient norm within a perturbation radius which bounds both the maximal eigenvalue of Hessian at local minima and the regularization function of SAM. We also present a novel training procedure named Gradient norm Aware Minimization (GAM) to seek minima with uniformly small curvature across all directions. Experimental results show that GAM improves the generalization of models trained with current optimizers such as SGD and AdamW on various datasets and networks. Furthermore, we show that GAM can help SAM find flatter minima and achieve better generalization. + + + + Latency Matters: Real-Time Action Forecasting Transformer + http://openaccess.thecvf.com//content/CVPR2023/papers/Girase_Latency_Matters_Real-Time_Action_Forecasting_Transformer_CVPR_2023_paper.pdf + We present RAFTformer, a real-time action forecasting transformer for latency aware real-world action forecasting applications. RAFTformer is a two-stage fully transformer based architecture which consists of a video transformer backbone that operates on high resolution, short range clips and a head transformer encoder that temporally aggregates information from multiple short range clips to span a long-term horizon. Additionally, we propose a self-supervised shuffled causal masking scheme to improve model generalization during training. Finally, we also propose a real-time evaluation setting that directly couples model inference latency to overall forecasting performance and brings forth an hitherto overlooked trade-off between latency and action forecasting performance. Our parsimonious network design facilitates RAFTformer inference latency to be 9x smaller than prior works at the same forecasting accuracy. Owing to its two-staged design, RAFTformer uses 94% less training compute and 90% lesser training parameters to outperform prior state-of-the-art baselines by 4.9 points on EGTEA Gaze+ and by 1.4 points on EPIC-Kitchens-100 dataset, as measured by Top-5 recall (T5R) in the offline setting. In the real-time setting, RAFTformer outperforms prior works by an even greater margin of upto 4.4 T5R points on the EPIC-Kitchens-100 dataset. Project Webpage: https://karttikeya.github.io/publication/RAFTformer/ + + + + HierVL: Learning Hierarchical Video-Language Embeddings + http://openaccess.thecvf.com//content/CVPR2023/papers/Ashutosh_HierVL_Learning_Hierarchical_Video-Language_Embeddings_CVPR_2023_paper.pdf + Video-language embeddings are a promising avenue for injecting semantics into visual representations, but existing methods capture only short-term associations between seconds-long video clips and their accompanying text. We propose HierVL, a novel hierarchical video-language embedding that simultaneously accounts for both long-term and short-term associations. As training data, we take videos accompanied by timestamped text descriptions of human actions, together with a high-level text summary of the activity throughout the long video (as are available in Ego4D). We introduce a hierarchical contrastive training objective that encourages text-visual alignment at both the clip level and video level. While the clip-level constraints use the step-by-step descriptions to capture what is happening in that instant, the video-level constraints use the summary text to capture why it is happening, i.e., the broader context for the activity and the intent of the actor. Our hierarchical scheme yields a clip representation that outperforms its single-level counterpart, as well as a long-term video representation that achieves SotA results on tasks requiring long-term video modeling. HierVL successfully transfers to multiple challenging downstream tasks (in EPIC-KITCHENS-100, Charades-Ego, HowTo100M) in both zero-shot and fine-tuned settings. + + + + GraVoS: Voxel Selection for 3D Point-Cloud Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Shrout_GraVoS_Voxel_Selection_for_3D_Point-Cloud_Detection_CVPR_2023_paper.pdf + 3D object detection within large 3D scenes is challenging not only due to the sparse and irregular 3D point clouds, but also due to both the extreme foreground-background scene imbalance and class imbalance. A common approach is to add ground-truth objects from other scenes. Differently, we propose to modify the scenes by removing elements (voxels), rather than adding ones. Our approach selects the "meaningful" voxels, in a manner that addresses both types of dataset imbalance. The approach is general and can be applied to any voxel-based detector, yet the meaningfulness of a voxel is network-dependent. Our voxel selection is shown to improve the performance of several prominent 3D detection methods. + + + + RobustNeRF: Ignoring Distractors With Robust Losses + http://openaccess.thecvf.com//content/CVPR2023/papers/Sabour_RobustNeRF_Ignoring_Distractors_With_Robust_Losses_CVPR_2023_paper.pdf + Neural radiance fields (NeRF) excel at synthesizing new views given multi-view, calibrated images of a static scene. When scenes include distractors, which are not persistent during image capture (moving objects, lighting variations, shadows), artifacts appear as view-dependent effects or 'floaters'. To cope with distractors, we advocate a form of robust estimation for NeRF training, modeling distractors in training data as outliers of an optimization problem. Our method successfully removes outliers from a scene and improves upon our baselines, on synthetic and real-world scenes. Our technique is simple to incorporate in modern NeRF frameworks, with few hyper-parameters. It does not assume a priori knowledge of the types of distractors, and is instead focused on the optimization problem rather than pre-processing or modeling transient objects. More results on our page https://robustnerf.github.io/public. + + + + Spherical Transformer for LiDAR-Based 3D Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Lai_Spherical_Transformer_for_LiDAR-Based_3D_Recognition_CVPR_2023_paper.pdf + LiDAR-based 3D point cloud recognition has benefited various applications. Without specially considering the LiDAR point distribution, most current methods suffer from information disconnection and limited receptive field, especially for the sparse distant points. In this work, we study the varying-sparsity distribution of LiDAR points and present SphereFormer to directly aggregate information from dense close points to the sparse distant ones. We design radial window self-attention that partitions the space into multiple non-overlapping narrow and long windows. It overcomes the disconnection issue and enlarges the receptive field smoothly and dramatically, which significantly boosts the performance of sparse distant points. Moreover, to fit the narrow and long windows, we propose exponential splitting to yield fine-grained position encoding and dynamic feature selection to increase model representation ability. Notably, our method ranks 1st on both nuScenes and SemanticKITTI semantic segmentation benchmarks with 81.9% and 74.8% mIoU, respectively. Also, we achieve the 3rd place on nuScenes object detection benchmark with 72.8% NDS and 68.5% mAP. Code is available at https://github.com/dvlab-research/SphereFormer.git. + + + + Watch or Listen: Robust Audio-Visual Speech Recognition With Visual Corruption Modeling and Reliability Scoring + http://openaccess.thecvf.com//content/CVPR2023/papers/Hong_Watch_or_Listen_Robust_Audio-Visual_Speech_Recognition_With_Visual_Corruption_CVPR_2023_paper.pdf + This paper deals with Audio-Visual Speech Recognition (AVSR) under multimodal input corruption situation where audio inputs and visual inputs are both corrupted, which is not well addressed in previous research directions. Previous studies have focused on how to complement the corrupted audio inputs with the clean visual inputs with the assumption of the availability of clean visual inputs. However, in real life, the clean visual inputs are not always accessible and can even be corrupted by occluded lip region or with noises. Thus, we firstly analyze that the previous AVSR models are not indeed robust to the corruption of multimodal input streams, the audio and the visual inputs, compared to uni-modal models. Then, we design multimodal input corruption modeling to develop robust AVSR models. Lastly, we propose a novel AVSR framework, namely Audio-Visual Reliability Scoring module (AV-RelScore), that is robust to the corrupted multimodal inputs. The AV-RelScore can determine which input modal stream is reliable or not for the prediction and also can exploit the more reliable streams in prediction. The effectiveness of the proposed method is evaluated with comprehensive experiments on popular benchmark databases, LRS2 and LRS3. We also show that the reliability scores obtained by AV-RelScore well reflect the degree of corruption and make the proposed model focus on the reliable multimodal representations. + + + + VisFusion: Visibility-Aware Online 3D Scene Reconstruction From Videos + http://openaccess.thecvf.com//content/CVPR2023/papers/Gao_VisFusion_Visibility-Aware_Online_3D_Scene_Reconstruction_From_Videos_CVPR_2023_paper.pdf + We propose VisFusion, a visibility-aware online 3D scene reconstruction approach from posed monocular videos. In particular, we aim to reconstruct the scene from volumetric features. Unlike previous reconstruction methods which aggregate features for each voxel from input views without considering its visibility, we aim to improve the feature fusion by explicitly inferring its visibility from a similarity matrix, computed from its projected features in each image pair. Following previous works, our model is a coarse-to-fine pipeline including a volume sparsification process. Different from their works which sparsify voxels globally with a fixed occupancy threshold, we perform the sparsification on a local feature volume along each visual ray to preserve at least one voxel per ray for more fine details. The sparse local volume is then fused with a global one for online reconstruction. We further propose to predict TSDF in a coarse-to-fine manner by learning its residuals across scales leading to better TSDF predictions. Experimental results on benchmarks show that our method can achieve superior performance with more scene details. Code is available at: https://github.com/huiyu-gao/VisFusion + + + + Towards Transferable Targeted Adversarial Examples + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Towards_Transferable_Targeted_Adversarial_Examples_CVPR_2023_paper.pdf + Transferability of adversarial examples is critical for black-box deep learning model attacks. While most existing studies focus on enhancing the transferability of untargeted adversarial attacks, few of them studied how to generate transferable targeted adversarial examples that can mislead models into predicting a specific class. Moreover, existing transferable targeted adversarial attacks usually fail to sufficiently characterize the target class distribution, thus suffering from limited transferability. In this paper, we propose the Transferable Targeted Adversarial Attack (TTAA), which can capture the distribution information of the target class from both label-wise and feature-wise perspectives, to generate highly transferable targeted adversarial examples. To this end, we design a generative adversarial training framework consisting of a generator to produce targeted adversarial examples, and feature-label dual discriminators to distinguish the generated adversarial examples from the target class images. Specifically, we design the label discriminator to guide the adversarial examples to learn label-related distribution information about the target class. Meanwhile, we design a feature discriminator, which extracts the feature-wise information with strong cross-model consistency, to enable the adversarial examples to learn the transferable distribution information. Furthermore, we introduce the random perturbation dropping to further enhance the transferability by augmenting the diversity of adversarial examples used in the training process. Experiments demonstrate that our method achieves excellent performance on the transferability of targeted adversarial examples. The targeted fooling rate reaches 95.13% when transferred from VGG-19 to DenseNet-121, which significantly outperforms the state-of-the-art methods. + + + + C-SFDA: A Curriculum Learning Aided Self-Training Framework for Efficient Source Free Domain Adaptation + http://openaccess.thecvf.com//content/CVPR2023/papers/Karim_C-SFDA_A_Curriculum_Learning_Aided_Self-Training_Framework_for_Efficient_Source_CVPR_2023_paper.pdf + Unsupervised domain adaptation (UDA) approaches focus on adapting models trained on a labeled source domain to an unlabeled target domain. In contrast to UDA, source-free domain adaptation (SFDA) is a more practical setup as access to source data is no longer required during adaptation. Recent state-of-the-art (SOTA) methods on SFDA mostly focus on pseudo-label refinement based self-training which generally suffers from two issues: i) inevitable occurrence of noisy pseudo-labels that could lead to early training time memorization, ii) refinement process requires maintaining a memory bank which creates a significant burden in resource constraint scenarios. To address these concerns, we propose C-SFDA, a curriculum learning aided self-training framework for SFDA that adapts efficiently and reliably to changes across domains based on selective pseudo-labeling. Specifically, we employ a curriculum learning scheme to promote learning from a restricted amount of pseudo labels selected based on their reliabilities. This simple yet effective step successfully prevents label noise propagation during different stages of adaptation and eliminates the need for costly memory-bank based label refinement. Our extensive experimental evaluations on both image recognition and semantic segmentation tasks confirm the effectiveness of our method. C-SFDA is also applicable to online test-time domain adaptation and outperforms previous SOTA methods in this task. + + + + Modeling the Distributional Uncertainty for Salient Object Detection Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Tian_Modeling_the_Distributional_Uncertainty_for_Salient_Object_Detection_Models_CVPR_2023_paper.pdf + Most of the existing salient object detection (SOD) models focus on improving the overall model performance, without explicitly explaining the discrepancy between the training and testing distributions. In this paper, we investigate a particular type of epistemic uncertainty, namely distributional uncertainty, for salient object detection. Specifically, for the first time, we explore the existing class-aware distribution gap exploration techniques, i.e. long-tail learning, single-model uncertainty modeling and test-time strategies, and adapt them to model the distributional uncertainty for our class-agnostic task. We define test sample that is dissimilar to the training dataset as being "out-of-distribution" (OOD) samples. Different from the conventional OOD definition, where OOD samples are those not belonging to the closed-world training categories, OOD samples for SOD are those break the basic priors of saliency, i.e. center prior, color contrast prior, compactness prior and etc., indicating OOD as being "continuous" instead of being discrete for our task. We've carried out extensive experimental results to verify effectiveness of existing distribution gap modeling techniques for SOD, and conclude that both train-time single-model uncertainty estimation techniques and weight-regularization solutions that preventing model activation from drifting too much are promising directions for modeling distributional uncertainty for SOD. + + + + Kernel Aware Resampler + http://openaccess.thecvf.com//content/CVPR2023/papers/Bernasconi_Kernel_Aware_Resampler_CVPR_2023_paper.pdf + Deep learning based methods for super-resolution have become state-of-the-art and outperform traditional approaches by a significant margin. From the initial models designed for fixed integer scaling factors (e.g. x2 or x4), efforts were made to explore different directions such as modeling blur kernels or addressing non-integer scaling factors. However, existing works do not provide a sound framework to handle them jointly. In this paper we propose a framework for generic image resampling that not only addresses all the above mentioned issues but extends the sets of possible transforms from upscaling to generic transforms. A key aspect to unlock these capabilities is the faithful modeling of image warping and changes of the sampling rate during the training data preparation. This allows a localized representation of the implicit image degradation that takes into account the reconstruction kernel, the local geometric distortion and the anti-aliasing kernel. Using this spatially variant degradation map as conditioning for our resampling model, we can address with the same model both global transformations, such as upscaling or rotation, and locally varying transformations such lens distortion or undistortion. Another important contribution is the automatic estimation of the degradation map in this more complex resampling setting (i.e. blind image resampling). Finally, we show that state-of-the-art results can be achieved by predicting kernels to apply on the input image instead of direct color prediction. This renders our model applicable for different types of data not seen during the training such as normals. + + + + LaserMix for Semi-Supervised LiDAR Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Kong_LaserMix_for_Semi-Supervised_LiDAR_Semantic_Segmentation_CVPR_2023_paper.pdf + Densely annotating LiDAR point clouds is costly, which often restrains the scalability of fully-supervised learning methods. In this work, we study the underexplored semi-supervised learning (SSL) in LiDAR semantic segmentation. Our core idea is to leverage the strong spatial cues of LiDAR point clouds to better exploit unlabeled data. We propose LaserMix to mix laser beams from different LiDAR scans and then encourage the model to make consistent and confident predictions before and after mixing. Our framework has three appealing properties. 1) Generic: LaserMix is agnostic to LiDAR representations (e.g., range view and voxel), and hence our SSL framework can be universally applied. 2) Statistically grounded: We provide a detailed analysis to theoretically explain the applicability of the proposed framework. 3) Effective: Comprehensive experimental analysis on popular LiDAR segmentation datasets (nuScenes, SemanticKITTI, and ScribbleKITTI) demonstrates our effectiveness and superiority. Notably, we achieve competitive results over fully-supervised counterparts with 2x to 5x fewer labels and improve the supervised-only baseline significantly by relatively 10.8%. We hope this concise yet high-performing framework could facilitate future research in semi-supervised LiDAR segmentation. Code is publicly available. + + + + Complementary Intrinsics From Neural Radiance Fields and CNNs for Outdoor Scene Relighting + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_Complementary_Intrinsics_From_Neural_Radiance_Fields_and_CNNs_for_Outdoor_CVPR_2023_paper.pdf + Relighting an outdoor scene is challenging due to the diverse illuminations and salient cast shadows. Intrinsic image decomposition on outdoor photo collections could partly solve this problem by weakly supervised labels with albedo and normal consistency from multi-view stereo. With neural radiance fields (NeRFs), editing the appearance code could produce more realistic results without explicitly interpreting the outdoor scene image formation. This paper proposes to complement the intrinsic estimation from volume rendering using NeRFs and from inversing the photometric image formation model using convolutional neural networks (CNNs). The former produces richer and more reliable pseudo labels (cast shadows and sky appearances in addition to albedo and normal) for training the latter to predict interpretable and editable lighting parameters via a single-image prediction pipeline. We demonstrate the advantages of our method for both intrinsic image decomposition and relighting for various real outdoor scenes. + + + + Azimuth Super-Resolution for FMCW Radar in Autonomous Driving + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Azimuth_Super-Resolution_for_FMCW_Radar_in_Autonomous_Driving_CVPR_2023_paper.pdf + We tackle the task of Azimuth (angular dimension) super-resolution for Frequency Modulated Continuous Wave (FMCW) multiple-input multiple-output (MIMO) radar. FMCW MIMO radar is widely used in autonomous driving alongside Lidar and RGB cameras. However, compared to Lidar, MIMO radar is usually of low resolution due to hardware size restrictions. For example, achieving 1-degree azimuth resolution requires at least 100 receivers, but a single MIMO device usually supports at most 12 receivers. Having limitations on the number of receivers is problematic since a high-resolution measurement of azimuth angle is essential for estimating the location and velocity of objects. To improve the azimuth resolution of MIMO radar, we propose a light, yet efficient, Analog-to-Digital super-resolution model (ADC-SR) that predicts or hallucinates additional radar signals using signals from only a few receivers. Compared with the baseline models that are applied to processed radar Range-Azimuth-Doppler (RAD) maps, we show that our ADC-SR method that processes raw ADC signals achieves comparable performance with 98% (50 times) fewer parameters. We also propose a hybrid super-resolution model (Hybrid-SR) combining our ADC-SR with a standard RAD super-resolution model, and show that performance can be improved by a large margin. Experiments on our City-Radar dataset and the RADIal dataset validate the importance of leveraging raw radar ADC signals. To assess the value of our super-resolution model for autonomous driving, we also perform object detection on the results of our super-resolution model and find that our super-resolution model improves detection performance by around 4% in mAP. + + + + VQACL: A Novel Visual Question Answering Continual Learning Setting + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_VQACL_A_Novel_Visual_Question_Answering_Continual_Learning_Setting_CVPR_2023_paper.pdf + Research on continual learning has recently led to a variety of work in unimodal community, however little attention has been paid to multimodal tasks like visual question answering (VQA). In this paper, we establish a novel VQA Continual Learning setting named VQACL, which contains two key components: a dual-level task sequence where visual and linguistic data are nested, and a novel composition testing containing new skill-concept combinations. The former devotes to simulating the ever-changing multimodal datastream in real world and the latter aims at measuring models' generalizability for cognitive reasoning. Based on our VQACL, we perform in-depth evaluations of five well-established continual learning methods, and observe that they suffer from catastrophic forgetting and have weak generalizability. To address above issues, we propose a novel representation learning method, which leverages a sample-specific and a sample-invariant feature to learn representations that are both discriminative and generalizable for VQA. Furthermore, by respectively extracting such representation for visual and textual input, our method can explicitly disentangle the skill and concept. Extensive experimental results illustrate that our method significantly outperforms existing models, demonstrating the effectiveness and compositionality of the proposed approach. + + + + High-Res Facial Appearance Capture From Polarized Smartphone Images + http://openaccess.thecvf.com//content/CVPR2023/papers/Azinovic_High-Res_Facial_Appearance_Capture_From_Polarized_Smartphone_Images_CVPR_2023_paper.pdf + We propose a novel method for high-quality facial texture reconstruction from RGB images using a novel capturing routine based on a single smartphone which we equip with an inexpensive polarization foil. Specifically, we turn the flashlight into a polarized light source and add a polarization filter on top of the camera. Leveraging this setup, we capture the face of a subject with cross-polarized and parallel-polarized light. For each subject, we record two short sequences in a dark environment under flash illumination with different light polarization using the modified smartphone. Based on these observations, we reconstruct an explicit surface mesh of the face using structure from motion. We then exploit the camera and light co-location within a differentiable renderer to optimize the facial textures using an analysis-by-synthesis approach. Our method optimizes for high-resolution normal textures, diffuse albedo, and specular albedo using a coarse-to-fine optimization scheme. We show that the optimized textures can be used in a standard rendering pipeline to synthesize high-quality photo-realistic 3D digital humans in novel environments. + + + + JAWS: Just a Wild Shot for Cinematic Transfer in Neural Radiance Fields + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_JAWS_Just_a_Wild_Shot_for_Cinematic_Transfer_in_Neural_CVPR_2023_paper.pdf + This paper presents JAWS, an optimzation-driven approach that achieves the robust transfer of visual cinematic features from a reference in-the-wild video clip to a newly generated clip. To this end, we rely on an implicit-neural-representation (INR) in a way to compute a clip that shares the same cinematic features as the reference clip. We propose a general formulation of a camera optimization problem in an INR that computes extrinsic and intrinsic camera parameters as well as timing. By leveraging the differentiability of neural representations, we can back-propagate our designed cinematic losses measured on proxy estimators through a NeRF network to the proposed cinematic parameters directly. We also introduce specific enhancements such as guidance maps to improve the overall quality and efficiency. Results display the capacity of our system to replicate well known camera sequences from movies, adapting the framing, camera parameters and timing of the generated video clip to maximize the similarity with the reference clip. + + + + EfficientSCI: Densely Connected Network With Space-Time Factorization for Large-Scale Video Snapshot Compressive Imaging + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_EfficientSCI_Densely_Connected_Network_With_Space-Time_Factorization_for_Large-Scale_Video_CVPR_2023_paper.pdf + Video snapshot compressive imaging (SCI) uses a two-dimensional detector to capture consecutive video frames during a single exposure time. Following this, an efficient reconstruction algorithm needs to be designed to reconstruct the desired video frames. Although recent deep learning-based state-of-the-art (SOTA) reconstruction algorithms have achieved good results in most tasks, they still face the following challenges due to excessive model complexity and GPU memory limitations: 1) these models need high computational cost, and 2) they are usually unable to reconstruct large-scale video frames at high compression ratios. To address these issues, we develop an efficient network for video SCI by using dense connections and space-time factorization mechanism within a single residual block, dubbed EfficientSCI. The EfficientSCI network can well establish spatial-temporal correlation by using convolution in the spatial domain and Transformer in the temporal domain, respectively. We are the first time to show that an UHD color video with high compression ratio can be reconstructed from a snapshot 2D measurement using a single end-to-end deep learning model with PSNR above 32 dB. Extensive results on both simulation and real data show that our method significantly outperforms all previous SOTA algorithms with better real-time performance. + + + + MotionTrack: Learning Robust Short-Term and Long-Term Motions for Multi-Object Tracking + http://openaccess.thecvf.com//content/CVPR2023/papers/Qin_MotionTrack_Learning_Robust_Short-Term_and_Long-Term_Motions_for_Multi-Object_Tracking_CVPR_2023_paper.pdf + The main challenge of Multi-Object Tracking (MOT) lies in maintaining a continuous trajectory for each target. Existing methods often learn reliable motion patterns to match the same target between adjacent frames and discriminative appearance features to re-identify the lost targets after a long period. However, the reliability of motion prediction and the discriminability of appearances can be easily hurt by dense crowds and extreme occlusions in the tracking process. In this paper, we propose a simple yet effective multi-object tracker, i.e., MotionTrack, which learns robust short-term and long-term motions in a unified framework to associate trajectories from a short to long range. For dense crowds, we design a novel Interaction Module to learn interaction-aware motions from short-term trajectories, which can estimate the complex movement of each target. For extreme occlusions, we build a novel Refind Module to learn reliable long-term motions from the target's history trajectory, which can link the interrupted trajectory with its corresponding detection. Our Interaction Module and Refind Module are embedded in the well-known tracking-by-detection paradigm, which can work in tandem to maintain superior performance. Extensive experimental results on MOT17 and MOT20 datasets demonstrate the superiority of our approach in challenging scenarios, and it achieves state-of-the-art performances at various MOT metrics. We will make the code and trained models publicly available. + + + + 3D Registration With Maximal Cliques + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_3D_Registration_With_Maximal_Cliques_CVPR_2023_paper.pdf + As a fundamental problem in computer vision, 3D point cloud registration (PCR) aims to seek the optimal pose to align a point cloud pair. In this paper, we present a 3D registration method with maximal cliques (MAC). The key insight is to loosen the previous maximum clique constraint, and to mine more local consensus information in a graph for accurate pose hypotheses generation: 1) A compatibility graph is constructed to render the affinity relationship between initial correspondences. 2) We search for maximal cliques in the graph, each of which represents a consensus set. We perform node-guided clique selection then, where each node corresponds to the maximal clique with the greatest graph weight. 3) Transformation hypotheses are computed for the selected cliques by SVD algorithm and the best hypothesis is used to perform registration. Extensive experiments on U3M, 3DMatch, 3DLoMatch and KITTI demonstrate that MAC effectively increases registration accuracy, outperforms various state-of-the-art methods and boosts the performance of deep-learned methods. MAC combined with deep-learned methods achieves state-of-the-art registration recall of 95.7% / 78.9% on the 3DMatch / 3DLoMatch. + + + + MetaPortrait: Identity-Preserving Talking Head Generation With Fast Personalized Adaptation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_MetaPortrait_Identity-Preserving_Talking_Head_Generation_With_Fast_Personalized_Adaptation_CVPR_2023_paper.pdf + In this work, we propose an ID-preserving talking head generation framework, which advances previous methods in two aspects. First, as opposed to interpolating from sparse flow, we claim that dense landmarks are crucial to achieving accurate geometry-aware flow fields. Second, inspired by face-swapping methods, we adaptively fuse the source identity during synthesis, so that the network better preserves the key characteristics of the image portrait. Although the proposed model surpasses prior generation fidelity on established benchmarks, personalized fine-tuning is still needed to further make the talking head generation qualified for real usage. However, this process is rather computationally demanding that is unaffordable to standard users. To alleviate this, we propose a fast adaptation model using a meta-learning approach. The learned model can be adapted to a high-quality personalized model as fast as 30 seconds. Last but not least, a spatial-temporal enhancement module is proposed to improve the fine details while ensuring temporal coherency. Extensive experiments prove the significant superiority of our approach over the state of the arts in both one-shot and personalized settings. + + + + UniHCP: A Unified Model for Human-Centric Perceptions + http://openaccess.thecvf.com//content/CVPR2023/papers/Ci_UniHCP_A_Unified_Model_for_Human-Centric_Perceptions_CVPR_2023_paper.pdf + Human-centric perceptions (e.g., pose estimation, human parsing, pedestrian detection, person re-identification, etc.) play a key role in industrial applications of visual models. While specific human-centric tasks have their own relevant semantic aspect to focus on, they also share the same underlying semantic structure of the human body. However, few works have attempted to exploit such homogeneity and design a general-propose model for human-centric tasks. In this work, we revisit a broad range of human-centric tasks and unify them in a minimalist manner. We propose UniHCP, a Unified Model for Human-Centric Perceptions, which unifies a wide range of human-centric tasks in a simplified end-to-end manner with the plain vision transformer architecture. With large-scale joint training on 33 humancentric datasets, UniHCP can outperform strong baselines on several in-domain and downstream tasks by direct evaluation. When adapted to a specific task, UniHCP achieves new SOTAs on a wide range of human-centric tasks, e.g., 69.8 mIoU on CIHP for human parsing, 86.18 mA on PA-100K for attribute prediction, 90.3 mAP on Market1501 for ReID, and 85.8 JI on CrowdHuman for pedestrian detection, performing better than specialized models tailored for each task. The code and pretrained model are available at https://github.com/OpenGVLab/UniHCP. + + + + VoxelNeXt: Fully Sparse VoxelNet for 3D Object Detection and Tracking + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_VoxelNeXt_Fully_Sparse_VoxelNet_for_3D_Object_Detection_and_Tracking_CVPR_2023_paper.pdf + 3D object detectors usually rely on hand-crafted proxies, e.g., anchors or centers, and translate well-studied 2D frameworks to 3D. Thus, sparse voxel features need to be densified and processed by dense prediction heads, which inevitably costs extra computation. In this paper, we instead propose VoxelNext for fully sparse 3D object detection. Our core insight is to predict objects directly based on sparse voxel features, without relying on hand-crafted proxies. Our strong sparse convolutional network VoxelNeXt detects and tracks 3D objects through voxel features entirely. It is an elegant and efficient framework, with no need for sparse-to-dense conversion or NMS post-processing. Our method achieves a better speed-accuracy trade-off than other mainframe detectors on the nuScenes dataset. For the first time, we show that a fully sparse voxel-based representation works decently for LIDAR 3D object detection and tracking. Extensive experiments on nuScenes, Waymo, and Argoverse2 benchmarks validate the effectiveness of our approach. Without bells and whistles, our model outperforms all existing LIDAR methods on the nuScenes tracking test benchmark. + + + + Bias in Pruned Vision Models: In-Depth Analysis and Countermeasures + http://openaccess.thecvf.com//content/CVPR2023/papers/Iofinova_Bias_in_Pruned_Vision_Models_In-Depth_Analysis_and_Countermeasures_CVPR_2023_paper.pdf + Pruning - that is, setting a significant subset of the parameters of a neural network to zero - is one of the most popular methods of model compression. Yet, several recent works have raised the issue that pruning may induce or exacerbate bias in the output of the compressed model. Despite existing evidence for this phenomenon, the relationship between neural network pruning and induced bias is not well-understood. In this work, we systematically investigate and characterize this phenomenon in Convolutional Neural Networks for computer vision. First, we show that it is in fact possible to obtain highly-sparse models, e.g. with less than 10% remaining weights, which do not decrease in accuracy nor substantially increase in bias when compared to dense models. At the same time, we also find that, at higher sparsities, pruned models exhibit higher uncertainty in their outputs, as well as increased correlations, which we directly link to increased bias. We propose easy-to-use criteria which, based only on the uncompressed model, establish whether bias will increase with pruning, and identify the samples most susceptible to biased predictions post-compression. + + + + AttentionShift: Iteratively Estimated Part-Based Attention Map for Pointly Supervised Instance Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Liao_AttentionShift_Iteratively_Estimated_Part-Based_Attention_Map_for_Pointly_Supervised_Instance_CVPR_2023_paper.pdf + Pointly supervised instance segmentation (PSIS) learns to segment objects using a single point within the object extent as supervision. Challenged by the non-negligible semantic variance between object parts, however, the single supervision point causes semantic bias and false segmentation. In this study, we propose an AttentionShift method, to solve the semantic bias issue by iteratively decomposing the instance attention map to parts and estimating fine-grained semantics of each part. AttentionShift consists of two modules plugged on the vision transformer backbone: (i) token querying for pointly supervised attention map generation, and (ii) key-point shift, which re-estimates part-based attention maps by key-point filtering in the feature space. These two steps are iteratively performed so that the part-based attention maps are optimized spatially as well as in the feature space to cover full object extent. Experiments on PASCAL VOC and MS COCO 2017 datasets show that AttentionShift respectively improves the state-of-the-art of by 7.7% and 4.8% under mAP@0.5, setting a solid PSIS baseline using vision transformer. Code is enclosed in the supplementary material. + + + + PlaneDepth: Self-Supervised Depth Estimation via Orthogonal Planes + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_PlaneDepth_Self-Supervised_Depth_Estimation_via_Orthogonal_Planes_CVPR_2023_paper.pdf + Multiple near frontal-parallel planes based depth representation demonstrated impressive results in self-supervised monocular depth estimation (MDE). Whereas, such a representation would cause the discontinuity of the ground as it is perpendicular to the frontal-parallel planes, which is detrimental to the identification of drivable space in autonomous driving. In this paper, we propose the PlaneDepth, a novel orthogonal planes based presentation, including vertical planes and ground planes. PlaneDepth estimates the depth distribution using a Laplacian Mixture Model based on orthogonal planes for an input image. These planes are used to synthesize a reference view to provide the self-supervision signal. Further, we find that the widely used resizing and cropping data augmentation breaks the orthogonality assumptions, leading to inferior plane predictions. We address this problem by explicitly constructing the resizing cropping transformation to rectify the predefined planes and predicted camera pose. Moreover, we propose an augmented self-distillation loss supervised with a bilateral occlusion mask to boost the robustness of orthogonal planes representation for occlusions. Thanks to our orthogonal planes representation, we can extract the ground plane in an unsupervised manner, which is important for autonomous driving. Extensive experiments on the KITTI dataset demonstrate the effectiveness and efficiency of our method. The code is available at https://github.com/svip-lab/PlaneDepth. + + + + Semantic-Conditional Diffusion Networks for Image Captioning + http://openaccess.thecvf.com//content/CVPR2023/papers/Luo_Semantic-Conditional_Diffusion_Networks_for_Image_Captioning_CVPR_2023_paper.pdf + Recent advances on text-to-image generation have witnessed the rise of diffusion models which act as powerful generative models. Nevertheless, it is not trivial to exploit such latent variable models to capture the dependency among discrete words and meanwhile pursue complex visual-language alignment in image captioning. In this paper, we break the deeply rooted conventions in learning Transformer-based encoder-decoder, and propose a new diffusion model based paradigm tailored for image captioning, namely Semantic-Conditional Diffusion Networks (SCD-Net). Technically, for each input image, we first search the semantically relevant sentences via cross-modal retrieval model to convey the comprehensive semantic information. The rich semantics are further regarded as semantic prior to trigger the learning of Diffusion Transformer, which produces the output sentence in a diffusion process. In SCD-Net, multiple Diffusion Transformer structures are stacked to progressively strengthen the output sentence with better visional-language alignment and linguistical coherence in a cascaded manner. Furthermore, to stabilize the diffusion process, a new self-critical sequence training strategy is designed to guide the learning of SCD-Net with the knowledge of a standard autoregressive Transformer model. Extensive experiments on COCO dataset demonstrate the promising potential of using diffusion models in the challenging image captioning task. Source code is available at https://github.com/YehLi/xmodaler/tree/master/configs/image_caption/scdnet. + + + + TranSG: Transformer-Based Skeleton Graph Prototype Contrastive Learning With Structure-Trajectory Prompted Reconstruction for Person Re-Identification + http://openaccess.thecvf.com//content/CVPR2023/papers/Rao_TranSG_Transformer-Based_Skeleton_Graph_Prototype_Contrastive_Learning_With_Structure-Trajectory_Prompted_CVPR_2023_paper.pdf + Person re-identification (re-ID) via 3D skeleton data is an emerging topic with prominent advantages. Existing methods usually design skeleton descriptors with raw body joints or perform skeleton sequence representation learning. However, they typically cannot concurrently model different body-component relations, and rarely explore useful semantics from fine-grained representations of body joints. In this paper, we propose a generic Transformer-based Skeleton Graph prototype contrastive learning (TranSG) approach with structure-trajectory prompted reconstruction to fully capture skeletal relations and valuable spatial-temporal semantics from skeleton graphs for person re-ID. Specifically, we first devise the Skeleton Graph Transformer (SGT) to simultaneously learn body and motion relations within skeleton graphs, so as to aggregate key correlative node features into graph representations. Then, we propose the Graph Prototype Contrastive learning (GPC) to mine the most typical graph features (graph prototypes) of each identity, and contrast the inherent similarity between graph representations and different prototypes from both skeleton and sequence levels to learn discriminative graph representations. Last, a graph Structure-Trajectory Prompted Reconstruction (STPR) mechanism is proposed to exploit the spatial and temporal contexts of graph nodes to prompt skeleton graph reconstruction, which facilitates capturing more valuable patterns and graph semantics for person re-ID. Empirical evaluations demonstrate that TranSG significantly outperforms existing state-of-the-art methods. We further show its generality under different graph modeling, RGB-estimated skeletons, and unsupervised scenarios. + + + + All Are Worth Words: A ViT Backbone for Diffusion Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Bao_All_Are_Worth_Words_A_ViT_Backbone_for_Diffusion_Models_CVPR_2023_paper.pdf + Vision transformers (ViT) have shown promise in various vision tasks while the U-Net based on a convolutional neural network (CNN) remains dominant in diffusion models. We design a simple and general ViT-based architecture (named U-ViT) for image generation with diffusion models. U-ViT is characterized by treating all inputs including the time, condition and noisy image patches as tokens and employing long skip connections between shallow and deep layers. We evaluate U-ViT in unconditional and class-conditional image generation, as well as text-to-image generation tasks, where U-ViT is comparable if not superior to a CNN-based U-Net of a similar size. In particular, latent diffusion models with U-ViT achieve record-breaking FID scores of 2.29 in class-conditional image generation on ImageNet 256x256, and 5.48 in text-to-image generation on MS-COCO, among methods without accessing large external datasets during the training of generative models. Our results suggest that, for diffusion-based image modeling, the long skip connection is crucial while the down-sampling and up-sampling operators in CNN-based U-Net are not always necessary. We believe that U-ViT can provide insights for future research on backbones in diffusion models and benefit generative modeling on large scale cross-modality datasets. + + + + SteerNeRF: Accelerating NeRF Rendering via Smooth Viewpoint Trajectory + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_SteerNeRF_Accelerating_NeRF_Rendering_via_Smooth_Viewpoint_Trajectory_CVPR_2023_paper.pdf + Neural Radiance Fields (NeRF) have demonstrated superior novel view synthesis performance but are slow at rendering. To speed up the volume rendering process, many acceleration methods have been proposed at the cost of large memory consumption. To push the frontier of the efficiency-memory trade-off, we explore a new perspective to accelerate NeRF rendering, leveraging a key fact that the viewpoint change is usually smooth and continuous in interactive viewpoint control. This allows us to leverage the information of preceding viewpoints to reduce the number of rendered pixels as well as the number of sampled points along the ray of the remaining pixels. In our pipeline, a low-resolution feature map is rendered first by volume rendering, then a lightweight 2D neural renderer is applied to generate the output image at target resolution leveraging the features of preceding and current frames. We show that the proposed method can achieve competitive rendering quality while reducing the rendering time with little memory overhead, enabling 30FPS at 1080P image resolution with a low memory footprint. + + + + Spatial-Frequency Mutual Learning for Face Super-Resolution + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Spatial-Frequency_Mutual_Learning_for_Face_Super-Resolution_CVPR_2023_paper.pdf + Face super-resolution (FSR) aims to reconstruct high-resolution (HR) face images from the low-resolution (LR) ones. With the advent of deep learning, the FSR technique has achieved significant breakthroughs. However, existing FSR methods either have a fixed receptive field or fail to maintain facial structure, limiting the FSR performance. To circumvent this problem, Fourier transform is introduced, which can capture global facial structure information and achieve image-size receptive field. Relying on the Fourier transform, we devise a spatial-frequency mutual network (SFMNet) for FSR, which is the first FSR method to explore the correlations between spatial and frequency domains as far as we know. To be specific, our SFMNet is a two-branch network equipped with a spatial branch and a frequency branch. Benefiting from the property of Fourier transform, the frequency branch can achieve image-size receptive field and capture global dependency while the spatial branch can extract local dependency. Considering that these dependencies are complementary and both favorable for FSR, we further develop a frequency-spatial interaction block (FSIB) which mutually amalgamates the complementary spatial and frequency information to enhance the capability of the model. Quantitative and qualitative experimental results show that the proposed method outperforms state-of-the-art FSR methods in recovering face images. The implementation and model will be released at https://github.com/wcy-cs/SFMNet. + + + + Being Comes From Not-Being: Open-Vocabulary Text-to-Motion Generation With Wordless Training + http://openaccess.thecvf.com//content/CVPR2023/papers/Lin_Being_Comes_From_Not-Being_Open-Vocabulary_Text-to-Motion_Generation_With_Wordless_Training_CVPR_2023_paper.pdf + Text-to-motion generation is an emerging and challenging problem, which aims to synthesize motion with the same semantics as the input text. However, due to the lack of diverse labeled training data, most approaches either limit to specific types of text annotations or require online optimizations to cater to the texts during inference at the cost of efficiency and stability. In this paper, we investigate offline open-vocabulary text-to-motion generation in a zero-shot learning manner that neither requires paired training data nor extra online optimization to adapt for unseen texts. Inspired by the prompt learning in NLP, we pretrain a motion generator that learns to reconstruct the full motion from the masked motion. During inference, instead of changing the motion generator, our method reformulates the input text into a masked motion as the prompt for the motion generator to "reconstruct" the motion. In constructing the prompt, the unmasked poses of the prompt are synthesized by a text-to-pose generator. To supervise the optimization of the text-to-pose generator, we propose the first text-pose alignment model for measuring the alignment between texts and 3D poses. And to prevent the pose generator from overfitting to limited training texts, we further propose a novel wordless training mechanism that optimizes the text-to-pose generator without any training texts. The comprehensive experimental results show that our method obtains a significant improvement against the baseline methods. The code is available at https://github.com/junfanlin/oohmg. + + + + MonoHuman: Animatable Human Neural Field From Monocular Video + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_MonoHuman_Animatable_Human_Neural_Field_From_Monocular_Video_CVPR_2023_paper.pdf + Animating virtual avatars with free-view control is crucial for various applications like virtual reality and digital entertainment. Previous studies have attempted to utilize the representation power of the neural radiance field (NeRF) to reconstruct the human body from monocular videos. Recent works propose to graft a deformation network into the NeRF to further model the dynamics of the human neural field for animating vivid human motions. However, such pipelines either rely on pose-dependent representations or fall short of motion coherency due to frame-independent optimization, making it difficult to generalize to unseen pose sequences realistically. In this paper, we propose a novel framework MonoHuman, which robustly renders view-consistent and high-fidelity avatars under arbitrary novel poses. Our key insight is to model the deformation field with bi-directional constraints and explicitly leverage the off-the-peg keyframe information to reason the feature correlations for coherent results. Specifically, we first propose a Shared Bidirectional Deformation module, which creates a pose-independent generalizable deformation field by disentangling backward and forward deformation correspondences into shared skeletal motion weight and separate non-rigid motions. Then, we devise a Forward Correspondence Search module, which queries the correspondence feature of keyframes to guide the rendering network. The rendered results are thus multi-view consistent with high fidelity, even under challenging novel pose settings. Extensive experiments demonstrate the superiority of our proposed MonoHuman over state-of-the-art methods. + + + + SINE: Semantic-Driven Image-Based NeRF Editing With Prior-Guided Editing Field + http://openaccess.thecvf.com//content/CVPR2023/papers/Bao_SINE_Semantic-Driven_Image-Based_NeRF_Editing_With_Prior-Guided_Editing_Field_CVPR_2023_paper.pdf + Despite the great success in 2D editing using user-friendly tools, such as Photoshop, semantic strokes, or even text prompts, similar capabilities in 3D areas are still limited, either relying on 3D modeling skills or allowing editing within only a few categories. In this paper, we present a novel semantic-driven NeRF editing approach, which enables users to edit a neural radiance field with a single image, and faithfully delivers edited novel views with high fidelity and multi-view consistency. To achieve this goal, we propose a prior-guided editing field to encode fine-grained geometric and texture editing in 3D space, and develop a series of techniques to aid the editing process, including cyclic constraints with a proxy mesh to facilitate geometric supervision, a color compositing mechanism to stabilize semantic-driven texture editing, and a feature-cluster-based regularization to preserve the irrelevant content unchanged. Extensive experiments and editing examples on both real-world and synthetic data demonstrate that our method achieves photo-realistic 3D editing using only a single edited image, pushing the bound of semantic-driven editing in 3D real-world scenes. + + + + MetaCLUE: Towards Comprehensive Visual Metaphors Research + http://openaccess.thecvf.com//content/CVPR2023/papers/Akula_MetaCLUE_Towards_Comprehensive_Visual_Metaphors_Research_CVPR_2023_paper.pdf + Creativity is an indispensable part of human cognition and also an inherent part of how we make sense of the world. Metaphorical abstraction is fundamental in communicating creative ideas through nuanced relationships between abstract concepts such as feelings. While computer vision benchmarks and approaches predominantly focus on understanding and generating literal interpretations of images, metaphorical comprehension of images remains relatively unexplored. Towards this goal, we introduce MetaCLUE, a set of vision tasks on visual metaphor. We also collect high-quality and rich metaphor annotations (abstract objects, concepts, relationships along with their corresponding object boxes) as there do not exist any datasets that facilitate the evaluation of these tasks. We perform a comprehensive analysis of state-of-the-art models in vision and language based on our annotations, highlighting strengths and weaknesses of current approaches in visual metaphor Classification, Localization, Understanding (retrieval, question answering, captioning) and gEneration (text-to-image synthesis) tasks. We hope this work provides a concrete step towards systematically developing AI systems with human-like creative capabilities. Project page: https://metaclue.github.io + + + + Towards End-to-End Generative Modeling of Long Videos With Memory-Efficient Bidirectional Transformers + http://openaccess.thecvf.com//content/CVPR2023/papers/Yoo_Towards_End-to-End_Generative_Modeling_of_Long_Videos_With_Memory-Efficient_Bidirectional_CVPR_2023_paper.pdf + Autoregressive transformers have shown remarkable success in video generation. However, the transformers are prohibited from directly learning the long-term dependency in videos due to the quadratic complexity of self-attention, and inherently suffering from slow inference time and error propagation due to the autoregressive process. In this paper, we propose Memory-efficient Bidirectional Transformer (MeBT) for end-to-end learning of long-term dependency in videos and fast inference. Based on recent advances in bidirectional transformers, our method learns to decode the entire spatio-temporal volume of a video in parallel from partially observed patches. The proposed transformer achieves a linear time complexity in both encoding and decoding, by projecting observable context tokens into a fixed number of latent tokens and conditioning them to decode the masked tokens through the cross-attention. Empowered by linear complexity and bidirectional modeling, our method demonstrates significant improvement over the autoregressive Transformers for generating moderately long videos in both quality and speed. + + + + HaLP: Hallucinating Latent Positives for Skeleton-Based Self-Supervised Learning of Actions + http://openaccess.thecvf.com//content/CVPR2023/papers/Shah_HaLP_Hallucinating_Latent_Positives_for_Skeleton-Based_Self-Supervised_Learning_of_Actions_CVPR_2023_paper.pdf + Supervised learning of skeleton sequence encoders for action recognition has received significant attention in recent times. However, learning such encoders without labels continues to be a challenging problem. While prior works have shown promising results by applying contrastive learning to pose sequences, the quality of the learned representations is often observed to be closely tied to data augmentations that are used to craft the positives. However, augmenting pose sequences is a difficult task as the geometric constraints among the skeleton joints need to be enforced to make the augmentations realistic for that action. In this work, we propose a new contrastive learning approach to train models for skeleton-based action recognition without labels. Our key contribution is a simple module, HaLP - to Hallucinate Latent Positives for contrastive learning. Specifically, HaLP explores the latent space of poses in suitable directions to generate new positives. To this end, we present a novel optimization formulation to solve for the synthetic positives with an explicit control on their hardness. We propose approximations to the objective, making them solvable in closed form with minimal overhead. We show via experiments that using these generated positives within a standard contrastive learning framework leads to consistent improvements across benchmarks such as NTU-60, NTU-120, and PKU-II on tasks like linear evaluation, transfer learning, and kNN evaluation. Our code can be found at https://github.com/anshulbshah/HaLP. + + + + FLEX: Full-Body Grasping Without Full-Body Grasps + http://openaccess.thecvf.com//content/CVPR2023/papers/Tendulkar_FLEX_Full-Body_Grasping_Without_Full-Body_Grasps_CVPR_2023_paper.pdf + Synthesizing 3D human avatars interacting realistically with a scene is an important problem with applications in AR/VR, video games, and robotics. Towards this goal, we address the task of generating a virtual human -- hands and full body -- grasping everyday objects. Existing methods approach this problem by collecting a 3D dataset of humans interacting with objects and training on this data. However, 1) these methods do not generalize to different object positions and orientations or to the presence of furniture in the scene, and 2) the diversity of their generated full-body poses is very limited. In this work, we address all the above challenges to generate realistic, diverse full-body grasps in everyday scenes without requiring any 3D full-body grasping data. Our key insight is to leverage the existence of both full-body pose and hand-grasping priors, composing them using 3D geometrical constraints to obtain full-body grasps. We empirically validate that these constraints can generate a variety of feasible human grasps that are superior to baselines both quantitatively and qualitatively. + + + + EVA: Exploring the Limits of Masked Visual Representation Learning at Scale + http://openaccess.thecvf.com//content/CVPR2023/papers/Fang_EVA_Exploring_the_Limits_of_Masked_Visual_Representation_Learning_at_CVPR_2023_paper.pdf + We launch EVA, a vision-centric foundation model to explore the limits of visual representation at scale using only publicly accessible data. EVA is a vanilla ViT pre-trained to reconstruct the masked out image-text aligned vision features conditioned on visible image patches. Via this pretext task, we can efficiently scale up EVA to one billion parameters, and sets new records on a broad range of representative vision downstream tasks, such as image recognition, video action recognition, object detection, instance segmentation and semantic segmentation without heavy supervised training. Moreover, we observe quantitative changes in scaling EVA result in qualitative changes in transfer learning performance that are not present in other models. For instance, EVA takes a great leap in the challenging large vocabulary instance segmentation task: our model achieves almost the same state-of-the-art performance on LVIS dataset with over a thousand categories and COCO dataset with only eighty categories. Beyond a pure vision encoder, EVA can also serve as a vision-centric, multi-modal pivot to connect images and text. We find initializing the vision tower of a giant CLIP from EVA can greatly stabilize the training and outperform the training from scratch counterpart with much fewer samples and less compute, providing a new direction for scaling up and accelerating the costly training of multi-modal foundation models. + + + + Discrete Point-Wise Attack Is Not Enough: Generalized Manifold Adversarial Attack for Face Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Discrete_Point-Wise_Attack_Is_Not_Enough_Generalized_Manifold_Adversarial_Attack_CVPR_2023_paper.pdf + Classical adversarial attacks for Face Recognition (FR) models typically generate discrete examples for target identity with a single state image. However, such paradigm of point-wise attack exhibits poor generalization against numerous unknown states of identity and can be easily defended. In this paper, by rethinking the inherent relationship between the face of target identity and its variants, we introduce a new pipeline of Generalized Manifold Adversarial Attack (GMAA) to achieve a better attack performance by expanding the attack range. Specifically, this expansion lies on two aspects -- GMAA not only expands the target to be attacked from one to many to encourage a good generalization ability for the generated adversarial examples, but it also expands the latter from discrete points to manifold by leveraging the domain knowledge that face expression change can be continuous, which enhances the attack effect as a data augmentation mechanism did. Moreover, we further design a dual supervision with local and global constraints as a minor contribution to improve the visual quality of the generated adversarial examples. We demonstrate the effectiveness of our method based on extensive experiments, and reveal that GMAA promises a semantic continuous adversarial space with a higher generalization ability and visual quality. + + + + FREDOM: Fairness Domain Adaptation Approach to Semantic Scene Understanding + http://openaccess.thecvf.com//content/CVPR2023/papers/Truong_FREDOM_Fairness_Domain_Adaptation_Approach_to_Semantic_Scene_Understanding_CVPR_2023_paper.pdf + Although Domain Adaptation in Semantic Scene Segmentation has shown impressive improvement in recent years, the fairness concerns in the domain adaptation have yet to be well defined and addressed. In addition, fairness is one of the most critical aspects when deploying the segmentation models into human-related real-world applications, e.g., autonomous driving, as any unfair predictions could influence human safety. In this paper, we propose a novel Fairness Domain Adaptation (FREDOM) approach to semantic scene segmentation. In particular, from the proposed formulated fairness objective, a new adaptation framework will be introduced based on the fair treatment of class distributions. Moreover, to generally model the context of structural dependency, a new conditional structural constraint is introduced to impose the consistency of predicted segmentation. Thanks to the proposed Conditional Structure Network, the self-attention mechanism has sufficiently modeled the structural information of segmentation. Through the ablation studies, the proposed method has shown the performance improvement of the segmentation models and promoted fairness in the model predictions. The experimental results on the two standard benchmarks, i.e., SYNTHIA -> Cityscapes and GTA5 -> Cityscapes, have shown that our method achieved State-of-the-Art (SOTA) performance. + + + + IMP: Iterative Matching and Pose Estimation With Adaptive Pooling + http://openaccess.thecvf.com//content/CVPR2023/papers/Xue_IMP_Iterative_Matching_and_Pose_Estimation_With_Adaptive_Pooling_CVPR_2023_paper.pdf + Previous methods solve feature matching and pose estimation using a two-stage process by first finding matches and then estimating the pose. As they ignore the geometric relationships between the two tasks, they focus on either improving the quality of matches or filtering potential outliers, leading to limited efficiency or accuracy. In contrast, we propose an iterative matching and pose estimation framework (IMP) leveraging the geometric connections between the two tasks: a few good matches are enough for a roughly accurate pose estimation; a roughly accurate pose can be used to guide the matching by providing geometric constraints. To this end, we implement a geometry-aware recurrent module with transformers which jointly outputs sparse matches and camera poses. Specifically, for each iteration, we first implicitly embed geometric information into the module via a pose-consistency loss, allowing it to predict geometry-aware matches progressively. Second, we introduce an efficient IMP (EIMP) to dynamically discard keypoints without potential matches, avoiding redundant updating and significantly reducing the quadratic time complexity of attention computation in transformers. Experiments on YFCC100m, Scannet, and Aachen Day-Night datasets demonstrate that the proposed method outperforms previous approaches in terms of accuracy and efficiency. + + + + PATS: Patch Area Transportation With Subdivision for Local Feature Matching + http://openaccess.thecvf.com//content/CVPR2023/papers/Ni_PATS_Patch_Area_Transportation_With_Subdivision_for_Local_Feature_Matching_CVPR_2023_paper.pdf + Local feature matching aims at establishing sparse correspondences between a pair of images. Recently, detector-free methods present generally better performance but are not satisfactory in image pairs with large scale differences. In this paper, we propose Patch Area Transportation with Subdivision (PATS) to tackle this issue. Instead of building an expensive image pyramid, we start by splitting the original image pair into equal-sized patches and gradually resizing and subdividing them into smaller patches with the same scale. However, estimating scale differences between these patches is non-trivial since the scale differences are determined by both relative camera poses and scene structures, and thus spatially varying over image pairs. Moreover, it is hard to obtain the ground truth for real scenes. To this end, we propose patch area transportation, which enables learning scale differences in a self-supervised manner. In contrast to bipartite graph matching, which only handles one-to-one matching, our patch area transportation can deal with many-to-many relationships. PATS improves both matching accuracy and coverage, and shows superior performance in downstream tasks, such as relative pose estimation, visual localization, and optical flow estimation.The source code will be released to benefit the community. + + + + MEDIC: Remove Model Backdoors via Importance Driven Cloning + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_MEDIC_Remove_Model_Backdoors_via_Importance_Driven_Cloning_CVPR_2023_paper.pdf + We develop a novel method to remove injected backdoors in deep learning models. It works by cloning the benign behaviors of a trojaned model to a new model of the same structure. It trains the clone model from scratch on a very small subset of samples and aims to minimize a cloning loss that denotes the differences between the activations of important neurons across the two models. The set of important neurons varies for each input, depending on their magnitude of activations and their impact on the classification result. We theoretically show our method can better recover benign functions of the backdoor model. Meanwhile, we prove our method can be more effective in removing backdoors compared with fine-tuning. Our experiments show that our technique can effectively remove nine different types of backdoors with minor benign accuracy degradation, outperforming the state-of-the-art backdoor removal techniques that are based on fine-tuning, knowledge distillation, and neuron pruning. + + + + SimpleNet: A Simple Network for Image Anomaly Detection and Localization + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_SimpleNet_A_Simple_Network_for_Image_Anomaly_Detection_and_Localization_CVPR_2023_paper.pdf + We propose a simple and application-friendly network (called SimpleNet) for detecting and localizing anomalies. SimpleNet consists of four components: (1) a pre-trained Feature Extractor that generates local features, (2) a shallow Feature Adapter that transfers local features towards target domain, (3) a simple Anomaly Feature Generator that counterfeits anomaly features by adding Gaussian noise to normal features, and (4) a binary Anomaly Discriminator that distinguishes anomaly features from normal features. During inference, the Anomaly Feature Generator would be discarded. Our approach is based on three intuitions. First, transforming pre-trained features to target-oriented features helps avoid domain bias. Second, generating synthetic anomalies in feature space is more effective, as defects may not have much commonality in the image space. Third, a simple discriminator is much efficient and practical. In spite of simplicity, SimpleNet outperforms previous methods quantitatively and qualitatively. On the MVTec AD benchmark, SimpleNet achieves an anomaly detection AUROC of 99.6%, reducing the error by 55.5% compared to the next best performing model. Furthermore, SimpleNet is faster than existing methods, with a high frame rate of 77 FPS on a 3080ti GPU. Additionally, SimpleNet demonstrates significant improvements in performance on the One-Class Novelty Detection task. Code: https://github.com/DonaldRR/SimpleNet. + + + + G-MSM: Unsupervised Multi-Shape Matching With Graph-Based Affinity Priors + http://openaccess.thecvf.com//content/CVPR2023/papers/Eisenberger_G-MSM_Unsupervised_Multi-Shape_Matching_With_Graph-Based_Affinity_Priors_CVPR_2023_paper.pdf + We present G-MSM (Graph-based Multi-Shape Matching), a novel unsupervised learning approach for non-rigid shape correspondence. Rather than treating a collection of input poses as an unordered set of samples, we explicitly model the underlying shape data manifold. To this end, we propose an adaptive multi-shape matching architecture that constructs an affinity graph on a given set of training shapes in a self-supervised manner. The key idea is to combine putative, pairwise correspondences by propagating maps along shortest paths in the underlying shape graph. During training, we enforce cycle-consistency between such optimal paths and the pairwise matches which enables our model to learn topology-aware shape priors. We explore different classes of shape graphs and recover specific settings, like template-based matching (star graph) or learnable ranking/sorting (TSP graph), as special cases in our framework. Finally, we demonstrate state-of-the-art performance on several recent shape correspondence benchmarks, including real-world 3D scan meshes with topological noise and challenging inter-class pairs. + + + + Mixed Autoencoder for Self-Supervised Visual Representation Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Mixed_Autoencoder_for_Self-Supervised_Visual_Representation_Learning_CVPR_2023_paper.pdf + Masked Autoencoder (MAE) has demonstrated superior performance on various vision tasks via randomly masking image patches and reconstruction. However, effective data augmentation strategies for MAE still remain open questions, different from those in contrastive learning that serve as the most important part. This paper studies the prevailing mixing augmentation for MAE. We first demonstrate that naive mixing will in contrast degenerate model performance due to the increase of mutual information (MI). To address, we propose homologous recognition, an auxiliary pretext task, not only to alleviate the MI increasement by explicitly requiring each patch to recognize homologous patches, but also to perform object-aware self-supervised pre-training for better downstream dense perception performance. With extensive experiments, we demonstrate that our proposed Mixed Autoencoder (MixedAE) achieves the state-of-the-art transfer results among masked image modeling (MIM) augmentations on different downstream tasks with significant efficiency. Specifically, our MixedAE outperforms MAE by +0.3% accuracy, +1.7 mIoU and +0.9 AP on ImageNet-1K, ADE20K and COCO respectively with a standard ViT-Base. Moreover, MixedAE surpasses iBOT, a strong MIM method combined with instance discrimination, while accelerating training by 2x. To our best knowledge, this is the very first work to consider mixing for MIM from the perspective of pretext task design. Code will be made available. + + + + ProphNet: Efficient Agent-Centric Motion Forecasting With Anchor-Informed Proposals + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_ProphNet_Efficient_Agent-Centric_Motion_Forecasting_With_Anchor-Informed_Proposals_CVPR_2023_paper.pdf + Motion forecasting is a key module in an autonomous driving system. Due to the heterogeneous nature of multi-sourced input, multimodality in agent behavior, and low latency required by onboard deployment, this task is notoriously challenging. To cope with these difficulties, this paper proposes a novel agent-centric model with anchor-informed proposals for efficient multimodal motion forecasting. We design a modality-agnostic strategy to concisely encode the complex input in a unified manner. We generate diverse proposals, fused with anchors bearing goal-oriented context, to induce multimodal prediction that covers a wide range of future trajectories. The network architecture is highly uniform and succinct, leading to an efficient model amenable for real-world deployment. Experiments reveal that our agent-centric network compares favorably with the state-of-the-art methods in prediction accuracy, while achieving scene-centric level inference latency. + + + + Learning Multi-Modal Class-Specific Tokens for Weakly Supervised Dense Object Localization + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_Learning_Multi-Modal_Class-Specific_Tokens_for_Weakly_Supervised_Dense_Object_Localization_CVPR_2023_paper.pdf + Weakly supervised dense object localization (WSDOL) relies generally on Class Activation Mapping (CAM), which exploits the correlation between the class weights of the image classifier and the pixel-level features. Due to the limited ability to address intra-class variations, the image classifier cannot properly associate the pixel features, leading to inaccurate dense localization maps. In this paper, we propose to explicitly construct multi-modal class representations by leveraging the Contrastive Language-Image Pre-training (CLIP), to guide dense localization. More specifically, we propose a unified transformer framework to learn two-modalities of class-specific tokens, i.e., class-specific visual and textual tokens. The former captures semantics from the target visual data while the latter exploits the class-related language priors from CLIP, providing complementary information to better perceive the intra-class diversities. In addition, we propose to enrich the multi-modal class-specific tokens with sample-specific contexts comprising visual context and image-language context. This enables more adaptive class representation learning, which further facilitates dense localization. Extensive experiments show the superiority of the proposed method for WSDOL on two multi-label datasets, i.e., PASCAL VOC and MS COCO, and one single-label dataset, i.e., OpenImages. Our dense localization maps also lead to the state-of-the-art weakly supervised semantic segmentation (WSSS) results on PASCAL VOC and MS COCO. + + + + GlassesGAN: Eyewear Personalization Using Synthetic Appearance Discovery and Targeted Subspace Modeling + http://openaccess.thecvf.com//content/CVPR2023/papers/Plesh_GlassesGAN_Eyewear_Personalization_Using_Synthetic_Appearance_Discovery_and_Targeted_Subspace_CVPR_2023_paper.pdf + We present GlassesGAN, a novel image editing framework for custom design of glasses, that sets a new standard in terms of output-image quality, edit realism, and continuous multi-style edit capability. To facilitate the editing process with GlassesGAN, we propose a Targeted Subspace Modelling (TSM) procedure that, based on a novel mechanism for (synthetic) appearance discovery in the latent space of a pre-trained GAN generator, constructs an eyeglasses-specific (latent) subspace that the editing framework can utilize. Additionally, we also introduce an appearance-constrained subspace initialization (SI) technique that centers the latent representation of the given input image in the well-defined part of the constructed subspace to improve the reliability of the learned edits. We test GlassesGAN on two (diverse) high-resolution datasets (CelebA-HQ and SiblingsDB-HQf) and compare it to three state-of-the-art baselines, i.e., InterfaceGAN, GANSpace, and MaskGAN. The reported results show that GlassesGAN convincingly outperforms all competing techniques, while offering functionality (e.g., fine-grained multi-style editing) not available with any of the competitors. The source code for GlassesGAN is made publicly available. + + + + Deep Hashing With Minimal-Distance-Separated Hash Centers + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Deep_Hashing_With_Minimal-Distance-Separated_Hash_Centers_CVPR_2023_paper.pdf + Deep hashing is an appealing approach for large-scale image retrieval. Most existing supervised deep hashing methods learn hash functions using pairwise or triple image similarities in randomly sampled mini-batches. They suffer from low training efficiency, insufficient coverage of data distribution, and pair imbalance problems. Recently, central similarity quantization (CSQ) attacks the above problems by using "hash centers" as a global similarity metric, which encourages the hash codes of similar images to approach their common hash center and distance themselves from other hash centers. Although achieving SOTA retrieval performance, CSQ falls short of a worst-case guarantee on the minimal distance between its constructed hash centers, i.e. the hash centers can be arbitrarily close. This paper presents an optimization method that finds hash centers with a constraint on the minimal distance between any pair of hash centers, which is non-trivial due to the non-convex nature of the problem. More importantly, we adopt the Gilbert-Varshamov bound from coding theory, which helps us to obtain a large minimal distance while ensuring the empirical feasibility of our optimization approach. With these clearly-separated hash centers, each is assigned to one image class, we propose several effective loss functions to train deep hashing networks. Extensive experiments on three datasets for image retrieval demonstrate that the proposed method achieves superior retrieval performance over the state-of-the-art deep hashing methods. + + + + VL-SAT: Visual-Linguistic Semantics Assisted Training for 3D Semantic Scene Graph Prediction in Point Cloud + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_VL-SAT_Visual-Linguistic_Semantics_Assisted_Training_for_3D_Semantic_Scene_Graph_CVPR_2023_paper.pdf + The task of 3D semantic scene graph (3DSSG) prediction in the point cloud is challenging since (1) the 3D point cloud only captures geometric structures with limited semantics compared to 2D images, and (2) long-tailed relation distribution inherently hinders the learning of unbiased prediction. Since 2D images provide rich semantics and scene graphs are in nature coped with languages, in this study, we propose Visual-Linguistic Semantics Assisted Training (VL-SAT) scheme that can significantly empower 3DSSG prediction models with discrimination about long-tailed and ambiguous semantic relations. The key idea is to train a powerful multi-modal oracle model to assist the 3D model. This oracle learns reliable structural representations based on semantics from vision, language, and 3D geometry, and its benefits can be heterogeneously passed to the 3D model during the training stage. By effectively utilizing visual-linguistic semantics in training, our VL-SAT can significantly boost common 3DSSG prediction models, such as SGFN and SGGpoint, only with 3D inputs in the inference stage, especially when dealing with tail relation triplets. Comprehensive evaluations and ablation studies on the 3DSSG dataset have validated the effectiveness of the proposed scheme. Code is available at https://github.com/wz7in/CVPR2023-VLSAT. + + + + Learning Emotion Representations From Verbal and Nonverbal Communication + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Learning_Emotion_Representations_From_Verbal_and_Nonverbal_Communication_CVPR_2023_paper.pdf + Emotion understanding is an essential but highly challenging component of artificial general intelligence. The absence of extensive annotated datasets has significantly impeded advancements in this field. We present EmotionCLIP, the first pre-training paradigm to extract visual emotion representations from verbal and nonverbal communication using only uncurated data. Compared to numerical labels or descriptions used in previous methods, communication naturally contains emotion information. Furthermore, acquiring emotion representations from communication is more congruent with the human learning process. We guide EmotionCLIP to attend to nonverbal emotion cues through subject-aware context encoding and verbal emotion cues using sentiment-guided contrastive learning. Extensive experiments validate the effectiveness and transferability of EmotionCLIP. Using merely linear-probe evaluation protocol, EmotionCLIP outperforms the state-of-the-art supervised visual emotion recognition methods and rivals many multimodal approaches across various benchmarks. We anticipate that the advent of EmotionCLIP will address the prevailing issue of data scarcity in emotion understanding, thereby fostering progress in related domains. The code and pre-trained models are available at https://github.com/Xeaver/EmotionCLIP. + + + + Architectural Backdoors in Neural Networks + http://openaccess.thecvf.com//content/CVPR2023/papers/Bober-Irizar_Architectural_Backdoors_in_Neural_Networks_CVPR_2023_paper.pdf + Machine learning is vulnerable to adversarial manipulation. Previous literature has demonstrated that at the training stage attackers can manipulate data (Gu et al.) and data sampling procedures (Shumailov et al.) to control model behaviour. A common attack goal is to plant backdoors i.e. force the victim model to learn to recognise a trigger known only by the adversary. In this paper, we introduce a new class of backdoor attacks that hide inside model architectures i.e. in the inductive bias of the functions used to train. These backdoors are simple to implement, for instance by publishing open-source code for a backdoored model architecture that others will reuse unknowingly. We demonstrate that model architectural backdoors represent a real threat and, unlike other approaches, can survive a complete re-training from scratch. We formalise the main construction principles behind architectural backdoors, such as a connection between the input and the output, and describe some possible protections against them. We evaluate our attacks on computer vision benchmarks of different scales and demonstrate the underlying vulnerability is pervasive in a variety of common training settings. + + + + Semantic Human Parsing via Scalable Semantic Transfer Over Multiple Label Domains + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_Semantic_Human_Parsing_via_Scalable_Semantic_Transfer_Over_Multiple_Label_CVPR_2023_paper.pdf + This paper presents Scalable Semantic Transfer (SST), a novel training paradigm, to explore how to leverage the mutual benefits of the data from different label domains (i.e. various levels of label granularity) to train a powerful human parsing network. In practice, two common application scenarios are addressed, termed universal parsing and dedicated parsing, where the former aims to learn homogeneous human representations from multiple label domains and switch predictions by only using different segmentation heads, and the latter aims to learn a specific domain prediction while distilling the semantic knowledge from other domains. The proposed SST has the following appealing benefits: (1) it can capably serve as an effective training scheme to embed semantic associations of human body parts from multiple label domains into the human representation learning process; (2) it is an extensible semantic transfer framework without predetermining the overall relations of multiple label domains, which allows continuously adding human parsing datasets to promote the training. (3) the relevant modules are only used for auxiliary training and can be removed during inference, eliminating the extra reasoning cost. Experimental results demonstrate SST can effectively achieve promising universal human parsing performance as well as impressive improvements compared to its counterparts on three human parsing benchmarks (i.e., PASCAL-Person-Part, ATR, and CIHP). Code is available at https://github.com/yangjie-cv/SST. + + + + GEN: Pushing the Limits of Softmax-Based Out-of-Distribution Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_GEN_Pushing_the_Limits_of_Softmax-Based_Out-of-Distribution_Detection_CVPR_2023_paper.pdf + Out-of-distribution (OOD) detection has been extensively studied in order to successfully deploy neural networks, in particular, for safety-critical applications. Moreover, performing OOD detection on large-scale datasets is closer to reality, but is also more challenging. Several approaches need to either access the training data for score design or expose models to outliers during training. Some post-hoc methods are able to avoid the aforementioned constraints, but are less competitive. In this work, we propose Generalized ENtropy score (GEN), a simple but effective entropy-based score function, which can be applied to any pre-trained softmax-based classifier. Its performance is demonstrated on the large-scale ImageNet-1k OOD detection benchmark. It consistently improves the average AUROC across six commonly-used CNN-based and visual transformer classifiers over a number of state-of-the-art post-hoc methods. The average AUROC improvement is at least 3.5%. Furthermore, we used GEN on top of feature-based enhancing methods as well as methods using training statistics to further improve the OOD detection performance. The code is available at: https://github.com/XixiLiu95/GEN. + + + + Learnable Skeleton-Aware 3D Point Cloud Sampling + http://openaccess.thecvf.com//content/CVPR2023/papers/Wen_Learnable_Skeleton-Aware_3D_Point_Cloud_Sampling_CVPR_2023_paper.pdf + Point cloud sampling is crucial for efficient large-scale point cloud analysis, where learning-to-sample methods have recently received increasing attention from the community for jointly training with downstream tasks. However, the above-mentioned task-specific sampling methods usually fail to explore the geometries of objects in an explicit manner. In this paper, we introduce a new skeleton-aware learning-to-sample method by learning object skeletons as the prior knowledge to preserve the object geometry and topology information during sampling. Specifically, without labor-intensive annotations per object category, we first learn category-agnostic object skeletons via the medial axis transform definition in an unsupervised manner. With object skeleton, we then evaluate the histogram of the local feature size as the prior knowledge to formulate skeleton-aware sampling from a probabilistic perspective. Additionally, the proposed skeleton-aware sampling pipeline with the task network is thus end-to-end trainable by exploring the reparameterization trick. Extensive experiments on three popular downstream tasks, point cloud classification, retrieval, and reconstruction, demonstrate the effectiveness of the proposed method for efficient point cloud analysis. + + + + Boundary-Enhanced Co-Training for Weakly Supervised Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Rong_Boundary-Enhanced_Co-Training_for_Weakly_Supervised_Semantic_Segmentation_CVPR_2023_paper.pdf + The existing weakly supervised semantic segmentation (WSSS) methods pay much attention to generating accurate and complete class activation maps (CAMs) as pseudo-labels, while ignoring the importance of training the segmentation networks. In this work, we observe that there is an inconsistency between the quality of the pseudo-labels in CAMs and the performance of the final segmentation model, and the mislabeled pixels mainly lie on the boundary areas. Inspired by these findings, we argue that the focus of WSSS should be shifted to robust learning given the noisy pseudo-labels, and further propose a boundary-enhanced co-training (BECO) method for training the segmentation model. To be specific, we first propose to use a co-training paradigm with two interactive networks to improve the learning of uncertain pixels. Then we propose a boundary-enhanced strategy to boost the prediction of difficult boundary areas, which utilizes reliable predictions to construct artificial boundaries. Benefiting from the design of co-training and boundary enhancement, our method can achieve promising segmentation performance for different CAMs. Extensive experiments on PASCAL VOC 2012 and MS COCO 2014 validate the superiority of our BECO over other state-of-the-art methods. + + + + Sample-Level Multi-View Graph Clustering + http://openaccess.thecvf.com//content/CVPR2023/papers/Tan_Sample-Level_Multi-View_Graph_Clustering_CVPR_2023_paper.pdf + Multi-view clustering have hitherto been studied due to their effectiveness in dealing with heterogeneous data. Despite the empirical success made by recent works, there still exists several severe challenges. Particularly, previous multi-view clustering algorithms seldom consider the topological structure in data, which is essential for clustering data on manifold. Moreover, existing methods cannot fully consistency the consistency of local structures between different views as they explore the clustering structure in a view-wise manner. In this paper, we propose to exploit the implied data manifold by learning the topological structure of data. Besides, considering that the consistency of multiple views is manifested in the generally similar local structure while the inconsistent structures are minority, we further explore the intersections of multiple views in the sample level such that the cross-view consistency can be better maintained. We model the above concerns in a unified framework and design an efficient algorithm to solve the corresponding optimization problem. Experimental results on various multi-view datasets certificate the effectiveness of the proposed method and verify its superiority over other SOTA approaches. + + + + Next3D: Generative Neural Texture Rasterization for 3D-Aware Head Avatars + http://openaccess.thecvf.com//content/CVPR2023/papers/Sun_Next3D_Generative_Neural_Texture_Rasterization_for_3D-Aware_Head_Avatars_CVPR_2023_paper.pdf + 3D-aware generative adversarial networks (GANs) synthesize high-fidelity and multi-view-consistent facial images using only collections of single-view 2D imagery. Towards fine-grained control over facial attributes, recent efforts incorporate 3D Morphable Face Model (3DMM) to describe deformation in generative radiance fields either explicitly or implicitly. Explicit methods provide fine-grained expression control but cannot handle topological changes caused by hair and accessories, while implicit ones can model varied topologies but have limited generalization caused by the unconstrained deformation fields. We propose a novel 3D GAN framework for unsupervised learning of generative, high-quality and 3D-consistent facial avatars from unstructured 2D images. To achieve both deformation accuracy and topological flexibility, we propose a 3D representation called Generative Texture-Rasterized Tri-planes. The proposed representation learns Generative Neural Textures on top of parametric mesh templates and then projects them into three orthogonal-viewed feature planes through rasterization, forming a tri-plane feature representation for volume rendering. In this way, we combine both fine-grained expression control of mesh-guided explicit deformation and the flexibility of implicit volumetric representation. We further propose specific modules for modeling mouth interior which is not taken into account by 3DMM. Our method demonstrates state-of-the-art 3Daware synthesis quality and animation ability through extensive experiments. Furthermore, serving as 3D prior, our animatable 3D representation boosts multiple applications including one-shot facial avatars and 3D-aware stylization. + + + + Linking Garment With Person via Semantically Associated Landmarks for Virtual Try-On + http://openaccess.thecvf.com//content/CVPR2023/papers/Yan_Linking_Garment_With_Person_via_Semantically_Associated_Landmarks_for_Virtual_CVPR_2023_paper.pdf + In this paper, a novel virtual try-on algorithm, dubbed SAL-VTON, is proposed, which links the garment with the person via semantically associated landmarks to alleviate misalignment. The semantically associated landmarks are a series of landmark pairs with the same local semantics on the in-shop garment image and the try-on image. Based on the semantically associated landmarks, SAL-VTON effectively models the local semantic association between garment and person, making up for the misalignment in the overall deformation of the garment. The outcome is achieved with a three-stage framework: 1) the semantically associated landmarks are estimated using the landmark localization model; 2) taking the landmarks as input, the warping model explicitly associates the corresponding parts of the garment and person for obtaining the local flow, thus refining the alignment in the global flow; 3) finally, a generator consumes the landmarks to better capture local semantics and control the try-on results.Moreover, we propose a new landmark dataset with a unified labelling rule of landmarks for diverse styles of garments. Extensive experimental results on popular datasets demonstrate that SAL-VTON can handle misalignment and outperform state-of-the-art methods both qualitatively and quantitatively. The dataset is available on https://modelscope.cn/datasets/damo/SAL-HG/summary. + + + + Devil's on the Edges: Selective Quad Attention for Scene Graph Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Jung_Devils_on_the_Edges_Selective_Quad_Attention_for_Scene_Graph_CVPR_2023_paper.pdf + Scene graph generation aims to construct a semantic graph structure from an image such that its nodes and edges respectively represent objects and their relationships. One of the major challenges for the task lies in the presence of distracting objects and relationships in images; contextual reasoning is strongly distracted by irrelevant objects or backgrounds and, more importantly, a vast number of irrelevant candidate relations. To tackle the issue, we propose the Selective Quad Attention Network (SQUAT) that learns to select relevant object pairs and disambiguate them via diverse contextual interactions. SQUAT consists of two main components: edge selection and quad attention. The edge selection module selects relevant object pairs, i.e., edges in the scene graph, which helps contextual reasoning, and the quad attention module then updates the edge features using both edge-to-node and edge-to-edge cross-attentions to capture contextual information between objects and object pairs. Experiments demonstrate the strong performance and robustness of SQUAT, achieving the state of the art on the Visual Genome and Open Images v6 benchmarks. + + + + NIFF: Alleviating Forgetting in Generalized Few-Shot Object Detection via Neural Instance Feature Forging + http://openaccess.thecvf.com//content/CVPR2023/papers/Guirguis_NIFF_Alleviating_Forgetting_in_Generalized_Few-Shot_Object_Detection_via_Neural_CVPR_2023_paper.pdf + Privacy and memory are two recurring themes in a broad conversation about the societal impact of AI. These concerns arise from the need for huge amounts of data to train deep neural networks. A promise of Generalized Few-shot Object Detection (G-FSOD), a learning paradigm in AI, is to alleviate the need for collecting abundant training samples of novel classes we wish to detect by leveraging prior knowledge from old classes (i.e., base classes). G-FSOD strives to learn these novel classes while alleviating catastrophic forgetting of the base classes. However, existing approaches assume that the base images are accessible, an assumption that does not hold when sharing and storing data is problematic. In this work, we propose the first data-free knowledge distillation (DFKD) approach for G-FSOD that leverages the statistics of the region of interest (RoI) features from the base model to forge instance-level features without accessing the base images. Our contribution is three-fold: (1) we design a standalone lightweight generator with (2) class-wise heads (3) to generate and replay diverse instance-level base features to the RoI head while finetuning on the novel data. This stands in contrast to standard DFKD approaches in image classification, which invert the entire network to generate base images. Moreover, we make careful design choices in the novel finetuning pipeline to regularize the model. We show that our approach can dramatically reduce the base memory requirements, all while setting a new standard for G-FSOD on the challenging MS-COCO and PASCAL-VOC benchmarks. + + + + Post-Processing Temporal Action Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Nag_Post-Processing_Temporal_Action_Detection_CVPR_2023_paper.pdf + Existing Temporal Action Detection (TAD) methods typically take a pre-processing step in converting an input varying-length video into a fixed-length snippet representation sequence, before temporal boundary estimation and action classification. This pre-processing step would temporally downsample the video, reducing the inference resolution and hampering the detection performance in the original temporal resolution. In essence, this is due to a temporal quantization error introduced during the resolution downsampling and recovery. This could negatively impact the TAD performance, but is largely ignored by existing methods. To address this problem, in this work we introduce a novel model-agnostic post-processing method without model redesign and retraining. Specifically, we model the start and end points of action instances with a Gaussian distribution for enabling temporal boundary inference at a sub-snippet level. We further introduce an efficient Taylor-expansion based approximation, dubbed as Gaussian Approximated Post-processing (GAP). Extensive experiments demonstrate that our GAP can consistently improve a wide variety of pre-trained off-the-shelf TAD models on the challenging ActivityNet (+0.2% 0.7% in average mAP) and THUMOS (+0.2% 0.5% in average mAP) benchmarks. Such performance gains are already significant and highly comparable to those achieved by novel model designs. Also, GAP can be integrated with model training for further performance gain. Importantly, GAP enables lower temporal resolutions for more efficient inference, facilitating low-resource applications. The code is available in https://github.com/sauradip/GAP. + + + + ConZIC: Controllable Zero-Shot Image Captioning by Sampling-Based Polishing + http://openaccess.thecvf.com//content/CVPR2023/papers/Zeng_ConZIC_Controllable_Zero-Shot_Image_Captioning_by_Sampling-Based_Polishing_CVPR_2023_paper.pdf + Zero-shot capability has been considered as a new revolution of deep learning, letting machines work on tasks without curated training data. As a good start and the only existing outcome of zero-shot image captioning (IC), ZeroCap abandons supervised training and sequentially searching every word in the caption using the knowledge of large-scale pre-trained models. Though effective, its autoregressive generation and gradient-directed searching mechanism limit the diversity of captions and inference speed, respectively. Moreover, ZeroCap does not consider the controllability issue of zero-shot IC. To move forward, we propose a framework for Controllable Zero-shot IC, named ConZIC. The core of ConZIC is a novel sampling-based non-autoregressive language model named GibbsBERT, which can generate and continuously polish every word. Extensive quantitative and qualitative results demonstrate the superior performance of our proposed ConZIC for both zero-shot IC and controllable zero-shot IC. Especially, ConZIC achieves about 5x faster generation speed than ZeroCap, and about 1.5x higher diversity scores, with accurate generation given different control signals. + + + + Learning From Noisy Labels With Decoupled Meta Label Purifier + http://openaccess.thecvf.com//content/CVPR2023/papers/Tu_Learning_From_Noisy_Labels_With_Decoupled_Meta_Label_Purifier_CVPR_2023_paper.pdf + Training deep neural networks (DNN) with noisy labels is challenging since DNN can easily memorize inaccurate labels, leading to poor generalization ability. Recently, the meta-learning based label correction strategy is widely adopted to tackle this problem via identifying and correcting potential noisy labels with the help of a small set of clean validation data. Although training with purified labels can effectively improve performance, solving the meta-learning problem inevitably involves a nested loop of bi-level optimization between model weights and hyperparameters (i.e., label distribution). As compromise, previous methods resort toa coupled learning process with alternating update. In this paper, we empirically find such simultaneous optimization over both model weights and label distribution can not achieve an optimal routine, consequently limiting the representation ability of backbone and accuracy of corrected labels. From this observation, a novel multi-stage label purifier named DMLP is proposed. DMLP decouples the label correction process into label-free representation learning and a simple meta label purifier, In this way, DMLP can focus on extracting discriminative feature and label correction in two distinctive stages. DMLP is a plug-and-play label purifier, the purified labels can be directly reused in naive end-to-end network retraining or other robust learning methods, where state-of-the-art results are obtained on several synthetic and real-world noisy datasets, especially under high noise levels. + + + + Language in a Bottle: Language Model Guided Concept Bottlenecks for Interpretable Image Classification + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_Language_in_a_Bottle_Language_Model_Guided_Concept_Bottlenecks_for_CVPR_2023_paper.pdf + Concept Bottleneck Models (CBM) are inherently interpretable models that factor model decisions into human-readable concepts. They allow people to easily understand why a model is failing, a critical feature for high-stakes applications. CBMs require manually specified concepts and often under-perform their black box counterparts, preventing their broad adoption. We address these shortcomings and are first to show how to construct high-performance CBMs without manual specification of similar accuracy to black box models. Our approach, Language Guided Bottlenecks (LaBo), leverages a language model, GPT-3, to define a large space of possible bottlenecks. Given a problem domain, LaBo uses GPT-3 to produce factual sentences about categories to form candidate concepts. LaBo efficiently searches possible bottlenecks through a novel submodular utility that promotes the selection of discriminative and diverse information. Ultimately, GPT-3's sentential concepts can be aligned to images using CLIP, to form a bottleneck layer. Experiments demonstrate that LaBo is a highly effective prior for concepts important to visual recognition. In the evaluation with 11 diverse datasets, LaBo bottlenecks excel at few-shot classification: they are 11.7% more accurate than black box linear probes at 1 shot and comparable with more data. Overall, LaBo demonstrates that inherently interpretable models can be widely applied at similar, or better, performance than black box approaches. + + + + ViPLO: Vision Transformer Based Pose-Conditioned Self-Loop Graph for Human-Object Interaction Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Park_ViPLO_Vision_Transformer_Based_Pose-Conditioned_Self-Loop_Graph_for_Human-Object_Interaction_CVPR_2023_paper.pdf + Human-Object Interaction (HOI) detection, which localizes and infers relationships between human and objects, plays an important role in scene understanding. Although two-stage HOI detectors have advantages of high efficiency in training and inference, they suffer from lower performance than one-stage methods due to the old backbone networks and the lack of considerations for the HOI perception process of humans in the interaction classifiers. In this paper, we propose Vision Transformer based Pose-Conditioned Self-Loop Graph (ViPLO) to resolve these problems. First, we propose a novel feature extraction method suitable for the Vision Transformer backbone, called masking with overlapped area (MOA) module. The MOA module utilizes the overlapped area between each patch and the given region in the attention function, which addresses the quantization problem when using the Vision Transformer backbone. In addition, we design a graph with a pose-conditioned self-loop structure, which updates the human node encoding with local features of human joints. This allows the classifier to focus on specific human joints to effectively identify the type of interaction, which is motivated by the human perception process for HOI. As a result, ViPLO achieves the state-of-the-art results on two public benchmarks, especially obtaining a +2.07 mAP performance gain on the HICO-DET dataset. + + + + MSINet: Twins Contrastive Search of Multi-Scale Interaction for Object ReID + http://openaccess.thecvf.com//content/CVPR2023/papers/Gu_MSINet_Twins_Contrastive_Search_of_Multi-Scale_Interaction_for_Object_ReID_CVPR_2023_paper.pdf + Neural Architecture Search (NAS) has been increasingly appealing to the society of object Re-Identification (ReID), for that task-specific architectures significantly improve the retrieval performance. Previous works explore new optimizing targets and search spaces for NAS ReID, yet they neglect the difference of training schemes between image classification and ReID. In this work, we propose a novel Twins Contrastive Mechanism (TCM) to provide more appropriate supervision for ReID architecture search. TCM reduces the category overlaps between the training and validation data, and assists NAS in simulating real-world ReID training schemes. We then design a Multi-Scale Interaction (MSI) search space to search for rational interaction operations between multi-scale features. In addition, we introduce a Spatial Alignment Module (SAM) to further enhance the attention consistency confronted with images from different sources. Under the proposed NAS scheme, a specific architecture is automatically searched, named as MSINet. Extensive experiments demonstrate that our method surpasses state-of-the-art ReID methods on both in-domain and cross-domain scenarios. + + + + WIRE: Wavelet Implicit Neural Representations + http://openaccess.thecvf.com//content/CVPR2023/papers/Saragadam_WIRE_Wavelet_Implicit_Neural_Representations_CVPR_2023_paper.pdf + Implicit neural representations (INRs) have recently advanced numerous vision-related areas. INR performance depends strongly on the choice of activation function employed in its MLP network. A wide range of nonlinearities have been explored, but, unfortunately, current INRs designed to have high accuracy also suffer from poor robustness (to signal noise, parameter variation, etc.). Inspired by harmonic analysis, we develop a new, highly accurate and robust INR that does not exhibit this tradeoff. Our Wavelet Implicit neural REpresentation (WIRE) uses as its activation function the complex Gabor wavelet that is well-known to be optimally concentrated in space--frequency and to have excellent biases for representing images. A wide range of experiments (image denoising, image inpainting, super-resolution, computed tomography reconstruction, image overfitting, and novel view synthesis with neural radiance fields) demonstrate that WIRE defines the new state of the art in INR accuracy, training time, and robustness. + + + + Bringing Inputs to Shared Domains for 3D Interacting Hands Recovery in the Wild + http://openaccess.thecvf.com//content/CVPR2023/papers/Moon_Bringing_Inputs_to_Shared_Domains_for_3D_Interacting_Hands_Recovery_CVPR_2023_paper.pdf + Despite recent achievements, existing 3D interacting hands recovery methods have shown results mainly on motion capture (MoCap) environments, not on in-the-wild (ITW) ones. This is because collecting 3D interacting hands data in the wild is extremely challenging, even for the 2D data. We present InterWild, which brings MoCap and ITW samples to shared domains for robust 3D interacting hands recovery in the wild with a limited amount of ITW 2D/3D interacting hands data. 3D interacting hands recovery consists of two sub-problems: 1) 3D recovery of each hand and 2) 3D relative translation recovery between two hands. For the first sub-problem, we bring MoCap and ITW samples to a shared 2D scale space. Although ITW datasets provide a limited amount of 2D/3D interacting hands, they contain large-scale 2D single hand data. Motivated by this, we use a single hand image as an input for the first sub-problem regardless of whether two hands are interacting. Hence, interacting hands of MoCap datasets are brought to the 2D scale space of single hands of ITW datasets. For the second sub-problem, we bring MoCap and ITW samples to a shared appearance-invariant space. Unlike the first sub-problem, 2D labels of ITW datasets are not helpful for the second sub-problem due to the 3D translation's ambiguity. Hence, instead of relying on ITW samples, we amplify the generalizability of MoCap samples by taking only a geometric feature without an image as an input for the second sub-problem. As the geometric feature is invariant to appearances, MoCap and ITW samples do not suffer from a huge appearance gap between the two datasets. The code is available in https://github.com/facebookresearch/InterWild. + + + + Deep Deterministic Uncertainty: A New Simple Baseline + http://openaccess.thecvf.com//content/CVPR2023/papers/Mukhoti_Deep_Deterministic_Uncertainty_A_New_Simple_Baseline_CVPR_2023_paper.pdf + Reliable uncertainty from deterministic single-forward pass models is sought after because conventional methods of uncertainty quantification are computationally expensive. We take two complex single-forward-pass uncertainty approaches, DUQ and SNGP, and examine whether they mainly rely on a well-regularized feature space. Crucially, without using their more complex methods for estimating uncertainty, we find that a single softmax neural net with such a regularized feature-space, achieved via residual connections and spectral normalization, outperforms DUQ and SNGP's epistemic uncertainty predictions using simple Gaussian Discriminant Analysis post-training as a separate feature-space density estimator---without fine-tuning on OoD data, feature ensembling, or input pre-procressing. Our conceptually simple Deep Deterministic Uncertainty (DDU) baseline can also be used to disentangle aleatoric and epistemic uncertainty and performs as well as Deep Ensembles, the state-of-the art for uncertainty prediction, on several OoD benchmarks (CIFAR-10/100 vs SVHN/Tiny-ImageNet, ImageNet vs ImageNet-O), active learning settings across different model architectures, as well as in large scale vision tasks like semantic segmentation, while being computationally cheaper. + + + + NeRDi: Single-View NeRF Synthesis With Language-Guided Diffusion As General Image Priors + http://openaccess.thecvf.com//content/CVPR2023/papers/Deng_NeRDi_Single-View_NeRF_Synthesis_With_Language-Guided_Diffusion_As_General_Image_CVPR_2023_paper.pdf + 2D-to-3D reconstruction is an ill-posed problem, yet humans are good at solving this problem due to their prior knowledge of the 3D world developed over years. Driven by this observation, we propose NeRDi, a single-view NeRF synthesis framework with general image priors from 2D diffusion models. Formulating single-view reconstruction as an image-conditioned 3D generation problem, we optimize the NeRF representations by minimizing a diffusion loss on its arbitrary view renderings with a pretrained image diffusion model under the input-view constraint. We leverage off-the-shelf vision-language models and introduce a two-section language guidance as conditioning inputs to the diffusion model. This is essentially helpful for improving multiview content coherence as it narrows down the general image prior conditioned on the semantic and visual features of the single-view input image. Additionally, we introduce a geometric loss based on estimated depth maps to regularize the underlying 3D geometry of the NeRF. Experimental results on the DTU MVS dataset show that our method can synthesize novel views with higher quality even compared to existing methods trained on this dataset. We also demonstrate our generalizability in zero-shot NeRF synthesis for in-the-wild images. + + + + InstantAvatar: Learning Avatars From Monocular Video in 60 Seconds + http://openaccess.thecvf.com//content/CVPR2023/papers/Jiang_InstantAvatar_Learning_Avatars_From_Monocular_Video_in_60_Seconds_CVPR_2023_paper.pdf + In this paper, we take one step further towards real-world applicability of monocular neural avatar reconstruction by contributing InstantAvatar, a system that can reconstruct human avatars from a monocular video within seconds, and these avatars can be animated and rendered at an interactive rate. To achieve this efficiency we propose a carefully designed and engineered system, that leverages emerging acceleration structures for neural fields, in combination with an efficient empty-space skipping strategy for dynamic scenes. We also contribute an efficient implementation that we will make available for research purposes. Compared to existing methods, InstantAvatar converges 130x faster and can be trained in minutes instead of hours. It achieves comparable or even better reconstruction quality and novel pose synthesis results. When given the same time budget, our method significantly outperforms SoTA methods. InstantAvatar can yield acceptable visual quality in as little as 10 seconds training time. For code and more demo results, please refer to https://ait.ethz.ch/InstantAvatar. + + + + You Only Segment Once: Towards Real-Time Panoptic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Hu_You_Only_Segment_Once_Towards_Real-Time_Panoptic_Segmentation_CVPR_2023_paper.pdf + In this paper, we propose YOSO, a real-time panoptic segmentation framework. YOSO predicts masks via dynamic convolutions between panoptic kernels and image feature maps, in which you only need to segment once for both instance and semantic segmentation tasks. To reduce the computational overhead, we design a feature pyramid aggregator for the feature map extraction, and a separable dynamic decoder for the panoptic kernel generation. The aggregator re-parameterizes interpolation-first modules in a convolution-first way, which significantly speeds up the pipeline without any additional costs. The decoder performs multi-head cross-attention via separable dynamic convolution for better efficiency and accuracy. To the best of our knowledge, YOSO is the first real-time panoptic segmentation framework that delivers competitive performance compared to state-of-the-art models. Specifically, YOSO achieves 46.4 PQ, 45.6 FPS on COCO; 52.5 PQ, 22.6 FPS on Cityscapes; 38.0 PQ, 35.4 FPS on ADE20K; and 34.1 PQ, 7.1 FPS on Mapillary Vistas. Code is available at https://github.com/hujiecpp/YOSO. + + + + Robust Single Image Reflection Removal Against Adversarial Attacks + http://openaccess.thecvf.com//content/CVPR2023/papers/Song_Robust_Single_Image_Reflection_Removal_Against_Adversarial_Attacks_CVPR_2023_paper.pdf + This paper addresses the problem of robust deep single-image reflection removal (SIRR) against adversarial attacks. Current deep learning based SIRR methods have shown significant performance degradation due to unnoticeable distortions and perturbations on input images. For a comprehensive robustness study, we first conduct diverse adversarial attacks specifically for the SIRR problem, i.e. towards different attacking targets and regions. Then we propose a robust SIRR model, which integrates the cross-scale attention module, the multi-scale fusion module, and the adversarial image discriminator. By exploiting the multi-scale mechanism, the model narrows the gap between features from clean and adversarial images. The image discriminator adaptively distinguishes clean or noisy inputs, and thus further gains reliable robustness. Extensive experiments on Nature, SIR^2, and Real datasets demonstrate that our model remarkably improves the robustness of SIRR across disparate scenes. + + + + PartMix: Regularization Strategy To Learn Part Discovery for Visible-Infrared Person Re-Identification + http://openaccess.thecvf.com//content/CVPR2023/papers/Kim_PartMix_Regularization_Strategy_To_Learn_Part_Discovery_for_Visible-Infrared_Person_CVPR_2023_paper.pdf + Modern data augmentation using a mixture-based technique can regularize the models from overfitting to the training data in various computer vision applications, but a proper data augmentation technique tailored for the part-based Visible-Infrared person Re-IDentification (VI-ReID) models remains unexplored. In this paper, we present a novel data augmentation technique, dubbed PartMix, that synthesizes the augmented samples by mixing the part descriptors across the modalities to improve the performance of part-based VI-ReID models. Especially, we synthesize the positive and negative samples within the same and across different identities and regularize the backbone model through contrastive learning. In addition, we also present an entropy-based mining strategy to weaken the adverse impact of unreliable positive and negative samples. When incorporated into existing part-based VI-ReID model, PartMix consistently boosts the performance. We conduct experiments to demonstrate the effectiveness of our PartMix over the existing VI-ReID methods and provide ablation studies. + + + + Feature Representation Learning With Adaptive Displacement Generation and Transformer Fusion for Micro-Expression Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhai_Feature_Representation_Learning_With_Adaptive_Displacement_Generation_and_Transformer_Fusion_CVPR_2023_paper.pdf + Micro-expressions are spontaneous, rapid and subtle facial movements that can neither be forged nor suppressed. They are very important nonverbal communication clues, but are transient and of low intensity thus difficult to recognize. Recently deep learning based methods have been developed for micro-expression recognition using feature extraction and fusion techniques, however, targeted feature learning and efficient feature fusion still lack further study according to micro-expression characteristics. To address these issues, we propose a novel framework Feature Representation Learning with adaptive Displacement Generation and Transformer fusion (FRL-DGT), in which a convolutional Displacement Generation Module (DGM) with self-supervised learning is used to extract dynamic feature targeted to the subsequent ME recognition task, and a well-designed Transformer fusion mechanism composed of the Transformer-based local fusion module, global fusion module, and full-face fusion module is applied to extract the multi-level informative feature from the output of the DGM for the final micro-expression prediction. Extensive experiments with solid leave-one-subject-out (LOSO) evaluation results have strongly demonstrated the superiority of our proposed FRL-DGT to state-of-the-art methods. + + + + ViewNet: A Novel Projection-Based Backbone With View Pooling for Few-Shot Point Cloud Classification + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_ViewNet_A_Novel_Projection-Based_Backbone_With_View_Pooling_for_Few-Shot_CVPR_2023_paper.pdf + Although different approaches have been proposed for 3D point cloud-related tasks, few-shot learning (FSL) of 3D point clouds still remains under-explored. In FSL, unlike traditional supervised learning, the classes of training and test data do not overlap, and a model needs to recognize unseen classes from only a few samples. Existing FSL methods for 3D point clouds employ point-based models as their backbone. Yet, based on our extensive experiments and analysis, we first show that using a point-based backbone is not the most suitable FSL approach, since (i) a large number of points' features are discarded by the max pooling operation used in 3D point-based backbones, decreasing the ability of representing shape information; (ii)point-based backbones are sensitive to occlusion. To address these issues, we propose employing a projection- and 2D Convolutional Neural Network-based backbone, referred to as the ViewNet, for FSL from 3D point clouds. Our approach first projects a 3D point cloud onto six different views to alleviate the issue of missing points. Also, to generate more descriptive and distinguishing features, we propose View Pooling, which combines different projected plane combinations into five groups and performs max-pooling on each of them. The experiments performed on the ModelNet40, ScanObjectNN and ModelNet40-C datasets, with cross validation, show that our method consistently outperforms the state-of-the-art baselines. Moreover, compared to traditional image classification backbones, such as ResNet, the proposed ViewNet can extract more distinguishing features from multiple views of a point cloud. We also show that ViewNet can be used as a backbone with different FSL heads and provides improved performance compared to traditionally used backbones. + + + + ANetQA: A Large-Scale Benchmark for Fine-Grained Compositional Reasoning Over Untrimmed Videos + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_ANetQA_A_Large-Scale_Benchmark_for_Fine-Grained_Compositional_Reasoning_Over_Untrimmed_CVPR_2023_paper.pdf + Building benchmarks to systemically analyze different capabilities of video question answering (VideoQA) models is challenging yet crucial. Existing benchmarks often use non-compositional simple questions and suffer from language biases, making it difficult to diagnose model weaknesses incisively. A recent benchmark AGQA poses a promising paradigm to generate QA pairs automatically from pre-annotated scene graphs, enabling it to measure diverse reasoning abilities with granular control. However, its questions have limitations in reasoning about the fine-grained semantics in videos as such information is absent in its scene graphs. To this end, we present ANetQA, a large-scale benchmark that supports fine-grained compositional reasoning over the challenging untrimmed videos from ActivityNet. Similar to AGQA, the QA pairs in ANetQA are automatically generated from annotated video scene graphs. The fine-grained properties of ANetQA are reflected in the following: (i) untrimmed videos with fine-grained semantics; (ii) spatio-temporal scene graphs with fine-grained taxonomies; and (iii) diverse questions generated from fine-grained templates. ANetQA attains 1.4 billion unbalanced and 13.4 million balanced QA pairs, which is an order of magnitude larger than AGQA with a similar number of videos. Comprehensive experiments are performed for state-of-the-art methods. The best model achieves 44.5% accuracy while human performance tops out at 84.5%, leaving sufficient room for improvement. + + + + CLAMP: Prompt-Based Contrastive Learning for Connecting Language and Animal Pose + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_CLAMP_Prompt-Based_Contrastive_Learning_for_Connecting_Language_and_Animal_Pose_CVPR_2023_paper.pdf + Animal pose estimation is challenging for existing image-based methods because of limited training data and large intra- and inter-species variances. Motivated by the progress of visual-language research, we propose that pre-trained language models (eg, CLIP) can facilitate animal pose estimation by providing rich prior knowledge for describing animal keypoints in text. However, we found that building effective connections between pre-trained language models and visual animal keypoints is non-trivial since the gap between text-based descriptions and keypoint-based visual features about animal pose can be significant. To address this issue, we introduce a novel prompt-based Contrastive learning scheme for connecting Language and AniMal Pose (CLAMP) effectively. The CLAMP attempts to bridge the gap by adapting the text prompts to the animal keypoints during network training. The adaptation is decomposed into spatial-aware and feature-aware processes, and two novel contrastive losses are devised correspondingly. In practice, the CLAMP enables the first cross-modal animal pose estimation paradigm. Experimental results show that our method achieves state-of-the-art performance under the supervised, few-shot, and zero-shot settings, outperforming image-based methods by a large margin. The code is available at https://github.com/xuzhang1199/CLAMP. + + + + Standing Between Past and Future: Spatio-Temporal Modeling for Multi-Camera 3D Multi-Object Tracking + http://openaccess.thecvf.com//content/CVPR2023/papers/Pang_Standing_Between_Past_and_Future_Spatio-Temporal_Modeling_for_Multi-Camera_3D_CVPR_2023_paper.pdf + This work proposes an end-to-end multi-camera 3D multi-object tracking (MOT) framework. It emphasizes spatio-temporal continuity and integrates both past and future reasoning for tracked objects. Thus, we name it "Past-and-Future reasoning for Tracking" (PF-Track). Specifically, our method adapts the "tracking by attention" framework and represents tracked instances coherently over time with object queries. To explicitly use historical cues, our "Past Reasoning" module learns to refine the tracks and enhance the object features by cross-attending to queries from previous frames and other objects. The "Future Reasoning" module digests historical information and predicts robust future trajectories. In the case of long-term occlusions, our method maintains the object positions and enables re-association by integrating motion predictions. On the nuScenes dataset, our method improves AMOTA by a large margin and remarkably reduces ID-Switches by 90% compared to prior approaches, which is an order of magnitude less. The code and models are made available at https://github.com/TRI-ML/PF-Track. + + + + TTA-COPE: Test-Time Adaptation for Category-Level Object Pose Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Lee_TTA-COPE_Test-Time_Adaptation_for_Category-Level_Object_Pose_Estimation_CVPR_2023_paper.pdf + Test-time adaptation methods have been gaining attention recently as a practical solution for addressing source-to-target domain gaps by gradually updating the model without requiring labels on the target data. In this paper, we propose a method of test-time adaptation for category-level object pose estimation called TTA-COPE. We design a pose ensemble approach with a self-training loss using pose-aware confidence. Unlike previous unsupervised domain adaptation methods for category-level object pose estimation, our approach processes the test data in a sequential, online manner, and it does not require access to the source domain at runtime. Extensive experimental results demonstrate that the proposed pose ensemble and the self-training loss improve category-level object pose performance during test time under both semi-supervised and unsupervised settings. + + + + Geometry and Uncertainty-Aware 3D Point Cloud Class-Incremental Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_Geometry_and_Uncertainty-Aware_3D_Point_Cloud_Class-Incremental_Semantic_Segmentation_CVPR_2023_paper.pdf + Despite the significant recent progress made on 3D point cloud semantic segmentation, the current methods require training data for all classes at once, and are not suitable for real-life scenarios where new categories are being continuously discovered. Substantial memory storage and expensive re-training is required to update the model to sequentially arriving data for new concepts. In this paper, to continually learn new categories using previous knowledge, we introduce class-incremental semantic segmentation of 3D point cloud. Unlike 2D images, 3D point clouds are disordered and unstructured, making it difficult to store and transfer knowledge especially when the previous data is not available. We further face the challenge of semantic shift, where previous/future classes are indiscriminately collapsed and treated as the background in the current step, causing a dramatic performance drop on past classes. We exploit the structure of point cloud and propose two strategies to address these challenges. First, we design a geometry-aware distillation module that transfers point-wise feature associations in terms of their geometric characteristics. To counter forgetting caused by the semantic shift, we further develop an uncertainty-aware pseudo-labelling scheme that eliminates noise in uncertain pseudo-labels by label propagation within a local neighborhood. Our extensive experiments on S3DIS and ScanNet in a class-incremental setting show impressive results comparable to the joint training strategy (upper bound). Code is available at: https://github.com/leolyj/3DPC-CISS + + + + Cooperation or Competition: Avoiding Player Domination for Multi-Target Robustness via Adaptive Budgets + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Cooperation_or_Competition_Avoiding_Player_Domination_for_Multi-Target_Robustness_via_CVPR_2023_paper.pdf + Despite incredible advances, deep learning has been shown to be susceptible to adversarial attacks. Numerous approaches were proposed to train robust networks both empirically and certifiably. However, most of them defend against only a single type of attack, while recent work steps forward at defending against multiple attacks. In this paper, to understand multi-target robustness, we view this problem as a bargaining game in which different players (adversaries) negotiate to reach an agreement on a joint direction of parameter updating. We identify a phenomenon named player domination in the bargaining game, and show that with this phenomenon, some of the existing max-based approaches such as MAX and MSD do not converge. Based on our theoretical results, we design a novel framework that adjusts the budgets of different adversaries to avoid player domination. Experiments on two benchmarks show that employing the proposed framework to the existing approaches significantly advances multi-target robustness. + + + + CAT: LoCalization and IdentificAtion Cascade Detection Transformer for Open-World Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Ma_CAT_LoCalization_and_IdentificAtion_Cascade_Detection_Transformer_for_Open-World_Object_CVPR_2023_paper.pdf + Open-world object detection (OWOD), as a more general and challenging goal, requires the model trained from data on known objects to detect both known and unknown objects and incrementally learn to identify these unknown objects. The existing works which employ standard detection framework and fixed pseudo-labelling mechanism (PLM) have the following problems: (i) The inclusion of detecting unknown objects substantially reduces the model's ability to detect known ones. (ii) The PLM does not adequately utilize the priori knowledge of inputs. (iii) The fixed selection manner of PLM cannot guarantee that the model is trained in the right direction. We observe that humans subconsciously prefer to focus on all foreground objects and then identify each one in detail, rather than localize and identify a single object simultaneously, for alleviating the confusion. This motivates us to propose a novel solution called CAT: LoCalization and IdentificAtion Cascade Detection Transformer which decouples the detection process via the shared decoder in the cascade decoding way. In the meanwhile, we propose the self-adaptive pseudo-labelling mechanism which combines the model-driven with input-driven PLM and self-adaptively generates robust pseudo-labels for unknown objects, significantly improving the ability of CAT to retrieve unknown objects. Comprehensive experiments on two benchmark datasets, i.e., MS-COCO and PASCAL VOC, show that our model outperforms the state-of-the-art in terms of all metrics in the task of OWOD, incremental object detection (IOD) and open-set detection. + + + + TruFor: Leveraging All-Round Clues for Trustworthy Image Forgery Detection and Localization + http://openaccess.thecvf.com//content/CVPR2023/papers/Guillaro_TruFor_Leveraging_All-Round_Clues_for_Trustworthy_Image_Forgery_Detection_and_CVPR_2023_paper.pdf + In this paper we present TruFor, a forensic framework that can be applied to a large variety of image manipulation methods, from classic cheapfakes to more recent manipulations based on deep learning. We rely on the extraction of both high-level and low-level traces through a transformer-based fusion architecture that combines the RGB image and a learned noise-sensitive fingerprint. The latter learns to embed the artifacts related to the camera internal and external processing by training only on real data in a self-supervised manner. Forgeries are detected as deviations from the expected regular pattern that characterizes each pristine image. Looking for anomalies makes the approach able to robustly detect a variety of local manipulations, ensuring generalization. In addition to a pixel-level localization map and a whole-image integrity score, our approach outputs a reliability map that highlights areas where localization predictions may be error-prone. This is particularly important in forensic applications in order to reduce false alarms and allow for a large scale analysis. Extensive experiments on several datasets show that our method is able to reliably detect and localize both cheapfakes and deepfakes manipulations outperforming state-of-the-art works. Code is publicly available at https://grip-unina.github.io/TruFor/. + + + + LANA: A Language-Capable Navigator for Instruction Following and Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_LANA_A_Language-Capable_Navigator_for_Instruction_Following_and_Generation_CVPR_2023_paper.pdf + Recently, visual-language navigation (VLN) -- entailing robot agents to follow navigation instructions -- has shown great advance. However, existing literature put most emphasis on interpreting instructions into actions, only delivering "dumb" wayfinding agents. In this article, we devise LANA, a language-capable navigation agent which is able to not only execute human-written navigation commands, but also provide route descriptions to humans. This is achieved by simultaneously learning instruction following and generation with only one single model. More specifically, two encoders, respectively for route and language encoding, are built and shared by two decoders, respectively, for action prediction and instruction generation, so as to exploit cross-task knowledge and capture task-specific characteristics. Throughout pretraining and fine-tuning, both instruction following and generation are set as optimization objectives. We empirically verify that, compared with recent advanced task-specific solutions, LANA attains better performances on both instruction following and route description, with nearly half complexity. In addition, endowed with language generation capability, LANA can explain to humans its behaviors and assist human's wayfinding. This work is expected to foster future efforts towards building more trustworthy and socially-intelligent navigation robots. Our code will be released. + + + + CAPE: Camera View Position Embedding for Multi-View 3D Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Xiong_CAPE_Camera_View_Position_Embedding_for_Multi-View_3D_Object_Detection_CVPR_2023_paper.pdf + In this paper, we address the problem of detecting 3D objects from multi-view images. Current query-based methods rely on global 3D position embeddings (PE) to learn the geometric correspondence between images and 3D space. We claim that directly interacting 2D image features with global 3D PE could increase the difficulty of learning view transformation due to the variation of camera extrinsics. Thus we propose a novel method based on CAmera view Position Embedding, called CAPE. We form the 3D position embeddings under the local camera-view coordinate system instead of the global coordinate system, such that 3D position embedding is free of encoding camera extrinsic parameters. Furthermore, we extend our CAPE to temporal modeling by exploiting the object queries of previous frames and encoding the ego motion for boosting 3D object detection. CAPE achieves the state-of-the-art performance (61.0% NDS and 52.5% mAP) among all LiDAR-free methods on standard nuScenes dataset. Codes and models are available. + + + + Bi-Directional Distribution Alignment for Transductive Zero-Shot Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Bi-Directional_Distribution_Alignment_for_Transductive_Zero-Shot_Learning_CVPR_2023_paper.pdf + It is well-known that zero-shot learning (ZSL) can suffer severely from the problem of domain shift, where the true and learned data distributions for the unseen classes do not match. Although transductive ZSL (TZSL) attempts to improve this by allowing the use of unlabelled examples from the unseen classes, there is still a high level of distribution shift. We propose a novel TZSL model (named as Bi-VAEGAN), which largely improves the shift by a strengthened distribution alignment between the visual and auxiliary spaces. The key proposal of the model design includes (1) a bi-directional distribution alignment, (2) a simple but effective L_2-norm based feature normalization approach, and (3) a more sophisticated unseen class prior estimation approach. In benchmark evaluation using four datasets, Bi-VAEGAN achieves the new state of the arts under both the standard and generalized TZSL settings. Code could be found at https://github.com/Zhicaiwww/Bi-VAEGAN. + + + + FlexNeRF: Photorealistic Free-Viewpoint Rendering of Moving Humans From Sparse Views + http://openaccess.thecvf.com//content/CVPR2023/papers/Jayasundara_FlexNeRF_Photorealistic_Free-Viewpoint_Rendering_of_Moving_Humans_From_Sparse_Views_CVPR_2023_paper.pdf + We present FlexNeRF, a method for photorealistic free-viewpoint rendering of humans in motion from monocular videos. Our approach works well with sparse views, which is a challenging scenario when the subject is exhibiting fast/complex motions. We propose a novel approach which jointly optimizes a canonical time and pose configuration, with a pose-dependent motion field and pose-independent temporal deformations complementing each other. Thanks to our novel temporal and cyclic consistency constraints along with additional losses on intermediate representation such as segmentation, our approach provides high quality outputs as the observed views become sparser. We empirically demonstrate that our method significantly outperforms the state-of-the-art on public benchmark datasets as well as a self-captured fashion dataset. The project page is available at: https://flex-nerf.github.io/. + + + + Towards Better Gradient Consistency for Neural Signed Distance Functions via Level Set Alignment + http://openaccess.thecvf.com//content/CVPR2023/papers/Ma_Towards_Better_Gradient_Consistency_for_Neural_Signed_Distance_Functions_via_CVPR_2023_paper.pdf + Neural signed distance functions (SDFs) have shown remarkable capability in representing geometry with details. However, without signed distance supervision, it is still a challenge to infer SDFs from point clouds or multi-view images using neural networks. In this paper, we claim that gradient consistency in the field, indicated by the parallelism of level sets, is the key factor affecting the inference accuracy. Hence, we propose a level set alignment loss to evaluate the parallelism of level sets, which can be minimized to achieve better gradient consistency. Our novelty lies in that we can align all level sets to the zero level set by constraining gradients at queries and their projections on the zero level set in an adaptive way. Our insight is to propagate the zero level set to everywhere in the field through consistent gradients to eliminate uncertainty in the field that is caused by the discreteness of 3D point clouds or the lack of observations from multi-view images. Our proposed loss is a general term which can be used upon different methods to infer SDFs from 3D point clouds and multi-view images. Our numerical and visual comparisons demonstrate that our loss can significantly improve the accuracy of SDFs inferred from point clouds or multi-view images under various benchmarks. Code and data are available at https://github.com/mabaorui/TowardsBetterGradient. + + + + Zero-Shot Everything Sketch-Based Image Retrieval, and in Explainable Style + http://openaccess.thecvf.com//content/CVPR2023/papers/Lin_Zero-Shot_Everything_Sketch-Based_Image_Retrieval_and_in_Explainable_Style_CVPR_2023_paper.pdf + This paper studies the problem of zero-short sketch-based image retrieval (ZS-SBIR), however with two significant differentiators to prior art (i) we tackle all variants (inter-category, intra-category, and cross datasets) of ZS-SBIR with just one network ("everything"), and (ii) we would really like to understand how this sketch-photo matching operates ("explainable"). Our key innovation lies with the realization that such a cross-modal matching problem could be reduced to comparisons of groups of key local patches -- akin to the seasoned "bag-of-words" paradigm. Just with this change, we are able to achieve both of the aforementioned goals, with the added benefit of no longer requiring external semantic knowledge. Technically, ours is a transformer-based cross-modal network, with three novel components (i) a self-attention module with a learnable tokenizer to produce visual tokens that correspond to the most informative local regions, (ii) a cross-attention module to compute local correspondences between the visual tokens across two modalities, and finally (iii) a kernel-based relation network to assemble local putative matches and produce an overall similarity metric for a sketch-photo pair. Experiments show ours indeed delivers superior performances across all ZS-SBIR settings. The all important explainable goal is elegantly achieved by visualizing cross-modal token correspondences, and for the first time, via sketch to photo synthesis by universal replacement of all matched photo patches. + + + + Graph Representation for Order-Aware Visual Transformation + http://openaccess.thecvf.com//content/CVPR2023/papers/Qiu_Graph_Representation_for_Order-Aware_Visual_Transformation_CVPR_2023_paper.pdf + This paper proposes a new visual reasoning formulation that aims at discovering changes between image pairs and their temporal orders. Recognizing scene dynamics and their chronological orders is a fundamental aspect of human cognition. The aforementioned abilities make it possible to follow step-by-step instructions, reason about and analyze events, recognize abnormal dynamics, and restore scenes to their previous states. However, it remains unclear how well current AI systems perform in these capabilities. Although a series of studies have focused on identifying and describing changes from image pairs, they mainly consider those changes that occur synchronously, thus neglecting potential orders within those changes. To address the above issue, we first propose a visual transformation graph structure for conveying order-aware changes. Then, we benchmarked previous methods on our newly generated dataset and identified the issues of existing methods for change order recognition. Finally, we show a significant improvement in order-aware change recognition by introducing a new model that explicitly associates different changes and then identifies changes and their orders in a graph representation. + + + + StarCraftImage: A Dataset for Prototyping Spatial Reasoning Methods for Multi-Agent Environments + http://openaccess.thecvf.com//content/CVPR2023/papers/Kulinski_StarCraftImage_A_Dataset_for_Prototyping_Spatial_Reasoning_Methods_for_Multi-Agent_CVPR_2023_paper.pdf + Spatial reasoning tasks in multi-agent environments such as event prediction, agent type identification, or missing data imputation are important for multiple applications (e.g., autonomous surveillance over sensor networks and subtasks for reinforcement learning (RL)). StarCraft II game replays encode intelligent (and adversarial) multi-agent behavior and could provide a testbed for these tasks; however, extracting simple and standardized representations for prototyping these tasks is laborious and hinders reproducibility. In contrast, MNIST and CIFAR10, despite their extreme simplicity, have enabled rapid prototyping and reproducibility of ML methods. Following the simplicity of these datasets, we construct a benchmark spatial reasoning dataset based on StarCraft II replays that exhibit complex multi-agent behaviors, while still being as easy to use as MNIST and CIFAR10. Specifically, we carefully summarize a window of 255 consecutive game states to create 3.6 million summary images from 60,000 replays, including all relevant metadata such as game outcome and player races. We develop three formats of decreasing complexity: Hyperspectral images that include one channel for every unit type (similar to multispectral geospatial images), RGB images that mimic CIFAR10, and grayscale images that mimic MNIST. We show how this dataset can be used for prototyping spatial reasoning methods. All datasets, code for extraction, and code for dataset loading can be found at https://starcraftdata.davidinouye.com/. + + + + Quality-Aware Pre-Trained Models for Blind Image Quality Assessment + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhao_Quality-Aware_Pre-Trained_Models_for_Blind_Image_Quality_Assessment_CVPR_2023_paper.pdf + Blind image quality assessment (BIQA) aims to automatically evaluate the perceived quality of a single image, whose performance has been improved by deep learning-based methods in recent years. However, the paucity of labeled data somewhat restrains deep learning-based BIQA methods from unleashing their full potential. In this paper, we propose to solve the problem by a pretext task customized for BIQA in a self-supervised learning manner, which enables learning representations from orders of magnitude more data. To constrain the learning process, we propose a quality-aware contrastive loss based on a simple assumption: the quality of patches from a distorted image should be similar, but vary from patches from the same image with different degradations and patches from different images. Further, we improve the existing degradation process and form a degradation space with the size of roughly 2x10^7. After pre-trained on ImageNet using our method, models are more sensitive to image quality and perform significantly better on downstream BIQA tasks. Experimental results show that our method obtains remarkable improvements on popular BIQA datasets. + + + + Network Expansion for Practical Training Acceleration + http://openaccess.thecvf.com//content/CVPR2023/papers/Ding_Network_Expansion_for_Practical_Training_Acceleration_CVPR_2023_paper.pdf + Recently, the sizes of deep neural networks and training datasets both increase drastically to pursue better performance in a practical sense. With the prevalence of transformer-based models in vision tasks, even more pressure is laid on the GPU platforms to train these heavy models, which consumes a large amount of time and computing resources as well. Therefore, it's crucial to accelerate the training process of deep neural networks. In this paper, we propose a general network expansion method to reduce the practical time cost of the model training process. Specifically, we utilize both width- and depth-level sparsity of dense models to accelerate the training of deep neural networks. Firstly, we pick a sparse sub-network from the original dense model by reducing the number of parameters as the starting point of training. Then the sparse architecture will gradually expand during the training procedure and finally grow into a dense one. We design different expanding strategies to grow CNNs and ViTs respectively, due to the great heterogeneity in between the two architectures. Our method can be easily integrated into popular deep learning frameworks, which saves considerable training time and hardware resources. Extensive experiments show that our acceleration method can significantly speed up the training process of modern vision models on general GPU devices with negligible performance drop (e.g. 1.42x faster for ResNet-101 and 1.34x faster for DeiT-base on ImageNet-1k). The code is available at https://github.com/huawei-noah/Efficient-Computing/tree/master/TrainingAcceleration/NetworkExpansion and https://gitee.com/mindspore/hub/blob/master/mshub_res/assets/noah-cvlab/gpu/1.8/networkexpansion_v1.0_imagenet2012.md. + + + + FCC: Feature Clusters Compression for Long-Tailed Visual Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_FCC_Feature_Clusters_Compression_for_Long-Tailed_Visual_Recognition_CVPR_2023_paper.pdf + Deep Neural Networks (DNNs) are rather restrictive in long-tailed data, since they commonly exhibit an under-representation for minority classes. Various remedies have been proposed to tackle this problem from different perspectives, but they ignore the impact of the density of Backbone Features (BFs) on this issue. Through representation learning, DNNs can map BFs into dense clusters in feature space, while the features of minority classes often show sparse clusters. In practical applications, these features are discretely mapped or even cross the decision boundary resulting in misclassification. Inspired by this observation, we propose a simple and generic method, namely Feature Clusters Compression (FCC), to increase the density of BFs by compressing backbone feature clusters. The proposed FCC can be easily achieved by only multiplying original BFs by a scaling factor in training phase, which establishes a linear compression relationship between the original and multiplied features, and forces DNNs to map the former into denser clusters. In test phase, we directly feed original features without multiplying the factor to the classifier, such that BFs of test samples are mapped closer together and do not easily cross the decision boundary. Meanwhile, FCC can be friendly combined with existing long-tailed methods and further boost them. We apply FCC to numerous state-of-the-art methods and evaluate them on widely used long-tailed benchmark datasets. Extensive experiments fully verify the effectiveness and generality of our method. Code is available at https://github.com/lijian16/FCC. + + + + Rethinking the Learning Paradigm for Dynamic Facial Expression Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Rethinking_the_Learning_Paradigm_for_Dynamic_Facial_Expression_Recognition_CVPR_2023_paper.pdf + Dynamic Facial Expression Recognition (DFER) is a rapidly developing field that focuses on recognizing facial expressions in video format. Previous research has considered non-target frames as noisy frames, but we propose that it should be treated as a weakly supervised problem. We also identify the imbalance of short- and long-term temporal relationships in DFER. Therefore, we introduce the Multi-3D Dynamic Facial Expression Learning (M3DFEL) framework, which utilizes Multi-Instance Learning (MIL) to handle inexact labels. M3DFEL generates 3D-instances to model the strong short-term temporal relationship and utilizes 3DCNNs for feature extraction. The Dynamic Long-term Instance Aggregation Module (DLIAM) is then utilized to learn the long-term temporal relationships and dynamically aggregate the instances. Our experiments on DFEW and FERV39K datasets show that M3DFEL outperforms existing state-of-the-art approaches with a vanilla R3D18 backbone. The source code is available at https://github.com/faceeyes/M3DFEL. + + + + Self-Supervised Learning for Multimodal Non-Rigid 3D Shape Matching + http://openaccess.thecvf.com//content/CVPR2023/papers/Cao_Self-Supervised_Learning_for_Multimodal_Non-Rigid_3D_Shape_Matching_CVPR_2023_paper.pdf + The matching of 3D shapes has been extensively studied for shapes represented as surface meshes, as well as for shapes represented as point clouds. While point clouds are a common representation of raw real-world 3D data (e.g. from laser scanners), meshes encode rich and expressive topological information, but their creation typically requires some form of (often manual) curation. In turn, methods that purely rely on point clouds are unable to meet the matching quality of mesh-based methods that utilise the additional topological structure. In this work we close this gap by introducing a self-supervised multimodal learning strategy that combines mesh-based functional map regularisation with a contrastive loss that couples mesh and point cloud data. Our shape matching approach allows to obtain intramodal correspondences for triangle meshes, complete point clouds, and partially observed point clouds, as well as correspondences across these data modalities. We demonstrate that our method achieves state-of-the-art results on several challenging benchmark datasets even in comparison to recent supervised methods, and that our method reaches previously unseen cross-dataset generalisation ability. + + + + Ham2Pose: Animating Sign Language Notation Into Pose Sequences + http://openaccess.thecvf.com//content/CVPR2023/papers/Arkushin_Ham2Pose_Animating_Sign_Language_Notation_Into_Pose_Sequences_CVPR_2023_paper.pdf + Translating spoken languages into Sign languages is necessary for open communication between the hearing and hearing-impaired communities. To achieve this goal, we propose the first method for animating a text written in HamNoSys, a lexical Sign language notation, into signed pose sequences. As HamNoSys is universal by design, our proposed method offers a generic solution invariant to the target Sign language. Our method gradually generates pose predictions using transformer encoders that create meaningful representations of the text and poses while considering their spatial and temporal information. We use weak supervision for the training process and show that our method succeeds in learning from partial and inaccurate data. Additionally, we offer a new distance measurement that considers missing keypoints, to measure the distance between pose sequences using DTW-MJE. We validate its correctness using AUTSL, a large-scale Sign language dataset, show that it measures the distance between pose sequences more accurately than existing measurements, and use it to assess the quality of our generated pose sequences. Code for the data pre-processing, the model, and the distance measurement is publicly released for future research. + + + + Open-Set Likelihood Maximization for Few-Shot Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Boudiaf_Open-Set_Likelihood_Maximization_for_Few-Shot_Learning_CVPR_2023_paper.pdf + We tackle the Few-Shot Open-Set Recognition (FSOSR) problem, i.e. classifying instances among a set of classes for which we only have a few labeled samples, while simultaneously detecting instances that do not belong to any known class. We explore the popular transductive setting, which leverages the unlabelled query instances at inference. Motivated by the observation that existing transductive methods perform poorly in open-set scenarios, we propose a generalization of the maximum likelihood principle, in which latent scores down-weighing the influence of potential outliers are introduced alongside the usual parametric model. Our formulation embeds supervision constraints from the support set and additional penalties discouraging overconfident predictions on the query set. We proceed with a block-coordinate descent, with the latent scores and parametric model co-optimized alternately, thereby benefiting from each other. We call our resulting formulation Open-Set Likelihood Optimization (OSLO). OSLO is interpretable and fully modular; it can be applied on top of any pre-trained model seamlessly. Through extensive experiments, we show that our method surpasses existing inductive and transductive methods on both aspects of open-set recognition, namely inlier classification and outlier detection. Code is available at https://github.com/ebennequin/few-shot-open-set. + + + + Boosting Accuracy and Robustness of Student Models via Adaptive Adversarial Distillation + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Boosting_Accuracy_and_Robustness_of_Student_Models_via_Adaptive_Adversarial_CVPR_2023_paper.pdf + Distilled student models in teacher-student architectures are widely considered for computational-effective deployment in real-time applications and edge devices. However, there is a higher risk of student models to encounter adversarial attacks at the edge. Popular enhancing schemes such as adversarial training have limited performance on compressed networks. Thus, recent studies concern about adversarial distillation (AD) that aims to inherit not only prediction accuracy but also adversarial robustness of a robust teacher model under the paradigm of robust optimization. In the min-max framework of AD, existing AD methods generally use fixed supervision information from the teacher model to guide the inner optimization for knowledge distillation which often leads to an overcorrection towards model smoothness. In this paper, we propose an adaptive adversarial distillation (AdaAD) that involves the teacher model in the knowledge optimization process in a way interacting with the student model to adaptively search for the inner results. Comparing with state-of-the-art methods, the proposed AdaAD can significantly boost both the prediction accuracy and adversarial robustness of student models in most scenarios. In particular, the ResNet-18 model trained by AdaAD achieves top-rank performance (54.23% robust accuracy) on RobustBench under AutoAttack. + + + + PixHt-Lab: Pixel Height Based Light Effect Generation for Image Compositing + http://openaccess.thecvf.com//content/CVPR2023/papers/Sheng_PixHt-Lab_Pixel_Height_Based_Light_Effect_Generation_for_Image_Compositing_CVPR_2023_paper.pdf + Lighting effects such as shadows or reflections are key in making synthetic images realistic and visually appealing. To generate such effects, traditional computer graphics uses a physically-based renderer along with 3D geometry. To compensate for the lack of geometry in 2D Image compositing, recent deep learning-based approaches introduced a pixel height representation to generate soft shadows and reflections. However, the lack of geometry limits the quality of the generated soft shadows and constrains reflections to pure specular ones. We introduce PixHt-Lab, a system leveraging an explicit mapping from pixel height representation to 3D space. Using this mapping, PixHt-Lab reconstructs both the cutout and background geometry and renders realistic, diverse, lighting effects for image compositing. Given a surface with physically-based materials, we can render reflections with varying glossiness. To generate more realistic soft shadows, we further propose to use 3D-aware buffer channels to guide a neural renderer. Both quantitative and qualitative evaluations demonstrate that PixHt-Lab significantly improves soft shadow generation. + + + + RGB No More: Minimally-Decoded JPEG Vision Transformers + http://openaccess.thecvf.com//content/CVPR2023/papers/Park_RGB_No_More_Minimally-Decoded_JPEG_Vision_Transformers_CVPR_2023_paper.pdf + Most neural networks for computer vision are designed to infer using RGB images. However, these RGB images are commonly encoded in JPEG before saving to disk; decoding them imposes an unavoidable overhead for RGB networks. Instead, our work focuses on training Vision Transformers (ViT) directly from the encoded features of JPEG. This way, we can avoid most of the decoding overhead, accelerating data load. Existing works have studied this aspect but they focus on CNNs. Due to how these encoded features are structured, CNNs require heavy modification to their architecture to accept such data. Here, we show that this is not the case for ViTs. In addition, we tackle data augmentation directly on these encoded features, which to our knowledge, has not been explored in-depth for training in this setting. With these two improvements -- ViT and data augmentation -- we show that our ViT-Ti model achieves up to 39.2% faster training and 17.9% faster inference with no accuracy loss compared to the RGB counterpart. + + + + Hybrid Active Learning via Deep Clustering for Video Action Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Rana_Hybrid_Active_Learning_via_Deep_Clustering_for_Video_Action_Detection_CVPR_2023_paper.pdf + In this work, we focus on reducing the annotation cost for video action detection which requires costly frame-wise dense annotations. We study a novel hybrid active learning (AL) strategy which performs efficient labeling using both intra-sample and inter-sample selection. The intra-sample selection leads to labeling of fewer frames in a video as opposed to inter-sample selection which operates at video level. This hybrid strategy reduces the annotation cost from two different aspects leading to significant labeling cost reduction. The proposed approach utilize Clustering-Aware Uncertainty Scoring (CLAUS), a novel label acquisition strategy which relies on both informativeness and diversity for sample selection. We also propose a novel Spatio-Temporal Weighted (STeW) loss formulation, which helps in model training under limited annotations. The proposed approach is evaluated on UCF-101-24 and J-HMDB-21 datasets demonstrating its effectiveness in significantly reducing the annotation cost where it consistently outperforms other baselines. Project details available at https://sites.google.com/view/activesparselabeling/home + + + + Fine-Grained Image-Text Matching by Cross-Modal Hard Aligning Network + http://openaccess.thecvf.com//content/CVPR2023/papers/Pan_Fine-Grained_Image-Text_Matching_by_Cross-Modal_Hard_Aligning_Network_CVPR_2023_paper.pdf + Current state-of-the-art image-text matching methods implicitly align the visual-semantic fragments, like regions in images and words in sentences, and adopt cross-attention mechanism to discover fine-grained cross-modal semantic correspondence. However, the cross-attention mechanism may bring redundant or irrelevant region-word alignments, degenerating retrieval accuracy and limiting efficiency. Although many researchers have made progress in mining meaningful alignments and thus improving accuracy, the problem of poor efficiency remains unresolved. In this work, we propose to learn fine-grained image-text matching from the perspective of information coding. Specifically, we suggest a coding framework to explain the fragments aligning process, which provides a novel view to reexamine the cross-attention mechanism and analyze the problem of redundant alignments. Based on this framework, a Cross-modal Hard Aligning Network (CHAN) is designed, which comprehensively exploits the most relevant region-word pairs and eliminates all other alignments. Extensive experiments conducted on two public datasets, MS-COCO and Flickr30K, verify that the relevance of the most associated word-region pairs is discriminative enough as an indicator of the image-text similarity, with superior accuracy and efficiency over the state-of-the-art approaches on the bidirectional image and text retrieval tasks. Our code will be available at https://github.com/ppanzx/CHAN. + + + + Sparsifiner: Learning Sparse Instance-Dependent Attention for Efficient Vision Transformers + http://openaccess.thecvf.com//content/CVPR2023/papers/Wei_Sparsifiner_Learning_Sparse_Instance-Dependent_Attention_for_Efficient_Vision_Transformers_CVPR_2023_paper.pdf + Vision Transformers (ViT) have shown competitive advantages in terms of performance compared to convolutional neural networks (CNNs), though they often come with high computational costs. To this end, previous methods explore different attention patterns by limiting a fixed number of spatially nearby tokens to accelerate the ViT's multi-head self-attention (MHSA) operations. However, such structured attention patterns limit the token-to-token connections to their spatial relevance, which disregards learned semantic connections from a full attention mask. In this work, we propose an approach to learn instance-dependent attention patterns, by devising a lightweight connectivity predictor module that estimates the connectivity score of each pair of tokens. Intuitively, two tokens have high connectivity scores if the features are considered relevant either spatially or semantically. As each token only attends to a small number of other tokens, the binarized connectivity masks are often very sparse by nature and therefore provide the opportunity to reduce network FLOPs via sparse computations. Equipped with the learned unstructured attention pattern, sparse attention ViT (Sparsifiner) produces a superior Pareto frontier between FLOPs and top-1 accuracy on ImageNet compared to token sparsity. Our method reduces 48% 69% FLOPs of MHSA while the accuracy drop is within 0.4%. We also show that combining attention and token sparsity reduces ViT FLOPs by over 60%. + + + + Structured Sparsity Learning for Efficient Video Super-Resolution + http://openaccess.thecvf.com//content/CVPR2023/papers/Xia_Structured_Sparsity_Learning_for_Efficient_Video_Super-Resolution_CVPR_2023_paper.pdf + The high computational costs of video super-resolution (VSR) models hinder their deployment on resource-limited devices, e.g., smartphones and drones. Existing VSR models contain considerable redundant filters, which drag down the inference efficiency. To prune these unimportant filters, we develop a structured pruning scheme called Structured Sparsity Learning (SSL) according to the properties of VSR. In SSL, we design pruning schemes for several key components in VSR models, including residual blocks, recurrent networks, and upsampling networks. Specifically, we develop a Residual Sparsity Connection (RSC) scheme for residual blocks of recurrent networks to liberate pruning restrictions and preserve the restoration information. For upsampling networks, we design a pixel-shuffle pruning scheme to guarantee the accuracy of feature channel-space conversion. In addition, we observe that pruning error would be amplified as the hidden states propagate along with recurrent networks. To alleviate the issue, we design Temporal Finetuning (TF). Extensive experiments show that SSL can significantly outperform recent methods quantitatively and qualitatively. The code is available at https://github.com/Zj-BinXia/SSL. + + + + "Seeing" Electric Network Frequency From Events + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_Seeing_Electric_Network_Frequency_From_Events_CVPR_2023_paper.pdf + Most of the artificial lights fluctuate in response to the grid's alternating current and exhibit subtle variations in terms of both intensity and spectrum, providing the potential to estimate the Electric Network Frequency (ENF) from conventional frame-based videos. Nevertheless, the performance of Video-based ENF (V-ENF) estimation largely relies on the imaging quality and thus may suffer from significant interference caused by non-ideal sampling, motion, and extreme lighting conditions. In this paper, we show that the ENF can be extracted without the above limitations from a new modality provided by the so-called event camera, a neuromorphic sensor that encodes the light intensity variations and asynchronously emits events with extremely high temporal resolution and high dynamic range. Specifically, we first formulate and validate the physical mechanism for the ENF captured in events, and then propose a simple yet robust Event-based ENF (E-ENF) estimation method through mode filtering and harmonic enhancement. Furthermore, we build an Event-Video ENF Dataset (EV-ENFD) that records both events and videos in diverse scenes. Extensive experiments on EV-ENFD demonstrate that our proposed E-ENF method can extract more accurate ENF traces, outperforming the conventional V-ENF by a large margin, especially in challenging environments with object motions and extreme lighting conditions. The code and dataset are available at https://github.com/xlx-creater/E-ENF. + + + + MMVC: Learned Multi-Mode Video Compression With Block-Based Prediction Mode Selection and Density-Adaptive Entropy Coding + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_MMVC_Learned_Multi-Mode_Video_Compression_With_Block-Based_Prediction_Mode_Selection_CVPR_2023_paper.pdf + Learning-based video compression has been extensively studied over the past years, but it still has limitations in adapting to various motion patterns and entropy models. In this paper, we propose multi-mode video compression (MMVC), a block wise mode ensemble deep video compression framework that selects the optimal mode for feature domain prediction adapting to different motion patterns. Proposed multi-modes include ConvLSTM-based feature domain prediction, optical flow conditioned feature domain prediction, and feature propagation to address a wide range of cases from static scenes without apparent motions to dynamic scenes with a moving camera. We partition the feature space into blocks for temporal prediction in spatial block-based representations. For entropy coding, we consider both dense and sparse post-quantization residual blocks, and apply optional run-length coding to sparse residuals to improve the compression rate. In this sense, our method uses a dual-mode entropy coding scheme guided by a binary density map, which offers significant rate reduction surpassing the extra cost of transmitting the binary selection map. We validate our scheme with some of the most popular benchmarking datasets. Compared with state-of-the-art video compression schemes and standard codecs, our method yields better or competitive results measured with PSNR and MS-SSIM. + + + + Omni Aggregation Networks for Lightweight Image Super-Resolution + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Omni_Aggregation_Networks_for_Lightweight_Image_Super-Resolution_CVPR_2023_paper.pdf + While lightweight ViT framework has made tremendous progress in image super-resolution, its uni-dimensional self-attention modeling, as well as homogeneous aggregation scheme, limit its effective receptive field (ERF) to include more comprehensive interactions from both spatial and channel dimensions. To tackle these drawbacks, this work proposes two enhanced components under a new Omni-SR architecture. First, an Omni Self-Attention (OSA) paradigm is proposed based on dense interaction principle, which can simultaneously model pixel-interaction from both spatial and channel dimensions, mining the potential correlations across omni-axis (i.e., spatial and channel). Coupling with mainstream window partitioning strategies, OSA can achieve superior performance with compelling computational budgets. Second, a multi-scale interaction scheme is proposed to mitigate sub-optimal ERF (i.e., premature saturation) in shallow models, which facilitates local propagation and meso-/global-scale interactions, rendering a omni-scale aggregation building block. Extensive experiments demonstrate that Omni-SR achieves record-high performance on lightweight super-resolution benchmarks (e.g., 26.95dB@Urban100 x4 with only 792K parameters). Our code is available at https://github.com/Francis0625/Omni-SR. + + + + Exploring the Effect of Primitives for Compositional Generalization in Vision-and-Language + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Exploring_the_Effect_of_Primitives_for_Compositional_Generalization_in_Vision-and-Language_CVPR_2023_paper.pdf + Compositionality is one of the fundamental properties of human cognition (Fodor & Pylyshyn, 1988). Compositional generalization is critical to simulate the compositional capability of humans, and has received much attention in the vision-and-language (V&L) community. It is essential to understand the effect of the primitives, including words, image regions, and video frames, to improve the compositional generalization capability. In this paper, we explore the effect of primitives for compositional generalization in V&L. Specifically, we present a self-supervised learning based framework that equips V&L methods with two characteristics: semantic equivariance and semantic invariance. With the two characteristics, the methods understand primitives by perceiving the effect of primitive changes on sample semantics and ground-truth. Experimental results on two tasks: temporal video grounding and visual question answering, demonstrate the effectiveness of our framework. + + + + DC2: Dual-Camera Defocus Control by Learning To Refocus + http://openaccess.thecvf.com//content/CVPR2023/papers/Alzayer_DC2_Dual-Camera_Defocus_Control_by_Learning_To_Refocus_CVPR_2023_paper.pdf + Smartphone cameras today are increasingly approaching the versatility and quality of professional cameras through a combination of hardware and software advancements. However, fixed aperture remains a key limitation, preventing users from controlling the depth of field (DoF) of captured images. At the same time, many smartphones now have multiple cameras with different fixed apertures - specifically, an ultra-wide camera with wider field of view and deeper DoF and a higher resolution primary camera with shallower DoF. In this work, we propose DC^2, a system for defocus control for synthetically varying camera aperture, focus distance and arbitrary defocus effects by fusing information from such a dual-camera system. Our key insight is to leverage real-world smartphone camera dataset by using image refocus as a proxy task for learning to control defocus. Quantitative and qualitative evaluations on real-world data demonstrate our system's efficacy where we outperform state-of-the-art on defocus deblurring, bokeh rendering, and image refocus. Finally, we demonstrate creative post-capture defocus control enabled by our method, including tilt-shift and content-based defocus effects. + + + + Looking Through the Glass: Neural Surface Reconstruction Against High Specular Reflections + http://openaccess.thecvf.com//content/CVPR2023/papers/Qiu_Looking_Through_the_Glass_Neural_Surface_Reconstruction_Against_High_Specular_CVPR_2023_paper.pdf + Neural implicit methods have achieved high-quality 3D object surfaces under slight specular highlights. However, high specular reflections (HSR) often appear in front of target objects when we capture them through glasses. The complex ambiguity in these scenes violates the multi-view consistency, then makes it challenging for recent methods to reconstruct target objects correctly. To remedy this issue, we present a novel surface reconstruction framework, NeuS-HSR, based on implicit neural rendering. In NeuS-HSR, the object surface is parameterized as an implicit signed distance function (SDF). To reduce the interference of HSR, we propose decomposing the rendered image into two appearances: the target object and the auxiliary plane. We design a novel auxiliary plane module by combining physical assumptions and neural networks to generate the auxiliary plane appearance. Extensive experiments on synthetic and real-world datasets demonstrate that NeuS-HSR outperforms state-of-the-art approaches for accurate and robust target surface reconstruction against HSR. + + + + PartSLIP: Low-Shot Part Segmentation for 3D Point Clouds via Pretrained Image-Language Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_PartSLIP_Low-Shot_Part_Segmentation_for_3D_Point_Clouds_via_Pretrained_CVPR_2023_paper.pdf + Generalizable 3D part segmentation is important but challenging in vision and robotics. Training deep models via conventional supervised methods requires large-scale 3D datasets with fine-grained part annotations, which are costly to collect. This paper explores an alternative way for low-shot part segmentation of 3D point clouds by leveraging a pretrained image-language model, GLIP, which achieves superior performance on open-vocabulary 2D detection. We transfer the rich knowledge from 2D to 3D through GLIP-based part detection on point cloud rendering and a novel 2D-to-3D label lifting algorithm. We also utilize multi-view 3D priors and few-shot prompt tuning to boost performance significantly. Extensive evaluation on PartNet and PartNet-Mobility datasets shows that our method enables excellent zero-shot 3D part segmentation. Our few-shot version not only outperforms existing few-shot approaches by a large margin but also achieves highly competitive results compared to the fully supervised counterpart. Furthermore, we demonstrate that our method can be directly applied to iPhone-scanned point clouds without significant domain gaps. + + + + MAGVLT: Masked Generative Vision-and-Language Transformer + http://openaccess.thecvf.com//content/CVPR2023/papers/Kim_MAGVLT_Masked_Generative_Vision-and-Language_Transformer_CVPR_2023_paper.pdf + While generative modeling on multimodal image-text data has been actively developed with large-scale paired datasets, there have been limited attempts to generate both image and text data by a single model rather than a generation of one fixed modality conditioned on the other modality. In this paper, we explore a unified generative vision-and-language (VL) model that can produce both images and text sequences. Especially, we propose a generative VL transformer based on the non-autoregressive mask prediction, named MAGVLT, and compare it with an autoregressive generative VL transformer (ARGVLT). In comparison to ARGVLT, the proposed MAGVLT enables bidirectional context encoding, fast decoding by parallel token predictions in an iterative refinement, and extended editing capabilities such as image and text infilling. For rigorous training of our MAGVLT with image-text pairs from scratch, we combine the image-to-text, text-to image, and joint image-and-text mask prediction tasks. Moreover, we devise two additional tasks based on the step-unrolled mask prediction and the selective prediction on the mixture of two image-text pairs. Experimental results on various downstream generation tasks of VL benchmarks show that our MAGVLT outperforms ARGVLT by a large margin even with significant inference speedup. Particularly, MAGVLT achieves competitive results on both zero-shot image-to-text and text-to-image generation tasks from MS-COCO by one moderate-sized model (fewer than 500M parameters) even without the use of monomodal data and networks. + + + + Decoupling Human and Camera Motion From Videos in the Wild + http://openaccess.thecvf.com//content/CVPR2023/papers/Ye_Decoupling_Human_and_Camera_Motion_From_Videos_in_the_Wild_CVPR_2023_paper.pdf + We propose a method to reconstruct global human trajectories from videos in the wild. Our optimization method decouples the camera and human motion, which allows us to place people in the same world coordinate frame. Most existing methods do not model the camera motion; methods that rely on the background pixels to infer 3D human motion usually require a full scene reconstruction, which is often not possible for in-the-wild videos. However, even when existing SLAM systems cannot recover accurate scene reconstructions, the background pixel motion still provides enough signal to constrain the camera motion. We show that relative camera estimates along with data-driven human motion priors can resolve the scene scale ambiguity and recover global human trajectories. Our method robustly recovers the global 3D trajectories of people in challenging in-the-wild videos, such as PoseTrack. We quantify our improvement over existing methods on 3D human dataset Egobody. We further demonstrate that our recovered camera scale allows us to reason about motion of multiple people in a shared coordinate frame, which improves performance of downstream tracking in PoseTrack. Code and additional results can be found at https://vye16.github.io/slahmr/. + + + + DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-Training via Word-Region Alignment + http://openaccess.thecvf.com//content/CVPR2023/papers/Yao_DetCLIPv2_Scalable_Open-Vocabulary_Object_Detection_Pre-Training_via_Word-Region_Alignment_CVPR_2023_paper.pdf + This paper presents DetCLIPv2, an efficient and scalable training framework that incorporates large-scale image-text pairs to achieve open-vocabulary object detection (OVD). Unlike previous OVD frameworks that typically rely on a pre-trained vision-language model (e.g., CLIP) or exploit image-text pairs via a pseudo labeling process, DetCLIPv2 directly learns the fine-grained word-region alignment from massive image-text pairs in an end-to-end manner. To accomplish this, we employ a maximum word-region similarity between region proposals and textual words to guide the contrastive objective. To enable the model to gain localization capability while learning broad concepts, DetCLIPv2 is trained with a hybrid supervision from detection, grounding and image-text pair data under a unified data formulation. By jointly training with an alternating scheme and adopting low-resolution input for image-text pairs, DetCLIPv2 exploits image-text pair data efficiently and effectively: DetCLIPv2 utilizes 13x more image-text pairs than DetCLIP with a similar training time and improves performance. With 13M image-text pairs for pre-training, DetCLIPv2 demonstrates superior open-vocabulary detection performance, e.g., DetCLIPv2 with Swin-T backbone achieves 40.4% zero-shot AP on the LVIS benchmark, which outperforms previous works GLIP/GLIPv2/DetCLIP by 14.4/11.4/4.5% AP, respectively, and even beats its fully-supervised counterpart by a large margin. + + + + GrowSP: Unsupervised Semantic Segmentation of 3D Point Clouds + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_GrowSP_Unsupervised_Semantic_Segmentation_of_3D_Point_Clouds_CVPR_2023_paper.pdf + We study the problem of 3D semantic segmentation from raw point clouds. Unlike existing methods which primarily rely on a large amount of human annotations for training neural networks, we propose the first purely unsupervised method, called GrowSP, to successfully identify complex semantic classes for every point in 3D scenes, without needing any type of human labels or pretrained models. The key to our approach is to discover 3D semantic elements via progressive growing of superpoints. Our method consists of three major components, 1) the feature extractor to learn per-point features from input point clouds, 2) the superpoint constructor to progressively grow the sizes of superpoints, and 3) the semantic primitive clustering module to group superpoints into semantic elements for the final semantic segmentation. We extensively evaluate our method on multiple datasets, demonstrating superior performance over all unsupervised baselines and approaching the classic fully supervised PointNet. We hope our work could inspire more advanced methods for unsupervised 3D semantic learning. + + + + One-Stage 3D Whole-Body Mesh Recovery With Component Aware Transformer + http://openaccess.thecvf.com//content/CVPR2023/papers/Lin_One-Stage_3D_Whole-Body_Mesh_Recovery_With_Component_Aware_Transformer_CVPR_2023_paper.pdf + Whole-body mesh recovery aims to estimate the 3D human body, face, and hands parameters from a single image. It is challenging to perform this task with a single network due to resolution issues, i.e., the face and hands are usually located in extremely small regions. Existing works usually detect hands and faces, enlarge their resolution to feed in a specific network to predict the parameter, and finally fuse the results. While this copy-paste pipeline can capture the fine-grained details of the face and hands, the connections between different parts cannot be easily recovered in late fusion, leading to implausible 3D rotation and unnatural pose. In this work, we propose a one-stage pipeline for expressive whole-body mesh recovery, named OSX, without separate networks for each part. Specifically, we design a Component Aware Transformer (CAT) composed of a global body encoder and a local face/hand decoder. The encoder predicts the body parameters and provides a high-quality feature map for the decoder, which performs a feature-level upsample-crop scheme to extract high-resolution part-specific features and adopt keypoint-guided deformable attention to estimate hand and face precisely. The whole pipeline is simple yet effective without any manual post-processing and naturally avoids implausible prediction. Comprehensive experiments demonstrate the effectiveness of OSX. Lastly, we build a large-scale Upper-Body dataset (UBody) with high-quality 2D and 3D whole-body annotations. It contains persons with partially visible bodies in diverse real-life scenarios to bridge the gap between the basic task and downstream applications. + + + + Masked Jigsaw Puzzle: A Versatile Position Embedding for Vision Transformers + http://openaccess.thecvf.com//content/CVPR2023/papers/Ren_Masked_Jigsaw_Puzzle_A_Versatile_Position_Embedding_for_Vision_Transformers_CVPR_2023_paper.pdf + Position Embeddings (PEs), an arguably indispensable component in Vision Transformers (ViTs), have been shown to improve the performance of ViTs on many vision tasks. However, PEs have a potentially high risk of privacy leakage since the spatial information of the input patches is exposed. This caveat naturally raises a series of interesting questions about the impact of PEs on accuracy, privacy, prediction consistency, etc. To tackle these issues, we propose a Masked Jigsaw Puzzle (MJP) position embedding method. In particular, MJP first shuffles the selected patches via our block-wise random jigsaw puzzle shuffle algorithm, and their corresponding PEs are occluded. Meanwhile, for the non-occluded patches, the PEs remain the original ones but their spatial relation is strengthened via our dense absolute localization regressor. The experimental results reveal that 1) PEs explicitly encode the 2D spatial relationship and lead to severe privacy leakage problems under gradient inversion attack; 2) Training ViTs with the naively shuffled patches can alleviate the problem, but it harms the accuracy; 3) Under a certain shuffle ratio, the proposed MJP not only boosts the performance and robustness on large-scale datasets (i.e., ImageNet-1K and ImageNet-C, -A/O) but also improves the privacy preservation ability under typical gradient attacks by a large margin. The source code and trained models are available at https://github.com/yhlleo/MJP. + + + + LayoutDiffusion: Controllable Diffusion Model for Layout-to-Image Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zheng_LayoutDiffusion_Controllable_Diffusion_Model_for_Layout-to-Image_Generation_CVPR_2023_paper.pdf + Recently, diffusion models have achieved great success in image synthesis. However, when it comes to the layout-to-image generation where an image often has a complex scene of multiple objects, how to make strong control over both the global layout map and each detailed object remains a challenging task. In this paper, we propose a diffusion model named LayoutDiffusion that can obtain higher generation quality and greater controllability than the previous works. To overcome the difficult multimodal fusion of image and layout, we propose to construct a structural image patch with region information and transform the patched image into a special layout to fuse with the normal layout in a unified form. Moreover, Layout Fusion Module (LFM) and Object-aware Cross Attention (OaCA) are proposed to model the relationship among multiple objects and designed to be object-aware and position-sensitive, allowing for precisely controlling the spatial related information. Extensive experiments show that our LayoutDiffusion outperforms the previous SOTA methods on FID, CAS by relatively 46.35%, 26.70% on COCO-stuff and 44.29%, 41.82% on VG. Code is available at https://github.com/ZGCTroy/LayoutDiffusion. + + + + DISC: Learning From Noisy Labels via Dynamic Instance-Specific Selection and Correction + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_DISC_Learning_From_Noisy_Labels_via_Dynamic_Instance-Specific_Selection_and_CVPR_2023_paper.pdf + Existing studies indicate that deep neural networks (DNNs) can eventually memorize the label noise. We observe that the memorization strength of DNNs towards each instance is different and can be represented by the confidence value, which becomes larger and larger during the training process. Based on this, we propose a Dynamic Instance-specific Selection and Correction method (DISC) for learning from noisy labels (LNL). We first use a two-view-based backbone for image classification, obtaining confidence for each image from two views. Then we propose a dynamic threshold strategy for each instance, based on the momentum of each instance's memorization strength in previous epochs to select and correct noisy labeled data. Benefiting from the dynamic threshold strategy and two-view learning, we can effectively group each instance into one of the three subsets (i.e., clean, hard, and purified) based on the prediction consistency and discrepancy by two views at each epoch. Finally, we employ different regularization strategies to conquer subsets with different degrees of label noise, improving the whole network's robustness. Comprehensive evaluations on three controllable and four real-world LNL benchmarks show that our method outperforms the state-of-the-art (SOTA) methods to leverage useful information in noisy data while alleviating the pollution of label noise. + + + + Imagen Editor and EditBench: Advancing and Evaluating Text-Guided Image Inpainting + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Imagen_Editor_and_EditBench_Advancing_and_Evaluating_Text-Guided_Image_Inpainting_CVPR_2023_paper.pdf + Text-guided image editing can have a transformative impact in supporting creative applications. A key challenge is to generate edits that are faithful to the input text prompt, while consistent with the input image. We present Imagen Editor, a cascaded diffusion model, built by fine-tuning Imagen on text-guided image inpainting. Imagen Editor's edits are faithful to the text prompts, which is accomplished by incorporating object detectors for proposing inpainting masks during training. In addition, text-guided image inpainting captures fine details in the input image by conditioning the cascaded pipeline on the original high resolution image. To improve qualitative and quantitative evaluation, we introduce EditBench, a systematic benchmark for text-guided image inpainting. EditBench evaluates inpainting edits on natural and generated images exploring objects, attributes, and scenes. Through extensive human evaluation on EditBench, we find that object-masking during training leads to across-the-board improvements in text-image alignment -- such that Imagen Editor is preferred over DALL-E 2 and Stable Diffusion -- and, as a cohort, these models are better at object-rendering than text-rendering, and handle material/color/size attributes better than count/shape attributes. + + + + Text With Knowledge Graph Augmented Transformer for Video Captioning + http://openaccess.thecvf.com//content/CVPR2023/papers/Gu_Text_With_Knowledge_Graph_Augmented_Transformer_for_Video_Captioning_CVPR_2023_paper.pdf + Video captioning aims to describe the content of videos using natural language. Although significant progress has been made, there is still much room to improve the performance for real-world applications, mainly due to the long-tail and open set issues of words. In this paper, we propose a text with knowledge graph augmented transformer (TextKG) for video captioning. Notably, TextKG is a two-stream transformer, formed by the external stream and internal stream. The external stream is designed to absorb external knowledge, which models the interactions between the external knowledge, e.g., pre-built knowledge graph, and the built-in information of videos, e.g., the salient object regions, speech transcripts, and video captions, to mitigate the open set of words challenge. Meanwhile, the internal stream is designed to exploit the multi-modality information in original videos (e.g., the appearance of video frames, speech transcripts, and video captions) to deal with the long-tail issue. In addition, the cross attention mechanism is also used in both streams to share information. In this way, the two streams can help each other for more accurate results. Extensive experiments conducted on four challenging video captioning datasets, i.e., YouCookII, ActivityNet Captions, MSR-VTT, and MSVD, demonstrate that the proposed method performs favorably against the state-of-the-art methods. Specifically, the proposed TextKG method outperforms the best published results by improving 18.7% absolute CIDEr scores on the YouCookII dataset. + + + + Devil Is in the Queries: Advancing Mask Transformers for Real-World Medical Image Segmentation and Out-of-Distribution Localization + http://openaccess.thecvf.com//content/CVPR2023/papers/Yuan_Devil_Is_in_the_Queries_Advancing_Mask_Transformers_for_Real-World_CVPR_2023_paper.pdf + Real-world medical image segmentation has tremendous long-tailed complexity of objects, among which tail conditions correlate with relatively rare diseases and are clinically significant. A trustworthy medical AI algorithm should demonstrate its effectiveness on tail conditions to avoid clinically dangerous damage in these out-of-distribution (OOD) cases. In this paper, we adopt the concept of object queries in Mask transformers to formulate semantic segmentation as a soft cluster assignment. The queries fit the feature-level cluster centers of inliers during training. Therefore, when performing inference on a medical image in real-world scenarios, the similarity between pixels and the queries detects and localizes OOD regions. We term this OOD localization as MaxQuery. Furthermore, the foregrounds of real-world medical images, whether OOD objects or inliers, are lesions. The difference between them is obviously less than that between the foreground and background, resulting in the object queries may focus redundantly on the background. Thus, we propose a query-distribution (QD) loss to enforce clear boundaries between segmentation targets and other regions at the query level, improving the inlier segmentation and OOD indication. Our proposed framework is tested on two real-world segmentation tasks, i.e., segmentation of pancreatic and liver tumors, outperforming previous leading algorithms by an average of 7.39% on AUROC, 14.69% on AUPR, and 13.79% on FPR95 for OOD localization. On the other hand, our framework improves the performance of inlier segmentation by an average of 5.27% DSC compared with nnUNet. + + + + T-SEA: Transfer-Based Self-Ensemble Attack on Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_T-SEA_Transfer-Based_Self-Ensemble_Attack_on_Object_Detection_CVPR_2023_paper.pdf + Compared to query-based black-box attacks, transfer-based black-box attacks do not require any information of the attacked models, which ensures their secrecy. However, most existing transfer-based approaches rely on ensembling multiple models to boost the attack transferability, which is time- and resource-intensive, not to mention the difficulty of obtaining diverse models on the same task. To address this limitation, in this work, we focus on the single-model transfer-based black-box attack on object detection, utilizing only one model to achieve a high-transferability adversarial attack on multiple black-box detectors. Specifically, we first make observations on the patch optimization process of the existing method and propose an enhanced attack framework by slightly adjusting its training strategies. Then, we analogize patch optimization with regular model optimization, proposing a series of self-ensemble approaches on the input data, the attacked model, and the adversarial patch to efficiently make use of the limited information and prevent the patch from overfitting. The experimental results show that the proposed framework can be applied with multiple classical base attack methods (e.g., PGD and MIM) to greatly improve the black-box transferability of the well-optimized patch on multiple mainstream detectors, meanwhile boosting white-box performance. + + + + PSVT: End-to-End Multi-Person 3D Pose and Shape Estimation With Progressive Video Transformers + http://openaccess.thecvf.com//content/CVPR2023/papers/Qiu_PSVT_End-to-End_Multi-Person_3D_Pose_and_Shape_Estimation_With_Progressive_CVPR_2023_paper.pdf + Existing methods of multi-person video 3D human Pose and Shape Estimation (PSE) typically adopt a two-stage strategy, which first detects human instances in each frame and then performs single-person PSE with temporal model. However, the global spatio-temporal context among spatial instances can not be captured. In this paper, we propose a new end-to-end multi-person 3D Pose and Shape estimation framework with progressive Video Transformer, termed PSVT. In PSVT, a spatio-temporal encoder (STE) captures the global feature dependencies among spatial objects. Then, spatio-temporal pose decoder (STPD) and shape decoder (STSD) capture the global dependencies between pose queries and feature tokens, shape queries and feature tokens, respectively. To handle the variances of objects as time proceeds, a novel scheme of progressive decoding is used to update pose and shape queries at each frame. Besides, we propose a novel pose-guided attention (PGA) for shape decoder to better predict shape parameters. The two components strengthen the decoder of PSVT to improve performance. Extensive experiments on the four datasets show that PSVT achieves stage-of-the-art results. + + + + Unifying Vision, Text, and Layout for Universal Document Processing + http://openaccess.thecvf.com//content/CVPR2023/papers/Tang_Unifying_Vision_Text_and_Layout_for_Universal_Document_Processing_CVPR_2023_paper.pdf + We propose Universal Document Processing (UDOP), a foundation Document AI model which unifies text, image, and layout modalities together with varied task formats, including document understanding and generation. UDOP leverages the spatial correlation between textual content and document image to model image, text, and layout modalities with one uniform representation. With a novel Vision-Text-Layout Transformer, UDOP unifies pretraining and multi-domain downstream tasks into a prompt-based sequence generation scheme. UDOP is pretrained on both large-scale unlabeled document corpora using innovative self-supervised objectives and diverse labeled data. UDOP also learns to generate document images from text and layout modalities via masked image reconstruction. To the best of our knowledge, this is the first time in the field of document AI that one model simultaneously achieves high-quality neural document editing and content customization. Our method sets the state-of-the-art on 8 Document AI tasks, e.g., document understanding and QA, across diverse data domains like finance reports, academic papers, and websites. UDOP ranks first on the leaderboard of the Document Understanding Benchmark. + + + + SparsePose: Sparse-View Camera Pose Regression and Refinement + http://openaccess.thecvf.com//content/CVPR2023/papers/Sinha_SparsePose_Sparse-View_Camera_Pose_Regression_and_Refinement_CVPR_2023_paper.pdf + Camera pose estimation is a key step in standard 3D reconstruction pipelines that operates on a dense set of images of a single object or scene. However, methods for pose estimation often fail when there are only a few images available because they rely on the ability to robustly identify and match visual features between pairs of images. While these methods can work robustly with dense camera views, capturing a large set of images can be time consuming or impractical. Here, we propose Sparse-View Camera Pose Regression and Refinement (SparsePose) for recovering accurate camera poses given a sparse set of wide-baseline images (fewer than 10). The method learns to regress initial camera poses and then iteratively refine them after training on a large-scale dataset of objects (Co3D: Common Objects in 3D). SparsePose significantly outperforms conventional and learning-based baselines in recovering accurate camera rotations and translations. We also demonstrate our pipeline for high-fidelity 3D reconstruction using only 5-9 images of an object. + + + + Flow Supervision for Deformable NeRF + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Flow_Supervision_for_Deformable_NeRF_CVPR_2023_paper.pdf + In this paper we present a new method for deformable NeRF that can directly use optical flow as supervision. We overcome the major challenge with respect to the computationally inefficiency of enforcing the flow constraints to the backward deformation field, used by deformable NeRFs. Specifically, we show that inverting the backward deformation function is actually not needed for computing scene flows between frames. This insight dramatically simplifies the problem, as one is no longer constrained to deformation functions that can be analytically inverted. Instead, thanks to the weak assumptions required by our derivation based on the inverse function theorem, our approach can be extended to a broad class of commonly used backward deformation field. We present results on monocular novel view synthesis with rapid object motion, and demonstrate significant improvements over baselines without flow supervision. + + + + Zero-Shot Text-to-Parameter Translation for Game Character Auto-Creation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhao_Zero-Shot_Text-to-Parameter_Translation_for_Game_Character_Auto-Creation_CVPR_2023_paper.pdf + Recent popular Role-Playing Games (RPGs) saw the great success of character auto-creation systems. The bone-driven face model controlled by continuous parameters (like the position of bones) and discrete parameters (like the hairstyles) makes it possible for users to personalize and customize in-game characters. Previous in-game character auto-creation systems are mostly image-driven, where facial parameters are optimized so that the rendered character looks similar to the reference face photo. This paper proposes a novel text-to-parameter translation method (T2P) to achieve zero-shot text-driven game character auto-creation. With our method, users can create a vivid in-game character with arbitrary text description without using any reference photo or editing hundreds of parameters manually. In our method, taking the power of large-scale pre-trained multi-modal CLIP and neural rendering, T2P searches both continuous facial parameters and discrete facial parameters in a unified framework. Due to the discontinuous parameter representation, previous methods have difficulty in effectively learning discrete facial parameters. T2P, to our best knowledge, is the first method that can handle the optimization of both discrete and continuous parameters. Experimental results show that T2P can generate high-quality and vivid game characters with given text prompts. T2P outperforms other SOTA text-to-3D generation methods on both objective evaluations and subjective evaluations. + + + + PIVOT: Prompting for Video Continual Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Villa_PIVOT_Prompting_for_Video_Continual_Learning_CVPR_2023_paper.pdf + Modern machine learning pipelines are limited due to data availability, storage quotas, privacy regulations, and expensive annotation processes. These constraints make it difficult or impossible to train and update large-scale models on such dynamic annotated sets. Continual learning directly approaches this problem, with the ultimate goal of devising methods where a deep neural network effectively learns relevant patterns for new (unseen) classes, without significantly altering its performance on previously learned ones. In this paper, we address the problem of continual learning for video data. We introduce PIVOT, a novel method that leverages extensive knowledge in pre-trained models from the image domain, thereby reducing the number of trainable parameters and the associated forgetting. Unlike previous methods, ours is the first approach that effectively uses prompting mechanisms for continual learning without any in-domain pre-training. Our experiments show that PIVOT improves state-of-the-art methods by a significant 27% on the 20-task ActivityNet setup. + + + + Panoptic Video Scene Graph Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_Panoptic_Video_Scene_Graph_Generation_CVPR_2023_paper.pdf + Towards building comprehensive real-world visual perception systems, we propose and study a new problem called panoptic scene graph generation (PVSG). PVSG is related to the existing video scene graph generation (VidSGG) problem, which focuses on temporal interactions between humans and objects localized with bounding boxes in videos. However, the limitation of bounding boxes in detecting non-rigid objects and backgrounds often causes VidSGG systems to miss key details that are crucial for comprehensive video understanding. In contrast, PVSG requires nodes in scene graphs to be grounded by more precise, pixel-level segmentation masks, which facilitate holistic scene understanding. To advance research in this new area, we contribute a high-quality PVSG dataset, which consists of 400 videos (289 third-person + 111 egocentric videos) with totally 150K frames labeled with panoptic segmentation masks as well as fine, temporal scene graphs. We also provide a variety of baseline methods and share useful design practices for future work. + + + + Understanding Imbalanced Semantic Segmentation Through Neural Collapse + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhong_Understanding_Imbalanced_Semantic_Segmentation_Through_Neural_Collapse_CVPR_2023_paper.pdf + A recent study has shown a phenomenon called neural collapse in that the within-class means of features and the classifier weight vectors converge to the vertices of a simplex equiangular tight frame at the terminal phase of training for classification. In this paper, we explore the corresponding structures of the last-layer feature centers and classifiers in semantic segmentation. Based on our empirical and theoretical analysis, we point out that semantic segmentation naturally brings contextual correlation and imbalanced distribution among classes, which breaks the equiangular and maximally separated structure of neural collapse for both feature centers and classifiers. However, such a symmetric structure is beneficial to discrimination for the minor classes. To preserve these advantages, we introduce a regularizer on feature centers to encourage the network to learn features closer to the appealing structure in imbalanced semantic segmentation. Experimental results show that our method can bring significant improvements on both 2D and 3D semantic segmentation benchmarks. Moreover, our method ranks first and sets a new record (+6.8% mIoU) on the ScanNet200 test leaderboard. + + + + HOICLIP: Efficient Knowledge Transfer for HOI Detection With Vision-Language Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Ning_HOICLIP_Efficient_Knowledge_Transfer_for_HOI_Detection_With_Vision-Language_Models_CVPR_2023_paper.pdf + Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions. Recently, Contrastive Language-Image Pre-training (CLIP) has shown great potential in providing interaction prior for HOI detectors via knowledge distillation. However, such approaches often rely on large-scale training data and suffer from inferior performance under few/zero-shot scenarios. In this paper, we propose a novel HOI detection framework that efficiently extracts prior knowledge from CLIP and achieves better generalization. In detail, we first introduce a novel interaction decoder to extract informative regions in the visual feature map of CLIP via a cross-attention mechanism, which is then fused with the detection backbone by a knowledge integration block for more accurate human-object pair detection. In addition, prior knowledge in CLIP text encoder is leveraged to generate a classifier by embedding HOI descriptions. To distinguish fine-grained interactions, we build a verb classifier from training data via visual semantic arithmetic and a lightweight verb representation adapter. Furthermore, we propose a training-free enhancement to exploit global HOI predictions from CLIP. Extensive experiments demonstrate that our method outperforms the state of the art by a large margin on various settings, e.g. +4.04 mAP on HICO-Det. The source code is available in https://github.com/Artanic30/HOICLIP. + + + + Focused and Collaborative Feedback Integration for Interactive Image Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Wei_Focused_and_Collaborative_Feedback_Integration_for_Interactive_Image_Segmentation_CVPR_2023_paper.pdf + Interactive image segmentation aims at obtaining a segmentation mask for an image using simple user annotations. During each round of interaction, the segmentation result from the previous round serves as feedback to guide the user's annotation and provides dense prior information for the segmentation model, effectively acting as a bridge between interactions. Existing methods overlook the importance of feedback or simply concatenate it with the original input, leading to underutilization of feedback and an increase in the number of required annotations. To address this, we propose an approach called Focused and Collaborative Feedback Integration (FCFI) to fully exploit the feedback for click-based interactive image segmentation. FCFI first focuses on a local area around the new click and corrects the feedback based on the similarities of high-level features. It then alternately and collaboratively updates the feedback and deep features to integrate the feedback into the features. The efficacy and efficiency of FCFI were validated on four benchmarks, namely GrabCut, Berkeley, SBD, and DAVIS. Experimental results show that FCFI achieved new state-of-the-art performance with less computational overhead than previous methods. The source code is available at https://github.com/veizgyauzgyauz/FCFI. + + + + Class Prototypes Based Contrastive Learning for Classifying Multi-Label and Fine-Grained Educational Videos + http://openaccess.thecvf.com//content/CVPR2023/papers/Gupta_Class_Prototypes_Based_Contrastive_Learning_for_Classifying_Multi-Label_and_Fine-Grained_CVPR_2023_paper.pdf + The recent growth in the consumption of online media by children during early childhood necessitates data-driven tools enabling educators to filter out appropriate educational content for young learners. This paper presents an approach for detecting educational content in online videos. We focus on two widely used educational content classes: literacy and math. For each class, we choose prominent codes (sub-classes) based on the Common Core Standards. For example, literacy codes include 'letter names', 'letter sounds', and math codes include 'counting', 'sorting'. We pose this as a fine-grained multilabel classification problem as videos can contain multiple types of educational content and the content classes can get visually similar (e.g., 'letter names' vs 'letter sounds'). We propose a novel class prototypes based supervised contrastive learning approach that can handle fine-grained samples associated with multiple labels. We learn a class prototype for each class and a loss function is employed to minimize the distances between a class prototype and the samples from the class. Similarly, distances between a class prototype and the samples from other classes are maximized. As the alignment between visual and audio cues are crucial for effective comprehension, we consider a multimodal transformer network to capture the interaction between visual and audio cues in videos while learning the embedding for videos. For evaluation, we present a dataset, APPROVE, employing educational videos from YouTube labeled with fine-grained education classes by education researchers. APPROVE consists of 193 hours of expert-annotated videos with 19 classes. The proposed approach outperforms strong baselines on APPROVE and other benchmarks such as Youtube-8M, and COIN. The dataset is available at https://nusci.csl.sri.com/project/APPROVE. + + + + Source-Free Adaptive Gaze Estimation by Uncertainty Reduction + http://openaccess.thecvf.com//content/CVPR2023/papers/Cai_Source-Free_Adaptive_Gaze_Estimation_by_Uncertainty_Reduction_CVPR_2023_paper.pdf + Gaze estimation across domains has been explored recently because the training data are usually collected under controlled conditions while the trained gaze estimators are used in real and diverse environments. However, due to privacy and efficiency concerns, simultaneous access to annotated source data and to-be-predicted target data can be challenging. In light of this, we present an unsupervised source-free domain adaptation approach for gaze estimation, which adapts a source-trained gaze estimator to unlabeled target domains without source data. We propose the Uncertainty Reduction Gaze Adaptation (UnReGA) framework, which achieves adaptation by reducing both sample and model uncertainty. Sample uncertainty is mitigated by enhancing image quality and making them gaze-estimation-friendly, whereas model uncertainty is reduced by minimizing prediction variance on the same inputs. Extensive experiments are conducted on six cross-domain tasks, demonstrating the effectiveness of UnReGA and its components. Results show that UnReGA outperforms other state-of-the-art cross-domain gaze estimation methods under both protocols, with and without source data + + + + SuperDisco: Super-Class Discovery Improves Visual Recognition for the Long-Tail + http://openaccess.thecvf.com//content/CVPR2023/papers/Du_SuperDisco_Super-Class_Discovery_Improves_Visual_Recognition_for_the_Long-Tail_CVPR_2023_paper.pdf + Modern image classifiers perform well on populated classes while degrading considerably on tail classes with only a few instances. Humans, by contrast, effortlessly handle the long-tailed recognition challenge, since they can learn the tail representation based on different levels of semantic abstraction, making the learned tail features more discriminative. This phenomenon motivated us to propose SuperDisco, an algorithm that discovers super-class representations for long-tailed recognition using a graph model. We learn to construct the super-class graph to guide the representation learning to deal with long-tailed distributions. Through message passing on the super-class graph, image representations are rectified and refined by attending to the most relevant entities based on the semantic similarity among their super-classes. Moreover, we propose to meta-learn the super-class graph under the supervision of a prototype graph constructed from a small amount of imbalanced data. By doing so, we obtain a more robust super-class graph that further improves the long-tailed recognition performance. The consistent state-of-the-art experiments on the long-tailed CIFAR-100, ImageNet, Places, and iNaturalist demonstrate the benefit of the discovered super-class graph for dealing with long-tailed distributions. + + + + Im2Hands: Learning Attentive Implicit Representation of Interacting Two-Hand Shapes + http://openaccess.thecvf.com//content/CVPR2023/papers/Lee_Im2Hands_Learning_Attentive_Implicit_Representation_of_Interacting_Two-Hand_Shapes_CVPR_2023_paper.pdf + We present Implicit Two Hands (Im2Hands), the first neural implicit representation of two interacting hands. Unlike existing methods on two-hand reconstruction that rely on a parametric hand model and/or low-resolution meshes, Im2Hands can produce fine-grained geometry of two hands with high hand-to-hand and hand-to-image coherency. To handle the shape complexity and interaction context between two hands, Im2Hands models the occupancy volume of two hands -- conditioned on an RGB image and coarse 3D keypoints -- by two novel attention-based modules responsible for (1) initial occupancy estimation and (2) context-aware occupancy refinement, respectively. Im2Hands first learns per-hand neural articulated occupancy in the canonical space designed for each hand using query-image attention. It then refines the initial two-hand occupancy in the posed space to enhance the coherency between the two hand shapes using query-anchor attention. In addition, we introduce an optional keypoint refinement module to enable robust two-hand shape estimation from predicted hand keypoints in a single-image reconstruction scenario. We experimentally demonstrate the effectiveness of Im2Hands on two-hand reconstruction in comparison to related methods, where ours achieves state-of-the-art results. Our code is publicly available at https://github.com/jyunlee/Im2Hands. + + + + Long-Term Visual Localization With Mobile Sensors + http://openaccess.thecvf.com//content/CVPR2023/papers/Yan_Long-Term_Visual_Localization_With_Mobile_Sensors_CVPR_2023_paper.pdf + Despite the remarkable advances in image matching and pose estimation, image-based localization of a camera in a temporally-varying outdoor environment is still a challenging problem due to huge appearance disparity between query and reference images caused by illumination, seasonal and structural changes. In this work, we propose to leverage additional sensors on a mobile phone, mainly GPS, compass, and gravity sensor, to solve this challenging problem. We show that these mobile sensors provide decent initial poses and effective constraints to reduce the searching space in image matching and final pose estimation. With the initial pose, we are also able to devise a direct 2D-3D matching network to efficiently establish 2D-3D correspondences instead of tedious 2D-2D matching in existing systems. As no public dataset exists for the studied problem, we collect a new dataset that provides a variety of mobile sensor data and significant scene appearance variations, and develop a system to acquire ground-truth poses for query images. We benchmark our method as well as several state-of-the-art baselines and demonstrate the effectiveness of the proposed approach. Our code and dataset are available on the project page: https://zju3dv.github.io/sensloc/. + + + + Data-Efficient Large Scale Place Recognition With Graded Similarity Supervision + http://openaccess.thecvf.com//content/CVPR2023/papers/Leyva-Vallina_Data-Efficient_Large_Scale_Place_Recognition_With_Graded_Similarity_Supervision_CVPR_2023_paper.pdf + Visual place recognition (VPR) is a fundamental task of computer vision for visual localization. Existing methods are trained using image pairs that either depict the same place or not. Such a binary indication does not consider continuous relations of similarity between images of the same place taken from different positions, determined by the continuous nature of camera pose. The binary similarity induces a noisy supervision signal into the training of VPR methods, which stall in local minima and require expensive hard mining algorithms to guarantee convergence. Motivated by the fact that two images of the same place only partially share visual cues due to camera pose differences, we deploy an automatic re-annotation strategy to re-label VPR datasets. We compute graded similarity labels for image pairs based on available localization metadata. Furthermore, we propose a new Generalized Contrastive Loss (GCL) that uses graded similarity labels for training contrastive networks. We demonstrate that the use of the new labels and GCL allow to dispense from hard-pair mining, and to train image descriptors that perform better in VPR by nearest neighbor search, obtaining superior or comparable results than methods that require expensive hard-pair mining and re-ranking techniques. + + + + Weakly Supervised Class-Agnostic Motion Prediction for Autonomous Driving + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Weakly_Supervised_Class-Agnostic_Motion_Prediction_for_Autonomous_Driving_CVPR_2023_paper.pdf + Understanding the motion behavior of dynamic environments is vital for autonomous driving, leading to increasing attention in class-agnostic motion prediction in LiDAR point clouds. Outdoor scenes can often be decomposed into mobile foregrounds and static backgrounds, which enables us to associate motion understanding with scene parsing. Based on this observation, we study a novel weakly supervised motion prediction paradigm, where fully or partially (1%, 0.1%) annotated foreground/background binary masks rather than expensive motion annotations are used for supervision. To this end, we propose a two-stage weakly supervised approach, where the segmentation model trained with the incomplete binary masks in Stage1 will facilitate the self-supervised learning of the motion prediction network in Stage2 by estimating possible moving foregrounds in advance. Furthermore, for robust self-supervised motion learning, we design a Consistency-aware Chamfer Distance loss by exploiting multi-frame information and explicitly suppressing potential outliers. Comprehensive experiments show that, with fully or partially binary masks as supervision, our weakly supervised models surpass the self-supervised models by a large margin and perform on par with some supervised ones. This further demonstrates that our approach achieves a good compromise between annotation effort and performance. + + + + Where We Are and What We're Looking At: Query Based Worldwide Image Geo-Localization Using Hierarchies and Scenes + http://openaccess.thecvf.com//content/CVPR2023/papers/Clark_Where_We_Are_and_What_Were_Looking_At_Query_Based_CVPR_2023_paper.pdf + Determining the exact latitude and longitude that a photo was taken is a useful and widely applicable task, yet it remains exceptionally difficult despite the accelerated progress of other computer vision tasks. Most previous approaches have opted to learn single representations of query images, which are then classified at different levels of geographic granularity. These approaches fail to exploit the different visual cues that give context to different hierarchies, such as the country, state, and city level. To this end, we introduce an end-to-end transformer-based architecture that exploits the relationship between different geographic levels (which we refer to as hierarchies) and the corresponding visual scene information in an image through hierarchical cross-attention. We achieve this by learning a query for each geographic hierarchy and scene type. Furthermore, we learn a separate representation for different environmental scenes, as different scenes in the same location are often defined by completely different visual features. We achieve state of the art accuracy on 4 standard geo-localization datasets : Im2GPS, Im2GPS3k, YFCC4k, and YFCC26k, as well as qualitatively demonstrate how our method learns different representations for different visual hierarchies and scenes, which has not been demonstrated in the previous methods. Above previous testing datasets mostly consist of iconic landmarks or images taken from social media, which makes the dataset a simple memory task, or makes it biased towards certain places. To address this issue we introduce a much harder testing dataset, Google-World-Streets-15k, comprised of images taken from Google Streetview covering the whole planet and present state of the art results. Our code can be found at https://github.com/AHKerrigan/GeoGuessNet. + + + + Critical Learning Periods for Multisensory Integration in Deep Networks + http://openaccess.thecvf.com//content/CVPR2023/papers/Kleinman_Critical_Learning_Periods_for_Multisensory_Integration_in_Deep_Networks_CVPR_2023_paper.pdf + We show that the ability of a neural network to integrate information from diverse sources hinges critically on being exposed to properly correlated signals during the early phases of training. Interfering with the learning process during this initial stage can permanently impair the development of a skill, both in artificial and biological systems where the phenomenon is known as a critical learning period. We show that critical periods arise from the complex and unstable early transient dynamics, which are decisive of final performance of the trained system and their learned representations. This evidence challenges the view, engendered by analysis of wide and shallow networks, that early learning dynamics of neural networks are simple, akin to those of a linear model. Indeed, we show that even deep linear networks exhibit critical learning periods for multi-source integration, while shallow networks do not. To better understand how the internal representations change according to disturbances or sensory deficits, we introduce a new measure of source sensitivity, which allows us to track the inhibition and integration of sources during training. Our analysis of inhibition suggests cross-source reconstruction as a natural auxiliary training objective, and indeed we show that architectures trained with cross-sensor reconstruction objectives are remarkably more resilient to critical periods. Our findings suggest that the recent success in self-supervised multi-modal training compared to previous supervised efforts may be in part due to more robust learning dynamics and not solely due to better architectures and/or more data. + + + + GarmentTracking: Category-Level Garment Pose Tracking + http://openaccess.thecvf.com//content/CVPR2023/papers/Xue_GarmentTracking_Category-Level_Garment_Pose_Tracking_CVPR_2023_paper.pdf + Garments are important to humans. A visual system that can estimate and track the complete garment pose can be useful for many downstream tasks and real-world applications. In this work, we present a complete package to address the category-level garment pose tracking task: (1) A recording system VR-Garment, with which users can manipulate virtual garment models in simulation through a VR interface. (2) A large-scale dataset VR-Folding, with complex garment pose configurations in manipulation like flattening and folding. (3) An end-to-end online tracking framework GarmentTracking, which predicts complete garment pose both in canonical space and task space given a point cloud sequence. Extensive experiments demonstrate that the proposed GarmentTracking achieves great performance even when the garment has large non-rigid deformation. It outperforms the baseline approach on both speed and accuracy. We hope our proposed solution can serve as a platform for future research. Codes and datasets are available in https://garment-tracking.robotflow.ai. + + + + MagicNet: Semi-Supervised Multi-Organ Segmentation via Magic-Cube Partition and Recovery + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_MagicNet_Semi-Supervised_Multi-Organ_Segmentation_via_Magic-Cube_Partition_and_Recovery_CVPR_2023_paper.pdf + We propose a novel teacher-student model for semi-supervised multi-organ segmentation. In the teacher-student model, data augmentation is usually adopted on unlabeled data to regularize the consistent training between teacher and student. We start from a key perspective that fixed relative locations and variable sizes of different organs can provide distribution information where a multi-organ CT scan is drawn. Thus, we treat the prior anatomy as a strong tool to guide the data augmentation and reduce the mismatch between labeled and unlabeled images for semi-supervised learning. More specifically, we propose a data augmentation strategy based on partition-and-recovery N^3 cubes cross- and within- labeled and unlabeled images. Our strategy encourages unlabeled images to learn organ semantics in relative locations from the labeled images (cross-branch) and enhances the learning ability for small organs (within-branch). For within-branch, we further propose to refine the quality of pseudo labels by blending the learned representations from small cubes to incorporate local attributes. Our method is termed as MagicNet, since it treats the CT volume as a magic-cube and N^3-cube partition-and-recovery process matches with the rule of playing a magic-cube. Extensive experiments on two public CT multi-organ datasets demonstrate the effectiveness of MagicNet, and noticeably outperforms state-of-the-art semi-supervised medical image segmentation approaches, with +7% DSC improvement on MACT dataset with 10% labeled images. + + + + Neural Intrinsic Embedding for Non-Rigid Point Cloud Matching + http://openaccess.thecvf.com//content/CVPR2023/papers/Jiang_Neural_Intrinsic_Embedding_for_Non-Rigid_Point_Cloud_Matching_CVPR_2023_paper.pdf + As a primitive 3D data representation, point clouds are prevailing in 3D sensing, yet short of intrinsic structural information of the underlying objects. Such discrepancy poses great challenges in directly establishing correspondences between point clouds sampled from deformable shapes. In light of this, we propose Neural Intrinsic Embedding (NIE) to embed each vertex into a high-dimensional space in a way that respects the intrinsic structure. Based upon NIE, we further present a weakly-supervised learning framework for non-rigid point cloud registration. Unlike the prior works, we do not require expansive and sensitive off-line basis construction (e.g., eigen-decomposition of Laplacians), nor do we require ground-truth correspondence labels for supervision. We empirically show that our framework performs on par with or even better than the state-of-the-art baselines, which generally require more supervision and/or more structural geometric input. + + + + Few-Shot Geometry-Aware Keypoint Localization + http://openaccess.thecvf.com//content/CVPR2023/papers/He_Few-Shot_Geometry-Aware_Keypoint_Localization_CVPR_2023_paper.pdf + Supervised keypoint localization methods rely on large manually labeled image datasets, where objects can deform, articulate, or occlude. However, creating such large keypoint labels is time-consuming and costly, and is often error-prone due to inconsistent labeling. Thus, we desire an approach that can learn keypoint localization with fewer yet consistently annotated images. To this end, we present a novel formulation that learns to localize semantically consistent keypoint definitions, even for occluded regions, for varying object categories. We use a few user-labeled 2D images as input examples, which are extended via self-supervision using a larger unlabeled dataset. Unlike unsupervised methods, the few-shot images act as semantic shape constraints for object localization. Furthermore, we introduce 3D geometry-aware constraints to uplift keypoints, achieving more accurate 2D localization. Our general-purpose formulation paves the way for semantically conditioned generative modeling and attains competitive or state-of-the-art accuracy on several datasets, including human faces, eyes, animals, cars, and never-before-seen mouth interior (teeth) localization tasks, not attempted by the previous few-shot methods. Project page: https://xingzhehe.github.io/FewShot3DKP/ + + + + Neural Vector Fields: Implicit Representation by Explicit Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_Neural_Vector_Fields_Implicit_Representation_by_Explicit_Learning_CVPR_2023_paper.pdf + Deep neural networks (DNNs) are widely applied for nowadays 3D surface reconstruction tasks and such methods can be further divided into two categories, which respectively warp templates explicitly by moving vertices or represent 3D surfaces implicitly as signed or unsigned distance functions. Taking advantage of both advanced explicit learning process and powerful representation ability of implicit functions, we propose a novel 3D representation method, Neural Vector Fields (NVF). It not only adopts the explicit learning process to manipulate meshes directly, but also leverages the implicit representation of unsigned distance functions (UDFs) to break the barriers in resolution and topology. Specifically, our method first predicts the displacements from queries towards the surface and models the shapes as Vector Fields. Rather than relying on network differentiation to obtain direction fields as most existing UDF-based methods, the produced vector fields encode the distance and direction fields both and mitigate the ambiguity at "ridge" points, such that the calculation of direction fields is straightforward and differentiation-free. The differentiation-free characteristic enables us to further learn a shape codebook via Vector Quantization, which encodes the cross-object priors, accelerates the training procedure, and boosts model generalization on cross-category reconstruction. The extensive experiments on surface reconstruction benchmarks indicate that our method outperforms those state-of-the-art methods in different evaluation scenarios including watertight vs non-watertight shapes, category-specific vs category-agnostic reconstruction, category-unseen reconstruction, and cross-domain reconstruction. Our code is released at https://github.com/Wi-sc/NVF. + + + + Think Twice Before Driving: Towards Scalable Decoders for End-to-End Autonomous Driving + http://openaccess.thecvf.com//content/CVPR2023/papers/Jia_Think_Twice_Before_Driving_Towards_Scalable_Decoders_for_End-to-End_Autonomous_CVPR_2023_paper.pdf + End-to-end autonomous driving has made impressive progress in recent years. Existing methods usually adopt the decoupled encoder-decoder paradigm, where the encoder extracts hidden features from raw sensor data, and the decoder outputs the ego-vehicle's future trajectories or actions. Under such a paradigm, the encoder does not have access to the intended behavior of the ego agent, leaving the burden of finding out safety-critical regions from the massive receptive field and inferring about future situations to the decoder. Even worse, the decoder is usually composed of several simple multi-layer perceptrons (MLP) or GRUs while the encoder is delicately designed (e.g., a combination of heavy ResNets or Transformer). Such an imbalanced resource-task division hampers the learning process. In this work, we aim to alleviate the aforementioned problem by two principles: (1) fully utilizing the capacity of the encoder; (2) increasing the capacity of the decoder. Concretely, we first predict a coarse-grained future position and action based on the encoder features. Then, conditioned on the position and action, the future scene is imagined to check the ramification if we drive accordingly. We also retrieve the encoder features around the predicted coordinate to obtain fine-grained information about the safety-critical region. Finally, based on the predicted future and the retrieved salient feature, we refine the coarse-grained position and action by predicting its offset from ground-truth. The above refinement module could be stacked in a cascaded fashion, which extends the capacity of the decoder with spatial-temporal prior knowledge about the conditioned future. We conduct experiments on the CARLA simulator and achieve state-of-the-art performance in closed-loop benchmarks. Extensive ablation studies demonstrate the effectiveness of each proposed module. Code and models are available at https://github.com/opendrivelab/ThinkTwice. + + + + Disentangling Orthogonal Planes for Indoor Panoramic Room Layout Estimation With Cross-Scale Distortion Awareness + http://openaccess.thecvf.com//content/CVPR2023/papers/Shen_Disentangling_Orthogonal_Planes_for_Indoor_Panoramic_Room_Layout_Estimation_With_CVPR_2023_paper.pdf + Based on the Manhattan World assumption, most existing indoor layout estimation schemes focus on recovering layouts from vertically compressed 1D sequences. However, the compression procedure confuses the semantics of different planes, yielding inferior performance with ambiguous interpretability. To address this issue, we propose to disentangle this 1D representation by pre-segmenting orthogonal (vertical and horizontal) planes from a complex scene, explicitly capturing the geometric cues for indoor layout estimation. Considering the symmetry between the floor boundary and ceiling boundary, we also design a soft-flipping fusion strategy to assist the pre-segmentation. Besides, we present a feature assembling mechanism to effectively integrate shallow and deep features with distortion distribution awareness. To compensate for the potential errors in pre-segmentation, we further leverage triple attention to reconstruct the disentangled sequences for better performance. Experiments on four popular benchmarks demonstrate our superiority over existing SoTA solutions, especially on the 3DIoU metric. The code is available at https://github.com/zhijieshen-bjtu/DOPNet. + + + + Neural Map Prior for Autonomous Driving + http://openaccess.thecvf.com//content/CVPR2023/papers/Xiong_Neural_Map_Prior_for_Autonomous_Driving_CVPR_2023_paper.pdf + High-definition (HD) semantic maps are a crucial component for autonomous driving on urban streets. Traditional offline HD maps are created through labor-intensive manual annotation processes, which are costly and do not accommodate timely updates. Recently, researchers have proposed to infer local maps based on online sensor observations. However, the range of online map inference is constrained by sensor perception range and is easily affected by occlusions. In this work, we propose Neural Map Prior (NMP), a neural representation of global maps that enables automatic global map updates and enhances local map inference performance. To incorporate the strong map prior into local map inference, we leverage cross-attention to dynamically capture the correlations between current features and prior features. For updating the global neural map prior, we use a learning-based fusion module to guide the network in fusing features from previous traversals. This design allows the network to capture a global neural map prior while making sequential online map predictions. Experimental results on the nuScenes dataset demonstrate that our framework is compatible with most map segmentation/detection methods, improving map prediction performance in challenging weather conditions and over an extended horizon. To the best of our knowledge, this represents the first learning-based system for constructing a global map prior. + + + + PEAL: Prior-Embedded Explicit Attention Learning for Low-Overlap Point Cloud Registration + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_PEAL_Prior-Embedded_Explicit_Attention_Learning_for_Low-Overlap_Point_Cloud_Registration_CVPR_2023_paper.pdf + Learning distinctive point-wise features is critical for low-overlap point cloud registration. Recently, it has achieved huge success in incorporating Transformer into point cloud feature representation, which usually adopts a self-attention module to learn intra-point-cloud features first, then utilizes a cross-attention module to perform feature exchange between input point clouds. Self-attention is computed by capturing the global dependency in geometric space. However, this global dependency can be ambiguous and lacks distinctiveness, especially in indoor low-overlap scenarios, as which the dependence with an extensive range of non-overlapping points introduces ambiguity. To address this issue, we present PEAL, a Prior-embedded Explicit Attention Learning model. By incorporating prior knowledge into the learning process, the points are divided into two parts. One includes points lying in the putative overlapping region and the other includes points lying in the putative non-overlapping region. Then PEAL explicitly learns one-way attention with the putative overlapping points. This simplistic design attains surprising performance, significantly relieving the aforementioned feature ambiguity. Our method improves the Registration Recall by 6+% on the challenging 3DLoMatch benchmark and achieves state-of-the-art performance on Feature Matching Recall, Inlier Ratio, and Registration Recall on both 3DMatch and 3DLoMatch. Code will be made publicly available. + + + + GeoVLN: Learning Geometry-Enhanced Visual Representation With Slot Attention for Vision-and-Language Navigation + http://openaccess.thecvf.com//content/CVPR2023/papers/Huo_GeoVLN_Learning_Geometry-Enhanced_Visual_Representation_With_Slot_Attention_for_Vision-and-Language_CVPR_2023_paper.pdf + Most existing works solving Room-to-Room VLN problem only utilize RGB images and do not consider local context around candidate views, which lack sufficient visual cues about surrounding environment. Moreover, natural language contains complex semantic information thus its correlations with visual inputs are hard to model merely with cross attention. In this paper, we propose GeoVLN, which learns Geometry-enhanced visual representation based on slot attention for robust Visual-and-Language Navigation. The RGB images are compensated with the corresponding depth maps and normal maps predicted by Omnidata as visual inputs. Technically, we introduce a two-stage module that combine local slot attention and CLIP model to produce geometry-enhanced representation from such input. We employ V&L BERT to learn a cross-modal representation that incorporate both language and vision informations. Additionally, a novel multiway attention module is designed, encouraging different phrases of input instruction to exploit the most related features from visual input. Extensive experiments demonstrate the effectiveness of our newly designed modules and show the compelling performance of the proposed method. + + + + KiUT: Knowledge-Injected U-Transformer for Radiology Report Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_KiUT_Knowledge-Injected_U-Transformer_for_Radiology_Report_Generation_CVPR_2023_paper.pdf + Radiology report generation aims to automatically generate a clinically accurate and coherent paragraph from the X-ray image, which could relieve radiologists from the heavy burden of report writing. Although various image caption methods have shown remarkable performance in the natural image field, generating accurate reports for medical images requires knowledge of multiple modalities, including vision, language, and medical terminology. We propose a Knowledge-injected U-Transformer (KiUT) to learn multi-level visual representation and adaptively distill the information with contextual and clinical knowledge for word prediction. In detail, a U-connection schema between the encoder and decoder is designed to model interactions between different modalities. And a symptom graph and an injected knowledge distiller are developed to assist the report generation. Experimentally, we outperform state-of-the-art methods on two widely used benchmark datasets: IU-Xray and MIMIC-CXR. Further experimental results prove the advantages of our architecture and the complementary benefits of the injected knowledge. + + + + Neural Video Compression With Diverse Contexts + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Neural_Video_Compression_With_Diverse_Contexts_CVPR_2023_paper.pdf + For any video codecs, the coding efficiency highly relies on whether the current signal to be encoded can find the relevant contexts from the previous reconstructed signals. Traditional codec has verified more contexts bring substantial coding gain, but in a time-consuming manner. However, for the emerging neural video codec (NVC), its contexts are still limited, leading to low compression ratio. To boost NVC, this paper proposes increasing the context diversity in both temporal and spatial dimensions. First, we guide the model to learn hierarchical quality patterns across frames, which enriches long-term and yet high-quality temporal contexts. Furthermore, to tap the potential of optical flow-based coding framework, we introduce a group-based offset diversity where the cross-group interaction is proposed for better context mining. In addition, this paper also adopts a quadtree-based partition to increase spatial context diversity when encoding the latent representation in parallel. Experiments show that our codec obtains 23.5% bitrate saving over previous SOTA NVC. Better yet, our codec has surpassed the under-developing next generation traditional codec/ECM in both RGB and YUV420 colorspaces, in terms of PSNR. The codes are at https://github.com/microsoft/DCVC. + + + + Markerless Camera-to-Robot Pose Estimation via Self-Supervised Sim-to-Real Transfer + http://openaccess.thecvf.com//content/CVPR2023/papers/Lu_Markerless_Camera-to-Robot_Pose_Estimation_via_Self-Supervised_Sim-to-Real_Transfer_CVPR_2023_paper.pdf + Solving the camera-to-robot pose is a fundamental requirement for vision-based robot control, and is a process that takes considerable effort and cares to make accurate. Traditional approaches require modification of the robot via markers, and subsequent deep learning approaches enabled markerless feature extraction. Mainstream deep learning methods only use synthetic data and rely on Domain Randomization to fill the sim-to-real gap, because acquiring the 3D annotation is labor-intensive. In this work, we go beyond the limitation of 3D annotations for real-world data. We propose an end-to-end pose estimation framework that is capable of online camera-to-robot calibration and a self-supervised training method to scale the training to unlabeled real-world data. Our framework combines deep learning and geometric vision for solving the robot pose, and the pipeline is fully differentiable. To train the Camera-to-Robot Pose Estimation Network (CtRNet), we leverage foreground segmentation and differentiable rendering for image-level self-supervision. The pose prediction is visualized through a renderer and the image loss with the input image is back-propagated to train the neural network. Our experimental results on two public real datasets confirm the effectiveness of our approach over existing works. We also integrate our framework into a visual servoing system to demonstrate the promise of real-time precise robot pose estimation for automation tasks. + + + + CARTO: Category and Joint Agnostic Reconstruction of ARTiculated Objects + http://openaccess.thecvf.com//content/CVPR2023/papers/Heppert_CARTO_Category_and_Joint_Agnostic_Reconstruction_of_ARTiculated_Objects_CVPR_2023_paper.pdf + We present CARTO, a novel approach for reconstructing multiple articulated objects from a single stereo RGB observation. We use implicit object-centric representations and learn a single geometry and articulation decoder for multiple object categories. Despite training on multiple categories, our decoder achieves a comparable reconstruction accuracy to methods that train bespoke decoders separately for each category. Combined with our stereo image encoder we infer the 3D shape, 6D pose, size, joint type, and the joint state of multiple unknown objects in a single forward pass. Our method achieves a 20.4% absolute improvement in mAP 3D IOU50 for novel instances when compared to a two-stage pipeline. Inference time is fast and can run on a NVIDIA TITAN XP GPU at 1 HZ for eight or less objects present. While only trained on simulated data, CARTO transfers to real-world object instances. Code and evaluation data is available at: http://carto.cs.uni-freiburg.de + + + + Event-Guided Person Re-Identification via Sparse-Dense Complementary Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Cao_Event-Guided_Person_Re-Identification_via_Sparse-Dense_Complementary_Learning_CVPR_2023_paper.pdf + Video-based person re-identification (Re-ID) is a prominent computer vision topic due to its wide range of video surveillance applications. Most existing methods utilize spatial and temporal correlations in frame sequences to obtain discriminative person features. However, inevitable degradations, e.g., motion blur contained in frames often cause ambiguity texture noise and temporal disturbance, leading to the loss of identity-discriminating cues. Recently, a new bio-inspired sensor called event camera, which can asynchronously record intensity changes, brings new vitality to the Re-ID task. With the microsecond resolution and low latency, event cameras can accurately capture the movements of pedestrians even in the aforementioned degraded environments. Inspired by the properties of event cameras, in this work, we propose a Sparse-Dense Complementary Learning Framework, which effectively extracts identity features by fully exploiting the complementary information of dense frames and sparse events. Specifically, for frames, we build a CNN-based module to aggregate the dense features of pedestrian appearance step-by-step, while for event streams, we design a bio-inspired spiking neural backbone, which encodes event signals into sparse feature maps in a spiking form, to present the dynamic motion cues of pedestrians. Finally, a cross feature alignment module is constructed to complementarily fuse motion information from events and appearance cues from frames to enhance identity representation learning. Experiments on several benchmarks show that by employing events and SNN into Re-ID, our method significantly outperforms competitive methods. + + + + Regularizing Second-Order Influences for Continual Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Sun_Regularizing_Second-Order_Influences_for_Continual_Learning_CVPR_2023_paper.pdf + Continual learning aims to learn on non-stationary data streams without catastrophically forgetting previous knowledge. Prevalent replay-based methods address this challenge by rehearsing on a small buffer holding the seen data, for which a delicate sample selection strategy is required. However, existing selection schemes typically seek only to maximize the utility of the ongoing selection, overlooking the interference between successive rounds of selection. Motivated by this, we dissect the interaction of sequential selection steps within a framework built on influence functions. We manage to identify a new class of second-order influences that will gradually amplify incidental bias in the replay buffer and compromise the selection process. To regularize the second-order effects, a novel selection objective is proposed, which also has clear connections to two widely adopted criteria. Furthermore, we present an efficient implementation for optimizing the proposed criterion. Experiments on multiple continual learning benchmarks demonstrate the advantage of our approach over state-of-the-art methods. Code is available at https://github.com/feifeiobama/InfluenceCL. + + + + Super-Resolution Neural Operator + http://openaccess.thecvf.com//content/CVPR2023/papers/Wei_Super-Resolution_Neural_Operator_CVPR_2023_paper.pdf + We propose Super-resolution Neural Operator (SRNO), a deep operator learning framework that can resolve high-resolution (HR) images at arbitrary scales from the low-resolution (LR) counterparts. Treating the LR-HR image pairs as continuous functions approximated with different grid sizes, SRNO learns the mapping between the corresponding function spaces. From the perspective of approximation theory, SRNO first embeds the LR input into a higher-dimensional latent representation space, trying to capture sufficient basis functions, and then iteratively approximates the implicit image function with a kernel integral mechanism, followed by a final dimensionality reduction step to generate the RGB representation at the target coordinates. The key characteristics distinguishing SRNO from prior continuous SR works are: 1) the kernel integral in each layer is efficiently implemented via the Galerkin-type attention, which possesses non-local properties in the spatial domain and therefore benefits the grid-free continuum; and 2) the multilayer attention architecture allows for the dynamic latent basis update, which is crucial for SR problems to "hallucinate" high-frequency information from the LR image. Experiments show that SRNO outperforms existing continuous SR methods in terms of both accuracy and running time. Our code is at https://github.com/2y7c3/Super-Resolution-Neural-Operator. + + + + GradICON: Approximate Diffeomorphisms via Gradient Inverse Consistency + http://openaccess.thecvf.com//content/CVPR2023/papers/Tian_GradICON_Approximate_Diffeomorphisms_via_Gradient_Inverse_Consistency_CVPR_2023_paper.pdf + We present an approach to learning regular spatial transformations between image pairs in the context of medical image registration. Contrary to optimization-based registration techniques and many modern learning-based methods, we do not directly penalize transformation irregularities but instead promote transformation regularity via an inverse consistency penalty. We use a neural network to predict a map between a source and a target image as well as the map when swapping the source and target images. Different from existing approaches, we compose these two resulting maps and regularize deviations of the Jacobian of this composition from the identity matrix. This regularizer -- GradICON -- results in much better convergence when training registration models compared to promoting inverse consistency of the composition of maps directly while retaining the desirable implicit regularization effects of the latter. We achieve state-of-the-art registration performance on a variety of real-world medical image datasets using a single set of hyperparameters and a single non-dataset-specific training protocol. The code is available at https://github.com/uncbiag/ICON. + + + + LP-DIF: Learning Local Pattern-Specific Deep Implicit Function for 3D Objects and Scenes + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_LP-DIF_Learning_Local_Pattern-Specific_Deep_Implicit_Function_for_3D_Objects_CVPR_2023_paper.pdf + Deep Implicit Function (DIF) has gained much popularity as an efficient 3D shape representation. To capture geometry details, current mainstream methods divide 3D shapes into local regions and then learn each one with a local latent code via a decoder, where the decoder shares the geometric similarities among different local regions. Although such local methods can capture more local details, a large diversity of different local regions increases the difficulty of learning an implicit function when treating all regions equally using only a single decoder. In addition, these local regions often exhibit imbalanced distributions, where certain regions have significantly fewer observations. This leads that fine geometry details could not be preserved well. To solve this problem, we propose a novel Local Pattern-specific Implicit Function, named LP-DIF, for representing a shape with some clusters of local regions and multiple decoders, where each decoder only focuses on one cluster of local regions which share a certain pattern. Specifically, we first extract local codes for all regions, and then cluster them into multiple groups in the latent space, where similar regions sharing a common pattern fall into one group. After that, we train multiple decoders for mining local patterns of different groups, which simplifies learning of fine geometric details by reducing the diversity of local regions seen by each decoder. To further alleviate the data-imbalance problem, we introduce a region re-weighting module to each pattern-specific decoder by kernel density estimator, which dynamically re-weights the regions during learning. Our LP-DIF can restore more geometry details, and thus improve the quality of 3D reconstruction. Experiments demonstrate that our method can achieve the state-of-the-art performance over previous methods. Code is available at https://github.com/gtyxyz/lpdif. + + + + PeakConv: Learning Peak Receptive Field for Radar Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_PeakConv_Learning_Peak_Receptive_Field_for_Radar_Semantic_Segmentation_CVPR_2023_paper.pdf + The modern machine learning-based technologies have shown considerable potential in automatic radar scene understanding. Among these efforts, radar semantic segmentation (RSS) can provide more refined and detailed information including the moving objects and background clutters within the effective receptive field of the radar. Motivated by the success of convolutional networks in various visual computing tasks, these networks have also been introduced to solve RSS task. However, neither the regular convolution operation nor the modified ones are specific to interpret radar signals. The receptive fields of existing convolutions are defined by the object presentation in optical signals, but these two signals have different perception mechanisms. In classic radar signal processing, the object signature is detected according to a local peak response, i.e., CFAR detection. Inspired by this idea, we redefine the receptive field of the convolution operation as the peak receptive field (PRF) and propose the peak convolution operation (PeakConv) to learn the object signatures in an end-to-end network. By incorporating the proposed PeakConv layers into the encoders, our RSS network can achieve better segmentation results compared with other SoTA methods on a multi-view real-measured dataset collected from an FMCW radar. Our code for PeakConv is available at https://github.com/zlw9161/PKC. + + + + Explaining Image Classifiers With Multiscale Directional Image Representation + http://openaccess.thecvf.com//content/CVPR2023/papers/Kolek_Explaining_Image_Classifiers_With_Multiscale_Directional_Image_Representation_CVPR_2023_paper.pdf + Image classifiers are known to be difficult to interpret and therefore require explanation methods to understand their decisions. We present ShearletX, a novel mask explanation method for image classifiers based on the shearlet transform -- a multiscale directional image representation. Current mask explanation methods are regularized by smoothness constraints that protect against undesirable fine-grained explanation artifacts. However, the smoothness of a mask limits its ability to separate fine-detail patterns, that are relevant for the classifier, from nearby nuisance patterns, that do not affect the classifier. ShearletX solves this problem by avoiding smoothness regularization all together, replacing it by shearlet sparsity constraints. The resulting explanations consist of a few edges, textures, and smooth parts of the original image, that are the most relevant for the decision of the classifier. To support our method, we propose a mathematical definition for explanation artifacts and an information theoretic score to evaluate the quality of mask explanations. We demonstrate the superiority of ShearletX over previous mask based explanation methods using these new metrics, and present exemplary situations where separating fine-detail patterns allows explaining phenomena that were not explainable before. + + + + Deep Polarization Reconstruction With PDAVIS Events + http://openaccess.thecvf.com//content/CVPR2023/papers/Mei_Deep_Polarization_Reconstruction_With_PDAVIS_Events_CVPR_2023_paper.pdf + The polarization event camera PDAVIS is a novel bio-inspired neuromorphic vision sensor that reports both conventional polarization frames and asynchronous, continuously per-pixel polarization brightness changes (polarization events) with fast temporal resolution and large dynamic range. A deep neural network method (Polarization FireNet) was previously developed to reconstruct the polarization angle and degree from polarization events for bridging the gap between the polarization event camera and mainstream computer vision. However, Polarization FireNet applies a network pre-trained for normal event-based frame reconstruction independently on each of four channels of polarization events from four linear polarization angles, which ignores the correlations between channels and inevitably introduces content inconsistency between the four reconstructed frames, resulting in unsatisfactory polarization reconstruction performance. In this work, we strive to train an effective, yet efficient, DNN model that directly outputs polarization from the input raw polarization events. To this end, we constructed the first large-scale event-to-polarization dataset, which we subsequently employed to train our events-to-polarization network E2P. E2P extracts rich polarization patterns from input polarization events and enhances features through cross-modality context integration. We demonstrate that E2P outperforms Polarization FireNet by a significant margin with no additional computing cost. Experimental results also show that E2P produces more accurate measurement of polarization than the PDAVIS frames in challenging fast and high dynamic range scenes. + + + + VideoTrack: Learning To Track Objects via Video Transformer + http://openaccess.thecvf.com//content/CVPR2023/papers/Xie_VideoTrack_Learning_To_Track_Objects_via_Video_Transformer_CVPR_2023_paper.pdf + Existing Siamese tracking methods, which are built on pair-wise matching between two single frames, heavily rely on additional sophisticated mechanism to exploit temporal information among successive video frames, hindering them from high efficiency and industrial deployments. In this work, we resort to sequence-level target matching that can encode temporal contexts into the spatial features through a neat feedforward video model. Specifically, we adapt the standard video transformer architecture to visual tracking by enabling spatiotemporal feature learning directly from frame-level patch sequences. To better adapt to the tracking task, we carefully blend the spatiotemporal information in the video clips through sequential multi-branch triplet blocks, which formulates a video transformer backbone. Our experimental study compares different model variants, such as tokenization strategies, hierarchical structures, and video attention schemes. Then, we propose a disentangled dual-template mechanism that decouples static and dynamic appearance changes over time, and reduces the temporal redundancy in video frames. Extensive experiments show that our method, named as VideoTrack, achieves state-of-the-art results while running in real-time. + + + + Distilling Self-Supervised Vision Transformers for Weakly-Supervised Few-Shot Classification & Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Kang_Distilling_Self-Supervised_Vision_Transformers_for_Weakly-Supervised_Few-Shot_Classification__Segmentation_CVPR_2023_paper.pdf + We address the task of weakly-supervised few-shot image classification and segmentation, by leveraging a Vision Transformer (ViT) pretrained with self-supervision. Our proposed method takes token representations from the self-supervised ViT and leverages their correlations, via self-attention, to produce classification and segmentation predictions through separate task heads. Our model is able to effectively learn to perform classification and segmentation in the absence of pixel-level labels during training, using only image-level labels. To do this it uses attention maps, created from tokens generated by the self-supervised ViT backbone, as pixel-level pseudo-labels. We also explore a practical setup with "mixed" supervision, where a small number of training images contains ground-truth pixel-level labels and the remaining images have only image-level labels. For this mixed setup, we propose to improve the pseudo-labels using a pseudo-label enhancer that was trained using the available ground-truth pixel-level labels. Experiments on Pascal-5i and COCO-20i demonstrate significant performance gains in a variety of supervision settings, and in particular when little-to-no pixel-level labels are available. + + + + Collaborative Noisy Label Cleaner: Learning Scene-Aware Trailers for Multi-Modal Highlight Detection in Movies + http://openaccess.thecvf.com//content/CVPR2023/papers/Gan_Collaborative_Noisy_Label_Cleaner_Learning_Scene-Aware_Trailers_for_Multi-Modal_Highlight_CVPR_2023_paper.pdf + Movie highlights stand out of the screenplay for efficient browsing and play a crucial role on social media platforms. Based on existing efforts, this work has two observations: (1) For different annotators, labeling highlight has uncertainty, which leads to inaccurate and time-consuming annotations. (2) Besides previous supervised or unsupervised settings, some existing video corpora can be useful, e.g., trailers, but they are often noisy and incomplete to cover the full highlights. In this work, we study a more practical and promising setting, i.e., reformulating highlight detection as "learning with noisy labels". This setting does not require time-consuming manual annotations and can fully utilize existing abundant video corpora. First, based on movie trailers, we leverage scene segmentation to obtain complete shots, which are regarded as noisy labels. Then, we propose a Collaborative noisy Label Cleaner (CLC) framework to learn from noisy highlight moments. CLC consists of two modules: augmented cross-propagation (ACP) and multi-modality cleaning (MMC). The former aims to exploit the closely related audio-visual signals and fuse them to learn unified multi-modal representations. The latter aims to achieve cleaner highlight labels by observing the changes in losses among different modalities. To verify the effectiveness of CLC, we further collect a large-scale highlight dataset named MovieLights. Comprehensive experiments on MovieLights and YouTube Highlights datasets demonstrate the effectiveness of our approach. Code has been made available at: https://github.com/TencentYoutuResearch/HighlightDetection-CLC + + + + ContraNeRF: Generalizable Neural Radiance Fields for Synthetic-to-Real Novel View Synthesis via Contrastive Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_ContraNeRF_Generalizable_Neural_Radiance_Fields_for_Synthetic-to-Real_Novel_View_Synthesis_CVPR_2023_paper.pdf + Although many recent works have investigated generalizable NeRF-based novel view synthesis for unseen scenes, they seldom consider the synthetic-to-real generalization, which is desired in many practical applications. In this work, we first investigate the effects of synthetic data in synthetic-to-real novel view synthesis and surprisingly observe that models trained with synthetic data tend to produce sharper but less accurate volume densities. For pixels where the volume densities are correct, fine-grained details will be obtained. Otherwise, severe artifacts will be produced. To maintain the advantages of using synthetic data while avoiding its negative effects, we propose to introduce geometry-aware contrastive learning to learn multi-view consistent features with geometric constraints. Meanwhile, we adopt cross-view attention to further enhance the geometry perception of features by querying features across input views. Experiments demonstrate that under the synthetic-to-real setting, our method can render images with higher quality and better fine-grained details, outperforming existing generalizable novel view synthesis methods in terms of PSNR, SSIM, and LPIPS. When trained on real data, our method also achieves state-of-the-art results. https://haoy945.github.io/contranerf/ + + + + PaletteNeRF: Palette-Based Appearance Editing of Neural Radiance Fields + http://openaccess.thecvf.com//content/CVPR2023/papers/Kuang_PaletteNeRF_Palette-Based_Appearance_Editing_of_Neural_Radiance_Fields_CVPR_2023_paper.pdf + Recent advances in neural radiance fields have enabled the high-fidelity 3D reconstruction of complex scenes for novel view synthesis. However, it remains underexplored how the appearance of such representations can be efficiently edited while maintaining photorealism. In this work, we present PaletteNeRF, a novel method for photorealistic appearance editing of neural radiance fields (NeRF) based on 3D color decomposition. Our method decomposes the appearance of each 3D point into a linear combination of palette-based bases (i.e., 3D segmentations defined by a group of NeRF-type functions) that are shared across the scene. While our palette-based bases are view-independent, we also predict a view-dependent function to capture the color residual (e.g., specular shading). During training, we jointly optimize the basis functions and the color palettes, and we also introduce novel regularizers to encourage the spatial coherence of the decomposition. Our method allows users to efficiently edit the appearance of the 3D scene by modifying the color palettes. We also extend our framework with compressed semantic features for semantic-aware appearance editing. We demonstrate that our technique is superior to baseline methods both quantitatively and qualitatively for appearance editing of complex real-world scenes. + + + + Contrastive Mean Teacher for Domain Adaptive Object Detectors + http://openaccess.thecvf.com//content/CVPR2023/papers/Cao_Contrastive_Mean_Teacher_for_Domain_Adaptive_Object_Detectors_CVPR_2023_paper.pdf + Object detectors often suffer from the domain gap between training (source domain) and real-world applications (target domain). Mean-teacher self-training is a powerful paradigm in unsupervised domain adaptation for object detection, but it struggles with low-quality pseudo-labels. In this work, we identify the intriguing alignment and synergy between mean-teacher self-training and contrastive learning. Motivated by this, we propose Contrastive Mean Teacher (CMT) -- a unified, general-purpose framework with the two paradigms naturally integrated to maximize beneficial learning signals. Instead of using pseudo-labels solely for final predictions, our strategy extracts object-level features using pseudo-labels and optimizes them via contrastive learning, without requiring labels in the target domain. When combined with recent mean-teacher self-training methods, CMT leads to new state-of-the-art target-domain performance: 51.9% mAP on Foggy Cityscapes, outperforming the previously best by 2.1% mAP. Notably, CMT can stabilize performance and provide more significant gains as pseudo-label noise increases. + + + + Learning Transferable Spatiotemporal Representations From Natural Script Knowledge + http://openaccess.thecvf.com//content/CVPR2023/papers/Zeng_Learning_Transferable_Spatiotemporal_Representations_From_Natural_Script_Knowledge_CVPR_2023_paper.pdf + Pre-training on large-scale video data has become a common recipe for learning transferable spatiotemporal representations in recent years. Despite some progress, existing methods are mostly limited to highly curated datasets (e.g., K400) and exhibit unsatisfactory out-of-the-box representations. We argue that it is due to the fact that they only capture pixel-level knowledge rather than spatiotemporal semantics, which hinders further progress in video understanding. Inspired by the great success of image-text pre-training (e.g., CLIP), we take the first step to exploit language semantics to boost transferable spatiotemporal representation learning. We introduce a new pretext task, Turning to Video for Transcript Sorting (TVTS), which sorts shuffled ASR scripts by attending to learned video representations. We do not rely on descriptive captions and learn purely from video, i.e., leveraging the natural transcribed speech knowledge to provide noisy but useful semantics over time. Our method enforces the vision model to contextualize what is happening over time so that it can re-organize the narrative transcripts, and can seamlessly apply to large-scale uncurated video data in the real world. Our method demonstrates strong out-of-the-box spatiotemporal representations on diverse benchmarks, e.g., +13.6% gains over VideoMAE on SSV2 via linear probing. The code is available at https://github.com/TencentARC/TVTS. + + + + CiCo: Domain-Aware Sign Language Retrieval via Cross-Lingual Contrastive Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Bao_CiCo_Domain-Aware_Sign_Language_Retrieval_via_Cross-Lingual_Contrastive_Learning_CVPR_2023_paper.pdf + This work focuses on sign language retrieval--a recently proposed task for sign language understanding. Sign language retrieval consists of two sub-tasks: text-to-sign-video (T2V) retrieval and sign-video-to-text (V2T) retrieval. Different from traditional video-text retrieval, sign language videos, not only contain visual signals but also carry abundant semantic meanings by themselves due to the fact that sign languages are also natural languages. Considering this character, we formulate sign language retrieval as a cross-lingual retrieval problem as well as a video-text retrieval task. Concretely, we take into account the linguistic properties of both sign languages and natural languages, and simultaneously identify the fine-grained cross-lingual (i.e., sign-to-word) mappings while contrasting the texts and the sign videos in a joint embedding space. This process is termed as cross-lingual contrastive learning. Another challenge is raised by the data scarcity issue--sign language datasets are orders of magnitude smaller in scale than that of speech recognition. We alleviate this issue by adopting a domain-agnostic sign encoder pre-trained on large-scale sign videos into the target domain via pseudo-labeling. Our framework, termed as domain-aware sign language retrieval via Cross-lingual Contrastive learning or CiCo for short, outperforms the pioneering method by large margins on various datasets, e.g., +22.4 T2V and +28.0 V2T R@1 improvements on How2Sign dataset, and +13.7 T2V and +17.1 V2T R@1 improvements on PHOENIX-2014T dataset. Code and models are available at: https://github.com/FangyunWei/SLRT. + + + + Video Dehazing via a Multi-Range Temporal Alignment Network With Physical Prior + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_Video_Dehazing_via_a_Multi-Range_Temporal_Alignment_Network_With_Physical_CVPR_2023_paper.pdf + Video dehazing aims to recover haze-free frames with high visibility and contrast. This paper presents a novel framework to effectively explore the physical haze priors and aggregate temporal information. Specifically, we design a memory-based physical prior guidance module to encode the prior-related features into long-range memory. Besides, we formulate a multi-range scene radiance recovery module to capture space-time dependencies in multiple space-time ranges, which helps to effectively aggregate temporal information from adjacent frames. Moreover, we construct the first large-scale outdoor video dehazing benchmark dataset, which contains videos in various real-world scenarios. Experimental results on both synthetic and real conditions show the superiority of our proposed method. + + + + Integrally Pre-Trained Transformer Pyramid Networks + http://openaccess.thecvf.com//content/CVPR2023/papers/Tian_Integrally_Pre-Trained_Transformer_Pyramid_Networks_CVPR_2023_paper.pdf + In this paper, we present an integral pre-training framework based on masked image modeling (MIM). We advocate for pre-training the backbone and neck jointly so that the transfer gap between MIM and downstream recognition tasks is minimal. We make two technical contributions. First, we unify the reconstruction and recognition necks by inserting a feature pyramid into the pre-training stage. Second, we complement mask image modeling (MIM) with masked feature modeling (MFM) that offers multi-stage supervision to the feature pyramid. The pre-trained models, termed integrally pre-trained transformer pyramid networks (iTPNs), serve as powerful foundation models for visual recognition. In particular, the base/large-level iTPN achieves an 86.2%/87.8% top-1 accuracy on ImageNet-1K, a 53.2%/55.6% box AP on COCO object detection with 1x training schedule using Mask-RCNN, and a 54.7%/57.7% mIoU on ADE20K semantic segmentation using UPerHead -- all these results set new records. Our work inspires the community to work on unifying upstream pre-training and downstream fine-tuning tasks. Code is available at https://github.com/sunsmarterjie/iTPN. + + + + Adaptive Channel Sparsity for Federated Learning Under System Heterogeneity + http://openaccess.thecvf.com//content/CVPR2023/papers/Liao_Adaptive_Channel_Sparsity_for_Federated_Learning_Under_System_Heterogeneity_CVPR_2023_paper.pdf + Owing to the non-i.i.d. nature of client data, channel neurons in federated-learned models may specialize to distinct features for different clients. Yet, existing channel-sparse federated learning (FL) algorithms prescribe fixed sparsity strategies for client models, and may thus prevent clients from training channel neurons collaboratively. To minimize the impact of sparsity on FL convergence, we propose Flado to improve the alignment of client model update trajectories by tailoring the sparsities of individual neurons in each client. Empirical results show that while other sparse methods are surprisingly impactful to convergence, Flado can not only attain the highest task accuracies with unlimited budget across a range of datasets, but also significantly reduce the amount of FLOPs required for training more than by 10x under the same communications budget, and push the Pareto frontier of communication/computation trade-off notably further than competing FL algorithms. + + + + Sequential Training of GANs Against GAN-Classifiers Reveals Correlated "Knowledge Gaps" Present Among Independently Trained GAN Instances + http://openaccess.thecvf.com//content/CVPR2023/papers/Pathak_Sequential_Training_of_GANs_Against_GAN-Classifiers_Reveals_Correlated_Knowledge_Gaps_CVPR_2023_paper.pdf + Modern Generative Adversarial Networks (GANs) generate realistic images remarkably well. Previous work has demonstrated the feasibility of "GAN-classifiers" that are distinct from the co-trained discriminator, and operate on images generated from a frozen GAN. That such classifiers work at all affirms the existence of "knowledge gaps" (out-of-distribution artifacts across samples) present in GAN training. We iteratively train GAN-classifiers and train GANs that "fool" the classifiers (in an attempt to fill the knowledge gaps), and examine the effect on GAN training dynamics, output quality, and GAN-classifier generalization. We investigate two settings, a small DCGAN architecture trained on low dimensional images (MNIST), and StyleGAN2, a SOTA GAN architecture trained on high dimensional images (FFHQ). We find that the DCGAN is unable to effectively fool a held-out GAN-classifier without compromising the output quality. However, StyleGAN2 can fool held-out classifiers with no change in output quality, and this effect persists over multiple rounds of GAN/classifier training which appears to reveal an ordering over optima in the generator parameter space. Finally, we study different classifier architectures and show that the architecture of the GAN-classifier has a strong influence on the set of its learned artifacts. + + + + TriVol: Point Cloud Rendering via Triple Volumes + http://openaccess.thecvf.com//content/CVPR2023/papers/Hu_TriVol_Point_Cloud_Rendering_via_Triple_Volumes_CVPR_2023_paper.pdf + Existing learning-based methods for point cloud rendering adopt various 3D representations and feature querying mechanisms to alleviate the sparsity problem of point clouds. However, artifacts still appear in the rendered images, due to the challenges in extracting continuous and discriminative 3D features from point clouds. In this paper, we present a dense while lightweight 3D representation, named TriVol, that can be combined with NeRF to render photo-realistic images from point clouds. Our TriVol consists of triple slim volumes, each of which is encoded from the input point cloud. Our representation has two advantages. First, it fuses the respective fields at different scales and thus extracts local and non-local features for discriminative representation. Second, since the volume size is greatly reduced, our 3D decoder can be efficiently inferred, allowing us to increase the resolution of the 3D space to render more point details. Extensive experiments on different benchmarks with varying kinds of scenes/objects demonstrate our framework's effectiveness compared with current approaches. Moreover, our framework has excellent generalization ability to render a category of scenes or objects without fine-tuning. + + + + (ML)$^2$P-Encoder: On Exploration of Channel-Class Correlation for Multi-Label Zero-Shot Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_ML2P-Encoder_On_Exploration_of_Channel-Class_Correlation_for_Multi-Label_Zero-Shot_Learning_CVPR_2023_paper.pdf + Recent studies usually approach multi-label zero-shot learning (MLZSL) with visual-semantic mapping on spatial-class correlation, which can be computationally costly, and worse still, fails to capture fine-grained class-specific semantics. We observe that different channels may usually have different sensitivities on classes, which can correspond to specific semantics. Such an intrinsic channel-class correlation suggests a potential alternative for the more accurate and class-harmonious feature representations. In this paper, our interest is to fully explore the power of channel-class correlation as the unique base for MLZSL. Specifically, we propose a light yet efficient Multi-Label Multi-Layer Perceptron-based Encoder, dubbed (ML)^2P-Encoder, to extract and preserve channel-wise semantics. We reorganize the generated feature maps into several groups, of which each of them can be trained independently with (ML)^2P-Encoder. On top of that, a global group-wise attention module is further designed to build the multi-label specific class relationships among different classes, which eventually fulfills a novel Channel-Class Correlation MLZSL framework (C^3-MLZSL). Extensive experiments on large-scale MLZSL benchmarks including NUS-WIDE and Open-Images-V4 demonstrate the superiority of our model against other representative state-of-the-art models. + + + + Image as a Foreign Language: BEiT Pretraining for Vision and Vision-Language Tasks + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Image_as_a_Foreign_Language_BEiT_Pretraining_for_Vision_and_CVPR_2023_paper.pdf + A big convergence of language, vision, and multimodal pretraining is emerging. In this work, we introduce a general-purpose multimodal foundation model BEiT-3, which achieves excellent transfer performance on both vision and vision-language tasks. Specifically, we advance the big convergence from three aspects: backbone architecture, pretraining task, and model scaling up. We use Multiway Transformers for general-purpose modeling, where the modular architecture enables both deep fusion and modality-specific encoding. Based on the shared backbone, we perform masked "language" modeling on images (Imglish), texts (English), and image-text pairs ("parallel sentences") in a unified manner. Experimental results show that BEiT-3 obtains remarkable performance on object detection (COCO), semantic segmentation (ADE20K), image classification (ImageNet), visual reasoning (NLVR2), visual question answering (VQAv2), image captioning (COCO), and cross-modal retrieval (Flickr30K, COCO). + + + + Density-Insensitive Unsupervised Domain Adaption on 3D Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Hu_Density-Insensitive_Unsupervised_Domain_Adaption_on_3D_Object_Detection_CVPR_2023_paper.pdf + 3D object detection from point clouds is crucial in safety-critical autonomous driving. Although many works have made great efforts and achieved significant progress on this task, most of them suffer from expensive annotation cost and poor transferability to unknown data due to the domain gap. Recently, few works attempt to tackle the domain gap in objects, but still fail to adapt to the gap of varying beam-densities between two domains, which is critical to mitigate the characteristic differences of the LiDAR collectors. To this end, we make the attempt to propose a density-insensitive domain adaption framework to address the density-induced domain gap. In particular, we first introduce Random Beam Re-Sampling (RBRS) to enhance the robustness of 3D detectors trained on the source domain to the varying beam-density. Then, we take this pre-trained detector as the backbone model, and feed the unlabeled target domain data into our newly designed task-specific teacher-student framework for predicting its high-quality pseudo labels. To further adapt the property of density-insensitive into the target domain, we feed the teacher and student branches with the same sample of different densities, and propose an Object Graph Alignment (OGA) module to construct two object-graphs between the two branches for enforcing the consistency in both the attribute and relation of cross-density objects. Experimental results on three widely adopted 3D object detection datasets demonstrate that our proposed domain adaption method outperforms the state-of-the-art methods, especially over varying-density data. Code is available at https://github.com/WoodwindHu/DTS. + + + + Learning Action Changes by Measuring Verb-Adverb Textual Relationships + http://openaccess.thecvf.com//content/CVPR2023/papers/Moltisanti_Learning_Action_Changes_by_Measuring_Verb-Adverb_Textual_Relationships_CVPR_2023_paper.pdf + The goal of this work is to understand the way actions are performed in videos. That is, given a video, we aim to predict an adverb indicating a modification applied to the action (e.g. cut "finely"). We cast this problem as a regression task. We measure textual relationships between verbs and adverbs to generate a regression target representing the action change we aim to learn. We test our approach on a range of datasets and achieve state-of-the-art results on both adverb prediction and antonym classification. Furthermore, we outperform previous work when we lift two commonly assumed conditions: the availability of action labels during testing and the pairing of adverbs as antonyms. Existing datasets for adverb recognition are either noisy, which makes learning difficult, or contain actions whose appearance is not influenced by adverbs, which makes evaluation less reliable. To address this, we collect a new high quality dataset: Adverbs in Recipes (AIR). We focus on instructional recipes videos, curating a set of actions that exhibit meaningful visual changes when performed differently. Videos in AIR are more tightly trimmed and were manually reviewed by multiple annotators to ensure high labelling quality. Results show that models learn better from AIR given its cleaner videos. At the same time, adverb prediction on AIR is challenging, demonstrating that there is considerable room for improvement. + + + + Context-Aware Pretraining for Efficient Blind Image Decomposition + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Context-Aware_Pretraining_for_Efficient_Blind_Image_Decomposition_CVPR_2023_paper.pdf + In this paper, we study Blind Image Decomposition (BID), which is to uniformly remove multiple types of degradation at once without foreknowing the noise type. There remain two practical challenges: (1) Existing methods typically require massive data supervision, making them infeasible to real-world scenarios. (2) The conventional paradigm usually focuses on mining the abnormal pattern of a superimposed image to separate the noise, which de facto conflicts with the primary image restoration task. Therefore, such a pipeline compromises repairing efficiency and authenticity. In an attempt to solve the two challenges in one go, we propose an efficient and simplified paradigm, called Context-aware Pretraining (CP), with two pretext tasks: mixed image separation and masked image reconstruction. Such a paradigm reduces the annotation demands and explicitly facilitates context-aware feature learning. Assuming the restoration process follows a structure-to-texture manner, we also introduce a Context-aware Pretrained network (CPNet). In particular, CPNet contains two transformer-based parallel encoders, one information fusion module, and one multi-head prediction module. The information fusion module explicitly utilizes the mutual correlation in the spatial-channel dimension, while the multi-head prediction module facilitates texture-guided appearance flow. Moreover, a new sampling loss along with an attribute label constraint is also deployed to make use of the spatial context, leading to high-fidelity image restoration. Extensive experiments on both real and synthetic benchmarks show that our method achieves competitive performance for various BID tasks. + + + + Weakly Supervised Posture Mining for Fine-Grained Classification + http://openaccess.thecvf.com//content/CVPR2023/papers/Tang_Weakly_Supervised_Posture_Mining_for_Fine-Grained_Classification_CVPR_2023_paper.pdf + Because the subtle differences between the different sub-categories of common visual categories such as bird species, fine-grained classification has been seen as a challenging task for many years. Most previous works focus towards the features in the single discriminative region isolatedly, while neglect the connection between the different discriminative regions in the whole image. However, the relationship between different discriminative regions contains rich posture information and by adding the posture information, model can learn the behavior of the object which attribute to improve the classification performance. In this paper, we propose a novel fine-grained framework named PMRC (posture mining and reverse cross-entropy), which is able to combine with different backbones to good effect. In PMRC, we use the Deep Navigator to generate the discriminative regions from the images, and then use them to construct the graph. We aggregate the graph by message passing and get the classification results. Specifically, in order to force PMRC to learn how to mine the posture information, we design a novel training paradigm, which makes the Deep Navigator and message passing communicate and train together. In addition, we propose the reverse cross-entropy (RCE) and demomenstate that compared to the cross-entropy (CE), RCE can not only promote the accurracy of our model but also generalize to promote the accuracy of other kinds of fine-grained classification models. Experimental results on benchmark datasets confirm that PMRC can achieve state-of-the-art. + + + + LAVENDER: Unifying Video-Language Understanding As Masked Language Modeling + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_LAVENDER_Unifying_Video-Language_Understanding_As_Masked_Language_Modeling_CVPR_2023_paper.pdf + Unified vision-language frameworks have greatly advanced in recent years, most of which adopt an encoder-decoder architecture to unify image-text tasks as sequence-to-sequence generation. However, existing video-language (VidL) models still require task-specific designs in model architecture and training objectives for each task. In this work, we explore a unified VidL framework LAVENDER, where Masked Language Modeling (MLM) is used as the common interface for all pre-training and downstream tasks. Such unification leads to a simplified model architecture, where only a lightweight MLM head, instead of a decoder with much more parameters, is needed on top of the multimodal encoder. Surprisingly, experimental results show that this unified framework achieves competitive performance on 14 VidL benchmarks, covering video question answering, text-to-video retrieval and video captioning. Extensive analyses further demonstrate LAVENDER can (i) seamlessly support all downstream tasks with just a single set of parameter values when multi-task finetuned; (ii) generalize to various downstream tasks with limited training samples; and (iii) enable zero-shot evaluation on video question answering tasks. + + + + Robust Unsupervised StyleGAN Image Restoration + http://openaccess.thecvf.com//content/CVPR2023/papers/Poirier-Ginter_Robust_Unsupervised_StyleGAN_Image_Restoration_CVPR_2023_paper.pdf + GAN-based image restoration inverts the generative process to repair images corrupted by known degradations. Existing unsupervised methods must carefully be tuned for each task and degradation level. In this work, we make StyleGAN image restoration robust: a single set of hyperparameters works across a wide range of degradation levels. This makes it possible to handle combinations of several degradations, without the need to retune. Our proposed approach relies on a 3-phase progressive latent space extension and a conservative optimizer, which avoids the need for any additional regularization terms. Extensive experiments demonstrate robustness on inpainting, upsampling, denoising, and deartifacting at varying degradations levels, outperforming other StyleGAN-based inversion techniques. Our approach also favorably compares to diffusion-based restoration by yielding much more realistic inversion results. Code will be released upon publication. + + + + Event-Based Frame Interpolation With Ad-Hoc Deblurring + http://openaccess.thecvf.com//content/CVPR2023/papers/Sun_Event-Based_Frame_Interpolation_With_Ad-Hoc_Deblurring_CVPR_2023_paper.pdf + The performance of video frame interpolation is inherently correlated with the ability to handle motion in the input scene. Even though previous works recognize the utility of asynchronous event information for this task, they ignore the fact that motion may or may not result in blur in the input video to be interpolated, depending on the length of the exposure time of the frames and the speed of the motion, and assume either that the input video is sharp, restricting themselves to frame interpolation, or that it is blurry, including an explicit, separate deblurring stage before interpolation in their pipeline. We instead propose a general method for event-based frame interpolation that performs deblurring ad-hoc and thus works both on sharp and blurry input videos. Our model consists in a bidirectional recurrent network that naturally incorporates the temporal dimension of interpolation and fuses information from the input frames and the events adaptively based on their temporal proximity. In addition, we introduce a novel real-world high-resolution dataset with events and color videos which provides a challenging evaluation setting for the examined task. Extensive experiments on the standard GoPro benchmark and on our dataset show that our network consistently outperforms previous state-of-the-art methods on frame interpolation, single image deblurring and the joint task of interpolation and deblurring. Our code and dataset will be available at https://github.com/AHupuJR/REFID. + + + + OvarNet: Towards Open-Vocabulary Object Attribute Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_OvarNet_Towards_Open-Vocabulary_Object_Attribute_Recognition_CVPR_2023_paper.pdf + In this paper, we consider the problem of simultaneously detecting objects and inferring their visual attributes in an image, even for those with no manual annotations provided at the training stage, resembling an open-vocabulary scenario. To achieve this goal, we make the following contributions: (i) we start with a naive two-stage approach for open-vocabulary object detection and attribute classification, termed CLIP-Attr. The candidate objects are first proposed with an offline RPN and later classified for semantic category and attributes; (ii) we combine all available datasets and train with a federated strategy to finetune the CLIP model, aligning the visual representation with attributes, additionally, we investigate the efficacy of leveraging freely available online image-caption pairs under weakly supervised learning; (iii) in pursuit of efficiency, we train a Faster-RCNN type model end-to-end with knowledge distillation, that performs class-agnostic object proposals and classification on semantic categories and attributes with classifiers generated from a text encoder; Finally, (iv) we conduct extensive experiments on VAW, MS-COCO, LSA, and OVAD datasets, and show that recognition of semantic category and attributes is complementary for visual scene understanding, i.e., jointly training object detection and attributes prediction largely outperform existing approaches that treat the two tasks independently, demonstrating strong generalization ability to novel attributes and categories. + + + + 3D Line Mapping Revisited + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_3D_Line_Mapping_Revisited_CVPR_2023_paper.pdf + In contrast to sparse keypoints, a handful of line segments can concisely encode the high-level scene layout, as they often delineate the main structural elements. In addition to offering strong geometric cues, they are also omnipresent in urban landscapes and indoor scenes. Despite their apparent advantages, current line-based reconstruction methods are far behind their point-based counterparts. In this paper we aim to close the gap by introducing LIMAP, a library for 3D line mapping that robustly and efficiently creates 3D line maps from multi-view imagery. This is achieved through revisiting the degeneracy problem of line triangulation, carefully crafted scoring and track building, and exploiting structural priors such as line coincidence, parallelism, and orthogonality. Our code integrates seamlessly with existing point-based Structure-from-Motion methods and can leverage their 3D points to further improve the line reconstruction. Furthermore, as a byproduct, the method is able to recover 3D association graphs between lines and points / vanishing points (VPs). In thorough experiments, we show that LIMAP significantly outperforms existing approaches for 3D line mapping. Our robust 3D line maps also open up new research directions. We show two example applications: visual localization and bundle adjustment, where integrating lines alongside points yields the best results. Code is available at https://github.com/cvg/limap. + + + + Efficient and Explicit Modelling of Image Hierarchies for Image Restoration + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Efficient_and_Explicit_Modelling_of_Image_Hierarchies_for_Image_Restoration_CVPR_2023_paper.pdf + The aim of this paper is to propose a mechanism to efficiently and explicitly model image hierarchies in the global, regional, and local range for image restoration. To achieve that, we start by analyzing two important properties of natural images including cross-scale similarity and anisotropic image features. Inspired by that, we propose the anchored stripe self-attention which achieves a good balance between the space and time complexity of self-attention and the modelling capacity beyond the regional range. Then we propose a new network architecture dubbed GRL to explicitly model image hierarchies in the Global, Regional, and Local range via anchored stripe self-attention, window self-attention, and channel attention enhanced convolution. Finally, the proposed network is applied to 7 image restoration types, covering both real and synthetic settings. The proposed method sets the new state-of-the-art for several of those. Code will be available at https://github.com/ofsoundof/GRL-Image-Restoration.git. + + + + DKT: Diverse Knowledge Transfer Transformer for Class Incremental Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Gao_DKT_Diverse_Knowledge_Transfer_Transformer_for_Class_Incremental_Learning_CVPR_2023_paper.pdf + Deep neural networks suffer from catastrophic forgetting in class incremental learning, where the classification accuracy of old classes drastically deteriorates when the networks learn the knowledge of new classes. Many works have been proposed to solve the class incremental learning problem. However, most of them either suffer from serious catastrophic forgetting and stability-plasticity dilemma or need too many extra parameters and computations. To meet the challenge, we propose a novel framework, Diverse Knowledge Transfer Transformer (DKT). which contains two novel knowledge transfers based on the attention mechanism to transfer the task-general knowledge and task-specific knowledge to the current task to alleviate catastrophic forgetting. Besides, we propose a duplex classifier to address the stability-plasticity dilemma, and a novel loss function to cluster the same categories in feature space and discriminate the features between old and new tasks to force the task specific knowledge to be more diverse. Our method needs only a few extra parameters, which are negligible, to tackle the increasing number of tasks. We conduct comprehensive experimental results on CIFAR100, ImageNet100/1000 datasets. The experiment results show that our method outperforms other competitive methods and achieves state-of-the-art performance. + + + + TarViS: A Unified Approach for Target-Based Video Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Athar_TarViS_A_Unified_Approach_for_Target-Based_Video_Segmentation_CVPR_2023_paper.pdf + The general domain of video segmentation is currently fragmented into different tasks spanning multiple benchmarks. Despite rapid progress in the state-of-the-art, current methods are overwhelmingly task-specific and cannot conceptually generalize to other tasks. Inspired by recent approaches with multi-task capability, we propose TarViS: a novel, unified network architecture that can be applied to any task that requires segmenting a set of arbitrarily defined 'targets' in video. Our approach is flexible with respect to how tasks define these targets, since it models the latter as abstract 'queries' which are then used to predict pixel-precise target masks. A single TarViS model can be trained jointly on a collection of datasets spanning different tasks, and can hot-swap between tasks during inference without any task-specific retraining. To demonstrate its effectiveness, we apply TarViS to four different tasks, namely Video Instance Segmentation (VIS), Video Panoptic Segmentation (VPS), Video Object Segmentation (VOS) and Point Exemplar-guided Tracking (PET). Our unified, jointly trained model achieves state-of-the-art performance on 5/7 benchmarks spanning these four tasks, and competitive performance on the remaining two. Code and model weights are available at: https://github.com/Ali2500/TarViS + + + + IDGI: A Framework To Eliminate Explanation Noise From Integrated Gradients + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_IDGI_A_Framework_To_Eliminate_Explanation_Noise_From_Integrated_Gradients_CVPR_2023_paper.pdf + Integrated Gradients (IG) as well as its variants are well-known techniques for interpreting the decisions of deep neural networks. While IG-based approaches attain state-of-the-art performance, they often integrate noise into their explanation saliency maps, which reduce their interpretability. To minimize the noise, we examine the source of the noise analytically and propose a new approach to reduce the explanation noise based on our analytical findings. We propose the Important Direction Gradient Integration (IDGI) framework, which can be easily incorporated into any IG-based method that uses the Reimann Integration for integrated gradient computation. Extensive experiments with three IG-based methods show that IDGI improves them drastically on numerous interpretability metrics. + + + + Implicit Surface Contrastive Clustering for LiDAR Point Clouds + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Implicit_Surface_Contrastive_Clustering_for_LiDAR_Point_Clouds_CVPR_2023_paper.pdf + Self-supervised pretraining on large unlabeled datasets has shown tremendous success on improving the task performance of many computer vision tasks. However, such techniques have not been widely used for outdoor LiDAR point cloud perception due to its scene complexity and wide range. This prevents impactful application from 2D pretraining frameworks. In this paper, we propose ISCC, a new self-supervised pretraining method, core of which are two pretext tasks newly designed for LiDAR point clouds. The first task focuses on learning semantic information by sorting local groups of points in the scene into a globally consistent set of semantically meaningful clusters using contrastive learning. This is augmented with a second task which reasons about precise surfaces of various parts of the scene through implicit surface reconstruction to learn geometric structures. We demonstrate their effectiveness on transfer learning performance on 3D object detection and semantic segmentation in real world LiDAR scenes. We further design an unsupervised semantic grouping task to showcase the highly semantically meaningful features learned by our approach. + + + + Semantic Ray: Learning a Generalizable Semantic Field With Cross-Reprojection Attention + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Semantic_Ray_Learning_a_Generalizable_Semantic_Field_With_Cross-Reprojection_Attention_CVPR_2023_paper.pdf + In this paper, we aim to learn a semantic radiance field from multiple scenes that is accurate, efficient and generalizable. While most existing NeRFs target at the tasks of neural scene rendering, image synthesis and multi-view reconstruction, there are a few attempts such as Semantic-NeRF that explore to learn high-level semantic understanding with the NeRF structure. However, Semantic-NeRF simultaneously learns color and semantic label from a single ray with multiple heads, where the single ray fails to provide rich semantic information. As a result, Semantic NeRF relies on positional encoding and needs to train one specific model for each scene. To address this, we propose Semantic Ray (S-Ray) to fully exploit semantic information along the ray direction from its multi-view reprojections. As directly performing dense attention over multi-view reprojected rays would suffer from heavy computational cost, we design a Cross-Reprojection Attention module with consecutive intra-view radial and cross-view sparse attentions, which decomposes contextual information along reprojected rays and cross multiple views and then collects dense connections by stacking the modules. Experiments show that our S-Ray is able to learn from multiple scenes, and it presents strong generalization ability to adapt to unseen scenes. + + + + ORCa: Glossy Objects As Radiance-Field Cameras + http://openaccess.thecvf.com//content/CVPR2023/papers/Tiwary_ORCa_Glossy_Objects_As_Radiance-Field_Cameras_CVPR_2023_paper.pdf + Reflections on glossy objects contain valuable and hidden information about the surrounding environment. By converting these objects into cameras, we can unlock exciting applications, including imaging beyond the camera's field-of-view and from seemingly impossible vantage points, e.g. from reflections on the human eye. However, this task is challenging because reflections depend jointly on object geometry, material properties, the 3D environment, and the observer's viewing direction. Our approach converts glossy objects with unknown geometry into radiance-field cameras to image the world from the object's perspective. Our key insight is to convert the object surface into a virtual sensor that captures cast reflections as a 2D projection of the 5D environment radiance field visible to and surrounding the object. We show that recovering the environment radiance fields enables depth and radiance estimation from the object to its surroundings in addition to beyond field-of-view novel-view synthesis, i.e. rendering of novel views that are only directly visible to the glossy object present in the scene, but not the observer. Moreover, using the radiance field we can image around occluders caused by close-by objects in the scene. Our method is trained end-to-end on multi-view images of the object and jointly estimates object geometry, diffuse radiance, and the 5D environment radiance field. + + + + SECAD-Net: Self-Supervised CAD Reconstruction by Learning Sketch-Extrude Operations + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_SECAD-Net_Self-Supervised_CAD_Reconstruction_by_Learning_Sketch-Extrude_Operations_CVPR_2023_paper.pdf + Reverse engineering CAD models from raw geometry is a classic but strenuous research problem. Previous learning-based methods rely heavily on labels due to the supervised design patterns or reconstruct CAD shapes that are not easily editable. In this work, we introduce SECAD-Net, an end-to-end neural network aimed at reconstructing compact and easy-to-edit CAD models in a self-supervised manner. Drawing inspiration from the modeling language that is most commonly used in modern CAD software, we propose to learn 2D sketches and 3D extrusion parameters from raw shapes, from which a set of extrusion cylinders can be generated by extruding each sketch from a 2D plane into a 3D body. By incorporating the Boolean operation (i.e., union), these cylinders can be combined to closely approximate the target geometry. We advocate the use of implicit fields for sketch representation, which allows for creating CAD variations by interpolating latent codes in the sketch latent space. Extensive experiments on both ABC and Fusion 360 datasets demonstrate the effectiveness of our method, and show superiority over state-of-the-art alternatives including the closely related method for supervised CAD reconstruction. We further apply our approach to CAD editing and single-view CAD reconstruction. The code is released at https://github.com/BunnySoCrazy/SECAD-Net. + + + + MDL-NAS: A Joint Multi-Domain Learning Framework for Vision Transformer + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_MDL-NAS_A_Joint_Multi-Domain_Learning_Framework_for_Vision_Transformer_CVPR_2023_paper.pdf + In this work, we introduce MDL-NAS, a unified framework that integrates multiple vision tasks into a manageable supernet and optimizes these tasks collectively under diverse dataset domains. MDL-NAS is storage-efficient since multiple models with a majority of shared parameters can be deposited into a single one. Technically, MDL-NAS constructs a coarse-to-fine search space, where the coarse search space offers various optimal architectures for different tasks while the fine search space provides fine-grained parameter sharing to tackle the inherent obstacles of multi-domain learning. In the fine search space, we suggest two parameter sharing policies, i.e., sequential sharing policy and mask sharing policy. Compared with previous works, such two sharing policies allow for the partial sharing and non-sharing of parameters at each layer of the network, hence attaining real fine-grained parameter sharing. Finally, we present a joint-subnet search algorithm that finds the optimal architecture and sharing parameters for each task within total resource constraints, challenging the traditional practice that downstream vision tasks are typically equipped with backbone networks designed for image classification. Experimentally, we demonstrate that MDL-NAS families fitted with non-hierarchical or hierarchical transformers deliver competitive performance for all tasks compared with state-of-the-art methods while maintaining efficient storage deployment and computation. We also demonstrate that MDL-NAS allows incremental learning and evades catastrophic forgetting when generalizing to a new task. + + + + Dual Alignment Unsupervised Domain Adaptation for Video-Text Retrieval + http://openaccess.thecvf.com//content/CVPR2023/papers/Hao_Dual_Alignment_Unsupervised_Domain_Adaptation_for_Video-Text_Retrieval_CVPR_2023_paper.pdf + Video-text retrieval is an emerging stream in both computer vision and natural language processing communities, which aims to find relevant videos given text queries. In this paper, we study the notoriously challenging task, i.e., Unsupervised Domain Adaptation Video-text Retrieval (UDAVR), wherein training and testing data come from different distributions. Previous works merely alleviate the domain shift, which however overlook the pairwise misalignment issue in target domain, i.e., there exist no semantic relationships between target videos and texts. To tackle this, we propose a novel method named Dual Alignment Domain Adaptation (DADA). Specifically, we first introduce the cross-modal semantic embedding to generate discriminative source features in a joint embedding space. Besides, we utilize the video and text domain adaptations to smoothly balance the minimization of the domain shifts. To tackle the pairwise misalignment in target domain, we introduce the Dual Alignment Consistency (DAC) to fully exploit the semantic information of both modalities in target domain. The proposed DAC adaptively aligns the video-text pairs which are more likely to be relevant in target domain, enabling that positive pairs are increasing progressively and the noisy ones will potentially be aligned in the later stages. To that end, our method can generate more truly aligned target pairs and ensure the discriminality of target features.Compared with the state-of-the-art methods, DADA achieves 20.18% and 18.61% relative improvements on R@1 under the setting of TGIF->MSRVTT and TGIF->MSVD respectively, demonstrating the superiority of our method. + + + + Computational Flash Photography Through Intrinsics + http://openaccess.thecvf.com//content/CVPR2023/papers/Maralan_Computational_Flash_Photography_Through_Intrinsics_CVPR_2023_paper.pdf + Flash is an essential tool as it often serves as the sole controllable light source in everyday photography. However, the use of flash is a binary decision at the time a photograph is captured with limited control over its characteristics such as strength or color. In this work, we study the computational control of the flash light in photographs taken with or without flash. We present a physically motivated intrinsic formulation for flash photograph formation and develop flash decomposition and generation methods for flash and no-flash photographs, respectively. We demonstrate that our intrinsic formulation outperforms alternatives in the literature and allows us to computationally control flash in in-the-wild images. + + + + SpaText: Spatio-Textual Representation for Controllable Image Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Avrahami_SpaText_Spatio-Textual_Representation_for_Controllable_Image_Generation_CVPR_2023_paper.pdf + Recent text-to-image diffusion models are able to generate convincing results of unprecedented quality. However, it is nearly impossible to control the shapes of different regions/objects or their layout in a fine-grained fashion. Previous attempts to provide such controls were hindered by their reliance on a fixed set of labels. To this end, we present SpaText --- a new method for text-to-image generation using open-vocabulary scene control. In addition to a global text prompt that describes the entire scene, the user provides a segmentation map where each region of interest is annotated by a free-form natural language description. Due to lack of large-scale datasets that have a detailed textual description for each region in the image, we choose to leverage the current large-scale text-to-image datasets and base our approach on a novel CLIP-based spatio-textual representation, and show its effectiveness on two state-of-the-art diffusion models: pixel-based and latent-based. In addition, we show how to extend the classifier-free guidance method in diffusion models to the multi-conditional case and present an alternative accelerated inference algorithm. Finally, we offer several automatic evaluation metrics and use them, in addition to FID scores and a user study, to evaluate our method and show that it achieves state-of-the-art results on image generation with free-form textual scene control. + + + + The ObjectFolder Benchmark: Multisensory Learning With Neural and Real Objects + http://openaccess.thecvf.com//content/CVPR2023/papers/Gao_The_ObjectFolder_Benchmark_Multisensory_Learning_With_Neural_and_Real_Objects_CVPR_2023_paper.pdf + We introduce the ObjectFolder Benchmark, a benchmark suite of 10 tasks for multisensory object-centric learning, centered around object recognition, reconstruction, and manipulation with sight, sound, and touch. We also introduce the ObjectFolder Real dataset, including the multisensory measurements for 100 real-world household objects, building upon a newly designed pipeline for collecting the 3D meshes, videos, impact sounds, and tactile readings of real-world objects. For each task in the ObjectFolder Benchmark, we conduct systematic benchmarking on both the 1,000 multisensory neural objects from ObjectFolder, and the real multisensory data from ObjectFolder Real. Our results demonstrate the importance of multisensory perception and reveal the respective roles of vision, audio, and touch for different object-centric learning tasks. By publicly releasing our dataset and benchmark suite, we hope to catalyze and enable new research in multisensory object-centric learning in computer vision, robotics, and beyond. Project page: https://objectfolder.stanford.edu + + + + ScaleFL: Resource-Adaptive Federated Learning With Heterogeneous Clients + http://openaccess.thecvf.com//content/CVPR2023/papers/Ilhan_ScaleFL_Resource-Adaptive_Federated_Learning_With_Heterogeneous_Clients_CVPR_2023_paper.pdf + Federated learning (FL) is an attractive distributed learning paradigm supporting real-time continuous learning and client privacy by default. In most FL approaches, all edge clients are assumed to have sufficient computation capabilities to participate in the learning of a deep neural network (DNN) model. However, in real-life applications, some clients may have severely limited resources and can only train a much smaller local model. This paper presents ScaleFL, a novel FL approach with two distinctive mechanisms to handle resource heterogeneity and provide an equitable FL framework for all clients. First, ScaleFL adaptively scales down the DNN model along width and depth dimensions by leveraging early exits to find the best-fit models for resource-aware local training on distributed clients. In this way, ScaleFL provides an efficient balance of preserving basic and complex features in local model splits with various sizes for joint training while enabling fast inference for model deployment. Second, ScaleFL utilizes self-distillation among exit predictions during training to improve aggregation through knowledge transfer among subnetworks. We conduct extensive experiments on benchmark CV (CIFAR-10/100, ImageNet) and NLP datasets (SST-2, AgNews). We demonstrate that ScaleFL outperforms existing representative heterogeneous FL approaches in terms of global/local model performance and provides inference efficiency, with up to 2x latency and 4x model size reduction with negligible performance drop below 2%. + + + + Reliable and Interpretable Personalized Federated Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Qin_Reliable_and_Interpretable_Personalized_Federated_Learning_CVPR_2023_paper.pdf + Federated learning can coordinate multiple users to participate in data training while ensuring data privacy. The collaboration of multiple agents allows for a natural connection between federated learning and collective intelligence. When there are large differences in data distribution among clients, it is crucial for federated learning to design a reliable client selection strategy and an interpretable client communication framework to better utilize group knowledge. Herein, a reliable personalized federated learning approach, termed RIPFL, is proposed and fully interpreted from the perspective of social learning. RIPFL reliably selects and divides the clients involved in training such that each client can use different amounts of social information and more effectively communicate with other clients. Simultaneously, the method effectively integrates personal information with the social information generated by the global model from the perspective of Bayesian decision rules and evidence theory, enabling individuals to grow better with the help of collective wisdom. An interpretable federated learning mind is well scalable, and the experimental results indicate that the proposed method has superior robustness and accuracy than other state-of-the-art federated learning algorithms. + + + + Optimal Transport Minimization: Crowd Localization on Density Maps for Semi-Supervised Counting + http://openaccess.thecvf.com//content/CVPR2023/papers/Lin_Optimal_Transport_Minimization_Crowd_Localization_on_Density_Maps_for_Semi-Supervised_CVPR_2023_paper.pdf + The accuracy of crowd counting in images has improved greatly in recent years due to the development of deep neural networks for predicting crowd density maps. However, most methods do not further explore the ability to localize people in the density map, with those few works adopting simple methods, like finding the local peaks in the density map. In this paper, we propose the optimal transport minimization (OT-M) algorithm for crowd localization with density maps. The objective of OT-M is to find a target point map that has the minimal Sinkhorn distance with the input density map, and we propose an iterative algorithm to compute the solution. We then apply OT-M to generate hard pseudo-labels (point maps) for semi-supervised counting, rather than the soft pseudo-labels (density maps) used in previous methods. Our hard pseudo-labels provide stronger supervision, and also enable the use of recent density-to-point loss functions for training. We also propose a confidence weighting strategy to give higher weight to the more reliable unlabeled data. Extensive experiments show that our methods achieve outstanding performance on both crowd localization and semi-supervised counting. Code is available at https://github.com/Elin24/OT-M. + + + + AdamsFormer for Spatial Action Localization in the Future + http://openaccess.thecvf.com//content/CVPR2023/papers/Chi_AdamsFormer_for_Spatial_Action_Localization_in_the_Future_CVPR_2023_paper.pdf + Predicting future action locations is vital for applications like human-robot collaboration. While some computer vision tasks have made progress in predicting human actions, accurately localizing these actions in future frames remains an area with room for improvement. We introduce a new task called spatial action localization in the future (SALF), which aims to predict action locations in both observed and future frames. SALF is challenging because it requires understanding the underlying physics of video observations to predict future action locations accurately. To address SALF, we use the concept of NeuralODE, which models the latent dynamics of sequential data by solving ordinary differential equations (ODE) with neural networks. We propose a novel architecture, AdamsFormer, which extends observed frame features to future time horizons by modeling continuous temporal dynamics through ODE solving. Specifically, we employ the Adams method, a multi-step approach that efficiently uses information from previous steps without discarding it. Our extensive experiments on UCF101-24 and JHMDB-21 datasets demonstrate that our proposed model outperforms existing long-range temporal modeling methods by a significant margin in terms of frame-mAP. + + + + Leveraging per Image-Token Consistency for Vision-Language Pre-Training + http://openaccess.thecvf.com//content/CVPR2023/papers/Gou_Leveraging_per_Image-Token_Consistency_for_Vision-Language_Pre-Training_CVPR_2023_paper.pdf + Most existing vision-language pre-training (VLP) approaches adopt cross-modal masked language modeling (CMLM) to learn vision-language associations. However, we find that CMLM is insufficient for this purpose according to our observations: (1) Modality bias: a considerable amount of masked tokens in CMLM can be recovered with only the language information, ignoring the visual inputs. (2) Under-utilization of the unmasked tokens: CMLM primarily focuses on the masked token but it cannot simultaneously leverage other tokens to learn vision-language associations. To handle those limitations, we propose EPIC (lEveraging Per Image-Token Consistency for vision-language pre-training). In EPIC, for each image-sentence pair, we mask tokens that are salient to the image (i.e., Saliency-based Masking Strategy) and replace them with alternatives sampled from a language model (i.e., Inconsistent Token Generation Procedure), and then the model is required to determine for each token in the sentence whether it is consistent with the image (i.e., Image-Token Consistency Task). The proposed EPIC method is easily combined with pre-training methods. Extensive experiments show that the combination of the EPIC method and state-of-the-art pre-training approaches, including ViLT, ALBEF, METER, and X-VLM, leads to significant improvements on downstream tasks. Our coude is released at https://github.com/gyhdog99/epic + + + + UTM: A Unified Multiple Object Tracking Model With Identity-Aware Feature Enhancement + http://openaccess.thecvf.com//content/CVPR2023/papers/You_UTM_A_Unified_Multiple_Object_Tracking_Model_With_Identity-Aware_Feature_CVPR_2023_paper.pdf + Recently, Multiple Object Tracking has achieved great success, which consists of object detection, feature embedding, and identity association. Existing methods apply the three-step or two-step paradigm to generate robust trajectories, where identity association is independent of other components. However, the independent identity association results in the identity-aware knowledge contained in the tracklet not be used to boost the detection and embedding modules. To overcome the limitations of existing methods, we introduce a novel Unified Tracking Model (UTM) to bridge those three components for generating a positive feedback loop with mutual benefits. The key insight of UTM is the Identity-Aware Feature Enhancement (IAFE), which is applied to bridge and benefit these three components by utilizing the identity-aware knowledge to boost detection and embedding. Formally, IAFE contains the Identity-Aware Boosting Attention (IABA) and the Identity-Aware Erasing Attention (IAEA), where IABA enhances the consistent regions between the current frame feature and identity-aware knowledge, and IAEA suppresses the distracted regions in the current frame feature. With better detections and embeddings, higher-quality tracklets can also be generated. Extensive experiments of public and private detections on three benchmarks demonstrate the robustness of UTM. + + + + On the Stability-Plasticity Dilemma of Class-Incremental Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Kim_On_the_Stability-Plasticity_Dilemma_of_Class-Incremental_Learning_CVPR_2023_paper.pdf + A primary goal of class-incremental learning is to strike a balance between stability and plasticity, where models should be both stable enough to retain knowledge learned from previously seen classes, and plastic enough to learn concepts from new classes. While previous works demonstrate strong performance on class-incremental benchmarks, it is not clear whether their success comes from the models being stable, plastic, or a mixture of both. This paper aims to shed light on how effectively recent class-incremental learning algorithms address the stability-plasticity trade-off. We establish analytical tools that measure the stability and plasticity of feature representations, and employ such tools to investigate models trained with various algorithms on large-scale class-incremental benchmarks. Surprisingly, we find that the majority of class-incremental learning algorithms heavily favor stability over plasticity, to the extent that the feature extractor of a model trained on the initial set of classes is no less effective than that of the final incremental model. Our observations not only inspire two simple algorithms that highlight the importance of feature representation analysis, but also suggest that class-incremental learning approaches, in general, should strive for better feature representation learning. + + + + Generalization Matters: Loss Minima Flattening via Parameter Hybridization for Efficient Online Knowledge Distillation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Generalization_Matters_Loss_Minima_Flattening_via_Parameter_Hybridization_for_Efficient_CVPR_2023_paper.pdf + Most existing online knowledge distillation(OKD) techniques typically require sophisticated modules to produce diverse knowledge for improving students' generalization ability. In this paper, we strive to fully utilize multi-model settings instead of well-designed modules to achieve a distillation effect with excellent generalization performance. Generally, model generalization can be reflected in the flatness of the loss landscape. Since averaging parameters of multiple models can find flatter minima, we are inspired to extend the process to the sampled convex combinations of multi-student models in OKD. Specifically, by linearly weighting students' parameters in each training batch, we construct a Hybrid-Weight Model(HWM) to represent the parameters surrounding involved students. The supervision loss of HWM can estimate the landscape's curvature of the whole region around students to measure the generalization explicitly. Hence we integrate HWM's loss into students' training and propose a novel OKD framework via parameter hybridization(OKDPH) to promote flatter minima and obtain robust solutions. Considering the redundancy of parameters could lead to the collapse of HWM, we further introduce a fusion operation to keep the high similarity of students. Compared to the state-of-the-art(SOTA) OKD methods and SOTA methods of seeking flat minima, our OKDPH achieves higher performance with fewer parameters, benefiting OKD with lightweight and robust characteristics. Our code is publicly available at https://github.com/tianlizhang/OKDPH. + + + + L-CoIns: Language-Based Colorization With Instance Awareness + http://openaccess.thecvf.com//content/CVPR2023/papers/Chang_L-CoIns_Language-Based_Colorization_With_Instance_Awareness_CVPR_2023_paper.pdf + Language-based colorization produces plausible colors consistent with the language description provided by the user. Recent studies introduce additional annotation to prevent color-object coupling and mismatch issues, but they still have difficulty in distinguishing instances corresponding to the same object words. In this paper, we propose a transformer-based framework to automatically aggregate similar image patches and achieve instance awareness without any additional knowledge. By applying our presented luminance augmentation and counter-color loss to break down the statistical correlation between luminance and color words, our model is driven to synthesize colors with better descriptive consistency. We further collect a dataset to provide distinctive visual characteristics and detailed language descriptions for multiple instances in the same image. Extensive experiments demonstrate our advantages of synthesizing visually pleasing and description-consistent results of instance-aware colorization. + + + + On the Effects of Self-Supervision and Contrastive Alignment in Deep Multi-View Clustering + http://openaccess.thecvf.com//content/CVPR2023/papers/Trosten_On_the_Effects_of_Self-Supervision_and_Contrastive_Alignment_in_Deep_CVPR_2023_paper.pdf + Self-supervised learning is a central component in recent approaches to deep multi-view clustering (MVC). However, we find large variations in the development of self-supervision-based methods for deep MVC, potentially slowing the progress of the field. To address this, we present DeepMVC, a unified framework for deep MVC that includes many recent methods as instances. We leverage our framework to make key observations about the effect of self-supervision, and in particular, drawbacks of aligning representations with contrastive learning. Further, we prove that contrastive alignment can negatively influence cluster separability, and that this effect becomes worse when the number of views increases. Motivated by our findings, we develop several new DeepMVC instances with new forms of self-supervision. We conduct extensive experiments and find that (i) in line with our theoretical findings, contrastive alignments decreases performance on datasets with many views; (ii) all methods benefit from some form of self-supervision; and (iii) our new instances outperform previous methods on several datasets. Based on our results, we suggest several promising directions for future research. To enhance the openness of the field, we provide an open-source implementation of DeepMVC, including recent models and our new instances. Our implementation includes a consistent evaluation protocol, facilitating fair and accurate evaluation of methods and components. + + + + Activating More Pixels in Image Super-Resolution Transformer + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Activating_More_Pixels_in_Image_Super-Resolution_Transformer_CVPR_2023_paper.pdf + Transformer-based methods have shown impressive performance in low-level vision tasks, such as image super-resolution. However, we find that these networks can only utilize a limited spatial range of input information through attribution analysis. This implies that the potential of Transformer is still not fully exploited in existing networks. In order to activate more input pixels for better reconstruction, we propose a novel Hybrid Attention Transformer (HAT). It combines both channel attention and window-based self-attention schemes, thus making use of their complementary advantages of being able to utilize global statistics and strong local fitting capability. Moreover, to better aggregate the cross-window information, we introduce an overlapping cross-attention module to enhance the interaction between neighboring window features. In the training stage, we additionally adopt a same-task pre-training strategy to exploit the potential of the model for further improvement. Extensive experiments show the effectiveness of the proposed modules, and we further scale up the model to demonstrate that the performance of this task can be greatly improved. Our overall method significantly outperforms the state-of-the-art methods by more than 1dB. Codes and models are available at https://github.com/XPixelGroup/HAT. + + + + BEV-SAN: Accurate BEV 3D Object Detection via Slice Attention Networks + http://openaccess.thecvf.com//content/CVPR2023/papers/Chi_BEV-SAN_Accurate_BEV_3D_Object_Detection_via_Slice_Attention_Networks_CVPR_2023_paper.pdf + Bird's-Eye-View (BEV) 3D Object Detection is a crucial multi-view technique for autonomous driving systems. Recently, plenty of works are proposed, following a similar paradigm consisting of three essential components, i.e., camera feature extraction, BEV feature construction, and task heads. Among the three components, BEV feature construction is BEV-specific compared with 2D tasks. Existing methods aggregate the multi-view camera features to the flattened grid in order to construct the BEV feature. However, flattening the BEV space along the height dimension fails to emphasize the informative features of different heights. For example, the barrier is located at a low height while the truck is located at a high height. In this paper, we propose a novel method named BEV Slice Attention Network (BEV-SAN) for exploiting the intrinsic characteristics of different heights. Instead of flattening the BEV space, we first sample along the height dimension to build the global and local BEV slices. Then, the features of BEV slices are aggregated from the camera features and merged by the attention mechanism. Finally, we fuse the merged local and global BEV features by a transformer to generate the final feature map for task heads. The purpose of local BEV slices is to emphasize informative heights. In order to find them, we further propose a LiDAR-guided sampling strategy to leverage the statistical distribution of LiDAR to determine the heights of local slices. Compared with uniform sampling, LiDAR-guided sampling can determine more informative heights. We conduct detailed experiments to demonstrate the effectiveness of BEV-SAN. Code will be released. + + + + The Dark Side of Dynamic Routing Neural Networks: Towards Efficiency Backdoor Injection + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_The_Dark_Side_of_Dynamic_Routing_Neural_Networks_Towards_Efficiency_CVPR_2023_paper.pdf + Recent advancements in deploying deep neural networks (DNNs) on resource-constrained devices have generated interest in input-adaptive dynamic neural networks (DyNNs). DyNNs offer more efficient inferences and enable the deployment of DNNs on devices with limited resources, such as mobile devices. However, we have discovered a new vulnerability in DyNNs that could potentially compromise their efficiency. Specifically, we investigate whether adversaries can manipulate DyNNs' computational costs to create a false sense of efficiency. To address this question, we propose EfficFrog, an adversarial attack that injects universal efficiency backdoors in DyNNs. To inject a backdoor trigger into DyNNs, EfficFrog poisons only a minimal percentage of the DyNNs' training data. During the inference phase, EfficFrog can slow down the backdoored DyNNs and abuse the computational resources of systems running DyNNs by adding the trigger to any input. To evaluate EfficFrog, we tested it on three DNN backbone architectures (based on VGG16, MobileNet, and ResNet56) using two popular datasets (CIFAR-10 and Tiny ImageNet). Our results demonstrate that EfficFrog reduces the efficiency of DyNNs on triggered input samples while keeping the efficiency of clean samples almost the same. + + + + NeRF in the Palm of Your Hand: Corrective Augmentation for Robotics via Novel-View Synthesis + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_NeRF_in_the_Palm_of_Your_Hand_Corrective_Augmentation_for_CVPR_2023_paper.pdf + Expert demonstrations are a rich source of supervision for training visual robotic manipulation policies, but imitation learning methods often require either a large number of demonstrations or expensive online expert supervision to learn reactive closed-loop behaviors. In this work, we introduce SPARTN (Synthetic Perturbations for Augmenting Robot Trajectories via NeRF): a fully-offline data augmentation scheme for improving robot policies that use eye-in-hand cameras. Our approach leverages neural radiance fields (NeRFs) to synthetically inject corrective noise into visual demonstrations: using NeRFs to generate perturbed viewpoints while simultaneously calculating the corrective actions. This requires no additional expert supervision or environment interaction, and distills the geometric information in NeRFs into a real-time reactive RGB-only policy. In a simulated 6-DoF visual grasping benchmark, SPARTN improves offline success rates by 2.8x over imitation learning without the corrective augmentations and even outperforms some methods that use online supervision. It additionally closes the gap between RGB-only and RGB-D success rates, eliminating the previous need for depth sensors. In real-world 6-DoF robotic grasping experiments from limited human demonstrations, our method improves absolute success rates by 22.5% on average, including objects that are traditionally challenging for depth-based methods. + + + + Building Rearticulable Models for Arbitrary 3D Objects From 4D Point Clouds + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Building_Rearticulable_Models_for_Arbitrary_3D_Objects_From_4D_Point_CVPR_2023_paper.pdf + We build rearticulable models for arbitrary everyday man-made objects containing an arbitrary number of parts that are connected together in arbitrary ways via 1-degree-of-freedom joints. Given point cloud videos of such everyday objects, our method identifies the distinct object parts, what parts are connected to what other parts, and the properties of the joints connecting each part pair. We do this by jointly optimizing the part segmentation, transformation, and kinematics using a novel energy minimization framework. Our inferred animatable models, enables retargeting to novel poses with sparse point correspondences guidance. We test our method on a new articulating robot dataset and the Sapiens dataset with common daily objects. Experiments show that our method outperforms two leading prior works on various metrics. + + + + Neural Congealing: Aligning Images to a Joint Semantic Atlas + http://openaccess.thecvf.com//content/CVPR2023/papers/Ofri-Amar_Neural_Congealing_Aligning_Images_to_a_Joint_Semantic_Atlas_CVPR_2023_paper.pdf + We present Neural Congealing -- a zero-shot self-supervised framework for detecting and jointly aligning semantically-common content across a given set of images. Our approach harnesses the power of pre-trained DINO-ViT features to learn: (i) a joint semantic atlas -- a 2D grid that captures the mode of DINO-ViT features in the input set, and (ii) dense mappings from the unified atlas to each of the input images. We derive a new robust self-supervised framework that optimizes the atlas representation and mappings per image set, requiring only a few real-world images as input without any additional input information (e.g., segmentation masks). Notably, we design our losses and training paradigm to account only for the shared content under severe variations in appearance, pose, background clutter or other distracting objects. We demonstrate results on a plethora of challenging image sets including sets of mixed domains (e.g., aligning images depicting sculpture and artwork of cats), sets depicting related yet different object categories (e.g., dogs and tigers), or domains for which large-scale training data is scarce (e.g., coffee mugs). We thoroughly evaluate our method and show that our test-time optimization approach performs favorably compared to a state-of-the-art method that requires extensive training on large-scale datasets. + + + + Adaptive Spot-Guided Transformer for Consistent Local Feature Matching + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_Adaptive_Spot-Guided_Transformer_for_Consistent_Local_Feature_Matching_CVPR_2023_paper.pdf + Local feature matching aims at finding correspondences between a pair of images. Although current detector-free methods leverage Transformer architecture to obtain an impressive performance, few works consider maintaining local consistency. Meanwhile, most methods struggle with large scale variations. To deal with the above issues, we propose Adaptive Spot-Guided Transformer (ASTR) for local feature matching, which jointly models the local consistency and scale variations in a unified coarse-to-fine architecture. The proposed ASTR enjoys several merits. First, we design a spot-guided aggregation module to avoid interfering with irrelevant areas during feature aggregation. Second, we design an adaptive scaling module to adjust the size of grids according to the calculated depth information at fine stage. Extensive experimental results on five standard benchmarks demonstrate that our ASTR performs favorably against state-of-the-art methods.Our code will be released on https://astr2023.github.io. + + + + Wide-Angle Rectification via Content-Aware Conformal Mapping + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Wide-Angle_Rectification_via_Content-Aware_Conformal_Mapping_CVPR_2023_paper.pdf + Despite the proliferation of ultra wide-angle lenses on smartphone cameras, such lenses often come with severe image distortion (e.g. curved linear structure, unnaturally skewed faces). Most existing rectification methods adopt a global warping transformation to undistort the input wide-angle image, yet their performances are not entirely satisfactory, leaving many unwanted residue distortions uncorrected or at the sacrifice of the intended wide FoV (field-of-view). This paper proposes a new method to tackle these challenges. Specifically, we derive a locally-adaptive polar-domain conformal mapping to rectify a wide-angle image. Parameters of the mapping are found automatically by analyzing image contents via deep neural networks. Experiments on large number of photos have confirmed the superior performance of the proposed method compared with all available previous methods. + + + + Token Turing Machines + http://openaccess.thecvf.com//content/CVPR2023/papers/Ryoo_Token_Turing_Machines_CVPR_2023_paper.pdf + We propose Token Turing Machines (TTM), a sequential, autoregressive Transformer model with memory for real-world sequential visual understanding. Our model is inspired by the seminal Neural Turing Machine, and has an external memory consisting of a set of tokens which summarise the previous history (i.e., frames). This memory is efficiently addressed, read and written using a Transformer as the processing unit/controller at each step. The model's memory module ensures that a new observation will only be processed with the contents of the memory (and not the entire history), meaning that it can efficiently process long sequences with a bounded computational cost at each step. We show that TTM outperforms other alternatives, such as other Transformer models designed for long sequences and recurrent neural networks, on two real-world sequential visual understanding tasks: online temporal activity detection from videos and vision-based robot action policy learning. Code is publicly available at: https://github.com/google-research/scenic/tree/main/scenic/projects/token_turing. + + + + Solving 3D Inverse Problems Using Pre-Trained 2D Diffusion Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Chung_Solving_3D_Inverse_Problems_Using_Pre-Trained_2D_Diffusion_Models_CVPR_2023_paper.pdf + Diffusion models have emerged as the new state-of-the-art generative model with high quality samples, with intriguing properties such as mode coverage and high flexibility. They have also been shown to be effective inverse problem solvers, acting as the prior of the distribution, while the information of the forward model can be granted at the sampling stage. Nonetheless, as the generative process remains in the same high dimensional (i.e. identical to data dimension) space, the models have not been extended to 3D inverse problems due to the extremely high memory and computational cost. In this paper, we combine the ideas from the conventional model-based iterative reconstruction with the modern diffusion models, which leads to a highly effective method for solving 3D medical image reconstruction tasks such as sparse-view tomography, limited angle tomography, compressed sensing MRI from pre-trained 2D diffusion models. In essence, we propose to augment the 2D diffusion prior with a model-based prior in the remaining direction at test time, such that one can achieve coherent reconstructions across all dimensions. Our method can be run in a single commodity GPU, and establishes the new state-of-the-art, showing that the proposed method can perform reconstructions of high fidelity and accuracy even in the most extreme cases (e.g. 2-view 3D tomography). We further reveal that the generalization capacity of the proposed method is surprisingly high, and can be used to reconstruct volumes that are entirely different from the training dataset. Code available: https://github.com/HJ-harry/DiffusionMBIR + + + + DyNCA: Real-Time Dynamic Texture Synthesis Using Neural Cellular Automata + http://openaccess.thecvf.com//content/CVPR2023/papers/Pajouheshgar_DyNCA_Real-Time_Dynamic_Texture_Synthesis_Using_Neural_Cellular_Automata_CVPR_2023_paper.pdf + Current Dynamic Texture Synthesis (DyTS) models can synthesize realistic videos. However, they require a slow iterative optimization process to synthesize a single fixed-size short video, and they do not offer any post-training control over the synthesis process. We propose Dynamic Neural Cellular Automata (DyNCA), a framework for real-time and controllable dynamic texture synthesis. Our method is built upon the recently introduced NCA models and can synthesize infinitely long and arbitrary-size realistic video textures in real-time. We quantitatively and qualitatively evaluate our model and show that our synthesized videos appear more realistic than the existing results. We improve the SOTA DyTS performance by 2 4 orders of magnitude. Moreover, our model offers several real-time video controls including motion speed, motion direction, and an editing brush tool. We exhibit our trained models in an online interactive demo that runs on local hardware and is accessible on personal computers and smartphones. + + + + Semantic-Promoted Debiasing and Background Disambiguation for Zero-Shot Instance Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/He_Semantic-Promoted_Debiasing_and_Background_Disambiguation_for_Zero-Shot_Instance_Segmentation_CVPR_2023_paper.pdf + Zero-shot instance segmentation aims to detect and precisely segment objects of unseen categories without any training samples. Since the model is trained on seen categories, there is a strong bias that the model tends to classify all the objects into seen categories. Besides, there is a natural confusion between background and novel objects that have never shown up in training. These two challenges make novel objects hard to be raised in the final instance segmentation results. It is desired to rescue novel objects from background and dominated seen categories. To this end, we propose D^2Zero with Semantic-Promoted Debiasing and Background Disambiguation to enhance the performance of Zero-shot instance segmentation. Semantic-promoted debiasing utilizes inter-class semantic relationships to involve unseen categories in visual feature training and learns an input-conditional classifier to conduct dynamical classification based on the input image. Background disambiguation produces image-adaptive background representation to avoid mistaking novel objects for background. Extensive experiments show that we significantly outperform previous state-of-the-art methods by a large margin, e.g., 16.86% improvement on COCO. + + + + RelightableHands: Efficient Neural Relighting of Articulated Hand Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Iwase_RelightableHands_Efficient_Neural_Relighting_of_Articulated_Hand_Models_CVPR_2023_paper.pdf + We present the first neural relighting approach for rendering high-fidelity personalized hands that can be animated in real-time under novel illumination. Our approach adopts a teacher-student framework, where the teacher learns appearance under a single point light from images captured in a light-stage, allowing us to synthesize hands in arbitrary illuminations but with heavy compute. Using images rendered by the teacher model as training data, an efficient student model directly predicts appearance under natural illuminations in real-time. To achieve generalization, we condition the student model with physics-inspired illumination features such as visibility, diffuse shading, and specular reflections computed on a coarse proxy geometry, maintaining a small computational overhead. Our key insight is that these features have strong correlation with subsequent global light transport effects, which proves sufficient as conditioning data for the neural relighting network. Moreover, in contrast to bottleneck illumination conditioning, these features are spatially aligned based on underlying geometry, leading to better generalization to unseen illuminations and poses. In our experiments, we demonstrate the efficacy of our illumination feature representations, outperforming baseline approaches. We also show that our approach can photorealistically relight two interacting hands at real-time speeds. + + + + Paired-Point Lifting for Enhanced Privacy-Preserving Visual Localization + http://openaccess.thecvf.com//content/CVPR2023/papers/Lee_Paired-Point_Lifting_for_Enhanced_Privacy-Preserving_Visual_Localization_CVPR_2023_paper.pdf + Visual localization refers to the process of recovering camera pose from input image relative to a known scene, forming a cornerstone of numerous vision and robotics systems. While many algorithms utilize sparse 3D point cloud of the scene obtained via structure-from-motion (SfM) for localization, recent studies have raised privacy concerns by successfully revealing high-fidelity appearance of the scene from such sparse 3D representation. One prominent approach for bypassing this attack was to lift 3D points to randomly oriented 3D lines thereby hiding scene geometry, but latest work have shown such random line cloud has a critical statistical flaw that can be exploited to break through protection. In this work, we present an alternative lightweight strategy called Paired-Point Lifting (PPL) for constructing 3D line clouds. Instead of drawing one randomly oriented line per 3D point, PPL splits 3D points into pairs and joins each pair to form 3D lines. This seemingly simple strategy yields 3 benefits, i) new ambiguity in feature selection, ii) increased line cloud sparsity, and iii) non-trivial distribution of 3D lines, all of which contributes to enhanced protection against privacy attacks. Extensive experimental results demonstrate the strength of PPL in concealing scene details without compromising localization accuracy, unlocking the true potential of 3D line clouds. + + + + What Happened 3 Seconds Ago? Inferring the Past With Thermal Imaging + http://openaccess.thecvf.com//content/CVPR2023/papers/Tang_What_Happened_3_Seconds_Ago_Inferring_the_Past_With_Thermal_CVPR_2023_paper.pdf + Inferring past human motion from RGB images is challenging due to the inherent uncertainty of the prediction problem. Thermal images, on the other hand, encode traces of past human-object interactions left in the environment via thermal radiation measurement. Based on this observation, we collect the first RGB-Thermal dataset for human motion analysis, dubbed Thermal-IM. Then we develop a three-stage neural network model for accurate past human pose estimation. Comprehensive experiments show that thermal cues significantly reduce the ambiguities of this task, and the proposed model achieves remarkable performance. The dataset is available at https://github.com/ZitianTang/Thermal-IM. + + + + Vector Quantization With Self-Attention for Quality-Independent Representation Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_Vector_Quantization_With_Self-Attention_for_Quality-Independent_Representation_Learning_CVPR_2023_paper.pdf + Recently, the robustness of deep neural networks has drawn extensive attention due to the potential distribution shift between training and testing data (e.g., deep models trained on high-quality images are sensitive to corruption during testing). Many researchers attempt to make the model learn invariant representations from multiple corrupted data through data augmentation or image-pair-based feature distillation to improve the robustness. Inspired by sparse representation in image restoration, we opt to address this issue by learning image-quality-independent feature representation in a simple plug-and-play manner, that is, to introduce discrete vector quantization (VQ) to remove redundancy in recognition models. Specifically, we first add a codebook module to the network to quantize deep features. Then we concatenate them and design a self-attention module to enhance the representation. During training, we enforce the quantization of features from clean and corrupted images in the same discrete embedding space so that an invariant quality-independent feature representation can be learned to improve the recognition robustness of low-quality images. Qualitative and quantitative experimental results show that our method achieved this goal effectively, leading to a new state-of-the-art result of 43.1% mCE on ImageNet-C with ResNet50 as the backbone. On other robustness benchmark datasets, such as ImageNet-R, our method also has an accuracy improvement of almost 2%. + + + + Generating Anomalies for Video Anomaly Detection With Prompt-Based Feature Mapping + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Generating_Anomalies_for_Video_Anomaly_Detection_With_Prompt-Based_Feature_Mapping_CVPR_2023_paper.pdf + Anomaly detection in surveillance videos is a challenging computer vision task where only normal videos are available during training. Recent work released the first virtual anomaly detection dataset to assist real-world detection. However, an anomaly gap exists because the anomalies are bounded in the virtual dataset but unbounded in the real world, so it reduces the generalization ability of the virtual dataset. There also exists a scene gap between virtual and real scenarios, including scene-specific anomalies (events that are abnormal in one scene but normal in another) and scene-specific attributes, such as the viewpoint of the surveillance camera. In this paper, we aim to solve the problem of the anomaly gap and scene gap by proposing a prompt-based feature mapping framework (PFMF). The PFMF contains a mapping network guided by an anomaly prompt to generate unseen anomalies with unbounded types in the real scenario, and a mapping adaptation branch to narrow the scene gap by applying domain classifier and anomaly classifier. The proposed framework outperforms the state-of-the-art on three benchmark datasets. Extensive ablation experiments also show the effectiveness of our framework design. + + + + Diffusion-Based Signed Distance Fields for 3D Shape Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Shim_Diffusion-Based_Signed_Distance_Fields_for_3D_Shape_Generation_CVPR_2023_paper.pdf + We propose a 3D shape generation framework (SDF-Diffusion in short) that uses denoising diffusion models with continuous 3D representation via signed distance fields (SDF). Unlike most existing methods that depend on discontinuous forms, such as point clouds, SDF-Diffusion generates high-resolution 3D shapes while alleviating memory issues by separating the generative process into two-stage: generation and super-resolution. In the first stage, a diffusion-based generative model generates a low-resolution SDF of 3D shapes. Using the estimated low-resolution SDF as a condition, the second stage diffusion model performs super-resolution to generate high-resolution SDF. Our framework can generate a high-fidelity 3D shape despite the extreme spatial complexity. On the ShapeNet dataset, our model shows competitive performance to the state-of-the-art methods and shows applicability on the shape completion task without modification. + + + + Hierarchical Temporal Transformer for 3D Hand Pose Estimation and Action Recognition From Egocentric RGB Videos + http://openaccess.thecvf.com//content/CVPR2023/papers/Wen_Hierarchical_Temporal_Transformer_for_3D_Hand_Pose_Estimation_and_Action_CVPR_2023_paper.pdf + Understanding dynamic hand motions and actions from egocentric RGB videos is a fundamental yet challenging task due to self-occlusion and ambiguity. To address occlusion and ambiguity, we develop a transformer-based framework to exploit temporal information for robust estimation. Noticing the different temporal granularity of and the semantic correlation between hand pose estimation and action recognition, we build a network hierarchy with two cascaded transformer encoders, where the first one exploits the short-term temporal cue for hand pose estimation, and the latter aggregates per-frame pose and object information over a longer time span to recognize the action. Our approach achieves competitive results on two first-person hand action benchmarks, namely FPHA and H2O. Extensive ablation studies verify our design choices. + + + + CAP-VSTNet: Content Affinity Preserved Versatile Style Transfer + http://openaccess.thecvf.com//content/CVPR2023/papers/Wen_CAP-VSTNet_Content_Affinity_Preserved_Versatile_Style_Transfer_CVPR_2023_paper.pdf + Content affinity loss including feature and pixel affinity is a main problem which leads to artifacts in photorealistic and video style transfer. This paper proposes a new framework named CAP-VSTNet, which consists of a new reversible residual network and an unbiased linear transform module, for versatile style transfer. This reversible residual network can not only preserve content affinity but not introduce redundant information as traditional reversible networks, and hence facilitate better stylization. Empowered by Matting Laplacian training loss which can address the pixel affinity loss problem led by the linear transform, the proposed framework is applicable and effective on versatile style transfer. Extensive experiments show that CAP-VSTNet can produce better qualitative and quantitative results in comparison with the state-of-the-art methods. + + + + Tunable Convolutions With Parametric Multi-Loss Optimization + http://openaccess.thecvf.com//content/CVPR2023/papers/Maggioni_Tunable_Convolutions_With_Parametric_Multi-Loss_Optimization_CVPR_2023_paper.pdf + Behavior of neural networks is irremediably determined by the specific loss and data used during training. However it is often desirable to tune the model at inference time based on external factors such as preferences of the user or dynamic characteristics of the data. This is especially important to balance the perception-distortion trade-off of ill-posed image-to-image translation tasks. In this work, we propose to optimize a parametric tunable convolutional layer, which includes a number of different kernels, using a parametric multi-loss, which includes an equal number of objectives. Our key insight is to use a shared set of parameters to dynamically interpolate both the objectives and the kernels. During training, these parameters are sampled at random to explicitly optimize all possible combinations of objectives and consequently disentangle their effect into the corresponding kernels. During inference, these parameters become interactive inputs of the model hence enabling reliable and consistent control over the model behavior. Extensive experimental results demonstrate that our tunable convolutions effectively work as a drop-in replacement for traditional convolutions in existing neural networks at virtually no extra computational cost, outperforming state-of-the-art control strategies in a wide range of applications; including image denoising, deblurring, super-resolution, and style transfer. + + + + DeepSolo: Let Transformer Decoder With Explicit Points Solo for Text Spotting + http://openaccess.thecvf.com//content/CVPR2023/papers/Ye_DeepSolo_Let_Transformer_Decoder_With_Explicit_Points_Solo_for_Text_CVPR_2023_paper.pdf + End-to-end text spotting aims to integrate scene text detection and recognition into a unified framework. Dealing with the relationship between the two sub-tasks plays a pivotal role in designing effective spotters. Although Transformer-based methods eliminate the heuristic post-processing, they still suffer from the synergy issue between the sub-tasks and low training efficiency. In this paper, we present DeepSolo, a simple DETR-like baseline that lets a single Decoder with Explicit Points Solo for text detection and recognition simultaneously. Technically, for each text instance, we represent the character sequence as ordered points and model them with learnable explicit point queries. After passing a single decoder, the point queries have encoded requisite text semantics and locations, thus can be further decoded to the center line, boundary, script, and confidence of text via very simple prediction heads in parallel. Besides, we also introduce a text-matching criterion to deliver more accurate supervisory signals, thus enabling more efficient training. Quantitative experiments on public benchmarks demonstrate that DeepSolo outperforms previous state-of-the-art methods and achieves better training efficiency. In addition, DeepSolo is also compatible with line annotations, which require much less annotation cost than polygons. The code is available at https://github.com/ViTAE-Transformer/DeepSolo. + + + + DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Ruiz_DreamBooth_Fine_Tuning_Text-to-Image_Diffusion_Models_for_Subject-Driven_Generation_CVPR_2023_paper.pdf + Large text-to-image models achieved a remarkable leap in the evolution of AI, enabling high-quality and diverse synthesis of images from a given text prompt. However, these models lack the ability to mimic the appearance of subjects in a given reference set and synthesize novel renditions of them in different contexts. In this work, we present a new approach for "personalization" of text-to-image diffusion models. Given as input just a few images of a subject, we fine-tune a pretrained text-to-image model such that it learns to bind a unique identifier with that specific subject. Once the subject is embedded in the output domain of the model, the unique identifier can be used to synthesize novel photorealistic images of the subject contextualized in different scenes. By leveraging the semantic prior embedded in the model with a new autogenous class-specific prior preservation loss, our technique enables synthesizing the subject in diverse scenes, poses, views and lighting conditions that do not appear in the reference images. We apply our technique to several previously-unassailable tasks, including subject recontextualization, text-guided view synthesis, and artistic rendering, all while preserving the subject's key features. We also provide a new dataset and evaluation protocol for this new task of subject-driven generation. Project page: https://dreambooth.github.io/ + + + + MOSO: Decomposing MOtion, Scene and Object for Video Prediction + http://openaccess.thecvf.com//content/CVPR2023/papers/Sun_MOSO_Decomposing_MOtion_Scene_and_Object_for_Video_Prediction_CVPR_2023_paper.pdf + Motion, scene and object are three primary visual components of a video. In particular, objects represent the foreground, scenes represent the background, and motion traces their dynamics. Based on this insight, we propose a two-stage MOtion, Scene and Object decomposition framework (MOSO) for video prediction, consisting of MOSO-VQVAE and MOSO-Transformer. In the first stage, MOSO-VQVAE decomposes a previous video clip into the motion, scene and object components, and represents them as distinct groups of discrete tokens. Then, in the second stage, MOSO-Transformer predicts the object and scene tokens of the subsequent video clip based on the previous tokens and adds dynamic motion at the token level to the generated object and scene tokens. Our framework can be easily extended to unconditional video generation and video frame interpolation tasks. Experimental results demonstrate that our method achieves new state-of-the-art performance on five challenging benchmarks for video prediction and unconditional video generation: BAIR, RoboNet, KTH, KITTI and UCF101. In addition, MOSO can produce realistic videos by combining objects and scenes from different videos. + + + + Learning the Distribution of Errors in Stereo Matching for Joint Disparity and Uncertainty Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Learning_the_Distribution_of_Errors_in_Stereo_Matching_for_Joint_CVPR_2023_paper.pdf + We present a new loss function for joint disparity and uncertainty estimation in deep stereo matching. Our work is motivated by the need for precise uncertainty estimates and the observation that multi-task learning often leads to improved performance in all tasks. We show that this can be achieved by requiring the distribution of uncertainty to match the distribution of disparity errors via a KL divergence term in the network's loss function. A differentiable soft-histogramming technique is used to approximate the distributions so that they can be used in the loss. We experimentally assess the effectiveness of our approach and observe significant improvements in both disparity and uncertainty prediction on large datasets. Our code is available at https://github.com/lly00412/SEDNet.git. + + + + Samples With Low Loss Curvature Improve Data Efficiency + http://openaccess.thecvf.com//content/CVPR2023/papers/Garg_Samples_With_Low_Loss_Curvature_Improve_Data_Efficiency_CVPR_2023_paper.pdf + In this paper, we study the second order properties of the loss of trained deep neural networks with respect to the training data points to understand the curvature of the loss surface in the vicinity of these points. We find that there is an unexpected concentration of samples with very low curvature. We note that these low curvature samples are largely consistent across completely different architectures, and identifiable in the early epochs of training. We show that the curvature relates to the 'cleanliness' of the data points, with low curvatures samples corresponding to clean, higher clarity samples, representative of their category. Alternatively, high curvature samples are often occluded, have conflicting features and visually atypical of their category. Armed with this insight, we introduce SLo-Curves, a novel coreset identification and training algorithm. SLo-curves identifies the samples with low curvatures as being more data-efficient and trains on them with an additional regularizer that penalizes high curvature of the loss surface in their vicinity. We demonstrate the efficacy of SLo-Curves on CIFAR-10 and CIFAR-100 datasets, where it outperforms state of the art coreset selection methods at small coreset sizes by up to 9%. The identified coresets generalize across architectures, and hence can be pre-computed to generate condensed versions of datasets for use in downstream tasks. + + + + TINC: Tree-Structured Implicit Neural Compression + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_TINC_Tree-Structured_Implicit_Neural_Compression_CVPR_2023_paper.pdf + Implicit neural representation (INR) can describe the target scenes with high fidelity using a small number of parameters, and is emerging as a promising data compression technique. However, limited spectrum coverage is intrinsic to INR, and it is non-trivial to remove redundancy in diverse complex data effectively. Preliminary studies can only exploit either global or local correlation in the target data and thus of limited performance. In this paper, we propose a Tree-structured Implicit Neural Compression (TINC) to conduct compact representation for local regions and extract the shared features of these local representations in a hierarchical manner. Specifically, we use Multi-Layer Perceptrons (MLPs) to fit the partitioned local regions, and these MLPs are organized in tree structure to share parameters according to the spatial distance. The parameter sharing scheme not only ensures the continuity between adjacent regions, but also jointly removes the local and non-local redundancy. Extensive experiments show that TINC improves the compression fidelity of INR, and has shown impressive compression capabilities over commercial tools and other deep learning based methods. Besides, the approach is of high flexibility and can be tailored for different data and parameter settings. The source code can be found at https://github.com/RichealYoung/TINC. + + + + Unifying Short and Long-Term Tracking With Graph Hierarchies + http://openaccess.thecvf.com//content/CVPR2023/papers/Cetintas_Unifying_Short_and_Long-Term_Tracking_With_Graph_Hierarchies_CVPR_2023_paper.pdf + Tracking objects over long videos effectively means solving a spectrum of problems, from short-term association for un-occluded objects to long-term association for objects that are occluded and then reappear in the scene. Methods tackling these two tasks are often disjoint and crafted for specific scenarios, and top-performing approaches are often a mix of techniques, which yields engineering-heavy solutions that lack generality. In this work, we question the need for hybrid approaches and introduce SUSHI, a unified and scalable multi-object tracker. Our approach processes long clips by splitting them into a hierarchy of subclips, which enables high scalability. We leverage graph neural networks to process all levels of the hierarchy, which makes our model unified across temporal scales and highly general. As a result, we obtain significant improvements over state-of-the-art on four diverse datasets. Our code and models are available at bit.ly/sushi-mot. + + + + Re-Basin via Implicit Sinkhorn Differentiation + http://openaccess.thecvf.com//content/CVPR2023/papers/Pena_Re-Basin_via_Implicit_Sinkhorn_Differentiation_CVPR_2023_paper.pdf + The recent emergence of new algorithms for permuting models into functionally equivalent regions of the solution space has shed some light on the complexity of error surfaces and some promising properties like mode connectivity. However, finding the permutation that minimizes some objectives is challenging, and current optimization techniques are not differentiable, which makes it difficult to integrate into a gradient-based optimization, and often leads to sub-optimal solutions. In this paper, we propose a Sinkhorn re-basin network with the ability to obtain the transportation plan that better suits a given objective. Unlike the current state-of-art, our method is differentiable and, therefore, easy to adapt to any task within the deep learning domain. Furthermore, we show the advantage of our re-basin method by proposing a new cost function that allows performing incremental learning by exploiting the linear mode connectivity property. The benefit of our method is compared against similar approaches from the literature under several conditions for both optimal transport and linear mode connectivity. The effectiveness of our continual learning method based on re-basin is also shown for several common benchmark datasets, providing experimental results that are competitive with the state-of-art. The source code is provided at https://github.com/fagp/sinkhorn-rebasin. + + + + Supervised Masked Knowledge Distillation for Few-Shot Transformers + http://openaccess.thecvf.com//content/CVPR2023/papers/Lin_Supervised_Masked_Knowledge_Distillation_for_Few-Shot_Transformers_CVPR_2023_paper.pdf + Vision Transformers (ViTs) emerge to achieve impressive performance on many data-abundant computer vision tasks by capturing long-range dependencies among local features. However, under few-shot learning (FSL) settings on small datasets with only a few labeled data, ViT tends to overfit and suffers from severe performance degradation due to its absence of CNN-alike inductive bias. Previous works in FSL avoid such problem either through the help of self-supervised auxiliary losses, or through the dextile uses of label information under supervised settings. But the gap between self-supervised and supervised few-shot Transformers is still unfilled. Inspired by recent advances in self-supervised knowledge distillation and masked image modeling (MIM), we propose a novel Supervised Masked Knowledge Distillation model (SMKD) for few-shot Transformers which incorporates label information into self-distillation frameworks. Compared with previous self-supervised methods, we allow intra-class knowledge distillation on both class and patch tokens, and introduce the challenging task of masked patch tokens reconstruction across intra-class images. Experimental results on four few-shot classification benchmark datasets show that our method with simple design outperforms previous methods by a large margin and achieves a new start-of-the-art. Detailed ablation studies confirm the effectiveness of each component of our model. Code for this paper is available here: https://github.com/HL-hanlin/SMKD. + + + + RIDCP: Revitalizing Real Image Dehazing via High-Quality Codebook Priors + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_RIDCP_Revitalizing_Real_Image_Dehazing_via_High-Quality_Codebook_Priors_CVPR_2023_paper.pdf + Existing dehazing approaches struggle to process real-world hazy images owing to the lack of paired real data and robust priors. In this work, we present a new paradigm for real image dehazing from the perspectives of synthesizing more realistic hazy data and introducing more robust priors into the network. Specifically, (1) instead of adopting the de facto physical scattering model, we rethink the degradation of real hazy images and propose a phenomenological pipeline considering diverse degradation types. (2) We propose a Real Image Dehazing network via high-quality Codebook Priors (RIDCP). Firstly, a VQGAN is pre-trained on a large-scale high-quality dataset to obtain the discrete codebook, encapsulating high-quality priors (HQPs). After replacing the negative effects brought by haze with HQPs, the decoder equipped with a novel normalized feature alignment module can effectively utilize high-quality features and produce clean results. However, although our degradation pipeline drastically mitigates the domain gap between synthetic and real data, it is still intractable to avoid it, which challenges HQPs matching in the wild. Thus, we re-calculate the distance when matching the features to the HQPs by a controllable matching operation, which facilitates finding better counterparts. We provide a recommendation to control the matching based on an explainable solution. Users can also flexibly adjust the enhancement degree as per their preference. Extensive experiments verify the effectiveness of our data synthesis pipeline and the superior performance of RIDCP in real image dehazing. Code and data will be released. + + + + Recurrence Without Recurrence: Stable Video Landmark Detection With Deep Equilibrium Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Micaelli_Recurrence_Without_Recurrence_Stable_Video_Landmark_Detection_With_Deep_Equilibrium_CVPR_2023_paper.pdf + Cascaded computation, whereby predictions are recurrently refined over several stages, has been a persistent theme throughout the development of landmark detection models. In this work, we show that the recently proposed Deep Equilibrium Model (DEQ) can be naturally adapted to this form of computation. Our Landmark DEQ (LDEQ) achieves state-of-the-art performance on the challenging WFLW facial landmark dataset, reaching 3.92 NME with fewer parameters and a training memory cost of O(1) in the number of recurrent modules. Furthermore, we show that DEQs are particularly suited for landmark detection in videos. In this setting, it is typical to train on still images due to the lack of labelled videos. This can lead to a "flickering" effect at inference time on video, whereby a model can rapidly oscillate between different plausible solutions across consecutive frames. By rephrasing DEQs as a constrained optimization, we emulate recurrence at inference time, despite not having access to temporal data at training time. This Recurrence without Recurrence (RwR) paradigm helps in reducing landmark flicker, which we demonstrate by introducing a new metric, normalized mean flicker (NMF), and contributing a new facial landmark video dataset (WFLW-V) targeting landmark uncertainty. On the WFLW-V hard subset made up of 500 videos, our LDEQ with RwR improves the NME and NMF by 10 and 13% respectively, compared to the strongest previously published model using a hand-tuned conventional filter. + + + + Generalized Relation Modeling for Transformer Tracking + http://openaccess.thecvf.com//content/CVPR2023/papers/Gao_Generalized_Relation_Modeling_for_Transformer_Tracking_CVPR_2023_paper.pdf + Compared with previous two-stream trackers, the recent one-stream tracking pipeline, which allows earlier interaction between the template and search region, has achieved a remarkable performance gain. However, existing one-stream trackers always let the template interact with all parts inside the search region throughout all the encoder layers. This could potentially lead to target-background confusion when the extracted feature representations are not sufficiently discriminative. To alleviate this issue, we propose a generalized relation modeling method based on adaptive token division. The proposed method is a generalized formulation of attention-based relation modeling for Transformer tracking, which inherits the merits of both previous two-stream and one-stream pipelines whilst enabling more flexible relation modeling by selecting appropriate search tokens to interact with template tokens. An attention masking strategy and the Gumbel-Softmax technique are introduced to facilitate the parallel computation and end-to-end learning of the token division module. Extensive experiments show that our method is superior to the two-stream and one-stream pipelines and achieves state-of-the-art performance on six challenging benchmarks with a real-time running speed. + + + + Non-Line-of-Sight Imaging With Signal Superresolution Network + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Non-Line-of-Sight_Imaging_With_Signal_Superresolution_Network_CVPR_2023_paper.pdf + Non-line-of-sight (NLOS) imaging aims at reconstructing the location, shape, albedo, and surface normal of the hidden object around the corner with measured transient data. Due to its strong potential in various fields, it has drawn much attention in recent years. However, long exposure time is not always available for applications such as auto-driving, which hinders the practical use of NLOS imaging. Although scanning fewer points can reduce the total measurement time, it also brings the problem of imaging quality degradation. This paper proposes a general learning-based pipeline for increasing imaging quality with only a few scanning points. We tailor a neural network to learn the operator that recovers a high spatial resolution signal. Experiments on synthetic and measured data indicate that the proposed method provides faithful reconstructions of the hidden scene under both confocal and non-confocal settings. Compared with original measurements, the acquisition of our approach is 16 times faster while maintaining similar reconstruction quality. Besides, the proposed pipeline can be applied directly to existing optical systems and imaging algorithms as a plug-in-and-play module. We believe the proposed pipeline is powerful in increasing the frame rate in NLOS video imaging. + + + + MixNeRF: Modeling a Ray With Mixture Density for Novel View Synthesis From Sparse Inputs + http://openaccess.thecvf.com//content/CVPR2023/papers/Seo_MixNeRF_Modeling_a_Ray_With_Mixture_Density_for_Novel_View_CVPR_2023_paper.pdf + Neural Radiance Field (NeRF) has broken new ground in the novel view synthesis due to its simple concept and state-of-the-art quality. However, it suffers from severe performance degradation unless trained with a dense set of images with different camera poses, which hinders its practical applications. Although previous methods addressing this problem achieved promising results, they relied heavily on the additional training resources, which goes against the philosophy of sparse-input novel-view synthesis pursuing the training efficiency. In this work, we propose MixNeRF, an effective training strategy for novel view synthesis from sparse inputs by modeling a ray with a mixture density model. Our MixNeRF estimates the joint distribution of RGB colors along the ray samples by modeling it with mixture of distributions. We also propose a new task of ray depth estimation as a useful training objective, which is highly correlated with 3D scene geometry. Moreover, we remodel the colors with regenerated blending weights based on the estimated ray depth and further improves the robustness for colors and viewpoints. Our MixNeRF outperforms other state-of-the-art methods in various standard benchmarks with superior efficiency of training and inference. + + + + Cross-Domain 3D Hand Pose Estimation With Dual Modalities + http://openaccess.thecvf.com//content/CVPR2023/papers/Lin_Cross-Domain_3D_Hand_Pose_Estimation_With_Dual_Modalities_CVPR_2023_paper.pdf + Recent advances in hand pose estimation have shed light on utilizing synthetic data to train neural networks, which however inevitably hinders generalization to real-world data due to domain gaps. To solve this problem, we present a framework for cross-domain semi-supervised hand pose estimation and target the challenging scenario of learning models from labelled multi-modal synthetic data and unlabelled real-world data. To that end, we propose a dual-modality network that exploits synthetic RGB and synthetic depth images. For pre-training, our network uses multi-modal contrastive learning and attention-fused supervision to learn effective representations of the RGB images. We then integrate a novel self-distillation technique during fine-tuning to reduce pseudo-label noise. Experiments show that the proposed method significantly improves 3D hand pose estimation and 2D keypoint detection on benchmarks. + + + + Delving Into Discrete Normalizing Flows on SO(3) Manifold for Probabilistic Rotation Modeling + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Delving_Into_Discrete_Normalizing_Flows_on_SO3_Manifold_for_Probabilistic_CVPR_2023_paper.pdf + Normalizing flows (NFs) provide a powerful tool to construct an expressive distribution by a sequence of trackable transformations of a base distribution and form a probabilistic model of underlying data.Rotation, as an important quantity in computer vision, graphics, and robotics, can exhibit many ambiguities when occlusion and symmetry occur and thus demands such probabilistic models. Though much progress has been made for NFs in Euclidean space, there are no effective normalizing flows without discontinuity or many-to-one mapping tailored for SO(3) manifold. Given the unique non-Euclidean properties of the rotation manifold, adapting the existing NFs to SO(3) manifold is non-trivial. In this paper, we propose a novel normalizing flow on SO(3) by combining a Mobius transformation-based coupling layer and a quaternion affine transformation. With our proposed rotation normalizing flows, one can not only effectively express arbitrary distributions on SO(3), but also conditionally build the target distribution given input observations. Extensive experiments show that our rotation normalizing flows significantly outperform the baselines on both unconditional and conditional tasks. + + + + SfM-TTR: Using Structure From Motion for Test-Time Refinement of Single-View Depth Networks + http://openaccess.thecvf.com//content/CVPR2023/papers/Izquierdo_SfM-TTR_Using_Structure_From_Motion_for_Test-Time_Refinement_of_Single-View_CVPR_2023_paper.pdf + Estimating a dense depth map from a single view is geometrically ill-posed, and state-of-the-art methods rely on learning depth's relation with visual appearance using deep neural networks. On the other hand, Structure from Motion (SfM) leverages multi-view constraints to produce very accurate but sparse maps, as matching across images is typically limited by locally discriminative texture. In this work, we combine the strengths of both approaches by proposing a novel test-time refinement (TTR) method, denoted as SfM-TTR, that boosts the performance of single-view depth networks at test time using SfM multi-view cues. Specifically, and differently from the state of the art, we use sparse SfM point clouds as test-time self-supervisory signal, fine-tuning the network encoder to learn a better representation of the test scene. Our results show how the addition of SfM-TTR to several state-of-the-art self-supervised and supervised networks improves significantly their performance, outperforming previous TTR baselines mainly based on photometric multi-view consistency. The code is available at https://github.com/serizba/SfM-TTR. + + + + MELTR: Meta Loss Transformer for Learning To Fine-Tune Video Foundation Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Ko_MELTR_Meta_Loss_Transformer_for_Learning_To_Fine-Tune_Video_Foundation_CVPR_2023_paper.pdf + Foundation models have shown outstanding performance and generalization capabilities across domains. Since most studies on foundation models mainly focus on the pretraining phase, a naive strategy to minimize a single task-specific loss is adopted for fine-tuning. However, such fine-tuning methods do not fully leverage other losses that are potentially beneficial for the target task. Therefore, we propose MEta Loss TRansformer (MELTR), a plug-in module that automatically and non-linearly combines various loss functions to aid learning the target task via auxiliary learning. We formulate the auxiliary learning as a bi-level optimization problem and present an efficient optimization algorithm based on Approximate Implicit Differentiation (AID). For evaluation, we apply our framework to various video foundation models (UniVL, Violet and All-in-one), and show significant performance gain on all four downstream tasks: text-to-video retrieval, video question answering, video captioning, and multi-modal sentiment analysis. Our qualitative analyses demonstrate that MELTR adequately 'transforms' individual loss functions and 'melts' them into an effective unified loss. Code is available at https://github.com/mlvlab/MELTR. + + + + Meta-Personalizing Vision-Language Models To Find Named Instances in Video + http://openaccess.thecvf.com//content/CVPR2023/papers/Yeh_Meta-Personalizing_Vision-Language_Models_To_Find_Named_Instances_in_Video_CVPR_2023_paper.pdf + Large-scale vision-language models (VLM) have shown impressive results for language-guided search applications. While these models allow category-level queries, they currently struggle with personalized searches for moments in a video where a specific object instance such as "My dog Biscuit" appears. We present the following three contributions to address this problem. First, we describe a method to meta-personalize a pre-trained VLM, i.e., learning how to learn to personalize a VLM at test time to search in video. Our method extends the VLM's token vocabulary by learning novel word embeddings specific to each instance. To capture only instance-specific features, we represent each instance embedding as a combination of shared and learned global category features. Second, we propose to learn such personalization without explicit human supervision. Our approach automatically identifies moments of named visual instances in video using transcripts and vision-language similarity in the VLM's embedding space. Finally, we introduce This-Is-My, a personal video instance retrieval benchmark. We evaluate our approach on This-Is-My and DeepFashion2 and show that we obtain a 15% relative improvement over the state of the art on the latter dataset. + + + + Egocentric Audio-Visual Object Localization + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Egocentric_Audio-Visual_Object_Localization_CVPR_2023_paper.pdf + Humans naturally perceive surrounding scenes by unifying sound and sight in a first-person view. Likewise, machines are advanced to approach human intelligence by learning with multisensory inputs from an egocentric perspective. In this paper, we explore the challenging egocentric audio-visual object localization task and observe that 1) egomotion commonly exists in first-person recordings, even within a short duration; 2) The out-of-view sound components can be created while wearers shift their attention. To address the first problem, we propose a geometry-aware temporal aggregation module to handle the egomotion explicitly. The effect of egomotion is mitigated by estimating the temporal geometry transformation and exploiting it to update visual representations. Moreover, we propose a cascaded feature enhancement module to tackle the second issue. It improves cross-modal localization robustness by disentangling visually-indicated audio representation. During training, we take advantage of the naturally available audio-visual temporal synchronization as the "free" self-supervision to avoid costly labeling. We also annotate and create the Epic Sounding Object dataset for evaluation purposes. Extensive experiments show that our method achieves state-of-the-art localization performance in egocentric videos and can be generalized to diverse audio-visual scenes. + + + + DropKey for Vision Transformer + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_DropKey_for_Vision_Transformer_CVPR_2023_paper.pdf + In this paper, we focus on analyzing and improving the dropout technique for self-attention layers of Vision Transformer, which is important while surprisingly ignored by prior works. In particular, we conduct researches on three core questions: First, what to drop in self-attention layers? Different from dropping attention weights in literature, we propose to move dropout operations forward ahead of attention matrix calculation and set the Key as the dropout unit, yielding a novel dropout-before-softmax scheme. We theoretically verify that this scheme helps keep both regularization and probability features of attention weights, alleviating the overfittings problem to specific patterns and enhancing the model to globally capture vital information; Second, how to schedule the drop ratio in consecutive layers? In contrast to exploit a constant drop ratio for all layers, we present a new decreasing schedule that gradually decreases the drop ratio along the stack of self-attention layers. We experimentally validate the proposed schedule can avoid overfittings in low-level features and missing in high-level semantics, thus improving the robustness and stableness of model training; Third, whether need to perform structured dropout operation as CNN? We attempt patch-based block-version of dropout operation and find that this useful trick for CNN is not essential for ViT. Given exploration on the above three questions, we present the novel DropKey method that regards Key as the drop unit and exploits decreasing schedule for drop ratio, improving ViTs in a general way. Comprehensive experiments demonstrate the effectiveness of DropKey for various ViT architectures, e.g. T2T, VOLO, CeiT and DeiT, as well as for various vision tasks, e.g., image classification, object detection, human-object interaction detection and human body shape recovery. + + + + Meta Architecture for Point Cloud Analysis + http://openaccess.thecvf.com//content/CVPR2023/papers/Lin_Meta_Architecture_for_Point_Cloud_Analysis_CVPR_2023_paper.pdf + Recent advances in 3D point cloud analysis bring a diverse set of network architectures to the field. However, the lack of a unified framework to interpret those networks makes any systematic comparison, contrast, or analysis challenging, and practically limits healthy development of the field. In this paper, we take the initiative to explore and propose a unified framework called PointMeta, to which the popular 3D point cloud analysis approaches could fit. This brings three benefits. First, it allows us to compare different approaches in a fair manner, and use quick experiments to verify any empirical observations or assumptions summarized from the comparison. Second, the big picture brought by PointMeta enables us to think across different components, and revisit common beliefs and key design decisions made by the popular approaches. Third, based on the learnings from the previous two analyses, by doing simple tweaks on the existing approaches, we are able to derive a basic building block, termed PointMetaBase. It shows very strong performance in efficiency and effectiveness through extensive experiments on challenging benchmarks, and thus verifies the necessity and benefits of high-level interpretation, contrast, and comparison like PointMeta. In particular, PointMetaBase surpasses the previous state-of-the-art method by 0.7%/1.4/%2.1% mIoU with only 2%/11%/13% of the computation cost on the S3DIS datasets. Codes are available in the supplementary materials. + + + + CIRCLE: Capture in Rich Contextual Environments + http://openaccess.thecvf.com//content/CVPR2023/papers/Araujo_CIRCLE_Capture_in_Rich_Contextual_Environments_CVPR_2023_paper.pdf + Synthesizing 3D human motion in a contextual, ecological environment is important for simulating realistic activities people perform in the real world. However, conventional optics-based motion capture systems are not suited for simultaneously capturing human movements and complex scenes. The lack of rich contextual 3D human motion datasets presents a roadblock to creating high-quality generative human motion models. We propose a novel motion acquisition system in which the actor perceives and operates in a highly contextual virtual world while being motion captured in the real world. Our system enables rapid collection of high-quality human motion in highly diverse scenes, without the concern of occlusion or the need for physical scene construction in the real world. We present CIRCLE, a dataset containing 10 hours of full-body reaching motion from 5 subjects across nine scenes, paired with ego-centric information of the environment represented in various forms, such as RGBD videos. We use this dataset to train a model that generates human motion conditioned on scene information. Leveraging our dataset, the model learns to use ego-centric scene information to achieve nontrivial reaching tasks in the context of complex 3D scenes. To download the data please visit our website (https://stanford-tml.github.io/circle_dataset/). + + + + PyPose: A Library for Robot Learning With Physics-Based Optimization + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_PyPose_A_Library_for_Robot_Learning_With_Physics-Based_Optimization_CVPR_2023_paper.pdf + Deep learning has had remarkable success in robotic perception, but its data-centric nature suffers when it comes to generalizing to ever-changing environments. By contrast, physics-based optimization generalizes better, but it does not perform as well in complicated tasks due to the lack of high-level semantic information and reliance on manual parametric tuning. To take advantage of these two complementary worlds, we present PyPose: a robotics-oriented, PyTorch-based library that combines deep perceptual models with physics-based optimization. PyPose's architecture is tidy and well-organized, it has an imperative style interface and is efficient and user-friendly, making it easy to integrate into real-world robotic applications. Besides, it supports parallel computing of any order gradients of Lie groups and Lie algebras and 2nd-order optimizers, such as trust region methods. Experiments show that PyPose achieves more than 10x speedup in computation compared to the state-of-the-art libraries. To boost future research, we provide concrete examples for several fields of robot learning, including SLAM, planning, control, and inertial navigation. + + + + Make Landscape Flatter in Differentially Private Federated Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Shi_Make_Landscape_Flatter_in_Differentially_Private_Federated_Learning_CVPR_2023_paper.pdf + To defend the inference attacks and mitigate the sensitive information leakages in Federated Learning (FL), client-level Differentially Private FL (DPFL) is the de-facto standard for privacy protection by clipping local updates and adding random noise. However, existing DPFL methods tend to make a sharper loss landscape and have poorer weight perturbation robustness, resulting in severe performance degradation. To alleviate these issues, we propose a novel DPFL algorithm named DP-FedSAM, which leverages gradient perturbation to mitigate the negative impact of DP. Specifically, DP-FedSAM integrates Sharpness Aware Minimization (SAM) optimizer to generate local flatness models with better stability and weight perturbation robustness, which results in the small norm of local updates and robustness to DP noise, thereby improving the performance. From the theoretical perspective, we analyze in detail how DP-FedSAM mitigates the performance degradation induced by DP. Meanwhile, we give rigorous privacy guarantees with Renyi DP and present the sensitivity analysis of local updates. At last, we empirically confirm that our algorithm achieves state-of-the-art (SOTA) performance compared with existing SOTA baselines in DPFL. + + + + BlackVIP: Black-Box Visual Prompting for Robust Transfer Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Oh_BlackVIP_Black-Box_Visual_Prompting_for_Robust_Transfer_Learning_CVPR_2023_paper.pdf + With the surge of large-scale pre-trained models (PTMs), fine-tuning these models to numerous downstream tasks becomes a crucial problem. Consequently, parameter efficient transfer learning (PETL) of large models has grasped huge attention. While recent PETL methods showcase impressive performance, they rely on optimistic assumptions: 1) the entire parameter set of a PTM is available, and 2) a sufficiently large memory capacity for the fine-tuning is equipped. However, in most real-world applications, PTMs are served as a black-box API or proprietary software without explicit parameter accessibility. Besides, it is hard to meet a large memory requirement for modern PTMs. In this work, we propose black-box visual prompting (BlackVIP), which efficiently adapts the PTMs without knowledge about model architectures and parameters. BlackVIP has two components; 1) Coordinator and 2) simultaneous perturbation stochastic approximation with gradient correction (SPSA-GC). The Coordinator designs input-dependent image-shaped visual prompts, which improves few-shot adaptation and robustness on distribution/location shift. SPSA-GC efficiently estimates the gradient of a target model to update Coordinator. Extensive experiments on 16 datasets demonstrate that BlackVIP enables robust adaptation to diverse domains without accessing PTMs' parameters, with minimal memory requirements. Code: https://github.com/changdaeoh/BlackVIP + + + + DeepVecFont-v2: Exploiting Transformers To Synthesize Vector Fonts With Higher Quality + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_DeepVecFont-v2_Exploiting_Transformers_To_Synthesize_Vector_Fonts_With_Higher_Quality_CVPR_2023_paper.pdf + Vector font synthesis is a challenging and ongoing problem in the fields of Computer Vision and Computer Graphics. The recently-proposed DeepVecFont achieved state-of-the-art performance by exploiting information of both the image and sequence modalities of vector fonts. However, it has limited capability for handling long sequence data and heavily relies on an image-guided outline refinement post-processing. Thus, vector glyphs synthesized by DeepVecFont still often contain some distortions and artifacts and cannot rival human-designed results. To address the above problems, this paper proposes an enhanced version of DeepVecFont mainly by making the following three novel technical contributions. First, we adopt Transformers instead of RNNs to process sequential data and design a relaxation representation for vector outlines, markedly improving the model's capability and stability of synthesizing long and complex outlines. Second, we propose to sample auxiliary points in addition to control points to precisely align the generated and target Bezier curves or lines. Finally, to alleviate error accumulation in the sequential generation process, we develop a context-based self-refinement module based on another Transformer-based decoder to remove artifacts in the initially synthesized glyphs. Both qualitative and quantitative results demonstrate that the proposed method effectively resolves those intrinsic problems of the original DeepVecFont and outperforms existing approaches in generating English and Chinese vector fonts with complicated structures and diverse styles. + + + + pCON: Polarimetric Coordinate Networks for Neural Scene Representations + http://openaccess.thecvf.com//content/CVPR2023/papers/Peters_pCON_Polarimetric_Coordinate_Networks_for_Neural_Scene_Representations_CVPR_2023_paper.pdf + Neural scene representations have achieved great success in parameterizing and reconstructing images, but current state of the art models are not optimized with the preservation of physical quantities in mind. While current architectures can reconstruct color images correctly, they create artifacts when trying to fit maps of polar quantities. We propose polarimetric coordinate networks (pCON), a new model architecture for neural scene representations aimed at preserving polarimetric information while accurately parameterizing the scene. Our model removes artifacts created by current coordinate network architectures when reconstructing three polarimetric quantities of interest. + + + + Uncertainty-Aware Vision-Based Metric Cross-View Geolocalization + http://openaccess.thecvf.com//content/CVPR2023/papers/Fervers_Uncertainty-Aware_Vision-Based_Metric_Cross-View_Geolocalization_CVPR_2023_paper.pdf + This paper proposes a novel method for vision-based metric cross-view geolocalization (CVGL) that matches the camera images captured from a ground-based vehicle with an aerial image to determine the vehicle's geo-pose. Since aerial images are globally available at low cost, they represent a potential compromise between two established paradigms of autonomous driving, i.e. using expensive high-definition prior maps or relying entirely on the sensor data captured at runtime. We present an end-to-end differentiable model that uses the ground and aerial images to predict a probability distribution over possible vehicle poses. We combine multiple vehicle datasets with aerial images from orthophoto providers on which we demonstrate the feasibility of our method. Since the ground truth poses are often inaccurate w.r.t. the aerial images, we implement a pseudo-label approach to produce more accurate ground truth poses and make them publicly available. While previous works require training data from the target region to achieve reasonable localization accuracy (i.e. same-area evaluation), our approach overcomes this limitation and outperforms previous results even in the strictly more challenging cross-area case. We improve the previous state-of-the-art by a large margin even without ground or aerial data from the test region, which highlights the model's potential for global-scale application. We further integrate the uncertainty-aware predictions in a tracking framework to determine the vehicle's trajectory over time resulting in a mean position error on KITTI-360 of 0.78m. + + + + Continuous Landmark Detection With 3D Queries + http://openaccess.thecvf.com//content/CVPR2023/papers/Chandran_Continuous_Landmark_Detection_With_3D_Queries_CVPR_2023_paper.pdf + Neural networks for facial landmark detection are notoriously limited to a fixed set of landmarks in a dedicated layout, which must be specified at training time. Dedicated datasets must also be hand-annotated with the corresponding landmark configuration for training. We propose the first facial landmark detection network that can predict continuous, unlimited landmarks, allowing to specify the number and location of the desired landmarks at inference time. Our method combines a simple image feature extractor with a queried landmark predictor, and the user can specify any continuous query points relative to a 3D template face mesh as input. As it is not tied to a fixed set of landmarks, our method is able to leverage all pre-existing 2D landmark datasets for training, even if they have inconsistent landmark configurations. As a result, we present a very powerful facial landmark detector that can be trained once, and can be used readily for numerous applications like 3D face reconstruction, arbitrary face segmentation, and is even compatible with helmeted mounted cameras, and therefore could vastly simplify face tracking workflows for media and entertainment applications. + + + + Unbiased Scene Graph Generation in Videos + http://openaccess.thecvf.com//content/CVPR2023/papers/Nag_Unbiased_Scene_Graph_Generation_in_Videos_CVPR_2023_paper.pdf + The task of dynamic scene graph generation (SGG) from videos is complicated and challenging due to the inherent dynamics of a scene, temporal fluctuation of model predictions, and the long-tailed distribution of the visual relationships in addition to the already existing challenges in image-based SGG. Existing methods for dynamic SGG have primarily focused on capturing spatio-temporal context using complex architectures without addressing the challenges mentioned above, especially the long-tailed distribution of relationships. This often leads to the generation of biased scene graphs. To address these challenges, we introduce a new framework called TEMPURA: TEmporal consistency and Memory Prototype guided UnceRtainty Attenuation for unbiased dynamic SGG. TEMPURA employs object-level temporal consistencies via transformer-based sequence modeling, learns to synthesize unbiased relationship representations using memory-guided training, and attenuates the predictive uncertainty of visual relations using a Gaussian Mixture Model (GMM). Extensive experiments demonstrate that our method achieves significant (up to 10% in some cases) performance gain over existing methods highlight- ing its superiority in generating more unbiased scene graphs. Code: https://github.com/sayaknag/unbiasedSGG.git + + + + Visual Language Pretrained Multiple Instance Zero-Shot Transfer for Histopathology Images + http://openaccess.thecvf.com//content/CVPR2023/papers/Lu_Visual_Language_Pretrained_Multiple_Instance_Zero-Shot_Transfer_for_Histopathology_Images_CVPR_2023_paper.pdf + Contrastive visual language pretraining has emerged as a powerful method for either training new language-aware image encoders or augmenting existing pretrained models with zero-shot visual recognition capabilities. However, existing works typically train on large datasets of image-text pairs and have been designed to perform downstream tasks involving only small to medium sized-images, neither of which are applicable to the emerging field of computational pathology where there are limited publicly available paired image-text datasets and each image can span up to 100,000 x 100,000 pixels in dimensions. In this paper we present MI-Zero, a simple and intuitive framework for unleashing the zero-shot transfer capabilities of contrastively aligned image and text models to gigapixel histopathology whole slide images, enabling multiple downstream diagnostic tasks to be carried out by pretrained encoders without requiring any additional labels. MI-Zero reformulates zero-shot transfer under the framework of multiple instance learning to overcome the computational challenge of inference on extremely large images. We used over 550k pathology reports and other available in-domain text corpora to pretrain our text encoder. By effectively leveraging strong pretrained encoders, our best model pretrained on over 33k histopathology image-caption pairs achieves an average median zero-shot accuracy of 70.2% across three different real-world cancer subtyping tasks. Our code is available at: https://github.com/mahmoodlab/MI-Zero. + + + + PMR: Prototypical Modal Rebalance for Multimodal Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Fan_PMR_Prototypical_Modal_Rebalance_for_Multimodal_Learning_CVPR_2023_paper.pdf + Multimodal learning (MML) aims to jointly exploit the common priors of different modalities to compensate for their inherent limitations. However, existing MML methods often optimize a uniform objective for different modalities, leading to the notorious "modality imbalance" problem and counterproductive MML performance. To address the problem, some existing methods modulate the learning pace based on the fused modality, which is dominated by the better modality and eventually results in a limited improvement on the worse modal. To better exploit the features of multimodal, we propose Prototypical Modality Rebalance (PMR) to perform stimulation on the particular slow-learning modality without interference from other modalities. Specifically, we introduce the prototypes that represent general features for each class, to build the non-parametric classifiers for uni-modal performance evaluation. Then, we try to accelerate the slow-learning modality by enhancing its clustering toward prototypes. Furthermore, to alleviate the suppression from the dominant modality, we introduce a prototype-based entropy regularization term during the early training stage to prevent premature convergence. Besides, our method only relies on the representations of each modality and without restrictions from model structures and fusion methods, making it with great application potential for various scenarios. The source code is available here. + + + + Multi-Sensor Large-Scale Dataset for Multi-View 3D Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Voynov_Multi-Sensor_Large-Scale_Dataset_for_Multi-View_3D_Reconstruction_CVPR_2023_paper.pdf + We present a new multi-sensor dataset for multi-view 3D surface reconstruction. It includes registered RGB and depth data from sensors of different resolutions and modalities: smartphones, Intel RealSense, Microsoft Kinect, industrial cameras, and structured-light scanner. The scenes are selected to emphasize a diverse set of material properties challenging for existing algorithms. We provide around 1.4 million images of 107 different scenes acquired from 100 viewing directions under 14 lighting conditions. We expect our dataset will be useful for evaluation and training of 3D reconstruction algorithms and for related tasks. The dataset is available at skoltech3d.appliedai.tech. + + + + PanoHead: Geometry-Aware 3D Full-Head Synthesis in 360deg + http://openaccess.thecvf.com//content/CVPR2023/papers/An_PanoHead_Geometry-Aware_3D_Full-Head_Synthesis_in_360deg_CVPR_2023_paper.pdf + Synthesis and reconstruction of 3D human head has gained increasing interests in computer vision and computer graphics recently. Existing state-of-the-art 3D generative adversarial networks (GANs) for 3D human head synthesis are either limited to near-frontal views or hard to preserve 3D consistency in large view angles. We propose PanoHead, the first 3D-aware generative model that enables high-quality view-consistent image synthesis of full heads in 360deg with diverse appearance and detailed geometry using only in-the-wild unstructured images for training. At its core, we lift up the representation power of recent 3D GANs and bridge the data alignment gap when training from in-the-wild images with widely distributed views. Specifically, we propose a novel two-stage self-adaptive image alignment for robust 3D GAN training. We further introduce a tri-grid neural volume representation that effectively addresses front-face and back-head feature entanglement rooted in the widely-adopted tri-plane formulation. Our method instills prior knowledge of 2D image segmentation in adversarial learning of 3D neural scene structures, enabling compositable head synthesis in diverse backgrounds. Benefiting from these designs, our method significantly outperforms previous 3D GANs, generating high-quality 3D heads with accurate geometry and diverse appearances, even with long wavy and afro hairstyles, renderable from arbitrary poses. Furthermore, we show that our system can reconstruct full 3D heads from single input images for personalized realistic 3D avatars. + + + + Rethinking Feature-Based Knowledge Distillation for Face Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Rethinking_Feature-Based_Knowledge_Distillation_for_Face_Recognition_CVPR_2023_paper.pdf + With the continual expansion of face datasets, feature-based distillation prevails for large-scale face recognition. In this work, we attempt to remove identity supervision in student training, to spare the GPU memory from saving massive class centers. However, this naive removal leads to inferior distillation result. We carefully inspect the performance degradation from the perspective of intrinsic dimension, and argue that the gap in intrinsic dimension, namely the intrinsic gap, is intimately connected to the infamous capacity gap problem. By constraining the teacher's search space with reverse distillation, we narrow the intrinsic gap and unleash the potential of feature-only distillation. Remarkably, the proposed reverse distillation creates universally student-friendly teacher that demonstrates outstanding student improvement. We further enhance its effectiveness by designing a student proxy to better bridge the intrinsic gap. As a result, the proposed method surpasses state-of-the-art distillation techniques with identity supervision on various face recognition benchmarks, and the improvements are consistent across different teacher-student pairs. + + + + NeurOCS: Neural NOCS Supervision for Monocular 3D Object Localization + http://openaccess.thecvf.com//content/CVPR2023/papers/Min_NeurOCS_Neural_NOCS_Supervision_for_Monocular_3D_Object_Localization_CVPR_2023_paper.pdf + Monocular 3D object localization in driving scenes is a crucial task, but challenging due to its ill-posed nature. Estimating 3D coordinates for each pixel on the object surface holds great potential as it provides dense 2D-3D geometric constraints for the underlying PnP problem. However, high-quality ground truth supervision is not available in driving scenes due to sparsity and various artifacts of Lidar data, as well as the practical infeasibility of collecting per-instance CAD models. In this work, we present NeurOCS, a framework that uses instance masks and 3D boxes as input to learn 3D object shapes by means of differentiable rendering, which further serves as supervision for learning dense object coordinates. Our approach rests on insights in learning a category-level shape prior directly from real driving scenes, while properly handling single-view ambiguities. Furthermore, we study and make critical design choices to learn object coordinates more effectively from an object-centric view. Altogether, our framework leads to new state-of-the-art in monocular 3D localization that ranks 1st on the KITTI-Object benchmark among published monocular methods. + + + + Revisiting Reverse Distillation for Anomaly Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Tien_Revisiting_Reverse_Distillation_for_Anomaly_Detection_CVPR_2023_paper.pdf + Anomaly detection is an important application in large-scale industrial manufacturing. Recent methods for this task have demonstrated excellent accuracy but come with a latency trade-off. Memory based approaches with dominant performances like PatchCore or Coupled-hypersphere-based Feature Adaptation (CFA) require an external memory bank, which significantly lengthens the execution time. Another approach that employs Reversed Distillation (RD) can perform well while maintaining low latency. In this paper, we revisit this idea to improve its performance, establishing a new state-of-the-art benchmark on the challenging MVTec dataset for both anomaly detection and localization. The proposed method, called RD++, runs six times faster than PatchCore, and two times faster than CFA but introduces a negligible latency compared to RD. We also experiment on the BTAD and Retinal OCT datasets to demonstrate our method's generalizability and conduct important ablation experiments to provide insights into its configurations. Source code will be available at https://github.com/tientrandinh/Revisiting-Reverse-Distillation. + + + + Diffusion-Based Generation, Optimization, and Planning in 3D Scenes + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Diffusion-Based_Generation_Optimization_and_Planning_in_3D_Scenes_CVPR_2023_paper.pdf + We introduce SceneDiffuser, a conditional generative model for 3D scene understanding. SceneDiffuser provides a unified model for solving scene-conditioned generation, optimization, and planning. In contrast to prior works, SceneDiffuser is intrinsically scene-aware, physics-based, and goal-oriented. With an iterative sampling strategy, SceneDiffuser jointly formulates the scene-aware generation, physics-based optimization, and goal-oriented planning via a diffusion-based denoising process in a fully differentiable fashion. Such a design alleviates the discrepancies among different modules and the posterior collapse of previous scene-conditioned generative models. We evaluate SceneDiffuser with various 3D scene understanding tasks, including human pose and motion generation, dexterous grasp generation, path planning for 3D navigation, and motion planning for robot arms. The results show significant improvements compared with previous models, demonstrating the tremendous potential of SceneDiffuser for the broad community of 3D scene understanding. + + + + TMO: Textured Mesh Acquisition of Objects With a Mobile Device by Using Differentiable Rendering + http://openaccess.thecvf.com//content/CVPR2023/papers/Choi_TMO_Textured_Mesh_Acquisition_of_Objects_With_a_Mobile_Device_CVPR_2023_paper.pdf + We present a new pipeline for acquiring a textured mesh in the wild with a single smartphone which offers access to images, depth maps, and valid poses. Our method first introduces an RGBD-aided structure from motion, which can yield filtered depth maps and refines camera poses guided by corresponding depth. Then, we adopt the neural implicit surface reconstruction method, which allows for high quality mesh and develops a new training process for applying a regularization provided by classical multi-view stereo methods. Moreover, we apply a differentiable rendering to fine-tune incomplete texture maps and generate textures which are perceptually closer to the original scene. Our pipeline can be applied to any common objects in the real world without the need for either in-the-lab environments or accurate mask images. We demonstrate results of captured objects with complex shapes and validate our method numerically against existing 3D reconstruction and texture mapping methods. + + + + MP-Former: Mask-Piloted Transformer for Image Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_MP-Former_Mask-Piloted_Transformer_for_Image_Segmentation_CVPR_2023_paper.pdf + We present a mask-piloted Transformer which improves masked-attention in Mask2Former for image segmentation. The improvement is based on our observation that Mask2Former suffers from inconsistent mask predictions between consecutive decoder layers, which leads to inconsistent optimization goals and low utilization of decoder queries. To address this problem, we propose a mask-piloted training approach, which additionally feeds noised ground-truth masks in masked-attention and trains the model to reconstruct the original ones. Compared with the predicted masks used in mask-attention, the ground-truth masks serve as a pilot and effectively alleviate the negative impact of inaccurate mask predictions in Mask2Former. Based on this technique, our MP-Former achieves a remarkable performance improvement on all three image segmentation tasks (instance, panoptic, and semantic), yielding +2.3 AP and +1.6 mIoU on the Cityscapes instance and semantic segmentation tasks with a ResNet-50 backbone. Our method also significantly speeds up the training, outperforming Mask2Former with half of the number of training epochs on ADE20K with both a ResNet-50 and a Swin-L backbones. Moreover, our method only introduces little computation during training and no extra computation during inference. Our code will be released at https://github.com/IDEA-Research/MP-Former. + + + + TAPS3D: Text-Guided 3D Textured Shape Generation From Pseudo Supervision + http://openaccess.thecvf.com//content/CVPR2023/papers/Wei_TAPS3D_Text-Guided_3D_Textured_Shape_Generation_From_Pseudo_Supervision_CVPR_2023_paper.pdf + In this paper, we investigate an open research task of generating controllable 3D textured shapes from the given textual descriptions. Previous works either require ground truth caption labeling or extensive optimization time. To resolve these issues, we present a novel framework, TAPS3D, to train a text-guided 3D shape generator with pseudo captions. Specifically, based on rendered 2D images, we retrieve relevant words from the CLIP vocabulary and construct pseudo captions using templates. Our constructed captions provide high-level semantic supervision for generated 3D shapes. Further, in order to produce fine-grained textures and increase geometry diversity, we propose to adopt low-level image regularization to enable fake-rendered images to align with the real ones. During the inference phase, our proposed model can generate 3D textured shapes from the given text without any additional optimization. We conduct extensive experiments to analyze each of our proposed components and show the efficacy of our framework in generating high-fidelity 3D textured and text-relevant shapes. + + + + Video Test-Time Adaptation for Action Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Lin_Video_Test-Time_Adaptation_for_Action_Recognition_CVPR_2023_paper.pdf + Although action recognition systems can achieve top performance when evaluated on in-distribution test points, they are vulnerable to unanticipated distribution shifts in test data. However, test-time adaptation of video action recognition models against common distribution shifts has so far not been demonstrated. We propose to address this problem with an approach tailored to spatio-temporal models that is capable of adaptation on a single video sample at a step. It consists in a feature distribution alignment technique that aligns online estimates of test set statistics towards the training statistics. We further enforce prediction consistency over temporally augmented views of the same test video sample. Evaluations on three benchmark action recognition datasets show that our proposed technique is architecture-agnostic and able to significantly boost the performance on both, the state of the art convolutional architecture TANet and the Video Swin Transformer. Our proposed method demonstrates a substantial performance gain over existing test-time adaptation approaches in both evaluations of a single distribution shift and the challenging case of random distribution shifts. + + + + Tensor4D: Efficient Neural 4D Decomposition for High-Fidelity Dynamic Reconstruction and Rendering + http://openaccess.thecvf.com//content/CVPR2023/papers/Shao_Tensor4D_Efficient_Neural_4D_Decomposition_for_High-Fidelity_Dynamic_Reconstruction_and_CVPR_2023_paper.pdf + We present Tensor4D, an efficient yet effective approach to dynamic scene modeling. The key of our solution is an efficient 4D tensor decomposition method so that the dynamic scene can be directly represented as a 4D spatio-temporal tensor. To tackle the accompanying memory issue, we decompose the 4D tensor hierarchically by projecting it first into three time-aware volumes and then nine compact feature planes. In this way, spatial information over time can be simultaneously captured in a compact and memory-efficient manner. When applying Tensor4D for dynamic scene reconstruction and rendering, we further factorize the 4D fields to different scales in the sense that structural motions and dynamic detailed changes can be learned from coarse to fine. The effectiveness of our method is validated on both synthetic and real-world scenes. Extensive experiments show that our method is able to achieve high-quality dynamic reconstruction and rendering from sparse-view camera rigs or even a monocular camera. + + + + Learning Personalized High Quality Volumetric Head Avatars From Monocular RGB Videos + http://openaccess.thecvf.com//content/CVPR2023/papers/Bai_Learning_Personalized_High_Quality_Volumetric_Head_Avatars_From_Monocular_RGB_CVPR_2023_paper.pdf + We propose a method to learn a high-quality implicit 3D head avatar from a monocular RGB video captured in the wild. The learnt avatar is driven by a parametric face model to achieve user-controlled facial expressions and head poses. Our hybrid pipeline combines the geometry prior and dynamic tracking of a 3DMM with a neural radiance field to achieve fine-grained control and photorealism. To reduce over-smoothing and improve out-of-model expressions synthesis, we propose to predict local features anchored on the 3DMM geometry. These learnt features are driven by 3DMM deformation and interpolated in 3D space to yield the volumetric radiance at a designated query point. We further show that using a Convolutional Neural Network in the UV space is critical in incorporating spatial context and producing representative local features. Extensive experiments show that we are able to reconstruct high-quality avatars, with more accurate expression-dependent details, good generalization to out-of-training expressions, and quantitatively superior renderings compared to other state-of-the-art approaches. + + + + Progressive Backdoor Erasing via Connecting Backdoor and Adversarial Attacks + http://openaccess.thecvf.com//content/CVPR2023/papers/Mu_Progressive_Backdoor_Erasing_via_Connecting_Backdoor_and_Adversarial_Attacks_CVPR_2023_paper.pdf + Deep neural networks (DNNs) are known to be vulnerable to both backdoor attacks as well as adversarial attacks. In the literature, these two types of attacks are commonly treated as distinct problems and solved separately, since they belong to training-time and inference-time attacks respectively. However, in this paper we find an intriguing connection between them: for a model planted with backdoors, we observe that its adversarial examples have similar behaviors as its triggered samples, i.e., both activate the same subset of DNN neurons. It indicates that planting a backdoor into a model will significantly affect the model's adversarial examples. Based on this observations, a novel Progressive Backdoor Erasing (PBE) algorithm is proposed to progressively purify the infected model by leveraging untargeted adversarial attacks. Different from previous backdoor defense methods, one significant advantage of our approach is that it can erase backdoor even when the additional clean dataset is unavailable. We empirically show that, against 5 state-of-the-art backdoor attacks, our AFT can effectively erase the backdoor triggers without obvious performance degradation on clean samples and significantly outperforms existing defense methods. + + + + LayoutFormer++: Conditional Graphic Layout Generation via Constraint Serialization and Decoding Space Restriction + http://openaccess.thecvf.com//content/CVPR2023/papers/Jiang_LayoutFormer_Conditional_Graphic_Layout_Generation_via_Constraint_Serialization_and_Decoding_CVPR_2023_paper.pdf + Conditional graphic layout generation, which generates realistic layouts according to user constraints, is a challenging task that has not been well-studied yet. First, there is limited discussion about how to handle diverse user constraints flexibly and uniformly. Second, to make the layouts conform to user constraints, existing work often sacrifices generation quality significantly. In this work, we propose LayoutFormer++ to tackle the above problems. First, to flexibly handle diverse constraints, we propose a constraint serialization scheme, which represents different user constraints as sequences of tokens with a predefined format. Then, we formulate conditional layout generation as a sequence-to-sequence transformation, and leverage encoder-decoder framework with Transformer as the basic architecture. Furthermore, to make the layout better meet user requirements without harming quality, we propose a decoding space restriction strategy. Specifically, we prune the predicted distribution by ignoring the options that definitely violate user constraints and likely result in low-quality layouts, and make the model samples from the restricted distribution. Experiments demonstrate that LayoutFormer++ outperforms existing approaches on all the tasks in terms of both better generation quality and less constraint violation. + + + + Stare at What You See: Masked Image Modeling Without Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Xue_Stare_at_What_You_See_Masked_Image_Modeling_Without_Reconstruction_CVPR_2023_paper.pdf + Masked Autoencoders (MAE) have been prevailing paradigms for large-scale vision representation pre-training. By reconstructing masked image patches from a small portion of visible image regions, MAE forces the model to infer semantic correlation within an image. Recently, some approaches apply semantic-rich teacher models to extract image features as the reconstruction target, leading to better performance. However, unlike the low-level features such as pixel values, we argue the features extracted by powerful teacher models already encode rich semantic correlation across regions in an intact image. This raises one question: is reconstruction necessary in Masked Image Modeling (MIM) with a teacher model? In this paper, we propose an efficient MIM paradigm named MaskAlign. MaskAlign simply learns the consistency of visible patch feature extracted by the student model and intact image features extracted by the teacher model. To further advance the performance and tackle the problem of input inconsistency between the student and teacher model, we propose a Dynamic Alignment (DA) module to apply learnable alignment. Our experimental results demonstrate that masked modeling does not lose effectiveness even without reconstruction on masked regions. Combined with Dynamic Alignment, MaskAlign can achieve state-of-the-art performance with much higher efficiency. + + + + Joint Visual Grounding and Tracking With Natural Language Specification + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_Joint_Visual_Grounding_and_Tracking_With_Natural_Language_Specification_CVPR_2023_paper.pdf + Tracking by natural language specification aims to locate the referred target in a sequence based on the natural language description. Existing algorithms solve this issue in two steps, visual grounding and tracking, and accordingly deploy the separated grounding model and tracking model to implement these two steps, respectively. Such a separated framework overlooks the link between visual grounding and tracking, which is that the natural language descriptions provide global semantic cues for localizing the target for both two steps. Besides, the separated framework can hardly be trained end-to-end. To handle these issues, we propose a joint visual grounding and tracking framework, which reformulates grounding and tracking as a unified task: localizing the referred target based on the given visual-language references. Specifically, we propose a multi-source relation modeling module to effectively build the relation between the visual-language references and the test image. In addition, we design a temporal modeling module to provide a temporal clue with the guidance of the global semantic information for our model, which effectively improves the adaptability to the appearance variations of the target. Extensive experimental results on TNL2K, LaSOT, OTB99, and RefCOCOg demonstrate that our method performs favorably against state-of-the-art algorithms for both tracking and grounding. Code is available at https://github.com/lizhou-cs/JointNLT. + + + + Few-Shot Semantic Image Synthesis With Class Affinity Transfer + http://openaccess.thecvf.com//content/CVPR2023/papers/Careil_Few-Shot_Semantic_Image_Synthesis_With_Class_Affinity_Transfer_CVPR_2023_paper.pdf + Semantic image synthesis aims to generate photo realistic images given a semantic segmentation map. Despite much recent progress, training them still requires large datasets of images annotated with per-pixel label maps that are extremely tedious to obtain. To alleviate the high annotation cost, we propose a transfer method that leverages a model trained on a large source dataset to improve the learning ability on small target datasets via estimated pairwise relations between source and target classes. The class affinity matrix is introduced as a first layer to the source model to make it compatible with the target label maps, and the source model is then further fine-tuned for the target domain. To estimate the class affinities we consider different approaches to leverage prior knowledge: semantic segmentation on the source domain, textual label embeddings, and self-supervised vision features. We apply our approach to GAN-based and diffusion-based architectures for semantic synthesis. Our experiments show that the different ways to estimate class affinity can effectively combined, and that our approach significantly improves over existing state-of-the-art transfer approaches for generative image models. + + + + HIER: Metric Learning Beyond Class Labels via Hierarchical Regularization + http://openaccess.thecvf.com//content/CVPR2023/papers/Kim_HIER_Metric_Learning_Beyond_Class_Labels_via_Hierarchical_Regularization_CVPR_2023_paper.pdf + Supervision for metric learning has long been given in the form of equivalence between human-labeled classes. Although this type of supervision has been a basis of metric learning for decades, we argue that it hinders further advances in the field. In this regard, we propose a new regularization method, dubbed HIER, to discover the latent semantic hierarchy of training data, and to deploy the hierarchy to provide richer and more fine-grained supervision than inter-class separability induced by common metric learning losses. HIER achieves this goal with no annotation for the semantic hierarchy but by learning hierarchical proxies in hyperbolic spaces. The hierarchical proxies are learnable parameters, and each of them is trained to serve as an ancestor of a group of data or other proxies to approximate the semantic hierarchy among them. HIER deals with the proxies along with data in hyperbolic space since the geometric properties of the space are well-suited to represent their hierarchical structure. The efficacy of HIER is evaluated on four standard benchmarks, where it consistently improved the performance of conventional methods when integrated with them, and consequently achieved the best records, surpassing even the existing hyperbolic metric learning technique, in almost all settings. + + + + Diffusion Probabilistic Model Made Slim + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_Diffusion_Probabilistic_Model_Made_Slim_CVPR_2023_paper.pdf + Despite the visually-pleasing results achieved, the massive computational cost has been a long-standing flaw for diffusion probabilistic models (DPMs), which, in turn, greatly limits their applications on resource-limited platforms. Prior methods towards efficient DPM, however, have largely focused on accelerating the testing yet overlooked their huge complexity and size. In this paper, we make a dedicated attempt to lighten DPM while striving to preserve its favourable performance. We start by training a small-sized latent diffusion model (LDM) from scratch but observe a significant fidelity drop in the synthetic images. Through a thorough assessment, we find that DPM is intrinsically biased against high-frequency generation, and learns to recover different frequency components at different time-steps. These properties make compact networks unable to represent frequency dynamics with accurate high-frequency estimation. Towards this end, we introduce a customized design for slim DPM, which we term as Spectral Diffusion (SD), for lightweight image synthesis. SD incorporates wavelet gating in its architecture to enable frequency dynamic feature extraction at every reverse steps, and conducts spectrum-aware distillation to promote high-frequency recovery by inverse weighting the objective based on spectrum magnitudes. Experimental results demonstrate that, SD achieves 8-18x computational complexity reduction as compared to the latent diffusion models on a series of conditional and unconditional image generation tasks while retaining competitive image fidelity. + + + + Confidence-Aware Personalized Federated Learning via Variational Expectation Maximization + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhu_Confidence-Aware_Personalized_Federated_Learning_via_Variational_Expectation_Maximization_CVPR_2023_paper.pdf + Federated Learning (FL) is a distributed learning scheme to train a shared model across clients. One common and fundamental challenge in FL is that the sets of data across clients could be non-identically distributed and have different sizes. Personalized Federated Learning (PFL) attempts to solve this challenge via locally adapted models. In this work, we present a novel framework for PFL based on hierarchical Bayesian modeling and variational inference. A global model is introduced as a latent variable to augment the joint distribution of clients' parameters and capture the common trends of different clients, optimization is derived based on the principle of maximizing the marginal likelihood and conducted using variational expectation maximization. Our algorithm gives rise to a closed-form estimation of a confidence value which comprises the uncertainty of clients' parameters and local model deviations from the global model. The confidence value is used to weigh clients' parameters in the aggregation stage and adjust the regularization effect of the global model. We evaluate our method through extensive empirical studies on multiple datasets. Experimental results show that our approach obtains competitive results under mild heterogeneous circumstances while significantly outperforming state-of-the-art PFL frameworks in highly heterogeneous settings. + + + + Hierarchical Supervision and Shuffle Data Augmentation for 3D Semi-Supervised Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Hierarchical_Supervision_and_Shuffle_Data_Augmentation_for_3D_Semi-Supervised_Object_CVPR_2023_paper.pdf + State-of-the-art 3D object detectors are usually trained on large-scale datasets with high-quality 3D annotations. However, such 3D annotations are often expensive and time-consuming, which may not be practical for real applications. A natural remedy is to adopt semi-supervised learning (SSL) by leveraging a limited amount of labeled samples and abundant unlabeled samples. Current pseudo-labeling-based SSL object detection methods mainly adopt a teacher-student framework, with a single fixed threshold strategy to generate supervision signals, which inevitably brings confused supervision when guiding the student network training. Besides, the data augmentation of the point cloud in the typical teacher-student framework is too weak, and only contains basic down sampling and flip-and-shift (i.e., rotate and scaling), which hinders the effective learning of feature information. Hence, we address these issues by introducing a novel approach of Hierarchical Supervision and Shuffle Data Augmentation (HSSDA), which is a simple yet effective teacher-student framework. The teacher network generates more reasonable supervision for the student network by designing a dynamic dual-threshold strategy. Besides, the shuffle data augmentation strategy is designed to strengthen the feature representation ability of the student network. Extensive experiments show that HSSDA consistently outperforms the recent state-of-the-art methods on different datasets. The code will be released at https://github.com/azhuantou/HSSDA. + + + + Planning-Oriented Autonomous Driving + http://openaccess.thecvf.com//content/CVPR2023/papers/Hu_Planning-Oriented_Autonomous_Driving_CVPR_2023_paper.pdf + Modern autonomous driving system is characterized as modular tasks in sequential order, i.e., perception, prediction, and planning. In order to perform a wide diversity of tasks and achieve advanced-level intelligence, contemporary approaches either deploy standalone models for individual tasks, or design a multi-task paradigm with separate heads. However, they might suffer from accumulative errors or deficient task coordination. Instead, we argue that a favorable framework should be devised and optimized in pursuit of the ultimate goal, i.e., planning of the self-driving car. Oriented at this, we revisit the key components within perception and prediction, and prioritize the tasks such that all these tasks contribute to planning. We introduce Unified Autonomous Driving (UniAD), a comprehensive framework up-to-date that incorporates full-stack driving tasks in one network. It is exquisitely devised to leverage advantages of each module, and provide complementary feature abstractions for agent interaction from a global perspective. Tasks are communicated with unified query interfaces to facilitate each other toward planning. We instantiate UniAD on the challenging nuScenes benchmark. With extensive ablations, the effectiveness of using such a philosophy is proven by substantially outperforming previous state-of-the-arts in all aspects. Code and models are public. + + + + Independent Component Alignment for Multi-Task Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Senushkin_Independent_Component_Alignment_for_Multi-Task_Learning_CVPR_2023_paper.pdf + In a multi-task learning (MTL) setting, a single model is trained to tackle a diverse set of tasks jointly. Despite rapid progress in the field, MTL remains challenging due to optimization issues such as conflicting and dominating gradients. In this work, we propose using a condition number of a linear system of gradients as a stability criterion of an MTL optimization. We theoretically demonstrate that a condition number reflects the aforementioned optimization issues. Accordingly, we present Aligned-MTL, a novel MTL optimization approach based on the proposed criterion, that eliminates instability in the training process by aligning the orthogonal components of the linear system of gradients. While many recent MTL approaches guarantee convergence to a minimum, task trade-offs cannot be specified in advance. In contrast, Aligned-MTL provably converges to an optimal point with pre-defined task-specific weights, which provides more control over the optimization result. Through experiments, we show that the proposed approach consistently improves performance on a diverse set of MTL benchmarks, including semantic and instance segmentation, depth estimation, surface normal estimation, and reinforcement learning. + + + + Edges to Shapes to Concepts: Adversarial Augmentation for Robust Vision + http://openaccess.thecvf.com//content/CVPR2023/papers/Tripathi_Edges_to_Shapes_to_Concepts_Adversarial_Augmentation_for_Robust_Vision_CVPR_2023_paper.pdf + Recent work has shown that deep vision models tend to be overly dependent on low-level or "texture" features, leading to poor generalization. Various data augmentation strategies have been proposed to overcome this so-called texture bias in DNNs. We propose a simple, lightweight adversarial augmentation technique that explicitly incentivizes the network to learn holistic shapes for accurate prediction in an object classification setting. Our augmentations superpose edgemaps from one image onto another image with shuffled patches, using a randomly determined mixing proportion, with the image label of the edgemap image. To classify these augmented images, the model needs to not only detect and focus on edges but distinguish between relevant and spurious edges. We show that our augmentations significantly improve classification accuracy and robustness measures on a range of datasets and neural architectures. As an example, for ViT-S, We obtain absolute gains on classification accuracy gains up to 6%. We also obtain gains of up to 28% and 8.5% on natural adversarial and out-of-distribution datasets like ImageNet-A (for ViTB) and ImageNet-R (for ViT-S), respectively. Analysis using a range of probe datasets shows substantially increased shape sensitivity in our trained models, explaining the observed improvement in robustness and classification accuracy. + + + + ReVISE: Self-Supervised Speech Resynthesis With Visual Input for Universal and Generalized Speech Regeneration + http://openaccess.thecvf.com//content/CVPR2023/papers/Hsu_ReVISE_Self-Supervised_Speech_Resynthesis_With_Visual_Input_for_Universal_and_CVPR_2023_paper.pdf + Prior works on improving speech quality with visual input typically study each type of auditory distortion separately (e.g., separation, inpainting, video-to-speech) and present tailored algorithms. This paper proposes to unify these subjects and study Generalized Speech Regeneration, where the goal is not to reconstruct the exact reference clean signal, but to focus on improving certain aspects of speech while not necessarily preserving the rest such as voice. In particular, this paper concerns intelligibility, quality, and video synchronization. We cast the problem as audio-visual speech resynthesis, which is composed of two steps: pseudo audio-visual speech recognition (P-AVSR) and pseudo text-to-speech synthesis (P-TTS). P-AVSR and P-TTS are connected by discrete units derived from a self-supervised speech model. Moreover, we utilize self-supervised audio-visual speech model to initialize P-AVSR. The proposed model is coined ReVISE. ReVISE is the first high-quality model for in-the-wild video-to-speech synthesis and achieves superior performance on all LRS3 audio-visual regeneration tasks with a single model. To demonstrates its applicability in the real world, ReVISE is also evaluated on EasyCom, an audio-visual benchmark collected under challenging acoustic conditions with only 1.6 hours of training data. Similarly, ReVISE greatly suppresses noise and improves quality. Project page: https://wnhsu.github.io/ReVISE. + + + + Data-Free Knowledge Distillation via Feature Exchange and Activation Region Constraint + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_Data-Free_Knowledge_Distillation_via_Feature_Exchange_and_Activation_Region_Constraint_CVPR_2023_paper.pdf + Despite the tremendous progress on data-free knowledge distillation (DFKD) based on synthetic data generation, there are still limitations in diverse and efficient data synthesis. It is naive to expect that a simple combination of generative network-based data synthesis and data augmentation will solve these issues. Therefore, this paper proposes a novel data-free knowledge distillation method (SpaceshipNet) based on channel-wise feature exchange (CFE) and multi-scale spatial activation region consistency (mSARC) constraint. Specifically, CFE allows our generative network to better sample from the feature space and efficiently synthesize diverse images for learning the student network. However, using CFE alone can severely amplify the unwanted noises in the synthesized images, which may result in failure to improve distillation learning and even have negative effects. Therefore, we propose mSARC to assure the student network can imitate not only the logit output but also the spatial activation region of the teacher network in order to alleviate the influence of unwanted noises in diverse synthetic images on distillation learning. Extensive experiments on CIFAR-10, CIFAR-100, Tiny-ImageNet, Imagenette, and ImageNet100 show that our method can work well with different backbone networks, and outperform the state-of-the-art DFKD methods. Code will be available at: https://github.com/skgyu/SpaceshipNet. + + + + CLIP-Sculptor: Zero-Shot Generation of High-Fidelity and Diverse Shapes From Natural Language + http://openaccess.thecvf.com//content/CVPR2023/papers/Sanghi_CLIP-Sculptor_Zero-Shot_Generation_of_High-Fidelity_and_Diverse_Shapes_From_Natural_CVPR_2023_paper.pdf + Recent works have demonstrated that natural language can be used to generate and edit 3D shapes. However, these methods generate shapes with limited fidelity and diversity. We introduce CLIP-Sculptor, a method to address these constraints by producing high-fidelity and diverse 3D shapes without the need for (text, shape) pairs during training. CLIP-Sculptor achieves this in a multi-resolution approach that first generates in a low-dimensional latent space and then upscales to a higher resolution for improved shape fidelity. For improved shape diversity, we use a discrete latent space which is modeled using a transformer conditioned on CLIP's image-text embedding space. We also present a novel variant of classifier-free guidance, which improves the accuracy-diversity trade-off. Finally, we perform extensive experiments demonstrating that CLIP-Sculptor outperforms state-of-the-art baselines. + + + + Mask-Free Video Instance Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Ke_Mask-Free_Video_Instance_Segmentation_CVPR_2023_paper.pdf + The recent advancement in Video Instance Segmentation (VIS) has largely been driven by the use of deeper and increasingly data-hungry transformer-based models. However, video masks are tedious and expensive to annotate, limiting the scale and diversity of existing VIS datasets. In this work, we aim to remove the mask-annotation requirement. We propose MaskFreeVIS, achieving highly competitive VIS performance, while only using bounding box annotations for the object state. We leverage the rich temporal mask consistency constraints in videos by introducing the Temporal KNN-patch Loss (TK-Loss), providing strong mask supervision without any labels. Our TK-Loss finds one-to-many matches across frames, through an efficient patch-matching step followed by a K-nearest neighbor selection. A consistency loss is then enforced on the found matches. Our mask-free objective is simple to implement, has no trainable parameters, is computationally efficient, yet outperforms baselines employing, e.g., state-of-the-art optical flow to enforce temporal mask consistency. We validate MaskFreeVIS on the YouTube-VIS 2019/2021, OVIS and BDD100K MOTS benchmarks. The results clearly demonstrate the efficacy of our method by drastically narrowing the gap between fully and weakly-supervised VIS performance. Our code and trained models are available at http://vis.xyz/pub/maskfreevis. + + + + Continual Detection Transformer for Incremental Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Continual_Detection_Transformer_for_Incremental_Object_Detection_CVPR_2023_paper.pdf + Incremental object detection (IOD) aims to train an object detector in phases, each with annotations for new object categories. As other incremental settings, IOD is subject to catastrophic forgetting, which is often addressed by techniques such as knowledge distillation (KD) and exemplar replay (ER). However, KD and ER do not work well if applied directly to state-of-the-art transformer-based object detectors such as Deformable DETR and UP-DETR. In this paper, we solve these issues by proposing a ContinuaL DEtection TRansformer (CL-DETR), a new method for transformer-based IOD which enables effective usage of KD and ER in this context. First, we introduce a Detector Knowledge Distillation (DKD) loss, focusing on the most informative and reliable predictions from old versions of the model, ignoring redundant background predictions, and ensuring compatibility with the available ground-truth labels. We also improve ER by proposing a calibration strategy to preserve the label distribution of the training set, therefore better matching training and testing statistics. We conduct extensive experiments on COCO 2017 and demonstrate that CL-DETR achieves state-of-the-art results in the IOD setting. + + + + Two-Stream Networks for Weakly-Supervised Temporal Action Localization With Semantic-Aware Mechanisms + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Two-Stream_Networks_for_Weakly-Supervised_Temporal_Action_Localization_With_Semantic-Aware_Mechanisms_CVPR_2023_paper.pdf + Weakly-supervised temporal action localization aims to detect action boundaries in untrimmed videos with only video-level annotations. Most existing schemes detect temporal regions that are most responsive to video-level classification, but they overlook the semantic consistency between frames. In this paper, we hypothesize that snippets with similar representations should be considered as the same action class despite the absence of supervision signals on each snippet. To this end, we devise a learnable dictionary where entries are the class centroids of the corresponding action categories. The representations of snippets identified as the same action category are induced to be close to the same class centroid, which guides the network to perceive the semantics of frames and avoid unreasonable localization. Besides, we propose a two-stream framework that integrates the attention mechanism and the multiple-instance learning strategy to extract fine-grained clues and salient features respectively. Their complementarity enables the model to refine temporal boundaries. Finally, the developed model is validated on the publicly available THUMOS-14 and ActivityNet-1.3 datasets, where substantial experiments and analyses demonstrate that our model achieves remarkable advances over existing methods. + + + + HyperMatch: Noise-Tolerant Semi-Supervised Learning via Relaxed Contrastive Constraint + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_HyperMatch_Noise-Tolerant_Semi-Supervised_Learning_via_Relaxed_Contrastive_Constraint_CVPR_2023_paper.pdf + Recent developments of the application of Contrastive Learning in Semi-Supervised Learning (SSL) have demonstrated significant advancements, as a result of its exceptional ability to learn class-aware cluster representations and the full exploitation of massive unlabeled data. However, mismatched instance pairs caused by inaccurate pseudo labels would assign an unlabeled instance to the incorrect class in feature space, hence exacerbating SSL's renowned confirmation bias. To address this issue, we introduced a novel SSL approach, HyperMatch, which is a plug-in to several SSL designs enabling noise-tolerant utilization of unlabeled data. In particular, confidence predictions are combined with semantic similarities to generate a more objective class distribution, followed by a Gaussian Mixture Model to divide pseudo labels into a 'confident' and a 'less confident' subset. Then, we introduce Relaxed Contrastive Loss by assigning the 'less-confident' samples to a hyper-class, i.e. the union of top-K nearest classes, which effectively regularizes the interference of incorrect pseudo labels and even increases the probability of pulling a 'less confident' sample close to its true class. Experiments and in-depth studies demonstrate that HyperMatch delivers remarkable state-of-the-art performance, outperforming FixMatch on CIFAR100 with 400 and 2500 labeled samples by 11.86% and 4.88%, respectively. + + + + LEGO-Net: Learning Regular Rearrangements of Objects in Rooms + http://openaccess.thecvf.com//content/CVPR2023/papers/Wei_LEGO-Net_Learning_Regular_Rearrangements_of_Objects_in_Rooms_CVPR_2023_paper.pdf + Humans universally dislike the task of cleaning up a messy room. If machines were to help us with this task, they must understand human criteria for regular arrangements, such as several types of symmetry, co-linearity or co-circularity, spacing uniformity in linear or circular patterns, and further inter-object relationships that relate to style and functionality. Previous approaches for this task relied on human input to explicitly specify goal state, or synthesized scenes from scratch--but such methods do not address the rearrangement of existing messy scenes without providing a goal state. In this paper, we present LEGO-Net, a data-driven transformer-based iterative method for LEarning reGular rearrangement of Objects in messy rooms. LEGO-Net is partly inspired by diffusion models--it starts with an initial messy state and iteratively "de-noises" the position and orientation of objects to a regular state while reducing distance traveled. Given randomly perturbed object positions and orientations in an existing dataset of professionally-arranged scenes, our method is trained to recover a regular re-arrangement. Results demonstrate that our method is able to reliably rearrange room scenes and outperform other methods. We additionally propose a metric for evaluating regularity in room arrangements using number-theoretic machinery. + + + + FastInst: A Simple Query-Based Model for Real-Time Instance Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/He_FastInst_A_Simple_Query-Based_Model_for_Real-Time_Instance_Segmentation_CVPR_2023_paper.pdf + Recent attention in instance segmentation has focused on query-based models. Despite being non-maximum suppression (NMS)-free and end-to-end, the superiority of these models on high-accuracy real-time benchmarks has not been well demonstrated. In this paper, we show the strong potential of query-based models on efficient instance segmentation algorithm designs. We present FastInst, a simple, effective query-based framework for real-time instance segmentation. FastInst can execute at a real-time speed (i.e., 32.5 FPS) while yielding an AP of more than 40 (i.e., 40.5 AP) on COCO test-dev without bells and whistles. Specifically, FastInst follows the meta-architecture of recently introduced Mask2Former. Its key designs include instance activation-guided queries, dual-path update strategy, and ground truth mask-guided learning, which enable us to use lighter pixel decoders, fewer Transformer decoder layers, while achieving better performance. The experiments show that FastInst outperforms most state-of-the-art real-time counterparts, including strong fully convolutional baselines, in both speed and accuracy. Code can be found at https://github.com/junjiehe96/FastInst. + + + + Self-Supervised Representation Learning for CAD + http://openaccess.thecvf.com//content/CVPR2023/papers/Jones_Self-Supervised_Representation_Learning_for_CAD_CVPR_2023_paper.pdf + Virtually every object in the modern world was created, modified, analyzed and optimized using computer aided design (CAD) tools. An active CAD research area is the use of data-driven machine learning methods to learn from the massive repositories of geometric and program representations. However, the lack of labeled data in CAD's native format, i.e., the parametric boundary representation (B-Rep), poses an obstacle at present difficult to overcome. Several datasets of mechanical parts in B-Rep format have recently been released for machine learning research. However, large-scale databases are mostly unlabeled, and labeled datasets are small. Additionally, task-specific label sets are rare and costly to annotate. This work proposes to leverage unlabeled CAD geometry on supervised learning tasks. We learn a novel, hybrid implicit/explicit surface representation for B-Rep geometry. Further, we show that this pre-training both significantly improves few-shot learning performance and achieves state-of-the-art performance on several current B-Rep benchmarks. + + + + DETRs With Hybrid Matching + http://openaccess.thecvf.com//content/CVPR2023/papers/Jia_DETRs_With_Hybrid_Matching_CVPR_2023_paper.pdf + One-to-one set matching is a key design for DETR to establish its end-to-end capability, so that object detection does not require a hand-crafted NMS (non-maximum suppression) to remove duplicate detections. This end-to-end signature is important for the versatility of DETR, and it has been generalized to broader vision tasks. However, we note that there are few queries assigned as positive samples and the one-to-one set matching significantly reduces the training efficacy of positive samples. We propose a simple yet effective method based on a hybrid matching scheme that combines the original one-to-one matching branch with an auxiliary one-to-many matching branch during training. Our hybrid strategy has been shown to significantly improve accuracy. In inference, only the original one-to-one match branch is used, thus maintaining the end-to-end merit and the same inference efficiency of DETR. The method is named H-DETR, and it shows that a wide range of representative DETR methods can be consistently improved across a wide range of visual tasks, including Deformable-DETR, PETRv2, PETR, and TransTrack, among others. + + + + Angelic Patches for Improving Third-Party Object Detector Performance + http://openaccess.thecvf.com//content/CVPR2023/papers/Si_Angelic_Patches_for_Improving_Third-Party_Object_Detector_Performance_CVPR_2023_paper.pdf + Deep learning models have shown extreme vulnerability to simple perturbations and spatial transformations. In this work, we explore whether we can adopt the characteristics of adversarial attack methods to help improve perturbation robustness for object detection. We study a class of realistic object detection settings wherein the target objects have control over their appearance. To this end, we propose a reversed Fast Gradient Sign Method (FGSM) to obtain these angelic patches that significantly increase the detection probability, even without pre-knowledge of the perturbations. In detail, we apply the patch to each object instance simultaneously, strengthen not only classification but also bounding box accuracy. Experiments demonstrate the efficacy of the partial-covering patch in solving the complex bounding box problem. More importantly, the performance is also transferable to different detection models even under severe affine transformations and deformable shapes. To our knowledge, we are the first (object detection) patch that achieves both cross-model and multiple-patch efficacy. We observed average accuracy improvements of 30% in the real-world experiments, which brings large social value. Our code is available at: https://github.com/averysi224/angelic_patches. + + + + Mask-Free OVIS: Open-Vocabulary Instance Segmentation Without Manual Mask Annotations + http://openaccess.thecvf.com//content/CVPR2023/papers/VS_Mask-Free_OVIS_Open-Vocabulary_Instance_Segmentation_Without_Manual_Mask_Annotations_CVPR_2023_paper.pdf + Existing instance segmentation models learn task-specific information using manual mask annotations from base (training) categories. These mask annotations require tremendous human effort, limiting the scalability to annotate novel (new) categories. To alleviate this problem, Open-Vocabulary (OV) methods leverage large-scale image-caption pairs and vision-language models to learn novel categories. In summary, an OV method learns task-specific information using strong supervision from base annotations and novel category information using weak supervision from image-captions pairs. This difference between strong and weak supervision leads to overfitting on base categories, resulting in poor generalization towards novel categories. In this work, we overcome this issue by learning both base and novel categories from pseudo-mask annotations generated by the vision-language model in a weakly supervised manner using our proposed Mask-free OVIS pipeline. Our method automatically generates pseudo-mask annotations by leveraging the localization ability of a pre-trained vision-language model for objects present in image-caption pairs. The generated pseudo-mask annotations are then used to supervise an instance segmentation model, freeing the entire pipeline from any labour-expensive instance-level annotations and overfitting. Our extensive experiments show that our method trained with just pseudo-masks significantly improves the mAP scores on the MS-COCO dataset and OpenImages dataset compared to the recent state-of-the-art methods trained with manual masks. Codes and models are provided in https://vibashan.github.io/ovis-web/. + + + + Complete-to-Partial 4D Distillation for Self-Supervised Point Cloud Sequence Representation Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Complete-to-Partial_4D_Distillation_for_Self-Supervised_Point_Cloud_Sequence_Representation_Learning_CVPR_2023_paper.pdf + Recent work on 4D point cloud sequences has attracted a lot of attention. However, obtaining exhaustively labeled 4D datasets is often very expensive and laborious, so it is especially important to investigate how to utilize raw unlabeled data. However, most existing self-supervised point cloud representation learning methods only consider geometry from a static snapshot omitting the fact that sequential observations of dynamic scenes could reveal more comprehensive geometric details. To overcome such issues, this paper proposes a new 4D self-supervised pre-training method called Complete-to-Partial 4D Distillation. Our key idea is to formulate 4D self-supervised representation learning as a teacher-student knowledge distillation framework and let the student learn useful 4D representations with the guidance of the teacher. Experiments show that this approach significantly outperforms previous pre-training approaches on a wide range of 4D point cloud sequence understanding tasks. Code is available at: https://github.com/dongyh20/C2P. + + + + Multi-Modal Gait Recognition via Effective Spatial-Temporal Feature Fusion + http://openaccess.thecvf.com//content/CVPR2023/papers/Cui_Multi-Modal_Gait_Recognition_via_Effective_Spatial-Temporal_Feature_Fusion_CVPR_2023_paper.pdf + Gait recognition is a biometric technology that identifies people by their walking patterns. The silhouettes-based method and the skeletons-based method are the two most popular approaches. However, the silhouette data are easily affected by clothing occlusion, and the skeleton data lack body shape information. To obtain a more robust and comprehensive gait representation for recognition, we propose a transformer-based gait recognition framework called MMGaitFormer, which effectively fuses and aggregates the spatial-temporal information from the skeletons and silhouettes. Specifically, a Spatial Fusion Module (SFM) and a Temporal Fusion Module (TFM) are proposed for effective spatial-level and temporal-level feature fusion, respectively. The SFM performs fine-grained body parts spatial fusion and guides the alignment of each part of the silhouette and each joint of the skeleton through the attention mechanism. The TFM performs temporal modeling through Cycle Position Embedding (CPE) and fuses temporal information of two modalities. Experiments demonstrate that our MMGaitFormer achieves state-of-the-art performance on popular gait datasets. For the most challenging "CL" (i.e., walking in different clothes) condition in CASIA-B, our method achieves a rank-1 accuracy of 94.8%, which outperforms the state-of-the-art single-modal methods by a large margin. + + + + Hierarchical Discriminative Learning Improves Visual Representations of Biomedical Microscopy + http://openaccess.thecvf.com//content/CVPR2023/papers/Jiang_Hierarchical_Discriminative_Learning_Improves_Visual_Representations_of_Biomedical_Microscopy_CVPR_2023_paper.pdf + Learning high-quality, self-supervised, visual representations is essential to advance the role of computer vision in biomedical microscopy and clinical medicine. Previous work has focused on self-supervised representation learning (SSL) methods developed for instance discrimination and applied them directly to image patches, or fields-of-view, sampled from gigapixel whole-slide images (WSIs) used for cancer diagnosis. However, this strategy is limited because it (1) assumes patches from the same patient are independent, (2) neglects the patient-slide-patch hierarchy of clinical biomedical microscopy, and (3) requires strong data augmentations that can degrade downstream performance. Importantly, sampled patches from WSIs of a patient's tumor are a diverse set of image examples that capture the same underlying cancer diagnosis. This motivated HiDisc, a data-driven method that leverages the inherent patient-slide-patch hierarchy of clinical biomedical microscopy to define a hierarchical discriminative learning task that implicitly learns features of the underlying diagnosis. HiDisc uses a self-supervised contrastive learning framework in which positive patch pairs are defined based on a common ancestry in the data hierarchy, and a unified patch, slide, and patient discriminative learning objective is used for visual SSL. We benchmark HiDisc visual representations on two vision tasks using two biomedical microscopy datasets, and demonstrate that (1) HiDisc pretraining outperforms current state-of-the-art self-supervised pretraining methods for cancer diagnosis and genetic mutation prediction, and (2) HiDisc learns high-quality visual representations using natural patch diversity without strong data augmentations. + + + + ProD: Prompting-To-Disentangle Domain Knowledge for Cross-Domain Few-Shot Image Classification + http://openaccess.thecvf.com//content/CVPR2023/papers/Ma_ProD_Prompting-To-Disentangle_Domain_Knowledge_for_Cross-Domain_Few-Shot_Image_Classification_CVPR_2023_paper.pdf + This paper considers few-shot image classification under the cross-domain scenario, where the train-to-test domain gap compromises classification accuracy. To mitigate the domain gap, we propose a prompting-to-disentangle (ProD) method through a novel exploration with the prompting mechanism. ProD adopts the popular multi-domain training scheme and extracts the backbone feature with a standard Convolutional Neural Network. Based on these two common practices, the key point of ProD is using the prompting mechanism in the transformer to disentangle the domain-general (DG) and domain-specific (DS) knowledge from the backbone feature. Specifically, ProD concatenates a DG and a DS prompt to the backbone feature and feeds them into a lightweight transformer. The DG prompt is learnable and shared by all the training domains, while the DS prompt is generated from the domain-of-interest on the fly. As a result, the transformer outputs DG and DS features in parallel with the two prompts, yielding the disentangling effect. We show that: 1) Simply sharing a single DG prompt for all the training domains already improves generalization towards the novel test domain. 2) The cross-domain generalization can be further reinforced by making the DG prompt neutral towards the training domains. 3) When inference, the DS prompt is generated from the support samples and can capture test domain knowledge through the prompting mechanism. Combining all three benefits, ProD significantly improves cross-domain few-shot classification. For instance, on CUB, ProD improves the 5-way 5-shot accuracy from 73.56% (baseline) to 79.19%, setting a new state of the art. + + + + ImageNet-E: Benchmarking Neural Network Robustness via Attribute Editing + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_ImageNet-E_Benchmarking_Neural_Network_Robustness_via_Attribute_Editing_CVPR_2023_paper.pdf + Recent studies have shown that higher accuracy on ImageNet usually leads to better robustness against different corruptions. In this paper, instead of following the traditional research paradigm that investigates new out-of-distribution corruptions or perturbations deep models may encounter, we conduct model debugging in in-distribution data to explore which object attributes a model may be sensitive to. To achieve this goal, we create a toolkit for object editing with controls of backgrounds, sizes, positions, and directions, and create a rigorous benchmark named ImageNet-E(diting) for evaluating the image classifier robustness in terms of object attributes. With our ImageNet-E, we evaluate the performance of current deep learning models, including both convolutional neural networks and vision transformers. We find that most models are quite sensitive to attribute changes. An imperceptible change in the background can lead to an average of 9.23% drop on top-1 accuracy. We also evaluate some robust models including both adversarially trained models and other robust trained models and find that some models show worse robustness against attribute changes than vanilla models. Based on these findings, we discover ways to enhance attribute robustness with preprocessing, architecture designs, and training strategies. We hope this work can provide some insights to the community and open up a new avenue for research in robust computer vision. The code and dataset will be publicly available. + + + + Learning With Fantasy: Semantic-Aware Virtual Contrastive Constraint for Few-Shot Class-Incremental Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Song_Learning_With_Fantasy_Semantic-Aware_Virtual_Contrastive_Constraint_for_Few-Shot_Class-Incremental_CVPR_2023_paper.pdf + Few-shot class-incremental learning (FSCIL) aims at learning to classify new classes continually from limited samples without forgetting the old classes. The mainstream framework tackling FSCIL is first to adopt the cross-entropy (CE) loss for training at the base session, then freeze the feature extractor to adapt to new classes. However, in this work, we find that the CE loss is not ideal for the base session training as it suffers poor class separation in terms of representations, which further degrades generalization to novel classes. One tempting method to mitigate this problem is to apply an additional naive supervised contrastive learning (SCL) in the base session. Unfortunately, we find that although SCL can create a slightly better representation separation among different base classes, it still struggles to separate base classes and new classes. Inspired by the observations made, we propose Semantic-Aware Virtual Contrastive model (SAVC), a novel method that facilitates separation between new classes and base classes by introducing virtual classes to SCL. These virtual classes, which are generated via pre-defined transformations, not only act as placeholders for unseen classes in the representation space but also provide diverse semantic information. By learning to recognize and contrast in the fantasy space fostered by virtual classes, our SAVC significantly boosts base class separation and novel class generalization, achieving new state-of-the-art performance on the three widely-used FSCIL benchmark datasets. Code is available at: https://github.com/zysong0113/SAVC. + + + + Cascaded Local Implicit Transformer for Arbitrary-Scale Super-Resolution + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Cascaded_Local_Implicit_Transformer_for_Arbitrary-Scale_Super-Resolution_CVPR_2023_paper.pdf + Implicit neural representation demonstrates promising ability in representing images with arbitrary resolutions recently. In this paper, we present Local Implicit Transformer (LIT) that integrates attention mechanism and frequency encoding technique into local implicit image function. We design a cross-scale local attention block to effectively aggregate local features and a local frequency encoding block to combine positional encoding with Fourier domain information for constructing high-resolution (HR) images. To further improve representative power, we propose Cascaded LIT (CLIT) exploiting multi-scale features along with cumulative training strategy that gradually increase the upsampling factors for training. We have performed extensive experiments to validate the effectiveness of these components and analyze the variants of the training strategy. The qualitative and quantitative results demonstrated that LIT and CLIT achieve favorable results and outperform the previous works within arbitrary super-resolution tasks. + + + + Network-Free, Unsupervised Semantic Segmentation With Synthetic Images + http://openaccess.thecvf.com//content/CVPR2023/papers/Feng_Network-Free_Unsupervised_Semantic_Segmentation_With_Synthetic_Images_CVPR_2023_paper.pdf + We derive a method that yields highly accurate semantic segmentation maps without the use of any additional neural network, layers, manually annotated training data, or supervised training. Our method is based on the observation that the correlation of a set of pixels belonging to the same semantic segment do not change when generating synthetic variants of an image using the style mixing approach in GANs. We show how we can use GAN inversion to accurately semantically segment synthetic and real photos as well as generate large training image-semantic segmentation mask pairs for downstream tasks. + + + + Hierarchical Dense Correlation Distillation for Few-Shot Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Peng_Hierarchical_Dense_Correlation_Distillation_for_Few-Shot_Segmentation_CVPR_2023_paper.pdf + Few-shot semantic segmentation (FSS) aims to form class-agnostic models segmenting unseen classes with only a handful of annotations. Previous methods limited to the semantic feature and prototype representation suffer from coarse segmentation granularity and train-set overfitting. In this work, we design Hierarchically Decoupled Matching Network (HDMNet) mining pixel-level support correlation based on the transformer architecture. The self-attention modules are used to assist in establishing hierarchical dense features, as a means to accomplish the cascade matching between query and support features. Moreover, we propose a matching module to reduce train-set overfitting and introduce correlation distillation leveraging semantic correspondence from coarse resolution to boost fine-grained segmentation. Our method performs decently in experiments. We achieve 50.0% mIoU on COCO-5i dataset one-shot setting and 56.0% on five-shot segmentation, respectively. The code is available on the project website. + + + + Hi4D: 4D Instance Segmentation of Close Human Interaction + http://openaccess.thecvf.com//content/CVPR2023/papers/Yin_Hi4D_4D_Instance_Segmentation_of_Close_Human_Interaction_CVPR_2023_paper.pdf + We propose Hi4D, a method and dataset for the auto analysis of physically close human-human interaction under prolonged contact. Robustly disentangling several in-contact subjects is a challenging task due to occlusions and complex shapes. Hence, existing multi-view systems typically fuse 3D surfaces of close subjects into a single, connected mesh. To address this issue we leverage i) individually fitted neural implicit avatars; ii) an alternating optimization scheme that refines pose and surface through periods of close proximity; and iii) thus segment the fused raw scans into individual instances. From these instances we compile Hi4D dataset of 4D textured scans of 20 subject pairs, 100 sequences, and a total of more than 11K frames. Hi4D contains rich interaction-centric annotations in 2D and 3D alongside accurately registered parametric body models. We define varied human pose and shape estimation tasks on this dataset and provide results from state-of-the-art methods on these benchmarks. + + + + SQUID: Deep Feature In-Painting for Unsupervised Anomaly Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Xiang_SQUID_Deep_Feature_In-Painting_for_Unsupervised_Anomaly_Detection_CVPR_2023_paper.pdf + Radiography imaging protocols focus on particular body regions, therefore producing images of great similarity and yielding recurrent anatomical structures across patients. To exploit this structured information, we propose the use of Space-aware Memory Queues for In-painting and Detecting anomalies from radiography images (abbreviated as SQUID). We show that SQUID can taxonomize the ingrained anatomical structures into recurrent patterns; and in the inference, it can identify anomalies (unseen/modified patterns) in the image. SQUID surpasses 13 state-of-the-art methods in unsupervised anomaly detection by at least 5 points on two chest X-ray benchmark datasets measured by the Area Under the Curve (AUC). Additionally, we have created a new dataset (DigitAnatomy), which synthesizes the spatial correlation and consistent shape in chest anatomy. We hope DigitAnatomy can prompt the development, evaluation, and interpretability of anomaly detection methods. + + + + On the Convergence of IRLS and Its Variants in Outlier-Robust Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Peng_On_the_Convergence_of_IRLS_and_Its_Variants_in_Outlier-Robust_CVPR_2023_paper.pdf + Outlier-robust estimation involves estimating some parameters (e.g., 3D rotations) from data samples in the presence of outliers, and is typically formulated as a non-convex and non-smooth problem. For this problem, the classical method called iteratively reweighted least-squares (IRLS) and its variants have shown impressive performance. This paper makes several contributions towards understanding why these algorithms work so well. First, we incorporate majorization and graduated non-convexity (GNC) into the IRLS framework and prove that the resulting IRLS variant is a convergent method for outlier-robust estimation. Moreover, in the robust regression context with a constant fraction of outliers, we prove this IRLS variant converges to the ground truth at a global linear and local quadratic rate for a random Gaussian feature matrix with high probability. Experiments corroborate our theory and show that the proposed IRLS variant converges within 5-10 iterations for typical problem instances of outlier-robust estimation, while state-of-the-art methods need at least 30 iterations. A basic implementation of our method is provided: https://github.com/liangzu/IRLS-CVPR2023 + + + + A New Comprehensive Benchmark for Semi-Supervised Video Anomaly Detection and Anticipation + http://openaccess.thecvf.com//content/CVPR2023/papers/Cao_A_New_Comprehensive_Benchmark_for_Semi-Supervised_Video_Anomaly_Detection_and_CVPR_2023_paper.pdf + Semi-supervised video anomaly detection (VAD) is a critical task in the intelligent surveillance system. However, an essential type of anomaly in VAD named scene-dependent anomaly has not received the attention of researchers. Moreover, there is no research investigating anomaly anticipation, a more significant task for preventing the occurrence of anomalous events. To this end, we propose a new comprehensive dataset, NWPU Campus, containing 43 scenes, 28 classes of abnormal events, and 16 hours of videos. At present, it is the largest semi-supervised VAD dataset with the largest number of scenes and classes of anomalies, the longest duration, and the only one considering the scene-dependent anomaly. Meanwhile, it is also the first dataset proposed for video anomaly anticipation. We further propose a novel model capable of detecting and anticipating anomalous events simultaneously. Compared with 7 outstanding VAD algorithms in recent years, our method can cope with scene-dependent anomaly detection and anomaly anticipation both well, achieving state-of-the-art performance on ShanghaiTech, CUHK Avenue, IITB Corridor and the newly proposed NWPU Campus datasets consistently. Our dataset and code is available at: https://campusvad.github.io. + + + + HumanBench: Towards General Human-Centric Perception With Projector Assisted Pretraining + http://openaccess.thecvf.com//content/CVPR2023/papers/Tang_HumanBench_Towards_General_Human-Centric_Perception_With_Projector_Assisted_Pretraining_CVPR_2023_paper.pdf + Human-centric perceptions include a variety of vision tasks, which have widespread industrial applications, including surveillance, autonomous driving, and the metaverse. It is desirable to have a general pretrain model for versatile human-centric downstream tasks. This paper forges ahead along this path from the aspects of both benchmark and pretraining methods. Specifically, we propose a HumanBench based on existing datasets to comprehensively evaluate on the common ground the generalization abilities of different pretraining methods on 19 datasets from 6 diverse downstream tasks, including person ReID, pose estimation, human parsing, pedestrian attribute recognition, pedestrian detection, and crowd counting. To learn both coarse-grained and fine-grained knowledge in human bodies, we further propose a Projector AssisTed Hierarchical pretraining method (PATH) to learn diverse knowledge at different granularity levels. Comprehensive evaluations on HumanBench show that our PATH achieves new state-of-the-art results on 17 downstream datasets and on-par results on the other 2 datasets. The code will be publicly at https://github.com/OpenGVLab/HumanBench. + + + + Deep Graph Reprogramming + http://openaccess.thecvf.com//content/CVPR2023/papers/Jing_Deep_Graph_Reprogramming_CVPR_2023_paper.pdf + In this paper, we explore a novel model reusing task tailored for graph neural networks (GNNs), termed as "deep graph reprogramming". We strive to reprogram a pre-trained GNN, without amending raw node features nor model parameters, to handle a bunch of cross-level downstream tasks in various domains. To this end, we propose an innovative Data Reprogramming paradigm alongside a Model Reprogramming paradigm. The former one aims to address the challenge of diversified graph feature dimensions for various tasks on the input side, while the latter alleviates the dilemma of fixed per-task-per-model behavior on the model side. For data reprogramming, we specifically devise an elaborated Meta-FeatPadding method to deal with heterogeneous input dimensions, and also develop a transductive Edge-Slimming as well as an inductive Meta-GraPadding approach for diverse homogenous samples. Meanwhile, for model reprogramming, we propose a novel task-adaptive Reprogrammable-Aggregator, to endow the frozen model with larger expressive capacities in handling cross-domain tasks. Experiments on fourteen datasets across node/graph classification/regression, 3D object recognition, and distributed action recognition, demonstrate that the proposed methods yield gratifying results, on par with those by re-training from scratch. + + + + Compacting Binary Neural Networks by Sparse Kernel Selection + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Compacting_Binary_Neural_Networks_by_Sparse_Kernel_Selection_CVPR_2023_paper.pdf + Binary Neural Network (BNN) represents convolution weights with 1-bit values, which enhances the efficiency of storage and computation. This paper is motivated by a previously revealed phenomenon that the binary kernels in successful BNNs are nearly power-law distributed: their values are mostly clustered into a small number of codewords. This phenomenon encourages us to compact typical BNNs and obtain further close performance through learning non-repetitive kernels within a binary kernel subspace. Specifically, we regard the binarization process as kernel grouping in terms of a binary codebook, and our task lies in learning to select a smaller subset of codewords from the full codebook. We then leverage the Gumbel-Sinkhorn technique to approximate the codeword selection process, and develop the Permutation Straight-Through Estimator (PSTE) that is able to not only optimize the selection process end-to-end but also maintain the non-repetitive occupancy of selected codewords. Experiments verify that our method reduces both the model size and bit-wise computational costs, and achieves accuracy improvements compared with state-of-the-art BNNs under comparable budgets. + + + + Unified Mask Embedding and Correspondence Learning for Self-Supervised Video Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Unified_Mask_Embedding_and_Correspondence_Learning_for_Self-Supervised_Video_Segmentation_CVPR_2023_paper.pdf + The objective of this paper is self-supervised learning of video object segmentation. We develop a unified framework which simultaneously models cross-frame dense correspondence for locally discriminative feature learning and embeds object-level context for target-mask decoding. As a result, it is able to directly learn to perform mask-guided sequential segmentation from unlabeled videos, in contrast to previous efforts usually relying on an oblique solution --- cheaply "copying" labels according to pixel-wise correlations. Concretely, our algorithm alternates between i) clustering video pixels for creating pseudo segmentation labels ex nihilo; and ii) utilizing the pseudo labels to learn mask encoding and decoding for VOS. Unsupervised correspondence learning is further incorporated into this self-taught, mask embedding scheme, so as to ensure the generic nature of the learnt representation and avoid cluster degeneracy. Our algorithm sets state-of-the-arts on two standard benchmarks (i.e., DAVIS17 and YouTube-VOS), narrowing the gap between self- and fully-supervised VOS, in terms of both performance and network architecture design. Our full code will be released. + + + + Seeing Beyond the Brain: Conditional Diffusion Model With Sparse Masked Modeling for Vision Decoding + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Seeing_Beyond_the_Brain_Conditional_Diffusion_Model_With_Sparse_Masked_CVPR_2023_paper.pdf + Decoding visual stimuli from brain recordings aims to deepen our understanding of the human visual system and build a solid foundation for bridging human and computer vision through the Brain-Computer Interface. However, reconstructing high-quality images with correct semantics from brain recordings is a challenging problem due to the complex underlying representations of brain signals and the scarcity of data annotations. In this work, we present MinD-Vis: Sparse Masked Brain Modeling with Double-Conditioned Latent Diffusion Model for Human Vision Decoding. Firstly, we learn an effective self-supervised representation of fMRI data using mask modeling in a large latent space inspired by the sparse coding of information in the primary visual cortex. Then by augmenting a latent diffusion model with double-conditioning, we show that MinD-Vis can reconstruct highly plausible images with semantically matching details from brain recordings using very few paired annotations. We benchmarked our model qualitatively and quantitatively; the experimental results indicate that our method outperformed state-of-the-art in both semantic mapping (100-way semantic classification) and generation quality (FID) by 66% and 41% respectively. An exhaustive ablation study was also conducted to analyze our framework. + + + + PointAvatar: Deformable Point-Based Head Avatars From Videos + http://openaccess.thecvf.com//content/CVPR2023/papers/Zheng_PointAvatar_Deformable_Point-Based_Head_Avatars_From_Videos_CVPR_2023_paper.pdf + The ability to create realistic animatable and relightable head avatars from casual video sequences would open up wide ranging applications in communication and entertainment. Current methods either build on explicit 3D morphable meshes (3DMM) or exploit neural implicit representations. The former are limited by fixed topology, while the latter are non-trivial to deform and inefficient to render. Furthermore, existing approaches entangle lighting and albedo, limiting the ability to re-render the avatar in new environments. In contrast, we propose PointAvatar, a deformable point-based representation that disentangles the source color into intrinsic albedo and normal-dependent shading. We demonstrate that PointAvatar bridges the gap between existing mesh- and implicit representations, combining high-quality geometry and appearance with topological flexibility, ease of deformation and rendering efficiency. We show that our method is able to generate animatable 3D avatars using monocular videos from multiple sources including hand-held smartphones, laptop webcams and internet videos, achieving state-of-the-art quality in challenging cases where previous methods fail, e.g., thin hair strands, while being significantly more efficient in training than competing methods. + + + + OrienterNet: Visual Localization in 2D Public Maps With Neural Matching + http://openaccess.thecvf.com//content/CVPR2023/papers/Sarlin_OrienterNet_Visual_Localization_in_2D_Public_Maps_With_Neural_Matching_CVPR_2023_paper.pdf + Humans can orient themselves in their 3D environments using simple 2D maps. Differently, algorithms for visual localization mostly rely on complex 3D point clouds that are expensive to build, store, and maintain over time. We bridge this gap by introducing OrienterNet, the first deep neural network that can localize an image with sub-meter accuracy using the same 2D semantic maps that humans use. OrienterNet estimates the location and orientation of a query image by matching a neural Bird's-Eye View with open and globally available maps from OpenStreetMap, enabling anyone to localize anywhere such maps are available. OrienterNet is supervised only by camera poses but learns to perform semantic matching with a wide range of map elements in an end-to-end manner. To enable this, we introduce a large crowd-sourced dataset of images captured across 12 cities from the diverse viewpoints of cars, bikes, and pedestrians. OrienterNet generalizes to new datasets and pushes the state of the art in both robotics and AR scenarios. The code is available at https://github.com/facebookresearch/OrienterNet + + + + PMatch: Paired Masked Image Modeling for Dense Geometric Matching + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhu_PMatch_Paired_Masked_Image_Modeling_for_Dense_Geometric_Matching_CVPR_2023_paper.pdf + Dense geometric matching determines the dense pixel-wise correspondence between a source and support image corresponding to the same 3D structure. Prior works employ an encoder of transformer blocks to correlate the two-frame features. However, existing monocular pretraining tasks, e.g., image classification, and masked image modeling (MIM), can not pretrain the cross-frame module, yielding less optimal performance. To resolve this, we reformulate the MIM from reconstructing a single masked image to reconstructing a pair of masked images, enabling the pretraining of transformer module. Additionally, we incorporate a decoder into pretraining for improved upsampling results. Further, to be robust to the textureless area, we propose a novel cross-frame global matching module (CFGM). Since the most textureless area is planar surfaces, we propose a homography loss to further regularize its learning. Combined together, we achieve the State-of-The-Art (SoTA) performance on geometric matching. Codes and models are available at https://github.com/ShngJZ/PMatch. + + + + Masked and Adaptive Transformer for Exemplar Based Image Translation + http://openaccess.thecvf.com//content/CVPR2023/papers/Jiang_Masked_and_Adaptive_Transformer_for_Exemplar_Based_Image_Translation_CVPR_2023_paper.pdf + We present a novel framework for exemplar based image translation. Recent advanced methods for this task mainly focus on establishing cross-domain semantic correspondence, which sequentially dominates image generation in the manner of local style control. Unfortunately, cross domain semantic matching is challenging; and matching errors ultimately degrade the quality of generated images. To overcome this challenge, we improve the accuracy of matching on the one hand, and diminish the role of matching in image generation on the other hand. To achieve the former, we propose a masked and adaptive transformer (MAT) for learning accurate cross-domain correspondence, and executing context-aware feature augmentation. To achieve the latter, we use source features of the input and global style codes of the exemplar, as supplementary information, for decoding an image. Besides, we devise a novel contrastive style learning method, for acquire quality-discriminative style representations, which in turn benefit high-quality image generation. Experimental results show that our method, dubbed MATEBIT, performs considerably better than state-of-the-art methods, in diverse image translation tasks. + + + + You Are Catching My Attention: Are Vision Transformers Bad Learners Under Backdoor Attacks? + http://openaccess.thecvf.com//content/CVPR2023/papers/Yuan_You_Are_Catching_My_Attention_Are_Vision_Transformers_Bad_Learners_CVPR_2023_paper.pdf + Vision Transformers (ViTs), which made a splash in the field of computer vision (CV), have shaken the dominance of convolutional neural networks (CNNs). However, in the process of industrializing ViTs, backdoor attacks have brought severe challenges to security. The success of ViTs benefits from the self-attention mechanism. However, compared with CNNs, we find that this mechanism of capturing global information within patches makes ViTs more sensitive to patch-wise triggers. Under such observations, we delicately design a novel backdoor attack framework for ViTs, dubbed BadViT, which utilizes a universal patch-wise trigger to catch the model's attention from patches beneficial for classification to those with triggers, thereby manipulating the mechanism on which ViTs survive to confuse itself. Furthermore, we propose invisible variants of BadViT to increase the stealth of the attack by limiting the strength of the trigger perturbation. Through a large number of experiments, it is proved that BadViT is an efficient backdoor attack method against ViTs, which is less dependent on the number of poisons, with satisfactory convergence, and is transferable for downstream tasks. Furthermore, the risks inside of ViTs to backdoor attacks are also explored from the perspective of existing advanced defense schemes. + + + + Contrastive Grouping With Transformer for Referring Image Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Tang_Contrastive_Grouping_With_Transformer_for_Referring_Image_Segmentation_CVPR_2023_paper.pdf + Referring image segmentation aims to segment the target referent in an image conditioning on a natural language expression. Existing one-stage methods employ per-pixel classification frameworks, which attempt straightforwardly to align vision and language at the pixel level, thus failing to capture critical object-level information. In this paper, we propose a mask classification framework, Contrastive Grouping with Transformer network (CGFormer), which explicitly captures object-level information via token-based querying and grouping strategy. Specifically, CGFormer first introduces learnable query tokens to represent objects and then alternately queries linguistic features and groups visual features into the query tokens for object-aware cross-modal reasoning. In addition, CGFormer achieves cross-level interaction by jointly updating the query tokens and decoding masks in every two consecutive layers. Finally, CGFormer cooperates contrastive learning to the grouping strategy to identify the token and its mask corresponding to the referent. Experimental results demonstrate that CGFormer outperforms state-of-the-art methods in both segmentation and generalization settings consistently and significantly. Code is available at https://github.com/Toneyaya/CGFormer. + + + + PaCa-ViT: Learning Patch-to-Cluster Attention in Vision Transformers + http://openaccess.thecvf.com//content/CVPR2023/papers/Grainger_PaCa-ViT_Learning_Patch-to-Cluster_Attention_in_Vision_Transformers_CVPR_2023_paper.pdf + Vision Transformers (ViTs) are built on the assumption of treating image patches as "visual tokens" and learn patch-to-patch attention. The patch embedding based tokenizer has a semantic gap with respect to its counterpart, the textual tokenizer. The patch-to-patch attention suffers from the quadratic complexity issue, and also makes it non-trivial to explain learned ViTs. To address these issues in ViT, this paper proposes to learn Patch-to-Cluster attention (PaCa) in ViT. Queries in our PaCa-ViT starts with patches, while keys and values are directly based on clustering (with a predefined small number of clusters). The clusters are learned end-to-end, leading to better tokenizers and inducing joint clustering-for-attention and attention-for-clustering for better and interpretable models. The quadratic complexity is relaxed to linear complexity. The proposed PaCa module is used in designing efficient and interpretable ViT backbones and semantic segmentation head networks. In experiments, the proposed methods are tested on ImageNet-1k image classification, MS-COCO object detection and instance segmentation and MIT-ADE20k semantic segmentation. Compared with the prior art, it obtains better performance in all the three benchmarks than the SWin and the PVTs by significant margins in ImageNet-1k and MIT-ADE20k. It is also significantly more efficient than PVT models in MS-COCO and MIT-ADE20k due to the linear complexity. The learned clusters are semantically meaningful. Code and model checkpoints are available at https://github.com/iVMCL/PaCaViT. + + + + Pix2map: Cross-Modal Retrieval for Inferring Street Maps From Images + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_Pix2map_Cross-Modal_Retrieval_for_Inferring_Street_Maps_From_Images_CVPR_2023_paper.pdf + Self-driving vehicles rely on urban street maps for autonomous navigation. In this paper, we introduce Pix2Map, a method for inferring urban street map topology directly from ego-view images, as needed to continually update and expand existing maps. This is a challenging task, as we need to infer a complex urban road topology directly from raw image data. The main insight of this paper is that this problem can be posed as cross-modal retrieval by learning a joint, cross-modal embedding space for images and existing maps, represented as discrete graphs that encode the topological layout of the visual surroundings. We conduct our experimental evaluation using the Argoverse dataset and show that it is indeed possible to accurately retrieve street maps corresponding to both seen and unseen roads solely from image data. Moreover, we show that our retrieved maps can be used to update or expand existing maps and even show proof-of-concept results for visual localization and image retrieval from spatial graphs. + + + + Unsupervised Inference of Signed Distance Functions From Single Sparse Point Clouds Without Learning Priors + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Unsupervised_Inference_of_Signed_Distance_Functions_From_Single_Sparse_Point_CVPR_2023_paper.pdf + It is vital to infer signed distance functions (SDFs) from 3D point clouds. The latest methods rely on generalizing the priors learned from large scale supervision. However, the learned priors do not generalize well to various geometric variations that are unseen during training, especially for extremely sparse point clouds. To resolve this issue, we present a neural network to directly infer SDFs from single sparse point clouds without using signed distance supervision, learned priors or even normals. Our insight here is to learn surface parameterization and SDFs inference in an end-to-end manner. To make up the sparsity, we leverage parameterized surfaces as a coarse surface sampler to provide many coarse surface estimations in training iterations, according to which we mine supervision and our thin plate splines (TPS) based network infers SDFs as smooth functions in a statistical way. Our method significantly improves the generalization ability and accuracy in unseen point clouds. Our experimental results show our advantages over the state-of-the-art methods in surface reconstruction for sparse point clouds under synthetic datasets and real scans.The code is available at https://github.com/chenchao15/NeuralTPS. + + + + Deep Discriminative Spatial and Temporal Network for Efficient Video Deblurring + http://openaccess.thecvf.com//content/CVPR2023/papers/Pan_Deep_Discriminative_Spatial_and_Temporal_Network_for_Efficient_Video_Deblurring_CVPR_2023_paper.pdf + How to effectively explore spatial and temporal information is important for video deblurring. In contrast to existing methods that directly align adjacent frames without discrimination, we develop a deep discriminative spatial and temporal network to facilitate the spatial and temporal feature exploration for better video deblurring. We first develop a channel-wise gated dynamic network to adaptively explore the spatial information. As adjacent frames usually contain different contents, directly stacking features of adjacent frames without discrimination may affect the latent clear frame restoration. Therefore, we develop a simple yet effective discriminative temporal feature fusion module to obtain useful temporal features for latent frame restoration. Moreover, to utilize the information from long-range frames, we develop a wavelet-based feature propagation method that takes the discriminative temporal feature fusion module as the basic unit to effectively propagate main structures from long-range frames for better video deblurring. We show that the proposed method does not require additional alignment methods and performs favorably against state-of-the-art ones on benchmark datasets in terms of accuracy and model complexity. + + + + Prototype-Based Embedding Network for Scene Graph Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zheng_Prototype-Based_Embedding_Network_for_Scene_Graph_Generation_CVPR_2023_paper.pdf + Current Scene Graph Generation (SGG) methods explore contextual information to predict relationships among entity pairs. However, due to the diverse visual appearance of numerous possible subject-object combinations, there is a large intra-class variation within each predicate category, e.g., "man-eating-pizza, giraffe-eating-leaf", and the severe inter-class similarity between different classes, e.g., "man-holding-plate, man-eating-pizza", in model's latent space. The above challenges prevent current SGG methods from acquiring robust features for reliable relation prediction. In this paper, we claim that predicate's categoryinherent semantics can serve as class-wise prototypes in the semantic space for relieving the above challenges caused by the diverse visual appearances. To the end, we propose the Prototype-based Embedding Network (PE-Net), which models entities/predicates with prototype-aligned compact and distinctive representations and establishes matching between entity pairs and predicates in a common embedding space for relation recognition. Moreover, Prototypeguided Learning (PL) is introduced to help PE-Net efficiently learn such entity-predicate matching, and Prototype Regularization (PR) is devised to relieve the ambiguous entity-predicate matching caused by the predicate's semantic overlap. Extensive experiments demonstrate that our method gains superior relation recognition capability on SGG, achieving new state-of-the-art performances on both Visual Genome and Open Images datasets. + + + + Efficient Movie Scene Detection Using State-Space Transformers + http://openaccess.thecvf.com//content/CVPR2023/papers/Islam_Efficient_Movie_Scene_Detection_Using_State-Space_Transformers_CVPR_2023_paper.pdf + The ability to distinguish between different movie scenes is critical for understanding the storyline of a movie. However, accurately detecting movie scenes is often challenging as it requires the ability to reason over very long movie segments. This is in contrast to most existing video recognition models, which are typically designed for short-range video analysis. This work proposes a State-Space Transformer model that can efficiently capture dependencies in long movie videos for accurate movie scene detection. Our model, dubbed TranS4mer, is built using a novel S4A building block, which combines the strengths of structured state-space sequence (S4) and self-attention (A) layers. Given a sequence of frames divided into movie shots (uninterrupted periods where the camera position does not change), the S4A block first applies self-attention to capture short-range intra-shot dependencies. Afterward, the state-space operation in the S4A block is used to aggregate long-range inter-shot cues. The final TranS4mer model, which can be trained end-to-end, is obtained by stacking the S4A blocks one after the other multiple times. Our proposed TranS4mer outperforms all prior methods in three movie scene detection datasets, including MovieNet, BBC, and OVSD, while also being 2x faster and requiring 3x less GPU memory than standard Transformer models. We will release our code and models. + + + + Efficient Semantic Segmentation by Altering Resolutions for Compressed Videos + http://openaccess.thecvf.com//content/CVPR2023/papers/Hu_Efficient_Semantic_Segmentation_by_Altering_Resolutions_for_Compressed_Videos_CVPR_2023_paper.pdf + Video semantic segmentation (VSS) is a computationally expensive task due to the per-frame prediction for videos of high frame rates. In recent work, compact models or adaptive network strategies have been proposed for efficient VSS. However, they did not consider a crucial factor that affects the computational cost from the input side: the input resolution. In this paper, we propose an altering resolution framework called AR-Seg for compressed videos to achieve efficient VSS. AR-Seg aims to reduce the computational cost by using low resolution for non-keyframes. To prevent the performance degradation caused by downsampling, we design a Cross Resolution Feature Fusion (CReFF) module, and supervise it with a novel Feature Similarity Training (FST) strategy. Specifically, CReFF first makes use of motion vectors stored in a compressed video to warp features from high-resolution keyframes to low-resolution non-keyframes for better spatial alignment, and then selectively aggregates the warped features with local attention mechanism. Furthermore, the proposed FST supervises the aggregated features with high-resolution features through an explicit similarity loss and an implicit constraint from the shared decoding layer. Extensive experiments on CamVid and Cityscapes show that AR-Seg achieves state-of-the-art performance and is compatible with different segmentation backbones. On CamVid, AR-Seg saves 67% computational cost (measured in GFLOPs) with the PSPNet18 backbone while maintaining high segmentation accuracy. Code: https://github.com/THU-LYJ-Lab/AR-Seg. + + + + Discriminating Known From Unknown Objects via Structure-Enhanced Recurrent Variational AutoEncoder + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_Discriminating_Known_From_Unknown_Objects_via_Structure-Enhanced_Recurrent_Variational_AutoEncoder_CVPR_2023_paper.pdf + Discriminating known from unknown objects is an important essential ability for human beings. To simulate this ability, a task of unsupervised out-of-distribution object detection (OOD-OD) is proposed to detect the objects that are never-seen-before during model training, which is beneficial for promoting the safe deployment of object detectors. Due to lacking unknown data for supervision, for this task, the main challenge lies in how to leverage the known in-distribution (ID) data to improve the detector's discrimination ability. In this paper, we first propose a method of Structure-Enhanced Recurrent Variational AutoEncoder (SR-VAE), which mainly consists of two dedicated recurrent VAE branches. Specifically, to boost the performance of object localization, we explore utilizing the classical Laplacian of Gaussian (LoG) operator to enhance the structure information in the extracted low-level features. Meanwhile, we design a VAE branch that recurrently generates the augmentation of the classification features to strengthen the discrimination ability of the object classifier. Finally, to alleviate the impact of lacking unknown data, another cycle-consistent conditional VAE branch is proposed to synthesize virtual OOD features that deviate from the distribution of ID features, which improves the capability of distinguishing OOD objects. In the experiments, our method is evaluated on OOD-OD, open-vocabulary detection, and incremental object detection. The significant performance gains over baselines show the superiorities of our method. The code will be released at https://github.com/AmingWu/SR-VAE. + + + + Occlusion-Free Scene Recovery via Neural Radiance Fields + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhu_Occlusion-Free_Scene_Recovery_via_Neural_Radiance_Fields_CVPR_2023_paper.pdf + Our everyday lives are filled with occlusions that we strive to see through. By aggregating desired background information from different viewpoints, we can easily eliminate such occlusions without any external occlusion-free supervision. Though several occlusion removal methods have been proposed to empower machine vision systems with such ability, their performances are still unsatisfactory due to reliance on external supervision. We propose a novel method for occlusion removal by directly building a mapping between position and viewing angles and the corresponding occlusion-free scene details leveraging Neural Radiance Fields (NeRF). We also develop an effective scheme to jointly optimize camera parameters and scene reconstruction when occlusions are present. An additional depth constraint is applied to supervise the entire optimization without labeled external data for training. The experimental results on existing and newly collected datasets validate the effectiveness of our method. + + + + Semi-Supervised Domain Adaptation With Source Label Adaptation + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_Semi-Supervised_Domain_Adaptation_With_Source_Label_Adaptation_CVPR_2023_paper.pdf + Semi-Supervised Domain Adaptation (SSDA) involves learning to classify unseen target data with a few labeled and lots of unlabeled target data, along with many labeled source data from a related domain. Current SSDA approaches usually aim at aligning the target data to the labeled source data with feature space mapping and pseudo-label assignments. Nevertheless, such a source-oriented model can sometimes align the target data to source data of the wrong classes, degrading the classification performance. This paper presents a novel source-adaptive paradigm that adapts the source data to match the target data. Our key idea is to view the source data as a noisily-labeled version of the ideal target data. Then, we propose an SSDA model that cleans up the label noise dynamically with the help of a robust cleaner component designed from the target perspective. Since the paradigm is very different from the core ideas behind existing SSDA approaches, our proposed model can be easily coupled with them to improve their performance. Empirical results on two state-of-the-art SSDA approaches demonstrate that the proposed model effectively cleans up the noise within the source labels and exhibits superior performance over those approaches across benchmark datasets. Our code is available at https://github.com/chu0802/SLA. + + + + Range-Nullspace Video Frame Interpolation With Focalized Motion Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_Range-Nullspace_Video_Frame_Interpolation_With_Focalized_Motion_Estimation_CVPR_2023_paper.pdf + Continuous-time video frame interpolation is a fundamental technique in computer vision for its flexibility in synthesizing motion trajectories and novel video frames at arbitrary intermediate time steps. Yet, how to infer accurate intermediate motion and synthesize high-quality video frames are two critical challenges. In this paper, we present a novel VFI framework with improved treatment for these challenges. To address the former, we propose focalized trajectory fitting, which performs confidence-aware motion trajectory estimation by learning to pay focus to reliable optical flow candidates while suppressing the outliers. The second is range-nullspace synthesis, a novel frame renderer cast as solving an ill-posed problem addressed by learning decoupled components in orthogonal subspaces. The proposed framework sets new records on 7 of 10 public VFI benchmarks. + + + + FlowGrad: Controlling the Output of Generative ODEs With Gradients + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_FlowGrad_Controlling_the_Output_of_Generative_ODEs_With_Gradients_CVPR_2023_paper.pdf + Generative modeling with ordinary differential equations (ODEs) has achieved fantastic results on a variety of applications. Yet, few works have focused on controlling the generated content of a pre-trained ODE-based generative model. In this paper, we propose to optimize the output of ODE models according to a guidance function to achieve controllable generation. We point out that, the gradients can be efficiently back-propagated from the output to any intermediate time steps on the ODE trajectory, by decomposing the back-propagation and computing vector-Jacobian products. To further accelerate the computation of the back-propagation, we propose to use a non-uniform discretization to approximate the ODE trajectory, where we measure how straight the trajectory is and gather the straight parts into one discretization step. This allows us to save 90% of the back-propagation time with ignorable error. Our framework, named FlowGrad, outperforms the state-of-the-art baselines on text-guided image manipulation. Moreover, FlowGrad enables us to find global semantic directions in frozen ODE-based generative models that can be used to manipulate new images without extra optimization. + + + + Learning Weather-General and Weather-Specific Features for Image Restoration Under Multiple Adverse Weather Conditions + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhu_Learning_Weather-General_and_Weather-Specific_Features_for_Image_Restoration_Under_Multiple_CVPR_2023_paper.pdf + Image restoration under multiple adverse weather conditions aims to remove weather-related artifacts by using the single set of network parameters. In this paper, we find that distorted images under different weather conditions contain general characteristics as well as their specific characteristics. Inspired by this observation, we design an efficient unified framework with a two-stage training strategy to explore the weather-general and weather-specific features. The first training stage aims to learn the weather-general features by taking the images under various weather conditions as the inputs and outputting the coarsely restored results. The second training stage aims to learn to adaptively expand the specific parameters for each weather type in the deep model, where requisite positions for expansion of weather-specific parameters are learned automatically. Hence, we can obtain an efficient and unified model for image restoration under multiple adverse weather conditions. Moreover, we build the first real-world benchmark dataset with multiple weather conditions to better deal with real-world weather scenarios. Experimental results show that our method achieves superior performance on all the synthetic and real-world benchmark datasets. + + + + Generalized Deep 3D Shape Prior via Part-Discretized Diffusion Process + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Generalized_Deep_3D_Shape_Prior_via_Part-Discretized_Diffusion_Process_CVPR_2023_paper.pdf + We develop a generalized 3D shape generation prior model, tailored for multiple 3D tasks including unconditional shape generation, point cloud completion, and cross-modality shape generation, etc. On one hand, to precisely capture local fine detailed shape information, a vector quantized variational autoencoder (VQ-VAE) is utilized to index local geometry from a compactly learned codebook based on a broad set of task training data. On the other hand, a discrete diffusion generator is introduced to model the inherent structural dependencies among different tokens. In the meantime, a multi-frequency fusion module (MFM) is developed to suppress high-frequency shape feature fluctuations, guided by multi-frequency contextual information. The above designs jointly equip our proposed 3D shape prior model with high-fidelity, diverse features as well as the capability of cross-modality alignment, and extensive experiments have demonstrated superior performances on various 3D shape generation tasks. + + + + Conflict-Based Cross-View Consistency for Semi-Supervised Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Conflict-Based_Cross-View_Consistency_for_Semi-Supervised_Semantic_Segmentation_CVPR_2023_paper.pdf + Semi-supervised semantic segmentation (SSS) has recently gained increasing research interest as it can reduce the requirement for large-scale fully-annotated training data. The current methods often suffer from the confirmation bias from the pseudo-labelling process, which can be alleviated by the co-training framework. The current co-training-based SSS methods rely on hand-crafted perturbations to prevent the different sub-nets from collapsing into each other, but these artificial perturbations cannot lead to the optimal solution. In this work, we propose a new conflict-based cross-view consistency (CCVC) method based on a two-branch co-training framework which aims at enforcing the two sub-nets to learn informative features from irrelevant views. In particular, we first propose a new cross-view consistency (CVC) strategy that encourages the two sub-nets to learn distinct features from the same input by introducing a feature discrepancy loss, while these distinct features are expected to generate consistent prediction scores of the input. The CVC strategy helps to prevent the two sub-nets from stepping into the collapse. In addition, we further propose a conflict-based pseudo-labelling (CPL) method to guarantee the model will learn more useful information from conflicting predictions, which will lead to a stable training process. We validate our new CCVC approach on the SSS benchmark datasets where our method achieves new state-of-the-art performance. Our code is available at https://github.com/xiaoyao3302/CCVC. + + + + SCoDA: Domain Adaptive Shape Completion for Real Scans + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_SCoDA_Domain_Adaptive_Shape_Completion_for_Real_Scans_CVPR_2023_paper.pdf + 3D shape completion from point clouds is a challenging task, especially from scans of real-world objects. Considering the paucity of 3D shape ground truths for real scans, existing works mainly focus on benchmarking this task on synthetic data, e.g. 3D computer-aided design models. However, the domain gap between synthetic and real data limits the generalizability of these methods. Thus, we propose a new task, SCoDA, for the domain adaptation of real scan shape completion from synthetic data. A new dataset, ScanSalon, is contributed with a bunch of elaborate 3D models created by skillful artists according to scans. To address this new task, we propose a novel cross-domain feature fusion method for knowledge transfer and a novel volume-consistent self-training framework for robust learning from real data. Extensive experiments prove our method is effective to bring an improvement of 6% 7% mIoU. + + + + TransFlow: Transformer As Flow Learner + http://openaccess.thecvf.com//content/CVPR2023/papers/Lu_TransFlow_Transformer_As_Flow_Learner_CVPR_2023_paper.pdf + Optical flow is an indispensable building block for various important computer vision tasks, including motion estimation, object tracking, and disparity measurement. In this work, we propose TransFlow, a pure transformer architecture for optical flow estimation. Compared to dominant CNN-based methods, TransFlow demonstrates three advantages. First, it provides more accurate correlation and trustworthy matching in flow estimation by utilizing spatial self-attention and cross-attention mechanisms between adjacent frames to effectively capture global dependencies; Second, it recovers more compromised information (e.g., occlusion and motion blur) in flow estimation through long-range temporal association in dynamic scenes; Third, it enables a concise self-learning paradigm and effectively eliminate the complex and laborious multi-stage pre-training procedures. We achieve the state-of-the-art results on the Sintel, KITTI-15, as well as several downstream tasks, including video object detection, interpolation and stabilization. For its efficacy, we hope TransFlow could serve as a flexible baseline for optical flow estimation. + + + + AutoFocusFormer: Image Segmentation off the Grid + http://openaccess.thecvf.com//content/CVPR2023/papers/Ziwen_AutoFocusFormer_Image_Segmentation_off_the_Grid_CVPR_2023_paper.pdf + Real world images often have highly imbalanced content density. Some areas are very uniform, e.g., large patches of blue sky, while other areas are scattered with many small objects. Yet, the commonly used successive grid downsampling strategy in convolutional deep networks treats all areas equally. Hence, small objects are represented in very few spatial locations, leading to worse results in tasks such as segmentation. Intuitively, retaining more pixels representing small objects during downsampling helps to preserve important information. To achieve this, we propose AutoFocusFormer (AFF), a local-attention transformer image recognition backbone, which performs adaptive downsampling by learning to retain the most important pixels for the task. Since adaptive downsampling generates a set of pixels irregularly distributed on the image plane, we abandon the classic grid structure. Instead, we develop a novel point-based local attention block, facilitated by a balanced clustering module and a learnable neighborhood merging module, which yields representations for our point-based versions of state-of-the-art segmentation heads. Experiments show that our AutoFocusFormer (AFF) improves significantly over baseline models of similar sizes. + + + + CLIP2Protect: Protecting Facial Privacy Using Text-Guided Makeup via Adversarial Latent Search + http://openaccess.thecvf.com//content/CVPR2023/papers/Shamshad_CLIP2Protect_Protecting_Facial_Privacy_Using_Text-Guided_Makeup_via_Adversarial_Latent_CVPR_2023_paper.pdf + The success of deep learning based face recognition systems has given rise to serious privacy concerns due to their ability to enable unauthorized tracking of users in the digital world. Existing methods for enhancing privacy fail to generate naturalistic' images that can protect facial privacy without compromising user experience. We propose a novel two-step approach for facial privacy protection that relies on finding adversarial latent codes in the low-dimensional manifold of a pretrained generative model. The first step inverts the given face image into the latent space and finetunes the generative model to achieve an accurate reconstruction of the given image from its latent code. This step produces a good initialization, aiding the generation of high-quality faces that resemble the given identity. Subsequently, user defined makeup text prompts and identity-preserving regularization are used to guide the search for adversarial codes in the latent space. Extensive experiments demonstrate that faces generated by our approach have stronger black-box transferability with an absolute gain of 12.06% over the state-of-the-art facial privacy protection approach under the face verification task. Finally, we demonstrate the effectiveness of the proposed approach for commercial face recognition systems. Our code is available at https://github.com/fahadshamshad/Clip2Protect. + + + + Improving Weakly Supervised Temporal Action Localization by Bridging Train-Test Gap in Pseudo Labels + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_Improving_Weakly_Supervised_Temporal_Action_Localization_by_Bridging_Train-Test_Gap_CVPR_2023_paper.pdf + The task of weakly supervised temporal action localization targets at generating temporal boundaries for actions of interest, meanwhile the action category should also be classified. Pseudo-label-based methods, which serve as an effective solution, have been widely studied recently. However, existing methods generate pseudo labels during training and make predictions during testing under different pipelines or settings, resulting in a gap between training and testing. In this paper, we propose to generate high-quality pseudo labels from the predicted action boundaries. Nevertheless, we note that existing post-processing, like NMS, would lead to information loss, which is insufficient to generate high-quality action boundaries. More importantly, transforming action boundaries into pseudo labels is quite challenging, since the predicted action instances are generally overlapped and have different confidence scores. Besides, the generated pseudo-labels can be fluctuating and inaccurate at the early stage of training. It might repeatedly strengthen the false predictions if there is no mechanism to conduct self-correction. To tackle these issues, we come up with an effective pipeline for learning better pseudo labels. Firstly, we propose a Gaussian weighted fusion module to preserve information of action instances and obtain high-quality action boundaries. Second, we formulate the pseudo-label generation as an optimization problem under the constraints in terms of the confidence scores of action instances. Finally, we introduce the idea of Delta pseudo labels, which enables the model with the ability of self-correction. Our method achieves superior performance to existing methods on two benchmarks, THUMOS14 and ActivityNet1.3, achieving gains of 1.9% on THUMOS14 and 3.7% on ActivityNet1.3 in terms of average mAP. + + + + REVEAL: Retrieval-Augmented Visual-Language Pre-Training With Multi-Source Multimodal Knowledge Memory + http://openaccess.thecvf.com//content/CVPR2023/papers/Hu_REVEAL_Retrieval-Augmented_Visual-Language_Pre-Training_With_Multi-Source_Multimodal_Knowledge_Memory_CVPR_2023_paper.pdf + In this paper, we propose an end-to-end Retrieval-Augmented Visual Language Model (REVEAL) that learns to encode world knowledge into a large-scale memory, and to retrieve from it to answer knowledge-intensive queries. REVEAL consists of four key components: the memory, the encoder, the retriever and the generator. The large-scale memory encodes various sources of multimodal world knowledge (e.g. image-text pairs, question answering pairs, knowledge graph triplets, etc.) via a unified encoder. The retriever finds the most relevant knowledge entries in the memory, and the generator fuses the retrieved knowledge with the input query to produce the output. A key novelty in our approach is that the memory, encoder, retriever and generator are all pre-trained end-to-end on a massive amount of data. Furthermore, our approach can use a diverse set of multimodal knowledge sources, which is shown to result in significant gains. We show that REVEAL achieves state-of-the-art results on visual question answering and image captioning. + + + + Why Is the Winner the Best? + http://openaccess.thecvf.com//content/CVPR2023/papers/Eisenmann_Why_Is_the_Winner_the_Best_CVPR_2023_paper.pdf + International benchmarking competitions have become fundamental for the comparative performance assessment of image analysis methods. However, little attention has been given to investigating what can be learnt from these competitions. Do they really generate scientific progress? What are common and successful participation strategies? What makes a solution superior to a competing method? To address this gap in the literature, we performed a multi-center study with all 80 competitions that were conducted in the scope of IEEE ISBI 2021 and MICCAI 2021. Statistical analyses performed based on comprehensive descriptions of the submitted algorithms linked to their rank as well as the underlying participation strategies revealed common characteristics of winning solutions. These typically include the use of multi-task learning (63%) and/or multi-stage pipelines (61%), and a focus on augmentation (100%), image preprocessing (97%), data curation (79%), and postprocessing (66%). The "typical" lead of a winning team is a computer scientist with a doctoral degree, five years of experience in biomedical image analysis, and four years of experience in deep learning. Two core general development strategies stood out for highly-ranked teams: the reflection of the metrics in the method design and the focus on analyzing and handling failure cases. According to the organizers, 43% of the winning algorithms exceeded the state of the art but only 11% completely solved the respective domain problem. The insights of our study could help researchers (1) improve algorithm development strategies when approaching new problems, and (2) focus on open research questions revealed by this work. + + + + HGNet: Learning Hierarchical Geometry From Points, Edges, and Surfaces + http://openaccess.thecvf.com//content/CVPR2023/papers/Yao_HGNet_Learning_Hierarchical_Geometry_From_Points_Edges_and_Surfaces_CVPR_2023_paper.pdf + Parsing an unstructured point set into constituent local geometry structures (e.g., edges or surfaces) would be helpful for understanding and representing point clouds. This motivates us to design a deep architecture to model the hierarchical geometry from points, edges, surfaces (triangles), to super-surfaces (adjacent surfaces) for the thorough analysis of point clouds. In this paper, we present a novel Hierarchical Geometry Network (HGNet) that integrates such hierarchical geometry structures from super-surfaces, surfaces, edges, to points in a top-down manner for learning point cloud representations. Technically, we first construct the edges between every two neighbor points. A point-level representation is learnt with edge-to-point aggregation, i.e., aggregating all connected edges into the anchor point. Next, as every two neighbor edges compose a surface, we obtain the edge-level representation of each anchor edge via surface-to-edge aggregation over all neighbor surfaces. Furthermore, the surface-level representation is achieved through super-surface-to-surface aggregation by transforming all super-surfaces into the anchor surface. A Transformer structure is finally devised to unify all the point-level, edge-level, and surface-level features into the holistic point cloud representations. Extensive experiments on four point cloud analysis datasets demonstrate the superiority of HGNet for 3D object classification and part/semantic segmentation tasks. More remarkably, HGNet achieves the overall accuracy of 89.2% on ScanObjectNN, improving PointNeXt-S by 1.5%. + + + + PAniC-3D: Stylized Single-View 3D Reconstruction From Portraits of Anime Characters + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_PAniC-3D_Stylized_Single-View_3D_Reconstruction_From_Portraits_of_Anime_Characters_CVPR_2023_paper.pdf + We propose PAniC-3D, a system to reconstruct stylized 3D character heads directly from illustrated (p)ortraits of (ani)me (c)haracters. Our anime-style domain poses unique challenges to single-view reconstruction; compared to natural images of human heads, character portrait illustrations have hair and accessories with more complex and diverse geometry, and are shaded with non-photorealistic contour lines. In addition, there is a lack of both 3D model and portrait illustration data suitable to train and evaluate this ambiguous stylized reconstruction task. Facing these challenges, our proposed PAniC-3D architecture crosses the illustration-to-3D domain gap with a line-filling model, and represents sophisticated geometries with a volumetric radiance field. We train our system with two large new datasets (11.2k Vroid 3D models, 1k Vtuber portrait illustrations), and evaluate on a novel AnimeRecon benchmark of illustration-to-3D pairs. PAniC-3D significantly outperforms baseline methods, and provides data to establish the task of stylized reconstruction from portrait illustrations. + + + + SunStage: Portrait Reconstruction and Relighting Using the Sun as a Light Stage + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_SunStage_Portrait_Reconstruction_and_Relighting_Using_the_Sun_as_a_CVPR_2023_paper.pdf + A light stage uses a series of calibrated cameras and lights to capture a subject's facial appearance under varying illumination and viewpoint. This captured information is crucial for facial reconstruction and relighting. Unfortunately, light stages are often inaccessible: they are expensive and require significant technical expertise for construction and operation. In this paper, we present SunStage: a lightweight alternative to a light stage that captures comparable data using only a smartphone camera and the sun. Our method only requires the user to capture a selfie video outdoors, rotating in place, and uses the varying angles between the sun and the face as guidance in joint reconstruction of facial geometry, reflectance, camera pose, and lighting parameters. Despite the in-the-wild un-calibrated setting, our approach is able to reconstruct detailed facial appearance and geometry, enabling compelling effects such as relighting, novel view synthesis, and reflectance editing. + + + + Private Image Generation With Dual-Purpose Auxiliary Classifier + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Private_Image_Generation_With_Dual-Purpose_Auxiliary_Classifier_CVPR_2023_paper.pdf + Privacy-preserving image generation has been important for segments such as medical domains that have sensitive and limited data. The benefits of guaranteed privacy come at the costs of generated images' quality and utility due to the privacy budget constraints. The utility is currently measured by the gen2real accuracy (g2r%), i.e., the accuracy on real data of a downstream classifier trained using generated data. However, apart from this standard utility, we identify the "reversed utility" as another crucial aspect, which computes the accuracy on generated data of a classifier trained using real data, dubbed as real2gen accuracy (r2g%). Jointly considering these two views of utility, the standard and the reversed, could help the generation model better improve transferability between fake and real data. Therefore, we propose a novel private image generation method that incorporates a dual-purpose auxiliary classifier, which alternates between learning from real data and fake data, into the training of differentially private GANs. Additionally, our deliberate training strategies such as sequential training contributes to accelerating the generator's convergence and further boosting the performance upon exhausting the privacy budget. Our results achieve new state-of-the-arts over all metrics on three benchmarks: MNIST, Fashion-MNIST, and CelebA. + + + + 3D-POP - An Automated Annotation Approach to Facilitate Markerless 2D-3D Tracking of Freely Moving Birds With Marker-Based Motion Capture + http://openaccess.thecvf.com//content/CVPR2023/papers/Naik_3D-POP_-_An_Automated_Annotation_Approach_to_Facilitate_Markerless_2D-3D_CVPR_2023_paper.pdf + Recent advances in machine learning and computer vision are revolutionizing the field of animal behavior by enabling researchers to track the poses and locations of freely moving animals without any marker attachment. However, large datasets of annotated images of animals for markerless pose tracking, especially high-resolution images taken from multiple angles with accurate 3D annotations, are still scant. Here, we propose a method that uses a motion capture (mo-cap) system to obtain a large amount of annotated data on animal movement and posture (2D and 3D) in a semi-automatic manner. Our method is novel in that it extracts the 3D positions of morphological keypoints (e.g eyes, beak, tail) in reference to the positions of markers attached to the animals. Using this method, we obtained, and offer here, a new dataset - 3D-POP with approximately 300k annotated frames (4 million instances) in the form of videos having groups of one to ten freely moving birds from 4 different camera views in a 3.6m x 4.2m area. 3D-POP is the first dataset of flocking birds with accurate keypoint annotations in 2D and 3D along with bounding box and individual identities and will facilitate the development of solutions for problems of 2D to 3D markerless pose, trajectory tracking, and identification in birds. + + + + Unified Keypoint-Based Action Recognition Framework via Structured Keypoint Pooling + http://openaccess.thecvf.com//content/CVPR2023/papers/Hachiuma_Unified_Keypoint-Based_Action_Recognition_Framework_via_Structured_Keypoint_Pooling_CVPR_2023_paper.pdf + This paper simultaneously addresses three limitations associated with conventional skeleton-based action recognition; skeleton detection and tracking errors, poor variety of the targeted actions, as well as person-wise and frame-wise action recognition. A point cloud deep-learning paradigm is introduced to the action recognition, and a unified framework along with a novel deep neural network architecture called Structured Keypoint Pooling is proposed. The proposed method sparsely aggregates keypoint features in a cascaded manner based on prior knowledge of the data structure (which is inherent in skeletons), such as the instances and frames to which each keypoint belongs, and achieves robustness against input errors. Its less constrained and tracking-free architecture enables time-series keypoints consisting of human skeletons and nonhuman object contours to be efficiently treated as an input 3D point cloud and extends the variety of the targeted action. Furthermore, we propose a Pooling-Switching Trick inspired by Structured Keypoint Pooling. This trick switches the pooling kernels between the training and inference phases to detect person-wise and frame-wise actions in a weakly supervised manner using only video-level action labels. This trick enables our training scheme to naturally introduce novel data augmentation, which mixes multiple point clouds extracted from different videos. In the experiments, we comprehensively verify the effectiveness of the proposed method against the limitations, and the method outperforms state-of-the-art skeleton-based action recognition and spatio-temporal action localization methods. + + + + Multi-View Reconstruction Using Signed Ray Distance Functions (SRDF) + http://openaccess.thecvf.com//content/CVPR2023/papers/Zins_Multi-View_Reconstruction_Using_Signed_Ray_Distance_Functions_SRDF_CVPR_2023_paper.pdf + In this paper, we investigate a new optimization framework for multi-view 3D shape reconstructions. Recent differentiable rendering approaches have provided breakthrough performances with implicit shape representations though they can still lack precision in the estimated geometries. On the other hand multi-view stereo methods can yield pixel wise geometric accuracy with local depth predictions along viewing rays. Our approach bridges the gap between the two strategies with a novel volumetric shape representation that is implicit but parameterized with pixel depths to better materialize the shape surface with consistent signed distances along viewing rays. The approach retains pixel-accuracy while benefiting from volumetric integration in the optimization. To this aim, depths are optimized by evaluating, at each 3D location within the volumetric discretization, the agreement between the depth prediction consistency and the photometric consistency for the corresponding pixels. The optimization is agnostic to the associated photo-consistency term which can vary from a median-based baseline to more elaborate criteria, learned functions. Our experiments demonstrate the benefit of the volumetric integration with depth predictions. They also show that our approach outperforms existing approaches over standard 3D benchmarks with better geometry estimations. + + + + Improving Cross-Modal Retrieval With Set of Diverse Embeddings + http://openaccess.thecvf.com//content/CVPR2023/papers/Kim_Improving_Cross-Modal_Retrieval_With_Set_of_Diverse_Embeddings_CVPR_2023_paper.pdf + Cross-modal retrieval across image and text modalities is a challenging task due to its inherent ambiguity: An image often exhibits various situations, and a caption can be coupled with diverse images. Set-based embedding has been studied as a solution to this problem. It seeks to encode a sample into a set of different embedding vectors that capture different semantics of the sample. In this paper, we present a novel set-based embedding method, which is distinct from previous work in two aspects. First, we present a new similarity function called smooth-Chamfer similarity, which is designed to alleviate the side effects of existing similarity functions for set-based embedding. Second, we propose a novel set prediction module to produce a set of embedding vectors that effectively captures diverse semantics of input by the slot attention mechanism. Our method is evaluated on the COCO and Flickr30K datasets across different visual backbones, where it outperforms existing methods including ones that demand substantially larger computation at inference. + + + + Policy Adaptation From Foundation Model Feedback + http://openaccess.thecvf.com//content/CVPR2023/papers/Ge_Policy_Adaptation_From_Foundation_Model_Feedback_CVPR_2023_paper.pdf + Recent progress on vision-language foundation models have brought significant advancement to building general-purpose robots. By using the pre-trained models to encode the scene and instructions as inputs for decision making, the instruction-conditioned policy can generalize across different objects and tasks. While this is encouraging, the policy still fails in most cases given an unseen task or environment. In this work, we propose Policy Adaptation from Foundation model Feedback (PAFF). When deploying the trained policy to a new task or a new environment, we first let the policy play with randomly generated instructions to record the demonstrations. While the execution could be wrong, we can use the pre-trained foundation models to provide feedback to relabel the demonstrations. This automatically provides new pairs of demonstration-instruction data for policy fine-tuning. We evaluate our method on a broad range of experiments with the focus on generalization on unseen objects, unseen tasks, unseen environments, and sim-to-real transfer. We show PAFF improves baselines by a large margin in all cases. + + + + Semi-DETR: Semi-Supervised Object Detection With Detection Transformers + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Semi-DETR_Semi-Supervised_Object_Detection_With_Detection_Transformers_CVPR_2023_paper.pdf + We analyze the DETR-based framework on semi-supervised object detection (SSOD) and observe that (1) the one-to-one assignment strategy generates incorrect matching when the pseudo ground-truth bounding box is inaccurate, leading to training inefficiency; (2) DETR-based detectors lack deterministic correspondence between the input query and its prediction output, which hinders the applicability of the consistency-based regularization widely used in current SSOD methods. We present Semi-DETR, the first transformer-based end-to-end semi-supervised object detector, to tackle these problems. Specifically, we propose a Stage-wise Hybrid Matching strategy that com- bines the one-to-many assignment and one-to-one assignment strategies to improve the training efficiency of the first stage and thus provide high-quality pseudo labels for the training of the second stage. Besides, we introduce a Cross-view Query Consistency method to learn the semantic feature invariance of object queries from different views while avoiding the need to find deterministic query correspondence. Furthermore, we propose a Cost-based Pseudo Label Mining module to dynamically mine more pseudo boxes based on the matching cost of pseudo ground truth bounding boxes for consistency training. Extensive experiments on all SSOD settings of both COCO and Pascal VOC benchmark datasets show that our Semi-DETR method outperforms all state-of-the-art methods by clear margins. + + + + GP-VTON: Towards General Purpose Virtual Try-On via Collaborative Local-Flow Global-Parsing Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Xie_GP-VTON_Towards_General_Purpose_Virtual_Try-On_via_Collaborative_Local-Flow_Global-Parsing_CVPR_2023_paper.pdf + Image-based Virtual Try-ON aims to transfer an in-shop garment onto a specific person. Existing methods employ a global warping module to model the anisotropic deformation for different garment parts, which fails to preserve the semantic information of different parts when receiving challenging inputs (e.g, intricate human poses, difficult garments). Moreover, most of them directly warp the input garment to align with the boundary of the preserved region, which usually requires texture squeezing to meet the boundary shape constraint and thus leads to texture distortion. The above inferior performance hinders existing methods from real-world applications. To address these problems and take a step towards real-world virtual try-on, we propose a General-Purpose Virtual Try-ON framework, named GP-VTON, by developing an innovative Local-Flow Global-Parsing (LFGP) warping module and a Dynamic Gradient Truncation (DGT) training strategy. Specifically, compared with the previous global warping mechanism, LFGP employs local flows to warp garments parts individually, and assembles the local warped results via the global garment parsing, resulting in reasonable warped parts and a semantic-correct intact garment even with challenging inputs.On the other hand, our DGT training strategy dynamically truncates the gradient in the overlap area and the warped garment is no more required to meet the boundary constraint, which effectively avoids the texture squeezing problem. Furthermore, our GP-VTON can be easily extended to multi-category scenario and jointly trained by using data from different garment categories. Extensive experiments on two high-resolution benchmarks demonstrate our superiority over the existing state-of-the-art methods. + + + + Decomposed Soft Prompt Guided Fusion Enhancing for Compositional Zero-Shot Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Lu_Decomposed_Soft_Prompt_Guided_Fusion_Enhancing_for_Compositional_Zero-Shot_Learning_CVPR_2023_paper.pdf + Compositional Zero-Shot Learning (CZSL) aims to recognize novel concepts formed by known states and objects during training. Existing methods either learn the combined state-object representation, challenging the generalization of unseen compositions, or design two classifiers to identify state and object separately from image features, ignoring the intrinsic relationship between them. To jointly eliminate the above issues and construct a more robust CZSL system, we propose a novel framework termed Decomposed Fusion with Soft Prompt (DFSP), by involving vision-language models (VLMs) for unseen composition recognition. Specifically, DFSP constructs a vector combination of learnable soft prompts with state and object to establish the joint representation of them. In addition, a cross-modal decomposed fusion module is designed between the language and image branches, which decomposes state and object among language features instead of image features. Notably, being fused with the decomposed features, the image features can be more expressive for learning the relationship with states and objects, respectively, to improve the response of unseen compositions in the pair space, hence narrowing the domain gap between seen and unseen sets. Experimental results on three challenging benchmarks demonstrate that our approach significantly outperforms other state-of-the-art methods by large margins. + + + + Hierarchical Semantic Contrast for Scene-Aware Video Anomaly Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Sun_Hierarchical_Semantic_Contrast_for_Scene-Aware_Video_Anomaly_Detection_CVPR_2023_paper.pdf + Increasing scene-awareness is a key challenge in video anomaly detection (VAD). In this work, we propose a hierarchical semantic contrast (HSC) method to learn a scene-aware VAD model from normal videos. We first incorporate foreground object and background scene features with high-level semantics by taking advantage of pre-trained video parsing models. Then, building upon the autoencoder-based reconstruction framework, we introduce both scene-level and object-level contrastive learning to enforce the encoded latent features to be compact within the same semantic classes while being separable across different classes. This hierarchical semantic contrast strategy helps to deal with the diversity of normal patterns and also increases their discrimination ability. Moreover, for the sake of tackling rare normal activities, we design a skeleton-based motion augmentation to increase samples and refine the model further. Extensive experiments on three public datasets and scene-dependent mixture datasets validate the effectiveness of our proposed method. + + + + All-in-Focus Imaging From Event Focal Stack + http://openaccess.thecvf.com//content/CVPR2023/papers/Lou_All-in-Focus_Imaging_From_Event_Focal_Stack_CVPR_2023_paper.pdf + Traditional focal stack methods require multiple shots to capture images focused at different distances of the same scene, which cannot be applied to dynamic scenes well. Generating a high-quality all-in-focus image from a single shot is challenging, due to the highly ill-posed nature of the single-image defocus and deblurring problem. In this paper, to restore an all-in-focus image, we propose the event focal stack which is defined as event streams captured during a continuous focal sweep. Given an RGB image focused at an arbitrary distance, we explore the high temporal resolution of event streams, from which we automatically select refocusing timestamps and reconstruct corresponding refocused images with events to form a focal stack. Guided by the neighbouring events around the selected timestamps, we can merge the focal stack with proper weights and restore a sharp all-in-focus image. Experimental results on both synthetic and real datasets show superior performance over state-of-the-art methods. + + + + Video Probabilistic Diffusion Models in Projected Latent Space + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_Video_Probabilistic_Diffusion_Models_in_Projected_Latent_Space_CVPR_2023_paper.pdf + Despite the remarkable progress in deep generative models, synthesizing high-resolution and temporally coherent videos still remains a challenge due to their high-dimensionality and complex temporal dynamics along with large spatial variations. Recent works on diffusion models have shown their potential to solve this challenge, yet they suffer from severe computation- and memory-inefficiency that limit the scalability. To handle this issue, we propose a novel generative model for videos, coined projected latent video diffusion models (PVDM), a probabilistic diffusion model which learns a video distribution in a low-dimensional latent space and thus can be efficiently trained with high-resolution videos under limited resources. Specifically, PVDM is composed of two components: (a) an autoencoder that projects a given video as 2D-shaped latent vectors that factorize the complex cubic structure of video pixels and (b) a diffusion model architecture specialized for our new factorized latent space and the training/sampling procedure to synthesize videos of arbitrary length with a single model. Experiments on popular video generation datasets demonstrate the superiority of PVDM compared with previous video synthesis methods; e.g., PVDM obtains the FVD score of 639.7 on the UCF-101 long video (128 frames) generation benchmark, which improves 1773.4 of the prior state-of-the-art. + + + + Defining and Quantifying the Emergence of Sparse Concepts in DNNs + http://openaccess.thecvf.com//content/CVPR2023/papers/Ren_Defining_and_Quantifying_the_Emergence_of_Sparse_Concepts_in_DNNs_CVPR_2023_paper.pdf + This paper aims to illustrate the concept-emerging phenomenon in a trained DNN. Specifically, we find that the inference score of a DNN can be disentangled into the effects of a few interactive concepts. These concepts can be understood as inference patterns in a sparse, symbolic graphical model, which explains the DNN. The faithfulness of using such a graphical model to explain the DNN is theoretically guaranteed, because we prove that the graphical model can well mimic the DNN's outputs on an exponential number of different masked samples. Besides, such a graphical model can be further simplified and re-written as an And-Or graph (AOG), without losing much explanation accuracy. The code is released at https://github.com/sjtu-xai-lab/aog. + + + + FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Qin_FreeSeg_Unified_Universal_and_Open-Vocabulary_Image_Segmentation_CVPR_2023_paper.pdf + Recently, open-vocabulary learning has emerged to accomplish segmentation for arbitrary categories of text-based descriptions, which popularizes the segmentation system to more general-purpose application scenarios. However, existing methods devote to designing specialized architectures or parameters for specific segmentation tasks. These customized design paradigms lead to fragmentation between various segmentation tasks, thus hindering the uniformity of segmentation models. Hence in this paper, we propose FreeSeg, a generic framework to accomplish Unified, Universal and Open-Vocabulary Image Segmentation. FreeSeg optimizes an all-in-one network via one-shot training and employs the same architecture and parameters to handle diverse segmentation tasks seamlessly in the inference procedure. Additionally, adaptive prompt learning facilitates the unified model to capture task-aware and category-sensitive concepts, improving model robustness in multi-task and varied scenarios. Extensive experimental results demonstrate that FreeSeg establishes new state-of-the-art results in performance and generalization on three segmentation tasks, which outperforms the best task-specific architectures by a large margin: 5.5% mIoU on semantic segmentation, 17.6% mAP on instance segmentation, 20.1% PQ on panoptic segmentation for the unseen class on COCO. Project page: https://FreeSeg.github.io. + + + + AVFormer: Injecting Vision Into Frozen Speech Models for Zero-Shot AV-ASR + http://openaccess.thecvf.com//content/CVPR2023/papers/Seo_AVFormer_Injecting_Vision_Into_Frozen_Speech_Models_for_Zero-Shot_AV-ASR_CVPR_2023_paper.pdf + Audiovisual automatic speech recognition (AV-ASR) aims to improve the robustness of a speech recognition system by incorporating visual information. Training fully supervised multimodal models for this task from scratch, however is limited by the need for large labelled audiovisual datasets (in each downstream domain of interest). We present AVFormer, a simple method for augmenting audioonly models with visual information, at the same time performing lightweight domain adaptation. We do this by (i) injecting visual embeddings into a frozen ASR model using lightweight trainable adaptors. We show that these can be trained on a small amount of weakly labelled video data with minimum additional training time and parameters. (ii) We also introduce a simple curriculum scheme during training which we show is crucial to enable the model to jointly process audio and visual information effectively; and finally (iii) we show that our model achieves state of the art zero-shot results on three different AV-ASR benchmarks (How2, VisSpeech and Ego4D), while also crucially preserving decent performance on traditional audio-only speech recognition benchmarks (LibriSpeech). Qualitative results show that our model effectively leverages visual information for robust speech recognition. + + + + Self-Guided Diffusion Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Hu_Self-Guided_Diffusion_Models_CVPR_2023_paper.pdf + Diffusion models have demonstrated remarkable progress in image generation quality, especially when guidance is used to control the generative process. However, guidance requires a large amount of image-annotation pairs for training and is thus dependent on their availability and correctness. In this paper, we eliminate the need for such annotation by instead exploiting the flexibility of self-supervision signals to design a framework for self-guided diffusion models. By leveraging a feature extraction function and a self-annotation function, our method provides guidance signals at various image granularities: from the level of holistic images to object boxes and even segmentation masks. Our experiments on single-label and multi-label image datasets demonstrate that self-labeled guidance always outperforms diffusion models without guidance and may even surpass guidance based on ground-truth labels. When equipped with self-supervised box or mask proposals, our method further generates visually diverse yet semantically consistent images, without the need for any class, box, or segment label annotation. Self-guided diffusion is simple, flexible and expected to profit from deployment at scale. + + + + One-Shot High-Fidelity Talking-Head Synthesis With Deformable Neural Radiance Field + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_One-Shot_High-Fidelity_Talking-Head_Synthesis_With_Deformable_Neural_Radiance_Field_CVPR_2023_paper.pdf + Talking head generation aims to generate faces that maintain the identity information of the source image and imitate the motion of the driving image. Most pioneering methods rely primarily on 2D representations and thus will inevitably suffer from face distortion when large head rotations are encountered. Recent works instead employ explicit 3D structural representations or implicit neural rendering to improve performance under large pose changes. Nevertheless, the fidelity of identity and expression is not so desirable, especially for novel-view synthesis. In this paper, we propose HiDe-NeRF, which achieves high-fidelity and free-view talking-head synthesis. Drawing on the recently proposed Deformable Neural Radiance Fields, HiDe-NeRF represents the 3D dynamic scene into a canonical appearance field and an implicit deformation field, where the former comprises the canonical source face and the latter models the driving pose and expression. In particular, we improve fidelity from two aspects: (i) to enhance identity expressiveness, we design a generalized appearance module that leverages multi-scale volume features to preserve face shape and details; (ii) to improve expression preciseness, we propose a lightweight deformation module that explicitly decouples the pose and expression to enable precise expression modeling. Extensive experiments demonstrate that our proposed approach can generate better results than previous works. Project page: https://www.waytron.net/hidenerf/ + + + + Trajectory-Aware Body Interaction Transformer for Multi-Person Pose Forecasting + http://openaccess.thecvf.com//content/CVPR2023/papers/Peng_Trajectory-Aware_Body_Interaction_Transformer_for_Multi-Person_Pose_Forecasting_CVPR_2023_paper.pdf + Multi-person pose forecasting remains a challenging problem, especially in modeling fine-grained human body interaction in complex crowd scenarios. Existing methods typically represent the whole pose sequence as a temporal series, yet overlook interactive influences among people based on skeletal body parts. In this paper, we propose a novel Trajectory-Aware Body Interaction Transformer (TBIFormer) for multi-person pose forecasting via effectively modeling body part interactions. Specifically, we construct a Temporal Body Partition Module that transforms all the pose sequences into a Multi-Person Body-Part sequence to retain spatial and temporal information based on body semantics. Then, we devise a Social Body Interaction Self-Attention (SBI-MSA) module, utilizing the transformed sequence to learn body part dynamics for inter- and intra-individual interactions. Furthermore, different from prior Euclidean distance-based spatial encodings, we present a novel and efficient Trajectory-Aware Relative Position Encoding for SBI-MSA to offer discriminative spatial information and additional interactive clues. On both short- and long-term horizons, we empirically evaluate our framework on CMU-Mocap, MuPoTS-3D as well as synthesized datasets (6 10 persons), and demonstrate that our method greatly outperforms the state-of-the-art methods. + + + + Conditional Image-to-Video Generation With Latent Flow Diffusion Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Ni_Conditional_Image-to-Video_Generation_With_Latent_Flow_Diffusion_Models_CVPR_2023_paper.pdf + Conditional image-to-video (cI2V) generation aims to synthesize a new plausible video starting from an image (e.g., a person's face) and a condition (e.g., an action class label like smile). The key challenge of the cI2V task lies in the simultaneous generation of realistic spatial appearance and temporal dynamics corresponding to the given image and condition. In this paper, we propose an approach for cI2V using novel latent flow diffusion models (LFDM) that synthesize an optical flow sequence in the latent space based on the given condition to warp the given image. Compared to previous direct-synthesis-based works, our proposed LFDM can better synthesize spatial details and temporal motion by fully utilizing the spatial content of the given image and warping it in the latent space according to the generated temporally-coherent flow. The training of LFDM consists of two separate stages: (1) an unsupervised learning stage to train a latent flow auto-encoder for spatial content generation, including a flow predictor to estimate latent flow between pairs of video frames, and (2) a conditional learning stage to train a 3D-UNet-based diffusion model (DM) for temporal latent flow generation. Unlike previous DMs operating in pixel space or latent feature space that couples spatial and temporal information, the DM in our LFDM only needs to learn a low-dimensional latent flow space for motion generation, thus being more computationally efficient. We conduct comprehensive experiments on multiple datasets, where LFDM consistently outperforms prior arts. Furthermore, we show that LFDM can be easily adapted to new domains by simply finetuning the image decoder. Our code is available at https://github.com/nihaomiao/CVPR23_LFDM. + + + + Virtual Sparse Convolution for Multimodal 3D Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_Virtual_Sparse_Convolution_for_Multimodal_3D_Object_Detection_CVPR_2023_paper.pdf + Recently, virtual/pseudo-point-based 3D object detection that seamlessly fuses RGB images and LiDAR data by depth completion has gained great attention. However, virtual points generated from an image are very dense, introducing a huge amount of redundant computation during detection. Meanwhile, noises brought by inaccurate depth completion significantly degrade detection precision. This paper proposes a fast yet effective backbone, termed VirConvNet, based on a new operator VirConv (Virtual Sparse Convolution), for virtual-point-based 3D object detection. The VirConv consists of two key designs: (1) StVD (Stochastic Voxel Discard) and (2) NRConv (Noise-Resistant Submanifold Convolution). The StVD alleviates the computation problem by discarding large amounts of nearby redundant voxels. The NRConv tackles the noise problem by encoding voxel features in both 2D image and 3D LiDAR space. By integrating our VirConv, we first develop an efficient pipeline VirConv-L based on an early fusion design. Then, we build a high-precision pipeline VirConv-T based on a transformed refinement scheme. Finally, we develop a semi-supervised pipeline VirConv-S based on a pseudo-label framework. On the KITTI car 3D detection test leaderboard, our VirConv-L achieves 85% AP with a fast running speed of 56ms. Our VirConv-T and VirConv-S attains a high-precision of 86.3% and 87.2% AP, and currently rank 2nd and 1st, respectively. The code is available at https://github.com/hailanyi/VirConv. + + + + Towards Universal Fake Image Detectors That Generalize Across Generative Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Ojha_Towards_Universal_Fake_Image_Detectors_That_Generalize_Across_Generative_Models_CVPR_2023_paper.pdf + With generative models proliferating at a rapid rate, there is a growing need for general purpose fake image detectors. In this work, we first show that the existing paradigm, which consists of training a deep network for real-vs-fake classification, fails to detect fake images from newer breeds of generative models when trained to detect GAN fake images. Upon analysis, we find that the resulting classifier is asymmetrically tuned to detect patterns that make an image fake. The real class becomes a 'sink' class holding anything that is not fake, including generated images from models not accessible during training. Building upon this discovery, we propose to perform real-vs-fake classification without learning; i.e., using a feature space not explicitly trained to distinguish real from fake images. We use nearest neighbor and linear probing as instantiations of this idea. When given access to the feature space of a large pretrained vision-language model, the very simple baseline of nearest neighbor classification has surprisingly good generalization ability in detecting fake images from a wide variety of generative models; e.g., it improves upon the SoTA by +15.07 mAP and +25.90% acc when tested on unseen diffusion and autoregressive models. + + + + A Large-Scale Homography Benchmark + http://openaccess.thecvf.com//content/CVPR2023/papers/Barath_A_Large-Scale_Homography_Benchmark_CVPR_2023_paper.pdf + We present a large-scale dataset of Planes in 3D, Pi3D, of roughly 1000 planes observed in 10 000 images from the 1DSfM dataset, and HEB, a large-scale homography estimation benchmark leveraging Pi3D. The applications of the Pi3D dataset are diverse, e.g. training or evaluating monocular depth, surface normal estimation and image matching algorithms. The HEB dataset consists of 226 260 homographies and includes roughly 4M correspondences. The homographies link images that often undergo significant viewpoint and illumination changes. As applications of HEB, we perform a rigorous evaluation of a wide range of robust estimators and deep learning-based correspondence filtering methods, establishing the current state-of-the-art in robust homography estimation. We also evaluate the uncertainty of the SIFT orientations and scales w.r.t. the ground truth coming from the underlying homographies and provide codes for comparing uncertainty of custom detectors. + + + + Weakly Supervised Video Emotion Detection and Prediction via Cross-Modal Temporal Erasing Network + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Weakly_Supervised_Video_Emotion_Detection_and_Prediction_via_Cross-Modal_Temporal_CVPR_2023_paper.pdf + Automatically predicting the emotions of user-generated videos (UGVs) receives increasing interest recently. However, existing methods mainly focus on a few key visual frames, which may limit their capacity to encode the context that depicts the intended emotions. To tackle that, in this paper, we propose a cross-modal temporal erasing network that locates not only keyframes but also context and audio-related information in a weakly-supervised manner. In specific, we first leverage the intra- and inter-modal relationship among different segments to accurately select keyframes. Then, we iteratively erase keyframes to encourage the model to concentrate on the contexts that include complementary information. Extensive experiments on three challenging video emotion benchmarks demonstrate that our method performs favorably against state-of-the-art approaches. The code is released on https://github.com/nku-zhichengzhang/WECL. + + + + Consistent View Synthesis With Pose-Guided Diffusion Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Tseng_Consistent_View_Synthesis_With_Pose-Guided_Diffusion_Models_CVPR_2023_paper.pdf + Novel view synthesis from a single image has been a cornerstone problem for many Virtual Reality applications that provide immersive experiences. However, most existing techniques can only synthesize novel views within a limited range of camera motion or fail to generate consistent and high-quality novel views under significant camera movement. In this work, we propose a pose-guided diffusion model to generate a consistent long-term video of novel views from a single image. We design an attention layer that uses epipolar lines as constraints to facilitate the association between different viewpoints. Experimental results on synthetic and real-world datasets demonstrate the effectiveness of the proposed diffusion model against state-of-the-art transformer-based and GAN-based approaches. More qualitative results are available at https://poseguided-diffusion.github.io/. + + + + MSMDFusion: Fusing LiDAR and Camera at Multiple Scales With Multi-Depth Seeds for 3D Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Jiao_MSMDFusion_Fusing_LiDAR_and_Camera_at_Multiple_Scales_With_Multi-Depth_CVPR_2023_paper.pdf + Fusing LiDAR and camera information is essential for accurate and reliable 3D object detection in autonomous driving systems. This is challenging due to the difficulty of combining multi-granularity geometric and semantic features from two drastically different modalities. Recent approaches aim at exploring the semantic densities of camera features through lifting points in 2D camera images (referred to as "seeds") into 3D space, and then incorporate 2D semantics via cross-modal interaction or fusion techniques. However, depth information is under-investigated in these approaches when lifting points into 3D space, thus 2D semantics can not be reliably fused with 3D points. Moreover, their multi-modal fusion strategy, which is implemented as concatenation or attention, either can not effectively fuse 2D and 3D information or is unable to perform fine-grained interactions in the voxel space. To this end, we propose a novel framework with better utilization of the depth information and fine-grained cross-modal interaction between LiDAR and camera, which consists of two important components. First, a Multi-Depth Unprojection (MDU) method is used to enhance the depth quality of the lifted points at each interaction level. Second, a Gated Modality-Aware Convolution (GMA-Conv) block is applied to modulate voxels involved with the camera modality in a fine-grained manner and then aggregate multi-modal features into a unified space. Together they provide the detection head with more comprehensive features from LiDAR and camera. On the nuScenes test benchmark, our proposed method, abbreviated as MSMDFusion, achieves state-of-the-art results on both 3D object detection and tracking tasks without using test-time-augmentation and ensemble techniques. The code is available at https://github.com/SxJyJay/MSMDFusion. + + + + Dense-Localizing Audio-Visual Events in Untrimmed Videos: A Large-Scale Benchmark and Baseline + http://openaccess.thecvf.com//content/CVPR2023/papers/Geng_Dense-Localizing_Audio-Visual_Events_in_Untrimmed_Videos_A_Large-Scale_Benchmark_and_CVPR_2023_paper.pdf + Existing audio-visual event localization (AVE) handles manually trimmed videos with only a single instance in each of them. However, this setting is unrealistic as natural videos often contain numerous audio-visual events with different categories. To better adapt to real-life applications, in this paper we focus on the task of dense-localizing audio-visual events, which aims to jointly localize and recognize all audio-visual events occurring in an untrimmed video. The problem is challenging as it requires fine-grained audio-visual scene and context understanding. To tackle this problem, we introduce the first Untrimmed Audio-Visual (UnAV-100) dataset, which contains 10K untrimmed videos with over 30K audio-visual events. Each video has 2.8 audio-visual events on average, and the events are usually related to each other and might co-occur as in real-life scenes. Next, we formulate the task using a new learning-based framework, which is capable of fully integrating audio and visual modalities to localize audio-visual events with various lengths and capture dependencies between them in a single pass. Extensive experiments demonstrate the effectiveness of our method as well as the significance of multi-scale cross-modal perception and dependency modeling for this task. + + + + Weak-Shot Object Detection Through Mutual Knowledge Transfer + http://openaccess.thecvf.com//content/CVPR2023/papers/Du_Weak-Shot_Object_Detection_Through_Mutual_Knowledge_Transfer_CVPR_2023_paper.pdf + Weak-shot Object Detection methods exploit a fully-annotated source dataset to facilitate the detection performance on the target dataset which only contains image-level labels for novel categories. To bridge the gap between these two datasets, we aim to transfer the object knowledge between the source (S) and target (T) datasets in a bi-directional manner. We propose a novel Knowledge Transfer (KT) loss which simultaneously distills the knowledge of objectness and class entropy from a proposal generator trained on the S dataset to optimize a multiple instance learning module on the T dataset. By jointly optimizing the classification loss and the proposed KT loss, the multiple instance learning module effectively learns to classify object proposals into novel categories in the T dataset with the transferred knowledge from base categories in the S dataset. Noticing the predicted boxes on the T dataset can be regarded as an extension for the original annotations on the S dataset to refine the proposal generator in return, we further propose a novel Consistency Filtering (CF) method to reliably remove inaccurate pseudo labels by evaluating the stability of the multiple instance learning module upon noise injections. Via mutually transferring knowledge between the S and T datasets in an iterative manner, the detection performance on the target dataset is significantly improved. Extensive experiments on public benchmarks validate that the proposed method performs favourably against the state-of-the-art methods without increasing the model parameters or inference computational complexity. + + + + Toward Stable, Interpretable, and Lightweight Hyperspectral Super-Resolution + http://openaccess.thecvf.com//content/CVPR2023/papers/Xie_Toward_Stable_Interpretable_and_Lightweight_Hyperspectral_Super-Resolution_CVPR_2023_paper.pdf + For real applications, existing HSI-SR methods are mostly not only limited to unstable performance under unknown scenarios but also suffer from high computation consumption. In this paper, we develop a new coordination optimization framework for stable, interpretable, and lightweight HSI-SR. Specifically, we create a positive cycle between fusion and degradation estimation under a new probabilistic framework. The estimated degradation is applied to fusion as guidance for a degradation-aware HSI-SR. Under the framework, we establish an explicit degradation estimation method to tackle the indeterminacy and unstable performance driven by black-box simulation in previous methods. Considering the interpretability in fusion, we integrate spectral mixing prior to the fusion process, which can be easily realized by a tiny autoencoder, leading to a dramatic release of the computation burden. We then develop a partial fine-tune strategy in inference to reduce the computation cost further. Comprehensive experiments demonstrate the superiority of our method against state-of-the-art under synthetic and real datasets. For instance, we achieve a 2.3 dB promotion on PSNR with 120x model size reduction and 4300x FLOPs reduction under the CAVE dataset. Code is available in https://github.com/WenjinGuo/DAEM. + + + + Masked Auto-Encoders Meet Generative Adversarial Networks and Beyond + http://openaccess.thecvf.com//content/CVPR2023/papers/Fei_Masked_Auto-Encoders_Meet_Generative_Adversarial_Networks_and_Beyond_CVPR_2023_paper.pdf + Masked Auto-Encoder (MAE) pretraining methods randomly mask image patches and then train a vision Transformer to reconstruct the original pixels based on the unmasked patches. While they demonstrates impressive performance for downstream vision tasks, it generally requires a large amount of training resource. In this paper, we introduce a novel Generative Adversarial Networks alike framework, referred to as GAN-MAE, where a generator is used to generate the masked patches according to the remaining visible patches, and a discriminator is employed to predict whether the patch is synthesized by the generator. We believe this capacity of distinguishing whether the image patch is predicted or original is benefit to representation learning. Another key point lies in that the parameters of the vision Transformer backbone in the generator and discriminator are shared. Extensive experiments demonstrate that adversarial training of GAN-MAE framework is more efficient and accordingly outperforms the standard MAE given the same model size, training data, and computation resource. The gains are substantially robust for different model sizes and datasets, in particular, a ViT-B model trained with GAN-MAE for 200 epochs outperforms the MAE with 1600 epochs on fine-tuning top-1 accuracy of ImageNet-1k with much less FLOPs. Besides, our approach also works well at transferring downstream tasks. + + + + RILS: Masked Visual Reconstruction in Language Semantic Space + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_RILS_Masked_Visual_Reconstruction_in_Language_Semantic_Space_CVPR_2023_paper.pdf + Both masked image modeling (MIM) and natural language supervision have facilitated the progress of transferable visual pre-training. In this work, we seek the synergy between two paradigms and study the emerging properties when MIM meets natural language supervision. To this end, we present a novel masked visual Reconstruction In Language semantic Space (RILS) pre-training framework, in which sentence representations, encoded by the text encoder, serve as prototypes to transform the vision-only signals into patch-sentence probabilities as semantically meaningful MIM reconstruction targets. The vision models can therefore capture useful components with structured information by predicting proper semantic of masked tokens. Better visual representations could, in turn, improve the text encoder via the image-text alignment objective, which is essential for the effective MIM target transformation. Extensive experimental results demonstrate that our method not only enjoys the best of previous MIM and CLIP but also achieves further improvements on various tasks due to their mutual benefits. RILS exhibits advanced transferability on downstream classification, detection, and segmentation, especially for low-shot regimes. Code is available at https://github.com/hustvl/RILS. + + + + Decoupling Learning and Remembering: A Bilevel Memory Framework With Knowledge Projection for Task-Incremental Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Sun_Decoupling_Learning_and_Remembering_A_Bilevel_Memory_Framework_With_Knowledge_CVPR_2023_paper.pdf + The dilemma between plasticity and stability arises as a common challenge for incremental learning. In contrast, the human memory system is able to remedy this dilemma owing to its multi-level memory structure, which motivates us to propose a Bilevel Memory system with Knowledge Projection (BMKP) for incremental learning. BMKP decouples the functions of learning and knowledge remembering via a bilevel-memory design: a working memory responsible for adaptively model learning, to ensure plasticity; a long-term memory in charge of enduringly storing the knowledge incorporated within the learned model, to guarantee stability. However, an emerging issue is how to extract the learned knowledge from the working memory and assimilate it into the long-term memory. To approach this issue, we reveal that the model learned by the working memory are actually residing in a redundant high-dimensional space, and the knowledge incorporated in the model can have a quite compact representation under a group of pattern basis shared by all incremental learning tasks. Therefore, we propose a knowledge projection process to adapatively maintain the shared basis, with which the loosely organized model knowledge of working memory is projected into the compact representation to be remembered in the long-term memory. We evaluate BMKP on CIFAR-10, CIFAR-100, and Tiny-ImageNet. The experimental results show that BMKP achieves state-of-the-art performance with lower memory usage. + + + + R2Former: Unified Retrieval and Reranking Transformer for Place Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhu_R2Former_Unified_Retrieval_and_Reranking_Transformer_for_Place_Recognition_CVPR_2023_paper.pdf + Visual Place Recognition (VPR) estimates the location of query images by matching them with images in a reference database. Conventional methods generally adopt aggregated CNN features for global retrieval and RANSAC-based geometric verification for reranking. However, RANSAC only employs geometric information but ignores other possible information that could be useful for reranking, e.g. local feature correlations, and attention values. In this paper, we propose a unified place recognition framework that handles both retrieval and reranking with a novel transformer model, named R2Former. The proposed reranking module takes feature correlation, attention value, and xy coordinates into account, and learns to determine whether the image pair is from the same location. The whole pipeline is end-to-end trainable and the reranking module alone can also be adopted on other CNN or transformer backbones as a generic component. Remarkably, R2Former significantly outperforms state-of-the-art methods on major VPR datasets with much less inference time and memory consumption. It also achieves the state-of-the-art on the hold-out MSLS challenge set and could serve as a simple yet strong solution for real-world large-scale applications. Experiments also show vision transformer tokens are comparable and sometimes better than CNN local features on local matching. The code is released at https://github.com/Jeff-Zilence/R2Former. + + + + Modality-Agnostic Debiasing for Single Domain Generalization + http://openaccess.thecvf.com//content/CVPR2023/papers/Qu_Modality-Agnostic_Debiasing_for_Single_Domain_Generalization_CVPR_2023_paper.pdf + Deep neural networks (DNNs) usually fail to generalize well to outside of distribution (OOD) data, especially in the extreme case of single domain generalization (single-DG) that transfers DNNs from single domain to multiple unseen domains. Existing single-DG techniques commonly devise various data-augmentation algorithms, and remould the multi-source domain generalization methodology to learn domain-generalized (semantic) features. Nevertheless, these methods are typically modality-specific, thereby being only applicable to one single modality (e.g., image). In contrast, we target a versatile Modality-Agnostic Debiasing (MAD) framework for single-DG, that enables generalization for different modalities. Technically, MAD introduces a novel two-branch classifier: a biased-branch encourages the classifier to identify the domain-specific (superficial) features, and a general-branch captures domain-generalized features based on the knowledge from biased-branch. Our MAD is appealing in view that it is pluggable to most single-DG models. We validate the superiority of our MAD in a variety of single-DG scenarios with different modalities, including recognition on 1D texts, 2D images, 3D point clouds, and semantic segmentation on 2D images. More remarkably, for recognition on 3D point clouds and semantic segmentation on 2D images, MAD improves DSU by 2.82% and 1.5% in accuracy and mIOU. + + + + Difficulty-Based Sampling for Debiased Contrastive Representation Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Jang_Difficulty-Based_Sampling_for_Debiased_Contrastive_Representation_Learning_CVPR_2023_paper.pdf + Contrastive learning is a self-supervised representation learning method that achieves milestone performance in various classification tasks. However, due to its unsupervised fashion, it suffers from the false negative sample problem: randomly drawn negative samples that are assumed to have a different label but actually have the same label as the anchor. This deteriorates the performance of contrastive learning as it contradicts the motivation of contrasting semantically similar and dissimilar pairs. This raised the attention and the importance of finding legitimate negative samples, which should be addressed by distinguishing between 1) true vs. false negatives; 2) easy vs. hard negatives. However, previous works were limited to the statistical approach to handle false negative and hard negative samples with hyperparameters tuning. In this paper, we go beyond the statistical approach and explore the connection between hard negative samples and data bias. We introduce a novel debiased contrastive learning method to explore hard negatives by relative difficulty referencing the bias-amplifying counterpart. We propose triplet loss for training a biased encoder that focuses more on easy negative samples. We theoretically show that the triplet loss amplifies the bias in self-supervised representation learning. Finally, we empirically show the proposed method improves downstream classification performance. + + + + CompletionFormer: Depth Completion With Convolutions and Vision Transformers + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_CompletionFormer_Depth_Completion_With_Convolutions_and_Vision_Transformers_CVPR_2023_paper.pdf + Given sparse depths and the corresponding RGB images, depth completion aims at spatially propagating the sparse measurements throughout the whole image to get a dense depth prediction. Despite the tremendous progress of deep-learning-based depth completion methods, the locality of the convolutional layer or graph model makes it hard for the network to model the long-range relationship between pixels. While recent fully Transformer-based architecture has reported encouraging results with the global receptive field, the performance and efficiency gaps to the well-developed CNN models still exist because of its deteriorative local feature details. This paper proposes a joint convolutional attention and Transformer block (JCAT), which deeply couples the convolutional attention layer and Vision Transformer into one block, as the basic unit to construct our depth completion model in a pyramidal structure. This hybrid architecture naturally benefits both the local connectivity of convolutions and the global context of the Transformer in one single model. As a result, our CompletionFormer outperforms state-of-the-art CNNs-based methods on the outdoor KITTI Depth Completion benchmark and indoor NYUv2 dataset, achieving significantly higher efficiency (nearly 1/3 FLOPs) compared to pure Transformer-based methods. Especially when the captured depth is highly sparse, the performance gap with other methods gets much larger. + + + + Improving Visual Grounding by Encouraging Consistent Gradient-Based Explanations + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_Improving_Visual_Grounding_by_Encouraging_Consistent_Gradient-Based_Explanations_CVPR_2023_paper.pdf + We propose a margin-based loss for tuning joint vision-language models so that their gradient-based explanations are consistent with region-level annotations provided by humans for relatively smaller grounding datasets. We refer to this objective as Attention Mask Consistency (AMC) and demonstrate that it produces superior visual grounding results than previous methods that rely on using vision-language models to score the outputs of object detectors. Particularly, a model trained with AMC on top of standard vision-language modeling objectives obtains a state-of-the-art accuracy of 86.49% in the Flickr30k visual grounding benchmark, an absolute improvement of 5.38% when compared to the best previous model trained under the same level of supervision. Our approach also performs exceedingly well on established benchmarks for referring expression comprehension where it obtains 80.34% accuracy in the easy test of RefCOCO+, and 64.55% in the difficult split. AMC is effective, easy to implement, and is general as it can be adopted by any vision-language model, and can use any type of region annotations. + + + + Physically Realizable Natural-Looking Clothing Textures Evade Person Detectors via 3D Modeling + http://openaccess.thecvf.com//content/CVPR2023/papers/Hu_Physically_Realizable_Natural-Looking_Clothing_Textures_Evade_Person_Detectors_via_3D_CVPR_2023_paper.pdf + Recent works have proposed to craft adversarial clothes for evading person detectors, while they are either only effective at limited viewing angles or very conspicuous to humans. We aim to craft adversarial texture for clothes based on 3D modeling, an idea that has been used to craft rigid adversarial objects such as a 3D-printed turtle. Unlike rigid objects, humans and clothes are non-rigid, leading to difficulties in physical realization. In order to craft natural-looking adversarial clothes that can evade person detectors at multiple viewing angles, we propose adversarial camouflage textures (AdvCaT) that resemble one kind of the typical textures of daily clothes, camouflage textures. We leverage the Voronoi diagram and Gumbel-softmax trick to parameterize the camouflage textures and optimize the parameters via 3D modeling. Moreover, we propose an efficient augmentation pipeline on 3D meshes combining topologically plausible projection (TopoProj) and Thin Plate Spline (TPS) to narrow the gap between digital and real-world objects. We printed the developed 3D texture pieces on fabric materials and tailored them into T-shirts and trousers. Experiments show high attack success rates of these clothes against multiple detectors. + + + + Camouflaged Object Detection With Feature Decomposition and Edge Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/He_Camouflaged_Object_Detection_With_Feature_Decomposition_and_Edge_Reconstruction_CVPR_2023_paper.pdf + Camouflaged object detection (COD) aims to address the tough issue of identifying camouflaged objects visually blended into the surrounding backgrounds. COD is a challenging task due to the intrinsic similarity of camouflaged objects with the background, as well as their ambiguous boundaries. Existing approaches to this problem have developed various techniques to mimic the human visual system. Albeit effective in many cases, these methods still struggle when camouflaged objects are so deceptive to the vision system. In this paper, we propose the FEature Decomposition and Edge Reconstruction (FEDER) model for COD. The FEDER model addresses the intrinsic similarity of foreground and background by decomposing the features into different frequency bands using learnable wavelets. It then focuses on the most informative bands to mine subtle cues that differentiate foreground and background. To achieve this, a frequency attention module and a guidance-based feature aggregation module are developed. To combat the ambiguous boundary problem, we propose to learn an auxiliary edge reconstruction task alongside the COD task. We design an ordinary differential equation-inspired edge reconstruction module that generates exact edges. By learning the auxiliary task in conjunction with the COD task, the FEDER model can generate precise prediction maps with accurate object boundaries. Experiments show that our FEDER model significantly outperforms state-of-the-art methods with cheaper computational and memory costs. + + + + ALOFT: A Lightweight MLP-Like Architecture With Dynamic Low-Frequency Transform for Domain Generalization + http://openaccess.thecvf.com//content/CVPR2023/papers/Guo_ALOFT_A_Lightweight_MLP-Like_Architecture_With_Dynamic_Low-Frequency_Transform_for_CVPR_2023_paper.pdf + Domain generalization (DG) aims to learn a model that generalizes well to unseen target domains utilizing multiple source domains without re-training. Most existing DG works are based on convolutional neural networks (CNNs). However, the local operation of the convolution kernel makes the model focus too much on local representations (e.g., texture), which inherently causes the model more prone to overfit to the source domains and hampers its generalization ability. Recently, several MLP-based methods have achieved promising results in supervised learning tasks by learning global interactions among different patches of the image. Inspired by this, in this paper, we first analyze the difference between CNN and MLP methods in DG and find that MLP methods exhibit a better generalization ability because they can better capture the global representations (e.g., structure) than CNN methods. Then, based on a recent lightweight MLP method, we obtain a strong baseline that outperforms most start-of-the-art CNN-based methods. The baseline can learn global structure representations with a filter to suppress structure-irrelevant information in the frequency space. Moreover, we propose a dynAmic LOw-Frequency spectrum Transform (ALOFT) that can perturb local texture features while preserving global structure features, thus enabling the filter to remove structure-irrelevant information sufficiently. Extensive experiments on four benchmarks have demonstrated that our method can achieve great performance improvement with a small number of parameters compared to SOTA CNN-based DG methods. Our code is available at https://github.com/lingeringlight/ALOFT/. + + + + Learning Visual Representations via Language-Guided Sampling + http://openaccess.thecvf.com//content/CVPR2023/papers/Banani_Learning_Visual_Representations_via_Language-Guided_Sampling_CVPR_2023_paper.pdf + Although an object may appear in numerous contexts, we often describe it in a limited number of ways. Language allows us to abstract away visual variation to represent and communicate concepts. Building on this intuition, we propose an alternative approach to visual representation learning: using language similarity to sample semantically similar image pairs for contrastive learning. Our approach diverges from image-based contrastive learning by sampling view pairs using language similarity instead of hand-crafted augmentations or learned clusters. Our approach also differs from image-text contrastive learning by relying on pre-trained language models to guide the learning rather than directly minimizing a cross-modal loss. Through a series of experiments, we show that language-guided learning yields better features than image-based and image-text representation learning approaches. + + + + Master: Meta Style Transformer for Controllable Zero-Shot and Few-Shot Artistic Style Transfer + http://openaccess.thecvf.com//content/CVPR2023/papers/Tang_Master_Meta_Style_Transformer_for_Controllable_Zero-Shot_and_Few-Shot_Artistic_CVPR_2023_paper.pdf + Transformer-based models achieve favorable performance in artistic style transfer recently thanks to its global receptive field and powerful multi-head/layer attention operations. Nevertheless, the over-paramerized multi-layer structure increases parameters significantly and thus presents a heavy burden for training. Moreover, for the task of style transfer, vanilla Transformer that fuses content and style features by residual connections is prone to content-wise distortion. In this paper, we devise a novel Transformer model termed as Master specifically for style transfer. On the one hand, in the proposed model, different Transformer layers share a common group of parameters, which (1) reduces the total number of parameters, (2) leads to more robust training convergence, and (3) is readily to control the degree of stylization via tuning the number of stacked layers freely during inference. On the other hand, different from the vanilla version, we adopt a learnable scaling operation on content features before content-style feature interaction, which better preserves the original similarity between a pair of content features while ensuring the stylization quality. We also propose a novel meta learning scheme for the proposed model so that it can not only work in the typical setting of arbitrary style transfer, but also adaptable to the few-shot setting, by only fine-tuning the Transformer encoder layer in the few-shot stage for one specific style. Text-guided few-shot style transfer is firstly achieved with the proposed framework. Extensive experiments demonstrate the superiority of Master under both zero-shot and few-shot style transfer settings. + + + + Affordance Diffusion: Synthesizing Hand-Object Interactions + http://openaccess.thecvf.com//content/CVPR2023/papers/Ye_Affordance_Diffusion_Synthesizing_Hand-Object_Interactions_CVPR_2023_paper.pdf + Recent successes in image synthesis are powered by large-scale diffusion models. However, most methods are currently limited to either text- or image-conditioned generation for synthesizing an entire image, texture transfer or inserting objects into a user-specified region. In contrast, in this work we focus on synthesizing complex interactions (i.e., an articulated hand) with a given object. Given an RGB image of an object, we aim to hallucinate plausible images of a human hand interacting with it. We propose a two step generative approach that leverages a LayoutNet that samples an articulation-agnostic hand-object-interaction layout, and a ContentNet that synthesizes images of a hand grasping the object given the predicted layout. Both are built on top of a large-scale pretrained diffusion model to make use of its latent representation. Compared to baselines, the proposed method is shown to generalize better to novel objects and perform surprisingly well on out-of-distribution in-the-wild scenes. The resulting system allows us to predict descriptive affordance information, such as hand articulation and approaching orientation. + + + + Towards Artistic Image Aesthetics Assessment: A Large-Scale Dataset and a New Method + http://openaccess.thecvf.com//content/CVPR2023/papers/Yi_Towards_Artistic_Image_Aesthetics_Assessment_A_Large-Scale_Dataset_and_a_CVPR_2023_paper.pdf + Image aesthetics assessment (IAA) is a challenging task due to its highly subjective nature. Most of the current studies rely on large-scale datasets (e.g., AVA and AADB) to learn a general model for all kinds of photography images. However, little light has been shed on measuring the aesthetic quality of artistic images, and the existing datasets only contain relatively few artworks. Such a defect is a great obstacle to the aesthetic assessment of artistic images. To fill the gap in the field of artistic image aesthetics assessment (AIAA), we first introduce a large-scale AIAA dataset: Boldbrush Artistic Image Dataset (BAID), which consists of 60,337 artistic images covering various art forms, with more than 360,000 votes from online users. We then propose a new method, SAAN (Style-specific Art Assessment Network), which can effectively extract and utilize style-specific and generic aesthetic information to evaluate artistic images. Experiments demonstrate that our proposed approach outperforms existing IAA methods on the proposed BAID dataset according to quantitative comparisons. We believe the proposed dataset and method can serve as a foundation for future AIAA works and inspire more research in this field. + + + + Inverting the Imaging Process by Learning an Implicit Camera Model + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Inverting_the_Imaging_Process_by_Learning_an_Implicit_Camera_Model_CVPR_2023_paper.pdf + Representing visual signals with implicit coordinate-based neural networks, as an effective replacement of the traditional discrete signal representation, has gained considerable popularity in computer vision and graphics. In contrast to existing implicit neural representations which focus on modelling the scene only, this paper proposes a novel implicit camera model which represents the physical imaging process of a camera as a deep neural network. We demonstrate the power of this new implicit camera model on two inverse imaging tasks: i) generating all-in-focus photos, and ii) HDR imaging. Specifically, we devise an implicit blur generator and an implicit tone mapper to model the aperture and exposure of the camera's imaging process, respectively. Our implicit camera model is jointly learned together with implicit scene models under multi-focus stack and multi-exposure bracket supervision. We have demonstrated the effectiveness of our new model on large number of test images and videos, producing accurate and visually appealing all-in-focus and high dynamic range images. In principle, our new implicit neural camera model has the potential to benefit a wide array of other inverse imaging tasks. + + + + Enhanced Training of Query-Based Object Detection via Selective Query Recollection + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Enhanced_Training_of_Query-Based_Object_Detection_via_Selective_Query_Recollection_CVPR_2023_paper.pdf + This paper investigates a phenomenon where query-based object detectors mispredict at the last decoding stage while predicting correctly at an intermediate stage. We review the training process and attribute the overlooked phenomenon to two limitations: lack of training emphasis and cascading errors from decoding sequence. We design and present Selective Query Recollection (SQR), a simple and effective training strategy for query-based object detectors. It cumulatively collects intermediate queries as decoding stages go deeper and selectively forwards the queries to the downstream stages aside from the sequential structure. Such-wise, SQR places training emphasis on later stages and allows later stages to work with intermediate queries from earlier stages directly. SQR can be easily plugged into various query-based object detectors and significantly enhances their performance while leaving the inference pipeline unchanged. As a result, we apply SQR on Adamixer, DAB-DETR, and Deformable-DETR across various settings (backbone, number of queries, schedule) and consistently brings 1.4 2.8 AP improvement. + + + + Detecting Human-Object Contact in Images + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Detecting_Human-Object_Contact_in_Images_CVPR_2023_paper.pdf + Humans constantly contact objects to move and perform tasks. Thus, detecting human-object contact is important for building human-centered artificial intelligence. However, there exists no robust method to detect contact between the body and the scene from an image, and there exists no dataset to learn such a detector. We fill this gap with HOT ("Human-Object conTact"), a new dataset of human-object contacts in images. To build HOT, we use two data sources: (1) We use the PROX dataset of 3D human meshes moving in 3D scenes, and automatically annotate 2D image areas for contact via 3D mesh proximity and projection. (2) We use the V-COCO, HAKE and Watch-n-Patch datasets, and ask trained annotators to draw polygons around the 2D image areas where contact takes place. We also annotate the involved body part of the human body. We use our HOT dataset to train a new contact detector, which takes a single color image as input, and outputs 2D contact heatmaps as well as the body-part labels that are in contact. This is a new and challenging task, that extends current foot-ground or hand-object contact detectors to the full generality of the whole body. The detector uses a part-attention branch to guide contact estimation through the context of the surrounding body parts and scene. We evaluate our detector extensively, and quantitative results show that our model outperforms baselines, and that all components contribute to better performance. Results on images from an online repository show reasonable detections and generalizability. Our HOT data and model are available for research at https://hot.is.tue.mpg.de. + + + + PointClustering: Unsupervised Point Cloud Pre-Training Using Transformation Invariance in Clustering + http://openaccess.thecvf.com//content/CVPR2023/papers/Long_PointClustering_Unsupervised_Point_Cloud_Pre-Training_Using_Transformation_Invariance_in_Clustering_CVPR_2023_paper.pdf + Feature invariance under different data transformations, i.e., transformation invariance, can be regarded as a type of self-supervision for representation learning. In this paper, we present PointClustering, a new unsupervised representation learning scheme that leverages transformation invariance for point cloud pre-training. PointClustering formulates the pretext task as deep clustering and employs transformation invariance as an inductive bias, following the philosophy that common point cloud transformation will not change the geometric properties and semantics. Technically, PointClustering iteratively optimizes the feature clusters and backbone, and delves into the transformation invariance as learning regularization from two perspectives: point level and instance level. Point-level invariance learning maintains local geometric properties through gathering point features of one instance across transformations, while instance-level invariance learning further measures clusters over the entire dataset to explore semantics of instances. Our PointClustering is architecture-agnostic and readily applicable to MLP-based, CNN-based and Transformer-based backbones. We empirically demonstrate that the models pre-learnt on the ScanNet dataset by PointClustering provide superior performances on six benchmarks, across downstream tasks of classification and segmentation. More remarkably, PointClustering achieves an accuracy of 94.5% on ModelNet40 with Transformer backbone. Source code is available at https://github.com/FuchenUSTC/PointClustering. + + + + Out-of-Distributed Semantic Pruning for Robust Semi-Supervised Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Out-of-Distributed_Semantic_Pruning_for_Robust_Semi-Supervised_Learning_CVPR_2023_paper.pdf + Recent advances in robust semi-supervised learning (SSL) typical filters out-of-distribution (OOD) information at the sample level. We argue that an overlooked problem of robust SSL is its corrupted information on semantic level, practically limiting the development of the field. In this paper, we take an initiative step to explore and propose a unified framework termed as OOD Semantic Pruning (OSP), aims at pruning OOD semantics out from the in-distribution (ID) features. Specifically, (i) we propose an aliasing OOD matching module to pair each ID sample with an OOD sample with semantic overlap. (ii) We design a soft orthogonality regularization, which first transforms each ID feature by suppressing its semantic component that is collinear with paired OOD sample. It then forces the predictions before and after soft orthogonality transformation to be consistent. Being practically simple, our method shows a strong performance in OOD detection and ID classification on challenging benchmarks. In particular, OSP surpasses the previous state-of-the-art by 13.7% on accuracy for ID classification and 5.9% on AUROC for OOD detection on TinyImageNet dataset. Codes are available in the supplementary material. + + + + Understanding and Improving Visual Prompting: A Label-Mapping Perspective + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Understanding_and_Improving_Visual_Prompting_A_Label-Mapping_Perspective_CVPR_2023_paper.pdf + We revisit and advance visual prompting (VP), an input prompting technique for vision tasks. VP can reprogram a fixed, pre-trained source model to accomplish downstream tasks in the target domain by simply incorporating universal prompts (in terms of input perturbation patterns) into downstream data points. Yet, it remains elusive why VP stays effective even given a ruleless label mapping (LM) between the source classes and the target classes. Inspired by the above, we ask: How is LM interrelated with VP? And how to exploit such a relationship to improve its accuracy on target tasks? We peer into the influence of LM on VP and provide an affirmative answer that a better 'quality' of LM (assessed by mapping precision and explanation) can consistently improve the effectiveness of VP. This is in contrast to the prior art where the factor of LM was missing. To optimize LM, we propose a new VP framework, termed ILM-VP (iterative label mapping-based visual prompting), which automatically re-maps the source labels to the target labels and progressively improves the target task accuracy of VP. Further, when using a contrastive language-image pretrained (CLIP) model, we propose to integrate an LM process to assist the text prompt selection of CLIP and to improve the target task accuracy. Extensive experiments demonstrate that our proposal significantly outperforms state-of-the-art VP methods. As highlighted below, we show that when reprogramming an ImageNet-pretrained ResNet-18 to 13 target tasks, our method outperforms baselines by a substantial margin, e.g., 7.9% and 6.7% accuracy improvements in transfer learning to the target Flowers102 and CIFAR100 datasets. Besides, our proposal on CLIP-based VP provides 13.7% and 7.1% accuracy improvements on Flowers102 and DTD respectively. + + + + DegAE: A New Pretraining Paradigm for Low-Level Vision + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_DegAE_A_New_Pretraining_Paradigm_for_Low-Level_Vision_CVPR_2023_paper.pdf + Self-supervised pretraining has achieved remarkable success in high-level vision, but its application in low-level vision remains ambiguous and not well-established. What is the primitive intention of pretraining? What is the core problem of pretraining in low-level vision? In this paper, we aim to answer these essential questions and establish a new pretraining scheme for low-level vision. Specifically, we examine previous pretraining methods in both high-level and low-level vision, and categorize current low-level vision tasks into two groups based on the difficulty of data acquisition: low-cost and high-cost tasks. Existing literature has mainly focused on pretraining for low-cost tasks, where the observed performance improvement is often limited. However, we argue that pretraining is more significant for high-cost tasks, where data acquisition is more challenging. To learn a general low-level vision representation that can improve the performance of various tasks, we propose a new pretraining paradigm called degradation autoencoder (DegAE). DegAE follows the philosophy of designing pretext task for self-supervised pretraining and is elaborately tailored to low-level vision. With DegAE pretraining, SwinIR achieves a 6.88dB performance gain on image dehaze task, while Uformer obtains 3.22dB and 0.54dB improvement on dehaze and derain tasks, respectively. + + + + The Differentiable Lens: Compound Lens Search Over Glass Surfaces and Materials for Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Cote_The_Differentiable_Lens_Compound_Lens_Search_Over_Glass_Surfaces_and_CVPR_2023_paper.pdf + Most camera lens systems are designed in isolation, separately from downstream computer vision methods. Recently, joint optimization approaches that design lenses alongside other components of the image acquisition and processing pipeline--notably, downstream neural networks--have achieved improved imaging quality or better performance on vision tasks. However, these existing methods optimize only a subset of lens parameters and cannot optimize glass materials given their categorical nature. In this work, we develop a differentiable spherical lens simulation model that accurately captures geometrical aberrations. We propose an optimization strategy to address the challenges of lens design--notorious for non-convex loss function landscapes and many manufacturing constraints--that are exacerbated in joint optimization tasks. Specifically, we introduce quantized continuous glass variables to facilitate the optimization and selection of glass materials in an end-to-end design context, and couple this with carefully designed constraints to support manufacturability. In automotive object detection, we report improved detection performance over existing designs even when simplifying designs to two- or three-element lenses, despite significantly degrading the image quality. + + + + Adversarially Masking Synthetic To Mimic Real: Adaptive Noise Injection for Point Cloud Segmentation Adaptation + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Adversarially_Masking_Synthetic_To_Mimic_Real_Adaptive_Noise_Injection_for_CVPR_2023_paper.pdf + This paper considers the synthetic-to-real adaptation of point cloud semantic segmentation, which aims to segment the real-world point clouds with only synthetic labels available. Contrary to synthetic data which is integral and clean, point clouds collected by real-world sensors typically contain unexpected and irregular noise because the sensors may be impacted by various environmental conditions. Consequently, the model trained on ideal synthetic data may fail to achieve satisfactory segmentation results on real data. Influenced by such noise, previous adversarial training methods, which are conventional for 2D adaptation tasks, become less effective. In this paper, we aim to mitigate the domain gap caused by target noise via learning to mask the source points during the adaptation procedure. To this end, we design a novel learnable masking module, which takes source features and 3D coordinates as inputs. We incorporate Gumbel-Softmax operation into the masking module so that it can generate binary masks and be trained end-to-end via gradient back-propagation. With the help of adversarial training, the masking module can learn to generate source masks to mimic the pattern of irregular target noise, thereby narrowing the domain gap. We name our method "Adversarial Masking" as adversarial training and learnable masking module depend on each other and cooperate with each other to mitigate the domain gap. Experiments on two synthetic-to-real adaptation benchmarks verify the effectiveness of the proposed method. + + + + Understanding Deep Generative Models With Generalized Empirical Likelihoods + http://openaccess.thecvf.com//content/CVPR2023/papers/Ravuri_Understanding_Deep_Generative_Models_With_Generalized_Empirical_Likelihoods_CVPR_2023_paper.pdf + Understanding how well a deep generative model captures a distribution of high-dimensional data remains an important open challenge. It is especially difficult for certain model classes, such as Generative Adversarial Networks and Diffusion Models, whose models do not admit exact likelihoods. In this work, we demonstrate that generalized empirical likelihood (GEL) methods offer a family of diagnostic tools that can identify many deficiencies of deep generative models (DGMs). We show, with appropriate specification of moment conditions, that the proposed method can identify which modes have been dropped, the degree to which DGMs are mode imbalanced, and whether DGMs sufficiently capture intra-class diversity. We show how to combine techniques from Maximum Mean Discrepancy and Generalized Empirical Likelihood to create not only distribution tests that retain per-sample interpretability, but also metrics that include label information. We find that such tests predict the degree of mode dropping and mode imbalance up to 60% better than metrics such as improved precision/recall. + + + + CLIPPING: Distilling CLIP-Based Models With a Student Base for Video-Language Retrieval + http://openaccess.thecvf.com//content/CVPR2023/papers/Pei_CLIPPING_Distilling_CLIP-Based_Models_With_a_Student_Base_for_Video-Language_CVPR_2023_paper.pdf + Pre-training a vison-language model and then fine-tuning it on downstream tasks have become a popular paradigm. However, pre-trained vison-language models with the Transformer architecture usually take long inference time. Knowledge distillation has been an efficient technique to transfer the capability of a large model to a small one while maintaining the accuracy, which has achieved remarkable success in natural language processing. However, it faces many problems when applying KD to the multi-modality applications. In this paper, we propose a novel knowledge distillation method, named CLIPPING, where the plentiful knowledge of a large teacher model that has been fine-tuned for video-language tasks with the powerful pre-trained CLIP can be effectively transferred to a small student only at the fine-tuning stage. Especially, a new layer-wise alignment with the student as the base is proposed for knowledge distillation of the intermediate layers in CLIPPING, which enables the student's layers to be the bases of the teacher, and thus allows the student to fully absorb the knowledge of the teacher. CLIPPING with MobileViT-v2 as the vison encoder without any vison-language pre-training achieves 88.1%-95.3% of the performance of its teacher on three video-language retrieval benchmarks, with its vison encoder being 19.5x smaller. CLIPPING also significantly outperforms a state-of-the-art small baseline (ALL-in-one-B) on the MSR-VTT dataset, obtaining relatively 7.4% performance gain, with 29% fewer parameters and 86.9% fewer flops. Moreover, CLIPPING is comparable or even superior to many large pre-training models. + + + + BEVHeight: A Robust Framework for Vision-Based Roadside 3D Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_BEVHeight_A_Robust_Framework_for_Vision-Based_Roadside_3D_Object_Detection_CVPR_2023_paper.pdf + While most recent autonomous driving system focuses on developing perception methods on ego-vehicle sensors, people tend to overlook an alternative approach to leverage intelligent roadside cameras to extend the perception ability beyond the visual range. We discover that the state-of-the-art vision-centric bird's eye view detection methods have inferior performances on roadside cameras. This is because these methods mainly focus on recovering the depth regarding the camera center, where the depth difference between the car and the ground quickly shrinks while the distance increases. In this paper, we propose a simple yet effective approach, dubbed BEVHeight, to address this issue. In essence, instead of predicting the pixel-wise depth, we regress the height to the ground to achieve a distance-agnostic formulation to ease the optimization process of camera-only perception methods. On popular 3D detection benchmarks of roadside cameras, our method surpasses all previous vision-centric methods by a significant margin. The code is available at https://github.com/ADLab-AutoDrive/BEVHeight. + + + + LASP: Text-to-Text Optimization for Language-Aware Soft Prompting of Vision & Language Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Bulat_LASP_Text-to-Text_Optimization_for_Language-Aware_Soft_Prompting_of_Vision__CVPR_2023_paper.pdf + Soft prompt learning has recently emerged as one of the methods of choice for adapting V&L models to a downstream task using a few training examples. However, current methods significantly overfit the training data, suffering from large accuracy degradation when tested on unseen classes from the same domain. To this end, in this paper, we make the following 4 contributions: (1) To alleviate base class overfitting, we propose a novel Language-Aware Soft Prompting (LASP) learning method by means of a text-to-text cross-entropy loss that maximizes the probability of the learned prompts to be correctly classified with respect to pre-defined hand-crafted textual prompts. (2) To increase the representation capacity of the prompts, we propose grouped LASP where each group of prompts is optimized with respect to a separate subset of textual prompts. (3) We identify a visual-language misalignment introduced by prompt learning and LASP, and more importantly, propose a re-calibration mechanism to address it. (4) We show that LASP is inherently amenable to including, during training, virtual classes, i.e. class names for which no visual samples are available, further increasing the robustness of the learned prompts. Through evaluations on 11 datasets, we show that our approach (a) significantly outperforms all prior works on soft prompting, and (b) matches and surpasses, for the first time, the accuracy on novel classes obtained by hand-crafted prompts and CLIP for 8 out of 11 test datasets. Code will be made available. + + + + AutoAD: Movie Description in Context + http://openaccess.thecvf.com//content/CVPR2023/papers/Han_AutoAD_Movie_Description_in_Context_CVPR_2023_paper.pdf + The objective of this paper is an automatic Audio Description (AD) model that ingests movies and outputs AD in text form. Generating high-quality movie AD is challenging due to the dependency of the descriptions on context, and the limited amount of training data available. In this work, we leverage the power of pretrained foundation models, such as GPT and CLIP, and only train a mapping network that bridges the two models for visually-conditioned text generation. In order to obtain high-quality AD, we make the following four contributions: (i) we incorporate context from the movie clip, AD from previous clips, as well as the subtitles; (ii) we address the lack of training data by pretraining on large-scale datasets, where visual or contextual information is unavailable, e.g. text-only AD without movies or visual captioning datasets without context; (iii) we improve on the currently available AD datasets, by removing label noise in the MAD dataset, and adding character naming information; and (iv) we obtain strong results on the movie AD task compared with previous methods. + + + + SceneComposer: Any-Level Semantic Image Synthesis + http://openaccess.thecvf.com//content/CVPR2023/papers/Zeng_SceneComposer_Any-Level_Semantic_Image_Synthesis_CVPR_2023_paper.pdf + We propose a new framework for conditional image synthesis from semantic layouts of any precision levels, ranging from pure text to a 2D semantic canvas with precise shapes. More specifically, the input layout consists of one or more semantic regions with free-form text descriptions and adjustable precision levels, which can be set based on the desired controllability. The framework naturally reduces to text-to-image (T2I) at the lowest level with no shape information, and it becomes segmentation-to-image (S2I) at the highest level. By supporting the levels in-between, our framework is flexible in assisting users of different drawing expertise and at different stages of their creative workflow. We introduce several novel techniques to address the challenges coming with this new setup, including a pipeline for collecting training data; a precision-encoded mask pyramid and a text feature map representation to jointly encode precision level, semantics, and composition information; and a multi-scale guided diffusion model to synthesize images. To evaluate the proposed method, we collect a test dataset containing user-drawn layouts with diverse scenes and styles. Experimental results show that the proposed method can generate high-quality images following the layout at given precision, and compares favorably against existing methods. Project page https://zengxianyu.github.io/scenec/ + + + + MaPLe: Multi-Modal Prompt Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Khattak_MaPLe_Multi-Modal_Prompt_Learning_CVPR_2023_paper.pdf + Pre-trained vision-language (V-L) models such as CLIP have shown excellent generalization ability to downstream tasks. However, they are sensitive to the choice of input text prompts and require careful selection of prompt templates to perform well. Inspired by the Natural Language Processing (NLP) literature, recent CLIP adaptation approaches learn prompts as the textual inputs to fine-tune CLIP for downstream tasks. We note that using prompting to adapt representations in a single branch of CLIP (language or vision) is sub-optimal since it does not allow the flexibility to dynamically adjust both representation spaces on a downstream task. In this work, we propose Multi-modal Prompt Learning (MaPLe) for both vision and language branches to improve alignment between the vision and language representations. Our design promotes strong coupling between the vision-language prompts to ensure mutual synergy and discourages learning independent uni-modal solutions. Further, we learn separate prompts across different early stages to progressively model the stage-wise feature relationships to allow rich context learning. We evaluate the effectiveness of our approach on three representative tasks of generalization to novel classes, new target datasets and unseen domain shifts. Compared with the state-of-the-art method Co-CoOp, MaPLe exhibits favorable performance and achieves an absolute gain of 3.45% on novel classes and 2.72% on overall harmonic-mean, averaged over 11 diverse image recognition datasets. Our code and pre-trained models are available at https://github.com/muzairkhattak/multimodal-prompt-learning. + + + + Scaling Language-Image Pre-Training via Masking + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Scaling_Language-Image_Pre-Training_via_Masking_CVPR_2023_paper.pdf + We present Fast Language-Image Pre-training (FLIP), a simple and more efficient method for training CLIP. Our method randomly masks out and removes a large portion of image patches during training. Masking allows us to learn from more image-text pairs given the same wall-clock time and contrast more samples per iteration with similar memory footprint. It leads to a favorable trade-off between accuracy and training time. In our experiments on 400 million image-text pairs, FLIP improves both accuracy and speed over the no-masking baseline. On a large diversity of downstream tasks, FLIP dominantly outperforms the CLIP counterparts trained on the same data. Facilitated by the speedup, we explore the scaling behavior of increasing the model size, data size, or training length, and report encouraging results and comparisons. We hope that our work will foster future research on scaling vision-language learning. + + + + DNF: Decouple and Feedback Network for Seeing in the Dark + http://openaccess.thecvf.com//content/CVPR2023/papers/Jin_DNF_Decouple_and_Feedback_Network_for_Seeing_in_the_Dark_CVPR_2023_paper.pdf + The exclusive properties of RAW data have shown great potential for low-light image enhancement. Nevertheless, the performance is bottlenecked by the inherent limitations of existing architectures in both single-stage and multi-stage methods. Mixed mapping across two different domains, noise-to-clean and RAW-to-sRGB, misleads the single-stage methods due to the domain ambiguity. The multi-stage methods propagate the information merely through the resulting image of each stage, neglecting the abundant features in the lossy image-level dataflow. In this paper, we probe a generalized solution to these bottlenecks and propose a Decouple aNd Feedback framework, abbreviated as DNF. To mitigate the domain ambiguity, domainspecific subtasks are decoupled, along with fully utilizing the unique properties in RAW and sRGB domains. The feature propagation across stages with a feedback mechanism avoids the information loss caused by image-level dataflow. The two key insights of our method resolve the inherent limitations of RAW data-based low-light image enhancement satisfactorily, empowering our method to outperform the previous state-of-the-art method by a large margin with only 19% parameters, achieving 0.97dB and 1.30dB PSNR improvements on the Sony and Fuji subsets of SID. + + + + Deformable Mesh Transformer for 3D Human Mesh Recovery + http://openaccess.thecvf.com//content/CVPR2023/papers/Yoshiyasu_Deformable_Mesh_Transformer_for_3D_Human_Mesh_Recovery_CVPR_2023_paper.pdf + We present Deformable mesh transFormer (DeFormer), a novel vertex-based approach to monocular 3D human mesh recovery. DeFormer iteratively fits a body mesh model to an input image via a mesh alignment feedback loop formed within a transformer decoder that is equipped with efficient body mesh driven attention modules: 1) body sparse self-attention and 2) deformable mesh cross attention. As a result, DeFormer can effectively exploit high-resolution image feature maps and a dense mesh model which were computationally expensive to deal with in previous approaches using the standard transformer attention. Experimental results show that DeFormer achieves state-of-the-art performances on the Human3.6M and 3DPW benchmarks. Ablation study is also conducted to show the effectiveness of the DeFormer model designs for leveraging multi-scale feature maps. Code is available at https://github.com/yusukey03012/DeFormer. + + + + Vita-CLIP: Video and Text Adaptive CLIP via Multimodal Prompting + http://openaccess.thecvf.com//content/CVPR2023/papers/Wasim_Vita-CLIP_Video_and_Text_Adaptive_CLIP_via_Multimodal_Prompting_CVPR_2023_paper.pdf + Adopting contrastive image-text pretrained models like CLIP towards video classification has gained attention due to its cost-effectiveness and competitive performance. However, recent works in this area face a trade-off. Finetuning the pretrained model to achieve strong supervised performance results in low zero-shot generalization. Similarly, freezing the backbone to retain zero-shot capability causes significant drop in supervised accuracy. Because of this, recent works in literature typically train separate models for supervised and zero-shot action recognition. In this work, we propose a multimodal prompt learning scheme that works to balance the supervised and zero-shot performance under a single unified training. Our prompting approach on the vision side caters for three aspects: 1) Global video-level prompts to model the data distribution; 2) Local frame-level prompts to provide per-frame discriminative conditioning; and 3) a summary prompt to extract a condensed video representation. Additionally, we define a prompting scheme on the text side to augment the textual context. Through this prompting scheme, we can achieve state-of-the-art zero-shot performance on Kinetics-600, HMDB51 and UCF101 while remaining competitive in the supervised setting. By keeping the pretrained backbone frozen, we optimize a much lower number of parameters and retain the existing general representation which helps achieve the strong zero-shot performance. Our codes and models will be publicly released. + + + + HS-Pose: Hybrid Scope Feature Extraction for Category-Level Object Pose Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Zheng_HS-Pose_Hybrid_Scope_Feature_Extraction_for_Category-Level_Object_Pose_Estimation_CVPR_2023_paper.pdf + In this paper, we focus on the problem of category-level object pose estimation, which is challenging due to the large intra-category shape variation. 3D graph convolution (3D-GC) based methods have been widely used to extract local geometric features, but they have limitations for complex shaped objects and are sensitive to noise. Moreover, the scale and translation invariant properties of 3D-GC restrict the perception of an object's size and translation information. In this paper, we propose a simple network structure, the HS-layer, which extends 3D-GC to extract hybrid scope latent features from point cloud data for category-level object pose estimation tasks. The proposed HS-layer: 1) is able to perceive local-global geometric structure and global information, 2) is robust to noise, and 3) can encode size and translation information. Our experiments show that the simple replacement of the 3D-GC layer with the proposed HS-layer on the baseline method (GPV-Pose) achieves a significant improvement, with the performance increased by 14.5% on 5d2cm metric and 10.3% on IoU75. Our method outperforms the state-of-the-art methods by a large margin (8.3% on 5d2cm, 6.9% on IoU75) on REAL275 dataset and runs in real-time (50 FPS). + + + + LayoutDM: Transformer-Based Diffusion Model for Layout Generation + http://openaccess.thecvf.com//content/CVPR2023/papers/Chai_LayoutDM_Transformer-Based_Diffusion_Model_for_Layout_Generation_CVPR_2023_paper.pdf + Automatic layout generation that can synthesize high-quality layouts is an important tool for graphic design in many applications. Though existing methods based on generative models such as Generative Adversarial Networks (GANs) and Variational Auto-Encoders (VAEs) have progressed, they still leave much room for improving the quality and diversity of the results. Inspired by the recent success of diffusion models in generating high-quality images, this paper explores their potential for conditional layout generation and proposes Transformer-based Layout Diffusion Model (LayoutDM) by instantiating the conditional denoising diffusion probabilistic model (DDPM) with a purely transformer-based architecture. Instead of using convolutional neural networks, a transformer-based conditional Layout Denoiser is proposed to learn the reverse diffusion process to generate samples from noised layout data. Benefitting from both transformer and DDPM, our LayoutDM is of desired properties such as high-quality generation, strong sample diversity, faithful distribution coverage, and stationary training in comparison to GANs and VAEs. Quantitative and qualitative experimental results show that our method outperforms state-of-the-art generative models in terms of quality and diversity. + + + + HandNeRF: Neural Radiance Fields for Animatable Interacting Hands + http://openaccess.thecvf.com//content/CVPR2023/papers/Guo_HandNeRF_Neural_Radiance_Fields_for_Animatable_Interacting_Hands_CVPR_2023_paper.pdf + We propose a novel framework to reconstruct accurate appearance and geometry with neural radiance fields (NeRF) for interacting hands, enabling the rendering of photo-realistic images and videos for gesture animation from arbitrary views. Given multi-view images of a single hand or interacting hands, an off-the-shelf skeleton estimator is first employed to parameterize the hand poses. Then we design a pose-driven deformation field to establish correspondence from those different poses to a shared canonical space, where a pose-disentangled NeRF for one hand is optimized. Such unified modeling efficiently complements the geometry and texture cues in rarely-observed areas for both hands. Meanwhile, we further leverage the pose priors to generate pseudo depth maps as guidance for occlusion-aware density learning. Moreover, a neural feature distillation method is proposed to achieve cross-domain alignment for color optimization. We conduct extensive experiments to verify the merits of our proposed HandNeRF and report a series of state-of-the-art results both qualitatively and quantitatively on the large-scale InterHand2.6M dataset. + + + + Introducing Competition To Boost the Transferability of Targeted Adversarial Examples Through Clean Feature Mixup + http://openaccess.thecvf.com//content/CVPR2023/papers/Byun_Introducing_Competition_To_Boost_the_Transferability_of_Targeted_Adversarial_Examples_CVPR_2023_paper.pdf + Deep neural networks are widely known to be susceptible to adversarial examples, which can cause incorrect predictions through subtle input modifications. These adversarial examples tend to be transferable between models, but targeted attacks still have lower attack success rates due to significant variations in decision boundaries. To enhance the transferability of targeted adversarial examples, we propose introducing competition into the optimization process. Our idea is to craft adversarial perturbations in the presence of two new types of competitor noises: adversarial perturbations towards different target classes and friendly perturbations towards the correct class. With these competitors, even if an adversarial example deceives a network to extract specific features leading to the target class, this disturbance can be suppressed by other competitors. Therefore, within this competition, adversarial examples should take different attack strategies by leveraging more diverse features to overwhelm their interference, leading to improving their transferability to different models. Considering the computational complexity, we efficiently simulate various interference from these two types of competitors in feature space by randomly mixing up stored clean features in the model inference and named this method Clean Feature Mixup (CFM). Our extensive experimental results on the ImageNet-Compatible and CIFAR-10 datasets show that the proposed method outperforms the existing baselines with a clear margin. Our code is available at https://github.com/dreamflake/CFM. + + + + A Whac-a-Mole Dilemma: Shortcuts Come in Multiples Where Mitigating One Amplifies Others + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_A_Whac-a-Mole_Dilemma_Shortcuts_Come_in_Multiples_Where_Mitigating_One_CVPR_2023_paper.pdf + Machine learning models have been found to learn shortcuts---unintended decision rules that are unable to generalize---undermining models' reliability. Previous works address this problem under the tenuous assumption that only a single shortcut exists in the training data. Real-world images are rife with multiple visual cues from background to texture. Key to advancing the reliability of vision systems is understanding whether existing methods can overcome multiple shortcuts or struggle in a Whac-A-Mole game, i.e., where mitigating one shortcut amplifies reliance on others. To address this shortcoming, we propose two benchmarks: 1) UrbanCars, a dataset with precisely controlled spurious cues, and 2) ImageNet-W, an evaluation set based on ImageNet for watermark, a shortcut we discovered affects nearly every modern vision model. Along with texture and background, ImageNet-W allows us to study multiple shortcuts emerging from training on natural images. We find computer vision models, including large foundation models---regardless of training set, architecture, and supervision---struggle when multiple shortcuts are present. Even methods explicitly designed to combat shortcuts struggle in a Whac-A-Mole dilemma. To tackle this challenge, we propose Last Layer Ensemble, a simple-yet-effective method to mitigate multiple shortcuts without Whac-A-Mole behavior. Our results surface multi-shortcut mitigation as an overlooked challenge critical to advancing the reliability of vision systems. The datasets and code are released: https://github.com/facebookresearch/Whac-A-Mole. + + + + Efficient Scale-Invariant Generator With Column-Row Entangled Pixel Synthesis + http://openaccess.thecvf.com//content/CVPR2023/papers/Nguyen_Efficient_Scale-Invariant_Generator_With_Column-Row_Entangled_Pixel_Synthesis_CVPR_2023_paper.pdf + Any-scale image synthesis offers an efficient and scalable solution to synthesize photo-realistic images at any scale, even going beyond 2K resolution. However, existing GAN-based solutions depend excessively on convolutions and a hierarchical architecture, which introduce inconsistency and the "texture sticking" issue when scaling the output resolution. From another perspective, INR-based generators are scale-equivariant by design, but their huge memory footprint and slow inference hinder these networks from being adopted in large-scale or real-time systems. In this work, we propose Column-Row Entangled Pixel Synthesisthes (CREPS), a new generative model that is both efficient and scale-equivariant without using any spatial convolutions or coarse-to-fine design. To save memory footprint and make the system scalable, we employ a novel bi-line representation that decomposes layer-wise feature maps into separate "thick" column and row encodings. Experiments on standard datasets, including FFHQ, LSUN-Church, and MetFaces, confirm CREPS' ability to synthesize scale-consistent and alias-free images up to 4K resolution with proper training and inference speed. + + + + H2ONet: Hand-Occlusion-and-Orientation-Aware Network for Real-Time 3D Hand Mesh Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_H2ONet_Hand-Occlusion-and-Orientation-Aware_Network_for_Real-Time_3D_Hand_Mesh_Reconstruction_CVPR_2023_paper.pdf + Real-time 3D hand mesh reconstruction is challenging, especially when the hand is holding some object. Beyond the previous methods, we design H2ONet to fully exploit non-occluded information from multiple frames to boost the reconstruction quality. First, we decouple hand mesh reconstruction into two branches, one to exploit finger-level non-occluded information and the other to exploit global hand orientation, with lightweight structures to promote real-time inference. Second, we propose finger-level occlusion-aware feature fusion, leveraging predicted finger-level occlusion information as guidance to fuse finger-level information across time frames. Further, we design hand-level occlusion-aware feature fusion to fetch non-occluded information from nearby time frames. We conduct experiments on the Dex-YCB and HO3D-v2 datasets with challenging hand-object occlusion cases, manifesting that H2ONet is able to run in real-time and achieves state-of-the-art performance on both the hand mesh and pose precision. The code will be released on GitHub. + + + + Interventional Bag Multi-Instance Learning on Whole-Slide Pathological Images + http://openaccess.thecvf.com//content/CVPR2023/papers/Lin_Interventional_Bag_Multi-Instance_Learning_on_Whole-Slide_Pathological_Images_CVPR_2023_paper.pdf + Multi-instance learning (MIL) is an effective paradigm for whole-slide pathological images (WSIs) classification to handle the gigapixel resolution and slide-level label. Prevailing MIL methods primarily focus on improving the feature extractor and aggregator. However, one deficiency of these methods is that the bag contextual prior may trick the model into capturing spurious correlations between bags and labels. This deficiency is a confounder that limits the performance of existing MIL methods. In this paper, we propose a novel scheme, Interventional Bag Multi-Instance Learning (IBMIL), to achieve deconfounded bag-level prediction. Unlike traditional likelihood-based strategies, the proposed scheme is based on the backdoor adjustment to achieve the interventional training, thus is capable of suppressing the bias caused by the bag contextual prior. Note that the principle of IBMIL is orthogonal to existing bag MIL methods. Therefore, IBMIL is able to bring consistent performance boosting to existing schemes, achieving new state-of-the-art performance. Code is available at https://github.com/HHHedo/IBMIL. + + + + RankMix: Data Augmentation for Weakly Supervised Learning of Classifying Whole Slide Images With Diverse Sizes and Imbalanced Categories + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_RankMix_Data_Augmentation_for_Weakly_Supervised_Learning_of_Classifying_Whole_CVPR_2023_paper.pdf + Whole Slide Images (WSIs) are usually gigapixel in size and lack pixel-level annotations. The WSI datasets are also imbalanced in categories. These unique characteristics, significantly different from the ones in natural images, pose the challenge of classifying WSI images as a kind of weakly supervise learning problems. In this study, we propose, RankMix, a data augmentation method of mixing ranked features in a pair of WSIs. RankMix introduces the concepts of pseudo labeling and ranking in order to extract key WSI regions in contributing to the WSI classification task. A two-stage training is further proposed to boost stable training and model performance. To our knowledge, the study of weakly supervised learning from the perspective of data augmentation to deal with the WSI classification problem that suffers from lack of training data and imbalance of categories is relatively unexplored. + + + + ActMAD: Activation Matching To Align Distributions for Test-Time-Training + http://openaccess.thecvf.com//content/CVPR2023/papers/Mirza_ActMAD_Activation_Matching_To_Align_Distributions_for_Test-Time-Training_CVPR_2023_paper.pdf + Test-Time-Training (TTT) is an approach to cope with out-of-distribution (OOD) data by adapting a trained model to distribution shifts occurring at test-time. We propose to perform this adaptation via Activation Matching (ActMAD): We analyze activations of the model and align activation statistics of the OOD test data to those of the training data. In contrast to existing methods, which model the distribution of entire channels in the ultimate layer of the feature extractor, we model the distribution of each feature in multiple layers across the network. This results in a more fine-grained supervision and makes ActMAD attain state of the art performance on CIFAR-100C and Imagenet-C. ActMAD is also architecture- and task-agnostic, which lets us go beyond image classification, and score 15.4% improvement over previous approaches when evaluating a KITTI-trained object detector on KITTI-Fog. Our experiments highlight that ActMAD can be applied to online adaptation in realistic scenarios, requiring little data to attain its full performance. + + + + DKM: Dense Kernelized Feature Matching for Geometry Estimation + http://openaccess.thecvf.com//content/CVPR2023/papers/Edstedt_DKM_Dense_Kernelized_Feature_Matching_for_Geometry_Estimation_CVPR_2023_paper.pdf + Feature matching is a challenging computer vision task that involves finding correspondences between two images of a 3D scene. In this paper we consider the dense approach instead of the more common sparse paradigm, thus striving to find all correspondences. Perhaps counter-intuitively, dense methods have previously shown inferior performance to their sparse and semi-sparse counterparts for estimation of two-view geometry. This changes with our novel dense method, which outperforms both dense and sparse methods on geometry estimation. The novelty is threefold: First, we propose a kernel regression global matcher. Secondly, we propose warp refinement through stacked feature maps and depthwise convolution kernels. Thirdly, we propose learning dense confidence through consistent depth and a balanced sampling approach for dense confidence maps. Through extensive experiments we confirm that our proposed dense method, Dense Kernelized Feature Matching, sets a new state-of-the-art on multiple geometry estimation benchmarks. In particular, we achieve an improvement on MegaDepth-1500 of +4.9 and +8.9 AUC@5 compared to the best previous sparse method and dense method respectively. Our code is provided at the following repository: https://github.com/Parskatt/DKM + + + + Structured 3D Features for Reconstructing Controllable Avatars + http://openaccess.thecvf.com//content/CVPR2023/papers/Corona_Structured_3D_Features_for_Reconstructing_Controllable_Avatars_CVPR_2023_paper.pdf + We introduce Structured 3D Features, a model based on a novel implicit 3D representation that pools pixel-aligned image features onto dense 3D points sampled from a parametric, statistical human mesh surface. The 3D points have associated semantics and can move freely in 3D space. This allows for optimal coverage of the person of interest, beyond just the body shape, which in turn, additionally helps modeling accessories, hair, and loose clothing. Owing to this, we present a complete 3D transformer-based attention framework which, given a single image of a person in an unconstrained pose, generates an animatable 3D reconstruction with albedo and illumination decomposition, as a result of a single end-to-end model, trained semi-supervised, and with no additional postprocessing. We show that our S3F model surpasses the previous state-of-the-art on various tasks, including monocular 3D reconstruction, as well as albedo & shading estimation. Moreover, we show that the proposed methodology allows novel view synthesis, relighting, and re-posing the reconstruction, and can naturally be extended to handle multiple input images (e.g. different views of a person, or the same view, in different poses, in video). Finally, we demonstrate the editing capabilities of our model for 3D virtual try-on applications. + + + + Active Finetuning: Exploiting Annotation Budget in the Pretraining-Finetuning Paradigm + http://openaccess.thecvf.com//content/CVPR2023/papers/Xie_Active_Finetuning_Exploiting_Annotation_Budget_in_the_Pretraining-Finetuning_Paradigm_CVPR_2023_paper.pdf + Given the large-scale data and the high annotation cost, pretraining-finetuning becomes a popular paradigm in multiple computer vision tasks. Previous research has covered both the unsupervised pretraining and supervised finetuning in this paradigm, while little attention is paid to exploiting the annotation budget for finetuning. To fill in this gap, we formally define this new active finetuning task focusing on the selection of samples for annotation in the pretraining-finetuning paradigm. We propose a novel method called ActiveFT for active finetuning task to select a subset of data distributing similarly with the entire unlabeled pool and maintaining enough diversity by optimizing a parametric model in the continuous space. We prove that the Earth Mover's distance between the distributions of the selected subset and the entire data pool is also reduced in this process. Extensive experiments show the leading performance and high efficiency of ActiveFT superior to baselines on both image classification and semantic segmentation. + + + + In-Hand 3D Object Scanning From an RGB Sequence + http://openaccess.thecvf.com//content/CVPR2023/papers/Hampali_In-Hand_3D_Object_Scanning_From_an_RGB_Sequence_CVPR_2023_paper.pdf + We propose a method for in-hand 3D scanning of an unknown object with a monocular camera. Our method relies on a neural implicit surface representation that captures both the geometry and the appearance of the object, however, by contrast with most NeRF-based methods, we do not assume that the camera-object relative poses are known. Instead, we simultaneously optimize both the object shape and the pose trajectory. As direct optimization over all shape and pose parameters is prone to fail without coarse-level initialization, we propose an incremental approach that starts by splitting the sequence into carefully selected overlapping segments within which the optimization is likely to succeed. We reconstruct the object shape and track its poses independently within each segment, then merge all the segments before performing a global optimization. We show that our method is able to reconstruct the shape and color of both textured and challenging texture-less objects, outperforms classical methods that rely only on appearance features, and that its performance is close to recent methods that assume known camera poses. + + + + Zero-Shot Referring Image Segmentation With Global-Local Context Features + http://openaccess.thecvf.com//content/CVPR2023/papers/Yu_Zero-Shot_Referring_Image_Segmentation_With_Global-Local_Context_Features_CVPR_2023_paper.pdf + Referring image segmentation (RIS) aims to find a segmentation mask given a referring expression grounded to a region of the input image. Collecting labelled datasets for this task, however, is notoriously costly and labor-intensive. To overcome this issue, we propose a simple yet effective zero-shot referring image segmentation method by leveraging the pre-trained cross-modal knowledge from CLIP. In order to obtain segmentation masks grounded to the input text, we propose a mask-guided visual encoder that captures global and local contextual information of an input image. By utilizing instance masks obtained from off-the-shelf mask proposal techniques, our method is able to segment fine-detailed instance-level groundings. We also introduce a global-local text encoder where the global feature captures complex sentence-level semantics of the entire input expression while the local feature focuses on the target noun phrase extracted by a dependency parser. In our experiments, the proposed method outperforms several zero-shot baselines of the task and even the weakly supervised referring expression segmentation method with substantial margins. Our code is available at https://github.com/Seonghoon-Yu/Zero-shot-RIS. + + + + SketchXAI: A First Look at Explainability for Human Sketches + http://openaccess.thecvf.com//content/CVPR2023/papers/Qu_SketchXAI_A_First_Look_at_Explainability_for_Human_Sketches_CVPR_2023_paper.pdf + This paper, for the very first time, introduces human sketches to the landscape of XAI (Explainable Artificial Intelligence). We argue that sketch as a "human-centred" data form, represents a natural interface to study explainability. We focus on cultivating sketch-specific explainability designs. This starts by identifying strokes as a unique building block that offers a degree of flexibility in object construction and manipulation impossible in photos. Following this, we design a simple explainability-friendly sketch encoder that accommodates the intrinsic properties of strokes: shape, location, and order. We then move on to define the first ever XAI task for sketch, that of stroke location inversion SLI. Just as we have heat maps for photos, and correlation matrices for text, SLI offers an explainability angle to sketch in terms of asking a network how well it can recover stroke locations of an unseen sketch. We offer qualitative results for readers to interpret as snapshots of the SLI process in the paper, and as GIFs on the project page. A minor but interesting note is that thanks to its sketch-specific design, our sketch encoder also yields the best sketch recognition accuracy to date while having the smallest number of parameters. The code is available at https://sketchxai.github.io. + + + + Rebalancing Batch Normalization for Exemplar-Based Class-Incremental Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Cha_Rebalancing_Batch_Normalization_for_Exemplar-Based_Class-Incremental_Learning_CVPR_2023_paper.pdf + Batch Normalization (BN) and its variants has been extensively studied for neural nets in various computer vision tasks, but relatively little work has been dedicated to studying the effect of BN in continual learning. To that end, we develop a new update patch for BN, particularly tailored for the exemplar-based class-incremental learning (CIL). The main issue of BN in CIL is the imbalance of training data between current and past tasks in a mini-batch, which makes the empirical mean and variance as well as the learnable affine transformation parameters of BN heavily biased toward the current task --- contributing to the forgetting of past tasks. While one of the recent BN variants has been developed for "online" CIL, in which the training is done with a single epoch, we show that their method does not necessarily bring gains for "offline" CIL, in which a model is trained with multiple epochs on the imbalanced training data. The main reason for the ineffectiveness of their method lies in not fully addressing the data imbalance issue, especially in computing the gradients for learning the affine transformation parameters of BN. Accordingly, our new hyperparameter-free variant, dubbed as Task-Balanced BN (TBBN), is proposed to more correctly resolve the imbalance issue by making a horizontally-concatenated task-balanced batch using both reshape and repeat operations during training. Based on our experiments on class incremental learning of CIFAR-100, ImageNet-100, and five dissimilar task datasets, we demonstrate that our TBBN, which works exactly the same as the vanilla BN in the inference time, is easily applicable to most existing exemplar-based offline CIL algorithms and consistently outperforms other BN variants. + + + + OmniVidar: Omnidirectional Depth Estimation From Multi-Fisheye Images + http://openaccess.thecvf.com//content/CVPR2023/papers/Xie_OmniVidar_Omnidirectional_Depth_Estimation_From_Multi-Fisheye_Images_CVPR_2023_paper.pdf + Estimating depth from four large field of view (FoV) cameras has been a difficult and understudied problem. In this paper, we proposed a novel and simple system that can convert this difficult problem into easier binocular depth estimation. We name this system OmniVidar, as its results are similar to LiDAR, but rely only on vision. OmniVidar contains three components: (1) a new camera model to address the shortcomings of existing models, (2) a new multi-fisheye camera based epipolar rectification method for solving the image distortion and simplifying the depth estimation problem, (3) an improved binocular depth estimation network, which achieves a better balance between accuracy and efficiency. Unlike other omnidirectional stereo vision methods, OmniVidar does not contain any 3D convolution, so it can achieve higher resolution depth estimation at fast speed. Results demonstrate that OmniVidar outperforms all other methods in terms of accuracy and performance. + + + + RWSC-Fusion: Region-Wise Style-Controlled Fusion Network for the Prohibited X-Ray Security Image Synthesis + http://openaccess.thecvf.com//content/CVPR2023/papers/Duan_RWSC-Fusion_Region-Wise_Style-Controlled_Fusion_Network_for_the_Prohibited_X-Ray_Security_CVPR_2023_paper.pdf + Automatic prohibited item detection in security inspection X-ray images is necessary for transportation.The abundance and diversity of the X-ray security images with prohibited item, termed as prohibited X-ray security images, are essential for training the detection model. In order to solve the data insufficiency, we propose a RegionWise Style-Controlled Fusion (RWSC-Fusion) network, which superimposes the prohibited items onto the normal X-ray security images, to synthesize the prohibited X-ray security images. The proposed RWSC-Fusion innovates both network structure and loss functions to generate more realistic X-ray security images. Specifically, a RWSCFusion module is designed to enable the region-wise fusion by controlling the appearance of the overlapping region with novel modulation parameters. In addition, an EdgeAttention (EA) module is proposed to effectively improve the sharpness of the synthetic images. As for the unsupervised loss function, we propose the Luminance loss in Logarithmic form (LL) and Correlation loss of Saturation Difference (CSD), to optimize the fused X-ray security images in terms of luminance and saturation. We evaluate the authenticity and the training effect of the synthetic X-ray security images on private and public SIXray dataset. The results confirm that our synthetic images are reliable enough to augment the prohibited Xray security images. + + + + Octree Guided Unoriented Surface Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Koneputugodage_Octree_Guided_Unoriented_Surface_Reconstruction_CVPR_2023_paper.pdf + We address the problem of surface reconstruction from unoriented point clouds. Implicit neural representations (INRs) have become popular for this task, but when information relating to the inside versus outside of a shape is not available (such as shape occupancy, signed distances or surface normal orientation) optimization relies on heuristics and regularizers to recover the surface. These methods can be slow to converge and easily get stuck in local minima. We propose a two-step approach, OG-INR, where we (1) construct a discrete octree and label what is inside and outside (2) optimize for a continuous and high-fidelity shape using an INR that is initially guided by the octree's labelling. To solve for our labelling, we propose an energy function over the discrete structure and provide an efficient move-making algorithm that explores many possible labellings. Furthermore we show that we can easily inject knowledge into the discrete octree, providing a simple way to influence the result from the continuous INR. We evaluate the effectiveness of our approach on two unoriented surface reconstruction datasets and show competitive performance compared to other unoriented, and some oriented, methods. Our results show that the exploration by the move-making algorithm avoids many of the bad local minima reached by purely gradient descent optimized methods (see Figure 1). + + + + ToThePoint: Efficient Contrastive Learning of 3D Point Clouds via Recycling + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_ToThePoint_Efficient_Contrastive_Learning_of_3D_Point_Clouds_via_Recycling_CVPR_2023_paper.pdf + Recent years have witnessed significant developments in point cloud processing, including classification and segmentation. However, supervised learning approaches need a lot of well-labeled data for training, and annotation is labor- and time-intensive. Self-supervised learning, on the other hand, uses unlabeled data, and pre-trains a backbone with a pretext task to extract latent representations to be used with the downstream tasks. Compared to 2D images, self-supervised learning of 3D point clouds is under-explored. Existing models, for self-supervised learning of 3D point clouds, rely on a large number of data samples, and require significant amount of computational resources and training time. To address this issue, we propose a novel contrastive learning approach, referred to as ToThePoint. Different from traditional contrastive learning methods, which maximize agreement between features obtained from a pair of point clouds formed only with different types of augmentation, ToThePoint also maximizes the agreement between the permutation invariant features and features discarded after max pooling. We first perform self-supervised learning on the ShapeNet dataset, and then evaluate the performance of the network on different downstream tasks. In the downstream task experiments, performed on the ModelNet40, ModelNet40C, ScanobjectNN and ShapeNet-Part datasets, our proposed ToThePoint achieves competitive, if not better results compared to the state-of-the-art baselines, and does so with significantly less training time (200 times faster than baselines) + + + + Weakly Supervised Monocular 3D Object Detection Using Multi-View Projection and Direction Consistency + http://openaccess.thecvf.com//content/CVPR2023/papers/Tao_Weakly_Supervised_Monocular_3D_Object_Detection_Using_Multi-View_Projection_and_CVPR_2023_paper.pdf + Monocular 3D object detection has become a mainstream approach in automatic driving for its easy application. A prominent advantage is that it does not need LiDAR point clouds during the inference. However, most current methods still rely on 3D point cloud data for labeling the ground truths used in the training phase. This inconsistency between the training and inference makes it hard to utilize the large-scale feedback data and increases the data collection expenses. To bridge this gap, we propose a new weakly supervised monocular 3D objection detection method, which can train the model with only 2D labels marked on images. To be specific, we explore three types of consistency in this task, i.e. the projection, multi-view and direction consistency, and design a weakly-supervised architecture based on these consistencies. Moreover, we propose a new 2D direction labeling method in this task to guide the model for accurate rotation direction prediction. Experiments show that our weakly-supervised method achieves comparable performance with some fully supervised methods. When used as a pre-training method, our model can significantly outperform the corresponding fully-supervised baseline with only 1/3 3D labels. + + + + EDA: Explicit Text-Decoupling and Dense Alignment for 3D Visual Grounding + http://openaccess.thecvf.com//content/CVPR2023/papers/Wu_EDA_Explicit_Text-Decoupling_and_Dense_Alignment_for_3D_Visual_Grounding_CVPR_2023_paper.pdf + 3D visual grounding aims to find the object within point clouds mentioned by free-form natural language descriptions with rich semantic cues. However, existing methods either extract the sentence-level features coupling all words or focus more on object names, which would lose the word-level information or neglect other attributes. To alleviate these issues, we present EDA that Explicitly Decouples the textual attributes in a sentence and conducts Dense Alignment between such fine-grained language and point cloud objects. Specifically, we first propose a text decoupling module to produce textual features for every semantic component. Then, we design two losses to supervise the dense matching between two modalities: position alignment loss and semantic alignment loss. On top of that, we further introduce a new visual grounding task, locating objects without object names, which can thoroughly evaluate the model's dense alignment capacity. Through experiments, we achieve state-of-the-art performance on two widely-adopted 3D visual grounding datasets, ScanRefer and SR3D/NR3D, and obtain absolute leadership on our newly-proposed task. The source code is available at https://github.com/yanmin-wu/EDA. + + + + MixPHM: Redundancy-Aware Parameter-Efficient Tuning for Low-Resource Visual Question Answering + http://openaccess.thecvf.com//content/CVPR2023/papers/Jiang_MixPHM_Redundancy-Aware_Parameter-Efficient_Tuning_for_Low-Resource_Visual_Question_Answering_CVPR_2023_paper.pdf + Recently, finetuning pretrained vision-language models (VLMs) has been a prevailing paradigm for achieving state-of-the-art performance in VQA. However, as VLMs scale, it becomes computationally expensive, storage inefficient, and prone to overfitting when tuning full model parameters for a specific task in low-resource settings. Although current parameter-efficient tuning methods dramatically reduce the number of tunable parameters, there still exists a significant performance gap with full finetuning. In this paper, we propose MixPHM, a redundancy-aware parameter-efficient tuning method that outperforms full finetuning in low-resource VQA. Specifically, MixPHM is a lightweight module implemented by multiple PHM-experts in a mixture-of-experts manner. To reduce parameter redundancy, we reparameterize expert weights in a low-rank subspace and share part of the weights inside and across MixPHM. Moreover, based on our quantitative analysis of representation redundancy, we propose Redundancy Regularization, which facilitates MixPHM to reduce task-irrelevant redundancy while promoting task-relevant correlation. Experiments conducted on VQA v2, GQA, and OK-VQA with different low-resource settings show that our MixPHM outperforms state-of-the-art parameter-efficient methods and is the only one consistently surpassing full finetuning. + + + + DejaVu: Conditional Regenerative Learning To Enhance Dense Prediction + http://openaccess.thecvf.com//content/CVPR2023/papers/Borse_DejaVu_Conditional_Regenerative_Learning_To_Enhance_Dense_Prediction_CVPR_2023_paper.pdf + We present DejaVu, a novel framework which leverages conditional image regeneration as additional supervision during training to improve deep networks for dense prediction tasks such as segmentation, depth estimation, and surface normal prediction. First, we apply redaction to the input image, which removes certain structural information by sparse sampling or selective frequency removal. Next, we use a conditional regenerator, which takes the redacted image and the dense predictions as inputs, and reconstructs the original image by filling in the missing structural information. In the redacted image, structural attributes like boundaries are broken while semantic context is largely preserved. In order to make the regeneration feasible, the conditional generator will then require the structure information from the other input source, i.e., the dense predictions. As such, by including this conditional regeneration objective during training, DejaVu encourages the base network to learn to embed accurate scene structure in its dense prediction. This leads to more accurate predictions with clearer boundaries and better spatial consistency. When it is feasible to leverage additional computation, DejaVu can be extended to incorporate an attention-based regeneration module within the dense prediction network, which further improves accuracy. Through extensive experiments on multiple dense prediction benchmarks such as Cityscapes, COCO, ADE20K, NYUD-v2, and KITTI, we demonstrate the efficacy of employing DejaVu during training, as it outperforms SOTA methods at no added computation cost. + + + + SmartBrush: Text and Shape Guided Object Inpainting With Diffusion Model + http://openaccess.thecvf.com//content/CVPR2023/papers/Xie_SmartBrush_Text_and_Shape_Guided_Object_Inpainting_With_Diffusion_Model_CVPR_2023_paper.pdf + Generic image inpainting aims to complete a corrupted image by borrowing surrounding information, which barely generates novel content. By contrast, multi-modal inpainting provides more flexible and useful controls on the inpainted content, e.g., a text prompt can be used to describe an object with richer attributes, and a mask can be used to constrain the shape of the inpainted object rather than being only considered as a missing area. We propose a new diffusion-based model named SmartBrush for completing a missing region with an object using both text and shape-guidance. While previous work such as DALLE-2 and Stable Diffusion can do text-guided inapinting they do not support shape guidance and tend to modify background texture surrounding the generated object. Our model incorporates both text and shape guidance with precision control. To preserve the background better, we propose a novel training and sampling strategy by augmenting the diffusion U-net with object-mask prediction. Lastly, we introduce a multi-task training strategy by jointly training inpainting with text-to-image generation to leverage more training data. We conduct extensive experiments showing that our model outperforms all baselines in terms of visual quality, mask controllability, and background preservation. + + + + RUST: Latent Neural Scene Representations From Unposed Imagery + http://openaccess.thecvf.com//content/CVPR2023/papers/Sajjadi_RUST_Latent_Neural_Scene_Representations_From_Unposed_Imagery_CVPR_2023_paper.pdf + Inferring the structure of 3D scenes from 2D observations is a fundamental challenge in computer vision. Recently popularized approaches based on neural scene representations have achieved tremendous impact and have been applied across a variety of applications. One of the major remaining challenges in this space is training a single model which can provide latent representations which effectively generalize beyond a single scene. Scene Representation Transformer (SRT) has shown promise in this direction, but scaling it to a larger set of diverse scenes is challenging and necessitates accurately posed ground truth data. To address this problem, we propose RUST (Really Unposed Scene representation Transformer), a pose-free approach to novel view synthesis trained on RGB images alone. Our main insight is that one can train a Pose Encoder that peeks at the target image and learns a latent pose embedding which is used by the decoder for view synthesis. We perform an empirical investigation into the learned latent pose structure and show that it allows meaningful test-time camera transformations and accurate explicit pose readouts. Perhaps surprisingly, RUST achieves similar quality as methods which have access to perfect camera pose, thereby unlocking the potential for large-scale training of amortized neural scene representations. + + + + Open Set Action Recognition via Multi-Label Evidential Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhao_Open_Set_Action_Recognition_via_Multi-Label_Evidential_Learning_CVPR_2023_paper.pdf + Existing methods for open set action recognition focus on novelty detection that assumes video clips show a single action, which is unrealistic in the real world. We propose a new method for open set action recognition and novelty detection via MUlti-Label Evidential learning (MULE), that goes beyond previous novel action detection methods by addressing the more general problems of single or multiple actors in the same scene, with simultaneous action(s) by any actor. Our Beta Evidential Neural Network estimates multi-action uncertainty with Beta densities based on actor-context-object relation representations. An evidence debiasing constraint is added to the objective func- tion for optimization to reduce the static bias of video representations, which can incorrectly correlate predictions and static cues. We develop a primal-dual average scheme update-based learning algorithm to optimize the proposed problem and provide corresponding theoretical analysis. Besides, uncertainty and belief-based novelty estimation mechanisms are formulated to detect novel actions. Extensive experiments on two real-world video datasets show that our proposed approach achieves promising performance in single/multi-actor, single/multi-action settings. Our code and models are released at https://github.com/charliezhaoyinpeng/mule. + + + + MAP: Multimodal Uncertainty-Aware Vision-Language Pre-Training Model + http://openaccess.thecvf.com//content/CVPR2023/papers/Ji_MAP_Multimodal_Uncertainty-Aware_Vision-Language_Pre-Training_Model_CVPR_2023_paper.pdf + Multimodal semantic understanding often has to deal with uncertainty, which means the obtained messages tend to refer to multiple targets. Such uncertainty is problematic for our interpretation, including inter- and intra-modal uncertainty. Little effort has studied the modeling of this uncertainty, particularly in pre-training on unlabeled datasets and fine-tuning in task-specific downstream datasets. In this paper, we project the representations of all modalities as probabilistic distributions via a Probability Distribution Encoder (PDE) by utilizing sequence-level interactions. Compared to the exiting deterministic methods, such uncertainty modeling can convey richer multimodal semantic information and more complex relationships. Furthermore, we integrate uncertainty modeling with popular pre-training frameworks and propose suitable pre-training tasks: Distribution-based Vision-Language Contrastive learning (D-VLC), Distribution-based Masked Language Modeling (D-MLM), and Distribution-based Image-Text Matching (D-ITM). The fine-tuned models are applied to challenging downstream tasks, including image-text retrieval, visual question answering, visual reasoning, and visual entailment, and achieve state-of-the-art results. + + + + DualRel: Semi-Supervised Mitochondria Segmentation From a Prototype Perspective + http://openaccess.thecvf.com//content/CVPR2023/papers/Mai_DualRel_Semi-Supervised_Mitochondria_Segmentation_From_a_Prototype_Perspective_CVPR_2023_paper.pdf + Automatic mitochondria segmentation enjoys great popularity with the development of deep learning. However, existing methods rely heavily on the labor-intensive manual gathering by experienced domain experts. And naively applying semi-supervised segmentation methods in the natural image field to mitigate the labeling cost is undesirable. In this work, we analyze the gap between mitochondrial images and natural images and rethink how to achieve effective semi-supervised mitochondria segmentation, from the perspective of reliable prototype-level supervision. We propose a novel end-to-end dual-reliable (DualRel) network, including a reliable pixel aggregation module and a reliable prototype selection module. The proposed DualRel enjoys several merits. First, to learn the prototypes well without any explicit supervision, we carefully design the referential correlation to rectify the direct pairwise correlation. Second, the reliable prototype selection module is responsible for further evaluating the reliability of prototypes in constructing prototype-level consistency regularization. Extensive experimental results on three challenging benchmarks demonstrate that our method performs favorably against state-of-the-art semi-supervised segmentation methods. Importantly, with extremely few samples used for training, DualRel is also on par with current state-of-the-art fully supervised methods. + + + + Gated Multi-Resolution Transfer Network for Burst Restoration and Enhancement + http://openaccess.thecvf.com//content/CVPR2023/papers/Mehta_Gated_Multi-Resolution_Transfer_Network_for_Burst_Restoration_and_Enhancement_CVPR_2023_paper.pdf + Burst image processing is becoming increasingly popular in recent years. However, it is a challenging task since individual burst images undergo multiple degradations and often have mutual misalignments resulting in ghosting and zipper artifacts. Existing burst restoration methods usually do not consider the mutual correlation and non-local contextual information among burst frames, which tends to limit these approaches in challenging cases. Another key challenge lies in the robust up-sampling of burst frames. The existing up-sampling methods cannot effectively utilize the advantages of single-stage and progressive up-sampling strategies with conventional and/or recent up-samplers at the same time. To address these challenges, we propose a novel Gated Multi-Resolution Transfer Network (GMTNet) to reconstruct a spatially precise high-quality image from a burst of low-quality raw images. GMTNet consists of three modules optimized for burst processing tasks: Multi-scale Burst Feature Alignment (MBFA) for feature denoising and alignment, Transposed-Attention Feature Merging (TAFM) for multi-frame feature aggregation, and Resolution Transfer Feature Up-sampler (RTFU) to up-scale merged features and construct a high-quality output image. Detailed experimental analysis on five datasets validate our approach and sets a new state-of-the-art for burst super-resolution, burst denoising, and low-light burst enhancement. Our codes and models are available at https://github.com/nanmehta/GMTNet. + + + + PIDNet: A Real-Time Semantic Segmentation Network Inspired by PID Controllers + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_PIDNet_A_Real-Time_Semantic_Segmentation_Network_Inspired_by_PID_Controllers_CVPR_2023_paper.pdf + Two-branch network architecture has shown its efficiency and effectiveness in real-time semantic segmentation tasks. However, direct fusion of high-resolution details and low-frequency context has the drawback of detailed features being easily overwhelmed by surrounding contextual information. This overshoot phenomenon limits the improvement of the segmentation accuracy of existing two-branch models. In this paper, we make a connection between Convolutional Neural Networks (CNN) and Proportional-Integral-Derivative (PID) controllers and reveal that a two-branch network is equivalent to a Proportional-Integral (PI) controller, which inherently suffers from similar overshoot issues. To alleviate this problem, we propose a novel three-branch network architecture: PIDNet, which contains three branches to parse detailed, context and boundary information, respectively, and employs boundary attention to guide the fusion of detailed and context branches. Our family of PIDNets achieve the best trade-off between inference speed and accuracy and their accuracy surpasses all the existing models with similar inference speed on the Cityscapes and CamVid datasets. Specifically, PIDNet-S achieves 78.6 mIOU with inference speed of 93.2 FPS on Cityscapes and 80.1 mIOU with speed of 153.7 FPS on CamVid. + + + + Frustratingly Easy Regularization on Representation Can Boost Deep Reinforcement Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/He_Frustratingly_Easy_Regularization_on_Representation_Can_Boost_Deep_Reinforcement_Learning_CVPR_2023_paper.pdf + Deep reinforcement learning (DRL) gives the promise that an agent learns good policy from high-dimensional information, whereas representation learning removes irrelevant and redundant information and retains pertinent information. In this work, we demonstrate that the learned representation of the Q-network and its target Q-network should, in theory, satisfy a favorable distinguishable representation property. Specifically, there exists an upper bound on the representation similarity of the value functions of two adjacent time steps in a typical DRL setting. However, through illustrative experiments, we show that the learned DRL agent may violate this property and lead to a sub-optimal policy. Therefore, we propose a simple yet effective regularizer called Policy Evaluation with Easy Regularization on Representation (PEER), which aims to maintain the distinguishable representation property via explicit regularization on internal representations. And we provide the convergence rate guarantee of PEER. Implementing PEER requires only one line of code. Our experiments demonstrate that incorporating PEER into DRL can significantly improve performance and sample efficiency. Comprehensive experiments show that PEER achieves state-of-the-art performance on all 4 environments on PyBullet, 9 out of 12 tasks on DMControl, and 19 out of 26 games on Atari. To the best of our knowledge, PEER is the first work to study the inherent representation property of Q-network and its target. Our code is available at https://sites.google.com/view/peer-cvpr2023/. + + + + PointDistiller: Structured Knowledge Distillation Towards Efficient and Compact 3D Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_PointDistiller_Structured_Knowledge_Distillation_Towards_Efficient_and_Compact_3D_Detection_CVPR_2023_paper.pdf + The remarkable breakthroughs in point cloud representation learning have boosted their usage in real-world applications such as self-driving cars and virtual reality. However, these applications usually have an urgent requirement for not only accurate but also efficient 3D object detection. Recently, knowledge distillation has been proposed as an effective model compression technique, which transfers the knowledge from an over-parameterized teacher to a lightweight student and achieves consistent effectiveness in 2D vision. However, due to point clouds' sparsity and irregularity, directly applying previous image-based knowledge distillation methods to point cloud detectors usually leads to unsatisfactory performance. To fill the gap, this paper proposes PointDistiller, a structured knowledge distillation framework for point clouds-based 3D detection. Concretely, PointDistiller includes local distillation which extracts and distills the local geometric structure of point clouds with dynamic graph convolution and reweighted learning strategy, which highlights student learning on the critical points or voxels to improve knowledge distillation efficiency. Extensive experiments on both voxels-based and raw points-based detectors have demonstrated the effectiveness of our method over seven previous knowledge distillation methods. For instance, our 4X compressed PointPillars student achieves 2.8 and 3.4 mAP improvements on BEV and 3D object detection, outperforming its teacher by 0.9 and 1.8 mAP, respectively. Codes are available in the supplementary material and will be released on Github. + + + + LEMaRT: Label-Efficient Masked Region Transform for Image Harmonization + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_LEMaRT_Label-Efficient_Masked_Region_Transform_for_Image_Harmonization_CVPR_2023_paper.pdf + We present a simple yet effective self-supervised pretraining method for image harmonization which can leverage large-scale unannotated image datasets. To achieve this goal, we first generate pre-training data online with our Label-Efficient Masked Region Transform (LEMaRT) pipeline. Given an image, LEMaRT generates a foreground mask and then applies a set of transformations to perturb various visual attributes, e.g., defocus blur, contrast, saturation, of the region specified by the generated mask. We then pre-train image harmonization models by recovering the original image from the perturbed image. Secondly, we introduce an image harmonization model, namely SwinIH, by retrofitting the Swin Transformer [27] with a combination of local and global self-attention mechanisms. Pretraining SwinIH with LEMaRT results in a new state of the art for image harmonization, while being label-efficient, i.e., consuming less annotated data for fine-tuning than existing methods. Notably, on iHarmony4 dataset [8], SwinIH outperforms the state of the art, i.e., SCS-Co [16] by a margin of 0.4 dB when it is fine-tuned on only 50% of the training data, and by 1.0 dB when it is trained on the full training dataset. + + + + Discriminator-Cooperated Feature Map Distillation for GAN Compression + http://openaccess.thecvf.com//content/CVPR2023/papers/Hu_Discriminator-Cooperated_Feature_Map_Distillation_for_GAN_Compression_CVPR_2023_paper.pdf + Despite excellent performance in image generation, Generative Adversarial Networks (GANs) are notorious for its requirements of enormous storage and intensive computation. As an awesome "performance maker", knowledge distillation is demonstrated to be particularly efficacious in exploring low-priced GANs. In this paper, we investigate the irreplaceability of teacher discriminator and present an inventive discriminator-cooperated distillation, abbreviated as DCD, towards refining better feature maps from the generator. In contrast to conventional pixel-to-pixel match methods in feature map distillation, our DCD utilizes teacher discriminator as a transformation to drive intermediate results of the student generator to be perceptually close to corresponding outputs of the teacher generator. Furthermore, in order to mitigate mode collapse in GAN compression, we construct a collaborative adversarial training paradigm where the teacher discriminator is from scratch established to co-train with student generator in company with our DCD. Our DCD shows superior results compared with existing GAN compression methods. For instance, after reducing over 40x MACs and 80x parameters of CycleGAN, we well decrease FID metric from 61.53 to 48.24 while the current SoTA method merely has 51.92. This work's source code has been made accessible at https://github.com/poopit/DCD-official. + + + + StyleAdv: Meta Style Adversarial Training for Cross-Domain Few-Shot Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Fu_StyleAdv_Meta_Style_Adversarial_Training_for_Cross-Domain_Few-Shot_Learning_CVPR_2023_paper.pdf + Cross-Domain Few-Shot Learning (CD-FSL) is a recently emerging task that tackles few-shot learning across different domains. It aims at transferring prior knowledge learned on the source dataset to novel target datasets. The CD-FSL task is especially challenged by the huge domain gap between different datasets. Critically, such a domain gap actually comes from the changes of visual styles, and wave-SAN empirically shows that spanning the style distribution of the source data helps alleviate this issue. However, wave-SAN simply swaps styles of two images. Such a vanilla operation makes the generated styles "real" and "easy", which still fall into the original set of the source styles. Thus, inspired by vanilla adversarial learning, a novel model-agnostic meta Style Adversarial training (StyleAdv) method together with a novel style adversarial attack method is proposed for CD-FSL. Particularly, our style attack method synthesizes both "virtual" and "hard" adversarial styles for model training. This is achieved by perturbing the original style with the signed style gradients. By continually attacking styles and forcing the model to recognize these challenging adversarial styles, our model is gradually robust to the visual styles, thus boosting the generalization ability for novel target datasets. Besides the typical CNN-based backbone, we also employ our StyleAdv method on large-scale pretrained vision transformer. Extensive experiments conducted on eight various target datasets show the effectiveness of our method. Whether built upon ResNet or ViT, we achieve the new state of the art for CD-FSL. Code is available at https://github.com/lovelyqian/StyleAdv-CDFSL. + + + + Long-Tailed Visual Recognition via Self-Heterogeneous Integration With Knowledge Excavation + http://openaccess.thecvf.com//content/CVPR2023/papers/Jin_Long-Tailed_Visual_Recognition_via_Self-Heterogeneous_Integration_With_Knowledge_Excavation_CVPR_2023_paper.pdf + Deep neural networks have made huge progress in the last few decades. However, as the real-world data often exhibits a long-tailed distribution, vanilla deep models tend to be heavily biased toward the majority classes. To address this problem, state-of-the-art methods usually adopt a mixture of experts (MoE) to focus on different parts of the long-tailed distribution. Experts in these methods are with the same model depth, which neglects the fact that different classes may have different preferences to be fit by models with different depths. To this end, we propose a novel MoE-based method called Self-Heterogeneous Integration with Knowledge Excavation (SHIKE). We first propose Depth-wise Knowledge Fusion (DKF) to fuse features between different shallow parts and the deep part in one network for each expert, which makes experts more diverse in terms of representation. Based on DKF, we further propose Dynamic Knowledge Transfer (DKT) to reduce the influence of the hardest negative class that has a non-negligible impact on the tail classes in our MoE framework. As a result, the classification accuracy of long-tailed data can be significantly improved, especially for the tail classes. SHIKE achieves the state-of-the-art performance of 56.3%, 60.3%, 75.4%, and 41.9% on CIFAR100-LT (IF100), ImageNet-LT, iNaturalist 2018, and Places-LT, respectively. The source code is available at https://github.com/jinyan-06/SHIKE. + + + + Context De-Confounded Emotion Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_Context_De-Confounded_Emotion_Recognition_CVPR_2023_paper.pdf + Context-Aware Emotion Recognition (CAER) is a crucial and challenging task that aims to perceive the emotional states of the target person with contextual information. Recent approaches invariably focus on designing sophisticated architectures or mechanisms to extract seemingly meaningful representations from subjects and contexts. However, a long-overlooked issue is that a context bias in existing datasets leads to a significantly unbalanced distribution of emotional states among different context scenarios. Concretely, the harmful bias is a confounder that misleads existing models to learn spurious correlations based on conventional likelihood estimation, significantly limiting the models' performance. To tackle the issue, this paper provides a causality-based perspective to disentangle the models from the impact of such bias, and formulate the causalities among variables in the CAER task via a tailored causal graph. Then, we propose a Contextual Causal Intervention Module (CCIM) based on the backdoor adjustment to de-confound the confounder and exploit the true causal effect for model training. CCIM is plug-in and model-agnostic, which improves diverse state-of-the-art approaches by considerable margins. Extensive experiments on three benchmark datasets demonstrate the effectiveness of our CCIM and the significance of causal insight. + + + + InstructPix2Pix: Learning To Follow Image Editing Instructions + http://openaccess.thecvf.com//content/CVPR2023/papers/Brooks_InstructPix2Pix_Learning_To_Follow_Image_Editing_Instructions_CVPR_2023_paper.pdf + We propose a method for editing images from human instructions: given an input image and a written instruction that tells the model what to do, our model follows these instructions to edit the image. To obtain training data for this problem, we combine the knowledge of two large pretrained models--a language model (GPT-3) and a text-to-image model (Stable Diffusion)--to generate a large dataset of image editing examples. Our conditional diffusion model, InstructPix2Pix, is trained on our generated data, and generalizes to real images and user-written instructions at inference time. Since it performs edits in the forward pass and does not require per-example fine-tuning or inversion, our model edits images quickly, in a matter of seconds. We show compelling editing results for a diverse collection of input images and written instructions. + + + + Progressive Disentangled Representation Learning for Fine-Grained Controllable Talking Head Synthesis + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Progressive_Disentangled_Representation_Learning_for_Fine-Grained_Controllable_Talking_Head_Synthesis_CVPR_2023_paper.pdf + We present a novel one-shot talking head synthesis method that achieves disentangled and fine-grained control over lip motion, eye gaze&blink, head pose, and emotional expression. We represent different motions via disentangled latent representations and leverage an image generator to synthesize talking heads from them. To effectively disentangle each motion factor, we propose a progressive disentangled representation learning strategy by separating the factors in a coarse-to-fine manner, where we first extract unified motion feature from the driving signal, and then isolate each fine-grained motion from the unified feature. We introduce motion-specific contrastive learning and regressing for non-emotional motions, and feature-level decorrelation and self-reconstruction for emotional expression, to fully utilize the inherent properties of each motion factor in unstructured video data to achieve disentanglement. Experiments show that our method provides high quality speech&lip-motion synchronization along with precise and disentangled control over multiple extra facial motions, which can hardly be achieved by previous methods. + + + + Breaking the "Object" in Video Object Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Tokmakov_Breaking_the_Object_in_Video_Object_Segmentation_CVPR_2023_paper.pdf + The appearance of an object can be fleeting when it transforms. As eggs are broken or paper is torn, their color, shape, and texture can change dramatically, preserving virtually nothing of the original except for the identity itself. Yet, this important phenomenon is largely absent from existing video object segmentation (VOS) benchmarks. In this work, we close the gap by collecting a new dataset for Video Object Segmentation under Transformations (VOST). It consists of more than 700 high-resolution videos, captured in diverse environments, which are 20 seconds long on average and densely labeled with instance masks. A careful, multi-step approach is adopted to ensure that these videos focus on complex object transformations, capturing their full temporal extent. We then extensively evaluate state-of-the-art VOS methods and make a number of important discoveries. In particular, we show that existing methods struggle when applied to this novel task and that their main limitation lies in over-reliance on static, appearance cues. This motivates us to propose a few modifications for the top-performing baseline that improve its performance by better capturing spatio-temporal information. But more broadly, the hope is to stimulate discussion on learning more robust video object representations. + + + + CoWs on Pasture: Baselines and Benchmarks for Language-Driven Zero-Shot Object Navigation + http://openaccess.thecvf.com//content/CVPR2023/papers/Gadre_CoWs_on_Pasture_Baselines_and_Benchmarks_for_Language-Driven_Zero-Shot_Object_CVPR_2023_paper.pdf + For robots to be generally useful, they must be able to find arbitrary objects described by people (i.e., be language-driven) even without expensive navigation training on in-domain data (i.e., perform zero-shot inference). We explore these capabilities in a unified setting: language-driven zero-shot object navigation (L-ZSON). Inspired by the recent success of open-vocabulary models for image classification, we investigate a straightforward framework, CLIP on Wheels (CoW), to adapt open-vocabulary models to this task without fine-tuning. To better evaluate L-ZSON, we introduce the Pasture benchmark, which considers finding uncommon objects, objects described by spatial and appearance attributes, and hidden objects described relative to visible objects. We conduct an in-depth empirical study by directly deploying 22 CoW baselines across Habitat, RoboTHOR, and Pasture. In total we evaluate over 90k navigation episodes and find that (1) CoW baselines often struggle to leverage language descriptions, but are surprisingly proficient at finding uncommon objects. (2) A simple CoW, with CLIP-based object localization and classical exploration---and no additional training---matches the navigation efficiency of a state-of-the-art ZSON method trained for 500M steps on Habitat MP3D data. This same CoW provides a 15.6 percentage point improvement in success over a state-of-the-art RoboTHOR ZSON model. + + + + CIGAR: Cross-Modality Graph Reasoning for Domain Adaptive Object Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_CIGAR_Cross-Modality_Graph_Reasoning_for_Domain_Adaptive_Object_Detection_CVPR_2023_paper.pdf + Unsupervised domain adaptive object detection (UDA-OD) aims to learn a detector by generalizing knowledge from a labeled source domain to an unlabeled target domain. Though the existing graph-based methods for UDA-OD perform well in some cases, they cannot learn a proper node set for the graph. In addition, these methods build the graph solely based on the visual features and do not consider the linguistic knowledge carried by the semantic prototypes, e.g., dataset labels. To overcome these problems, we propose a cross-modality graph reasoning adaptation (CIGAR) method to take advantage of both visual and linguistic knowledge. Specifically, our method performs cross-modality graph reasoning between the linguistic modality graph and visual modality graphs to enhance their representations. We also propose a discriminative feature selector to find the most discriminative features and take them as the nodes of the visual graph for both efficiency and effectiveness. In addition, we employ the linguistic graph matching loss to regulate the update of linguistic graphs and maintain their semantic representation during the training process. Comprehensive experiments validate the effectiveness of our proposed CIGAR. + + + + HOOD: Hierarchical Graphs for Generalized Modelling of Clothing Dynamics + http://openaccess.thecvf.com//content/CVPR2023/papers/Grigorev_HOOD_Hierarchical_Graphs_for_Generalized_Modelling_of_Clothing_Dynamics_CVPR_2023_paper.pdf + We propose a method that leverages graph neural networks, multi-level message passing, and unsupervised training to enable real-time prediction of realistic clothing dynamics. Whereas existing methods based on linear blend skinning must be trained for specific garments, our method is agnostic to body shape and applies to tight-fitting garments as well as loose, free-flowing clothing. Our method furthermore handles changes in topology (e.g., garments with buttons or zippers) and material properties at inference time. As one key contribution, we propose a hierarchical message-passing scheme that efficiently propagates stiff stretching modes while preserving local detail. We empirically show that our method outperforms strong baselines quantitatively and that its results are perceived as more realistic than state-of-the-art methods. + + + + HyperReel: High-Fidelity 6-DoF Video With Ray-Conditioned Sampling + http://openaccess.thecvf.com//content/CVPR2023/papers/Attal_HyperReel_High-Fidelity_6-DoF_Video_With_Ray-Conditioned_Sampling_CVPR_2023_paper.pdf + Volumetric scene representations enable photorealistic view synthesis for static scenes and form the basis of several existing 6-DoF video techniques. However, the volume rendering procedures that drive these representations necessitate careful trade-offs in terms of quality, rendering speed, and memory efficiency. In particular, existing methods fail to simultaneously achieve real-time performance, small memory footprint, and high-quality rendering for challenging real-world scenes. To address these issues, we present HyperReel --- a novel 6-DoF video representation. The two core components of HyperReel are: (1) a ray-conditioned sample prediction network that enables high-fidelity, high frame rate rendering at high resolutions and (2) a compact and memory-efficient dynamic volume representation. Our 6-DoF video pipeline achieves the best performance compared to prior and contemporary approaches in terms of visual quality with small memory requirements, while also rendering at up to 18 frames-per-second at megapixel resolution without any custom CUDA code. + + + + PCR: Proxy-Based Contrastive Replay for Online Class-Incremental Continual Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Lin_PCR_Proxy-Based_Contrastive_Replay_for_Online_Class-Incremental_Continual_Learning_CVPR_2023_paper.pdf + Online class-incremental continual learning is a specific task of continual learning. It aims to continuously learn new classes from data stream and the samples of data stream are seen only once, which suffers from the catastrophic forgetting issue, i.e., forgetting historical knowledge of old classes. Existing replay-based methods effectively alleviate this issue by saving and replaying part of old data in a proxy-based or contrastive-based replay manner. Although these two replay manners are effective, the former would incline to new classes due to class imbalance issues, and the latter is unstable and hard to converge because of the limited number of samples. In this paper, we conduct a comprehensive analysis of these two replay manners and find that they can be complementary. Inspired by this finding, we propose a novel replay-based method called proxy-based contrastive replay (PCR). The key operation is to replace the contrastive samples of anchors with corresponding proxies in the contrastive-based way. It alleviates the phenomenon of catastrophic forgetting by effectively addressing the imbalance issue, as well as keeps a faster convergence of the model. We conduct extensive experiments on three real-world benchmark datasets, and empirical results consistently demonstrate the superiority of PCR over various state-of-the-art methods. + + + + Token Boosting for Robust Self-Supervised Visual Transformer Pre-Training + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Token_Boosting_for_Robust_Self-Supervised_Visual_Transformer_Pre-Training_CVPR_2023_paper.pdf + Learning with large-scale unlabeled data has become a powerful tool for pre-training Visual Transformers (VTs). However, prior works tend to overlook that, in real-world scenarios, the input data may be corrupted and unreliable. Pre-training VTs on such corrupted data can be challenging, especially when we pre-train via the masked autoencoding approach, where both the inputs and masked "ground truth" targets can potentially be unreliable in this case. To address this limitation, we introduce the Token Boosting Module (TBM) as a plug-and-play component for VTs that effectively allows the VT to learn to extract clean and robust features during masked autoencoding pre-training. We provide theoretical analysis to show how TBM improves model pre-training with more robust and generalizable representations, thus benefiting downstream tasks. We conduct extensive experiments to analyze TBM's effectiveness, and results on four corrupted datasets demonstrate that TBM consistently improves performance on downstream tasks. + + + + MaskCon: Masked Contrastive Learning for Coarse-Labelled Dataset + http://openaccess.thecvf.com//content/CVPR2023/papers/Feng_MaskCon_Masked_Contrastive_Learning_for_Coarse-Labelled_Dataset_CVPR_2023_paper.pdf + Deep learning has achieved great success in recent years with the aid of advanced neural network structures and large-scale human-annotated datasets. However, it is often costly and difficult to accurately and efficiently annotate large-scale datasets, especially for some specialized domains where fine-grained labels are required. In this setting, coarse labels are much easier to acquire as they do not require expert knowledge. In this work, we propose a contrastive learning method, called masked contrastive learning (MaskCon) to address the under-explored problem setting, where we learn with a coarse-labelled dataset in order to address a finer labelling problem. More specifically, within the contrastive learning framework, for each sample our method generates soft-labels with the aid of coarse labels against other samples and another augmented view of the sample in question. By contrast to self-supervised contrastive learning where only the sample's augmentations are considered hard positives, and in supervised contrastive learning where only samples with the same coarse labels are considered hard positives, we propose soft labels based on sample distances, that are masked by the coarse labels. This allows us to utilize both inter-sample relations and coarse labels. We demonstrate that our method can obtain as special cases many existing state-of-the-art works and that it provides tighter bounds on the generalization error. Experimentally, our method achieves significant improvement over the current state-of-the-art in various datasets, including CIFAR10, CIFAR100, ImageNet-1K, Standford Online Products and Stanford Cars196 datasets. Code and annotations are available at https://github.com/MrChenFeng/MaskCon_CVPR2023. + + + + AGAIN: Adversarial Training With Attribution Span Enlargement and Hybrid Feature Fusion + http://openaccess.thecvf.com//content/CVPR2023/papers/Yin_AGAIN_Adversarial_Training_With_Attribution_Span_Enlargement_and_Hybrid_Feature_CVPR_2023_paper.pdf + The deep neural networks (DNNs) trained by adversarial training (AT) usually suffered from significant robust generalization gap, i.e., DNNs achieve high training robustness but low test robustness. In this paper, we propose a generic method to boost the robust generalization of AT methods from the novel perspective of attribution span. To this end, compared with standard DNNs, we discover that the generalization gap of adversarially trained DNNs is caused by the smaller attribution span on the input image. In other words, adversarially trained DNNs tend to focus on specific visual concepts on training images, causing its limitation on test robustness. In this way, to enhance the robustness, we propose an effective method to enlarge the learned attribution span. Besides, we use hybrid feature statistics for feature fusion to enrich the diversity of features. Extensive experiments show that our method can effectively improves robustness of adversarially trained DNNs, outperforming previous SOTA methods. Furthermore, we provide a theoretical analysis of our method to prove its effectiveness. + + + + ABLE-NeRF: Attention-Based Rendering With Learnable Embeddings for Neural Radiance Field + http://openaccess.thecvf.com//content/CVPR2023/papers/Tang_ABLE-NeRF_Attention-Based_Rendering_With_Learnable_Embeddings_for_Neural_Radiance_Field_CVPR_2023_paper.pdf + Neural Radiance Field (NeRF) is a popular method in representing 3D scenes by optimising a continuous volumetric scene function. Its large success which lies in applying volumetric rendering (VR) is also its Achilles' heel in producing view-dependent effects. As a consequence, glossy and transparent surfaces often appear murky. A remedy to reduce these artefacts is to constrain this VR equation by excluding volumes with back-facing normal. While this approach has some success in rendering glossy surfaces, translucent objects are still poorly represented. In this paper, we present an alternative to the physics-based VR approach by introducing a self-attention-based framework on volumes along a ray. In addition, inspired by modern game engines which utilise Light Probes to store local lighting passing through the scene, we incorporate Learnable Embeddings to capture view dependent effects within the scene. Our method, which we call ABLE-NeRF, significantly reduces 'blurry' glossy surfaces in rendering and produces realistic translucent surfaces which lack in prior art. In the Blender dataset, ABLE-NeRF achieves SOTA results and surpasses Ref-NeRF in all 3 image quality metrics PSNR, SSIM, LPIPS. + + + + WinCLIP: Zero-/Few-Shot Anomaly Classification and Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Jeong_WinCLIP_Zero-Few-Shot_Anomaly_Classification_and_Segmentation_CVPR_2023_paper.pdf + Visual anomaly classification and segmentation are vital for automating industrial quality inspection. The focus of prior research in the field has been on training custom models for each quality inspection task, which requires task-specific images and annotation. In this paper we move away from this regime, addressing zero-shot and few-normal-shot anomaly classification and segmentation. Recently CLIP, a vision-language model, has shown revolutionary generality with competitive zero/few-shot performance in comparison to full-supervision. But CLIP falls short on anomaly classification and segmentation tasks. Hence, we propose window-based CLIP (WinCLIP) with (1) a compositional ensemble on state words and prompt templates and (2) efficient extraction and aggregation of window/patch/image-level features aligned with text. We also propose its few-normal-shot extension WinCLIP+, which uses complementary information from normal images. In MVTec-AD (and VisA), without further tuning, WinCLIP achieves 91.8%/85.1% (78.1%/79.6%) AUROC in zero-shot anomaly classification and segmentation while WinCLIP+ does 93.1%/95.2% (83.8%/96.4%) in 1-normal-shot, surpassing state-of-the-art by large margins. + + + + TriDet: Temporal Action Detection With Relative Boundary Modeling + http://openaccess.thecvf.com//content/CVPR2023/papers/Shi_TriDet_Temporal_Action_Detection_With_Relative_Boundary_Modeling_CVPR_2023_paper.pdf + In this paper, we present a one-stage framework TriDet for temporal action detection. Existing methods often suffer from imprecise boundary predictions due to the ambiguous action boundaries in videos. To alleviate this problem, we propose a novel Trident-head to model the action boundary via an estimated relative probability distribution around the boundary. In the feature pyramid of TriDet, we propose a Scalable-Granularity Perception (SGP) layer to aggregate information across different temporal granularities, which is much more efficient than the recent transformer-based feature pyramid. Benefiting from the Trident-head and the SGP-based feature pyramid, TriDet achieves state-of-the-art performance on three challenging benchmarks: THUMOS14, HACS and EPIC-KITCHEN 100, with lower computational costs, compared to previous methods. For example, TriDet hits an average mAP of 69.3% on THUMOS14, outperforming the previous best by 2.5%, but with only 74.6% of its latency. + + + + Dream3D: Zero-Shot Text-to-3D Synthesis Using 3D Shape Prior and Text-to-Image Diffusion Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_Dream3D_Zero-Shot_Text-to-3D_Synthesis_Using_3D_Shape_Prior_and_Text-to-Image_CVPR_2023_paper.pdf + Recent CLIP-guided 3D optimization methods, such as DreamFields and PureCLIPNeRF, have achieved impressive results in zero-shot text-to-3D synthesis. However, due to scratch training and random initialization without prior knowledge, these methods often fail to generate accurate and faithful 3D structures that conform to the input text. In this paper, we make the first attempt to introduce explicit 3D shape priors into the CLIP-guided 3D optimization process. Specifically, we first generate a high-quality 3D shape from the input text in the text-to-shape stage as a 3D shape prior. We then use it as the initialization of a neural radiance field and optimize it with the full prompt. To address the challenging text-to-shape generation task, we present a simple yet effective approach that directly bridges the text and image modalities with a powerful text-to-image diffusion model. To narrow the style domain gap between the images synthesized by the text-to-image diffusion model and shape renderings used to train the image-to-shape generator, we further propose to jointly optimize a learnable text prompt and fine-tune the text-to-image diffusion model for rendering-style image generation. Our method, Dream3D, is capable of generating imaginative 3D content with superior visual quality and shape accuracy compared to state-of-the-art methods. Our project page is at https://bluestyle97.github.io/dream3d/. + + + + Reinforcement Learning-Based Black-Box Model Inversion Attacks + http://openaccess.thecvf.com//content/CVPR2023/papers/Han_Reinforcement_Learning-Based_Black-Box_Model_Inversion_Attacks_CVPR_2023_paper.pdf + Model inversion attacks are a type of privacy attack that reconstructs private data used to train a machine learning model, solely by accessing the model. Recently, white-box model inversion attacks leveraging Generative Adversarial Networks (GANs) to distill knowledge from public datasets have been receiving great attention because of their excellent attack performance. On the other hand, current black-box model inversion attacks that utilize GANs suffer from issues such as being unable to guarantee the completion of the attack process within a predetermined number of query accesses or achieve the same level of performance as white-box attacks. To overcome these limitations, we propose a reinforcement learning-based black-box model inversion attack. We formulate the latent space search as a Markov Decision Process (MDP) problem and solve it with reinforcement learning. Our method utilizes the confidence scores of the generated images to provide rewards to an agent. Finally, the private data can be reconstructed using the latent vectors found by the agent trained in the MDP. The experiment results on various datasets and models demonstrate that our attack successfully recovers the private information of the target model by achieving state-of-the-art attack performance. We emphasize the importance of studies on privacy-preserving machine learning by proposing a more advanced black-box model inversion attack. + + + + Learning a Deep Color Difference Metric for Photographic Images + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_Learning_a_Deep_Color_Difference_Metric_for_Photographic_Images_CVPR_2023_paper.pdf + Most well-established and widely used color difference (CD) metrics are handcrafted and subject-calibrated against uniformly colored patches, which do not generalize well to photographic images characterized by natural scene complexities. Constructing CD formulae for photographic images is still an active research topic in imaging/illumination, vision science, and color science communities. In this paper, we aim to learn a deep CD metric for photographic images with four desirable properties. First, it well aligns with the observations in vision science that color and form are linked inextricably in visual cortical processing. Second, it is a proper metric in the mathematical sense. Third, it computes accurate CDs between photographic images, differing mainly in color appearances. Fourth, it is robust to mild geometric distortions (e.g., translation or due to parallax), which are often present in photographic images of the same scene captured by different digital cameras. We show that all these properties can be satisfied at once by learning a multi-scale autoregressive normalizing flow for feature transform, followed by the Euclidean distance which is linearly proportional to the human perceptual CD. Quantitative and qualitative experiments on the large-scale SPCD dataset demonstrate the promise of the learned CD metric. + + + + 1000 FPS HDR Video With a Spike-RGB Hybrid Camera + http://openaccess.thecvf.com//content/CVPR2023/papers/Chang_1000_FPS_HDR_Video_With_a_Spike-RGB_Hybrid_Camera_CVPR_2023_paper.pdf + Capturing high frame rate and high dynamic range (HFR&HDR) color videos in high-speed scenes with conventional frame-based cameras is very challenging. The increasing frame rate is usually guaranteed by using shorter exposure time so that the captured video is severely interfered by noise. Alternating exposures could alleviate the noise issue but sacrifice frame rate due to involving long-exposure frames. The neuromorphic spiking camera records high-speed scenes of high dynamic range without colors using a completely different sensing mechanism and visual representation. We introduce a hybrid camera system composed of a spiking and an alternating-exposure RGB camera to capture HFR&HDR scenes with high fidelity. Our insight is to bring each camera's superiority into full play. The spike frames, with accurate fast motion information encoded, are first reconstructed for motion representation, from which the spike-based optical flows guide the recovery of missing temporal information for middle- and long-exposure RGB images while retaining their reliable color appearances. With the strong temporal constraint estimated from spike trains, both missing and distorted colors cross RGB frames are recovered to generate time-consistent and HFR color frames. We collect a new Spike-RGB dataset that contains 300 sequences of synthetic data and 20 groups of real-world data to demonstrate 1000 FPS HDR videos outperforming HDR video reconstruction methods and commercial high-speed cameras. + + + + DINN360: Deformable Invertible Neural Network for Latitude-Aware 360deg Image Rescaling + http://openaccess.thecvf.com//content/CVPR2023/papers/Guo_DINN360_Deformable_Invertible_Neural_Network_for_Latitude-Aware_360deg_Image_Rescaling_CVPR_2023_paper.pdf + With the rapid development of virtual reality, 360deg images have gained increasing popularity. Their wide field of view necessitates high resolution to ensure image quality. This, however, makes it harder to acquire, store and even process such 360deg images. To alleviate this issue, we propose the first attempt at 360deg image rescaling, which refers to downscaling a 360deg image to a visually valid low-resolution (LR) counterpart and then upscaling to a high-resolution (HR) 360deg image given the LR variant. Specifically, we first analyze two 360deg image datasets and observe several findings that characterize how 360deg images typically change along their latitudes. Inspired by these findings, we propose a novel deformable invertible neural network (INN), named DINN360, for latitude-aware 360deg image rescaling. In DINN360, a deformable INN is designed to downscale the LR image, and project the high-frequency (HF) component to the latent space by adaptively handling various deformations occurring at different latitude regions. Given the downscaled LR image, the high-quality HR image is then reconstructed in a conditional latitude-aware manner by recovering the structure-related HF component from the latent space. Extensive experiments over four public datasets show that our DINN360 method performs considerably better than other state-of-the-art methods for 2x, 4x and 8x 360deg image rescaling. + + + + Learning Geometric-Aware Properties in 2D Representation Using Lightweight CAD Models, or Zero Real 3D Pairs + http://openaccess.thecvf.com//content/CVPR2023/papers/Arsomngern_Learning_Geometric-Aware_Properties_in_2D_Representation_Using_Lightweight_CAD_Models_CVPR_2023_paper.pdf + Cross-modal training using 2D-3D paired datasets, such as those containing multi-view images and 3D scene scans, presents an effective way to enhance 2D scene understanding by introducing geometric and view-invariance priors into 2D features. However, the need for large-scale scene datasets can impede scalability and further improvements. This paper explores an alternative learning method by leveraging a lightweight and publicly available type of 3D data in the form of CAD models. We construct a 3D space with geometric-aware alignment where the similarity in this space reflects the geometric similarity of CAD models based on the Chamfer distance. The acquired geometric-aware properties are then induced into 2D features, which boost performance on downstream tasks more effectively than existing RGB-CAD approaches. Our technique is not limited to paired RGB-CAD datasets. By training exclusively on pseudo pairs generated from CAD-based reconstruction methods, we enhance the performance of SOTA 2D pre-trained models that use ResNet-50 or ViT-B backbones on various 2D understanding tasks. We also achieve comparable results to SOTA methods trained on scene scans on four tasks in NYUv2, SUNRGB-D, indoor ADE20k, and indoor/outdoor COCO, despite using lightweight CAD models or pseudo data. + + + + Few-Shot Learning With Visual Distribution Calibration and Cross-Modal Distribution Alignment + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Few-Shot_Learning_With_Visual_Distribution_Calibration_and_Cross-Modal_Distribution_Alignment_CVPR_2023_paper.pdf + Pre-trained vision-language models have inspired much research on few-shot learning. However, with only a few training images, there exist two crucial problems: (1) the visual feature distributions are easily distracted by class-irrelevant information in images, and (2) the alignment between the visual and language feature distributions is difficult. To deal with the distraction problem, we propose a Selective Attack module, which consists of trainable adapters that generate spatial attention maps of images to guide the attacks on class-irrelevant image areas. By messing up these areas, the critical features are captured and the visual distributions of image features are calibrated. To better align the visual and language feature distributions that describe the same object class, we propose a cross-modal distribution alignment module, in which we introduce a vision-language prototype for each class to align the distributions, and adopt the Earth Mover's Distance (EMD) to optimize the prototypes. For efficient computation, the upper bound of EMD is derived. In addition, we propose an augmentation strategy to increase the diversity of the images and the text prompts, which can reduce overfitting to the few-shot training images. Extensive experiments on 11 datasets demonstrate that our method consistently outperforms prior arts in few-shot learning. + + + + Finetune Like You Pretrain: Improved Finetuning of Zero-Shot Vision Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Goyal_Finetune_Like_You_Pretrain_Improved_Finetuning_of_Zero-Shot_Vision_Models_CVPR_2023_paper.pdf + Finetuning image-text models such as CLIP achieves state-of-the-art accuracies on a variety of benchmarks. However, recent works (Kumar et al., 2022; Wortsman et al., 2021) have shown that even subtle differences in the finetuning process can lead to surprisingly large differences in the final performance, both for in-distribution (ID) and out-of-distribution (OOD) data. In this work, we show that a natural and simple approach of mimicking contrastive pretraining consistently outperforms alternative finetuning approaches. Specifically, we cast downstream class labels as text prompts and continue optimizing the contrastive loss between image embeddings and class-descriptive prompt embeddings (contrastive finetuning). Our method consistently outperforms baselines across 7 distribution shift, 6 transfer learning, and 3 few-shot learning benchmarks. On WILDS-iWILDCam, our proposed approach FLYP outperforms the top of the leaderboard by 2.3% ID and 2.7% OOD, giving the highest reported accuracy. Averaged across 7 OOD datasets (2 WILDS and 5 ImageNet associated shifts), FLYP gives gains of 4.2% OOD over standard finetuning and outperforms current state-ofthe-art (LP-FT) by more than 1% both ID and OOD. Similarly, on 3 few-shot learning benchmarks, FLYP gives gains up to 4.6% over standard finetuning and 4.4% over the state-of-the-art. Thus we establish our proposed method of contrastive finetuning as a simple and intuitive state-ofthe-art for supervised finetuning of image-text models like CLIP. Code is available at https://github.com/locuslab/FLYP. + + + + Weakly Supervised Temporal Sentence Grounding With Uncertainty-Guided Self-Training + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Weakly_Supervised_Temporal_Sentence_Grounding_With_Uncertainty-Guided_Self-Training_CVPR_2023_paper.pdf + The task of weakly supervised temporal sentence grounding aims at finding the corresponding temporal moments of a language description in the video, given video-language correspondence only at video-level. Most existing works select mismatched video-language pairs as negative samples and train the model to generate better positive proposals that are distinct from the negative ones. However, due to the complex temporal structure of videos, proposals distinct from the negative ones may correspond to several video segments but not necessarily the correct ground truth. To alleviate this problem, we propose an uncertainty-guided self-training technique to provide extra self-supervision signals to guide the weakly-supervised learning. The self-training process is based on teacher-student mutual learning with weak-strong augmentation, which enables the teacher network to generate relatively more reliable outputs compared to the student network, so that the student network can learn from the teacher's output. Since directly applying existing self-training methods in this task easily causes error accumulation, we specifically design two techniques in our self-training method: (1) we construct a Bayesian teacher network, leveraging its uncertainty as a weight to suppress the noisy teacher supervisory signals; (2) we leverage the cycle consistency brought by temporal data augmentation to perform mutual learning between the two networks. Experiments demonstrate our method's superiority on Charades-STA and ActivityNet Captions datasets. We also show in the experiment that our self-training method can be applied to improve the performance of multiple backbone methods. + + + + AutoRecon: Automated 3D Object Discovery and Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_AutoRecon_Automated_3D_Object_Discovery_and_Reconstruction_CVPR_2023_paper.pdf + A fully automated object reconstruction pipeline is crucial for digital content creation. While the area of 3D reconstruction has witnessed profound developments, the removal of background to obtain a clean object model still relies on different forms of manual labor, such as bounding box labeling, mask annotations, and mesh manipulations. In this paper, we propose a novel framework named AutoRecon for the automated discovery and reconstruction of an object from multi-view images. We demonstrate that foreground objects can be robustly located and segmented from SfM point clouds by leveraging self-supervised 2D vision transformer features. Then, we reconstruct decomposed neural scene representations with dense supervision provided by the decomposed point clouds, resulting in accurate object reconstruction and segmentation. Experiments on the DTU, BlendedMVS and CO3D-V2 datasets demonstrate the effectiveness and robustness of AutoRecon. The code and supplementary material are available on the project page: https://zju3dv.github.io/autorecon/. + + + + Learning a Practical SDR-to-HDRTV Up-Conversion Using New Dataset and Degradation Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Guo_Learning_a_Practical_SDR-to-HDRTV_Up-Conversion_Using_New_Dataset_and_Degradation_CVPR_2023_paper.pdf + In media industry, the demand of SDR-to-HDRTV up-conversion arises when users possess HDR-WCG (high dynamic range-wide color gamut) TVs while most off-the-shelf footage is still in SDR (standard dynamic range). The research community has started tackling this low-level vision task by learning-based approaches. When applied to real SDR, yet, current methods tend to produce dim and desaturated result, making nearly no improvement on viewing experience. Different from other network-oriented methods, we attribute such deficiency to training set (HDR-SDR pair). Consequently, we propose new HDRTV dataset (dubbed HDRTV4K) and new HDR-to-SDR degradation models. Then, it's used to train a luminance-segmented network (LSN) consisting of a global mapping trunk, and two Transformer branches on bright and dark luminance range. We also update assessment criteria by tailored metrics and subjective experiment. Finally, ablation studies are conducted to prove the effectiveness. Our work is available at: https://github.com/AndreGuo/HDRTVDM. + + + + Learning To Fuse Monocular and Multi-View Cues for Multi-Frame Depth Estimation in Dynamic Scenes + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Learning_To_Fuse_Monocular_and_Multi-View_Cues_for_Multi-Frame_Depth_CVPR_2023_paper.pdf + Multi-frame depth estimation generally achieves high accuracy relying on the multi-view geometric consistency. When applied in dynamic scenes, e.g., autonomous driving, this consistency is usually violated in the dynamic areas, leading to corrupted estimations. Many multi-frame methods handle dynamic areas by identifying them with explicit masks and compensating the multi-view cues with monocular cues represented as local monocular depth or features. The improvements are limited due to the uncontrolled quality of the masks and the underutilized benefits of the fusion of the two types of cues. In this paper, we propose a novel method to learn to fuse the multi-view and monocular cues encoded as volumes without needing the heuristically crafted masks. As unveiled in our analyses, the multi-view cues capture more accurate geometric information in static areas, and the monocular cues capture more useful contexts in dynamic areas. To let the geometric perception learned from multi-view cues in static areas propagate to the monocular representation in dynamic areas and let monocular cues enhance the representation of multi-view cost volume, we propose a cross-cue fusion (CCF) module, which includes the cross-cue attention (CCA) to encode the spatially non-local relative intra-relations from each source to enhance the representation of the other. Experiments on real-world datasets prove the significant effectiveness and generalization ability of the proposed method. + + + + Bias-Eliminating Augmentation Learning for Debiased Federated Learning + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_Bias-Eliminating_Augmentation_Learning_for_Debiased_Federated_Learning_CVPR_2023_paper.pdf + Learning models trained on biased datasets tend to observe correlations between categorical and undesirable features, which result in degraded performances. Most existing debiased learning models are designed for centralized machine learning, which cannot be directly applied to distributed settings like federated learning (FL), which collects data at distinct clients with privacy preserved. To tackle the challenging task of debiased federated learning, we present a novel FL framework of Bias-Eliminating Augmentation Learning (FedBEAL), which learns to deploy Bias-Eliminating Augmenters (BEA) for producing client-specific bias-conflicting samples at each client. Since the bias types or attributes are not known in advance, a unique learning strategy is presented to jointly train BEA with the proposed FL framework. Extensive image classification experiments on datasets with various bias types confirm the effectiveness and applicability of our FedBEAL, which performs favorably against state-of-the-art debiasing and FL methods for debiased FL. + + + + Understanding the Robustness of 3D Object Detection With Bird's-Eye-View Representations in Autonomous Driving + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhu_Understanding_the_Robustness_of_3D_Object_Detection_With_Birds-Eye-View_Representations_CVPR_2023_paper.pdf + 3D object detection is an essential perception task in autonomous driving to understand the environments. The Bird's-Eye-View (BEV) representations have significantly improved the performance of 3D detectors with camera inputs on popular benchmarks. However, there still lacks a systematic understanding of the robustness of these vision-dependent BEV models, which is closely related to the safety of autonomous driving systems. In this paper, we evaluate the natural and adversarial robustness of various representative models under extensive settings, to fully understand their behaviors influenced by explicit BEV features compared with those without BEV. In addition to the classic settings, we propose a 3D consistent patch attack by applying adversarial patches in the 3D space to guarantee the spatiotemporal consistency, which is more realistic for the scenario of autonomous driving. With substantial experiments, we draw several findings: 1) BEV models tend to be more stable than previous methods under different natural conditions and common corruptions due to the expressive spatial representations; 2) BEV models are more vulnerable to adversarial noises, mainly caused by the redundant BEV features; 3) Camera-LiDAR fusion models have superior performance under different settings with multi-modal inputs, but BEV fusion model is still vulnerable to adversarial noises of both point cloud and image. These findings alert the safety issue in the applications of BEV detectors and could facilitate the development of more robust models. + + + + Generalist: Decoupling Natural and Robust Generalization + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Generalist_Decoupling_Natural_and_Robust_Generalization_CVPR_2023_paper.pdf + Deep neural networks obtained by standard training have been constantly plagued by adversarial examples. Although adversarial training demonstrates its capability to defend against adversarial examples, unfortunately, it leads to an inevitable drop in the natural generalization. To address the issue, we decouple the natural generalization and the robust generalization from joint training and formulate different training strategies for each one. Specifically, instead of minimizing a global loss on the expectation over these two generalization errors, we propose a bi-expert framework called Generalist where we simultaneously train base learners with task-aware strategies so that they can specialize in their own fields. The parameters of base learners are collected and combined to form a global learner at intervals during the training process. The global learner is then distributed to the base learners as initialized parameters for continued training. Theoretically, we prove that the risks of Generalist will get lower once the base learners are well trained. Extensive experiments verify the applicability of Generalist to achieve high accuracy on natural examples while maintaining considerable robustness to adversarial ones. Code is available at https://github.com/PKU-ML/Generalist. + + + + Explicit Visual Prompting for Low-Level Structure Segmentations + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_Explicit_Visual_Prompting_for_Low-Level_Structure_Segmentations_CVPR_2023_paper.pdf + We consider the generic problem of detecting low-level structures in images, which includes segmenting the manipulated parts, identifying out-of-focus pixels, separating shadow regions, and detecting concealed objects. Whereas each such topic has been typically addressed with a domain-specific solution, we show that a unified approach performs well across all of them. We take inspiration from the widely-used pre-training and then prompt tuning protocols in NLP and propose a new visual prompting model, named Explicit Visual Prompting (EVP). Different from the previous visual prompting which is typically a dataset-level implicit embedding, our key insight is to enforce the tunable parameters focusing on the explicit visual content from each individual image, i.e., the features from frozen patch embeddings and the input's high-frequency components. The proposed EVP significantly outperforms other parameter-efficient tuning protocols under the same amount of tunable parameters (5.7% extra trainable parameters of each task). EVP also achieves state-of-the-art performances on diverse low-level structure segmentation tasks compared to task-specific solutions. Our code is available at: https://github.com/NiFangBaAGe/Explicit-Visual-Prompt. + + + + Practical Network Acceleration With Tiny Sets + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Practical_Network_Acceleration_With_Tiny_Sets_CVPR_2023_paper.pdf + Due to data privacy issues, accelerating networks with tiny training sets has become a critical need in practice. Previous methods mainly adopt filter-level pruning to accelerate networks with scarce training samples. In this paper, we reveal that dropping blocks is a fundamentally superior approach in this scenario. It enjoys a higher acceleration ratio and results in a better latency-accuracy performance under the few-shot setting. To choose which blocks to drop, we propose a new concept namely recoverability to measure the difficulty of recovering the compressed network. Our recoverability is efficient and effective for choosing which blocks to drop. Finally, we propose an algorithm named PRACTISE to accelerate networks using only tiny sets of training images. PRACTISE outperforms previous methods by a significant margin. For 22% latency reduction, PRACTISE surpasses previous methods by on average 7% on ImageNet-1k. It also enjoys high generalization ability, working well under data-free or out-of-domain data settings, too. Our code is at https://github.com/DoctorKey/Practise. + + + + NeRF-RPN: A General Framework for Object Detection in NeRFs + http://openaccess.thecvf.com//content/CVPR2023/papers/Hu_NeRF-RPN_A_General_Framework_for_Object_Detection_in_NeRFs_CVPR_2023_paper.pdf + This paper presents the first significant object detection framework, NeRF-RPN, which directly operates on NeRF. Given a pre-trained NeRF model, NeRF-RPN aims to detect all bounding boxes of objects in a scene. By exploiting a novel voxel representation that incorporates multi-scale 3D neural volumetric features, we demonstrate it is possible to regress the 3D bounding boxes of objects in NeRF directly without rendering the NeRF at any viewpoint. NeRF-RPN is a general framework and can be applied to detect objects without class labels. We experimented NeRF-RPN with various backbone architectures, RPN head designs, and loss functions. All of them can be trained in an end-to-end manner to estimate high quality 3D bounding boxes. To facilitate future research in object detection for NeRF, we built a new benchmark dataset which consists of both synthetic and real-world data with careful labeling and clean up. Code and dataset are available at https://github.com/lyclyc52/NeRF_RPN. + + + + Masked Wavelet Representation for Compact Neural Radiance Fields + http://openaccess.thecvf.com//content/CVPR2023/papers/Rho_Masked_Wavelet_Representation_for_Compact_Neural_Radiance_Fields_CVPR_2023_paper.pdf + Neural radiance fields (NeRF) have demonstrated the potential of coordinate-based neural representation (neural fields or implicit neural representation) in neural rendering. However, using a multi-layer perceptron (MLP) to represent a 3D scene or object requires enormous computational resources and time. There have been recent studies on how to reduce these computational inefficiencies by using additional data structures, such as grids or trees. Despite the promising performance, the explicit data structure necessitates a substantial amount of memory. In this work, we present a method to reduce the size without compromising the advantages of having additional data structures. In detail, we propose using the wavelet transform on grid-based neural fields. Grid-based neural fields are for fast convergence, and the wavelet transform, whose efficiency has been demonstrated in high-performance standard codecs, is to improve the parameter efficiency of grids. Furthermore, in order to achieve a higher sparsity of grid coefficients while maintaining reconstruction quality, we present a novel trainable masking approach. Experimental results demonstrate that non-spatial grid coefficients, such as wavelet coefficients, are capable of attaining a higher level of sparsity than spatial grid coefficients, resulting in a more compact representation. With our proposed mask and compression pipeline, we achieved state-of-the-art performance within a memory budget of 2 MB. Our code is available at https://github.com/daniel03c1/masked_wavelet_nerf. + + + + ObjectStitch: Object Compositing With Diffusion Model + http://openaccess.thecvf.com//content/CVPR2023/papers/Song_ObjectStitch_Object_Compositing_With_Diffusion_Model_CVPR_2023_paper.pdf + Object compositing based on 2D images is a challenging problem since it typically involves multiple processing stages such as color harmonization, geometry correction and shadow generation to generate realistic results. Furthermore, annotating training data pairs for compositing requires substantial manual effort from professionals, and is hardly scalable. Thus, with the recent advances in generative models, in this work, we propose a self-supervised framework for object compositing by leveraging the power of conditional diffusion models. Our framework can hollistically address the object compositing task in a unified model, transforming the viewpoint, geometry, color and shadow of the generated object while requiring no manual labeling. To preserve the input object's characteristics, we introduce a content adaptor that helps to maintain categorical semantics and object appearance. A data augmentation method is further adopted to improve the fidelity of the generator. Our method outperforms relevant baselines in both realism and faithfulness of the synthesized result images in a user study on various real-world images. + + + + Anchor3DLane: Learning To Regress 3D Anchors for Monocular 3D Lane Detection + http://openaccess.thecvf.com//content/CVPR2023/papers/Huang_Anchor3DLane_Learning_To_Regress_3D_Anchors_for_Monocular_3D_Lane_CVPR_2023_paper.pdf + Monocular 3D lane detection is a challenging task due to its lack of depth information. A popular solution is to first transform the front-viewed (FV) images or features into the bird-eye-view (BEV) space with inverse perspective mapping (IPM) and detect lanes from BEV features. However, the reliance of IPM on flat ground assumption and loss of context information make it inaccurate to restore 3D information from BEV representations. An attempt has been made to get rid of BEV and predict 3D lanes from FV representations directly, while it still underperforms other BEV-based methods given its lack of structured representation for 3D lanes. In this paper, we define 3D lane anchors in the 3D space and propose a BEV-free method named Anchor3DLane to predict 3D lanes directly from FV representations. 3D lane anchors are projected to the FV features to extract their features which contain both good structural and context information to make accurate predictions. In addition, we also develop a global optimization method that makes use of the equal-width property between lanes to reduce the lateral error of predictions. Extensive experiments on three popular 3D lane detection benchmarks show that our Anchor3DLane outperforms previous BEV-based methods and achieves state-of-the-art performances. The code is available at: https://github.com/tusen-ai/Anchor3DLane. + + + + Class-Balancing Diffusion Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Qin_Class-Balancing_Diffusion_Models_CVPR_2023_paper.pdf + Diffusion-based models have shown the merits of generating high-quality visual data while preserving better diversity in recent studies. However, such observation is only justified with curated data distribution, where the data samples are nicely pre-processed to be uniformly distributed in terms of their labels. In practice, a long-tailed data distribution appears more common and how diffusion models perform on such class-imbalanced data remains unknown. In this work, we first investigate this problem and observe significant degradation in both diversity and fidelity when the diffusion model is trained on datasets with class-imbalanced distributions. Especially in tail classes, the generations largely lose diversity and we observe severe mode-collapse issues. To tackle this problem, we set from the hypothesis that the data distribution is not class-balanced, and propose Class-Balancing Diffusion Models (CBDM) that are trained with a distribution adjustment regularizer as a solution. Experiments show that images generated by CBDM exhibit higher diversity and quality in both quantitative and qualitative ways. Our method benchmarked the generation results on CIFAR100/CIFAR100LT dataset and shows outstanding performance on the downstream recognition task. + + + + AstroNet: When Astrocyte Meets Artificial Neural Network + http://openaccess.thecvf.com//content/CVPR2023/papers/Han_AstroNet_When_Astrocyte_Meets_Artificial_Neural_Network_CVPR_2023_paper.pdf + Network structure learning aims to optimize network architectures and make them more efficient without compromising performance. In this paper, we first study the astrocytes, a new mechanism to regulate connections in the classic M-P neuron. Then, with the astrocytes, we propose an AstroNet that can adaptively optimize neuron connections and therefore achieves structure learning to achieve higher accuracy and efficiency. AstroNet is based on our built Astrocyte-Neuron model, with a temporal regulation mechanism and a global connection mechanism, which is inspired by the bidirectional communication property of astrocytes. With the model, the proposed AstroNet uses a neural network (NN) for performing tasks, and an astrocyte network (AN) to continuously optimize the connections of NN, i.e., assigning weight to the neuron units in the NN adaptively. Experiments on the classification task demonstrate that our AstroNet can efficiently optimize the network structure while achieving state-of-the-art (SOTA) accuracy. + + + + Feature Alignment and Uniformity for Test Time Adaptation + http://openaccess.thecvf.com//content/CVPR2023/papers/Wang_Feature_Alignment_and_Uniformity_for_Test_Time_Adaptation_CVPR_2023_paper.pdf + Test time adaptation (TTA) aims to adapt deep neural networks when receiving out of distribution test domain samples. In this setting, the model can only access online unlabeled test samples and pre-trained models on the training domains. We first address TTA as a feature revision problem due to the domain gap between source domains and target domains. After that, we follow the two measurements alignment and uniformity to discuss the test time feature revision. For test time feature uniformity, we propose a test time self-distillation strategy to guarantee the consistency of uniformity between representations of the current batch and all the previous batches. For test time feature alignment, we propose a memorized spatial local clustering strategy to align the representations among the neighborhood samples for the upcoming batch. To deal with the common noisy label problem, we propound the entropy and consistency filters to select and drop the possible noisy labels. To prove the scalability and efficacy of our method, we conduct experiments on four domain generalization benchmarks and four medical image segmentation tasks with various backbones. Experiment results show that our method not only improves baseline stably but also outperforms existing state-of-the-art test time adaptation methods. + + + + Balanced Product of Calibrated Experts for Long-Tailed Recognition + http://openaccess.thecvf.com//content/CVPR2023/papers/Aimar_Balanced_Product_of_Calibrated_Experts_for_Long-Tailed_Recognition_CVPR_2023_paper.pdf + Many real-world recognition problems are characterized by long-tailed label distributions. These distributions make representation learning highly challenging due to limited generalization over the tail classes. If the test distribution differs from the training distribution, e.g. uniform versus long-tailed, the problem of the distribution shift needs to be addressed. A recent line of work proposes learning multiple diverse experts to tackle this issue. Ensemble diversity is encouraged by various techniques, e.g. by specializing different experts in the head and the tail classes. In this work, we take an analytical approach and extend the notion of logit adjustment to ensembles to form a Balanced Product of Experts (BalPoE). BalPoE combines a family of experts with different test-time target distributions, generalizing several previous approaches. We show how to properly define these distributions and combine the experts in order to achieve unbiased predictions, by proving that the ensemble is Fisher-consistent for minimizing the balanced error. Our theoretical analysis shows that our balanced ensemble requires calibrated experts, which we achieve in practice using mixup. We conduct extensive experiments and our method obtains new state-of-the-art results on three long-tailed datasets: CIFAR-100-LT, ImageNet-LT, and iNaturalist-2018. Our code is available at https://github.com/emasa/BalPoE-CalibratedLT. + + + + PanoSwin: A Pano-Style Swin Transformer for Panorama Understanding + http://openaccess.thecvf.com//content/CVPR2023/papers/Ling_PanoSwin_A_Pano-Style_Swin_Transformer_for_Panorama_Understanding_CVPR_2023_paper.pdf + In panorama understanding, the widely used equirectangular projection (ERP) entails boundary discontinuity and spatial distortion. It severely deteriorates the conventional CNNs and vision Transformers on panoramas. In this paper, we propose a simple yet effective architecture named PanoSwin to learn panorama representations with ERP. To deal with the challenges brought by equirectangular projection, we explore a pano-style shift windowing scheme and novel pitch attention to address the boundary discontinuity and the spatial distortion, respectively. Besides, based on spherical distance and Cartesian coordinates, we adapt absolute positional encodings and relative positional biases for panoramas to enhance panoramic geometry information. Realizing that planar image understanding might share some common knowledge with panorama understanding, we devise a novel two-stage learning framework to facilitate knowledge transfer from the planar images to panoramas. We conduct experiments against the state-of-the-art on various panoramic tasks, i.e., panoramic object detection, panoramic classification, and panoramic layout estimation. The experimental results demonstrate the effectiveness of PanoSwin in panorama understanding. + + + + Parameter Efficient Local Implicit Image Function Network for Face Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Sarkar_Parameter_Efficient_Local_Implicit_Image_Function_Network_for_Face_Segmentation_CVPR_2023_paper.pdf + Face parsing is defined as the per-pixel labeling of images containing human faces. The labels are defined to identify key facial regions like eyes, lips, nose, hair, etc. In this work, we make use of the structural consistency of the human face to propose a lightweight face-parsing method using a Local Implicit Function network, FP-LIIF. We propose a simple architecture having a convolutional encoder and a pixel MLP decoder that uses 1/26th number of parameters compared to the state-of-the-art models and yet matches or outperforms state-of-the-art models on multiple datasets, like CelebAMask-HQ and LaPa. We do not use any pretraining, and compared to other works, our network can also generate segmentation at different resolutions without any changes in the input resolution. This work enables the use of facial segmentation on low-compute or low-bandwidth devices because of its higher FPS and smaller model size. + + + + Referring Image Matting + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Referring_Image_Matting_CVPR_2023_paper.pdf + Different from conventional image matting, which either requires user-defined scribbles/trimap to extract a specific foreground object or directly extracts all the foreground objects in the image indiscriminately, we introduce a new task named Referring Image Matting (RIM) in this paper, which aims to extract the meticulous alpha matte of the specific object that best matches the given natural language description, thus enabling a more natural and simpler instruction for image matting. First, we establish a large-scale challenging dataset RefMatte by designing a comprehensive image composition and expression generation engine to automatically produce high-quality images along with diverse text attributes based on public datasets. RefMatte consists of 230 object categories, 47,500 images, 118,749 expression-region entities, and 474,996 expressions. Additionally, we construct a real-world test set with 100 high-resolution natural images and manually annotate complex phrases to evaluate the out-of-domain generalization abilities of RIM methods. Furthermore, we present a novel baseline method CLIPMat for RIM, including a context-embedded prompt, a text-driven semantic pop-up, and a multi-level details extractor. Extensive experiments on RefMatte in both keyword and expression settings validate the superiority of CLIPMat over representative methods. We hope this work could provide novel insights into image matting and encourage more follow-up studies. The dataset, code and models are available at https://github.com/JizhiziLi/RIM. + + + + Modality-Invariant Visual Odometry for Embodied Vision + http://openaccess.thecvf.com//content/CVPR2023/papers/Memmel_Modality-Invariant_Visual_Odometry_for_Embodied_Vision_CVPR_2023_paper.pdf + Effectively localizing an agent in a realistic, noisy setting is crucial for many embodied vision tasks. Visual Odometry (VO) is a practical substitute for unreliable GPS and compass sensors, especially in indoor environments. While SLAM-based methods show a solid performance without large data requirements, they are less flexible and robust w.r.t. to noise and changes in the sensor suite compared to learning-based approaches. Recent deep VO models, however, limit themselves to a fixed set of input modalities, e.g., RGB and depth, while training on millions of samples. When sensors fail, sensor suites change, or modalities are intentionally looped out due to available resources, e.g., power consumption, the models fail catastrophically. Furthermore, training these models from scratch is even more expensive without simulator access or suitable existing models that can be fine-tuned. While such scenarios get mostly ignored in simulation, they commonly hinder a model's reusability in real-world applications. We propose a Transformer-based modality-invariant VO approach that can deal with diverse or changing sensor suites of navigation agents. Our model outperforms previous methods while training on only a fraction of the data. We hope this method opens the door to a broader range of real-world applications that can benefit from flexible and learned VO models. + + + + What You Can Reconstruct From a Shadow + http://openaccess.thecvf.com//content/CVPR2023/papers/Liu_What_You_Can_Reconstruct_From_a_Shadow_CVPR_2023_paper.pdf + 3D reconstruction is a fundamental problem in computer vision, and the task is especially challenging when the object to reconstruct is partially or fully occluded. We introduce a method that uses the shadows cast by an unobserved object in order to infer the possible 3D volumes under occlusion. We create a differentiable image formation model that allows us to jointly infer the 3D shape of an object, its pose, and the position of a light source. Since the approach is end-to-end differentiable, we are able to integrate learned priors of object geometry in order to generate realistic 3D shapes of different object categories. Experiments and visualizations show that the method is able to generate multiple possible solutions that are consistent with the observation of the shadow. Our approach works even when the position of the light source and object pose are both unknown. Our approach is also robust to real-world images where ground-truth shadow mask is unknown. + + + + Lite DETR: An Interleaved Multi-Scale Encoder for Efficient DETR + http://openaccess.thecvf.com//content/CVPR2023/papers/Li_Lite_DETR_An_Interleaved_Multi-Scale_Encoder_for_Efficient_DETR_CVPR_2023_paper.pdf + Recent DEtection TRansformer-based (DETR) models have obtained remarkable performance. Its success cannot be achieved without the re-introduction of multi-scale feature fusion in the encoder. However, the excessively increased tokens in multi-scale features, especially for about 75% of low-level features, are quite computationally inefficient, which hinders real applications of DETR models. In this paper, we present Lite DETR, a simple yet efficient end-to-end object detection framework that can effectively reduce the GFLOPs of the detection head by 60% while keeping 99% of the original performance. Specifically, we design an efficient encoder block to update high-level features (corresponding to small-resolution feature maps) and low-level features (corresponding to large-resolution feature maps) in an interleaved way. In addition, to better fuse cross-scale features, we develop a key-aware deformable attention to predict more reliable attention weights. Comprehensive experiments validate the effectiveness and efficiency of the proposed Lite DETR, and the efficient encoder strategy can generalize well across existing DETR-based models. The code will be released after the blind review. + + + + MobileNeRF: Exploiting the Polygon Rasterization Pipeline for Efficient Neural Field Rendering on Mobile Architectures + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_MobileNeRF_Exploiting_the_Polygon_Rasterization_Pipeline_for_Efficient_Neural_Field_CVPR_2023_paper.pdf + Neural Radiance Fields (NeRFs) have demonstrated amazing ability to synthesize images of 3D scenes from novel views. However, they rely upon specialized volumetric rendering algorithms based on ray marching that are mismatched to the capabilities of widely deployed graphics hardware. This paper introduces a new NeRF representation based on textured polygons that can synthesize novel images efficiently with standard rendering pipelines. The NeRF is represented as a set of polygons with textures representing binary opacities and feature vectors. Traditional rendering of the polygons with a z-buffer yields an image with features at every pixel, which are interpreted by a small, view-dependent MLP running in a fragment shader to produce a final pixel color. This approach enables NeRFs to be rendered with the traditional polygon rasterization pipeline, which provides massive pixel-level parallelism, achieving interactive frame rates on a wide range of compute platforms, including mobile phones. + + + + Pseudo-Label Guided Contrastive Learning for Semi-Supervised Medical Image Segmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Basak_Pseudo-Label_Guided_Contrastive_Learning_for_Semi-Supervised_Medical_Image_Segmentation_CVPR_2023_paper.pdf + Although recent works in semi-supervised learning (SemiSL) have accomplished significant success in natural image segmentation, the task of learning discriminative representations from limited annotations has been an open problem in medical images. Contrastive Learning (CL) frameworks use the notion of similarity measure which is useful for classification problems, however, they fail to transfer these quality representations for accurate pixel-level segmentation. To this end, we propose a novel semi-supervised patch-based CL framework for medical image segmentation without using any explicit pretext task. We harness the power of both CL and SemiSL, where the pseudo-labels generated from SemiSL aid CL by providing additional guidance, whereas discriminative class information learned in CL leads to accurate multi-class segmentation. Additionally, we formulate a novel loss that synergistically encourages inter-class separability and intra-class compactness among the learned representations. A new inter-patch semantic disparity mapping using average patch entropy is employed for a guided sampling of positives and negatives in the proposed CL framework. Experimental analysis on three publicly available datasets of multiple modalities reveals the superiority of our proposed method as compared to the state-of-the-art methods. Code is available at: https://github.com/hritam-98/PatchCL-MedSeg. + + + + Self-Supervised Geometry-Aware Encoder for Style-Based 3D GAN Inversion + http://openaccess.thecvf.com//content/CVPR2023/papers/Lan_Self-Supervised_Geometry-Aware_Encoder_for_Style-Based_3D_GAN_Inversion_CVPR_2023_paper.pdf + StyleGAN has achieved great progress in 2D face reconstruction and semantic editing via image inversion and latent editing. While studies over extending 2D StyleGAN to 3D faces have emerged, a corresponding generic 3D GAN inversion framework is still missing, limiting the applications of 3D face reconstruction and semantic editing. In this paper, we study the challenging problem of 3D GAN inversion where a latent code is predicted given a single face image to faithfully recover its 3D shapes and detailed textures. The problem is ill-posed: innumerable compositions of shape and texture could be rendered to the current image. Furthermore, with the limited capacity of a global latent code, 2D inversion methods cannot preserve faithful shape and texture at the same time when applied to 3D models. To solve this problem, we devise an effective self-training scheme to constrain the learning of inversion. The learning is done efficiently without any real-world 2D-3D training pairs but proxy samples generated from a 3D GAN. In addition, apart from a global latent code that captures the coarse shape and texture information, we augment the generation network with a local branch, where pixel-aligned features are added to faithfully reconstruct face details. We further consider a new pipeline to perform 3D view-consistent editing. Extensive experiments show that our method outperforms state-of-the-art inversion methods in both shape and texture reconstruction quality. + + + + POEM: Reconstructing Hand in a Point Embedded Multi-View Stereo + http://openaccess.thecvf.com//content/CVPR2023/papers/Yang_POEM_Reconstructing_Hand_in_a_Point_Embedded_Multi-View_Stereo_CVPR_2023_paper.pdf + Enable neural networks to capture 3D geometrical-aware features is essential in multi-view based vision tasks. Previous methods usually encode the 3D information of multi-view stereo into the 2D features. In contrast, we present a novel method, named POEM, that directly operates on the 3D POints Embedded in the Multi-view stereo for reconstructing hand mesh in it. Point is a natural form of 3D information and an ideal medium for fusing features across views, as it has different projections on different views. Our method is thus in light of a simple yet effective idea, that a complex 3D hand mesh can be represented by a set of 3D points that 1) are embedded in the multi-view stereo, 2) carry features from the multi-view images, and 3) encircle the hand. To leverage the power of points, we design two operations: point-based feature fusion and cross-set point attention mechanism. Evaluation on three challenging multi-view datasets shows that POEM outperforms the state-of-the-art in hand mesh reconstruction. Code and models are available for research at github.com/lixiny/POEM + + + + Progressively Optimized Local Radiance Fields for Robust View Synthesis + http://openaccess.thecvf.com//content/CVPR2023/papers/Meuleman_Progressively_Optimized_Local_Radiance_Fields_for_Robust_View_Synthesis_CVPR_2023_paper.pdf + We present an algorithm for reconstructing the radiance field of a large-scale scene from a single casually captured video. The task poses two core challenges. First, most existing radiance field reconstruction approaches rely on accurate pre-estimated camera poses from Structure-from-Motion algorithms, which frequently fail on in-the-wild videos. Second, using a single, global radiance field with finite representational capacity does not scale to longer trajectories in an unbounded scene. For handling unknown poses, we jointly estimate the camera poses with radiance field in a progressive manner. We show that progressive optimization significantly improves the robustness of the reconstruction. For handling large unbounded scenes, we dynamically allocate new local radiance fields trained with frames within a temporal window. This further improves robustness (e.g., performs well even under moderate pose drifts) and allows us to scale to large scenes. Our extensive evaluation on the Tanks and Temples dataset and our collected outdoor dataset, Static Hikes, show that our approach compares favorably with the state-of-the-art. + + + + GeoMVSNet: Learning Multi-View Stereo With Geometry Perception + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_GeoMVSNet_Learning_Multi-View_Stereo_With_Geometry_Perception_CVPR_2023_paper.pdf + Recent cascade Multi-View Stereo (MVS) methods can efficiently estimate high-resolution depth maps through narrowing hypothesis ranges. However, previous methods ignored the vital geometric information embedded in coarse stages, leading to vulnerable cost matching and sub-optimal reconstruction results. In this paper, we propose a geometry awareness model, termed GeoMVSNet, to explicitly integrate geometric clues implied in coarse stages for delicate depth estimation. In particular, we design a two-branch geometry fusion network to extract geometric priors from coarse estimations to enhance structural feature extraction at finer stages. Besides, we embed the coarse probability volumes, which encode valuable depth distribution attributes, into the lightweight regularization network to further strengthen depth-wise geometry intuition. Meanwhile, we apply the frequency domain filtering to mitigate the negative impact of the high-frequency regions and adopt the curriculum learning strategy to progressively boost the geometry integration of the model. To intensify the full-scene geometry perception of our model, we present the depth distribution similarity loss based on the Gaussian-Mixture Model assumption. Extensive experiments on DTU and Tanks and Temples (T&T) datasets demonstrate that our GeoMVSNet achieves state-of-the-art results and ranks first on the T&T-Advanced set. Code is available at https://github.com/doubleZ0108/GeoMVSNet. + + + + TeSLA: Test-Time Self-Learning With Automatic Adversarial Augmentation + http://openaccess.thecvf.com//content/CVPR2023/papers/Tomar_TeSLA_Test-Time_Self-Learning_With_Automatic_Adversarial_Augmentation_CVPR_2023_paper.pdf + Most recent test-time adaptation methods focus on only classification tasks, use specialized network architectures, destroy model calibration or rely on lightweight information from the source domain. To tackle these issues, this paper proposes a novel Test-time Self-Learning method with automatic Adversarial augmentation dubbed TeSLA for adapting a pre-trained source model to the unlabeled streaming test data. In contrast to conventional self-learning methods based on cross-entropy, we introduce a new test-time loss function through an implicitly tight connection with the mutual information and online knowledge distillation. Furthermore, we propose a learnable efficient adversarial augmentation module that further enhances online knowledge distillation by simulating high entropy augmented images. Our method achieves state-of-the-art classification and segmentation results on several benchmarks and types of domain shifts, particularly on challenging measurement shifts of medical images. TeSLA also benefits from several desirable properties compared to competing methods in terms of calibration, uncertainty metrics, insensitivity to model architectures, and source training strategies, all supported by extensive ablations. Our code and models are available at https://github.com/devavratTomar/TeSLA. + + + + RefTeacher: A Strong Baseline for Semi-Supervised Referring Expression Comprehension + http://openaccess.thecvf.com//content/CVPR2023/papers/Sun_RefTeacher_A_Strong_Baseline_for_Semi-Supervised_Referring_Expression_Comprehension_CVPR_2023_paper.pdf + Referring expression comprehension (REC) often requires a large number of instance-level annotations for fully supervised learning, which are laborious and expensive. In this paper, we present the first attempt of semi-supervised learning for REC and propose a strong baseline method called RefTeacher. Inspired by the recent progress in computer vision, RefTeacher adopts a teacher-student learning paradigm, where the teacher REC network predicts pseudo-labels for optimizing the student one. This paradigm allows REC models to exploit massive unlabeled data based on a small fraction of labeled. In particular, we also identify two key challenges in semi-supervised REC, namely, sparse supervision signals and worse pseudo-label noise. To address these issues, we equip RefTeacher with two novel designs called Attention-based Imitation Learning (AIL) and Adaptive Pseudo-label Weighting (APW). AIL can help the student network imitate the recognition behaviors of the teacher, thereby obtaining sufficient supervision signals. APW can help the model adaptively adjust the contributions of pseudo-labels with varying qualities, thus avoiding confirmation bias. To validate RefTeacher, we conduct extensive experiments on three REC benchmark datasets. Experimental results show that RefTeacher obtains obvious gains over the fully supervised methods. More importantly, using only 10% labeled data, our approach allows the model to achieve near 100% fully supervised performance, e.g., only -2.78% on RefCOCO. + + + + Handwritten Text Generation From Visual Archetypes + http://openaccess.thecvf.com//content/CVPR2023/papers/Pippi_Handwritten_Text_Generation_From_Visual_Archetypes_CVPR_2023_paper.pdf + Generating synthetic images of handwritten text in a writer-specific style is a challenging task, especially in the case of unseen styles and new words, and even more when these latter contain characters that are rarely encountered during training. While emulating a writer's style has been recently addressed by generative models, the generalization towards rare characters has been disregarded. In this work, we devise a Transformer-based model for Few-Shot styled handwritten text generation and focus on obtaining a robust and informative representation of both the text and the style. In particular, we propose a novel representation of the textual content as a sequence of dense vectors obtained from images of symbols written as standard GNU Unifont glyphs, which can be considered their visual archetypes. This strategy is more suitable for generating characters that, despite having been seen rarely during training, possibly share visual details with the frequently observed ones. As for the style, we obtain a robust representation of unseen writers' calligraphy by exploiting specific pre-training on a large synthetic dataset. Quantitative and qualitative results demonstrate the effectiveness of our proposal in generating words in unseen styles and with rare characters more faithfully than existing approaches relying on independent one-hot encodings of the characters. + + + + Unicode Analogies: An Anti-Objectivist Visual Reasoning Challenge + http://openaccess.thecvf.com//content/CVPR2023/papers/Spratley_Unicode_Analogies_An_Anti-Objectivist_Visual_Reasoning_Challenge_CVPR_2023_paper.pdf + Analogical reasoning enables agents to extract relevant information from scenes, and efficiently navigate them in familiar ways. While progressive-matrix problems (PMPs) are becoming popular for the development and evaluation of analogical reasoning in computer vision, we argue that the dominant methodology in this area struggles to expose the lack of meaningful generalisation in solvers, and reinforces an objectivist stance on perception -- that objects can only be seen one way -- which we believe to be counter-productive. In this paper, we introduce the Unicode Analogies challenge, consisting of polysemic, character-based PMPs to benchmark fluid conceptualisation ability in vision systems. Writing systems have evolved characters at multiple levels of abstraction, from iconic through to symbolic representations, producing both visually interrelated yet exceptionally diverse images when compared to those exhibited by existing PMP datasets. Our framework has been designed to challenge models by presenting tasks much harder to complete without robust feature extraction, while remaining largely solvable by human participants. We therefore argue that Unicode Analogies elegantly captures and tests for a facet of human visual reasoning that is severely lacking in current-generation AI. + + + + FFF: Fragment-Guided Flexible Fitting for Building Complete Protein Structures + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_FFF_Fragment-Guided_Flexible_Fitting_for_Building_Complete_Protein_Structures_CVPR_2023_paper.pdf + Cryo-electron microscopy (cryo-EM) is a technique for reconstructing the 3-dimensional (3D) structure of biomolecules (especially large protein complexes and molecular assemblies). As the resolution increases to the near-atomic scale, building protein structures de novo from cryo-EM maps becomes possible. Recently, recognition-based de novo building methods have shown the potential to streamline this process. However, it cannot build a complete structure due to the low signal-to-noise ratio (SNR) problem. At the same time, AlphaFold has led to a great breakthrough in predicting protein structures. This has inspired us to combine fragment recognition and structure prediction methods to build a complete structure. In this paper, we propose a new method named FFF that bridges protein structure prediction and protein structure recognition with flexible fitting. First, a multi-level recognition network is used to capture various structural features from the input 3D cryo-EM map. Next, protein structural fragments are generated using pseudo peptide vectors and a protein sequence alignment method based on these extracted features. Finally, a complete structural model is constructed using the predicted protein fragments via flexible fitting. Based on our benchmark tests, FFF outperforms the baseline meth- ods for building complete protein structures. + + + + Align Your Latents: High-Resolution Video Synthesis With Latent Diffusion Models + http://openaccess.thecvf.com//content/CVPR2023/papers/Blattmann_Align_Your_Latents_High-Resolution_Video_Synthesis_With_Latent_Diffusion_Models_CVPR_2023_paper.pdf + Latent Diffusion Models (LDMs) enable high-quality image synthesis while avoiding excessive compute demands by training a diffusion model in a compressed lower-dimensional latent space. Here, we apply the LDM paradigm to high-resolution video generation, a particularly resource-intensive task. We first pre-train an LDM on images only; then, we turn the image generator into a video generator by introducing a temporal dimension to the latent space diffusion model and fine-tuning on encoded image sequences, i.e., videos. Similarly, we temporally align diffusion model upsamplers, turning them into temporally consistent video super resolution models. We focus on two relevant real-world applications: Simulation of in-the-wild driving data and creative content creation with text-to-video modeling. In particular, we validate our Video LDM on real driving videos of resolution 512x1024, achieving state-of-the-art performance. Furthermore, our approach can easily leverage off-the-shelf pre-trained image LDMs, as we only need to train a temporal alignment model in that case. Doing so, we turn the publicly available, state-of-the-art text-to-image LDM Stable Diffusion into an efficient and expressive text-to-video model with resolution up to 1280x2048. We show that the temporal layers trained in this way generalize to different fine-tuned text-to-image LDMs. Utilizing this property, we show the first results for personalized text-to-video generation, opening exciting directions for future content creation. Project page: https://nv-tlabs.github.io/VideoLDM/ + + + + Implicit 3D Human Mesh Recovery Using Consistency With Pose and Shape From Unseen-View + http://openaccess.thecvf.com//content/CVPR2023/papers/Cho_Implicit_3D_Human_Mesh_Recovery_Using_Consistency_With_Pose_and_CVPR_2023_paper.pdf + From an image of a person, we can easily infer the natural 3D pose and shape of the person even if ambiguity exists. This is because we have a mental model that allows us to imagine a person's appearance at different viewing directions from a given image and utilize the consistency between them for inference. However, existing human mesh recovery methods only consider the direction in which the image was taken due to their structural limitations. Hence, we propose "Implicit 3D Human Mesh Recovery (ImpHMR)" that can implicitly imagine a person in 3D space at the feature-level via Neural Feature Fields. In ImpHMR, feature fields are generated by CNN-based image encoder for a given image. Then, the 2D feature map is volume-rendered from the feature field for a given viewing direction, and the pose and shape parameters are regressed from the feature. To utilize consistency with pose and shape from unseen-view, if there are 3D labels, the model predicts results including the silhouette from an arbitrary direction and makes it equal to the rotated ground-truth. In the case of only 2D labels, we perform self-supervised learning through the constraint that the pose and shape parameters inferred from different directions should be the same. Extensive evaluations show the efficacy of the proposed method. + + + + Teleidoscopic Imaging System for Microscale 3D Shape Reconstruction + http://openaccess.thecvf.com//content/CVPR2023/papers/Kawahara_Teleidoscopic_Imaging_System_for_Microscale_3D_Shape_Reconstruction_CVPR_2023_paper.pdf + This paper proposes a practical method of microscale 3D shape capturing by a teleidoscopic imaging system. The main challenge in microscale 3D shape reconstruction is to capture the target from multiple viewpoints with a large enough depth-of-field. Our idea is to employ a teleidoscopic measurement system consisting of three planar mirrors and monocentric lens. The planar mirrors virtually define multiple viewpoints by multiple reflections, and the monocentric lens realizes a high magnification with less blurry and surround view even in closeup imaging. Our contributions include, a structured ray-pixel camera model which handles refractive and reflective projection rays efficiently, analytical evaluations of depth of field of our teleidoscopic imaging system, and a practical calibration algorithm of the teleidoscppic imaging system. Evaluations with real images prove the concept of our measurement system. + + + + UV Volumes for Real-Time Rendering of Editable Free-View Human Performance + http://openaccess.thecvf.com//content/CVPR2023/papers/Chen_UV_Volumes_for_Real-Time_Rendering_of_Editable_Free-View_Human_Performance_CVPR_2023_paper.pdf + Neural volume rendering enables photo-realistic renderings of a human performer in free-view, a critical task in immersive VR/AR applications. But the practice is severely limited by high computational costs in the rendering process. To solve this problem, we propose the UV Volumes, a new approach that can render an editable free-view video of a human performer in real-time. It separates the high-frequency (i.e., non-smooth) human appearance from the 3D volume, and encodes them into 2D neural texture stacks (NTS). The smooth UV volumes allow much smaller and shallower neural networks to obtain densities and texture coordinates in 3D while capturing detailed appearance in 2D NTS. For editability, the mapping between the parameterized human model and the smooth texture coordinates allows us a better generalization on novel poses and shapes. Furthermore, the use of NTS enables interesting applications, e.g., retexturing. Extensive experiments on CMU Panoptic, ZJU Mocap, and H36M datasets show that our model can render 960 x 540 images in 30FPS on average with comparable photo-realism to state-of-the-art methods. + + + + JacobiNeRF: NeRF Shaping With Mutual Information Gradients + http://openaccess.thecvf.com//content/CVPR2023/papers/Xu_JacobiNeRF_NeRF_Shaping_With_Mutual_Information_Gradients_CVPR_2023_paper.pdf + We propose a method that trains a neural radiance field (NeRF) to encode not only the appearance of the scene but also semantic correlations between scene points, regions, or entities -- aiming to capture their mutual co-variation patterns. In contrast to the traditional first-order photometric reconstruction objective, our method explicitly regularizes the learning dynamics to align the Jacobians of highly-correlated entities, which proves to maximize the mutual information between them under random scene perturbations. By paying attention to this second-order information, we can shape a NeRF to express semantically meaningful synergies when the network weights are changed by a delta along the gradient of a single entity, region, or even a point. To demonstrate the merit of this mutual information modeling, we leverage the coordinated behavior of scene entities that emerges from our shaping to perform label propagation for semantic and instance segmentation. Our experiments show that a JacobiNeRF is more efficient in propagating annotations among 2D pixels and 3D points compared to NeRFs without mutual information shaping, especially in extremely sparse label regimes -- thus reducing annotation burden. The same machinery can further be used for entity selection or scene modifications. Our code is available at https://github.com/xxm19/jacobinerf. + + + + Open-Set Representation Learning Through Combinatorial Embedding + http://openaccess.thecvf.com//content/CVPR2023/papers/Kim_Open-Set_Representation_Learning_Through_Combinatorial_Embedding_CVPR_2023_paper.pdf + Visual recognition tasks are often limited to dealing with a small subset of classes simply because the labels for the remaining classes are unavailable. We are interested in identifying novel concepts in a dataset through representation learning based on both labeled and unlabeled examples, and extending the horizon of recognition to both known and novel classes. To address this challenging task, we propose a combinatorial learning approach, which naturally clusters the examples in unseen classes using the compositional knowledge given by multiple supervised meta-classifiers on heterogeneous label spaces. The representations given by the combinatorial embedding are made more robust by unsupervised pairwise relation learning. The proposed algorithm discovers novel concepts via a joint optimization for enhancing the discrimitiveness of unseen classes as well as learning the representations of known classes generalizable to novel ones. Our extensive experiments demonstrate remarkable performance gains by the proposed approach on public datasets for image retrieval and image categorization with novel class discovery. + + + + Multi-View Stereo Representation Revist: Region-Aware MVSNet + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_Multi-View_Stereo_Representation_Revist_Region-Aware_MVSNet_CVPR_2023_paper.pdf + Deep learning-based multi-view stereo has emerged as a powerful paradigm for reconstructing the complete geometrically-detailed objects from multi-views. Most of the existing approaches only estimate the pixel-wise depth value by minimizing the gap between the predicted point and the intersection of ray and surface, which usually ignore the surface topology. It is essential to the textureless regions and surface boundary that cannot be properly reconstructed.To address this issue, we suggest to take advantage of point-to-surface distance so that the model is able to perceive a wider range of surfaces. To this end, we predict the distance volume from cost volume to estimate the signed distance of points around the surface. Our proposed RA-MVSNet is patch-awared, since the perception range is enhanced by associating hypothetical planes with a patch of surface. Therefore, it could increase the completion of textureless regions and reduce the outliers at the boundary. Moreover, the mesh topologies with fine details can be generated by the introduced distance volume. Comparing to the conventional deep learning-based multi-view stereo methods, our proposed RA-MVSNet approach obtains more complete reconstruction results by taking advantage of signed distance supervision. The experiments on both the DTU and Tanks & Temples datasets demonstrate that our proposed approach achieves the state-of-the-art results. + + + + A Unified HDR Imaging Method With Pixel and Patch Level + http://openaccess.thecvf.com//content/CVPR2023/papers/Yan_A_Unified_HDR_Imaging_Method_With_Pixel_and_Patch_Level_CVPR_2023_paper.pdf + Mapping Low Dynamic Range (LDR) images with different exposures to High Dynamic Range (HDR) remains nontrivial and challenging on dynamic scenes due to ghosting caused by object motion or camera jitting. With the success of Deep Neural Networks (DNNs), several DNNs-based methods have been proposed to alleviate ghosting, they cannot generate approving results when motion and saturation occur. To generate visually pleasing HDR images in various cases, we propose a hybrid HDR deghosting network, called HyHDRNet, to learn the complicated relationship between reference and non-reference images. The proposed HyHDRNet consists of a content alignment subnetwork and a Transformer-based fusion subnetwork. Specifically, to effectively avoid ghosting from the source, the content alignment subnetwork uses patch aggregation and ghost attention to integrate similar content from other non-reference images with patch level and suppress undesired components with pixel level. To achieve mutual guidance between patch-level and pixel-level, we leverage a gating module to sufficiently swap useful information both in ghosted and saturated regions. Furthermore, to obtain a high-quality HDR image, the Transformer-based fusion subnetwork uses a Residual Deformable Transformer Block (RDTB) to adaptively merge information for different exposed regions. We examined the proposed method on four widely used public HDR image deghosting datasets. Experiments demonstrate that HyHDRNet outperforms state-of-the-art methods both quantitatively and qualitatively, achieving appealing HDR visualization with unified textures and colors. + + + + Partial Network Cloning + http://openaccess.thecvf.com//content/CVPR2023/papers/Ye_Partial_Network_Cloning_CVPR_2023_paper.pdf + In this paper, we study a novel task that enables partial knowledge transfer from pre-trained models, which we term as Partial Network Cloning (PNC). Unlike prior methods that update all or at least part of the parameters in the target network throughout the knowledge transfer process, PNC conducts partial parametric "cloning" from a source network and then injects the cloned module to the target, without modifying its parameters. Thanks to the transferred module, the target network is expected to gain additional functionality, such as inference on new classes; whenever needed, the cloned module can be readily removed from the target, with its original parameters and competence kept intact. Specifically, we introduce an innovative learning scheme that allows us to identify simultaneously the component to be cloned from the source and the position to be inserted within the target network, so as to ensure the optimal performance. Experimental results on several datasets demonstrate that, our method yields a significant improvement of 5% in accuracy and 50% in locality when compared with parameter-tuning based methods. + + + + MOTRv2: Bootstrapping End-to-End Multi-Object Tracking by Pretrained Object Detectors + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhang_MOTRv2_Bootstrapping_End-to-End_Multi-Object_Tracking_by_Pretrained_Object_Detectors_CVPR_2023_paper.pdf + In this paper, we propose MOTRv2, a simple yet effective pipeline to bootstrap end-to-end multi-object tracking with a pretrained object detector. Existing end-to-end methods, e.g. MOTR and TrackFormer are inferior to their tracking-by-detection counterparts mainly due to their poor detection performance. We aim to improve MOTR by elegantly incorporating an extra object detector. We first adopt the anchor formulation of queries and then use an extra object detector to generate proposals as anchors, providing detection prior to MOTR. The simple modification greatly eases the conflict between joint learning detection and association tasks in MOTR. MOTRv2 keeps the end-to-end feature and scales well on large-scale benchmarks. MOTRv2 achieves the top performance (73.4% HOTA) among all existing methods on the DanceTrack dataset. Moreover, MOTRv2 reaches state-of-the-art performance on the BDD100K dataset. We hope this simple and effective pipeline can provide some new insights to the end-to-end MOT community. The code will be released in the near future. + + + + Principles of Forgetting in Domain-Incremental Semantic Segmentation in Adverse Weather Conditions + http://openaccess.thecvf.com//content/CVPR2023/papers/Kalb_Principles_of_Forgetting_in_Domain-Incremental_Semantic_Segmentation_in_Adverse_Weather_CVPR_2023_paper.pdf + Deep neural networks for scene perception in automated vehicles achieve excellent results for the domains they were trained on. However, in real-world conditions, the domain of operation and its underlying data distribution are subject to change. Adverse weather conditions, in particular, can significantly decrease model performance when such data are not available during training. Additionally, when a model is incrementally adapted to a new domain, it suffers from catastrophic forgetting, causing a significant drop in performance on previously observed domains. Despite recent progress in reducing catastrophic forgetting, its causes and effects remain obscure. Therefore, we study how the representations of semantic segmentation models are affected during domain-incremental learning in adverse weather conditions. Our experiments and representational analyses indicate that catastrophic forgetting is primarily caused by changes to low-level features in domain-incremental learning and that learning more general features on the source domain using pre-training and image augmentations leads to efficient feature reuse in subsequent tasks, which drastically reduces catastrophic forgetting. These findings highlight the importance of methods that facilitate generalized features for effective continual learning algorithms. + + + + Neural Texture Synthesis With Guided Correspondence + http://openaccess.thecvf.com//content/CVPR2023/papers/Zhou_Neural_Texture_Synthesis_With_Guided_Correspondence_CVPR_2023_paper.pdf + Markov random fields (MRFs) are the cornerstone of classical approaches to example-based texture synthesis. Yet, it is not fully valued in the deep learning era. This paper aims to re-promote the combination of MRFs and neural networks, i.e., the CNNMRF model, for texture synthesis, with two key observations made. We first propose to compute the Guided Correspondence Distance in the nearest neighbor search, based on which a Guided Correspondence loss is defined to measure the similarity of the output texture to the example. Experiments show that our approach surpasses existing neural approaches in uncontrolled and controlled texture synthesis. More importantly, the Guided Correspondence loss can function as a general textural loss in, e.g., training generative networks for real-time controlled synthesis and inversion-based single-image editing. In contrast, existing textural losses, such as the Sliced Wasserstein loss, cannot work on these challenging tasks. + + + + Interactive Cartoonization With Controllable Perceptual Factors + http://openaccess.thecvf.com//content/CVPR2023/papers/Ahn_Interactive_Cartoonization_With_Controllable_Perceptual_Factors_CVPR_2023_paper.pdf + Cartoonization is a task that renders natural photos into cartoon styles. Previous deep methods only have focused on end-to-end translation, disabling artists from manipulating results. To tackle this, in this work, we propose a novel solution with editing features of texture and color based on the cartoon creation process. To do that, we design a model architecture to have separate decoders, texture and color, to decouple these attributes. In the texture decoder, we propose a texture controller, which enables a user to control stroke style and abstraction to generate diverse cartoon textures. We also introduce an HSV color augmentation to induce the networks to generate consistent color translation. To the best of our knowledge, our work is the first method to control the cartoonization during the inferences step, generating high-quality results compared to baselines. + + + + \ No newline at end of file diff --git a/rss/CVPR2024.xml b/rss/CVPR2024.xml new file mode 100644 index 0000000..a2c26d6 --- /dev/null +++ b/rss/CVPR2024.xml @@ -0,0 +1,16297 @@ + + + + CVPR 2024 + + + DPMesh: Exploiting Diffusion Prior for Occluded Human Mesh Recovery + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhu_DPMesh_Exploiting_Diffusion_Prior_for_Occluded_Human_Mesh_Recovery_CVPR_2024_paper.pdf + The recovery of occluded human meshes poses challenges for current methods due to the difficulty in extracting effective image features under severe occlusion. In this paper we introduce DPMesh an innovative framework for occluded human mesh recovery that capitalizes on the profound knowledge about object structure and spatial relationships embedded in a pre-trained text-to-image diffusion model. Unlike previous methods reliant on conventional backbones for vanilla feature extraction DPMesh seamlessly integrates the pre-trained denoising U-Net with potent priors as its image backbone and performs a single-step inference to provide occlusion-aware information. To enhance the perception capability for occluded poses DPMesh incorporates judicious guidance via condition injection which produces effective controls from 2D observations for the denoising U-Net. Furthermore we explore a dedicated noisy key-point reasoning approach to mitigate disturbances arising from occlusion and crowded scenarios. This strategy fully unleashes the perceptual capability of the diffusion prior thereby enhancing accuracy. Extensive quantitative and qualitative experiments affirm the efficacy of our framework as we outperform state-of-the-art methods on both occlusion-specific and standard datasets underscoring its ability to achieve precise and robust 3D human mesh recovery particularly in challenging scenarios involving occlusion and crowded scenes. Code is available at https://github.com/EternalEvan/DPMesh. + + + + HEAL-SWIN: A Vision Transformer On The Sphere + http://openaccess.thecvf.com//content/CVPR2024/papers/Carlsson_HEAL-SWIN_A_Vision_Transformer_On_The_Sphere_CVPR_2024_paper.pdf + High-resolution wide-angle fisheye images are becoming more and more important for robotics applications such as autonomous driving. However using ordinary convolutional neural networks or vision transformers on this data is problematic due to projection and distortion losses introduced when projecting to a rectangular grid on the plane. We introduce the HEAL-SWIN transformer which combines the highly uniform Hierarchical Equal Area iso-Latitude Pixelation (HEALPix) grid used in astrophysics and cosmology with the Hierarchical Shifted-Window (SWIN) transformer to yield an efficient and flexible model capable of training on high-resolution distortion-free spherical data. In HEAL-SWIN the nested structure of the HEALPix grid is used to perform the patching and windowing operations of the SWIN transformer enabling the network to process spherical representations with minimal computational overhead. We demonstrate the superior performance of our model on both synthetic and real automotive datasets as well as a selection of other image datasets for semantic segmentation depth regression and classification tasks. Our code is publicly available. + + + + 3D Paintbrush: Local Stylization of 3D Shapes with Cascaded Score Distillation + http://openaccess.thecvf.com//content/CVPR2024/papers/Decatur_3D_Paintbrush_Local_Stylization_of_3D_Shapes_with_Cascaded_Score_CVPR_2024_paper.pdf + We present 3D Paintbrush a technique for automatically texturing local semantic regions on meshes via text descriptions. Our method is designed to operate directly on meshes producing texture maps which seamlessly integrate into standard graphics pipelines. We opt to simultaneously produce a localization map (to specify the edit region) and a texture map which conforms to it. This approach improves the quality of both the localization and the stylization. To enhance the details and resolution of the textured area we leverage multiple stages of a cascaded diffusion model to supervise our local editing technique with generative priors learned from images at different resolutions. Our technique referred to as Cascaded Score Distillation (CSD) simultaneously distills scores at multiple resolutions in a cascaded fashion enabling control over both the granularity and global understanding of the supervision. We demonstrate the effectiveness of 3D Paintbrush to locally texture different semantic regions on a variety of shapes. + + + + Guided Slot Attention for Unsupervised Video Object Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Lee_Guided_Slot_Attention_for_Unsupervised_Video_Object_Segmentation_CVPR_2024_paper.pdf + Unsupervised video object segmentation aims to segment the most prominent object in a video sequence. However the existence of complex backgrounds and multiple foreground objects make this task challenging. To address this issue we propose a guided slot attention network to reinforce spatial structural information and obtain better foreground-background separation. The foreground and background slots which are initialized with query guidance are iteratively refined based on interactions with template information. Furthermore to improve slot-template interaction and effectively fuse global and local features in the target and reference frames K-nearest neighbors filtering and a feature aggregation transformer are introduced. The proposed model achieves state-of-the-art performance on two popular datasets. Additionally we demonstrate the robustness of the proposed model in challenging scenes through various comparative experiments. + + + + Programmable Motion Generation for Open-Set Motion Control Tasks + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Programmable_Motion_Generation_for_Open-Set_Motion_Control_Tasks_CVPR_2024_paper.pdf + Character animation in real-world scenarios necessitates a variety of constraints such as trajectories key-frames interactions etc. Existing methodologies typically treat single or a finite set of these constraint(s) as separate control tasks. These methods are often specialized and the tasks they address are rarely extendable or customizable. We categorize these as solutions to the close-set motion control problem. In response to the complexity of practical motion control we propose and attempt to solve the open-set motion control problem. This problem is characterized by an open and fully customizable set of motion control tasks. To address this we introduce a new paradigm programmable motion generation. In this paradigm any given motion control task is broken down into a combination of atomic constraints. These constraints are then programmed into an error function that quantifies the degree to which a motion sequence adheres to them. We utilize a pre-trained motion generation model and optimize its latent code to minimize the error function of the generated motion. Consequently the generated motion not only inherits the prior of the generative model but also satisfies the requirements of the compounded constraints. Our experiments demonstrate that our approach can generate high-quality motions when addressing a wide range of unseen tasks. These tasks encompass motion control by motion dynamics geometric constraints physical laws interactions with scenes objects or the character's own body parts etc. All of these are achieved in a unified approach without the need for ad-hoc paired training data collection or specialized network designs. During the programming of novel tasks we observed the emergence of new skills beyond those of the prior model. With the assistance of large language models we also achieved automatic programming. We hope that this work will pave the way for the motion control of general AI agents. + + + + SCE-MAE: Selective Correspondence Enhancement with Masked Autoencoder for Self-Supervised Landmark Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Yin_SCE-MAE_Selective_Correspondence_Enhancement_with_Masked_Autoencoder_for_Self-Supervised_Landmark_CVPR_2024_paper.pdf + Self-supervised landmark estimation is a challenging task that demands the formation of locally distinct feature representations to identify sparse facial landmarks in the absence of annotated data. To tackle this task existing state-of-the-art (SOTA) methods (1) extract coarse features from backbones that are trained with instance-level self-supervised learning (SSL) paradigms which neglect the dense prediction nature of the task (2) aggregate them into memory-intensive hypercolumn formations and (3) supervise lightweight projector networks to naively establish full local correspondences among all pairs of spatial features. In this paper we introduce SCE-MAE a framework that (1) leverages the MAE [??] a region-level SSL method that naturally better suits the landmark prediction task (2) operates on the vanilla feature map instead of on expensive hypercolumns and (3) employs a Correspondence Approximation and Refinement Block (CARB) that utilizes a simple density peak clustering algorithm and our proposed Locality-Constrained Repellence Loss to directly hone only select local correspondences. We demonstrate through extensive experiments that SCE-MAE is highly effective and robust outperforming existing SOTA methods by large margins of 20%-44% on the landmark matching and 9%-15% on the landmark detection tasks. + + + + LAKE-RED: Camouflaged Images Generation by Latent Background Knowledge Retrieval-Augmented Diffusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_LAKE-RED_Camouflaged_Images_Generation_by_Latent_Background_Knowledge_Retrieval-Augmented_Diffusion_CVPR_2024_paper.pdf + Camouflaged vision perception is an important vision task with numerous practical applications. Due to the expensive collection and labeling costs this community struggles with a major bottleneck that the species category of its datasets is limited to a small number of object species. However the existing camouflaged generation methods require specifying the background manually thus failing to extend the camouflaged sample diversity in a low-cost manner. In this paper we propose a Latent Background Knowledge Retrieval-Augmented Diffusion (LAKE-RED) for camouflaged image generation. To our knowledge our contributions mainly include: (1) For the first time we propose a camouflaged generation paradigm that does not need to receive any background inputs. (2) Our LAKE-RED is the first knowledge retrieval-augmented method with interpretability for camouflaged generation in which we propose an idea that knowledge retrieval and reasoning enhancement are separated explicitly to alleviate the task-specific challenges. Moreover our method is not restricted to specific foreground targets or backgrounds offering a potential for extending camouflaged vision perception to more diverse domains. (3) Experimental results demonstrate that our method outperforms the existing approaches generating more realistic camouflage images. + + + + TIGER: Time-Varying Denoising Model for 3D Point Cloud Generation with Diffusion Process + http://openaccess.thecvf.com//content/CVPR2024/papers/Ren_TIGER_Time-Varying_Denoising_Model_for_3D_Point_Cloud_Generation_with_CVPR_2024_paper.pdf + Recently diffusion models have emerged as a new powerful generative method for 3D point cloud generation tasks. However few works study the effect of the architecture of the diffusion model in the 3D point cloud resorting to the typical UNet model developed for 2D images. Inspired by the wide adoption of Transformers we study the complementary role of convolution (from UNet) and attention (from Transformers). We discover that their respective importance change according to the timestep in the diffusion process. At early stage attention has an outsized influence because Transformers are found to generate the overall shape more quickly and at later stages when adding fine detail convolution starts having a larger impact on the generated point cloud's local surface quality. In light of this observation we propose a time-varying two-stream denoising model combined with convolution layers and transformer blocks. We generate an optimizable mask from each timestep to reweigh global and local features obtaining time-varying fused features. Experimentally we demonstrate that our proposed method quantitatively outperforms other state-of-the-art methods regarding visual quality and diversity. Code is avaiable github.com/Zhiyuan-R/Tiger-Time-varying-Diffusion-Model-for-Point-Cloud-Generation. + + + + ASH: Animatable Gaussian Splats for Efficient and Photoreal Human Rendering + http://openaccess.thecvf.com//content/CVPR2024/papers/Pang_ASH_Animatable_Gaussian_Splats_for_Efficient_and_Photoreal_Human_Rendering_CVPR_2024_paper.pdf + Real-time rendering of photorealistic and controllable human avatars stands as a cornerstone in Computer Vision and Graphics. While recent advances in neural implicit rendering have unlocked unprecedented photorealism for digital avatars real-time performance has mostly been demonstrated for static scenes only. To address this we propose ASH an animatable Gaussian splatting approach for photorealistic rendering of dynamic humans in real time. We parameterize the clothed human as animatable 3D Gaussians which can be efficiently splatted into image space to generate the final rendering. However naively learning the Gaussian parameters in 3D space poses a severe challenge in terms of compute. Instead we attach the Gaussians onto a deformable character model and learn their parameters in 2D texture space which allows leveraging efficient 2D convolutional architectures that easily scale with the required number of Gaussians. We benchmark ASH with competing methods on pose-controllable avatars demonstrating that our method outperforms existing real-time methods by a large margin and shows comparable or even better results than offline methods. + + + + ArtAdapter: Text-to-Image Style Transfer using Multi-Level Style Encoder and Explicit Adaptation + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_ArtAdapter_Text-to-Image_Style_Transfer_using_Multi-Level_Style_Encoder_and_Explicit_CVPR_2024_paper.pdf + This work introduces ArtAdapter a transformative text-to-image (T2I) style transfer framework that transcends traditional limitations of color brushstrokes and object shape capturing high-level style elements such as composition and distinctive artistic expression. The integration of a multi-level style encoder with our proposed explicit adaptation mechanism enables ArtAdapter to achieve unprecedented fidelity in style transfer ensuring close alignment with textual descriptions. Additionally the incorporation of an Auxiliary Content Adapter (ACA) effectively separates content from style alleviating the borrowing of content from style references. Moreover our novel fast finetuning approach could further enhance zero-shot style representation while mitigating the risk of overfitting. Comprehensive evaluations confirm that ArtAdapter surpasses current state-of-the-art methods. + + + + Activity-Biometrics: Person Identification from Daily Activities + http://openaccess.thecvf.com//content/CVPR2024/papers/Azad_Activity-Biometrics_Person_Identification_from_Daily_Activities_CVPR_2024_paper.pdf + In this work we study a novel problem which focuses on person identification while performing daily activities. Learning biometric features from RGB videos is challenging due to spatio-temporal complexity and presence of appearance biases such as clothing color and background. We propose ABNet a novel framework which leverages disentanglement of biometric and non-biometric features to perform effective person identification from daily activities. ABNet relies on a bias-less teacher to learn biometric features from RGB videos and explicitly disentangle non-biometric features with the help of biometric distortion. In addition ABNet also exploits activity prior for biometrics which is enabled by joint biometric and activity learning. We perform comprehensive evaluation of the proposed approach across five different datasets which are derived from existing activity recognition benchmarks. Furthermore we extensively compare ABNet with existing works in person identification and demonstrate its effectiveness for activity-based biometrics across all five datasets. The code and dataset can be accessed at: https://github.com/sacrcv/Activity-Biometrics/ + + + + Z*: Zero-shot Style Transfer via Attention Reweighting + http://openaccess.thecvf.com//content/CVPR2024/papers/Deng_Z_Zero-shot_Style_Transfer_via_Attention_Reweighting_CVPR_2024_paper.pdf + Despite the remarkable progress in image style transfer formulating style in the context of art is inherently subjective and challenging. In contrast to existing methods this study shows that vanilla diffusion models can directly extract style information and seamlessly integrate the generative prior into the content image without retraining. Specifically we adopt dual denoising paths to represent content/style references in latent space and then guide the content image denoising process with style latent codes. We further reveal that the cross-attention mechanism in latent diffusion models tends to blend the content and style images resulting in stylized outputs that deviate from the original content image. To overcome this limitation we introduce a cross-attention reweighting strategy. Through theoretical analysis and experiments we demonstrate the effectiveness and superiority of the diffusion-based zero-shot style transfer via attention reweighting Z-STAR. + + + + Learning Continuous 3D Words for Text-to-Image Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Cheng_Learning_Continuous_3D_Words_for_Text-to-Image_Generation_CVPR_2024_paper.pdf + Current controls over diffusion models (e.g. through text or ControlNet) for image generation fall short in recognizing abstract continuous attributes like illumination direction or non-rigid shape change. In this paper we present an approach for allowing users of text-to-image models to have fine-grained control of several attributes in an image. We do this by engineering special sets of input tokens that can be transformed in a continuous manner we call them Continuous 3D Words. These attributes can for example be represented as sliders and applied jointly with text prompts for fine-grained control over image generation. Given only a single mesh and a rendering engine we show that our approach can be adopted to provide continuous user control over several 3D-aware attributes including time-of-day illumination bird wing orientation dollyzoom effect and object poses. Our method is capable of conditioning image creation with multiple Continuous 3D Words and text descriptions simultaneously while adding no overhead to the generative process. + + + + MarkovGen: Structured Prediction for Efficient Text-to-Image Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Jayasumana_MarkovGen_Structured_Prediction_for_Efficient_Text-to-Image_Generation_CVPR_2024_paper.pdf + Modern text-to-image generation models produce high-quality images that are both photorealistic and faithful to the text prompts. However this quality comes at significant computational cost: nearly all of these models are iterative and require running sampling multiple times with large models. This iterative process is needed to ensure that different regions of the image are not only aligned with the text prompt but also compatible with each other. In this work we propose a light-weight approach to achieving this compatibility between different regions of an image using a Markov Random Field (MRF) model. We demonstrate the effectiveness of this method on top of the latent token-based Muse text-to-image model. The MRF richly encodes the compatibility among image tokens at different spatial locations to improve quality and significantly reduce the required number of Muse sampling steps. Inference with the MRF is significantly cheaper and its parameters can be quickly learned through back-propagation by modeling MRF inference as a differentiable neural-network layer. Our full model MarkovGen uses this proposed MRF model to both speed up Muse by 1.5xand produce higher quality images by decreasing undesirable image artifacts. + + + + HashPoint: Accelerated Point Searching and Sampling for Neural Rendering + http://openaccess.thecvf.com//content/CVPR2024/papers/Ma_HashPoint_Accelerated_Point_Searching_and_Sampling_for_Neural_Rendering_CVPR_2024_paper.pdf + In this paper we address the problem of efficient point searching and sampling for volume neural rendering. Within this realm two typical approaches are employed: rasterization and ray tracing. The rasterization-based methods enable real-time rendering at the cost of increased memory and lower fidelity. In contrast the ray-tracing-based methods yield superior quality but demand longer rendering time. We solve this problem by our HashPoint method combining these two strategies leveraging rasterization for efficient point searching and sampling and ray marching for rendering. Our method optimizes point searching by rasterizing points within the camera's view organizing them in a hash table and facilitating rapid searches. Notably we accelerate the rendering process by adaptive sampling on the primary surface encountered by the ray. Our approach yields substantial speed-up for a range of state-of-the-art ray-tracing-based methods maintaining equivalent or superior accuracy across synthetic and real test datasets. The code will be available at https://jiahao-ma.github.io/hashpoint/ + + + + MFP: Making Full Use of Probability Maps for Interactive Image Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Lee_MFP_Making_Full_Use_of_Probability_Maps_for_Interactive_Image_CVPR_2024_paper.pdf + In recent interactive segmentation algorithms previous probability maps are used as network input to help predictions in the current segmentation round. However despite the utilization of previous masks useful information contained in the probability maps is not well propagated to the current predictions. In this paper to overcome this limitation we propose a novel and effective algorithm for click-based interactive image segmentation called MFP which attempts to make full use of probability maps. We first modulate previous probability maps to enhance their representations of user-specified objects. Then we feed the modulated probability maps as additional input to the segmentation network. We implement the proposed MFP algorithm based on the ResNet-34 HRNet-18 and ViT-B backbones and assess the performance extensively on various datasets. It is demonstrated that MFP meaningfully outperforms the existing algorithms using identical backbones. The source codes are available at https://github.com/cwlee00/MFP. + + + + StyLitGAN: Image-Based Relighting via Latent Control + http://openaccess.thecvf.com//content/CVPR2024/papers/Bhattad_StyLitGAN_Image-Based_Relighting_via_Latent_Control_CVPR_2024_paper.pdf + We describe a novel method StyLitGAN for relighting and resurfacing images in the absence of labeled data. StyLitGAN generates images with realistic lighting effects including cast shadows soft shadows inter-reflections and glossy effects without the need for paired or CGI data. StyLitGAN uses an intrinsic image method to decompose an image followed by a search of the latent space of a pretrained StyleGAN to identify a set of directions. By prompting the model to fix one component (e.g. albedo) and vary another (e.g. shading) we generate relighted images by adding the identified directions to the latent style codes. Quantitative metrics of change in albedo and lighting diversity allow us to choose effective directions using a forward selection process. Qualitative evaluation confirms the effectiveness of our method. + + + + MoMask: Generative Masked Modeling of 3D Human Motions + http://openaccess.thecvf.com//content/CVPR2024/papers/Guo_MoMask_Generative_Masked_Modeling_of_3D_Human_Motions_CVPR_2024_paper.pdf + We introduce MoMask a novel masked modeling framework for text-driven 3D human motion generation. In MoMask a hierarchical quantization scheme is employed to represent human motion as multi-layer discrete motion tokens with high-fidelity details. Starting at the base layer with a sequence of motion tokens obtained by vector quantization the residual tokens of increasing orders are derived and stored at the subsequent layers of the hierarchy. This is consequently followed by two distinct bidirectional transformers. For the base-layer motion tokens a Masked Transformer is designated to predict randomly masked motion tokens conditioned on text input at training stage. During generation (i.e. inference) stage starting from an empty sequence our Masked Transformer iteratively fills up the missing tokens; Subsequently a Residual Transformer learns to progressively predict the next-layer tokens based on the results from current layer. Extensive experiments demonstrate that MoMask outperforms the state-of-art methods on the text-to-motion generation task with an FID of 0.045 (vs e.g. 0.141 of T2M-GPT) on the HumanML3D dataset and 0.228 (vs 0.514) on KIT-ML respectively. MoMask can also be seamlessly applied in related tasks without further model fine-tuning such as text-guided temporal inpainting. + + + + Fitting Flats to Flats + http://openaccess.thecvf.com//content/CVPR2024/papers/Dogadov_Fitting_Flats_to_Flats_CVPR_2024_paper.pdf + Affine subspaces of Euclidean spaces are also referred to as flats. A standard task in computer vision or more generally in engineering and applied sciences is fitting a flat to a set of points which is commonly solved using the PCA. We generalize this technique to enable fitting a flat to a set of other flats possibly of varying dimensions based on representing the flats as squared distance fields. Compared to previous approaches such as Riemannian centers of mass in the manifold of affine Grassmannians our approach is conceptually much simpler and computationally more efficient yet offers desirable properties such as respecting symmetries and being equivariant to rigid transformations leading to more intuitive and useful results in practice. We demonstrate these claims in a number of synthetic experiments and a multi-view reconstruction task of line-like objects. + + + + Coupled Laplacian Eigenmaps for Locally-Aware 3D Rigid Point Cloud Matching + http://openaccess.thecvf.com//content/CVPR2024/papers/Bastico_Coupled_Laplacian_Eigenmaps_for_Locally-Aware_3D_Rigid_Point_Cloud_Matching_CVPR_2024_paper.pdf + Point cloud matching a crucial technique in computer vision medical and robotics fields is primarily concerned with finding correspondences between pairs of point clouds or voxels. In some practical scenarios emphasizing local differences is crucial for accurately identifying a correct match thereby enhancing the overall robustness and reliability of the matching process. Commonly used shape descriptors have several limitations and often fail to provide meaningful local insights about the paired geometries. In this work we propose a new technique based on graph Laplacian eigenmaps to match point clouds by taking into account fine local structures. To deal with the order and sign ambiguity of Laplacian eigenmaps we introduce a new operator called Coupled Laplacian that allows to easily generate aligned eigenspaces for multiple registered geometries. We show that the similarity between those aligned high-dimensional spaces provides a locally meaningful score to match shapes. We firstly evaluate the performance of the proposed technique in a point-wise manner focusing on the task of object anomaly localization on the MVTec 3D-AD dataset. Additionally we define a new medical task called automatic Bone Side Estimation (BSE) which we address through a global similarity score derived from coupled eigenspaces. In order to test it we propose a benchmark collecting bone surface structures from various public datasets. Our matching technique based on Coupled Laplacian outperforms other methods by reaching an impressive accuracy on both tasks. + + + + Scaling Up Video Summarization Pretraining with Large Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Argaw_Scaling_Up_Video_Summarization_Pretraining_with_Large_Language_Models_CVPR_2024_paper.pdf + Long-form video content constitutes a significant portion of internet traffic making automated video summarization an essential research problem. However existing video summarization datasets are notably limited in their size constraining the effectiveness of state-of-the-art methods for generalization. Our work aims to overcome this limitation by capitalizing on the abundance of long-form videos with dense speech-to-video alignment and the remarkable capabilities of recent large language models (LLMs) in summarizing long text. We introduce an automated and scalable pipeline for generating a large-scale video summarization dataset using LLMs as Oracle summarizers. By leveraging the generated dataset we analyze the limitations of existing approaches and propose a new video summarization model that effectively addresses them. To facilitate further research in the field our work also presents a new benchmark dataset that contains 1200 long videos each with high-quality summaries annotated by professionals. Extensive experiments clearly indicate that our proposed approach sets a new state-of-the-art in video summarization across several benchmarks. + + + + Continuous Optical Zooming: A Benchmark for Arbitrary-Scale Image Super-Resolution in Real World + http://openaccess.thecvf.com//content/CVPR2024/papers/Fu_Continuous_Optical_Zooming_A_Benchmark_for_Arbitrary-Scale_Image_Super-Resolution_in_CVPR_2024_paper.pdf + Most current arbitrary-scale image super-resolution (SR) methods has commonly relied on simulated data generated by simple synthetic degradation models (e.g. bicubic downsampling) at continuous various scales thereby falling short in capturing the complex degradation of real-world images. This limitation hinders the visual quality of these methods when applied to real-world images. To address this issue we propose the Continuous Optical Zooming dataset (COZ) by constructing an automatic imaging system to collect images at fine-grained various focal lengths within a specific range and providing strict image pair alignment. The COZ dataset serves as a benchmark to provide real-world data for training and testing arbitrary-scale SR models. To enhance the model's robustness against real-world image degradation we propose a Local Mix Implicit network (LMI) based on the MLP-mixer architecture and meta-learning which directly learns the local texture information by simultaneously mixing features and coordinates of multiple independent points. The extensive experiments demonstrate the superior performance of the arbitrary-scale SR models trained on the COZ dataset compared to models trained on simulated data. Our LMI model exhibits the superior effectiveness compared to other models. This study is of great significance in developing more efficient algorithms and improving the performance of arbitrary-scale image SR methods in practical applications. Our dataset and codes are available at https://github.com/pf0607/COZ. + + + + Sharingan: A Transformer Architecture for Multi-Person Gaze Following + http://openaccess.thecvf.com//content/CVPR2024/papers/Tafasca_Sharingan_A_Transformer_Architecture_for_Multi-Person_Gaze_Following_CVPR_2024_paper.pdf + Gaze is a powerful form of non-verbal communication that humans develop from an early age. As such modeling this behavior is an important task that can benefit a broad set of application domains ranging from robotics to sociology. In particular the gaze following task in computer vision is defined as the prediction of the 2D pixel coordinates where a person in the image is looking. Previous attempts in this area have primarily centered on CNN-based architectures but they have been constrained by the need to process one person at a time which proves to be highly inefficient. In this paper we introduce a novel and effective multi-person transformer-based architecture for gaze prediction. While there exist prior works using transformers for multi-person gaze prediction they use a fixed set of learnable embeddings to decode both the person and its gaze target which requires a matching step afterward to link the predictions with the annotations. Thus it is difficult to quantitatively evaluate these methods reliably with the available benchmarks or integrate them into a larger human behavior understanding system. Instead we are the first to propose a multi-person transformer-based architecture that maintains the original task formulation and ensures control over the people fed as input. Our main contribution lies in encoding the person-specific information into a single controlled token to be processed alongside image tokens and using its output for prediction based on a novel multiscale decoding mechanism. Our new architecture achieves state-of-the-art results on the GazeFollow VideoAttentionTarget and ChildPlay datasets and outperforms comparable multi-person architectures with a notable margin. Our code checkpoints and data extractions will be made publicly available soon. + + + + Open-Vocabulary Segmentation with Semantic-Assisted Calibration + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Open-Vocabulary_Segmentation_with_Semantic-Assisted_Calibration_CVPR_2024_paper.pdf + This paper studies open-vocabulary segmentation (OVS) through calibrating in-vocabulary and domain-biased embedding space with generalized contextual prior of CLIP. As the core of open-vocabulary understanding alignment of visual content with the semantics of unbounded text has become the bottleneck of this field. To address this challenge recent works propose to utilize CLIP as an additional classifier and aggregate model predictions with CLIP classification results. Despite their remarkable progress performance of OVS methods in relevant scenarios is still unsatisfactory compared with supervised counterparts. We attribute this to the in-vocabulary embedding and domain-biased CLIP prediction. To this end we present a Semantic-assisted CAlibration Network (SCAN). In SCAN we incorporate generalized semantic prior of CLIP into proposal embedding to avoid collapsing on known categories. Besides a contextual shift strategy is applied to mitigate the lack of global context and unnatural background noise. With above designs SCAN achieves state-of-the-art performance on all popular open-vocabulary segmentation benchmarks. Furthermore we also focus on the problem of existing evaluation system that ignores semantic duplication across categories and propose a new metric called Semantic-Guided IoU (SG-IoU). + + + + Towards a Perceptual Evaluation Framework for Lighting Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Giroux_Towards_a_Perceptual_Evaluation_Framework_for_Lighting_Estimation_CVPR_2024_paper.pdf + Progress in lighting estimation is tracked by computing existing image quality assessment (IQA) metrics on images from standard datasets. While this may appear to be a reasonable approach we demonstrate that doing so does not correlate to human preference when the estimated lighting is used to relight a virtual scene into a real photograph. To study this we design a controlled psychophysical experiment where human observers must choose their preference amongst rendered scenes lit using a set of lighting estimation algorithms selected from the recent literature and use it to analyse how these algorithms perform according to human perception. Then we demonstrate that none of the most popular IQA metrics from the literature taken individually correctly represent human perception. Finally we show that by learning a combination of existing IQA metrics we can more accurately represent human preference. This provides a new perceptual framework to help evaluate future lighting estimation algorithms. To encourage future research all (anonymised) perceptual data and code are available at https://lvsn.github.io/PerceptionMetric/. + + + + On Exact Inversion of DPM-Solvers + http://openaccess.thecvf.com//content/CVPR2024/papers/Hong_On_Exact_Inversion_of_DPM-Solvers_CVPR_2024_paper.pdf + Diffusion probabilistic models (DPMs) are a key component in modern generative models. DPM-solvers have achieved reduced latency and enhanced quality significantly but have posed challenges to find the exact inverse (i.e. finding the initial noise from the given image). Here we investigate the exact inversions for DPM-solvers and propose algorithms to perform them when samples are generated by the first-order as well as higher-order DPM-solvers. For each explicit denoising step in DPM-solvers we formulated the inversions using implicit methods such as gradient descent or forward step method to ensure the robustness to large classifier-free guidance unlike the prior approach using fixed-point iteration. Experimental results demonstrated that our proposed exact inversion methods significantly reduced the error of both image and noise reconstructions greatly enhanced the ability to distinguish invisible watermarks and well prevented unintended background changes consistently during image editing. + + + + CAMEL: CAusal Motion Enhancement Tailored for Lifting Text-driven Video Editing + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_CAMEL_CAusal_Motion_Enhancement_Tailored_for_Lifting_Text-driven_Video_Editing_CVPR_2024_paper.pdf + Text-driven video editing poses significant challenges in exhibiting flicker-free visual continuity while preserving the inherent motion patterns of original videos. Existing methods operate under a paradigm where motion and appearance are intricately intertwined. This coupling leads to the network either over-fitting appearance content -- failing to capture motion patterns -- or focusing on motion patterns at the expense of content generalization to diverse textual scenarios. Inspired by the pivotal role of wavelet transform in dissecting video sequences we propose CAusal Motion Enhancement tailored for Lifting text-driven video editing (CAMEL) a novel technique with two core designs. First we introduce motion prompts designed to summarize motion concepts from video templates through direct optimization. The optimized prompts are purposefully integrated into latent representations of diffusion models to enhance the motion fidelity of generated results. Second to enhance motion coherence and extend the generalization of appearance content to creative textual prompts we propose the causal motion-enhanced attention mechanism. This mechanism is implemented in tandem with a novel causal motion filter synergistically enhancing the motion coherence of disentangled high-frequency components and concurrently preserving the generalization of appearance content across various textual scenarios. Extensive experimental results show the superior performance of CAMEL. + + + + FocSAM: Delving Deeply into Focused Objects in Segmenting Anything + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_FocSAM_Delving_Deeply_into_Focused_Objects_in_Segmenting_Anything_CVPR_2024_paper.pdf + The Segment Anything Model (SAM) marks a notable milestone in segmentation models highlighted by its robust zero-shot capabilities and ability to handle diverse prompts. SAM follows a pipeline that separates interactive segmentation into image preprocessing through a large encoder and interactive inference via a lightweight decoder ensuring efficient real-time performance. However SAM faces stability issues in challenging samples upon this pipeline. These issues arise from two main factors. Firstly the image preprocessing disables SAM to dynamically use image-level zoom-in strategies to refocus on the target object during interaction. Secondly the lightweight decoder struggles to sufficiently integrate interactive information with image embeddings. To address these two limitations we propose FocSAM with a pipeline redesigned on two pivotal aspects. First we propose Dynamic Window Multi-head Self-Attention (Dwin-MSA) to dynamically refocus SAM's image embeddings on the target object. Dwin-MSA localizes attention computations around the target object enhancing object-related embeddings with minimal computational overhead. Second we propose Pixel-wise Dynamic ReLU (P-DyReLU) to enable sufficient integration of interactive information from a few initial clicks that have significant impacts on the overall segmentation results. Experimentally FocSAM augments SAM's interactive segmentation performance to match the existing state-of-the-art method in segmentation quality requiring only about 5.6% of this method's inference time on CPUs. Code is available at https://github.com/YouHuang67/focsam. + + + + PRDP: Proximal Reward Difference Prediction for Large-Scale Reward Finetuning of Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Deng_PRDP_Proximal_Reward_Difference_Prediction_for_Large-Scale_Reward_Finetuning_of_CVPR_2024_paper.pdf + Reward finetuning has emerged as a promising approach to aligning foundation models with downstream objectives. Remarkable success has been achieved in the language domain by using reinforcement learning (RL) to maximize rewards that reflect human preference. However in the vision domain existing RL-based reward finetuning methods are limited by their instability in large-scale training rendering them incapable of generalizing to complex unseen prompts. In this paper we propose Proximal Reward Difference Prediction (PRDP) enabling stable black-box reward finetuning for diffusion models for the first time on large-scale prompt datasets with over 100K prompts. Our key innovation is the Reward Difference Prediction (RDP) objective that has the same optimal solution as the RL objective while enjoying better training stability. Specifically the RDP objective is a supervised regression objective that tasks the diffusion model with predicting the reward difference of generated image pairs from their denoising trajectories. We theoretically prove that the diffusion model that obtains perfect reward difference prediction is exactly the maximizer of the RL objective. We further develop an online algorithm with proximal updates to stably optimize the RDP objective. In experiments we demonstrate that PRDP can match the reward maximization ability of well-established RL-based methods in small-scale training. Furthermore through large-scale training on text prompts from the Human Preference Dataset v2 and the Pick-a-Pic v1 dataset PRDP achieves superior generation quality on a diverse set of complex unseen prompts whereas RL-based methods completely fail. + + + + Task-Customized Mixture of Adapters for General Image Fusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhu_Task-Customized_Mixture_of_Adapters_for_General_Image_Fusion_CVPR_2024_paper.pdf + General image fusion aims at integrating important information from multi-source images. However due to the significant cross-task gap the respective fusion mechanism varies considerably in practice resulting in limited performance across subtasks. To handle this problem we propose a novel task-customized mixture of adapters (TC-MoA) for general image fusion adaptively prompting various fusion tasks in a unified model. We borrow the insight from the mixture of experts (MoE) taking the experts as efficient tuning adapters to prompt a pre-trained foundation model. These adapters are shared across different tasks and constrained by mutual information regularization ensuring compatibility with different tasks while complementarity for multi-source images. The task-specific routing networks customize these adapters to extract task-specific information from different sources with dynamic dominant intensity performing adaptive visual feature prompt fusion. Notably our TC-MoA controls the dominant intensity bias for different fusion tasks successfully unifying multiple fusion tasks in a single model. Extensive experiments show that TC-MoA outperforms the competing approaches in learning commonalities while retaining compatibility for general image fusion (multi-modal multi-exposure and multi-focus) and also demonstrating striking controllability on more generalization experiments. The code is available at https://github.com/YangSun22/TC-MoA. + + + + Artist-Friendly Relightable and Animatable Neural Heads + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_Artist-Friendly_Relightable_and_Animatable_Neural_Heads_CVPR_2024_paper.pdf + An increasingly common approach for creating photo-realistic digital avatars is through the use of volumetric neural fields. The original neural radiance field (NeRF) allowed for impressive novel view synthesis of static heads when trained on a set of multi-view images and follow up methods showed that these neural representations can be extended to dynamic avatars. Recently new variants also surpassed the usual drawback of baked-in illumination in neural representations showing that static neural avatars can be relit in any environment. In this work we simultaneously tackle both the motion and illumination problem proposing a new method for relightable and animatable neural heads. Our method builds on a proven dynamic avatar approach based on a mixture of volumetric primitives combined with a recently-proposed lightweight hardware setup for relightable neural fields and includes a novel architecture that allows relighting dynamic neural avatars performing unseen expressions in any environment even with nearfield illumination and viewpoints. + + + + From Feature to Gaze: A Generalizable Replacement of Linear Layer for Gaze Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Bao_From_Feature_to_Gaze_A_Generalizable_Replacement_of_Linear_Layer_CVPR_2024_paper.pdf + Deep-learning-based gaze estimation approaches often suffer from notable performance degradation in unseen target domains. One of the primary reasons is that the Fully Connected layer is highly prone to overfitting when mapping the high-dimensional image feature to 3D gaze. In this paper we propose Analytical Gaze Generalization framework (AGG) to improve the generalization ability of gaze estimation models without touching target domain data. The AGG consists of two modules the Geodesic Projection Module (GPM) and the Sphere-Oriented Training (SOT). GPM is a generalizable replacement of FC layer which projects high-dimensional image features to 3D space analytically to extract the principle components of gaze. Then we propose Sphere-Oriented Training (SOT) to incorporate the GPM into the training process and further improve cross-domain performances. Experimental results demonstrate that the AGG effectively alleviate the overfitting problem and consistently improves the cross-domain gaze estimation accuracy in 12 cross-domain settings without requiring any target domain data. The insight from the Analytical Gaze Generalization framework has the potential to benefit other regression tasks with physical meanings. + + + + Boosting Image Restoration via Priors from Pre-trained Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_Boosting_Image_Restoration_via_Priors_from_Pre-trained_Models_CVPR_2024_paper.pdf + Pre-trained models with large-scale training data such as CLIP and Stable Diffusion have demonstrated remarkable performance in various high-level computer vision tasks such as image understanding and generation from language descriptions. Yet their potential for low-level tasks such as image restoration remains relatively unexplored. In this paper we explore such models to enhance image restoration. As off-the-shelf features (OSF) from pre-trained models do not directly serve image restoration we propose to learn an additional lightweight module called Pre-Train-Guided Refinement Module (PTG-RM) to refine restoration results of a target restoration network with OSF. PTG-RM consists of two components Pre-Train-Guided Spatial-Varying Enhancement (PTG-SVE) and Pre-Train-Guided Channel-Spatial Attention (PTG-CSA). PTG-SVE enables optimal short- and long-range neural operations while PTG-CSA enhances spatial-channel attention for restoration-related learning. Extensive experiments demonstrate that PTG-RM with its compact size (<1M parameters) effectively enhances restoration performance of various models across different tasks including low-light enhancement deraining deblurring and denoising. + + + + VRetouchEr: Learning Cross-frame Feature Interdependence with Imperfection Flow for Face Retouching in Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Xue_VRetouchEr_Learning_Cross-frame_Feature_Interdependence_with_Imperfection_Flow_for_Face_CVPR_2024_paper.pdf + Face Video Retouching is a complex task that often requires labor-intensive manual editing. Conventional image retouching methods perform less satisfactorily in terms of generalization performance and stability when applied to videos without exploiting the correlation among frames. To address this issue we propose a Video Retouching transformEr to remove facial imperfections in videos which is referred to as VRetouchEr. Specifically we estimate the apparent motion of imperfections between two consecutive frames and the resulting displacement vectors are used to refine the imperfection map which is synthesized from the current frame together with the corresponding encoder features. The flow-based imperfection refinement is critical for precise and stable retouching across frames. To leverage the temporal contextual information we inject the refined imperfection map into each transformer block for multi-frame masked attention computation such that we can capture the interdependence between the current frame and multiple reference frames. As a result the imperfection regions can be replaced with normal skin with high fidelity while at the same time keeping the other regions unchanged. Extensive experiments are performed to verify the superiority of VRetouchEr over state-of-the-art image retouching methods in terms of fidelity and stability. + + + + Arbitrary-Scale Image Generation and Upsampling using Latent Diffusion Model and Implicit Neural Decoder + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_Arbitrary-Scale_Image_Generation_and_Upsampling_using_Latent_Diffusion_Model_and_CVPR_2024_paper.pdf + Super-resolution (SR) and image generation are important tasks in computer vision and are widely adopted in real-world applications. Most existing methods however generate images only at fixed-scale magnification and suffer from over-smoothing and artifacts. Additionally they do not offer enough diversity of output images nor image consistency at different scales. Most relevant work applied Implicit Neural Representation (INR) to the denoising diffusion model to obtain continuous-resolution yet diverse and high-quality SR results. Since this model operates in the image space the larger the resolution of image is produced the more memory and inference time is required and it also does not maintain scale-specific consistency. We propose a novel pipeline that can super-resolve an input image or generate from a random noise a novel image at arbitrary scales. The method consists of a pretrained auto-encoder a latent diffusion model and an implicit neural decoder and their learning strategies. The proposed method adopts diffusion processes in a latent space thus efficient yet aligned with output image space decoded by MLPs at arbitrary scales. More specifically our arbitrary-scale decoder is designed by the symmetric decoder w/o up-scaling from the pretrained auto-encoder and Local Implicit Image Function (LIIF) in series. The latent diffusion process is learnt by the denoising and the alignment losses jointly. Errors in output images are backpropagated via the fixed decoder improving the quality of output images. In the extensive experiments using multiple public benchmarks on the two tasks i.e. image super-resolution and novel image generation at arbitrary scales the proposed method outperforms relevant methods in metrics of image quality diversity and scale consistency. It is significantly better than the relevant prior-art in the inference speed and memory usage. + + + + Cache Me if You Can: Accelerating Diffusion Models through Block Caching + http://openaccess.thecvf.com//content/CVPR2024/papers/Wimbauer_Cache_Me_if_You_Can_Accelerating_Diffusion_Models_through_Block_CVPR_2024_paper.pdf + Diffusion models have recently revolutionized the field of image synthesis due to their ability to generate photorealistic images. However one of the major drawbacks of diffusion models is that the image generation process is costly. A large image-to-image network has to be applied many times to iteratively refine an image from random noise. While many recent works propose techniques to reduce the number of required steps they generally treat the underlying denoising network as a black box. In this work we investigate the behavior of the layers within the network and find that 1) the layers' output changes smoothly over time 2) the layers show distinct patterns of change and 3) the change from step to step is often very small. We hypothesize that many layer computations in the denoising network are redundant. Leveraging this we introduce Block Caching in which we reuse outputs from layer blocks of previous steps to speed up inference. Furthermore we propose a technique to automatically determine caching schedules based on each block's changes over timesteps. In our experiments we show through FID human evaluation and qualitative analysis that Block Caching allows to generate images with higher visual quality at the same computational cost. We demonstrate this for different state-of-the-art models (LDM and EMU) and solvers (DDIM and DPM). + + + + Identifying Important Group of Pixels using Interactions + http://openaccess.thecvf.com//content/CVPR2024/papers/Sumiyasu_Identifying_Important_Group_of_Pixels_using_Interactions_CVPR_2024_paper.pdf + To better understand the behavior of image classifiers it is useful to visualize the contribution of individual pixels to the model prediction. In this study we propose a method MoXI(Model eXplanation by Interactions) that efficiently and accurately identifies a group of pixels with high prediction confidence. The proposed method employs game-theoretic concepts Shapley values and interactions taking into account the effects of individual pixels and the cooperative influence of pixels on model confidence. Theoretical analysis and experiments demonstrate that our method better identifies the pixels that are highly contributing to the model outputs than widely-used by Grad-CAM Attention rollout and Shapley value. While prior studies have suffered from the exponential computational cost in the computation of Shapley value and interactions we show that this can be reduced to quadratic cost for our task. The code is available at https://github.com/KosukeSumiyasu/MoXI. + + + + DIOD: Self-Distillation Meets Object Discovery + http://openaccess.thecvf.com//content/CVPR2024/papers/Kara_DIOD_Self-Distillation_Meets_Object_Discovery_CVPR_2024_paper.pdf + Instance segmentation demands substantial labeling resources. This has prompted increased interest to explore the object discovery task as an unsupervised alternative. In particular promising results were achieved in localizing instances using motion supervision only. However the motion signal introduces complexities due to its inherent noise and sparsity which constrains the effectiveness of current methodologies. In the present paper we propose DIOD (self DIstillation meets Object Discovery) the first method that places the motion-guided object discovery within a framework of continuous improvement through knowledge distillation providing solutions to existing limitations (i) DIOD robustly eliminates the noise present in the exploited motion maps providing accurate motion-supervision (ii) DIOD leverages the discovered objects within an iterative pseudo-labeling framework enriching the initial motion-supervision with static objects which results in a cost-efficient increase in performance. Through experiments on synthetic and real-world datasets we demonstrate the benefits of bridging the gap between object discovery and distillation by significantly improving the state-of-the-art. This enhancement is also sustained across other demanding metrics so far reserved for supervised tasks. + + + + GoMAvatar: Efficient Animatable Human Modeling from Monocular Video Using Gaussians-on-Mesh + http://openaccess.thecvf.com//content/CVPR2024/papers/Wen_GoMAvatar_Efficient_Animatable_Human_Modeling_from_Monocular_Video_Using_Gaussians-on-Mesh_CVPR_2024_paper.pdf + We introduce GoMAvatar a novel approach for real-time memory-efficient high-quality animatable human modeling. GoMAvatar takes as input a single monocular video to create a digital avatar capable of re-articulation in new poses and real-time rendering from novel viewpoints while seamlessly integrating with rasterization-based graphics pipelines. Central to our method is the Gaussians-on-Mesh (GoM) representation a hybrid 3D model combining rendering quality and speed of Gaussian splatting with geometry modeling and compatibility of deformable meshes. We assess GoMAvatar on ZJU-MoCap PeopleSnapshot and various YouTube videos. GoMAvatar matches or surpasses current monocular human modeling algorithms in rendering quality and significantly outperforms them in computational efficiency (43 FPS) while being memory-efficient (3.63 MB per subject). + + + + Neural Redshift: Random Networks are not Random Functions + http://openaccess.thecvf.com//content/CVPR2024/papers/Teney_Neural_Redshift_Random_Networks_are_not_Random_Functions_CVPR_2024_paper.pdf + Our understanding of the generalization capabilities of neural networks NNs is still incomplete. Prevailing explanations are based on implicit biases of gradient descent GD but they cannot account for the capabilities of models from gradientfree methods nor the simplicity bias recently observed in untrained networks This paper seeks other sources of generalization in NNs. To understand the inductive biases provided by architectures independently from GD we examine untrained randomweight networks Even simple MLPs show strong inductive biases uniform sampling in weight space yields a very biased distribution of functions in terms of complexity But unlike common wisdom NNs do not have an inherent simplicity bias This property depends on components such as ReLUs residual connections and layer normalizations Alternative architectures can be built with a bias for any level of complexity. Transformers also inherit all these properties from their building blocks. We provide a fresh explanation for the success of deep learning independent from gradientbased training It points at promising avenues for controlling the solutions implemented by trained models. + + + + HumanGaussian: Text-Driven 3D Human Generation with Gaussian Splatting + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_HumanGaussian_Text-Driven_3D_Human_Generation_with_Gaussian_Splatting_CVPR_2024_paper.pdf + Realistic 3D human generation from text prompts is a desirable yet challenging task. Existing methods optimize 3D representations like mesh or neural fields via score distillation sampling (SDS) which suffers from inadequate fine details or excessive training time. In this paper we propose an efficient yet effective framework HumanGaussian that generates high-quality 3D humans with fine-grained geometry and realistic appearance. Our key insight is that 3D Gaussian Splatting is an efficient renderer with periodic Gaussian shrinkage or growing where such adaptive density control can be naturally guided by intrinsic human structures. Specifically 1) we first propose a Structure-Aware SDS that simultaneously optimizes human appearance and geometry. The multi-modal score function from both RGB and depth space is leveraged to distill the Gaussian densification and pruning process. 2) Moreover we devise an Annealed Negative Prompt Guidance by decomposing SDS into a noisier generative score and a cleaner classifier score which well addresses the over-saturation issue. The floating artifacts are further eliminated based on Gaussian size in a prune-only phase to enhance generation smoothness. Extensive experiments demonstrate the superior efficiency and competitive quality of our framework rendering vivid 3D humans under diverse scenarios. + + + + CosmicMan: A Text-to-Image Foundation Model for Humans + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_CosmicMan_A_Text-to-Image_Foundation_Model_for_Humans_CVPR_2024_paper.pdf + We present CosmicMan a text-to-image foundation model specialized for generating high-fidelity human images. Unlike current general-purpose foundation models that are stuck in the dilemma of inferior quality and text-image misalignment for humans CosmicMan enables generating photo-realistic human images with meticulous appearance reasonable structure and precise text-image alignment with detailed dense descriptions. At the heart of CosmicMan's success are the new reflections and perspectives on data and models: (1) We found that data quality and a scalable data production flow are essential for the final results from trained models. Hence we propose a new data production paradigm Annotate Anyone which serves as a perpetual data flywheel to produce high-quality data with accurate yet cost-effective annotations over time. Based on this we constructed a large-scale dataset CosmicMan-HQ 1.0 with 6 Million high-quality real-world human images in a mean resolution of 1488x1255 and attached with precise text annotations deriving from 115 Million attributes in diverse granularities. (2) We argue that a text-to-image foundation model specialized for humans must be pragmatic - easy to integrate into down-streaming tasks while effective in producing high-quality human images. Hence we propose to model the relationship between dense text descriptions and image pixels in a decomposed manner and present Decomposed-Attention-Refocusing (Daring) training framework. It seamlessly decomposes the cross-attention features in existing text-to-image diffusion model and enforces attention refocusing without adding extra modules. Through Daring we show that explicitly discretizing continuous text space into several basic groups that align with human body structure is the key to tackling the misalignment problem in a breeze. Project page: https://cosmicman-cvpr2024.github.io/. + + + + JDEC: JPEG Decoding via Enhanced Continuous Cosine Coefficients + http://openaccess.thecvf.com//content/CVPR2024/papers/Han_JDEC_JPEG_Decoding_via_Enhanced_Continuous_Cosine_Coefficients_CVPR_2024_paper.pdf + We propose a practical approach to JPEG image decoding utilizing a local implicit neural representation with continuous cosine formulation. The JPEG algorithm significantly quantizes discrete cosine transform (DCT) spectra to achieve a high compression rate inevitably resulting in quality degradation while encoding an image. We have designed a continuous cosine spectrum estimator to address the quality degradation issue that restores the distorted spectrum. By leveraging local DCT formulations our network has the privilege to exploit dequantization and upsampling simultaneously. Our proposed model enables decoding compressed images directly across different quality factors using a single pre-trained model without relying on a conventional JPEG decoder. As a result our proposed network achieves state-of-the-art performance in flexible color image JPEG artifact removal tasks. Our source code is available at https://github.com/WooKyoungHan/JDEC + + + + HOI-M^3: Capture Multiple Humans and Objects Interaction within Contextual Environment + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_HOI-M3_Capture_Multiple_Humans_and_Objects_Interaction_within_Contextual_Environment_CVPR_2024_paper.pdf + Humans naturally interact with both others and the surrounding multiple objects engaging in various social activities. However recent advances in modeling human-object interactions mostly focus on perceiving isolated individuals and objects due to fundamental data scarcity. In this paper we introduce HOI-M^3 a novel large-scale dataset for modeling the interactions of Multiple huMans and Multiple objects. Notably it provides accurate 3D tracking for both humans and objects from dense RGB and object-mounted IMU inputs covering 199 sequences and 181M frames of diverse humans and objects under rich activities. With the unique HOI-M^3 dataset we introduce two novel data-driven tasks with companion strong baselines: monocular capture and unstructured generation of multiple human-object interactions. Extensive experiments demonstrate that our dataset is challenging and worthy of further research about multiple human-object interactions and behavior analysis. Our HOI-M^3 dataset corresponding codes and pre-trained models will be disseminated to the community for future research. + + + + Interactive3D: Create What You Want by Interactive 3D Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Dong_Interactive3D_Create_What_You_Want_by_Interactive_3D_Generation_CVPR_2024_paper.pdf + 3D object generation has undergone significant advancements yielding high-quality results. However fall short in achieving precise user control often yielding results that do not align with user expectations thus limiting their applicability. User-envisioning 3D object generation faces significant challenges in realizing its concepts using current generative models due to limited interaction capabilities. Existing methods mainly offer two approaches: (i) interpreting textual instructions with constrained controllability or (ii) reconstructing 3D objects from 2D images. Both of them limit customization to the confines of the 2D reference and potentially introduce undesirable artifacts during the 3D lifting process restricting the scope for direct and versatile 3D modifications. In this work we introduce Interactive3D an innovative framework for interactive 3D generation that grants users precise control over the generative process through extensive 3D interaction capabilities. Interactive3D is constructed in two cascading stages utilizing distinct 3D representations. The first stage employs Gaussian Splatting for direct user interaction allowing modifications and guidance of the generative direction at any intermediate step through (i) Adding and Removing components (ii) Deformable and Rigid Dragging (iii) Geometric Transformations and (iv) Semantic Editing. Subsequently the Gaussian splats are transformed into InstantNGP. We introduce a novel (v) Interactive Hash Refinement module to further add details and extract the geometry in the second stage. Our experiments demonstrate that proposed Interactive3D markedly improves the controllability and quality of 3D generation. Our project webpage is available at https://interactive-3d.github.io/. + + + + OmniLocalRF: Omnidirectional Local Radiance Fields from Dynamic Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Choi_OmniLocalRF_Omnidirectional_Local_Radiance_Fields_from_Dynamic_Videos_CVPR_2024_paper.pdf + Omnidirectional cameras are extensively used in various applications to provide a wide field of vision. However they face a challenge in synthesizing novel views due to the inevitable presence of dynamic objects including the photographer in their wide field of view. In this paper we introduce a new approach called Omnidirectional Local Radiance Fields (OmniLocalRF) that can render static-only scene views removing and inpainting dynamic objects simultaneously. Our approach combines the principles of local radiance fields with the bidirectional optimization of omnidirectional rays. Our input is an omnidirectional video and we evaluate the mutual observations of the entire angle between the previous and current frames. To reduce ghosting artifacts of dynamic objects and inpaint occlusions we devise a multi-resolution motion mask prediction module. Unlike existing methods that primarily separate dynamic components through the temporal domain our method uses multi-resolution neural feature planes for precise segmentation which is more suitable for long 360-degree videos. Our experiments validate that OmniLocalRF outperforms existing methods in both qualitative and quantitative metrics especially in scenarios with complex real-world scenes. In particular our approach eliminates the need for manual interaction such as drawing motion masks by hand and additional pose estimation making it a highly effective and efficient solution. + + + + Semantic Human Mesh Reconstruction with Textures + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhan_Semantic_Human_Mesh_Reconstruction_with_Textures_CVPR_2024_paper.pdf + The field of 3D detailed human mesh reconstruction has made significant progress in recent years. However current methods still face challenges when used in industrial applications due to unstable results low-quality meshes and a lack of UV unwrapping and skinning weights. In this paper we present SHERT a novel pipeline that can reconstruct semantic human meshes with textures and high-precision details. SHERT applies semantic- and normal-based sampling between the detailed surface (e.g. mesh and SDF) and the corresponding SMPL-X model to obtain a partially sampled semantic mesh and then generates the complete semantic mesh by our specifically designed self-supervised completion and refinement networks. Using the complete semantic mesh as a basis we employ a texture diffusion model to create human textures that are driven by both images and texts. Our reconstructed meshes have stable UV unwrapping high-quality triangle meshes and consistent semantic information. The given SMPL-X model provides semantic information and shape priors allowing SHERT to perform well even with incorrect and incomplete inputs. The semantic information also makes it easy to substitute and animate different body parts such as the face body and hands. Quantitative and qualitative experiments demonstrate that SHERT is capable of producing high-fidelity and robust semantic meshes that outperform state-of-the-art methods. + + + + PIA: Your Personalized Image Animator via Plug-and-Play Modules in Text-to-Image Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_PIA_Your_Personalized_Image_Animator_via_Plug-and-Play_Modules_in_Text-to-Image_CVPR_2024_paper.pdf + Recent advancements in personalized text-to-image (T2I) models have revolutionized content creation empowering non-experts to generate stunning images with unique styles. While promising animating these personalized images with realistic motions poses significant challenges in preserving distinct styles high-fidelity details and achieving motion controllability by text. In this paper we present PIA a Personalized Image Animator that excels in aligning with condition images achieving motion controllability by text and the compatibility with various personalized T2I models without specific tuning. To achieve these goals PIA builds upon a base T2I model with well-trained temporal alignment layers allowing for the seamless transformation of any personalized T2I model into an image animation model. A key component of PIA is the introduction of the condition module which takes as inputs the condition frame and inter-frame affinity. This module leverages the affinity hint to transfer appearance information from the condition frame to individual frames in the latent space. This design mitigates the challenges of appearance-related frame alignment within PIA and allows for a stronger focus on aligning with motion-related guidance. To address the lack of a benchmark for this field we introduce AnimateBench a comprehensive benchmark comprising diverse personalized T2I models curated images and motion-related prompts. We show extensive evaluations and applications on AnimateBench to verify the superiority of PIA. + + + + NeRF Analogies: Example-Based Visual Attribute Transfer for NeRFs + http://openaccess.thecvf.com//content/CVPR2024/papers/Fischer_NeRF_Analogies_Example-Based_Visual_Attribute_Transfer_for_NeRFs_CVPR_2024_paper.pdf + A Neural Radiance Field (NeRF) encodes the specific relation of 3D geometry and appearance of a scene. We here ask the question whether we can transfer the appearance from a source NeRF onto a target 3D geometry in a semantically meaningful way such that the resulting new NeRF retains the target geometry but has an appearance that is an analogy to the source NeRF. To this end we generalize classic image analogies from 2D images to NeRFs. We leverage correspondence transfer along semantic affinity that is driven by semantic features from large pre-trained 2D image models to achieve multi-view consistent appearance transfer. Our method allows exploring the mix-and-match product space of 3D geometry and appearance. We show that our method outperforms traditional stylization-based methods and that a large majority of users prefer our method over several typical baselines. Project page: https://mfischer-ucl.github.io/nerf_analogies + + + + Texture-Preserving Diffusion Models for High-Fidelity Virtual Try-On + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Texture-Preserving_Diffusion_Models_for_High-Fidelity_Virtual_Try-On_CVPR_2024_paper.pdf + Image-based virtual try-on is an increasingly important task for online shopping. It aims to synthesize images of a specific person wearing a specified garment. Diffusion model-based approaches have recently become popular as they are excellent at image synthesis tasks. However these approaches usually employ additional image encoders and rely on the cross-attention mechanism for texture transfer from the garment to the person image which affects the try-on's efficiency and fidelity. To address these issues we propose an Texture-Preserving Diffusion (TPD) model for virtual try-on which enhances the fidelity of the results and introduces no additional image encoders. Accordingly we make contributions from two aspects. First we propose to concatenate the masked person and reference garment images along the spatial dimension and utilize the resulting image as the input for the diffusion model's denoising UNet. This enables the original self-attention layers contained in the diffusion model to achieve efficient and accurate texture transfer. Second we propose a novel diffusion-based method that predicts a precise inpainting mask based on the person and reference garment images further enhancing the reliability of the try-on results. In addition we integrate mask prediction and image synthesis into a single compact model. The experimental results show that our approach can be applied to various try-on tasks e.g. garment-to-person and person-to-person try-ons and significantly outperforms state-of-the-art methods on popular VITON VITON-HD databases. Code is available at https://github.com/Gal4way/TPD. + + + + Towards Robust Event-guided Low-Light Image Enhancement: A Large-Scale Real-World Event-Image Dataset and Novel Approach + http://openaccess.thecvf.com//content/CVPR2024/papers/Liang_Towards_Robust_Event-guided_Low-Light_Image_Enhancement_A_Large-Scale_Real-World_Event-Image_CVPR_2024_paper.pdf + Event camera has recently received much attention for low-light image enhancement (LIE) thanks to their distinct advantages such as high dynamic range. However current research is prohibitively restricted by the lack of large-scale real-world and spatial-temporally aligned event-image datasets. To this end we propose a real-world (indoor and outdoor) dataset comprising over 30K pairs of images and events under both low and normal illumination conditions. To achieve this we utilize a robotic arm that traces a consistent non-linear trajectory to curate the dataset with spatial alignment precision under 0.03mm. We then introduce a matching alignment strategy rendering 90% of our dataset with errors less than 0.01s. Based on the dataset we propose a novel event-guided LIE approach called EvLight towards robust performance in real-world low-light scenes. Specifically we first design the multi-scale holistic fusion branch to extract holistic structural and textural information from both events and images. To ensure robustness against variations in the regional illumination and noise we then introduce a Signal-to-Noise-Ratio (SNR)-guided regional feature selection to selectively fuse features of images from regions with high SNR and enhance those with low SNR by extracting regional structural information from events. our EvLight significantly surpasses the frame-based methods e.g. Retinexformer by 1.14 dB and 2.62 dB respectively. Code and datasets are available at https://vlislab22.github.io/eg-lowlight/. + + + + From a Bird's Eye View to See: Joint Camera and Subject Registration without the Camera Calibration + http://openaccess.thecvf.com//content/CVPR2024/papers/Qian_From_a_Birds_Eye_View_to_See_Joint_Camera_and_CVPR_2024_paper.pdf + We tackle a new problem of multi-view camera and subject registration in the bird's eye view (BEV) without pre-given camera calibration which promotes the multi-view subject registration problem to a new calibration-free stage. This greatly alleviates the limitation in many practical applications. However this is a very challenging problem since its only input is several RGB images from different first-person views (FPVs) without the BEV image and the calibration of the FPVs while the output is a unified plane aggregated from all views with the positions and orientations of both the subjects and cameras in a BEV. For this purpose we propose an end-to-end framework solving camera and subject registration together by taking advantage of their mutual dependence whose main idea is as below: i) creating a subject view-transform module (VTM) to project each pedestrian from FPV to a virtual BEV ii) deriving a multi-view geometry-based spatial alignment module (SAM) to estimate the relative camera pose in a unified BEV iii) selecting and refining the subject and camera registration results within the unified BEV. We collect a new large-scale synthetic dataset with rich annotations for training and evaluation. Additionally we also collect a real dataset for cross-domain evaluation. The experimental results show the remarkable effectiveness of our method. The code and proposed datasets are available at https://github.com/zekunqian/BEVSee. + + + + Enhancing Video Super-Resolution via Implicit Resampling-based Alignment + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_Enhancing_Video_Super-Resolution_via_Implicit_Resampling-based_Alignment_CVPR_2024_paper.pdf + In video super-resolution it is common to use a frame-wise alignment to support the propagation of information over time. The role of alignment is well-studied for low-level enhancement in video but existing works overlook a critical step -- resampling. We show through extensive experiments that for alignment to be effective the resampling should preserve the reference frequency spectrum while minimizing spatial distortions. However most existing works simply use a default choice of bilinear interpolation for resampling even though bilinear interpolation has a smoothing effect and hinders super-resolution. From these observations we propose an implicit resampling-based alignment. The sampling positions are encoded by a sinusoidal positional encoding while the value is estimated with a coordinate network and a window-based cross-attention. We show that bilinear interpolation inherently attenuates high-frequency information while an MLP-based coordinate network can approximate more frequencies. Experiments on synthetic and real-world datasets show that alignment with our proposed implicit resampling enhances the performance of state-of-the-art frameworks with minimal impact on both compute and parameters. + + + + Parameter Efficient Fine-tuning via Cross Block Orchestration for Segment Anything Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Peng_Parameter_Efficient_Fine-tuning_via_Cross_Block_Orchestration_for_Segment_Anything_CVPR_2024_paper.pdf + Parameter-efficient fine-tuning (PEFT) is an effective methodology to unleash the potential of large foundation models in novel scenarios with limited training data. In the computer vision community PEFT has shown effectiveness in image classification but little research has studied its ability for image segmentation. Fine-tuning segmentation models usually require a heavier adjustment of parameters to align the proper projection directions in the parameter space for new scenarios. This raises a challenge to existing PEFT algorithms as they often inject a limited number of individual parameters into each block which prevents substantial adjustment of the projection direction of the parameter space due to the limitation of Hidden Markov Chain along blocks. In this paper we equip PEFT with a cross-block orchestration mechanism to enable the adaptation of the Segment Anything Model (SAM) to various downstream scenarios. We introduce a novel inter-block communication module which integrates a learnable relation matrix to facilitate communication among different coefficient sets of each PEFT block's parameter space. Moreover we propose an intra-block enhancement module which introduces a linear projection head whose weights are generated from a hyper-complex layer further enhancing the impact of the adjustment of projection directions on the entire parameter space. Extensive experiments on diverse benchmarks demonstrate that our proposed approach consistently improves the segmentation performance significantly on novel scenarios with only around 1K additional parameters. + + + + Masked and Shuffled Blind Spot Denoising for Real-World Images + http://openaccess.thecvf.com//content/CVPR2024/papers/Chihaoui_Masked_and_Shuffled_Blind_Spot_Denoising_for_Real-World_Images_CVPR_2024_paper.pdf + We introduce a novel approach to single image denoising based on the Blind Spot Denoising principle which we call MAsked and SHuffled Blind Spot Denoising (MASH). We focus on the case of correlated noise which often plagues real images. MASH is the result of a careful analysis to determine the relationships between the level of blindness (masking) of the input and the (unknown) noise correlation. Moreover we introduce a shuffling technique to weaken the local correlation of noise which in turn yields an additional denoising performance improvement. We evaluate MASH via extensive experiments on real-world noisy image datasets. We demonstrate state-of-the-art results compared to existing self-supervised denoising methods. + + + + DiffusionAvatars: Deferred Diffusion for High-fidelity 3D Head Avatars + http://openaccess.thecvf.com//content/CVPR2024/papers/Kirschstein_DiffusionAvatars_Deferred_Diffusion_for_High-fidelity_3D_Head_Avatars_CVPR_2024_paper.pdf + DiffusionAvatars synthesizes a high-fidelity 3D head avatar of a person offering intuitive control over both pose and expression. We propose a diffusion-based neural renderer that leverages generic 2D priors to produce compelling images of faces. For coarse guidance of the expression and head pose we render a neural parametric head model (NPHM) from the target viewpoint which acts as a proxy geometry of the person. Additionally to enhance the modeling of intricate facial expressions we condition DiffusionAvatars directly on the expression codes obtained from NPHM via cross-attention. Finally to synthesize consistent surface details across different viewpoints and expressions we rig learnable spatial features to the head's surface via TriPlane lookup in NPHM's canonical space. We train DiffusionAvatars on RGB videos and corresponding fitted NPHM meshes of a person and test the obtained avatars in both self-reenactment and animation scenarios. Our experiments demonstrate that DiffusionAvatars generates temporally consistent and visually appealing videos for novel poses and expressions of a person outperforming existing approaches. + + + + Data-Free Quantization via Pseudo-label Filtering + http://openaccess.thecvf.com//content/CVPR2024/papers/Fan_Data-Free_Quantization_via_Pseudo-label_Filtering_CVPR_2024_paper.pdf + Quantization for model compression can efficiently reduce the network complexity and storage requirement but the original training data is necessary to remedy the performance loss caused by quantization. The Data-Free Quantization (DFQ) methods have been proposed to handle the absence of original training data with synthetic data. However there are differences between the synthetic and original training data which affects the performance of the quantized network but none of the existing methods considers the differences. In this paper we propose an efficient data-free quantization via pseudo-label filtering which is the first to evaluate the synthetic data before quantization. We design a new metric for evaluating synthetic data using self-entropy which indicates the reliability of synthetic data. The synthetic data can be categorized with the metric into high- and low-reliable datasets for the following training process. Besides the multiple pseudo-labels are designed to label the synthetic data with different reliability which can provide valuable supervision information and avoid misleading training by low-reliable samples. Extensive experiments are implemented on several datasets including CIFAR-10 CIFAR-100 and ImageNet with various models. The experimental results show that our method can perform excellently and outperform existing methods in accuracy. + + + + Generative Powers of Ten + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Generative_Powers_of_Ten_CVPR_2024_paper.pdf + We present a method that uses a text-to-image model to generate consistent content across multiple image scales enabling extreme semantic zooms into a scene e.g. ranging from a wide-angle landscape view of a forest to a macro shot of an insect sitting on one of the tree branches. We achieve this through a joint multi-scale diffusion sampling approach that encourages consistency across different scales while preserving the integrity of each individual sampling process. Since each generated scale is guided by a different text prompt our method enables deeper levels of zoom than traditional super-resolution methods that may struggle to create new contextual structure at vastly different scales. We compare our method qualitatively with alternative techniques in image super-resolution and outpainting and show that our method is most effective at generating consistent multi-scale content. + + + + Text-conditional Attribute Alignment across Latent Spaces for 3D Controllable Face Image Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_Text-conditional_Attribute_Alignment_across_Latent_Spaces_for_3D_Controllable_Face_CVPR_2024_paper.pdf + With the advent of generative models and vision language pretraining significant improvement has been made in text-driven face manipulation. The text embedding can be used as target supervision for expression control.However it is non-trivial to associate with its 3D attributesi.e. pose and illumination. To address these issues we propose a Text-conditional Attribute aLignment approach for 3D controllable face image synthesis and our model is referred to as TcALign. Specifically since the 3D rendered image can be precisely controlled with the 3D face representation we first propose a Text-conditional 3D Editor to produce the target face representation to realize text-driven manipulation in the 3D space. An attribute embedding space spanned by the target-related attributes embeddings is also introduced to infer the disentangled task-specific direction. Next we train a cross-modal latent mapping network conditioned on the derived difference of 3D representation to infer a correct vector in the latent space of StyleGAN.Thiscorrection vector learning design can accurately transfer the attribute manipulation on 3D images to 2D images. We show that the proposed method delivers more precise text-driven multi-attribute manipulation for 3D controllable face image synthesis. Extensive qualitative and quantitative experiments verify the effectiveness and superiority of our method over the other competing methods. + + + + Correcting Diffusion Generation through Resampling + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Correcting_Diffusion_Generation_through_Resampling_CVPR_2024_paper.pdf + Despite diffusion models' superior capabilities in modeling complex distributions there are still non-trivial distributional discrepancies between generated and ground-truth images which has resulted in several notable problems in image generation including missing object errors in text-to-image generation and low image quality. Existing methods that attempt to address these problems mostly do not tend to address the fundamental cause behind these problems which is the distributional discrepancies and hence achieve sub-optimal results. In this paper we propose a particle filtering framework that can effectively address both problems by explicitly reducing the distributional discrepancies. Specifically our method relies on a set of external guidance including a small set of real images and a pre-trained object detector to gauge the distribution gap and then design the resampling weight accordingly to correct the gap. Experiments show that our methods can effectively correct missing object errors and improve image quality in various image generation tasks. Notably our method outperforms the existing strongest baseline by 5% in object occurrence and 1.0 in FID on MS-COCO. Our code is available at https://github.com/UCSB-NLP-Chang/diffusion_resampling.git. + + + + AirPlanes: Accurate Plane Estimation via 3D-Consistent Embeddings + http://openaccess.thecvf.com//content/CVPR2024/papers/Watson_AirPlanes_Accurate_Plane_Estimation_via_3D-Consistent_Embeddings_CVPR_2024_paper.pdf + Extracting planes from a 3D scene is useful for downstream tasks in robotics and augmented reality. In this paper we tackle the problem of estimating the planar surfaces in a scene from posed images. Our first finding is that a surprisingly competitive baseline results from combining popular clustering algorithms with recent improvements in 3D geometry estimation. However such purely geometric methods are understandably oblivious to plane semantics which are crucial to discerning distinct planes. To overcome this limitation we propose a method that predicts multi-view consistent plane embeddings that complement geometry when clustering points into planes. We show through extensive evaluation on the ScanNetV2 dataset that our new method outperforms existing approaches and our strong geometric baseline for the task of plane estimation. + + + + Blur2Blur: Blur Conversion for Unsupervised Image Deblurring on Unknown Domains + http://openaccess.thecvf.com//content/CVPR2024/papers/Pham_Blur2Blur_Blur_Conversion_for_Unsupervised_Image_Deblurring_on_Unknown_Domains_CVPR_2024_paper.pdf + This paper presents an innovative framework designed to train an image deblurring algorithm tailored to a specific camera device. This algorithm works by transforming a blurry input image which is challenging to deblur into another blurry image that is more amenable to deblurring. The transformation process from one blurry state to another leverages unpaired data consisting of sharp and blurry images captured by the target camera device. Learning this blur-to-blur transformation is inherently simpler than direct blur-to-sharp conversion as it primarily involves modifying blur patterns rather than the intricate task of reconstructing fine image details. The efficacy of the proposed approach has been demonstrated through comprehensive experiments on various benchmarks where it significantly outperforms state-of-the-art methods both quantitatively and qualitatively. Our code and data are available at https://github.com/VinAIResearch/Blur2Blur + + + + Exploring Vision Transformers for 3D Human Motion-Language Models with Motion Patches + http://openaccess.thecvf.com//content/CVPR2024/papers/Yu_Exploring_Vision_Transformers_for_3D_Human_Motion-Language_Models_with_Motion_CVPR_2024_paper.pdf + To build a cross-modal latent space between 3D human motion and language acquiring large-scale and high-quality human motion data is crucial. However unlike the abundance of image data the scarcity of motion data has limited the performance of existing motion-language models. To counter this we introduce "motion patches" a new representation of motion sequences and propose using Vision Transformers (ViT) as motion encoders via transfer learning aiming to extract useful knowledge from the image domain and apply it to the motion domain. These motion patches created by dividing and sorting skeleton joints based on body parts in motion sequences are robust to varying skeleton structures and can be regarded as color image patches in ViT. We find that transfer learning with pre-trained weights of ViT obtained through training with 2D image data can boost the performance of motion analysis presenting a promising direction for addressing the issue of limited motion data. Our extensive experiments show that the proposed motion patches used jointly with ViT achieve state-of-the-art performance in the benchmarks of text-to-motion retrieval and other novel challenging tasks such as cross-skeleton recognition zero-shot motion classification and human interaction recognition which are currently impeded by the lack of data. + + + + Clustering for Protein Representation Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Quan_Clustering_for_Protein_Representation_Learning_CVPR_2024_paper.pdf + Protein representation learning is a challenging task that aims to capture the structure and function of proteins from their amino acid sequences. Previous methods largely ignored the fact that not all amino acids are equally important for protein folding and activity. In this article we propose a neural clustering framework that can automatically discover the critical components of a protein by considering both its primary and tertiary structure information. Our framework treats a protein as a graph where each node represents an amino acid and each edge represents a spatial or sequential connection between amino acids. We then apply an iterative clustering strategy to group the nodes into clusters based on their 1D and 3D positions and assign scores to each cluster. We select the highest-scoring clusters and use their medoid nodes for the next iteration of clustering until we obtain a hierarchical and informative representation of the protein. We evaluate on four protein-related tasks: protein fold classification enzyme reaction classification gene ontology term prediction and enzyme commission number prediction. Experimental results demonstrate that our method achieves state-of-the-art performance. + + + + CorrMatch: Label Propagation via Correlation Matching for Semi-Supervised Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Sun_CorrMatch_Label_Propagation_via_Correlation_Matching_for_Semi-Supervised_Semantic_Segmentation_CVPR_2024_paper.pdf + This paper presents a simple but performant semi-supervised semantic segmentation approach called CorrMatch. Previous approaches mostly employ complicated training strategies to leverage unlabeled data but overlook the role of correlation maps in modeling the relationships between pairs of locations. We observe that the correlation maps not only enable clustering pixels of the same category easily but also contain good shape information which previous works have omitted. Motivated by these we aim to improve the use efficiency of unlabeled data by designing two novel label propagation strategies. First we propose to conduct pixel propagation by modeling the pairwise similarities of pixels to spread the high-confidence pixels and dig out more. Then we perform region propagation to enhance the pseudo labels with accurate class-agnostic masks extracted from the correlation maps. CorrMatch achieves great performance on popular segmentation benchmarks. Taking the DeepLabV3+ with ResNet-101 backbone as our segmentation model we receive a 76%+ mIoU score on the Pascal VOC 2012 dataset with only 92 annotated images. Code is available at https://github.com/BBBBchan/CorrMatch . + + + + Estimating Extreme 3D Image Rotations using Cascaded Attention + http://openaccess.thecvf.com//content/CVPR2024/papers/Dekel_Estimating_Extreme_3D_Image_Rotations_using_Cascaded_Attention_CVPR_2024_paper.pdf + Estimating large extreme inter-image rotations is critical for numerous computer vision domains involving images related by limited or non-overlapping fields of view. In this work we propose an attention-based approach with a pipeline of novel algorithmic components. First as rotation estimation pertains to image pairs we introduce an inter-image distillation scheme using Decoders to improve embeddings. Second whereas contemporary methods compute a 4D correlation volume (4DCV) encoding inter-image relationships we propose an Encoder-based cross-attention approach between activation maps to compute an enhanced equivalent of the 4DCV. Finally we present a cascaded Decoder-based technique for alternately refining the cross-attention and the rotation query. Our approach outperforms current state-of-the-art methods on extreme rotation estimation. We make our code publicly available. + + + + Adapt or Perish: Adaptive Sparse Transformer with Attentive Feature Refinement for Image Restoration + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_Adapt_or_Perish_Adaptive_Sparse_Transformer_with_Attentive_Feature_Refinement_CVPR_2024_paper.pdf + Transformer-based approaches have achieved promising performance in image restoration tasks given their ability to model long-range dependencies which is crucial for recovering clear images. Though diverse efficient attention mechanism designs have addressed the intensive computations associated with using transformers they often involve redundant information and noisy interactions from irrelevant regions by considering all available tokens. In this work we propose an Adaptive Sparse Transformer (AST) to mitigate the noisy interactions of irrelevant areas and remove feature redundancy in both spatial and channel domains. AST comprises two core designs i.e. an Adaptive Sparse Self-Attention (ASSA) block and a Feature Refinement Feed-forward Network (FRFN). Specifically ASSA is adaptively computed using a two-branch paradigm where the sparse branch is introduced to filter out the negative impacts of low query-key matching scores for aggregating features while the dense one ensures sufficient information flow through the network for learning discriminative representations. Meanwhile FRFN employs an enhance-and-ease scheme to eliminate feature redundancy in channels enhancing the restoration of clear latent images. Experimental results on commonly used benchmarks have demonstrated the versatility and competitive performance of our method in several tasks including rain streak removal real haze removal and raindrop removal. The code and pre-trained models are available at https://github.com/joshyZhou/AST. + + + + VINECS: Video-based Neural Character Skinning + http://openaccess.thecvf.com//content/CVPR2024/papers/Liao_VINECS_Video-based_Neural_Character_Skinning_CVPR_2024_paper.pdf + Rigging and skinning clothed human avatars is a challenging task and traditionally requires a lot of manual work and expertise. Recent methods addressing it either generalize across different characters or focus on capturing the dynamics of a single character observed under different pose configurations. However the former methods typically predict solely static skinning weights which perform poorly for highly articulated poses and the latter ones either require dense 3D character scans in different poses or cannot generate an explicit mesh with vertex correspondence over time. To address these challenges we propose a fully automated approach for creating a fully rigged character with pose-dependent skinning weights which can be solely learned from multi-view video. Therefore we first acquire a rigged template which is then statically skinned. Next a coordinate-based MLP learns a skinning weights field parameterized over the position in a canonical pose space and the respective pose. Moreover we introduce our pose- and view-dependent appearance field allowing us to differentiably render and supervise the posed mesh using multi-view imagery. We show that our approach outperforms state-of-the-art while not relying on dense 4D scans. More details can be found on our project page. + + + + Your Student is Better Than Expected: Adaptive Teacher-Student Collaboration for Text-Conditional Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Starodubcev_Your_Student_is_Better_Than_Expected_Adaptive_Teacher-Student_Collaboration_for_CVPR_2024_paper.pdf + Knowledge distillation methods have recently shown to be a promising direction to speedup the synthesis of large-scale diffusion models by requiring only a few inference steps. While several powerful distillation methods were recently proposed the overall quality of student samples is typically lower compared to the teacher ones which hinders their practical usage. In this work we investigate the relative quality of samples produced by the teacher text-to-image diffusion model and its distilled student version. As our main empirical finding we discover that a noticeable portion of student samples exhibit superior fidelity compared to the teacher ones despite the approximate nature of the student. Based on this finding we propose an adaptive collaboration between student and teacher diffusion models for effective text-to-image synthesis. Specifically the distilled model produces an initial image sample and then an oracle decides whether it needs further improvements with the teacher model. Extensive experiments demonstrate that the designed pipeline surpasses state-of-the-art text-to-image alternatives for various inference budgets in terms of human preference. Furthermore the proposed approach can be naturally used in popular applications such as text-guided image editing and controllable generation. + + + + SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design + http://openaccess.thecvf.com//content/CVPR2024/papers/Yun_SHViT_Single-Head_Vision_Transformer_with_Memory_Efficient_Macro_Design_CVPR_2024_paper.pdf + Recently efficient Vision Transformers have shown great performance with low latency on resource-constrained devices. Conventionally they use 4x4 patch embeddings and a 4-stage structure at the macro level while utilizing sophisticated attention with multi-head configuration at the micro level. This paper aims to address computational redundancy at all design levels in a memory-efficient manner. We discover that using larger-stride patchify stem not only reduces memory access costs but also achieves competitive performance by leveraging token representations with reduced spatial redundancy from the early stages. Furthermore our preliminary analyses suggest that attention layers in the early stages can be substituted with convolutions and several attention heads in the latter stages are computationally redundant. To handle this we introduce a single-head attention module that inherently prevents head redundancy and simultaneously boosts accuracy by parallelly combining global and local information. Building upon our solutions we introduce SHViT a Single-Head Vision Transformer that obtains the state-of-the-art speed-accuracy tradeoff. For example on ImageNet-1k our SHViT-S4 is 3.3x 8.1x and 2.4x faster than MobileViTv2x1.0 on GPU CPU and iPhone12 mobile device respectively while being 1.3% more accurate. For object detection and instance segmentation on MS COCO using Mask-RCNN head our model achieves performance comparable to FastViT-SA12 while exhibiting 3.8x and 2.0x lower backbone latency on GPU and mobile device respectively. + + + + CommonCanvas: Open Diffusion Models Trained on Creative-Commons Images + http://openaccess.thecvf.com//content/CVPR2024/papers/Gokaslan_CommonCanvas_Open_Diffusion_Models_Trained_on_Creative-Commons_Images_CVPR_2024_paper.pdf + We train a set of open text-to-image (T2I) diffusion models on a dataset of curated Creative-Commons-licensed (CC) images which yields models that are competitive with Stable Diffusion 2 (SD2). This task presents two challenges: (1) high-resolution CC images lack the captions necessary to train T2I models; (2) CC images are relatively scarce. To address these challenges we use an intuitive transfer learning technique to produce a set of high-quality synthetic captions paired with our assembled CC images. We then develop a data- and compute-efficient training recipe that requires as little as 3% of the LAION data (i.e. roughly 70 million examples) needed to train existing SD2 models but obtains the same quality. These results indicate that we have a sufficient number of CC images (also roughly 70 million) for training high-quality models. Our recipe also implements a variety of optimizations that achieve 2.71x training speed-ups enabling rapid model iteration. We leverage this recipe to train several high-quality T2I mod- els which we dub the CommonCanvas family. Our largest model achieves comparable performance to SD2 on human evaluation even though we use a synthetically captioned CC-image dataset that is only <3% the size of LAION for training. We release our models data and code on GitHub. + + + + Prompt-Driven Referring Image Segmentation with Instance Contrasting + http://openaccess.thecvf.com//content/CVPR2024/papers/Shang_Prompt-Driven_Referring_Image_Segmentation_with_Instance_Contrasting_CVPR_2024_paper.pdf + Referring image segmentation (RIS) aims to segment the target referent described by natural language. Recently large-scale pre-trained models e.g. CLIP and SAM have been successfully applied in many downstream tasks but they are not well adapted to RIS task due to inter-task differences. In this paper we propose a new prompt-driven framework named Prompt-RIS which bridges CLIP and SAM end-to-end and transfers their rich knowledge and powerful capabilities to RIS task through prompt learning. To adapt CLIP to pixel-level task we first propose a Cross-Modal Prompting method which acquires more comprehensive vision-language interaction and fine-grained text-to-pixel alignment by performing bidirectional prompting. Then the prompt-tuned CLIP generates masks points and text prompts for SAM to generate more accurate mask predictions. Moreover we further propose Instance Contrastive Learning to improve the model's discriminability to different instances and robustness to diverse languages describing the same instance. Extensive experiments demonstrate that the performance of our method outperforms the state-of-the-art methods consistently in both general and open-vocabulary settings. + + + + Image Sculpting: Precise Object Editing with 3D Geometry Control + http://openaccess.thecvf.com//content/CVPR2024/papers/Yenphraphai_Image_Sculpting_Precise_Object_Editing_with_3D_Geometry_Control_CVPR_2024_paper.pdf + We present Image Sculpting a new framework for editing 2D images by incorporating tools from 3D geometry and graphics. This approach differs markedly from existing methods which are confined to 2D spaces and typically rely on textual instructions leading to ambiguity and limited control. Image Sculpting converts 2D objects into 3D enabling direct interaction with their 3D geometry. Post-editing these objects are re-rendered into 2D merging into the original image to produce high-fidelity results through a coarse-to-fine enhancement process. The framework supports precise quantifiable and physically-plausible editing options such as pose editing rotation translation 3D composition carving and serial addition. It marks an initial step towards combining the creative freedom of generative models with the precision of graphics pipelines. + + + + PFStorer: Personalized Face Restoration and Super-Resolution + http://openaccess.thecvf.com//content/CVPR2024/papers/Varanka_PFStorer_Personalized_Face_Restoration_and_Super-Resolution_CVPR_2024_paper.pdf + Recent developments in face restoration have achieved remarkable results in producing high-quality and lifelike outputs. The stunning results however often fail to be faithful with respect to the identity of the person as the models lack necessary context. In this paper we explore the potential of personalized face restoration with diffusion models. In our approach a restoration model is personalized using a few images of the identity leading to tailored restoration with respect to the identity while retaining fine-grained details. By using independent trainable blocks for personalization the rich prior of a base restoration model can be exploited to its fullest. To avoid the model relying on parts of identity left in the conditioning low-quality images a generative regularizer is employed. With a learnable parameter the model learns to balance between the details generated based on the input image and the degree of personalization. Moreover we improve the training pipeline of face restoration models to enable an alignment-free approach. We showcase the robust capabilities of our approach in several real-world scenarios with multiple identities demonstrating our method's ability to generate fine-grained details with faithful restoration. In the user study we evaluate the perceptual quality and faithfulness of the generated details with our method being voted best 61% of the time compared to the second best with 25% of the votes. + + + + TextureDreamer: Image-Guided Texture Synthesis Through Geometry-Aware Diffusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Yeh_TextureDreamer_Image-Guided_Texture_Synthesis_Through_Geometry-Aware_Diffusion_CVPR_2024_paper.pdf + We present TextureDreamer a novel image-guided texture synthesis method to transfer relightable textures from a small number of input images (3 to 5) to target 3D shapes across arbitrary categories. Texture creation is a pivotal challenge in vision and graphics. Industrial companies hire experienced artists to manually craft textures for 3D assets. Classical methods require densely sampled views and accurately aligned geometry while learning-based methods are confined to category-specific shapes within the dataset. In contrast TextureDreamer can transfer highly detailed intricate textures from real-world environments to arbitrary objects with only a few casually captured images potentially significantly democratizing texture creation. Our core idea personalized geometry-aware score distillation (PGSD) draws inspiration from recent advancements in diffuse models including personalized modeling for texture information extraction score distillation for detailed appearance synthesis and explicit geometry guidance with ControlNet. Our integration and several essential modifications substantially improve the texture quality. Experiments on real images spanning different categories show that TextureDreamer can successfully transfer highly realistic semantic meaningful texture to arbitrary objects surpassing the visual quality of previous state-of-the-art. Project page: https://texturedreamer.github.io + + + + Boosting Image Quality Assessment through Efficient Transformer Adaptation with Local Feature Enhancement + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_Boosting_Image_Quality_Assessment_through_Efficient_Transformer_Adaptation_with_Local_CVPR_2024_paper.pdf + Image Quality Assessment (IQA) constitutes a fundamental task within the field of computer vision yet it remains an unresolved challenge owing to the intricate distortion conditions diverse image contents and limited availability of data. Recently the community has witnessed the emergence of numerous large-scale pretrained foundation models. However it remains an open problem whether the scaling law in high-level tasks is also applicable to IQA tasks which are closely related to low-level clues. In this paper we demonstrate that with a proper injection of local distortion features a larger pretrained vision transformer (ViT) foundation model performs better in IQA tasks. Specifically for the lack of local distortion structure and inductive bias of the large-scale pretrained ViT we use another pretrained convolution neural networks (CNNs) which is well known for capturing the local structure to extract multi-scale image features. Further we propose a local distortion extractor to obtain local distortion features from the pretrained CNNs and a local distortion injector to inject the local distortion features into ViT. By only training the extractor and injector our method can benefit from the rich knowledge in the powerful foundation models and achieve state-of-the-art performance on popular IQA datasets indicating that IQA is not only a low-level problem but also benefits from stronger high-level features drawn from large-scale pretrained models. Codes are publicly available at: https://github.com/NeosXu/LoDa. + + + + Attention Calibration for Disentangled Text-to-Image Personalization + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Attention_Calibration_for_Disentangled_Text-to-Image_Personalization_CVPR_2024_paper.pdf + Recent thrilling progress in large-scale text-to-image (T2I) models has unlocked unprecedented synthesis quality of AI-generated content (AIGC) including image generation 3D and video composition. Further personalized techniques enable appealing customized production of a novel concept given only several images as reference. However an intriguing problem persists: Is it possible to capture multiple novel concepts from one single reference image? In this paper we identify that existing approaches fail to preserve visual consistency with the reference image and eliminate cross-influence from concepts. To alleviate this we propose an attention calibration mechanism to improve the concept-level understanding of the T2I model. Specifically we first introduce new learnable modifiers bound with classes to capture attributes of multiple concepts. Then the classes are separated and strengthened following the activation of the cross-attention operation ensuring comprehensive and self-contained concepts. Additionally we suppress the attention activation of different classes to mitigate mutual influence among concepts. Together our proposed method dubbed DisenDiff can learn disentangled multiple concepts from one single image and produce novel customized images with learned concepts. We demonstrate that our method outperforms the current state of the art in both qualitative and quantitative evaluations. More importantly our proposed techniques are compatible with LoRA and inpainting pipelines enabling more interactive experiences. + + + + One-Shot Structure-Aware Stylized Image Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Cho_One-Shot_Structure-Aware_Stylized_Image_Synthesis_CVPR_2024_paper.pdf + While GAN-based models have been successful in image stylization tasks they often struggle with structure preservation while stylizing a wide range of input images. Recently diffusion models have been adopted for image stylization but still lack the capability to maintain the original quality of input images. Building on this we propose OSASIS: a novel one-shot stylization method that is robust in structure preservation. We show that OSASIS is able to effectively disentangle the semantics from the structure of an image allowing it to control the level of content and style implemented to a given input. We apply OSASIS to various experimental settings including stylization with out-of-domain reference images and stylization with text-driven manipulation. Results show that OSASIS outperforms other stylization methods especially for input images that were rarely encountered during training providing a promising solution to stylization via diffusion models. + + + + MR-VNet: Media Restoration using Volterra Networks + http://openaccess.thecvf.com//content/CVPR2024/papers/Roheda_MR-VNet_Media_Restoration_using_Volterra_Networks_CVPR_2024_paper.pdf + This research paper presents a novel class of restoration network architecture based on the Volterra series formulation. By incorporating non-linearity into the system response function through higher order convolutions instead of traditional activation functions we introduce a general framework for image/video restoration. Through extensive experimentation we demonstrate that our proposed architecture achieves state-of-the-art (SOTA) performance in the field of Image/Video Restoration. Moreover we establish that the recently introduced Non-Linear Activation Free Network (NAF-NET) can be considered a special case within the broader class of Volterra Neural Networks. These findings highlight the potential of Volterra Neural Networks as a versatile and powerful tool for addressing complex restoration tasks in computer vision. + + + + Single Mesh Diffusion Models with Field Latents for Texture Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Mitchel_Single_Mesh_Diffusion_Models_with_Field_Latents_for_Texture_Generation_CVPR_2024_paper.pdf + We introduce a framework for intrinsic latent diffusion models operating directly on the surfaces of 3D shapes with the goal of synthesizing high-quality textures. Our approach is underpinned by two contributions: Field Latents a latent representation encoding textures as discrete vector fields on the mesh vertices and Field Latent Diffusion Models which learn to denoise a diffusion process in the learned latent space on the surface. We consider a single-textured-mesh paradigm where our models are trained to generate variations of a given texture on a mesh. We show the synthesized textures are of superior fidelity compared those from existing single-textured-mesh generative models. Our models can also be adapted for user-controlled editing tasks such as inpainting and label-guided generation. The efficacy of our approach is due in part to the equivariance of our proposed framework under isometries allowing our models to seamlessly reproduce details across locally similar regions and opening the door to a notion of generative texture transfer. Code and visualizations are available at https://single-mesh-diffusion.github.io/. + + + + SAI3D: Segment Any Instance in 3D Scenes + http://openaccess.thecvf.com//content/CVPR2024/papers/Yin_SAI3D_Segment_Any_Instance_in_3D_Scenes_CVPR_2024_paper.pdf + Advancements in 3D instance segmentation have traditionally been tethered to the availability of annotated datasets limiting their application to a narrow spectrum of object categories. Recent efforts have sought to harness vision-language models like CLIP for open-set semantic reasoning yet these methods struggle to distinguish between objects of the same categories and rely on specific prompts that are not universally applicable. In this paper we introduce SAI3D a novel zero-shot 3D instance segmentation approach that synergistically leverages geometric priors and semantic cues derived from Segment Anything Model (SAM). Our method partitions a 3D scene into geometric primitives which are then progressively merged into 3D instance segmentations that are consistent with the multi-view SAM masks. Moreover we design a hierarchical region-growing algorithm with a dynamic thresholding mechanism which largely improves the robustness of fine-grained 3D scene parsing. Empirical evaluations on ScanNet Matterport3D and the more challenging ScanNet++ datasets demonstrate the superiority of our approach. Notably SAI3D outperforms existing open-vocabulary baselines and even surpasses fully-supervised methods in class-agnostic segmentation on ScanNet++. Our project page is at https://yd-yin.github.io/SAI3D/. + + + + TexOct: Generating Textures of 3D Models with Octree-based Diffusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_TexOct_Generating_Textures_of_3D_Models_with_Octree-based_Diffusion_CVPR_2024_paper.pdf + This paper focuses on synthesizing high-quality and complete textures directly on the surface of 3D models within 3D space. 2D diffusion-based methods face challenges in generating 2D texture maps due to the infinite possibilities of UV mapping for a given 3D mesh. Utilizing point clouds helps circumvent variations arising from diverse mesh topologies and UV mappings. Nevertheless achieving dense point clouds to accurately represent texture details poses a challenge due to limited computational resources. To address these challenges we propose an efficient octree-based diffusion pipeline called TexOct. Our method starts by sampling a point cloud from the surface of a given 3D model with each point containing texture noise values. We utilize an octree structure to efficiently represent this point cloud. Additionally we introduce an innovative octree-based diffusion model that leverages the denoising capabilities of the Denoising Diffusion Probabilistic Model (DDPM). This model gradually reduces the texture noise on the octree nodes resulting in the restoration of fine texture. Experimental results on ShapeNet demonstrate that TexOct effectively generates high-quality 3D textures in both unconditional and text / image-conditional scenarios. + + + + Anatomically Constrained Implicit Face Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Chandran_Anatomically_Constrained_Implicit_Face_Models_CVPR_2024_paper.pdf + Coordinate based implicit neural representations have gained rapid popularity in recent years as they have been successfully used in image geometry and scene modeling tasks. In this work we present a novel use case for such implicit representations in the context of learning anatomically constrained face models. Actor specific anatomically constrained face models are the state of the art in both facial performance capture and performance retargeting. Despite their practical success these anatomical models are slow to evaluate and often require extensive data capture to be built. We propose the anatomical implicit face model; an ensemble of implicit neural networks that jointly learn to model the facial anatomy and the skin surface with high-fidelity and can readily be used as a drop in replacement to conventional blendshape models. Given an arbitrary set of skin surface meshes of an actor and only a neutral shape with estimated skull and jaw bones our method can recover a dense anatomical substructure which constrains every point on the facial surface. We demonstrate the usefulness of our approach in several tasks ranging from shape fitting shape editing and performance retargeting. + + + + Capturing Closely Interacted Two-Person Motions with Reaction Priors + http://openaccess.thecvf.com//content/CVPR2024/papers/Fang_Capturing_Closely_Interacted_Two-Person_Motions_with_Reaction_Priors_CVPR_2024_paper.pdf + In this paper we focus on capturing closely interacted two-person motions from monocular videos an important yet understudied topic. Unlike less-interacted motions closely interacted motions contain frequently occurring inter-human occlusions which pose significant challenges to existing capturing algorithms. To address this problem our key observation is that close physical interactions between two subjects typically happen under very specific situations (e.g. handshake hug etc.) and such situational contexts contain strong prior semantics to help infer the poses of occluded joints. In this spirit we introduce reaction priors which are invertible neural networks that bi-directionally model the pose probability distributions of one person given the pose of the other. The learned reaction priors are then incorporated into a query-based pose estimator which is a decoder-only Transformer with self-attentions on both intra-joint and inter-joint relationships. We demonstrate that our design achieves considerably higher performance than previous methods on multiple benchmarks. What's more as existing datasets lack sufficient cases of close human-human interactions we also build a new dataset called Dual-Human to better evaluate different methods. Dual-Human contains around 2k sequences of closely interacted two-person motions each with synthetic multi-view renderings contact annotations and text descriptions. We believe that this new public dataset can significantly promote further research in this area. + + + + RobustSAM: Segment Anything Robustly on Degraded Images + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_RobustSAM_Segment_Anything_Robustly_on_Degraded_Images_CVPR_2024_paper.pdf + Segment Anything Model (SAM) has emerged as a transformative approach in image segmentation acclaimed for its robust zero-shot segmentation capabilities and flexible prompting system. Nonetheless its performance is challenged by images with degraded quality. Addressing this limitation we propose the Robust Segment Anything Model (RobustSAM) which enhances SAM's performance on low-quality images while preserving its promptability and zero-shot generalization. Our method leverages the pre-trained SAM model with only marginal parameter increments and computational requirements. The additional parameters of RobustSAM can be optimized within 30 hours on eight GPUs demonstrating its feasibility and practicality for typical research laboratories. We also introduce the Robust-Seg dataset a collection of 688K image-mask pairs with different degradations designed to train and evaluate our model optimally. Extensive experiments across various segmentation tasks and datasets confirm RobustSAM's superior performance especially under zero-shot conditions underscoring its potential for extensive real-world application. Additionally our method has been shown to effectively improve the performance of SAM-based downstream tasks such as single image dehazing and deblurring. + + + + In-N-Out: Faithful 3D GAN Inversion with Volumetric Decomposition for Face Editing + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_In-N-Out_Faithful_3D_GAN_Inversion_with_Volumetric_Decomposition_for_Face_CVPR_2024_paper.pdf + 3D-aware GANs offer new capabilities for view synthesis while preserving the editing functionalities of their 2D counterparts. GAN inversion is a crucial step that seeks the latent code to reconstruct input images or videos subsequently enabling diverse editing tasks through manipulation of this latent code. However a model pre-trained on a particular dataset (e.g. FFHQ) often has difficulty reconstructing images with out-of-distribution (OOD) objects such as faces with heavy make-up or occluding objects. We address this issue by explicitly modeling OOD objects from the input in 3D-aware GANs. Our core idea is to represent the image using two individual neural radiance fields: one for the in-distribution content and the other for the out-of-distribution object. The final reconstruction is achieved by optimizing the composition of these two radiance fields with carefully designed regularization. We demonstrate that our explicit decomposition alleviates the inherent trade-off between reconstruction fidelity and editability. We evaluate reconstruction accuracy and editability of our method on challenging real face images and videos and showcase favorable results against other baselines. + + + + Combining Frame and GOP Embeddings for Neural Video Representation + http://openaccess.thecvf.com//content/CVPR2024/papers/Saethre_Combining_Frame_and_GOP_Embeddings_for_Neural_Video_Representation_CVPR_2024_paper.pdf + Implicit neural representations (INRs) were recently proposed as a new video compression paradigm with existing approaches performing on par with HEVC. However such methods only perform well in limited settings e.g. specific model sizes fixed aspect ratios and low-motion videos. We address this issue by proposing T-NeRV a hybrid video INR that combines frame-specific embeddings with GOP-specific features providing a lever for content-specific fine-tuning. We employ entropy-constrained training to jointly optimize our model for rate and distortion and demonstrate that T-NeRV can thereby automatically adjust this lever during training effectively fine-tuning itself to the target content. We evaluate T-NeRV on the UVG dataset where it achieves state-of-the-art results on the video representation task outperforming previous works by up to 3dB PSNR on challenging high-motion sequences. Further our method improves on the compression performance of previous methods and is the first video INR to outperform HEVC on all UVG sequences. + + + + Fantastic Animals and Where to Find Them: Segment Any Marine Animal with Dual SAM + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Fantastic_Animals_and_Where_to_Find_Them_Segment_Any_Marine_CVPR_2024_paper.pdf + As an important pillar of underwater intelligence Marine Animal Segmentation (MAS) involves segmenting animals within marine environments. Previous methods don't excel in extracting long-range contextual features and overlook the connectivity between discrete pixels. Recently Segment Anything Model (SAM) offers a universal framework for general segmentation tasks. Unfortunately trained with natural images SAM does not obtain the prior knowledge from marine images. In addition the single-position prompt of SAM is very insufficient for prior guidance. To address these issues we propose a novel feature learning framework named Dual-SAM for high-performance MAS. To this end we first introduce a dual structure with SAM's paradigm to enhance feature learning of marine images. Then we propose a Multi-level Coupled Prompt (MCP) strategy to instruct comprehensive underwater prior information and enhance the multi-level features of SAM's encoder with adapters. Subsequently we design a Dilated Fusion Attention Module (DFAM) to progressively integrate multi-level features from SAM's encoder. Finally instead of directly predicting the masks of marine animals we propose a Criss-Cross Connectivity Prediction (C3P) paradigm to capture the inter-connectivity between discrete pixels. With dual decoders it generates pseudo-labels and achieves mutual supervision for complementary feature representations resulting in considerable improvements over previous techniques. Extensive experiments verify that our proposed method achieves state-of-the-art performances on five widely-used MAS datasets. The code is available at https://github.com/Drchip61/Dual SAM. + + + + Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners + http://openaccess.thecvf.com//content/CVPR2024/papers/Xing_Seeing_and_Hearing_Open-domain_Visual-Audio_Generation_with_Diffusion_Latent_Aligners_CVPR_2024_paper.pdf + Video and audio content creation serves as the core technique for the movie industry and professional users. Recently existing diffusion-based methods tackle video and audio generation separately which hinders the technique transfer from academia to industry. In this work we aim at filling the gap with a carefully designed optimization-based framework for cross-visual-audio and joint-visual-audio generation. We observe the powerful generation ability of off-the-shelf video or audio generation models. Thus instead of training the giant models from scratch we propose to bridge the existing strong models with a shared latent representation space. Specifically we propose a multimodality latent aligner with the pre-trained ImageBind model. Our latent aligner shares a similar core as the classifier guidance that guides the diffusion denoising process during inference time. Through carefully designed optimization strategy and loss functions we show the superior performance of our method on joint video-audio generation visual-steered audio generation and audio-steered visual generation tasks. The project website can be found at \href https://yzxing87.github.io/Seeing-and-Hearing/ https://yzxing87.github.io/Seeing-and-Hearing/ . + + + + Objects as Volumes: A Stochastic Geometry View of Opaque Solids + http://openaccess.thecvf.com//content/CVPR2024/papers/Miller_Objects_as_Volumes_A_Stochastic_Geometry_View_of_Opaque_Solids_CVPR_2024_paper.pdf + We develop a theory for the representation of opaque solids as volumes. Starting from a stochastic representation of opaque solids as random indicator functions we prove the conditions under which such solids can be modeled using exponential volumetric transport. We also derive expressions for the volumetric attenuation coefficient as a functional of the probability distributions of the underlying indicator functions. We generalize our theory to account for isotropic and anisotropic scattering at different parts of the solid and for representations of opaque solids as stochastic implicit surfaces. We derive our volumetric representation from first principles which ensures that it satisfies physical constraints such as reciprocity and reversibility. We use our theory to explain compare and correct previous volumetric representations as well as propose meaningful extensions that lead to improved performance in 3D reconstruction tasks. + + + + Improving Subject-Driven Image Synthesis with Subject-Agnostic Guidance + http://openaccess.thecvf.com//content/CVPR2024/papers/Chan_Improving_Subject-Driven_Image_Synthesis_with_Subject-Agnostic_Guidance_CVPR_2024_paper.pdf + In subject-driven text-to-image synthesis the synthesis process tends to be heavily influenced by the reference images provided by users often overlooking crucial attributes detailed in the text prompt. In this work we propose Subject-Agnostic Guidance (SAG) a simple yet effective solution to remedy the problem. We show that through constructing a subject-agnostic condition and applying our proposed dual classifier-free guidance one could obtain outputs consistent with both the given subject and input text prompts. We validate the efficacy of our approach through both optimization-based and encoder-based methods. Additionally we demonstrate its applicability in second-order customization methods where an encoder-based model is fine-tuned with DreamBooth. Our approach is conceptually simple and requires only minimal code modifications but leads to substantial quality improvements as evidenced by our evaluations and user studies. + + + + Diffusion Model Alignment Using Direct Preference Optimization + http://openaccess.thecvf.com//content/CVPR2024/papers/Wallace_Diffusion_Model_Alignment_Using_Direct_Preference_Optimization_CVPR_2024_paper.pdf + Large language models (LLMs) are fine-tuned using human comparison data with Reinforcement Learning from Human Feedback (RLHF) methods to make them better aligned with users' preferences. In contrast to LLMs human preference learning has not been widely explored in text-to-image diffusion models; the best existing approach is to fine-tune a pretrained model using carefully curated high quality images and captions to improve visual appeal and text alignment. We propose Diffusion-DPO a method to align diffusion models to human preferences by directly optimizing on human comparison data. Diffusion-DPO is adapted from the recently developed Direct Preference Optimization (DPO) a simpler alternative to RLHF which directly optimizes a policy that best satisfies human preferences under a classification objective. We re-formulate DPO to account for a diffusion model notion of likelihood utilizing the evidence lower bound to derive a differentiable objective. Using the Pick-a-pic dataset of 851K crowdsourced pairwise preferences we fine-tune the base model of the state-of-the-art Stable Diffusion XL (SDXL)-1.0 model with Diffusion-DPO. Our fine-tuned base model significantly outperforms both base SDXL-1.0 and the larger SDXL-1.0 model consisting of an additional refinement model in human evaluation improving visual appeal and prompt alignment. We also develop a variant that uses AI feedback and has comparable performance to training on human preferences opening the door for scaling of diffusion model alignment methods. + + + + ZeroNVS: Zero-Shot 360-Degree View Synthesis from a Single Image + http://openaccess.thecvf.com//content/CVPR2024/papers/Sargent_ZeroNVS_Zero-Shot_360-Degree_View_Synthesis_from_a_Single_Image_CVPR_2024_paper.pdf + We introduce a 3D-aware diffusion model ZeroNVS for single-image novel view synthesis for in-the-wild scenes. While existing methods are designed for single objects with masked backgrounds we propose new techniques to address challenges introduced by in-the-wild multi-object scenes with complex backgrounds. Specifically we train a generative prior on a mixture of data sources that capture object-centric indoor and outdoor scenes. To address issues from data mixture such as depth-scale ambiguity we propose a novel camera conditioning parameterization and normalization scheme. Further we observe that Score Distillation Sampling (SDS) tends to truncate the distribution of complex backgrounds during distillation of 360-degree scenes and propose "SDS anchoring" to improve the diversity of synthesized novel views. Our model sets a new state-of-the-art result in LPIPS on the DTU dataset in the zero-shot setting even outperforming methods specifically trained on DTU. We further adapt the challenging Mip-NeRF 360 dataset as a new benchmark for single-image novel view synthesis and demonstrate strong performance in this setting. Code and models will be publicly available. + + + + Restoration by Generation with Constrained Priors + http://openaccess.thecvf.com//content/CVPR2024/papers/Ding_Restoration_by_Generation_with_Constrained_Priors_CVPR_2024_paper.pdf + The inherent generative power of denoising diffusion models makes them well-suited for image restoration tasks where the objective is to find the optimal high-quality image within the generative space that closely resembles the input image. We propose a method to adapt a pretrained diffusion model for image restoration by simply adding noise to the input image to be restored and then denoise. Our method is based on the observation that the space of a generative model needs to be constrained. We impose this constraint by finetuning the generative model with a set of anchor images that capture the characteristics of the input image. With the constrained space we can then leverage the sampling strategy used for generation to do image restoration. We evaluate against previous methods and show superior performances on multiple real-world restoration datasets in preserving identity and image quality. We also demonstrate an important and practical application on personalized restoration where we use a personal album as the anchor images to constrain the generative space. This approach allows us to produce results that accurately preserve high-frequency details which previous works are unable to do. Project webpage: https://gen2res.github.io. + + + + Blur-aware Spatio-temporal Sparse Transformer for Video Deblurring + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Blur-aware_Spatio-temporal_Sparse_Transformer_for_Video_Deblurring_CVPR_2024_paper.pdf + Video deblurring relies on leveraging information from other frames in the video sequence to restore the blurred regions in the current frame. Mainstream approaches employ bidirectional feature propagation spatio-temporal transformers or a combination of both to extract information from the video sequence. However limitations in memory and computational resources constraints the temporal window length of the spatio-temporal transformer preventing the extraction of longer temporal contextual information from the video sequence. Additionally bidirectional feature propagation is highly sensitive to inaccurate optical flow in blurry frames leading to error accumulation during the propagation process. To address these issues we propose BSSTNet Blur-aware Spatio-temporal Sparse Transformer Network. It introduces the blur map which converts the originally dense attention into a sparse form enabling a more extensive utilization of information throughout the entire video sequence. Specifically BSSTNet (1) uses a longer temporal window in the transformer leveraging information from more distant frames to restore the blurry pixels in the current frame. (2) introduces bidirectional feature propagation guided by blur maps which reduces error accumulation caused by the blur frame. The experimental results demonstrate the proposed BSSTNet outperforms the state-of-the-art methods on the GoPro and DVD datasets. + + + + DiffusionPoser: Real-time Human Motion Reconstruction From Arbitrary Sparse Sensors Using Autoregressive Diffusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Van_Wouwe_DiffusionPoser_Real-time_Human_Motion_Reconstruction_From_Arbitrary_Sparse_Sensors_Using_CVPR_2024_paper.pdf + Motion capture from a limited number of body-worn sensors such as inertial measurement units (IMUs) and pressure insoles has important applications in health human performance and entertainment. Recent work has focused on accurately reconstructing whole-body motion from a specific sensor configuration using six IMUs. While a common goal across applications is to use the minimal number of sensors to achieve required accuracy the optimal arrangement of the sensors might differ from application to application. We propose a single diffusion model DiffusionPoser which reconstructs human motion in real-time from an arbitrary combination of sensors including IMUs placed at specified locations and pressure insoles. Unlike existing methods our model grants users the flexibility to determine the number and arrangement of sensors tailored to the specific activity of interest without the need for retraining. A novel autoregressive inferencing scheme ensures real-time motion reconstruction that closely aligns with measured sensor signals. The generative nature of DiffusionPoser ensures realistic behavior even for degrees-of-freedom not directly measured. Qualitative results can be found on our project website. + + + + MANUS: Markerless Grasp Capture using Articulated 3D Gaussians + http://openaccess.thecvf.com//content/CVPR2024/papers/Pokhariya_MANUS_Markerless_Grasp_Capture_using_Articulated_3D_Gaussians_CVPR_2024_paper.pdf + Understanding how we grasp objects with our hands has important applications in areas like robotics and mixed reality. However this challenging problem requires accurate modeling of the contact between hands and objects.To capture grasps existing methods use skeletons meshes or parametric models that does not represent hand shape accurately resulting in inaccurate contacts. We present MANUS a method for Markerless Hand-Object Grasp Capture using Articulated 3D Gaussians. We build a novel articulated 3D Gaussians representation that extends 3D Gaussian splatting for high-fidelity representation of articulating hands. Since our representation uses Gaussian primitives optimized from the multi-view pixel-aligned losses it enables us to efficiently and accurately estimate contacts between the hand and the object. For the most accurate results our method requires tens of camera views that current datasets do not provide. We therefore build MANUS-Grasps a new dataset that contains hand-object grasps viewed from 50+ cameras across 30+ scenes 3 subjects and comprising over 7M frames. In addition to extensive qualitative results we also show that our method outperforms others on a quantitative contact evaluation method that uses paint transfer from the object to the hand. + + + + BerfScene: Bev-conditioned Equivariant Radiance Fields for Infinite 3D Scene Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_BerfScene_Bev-conditioned_Equivariant_Radiance_Fields_for_Infinite_3D_Scene_Generation_CVPR_2024_paper.pdf + Generating large-scale 3D scenes cannot simply apply existing 3D object synthesis technique since 3D scenes usually hold complex spatial configurations and consist of a number of objects at varying scales. We thus propose a practical and efficient 3D representation that incorporates an equivariant radiance field with the guidance of a bird's-eye view (BEV) map. Concretely objects of synthesized 3D scenes could be easily manipulated through steering the corresponding BEV maps. Moreover by adequately incorporating positional encoding and low-pass filters into the generator the representation becomes equivariant to the given BEV map. Such equivariance allows us to produce large-scale even infinite-scale 3D scenes via synthesizing local scenes and then stitching them with smooth consistency. Extensive experiments on 3D scene datasets demonstrate the effectiveness of our approach. Our project website is at: https://https://zqh0253.github.io/BerfScene. + + + + 3D Facial Expressions through Analysis-by-Neural-Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Retsinas_3D_Facial_Expressions_through_Analysis-by-Neural-Synthesis_CVPR_2024_paper.pdf + While existing methods for 3D face reconstruction from in-the-wild images excel at recovering the overall face shape they commonly miss subtle extreme asymmetric or rarely observed expressions. We improve upon these methods with SMIRK (Spatial Modeling for Image-based Reconstruction of Kinesics) which faithfully reconstructs expressive 3D faces from images. We identify two key limitations in existing methods: shortcomings in their self-supervised training formulation and a lack of expression diversity in the training images. For training most methods employ differentiable rendering to compare a predicted face mesh with the input image along with a plethora of additional loss functions. This differentiable rendering loss not only has to provide supervision to optimize for 3D face geometry camera albedo and lighting which is an ill-posed optimization problem but the domain gap between rendering and input image further hinders the learning process. Instead SMIRK replaces the differentiable rendering with a neural rendering module that given the rendered predicted mesh geometry and sparsely sampled pixels of the input image generates a face image. As the neural rendering gets color information from sampled image pixels supervising with neural rendering-based reconstruction loss can focus solely on the geometry. Further it enables us to generate images of the input identity with varying expressions while training. These are then utilized as input to the reconstruction model and used as supervision with ground truth geometry. This effectively augments the training data and enhances the generalization for diverse expressions. Our qualitative quantitative and particularly our perceptual evaluations demonstrate that SMIRK achieves the new state-of-the art performance on accurate expression reconstruction. For our method's source code demo video and more please visit our project webpage: https://georgeretsi.github.io/smirk/. + + + + Unleashing the Potential of SAM for Medical Adaptation via Hierarchical Decoding + http://openaccess.thecvf.com//content/CVPR2024/papers/Cheng_Unleashing_the_Potential_of_SAM_for_Medical_Adaptation_via_Hierarchical_CVPR_2024_paper.pdf + The Segment Anything Model (SAM) has garnered significant attention for its versatile segmentation abilities and intuitive prompt-based interface. However its application in medical imaging presents challenges requiring either substantial training costs and extensive medical datasets for full model fine-tuning or high-quality prompts for optimal performance. This paper introduces H-SAM: a prompt-free adaptation of SAM tailored for efficient fine-tuning of medical images via a two-stage hierarchical decoding procedure. In the initial stage H-SAM employs SAM's original decoder to generate a prior probabilistic mask guiding a more intricate decoding process in the second stage. Specifically we propose two key designs: 1) A class-balanced mask-guided self-attention mechanism addressing the unbalanced label distribution enhancing image embedding; 2) A learnable mask cross-attention mechanism spatially modulating the interplay among different image regions based on the prior mask. Moreover the inclusion of a hierarchical pixel decoder in H-SAM enhances its proficiency in capturing fine-grained and localized details. This approach enables SAM to effectively integrate learned medical priors facilitating enhanced adaptation for medical image segmentation with limited samples. Our H-SAM demonstrates a 4.78% improvement in average Dice compared to existing prompt-free SAM variants for multi-organ segmentation using only 10% of 2D slices. Notably without using any unlabeled data H-SAM even outperforms state-of-the-art semi-supervised models relying on extensive unlabeled training data across various medical datasets. Our code is available at https://github.com/Cccccczh404/H-SAM. + + + + Puff-Net: Efficient Style Transfer with Pure Content and Style Feature Fusion Network + http://openaccess.thecvf.com//content/CVPR2024/papers/Zheng_Puff-Net_Efficient_Style_Transfer_with_Pure_Content_and_Style_Feature_CVPR_2024_paper.pdf + Style transfer aims to render an image with the artistic features of a style image while maintaining the original structure. Various methods have been put forward for this task but some challenges still exist. For instance it is difficult for CNN-based methods to handle global information and long-range dependencies between input images for which transformer-based methods have been proposed. Although transformer can better model the relationship between content and style images they require high-cost hardware and time-consuming inference. To address these issues we design a novel transformer model that includes only encoders thus significantly reducing the computational cost. In addition we also find that existing style transfer methods may lead to images under-stylied or missing content. In order to achieve better stylization we design a content feature extractor and a style feature extractor. Then we can feed pure content and style images into the transformer. Finally we propose a network model termed Puff-Net i.e. efficient style transfer with pure content and style feature fusion network. Through qualitative and quantitative experiments we demonstrate the advantages of our model compared to state-of-the-art ones in the literature. The code is availabel at https://github.com/ZszYmy9/Puff-Net. + + + + Towards Progressive Multi-Frequency Representation for Image Warping + http://openaccess.thecvf.com//content/CVPR2024/papers/Xiao_Towards_Progressive_Multi-Frequency_Representation_for_Image_Warping_CVPR_2024_paper.pdf + Image warping a classic task in computer vision aims to use geometric transformations to change the appearance of images. Recent methods learn the resampling kernels for warping through neural networks to estimate missing values in irregular grids which however fail to capture local variations in deformed content and produce images with distortion and less high-frequency details. To address this issue this paper proposes an effective method namely MFR to learn Multi-Frequency Representations from input images for image warping. Specifically we propose a progressive filtering network to learn image representations from different frequency subbands and generate deformable images in a coarse-to-fine manner. Furthermore we employ learnable Gabor wavelet filters to improve the model's capability to learn local spatial-frequency representations. Comprehensive experiments including homography transformation equirectangular to perspective projection and asymmetric image super-resolution demonstrate that the proposed MFR significantly outperforms state-of-the-art image warping methods. Our method also showcases superior generalization to out-of-distribution domains where the generated images are equipped with rich details and less distortion thereby high visual quality. The source code is available at https://github.com/junxiao01/MFR. + + + + Learning to Control Camera Exposure via Reinforcement Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Lee_Learning_to_Control_Camera_Exposure_via_Reinforcement_Learning_CVPR_2024_paper.pdf + Adjusting camera exposure in arbitrary lighting conditions is the first step to ensure the functionality of computer vision applications. Poorly adjusted camera exposure often leads to critical failure and performance degradation. Traditional camera exposure control methods require multiple convergence steps and time-consuming processes making them unsuitable for dynamic lighting conditions. In this paper we propose a new camera exposure control framework that rapidly controls camera exposure while performing real-time processing by exploiting deep reinforcement learning. The proposed framework consists of four contributions: 1) a simplified training ground to simulate real-world's diverse and dynamic lighting changes 2) flickering and image attribute-aware reward design along with lightweight state design for real-time processing 3) a static-to-dynamic lighting curriculum to gradually improve the agent's exposure-adjusting capability and 4) domain randomization techniques to alleviate the limitation of the training ground and achieve seamless generalization in the wild. As a result our proposed method rapidly reaches a desired exposure level within five steps with real-time processing (1 ms). Also the acquired images are well-exposed and show superiority in various computer vision tasks such as feature extraction and object detection. + + + + RNb-NeuS: Reflectance and Normal-based Multi-View 3D Reconstruction + http://openaccess.thecvf.com//content/CVPR2024/papers/Brument_RNb-NeuS_Reflectance_and_Normal-based_Multi-View_3D_Reconstruction_CVPR_2024_paper.pdf + This paper introduces a versatile paradigm for integrating multi-view reflectance (optional) and normal maps acquired through photometric stereo. Our approach employs a pixel-wise joint re-parameterization of reflectance and normal considering them as a vector of radiances rendered under simulated varying illumination. This re-parameterization enables the seamless integration of reflectance and normal maps as input data in neural volume rendering-based 3D reconstruction while preserving a single optimization objective. In contrast recent multi-view photometric stereo (MVPS) methods depend on multiple potentially conflicting objectives. Despite its apparent simplicity our proposed approach outperforms state-of-the-art approaches in MVPS benchmarks across F-score Chamfer distance and mean angular error metrics. Notably it significantly improves the detailed 3D reconstruction of areas with high curvature or low visibility. + + + + Scaling Up Dynamic Human-Scene Interaction Modeling + http://openaccess.thecvf.com//content/CVPR2024/papers/Jiang_Scaling_Up_Dynamic_Human-Scene_Interaction_Modeling_CVPR_2024_paper.pdf + Confronting the challenges of data scarcity and advanced motion synthesis in human-scene interaction modeling we introduce the TRUMANS dataset alongside a novel HSI motion synthesis method. TRUMANS stands as the most comprehensive motion-captured HSI dataset currently available encompassing over 15 hours of human interactions across 100 indoor scenes. It intricately captures whole-body human motions and part-level object dynamics focusing on the realism of contact. This dataset is further scaled up by transforming physical environments into exact virtual models and applying extensive augmentations to appearance and motion for both humans and objects while maintaining interaction fidelity. Utilizing TRUMANS we devise a diffusion-based autoregressive model that efficiently generates HSI sequences of any length taking into account both scene context and intended actions. In experiments our approach shows remarkable zero-shot generalizability on a range of 3D scene datasets (e.g. PROX Replica ScanNet ScanNet++) producing motions that closely mimic original motion-captured sequences as confirmed by quantitative experiments and human studies. + + + + Semantic-aware SAM for Point-Prompted Instance Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wei_Semantic-aware_SAM_for_Point-Prompted_Instance_Segmentation_CVPR_2024_paper.pdf + Single-point annotation in visual tasks with the goal of minimizing labeling costs is becoming increasingly prominent in research. Recently visual foundation models such as Segment Anything (SAM) have gained widespread usage due to their robust zero-shot capabilities and exceptional annotation performance. However SAM's class-agnostic output and high confidence in local segmentation introduce semantic ambiguity posing a challenge for precise category-specific segmentation. In this paper we introduce a cost-effective category-specific segmenter using SAM. To tackle this challenge we have devised a Semantic-Aware Instance Segmentation Network (SAPNet) that integrates Multiple Instance Learning (MIL) with matching capability and SAM with point prompts. SAPNet strategically selects the most representative mask proposals generated by SAM to supervise segmentation with a specific focus on object category information. Moreover we introduce the Point Distance Guidance and Box Mining Strategy to mitigate inherent challenges: group and local issues in weakly supervised segmentation. These strategies serve to further enhance the overall segmentation performance. The experimental results on Pascal VOC and COCO demonstrate the promising performance of our proposed SAPNet emphasizing its semantic matching capabilities and its potential to advance point-prompted instance segmentation. The code is available at https://github.com/zhaoyangwei123/SAPNet. + + + + Make Pixels Dance: High-Dynamic Video Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zeng_Make_Pixels_Dance_High-Dynamic_Video_Generation_CVPR_2024_paper.pdf + Creating high-dynamic videos such as motion-rich actions and sophisticated visual effects poses a significant challenge in the field of artificial intelligence. Unfortunately current state-of-the-art video generation methods primarily focusing on text-to-video generation tend to produce video clips with minimal motions despite maintaining high fidelity. We argue that relying solely on text instructions is insufficient and suboptimal for video generation. In this paper we introduce PixelDance a novel approach based on diffusion models that incorporates image instructions for both the first and last frames in conjunction with text instructions for video generation. Comprehensive experimental results demonstrate that PixelDance trained with public data exhibits significantly better proficiency in synthesizing videos with complex scenes and intricate motions setting a new standard for video generation. + + + + A&B BNN: Add&Bit-Operation-Only Hardware-Friendly Binary Neural Network + http://openaccess.thecvf.com//content/CVPR2024/papers/Ma_AB_BNN_AddBit-Operation-Only_Hardware-Friendly_Binary_Neural_Network_CVPR_2024_paper.pdf + Binary neural networks utilize 1-bit quantized weights and activations to reduce both the model's storage demands and computational burden. However advanced binary architectures still incorporate millions of inefficient and nonhardware-friendly full-precision multiplication operations. A&B BNN is proposed to directly remove part of the multiplication operations in a traditional BNN and replace the rest with an equal number of bit operations introducing the mask layer and the quantized RPReLU structure based on the normalizer-free network architecture. The mask layer can be removed during inference by leveraging the intrinsic characteristics of BNN with straightforward mathematical transformations to avoid the associated multiplication operations. The quantized RPReLU structure enables more efficient bit operations by constraining its slope to be integer powers of 2. Experimental results achieved 92.30% 69.35% and 66.89% on the CIFAR-10 CIFAR-100 and ImageNet datasets respectively which are competitive with the state-of-the-art. Ablation studies have verified the efficacy of the quantized RPReLU structure leading to a 1.14% enhancement on the ImageNet compared to using a fixed slope RLeakyReLU. The proposed add&bit-operation-only BNN offers an innovative approach for hardware-friendly network architecture. + + + + Task-aligned Part-aware Panoptic Segmentation through Joint Object-Part Representations + http://openaccess.thecvf.com//content/CVPR2024/papers/de_Geus_Task-aligned_Part-aware_Panoptic_Segmentation_through_Joint_Object-Part_Representations_CVPR_2024_paper.pdf + Part-aware panoptic segmentation (PPS) requires (a) that each foreground object and background region in an image is segmented and classified and (b) that all parts within foreground objects are segmented classified and linked to their parent object. Existing methods approach PPS by separately conducting object-level and part-level segmentation. However their part-level predictions are not linked to individual parent objects. Therefore their learning objective is not aligned with the PPS task objective which harms the PPS performance. To solve this and make more accurate PPS predictions we propose Task-Aligned Part-aware Panoptic Segmentation (TAPPS). This method uses a set of shared queries to jointly predict (a) object-level segments and (b) the part-level segments within those same objects. As a result TAPPS learns to predict part-level segments that are linked to individual parent objects aligning the learning objective with the task objective and allowing TAPPS to leverage joint object-part representations. With experiments we show that TAPPS considerably outperforms methods that predict objects and parts separately and achieves new state-of-the-art PPS results. + + + + From Activation to Initialization: Scaling Insights for Optimizing Neural Fields + http://openaccess.thecvf.com//content/CVPR2024/papers/Saratchandran_From_Activation_to_Initialization_Scaling_Insights_for_Optimizing_Neural_Fields_CVPR_2024_paper.pdf + In the realm of computer vision Neural Fields have gained prominence as a contemporary tool harnessing neural networks for signal representation. Despite the remarkable progress in adapting these networks to solve a variety of problems the field still lacks a comprehensive theoretical framework. This article aims to address this gap by delving into the intricate interplay between initialization and activation providing a foundational basis for the robust optimization of Neural Fields. Our theoretical insights reveal a deep-seated connection among network initialization architectural choices and the optimization process emphasizing the need for a holistic approach when designing cutting-edge Neural Fields. + + + + DiffAvatar: Simulation-Ready Garment Optimization with Differentiable Simulation + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_DiffAvatar_Simulation-Ready_Garment_Optimization_with_Differentiable_Simulation_CVPR_2024_paper.pdf + The realism of digital avatars is crucial in enabling telepresence applications with self-expression and customization. While physical simulations can produce realistic motions for clothed humans they require high-quality garment assets with associated physical parameters for cloth simulations. However manually creating these assets and calibrating their parameters is labor-intensive and requires specialized expertise. Current methods focus on reconstructing geometry but don't generate complete assets for physics-based applications. To address this gap we propose DiffAvatar a novel approach that performs body and garment co-optimization using differentiable simulation. By integrating physical simulation into the optimization loop and accounting for the complex nonlinear behavior of cloth and its intricate interaction with the body our framework recovers body and garment geometry and extracts important material parameters in a physically plausible way. Our experiments demonstrate that our approach generates realistic clothing and body shape suitable for downstream applications. We provide additional insights and results on our webpage: people.csail.mit.edu/liyifei/publication/diffavatar. + + + + AlignSAM: Aligning Segment Anything Model to Open Context via Reinforcement Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_AlignSAM_Aligning_Segment_Anything_Model_to_Open_Context_via_Reinforcement_CVPR_2024_paper.pdf + Powered by massive curated training data Segment Anything Model (SAM) has demonstrated its impressive generalization capabilities in open-world scenarios with the guidance of prompts. However the vanilla SAM is class-agnostic and heavily relies on user-provided prompts to segment objects of interest. Adapting this method to diverse tasks is crucial for accurate target identification and to avoid suboptimal segmentation results. In this paper we propose a novel framework termed AlignSAM designed for automatic prompting for aligning SAM to an open context through reinforcement learning. Anchored by an agent AlignSAM enables the generality of the SAM model across diverse downstream tasks while keeping its parameters frozen. Specifically AlignSAM initiates a prompting agent to iteratively refine segmentation predictions by interacting with the foundational model. It integrates a reinforcement learning policy network to provide informative prompts to the foundational models. Additionally a semantic recalibration module is introduced to provide fine-grained labels of prompts enhancing the model's proficiency in handling tasks encompassing explicit and implicit semantics. Experiments conducted on various challenging segmentation tasks among existing foundation models demonstrate the superiority of the proposed AlignSAM over state-of-the-art approaches. Project page: https://github.com/Duojun-Huang/AlignSAM-CVPR2024. + + + + Learning Spatial Adaptation and Temporal Coherence in Diffusion Models for Video Super-Resolution + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Learning_Spatial_Adaptation_and_Temporal_Coherence_in_Diffusion_Models_for_CVPR_2024_paper.pdf + Diffusion models are just at a tipping point for image super-resolution task. Nevertheless it is not trivial to capitalize on diffusion models for video super-resolution which necessitates not only the preservation of visual appearance from low-resolution to high-resolution videos but also the temporal consistency across video frames. In this paper we propose a novel approach pursuing Spatial Adaptation and Temporal Coherence (SATeCo) for video super-resolution. SATeCo pivots on learning spatial-temporal guidance from low-resolution videos to calibrate both latent-space high-resolution video denoising and pixel-space video reconstruction. Technically SATeCo freezes all the parameters of the pre-trained UNet and VAE and only optimizes two deliberately-designed spatial feature adaptation (SFA) and temporal feature alignment (TFA) modules in the decoder of UNet and VAE. SFA modulates frame features via adaptively estimating affine parameters for each pixel guaranteeing pixel-wise guidance for high-resolution frame synthesis. TFA delves into feature interaction within a 3D local window (tubelet) through self-attention and executes cross-attention between tubelet and its low-resolution counterpart to guide temporal feature alignment. Extensive experiments conducted on the REDS4 and Vid4 datasets demonstrate the effectiveness of our approach. + + + + Denoising Point Clouds in Latent Space via Graph Convolution and Invertible Neural Network + http://openaccess.thecvf.com//content/CVPR2024/papers/Mao_Denoising_Point_Clouds_in_Latent_Space_via_Graph_Convolution_and_CVPR_2024_paper.pdf + Point clouds frequently contain noise and outliers presenting obstacles for downstream applications. In this work we introduce a novel denoising method for point clouds. By leveraging the latent space we explicitly uncover noise components allowing for the extraction of a clean latent code. This in turn facilitates the restoration of clean points via inverse transformation. A key component in our network is a new multi-level graph convolution network for capturing rich geometric structural features at various scales from local to global. These features are then integrated into the invertible neural network which bijectively maps the latent space to guide the noise disentanglement process. Additionally we employ an invertible monotone operator to model the transformation process effectively enhancing the representation of integrated geometric features. This enhancement allows our network to precisely differentiate between noise factors and the intrinsic clean points in the latent code by projecting them onto separate channels. Both qualitative and quantitative evaluations demonstrate that our method outperforms state-of-the-art methods at various noise levels. The source code is available at https://github.com/yanbiao1/PD-LTS. + + + + HIR-Diff: Unsupervised Hyperspectral Image Restoration Via Improved Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Pang_HIR-Diff_Unsupervised_Hyperspectral_Image_Restoration_Via_Improved_Diffusion_Models_CVPR_2024_paper.pdf + Hyperspectral image (HSI) restoration aims at recovering clean images from degraded observations and plays a vital role in downstream tasks. Existing model-based methods have limitations in accurately modeling the complex image characteristics with handcraft priors and deep learning-based methods suffer from poor generalization ability. To alleviate these issues this paper proposes an unsupervised HSI restoration framework with pre-trained diffusion model (HIR-Diff) which restores the clean HSIs from the product of two low-rank components i.e. the reduced image and the coefficient matrix. Specifically the reduced image which has a low spectral dimension lies in the image field and can be inferred from our improved diffusion model where a new guidance function with total variation (TV) prior is designed to ensure that the reduced image can be well sampled. The coefficient matrix can be effectively pre-estimated based on singular value decomposition (SVD) and rank-revealing QR (RRQR) factorization. Furthermore a novel exponential noise schedule is proposed to accelerate the restoration process (about 5xacceleration for denoising) with little performance decrease. Extensive experimental results validate the superiority of our method in both performance and speed on a variety of HSI restoration tasks including HSI denoising noisy HSI super-resolution and noisy HSI inpainting. The code is available at https://github.com/LiPang/HIRDiff. + + + + FreeDrag: Feature Dragging for Reliable Point-based Image Editing + http://openaccess.thecvf.com//content/CVPR2024/papers/Ling_FreeDrag_Feature_Dragging_for_Reliable_Point-based_Image_Editing_CVPR_2024_paper.pdf + To serve the intricate and varied demands of image editing precise and flexible manipulation in image content is indispensable. Recently Drag-based editing methods have gained impressive performance. However these methods predominantly center on point dragging resulting in two noteworthy drawbacks namely "miss tracking" where difficulties arise in accurately tracking the predetermined handle points and "ambiguous tracking" where tracked points are potentially positioned in wrong regions that closely resemble the handle points. To address the above issues we propose FreeDrag a feature dragging methodology designed to free the burden on point tracking. The FreeDrag incorporates two key designs i.e. template feature via adaptive updating and line search with backtracking the former improves the stability against drastic content change by elaborately controlling the feature updating scale after each dragging while the latter alleviates the misguidance from similar points by actively restricting the search area in a line. These two technologies together contribute to a more stable semantic dragging with higher efficiency. Comprehensive experimental results substantiate that our approach significantly outperforms pre-existing methodologies offering reliable point-based editing even in various complex scenarios. + + + + Confronting Ambiguity in 6D Object Pose Estimation via Score-Based Diffusion on SE(3) + http://openaccess.thecvf.com//content/CVPR2024/papers/Hsiao_Confronting_Ambiguity_in_6D_Object_Pose_Estimation_via_Score-Based_Diffusion_CVPR_2024_paper.pdf + Addressing pose ambiguity in 6D object pose estimation from single RGB images presents a significant challenge particularly due to object symmetries or occlusions. In response we introduce a novel score-based diffusion method applied to the SE(3) group marking the first application of diffusion models to SE(3) within the image domain specifically tailored for pose estimation tasks. Extensive evaluations demonstrate the method's efficacy in handling pose ambiguity mitigating perspective-induced ambiguity and showcasing the robustness of our surrogate Stein score formulation on SE(3). This formulation not only improves the convergence of denoising process but also enhances computational efficiency. Thus we pioneer a promising strategy for 6D object pose estimation. + + + + DiffInDScene: Diffusion-based High-Quality 3D Indoor Scene Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Ju_DiffInDScene_Diffusion-based_High-Quality_3D_Indoor_Scene_Generation_CVPR_2024_paper.pdf + We present DiffInDScene a novel framework for tackling the problem of high-quality 3D indoor scene generation which is challenging due to the complexity and diversity of the indoor scene geometry. Although diffusion-based generative models have previously demonstrated impressive performance in image generation and object-level 3D generation they have not yet been applied to room-level 3D generation due to their computationally intensive costs. In DiffInDScene we propose a cascaded 3D diffusion pipeline that is efficient and possesses strong generative performance for Truncated Signed Distance Function (TSDF). The whole pipeline is designed to run on a sparse occupancy space in a coarse-to-fine fashion. Inspired by KinectFusion's incremental alignment and fusion of local TSDF volumes we propose a diffusion-based SDF fusion approach that iteratively diffuses and fuses local TSDF volumes facilitating the generation of an entire room environment. The generated results demonstrate that our work is capable to achieve high-quality room generation directly in three-dimensional space starting from scratch. In addition to the scene generation the final part of DiffInDScene can be used as a post-processing module to refine the 3D reconstruction results from multi-view stereo. According to the user study the mesh quality generated by our DiffInDScene can even outperform the ground truth mesh provided by ScanNet. + + + + MAPSeg: Unified Unsupervised Domain Adaptation for Heterogeneous Medical Image Segmentation Based on 3D Masked Autoencoding and Pseudo-Labeling + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_MAPSeg_Unified_Unsupervised_Domain_Adaptation_for_Heterogeneous_Medical_Image_Segmentation_CVPR_2024_paper.pdf + Robust segmentation is critical for deriving quantitative measures from large-scale multi-center and longitudinal medical scans. Manually annotating medical scans however is expensive and labor-intensive and may not always be available in every domain. Unsupervised domain adaptation (UDA) is a well-studied technique that alleviates this label-scarcity problem by leveraging available labels from another domain. In this study we introduce Masked Autoencoding and Pseudo-Labeling Segmentation (MAPSeg) a unified UDA framework with great versatility and superior performance for heterogeneous and volumetric medical image segmentation. To the best of our knowledge this is the first study that systematically reviews and develops a framework to tackle four different domain shifts in medical image segmentation. More importantly MAPSeg is the first framework that can be applied to centralized federated and test-time UDA while maintaining comparable performance. We compare MAPSeg with previous state-of-the-art methods on a private infant brain MRI dataset and a public cardiac CT-MRI dataset and MAPSeg outperforms others by a large margin (10.5 Dice improvement on the private MRI dataset and 5.7 on the public CT-MRI dataset). MAPSeg poses great practical value and can be applied to real-world problems. GitHub: https://github.com/XuzheZ/MAPSeg/. + + + + DaReNeRF: Direction-aware Representation for Dynamic Scenes + http://openaccess.thecvf.com//content/CVPR2024/papers/Lou_DaReNeRF_Direction-aware_Representation_for_Dynamic_Scenes_CVPR_2024_paper.pdf + Addressing the intricate challenge of modeling and re-rendering dynamic scenes most recent approaches have sought to simplify these complexities using plane-based explicit representations overcoming the slow training time issues associated with methods like Neural Radiance Fields (NeRF) and implicit representations. However the straightforward decomposition of 4D dynamic scenes into multiple 2D plane-based representations proves insufficient for re-rendering high-fidelity scenes with complex motions. In response we present a novel direction-aware representation (DaRe) approach that captures scene dynamics from six different directions. This learned representation undergoes an inverse dual-tree complex wavelet transformation (DTCWT) to recover plane-based information. DaReNeRF computes features for each space-time point by fusing vectors from these recovered planes. Combining DaReNeRF with a tiny MLP for color regression and leveraging volume rendering in training yield state-of-the-art performance in novel view synthesis for complex dynamic scenes. Notably to address redundancy introduced by the six real and six imaginary direction-aware wavelet coefficients we introduce a trainable masking approach mitigating storage issues without significant performance decline. Moreover DaReNeRF maintains a 2x reduction in training time compared to prior art while delivering superior performance. + + + + SfmCAD: Unsupervised CAD Reconstruction by Learning Sketch-based Feature Modeling Operations + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_SfmCAD_Unsupervised_CAD_Reconstruction_by_Learning_Sketch-based_Feature_Modeling_Operations_CVPR_2024_paper.pdf + This paper introduces SfmCAD a novel unsupervised network that reconstructs 3D shapes by learning the Sketch-based Feature Modeling operations commonly used in modern CAD workflows. Given a 3D shape represented as voxels SfmCAD learns a neural-typed sketch+path parameterized representation including 2D sketches of feature primitives and their 3D sweeping paths without supervision for inferring feature-based CAD programs. SfmCAD employs 2D sketches for local detail representation and 3D paths to capture the overall structure achieving a clear separation between shape details and structure. This conversion into parametric forms enables users to seamlessly adjust the shape's geometric and structural features thus enhancing interpretability and user control. We demonstrate the effectiveness of our method by applying SfmCAD to many different types of objects such as CAD parts ShapeNet objects and tree shapes. Extensive comparisons show that SfmCAD produces compact and faithful 3D reconstructions with superior quality compared to alternatives. The code is released at https://github.com/BunnySoCrazy/SfmCAD. + + + + Learning Degradation-unaware Representation with Prior-based Latent Transformations for Blind Face Restoration + http://openaccess.thecvf.com//content/CVPR2024/papers/Xie_Learning_Degradation-unaware_Representation_with_Prior-based_Latent_Transformations_for_Blind_Face_CVPR_2024_paper.pdf + Blind face restoration focuses on restoring high-fidelity details from images subjected to complex and unknown degradations while preserving identity information. In this paper we present a Prior-based Latent Transformation approach (PLTrans) which is specifically designed to learn a degradation-unaware representation thereby allowing the restoration network to effectively generalize to real-world degradation. Toward this end PLTrans learns a degradation-unaware query via a latent diffusion-based regularization module. Furthermore conditioned on the features of a degraded face image a latent dictionary that captures the priors of HQ face images is leveraged to refine the features by mapping the top-d nearest elements. The refined version will be used to build key and value for the cross-attention computation which is tailored to each degraded image and exhibits reduced sensitivity to different degradation factors. Conditioned on the resulting representation we train a decoding network that synthesizes face images with authentic details and identity preservation. Through extensive experiments we verify the effectiveness of the design elements and demonstrate the generalization ability of our proposed approach for both synthetic and unknown degradations. We finally demonstrate the applicability of PLTrans in other vision tasks. + + + + Faces that Speak: Jointly Synthesising Talking Face and Speech from Text + http://openaccess.thecvf.com//content/CVPR2024/papers/Jang_Faces_that_Speak_Jointly_Synthesising_Talking_Face_and_Speech_from_CVPR_2024_paper.pdf + The goal of this work is to simultaneously generate natural talking faces and speech outputs from text. We achieve this by integrating Talking Face Generation (TFG) and Text-to-Speech (TTS) systems into a unified framework. We address the main challenges of each task: (1) generating a range of head poses representative of real-world scenarios and (2) ensuring voice consistency despite variations in facial motion for the same identity. To tackle these issues we introduce a motion sampler based on conditional flow matching which is capable of high-quality motion code generation in an efficient way. Moreover we introduce a novel conditioning method for the TTS system which utilises motion-removed features from the TFG model to yield uniform speech outputs. Our extensive experiments demonstrate that our method effectively creates natural-looking talking faces and speech that accurately match the input text. To our knowledge this is the first effort to build a multimodal synthesis system that can generalise to unseen identities. + + + + DiffusionRegPose: Enhancing Multi-Person Pose Estimation using a Diffusion-Based End-to-End Regression Approach + http://openaccess.thecvf.com//content/CVPR2024/papers/Tan_DiffusionRegPose_Enhancing_Multi-Person_Pose_Estimation_using_a_Diffusion-Based_End-to-End_Regression_CVPR_2024_paper.pdf + This paper presents the DiffusionRegPose a novel approach to multi-person pose estimation that converts a one-stage end-to-end keypoint regression model into a diffusion-based sampling process. Existing one-stage deterministic regression methods though efficient are often prone to missed or false detections in crowded or occluded scenes due to their inability to reason pose ambiguity. To address these challenges we handle ambiguous poses in a generative fashion i.e. sampling from the image-conditioned pose distributions characterized by a diffusion probabilistic model. Specifically with initial pose tokens extracted from the image noisy pose candidates are progressively refined by interacting with the initial tokens via attention layers. Extensive evaluations on the COCO and CrowdPose datasets show that DiffusionRegPose clearly improves the pose accuracy in crowded scenarios as evidenced by a notable 4.0 AP increase in the AP_H metric on the CrowdPose dataset. This demonstrates the model's potential for robust and precise human pose estimation in real-world applications. + + + + Memory-Scalable and Simplified Functional Map Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Magnet_Memory-Scalable_and_Simplified_Functional_Map_Learning_CVPR_2024_paper.pdf + Deep functional maps have emerged in recent years as a prominent learning-based framework for non-rigid shape matching problems. While early methods in this domain only focused on learning in the functional domain the latest techniques have demonstrated that by promoting consistency between functional and pointwise maps leads to significant improvements in accuracy. Unfortunately existing approaches rely heavily on the computation of large dense matrices arising from soft pointwise maps which compromises their efficiency and scalability. To address this limitation we introduce a novel memory-scalable and efficient functional map learning pipeline. By leveraging the specific structure of functional maps we offer the possibility to achieve identical results without ever storing the pointwise map in memory. Furthermore based on the same approach we present a differentiable map refinement layer adapted from an existing axiomatic refinement algorithm. Unlike many functional map learning methods which use this algorithm at a post-processing step ours can be easily used at train time enabling to enforce consistency between the refined and initial versions of the map. Our resulting approach is both simpler more efficient and more numerically stable by avoiding differentiation through a linear system while achieving close to state-of-the-art results in challenging scenarios. + + + + Gaussian Head Avatar: Ultra High-fidelity Head Avatar via Dynamic Gaussians + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_Gaussian_Head_Avatar_Ultra_High-fidelity_Head_Avatar_via_Dynamic_Gaussians_CVPR_2024_paper.pdf + Creating high-fidelity 3D head avatars has always been a research hotspot but there remains a great challenge under lightweight sparse view setups. In this paper we propose Gaussian Head Avatar represented by controllable 3D Gaussians for high-fidelity head avatar modeling. We optimize the neutral 3D Gaussians and a fully learned MLP-based deformation field to capture complex expressions. The two parts benefit each other thereby our method can model fine-grained dynamic details while ensuring expression accuracy. Furthermore we devise a well-designed geometry-guided initialization strategy based on implicit SDF and Deep Marching Tetrahedra for the stability and convergence of the training procedure. Experiments show our approach outperforms other state-of-the-art sparse-view methods achieving ultra high-fidelity rendering quality at 2K resolution even under exaggerated expressions. Project page: https://yuelangx.github.io/gaussianheadavatar. + + + + Stratified Avatar Generation from Sparse Observations + http://openaccess.thecvf.com//content/CVPR2024/papers/Feng_Stratified_Avatar_Generation_from_Sparse_Observations_CVPR_2024_paper.pdf + Estimating 3D full-body avatars from AR/VR devices is essential for creating immersive experiences in AR/VR applications. This task is challenging due to the limited input from Head Mounted Devices which capture only sparse observations from the head and hands. Predicting the full-body avatars particularly the lower body from these sparse observations presents significant difficulties. In this paper we are inspired by the inherent property of the kinematic tree defined in the Skinned Multi-Person Linear (SMPL) model where the upper body and lower body share only one common ancestor node bringing the potential of decoupled reconstruction. We propose a stratified approach to decouple the conventional full-body avatar reconstruction pipeline into two stages with the reconstruction of the upper body first and a subsequent reconstruction of the lower body conditioned on the previous stage. To implement this straightforward idea we leverage the latent diffusion model as a powerful probabilistic generator and train it to follow the latent distribution of decoupled motions explored by a VQ-VAE encoder-decoder model. Extensive experiments on AMASS mocap dataset demonstrate our state-of-the-art performance in the reconstruction of full-body motions. + + + + Rewrite the Stars + http://openaccess.thecvf.com//content/CVPR2024/papers/Ma_Rewrite_the_Stars_CVPR_2024_paper.pdf + Recent studies have drawn attention to the untapped potential of the "star operation" (element-wise multiplication) in network design. While intuitive explanations abound the foundational rationale behind its application remains largely unexplored. Our study attempts to reveal the star operation's ability of mapping inputs into high-dimensional non-linear feature spaces--akin to kernel tricks--without widening the network. We further introduce StarNet a simple yet powerful prototype demonstrating impressive performance and low latency under compact network structure and efficient budget. Like stars in the sky the star operation appears unremarkable but holds a vast universe of potential. Our work encourages further exploration across tasks with codes available at https://github.com/ma-xu/Rewrite-the-Stars. + + + + PairDETR : Joint Detection and Association of Human Bodies and Faces + http://openaccess.thecvf.com//content/CVPR2024/papers/Ali_PairDETR__Joint_Detection_and_Association_of_Human_Bodies_and_CVPR_2024_paper.pdf + Image and video analysis requires not only accurate object but also the understanding of relationships among detected objects. Common solutions to relation modeling typically resort to stand-alone object detectors followed by non-differentiable post-processing techniques. Recently introduced detection transformers (DETR) perform end-to-end object detection based on a bipartite matching loss. Such methods however lack the ability to jointly detect objects and resolve object associations. In this paper we build on the DETR approach and extend it to the joint detection of objects and their relationships by introducing an approximated bipartite matching. While our method can generalize to an arbitrary number of objects we here focus on the modeling of object pairs and their relations. In particular we apply our method PairDETR to the problem of detecting human bodies and faces and associating them for the same person. Our approach not only eliminates the need for hand-designed post-processing but also achieves excellent results for body-face associations. We evaluate PairDETR on the challenging CrowdHuman and CityPersons datasets and demonstrate a large improvement over the state of the art. Our training code and pre-trained models are available online. + + + + Text2HOI: Text-guided 3D Motion Generation for Hand-Object Interaction + http://openaccess.thecvf.com//content/CVPR2024/papers/Cha_Text2HOI_Text-guided_3D_Motion_Generation_for_Hand-Object_Interaction_CVPR_2024_paper.pdf + This paper introduces the first text-guided work for generating the sequence of hand-object interaction in 3D. The main challenge arises from the lack of labeled data where existing ground-truth datasets are nowhere near generalizable in interaction type and object category which inhibits the modeling of diverse 3D hand-object interaction with the correct physical implication (e.g. contacts and semantics) from text prompts. To address this challenge we propose to decompose the interaction generation task into two subtasks: hand-object contact generation; and hand-object motion generation. For contact generation a VAE-based network takes as input a text and an object mesh and generates the probability of contacts between the surfaces of hands and the object during the interaction. The network learns a variety of local geometry structure of diverse objects that is independent of the objects' category and thus it is applicable to general objects. For motion generation a Transformer-based diffusion model utilizes this 3D contact map as a strong prior for generating physically plausible hand-object motion as a function of text prompts by learning from the augmented labeled dataset; where we annotate text labels from many existing 3D hand and object motion data. Finally we further introduce a hand refiner module that minimizes the distance between the object surface and hand joints to improve the temporal stability of the object-hand contacts and to suppress the penetration artifacts. In the experiments we demonstrate that our method can generate more realistic and diverse interactions compared to other baseline methods. We also show that our method is applicable to unseen objects. We will release our model and newly labeled data as a strong foundation for future research. Codes and data are available in: https://github.com/JunukCha/Text2HOI. + + + + MACE: Mass Concept Erasure in Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Lu_MACE_Mass_Concept_Erasure_in_Diffusion_Models_CVPR_2024_paper.pdf + The rapid expansion of large-scale text-to-image diffusion models has raised growing concerns regarding their potential misuse in creating harmful or misleading content. In this paper we introduce MACE a finetuning framework for the task of mass concept erasure. This task aims to prevent models from generating images that embody unwanted concepts when prompted. Existing concept erasure methods are typically restricted to handling fewer than five concepts simultaneously and struggle to find a balance between erasing concept synonyms (generality) and maintaining unrelated concepts (specificity). In contrast MACE differs by successfully scaling the erasure scope up to 100 concepts and by achieving an effective balance between generality and specificity. This is achieved by leveraging closed-form cross-attention refinement along with LoRA finetuning collectively eliminating the information of undesirable concepts. Furthermore MACE integrates multiple LoRAs without mutual interference. We conduct extensive evaluations of MACE against prior methods across four different tasks: object erasure celebrity erasure explicit content erasure and artistic style erasure. Our results reveal that MACE surpasses prior methods in all evaluated tasks. Code is available at https://github.com/Shilin-LU/MACE. + + + + PeLK: Parameter-efficient Large Kernel ConvNets with Peripheral Convolution + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_PeLK_Parameter-efficient_Large_Kernel_ConvNets_with_Peripheral_Convolution_CVPR_2024_paper.pdf + Recently some large kernel convnets strike back with appealing performance and efficiency. However given the square complexity of convolution scaling up kernels can bring about an enormous amount of parameters and the proliferated parameters can induce severe optimization problem. Due to these issues current CNNs compromise to scale up to 51x51 in the form of stripe convolution (i.e. 51x5+5x51) and start to saturate as the kernel size continues growing. In this paper we delve into addressing these vital issues and explore whether we can continue scaling up kernels for more performance gains. Inspired by human vision we propose a human-like peripheral convolution that efficiently reduces over 90% parameter count of dense grid convolution through parameter sharing and manage to scale up kernel size to extremely large. Our peripheral convolution behaves highly similar to human reducing the complexity of convolution from O(K^2) to O(logK) without backfiring performance. Built on this we propose Parameter-efficient Large Kernel Network (PeLK). Our PeLK outperforms modern vision Transformers and ConvNet architectures like Swin ConvNeXt RepLKNet and SLaK on various vision tasks including ImageNet classification semantic segmentation on ADE20K and object detection on MS COCO. For the first time we successfully scale up the kernel size of CNNs to an unprecedented 101x101 and demonstrate consistent improvements. + + + + AiOS: All-in-One-Stage Expressive Human Pose and Shape Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Sun_AiOS_All-in-One-Stage_Expressive_Human_Pose_and_Shape_Estimation_CVPR_2024_paper.pdf + Expressive human pose and shape estimation (a.k.a. 3D whole-body mesh recovery) involves the human body hand and expression estimation. Most existing methods have tackled this task in a two-stage manner first detecting the human body part with an off-the-shelf detection model and then inferring the different human body parts individually. Despite the impressive results achieved these methods suffer from 1) loss of valuable contextual information via cropping 2) introducing distractions and 3) lacking inter-association among different persons and body parts inevitably causing performance degradation especially for crowded scenes. To address these issues we introduce a novel all-in-one-stage framework AiOS for multiple expressive human pose and shape recovery without an additional human detection step. Specifically our method is built upon DETR which treats multi-person whole-body mesh recovery task as a progressive set prediction problem with various sequential detection. We devise the decoder tokens and extend them to our task. Specifically we first employ a human token to probe a human location in the image and encode global features for each instance which provides a coarse location for the later transformer block. Then we introduce a joint-related token to probe the human joint in the image and encoder a fine-grained local feature which collaborates with the global feature to regress the whole-body mesh. This straightforward but effective model outperforms previous state-of-the-art methods by a 9 reduction in NMVE on AGORA a 30 reduction in PVE on EHF a 10 reduction in PVE on ARCTIC and a 3 reduction in PVE on EgoBody. + + + + Design2Cloth: 3D Cloth Generation from 2D Masks + http://openaccess.thecvf.com//content/CVPR2024/papers/Zheng_Design2Cloth_3D_Cloth_Generation_from_2D_Masks_CVPR_2024_paper.pdf + In recent years there has been a significant shift in the field of digital avatar research towards modeling animating and reconstructing clothed human representations as a key step towards creating realistic avatars. However current 3D cloth generation methods are garment specific or trained completely on synthetic data hence lacking fine details and realism. In this work we make a step towards automatic realistic garment design and propose Design2Cloth a high fidelity 3D generative model trained on a real world dataset from more than 2000 subject scans. To provide vital contribution to the fashion industry we developed a user-friendly adversarial model capable of generating diverse and detailed clothes simply by drawing a 2D cloth mask. Under a series of both qualitative and quantitative experiments we showcase that Design2Cloth outperforms current state-of-the-art cloth generative models by a large margin. In addition to the generative properties of our network we showcase that the proposed method can be used to achieve high quality reconstructions from single in-the-wild images and 3D scans. Dataset code and pre-trained model will become publicly available. + + + + Amodal Completion via Progressive Mixed Context Diffusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_Amodal_Completion_via_Progressive_Mixed_Context_Diffusion_CVPR_2024_paper.pdf + Our brain can effortlessly recognize objects even when partially hidden from view. Seeing the visible of the hidden is called amodal completion; however this task remains a challenge for generative AI despite rapid progress. We propose to sidestep many of the difficulties of existing approaches which typically involve a two-step process of predicting amodal masks and then generating pixels. Our method involves thinking outside the box literally! We go outside the object bounding box to use its context to guide a pre-trained diffusion inpainting model and then progressively grow the occluded object and trim the extra background. We overcome two technical challenges: 1) how to be free of unwanted co-occurrence bias which tends to regenerate similar occluders and 2) how to judge if an amodal completion has succeeded. Our amodal completion method exhibits improved photorealistic completion results compared to existing approaches in numerous successful completion cases. And the best part? It doesn't require any special training or fine-tuning of models. Project page and code: https://k8xu.github.io/amodal/ + + + + Diffusion 3D Features (Diff3F): Decorating Untextured Shapes with Distilled Semantic Features + http://openaccess.thecvf.com//content/CVPR2024/papers/Dutt_Diffusion_3D_Features_Diff3F_Decorating_Untextured_Shapes_with_Distilled_Semantic_CVPR_2024_paper.pdf + We present Diff3F as a simple robust and class-agnostic feature descriptor that can be computed for untextured input shapes (meshes or point clouds). Our method distills diffusion features from image foundational models onto input shapes. Specifically we use the input shapes to produce depth and normal maps as guidance for conditional image synthesis. In the process we produce (diffusion) features in 2D that we subsequently lift and aggregate on the original surface. Our key observation is that even if the conditional image generations obtained from multi-view rendering of the input shapes are inconsistent the associated image features are robust and hence can be directly aggregated across views. This produces semantic features on the input shapes without requiring additional data or training. We perform extensive experiments on multiple benchmarks (SHREC'19 SHREC'20 FAUST and TOSCA) and demonstrate that our features being semantic instead of geometric produce reliable correspondence across both isometric and non-isometrically related shape families. Code is available via the project webpage at https://diff3f.github.io/ + + + + Cinematic Behavior Transfer via NeRF-based Differentiable Filming + http://openaccess.thecvf.com//content/CVPR2024/papers/Jiang_Cinematic_Behavior_Transfer_via_NeRF-based_Differentiable_Filming_CVPR_2024_paper.pdf + In the evolving landscape of digital media and video production the precise manipulation and reproduction of visual elements like camera movements and character actions are highly desired. Existing SLAM methods face limitations in dynamic scenes and human pose estimation often focuses on 2D projections neglecting 3D statuses. To address these issues we first introduce a reverse filming behavior estimation technique. It optimizes camera trajectories by leveraging NeRF as a differentiable renderer and refining SMPL tracks. We then introduce a cinematic transfer pipeline that is able to transfer various shot types to a new 2D video or a 3D virtual environment. The incorporation of 3D engine workflow enables superior rendering and control abilities which also achieves a higher rating in the user study. + + + + Text-Driven Image Editing via Learnable Regions + http://openaccess.thecvf.com//content/CVPR2024/papers/Lin_Text-Driven_Image_Editing_via_Learnable_Regions_CVPR_2024_paper.pdf + Language has emerged as a natural interface for image editing. In this paper we introduce a method for region-based image editing driven by textual prompts without the need for user-provided masks or sketches. Specifically our approach leverages an existing pre-trained text-to-image model and introduces a bounding box generator to identify the editing regions that are aligned with the textual prompts. We show that this simple approach enables flexible editing that is compatible with current image generation models and is able to handle complex prompts featuring multiple objects complex sentences or lengthy paragraphs. We conduct an extensive user study to compare our method against state-of-the-art methods. The experiments demonstrate the competitive performance of our method in manipulating images with high fidelity and realism that correspond to the provided language descriptions. Our project webpage can be found at: https://yuanzelin.me/LearnableRegions_page. + + + + Relation Rectification in Diffusion Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_Relation_Rectification_in_Diffusion_Model_CVPR_2024_paper.pdf + Despite their exceptional generative abilities large T2I diffusion models much like skilled but careless artists often struggle with accurately depicting visual relationships between objects. This issue as we uncover through careful analysis arises from a misaligned text encoder that struggles to interpret specific relationships and differentiate the logical order of associated objects. To resolve this we introduce a novel task termed Relation Rectification aiming to refine the model to accurately represent a given relationship it initially fails to generate. To address this we propose an innovative solution utilizing a Heterogeneous Graph Convolutional Network (HGCN). It models the directional relationships between relation terms and corresponding objects within the input prompts. Specifically we optimize the HGCN on a pair of prompts with identical relational words but reversed object orders supplemented by a few reference images. The lightweight HGCN adjusts the text embeddings generated by the text encoder ensuring accurate reflection of the textual relation in the embedding space. Crucially our method retains the parameters of the text encoder and diffusion model preserving the model's robust performance on unrelated descriptions. We validated our approach on a newly curated dataset of diverse relational data demonstrating both quantitative and qualitative enhancements in generating images with precise visual relations. Project page: https://wuyinwei-hah.github.io/rrnet.github.io/ . + + + + Mocap Everyone Everywhere: Lightweight Motion Capture With Smartwatches and a Head-Mounted Camera + http://openaccess.thecvf.com//content/CVPR2024/papers/Lee_Mocap_Everyone_Everywhere_Lightweight_Motion_Capture_With_Smartwatches_and_a_CVPR_2024_paper.pdf + We present a lightweight and affordable motion capture method based on two smartwatches and a head-mounted camera. In contrast to the existing approaches that use six or more expert-level IMU devices our approach is much more cost-effective and convenient. Our method can make wearable motion capture accessible to everyone everywhere enabling 3D full-body motion capture in diverse environments. As a key idea to overcome the extreme sparsity and ambiguities of sensor inputs with different modalities we integrate 6D head poses obtained from the head-mounted cameras for motion estimation. To enable capture in expansive indoor and outdoor scenes we propose an algorithm to track and update floor level changes to define head poses coupled with a multi-stage Transformer-based regression module. We also introduce novel strategies leveraging visual cues of egocentric images to further enhance the motion capture quality while reducing ambiguities. We demonstrate the performance of our method on various challenging scenarios including complex outdoor environments and everyday motions including object interactions and social interactions among multiple individuals. + + + + Fast ODE-based Sampling for Diffusion Models in Around 5 Steps + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_Fast_ODE-based_Sampling_for_Diffusion_Models_in_Around_5_Steps_CVPR_2024_paper.pdf + Sampling from diffusion models can be treated as solving the corresponding ordinary differential equations (ODEs) with the aim of obtaining an accurate solution with as few number of function evaluations (NFE) as possible. Recently various fast samplers utilizing higher-order ODE solvers have emerged and achieved better performance than the initial first-order one. However these numerical methods inherently result in certain approximation errors which significantly degrades sample quality with extremely small NFE (e.g. around 5). In contrast based on the geometric observation that each sampling trajectory almost lies in a two-dimensional subspace embedded in the ambient space we propose Approximate MEan-Direction Solver (AMED-Solver) that eliminates truncation errors by directly learning the mean direction for fast diffusion sampling. Besides our method can be easily used as a plugin to further improve existing ODE-based samplers. Extensive experiments on image synthesis with the resolution ranging from 32 to 512 demonstrate the effectiveness of our method. With only 5 NFE we achieve 6.61 FID on CIFAR-10 10.74 FID on ImageNet 64x64 and 13.20 FID on LSUN Bedroom. Our code is available at https://github.com/zju-pi/diff-sampler. + + + + CLiC: Concept Learning in Context + http://openaccess.thecvf.com//content/CVPR2024/papers/Safaee_CLiC_Concept_Learning_in_Context_CVPR_2024_paper.pdf + This paper addresses the challenge of learning a local visual pattern of an object from one image and generating images depicting objects with that pattern. Learning a localized concept and placing it on an object in a target image is a nontrivial task as the objects may have different orientations and shapes. Our approach builds upon recent advancements in visual concept learning. It involves acquiring a visual concept (e.g. an ornament) from a source image and subsequently applying it to an object (e.g. a chair) in a target image. Our key idea is to perform in-context concept learning acquiring the local visual concept within the broader context of the objects they belong to. To localize the concept learning we employ soft masks that contain both the concept within the mask and the surrounding image area. We demonstrate our approach through object generation within an image showcasing plausible embedding of in-context learned concepts. We also introduce methods for directing acquired concepts to specific locations within target images employing cross-attention mechanisms and establishing correspondences between source and target objects. The effectiveness of our method is demonstrated through quantitative and qualitative experiments along with comparisons against baseline techniques. + + + + CAD-SIGNet: CAD Language Inference from Point Clouds using Layer-wise Sketch Instance Guided Attention + http://openaccess.thecvf.com//content/CVPR2024/papers/Khan_CAD-SIGNet_CAD_Language_Inference_from_Point_Clouds_using_Layer-wise_Sketch_CVPR_2024_paper.pdf + Reverse engineering in the realm of Computer-Aided Design (CAD) has been a longstanding aspiration though not yet entirely realized. Its primary aim is to uncover the CAD process behind a physical object given its 3D scan. We propose CAD-SIGNet an end-to-end trainable and auto-regressive architecture to recover the design history of a CAD model represented as a sequence of sketch- and-extrusion from an input point cloud. Our model learns CAD visual-language representations by layer-wise cross-attention between point cloud and CAD language embedding. In particular a new Sketch instance Guided Attention (SGA) module is proposed in order to reconstruct the fine- grained details of the sketches. Thanks to its auto-regressive nature CAD-SIGNet not only reconstructs a unique full design history of the corresponding CAD model given an in- put point cloud but also provides multiple plausible design choices. This allows for an interactive reverse engineering scenario by providing designers with multiple next step choices along with the design process. Extensive experiments on publicly available CAD datasets showcase the effectiveness of our approach against existing baseline models in two settings namely full design history recovery and conditional auto-completion from point clouds. + + + + CLIB-FIQA: Face Image Quality Assessment with Confidence Calibration + http://openaccess.thecvf.com//content/CVPR2024/papers/Ou_CLIB-FIQA_Face_Image_Quality_Assessment_with_Confidence_Calibration_CVPR_2024_paper.pdf + Face Image Quality Assessment (FIQA) is pivotal for guaranteeing the accuracy of face recognition in unconstrained environments. Recent progress in deep quality-fitting-based methods that train models to align with quality anchors has shown promise in FIQA. However these methods heavily depend on a recognition model to yield quality anchors and indiscriminately treat the confidence of inaccurate anchors as equivalent to that of accurate ones during the FIQA model training leading to a fitting bottleneck issue. This paper seeks a solution by putting forward the Confidence-Calibrated Face Image Quality Assessment (CLIB-FIQA) approach underpinned by the synergistic interplay between the quality anchors and objective quality factors such as blur pose expression occlusion and illumination. Specifically we devise a joint learning framework built upon the vision-language alignment model which leverages the joint distribution with multiple quality factors to facilitate the quality fitting of the FIQA model. Furthermore to alleviate the issue of the model placing excessive trust in inaccurate quality anchors we propose a confidence calibration method to correct the quality distribution by exploiting to the fullest extent of these objective quality factors characterized as the merged-factor distribution during training. Experimental results on eight datasets reveal the superior performance of the proposed method. + + + + Predicated Diffusion: Predicate Logic-Based Attention Guidance for Text-to-Image Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Sueyoshi_Predicated_Diffusion_Predicate_Logic-Based_Attention_Guidance_for_Text-to-Image_Diffusion_Models_CVPR_2024_paper.pdf + Diffusion models have achieved remarkable success in generating high-quality diverse and creative images. However in text-based image generation they often struggle to accurately capture the intended meaning of the text. For instance a specified object might not be generated or an adjective might incorrectly alter unintended objects. Moreover we found that relationships indicating possession between objects are frequently overlooked. Despite the diversity of users' intentions in text existing methods often focus on only some aspects of these intentions. In this paper we propose Predicated Diffusion a unified framework designed to more effectively express users' intentions. It represents the intended meaning as propositions using predicate logic and treats the pixels in attention maps as fuzzy predicates. This approach provides a differentiable loss function that offers guidance for the image generation process to better fulfill the propositions. Comparative evaluations with existing methods demonstrated that Predicated Diffusion excels in generating images faithful to various text prompts while maintaining high image quality as validated by human evaluators and pretrained image-text models. + + + + MoML: Online Meta Adaptation for 3D Human Motion Prediction + http://openaccess.thecvf.com//content/CVPR2024/papers/Sun_MoML_Online_Meta_Adaptation_for_3D_Human_Motion_Prediction_CVPR_2024_paper.pdf + In the academic field the research on human motion prediction tasks mainly focuses on exploiting the observed information to forecast human movements accurately in the near future horizon. However a significant gap appears when it comes to the application field as current models are all trained offline with fixed parameters that are inherently suboptimal to handle the complex yet ever-changing nature of human behaviors. To bridge this gap in this paper we introduce the task of online meta adaptation for human motion prediction based on the insight that finding "smart weights" capable of swift adjustments to suit different motion contexts along the time is a key to improving predictive accuracy. We propose MoML which ingeniously borrows the bilevel optimization spirit of model-agnostic meta-learning to transform previous predictive mistakes into strong inductive biases to guide online adaptation. This is achieved by our MoAdapter blocks that can learn error information by facilitating efficient adaptation via a few gradient steps which fine-tunes our meta-learned "smart" initialization produced by the generic predictor. Considering real-time requirements in practice we further propose Fast-MoML a more efficient variant of MoML that features a closed-form solution instead of conventional gradient update. Experimental results show that our approach can effectively bring many existing offline motion prediction models online and improves their predictive accuracy. + + + + CAT-DM: Controllable Accelerated Virtual Try-on with Diffusion Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Zeng_CAT-DM_Controllable_Accelerated_Virtual_Try-on_with_Diffusion_Model_CVPR_2024_paper.pdf + Generative Adversarial Networks (GANs) dominate the research field in image-based virtual try-on but have not resolved problems such as unnatural deformation of garments and the blurry generation quality. While the generative quality of diffusion models is impressive achieving controllability poses a significant challenge when applying it to virtual try-on and multiple denoising iterations limit its potential for real-time applications. In this paper we propose Controllable Accelerated virtual Try-on with Diffusion Model (CAT-DM). To enhance the controllability a basic diffusion-based virtual try-on network is designed which utilizes ControlNet to introduce additional control conditions and improves the feature extraction of garment images. In terms of acceleration CAT-DM initiates a reverse denoising process with an implicit distribution generated by a pre-trained GAN-based model. Compared with previous try-on methods based on diffusion models CAT-DM not only retains the pattern and texture details of the in-shop garment but also reduces the sampling steps without compromising generation quality. Extensive experiments demonstrate the superiority of CAT-DM against both GAN-based and diffusion-based methods in producing more realistic images and accurately reproducing garment patterns. + + + + Synergistic Global-space Camera and Human Reconstruction from Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_Synergistic_Global-space_Camera_and_Human_Reconstruction_from_Videos_CVPR_2024_paper.pdf + Remarkable strides have been made in reconstructing static scenes or human bodies from monocular videos. Yet the two problems have largely been approached independently without much synergy. Most visual SLAM methods can only reconstruct camera trajectories and scene structures up to scale while most HMR methods reconstruct human meshes in metric scale but fall short in reasoning with cameras and scenes. This work introduces Synergistic Camera and Human Reconstruction (SynCHMR) to marry the best of both worlds. Specifically we design Human-aware Metric SLAM to reconstruct metric-scale camera poses and scene point clouds using camera-frame HMR as a strong prior addressing depth scale and dynamic ambiguities. Conditioning on the dense scene recovered we further learn a Scene-aware SMPL Denoiser to enhance world-frame HMR by incorporating spatiotemporal coherency and dynamic scene constraints. Together they lead to consistent reconstructions of camera trajectories human meshes and dense scene point clouds in a common world frame. + + + + 3D Face Reconstruction with the Geometric Guidance of Facial Part Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_3D_Face_Reconstruction_with_the_Geometric_Guidance_of_Facial_Part_CVPR_2024_paper.pdf + 3D Morphable Models (3DMMs) provide promising 3D face reconstructions in various applications. However existing methods struggle to reconstruct faces with extreme expressions due to deficiencies in supervisory signals such as sparse or inaccurate landmarks. Segmentation information contains effective geometric contexts for face reconstruction. Certain attempts intuitively depend on differentiable renderers to compare the rendered silhouettes of reconstruction with segmentation which is prone to issues like local optima and gradient instability. In this paper we fully utilize the facial part segmentation geometry by introducing Part Re-projection Distance Loss (PRDL). Specifically PRDL transforms facial part segmentation into 2D points and re-projects the reconstruction onto the image plane. Subsequently by introducing grid anchors and computing different statistical distances from these anchors to the point sets PRDL establishes geometry descriptors to optimize the distribution of the point sets for face reconstruction. PRDL exhibits a clear gradient compared to the renderer-based methods and presents state-of-the-art reconstruction performance in extensive quantitative and qualitative experiments. Our project is available at https://github.com/wang-zidu/3DDFA-V3. + + + + FreeU: Free Lunch in Diffusion U-Net + http://openaccess.thecvf.com//content/CVPR2024/papers/Si_FreeU_Free_Lunch_in_Diffusion_U-Net_CVPR_2024_paper.pdf + In this paper we uncover the untapped potential of diffusion U-Net which serves as a "free lunch" that substantially improves the generation quality on the fly. We initially investigate the key contributions of the U-Net architecture to the denoising process and identify that its main backbone primarily contributes to denoising whereas its skip connections mainly introduce high-frequency features into the decoder module causing the potential neglect of crucial functions intrinsic to the backbone network. Capitalizing on this discovery we propose a simple yet effective method termed "FreeU" which enhances generation quality without additional training or finetuning. Our key insight is to strategically re-weight the contributions sourced from the U-Net's skip connections and backbone feature maps to leverage the strengths of both components of the U-Net architecture. Promising results on image and video generation tasks demonstrate that our FreeU can be readily integrated to existing diffusion models e.g. Stable Diffusion DreamBooth and ControlNet to improve the generation quality with only a few lines of code. All you need is to adjust two scaling factors during inference. + + + + ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Hollein_ViewDiff_3D-Consistent_Image_Generation_with_Text-to-Image_Models_CVPR_2024_paper.pdf + 3D asset generation is getting massive amounts of attention inspired by the recent success on text-guided 2D content creation. Existing text-to-3D methods use pretrained text-to-image diffusion models in an optimization problem or fine-tune them on synthetic data which often results in non-photorealistic 3D objects without backgrounds. In this paper we present a method that leverages pretrained text-to-image models as a prior and learn to generate multi-view images in a single denoising process from real-world data. Concretely we propose to integrate 3D volume-rendering and cross-frame-attention layers into each block of the existing U-Net network of the text-to-image model. Moreover we design an autoregressive generation that renders more 3D-consistent images at any viewpoint. We train our model on real-world datasets of objects and showcase its capabilities to generate instances with a variety of high-quality shapes and textures in authentic surroundings. Compared to the existing methods the results generated by our method are consistent and have favorable visual quality (-30% FID -37% KID). + + + + Diffusion Models Without Attention + http://openaccess.thecvf.com//content/CVPR2024/papers/Yan_Diffusion_Models_Without_Attention_CVPR_2024_paper.pdf + In recent advancements in high-fidelity image generation Denoising Diffusion Probabilistic Models (DDPMs) have emerged as a key player. However their application at high resolutions presents significant computational challenges. Current methods such as patchifying expedite processes in UNet and Transformer architectures but at the expense of representational capacity. Addressing this we introduce the Diffusion State Space Model (DiffuSSM) an architecture that supplants attention mechanisms with a more scalable state space model backbone. This approach effectively handles higher resolutions without resorting to global compression thus preserving detailed image representation throughout the diffusion process. Our focus on FLOP-efficient architectures in diffusion training marks a significant step forward. Comprehensive evaluations on both ImageNet and LSUN datasets at two resolutions demonstrate that DiffuSSMs are on par or even outperform existing diffusion models with attention modules in FID and Inception Score metrics while significantly reducing total FLOP usage. + + + + Emotional Speech-driven 3D Body Animation via Disentangled Latent Diffusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Chhatre_Emotional_Speech-driven_3D_Body_Animation_via_Disentangled_Latent_Diffusion_CVPR_2024_paper.pdf + Existing methods for synthesizing 3D human gestures from speech have shown promising results but they do not explicitly model the impact of emotions on the generated gestures. Instead these methods directly output animations from speech without control over the expressed emotion. To address this limitation we present AMUSE an emotional speech-driven body animation model based on latent diffusion. Our observation is that content (i.e. gestures related to speech rhythm and word utterances) emotion and personal style are separable. To account for this AMUSE maps the driving audio to three disentangled latent vectors: one for content one for emotion and one for personal style. A latent diffusion model trained to generate gesture motion sequences is then conditioned on these latent vectors. Once trained AMUSE synthesizes 3D human gestures directly from speech with control over the expressed emotions and style by combining the content from the driving speech with the emotion and style of another speech sequence. Randomly sampling the noise of the diffusion model further generates variations of the gesture with the same emotional expressivity. Qualitative quantitative and perceptual evaluations demonstrate that AMUSE outputs realistic gesture sequences. Compared to the state of the art the generated gestures are better synchronized with the speech content and better represent the emotion expressed by the input speech. Our code is available at amuse.is.tue.mpg.de. + + + + Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Horita_Retrieval-Augmented_Layout_Transformer_for_Content-Aware_Layout_Generation_CVPR_2024_paper.pdf + Content-aware graphic layout generation aims to automatically arrange visual elements along with a given content such as an e-commerce product image. In this paper we argue that the current layout generation approaches suffer from the limited training data for the high-dimensional layout structure. We show that a simple retrieval augmentation can significantly improve the generation quality. Our model which is named Retrieval-Augmented Layout Transformer (RALF) retrieves nearest neighbor layout examples based on an input image and feeds these results into an autoregressive generator. Our model can apply retrieval augmentation to various controllable generation tasks and yield high-quality layouts within a unified architecture. Our extensive experiments show that RALF successfully generates content-aware layouts in both constrained and unconstrained settings and significantly outperforms the baselines. + + + + InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning + http://openaccess.thecvf.com//content/CVPR2024/papers/Shi_InstantBooth_Personalized_Text-to-Image_Generation_without_Test-Time_Finetuning_CVPR_2024_paper.pdf + Recent advances in personalized image generation have enabled pre-trained text-to-image models to learn new concepts from specific image sets. However these methods often necessitate extensive test-time finetuning for each new concept leading to inefficiencies in both time and scalability. To address this challenge we introduce InstantBooth an innovative approach leveraging existing text-to-image models for instantaneous text-guided image personalization eliminating the need for test-time finetuning. This efficiency is achieved through two primary innovations. Firstly we utilize an image encoder that transforms input images into a global embedding to grasp the general concept. Secondly we integrate new adapter layers into the pre-trained model enhancing its ability to capture intricate identity details while maintaining language coherence. Significantly our model is trained exclusively on text-image pairs without reliance on concept-specific paired images. When benchmarked against existing finetuning-based personalization techniques like DreamBooth and Textual-Inversion InstantBooth not only shows comparable proficiency in aligning language with image maintaining image quality and preserving identity but also boasts a 100-fold increase in processing speed. + + + + SD2Event:Self-supervised Learning of Dynamic Detectors and Contextual Descriptors for Event Cameras + http://openaccess.thecvf.com//content/CVPR2024/papers/Gao_SD2EventSelf-supervised_Learning_of_Dynamic_Detectors_and_Contextual_Descriptors_for_Event_CVPR_2024_paper.pdf + Event cameras offer many advantages over traditional frame-based cameras such as high dynamic range and low latency. Therefore event cameras are widely applied in diverse computer vision applications where event-based keypoint detection is a fundamental task. However achieving robust event-based keypoint detection remains challenging because the ground truth of event keypoints is difficult to obtain descriptors extracted by CNN usually lack discriminative ability in the presence of intense noise and fixed keypoint detectors are limited in detecting varied keypoint patterns. To address these challenges a novel event-based keypoint detection method is proposed by learning dynamic detectors and contextual descriptors in a self-supervised manner (SD2Event) including a contextual feature descriptor learning (CFDL) module and a dynamic keypoint detector learning (DKDL) module. The proposed SD2Event enjoys several merits. First the proposed CFDL module can model long-range contexts efficiently and effectively. Second the DKDL module generates dynamic keypoint detectors which can detect keypoints with diverse patterns across various event streams. Third the proposed self-supervised signals can guide the model's adaptation to event data. Extensive experimental results on three challenging benchmarks show that our proposed method significantly outperforms stateof-the-art event-based keypoint detection methods. + + + + PaReNeRF: Toward Fast Large-scale Dynamic NeRF with Patch-based Reference + http://openaccess.thecvf.com//content/CVPR2024/papers/Tang_PaReNeRF_Toward_Fast_Large-scale_Dynamic_NeRF_with_Patch-based_Reference_CVPR_2024_paper.pdf + With photo-realistic image generation Neural Radiance Field (NeRF) is widely used for large-scale dynamic scene reconstruction as autonomous driving simulator. However large-scale scene reconstruction still suffers from extremely long training time and rendering time. Low-resolution (LR) rendering combined with upsampling can alleviate this problem but it degrades image quality. In this paper we design a lightweight reference decoder which exploits prior information from known views to improve image reconstruction quality of new views. In addition to speed up prior information search we propose an optical flow and structural similarity based prior information search method. Results on KITTI and VKITTI2 datasets show that our method significantly outperforms the baseline method in terms of training speed rendering speed and rendering quality. + + + + Affine Equivariant Networks Based on Differential Invariants + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Affine_Equivariant_Networks_Based_on_Differential_Invariants_CVPR_2024_paper.pdf + Convolutional neural networks benefit from translation equivariance achieving tremendous success. Equivariant networks further extend this property to other transformation groups. However most existing methods require discretization or sampling of groups leading to increased model sizes for larger groups such as the affine group. In this paper we build affine equivariant networks based on differential invariants from the viewpoint of symmetric PDEs without discretizing or sampling the group. To address the division-by-zero issue arising from fractional differential invariants of the affine group we construct a new kind of affine invariants by normalizing polynomial relative differential invariants to replace classical differential invariants. For further flexibility we design an equivariant layer which can be directly integrated into convolutional networks of various architectures. Moreover our framework for the affine group is also applicable to its continuous subgroups. We implement equivariant networks for the scale group the rotation-scale group and the affine group. Numerical experiments demonstrate the outstanding performance of our framework across classification tasks involving transformations of these groups. Remarkably under the out-of-distribution setting our model achieves a 3.37% improvement in accuracy over the main counterpart affConv on the affNIST dataset. + + + + Selectively Informative Description can Reduce Undesired Embedding Entanglements in Text-to-Image Personalization + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_Selectively_Informative_Description_can_Reduce_Undesired_Embedding_Entanglements_in_Text-to-Image_CVPR_2024_paper.pdf + In text-to-image personalization a timely and crucial challenge is the tendency of generated images overfitting to the biases present in the reference images. We initiate our study with a comprehensive categorization of the biases into background nearby-object tied-object substance (in style re-contextualization) and pose biases. These biases manifest in the generated images due to their entanglement into the subject embedding. This undesired embedding entanglement not only results in the reflection of biases from the reference images into the generated images but also notably diminishes the alignment of the generated images with the given generation prompt. To address this challenge we propose SID (Selectively Informative Description) a text description strategy that deviates from the prevalent approach of only characterizing the subject's class identification. SID is generated utilizing multimodal GPT-4 and can be seamlessly integrated into optimization-based models. We present comprehensive experimental results along with analyses of cross-attention maps subject-alignment non-subject-disentanglement and text-alignment. + + + + Smooth Diffusion: Crafting Smooth Latent Spaces in Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Guo_Smooth_Diffusion_Crafting_Smooth_Latent_Spaces_in_Diffusion_Models_CVPR_2024_paper.pdf + Recently diffusion models have made remarkable progress in text-to-image (T2I) generation synthesizing images with high fidelity and diverse contents. Despite this advancement latent space smoothness within diffusion models remains largely unexplored. Smooth latent spaces ensure that a perturbation on an input latent corresponds to a steady change in the output image. This property proves beneficial in downstream tasks including image interpolation inversion and editing. In this work we expose the non-smoothness of diffusion latent spaces by observing noticeable visual fluctuations resulting from minor latent variations. To tackle this issue we propose Smooth Diffusion a new category of diffusion models that can be simultaneously high-performing and smooth. Specifically we introduce Step-wise Variation Regularization to enforce the proportion between the variations of an arbitrary input latent and that of the output image is a constant at any diffusion training step. In addition we devise an interpolation standard deviation (ISTD) metric to effectively assess the latent space smoothness of a diffusion model. Extensive quantitative and qualitative experiments demonstrate that Smooth Diffusion stands out as a more desirable solution not only in T2I generation but also across various downstream tasks. Smooth Diffusion is implemented as a plug-and-play Smooth-LoRA to work with various community models. Code is available at https://github.com/SHI-Labs/Smooth-Diffusion. + + + + FlowIE: Efficient Image Enhancement via Rectified Flow + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhu_FlowIE_Efficient_Image_Enhancement_via_Rectified_Flow_CVPR_2024_paper.pdf + Image enhancement holds extensive applications in real-world scenarios due to complex environments and limitations of imaging devices. Conventional methods are often constrained by their tailored models resulting in diminished robustness when confronted with challenging degradation conditions. In response we propose FlowIE a simple yet highly effective flow-based image enhancement framework that estimates straight-line paths from an elementary distribution to high-quality images. Unlike previous diffusion-based methods that suffer from long-time inference FlowIE constructs a linear many-to-one transport mapping via conditioned rectified flow. The rectification straightens the trajectories of probability transfer accelerating inference by an order of magnitude. This design enables our FlowIE to fully exploit rich knowledge in the pre-trained diffusion model rendering it well-suited for various real-world applications. Moreover we devise a faster inference algorithm inspired by Lagrange's Mean Value Theorem harnessing midpoint tangent direction to optimize path estimation ultimately yielding visually superior results. Thanks to these designs our FlowIE adeptly manages a diverse range of enhancement tasks within a concise sequence of fewer than 5 steps. Our contributions are rigorously validated through comprehensive experiments on synthetic and real-world datasets unveiling the compelling efficacy and efficiency of our proposed FlowIE. + + + + Improving Training Efficiency of Diffusion Models via Multi-Stage Framework and Tailored Multi-Decoder Architecture + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Improving_Training_Efficiency_of_Diffusion_Models_via_Multi-Stage_Framework_and_CVPR_2024_paper.pdf + Diffusion models emerging as powerful deep generative tools excel in various applications. They operate through a two-steps process: introducing noise into training samples and then employing a model to convert random noise into new samples (e.g. images). However their remarkable generative performance is hindered by slow training and sampling. This is due to the necessity of tracking extensive forward and reverse diffusion trajectories and employing a large model with numerous parameters across multiple timesteps (i.e. noise levels). To tackle these challenges we present a multi-stage framework inspired by our empirical findings. These observations indicate the advantages of employing distinct parameters tailored to each timestep while retaining universal parameters shared across all time steps. Our approach involves segmenting the time interval into multiple stages where we employ custom multi-decoder U-net architecture that blends time-dependent models with a universally shared encoder. Our framework enables the efficient distribution of computational resources and mitigates inter-stage interference which substantially improves training efficiency. Extensive numerical experiments affirm the effectiveness of our framework showcasing significant training and sampling efficiency enhancements on three state-of-the-art diffusion models including large-scale latent diffusion models. Furthermore our ablation studies illustrate the impact of two important components in our framework: (i) a novel timestep clustering algorithm for stage division and (ii) an innovative multi-decoder U-net architecture seamlessly integrating universal and customized hyperparameters. + + + + In-Context Matting + http://openaccess.thecvf.com//content/CVPR2024/papers/Guo_In-Context_Matting_CVPR_2024_paper.pdf + We introduce in-context matting a novel task setting of image matting. Given a reference image of a certain foreground and guided priors such as points scribbles and masks in-context matting enables automatic alpha estimation on a batch of target images of the same foreground category without additional auxiliary input. This setting marries good performance in auxiliary input-based matting and ease of use in automatic matting which finds a good trade-off between customization and automation. To overcome the key challenge of accurate foreground matching we introduce IconMatting an in-context matting model built upon a pre-trained text-to-image diffusion model. Conditioned on inter- and intra-similarity matching IconMatting can make full use of reference context to generate accurate target alpha mattes. To benchmark the task we also introduce a novel testing dataset ICM-57 covering 57 groups of real-world images. Quantitative and qualitative results on the ICM-57 testing set show that IconMatting rivals the accuracy of trimap-based matting while retaining the automation level akin to automatic matting. Code is available at https://github.com/tiny-smart/in-context-matting. + + + + DemoCaricature: Democratising Caricature Generation with a Rough Sketch + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_DemoCaricature_Democratising_Caricature_Generation_with_a_Rough_Sketch_CVPR_2024_paper.pdf + In this paper we democratise caricature generation empowering individuals to effortlessly craft personalised caricatures with just a photo and a conceptual sketch. Our objective is to strike a delicate balance between abstraction and identity while preserving the creativity and subjectivity inherent in a sketch. To achieve this we present Explicit Rank-1 Model Editing alongside single-image personalisation selectively applying nuanced edits to cross-attention layers for a seamless merge of identity and style. Additionally we propose Random Mask Reconstruction to enhance robustness directing the model to focus on distinctive identity and style features. Crucially our aim is not to replace artists but to eliminate accessibility barriers allowing enthusiasts to engage in the artistry. + + + + CapHuman: Capture Your Moments in Parallel Universes + http://openaccess.thecvf.com//content/CVPR2024/papers/Liang_CapHuman_Capture_Your_Moments_in_Parallel_Universes_CVPR_2024_paper.pdf + We concentrate on a novel human-centric image synthesis task that is given only one reference facial photograph it is expected to generate specific individual images with diverse head positions poses facial expressions and illuminations in different contexts. To accomplish this goal we argue that our generative model should be capable of the following favorable characteristics: (1) a strong visual and semantic understanding of our world and human society for basic object and human image generation. (2) generalizable identity preservation ability. (3) flexible and fine-grained head control. Recently large pre-trained text-to-image diffusion models have shown remarkable results serving as a powerful generative foundation. As a basis we aim to unleash the above two capabilities of the pre-trained model. In this work we present a new framework named CapHuman. We embrace the "encode then learn to align" paradigm which enables generalizable identity preservation for new individuals without cumbersome tuning at inference. CapHuman encodes identity features and then learns to align them into the latent space. Moreover we introduce the 3D facial prior to equip our model with control over the human head in a flexible and 3D-consistent manner. Extensive qualitative and quantitative analyses demonstrate our CapHuman can produce well-identity-preserved photo-realistic and high-fidelity portraits with content-rich representations and various head renditions superior to established baselines. Code and checkpoint will be released at https://github.com/VamosC/CapHuman. + + + + SDPose: Tokenized Pose Estimation via Circulation-Guide Self-Distillation + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_SDPose_Tokenized_Pose_Estimation_via_Circulation-Guide_Self-Distillation_CVPR_2024_paper.pdf + Recently transformer-based methods have achieved state-of-the-art prediction quality on human pose estimation(HPE). Nonetheless most of these top-performing transformer-based models are too computation-consuming and storage-demanding to deploy on edge computing platforms. Those transformer-based models that require fewer resources are prone to under-fitting due to their smaller scale and thus perform notably worse than their larger counterparts. Given this conundrum we introduce SDPose a new self-distillation method for improving the performance of small transformer-based models. To mitigate the problem of under-fitting we design a transformer module named Multi-Cycled Transformer(MCT) based on multiple-cycled forwards to more fully exploit the potential of small model parameters. Further in order to prevent the additional inference compute-consuming brought by MCT we introduce a self-distillation scheme extracting the knowledge from the MCT module to a naive forward model. Specifically on the MSCOCO validation dataset SDPose-T obtains 69.7% mAP with 4.4M parameters and 1.8 GFLOPs. Furthermore SDPose-S-V2 obtains 73.5% mAP on the MSCOCO validation dataset with 6.2M parameters and 4.7 GFLOPs achieving a new state-of-the-art among predominant tiny neural network methods. + + + + Authentic Hand Avatar from a Phone Scan via Universal Hand Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Moon_Authentic_Hand_Avatar_from_a_Phone_Scan_via_Universal_Hand_CVPR_2024_paper.pdf + The authentic 3D hand avatar with every identifiable information such as hand shapes and textures is necessary for immersive experiences in AR/VR. In this paper we present a universal hand model (UHM) which 1) can universally represent high-fidelity 3D hand meshes of arbitrary identities (IDs) and 2) can be adapted to each person with a short phone scan for the authentic hand avatar. For effective universal hand modeling we perform tracking and modeling at the same time while previous 3D hand models perform them separately. The conventional separate pipeline suffers from the accumulated errors from the tracking stage which cannot be recovered in the modeling stage. On the other hand ours does not suffer from the accumulated errors while having a much more concise overall pipeline. We additionally introduce a novel image matching loss function to address a skin sliding during the tracking and modeling while existing works have not focused on it much. Finally using learned priors from our UHM we effectively adapt our UHM to each person's short phone scan for the authentic hand avatar. + + + + Open-World Semantic Segmentation Including Class Similarity + http://openaccess.thecvf.com//content/CVPR2024/papers/Sodano_Open-World_Semantic_Segmentation_Including_Class_Similarity_CVPR_2024_paper.pdf + Interpreting camera data is key for autonomously acting systems such as autonomous vehicles. Vision systems that operate in real-world environments must be able to understand their surroundings and need the ability to deal with novel situations. This paper tackles open-world semantic segmentation i.e. the variant of interpreting image data in which objects occur that have not been seen during training. We propose a novel approach that performs accurate closed-world semantic segmentation and at the same time can identify new categories without requiring any additional training data. Our approach additionally provides a similarity measure for every newly discovered class in an image to a known category which can be useful information in downstream tasks such as planning or mapping. Through extensive experiments we show that our model achieves state-of-the-art results on classes known from training data as well as for anomaly segmentation and can distinguish between different unknown classes. + + + + Towards Memorization-Free Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Towards_Memorization-Free_Diffusion_Models_CVPR_2024_paper.pdf + Pretrained diffusion models and their outputs are widely accessible due to their exceptional capacity for synthesizing high-quality images and their open-source nature. The users however may face litigation risks owing to the models' tendency to memorize and regurgitate training data during inference. To address this we introduce Anti-Memorization Guidance (AMG) a novel framework employing three targeted guidance strategies for the main causes of memorization: image and caption duplication and highly specific user prompts. Consequently AMG ensures memorization-free outputs while maintaining high image quality and text alignment leveraging the synergy of its guidance methods each indispensable in its own right. AMG also features an innovative automatic detection system for potential memorization during each step of inference process allows selective application of guidance strategies minimally interfering with the original sampling process to preserve output utility. We applied AMG to pretrained Denoising Diffusion Probabilistic Models (DDPM) and Stable Diffusion across various generation tasks. The results demonstrate that AMG is the first approach to successfully eradicates all instances of memorization with no or marginal impacts on image quality and text-alignment as evidenced by FID and CLIP scores. + + + + IQ-VFI: Implicit Quadratic Motion Estimation for Video Frame Interpolation + http://openaccess.thecvf.com//content/CVPR2024/papers/Hu_IQ-VFI_Implicit_Quadratic_Motion_Estimation_for_Video_Frame_Interpolation_CVPR_2024_paper.pdf + Advanced video frame interpolation (VFI) algorithms approximate intermediate motions between two input frames to synthesize intermediate frame. However they struggle to handle complex scenarios with curvilinear motions since they overlook the latent acceleration information between the input frames. Moreover the supervision of predicted motions is tricky because ground-truth motions are not available. To this end we propose a novel framework for implicit quadratic video frame interpolation (IQ-VFI) which explores latent acceleration information and accurate intermediate motions via knowledge distillation. Specifically the proposed IQ-VFI consists of an implicit acceleration estimation network (IANet) and a VFI backbone the former fully leverages spatio-temporal information to explore latent acceleration priors between two input frames which is then used to progressively modulate linear motions from the latter into quadratic motions in coarse-to-fine manner. Furthermore to encourage both components to distill more acceleration and motion cues oriented towards VFI we propose a knowledge distillation strategy in which implicit acceleration distillation loss and implicit motion distillation loss are employed to adaptively guide latent acceleration priors and intermediate motions learning respectively. Extensive experiments show that our proposed IQ-VFI can achieve state-of-the-art performances on various benchmark datasets. + + + + KeyPoint Relative Position Encoding for Face Recognition + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_KeyPoint_Relative_Position_Encoding_for_Face_Recognition_CVPR_2024_paper.pdf + In this paper we address the challenge of making ViT models more robust to unseen affine transformations. Such robustness becomes useful in various recognition tasks such as face recognition when image alignment failures occur. We propose a novel method called KP-RPE which leverages key points (e.g.facial landmarks) to make ViT more resilient to scale translation and pose variations. We begin with the observation that Relative Position Encoding (RPE) is a good way to bring affine transform generalization to ViTs. RPE however can only inject the model with prior knowledge that nearby pixels are more important than far pixels. Keypoint RPE (KP-RPE) is an extension of this principle where the significance of pixels is not solely dictated by their proximity but also by their relative positions to specific keypoints within the image. By anchoring the significance of pixels around keypoints the model can more effectively retain spatial relationships even when those relationships are disrupted by affine transformations. We show the merit of KP-RPE in face and gait recognition. The experimental results demonstrate the effectiveness in improving face recognition performance from low-quality images particularly where alignment is prone to failure. Code and pre-trained models are available. + + + + Hyper-MD: Mesh Denoising with Customized Parameters Aware of Noise Intensity and Geometric Characteristics + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Hyper-MD_Mesh_Denoising_with_Customized_Parameters_Aware_of_Noise_Intensity_CVPR_2024_paper.pdf + Mesh denoising (MD) is a critical task in geometry processing as meshes from scanning or AIGC techniques are susceptible to noise contamination. The challenge of MD lies in the diverse nature of mesh facets in terms of geometric characteristics and noise distributions. Despite recent advancements in deep learning-based MD methods existing MD networks typically neglect the consideration of geometric characteristics and noise distributions. In this paper we propose Hyper-MD a hyper-network-based approach that addresses this limitation by dynamically customizing denoising parameters for each facet based on its noise intensity and geometric characteristics. Specifically Hyper-MD is composed of a hyper-network and an MD network. For each noisy facet the hyper-network takes two angles as input to customize parameters for the MD network. These two angles are specially defined to reveal the noise intensity and geometric characteristics of the current facet respectively. The MD network receives a facet patch as input and outputs the denoised normal using the customized parameters. Experimental results on synthetic and real-scanned meshes demonstrate that Hyper-MD outperforms state-of-the-art mesh denoising methods. + + + + Beyond First-Order Tweedie: Solving Inverse Problems using Latent Diffusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Rout_Beyond_First-Order_Tweedie_Solving_Inverse_Problems_using_Latent_Diffusion_CVPR_2024_paper.pdf + Sampling from the posterior distribution in latent diffusion models for inverse problems is computationally challenging. Existing methods often rely on Tweedie's first-order moments that tend to induce biased results. Second-order approximations are computationally prohibitive making standard reverse diffusion processes intractable for posterior sampling. This paper presents Second-order Tweedie sampler from Surrogate Loss (STSL) a novel sampler offering efficiency comparable to first-order Tweedie while enabling tractable reverse processes using second-order approximation. Theoretical results reveal that our approach utilizing for the trace of the Hessian with only O(1) compute establishes a lower bound through a surrogate loss and enables a tractable reverse process. We show STSL outperforms SoTA solvers PSLD and P2L by reducing neural function evaluations by 4X and 8X respectively while enhancing sampling quality on FFHQ ImageNet and COCO benchmarks. Moreover STSL extends to text guided image editing and mitigates residual distortions in corrupted images. To our best knowledge this is the first work to offer an efficient second order approximation for solving inverse problems using latent diffusion and editing real world images with corruptions. + + + + Rethinking the Objectives of Vector-Quantized Tokenizers for Image Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Gu_Rethinking_the_Objectives_of_Vector-Quantized_Tokenizers_for_Image_Synthesis_CVPR_2024_paper.pdf + Vector-Quantized (VQ-based) generative models usually consist of two basic components i.e. VQ tokenizers and generative transformers. Prior research focuses on improving the reconstruction fidelity of VQ tokenizers but rarely examines how the improvement in reconstruction affects the generation ability of generative transformers. In this paper we find that improving the reconstruction fidelity of VQ tokenizers does not necessarily improve the generation. Instead learning to compress semantic features within VQ tokenizers significantly improves generative transformers' ability to capture textures and structures. We thus highlight two competing objectives of VQ tokenizers for image synthesis: semantic compression and details preservation. Different from previous work that prioritizes better details preservation we propose Semantic-Quantized GAN (SeQ-GAN) with two learning phases to balance the two objectives. In the first phase we propose a semantic-enhanced perceptual loss for better semantic compression. In the second phase we fix the encoder and codebook but finetune the decoder to achieve better details preservation. Our proposed SeQ-GAN significantly improves VQ-based generative models for both unconditional and conditional image generation. Specifically SeQ-GAN achieves a Frechet Inception Distance (FID) of 6.25 and Inception Score (IS) of 140.9 on 256x256 ImageNet generation a remarkable improvement over VIT-VQGAN which obtains 11.2 FID and 97.2 IS. + + + + Continuous Pose for Monocular Cameras in Neural Implicit Representation + http://openaccess.thecvf.com//content/CVPR2024/papers/Ma_Continuous_Pose_for_Monocular_Cameras_in_Neural_Implicit_Representation_CVPR_2024_paper.pdf + In this paper we showcase the effectiveness of optimizing monocular camera poses as a continuous function of time. The camera poses are represented using an implicit neural function which maps the given time to the corresponding camera pose. The mapped camera poses are then used for the downstream tasks where joint camera pose optimization is also required. While doing so the network parameters - that implicitly represent camera poses - are optimized. We exploit the proposed method in four diverse experimental settings namely (1) NeRF from noisy poses; (2) NeRF from asynchronous Events; (3) Visual Simultaneous Localization and Mapping (vSLAM); and (4) vSLAM with IMUs. In all four settings the proposed method performs significantly better than the compared baselines and the state-of-the-art methods. Additionally using the assumption of continuous motion changes in pose may actually live in a manifold that has lower than 6 degrees of freedom (DOF) is also realized. We call this low DOF motion representation as the intrinsic motion and use the approach in vSLAM settings show ing impressive camera tracking performance. + + + + D^4: Dataset Distillation via Disentangled Diffusion Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Su_D4_Dataset_Distillation_via_Disentangled_Diffusion_Model_CVPR_2024_paper.pdf + Dataset distillation offers a lightweight synthetic dataset for fast network training with promising test accuracy. To imitate the performance of the original dataset most approaches employ bi-level optimization and the distillation space relies on the matching architecture. Nevertheless these approaches either suffer significant computational costs on large-scale datasets or experience performance decline on cross-architectures. We advocate for designing an economical dataset distillation framework that is independent of the matching architectures.With empirical observations we argue that constraining the consistency of the real and synthetic image spaces will enhance the cross-architecture generalization. Motivated by this we introduce Dataset Distillation via Disentangled Diffusion Model (D^4M) an efficient framework for dataset distillation. Compared to architecture-dependent methods D^4M employs latent diffusion model to guarantee consistency and incorporates label information into category prototypes. The distilled datasets are versatile eliminating the need for repeated generation of distinct datasets for various architectures. Through comprehensive experiments D^4M demonstrates superior performance and robust generalization surpassing the SOTA methods across most aspects. + + + + 360DVD: Controllable Panorama Video Generation with 360-Degree Video Diffusion Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_360DVD_Controllable_Panorama_Video_Generation_with_360-Degree_Video_Diffusion_Model_CVPR_2024_paper.pdf + Panorama video recently attracts more interest in both study and application courtesy of its immersive experience. Due to the expensive cost of capturing 360-degree panoramic videos generating desirable panorama videos by prompts is urgently required. Lately the emerging text-to-video (T2V) diffusion methods demonstrate notable effectiveness in standard video generation. However due to the significant gap in content and motion patterns between panoramic and standard videos these methods encounter challenges in yielding satisfactory 360-degree panoramic videos. In this paper we propose a pipeline named 360-Degree Video Diffusion model (360DVD) for generating 360-degree panoramic videos based on the given prompts and motion conditions. Specifically we introduce a lightweight 360-Adapter accompanied by 360 Enhancement Techniques to transform pre-trained T2V models for panorama video generation. We further propose a new panorama dataset named WEB360 consisting of panoramic video-text pairs for training 360DVD addressing the absence of captioned panoramic video datasets. Extensive experiments demonstrate the superiority and effectiveness of 360DVD for panorama video generation. + + + + RankMatch: Exploring the Better Consistency Regularization for Semi-supervised Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Mai_RankMatch_Exploring_the_Better_Consistency_Regularization_for_Semi-supervised_Semantic_Segmentation_CVPR_2024_paper.pdf + The key lie in semi-supervised semantic segmentation is how to fully exploit substantial unlabeled data to improve the model's generalization performance by resorting to constructing effective supervision signals. Most methods tend to directly apply contrastive learning to seek additional supervision to complement independent regular pixel-wise consistency regularization. However these methods tend not to be preferred ascribed to their complicated designs heavy memory footprints and susceptibility to confirmation bias. In this paper we analyze the bottlenecks exist in contrastive learning-based methods and offer a fresh perspective on inter-pixel correlations to construct more safe and effective supervision signals which is in line with the nature of semantic segmentation. To this end we develop a coherent RankMatch network including the construction of representative agents to model inter-pixel correlation beyond regular individual pixel-wise consistency and further unlock the potential of agents by modeling inter-agent relationships in pursuit of rank-aware correlation consistency. Extensive experimental results on multiple benchmarks including mitochondria segmentation demonstrate that RankMatch performs favorably against state-of-the-art methods. Particularly in the low-data regimes RankMatch achieves significant improvements. + + + + DuPL: Dual Student with Trustworthy Progressive Learning for Robust Weakly Supervised Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_DuPL_Dual_Student_with_Trustworthy_Progressive_Learning_for_Robust_Weakly_CVPR_2024_paper.pdf + Recently One-stage Weakly Supervised Semantic Segmentation (WSSS) with image-level labels has gained increasing interest due to simplification over its cumbersome multi-stage counterpart. Limited by the inherent ambiguity of Class Activation Map (CAM) we observe that one-stage pipelines often encounter confirmation bias caused by incorrect CAM pseudo-labels impairing their final segmentation performance. Although recent works discard many unreliable pseudo-labels to implicitly alleviate this issue they fail to exploit sufficient supervision for their models. To this end we propose a dual student framework with trustworthy progressive learning (DuPL). Specifically we propose a dual student network with a discrepancy loss to yield diverse CAMs for each sub-net. The two sub-nets generate supervision for each other mitigating the confirmation bias caused by learning their own incorrect pseudo-labels. In this process we progressively introduce more trustworthy pseudo-labels to be involved in the supervision through dynamic threshold adjustment with an adaptive noise filtering strategy. Moreover we believe that every pixel even discarded from supervision due to its unreliability is important for WSSS. Thus we develop consistency regularization on these discarded regions providing supervision of every pixel. Experiment results demonstrate the superiority of the proposed DuPL over the recent state-of-the-art alternatives on PASCAL VOC 2012 and MS COCO datasets. Code is available at https://github.com/Wu0409/DuPL. + + + + SurMo: Surface-based 4D Motion Modeling for Dynamic Human Rendering + http://openaccess.thecvf.com//content/CVPR2024/papers/Hu_SurMo_Surface-based_4D_Motion_Modeling_for_Dynamic_Human_Rendering_CVPR_2024_paper.pdf + Dynamic human rendering from video sequences has achieved remarkable progress by formulating the rendering as a mapping from static poses to human images. However existing methods focus on the human appearance reconstruction of every single frame while the temporal motion relations are not fully explored. In this paper we propose a new 4D motion modeling paradigm SurMo that jointly models the temporal dynamics and human appearances in a unified framework with three key designs: 1) Surface-based motion encoding that models 4D human motions with an efficient compact surface-based triplane. It encodes both spatial and temporal motion relations on the dense surface manifold of a statistical body template which inherits body topology priors for generalizable novel view synthesis with sparse training observations. 2) Physical motion decoding that is designed to encourage physical motion learning by decoding the motion triplane features at timestep t to predict both spatial derivatives and temporal derivatives at the next timestep t+1 in the training stage. 3) 4D appearance decoding that renders the motion triplanes into images by an efficient volumetric surface-conditioned renderer that focuses on the rendering of body surfaces with motion learning conditioning. Extensive experiments validate the state-of-the-art performance of our new paradigm and illustrate the expressiveness of surface-based motion triplanes for rendering high-fidelity view-consistent humans with fast motions and even motion-dependent shadows. Our project page is at: https://taohuumd.github.io/projects/SurMo. + + + + Hierarchical Spatio-temporal Decoupling for Text-to-Video Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Qing_Hierarchical_Spatio-temporal_Decoupling_for_Text-to-Video_Generation_CVPR_2024_paper.pdf + Despite diffusion models having shown powerful abilities to generate photorealistic images generating videos that are realistic and diverse still remains in its infancy. One of the key reasons is that current methods intertwine spatial content and temporal dynamics together leading to a notably increased complexity of text-to-video generation (T2V). In this work we propose HiGen a diffusion model-based method that improves performance by decoupling the spatial and temporal factors of videos from two perspectives i.e. structure level and content level. At the structure level we decompose the T2V task into two steps including spatial reasoning and temporal reasoning using a unified denoiser. Specifically we generate spatially coherent priors using text during spatial reasoning and then generate temporally coherent motions from these priors during temporal reasoning. At the content level we extract two subtle cues from the content of the input video that can express motion and appearance changes respectively. These two cues then guide the model's training for generating videos enabling flexible content variations and enhancing temporal stability. Through the decoupled paradigm HiGen can effectively reduce the complexity of this task and generate realistic videos with semantics accuracy and motion stability. Extensive experiments demonstrate the superior performance of HiGen over the state-of-the-art T2V methods. We have released our source code and models. + + + + PLACE: Adaptive Layout-Semantic Fusion for Semantic Image Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Lv_PLACE_Adaptive_Layout-Semantic_Fusion_for_Semantic_Image_Synthesis_CVPR_2024_paper.pdf + Recent advancements in large-scale pre-trained text-to-image models have led to remarkable progress in semantic image synthesis. Nevertheless synthesizing high-quality images with consistent semantics and layout remains a challenge. In this paper we propose the adaPtive LAyout-semantiC fusion modulE (PLACE) that harnesses pre-trained models to alleviate the aforementioned issues. Specifically we first employ the layout control map to faithfully represent layouts in the feature space. Subsequently we combine the layout and semantic features in a timestep-adaptive manner to synthesize images with realistic details. During fine-tuning we propose the Semantic Alignment (SA) loss to further enhance layout alignment. Additionally we introduce the Layout-Free Prior Preservation (LFP) loss which leverages unlabeled data to maintain the priors of pre-trained models thereby improving the visual quality and semantic consistency of synthesized images. Extensive experiments demonstrate that our approach performs favorably in terms of visual quality semantic consistency and layout alignment. The source code and model are available at \href https://github.com/cszy98/PLACE/tree/main PLACE . + + + + Exploring Efficient Asymmetric Blind-Spots for Self-Supervised Denoising in Real-World Scenarios + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Exploring_Efficient_Asymmetric_Blind-Spots_for_Self-Supervised_Denoising_in_Real-World_Scenarios_CVPR_2024_paper.pdf + Self-supervised denoising has attracted widespread attention due to its ability to train without clean images. However noise in real-world scenarios is often spatially correlated which causes many self-supervised algorithms that assume pixel-wise independent noise to perform poorly. Recent works have attempted to break noise correlation with downsampling or neighborhood masking. However denoising on downsampled subgraphs can lead to aliasing effects and loss of details due to a lower sampling rate. Furthermore the neighborhood masking methods either come with high computational complexity or do not consider local spatial preservation during inference. Through the analysis of existing methods we point out that the key to obtaining high-quality and texture-rich results in real-world self-supervised denoising tasks is to train at the original input resolution structure and use asymmetric operations during training and inference. Based on this we propose Asymmetric Tunable Blind-Spot Network (AT-BSN) where the blind-spot size can be freely adjusted thus better balancing noise correlation suppression and image local spatial destruction during training and inference. In addition we regard the pre-trained AT-BSN as a meta-teacher network capable of generating various teacher networks by sampling different blind-spots. We propose a blind-spot based multi-teacher distillation strategy to distill a lightweight network significantly improving performance. Experimental results on multiple datasets prove that our method achieves state-of-the-art and is superior to other self-supervised algorithms in terms of computational overhead and visual effects. + + + + Efficient Multi-scale Network with Learnable Discrete Wavelet Transform for Blind Motion Deblurring + http://openaccess.thecvf.com//content/CVPR2024/papers/Gao_Efficient_Multi-scale_Network_with_Learnable_Discrete_Wavelet_Transform_for_Blind_CVPR_2024_paper.pdf + Coarse-to-fine schemes are widely used in traditional single-image motion deblur; however in the context of deep learning existing multi-scale algorithms not only require the use of complex modules for feature fusion of low-scale RGB images and deep semantics but also manually generate low-resolution pairs of images that do not have sufficient confidence. In this work we propose a multi-scale network based on single-input and multiple-outputs(SIMO) for motion deblurring. This simplifies the complexity of algorithms based on a coarse-to-fine scheme. To alleviate restoration defects impacting detail information brought about by using a multi-scale architecture we combine the characteristics of real-world blurring trajectories with a learnable wavelet transform module to focus on the directional continuity and frequency features of the step-by-step transitions between blurred images to sharp images. In conclusion we propose a multi-scale network with a learnable discrete wavelet transform (MLWNet) which exhibits state-of-the-art performance on multiple real-world deblurred datasets in terms of both subjective and objective quality as well as computational efficiency. + + + + MaskPLAN: Masked Generative Layout Planning from Partial Input + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_MaskPLAN_Masked_Generative_Layout_Planning_from_Partial_Input_CVPR_2024_paper.pdf + Layout planning spanning from architecture to interior design is a slow iterative exploration of ill-defined problems adopting a "I'll know it when I see it" approach to potential solutions. Recent advances in generative models promise automating layout generation yet often overlook the crucial role of user-guided iteration cannot generate full solutions from incomplete design ideas and do not learn for the inter-dependency of layout attributes. To address these limitations we propose MaskPLAN a novel generative model based on Graph-structured Dynamic Masked Autoencoders (GDMAE) featuring five transformers generating a blend of graph-based and image-based layout attributes. MaskPLAN lets users generate and adjust layouts with partial attribute definitions create alternatives for preferences and practice new composition-driven or functionality-driven workflows. Through cross-attribute learning and the user input as a global conditional prior we ensure that design synthesis is calibrated at every intermediate stage maintaining its feasibility and practicality. Extensive evaluations show MaskPLAN's superior performance over existing methods across multiple metrics. + + + + HMD-Poser: On-Device Real-time Human Motion Tracking from Scalable Sparse Observations + http://openaccess.thecvf.com//content/CVPR2024/papers/Dai_HMD-Poser_On-Device_Real-time_Human_Motion_Tracking_from_Scalable_Sparse_Observations_CVPR_2024_paper.pdf + It is especially challenging to achieve real-time human motion tracking on a standalone VR Head-Mounted Display (HMD) such as Meta Quest and PICO. In this paper we propose HMD-Poser the first unified approach to recover full-body motions using scalable sparse observations from HMD and body-worn IMUs. In particular it can support a variety of input scenarios such as HMD HMD+2IMUs HMD+3IMUs etc. The scalability of inputs may accommodate users' choices for both high tracking accuracy and easy-to-wear. A lightweight temporal-spatial feature learning network is proposed in HMD-Poser to guarantee that the model runs in real-time on HMDs. Furthermore HMD-Poser presents online body shape estimation to improve the position accuracy of body joints. Extensive experimental results on the challenging AMASS dataset show that HMD-Poser achieves new state-of-the-art results in both accuracy and real-time performance. We also build a new free-dancing motion dataset to evaluate HMD-Poser's on-device performance and investigate the performance gap between synthetic data and real-captured sensor data. Finally we demonstrate our HMD-Poser with a real-time Avatar-driving application on a commercial HMD. Our code and free-dancing motion dataset are available \href https://pico-ai-team.github.io/hmd-poser here . + + + + Flexible Biometrics Recognition: Bridging the Multimodality Gap through Attention Alignment and Prompt Tuning + http://openaccess.thecvf.com//content/CVPR2024/papers/Tiong_Flexible_Biometrics_Recognition_Bridging_the_Multimodality_Gap_through_Attention_Alignment_CVPR_2024_paper.pdf + Periocular and face are complementary biometrics for identity management albeit with inherent limitations notably in scenarios involving occlusion due to sunglasses or masks. In response to these challenges we introduce Flexible Biometric Recognition (FBR) a novel framework designed to advance conventional face periocular and multimodal face-periocular biometrics across both intra- and cross-modality recognition tasks. FBR strategically utilizes the Multimodal Fusion Attention (MFA) and Multimodal Prompt Tuning (MPT) mechanisms within the Vision Transformer architecture. MFA facilitates the fusion of modalities ensuring cohesive alignment between facial and periocular embeddings while incorporating soft-biometrics to enhance the model's ability to discriminate between individuals. The fusion of three modalities is pivotal in exploring interrelationships between different modalities. Additionally MPT serves as a unifying bridge intertwining inputs and promoting cross-modality interactions while preserving their distinctive characteristics. The collaborative synergy of MFA and MPT enhances the shared features of the face and periocular with a specific emphasis on the ocular region yielding exceptional performance in both intra- and cross-modality recognition tasks. Rigorous experimentation across four benchmark datasets validates the noteworthy performance of the FBR model. The source code is available at https://github.com/MIS-DevWorks/FBR. + + + + Multi-scale Dynamic and Hierarchical Relationship Modeling for Facial Action Units Recognition + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Multi-scale_Dynamic_and_Hierarchical_Relationship_Modeling_for_Facial_Action_Units_CVPR_2024_paper.pdf + Human facial action units (AUs) are mutually related in a hierarchical manner as not only they are associated with each other in both spatial and temporal domains but also AUs located in the same/close facial regions show stronger relationships than those of different facial regions. While none of existing approach thoroughly model such hierarchical inter-dependencies among AUs this paper proposes to comprehensively model multi-scale AU-related dynamic and hierarchical spatio-temporal relationship among AUs for their occurrences recognition. Specifically we first propose a novel multi-scale temporal differencing network with an adaptive weighting block to explicitly capture facial dynamics across frames at different spatial scales which specifically considers the heterogeneity of range and magnitude in different AUs' activation. Then a two-stage strategy is introduced to hierarchically model the relationship among AUs based on their spatial distribution (i.e. local and cross-region AU relationship modelling). Experimental results achieved on BP4D and DISFA show that our approach is the new state-of-the-art in the field of AU occurrence recognition. Our code is publicly available at https://github.com/CVI-SZU/MDHR. + + + + EventEgo3D: 3D Human Motion Capture from Egocentric Event Streams + http://openaccess.thecvf.com//content/CVPR2024/papers/Millerdurai_EventEgo3D_3D_Human_Motion_Capture_from_Egocentric_Event_Streams_CVPR_2024_paper.pdf + Monocular egocentric 3D human motion capture is a challenging and actively researched problem. Existing methods use synchronously operating visual sensors (e.g. RGB cameras) and often fail under low lighting and fast motions which can be restricting in many applications involving head-mounted devices. In response to the existing limitations this paper 1) introduces a new problem i.e. 3D human motion capture from an egocentric monocular event camera with a fisheye lens and 2) proposes the first approach to it called EventEgo3D (EE3D). Event streams have high temporal resolution and provide reliable cues for 3D human motion capture under high-speed human motions and rapidly changing illumination. The proposed EE3D framework is specifically tailored for learning with event streams in the LNES representation enabling high 3D reconstruction accuracy. We also design a prototype of a mobile head-mounted device with an event camera and record a real dataset with event observations and the ground-truth 3D human poses (in addition to the synthetic dataset). Our EE3D demonstrates robustness and superior 3D accuracy compared to existing solutions across various challenging experiments while supporting real-time 3D pose update rates of 140Hz. + + + + A Call to Reflect on Evaluation Practices for Age Estimation: Comparative Analysis of the State-of-the-Art and a Unified Benchmark + http://openaccess.thecvf.com//content/CVPR2024/papers/Paplham_A_Call_to_Reflect_on_Evaluation_Practices_for_Age_Estimation_CVPR_2024_paper.pdf + Comparing different age estimation methods poses a challenge due to the unreliability of published results stemming from inconsistencies in the benchmarking process. Previous studies have reported continuous performance improvements over the past decade using specialized methods; however our findings challenge these claims. This paper identifies two trivial yet persistent issues with the currently used evaluation protocol and describes how to resolve them. We offer an extensive comparative analysis for state-of-the-art facial age estimation methods. Surprisingly we find that the performance differences between the methods are negligible compared to the effect of other factors such as facial alignment facial coverage image resolution model architecture or the amount of data used for pretraining. We use the gained insights to propose using FaRL as the backbone model and demonstrate its effectiveness on all public datasets. We make the source code and exact data splits public on GitHub and in the supplementary material. + + + + CosalPure: Learning Concept from Group Images for Robust Co-Saliency Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhu_CosalPure_Learning_Concept_from_Group_Images_for_Robust_Co-Saliency_Detection_CVPR_2024_paper.pdf + Co-salient object detection (CoSOD) aims to identify the common and salient (usually in the foreground) regions across a given group of images. Although achieving significant progress state-of-the-art CoSODs could be easily affected by some adversarial perturbations leading to substantial accuracy reduction. The adversarial perturbations can mislead CoSODs but do not change the high-level semantic information (e.g. concept) of the co-salient objects. In this paper we propose a novel robustness enhancement framework by first learning the concept of the co-salient objects based on the input group images and then leveraging this concept to purify adversarial perturbations which are subsequently fed to CoSODs for robustness enhancement. Specifically we propose CosalPure containing two modules i.e. group-image concept learning and concept-guided diffusion purification. For the first module we adopt a pre-trained text-to-image diffusion model to learn the concept of co-salient objects within group images where the learned concept is robust to adversarial examples. For the second module we map the adversarial image to the latent space and then perform diffusion generation by embedding the learned concept into the noise prediction function as an extra condition. Our method can effectively alleviate the influence of the SOTA adversarial attack containing different adversarial patterns including exposure and noise. The extensive results demonstrate that our method could enhance the robustness of CoSODs significantly. + + + + MRFP: Learning Generalizable Semantic Segmentation from Sim-2-Real with Multi-Resolution Feature Perturbation + http://openaccess.thecvf.com//content/CVPR2024/papers/Udupa_MRFP_Learning_Generalizable_Semantic_Segmentation_from_Sim-2-Real_with_Multi-Resolution_Feature_CVPR_2024_paper.pdf + Deep neural networks have shown exemplary performance on semantic scene understanding tasks on source domains but due to the absence of style diversity during training enhancing performance on unseen target domains using only single source domain data remains a challenging task. Generation of simulated data is a feasible alternative to retrieving large style-diverse real-world datasets as it is a cumbersome and budget-intensive process. However the large domain-specific inconsistencies between simulated and real-world data pose a significant generalization challenge in semantic segmentation. In this work to alleviate this problem we propose a novel Multi-Resolution Feature Perturbation (MRFP) technique to randomize domain-specific fine-grained features and perturb style of coarse features. Our experimental results on various urban-scene segmentation datasets clearly indicate that along with the perturbation of style-information perturbation of fine-feature components is paramount to learn domain invariant robust feature maps for semantic segmentation models. MRFP is a simple and computationally efficient transferable module with no additional learnable parameters or objective functions that helps state-of-the-art deep neural networks to learn robust domain invariant features for simulation-to-real semantic segmentation. Code is available at https://github.com/airl-iisc/MRFP. + + + + MotionEditor: Editing Video Motion via Content-Aware Diffusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Tu_MotionEditor_Editing_Video_Motion_via_Content-Aware_Diffusion_CVPR_2024_paper.pdf + Existing diffusion-based video editing models have made gorgeous advances for editing attributes of a source video over time but struggle to manipulate the motion information while preserving the original protagonist's appearance and background. To address this we propose MotionEditor the first diffusion model for video motion editing. MotionEditor incorporates a novel content-aware motion adapter into ControlNet to capture temporal motion correspondence. While ControlNet enables direct generation based on skeleton poses it encounters challenges when modifying the source motion in the inverted noise due to contradictory signals between the noise (source) and the condition (reference). Our adapter complements ControlNet by involving source content to transfer adapted control signals seamlessly. Further we build up a two-branch architecture (a reconstruction branch and an editing branch) with a high-fidelity attention injection mechanism facilitating branch interaction. This mechanism enables the editing branch to query the key and value from the reconstruction branch in a decoupled manner making the editing branch retain the original background and protagonist appearance. We also propose a skeleton alignment algorithm to address the discrepancies in pose size and position. Experiments demonstrate the promising motion editing ability of MotionEditor both qualitatively and quantitatively. To the best of our knowledge MotionEditor is the first to use diffusion models specifically for video motion editing considering the origin dynamic background and camera movement. + + + + Doubly Abductive Counterfactual Inference for Text-based Image Editing + http://openaccess.thecvf.com//content/CVPR2024/papers/Song_Doubly_Abductive_Counterfactual_Inference_for_Text-based_Image_Editing_CVPR_2024_paper.pdf + We study text-based image editing (TBIE) of a single image by counterfactual inference because it is an elegant formulation to precisely address the requirement: the edited image should retain the fidelity of the original one. Through the lens of the formulation we find that the crux of TBIE is that existing techniques hardly achieve a good trade-off between editability and fidelity mainly due to the overfitting of the single-image fine-tuning. To this end we propose a Doubly Abductive Counterfactual inference framework (DAC). We first parameterize an exogenous variable as a UNet LoRA whose abduction can encode all the image details. Second we abduct another exogenous variable parameterized by a text encoder LoRA which recovers the lost editability caused by the overfitted first abduction. Thanks to the second abduction which exclusively encodes the visual transition from post-edit to pre-edit its inversion---subtracting the LoRA---effectively reverts pre-edit back to post-edit thereby accomplishing the edit. Through extensive experiments our DAC achieves a good trade-off between editability and fidelity. Thus we can support a wide spectrum of user editing intents including addition removal manipulation replacement style transfer and facial change which are extensively validated in both qualitative and quantitative evaluations. Codes are in https://github.com/xuesong39/DAC. + + + + Normalizing Flows on the Product Space of SO(3) Manifolds for Probabilistic Human Pose Modeling + http://openaccess.thecvf.com//content/CVPR2024/papers/Dunkel_Normalizing_Flows_on_the_Product_Space_of_SO3_Manifolds_for_CVPR_2024_paper.pdf + Normalizing flows have proven their efficacy for density estimation in Euclidean space but their application to rotational representations crucial in various domains such as robotics or human pose modeling remains underexplored. Probabilistic models of the human pose can benefit from approaches that rigorously consider the rotational nature of human joints. For this purpose we introduce HuProSO3 a normalizing flow model that operates on a high-dimensional product space of SO(3) manifolds modeling the joint distribution for human joints with three degrees of freedom. HuProSO3's advantage over state-of-the-art approaches is demonstrated through its superior modeling accuracy in three different applications and its capability to evaluate the exact likelihood. This work not only addresses the technical challenge of learning densities on SO(3) manifolds but it also has broader implications for domains where the probabilistic regression of correlated 3D rotations is of importance. Code will be available at https://github.com/odunkel/HuProSO. + + + + ReGenNet: Towards Human Action-Reaction Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_ReGenNet_Towards_Human_Action-Reaction_Synthesis_CVPR_2024_paper.pdf + Humans constantly interact with their surrounding environments. Current human-centric generative models mainly focus on synthesizing humans plausibly interacting with static scenes and objects while the dynamic human action-reaction synthesis for ubiquitous causal human-human interactions is less explored. Human-human interactions can be regarded as asymmetric with actors and reactors in atomic interaction periods. In this paper we comprehensively analyze the asymmetric dynamic synchronous and detailed nature of human-human interactions and propose the first multi-setting human action-reaction synthesis benchmark to generate human reactions conditioned on given human actions. To begin with we propose to annotate the actor-reactor order of the interaction sequences for the NTU120 InterHuman and Chi3D datasets. Based on them a diffusion-based generative model with a Transformer decoder architecture called ReGenNet together with an explicit distance-based interaction loss is proposed to predict human reactions in an online manner where the future states of actors are unavailable to reactors. Quantitative and qualitative results show that our method can generate instant and plausible human reactions compared to the baselines and can generalize to unseen actor motions and viewpoint changes. + + + + A Simple Baseline for Efficient Hand Mesh Reconstruction + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_A_Simple_Baseline_for_Efficient_Hand_Mesh_Reconstruction_CVPR_2024_paper.pdf + Hand mesh reconstruction has attracted considerable attention in recent years with various approaches and techniques being proposed. Some of these methods incorporate complex components and designs which while effective may complicate the model and hinder efficiency. In this paper we decompose the mesh decoder into token generator and mesh regressor. Through extensive ablation experiments we found that the token generator should select discriminating and representative points while the mesh regressor needs to upsample sparse keypoints into dense meshes in multiple stages. Given these functionalities we can achieve high performance with minimal computational resources. Based on this observation we propose a simple yet effective baseline that outperforms state-of-the-art methods by a large margin while maintaining real-time efficiency. Our method outperforms existing solutions achieving state-of-the-art (SOTA) results across multiple datasets. On the FreiHAND dataset our approach produced a PA-MPJPE of 5.8mm and a PA-MPVPE of 6.1mm. Similarly on the DexYCB dataset we observed a PA-MPJPE of 5.5mm and a PA-MPVPE of 5.5mm. As for performance speed our method reached up to 33 frames per second (fps) when using HRNet and up to 70 fps when employing FastViT-MA36. Code will be made available. + + + + PhotoMaker: Customizing Realistic Human Photos via Stacked ID Embedding + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_PhotoMaker_Customizing_Realistic_Human_Photos_via_Stacked_ID_Embedding_CVPR_2024_paper.pdf + Recent advances in text-to-image generation have made remarkable progress in synthesizing realistic human photos conditioned on given text prompts. However existing personalized generation methods cannot simultaneously satisfy the requirements of high efficiency promising identity (ID) fidelity and flexible text controllability. In this work we introduce PhotoMaker an efficient personalized text-to-image generation method which mainly encodes an arbitrary number of input ID images into a stack ID embedding for preserving ID information. Such an embedding also empowers our method to be applied in many interesting scenarios such as when replacing the corresponding class word and when combining the characteristics of different identities. Besides to better drive the training of our PhotoMaker we propose an ID-oriented data creation pipeline to assemble the training data. Under the nourishment of the dataset constructed through the proposed pipeline our PhotoMaker demonstrates comparable performance to test-time fine-tuning-based methods yet provides significant speed improvements strong generalization capabilities and a wide range of applications. + + + + Score-Guided Diffusion for 3D Human Recovery + http://openaccess.thecvf.com//content/CVPR2024/papers/Stathopoulos_Score-Guided_Diffusion_for_3D_Human_Recovery_CVPR_2024_paper.pdf + We present Score-Guided Human Mesh Recovery (ScoreHMR) an approach for solving inverse problems for 3D human pose and shape reconstruction. These inverse problems involve fitting a human body model to image observations traditionally solved through optimization techniques. ScoreHMR mimics model fitting approaches but alignment with the image observation is achieved through score guidance in the latent space of a diffusion model. The diffusion model is trained to capture the conditional distribution of the human model parameters given an input image. By guiding its denoising process with a task-specific score ScoreHMR effectively solves inverse problems for various applications without the need for retraining the task-agnostic diffusion model. We evaluate our approach on three settings/applications. These are: (i) single-frame model fitting; (ii) reconstruction from multiple uncalibrated views; (iii) reconstructing humans in video sequences. ScoreHMR consistently outperforms all optimization baselines on popular benchmarks across all settings. We make our code and models available on the project website: https://statho.github.io/ScoreHMR. + + + + Check Locate Rectify: A Training-Free Layout Calibration System for Text-to-Image Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Gong_Check_Locate_Rectify_A_Training-Free_Layout_Calibration_System_for_Text-to-Image_CVPR_2024_paper.pdf + Diffusion models have recently achieved remarkable progress in generating realistic images. However challenges remain in accurately understanding and synthesizing the layout requirements in the textual prompts. To align the generated image with layout instructions we present a training-free layout calibration system SimM that intervenes in the generative process on the fly during inference time. Specifically following a "check-locate-rectify" pipeline the system first analyses the prompt to generate the target layout and compares it with the intermediate outputs to automatically detect errors. Then by moving the located activations and making intra- and inter-map adjustments the rectification process can be performed with negligible computational overhead. To evaluate SimM over a range of layout requirements we present a benchmark SimMBench that compensates for the lack of superlative spatial relations in existing datasets. And both quantitative and qualitative results demonstrate the effectiveness of the proposed SimM in calibrating the layout inconsistencies. Our project page is at https://simm-t2i.github.io/SimM. + + + + Pose-Transformed Equivariant Network for 3D Point Trajectory Prediction + http://openaccess.thecvf.com//content/CVPR2024/papers/Yu_Pose-Transformed_Equivariant_Network_for_3D_Point_Trajectory_Prediction_CVPR_2024_paper.pdf + Predicting 3D point trajectory is a fundamental learning task which commonly should be equivariant under Euclidean transformation e.g. SE(3). The existing equivariant models are commonly based on the group equivariant convolution equivariant message passing vector neuron frame averaging etc. In this paper we propose a novel pose-transformed equivariant network in which the points are firstly uniquely normalized and then transformed by the learned pose transformations upon which the points after motion are predicted and aggregated. Under each transformed pose we design the point position predictor consisting of multiple Pose-Transformed Points Prediction blocks in which the global and local motions are estimated and aggregated. This framework can be proven to be equivariant to SE(3) transformation over 3D points. We evaluate the pose-transformed equivariant network on extensive datasets including human motion capture molecular dynamics modeling and dynamics simulation. Extensive experimental comparisons demonstrated our SOTA performance compared with the existing equivariant networks for 3D point trajectory prediction. + + + + Revisiting Sampson Approximations for Geometric Estimation Problems + http://openaccess.thecvf.com//content/CVPR2024/papers/Rydell_Revisiting_Sampson_Approximations_for_Geometric_Estimation_Problems_CVPR_2024_paper.pdf + Many problems in computer vision can be formulated as geometric estimation problems i.e. given a collection of measurements (e.g. point correspondences) we wish to fit a model (e.g. an essential matrix) that agrees with our observations. This necessitates some measure of how much an observation "agrees" with a given model. A natural choice is to consider the smallest perturbation that makes the observation exactly satisfy the constraints. However for many problems this metric is expensive or otherwise intractable to compute. The so-called Sampson error approximates this geometric error through a linearization scheme. For epipolar geometry the Sampson error is a popular choice and in practice known to yield very tight approximations of the corresponding geometric residual (the reprojection error). In this paper we revisit the Sampson approximation and provide new theoretical insights as to why and when this approximation works as well as provide explicit bounds on the tightness under some mild assumptions. Our theoretical results are validated in several experiments on real data and in the context of different geometric estimation tasks. + + + + Fixed Point Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Bai_Fixed_Point_Diffusion_Models_CVPR_2024_paper.pdf + We introduce the Fixed Point Diffusion Model (FPDM) a novel approach to image generation that integrates the concept of fixed point solving into the framework of diffusion-based generative modeling. Our approach embeds an implicit fixed point solving layer into the denoising network of a diffusion model transforming the diffusion process into a sequence of closely-related fixed point problems. Combined with a new stochastic training method this approach significantly reduces model size reduces memory usage and accelerates training. Moreover it enables the development of two new techniques to improve sampling efficiency: reallocating computation across timesteps and reusing fixed point solutions between timesteps. We conduct extensive experiments with state-of-the-art models on ImageNet FFHQ CelebA-HQ and LSUN-Church demonstrating substantial improvements in performance and efficiency. Compared to the state-of-the-art DiT model FPDM contains 87% fewer parameters consumes 60% less memory during training and improves image generation quality in situations where sampling computation or time is limited. + + + + Residual Learning in Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Residual_Learning_in_Diffusion_Models_CVPR_2024_paper.pdf + Diffusion models (DMs) have achieved remarkable generative performance particularly with the introduction of stochastic differential equations (SDEs). Nevertheless a gap emerges in the model sampling trajectory constructed by reverse-SDE due to the accumulation of score estimation and discretization errors. This gap results in a residual in the generated images adversely impacting the image quality. To remedy this we propose a novel residual learning framework built upon a correction function. The optimized function enables to improve image quality via rectifying the sampling trajectory effectively. Importantly our framework exhibits transferable residual correction ability i.e. a correction function optimized for one pre-trained DM can also enhance the sampling trajectory constructed by other different DMs on the same dataset. Experimental results on four widely-used datasets demonstrate the effectiveness and transferable capability of our framework. + + + + Beyond Textual Constraints: Learning Novel Diffusion Conditions with Fewer Examples + http://openaccess.thecvf.com//content/CVPR2024/papers/Yu_Beyond_Textual_Constraints_Learning_Novel_Diffusion_Conditions_with_Fewer_Examples_CVPR_2024_paper.pdf + In this paper we delve into a novel aspect of learning novel diffusion conditions with datasets an order of magnitude smaller. The rationale behind our approach is the elimination of textual constraints during the few-shot learning process. To that end we implement two optimization strategies. The first prompt-free conditional learning utilizes a prompt-free encoder derived from a pre-trained Stable Diffusion model. This strategy is designed to adapt new conditions to the diffusion process by minimizing the textual-visual correlation thereby ensuring a more precise alignment between the generated content and the specified conditions. The second strategy entails condition-specific negative rectification which addresses the inconsistencies typically brought about by Classifier-free guidance in few-shot training contexts. Our extensive experiments across a variety of condition modalities demonstrate the effectiveness and efficiency of our framework yielding results comparable to those obtained with datasets a thousand times larger. Our codes are available at https://github.com/Yuyan9Yu/BeyondTextConstraint. + + + + Exploiting Style Latent Flows for Generalizing Deepfake Video Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Choi_Exploiting_Style_Latent_Flows_for_Generalizing_Deepfake_Video_Detection_CVPR_2024_paper.pdf + This paper presents a new approach for the detection of fake videos based on the analysis of style latent vectors and their abnormal behavior in temporal changes in the generated videos. We discovered that the generated facial videos suffer from the temporal distinctiveness in the temporal changes of style latent vectors which are inevitable during the generation of temporally stable videos with various facial expressions and geometric transformations. Our framework utilizes the StyleGRU module trained by contrastive learning to represent the dynamic properties of style latent vectors. Additionally we introduce a style attention module that integrates StyleGRU-generated features with content-based features enabling the detection of visual and temporal artifacts. We demonstrate our approach across various benchmark scenarios in deepfake detection showing its superiority in cross-dataset and cross-manipulation scenarios. Through further analysis we also validate the importance of using temporal changes of style latent vectors to improve the generality of deepfake video detection. + + + + Video-P2P: Video Editing with Cross-attention Control + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Video-P2P_Video_Editing_with_Cross-attention_Control_CVPR_2024_paper.pdf + Video-P2P is the first framework for real-world video editing with cross-attention control. While attention control has proven effective for image editing with pre-trained image generation models there are currently no large-scale video generation models publicly available. Video-P2P addresses this limitation by adapting an image generation diffusion model to complete various video editing tasks. Specifically we propose to first tune a Text-to-Set (T2S) model to complete an approximate inversion and then optimize a shared unconditional embedding to achieve accurate video inversion with a small memory cost. We further prove that it is crucial for consistent video editing. For attention control we introduce a novel decoupled-guidance strategy which uses different guidance strategies for the source and target prompts. The optimized unconditional embedding for the source prompt improves reconstruction ability while an initialized unconditional embedding for the target prompt enhances editability. Incorporating the attention maps of these two branches enables detailed editing. These technical designs enable various text-driven editing applications including word swap prompt refinement and attention re-weighting. Video-P2P works well on real-world videos for generating new characters while optimally preserving their original poses and scenes. It significantly outperforms previous approaches. + + + + Hunting Attributes: Context Prototype-Aware Learning for Weakly Supervised Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Tang_Hunting_Attributes_Context_Prototype-Aware_Learning_for_Weakly_Supervised_Semantic_Segmentation_CVPR_2024_paper.pdf + Recent weakly supervised semantic segmentation (WSSS) methods strive to incorporate contextual knowledge to improve the completeness of class activation maps (CAM). In this work we argue that the knowledge bias between instances and contexts affects the capability of the prototype to sufficiently understand instance semantics. Inspired by prototype learning theory we propose leveraging prototype awareness to capture diverse and fine-grained feature attributes of instances. The hypothesis is that contextual prototypes might erroneously activate similar and frequently co-occurring object categories due to this knowledge bias. Therefore we propose to enhance the prototype representation ability by mitigating the bias to better capture spatial coverage in semantic object regions. With this goal we present a Context Prototype-Aware Learning (CPAL) strategy which leverages semantic context to enrich instance comprehension. The core of this method is to accurately capture intra-class variations in object features through context-aware prototypes facilitating the adaptation to the semantic attributes of various instances. We design feature distribution alignment to optimize prototype awareness aligning instance feature distributions with dense features. In addition a unified training framework is proposed to combine label-guided classification supervision and prototypes-guided self-supervision. Experimental results on PASCAL VOC 2012 and MS COCO 2014 show that CPAL significantly improves off-the-shelf methods and achieves state-of-the-art performance. The project is available at \href https://github.com/Barrett-python/CPAL https://github.com/Barrett-python/CPAL. + + + + PIE-NeRF: Physics-based Interactive Elastodynamics with NeRF + http://openaccess.thecvf.com//content/CVPR2024/papers/Feng_PIE-NeRF_Physics-based_Interactive_Elastodynamics_with_NeRF_CVPR_2024_paper.pdf + We show that physics-based simulations can be seamlessly integrated with NeRF to generate high-quality elastodynamics of real-world objects. Unlike existing methods we discretize nonlinear hyperelasticity in a meshless way obviating the necessity for intermediate auxiliary shape proxies like a tetrahedral mesh or voxel grid. A quadratic generalized moving least square is employed to capture nonlinear dynamics and large deformation on the implicit model. Such meshless integration enables versatile simulations of complex and codimensional shapes. We adaptively place the least-square kernels according to the NeRF density field to significantly reduce the complexity of the nonlinear simulation. As a result physically realistic animations can be conveniently synthesized using our method for a wide range of hyperelastic materials at an interactive rate. For more information please visit https://fytalon.github.io/pienerf. + + + + FlashAvatar: High-fidelity Head Avatar with Efficient Gaussian Embedding + http://openaccess.thecvf.com//content/CVPR2024/papers/Xiang_FlashAvatar_High-fidelity_Head_Avatar_with_Efficient_Gaussian_Embedding_CVPR_2024_paper.pdf + We propose FlashAvatar a novel and lightweight 3D animatable avatar representation that could reconstruct a digital avatar from a short monocular video sequence in minutes and render high-fidelity photo-realistic images at 300FPS on a consumer-grade GPU. To achieve this we maintain a uniform 3D Gaussian field embedded in the surface of a parametric face model and learn extra spatial offset to model non-surface regions and subtle facial details. While full use of geometric priors can capture high-frequency facial details and preserve exaggerated expressions proper initialization can help reduce the number of Gaussians thus enabling super-fast rendering speed. Extensive experimental results demonstrate that FlashAvatar outperforms existing works regarding visual quality and personalized details and is almost an order of magnitude faster in rendering speed. Project page: https://ustc3dv.github.io/FlashAvatar/ + + + + ZERO-IG: Zero-Shot Illumination-Guided Joint Denoising and Adaptive Enhancement for Low-Light Images + http://openaccess.thecvf.com//content/CVPR2024/papers/Shi_ZERO-IG_Zero-Shot_Illumination-Guided_Joint_Denoising_and_Adaptive_Enhancement_for_Low-Light_CVPR_2024_paper.pdf + This paper presents a novel zero-shot method for jointly denoising and enhancing real-word low-light images. The proposed method is independent of training data and noise distribution. Guided by illumination we integrate denoising and enhancing processes seamlessly enabling end-to-end training. Pairs of downsampled images are extracted from a single original low-light image and processed to preliminarily reduce noise. Based on the smoothness of illumination near-authentic illumination can be estimated from the denoised low-light image. Specifically the illumination is constrained by the denoised image's brightness uniformly amplifying pixels to raise overall brightness to normal-light level. We simultaneously restrict the illumination by scaling each pixel of the denoised image based on its intensity controlling the enhancement amplitude for different pixels. Applying the illumination to the original low-light image yields an adaptively enhanced reflection. This prevents under-enhancement and localized overexposure. Notably we concatenate the reflection with the illumination preserving their computational relationship to ultimately remove noise from the original low-light image in the form of reflection. This provides sufficient image information for the denoising procedure without changing the noise characteristics. Extensive experiments demonstrate that our method outperforms other state-of-the-art methods. The source code is available at https://github.com/Doyle59217/ZeroIG. + + + + FinePOSE: Fine-Grained Prompt-Driven 3D Human Pose Estimation via Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_FinePOSE_Fine-Grained_Prompt-Driven_3D_Human_Pose_Estimation_via_Diffusion_Models_CVPR_2024_paper.pdf + The 3D Human Pose Estimation (3D HPE) task uses 2D images or videos to predict human joint coordinates in 3D space. Despite recent advancements in deep learning-based methods they mostly ignore the capability of coupling accessible texts and naturally feasible knowledge of humans missing out on valuable implicit supervision to guide the 3D HPE task. Moreover previous efforts often study this task from the perspective of the whole human body neglecting fine-grained guidance hidden in different body parts. To this end we present a new Fine-Grained Prompt-Driven Denoiser based on a diffusion model for 3D HPE named FinePOSE. It consists of three core blocks enhancing the reverse process of the diffusion model: (1) Fine-grained Part-aware Prompt learning (FPP) block constructs fine-grained part-aware prompts via coupling accessible texts and naturally feasible knowledge of body parts with learnable prompts to model implicit guidance. (2) Fine-grained Prompt-pose Communication (FPC) block establishes fine-grained communications between learned part-aware prompts and poses to improve the denoising quality. (3) Prompt-driven Timestamp Stylization (PTS) block integrates learned prompt embedding and temporal information related to the noise level to enable adaptive adjustment at each denoising step. Extensive experiments on public single-human pose estimation datasets show that FinePOSE outperforms state-of-the-art methods. We further extend FinePOSE to multi-human pose estimation. Achieving 34.3mm average MPJPE on the EgoHumans dataset demonstrates the potential of FinePOSE to deal with complex multi-human scenarios. Code is available at https://github.com/PKU-ICST-MIPL/FinePOSE_CVPR2024. + + + + DreamPropeller: Supercharge Text-to-3D Generation with Parallel Sampling + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_DreamPropeller_Supercharge_Text-to-3D_Generation_with_Parallel_Sampling_CVPR_2024_paper.pdf + Recent methods such as Score Distillation Sampling (SDS) and Variational Score Distillation (VSD) using 2D diffusion models for text-to-3D generation have demonstrated impressive generation quality. However the long generation time of such algorithms significantly degrades the user experience. To tackle this problem we propose DreamPropeller a drop-in acceleration algorithm that can be wrapped around any existing text-to-3D generation pipeline based on score distillation. Our framework generalizes Picard iterations a classical algorithm for parallel sampling an ODE path and can account for non-ODE paths such as momentum-based gradient updates and changes in dimensions during the optimization process as in many cases of 3D generation. We show that our algorithm trades parallel compute for wallclock time and empirically achieves up to 4.7x speedup with a negligible drop in generation quality for all tested frameworks. + + + + Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs + http://openaccess.thecvf.com//content/CVPR2024/papers/Fei_Dysen-VDM_Empowering_Dynamics-aware_Text-to-Video_Diffusion_with_LLMs_CVPR_2024_paper.pdf + Text-to-video (T2V) synthesis has gained increasing attention in the community in which the recently emerged diffusion models (DMs) have promisingly shown stronger performance than the past approaches. While existing state-of-the-art DMs are competent to achieve high-resolution video generation they may largely suffer from key limitations (e.g. action occurrence disorders crude video motions) with respect to the intricate temporal dynamics modeling one of the crux of video synthesis. In this work we investigate strengthening the awareness of video dynamics for DMs for high-quality T2V generation. Inspired by human intuition we design an innovative dynamic scene manager (dubbed as Dysen) module which includes (step-1) extracting from input text the key actions with proper time-order arrangement (step-2) transforming the action schedules into the dynamic scene graph (DSG) representations and (step-3) enriching the scenes in the DSG with sufficient and reasonable details. Taking advantage of the existing powerful LLMs (e.g. ChatGPT) via in-context learning Dysen realizes (nearly) human-level temporal dynamics understanding. Finally the resulting video DSG with rich action scene details is encoded as fine-grained spatio-temporal features integrated into the backbone T2V DM for video generating. Experiments on popular T2V datasets suggest that our Dysen-VDM consistently outperforms prior arts with significant margins especially in scenarios with complex actions. + + + + General Object Foundation Model for Images and Videos at Scale + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_General_Object_Foundation_Model_for_Images_and_Videos_at_Scale_CVPR_2024_paper.pdf + We present GLEE in this work an object-level foundation model for locating and identifying objects in images and videos. Through a unified framework GLEEaccomplishes detection segmentation tracking grounding and identification of arbitrary objects in the open world scenario for various object perception tasks. Adopting a cohesive learning strategy GLEE acquires knowledge from diverse data sources with varying supervision levels to formulate general object representations excelling in zero-shot transfer to new data and tasks. Specifically we employ an image encoder text encoder and visual prompter to handle multi-modal inputs enabling to simultaneously solve various object-centric downstream tasks while maintaining state-of-the-art performance. Demonstrated through extensive training on over five million images from diverse benchmarks GLEE exhibits remarkable versatility and improved generalization performance efficiently tackling downstream tasks without the need for task-specific adaptation. By integrating large volumes of automatically labeled data we further enhance its zero-shot generalization capabilities. Additionally GLEE is capable of being integrated into Large Language Models serving as a foundational model to provide universal object-level information for multi-modal tasks. We hope that the versatility and universality of our method will mark a significant step in the development of efficient visual foundation models for AGI systems. The models and code are released at https://github.com/FoundationVision/GLEE. + + + + Inlier Confidence Calibration for Point Cloud Registration + http://openaccess.thecvf.com//content/CVPR2024/papers/Yuan_Inlier_Confidence_Calibration_for_Point_Cloud_Registration_CVPR_2024_paper.pdf + Inliers estimation constitutes a pivotal step in partially overlapping point cloud registration. Existing methods broadly obey coordinate-based scheme where inlier confidence is scored through simply capturing coordinate differences in the context. However this scheme results in massive inlier misinterpretation readily consequently affecting the registration performance. In this paper we explore to extend a new definition called inlier confidence calibration (ICC) to alleviate the above issues. Firstly we provide finely initial correspondences for ICC in order to generate high quality reference point cloud copy corresponding to the source point cloud. In particular we develop a soft assignment matrix optimization theorem that offers faster speed and greater precision compared to Sinkhorn. Benefiting from the high quality reference copy we argue the neighborhood patch formed by inlier and its neighborhood should have consistency between source point cloud and its reference copy. Based on this insight we construct transformation-invariant geometric constraints and capture geometric structure consistency to calibrate inlier confidence for estimated correspondences between source point cloud and its reference copy. Finally transformation is further calculated by the weighted SVD algorithm with the calibrated inlier confidence. Our model is trained in an unsupervised manner and extensive experiments on synthetic and real-world datasets illustrate the effectiveness of the proposed method. + + + + Readout Guidance: Learning Control from Diffusion Features + http://openaccess.thecvf.com//content/CVPR2024/papers/Luo_Readout_Guidance_Learning_Control_from_Diffusion_Features_CVPR_2024_paper.pdf + We present Readout Guidance a method for controlling text-to-image diffusion models with learned signals. Readout Guidance uses readout heads lightweight networks trained to extract signals from the features of a pre-trained frozen diffusion model at every timestep. These readouts can encode single-image properties such as pose depth and edges; or higher-order properties that relate multiple images such as correspondence and appearance similarity. Furthermore by comparing the readout estimates to a user-defined target and back-propagating the gradient through the readout head these estimates can be used to guide the sampling process. Compared to prior methods for conditional generation Readout Guidance requires significantly fewer added parameters and training samples and offers a convenient and simple recipe for reproducing different forms of conditional control under a single framework with a single architecture and sampling procedure. We showcase these benefits in the applications of drag-based manipulation identity-consistent generation and spatially aligned control. + + + + A Unified Approach for Text- and Image-guided 4D Scene Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zheng_A_Unified_Approach_for_Text-_and_Image-guided_4D_Scene_Generation_CVPR_2024_paper.pdf + Large-scale diffusion generative models are greatly simplifying image video and 3D asset creation from user provided text prompts and images. However the challenging problem of text-to-4D dynamic 3D scene generation with diffusion guidance remains largely unexplored. We propose Dream-in-4D which features a novel two-stage approach for text-to-4D synthesis leveraging (1) 3D and 2D diffusion guidance to effectively learn a high-quality static 3D asset in the first stage; (2) a deformable neural radiance field that explicitly disentangles the learned static asset from its deformation preserving quality during motion learning; and (3) a multi-resolution feature grid for the deformation field with a displacement total variation loss to effectively learn motion with video diffusion guidance in the second stage. Through a user preference study we demonstrate that our approach significantly advances image and motion quality 3D consistency and text fidelity for text-to-4D generation compared to baseline approaches. Thanks to its motion-disentangled representation Dream-in-4D can also be easily adapted for controllable generation where appearance is defined by one or multiple images without the need to modify the motion learning stage. Thus our method offers for the first time a unified approach for text-to-4D image-to-4D and personalized 4D generation tasks. + + + + GaussianAvatar: Towards Realistic Human Avatar Modeling from a Single Video via Animatable 3D Gaussians + http://openaccess.thecvf.com//content/CVPR2024/papers/Hu_GaussianAvatar_Towards_Realistic_Human_Avatar_Modeling_from_a_Single_Video_CVPR_2024_paper.pdf + We present GaussianAvatar an efficient approach to creating realistic human avatars with dynamic 3D appearances from a single video. We start by introducing animatable 3D Gaussians to explicitly represent humans in various poses and clothing styles. Such an explicit and animatable representation can fuse 3D appearances more efficiently and consistently from 2D observations. Our representation is further augmented with dynamic properties to support pose-dependent appearance modeling where a dynamic appearance network along with an optimizable feature tensor is designed to learn the motion-to-appearance mapping. Moreover by leveraging the differentiable motion condition our method enables a joint optimization of motions and appearances during avatar modeling which helps to tackle the long-standing issue of inaccurate motion estimation in monocular settings. The efficacy of GaussianAvatar is validated on both the public dataset and our collected dataset demonstrating its superior performances in terms of appearance quality and rendering efficiency. + + + + Mosaic-SDF for 3D Generative Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Yariv_Mosaic-SDF_for_3D_Generative_Models_CVPR_2024_paper.pdf + Current diffusion or flow-based generative models for 3D shapes divide to two: distilling pre-trained 2D image diffusion models and training directly on 3D shapes. When training a diffusion or flow models on 3D shapes a crucial design choice is the shape representation. An effective shape representation needs to adhere three design principles: it should allow an efficient conversion of large 3D datasets to the representation form; it should provide a good tradeoff of approximation power versus number of parameters; and it should have a simple tensorial form that is compatible with existing powerful neural architectures. While standard 3D shape representations such as volumetric grids and point clouds do not adhere to all these principles simultaneously we advocate in this paper a new representation that does. We introduce Mosaic-SDF (M-SDF): a simple 3D shape representation that approximates the Signed Distance Function (SDF) of a given shape by using a set of local grids spread near the shape's boundary. The M-SDF representation is fast to compute for each shape individually making it readily parallelizable; it is parameter efficient as it only covers the space around the shape's boundary; and it has a simple matrix form compatible with Transformer-based architectures. We demonstrate the efficacy of the M-SDF representation by using it to train a 3D generative flow model including class-conditioned generation with the ShapeNetCore-V2 (3D Warehouse) dataset and text-to-3D generation using a dataset of about 600k caption-shape pairs. + + + + Diffusion Handles Enabling 3D Edits for Diffusion Models by Lifting Activations to 3D + http://openaccess.thecvf.com//content/CVPR2024/papers/Pandey_Diffusion_Handles_Enabling_3D_Edits_for_Diffusion_Models_by_Lifting_CVPR_2024_paper.pdf + Diffusion handles is a novel approach to enable 3D object edits on diffusion images requiring only existing pre-trained diffusion models depth estimation without any fine-tuning or 3D object retrieval. The edited results remain plausible photo-real and preserve object identity. Diffusion handles address a critically missing facet of generative image-based creative design. Our key insight is to lift diffusion activations for a selected object to 3D using a proxy depth 3D-transform the depth and associated activations and project them back to image space. The diffusion process guided by the manipulated activations produces plausible edited images showing complex 3D occlusion and lighting effects. We evaluate diffusion handles: quantitatively on a large synthetic data benchmark; and qualitatively by a user study showing our output to be more plausible and better than prior art at both 3D editing and identity control. + + + + Friendly Sharpness-Aware Minimization + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Friendly_Sharpness-Aware_Minimization_CVPR_2024_paper.pdf + Sharpness-Aware Minimization (SAM) has been instrumental in improving deep neural network training by minimizing both training loss and loss sharpness. Despite the practical success the mechanisms behind SAM's generalization enhancements remain elusive limiting its progress in deep learning optimization. In this work we investigate SAM's core components for generalization improvement and introduce "Friendly-SAM" (F-SAM) to further enhance SAM's generalization. Our investigation reveals the key role of batch-specific stochastic gradient noise within the adversarial perturbation i.e. the current minibatch gradient which significantly influences SAM's generalization performance. By decomposing the adversarial perturbation in SAM into full gradient and stochastic gradient noise components we discover that relying solely on the full gradient component degrades generalization while excluding it leads to improved performance. The possible reason lies in the full gradient component's increase in sharpness loss for the entire dataset creating inconsistencies with the subsequent sharpness minimization step solely on the current minibatch data. Inspired by these insights F-SAM aims to mitigate the negative effects of the full gradient component. It removes the full gradient estimated by an exponentially moving average (EMA) of historical stochastic gradients and then leverages stochastic gradient noise for improved generalization. Moreover we provide theoretical validation for the EMA approximation and prove the convergence of F-SAM on non-convex problems. Extensive experiments demonstrate the superior generalization performance and robustness of F-SAM over vanilla SAM. Code is available at https://github.com/nblt/F-SAM. + + + + BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Shi_BIVDiff_A_Training-Free_Framework_for_General-Purpose_Video_Synthesis_via_Bridging_CVPR_2024_paper.pdf + Diffusion models have made tremendous progress in text-driven image and video generation. Now text-to-image foundation models are widely applied to various downstream image synthesis tasks such as controllable image generation and image editing while downstream video synthesis tasks are less explored for several reasons. First it requires huge memory and computation overhead to train a video generation foundation model. Even with video foundation models additional costly training is still required for downstream video synthesis tasks. Second although some works extend image diffusion models into videos in a training-free manner temporal consistency cannot be well preserved. Finally these adaption methods are specifically designed for one task and fail to generalize to different tasks. To mitigate these issues we propose a training-free general-purpose video synthesis framework coined as BIVDiff via bridging specific image diffusion models and general text-to-video foundation diffusion models. Specifically we first use a specific image diffusion model (e.g. ControlNet and Instruct Pix2Pix) for frame-wise video generation then perform Mixed Inversion on the generated video and finally input the inverted latents into the video diffusion models (e.g. VidRD and ZeroScope) for temporal smoothing. This decoupled framework enables flexible image model selection for different purposes with strong task generalization and high efficiency. To validate the effectiveness and general use of BIVDiff we perform a wide range of video synthesis tasks including controllable video generation video editing video inpainting and outpainting. + + + + NC-TTT: A Noise Constrastive Approach for Test-Time Training + http://openaccess.thecvf.com//content/CVPR2024/papers/Osowiechi_NC-TTT_A_Noise_Constrastive_Approach_for_Test-Time_Training_CVPR_2024_paper.pdf + Despite their exceptional performance in vision tasks deep learning models often struggle when faced with domain shifts during testing. Test-Time Training (TTT) methods have recently gained popularity by their ability to enhance the robustness of models through the addition of an auxiliary objective that is jointly optimized with the main task. Being strictly unsupervised this auxiliary objective is used at test time to adapt the model without any access to labels. In this work we propose Noise-Contrastive Test-Time Training (NC-TTT) a novel unsupervised TTT technique based on the discrimination of noisy feature maps. By learning to classify noisy views of projected feature maps and then adapting the model accordingly on new domains classification performance can be recovered by an important margin. Experiments on several popular test-time adaptation baselines demonstrate the advantages of our method compared to recent approaches for this task. The code can be found at: https://github.com/GustavoVargasHakim/NCTTT.git + + + + Small Scale Data-Free Knowledge Distillation + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Small_Scale_Data-Free_Knowledge_Distillation_CVPR_2024_paper.pdf + Data-free knowledge distillation is able to utilize the knowledge learned by a large teacher network to augment the training of a smaller student network without accessing the original training data avoiding privacy security and proprietary risks in real applications. In this line of research existing methods typically follow an inversion-and-distillation paradigm in which a generative adversarial network on-the-fly trained with the guidance of the pre-trained teacher network is used to synthesize a large-scale sample set for knowledge distillation. In this paper we reexamine this common data-free knowledge distillation paradigm showing that there is considerable room to improve the overall training efficiency through a lens of "small-scale inverted data for knowledge distillation". In light of three empirical observations indicating the importance of how to balance class distributions in terms of synthetic sample diversity and difficulty during both data inversion and distillation processes we propose Small Scale Data-free Knowledge Distillation (SSD-KD). In formulation SSD-KD introduces a modulating function to balance synthetic samples and a priority sampling function to select proper samples facilitated by a dynamic replay buffer and a reinforcement learning strategy. As a result SSD-KD can perform distillation training conditioned on an extremely small scale of synthetic samples (e.g. 10x less than the original training data scale) making the overall training efficiency one or two orders of magnitude faster than many mainstream methods while retaining superior or competitive model performance as demonstrated on popular image classification and semantic segmentation benchmarks. The code is available at https://github.com/OSVAI/SSD-KD. + + + + CFPL-FAS: Class Free Prompt Learning for Generalizable Face Anti-spoofing + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_CFPL-FAS_Class_Free_Prompt_Learning_for_Generalizable_Face_Anti-spoofing_CVPR_2024_paper.pdf + Domain generalization (DG) based Face Anti-Spoofing (FAS) aims to improve the model's performance on unseen domains. Existing methods either rely on domain labels to align domain-invariant feature spaces or disentangle generalizable features from the whole sample which inevitably lead to the distortion of semantic feature structures and achieve limited generalization. In this work we make use of large-scale VLMs like CLIP and leverage the textual feature to dynamically adjust the classifier's weights for exploring generalizable visual features. Specifically we propose a novel Class Free Prompt Learning (CFPL) paradigm for DG FAS which utilizes two lightweight transformers namely Content Q-Former (CQF) and Style Q-Former (SQF) to learn the different semantic prompts conditioned on content and style features by using a set of learnable query vectors respectively. Thus the generalizable prompt can be learned by two improvements: (1) A Prompt-Text Matched (PTM) supervision is introduced to ensure CQF learns visual representation that is most informative of the content description. (2) A Diversified Style Prompt (DSP) technology is proposed to diversify the learning of style prompts by mixing feature statistics between instance-specific styles. Finally the learned text features modulate visual features to generalization through the designed Prompt Modulation (PM). Extensive experiments show that the CFPL is effective and outperforms the state-of-the-art methods on several cross-domain datasets. + + + + Open Vocabulary Semantic Scene Sketch Understanding + http://openaccess.thecvf.com//content/CVPR2024/papers/Bourouis_Open_Vocabulary_Semantic_Scene_Sketch_Understanding_CVPR_2024_paper.pdf + We study the underexplored but fundamental vision problem of machine understanding of abstract freehand scene sketches. We introduce a sketch encoder that results in semantically-aware feature space which we evaluate by testing its performance on a semantic sketch segmentation task. To train our model we rely only on the availability of bitmap sketches with their brief captions and do not require any pixel-level annotations. To obtain generalization to a large set of sketches and categories we build on a vision transformer encoder pretrained with the CLIP model. We freeze the text encoder and perform visual-prompt tuning of the visual encoder branch while introducing a set of critical modifications. Firstly we augment the classical key-query (k-q) self-attention blocks with value-value (v-v) self-attention blocks. Central to our model is a two-level hierarchical network design that enables efficient semantic disentanglement: The first level ensures holistic scene sketch encoding and the second level focuses on individual categories. We then in the second level of the hierarchy introduce a cross-attention between textual and visual branches. Our method outperforms zero-shot CLIP pixel accuracy of segmentation results by 37 points reaching an accuracy of 85.5% on the FS-COCO sketch dataset. Finally we conduct a user study that allows us to identify further improvements needed over our method to reconcile machine and human understanding of scene sketches. + + + + IntrinsicAvatar: Physically Based Inverse Rendering of Dynamic Humans from Monocular Videos via Explicit Ray Tracing + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_IntrinsicAvatar_Physically_Based_Inverse_Rendering_of_Dynamic_Humans_from_Monocular_CVPR_2024_paper.pdf + We present IntrinsicAvatar a novel approach to recovering the intrinsic properties of clothed human avatars including geometry albedo material and environment lighting from only monocular videos. Recent advancements in human-based neural rendering have enabled high-quality geometry and appearance reconstruction of clothed humans from just monocular videos. However these methods bake intrinsic properties such as albedo material and environment lighting into a single entangled neural representation. On the other hand only a handful of works tackle the problem of estimating geometry and disentangled appearance properties of clothed humans from monocular videos. They usually achieve limited quality and disentanglement due to approximations of secondary shading effects via learned MLPs. In this work we propose to model secondary shading effects explicitly via Monte-Carlo ray tracing. We model the rendering process of clothed humans as a volumetric scattering process and combine ray tracing with body articulation. Our approach can recover high-quality geometry albedo material and lighting properties of clothed humans from a single monocular video without requiring supervised pre-training using ground truth materials. Furthermore since we explicitly model the volumetric scattering process and ray tracing our model naturally generalizes to novel poses enabling animation of the reconstructed avatar in novel lighting conditions. + + + + Efficient Detection of Long Consistent Cycles and its Application to Distributed Synchronization + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Efficient_Detection_of_Long_Consistent_Cycles_and_its_Application_to_CVPR_2024_paper.pdf + Group synchronization plays a crucial role in global pipelines for Structure from Motion (SfM). Its formulation is nonconvex and it is faced with highly corrupted measurements. Cycle consistency has been effective in addressing these challenges. However computationally efficient solutions are needed for cycles longer than three especially in practical scenarios where 3-cycles are unavailable. To overcome this computational bottleneck we propose an algorithm for group synchronization that leverages information from cycles of lengths ranging from three to six with a complexity of O(n^3) (or O(n^ 2.373 ) when using a faster matrix multiplication algorithm). We establish non-trivial theory for this and related methods that achieves competitive sample complexity assuming the uniform corruption model. To advocate the practical need for our method we consider distributed group synchronization which requires at least 4-cycles and we illustrate state-of-the-art performance by our method in this context. + + + + Vlogger: Make Your Dream A Vlog + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhuang_Vlogger_Make_Your_Dream_A_Vlog_CVPR_2024_paper.pdf + In this work we present Vlogger a generic AI system for generating a minute-level video blog (i.e. vlog) of user descriptions. Different from short videos with a few seconds vlog often contains a complex storyline with diversified scenes which is challenging for most existing video generation approaches. To break through this bottleneck our Vlogger smartly leverages Large Language Model (LLM) as Director and decomposes a long video generation task of vlog into four key stages where we invoke various foundation models to play the critical roles of vlog professionals including (1) Script (2) Actor (3) ShowMaker and (4) Voicer. With such a design of mimicking human beings our Vlogger can generate vlogs through explainable cooperation of top-down planning and bottom-up shooting. More over we introduce a novel video diffusion model ShowMaker which serves as a videographer in our Vlogger for generating the video snippet of each shooting scene. By incorporating Script and Actor attentively as textual and visual prompts it can effectively enhance spatial-temporal coherence in the snippet. Besides we design a concise mixed training paradigm for ShowMaker boosting its capacity for both T2V generation and prediction. Finally the extensive experiments show that our method achieves state-of-the-art performance on zero-shot T2V generation and prediction tasks. More importantly Vlogger can generate over 5-minute vlogs from open-world descriptions without loss of video coherence on script and actor. + + + + Neural 3D Strokes: Creating Stylized 3D Scenes with Vectorized 3D Strokes + http://openaccess.thecvf.com//content/CVPR2024/papers/Duan_Neural_3D_Strokes_Creating_Stylized_3D_Scenes_with_Vectorized_3D_CVPR_2024_paper.pdf + We present Neural 3D Strokes a novel technique to generate stylized images of a 3D scene at arbitrary novel views from multi-view 2D images. Different from existing methods which apply stylization to trained neural radiance fields at the voxel level our approach draws inspiration from image-to-painting methods simulating the progressive painting process of human artwork with vector strokes. We develop a palette of stylized 3D strokes from basic primitives and splines and consider the 3D scene stylization task as a multi-view reconstruction process based on these 3D stroke primitives. Instead of directly searching for the parameters of these 3D strokes which would be too costly we introduce a differentiable renderer that allows optimizing stroke parameters using gradient descent and propose a training scheme to alleviate the vanishing gradient issue. The extensive evaluation demonstrates that our approach effectively synthesizes 3D scenes with significant geometric and aesthetic stylization while maintaining a consistent appearance across different views. Our method can be further integrated with style loss and image-text contrastive models to extend its applications including color transfer and text-driven 3D scene drawing. Results and code are available at http://buaavrcg.github.io/Neural3DStrokes. + + + + Multi-Object Tracking in the Dark + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Multi-Object_Tracking_in_the_Dark_CVPR_2024_paper.pdf + Low-light scenes are prevalent in real-world applications (e.g. autonomous driving and surveillance at night). Recently multi-object tracking in various practical use cases have received much attention but multi-object tracking in dark scenes is rarely considered. In this paper we focus on multi-object tracking in dark scenes. To address the lack of datasets we first build a Low-light Multi-Object Tracking (LMOT) dataset. LMOT provides well-aligned low-light video pairs captured by our dual-camera system and high-quality multi-object tracking annotations for all videos. Then we propose a low-light multi-object tracking method termed as LTrack. We introduce the adaptive low-pass downsample module to enhance low-frequency components of images outside the sensor noises. The degradation suppression learning strategy enables the model to learn invariant information under noise disturbance and image quality degradation. These components improve the robustness of multi-object tracking in dark scenes. We conducted a comprehensive analysis of our LMOT dataset and proposed LTrack. Experimental results demonstrate the superiority of the proposed method and its competitiveness in real night low-light scenes. Dataset and Code: https://github.com/ying-fu/LMOT + + + + UniHuman: A Unified Model For Editing Human Images in the Wild + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_UniHuman_A_Unified_Model_For_Editing_Human_Images_in_the_CVPR_2024_paper.pdf + Human image editing includes tasks like changing a person's pose their clothing or editing the image according to a text prompt. However prior work often tackles these tasks separately overlooking the benefit of mutual reinforcement from learning them jointly. In this paper we propose UniHuman a unified model that addresses multiple facets of human image editing in real-world settings. To enhance the model's generation quality and generalization capacity we leverage guidance from human visual encoders and introduce a lightweight pose-warping module that can exploit different pose representations accommodating unseen textures and patterns. Furthermore to bridge the disparity between existing human editing benchmarks with real-world data we curated 400K high-quality human image-text pairs for training and collected 2K human images for out-of-domain testing both encompassing diverse clothing styles backgrounds and age groups. Experiments on both in-domain and out-of-domain test sets demonstrate that UniHuman outperforms task-specific models by a significant margin. In user studies UniHuman is preferred by the users in an average of 77% of cases. Our project is available at https://github.com/NannanLi999/UniHuman. + + + + DiffAgent: Fast and Accurate Text-to-Image API Selection with Large Language Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_DiffAgent_Fast_and_Accurate_Text-to-Image_API_Selection_with_Large_Language_CVPR_2024_paper.pdf + Text-to-image (T2I) generative models have attracted significant attention and found extensive applications within and beyond academic research. For example the Civitai community a platform for T2I innovation currently hosts an impressive array of 74492 distinct models. However this diversity presents a formidable challenge in selecting the most appropriate model and parameters a process that typically requires numerous trials. Drawing inspiration from the tool usage research of large language models (LLMs) we introduce DiffAgent an LLM agent designed to screen the accurate selection in seconds via API calls. DiffAgent leverages a novel two-stage training framework SFTA enabling it to accurately align T2I API responses with user input in accordance with human preferences. To train and evaluate DiffAgent's capabilities we present DABench a comprehensive dataset encompassing an extensive range of T2I APIs from the community. Our evaluations reveal that DiffAgent not only excels in identifying the appropriate T2I API but also underscores the effectiveness of the SFTA training framework. Codes are available at https://github.com/OpenGVLab/DiffAgent. + + + + In Search of a Data Transformation That Accelerates Neural Field Training + http://openaccess.thecvf.com//content/CVPR2024/papers/Seo_In_Search_of_a_Data_Transformation_That_Accelerates_Neural_Field_CVPR_2024_paper.pdf + Neural field is an emerging paradigm in data representation that trains a neural network to approximate the given signal. A key obstacle that prevents its widespread adoption is the encoding speed---generating neural fields requires an overfitting of a neural network which can take a significant number of SGD steps to reach the desired fidelity level. In this paper we delve into the impacts of data transformations on the speed of neural field training specifically focusing on how permuting pixel locations affect the convergence speed of SGD. Counterintuitively we find that randomly permuting the pixel locations can considerably accelerate the training. To explain this phenomenon we examine the neural field training through the lens of PSNR curves loss landscapes and error patterns. Our analyses suggest that the random pixel permutations remove the easy-to-fit patterns which facilitate easy optimization in the early stage but hinder capturing fine details of the signal. + + + + Zero-Painter: Training-Free Layout Control for Text-to-Image Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Ohanyan_Zero-Painter_Training-Free_Layout_Control_for_Text-to-Image_Synthesis_CVPR_2024_paper.pdf + We present Zero-Painter a novel training-free framework for layout-conditional text-to-image synthesis that facilitates the creation of detailed and controlled imagery from textual prompts. Our method utilizes object masks and individual descriptions coupled with a global text prompt to generate images with high fidelity. Zero-Painter employs a two-stage process involving our novel Prompt-Adjusted Cross-Attention (PACA) and Region-Grouped Cross-Attention (ReGCA) blocks ensuring precise alignment of generated objects with textual prompts and mask shapes. Our extensive experiments demonstrate that Zero-Painter surpasses current state-of-the-art methods in preserving textual details and adhering to mask shapes. We will make the codes and the models publicly available. + + + + Towards 3D Vision with Low-Cost Single-Photon Cameras + http://openaccess.thecvf.com//content/CVPR2024/papers/Mu_Towards_3D_Vision_with_Low-Cost_Single-Photon_Cameras_CVPR_2024_paper.pdf + We present a method for reconstructing 3D shape of arbitrary Lambertian objects based on measurements by miniature energy-efficient low-cost single-photon cameras. These cameras operating as time resolved image sensors illuminate the scene with a very fast pulse of diffuse light and record the shape of that pulse as it returns back from the scene at a high temporal resolution. We propose to model this image formation process account for its non-idealities and adapt neural rendering to reconstruct 3D geometry from a set of spatially distributed sensors with known poses. We show that our approach can successfully recover complex 3D shapes from simulated data. We further demonstrate 3D object reconstruction from real-world captures utilizing measurements from a commodity proximity sensor. Our work draws a connection between image-based modeling and active range scanning and offers a step towards 3D vision with single-photon cameras. + + + + WonderJourney: Going from Anywhere to Everywhere + http://openaccess.thecvf.com//content/CVPR2024/papers/Yu_WonderJourney_Going_from_Anywhere_to_Everywhere_CVPR_2024_paper.pdf + We introduce WonderJourney a modular framework for perpetual 3D scene generation. Unlike prior work on view generation that focuses on a single type of scenes we start at any user-provided location (by a text description or an image) and generate a journey through a long sequence of diverse yet coherently connected 3D scenes. We leverage an LLM to generate textual descriptions of the scenes in this journey a text-driven point cloud generation pipeline to make a compelling and coherent sequence of 3D scenes and a large VLM to verify the generated scenes. We show compelling diverse visual results across various scene types and styles forming imaginary "wonderjourneys". Project website: https://kovenyu.com/WonderJourney. + + + + 4D-fy: Text-to-4D Generation Using Hybrid Score Distillation Sampling + http://openaccess.thecvf.com//content/CVPR2024/papers/Bahmani_4D-fy_Text-to-4D_Generation_Using_Hybrid_Score_Distillation_Sampling_CVPR_2024_paper.pdf + Recent breakthroughs in text-to-4D generation rely on pre-trained text-to-image and text-to-video models to generate dynamic 3D scenes. However current text-to-4D methods face a three-way tradeoff between the quality of scene appearance 3D structure and motion. For example text-to-image models and their 3D-aware variants are trained on internet-scale image datasets and can be used to produce scenes with realistic appearance and 3D structure---but no motion. Text-to-video models are trained on relatively smaller video datasets and can produce scenes with motion but poorer appearance and 3D structure. While these models have complementary strengths they also have opposing weaknesses making it difficult to combine them in a way that alleviates this three-way tradeoff. Here we introduce hybrid score distillation sampling an alternating optimization procedure that blends supervision signals from multiple pre-trained diffusion models and incorporates benefits of each for high-fidelity text-to-4D generation. Using hybrid SDS we demonstrate synthesis of 4D scenes with compelling appearance 3D structure and motion. + + + + FreeControl: Training-Free Spatial Control of Any Text-to-Image Diffusion Model with Any Condition + http://openaccess.thecvf.com//content/CVPR2024/papers/Mo_FreeControl_Training-Free_Spatial_Control_of_Any_Text-to-Image_Diffusion_Model_with_CVPR_2024_paper.pdf + Recent approaches such as ControlNet offer users fine-grained spatial control over text-to-image (T2I) diffusion models. However auxiliary modules have to be trained for each spatial condition type model architecture and checkpoint putting them at odds with the diverse intents and preferences a human designer would like to convey to the AI models during the content creation process. In this work we present FreeControl a training-free approach for controllable T2I generation that supports multiple conditions architectures and checkpoints simultaneously. FreeControl enforces structure guidance to facilitate the global alignment with a guidance image and appearance guidance to collect visual details from images generated without control. Extensive qualitative and quantitative experiments demonstrate the superior performance of FreeControl across a variety of pre-trained T2I models. In particular FreeControl enables convenient training-free control over many different architectures and checkpoints allows the challenging input conditions on which most of the existing training-free methods fail and achieves competitive synthesis quality compared to training-based approaches. Project page:https://genforce.github.io/freecontrol/. + + + + VMC: Video Motion Customization using Temporal Attention Adaption for Text-to-Video Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Jeong_VMC_Video_Motion_Customization_using_Temporal_Attention_Adaption_for_Text-to-Video_CVPR_2024_paper.pdf + Text-to-video diffusion models have advanced video generation significantly. However customizing these models to generate videos with tailored motions presents a substantial challenge. In specific they encounter hurdles in (1) accurately reproducing motion from a target video and (2) creating diverse visual variations. For example straightforward extensions of static image customization methods to video often lead to intricate entanglements of appearance and motion data. To tackle this here we present the Video Motion Customization (VMC) framework a novel one-shot tuning approach crafted to adapt temporal attention layers within video diffusion models. Our approach introduces a novel motion distillation objective using residual vectors between consecutive noisy latent frames as a motion reference. The diffusion process then preserve low-frequency motion trajectories while mitigating high-frequency motion-unrelated noise in image space. We validate our method against state-of-the-art video generative models across diverse real-world motions and contexts. Our codes and data can be found at: https://video-motion-customization.github.io/ + + + + DistriFusion: Distributed Parallel Inference for High-Resolution Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_DistriFusion_Distributed_Parallel_Inference_for_High-Resolution_Diffusion_Models_CVPR_2024_paper.pdf + Diffusion models have achieved great success in synthesizing high-quality images. However generating high-resolution images with diffusion models is still challenging due to the enormous computational costs resulting in a prohibitive latency for interactive applications. In this paper we propose DistriFusion to tackle this problem by leveraging parallelism across multiple GPUs. Our method splits the model input into multiple patches and assigns each patch to a GPU. However naively implementing such an algorithm breaks the interaction between patches and loses fidelity while incorporating such an interaction will incur tremendous communication overhead. To overcome this dilemma we observe the high similarity between the input from adjacent diffusion steps and propose Displaced Patch Parallelism which takes advantage of the sequential nature of the diffusion process by reusing the pre-computed feature maps from the previous timestep to provide context for the current step. Therefore our method supports asynchronous communication which can be pipelined by computation. Extensive experiments show that our method can be applied to recent Stable Diffusion XL with no quality degradation and achieve up to a 6.1x speedup on eight NVIDIA A100s compared to one. Our code is publicly available at https://github.com/mit-han-lab/distrifuser. + + + + AZ-NAS: Assembling Zero-Cost Proxies for Network Architecture Search + http://openaccess.thecvf.com//content/CVPR2024/papers/Lee_AZ-NAS_Assembling_Zero-Cost_Proxies_for_Network_Architecture_Search_CVPR_2024_paper.pdf + Training-free network architecture search (NAS) aims to discover high-performing networks with zero-cost proxies capturing network characteristics related to the final performance. However network rankings estimated by previous training-free NAS methods have shown weak correlations with the performance. To address this issue we propose AZ-NAS a novel approach that leverages the ensemble of various zero-cost proxies to enhance the correlation between a predicted ranking of networks and the ground truth substantially in terms of the performance. To achieve this we introduce four novel zero-cost proxies that are complementary to each other analyzing distinct traits of architectures in the views of expressivity progressivity trainability and complexity. The proxy scores can be obtained simultaneously within a single forward and backward pass making an overall NAS process highly efficient. In order to integrate the rankings predicted by our proxies effectively we introduce a non-linear ranking aggregation method that highlights the networks highly-ranked consistently across all the proxies. Experimental results conclusively demonstrate the efficacy and efficiency of AZ-NAS outperforming state-of-the-art methods on standard benchmarks all while maintaining a reasonable runtime cost. + + + + Improving Physics-Augmented Continuum Neural Radiance Field-Based Geometry-Agnostic System Identification with Lagrangian Particle Optimization + http://openaccess.thecvf.com//content/CVPR2024/papers/Kaneko_Improving_Physics-Augmented_Continuum_Neural_Radiance_Field-Based_Geometry-Agnostic_System_Identification_with_CVPR_2024_paper.pdf + Geometry-agnostic system identification is a technique for identifying the geometry and physical properties of an object from video sequences without any geometric assumptions. Recently physics-augmented continuum neural radiance fields (PAC-NeRF) has demonstrated promising results for this technique by utilizing a hybrid Eulerian-Lagrangian representation in which the geometry is represented by the Eulerian grid representations of NeRF the physics is described by a material point method (MPM) and they are connected via Lagrangian particles. However a notable limitation of PAC-NeRF is that its performance is sensitive to the learning of the geometry from the first frames owing to its two-step optimization. First the grid representations are optimized with the first frames of video sequences and then the physical properties are optimized through video sequences utilizing the fixed first-frame grid representations. This limitation can be critical when learning of the geometric structure is difficult for example in a few-shot (sparse view) setting. To overcome this limitation we propose Lagrangian particle optimization (LPO) in which the positions and features of particles are optimized through video sequences in Lagrangian space. This method allows for the optimization of the geometric structure across the entire video sequence within the physical constraints imposed by the MPM. The experimental results demonstrate that the LPO is useful for geometric correction and physical identification in sparse-view settings. + + + + Beyond Image Super-Resolution for Image Recognition with Task-Driven Perceptual Loss + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_Beyond_Image_Super-Resolution_for_Image_Recognition_with_Task-Driven_Perceptual_Loss_CVPR_2024_paper.pdf + In real-world scenarios image recognition tasks such as semantic segmentation and object detection often pose greater challenges due to the lack of information available within low-resolution (LR) content. Image super-resolution (SR) is one of the promising solutions for addressing the challenges. However due to the ill-posed property of SR it is challenging for typical SR methods to restore task-relevant high-frequency contents which may dilute the advantage of utilizing the SR method. Therefore in this paper we propose Super-Resolution for Image Recognition (SR4IR) that effectively guides the generation of SR images beneficial to achieving satisfactory image recognition performance when processing LR images. The critical component of our SR4IR is the task-driven perceptual (TDP) loss that enables the SR network to acquire task-specific knowledge from a network tailored for a specific task. Moreover we propose a cross-quality patch mix and an alternate training framework that significantly enhances the efficacy of the TDP loss by addressing potential problems when employing the TDP loss. Through extensive experiments we demonstrate that our SR4IR achieves outstanding task performance by generating SR images useful for a specific image recognition task including semantic segmentation object detection and image classification. The implementation code is available at https://github.com/JaehaKim97/SR4IR. + + + + XCube: Large-Scale 3D Generative Modeling using Sparse Voxel Hierarchies + http://openaccess.thecvf.com//content/CVPR2024/papers/Ren_XCube_Large-Scale_3D_Generative_Modeling_using_Sparse_Voxel_Hierarchies_CVPR_2024_paper.pdf + We present XCube a novel generative model for high-resolution sparse 3D voxel grids with arbitrary attributes. Our model can generate millions of voxels with a finest effective resolution of up to 1024^3 in a feed-forward fashion without time-consuming test-time optimization. To achieve this we employ a hierarchical voxel latent diffusion model which generates progressively higher resolution grids in a coarse-to-fine manner using a custom framework built on the highly efficient VDB data structure. Apart from generating high-resolution objects we demonstrate the effectiveness of XCube on large outdoor scenes at scales of 100m x 100m with a voxel size as small as 10cm. We observe clear qualitative and quantitative improvements over past approaches. In addition to unconditional generation we show that our model can be used to solve a variety of tasks such as user-guided editing scene completion from a single scan and text-to-3D. + + + + Reconstruction-free Cascaded Adaptive Compressive Sensing + http://openaccess.thecvf.com//content/CVPR2024/papers/Qiu_Reconstruction-free_Cascaded_Adaptive_Compressive_Sensing_CVPR_2024_paper.pdf + Scene-aware Adaptive Compressive Sensing (ACS) has constituted a persistent pursuit holding substantial promise for the enhancement of Compressive Sensing (CS) performance. Cascaded ACS furnishes a proficient multi-stage framework for adaptively allocating the CS sampling based on previous CS measurements. However reconstruction is commonly required for analyzing and steering the successive CS sampling which bottlenecks the ACS speed and impedes the practical application in time-sensitive scenarios. Addressing this challenge we propose a reconstruction-free cascaded ACS method which requires NO reconstruction during the adaptive sampling process. A lightweight Score Network (ScoreNet) is proposed to directly determine the ACS allocation with previous CS measurements and a differentiable adaptive sampling module is proposed for end-to-end training. For image reconstruction we propose a Multi-Grid Spatial-Attention Network (MGSANet) that could facilitate efficient multi-stage training and inferencing. By introducing the reconstruction-fidelity supervision outside the loop of the multi-stage sampling process ACS can be efficiently optimized and achieve high imaging fidelity. The effectiveness of the proposed method is demonstrated with extensive quantitative and qualitative experiments compared with the state-of-the-art CS algorithms. + + + + USE: Universal Segment Embeddings for Open-Vocabulary Image Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_USE_Universal_Segment_Embeddings_for_Open-Vocabulary_Image_Segmentation_CVPR_2024_paper.pdf + The open-vocabulary image segmentation task involves partitioning images into semantically meaningful segments and classifying them with flexible text-defined categories. The recent vision-based foundation models such as the Segment Anything Model (SAM) have shown superior performance in generating class-agnostic image segments. The main challenge in open-vocabulary image segmentation now lies in accurately classifying these segments into text-defined categories. In this paper we introduce the Universal Segment Embedding (USE) framework to address this challenge. This framework is comprised of two key components: 1) a data pipeline designed to efficiently curate a large amount of segment-text pairs at various granularities and 2) a universal segment embedding model that enables precise segment classification into a vast range of text-defined categories. The USE model can not only help open-vocabulary image segmentation but also facilitate other downstream tasks (e.g. querying and ranking). Through comprehensive experimental studies on semantic segmentation and part segmentation benchmarks we demonstrate that the USE framework outperforms state-of-the-art open-vocabulary segmentation methods. + + + + Functional Diffusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Functional_Diffusion_CVPR_2024_paper.pdf + We propose functional diffusion a generative diffusion model focused on infinite-dimensional function data samples. In contrast to previous work functional diffusion works on samples that are represented by functions with a continuous domain. Functional diffusion can be seen as an extension of classical diffusion models to an infinite-dimensional domain. Functional diffusion is very versatile as images videos audio 3D shapes deformations etc. can be handled by the same framework with minimal changes. In addition functional diffusion is especially suited for irregular data or data defined in non-standard domains. In our work we derive the necessary foundations for functional diffusion and propose a first implementation based on the transformer architecture. We show generative results on complicated signed distance functions and deformation functions defined on 3D surfaces. + + + + Wired Perspectives: Multi-View Wire Art Embraces Generative AI + http://openaccess.thecvf.com//content/CVPR2024/papers/Qu_Wired_Perspectives_Multi-View_Wire_Art_Embraces_Generative_AI_CVPR_2024_paper.pdf + Creating multi-view wire art (MVWA) a static 3D sculpture with diverse interpretations from different viewpoints is a complex task even for skilled artists. In response we present DreamWire an AI system enabling everyone to craft MVWA easily. Users express their vision through text prompts or scribbles freeing them from intricate 3D wire organisation. Our approach synergises 3D Bezier curves Prim's algorithm and knowledge distillation from diffusion models or their variants (e.g. ControlNet). This blend enables the system to represent 3D wire art ensuring spatial continuity and overcoming data scarcity. Extensive evaluation and analysis are conducted to shed insight on the inner workings of the proposed system including the trade-off between connectivity and visual aesthetics. + + + + Leveraging Camera Triplets for Efficient and Accurate Structure-from-Motion + http://openaccess.thecvf.com//content/CVPR2024/papers/Manam_Leveraging_Camera_Triplets_for_Efficient_and_Accurate_Structure-from-Motion_CVPR_2024_paper.pdf + In Structure-from-Motion (SfM) the underlying viewgraphs of unordered image collections generally have a highly redundant set of edges that can be sparsified for efficiency without significant loss of reconstruction quality. Often there are also false edges due to incorrect image retrieval and repeated structures (symmetries) that give rise to ghosting and superimposed reconstruction artifacts. We present a unified method to simultaneously sparsify the viewgraph and remove false edges. We propose a scoring mechanism based on camera triplets that identifies edge redundancy as well as false edges. Our edge selection is formulated as an optimization problem which can be provably solved using a simple thresholding scheme. This results in a highly efficient algorithm which can be incorporated as a pre-processing step into any SfM pipeline making it practically usable. We demonstrate the utility of our method on generic and ambiguous datasets that cover the range of small medium and large-scale datasets all with different statistical properties. Sparsification of generic datasets using our method significantly reduces reconstruction time while maintaining the accuracy of the reconstructions as well as removing ghosting artifacts. For ambiguous datasets our method removes false edges thereby avoiding incorrect superimposed reconstructions. + + + + SimDA: Simple Diffusion Adapter for Efficient Video Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Xing_SimDA_Simple_Diffusion_Adapter_for_Efficient_Video_Generation_CVPR_2024_paper.pdf + The recent wave of AI-generated content has witnessed the great development and success of Text-to-Image (T2I) technologies. By contrast Text-to-Video (T2V) still falls short of expectations though attracting increasing interest. Existing works either train from scratch or adapt large T2I model to videos both of which are computation and resource expensive. In this work we propose a Simple Diffusion Adapter (SimDA) that fine-tunes only 24M out of 1.1B parameters of a strong T2I model adapting it to video generation in a parameter-efficient way. In particular we turn the T2I model for T2V by designing light-weight spatial and temporal adapters for transfer learning. Besides we change the original spatial attention to the proposed Latent-Shift Attention (LSA) for temporal consistency. With a similar model architecture we further train a video super-resolution model to generate high-definition (1024 x 1024) videos. In addition to T2V generation in the wild SimDA could also be utilized in one-shot video editing with only 2 minutes tuning. Doing so our method could minimize the training effort with extremely few tunable parameters for model adaptation. + + + + Multi-view Aggregation Network for Dichotomous Image Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Yu_Multi-view_Aggregation_Network_for_Dichotomous_Image_Segmentation_CVPR_2024_paper.pdf + Dichotomous Image Segmentation (DIS) has recently emerged towards high-precision object segmentation from high-resolution natural images. When designing an effective DIS model the main challenge is how to balance the semantic dispersion of high-resolution targets in the small receptive field and the loss of high-precision details in the large receptive field. Existing methods rely on tedious multiple encoder-decoder streams and stages to gradually complete the global localization and local refinement. Human visual system captures regions of interest by observing them from multiple views. Inspired by it we model DIS as a multi-view object perception problem and provide a parsimonious multi-view aggregation network (MVANet) which unifies the feature fusion of the distant view and close-up view into a single stream with one encoder-decoder structure. With the help of the proposed multi-view complementary localization and refinement modules our approach established long-range profound visual interactions across multiple views allowing the features of the detailed close-up view to focus on highly slender structures. Experiments on the popular DIS-5K dataset show that our MVANet significantly outperforms state-of-the-art methods in both accuracy and speed. The source code and datasets will be publicly available at \href https://github.com/qianyu-dlut/MVANet MVANet . + + + + A Recipe for Scaling up Text-to-Video Generation with Text-free Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_A_Recipe_for_Scaling_up_Text-to-Video_Generation_with_Text-free_Videos_CVPR_2024_paper.pdf + Diffusion-based text-to-video generation has witnessed impressive progress in the past year yet still falls behind text-to-image generation. One of the key reasons is the limited scale of publicly available data (e.g. 10M video-text pairs in WebVid10M vs. 5B image-text pairs in LAION) considering the high cost of video captioning. Instead it could be far easier to collect unlabeled clips from video platforms like YouTube. Motivated by this we come up with a novel text-to-video generation framework termed TF-T2V which can directly learn with text-free videos. The rationale behind is to separate the process of text decoding from that of temporal modeling. To this end we employ a content branch and a motion branch which are jointly optimized with weights shared. Following such a pipeline we study the effect of doubling the scale of training set (i.e. video-only WebVid10M) with some randomly collected text-free videos and are encouraged to observe the performance improvement (FID from 9.67 to 8.19 and FVD from 484 to 441) demonstrating the scalability of our approach. We also find that our model could enjoy sustainable performance gain (FID from 8.19 to 7.64 and FVD from 441 to 366) after reintroducing some text labels for training. Finally we validate the effectiveness and generalizability of our ideology on both native text-to-video generation and compositional video synthesis paradigms. Code and models will be publicly available at here. + + + + Molecular Data Programming: Towards Molecule Pseudo-labeling with Systematic Weak Supervision + http://openaccess.thecvf.com//content/CVPR2024/papers/Juan_Molecular_Data_Programming_Towards_Molecule_Pseudo-labeling_with_Systematic_Weak_Supervision_CVPR_2024_paper.pdf + The premise for the great advancement of molecular machine learning is dependent on a considerable amount of labeled data. In many real-world scenarios the labeled molecules are limited in quantity or laborious to derive. Recent pseudo-labeling methods are usually designed based on a single domain knowledge thereby failing to understand the comprehensive molecular configurations and limiting their adaptability to generalize across diverse biochemical context. To this end we introduce an innovative paradigm for dealing with the molecule pseudo-labeling named as Molecular Data Programming (MDP). In particular we adopt systematic supervision sources via crafting multiple graph labeling functions which covers various molecular structural knowledge of graph kernels molecular fingerprints and topological features. Each of them creates an uncertain and biased labels for the unlabeled molecules. To address the decision conflicts among the diverse pseudo-labels we design a label synchronizer to differentiably model confidences and correlations between the labeling functions which yields probabilistic molecular labels to adapt for specific applications. These probabilistic molecular labels are used to train a molecular classifier for improving its generalization capability. On eight benchmark datasets we empirically demonstrate the effectiveness of MDP on the weakly supervised molecule classification tasks. + + + + Residual Denoising Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Residual_Denoising_Diffusion_Models_CVPR_2024_paper.pdf + We propose residual denoising diffusion models (RDDM) a novel dual diffusion process that decouples the traditional single denoising diffusion process into residual diffusion and noise diffusion. This dual diffusion framework expands the denoising-based diffusion models initially uninterpretable for image restoration into a unified and interpretable model for both image generation and restoration by introducing residuals. Specifically our residual diffusion represents directional diffusion from the target image to the degraded input image and explicitly guides the reverse generation process for image restoration while noise diffusion represents random perturbations in the diffusion process. The residual prioritizes certainty while the noise emphasizes diversity enabling RDDM to effectively unify tasks with varying certainty or diversity requirements such as image generation and restoration. We demonstrate that our sampling process is consistent with that of DDPM and DDIM through coefficient transformation and propose a partially path-independent generation process to better understand the reverse process. Notably our RDDM enables a generic UNet trained with only an L1 loss and a batch size of 1 to compete with state-of-the-art image restoration methods. We provide code and pre-trained models to encourage further exploration application and development of our innovative framework (https://github.com/nachifur/RDDM). + + + + Towards Accurate and Robust Architectures via Neural Architecture Search + http://openaccess.thecvf.com//content/CVPR2024/papers/Ou_Towards_Accurate_and_Robust_Architectures_via_Neural_Architecture_Search_CVPR_2024_paper.pdf + To defend deep neural networks from adversarial attacks adversarial training has been drawing increasing attention for its effectiveness. However the accuracy and robustness resulting from the adversarial training are limited by the architecture because adversarial training improves accuracy and robustness by adjusting the weight connection affiliated to the architecture. In this work we propose ARNAS to search for accurate and robust architectures for adversarial training. First we design an accurate and robust search space in which the placement of the cells and the proportional relationship of the filter numbers are carefully determined. With the design the architectures can obtain both accuracy and robustness by deploying accurate and robust structures to their sensitive positions respectively. Then we propose a differentiable multi-objective search strategy performing gradient descent towards directions that are beneficial for both natural loss and adversarial loss thus the accuracy and robustness can be guaranteed at the same time. We conduct comprehensive experiments in terms of white-box attacks black-box attacks and transferability. Experimental results show that the searched architecture has the strongest robustness with the competitive accuracy and breaks the traditional idea that NAS-based architectures cannot transfer well to complex tasks in robustness scenarios. By analyzing outstanding architectures searched we also conclude that accurate and robust neural architectures tend to deploy different structures near the input and output which has great practical significance on both hand-crafting and automatically designing of accurate and robust architectures. + + + + Closely Interactive Human Reconstruction with Proxemics and Physics-Guided Adaption + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_Closely_Interactive_Human_Reconstruction_with_Proxemics_and_Physics-Guided_Adaption_CVPR_2024_paper.pdf + Existing multi-person human reconstruction approaches mainly focus on recovering accurate poses or avoiding penetration but overlook the modeling of close interactions. In this work we tackle the task of reconstructing closely interactive humans from a monocular video. The main challenge of this task comes from insufficient visual information caused by depth ambiguity and severe inter-person occlusion. In view of this we propose to leverage knowledge from proxemic behavior and physics to compensate the lack of visual information. This is based on the observation that human interaction has specific patterns following the social proxemics. Specifically we first design a latent representation based on Vector Quantised-Variational AutoEncoder (VQ-VAE) to model human interaction. A proxemics and physics guided diffusion model is then introduced to denoise the initial distribution. We design the diffusion model as dual branch with each branch representing one individual such that the interaction can be modeled via cross attention. With the learned priors of VQ-VAE and physical constraint as the additional information our proposed approach is capable of estimating accurate poses that are also proxemics and physics plausible. Experimental results on Hi4D 3DPW and CHI3D demonstrate that our method outperforms existing approaches. The code is available at https://github.com/boycehbz/HumanInteraction. + + + + Taming Stable Diffusion for Text to 360 Panorama Image Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Taming_Stable_Diffusion_for_Text_to_360_Panorama_Image_Generation_CVPR_2024_paper.pdf + Generative models e.g. Stable Diffusion have enabled the creation of photorealistic images from text prompts. Yet the generation of 360-degree panorama images from text remains a challenge particularly due to the dearth of paired text-panorama data and the domain gap between panorama and perspective images. In this paper we introduce a novel dual-branch diffusion model named PanFusion to generate a 360-degree image from a text prompt. We leverage the stable diffusion model as one branch to provide prior knowledge in natural image generation and register it to another panorama branch for holistic image generation. We propose a unique cross-attention mechanism with projection awareness to minimize distortion during the collaborative denoising process. Our experiments validate that PanFusion surpasses existing methods and thanks to its dual-branch structure can integrate additional constraints like room layout for customized panorama outputs. + + + + Modular Blind Video Quality Assessment + http://openaccess.thecvf.com//content/CVPR2024/papers/Wen_Modular_Blind_Video_Quality_Assessment_CVPR_2024_paper.pdf + Blind video quality assessment (BVQA) plays a pivotal role in evaluating and improving the viewing experience of end-users across a wide range of video-based platforms and services. Contemporary deep learning-based models primarily analyze video content in its aggressively subsampled format while being blind to the impact of the actual spatial resolution and frame rate on video quality. In this paper we propose a modular BVQA model and a method of training it to improve its modularity. Our model comprises a base quality predictor a spatial rectifier and a temporal rectifier responding to the visual content and distortion spatial resolution and frame rate changes on video quality respectively. During training spatial and temporal rectifiers are dropped out with some probabilities to render the base quality predictor a standalone BVQA model which should work better with the rectifiers. Extensive experiments on both professionally-generated content and user-generated content video databases show that our quality model achieves superior or comparable performance to current methods. Additionally the modularity of our model offers an opportunity to analyze existing video quality databases in terms of their spatial and temporal complexity. + + + + RELI11D: A Comprehensive Multimodal Human Motion Dataset and Method + http://openaccess.thecvf.com//content/CVPR2024/papers/Yan_RELI11D_A_Comprehensive_Multimodal_Human_Motion_Dataset_and_Method_CVPR_2024_paper.pdf + Comprehensive capturing of human motions requires both accurate captures of complex poses and precise localization of the human within scenes. Most of the HPE datasets and methods primarily rely on RGB LiDAR or IMU data. However solely using these modalities or a combination of them may not be adequate for HPE particularly for complex and fast movements. For holistic human motion understanding we present RELI11D a high-quality multimodal human motion dataset involves LiDAR IMU system RGB camera and Event camera. It records the motions of 10 actors performing 5 sports in 7 scenes including 3.32 hours of synchronized LiDAR point clouds IMU measurement data RGB videos and Event steams. Through extensive experiments we demonstrate that the RELI11D presents considerable challenges and opportunities as it contains many rapid and complex motions that require precise location. To address the challenge of integrating different modalities we propose LEIR a multimodal baseline that effectively utilizes LiDAR Point Cloud Event stream and RGB through our cross-attention fusion strategy. We show that LEIR exhibits promising results for rapid motions and daily motions and that utilizing the characteristics of multiple modalities can indeed improve HPE performance. Both the dataset and source code will be released publicly to the research community fostering collaboration and enabling further exploration in this field. + + + + One-Class Face Anti-spoofing via Spoof Cue Map-Guided Feature Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_One-Class_Face_Anti-spoofing_via_Spoof_Cue_Map-Guided_Feature_Learning_CVPR_2024_paper.pdf + Many face anti-spoofing (FAS) methods have focused on learning discriminative features from both live and spoof training data to strengthen the security of face recognition systems. However since not every possible attack type is available in the training stage these FAS methods usually fail to detect unseen attacks in the inference stage. In comparison one-class FAS where the training data are from only live faces aims to detect whether a test face image belongs to the live class or not. In this paper we propose a novel One-Class Spoof Cue Map estimation Network (OC-SCMNet) to address the one-class FAS detection problem. Our first goal is to learn to extract latent spoof features from live images so that their estimated Spoof Cue Maps (SCMs) should have zero responses. To avoid trapping to a trivial solution we devise a novel SCM-guided feature learning by combining many SCMs as pseudo ground-truths to guide a conditional generator to generate latent spoof features for spoof data. Our second goal is to approximately simulate the potential out-of-distribution spoof attacks. To this end we propose using a memory bank to dynamically preserve a set of sufficiently "independent" latent spoof features to encourage the generator to probe the latent spoof feature space. Extensive experiments conducted on eight FAS benchmark datasets demonstrate that the proposed OC-SCMNet not only outperforms previous one-class FAS methods but also achieves comparable performances to state-of-the-art two-class FAS method. The codes are available at https://github.com/Pei-KaiHuang/CVPR24_OC_SCMNet. + + + + InteractDiffusion: Interaction Control in Text-to-Image Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Hoe_InteractDiffusion_Interaction_Control_in_Text-to-Image_Diffusion_Models_CVPR_2024_paper.pdf + Large-scale text-to-image (T2I) diffusion models have showcased incredible capabilities in generating coherent images based on textual descriptions enabling vast applications in content generation. While recent advancements have introduced control over factors such as object localization posture and image contours a crucial gap remains in our ability to control the interactions between objects in the generated content. Well-controlling interactions in generated images could yield meaningful applications such as creating realistic scenes with interacting characters. In this work we study the problems of conditioning T2I diffusion models with Human-Object Interaction (HOI) information consisting of a triplet label (person action object) and corresponding bounding boxes. We propose a pluggable interaction control model called InteractDiffusion that extends existing pre-trained T2I diffusion models to enable them being better conditioned on interactions. Specifically we tokenize the HOI information and learn their relationships via interaction embeddings. A conditioning self-attention layer is trained to map HOI tokens to visual tokens thereby conditioning the visual tokens better in existing T2I diffusion models. Our model attains the ability to control the interaction and location on existing T2I diffusion models which outperforms existing baselines by a large margin in HOI detection score as well as fidelity in FID and KID. Project page: https://jiuntian.github.io/interactdiffusion. + + + + Emergent Open-Vocabulary Semantic Segmentation from Off-the-shelf Vision-Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Luo_Emergent_Open-Vocabulary_Semantic_Segmentation_from_Off-the-shelf_Vision-Language_Models_CVPR_2024_paper.pdf + From image-text pairs large-scale vision-language models (VLMs) learn to implicitly associate image regions with words which prove effective for tasks like visual question answering. However leveraging the learned association for open-vocabulary semantic segmentation remains a challenge. In this paper we propose a simple yet extremely effective training-free technique Plug-and-Play Open-Vocabulary Semantic Segmentation (PnP-OVSS) for this task. PnP-OVSS leverages a VLM with direct text-to-image cross-attention and an image-text matching loss. To balance between over-segmentation and under-segmentation we introduce Salience Dropout; by iteratively dropping patches that the model is most attentive to we are able to better resolve the entire extent of the segmentation mask. PnP-OVSS does not require any neural network training and performs hyperparameter tuning without the need for any segmentation annotations even for a validation set. PnP-OVSS demonstrates substantial improvements over comparable baselines (+29.4% mIoU on Pascal VOC +13.2% mIoU on Pascal Context +14.0% mIoU on MS COCO +2.4% mIoU on COCO Stuff) and even outperforms most baselines that conduct additional network training on top of pretrained VLMs. Our codebase is at https://github.com/letitiabanana/PnP-OVSS. + + + + SelfPose3d: Self-Supervised Multi-Person Multi-View 3d Pose Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Srivastav_SelfPose3d_Self-Supervised_Multi-Person_Multi-View_3d_Pose_Estimation_CVPR_2024_paper.pdf + We present a new self-supervised approach SelfPose3d for estimating 3d poses of multiple persons from multiple camera views. Unlike current state-of-the-art fully-supervised methods our approach does not require any 2d or 3d ground-truth poses and uses only the multi-view input images from a calibrated camera setup and 2d pseudo poses generated from an off-the-shelf 2d human pose estimator. We propose two self-supervised learning objectives: self-supervised person localization in 3d space and self-supervised 3d pose estimation. We achieve self-supervised 3d person localization by training the model on synthetically generated 3d points serving as 3d person root positions and on the projected root-heatmaps in all the views. We then model the 3d poses of all the localized persons with a bottleneck representation map them onto all views obtaining 2d joints and render them using 2d Gaussian heatmaps in an end-to-end differentiable manner. Afterwards we use the corresponding 2d joints and heatmaps from the pseudo 2d poses for learning. To alleviate the intrinsic inaccuracy of the pseudo labels we propose an adaptive supervision attention mechanism to guide the self-supervision. Our experiments and analysis on three public benchmark datasets including Panoptic Shelf and Campus show the effectiveness of our approach which is comparable to fully-supervised methods. Code is available at https://github.com/CAMMA-public/SelfPose3D. + + + + Joint2Human: High-Quality 3D Human Generation via Compact Spherical Embedding of 3D Joints + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Joint2Human_High-Quality_3D_Human_Generation_via_Compact_Spherical_Embedding_of_CVPR_2024_paper.pdf + 3D human generation is increasingly significant in various applications. However the direct use of 2D generative methods in 3D generation often results in losing local details while methods that reconstruct geometry from generated images struggle with global view consistency. In this work we introduce Joint2Human a novel method that leverages 2D diffusion models to generate detailed 3D human geometry directly ensuring both global structure and local details. To achieve this we employ the Fourier occupancy field (FOF) representation enabling the direct generation of 3D shapes as preliminary results with 2D generative models. With the proposed high-frequency enhancer and the multi-view recarving strategy our method can seamlessly integrate the details from different views into a uniform global shape. To better utilize the 3D human prior and enhance control over the generated geometry we introduce a compact spherical embedding of 3D joints. This allows for an effective guidance of pose during the generation process. Additionally our method can generate 3D humans guided by textual inputs. Our experimental results demonstrate the capability of our method to ensure global structure local details high resolution and low computational cost simultaneously. More results and the code can be found on our project page at http://cic.tju.edu.cn/faculty/likun/projects/Joint2Human. + + + + Prompt-Free Diffusion: Taking "Text" out of Text-to-Image Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_Prompt-Free_Diffusion_Taking_Text_out_of_Text-to-Image_Diffusion_Models_CVPR_2024_paper.pdf + Text-to-image (T2I) research has grown explosively in the past year owing to the large-scale pre-trained diffusion models and many emerging personalization and editing approaches. Yet one pain point persists: the text prompt engineering and searching high-quality text prompts for customized results is more art than science. Moreover as commonly argued: "an image is worth a thousand words" - the attempt to describe a desired image with texts often ends up being ambiguous and cannot comprehensively cover delicate visual details hence necessitating more additional controls from the visual domain. In this paper we take a bold step forward: taking "Text" out of a pretrained T2I diffusion model to reduce the burdensome prompt engineering efforts for users. Our proposed framework Prompt-Free Diffusion relies on only visual inputs to generate new images: it takes a reference image as "context" an optional image structural conditioning and an initial noise with absolutely no text prompt. The core architecture behind the scene is Semantic Context Encoder (SeeCoder) substituting the commonly used CLIP-based or LLM-based text encoder. The reusability of SeeCoder also makes it a convenient drop-in component: one can also pre-train a SeeCoder in one T2I model and reuse it for another. Through extensive experiments Prompt-Free Diffusion is experimentally found to (i) outperform prior exemplar-based image synthesis approaches; (ii) perform on par with state-of-the-art T2I models using prompts following the best practice; and (iii) be naturally extensible to other downstream applications such as anime figure generation and virtual try-on with promising quality. Our code and models will be open-sourced. + + + + Multi-agent Long-term 3D Human Pose Forecasting via Interaction-aware Trajectory Conditioning + http://openaccess.thecvf.com//content/CVPR2024/papers/Jeong_Multi-agent_Long-term_3D_Human_Pose_Forecasting_via_Interaction-aware_Trajectory_Conditioning_CVPR_2024_paper.pdf + Human pose forecasting garners attention for its diverse applications. However challenges in modeling the multi-modal nature of human motion and intricate interactions among agents persist particularly with longer timescales and more agents. In this paper we propose an interaction-aware trajectory-conditioned long-term multi-agent human pose forecasting model utilizing a coarse-to-fine prediction approach: multi-modal global trajectories are initially forecasted followed by respective local pose forecasts conditioned on each mode. In doing so our Trajectory2Pose model introduces a graph-based agent-wise interaction module for a reciprocal forecast of local motion-conditioned global trajectory and trajectory-conditioned local pose. Our model effectively handles the multi-modality of human motion and the complexity of long-term multi-agent interactions improving performance in complex environments. Furthermore we address the lack of long-term (6s+) multi-agent (5+) datasets by constructing a new dataset from real-world images and 2D annotations enabling a comprehensive evaluation of our proposed model. State-of-the-art prediction performance on both complex and simpler datasets confirms the generalized effectiveness of our method. The code is available at https://github.com/Jaewoo97/T2P. + + + + CLOAF: CoLlisiOn-Aware Human Flow + http://openaccess.thecvf.com//content/CVPR2024/papers/Davydov_CLOAF_CoLlisiOn-Aware_Human_Flow_CVPR_2024_paper.pdf + Even the best current algorithms for estimating body 3D shape and pose yield results that include body self-intersections. In this paper we present CLOAF which exploits the diffeomorphic nature of Ordinary Differential Equations to eliminate such self-intersections while still imposing body shape constraints. We show that unlike earlier approaches to addressing this issue ours completely eliminates the self-intersections without compromising the accuracy of the reconstructions. Being differentiable CLOAF can be used to fine-tune pose and shape estimation baselines to improve their overall performance and eliminate self-intersections in their predictions. Furthermore we demonstrate how our CLOAF strategy can be applied to practically any motion field induced by the user. CLOAF also makes it possible to edit motion to interact with the environment without worrying about potential collision or loss of body-shape prior. + + + + Hybrid Functional Maps for Crease-Aware Non-Isometric Shape Matching + http://openaccess.thecvf.com//content/CVPR2024/papers/Bastian_Hybrid_Functional_Maps_for_Crease-Aware_Non-Isometric_Shape_Matching_CVPR_2024_paper.pdf + Non-isometric shape correspondence remains a fundamental challenge in computer vision. Traditional methods using Laplace-Beltrami operator (LBO) eigenmodes face limitations in characterizing high-frequency extrinsic shape changes like bending and creases. We propose a novel approach of combining the non-orthogonal extrinsic basis of eigenfunctions of the elastic thin-shell hessian with the intrinsic ones of the LBO creating a hybrid spectral space in which we construct functional maps. To this end we present a theoretical framework to effectively integrate non-orthogonal basis functions into descriptor- and learning-based functional map methods. Our approach can be incorporated easily into existing functional map pipelines across varying applications and is able to handle complex deformations beyond isometries. We show extensive evaluations across various supervised and unsupervised settings and demonstrate significant improvements. Notably our approach achieves up to 15% better mean geodesic error for non-isometric correspondence settings and up to 45% improvement in scenarios with topological noise. + + + + Density-Guided Semi-Supervised 3D Semantic Segmentation with Dual-Space Hardness Sampling + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Density-Guided_Semi-Supervised_3D_Semantic_Segmentation_with_Dual-Space_Hardness_Sampling_CVPR_2024_paper.pdf + Densely annotating the large-scale point clouds is laborious. To alleviate the annotation burden contrastive learning has attracted increasing attention for tackling semi-supervised 3D semantic segmentation. However existing point-to-point contrastive learning techniques in literature are generally sensitive to outliers resulting in insufficient modeling of the point-wise representations. To address this problem we propose a method named DDSemi for semi-supervised 3D semantic segmentation where a density-guided contrastive learning technique is explored. This technique calculates the contrastive loss in a point-to-anchor manner by estimating an anchor for each class from the memory bank based on the finding that the cluster centers tend to be located in dense regions. In this technique an inter-contrast loss is derived from the perturbed unlabeled point cloud pairs while an intra-contrast loss is derived from a single unlabeled point cloud. The derived losses could enhance the discriminability of the features and implicitly constrain the semantic consistency between the perturbed unlabeled point cloud pairs. In addition we propose a dual-space hardness sampling strategy to pay more attention to the hard samples located in sparse regions of both the geometric space and feature space by reweighting the point-wise intra-contrast loss. Experimental results on both indoor-scene and outdoor-scene datasets demonstrate that the proposed method outperforms the comparative state-of-the-art semi-supervised methods. + + + + ElasticDiffusion: Training-free Arbitrary Size Image Generation through Global-Local Content Separation + http://openaccess.thecvf.com//content/CVPR2024/papers/Haji-Ali_ElasticDiffusion_Training-free_Arbitrary_Size_Image_Generation_through_Global-Local_Content_Separation_CVPR_2024_paper.pdf + Diffusion models have revolutionized image generation in recent years yet they are still limited to a few sizes and aspect ratios. We propose ElasticDiffusion a novel training-free decoding method that enables pretrained text-to-image diffusion models to generate images with various sizes. ElasticDiffusion attempts to decouple the generation trajectory of a pretrained model into local and global signals. The local signal controls low-level pixel information and can be estimated on local patches while the global signal is used to maintain overall structural consistency and is estimated with a reference image. We test our method on CelebA-HQ (faces) and LAION-COCO (objects/indoor/outdoor scenes). Our experiments and qualitative results show superior image coherence quality across aspect ratios compared to MultiDiffusion and the standard decoding strategy of Stable Diffusion. Project Webpage: https://elasticdiffusion.github.io + + + + Locally Adaptive Neural 3D Morphable Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Tarasiou_Locally_Adaptive_Neural_3D_Morphable_Models_CVPR_2024_paper.pdf + We present the Locally Adaptive Morphable Model (LAMM) a highly flexible Auto-Encoder (AE) framework for learning to generate and manipulate 3D meshes. We train our architecture following a simple self-supervised training scheme in which input displacements over a set of sparse control vertices are used to overwrite the encoded geometry in order to transform one training sample into another. During inference our model produces a dense output that adheres locally to the specified sparse geometry while maintaining the overall appearance of the encoded object. This approach results in state-of-the-art performance in both disentangling manipulated geometry and 3D mesh reconstruction. To the best of our knowledge LAMM is the first end-to-end framework that enables direct local control of 3D vertex geometry in a single forward pass. A very efficient computational graph allows our network to train with only a fraction of the memory required by previous methods and run faster during inference generating 12k vertex meshes at >60fps on a single CPU thread. We further leverage local geometry control as a primitive for higher level editing operations and present a set of derivative capabilities such as swapping and sampling object parts. Code and pretrained models can be found at https://github.com/michaeltrs/LAMM. + + + + ICON: Incremental CONfidence for Joint Pose and Radiance Field Optimization + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_ICON_Incremental_CONfidence_for_Joint_Pose_and_Radiance_Field_Optimization_CVPR_2024_paper.pdf + Neural Radiance Fields (NeRF) exhibit remarkable performance for Novel View Synthesis (NVS) given a set of 2D images. However NeRF training requires accurate camera pose for each input view typically obtained by Structure-from-Motion (SfM) pipelines. Recent works have attempted to relax this constraint but they still often rely on decent initial poses which they can refine. Here we aim at removing the requirement for pose initialization. We present Incremental CONfidence (ICON) an optimization procedure for training NeRFs from 2D video frames. ICON only assumes smooth camera motion to estimate initial guess for poses. Further ICON introduces "confidence": an adaptive measure of model quality used to dynamically reweight gradients. ICON relies on high-confidence poses to learn NeRF and high-confidence 3D structure (as encoded by NeRF) to learn poses. We show that ICON without prior pose initialization achieves superior performance in both CO3D and HO3D versus methods which use SfM pose. + + + + Learned Scanpaths Aid Blind Panoramic Video Quality Assessment + http://openaccess.thecvf.com//content/CVPR2024/papers/Fan_Learned_Scanpaths_Aid_Blind_Panoramic_Video_Quality_Assessment_CVPR_2024_paper.pdf + Panoramic videos have the advantage of providing an immersive and interactive viewing experience. Nevertheless their spherical nature gives rise to various and uncertain user viewing behaviors which poses significant challenges for panoramic video quality assessment (PVQA). In this work we propose an end-to-end optimized blind PVQA method with explicit modeling of user viewing patterns through visual scanpaths. Our method consists of two modules: a scanpath generator and a quality assessor. The scanpath generator is initially trained to predict future scanpaths by minimizing their expected code length and then jointly optimized with the quality assessor for quality prediction. Our blind PVQA method enables direct quality assessment of panoramic images by treating them as videos composed of identical frames. Experiments on three public panoramic image and video quality datasets encompassing both synthetic and authentic distortions validate the superiority of our blind PVQA model over existing methods. + + + + TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Ni_TI2V-Zero_Zero-Shot_Image_Conditioning_for_Text-to-Video_Diffusion_Models_CVPR_2024_paper.pdf + Text-conditioned image-to-video generation (TI2V) aims to synthesize a realistic video starting from a given image (e.g. a woman's photo) and a text description (e.g. "a woman is drinking water."). Existing TI2V frameworks often require costly training on video-text datasets and specific model designs for text and image conditioning. In this paper we propose TI2V-Zero a zero-shot tuning-free method that empowers a pretrained text-to-video (T2V) diffusion model to be conditioned on a provided image enabling TI2V generation without any optimization fine-tuning or introducing external modules. Our approach leverages a pretrained T2V diffusion foundation model as the generative prior. To guide video generation with the additional image input we propose a "repeat-and-slide" strategy that modulates the reverse denoising process allowing the frozen diffusion model to synthesize a video frame-by-frame starting from the provided image. To ensure temporal continuity we employ a DDPM inversion strategy to initialize Gaussian noise for each newly synthesized frame and a resampling technique to help preserve visual details. We conduct comprehensive experiments on both domain-specific and open-domain datasets where TI2V-Zero consistently outperforms a recent open-domain TI2V model. Furthermore we show that TI2V-Zero can seamlessly extend to other tasks such as video infilling and prediction when provided with more images. Its autoregressive design also supports long video generation. + + + + iToF-flow-based High Frame Rate Depth Imaging + http://openaccess.thecvf.com//content/CVPR2024/papers/Meng_iToF-flow-based_High_Frame_Rate_Depth_Imaging_CVPR_2024_paper.pdf + iToF is a prevalent cost-effective technology for 3D perception. While its reliance on multi-measurement commonly leads to reduced performance in dynamic environments. Based on the analysis of the physical iToF imaging process we propose the iToF flow composed of crossmode transformation and uni-mode photometric correction to model the variation of measurements caused by different measurement modes and 3D motion respectively. We propose a local linear transform (LLT) based cross-mode transfer module (LCTM) for mode-varying and pixel shift compensation of cross-mode flow and uni-mode photometric correct module (UPCM) for estimating the depth-wise motion caused photometric residual of uni-mode flow. The iToF flow-based depth extraction network is proposed which could facilitate the estimation of the 4-phase measurements at each individual time for high framerate and accurate depth estimation. Extensive experiments including both simulation and real-world experiments are conducted to demonstrate the effectiveness of the proposed methods. Compared with the SOTA method our approach reduces the computation time by 75% while improving the performance by 38%. The code and database are available at https://github.com/ComputationalPerceptionLab/iToF_flow. + + + + Relightful Harmonization: Lighting-aware Portrait Background Replacement + http://openaccess.thecvf.com//content/CVPR2024/papers/Ren_Relightful_Harmonization_Lighting-aware_Portrait_Background_Replacement_CVPR_2024_paper.pdf + Portrait harmonization aims to composite a subject into a new background adjusting its lighting and color to ensure harmony with the background scene. Existing harmonization techniques often only focus on adjusting the global color and brightness of the foreground and ignore crucial illumination cues from the background such as apparent lighting direction leading to unrealistic compositions. We introduce Relightful Harmonization a lighting-aware diffusion model designed to seamlessly harmonize sophisticated lighting effect for the foreground portrait using any background image. Our approach unfolds in three stages. First we introduce a lighting representation module that allows our diffusion model to encode lighting information from target image background. Second we introduce an alignment network that aligns lighting features learned from image background with lighting features learned from panorama environment maps which is a complete representation for scene illumination. Last to further boost the photorealism of the proposed method we introduce a novel data simulation pipeline that generates synthetic training pairs from a diverse range of natural images which are used to refine the model. Our method outperforms existing benchmarks in visual fidelity and lighting coherence showing superior generalization in real-world testing scenarios highlighting its versatility and practicality. + + + + Mitigating Motion Blur in Neural Radiance Fields with Events and Frames + http://openaccess.thecvf.com//content/CVPR2024/papers/Cannici_Mitigating_Motion_Blur_in_Neural_Radiance_Fields_with_Events_and_CVPR_2024_paper.pdf + Neural Radiance Fields (NeRFs) have shown great potential in novel view synthesis. However they struggle to render sharp images when the data used for training is affected by motion blur. On the other hand event cameras excel in dynamic scenes as they measure brightness changes with microsecond resolution and are thus only marginally affected by blur. Recent methods attempt to enhance NeRF reconstructions under camera motion by fusing frames and events. However they face challenges in recovering accurate color content or constrain the NeRF to a set of predefined camera poses harming reconstruction quality in challenging conditions. This paper proposes a novel formulation addressing these issues by leveraging both model- and learning-based modules. We explicitly model the blur formation process exploiting the event double integral as an additional model-based prior. Additionally we model the event-pixel response using an end-to-end learnable response function allowing our method to adapt to non-idealities in the real event-camera sensor. We show on synthetic and real data that the proposed approach outperforms existing deblur NeRFs that use only frames as well as those that combine frames and events by +6.13dB and +2.48dB respectively. + + + + TokenHMR: Advancing Human Mesh Recovery with a Tokenized Pose Representation + http://openaccess.thecvf.com//content/CVPR2024/papers/Dwivedi_TokenHMR_Advancing_Human_Mesh_Recovery_with_a_Tokenized_Pose_Representation_CVPR_2024_paper.pdf + We address the problem of regressing 3D human pose and shape from a single image with a focus on 3D accuracy. The current best methods leverage large datasets of 3D pseudo-ground-truth (p-GT) and 2D keypoints leading to robust performance. With such methods however we observe a paradoxical decline in 3D pose accuracy with increasing 2D accuracy. This is caused by biases in the p-GT and the use of an approximate camera projection model. We quantify the error induced by current camera models and show that fitting 2D keypoints and p-GT accurately causes incorrect 3D poses. Our analysis defines the invalid distances within which minimizing 2D and p-GT losses is detrimental. We use this to formulate a new loss "Threshold-Adaptive Loss Scaling" (TALS) that penalizes gross 2D and p-GT errors but not smaller ones. With such a loss there are many 3D poses that could equally explain the 2D evidence. To reduce this ambiguity we need a prior over valid human poses but such priors can introduce unwanted bias. To address this we exploit a tokenized representation of human pose and reformulate the problem as token prediction. This restricts the estimated poses to the space of valid poses effectively improving robustness to occlusion. Extensive experiments on the EMDB and 3DPW datasets show that our reformulated loss and tokenization allows us to train on in-the-wild data while improving 3D accuracy over the state-of-the-art. Our models and code are available for research at https://tokenhmr.is.tue.mpg.de. + + + + FaceCom: Towards High-fidelity 3D Facial Shape Completion via Optimization and Inpainting Guidance + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_FaceCom_Towards_High-fidelity_3D_Facial_Shape_Completion_via_Optimization_and_CVPR_2024_paper.pdf + We propose FaceCom a method for 3D facial shape completion which delivers high-fidelity results for incomplete facial inputs of arbitrary forms. Unlike end-to-end shape completion methods based on point clouds or voxels our approach relies on a mesh-based generative network that is easy to optimize enabling it to handle shape completion for irregular facial scans. We first train a shape generator on a mixed 3D facial dataset containing 2405 identities. Based on the incomplete facial input we fit complete faces using an optimization approach under image inpainting guidance. The completion results are refined through a post-processing step. FaceCom demonstrates the ability to effectively and naturally complete facial scan data with varying missing regions and degrees of missing areas. Our method can be used in medical prosthetic fabrication and the registration of deficient scanning data. Our experimental results demonstrate that FaceCom achieves exceptional performance in fitting and shape completion tasks. + + + + LightOctree: Lightweight 3D Spatially-Coherent Indoor Lighting Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_LightOctree_Lightweight_3D_Spatially-Coherent_Indoor_Lighting_Estimation_CVPR_2024_paper.pdf + We present a lightweight solution for estimating spatially-coherent indoor lighting from a single RGB image. Previous methods for estimating illumination using volumetric representations have overlooked the sparse distribution of light sources in space necessitating substantial memory and computational resources for achieving high-quality results. We introduce a unified voxel octree-based illumination estimation framework to produce 3D spatially-coherent lighting. Additionally a differentiable voxel octree cone tracing rendering layer is proposed to eliminate regular volumetric representation throughout the entire process and ensure the retention of features across different frequency domains. This reduction significantly decreases spatial usage and required floating-point operations without substantially compromising precision. Experimental results demonstrate that our approach achieves high-quality coherent estimation with minimal cost compared to previous methods. + + + + FaceLift: Semi-supervised 3D Facial Landmark Localization + http://openaccess.thecvf.com//content/CVPR2024/papers/Ferman_FaceLift_Semi-supervised_3D_Facial_Landmark_Localization_CVPR_2024_paper.pdf + 3D facial landmark localization has proven to be of particular use for applications such as face tracking 3D face modeling and image-based 3D face reconstruction. In the supervised learning case such methods usually rely on 3D landmark datasets derived from 3DMM-based registration that often lack spatial definition alignment as compared with that chosen by hand-labeled human consensus e.g. how are eyebrow landmarks defined? This creates a gap between landmark datasets generated via high-quality 2D human labels and 3DMMs and it ultimately limits their effectiveness. To address this issue we introduce a novel semi-supervised learning approach that learns 3D landmarks by directly lifting (visible) hand-labeled 2D landmarks and ensures better definition alignment without the need for 3D landmark datasets. To lift 2D landmarks to 3D we leverage 3D-aware GANs for better multi-view consistency learning and in-the-wild multi-frame videos for robust cross-generalization. Empirical experiments demonstrate that our method not only achieves better definition alignment between 2D-3D landmarks but also outperforms other supervised learning 3D landmark localization methods on both 3DMM labeled and photogrammetric ground truth evaluation datasets. Project Page: https://davidcferman.github.io/FaceLift + + + + PSDPM: Prototype-based Secondary Discriminative Pixels Mining for Weakly Supervised Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_PSDPM_Prototype-based_Secondary_Discriminative_Pixels_Mining_for_Weakly_Supervised_Semantic_CVPR_2024_paper.pdf + Image-level Weakly Supervised Semantic Segmentation (WSSS) has received increasing attention due to its low annotation cost. Class Activation Mapping (CAM) generated through classifier weights in WSSS inevitably ignores certain useful cues while the CAM generated through class prototypes can alleviate that. However because of the different goals of image classification and semantic segmentation the class prototypes still focus on activating primary discriminative pixels learned from classification loss leading to incomplete CAM. In this paper we propose a plugand-play Prototype-based Secondary Discriminative Pixels Mining (PSDPM) framework for enabling class prototypes to activate more secondary discriminative pixels thus generating a more complete CAM. Specifically we introduce a Foreground Pixel Estimation Module (FPEM) for estimating potential foreground pixels based on the correlations between primary and secondary discriminative pixels and the semantic segmentation results of baseline methods. Then we enable WSSS model to learn discriminative features from secondary discriminative pixels through a consistency loss calculated between FPEM result and class-prototype CAM. Experimental results show that our PSDPM improves various baseline methods significantly and achieves new state-of-the-art performances on WSSS benchmarks. Codes are available at https://github.com/xinqiaozhao/PSDPM. + + + + Frozen CLIP: A Strong Backbone for Weakly Supervised Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Frozen_CLIP_A_Strong_Backbone_for_Weakly_Supervised_Semantic_Segmentation_CVPR_2024_paper.pdf + Weakly supervised semantic segmentation has witnessed great achievements with image-level labels. Several recent approaches use the CLIP model to generate pseudo labels for training an individual segmentation model while there is no attempt to apply the CLIP model as the backbone to directly segment objects with image-level labels. In this paper we propose WeCLIP a CLIP-based single-stage pipeline for weakly supervised semantic segmentation. Specifically the frozen CLIP model is applied as the backbone for semantic feature extraction and a new decoder is designed to interpret extracted semantic features for final prediction. Meanwhile we utilize the above frozen backbone to generate pseudo labels for training the decoder. Such labels cannot be optimized during training. We then propose a refinement module (RFM) to rectify them dynamically. Our architecture enforces the proposed decoder and RFM to benefit from each other to boost the final performance. Extensive experiments show that our approach significantly outperforms other approaches with less training cost. Additionally our WeCLIP also obtains promising results for fully supervised settings. The code is available at https://github.com/zbf1991/WeCLIP. + + + + LAFS: Landmark-based Facial Self-supervised Learning for Face Recognition + http://openaccess.thecvf.com//content/CVPR2024/papers/Sun_LAFS_Landmark-based_Facial_Self-supervised_Learning_for_Face_Recognition_CVPR_2024_paper.pdf + In this work we focus on learning facial representations that can be adapted to train effective face recognition models particularly in the absence of labels. Firstly compared with existing labelled face datasets a vastly larger magnitude of unlabeled faces exists in the real world. We explore the learning strategy of these unlabeled facial images through self-supervised pretraining to transfer generalized face recognition performance. Moreover motivated by one recent finding that is the face saliency area is critical for face recognition in contrast to utilizing random cropped blocks of images for constructing augmentations in pretraining we utilize patches localized by extracted facial landmarks. This enables our method - namely Landmark-based Facial Self-supervised learning (LAFS) to learn key representation that is more critical for face recognition. We also incorporate two landmark-specific augmentations which introduce more diversity of landmark information to further regularize the learning. With learned landmark-based facial representations we further adapt the representation for face recognition with regularization mitigating variations in landmark positions. Our method achieves significant improvement over the state-of-the-art on multiple face recognition benchmarks especially on more challenging few-shot scenarios. The code is available at https://github.com/szlbiubiubiu/LAFS_CVPR2024 + + + + SED: A Simple Encoder-Decoder for Open-Vocabulary Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Xie_SED_A_Simple_Encoder-Decoder_for_Open-Vocabulary_Semantic_Segmentation_CVPR_2024_paper.pdf + Open-vocabulary semantic segmentation strives to distinguish pixels into different semantic groups from an open set of categories. Most existing methods explore utilizing pre-trained vision-language models in which the key is to adopt the image-level model for pixel-level segmentation task. In this paper we propose a simple encoder-decoder named SED for open-vocabulary semantic segmentation which comprises a hierarchical encoder-based cost map generation and a gradual fusion decoder with category early rejection. The hierarchical encoder-based cost map generation employs hierarchical backbone instead of plain transformer to predict pixel-level image-text cost map. Compared to plain transformer hierarchical backbone better captures local spatial information and has linear computational complexity with respect to input size. Our gradual fusion decoder employs a top-down structure to combine cost map and the feature maps of different backbone levels for segmentation. To accelerate inference speed we introduce a category early rejection scheme in the decoder that rejects many no-existing categories at the early layer of decoder resulting in at most 4.7 times acceleration without accuracy degradation. Experiments are performed on multiple open-vocabulary semantic segmentation datasets which demonstrates the efficacy of our SED method. When using ConvNeXt-B our SED method achieves mIoU score of 31.6% on ADE20K with 150 categories at 82 millisecond (ms) per image on a single A6000. Our source code is available at https://github.com/xb534/SED. + + + + GPLD3D: Latent Diffusion of 3D Shape Generative Models by Enforcing Geometric and Physical Priors + http://openaccess.thecvf.com//content/CVPR2024/papers/Dong_GPLD3D_Latent_Diffusion_of_3D_Shape_Generative_Models_by_Enforcing_CVPR_2024_paper.pdf + State-of-the-art man-made shape generative models usually adopt established generative models under a suitable implicit shape representation. A common theme is to perform distribution alignment which does not explicitly model important shape priors. As a result many synthetic shapes are not connected. Other synthetic shapes present problems of physical stability and geometric feasibility. This paper introduces a novel latent diffusion shape-generative model regularized by a quality checker that outputs a score of a latent code. The scoring function employs a learned function that provides a geometric feasibility score and a deterministic procedure to quantify a physical stability score. The key to our approach is a new diffusion procedure that combines the discrete empirical data distribution and a continuous distribution induced by the quality checker. We introduce a principled approach to determine the tradeoff parameters for learning the denoising network at different noise levels. Experimental results show that our approach outperforms state-of-the-art shape generations quantitatively and qualitatively on ShapeNet-v2. + + + + Self-correcting LLM-controlled Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_Self-correcting_LLM-controlled_Diffusion_Models_CVPR_2024_paper.pdf + Text-to-image generation has witnessed significant progress with the advent of diffusion models. Despite the ability to generate photorealistic images current text-to-image diffusion models still often struggle to accurately interpret and follow complex input text prompts. In contrast to existing models that aim to generate images only with their best effort we introduce Self-correcting LLM-controlled Diffusion (SLD). SLD is a framework that generates an image from the input prompt assesses its alignment with the prompt and performs self-corrections on the inaccuracies in the generated image. Steered by an LLM controller SLD turns text-to-image generation into an iterative closed-loop process ensuring correctness in the resulting image. SLD is not only training-free but can also be seamlessly integrated with diffusion models behind API access such as DALL-E 3 to further boost the performance of state-of-the-art diffusion models. Experimental results show that our approach can rectify a majority of incorrect generations particularly in generative numeracy attribute binding and spatial relationships. Furthermore by simply adjusting the instructions to the LLM SLD can perform image editing tasks bridging the gap between text-to-image generation and image editing pipelines. Our code is available at: https://self-correcting-llm-diffusion.github.io. + + + + PACER+: On-Demand Pedestrian Animation Controller in Driving Scenarios + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_PACER_On-Demand_Pedestrian_Animation_Controller_in_Driving_Scenarios_CVPR_2024_paper.pdf + We address the challenge of content diversity and controllability in pedestrian simulation for driving scenarios. Recent pedestrian animation frameworks have a significant limitation wherein they primarily focus on either following trajectory or the content of the reference video consequently overlooking the potential diversity of human motion within such scenarios. This limitation restricts the ability to generate pedestrian behaviors that exhibit a wider range of variations and realistic motions and therefore restricts its usage to provide rich motion content for other components in the driving simulation system e.g. suddenly changed motion to which the autonomous vehicle should respond. In our approach we strive to surpass the limitation by showcasing diverse human motions obtained from various sources such as generated human motions in addition to following the given trajectory. The fundamental contribution of our framework lies in combining the motion tracking task with trajectory following which enables the tracking of specific motion parts (e.g. upper body) while simultaneously following the given trajectory by a single policy. This way we significantly enhance both the diversity of simulated human motion within the given scenario and the controllability of the content including language-based control. Our framework facilitates the generation of a wide range of human motions contributing to greater realism and adaptability in pedestrian simulations for driving scenarios. + + + + LTM: Lightweight Textured Mesh Extraction and Refinement of Large Unbounded Scenes for Efficient Storage and Real-time Rendering + http://openaccess.thecvf.com//content/CVPR2024/papers/Choi_LTM_Lightweight_Textured_Mesh_Extraction_and_Refinement_of_Large_Unbounded_CVPR_2024_paper.pdf + Advancements in neural signed distance fields (SDFs) have enabled modeling 3D surface geometry from a set of 2D images of real-world scenes. Baking neural SDFs can extract explicit mesh with appearance baked into texture maps as neural features. The baked meshes still have a large memory footprint and require a powerful GPU for real-time rendering. Neural optimization of such large meshes with differentiable rendering pose significant challenges. We propose a method to produce optimized meshes for large unbounded scenes with low triangle budget and high fidelity of geometry and appearance. We achieve this by combining advancements in baking neural SDFs with classical mesh simplification techniques and proposing a joint appearance-geometry refinement step. The visual quality is comparable to or better than state-of-the-art neural meshing and baking methods with high geometric accuracy despite significant reduction in triangle count making the produced meshes efficient for storage transmission and rendering on mobile hardware. We validate the effectiveness of the proposed method on large unbounded scenes from mip-NeRF 360 Tanks & Temples and Deep Blending datasets achieving at-par rendering quality with 73x reduced triangles and 11x reduction in memory footprint. + + + + Don't Drop Your Samples! Coherence-Aware Training Benefits Conditional Diffusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Dufour_Dont_Drop_Your_Samples_Coherence-Aware_Training_Benefits_Conditional_Diffusion_CVPR_2024_paper.pdf + Conditional diffusion models are powerful generative models that can leverage various types of conditional information such as class labels segmentation masks or text captions. However in many real-world scenarios conditional information may be noisy or unreliable due to human annotation errors or weak alignment. In this paper we propose the Coherence-Aware Diffusion (CAD) a novel method to integrate confidence in conditional information into diffusion models allowing them to learn from noisy annotations without discarding data. We assume that each data point has an associated confidence score that reflects the quality of the conditional information. We then condition the diffusion model on both the conditional information and the confidence score. In this way the model learns to ignore or discount the conditioning when the confidence is low. We show that our method is theoretically sound and empirically effective on various conditional generation tasks. Moreover we show that leveraging confidence generates realistic and diverse samples that respect conditional information better than models trained on cleaned datasets where samples with low confidence have been discarded. + + + + What Do You See in Vehicle? Comprehensive Vision Solution for In-Vehicle Gaze Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Cheng_What_Do_You_See_in_Vehicle_Comprehensive_Vision_Solution_for_CVPR_2024_paper.pdf + Driver's eye gaze holds a wealth of cognitive and intentional cues crucial for intelligent vehicles. Despite its significance research on in-vehicle gaze estimation remains limited due to the scarcity of comprehensive and well-annotated datasets in real driving scenarios. In this paper we present three novel elements to advance in-vehicle gaze research. Firstly we introduce IVGaze a pioneering dataset capturing in-vehicle gaze collected from 125 individuals and covering a large range of gaze and head within vehicles. Conventional gaze collection systems are inadequate for in-vehicle use. In this dataset we propose a new vision-based solution for in-vehicle gaze collection introducing a refined gaze target calibration method to tackle annotation challenges. Second our research focuses on in-vehicle gaze estimation leveraging the IVGaze. Images of in-vehicle faces often suffer from low resolution prompting our introduction of a gaze pyramid transformer that harnesses transformer-based multilevel features integration. Expanding upon this we introduce the dual-stream gaze pyramid transformer (GazeDPTR). Employing perspective transformation we rotate virtual cameras to normalize images utilizing camera pose to merge normalized and original images for accurate gaze estimation. GazeDPTR showcases state-of-the-art performance on the IVGaze dataset. Thirdly we explore a novel strategy for gaze zone classification by extending the GazeDPTR. A foundational tri-plane and project gaze onto these planes are newly defined. Leveraging both positional features from the projection points and visual attributes from images we achieve superior performance compared to relying solely on visual features thereby substantiating the advantage of gaze estimation. The project is available at https://yihua.zone/work/ivgaze + + + + UFORecon: Generalizable Sparse-View Surface Reconstruction from Arbitrary and Unfavorable Sets + http://openaccess.thecvf.com//content/CVPR2024/papers/Na_UFORecon_Generalizable_Sparse-View_Surface_Reconstruction_from_Arbitrary_and_Unfavorable_Sets_CVPR_2024_paper.pdf + Generalizable neural implicit surface reconstruction aims to obtain an accurate underlying geometry given a limited number of multi-view images from unseen scenes. However existing methods select only informative and relevant views using predefined scores for training and testing phases. This constraint renders the model impractical in real-world scenarios where the availability of favorable combinations cannot always be ensured. We introduce and validate a view-combination score to indicate the effectiveness of the input view combination. We observe that previous methods output degenerate solutions under arbitrary and unfavorable sets. Building upon this finding we propose UFORecon a robust view-combination generalizable surface reconstruction framework. To achieve this we apply cross-view matching transformers to model interactions between source images and build correlation frustums to capture global correlations. Additionally we explicitly encode pairwise feature similarities as view-consistent priors. Our proposed framework significantly outperforms previous methods in terms of view-combination generalizability and also in the conventional generalizable protocol trained with favorable view-combinations. The code is available at https://github.com/Youngju-Na/UFORecon. + + + + Breathing Life Into Sketches Using Text-to-Video Priors + http://openaccess.thecvf.com//content/CVPR2024/papers/Gal_Breathing_Life_Into_Sketches_Using_Text-to-Video_Priors_CVPR_2024_paper.pdf + A sketch is one of the most intuitive and versatile tools humans use to convey their ideas visually. An animated sketch opens another dimension to the expression of ideas and is widely used by designers for a variety of purposes. Animating sketches is a laborious process requiring extensive experience and professional design skills. In this work we present a method that automatically adds motion to a single-subject sketch (hence "breathing life into it") merely by providing a text prompt indicating the desired motion. The output is a short animation provided in vector representation which can be easily edited. Our method does not require extensive training but instead leverages the motion prior of a large pretrained text-to-video diffusion model using a score-distillation loss to guide the placement of strokes. To promote natural and smooth motion and to better preserve the sketch's appearance we model the learned motion through two components. The first governs small local deformations and the second controls global affine transformations. Surprisingly we find that even models that struggle to generate sketch videos on their own can still serve as a useful backbone for animating abstract representations. + + + + Learning Diffusion Texture Priors for Image Restoration + http://openaccess.thecvf.com//content/CVPR2024/papers/Ye_Learning_Diffusion_Texture_Priors_for_Image_Restoration_CVPR_2024_paper.pdf + Diffusion Models have shown remarkable performance in image generation tasks which are capable of generating diverse and realistic image content. When adopting diffusion models for image restoration the crucial challenge lies in how to preserve high-level image fidelity in the randomness diffusion process and generate accurate background structures and realistic texture details. In this paper we propose a general framework and develop a Diffusion Texture Prior Model (DTPM) for image restoration tasks. DTPM explicitly models high-quality texture details through the diffusion process rather than global contextual content. In phase one of the training stage we pre-train DTPM on approximately 55K high-quality image samples after which we freeze most of its parameters. In phase two we insert conditional guidance adapters into DTPM and equip it with an initial predictor thereby facilitating its rapid adaptation to downstream image restoration tasks. Our DTPM could mitigate the randomness of traditional diffusion models by utilizing encapsulated rich and diverse texture knowledge and background structural information provided by the initial predictor during the sampling process. Our comprehensive evaluations of five image restoration tasks demonstrate DTPM's superiority over existing regression and diffusion-based image restoration methods in perceptual quality and its exceptional generalization capabilities. + + + + Entangled View-Epipolar Information Aggregation for Generalizable Neural Radiance Fields + http://openaccess.thecvf.com//content/CVPR2024/papers/Min_Entangled_View-Epipolar_Information_Aggregation_for_Generalizable_Neural_Radiance_Fields_CVPR_2024_paper.pdf + Generalizable NeRF can directly synthesize novel views across new scenes eliminating the need for scene-specific retraining in vanilla NeRF. A critical enabling factor in these approaches is the extraction of a generalizable 3D representation by aggregating source-view features. In this paper we propose an Entangled View-Epipolar Information Aggregation method dubbed EVE-NeRF. Different from existing methods that consider cross-view and along-epipolar information independently EVE-NeRF conducts the view-epipolar feature aggregation in an entangled manner by injecting the scene-invariant appearance continuity and geometry consistency priors to the aggregation process. Our approach effectively mitigates the potential lack of inherent geometric and appearance constraint resulting from one-dimensional interactions thus further boosting the 3D representation generalizablity. EVE-NeRF attains state-of-the-art performance across various evaluation scenarios. Extensive experiments demonstate that compared to prevailing single-dimensional aggregation the entangled network excels in the accuracy of 3D scene geometry and appearance reconstruction. Our code is publicly available at https://github.com/tatakai1/EVENeRF. + + + + YolOOD: Utilizing Object Detection Concepts for Multi-Label Out-of-Distribution Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Zolfi_YolOOD_Utilizing_Object_Detection_Concepts_for_Multi-Label_Out-of-Distribution_Detection_CVPR_2024_paper.pdf + Out-of-distribution (OOD) detection has attracted a large amount of attention from the machine learning research community in recent years due to its importance in deployed systems. Most of the previous studies focused on the detection of OOD samples in the multi-class classification task. However OOD detection in the multi-label classification task a more common real-world use case remains an underexplored domain. In this research we propose YolOOD - a method that utilizes concepts from the object detection domain to perform OOD detection in the multi-label classification task. Object detection models have an inherent ability to distinguish between objects of interest (in-distribution data) and irrelevant objects (OOD data) in images that contain multiple objects belonging to different class categories. These abilities allow us to convert a regular object detection model into an image classifier with inherent OOD detection capabilities with just minor changes. We compare our approach to state-of-the-art OOD detection methods and demonstrate YolOOD's ability to outperform these methods on a comprehensive suite of in-distribution and OOD benchmark datasets. + + + + Collaborating Foundation Models for Domain Generalized Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Benigmim_Collaborating_Foundation_Models_for_Domain_Generalized_Semantic_Segmentation_CVPR_2024_paper.pdf + Domain Generalized Semantic Segmentation (DGSS) deals with training a model on a labeled source domain with the aim of generalizing to unseen domains during inference. Existing DGSS methods typically effectuate robust features by means of Domain Randomization (DR). Such an approach is often limited as it can only account for style diversification and not content. In this work we take an orthogonal approach to DGSS and propose to use an assembly of CoLlaborative FOUndation models for Domain Generalized Semantic Segmentation (CLOUDS). In detail CLOUDS is a framework that integrates Foundation Models of various kinds: (i) CLIP backbone for its robust feature representation (ii) Diffusion Model to diversify the content thereby covering various modes of the possible target distribution and (iii) Segment Anything Model (SAM) for iteratively refining the predictions of the segmentation model. Extensive experiments show that our CLOUDS excels in adapting from synthetic to real DGSS benchmarks and under varying weather conditions notably outperforming prior methods by 5.6% and 6.7% on averaged mIoU respectively. Our code is available at https://github.com/yasserben/CLOUDS + + + + Towards Variable and Coordinated Holistic Co-Speech Motion Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Towards_Variable_and_Coordinated_Holistic_Co-Speech_Motion_Generation_CVPR_2024_paper.pdf + This paper addresses the problem of generating lifelike holistic co-speech motions for 3D avatars focusing on two key aspects: variability and coordination. Variability allows the avatar to exhibit a wide range of motions even with similar speech content while coordination ensures a harmonious alignment among facial expressions hand gestures and body poses. We aim to achieve both with ProbTalk a unified probabilistic framework designed to jointly model facial hand and body movements in speech. ProbTalk builds on the variational autoencoder (VAE) architecture and incorporates three core designs. First we introduce product quantization (PQ) to the VAE which enriches the representation of complex holistic motion. Second we devise a novel non-autoregressive model that embeds 2D positional encoding into the product-quantized representation thereby preserving essential structure information of the PQ codes. Last we employ a secondary stage to refine the preliminary prediction further sharpening the high-frequency details. Coupling these three designs enables ProbTalk to generate natural and diverse holistic co-speech motions outperforming several state-of-the-art methods in qualitative and quantitative evaluations particularly in terms of realism. Our code and model will be released for research purposes at https://feifeifeiliu.github.io/probtalk/. + + + + AllSpark: Reborn Labeled Features from Unlabeled in Transformer for Semi-Supervised Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_AllSpark_Reborn_Labeled_Features_from_Unlabeled_in_Transformer_for_Semi-Supervised_CVPR_2024_paper.pdf + Semi-supervised semantic segmentation (SSSS) has been proposed to alleviate the burden of time-consuming pixel-level manual labeling which leverages limited labeled data along with larger amounts of unlabeled data. Current state-of-the-art methods train the labeled data with ground truths and unlabeled data with pseudo labels. However the two training flows are separate which allows labeled data to dominate the training process resulting in low-quality pseudo labels and consequently sub-optimal results. To alleviate this issue we present AllSpark which reborns the labeled features from unlabeled ones with the channel-wise cross-attention mechanism. We further introduce a Semantic Memory along with a Channel Semantic Grouping strategy to ensure that unlabeled features adequately represent labeled features. The AllSpark shed new light on the architecture level designs of SSSS rather than framework level which avoids increasingly complicated training pipeline designs. It can also be regarded as a flexible bottleneck module that can be seamlessly integrated into a general transformer-based segmentation model. The proposed AllSpark outperforms existing methods across all evaluation protocols on Pascal Cityscapes and COCO benchmarks without bells-and-whistles. Code and model weights are available at: https://github.com/xmed-lab/AllSpark. + + + + SIGNeRF: Scene Integrated Generation for Neural Radiance Fields + http://openaccess.thecvf.com//content/CVPR2024/papers/Dihlmann_SIGNeRF_Scene_Integrated_Generation_for_Neural_Radiance_Fields_CVPR_2024_paper.pdf + Advances in image diffusion models have recently led to notable improvements in the generation of high-quality images. In combination with Neural Radiance Fields (NeRFs) they enabled new opportunities in 3D generation. However most generative 3D approaches are object-centric and applying them to editing existing photorealistic scenes is not trivial. We propose SIGNeRF a novel approach for fast and controllable NeRF scene editing and scene-integrated object generation. A new generative update strategy ensures 3D consistency across the edited images without requiring iterative optimization. We find that depth-conditioned diffusion models inherently possess the capability to generate 3D consistent views by requesting a grid of images instead of single views. Based on these insights we introduce a multi-view reference sheet of modified images. Our method updates an image collection consistently based on the reference sheet and refines the original NeRF with the newly generated image set in one go. By exploiting the depth conditioning mechanism of the image diffusion model we gain fine control over the spatial location of the edit and enforce shape guidance by a selected region or an external mesh. + + + + Generating Illustrated Instructions + http://openaccess.thecvf.com//content/CVPR2024/papers/Menon_Generating_Illustrated_Instructions_CVPR_2024_paper.pdf + We introduce a new task of generating "Illustrated Instructions" i.e. visual instructions customized to a user's needs. We identify desiderata unique to this task and formalize it through a suite of automatic and human evaluation metrics designed to measure the validity consistency and efficacy of the generations. We combine the power of large language models (LLMs) together with strong text-to-image generation diffusion models to propose a simple approach called StackedDiffusion which generates such illustrated instructions given text as input. The resulting model strongly outperforms baseline approaches and state-of-the-art multimodal LLMs; and in 30% of cases users even prefer it to human-generated articles. Most notably it enables various new and exciting applications far beyond what static articles on the web can provide such as personalized instructions complete with intermediate steps and pictures in response to a user's individual situation. + + + + Robust Image Denoising through Adversarial Frequency Mixup + http://openaccess.thecvf.com//content/CVPR2024/papers/Ryou_Robust_Image_Denoising_through_Adversarial_Frequency_Mixup_CVPR_2024_paper.pdf + Image denoising approaches based on deep neural networks often struggle with overfitting to specific noise distributions present in training data. This challenge persists in existing real-world denoising networks which are trained using a limited spectrum of real noise distributions and thus show poor robustness to out-of-distribution real noise types. To alleviate this issue we develop a novel training framework called Adversarial Frequency Mixup (AFM). AFM leverages mixup in the frequency domain to generate noisy images with distinctive and challenging noise characteristics all the while preserving the properties of authentic real-world noise. Subsequently incorporating these noisy images into the training pipeline enhances the denoising network's robustness to variations in noise distributions. Extensive experiments and analyses conducted on a wide range of real noise benchmarks demonstrate that denoising networks trained with our proposed framework exhibit significant improvements in robustness to unseen noise distributions. The code is available at https://github.com/dhryougit/AFM. + + + + AnyScene: Customized Image Synthesis with Composited Foreground + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_AnyScene_Customized_Image_Synthesis_with_Composited_Foreground_CVPR_2024_paper.pdf + Recent advancements in text-to-image technology have significantly advanced the field of image customization. Among various applications the task of customizing diverse scenes for user-specified composited elements holds great application value but has not been extensively explored. Addressing this gap we propose AnyScene a specialized framework designed to create varied scenes from composited foreground using textual prompts. AnyScene addresses the primary challenges inherent in existing methods particularly scene disharmony due to a lack of foreground semantic understanding and distortion of foreground elements. Specifically we develop a foreground injection module that guides a pre-trained diffusion model to generate cohesive scenes in visual harmony with the provided foreground. To enhance robust generation we implement a layout control strategy that prevents distortions of foreground elements. Furthermore an efficient image blending mechanism seamlessly reintegrates foreground details into the generated scenes producing outputs with overall visual harmony and precise foreground details. In addition we propose a new benchmark and a series of quantitative metrics to evaluate this proposed image customization task. Extensive experimental results demonstrate the effectiveness of AnyScene which confirms its potential in various applications. + + + + Training Generative Image Super-Resolution Models by Wavelet-Domain Losses Enables Better Control of Artifacts + http://openaccess.thecvf.com//content/CVPR2024/papers/Korkmaz_Training_Generative_Image_Super-Resolution_Models_by_Wavelet-Domain_Losses_Enables_Better_CVPR_2024_paper.pdf + Super-resolution (SR) is an ill-posed inverse problem where the size of the set of feasible solutions that are consistent with a given low-resolution image is very large. Many algorithms have been proposed to find a "good" solution among the feasible solutions that strike a balance between fidelity and perceptual quality. Unfortunately all known methods generate artifacts and hallucinations while trying to reconstruct high-frequency (HF) image details. A fundamental question is: Can a model learn to distinguish genuine image details from artifacts? Although some recent works focused on the differentiation of details and artifacts this is a very challenging problem and a satisfactory solution is yet to be found. This paper shows that the characterization of genuine HF details versus artifacts can be better learned by training GAN-based SR models using wavelet-domain loss functions compared to RGB-domain or Fourier-space losses. Although wavelet-domain losses have been used in the literature before they have not been used in the context of the SR task. More specifically we train the discriminator only on the HF wavelet sub-bands instead of on RGB images and the generator is trained by a fidelity loss over wavelet subbands to make it sensitive to the scale and orientation of structures. Extensive experimental results demonstrate that our model achieves better perception-distortion trade-off according to multiple objective measures and visual evaluations. + + + + Monocular Identity-Conditioned Facial Reflectance Reconstruction + http://openaccess.thecvf.com//content/CVPR2024/papers/Ren_Monocular_Identity-Conditioned_Facial_Reflectance_Reconstruction_CVPR_2024_paper.pdf + Recent 3D face reconstruction methods have made remarkable advancements yet there remain huge challenges in monocular high-quality facial reflectance reconstruction. Existing methods rely on a large amount of light-stage captured data to learn facial reflectance models. However the lack of subject diversity poses challenges in achieving good generalization and widespread applicability. In this paper we learn the reflectance prior in image space rather than UV space and present a framework named ID2Reflectance. Our framework can directly estimate the reflectance maps of a single image while using limited reflectance data for training. Our key insight is that reflectance data shares facial structures with RGB faces which enables obtaining expressive facial prior from inexpensive RGB data thus reducing the dependency on reflectance data. We first learn a high-quality prior for facial reflectance. Specifically we pretrain multi-domain facial feature codebooks and design a codebook fusion method to align the reflectance and RGB domains. Then we propose an identity-conditioned swapping module that injects facial identity from the target image into the pre-trained auto-encoder to modify the identity of the source reflectance image. Finally we stitch multi-view swapped reflectance images to obtain renderable assets. Extensive experiments demonstrate that our method exhibits excellent generalization capability and achieves state-of-the-art facial reflectance reconstruction results for in-the-wild faces. + + + + C3: High-Performance and Low-Complexity Neural Compression from a Single Image or Video + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_C3_High-Performance_and_Low-Complexity_Neural_Compression_from_a_Single_Image_CVPR_2024_paper.pdf + Most neural compression models are trained on large datasets of images or videos in order to generalize to unseen data. Such generalization typically requires large and expressive architectures with a high decoding complexity. Here we introduce C3 a neural compression method with strong rate-distortion (RD) performance that instead overfits a small model to each image or video separately. The resulting decoding complexity of C3 can be an order of magnitude lower than neural baselines with similar RD performance. C3 builds on COOL-CHIC [Ladune et al 2023] and makes several simple and effective improvements for images. We further develop new methodology to apply C3 to videos. On the CLIC2020 image benchmark we match the RD performance of VTM the reference implementation of the H.266 codec with less than3k MACs/pixel for decoding. On the UVG video benchmark we match the RD performance of the Video Compression Transformer [Mentzer er al 2022] a well-established neural video codec with less than 5k MACs/pixel for decoding. + + + + Revisiting Non-Autoregressive Transformers for Efficient Image Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Ni_Revisiting_Non-Autoregressive_Transformers_for_Efficient_Image_Synthesis_CVPR_2024_paper.pdf + The field of image synthesis is currently flourishing due to the advancements in diffusion models. While diffusion models have been successful their computational intensity has prompted the pursuit of more efficient alternatives. As a representative work non-autoregressive Transformers (NATs) have been recognized for their rapid generation. However a major drawback of these models is their inferior performance compared to diffusion models. In this paper we aim to re-evaluate the full potential of NATs by revisiting the design of their training and inference strategies. Specifically we identify the complexities in properly configuring these strategies and indicate the possible sub-optimality in existing heuristic-driven designs. Recognizing this we propose to go beyond existing methods by directly solving the optimal strategies in an automatic framework. The resulting method named AutoNAT advances the performance boundaries of NATs notably and is able to perform comparably with the latest diffusion models with a significantly reduced inference cost. The effectiveness of AutoNAT is comprehensively validated on four benchmark datasets i.e. ImageNet-256 & 512 MS-COCO and CC3M. Code and pre-trained models will be available at https://github.com/LeapLabTHU/ImprovedNAT. + + + + ANIM: Accurate Neural Implicit Model for Human Reconstruction from a single RGB-D Image + http://openaccess.thecvf.com//content/CVPR2024/papers/Pesavento_ANIM_Accurate_Neural_Implicit_Model_for_Human_Reconstruction_from_a_CVPR_2024_paper.pdf + Recent progress in human shape learning shows that neural implicit models are effective in generating 3D human surfaces from limited number of views and even from a single RGB image. However existing monocular approaches still struggle to recover fine geometric details such as face hands or cloth wrinkles. They are also easily prone to depth ambiguities that result in distorted geometries along the camera optical axis. In this paper we explore the benefits of incorporating depth observations in the reconstruction process by introducing ANIM a novel method that reconstructs arbitrary 3D human shapes from single-view RGB-D images with an unprecedented level of accuracy. Our model learns geometric details from both multi-resolution pixel-aligned and voxel-aligned features to leverage depth information and enable spatial relationships mitigating depth ambiguities. We further enhance the quality of the reconstructed shape by introducing a depth-supervision strategy which improves the accuracy of the signed distance field estimation of points that lie on the reconstructed surface. Experiments demonstrate that ANIM outperforms state-of-the-art works that use RGB surface normals point cloud or RGB-D data as input. In addition we introduce ANIM-Real a new multi-modal dataset comprising high-quality scans paired with consumer-grade RGB-D camera and our protocol to fine-tune ANIM enabling high-quality reconstruction from real-world human capture. + + + + Real-Time Simulated Avatar from Head-Mounted Sensors + http://openaccess.thecvf.com//content/CVPR2024/papers/Luo_Real-Time_Simulated_Avatar_from_Head-Mounted_Sensors_CVPR_2024_paper.pdf + We present SimXR a method for controlling a simulated avatar from information (headset pose and cameras) obtained from AR / VR headsets. Due to the challenging viewpoint of head-mounted cameras the human body is often clipped out of view making traditional image-based egocentric pose estimation challenging. On the other hand headset poses provide valuable information about overall body motion but lack fine-grained details about the hands and feet. To synergize headset poses with cameras we control a humanoid to track headset movement while analyzing input images to decide body movement. When body parts are seen the movements of hands and feet will be guided by the images; when unseen the laws of physics guide the controller to generate plausible motion. We design an end-to-end method that does not rely on any intermediate representations and learns to directly map from images and headset poses to humanoid control signals. To train our method we also propose a large-scale synthetic dataset created using camera configurations compatible with a commercially available VR headset (Quest 2) and show promising results on real-world captures. To demonstrate the applicability of our framework we also test it on an AR headset with a forward-facing camera. + + + + Seamless Human Motion Composition with Blended Positional Encodings + http://openaccess.thecvf.com//content/CVPR2024/papers/Barquero_Seamless_Human_Motion_Composition_with_Blended_Positional_Encodings_CVPR_2024_paper.pdf + Conditional human motion generation is an important topic with many applications in virtual reality gaming and robotics. While prior works have focused on generating motion guided by text music or scenes these typically result in isolated motions confined to short durations. Instead we address the generation of long continuous sequences guided by a series of varying textual descriptions. In this context we introduce FlowMDM the first diffusion-based model that generates seamless Human Motion Compositions (HMC) without any postprocessing or redundant denoising steps. For this we introduce the Blended Positional Encodings a technique that leverages both absolute and relative positional encodings in the denoising chain. More specifically global motion coherence is recovered at the absolute stage whereas smooth and realistic transitions are built at the relative stage. As a result we achieve state-of-the-art results in terms of accuracy realism and smoothness on the Babel and HumanML3D datasets. FlowMDM excels when trained with only a single description per motion sequence thanks to its Pose-Centric Cross-ATtention which makes it robust against varying text descriptions at inference time. Finally to address the limitations of existing HMC metrics we propose two new metrics: the Peak Jerk and the Area Under the Jerk to detect abrupt transitions. + + + + FedUV: Uniformity and Variance for Heterogeneous Federated Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Son_FedUV_Uniformity_and_Variance_for_Heterogeneous_Federated_Learning_CVPR_2024_paper.pdf + Federated learning is a promising framework to train neural networks with widely distributed data. However performance degrades heavily with heterogeneously distributed data. Recent work has shown this is due to the final layer of the network being most prone to local bias some finding success freezing the final layer as an orthogonal classifier. We investigate the training dynamics of the classifier by applying SVD to the weights motivated by the observation that freezing weights results in constant singular values. We find that there are differences when training in IID and non-IID settings. Based on this finding we introduce two regularization terms for local training to continuously emulate IID settings: (1) variance in the dimension-wise probability distribution of the classifier and (2) hyperspherical uniformity of representations of the encoder. These regularizations promote local models to act as if it were in an IID setting regardless of the local data distribution thus offsetting proneness to bias while being flexible to the data. On extensive experiments in both label-shift and feature-shift settings we verify that our method achieves highest performance by a large margin especially in highly non-IID cases in addition to being scalable to larger models and datasets. + + + + GAvatar: Animatable 3D Gaussian Avatars with Implicit Mesh Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Yuan_GAvatar_Animatable_3D_Gaussian_Avatars_with_Implicit_Mesh_Learning_CVPR_2024_paper.pdf + Gaussian splatting has emerged as a powerful 3D representation that harnesses the advantages of both explicit (mesh) and implicit (NeRF) 3D representations. In this paper we seek to leverage Gaussian splatting to generate realistic animatable avatars from textual descriptions addressing the limitations (e.g. efficiency and flexibility) imposed by mesh or NeRF-based representations. However a naive application of Gaussian splatting cannot generate high-quality animatable avatars and suffers from learning instability; it also cannot capture fine avatar geometries and often leads to degenerate body parts. To tackle these problems we first propose a primitive-based 3D Gaussian representation where Gaussians are defined inside pose-driven primitives to facilitate animations. Second to stabilize and amortize the learning of millions of Gaussians we propose to use implicit neural fields to predict the Gaussian attributes (e.g. colors). Finally to capture fine avatar geometries and extract detailed meshes we propose a novel SDF-based implicit mesh learning approach for 3D Gaussians that regularizes the underlying geometries and extracts highly detailed textured meshes. Our proposed method GAvatar enables the large-scale generation of diverse animatable avatars using only text prompts. GAvatar significantly surpasses existing methods in terms of both appearance and geometry quality and achieves extremely fast rendering (100 fps) at 1K resolution. + + + + Grounding Everything: Emerging Localization Properties in Vision-Language Transformers + http://openaccess.thecvf.com//content/CVPR2024/papers/Bousselham_Grounding_Everything_Emerging_Localization_Properties_in_Vision-Language_Transformers_CVPR_2024_paper.pdf + Vision-language foundation models have shown remarkable performance in various zero-shot settings such as image retrieval classification or captioning. But so far those models seem to fall behind when it comes to zero-shot localization of referential expressions and objects in images. As a result they need to be fine-tuned for this task. In this paper we show that pretrained vision-language (VL) models allow for zero-shot open-vocabulary object localization without any fine-tuning. To leverage those capabilities we propose a Grounding Everything Module (GEM) that generalizes the idea of value-value attention introduced by CLIPSurgery to a self-self attention path. We show that the concept of self-self attention corresponds to clustering thus enforcing groups of tokens arising from the same object to be similar while preserving the alignment with the language space. To further guide the group formation we propose a set of regularizations that allows the model to finally generalize across datasets and backbones. We evaluate the proposed GEM framework on various benchmark tasks and datasets for semantic segmentation. GEM not only outperforms other training-free open-vocabulary localization methods but also achieves state-of-the-art results on the recently proposed OpenImagesV7 large-scale segmentation benchmark. Code is available at https://github.com/WalBouss/GEM + + + + Mean-Shift Feature Transformer + http://openaccess.thecvf.com//content/CVPR2024/papers/Kobayashi_Mean-Shift_Feature_Transformer_CVPR_2024_paper.pdf + Transformer models developed in NLP make a great impact on computer vision fields producing promising performance on various tasks. While multi-head attention a characteristic mechanism of the transformer attracts keen research interest such as for reducing computation cost we analyze the transformer model from a viewpoint of feature transformation based on a distribution of input feature tokens. The analysis inspires us to derive a novel transformation method from mean-shift update which is an effective gradient ascent to seek a local mode of distinctive representation on the token distribution. We also present an efficient projection approach to reduce parameter size of linear projections constituting the proposed multi-head feature transformation. In the experiments on ImageNet-1K dataset the proposed methods embedded into various network models exhibit favorable performance improvement in place of the transformer module. + + + + Domain Separation Graph Neural Networks for Saliency Object Ranking + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_Domain_Separation_Graph_Neural_Networks_for_Saliency_Object_Ranking_CVPR_2024_paper.pdf + Saliency object ranking (SOR) has attracted significant attention recently. Previous methods usually failed to explicitly explore the saliency degree-related relationships between objects. In this paper we propose a novel Domain Separation Graph Neural Network (DSGNN) which starts with separately extracting the shape and texture cues from each object and builds an shape graph as well as a texture graph for all objects in the given image. Then we propose a Shape-Texture Graph Domain Separation (STGDS) module to separate the task-relevant and irrelevant information of target objects by explicitly modelling the relationship between each pair of objects in terms of their shapes and textures respectively. Furthermore a Cross Image Graph Domain Separation (CIGDS) module is introduced to explore the saliency degree subspace that is robust to different scenes aiming to create a unified representation for targets with the same saliency levels in different images. Importantly our DSGNN automatically learns a multi-dimensional feature to represent each graph edge allowing complex diverse and ranking-related relationships to be modelled. Experimental results show that our DSGNN achieved the new state-of-the-art performance on both ASSR and IRSR datasets with large improvements of 5.2% and 4.1% SA-SOR respectively. Our code is provided in https://github.com/Wu-ZJ/DSGNN. + + + + RAM-Avatar: Real-time Photo-Realistic Avatar from Monocular Videos with Full-body Control + http://openaccess.thecvf.com//content/CVPR2024/papers/Deng_RAM-Avatar_Real-time_Photo-Realistic_Avatar_from_Monocular_Videos_with_Full-body_Control_CVPR_2024_paper.pdf + This paper focuses on advancing the applicability of human avatar learning methods by proposing RAM-Avatar which learns a Real-time photo-realistic Avatar that supports full-body control from Monocular videos. To achieve this goal RAM-Avatar leverages two statistical templates responsible for modeling the facial expression and hand gesture variations while a sparsely computed dual attention module is introduced upon another body template to facilitate high-fidelity texture rendering for the torsos and limbs. Building on this foundation we deploy a lightweight yet powerful StyleUnet along with a temporal-aware discriminator to achieve real-time realistic rendering. To enable robust animation for out-of-distribution poses we propose a Motion Distribution Align module to compensate for the discrepancies between the training and testing motion distribution. Results and extensive experiments conducted in various experimental settings demonstrate the superiority of our proposed method and a real-time live system is proposed to further push research into applications. The training and testing code will be released for research purposes. + + + + Video Prediction by Modeling Videos as Continuous Multi-Dimensional Processes + http://openaccess.thecvf.com//content/CVPR2024/papers/Shrivastava_Video_Prediction_by_Modeling_Videos_as_Continuous_Multi-Dimensional_Processes_CVPR_2024_paper.pdf + Diffusion models have made significant strides in image generation mastering tasks such as unconditional image synthesis text-image translation and image-to-image conversions. However their capability falls short in the realm of video prediction mainly because they treat videos as a collection of independent images relying on external constraints such as temporal attention mechanisms to enforce temporal coherence. In our paper we introduce a novel model class that treats video as a continuous multi-dimensional process rather than a series of discrete frames. Through extensive experimentation we establish state-of-the-art performance in video prediction validated on benchmark datasets including KTH BAIR Human3.6M and UCF101. + + + + PICTURE: PhotorealistIC virtual Try-on from UnconstRained dEsigns + http://openaccess.thecvf.com//content/CVPR2024/papers/Ning_PICTURE_PhotorealistIC_virtual_Try-on_from_UnconstRained_dEsigns_CVPR_2024_paper.pdf + In this paper we propose a novel virtual try-on from unconstrained designs (ucVTON) task to enable photorealistic synthesis of personalized composite clothing on input human images. Unlike prior arts constrained by specific input types our method allows flexible specification of style (text or image) and texture (full garment cropped sections or texture patches) conditions. To address the entanglement challenge when using full garment images as conditions we develop a two-stage pipeline with explicit disentanglement of style and texture. In the first stage we generate a human parsing map reflecting the desired style conditioned on the input. In the second stage we composite textures onto the parsing map areas based on the texture input. To represent complex and non-stationary textures that have never been achieved in previous fashion editing works we first propose extracting hierarchical and balanced CLIP features and applying position encoding in VTON. Experiments demonstrate superior synthesis quality and personalization enabled by our method. The flexible control over style and texture mixing brings virtual try-on to a new level of user experience for online shopping and fashion design. + + + + Towards Robust 3D Pose Transfer with Adversarial Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Towards_Robust_3D_Pose_Transfer_with_Adversarial_Learning_CVPR_2024_paper.pdf + 3D pose transfer that aims to transfer the desired pose to a target mesh is one of the most challenging 3D generation tasks. Previous attempts rely on well-defined parametric human models or skeletal joints as driving pose sources. However to obtain those clean pose sources cumbersome but necessary pre-processing pipelines are inevitable hindering implementations of the real-time applications. This work is driven by the intuition that the robustness of the model can be enhanced by introducing adversarial samples into the training leading to a more invulnerable model to the noisy inputs which even can be further extended to directly handling the real-world data like raw point clouds/scans without intermediate processing. Furthermore we propose a novel 3D pose Masked Autoencoder (3D-PoseMAE) a customized MAE that effectively learns 3D extrinsic presentations (i.e. pose). 3D-PoseMAE facilitates learning from the aspect of extrinsic attributes by simultaneously generating adversarial samples that perturb the model and learning the arbitrary raw noisy poses via a multi-scale masking strategy. Both qualitative and quantitative studies show that the transferred meshes given by our network result in much better quality. Besides we demonstrate the strong generalizability of our method on various poses different domains and even raw scans. Experimental results also show meaningful insights that the intermediate adversarial samples generated in the training can successfully attack the existing pose transfer models. + + + + EAGLE: Eigen Aggregation Learning for Object-Centric Unsupervised Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_EAGLE_Eigen_Aggregation_Learning_for_Object-Centric_Unsupervised_Semantic_Segmentation_CVPR_2024_paper.pdf + Semantic segmentation has innately relied on extensive pixel-level annotated data leading to the emergence of unsupervised methodologies. Among them leveraging self-supervised Vision Transformers for unsupervised semantic segmentation (USS) has been making steady progress with expressive deep features. Yet for semantically segmenting images with complex objects a predominant challenge remains: the lack of explicit object-level semantic encoding in patch-level features. This technical limitation often leads to inadequate segmentation of complex objects with diverse structures. To address this gap we present a novel approach EAGLE which emphasizes object-centric representation learning for unsupervised semantic segmentation. Specifically we introduce EiCue a spectral technique providing semantic and structural cues through an eigenbasis derived from the semantic similarity matrix of deep image features and color affinity from an image. Further by incorporating our object-centric contrastive loss with EiCue we guide our model to learn object-level representations with intra- and inter-image object-feature consistency thereby enhancing semantic accuracy. Extensive experiments on COCO-Stuff Cityscapes and Potsdam-3 datasets demonstrate the state-of-the-art USS results of EAGLE with accurate and consistent semantic segmentation across complex scenes. + + + + AVID: Any-Length Video Inpainting with Diffusion Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_AVID_Any-Length_Video_Inpainting_with_Diffusion_Model_CVPR_2024_paper.pdf + Recent advances in diffusion models have successfully enabled text-guided image inpainting. While it seems straightforward to extend such editing capability into the video domain there have been fewer works regarding text-guided video inpainting. Given a video a masked region at its initial frame and an editing prompt it requires a model to do infilling at each frame following the editing guidance while keeping the out-of-mask region intact. There are three main challenges in text-guided video inpainting: (i) temporal consistency of the edited video (ii) supporting different inpainting types at different structural fidelity levels and (iii) dealing with variable video length. To address these challenges we introduce Any-Length Video Inpainting with Diffusion Model dubbed as AVID. At its core our model is equipped with effective motion modules and adjustable structure guidance for fixed-length video inpainting. Building on top of that we propose a novel Temporal MultiDiffusion sampling pipeline with a middle-frame attention guidance mechanism facilitating the generation of videos with any desired duration. Our comprehensive experiments show our model can robustly deal with various inpainting types at different video duration ranges with high quality. + + + + NoiseCollage: A Layout-Aware Text-to-Image Diffusion Model Based on Noise Cropping and Merging + http://openaccess.thecvf.com//content/CVPR2024/papers/Shirakawa_NoiseCollage_A_Layout-Aware_Text-to-Image_Diffusion_Model_Based_on_Noise_Cropping_CVPR_2024_paper.pdf + Layout-aware text-to-image generation is a task to generate multi-object images that reflect layout conditions in addition to text conditions. The current layout-aware text-to-image diffusion models still have several issues including mismatches between the text and layout conditions and quality degradation of generated images. This paper proposes a novel layout-aware text-to-image diffusion model called NoiseCollage to tackle these issues. During the denoising process NoiseCollage independently estimates noises for individual objects and then crops and merges them into a single noise. This operation helps avoid condition mismatches; in other words it can put the right objects in the right places. Qualitative and quantitative evaluations show that NoiseCollage outperforms several state-of-the-art models. These successful results indicate that the crop-and-merge operation of noises is a reasonable strategy to control image generation. We also show that NoiseCollage can be integrated with ControlNet to use edges sketches and pose skeletons as additional conditions. Experimental results show that this integration boosts the layout accuracy of ControlNet. The code is available at https://github.com/univ-esuty/noisecollage. + + + + Arbitrary Motion Style Transfer with Multi-condition Motion Latent Diffusion Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Song_Arbitrary_Motion_Style_Transfer_with_Multi-condition_Motion_Latent_Diffusion_Model_CVPR_2024_paper.pdf + Computer animation's quest to bridge content and style has historically been a challenging venture with previous efforts often leaning toward one at the expense of the other. This paper tackles the inherent challenge of content-style duality ensuring a harmonious fusion where the core narrative of the content is both preserved and elevated through stylistic enhancements. We propose a novel Multi-condition Motion Latent Diffusion Model (MCM-LDM) for Arbitrary Motion Style Transfer (AMST). Our MCM-LDM significantly emphasizes preserving trajectories recognizing their fundamental role in defining the essence and fluidity of motion content. Our MCM-LDM's cornerstone lies in its ability first to disentangle and then intricately weave together motion's tripartite components: motion trajectory motion content and motion style. The critical insight of MCM-LDM is to embed multiple conditions with distinct priorities. The content channel serves as the primary flow guiding the overall structure and movement while the trajectory and style channels act as auxiliary components and synchronize with the primary one dynamically. This mechanism ensures that multi-conditions can seamlessly integrate into the main flow enhancing the overall animation without overshadowing the core content. Empirical evaluations underscore the model's proficiency in achieving fluid and authentic motion style transfers setting a new benchmark in the realm of computer animation. The source code and model are available at https://github.com/XingliangJin/MCM-LDM.git. + + + + ViT-CoMer: Vision Transformer with Convolutional Multi-scale Feature Interaction for Dense Predictions + http://openaccess.thecvf.com//content/CVPR2024/papers/Xia_ViT-CoMer_Vision_Transformer_with_Convolutional_Multi-scale_Feature_Interaction_for_Dense_CVPR_2024_paper.pdf + Although Vision Transformer (ViT) has achieved significant success in computer vision it does not perform well in dense prediction tasks due to the lack of inner-patch information interaction and the limited diversity of feature scale. Most existing studies are devoted to designing vision-specific transformers to solve the above problems which introduce additional pre-training costs. Therefore we present a plain pre-training-free and feature-enhanced ViT backbone with Convolutional Multi-scale feature interaction named ViT-CoMer which facilitates bidirectional interaction between CNN and transformer. Compared to the state-of-the-art ViT-CoMer has the following advantages: (1) We inject spatial pyramid multi-receptive field convolutional features into the ViT architecture which effectively alleviates the problems of limited local information interaction and single-feature representation in ViT. (2) We propose a simple and efficient CNN-Transformer bidirectional fusion interaction module that performs multi-scale fusion across hierarchical features which is beneficial for handling dense prediction tasks. (3) We evaluate the performance of ViT-CoMer across various dense prediction tasks different frameworks and multiple advanced pre-training. Notably our ViT-CoMer-L achieves 64.3% AP on COCO val2017 without extra training data and 62.1% mIoU on ADE20K val both of which are comparable to state-of-the-art methods. We hope ViT-CoMer can serve as a new backbone for dense prediction tasks to facilitate future research. The code will be released at https://github.com/Traffic-X/ViT-CoMer. + + + + PromptCoT: Align Prompt Distribution via Adapted Chain-of-Thought + http://openaccess.thecvf.com//content/CVPR2024/papers/Yao_PromptCoT_Align_Prompt_Distribution_via_Adapted_Chain-of-Thought_CVPR_2024_paper.pdf + Diffusion-based generative models have exhibited remarkable capability in the production of high-fidelity visual content such as images and videos. However their performance is significantly contingent upon the quality of textual inputs commonly referred to as "prompts". The process of traditional prompt engineering while effective necessitates empirical expertise and poses challenges for inexperienced users. In this paper we introduce PromptCoT an innovative enhancer that autonomously refines prompts for users. PromptCoT is designed based on the observation that prompts which resemble the textual information of high-quality images in the training set often lead to superior generation performance. Therefore we fine-tune the pre-trained Large Language Models (LLM) using a curated text dataset that solely comprises descriptions of high-quality visual content. By doing so the LLM can capture the distribution of high-quality training texts enabling it to generate aligned continuations and revisions to boost the original texts. Nonetheless one drawback of pre-trained LLMs is their tendency to generate extraneous or irrelevant information. We employ the Chain-of-Thought (CoT) mechanism to improve the alignment between the original text prompts and their refined versions. CoT can extract and amalgamate crucial information from the aligned continuation and revision enabling reasonable inferences based on the contextual cues to produce a more comprehensive and nuanced final output. Considering computational efficiency instead of allocating a dedicated LLM for prompt enhancement to each individual model or dataset we integrate adapters that facilitate dataset-specific adaptation leveraging a shared pre-trained LLM as the foundation for this process. With independent fine-tuning of these adapters we can adapt PromptCoT to new datasets while minimally increasing training costs and memory usage. We evaluate the effectiveness of PromptCoT by assessing its performance on widely-used latent diffusion models for image and video generation. The results demonstrate significant improvements in key performance metrics. + + + + Anomaly Score: Evaluating Generative Models and Individual Generated Images based on Complexity and Vulnerability + http://openaccess.thecvf.com//content/CVPR2024/papers/Hwang_Anomaly_Score_Evaluating_Generative_Models_and_Individual_Generated_Images_based_CVPR_2024_paper.pdf + With the advancement of generative models the assessment of generated images becomes increasingly more important. Previous methods measure distances between features of reference and generated images from trained vision models. In this paper we conduct an extensive investigation into the relationship between the representation space and input space around generated images. We first propose two measures related to the presence of unnatural elements within images: complexity which indicates how non-linear the representation space is and vulnerability which is related to how easily the extracted feature changes by adversarial input changes. Based on these we introduce a new metric to evaluating image-generative models called anomaly score (AS). Moreover we propose AS-i (anomaly score for individual images) that can effectively evaluate generated images individually. Experimental results demonstrate the validity of the proposed approach. + + + + GeneAvatar: Generic Expression-Aware Volumetric Head Avatar Editing from a Single Image + http://openaccess.thecvf.com//content/CVPR2024/papers/Bao_GeneAvatar_Generic_Expression-Aware_Volumetric_Head_Avatar_Editing_from_a_Single_CVPR_2024_paper.pdf + Recently we have witnessed the explosive growth of various volumetric representations in modeling animatable head avatars. However due to the diversity of frameworks there is no practical method to support high-level applications like 3D head avatar editing across different representations. In this paper we propose a generic avatar editing approach that can be universally applied to various 3DMM driving volumetric head avatars. To achieve this goal we design a novel expression-aware modification generative model which enables lift 2D editing from a single image to a consistent 3D modification field. To ensure the effectiveness of the generative modification process we develop several techniques including an expression-dependent modification distillation scheme to draw knowledge from the large-scale head avatar model and 2D facial texture editing tools implicit latent space guidance to enhance model convergence and a segmentation-based loss reweight strategy for fine-grained texture inversion. Extensive experiments demonstrate that our method delivers high-quality and consistent results across multiple expression and viewpoints. Project page: https://zju3dv.github.io/ geneavatar/. + + + + Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Learn_to_Rectify_the_Bias_of_CLIP_for_Unsupervised_Semantic_CVPR_2024_paper.pdf + Recent works utilize CLIP to perform the challenging unsupervised semantic segmentation task where only images without annotations are available. However we observe that when adopting CLIP to such a pixel-level understanding task unexpected bias occurs. Previous works don't explicitly model such bias which largely constrains the segmentation performance. In this paper we propose to explicitly model and rectify the bias existing in CLIP to facilitate the unsupervised semantic segmentation. Specifically we design a learnable "Reference" prompt to encode class-preference bias and project the positional embedding of vision transformer to represent space-preference bias. Via a simple element-wise subtraction we rectify the logits of CLIP classifier. Based on the rectified logits we generate a segmentation mask via a Gumbel-Softmax operation. Then a contrastive loss between masked visual feature and the text features of different classes is imposed to facilitate the effective bias modeling. To further improve the segmentation we distill the knowledge from the rectified CLIP to the advanced segmentation architecture via minimizing our designed mask-guided feature-guided and text-guided loss terms. Extensive experiments on standard benchmarks demonstrate that our method performs favorably against previous state-of-the-arts. The implementation is available at https://github.com/dogehhh/ReCLIP. + + + + Unlocking Pre-trained Image Backbones for Semantic Image Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Ifriqi_Unlocking_Pre-trained_Image_Backbones_for_Semantic_Image_Synthesis_CVPR_2024_paper.pdf + Semantic image synthesis i.e. generating images from user-provided semantic label maps is an important conditional image generation task as it allows to control both the content as well as the spatial layout of generated images. Although diffusion models have pushed the state of the art in generative image modeling the iterative nature of their inference process makes them computationally demanding. Other approaches such as GANs are more efficient as they only need a single feed-forward pass for generation but the image quality tends to suffer when modeling large and diverse datasets. In this work we propose a new class of GAN discriminators for semantic image synthesis that generates highly realistic images by exploiting feature backbones pre-trained for tasks such as image classification. We also introduce a new generator architecture with better context modeling and using cross-attention to inject noise into latent variables leading to more diverse generated images. Our model which we dub DP-SIMS achieves state-of-the-art results in terms of image quality and consistency with the input label maps on ADE-20K COCO-Stuff and Cityscapes surpassing recent diffusion models while requiring two orders of magnitude less compute for inference. + + + + TexTile: A Differentiable Metric for Texture Tileability + http://openaccess.thecvf.com//content/CVPR2024/papers/Rodriguez-Pardo_TexTile_A_Differentiable_Metric_for_Texture_Tileability_CVPR_2024_paper.pdf + We introduce TexTile a novel differentiable metric to quantify the degree upon which a texture image can be concatenated with itself without introducing repeating artifacts (i.e. the tileability). Existing methods for tileable texture synthesis focus on general texture quality but lack explicit analysis of the intrinsic repeatability properties of a texture. In contrast our TexTile metric effectively evaluates the tileable properties of a texture opening the door to more informed synthesis and analysis of tileable textures. Under the hood TexTile is formulated as a binary classifier carefully built from a large dataset of textures of different styles semantics regularities and human annotations.Key to our method is a set of architectural modifications to baseline pre-train image classifiers to overcome their shortcomings at measuring tileability along with a custom data augmentation and training regime aimed at increasing robustness and accuracy. We demonstrate that TexTile can be plugged into different state-of-the-art texture synthesis methods including diffusion-based strategies and generate tileable textures while keeping or even improving the overall texture quality. Furthermore we show that TexTile can objectively evaluate any tileable texture synthesis method whereas the current mix of existing metrics produces uncorrelated scores which heavily hinders progress in the field. + + + + Improving Image Restoration through Removing Degradations in Textual Representations + http://openaccess.thecvf.com//content/CVPR2024/papers/Lin_Improving_Image_Restoration_through_Removing_Degradations_in_Textual_Representations_CVPR_2024_paper.pdf + In this paper we introduce a new perspective for improving image restoration by removing degradation in the textual representations of a given degraded image. Intuitively restoration is much easier on text modality than image one. For example it can be easily conducted by removing degradation-related words while keeping the content-aware words. Hence we combine the advantages of images in detail description and ones of text in degradation removal to perform restoration. To address the cross-modal assistance we propose to map the degraded images into textual representations for removing the degradations and then convert the restored textual representations into a guidance image for assisting image restoration. In particular We ingeniously embed an image-to-text mapper and text restoration module into CLIP-equipped text-to-image models to generate the guidance. Then we adopt a simple coarse-to-fine approach to dynamically inject multi-scale information from guidance to image restoration networks. Extensive experiments are conducted on various image restoration tasks including deblurring dehazing deraining and denoising and all-in-one image restoration. The results showcase that our method outperforms state-of-the-art ones across all these tasks. The codes and models are available at https://github.com/mrluin/TextualDegRemoval. + + + + ZONE: Zero-Shot Instruction-Guided Local Editing + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_ZONE_Zero-Shot_Instruction-Guided_Local_Editing_CVPR_2024_paper.pdf + Recent advances in vision-language models like Stable Diffusion have shown remarkable power in creative image synthesis and editing.However most existing text-to-image editing methods encounter two obstacles: First the text prompt needs to be carefully crafted to achieve good results which is not intuitive or user-friendly. Second they are insensitive to local edits and can irreversibly affect non-edited regions leaving obvious editing traces. To tackle these problems we propose a Zero-shot instructiON-guided local image Editing approach termed ZONE. We first convert the editing intent from the user-provided instruction (e.g. "make his tie blue") into specific image editing regions through InstructPix2Pix. We then propose a Region-IoU scheme for precise image layer extraction from an off-the-shelf segment model. We further develop an edge smoother based on FFT for seamless blending between the layer and the image.Our method allows for arbitrary manipulation of a specific region with a single instruction while preserving the rest. Extensive experiments demonstrate that our ZONE achieves remarkable local editing results and user-friendliness outperforming state-of-the-art methods. Code is available at https://github.com/lsl001006/ZONE. + + + + U-VAP: User-specified Visual Appearance Personalization via Decoupled Self Augmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_U-VAP_User-specified_Visual_Appearance_Personalization_via_Decoupled_Self_Augmentation_CVPR_2024_paper.pdf + Concept personalization methods enable large text-to-image models to learn specific subjects (e.g. objects/poses/3D models) and synthesize renditions in new contexts. Given that the image references are highly biased towards visual attributes state-of-the-art personalization models tend to overfit the whole subject and cannot disentangle visual characteristics in pixel space. In this study we proposed a more challenging setting namely fine-grained visual appearance personalization. Different from existing methods we allow users to provide a sentence describing the desired attributes. A novel decoupled self-augmentation strategy is proposed to generate target-related and non-target samples to learn user-specified visual attributes.These augmented data allow for refining the model's understanding of the target attribute while mitigating the impact of unrelated attributes. At the inference stage adjustments are conducted on semantic space through the learned target and non-target embeddings to further enhance the disentanglement of target attributes. Extensive experiments on various kinds of visual attributes with SOTA personalization methods shows the ability of the proposed method to mimic target visual appearance in novel contexts thus improving the controllability and flexibility of personalization. + + + + HHMR: Holistic Hand Mesh Recovery by Enhancing the Multimodal Controllability of Graph Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_HHMR_Holistic_Hand_Mesh_Recovery_by_Enhancing_the_Multimodal_Controllability_CVPR_2024_paper.pdf + Recent years have witnessed a trend of the deep integration of the generation and reconstruction paradigms. In this paper we extend the ability of controllable generative models for a more comprehensive hand mesh recovery task: direct hand mesh generation inpainting reconstruction and fitting in a single framework which we name as Holistic Hand Mesh Recovery (HHMR). Our key observation is that different kinds of hand mesh recovery tasks can be achieved by a single generative model with strong multimodal controllability and in such a framework realizing different tasks only requires giving different signals as conditions. To achieve this goal we propose an all-in-one diffusion framework based on graph convolution and attention mechanisms for holistic hand mesh recovery. In order to achieve strong control generation capability while ensuring the decoupling of multimodal control signals we map different modalities to a share feature space and apply cross-scale random masking in both modality and feature levels. In this way the correlation between different modalities can be fully exploited during the learning of hand priors. Furthermore we propose Condition-aligned Gradient Guidance to enhance the alignment of the generated model with the control signals which significantly improves the accuracy of the hand mesh reconstruction and fitting. Experiments show that our novel framework can realize multiple hand mesh recovery tasks simultaneously and outperform the existing methods in different tasks which provides more possibilities for subsequent downstream applications including gesture recognition pose generation mesh editing and so on. + + + + Robust Self-calibration of Focal Lengths from the Fundamental Matrix + http://openaccess.thecvf.com//content/CVPR2024/papers/Kocur_Robust_Self-calibration_of_Focal_Lengths_from_the_Fundamental_Matrix_CVPR_2024_paper.pdf + The problem of self-calibration of two cameras from a given fundamental matrix is one of the basic problems in geometric computer vision. Under the assumption of known principal points and square pixels the Bougnoux formula offers a means to compute the two unknown focal lengths. However in many practical situations the formula yields inaccurate results due to commonly occurring singularities. Moreover the estimates are sensitive to noise in the computed fundamental matrix and to the assumed positions of the principal points. In this paper we therefore propose an efficient and robust iterative method to estimate the focal lengths along with the principal points of the cameras given a fundamental matrix and priors for the estimated camera intrinsics. In addition we study a computationally efficient check of models generated within RANSAC that improves the accuracy of the estimated models while reducing the total computational time. Extensive experiments on real and synthetic data show that our iterative method brings significant improvements in terms of the accuracy of the estimated focal lengths over the Bougnoux formula and other state-of-the-art methods even when relying on inaccurate priors. The code for the methods and experiments is available at https://github.com/kocurvik/robust_self_calibration + + + + PartDistill: 3D Shape Part Segmentation by Vision-Language Model Distillation + http://openaccess.thecvf.com//content/CVPR2024/papers/Umam_PartDistill_3D_Shape_Part_Segmentation_by_Vision-Language_Model_Distillation_CVPR_2024_paper.pdf + This paper proposes a cross-modal distillation framework PartDistill which transfers 2D knowledge from vision-language models (VLMs) to facilitate 3D shape part segmentation. PartDistill addresses three major challenges in this task: the lack of 3D segmentation in invisible or undetected regions in the 2D projections inconsistent 2D predictions by VLMs and the lack of knowledge accumulation across different 3D shapes. PartDistill consists of a teacher network that uses a VLM to make 2D predictions and a student network that learns from the 2D predictions while extracting geometrical features from multiple 3D shapes to carry out 3D part segmentation. A bi-directional distillation including forward and backward distillations is carried out within the framework where the former forward distills the 2D predictions to the student network and the latter improves the quality of the 2D predictions which subsequently enhances the final 3D segmentation. Moreover PartDistill can exploit generative models that facilitate effortless 3D shape creation for generating knowledge sources to be distilled. Through extensive experiments PartDistill boosts the existing methods with substantial margins on widely used ShapeNetPart and PartNetE datasets by more than 15% and 12% higher mIoU scores respectively. The code for this work is available at https://github.com/ardianumam/PartDistill. + + + + DragDiffusion: Harnessing Diffusion Models for Interactive Point-based Image Editing + http://openaccess.thecvf.com//content/CVPR2024/papers/Shi_DragDiffusion_Harnessing_Diffusion_Models_for_Interactive_Point-based_Image_Editing_CVPR_2024_paper.pdf + Accurate and controllable image editing is a challenging task that has attracted significant attention recently. Notably DragGAN developed by Pan et al. (2023) is an interactive point-based image editing framework that achieves impressive editing results with pixel-level precision. However due to its reliance on generative adversarial networks (GANs) its generality is limited by the capacity of pretrained GAN models. In this work we extend this editing framework to diffusion models and propose a novel approach DragDiffusion. By harnessing large-scale pretrained diffusion models we greatly enhance the applicability of interactive point-based editing on both real and diffusion-generated images. Unlike other diffusion-based editing methods that provide guidance on diffusion latents of multiple time steps our approach achieves efficient yet accurate spatial control by optimizing the latent of only one time step. This novel design is motivated by our observations that UNet features at a specific time step provides sufficient semantic and geometric information to support the drag-based editing. Moreover we introduce two additional techniques namely identity-preserving fine-tuning and reference-latent-control to further preserve the identity of the original image. Lastly we present a challenging benchmark dataset called DragBench---the first benchmark to evaluate the performance of interactive point-based image editing methods. Experiments across a wide range of challenging cases (e.g. images with multiple objects diverse object categories various styles etc.) demonstrate the versatility and generality of DragDiffusion. Code and the DragBench dataset: https://github.com/Yujun-Shi/DragDiffusion. + + + + Addressing Background Context Bias in Few-Shot Segmentation through Iterative Modulation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhu_Addressing_Background_Context_Bias_in_Few-Shot_Segmentation_through_Iterative_Modulation_CVPR_2024_paper.pdf + Existing few-shot segmentation methods usually extract foreground prototypes from support images to guide query image segmentation. However different background contexts of support and query images can cause their foreground features to be misaligned. This phenomenon known as background context bias can hinder the effectiveness of support prototypes in guiding query image segmentation. In this work we propose a novel framework with an iterative structure to address this problem. In each iteration of the framework we first generate a query prediction based on a support foreground feature. Next we extract background context from the query image to modulate the support foreground feature thus eliminating the foreground feature misalignment caused by the different backgrounds. After that we design a confidence-biased attention to eliminate noise and cleanse information. By integrating these components through an iterative structure we create a novel network that can leverage the synergies between different modules to improve their performance in a mutually reinforcing manner. Through these carefully designed components and structures our network can effectively eliminate background context bias in few-shot segmentation thus achieving outstanding performance. We conduct extensive experiments on the PASCAL-5^ i and COCO-20^ i datasets and achieve state-of-the-art (SOTA) results which demonstrate the effectiveness of our approach. + + + + TiNO-Edit: Timestep and Noise Optimization for Robust Diffusion-Based Image Editing + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_TiNO-Edit_Timestep_and_Noise_Optimization_for_Robust_Diffusion-Based_Image_Editing_CVPR_2024_paper.pdf + Despite many attempts to leverage pre-trained text-to-image models (T2I) like Stable Diffusion (SD) for controllable image editing producing good predictable results remains a challenge. Previous approaches have focused on either fine-tuning pre-trained T2I models on specific datasets to generate certain kinds of images (e.g. with a specific object or person) or on optimizing the weights text prompts and/or learning features for each input image in an attempt to coax the image generator to produce the desired result. However these approaches all have shortcomings and fail to produce good results in a predictable and controllable manner. To address this problem we present TiNO-Edit an SD-based method that focuses on optimizing the noise patterns and diffusion timesteps during editing something previously unexplored in the literature. With this simple change we are able to generate results that both better align with the original images and reflect the desired result. Furthermore we propose a set of new loss functions that operate in the latent domain of SD greatly speeding up the optimization when compared to prior losses which operate in the pixel domain. Our method can be easily applied to variations of SD including Textual Inversion and DreamBooth that encode new concepts and incorporate them into the edited results. We present a host of image-editing capabilities enabled by our approach. Our code is publicly available at https://github.com/SherryXTChen/TiNO-Edit. + + + + AdaShift: Learning Discriminative Self-Gated Neural Feature Activation With an Adaptive Shift Factor + http://openaccess.thecvf.com//content/CVPR2024/papers/Cai_AdaShift_Learning_Discriminative_Self-Gated_Neural_Feature_Activation_With_an_Adaptive_CVPR_2024_paper.pdf + Nonlinearities are decisive in neural representation learning. Traditional Activation (Act) functions impose fixed inductive biases on neural networks with oriented biological intuitions. Recent methods leverage self-gated curves to compensate for the rigid traditional Act paradigms in fitting flexibility. However substantial improvements are still impeded by the norm-induced mismatched feature re-calibrations (see Section 1) i.e. the actual importance of a feature can be inconsistent with its explicit intensity such that violates the basic intention of a direct self-gated feature re-weighting. To address this problem we propose to learn discriminative neural feature Act with a novel prototype namely AdaShift which enhances typical self-gated Act by incorporating an adaptive shift factor into the re-weighting function of Act. AdaShift casts dynamic translations on the inputs of a re-weighting function by exploiting comprehensive feature-filter context cues of different ranges in a simple yet effective manner. We obtain the new intuitions of AdaShift by rethinking the feature-filter relationships from a common Softmax-based classification and by generalizing the new observations to a common learning layer that encodes features with updatable filters. Our practical AdaShifts built upon the new Act prototype demonstrate significant improvements to the popular/SOTA Act functions on different vision benchmarks. By simply replacing ReLU with AdaShifts ResNets can match advanced Transformer counterparts (e.g. ResNet-50 vs. Swin-T) with lower cost and fewer parameters. + + + + SCEdit: Efficient and Controllable Image Diffusion Generation via Skip Connection Editing + http://openaccess.thecvf.com//content/CVPR2024/papers/Jiang_SCEdit_Efficient_and_Controllable_Image_Diffusion_Generation_via_Skip_Connection_CVPR_2024_paper.pdf + Image diffusion models have been utilized in various tasks such as text-to-image generation and controllable image synthesis. Recent research has introduced tuning methods that make subtle adjustments to the original models yielding promising results in specific adaptations of foundational generative diffusion models. Rather than modifying the main backbone of the diffusion model we delve into the role of skip connection in U-Net and reveal that hierarchical features aggregating long-distance information across encoder and decoder make a significant impact on the content and quality of image generation. Based on the observation we propose an efficient generative tuning framework dubbed SCEdit which integrates and edits Skip Connection using a lightweight tuning module named SC-Tuner. Furthermore the proposed framework allows for straightforward extension to controllable image synthesis by injecting different conditions with Controllable SC-Tuner simplifying and unifying the network design for multi-condition inputs. Our SCEdit substantially reduces training parameters memory usage and computational expense due to its lightweight tuners with backward propagation only passing to the decoder blocks. Extensive experiments conducted on text-to-image generation and controllable image synthesis tasks demonstrate the superiority of our method in terms of efficiency and performance. Project page: https://scedit.github.io/. + + + + BA-SAM: Scalable Bias-Mode Attention Mask for Segment Anything Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Song_BA-SAM_Scalable_Bias-Mode_Attention_Mask_for_Segment_Anything_Model_CVPR_2024_paper.pdf + In this paper we address the challenge of image resolution variation for the Segment Anything Model (SAM). SAM known for its zero-shot generalizability exhibits a performance degradation when faced with datasets with varying image sizes. Previous approaches tend to resize the image to a fixed size or adopt structure modifications hindering the preservation of SAM's rich prior knowledge. Besides such task-specific tuning necessitates a complete retraining of the model which is cost-expensive and unacceptable for deployment in the downstream tasks. In this paper we reformulate this challenge as a length extrapolation problem where token sequence length varies while maintaining a consistent patch size for images with different sizes. To this end we propose a Scalable Bias-Mode Attention Mask (BA-SAM) to enhance SAM's adaptability to varying image resolutions while eliminating the need for structure modifications. Firstly we introduce a new scaling factor to ensure consistent magnitude in the attention layer's dot product values when the token sequence length changes. Secondly we present a bias-mode attention mask that allows each token to prioritize neighboring information mitigating the impact of untrained distant information. Our BA-SAM demonstrates efficacy in two scenarios: zero-shot and fine-tuning. Extensive evaluation of diverse datasets including DIS5K DUTS ISIC COD10K and COCO reveals its ability to significantly mitigate performance degradation in the zero-shot setting and achieve state-of-the-art performance with minimal fine-tuning. Furthermore we propose a generalized model and benchmark showcasing BA-SAM's generalizability across all four datasets simultaneously. + + + + Deciphering 'What' and 'Where' Visual Pathways from Spectral Clustering of Layer-Distributed Neural Representations + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Deciphering_What_and_Where_Visual_Pathways_from_Spectral_Clustering_of_CVPR_2024_paper.pdf + We present an approach for analyzing grouping information contained within a neural network's activations permitting extraction of spatial layout and semantic segmentation from the behavior of large pre-trained vision models. Unlike prior work our method conducts a wholistic analysis of a network's activation state leveraging features from all layers and obviating the need to guess which part of the model contains relevant information. Motivated by classic spectral clustering we formulate this analysis in terms of an optimization objective involving a set of affinity matrices each formed by comparing features within a different layer. Solving this optimization problem using gradient descent allows our technique to scale from single images to dataset-level analysis including in the latter both intra- and inter-image relationships. Analyzing a pre-trained generative transformer provides insight into the computational strategy learned by such models. Equating affinity with key-query similarity across attention layers yields eigenvectors encoding scene spatial layout whereas defining affinity by value vector similarity yields eigenvectors encoding object identity. This result suggests that key and query vectors coordinate attentional information flow according to spatial proximity (a `where' pathway) while value vectors refine a semantic category representation (a `what' pathway). + + + + Real-Time Exposure Correction via Collaborative Transformations and Adaptive Sampling + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Real-Time_Exposure_Correction_via_Collaborative_Transformations_and_Adaptive_Sampling_CVPR_2024_paper.pdf + Most of the previous exposure correction methods learn dense pixel-wise transformations to achieve promising results but consume huge computational resources. Recently Learnable 3D lookup tables (3D LUTs) have demonstrated impressive performance and efficiency for image enhancement. However these methods can only perform global transformations and fail to finely manipulate local regions. Moreover they uniformly downsample the input image which loses the rich color information and limits the learning of color transformation capabilities. In this paper we present a collaborative transformation framework (CoTF) for real-time exposure correction which integrates global transformation with pixel-wise transformations in an efficient manner. Specifically the global transformation adjusts the overall appearance using image-adaptive 3D LUTs to provide decent global contrast and sharp details while the pixel transformation compensates for local context. Then a relation-aware modulation module is designed to combine these two components effectively. In addition we propose an adaptive sampling strategy to preserve more color information by predicting the sampling intervals thus providing higher quality input data for the learning of 3D LUTs. Extensive experiments demonstrate that our method can process high-resolution images in real-time on GPUs while achieving comparable performance against current state-of-the-art methods. The code is available at https://github.com/HUST-IAL/CoTF. + + + + Lodge: A Coarse to Fine Diffusion Network for Long Dance Generation Guided by the Characteristic Dance Primitives + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Lodge_A_Coarse_to_Fine_Diffusion_Network_for_Long_Dance_CVPR_2024_paper.pdf + We propose Lodge a network capable of generating extremely long dance sequences conditioned on given music. We design Lodge as a two-stage coarse to fine diffusion architecture and propose the characteristic dance primitives that possess significant expressiveness as intermediate representations between two diffusion models. The first stage is global diffusion which focuses on comprehending the coarse-level music-dance correlation and production characteristic dance primitives. In contrast the second-stage is the local diffusion which parallelly generates detailed motion sequences under the guidance of the dance primitives and choreographic rules. In addition we propose a Foot Refine Block to optimize the contact between the feet and the ground enhancing the physical realism of the motion. Code available at https://li-ronghui.github.io/lodge + + + + Transcending Forgery Specificity with Latent Space Augmentation for Generalizable Deepfake Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Yan_Transcending_Forgery_Specificity_with_Latent_Space_Augmentation_for_Generalizable_Deepfake_CVPR_2024_paper.pdf + Deepfake detection faces a critical generalization hurdle with performance deteriorating when there is a mismatch between the distributions of training and testing data. A broadly received explanation is the tendency of these detectors to be overfitted to forgery-specific artifacts rather than learning features that are widely applicable across various forgeries. To address this issue we propose a simple yet effective detector called LSDA (\underline L atent \underline S pace \underline D ata \underline A ugmentation) which is based on a heuristic idea: representations with a wider variety of forgeries should be able to learn a more generalizable decision boundary thereby mitigating the overfitting of method-specific features (see Fig. 1). Following this idea we propose to enlarge the forgery space by constructing and simulating variations within and across forgery features in the latent space. This approach encompasses the acquisition of enriched domain-specific features and the facilitation of smoother transitions between different forgery types effectively bridging domain gaps. Our approach culminates in refining a binary classifier that leverages the distilled knowledge from the enhanced features striving for a generalizable deepfake detector. Comprehensive experiments show that our proposed method is surprisingly effective and transcends state-of-the-art detectors across several widely used benchmarks. + + + + Scaling Laws of Synthetic Images for Model Training ... for Now + http://openaccess.thecvf.com//content/CVPR2024/papers/Fan_Scaling_Laws_of_Synthetic_Images_for_Model_Training_..._for_CVPR_2024_paper.pdf + Recent significant advances in text-to-image models unlock the possibility of training vision systems using synthetic images potentially overcoming the difficulty of collecting curated data at scale. It is unclear however how these models behave at scale as more synthetic data is added to the training set. In this paper we study the scaling laws of synthetic images generated by state of the art text-to-image models for the training of supervised models: image classifiers with label supervision and CLIP with language supervision. We identify several factors including text prompts classifier-free guidance scale and types of text-to-image models that significantly affect scaling behavior. After tuning these factors we observe that synthetic images demonstrate a scaling trend similar to but slightly less effective than real images in CLIP training while they significantly underperform in scaling when training supervised image classifiers. Our analysis indicates that the main reason for this underperformance is the inability of off-the-shelf text-to-image models to generate certain concepts a limitation that significantly impairs the training of image classifiers. Our findings also suggest that scaling synthetic data can be particularly effective in scenarios such as: (1) when there is a limited supply of real images for a supervised problem (e.g. fewer than 0.5 million images in ImageNet) (2) when the evaluation dataset diverges significantly from the training data indicating the out-of-distribution scenario or (3) when synthetic data is used in conjunction with real images as demonstrated in the training of CLIP models. + + + + State Space Models for Event Cameras + http://openaccess.thecvf.com//content/CVPR2024/papers/Zubic_State_Space_Models_for_Event_Cameras_CVPR_2024_paper.pdf + Today state-of-the-art deep neural networks that process event-camera data first convert a temporal window of events into dense grid-like input representations. As such they exhibit poor generalizability when deployed at higher inference frequencies (i.e. smaller temporal windows) than the ones they were trained on. We address this challenge by introducing state-space models (SSMs) with learnable timescale parameters to event-based vision. This design adapts to varying frequencies without the need to retrain the network at different frequencies. Additionally we investigate two strategies to counteract aliasing effects when deploying the model at higher frequencies. We comprehensively evaluate our approach against existing methods based on RNN and Transformer architectures across various benchmarks including Gen1 and 1 Mpx event camera datasets. Our results demonstrate that SSM-based models train 33% faster and also exhibit minimal performance degradation when tested at higher frequencies than the training input. Traditional RNN and Transformer models exhibit performance drops of more than 20 mAP with SSMs having a drop of 3.31 mAP highlighting the effectiveness of SSMs in event-based vision tasks. + + + + TeTriRF: Temporal Tri-Plane Radiance Fields for Efficient Free-Viewpoint Video + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_TeTriRF_Temporal_Tri-Plane_Radiance_Fields_for_Efficient_Free-Viewpoint_Video_CVPR_2024_paper.pdf + Neural Radiance Fields (NeRF) revolutionize the realm of visual media by providing photorealistic Free-Viewpoint Video (FVV) experiences offering viewers unparalleled immersion and interactivity. However the technology's significant storage requirements and the computational complexity involved in generation and rendering currently limit its broader application. To close this gap this paper presents Temporal Tri-Plane Radiance Fields (TeTriRF) a novel technology that significantly reduces the storage size for Free-Viewpoint Video (FVV) while maintaining low-cost generation and rendering. TeTriRF introduces a hybrid representation with tri-planes and voxel grids to support scaling up to long-duration sequences and scenes with complex motions or rapid changes. We propose a group training scheme tailored to achieving high training efficiency and yielding temporally consistent low-entropy scene representations on feature domain. Leveraging these properties of the representations we introduce a compression pipeline with off-the-shelf video codecs achieving an order of magnitude less storage size compared to the state-of-the-art. Our experiments demonstrate that TeTriRF can achieve competitive quality with a higher compression rate. + + + + Event-assisted Low-Light Video Object Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Event-assisted_Low-Light_Video_Object_Segmentation_CVPR_2024_paper.pdf + In the realm of video object segmentation (VOS) the challenge of operating under low-light conditions persists resulting in notably degraded image quality and compromised accuracy when comparing query and memory frames for similarity computation. Event cameras characterized by their high dynamic range and ability to capture motion information of objects offer promise in enhancing object visibility and aiding VOS methods under such low-light conditions. This paper introduces a pioneering framework tailored for low-light VOS leveraging event camera data to elevate segmentation accuracy. Our approach hinges on two pivotal components: the Adaptive Cross-Modal Fusion (ACMF) module aimed at extracting pertinent features while fusing image and event modalities to mitigate noise interference and the Event-Guided Memory Matching (EGMM) module designed to rectify the issue of inaccurate matching prevalent in low-light settings. Additionally we present the creation of a synthetic LLE-DAVIS dataset and the curation of a real-world LLE-VOS dataset encompassing frames and events. Experimental evaluations corroborate the efficacy of our method across both datasets affirming its effectiveness in low-light scenarios. The datasets are available at https://github.com/HebeiFast/EventLowLightVOS. + + + + VidToMe: Video Token Merging for Zero-Shot Video Editing + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_VidToMe_Video_Token_Merging_for_Zero-Shot_Video_Editing_CVPR_2024_paper.pdf + Diffusion models have made significant advances in generating high-quality images but their application to video generation has remained challenging due to the complexity of temporal motion. Zero-shot video editing offers a solution by utilizing pre-trained image diffusion models to translate source videos into new ones. Nevertheless existing methods struggle to maintain strict temporal consistency and efficient memory consumption. In this work we propose a novel approach to enhance temporal consistency in generated videos by merging self-attention tokens across frames. By aligning and compressing temporally redundant tokens across frames our method improves temporal coherence and reduces memory consumption in self-attention computations. The merging strategy matches and aligns tokens according to the temporal correspondence between frames facilitating natural temporal consistency in generated video frames. To manage the complexity of video processing we divide videos into chunks and develop intra-chunk local token merging and inter-chunk global token merging ensuring both short-term video continuity and long-term content consistency. Our video editing approach seamlessly extends the advancements in image editing to video editing rendering favorable results in temporal consistency over state-of-the-art methods. + + + + FaceChain-SuDe: Building Derived Class to Inherit Category Attributes for One-shot Subject-Driven Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Qiao_FaceChain-SuDe_Building_Derived_Class_to_Inherit_Category_Attributes_for_One-shot_CVPR_2024_paper.pdf + Recently subject-driven generation has garnered significant interest due to its ability to personalize text-to-image generation. Typical works focus on learning the new subject's private attributes. However an important fact has not been taken seriously that a subject is not an isolated new concept but should be a specialization of a certain category in the pre-trained model. This results in the subject failing to comprehensively inherit the attributes in its category causing poor attribute-related generations. In this paper motivated by object-oriented programming we model the subject as a derived class whose base class is its semantic category. This modeling enables the subject to inherit public attributes from its category while learning its private attributes from the user-provided example. Specifically we propose a plug-and-play method Subject-Derived regularization (SuDe). It constructs the base-derived class modeling by constraining the subject-driven generated images to semantically belong to the subject's category. Extensive experiments under three baselines and two backbones on various subjects show that our SuDe enables imaginative attribute-related generations while maintaining subject fidelity. For the codes please refer to \href https://github.com/modelscope/facechain FaceChain . + + + + StableVITON: Learning Semantic Correspondence with Latent Diffusion Model for Virtual Try-On + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_StableVITON_Learning_Semantic_Correspondence_with_Latent_Diffusion_Model_for_Virtual_CVPR_2024_paper.pdf + Given a clothing image and a person image an image-based virtual try-on aims to generate a customized image that appears natural and accurately reflects the characteristics of the clothing image. In this work we aim to expand the applicability of the pre-trained diffusion model so that it can be utilized independently for the virtual try-on task. The main challenge is to preserve the clothing details while effectively utilizing the robust generative capability of the pre-trained model. In order to tackle these issues we propose StableVITON learning the semantic correspondence between the clothing and the human body within the latent space of the pre-trained diffusion model in an end-to-end manner. Our proposed zero cross-attention blocks not only preserve the clothing details by learning the semantic correspondence but also generate high-fidelity images by utilizing the inherent knowledge of the pre-trained model in the warping process. Through our proposed novel attention total variation loss and applying augmentation we achieve the sharp attention map resulting in a more precise representation of clothing details. StableVITON outperforms the baselines in qualitative and quantitative evaluation showing promising quality in arbitrary person images. Our code is available at https://github.com/rlawjdghek/StableVITON. + + + + Make-Your-Anchor: A Diffusion-based 2D Avatar Generation Framework + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_Make-Your-Anchor_A_Diffusion-based_2D_Avatar_Generation_Framework_CVPR_2024_paper.pdf + Despite the remarkable process of talking-head-based avatar-creating solutions directly generating anchor-style videos with full-body motions remains challenging. In this study we propose Make-Your-Anchor a novel system necessitating only a one-minute video clip of an individual for training subsequently enabling the automatic generation of anchor-style videos with precise torso and hand movements. Specifically we finetune a proposed structure-guided diffusion model on input video to render 3D mesh conditions into human appearances. We adopt a two-stage training strategy for the diffusion model effectively binding movements with specific appearances. To produce arbitrary long temporal video we extend the 2D U-Net in the frame-wise diffusion model to a 3D style without additional training cost and a simple yet effective batch-overlapped temporal denoising module is proposed to bypass the constraints on video length during inference. Finally a novel identity-specific face enhancement module is introduced to improve the visual quality of facial regions in the output videos. Comparative experiments demonstrate the effectiveness and superiority of the system in terms of visual quality temporal coherence and identity preservation outperforming SOTA diffusion/non-diffusion methods. Project page: https://github.com/ICTMCG/Make-Your-Anchor. + + + + Learning Dynamic Tetrahedra for High-Quality Talking Head Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Learning_Dynamic_Tetrahedra_for_High-Quality_Talking_Head_Synthesis_CVPR_2024_paper.pdf + Recent works in implicit representations such as Neural Radiance Fields (NeRF) have advanced the generation of realistic and animatable head avatars from video sequences. These implicit methods are still confronted by visual artifacts and jitters since the lack of explicit geometric constraints poses a fundamental challenge in accurately modeling complex facial deformations. In this paper we introduce Dynamic Tetrahedra (DynTet) a novel hybrid representation that encodes explicit dynamic meshes by neural networks to ensure geometric consistency across various motions and viewpoints. DynTet is parameterized by the coordinate-based networks which learn signed distance deformation and material texture anchoring the training data into a predefined tetrahedra grid. Leveraging Marching Tetrahedra DynTet efficiently decodes textured meshes with a consistent topology enabling fast rendering through a differentiable rasterizer and supervision via a pixel loss. To enhance training efficiency we incorporate classical 3D Morphable Models to facilitate geometry learning and define a canonical space for simplifying texture learning. These advantages are readily achievable owing to the effective geometric representation employed in DynTet. Compared with prior works DynTet demonstrates significant improvements in fidelity lip synchronization and real-time performance according to various metrics. Beyond producing stable and visually appealing synthesis videos our method also outputs the dynamic meshes which is promising to enable many emerging applications. Code is available at https://github.com/zhangzc21/DynTet. + + + + 3D Geometry-Aware Deformable Gaussian Splatting for Dynamic View Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Lu_3D_Geometry-Aware_Deformable_Gaussian_Splatting_for_Dynamic_View_Synthesis_CVPR_2024_paper.pdf + In this paper we propose a 3D geometry-aware deformable Gaussian Splatting method for dynamic view synthesis. Existing neural radiance fields (NeRF) based solutions learn the deformation in an implicit manner which cannot incorporate 3D scene geometry. Therefore the learned deformation is not necessarily geometrically coherent which results in unsatisfactory dynamic view synthesis and 3D dynamic reconstruction. Recently 3D Gaussian Splatting provides a new representation of the 3D scene building upon which the 3D geometry could be exploited in learning the complex 3D deformation. Specifically the scenes are represented as a collection of 3D Gaussian where each 3D Gaussian is optimized to move and rotate over time to model the deformation. To enforce the 3D scene geometry constraint during deformation we explicitly extract 3D geometry features and integrate them in learning the 3D deformation. In this way our solution achieves 3D geometry-aware deformation modeling which enables improved dynamic view synthesis and 3D dynamic reconstruction. Extensive experimental results on both synthetic and real datasets prove the superiority of our solution which achieves new state-of-the-art performance. The project is available at \href https://npucvr.github.io/GaGS/ https://npucvr.github.io/GaGS/ . + + + + Person-in-WiFi 3D: End-to-End Multi-Person 3D Pose Estimation with Wi-Fi + http://openaccess.thecvf.com//content/CVPR2024/papers/Yan_Person-in-WiFi_3D_End-to-End_Multi-Person_3D_Pose_Estimation_with_Wi-Fi_CVPR_2024_paper.pdf + Wi-Fi signals in contrast to cameras offer privacy protection and occlusion resilience for some practical scenarios such as smart homes elderly care and virtual reality. Recent years have seen remarkable progress in the estimation of single-person 2D pose single-person 3D pose and multi-person 2D pose. This paper takes a step forward by introducing Person-in-WiFi 3D a pioneering Wi-Fi system that accomplishes multi-person 3D pose estimation. Person-in-WiFi 3D has two main updates. Firstly it has a greater number of Wi-Fi devices to enhance the capability for capturing spatial reflections from multiple individuals. Secondly it leverages the Transformer for end-to-end estimation. Compared to its predecessor Person-in-WiFi 3D is storage-efficient and fast. We deployed a proof-of-concept system in 4mx3.5m areas and collected a dataset of over 97K frames with seven volunteers. Person-in-WiFi 3D attains 3D joint localization errors of 91.7mm (1-person) 108.1mm (2-person) and 125.3mm (3-person) comparable to cameras and millimeter-wave radars. + + + + Fairy: Fast Parallelized Instruction-Guided Video-to-Video Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_Fairy_Fast_Parallelized_Instruction-Guided_Video-to-Video_Synthesis_CVPR_2024_paper.pdf + In this paper we introduce Fairy a minimalist yet robust adaptation of image-editing diffusion models enhancing them for video editing applications. Our approach centers on the concept of anchor-based cross-frame attention a mechanism that implicitly propagates diffusion features across frames ensuring superior temporal coherence and high-fidelity synthesis. Fairy not only addresses limitations of previous models including memory and processing speed. It also improves temporal consistency through a unique data augmentation strategy. This strategy renders the model equivariant to affine transformations in both source and target images. Remarkably efficient Fairy generates 120-frame 512x384 videos (4-second duration at 30 FPS) in just 14 seconds outpacing prior works by at least 44x. A comprehensive user study involving 1000 generated samples confirms that our approach delivers superior quality decisively outperforming established methods. + + + + SmartEdit: Exploring Complex Instruction-based Image Editing with Multimodal Large Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_SmartEdit_Exploring_Complex_Instruction-based_Image_Editing_with_Multimodal_Large_Language_CVPR_2024_paper.pdf + Current instruction-based image editing methods such as InstructPix2Pix often fail to produce satisfactory results in complex scenarios due to their dependence on the simple CLIP text encoder in diffusion models. To rectify this this paper introduces SmartEdit a novel approach of instruction-based image editing that leverages Multimodal Large Language Models (MLLMs) to enhance its understanding and reasoning capabilities. However direct integration of these elements still faces challenges in situations requiring complex reasoning. To mitigate this we propose a Bidirectional Interaction Module (BIM) that enables comprehensive bidirectional information interactions between the input image and the MLLM output. During training we initially incorporate perception data to boost the perception and understanding capabilities of diffusion models. Subsequently we demonstrate that a small amount of complex instruction editing data can effectively stimulate SmartEdit's editing capabilities for more complex instructions. We further construct a new evaluation dataset Reason-Edit specifically tailored for complex instruction-based image editing. Both quantitative and qualitative results on this evaluation dataset indicate that our SmartEdit surpasses previous methods paving the way for the practical application of complex instruction-based image editing. + + + + It's All About Your Sketch: Democratising Sketch Control in Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Koley_Its_All_About_Your_Sketch_Democratising_Sketch_Control_in_Diffusion_CVPR_2024_paper.pdf + This paper unravels the potential of sketches for diffusion models addressing the deceptive promise of direct sketch control in generative AI. We importantly democratise the process enabling amateur sketches to generate precise images living up to the commitment of "what you sketch is what you get". A pilot study underscores the necessity revealing that deformities in existing models stem from spatial-conditioning. To rectify this we propose an abstraction-aware framework utilising a sketch adapter adaptive time-step sampling and discriminative guidance from a pre-trained fine-grained sketch-based image retrieval model working synergistically to reinforce fine-grained sketch-photo association. Our approach operates seamlessly during inference without the need for textual prompts; a simple rough sketch akin to what you and I can create suffices! We welcome everyone to examine results presented in the paper and its supplementary. Contributions include democratising sketch control introducing an abstraction-aware framework and leveraging discriminative guidance validated through extensive experiments. + + + + When StyleGAN Meets Stable Diffusion: a W+ Adapter for Personalized Image Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_When_StyleGAN_Meets_Stable_Diffusion_a_W_Adapter_for_Personalized_CVPR_2024_paper.pdf + Text-to-image diffusion models have remarkably excelled in producing diverse high-quality and photo-realistic images. This advancement has spurred a growing interest in incorporating specific identities into generated content. Most current methods employ an inversion approach to embed a target visual concept into the text embedding space using a single reference image. However the newly synthesized faces either closely resemble the reference image in terms of facial attributes such as expression or exhibit a reduced capacity for identity preservation. Text descriptions intended to guide the facial attributes of the synthesized face may fall short owing to the intricate entanglement of identity information with identity-irrelevant facial attributes derived from the reference image. To address these issues we present the novel use of the extended StyleGAN embedding space \mathcal W _+ to achieve enhanced identity preservation and disentanglement for diffusion models. By aligning this semantically meaningful human face latent space with text-to-image diffusion models we succeed in maintaining high fidelity in identity preservation coupled with the capacity for semantic editing. Additionally we propose new training objectives to balance the influences of both prompt and identity conditions ensuring that the identity-irrelevant background remains \lxm negligibly affected during facial attribute modifications. Extensive experiments reveal that our method adeptly generates personalized text-to-image outputs that are not only compatible with prompt descriptions but also amenable to common StyleGAN editing directions in diverse settings. Our code and model are available at https://github.com/csxmli2016/w-plus-adapter. + + + + CAM Back Again: Large Kernel CNNs from a Weakly Supervised Object Localization Perspective + http://openaccess.thecvf.com//content/CVPR2024/papers/Yasuki_CAM_Back_Again_Large_Kernel_CNNs_from_a_Weakly_Supervised_CVPR_2024_paper.pdf + Recently convolutional neural networks (CNNs) with large size kernels have attracted much attention in the computer vision field following the success of the Vision Transformers. Large kernel CNNs have been reported to perform well in downstream vision tasks as well as in classification performance. The reason for the high-performance of large kernel CNNs in downstream tasks has been attributed to the large effective receptive field (ERF) produced by large size kernels but this view has not been fully tested. We therefore revisit the performance of large kernel CNNs in downstream task focusing on the weakly supervised object localization (WSOL) task. WSOL a difficult downstream task that is not fully supervised provides a new angle to explore the capabilities of the large kernel CNNs. Our study compares the modern large kernel CNNs ConvNeXt RepLKNet and SLaK to test the validity of the naive expectation that ERF size is important for improving downstream task performance. Our analysis of the factors contributing to high performance provides a different perspective in which the main factor is feature map improvement. Furthermore we find that modern CNNs are robust to the CAM problems of local regions of objects being activated which has long been discussed in WSOL. CAM is the most classic WSOL method but because of the above-mentioned problems it is often used as a baseline method for comparison. However experiments on the CUB-200-2011 dataset show that simply combining a large kernel CNN CAM and simple data augmentation methods can achieve performance (90.99% MaxBoxAcc) comparable to the latest WSOL method which is CNN-based and requires special training or complex post-processing. + + + + Putting the Object Back into Video Object Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Cheng_Putting_the_Object_Back_into_Video_Object_Segmentation_CVPR_2024_paper.pdf + We present Cutie a video object segmentation (VOS) network with object-level memory reading which puts the object representation from memory back into the video object segmentation result. Recent works on VOS employ bottom-up pixel-level memory reading which struggles due to matching noise especially in the presence of distractors resulting in lower performance in more challenging data. In contrast Cutie performs top-down object-level memory reading by adapting a small set of object queries. Via those it interacts with the bottom-up pixel features iteratively with a query-based object transformer (qt hence Cutie). The object queries act as a high-level summary of the target object while high-resolution feature maps are retained for accurate segmentation. Together with foreground-background masked attention Cutie cleanly separates the semantics of the foreground object from the background. On the challenging MOSE dataset Cutie improves by 8.7 J&F over XMem with a similar running time and improves by 4.2 J&F over DeAOT while being three times faster. Code is available at: hkchengrex.github.io/Cutie + + + + Concept Weaver: Enabling Multi-Concept Fusion in Text-to-Image Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Kwon_Concept_Weaver_Enabling_Multi-Concept_Fusion_in_Text-to-Image_Models_CVPR_2024_paper.pdf + While there has been significant progress in customizing text-to-image generation models generating images that combine multiple personalized concepts remains challenging. In this work we introduce Concept Weaver a method for composing customized text-to-image diffusion models at inference time. Specifically the method breaks the process into two steps: creating a template image aligned with the semantics of input prompts and then personalizing the template using a concept fusion strategy. The fusion strategy incorporates the appearance of the target concepts into the template image while retaining its structural details. The results indicate that our method can generate multiple custom concepts with higher identity fidelity compared to alternative approaches. Furthermore the method is shown to seamlessly handle more than two concepts and closely follow the semantic meaning of the input prompt without blending appearances across different subjects. + + + + Cross-Domain Few-Shot Segmentation via Iterative Support-Query Correspondence Mining + http://openaccess.thecvf.com//content/CVPR2024/papers/Nie_Cross-Domain_Few-Shot_Segmentation_via_Iterative_Support-Query_Correspondence_Mining_CVPR_2024_paper.pdf + Cross-Domain Few-Shot Segmentation (CD-FSS) poses the challenge of segmenting novel categories from a distinct domain using only limited exemplars. In this paper we undertake a comprehensive study of CD-FSS and uncover two crucial insights: (i) the necessity of a fine-tuning stage to effectively transfer the learned meta-knowledge across domains and (ii) the overfitting risk during the naive fine-tuning due to the scarcity of novel category examples. With these insights we propose a novel cross-domain fine-tuning strategy that addresses the challenging CD-FSS tasks. We first design Bi-directional Few-shot Prediction (BFP) which establishes support-query correspondence in a bi-directional manner crafting augmented supervision to reduce the overfitting risk. Then we further extend BFP into Iterative Few-shot Adaptor (IFA) which is a recursive framework to capture the support-query correspondence iteratively targeting maximal exploitation of supervisory signals from the sparse novel category samples. Extensive empirical evaluations show that our method significantly outperforms the state-of-the-arts (+7.8%) which verifies that IFA tackles the cross-domain challenges and mitigates the overfitting simultaneously. The code is available at: https://github.com/niejiahao1998/IFA. + + + + DiffSHEG: A Diffusion-Based Approach for Real-Time Speech-driven Holistic 3D Expression and Gesture Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_DiffSHEG_A_Diffusion-Based_Approach_for_Real-Time_Speech-driven_Holistic_3D_Expression_CVPR_2024_paper.pdf + We propose DiffSHEG a Diffusion-based approach for Speech-driven Holistic 3D Expression and Gesture generation. While previous works focused on co-speech gesture or expression generation individually the joint generation of synchronized expressions and gestures remains barely explored. To address this our diffusion-based co-speech motion generation Transformer enables uni-directional information flow from expression to gesture facilitating improved matching of joint expression-gesture distributions. Furthermore we introduce an outpainting-based sampling strategy for arbitrary long sequence generation in diffusion models offering flexibility and computational efficiency. Our method provides a practical solution that produces high-quality synchronized expression and gesture generation driven by speech. Evaluated on two public datasets our approach achieves state-of-the-art performance both quantitatively and qualitatively. Additionally a user study confirms the superiority of our method over prior approaches. By enabling the real-time generation of expressive and synchronized motions our method showcases its potential for various applications in the development of digital humans and embodied agents. + + + + Animating General Image with Large Visual Motion Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Animating_General_Image_with_Large_Visual_Motion_Model_CVPR_2024_paper.pdf + We present the pioneering Large Visual Motion Model (LVMM) meticulously engineered to analyze the intrinsic dynamics encapsulated within real-world imagery. Our model fortified with a wealth of prior knowledge extracted from billions of image pairs demonstrates promising results in predicting a diverse spectrum of scene dynamics. As a result it can infuse any generic image with authentic dynamic effects enhancing its visual allure. + + + + DIRECT-3D: Learning Direct Text-to-3D Generation on Massive Noisy 3D Data + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_DIRECT-3D_Learning_Direct_Text-to-3D_Generation_on_Massive_Noisy_3D_Data_CVPR_2024_paper.pdf + We present DIRECT-3D a diffusion-based 3D generative model for creating high-quality 3D assets (represented by Neural Radiance Fields) from text prompts. Unlike recent 3D generative models that rely on clean and well-aligned 3D data limiting them to single or few-class generation our model is directly trained on extensive noisy and unaligned `in-the-wild' 3D assets mitigating the key challenge (i.e. data scarcity) in large-scale 3D generation. In particular DIRECT-3D is a tri-plane diffusion model that integrates two innovations: 1) A novel learning framework where noisy data are filtered and aligned automatically during the training process. Specifically after an initial warm-up phase using a small set of clean data an iterative optimization is introduced in the diffusion process to explicitly estimate the 3D pose of objects and select beneficial data based on conditional density. 2) An efficient 3D representation that is achieved by disentangling object geometry and color features with two separate conditional diffusion models that are optimized hierarchically. Given a prompt input our model generates high-quality high-resolution realistic and complex 3D objects with accurate geometric details in seconds. We achieve state-of-the-art performance in both single-class generation and text-to-3D generation. We also demonstrate that DIRECT-3D can serve as a useful 3D geometric prior of objects for example to alleviate the well-known Janus problem in 2D-lifting methods such as DreamFusion. + + + + OHTA: One-shot Hand Avatar via Data-driven Implicit Priors + http://openaccess.thecvf.com//content/CVPR2024/papers/Zheng_OHTA_One-shot_Hand_Avatar_via_Data-driven_Implicit_Priors_CVPR_2024_paper.pdf + In this paper we delve into the creation of one-shot hand avatars attaining high-fidelity and drivable hand representations swiftly from a single image. With the burgeoning domains of the digital human the need for quick and personalized hand avatar creation has become increasingly critical. Existing techniques typically require extensive input data and may prove cumbersome or even impractical in certain scenarios. To enhance accessibility we present a novel method OHTA (One-shot Hand avaTAr) that enables the creation of detailed hand avatars from merely one image. OHTA tackles the inherent difficulties of this data-limited problem by learning and utilizing data-driven hand priors. Specifically we design a hand prior model initially employed for 1) learning various hand priors with available data and subsequently for 2) the inversion and fitting of the target identity with prior knowledge. OHTA demonstrates the capability to create high-fidelity hand avatars with consistent animatable quality solely relying on a single image. Furthermore we illustrate the versatility of OHTA through diverse applications encompassing text-to-avatar conversion hand editing and identity latent space manipulation. + + + + Human Motion Prediction Under Unexpected Perturbation + http://openaccess.thecvf.com//content/CVPR2024/papers/Yue_Human_Motion_Prediction_Under_Unexpected_Perturbation_CVPR_2024_paper.pdf + We investigate a new task in human motion prediction which is predicting motions under unexpected physical perturbation potentially involving multiple people. Compared with existing research this task involves predicting less controlled unpremeditated and pure reactive motions in response to external impact and how such motions can propagate through people. It brings new challenges such as data scarcity and predicting complex interactions. To this end we propose a new method capitalizing differentiable physics and deep neural networks leading to an explicit Latent Differentiable Physics (LDP) model. Through experiments we demonstrate that LDP has high data efficiency outstanding prediction accuracy strong generalizability and good explainability. Since there is no similar research a comprehensive comparison with 11 adapted baselines from several relevant domains is conducted showing LDP outperforming existing research both quantitatively and qualitatively improving prediction accuracy by as much as 70% and demonstrating significantly stronger generalization. + + + + Text-to-3D Generation with Bidirectional Diffusion using both 2D and 3D priors + http://openaccess.thecvf.com//content/CVPR2024/papers/Ding_Text-to-3D_Generation_with_Bidirectional_Diffusion_using_both_2D_and_3D_CVPR_2024_paper.pdf + Most 3D generation research focuses on up-projecting 2D foundation models into the 3D space either by minimizing 2D Score Distillation Sampling (SDS) loss or fine-tuning on multi-view datasets. Without explicit 3D priors these methods often lead to geometric anomalies and multi-view inconsistency. Recently researchers have attempted to improve the genuineness of 3D objects by directly training on 3D datasets albeit at the cost of low-quality texture generation due to the limited texture diversity in 3D datasets. To harness the advantages of both approaches we propose Bidirectional Diffusion (BiDiff) a unified framework that incorporates both a 3D and a 2D diffusion process to preserve both 3D fidelity and 2D texture richness respectively. Moreover as a simple combination may yield inconsistent generation results we further bridge them with novel bidirectional guidance. In addition our method can be used as an initialization of optimization-based models to further improve the quality of 3D model and efficiency of optimization reducing the process from 3.4 hours to 20 minutes. Experimental results have shown that our model achieves high-quality diverse and scalable 3D generation. Project website https://bidiff.github.io/. + + + + Make-It-Vivid: Dressing Your Animatable Biped Cartoon Characters from Text + http://openaccess.thecvf.com//content/CVPR2024/papers/Tang_Make-It-Vivid_Dressing_Your_Animatable_Biped_Cartoon_Characters_from_Text_CVPR_2024_paper.pdf + Creating and animating 3D biped cartoon characters is crucial and valuable in various applications. Compared with geometry the diverse texture design plays an important role in making 3D biped cartoon characters vivid and charming. Therefore we focus on automatic texture design for cartoon characters based on input instructions. This is challenging for domain-specific requirements and a lack of high-quality data. To address this challenge we propose Make-It-Vivid the first attempt to enable high-quality texture generation from text in UV space. We prepare a detailed text-texture paired data for 3D characters by using vision-question-answering agents. Then we customize a pretrained text-to-image model to generate texture map with template structure while preserving the natural 2D image knowledge. Furthermore to enhance fine-grained details we propose a novel adversarial learning scheme to shorten the domain gap between original dataset and realistic texture domain. Extensive experiments show that our approach outperforms current texture generation methods resulting in efficient character texturing and faithful generation with prompts. Besides we showcase various applications such as out of domain generation and texture stylization. We also provide an efficient generation system for automatic text-guided textured character generation and animation. + + + + Neural Sign Actors: A Diffusion Model for 3D Sign Language Production from Text + http://openaccess.thecvf.com//content/CVPR2024/papers/Baltatzis_Neural_Sign_Actors_A_Diffusion_Model_for_3D_Sign_Language_CVPR_2024_paper.pdf + Sign Languages (SL) serve as the primary mode of communication for the Deaf and Hard of Hearing communities. Deep learning methods for SL recognition and translation have achieved promising results. However Sign Language Production (SLP) poses a challenge as the generated motions must be realistic and have precise semantic meaning. Most SLP methods rely on 2D data which hinders their realism. In this work a diffusion-based SLP model is trained on a curated large-scale dataset of 4D signing avatars and their corresponding text transcripts. The proposed method can generate dynamic sequences of 3D avatars from an unconstrained domain of discourse using a diffusion process formed on a novel and anatomically informed graph neural network defined on the SMPL-X body skeleton. Through quantitative and qualitative experiments we show that the proposed method considerably outperforms previous methods of SLP. This work makes an important step towards realistic neural sign avatars bridging the communication gap between Deaf and hearing communities. + + + + On the Diversity and Realism of Distilled Dataset: An Efficient Dataset Distillation Paradigm + http://openaccess.thecvf.com//content/CVPR2024/papers/Sun_On_the_Diversity_and_Realism_of_Distilled_Dataset_An_Efficient_CVPR_2024_paper.pdf + Contemporary machine learning which involves training large neural networks on massive datasets faces significant computational challenges. Dataset distillation as a recent emerging strategy aims to compress real-world datasets for efficient training. However this line of research currently struggles with large-scale and high-resolution datasets hindering its practicality and feasibility. Thus we re-examine existing methods and identify three properties essential for real-world applications: realism diversity and efficiency. As a remedy we propose RDED a novel computationally-efficient yet effective data distillation paradigm to enable both diversity and realism of the distilled data. Extensive empirical results over various model architectures and datasets demonstrate the advancement of RDED: we can distill a dataset to 10 images per class from full ImageNet-1K within 7 minutes achieving a notable 42% accuracy with ResNet-18 on a single RTX-4090 GPU (while the SOTA only achieves 21% but requires 6 hours). Code: https://github.com/LINs-lab/RDED. + + + + Semantics-aware Motion Retargeting with Vision-Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Semantics-aware_Motion_Retargeting_with_Vision-Language_Models_CVPR_2024_paper.pdf + Capturing and preserving motion semantics is essential to motion retargeting between animation characters. However most of the previous works neglect the semantic information or rely on human-designed joint-level representations. Here we present a novel Semantics-aware Motion reTargeting (SMT) method with the advantage of vision-language models to extract and maintain meaningful motion semantics. We utilize a differentiable module to render 3D motions. Then the high-level motion semantics are incorporated into the motion retargeting process by feeding the vision-language model with the rendered images and aligning the extracted semantic embeddings. To ensure the preservation of fine-grained motion details and high-level semantics we adopt a two-stage pipeline consisting of skeleton-aware pre-training and fine-tuning with semantics and geometry constraints. Experimental results show the effectiveness of the proposed method in producing high-quality motion retargeting results while accurately preserving motion semantics. Project page can be found at https://sites.google.com/view/smtnet. + + + + Unsupervised Semantic Segmentation Through Depth-Guided Feature Correlation and Sampling + http://openaccess.thecvf.com//content/CVPR2024/papers/Sick_Unsupervised_Semantic_Segmentation_Through_Depth-Guided_Feature_Correlation_and_Sampling_CVPR_2024_paper.pdf + Traditionally training neural networks to perform semantic segmentation requires expensive human-made annotations. But more recently advances in the field of unsupervised learning have made significant progress on this issue and towards closing the gap to supervised algorithms. To achieve this semantic knowledge is distilled by learning to correlate randomly sampled features from images across an entire dataset. In this work we build upon these advances by incorporating information about the structure of the scene into the training process through the use of depth information. We achieve this by (1) learning depth-feature correlation by spatially correlating the feature maps with the depth maps to induce knowledge about the structure of the scene and (2) exploiting farthest-point sampling to more effectively select relevant features by utilizing 3D sampling techniques on depth information of the scene. Finally we demonstrate the effectiveness of our technical contributions through extensive experimentation and present significant improvements in performance across multiple benchmark datasets. + + + + RAVE: Randomized Noise Shuffling for Fast and Consistent Video Editing with Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Kara_RAVE_Randomized_Noise_Shuffling_for_Fast_and_Consistent_Video_Editing_CVPR_2024_paper.pdf + Recent advancements in diffusion-based models have demonstrated significant success in generating images from text. However video editing models have not yet reached the same level of visual quality and user control. To address this we introduce RAVE a zero-shot video editing method that leverages pre-trained text-to-image diffusion models without additional training. RAVE takes an input video and a text prompt to produce high-quality videos while preserving the original motion and semantic structure. It employs a novel noise shuffling strategy leveraging spatio-temporal interactions between frames to produce temporally consistent videos faster than existing methods. It is also efficient in terms of memory requirements allowing it to handle longer videos. RAVE is capable of a wide range of edits from local attribute modifications to shape transformations. In order to demonstrate the versatility of RAVE we create a comprehensive video evaluation dataset ranging from object-focused scenes to complex human activities like dancing and typing and dynamic scenes featuring swimming fish and boats. Our qualitative and quantitative experiments highlight the effectiveness of RAVE in diverse video editing scenarios compared to existing methods. Our code dataset and videos can be found in \href https://rave-video-edit.github.io/. + + + + Video-Based Human Pose Regression via Decoupled Space-Time Aggregation + http://openaccess.thecvf.com//content/CVPR2024/papers/He_Video-Based_Human_Pose_Regression_via_Decoupled_Space-Time_Aggregation_CVPR_2024_paper.pdf + By leveraging temporal dependency in video sequences multi-frame human pose estimation algorithms have demonstrated remarkable results in complicated situations such as occlusion motion blur and video defocus. These algorithms are predominantly based on heatmaps resulting in high computation and storage requirements per frame which limits their flexibility and real-time application in video scenarios particularly on edge devices. In this paper we develop an efficient and effective video-based human pose regression method which bypasses intermediate representations such as heatmaps and instead directly maps the input to the output joint coordinates. Despite the inherent spatial correlation among adjacent joints of the human pose the temporal trajectory of each individual joint exhibits relative independence. In light of this we propose a novel Decoupled Space-Time Aggregation network (DSTA) to separately capture the spatial contexts between adjacent joints and the temporal cues of each individual joint thereby avoiding the conflation of spatiotemporal dimensions. Concretely DSTA learns a dedicated feature token for each joint to facilitate the modeling of their spatiotemporal dependencies. With the proposed joint-wise local-awareness attention mechanism our method is capable of efficiently and flexibly utilizing the spatial dependency of adjacent joints and the temporal dependency of each joint itself. Extensive experiments demonstrate the superiority of our method. Compared to previous regression-based single-frame human pose estimation methods DSTA significantly enhances performance achieving an 8.9 mAP improvement on PoseTrack2017. Furthermore our approach either surpasses or is on par with the state-of-the-art heatmap-based multi-frame human pose estimation methods. Project page: https://github.com/zgspose/DSTA. + + + + L-MAGIC: Language Model Assisted Generation of Images with Coherence + http://openaccess.thecvf.com//content/CVPR2024/papers/Cai_L-MAGIC_Language_Model_Assisted_Generation_of_Images_with_Coherence_CVPR_2024_paper.pdf + In the current era of generative AI breakthroughs generating panoramic scenes from a single input image remains a key challenge. Most existing methods use diffusion-based iterative or simultaneous multi-view inpainting. However the lack of global scene layout priors leads to subpar outputs with duplicated objects (e.g. multiple beds in a bedroom) or requires time-consuming human text inputs for each view. We propose L-MAGIC a novel method leveraging large language models for guidance while diffusing multiple coherent views of 360 degree panoramic scenes. L-MAGIC harnesses pre-trained diffusion and language models without fine-tuning ensuring zero-shot performance. The output quality is further enhanced by super-resolution and multi-view fusion techniques. Extensive experiments demonstrate that the resulting panoramic scenes feature better scene layouts and perspective view rendering quality compared to related works with >70% preference in human evaluations. Combined with conditional diffusion models L-MAGIC can accept various input modalities including but not limited to text depth maps sketches and colored scripts. Applying depth estimation further enables 3D point cloud generation and dynamic scene exploration with fluid camera motion. + + + + 3D Face Tracking from 2D Video through Iterative Dense UV to Image Flow + http://openaccess.thecvf.com//content/CVPR2024/papers/Taubner_3D_Face_Tracking_from_2D_Video_through_Iterative_Dense_UV_CVPR_2024_paper.pdf + When working with 3D facial data improving fidelity and avoiding the uncanny valley effect is critically dependent on accurate 3D facial performance capture. Because such methods are expensive and due to the widespread availability of 2D videos recent methods have focused on how to perform monocular 3D face tracking. However these methods often fall short in capturing precise facial movements due to limitations in their network architecture training and evaluation processes. Addressing these challenges we propose a novel face tracker FlowFace that introduces an innovative 2D alignment network for dense per-vertex alignment. Unlike prior work FlowFace is trained on high-quality 3D scan annotations rather than weak supervision or synthetic data. Our 3D model fitting module jointly fits a 3D face model from one or many observations integrating existing neutral shape priors for enhanced identity and expression disentanglement and per-vertex deformations for detailed facial feature reconstruction. Additionally we propose a novel metric and benchmark for assessing tracking accuracy. Our method exhibits superior performance on both custom and publicly available benchmarks. We further validate the effectiveness of our tracker by generating high-quality 3D data from 2D videos which leads to performance gains on downstream tasks. + + + + Carve3D: Improving Multi-view Reconstruction Consistency for Diffusion Models with RL Finetuning + http://openaccess.thecvf.com//content/CVPR2024/papers/Xie_Carve3D_Improving_Multi-view_Reconstruction_Consistency_for_Diffusion_Models_with_RL_CVPR_2024_paper.pdf + Multi-view diffusion models obtained by applying Supervised Finetuning (SFT) to text-to-image diffusion models have driven recent breakthroughs in text-to-3D research. However due to the limited size and quality of existing 3D datasets they still suffer from multi-view inconsistencies and Neural Radiance Field (NeRF) reconstruction artifacts. We argue that multi-view diffusion models can benefit from further Reinforcement Learning Finetuning (RLFT) which allows models to learn from the data generated by themselves and improve beyond their dataset limitations during SFT. To this end we introduce Carve3D an improved RLFT algorithm coupled with a novel Multi-view Reconstruction Consistency (MRC) metric to enhance the consistency of multi-view diffusion models. To measure the MRC metric on a set of multi-view images we compare them with their corresponding NeRF renderings at the same camera viewpoints. The resulting model which we denote as Carve3DM demonstrates superior multi-view consistency and NeRF reconstruction quality than existing models. Our results suggest that pairing SFT with Carve3D's RLFT is essential for developing multi-view-consistent diffusion models mirroring the standard Large Language Model (LLM) alignment pipeline. Our code training and testing data and video results are available at: https://desaixie.github.io/carve-3d. + + + + Shadow Generation for Composite Image Using Diffusion Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Shadow_Generation_for_Composite_Image_Using_Diffusion_Model_CVPR_2024_paper.pdf + In the realm of image composition generating realistic shadow for the inserted foreground remains a formidable challenge. Previous works have developed image-to-image translation models which are trained on paired training data. However they are struggling to generate shadows with accurate shapes and intensities hindered by data scarcity and inherent task complexity. In this paper we resort to foundation model with rich prior knowledge of natural shadow images. Specifically we first adapt ControlNet to our task and then propose intensity modulation modules to improve the shadow intensity. Moreover we extend the small-scale DESOBA dataset to DESOBAv2 using a novel data acquisition pipeline. Experimental results on both DESOBA and DESOBAv2 datasets as well as real composite images demonstrate the superior capability of our model for shadow generation task. The dataset code and model are released at https://github.com/bcmi/Object-Shadow-Generation-Dataset-DESOBAv2. + + + + DisCo: Disentangled Control for Realistic Human Dance Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_DisCo_Disentangled_Control_for_Realistic_Human_Dance_Generation_CVPR_2024_paper.pdf + Generative AI has made significant strides in computer vision particularly in text-driven image/video synthesis (T2I/T2V). Despite the notable advancements it remains challenging in human-centric content synthesis such as realistic dance generation. Current methodologies primarily tailored for human motion transfer encounter difficulties when confronted with real-world dance scenarios (e.g. social media dance) which require to generalize across a wide spectrum of poses and intricate human details. In this paper we depart from the traditional paradigm of human motion transfer and emphasize two additional critical attributes for the synthesis of human dance content in social media contexts: (i) Generalizability: the model should be able to generalize beyond generic human viewpoints as well as unseen human subjects backgrounds and poses; (ii) Compositionality: it should allow for the seamless composition of seen/unseen subjects backgrounds and poses from different sources. To address these challenges we introduce DISCO which includes a novel model architecture with disentangled control to improve the compositionality of dance synthesis and an effective human attribute pre-training for better generalizability to unseen humans. Extensive qualitative and quantitative results demonstrate that DISCO can generate high-quality human dance images and videos with diverse appearances and flexible motions. Code is available at https://disco-dance.github.io/. + + + + GaussianShader: 3D Gaussian Splatting with Shading Functions for Reflective Surfaces + http://openaccess.thecvf.com//content/CVPR2024/papers/Jiang_GaussianShader_3D_Gaussian_Splatting_with_Shading_Functions_for_Reflective_Surfaces_CVPR_2024_paper.pdf + The advent of neural 3D Gaussians has recently brought about a revolution in the field of neural rendering facilitating the generation of high-quality renderings at real-time speeds. However the explicit and discrete representation encounters challenges when applied to scenes featuring reflective surfaces. In this paper we present GaussianShader a novel method that applies a simplified shading function on 3D Gaussians to enhance the neural rendering in scenes with reflective surfaces while preserving the training and rendering efficiency. The main challenge in applying the shading function lies in the accurate normal estimation on discrete 3D Gaussians. Specifically we proposed a novel normal estimation framework based on the shortest axis directions of 3D Gaussians with a delicately designed loss to make the consistency between the normals and the geometries of Gaussian spheres. Experiments show that GaussianShader strikes a commendable balance between efficiency and visual quality. Our method surpasses Gaussian Splatting in PSNR on specular object datasets exhibiting an improvement of 1.57dB. When compared to prior works handling reflective surfaces such as Ref-NeRF our optimization time is significantly accelerated (23h vs. 0.58h). + + + + pix2gestalt: Amodal Segmentation by Synthesizing Wholes + http://openaccess.thecvf.com//content/CVPR2024/papers/Ozguroglu_pix2gestalt_Amodal_Segmentation_by_Synthesizing_Wholes_CVPR_2024_paper.pdf + We introduce pix2gestalt a framework for zero-shot amodal segmentation which learns to estimate the shape and appearance of whole objects that are only partially visible behind occlusions. By capitalizing on large-scale diffusion models and transferring their representations to this task we learn a conditional diffusion model for reconstructing whole objects in challenging zero-shot cases including examples that break natural and physical priors such as art. As training data we use a synthetically curated dataset containing occluded objects paired with their whole counterparts. Experiments show that our approach outperforms supervised baselines on established benchmarks. Our model can furthermore be used to significantly improve the performance of existing object recognition and 3D reconstruction methods in the presence of occlusions. + + + + Weakly Supervised Point Cloud Semantic Segmentation via Artificial Oracle + http://openaccess.thecvf.com//content/CVPR2024/papers/Kweon_Weakly_Supervised_Point_Cloud_Semantic_Segmentation_via_Artificial_Oracle_CVPR_2024_paper.pdf + Manual annotation of every point in a point cloud is a costly and labor-intensive process. While weakly supervised point cloud semantic segmentation (WSPCSS) with sparse annotation shows promise the limited information from initial sparse labels can place an upper bound on performance. As a new research direction for WSPCSS we propose a novel Region Exploration via Artificial Labeling (REAL) framework. It leverages a foundational image model as an artificial oracle within the active learning context eliminating the need for manual annotation by a human oracle. To integrate the 2D model into the 3D domain we first introduce a Projection-based Point-toSegment (PP2S) module designed to enable prompt segmentation of 3D data without additional training. The REAL framework samples query points based on model predictions and requests annotations from PP2S dynamically refining labels and improving model training. Furthermore to overcome several challenges of employing an artificial model as an oracle we formulate effective query sampling and label updating strategies. Our comprehensive experiments and comparisons demonstrate that the REAL framework significantly outperforms existing methods across various benchmarks. The code is available at https://github.com/jihun1998/AO. + + + + Forecasting of 3D Whole-body Human Poses with Grasping Objects + http://openaccess.thecvf.com//content/CVPR2024/papers/Yan_Forecasting_of_3D_Whole-body_Human_Poses_with_Grasping_Objects_CVPR_2024_paper.pdf + In the context of computer vision and human-robot interaction forecasting 3D human poses is crucial for understanding human behavior and enhancing the predictive capabilities of intelligent systems. While existing methods have made significant progress they often focus on predicting major body joints overlooking fine-grained gestures and their interaction with objects. Human hand movements particularly during object interactions play a pivotal role and provide more precise expressions of human poses. This work fills this gap and introduces a novel paradigm: forecasting 3D whole-body human poses with a focus on grasping objects. This task involves predicting activities across all joints in the body and hands encompassing the complexities of internal heterogeneity and external interactivity. To tackle these challenges we also propose a novel approach: C^3HOST cross-context cross-modal consolidation for 3D whole-body pose forecasting effectively handles the complexities of internal heterogeneity and external interactivity. C^3HOST involves distinct steps including the heterogeneous content encoding and alignment and cross-modal feature learning and interaction. These enable us to predict activities across all body and hand joints ensuring high-precision whole-body human pose prediction even during object grasping. Extensive experiments on two benchmarks demonstrate that our model significantly enhances the accuracy of whole-body human motion prediction. The project page is available at https://sites.google.com/view/c3host. + + + + Accelerating Diffusion Sampling with Optimized Time Steps + http://openaccess.thecvf.com//content/CVPR2024/papers/Xue_Accelerating_Diffusion_Sampling_with_Optimized_Time_Steps_CVPR_2024_paper.pdf + Diffusion probabilistic models (DPMs) have shown remarkable performance in high-resolution image synthesis but their sampling efficiency is still to be desired due to the typically large number of sampling steps. Recent advancements in high-order numerical ODE solvers for DPMs have enabled the generation of high-quality images with much fewer sampling steps. While this is a significant development most sampling methods still employ uniform time steps which is not optimal when using a small number of steps. To address this issue we propose a general framework for designing an optimization problem that seeks more appropriate time steps for a specific numerical ODE solver for DPMs. This optimization problem aims to minimize the distance between the ground-truth solution to the ODE and an approximate solution corresponding to the numerical solver. It can be efficiently solved using the constrained trust region method taking less than 15 seconds. Our extensive experiments on both unconditional and conditional sampling using pixel- and latent-space DPMs demonstrate that when combined with the state-of-the-art sampling method UniPC our optimized time steps significantly improve image generation performance in terms of FID scores for datasets such as CIFAR-10 and ImageNet compared to using uniform time steps. + + + + Unsupervised Template-assisted Point Cloud Shape Correspondence Network + http://openaccess.thecvf.com//content/CVPR2024/papers/Deng_Unsupervised_Template-assisted_Point_Cloud_Shape_Correspondence_Network_CVPR_2024_paper.pdf + Unsupervised point cloud shape correspondence aims to establish point-wise correspondences between source and target point clouds. Existing methods obtain correspondences directly by computing point-wise feature similarity between point clouds. However non-rigid objects possess strong deformability and unusual shapes making it a longstanding challenge to directly establish correspondences between point clouds with unconventional shapes. To address this challenge we propose an unsupervised Template-Assisted point cloud shape correspondence Network termed TANet including a template generation module and a template assistance module. The proposed TANet enjoys several merits. Firstly the template generation module establishes a set of learnable templates with explicit structures. Secondly we introduce a template assistance module that extensively leverages the generated templates to establish more accurate shape correspondences from multiple perspectives. Extensive experiments on four human and animal datasets demonstrate that TANet achieves favorable performance against state-of-the-art methods. + + + + Finsler-Laplace-Beltrami Operators with Application to Shape Analysis + http://openaccess.thecvf.com//content/CVPR2024/papers/Weber_Finsler-Laplace-Beltrami_Operators_with_Application_to_Shape_Analysis_CVPR_2024_paper.pdf + The Laplace-Beltrami operator (LBO) emerges from studying manifolds equipped with a Riemannian metric. It is often called the swiss army knife of geometry processing as it allows to capture intrinsic shape information and gives rise to heat diffusion geodesic distances and a multitude of shape descriptors. It also plays a central role in geometric deep learning. In this work we explore Finsler manifolds as a generalization of Riemannian manifolds. We revisit the Finsler heat equation and derive a Finsler heat kernel and a Finsler-Laplace-Beltrami Operator (FLBO): a novel theoretically justified anisotropic Laplace-Beltrami operator (ALBO). In experimental evaluations we demonstrate that the proposed FLBO is a valuable alternative to the traditional Riemannian-based LBO and ALBOs for spatial filtering and shape correspondence estimation. We hope that the proposed Finsler heat kernel and the FLBO will inspire further exploration of Finsler geometry in the computer vision community. + + + + Minimal Perspective Autocalibration + http://openaccess.thecvf.com//content/CVPR2024/papers/Dal_Cin_Minimal_Perspective_Autocalibration_CVPR_2024_paper.pdf + We introduce a new family of minimal problems for reconstruction from multiple views. Our primary focus is a novel approach to autocalibration a long-standing problem in computer vision. Traditional approaches to this problem such as those based on Kruppa's equations or the modulus constraint rely explicitly on the knowledge of multiple fundamental matrices or a projective reconstruction. In contrast we consider a novel formulation involving constraints on image points the unknown depths of 3D points and a partially specified calibration matrix K. For 2 and 3 views we present a comprehensive taxonomy of minimal autocalibration problems obtained by relaxing some of these constraints. These problems are organized into classes according to the number of views and any assumed prior knowledge of K. Within each class we determine problems with the fewest---or a relatively small number of---solutions. From this zoo of problems we devise three practical solvers. Experiments with synthetic and real data and interfacing our solvers with COLMAP demonstrate that we achieve superior accuracy compared to state-of-the-art calibration methods. The code is available at https://github.com/andreadalcin/MinimalPerspectiveAutocalibration. + + + + Time- Memory- and Parameter-Efficient Visual Adaptation + http://openaccess.thecvf.com//content/CVPR2024/papers/Mercea_Time-_Memory-_and_Parameter-Efficient_Visual_Adaptation_CVPR_2024_paper.pdf + As foundation models become more popular there is a growing need to efficiently finetune them for downstream tasks. Although numerous adaptation methods have been proposed they are designed to be efficient only in terms of how many parameters are trained. They however typically still require backpropagating gradients throughout the model meaning that their training-time and -memory cost does not reduce as significantly. We propose an adaptation method which does not backpropagate gradients through the backbone. We achieve this by designing a lightweight network in parallel that operates on features from the frozen pretrained backbone. As a result our method is efficient not only in terms of parameters but also in training-time and memory usage. Our approach achieves state-of-the-art accuracy-parameter trade-offs on the popular VTAB benchmark and we further show how we outperform prior works with respect to training-time and -memory usage too. We further demonstrate the training efficiency and scalability of our method by adapting a vision transformer backbone of 4 billion parameters for the computationally demanding task of video classification without any intricate model parallelism. Here we outperform a prior adaptor-based method which could only scale to a 1 billion parameter backbone or fully-finetuning a smaller backbone with the same GPU and less training time. + + + + Suppress and Rebalance: Towards Generalized Multi-Modal Face Anti-Spoofing + http://openaccess.thecvf.com//content/CVPR2024/papers/Lin_Suppress_and_Rebalance_Towards_Generalized_Multi-Modal_Face_Anti-Spoofing_CVPR_2024_paper.pdf + Face Anti-Spoofing (FAS) is crucial for securing face recognition systems against presentation attacks. With advancements in sensor manufacture and multi-modal learning techniques many multi-modal FAS approaches have emerged. However they face challenges in generalizing to unseen attacks and deployment conditions. These challenges arise from (1) modality unreliability where some modality sensors like depth and infrared undergo significant domain shifts in varying environments leading to the spread of unreliable information during cross-modal feature fusion and (2) modality imbalance where training overly relies on a dominant modality hinders the convergence of others reducing effectiveness against attack types that are indistinguishable by sorely using the dominant modality. To address modality unreliability we propose the Uncertainty-Guided Cross-Adapter (U-Adapter) to recognize unreliably detected regions within each modality and suppress the impact of unreliable regions on other modalities. For modality imbalance we propose a Rebalanced Modality Gradient Modulation (ReGrad) strategy to rebalance the convergence speed of all modalities by adaptively adjusting their gradients. Besides we provide the first large-scale benchmark for evaluating multi-modal FAS performance under domain generalization scenarios. Extensive experiments demonstrate that our method outperforms state-of-the-art methods. Source codes and protocols are released on https://github.com/OMGGGGG/mmdg. + + + + Universal Segmentation at Arbitrary Granularity with Language Instruction + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Universal_Segmentation_at_Arbitrary_Granularity_with_Language_Instruction_CVPR_2024_paper.pdf + This paper aims to achieve universal segmentation of arbitrary semantic level. Despite significant progress in recent years specialist segmentation approaches are limited to specific tasks and data distribution. Retraining a new model for adaptation to new scenarios or settings takes expensive computation and time cost which raises the demand for versatile and universal segmentation model that can cater to various granularity. Although some attempts have been made for unifying different segmentation tasks or generalization to various scenarios limitations in the definition of paradigms and input-output spaces make it difficult for them to achieve accurate understanding of content at arbitrary granularity. To this end we present UniLSeg a universal segmentation model that can perform segmentation at any semantic level with the guidance of language instructions. For training UniLSeg we reorganize a group of tasks from original diverse distributions into a unified data format where images with texts describing segmentation targets as input and corresponding masks are output. Combined with a automatic annotation engine for utilizing numerous unlabeled data UniLSeg achieves excellent performance on various tasks and settings surpassing both specialist and unified segmentation models. + + + + Layout-Agnostic Scene Text Image Synthesis with Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhangli_Layout-Agnostic_Scene_Text_Image_Synthesis_with_Diffusion_Models_CVPR_2024_paper.pdf + While diffusion models have significantly advanced the quality of image generation their capability to accurately and coherently render text within these images remains a substantial challenge. Conventional diffusion-based methods for scene text generation are typically limited by their reliance on an intermediate layout output. This dependency often results in a constrained diversity of text styles and fonts an inherent limitation stemming from the deterministic nature of the layout generation phase. To address these challenges this paper introduces SceneTextGen a novel diffusion-based model specifically designed to circumvent the need for a predefined layout stage. By doing so SceneTextGen facilitates a more natural and varied representation of text. The novelty of SceneTextGen lies in its integration of three key components: a character-level encoder for capturing detailed typographic properties coupled with a character-level instance segmentation model and a word-level spotting model to address the issues of unwanted text generation and minor character inaccuracies. We validate the performance of our method by demonstrating improved character recognition rates on generated images across different public visual text datasets in comparison to both standard diffusion based methods and text specific methods. + + + + SmartMask: Context Aware High-Fidelity Mask Generation for Fine-grained Object Insertion and Layout Control + http://openaccess.thecvf.com//content/CVPR2024/papers/Singh_SmartMask_Context_Aware_High-Fidelity_Mask_Generation_for_Fine-grained_Object_Insertion_CVPR_2024_paper.pdf + The field of generative image inpainting and object insertion has made significant progress with the recent advent of latent diffusion models. Utilizing a precise object mask can greatly enhance these applications. However due to the challenges users encounter in creating high-fidelity masks there is a tendency for these methods to rely on more coarse masks (e.g. bounding box) for these applications. This results in limited control and compromised background content preservation. To overcome these limitations we introduce SmartMask which allows any novice user to create detailed masks for precise object insertion. Combined with a ControlNet-Inpaint model our experiments demonstrate that SmartMask achieves superior object insertion quality preserving the background content more effectively than previous methods. Notably unlike prior works the proposed approach can also be used even without user-mask guidance which allows it to perform mask-free object insertion at diverse positions and scales. Furthermore we find that when used iteratively with a novel instruction-tuning based planning model SmartMask can be used to design detailed layouts from scratch. As compared with user-scribble based layout design we observe that SmartMask allows for better quality outputs with layout-to-image generation methods. + + + + Customization Assistant for Text-to-Image Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_Customization_Assistant_for_Text-to-Image_Generation_CVPR_2024_paper.pdf + Customizing pre-trained text-to-image generation model has attracted massive research interest recently due to its huge potential in real-world applications. Although existing methods are able to generate creative content for a novel concept contained in single user-input image their capability are still far from perfection. Specifically most existing methods require fine-tuning the generative model on testing images. Some existing methods do not require fine-tuning while their performance are unsatisfactory. Furthermore the interaction between users and models are still limited to directive and descriptive prompts such as instructions and captions. In this work we build a customization assistant based on pre-trained large language model and diffusion model which can not only perform customized generation in a tuning-free manner but also enable more user-friendly interactions: users can chat with the assistant and input either ambiguous text or clear instruction. Specifically we propose a new framework consists of a new model design and a novel training strategy. The resulting assistant can perform customized generation in 2-5 seconds without any test time fine-tuning. Extensive experiments are conducted competitive results have been obtained across different domains illustrating the effectiveness of the proposed method. + + + + GenHowTo: Learning to Generate Actions and State Transformations from Instructional Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Souek_GenHowTo_Learning_to_Generate_Actions_and_State_Transformations_from_Instructional_CVPR_2024_paper.pdf + We address the task of generating temporally consistent and physically plausible images of actions and object state transformations. Given an input image and a text prompt describing the targeted transformation our generated images preserve the environment and transform objects in the initial image. Our contributions are threefold. First we leverage a large body of instructional videos and automatically mine a dataset of triplets of consecutive frames corresponding to initial object states actions and resulting object transformations. Second equipped with this data we develop and train a conditioned diffusion model dubbed GenHowTo. Third we evaluate GenHowTo on a variety of objects and actions and show superior performance compared to existing methods. In particular we introduce a quantitative evaluation where GenHowTo achieves 88% and 74% on seen and unseen interaction categories respectively outperforming prior work by a large margin. + + + + Paint-it: Text-to-Texture Synthesis via Deep Convolutional Texture Map Optimization and Physically-Based Rendering + http://openaccess.thecvf.com//content/CVPR2024/papers/Youwang_Paint-it_Text-to-Texture_Synthesis_via_Deep_Convolutional_Texture_Map_Optimization_and_CVPR_2024_paper.pdf + We present Paint-it a text-driven high-fidelity texture map synthesis method for 3D meshes via neural re-parameterized texture optimization. Paint-it synthesizes texture maps from a text description by synthesis-through-optimization exploiting the Score-Distillation Sampling (SDS). We observe that directly applying SDS yields undesirable texture quality due to its noisy gradients. We reveal the importance of texture parameterization when using SDS. Specifically we propose Deep Convolutional Physically-Based Rendering (DC-PBR) parameterization which re-parameterizes the physically-based rendering (PBR) texture maps with randomly initialized convolution-based neural kernels instead of a standard pixel-based parameterization. We show that DC-PBR inherently schedules the optimization curriculum according to texture frequency and naturally filters out the noisy signals from SDS. In experiments Paint-it obtains remarkable quality PBR texture maps within 15 min. given only a text description. We demonstrate the generalizability and practicality of Paint-it by synthesizing high-quality texture maps for large-scale mesh datasets and showing test-time applications such as relighting and material control using a popular graphics engine. + + + + Physics-Aware Hand-Object Interaction Denoising + http://openaccess.thecvf.com//content/CVPR2024/papers/Luo_Physics-Aware_Hand-Object_Interaction_Denoising_CVPR_2024_paper.pdf + The credibility and practicality of a reconstructed hand-object interaction sequence depend largely on its physical plausibility. However due to high occlusions during hand-object interaction physical plausibility remains a challenging criterion for purely vision-based tracking methods. To address this issue and enhance the results of existing hand trackers this paper proposes a novel physically-aware hand motion de-noising method. Specifically we introduce two learned loss terms that explicitly capture two crucial aspects of physical plausibility: grasp credibility and manipulation feasibility. These terms are used to train a physically-aware de-noising network. Qualitative and quantitative experiments demonstrate that our approach significantly improves both fine-grained physical plausibility and overall pose accuracy surpassing current state-of-the-art de-noising methods. + + + + VastGaussian: Vast 3D Gaussians for Large Scene Reconstruction + http://openaccess.thecvf.com//content/CVPR2024/papers/Lin_VastGaussian_Vast_3D_Gaussians_for_Large_Scene_Reconstruction_CVPR_2024_paper.pdf + Existing NeRF-based methods for large scene reconstruction often have limitations in visual quality and rendering speed. While the recent 3D Gaussian Splatting works well on small-scale and object-centric scenes scaling it up to large scenes poses challenges due to limited video memory long optimization time and noticeable appearance variations. To address these challenges we present VastGaussian the first method for high-quality reconstruction and real-time rendering on large scenes based on 3D Gaussian Splatting. We propose a progressive partitioning strategy to divide a large scene into multiple cells where the training cameras and point cloud are properly distributed with an airspace-aware visibility criterion. These cells are merged into a complete scene after parallel optimization. We also introduce decoupled appearance modeling into the optimization process to reduce appearance variations in the rendered images. Our approach outperforms existing NeRF-based methods and achieves state-of-the-art results on multiple large scene datasets enabling fast optimization and high-fidelity real-time rendering. + + + + Edit One for All: Interactive Batch Image Editing + http://openaccess.thecvf.com//content/CVPR2024/papers/Nguyen_Edit_One_for_All_Interactive_Batch_Image_Editing_CVPR_2024_paper.pdf + In recent years image editing has advanced remarkably. With increased human control it is now possible to edit an image in a plethora of ways; from specifying in text what we want to change to straight up dragging the contents of the image in an interactive point-based manner. However most of the focus has remained on editing single images at a time. Whether and how we can simultaneously edit large batches of images has remained understudied. With the goal of minimizing human supervision in the editing process this paper presents a novel method for interactive batch image editing using StyleGAN as the medium. Given an edit specified by users in an example image (e.g. make the face frontal) our method can automatically transfer that edit to other test images so that regardless of their initial state (pose) they all arrive at the same final state (e.g. all facing front). Extensive experiments demonstrate that edits performed using our method have similar visual quality to existing single-image-editing methods while having more visual consistency and saving significant time and human effort. + + + + Deformable One-shot Face Stylization via DINO Semantic Guidance + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_Deformable_One-shot_Face_Stylization_via_DINO_Semantic_Guidance_CVPR_2024_paper.pdf + This paper addresses the complex issue of one-shot face stylization focusing on the simultaneous consideration of appearance and structure where previous methods have fallen short. We explore deformation-aware face stylization that diverges from traditional single-image style reference opting for a real-style image pair instead. The cornerstone of our method is the utilization of a self-supervised vision transformer specifically DINO-ViT to establish a robust and consistent facial structure representation across both real and style domains. Our stylization process begins by adapting the StyleGAN generator to be deformation-aware through the integration of spatial transformers (STN). We then introduce two innovative constraints for generator fine-tuning under the guidance of DINO semantics: i) a directional deformation loss that regulates directional vectors in DINO space and ii) a relative structural consistency constraint based on DINO token self-similarities ensuring diverse generation. Additionally style-mixing is employed to align the color generation with the reference minimizing inconsistent correspondences. This framework delivers enhanced deformability for general one-shot face stylization achieving notable efficiency with a fine-tuning duration of approximately 10 minutes. Extensive qualitative and quantitative comparisons demonstrate our superiority over state-of-the-art one-shot face stylization methods. Code is available at https://github.com/zichongc/DoesFS + + + + Coarse-to-Fine Latent Diffusion for Pose-Guided Person Image Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Lu_Coarse-to-Fine_Latent_Diffusion_for_Pose-Guided_Person_Image_Synthesis_CVPR_2024_paper.pdf + Diffusion model is a promising approach to image generation and has been employed for Pose-Guided Person Image Synthesis (PGPIS) with competitive performance. While existing methods simply align the person appearance to the target pose they are prone to overfitting due to the lack of a high-level semantic understanding on the source person image. In this paper we propose a novel Coarse-to-Fine Latent Diffusion (CFLD) method for PGPIS. In the absence of image-caption pairs and textual prompts we develop a novel training paradigm purely based on images to control the generation process of a pre-trained text-to-image diffusion model. A perception-refined decoder is designed to progressively refine a set of learnable queries and extract semantic understanding of person images as a coarse-grained prompt. This allows for the decoupling of fine-grained appearance and pose information controls at different stages and thus circumventing the potential overfitting problem. To generate more realistic texture details a hybrid-granularity attention module is proposed to encode multi-scale fine-grained appearance features as bias terms to augment the coarse-grained prompt. Both quantitative and qualitative experimental results on the DeepFashion benchmark demonstrate the superiority of our method over the state of the arts for PGPIS. Code is available at https://github.com/YanzuoLu/CFLD. + + + + OMG: Towards Open-vocabulary Motion Generation via Mixture of Controllers + http://openaccess.thecvf.com//content/CVPR2024/papers/Liang_OMG_Towards_Open-vocabulary_Motion_Generation_via_Mixture_of_Controllers_CVPR_2024_paper.pdf + We have recently seen tremendous progress in realistic text-to-motion generation. Yet the existing methods often fail or produce implausible motions with unseen text inputs which limits the applications. In this paper we present OMG a novel framework which enables compelling motion generation from zero-shot open-vocabulary text prompts. Our key idea is to carefully tailor the pretrain-then-finetune paradigm into the text-to-motion generation. At the pre-training stage our model improves the generation ability by learning the rich out-of-domain inherent motion traits. To this end we scale up a large unconditional diffusion model up to 1B parameters so as to utilize the massive unlabeled motion data up to over 20M motion instances. At the subsequent fine-tuning stage we introduce motion ControlNet which incorporates text prompts as conditioning information through a trainable copy of the pre-trained model and the proposed novel Mixture-of-Controllers (MoC) block. MoC block adaptively recognizes various ranges of the sub-motions with a cross-attention mechanism and processes them separately with the text-token-specific experts. Such a design effectively aligns the CLIP token embeddings of text prompts to various ranges of compact and expressive motion features. Extensive experiments demonstrate that our OMG achieves significant improvements over the state-of-the-art methods on zero-shot text-to-motion generation. Project page: https://tr3e.github.io/omg-page. + + + + Align Your Gaussians: Text-to-4D with Dynamic 3D Gaussians and Composed Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Ling_Align_Your_Gaussians_Text-to-4D_with_Dynamic_3D_Gaussians_and_Composed_CVPR_2024_paper.pdf + Text-guided diffusion models have revolutionized image and video generation and have also been successfully used for optimization-based 3D object synthesis. Here we instead focus on the underexplored text-to-4D setting and synthesize dynamic animated 3D objects using score distillation methods with an additional temporal dimension. Compared to previous work we pursue a novel compositional generation-based approach and combine text-to-image text-to-video and 3D-aware multiview diffusion models to provide feedback during 4D object optimization thereby simultaneously enforcing temporal consistency high-quality visual appearance and realistic geometry. Our method called Align Your Gaussians (AYG) leverages dynamic 3D Gaussian Splatting with deformation fields as 4D representation. Crucial to AYG is a novel method to regularize the distribution of the moving 3D Gaussians and thereby stabilize the optimization and induce motion. We also propose a motion amplification mechanism as well as a new autoregressive synthesis scheme to generate and combine multiple 4D sequences for longer generation. These techniques allow us to synthesize vivid dynamic scenes outperform previous work qualitatively and quantitatively and achieve state-of-the-art text-to-4D performance. Due to the Gaussian 4D representation different 4D animations can be seamlessly combined as we demonstrate. AYG opens up promising avenues for animation simulation and digital content creation as well as synthetic data generation. + + + + PDF: A Probability-Driven Framework for Open World 3D Point Cloud Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_PDF_A_Probability-Driven_Framework_for_Open_World_3D_Point_Cloud_CVPR_2024_paper.pdf + Existing point cloud semantic segmentation networks cannot identify unknown classes and update their knowledge due to a closed-set and static perspective of the real world which would induce the intelligent agent to make bad decisions. To address this problem we propose a Probability-Driven Framework (PDF) for open world semantic segmentation that includes (i) a lightweight U-decoder branch to identify unknown classes by estimating the uncertainties (ii) a flexible pseudo-labeling scheme to supply geometry features along with probability distribution features of unknown classes by generating pseudo labels and (iii) an incremental knowledge distillation strategy to incorporate novel classes into the existing knowledge base gradually. Our framework enables the model to behave like human beings which could recognize unknown objects and incrementally learn them with the corresponding knowledge. Experimental results on the S3DIS and ScanNetv2 datasets demonstrate that the proposed PDF outperforms other methods by a large margin in both important tasks of open world semantic segmentation. + + + + Test-Time Domain Generalization for Face Anti-Spoofing + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_Test-Time_Domain_Generalization_for_Face_Anti-Spoofing_CVPR_2024_paper.pdf + Face Anti-Spoofing (FAS) is pivotal in safeguarding facial recognition systems against presentation attacks. While domain generalization (DG) methods have been developed to enhance FAS performance they predominantly focus on learning domain-invariant features during training which may not guarantee generalizability to unseen data that differs largely from the source distributions. Our insight is that testing data can serve as a valuable resource to enhance the generalizability beyond mere evaluation for DG FAS. In this paper we introduce a novel Test-Time Domain Generalization (TTDG) framework for FAS which leverages the testing data to boost the model's generalizability. Our method consisting of Test-Time Style Projection (TTSP) and Diverse Style Shifts Simulation (DSSS) effectively projects the unseen data to the seen domain space. In particular we first introduce the innovative TTSP to project the styles of the arbitrarily unseen samples of the testing distribution to the known source space of the training distributions. We then design the efficient DSSS to synthesize diverse style shifts via learnable style bases with two specifically designed losses in a hyperspherical feature space. Our method eliminates the need for model updates at the test time and can be seamlessly integrated into not only the CNN but also ViT backbones. Comprehensive experiments on widely used cross-domain FAS benchmarks demonstrate our method's state-of-the-art performance and effectiveness. + + + + Real-time 3D-aware Portrait Video Relighting + http://openaccess.thecvf.com//content/CVPR2024/papers/Cai_Real-time_3D-aware_Portrait_Video_Relighting_CVPR_2024_paper.pdf + Synthesizing realistic videos of talking faces under custom lighting conditions and viewing angles benefits various downstream applications like video conferencing. However most existing relighting methods are either time-consuming or unable to adjust the viewpoints. In this paper we present the first real-time 3D-aware method for relighting in-the-wild videos of talking faces based on Neural Radiance Fields (NeRF). Given an input portrait video our method can synthesize talking faces under both novel views and novel lighting conditions with a photo-realistic and disentangled 3D representation. Specifically we infer an albedo tri-plane as well as a shading tri-plane based on a desired lighting condition for each video frame with fast dual-encoders. We also leverage a temporal consistency network to ensure smooth transitions and reduce flickering artifacts. Our method runs at 32.98 fps on consumer-level hardware and achieves state-of-the-art results in terms of reconstruction quality lighting error lighting instability temporal consistency and inference speed. We demonstrate the effectiveness and interactivity of our method on various portrait videos with diverse lighting and viewing conditions. + + + + 3DGS-Avatar: Animatable Avatars via Deformable 3D Gaussian Splatting + http://openaccess.thecvf.com//content/CVPR2024/papers/Qian_3DGS-Avatar_Animatable_Avatars_via_Deformable_3D_Gaussian_Splatting_CVPR_2024_paper.pdf + We introduce an approach that creates animatable human avatars from monocular videos using 3D Gaussian Splatting (3DGS). Existing methods based on neural radiance fields (NeRFs) achieve high-quality novel-view/novel-pose image synthesis but often require days of training and are extremely slow at inference time. Recently the community has explored fast grid structures for efficient training of clothed avatars. Albeit being extremely fast at training these methods can barely achieve an interactive rendering frame rate with around 15 FPS. In this paper we use 3D Gaussian Splatting and learn a non-rigid deformation network to reconstruct animatable clothed human avatars that can be trained within 30 minutes and rendered at real-time frame rates (50+ FPS). Given the explicit nature of our representation we further introduce as-isometric-as-possible regularizations on both the Gaussian mean vectors and the covariance matrices enhancing the generalization of our model on highly articulated unseen poses. Experimental results show that our method achieves comparable and even better performance compared to state-of-the-art approaches on animatable avatar creation from a monocular input while being 400x and 250x faster in training and inference respectively. + + + + Style Aligned Image Generation via Shared Attention + http://openaccess.thecvf.com//content/CVPR2024/papers/Hertz_Style_Aligned_Image_Generation_via_Shared_Attention_CVPR_2024_paper.pdf + Large-scale Text-to-Image (T2I) models have rapidly gained prominence across creative fields generating visually compelling outputs from textual prompts. However controlling these models to ensure consistent style remains challenging with existing methods necessitating fine-tuning and manual intervention to disentangle content and style. In this paper we introduce StyleAligned a novel technique designed to establish style alignment among a series of generated images. By employing minimal `attention sharing' during the diffusion process our method maintains style consistency across images within T2I models. This approach allows for the creation of style-consistent images using a reference style through a straightforward inversion operation. Our method's evaluation across diverse styles and text prompts demonstrates high-quality synthesis and fidelity underscoring its efficacy in achieving consistent style across various inputs. + + + + Back to 3D: Few-Shot 3D Keypoint Detection with Back-Projected 2D Features + http://openaccess.thecvf.com//content/CVPR2024/papers/Wimmer_Back_to_3D_Few-Shot_3D_Keypoint_Detection_with_Back-Projected_2D_CVPR_2024_paper.pdf + With the immense growth of dataset sizes and computing resources in recent years so-called foundation models have become popular in NLP and vision tasks. In this work we propose to explore foundation models for the task of keypoint detection on 3D shapes. A unique characteristic of keypoint detection is that it requires semantic and geometric awareness while demanding high localization accuracy. To address this problem we propose first to back-project features from large pre-trained 2D vision models onto 3D shapes and employ them for this task. We show that we obtain robust 3D features that contain rich semantic information and analyze multiple candidate features stemming from different 2D foundation models. Second we employ a keypoint candidate optimization module which aims to match the average observed distribution of keypoints on the shape and is guided by the back-projected features. The resulting approach achieves a new state of the art for few-shot keypoint detection on the KeyPointNet dataset almost doubling the performance of the previous best methods. + + + + Neural Markov Random Field for Stereo Matching + http://openaccess.thecvf.com//content/CVPR2024/papers/Guan_Neural_Markov_Random_Field_for_Stereo_Matching_CVPR_2024_paper.pdf + Stereo matching is a core task for many computer vision and robotics applications. Despite their dominance in traditional stereo methods the hand-crafted Markov Random Field (MRF) models lack sufficient modeling accuracy compared to end-to-end deep models. While deep learning representations have greatly improved the unary terms of the MRF models the overall accuracy is still severely limited by the hand-crafted pairwise terms and message passing. To address these issues we propose a neural MRF model where both potential functions and message passing are designed using data-driven neural networks. Our fully data-driven model is built on the foundation of variational inference theory to prevent convergence issues and retain stereo MRF's graph inductive bias. To make the inference tractable and scale well to high-resolution images we also propose a Disparity Proposal Network (DPN) to adaptively prune the search space of disparity. The proposed approach ranks 1^ st on both KITTI 2012 and 2015 leaderboards among all published methods while running faster than 100 ms. This approach significantly outperforms prior global methods e.g. lowering D1 metric by more than 50% on KITTI 2015. In addition our method exhibits strong cross-domain generalization and can recover sharp edges. The codes at https://github.com/aeolusguan/NMRF. + + + + PoseIRM: Enhance 3D Human Pose Estimation on Unseen Camera Settings via Invariant Risk Minimization + http://openaccess.thecvf.com//content/CVPR2024/papers/Cai_PoseIRM_Enhance_3D_Human_Pose_Estimation_on_Unseen_Camera_Settings_CVPR_2024_paper.pdf + Camera-parameter-free multi-view pose estimation is an emerging technique for 3D human pose estimation (HPE). They can infer the camera settings implicitly or explicitly to mitigate the depth uncertainty impact showcasing significant potential in real applications. However due to the limited camera setting diversity in the available datasets the inferred camera parameters are always simply hardcoded into the model during training and not adaptable to the input in inference making the learned models cannot generalize well under unseen camera settings. A natural solution is to artificially synthesize some samples i.e. 2D-3D pose pairs under massive new camera settings. Unfortunately to prevent over-fitting the existing camera setting the number of synthesized samples for each new camera setting should be comparable with that for the existing one which multiplies the scale of training and even makes it computationally prohibitive. In this paper we propose a novel HPE approach under the invariant risk minimization (IRM) paradigm. Precisely we first synthesize 2D poses from myriad camera settings. We then train our model under the IRM paradigm which targets at learning a common optimal model across all camera settings and thus enforces the model to automatically learn the camera parameters based on the input data. This allows the model to accurately infer 3D poses on unseen data by training on only a handful of samples from each synthesized setting and thus avoid the unbearable training cost increment. Another appealing feature of our method is that benefited from the capability of IRM in identifying the invariant features its performance on the seen camera settings is enhanced as well. Comprehensive experiments verify the superiority of our approach. + + + + CCEdit: Creative and Controllable Video Editing via Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Feng_CCEdit_Creative_and_Controllable_Video_Editing_via_Diffusion_Models_CVPR_2024_paper.pdf + In this paper we present CCEdit a versatile generative video editing framework based on diffusion models. Our approach employs a novel trident network structure that separates structure and appearance control ensuring precise and creative editing capabilities. Utilizing the foundational ControlNet architecture we maintain the structural integrity of the video during editing. The incorporation of an additional appearance branch enables users to exert fine-grained control over the edited key frame. These two side branches seamlessly integrate into the main branch which is constructed upon existing text-to-image (T2I) generation models through learnable temporal layers. The versatility of our framework is demonstrated through a diverse range of choices in both structure representations and personalized T2I models as well as the option to provide the edited key frame. To facilitate comprehensive evaluation we introduce the BalanceCC benchmark dataset comprising 100 videos and 4 target prompts for each video. Our extensive user studies compare CCEdit with eight state-of-the-art video editing methods. The outcomes demonstrate CCEdit's substantial superiority over all other methods. + + + + HAVE-FUN: Human Avatar Reconstruction from Few-Shot Unconstrained Images + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_HAVE-FUN_Human_Avatar_Reconstruction_from_Few-Shot_Unconstrained_Images_CVPR_2024_paper.pdf + As for human avatar reconstruction contemporary techniques commonly necessitate the acquisition of costly data and struggle to achieve satisfactory results from a small number of casual images. In this paper we investigate this task from a few-shot unconstrained photo album. The reconstruction of human avatars from such data sources is challenging because of limited data amount and dynamic articulated poses. For handling dynamic data we integrate a skinning mechanism with deep marching tetrahedra (DMTet) to form a drivable tetrahedral representation which drives arbitrary mesh topologies generated by the DMTet for the adaptation of unconstrained images. To effectively mine instructive information from few-shot data we devise a two-phase optimization method with few-shot reference and few-shot guidance. The former focuses on aligning avatar identity with reference images while the latter aims to generate plausible appearances for unseen regions. Overall our framework called HaveFun can undertake avatar reconstruction rendering and animation. Extensive experiments on our developed benchmarks demonstrate that HaveFun exhibits substantially superior performance in reconstructing the human body and hand. + + + + DiffMorpher: Unleashing the Capability of Diffusion Models for Image Morphing + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_DiffMorpher_Unleashing_the_Capability_of_Diffusion_Models_for_Image_Morphing_CVPR_2024_paper.pdf + Diffusion models have achieved remarkable image generation quality surpassing previous generative models. However a notable limitation of diffusion models in comparison to GANs is their difficulty in smoothly interpolating between two image samples due to their highly unstructured latent space. Such a smooth interpolation is intriguing as it naturally serves as a solution for the image morphing task with many applications. In this work we address this limitation via DiffMorpher an approach that enables smooth and natural image interpolation by harnessing the prior knowledge of a pre-trained diffusion model. Our key idea is to capture the semantics of the two images by fitting two LoRAs to them respectively and interpolate between both the LoRA parameters and the latent noises to ensure a smooth semantic transition where correspondence automatically emerges without the need for annotation. In addition we propose an attention interpolation and injection technique an adaptive normalization adjustment method and a new sampling schedule to further enhance the smoothness between consecutive images. Extensive experiments demonstrate that DiffMorpher achieves starkly better image morphing effects than previous methods across a variety of object categories bridging a critical functional gap that distinguished diffusion models from GANs. + + + + Towards Real-World HDR Video Reconstruction: A Large-Scale Benchmark Dataset and A Two-Stage Alignment Network + http://openaccess.thecvf.com//content/CVPR2024/papers/Shu_Towards_Real-World_HDR_Video_Reconstruction_A_Large-Scale_Benchmark_Dataset_and_CVPR_2024_paper.pdf + As an important and practical way to obtain high dynamic range (HDR) video HDR video reconstruction from sequences with alternating exposures is still less explored mainly due to the lack of large-scale real-world datasets. Existing methods are mostly trained on synthetic datasets which perform poorly in real scenes. In this work to facilitate the development of real-world HDR video reconstruction we present Real-HDRV a large-scale real-world benchmark dataset for HDR video reconstruction featuring various scenes diverse motion patterns and high-quality labels. Specifically our dataset contains 500 LDRs-HDRs video pairs comprising about 28000 LDR frames and 4000 HDR labels covering daytime nighttime indoor and outdoor scenes. To our best knowledge our dataset is the largest real-world HDR video reconstruction dataset. Correspondingly we propose an end-to-end network for HDR video reconstruction where a novel two-stage strategy is designed to perform alignment sequentially. Specifically the first stage performs global alignment with the adaptively estimated global offsets reducing the difficulty of subsequent alignment. The second stage implicitly performs local alignment in a coarse-to-fine manner at the feature level using the adaptive separable convolution. Extensive experiments demonstrate that: (1) models trained on our dataset can achieve better performance on real scenes than those trained on synthetic datasets; (2) our method outperforms previous state-of-the-art methods. Our dataset is available at https://github.com/yungsyu99/Real-HDRV. + + + + Efficient 3D Implicit Head Avatar with Mesh-anchored Hash Table Blendshapes + http://openaccess.thecvf.com//content/CVPR2024/papers/Bai_Efficient_3D_Implicit_Head_Avatar_with_Mesh-anchored_Hash_Table_Blendshapes_CVPR_2024_paper.pdf + 3D head avatars built with neural implicit volumetric representations have achieved unprecedented levels of photorealism. However the computational cost of these methods remains a significant barrier to their widespread adoption particularly in real-time applications such as virtual reality and teleconferencing. While attempts have been made to develop fast neural rendering approaches for static scenes these methods cannot be simply employed to support realistic facial expressions such as in the case of a dynamic facial performance. To address these challenges we propose a novel fast 3D neural implicit head avatar model that achieves real-time rendering while maintaining fine-grained controllability and high rendering quality. Our key idea lies in the introduction of local hash table blendshapes which are learned and attached to the vertices of an underlying face parametric model. These per-vertex hash-tables are linearly merged with weights predicted via a CNN resulting in expression dependent embeddings. Our novel representation enables efficient density and color predictions using a lightweight MLP which is further accelerated by a hierarchical nearest neighbor search method. Extensive experiments show that our approach runs in real-time while achieving comparable rendering quality to state-of-the-arts and decent results on challenging expressions. + + + + No Time to Train: Empowering Non-Parametric Networks for Few-shot 3D Scene Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhu_No_Time_to_Train_Empowering_Non-Parametric_Networks_for_Few-shot_3D_CVPR_2024_paper.pdf + To reduce the reliance on large-scale datasets recent works in 3D segmentation resort to few-shot learning. Current 3D few-shot segmentation methods first pre-train models on 'seen' classes and then evaluate their generalization performance on 'unseen' classes. However the prior pre-training stage not only introduces excessive time overhead but also incurs a significant domain gap on 'unseen' classes. To tackle these issues we propose a Non-parametric Network for few-shot 3D Segmentation Seg-NN and its Parametric variant Seg-PN. Without training Seg-NN extracts dense representations by hand-crafted filters and achieves comparable performance to existing parameterized models. Due to the elimination of pre-training Seg-NN can alleviate the domain gap issue and save a substantial amount of time. Based on Seg-NN Seg-PN only requires training a lightweight QUEry-Support Transferring (QUEST) module which enhances the interaction between the support set and query set. Experiments suggest that Seg-PN outperforms previous state-of-the-art method by +4.19% and +7.71% mIoU on S3DIS and ScanNet datasets respectively while reducing training time by -90% indicating its effectiveness and efficiency. Code is available https://github.com/yangyangyang127/Seg-NN. + + + + PhysGaussian: Physics-Integrated 3D Gaussians for Generative Dynamics + http://openaccess.thecvf.com//content/CVPR2024/papers/Xie_PhysGaussian_Physics-Integrated_3D_Gaussians_for_Generative_Dynamics_CVPR_2024_paper.pdf + We introduce PhysGaussian a new method that seamlessly integrates physically grounded Newtonian dynamics within 3D Gaussians to achieve high-quality novel motion synthesis. Employing a customized Material Point Method (MPM) our approach enriches 3D Gaussian kernels with physically meaningful kinematic deformation and mechanical stress attributes all evolved in line with continuum mechanics principles. A defining characteristic of our method is the seamless integration between physical simulation and visual rendering: both components utilize the same 3D Gaussian kernels as their discrete representations. This negates the necessity for triangle/tetrahedron meshing marching cubes cage meshes or any other geometry embedding highlighting the principle of "what you see is what you simulate (WS^2)". Our method demonstrates exceptional versatility across a wide variety of materials--including elastic entities plastic metals non-Newtonian fluids and granular materials--showcasing its strong capabilities in creating diverse visual content with novel viewpoints and movements. + + + + Spatio-Temporal Turbulence Mitigation: A Translational Perspective + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Spatio-Temporal_Turbulence_Mitigation_A_Translational_Perspective_CVPR_2024_paper.pdf + Recovering images distorted by atmospheric turbulence is a challenging inverse problem due to the stochastic nature of turbulence. Although numerous turbulence mitigation (TM) algorithms have been proposed their efficiency and generalization to real-world dynamic scenarios remain severely limited. Building upon the intuitions of classical TM algorithms we present the Deep Atmospheric TUrbulence Mitigation network (DATUM). DATUM aims to overcome major challenges when transitioning from classical to deep learning approaches. By carefully integrating the merits of classical multi-frame TM methods into a deep network structure we demonstrate that DATUM can efficiently perform long-range temporal aggregation using a recurrent fashion while deformable attention and temporal-channel attention seamlessly facilitate pixel registration and lucky imaging. With additional supervision tilt and blur degradation can be jointly mitigated. These inductive biases empower DATUM to significantly outperform existing methods while delivering a tenfold increase in processing speed. A large-scale training dataset ATSyn is presented as a co-invention to enable the generalization to real turbulence. Our code and datasets are available at https://xg416.github.io/DATUM/ + + + + Grounded Text-to-Image Synthesis with Attention Refocusing + http://openaccess.thecvf.com//content/CVPR2024/papers/Phung_Grounded_Text-to-Image_Synthesis_with_Attention_Refocusing_CVPR_2024_paper.pdf + Driven by the scalable diffusion models trained on large-scale datasets text-to-image synthesis methods have shown compelling results. However these models still fail to precisely follow the text prompt involving multiple objects attributes or spatial compositions. In this paper we reveal the potential causes of the diffusion model's cross-attention and self-attention layers. We propose two novel losses to refocus attention maps according to a given spatial layout during sampling. Creating the layouts manually requires additional effort and can be tedious. Therefore we explore using large language models (LLM) to produce these layouts for our method. We conduct extensive experiments on the DrawBench HRS and TIFA benchmarks to evaluate our proposed method. We show that our proposed attention refocusing effectively improves the controllability of existing approaches. + + + + IReNe: Instant Recoloring of Neural Radiance Fields + http://openaccess.thecvf.com//content/CVPR2024/papers/Mazzucchelli_IReNe_Instant_Recoloring_of_Neural_Radiance_Fields_CVPR_2024_paper.pdf + Advances in NERFs have allowed for 3D scene reconstructions and novel view synthesis. Yet efficiently editing these representations while retaining photorealism is an emerging challenge. Recent methods face three primary limitations: they're slow for interactive use lack precision at object boundaries and struggle to ensure multi-view consistency. We introduce IReNe to address these limitations enabling swift near real-time color editing in NeRF. Leveraging a pre-trained NeRF model and a single training image with user-applied color edits IReNe swiftly adjusts network parameters in seconds. This adjustment allows the model to generate new scene views accurately representing the color changes from the training image while also controlling object boundaries and view-specific effects. Object boundary control is achieved by integrating a trainable segmentation module into the model. The process gains efficiency by retraining only the weights of the last network layer. We observed that neurons in this layer can be classified into those responsible for view-dependent appearance and those contributing to diffuse appearance. We introduce an automated classification approach to identify these neuron types and exclusively fine-tune the weights of the diffuse neurons. This further accelerates training and ensures consistent color edits across different views. A thorough validation on a new dataset with edited object colors shows significant quantitative and qualitative advancements over competitors accelerating speeds by 5x and 500x. + + + + Class Tokens Infusion for Weakly Supervised Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Yoon_Class_Tokens_Infusion_for_Weakly_Supervised_Semantic_Segmentation_CVPR_2024_paper.pdf + Weakly Supervised Semantic Segmentation (WSSS) relies on Class Activation Maps (CAMs) to extract spatial information from image-level labels. With the success of Vision Transformer (ViT) the migration of ViT is actively conducted in WSSS. This work proposes a novel WSSS framework with Class Token Infusion (CTI). By infusing the class tokens from images we guide class tokens to possess class-specific distinct characteristics and global-local consistency. For this we devise two kinds of token infusion: 1) Intra-image Class Token Infusion (I-CTI) and 2) Cross-Image Class Token Infusion (C-CTI). In I-CTI we infuse the class tokens from the same but differently augmented images and thus make CAMs consistent among various deformations (view color). In C-CTI by infusing the class tokens from the other images and imposing the resulting CAMs to be similar it learns class-specific distinct characteristics. Besides the CTI we bring the background (BG) concept into ViT with the BG token to reduce the false positive activation of CAMs. We demonstrate the effectiveness of our method on PASCAL VOC 2012 and MS COCO 2014 datasets achieving state-of-the-art results in weakly supervised semantic segmentation. The code is available at https://github.com/yoon307/CTI + + + + FedHCA2: Towards Hetero-Client Federated Multi-Task Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Lu_FedHCA2_Towards_Hetero-Client_Federated_Multi-Task_Learning_CVPR_2024_paper.pdf + Federated Learning (FL) enables joint training across distributed clients using their local data privately. Federated Multi-Task Learning (FMTL) builds on FL to handle multiple tasks assuming model congruity that identical model architecture is deployed in each client. To relax this assumption and thus extend real-world applicability we introduce a novel problem setting Hetero-Client Federated Multi-Task Learning (HC-FMTL) to accommodate diverse task setups. The main challenge of HC-FMTL is the model incongruity issue that invalidates conventional aggregation methods. It also escalates the difficulties in model aggregation to deal with data and task heterogeneity inherent in FMTL. To address these challenges we propose the FedHCA^2 framework which allows for federated training of personalized models by modeling relationships among heterogeneous clients. Drawing on our theoretical insights into the difference between multi-task and federated optimization we propose the Hyper Conflict-Averse Aggregation scheme to mitigate conflicts during encoder updates. Additionally inspired by task interaction in MTL the Hyper Cross Attention Aggregation scheme uses layer-wise cross attention to enhance decoder interactions while alleviating model incongruity. Moreover we employ learnable Hyper Aggregation Weights for each client to customize personalized parameter updates. Extensive experiments demonstrate the superior performance of FedHCA^2 in various HC-FMTL scenarios compared to representative methods. Code is available at https://github.com/innovator-zero/FedHCA2. + + + + Motion Diversification Networks + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_Motion_Diversification_Networks_CVPR_2024_paper.pdf + We introduce Motion Diversification Networks a novel framework for learning to generate realistic and diverse 3D human motion. Despite recent advances in deep generative motion modeling existing models often fail to produce samples that capture the full range of plausible and natural 3D human motion within a given context. The lack of diversity becomes even more apparent in applications where subtle and multi-modal 3D human forecasting is crucial for safety such as robotics and autonomous driving. Towards more realistic and functional 3D motion models we highlight limitations in existing generative modeling techniques particularly in overly simplistic latent code sampling strategies. We then introduce a transformer-based diversification mechanism that learns to effectively guide sampling in the latent space. Our proposed attention-based module queries multiple stochastic samples to flexibly predict a diverse set of latent codes which can be subsequently decoded into motion samples. The proposed framework achieves state-of-the-art diversity and accuracy prediction performance across a range of benchmarks and settings particularly when used to forecast intricate in-the-wild 3D human motion within complex urban environments. Our models datasets and code are available at https://mdncvpr.github.io/. + + + + Telling Left from Right: Identifying Geometry-Aware Semantic Correspondence + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Telling_Left_from_Right_Identifying_Geometry-Aware_Semantic_Correspondence_CVPR_2024_paper.pdf + While pre-trained large-scale vision models have shown significant promise for semantic correspondence their features often struggle to grasp the geometry and orientation of instances. This paper identifies the importance of being geometry-aware for semantic correspondence and reveals a limitation of the features of current foundation models under simple post-processing. We show that incorporating this information can markedly enhance semantic correspondence performance with simple but effective solutions in both zero-shot and supervised settings. We also construct a new challenging benchmark for semantic correspondence built from an existing animal pose estimation dataset for both pre-training validating models. Our method achieves a PCK@0.10 score of 65.4 (zero-shot) and 85.6 (supervised) on the challenging SPair-71k dataset outperforming the state of the art by 5.5p and 11.0p absolute gains respectively. Our code and datasets are publicly available at: https://telling-left-from-right.github.io. + + + + PAIR Diffusion: A Comprehensive Multimodal Object-Level Image Editor + http://openaccess.thecvf.com//content/CVPR2024/papers/Goel_PAIR_Diffusion_A_Comprehensive_Multimodal_Object-Level_Image_Editor_CVPR_2024_paper.pdf + Generative image editing has recently witnessed extremely fast-paced growth. Some works use high-level conditioning such as text while others use low-level conditioning. Nevertheless most of them lack fine-grained control over the properties of the different objects present in the image i.e. object-level image editing. In this work we tackle the task by perceiving the images as an amalgamation of various objects and aim to control the properties of each object in a fine-grained manner. Out of these properties we identify structure and appearance as the most intuitive to understand and useful for editing purposes. We propose PAIR Diffusion a generic framework that enables a diffusion model to control the structure and appearance properties of each object in the image. We show that having control over the properties of each object in an image leads to comprehensive editing capabilities. Our framework allows for various object-level editing operations on real images such as reference image-based appearance editing free-form shape editing adding objects and variations. Thanks to our design we do not require any inversion step. Additionally we propose multimodal classifier-free guidance which enables editing images using both reference images and text when using our approach with foundational diffusion models. We validate the above claims by extensively evaluating our framework on both unconditional and foundational diffusion models. + + + + TokenCompose: Text-to-Image Diffusion with Token-level Supervision + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_TokenCompose_Text-to-Image_Diffusion_with_Token-level_Supervision_CVPR_2024_paper.pdf + We present TokenCompose a Latent Diffusion Model for text-to-image generation that achieves enhanced consistency between user-specified text prompts and model-generated images. Despite its tremendous success the standard denoising process in the Latent Diffusion Model takes text prompts as conditions only absent explicit constraint for the consistency between the text prompts and the image contents leading to unsatisfactory results for composing multiple object categories. Our proposed TokenCompose aims to improve multi-category instance composition by introducing the token-wise consistency terms between the image content and object segmentation maps in the finetuning stage. TokenCompose can be applied directly to the existing training pipeline of text-conditioned diffusion models without extra human labeling information. By finetuning Stable Diffusion with our approach the model exhibits significant improvements in multi-category instance composition and enhanced photorealism for its generated images. + + + + FINER: Flexible Spectral-bias Tuning in Implicit NEural Representation by Variable-periodic Activation Functions + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_FINER_Flexible_Spectral-bias_Tuning_in_Implicit_NEural_Representation_by_Variable-periodic_CVPR_2024_paper.pdf + Implicit Neural Representation (INR) which utilizes a neural network to map coordinate inputs to corresponding attributes is causing a revolution in the field of signal processing. However current INR techniques suffer from a restricted capability to tune their supported frequency set resulting in imperfect performance when representing complex signals with multiple frequencies. We have identified that this frequency-related problem can be greatly alleviated by introducing variable-periodic activation functions for which we propose FINER. By initializing the bias of the neural network within different ranges sub-functions with various frequencies in the variable-periodic function are selected for activation. Consequently the supported frequency set of FINER can be flexibly tuned leading to improved performance in signal representation. We demonstrate the capabilities of FINER in the contexts of 2D image fitting 3D signed distance field representation and 5D neural radiance fields optimization and we show that it outperforms existing INRs. + + + + TextCraftor: Your Text Encoder Can be Image Quality Controller + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_TextCraftor_Your_Text_Encoder_Can_be_Image_Quality_Controller_CVPR_2024_paper.pdf + Diffusion-based text-to-image generative models e.g. Stable Diffusion have revolutionized the field of content generation enabling significant advancements in areas like image editing and video synthesis. Despite their formidable capabilities these models are not without their limitations. It is still challenging to synthesize an image that aligns well with the input text and multiple runs with carefully crafted prompts are required to achieve satisfactory results. To mitigate these limitations numerous studies have endeavored to fine-tune the pre-trained diffusion models i.e.. UNet utilizing various technologies. Yet amidst these efforts a pivotal question of text-to-image diffusion model training has remained largely unexplored: Is it possible and feasible to fine-tune the text encoder to improve the performance of text-to-image diffusion models? Our findings reveal that instead of replacing the CLIP text encoder used in Stable Diffusion with other large language models we can enhance it through our proposed fine-tuning approach TextCraftor leading to substantial improvements in quantitative benchmarks and human assessments. Interestingly our technique also empowers controllable image generation through the interpolation of different text encoders fine-tuned with various rewards. We also demonstrate that TextCraftor is orthogonal to UNet finetuning and can be combined to further improve generative quality. + + + + IMPRINT: Generative Object Compositing by Learning Identity-Preserving Representation + http://openaccess.thecvf.com//content/CVPR2024/papers/Song_IMPRINT_Generative_Object_Compositing_by_Learning_Identity-Preserving_Representation_CVPR_2024_paper.pdf + Generative object compositing emerges as a promising new avenue for compositional image editing. However the requirement of object identity preservation poses a significant challenge limiting practical usage of most existing methods. In response this paper introduces IMPRINT a novel diffusion-based generative model trained with a two-stage learning framework that decouples learning of identity preservation from that of compositing. The first stage is targeted for context-agnostic identity-preserving pretraining of the object encoder enabling the encoder to learn an embedding that is both view-invariant and conducive to enhanced detail preservation. The subsequent stage leverages this representation to learn seamless harmonization of the object composited to the background. In addition IMPRINT incorporates a shape-guidance mechanism offering user-directed control over the compositing process. Extensive experiments demonstrate that IMPRINT significantly outperforms existing methods and various baselines on identity preservation and composition quality. + + + + Portrait4D: Learning One-Shot 4D Head Avatar Synthesis using Synthetic Data + http://openaccess.thecvf.com//content/CVPR2024/papers/Deng_Portrait4D_Learning_One-Shot_4D_Head_Avatar_Synthesis_using_Synthetic_Data_CVPR_2024_paper.pdf + Existing one-shot 4D head synthesis methods usually learn from monocular videos with the aid of 3DMM reconstruction yet the latter is evenly challenging which restricts them from reasonable 4D head synthesis. We present a method to learn one-shot 4D head synthesis via large-scale synthetic data. The key is to first learn a part-wise 4D generative model from monocular images via adversarial learning to synthesize multi-view images of diverse identities and full motions as training data; then leverage a transformer-based animatable triplane reconstructor to learn 4D head reconstruction using the synthetic data. A novel learning strategy is enforced to enhance the generalizability to real images by disentangling the learning process of 3D reconstruction and reenactment. Experiments demonstrate our superiority over the prior art. + + + + ConvoFusion: Multi-Modal Conversational Diffusion for Co-Speech Gesture Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Mughal_ConvoFusion_Multi-Modal_Conversational_Diffusion_for_Co-Speech_Gesture_Synthesis_CVPR_2024_paper.pdf + Gestures play a key role in human communication. Recent methods for co-speech gesture generation while managing to generate beat-aligned motions struggle generating gestures that are semantically aligned with the utterance. Compared to beat gestures that align naturally to the audio signal semantically coherent gestures require modeling the complex interactions between the language and human motion and can be controlled by focusing on certain words. Therefore we present ConvoFusion a diffusion-based approach for multi-modal gesture synthesis which can not only generate gestures based on multi-modal speech inputs but can also facilitate controllability in gesture synthesis. Our method proposes two guidance objectives that allow the users to modulate the impact of different conditioning modalities (e.g. audio vs text) as well as to choose certain words to be emphasized during gesturing. Our method is versatile in that it can be trained either for generating monologue gestures or even the conversational gestures. To further advance the research on multi-party interactive gestures the DnD Group Gesture dataset is released which contains 6 hours of gesture data showing 5 people interacting with one another. We compare our method with several recent works and demonstrate effectiveness of our method on a variety of tasks. We urge the reader to watch our supplementary video at https://vcai.mpi-inf.mpg.de/projects/ConvoFusion/. + + + + Boosting Neural Representations for Videos with a Conditional Decoder + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Boosting_Neural_Representations_for_Videos_with_a_Conditional_Decoder_CVPR_2024_paper.pdf + Implicit neural representations (INRs) have emerged as a promising approach for video storage and processing showing remarkable versatility across various video tasks. However existing methods often fail to fully leverage their representation capabilities primarily due to inadequate alignment of intermediate features during target frame decoding. This paper introduces a universal boosting framework for current implicit video representation approaches. Specifically we utilize a conditional decoder with a temporal-aware affine transform module which uses the frame index as a prior condition to effectively align intermediate features with target frames. Besides we introduce a sinusoidal NeRV-like block to generate diverse intermediate features and achieve a more balanced parameter distribution thereby enhancing the model's capacity. With a high-frequency information-preserving reconstruction loss our approach successfully boosts multiple baseline INRs in the reconstruction quality and convergence speed for video regression and exhibits superior inpainting and interpolation results. Further we integrate a consistent entropy minimization technique and develop video codecs based on these boosted INRs. Experiments on the UVG dataset confirm that our enhanced codecs significantly outperform baseline INRs and offer competitive rate-distortion performance compared to traditional and learning-based codecs. Code is available at https://github.com/Xinjie-Q/Boosting-NeRV. + + + + From Audio to Photoreal Embodiment: Synthesizing Humans in Conversations + http://openaccess.thecvf.com//content/CVPR2024/papers/Ng_From_Audio_to_Photoreal_Embodiment_Synthesizing_Humans_in_Conversations_CVPR_2024_paper.pdf + We present a framework for generating full-bodied photorealistic avatars that gesture according to the conversational dynamics of a dyadic interaction. Given speech audio we output multiple possibilities of gestural motion for an individual including face body and hands. The key behind our method is in combining the benefits of sample diversity from vector quantization with the high-frequency details obtained through diffusion to generate more dynamic expressive motion. We visualize the generated motion using highly photorealistic avatars that can express crucial nuances in gestures (e.g. sneers and smirks). To facilitate this line of research we introduce a first-of-its-kind multi-view conversational dataset that allows for photorealistic reconstruction. Experiments show our model generates appropriate and diverse gestures outperforming both diffusion- and VQ-only methods. Furthermore our perceptual evaluation highlights the importance of photorealism (vs. meshes) in accurately assessing subtle motion details in conversational gestures. Code and dataset available on project page. + + + + Single-View Scene Point Cloud Human Grasp Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Single-View_Scene_Point_Cloud_Human_Grasp_Generation_CVPR_2024_paper.pdf + In this work we explore a novel task of generating human grasps based on single-view scene point clouds which more accurately mirrors the typical real-world situation of observing objects from a single viewpoint. Due to the incompleteness of object point clouds and the presence of numerous scene points the generated hand is prone to penetrating into the invisible parts of the object and the model is easily affected by scene points. Thus we introduce S2HGrasp a framework composed of two key modules: the Global Perception module that globally perceives partial object point clouds and the DiffuGrasp module designed to generate high-quality human grasps based on complex inputs that include scene points. Additionally we introduce S2HGD dataset which comprises approximately 99000 single-object single-view scene point clouds of 1668 unique objects each annotated with one human grasp. Our extensive experiments demonstrate that S2HGrasp can not only generate natural human grasps regardless of scene points but also effectively prevent penetration between the hand and invisible parts of the object. Moreover our model showcases strong generalization capability when applied to unseen objects. Our code and dataset are available at https://github.com/iSEE-Laboratory/S2HGrasp. + + + + One-step Diffusion with Distribution Matching Distillation + http://openaccess.thecvf.com//content/CVPR2024/papers/Yin_One-step_Diffusion_with_Distribution_Matching_Distillation_CVPR_2024_paper.pdf + Diffusion models generate high-quality images but require dozens of forward passes. We introduce Distribution Matching Distillation (DMD) a procedure to transform a diffusion model into a one-step image generator with minimal impact on image quality. We enforce the one-step image generator match the diffusion model at distribution level by minimizing an approximate KL divergence whose gradient can be expressed as the difference between 2 score functions one of the target distribution and the other of the synthetic distribution being produced by our one-step generator. The score functions are parameterized as two diffusion models trained separately on each distribution. Combined with a simple regression loss matching the large-scale structure of the multi-step diffusion outputs our method outperforms all published few-step diffusion approaches reaching 2.62 FID on ImageNet 64x64 and 11.49 FID on zero-shot COCO-30k comparable to Stable Diffusion but orders of magnitude faster. Utilizing FP16 inference our model can generate images at 20 FPS on modern hardware. + + + + Rethinking Human Motion Prediction with Symplectic Integral + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Rethinking_Human_Motion_Prediction_with_Symplectic_Integral_CVPR_2024_paper.pdf + Long-term and accurate forecasting is the long-standing pursuit of the human motion prediction task. Existing methods typically suffer from dramatic degradation in prediction accuracy with the increasing prediction horizon. It comes down to two reasons:1? Insufficient numerical stability.Unforeseen high noise and complex feature relationships in the data. 2? Inadequate modeling stability. Unreasonable step sizes and undesirable parameter updates in the prediction.In this paper we design a novel and symplectic integral-inspired framework named symplectic integral neural network (SINN) which engages symplectic trajectories to optimize the pose representation and employs a stable symplectic operator to alternately model the dynamic context. Specifically we design a Symplectic Representation Encoder that performs on enhanced human pose representation to obtain trajectories on the symplectic manifold ensuring numerical stability based on Hamiltonian mechanics and symplectic spatial splitting algorithm. We further present the Symplectic Temporal Aggregation module in the light of the symplectic temporal splitting algorithm which splits the long-term prediction into multiple accurate short-term predictions generated by a symplectic operator to secure modeling stability. Moreover our approach is model-agnostic and can be efficiently integrated with different physical dynamics models.The experimental results demonstrate that our method achieves the new state-of-the-art outperforming existing methods by large margins:20.1%on Human3.6M16.7%on CUM Mocap and 10.2% on 3DPW. + + + + CPGA: Coding Priors-Guided Aggregation Network for Compressed Video Quality Enhancement + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhu_CPGA_Coding_Priors-Guided_Aggregation_Network_for_Compressed_Video_Quality_Enhancement_CVPR_2024_paper.pdf + Recently numerous approaches have achieved notable success in compressed video quality enhancement (VQE). However these methods usually ignore the utilization of valuable coding priors inherently embedded in compressed videos such as motion vectors and residual frames which carry abundant temporal and spatial information. To remedy this problem we propose the Coding Priors-Guided Aggregation (CPGA) network to utilize temporal and spatial information from coding priors. The CPGA mainly consists of an inter-frame temporal aggregation (ITA) module and a multi-scale non-local aggregation (MNA) module. Specifically the ITA module aggregates temporal information from consecutive frames and coding priors while the MNA module globally captures spatial information guided by residual frames. In addition to facilitate research in VQE task we newly construct the Video Coding Priors (VCP) dataset comprising 300 videos with various coding priors extracted from corresponding bitstreams. It remedies the shortage of previous datasets on the lack of coding information. Experimental results demonstrate the superiority of our method compared to existing state-of-the-art methods. The code and dataset will be released at https://github.com/VQE-CPGA/CPGA. + + + + MicroCinema: A Divide-and-Conquer Approach for Text-to-Video Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_MicroCinema_A_Divide-and-Conquer_Approach_for_Text-to-Video_Generation_CVPR_2024_paper.pdf + We present MicroCinema a straightforward yet effective framework for high-quality and coherent text-to-video generation. Unlike existing approaches that align text prompts with video directly MicroCinema introduces a Divide-and-Conquer strategy which divides the text-to-video into a two-stage process: text-to-image generation and image&text-to-video generation. This strategy offers two significant advantages. a) It allows us to take full advantage of the recent advances in text-to-image models such as Stable Diffusion Midjourney and DALLE to generate photorealistic and highly detailed images. b) Leveraging the generated image the model can allocate less focus to fine-grained appearance details prioritizing the efficient learning of motion dynamics. To implement this strategy effectively we introduce two core designs. First we propose the Appearance Injection Network enhancing the preservation of the appearance of the given image. Second we introduce the Appearance Noise Prior a novel mechanism aimed at maintaining the capabilities of pre-trained 2D diffusion models. These design elements empower MicroCinema to generate high-quality videos with precise motion guided by the provided text prompts. Extensive experiments demonstrate the superiority of the proposed framework. Concretely MicroCinema achieves SOTA zero-shot FVD of 342.86 on UCF-101 and 377.40 on MSR-VTT. + + + + Structure Matters: Tackling the Semantic Discrepancy in Diffusion Models for Image Inpainting + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Structure_Matters_Tackling_the_Semantic_Discrepancy_in_Diffusion_Models_for_CVPR_2024_paper.pdf + Denoising diffusion probabilistic models (DDPMs) for image inpainting aim to add the noise to the texture of the image during the forward process and recover the masked regions with the unmasked ones of the texture via the reverse denoising process. Despite the meaningful semantics generation the existing arts suffer from the semantic discrepancy between the masked and unmasked regions since the semantically dense unmasked texture fails to be completely degraded while the masked regions turn to the pure noise in diffusion process leading to the large discrepancy between them. In this paper we aim to answer how the unmasked semantics guide the texture denoising process; together with how to tackle the semantic discrepancy to facilitate the consistent and meaningful semantics generation. To this end we propose a novel structure-guided diffusion model for image inpainting named StrDiffusion to reformulate the conventional texture denoising process under the structure guidance to derive a simplified denoising objective for image inpainting while revealing: 1) the semantically sparse structure is beneficial to tackle the semantic discrepancy in the early stage while the dense texture generates the reasonable semantics in the late stage; 2) the semantics from the unmasked regions essentially offer the time-dependent structure guidance for the texture denoising process benefiting from the time-dependent sparsity of the structure semantics. For the denoising process a structure-guided neural network is trained to estimate the simplified denoising objective by exploiting the consistency of the denoised structure between masked and unmasked regions. Besides we devise an adaptive resampling strategy as a formal criterion as whether the structure is competent to guide the texture denoising process while regulate their semantic correlations. Extensive experiments validate the merits of StrDiffusion over the state-of-the-arts. Our code is available at https://github.com/htyjers/StrDiffusion. + + + + Makeup Prior Models for 3D Facial Makeup Estimation and Applications + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Makeup_Prior_Models_for_3D_Facial_Makeup_Estimation_and_Applications_CVPR_2024_paper.pdf + In this work we introduce two types of makeup prior models to extend existing 3D face prior models: PCA-based and StyleGAN2-based priors. The PCA-based prior model is a linear model that is easy to construct and is computationally efficient. However it retains only low-frequency information. Conversely the StyleGAN2-based model can represent high-frequency information with relatively higher computational cost than the PCA-based model. Although there is a trade-off between the two models both are applicable to 3D facial makeup estimation and related applications. By leveraging makeup prior models and designing a makeup consistency module we effectively address the challenges that previous methods faced in robustly estimating makeup particularly in the context of handling self-occluded faces. In experiments we demonstrate that our approach reduces computational costs by several orders of magnitude achieving speeds up to 180 times faster. In addition by improving the accuracy of the estimated makeup we confirm that our methods are highly advantageous for various 3D facial makeup applications such as 3D makeup face reconstruction user-friendly makeup editing makeup transfer and interpolation. + + + + I'M HOI: Inertia-aware Monocular Capture of 3D Human-Object Interactions + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_IM_HOI_Inertia-aware_Monocular_Capture_of_3D_Human-Object_Interactions_CVPR_2024_paper.pdf + We are living in a world surrounded by diverse and "smart" devices with rich modalities of sensing ability. Conveniently capturing the interactions between us humans and these objects remains far-reaching. In this paper we present I'm-HOI a monocular scheme to faithfully capture the 3D motions of both the human and object in a novel setting: using a minimal amount of RGB camera and object-mounted Inertial Measurement Unit (IMU). It combines general motion inference and category-aware refinement. For the former we introduce a holistic human-object tracking method to fuse the IMU signals and the RGB stream and progressively recover the human motions and subsequently the companion object motions. For the latter we tailor a category-aware motion diffusion model which is conditioned on both the raw IMU observations and the results from the previous stage under over-parameterization representation. It significantly refines the initial results and generates vivid body hand and object motions. Moreover we contribute a large dataset with ground truth human and object motions dense RGB inputs and rich object-mounted IMU measurements. Extensive experiments demonstrate the effectiveness of I'm-HOI under a hybrid capture setting. Our dataset and code will be released to the community. + + + + Dynamic Policy-Driven Adaptive Multi-Instance Learning for Whole Slide Image Classification + http://openaccess.thecvf.com//content/CVPR2024/papers/Zheng_Dynamic_Policy-Driven_Adaptive_Multi-Instance_Learning_for_Whole_Slide_Image_Classification_CVPR_2024_paper.pdf + Multi-Instance Learning (MIL) has shown impressive performance for histopathology whole slide image (WSI) analysis using bags or pseudo-bags. It involves instance sampling feature representation and decision-making. However existing MIL-based technologies at least suffer from one or more of the following problems: 1) requiring high storage and intensive pre-processing for numerous instances (sampling); 2) potential over-fitting with limited knowledge to predict bag labels (feature representation); 3) pseudo-bag counts and prior biases affect model robustness and generalizability (decision-making). Inspired by clinical diagnostics using the past sampling instances can facilitate the final WSI analysis but it is barely explored in prior technologies. To break free these limitations we integrate the dynamic instance sampling and reinforcement learning into a unified framework to improve the instance selection and feature aggregation forming a novel Dynamic Policy Instance Selection (DPIS) scheme for better and more credible decision-making. Specifically the measurement of feature distance and reward function are employed to boost continuous instance sampling. To alleviate the over-fitting we explore the latent global relations among instances for more robust and discriminative feature representation while establishing reward and punishment mechanisms to correct biases in pseudo-bags using contrastive learning. These strategies form the final Dynamic Policy-Driven Adaptive Multi-Instance Learning (PAMIL) method for WSI tasks. Extensive experiments reveal that our PAMIL method outperforms the state-of-the-art by 3.8% on CAMELYON16 and 4.4% on TCGA lung cancer datasets. + + + + LiDAR4D: Dynamic Neural Fields for Novel Space-time View LiDAR Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Zheng_LiDAR4D_Dynamic_Neural_Fields_for_Novel_Space-time_View_LiDAR_Synthesis_CVPR_2024_paper.pdf + Although neural radiance fields (NeRFs) have achieved triumphs in image novel view synthesis (NVS) LiDAR NVS remains largely unexplored. Previous LiDAR NVS methods employ a simple shift from image NVS methods while ignoring the dynamic nature and the large-scale reconstruction problem of LiDAR point clouds. In light of this we propose LiDAR4D a differentiable LiDAR-only framework for novel space-time LiDAR view synthesis. In consideration of the sparsity and large-scale characteristics we design a 4D hybrid representation combined with multi-planar and grid features to achieve effective reconstruction in a coarse-to-fine manner. Furthermore we introduce geometric constraints derived from point clouds to improve temporal consistency. For the realistic synthesis of LiDAR point clouds we incorporate the global optimization of ray-drop probability to preserve cross-region patterns. Extensive experiments on KITTI-360 and NuScenes datasets demonstrate the superiority of our method in accomplishing geometry-aware and time-consistent dynamic reconstruction. Codes are available at https://github.com/ispc-lab/LiDAR4D. + + + + Exploiting Diffusion Prior for Generalizable Dense Prediction + http://openaccess.thecvf.com//content/CVPR2024/papers/Lee_Exploiting_Diffusion_Prior_for_Generalizable_Dense_Prediction_CVPR_2024_paper.pdf + Contents generated by recent advanced Text-to-Image (T2I) diffusion models are sometimes too imaginative for existing off-the-shelf dense predictors to estimate due to the immitigable domain gap. We introduce DMP a pipeline utilizing pre-trained T2I models as a prior for dense prediction tasks. To address the misalignment between deterministic prediction tasks and stochastic T2I models we reformulate the diffusion process through a sequence of interpolations establishing a deterministic mapping between input RGB images and output prediction distributions. To preserve generalizability we use low-rank adaptation to fine-tune pre-trained models. Extensive experiments across five tasks including 3D property estimation semantic segmentation and intrinsic image decomposition showcase the efficacy of the proposed method. Despite limited-domain training data the approach yields faithful estimations for arbitrary images surpassing existing state-of-the-art algorithms. + + + + Orthogonal Adaptation for Modular Customization of Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Po_Orthogonal_Adaptation_for_Modular_Customization_of_Diffusion_Models_CVPR_2024_paper.pdf + Customization techniques for text-to-image models have paved the way for a wide range of previously unattainable applications enabling the generation of specific concepts across diverse contexts and styles. While existing methods facilitate high-fidelity customization for individual concepts or a limited pre-defined set of them they fall short of achieving scalability where a single model can seamlessly render countless concepts. In this paper we address a new problem called Modular Customization with the goal of efficiently merging customized models that were fine-tuned independently for individual concepts. This allows the merged model to jointly synthesize concepts in one image without compromising fidelity or incurring any additional computational costs. To address this problem we introduce Orthogonal Adaptation a method designed to encourage the customized models which do not have access to each other during fine-tuning to have orthogonal residual weights. This ensures that during inference time the customized models can be summed with minimal interference. Our proposed method is both simple and versatile applicable to nearly all optimizable weights in the model architecture. Through an extensive set of quantitative and qualitative evaluations our method consistently outperforms relevant baselines in terms of efficiency and identity preservation demonstrating a significant leap toward scalable customization of diffusion models. + + + + Optimizing Diffusion Noise Can Serve As Universal Motion Priors + http://openaccess.thecvf.com//content/CVPR2024/papers/Karunratanakul_Optimizing_Diffusion_Noise_Can_Serve_As_Universal_Motion_Priors_CVPR_2024_paper.pdf + We propose Diffusion Noise Optimization (DNO) a new method that effectively leverages existing motion diffusion models as motion priors for a wide range of motion-related tasks. Instead of training a task-specific diffusion model for each new task DNO operates by optimizing the diffusion latent noise of an existing pre-trained text-to-motion model. Given the corresponding latent noise of a human motion it propagates the gradient from the target criteria defined on the motion space through the whole denoising process to update the diffusion latent noise. As a result DNO supports any use cases where criteria can be defined as a function of motion. In particular we show that for motion editing and control DNO outperforms existing methods in both achieving the objective and preserving the motion content. DNO accommodates a diverse range of editing modes including changing trajectory pose joint locations or avoiding newly added obstacles. In addition DNO is effective in motion denoising and completion producing smooth and realistic motion from noisy and partial inputs. DNO achieves these results at inference time without the need for model retraining offering great versatility for any defined reward or loss function on the motion representation. + + + + OVFoodSeg: Elevating Open-Vocabulary Food Image Segmentation via Image-Informed Textual Representation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_OVFoodSeg_Elevating_Open-Vocabulary_Food_Image_Segmentation_via_Image-Informed_Textual_Representation_CVPR_2024_paper.pdf + In the realm of food computing segmenting ingredients from images poses substantial challenges due to the large intra-class variance among the same ingredients the emergence of new ingredients and the high annotation costs associated with large food segmentation datasets. Existing approaches primarily utilize a closed-vocabulary and static text embeddings setting. These methods often fall short in effectively handling the ingredients particularly new and diverse ones. In response to these limitations we introduce OVFoodSeg a framework that adopts an open-vocabulary setting and enhances text embeddings with visual context. By integrating vision-language models (VLMs) our approach enriches text embedding with image-specific information through two innovative modules e.g. an image-to-text learner FoodLearner and an Image-Informed Text Encoder. The training process of OVFoodSeg is divided into two stages: the pre-training of FoodLearner and the subsequent learning phase for segmentation. The pre-training phase equips FoodLearner with the capability to align visual information with corresponding textual representations that are specifically related to food while the second phase adapts both the FoodLearner and the Image-Informed Text Encoder for the segmentation task. By addressing the deficiencies of previous models OVFoodSeg demonstrates a significant improvement achieving an 4.9% increase in mean Intersection over Union (mIoU) on the FoodSeg103 dataset setting a new milestone for food image segmentation. + + + + XFeat: Accelerated Features for Lightweight Image Matching + http://openaccess.thecvf.com//content/CVPR2024/papers/Potje_XFeat_Accelerated_Features_for_Lightweight_Image_Matching_CVPR_2024_paper.pdf + We introduce a lightweight and accurate architecture for resource-efficient visual correspondence. Our method dubbed XFeat (Accelerated Features) revisits fundamental design choices in convolutional neural networks for detecting extracting and matching local features. Our new model satisfies a critical need for fast and robust algorithms suitable to resource-limited devices. In particular accurate image matching requires sufficiently large image resolutions -- for this reason we keep the resolution as large as possible while limiting the number of channels in the network. Besides our model is designed to offer the choice of matching at the sparse or semi-dense levels each of which may be more suitable for different downstream applications such as visual navigation and augmented reality. Our model is the first to offer semi-dense matching efficiently leveraging a novel match refinement module that relies on coarse local descriptors. XFeat is versatile and hardware-independent surpassing current deep learning-based local features in speed (up to 5x faster) with comparable or better accuracy proven in pose estimation and visual localization. We showcase it running in real-time on an inexpensive laptop CPU without specialized hardware optimizations. Code and weights are available at verlab.dcc.ufmg.br/descriptors/xfeat_cvpr24. + + + + VideoRF: Rendering Dynamic Radiance Fields as 2D Feature Video Streams + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_VideoRF_Rendering_Dynamic_Radiance_Fields_as_2D_Feature_Video_Streams_CVPR_2024_paper.pdf + Neural Radiance Fields (NeRFs) excel in photorealistically rendering static scenes. However rendering dynamic long-duration radiance fields on ubiquitous devices remains challenging due to data storage and computational constraints. In this paper we introduce VideoRF the first approach to enable real-time streaming and rendering of dynamic human-centric radiance fields on mobile platforms. At the core is a serialized 2D feature image stream representing the 4D radiance field all in one. We introduce a tailored training scheme directly applied to this 2D domain to impose the temporal and spatial redundancy of the feature image stream. By leveraging the redundancy we show that the feature image stream can be efficiently compressed by 2D video codecs which allows us to exploit video hardware accelerators to achieve real-time decoding. On the other hand based on the feature image stream we propose a novel rendering pipeline for VideoRF which has specialized space mappings to query radiance properties efficiently. Paired with a deferred shading model VideoRF has the capability of real-time rendering on mobile devices thanks to its efficiency. We have developed a real-time interactive player that enables online streaming and rendering of dynamic scenes offering a seamless and immersive free-viewpoint experience across a range of devices from desktops to mobile phones. Our project page is available at https://aoliao12138.github.io/VideoRF/. + + + + DPHMs: Diffusion Parametric Head Models for Depth-based Tracking + http://openaccess.thecvf.com//content/CVPR2024/papers/Tang_DPHMs_Diffusion_Parametric_Head_Models_for_Depth-based_Tracking_CVPR_2024_paper.pdf + We introduce Diffusion Parametric Head Models (DPHMs) a generative model that enables robust volumetric head reconstruction and tracking from monocular depth sequences. While recent volumetric head models such as NPHMs can now excel in representing high-fidelity head geometries tracking and reconstructing heads from real-world single-view depth sequences remains very challenging as the fitting to partial and noisy observations is underconstrained. To tackle these challenges we propose a latent diffusion-based prior to regularize volumetric head reconstruction and tracking. This prior-based regularizer effectively constrains the identity and expression codes to lie on the underlying latent manifold which represents plausible head shapes. To evaluate the effectiveness of the diffusion-based prior we collect a dataset of monocular Kinect sequences consisting of various complex facial expression motions and rapid transitions. We compare our method to state-of-the-art tracking methods and demonstrate improved head identity reconstruction as well as robust expression tracking. + + + + DetDiffusion: Synergizing Generative and Perceptive Models for Enhanced Data Generation and Perception + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_DetDiffusion_Synergizing_Generative_and_Perceptive_Models_for_Enhanced_Data_Generation_CVPR_2024_paper.pdf + Current perceptive models heavily depend on resource-intensive datasets prompting the need for innovative solutions. Leveraging recent advances in diffusion models synthetic data by constructing image inputs from various annotations proves beneficial for downstream tasks. While prior methods have separately addressed generative and perceptive models DetDiffusion for the first time harmonizes both tackling the challenges in generating effective data for perceptive models. To enhance image generation with perceptive models we introduce perception-aware loss (P.A. loss) through segmentation improving both quality and controllability. To boost the performance of specific perceptive models our method customizes data augmentation by extracting and utilizing perception-aware attribute (P.A. Attr) during generation. Experimental results from the object detection task highlight DetDiffusion's superior performance establishing a new state-of-the-art in layout-guided generation. Furthermore image syntheses from DetDiffusion can effectively augment training data significantly enhancing downstream detection performance. + + + + Perception-Oriented Video Frame Interpolation via Asymmetric Blending + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_Perception-Oriented_Video_Frame_Interpolation_via_Asymmetric_Blending_CVPR_2024_paper.pdf + Previous methods for Video Frame Interpolation (VFI) have encountered challenges notably the manifestation of blur and ghosting effects. These issues can be traced back to two pivotal factors: unavoidable motion errors and misalignment in supervision. In practice motion estimates often prove to be error-prone resulting in misaligned features. Furthermore the reconstruction loss tends to bring blurry results particularly in misaligned regions. To mitigate these challenges we propose a new paradigm called PerVFI (Perception-oriented Video Frame Interpolation). Our approach incorporates an Asymmetric Synergistic Blending module (ASB) that utilizes features from both sides to synergistically blend intermediate features. One reference frame emphasizes primary content while the other contributes complementary information. To impose a stringent constraint on the blending process we introduce a self-learned sparse quasi-binary mask which effectively mitigates ghosting and blur artifacts in the output. Additionally we employ a normalizing flow-based generator and utilize the negative log-likelihood loss to learn the conditional distribution of the output which further facilitates the generation of clear and fine details. Experimental results validate the superiority of PerVFI demonstrating significant improvements in perceptual quality compared to existing methods. Codes are available at https://github.com/mulns/PerVFI + + + + DUDF: Differentiable Unsigned Distance Fields with Hyperbolic Scaling + http://openaccess.thecvf.com//content/CVPR2024/papers/Fainstein_DUDF_Differentiable_Unsigned_Distance_Fields_with_Hyperbolic_Scaling_CVPR_2024_paper.pdf + In recent years there has been a growing interest in training Neural Networks to approximate Unsigned Distance Fields (UDFs) for representing open surfaces in the context of 3D reconstruction. However UDFs are non-differentiable at the zero level set which leads to significant errors in distances and gradients generally resulting in fragmented and discontinuous surfaces. In this paper we propose to learn a hyperbolic scaling of the unsigned distance field which defines a new Eikonal problem with distinct boundary conditions. This allows our formulation to integrate seamlessly with state-of-the-art continuously differentiable implicit neural representation networks largely applied in the literature to represent signed distance fields. Our approach not only addresses the challenge of open surface representation but also demonstrates significant improvement in reconstruction quality and training performance. Moreover the unlocked field's differentiability allows the accurate computation of essential topological properties such as normal directions and curvatures pervasive in downstream tasks such as rendering. Through extensive experiments we validate our approach across various data sets and against competitive baselines. The results demonstrate enhanced accuracy and up to an order of magnitude increase in speed compared to previous methods. + + + + 2S-UDF: A Novel Two-stage UDF Learning Method for Robust Non-watertight Model Reconstruction from Multi-view Images + http://openaccess.thecvf.com//content/CVPR2024/papers/Deng_2S-UDF_A_Novel_Two-stage_UDF_Learning_Method_for_Robust_Non-watertight_CVPR_2024_paper.pdf + Recently building on the foundation of neural radiance field various techniques have emerged to learn unsigned distance fields (UDF) to reconstruct 3D non-watertight models from multi-view images. Yet a central challenge in UDF-based volume rendering is formulating a proper way to convert unsigned distance values into volume density ensuring that the resulting weight function remains unbiased and sensitive to occlusions. Falling short on these requirements often results in incorrect topology or large reconstruction errors in resulting models. This paper addresses this challenge by presenting a novel two-stage algorithm 2S-UDF for learning a high-quality UDF from multi-view images. Initially the method applies an easily trainable density function that while slightly biased and transparent aids in coarse reconstruction. The subsequent stage then refines the geometry and appearance of the object to achieve a high-quality reconstruction by directly adjusting the weight function used in volume rendering to ensure that it is unbiased and occlusion-aware. Decoupling density and weight in two stages makes our training stable and robust distinguishing our technique from existing UDF learning approaches. Evaluations on the DeepFashion3D DTU and BlendedMVS datasets validate the robustness and effectiveness of our proposed approach. In both quantitative metrics and visual quality the results indicate our superior performance over other UDF learning techniques in reconstructing 3D non-watertight models from multi-view images. Our code is available at https://bitbucket.org/jkdeng/2sudf/. + + + + UniVS: Unified and Universal Video Segmentation with Prompts as Queries + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_UniVS_Unified_and_Universal_Video_Segmentation_with_Prompts_as_Queries_CVPR_2024_paper.pdf + Despite the recent advances in unified image segmentation (IS) developing a unified video segmentation (VS) model remains a challenge. This is mainly because generic category-specified VS tasks need to detect all objects and track them across consecutive frames while prompt-guided VS tasks require re-identifying the target with visual/text prompts throughout the entire video making it hard to handle the different tasks with the same architecture. We make an attempt to address these issues and present a novel unified VS architecture namely UniVS by using prompts as queries. UniVS averages the prompt features of the target from previous frames as its initial query to explicitly decode masks and introduces a target-wise prompt cross-attention layer in the mask decoder to integrate prompt features in the memory pool. By taking the predicted masks of entities from previous frames as their visual prompts UniVS converts different VS tasks into prompt-guided target segmentation eliminating the heuristic inter-frame matching process. Our framework not only unifies the different VS tasks but also naturally achieves universal training and testing ensuring robust performance across different scenarios. UniVS shows a commendable balance between performance and universality on 10 challenging VS benchmarks covering video instance semantic panoptic object and referring segmentation tasks. Code can be found at https://github.com/MinghanLi/UniVS. + + + + Efficiently Assemble Normalization Layers and Regularization for Federated Domain Generalization + http://openaccess.thecvf.com//content/CVPR2024/papers/Le_Efficiently_Assemble_Normalization_Layers_and_Regularization_for_Federated_Domain_Generalization_CVPR_2024_paper.pdf + Domain shift is a formidable issue in Machine Learning that causes a model to suffer from performance degradation when tested on unseen domains. Federated Domain Generalization (FedDG) attempts to train a global model using collaborative clients in a privacy-preserving manner that can generalize well to unseen clients possibly with domain shift. However most existing FedDG methods either cause additional privacy risks of data leakage or induce significant costs in client communication and computation which are major concerns in the Federated Learning paradigm. To circumvent these challenges here we introduce a novel architectural method for FedDG namely gPerXAN which relies on a normalization scheme working with a guiding regularizer. In particular we carefully design Personalized eXplicitly Assembled Normalization to enforce client models selectively filtering domain-specific features that are biased towards local data while retaining discrimination of those features. Then we incorporate a simple yet effective regularizer to guide these models in directly capturing domain-invariant representations that the global model's classifier can leverage. Extensive experimental results on two benchmark datasets i.e. PACS and Office-Home and a real-world medical dataset Camelyon17 indicate that our proposed method outperforms other existing methods in addressing this particular problem. + + + + Depth Information Assisted Collaborative Mutual Promotion Network for Single Image Dehazing + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Depth_Information_Assisted_Collaborative_Mutual_Promotion_Network_for_Single_Image_CVPR_2024_paper.pdf + Recovering a clear image from a single hazy image is an open inverse problem. Although significant research progress has been made most existing methods ignore the effect that downstream tasks play in promoting upstream dehazing. From the perspective of the haze generation mechanism there is a potential relationship between the depth information of the scene and the hazy image. Based on this we propose a dual-task collaborative mutual promotion framework to achieve the dehazing of a single image. This framework integrates depth estimation and dehazing by a dual-task interaction mechanism and achieves mutual enhancement of their performance. To realize the joint optimization of the two tasks an alternative implementation mechanism with the difference perception is developed. On the one hand the difference perception between the depth maps of the dehazing result and the ideal image is proposed to promote the dehazing network to pay attention to the non-ideal areas of the dehazing. On the other hand by improving the depth estimation performance in the difficult-to-recover areas of the hazy image the dehazing network can explicitly use the depth information of the hazy image to assist the clear image recovery. To promote the depth estimation we propose to use the difference between the dehazed image and the ground truth to guide the depth estimation network to focus on the dehazed unideal areas. It allows dehazing and depth estimation to leverage their strengths in a mutually reinforcing manner. Experimental results show that the proposed method can achieve better performance than that of the state-of-the-art approaches. The source code is released at https://github.com/zhoushen1/DCMPNet. + + + + Unlocking the Potential of Pre-trained Vision Transformers for Few-Shot Semantic Segmentation through Relationship Descriptors + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_Unlocking_the_Potential_of_Pre-trained_Vision_Transformers_for_Few-Shot_Semantic_CVPR_2024_paper.pdf + The recent advent of pre-trained vision transformers has unveiled a promising property: their inherent capability to group semantically related visual concepts. In this paper we explore to harnesses this emergent feature to tackle few-shot semantic segmentation a task focused on classifying pixels in a test image with a few example data. A critical hurdle in this endeavor is preventing overfitting to the limited classes seen during training the few-shot segmentation model. As our main discovery we find that the concept of "relationship descriptors" initially conceived for enhancing the CLIP model for zero-shot semantic segmentation offers a potential solution. We adapt and refine this concept to craft a relationship descriptor construction tailored for few-shot semantic segmentation extending its application across multiple layers to enhance performance. Building upon this adaptation we proposed a few-shot semantic segmentation framework that is not only easy to implement and train but also effectively scales with the number of support examples and categories. Through rigorous experimentation across various datasets including PASCAL-5^ i and COCO-20^ i we demonstrate a clear advantage of our method in diverse few-shot semantic segmentation scenarios and a range of pre-trained vision transformer models. The findings clearly show that our method significantly outperforms current state-of-the-art techniques highlighting the effectiveness of harnessing the emerging capabilities of vision transformers for few-shot semantic segmentation. We release the code at https://github.com/ZiqinZhou66/FewSegwithRD.git. + + + + CustomListener: Text-guided Responsive Interaction for User-friendly Listening Head Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_CustomListener_Text-guided_Responsive_Interaction_for_User-friendly_Listening_Head_Generation_CVPR_2024_paper.pdf + Listening head generation aims to synthesize a non-verbal responsive listener head by modeling the correlation between the speaker and the listener in dynamic conversion. The applications of listener agent generation in virtual interaction have promoted many works achieving diverse and fine-grained motion generation. However they can only manipulate motions through simple emotional labels but cannot freely control the listener's motions. Since listener agents should have human-like attributes (e.g. identity personality) which can be freely customized by users this limits their realism. In this paper we propose a user-friendly framework called CustomListener to realize the free-form text prior guided listener generation. To achieve speaker-listener coordination we design a Static to Dynamic Portrait module (SDP) which interacts with speaker information to transform static text into dynamic portrait token with completion rhythm and amplitude information. To achieve coherence between segments we design a Past Guided Generation module (PGG) to maintain the consistency of customized listener attributes through the motion prior and utilize a diffusion-based structure conditioned on the portrait token and the motion prior to realize the controllable generation. To train and evaluate our model we have constructed two text-annotated listening head datasets based on ViCo and RealTalk which provide text-video paired labels. Extensive experiments have verified the effectiveness of our model. + + + + Fun with Flags: Robust Principal Directions via Flag Manifolds + http://openaccess.thecvf.com//content/CVPR2024/papers/Mankovich_Fun_with_Flags_Robust_Principal_Directions_via_Flag_Manifolds_CVPR_2024_paper.pdf + Principal component analysis (PCA) along with its extensions to manifolds and outlier contaminated data have been indispensable in computer vision and machine learning. In this work we present a unifying formalism for PCA and its variants and introduce a framework based on the flags of linear subspaces i.e. a hierarchy of nested linear subspaces of increasing dimension which not only allows for a common implementation but also yields novel variants not explored previously. We begin by generalizing traditional PCA methods that either maximize variance or minimize reconstruction error. We expand these interpretations to develop a wide array of new dimensionality reduction algorithms by accounting for outliers and the data manifold. To devise a common computational approach we recast robust and dual forms of PCA as optimization problems on flag manifolds. We then integrate tangent space approximations of principal geodesic analysis (tangent-PCA) into this flag-based framework creating novel robust and dual geodesic PCA variations. The remarkable flexibility offered by the `flagification' introduced here enables even more algorithmic variants identified by specific flag types. Last but not least we propose an effective convergent solver for these flag-formulations employing the Stiefel manifold. Our empirical results on both real-world and synthetic scenarios demonstrate the superiority of our novel algorithms especially in terms of robustness to outliers on manifolds. + + + + Generating Non-Stationary Textures using Self-Rectification + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_Generating_Non-Stationary_Textures_using_Self-Rectification_CVPR_2024_paper.pdf + This paper addresses the challenge of example-based non-stationary texture synthesis. We introduce a novel two-step approach wherein users first modify a reference texture using standard image editing tools yielding an initial rough target for the synthesis. Subsequently our proposed method termed "self-rectification" automatically refines this target into a coherent seamless texture while faithfully preserving the distinct visual characteristics of the reference exemplar. Our method leverages a pre-trained diffusion network and uses self-attention mechanisms to gradually align the synthesized texture with the reference ensuring the retention of the structures in the provided target. Through experimental validation our approach exhibits exceptional proficiency in handling non-stationary textures demonstrating significant advancements in texture synthesis when compared to existing state-of-the-art techniques. Code is available at https://github.com/xiaorongjun000/Self-Rectification + + + + SPU-PMD: Self-Supervised Point Cloud Upsampling via Progressive Mesh Deformation + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_SPU-PMD_Self-Supervised_Point_Cloud_Upsampling_via_Progressive_Mesh_Deformation_CVPR_2024_paper.pdf + Despite the success of recent upsampling approaches generating high-resolution point sets with uniform distribution and meticulous structures is still challenging. Unlike existing methods that only take spatial information of the raw data into account we regard point cloud upsampling as generating dense point clouds from deformable topology. Motivated by this we present SPU-PMD a self-supervised topological mesh deformation network for 3D densification. As a cascaded framework our architecture is formulated by a series of coarse mesh interpolator and mesh deformers. At each stage the mesh interpolator first produces the initial dense point clouds via mesh interpolation which allows the model to perceive the primitive topology better. Meanwhile the deformer infers the morphing by estimating the movements of mesh nodes and reconstructs the descriptive topology structure. By associating mesh deformation with feature expansion this module progressively refines point clouds' surface uniformity and structural details. To demonstrate the effectiveness of the proposed method extensive quantitative and qualitative experiments are conducted on synthetic and real-scanned 3D data. Also we compare it with state-of-the-art techniques to further illustrate the superiority of our network. The project page is: https://github.com/lyz21/SPU-PMD + + + + Snap Video: Scaled Spatiotemporal Transformers for Text-to-Video Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Menapace_Snap_Video_Scaled_Spatiotemporal_Transformers_for_Text-to-Video_Synthesis_CVPR_2024_paper.pdf + Contemporary models for generating images show remarkable quality and versatility. Swayed by these advantages the research community repurposes them to generate videos. Since video content is highly redundant we argue that naively bringing advances of image models to the video generation domain reduces motion fidelity visual quality and impairs scalability. In this work we build Snap Video a video-first model that systematically addresses these challenges. To do that we first extend the EDM framework to take into account spatially and temporally redundant pixels and naturally support video generation. Second we show that a U-Net--a workhorse behind image generation--scales poorly when generating videos requiring significant computational overhead. Hence we propose a new transformer-based architecture that trains 3.31 times faster than U-Nets (and is 4.5 faster at inference). This allows us to efficiently train a text-to-video model with billions of parameters for the first time reach state-of-the-art results on a number of benchmarks and generate videos with substantially higher quality temporal consistency and motion complexity. The user studies showed that our model was favored by a large margin over the most recent methods. + + + + JointSQ: Joint Sparsification-Quantization for Distributed Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Xie_JointSQ_Joint_Sparsification-Quantization_for_Distributed_Learning_CVPR_2024_paper.pdf + Gradient sparsification and quantization offer a promising prospect to alleviate the communication overhead problem in distributed learning. However direct combination of the two results in suboptimal solutions due to the fact that sparsification and quantization haven't been learned together. In this paper we propose Joint Sparsification-Quantization (JointSQ) inspired by the discovery that sparsification can be treated as 0-bit quantization regardless of architectures. Specifically we mathematically formulate JointSQ as a mixed-precision quantization problem expanding the solution space. It can be solved by the designed MCKP-Greedy algorithm. Theoretical analysis demonstrates the minimal compression noise of JointSQ and extensive experiments on various network architectures including CNN RNN and Transformer also validate this point. Under the introduction of computation overhead consistent with or even lower than previous methods JointSQ achieves a compression ratio of 1000xon different models while maintaining near-lossless accuracy and brings 1.4xto 2.9xspeedup over existing methods. + + + + A Unified Framework for Human-centric Point Cloud Video Understanding + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_A_Unified_Framework_for_Human-centric_Point_Cloud_Video_Understanding_CVPR_2024_paper.pdf + Human-centric Point Cloud Video Understanding (PVU) is an emerging field focused on extracting and interpreting human-related features from sequences of human point clouds further advancing downstream human-centric tasks and applications. Previous works usually focus on tackling one specific task and rely on huge labeled data which has poor generalization capability. Considering that human has specific characteristics including the structural semantics of human body and the dynamics of human motions we propose a unified framework to make full use of the prior knowledge and explore the inherent features in the data itself for generalized human-centric point cloud video understanding. Extensive experiments demonstrate that our method achieves state-of-the-art performance on various human-related tasks including action recognition and 3D pose estimation. All datasets and code will be released soon. + + + + Shadow-Enlightened Image Outpainting + http://openaccess.thecvf.com//content/CVPR2024/papers/Yu_Shadow-Enlightened_Image_Outpainting_CVPR_2024_paper.pdf + Conventional image outpainting methods usually treat unobserved areas as unknown and extend the scene only in terms of semantic consistency thus overlooking the hidden information in shadows cast by unobserved areas such as the invisible shapes and semantics. In this paper we propose to extract and utilize the hidden information of unobserved areas from their shadows to enhance image outpainting. To this end we propose an end-to-end deep approach that explicitly looks into the shadows within the image. Specifically we extract shadows from the input image and identify instance-level shadow regions cast by the unobserved areas. Then the instance-level shadow representations are concatenated to predict the scene layout of each unobserved instance and outpaint the unobserved areas. Finally two discriminators are implemented to enhance alignment between the extended semantics and their shadows. In the experiments we show that our proposed approach provides complementary cues for outpainting and achieves considerable improvement on all datasets by adopting our approach as a plug-in module. + + + + BOTH2Hands: Inferring 3D Hands from Both Text Prompts and Body Dynamics + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_BOTH2Hands_Inferring_3D_Hands_from_Both_Text_Prompts_and_Body_CVPR_2024_paper.pdf + The recently emerging text-to-motion advances have spired numerous attempts for convenient and interactive human motion generation. Yet existing methods are largely limited to generating body motions only without considering the rich two-hand motions let alone handling various conditions like body dynamics or texts. To break the data bottleneck we propose BOTH57M a novel multi-modal dataset for two-hand motion generation. Our dataset includes accurate motion tracking for the human body and hands and provides pair-wised finger-level hand annotations and body descriptions. We further provide a strong baseline method BOTH2Hands for the novel task: generating vivid two-hand motions from both implicit body dynamics and explicit text prompts. We first warm up two parallel body-to-hand and text-to-hand diffusion models and then utilize the cross-attention transformer for motion blending. Extensive experiments and cross-validations demonstrate the effectiveness of our approach and dataset for generating convincing two-hand motions from the hybrid body-and-textual conditions. Our dataset and code will be disseminated to the community for future research which can be found at https://github.com/Godheritage/BOTH2Hands. + + + + DreamAvatar: Text-and-Shape Guided 3D Human Avatar Generation via Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Cao_DreamAvatar_Text-and-Shape_Guided_3D_Human_Avatar_Generation_via_Diffusion_Models_CVPR_2024_paper.pdf + We present DreamAvatar a text-and-shape guided framework for generating high-quality 3D human avatars with controllable poses. While encouraging results have been reported by recent methods on text-guided 3D common object generation generating high-quality human avatars remains an open challenge due to the complexity of the human body's shape pose and appearance. We propose DreamAvatar to tackle this challenge which utilizes a trainable NeRF for predicting density and color for 3D points and pretrained text-to-image diffusion models for providing 2D self-supervision. Specifically we leverage the SMPL model to provide shape and pose guidance for the generation. We introduce a dual-observation-space design that involves the joint optimization of a canonical space and a posed space that are related by a learnable deformation field. This facilitates the generation of more complete textures and geometry faithful to the target pose. We also jointly optimize the losses computed from the full body and from the zoomed-in 3D head to alleviate the common multi-face "Janus" problem and improve facial details in the generated avatars. Extensive evaluations demonstrate that DreamAvatar significantly outperforms existing methods establishing a new state-of-the-art for text-and-shape guided 3D human avatar generation. + + + + Bidirectional Autoregessive Diffusion Model for Dance Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Bidirectional_Autoregessive_Diffusion_Model_for_Dance_Generation_CVPR_2024_paper.pdf + Dance serves as a powerful medium for expressing human emotions but the lifelike generation of dance is still a considerable challenge. Recently diffusion models have showcased remarkable generative abilities across various domains. They hold promise for human motion generation due to their adaptable many-to-many nature. Nonetheless current diffusion-based motion generation models often create entire motion sequences directly and unidirectionally lacking focus on the motion with local and bidirectional enhancement. When choreographing high-quality dance movements people need to take into account not only the musical context but also the nearby music-aligned dance motions. To authentically capture human behavior we propose a Bidirectional Autoregressive Diffusion Model (BADM) for music-to-dance generation where a bidirectional encoder is built to enforce that the generated dance is harmonious in both the forward and backward directions. To make the generated dance motion smoother a local information decoder is built for local motion enhancement. The proposed framework is able to generate new motions based on the input conditions and nearby motions which foresees individual motion slices iteratively and consolidates all predictions. To further refine the synchronicity between the generated dance and the beat the beat information is incorporated as an input to generate better music-aligned dance movements. Experimental results demonstrate that the proposed model achieves state-of-the-art performance compared to existing unidirectional approaches on the prominent benchmark for music-to-dance generation. + + + + FRESCO: Spatial-Temporal Correspondence for Zero-Shot Video Translation + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_FRESCO_Spatial-Temporal_Correspondence_for_Zero-Shot_Video_Translation_CVPR_2024_paper.pdf + The remarkable efficacy of text-to-image diffusion models has motivated extensive exploration of their potential application in video domains. Zero-shot methods seek to extend image diffusion models to videos without necessitating model training. Recent methods mainly focus on incorporating inter-frame correspondence into attention mechanisms. However the soft constraint imposed on determining where to attend to valid features can sometimes be insufficient resulting in temporal inconsistency. In this paper we introduce FRESCO intra-frame correspondence alongside inter-frame correspondence to establish a more robust spatial-temporal constraint. This enhancement ensures a more consistent transformation of semantically similar content across frames. Beyond mere attention guidance our approach involves an explicit update of features to achieve high spatial-temporal consistency with the input video significantly improving the visual coherence of the resulting translated videos. Extensive experiments demonstrate the effectiveness of our proposed framework in producing high-quality coherent videos marking a notable improvement over existing zero-shot methods. + + + + SplattingAvatar: Realistic Real-Time Human Avatars with Mesh-Embedded Gaussian Splatting + http://openaccess.thecvf.com//content/CVPR2024/papers/Shao_SplattingAvatar_Realistic_Real-Time_Human_Avatars_with_Mesh-Embedded_Gaussian_Splatting_CVPR_2024_paper.pdf + We present SplattingAvatar a hybrid 3D representation of photorealistic human avatars with Gaussian Splatting embedded on a triangle mesh which renders over 300 FPS on a modern GPU and 30 FPS on a mobile device. We disentangle the motion and appearance of a virtual human with explicit mesh geometry and implicit appearance modeling with Gaussian Splatting. The Gaussians are defined by barycentric coordinates and displacement on a triangle mesh as Phong surfaces. We extend lifted optimization to simultaneously optimize the parameters of the Gaussians while walking on the triangle mesh. SplattingAvatar is a hybrid representation of virtual humans where the mesh represents low-frequency motion and surface deformation while the Gaussians take over the high-frequency geometry and detailed appearance. Unlike existing deformation methods that rely on an MLP-based linear blend skinning (LBS) field for motion we control the rotation and translation of the Gaussians directly by mesh which empowers its compatibility with various animation techniques e.g. skeletal animation blend shapes and mesh editing. Trainable from monocular videos for both full-body and head avatars SplattingAvatar shows state-of-the-art rendering quality across multiple datasets. + + + + MoSAR: Monocular Semi-Supervised Model for Avatar Reconstruction using Differentiable Shading + http://openaccess.thecvf.com//content/CVPR2024/papers/Dib_MoSAR_Monocular_Semi-Supervised_Model_for_Avatar_Reconstruction_using_Differentiable_Shading_CVPR_2024_paper.pdf + Reconstructing an avatar from a portrait image has many applications in multimedia but remains a challenging research problem. Extracting reflectance maps and geometry from one image is ill-posed: recovering geometry is a one-to-many mapping problem and reflectance and light are difficult to disentangle. Accurate geometry and reflectance can be captured under the controlled conditions of a light stage but it is costly to acquire large datasets in this fashion. Moreover training solely with this type of data leads to poor generalization with in-the-wild images. This motivates the introduction of MoSAR a method for 3D avatar generation from monocular images. We propose a semi-supervised training scheme that improves generalization by learning from both light stage and in-the-wild datasets. This is achieved using a novel differentiable shading formulation. We show that our approach effectively disentangles the intrinsic face parameters producing relightable avatars. As a result MoSAR estimates a richer set of skin reflectance maps and generates more realistic avatars than existing state-of-the-art methods. We also release a new dataset that provides intrinsic face attributes (diffuse specular ambient occlusion and translucency maps) for 10k subjects. + + + + RankED: Addressing Imbalance and Uncertainty in Edge Detection Using Ranking-based Losses + http://openaccess.thecvf.com//content/CVPR2024/papers/Cetinkaya_RankED_Addressing_Imbalance_and_Uncertainty_in_Edge_Detection_Using_Ranking-based_CVPR_2024_paper.pdf + Detecting edges in images suffers from the problems of (P1) heavy imbalance between positive and negative classes as well as (P2) label uncertainty owing to disagreement between different annotators. Existing solutions address P1 using class-balanced cross-entropy loss and dice loss and P2 by only predicting edges agreed upon by most annotators. In this paper we propose RankED a unified ranking-based approach that addresses both the imbalance problem (P1) and the uncertainty problem (P2). RankED tackles these two problems with two components: One component which ranks positive pixels over negative pixels and the second which promotes high confidence edge pixels to have more label certainty. We show that RankED outperforms previous studies and sets a new state-of-the-art on NYUD-v2 BSDS500 and Multi-cue datasets. Code is available at https://ranked-cvpr24.github.io. + + + + DiffHuman: Probabilistic Photorealistic 3D Reconstruction of Humans + http://openaccess.thecvf.com//content/CVPR2024/papers/Sengupta_DiffHuman_Probabilistic_Photorealistic_3D_Reconstruction_of_Humans_CVPR_2024_paper.pdf + We present DiffHuman a probabilistic method for photorealistic 3D human reconstruction from a single RGB image. Despite the ill-posed nature of this problem most methods are deterministic and output a single solution often resulting in a lack of geometric detail and blurriness in unseen or uncertain regions. In contrast DiffHuman predicts a probability distribution over 3D reconstructions conditioned on an input 2D image which allows us to sample multiple detailed 3D avatars that are consistent with the image. DiffHuman is implemented as a conditional diffusion model that denoises pixel-aligned 2D observations of an underlying 3D shape representation. During inference we may sample 3D avatars by iteratively denoising 2D renders of the predicted 3D representation. Furthermore we introduce a generator neural network that approximates rendering with considerably reduced runtime (55x speed up) resulting in a novel dual-branch diffusion framework. Our experiments show that DiffHuman can produce diverse and detailed reconstructions for the parts of the person that are unseen or uncertain in the input image while remaining competitive with the state-of-the-art when reconstructing visible surfaces. + + + + Permutation Equivariance of Transformers and Its Applications + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_Permutation_Equivariance_of_Transformers_and_Its_Applications_CVPR_2024_paper.pdf + Revolutionizing the field of deep learning Transformer-based models have achieved remarkable performance in many tasks. Recent research has recognized these models are robust to shuffling but are limited to inter-token permutation in the forward propagation. In this work we propose our definition of permutation equivariance a broader concept covering both inter- and intra- token permutation in the forward and backward propagation of neural networks. We rigorously proved that such permutation equivariance property can be satisfied on most vanilla Transformer-based models with almost no adaptation. We examine the property over a range of state-of-the-art models including ViT Bert GPT and others with experimental validations. Further as a proof-of-concept we explore how real-world applications including privacy-enhancing split learning and model authorization could exploit the permutation equivariance property which implicates wider intriguing application scenarios. The code is available at https://github.com/Doby-Xu/ST + + + + SVDTree: Semantic Voxel Diffusion for Single Image Tree Reconstruction + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_SVDTree_Semantic_Voxel_Diffusion_for_Single_Image_Tree_Reconstruction_CVPR_2024_paper.pdf + Efficiently representing and reconstructing the 3D geometry of biological trees remains a challenging problem in computer vision and graphics. We propose a novel approach for generating realistic tree models from single-view photographs. We cast the 3D information inference problem to a semantic voxel diffusion process which converts an input image of a tree to a novel Semantic Voxel Structure (SVS) in 3D space. The SVS encodes the geometric appearance and semantic structural information (e.g. classifying trunks branches and leaves) which retains the intricate internal tree features. Tailored to the SVS we present SVDTree a new hybrid tree modeling approach by combining structure-oriented branch reconstruction and self-organization-based foliage reconstruction. We validate SVDTree by using images from both synthetic and real trees. The comparison results show that our approach can better preserve tree details and achieve more realistic and accurate reconstruction results than previous methods. + + + + Rethinking FID: Towards a Better Evaluation Metric for Image Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Jayasumana_Rethinking_FID_Towards_a_Better_Evaluation_Metric_for_Image_Generation_CVPR_2024_paper.pdf + As with many machine learning problems the progress of image generation methods hinges on good evaluation metrics. One of the most popular is the Frechet Inception Distance (FID). FID estimates the distance between a distribution of Inception-v3 features of real images and those of images generated by the algorithm. We highlight important drawbacks of FID: Inception's poor representation of the rich and varied content generated by modern text-to-image models incorrect normality assumptions and poor sample complexity. We call for a reevaluation of FID's use as the primary quality metric for generated images. We empirically demonstrate that FID contradicts human raters it does not reflect gradual improvement of iterative text-to-image models it does not capture distortion levels and that it produces inconsistent results when varying the sample size. We also propose an alternative new metric CMMD based on richer CLIP embeddings and the maximum mean discrepancy distance with the Gaussian RBF kernel. It is an unbiased estimator that does not make any assumptions on the probability distribution of the embeddings and is sample efficient. Through extensive experiments and analysis we demonstrate that FID-based evaluations of text-to-image models may be unreliable and that CMMD offers a more robust and reliable assessment of image quality. + + + + SuperPrimitive: Scene Reconstruction at a Primitive Level + http://openaccess.thecvf.com//content/CVPR2024/papers/Mazur_SuperPrimitive_Scene_Reconstruction_at_a_Primitive_Level_CVPR_2024_paper.pdf + Joint camera pose and dense geometry estimation from a set of images or a monocular video remains a challenging problem due to its computational complexity and inherent visual ambiguities. Most dense incremental reconstruction systems operate directly on image pixels and solve for their 3D positions using multi-view geometry cues. Such pixel-level approaches suffer from ambiguities or violations of multi-view consistency (e.g. caused by textureless or specular surfaces). We address this issue with a new image representation which we call a SuperPrimitive. SuperPrimitives are obtained by splitting images into semantically correlated local regions and enhancing them with estimated surface normal directions both of which are predicted by state-of-the-art single image neural networks. This provides a local geometry estimate per SuperPrimitive while their relative positions are adjusted based on multi-view observations. We demonstrate the versatility of our new representation by addressing three 3D reconstruction tasks: depth completion few-view structure from motion and monocular dense visual odometry. Project page: https://makezur.github.io/SuperPrimitive/ + + + + TFMQ-DM: Temporal Feature Maintenance Quantization for Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_TFMQ-DM_Temporal_Feature_Maintenance_Quantization_for_Diffusion_Models_CVPR_2024_paper.pdf + The Diffusion model a prevalent framework for image generation encounters significant challenges in terms of broad applicability due to its extended inference times and substantial memory requirements. Efficient Post-training Quantization (PTQ) is pivotal for addressing these issues in traditional models. Different from traditional models diffusion models heavily depend on the time-step t to achieve satisfactory multi-round denoising. Usually t from the finite set \ 1 \ldots T\ is encoded to a temporal feature by a few modules totally irrespective of the sampling data. However existing PTQ methods do not optimize these modules separately. They adopt inappropriate reconstruction targets and complex calibration methods resulting in a severe disturbance of the temporal feature and denoising trajectory as well as a low compression efficiency. To solve these we propose a Temporal Feature Maintenance Quantization (TFMQ) framework building upon a Temporal Information Block which is just related to the time-step t and unrelated to the sampling data. Powered by the pioneering block design we devise temporal information aware reconstruction (TIAR) and finite set calibration (FSC) to align the full-precision temporal features in a limited time. Equipped with the framework we can maintain the most temporal information and ensure the end-to-end generation quality. Extensive experiments on various datasets and diffusion models prove our state-of-the-art results. Remarkably our quantization approach for the first time achieves model performance nearly on par with the full-precision model under 4-bit weight quantization. Additionally our method incurs almost no extra computational cost and accelerates quantization time by 2.0 xon LSUN-Bedrooms 256 x256 compared to previous works. Our code is publicly available at \href https://github.com/ModelTC/TFMQ-DM https://github.com/ModelTC/TFMQ-DM . + + + + CONFORM: Contrast is All You Need for High-Fidelity Text-to-Image Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Meral_CONFORM_Contrast_is_All_You_Need_for_High-Fidelity_Text-to-Image_Diffusion_CVPR_2024_paper.pdf + Images produced by text-to-image diffusion models might not always faithfully represent the semantic intent of the provided text prompt where the model might overlook or entirely fail to produce certain objects. While recent studies propose various solutions they often require customly tailored functions for each of these problems leading to sub-optimal results especially for complex prompts. Our work introduces a novel perspective by tackling this challenge in a contrastive context. Our approach intuitively promotes the segregation of objects in attention maps while also maintaining that pairs of related attributes are kept close to each other. We conducted extensive experiments across a wide variety of scenarios each involving unique combinations of objects attributes and scenes. These experiments effectively showcase the versatility efficiency and flexibility of our method in working with both latent and pixel-based diffusion models including Stable Diffusion and Imagen. Moreover we publicly share our source code to facilitate further research. + + + + Self-Supervised Facial Representation Learning with Facial Region Awareness + http://openaccess.thecvf.com//content/CVPR2024/papers/Gao_Self-Supervised_Facial_Representation_Learning_with_Facial_Region_Awareness_CVPR_2024_paper.pdf + Self-supervised pre-training has been proved to be effective in learning transferable representations that benefit various visual tasks. This paper asks this question: can self-supervised pre-training learn general facial representations for various facial analysis tasks? Recent efforts toward this goal are limited to treating each face image as a whole i.e. learning consistent facial representations at the image-level which overlooks the consistency of local facial representations (i.e. facial regions like eyes nose etc). In this work we make a first attempt to propose a novel self-supervised facial representation learning framework to learn consistent global and local facial representations Facial Region Awareness (FRA). Specifically we explicitly enforce the consistency of facial regions by matching the local facial representations across views which are extracted with learned heatmaps highlighting the facial regions. Inspired by the mask prediction in supervised semantic segmentation we obtain the heatmaps via cosine similarity between the per-pixel projection of feature maps and facial mask embeddings computed from learnable positional embeddings which leverage the attention mechanism to globally look up the facial image for facial regions. To learn such heatmaps we formulate the learning of facial mask embeddings as a deep clustering problem by assigning the pixel features from the feature maps to them. The transfer learning results on facial classification and regression tasks show that our FRA outperforms previous pre-trained models and more importantly using ResNet as the unified backbone for various tasks our FRA achieves comparable or even better performance compared with SOTA methods in facial analysis tasks. + + + + GaussianDreamer: Fast Generation from Text to 3D Gaussians by Bridging 2D and 3D Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Yi_GaussianDreamer_Fast_Generation_from_Text_to_3D_Gaussians_by_Bridging_CVPR_2024_paper.pdf + In recent times the generation of 3D assets from text prompts has shown impressive results. Both 2D and 3D diffusion models can help generate decent 3D objects based on prompts. 3D diffusion models have good 3D consistency but their quality and generalization are limited as trainable 3D data is expensive and hard to obtain. 2D diffusion models enjoy strong abilities of generalization and fine generation but 3D consistency is hard to guarantee. This paper attempts to bridge the power from the two types of diffusion models via the recent explicit and efficient 3D Gaussian splatting representation. A fast 3D object generation framework named as GaussianDreamer is proposed where the 3D diffusion model provides priors for initialization and the 2D diffusion model enriches the geometry and appearance. Operations of noisy point growing and color perturbation are introduced to enhance the initialized Gaussians. Our GaussianDreamer can generate a high-quality 3D instance or 3D avatar within 15 minutes on one GPU much faster than previous methods while the generated instances can be directly rendered in real time. Demos and code are available at https://taoranyi.com/gaussiandreamer/. + + + + Open-Vocabulary Attention Maps with Token Optimization for Semantic Segmentation in Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Marcos-Manchon_Open-Vocabulary_Attention_Maps_with_Token_Optimization_for_Semantic_Segmentation_in_CVPR_2024_paper.pdf + Diffusion models represent a new paradigm in text-to-image generation. Beyond generating high-quality images from text prompts models such as Stable Diffusion have been successfully extended to the joint generation of semantic segmentation pseudo-masks. However current extensions primarily rely on extracting attentions linked to prompt words used for image synthesis. This approach limits the generation of segmentation masks derived from word tokens not contained in the text prompt. In this work we introduce Open-Vocabulary Attention Maps (OVAM)--a training-free method for text-to-image diffusion models that enables the generation of attention maps for any word. In addition we propose a lightweight optimization process based on OVAM for finding tokens that generate accurate attention maps for an object class with a single annotation. We evaluate these tokens within existing state-of-the-art Stable Diffusion extensions. The best-performing model improves its mIoU from 52.1 to 86.6 for the synthetic images' pseudo-masks demonstrating that our optimized tokens are an efficient way to improve the performance of existing methods without architectural changes or retraining. + + + + DreamComposer: Controllable 3D Object Generation via Multi-View Conditions + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_DreamComposer_Controllable_3D_Object_Generation_via_Multi-View_Conditions_CVPR_2024_paper.pdf + Utilizing pre-trained 2D large-scale generative models recent works are capable of generating high-quality novel views from a single in-the-wild image. However due to the lack of information from multiple views these works encounter difficulties in generating controllable novel views. In this paper we present DreamComposer a flexible and scalable framework that can enhance existing view-aware diffusion models by injecting multi-view conditions. Specifically DreamComposer first uses a view-aware 3D lifting module to obtain 3D representations of an object from multiple views. Then it renders the latent features of the target view from 3D representations with the multi-view feature fusion module. Finally the target view features extracted from multi-view inputs are injected into a pre-trained diffusion model. Experiments show that DreamComposer is compatible with state-of-the-art diffusion models for zero-shot novel view synthesis further enhancing them to generate high-fidelity novel view images with multi-view conditions ready for controllable 3D object reconstruction and various other applications. + + + + Self-Calibrating Vicinal Risk Minimisation for Model Calibration + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Self-Calibrating_Vicinal_Risk_Minimisation_for_Model_Calibration_CVPR_2024_paper.pdf + Model calibration measuring the alignment between the prediction accuracy and model confidence is an important metric reflecting model trustworthiness. Existing dense binary classification methods without proper regularisation of model confidence are prone to being over-confident. To calibrate Deep Neural Networks (DNNs) we propose a Self-Calibrating Vicinal Risk Minimisation (SCVRM) that explores the vicinity space of labeled data where vicinal images that are farther away from labeled images adopt the groundtruth label with decreasing label confidence. We prove that in the logistic regression problem SCVRM can be seen as a Vicinal Risk Minimisation plus a regularisation term that penalises the over-confident predictions. In practical implementation SCVRM is approximated using Monte Carlo sampling that samples additional augmented training images and labels from the vicinal distributions. Experimental results demonstrate that SCVRM can significantly enhance model calibration for different dense classification tasks on both in-distribution and out-of-distribution data. Code is available at https://github.com/Carlisle-Liu/SCVRM. + + + + LPSNet: End-to-End Human Pose and Shape Estimation with Lensless Imaging + http://openaccess.thecvf.com//content/CVPR2024/papers/Ge_LPSNet_End-to-End_Human_Pose_and_Shape_Estimation_with_Lensless_Imaging_CVPR_2024_paper.pdf + Human pose and shape (HPS) estimation with lensless imaging is not only beneficial to privacy protection but also can be used in covert surveillance scenarios due to the small size and simple structure of this device. However this task presents significant challenges due to the inherent ambiguity of the captured measurements and lacks effective methods for directly estimating human pose and shape from lensless data. In this paper we propose the first end-to-end framework to recover 3D human poses and shapes from lensless measurements to our knowledge. We specifically design a multi-scale lensless feature decoder to decode the lensless measurements through the optically encoded mask for efficient feature extraction. We also propose a double-head auxiliary supervision mechanism to improve the estimation accuracy of human limb ends. Besides we establish a lensless imaging system and verify the effectiveness of our method on various datasets acquired by our lensless imaging system. The code and dataset are available at https://cic.tju.edu.cn/faculty/likun/projects/LPSNet. + + + + Towards a Simultaneous and Granular Identity-Expression Control in Personalized Face Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Towards_a_Simultaneous_and_Granular_Identity-Expression_Control_in_Personalized_Face_CVPR_2024_paper.pdf + In human-centric content generation the pre-trained text-to-image models struggle to produce user-wanted portrait images which retain the identity of individuals while exhibiting diverse expressions. This paper introduces our efforts towards personalized face generation. To this end we propose a novel multi-modal face generation framework capable of simultaneous identity-expression control and more fine-grained expression synthesis. Our expression control is so sophisticated that it can be specialized by the fine-grained emotional vocabulary. We devise a novel diffusion model that can undertake the task of simultaneously face swapping and reenactment. Due to the entanglement of identity and expression separately and precisely controlling them within one framework is a nontrivial task thus has not been explored yet. To overcome this we propose several innovative designs in the conditional diffusion model including balancing identity and expression encoder improved midpoint sampling and explicitly background conditioning. Extensive experiments have demonstrated the controllability and scalability of the proposed framework in comparison with state-of-the-art text-to-image face swapping and face reenactment methods. + + + + PEEKABOO: Interactive Video Generation via Masked-Diffusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Jain_PEEKABOO_Interactive_Video_Generation_via_Masked-Diffusion_CVPR_2024_paper.pdf + Modern video generation models like Sora have achieved remarkable success in producing high-quality videos. However a significant limitation is their inability to offer interactive control to users a feature that promises to open up unprecedented applications and creativity. In this work we introduce the first solution to equip diffusion-based video generation models with spatio-temporal control. We present Peekaboo a novel masked attention module which seamlessly integrates with current video generation models offering control without the need for additional training or inference overhead. To facilitate future research we also introduce a comprehensive benchmark for interactive video generation. This benchmark offers a standardized framework for the community to assess the efficacy of emerging interactive video generation models. Our extensive qualitative and quantitative assessments reveal that Peekaboo achieves up to a 3.8x improvement in mIoU over baseline models all while maintaining the same latency. Code and benchmark are available on the webpage. + + + + High-fidelity Person-centric Subject-to-Image Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_High-fidelity_Person-centric_Subject-to-Image_Synthesis_CVPR_2024_paper.pdf + Current subject-driven image generation methods encounter significant challenges in person-centric image generation. The reason is that they learn the semantic scene and person generation by fine-tuning a common pre-trained diffusion which involves an irreconcilable training imbalance. Precisely to generate realistic persons they need to sufficiently tune the pre-trained model which inevitably causes the model to forget the rich semantic scene prior and makes scene generation over-fit to the training data. Moreover even with sufficient fine-tuning these methods can still not generate high-fidelity persons since joint learning of the scene and person generation also lead to quality compromise. In this paper we propose Face-diffuser an effective collaborative generation pipeline to eliminate the above training imbalance and quality compromise. Specifically we first develop two specialized pre-trained diffusion models i.e. Text-driven Diffusion Model (TDM) and Subject-augmented Diffusion Model (SDM) for scene and person generation respectively. The sampling process is divided into three sequential stages i.e. semantic scene construction subject-scene fusion and subject enhancement. The first and last stages are performed by TDM and SDM respectively. The subject-scene fusion stage that is the collaboration achieved through a novel and highly effective mechanism Saliency-adaptive Noise Fusion (SNF). Specifically it is based on our key observation that there exists a robust link between classifier-free guidance responses and the saliency of generated images. In each time step SNF leverages the unique strengths of each model and allows for the spatial blending of predicted noises from both models automatically in a saliency-aware manner all of which can be seamlessly integrated into the DDIM sampling process. Extensive experiments confirm the impressive effectiveness and robustness of the Face-diffuser in generating high-fidelity person images depicting multiple unseen persons with varying contexts. Code is available at https://github.com/CodeGoat24/Face-diffuser. + + + + JeDi: Joint-Image Diffusion Models for Finetuning-Free Personalized Text-to-Image Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zeng_JeDi_Joint-Image_Diffusion_Models_for_Finetuning-Free_Personalized_Text-to-Image_Generation_CVPR_2024_paper.pdf + Personalized text-to-image generation models enable users to create images that depict their individual possessions in diverse scenes finding applications in various domains. To achieve the personalization capability existing methods rely on finetuning a text-to-image foundation model on a user's custom dataset which can be non-trivial for general users resource-intensive and time-consuming. Despite attempts to develope finetuning-free methods their generation quality is much lower compared to their finetuning counterparts. In this paper we propose Joint-Image Diffusion (\jedi) an effective technique for learning a finetuning-free personalization model. Our key idea is to learn the joint distribution of multiple related text-image pairs that share a common subject. To facilitate learning we propose a scalable synthetic dataset generation technique. Once trained our model enables fast and easy personalization at test time by simply using reference images as input during the sampling process. Our approach does not require any expensive optimization process or additional modules and can faithfully preserve the identity represented by any number of reference images. Experimental results show that our model achieves state-of-the-art generation quality both quantitatively and qualitatively significantly outperforming both the prior finetuning-based and finetuning-free personalization baselines. + + + + HandDiff: 3D Hand Pose Estimation with Diffusion on Image-Point Cloud + http://openaccess.thecvf.com//content/CVPR2024/papers/Cheng_HandDiff_3D_Hand_Pose_Estimation_with_Diffusion_on_Image-Point_Cloud_CVPR_2024_paper.pdf + Extracting keypoint locations from input hand frames known as 3D hand pose estimation is a critical task in various human-computer interaction applications. Essentially the 3D hand pose estimation can be regarded as a 3D point subset generative problem conditioned on input frames. Thanks to the recent significant progress on diffusion-based generative models hand pose estimation can also benefit from the diffusion model to estimate keypoint locations with high quality. However directly deploying the existing diffusion models to solve hand pose estimation is non-trivial since they cannot achieve the complex permutation mapping and precise localization. Based on this motivation this paper proposes HandDiff a diffusion-based hand pose estimation model that iteratively denoises accurate hand pose conditioned on hand-shaped image-point clouds. In order to recover keypoint permutation and accurate location we further introduce joint-wise condition and local detail condition. Experimental results demonstrate that the proposed HandDiff significantly outperforms the existing approaches on four challenging hand pose benchmark datasets. Codes and pre-trained models are publicly available at https://github.com/cwc1260/HandDiff. + + + + VP3D: Unleashing 2D Visual Prompt for Text-to-3D Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_VP3D_Unleashing_2D_Visual_Prompt_for_Text-to-3D_Generation_CVPR_2024_paper.pdf + Recent innovations on text-to-3D generation have featured Score Distillation Sampling (SDS) which enables the zero-shot learning of implicit 3D models (NeRF) by directly distilling prior knowledge from 2D diffusion models. However current SDS-based models still struggle with intricate text prompts and commonly result in distorted 3D models with unrealistic textures or cross-view inconsistency issues. In this work we introduce a novel Visual Prompt-guided text-to-3D diffusion model (VP3D) that explicitly unleashes the visual appearance knowledge in 2D visual prompt to boost text-to-3D generation. Instead of solely supervising SDS with text prompt VP3D first capitalizes on 2D diffusion model to generate a high-quality image from input text which subsequently acts as visual prompt to strengthen SDS optimization with explicit visual appearance. Meanwhile we couple the SDS optimization with additional differentiable reward function that encourages rendering images of 3D models to better visually align with 2D visual prompt and semantically match with text prompt. Through extensive experiments we show that the 2D Visual Prompt in our VP3D significantly eases the learning of visual appearance of 3D models and thus leads to higher visual fidelity with more detailed textures. It is also appealing in view that when replacing the self-generating visual prompt with a given reference image VP3D is able to trigger a new task of stylized text-to-3D generation. Our project page is available at https://vp3d-cvpr24.github.io. + + + + Content-Style Decoupling for Unsupervised Makeup Transfer without Generating Pseudo Ground Truth + http://openaccess.thecvf.com//content/CVPR2024/papers/Sun_Content-Style_Decoupling_for_Unsupervised_Makeup_Transfer_without_Generating_Pseudo_Ground_CVPR_2024_paper.pdf + The absence of real targets to guide the model training is one of the main problems with the makeup transfer task. Most existing methods tackle this problem by synthesizing pseudo ground truths (PGTs). However the generated PGTs are often sub-optimal and their imprecision will eventually lead to performance degradation. To alleviate this issue in this paper we propose a novel Content-Style Decoupled Makeup Transfer (CSD-MT) method which works in a purely unsupervised manner and thus eliminates the negative effects of generating PGTs. Specifically based on the frequency characteristics analysis we assume that the low-frequency (LF) component of a face image is more associated with its makeup style information while the high-frequency (HF) component is more related to its content details. This assumption allows CSD-MT to decouple the content and makeup style information in each face image through the frequency decomposition. After that CSD-MT realizes makeup transfer by maximizing the consistency of these two types of information between the transferred result and input images respectively. Two newly designed loss functions are also introduced to further improve the transfer performance. Extensive quantitative and qualitative analyses show the effectiveness of our CSD-MT method. Our code is available at https://github.com/Snowfallingplum/CSD-MT. + + + + You Only Need Less Attention at Each Stage in Vision Transformers + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_You_Only_Need_Less_Attention_at_Each_Stage_in_Vision_CVPR_2024_paper.pdf + The advent of Vision Transformers (ViTs) marks a substantial paradigm shift in the realm of computer vision. ViTs capture the global information of images through self-attention modules which perform dot product computations among patchified image tokens. While self-attention modules empower ViTs to capture long-range dependencies the computational complexity grows quadratically with the number of tokens which is a major hindrance to the practical application of ViTs. Moreover the self-attention mechanism in deep ViTs is also susceptible to the attention saturation issue. Accordingly we argue against the necessity of computing the attention scores in every layer and we propose the Less-Attention Vision Transformer (LaViT) which computes only a few attention operations at each stage and calculates the subsequent feature alignments in other layers via attention transformations that leverage the previously calculated attention scores. This novel approach can mitigate two primary issues plaguing traditional self-attention modules: the heavy computational burden and attention saturation. Our proposed architecture offers superior efficiency and ease of implementation merely requiring matrix multiplications that are highly optimized in contemporary deep learning frameworks. Moreover our architecture demonstrates exceptional performance across various vision tasks including classification detection and segmentation. + + + + Generalizable Novel-View Synthesis using a Stereo Camera + http://openaccess.thecvf.com//content/CVPR2024/papers/Lee_Generalizable_Novel-View_Synthesis_using_a_Stereo_Camera_CVPR_2024_paper.pdf + In this paper we propose the first generalizable view synthesis approach that specifically targets multi-view stereo-camera images. Since recent stereo matching has demonstrated accurate geometry prediction we introduce stereo matching into novel-view synthesis for high-quality geometry reconstruction. To this end this paper proposes a novel framework dubbed StereoNeRF which integrates stereo matching into a NeRF-based generalizable view synthesis approach. StereoNeRF is equipped with three key components to effectively exploit stereo matching in novel-view synthesis: a stereo feature extractor a depth-guided plane-sweeping and a stereo depth loss. Moreover we propose the StereoNVS dataset the first multi-view dataset of stereo-camera images encompassing a wide variety of both real and synthetic scenes. Our experimental results demonstrate that StereoNeRF surpasses previous approaches in generalizable view synthesis. + + + + Digital Life Project: Autonomous 3D Characters with Social Intelligence + http://openaccess.thecvf.com//content/CVPR2024/papers/Cai_Digital_Life_Project_Autonomous_3D_Characters_with_Social_Intelligence_CVPR_2024_paper.pdf + In this work we present Digital Life Project a framework utilizing language as the universal medium to build autonomous 3D characters who are capable of engaging in social interactions and expressing with articulated body motions thereby simulating life in a digital environment. Our framework comprises two primary components: 1) SocioMind: a meticulously crafted digital brain that models personalities with systematic few-shot exemplars incorporates a reflection process based on psychology principles and emulates autonomy by initiating dialogue topics; 2) MoMat-MoGen: a text-driven motion synthesis paradigm for controlling the character's digital body. It integrates motion matching a proven industry technique to ensure motion quality with cutting-edge advancements in motion generation for diversity. Extensive experiments demonstrate that each module achieves state-of-the-art performance in its respective domain. Collectively they enable virtual characters to initiate and sustain dialogues autonomously while evolving their socio-psychological states. Concurrently these characters can perform contextually relevant bodily movements. Additionally an extension of DLP enables a virtual character to recognize and appropriately respond to human players' actions. + + + + Rethinking Prior Information Generation with CLIP for Few-Shot Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Rethinking_Prior_Information_Generation_with_CLIP_for_Few-Shot_Segmentation_CVPR_2024_paper.pdf + Few-shot segmentation remains challenging due to the limitations of its labeling information for unseen classes. Most previous approaches rely on extracting high-level feature maps from the frozen visual encoder to compute the pixel-wise similarity as a key prior guidance for the decoder. However such a prior representation suffers from coarse granularity and poor generalization to new classes since these high-level feature maps have obvious category bias. In this work we propose to replace the visual prior representation with the visual-text alignment capacity to capture more reliable guidance and enhance the model generalization. Specifically we design two kinds of training-free prior information generation strategy that attempts to utilize the semantic alignment capability of the Contrastive Language-Image Pre-training model (CLIP) to locate the target class. Besides to acquire more accurate prior guidance we build a high-order relationship of attention maps and utilize it to refine the initial prior information. Experiments on both the PASCAL-5i and COCO-20i datasets show that our method obtains a clearly substantial improvement and reaches the new state-of-the-art performance. The code is available on the project website. + + + + Generative Rendering: Controllable 4D-Guided Video Generation with 2D Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Cai_Generative_Rendering_Controllable_4D-Guided_Video_Generation_with_2D_Diffusion_Models_CVPR_2024_paper.pdf + Traditional 3D content creation tools empower users to bring their imagination to life by giving them direct control over a scene's geometry appearance motion and camera path. Creating computer-generated videos however is a tedious manual process which can be automated by emerging text-to-video diffusion models. Despite great promise video diffusion models are difficult to control hindering users to apply their creativity rather than amplifying it. To address this challenge we present a novel approach that combines the controllability of dynamic 3D meshes with the expressivity and editability of emerging diffusion models. For this purpose our approach takes an animated low-fidelity rendered mesh as input and injects the ground truth correspondence information obtained from the dynamic mesh into various stages of a pre-trained text-to-image generation model to output high-quality and temporally consistent frames. We demonstrate our approach on various examples where motion can be obtained by animating rigged assets or changing the camera path. + + + + Relightable Gaussian Codec Avatars + http://openaccess.thecvf.com//content/CVPR2024/papers/Saito_Relightable_Gaussian_Codec_Avatars_CVPR_2024_paper.pdf + The fidelity of relighting is bounded by both geometry and appearance representations. For geometry both mesh and volumetric approaches have difficulty modeling intricate structures like 3D hair geometry. For appearance existing relighting models are limited in fidelity and often too slow to render in real-time with high-resolution continuous environments. In this work we present Relightable Gaussian Codec Avatars a method to build high-fidelity relightable head avatars that can be animated to generate novel expressions. Our geometry model based on 3D Gaussians can capture 3D-consistent sub-millimeter details such as hair strands and pores on dynamic face sequences. To support diverse materials of human heads such as the eyes skin and hair in a unified manner we present a novel relightable appearance model based on learnable radiance transfer. Together with global illumination-aware spherical harmonics for the diffuse components we achieve real-time relighting with all-frequency reflections using spherical Gaussians. This appearance model can be efficiently relit under both point light and continuous illumination. We further improve the fidelity of eye reflections and enable explicit gaze control by introducing relightable explicit eye models. Our method outperforms existing approaches without compromising real-time performance. We also demonstrate real-time relighting of avatars on a tethered consumer VR headset showcasing the efficiency and fidelity of our avatars. + + + + Single-to-Dual-View Adaptation for Egocentric 3D Hand Pose Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Single-to-Dual-View_Adaptation_for_Egocentric_3D_Hand_Pose_Estimation_CVPR_2024_paper.pdf + The pursuit of accurate 3D hand pose estimation stands as a keystone for understanding human activity in the realm of egocentric vision. The majority of existing estimation methods still rely on single-view images as input leading to potential limitations e.g. limited field-of-view and ambiguity in depth. To address these problems adding another camera to better capture the shape of hands is a practical direction. However existing multi-view hand pose estimation methods suffer from two main drawbacks: 1) Requiring multi-view annotations for training which are expensive. 2) During testing the model becomes inapplicable if camera parameters/layout are not the same as those used in training. In this paper we propose a novel Single-to-Dual-view adaptation (S2DHand) solution that adapts a pre-trained single-view estimator to dual views. Compared with existing multi-view training methods 1) our adaptation process is unsupervised eliminating the need for multi-view annotation. 2) Moreover our method can handle arbitrary dual-view pairs with unknown camera parameters making the model applicable to diverse camera settings. Specifically S2DHand is built on certain stereo constraints including pair-wise cross-view consensus and invariance of transformation between both views. These two stereo constraints are used in a complementary manner to generate pseudo-labels allowing reliable adaptation. Evaluation results reveal that S2DHand achieves significant improvements on arbitrary camera pairs under both in-dataset and cross-dataset settings and outperforms existing adaptation methods with leading performance. Project page: https://github.com/ut-vision/S2DHand. + + + + Animate Anyone: Consistent and Controllable Image-to-Video Synthesis for Character Animation + http://openaccess.thecvf.com//content/CVPR2024/papers/Hu_Animate_Anyone_Consistent_and_Controllable_Image-to-Video_Synthesis_for_Character_Animation_CVPR_2024_paper.pdf + Character Animation aims to generating character videos from still images through driving signals. Currently diffusion models have become the mainstream in visual generation research owing to their robust generative capabilities. However challenges persist in the realm of image-to-video especially in character animation where temporally maintaining consistency with detailed information from character remains a formidable problem. In this paper we leverage the power of diffusion models and propose a novel framework tailored for character animation. To preserve consistency of intricate appearance features from reference image we design ReferenceNet to merge detail features via spatial attention. To ensure controllability and continuity we introduce an efficient pose guider to direct character's movements and employ an effective temporal modeling approach to ensure smooth inter-frame transitions between video frames. By expanding the training data our approach can animate arbitrary characters yielding superior results in character animation compared to other image-to-video methods. Furthermore we evaluate our method on image animation benchmarks achieving state-of-the-art results. + + + + FreeCustom: Tuning-Free Customized Image Generation for Multi-Concept Composition + http://openaccess.thecvf.com//content/CVPR2024/papers/Ding_FreeCustom_Tuning-Free_Customized_Image_Generation_for_Multi-Concept_Composition_CVPR_2024_paper.pdf + Benefiting from large-scale pre-trained text-to-image (T2I) generative models impressive progress has been achieved in customized image generation which aims to generate user-specified concepts. Existing approaches have extensively focused on single-concept customization and still encounter challenges when it comes to complex scenarios that involve combining multiple concepts. These approaches often require retraining/fine-tuning using a few images leading to time-consuming training processes and impeding their swift implementation. Furthermore the reliance on multiple images to represent a singular concept increases the difficulty of customization. To this end we propose FreeCustom a novel tuning-free method to generate customized images of multi-concept composition based on reference concepts using only one image per concept as input. Specifically we introduce a new multi-reference self-attention (MRSA) mechanism and a weighted mask strategy that enables the generated image to access and focus more on the reference concepts. In addition MRSA leverages our key finding that input concepts are better preserved when providing images with context interactions. Experiments show that our method's produced images are consistent with the given concepts and better aligned with the input text. Our method outperforms or performs on par with other training-based methods in terms of multi-concept composition and single-concept customization but is simpler. Codes can be found \href https://github.com/aim-uofa/FreeCustom here . + + + + MaskINT: Video Editing via Interpolative Non-autoregressive Masked Transformers + http://openaccess.thecvf.com//content/CVPR2024/papers/Ma_MaskINT_Video_Editing_via_Interpolative_Non-autoregressive_Masked_Transformers_CVPR_2024_paper.pdf + Recent advances in generative AI have significantly enhanced image and video editing particularly in the context of text prompt control. State-of-the-art approaches predominantly rely on diffusion models to accomplish these tasks. However the computational demands of diffusion-based methods are substantial often necessitating large-scale paired datasets for training and therefore challenging the deployment in real applications. To address these issues this paper breaks down the text-based video editing task into two stages. First we leverage an pre-trained text-to-image diffusion model to simultaneously edit few keyframes in an zero-shot way. Second we introduce an efficient model called MaskINT which is built on non-autoregressive masked generative transformers and specializes in frame interpolation between the edited keyframes using the structural guidance from intermediate frames. Experimental results suggest that our MaskINT achieves comparable performance with diffusion-based methodologies while significantly improve the inference time. This research offers a practical solution for text-based video editing and showcases the potential of non-autoregressive masked generative transformers in this domain. + + + + Learning Multi-Dimensional Human Preference for Text-to-Image Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Learning_Multi-Dimensional_Human_Preference_for_Text-to-Image_Generation_CVPR_2024_paper.pdf + Current metrics for text-to-image models typically rely on statistical metrics which inadequately represent the real preference of humans. Although recent work attempts to learn these preferences via human annotated images they reduce the rich tapestry of human preference to a single overall score. However the preference results vary when humans evaluate images with different aspects. Therefore to learn the multi-dimensional human preferences we propose the Multi-dimensional Preference Score (MPS) the first multi-dimensional preference scoring model for the evaluation of text-to-image models. The MPS introduces the preference condition module upon CLIP model to learn these diverse preferences. It is trained based on our Multi-dimensional Human Preference (MHP) Dataset which comprises 918315 human preference choices across four dimensions (i.e. aesthetics semantic alignment detail quality and overall assessment) on 607541 images. The images are generated by a wide range of latest text-to-image models. The MPS outperforms existing scoring methods across 3 datasets in 4 dimensions enabling it a promising metric for evaluating and improving text-to-image generation. The model and dataset will be made publicly available to facilitate future research. + + + + ViVid-1-to-3: Novel View Synthesis with Video Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Kwak_ViVid-1-to-3_Novel_View_Synthesis_with_Video_Diffusion_Models_CVPR_2024_paper.pdf + Generating novel views of an object from a single image is a challenging task. It requires an understanding of the underlying 3D structure of the object from an image and rendering high-quality spatially consistent new views. While recent methods for view synthesis based on diffusion have shown great progress achieving consistency among various view estimates and at the same time abiding by the desired camera pose remains a critical problem yet to be solved. In this work we demonstrate a strikingly simple method where we utilize a pre-trained video diffusion model to solve this problem. Our key idea is that synthesizing a novel view could be reformulated as synthesizing a video of a camera going around the object of interest---a scanning video---which then allows us to leverage the powerful priors that a video diffusion model would have learned. Thus to perform novel-view synthesis we create a smooth camera trajectory to the target view that we wish to render and denoise using both a view-conditioned diffusion model and a video diffusion model. By doing so we obtain a highly consistent novel view synthesis outperforming the state of the art. + + + + Generating Human Motion in 3D Scenes from Text Descriptions + http://openaccess.thecvf.com//content/CVPR2024/papers/Cen_Generating_Human_Motion_in_3D_Scenes_from_Text_Descriptions_CVPR_2024_paper.pdf + Generating human motions from textual descriptions has gained growing research interest due to its wide range of applications. However only a few works consider human-scene interactions together with text conditions which is crucial for visual and physical realism. This paper focuses on the task of generating human motions in 3D indoor scenes given text descriptions of the human-scene interactions. This task presents challenges due to the multimodality nature of text scene and motion as well as the need for spatial reasoning. To address these challenges we propose a new approach that decomposes the complex problem into two more manageable sub-problems: (1) language grounding of the target object and (2) object-centric motion generation. For language grounding of the target object we leverage the power of large language models. For motion generation we design an object-centric scene representation for the generative model to focus on the target object thereby reducing the scene complexity and facilitating the modeling of the relationship between human motions and the object. Experiments demonstrate the better motion quality of our approach compared to baselines and validate our design choices. Code will be available at https://zju3dv.github.io/text_scene_motion. + + + + QDFormer: Towards Robust Audiovisual Segmentation in Complex Environments with Quantization-based Semantic Decomposition + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_QDFormer_Towards_Robust_Audiovisual_Segmentation_in_Complex_Environments_with_Quantization-based_CVPR_2024_paper.pdf + Audiovisual segmentation (AVS) is a challenging task that aims to segment visual objects in videos according to their associated acoustic cues. With multiple sound sources and background disturbances involved establishing robust correspondences between audio and visual contents poses unique challenges due to (1) complex entanglement across sound sources and (2) frequent changes in the occurrence of distinct sound events. Assuming sound events occur independently the multi-source semantic space can be represented as the Cartesian product of single-source sub-spaces. We are motivated to decompose the multi-source audio semantics into single-source semantics for more effective interactions with visual content. We propose a semantic decomposition method based on product quantization where the multi-source semantics can be decomposed and represented by several disentangled and noise-suppressed single-source semantics. Furthermore we introduce a global-to-local quantization mechanism which distills knowledge from stable global (clip-level) features into local (frame-level) ones to handle frequent changes in audio semantics. Extensive experiments demonstrate that our semantically decomposed audio representation significantly improves AVS performance eg +21.2% mIoU on the challenging AVS-Semantic benchmark with ResNet50 backbone. + + + + Fast Adaptation for Human Pose Estimation via Meta-Optimization + http://openaccess.thecvf.com//content/CVPR2024/papers/Hu_Fast_Adaptation_for_Human_Pose_Estimation_via_Meta-Optimization_CVPR_2024_paper.pdf + Domain shift is a challenge for supervised human pose estimation where the source data and target data come from different distributions. This is why pose estimation methods generally perform worse on the test set than on the training set. Recently test-time adaptation has proven to be an effective way to deal with domain shift in human pose estimation. Although the performance on the target domain has been improved existing methods require a large number of weight updates for convergence which is time-consuming and brings catastrophic forgetting. To solve these issues we propose a meta-auxiliary learning method to achieve fast adaptation for domain shift during inference. Specifically we take human pose estimation as the supervised primary task and propose body-specific image inpainting as a self-supervised auxiliary task. First we jointly train the primary and auxiliary tasks to get a pre-trained model on the source domain. Then meta-training correlates the performance of the two tasks to learn a good weight initialization. Finally meta-testing adapts the meta-learned model to the target data through self-supervised learning. Benefiting from the meta-learning paradigm the proposed method enables fast adaptation to the target domain while preserving the source domain knowledge. The carefully designed auxiliary task better pays attention to human-related semantics in a single image. Extensive experiments demonstrate the effectiveness of our test-time fast adaptation. + + + + WOUAF: Weight Modulation for User Attribution and Fingerprinting in Text-to-Image Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_WOUAF_Weight_Modulation_for_User_Attribution_and_Fingerprinting_in_Text-to-Image_CVPR_2024_paper.pdf + The rapid advancement of generative models facilitating the creation of hyper-realistic images from textual descriptions has concurrently escalated critical societal concerns such as misinformation. Although providing some mitigation traditional fingerprinting mechanisms fall short in attributing responsibility for the malicious use of synthetic images. This paper introduces a novel approach to model fingerprinting that assigns responsibility for the generated images thereby serving as a potential countermeasure to model misuse. Our method modifies generative models based on each user's unique digital fingerprint imprinting a unique identifier onto the resultant content that can be traced back to the user. This approach incorporating fine-tuning into Text-to-Image (T2I) tasks using the Stable Diffusion Model demonstrates near-perfect attribution accuracy with a minimal impact on output quality. Through extensive evaluation we show that our method outperforms baseline methods with an average improvement of 11% in handling image post-processes. Our method presents a promising and novel avenue for accountable model distribution and responsible use. Our code is available in https://github.com/kylemin/WOUAF. + + + + Text-Conditioned Generative Model of 3D Strand-based Human Hairstyles + http://openaccess.thecvf.com//content/CVPR2024/papers/Sklyarova_Text-Conditioned_Generative_Model_of_3D_Strand-based_Human_Hairstyles_CVPR_2024_paper.pdf + We present HAAR a new strand-based generative model for 3D human hairstyles. Specifically based on textual inputs HAAR produces 3D hairstyles that could be used as production-level assets in modern computer graphics engines. Current AI-based generative models take advantage of powerful 2D priors to reconstruct 3D content in the form of point clouds meshes or volumetric functions. However by using the 2D priors they are intrinsically limited to only recovering the visual parts. Highly occluded hair structures can not be reconstructed with those methods and they only model the "outer shell" which is not ready to be used in physics-based rendering or simulation pipelines. In contrast we propose a first text-guided generative method that uses 3D hair strands as an underlying representation. Leveraging 2D visual question-answering (VQA) systems we automatically annotate synthetic hair models that are generated from a small set of artist-created hairstyles. This allows us to train a latent diffusion model that operates in a common hairstyle UV space. In qualitative and quantitative studies we demonstrate the capabilities of the proposed model and compare it to existing hairstyle generation approaches. For results please refer to our project page https://haar.is.tue.mpg.de/. + + + + Skeleton-in-Context: Unified Skeleton Sequence Modeling with In-Context Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Skeleton-in-Context_Unified_Skeleton_Sequence_Modeling_with_In-Context_Learning_CVPR_2024_paper.pdf + In-context learning provides a new perspective for multi-task modeling for vision and NLP. Under this setting the model can perceive tasks from prompts and accomplish them without any extra task-specific head predictions or model fine-tuning. However skeleton sequence modeling via in-context learning remains unexplored. Directly applying existing in-context models from other areas onto skeleton sequences fails due to the similarity between inter-frame and cross-task poses which makes it exceptionally hard to perceive the task correctly from a subtle context. To address this challenge we propose Skeleton-in-Context (SiC) an effective framework for in-context skeleton sequence modeling. Our SiC is able to handle multiple skeleton-based tasks simultaneously after a single training process and accomplish each task from context according to the given prompt. It can further generalize to new unseen tasks according to customized prompts. To facilitate context perception we additionally propose a task-unified prompt which adaptively learns tasks of different natures such as partial joint-level generation sequence-level prediction or 2D-to-3D motion prediction. We conduct extensive experiments to evaluate the effectiveness of our SiC on multiple tasks including motion prediction pose estimation joint completion and future pose estimation. We also evaluate its generalization capability on unseen tasks such as motion-in-between. These experiments show that our model achieves state-of-the-art multi-task performance and even outperforms single-task methods on certain tasks. + + + + DemoFusion: Democratising High-Resolution Image Generation With No $$$ + http://openaccess.thecvf.com//content/CVPR2024/papers/Du_DemoFusion_Democratising_High-Resolution_Image_Generation_With_No__CVPR_2024_paper.pdf + High-resolution image generation with Generative Artificial Intelligence (GenAI) has immense potential but due to the enormous capital investment required for training it is increasingly centralised to a few large corporations and hidden behind paywalls. This paper aims to democratise high-resolution GenAI by advancing the frontier of high-resolution generation while remaining accessible to a broad audience. We demonstrate that existing Latent Diffusion Models (LDMs) possess untapped potential for higher-resolution image generation. Our novel DemoFusion framework seamlessly extends open-source GenAI models employing Progressive Upscaling Skip Residual and Dilated Sampling mechanisms to achieve higher-resolution image generation. The progressive nature of DemoFusion requires more passes but the intermediate results can serve as "previews" facilitating rapid prompt iteration. + + + + Total Selfie: Generating Full-Body Selfies + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Total_Selfie_Generating_Full-Body_Selfies_CVPR_2024_paper.pdf + We present a method to generate full-body selfies from photographs originally taken at arms length. Because self-captured photos are typically taken close up they have limited field of view and exaggerated perspective that distorts facial shapes. We instead seek to generate the photo some one else would take of you from a few feet away. Our approach takes as input four selfies of your face and body a background image and generates a full-body selfie in a desired target pose. We introduce a novel diffusion-based approach to combine all of this information into high-quality well-composed photos of you with the desired pose and background. + + + + Learning Structure-from-Motion with Graph Attention Networks + http://openaccess.thecvf.com//content/CVPR2024/papers/Brynte_Learning_Structure-from-Motion_with_Graph_Attention_Networks_CVPR_2024_paper.pdf + In this paper we tackle the problem of learning Structure-from-Motion (SfM) through the use of graph attention networks. SfM is a classic computer vision problem that is solved though iterative minimization of reprojection errors referred to as Bundle Adjustment (BA) starting from a good initialization. In order to obtain a good enough initialization to BA conventional methods rely on a sequence of sub-problems (such as pairwise pose estimation pose averaging or triangulation) which provide an initial solution that can then be refined using BA. In this work we replace these sub-problems by learning a model that takes as input the 2D keypoints detected across multiple views and outputs the corresponding camera poses and 3D keypoint coordinates. Our model takes advantage of graph neural networks to learn SfM-specific primitives and we show that it can be used for fast inference of the reconstruction for new and unseen sequences. The experimental results show that the proposed model outperforms competing learning-based methods and challenges COLMAP while having lower runtime. Our code is available at: https://github.com/lucasbrynte/gasfm/. + + + + Geometry Transfer for Stylizing Radiance Fields + http://openaccess.thecvf.com//content/CVPR2024/papers/Jung_Geometry_Transfer_for_Stylizing_Radiance_Fields_CVPR_2024_paper.pdf + Shape and geometric patterns are essential in defining stylistic identity. However current 3D style transfer methods predominantly focus on transferring colors and textures often overlooking geometric aspects. In this paper we introduce Geometry Transfer a novel method that leverages geometric deformation for 3D style transfer. This technique employs depth maps to extract a style guide subsequently applied to stylize the geometry of radiance fields. Moreover we propose new techniques that utilize geometric cues from the 3D scene thereby enhancing aesthetic expressiveness and more accurately reflecting intended styles. Our extensive experiments show that Geometry Transfer enables a broader and more expressive range of stylizations thereby significantly expanding the scope of 3D style transfer. + + + + Holoported Characters: Real-time Free-viewpoint Rendering of Humans from Sparse RGB Cameras + http://openaccess.thecvf.com//content/CVPR2024/papers/Shetty_Holoported_Characters_Real-time_Free-viewpoint_Rendering_of_Humans_from_Sparse_RGB_CVPR_2024_paper.pdf + We present the first approach to render highly realistic free-viewpoint videos of a human actor in general apparel from sparse multi-view recording to display in real-time at an unprecedented 4K resolution. At inference our method only requires four camera views of the moving actor and the respective 3D skeletal pose. It handles actors in wide clothing and reproduces even fine-scale dynamic detail e.g. clothing wrinkles face expressions and hand gestures. At training time our learning-based approach expects dense multi-view video and a rigged static surface scan of the actor. Our method comprises three main stages. Stage 1 is a skeleton-driven neural approach for high-quality capture of the detailed dynamic mesh geometry. Stage 2 is a novel solution to create a view-dependent texture using four test-time camera views as input. Finally stage 3 comprises a new image-based refinement network rendering the final 4K image given the output from the previous stages. Our approach establishes a new benchmark for real-time rendering resolution and quality using sparse input camera views unlocking possibilities for immersive telepresence. + + + + SEAS: ShapE-Aligned Supervision for Person Re-Identification + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhu_SEAS_ShapE-Aligned_Supervision_for_Person_Re-Identification_CVPR_2024_paper.pdf + We introduce SEAS using ShapE-Aligned Supervision to enhance appearance-based person re-identification. When recognizing an individual's identity existing methods primarily rely on appearance which can be influenced by the background environment due to a lack of body shape awareness. Although some methods attempt to incorporate other modalities such as gait or body shape they encode the additional modality separately resulting in extra computational costs and lacking an inherent connection with appearance. In this paper we explore the use of implicit 3-D body shape representations as pixel-level guidance to augment the extraction of identity features with body shape knowledge in addition to appearance. Using body shape as supervision rather than as input provides shape-aware enhancements without any increase in computational cost and delivers coherent integration with pixel-wise appearance features. Moreover for video-based person re-identification we align pixel-level features across frames with shape awareness to ensure temporal consistency. Our results demonstrate that incorporating body shape as pixel-level supervision reduces rank-1 errors by 1.4% for frame-based and by 2.5% for video-based re-identification tasks respectively and can also be generalized to other existing appearance-based person re-identification methods. + + + + Making Vision Transformers Truly Shift-Equivariant + http://openaccess.thecvf.com//content/CVPR2024/papers/Rojas-Gomez_Making_Vision_Transformers_Truly_Shift-Equivariant_CVPR_2024_paper.pdf + In the field of computer vision Vision Transformers (ViTs) have emerged as a prominent deep learning architecture. Despite being inspired by Convolutional Neural Networks (CNNs) ViTs are susceptible to small spatial shifts in the input data - they lack shift-equivariance. To address this shortcoming we introduce novel data-adaptive designs for each of the ViT modules that break shift-equivariance such as tokenization self-attention patch merging and positional encoding. With our proposed modules we achieve perfect circular shift-equivariance across four prominent ViT architectures: Swin SwinV2 CvT and MViTv2. Additionally we leverage our design to further enhance consistency under standard shifts. We evaluate our adaptive ViT models on image classification and semantic segmentation tasks. Our models achieve competitive performance across three diverse datasets showcasing perfect (100%) circular shift consistency while improving standard shift consistency. + + + + SpikeNeRF: Learning Neural Radiance Fields from Continuous Spike Stream + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhu_SpikeNeRF_Learning_Neural_Radiance_Fields_from_Continuous_Spike_Stream_CVPR_2024_paper.pdf + Spike cameras leveraging spike-based integration sampling and high temporal resolution offer distinct advantages over standard cameras. However existing approaches reliant on spike cameras often assume optimal illumination a condition frequently unmet in real-world scenarios. To address this we introduce SpikeNeRF the first work that derives a NeRF-based volumetric scene representation from spike camera data. Our approach leverages NeRF's multi-view consistency to establish robust self-supervision effectively eliminating erroneous measurements and uncovering coherent structures within exceedingly noisy input amidst diverse real-world illumination scenarios. The framework comprises two core elements: a spike generation model incorporating an integrate-and-fire neuron layer and parameters accounting for non-idealities such as threshold variation and a spike rendering loss capable of generalizing across varying illumination conditions. We describe how to effectively optimize neural radiance fields to render photorealistic novel views from the novel continuous spike stream demonstrating advantages over other vision sensors in certain scenes. Empirical evaluations conducted on both real and novel realistically simulated sequences affirm the efficacy of our methodology. The dataset and source code are released at https://github.com/BIT-Vision/SpikeNeRF. + + + + A Semi-supervised Nighttime Dehazing Baseline with Spatial-Frequency Aware and Realistic Brightness Constraint + http://openaccess.thecvf.com//content/CVPR2024/papers/Cong_A_Semi-supervised_Nighttime_Dehazing_Baseline_with_Spatial-Frequency_Aware_and_Realistic_CVPR_2024_paper.pdf + Existing research based on deep learning has extensively explored the problem of daytime image dehazing. However few studies have considered the characteristics of nighttime hazy scenes. There are two distinctions between nighttime and daytime haze. First there may be multiple active colored light sources with lower illumination intensity in nighttime scenes which may cause haze glow and noise with localized coupled and frequency inconsistent characteristics. Second due to the domain discrepancy between simulated and real-world data unrealistic brightness may occur when applying a dehazing model trained on simulated data to real-world data. To address the above two issues we propose a semi-supervised model for real-world nighttime dehazing. First the spatial attention and frequency spectrum filtering are implemented as a spatial-frequency domain information interaction module to handle the first issue. Second a pseudo-label-based retraining strategy and a local window-based brightness loss for semi-supervised training process is designed to suppress haze and glow while achieving realistic brightness. Experiments on public benchmarks validate the effectiveness of the proposed method and its superiority over state-of-the-art methods. The source code and Supplementary Materials are placed in the https://github.com/Xiaofeng-life/SFSNiD. + + + + Deep Equilibrium Diffusion Restoration with Parallel Sampling + http://openaccess.thecvf.com//content/CVPR2024/papers/Cao_Deep_Equilibrium_Diffusion_Restoration_with_Parallel_Sampling_CVPR_2024_paper.pdf + Diffusion model-based image restoration (IR) aims to use diffusion models to recover high-quality (HQ) images from degraded images achieving promising performance. Due to the inherent property of diffusion models most existing methods need long serial sampling chains to restore HQ images step-by-step resulting in expensive sampling time and high computation costs. Moreover such long sampling chains hinder understanding the relationship between inputs and restoration results since it is hard to compute the gradients in the whole chains. In this work we aim to rethink the diffusion model-based IR models through a different perspective i.e. a deep equilibrium (DEQ) fixed point system called DeqIR. Specifically we derive an analytical solution by modeling the entire sampling chain in these IR models as a joint multivariate fixed point system. Based on the analytical solution we can conduct parallel sampling and restore HQ images without training. Furthermore we compute fast gradients via DEQ inversion and found that initialization optimization can boost image quality and control the generation direction. Extensive experiments on benchmarks demonstrate the effectiveness of our method on typical IR tasks and real-world settings. + + + + Gaussian Shell Maps for Efficient 3D Human Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Abdal_Gaussian_Shell_Maps_for_Efficient_3D_Human_Generation_CVPR_2024_paper.pdf + Efficient generation of 3D digital humans is important in several industries including virtual reality social media and cinematic production. 3D generative adversarial networks (GANs) have demonstrated state-of-the-art (SOTA) quality and diversity for generated assets. Current 3D GAN architectures however typically rely on volume representations which are slow to render thereby hampering the GAN training and requiring multi-view-inconsistent 2D upsamplers. Here we introduce Gaussian Shell Maps (GSMs) as a framework that connects SOTA generator network architectures with emerging 3D Gaussian rendering primitives using an articulable multi shell--based scaffold. In this setting a CNN generates a 3D texture stack with features that are mapped to the shells. The latter represent inflated and deflated versions of a template surface of a digital human in a canonical body pose. Instead of rasterizing the shells directly we sample 3D Gaussians on the shells whose attributes are encoded in the texture features. These Gaussians are efficiently and differentiably rendered. The ability to articulate the shells is important during GAN training and at inference time to deform a body into arbitrary user-defined poses. Our efficient rendering scheme bypasses the need for view-inconsistent upsamplers and achieves high-quality multi-view consistent renderings at a native resolution of 512 x512 pixels. We demonstrate that GSMs successfully generate 3D humans when trained on single-view datasets including SHHQ and DeepFashion. + + + + MoST: Motion Style Transformer Between Diverse Action Contents + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_MoST_Motion_Style_Transformer_Between_Diverse_Action_Contents_CVPR_2024_paper.pdf + While existing motion style transfer methods are effective between two motions with identical content their performance significantly diminishes when transferring style between motions with different contents. This challenge lies in the lack of clear separation between content and style of a motion. To tackle this challenge we propose a novel motion style transformer that effectively disentangles style from content and generates a plausible motion with transferred style from a source motion. Our distinctive approach to achieving the goal of disentanglement is twofold: (1) a new architecture for motion style transformer with 'part-attentive style modulator across body parts' and 'Siamese encoders that encode style and content features separately'; (2) style disentanglement loss. Our method outperforms existing methods and demonstrates exceptionally high quality particularly in motion pairs with different contents without the need for heuristic post-processing. Codes are available at https://github.com/Boeun-Kim/MoST. + + + + Prompting Hard or Hardly Prompting: Prompt Inversion for Text-to-Image Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Mahajan_Prompting_Hard_or_Hardly_Prompting_Prompt_Inversion_for_Text-to-Image_Diffusion_CVPR_2024_paper.pdf + The quality of the prompts provided to text-to-image diffusion models determines how faithful the generated content is to the user's intent often requiring `prompt engineering'. To harness visual concepts from target images without prompt engineering current approaches largely rely on embedding inversion by optimizing and then mapping them to pseudo-tokens. However working with such high-dimensional vector representations is challenging because they lack semantics and interpretability and only allow simple vector operations when using them. Instead this work focuses on inverting the diffusion model to obtain interpretable language prompts directly. The challenge of doing this lies in the fact that the resulting optimization problem is fundamentally discrete and the space of prompts is exponentially large; this makes using standard optimization techniques such as stochastic gradient descent difficult. To this end we utilize a delayed projection scheme to optimize for prompts representative of the vocabulary space in the model. Further we leverage the findings that different timesteps of the diffusion process cater to different levels of detail in an image. The later noisy timesteps of the forward diffusion process correspond to the semantic information and therefore prompt inversion in this range provides tokens representative of the image semantics. We show that our approach can identify semantically interpretable and meaningful prompts for a target image which can be used to synthesize diverse images with similar content. We further illustrate the application of the optimized prompts in evolutionary image generation and concept removal. + + + + Unmixing Before Fusion: A Generalized Paradigm for Multi-Source-based Hyperspectral Image Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Yu_Unmixing_Before_Fusion_A_Generalized_Paradigm_for_Multi-Source-based_Hyperspectral_Image_CVPR_2024_paper.pdf + In the realm of AI data serves as a pivotal resource. Real-world hyperspectral images (HSIs) bearing wide spectral characteristics are particularly valuable. However the acquisition of HSIs is always costly and time-intensive resulting in a severe data-thirsty issue in HSI research and applications. Current solutions have not been able to generate a sufficient volume of diverse and reliable synthetic HSIs. To this end our study formulates a novel generalized paradigm for HSI synthesis i.e. unmixing before fusion that initiates with unmixing across multi-source data and follows by fusion-based synthesis. By integrating unmixing this work maps unpaired HSI and RGB data to a low-dimensional abundance space greatly alleviating the difficulty of generating high-dimensional samples. Moreover incorporating abundances inferred from unpaired RGB images into generative models allows for cost-effective supplementation of various realistic spatial distributions in abundance synthesis. Our proposed paradigm can be instrumental with a series of deep generative models filling a significant gap in the field and enabling the generation of vast high-quality HSI samples for large-scale downstream tasks. Extension experiments on downstream tasks demonstrate the effectiveness of synthesized HSIs. The code is available at: HSI-Synthesis.github.io. + + + + CoDi: Conditional Diffusion Distillation for Higher-Fidelity and Faster Image Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Mei_CoDi_Conditional_Diffusion_Distillation_for_Higher-Fidelity_and_Faster_Image_Generation_CVPR_2024_paper.pdf + Large generative diffusion models have revolutionized text-to-image generation and offer immense potential for conditional generation tasks such as image enhancement restoration editing and compositing. However their widespread adoption is hindered by the high computational cost which limits their real-time application. To address this challenge we introduce a novel method dubbed CoDi that adapts a pre-trained latent diffusion model to accept additional image conditioning inputs while significantly reducing the sampling steps required to achieve high-quality results. Our method can leverage architectures such as ControlNet to incorporate conditioning inputs without compromising the model's prior knowledge gained during large scale pre-training. Additionally a conditional consistency loss enforces consistent predictions across diffusion steps effectively compelling the model to generate high-quality images with conditions in a few steps. Our conditional-task learning and distillation approach outperforms previous distillation methods achieving a new state-of-the-art in producing high-quality images with very few steps (e.g. 1-4) across multiple tasks including super-resolution text-guided image editing and depth-to-image generation. + + + + X-Adapter: Adding Universal Compatibility of Plugins for Upgraded Diffusion Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Ran_X-Adapter_Adding_Universal_Compatibility_of_Plugins_for_Upgraded_Diffusion_Model_CVPR_2024_paper.pdf + We introduce X-Adapter a universal upgrader to enable the pretrained plug-and-play modules (e.g. ControlNet LoRA) to work directly with the upgraded text-to-image diffusion model (e.g. SDXL) without further retraining. We achieve this goal by training an additional network to control the frozen upgraded model with the new text-image data pairs. In detail X-Adapter keeps a frozen copy of the old model to preserve the connectors of different plugins. Additionally X-Adapter adds trainable mapping layers that bridge the decoders from models of different versions for feature remapping. The remapped features will be used as guidance for the upgraded model. To enhance the guidance ability of X-Adapter we employ a -text training strategy for the upgraded model. After training we also introduce a two-stage denoising strategy to align the initial latents of X-Adapter and the upgraded model. Thanks to our strategies X-Adapter demonstrates universal compatibility with various plugins and also enables plugins of different versions to work together thereby expanding the functionalities of diffusion community. To verify the effectiveness of the proposed method we conduct extensive experiments and the results show that X-Adapter may facilitate wider application in the upgraded foundational diffusion model. Project page at: https://showlab.github.io/X-Adapter. + + + + CADTalk: An Algorithm and Benchmark for Semantic Commenting of CAD Programs + http://openaccess.thecvf.com//content/CVPR2024/papers/Yuan_CADTalk_An_Algorithm_and_Benchmark_for_Semantic_Commenting_of_CAD_CVPR_2024_paper.pdf + CAD programs are a popular way to compactly encode shapes as a sequence of operations that are easy to parametrically modify. However without sufficient semantic comments and structure such programs can be challenging to understand let alone modify. We introduce the problem of semantic commenting CAD programs wherein the goal is to segment the input program into code blocks corresponding to semantically meaningful shape parts and assign a semantic label to each block. We solve the problem by combining program parsing with visual-semantic analysis afforded by recent advances in foundational language and vision models. Specifically by executing the input programs we create shapes which we use to generate conditional photorealistic images to make use of semantic annotators for such images. We then distill the information across the images and link back to the original programs to semantically comment on them. Additionally we collected and annotated a benchmark dataset CADTalk consisting of 5288 machine-made programs and 45 human-made programs with ground truth semantic comments. We extensively evaluated our approach compared it to a GPT-based baseline and an open-set shape segmentation baseline and reported an 83.24% accuracy on the new CADTalk dataset. Code and data: https://enigma-li.github.io/CADTalk/. + + + + Inversion-Free Image Editing with Language-Guided Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_Inversion-Free_Image_Editing_with_Language-Guided_Diffusion_Models_CVPR_2024_paper.pdf + Despite recent advances in inversion-based editing text-guided image manipulation remains challenging for diffusion models. The primary bottlenecks include 1) the time-consuming nature of the inversion process; 2) the struggle to balance consistency with accuracy; 3) the lack of compatibility with efficient consistency sampling methods used in consistency models. To address the above issues we start by asking ourselves if the inversion process can be eliminated for editing. We show that when the initial sample is known a special variance schedule reduces the denoising step to the same form as the multi-step consistency sampling. We name this Denoising Diffusion Consistent Model (DDCM) and note that it implies a virtual inversion strategy without explicit inversion in sampling. We further unify the attention control mechanisms in a tuning-free framework for text-guided editing. Combining them we present inversion-free editing (InfEdit) which allows for consistent and faithful editing for both rigid and non-rigid semantic changes catering to intricate modifications without compromising on the image's integrity and explicit inversion. Through extensive experiments InfEdit shows strong performance in various editing tasks and also maintains a seamless workflow (less than 3 seconds on one single A40) demonstrating the potential for real-time applications. + + + + HumMUSS: Human Motion Understanding using State Space Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Mondal_HumMUSS_Human_Motion_Understanding_using_State_Space_Models_CVPR_2024_paper.pdf + Understanding human motion from video is essential for a range of applications including pose estimation mesh recovery and action recognition. While state-of-the-art methods predominantly rely on transformer-based architectures these approaches have limitations in practical scenarios. Transformers are slower when sequentially predicting on a continuous stream of frames in real-time and do not generalize to new frame rates. In light of these constraints we propose a novel attention-free spatiotemporal model for human motion understanding building upon recent advancements in state space models. Our model not only matches the performance of transformer-based models in various motion understanding tasks but also brings added benefits like adaptability to different video frame rates and enhanced training speed when working with longer sequence of keypoints. Moreover the proposed model supports both offline and real-time applications. For real-time sequential prediction our model is both memory efficient and several times faster than transformer-based approaches while maintaining their high accuracy. + + + + Drag Your Noise: Interactive Point-based Editing via Diffusion Semantic Propagation + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Drag_Your_Noise_Interactive_Point-based_Editing_via_Diffusion_Semantic_Propagation_CVPR_2024_paper.pdf + Point-based interactive editing serves as an essential tool to complement the controllability of existing generative models. A concurrent work DragDiffusion updates the diffusion latent map in response to user inputs causing global latent map alterations. This results in imprecise preservation of the original content and unsuccessful editing due to gradient vanishing. In contrast we present DragNoise offering robust and accelerated editing without retracing the latent map. The core rationale of DragNoise lies in utilizing the predicted noise output of each U-Net as a semantic editor. This approach is grounded in two critical observations: firstly the bottleneck features of U-Net inherently possess semantically rich features ideal for interactive editing; secondly high-level semantics established early in the denoising process show minimal variation in subsequent stages. Leveraging these insights DragNoise edits diffusion semantics in a single denoising step and efficiently propagates these changes ensuring stability and efficiency in diffusion editing. Comparative experiments reveal that DragNoise achieves superior control and semantic retention reducing the optimization time by over 50% compared to DragDiffusion. Our codes are available at https://github.com/haofengl/DragNoise. + + + + ContextSeg: Sketch Semantic Segmentation by Querying the Context with Attention + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_ContextSeg_Sketch_Semantic_Segmentation_by_Querying_the_Context_with_Attention_CVPR_2024_paper.pdf + Sketch semantic segmentation is a well-explored and pivotal problem in computer vision involving the assignment of predefined part labels to individual strokes. This paper presents ContextSeg - a simple yet highly effective approach to tackling this problem with two stages. In the first stage to better encode the shape and positional information of strokes we propose to predict an extra dense distance field in an autoencoder network to reinforce structural information learning. In the second stage we treat an entire stroke as a single entity and label a group of strokes within the same semantic part using an autoregressive Transformer with the default attention mechanism. By group-based labeling our method can fully leverage the context information when making decisions for the remaining groups of strokes. Our method achieves the best segmentation accuracy compared with state-of-the-art approaches on two representative datasets and has been extensively evaluated demonstrating its superior performance. Additionally we offer insights into solving part imbalance in training data and the preliminary experiment on cross-category training which can inspire future research in this field. + + + + Taming the Tail in Class-Conditional GANs: Knowledge Sharing via Unconditional Training at Lower Resolutions + http://openaccess.thecvf.com//content/CVPR2024/papers/Khorram_Taming_the_Tail_in_Class-Conditional_GANs_Knowledge_Sharing_via_Unconditional_CVPR_2024_paper.pdf + Despite extensive research on training generative adversarial networks (GANs) with limited training data learning to generate images from long-tailed training distributions remains fairly unexplored. In the presence of imbalanced multi-class training data GANs tend to favor classes with more samples leading to the generation of low quality and less diverse samples in tail classes. In this study we aim to improve the training of class-conditional GANs with long-tailed data. We propose a straightforward yet effective method for knowledge sharing allowing tail classes to borrow from the rich information from classes with more abundant training data. More concretely we propose modifications to existing class-conditional GAN architectures to ensure that the lower-resolution layers of the generator are trained entirely unconditionally while reserving class-conditional generation for the higher-resolution layers. Experiments on several long-tail benchmarks and GAN architectures demonstrate a significant improvement over existing methods in both the diversity and fidelity of the generated images. The code is available at https://github.com/khorrams/utlo. + + + + VideoSwap: Customized Video Subject Swapping with Interactive Semantic Point Correspondence + http://openaccess.thecvf.com//content/CVPR2024/papers/Gu_VideoSwap_Customized_Video_Subject_Swapping_with_Interactive_Semantic_Point_Correspondence_CVPR_2024_paper.pdf + Current diffusion-based video editing primarily focuses on structure-preserved editing by utilizing various dense correspondences to ensure temporal consistency and motion alignment. However these approaches are often ineffective when the target edit involves a shape change. To embark on video editing with shape change we explore customized video subject swapping in this work where we aim to replace the main subject in a source video with a target subject having a distinct identity and potentially different shape. In contrast to previous methods that rely on dense correspondences we introduce the VideoSwap framework that exploits semantic point correspondences inspired by our observation that only a small number of semantic points are necessary to align the subject's motion trajectory and modify its shape. We also introduce various user-point interactions (e.g. removing points and dragging points) to address various semantic point correspondence. Extensive experiments demonstrate state-of-the-art video subject swapping results across a variety of real-world videos. + + + + Hierarchical Histogram Threshold Segmentation - Auto-terminating High-detail Oversegmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Chang_Hierarchical_Histogram_Threshold_Segmentation_-_Auto-terminating_High-detail_Oversegmentation_CVPR_2024_paper.pdf + Superpixels play a crucial role in image processing by partitioning an image into clusters of pixels with similar visual attributes. This facilitates subsequent image processing tasks offering computational advantages over the manipulation of individual pixels. While numerous oversegmentation techniques have emerged in recent years many rely on predefined initialization and termination criteria. In this paper a novel top-down superpixel segmentation algorithm called Hierarchical Histogram Threshold Segmentation (HHTS) is introduced. It eliminates the need for initialization and implements auto-termination outperforming state-of-the-art methods w.r.t boundary recall. This is achieved by iteratively partitioning individual pixel segments into foreground and background and applying intensity thresholding across multiple color channels. The underlying iterative process constructs a superpixel hierarchy that adapts to local detail distributions until color information exhaustion. Experimental results demonstrate the superiority of the proposed approach in terms of boundary adherence while maintaining competitive runtime performance on the BSDS500 and NYUV2 datasets. Furthermore an application of HHTS in refining machine learning-based semantic segmentation masks produced by the Segment Anything Foundation Model (SAM) is presented. + + + + Once for Both: Single Stage of Importance and Sparsity Search for Vision Transformer Compression + http://openaccess.thecvf.com//content/CVPR2024/papers/Ye_Once_for_Both_Single_Stage_of_Importance_and_Sparsity_Search_CVPR_2024_paper.pdf + Recent Vision Transformer Compression (VTC) works mainly follow a two-stage scheme where the importance score of each model unit is first evaluated or preset in each submodule followed by the sparsity score evaluation according to the target sparsity constraint. Such a separate evaluation process induces the gap between importance and sparsity score distributions thus causing high search costs for VTC. In this work for the first time we investigate how to integrate the evaluations of importance and sparsity scores into a single stage searching the optimal subnets in an efficient manner. Specifically we present OFB a cost-efficient approach that simultaneously evaluates both importance and sparsity scores termed Once for Both (OFB) for VTC. First a bi-mask scheme is developed by entangling the importance score and the differentiable sparsity score to jointly determine the pruning potential (prunability) of each unit. Such a bi-mask search strategy is further used together with a proposed adaptive one-hot loss to realize the progressive-and-efficient search for the most important subnet. Finally Progressive Masked Image Modeling (PMIM) is proposed to regularize the feature space to be more representative during the search process which may be degraded by the dimension reduction. Extensive experiments demonstrate that OFB can achieve superior compression performance over state-of-the-art searching-based and pruning-based methods under various Vision Transformer architectures meanwhile promoting search efficiency significantly e.g. costing one GPU search day for the compression of DeiT-S on ImageNet-1K. + + + + As-Plausible-As-Possible: Plausibility-Aware Mesh Deformation Using 2D Diffusion Priors + http://openaccess.thecvf.com//content/CVPR2024/papers/Yoo_As-Plausible-As-Possible_Plausibility-Aware_Mesh_Deformation_Using_2D_Diffusion_Priors_CVPR_2024_paper.pdf + We present As-Plausible-as-Possible (APAP) mesh deformation technique that leverages 2D diffusion priors to preserve the plausibility of a mesh under user-controlled deformation. Our framework uses per-face Jacobians to represent mesh deformations where mesh vertex coordinates are computed via a differentiable Poisson Solve. The deformed mesh is rendered and the resulting 2D image is used in the Score Distillation Sampling (SDS) process which enables extracting meaningful plausibility priors from a pretrained 2D diffusion model. To better preserve the identity of the edited mesh we fine-tune our 2D diffusion model with LoRA. Gradients extracted by SDS and a user-prescribed handle displacement are then backpropagated to the per-face Jacobians and we use iterative gradient descent to compute the final deformation that balances between the user edit and the output plausibility. We evaluate our method with 2D and 3D meshes and demonstrate qualitative and quantitative improvements when using plausibility priors over geometry-preservation or distortion-minimization priors used by previous techniques. Our project page is at: https://as-plausible-aspossible.github.io/ + + + + ECLIPSE: Efficient Continual Learning in Panoptic Segmentation with Visual Prompt Tuning + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_ECLIPSE_Efficient_Continual_Learning_in_Panoptic_Segmentation_with_Visual_Prompt_CVPR_2024_paper.pdf + Panoptic segmentation combining semantic and instance segmentation stands as a cutting-edge computer vision task. Despite recent progress with deep learning models the dynamic nature of real-world applications necessitates continual learning where models adapt to new classes (plasticity) over time without forgetting old ones (catastrophic forgetting). Current continual segmentation methods often rely on distillation strategies like knowledge distillation and pseudo-labeling which are effective but result in increased training complexity and computational overhead. In this paper we introduce a novel and efficient method for continual panoptic segmentation based on Visual Prompt Tuning dubbed ECLIPSE. Our approach involves freezing the base model parameters and fine-tuning only a small set of prompt embeddings addressing both catastrophic forgetting and plasticity and significantly reducing the trainable parameters. To mitigate inherent challenges such as error propagation and semantic drift in continual segmentation we propose logit manipulation to effectively leverage common knowledge across the classes. Experiments on ADE20K continual panoptic segmentation benchmark demonstrate the superiority of ECLIPSE notably its robustness against catastrophic forgetting and its reasonable plasticity achieving a new state-of-the-art. The code is available at https://github.com/clovaai/ECLIPSE. + + + + MaGGIe: Masked Guided Gradual Human Instance Matting + http://openaccess.thecvf.com//content/CVPR2024/papers/Huynh_MaGGIe_Masked_Guided_Gradual_Human_Instance_Matting_CVPR_2024_paper.pdf + Human matting is a foundation task in image and video processing where human foreground pixels are extracted from the input. Prior works either improve the accuracy by additional guidance or improve the temporal consistency of a single instance across frames. We propose a new framework MaGGIe Masked Guided Gradual Human Instance Matting which predicts alpha mattes progressively for each human instances while maintaining the computational cost precision and consistency. Our method leverages modern architectures including transformer attention and sparse convolution to output all instance mattes simultaneously without exploding memory and latency. Although keeping constant inference costs in the multiple-instance scenario our framework achieves robust and versatile performance on our proposed synthesized benchmarks. With the higher quality image and video matting benchmarks the novel multi-instance synthesis approach from publicly available sources is introduced to increase the generalization of models in real-world scenarios. Our code and datasets are available at https://maggie-matt.github.io + + + + Towards the Uncharted: Density-Descending Feature Perturbation for Semi-supervised Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Towards_the_Uncharted_Density-Descending_Feature_Perturbation_for_Semi-supervised_Semantic_Segmentation_CVPR_2024_paper.pdf + Semi-supervised semantic segmentation allows model to mine effective supervision from unlabeled data to complement label-guided training. Recent research has primarily focused on consistency regularization techniques exploring perturbation-invariant training at both the image and feature levels. In this work we proposed a novel feature-level consistency learning framework named Density-Descending Feature Perturbation (DDFP). Inspired by the low-density separation assumption in semi-supervised learning our key insight is that feature density can shed a light on the most promising direction for the segmentation classifier to explore which is the regions with lower density. We propose to shift features with confident predictions towards lower-density regions by perturbation injection. The perturbed features are then supervised by the predictions on the original features thereby compelling the classifier to explore less dense regions to effectively regularize the decision boundary. Central to our method is the estimation of feature density. To this end we introduce a lightweight density estimator based on normalizing flow allowing for efficient capture of the feature density distribution in an online manner. By extracting gradients from the density estimator we can determine the direction towards less dense regions for each feature. The proposed DDFP outperforms other designs on feature-level perturbations and shows state of the art performances on both Pascal VOC and Cityscapes dataset under various partition protocols. The project is available at https://github.com/Gavinwxy/DDFP. + + + + RTMO: Towards High-Performance One-Stage Real-Time Multi-Person Pose Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Lu_RTMO_Towards_High-Performance_One-Stage_Real-Time_Multi-Person_Pose_Estimation_CVPR_2024_paper.pdf + Real-time multi-person pose estimation presents significant challenges in balancing speed and precision. While two-stage top-down methods slow down as the number of people in the image increases existing one-stage methods often fail to simultaneously deliver high accuracy and real-time performance. This paper introduces RTMO a one-stage pose estimation framework that seamlessly integrates coordinate classification by representing keypoints using dual 1-D heatmaps within the YOLO architecture achieving accuracy comparable to top-down methods while maintaining high speed. We propose a dynamic coordinate classifier and a tailored loss function for heatmap learning specifically designed to address the incompatibilities between coordinate classification and dense prediction models. RTMO outperforms state-of-the-art one-stage pose estimators achieving 1.1% higher AP on COCO while operating about 9 times faster with the same backbone. Our largest model RTMO-l attains 74.8% AP on COCO val2017 and 141 FPS on a single V100 GPU demonstrating its efficiency and accuracy. The code and models are available at https://github.com/open-mmlab/mmpose/tree/main/projects/rtmo. + + + + WaveFace: Authentic Face Restoration with Efficient Frequency Recovery + http://openaccess.thecvf.com//content/CVPR2024/papers/Miao_WaveFace_Authentic_Face_Restoration_with_Efficient_Frequency_Recovery_CVPR_2024_paper.pdf + Although diffusion models are rising as a powerful solution for blind face restoration they are criticized for two problems: 1) slow training and inference speed and 2) failure in preserving identity and recovering fine-grained facial details. In this work we propose WaveFace to solve the problems in the frequency domain where low- and high-frequency components decomposed by wavelet transformation are considered individually to maximize authenticity as well as efficiency. The diffusion model is applied to recover the low-frequency component only which presents general information of the original image but 1/16 in size. To preserve the original identity the generation is conditioned on the low-frequency component of low-quality images at each denoising step. Meanwhile high-frequency components at multiple decomposition levels are handled by a unified network which recovers complex facial details in a single step. Evaluations on four benchmark datasets show that: 1) WaveFace outperforms state-of-the-art methods in authenticity especially in terms of identity preservation and 2) authentic images are restored with the efficiency 10x faster than existing diffusion model-based BFR methods. + + + + UltrAvatar: A Realistic Animatable 3D Avatar Diffusion Model with Authenticity Guided Textures + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_UltrAvatar_A_Realistic_Animatable_3D_Avatar_Diffusion_Model_with_Authenticity_CVPR_2024_paper.pdf + Recent advances in 3D avatar generation have gained significant attention. These breakthroughs aim to produce more realistic animatable avatars narrowing the gap between virtual and real-world experiences. Most of existing works employ Score Distillation Sampling (SDS) loss combined with a differentiable renderer and text condition to guide a diffusion model in generating 3D avatars. However SDS often generates over-smoothed results with few facial details thereby lacking the diversity compared with ancestral sampling. On the other hand other works generate 3D avatar from a single image where the challenges of unwanted lighting effects perspective views and inferior image quality make them difficult to reliably reconstruct the 3D face meshes with the aligned complete textures. In this paper we propose a novel 3D avatar generation approach termed UltrAvatar with enhanced fidelity of geometry and superior quality of physically based rendering (PBR) textures without unwanted lighting. To this end the proposed approach presents a diffuse color extraction model and an authenticity guided texture diffusion model. The former removes the unwanted lighting effects to reveal true diffuse colors so that the generated avatars can be rendered under various lighting conditions. The latter follows two gradient-based guidances for generating PBR textures to render diverse face-identity features and details better aligning with 3D mesh geometry. We demonstrate the effectiveness and robustness of the proposed method outperforming the state-of-the-art methods by a large margin in the experiments. + + + + Attention-Propagation Network for Egocentric Heatmap to 3D Pose Lifting + http://openaccess.thecvf.com//content/CVPR2024/papers/Kang_Attention-Propagation_Network_for_Egocentric_Heatmap_to_3D_Pose_Lifting_CVPR_2024_paper.pdf + We present EgoTAP a heatmap-to-3D pose lifting method for highly accurate stereo egocentric 3D pose estimation. Severe self-occlusion and out-of-view limbs in egocentric camera views make accurate pose estimation a challenging problem. To address the challenge prior methods employ joint heatmaps-probabilistic 2D representations of the body pose but heatmap-to-3D pose conversion still remains an inaccurate process. We propose a novel heatmap-to-3D lifting method composed of the Grid ViT Encoder and the Propagation Network. The Grid ViT Encoder summarizes joint heatmaps into effective feature embedding using self-attention. Then the Propagation Network estimates the 3D pose by utilizing skeletal information to better estimate the position of obscure joints. Our method significantly outperforms the previous state-of-the-art qualitatively and quantitatively demonstrated by a 23.9% reduction of error in an MPJPE metric. Our source code is available on GitHub. + + + + OmniMotionGPT: Animal Motion Generation with Limited Data + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_OmniMotionGPT_Animal_Motion_Generation_with_Limited_Data_CVPR_2024_paper.pdf + Our paper aims to generate diverse and realistic animal motion sequences from textual descriptions without a large-scale animal text-motion dataset. While the task of text-driven human motion synthesis is already extensively studied and benchmarked it remains challenging to transfer this success to other skeleton structures with limited data. In this work we design a model architecture that imitates Generative Pretraining Transformer (GPT) utilizing prior knowledge learned from human data to the animal domain. We jointly train motion autoencoders for both animal and human motions and at the same time optimize through the similarity scores among human motion encoding animal motion encoding and text CLIP embedding. Presenting the first solution to this problem we are able to generate animal motions with high diversity and fidelity quantitatively and qualitatively outperforming the results of training human motion generation baselines on animal data. Additionally we introduce AnimalML3D the first text-animal motion dataset with 1240 animation sequences spanning 36 different animal identities. We hope this dataset would mediate the data scarcity problem in text-driven animal motion generation providing a new playground for the research community. + + + + InstanceDiffusion: Instance-level Control for Image Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_InstanceDiffusion_Instance-level_Control_for_Image_Generation_CVPR_2024_paper.pdf + Text-to-image diffusion models produce high quality images but do not offer control over individual instances in the image. We introduce InstanceDiffusion that adds precise instance-level control to text-to-image diffusion models. InstanceDiffusion supports free-form language conditions per instance and allows flexible ways to specify instance locations such as simple single points scribbles bounding boxes or intricate instance segmentation masks and combinations thereof. We propose three major changes to text-to-image models that enable precise instance-level control. Our UniFusion block enables instance-level conditions for text-to-image models the ScaleU block improves image fidelity and our Multi-instance Sampler improves generations for multiple instances. InstanceDiffusion significantly surpasses specialized state-of-the-art models for each location condition. Notably on the COCO dataset we outperform previous state-of-the-art by 20.4% AP50box for box inputs and 25.4% IoU for mask inputs. + + + + Unifying Top-down and Bottom-up Scanpath Prediction Using Transformers + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Unifying_Top-down_and_Bottom-up_Scanpath_Prediction_Using_Transformers_CVPR_2024_paper.pdf + Most models of visual attention aim at predicting either top-down or bottom-up control as studied using different visual search and free-viewing tasks. In this paper we propose the Human Attention Transformer (HAT) a single model that predicts both forms of attention control. HAT uses a novel transformer-based architecture and a simplified foveated retina that collectively create a spatio-temporal awareness akin to the dynamic visual working memory of humans. HAT not only establishes a new state-of-the-art in predicting the scanpath of fixations made during target-present and target-absent visual search and "taskless" free viewing but also makes human gaze behavior interpretable. Unlike previous methods that rely on a coarse grid of fixation cells and experience information loss due to fixation discretization HAT features a sequential dense prediction architecture and outputs a dense heatmap for each fixation thus avoiding discretizing fixations. HAT sets a new standard in computational attention which emphasizes effectiveness generality and interpretability. HAT's demonstrated scope and applicability will likely inspire the development of new attention models that can better predict human behavior in various attention-demanding scenarios. Code is available at https://github.com/cvlab-stonybrook/HAT. + + + + 3D-Aware Face Editing via Warping-Guided Latent Direction Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Cheng_3D-Aware_Face_Editing_via_Warping-Guided_Latent_Direction_Learning_CVPR_2024_paper.pdf + 3D facial editing a longstanding task in computer vision with broad applications is expected to fast and intuitively manipulate any face from arbitrary viewpoints following the user's will. Existing works have limitations in terms of intuitiveness generalization and efficiency. To overcome these challenges we propose FaceEdit3D which allows users to directly manipulate 3D points to edit a 3D face achieving natural and rapid face editing. After one or several points are manipulated by users we propose the tri-plane warping to directly manipulate the view-independent 3D representation. To address the problem of distortion caused by tri-plane warping we train a warp-aware encoder to project the warped face onto a standardized latent space. In this space we further propose directional latent editing to mitigate the identity bias caused by the encoder and realize the disentangled editing of various attributes. Extensive experiments show that our method achieves superior results with rich facial details and nice identity preservation. Our approach also supports general applications like multi-attribute continuous editing and cat/car editing. The project website is https://cyh-sj.github.io/FaceEdit3D/. + + + + CAT-Seg: Cost Aggregation for Open-Vocabulary Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Cho_CAT-Seg_Cost_Aggregation_for_Open-Vocabulary_Semantic_Segmentation_CVPR_2024_paper.pdf + Open-vocabulary semantic segmentation presents the challenge of labeling each pixel within an image based on a wide range of text descriptions. In this work we introduce a novel cost-based approach to adapt vision-language foundation models notably CLIP for the intricate task of semantic segmentation. Through aggregating the cosine similarity score i.e. the cost volume between image and text embeddings our method potently adapts CLIP for segmenting seen and unseen classes by fine-tuning its encoders addressing the challenges faced by existing methods in handling unseen classes. Building upon this we explore methods to effectively aggregate the cost volume considering its multi-modal nature of being established between image and text embeddings. Furthermore we examine various methods for efficiently fine-tuning CLIP. + + + + Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation + http://openaccess.thecvf.com//content/CVPR2024/papers/Guo_Focus_on_Your_Instruction_Fine-grained_and_Multi-instruction_Image_Editing_by_CVPR_2024_paper.pdf + Recently diffusion-based methods like InstructPix2Pix (IP2P) have achieved effective instruction-based image editing requiring only natural language instructions from the user. However these methods often inadvertently alter unintended areas and struggle with multi-instruction editing resulting in compromised outcomes. To address these issues we introduce the Focus on Your Instruction (FoI) a method designed to ensure precise and harmonious editing across multiple instructions without extra training or test-time optimization. In the FoI we primarily emphasize two aspects: (1) precisely extracting regions of interest for each instruction and (2) guiding the denoising process to concentrate within these regions of interest. For the first objective we identify the implicit grounding capability of IP2P from the cross-attention between instruction and image then develop an effective mask extraction method. + + + + AvatarGPT: All-in-One Framework for Motion Understanding Planning Generation and Beyond + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_AvatarGPT_All-in-One_Framework_for_Motion_Understanding_Planning_Generation_and_Beyond_CVPR_2024_paper.pdf + Large Language Models(LLMs) have shown remarkable emergent abilities in unifying almost all (if not every) NLP tasks. In the human motion-related realm however researchers still develop siloed models for each task. Inspired by InstuctGPT[??] and the generalist concept behind Gato [??] we introduce AvatarGPT an All-in-One framework for motion understanding planning generations as well as other tasks such as motion in-between synthesis. AvatarGPT treats each task as one type of instruction fine-tuned on the shared LLM. All the tasks are seamlessly interconnected with language as the universal interface constituting a closed-loop within the framework. To achieve this human motion sequences are first encoded as discrete tokens which serve as the extended vocabulary of LLM. Then an unsupervised pipeline to generate natural language descriptions of human action sequences from in-the-wild videos is developed. Finally all tasks are jointly trained. Extensive experiments show that AvatarGPT achieves SOTA on low-level tasks and promising results on high-level tasks demonstrating the effectiveness of our proposed All-in-One framework. Moreover for the first time AvatarGPT enables a principled approach by iterative traversal of the tasks within the closed-loop for unlimited long-motion synthesis. + + + + Co-Speech Gesture Video Generation via Motion-Decoupled Diffusion Model + http://openaccess.thecvf.com//content/CVPR2024/papers/He_Co-Speech_Gesture_Video_Generation_via_Motion-Decoupled_Diffusion_Model_CVPR_2024_paper.pdf + Co-speech gestures if presented in the lively form of videos can achieve superior visual effects in human-machine interaction. While previous works mostly generate structural human skeletons resulting in the omission of appearance information we focus on the direct generation of audio-driven co-speech gesture videos in this work. There are two main challenges: 1) A suitable motion feature is needed to describe complex human movements with crucial appearance information. 2) Gestures and speech exhibit inherent dependencies and should be temporally aligned even of arbitrary length. To solve these problems we present a novel motion-decoupled framework to generate co-speech gesture videos. Specifically we first introduce a well-designed nonlinear TPS transformation to obtain latent motion features preserving essential appearance information. Then a transformer-based diffusion model is proposed to learn the temporal correlation between gestures and speech and performs generation in the latent motion space followed by an optimal motion selection module to produce long-term coherent and consistent gesture videos. For better visual perception we further design a refinement network focusing on missing details of certain areas. Extensive experimental results show that our proposed framework significantly outperforms existing approaches in both motion and video-related evaluations. Our code demos and more resources are available at https://github.com/thuhcsi/S2G-MDDiffusion. + + + + CDFormer: When Degradation Prediction Embraces Diffusion Model for Blind Image Super-Resolution + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_CDFormer_When_Degradation_Prediction_Embraces_Diffusion_Model_for_Blind_Image_CVPR_2024_paper.pdf + Existing Blind image Super-Resolution (BSR) methods focus on estimating either kernel or degradation information but have long overlooked the essential content details. In this paper we propose a novel BSR approach Content-aware Degradation-driven Transformer (CDFormer) to capture both degradation and content representations. However low-resolution images cannot provide enough content details and thus we introduce a diffusion-based module CDFormer_ diff to first learn Content Degradation Prior (CDP) in both low- and high-resolution images and then approximate the real distribution given only low-resolution information. Moreover we apply an adaptive SR network CDFormer_ SR that effectively utilizes CDP to refine features. Compared to previous diffusion-based SR methods we treat the diffusion model as an estimator that can overcome the limitations of expensive sampling time and excessive diversity. Experiments show that CDFormer can outperform existing methods establishing a new state-of-the-art performance on various benchmarks under blind settings. Codes and models will be available at https://github.com/I2-Multimedia-Lab/CDFormer. + + + + HumanRef: Single Image to 3D Human Generation via Reference-Guided Diffusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_HumanRef_Single_Image_to_3D_Human_Generation_via_Reference-Guided_Diffusion_CVPR_2024_paper.pdf + Generating a 3D human model from a single reference image is challenging because it requires inferring textures and geometries in invisible views while maintaining consistency with the reference image. Previous methods utilizing 3D generative models are limited by the availability of 3D training data. Optimization-based methods that lift text-to-image diffusion models to 3D generation often fail to preserve the texture details of the reference image resulting in inconsistent appearances in different views. In this paper we propose HumanRef a 3D human generation framework from a single-view input. To ensure the generated 3D model is photorealistic and consistent with the input image HumanRef introduces a novel method called reference-guided score distillation sampling (Ref-SDS) which effectively incorporates image guidance into the generation process. Furthermore we introduce region-aware attention to Ref-SDS ensuring accurate correspondence between different body regions. Experimental results demonstrate that HumanRef outperforms state-of-the-art methods in generating 3D clothed humans with fine geometry photorealistic textures and view-consistent appearances. Code and model are available at https://eckertzhang.github.io/HumanRef.github.io/. + + + + Rethinking Interactive Image Segmentation with Low Latency High Quality and Diverse Prompts + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Rethinking_Interactive_Image_Segmentation_with_Low_Latency_High_Quality_and_CVPR_2024_paper.pdf + The goal of interactive image segmentation is to delineate specific regions within an image via visual or language prompts. Low-latency and high-quality interactive segmentation with diverse prompts remain challenging for existing specialist and generalist models. Specialist models with their limited prompts and task-specific designs experience high latency because the image must be recomputed every time the prompt is updated due to the joint encoding of image and visual prompts. Generalist models exemplified by the Segment Anything Model (SAM) have recently excelled in prompt diversity and efficiency lifting image segmentation to the foundation model era. However for high-quality segmentations SAM still lags behind state-of-the-art specialist models despite SAM being trained with x100 more segmentation masks. In this work we delve deep into the architectural differences between the two types of models. We observe that dense representation and fusion of visual prompts are the key design choices contributing to the high segmentation quality of specialist models. In light of this we reintroduce this dense design into the generalist models to facilitate the development of generalist models with high segmentation quality. To densely represent diverse visual prompts we propose to use a dense map to capture five types: clicks boxes polygons scribbles and masks. Thus we propose SegNext a next-generation interactive segmentation approach offering low latency high quality and diverse prompt support. Our method outperforms current state-of-the-art methods on HQSeg-44K and DAVIS quantitatively and qualitatively. + + + + DITTO: Dual and Integrated Latent Topologies for Implicit 3D Reconstruction + http://openaccess.thecvf.com//content/CVPR2024/papers/Shim_DITTO_Dual_and_Integrated_Latent_Topologies_for_Implicit_3D_Reconstruction_CVPR_2024_paper.pdf + We propose a novel concept of dual and integrated latent topologies (DITTO in short) for implicit 3D reconstruction from noisy and sparse point clouds. Most existing methods predominantly focus on single latent type such as point or grid latents. In contrast the proposed DITTO leverages both point and grid latents (i.e. dual latent) to enhance their strengths the stability of grid latents and the detail-rich capability of point latents. Concretely DITTO consists of dual latent encoder and integrated implicit decoder. In the dual latent encoder a dual latent layer which is the key module block composing the encoder refines both latents in parallel maintaining their distinct shapes and enabling recursive interaction. Notably a newly proposed dynamic sparse point transformer within the dual latent layer effectively refines point latents. Then the integrated implicit decoder systematically combines these refined latents achieving high-fidelity 3D reconstruction and surpassing previous state-of-the-art methods on object- and scene-level datasets especially in thin and detailed structures. + + + + HIT: Estimating Internal Human Implicit Tissues from the Body Surface + http://openaccess.thecvf.com//content/CVPR2024/papers/Keller_HIT_Estimating_Internal_Human_Implicit_Tissues_from_the_Body_Surface_CVPR_2024_paper.pdf + The creation of personalized anatomical digital twins is important in the fields of medicine computer graphics sports science and biomechanics. To observe a subject's anatomy expensive medical devices (MRI or CT) are required and the creation of the digital model is often time-consuming and involves manual effort. Instead we leverage the fact that the shape of the body surface is correlated with the internal anatomy; e.g. from surface observations alone one can predict body composition and skeletal structure. In this work we go further and learn to infer the 3D location of three important anatomic tissues: subcutaneous adipose tissue (fat) lean tissue (muscles and organs) and long bones. To learn to infer these tissues we tackle several key challenges. We first create a dataset of human tissues by segmenting full-body MRI scans and registering the SMPL body mesh to the body surface. With this dataset we train HIT (Human Implicit Tissues) an implicit function that given a point inside a body predicts its tissue class. HIT leverages the SMPL body model shape and pose parameters to canonicalize the medical data. Unlike SMPL which is trained from upright 3D scans MRI scans are acquired with subjects lying on a table resulting in significant soft-tissue deformation. Consequently HIT uses a learned volumetric deformation field that undoes these deformations. Since HIT is parameterized by SMPL we can repose bodies or change the shape of subjects and the internal structures deform appropriately. We perform extensive experiments to validate HIT's ability to predict a plausible internal structure for novel subjects. The dataset and HIT model are available at https://hit.is.tue.mpg.de to foster future research in this direction. + + + + DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_DanceCamera3D_3D_Camera_Movement_Synthesis_with_Music_and_Dance_CVPR_2024_paper.pdf + Choreographers determine what the dances look like while cameramen determine the final presentation of dances. Recently various methods and datasets have showcased the feasibility of dance synthesis. However camera movement synthesis with music and dance remains an unsolved challenging problem due to the scarcity of paired data. Thus we present DCM a new multi-modal 3D dataset which for the first time combines camera movement with dance motion and music audio. This dataset encompasses 108 dance sequences (3.2 hours) of paired dance-camera-music data from the anime community covering 4 music genres. With this dataset we uncover that dance camera movement is multifaceted and human-centric and possesses multiple influencing factors making dance camera synthesis a more challenging task compared to camera or dance synthesis alone. To overcome these difficulties we propose DanceCamera3D a transformer-based diffusion model that incorporates a novel body attention loss and a condition separation strategy. For evaluation we devise new metrics measuring camera movement quality diversity and dancer fidelity. Utilizing these metrics we conduct extensive experiments on our DCM dataset providing both quantitative and qualitative evidence showcasing the effectiveness of our DanceCamera3D model. Code and video demos are available at https://github.com/ Carmenw1203/DanceCamera3D-Official. + + + + Cross Initialization for Face Personalization of Text-to-Image Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Pang_Cross_Initialization_for_Face_Personalization_of_Text-to-Image_Models_CVPR_2024_paper.pdf + Recently there has been a surge in face personalization techniques benefiting from the advanced capabilities of pretrained text-to-image diffusion models. Among these a notable method is Textual Inversion which generates personalized images by inverting given images into textual embeddings. However methods based on Textual Inversion still struggle with balancing the trade-off between reconstruction quality and editability. In this study we examine this issue through the lens of initialization. Upon closely examining traditional initialization methods we identified a significant disparity between the initial and learned embeddings in terms of both scale and orientation. The scale of the learned embedding can be up to 100 times greater than that of the initial embedding. Such a significant change in the embedding could increase the risk of overfitting thereby compromising the editability. Driven by this observation we introduce a novel initialization method termed Cross Initialization that significantly narrows the gap between the initial and learned embeddings. This method not only improves both reconstruction and editability but also reduces the optimization steps from 5000 to 320. Furthermore we apply a regularization term to keep the learned embedding close to the initial embedding. We show that when combined with Cross Initialization this regularization term can effectively improve editability. We provide comprehensive empirical evidence to demonstrate the superior performance of our method compared to the baseline methods. Notably in our experiments Cross Initialization is the only method that successfully edits an individual's facial expression. Additionally a fast version of our method allows for capturing an input image in roughly 26 seconds while surpassing the baseline methods in terms of both reconstruction and editability. Code is available at https://github.com/lyuPang/CrossInitialization. + + + + LEDITS++: Limitless Image Editing using Text-to-Image Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Brack_LEDITS_Limitless_Image_Editing_using_Text-to-Image_Models_CVPR_2024_paper.pdf + Text-to-image diffusion models have recently received increasing interest for their astonishing ability to produce high-fidelity images from solely text inputs. Subsequent research efforts aim to exploit and apply their capabilities to real image editing. However existing image-to-image methods are often inefficient imprecise and of limited versatility. They either require time-consuming fine-tuning deviate unnecessarily strongly from the input image and/or lack support for multiple simultaneous edits. To address these issues we introduce LEDITS++ an efficient yet versatile and precise textual image manipulation technique. LEDITS++'s novel inversion approach requires no tuning nor optimization and produces high-fidelity results with a few diffusion steps. Second our methodology supports multiple simultaneous edits and is architecture-agnostic. Third we use a novel implicit masking technique that limits changes to relevant image regions. We propose the novel TEdBench++ benchmark as part of our exhaustive evaluation. Our results demonstrate the capabilities of LEDITS++ and its improvements over previous methods. + + + + Video Interpolation with Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Jain_Video_Interpolation_with_Diffusion_Models_CVPR_2024_paper.pdf + We present VIDIM a generative model for video interpolation which creates short videos given a start and end frame. In order to achieve high fidelity and generate motions unseen in the input data VIDIM uses cascaded diffusion models to first generate the target video at low resolution and then generate the high-resolution video conditioned on the low-resolution generated video. We compare VIDIM to previous state-of-the-art methods on video interpolation and demonstrate how such works fail in most settings where the underlying motion is complex nonlinear or ambiguous while VIDIM can easily handle such cases. We additionally demonstrate how classifier-free guidance on the start and end frame and conditioning the superresolution model on the original high-resolution frames without additional parameters unlocks high-fidelity results. VIDIM is fast to sample from as it jointly denoises all the frames to be generated requires less than a billion parameters per diffusion model to produce compelling results and still enjoys scalability and improved quality at larger parameter counts. Please see our project page at vidiminterpolation.github.io. + + + + Learning Adaptive Spatial Coherent Correlations for Speech-Preserving Facial Expression Manipulation + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Learning_Adaptive_Spatial_Coherent_Correlations_for_Speech-Preserving_Facial_Expression_Manipulation_CVPR_2024_paper.pdf + Speech-preserving facial expression manipulation (SPFEM) aims to modify facial emotions while meticulously maintaining the mouth animation associated with spoken content. Current works depend on inaccessible paired training samples for the person where two aligned frames exhibit the same speech content yet differ in emotional expression limiting the SPFEM applications in real-world scenarios. In this work we discover that speakers who convey the same content with different emotions exhibit highly correlated local facial animations providing valuable supervision for SPFEM. To capitalize on this insight we propose a novel adaptive spatial coherent correlation learning (ASCCL) algorithm which models the aforementioned correlation as an explicit metric and integrates the metric to supervise manipulating facial expression and meanwhile better preserving the facial animation of spoken contents. To this end it first learns a spatial coherent correlation metric ensuring the visual disparities of adjacent local regions of the image belonging to one emotion are similar to those of the corresponding counterpart of the image belonging to another emotion. Recognizing that visual disparities are not uniform across all regions we have also crafted a disparity-aware adaptive strategy that prioritizes regions that present greater challenges. During SPFEM model training we construct the adaptive spatial coherent correlation metric between corresponding local regions of the input and output images as addition loss to supervise the generation process. We conduct extensive experiments on variant datasets and the results demonstrate the effectiveness of the proposed ASCCL algorithm. Code is publicly available at https://github.com/jianmanlincjx/ASCCL + + + + WHAM: Reconstructing World-grounded Humans with Accurate 3D Motion + http://openaccess.thecvf.com//content/CVPR2024/papers/Shin_WHAM_Reconstructing_World-grounded_Humans_with_Accurate_3D_Motion_CVPR_2024_paper.pdf + The estimation of 3D human motion from video has progressed rapidly but current methods still have several key limitations. First most methods estimate the human in camera coordinates. Second prior work on estimating humans in global coordinates often assumes a flat ground plane and produces foot sliding. Third the most accurate methods rely on computationally expensive optimization pipelines limiting their use to offline applications. Finally existing video-based methods are surprisingly less accurate than single-frame methods. We address these limitations with WHAM (World-grounded Humans with Accurate Motion) which accurately and efficiently reconstructs 3D human motion in a global coordinate system from video. WHAM learns to lift 2D keypoint sequences to 3D using motion capture data and fuses this with video features integrating motion context and visual information. WHAM exploits camera angular velocity estimated from a SLAM method together with human motion to estimate the body's global trajectory. We combine this with a contact-aware trajectory refinement method that lets WHAM capture human motion in diverse conditions such as climbing stairs. WHAM outperforms all existing 3D human motion recovery methods across multiple in-the-wild benchmarks. Code is available for research purposes at http://wham.is.tue.mpg.de/. + + + + DiffPerformer: Iterative Learning of Consistent Latent Guidance for Diffusion-based Human Video Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_DiffPerformer_Iterative_Learning_of_Consistent_Latent_Guidance_for_Diffusion-based_Human_CVPR_2024_paper.pdf + Existing diffusion models for pose-guided human video generation mostly suffer from temporal inconsistency in the generated appearance and poses due to the inherent randomization nature of the generation process. In this paper we propose a novel framework DiffPerformer to synthesize high-fidelity and temporally consistent human video. Without complex architecture modification or costly training DiffPerformer finetunes a pretrained diffusion model on a single video of the target character and introduces an implicit video representation as a proxy to learn temporally consistent guidance for the diffusion model. The guidance is encoded into VAE latent space and an iterative optimization loop is constructed between the implicit video representation and the diffusion model allowing to harness the smooth property of the implicit video representation and the generative capabilities of the diffusion model in a mutually beneficial way. Moreover we propose 3D-aware human flow as a temporal constraint during the optimization to explicitly model the correspondence between driving poses and human appearance. This alleviates the misalignment between guided poses and target performer and therefore maintains the appearance coherence under various motions. Extensive experiments demonstrate that our method outperforms the state-of-the-art methods. + + + + Category-Level Multi-Part Multi-Joint 3D Shape Assembly + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Category-Level_Multi-Part_Multi-Joint_3D_Shape_Assembly_CVPR_2024_paper.pdf + Shape assembly composes complex shapes geometries by arranging simple part geometries and has wide applications in autonomous robotic assembly and CAD modeling. Existing works focus on geometry reasoning and neglect the actual physical assembly process of matching and fitting joints which are the contact surfaces connecting different parts. In this paper we consider contacting joints for the task of multi-part assembly. A successful joint-optimized assembly needs to satisfy the bilateral objectives of shape structure and joint alignment. We propose a hierarchical graph learning approach composed of two levels of graph representation learning. The part graph takes part geometries as input to build the desired shape structure. The joint-level graph uses part joints information and focuses on matching and aligning joints. The two kinds of information are combined to achieve the bilateral objectives. Extensive experiments demonstrate that our method outperforms previous methods achieving better shape structure and higher joint alignment accuracy. + + + + One-Shot Open Affordance Learning with Foundation Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_One-Shot_Open_Affordance_Learning_with_Foundation_Models_CVPR_2024_paper.pdf + We introduce One-shot Open Affordance Learning (OOAL) where a model is trained with just one example per base object category but is expected to identify novel objects and affordances. While vision-language models excel at recognizing novel objects and scenes they often struggle to understand finer levels of granularity such as affordances. To handle this issue we conduct a comprehensive analysis of existing foundation models to explore their inherent understanding of affordances and assess the potential for data-limited affordance learning. We then propose a vision-language framework with simple and effective designs that boost the alignment between visual features and affordance text embeddings. Experiments on two affordance segmentation benchmarks show that the proposed method outperforms state-of-the-art models with less than 1% of the full training data and exhibits reasonable generalization capability on unseen objects and affordances. Project page: https://reagan1311.github.io/ooal. + + + + Don't Look into the Dark: Latent Codes for Pluralistic Image Inpainting + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Dont_Look_into_the_Dark_Latent_Codes_for_Pluralistic_Image_CVPR_2024_paper.pdf + We present a method for large-mask pluralistic image inpainting based on the generative framework of discrete latent codes. Our method learns latent priors discretized as tokens by only performing computations at the visible locations of the image. This is realized by a restrictive partial encoder that predicts the token label for each visible block a bidirectional transformer that infers the missing labels by only looking at these tokens and a dedicated synthesis network that couples the tokens with the partial image priors to generate coherent and pluralistic complete image even under extreme mask settings. Experiments on public benchmarks validate our design choices as the proposed method outperforms strong baselines in both visual quality and diversity metrics. + + + + DiffEditor: Boosting Accuracy and Flexibility on Diffusion-based Image Editing + http://openaccess.thecvf.com//content/CVPR2024/papers/Mou_DiffEditor_Boosting_Accuracy_and_Flexibility_on_Diffusion-based_Image_Editing_CVPR_2024_paper.pdf + Large-scale Text-to-Image (T2I) diffusion models have revolutionized image generation over the last few years. Although owning diverse and high-quality generation capabilities translating these abilities to fine-grained image editing remains challenging. In this paper we propose DiffEditor to rectify two weaknesses in existing diffusion-based image editing: (1) in complex scenarios editing results often lack editing accuracy and exhibit unexpected artifacts; (2) lack of flexibility to harmonize editing operations e.g. imagine new content. In our solution we introduce image prompts in fine-grained image editing cooperating with the text prompt to better describe the editing content. To increase the flexibility while maintaining content consistency we locally combine stochastic differential equation (SDE) into the ordinary differential equation (ODE) sampling. In addition we incorporate regional score-based gradient guidance and a time travel strategy into the diffusion sampling further improving the editing quality. Extensive experiments demonstrate that our method can efficiently achieve state-of-the-art performance on various fine-grained image editing tasks including editing within a single image (e.g. object moving resizing and content dragging) and across images (e.g. appearance replacing and object pasting). Our source code is released at https://github.com/MC-E/DragonDiffusion. + + + + InstructVideo: Instructing Video Diffusion Models with Human Feedback + http://openaccess.thecvf.com//content/CVPR2024/papers/Yuan_InstructVideo_Instructing_Video_Diffusion_Models_with_Human_Feedback_CVPR_2024_paper.pdf + Diffusion models have emerged as the de facto paradigm for video generation. However their reliance on web-scale data of varied quality often yields results that are visually unappealing and misaligned with the textual prompts. To tackle this problem we propose InstructVideo to instruct text-to-video diffusion models with human feedback by reward fine-tuning. InstructVideo has two key ingredients: 1) To ameliorate the cost of reward fine-tuning induced by generating through the full DDIM sampling chain we recast reward fine-tuning as editing. By leveraging the diffusion process to corrupt a sampled video InstructVideo requires only partial inference of the DDIM sampling chain reducing fine-tuning cost while improving fine-tuning efficiency. 2) To mitigate the absence of a dedicated video reward model for human preferences we repurpose established image reward models e.g. HPSv2. To this end we propose Segmental Video Reward a mechanism to provide reward signals based on segmental sparse sampling and Temporally Attenuated Reward a method that mitigates temporal modeling degradation during fine-tuning. Extensive experiments both qualitative and quantitative validate the practicality and efficacy of using image reward models in InstructVideo significantly enhancing the visual quality of generated videos without compromising generalization capabilities. Code and models can be accessed through our project page https://instructvideo.github.io/. + + + + On the Content Bias in Frechet Video Distance + http://openaccess.thecvf.com//content/CVPR2024/papers/Ge_On_the_Content_Bias_in_Frechet_Video_Distance_CVPR_2024_paper.pdf + Frechet Video Distance (FVD) a prominent metric for evaluating video generation models is known to conflict with human perception occasionally. In this paper we aim to explore the extent of FVD's bias toward frame quality over temporal realism and identify its sources. We first quantify the FVD's sensitivity to the temporal axis by decoupling the frame and motion quality and find that the FVD only increases slightly with larger temporal corruption. We then analyze the generated videos and show that via careful sampling from a large set of generated videos that do not contain motions one can drastically decrease FVD without improving the temporal quality. Both studies suggest FVD's basis towards the quality of individual frames. We show that FVD with features extracted from the recent large-scale self-supervised video models is less biased toward image quality. Finally we revisit a few real-world examples to validate our hypothesis. + + + + Image Neural Field Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Image_Neural_Field_Diffusion_Models_CVPR_2024_paper.pdf + Diffusion models have shown an impressive ability to model complex data distributions with several key advantages over GANs such as stable training better coverage of the training distribution's modes and the ability to solve inverse problems without extra training. However most diffusion models learn the distribution of fixed-resolution images. We propose to learn the distribution of continuous images by training diffusion models on image neural fields which can be rendered at any resolution and show its advantages over fixed-resolution models. To achieve this a key challenge is to obtain a latent space that represents photorealistic image neural fields. We propose a simple and effective method inspired by several recent techniques but with key changes to make the image neural fields photorealistic. Our method can be used to convert existing latent diffusion autoencoders into image neural field autoencoders. We show that image neural field diffusion models can be trained using mixed-resolution image datasets outperform fixed-resolution diffusion models followed by super-resolution models and can solve inverse problems with conditions applied at different scales efficiently. + + + + Discriminative Probing and Tuning for Text-to-Image Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Qu_Discriminative_Probing_and_Tuning_for_Text-to-Image_Generation_CVPR_2024_paper.pdf + Despite advancements in text-to-image generation (T2I) prior methods often face text-image misalignment problems such as relation confusion in generated images. Existing solutions involve cross-attention manipulation for better compositional understanding or integrating large language models for improved layout planning. However the inherent alignment capabilities of T2I models are still inadequate. By reviewing the link between generative and discriminative modeling we posit that T2I models' discriminative abilities may reflect their text-image alignment proficiency during generation. In this light we advocate bolstering the discriminative abilities of T2I models to achieve more precise text-to-image alignment for generation. We present a discriminative adapter built on T2I models to probe their discriminative abilities on two representative tasks and leverage discriminative fine-tuning to improve their text-image alignment. As a bonus of the discriminative adapter a self-correction mechanism can leverage discriminative gradients to better align generated images to text prompts during inference. Comprehensive evaluations across three benchmark datasets including both in-distribution and out-of-distribution scenarios demonstrate our method's superior generation performance. Meanwhile it achieves state-of-the-art discriminative performance on the two discriminative tasks compared to other generative models. The code is available at https://dpt-t2i.github.io/. + + + + Towards More Accurate Diffusion Model Acceleration with A Timestep Tuner + http://openaccess.thecvf.com//content/CVPR2024/papers/Xia_Towards_More_Accurate_Diffusion_Model_Acceleration_with_A_Timestep_Tuner_CVPR_2024_paper.pdf + A diffusion model which is formulated to produce an image using thousands of denoising steps usually suffers from a slow inference speed. Existing acceleration algorithms simplify the sampling by skipping most steps yet exhibit considerable performance degradation. By viewing the generation of diffusion models as a discretized integral process we argue that the quality drop is partly caused by applying an inaccurate integral direction to a timestep interval. To rectify this issue we propose a timestep tuner that helps find a more accurate integral direction for a particular interval at the minimum cost. Specifically at each denoising step we replace the original parameterization by conditioning the network on a new timestep enforcing the sampling distribution towards the real one. Extensive experiments show that our plug-in design can be trained efficiently and boost the inference performance of various state-of-the-art acceleration methods especially when there are few denoising steps. For example when using 10 denoising steps on LSUN Bedroom dataset we improve the FID of DDIM from 9.65 to 6.07 simply by adopting our method for a more appropriate set of timesteps. Code is available at \href https://github.com/THU-LYJ-Lab/time-tuner https://github.com/THU-LYJ-Lab/time-tuner . + + + + Rethinking Generalizable Face Anti-spoofing via Hierarchical Prototype-guided Distribution Refinement in Hyperbolic Space + http://openaccess.thecvf.com//content/CVPR2024/papers/Hu_Rethinking_Generalizable_Face_Anti-spoofing_via_Hierarchical_Prototype-guided_Distribution_Refinement_in_CVPR_2024_paper.pdf + Generalizable face anti-spoofing (FAS) approaches have drawn growing attention due to their robustness for diverse presentation attacks in unseen scenarios. Most previous methods always utilize domain generalization (DG) frameworks via directly aligning diverse source samples into a common feature space. However these methods neglect the hierarchical relations in FAS samples which may hinder the generalization ability by direct alignment. To address these issues we propose a novel Hierarchical Prototype-guided Distribution Refinement (HPDR) framework to learn embedding in hyperbolic space which facilitates the hierarchical relation construction. We also collaborate with prototype learning for hierarchical distribution refinement in hyperbolic space. In detail we propose the Hierarchical Prototype Learning to simultaneously guide domain alignment and improve the discriminative ability via constraining the multi-level relations between prototypes and instances in hyperbolic space. Moreover we design a Prototype-oriented Classifier which further considers relations between the sample and prototypes to improve the robustness of the final decision. Extensive experiments and visualizations demonstrate the effectiveness of our method against previous competitors. + + + + GenesisTex: Adapting Image Denoising Diffusion to Texture Space + http://openaccess.thecvf.com//content/CVPR2024/papers/Gao_GenesisTex_Adapting_Image_Denoising_Diffusion_to_Texture_Space_CVPR_2024_paper.pdf + We present GenesisTex a novel method for synthesizing textures for 3D geometries from text descriptions. GenesisTex adapts the pretrained image diffusion model to texture space by texture space sampling. Specifically we maintain a latent texture map for each viewpoint which is updated with predicted noise on the rendering of the corresponding viewpoint. The sampled latent texture maps are then decoded into a final texture map. During the sampling process we focus on both global and local consistency across multiple viewpoints: global consistency is achieved through the integration of style consistency mechanisms within the noise prediction network and low-level consistency is achieved by dynamically aligning latent textures. Finally we apply reference-based inpainting and img2img on denser views for texture refinement. Our approach overcomes the limitations of slow optimization in distillation-based methods and instability in inpainting-based methods. Experiments on meshes from various sources demonstrate that our method surpasses the baseline methods quantitatively and qualitatively. + + + + Image-to-Image Matching via Foundation Models: A New Perspective for Open-Vocabulary Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Image-to-Image_Matching_via_Foundation_Models_A_New_Perspective_for_Open-Vocabulary_CVPR_2024_paper.pdf + Open-vocabulary semantic segmentation (OVS) aims to segment images of arbitrary categories specified by class labels or captions. However most previous best-performing methods whether pixel grouping methods or region recognition methods suffer from false matches between image features and category labels. We attribute this to the natural gap between the textual features and visual features. In this work we rethink how to mitigate false matches from the perspective of image-to-image matching and propose a novel relation-aware intra-modal matching (RIM) framework for OVS based on visual foundation models. RIM achieves robust region classification by firstly constructing diverse image-modal reference features and then matching them with region features based on relation-aware ranking distribution. The proposed RIM enjoys several merits. First the intra-modal reference features are better aligned circumventing potential ambiguities that may arise in cross-modal matching. Second the ranking-based matching process harnesses the structure information implicit in the inter-class relationships making it more robust than comparing individually. Extensive experiments on three benchmarks demonstrate that RIM outperforms previous state-of-the-art methods by large margins obtaining a lead of more than 10% in mIoU on PASCAL VOC benchmark + + + + BigGait: Learning Gait Representation You Want by Large Vision Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Ye_BigGait_Learning_Gait_Representation_You_Want_by_Large_Vision_Models_CVPR_2024_paper.pdf + Gait recognition stands as one of the most pivotal remote identification technologies and progressively expands across research and industry communities. However existing gait recognition methods heavily rely on task-specific upstream driven by supervised learning to provide explicit gait representations like silhouette sequences which inevitably introduce expensive annotation costs and potential error accumulation. Escaping from this trend this work explores effective gait representations based on the all-purpose knowledge produced by task-agnostic Large Vision Models (LVMs) and proposes a simple yet efficient gait framework termed BigGait. Specifically the Gait Representation Extractor (GRE) within BigGait draws upon design principles from established gait representations effectively transforming all-purpose knowledge into implicit gait representations without requiring third-party supervision signals. Experiments on CCPG CAISA-B* and SUSTech1K indicate that BigGait significantly outperforms the previous methods in both within-domain and cross-domain tasks in most cases and provides a more practical paradigm for learning the next-generation gait representation. Finally we delve into prospective challenges and promising directions in LVMs-based gait recognition aiming to inspire future work in this emerging topic. The source code is available at https://github.com/ShiqiYu/OpenGait. + + + + HOIST-Former: Hand-held Objects Identification Segmentation and Tracking in the Wild + http://openaccess.thecvf.com//content/CVPR2024/papers/Narasimhaswamy_HOIST-Former_Hand-held_Objects_Identification_Segmentation_and_Tracking_in_the_Wild_CVPR_2024_paper.pdf + We address the challenging task of identifying segmenting and tracking hand-held objects which is crucial for applications such as human action segmentation and performance evaluation. This task is particularly challenging due to heavy occlusion rapid motion and the transitory nature of objects being hand-held where an object may be held released and subsequently picked up again. To tackle these challenges we have developed a novel transformer-based architecture called HOIST-Former. HOIST-Former is adept at spatially and temporally segmenting hands and objects by iteratively pooling features from each other ensuring that the processes of identification segmentation and tracking of hand-held objects depend on the hands' positions and their contextual appearance. We further refine HOIST-Former with a contact loss that focuses on areas where hands are in contact with objects. Moreover we also contribute an in-the-wild video dataset called HOIST which comprises 4125 videos complete with bounding boxes segmentation masks and tracking IDs for hand-held objects. Through experiments on the HOIST dataset and two additional public datasets we demonstrate the efficacy of HOIST-Former in segmenting and tracking hand-held objects. + + + + Contextrast: Contextual Contrastive Learning for Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Sung_Contextrast_Contextual_Contrastive_Learning_for_Semantic_Segmentation_CVPR_2024_paper.pdf + Despite great improvements in semantic segmentation challenges persist because of the lack of local/global contexts and the relationship between them. In this paper we propose Contextrast a contrastive learning-based semantic segmentation method that allows to capture local/global contexts and comprehend their relationships. Our proposed method comprises two parts: a) contextual contrastive learning (CCL) and b) boundary-aware negative (BANE) sampling. Contextual contrastive learning obtains local/global context from multi-scale feature aggregation and inter/intra-relationship of features for better discrimination capabilities. Meanwhile BANE sampling selects embedding features along the boundaries of incorrectly predicted regions to employ them as harder negative samples on our contrastive learning resolving segmentation issues along the boundary region by exploiting fine-grained details. We demonstrate that our Contextrast substantially enhances the performance of semantic segmentation networks outperforming state-of-the-art contrastive learning approaches on diverse public datasets e.g. Cityscapes CamVid PASCAL-C COCO-Stuff and ADE20K without an increase in computational cost during inference. + + + + AUEditNet: Dual-Branch Facial Action Unit Intensity Manipulation with Implicit Disentanglement + http://openaccess.thecvf.com//content/CVPR2024/papers/Jin_AUEditNet_Dual-Branch_Facial_Action_Unit_Intensity_Manipulation_with_Implicit_Disentanglement_CVPR_2024_paper.pdf + Facial action unit (AU) intensity plays a pivotal role in quantifying fine-grained expression behaviors which is an effective condition for facial expression manipulation. However publicly available datasets containing intensity annotations for multiple AUs remain severely limited often featuring a restricted number of subjects. This limitation places challenges to the AU intensity manipulation in images due to disentanglement issues leading researchers to resort to other large datasets with pretrained AU intensity estimators for pseudo labels. In addressing this constraint and fully leveraging manual annotations of AU intensities for precise manipulation we introduce AUEditNet. Our proposed model achieves impressive intensity manipulation across 12 AUs trained effectively with only 18 subjects. Utilizing a dual-branch architecture our approach achieves comprehensive disentanglement of facial attributes and identity without necessitating additional loss functions or implementing with large batch sizes. This approach offers a potential solution to achieve desired facial attribute editing despite the dataset's limited subject count. Our experiments demonstrate AUEditNet's superior accuracy in editing AU intensities affirming its capability in disentangling facial attributes and identity within a limited subject pool. AUEditNet allows conditioning by either intensity values or target images eliminating the need for constructing AU combinations for specific facial expression synthesis. Moreover AU intensity estimation as a downstream task validates the consistency between real and edited images confirming the effectiveness of our proposed AU intensity manipulation method. + + + + BodyMAP - Jointly Predicting Body Mesh and 3D Applied Pressure Map for People in Bed + http://openaccess.thecvf.com//content/CVPR2024/papers/Tandon_BodyMAP_-_Jointly_Predicting_Body_Mesh_and_3D_Applied_Pressure_CVPR_2024_paper.pdf + Accurately predicting the 3D human posture and the pressure exerted on the body for people resting in bed visualized as a body mesh (3D pose & shape) with a 3D pressure map holds significant promise for healthcare applications particularly in the prevention of pressure ulcers. Current methods focus on singular facets of the problem---predicting only 2D/3D poses generating 2D pressure images predicting pressure only for certain body regions instead of the full body or forming indirect approximations to the 3D pressure map. In contrast we introduce BodyMAP which jointly predicts the human body mesh and 3D applied pressure map across the entire human body. Our network leverages multiple visual modalities incorporating both a depth image of a person in bed and its corresponding 2D pressure image acquired from a pressure-sensing mattress. The 3D pressure map is represented as a pressure value at each mesh vertex and thus allows for precise localization of high-pressure regions on the body. Additionally we present BodyMAP-WS a new formulation of pressure prediction in which we implicitly learn pressure in 3D by aligning sensed 2D pressure images with a differentiable 2D projection of the predicted 3D pressure maps. In evaluations with real-world human data our method outperforms the current state-of-the-art technique by 25% on both body mesh and 3D applied pressure map prediction tasks for people in bed. + + + + KPConvX: Modernizing Kernel Point Convolution with Kernel Attention + http://openaccess.thecvf.com//content/CVPR2024/papers/Thomas_KPConvX_Modernizing_Kernel_Point_Convolution_with_Kernel_Attention_CVPR_2024_paper.pdf + In the field of deep point cloud understanding KPConv is a unique architecture that uses kernel points to locate convolutional weights in space instead of relying on Multi-Layer Perceptron (MLP) encodings. While it initially achieved success it has since been surpassed by recent MLP networks that employ updated designs and training strategies. Building upon the kernel point principle we present two novel designs: KPConvD (depthwise KPConv) a lighter design that enables the use of deeper architectures and KPConvX an innovative design that scales the depthwise convolutional weights of KPConvD with kernel attention values. Using KPConvX with a modern architecture and training strategy we are able to outperform current state-of-the-art approaches on the ScanObjectNN Scannetv2 and S3DIS datasets. We validate our design choices through ablation studies and release our code and models. + + + + Clockwork Diffusion: Efficient Generation With Model-Step Distillation + http://openaccess.thecvf.com//content/CVPR2024/papers/Habibian_Clockwork_Diffusion_Efficient_Generation_With_Model-Step_Distillation_CVPR_2024_paper.pdf + This work aims to improve the efficiency of text-to-image diffusion models. While diffusion models use computationally expensive UNet-based denoising operations in every generation step we identify that not all operations are equally relevant for the final output quality. In particular we observe that UNet layers operating on high-res feature maps are relatively sensitive to small perturbations. In contrast low-res feature maps influence the semantic layout of the final image and can often be perturbed with no noticeable change in the output. Based on this observation we propose Clockwork Diffusion a method that periodically reuses computation from preceding denoising steps to approximate low-res feature maps at one or more subsequent steps. For multiple base- lines and for both text-to-image generation and image editing we demonstrate that Clockwork leads to comparable or improved perceptual scores with drastically reduced computational complexity. As an example for Stable Diffusion v1.5 with 8 DPM++ steps we save 32% of FLOPs with negligible FID and CLIP change. We re- lease code at https://github.com/Qualcomm-AI-research/clockwork-diffusion + + + + Pick-or-Mix: Dynamic Channel Sampling for ConvNets + http://openaccess.thecvf.com//content/CVPR2024/papers/Kumar_Pick-or-Mix_Dynamic_Channel_Sampling_for_ConvNets_CVPR_2024_paper.pdf + Channel pruning approaches for convolutional neural networks (ConvNets) deactivate the channels statically or dynamically and require special implementation. In addition channel squeezing in representative ConvNets is carried out via 1 x 1 convolutions which dominates a large portion of computations and network parameters. Given these challenges we propose an effective multi-purpose module for dynamic channel sampling namely Pick-or-Mix (PiX) which does not require special implementation. PiX divides a set of channels into subsets and then picks from them where the picking decision is dynamically made per each pixel based on the input activations. We plug PiX into prominent ConvNet architectures and verify its multi-purpose utilities. After replacing 1 x 1 channel squeezing layers in ResNet with PiX the network becomes 25% faster without losing accuracy. We show that PiX allows ConvNets to learn better data representation than widely adopted approaches to enhance networks' representation power (e.g. SE CBAM AFF SKNet and DWP). We also show that PiX achieves state-of-the-art performance on network downscaling and dynamic channel pruning applications. + + + + DyBluRF: Dynamic Neural Radiance Fields from Blurry Monocular Video + http://openaccess.thecvf.com//content/CVPR2024/papers/Sun_DyBluRF_Dynamic_Neural_Radiance_Fields_from_Blurry_Monocular_Video_CVPR_2024_paper.pdf + Recent advancements in dynamic neural radiance field methods have yielded remarkable outcomes. However these approaches rely on the assumption of sharp input images. When faced with motion blur existing dynamic NeRF methods often struggle to generate high-quality novel views. In this paper we propose DyBluRF a dynamic radiance field approach that synthesizes sharp novel views from a monocular video affected by motion blur. To account for motion blur in input images we simultaneously capture the camera trajectory and object Discrete Cosine Transform (DCT) trajectories within the scene. Additionally we employ a global cross-time rendering approach to ensure consistent temporal coherence across the entire scene. We curate a dataset comprising diverse dynamic scenes that are specifically tailored for our task. Experimental results on our dataset demonstrate that our method outperforms existing approaches in generating sharp novel views from motion-blurred inputs while maintaining spatial-temporal consistency of the scene. + + + + AAMDM: Accelerated Auto-regressive Motion Diffusion Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_AAMDM_Accelerated_Auto-regressive_Motion_Diffusion_Model_CVPR_2024_paper.pdf + Interactive motion synthesis is essential in creating immersive experiences in entertainment applications such as video games and virtual reality. However generating animations that are both high-quality and contextually responsive remains a challenge. Traditional techniques in the game industry can produce high-fidelity animations but suffer from high computational costs and poor scalability. Trained neural network models alleviate the memory and speed issues yet fall short on generating diverse motions. Diffusion models offer diverse motion synthesis with low memory usage but require expensive reverse diffusion processes. This paper introduces the Accelerated Auto-regressive Motion Diffusion Model (AAMDM) a novel motion synthesis framework designed to achieve quality diversity and efficiency all together. AAMDM integrates Denoising Diffusion GANs as a fast Generation Module and an Auto-regressive Diffusion Model as a Polishing Module. Furthermore AAMDM operates in a lower-dimensional embedded space rather than the full-dimensional pose space which reduces the training complexity as well as further improves the performance. We show that AAMDM outperforms existing methods in motion quality diversity and runtime efficiency through comprehensive quantitative analyses and visual comparisons. We also demonstrate the effectiveness of each algorithmic component through ablation studies. + + + + Towards Understanding Cross and Self-Attention in Stable Diffusion for Text-Guided Image Editing + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Towards_Understanding_Cross_and_Self-Attention_in_Stable_Diffusion_for_Text-Guided_CVPR_2024_paper.pdf + Deep Text-to-Image Synthesis (TIS) models such as Stable Diffusion have recently gained significant popularity for creative text-to-image generation. However for domain-specific scenarios tuning-free Text-guided Image Editing (TIE) is of greater importance for application developers. This approach modifies objects or object properties in images by manipulating feature components in attention layers during the generation process. Nevertheless little is known about the semantic meanings that these attention layers have learned and which parts of the attention maps contribute to the success of image editing. In this paper we conduct an in-depth probing analysis and demonstrate that cross-attention maps in Stable Diffusion often contain object attribution information which can result in editing failures. In contrast self-attention maps play a crucial role in preserving the geometric and shape details of the source image during the transformation to the target image. Our analysis offers valuable insights into understanding cross and self-attention mechanisms in diffusion models. Furthermore based on our findings we propose a simplified yet more stable and efficient tuning-free procedure that modifies only the self-attention maps of specified attention layers during the denoising process. Experimental results show that our simplified method consistently surpasses the performance of popular approaches on multiple datasets. + + + + DiverGen: Improving Instance Segmentation by Learning Wider Data Distribution with More Diverse Generative Data + http://openaccess.thecvf.com//content/CVPR2024/papers/Fan_DiverGen_Improving_Instance_Segmentation_by_Learning_Wider_Data_Distribution_with_CVPR_2024_paper.pdf + Instance segmentation is data-hungry and as model capacity increases data scale becomes crucial for improving the accuracy. Most instance segmentation datasets today require costly manual annotation limiting their data scale. Models trained on such data are prone to overfitting on the training set especially for those rare categories. While recent works have delved into exploiting generative models to create synthetic datasets for data augmentation these approaches do not efficiently harness the full potential of generative models. To address these issues we introduce a more efficient strategy to construct generative datasets for data augmentation termed DiverGen. Firstly we provide an explanation of the role of generative data from the perspective of distribution discrepancy. We investigate the impact of different data on the distribution learned by the model. We argue that generative data can expand the data distribution that the model can learn thus mitigating overfitting. Additionally we find that the diversity of generative data is crucial for improving model performance and enhance it through various strategies including category diversity prompt diversity and generative model diversity. With these strategies we can scale the data to millions while maintaining the trend of model performance improvement. On the LVIS dataset DiverGen significantly outperforms the strong model X-Paste achieving +1.1 box AP and +1.1 mask AP across all categories and +1.9 box AP and +2.5 mask AP for rare categories. Our codes are available at https://github.com/aim-uofa/DiverGen. + + + + Learning Disentangled Identifiers for Action-Customized Text-to-Image Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_Learning_Disentangled_Identifiers_for_Action-Customized_Text-to-Image_Generation_CVPR_2024_paper.pdf + This study focuses on a novel task in text-to-image (T2I) generation namely action customization. The objective of this task is to learn the co-existing action from limited data and generalize it to unseen humans or even animals. Experimental results show that existing subject-driven customization methods fail to learn the representative characteristics of actions and struggle in decoupling actions from context features including appearance. To overcome the preference for low-level features and the entanglement of high-level features we propose an inversion-based method Action-Disentangled Identifier (ADI) to learn action-specific identifiers from the exemplar images. ADI first expands the semantic conditioning space by introducing layer-wise identifier tokens thereby increasing the representational richness while distributing the inversion across different features. Then to block the inversion of action-agnostic features ADI extracts the gradient invariance from the constructed sample triples and masks the updates of irrelevant channels. To comprehensively evaluate the task we present an ActionBench that includes a variety of actions each accompanied by meticulously selected samples. Both quantitative and qualitative results show that our ADI outperforms existing baselines in action-customized T2I generation. Our project page is at https://adi-t2i.github.io/ADI. + + + + Automatic Controllable Colorization via Imagination + http://openaccess.thecvf.com//content/CVPR2024/papers/Cong_Automatic_Controllable_Colorization_via_Imagination_CVPR_2024_paper.pdf + We propose a framework for automatic colorization that allows for iterative editing and modifications. The core of our framework lies in an imagination module: by understanding the content within a grayscale image we utilize a pre-trained image generation model to generate multiple images that contain the same content. These images serve as references for coloring mimicking the process of human experts. As the synthesized images can be imperfect or different from the original grayscale image we propose a Reference Refinement Module to select the optimal reference composition. Unlike most previous end-to-end automatic colorization algorithms our framework allows for iterative and localized modifications of the colorization results because we explicitly model the coloring samples. Extensive experiments demonstrate the superiority of our framework over existing automatic colorization algorithms in editability and flexibility. Project page: https://xy-cong.github.io/imagine-colorization/. + + + + EMOPortraits: Emotion-enhanced Multimodal One-shot Head Avatars + http://openaccess.thecvf.com//content/CVPR2024/papers/Drobyshev_EMOPortraits_Emotion-enhanced_Multimodal_One-shot_Head_Avatars_CVPR_2024_paper.pdf + Head avatars animated by visual signals have gained popularity particularly in cross-driving synthesis where the driver differs from the animated character a challenging but highly practical approach. The recently presented MegaPortraits model has demonstrated state-of-the-art results in this domain. We conduct a deep examination and evaluation of this model with a particular focus on its latent space for facial expression descriptors and uncover several limitations with its ability to express intense face motions. Head avatars animated by visual signals have gained popularity particularly in cross-driving synthesis where the driver differs from the animated character a challenging but highly practical approach. The recently presented MegaPortraits model has demonstrated state-of-the-art results in this domain. We conduct a deep examination and evaluation of this model with a particular focus on its latent space for facial expression descriptors and uncover several limitations with its ability to express intense face motions. To address these limitations we propose substantial changes in both training pipeline and model architecture to introduce our EMOPortraits model where we: Enhance the model's capability to faithfully support intense asymmetric face expressions setting a new state-of-the-art result in the emotion transfer task surpassing previous methods in both metrics and quality. Incorporate speech-driven mode to our model achieving top-tier performance in audio-driven facial animation making it possible to drive source identity through diverse modalities including visual signal audio or a blend of both.Furthermore we propose a novel multi-view video dataset featuring a wide range of intense and asymmetric facial expressions filling the gap with absence of such data in existing datasets. + + + + Open3DIS: Open-Vocabulary 3D Instance Segmentation with 2D Mask Guidance + http://openaccess.thecvf.com//content/CVPR2024/papers/Nguyen_Open3DIS_Open-Vocabulary_3D_Instance_Segmentation_with_2D_Mask_Guidance_CVPR_2024_paper.pdf + We introduce Open3DIS a novel solution designed to tackle the problem of Open-Vocabulary Instance Segmentation within 3D scenes. Objects within 3D environments exhibit diverse shapes scales and colors making precise instance-level identification a challenging task. Recent advancements in Open-Vocabulary scene understanding have made significant strides in this area by employing class-agnostic 3D instance proposal networks for object localization and learning queryable features for each 3D mask. While these methods produce high-quality instance proposals they struggle with identifying small-scale and geometrically ambiguous objects. The key idea of our method is a new module that aggregates 2D instance masks across frames and maps them to geometrically coherent point cloud regions as high-quality object proposals addressing the above limitations. These are then combined with 3D class-agnostic instance proposals to include a wide range of objects in the real world. To validate our approach we conducted experiments on three prominent datasets including ScanNet200 S3DIS and Replica demonstrating significant performance gains in segmenting objects with diverse categories over the state-of-the-art approaches. + + + + ControlRoom3D: Room Generation using Semantic Proxy Rooms + http://openaccess.thecvf.com//content/CVPR2024/papers/Schult_ControlRoom3D_Room_Generation_using_Semantic_Proxy_Rooms_CVPR_2024_paper.pdf + Manually creating 3D environments for AR/VR applications is a complex process requiring expert knowledge in 3D modeling software. Pioneering works facilitate this process by generating room meshes conditioned on textual style descriptions. Yet many of these automatically generated 3D meshes do not adhere to typical room layouts compromising their plausibility e.g. by placing several beds in one bedroom. To address these challenges we present ControlRoom3D a novel method to generate high-quality room meshes. Central to our approach is a user-defined 3D semantic proxy room that outlines a rough room layout based on semantic bounding boxes and a textual description of the overall room style. Our key insight is that when rendered to 2D this 3D representation provides valuable geometric and semantic information to control powerful 2D models to generate 3D consistent textures and geometry that aligns well with the proxy room. Backed up by an extensive study including quantitative metrics and qualitative user evaluations our method generates diverse and globally plausible 3D room meshes thus empowering users to design 3D rooms effortlessly without specialized knowledge. + + + + UniPTS: A Unified Framework for Proficient Post-Training Sparsity + http://openaccess.thecvf.com//content/CVPR2024/papers/Xie_UniPTS_A_Unified_Framework_for_Proficient_Post-Training_Sparsity_CVPR_2024_paper.pdf + Post-training Sparsity (PTS) is a recently emerged avenue that chases efficient network sparsity with limited data in need. Existing PTS methods however undergo significant performance degradation compared with traditional methods that retrain the sparse networks via the whole dataset especially at high sparsity ratios. In this paper we attempt to reconcile this disparity by transposing three cardinal factors that profoundly alter the performance of conventional sparsity into the context of PTS. Our endeavors particularly comprise (1) A base-decayed sparsity objective that promotes efficient knowledge transferring from dense network to the sparse counterpart. (2) A reducing-regrowing search algorithm designed to ascertain the optimal sparsity distribution while circumventing overfitting to the small calibration set in PTS. (3) The employment of dynamic sparse training predicated on the preceding aspects aimed at comprehensively optimizing the sparsity structure while ensuring training stability. Our proposed framework termed UniPTS is validated to be much superior to existing PTS methods across extensive benchmarks. As an illustration it amplifies the performance of POT a recently proposed recipe from 3.9% to 68.6% when pruning ResNet-50 at 90% sparsity ratio on ImageNet. + + + + HumanNorm: Learning Normal Diffusion Model for High-quality and Realistic 3D Human Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_HumanNorm_Learning_Normal_Diffusion_Model_for_High-quality_and_Realistic_3D_CVPR_2024_paper.pdf + Recent text-to-3D methods employing diffusion models have made significant advancements in 3D human generation. However these approaches face challenges due to the limitations of text-to-image diffusion models which lack an understanding of 3D structures. Consequently these methods struggle to achieve high-quality human generation resulting in smooth geometry and cartoon-like appearances. In this paper we propose HumanNorm a novel approach for high-quality and realistic 3D human generation. The main idea is to enhance the model's 2D perception of 3D geometry by learning a normal-adapted diffusion model and a normal-aligned diffusion model. The normal-adapted diffusion model can generate high-fidelity normal maps corresponding to user prompts with view-dependent and body-aware text. The normal-aligned diffusion model learns to generate color images aligned with the normal maps thereby transforming physical geometry details into realistic appearance. Leveraging the proposed normal diffusion model we devise a progressive geometry generation strategy and a multi-step Score Distillation Sampling (SDS) loss to enhance the performance of 3D human generation. Comprehensive experiments substantiate HumanNorm's ability to generate 3D humans with intricate geometry and realistic appearances. HumanNorm outperforms existing text-to-3D methods in both geometry and texture quality. The project page of HumanNorm is https://humannorm.github.io/. + + + + Cross-view and Cross-pose Completion for 3D Human Understanding + http://openaccess.thecvf.com//content/CVPR2024/papers/Armando_Cross-view_and_Cross-pose_Completion_for_3D_Human_Understanding_CVPR_2024_paper.pdf + Human perception and understanding is a major domain of computer vision which like many other vision subdomains recently stands to gain from the use of large models pre-trained on large datasets. We hypothesize that the most common pre-training strategy of relying on general purpose object-centric image datasets such as ImageNet is limited by an important domain shift. On the other hand collecting domain-specific ground truth such as 2D or 3D labels does not scale well. Therefore we propose a pre-training approach based on self-supervised learning that works on human-centric data using only images. Our method uses pairs of images of humans: the first is partially masked and the model is trained to reconstruct the masked parts given the visible ones and a second image. It relies on both stereoscopic (cross-view) pairs and temporal (cross-pose) pairs taken from videos in order to learn priors about 3D as well as human motion. We pre-train a model for body-centric tasks and one for hand-centric tasks. With a generic transformer architecture these models outperform existing self-supervised pre-training methods on a wide set of human-centric downstream tasks and obtain state-of-the-art performance for instance when fine-tuning for model-based and model-free human mesh recovery. + + + + Efficient Scene Recovery Using Luminous Flux Prior + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Efficient_Scene_Recovery_Using_Luminous_Flux_Prior_CVPR_2024_paper.pdf + Scene recovery the restoration of images degraded by adverse weather conditions presents significant challenges for existing methods. Physical models constrained by their inherent assumptions often fail when these assumptions are not met; Deep learning models are powerful they are limited by the diversity of their training datasets leading to poor generalization and high computational demands. To address these limitations we propose the Luminous Flux Prior (LFP) to recover degraded images under diverse adverse weather without learning. Luminous flux a physical measure that reflects image brightness has a rate of change that demonstrates a significant correlation with transmission. Consequently we leverage this rate of change in luminous flux as prior knowledge to estimate transmission which in turn assists in image recovery. This approach reduces dependency on physical parameters and enhances adaptability to various weather. Experimental validation under diverse conditions such as sandstorms underwater environments and haze attests to the robustness of LFP in restoring clear images. With a time complexity of \mathcal O (N\log N) LFP enables real-time recovery making it a suitable for devices with limited computational resources. + + + + Customize your NeRF: Adaptive Source Driven 3D Scene Editing via Local-Global Iterative Training + http://openaccess.thecvf.com//content/CVPR2024/papers/He_Customize_your_NeRF_Adaptive_Source_Driven_3D_Scene_Editing_via_CVPR_2024_paper.pdf + In this paper we target the adaptive source driven 3D scene editing task by proposing a CustomNeRF model that unifies a text description or a reference image as the editing prompt. However obtaining desired editing results conformed with the editing prompt is nontrivial since there exist two significant challenges including accurate editing of only foreground regions and multi-view consistency given a single-view reference image. To tackle the first challenge we propose a Local-Global Iterative Editing (LGIE) training scheme that alternates between foreground region editing and full-image editing aimed at foreground-only manipulation while preserving the background. For the second challenge we also design a class-guided regularization that exploits class priors within the generation model to alleviate the inconsistency problem among different views in image-driven editing. Extensive experiments show that our CustomNeRF produces precise editing results under various real scenes for both text- and image-driven settings. The code is available at: https://github. com/hrz2000/CustomNeRF. + + + + Spherical Mask: Coarse-to-Fine 3D Point Cloud Instance Segmentation with Spherical Representation + http://openaccess.thecvf.com//content/CVPR2024/papers/Shin_Spherical_Mask_Coarse-to-Fine_3D_Point_Cloud_Instance_Segmentation_with_Spherical_CVPR_2024_paper.pdf + Coarse-to-fine 3D instance segmentation methods show weak performances compared to recent Grouping-based Kernel-based and Transformer-based methods. We argue that this is due to two limitations: 1) Instance size overestimation by axis-aligned bounding box(AABB) 2) False negative error accumulation from inaccurate box to the refinement phase. In this work we introduce Spherical Mask a novel coarse-to-fine approach based on spherical representation overcoming those two limitations with several benefits. Specifically our coarse detection estimates each instance with a 3D polygon using a center and radial distance predictions which avoids excessive size estimation of AABB. To cut the error propagation in the existing coarse-to-fine approaches we virtually migrate points based on the polygon allowing all foreground points including false negatives to be refined. During inference the proposal and point migration modules run in parallel and are assembled to form binary masks of instances. We also introduce two margin-based losses for the point migration to enforce corrections for the false positives/negatives and cohesion of foreground points significantly improving the performance. Experimental results from three datasets such as ScanNetV2 S3DIS and STPLS3D show that our proposed method outperforms existing works demonstrating the effectiveness of the new instance representation with spherical coordinates. The code is available at: https://github.com/yunshin/SphericalMask + + + + FSRT: Facial Scene Representation Transformer for Face Reenactment from Factorized Appearance Head-pose and Facial Expression Features + http://openaccess.thecvf.com//content/CVPR2024/papers/Rochow_FSRT_Facial_Scene_Representation_Transformer_for_Face_Reenactment_from_Factorized_CVPR_2024_paper.pdf + The task of face reenactment is to transfer the head motion and facial expressions from a driving video to the appearance of a source image which may be of a different person (cross-reenactment). Most existing methods are CNN-based and estimate optical flow from the source image to the current driving frame which is then inpainted and refined to produce the output animation. We propose a transformer-based encoder for computing a set-latent representation of the source image(s). We then predict the output color of a query pixel using a transformer-based decoder which is conditioned with keypoints and a facial expression vector extracted from the driving frame. Latent representations of the source person are learned in a self-supervised manner that factorize their appearance head pose and facial expressions. Thus they are perfectly suited for cross-reenactment. In contrast to most related work our method naturally extends to multiple source images and can thus adapt to person-specific facial dynamics. We also propose data augmentation and regularization schemes that are necessary to prevent overfitting and support generalizability of the learned representations. We evaluated our approach in a randomized user study. The results indicate superior performance compared to the state-of-the-art in terms of motion transfer quality and temporal consistency. + + + + TetraSphere: A Neural Descriptor for O(3)-Invariant Point Cloud Analysis + http://openaccess.thecvf.com//content/CVPR2024/papers/Melnyk_TetraSphere_A_Neural_Descriptor_for_O3-Invariant_Point_Cloud_Analysis_CVPR_2024_paper.pdf + In many practical applications 3D point cloud analysis requires rotation invariance. In this paper we present a learnable descriptor invariant under 3D rotations and reflections i.e. the O(3) actions utilizing the recently introduced steerable 3D spherical neurons and vector neurons. Specifically we propose an embedding of the 3D spherical neurons into 4D vector neurons which leverages end-to-end training of the model. In our approach we perform TetraTransform---an equivariant embedding of the 3D input into 4D constructed from the steerable neurons---and extract deeper O(3)-equivariant features using vector neurons. This integration of the TetraTransform into the VN-DGCNN framework termed TetraSphere negligibly increases the number of parameters by less than 0.0002%. TetraSphere sets a new state-of-the-art performance classifying randomly rotated real-world object scans of the challenging subsets of ScanObjectNN. Additionally TetraSphere outperforms all equivariant methods on randomly rotated synthetic data: classifying objects from ModelNet40 and segmenting parts of the ShapeNet shapes. Thus our results reveal the practical value of steerable 3D spherical neurons for learning in 3D Euclidean space. The code is available at https://github.com/pavlo-melnyk/tetrasphere. + + + + WANDR: Intention-guided Human Motion Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Diomataris_WANDR_Intention-guided_Human_Motion_Generation_CVPR_2024_paper.pdf + Synthesizing natural human motions that enable a 3D human avatar to walk and reach for arbitrary goals in 3D space remains an unsolved problem with many applications. Existing methods (data-driven or using reinforcement learning) are limited in terms of generalization and motion naturalness. A primary obstacle is the scarcity of training data that combines locomotion with goal reaching. To address this we introduce WANDR a data-driven model that takes an avatar's initial pose and a goal's 3D position and generates natural human motions that place the end effector (wrist) on the goal location. To solve this we introduce novel intention features that drive rich goal-oriented movement. Intention guides the agent to the goal and interactively adapts the generation to novel situations without needing to define sub-goals or the entire motion path. Crucially intention allows training on datasets that have goal-oriented motions as well as those that do not. WANDR is a conditional Variational Auto-Encoder (c-VAE) which we train using the AMASS and CIRCLE datasets. We evaluate our method extensively and demonstrate its ability to generate natural and long-term motions that reach 3D goals and generalize to unseen goal locations. Our models and code are available for research purposes at wandr.is.tue.mpg.de + + + + GroupContrast: Semantic-aware Self-supervised Representation Learning for 3D Understanding + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_GroupContrast_Semantic-aware_Self-supervised_Representation_Learning_for_3D_Understanding_CVPR_2024_paper.pdf + Self-supervised 3D representation learning aims to learn effective representations from large-scale unlabeled point clouds. Most existing approaches adopt point discrimination as the pretext task which assigns matched points in two distinct views as positive pairs and unmatched points as negative pairs. However this approach often results in semantically identical points having dissimilar representations leading to a high number of false negatives and introducing a semantic conflict problem. To address this issue we propose GroupContrast a novel approach that combines segment grouping and semantic-aware contrastive learning. Segment grouping partitions points into semantically meaningful regions which enhances semantic coherence and provides semantic guidance for the subsequent contrastive representation learning. Semantic-aware contrastive learning augments the semantic information extracted from segment grouping and helps to alleviate the issue of semantic conflict. We conducted extensive experiments on multiple 3D scene understanding tasks. The results demonstrate that GroupContrast learns semantically meaningful representations and achieves promising transfer learning performance. + + + + Privacy-Preserving Face Recognition Using Trainable Feature Subtraction + http://openaccess.thecvf.com//content/CVPR2024/papers/Mi_Privacy-Preserving_Face_Recognition_Using_Trainable_Feature_Subtraction_CVPR_2024_paper.pdf + The widespread adoption of face recognition has led to increasing privacy concerns as unauthorized access to face images can expose sensitive personal information. This paper explores face image protection against viewing and recovery attacks. Inspired by image compression we propose creating a visually uninformative face image through feature subtraction between an original face and its model-produced regeneration. Recognizable identity features within the image are encouraged by co-training a recognition model on its high-dimensional feature representation. To enhance privacy the high-dimensional representation is crafted through random channel shuffling resulting in randomized recognizable images devoid of attacker-leverageable texture details. We distill our methodologies into a novel privacy-preserving face recognition method MinusFace. Experiments demonstrate its high recognition accuracy and effective privacy protection. Its code is available at https://github.com/Tencent/TFace. + + + + Learning Visual Prompt for Gait Recognition + http://openaccess.thecvf.com//content/CVPR2024/papers/Ma_Learning_Visual_Prompt_for_Gait_Recognition_CVPR_2024_paper.pdf + Gait a prevalent and complex form of human motion plays a significant role in the field of long-range pedestrian retrieval due to the unique characteristics inherent in individual motion patterns. However gait recognition in real-world scenarios is challenging due to the limitations of capturing comprehensive cross-viewing and cross-clothing data. Additionally distractors such as occlusions directional changes and lingering movements further complicate the problem. The widespread application of deep learning techniques has led to the development of various potential gait recognition methods. However these methods utilize convolutional networks to extract shared information across different views and attire conditions. Once trained the parameters and non-linear function become constrained to fixed patterns limiting their adaptability to various distractors in real-world scenarios. In this paper we present a unified gait recognition framework to extract global motion patterns and develop a novel dynamic transformer to generate representative gait features. Specifically we develop a trainable part-based prompt pool with numerous key-value pairs that can dynamically select prompt templates to incorporate into the gait sequence thereby providing task-relevant shared knowledge information. Furthermore we specifically design dynamic attention to extract robust motion patterns and address the length generalization issue. Extensive experiments on four widely recognized gait datasets i.e. Gait3D GREW OUMVLP and CASIA-B reveal that the proposed method yields substantial improvements compared to current state-of-the-art approaches. + + + + SC-GS: Sparse-Controlled Gaussian Splatting for Editable Dynamic Scenes + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_SC-GS_Sparse-Controlled_Gaussian_Splatting_for_Editable_Dynamic_Scenes_CVPR_2024_paper.pdf + Novel view synthesis for dynamic scenes is still a challenging problem in computer vision and graphics. Recently Gaussian splatting has emerged as a robust technique to represent static scenes and enable high-quality and real-time novel view synthesis. Building upon this technique we propose a new representation that explicitly decomposes the motion and appearance of dynamic scenes into sparse control points and dense Gaussians respectively. Our key idea is to use sparse control points significantly fewer in number than the Gaussians to learn compact 6 DoF transformation bases which can be locally interpolated through learned interpolation weights to yield the motion field of 3D Gaussians. We employ a deformation MLP to predict time-varying 6 DoF transformations for each control point which reduces learning complexities enhances learning abilities and facilitates obtaining temporal and spatial coherent motion patterns. Then we jointly learn the 3D Gaussians the canonical space locations of control points and the deformation MLP to reconstruct the appearance geometry and dynamics of 3D scenes. During learning the location and number of control points are adaptively adjusted to accommodate varying motion complexities in different regions and an ARAP loss following the principle of as rigid as possible is developed to enforce spatial continuity and local rigidity of learned motions. Finally thanks to the explicit sparse motion representation and its decomposition from appearance our method can enable user-controlled motion editing while retaining high-fidelity appearances. Extensive experiments demonstrate that our approach outperforms existing approaches on novel view synthesis with a high rendering speed and enables novel appearance-preserved motion editing applications. + + + + Tri-Modal Motion Retrieval by Learning a Joint Embedding Space + http://openaccess.thecvf.com//content/CVPR2024/papers/Yin_Tri-Modal_Motion_Retrieval_by_Learning_a_Joint_Embedding_Space_CVPR_2024_paper.pdf + Text-to-motion tasks have been the focus of recent advancements in the human motion domain. However the performance of text-to-motion tasks have not reached its potential primarily due to the lack of motion datasets and the pronounced gap between the text and motion modalities. To mitigate this challenge we introduce VLMA a novel Video-Language-Motion Alignment method. This approach leverages human-centric videos as an intermediary modality effectively bridging the divide between text and motion. By employing contrastive learning we construct a cohesive embedding space across the three modalities. Furthermore we incorporate a motion reconstruction branch ensuring that the resulting motion remains closely aligned with its original trajectory. Experimental evaluations on the HumanML3D and KIT-ML datasets demonstrate the superiority of our method in comparison to existing approaches. Furthermore we introduce a novel task termed video-to-motion retrieval designed to facilitate the seamlessxt eraction of corresponding 3D motions from an RGB video. Supplementary experiments demonstrate that our model is extensible to real-world human-centric videos offering a valuable complement to the pose estimation task. + + + + Geometry-aware Reconstruction and Fusion-refined Rendering for Generalizable Neural Radiance Fields + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Geometry-aware_Reconstruction_and_Fusion-refined_Rendering_for_Generalizable_Neural_Radiance_Fields_CVPR_2024_paper.pdf + Generalizable NeRF aims to synthesize novel views for unseen scenes. Common practices involve constructing variance-based cost volumes for geometry reconstruction and encoding 3D descriptors for decoding novel views. However existing methods show limited generalization ability in challenging conditions due to inaccurate geometry sub-optimal descriptors and decoding strategies. We address these issues point by point. First we find the variance-based cost volume exhibits failure patterns as the features of pixels corresponding to the same point can be inconsistent across different views due to occlusions or reflections. We introduce an Adaptive Cost Aggregation (ACA) approach to amplify the contribution of consistent pixel pairs and suppress inconsistent ones. Unlike previous methods that solely fuse 2D features into descriptors our approach introduces a Spatial-View Aggregator (SVA) to incorporate 3D context into descriptors through spatial and inter-view interaction. When decoding the descriptors we observe the two existing decoding strategies excel in different areas which are complementary. A Consistency-Aware Fusion (CAF) strategy is proposed to leverage the advantages of both. We incorporate the above ACA SVA and CAF into a coarse-to-fine framework termed Geometry-aware Reconstruction and Fusion-refined Rendering (GeFu). GeFu attains state-of-the-art performance across multiple datasets. + + + + VideoBooth: Diffusion-based Video Generation with Image Prompts + http://openaccess.thecvf.com//content/CVPR2024/papers/Jiang_VideoBooth_Diffusion-based_Video_Generation_with_Image_Prompts_CVPR_2024_paper.pdf + Text-driven video generation witnesses rapid progress. However merely using text prompts is not enough to depict the desired subject appearance that accurately aligns with users' intents especially for customized content creation. In this paper we study the task of video generation with image prompts which provide more accurate and direct content control beyond the text prompts. Specifically we propose a feed-forward framework VideoBooth with two dedicated designs: 1) We propose to embed image prompts in a coarse-to-fine manner. Coarse visual embeddings from image encoder provide high-level encodings of image prompts while fine visual embeddings from the proposed attention injection module provide multi-scale and detailed encoding of image prompts. These two complementary embeddings can faithfully capture the desired appearance. 2) In the attention injection module at fine level multi-scale image prompts are fed into different cross-frame attention layers as additional keys and values. This extra spatial information refines the details in the first frame and then it is propagated to the remaining frames which maintains temporal consistency. Extensive experiments demonstrate that VideoBooth achieves state-of-the-art performance in generating customized high-quality videos with subjects specified in image prompts. Notably VideoBooth is a generalizable framework where a single model works for a wide range of image prompts with only feed-forward passes. + + + + SCULPT: Shape-Conditioned Unpaired Learning of Pose-dependent Clothed and Textured Human Meshes + http://openaccess.thecvf.com//content/CVPR2024/papers/Sanyal_SCULPT_Shape-Conditioned_Unpaired_Learning_of_Pose-dependent_Clothed_and_Textured_Human_CVPR_2024_paper.pdf + We present SCULPT a novel 3D generative model for clothed and textured 3D meshes of humans. Specifically we devise a deep neural network that learns to represent the geometry and appearance distribution of clothed human bodies. Training such a model is challenging as datasets of textured 3D meshes for humans are limited in size and accessibility. Our key observation is that there exist medium-sized 3D scan datasets like CAPE as well as large-scale 2D image datasets of clothed humans and multiple appearances can be mapped to a single geometry. To effectively learn from the two data modalities we propose an unpaired learning procedure for pose-dependent clothed and textured human meshes. Specifically we learn a pose-dependent geometry space from 3D scan data. We represent this as per vertex displacements w.r.t. the SMPL model. Next we train a geometry conditioned texture generator in an unsupervised way using the 2D image data. We use intermediate activations of the learned geometry model to condition our texture generator. To alleviate entanglement between pose and clothing type and pose and clothing appearance we condition both the texture and geometry generators with attribute labels such as clothing types for the geometry and clothing colors for the texture generator. We automatically generated these conditioning labels for the 2D images based on the visual question-answering model BLIP and CLIP. We validate our method on the SCULPT dataset and compare to state-of-the-art 3D generative models for clothed human bodies. Our code and data can be found at https://sculpt.is.tue.mpg.de. + + + + EasyDrag: Efficient Point-based Manipulation on Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Hou_EasyDrag_Efficient_Point-based_Manipulation_on_Diffusion_Models_CVPR_2024_paper.pdf + Generative models are gaining increasing popularity and the demand for precisely generating images is on the rise. However generating an image that perfectly aligns with users' expectations is extremely challenging. The shapes of objects the poses of animals the structures of landscapes and more may not match the user's desires and this applies to real images as well. This is where point-based image editing becomes essential. An excellent image editing method needs to meet the following criteria: user-friendly interaction high performance and good generalization capability. Due to the limitations of StyleGAN DragGAN exhibits limited robustness across diverse scenarios while DragDiffusion lacks user-friendliness due to the necessity of LoRA fine-tuning and masks. In this paper we introduce a novel interactive point-based image editing framework called EasyDrag that leverages pretrained diffusion models to achieve high-quality editing outcomes and user-friendship. Extensive experimentation demonstrates that our approach surpasses DragDiffusion in terms of both image quality and editing precision for point-based image manipulation tasks. + + + + InterHandGen: Two-Hand Interaction Generation via Cascaded Reverse Diffusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Lee_InterHandGen_Two-Hand_Interaction_Generation_via_Cascaded_Reverse_Diffusion_CVPR_2024_paper.pdf + We present InterHandGen a novel framework that learns the generative prior of two-hand interaction. Sampling from our model yields plausible and diverse two-hand shapes in close interaction with or without an object. Our prior can be incorporated into any optimization or learning methods to reduce ambiguity in an ill-posed setup. Our key observation is that directly modeling the joint distribution of multiple instances imposes high learning complexity due to its combinatorial nature. Thus we propose to decompose the modeling of joint distribution into the modeling of factored unconditional and conditional single instance distribution. In particular we introduce a diffusion model that learns the single-hand distribution unconditional and conditional to another hand via conditioning dropout. For sampling we combine anti-penetration and classifier-free guidance to enable plausible generation. Furthermore we establish the rigorous evaluation protocol of two-hand synthesis where our method significantly outperforms baseline generative models in terms of plausibility and diversity. We also demonstrate that our diffusion prior can boost the performance of two-hand reconstruction from monocular in-the-wild images achieving new state-of-the-art accuracy. + + + + Video2Game: Real-time Interactive Realistic and Browser-Compatible Environment from a Single Video + http://openaccess.thecvf.com//content/CVPR2024/papers/Xia_Video2Game_Real-time_Interactive_Realistic_and_Browser-Compatible_Environment_from_a_Single_CVPR_2024_paper.pdf + Creating high-quality and interactive virtual environments such as games and simulators often involves complex and costly manual modeling processes. In this paper we present Video2Game a novel approach that automatically converts videos of real-world scenes into realistic and interactive game environments. At the heart of our system are three core components: (i) a neural radiance fields (NeRF) module that effectively captures the geometry and visual appearance of the scene; (ii) a mesh module that distills the knowledge from NeRF for faster rendering; and (iii) a physics module that models the interactions and physical dynamics among the objects. By following the carefully designed pipeline one can construct an interactable and actionable digital replica of the real world. We benchmark our system on both indoor and large-scale outdoor scenes. We show that we can not only produce highly-realistic renderings in real-time but also build interactive games on top. + + + + Tackling the Singularities at the Endpoints of Time Intervals in Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Tackling_the_Singularities_at_the_Endpoints_of_Time_Intervals_in_CVPR_2024_paper.pdf + Most diffusion models assume that the reverse process adheres to a Gaussian distribution. However this approximation has not been rigorously validated especially at singularities where t=0 and t=1. Improperly dealing with such singularities leads to an average brightness issue in applications and limits the generation of images with extreme brightness or darkness. We primarily focus on tackling singularities from both theoretical and practical perspectives. Initially we establish the error bounds for the reverse process approximation and showcase its Gaussian characteristics at singularity time steps. Based on this theoretical insight we confirm the singularity at t=1 is conditionally removable while it at t=0 is an inherent property. Upon these significant conclusions we propose a novel plug-and-play method SingDiffusion to address the initial singular time step sampling which not only effectively resolves the average brightness issue for a wide range of diffusion models without extra training efforts but also enhances their generation capability in achieving notable lower FID scores. + + + + CHAIN: Enhancing Generalization in Data-Efficient GANs via lipsCHitz continuity constrAIned Normalization + http://openaccess.thecvf.com//content/CVPR2024/papers/Ni_CHAIN_Enhancing_Generalization_in_Data-Efficient_GANs_via_lipsCHitz_continuity_constrAIned_CVPR_2024_paper.pdf + Generative Adversarial Networks (GANs) significantly advanced image generation but their performance heavily depends on abundant training data. In scenarios with limited data GANs often struggle with discriminator overfitting and unstable training. Batch Normalization (BN) despite being known for enhancing generalization and training stability has rarely been used in the discriminator of Data-Efficient GANs. Our work addresses this gap by identifying a critical flaw in BN: the tendency for gradient explosion during the centering and scaling steps. To tackle this issue we present CHAIN (lipsCHitz continuity constrAIned Normalization) which replaces the conventional centering step with zero-mean regularization and integrates a Lipschitz continuity constraint in the scaling step. CHAIN further enhances GAN training by adaptively interpolating the normalized and unnormalized features effectively avoiding discriminator overfitting. Our theoretical analyses firmly establishes CHAIN's effectiveness in reducing gradients in latent features and weights improving stability and generalization in GAN training. Empirical evidence supports our theory. CHAIN achieves state-of-the-art results in data-limited scenarios on CIFAR-10/100 ImageNet five low-shot and seven high-resolution few-shot image datasets. + + + + High-Quality Facial Geometry and Appearance Capture at Home + http://openaccess.thecvf.com//content/CVPR2024/papers/Han_High-Quality_Facial_Geometry_and_Appearance_Capture_at_Home_CVPR_2024_paper.pdf + Facial geometry and appearance capture have demonstrated tremendous success in 3D scanning real humans in studios. Recent works propose to democratize this technique while keeping the results high quality. However they are still inconvenient for daily usage. In addition they focus on an easier problem of only capturing facial skin. This paper proposes a novel method for high-quality face capture featuring an easy-to-use system and the capability to model the complete face with skin mouth interior hair and eyes. We reconstruct facial geometry and appearance from a single co-located smartphone flashlight sequence captured in a dim room where the flashlight is the dominant light source (e.g. rooms with curtains or at night). To model the complete face we propose a novel hybrid representation to effectively model both eyes and other facial regions along with novel techniques to learn it from images. We apply a combined lighting model to compactly represent real illuminations and exploit a morphable face albedo model as a reflectance prior to disentangle diffuse and specular. Experiments show that our method can capture high-quality 3D relightable scans. Our code will be released. + + + + Your Image is My Video: Reshaping the Receptive Field via Image-To-Video Differentiable AutoAugmentation and Fusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Casarin_Your_Image_is_My_Video_Reshaping_the_Receptive_Field_via_CVPR_2024_paper.pdf + The landscape of deep learning research is moving towards innovative strategies to harness the true potential of data. Traditionally emphasis has been on scaling model architectures resulting in large and complex neural networks which can be difficult to train with limited computational resources. However independently of the model size data quality (i.e. amount and variability) is still a major factor that affects model generalization. In this work we propose a novel technique to exploit available data through the use of automatic data augmentation for the tasks of image classification and semantic segmentation. We introduce the first Differentiable Augmentation Search method (DAS) to generate variations of images that can be processed as videos. Compared to previous approaches DAS is extremely fast and flexible allowing the search on very large search spaces in less than a GPU day. Our intuition is that the increased receptive field in the temporal dimension provided by DAS could lead to benefits also to the spatial receptive field. More specifically we leverage DAS to guide the reshaping of the spatial receptive field by selecting task-dependant transformations. As a result compared to standard augmentation alternatives we improve in terms of accuracy on ImageNet Cifar10 Cifar100 Tiny-ImageNet Pascal-VOC-2012 and CityScapes datasets when plugging-in our DAS over different light-weight video backbones. + + + + SpikingResformer: Bridging ResNet and Vision Transformer in Spiking Neural Networks + http://openaccess.thecvf.com//content/CVPR2024/papers/Shi_SpikingResformer_Bridging_ResNet_and_Vision_Transformer_in_Spiking_Neural_Networks_CVPR_2024_paper.pdf + The remarkable success of Vision Transformers in Artificial Neural Networks (ANNs) has led to a growing interest in incorporating the self-attention mechanism and transformer-based architecture into Spiking Neural Networks (SNNs). While existing methods propose spiking self-attention mechanisms that are compatible with SNNs they lack reasonable scaling methods and the overall architectures proposed by these methods suffer from a bottleneck in effectively extracting local features. To address these challenges we propose a novel spiking self-attention mechanism named Dual Spike Self-Attention (DSSA) with a reasonable scaling method. Based on DSSA we propose a novel spiking Vision Transformer architecture called SpikingResformer which combines the ResNet-based multi-stage architecture with our proposed DSSA to improve both performance and energy efficiency while reducing parameters. Experimental results show that SpikingResformer achieves higher accuracy with fewer parameters and lower energy consumption than other spiking Vision Transformer counterparts. Notably our SpikingResformer-L achieves 79.40% top-1 accuracy on ImageNet with 4 time-steps which is the state-of-the-art result in the SNN field. + + + + Self-Supervised Dual Contouring + http://openaccess.thecvf.com//content/CVPR2024/papers/Sundararaman_Self-Supervised_Dual_Contouring_CVPR_2024_paper.pdf + Learning-based isosurface extraction methods have recently emerged as a robust and efficient alternative to axiomatic techniques. However the vast majority of such approaches rely on supervised training with axiomatically computed ground truths thus potentially inheriting biases and data artefacts of the corresponding axiomatic methods. Steering away from such dependencies we propose a self-supervised training scheme to the Neural Dual Contouring meshing framework resulting in our method: Self-Supervised Dual Contouring (SDC). Instead of optimizing predicted mesh vertices with supervised training we use two novel self-supervised loss functions that encourage the consistency between distances to the generated mesh up to the first order. Meshes reconstructed by SDC surpass existing data-driven methods in capturing intricate details while being more robust to possible irregularities in the input. Furthermore we use the same self-supervised training objective linking inferred mesh and input SDF to regularize the training process of Deep Implicit Networks (DINs). We demonstrate that the resulting DINs produce higher-quality implicit functions ultimately leading to more accurate and detail-preserving surfaces compared to prior baselines for different input modalities. Finally we demonstrate that our self-supervised losses improve meshing performance in the single-view reconstruction task by enabling joint training of predicted SDF and resulting output mesh. + + + + GSVA: Generalized Segmentation via Multimodal Large Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Xia_GSVA_Generalized_Segmentation_via_Multimodal_Large_Language_Models_CVPR_2024_paper.pdf + Generalized Referring Expression Segmentation (GRES) extends the scope of classic RES to refer to multiple objects in one expression or identify the empty targets absent in the image. GRES poses challenges in modeling the complex spatial relationships of the instances in the image and identifying non-existing referents. Multimodal Large Language Models (MLLMs) have recently shown tremendous progress in these complicated vision-language tasks. Connecting Large Language Models (LLMs) and vision models MLLMs are proficient in understanding contexts with visual inputs. Among them LISA as a representative adopts a special [SEG] token to prompt a segmentation mask decoder e.g. SAM to enable MLLMs in the RES task. However existing solutions to GRES remain unsatisfactory since current segmentation MLLMs cannot correctly handle the cases where users might reference multiple subjects in a singular prompt or provide descriptions incongruent with any image target. In this paper we propose Generalized Segmentation Vision Assistant (GSVA) to address this gap. Specifically GSVA reuses the [SEG] token to prompt the segmentation model towards supporting multiple mask references simultaneously and innovatively learns to generate a [REJ] token to reject the targets explicitly. Experiments validate GSVA's efficacy in resolving the GRES issue marking a notable enhancement and setting a new record on the GRES benchmark gRefCOCO dataset. GSVA also proves effective across various classic referring segmentation and comprehension tasks. + + + + AdaBM: On-the-Fly Adaptive Bit Mapping for Image Super-Resolution + http://openaccess.thecvf.com//content/CVPR2024/papers/Hong_AdaBM_On-the-Fly_Adaptive_Bit_Mapping_for_Image_Super-Resolution_CVPR_2024_paper.pdf + Although image super-resolution (SR) problem has experienced unprecedented restoration accuracy with deep neural networks it has yet limited versatile applications due to the substantial computational costs. Since different input images for SR face different restoration difficulties adapting computational costs based on the input image referred to as adaptive inference has emerged as a promising solution to compress SR networks. Specifically adapting the quantization bit-widths has successfully reduced the inference and memory cost without sacrificing the accuracy. However despite the benefits of the resultant adaptive network existing works rely on time-intensive quantization-aware training with full access to the original training pairs to learn the appropriate bit allocation policies which limits its ubiquitous usage. To this end we introduce the first on-the-fly adaptive quantization framework that accelerates the processing time from hours to seconds. We formulate the bit allocation problem with only two bit mapping modules: one to map the input image to the image-wise bit adaptation factor and one to obtain the layer-wise adaptation factors. These bit mappings are calibrated and fine-tuned using only a small number of calibration images. We achieve competitive performance with the previous adaptive quantization methods while the processing time is accelerated by x2000. Codes are available at https://github.com/Cheeun/AdaBM. + + + + SVGDreamer: Text Guided SVG Generation with Diffusion Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Xing_SVGDreamer_Text_Guided_SVG_Generation_with_Diffusion_Model_CVPR_2024_paper.pdf + Recently text-guided scalable vector graphics (SVGs) synthesis has shown promise in domains such as iconography and sketch. However existing text-to-SVG generation methods lack editability and struggle with visual quality and result diversity. To address these limitations we propose a novel text-guided vector graphics synthesis method called SVGDreamer. SVGDreamer incorporates a semantic-driven image vectorization (SIVE) process that enables the decomposition of synthesis into foreground objects and background thereby enhancing editability. Specifically the SIVE process introduces attention-based primitive control and an attention-mask loss function for effective control and manipulation of individual elements. Additionally we propose a Vectorized Particle-based Score Distillation (VPSD) approach to address issues of shape over-smoothing color over-saturation limited diversity and slow convergence of the existing text-to-SVG generation methods by modeling SVGs as distributions of control points and colors. Furthermore VPSD leverages a reward model to re-weight vector particles which improves aesthetic appeal and accelerates convergence. Extensive experiments are conducted to validate the effectiveness of SVGDreamer demonstrating its superiority over baseline methods in terms of editability visual quality and diversity. Project page: \href https://ximinng.github.io/SVGDreamer-project/ https://ximinng.github.io/SVGDreamer-project/ + + + + BlockGCN: Redefine Topology Awareness for Skeleton-Based Action Recognition + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_BlockGCN_Redefine_Topology_Awareness_for_Skeleton-Based_Action_Recognition_CVPR_2024_paper.pdf + Graph Convolutional Networks (GCNs) have long set the state-of-the-art in skeleton-based action recognition leveraging their ability to unravel the complex dynamics of human joint topology through the graph's adjacency matrix. However an inherent flaw has come to light in these cutting-edge models: they tend to optimize the adjacency matrix jointly with the model weights. This process while seemingly efficient causes a gradual decay of bone connectivity data resulting in a model indifferent to the very topology it sought to represent. To remedy this we propose a two-fold strategy: (1) We introduce an innovative approach that encodes bone connectivity by harnessing the power of graph distances to describe the physical topology; we further incorporate action-specific topological representation via persistent homology analysis to depict systemic dynamics. This preserves the vital topological nuances often lost in conventional GCNs. (2) Our investigation also reveals the redundancy in existing GCNs for multi-relational modeling which we address by proposing an efficient refinement to Graph Convolutions (GC) - the BlockGC. This significantly reduces parameters while improving performance beyond original GCNs. Our full model BlockGCN establishes new benchmarks in skeleton-based action recognition across all model categories. Its high accuracy and lightweight design most notably on the large-scale NTU RGB+D 120 dataset stand as strong validation of the efficacy of BlockGCN. + + + + Structure-Guided Adversarial Training of Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Structure-Guided_Adversarial_Training_of_Diffusion_Models_CVPR_2024_paper.pdf + Diffusion models have demonstrated exceptional efficacy in various generative applications. While existing models focus on minimizing a weighted sum of denoising score matching losses for data distribution modeling their training primarily emphasizes instance-level optimization overlooking valuable structural information within each mini-batch indicative of pair-wise relationships among samples. To address this limitation we introduce Structure-guided Adversarial training of Diffusion Models (SADM). In this pioneering approach we compel the model to learn manifold structures between samples in each training batch. To ensure the model captures authentic manifold structures in the data distribution we advocate adversarial training of the diffusion generator against a novel structure discriminator in a minimax game distinguishing real manifold structures from the generated ones. SADM substantially outperforms existing methods in image generation and cross-domain fine-tuning tasks across 12 datasets establishing a new state-of-the-art FID of 1.58 and 2.11 on ImageNet for class-conditional image generation at resolutions of 256x256 and 512x512 respectively. + + + + NIFTY: Neural Object Interaction Fields for Guided Human Motion Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Kulkarni_NIFTY_Neural_Object_Interaction_Fields_for_Guided_Human_Motion_Synthesis_CVPR_2024_paper.pdf + We address the problem of generating realistic 3D motions of humans interacting with objects in a scene. Our key idea is to create a neural interaction field attached to a specific object which outputs the distance to the valid interaction manifold given a human pose as input. This interaction field guides the sampling of an object-conditioned human motion diffusion model so as to encourage plausible contacts and affordance semantics. To support interactions with scarcely available data we propose an automated synthetic data pipeline. For this we seed a pre-trained motion model which has priors for the basics of human movement with interaction-specific anchor poses extracted from limited motion capture data. Using our guided diffusion model trained on generated synthetic data we synthesize realistic motions for sitting and lifting with several objects outperforming alternative approaches in terms of motion quality and successful action completion. We call our framework NIFTY: Neural Interaction Fields for Trajectory sYnthesis. + + + + Can Language Beat Numerical Regression? Language-Based Multimodal Trajectory Prediction + http://openaccess.thecvf.com//content/CVPR2024/papers/Bae_Can_Language_Beat_Numerical_Regression_Language-Based_Multimodal_Trajectory_Prediction_CVPR_2024_paper.pdf + Language models have demonstrated impressive ability in context understanding and generative performance. Inspired by the recent success of language foundation models in this paper we propose LMTraj (Language-based Multimodal Trajectory predictor) which recasts the trajectory prediction task into a sort of question-answering problem. Departing from traditional numerical regression models which treat the trajectory coordinate sequence as continuous signals we consider them as discrete signals like text prompts. Specially we first transform an input space for the trajectory coordinate into the natural language space. Here the entire time-series trajectories of pedestrians are converted into a text prompt and scene images are described as text information through image captioning. The transformed numerical and image data are then wrapped into the question-answering template for use in a language model. Next to guide the language model in understanding and reasoning high-level knowledge such as scene context and social relationships between pedestrians we introduce an auxiliary multi-task question and answering. We then train a numerical tokenizer with the prompt data. We encourage the tokenizer to separate the integer and decimal parts well and leverage it to capture correlations between the consecutive numbers in the language model. Lastly we train the language model using the numerical tokenizer and all of the question-answer prompts. Here we propose a beam-search-based most-likely prediction and a temperature-based multimodal prediction to implement both deterministic and stochastic inferences. Applying our LMTraj we show that the language-based model can be a powerful pedestrian trajectory predictor and outperforms existing numerical-based predictor methods. Extensive experiments show that our LMTraj can successfully understand social relationships and accurately extrapolate the multimodal futures on the public pedestrian trajectory prediction benchmark. Code is publicly available at https://github.com/inhwanbae/LMTrajectory. + + + + Building Optimal Neural Architectures using Interpretable Knowledge + http://openaccess.thecvf.com//content/CVPR2024/papers/Mills_Building_Optimal_Neural_Architectures_using_Interpretable_Knowledge_CVPR_2024_paper.pdf + Neural Architecture Search is a costly practice. The fact that a search space can span a vast number of design choices with each architecture evaluation taking nontrivial overhead makes it hard for an algorithm to sufficiently explore candidate networks. In this paper we propose AutoBuild a scheme which learns to align the latent embeddings of operations and architecture modules with the ground-truth performance of the architectures they appear in. By doing so AutoBuild is capable of assigning interpretable importance scores to architecture modules such as individual operation features and larger macro operation sequences such that high-performance neural networks can be constructed without any need for search. Through experiments performed on state-of-the-art image classification segmentation and Stable Diffusion models we show that by mining a relatively small set of evaluated architectures AutoBuild can learn to build high-quality architectures directly or help to reduce search space to focus on relevant areas finding better architectures that outperform both the original labeled ones and ones found by search baselines. Code available at https://github.com/Ascend-Research/AutoBuild + + + + Holo-Relighting: Controllable Volumetric Portrait Relighting from a Single Image + http://openaccess.thecvf.com//content/CVPR2024/papers/Mei_Holo-Relighting_Controllable_Volumetric_Portrait_Relighting_from_a_Single_Image_CVPR_2024_paper.pdf + At the core of portrait photography is the search for ideal lighting and viewpoint. The process often requires advanced knowledge in photography and an elaborate studio setup. In this work we propose Holo-Relighting a volumetric relighting method that is capable of synthesizing novel viewpoints and novel lighting from a single image. Holo-Relighting leverages the pretrained 3D GAN (EG3D) to reconstruct geometry and appearance from an input portrait as a set of 3D-aware features. We design a relighting module conditioned on a given lighting to process these features and predict a relit 3D representation in the form of a tri-plane which can render to an arbitrary viewpoint through volume rendering. Besides viewpoint and lighting control Holo-Relighting also takes the head pose as a condition to enable head-pose-dependent lighting effects. With these novel designs Holo-Relighting can generate complex non-Lambertian lighting effects (e.g. specular highlights and cast shadows) without using any explicit physical lighting priors. We train Holo-Relighting with data captured with a light stage and propose two data-rendering techniques to improve the data quality for training the volumetric relighting system. Through quantitative and qualitative experiments we demonstrate Holo-Relighting can achieve state-of-the-arts relighting quality with better photorealism 3D consistency and controllability. + + + + Noisy One-point Homographies are Surprisingly Good + http://openaccess.thecvf.com//content/CVPR2024/papers/Ding_Noisy_One-point_Homographies_are_Surprisingly_Good_CVPR_2024_paper.pdf + Two-view homography estimation is a classic and fundamental problem in computer vision. While conceptually simple the problem quickly becomes challenging when multiple planes are visible in the image pair. Even with correct matches each individual plane (homography) might have a very low number of inliers when comparing to the set of all correspondences. In practice this requires a large number of RANSAC iterations to generate a good model hypothesis. The current state-of-the-art methods therefore seek to reduce the sample size from four point correspondences originally by including additional information such as keypoint orientation/angles or local affine information. In this work we continue in this direction and propose a novel one-point solver that leverages different approximate constraints derived from the same auxiliary information. In experiments we obtain state-of-the-art results with execution time speed-ups on large benchmark datasets and show that it is more beneficial for the solver to be sample efficient compared to generating more accurate homographies. + + + + Panacea: Panoramic and Controllable Video Generation for Autonomous Driving + http://openaccess.thecvf.com//content/CVPR2024/papers/Wen_Panacea_Panoramic_and_Controllable_Video_Generation_for_Autonomous_Driving_CVPR_2024_paper.pdf + The field of autonomous driving increasingly demands high-quality annotated training data. In this paper we propose Panacea an innovative approach to generate panoramic and controllable videos in driving scenarios capable of yielding an unlimited numbers of diverse annotated samples pivotal for autonomous driving advancements. Panacea addresses two critical challenges: 'Consistency' and 'Controllability.' Consistency ensures temporal and cross-view coherence while Controllability ensures the alignment of generated content with corresponding annotations. Our approach integrates a novel 4D attention and a two-stage generation pipeline to maintain coherence supplemented by the ControlNet framework for meticulous control by the Bird's-Eye-View (BEV) layouts. Extensive qualitative and quantitative evaluations of Panacea on the nuScenes dataset prove its effectiveness in generating high-quality multi-view driving-scene videos. This work notably propels the field of autonomous driving by effectively augmenting the training dataset used for advanced BEV perception techniques. + + + + DreamMatcher: Appearance Matching Self-Attention for Semantically-Consistent Text-to-Image Personalization + http://openaccess.thecvf.com//content/CVPR2024/papers/Nam_DreamMatcher_Appearance_Matching_Self-Attention_for_Semantically-Consistent_Text-to-Image_Personalization_CVPR_2024_paper.pdf + The objective of text-to-image (T2I) personalization is to customize a diffusion model to a user-provided reference concept generating diverse images of the concept aligned with the target prompts. Conventional methods representing the reference concepts using unique text embeddings often fail to accurately mimic the appearance of the reference. To address this one solution may be explicitly conditioning the reference images into the target denoising process known as key-value replacement. However prior works are constrained to local editing since they disrupt the structure path of the pre-trained T2I model. To overcome this we propose a novel plug-in method called DreamMatcher which reformulates T2I personalization as semantic matching. Specifically DreamMatcher replaces the target values with reference values aligned by semantic matching while leaving the structure path unchanged to preserve the versatile capability of pre-trained T2I models for generating diverse structures. We also introduce a semantic-consistent masking strategy to isolate the personalized concept from irrelevant regions introduced by the target prompts. Compatible with existing T2I models DreamMatcher shows significant improvements in complex scenarios. Intensive analyses demonstrate the effectiveness of our approach. + + + + PolarMatte: Fully Computational Ground-Truth-Quality Alpha Matte Extraction for Images and Video using Polarized Screen Matting + http://openaccess.thecvf.com//content/CVPR2024/papers/Enomoto_PolarMatte_Fully_Computational_Ground-Truth-Quality_Alpha_Matte_Extraction_for_Images_and_CVPR_2024_paper.pdf + The creation of high-quality alpha mattes as ground-truth data for video matting is typically a laborious task. The trade-off between accuracy manual corrections and capture constraints often produces erroneous results or is cost prohibitive. We propose PolarMatte a fully computational alpha matte extraction method for images and video without compromise between quality and practicality. A single polarization camera is used to capture dynamic scenes backlit by an off-the-shelf LCD monitor. PolarMatte exploits the polarization channel to compute the per-pixel opacity of the target scene including the transparency of fine-details translucent objects and optical/motion blur. We leverage polarization clues to robustly detect indistinguishable pixels and extract the alpha matte value at polarized foreground reflections with a polarimetric matting Laplacian. Quantitative and qualitative evaluation demonstrate our ability to computationally extract ground-truth-quality alpha mattes without human labour. + + + + HOIDiffusion: Generating Realistic 3D Hand-Object Interaction Data + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_HOIDiffusion_Generating_Realistic_3D_Hand-Object_Interaction_Data_CVPR_2024_paper.pdf + 3D hand-object interaction data is scarce due to the hardware constraints in scaling up the data collection process. In this paper we propose HOIDiffusion for generating realistic and diverse 3D hand-object interaction data. Our model is a conditional diffusion model that takes both the 3D hand-object geometric structure and text description as inputs for image synthesis. This offers a more controllable and realistic synthesis as we can specify the structure and style inputs in a disentangled manner. HOIDiffusion is trained by leveraging a diffusion model pre-trained on large-scale natural images and a few 3D human demonstrations. Beyond controllable image synthesis we adopt the generated 3D data for learning 6D object pose estimation and show its effectiveness in improving perception systems. Project page: https://mq-zhang1.github.io/HOIDiffusion. + + + + VecFusion: Vector Font Generation with Diffusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Thamizharasan_VecFusion_Vector_Font_Generation_with_Diffusion_CVPR_2024_paper.pdf + We present VecFusion a new neural architecture that can generate vector fonts with varying topological structures and precise control point positions. Our approach is a cascaded diffusion model which consists of a raster diffusion model followed by a vector diffusion model. The raster model generates low-resolution rasterized fonts with auxiliary control point information capturing the global style and shape of the font while the vector model synthesizes vector fonts conditioned on the low-resolution raster fonts from the first stage. To synthesize long and complex curves our vector diffusion model uses a transformer architecture and a novel vector representation that enables the modeling of diverse vector geometry and the precise prediction of control points. Our experiments show that in contrast to previous generative models for vector graphics our new cascaded vector diffusion model generates higher quality vector fonts with complex structures and diverse styles. + + + + Towards Text-guided 3D Scene Composition + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Towards_Text-guided_3D_Scene_Composition_CVPR_2024_paper.pdf + We are witnessing significant breakthroughs in the technology for generating 3D objects from text. Existing approaches either leverage large text-to-image models to optimize a 3D representation or train 3D generators on object-centric datasets. Generating entire scenes however remains very challenging as a scene contains multiple 3D objects diverse and scattered. In this work we introduce SceneWiz3D - a novel approach to synthesize high-fidelity 3D scenes from text. We marry the locality of objects with globality of scenes by introducing a hybrid 3D representation - explicit for objects and implicit for scenes. Remarkably an object being represented explicitly can be either generated from text using conventional text-to-3D approaches or provided by users. To configure the layout of the scene and automatically place objects we apply the Particle Swarm Optimization technique during the optimization process. Furthermore it is difficult for certain parts of the scene (e.g. corners occlusion) to receive multi-view supervision leading to inferior geometry. We incorporate an RGBD panorama diffusion model to mitigate it resulting in high-quality geometry. Extensive evaluation supports that our approach achieves superior quality over previous approaches enabling the generation of detailed and view-consistent 3D scenes. Our project website is at https://zqh0253.github.io/SceneWiz3D.\\ + + + + EMAGE: Towards Unified Holistic Co-Speech Gesture Generation via Expressive Masked Audio Gesture Modeling + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_EMAGE_Towards_Unified_Holistic_Co-Speech_Gesture_Generation_via_Expressive_Masked_CVPR_2024_paper.pdf + We propose EMAGE a framework to generate full-body human gestures from audio and masked gestures encompassing facial local body hands and global movements. To achieve this we first introduce BEAT2 (BEAT-SMPLX-FLAME) a new mesh-level holistic co-speech dataset. BEAT2 combines a MoShed SMPL-X body with FLAME head parameters and further refines the modeling of head neck and finger movements offering a community-standardized high-quality 3D motion captured dataset. EMAGE leverages masked body gesture priors during training to boost inference performance. It involves a Masked Audio Gesture Transformer facilitating joint training on audio-to-gesture generation and masked gesture reconstruction to effectively encode audio and body gesture hints. Encoded body hints from masked gestures are then separately employed to generate facial and body movements. Moreover EMAGE adaptively merges speech features from the audio's rhythm and content and utilizes four compositional VQ-VAEs to enhance the results' fidelity and diversity. Experiments demonstrate that EMAGE generates holistic gestures with state-of-the-art performance and is flexible in accepting predefined spatial-temporal gesture inputs generating complete audio-synchronized results. Our code and dataset are available. https://pantomatrix.github.io/EMAGE/ + + + + Adversarial Text to Continuous Image Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Haydarov_Adversarial_Text_to_Continuous_Image_Generation_CVPR_2024_paper.pdf + Existing GAN-based text-to-image models treat images as 2D pixel arrays. In this paper we approach the text-to-image task from a different perspective where a 2D image is represented as an implicit neural representation (INR). We show that straightforward conditioning of the unconditional INR-based GAN method on text inputs is not enough to achieve good performance. We propose a word-level attention-based weight modulation operator that controls the generation process of INR-GAN based on hypernetworks. Our experiments on benchmark datasets show that HyperCGAN achieves competitive performance to existing pixel-based methods and retains the properties of continuous generative models. + + + + HumanNeRF-SE: A Simple yet Effective Approach to Animate HumanNeRF with Diverse Poses + http://openaccess.thecvf.com//content/CVPR2024/papers/Ma_HumanNeRF-SE_A_Simple_yet_Effective_Approach_to_Animate_HumanNeRF_with_CVPR_2024_paper.pdf + We present HumanNeRF-SE a simple yet effective method that synthesizes diverse novel pose images with simple input. Previous HumanNeRF works require a large number of optimizable parameters to fit the human images. Instead we reload these approaches by combining explicit and implicit human representations to design both generalized rigid deformation and specific non-rigid deformation. Our key insight is that explicit shape can reduce the sampling points used to fit implicit representation and frozen blending weights from SMPL constructing a generalized rigid deformation can effectively avoid overfitting and improve pose generalization performance. Our architecture involving both explicit and implicit representation is simple yet effective. Experiments demonstrate our model can synthesize images under arbitrary poses with few-shot input and increase the speed of synthesizing images by 15 times through a reduction in computational complexity without using any existing acceleration modules. Compared to the state-of-the-art HumanNeRF studies HumanNeRF-SE achieves better performance with fewer learnable parameters and less training time. + + + + HOLD: Category-agnostic 3D Reconstruction of Interacting Hands and Objects from Video + http://openaccess.thecvf.com//content/CVPR2024/papers/Fan_HOLD_Category-agnostic_3D_Reconstruction_of_Interacting_Hands_and_Objects_from_CVPR_2024_paper.pdf + Since humans interact with diverse objects every day the holistic 3D capture of these interactions is important to understand and model human behaviour. However most existing methods for hand-object reconstruction from RGB either assume pre-scanned object templates or heavily rely on limited 3D hand-object data restricting their ability to scale and generalize to more unconstrained interaction settings. To address this we introduce HOLD -- the first category-agnostic method that reconstructs an articulated hand and an object jointly from a monocular interaction video. We develop a compositional articulated implicit model that can reconstruct disentangled 3D hands and objects from 2D images. We also further incorporate hand-object constraints to improve hand-object poses and consequently the reconstruction quality. Our method does not rely on any 3D hand-object annotations while significantly outperforming fully-supervised baselines in both in-the-lab and challenging in-the-wild settings. Moreover we qualitatively show its robustness in reconstructing from in-the-wild videos. See https://github.com/zc-alexfan/hold for code data models and updates. + + + + Continual Segmentation with Disentangled Objectness Learning and Class Recognition + http://openaccess.thecvf.com//content/CVPR2024/papers/Gong_Continual_Segmentation_with_Disentangled_Objectness_Learning_and_Class_Recognition_CVPR_2024_paper.pdf + Most continual segmentation methods tackle the problem as a per-pixel classification task. However such a paradigm is very challenging and we find query-based segmenters with built-in objectness have inherent advantages compared with per-pixel ones as objectness has strong transfer ability and forgetting resistance. Based on these findings we propose CoMasTRe by disentangling continual segmentation into two stages: forgetting-resistant continual objectness learning and well-researched continual classification. CoMasTRe uses a two-stage segmenter learning class-agnostic mask proposals at the first stage and leaving recognition to the second stage. During continual learning a simple but effective distillation is adopted to strengthen objectness. To further mitigate the forgetting of old classes we design a multi-label class distillation strategy suited for segmentation. We assess the effectiveness of CoMasTRe on PASCAL VOC and ADE20K. Extensive experiments show that our method outperforms per-pixel and query-based methods on both datasets. Code will be available at https://github.com/jordangong/CoMasTRe. + + + + ASAM: Boosting Segment Anything Model with Adversarial Tuning + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_ASAM_Boosting_Segment_Anything_Model_with_Adversarial_Tuning_CVPR_2024_paper.pdf + In the evolving landscape of computer vision foundation models have emerged as pivotal tools exhibiting exceptional adaptability to a myriad of tasks. Among these the Segment Anything Model (SAM) by Meta AI has distinguished itself in image segmentation. However SAM like its counterparts encounters limitations in specific niche applications prompting a quest for enhancement strategies that do not compromise its inherent capabilities. This paper introduces ASAM a novel methodology that amplifies SAM's performance through adversarial tuning. We harness the potential of natural adversarial examples inspired by their successful implementation in natural language processing. By utilizing a stable diffusion model we augment a subset (1%) of the SA-1B dataset generating adversarial instances that are more representative of natural variations rather than conventional imperceptible perturbations. Our approach maintains the photorealism of adversarial examples and ensures alignment with original mask annotations thereby preserving the integrity of the segmentation task. The fine-tuned ASAM demonstrates significant improvements across a diverse range of segmentation tasks without necessitating additional data or architectural modifications. The results of our extensive evaluations confirm that ASAM establishes new benchmarks in segmentation tasks thereby contributing to the advancement of foundational models in computer vision. Our project page is in https://asam2024.github.io/. + + + + Dynamic Support Information Mining for Category-Agnostic Pose Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Ren_Dynamic_Support_Information_Mining_for_Category-Agnostic_Pose_Estimation_CVPR_2024_paper.pdf + Category-agnostic pose estimation (CAPE) aims to predict the pose of a query image based on few support images with pose annotations. Existing methods achieve the localization of arbitrary keypoints through similarity matching between support keypoint features and query image features. However these methods primarily focus on mining information from the query images neglecting the fact that support samples with keypoint annotations contain rich category-specific fine-grained semantic information and prior structural information. In this paper we propose a Support-based Dynamic Perception Network (SDPNet) for the robust and accurate CAPE. On the one hand SDPNet models complex dependencies between support keypoints constructing category-specific prior structure to guide the interaction of query keypoints. On the other hand SDPNet extracts fine-grained semantic information from support samples dynamically modulating the refinement process of query. Our method outperforms existing methods on MP-100 dataset by a large margin. + + + + Taming Mode Collapse in Score Distillation for Text-to-3D Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Taming_Mode_Collapse_in_Score_Distillation_for_Text-to-3D_Generation_CVPR_2024_paper.pdf + Despite the remarkable performance of score distillation in text-to-3D generation such techniques notoriously suffer from view inconsistency issues also known as "Janus" artifact where the generated objects fake each view with multiple front faces. Although empirically effective methods have approached this problem via score debiasing or prompt engineering a more rigorous perspective to explain and tackle this problem remains elusive. In this paper we reveal that the existing score distillation-based text-to-3D generation frameworks degenerate to maximal likelihood seeking on each view independently and thus suffer from the mode collapse problem manifesting as the Janus artifact in practice. To tame mode collapse we improve score distillation by re-establishing the entropy term in the corresponding variational objective which is applied to the distribution of rendered images. Maximizing the entropy encourages diversity among different views in generated 3D assets thereby mitigating the Janus problem. Based on this new objective we derive a new update rule for 3D score distillation dubbed Entropic Score Distillation (ESD). We theoretically reveal that ESD can be simplified and implemented by just adopting the classifier-free guidance trick upon variational score distillation. Although embarrassingly straightforward our extensive experiments demonstrate that ESD can be an effective treatment for Janus artifacts in score distillation. + + + + MagicAnimate: Temporally Consistent Human Image Animation using Diffusion Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_MagicAnimate_Temporally_Consistent_Human_Image_Animation_using_Diffusion_Model_CVPR_2024_paper.pdf + This paper studies the human image animation task which aims to generate a video of a certain reference identity following a particular motion sequence. Existing animation works typically employ the frame-warping technique to animate the reference image towards the target motion. Despite achieving reasonable results these approaches face challenges in maintaining temporal consistency throughout the animation due to the lack of temporal modeling and poor preservation of reference identity. In this work we introduce MagicAnimate a diffusion-based framework that aims at enhancing temporal consistency preserving reference image faithfully and improving animation fidelity. To achieve this we first develop a video diffusion model to encode temporal information. Second to maintain the appearance coherence across frames we introduce a novel appearance encoder to retain the intricate details of the reference image. Leveraging these two innovations we further employ a simple video fusion technique to encourage smooth transitions for long video animation. Empirical results demonstrate the superiority of our method over baseline approaches on two benchmarks. Notably our approach outperforms the strongest baseline by over 38% in terms of video fidelity on the challenging TikTok dancing dataset. Code and model will be made available at https://showlab.github.io/magicanimate. + + + + From Correspondences to Pose: Non-minimal Certifiably Optimal Relative Pose without Disambiguation + http://openaccess.thecvf.com//content/CVPR2024/papers/Tirado-Garin_From_Correspondences_to_Pose_Non-minimal_Certifiably_Optimal_Relative_Pose_without_CVPR_2024_paper.pdf + Estimating the relative camera pose from n \geq 5 correspondences between two calibrated views is a fundamental task in computer vision. This process typically involves two stages: 1) estimating the essential matrix between the views and 2) disambiguating among the four candidate relative poses that satisfy the epipolar geometry. In this paper we demonstrate a novel approach that for the first time bypasses the second stage. Specifically we show that it is possible to directly estimate the correct relative camera pose from correspondences without needing a post-processing step to enforce the cheirality constraint on the correspondences. Building on recent advances in certifiable non-minimal optimization we frame the relative pose estimation as a Quadratically Constrained Quadratic Program (QCQP). By applying the appropriate constraints we ensure the estimation of a camera pose that corresponds to a valid 3D geometry and that is globally optimal when certified. We validate our method through exhaustive synthetic and real-world experiments confirming the efficacy efficiency and accuracy of the proposed approach. Code is available at https://github.com/javrtg/C2P. + + + + Loose Inertial Poser: Motion Capture with IMU-attached Loose-Wear Jacket + http://openaccess.thecvf.com//content/CVPR2024/papers/Zuo_Loose_Inertial_Poser_Motion_Capture_with_IMU-attached_Loose-Wear_Jacket_CVPR_2024_paper.pdf + Existing wearable motion capture methods typically demand tight on-body fixation (often using straps) for reliable sensing limiting their application in everyday life. In this paper we introduce Loose Inertial Poser a novel motion capture solution with high wearing comfortableness by integrating four Inertial Measurement Units (IMUs) into a loose-wear jacket. Specifically we address the challenge of scarce loose-wear IMU training data by proposing a Secondary Motion AutoEncoder (SeMo-AE) that learns to model and synthesize the effects of secondary motion between the skin and loose clothing on IMU data. SeMo-AE is leveraged to generate a diverse synthetic dataset of loose-wear IMU data to augment training for the pose estimation network and significantly improve its accuracy. For validation we collected a dataset with various subjects and 2 wearing styles (zipped and unzipped). Experimental results demonstrate that our approach maintains high-quality real-time posture estimation even in loose-wear scenarios. + + + + Training-Free Pretrained Model Merging + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_Training-Free_Pretrained_Model_Merging_CVPR_2024_paper.pdf + Recently model merging techniques have surfaced as a solution to combine multiple single-talent models into a single multi-talent model. However previous endeavors in this field have either necessitated additional training or fine-tuning processes or require that the models possess the same pre-trained initialization. In this work we identify a common drawback in prior works w.r.t. the inconsistency of unit similarity in the weight space and the activation space. To address this inconsistency we propose an innovative model merging framework coined as merging under dual-space constraints (MuDSC). Specifically instead of solely maximizing the objective of a single space we advocate for the exploration of permutation matrices situated in a region with a unified high similarity in the dual space achieved through the linear combination of activation and weight similarity matrices. In order to enhance usability we have also incorporated adaptations for group structure including Multi-Head Attention and Group Normalization. Comprehensive experimental comparisons demonstrate that MuDSC can significantly boost the performance of merged models with various task combinations and architectures. Furthermore the visualization of the merged model within the multi-task loss landscape reveals that MuDSC enables the merged model to reside in the overlapping segment featuring a unified lower loss for each task. Our code is publicly available at https://github.com/zju-vipa/training_free_model_merging. + + + + NC-SDF: Enhancing Indoor Scene Reconstruction Using Neural SDFs with View-Dependent Normal Compensation + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_NC-SDF_Enhancing_Indoor_Scene_Reconstruction_Using_Neural_SDFs_with_View-Dependent_CVPR_2024_paper.pdf + State-of-the-art neural implicit surface representations have achieved impressive results in indoor scene reconstruction by incorporating monocular geometric priors as additional supervision. However we have observed that multi-view inconsistency between such priors poses a challenge for high-quality reconstructions. In response we present NC-SDF a neural signed distance field (SDF) 3D reconstruction framework with view-dependent normal compensation (NC). Specifically we integrate view-dependent biases in monocular normal priors into the neural implicit representation of the scene. By adaptively learning and correcting the biases our NC-SDF effectively mitigates the adverse impact of inconsistent supervision enhancing both the global consistency and local details in the reconstructions. To further refine the details we introduce an informative pixel sampling strategy to pay more attention to intricate geometry with higher information content. Additionally we design a hybrid geometry modeling approach to improve the neural implicit representation. Experiments on synthetic and real-world datasets demonstrate that NC-SDF outperforms existing approaches in terms of reconstruction quality. + + + + Person in Place: Generating Associative Skeleton-Guidance Maps for Human-Object Interaction Image Editing + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Person_in_Place_Generating_Associative_Skeleton-Guidance_Maps_for_Human-Object_Interaction_CVPR_2024_paper.pdf + Recently there were remarkable advances in image editing tasks in various ways. Nevertheless existing image editing models are not designed for Human-Object Interaction (HOI) image editing. One of these approaches (e.g. ControlNet) employs the skeleton guidance to offer precise representations of human showing better results in HOI image editing. However using conventional methods manually creating HOI skeleton guidance is necessary. This paper proposes the object interactive diffuser with associative attention that considers both the interaction with objects and the joint graph structure automating the generation of HOI skeleton guidance. Additionally we propose the HOI loss with novel scaling parameter demonstrating its effectiveness in generating skeletons that interact better. To evaluate generated object-interactive skeletons we propose two metrics top-N accuracy and skeleton probabilistic distance. Our framework integrates object interactive diffuser that generates object-interactive skeletons with previous methods demonstrating the outstanding results in HOI image editing. Finally we present potentials of our framework beyond HOI image editing as applications to human-to-human interaction skeleton editing and 3D mesh optimization. The code is available at https://github.com/YangChangHee/CVPR2024_Person-In-Place_RELEASE + + + + ChatPose: Chatting about 3D Human Pose + http://openaccess.thecvf.com//content/CVPR2024/papers/Feng_ChatPose_Chatting_about_3D_Human_Pose_CVPR_2024_paper.pdf + We introduce ChatPose a framework employing Large Language Models (LLMs) to understand and reason about 3D human poses from images or textual descriptions. Our work is motivated by the human ability to intuitively understand postures from a single image or a brief description a process that intertwines image interpretation world knowledge and an understanding of body language. Traditional human pose estimation and generation methods often operate in isolation lacking semantic understanding and reasoning abilities. ChatPose addresses these limitations by embedding SMPL poses as distinct signal tokens within a multimodal LLM enabling the direct generation of 3D body poses from both textual and visual inputs. Leveraging the powerful capabilities of multimodal LLMs ChatPose unifies classical 3D human pose and generation tasks while offering user interactions. Additionally ChatPose empowers LLMs to apply their extensive world knowledge in reasoning about human poses leading to two advanced tasks: speculative pose generation and reasoning about pose estimation. These tasks involve reasoning about humans to generate 3D poses from subtle text queries possibly accompanied by images. We establish benchmarks for these tasks moving beyond traditional 3D pose generation and estimation methods. Our results show that ChatPose out-performs existing multimodal LLMs and task-specific methods on these newly proposed tasks. Furthermore ChatPose's ability to understand and generate 3D human poses based on complex reasoning opens new directions in human pose analysis. Code and data are available for research at https://yfeng95.github.io/ChatPose. + + + + Distilling ODE Solvers of Diffusion Models into Smaller Steps + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_Distilling_ODE_Solvers_of_Diffusion_Models_into_Smaller_Steps_CVPR_2024_paper.pdf + Abstract Diffusion models have recently gained prominence as a novel category of generative models. Despite their success these models face a notable drawback in terms of slow sampling speeds requiring a high number of function evaluations (NFE) in the order of hundreds or thousands. In response both learning-free and learning-based sampling strategies have been explored to expedite the sampling process. Learning-free sampling employs various ordinary differential equation (ODE) solvers based on the formulation of diffusion ODEs. However it encounters challenges in faithfully tracking the true sampling trajectory particularly for small NFE. Conversely learning-based sampling methods such as knowledge distillation demand extensive additional training limiting their practical applicability. To overcome these limitations we introduce Distilled-ODE solvers (D-ODE solvers) a straightforward distillation approach grounded in ODE solver formulations. Our method seamlessly integrates the strengths of both learning-free and learning-based sampling. D-ODE solvers are constructed by introducing a single parameter adjustment to existing ODE solvers. Furthermore we optimize D-ODE solvers with smaller steps using knowledge distillation from ODE solvers with larger steps across a batch of samples. Comprehensive experiments demonstrate the superior performance of D-ODE solvers compared to existing ODE solvers including DDIM PNDM DPM-Solver DEIS and EDM particularly in scenarios with fewer NFE. Notably our method incurs negligible computational overhead compared to previous distillation techniques facilitating straightforward and rapid integration with existing samplers. Qualitative analysis reveals that D-ODE solvers not only enhance image quality but also faithfully follow the target ODE trajectory. + + + + LightIt: Illumination Modeling and Control for Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Kocsis_LightIt_Illumination_Modeling_and_Control_for_Diffusion_Models_CVPR_2024_paper.pdf + We introduce LightIt a method for explicit illumination control for image generation. Recent generative methods lack lighting control which is crucial to numerous artistic aspects of image generation such as setting the overall mood or cinematic appearance. To overcome these limitations we propose to condition the generation on shading and normal maps. We model the lighting with single bounce shading which includes cast shadows. We first train a shading estimation module to generate a dataset of real-world images and shading pairs. Then we train a control network using the estimated shading and normals as input. Our method demonstrates high-quality image generation and lighting control in numerous scenes. Additionally we use our generated dataset to train an identity-preserving relighting model conditioned on an image and a target shading. Our method is the first that enables the generation of images with controllable consistent lighting and performs on par with specialized relighting state-of-the-art methods. + + + + Neural Lineage + http://openaccess.thecvf.com//content/CVPR2024/papers/Yu_Neural_Lineage_CVPR_2024_paper.pdf + Given a well-behaved neural network is possible to identify its parent based on which it was tuned? In this paper we introduce a novel task known as neural lineage detection aiming at discovering lineage relationships between parent and child models. Specifically from a set of parent models neural lineage detection predicts which parent model a child model has been fine-tuned from. We propose two approaches to address this task. (1) For practical convenience we introduce a learning-free approach which integrates an approximation of the finetuning process into the neural network representation similarity metrics leading to a similarity-based lineage detection scheme. (2) For the pursuit of accuracy we introduce a learning-based lineage detector comprising encoders and a transformer detector. Through experimentation we have validated that our proposed learning-free and learning-based methods outperform the baseline in various learning settings and are adaptable to a variety of visual models. Moreover they also exhibit the ability to trace cross-generational lineage identifying not only parent models but also their ancestors. + + + + Visual Layout Composer: Image-Vector Dual Diffusion Model for Design Layout Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Shabani_Visual_Layout_Composer_Image-Vector_Dual_Diffusion_Model_for_Design_Layout_CVPR_2024_paper.pdf + This paper proposes an image-vector dual diffusion model for generative layout design. Distinct from prior efforts that mostly ignore element-level visual information our approach integrates the power of a pre-trained large image diffusion model to guide layout composition in a vector diffusion model by providing enhanced salient region understanding and high-level inter-element relationship reasoning. Our proposed model simultaneously operates in two domains: it generates the overall design appearance in the image domain while optimizing the size and position of each design element in the vector domain. The proposed method achieves the state-of-the-art results on several datasets and enables new layout design applications. + + + + 3D Multi-frame Fusion for Video Stabilization + http://openaccess.thecvf.com//content/CVPR2024/papers/Peng_3D_Multi-frame_Fusion_for_Video_Stabilization_CVPR_2024_paper.pdf + In this paper we present RStab a novel framework for video stabilization that integrates 3D multi-frame fusion through volume rendering. Departing from conventional methods we introduce a 3D multi-frame perspective to generate stabilized images addressing the challenge of full-frame generation while preserving structure. The core of our RStab framework lies in Stabilized Rendering (SR) a volume rendering module fusing multi-frame information in 3D space. Specifically SR involves warping features and colors from multiple frames by projection fusing them into descriptors to render the stabilized image. However the precision of warped information depends on the projection accuracy a factor significantly influenced by dynamic regions. In response we introduce the Adaptive Ray Range (ARR) module to integrate depth priors adaptively defining the sampling range for the projection process. Additionally we propose Color Correction (CC) assisting geometric constraints with optical flow for accurate color aggregation. Thanks to the three modules our RStab demonstrates superior performance compared with previous stabilizers in the field of view (FOV) image quality and video stability across various datasets. + + + + Local-consistent Transformation Learning for Rotation-invariant Point Cloud Analysis + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Local-consistent_Transformation_Learning_for_Rotation-invariant_Point_Cloud_Analysis_CVPR_2024_paper.pdf + Rotation invariance is an important requirement for point shape analysis. To achieve this current state-of-the-art methods attempt to construct the local rotation-invariant representation through learning or defining the local reference frame (LRF). Although efficient these LRF-based methods suffer from perturbation of local geometric relations resulting in suboptimal local rotation invariance. To alleviate this issue we propose a Local-consistent Transformation (LocoTrans) learning strategy. Specifically we first construct the local-consistent reference frame (LCRF) by considering the symmetry of the two axes in LRF. In comparison with previous LRFs our LCRF is able to preserve local geometric relationships better through performing local-consistent transformation. However as the consistency only exists in local regions the relative pose information is still lost in the intermediate layers of the network. We mitigate such a relative pose issue by developing a relative pose recovery (RPR) module. RPR aims to restore the relative pose between adjacent transformed patches. Equipped with LCRF and RPR our LocoTrans is capable of learning local-consistent transformation and preserving local geometry which benefits rotation invariance learning. Competitive performance under arbitrary rotations on both shape classification and part segmentation tasks and ablations can demonstrate the effectiveness of our method. Code will be available publicly at https://github.com/wdttt/LocoTrans. + + + + Tailored Visions: Enhancing Text-to-Image Generation with Personalized Prompt Rewriting + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Tailored_Visions_Enhancing_Text-to-Image_Generation_with_Personalized_Prompt_Rewriting_CVPR_2024_paper.pdf + Despite significant progress in the field it is still challenging to create personalized visual representations that align closely with the desires and preferences of individual users. This process requires users to articulate their ideas in words that are both comprehensible to the models and accurately capture their vision posing difficulties for many users. In this paper we tackle this challenge by leveraging historical user interactions with the system to enhance user prompts. We propose a novel approach that involves rewriting user prompts based on a newly collected large-scale text-to-image dataset with over 300k prompts from 3115 users. Our rewriting model enhances the expressiveness and alignment of user prompts with their intended visual outputs. Experimental results demonstrate the superiority of our methods over baseline approaches as evidenced in our new offline evaluation method and online tests. Our code and dataset are available at https://github.com/zzjchen/Tailored-Visions + + + + Efficient Deformable ConvNets: Rethinking Dynamic and Sparse Operator for Vision Applications + http://openaccess.thecvf.com//content/CVPR2024/papers/Xiong_Efficient_Deformable_ConvNets_Rethinking_Dynamic_and_Sparse_Operator_for_Vision_CVPR_2024_paper.pdf + We introduce Deformable Convolution v4 (DCNv4) a highly efficient and effective operator designed for a broad spectrum of vision applications. DCNv4 addresses the limitations of its predecessor DCNv3 with two key enhancements: 1. removing softmax normalization in spatial aggregation to enhance its dynamic property and expressive power and 2. optimizing memory access to minimize redundant operations for speedup. These improvements result in a significantly faster convergence compared to DCNv3 and a substantial increase in processing speed with DCNv4 achieving more than three times the forward speed. DCNv4 demonstrates exceptional performance across various tasks including image classification instance and semantic segmentation and notably image generation. When integrated into generative models like U-Net in the latent diffusion model DCNv4 outperforms its baseline underscoring its possibility to enhance generative models. In practical applications replacing DCNv3 with DCNv4 in the InternImage model to create FlashInternImage results in up to 80% speed increase and further performance improvement without further modifications. The advancements in speed and efficiency of DCNv4 combined with its robust performance across diverse vision tasks show its potential as a foundational building block for future vision models. + + + + CoDe: An Explicit Content Decoupling Framework for Image Restoration + http://openaccess.thecvf.com//content/CVPR2024/papers/Gu_CoDe_An_Explicit_Content_Decoupling_Framework_for_Image_Restoration_CVPR_2024_paper.pdf + The performance of image restoration (IR) is highly dependent on the reconstruction quality of diverse contents with varying complexity. However most IR approaches model the mapping between various complexity contents of inputs and outputs through the repeated feature calculation propagation mechanism in a unified pipeline which leads to unsatisfactory results. To address this issue we propose an explicit Content Decoupling framework for IR dubbed CoDe to end-to-end model the restoration process by utilizing decoupled content components in a divide-and-conquer-like architecture. Specifically a Content Decoupling Module is first designed to decouple content components of inputs and outputs according to the frequency spectra adaptively generated from the transform domain. In addition in order to harness the divide-and-conquer strategy for reconstructing decoupled content components we propose an IR Network Container. It contains an optimized version which is a streamlining of an arbitrary IR network comprising the cascaded modulated subnets and a Reconstruction Layers Pool. Finally a Content Consistency Loss is designed from the transform domain perspective to supervise the restoration process of each content component and further guide the feature fusion process. Extensive experiments on several IR tasks such as image super-resolution image denoising and image blurring covering both real and synthetic settings demonstrate that the proposed paradigm can effectively take the performance of the original network to a new state-of-the-art level in multiple benchmark datasets (e.g. 0.34dB@Set5 x4 over DAT). + + + + DreamVideo: Composing Your Dream Videos with Customized Subject and Motion + http://openaccess.thecvf.com//content/CVPR2024/papers/Wei_DreamVideo_Composing_Your_Dream_Videos_with_Customized_Subject_and_Motion_CVPR_2024_paper.pdf + Customized generation using diffusion models has made impressive progress in image generation but remains unsatisfactory in the challenging video generation task as it requires the controllability of both subjects and motions. To that end we present DreamVideo a novel approach to generating personalized videos from a few static images of the desired subject and a few videos of target motion. DreamVideo decouples this task into two stages subject learning and motion learning by leveraging a pre-trained video diffusion model. The subject learning aims to accurately capture the fine appearance of the subject from provided images which is achieved by combining textual inversion and fine-tuning of our carefully designed identity adapter. In motion learning we architect a motion adapter and fine-tune it on the given videos to effectively model the target motion pattern. Combining these two lightweight and efficient adapters allows for flexible customization of any subject with any motion. Extensive experimental results demonstrate the superior performance of our DreamVideo over the state-of-the-art methods for customized video generation. Our project page is at https://dreamvideo-t2v.github.io. + + + + Using Human Feedback to Fine-tune Diffusion Models without Any Reward Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Using_Human_Feedback_to_Fine-tune_Diffusion_Models_without_Any_Reward_CVPR_2024_paper.pdf + Using reinforcement learning with human feedback (RLHF) has shown significant promise in fine-tuning diffusion models. Previous methods start by training a reward model that aligns with human preferences then leverage RL techniques to fine-tune the underlying models. However crafting an efficient reward model demands extensive datasets optimal architecture and manual hyperparameter tuning making the process both time and cost-intensive. The direct preference optimization (DPO) method effective in fine-tuning large language models eliminates the necessity for a reward model. However the extensive GPU memory requirement of the diffusion model's denoising process hinders the direct application of the DPO method. To address this issue we introduce the Direct Preference for Denoising Diffusion Policy Optimization (D3PO) method to directly fine-tune diffusion models. The theoretical analysis demonstrates that although D3PO omits training a reward model it effectively functions as the optimal reward model trained using human feedback data to guide the learning process. This approach requires no training of a reward model proving to be more direct cost-effective and minimizing computational overhead. In experiments our method uses the relative scale of objectives as a proxy for human preference delivering comparable results to methods using ground-truth rewards. Moreover D3PO demonstrates the ability to reduce image distortion rates and generate safer images overcoming challenges lacking robust reward models. Our code is publicly available at https://github.com/yk7333/D3PO. + + + + SynSP: Synergy of Smoothness and Precision in Pose Sequences Refinement + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_SynSP_Synergy_of_Smoothness_and_Precision_in_Pose_Sequences_Refinement_CVPR_2024_paper.pdf + Predicting human pose sequences via existing pose estimators often encounters various estimation errors. Motion refinement methods aim to optimize the predicted human pose sequences from pose estimators while ensuring minimal computational overhead and latency. Prior investigations have primarily concentrated on striking a balance between the two objectives i.e. smoothness and precision while optimizing the predicted pose sequences. However it has come to our attention that the tension between these two objectives can provide additional quality cues about the predicted pose sequences. These cues in turn are able to aid the network in optimizing lower-quality poses. To leverage this quality information we propose a motion refinement network termed SynSP to achieve a Synergy of Smoothness and Precision in the sequence refinement tasks. Moreover SynSP can also address multi-view poses of one person simultaneously fixing inaccuracies in predicted poses through heightened attention to similar poses from other views thereby amplifying the resultant quality cues and overall performance. Compared with previous methods SynSP benefits from both pose quality and multi-view information with a much shorter input sequence length achieving state-of-the-art results among four challenging datasets involving 2D 3D and SMPL pose representations in both single-view and multi-view scenes. Github code: https://github.com/InvertedForest/SynSP. + + + + Learned Representation-Guided Diffusion Models for Large-Image Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Graikos_Learned_Representation-Guided_Diffusion_Models_for_Large-Image_Generation_CVPR_2024_paper.pdf + To synthesize high-fidelity samples diffusion models typically require auxiliary data to guide the generation process. However it is impractical to procure the painstaking patch-level annotation effort required in specialized domains like histopathology and satellite imagery; it is often performed by domain experts and involves hundreds of millions of patches. Modern-day self-supervised learning (SSL) representations encode rich semantic and visual information. In this paper we posit that such representations are expressive enough to act as proxies to fine-grained human labels. We introduce a novel approach that trains diffusion models conditioned on embeddings from SSL. Our diffusion models successfully project these features back to high-quality histopathology and remote sensing images. In addition we construct larger images by assembling spatially consistent patches inferred from SSL embeddings preserving long-range dependencies. Augmenting real data by generating variations of real images improves downstream classifier accuracy for patch-level and larger image-scale classification tasks. Our models are effective even on datasets not encountered during training demonstrating their robustness and generalizability. Generating images from learned embeddings is agnostic to the source of the embeddings. The SSL embeddings used to generate a large image can either be extracted from a reference image or sampled from an auxiliary model conditioned on any related modality (e.g. class labels text genomic data). As proof of concept we introduce the text-to-large image synthesis paradigm where we successfully synthesize large pathology and satellite images out of text descriptions. + + + + Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following + http://openaccess.thecvf.com//content/CVPR2024/papers/Feng_Ranni_Taming_Text-to-Image_Diffusion_for_Accurate_Instruction_Following_CVPR_2024_paper.pdf + Existing text-to-image (T2I) diffusion models usually struggle in interpreting complex prompts especially those with quantity object-attribute binding and multi-subject descriptions. In this work we introduce a semantic panel as the middleware in decoding texts to images supporting the generator to better follow instructions. The panel is obtained through arranging the visual concepts parsed from the input text by the aid of large language models and then injected into the denoising network as a detailed control signal to complement the text condition. To facilitate text-to-panel learning we come up with a carefully designed semantic formatting protocol accompanied by a fully-automatic data preparation pipeline. Thanks to such a design our approach which we call Ranni manages to enhance a pre-trained T2I generator regarding its textual controllability. More importantly the introduction of the generative middleware brings a more convenient form of interaction (i.e. directly adjusting the elements in the panel or using language instructions) and further allows users to finely customize their generation based on which we develop a practical system and showcase its potential in continuous generation and chatting-based editing. + + + + Direct2.5: Diverse Text-to-3D Generation via Multi-view 2.5D Diffusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Lu_Direct2.5_Diverse_Text-to-3D_Generation_via_Multi-view_2.5D_Diffusion_CVPR_2024_paper.pdf + Recent advances in generative AI have unveiled significant potential for the creation of 3D content. However current methods either apply a pre-trained 2D diffusion model with the time-consuming score distillation sampling (SDS) or a direct 3D diffusion model trained on limited 3D data losing generation diversity. In this work we approach the problem by employing a multi-view 2.5D diffusion fine-tuned from a pre-trained 2D diffusion model. The multi-view 2.5D diffusion directly models the structural distribution of 3D data while still maintaining the strong generalization ability of the original 2D diffusion model filling the gap between 2D diffusion-based and direct 3D diffusion-based methods for 3D content generation. During inference multi-view normal maps are generated using the 2.5D diffusion and a novel differentiable rasterization scheme is introduced to fuse the almost consistent multi-view normal maps into a consistent 3D model. We further design a normal-conditioned multi-view image generation module for fast appearance generation given the 3D geometry. Our method is a one-pass diffusion process and does not require any SDS optimization as post-processing. We demonstrate through extensive experiments that our direct 2.5D generation with the specially-designed fusion scheme can achieve diverse mode-seeking-free and high-fidelity 3D content generation in only 10 seconds. + + + + MatFuse: Controllable Material Generation with Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Vecchio_MatFuse_Controllable_Material_Generation_with_Diffusion_Models_CVPR_2024_paper.pdf + Creating high-quality materials in computer graphics is a challenging and time-consuming task which requires great expertise. To simplify this process we introduce MatFuse a unified approach that harnesses the generative power of diffusion models for creation and editing of 3D materials. Our method integrates multiple sources of conditioning including color palettes sketches text and pictures enhancing creative possibilities and granting fine-grained control over material synthesis. Additionally MatFuse enables map-level material editing capabilities through latent manipulation by means of a multi-encoder compression model which learns a disentangled latent representation for each map. We demonstrate the effectiveness of MatFuse under multiple conditioning settings and explore the potential of material editing. Finally we assess the quality of the generated materials both quantitatively in terms of CLIP-IQA and FID scores and qualitatively by conducting a user study. Source code for training MatFuse and supplemental materials are publicly available at https://gvecchio.com/matfuse. + + + + Training Vision Transformers for Semi-Supervised Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Hu_Training_Vision_Transformers_for_Semi-Supervised_Semantic_Segmentation_CVPR_2024_paper.pdf + We present S4Former a novel approach to training Vision Transformers for Semi-Supervised Semantic Segmentation (S4). At its core S4Former employs a Vision Transformer within a classic teacher-student framework and then leverages three novel technical ingredients: PatchShuffle as a parameter-free perturbation technique Patch-Adaptive Self-Attention (PASA) as a fine-grained feature modulation method and the innovative Negative Class Ranking (NCR) regularization loss. Based on these regularization modules aligned with Transformer-specific characteristics across the image input feature and output dimensions S4Former exploits the Transformer's ability to capture and differentiate consistent global contextual information in unlabeled images. Overall S4Former not only defines a new state of the art in S4 but also maintains a streamlined and scalable architecture. Being readily compatible with existing frameworks S4Former achieves strong improvements (up to 4.9%) on benchmarks like Pascal VOC 2012 COCO and Cityscapes with varying numbers of labeled data. The code is at https://github.com/JoyHuYY1412/S4Former. + + + + Quantifying Task Priority for Multi-Task Optimization + http://openaccess.thecvf.com//content/CVPR2024/papers/Jeong_Quantifying_Task_Priority_for_Multi-Task_Optimization_CVPR_2024_paper.pdf + The goal of multi-task learning is to learn diverse tasks within a single unified network. As each task has its own unique objective function conflicts emerge during training resulting in negative transfer among them. Earlier research identified these conflicting gradients in shared parameters between tasks and attempted to realign them in the same direction. However we prove that such optimization strategies lead to sub-optimal Pareto solutions due to their inability to accurately determine the individual contributions of each parameter across various tasks. In this paper we propose the concept of task priority to evaluate parameter contributions across different tasks. To learn task priority we identify the type of connections related to links between parameters influenced by task-specific losses during backpropagation. The strength of connections is gauged by the magnitude of parameters to determine task priority. Based on these we present a new method named connection strength-based optimization for multi-task learning which consists of two phases. The first phase learns the task priority within the network while the second phase modifies the gradients while upholding this priority. This ultimately leads to finding new Pareto optimal solutions for multiple tasks. Through extensive experiments we show that our approach greatly enhances multi-task performance in comparison to earlier gradient manipulation methods. + + + + On the Scalability of Diffusion-based Text-to-Image Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_On_the_Scalability_of_Diffusion-based_Text-to-Image_Generation_CVPR_2024_paper.pdf + Scaling up model and data size has been quite successful for the evolution of LLMs. However the scaling law for the diffusion based text-to-image (T2I) models is not fully explored. It is also unclear how to efficiently scale the model for better performance at reduced cost. The different training settings and expensive training cost make a fair model comparison extremely difficult. In this work we empirically study the scaling properties of diffusion based T2I models by performing extensive and rigours ablations on scaling both denoising backbones and training set including training scaled UNet and Transformer variants ranging from 0.4B to 4B parameters on datasets upto 600M images. For model scaling we find the location and amount of cross attention distinguishes the performance of existing UNet designs. And increasing the transformer blocks is more parameter-efficient for improving text-image alignment than increasing channel numbers. We then identify an efficient UNet variant which is 45% smaller and 28% faster than SDXL's UNet. On the data scaling side we show the quality and diversity of the training set matters more than simply dataset size. Increasing caption density and diversity improves text-image alignment performance and the learning efficiency. Finally we provide scaling functions to predict the text-image alignment performance as functions of the scale of model size compute and dataset size. + + + + AnySkill: Learning Open-Vocabulary Physical Skill for Interactive Agents + http://openaccess.thecvf.com//content/CVPR2024/papers/Cui_AnySkill_Learning_Open-Vocabulary_Physical_Skill_for_Interactive_Agents_CVPR_2024_paper.pdf + Traditional approaches in physics-based motion generation centered around imitation learning and reward shaping often struggle to adapt to new scenarios. To tackle this limitation we propose AnySkill a novel hierarchical method that learns physically plausible interactions following open-vocabulary instructions. Our approach begins by developing a set of atomic actions via a low-level controller trained via imitation learning. Upon receiving an open-vocabulary textual instruction AnySkill employs a high-level policy that selects and integrates these atomic actions to maximize the CLIP similarity between the agent's rendered images and the text. An important feature of our method is the use of image-based rewards for the high-level policy which allows the agent to learn interactions with objects without manual reward engineering. We demonstrate AnySkill's capability to generate realistic and natural motion sequences in response to unseen instructions of varying lengths marking it the first method capable of open-vocabulary physical skill learning for interactive humanoid agents. + + + + Generative Unlearning for Any Identity + http://openaccess.thecvf.com//content/CVPR2024/papers/Seo_Generative_Unlearning_for_Any_Identity_CVPR_2024_paper.pdf + Recent advances in generative models trained on large-scale datasets have made it possible to synthesize high-quality samples across various domains. Moreover the emergence of strong inversion networks enables not only a reconstruction of real-world images but also the modification of attributes through various editing methods. However in certain domains related to privacy issues e.g. human faces advanced generative models along with strong inversion methods can lead to potential misuses. In this paper we propose an essential yet under-explored task called generative identity unlearning which steers the model not to generate an image of a specific identity. In the generative identity unlearning we target the following objectives: (i) preventing the generation of images with a certain identity and (ii) preserving the overall quality of the generative model. To satisfy these goals we propose a novel framework Generative Unlearning for Any Identity (GUIDE) which prevents the reconstruction of a specific identity by unlearning the generator with only a single image. GUIDE consists of two parts: (i) finding a target point for optimization that un-identifies the source latent code and (ii) novel loss functions that facilitate the unlearning procedure while less affecting the learned distribution. Our extensive experiments demonstrate that our proposed method achieves state-of-the-art performance in the generative machine unlearning task. The code is available at https://github.com/KHU-AGI/GUIDE. + + + + FlowVid: Taming Imperfect Optical Flows for Consistent Video-to-Video Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Liang_FlowVid_Taming_Imperfect_Optical_Flows_for_Consistent_Video-to-Video_Synthesis_CVPR_2024_paper.pdf + Diffusion models have transformed the image-to-image (I2I) synthesis and are now permeating into videos. However the advancement of video-to-video (V2V) synthesis has been hampered by the challenge of maintaining temporal consistency across video frames. This paper proposes a consistent V2V synthesis framework by jointly leveraging spatial conditions and temporal optical flow clues within the source video. Contrary to prior methods that strictly adhere to optical flow our approach harnesses its benefits while handling the imperfection in flow estimation. We encode the optical flow via warping from the first frame and serve it as a supplementary reference in the diffusion model. This enables our model for video synthesis by editing the first frame with any prevalent I2I models and then propagating edits to successive frames. Our V2V model FlowVid demonstrates remarkable properties: (1) Flexibility: FlowVid works seamlessly with existing I2I models facilitating various modifications including stylization object swaps and local edits. (2) Efficiency: Generation of a 4-second video with 30 FPS and 512x512 resolution takes only 1.5 minutes which is 3.1x 7.2x and 10.5x faster than CoDeF Rerender and TokenFlow respectively. (3) High-quality: In user studies our FlowVid is preferred 45.7% of the time outperforming CoDeF (3.5%) Rerender (10.2%) and TokenFlow (40.4%). + + + + StyleCineGAN: Landscape Cinemagraph Generation using a Pre-trained StyleGAN + http://openaccess.thecvf.com//content/CVPR2024/papers/Choi_StyleCineGAN_Landscape_Cinemagraph_Generation_using_a_Pre-trained_StyleGAN_CVPR_2024_paper.pdf + We propose a method that can generate cinemagraphs automatically from a still landscape image using a pre-trained StyleGAN. Inspired by the success of recent unconditional video generation we leverage a powerful pre-trained image generator to synthesize high-quality cinemagraphs. Unlike previous approaches that mainly utilize the latent space of a pre-trained StyleGAN our approach utilizes its deep feature space for both GAN inversion and cinemagraph generation. Specifically we propose multi-scale deep feature warping (MSDFW) which warps the intermediate features of a pre-trained StyleGAN at different resolutions. By using MSDFW the generated cinemagraphs are of high resolution and exhibit plausible looping animation. We demonstrate the superiority of our method through user studies and quantitative comparisons with state-of-the-art cinemagraph generation methods and a video generation method that uses a pre-trained StyleGAN. + + + + Laplacian-guided Entropy Model in Neural Codec with Blur-dissipated Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Khoshkhahtinat_Laplacian-guided_Entropy_Model_in_Neural_Codec_with_Blur-dissipated_Synthesis_CVPR_2024_paper.pdf + While replacing Gaussian decoders with a conditional diffusion model enhances the perceptual quality of reconstructions in neural image compression their lack of inductive bias for image data restricts their ability to achieve state-of-the-art perceptual levels. To address this limitation we adopt a non-isotropic diffusion model at the decoder side. This model imposes an inductive bias aimed at distinguishing between frequency contents thereby facilitating the generation of high-quality images. Moreover our framework is equipped with a novel entropy model that accurately models the probability distribution of latent representation by exploiting spatio-channel correlations in latent space while accelerating the entropy decoding step. This channel-wise entropy model leverages both local and global spatial contexts within each channel chunk. The global spatial context is built upon the Transformer which is specifically designed for image compression tasks. The designed Transformer employs a Laplacian-shaped positional encoding the learnable parameters of which are adaptively adjusted for each channel cluster. Our experiments demonstrate that our proposed framework yields better perceptual quality compared to cutting-edge generative-based codecs and the proposed entropy model contributes to notable bitrate savings. The code is available at https://github.com/Atefeh-Khoshtinat/Blur-dissipated-compression. + + + + RMT: Retentive Networks Meet Vision Transformers + http://openaccess.thecvf.com//content/CVPR2024/papers/Fan_RMT_Retentive_Networks_Meet_Vision_Transformers_CVPR_2024_paper.pdf + Vision Transformer (ViT) has gained increasing attention in the computer vision community in recent years. However the core component of ViT Self-Attention lacks explicit spatial priors and bears a quadratic computational complexity thereby constraining the applicability of ViT. To alleviate these issues we draw inspiration from the recent Retentive Network (RetNet) in the field of NLP and propose RMT a strong vision backbone with explicit spatial prior for general purposes. Specifically we extend the RetNet's temporal decay mechanism to the spatial domain and propose a spatial decay matrix based on the Manhattan distance to introduce the explicit spatial prior to Self-Attention. Additionally an attention decomposition form that adeptly adapts to explicit spatial prior is proposed aiming to reduce the computational burden of modeling global information without disrupting the spatial decay matrix. Based on the spatial decay matrix and the attention decomposition form we can flexibly integrate explicit spatial prior into the vision backbone with linear complexity. Extensive experiments demonstrate that RMT exhibits exceptional performance across various vision tasks. Specifically without extra training data RMT achieves 84.8% and 86.1% top-1 acc on ImageNet-1k with 27M/4.5GFLOPs and 96M/18.2GFLOPs. For downstream tasks RMT achieves 54.5 box AP and 47.2 mask AP on the COCO detection task and 52.8 mIoU on the ADE20K semantic segmentation task. + + + + Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Multimodal_Pathway_Improve_Transformers_with_Irrelevant_Data_from_Other_Modalities_CVPR_2024_paper.pdf + We propose to improve transformers of a specific modality with irrelevant data from other modalities e.g. improve an ImageNet model with audio or point cloud datasets. We would like to highlight that the data samples of the target modality are irrelevant to the other modalities which distinguishes our method from other works utilizing paired (e.g. CLIP) or interleaved data of different modalities. We propose a methodology named Multimodal Pathway - given a target modality and a transformer designed for it we use an auxiliary transformer trained with data of another modality and construct pathways to connect components of the two models so that data of the target modality can be processed by both models. In this way we utilize the universal sequence-to-sequence modeling abilities of transformers obtained from two modalities. As a concrete implementation we use a modality-specific tokenizer and task-specific head as usual but utilize the transformer blocks of the auxiliary model via a proposed method named Cross-Modal Re-parameterization which exploits the auxiliary weights without any inference costs. On the image point cloud video and audio recognition tasks we observe significant and consistent performance improvements with irrelevant data from other modalities. The code and models are available at https://github.com/AILab-CVC/M2PT. + + + + FaceChain-ImagineID: Freely Crafting High-Fidelity Diverse Talking Faces from Disentangled Audio + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_FaceChain-ImagineID_Freely_Crafting_High-Fidelity_Diverse_Talking_Faces_from_Disentangled_Audio_CVPR_2024_paper.pdf + In this paper we abstract the process of people hearing speech extracting meaningful cues and creating various dynamically audio-consistent talking faces termed Listening and Imagining into the task of high-fidelity diverse talking faces generation from a single audio. Specifically it involves two critical challenges: one is to effectively decouple identity content and emotion from entangled audio and the other is to maintain intra-video diversity and inter-video consistency. To tackle the issues we first dig out the intricate relationships among facial factors and simplify the decoupling process tailoring a Progressive Audio Disentanglement for accurate facial geometry and semantics learning where each stage incorporates a customized training module responsible for a specific factor. Secondly to achieve visually diverse and audio-synchronized animation solely from input audio within a single model we introduce the Controllable Coherent Frame generation which involves the flexible integration of three trainable adapters with frozen Latent Diffusion Models (LDMs) to focus on maintaining facial geometry and semantics as well as texture and temporal coherence between frames. In this way we inherit high-quality diverse generation from LDMs while significantly improving their controllability at a low training cost. Extensive experiments demonstrate the flexibility and effectiveness of our method in handling this paradigm. The codes will be released at https://github.com/modelscope/facechain. + + + + SSR-Encoder: Encoding Selective Subject Representation for Subject-Driven Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_SSR-Encoder_Encoding_Selective_Subject_Representation_for_Subject-Driven_Generation_CVPR_2024_paper.pdf + Recent advancements in subject-driven image generation have led to zero-shot generation yet precise selection and focus on crucial subject representations remain challenging. Addressing this we introduce the SSR-Encoder a novel architecture designed for selectively capturing any subject from single or multiple reference images. It responds to various query modalities including text and masks without necessitating test-time fine-tuning. The SSR-Encoder combines a Token-to-Patch Aligner that aligns query inputs with image patches and a Detail-Preserving Subject Encoder for extracting and preserving fine features of the subjects thereby generating subject embeddings. These embeddings used in conjunction with original text embeddings condition the generation process. Characterized by its model generalizability and efficiency the SSR-Encoder adapts to a range of custom models and control modules. Enhanced by the Embedding Consistency Regularization Loss for improved training our extensive experiments demonstrate its effectiveness in versatile and high-quality image generation indicating its broad applicability. + + + + MVIP-NeRF: Multi-view 3D Inpainting on NeRF Scenes via Diffusion Prior + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_MVIP-NeRF_Multi-view_3D_Inpainting_on_NeRF_Scenes_via_Diffusion_Prior_CVPR_2024_paper.pdf + Despite the emergence of successful NeRF inpainting methods built upon explicit RGB and depth 2D inpainting supervisions these methods are inherently constrained by the capabilities of their underlying 2D inpainters. This is due to two key reasons: (i) independently inpainting constituent images results in view-inconsistent imagery and (ii) 2D inpainters struggle to ensure high-quality geometry completion and alignment with inpainted RGB images. To overcome these limitations we propose a novel approach called MVIP-NeRF that harnesses the potential of diffusion priors for NeRF inpainting addressing both appearance and geometry aspects. MVIP-NeRF performs joint inpainting across multiple views to reach a consistent solution which is achieved via an iterative optimization process based on Score Distillation Sampling (SDS). Apart from recovering the rendered RGB images we also extract normal maps as a geometric representation and define a normal SDS loss that motivates accurate geometry inpainting and alignment with the appearance. Additionally we formulate a multi-view SDS score function to distill generative priors simultaneously from different view images ensuring consistent visual completion when dealing with large view variations. Our experimental results show better appearance and geometry recovery than previous NeRF inpainting methods. + + + + StegoGAN: Leveraging Steganography for Non-Bijective Image-to-Image Translation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_StegoGAN_Leveraging_Steganography_for_Non-Bijective_Image-to-Image_Translation_CVPR_2024_paper.pdf + Most image-to-image translation models postulate that a unique correspondence exists between the semantic classes of the source and target domains. However this assumption does not always hold in real-world scenarios due to divergent distributions different class sets and asymmetrical information representation. As conventional GANs attempt to generate images that match the distribution of the target domain they may hallucinate spurious instances of classes absent from the source domain thereby diminishing the usefulness and reliability of translated images. CycleGAN-based methods are also known to hide the mismatched information in the generated images to bypass cycle consistency objectives a process known as steganography. In response to the challenge of non-bijective image translation we introduce StegoGAN a novel model that leverages steganography to prevent spurious features in generated images. Our approach enhances the semantic consistency of the translated images without requiring additional postprocessing or supervision. Our experimental evaluations demonstrate that StegoGAN outperforms existing GAN-based models across various non-bijective image-to-image translation tasks both qualitatively and quantitatively. Our code and pretrained models are accessible at https://github.com/sian-wusidi/StegoGAN. + + + + M&M VTO: Multi-Garment Virtual Try-On and Editing + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhu_MM_VTO_Multi-Garment_Virtual_Try-On_and_Editing_CVPR_2024_paper.pdf + We present M&M VTO-a mix and match virtual try-on method that takes as input multiple garment images text description for garment layout and an image of a person. An example input includes: an image of a shirt an image of a pair of pants "rolled sleeves shirt tucked in" and an image of a person. The output is a visualization of how those garments (in the desired layout) would look like on the given person. Key contributions of our method are: 1) a single stage diffusion based model with no super resolution cascading that allows to mix and match multiple garments at 1024x512 resolution preserving and warping intricate garment details 2) architecture design (VTO UNet Diffusion Transformer) to disentangle denoising from person specific features allowing for a highly effective finetuning strategy for identity preservation (6MB model per individual vs 4GB achieved with e.g. dreambooth finetuning); solving a common identity loss problem in current virtual try-on methods 3) layout control for multiple garments via text inputs finetuned over PaLI-3 for virtual try-on task. Experimental results indicate that M&M VTO achieves state-of-the-art performance both qualitatively and quantitatively as well as opens up new opportunities for virtual try-on via language-guided and multi-garment try-on. + + + + Dynamic Inertial Poser (DynaIP): Part-Based Motion Dynamics Learning for Enhanced Human Pose Estimation with Sparse Inertial Sensors + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Dynamic_Inertial_Poser_DynaIP_Part-Based_Motion_Dynamics_Learning_for_Enhanced_CVPR_2024_paper.pdf + This paper introduces a novel human pose estimation approach using sparse inertial sensors addressing the shortcomings of previous methods reliant on synthetic data. It leverages a diverse array of real inertial motion capture data from different skeleton formats to improve motion diversity and model generalization. This method features two innovative components: a pseudo-velocity regression model for dynamic motion capture with inertial sensors and a part-based model dividing the body and sensor data into three regions each focusing on their unique characteristics. The approach demonstrates superior performance over state-of-the-art models across five public datasets notably reducing pose error by 19% on the DIP-IMU dataset thus representing a significant improvement in inertial sensor-based human pose estimation. Our codes are available at https://github.com/dx118/dynaip + + + + GraCo: Granularity-Controllable Interactive Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_GraCo_Granularity-Controllable_Interactive_Segmentation_CVPR_2024_paper.pdf + Interactive Segmentation (IS) segments specific objects or parts in the image according to user input. Current IS pipelines fall into two categories: single-granularity output and multi-granularity output. The latter aims to alleviate the spatial ambiguity present in the former. However the multi-granularity output pipeline suffers from limited interaction flexibility and produces redundant results. In this work we introduce Granularity-Controllable Interactive Segmentation (GraCo) a novel approach that allows precise control of prediction granularity by introducing additional parameters to input. This enhances the customization of the interactive system and eliminates redundancy while resolving ambiguity. Nevertheless the exorbitant cost of annotating multi-granularity masks and the lack of available datasets with granularity annotations make it difficult for models to acquire the necessary guidance to control output granularity. To address this problem we design an any-granularity mask generator that exploits the semantic property of the pre-trained IS model to automatically generate abundant mask-granularity pairs without requiring additional manual annotation. Based on these pairs we propose a granularity-controllable learning strategy that efficiently imparts the granularity controllability to the IS model. Extensive experiments on intricate scenarios at object and part levels demonstrate that our GraCo has significant advantages over previous methods. This highlights the potential of GraCo to be a flexible annotation tool capable of adapting to diverse segmentation scenarios. The project page: https://zhao-yian.github.io/GraCo. + + + + G-HOP: Generative Hand-Object Prior for Interaction Reconstruction and Grasp Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Ye_G-HOP_Generative_Hand-Object_Prior_for_Interaction_Reconstruction_and_Grasp_Synthesis_CVPR_2024_paper.pdf + We propose G-HOP a denoising diffusion based generative prior for hand-object interactions that allows modeling both the 3D object and a human hand conditioned on the object category. To learn a 3D spatial diffusion model that can capture this joint distribution we represent the human hand via a skeletal distance field to obtain a representation aligned with the (latent) signed distance field for the object. We show that this hand-object prior can then serve as a generic guidance to facilitate other tasks like reconstruction from interaction clip and human grasp synthesis. We believe that our model trained by aggregating several diverse real-world interaction datasets spanning 155 categories represents a first approach that allows jointly generating both hand and object. Our empirical evaluations demonstrate the benefit of this joint prior in video-based reconstruction and human grasp synthesis outperforming current task-specific baselines. + + + + Contrastive Denoising Score for Text-guided Latent Diffusion Image Editing + http://openaccess.thecvf.com//content/CVPR2024/papers/Nam_Contrastive_Denoising_Score_for_Text-guided_Latent_Diffusion_Image_Editing_CVPR_2024_paper.pdf + With the remarkable advent of text-to-image diffusion models image editing methods have become more diverse and continue to evolve. A promising recent approach in this realm is Delta Denoising Score (DDS) - an image editing technique based on Score Distillation Sampling (SDS) framework that leverages the rich generative prior of text-to-image diffusion models. However relying solely on the difference between scoring functions is insufficient for preserving specific structural elements from the original image a crucial aspect of image editing. To address this here we present an embarrassingly simple yet very powerful modification of DDS called Contrastive Denoising Score (CDS) for latent diffusion models (LDM). Inspired by the similarities and differences between DDS and the contrastive learning for unpaired image-to-image translation(CUT) we introduce a straightforward approach using CUT loss within the DDS framework. Rather than employing auxiliary networks as in the original CUT approach we leverage the intermediate features of LDM specifically those from the self-attention layers which possesses rich spatial information. Our approach enables zero-shot image-to-image translation and neural radiance field (NeRF) editing achieving structural correspondence between the input and output while maintaining content controllability. Qualitative results and comparisons demonstrates the effectiveness of our proposed method. + + + + Neural Point Cloud Diffusion for Disentangled 3D Shape and Appearance Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Schroppel_Neural_Point_Cloud_Diffusion_for_Disentangled_3D_Shape_and_Appearance_CVPR_2024_paper.pdf + Controllable generation of 3D assets is important for many practical applications like content creation in movies games and engineering as well as in AR/VR. Recently diffusion models have shown remarkable results in generation quality of 3D objects. However none of the existing models enable disentangled generation to control the shape and appearance separately. For the first time we present a suitable representation for 3D diffusion models to enable such disentanglement by introducing a hybrid point cloud and neural radiance field approach. We model a diffusion process over point positions jointly with a high-dimensional feature space for a local density and radiance decoder. While the point positions represent the coarse shape of the object the point features allow modeling the geometry and appearance details. This disentanglement enables us to sample both independently and therefore to control both separately. Our approach sets a new state of the art in generation compared to previous disentanglement-capable methods by reduced FID scores of 30-90% and is on-par with other non-disentanglement-capable state-of-the art methods. + + + + VAREN: Very Accurate and Realistic Equine Network + http://openaccess.thecvf.com//content/CVPR2024/papers/Zuffi_VAREN_Very_Accurate_and_Realistic_Equine_Network_CVPR_2024_paper.pdf + Data-driven three-dimensional parametric shape models of the human body have gained enormous popularity both for the analysis of visual data and for the generation of synthetic humans. Following a similar approach for animals does not scale to the multitude of existing animal species not to mention the difficulty of accessing subjects to scan in 3D. However we argue that for domestic species of great importance like the horse it is a highly valuable investment to put effort into gathering a large dataset of real 3D scans and learn a realistic 3D articulated shape model. We introduce VAREN a novel 3D articulated parametric shape model learned from 3D scans of many real horses. VAREN bridges synthesis and analysis tasks as the generated model instances have unprecedented realism while being able to represent horses of different sizes and shapes. Differently from previous body models VAREN has two resolutions an anatomical skeleton and interpretable learned pose-dependent deformations which are related to the body muscles. We show with experiments that this formulation has superior performance with respect to previous strategies for modeling pose-dependent deformations in the human body case while also being more compact and allowing an analysis of the relationship between articulation and muscle deformation during articulated motion. + + + + SD-DiT: Unleashing the Power of Self-supervised Discrimination in Diffusion Transformer + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhu_SD-DiT_Unleashing_the_Power_of_Self-supervised_Discrimination_in_Diffusion_Transformer_CVPR_2024_paper.pdf + Diffusion Transformer (DiT) has emerged as the new trend of generative diffusion models on image generation. In view of extremely slow convergence in typical DiT recent breakthroughs have been driven by mask strategy that significantly improves the training efficiency of DiT with additional intra-image contextual learning. Despite this progress mask strategy still suffers from two inherent limitations: (a) training-inference discrepancy and (b) fuzzy relations between mask reconstruction & generative diffusion process resulting in sub-optimal training of DiT. In this work we address these limitations by novelly unleashing the self-supervised discrimination knowledge to boost DiT training. Technically we frame our DiT in a teacher-student manner. The teacher-student discriminative pairs are built on the diffusion noises along the same Probability Flow Ordinary Differential Equation (PF-ODE). Instead of applying mask reconstruction loss over both DiT encoder and decoder we decouple DiT encoder and decoder to separately tackle discriminative and generative objectives. In particular by encoding discriminative pairs with student and teacher DiT encoders a new discriminative loss is designed to encourage the inter-image alignment in the self-supervised embedding space. After that student samples are fed into student DiT decoder to perform the typical generative diffusion task. Extensive experiments are conducted on ImageNet dataset and our method achieves a competitive balance between training cost and generative capacity. + + + + MedBN: Robust Test-Time Adaptation against Malicious Test Samples + http://openaccess.thecvf.com//content/CVPR2024/papers/Park_MedBN_Robust_Test-Time_Adaptation_against_Malicious_Test_Samples_CVPR_2024_paper.pdf + Test-time adaptation (TTA) has emerged as a promising solution to address performance decay due to unforeseen distribution shifts between training and test data. While recent TTA methods excel in adapting to test data variations such adaptability exposes a model to vulnerability against malicious examples an aspect that has received limited attention. Previous studies have uncovered security vulnerabilities within TTA even when a small proportion of the test batch is maliciously manipulated. In response to the emerging threat we propose median batch normalization (MedBN) leveraging the robustness of the median for statistics estimation within the batch normalization layer during test-time inference. Our method is algorithm-agnostic thus allowing seamless integration with existing TTA frameworks. Our experimental results on benchmark datasets including CIFAR10-C CIFAR100-C and ImageNet-C consistently demonstrate that MedBN outperforms existing approaches in maintaining robust performance across different attack scenarios encompassing both instant and cumulative attacks. Through extensive experiments we show that our approach sustains the performance even in the absence of attacks achieving a practical balance between robustness and performance. + + + + Unsupervised Gaze Representation Learning from Multi-view Face Images + http://openaccess.thecvf.com//content/CVPR2024/papers/Bao_Unsupervised_Gaze_Representation_Learning_from_Multi-view_Face_Images_CVPR_2024_paper.pdf + Annotating gaze is an expensive and time-consuming endeavor requiring costly eye-trackers or complex geometric calibration procedures. Although some eye-based unsupervised gaze representation learning methods have been proposed the quality of gaze representation extracted by these methods degrades severely when the head pose is large. In this paper we present the Multi-View Dual-Encoder (MV-DE) a framework designed to learn gaze representations from unlabeled multi-view face images. Through the proposed Dual-Encoder architecture and the multi-view gaze representation swapping strategy the MV-DE successfully disentangles gaze from general facial information and derives gaze representations closely tied to the subject's eyeball rotation without gaze label. Experimental results illustrate that the gaze representations learned by the MV-DE can be used in downstream tasks including gaze estimation and redirection. Gaze estimation results indicates that the proposed MV-DE displays notably higher robustness to uncontrolled head movements when compared to state-of-the-art (SOTA) unsupervised learning methods. + + + + AEROBLADE: Training-Free Detection of Latent Diffusion Images Using Autoencoder Reconstruction Error + http://openaccess.thecvf.com//content/CVPR2024/papers/Ricker_AEROBLADE_Training-Free_Detection_of_Latent_Diffusion_Images_Using_Autoencoder_Reconstruction_CVPR_2024_paper.pdf + With recent text-to-image models anyone can generate deceptively realistic images with arbitrary contents fueling the growing threat of visual disinformation. A key enabler for generating high-resolution images with low computational cost has been the development of latent diffusion models (LDMs). In contrast to conventional diffusion models LDMs perform the denoising process in the low-dimensional latent space of a pre-trained autoencoder (AE) instead of the high-dimensional image space. Despite their relevance the forensic analysis of LDMs is still in its infancy. In this work we propose AEROBLADE a novel detection method which exploits an inherent component of LDMs: the AE used to transform images between image and latent space. We find that generated images can be more accurately reconstructed by the AE than real images allowing for a simple detection approach based on the reconstruction error. Most importantly our method is easy to implement and does not require any training yet nearly matches the performance of detectors that rely on extensive training. We empirically demonstrate that AEROBLADE is effective against state-of-the-art LDMs including Stable Diffusion and Midjourney. Beyond detection our approach allows for the qualitative analysis of images which can be leveraged for identifying inpainted regions. We release our code and data at https://github.com/jonasricker/aeroblade. + + + + Point2CAD: Reverse Engineering CAD Models from 3D Point Clouds + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Point2CAD_Reverse_Engineering_CAD_Models_from_3D_Point_Clouds_CVPR_2024_paper.pdf + Computer-Aided Design (CAD) model reconstruction from point clouds is an important problem at the intersection of computer vision graphics and machine learning; it saves the designer significant time when iterating on in-the-wild objects. Recent advancements in this direction achieve relatively reliable semantic segmentation but still struggle to produce an adequate topology of the CAD model. In this work we analyze the current state of the art for that ill-posed task and identify shortcomings of existing methods. We propose a hybrid analytic-neural reconstruction scheme that bridges the gap between segmented point clouds and structured CAD models and can be readily combined with different segmentation backbones. Moreover to power the surface fitting stage we propose a novel implicit neural representation of freeform surfaces driving up the performance of our overall CAD reconstruction scheme. We extensively evaluate our method on the popular ABC benchmark of CAD models and set a new state-of-the-art for that dataset. Code is available at https://github.com/YujiaLiu76/point2cad. + + + + LocLLM: Exploiting Generalizable Human Keypoint Localization via Large Language Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_LocLLM_Exploiting_Generalizable_Human_Keypoint_Localization_via_Large_Language_Model_CVPR_2024_paper.pdf + The capacity of existing human keypoint localization models is limited by keypoint priors provided by the training data. To alleviate this restriction and pursue more general model this work studies keypoint localization from a different perspective by reasoning locations based on keypiont clues in text descriptions. We propose LocLLM the first Large-Language Model (LLM) based keypoint localization model that takes images and text instructions as inputs and outputs the desired keypoint coordinates. LocLLM leverages the strong reasoning capability of LLM and clues of keypoint type location and relationship in textual descriptions for keypoint localization. To effectively tune LocLLM we construct localization-based instruction conversations to connect keypoint description with corresponding coordinates in input image and fine-tune the whole model in a parameter-efficient training pipeline. LocLLM shows remarkable performance on standard 2D/3D keypoint localization benchmarks. Moreover incorporating language clues into the localization makes LocLLM show superior flexibility and generalizable capability in cross dataset keypoint localization and even detecting novel type of keypoints unseen during training. + + + + MMA-Diffusion: MultiModal Attack on Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_MMA-Diffusion_MultiModal_Attack_on_Diffusion_Models_CVPR_2024_paper.pdf + In recent years Text-to-Image (T2I) models have seen remarkable advancements gaining widespread adoption. However this progress has inadvertently opened avenues for potential misuse particularly in generating inappropriate or Not-Safe-For-Work (NSFW) content. Our work introduces MMA-Diffusion a framework that presents a significant and realistic threat to the security of T2I models by effectively circumventing current defensive measures in both open-source models and commercial online services. Unlike previous approaches MMA-Diffusion leverages both textual and visual modalities to bypass safeguards like prompt filters and post-hoc safety checkers thus exposing and highlighting the vulnerabilities in existing defense mechanisms. Our codes are available at https://github.com/cure-lab/MMA-Diffusion. + + + + HanDiffuser: Text-to-Image Generation With Realistic Hand Appearances + http://openaccess.thecvf.com//content/CVPR2024/papers/Narasimhaswamy_HanDiffuser_Text-to-Image_Generation_With_Realistic_Hand_Appearances_CVPR_2024_paper.pdf + Text-to-image generative models can generate high-quality humans but realism is lost when generating hands. Common artifacts include irregular hand poses shapes incorrect numbers of fingers and physically implausible finger orientations. To generate images with realistic hands we propose a novel diffusion-based architecture called HanDiffuser that achieves realism by injecting hand embeddings in the generative process. HanDiffuser consists of two components: a Text-to-Hand-Params diffusion model to generate SMPL-Body and MANO-Hand parameters from input text prompts and a Text-Guided Hand-Params-to-Image diffusion model to synthesize images by conditioning on the prompts and hand parameters generated by the previous component. We incorporate multiple aspects of hand representation including 3D shapes and joint-level finger positions orientations and articulations for robust learning and reliable performance during inference. We conduct extensive quantitative and qualitative experiments and perform user studies to demonstrate the efficacy of our method in generating images with high-quality hands. + + + + Hierarchical Patch Diffusion Models for High-Resolution Video Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Skorokhodov_Hierarchical_Patch_Diffusion_Models_for_High-Resolution_Video_Generation_CVPR_2024_paper.pdf + Diffusion models have demonstrated remarkable performance in image and video synthesis. However scaling them to high-resolution inputs is challenging and requires restructuring the diffusion pipeline into multiple independent components limiting scalability and complicating downstream applications. In this work we study patch diffusion models (PDMs) -- a diffusion paradigm which models the distribution of patches rather than whole inputs keeping up to 0.7% of the original pixels. This makes it very efficient during training and unlocks end-to-end optimization on high-resolution videos. We improve PDMs in two principled ways. First to enforce consistency between patches we develop deep context fusion -- an architectural technique that propagates the context information from low-scale to high-scale patches in a hierarchical manner. Second to accelerate training and inference we propose adaptive computation which allocates more network capacity and computation towards coarse image details. The resulting model sets a new state-of-the-art FVD score of 66.32 and Inception Score of 87.68 in class-conditional video generation on UCF-101 256x256 surpassing recent methods by more than 100%. Then we show that it can be rapidly fine-tuned from a base 36x64 low-resolution generator for high-resolution 64x288x512 text-to-video synthesis. To the best of our knowledge our model is the first diffusion-based architecture which is trained on such high resolutions entirely end-to-end. Project webpage: https://snap-research.github.io/hpdm. + + + + Neural Implicit Morphing of Face Images + http://openaccess.thecvf.com//content/CVPR2024/papers/Schardong_Neural_Implicit_Morphing_of_Face_Images_CVPR_2024_paper.pdf + Face morphing is a problem in computer graphics with numerous artistic and forensic applications. It is challenging due to variations in pose lighting gender and ethnicity. This task consists of a warping for feature alignment and a blending for a seamless transition between the warped images. We propose to leverage coord-based neural networks to represent such warpings and blendings of face images. During training we exploit the smoothness and flexibility of such networks by combining energy functionals employed in classical approaches without discretizations. Additionally our method is time-dependent allowing a continuous warping/blending of the images. During morphing inference we need both direct and inverse transformations of the time-dependent warping. The first (second) is responsible for warping the target (source) image into the source (target) image. Our neural warping stores those maps in a single network dismissing the need for inverting them. The results of our experiments indicate that our method is competitive with both classical and generative models under the lens of image quality and face-morphing detectors. Aesthetically the resulting images present a seamless blending of diverse faces not yet usual in the literature. + + + + UniGS: Unified Representation for Image Generation and Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Qi_UniGS_Unified_Representation_for_Image_Generation_and_Segmentation_CVPR_2024_paper.pdf + This paper introduces a novel unified representation of diffusion models for image generation and segmentation. Specifically we use a colormap to represent entity-level masks addressing the challenge of varying entity numbers while aligning the representation closely with the image RGB domain. Two novel modules including the location-aware color palette and progressive dichotomy module are proposed to support our mask representation. On the one hand a location-aware palette guarantees the colors' consistency to entities' locations. On the other hand the progressive dichotomy module can efficiently decode the synthesized colormap to high-quality entity-level masks in a depth-first binary search without knowing the cluster numbers. To tackle the issue of lacking large-scale segmentation training data we employ an inpainting pipeline and then improve the flexibility of diffusion models across various tasks including inpainting image synthesis referring segmentation and entity segmentation. Comprehensive experiments validate the efficiency of our approach demonstrating comparable segmentation mask quality to state-of-the-art and adaptability to multiple tasks. + + + + Training-Free Open-Vocabulary Segmentation with Offline Diffusion-Augmented Prototype Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Barsellotti_Training-Free_Open-Vocabulary_Segmentation_with_Offline_Diffusion-Augmented_Prototype_Generation_CVPR_2024_paper.pdf + Open-vocabulary semantic segmentation aims at segmenting arbitrary categories expressed in textual form. Previous works have trained over large amounts of image-caption pairs to enforce pixel-level multimodal alignments. However captions provide global information about the semantics of a given image but lack direct localization of individual concepts. Further training on large-scale datasets inevitably brings significant computational costs. In this paper we propose FreeDA a training-free diffusion-augmented method for open-vocabulary semantic segmentation which leverages the ability of diffusion models to visually localize generated concepts and local-global similarities to match class-agnostic regions with semantic classes. Our approach involves an offline stage in which textual-visual reference embeddings are collected starting from a large set of captions and leveraging visual and semantic contexts. At test time these are queried to support the visual matching process which is carried out by jointly considering class-agnostic regions and global semantic similarities. Extensive analyses demonstrate that FreeDA achieves state-of-the-art performance on five datasets surpassing previous methods by more than 7.0 average points in terms of mIoU and without requiring any training. Our source code is available at https://aimagelab.github.io/freeda/. + + + + HUGS: Human Gaussian Splats + http://openaccess.thecvf.com//content/CVPR2024/papers/Kocabas_HUGS_Human_Gaussian_Splats_CVPR_2024_paper.pdf + Recent advances in neural rendering have improved both training and rendering times by orders of magnitude. While these methods demonstrate state-of-the-art quality and speed they are designed for photogrammetry of static scenes and do not generalize well to freely moving humans in the environment. In this work we introduce Human Gaussian Splats (HUGS) that represents an animatable human together with the scene using 3D Gaussian Splatting (3DGS). Our method takes only a monocular video with a small number of (50-100) frames and it automatically learns to disentangle the static scene and a fully animatable human avatar within 30 minutes. We utilize the SMPL body model to initialize the human Gaussians. To capture details that are not modeled by SMPL (e.g cloth hairs) we allow the 3D Gaussians to deviate from the human body model. Utilizing 3D Gaussians for animated humans brings new challenges including the artifacts created when articulating the Gaussians. We propose to jointly optimize the linear blend skinning weights to coordinate the movements of individual Gaussians during animation. Our approach enables novel-pose synthesis of human and novel view synthesis of both the human and the scene. We achieve state-of-the-art rendering quality with a rendering speed of 60 FPS while being 100x faster to train over previous work. + + + + PhysPT: Physics-aware Pretrained Transformer for Estimating Human Dynamics from Monocular Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_PhysPT_Physics-aware_Pretrained_Transformer_for_Estimating_Human_Dynamics_from_Monocular_CVPR_2024_paper.pdf + While current methods have shown promising progress on estimating 3D human motion from monocular videos their motion estimates are often physically unrealistic because they mainly consider kinematics. In this paper we introduce Physics-aware Pretrained Transformer (PhysPT) which improves kinematics-based motion estimates and infers motion forces. PhysPT exploits a Transformer encoder-decoder backbone to effectively learn human dynamics in a self-supervised manner. Moreover it incorporates physics principles governing human motion. Specifically we build a physics-based body representation and contact force model. We leverage them to impose novel physics-inspired training losses (i.e. force loss contact loss and Euler-Lagrange loss) enabling PhysPT to capture physical properties of the human body and the forces it experiences. Experiments demonstrate that once trained PhysPT can be directly applied to kinematics-based estimates to significantly enhance their physical plausibility and generate favourable motion forces. Furthermore we show that these physically meaningful quantities translate into improved accuracy of an important downstream task: human action recognition. + + + + EfficientDreamer: High-Fidelity and Robust 3D Creation via Orthogonal-view Diffusion Priors + http://openaccess.thecvf.com//content/CVPR2024/papers/Hu_EfficientDreamer_High-Fidelity_and_Robust_3D_Creation_via_Orthogonal-view_Diffusion_Priors_CVPR_2024_paper.pdf + While image diffusion models have made significant progress in text-driven 3D content creation they often fail to accurately capture the intended meaning of text prompts especially for view information. This limitation leads to the Janus problem where multi-faced 3D models are generated under the guidance of such diffusion models. In this paper we propose a robust high-quality 3D content generation pipeline by exploiting orthogonal-view image guidance. First we introduce a novel 2D diffusion model that generates an image consisting of four orthogonal-view sub-images based on the given text prompt. Then the 3D content is created using this diffusion model. Notably the generated orthogonal-view image provides strong geometric structure priors and thus improves 3D consistency. As a result it effectively resolves the Janus problem and significantly enhances the quality of 3D content creation. Additionally we present a 3D synthesis fusion network that can further improve the details of the generated 3D contents. Both quantitative and qualitative evaluations demonstrate that our method surpasses previous text-to-3D techniques. Project page: https://efficientdreamer.github.io. + + + + HOIAnimator: Generating Text-prompt Human-object Animations using Novel Perceptive Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Song_HOIAnimator_Generating_Text-prompt_Human-object_Animations_using_Novel_Perceptive_Diffusion_Models_CVPR_2024_paper.pdf + To date the quest to rapidly and effectively produce human-object interaction (HOI) animations directly from textual descriptions stands at the forefront of computer vision research. The underlying challenge demands both a discriminating interpretation of language and a comprehensive physics-centric model supporting real-world dynamics. To ameliorate this paper advocates HOIAnimator a novel and interactive diffusion model with perception ability and also ingeniously crafted to revolutionize the animation of complex interactions from linguistic narratives. The effectiveness of our model is anchored in two ground-breaking innovations: (1) Our Perceptive Diffusion Models (PDM) brings together two types of models: one focused on human movements and the other on objects. This combination allows for animations where humans and objects move in concert with each other making the overall motion more realistic. Additionally we propose a Perceptive Message Passing (PMP) mechanism to enhance the communication bridging the two models ensuring that the animations are smooth and unified; (2) We devise an Interaction Contact Field (ICF) a sophisticated model that implicitly captures the essence of HOIs. Beyond mere predictive contact points the ICF assesses the proximity of human and object to their respective environment informed by a probabilistic distribution of interactions learned throughout the denoising phase. Our comprehensive evaluation showcases HOIanimator's superior ability to produce dynamic context-aware animations that surpass existing benchmarks in text-driven animation synthesis. + + + + SyncTalk: The Devil is in the Synchronization for Talking Head Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Peng_SyncTalk_The_Devil_is_in_the_Synchronization_for_Talking_Head_CVPR_2024_paper.pdf + Achieving high synchronization in the synthesis of realistic speech-driven talking head videos presents a significant challenge. Traditional Generative Adversarial Networks (GAN) struggle to maintain consistent facial identity while Neural Radiance Fields (NeRF) methods although they can address this issue often produce mismatched lip movements inadequate facial expressions and unstable head poses. A lifelike talking head requires synchronized coordination of subject identity lip movements facial expressions and head poses. The absence of these synchronizations is a fundamental flaw leading to unrealistic and artificial outcomes. To address the critical issue of synchronization identified as the "devil" in creating realistic talking heads we introduce SyncTalk. This NeRF-based method effectively maintains subject identity enhancing synchronization and realism in talking head synthesis. SyncTalk employs a Face-Sync Controller to align lip movements with speech and innovatively uses a 3D facial blendshape model to capture accurate facial expressions. Our HeadSync Stabilizer optimizes head poses achieving more natural head movements. The Portrait-Sync Generator restores hair details and blends the generated head with the torso for a seamless visual experience. Extensive experiments and user studies demonstrate that SyncTalk outperforms state-of-the-art methods in synchronization and realism. We recommend watching the supplementary video: https://ziqiaopeng.github.io/synctalk + + + + DreamSalon: A Staged Diffusion Framework for Preserving Identity-Context in Editable Face Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Lin_DreamSalon_A_Staged_Diffusion_Framework_for_Preserving_Identity-Context_in_Editable_CVPR_2024_paper.pdf + While large-scale pre-trained text-to-image models can synthesize diverse and high-quality human-centered images novel challenges arise with a nuanced task of "identity fine editing" - precisely modifying specific features of a subject while maintaining its inherent identity and context. Existing personalization methods either require time-consuming optimization or learning additional encoders adept in "identity re-contextualization". However they often struggle with detailed and sensitive tasks like human face editing. To address these challenges we introduce DreamSalon a noise-guided staged-editing framework uniquely focusing on detailed image manipulations and identity-context preservation. By discerning editing and boosting stages via the frequency and gradient of predicted noises DreamSalon first performs detailed manipulations on specific features in the editing stage guided by high-frequency information and then employs stochastic denoising in the boosting stage to improve image quality. For more precise editing DreamSalon semantically mixes source and target textual prompts guided by differences in their embedding covariances to direct the model's focus on specific manipulation areas. Our experiments demonstrate DreamSalon's ability to efficiently and faithfully edit fine details on human faces outperforming existing methods both qualitatively and quantitatively. + + + + Neural Super-Resolution for Real-time Rendering with Radiance Demodulation + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Neural_Super-Resolution_for_Real-time_Rendering_with_Radiance_Demodulation_CVPR_2024_paper.pdf + It is time-consuming to render high-resolution images in applications such as video games and virtual reality and thus super-resolution technologies become increasingly popular for real-time rendering. However it is challenging to preserve sharp texture details keep the temporal stability and avoid the ghosting artifacts in real-time super-resolution rendering. To address this issue we introduce radiance demodulation to separate the rendered image or radiance into a lighting component and a material component considering the fact that the light component is smoother than the rendered image so that the high-resolution material component with detailed textures can be easily obtained. We perform the super-resolution on the lighting component only and re-modulate it with the high-resolution material component to obtain the final super-resolution image with more texture details. A reliable warping module is proposed by explicitly marking the occluded regions to avoid the ghosting artifacts. To further enhance the temporal stability we design a frame-recurrent neural network and a temporal loss to aggregate the previous and current frames which can better capture the spatial-temporal consistency among reconstructed frames. As a result our method is able to produce temporally stable results in real-time rendering with high-quality details even in the challenging 4 x4 super-resolution scenarios. Code is available at: \href https://github.com/Riga2/NSRD https://github.com/Riga2/NSRD . + + + + MMM: Generative Masked Motion Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Pinyoanuntapong_MMM_Generative_Masked_Motion_Model_CVPR_2024_paper.pdf + Recent advances in text-to-motion generation using diffusion and autoregressive models have shown promising results. However these models often suffer from a trade-off between real-time performance high fidelity and motion editability. To address this gap we introduce MMM a novel yet simple motion generation paradigm based on Masked Motion Model. MMM consists of two key components: (1) a motion tokenizer that transforms 3D human motion into a sequence of discrete tokens in latent space and (2) a conditional masked motion transformer that learns to predict randomly masked motion tokens conditioned on the pre-computed text tokens. By attending to motion and text tokens in all directions MMM explicitly captures inherent dependency among motion tokens and semantic mapping between motion and text tokens. During inference this allows parallel and iterative decoding of multiple motion tokens that are highly consistent with fine-grained text descriptions therefore simultaneously achieving high-fidelity and high-speed motion generation. In addition MMM has innate motion editability. By simply placing mask tokens in the place that needs editing MMM automatically fills the gaps while guaranteeing smooth transitions between editing and non-editing parts. Extensive experiments on the HumanML3D and KIT-ML datasets demonstrate that MMM surpasses current leading methods in generating high-quality motion (evidenced by superior FID scores of 0.08 and 0.429) while offering advanced editing features such as body-part modification motion in-betweening and the synthesis of long motion sequences. In addition MMM is two orders of magnitude faster on a single mid-range GPU than editable motion diffusion models. Our project page is available at https://exitudio.github.io/MMM-page/. + + + + PEGASUS: Personalized Generative 3D Avatars with Composable Attributes + http://openaccess.thecvf.com//content/CVPR2024/papers/Cha_PEGASUS_Personalized_Generative_3D_Avatars_with_Composable_Attributes_CVPR_2024_paper.pdf + We present PEGASUS a method for constructing a personalized generative 3D face avatar from monocular video sources. Our generative 3D avatar enables disentangled controls to selectively alter the facial attributes (e.g. hair or nose) while preserving the identity. Our approach consists of two stages: synthetic database generation and constructing a personalized generative avatar. We generate a synthetic video collection of the target identity with varying facial attributes where the videos are synthesized by borrowing the attributes from monocular videos of diverse identities. Then we build a person-specific generative 3D avatar that can modify its attributes continuously while preserving its identity. Through extensive experiments we demonstrate that our method of generating a synthetic database and creating a 3D generative avatar is the most effective in preserving identity while achieving high realism. Subsequently we introduce a zero-shot approach to achieve the same goal of generative modeling more efficiently by leveraging a previously constructed personalized generative model. + + + + Diff-Plugin: Revitalizing Details for Diffusion-based Low-level Tasks + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Diff-Plugin_Revitalizing_Details_for_Diffusion-based_Low-level_Tasks_CVPR_2024_paper.pdf + Diffusion models trained on large-scale datasets have achieved remarkable progress in image synthesis. However due to the randomness in the diffusion process they often struggle with handling diverse low-level tasks that require details preservation. To overcome this limitation we present a new Diff-Plugin framework to enable a single pre-trained diffusion model to generate high-fidelity results across a variety of low-level tasks. Specifically we first propose a lightweight Task-Plugin module with a dual branch design to provide task-specific priors guiding the diffusion process in preserving image content. We then propose a Plugin-Selector that can automatically select different Task-Plugins based on the text instruction allowing users to edit images by indicating multiple low-level tasks with natural language. We conduct extensive experiments on 8 low-level vision tasks. The results demonstrate the superiority of Diff-Plugin over existing methods particularly in real-world scenarios. Our ablations further validate that Diff-Plugin is stable schedulable and supports robust training across different dataset sizes. + + + + Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Intelligent_Grimm_-_Open-ended_Visual_Storytelling_via_Latent_Diffusion_Models_CVPR_2024_paper.pdf + Generative models have recently exhibited exceptional capabilities in text-to-image generation but still struggle to generate image sequences coherently. In this work we focus on a novel yet challenging task of generating a coherent image sequence based on a given storyline denoted as open-ended visual storytelling. We make the following three contributions: (i) to fulfill the task of visual storytelling we propose a learning-based auto-regressive image generation model termed as StoryGen with a novel vision-language context module that enables to generate the current frame by conditioning on the corresponding text prompt and preceding image-caption pairs; (ii) to address the data shortage of visual storytelling we collect paired image-text sequences by sourcing from online videos and open-source E-books establishing processing pipeline for constructing a large-scale dataset with diverse characters storylines and artistic styles named StorySalon; (iii) Quantitative experiments and human evaluations have validated the superiority of our StoryGen where we show it can generalize to unseen characters without any optimization and generate image sequences with coherent content and consistent character. Code dataset and models are available at https://haoningwu3639.github.io/StoryGen_Webpage/ + + + + GenTron: Diffusion Transformers for Image and Video Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_GenTron_Diffusion_Transformers_for_Image_and_Video_Generation_CVPR_2024_paper.pdf + In this study we explore Transformer based diffusion models for image and video generation. Despite the dominance of Transformer architectures in various fields due to their flexibility and scalability the visual generative domain primarily utilizes CNN-based U-Net architectures particularly in diffusion-based models. We introduce GenTron a family of Generative models employing Transformer-based diffusion to address this gap. Our initial step was to adapt Diffusion Transformers (DiTs) from class to text conditioning a process involving thorough empirical exploration of the conditioning mechanism. We then scale GenTron from approximately 900M to over 3B parameters observing improvements in visual quality. Furthermore we extend GenTron to text-to-video generation incorporating novel motion-free guidance to enhance video quality. In human evaluations against SDXL GenTron achieves a 51.1% win rate in visual quality (with a 19.8% draw rate) and a 42.3% win rate in text alignment (with a 42.9% draw rate). GenTron notably performs well in T2I-CompBench highlighting its compositional generation ability. We hope GenTron could provide meaningful insights and serve as a valuable reference for future research. Please refer to the arXiv version for the most up-to-date results: https://arxiv.org/abs/2312.04557. + + + + TRIP: Temporal Residual Learning with Image Noise Prior for Image-to-Video Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_TRIP_Temporal_Residual_Learning_with_Image_Noise_Prior_for_Image-to-Video_CVPR_2024_paper.pdf + Recent advances in text-to-video generation have demonstrated the utility of powerful diffusion models. Nevertheless the problem is not trivial when shaping diffusion models to animate static image (i.e. image-to-video generation). The difficulty originates from the aspect that the diffusion process of subsequent animated frames should not only preserve the faithful alignment with the given image but also pursue temporal coherence among adjacent frames. To alleviate this we present TRIP a new recipe of image-to-video diffusion paradigm that pivots on image noise prior derived from static image to jointly trigger inter-frame relational reasoning and ease the coherent temporal modeling via temporal residual learning. Technically the image noise prior is first attained through one-step backward diffusion process based on both static image and noised video latent codes. Next TRIP executes a residual-like dual-path scheme for noise prediction: 1) a shortcut path that directly takes image noise prior as the reference noise of each frame to amplify the alignment between the first frame and subsequent frames; 2) a residual path that employs 3D-UNet over noised video and static image latent codes to enable inter-frame relational reasoning thereby easing the learning of the residual noise for each frame. Furthermore both reference and residual noise of each frame are dynamically merged via attention mechanism for final video generation. Extensive experiments on WebVid-10M DTDB and MSR-VTT datasets demonstrate the effectiveness of our TRIP for image-to-video generation. Please see our project page at https://trip-i2v.github.io/TRIP/. + + + + TexVocab: Texture Vocabulary-conditioned Human Avatars + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_TexVocab_Texture_Vocabulary-conditioned_Human_Avatars_CVPR_2024_paper.pdf + To adequately utilize the available image evidence in multi-view video-based avatar modeling we propose TexVocab a novel avatar representation that constructs a texture vocabulary and associates body poses with texture maps for animation. Given multi-view RGB videos our method initially back-projects all the available images in the training videos to the posed SMPL surface producing texture maps in the SMPL UV domain. Then we construct pairs of human poses and texture maps to establish a texture vocabulary for encoding dynamic human appearances under various poses. Unlike the commonly used joint-wise manner we further design a body-part-wise encoding strategy to learn the structural effects of the kinematic chain. Given a driving pose we query the pose feature hierarchically by decomposing the pose vector into several body parts and interpolating the texture features for synthesizing fine-grained human dynamics. Overall our method is able to create animatable human avatars with detailed and dynamic appearances from RGB videos and the experiments show that our method outperforms state-of-the-art approaches. + + + + KITRO: Refining Human Mesh by 2D Clues and Kinematic-tree Rotation + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_KITRO_Refining_Human_Mesh_by_2D_Clues_and_Kinematic-tree_Rotation_CVPR_2024_paper.pdf + 2D keypoints are commonly used as an additional cue to refine estimated 3D human meshes. Current methods optimize the pose and shape parameters with a reprojection loss on the provided 2D keypoints. Such an approach while simple and intuitive has limited effectiveness because the optimal solution is hard to find in ambiguous parameter space and may sacrifice depth. Additionally divergent gradients from distal joints complicate and deviate the refinement of proximal joints in the kinematic chain. To address these we introduce Kinematic-Tree Rotation (KITRO) a novel mesh refinement strategy that explicitly models depth and human kinematic-tree structure. KITRO treats refinement from a bone-wise perspective. Unlike previous methods which perform gradient-based optimizations our method calculates bone directions in closed form. By accounting for the 2D pose bone length and parent joint's depth the calculation results in two possible directions for each child joint. We then use a decision tree to trace binary choices for all bones along the human skeleton's kinematic-tree to select the most probable hypothesis. Our experiments across various datasets and baseline models demonstrate that KITRO significantly improves 3D joint estimation accuracy and achieves an ideal 2D fit simultaneously. Our code available at: https://github.com/MartaYang/KITRO. + + + + SuGaR: Surface-Aligned Gaussian Splatting for Efficient 3D Mesh Reconstruction and High-Quality Mesh Rendering + http://openaccess.thecvf.com//content/CVPR2024/papers/Guedon_SuGaR_Surface-Aligned_Gaussian_Splatting_for_Efficient_3D_Mesh_Reconstruction_and_CVPR_2024_paper.pdf + We propose a method to allow precise and extremely fast mesh extraction from 3D Gaussian Splatting. Gaussian Splatting has recently become very popular as it yields realistic rendering while being significantly faster to train than NeRFs. It is however challenging to extract a mesh from the millions of tiny 3D Gaussians as these Gaussians tend to be unorganized after optimization and no method has been proposed so far. Our first key contribution is a regularization term that encourages the Gaussians to align well with the surface of the scene. We then introduce a method that exploits this alignment to extract a mesh from the Gaussians using Poisson reconstruction which is fast scalable and preserves details in contrast to the Marching Cubes algorithm usually applied to extract meshes from Neural SDFs. Finally we introduce an optional refinement strategy that binds Gaussians to the surface of the mesh and jointly optimizes these Gaussians and the mesh through Gaussian splatting rendering. This enables easy editing sculpting animating and relighting of the Gaussians by manipulating the mesh instead of the Gaussians themselves. Retrieving such an editable mesh for realistic rendering is done within minutes with our method compared to hours with the state-of-the-art method on SDFs while providing a better rendering quality. + + + + Towards Effective Usage of Human-Centric Priors in Diffusion Models for Text-based Human Image Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Towards_Effective_Usage_of_Human-Centric_Priors_in_Diffusion_Models_for_CVPR_2024_paper.pdf + Vanilla text-to-image diffusion models struggle with generating accurate human images commonly resulting in imperfect anatomies such as unnatural postures or disproportionate limbs. Existing methods address this issue mostly by fine-tuning the model with extra images or adding additional controls --- human-centric priors such as pose or depth maps --- during the image generation phase. This paper explores the integration of these human-centric priors directly into the model fine-tuning stage essentially eliminating the need for extra conditions at the inference stage. We realize this idea by proposing a human-centric alignment loss to strengthen human-related information from the textual prompts within the cross-attention maps. To ensure semantic detail richness and human structural accuracy during fine-tuning we introduce scale-aware and step-wise constraints within the diffusion process according to an in-depth analysis of the cross-attention layer. Extensive experiments show that our method largely improves over state-of-the-art text-to-image models to synthesize high-quality human images based on user-written prompts. + + + + A Video is Worth 256 Bases: Spatial-Temporal Expectation-Maximization Inversion for Zero-Shot Video Editing + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_A_Video_is_Worth_256_Bases_Spatial-Temporal_Expectation-Maximization_Inversion_for_CVPR_2024_paper.pdf + This paper presents a video inversion approach for zero-shot video editing which models the input video with low-rank representation during the inversion process. The existing video editing methods usually apply the typical 2D DDIM inversion or naive spatial-temporal DDIM inversion before editing which leverages time-varying representation for each frame to derive noisy latent. Unlike most existing approaches we propose a Spatial-Temporal Expectation-Maximization (STEM) inversion which formulates the dense video feature under an expectation-maximization manner and iteratively estimates a more compact basis set to represent the whole video. Each frame applies the fixed and global representation for inversion which is more friendly for temporal consistency during reconstruction and editing. Extensive qualitative and quantitative experiments demonstrate that our STEM inversion can achieve consistent improvement on two state-of-the-art video editing methods. Project page: https://stem-inv.github.io/page/. + + + + URHand: Universal Relightable Hands + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_URHand_Universal_Relightable_Hands_CVPR_2024_paper.pdf + Existing photorealistic relightable hand models require extensive identity-specific observations in different views poses and illuminations and face challenges in generalizing to natural illuminations and novel identities. To bridge this gap we present URHand the first universal relightable hand model that generalizes across viewpoints poses illuminations and identities. Our model allows few-shot personalization using images captured with a mobile phone and is ready to be photorealistically rendered under novel illuminations. To simplify the personalization process while retaining photorealism we build a powerful universal relightable prior based on neural relighting from multi-view images of hands captured in a light stage with hundreds of identities. The key challenge is scaling the cross-identity training while maintaining personalized fidelity and sharp details without compromising generalization under natural illuminations. To this end we propose a spatially varying linear lighting model as the neural renderer that takes physics-inspired shading as input feature. By removing non-linear activations and bias our specifically designed lighting model explicitly keeps the linearity of light transport. This enables single-stage training from light-stage data while generalizing to real-time rendering under arbitrary continuous illuminations across diverse identities. In addition we introduce the joint learning of a physically based model and our neural relighting model which further improves fidelity and generalization. Extensive experiments show that our approach achieves superior performance over existing methods in terms of both quality and generalizability. We also demonstrate quick personalization of URHand from a short phone scan of an unseen identity. + + + + Named Entity Driven Zero-Shot Image Manipulation + http://openaccess.thecvf.com//content/CVPR2024/papers/Feng_Named_Entity_Driven_Zero-Shot_Image_Manipulation_CVPR_2024_paper.pdf + We introduced StyleEntity a zero-shot image manipulation model that utilizes named entities as proxies during its training phase. This strategy enables our model to manipulate images using unseen textual descriptions during inference all within a single training phase. Additionally we proposed an inference technique termed Prompt Ensemble Latent Averaging (PELA). PELA averages the manipulation directions derived from various named entities during inference effectively eliminating the noise directions thus achieving stable manipulation. In our experiments StyleEntity exhibited superior performance in a zero-shot setting compared to other methods. The code model weights and datasets is available at https://github.com/feng-zhida/StyleEntity. + + + + ESR-NeRF: Emissive Source Reconstruction Using LDR Multi-view Images + http://openaccess.thecvf.com//content/CVPR2024/papers/Jeong_ESR-NeRF_Emissive_Source_Reconstruction_Using_LDR_Multi-view_Images_CVPR_2024_paper.pdf + Existing NeRF-based inverse rendering methods suppose that scenes are exclusively illuminated by distant light sources neglecting the potential influence of emissive sources within a scene. In this work we confront this limitation using LDR multi-view images captured with emissive sources turned on and off. Two key issues must be addressed: 1) ambiguity arising from the limited dynamic range along with unknown lighting details and 2) the expensive computational cost in volume rendering to backtrace the paths leading to final object colors. We present a novel approach ESR-NeRF leveraging neural networks as learnable functions to represent ray-traced fields. By training networks to satisfy light transport segments we regulate outgoing radiances progressively identifying emissive sources while being aware of reflection areas. The results on scenes encompassing emissive sources with various properties demonstrate the superiority of ESR-NeRF in qualitative and quantitative ways. Our approach also extends its applicability to the scenes devoid of emissive sources achieving lower CD metrics on the DTU dataset. + + + + Infer from What You Have Seen Before: Temporally-dependent Classifier for Semi-supervised Video Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhuang_Infer_from_What_You_Have_Seen_Before_Temporally-dependent_Classifier_for_CVPR_2024_paper.pdf + Due to high expense of human labor one major challenge for semantic segmentation in real-world scenarios is the lack of sufficient pixel-level labels which is more serious when processing video data. To exploit unlabeled data for model training semi-supervised learning methods attempt to construct pseudo labels or various auxiliary constraints as supervision signals. However most of them just process video data as a set of independent images in a per-frame manner. The rich temporal relationships are ignored which can serve as valuable clues for representation learning. Besides this per-frame recognition paradigm is quite different from that of humans. Actually benefited from the internal temporal relevance of video data human would wisely use the distinguished semantic concepts in historical frames to aid the recognition of the current frame. Motivated by this observation we propose a novel temporally-dependent classifier (TDC) to mimic the human-like recognition procedure. Comparing to the conventional classifier TDC can guide the model to learn a group of temporally-consistent semantic concepts across frames which essentially provides an implicit and effective constraint. We conduct extensive experiments on Cityscapes and CamVid and the results demonstrate the superiority of our proposed method to previous state-of-the-art methods. The code is available at https://github.com/jfzhuang/TDC. + + + + Video Frame Interpolation via Direct Synthesis with the Event-based Reference + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Video_Frame_Interpolation_via_Direct_Synthesis_with_the_Event-based_Reference_CVPR_2024_paper.pdf + Video Frame Interpolation (VFI) has witnessed a surge in popularity due to its abundant downstream applications. Event-based VFI (E-VFI) has recently propelled the advancement of VFI. Thanks to the high temporal resolution benefits event cameras can bridge the informational void present between successive video frames. Most state-of-the-art E-VFI methodologies follow the conventional VFI paradigm which pivots on motion estimation between consecutive frames to generate intermediate frames through a process of warping and refinement. However this reliance engenders a heavy dependency on the quality and consistency of keyframes rendering these methods susceptible to challenges in extreme real-world scenarios such as missing moving objects and severe occlusion dilemmas. This study proposes a novel E-VFI framework that directly synthesize intermediate frames leveraging event-based reference obviating the necessity for explicit motion estimation and substantially enhancing the capacity to handle motion occlusion. Given the sparse and inherently noisy nature of event data we prioritize the reliability of the event-based reference leading to the development of an innovative event-aware reconstruction strategy for accurate reference generation. Besides we implement a bi-directional event-guided alignment from keyframes to the reference using the introduced E-PCD module. Finally a transformer-based decoder is adopted for prediction refinement. Comprehensive experimental evaluations on both synthetic and real-world datasets underscore the superiority of our approach and its potential to execute high-quality VFI tasks. + + + + DSL-FIQA: Assessing Facial Image Quality via Dual-Set Degradation Learning and Landmark-Guided Transformer + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_DSL-FIQA_Assessing_Facial_Image_Quality_via_Dual-Set_Degradation_Learning_and_CVPR_2024_paper.pdf + Generic Face Image Quality Assessment (GFIQA) evaluates the perceptual quality of facial images which is crucial in improving image restoration algorithms and selecting high-quality face images for downstream tasks. We present a novel transformer-based method for GFIQA which is aided by two unique mechanisms. First a novel Dual-Set Degradation Representation Learning (DSL) mechanism uses facial images with both synthetic and real degradations to decouple degradation from content ensuring generalizability to real-world scenarios. This self-supervised method learns degradation features on a global scale providing a robust alternative to conventional methods that use local patch information in degradation learning. Second our transformer leverages facial landmarks to emphasize visually salient parts of a face image in evaluating its perceptual quality. We also introduce a balanced and diverse Comprehensive Generic Face IQA (CGFIQA-40k) dataset of 40K images carefully designed to overcome the biases in particular the imbalances in skin tone and gender representation in existing datasets. Extensive analysis and evaluation demonstrate the robustness of our method marking a significant improvement over prior methods. + + + + FMA-Net: Flow-Guided Dynamic Filtering and Iterative Feature Refinement with Multi-Attention for Joint Video Super-Resolution and Deblurring + http://openaccess.thecvf.com//content/CVPR2024/papers/Youk_FMA-Net_Flow-Guided_Dynamic_Filtering_and_Iterative_Feature_Refinement_with_Multi-Attention_CVPR_2024_paper.pdf + We present a joint learning scheme of video super-resolution and deblurring called VSRDB to restore clean high-resolution (HR) videos from blurry low-resolution (LR) ones. This joint restoration problem has drawn much less attention compared to single restoration problems. In this paper we propose a novel flow-guided dynamic filtering (FGDF) and iterative feature refinement with multi-attention (FRMA) which constitutes our VSRDB framework denoted as FMA-Net. Specifically our proposed FGDF enables precise estimation of both spatio-temporally-variant degradation and restoration kernels that are aware of motion trajectories through sophisticated motion representation learning. Compared to conventional dynamic filtering the FGDF enables the FMA-Net to effectively handle large motions into the VSRDB. Additionally the stacked FRMA blocks trained with our novel temporal anchor (TA) loss which temporally anchors and sharpens features refine features in a coarse-to-fine manner through iterative updates. Extensive experiments demonstrate the superiority of the proposed FMA-Net over state-of-the-art methods in terms of both quantitative and qualitative quality. Codes and pre-trained models are available at: https://kaist-viclab.github.io/fmanet-site. + + + + Hourglass Tokenizer for Efficient Transformer-Based 3D Human Pose Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Hourglass_Tokenizer_for_Efficient_Transformer-Based_3D_Human_Pose_Estimation_CVPR_2024_paper.pdf + Transformers have been successfully applied in the field of video-based 3D human pose estimation. However the high computational costs of these video pose transformers (VPTs) make them impractical on resource-constrained devices. In this paper we present a plug-and-play pruning-and-recovering framework called Hourglass Tokenizer (HoT) for efficient transformer-based 3D human pose estimation from videos. Our HoT begins with pruning pose tokens of redundant frames and ends with recovering full-length tokens resulting in a few pose tokens in the intermediate transformer blocks and thus improving the model efficiency. To effectively achieve this we propose a token pruning cluster (TPC) that dynamically selects a few representative tokens with high semantic diversity while eliminating the redundancy of video frames. In addition we develop a token recovering attention (TRA) to restore the detailed spatio-temporal information based on the selected tokens thereby expanding the network output to the original full-length temporal resolution for fast inference. Extensive experiments on two benchmark datasets (i.e. Human3.6M and MPI-INF-3DHP) demonstrate that our method can achieve both high efficiency and estimation accuracy compared to the original VPT models. For instance applying to MotionBERT and MixSTE on Human3.6M our HoT can save nearly 50% FLOPs without sacrificing accuracy and nearly 40% FLOPs with only 0.2% accuracy drop respectively. Code and models are available at https://github.com/NationalGAILab/HoT. + + + + Boosting Diffusion Models with Moving Average Sampling in Frequency Domain + http://openaccess.thecvf.com//content/CVPR2024/papers/Qian_Boosting_Diffusion_Models_with_Moving_Average_Sampling_in_Frequency_Domain_CVPR_2024_paper.pdf + Diffusion models have recently brought a powerful revolution in image generation. Despite showing impressive generative capabilities most of these models rely on the current sample to denoise the next one possibly resulting in denoising instability. In this paper we reinterpret the iterative denoising process as model optimization and leverage a moving average mechanism to ensemble all the prior samples. Instead of simply applying moving average to the denoised samples at different timesteps we first map the denoised samples to data space and then perform moving average to avoid distribution shift across timesteps. In view that diffusion models evolve the recovery from low-frequency components to high-frequency details we further decompose the samples into different frequency components and execute moving average separately on each component. We name the complete approach "Moving Average Sampling in Frequency domain (MASF)". MASF could be seamlessly integrated into mainstream pre-trained diffusion models and sampling schedules. Extensive experiments on both unconditional and conditional diffusion models demonstrate that our MASF leads to superior performances compared to the baselines with almost negligible additional complexity cost. + + + + Bi-Causal: Group Activity Recognition via Bidirectional Causality + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Bi-Causal_Group_Activity_Recognition_via_Bidirectional_Causality_CVPR_2024_paper.pdf + Current approaches in Group Activity Recognition (GAR) predominantly emphasize Human Relations (HRs) while often neglecting the impact of Human-Object Interactions (HOIs). This study prioritizes the consideration of both HRs and HOIs emphasizing their interdependence. Notably employing Granger Causality Tests reveals the presence of bidirectional causality between HRs and HOIs. Leveraging this insight we propose a Bidirectional-Causal GAR network. This network establishes a causality communication channel while modeling relations and interactions enabling reciprocal enhancement between human-object interactions and human relations ensuring their mutual consistency. Additionally an Interaction Module is devised to effectively capture the dynamic nature of human-object interactions. Comprehensive experiments conducted on two publicly available datasets showcase the superiority of our proposed method over state-of-the-art approaches. + + + + Space-Time Diffusion Features for Zero-Shot Text-Driven Motion Transfer + http://openaccess.thecvf.com//content/CVPR2024/papers/Yatim_Space-Time_Diffusion_Features_for_Zero-Shot_Text-Driven_Motion_Transfer_CVPR_2024_paper.pdf + We present a new method for text-driven motion transfer - synthesizing a video that complies with an input text prompt describing the target objects and scene while maintaining an input video's motion and scene layout. Prior methods are confined to transferring motion across two subjects within the same or closely related object categories and are applicable for limited domains (e.g. humans). In this work we consider a significantly more challenging setting in which the target and source objects differ drastically in shape and fine-grained motion characteristics (e.g. translating a jumping dog into a dolphin). To this end we leverage a pre-trained and fixed text-to-video diffusion model which provides us with generative and motion priors. The pillar of our method is a new space-time feature loss derived directly from the model. This loss guides the generation process to preserve the overall motion of the input video while complying with the target object in terms of shape and fine-grained motion traits. + + + + MIGC: Multi-Instance Generation Controller for Text-to-Image Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_MIGC_Multi-Instance_Generation_Controller_for_Text-to-Image_Synthesis_CVPR_2024_paper.pdf + We present a Multi-Instance Generation (MIG) task simultaneously generating multiple instances with diverse controls in one image. Given a set of predefined coordinates and their corresponding descriptions the task is to ensure that generated instances are accurately at the designated locations and that all instances' attributes adhere to their corresponding description. This broadens the scope of current research on Single-instance generation elevating it to a more versatile and practical dimension. Inspired by the idea of divide and conquer we introduce an innovative approach named Multi-Instance Generation Controller (MIGC) to address the challenges of the MIG task. Initially we break down the MIG task into several subtasks each involving the shading of a single instance. To ensure precise shading for each instance we introduce an instance enhancement attention mechanism. Lastly we aggregate all the shaded instances to provide the necessary information for accurately generating multiple instances in stable diffusion (SD). To evaluate how well generation models perform on the MIG task we provide a COCO-MIG benchmark along with an evaluation pipeline. Extensive experiments were conducted on the proposed COCO-MIG benchmark as well as on various commonly used benchmarks. The evaluation results illustrate the exceptional control capabilities of our model in terms of quantity position attribute and interaction. Code and demos will be released at https://migcproject.github.io/. + + + + Distilling CLIP with Dual Guidance for Learning Discriminative Human Body Shape Representation + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Distilling_CLIP_with_Dual_Guidance_for_Learning_Discriminative_Human_Body_CVPR_2024_paper.pdf + Person Re-Identification (ReID) holds critical importance in computer vision with pivotal applications in public safety and crime prevention. Traditional ReID methods reliant on appearance attributes such as clothing and color encounter limitations in long-term scenarios and dynamic environments. To address these challenges we propose CLIP3DReID an innovative approach that enhances person ReID by integrating linguistic descriptions with visual perception leveraging pretrained CLIP model for knowledge distillation. Our method first employs CLIP to automatically label body shapes with linguistic descriptors. We then apply optimal transport theory to align the student model's local visual features with shape-aware tokens derived from CLIP's linguistic output. Additionally we align the student model's global visual features with those from the CLIP image encoder and the 3D SMPL identity space fostering enhanced domain robustness. CLIP3DReID notably excels in discerning discriminative body shape features achieving state-of-the-art results in person ReID. Our approach represents a significant advancement in ReID offering robust solutions to existing challenges and setting new directions for future research. + + + + LLaFS: When Large Language Models Meet Few-Shot Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhu_LLaFS_When_Large_Language_Models_Meet_Few-Shot_Segmentation_CVPR_2024_paper.pdf + This paper proposes LLaFS the first attempt to leverage large language models (LLMs) in few-shot segmentation. In contrast to the conventional few-shot segmentation methods that only rely on the limited and biased information from the annotated support images LLaFS leverages the vast prior knowledge gained by LLM as an effective supplement and directly uses the LLM to segment images in a few-shot manner. To enable the text-based LLM to handle image-related tasks we carefully design an input instruction that allows the LLM to produce segmentation results represented as polygons and propose a region-attribute table to simulate the human visual mechanism and provide multi-modal guidance. We also synthesize pseudo samples and use curriculum learning for pretraining to augment data and achieve better optimization. LLaFS achieves state-of-the-art results on multiple datasets showing the potential of using LLMs for few-shot computer vision tasks. + + + + Kernel Adaptive Convolution for Scene Text Detection via Distance Map Prediction + http://openaccess.thecvf.com//content/CVPR2024/papers/Zheng_Kernel_Adaptive_Convolution_for_Scene_Text_Detection_via_Distance_Map_CVPR_2024_paper.pdf + Segmentation-based scene text detection algorithms that are accurate to the pixel level can satisfy the detection of arbitrary shape scene text and have received widespread attention. On the one hand due to the complexity and diversity of the scene text the convolution with a fixed kernel size has some limitations in extracting the visual features of the scene text. On the other hand most of the existing segmentation-based algorithms only segment the center of the text losing information such as the edges and directions of the text with limited detection accuracy. There are also some improved algorithms that use iterative corrections or introduce other multiple information to improve text detection accuracy but at the expense of efficiency. To address these issues this paper proposes a simple and effective scene text detection method the Kernel Adaptive Convolution which is designed with a Kernel Adaptive Convolution Module for scene text detection via predicting the distance map. Specifically first we design an extensible kernel adaptive convolution module (KACM) to extract visual features from multiple convolutions with different kernel sizes in an adaptive manner. Secondly our method predicts the text distance map under the supervision of a priori information (including direction map and foreground segmentation map) and completes the text detection from the predicted distance map. Experiments on four publicly available datasets prove the effectiveness of our algorithm in which the accuracy and efficiency of both the Total-Text and TD500 outperform the state-of-the-art algorithm. The algorithm efficiency is improved while the accuracy is competitive on ArT and CTW1500. + + + + Adaptive Multi-Modal Cross-Entropy Loss for Stereo Matching + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_Adaptive_Multi-Modal_Cross-Entropy_Loss_for_Stereo_Matching_CVPR_2024_paper.pdf + Despite the great success of deep learning in stereo matching recovering accurate disparity maps is still challenging. Currently L1 and cross-entropy are the two most widely used losses for stereo network training. Compared with the former the latter usually performs better thanks to its probability modeling and direct supervision to the cost volume. However how to accurately model the stereo ground-truth for cross-entropy loss remains largely under-explored. Existing works simply assume that the ground-truth distributions are uni-modal which ignores the fact that most of the edge pixels can be multi-modal. In this paper a novel adaptive multi-modal cross-entropy loss (ADL) is proposed to guide the networks to learn different distribution patterns for each pixel. Moreover we optimize the disparity estimator to further alleviate the bleeding or misalignment artifacts in inference. Extensive experimental results show that our method is generic and can help classic stereo networks regain state-of-the-art performance. In particular GANet with our method ranks 1st on both the KITTI 2015 and 2012 benchmarks among the published methods. Meanwhile excellent synthetic-to-realistic generalization performance can be achieved by simply replacing the traditional loss with ours. Code is available at https://github.com/xxxupeng/ADL. + + + + Unlocking the Potential of Prompt-Tuning in Bridging Generalized and Personalized Federated Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Deng_Unlocking_the_Potential_of_Prompt-Tuning_in_Bridging_Generalized_and_Personalized_CVPR_2024_paper.pdf + Vision Transformers (ViT) and Visual Prompt Tuning (VPT) achieve state-of-the-art performance with improved efficiency in various computer vision tasks. This suggests a promising paradigm shift of adapting pre-trained ViT models to Federated Learning (FL) settings. However the challenge of data heterogeneity among FL clients presents a significant hurdle in effectively deploying ViT models. Existing Generalized FL (GFL) and Personalized FL (PFL) methods have limitations in balancing performance across both global and local data distributions. In this paper we present a novel algorithm SGPT that integrates GFL and PFL approaches by employing a unique combination of both shared and group-specific prompts. This design enables SGPT to capture both common and group-specific features. A key feature of SGPT is its prompt selection module which facilitates the training of a single global model capable of automatically adapting to diverse local client data distributions without the need for local fine-tuning. To effectively train the prompts we utilize block coordinate descent (BCD) learning from common feature information (shared prompts) and then more specialized knowledge (group prompts) iteratively. Theoretically we justify that learning the proposed prompts can reduce the gap between global and local performance. Empirically we conduct experiments on both label and feature heterogeneity settings in comparison with state-of-the-art baselines along with extensive ablation studies to substantiate the superior performance of SGPT. + + + + GALA: Generating Animatable Layered Assets from a Single Scan + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_GALA_Generating_Animatable_Layered_Assets_from_a_Single_Scan_CVPR_2024_paper.pdf + We present GALA a framework that takes as input a single-layer clothed 3D human mesh and decomposes it into complete multi-layered 3D assets. The outputs can then be combined with other assets to create novel clothed human avatars with any pose. Existing reconstruction approaches often treat clothed humans as a single-layer of geometry and overlook the inherent compositionality of humans with hairstyles clothing and accessories thereby limiting the utility of the meshes for down-stream applications. Decomposing a single-layer mesh into separate layers is a challenging task because it requires the synthesis of plausible geometry and texture for the severely occluded regions. Moreover even with successful decomposition meshes are not normalized in terms of poses and body shapes failing coherent composition with novel identities and poses. To address these challenges we propose to leverage the general knowledge of a pretrained 2D diffusion model as geometry and appearance prior for humans and other assets. We first separate the input mesh using the 3D surface segmentation extracted from multi-view 2D segmentations. Then we synthesize the missing geometry of different layers in both posed and canonical spaces using a novel pose-guided Score Distillation Sampling (SDS) loss. Once we complete inpainting high-fidelity 3D geometry we also apply the same SDS loss to its texture to obtain the complete appearance including the initially occluded regions. Through a series of decomposition steps we obtain multiple layers of 3D assets in a shared canonical space normalized in terms of poses and human shapes hence supporting effortless composition to novel identities and reanimation with novel poses. Our experiments demonstrate the effectiveness of our approach for decomposition canonicalization and composition tasks compared to existing solutions. + + + + LeGO: Leveraging a Surface Deformation Network for Animatable Stylized Face Generation with One Example + http://openaccess.thecvf.com//content/CVPR2024/papers/Yoon_LeGO_Leveraging_a_Surface_Deformation_Network_for_Animatable_Stylized_Face_CVPR_2024_paper.pdf + Recent advances in 3D face stylization have made significant strides in few to zero-shot settings. However the degree of stylization achieved by existing methods is often not sufficient for practical applications because they are mostly based on statistical 3D Morphable Models (3DMM) with limited variations. To this end we propose a method that can produce a highly stylized 3D face model with desired topology. Our methods train a surface deformation network with 3DMM and translate its domain to the target style using a paired exemplar. The network achieves stylization of the 3D face mesh by mimicking the style of the target using a differentiable renderer and directional CLIP losses. Additionally during the inference process we utilize a Mesh Agnostic Encoder (MAGE) that takes deformation target a mesh of diverse topologies as input to the stylization process and encodes its shape into our latent space. The resulting stylized face model can be animated by commonly used 3DMM blend shapes. A set of quantitative and qualitative evaluations demonstrate that our method can produce highly stylized face meshes according to a given style and output them in a desired topology. We also demonstrate example applications of our method including image-based stylized avatar generation linear interpolation of geometric styles and facial animation of stylized avatars. + + + + Frequency-Adaptive Dilated Convolution for Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Frequency-Adaptive_Dilated_Convolution_for_Semantic_Segmentation_CVPR_2024_paper.pdf + Dilated convolution which expands the receptive field by inserting gaps between its consecutive elements is widely employed in computer vision. In this study we propose three strategies to improve individual phases of dilated convolution from the view of spectrum analysis. Departing from the conventional practice of fixing a global dilation rate as a hyperparameter we introduce Frequency-Adaptive Dilated Convolution (FADC) which dynamically adjusts dilation rates spatially based on local frequency components. Subsequently we design two plug-in modules to directly enhance effective bandwidth and receptive field size. The Adaptive Kernel (AdaKern) module decomposes convolution weights into low-frequency and high-frequency components dynamically adjusting the ratio between these components on a per-channel basis. By increasing the high-frequency part of convolution weights AdaKern captures more high-frequency components thereby improving effective bandwidth. The Frequency Selection (FreqSelect) module optimally balances high- and low-frequency components in feature representations through spatially variant reweighting. It suppresses high frequencies in the background to encourage FADC to learn a larger dilation thereby increasing the receptive field for an expanded scope. Extensive experiments on segmentation and object detection consistently validate the efficacy of our approach. The code is made publicly available at https://github.com/Linwei-Chen/FADC. + + + + Multiple View Geometry Transformers for 3D Human Pose Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Liao_Multiple_View_Geometry_Transformers_for_3D_Human_Pose_Estimation_CVPR_2024_paper.pdf + In this work we aim to improve the 3D reasoning ability of Transformers in multi-view 3D human pose estimation. Recent works have focused on end-to-end learning-based transformer designs which struggle to resolve geometric information accurately particularly during occlusion. Instead we propose a novel hybrid model MVGFormer which has a series of geometric and appearance modules organized in an iterative manner. The geometry modules are learning-free and handle all viewpoint-dependent 3D tasks geometrically which notably improves the model's generalization ability. The appearance modules are learnable and are dedicated to estimating 2D poses from image signals end-to-end which enables them to achieve accurate estimates even when occlusion occurs leading to a model that is both accurate and generalizable to new cameras and geometries. We evaluate our approach for both in-domain and out-of-domain settings where our model consistently outperforms state-of-the-art methods and especially does so by a significant margin in the out-of-domain setting. We will release the code and models: https://github.com/XunshanMan/MVGFormer. + + + + SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Ho_SiTH_Single-view_Textured_Human_Reconstruction_with_Image-Conditioned_Diffusion_CVPR_2024_paper.pdf + A long-standing goal of 3D human reconstruction is to create lifelike and fully detailed 3D humans from single-view images. The main challenge lies in inferring unknown body shapes appearances and clothing details in areas not visible in the images. To address this we propose SiTH a novel pipeline that uniquely integrates an image-conditioned diffusion model into a 3D mesh reconstruction workflow. At the core of our method lies the decomposition of the challenging single-view reconstruction problem into generative hallucination and reconstruction subproblems. For the former we employ a powerful generative diffusion model to hallucinate unseen back-view appearance based on the input images. For the latter we leverage skinned body meshes as guidance to recover full-body texture meshes from the input and back-view images. SiTH requires as few as 500 3D human scans for training while maintaining its generality and robustness to diverse images. Extensive evaluations on two 3D human benchmarks including our newly created one highlighted our method's superior accuracy and perceptual quality in 3D textured human reconstruction. Our code and evaluation benchmark is available at https://ait.ethz.ch/sith. + + + + DynVideo-E: Harnessing Dynamic NeRF for Large-Scale Motion- and View-Change Human-Centric Video Editing + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_DynVideo-E_Harnessing_Dynamic_NeRF_for_Large-Scale_Motion-_and_View-Change_Human-Centric_CVPR_2024_paper.pdf + Despite recent progress in diffusion-based video editing existing methods are limited to short-length videos due to the contradiction between long-range consistency and frame-wise editing. Prior attempts to address this challenge by introducing video-2D representations encounter significant difficulties with large motion- and view-change videos especially in human-centric scenarios. To overcome this we propose to introduce the dynamic Neural Radiance Fields (NeRF) as the innovative video representation where the editing can be performed in the 3D spaces and propagated to the entire video via the deformation field. To provide consistent and controllable editing we propose the image-based video-NeRF editing pipeline with a set of innovative designs including multi-view multi-pose Score Distillation Sampling (SDS) from both the 2D personalized diffusion prior and 3D diffusion prior reconstruction losses text-guided local parts super-resolution and style transfer. Extensive experiments demonstrate that our method dubbed as DynVideo-E significantly outperforms SOTA approaches on two challenging datasets by a large margin of 50% 95% for human preference. Code will be released at https://showlab.github.io/DynVideo-E/. + + + + Real-Time Neural BRDF with Spherically Distributed Primitives + http://openaccess.thecvf.com//content/CVPR2024/papers/Dou_Real-Time_Neural_BRDF_with_Spherically_Distributed_Primitives_CVPR_2024_paper.pdf + We propose a neural reflectance model (NeuBRDF) that offers highly versatile material representation yet with light memory and neural computation consumption towards achieving real-time rendering. The results depicted in Fig. 1 rendered at full HD resolution on a contemporary desktop machine demonstrate that our system achieves real-time performance with a wide variety of appearances which is approached by the following two designs. Firstly recognizing that the bidirectional reflectance is distributed in a sparse high-dimensional space we propose to project the BRDF into two low-dimensional components i.e. two hemisphere feature-grids for incoming and outgoing directions respectively. Secondly we distribute learnable neural reflectance primitives on our highly-tailored spherical surface grid. These primitives offer informative features for each hemisphere component and reduce the complexity of the feature learning network leading to fast evaluation. These primitives are centrally stored in a codebook and can be shared across multiple grids and even across materials based on low-cost indices stored in material-specific spherical surface grids. Our NeuBRDF agnostic to the material provides a unified framework for representing a variety of materials consistently. Comprehensive experimental results on measured BRDF compression Monte Carlo simulated BRDF acceleration and extension to spatially varying effects demonstrate the superior quality and generalizability achieved by the proposed scheme. + + + + VideoCrafter2: Overcoming Data Limitations for High-Quality Video Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_VideoCrafter2_Overcoming_Data_Limitations_for_High-Quality_Video_Diffusion_Models_CVPR_2024_paper.pdf + Text-to-video generation aims to produce a video based on a given prompt. Recently several commercial video models have been able to generate plausible videos with minimal noise excellent details and high aesthetic scores. However these models rely on large-scale well-filtered high-quality videos that are not accessible to the community. Many existing research works which train models using the low-quality WebVid-10M dataset struggle to generate high-quality videos because the models are optimized to fit WebVid-10M. In this work we explore the training scheme of video models extended from Stable Diffusion and investigate the feasibility of leveraging low-quality videos and synthesized high-quality images to obtain a high-quality video model. We first analyze the connection between the spatial and temporal modules of video models and the distribution shift to low-quality videos. We observe that full training of all modules results in a stronger coupling between spatial and temporal modules than only training temporal modules. Based on this stronger coupling we shift the distribution to higher quality without motion degradation by finetuning spatial modules with high-quality images resulting in a generic high-quality video model. Evaluations are conducted to demonstrate the superiority of the proposed method particularly in picture quality motion and concept composition. + + + + Style Injection in Diffusion: A Training-free Approach for Adapting Large-scale Diffusion Models for Style Transfer + http://openaccess.thecvf.com//content/CVPR2024/papers/Chung_Style_Injection_in_Diffusion_A_Training-free_Approach_for_Adapting_Large-scale_CVPR_2024_paper.pdf + Despite the impressive generative capabilities of diffusion models existing diffusion model-based style transfer methods require inference-stage optimization (e.g. fine-tuning or textual inversion of style) which is time-consuming or fails to leverage the generative ability of large-scale diffusion models. To address these issues we introduce a novel artistic style transfer method based on a pre-trained large-scale diffusion model without any optimization. Specifically we manipulate the features of self-attention layers as the way the cross-attention mechanism works; in the generation process substituting the key and value of content with those of style image. This approach provides several desirable characteristics for style transfer including 1) preservation of content by transferring similar styles into similar image patches and 2) transfer of style based on similarity of local texture (e.g. edge) between content and style images. Furthermore we introduce query preservation and attention temperature scaling to mitigate the issue of disruption of original content and initial latent Adaptive Instance Normalization (AdaIN) to deal with the disharmonious color (failure to transfer the colors of style). Our experimental results demonstrate that our proposed method surpasses state-of-the-art methods in both conventional and diffusion-based style transfer baselines. + + + + OrthCaps: An Orthogonal CapsNet with Sparse Attention Routing and Pruning + http://openaccess.thecvf.com//content/CVPR2024/papers/Geng_OrthCaps_An_Orthogonal_CapsNet_with_Sparse_Attention_Routing_and_Pruning_CVPR_2024_paper.pdf + Redundancy is a persistent challenge in Capsule Networks (CapsNet) leading to high computational costs and parameter counts. Although previous studies have introduced pruning after the initial capsule layer dynamic routing's fully connected nature and non-orthogonal weight matrices reintroduce redundancy in deeper layers. Besides dynamic routing requires iterating to converge further increasing computational demands. In this paper we propose an Orthogonal Capsule Network (OrthCaps) to reduce redundancy improve routing performance and decrease parameter counts. Firstly an efficient pruned capsule layer is introduced to discard redundant capsules. Secondly dynamic routing is replaced with orthogonal sparse attention routing eliminating the need for iterations and fully connected structures. Lastly weight matrices during routing are orthogonalized to sustain low capsule similarity which is the first approach to use Householder orthogonal decomposition to enforce orthogonality in CapsNet. Our experiments on baseline datasets affirm the efficiency and robustness of OrthCaps in classification tasks in which ablation studies validate the criticality of each component. OrthCaps-Shallow outperforms other Capsule Network benchmarks on four datasets utilizing only 110k parameters - a mere 1.25% of a standard Capsule Network's total. To the best of our knowledge it achieves the smallest parameter count among existing Capsule Networks. Similarly OrthCaps-Deep demonstrates competitive performance across four datasets utilizing only 1.2% of the parameters required by its counterparts. + + + + Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks + http://openaccess.thecvf.com//content/CVPR2024/papers/Xiao_Florence-2_Advancing_a_Unified_Representation_for_a_Variety_of_Vision_CVPR_2024_paper.pdf + We introduce Florence-2 a novel vision foundation model with a unified prompt-based representation for various computer vision and vision-language tasks. While existing large vision models excel in transfer learning they struggle to perform diverse tasks with simple instructions a capability that implies handling the complexity of various spatial hierarchy and semantic granularity. Florence-2 was designed to take text-prompt as task instructions and generate desirable results in text forms whether it be captioning object detection grounding or segmentation. This multi-task learning setup demands large-scale high-quality annotated data. To this end we co-developed FLD-5B that consists of 5.4 billion comprehensive visual annotations on 126 million images using an iterative strategy of automated image annotation and model refinement. We adopted a sequence-to-sequence structure to train Florence-2 to perform versatile and comprehensive vision tasks. Extensive evaluations on numerous tasks demonstrated Florence-2 to be a strong vision foundation model contender with unprecedented zero-shot and fine-tuning capabilities. + + + + NeRF On-the-go: Exploiting Uncertainty for Distractor-free NeRFs in the Wild + http://openaccess.thecvf.com//content/CVPR2024/papers/Ren_NeRF_On-the-go_Exploiting_Uncertainty_for_Distractor-free_NeRFs_in_the_Wild_CVPR_2024_paper.pdf + Neural Radiance Fields (NeRFs) have shown remarkable success in synthesizing photorealistic views from multi-view images of static scenes but face challenges in dynamic real-world environments with distractors like moving objects shadows and lighting changes. Existing methods manage controlled environments and low occlusion ratios but fall short in render quality especially under high occlusion scenarios. In this paper we introduce NeRF On-the-go a simple yet effective approach that enables the robust synthesis of novel views in complex in-the-wild scenes from only casually captured image sequences. Delving into uncertainty our method not only efficiently eliminates distractors even when they are predominant in captures but also achieves a notably faster convergence speed. Through comprehensive experiments on various scenes our method demonstrates a significant improvement over state-of-the-art techniques. This advancement opens new avenues for NeRF in diverse and dynamic real-world applications. + + + + 3D Human Pose Perception from Egocentric Stereo Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Akada_3D_Human_Pose_Perception_from_Egocentric_Stereo_Videos_CVPR_2024_paper.pdf + While head-mounted devices are becoming more compact they provide egocentric views with significant self-occlusions of the device user. Hence existing methods often fail to accurately estimate complex 3D poses from egocentric views. In this work we propose a new transformer-based framework to improve egocentric stereo 3D human pose estimation which leverages the scene information and temporal context of egocentric stereo videos. Specifically we utilize 1) depth features from our 3D scene reconstruction module with uniformly sampled windows of egocentric stereo frames and 2) human joint queries enhanced by temporal features of the video inputs. Our method is able to accurately estimate human poses even in challenging scenarios such as crouching and sitting. Furthermore we introduce two new benchmark datasets i.e. UnrealEgo2 and UnrealEgo-RW (RealWorld). UnrealEgo2 is a large-scale in-the-wild dataset captured in synthetic 3D scenes. UnrealEgo-RW is a real-world dataset captured with our newly developed device. The proposed datasets offer a much larger number of egocentric stereo views with a wider variety of human motions than the existing datasets allowing comprehensive evaluation of existing and upcoming methods. Our extensive experiments show that the proposed approach significantly outperforms previous methods. UnrealEgo2 UnrealEgo-RW and trained models are available on our project page and Benchmark Challenge. + + + + Grid Diffusion Models for Text-to-Video Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Lee_Grid_Diffusion_Models_for_Text-to-Video_Generation_CVPR_2024_paper.pdf + Recent advances in the diffusion models have significantly improved text-to-image generation. However generating videos from text is a more challenging task than generating images from text due to the much larger dataset and higher computational cost required. Most existing video generation methods use either a 3D U-Net architecture that considers the temporal dimension or autoregressive generation. These methods require large datasets and are limited in terms of computational costs compared to text-to-image generation. To tackle these challenges we propose a simple but effective novel grid diffusion for text-to-video generation without temporal dimension in architecture and a large text-video paired dataset. We can generate a high-quality video using a fixed amount of GPU memory regardless of the number of frames by representing the video as a grid image. Additionally since our method reduces the dimensions of the video to the dimensions of the image various image-based methods can be applied to videos such as text-guided video manipulation from image manipulation. Our proposed method outperforms the existing methods in both quantitative and qualitative evaluations demonstrating the suitability of our model for real-world video generation. + + + + LucidDreamer: Towards High-Fidelity Text-to-3D Generation via Interval Score Matching + http://openaccess.thecvf.com//content/CVPR2024/papers/Liang_LucidDreamer_Towards_High-Fidelity_Text-to-3D_Generation_via_Interval_Score_Matching_CVPR_2024_paper.pdf + The recent advancements in text-to-3D generation mark a significant milestone in generative models unlocking new possibilities for creating imaginative 3D assets across various real-world scenarios. While recent advancements in text-to-3D generation have shown promise they often fall short in rendering detailed and high-quality 3D models. This problem is especially prevalent as many methods base themselves on Score Distillation Sampling (SDS). This paper identifies a notable deficiency in SDS that it brings inconsistent and low-quality updating direction for the 3D model causing the over-smoothing effect. To address this we propose a novel approach called Interval Score Matching (ISM). ISM employs deterministic diffusing trajectories and utilizes interval-based score matching to counteract over-smoothing. Furthermore we incorporate 3D Gaussian Splatting into our text-to-3D generation pipeline. Extensive experiments show that our model largely outperforms the state-of-the-art in quality and training efficiency. + + + + PTM-VQA: Efficient Video Quality Assessment Leveraging Diverse PreTrained Models from the Wild + http://openaccess.thecvf.com//content/CVPR2024/papers/Yuan_PTM-VQA_Efficient_Video_Quality_Assessment_Leveraging_Diverse_PreTrained_Models_from_CVPR_2024_paper.pdf + Video quality assessment (VQA) is a challenging problem due to the numerous factors that can affect the perceptual quality of a video e.g. content attractiveness distortion type motion pattern and level. However annotating the Mean opinion score (MOS) for videos is expensive and time-consuming which limits the scale of VQA datasets and poses a significant obstacle for deep learning-based methods. In this paper we propose a VQA method named PTM-VQA which leverages PreTrained Models to transfer knowledge from models pretrained on various pre-tasks enabling benefits for VQA from different aspects. Specifically we extract features of videos from different pretrained models with frozen weights and integrate them to generate representation. Since these models possess various fields of knowledge and are often trained with labels irrelevant to quality we propose an Intra-Consistency and Inter-Divisibility (ICID) loss to impose constraints on features extracted by multiple pretrained models. The intra-consistency constraint ensures that features extracted by different pretrained models are in the same unified quality-aware latent space while the inter-divisibility introduces pseudo clusters based on the annotation of samples and tries to separate features of samples from different clusters. Furthermore with a constantly growing number of pretrained models it is crucial to determine which models to use and how to use them. To address this problem we propose an efficient scheme to select suitable candidates. Models with better clustering performance on VQA datasets are chosen to be our candidates. Extensive experiments demonstrate the effectiveness of the proposed method. + + + + REACTO: Reconstructing Articulated Objects from a Single Video + http://openaccess.thecvf.com//content/CVPR2024/papers/Song_REACTO_Reconstructing_Articulated_Objects_from_a_Single_Video_CVPR_2024_paper.pdf + In this paper we address the challenge of reconstructing general articulated 3D objects from a single video. Existing works employing dynamic neural radiance fields have advanced the modeling of articulated objects like humans and animals from videos but face challenges with piece-wise rigid general articulated objects due to limitations in their deformation models. To tackle this we propose Quasi-Rigid Blend Skinning a novel deformation model that enhances the rigidity of each part while maintaining flexible deformation of the joints. Our primary insight combines three distinct approaches: 1) an enhanced bone rigging system for improved component modeling 2) the use of quasi-sparse skinning weights to boost part rigidity and reconstruction fidelity and 3) the application of geodesic point assignment for precise motion and seamless deformation. Our method outperforms previous works in producing higher-fidelity 3D reconstructions of general articulated objects as demonstrated on both real and synthetic datasets. Project page: https://chaoyuesong.github.io/REACTO. + + + + Egocentric Whole-Body Motion Capture with FisheyeViT and Diffusion-Based Motion Refinement + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Egocentric_Whole-Body_Motion_Capture_with_FisheyeViT_and_Diffusion-Based_Motion_Refinement_CVPR_2024_paper.pdf + In this work we explore egocentric whole-body motion capture using a single fisheye camera which simultaneously estimates human body and hand motion. This task presents significant challenges due to three factors: the lack of high-quality datasets fisheye camera distortion and human body self-occlusion. To address these challenges we propose a novel approach that leverages FisheyeViT to extract fisheye image features which are subsequently converted into pixel-aligned 3D heatmap representations for 3D human body pose prediction. For hand tracking we incorporate dedicated hand detection and hand pose estimation networks for regressing 3D hand poses. Finally we develop a diffusion-based whole-body motion prior model to refine the estimated whole-body motion while accounting for joint uncertainties. To train these networks we collect a large synthetic dataset EgoWholeBody comprising 840000 high-quality egocentric images captured across a diverse range of whole-body motion sequences. Quantitative and qualitative evaluations demonstrate the effectiveness of our method in producing high-quality whole-body motion estimates from a single egocentric camera. + + + + Language Embedded 3D Gaussians for Open-Vocabulary Scene Understanding + http://openaccess.thecvf.com//content/CVPR2024/papers/Shi_Language_Embedded_3D_Gaussians_for_Open-Vocabulary_Scene_Understanding_CVPR_2024_paper.pdf + Open-vocabulary querying in 3D space is challenging but essential for scene understanding tasks such as object localization and segmentation. Language-embedded scene representations have made progress by incorporating language features into 3D spaces. However their efficacy heavily depends on neural networks that are resource-intensive in training and rendering. Although recent 3D Gaussians offer efficient and high-quality novel view synthesis directly embedding language features in them leads to prohibitive memory usage and decreased performance. In this work we introduce Language Embedded 3D Gaussians a novel scene representation for open-vocabulary query tasks. Instead of embedding high-dimensional raw semantic features on 3D Gaussians we propose a dedicated quantization scheme that drastically alleviates the memory requirement and a novel embedding procedure that achieves smoother yet high accuracy query countering the multi-view feature inconsistencies and the high-frequency inductive bias in point-based representations. Our comprehensive experiments show that our representation achieves the best visual quality and language querying accuracy across current language-embedded representations while maintaining real-time rendering frame rates on a single desktop GPU. + + + + Towards Automated Movie Trailer Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Argaw_Towards_Automated_Movie_Trailer_Generation_CVPR_2024_paper.pdf + Movie trailers are an essential tool for promoting films and attracting audiences. However the process of creating trailers can be time-consuming and expensive. To streamline this process we propose an automatic trailer generation framework that generates plausible trailers from a full movie by automating shot selection and composition. Our approach draws inspiration from machine translation techniques and models the movies and trailers as sequences of shots thus formulating the trailer generation problem as a sequence-to-sequence task. We introduce Trailer Generation Transformer (TGT) a deep-learning framework utilizing an encoder-decoder architecture. TGT movie encoder is tasked with contextualizing each movie shot representation via self-attention while the autoregressive trailer decoder predicts the feature representation of the next trailer shot accounting for the relevance of shots' temporal order in trailers. Our TGT significantly outperforms previous methods on a comprehensive suite of metrics. + + + + Sheared Backpropagation for Fine-tuning Foundation Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Yu_Sheared_Backpropagation_for_Fine-tuning_Foundation_Models_CVPR_2024_paper.pdf + Fine-tuning is the process of extending the training of pre-trained models on specific target tasks thereby significantly enhancing their performance across various applications. However fine-tuning often demands large memory consumption posing a challenge for low-memory devices that some previous memory-efficient fine-tuning methods attempted to mitigate by pruning activations for gradient computation albeit at the cost of significant computational overhead from the pruning processes during training. To address these challenges we introduce PreBackRazor a novel activation pruning scheme offering both computational and memory efficiency through a sparsified backpropagation strategy which judiciously avoids unnecessary activation pruning and storage and gradient computation. Before activation pruning our approach samples a probability of selecting a portion of parameters to freeze utilizing a bandit method for updates to prioritize impactful gradients on convergence. During the feed-forward pass each model layer adjusts adaptively based on parameter activation status obviating the need for sparsification and storage of redundant activations for subsequent backpropagation. Benchmarking on fine-tuning foundation models our approach maintains baseline accuracy across diverse tasks yielding over 20% speedup and around 10% memory reduction. Moreover integrating with an advanced CUDA kernel achieves up to 60% speedup without extra memory costs or accuracy loss significantly enhancing the efficiency of fine-tuning foundation models on memory-constrained devices. + + + + Misalignment-Robust Frequency Distribution Loss for Image Transformation + http://openaccess.thecvf.com//content/CVPR2024/papers/Ni_Misalignment-Robust_Frequency_Distribution_Loss_for_Image_Transformation_CVPR_2024_paper.pdf + This paper aims to address a common challenge in deep learning-based image transformation methods such as image enhancement and super-resolution which heavily rely on precisely aligned paired datasets with pixel-level alignments. However creating precisely aligned paired images presents significant challenges and hinders the advancement of methods trained on such data. To overcome this challenge this paper introduces a novel and simple Frequency Distribution Loss (FDL) for computing distribution distance within the frequency domain. Specifically we transform image features into the frequency domain using Discrete Fourier Transformation (DFT). Subsequently frequency components (amplitude and phase) are processed separately to form the FDL loss function. Our method is empirically proven effective as a training constraint due to the thoughtful utilization of global information in the frequency domain. Extensive experimental evaluations focusing on image enhancement and super-resolution tasks demonstrate that FDL outperforms existing misalignment-robust loss functions. Furthermore we explore the potential of our FDL for image style transfer that relies solely on completely misaligned data. Our code is available at: https://github.com/eezkni/FDL + + + + Degrees of Freedom Matter: Inferring Dynamics from Point Trajectories + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Degrees_of_Freedom_Matter_Inferring_Dynamics_from_Point_Trajectories_CVPR_2024_paper.pdf + Understanding the dynamics of generic 3D scenes is fundamentally challenging in computer vision essential in enhancing applications related to scene reconstruction motion tracking and avatar creation. In this work we address the task as the problem of inferring dense long-range motion of 3D points. By observing a set of point trajectories we aim to learn an implicit motion field parameterized by a neural network to predict the movement of novel points within the same domain without relying on any data-driven or scene-specific priors. To achieve this our approach builds upon the recently introduced dynamic point field model that learns smooth deformation fields between the canonical frame and individual observation frames. However temporal consistency between consecutive frames is neglected and the number of required parameters increases linearly with the sequence length due to per-frame modeling. To address these shortcomings we exploit the intrinsic regularization provided by SIREN and modify the input layer to produce a spatiotemporally smooth motion field. Additionally we analyze the motion field Jacobian matrix and discover that the motion degrees of freedom (DOFs) in an infinitesimal area around a point and the network hidden variables have different behaviors to affect the model's representational power. This enables us to improve the model representation capability while retaining the model compactness. Furthermore to reduce the risk of overfitting we introduce a regularization term based on the assumption of piece-wise motion smoothness. Our experiments assess the model's performance in predicting unseen point trajectories and its application in temporal mesh alignment with guidance. The results demonstrate its superiority and effectiveness. The code and data for the project are publicly available at https://yz-cnsdqz.github.io/eigenmotion/DOMA. + + + + Low-Latency Neural Stereo Streaming + http://openaccess.thecvf.com//content/CVPR2024/papers/Hou_Low-Latency_Neural_Stereo_Streaming_CVPR_2024_paper.pdf + The rise of new video modalities like virtual reality or autonomous driving has increased the demand for efficient multi-view video compression methods both in terms of rate-distortion (R-D) performance and in terms of delay and runtime. While most recent stereo video compression approaches have shown promising performance they compress left and right views sequentially leading to poor parallelization and runtime performance. This work presents Low-Latency neural codec for Stereo video Streaming (LLSS) a novel parallel stereo video coding method designed for fast and efficient low-latency stereo video streaming. Instead of using a sequential cross-view motion compensation like existing methods LLSS introduces a bidirectional feature shifting module to directly exploit mutual information among views and encode them effectively with a joint cross-view prior model for entropy coding. Thanks to this design LLSS processes left and right views in parallel minimizing latency; all while substantially improving R-D performance compared to both existing neural and conventional codecs. + + + + Intrinsic Image Diffusion for Indoor Single-view Material Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Kocsis_Intrinsic_Image_Diffusion_for_Indoor_Single-view_Material_Estimation_CVPR_2024_paper.pdf + We present Intrinsic Image Diffusion a generative model for appearance decomposition of indoor scenes. Given a single input view we sample multiple possible material explanations represented as albedo roughness and metallic maps. Appearance decomposition poses a considerable challenge in computer vision due to the inherent ambiguity between lighting and material properties and the lack of real datasets. To address this issue we advocate for a probabilistic formulation where instead of attempting to directly predict the true material properties we employ a conditional generative model to sample from the solution space. Furthermore we show that utilizing the strong learned prior of recent diffusion models trained on large-scale real-world images can be adapted to material estimation and highly improves the generalization to real images. Our method produces significantly sharper more consistent and more detailed materials outperforming state-of-the-art methods by 1.5dB on PSNR and by 45% better FID score on albedo prediction. We demonstrate the effectiveness of our approach through experiments on both synthetic and real-world datasets. + + + + Material Palette: Extraction of Materials from a Single Image + http://openaccess.thecvf.com//content/CVPR2024/papers/Lopes_Material_Palette_Extraction_of_Materials_from_a_Single_Image_CVPR_2024_paper.pdf + Physically-Based Rendering (PBR) is key to modeling the interaction between light and materials and finds extensive applications across computer graphics domains. However acquiring PBR materials is costly and requires special apparatus. In this paper we propose a method to extract PBR materials from a single real-world image. We do so in two steps: first we map regions of the image to material concept tokens using a diffusion model allowing the sampling of texture images resembling each material in the scene. Second we leverage a separate network to decompose the generated textures into spatially varying BRDFs (SVBRDFs) offering us readily usable materials for rendering applications. Our approach relies on existing synthetic material libraries with SVBRDF ground truth. It exploits a diffusion-generated RGB texture dataset to allow generalization to new samples using unsupervised domain adaptation (UDA). Our contributions are thoroughly evaluated on synthetic and real-world datasets. We further demonstrate the applicability of our method for editing 3D scenes with materials estimated from real photographs. Along with video we share code and models as open-source on the project page: https://github.com/astra-vision/MaterialPalette + + + + RealCustom: Narrowing Real Text Word for Real-Time Open-Domain Text-to-Image Customization + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_RealCustom_Narrowing_Real_Text_Word_for_Real-Time_Open-Domain_Text-to-Image_Customization_CVPR_2024_paper.pdf + Text-to-image customization which aims to synthesize text-driven images for the given subjects has recently revolutionized content creation. Existing works follow the pseudo-word paradigm i.e. represent the given subjects as pseudo-words and then compose them with the given text. However the inherent entangled influence scope of pseudo-words with the given text results in a dual-optimum paradox i.e. the similarity of the given subjects and the controllability of the given text could not be optimal simultaneously. We present RealCustom that for the first time disentangles similarity from controllability by precisely limiting subject influence to relevant parts only achieved by gradually narrowing real text word from its general connotation to the specific subject and using its cross-attention to distinguish relevance. Specifically RealCustom introduces a novel "train-inference" decoupled framework: (1) during training RealCustom learns general alignment between visual conditions to original textual conditions by a novel adaptive scoring module to adaptively modulate influence quantity; (2) during inference a novel adaptive mask guidance strategy is proposed to iteratively update the influence scope and influence quantity of the given subjects to gradually narrow the generation of the real text word. Comprehensive experiments demonstrate the superior real-time customization ability of RealCustom in the open domain achieving both unprecedented similarity of the given subjects and controllability of the given text for the first time. The project page is https://corleone-huang.github.io/realcustom/. + + + + Text2QR: Harmonizing Aesthetic Customization and Scanning Robustness for Text-Guided QR Code Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_Text2QR_Harmonizing_Aesthetic_Customization_and_Scanning_Robustness_for_Text-Guided_QR_CVPR_2024_paper.pdf + In the digital era QR codes serve as a linchpin connecting virtual and physical realms. Their pervasive integration across various applications highlights the demand for aesthetically pleasing codes without compromised scannability. However prevailing methods grapple with the intrinsic challenge of balancing customization and scannability. Notably stable-diffusion models have ushered in an epoch of high-quality customizable content generation. This paper introduces Text2QR a pioneering approach leveraging these advancements to address a fundamental challenge: concurrently achieving user-defined aesthetics and scanning robustness. To ensure stable generation of aesthetic QR codes we introduce the QR Aesthetic Blueprint (QAB) module generating a blueprint image exerting control over the entire generation process. Subsequently the Scannability Enhancing Latent Refinement (SELR) process refines the output iteratively in the latent space enhancing scanning robustness. This approach harnesses the potent generation capabilities of stable-diffusion models navigating the trade-off between image aesthetics and QR code scannability. Our experiments demonstrate the seamless fusion of visual appeal with the practical utility of aesthetic QR codes markedly outperforming prior methods. Codes are available at https://github.com/mulns/Text2QR + + + + ECLIPSE: A Resource-Efficient Text-to-Image Prior for Image Generations + http://openaccess.thecvf.com//content/CVPR2024/papers/Patel_ECLIPSE_A_Resource-Efficient_Text-to-Image_Prior_for_Image_Generations_CVPR_2024_paper.pdf + Text-to-image (T2I) diffusion models notably the unCLIP models (e.g. DALL-E-2) achieve state-of-the-art (SOTA) performance on various compositional T2I benchmarks at the cost of significant computational resources. The unCLIP stack comprises T2I prior and diffusion image decoder. The T2I prior model alone adds a billion parameters compared to the Latent Diffusion Models which increases the computational and high-quality data requirements. We introduce ECLIPSE a novel contrastive learning method that is both parameter and data-efficient. ECLIPSE leverages pre-trained vision-language models (e.g. CLIP) to distill the knowledge into the prior model. We demonstrate that the ECLIPSE trained prior with only 3.3% of the parameters and trained on a mere 2.8% of the data surpasses the baseline T2I priors with an average of 71.6% preference score under resource-limited setting. It also attains performance on par with SOTA big models achieving an average of 63.36% preference score in terms of the ability to follow the text compositions. Extensive experiments on two unCLIP diffusion image decoders Karlo and Kandinsky affirm that ECLIPSE priors consistently deliver high performance while significantly reducing resource dependency. Project page: https://eclipse-t2i.vercel.app/ + + + + Adaptive Bidirectional Displacement for Semi-Supervised Medical Image Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Chi_Adaptive_Bidirectional_Displacement_for_Semi-Supervised_Medical_Image_Segmentation_CVPR_2024_paper.pdf + Consistency learning is a central strategy to tackle unlabeled data in semi-supervised medical image segmentation (SSMIS) which enforces the model to produce consistent predictions under the perturbation. However most current approaches solely focus on utilizing a specific single perturbation which can only cope with limited cases while employing multiple perturbations simultaneously is hard to guarantee the quality of consistency learning. In this paper we propose an Adaptive Bidirectional Displacement (ABD) approach to solve the above challenge. Specifically we first design a bidirectional patch displacement based on reliable prediction confidence for unlabeled data to generate new samples which can effectively suppress uncontrollable regions and still retain the influence of input perturbations. Meanwhile to enforce the model to learn the potentially uncontrollable content a bidirectional displacement operation with inverse confidence is proposed for the labeled images which generates samples with more unreliable information to facilitate model learning. Extensive experiments show that ABD achieves new state-of-the-art performances for SSMIS significantly improving different baselines. Source code is available at https://github.com/chy-upc/ABD. + + + + Accurate Training Data for Occupancy Map Prediction in Automated Driving Using Evidence Theory + http://openaccess.thecvf.com//content/CVPR2024/papers/Kalble_Accurate_Training_Data_for_Occupancy_Map_Prediction_in_Automated_Driving_CVPR_2024_paper.pdf + Automated driving fundamentally requires knowledge about the surrounding geometry of the scene. Modern approaches use only captured images to predict occupancy maps that represent the geometry. Training these approaches requires accurate data that may be acquired with the help of LiDAR scanners. We show that the techniques used for current benchmarks and training datasets to convert LiDAR scans into occupancy grid maps yield very low quality and subsequently present a novel approach using evidence theory that yields more accurate reconstructions. We demonstrate that these are superior by a large margin both qualitatively and quantitatively and that we additionally obtain meaningful uncertainty estimates. When converting the occupancy maps back to depth estimates and comparing them with the raw LiDAR measurements our method yields a MAE improvement of 30% to 52% on nuScenes and 53% on Waymo over other occupancy ground-truth data. Finally we use the improved occupancy maps to train a state-of-the-art occupancy prediction method and demonstrate that it improves the MAE by 25% on nuScenes. + + + + DiffusionLight: Light Probes for Free by Painting a Chrome Ball + http://openaccess.thecvf.com//content/CVPR2024/papers/Phongthawee_DiffusionLight_Light_Probes_for_Free_by_Painting_a_Chrome_Ball_CVPR_2024_paper.pdf + We present a simple yet effective technique to estimate lighting in a single input image. Current techniques rely heavily on HDR panorama datasets to train neural networks to regress an input with limited field-of-view to a full environment map. However these approaches often struggle with real-world uncontrolled settings due to the limited diversity and size of their datasets. To address this problem we leverage diffusion models trained on billions of standard images to render a chrome ball into the input image. Despite its simplicity this task remains challenging: the diffusion models often insert incorrect or inconsistent objects and cannot readily generate chrome balls in HDR format. Our research uncovers a surprising relationship between the appearance of chrome balls and the initial diffusion noise map which we utilize to consistently generate high-quality chrome balls. We further fine-tune an LDR diffusion model (Stable Diffusion XL) with LoRA enabling it to perform exposure bracketing for HDR light estimation. Our method produces convincing light estimates across diverse settings and demonstrates superior generalization to in-the-wild scenarios. + + + + Rethinking the Spatial Inconsistency in Classifier-Free Diffusion Guidance + http://openaccess.thecvf.com//content/CVPR2024/papers/Shen_Rethinking_the_Spatial_Inconsistency_in_Classifier-Free_Diffusion_Guidance_CVPR_2024_paper.pdf + Classifier-Free Guidance (CFG) has been widely used in text-to-image diffusion models where the CFG scale is introduced to control the strength of text guidance on the whole image space. However we argue that a global CFG scale results in spatial inconsistency on varying semantic strengths and suboptimal image quality. To address this problem we present a novel approach Semantic-aware Classifier-Free Guidance (S-CFG) to customize the guidance degrees for different semantic units in text-to-image diffusion models. Specifically we first design a training-free semantic segmentation method to partition the latent image into relatively independent semantic regions at each denoising step. In particular the cross-attention map in the denoising U-net backbone is renormalized for assigning each patch to the corresponding token while the self-attention map is used to complete the semantic regions. Then to balance the amplification of diverse semantic units we adaptively adjust the CFG scales across different semantic regions to rescale the text guidance degrees into a uniform level. Finally extensive experiments demonstrate the superiority of S-CFG over the original CFG strategy on various text-to-image diffusion models without requiring any extra training cost. our codes are available at https://github.com/SmilesDZgk/S-CFG. + + + + KTPFormer: Kinematics and Trajectory Prior Knowledge-Enhanced Transformer for 3D Human Pose Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Peng_KTPFormer_Kinematics_and_Trajectory_Prior_Knowledge-Enhanced_Transformer_for_3D_Human_CVPR_2024_paper.pdf + This paper presents a novel Kinematics and Trajectory Prior Knowledge-Enhanced Transformer (KTPFormer) which overcomes the weakness in existing transformer-based methods for 3D human pose estimation that the derivation of Q K V vectors in their self-attention mechanisms are all based on simple linear mapping. We propose two prior attention modules namely Kinematics Prior Attention (KPA) and Trajectory Prior Attention (TPA) to take advantage of the known anatomical structure of the human body and motion trajectory information to facilitate effective learning of global dependencies and features in the multi-head self-attention. KPA models kinematic relationships in the human body by constructing a topology of kinematics while TPA builds a trajectory topology to learn the information of joint motion trajectory across frames. Yielding Q K V vectors with prior knowledge the two modules enable KTPFormer to model both spatial and temporal correlations simultaneously. Extensive experiments on three benchmarks (Human3.6M MPI-INF-3DHP and HumanEva) show that KTPFormer achieves superior performance in comparison to state-of-the-art methods. More importantly our KPA and TPA modules have lightweight plug-and-play designs and can be integrated into various transformer-based networks (i.e. diffusion-based) to improve the performance with only a very small increase in the computational overhead. The code is available at: https://github.com/JihuaPeng/KTPFormer. + + + + Differentiable Micro-Mesh Construction + http://openaccess.thecvf.com//content/CVPR2024/papers/Dou_Differentiable_Micro-Mesh_Construction_CVPR_2024_paper.pdf + Micro-mesh (u-mesh) is a new graphics primitive for compact representation of extreme geometry consisting of a low-polygon base mesh enriched by per micro-vertex displacement. A new generation of GPUs supports this structure with hardware evolution on u-mesh ray tracing achieving real-time rendering in pixel level geometric details. In this article we present a differentiable framework to convert standard meshes into this efficient format offering a holistic scheme in contrast to the previous stage-based methods. In our construction context a u-mesh is defined where each base triangle is a parametric primitive which is then reparameterized with Laplacian operators for efficient geometry optimization. Our framework offers numerous advantages for high-quality u-mesh production: (i) end-to-end geometry optimization and displacement baking; (ii) enabling the differentiation of renderings with respect to umesh for faithful reprojectability; (iii) high scalability for integrating useful features for u-mesh production and rendering such as minimizing shell volume maintaining the isotropy of the base mesh and visual-guided adaptive level of detail. Extensive experiments on u-mesh construction for a large set of high-resolution meshes demonstrate the superior quality achieved by the proposed scheme. + + + + SNED: Superposition Network Architecture Search for Efficient Video Diffusion Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_SNED_Superposition_Network_Architecture_Search_for_Efficient_Video_Diffusion_Model_CVPR_2024_paper.pdf + While AI-generated content has garnered significant attention achieving photo-realistic video synthesis remains a formidable challenge. Despite the promising advances in diffusion models for video generation quality the complex model architecture and substantial computational demands for both training and inference create a significant gap between these models and real-world applications. This paper presents SNED a superposition network architecture search method for efficient video diffusion model. Our method employs a supernet training paradigm that targets various model cost and resolution options using a weight-sharing method. Moreover we propose the supernet training sampling warm-up for fast training optimization. To showcase the flexibility of our method we conduct experiments involving both pixel-space and latent-space video diffusion models. The results demonstrate that our framework consistently produces comparable results across different model options with high efficiency. According to the experiment for the pixel-space video diffusion model we can achieve consistent video generation results simultaneously across 64 x 64 to 256 x 256 resolutions with a large range of model sizes from 640M to 1.6B number of parameters for pixel-space video diffusion models. + + + + LeftRefill: Filling Right Canvas based on Left Reference through Generalized Text-to-Image Diffusion Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Cao_LeftRefill_Filling_Right_Canvas_based_on_Left_Reference_through_Generalized_CVPR_2024_paper.pdf + This paper introduces LeftRefill an innovative approach to efficiently harness large Text-to-Image (T2I) diffusion models for reference-guided image synthesis. As the name implies LeftRefill horizontally stitches reference and target views together as a whole input. The reference image occupies the left side while the target canvas is positioned on the right. Then LeftRefill paints the right-side target canvas based on the left-side reference and specific task instructions. Such a task formulation shares some similarities with contextual inpainting akin to the actions of a human painter. This novel formulation efficiently learns both structural and textured correspondence between reference and target without other image encoders or adapters. We inject task and view information through cross-attention modules in T2I models and further exhibit multi-view reference ability via the re-arranged self-attention modules. These enable LeftRefill to perform consistent generation as a generalized model without requiring test-time fine-tuning or model modifications. Thus LeftRefill can be seen as a simple yet unified framework to address reference-guided synthesis. As an exemplar we leverage LeftRefill to address two different challenges: reference-guided inpainting and novel view synthesis based on the pre-trained StableDiffusion. Codes and models are released at https://github.com/ewrfcas/LeftRefill. + + + + Personalized Residuals for Concept-Driven Text-to-Image Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Ham_Personalized_Residuals_for_Concept-Driven_Text-to-Image_Generation_CVPR_2024_paper.pdf + We present personalized residuals and localized attention-guided sampling for efficient concept-driven generation using text-to-image diffusion models. Our method first represents concepts by freezing the weights of a pretrained text-conditioned diffusion model and learning low-rank residuals for a small subset of the model's layers. The residual-based approach then directly enables application of our proposed sampling technique which applies the learned residuals only in areas where the concept is localized via cross-attention and applies the original diffusion weights in all other regions. Localized sampling therefore combines the learned identity of the concept with the existing generative prior of the underlying diffusion model. We show that personalized residuals effectively capture the identity of a concept in 3 minutes on a single GPU without the use of regularization images and with fewer parameters than previous models and localized sampling allows using the original model as strong prior for large parts of the image. + + + + Condition-Aware Neural Network for Controlled Image Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Cai_Condition-Aware_Neural_Network_for_Controlled_Image_Generation_CVPR_2024_paper.pdf + We present Condition-Aware Neural Network (CAN) a new method for adding control to image generative models. In parallel to prior conditional control methods CAN controls the image generation process by dynamically manipulating the weight of the neural network. This is achieved by introducing a condition-aware weight generation module that generates conditional weight for convolution/linear layers based on the input condition. We test CAN on class-conditional image generation on ImageNet and text-to-image generation on COCO. CAN consistently delivers significant improvements for diffusion transformer models including DiT and UViT. In particular CAN combined with EfficientViT (CaT) achieves 2.78 FID on ImageNet 512x512 surpassing DiT-XL/2 while requiring 52x fewer MACs per sampling step. + + + + Prompt Augmentation for Self-supervised Text-guided Image Manipulation + http://openaccess.thecvf.com//content/CVPR2024/papers/Bodur_Prompt_Augmentation_for_Self-supervised_Text-guided_Image_Manipulation_CVPR_2024_paper.pdf + Text-guided image editing finds applications in various creative and practical fields. While recent studies in image generation have advanced the field they often struggle with the dual challenges of coherent image transformation and context preservation. In response our work introduces prompt augmentation a method amplifying a single input prompt into several target prompts strengthening textual context and enabling localised image editing. Specifically we use the augmented prompts to delineate the intended manipulation area. We propose a Contrastive Loss tailored to driving effective image editing by displacing edited areas and drawing preserved regions closer. Acknowledging the continuous nature of image manipulations we further refine our approach by incorporating the similarity concept creating a Soft Contrastive Loss. The new losses are incorporated to the diffusion model demonstrating improved or competitive image editing results on public datasets and generated images over state-of-the-art approaches. + + + + Guess The Unseen: Dynamic 3D Scene Reconstruction from Partial 2D Glimpses + http://openaccess.thecvf.com//content/CVPR2024/papers/Lee_Guess_The_Unseen_Dynamic_3D_Scene_Reconstruction_from_Partial_2D_CVPR_2024_paper.pdf + In this paper we present a method to reconstruct the world and multiple dynamic humans in 3D from a monocular video input. As a key idea we represent both the world and multiple humans via the recently emerging 3D Gaussian Splatting (3D-GS) representation enabling to conveniently and efficiently compose and render them together. In particular we address the scenarios with severely limited and sparse observations in 3D human reconstruction a common challenge encountered in the real world. To tackle this challenge we introduce a novel approach to optimize the 3D-GS representation in a canonical space by fusing the sparse cues in the common space where we leverage a pre-trained 2D diffusion model to synthesize unseen views while keeping the consistency with the observed 2D appearances. We demonstrate our method can reconstruct high-quality animatable 3D humans in various challenging examples in the presence of occlusion image crops few-shot and extremely sparse observations. After reconstruction our method is capable of not only rendering the scene in any novel views at arbitrary time instances but also editing the 3D scene by removing individual humans or applying different motions for each human. Through various experiments we demonstrate the quality and efficiency of our methods over alternative existing approaches. + + + + HyperDreamBooth: HyperNetworks for Fast Personalization of Text-to-Image Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Ruiz_HyperDreamBooth_HyperNetworks_for_Fast_Personalization_of_Text-to-Image_Models_CVPR_2024_paper.pdf + Personalization has emerged as a prominent aspect within the field of generative AI enabling the synthesis of individuals in diverse contexts and styles while retaining high-fidelity to their identities. However the process of personalization presents inherent challenges in terms of time and memory requirements. Fine-tuning each personalized model needs considerable GPU time investment and storing a personalized model per subject can be demanding in terms of storage capacity. To overcome these challenges we propose HyperDreamBooth - a hypernetwork capable of efficiently generating a small set of personalized weights from a single image of a person. By composing these weights into the diffusion model coupled with fast finetuning HyperDreamBooth can generate a person's face in various contexts and styles with high subject details while also preserving the model's crucial knowledge of diverse styles and semantic modifications. Our method achieves personalization on faces in roughly 20 seconds 25x faster than DreamBooth and 125x faster than Textual Inversion using as few as one reference image with the same quality and style diversity as DreamBooth. Also our method yields a model that is 10000x smaller than a normal DreamBooth model. + + + + HardMo: A Large-Scale Hardcase Dataset for Motion Capture + http://openaccess.thecvf.com//content/CVPR2024/papers/Liao_HardMo_A_Large-Scale_Hardcase_Dataset_for_Motion_Capture_CVPR_2024_paper.pdf + Recent years have witnessed rapid progress in monocular human mesh recovery. Despite their impressive performance on public benchmarks existing methods are vulnerable to unusual poses which prevents them from deploying to challenging scenarios such as dance and martial arts. This issue is mainly attributed to the domain gap induced by the data scarcity in relevant cases. Most existing datasets are captured in constrained scenarios and lack samples of such complex movements. For this reason we propose a data collection pipeline comprising automatic crawling precise annotation and hardcase mining. Based on this pipeline we establish a large dataset in a short time. The dataset named HardMo contains 7M images along with precise annotations covering 15 categories of dance and 14 categories of martial arts. Empirically we find that the prediction failure in dance and martial arts is mainly characterized by the misalignment of hand-wrist and foot-ankle. To dig deeper into the two hardcases we leverage the proposed automatic pipeline to filter collected data and construct two subsets named HardMo-Hand and HardMo-Foot. Extensive experiments demonstrate the effectiveness of the annotation pipeline and the data-driven solution to failure cases. Specifically after being trained on HardMo HMR an early pioneering method can even outperform the current state of the art 4DHumans on our benchmarks. + + + + Separate and Conquer: Decoupling Co-occurrence via Decomposition and Representation for Weakly Supervised Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Separate_and_Conquer_Decoupling_Co-occurrence_via_Decomposition_and_Representation_for_CVPR_2024_paper.pdf + Weakly supervised semantic segmentation (WSSS) with image-level labels aims to achieve segmentation tasks without dense annotations. However attributed to the frequent coupling of co-occurring objects and the limited supervision from image-level labels the challenging co-occurrence problem is widely present and leads to false activation of objects in WSSS. In this work we devise a 'Separate and Conquer' scheme SeCo to tackle this issue from dimensions of image space and feature space. In the image space we propose to 'separate' the co-occurring objects with image decomposition by subdividing images into patches. Importantly we assign each patch a category tag from Class Activation Maps (CAMs) which spatially helps remove the co-context bias and guide the subsequent representation. In the feature space we propose to 'conquer' the false activation by enhancing semantic representation with multi-granularity knowledge contrast. To this end a dual-teacher-single-student architecture is designed and tag-guided contrast is conducted which guarantee the correctness of knowledge and further facilitate the discrepancy among co-contexts. We streamline the multi-staged WSSS pipeline end-to-end and tackle this issue without external supervision. Extensive experiments are conducted validating the efficiency of our method and the superiority over previous single-staged and even multi-staged competitors on PASCAL VOC and MS COCO. Code is available at https://github.com/zwyang6/SeCo.git. + + + + BiPer: Binary Neural Networks using a Periodic Function + http://openaccess.thecvf.com//content/CVPR2024/papers/Vargas_BiPer_Binary_Neural_Networks_using_a_Periodic_Function_CVPR_2024_paper.pdf + Quantized neural networks employ reduced precision representations for both weights and activations. This quantization process significantly reduces the memory requirements and computational complexity of the network. Binary Neural Networks (BNNs) are the extreme quantization case representing values with just one bit. Since the sign function is typically used to map real values to binary values smooth approximations are introduced to mimic the gradients during error backpropagation. Thus the mismatch between the forward and backward models corrupts the direction of the gradient causing training inconsistency problems and performance degradation. In contrast to current BNN approaches we propose to employ a binary periodic (BiPer) function during binarization. Specifically we use a square wave for the forward pass to obtain the binary values and employ the trigonometric sine function with the same period of the square wave as a differentiable surrogate during the backward pass. We demonstrate that this approach can control the quantization error by using the frequency of the periodic function and improves network performance. Extensive experiments validate the effectiveness of BiPer in benchmark datasets and network architectures with improvements of up to 1% and 0.69% with respect to state-of-the-art methods in the classification task over CIFAR-10 and ImageNet respectively. Our code is publicly available at https://github.com/edmav4/BiPer. + + + + Segment Any Event Streams via Weighted Adaptation of Pivotal Tokens + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Segment_Any_Event_Streams_via_Weighted_Adaptation_of_Pivotal_Tokens_CVPR_2024_paper.pdf + In this paper we delve into the nuanced challenge of tailoring the Segment Anything Models (SAMs) for integration with event data with the overarching objective of attaining robust and universal object segmentation within the event-centric domain. One pivotal issue at the heart of this endeavor is the precise alignment and calibration of embeddings derived from event-centric data such that they harmoniously coincide with those originating from RGB imagery. Capitalizing on the vast repositories of datasets with paired events and RGB images our proposition is to harness and extrapolate the profound knowledge encapsulated within the pre-trained SAM framework. As a cornerstone to achieving this we introduce a multi-scale feature distillation methodology. This methodology rigorously optimizes the alignment of token embeddings originating from event data with their RGB image counterparts thereby preserving and enhancing the robustness of the overall architecture. Considering the distinct significance that token embeddings from intermediate layers hold for higher-level embeddings our strategy is centered on accurately calibrating the pivotal token embeddings. This targeted calibration is aimed at effectively managing the discrepancies in high-level embeddings originating from both the event and image domains. Extensive experiments on different datasets demonstrate the effectiveness of the proposed distillation method. Code in https://github.com/happychenpipi/EventSAM. + + + + AnyDoor: Zero-shot Object-level Image Customization + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_AnyDoor_Zero-shot_Object-level_Image_Customization_CVPR_2024_paper.pdf + This work presents AnyDoor a diffusion-based image generator with the power to teleport target objects to new scenes at user-specified locations with desired shapes. Instead of tuning parameters for each object our model is trained only once and effortlessly generalizes to diverse object-scene combinations at the inference stage. Such a challenging zero-shot setting requires an adequate characterization of a certain object. To this end we complement the commonly used identity feature with detail features which are carefully designed to maintain appearance details yet allow versatile local variations(e.g. lighting orientation posture etc.) supporting the object in favorably blending with different surroundings. We further propose to borrow knowledge from video datasets where we can observe various forms (i.e. along the time axis) of a single object leading to stronger model generalizability and robustness. Extensive experiments demonstrate the superiority of our approach over existing alternatives as well as its great potential in real-world applications such as virtual try-on shape editing and object swapping. + + + + Clustering Propagation for Universal Medical Image Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Ding_Clustering_Propagation_for_Universal_Medical_Image_Segmentation_CVPR_2024_paper.pdf + Prominent solutions for medical image segmentation are typically tailored for automatic or interactive setups posing challenges in facilitating progress achieved in one task to another. This also necessitates separate models for each task duplicating both training time and parameters. To address above issues we introduce S2VNet a universal framework that leverages Slice-to-Volume propagation to unify automatic/interactive segmentation within a single model and one training session. Inspired by clustering-based segmentation techniques S2VNet makes full use of the slice-wise structure of volumetric data by initializing cluster centers from the cluster results of previous slice. This enables knowledge acquired from prior slices to assist in the segmentation of the current slice further efficiently bridging the communication between remote slices using mere 2D networks. Moreover such a framework readily accommodates inter- active segmentation with no architectural change simply by initializing centroids from user inputs. S2VNet distinguishes itself by swift inference speeds and reduced memory consumption compared to prevailing 3D solutions. It can also handle multi-class interactions with each of them serving to initialize different centroids. Experiments on three benchmarks demonstrate S2VNet surpasses task-specified solutions on both automatic/interactive setups. + + + + Garment Recovery with Shape and Deformation Priors + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Garment_Recovery_with_Shape_and_Deformation_Priors_CVPR_2024_paper.pdf + While modeling people wearing tight-fitting clothing has made great strides in recent years loose-fitting clothing remains a challenge. We propose a method that delivers realistic garment models from real-world images regardless of garment shape or deformation. To this end we introduce a fitting approach that utilizes shape and deformation priors learned from synthetic data to accurately capture garment shapes and deformations including large ones. Not only does our approach recover the garment geometry accurately it also yields models that can be directly used by downstream applications such as animation and simulation. + + + + Psychometry: An Omnifit Model for Image Reconstruction from Human Brain Activity + http://openaccess.thecvf.com//content/CVPR2024/papers/Quan_Psychometry_An_Omnifit_Model_for_Image_Reconstruction_from_Human_Brain_CVPR_2024_paper.pdf + Reconstructing the viewed images from human brain activity bridges human and computer vision through the Brain-Computer Interface. The inherent variability in brain function between individuals leads existing literature to focus on acquiring separate models for each individual using their respective brain signal data ignoring commonalities between these data. In this article we devise Psychometry an omnifit model for reconstructing images from functional Magnetic Resonance Imaging (fMRI) obtained from different subjects. Psychometry incorporates an omni mixture-of-experts (Omni MoE) module where all the experts work together to capture the inter-subject commonalities while each expert associated with subject-specific parameters copes with the individual differences. Moreover Psychometry is equipped with a retrieval-enhanced inference strategy termed Ecphory which aims to enhance the learned fMRI representation via retrieving from prestored subject-specific memories. These designs collectively render Psychometry omnifit and efficient enabling it to capture both inter-subject commonality and individual specificity across subjects. As a result the enhanced fMRI representations serve as conditional signals to guide a generation model to reconstruct high-quality and realistic images establishing Psychometry as state-of-the-art in terms of both high-level and low-level metrics. + + + + Exploring Regional Clues in CLIP for Zero-Shot Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Exploring_Regional_Clues_in_CLIP_for_Zero-Shot_Semantic_Segmentation_CVPR_2024_paper.pdf + CLIP has demonstrated marked progress in visual recognition due to its powerful pre-training on large-scale image-text pairs. However it still remains a critical challenge: how to transfer image-level knowledge into pixel-level understanding tasks such as semantic segmentation. In this paper to solve the mentioned challenge we analyze the gap between the capability of the CLIP model and the requirement of the zero-shot semantic segmentation task. Based on our analysis and observations we propose a novel method for zero-shot semantic segmentation dubbed CLIP-RC (CLIP with Regional Clues) bringing two main insights. On the one hand a region-level bridge is necessary to provide fine-grained semantics. On the other hand overfitting should be mitigated during the training stage. Benefiting from the above discoveries CLIP-RC achieves state-of-the-art performance on various zero-shot semantic segmentation benchmarks including PASCAL VOC PASCAL Context and COCO-Stuff 164K. Code will be available at https://github.com/Jittor/JSeg. + + + + Move as You Say Interact as You Can: Language-guided Human Motion Generation with Scene Affordance + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Move_as_You_Say_Interact_as_You_Can_Language-guided_Human_CVPR_2024_paper.pdf + Despite significant advancements in text-to-motion synthesis generating language-guided human motion within 3D environments poses substantial challenges. These challenges stem primarily from (i) the absence of powerful generative models capable of jointly modeling natural language 3D scenes and human motion and (ii) the generative models' intensive data requirements contrasted with the scarcity of comprehensive high-quality language-scene-motion datasets. To tackle these issues we introduce a novel two-stage framework that employs scene affordance as an intermediate representation effectively linking 3D scene grounding and conditional motion generation. Our framework comprises an Affordance Diffusion Model (ADM) for predicting explicit affordance map and an Affordance-to-Motion Diffusion Model (AMDM) for generating plausible human motions. By leveraging scene affordance maps our method overcomes the difficulty in generating human motion under multimodal condition signals especially when training with limited data lacking extensive language-scene-motion pairs. Our extensive experiments demonstrate that our approach consistently outperforms all baselines on established benchmarks including HumanML3D and HUMANISE. Additionally we validate our model's exceptional generalization capabilities on a specially curated evaluation set featuring previously unseen descriptions and scenes. + + + + Generalizable Face Landmarking Guided by Conditional Face Warping + http://openaccess.thecvf.com//content/CVPR2024/papers/Liang_Generalizable_Face_Landmarking_Guided_by_Conditional_Face_Warping_CVPR_2024_paper.pdf + As a significant step for human face modeling editing and generation face landmarking aims at extracting facial keypoints from images. A generalizable face landmarker is required in practice because real-world facial images e.g. the avatars in animations and games are often stylized in various ways. However achieving generalizable face landmarking is challenging due to the diversity of facial styles and the scarcity of labeled stylized faces. In this study we propose a simple but effective paradigm to learn a generalizable face landmarker based on labeled real human faces and unlabeled stylized faces. Our method learns the face landmarker as the key module of a conditional face warper. Given a pair of real and stylized facial images the conditional face warper predicts a warping field from the real face to the stylized one in which the face landmarker predicts the ending points of the warping field and provides us with high-quality pseudo landmarks for the corresponding stylized facial images. Applying an alternating optimization strategy we learn the face landmarker to minimize i) the discrepancy between the stylized faces and the warped real ones and ii) the prediction errors of both real and pseudo landmarks. Experiments on various datasets show that our method outperforms existing state-of-the-art domain adaptation methods in face landmarking tasks leading to a face landmarker with better generalizability. Code is available at https://plustwo0.github.io/project-face-landmarker. + + + + Sat2Scene: 3D Urban Scene Generation from Satellite Images with Diffusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Sat2Scene_3D_Urban_Scene_Generation_from_Satellite_Images_with_Diffusion_CVPR_2024_paper.pdf + Directly generating scenes from satellite imagery offers exciting possibilities for integration into applications like games and map services. However challenges arise from significant view changes and scene scale. Previous efforts mainly focused on image or video generation lacking exploration into the adaptability of scene generation for arbitrary views. Existing 3D generation works either operate at the object level or are difficult to utilize the geometry obtained from satellite imagery. To overcome these limitations we propose a novel architecture for direct 3D scene generation by introducing diffusion models into 3D sparse representations and combining them with neural rendering techniques. Specifically our approach generates texture colors at the point level for a given geometry using a 3D diffusion model first which is then transformed into a scene representation in a feed-forward manner. The representation can be utilized to render arbitrary views which would excel in both single-frame quality and inter-frame consistency. Experiments in two city-scale datasets show that our model demonstrates proficiency in generating photo-realistic street-view image sequences and cross-view urban scenes from satellite imagery. + + + + Control4D: Efficient 4D Portrait Editing with Text + http://openaccess.thecvf.com//content/CVPR2024/papers/Shao_Control4D_Efficient_4D_Portrait_Editing_with_Text_CVPR_2024_paper.pdf + We introduce Control4D an innovative framework for editing dynamic 4D portraits using text instructions. Our method addresses the prevalent challenges in 4D editing notably the inefficiencies of existing 4D representations and the inconsistent editing effect caused by diffusion-based editors. We first propose GaussianPlanes a novel 4D representation that makes Gaussian Splatting more structured by applying plane-based decomposition in 3D space and time. This enhances both efficiency and robustness in 4D editing. Furthermore we propose to leverage a 4D generator to learn a more continuous generation space from inconsistent edited images produced by the diffusion-based editor which effectively improves the consistency and quality of 4D editing. Comprehensive evaluation demonstrates the superiority of Control4D including significantly reduced training time high-quality rendering and spatial-temporal consistency in 4D portrait editing. + + + + CLIPtone: Unsupervised Learning for Text-based Image Tone Adjustment + http://openaccess.thecvf.com//content/CVPR2024/papers/Lee_CLIPtone_Unsupervised_Learning_for_Text-based_Image_Tone_Adjustment_CVPR_2024_paper.pdf + Recent image tone adjustment (or enhancement) approaches have predominantly adopted supervised learning for learning human-centric perceptual assessment. However these approaches are constrained by intrinsic challenges of supervised learning. Primarily the requirement for expertly-curated or retouched images escalates the data acquisition expenses. Moreover their coverage of target styles is confined to stylistic variants inferred from the training data. To surmount the above challenges we propose an unsupervised learning-based approach for text-based image tone adjustment CLIPtone that extends an existing image enhancement method to accommodate natural language descriptions. Specifically we design a hyper-network to adaptively modulate the pretrained parameters of a backbone model based on a text description. To assess whether an adjusted image aligns with its text description without a ground-truth image we utilize CLIP which is trained on a vast set of language-image pairs and thus encompasses the knowledge of human perception. The major advantages of our approach are threefold: (i) minimal data collection expenses (ii) support for a range of adjustments and (iii) the ability to handle novel text descriptions unseen in training. The efficacy of the proposed method is demonstrated through comprehensive experiments including a user study. + + + + Codebook Transfer with Part-of-Speech for Vector-Quantized Image Modeling + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Codebook_Transfer_with_Part-of-Speech_for_Vector-Quantized_Image_Modeling_CVPR_2024_paper.pdf + Vector-Quantized Image Modeling (VQIM) is a fundamental research problem in image synthesis which aims to represent an image with a discrete token sequence. Existing studies effectively address this problem by learning a discrete codebook from scratch and in a code-independent manner to quantize continuous representations into discrete tokens. However learning a codebook from scratch and in a code-independent manner is highly challenging which may be a key reason causing codebook collapse i.e. some code vectors can rarely be optimized without regard to the relationship between codes and good codebook priors such that die off finally. In this paper inspired by pretrained language models we find that these language models have actually pretrained a superior codebook via a large number of text corpus but such information is rarely exploited in VQIM. To this end we propose a novel codebook transfer framework with part-of-speech called VQCT which aims to transfer a well-trained codebook from pretrained language models to VQIM for robust codebook learning. Specifically we first introduce a pretrained codebook from language models and part-of-speech knowledge as priors. Then we construct a vision-related codebook with these priors for achieving codebook transfer. Finally a novel codebook transfer network is designed to exploit abundant semantic relationships between codes contained in pretrained codebooks for robust VQIM codebook learning. Experimental results on four datasets show that our VQCT method achieves superior VQIM performance over previous state-of-the-art methods. + + + + InceptionNeXt: When Inception Meets ConvNeXt + http://openaccess.thecvf.com//content/CVPR2024/papers/Yu_InceptionNeXt_When_Inception_Meets_ConvNeXt_CVPR_2024_paper.pdf + Inspired by the long-range modeling ability of ViTs large-kernel convolutions are widely studied and adopted recently to enlarge the receptive field and improve model performance like the remarkable work ConvNeXt which employs 7x7 depthwise convolution. Although such depthwise operator only consumes a few FLOPs it largely harms the model efficiency on powerful computing devices due to the high memory access costs. For example ConvNeXt-T has similar FLOPs with ResNet-50 but only achieves 60% throughputs when trained on A100 GPUs with full precision. Although reducing the kernel size of ConvNeXt can improve speed it results in significant performance degradation which poses a challenging problem: How to speed up large-kernel-based CNN models while preserving their performance. To tackle this issue inspired by Inceptions we propose to decompose large-kernel depthwise convolution into four parallel branches along channel dimension i.e. small square kernel two orthogonal band kernels and an identity mapping. With this new Inception depthwise convolution we build a series of networks namely IncepitonNeXt which not only enjoy high throughputs but also maintain competitive performance. For instance InceptionNeXt-T achieves 1.6x higher training throughputs than ConvNeX-T as well as attains 0.2% top-1 accuracy improvement on ImageNet-1K. We anticipate InceptionNeXt can serve as an economical baseline for future architecture design to reduce carbon footprint. + + + + LiveHPS: LiDAR-based Scene-level Human Pose and Shape Estimation in Free Environment + http://openaccess.thecvf.com//content/CVPR2024/papers/Ren_LiveHPS_LiDAR-based_Scene-level_Human_Pose_and_Shape_Estimation_in_Free_CVPR_2024_paper.pdf + For human-centric large-scale scenes fine-grained modeling for 3D human global pose and shape is significant for scene understanding and can benefit many real-world applications. In this paper we present LiveHPS a novel single-LiDAR-based approach for scene-level human pose and shape estimation without any limitation of light conditions and wearable devices. In particular we design a distillation mechanism to mitigate the distribution-varying effect of LiDAR point clouds and exploit the temporal-spatial geometric and dynamic information existing in consecutive frames to solve the occlusion and noise disturbance. LiveHPS with its efficient configuration and high-quality output is well-suited for real-world applications. Moreover we propose a huge human motion dataset named FreeMotion which is collected in various scenarios with diverse human poses shapes and translations. It consists of multi-modal and multi-view acquisition data from calibrated and synchronized LiDARs cameras and IMUs. Extensive experiments on our new dataset and other public datasets demonstrate the SOTA performance and robustness of our approach. We will release our code and dataset soon. + + + + Segment Every Out-of-Distribution Object + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_Segment_Every_Out-of-Distribution_Object_CVPR_2024_paper.pdf + Semantic segmentation models while effective for in-distribution categories face challenges in real-world deployment due to encountering out-of-distribution (OoD) objects. Detecting these OoD objects is crucial for safety-critical applications. Existing methods rely on anomaly scores but choosing a suitable threshold for generating masks presents difficulties and can lead to fragmentation and inaccuracy. This paper introduces a method to convert anomaly Score To segmentation Mask called S2M a simple and effective framework for OoD detection in semantic segmentation. Unlike assigning anomaly scores to pixels S2M directly segments the entire OoD object. By transforming anomaly scores into prompts for a promptable segmentation model S2M eliminates the need for threshold selection. Extensive experiments demonstrate that S2M outperforms the state-of-the-art by approximately 20% in IoU and 40% in mean F1 score on average across various benchmarks including Fishyscapes Segment-Me-If-You-Can and RoadAnomaly datasets. + + + + Wavelet-based Fourier Information Interaction with Frequency Diffusion Adjustment for Underwater Image Restoration + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_Wavelet-based_Fourier_Information_Interaction_with_Frequency_Diffusion_Adjustment_for_Underwater_CVPR_2024_paper.pdf + Underwater images are subject to intricate and diverse degradation inevitably affecting the effectiveness of underwater visual tasks. However most approaches primarily operate in the raw pixel space of images which limits the exploration of the frequency characteristics of underwater images leading to an inadequate utilization of deep models' representational capabilities in producing high-quality images. In this paper we introduce a novel Underwater Image Enhancement (UIE) framework named WF-Diff designed to fully leverage the characteristics of frequency domain information and diffusion models. WF-Diff consists of two detachable networks: Wavelet-based Fourier information interaction network (WFI2-net) and Frequency Residual Diffusion Adjustment Module (FRDAM). With our full exploration of the frequency domain information WFI2-net aims to achieve preliminary enhancement of frequency information in the wavelet space. Our proposed FRDAM can further refine the high- and low-frequency information of the initial enhanced images which can be viewed as a plug-and-play universal module to adjust the detail of the underwater images. With the above techniques our algorithm can show SOTA performance on real-world underwater image datasets and achieves competitive performance in visual quality. + + + + PoNQ: a Neural QEM-based Mesh Representation + http://openaccess.thecvf.com//content/CVPR2024/papers/Maruani_PoNQ_a_Neural_QEM-based_Mesh_Representation_CVPR_2024_paper.pdf + Although polygon meshes have been a standard representation in geometry processing their irregular and combinatorial nature hinders their suitability for learning-based applications. In this work we introduce a novel learnable mesh representation through a set of local 3D sample Points and their associated Normals and Quadric error metrics (QEM) w.r.t. the underlying shape which we denote PoNQ. A global mesh is directly derived from PoNQ by efficiently leveraging the knowledge of the local quadric errors. Besides marking the first use of QEM within a neural shape representation our contribution guarantees both topological and geometrical properties by ensuring that a PoNQ mesh does not self-intersect and is always the boundary of a volume. Notably our representation does not rely on a regular grid is supervised directly by the target surface alone and also handles open surfaces with boundaries and/or sharp features. We demonstrate the efficacy of PoNQ through a learning-based mesh prediction from SDF grids and show that our method surpasses recent state-of-the-art techniques in terms of both surface and edge-based metrics. + + + + Boosting Order-Preserving and Transferability for Neural Architecture Search: a Joint Architecture Refined Search and Fine-tuning Approach + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Boosting_Order-Preserving_and_Transferability_for_Neural_Architecture_Search_a_Joint_CVPR_2024_paper.pdf + Supernet is a core component in many recent Neural Architecture Search (NAS) methods. It not only helps embody the search space but also provides a (relative) estimation of the final performance of candidate architectures. Thus it is critical that the top architectures ranked by a supernet should be consistent with those ranked by true performance which is known as the order-preserving ability. In this work we analyze the order-preserving ability on the whole search space (global) and a sub-space of top architectures (local) and empirically show that the local order-preserving for current two-stage NAS methods still need to be improved. To rectify this we propose a novel concept of Supernet Shifting a refined search strategy combining architecture searching with supernet fine-tuning. Specifically apart from evaluating the training loss is also accumulated in searching and the supernet is updated every iteration. Since superior architectures are sampled more frequently in evolutionary searching the supernet is encouraged to focus on top architectures thus improving local order-preserving. Besides a pre-trained supernet is often un-reusable for one-shot methods. We show that Supernet Shifting can fulfill transferring supernet to a new dataset. Specifically the last classifier layer will be unset and trained through evolutionary searching. Comprehensive experiments show that our method has better order-preserving ability and can find a dominating architecture. Moreover the pre-trained supernet can be easily transferred into a new dataset with no loss of performance. + + + + Dr. Bokeh: DiffeRentiable Occlusion-aware Bokeh Rendering + http://openaccess.thecvf.com//content/CVPR2024/papers/Sheng_Dr._Bokeh_DiffeRentiable_Occlusion-aware_Bokeh_Rendering_CVPR_2024_paper.pdf + Bokeh is widely used in photography to draw attention to the subject while effectively isolating distractions in the background. Computational methods can simulate bokeh effects without relying on a physical camera lens but the inaccurate lens modeling in existing filtering-based methods leads to artifacts that need post-processing or learning-based methods to fix. We propose Dr.Bokeh a novel rendering method that addresses the issue by directly correcting the defect that violates the physics in the current filtering-based bokeh rendering equation. Dr.Bokeh first preprocesses the input RGBD to obtain a layered scene representation. Dr.Bokeh then takes the layered representation and user-defined lens parameters to render photo-realistic lens blur based on the novel occlusion-aware bokeh rendering method. Experiments show that the non-learning based renderer Dr.Bokeh outperforms state-of-the-art bokeh rendering algorithms in terms of photo-realism. In addition extensive quantitative and qualitative evaluations show the more accurate lens model further pushes the limit of a closely related field depth-from-defocus. + + + + LAENeRF: Local Appearance Editing for Neural Radiance Fields + http://openaccess.thecvf.com//content/CVPR2024/papers/Radl_LAENeRF_Local_Appearance_Editing_for_Neural_Radiance_Fields_CVPR_2024_paper.pdf + Due to the omnipresence of Neural Radiance Fields (NeRFs) the interest towards editable implicit 3D representations has surged over the last years. However editing implicit or hybrid representations as used for NeRFs is difficult due to the entanglement of appearance and geometry encoded in the model parameters. Despite these challenges recent research has shown first promising steps towards photorealistic and non-photorealistic appearance edits. The main open issues of related work include limited interactivity a lack of support for local edits and large memory requirements rendering them less useful in practice. We address these limitations with LAENeRF a unified framework for photorealistic and non-photorealistic appearance editing of NeRFs. To tackle local editing we leverage a voxel grid as starting point for region selection. We learn a mapping from expected ray terminations to final output color which can optionally be supervised by a style loss resulting in a framework which can perform photorealistic and non-photorealistic appearance editing of selected regions. Relying on a single point per ray for our mapping we limit memory requirements and enable fast optimization. To guarantee interactivity we compose the output color using a set of learned modifiable base colors composed with additive layer mixing. Compared to concurrent work LAENeRF enables recoloring and stylization while keeping processing time low. Furthermore we demonstrate that our approach surpasses baseline methods both quantitatively and qualitatively. + + + + Adversarial Score Distillation: When score distillation meets GAN + http://openaccess.thecvf.com//content/CVPR2024/papers/Wei_Adversarial_Score_Distillation_When_score_distillation_meets_GAN_CVPR_2024_paper.pdf + Existing score distillation methods are sensitive to classifier-free guidance (CFG) scale manifested as over-smoothness or instability at small CFG scales while over-saturation at large ones. To explain and analyze these issues we revisit the derivation of Score Distillation Sampling (SDS) and decipher existing score distillation with the Wasserstein Generative Adversarial Network (WGAN) paradigm. With the WGAN paradigm we find that existing score distillation either employs a fixed sub-optimal discriminator or conducts incomplete discriminator optimization resulting in the scale-sensitive issue. We propose the Adversarial Score Distillation (ASD) which maintains an optimizable discriminator and updates it using the complete optimization objective. Experiments show that the proposed ASD performs favorably in 2D distillation and text-to-3D tasks against existing methods. Furthermore to explore the generalization ability of our paradigm we extend ASD to the image editing task which achieves competitive results. The project page and code are at https://github.com/2y7c3/ASD + + + + Vector Graphics Generation via Mutually Impulsed Dual-domain Diffusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_Vector_Graphics_Generation_via_Mutually_Impulsed_Dual-domain_Diffusion_CVPR_2024_paper.pdf + Intelligent generation of vector graphics has very promising applications in the fields of advertising and logo design artistic painting animation production etc. However current mainstream vector image generation methods lack the encoding of image appearance information that is associated with the original vector representation and therefore lose valid supervision signal from the strong correlation between the discrete vector parameter (drawing instruction) sequence and the target shape/structure of the corresponding pixel image. On the one hand the generation process based on pure vector domain completely ignores the similarity measurement between shape parameter (and their combination) and the paired pixel image appearance pattern; on the other hand two-stage methods (i.e. generation-and-vectorization) based on pixel diffusion followed by differentiable image-to-vector translation suffer from wrong error-correction signal caused by approximate gradients. To address the above issues we propose a novel generation framework based on dual-domain (vector-pixel) diffusion with cross-modality impulse signals from each other. First in each diffusion step the current representation extracted from the other domain is used as a condition variable to constrain the subsequent sampling operation yielding shape-aware new parameterizations; second independent supervision signals from both domains avoid the gradient error accumulation problem caused by cross-domain representation conversion. Extensive experimental results on popular benchmarks including font and icon datasets demonstrate the great advantages of our proposed framework in terms of generated shape quality. + + + + ScoreHypo: Probabilistic Human Mesh Estimation with Hypothesis Scoring + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_ScoreHypo_Probabilistic_Human_Mesh_Estimation_with_Hypothesis_Scoring_CVPR_2024_paper.pdf + Monocular 3D human mesh estimation is an ill-posed problem characterized by inherent ambiguity and occlusion. While recent probabilistic methods propose generating multiple solutions little attention is paid to obtaining high-quality estimates from them. To address this limitation we introduce ScoreHypo a versatile framework by first leverages our novel HypoNet to generate multiple hypotheses followed by employing a meticulously designed scorer ScoreNet to evaluate and select high-quality estimates. ScoreHypo formulates the estimation process as a reverse denoising process where HypoNet produces a diverse set of plausible estimates that effectively align with the image cues. Subsequently ScoreNet is employed to rigorously evaluate and rank these estimates based on their quality and finally identify superior ones. Experimental results demonstrate that HypoNet outperforms existing state-of-the-art probabilistic methods as a multi-hypothesis mesh estimator. Moreover the estimates selected by ScoreNet significantly outperform random generation or simple averaging. Notably the trained ScoreNet exhibits generalizability as it can effectively score existing methods and significantly reduce their errors by more than 15%. Code and models are available at https://xy02-05.github.io/ScoreHypo. + + + + MeshPose: Unifying DensePose and 3D Body Mesh Reconstruction + http://openaccess.thecvf.com//content/CVPR2024/papers/Le_MeshPose_Unifying_DensePose_and_3D_Body_Mesh_Reconstruction_CVPR_2024_paper.pdf + DensePose provides a pixel-accurate association of images with 3D mesh coordinates but does not provide a 3D mesh while Human Mesh Reconstruction (HMR) systems have high 2D reprojection error as measured by DensePose localization metrics. In this work we introduce MeshPose to jointly tackle DensePose and HMR. For this we first introduce new losses that allow us to use weak DensePose supervision to accurately localize in 2D a subset of the mesh vertices ('VertexPose'). We then lift these vertices to 3D yielding a low-poly body mesh ('MeshPose'). Our system is trained in an end-to-end manner and is the first HMR method to attain competitive DensePose accuracy while also being lightweight and amenable to efficient inference making it suitable for real-time AR applications. + + + + Unsupervised Salient Instance Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Tian_Unsupervised_Salient_Instance_Detection_CVPR_2024_paper.pdf + The significant amount of manual efforts in annotating pixel-level labels has triggered the advancement of unsupervised saliency learning. However without supervision signals state-of-the-art methods can only infer region-level saliency. In this paper we propose to explore the unsupervised salient instance detection (USID) problem for a more fine-grained visual understanding. Our key observation is that self-supervised transformer features may exhibit local similarities as well as different levels of contrast to other regions which provide informative cues to identify salient instances. Hence we propose SCoCo a novel network that models saliency coherence and contrast for USID. SCoCo includes two novel modules: (1) a global background adaptation (GBA) module with a scene-level contrastive loss to extract salient regions from the scene by searching the adaptive "saliency threshold" in the self-supervised transformer features and (2) a locality-aware similarity (LAS) module with an instance-level contrastive loss to group salient regions into instances by modeling the in-region saliency coherence and cross-region saliency contrasts. Extensive experiments show that SCoCo outperforms state-of-the-art weakly-supervised SID methods and carefully designed unsupervised baselines and has comparable performances to fully-supervised SID methods. + + + + Move Anything with Layered Scene Diffusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Ren_Move_Anything_with_Layered_Scene_Diffusion_CVPR_2024_paper.pdf + Diffusion models generate images with an unprecedented level of quality but how can we freely rearrange image layouts? Recent works generate controllable scenes via learning spatially disentangled latent codes but these methods do not apply to diffusion models due to their fixed forward process. In this work we propose SceneDiffusion to optimize a layered scene representation during the diffusion sampling process. Our key insight is that spatial disentanglement can be obtained by jointly denoising scene renderings at different spatial layouts. Our generated scenes support a wide range of spatial editing operations including moving resizing cloning and layer-wise appearance editing operations including object restyling and replacing. Moreover a scene can be generated conditioned on a reference image thus enabling object moving for in-the-wild images. Notably this approach is training-free compatible with general text-to-image diffusion models and responsive in less than a second. + + + + Human Gaussian Splatting: Real-time Rendering of Animatable Avatars + http://openaccess.thecvf.com//content/CVPR2024/papers/Moreau_Human_Gaussian_Splatting_Real-time_Rendering_of_Animatable_Avatars_CVPR_2024_paper.pdf + This work addresses the problem of real-time rendering of photorealistic human body avatars learned from multi-view videos. While the classical approaches to model and render virtual humans generally use a textured mesh recent research has developed neural body representations that achieve impressive visual quality. However these models are difficult to render in real-time and their quality degrades when the character is animated with body poses different than the training observations. We propose an animatable human model based on 3D Gaussian Splatting that has recently emerged as a very efficient alternative to neural radiance fields. The body is represented by a set of gaussian primitives in a canonical space which is deformed with a coarse to fine approach that combines forward skinning and local non-rigid refinement. We describe how to learn our Human Gaussian Splatting (HuGS) model in an end-to-end fashion from multi-view observations and evaluate it against the state-of-the-art approaches for novel pose synthesis of clothed body. Our method achieves 1.5 dB PSNR improvement over the state-of-the-art on THuman4 dataset while being able to render in real-time (?80 fps for 512x512 resolution). + + + + The Devil is in the Details: StyleFeatureEditor for Detail-Rich StyleGAN Inversion and High Quality Image Editing + http://openaccess.thecvf.com//content/CVPR2024/papers/Bobkov_The_Devil_is_in_the_Details_StyleFeatureEditor_for_Detail-Rich_StyleGAN_CVPR_2024_paper.pdf + The task of manipulating real image attributes through StyleGAN inversion has been extensively researched. This process involves searching latent variables from a well-trained StyleGAN generator that can synthesize a real image modifying these latent variables and then synthesizing an image with the desired edits. A balance must be struck between the quality of the reconstruction and the ability to edit. Earlier studies utilized the low-dimensional W-space for latent search which facilitated effective editing but struggled with reconstructing intricate details. More recent research has turned to the high-dimensional feature space F which successfully inverses the input image but loses much of the detail during editing. In this paper we introduce StyleFeatureEditor -- a novel method that enables editing in both w-latents and F-latents. This technique not only allows for the reconstruction of finer image details but also ensures their preservation during editing. We also present a new training pipeline specifically designed to train our model to accurately edit F-latents. Our method is compared with state-of-the-art encoding approaches demonstrating that our model excels in terms of reconstruction quality and is capable of editing even challenging out-of-domain examples. + + + + Unbiased Estimator for Distorted Conics in Camera Calibration + http://openaccess.thecvf.com//content/CVPR2024/papers/Song_Unbiased_Estimator_for_Distorted_Conics_in_Camera_Calibration_CVPR_2024_paper.pdf + In the literature points and conics have been major features for camera geometric calibration. Although conics are more informative features than points the loss of the conic property under distortion has critically limited the utility of conic features in camera calibration. Many existing approaches addressed conic-based calibration by ignoring distortion or introducing 3D spherical targets to circumvent this limitation. In this paper we present a novel formulation for conic-based calibration using moments. Our derivation is based on the mathematical finding that the first moment can be estimated without bias even under distortion. This allows us to track moment changes during projection and distortion ensuring the preservation of the first moment of the distorted conic. With an unbiased estimator the circular patterns can be accurately detected at the sub-pixel level and can now be fully exploited for an entire calibration pipeline resulting in significantly improved calibration. The entire code is readily available from https://github.com/ChaehyeonSong/discocal. + + + + MultiPhys: Multi-Person Physics-aware 3D Motion Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Ugrinovic_MultiPhys_Multi-Person_Physics-aware_3D_Motion_Estimation_CVPR_2024_paper.pdf + We introduce MultiPhys a method designed for recovering multi-person motion from monocular videos. Our focus lies in capturing coherent spatial placement between pairs of individuals across varying degrees of engagement. MultiPhys being physically aware exhibits robustness to jittering and occlusions and effectively eliminates penetration issues between the two individuals. We devise a pipeline in which the motion estimated by a kinematic-based method is fed into a physics simulator in an autoregressive manner. We introduce distinct components that enable our model to harness the simulator's properties without compromising the accuracy of the kinematic estimates. This results in final motion estimates that are both kinematically coherent and physically compliant. Extensive evaluations on three challenging datasets characterized by substantial inter-person interaction show that our method significantly reduces errors associated with penetration and foot skating while performing competitively with the state-of-the-art on motion accuracy and smoothness. + + + + NIVeL: Neural Implicit Vector Layers for Text-to-Vector Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Thamizharasan_NIVeL_Neural_Implicit_Vector_Layers_for_Text-to-Vector_Generation_CVPR_2024_paper.pdf + The success of denoising diffusion models in representing rich data distributions over 2D raster images has prompted research on extending them to other data representations such as vector graphics. Unfortunately due to their variable structure and scarcity of vector training data directly applying diffusion models on this domain remains a challenging problem. Using workarounds like optimization via Score Distillation Sampling (SDS) is also fraught with difficulty as vector representations are non-trivial to directly optimize and tend to result in implausible geometries such as redundant or self-intersecting shapes. NIVeL addresses these challenges by reinterpreting the problem on an alternative intermediate domain which preserves the desirable properties of vector graphics - mainly sparsity of representation and resolution-independence. This alternative domain is based on neural implicit fields expressed in a set of decomposable editable layers. Based on our experiments NIVeL produces text-to-vector graphics results of significantly better quality than the state-of-the-art. + + + + OAKINK2: A Dataset of Bimanual Hands-Object Manipulation in Complex Task Completion + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhan_OAKINK2_A_Dataset_of_Bimanual_Hands-Object_Manipulation_in_Complex_Task_CVPR_2024_paper.pdf + We present OAKINK2 a dataset of bimanual object manipulation tasks for complex daily activities. In pursuit of constructing the complex tasks into a structured representation OAKINK2 introduces three level of abstraction to organize the manipulation tasks: Affordance Primitive Task and Complex Task. OAKINK2 features on an object-centric perspective for decoding the complex tasks treating them as a sequence of object affordance fulfillment. The first level Affordance outlines the functionalities that objects in the scene can afford the second level Primitive Task describes the minimal interaction units that humans interact with the object to achieve its affordance and the third level Complex Task illustrates how Primitive Tasks are composed and interdependent. OAKINK2 dataset provides multi-view image streams and precise pose annotations for the human body hands and various interacting objects. This extensive collection supports applications such as interaction reconstruction and motion synthesis. Based on the 3-level abstraction of OAKINK2 we explore a task-oriented framework for Complex Task Completion (CTC). CTC aims to generate a sequence of bimanual manipulation to achieve task objectives. Within the CTC framework we employ Large Language Models (LLMs) to decompose the complex task objectives into sequences of Primitive Tasks and have developed a Motion Fulfillment Model that generates bimanual hand motion for each Primitive Task. OAKINK2 datasets and models are available at https://oakink.net/v2. + + + + Text-Guided 3D Face Synthesis - From Generation to Editing + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_Text-Guided_3D_Face_Synthesis_-_From_Generation_to_Editing_CVPR_2024_paper.pdf + Text-guided 3D face synthesis has achieved remarkable results by leveraging text-to-image (T2I) diffusion models. However most existing works focus solely on the direct generation ignoring the editing restricting them from synthesizing customized 3D faces through iterative adjustments. In this paper we propose a unified text-guided framework from face generation to editing. In the generation stage we propose a geometry-texture decoupled generation to mitigate the loss of geometric details caused by coupling. Besides decoupling enables us to utilize the generated geometry as a condition for texture generation yielding highly geometry-texture aligned results. We further employ a fine-tuned texture diffusion model to enhance texture quality in both RGB and YUV space. In the editing stage we first employ a pre-trained diffusion model to update facial geometry or texture based on the texts. To enable sequential editing we introduce a UV domain consistency preservation regularization preventing unintentional changes to irrelevant facial attributes. Besides we propose a self-guided consistency weight strategy to improve editing efficacy while preserving consistency. Through comprehensive experiments we showcase our method's superiority in face synthesis. + + + + Multiplane Prior Guided Few-Shot Aerial Scene Rendering + http://openaccess.thecvf.com//content/CVPR2024/papers/Gao_Multiplane_Prior_Guided_Few-Shot_Aerial_Scene_Rendering_CVPR_2024_paper.pdf + Neural Radiance Fields (NeRF) have been successfully applied in various aerial scenes yet they face challenges with sparse views due to limited supervision. The acquisition of dense aerial views is often prohibitive as unmanned aerial vehicles (UAVs) may encounter constraints in perspective range and energy constraints. In this work we introduce Multiplane Prior guided NeRF (MPNeRF) a novel approach tailored for few-shot aerial scene rendering--marking a pioneering effort in this domain. Our key insight is that the intrinsic geometric regularities specific to aerial imagery could be leveraged to enhance NeRF in sparse aerial scenes. By investigating NeRF's and Multiplane Image (MPI)'s behavior we propose to guide the training process of NeRF with a Multiplane Prior. The proposed Multiplane Prior draws upon MPI's benefits and incorporates advanced image comprehension through a SwinV2 Transformer pre-trained via SimMIM. Our extensive experiments demonstrate that MPNeRF outperforms existing state-of-the-art methods applied in non-aerial contexts by tripling the performance in SSIM and LPIPS even with three views available. We hope our work offers insights into the development of NeRF-based applications in aerial scenes with limited data. + + + + MAS: Multi-view Ancestral Sampling for 3D Motion Generation Using 2D Diffusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Kapon_MAS_Multi-view_Ancestral_Sampling_for_3D_Motion_Generation_Using_2D_CVPR_2024_paper.pdf + We introduce Multi-view Ancestral Sampling (MAS) a method for 3D motion generation using 2D diffusion models that were trained on motions obtained from in-the-wild videos. As such MAS opens opportunities to exciting and diverse fields of motion previously under-explored as 3D data is scarce and hard to collect. MAS works by simultaneously denoising multiple 2D motion sequences representing different views of the same 3D motion. It ensures consistency across all views at each diffusion step by combining the individual generations into a unified 3D sequence and projecting it back to the original views. We demonstrate MAS on 2D pose data acquired from videos depicting professional basketball maneuvers rhythmic gymnastic performances featuring a ball apparatus and horse races. In each of these domains 3D motion capture is arduous and yet MAS generates diverse and realistic 3D sequences. Unlike the Score Distillation approach which optimizes each sample by repeatedly applying small fixes our method uses a sampling process that was constructed for the diffusion framework. As we demonstrate MAS avoids common issues such as out-of-domain sampling and mode-collapse. https://guytevet.github.io/mas-page/ + + + + Bilateral Event Mining and Complementary for Event Stream Super-Resolution + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_Bilateral_Event_Mining_and_Complementary_for_Event_Stream_Super-Resolution_CVPR_2024_paper.pdf + Event Stream Super-Resolution (ESR) aims to address the challenge of insufficient spatial resolution in event streams which holds great significance for the application of event cameras in complex scenarios. Previous works for ESR often process positive and negative events in a mixed paradigm. This paradigm limits their ability to effectively model the unique characteristics of each event and mutually refine each other by considering their correlations. In this paper we propose a bilateral event mining and complementary network (BMCNet) to fully leverage the potential of each event and capture the shared information to complement each other simultaneously. Specifically we resort to a two-stream network to accomplish comprehensive mining of each type of events individually. To facilitate the exchange of information between two streams we propose a bilateral information exchange (BIE) module. This module is layer-wisely embedded between two streams enabling the effective propagation of hierarchical global information while alleviating the impact of invalid information brought by inherent characteristics of events. The experimental results demonstrate that our approach outperforms the previous state-of-the-art methods in ESR achieving performance improvements of over 11% on both real and synthetic datasets. Moreover our method significantly enhances the performance of event-based downstream tasks such as object recognition and video reconstruction. Our code is available at https://github.com/Lqm26/BMCNet-ESR. + + + + SANeRF-HQ: Segment Anything for NeRF in High Quality + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_SANeRF-HQ_Segment_Anything_for_NeRF_in_High_Quality_CVPR_2024_paper.pdf + Recently the Segment Anything Model (SAM) has showcased remarkable capabilities of zero-shot segmentation while NeRF (Neural Radiance Fields) has gained popularity as a method for various 3D problems beyond novel view synthesis. Though there exist initial attempts to incorporate these two methods into 3D segmentation they face the challenge of accurately and consistently segmenting objects in complex scenarios. In this paper we introduce the Segment Anything for NeRF in High Quality (SANeRF-HQ) to achieve high-quality 3D segmentation of any target object in a given scene. SANeRF-HQ utilizes SAM for open-world object segmentation guided by user-supplied prompts while leveraging NeRF to aggregate information from different viewpoints. To overcome the aforementioned challenges we employ density field and RGB similarity to enhance the accuracy of segmentation boundary during the aggregation. Emphasizing on segmentation accuracy we evaluate our method on multiple NeRF datasets where high-quality ground-truths are available or manually annotated. SANeRF-HQ shows a significant quality improvement over state-of-the-art methods in NeRF object segmentation provides higher flexibility for object localization and enables more consistent object segmentation across multiple views. + + + + Transcending the Limit of Local Window: Advanced Super-Resolution Transformer with Adaptive Token Dictionary + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Transcending_the_Limit_of_Local_Window_Advanced_Super-Resolution_Transformer_with_CVPR_2024_paper.pdf + Single Image Super-Resolution is a classic computer vision problem that involves estimating high-resolution (HR) images from low-resolution (LR) ones. Although deep neural networks (DNNs) especially Transformers for super-resolution have seen significant advancements in recent years challenges still remain particularly in limited receptive field caused by window-based self-attention. To address these issues we introduce a group of auxiliary Adaptive Token Dictionary to SR Transformer and establish an ATD-SR method. The introduced token dictionary could learn prior information from training data and adapt the learned prior to specific testing image through an adaptive refinement step. The refinement strategy could not only provide global information to all input tokens but also group image tokens into categories. Based on category partitions we further propose a category-based self-attention mechanism designed to leverage distant but similar tokens for enhancing input features. The experimental results show that our method achieves the best performance on various single image super-resolution benchmarks. + + + + Mixed-Precision Quantization for Federated Learning on Resource-Constrained Heterogeneous Devices + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Mixed-Precision_Quantization_for_Federated_Learning_on_Resource-Constrained_Heterogeneous_Devices_CVPR_2024_paper.pdf + While federated learning (FL) systems often utilize quantization to battle communication and computational bottlenecks they have heretofore been limited to deploying fixed-precision quantization schemes. Meanwhile the concept of mixed-precision quantization (MPQ) where different layers of a deep learning model are assigned varying bit-width remains unexplored in the FL settings. We present a novel FL algorithm FedMPQ which introduces mixed-precision quantization to resource-heterogeneous FL systems. Specifically local models quantized so as to satisfy bit-width constraint are trained by optimizing an objective function that includes a regularization term which promotes reduction of precision in some of the layers without significant performance degradation. The server collects local model updates de-quantizes them into full-precision models and then aggregates them into a global model. To initialize the next round of local training the server relies on the information learned in the previous training round to customize bit-width assignments of the models delivered to different clients. In extensive benchmarking experiments on several model architectures and different datasets in both iid and non-iid settings FedMPQ outperformed the baseline FL schemes that utilize fixed-precision quantization while incurring only a minor computational overhead on the participating devices. + + + + Neural Fields as Distributions: Signal Processing Beyond Euclidean Space + http://openaccess.thecvf.com//content/CVPR2024/papers/Rebain_Neural_Fields_as_Distributions_Signal_Processing_Beyond_Euclidean_Space_CVPR_2024_paper.pdf + Neural fields have emerged as a powerful and broadly applicable method for representing signals. However in contrast to classical discrete digital signal processing the portfolio of tools to process such representations is still severely limited and restricted to Euclidean domains. In this paper we address this problem by showing how a probabilistic re-interpretation of neural fields can enable their training and inference processes to become "filter-aware". The formulation we propose not only merges training and filtering in an efficient way but also generalizes beyond the familiar Euclidean coordinate spaces to the more general set of smooth manifolds and convolutions induced by the actions of Lie groups. We demonstrate how this framework can enable novel integrations of signal processing techniques for neural field applications on both Euclidean domains such as images and audio as well as non-Euclidean domains such as rotations and rays. A noteworthy benefit of our method is its applicability. Our method can be summarized as primarily a modification of the loss function and in most cases does not require changes to the network architecture or the inference process. + + + + Style Blind Domain Generalized Semantic Segmentation via Covariance Alignment and Semantic Consistence Contrastive Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Ahn_Style_Blind_Domain_Generalized_Semantic_Segmentation_via_Covariance_Alignment_and_CVPR_2024_paper.pdf + Deep learning models for semantic segmentation often experience performance degradation when deployed to unseen target domains unidentified during the training phase. This is mainly due to variations in image texture (i.e. style) from different data sources. To tackle this challenge existing domain generalized semantic segmentation (DGSS) methods attempt to remove style variations from the feature. However these approaches struggle with the entanglement of style and content which may lead to the unintentional removal of crucial content information causing performance degradation. This study addresses this limitation by proposing BlindNet a novel DGSS approach that blinds the style without external modules or datasets. The main idea behind our proposed approach is to alleviate the effect of style in the encoder whilst facilitating robust segmentation in the decoder. To achieve this BlindNet comprises two key components: covariance alignment and semantic consistency contrastive learning. Specifically the covariance alignment trains the encoder to uniformly recognize various styles and preserve the content information of the feature rather than removing the style-sensitive factor. Meanwhile semantic consistency contrastive learning enables the decoder to construct discriminative class embedding space and disentangles features that are vulnerable to misclassification. Through extensive experiments our approach outperforms existing DGSS methods exhibiting robustness and superior performance for semantic segmentation on unseen target domains. + + + + X-3D: Explicit 3D Structure Modeling for Point Cloud Recognition + http://openaccess.thecvf.com//content/CVPR2024/papers/Sun_X-3D_Explicit_3D_Structure_Modeling_for_Point_Cloud_Recognition_CVPR_2024_paper.pdf + Numerous prior studies predominantly emphasize constructing relation vectors for individual neighborhood points and generating dynamic kernels for each vector and embedding these into high-dimensional spaces to capture implicit local structures. However we contend that such implicit high-dimensional structure modeling approch inadequately represents the local geometric structure of point clouds due to the absence of explicit structural information. Hence we introduce X-3D an explicit 3D structure modeling approach. X-3D functions by capturing the explicit local structural information within the input 3D space and employing it to produce dynamic kernels with shared weights for all neighborhood points within the current local region. This modeling approach introduces effective geometric prior and significantly diminishes the disparity between the local structure of the embedding space and the original input point cloud thereby improving the extraction of local features. Experiments show that our method can be used on a variety of methods and achieves state-of-the-art performance on segmentation classification detection tasks with lower extra computational cost. Such as 90.7% on ScanObjectNN for classification 79.2% on S3DIS 6 fold and 74.3% on S3DIS Area 5 for segmentation 76.3% on ScanNetV2 for segmentation and 64.5% mAP_ 25 46.9% mAP_ 50 on SUN RGB-D and 69.0% mAP_ 25 51.1% mAP_ 50 on ScanNetV2 . Our code is available at \href https://github.com/sunshuofeng/X-3D https://github.com/sunshuofeng/X-3D . + + + + One More Step: A Versatile Plug-and-Play Module for Rectifying Diffusion Schedule Flaws and Enhancing Low-Frequency Controls + http://openaccess.thecvf.com//content/CVPR2024/papers/Hu_One_More_Step_A_Versatile_Plug-and-Play_Module_for_Rectifying_Diffusion_CVPR_2024_paper.pdf + It is well known that many open-released foundational diffusion models have difficulty in generating images that substantially depart from average brightness despite such images being present in the training data. This is due to an inconsistency: while denoising starts from pure Gaussian noise during inference the training noise schedule retains residual data even in the final timestep distribution due to difficulties in numerical conditioning in mainstream formulation leading to unintended bias during inference. To mitigate this issue certain eps-prediction models are combined with an ad-hoc offset-noise methodology. In parallel some contemporary models have adopted zero-terminal SNR noise schedules together with v-prediction which necessitate major alterations to pre-trained models. However such changes risk destabilizing a large multitude of community-driven applications anchored on these pre-trained models. In light of this our investigation revisits the fundamental causes leading to our proposal of an innovative and principled remedy called One More Step (OMS). By integrating a compact network and incorporating an additional simple yet effective step during inference OMS elevates image fidelity and harmonizes the dichotomy between training and inference while preserving original model parameters. Once trained various pre-trained diffusion models with the same latent domain can share the same OMS module. + + + + HIVE: Harnessing Human Feedback for Instructional Visual Editing + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_HIVE_Harnessing_Human_Feedback_for_Instructional_Visual_Editing_CVPR_2024_paper.pdf + Incorporating human feedback has been shown to be crucial to align text generated by large language models to human preferences. We hypothesize that state-of-the-art instructional image editing models where outputs are generated based on an input image and an editing instruction could similarly benefit from human feedback as their outputs may not adhere to the correct instructions and preferences of users. In this paper we present a novel framework to harness human feedback for instructional visual editing (HIVE). Specifically we collect human feedback on the edited images and learn a reward function to capture the underlying user preferences. We then introduce scalable diffusion model fine-tuning methods that can incorporate human preferences based on the estimated reward. Besides to mitigate the bias brought by the limitation of data we contribute a new 1.1M training dataset a 3.6K reward dataset for rewards learning and a 1K evaluation dataset to boost the performance of instructional image editing. We conduct extensive empirical experiments quantitatively and qualitatively showing that HIVE is favored over previous state-of-the-art instructional image editing approaches by a large margin. + + + + StrokeFaceNeRF: Stroke-based Facial Appearance Editing in Neural Radiance Field + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_StrokeFaceNeRF_Stroke-based_Facial_Appearance_Editing_in_Neural_Radiance_Field_CVPR_2024_paper.pdf + Current 3D-aware facial NeRF generation approaches control the facial appearance by text lighting conditions or reference images limiting precise manipulation of local facial regions and interactivity. Color stroke a user-friendly and effective tool to depict appearance is challenging to edit 3D faces because of the lack of texture coarse geometry representation and detailed editing operations. To solve the above problems we introduce StrokeFaceNeRF a novel stroke-based method for editing facial NeRF appearance. In order to infer the missing texture and 3D geometry information 2D edited stroke maps are firstly encoded into the EG3D's latent space followed by a transformer-based editing module to achieve effective appearance changes while preserving the original geometry in editing regions. Notably we design a novel geometry loss function to ensure surface density remains consistent during training. To further enhance the local manipulation accuracy we propose a stereo fusion approach which lifts the 2D mask (inferred from strokes or drawn by users) into 3D mask volume allowing explicit blending of the original and edited faces. Extensive experiments validate that the proposed method outperforms existing 2D and 3D methods in both editing reality and geometry retention. + + + + ProxyCap: Real-time Monocular Full-body Capture in World Space via Human-Centric Proxy-to-Motion Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_ProxyCap_Real-time_Monocular_Full-body_Capture_in_World_Space_via_Human-Centric_CVPR_2024_paper.pdf + Learning-based approaches to monocular motion capture have recently shown promising results by learning to regress in a data-driven manner. However due to the challenges in data collection and network designs it remains challenging to achieve real-time full-body capture while being accurate in world space. In this work we introduce ProxyCap a human-centric proxy-to-motion learning scheme to learn world-space motions from a proxy dataset of 2D skeleton sequences and 3D rotational motions. Such proxy data enables us to build a learning-based network with accurate world-space supervision while also mitigating the generalization issues. For more accurate and physically plausible predictions in world space our network is designed to learn human motions from a human-centric perspective which enables the understanding of the same motion captured with different camera trajectories. Moreover a contact-aware neural motion descent module is proposed to improve foot-ground contact and motion misalignment with the proxy observations. With the proposed learning-based solution we demonstrate the first real-time monocular full-body capture system with plausible foot-ground contact in world space even using hand-held cameras. + + + + On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Chatterjee_On_the_Robustness_of_Language_Guidance_for_Low-Level_Vision_Tasks_CVPR_2024_paper.pdf + Recent advances in monocular depth estimation have been made by incorporating natural language as additional guidance. Although yielding impressive results the impact of the language prior particularly in terms of generalization and robustness remains unexplored. In this paper we address this gap by quantifying the impact of this prior and introduce methods to benchmark its effectiveness across various settings. We generate "low-level" sentences that convey object-centric three-dimensional spatial relationships incorporate them as additional language priors and evaluate their downstream impact on depth estimation. Our key finding is that current language-guided depth estimators perform optimally only with scene-level descriptions and counter-intuitively fare worse with low level descriptions. Despite leveraging additional data these methods are not robust to directed adversarial attacks and decline in performance with an increase in distribution shift. Finally to provide a foundation for future research we identify points of failures and offer insights to better understand these shortcomings. With an increasing number of methods using language for depth estimation our findings highlight the opportunities and pitfalls that require careful consideration for effective deployment in real-world settings. + + + + UFOGen: You Forward Once Large Scale Text-to-Image Generation via Diffusion GANs + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_UFOGen_You_Forward_Once_Large_Scale_Text-to-Image_Generation_via_Diffusion_CVPR_2024_paper.pdf + Text-to-image diffusion models have demonstrated remarkable capabilities in transforming text prompts into coherent images yet the computational cost of the multi-step inference remains a persistent challenge. To address this issue we present UFOGen a novel generative model designed for ultra-fast one-step text-to-image generation. In contrast to conventional approaches that focus on improving samplers or employing distillation techniques for diffusion models UFOGen adopts a hybrid methodology integrating diffusion models with a GAN objective. Leveraging a newly introduced diffusion-GAN objective and initialization with pre-trained diffusion models UFOGen excels in efficiently generating high-quality images conditioned on textual descriptions in a single step. Beyond traditional text-to-image generation UFOGen showcases versatility in applications. Notably UFOGen stands among the pioneering models enabling one-step text-to-image generation and diverse downstream tasks presenting a significant advancement in the landscape of efficient generative models. + + + + A Dual-Augmentor Framework for Domain Generalization in 3D Human Pose Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Peng_A_Dual-Augmentor_Framework_for_Domain_Generalization_in_3D_Human_Pose_CVPR_2024_paper.pdf + 3D human pose data collected in controlled laboratory settings present challenges for pose estimators that generalize across diverse scenarios. To address this domain generalization is employed. Current methodologies in domain generalization for 3D human pose estimation typically utilize adversarial training to generate synthetic poses for training. Nonetheless these approaches exhibit several limitations. First the lack of prior information about the target domain complicates the application of suitable augmentation through a single pose augmentor affecting generalization on target domains. Moreover adversarial training's discriminator tends to enforce similarity between source and synthesized poses impeding the exploration of out-of-source distributions. Furthermore the pose estimator's optimization is not exposed to domain shifts limiting its overall generalization ability. To address these limitations we propose a novel framework featuring two pose augmentors: the weak and the strong augmentors. Our framework employs differential strategies for generation and discrimination processes facilitating the preservation of knowledge related to source poses and the exploration of out-of-source distributions without prior information about target poses. Besides we leverage meta-optimization to simulate domain shifts in the optimization process of the pose estimator thereby improving its generalization ability. Our proposed approach significantly outperforms existing methods as demonstrated through comprehensive experiments on various benchmark datasets. + + + + ACT-Diffusion: Efficient Adversarial Consistency Training for One-step Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Kong_ACT-Diffusion_Efficient_Adversarial_Consistency_Training_for_One-step_Diffusion_Models_CVPR_2024_paper.pdf + Though diffusion models excel in image generation their step-by-step denoising leads to slow generation speeds. Consistency training addresses this issue with single-step sampling but often produces lower-quality generations and requires high training costs. In this paper we show that optimizing consistency training loss minimizes the Wasserstein distance between target and generated distributions. As timestep increases the upper bound accumulates previous consistency training losses. Therefore larger batch sizes are needed to reduce both current and accumulated losses. We propose Adversarial Consistency Training (ACT) which directly minimizes the Jensen-Shannon (JS) divergence between distributions at each timestep using a discriminator. Theoretically ACT enhances generation quality and convergence. By incorporating a discriminator into the consistency training framework our method achieves improved FID scores on CIFAR10 and ImageNet 64x64 and LSUN Cat 256x256 datasets retains zero-shot image inpainting capabilities and uses less than 1/6 of the original batch size and fewer than 1/2 of the model parameters and training steps compared to the baseline method this leads to a substantial reduction in resource consumption. Our code is available: https://github.com/kong13661/ACT + + + + Spectral Meets Spatial: Harmonising 3D Shape Matching and Interpolation + http://openaccess.thecvf.com//content/CVPR2024/papers/Cao_Spectral_Meets_Spatial_Harmonising_3D_Shape_Matching_and_Interpolation_CVPR_2024_paper.pdf + Although 3D shape matching and interpolation are highly interrelated they are often studied separately and applied sequentially to relate different 3D shapes thus resulting in sub-optimal performance. In this work we present a unified framework to predict both point-wise correspondences and shape interpolation between 3D shapes. To this end we combine the deep functional map framework with classical surface deformation models to map shapes in both spectral and spatial domains. On the one hand by incorporating spatial maps our method obtains more accurate and smooth point-wise correspondences compared to previous functional map methods for shape matching. On the other hand by introducing spectral maps our method gets rid of commonly used but computationally expensive geodesic distance constraints that are only valid for near-isometric shape deformations. Furthermore we propose a novel test-time adaptation scheme to capture both pose-dominant and shape-dominant deformations. Using different challenging datasets we demonstrate that our method outperforms previous state-of-the-art methods for both shape matching and interpolation even compared to supervised approaches. + + + + Emu Edit: Precise Image Editing via Recognition and Generation Tasks + http://openaccess.thecvf.com//content/CVPR2024/papers/Sheynin_Emu_Edit_Precise_Image_Editing_via_Recognition_and_Generation_Tasks_CVPR_2024_paper.pdf + Instruction-based image editing holds immense potential for a variety of applications as it enables users to perform any editing operation using a natural language instruction. However current models in this domain often struggle with accurately executing user instructions. We present Emu Edit a multi-task image editing model which sets state-of-the-art results in instruction-based image editing. To develop Emu Edit we train it to multi-task across an unprecedented range of tasks such as region-based editing free-form editing and Computer Vision tasks all of which are formulated as generative tasks. Additionally to enhance Emu Edit's multi-task learning abilities we provide it with learned task embeddings which guide the generation process towards the correct edit type. Both these elements are essential for Emu Edit's outstanding performance. Furthermore we show that Emu Edit can generalize to new tasks such as image inpainting super-resolution and compositions of editing tasks with just a few labeled examples. This capability offers a significant advantage in scenarios where high-quality samples are scarce. Lastly to facilitate a more rigorous and informed assessment of instructable image editing models we release a new challenging and versatile benchmark that includes seven different image editing tasks. + + + + Face2Diffusion for Fast and Editable Face Personalization + http://openaccess.thecvf.com//content/CVPR2024/papers/Shiohara_Face2Diffusion_for_Fast_and_Editable_Face_Personalization_CVPR_2024_paper.pdf + Face personalization aims to insert specific faces taken from images into pretrained text-to-image diffusion models. However it is still challenging for previous methods to preserve both the identity similarity and editability due to overfitting to training samples. In this paper we propose Face2Diffusion (F2D) for high-editability face personalization. The core idea behind F2D is that removing identity-irrelevant information from the training pipeline prevents the overfitting problem and improves editability of encoded faces. F2D consists of the following three novel components: 1) Multi-scale identity encoder provides well-disentangled identity features while keeping the benefits of multi-scale information which improves the diversity of camera poses. 2) Expression guidance disentangles face expressions from identities and improves the controllability of face expressions. 3) Class-guided denoising regularization encourages models to learn how faces should be denoised which boosts the text-alignment of backgrounds. Extensive experiments on the FaceForensics++ dataset and diverse prompts demonstrate our method greatly improves the trade-off between the identity- and text-fidelity compared to previous state-of-the-art methods. Code is available at https://github.com/mapooon/Face2Diffusion. + + + + Dancing with Still Images: Video Distillation via Static-Dynamic Disentanglement + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Dancing_with_Still_Images_Video_Distillation_via_Static-Dynamic_Disentanglement_CVPR_2024_paper.pdf + Recently dataset distillation has paved the way towards efficient machine learning especially for image datasets. However the distillation for videos characterized by an exclusive temporal dimension remains an underexplored domain. In this work we provide the first systematic study of video distillation and introduce a taxonomy to categorize temporal compression. Our investigation reveals that the temporal information is usually not well learned during distillation and the temporal dimension of synthetic data contributes little. The observations motivate our unified framework of disentangling the dynamic and static information in the videos. It first distills the videos into still images as static memory and then compensates the dynamic and motion information with a learnable dynamic memory block. Our method achieves state-of-the-art on video datasets at different scales with notably smaller memory storage budget. Our code is available at https://github.com/yuz1wan/video_distillation. + + + + UniRepLKNet: A Universal Perception Large-Kernel ConvNet for Audio Video Point Cloud Time-Series and Image Recognition + http://openaccess.thecvf.com//content/CVPR2024/papers/Ding_UniRepLKNet_A_Universal_Perception_Large-Kernel_ConvNet_for_Audio_Video_Point_CVPR_2024_paper.pdf + Large-kernel convolutional neural networks (ConvNets) have recently received extensive research attention but two unresolved and critical issues demand further investigation. 1) The architectures of existing large-kernel ConvNets largely follow the design principles of conventional ConvNets or transformers while the architectural design for large-kernel ConvNets remains under-addressed. 2) As transformers have dominated multiple modalities it remains to be investigated whether ConvNets also have a strong universal perception ability in domains beyond vision. In this paper we contribute from two aspects. 1) We propose four architectural guidelines for designing large-kernel ConvNets the core of which is to exploit the essential characteristics of large kernels that distinguish them from small kernels - they can see wide without going deep. Following such guidelines our proposed large-kernel ConvNet shows leading performance in image recognition (ImageNet accuracy of 88.0% ADE20K mIoU of 55.6% and COCO box AP of 56.4%) demonstrating better performance and higher speed than the recent powerful competitors. 2) We discover large kernels are the key to unlocking the exceptional performance of ConvNets in domains where they were originally not proficient. With certain modality-related preprocessing approaches the proposed model achieves state-of-the-art performance on time-series forecasting and audio recognition tasks even without modality-specific customization to the architecture. All the code and models are publicly available on GitHub and Huggingface. + + + + SwiftBrush: One-Step Text-to-Image Diffusion Model with Variational Score Distillation + http://openaccess.thecvf.com//content/CVPR2024/papers/Nguyen_SwiftBrush_One-Step_Text-to-Image_Diffusion_Model_with_Variational_Score_Distillation_CVPR_2024_paper.pdf + Despite their ability to generate high-resolution and diverse images from text prompts text-to-image diffusion models often suffer from slow iterative sampling processes. Model distillation is one of the most effective directions to accelerate these models. However previous distillation methods fail to retain the generation quality while requiring a significant amount of images for training either from real data or synthetically generated by the teacher model. In response to this limitation we present a novel image-free distillation scheme named SwiftBrush. Drawing inspiration from text-to-3D synthesis in which a 3D neural radiance field that aligns with the input prompt can be obtained from a 2D text-to-image diffusion prior via a specialized loss without the use of any 3D data ground-truth our approach re-purposes that same loss for distilling a pretrained multi-step text-to-image model to a student network that can generate high-fidelity images with just a single inference step. In spite of its simplicity our model stands as one of the first one-step text-to-image generators that can produce images of comparable quality to Stable Diffusion without reliance on any training image data. Remarkably SwiftBrush achieves an FID score of 16.67 and a CLIP score of 0.29 on the COCO-30K benchmark achieving competitive results or even substantially surpassing existing state-of-the-art distillation techniques. + + + + DEADiff: An Efficient Stylization Diffusion Model with Disentangled Representations + http://openaccess.thecvf.com//content/CVPR2024/papers/Qi_DEADiff_An_Efficient_Stylization_Diffusion_Model_with_Disentangled_Representations_CVPR_2024_paper.pdf + The diffusion-based text-to-image model harbors immense potential in transferring reference style. However current encoder-based approaches significantly impair the text controllability of text-to-image models while transferring styles. In this paper we introduce DEADiff to address this issue using the following two strategies: 1) a mechanism to decouple the style and semantics of reference images. The decoupled feature representations are first extracted by Q-Formers which are instructed by different text descriptions. Then they are injected into mutually exclusive subsets of cross-attention layers for better disentanglement. 2) A non-reconstructive learning method. The Q-Formers are trained using paired images rather than the identical target in which the reference image and the ground-truth image are with the same style or semantics. We show that DEADiff attains the best visual stylization results and optimal balance between the text controllability inherent in the text-to-image model and style similarity to the reference image as demonstrated both quantitatively and qualitatively. Our project page is https://tianhao-qi.github.io/DEADiff/. + + + + Exact Fusion via Feature Distribution Matching for Few-shot Image Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_Exact_Fusion_via_Feature_Distribution_Matching_for_Few-shot_Image_Generation_CVPR_2024_paper.pdf + Few-shot image generation as an important yet challenging visual task still suffers from the trade-off between generation quality and diversity. According to the principle of feature-matching learning existing fusion-based methods usually fuse different features by using similarity measurements or attention mechanisms which may match features inaccurately and lead to artifacts in the texture and structure of generated images. In this paper we propose an exact Fusion via Feature Distribution matching Generative Adversarial Network (F2DGAN) for few-shot image generation. The rationale behind this is that feature distribution matching is much more reliable than feature matching to explore the statistical characters in image feature space for limited real-world data. To model feature distributions from only a few examples for feature fusion we design a novel variational feature distribution matching fusion module to perform exact fusion by empirical cumulative distribution functions. Specifically we employ a variational autoencoder to transform deep image features into distributions and fuse different features exactly by applying histogram matching. Additionally we formulate two effective losses to guide the matching process for better fitting our fusion strategy. Extensive experiments compared with state-of-the-art methods on three public datasets demonstrate the superiority of F2DGAN for few-shot image generation in terms of generation quality and diversity and the effectiveness of data augmentation in downstream classification tasks. + + + + CoDeF: Content Deformation Fields for Temporally Consistent Video Processing + http://openaccess.thecvf.com//content/CVPR2024/papers/Ouyang_CoDeF_Content_Deformation_Fields_for_Temporally_Consistent_Video_Processing_CVPR_2024_paper.pdf + We present the content deformation field (CoDeF) as a new type of video representation which consists of a canonical content field aggregating the static contents in the entire video and a temporal deformation field recording the transformations from the canonical image (i.e. rendered from the canonical content field) to each individual frame along the time axis. Given a target video these two fields are jointly optimized to reconstruct it through a carefully tailored rendering pipeline. We advisedly introduce some regularizations into the optimization process urging the canonical content field to inherit semantics (e.g. the object shape) from the video. With such a design CoDeF naturally supports lifting image algorithms for video processing in the sense that one can apply an image algorithm to the canonical image and effortlessly propagate the outcomes to the entire video with the aid of the temporal deformation field. We experimentally show that CoDeF is able to lift image-to-image translation to video-to-video translation and lift keypoint detection to keypoint tracking without any training. More importantly thanks to our lifting strategy that deploys the algorithms on only one image we achieve superior cross-frame consistency in processed videos compared to existing video-to-video translation approaches and even manage to track non-rigid objects like water and smog. Code will be made publicly available. + + + + QUADify: Extracting Meshes with Pixel-level Details and Materials from Images + http://openaccess.thecvf.com//content/CVPR2024/papers/Fruhauf_QUADify_Extracting_Meshes_with_Pixel-level_Details_and_Materials_from_Images_CVPR_2024_paper.pdf + Despite exciting progress in automatic 3D reconstruction from images excessive and irregular triangular faces in the resulting meshes still constitute a significant challenge when it comes to adoption in practical artist workflows. Therefore we propose a method to extract regular quad-dominant meshes from posed images. More specifically we generate a high-quality 3D model through decomposition into an easily editable quad-dominant mesh with pixel-level details such as displacement materials and lighting. To enable end-to-end learning of shape and quad topology we QUADify a neural implicit representation using our novel differentiable re-meshing objective. Distinct from previous work our method exploits artifact-free Catmull-Clark subdivision combined with vertex displacement to extract pixel-level details linked to the base geometry. Finally we apply differentiable rendering techniques for material and lighting decomposition to optimize for image reconstruction. Our experiments show the benefits of end-to-end re-meshing and that our method yields state-of-the-art geometric accuracy while providing lightweight meshes with displacements and textures that are directly compatible with professional renderers and game engines. + + + + RecDiffusion: Rectangling for Image Stitching with Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_RecDiffusion_Rectangling_for_Image_Stitching_with_Diffusion_Models_CVPR_2024_paper.pdf + Image stitching from different captures often results in non-rectangular boundaries which is often considered unappealing. To solve non-rectangular boundaries current solutions involve cropping which discards image content inpainting which can introduce unrelated content or warping which can distort non-linear features and introduce artifacts. To overcome these issues we introduce a novel diffusion-based learning framework RecDiffusion for image stitching rectangling. This framework combines Motion Diffusion Models (MDM) to generate motion fields effectively transitioning from the stitched image's irregular borders to a geometrically corrected intermediary. Followed by Content Diffusion Models (CDM) for image detail refinement. Notably our sampling process utilizes a weighted map to identify regions needing correction during each iteration of CDM. Our RecDiffusion ensures geometric accuracy and overall visual appeal surpassing all previous methods in both quantitative and qualitative measures when evaluated on public benchmarks. Code is released at https://github.com/lhaippp/RecDiffusion. + + + + Eclipse: Disambiguating Illumination and Materials using Unintended Shadows + http://openaccess.thecvf.com//content/CVPR2024/papers/Verbin_Eclipse_Disambiguating_Illumination_and_Materials_using_Unintended_Shadows_CVPR_2024_paper.pdf + Decomposing an object's appearance into representations of its materials and the surrounding illumination is difficult even when the object's 3D shape is known beforehand. This problem is especially challenging for diffuse objects: it is ill-conditioned because diffuse materials severely blur incoming light and it is ill-posed because diffuse materials under high-frequency lighting can be indistinguishable from shiny materials under low-frequency lighting. We show that it is possible to recover precise materials and illumination---even from diffuse objects---by exploiting unintended shadows like the ones cast onto an object by the photographer who moves around it. These shadows are a nuisance in most previous inverse rendering pipelines but here we exploit them as signals that improve conditioning and help resolve material-lighting ambiguities. We present a method based on differentiable Monte Carlo ray tracing that uses images of an object to jointly recover its spatially-varying materials the surrounding illumination environment and the shapes of the unseen light occluders who inadvertently cast shadows upon it. + + + + Balancing Act: Distribution-Guided Debiasing in Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Parihar_Balancing_Act_Distribution-Guided_Debiasing_in_Diffusion_Models_CVPR_2024_paper.pdf + Diffusion Models (DMs) have emerged as powerful generative models with unprecedented image generation capability. These models are widely used for data augmentation and creative applications. However DMs reflect the biases present in the training datasets. This is especially concerning in the context of faces where the DM prefers one demographic subgroup vs others (eg. female vs male). In this work we present a method for debiasing DMs without relying on additional reference data or model retraining. Specifically we propose Distribution Guidance which enforces the generated images to follow the prescribed attribute distribution. To realize this we build on the key insight that the latent features of denoising UNet hold rich demographic semantics and the same can be leveraged to guide debiased generation. We train Attribute Distribution Predictor (ADP) - a small mlp that maps the latent features to the distribution of attributes. ADP is trained with pseudo labels generated from existing attribute classifiers. The proposed Distribution Guidance with ADP enables us to do fair generation. Our method reduces bias across single/multiple attributes and outperforms the baseline by a significant margin for unconditional and text-conditional diffusion models. Further we present a downstream task of training a fair attribute classifier by augmenting the training set with our generated data. + + + + Differentiable Point-based Inverse Rendering + http://openaccess.thecvf.com//content/CVPR2024/papers/Chung_Differentiable_Point-based_Inverse_Rendering_CVPR_2024_paper.pdf + We present differentiable point-based inverse rendering DPIR an analysis-by-synthesis method that processes images captured under diverse illuminations to estimate shape and spatially-varying BRDF. To this end we adopt point-based rendering eliminating the need for multiple samplings per ray typical of volumetric rendering thus significantly enhancing the speed of inverse rendering. To realize this idea we devise a hybrid point-volumetric representation for geometry and a regularized basis-BRDF representation for reflectance. The hybrid geometric representation enables fast rendering through point-based splatting while retaining the geometric details and stability inherent to SDF-based representations. The regularized basis-BRDF mitigates the ill-posedness of inverse rendering stemming from limited light-view angular samples. We also propose an efficient shadow detection method using point-based shadow map rendering. Our extensive evaluations demonstrate that DPIR outperforms prior works in terms of reconstruction accuracy computational efficiency and memory footprint. Furthermore our explicit point-based representation and rendering enables intuitive geometry and reflectance editing. + + + + A Unified and Interpretable Emotion Representation and Expression Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Paskaleva_A_Unified_and_Interpretable_Emotion_Representation_and_Expression_Generation_CVPR_2024_paper.pdf + Canonical emotions such as happy sad and fear are easy to understand and annotate. However emotions are often compound e.g. happily surprised and can be mapped to the action units (AUs) used for expressing emotions and trivially to the canonical ones. Intuitively emotions are continuous as represented by the arousal-valence (AV) model. An interpretable unification of these four modalities --namely Canonical Compound AUs and AV-- is highly desirable for a better representation and understanding of emotions. However such unification remains to be unknown in the current literature. In this work we propose an interpretable and unified emotion model referred as C2A2. We also develop a method that leverages labels of the non-unified models to annotate the novel unified one. Finally we modify the text-conditional diffusion models to understand continuous numbers which are then used to generate continuous expressions using our unified emotion model. Through quantitative and qualitative experiments we show that our generated images are rich and capture subtle expressions. Our work allows a fine-grained generation of expressions in conjunction with other textual inputs and offers a new label space for emotions at the same time. + + + + Upscale-A-Video: Temporal-Consistent Diffusion Model for Real-World Video Super-Resolution + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_Upscale-A-Video_Temporal-Consistent_Diffusion_Model_for_Real-World_Video_Super-Resolution_CVPR_2024_paper.pdf + Text-based diffusion models have exhibited remarkable success in generation and editing showing great promise for enhancing visual content with their generative prior. However applying these models to video super-resolution remains challenging due to the high demands for output fidelity and temporal consistency which is complicated by the inherent randomness in diffusion models. Our study introduces Upscale-A-Video a text-guided latent diffusion framework for video upscaling. This framework ensures temporal coherence through two key mechanisms: locally it integrates temporal layers into U-Net and VAE-Decoder maintaining consistency within short sequences; globally without training a flow-guided recurrent latent propagation module is introduced to enhance overall video stability by propagating and fusing latent across the entire sequences. Thanks to the diffusion paradigm our model also offers greater flexibility by allowing text prompts to guide texture creation and adjustable noise levels to balance restoration and generation enabling a trade-off between fidelity and quality. Extensive experiments show that Upscale-A-Video surpasses existing methods in both synthetic and real-world benchmarks as well as in AI-generated videos showcasing impressive visual realism and temporal consistency. + + + + 4D-DRESS: A 4D Dataset of Real-World Human Clothing With Semantic Annotations + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_4D-DRESS_A_4D_Dataset_of_Real-World_Human_Clothing_With_Semantic_CVPR_2024_paper.pdf + The studies of human clothing for digital avatars have predominantly relied on synthetic datasets. While easy to collect synthetic data often fall short in realism and fail to capture authentic clothing dynamics. Addressing this gap we introduce 4D-DRESS the first real-world 4D dataset advancing human clothing research with its high-quality 4D textured scans and garment meshes. 4D-DRESS captures 64 outfits in 520 human motion sequences amounting to 78k textured scans. Creating a real-world clothing dataset is challenging particularly in annotating and segmenting the extensive and complex 4D human scans. To address this we develop a semi-automatic 4D human parsing pipeline. We efficiently combine a human-in-the-loop process with automation to accurately label 4D scans in diverse garments and body movements. Leveraging precise annotations and high-quality garment meshes we establish several benchmarks for clothing simulation and reconstruction. 4D-DRESS offers realistic and challenging data that complements synthetic sources paving the way for advancements in research of lifelike human clothing. Website: https://ait.ethz.ch/4d-dress + + + + Specularity Factorization for Low-Light Enhancement + http://openaccess.thecvf.com//content/CVPR2024/papers/Saini_Specularity_Factorization_for_Low-Light_Enhancement_CVPR_2024_paper.pdf + We present a new additive image factorization technique that treats images to be composed of multiple latent specular components which can be simply estimated recursively by modulating the sparsity during decomposition. Our model-driven RSFNet estimates these factors by unrolling the optimization into network layers requiring only a few scalars to be learned. The resultant factors are interpretable by design and can be fused for different image enhancement tasks via a network or combined directly by the user in a controllable fashion. Based on RSFNet we detail a zero-reference Low Light Enhancement (LLE) application trained without paired or unpaired supervision. Our system improves the state-of-the-art performance on standard benchmarks and achieves better generalization on multiple other datasets. We also integrate our factors with other task specific fusion networks for applications like deraining deblurring and dehazing with negligible overhead thereby highlighting the multi-domain and multi-task generalizability of our proposed RSFNet. The code and data is released for reproducibility on the project homepage. + + + + Paint3D: Paint Anything 3D with Lighting-Less Texture Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Zeng_Paint3D_Paint_Anything_3D_with_Lighting-Less_Texture_Diffusion_Models_CVPR_2024_paper.pdf + This paper presents Paint3D a novel coarse-to-fine generative framework that is capable of producing high-resolution lighting-less and diverse 2K UV texture maps for untextured 3D meshes conditioned on text or image inputs. The key challenge addressed is generating high-quality textures without embedded illumination information which allows the textures to be re-lighted or re-edited within modern graphics pipelines. To achieve this our method first leverages a pre-trained depth-aware 2D diffusion model to generate view-conditional images and perform multi-view texture fusion producing an initial coarse texture map. However as 2D models cannot fully represent 3D shapes and disable lighting effects the coarse texture map exhibits incomplete areas and illumination artifacts. To resolve this we train separate UV Inpainting and UVHD diffusion models specialized for the shape-aware refinement of incomplete areas and the removal of illumination artifacts. Through this coarse-to-fine process Paint3D can produce high-quality 2K UV textures that maintain semantic consistency while being lighting-less significantly advancing the state-of-the-art in texturing 3D objects. + + + + MS-MANO: Enabling Hand Pose Tracking with Biomechanical Constraints + http://openaccess.thecvf.com//content/CVPR2024/papers/Xie_MS-MANO_Enabling_Hand_Pose_Tracking_with_Biomechanical_Constraints_CVPR_2024_paper.pdf + This work proposes a novel learning framework for visual hand dynamics analysis that takes into account the physiological aspects of hand motion. The existing models which are simplified joint-actuated systems often produce unnatural motions. To address this we integrate a musculoskeletal system with a learnable parametric hand model MANO to create a new model MS-MANO. This model emulates the dynamics of muscles and tendons to drive the skeletal system imposing physiologically realistic constraints on the resulting torque trajectories. We further propose a simulation-in-the-loop pose refinement framework BioPR that refines the initial estimated pose through a multi-layer perceptron (MLP) network. Our evaluation of the accuracy of MS-MANO and the efficacy of the BioPR is conducted in two separate parts. The accuracy of MS-MANO is compared with MyoSuite while the efficacy of BioPR is benchmarked against two large-scale public datasets and two recent state-of-the-art methods. The results demonstrate that our approach consistently improves the baseline methods both quantitatively and qualitatively. + + + + Generate Like Experts: Multi-Stage Font Generation by Incorporating Font Transfer Process into Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Fu_Generate_Like_Experts_Multi-Stage_Font_Generation_by_Incorporating_Font_Transfer_CVPR_2024_paper.pdf + Few-shot font generation (FFG) produces stylized font images with a limited number of reference samples which can significantly reduce labor costs in manual font designs. Most existing FFG methods follow the style-content disentanglement paradigm and employ the Generative Adversarial Network (GAN) to generate target fonts by combining the decoupled content and style representations. The complicated structure and detailed style are simultaneously generated in those methods which may be the sub-optimal solutions for FFG task. Inspired by most manual font design processes of expert designers in this paper we model font generation as a multi-stage generative process. Specifically as the injected noise and the data distribution in diffusion models can be well-separated into different sub-spaces we are able to incorporate the font transfer process into these models. Based on this observation we generalize diffusion methods to model font generative process by separating the reverse diffusion process into three stages with different functions: The structure construction stage first generates the structure information for the target character based on the source image and the font transfer stage subsequently transforms the source font to the target font. Finally the font refinement stage enhances the appearances and local details of the target font images. Based on the above multi-stage generative process we construct our font generation framework named MSD-Font with a dual-network approach to generate font images. The superior performance demonstrates the effectiveness of our model. The code is available at: https://github.com/fubinfb/MSD-Font . + + + + Diffuse Attend and Segment: Unsupervised Zero-Shot Segmentation using Stable Diffusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Tian_Diffuse_Attend_and_Segment_Unsupervised_Zero-Shot_Segmentation_using_Stable_Diffusion_CVPR_2024_paper.pdf + Producing quality segmentation masks for images is a fundamental problem in computer vision. Recent research has explored large-scale supervised training to enable zero-shot transfer segmentation on virtually any image style and unsupervised training to enable segmentation without dense annotations. However constructing a model capable of segmenting anything in a zero-shot manner without any annotations is still challenging. In this paper we propose to utilize the self-attention layers in stable diffusion models to achieve this goal because the pre-trained stable diffusion model has learned inherent concepts of objects within its attention layers. Specifically we introduce a simple yet effective iterative merging process based on measuring KL divergence among attention maps to merge them into valid segmentation masks. The proposed method does not require any training or language dependency to extract quality segmentation for any images. On COCO-Stuff-27 our method surpasses the prior unsupervised zero-shot transfer SOTA method by an absolute 26% in pixel accuracy and 17% in mean IoU. + + + + Implicit Discriminative Knowledge Learning for Visible-Infrared Person Re-Identification + http://openaccess.thecvf.com//content/CVPR2024/papers/Ren_Implicit_Discriminative_Knowledge_Learning_for_Visible-Infrared_Person_Re-Identification_CVPR_2024_paper.pdf + Visible-Infrared Person Re-identification (VI-ReID) is a challenging cross-modal pedestrian retrieval task due to significant intra-class variations and cross-modal discrepancies among different cameras. Existing works mainly focus on embedding images of different modalities into a unified space to mine modality-shared features. They only seek distinctive information within these shared features while ignoring the identity-aware useful information that is implicit in the modality-specific features. To address this issue we propose a novel Implicit Discriminative Knowledge Learning (IDKL) network to uncover and leverage the implicit discriminative information contained within the modality-specific. First we extract modality-specific and modality-shared features using a novel dual-stream network. Then the modality-specific features undergo purification to reduce their modality style discrepancies while preserving identity-aware discriminative knowledge. Subsequently this kind of implicit knowledge is distilled into the modality-shared feature to enhance its distinctiveness. Finally an alignment loss is proposed to minimize modality discrepancy on enhanced modality-shared features. Extensive experiments on multiple public datasets demonstrate the superiority of IDKL network over the state-of-the-art methods. + + + + Gradient Alignment for Cross-Domain Face Anti-Spoofing + http://openaccess.thecvf.com//content/CVPR2024/papers/Le_Gradient_Alignment_for_Cross-Domain_Face_Anti-Spoofing_CVPR_2024_paper.pdf + Recent advancements in domain generalization (DG) for face anti-spoofing (FAS) have garnered considerable attention. Traditional methods have focused on designing learning objectives and additional modules to isolate domain-specific features while retaining domain-invariant characteristics in their representations. However such approaches often lack guarantees of consistent maintenance of domain-invariant features or the complete removal of domain-specific features. Furthermore most prior works of DG for FAS do not ensure convergence to a local flat minimum which has been shown to be advantageous for DG. In this paper we introduce GAC-FAS a novel learning objective that encourages the model to converge towards an optimal flat minimum without necessitating additional learning modules. Unlike conventional sharpness-aware minimizers GAC-FAS identifies ascending points for each domain and regulates the generalization gradient updates at these points to align coherently with empirical risk minimization (ERM) gradient updates. This unique approach specifically guides the model to be robust against domain shifts. We demonstrate the efficacy of GAC-FAS through rigorous testing on challenging cross-domain FAS datasets where it establishes state-of-the-art performance. + + + + OpticalDR: A Deep Optical Imaging Model for Privacy-Protective Depression Recognition + http://openaccess.thecvf.com//content/CVPR2024/papers/Pan_OpticalDR_A_Deep_Optical_Imaging_Model_for_Privacy-Protective_Depression_Recognition_CVPR_2024_paper.pdf + Depression Recognition (DR) poses a considerable challenge especially in the context of the growing concerns surrounding privacy. Traditional automatic diagnosis of DR technology necessitates the use of facial images undoubtedly expose the patient identity features and poses privacy risks. In order to mitigate the potential risks associated with the inappropriate disclosure of patient facial images we design a new imaging system to erase the identity information of captured facial images while retain disease-relevant features. It is irreversible for identity information recovery while preserving essential disease-related characteristics necessary for accurate DR. More specifically we try to record a de-identified facial image (erasing the identifiable features as much as possible) by a learnable lens which is optimized in conjunction with the following DR task as well as a range of face analysis related auxiliary tasks in an end-to-end manner. These aforementioned strategies form our final Optical deep Depression Recognition network (OpticalDR). Experiments on CelebA AVEC 2013 and AVEC 2014 datasets demonstrate that our OpticalDR has achieved state-of-the-art privacy protection performance with an average AUC of 0.51 on popular facial recognition models and competitive results for DR with MAE/RMSE of 7.53/8.48 on AVEC 2013 and 7.89/8.82 on AVEC 2014 respectively. Code is available at https://github.com/divertingPan/OpticalDR. + + + + Observation-Guided Diffusion Probabilistic Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Kang_Observation-Guided_Diffusion_Probabilistic_Models_CVPR_2024_paper.pdf + We propose a novel diffusion-based image generation method called the observation-guided diffusion probabilistic model (OGDM) which effectively addresses the tradeoff between quality control and fast sampling. Our approach reestablishes the training objective by integrating the guidance of the observation process with the Markov chain in a principled way. This is achieved by introducing an additional loss term derived from the observation based on a conditional discriminator on noise level which employs a Bernoulli distribution indicating whether its input lies on the (noisy) real manifold or not. This strategy allows us to optimize the more accurate negative log-likelihood induced in the inference stage especially when the number of function evaluations is limited. The proposed training scheme is also advantageous even when incorporated only into the fine-tuning process and it is compatible with various fast inference strategies since our method yields better denoising networks using the exactly the same inference procedure without incurring extra computational cost. We demonstrate the effectiveness of our training algorithm using diverse inference techniques on strong diffusion model baselines. Our implementation is available at https://github.com/Junoh-Kang/OGDM_edm. + + + + Spatial-Aware Regression for Keypoint Localization + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Spatial-Aware_Regression_for_Keypoint_Localization_CVPR_2024_paper.pdf + Regression-based keypoint localization shows advantages of high efficiency and better robustness to quantization errors than heatmap-based methods. However existing regression-based methods discard the spatial location prior in input image with a global pooling leading to inferior accuracy and are limited to single instance localization tasks. We study the regression-based keypoint localization from a new perspective by leveraging the spatial location prior. Instead of regressing on the pooled feature the proposed Spatial-Aware Regression (SAR) maintains the spatial location map and outputs spatial coordinates and confidence score for each grid which are optimized with a unified objective. Benefited by the location prior these spatial-aware outputs can be efficiently optimized resulting in better localization performance. Moreover incorporating spatial prior makes SAR more general and can be applied into various keypoint localization tasks. We test the proposed method in 4 keypoint localization tasks including single/multi-person 2D/3D pose estimation and the whole-body pose estimation. Extensive experiments demonstrate its promising performance e.g. consistently outperforming recent regressions-based methods. + + + + EFormer: Enhanced Transformer towards Semantic-Contour Features of Foreground for Portraits Matting + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_EFormer_Enhanced_Transformer_towards_Semantic-Contour_Features_of_Foreground_for_Portraits_CVPR_2024_paper.pdf + The portrait matting task aims to extract an alpha matte with complete semantics and finely detailed contours. In comparison to CNN-based approaches transformers with self-attention module have a better capacity to capture long-range dependencies and low-frequency semantic information of a portrait. However recent research shows that the self-attention mechanism struggles with modeling high-frequency contour information and capturing fine contour details which can lead to bias while predicting the portrait's contours. To deal with this issue we propose EFormer to enhance the model's attention towards both the low-frequency semantic and high-frequency contour features. For the high-frequency contours our research demonstrates that cross-attention module between different resolutions can guide our model to allocate attention appropriately to these contour regions. Supported by this we can successfully extract the high-frequency detail information around the portrait's contours which were previously ignored by self-attention. Based on the cross-attention module we further build a semantic and contour detector (SCD) to accurately capture both the low-frequency semantic and high-frequency contour features. And we design a contour-edge extraction branch and semantic extraction branch to extract refined high-frequency contour features and complete low-frequency semantic information respectively. Finally we fuse the two kinds of features and leverage the segmentation head to generate a predicted portrait matte. Experiments on VideoMatte240K (JPEG SD Format) and Adobe Image Matting (AIM) datasets demonstrate that EFormer outperforms previous portrait matte methods. + + + + MultiPly: Reconstruction of Multiple People from Monocular Video in the Wild + http://openaccess.thecvf.com//content/CVPR2024/papers/Jiang_MultiPly_Reconstruction_of_Multiple_People_from_Monocular_Video_in_the_CVPR_2024_paper.pdf + We present MultiPly a novel framework to reconstruct multiple people in 3D from monocular in-the-wild videos. Reconstructing multiple individuals moving and interacting naturally from monocular in-the-wild videos poses a challenging task. Addressing it necessitates precise pixel-level disentanglement of individuals without any prior knowledge about the subjects. Moreover it requires recovering intricate and complete 3D human shapes from short video sequences intensifying the level of difficulty. To tackle these challenges we first define a layered neural representation for the entire scene composited by individual human and background models. We learn the layered neural representation from videos via our layer-wise differentiable volume rendering. This learning process is further enhanced by our hybrid instance segmentation approach which combines the self-supervised 3D segmentation and the promptable 2D segmentation module yielding reliable instance segmentation supervision even under close human interaction. A confidence-guided optimization formulation is introduced to optimize the human poses and shape/appearance alternately. We incorporate effective objectives to refine human poses via photometric information and impose physically plausible constraints on human dynamics leading to temporally consistent 3D reconstructions with high fidelity. The evaluation of our method shows the superiority over prior art on publicly available datasets and in-the-wild videos. + + + + ConsistNet: Enforcing 3D Consistency for Multi-view Images Diffusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_ConsistNet_Enforcing_3D_Consistency_for_Multi-view_Images_Diffusion_CVPR_2024_paper.pdf + Given a single image of a 3D object this paper proposes a novel method (named ConsistNet) that can generate multiple images of the same object as if they are captured from different viewpoints while the 3D (multi-view) consistencies among those multiple generated images are effectively exploited. Central to our method is a lightweight multi-view consistency block that enables information exchange across multiple single-view diffusion processes based on the underlying multi-view geometry principles. ConsistNet is an extension to the standard latent diffusion model and it consists of two submodules: (a) a view aggregation module that unprojects multi-view features into global 3D volumes and infers consistency and (b) a ray aggregation module that samples and aggregates 3D consistent features back to each view to enforce consistency. Our approach departs from previous methods in multi-view image generation in that it can be easily dropped in pre-trained LDMs without requiring explicit pixel correspondences or depth prediction. Experiments show that our method effectively learns 3D consistency over a frozen Zero123-XL backbone and can generate 16 surrounding views of the object within 11 seconds on a single A100 GPU. + + + + GenN2N: Generative NeRF2NeRF Translation + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_GenN2N_Generative_NeRF2NeRF_Translation_CVPR_2024_paper.pdf + We present GenN2N a unified NeRF-to-NeRF translation framework for various NeRF translation tasks such as text-driven NeRF editing colorization super-resolution inpainting etc. Unlike previous methods designed for individual translation tasks with task-specific schemes GenN2N achieves all these NeRF editing tasks by employing a plug-and-play image-to-image translator to perform editing in the 2D domain and lifting 2D edits into the 3D NeRF space. Since the 3D consistency of 2D edits may not be assured we propose to model the distribution of the underlying 3D edits through a generative model that can cover all possible edited NeRFs. To model the distribution of 3D edited NeRFs from 2D edited images we carefully design a VAE-GAN that encodes images while decoding NeRFs. The latent space is trained to align with a Gaussian distribution and the NeRFs are supervised through an adversarial loss on its renderings. To ensure the latent code does not depend on 2D viewpoints but truly reflects the 3D edits we also regularize the latent code through a contrastive learning scheme. Extensive experiments on various editing tasks show GenN2N as a universal framework performs as well or better than task-specific specialists while possessing flexible generative power. More results on our project page: https://xiangyueliu.github.io/GenN2N/. + + + + Universal Robustness via Median Randomized Smoothing for Real-World Super-Resolution + http://openaccess.thecvf.com//content/CVPR2024/papers/Chaouai_Universal_Robustness_via_Median_Randomized_Smoothing_for_Real-World_Super-Resolution_CVPR_2024_paper.pdf + Most of the recent literature on image Super-Resolution (SR) can be classified into two main approaches. The first one involves learning a corruption model tailored to a specific dataset aiming to mimic the noise and corruption in low-resolution images such as sensor noise. However this approach is data-specific tends to lack adaptability and its accuracy diminishes when faced with unseen types of image corruptions. A second and more recent approach referred to as Robust Super-Resolution (RSR) proposes to improve real-world SR by harnessing the generalization capabilities of a model by making it robust to adversarial attacks. To delve further into this second approach our paper explores the universality of various methods for enhancing the robustness of deep learning SR models. In other words we inquire: \enquote Which robustness method exhibits the highest degree of adaptability when dealing with a wide range of adversarial attacks ? . Our extensive experimentation on both synthetic and real-world images empirically demonstrates that median randomized smoothing (MRS) is more general in terms of robustness compared to adversarial learning techniques which tend to focus on specific types of attacks. Furthermore as expected we also illustrate that the proposed universal robust method enables the SR model to handle standard corruptions more effectively such as blur and Gaussian noise and notably corruptions naturally present in real-world images. These results support the significance of shifting the paradigm in the development of real-world SR methods towards RSR especially via MRS. + + + + One-dimensional Adapter to Rule Them All: Concepts Diffusion Models and Erasing Applications + http://openaccess.thecvf.com//content/CVPR2024/papers/Lyu_One-dimensional_Adapter_to_Rule_Them_All_Concepts_Diffusion_Models_and_CVPR_2024_paper.pdf + The prevalent use of commercial and open-source diffusion models (DMs) for text-to-image generation prompts risk mitigation to prevent undesired behaviors. Existing concept erasing methods in academia are all based on full parameter or specification-based fine-tuning from which we observe the following issues: 1) Generation alteration towards erosion: Parameter drift during target elimination causes alterations and potential deformations across all generations even eroding other concepts at varying degrees which is more evident with multi-concept erased; 2) Transfer inability & deployment inefficiency: Previous model-specific erasure impedes the flexible combination of concepts and the training-free transfer towards other models resulting in linear cost growth as the deployment scenarios increase. To achieve non-invasive precise customizable and transferable elimination we ground our erasing framework on one-dimensional adapters to erase multiple concepts from most DMs at once across versatile erasing applications. The concept-SemiPermeable structure is injected as a Membrane (SPM) into any DM to learn targeted erasing and meantime the alteration and erosion phenomenon is effectively mitigated via a novel Latent Anchoring fine-tuning strategy. Once obtained SPMs can be flexibly combined and plug-and-play for other DMs without specific re-tuning enabling timely and efficient adaptation to diverse scenarios. During generation our Facilitated Transport mechanism dynamically regulates the permeability of each SPM to respond to different input prompts further minimizing the impact on other concepts. Quantitative and qualitative results across 40 concepts 7 DMs and 4 erasing applications have demonstrated the superior erasing of SPM. Our code and pre-tuned SPMs are available on the project page https://lyumengyao.github.io/projects/spm. + + + + Kandinsky Conformal Prediction: Efficient Calibration of Image Segmentation Algorithms + http://openaccess.thecvf.com//content/CVPR2024/papers/Brunekreef_Kandinsky_Conformal_Prediction_Efficient_Calibration_of_Image_Segmentation_Algorithms_CVPR_2024_paper.pdf + Image segmentation algorithms can be understood as a collection of pixel classifiers for which the outcomes of nearby pixels are correlated. Classifier models can be calibrated using Inductive Conformal Prediction but this requires holding back a sufficiently large calibration dataset for computing the distribution of non-conformity scores of the model's predictions. If one only requires only marginal calibration on the image level this calibration set consists of all individual pixels in the images available for calibration. However if the goal is to attain proper calibration for each individual pixel classifier the calibration set consists of individual images. In a scenario where data are scarce (such as the medical domain) it may not always be possible to set aside sufficiently many images for this pixel-level calibration.The method we propose dubbed "Kandinsky calibration" makes use of the spatial structure present in the distribution of natural images to simultaneously calibrate the classifiers of "similar" pixels. This can be seen as an intermediate approach between marginal (imagewise) and conditional (pixelwise) calibration where non-conformity scores are aggregated over similar image regions thereby making more efficient use of the images available for calibration. We run experiments on segmentation algorithms trained and calibrated on subsets of the public MS-COCO and Medical Decathlon datasets demonstrating that Kandinsky calibration method can significantly improve the coverage. When compared to both pixelwise and imagewise calibration on little data the Kandinsky method achieves much lower coverage errors indicating the data efficiency of the Kandinsky calibration. + + + + Diversity-aware Channel Pruning for StyleGAN Compression + http://openaccess.thecvf.com//content/CVPR2024/papers/Chung_Diversity-aware_Channel_Pruning_for_StyleGAN_Compression_CVPR_2024_paper.pdf + StyleGAN has shown remarkable performance in unconditional image generation. However its high computational cost poses a significant challenge for practical applications. Although recent efforts have been made to compress StyleGAN while preserving its performance existing compressed models still lag behind the original model particularly in terms of sample diversity. To overcome this we propose a novel channel pruning method that leverages varying sensitivities of channels to latent vectors which is a key factor in sample diversity. Specifically by assessing channel importance based on their sensitivities to latent vector perturbations our method enhances the diversity of samples in the compressed model. Since our method solely focuses on the channel pruning stage it has complementary benefits with prior training schemes without additional training cost. Extensive experiments demonstrate that our method significantly enhances sample diversity across various datasets. Moreover in terms of FID scores our method not only surpasses state-of-the-art by a large margin but also achieves comparable scores with only half training iterations. + + + + Neural Clustering based Visual Representation Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Neural_Clustering_based_Visual_Representation_Learning_CVPR_2024_paper.pdf + We investigate a fundamental aspect of machine vision: the measurement of features by revisiting clustering one of the most classic approaches in machine learning and data analysis. Existing visual feature extractors including ConvNets ViTs and MLPs represent an image as rectangular regions. Though prevalent such a grid-style paradigm is built upon engineering practice and lacks explicit modeling of data distribution. In this work we propose feature extraction with clustering (FEC) a conceptually elegant yet surprisingly ad-hoc interpretable neural clustering framework which views feature extraction as a process of selecting representatives from data and thus automatically captures the underlying data distribution. Given an image FEC alternates between grouping pixels into individual clusters to abstract representatives and updating the deep features of pixels with current representatives. Such an iterative working mechanism is implemented in the form of several neural layers and the final representatives can be used for downstream tasks. The cluster assignments across layers which can be viewed and inspected by humans make the forward process of FEC fully transparent and empower it with promising ad-hoc interpretability. Extensive experiments on various visual recognition models and tasks verify the effectiveness generality and interpretability of FEC. We expect this work will provoke a rethink of the current de facto grid-style paradigm. + + + + Sparse Semi-DETR: Sparse Learnable Queries for Semi-Supervised Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Shehzadi_Sparse_Semi-DETR_Sparse_Learnable_Queries_for_Semi-Supervised_Object_Detection_CVPR_2024_paper.pdf + In this paper we address the limitations of the DETR-based semi-supervised object detection (SSOD) framework particularly focusing on the challenges posed by the quality of object queries. In DETR-based SSOD the one-to-one assignment strategy provides inaccurate pseudo-labels while the one-to-many assignments strategy leads to overlapping predictions. These issues compromise training efficiency and degrade model performance especially in detecting small or occluded objects. We introduce Sparse Semi-DETR a novel transformer-based end-to-end semi-supervised object detection solution to overcome these challenges. Sparse Semi-DETR incorporates a Query Refinement Module to enhance the quality of object queries significantly improving detection capabilities for small and partially obscured objects. Additionally we integrate a Reliable Pseudo-Label Filtering Module that selectively filters high-quality pseudo-labels thereby enhancing detection accuracy and consistency. On the MS-COCO and Pascal VOC object detection benchmarks Sparse Semi-DETR achieves a significant improvement over current state-of-the-art methods that highlight Sparse Semi-DETR's effectiveness in semi-supervised object detection particularly in challenging scenarios involving small or partially obscured objects. + + + + Uncertainty-Aware Source-Free Adaptive Image Super-Resolution with Wavelet Augmentation Transformer + http://openaccess.thecvf.com//content/CVPR2024/papers/Ai_Uncertainty-Aware_Source-Free_Adaptive_Image_Super-Resolution_with_Wavelet_Augmentation_Transformer_CVPR_2024_paper.pdf + Unsupervised Domain Adaptation (UDA) can effectively address domain gap issues in real-world image Super-Resolution (SR) by accessing both the source and target data. Considering privacy policies or transmission restrictions of source data in practical scenarios we propose a SOurce-free Domain Adaptation framework for image SR (SODA-SR) to address this issue i.e. adapt a source-trained model to a target domain with only unlabeled target data. SODA-SR leverages the source-trained model to generate refined pseudo-labels for teacher-student learning. To better utilize pseudo-labels we propose a novel wavelet-based augmentation method named Wavelet Augmentation Transformer (WAT) which can be flexibly incorporated with existing networks to implicitly produce useful augmented data. WAT learns low-frequency information of varying levels across diverse samples which is aggregated efficiently via deformable attention. Furthermore an uncertainty-aware self-training mechanism is proposed to improve the accuracy of pseudo-labels with inaccurate predictions being rectified by uncertainty estimation. To acquire better SR results and avoid overfitting pseudo-labels several regularization losses are proposed to constrain target LR and SR images in the frequency domain. Experiments show that without accessing source data SODA-SR outperforms state-of-the-art UDA methods in both synthetic->real and real->real adaptation settings and is not constrained by specific network architectures. + + + + Spacetime Gaussian Feature Splatting for Real-Time Dynamic View Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Spacetime_Gaussian_Feature_Splatting_for_Real-Time_Dynamic_View_Synthesis_CVPR_2024_paper.pdf + Novel view synthesis of dynamic scenes has been an intriguing yet challenging problem. Despite recent advancements simultaneously achieving high-resolution photorealistic results real-time rendering and compact storage remains a formidable task. To address these challenges we propose Spacetime Gaussian Feature Splatting as a novel dynamic scene representation composed of three pivotal components. First we formulate expressive Spacetime Gaussians by enhancing 3D Gaussians with temporal opacity and parametric motion/rotation. This enables Spacetime Gaussians to capture static dynamic as well as transient content within a scene. Second we introduce splatted feature rendering which replaces spherical harmonics with neural features. These features facilitate the modeling of view- and time-dependent appearance while maintaining small size. Third we leverage the guidance of training error and coarse depth to sample new Gaussians in areas that are challenging to converge with existing pipelines. Experiments on several established real-world datasets demonstrate that our method achieves state-of-the-art rendering quality and speed while retaining compact storage. At 8K resolution our lite-version model can render at 60 FPS on an Nvidia RTX 4090 GPU. + + + + Instruct-Imagen: Image Generation with Multi-modal Instruction + http://openaccess.thecvf.com//content/CVPR2024/papers/Hu_Instruct-Imagen_Image_Generation_with_Multi-modal_Instruction_CVPR_2024_paper.pdf + This paper presents Instruct-Imagen a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce multi-modal instruction for image generation a task representation articulating a range of generation intents with precision. It uses natural language to amalgamate disparate modalities (e.g. text edge style subject etc.) such that abundant generation intents can be standardized in a uniform format. We then build Instruct-Imagen by fine-tuning a pre-trained text-to-image diffusion model with two stages. First we adapt the model using the retrieval-augmented training to enhance model's capabilities to ground its generation on external multi-modal context. Subsequently we fine-tune the adapted model on diverse image generation tasks that requires vision-language understanding (e.g. subject-driven generation etc.) each paired with a multi-modal instruction encapsulating the task's essence. Human evaluation on various image generation datasets reveals that Instruct-Imagen matches or surpasses prior task-specific models in-domain and demonstrates promising generalization to unseen and more complex tasks. Our evaluation suite will be made publicly available. + + + + Rethinking Few-shot 3D Point Cloud Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/An_Rethinking_Few-shot_3D_Point_Cloud_Semantic_Segmentation_CVPR_2024_paper.pdf + This paper revisits few-shot 3D point cloud semantic segmentation (FS-PCS) with a focus on two significant issues in the state-of-the-art: foreground leakage and sparse point distribution. The former arises from non-uniform point sampling allowing models to distinguish the density disparities between foreground and background for easier segmentation. The latter results from sampling only 2048 points limiting semantic information and deviating from the real-world practice. To address these issues we introduce a standardized FS-PCS setting upon which a new benchmark is built. Moreover we propose a novel FS-PCS model. While previous methods are based on feature optimization by mainly refining support features to enhance prototypes our method is based on correlation optimization referred to as Correlation Optimization Segmentation (COSeg). Specifically we compute Class-specific Multi-prototypical Correlation (CMC) for each query point representing its correlations to category prototypes. Then we propose the Hyper Correlation Augmentation (HCA) module to enhance CMC. Furthermore tackling the inherent property of few-shot training to incur base susceptibility for models we propose to learn non-parametric prototypes for the base classes during training. The learned base prototypes are used to calibrate correlations for the background class through a Base Prototypes Calibration (BPC) module. Experiments on popular datasets demonstrate the superiority of COSeg over existing methods. The code is available at github.com/ZhaochongAn/COSeg. + + + + GreedyViG: Dynamic Axial Graph Construction for Efficient Vision GNNs + http://openaccess.thecvf.com//content/CVPR2024/papers/Munir_GreedyViG_Dynamic_Axial_Graph_Construction_for_Efficient_Vision_GNNs_CVPR_2024_paper.pdf + Vision graph neural networks (ViG) offer a new avenue for exploration in computer vision. A major bottleneck in ViGs is the inefficient k-nearest neighbor (KNN) operation used for graph construction. To solve this issue we propose a new method for designing ViGs Dynamic Axial Graph Construction (DAGC) which is more efficient than KNN as it limits the number of considered graph connections made within an image. Additionally we propose a novel CNN-GNN architecture GreedyViG which uses DAGC. Extensive experiments show that GreedyViG beats existing ViG CNN and ViT architectures in terms of accuracy GMACs and parameters on image classification object detection instance segmentation and semantic segmentation tasks. Our smallest model GreedyViG-S achieves 81.1% top-1 accuracy on ImageNet-1K 2.9% higher than Vision GNN and 2.2% higher than Vision HyperGraph Neural Network (ViHGNN) with less GMACs and a similar number of parameters. Our largest model GreedyViG-B obtains 83.9% top-1 accuracy 0.2% higher than Vision GNN with a 66.6% decrease in parameters and a 69% decrease in GMACs. GreedyViG-B also obtains the same accuracy as ViHGNN with a 67.3% decrease in parameters and a 71.3% decrease in GMACs. Our work shows that hybrid CNN-GNN architectures not only provide a new avenue for designing efficient models but that they can also exceed the performance of current state-of-the-art models. + + + + Relightable and Animatable Neural Avatar from Sparse-View Video + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_Relightable_and_Animatable_Neural_Avatar_from_Sparse-View_Video_CVPR_2024_paper.pdf + This paper tackles the problem of creating relightable and animatable neural avatars from sparse-view (or monocular) videos of dynamic humans under unknown illumination. Previous neural human reconstruction methods produce animatable avatars from sparse views using deformed Signed Distance Fields (SDF) but are non-relightable. While differentiable inverse rendering methods have succeeded in the material recovery of static objects it is not straightforward to extend them to dynamic humans since it is computationally intensive to compute pixel-surface intersection and light visibility on deformed SDFs for relighting. To solve this challenge we propose a Hierarchical Distance Query (HDQ) algorithm to approximate the world space SDF under arbitrary human poses. Specifically we estimate coarse SDF based on a parametric human model and compute fine SDF by exploiting the invariance of SDF w.r.t. local deformation. Based on HDQ we leverage sphere tracing to efficiently estimate the surface intersection and light visibility. This allows us to develop the first system to recover relightable and animatable neural avatars from sparse or monocular inputs. Experiments show that our approach produces superior results compared to state-of-the-art methods. Our project page is available at https://zju3dv.github.io/relightable_avatar. + + + + Pose Adapted Shape Learning for Large-Pose Face Reenactment + http://openaccess.thecvf.com//content/CVPR2024/papers/Hsu_Pose_Adapted_Shape_Learning_for_Large-Pose_Face_Reenactment_CVPR_2024_paper.pdf + We propose the Pose Adapted Shape Learning (PASL) for large-pose face reenactment. The PASL framework consists of three modules namely the Pose-Adapted face Encoder (PAE) the Cycle-consistent Shape Generator (CSG) and the Attention-Embedded Generator (AEG). Different from previous approaches that use a single face encoder for identity preservation we propose multiple Pose-Adapted face Encodes (PAEs) to better preserve facial identity across large poses. Given a source face and a reference face the CSG generates a recomposed shape that fuses the source identity and reference action in the shape space and meets the cycle consistency requirement. Taking the shape code and the source as inputs the AEG learns the attention within the shape code and between the shape code and source style to enhance the generation of the desired target face. As existing benchmark datasets are inappropriate for evaluating large-pose face reenactment we propose a scheme to compose large-pose face pairs and introduce the MPIE-LP (Large Pose) and VoxCeleb2-LP datasets as the new large-pose benchmarks. We compared our approach with state-of-the-art methods on MPIE-LP and VoxCeleb2-LP for large-pose performance and on VoxCeleb1 for the common scope of pose variation. + + + + NRDF: Neural Riemannian Distance Fields for Learning Articulated Pose Priors + http://openaccess.thecvf.com//content/CVPR2024/papers/He_NRDF_Neural_Riemannian_Distance_Fields_for_Learning_Articulated_Pose_Priors_CVPR_2024_paper.pdf + Faithfully modeling the space of articulations is a crucial task that allows recovery and generation of realistic poses and remains a notorious challenge. To this end we introduce Neural Riemannian Distance Fields (NRDFs) data-driven priors modeling the space of plausible articulations represented as the zero-level-set of a neural field in a high-dimensional product-quaternion space. To train NRDFs only on positive examples we introduce a new sampling algorithm ensuring that the geodesic distances follow a desired distribution yielding a principled distance field learning paradigm. We then devise a projection algorithm to map any random pose onto the level-set by an adaptive-step Riemannian optimizer adhering to the product manifold of joint rotations at all times. NRDFs can compute the Riemannian gradient via backpropagation and by mathematical analogy are related to Riemannian flow matching a recent generative model. We conduct a comprehensive evaluation of NRDF against other pose priors in various downstream tasks i.e. pose generation image-based pose estimation and solving inverse kinematics highlighting NRDF's superior performance. Besides humans NRDF's versatility extends to hand and animal poses as it can effectively represent any articulation. + + + + RepAn: Enhanced Annealing through Re-parameterization + http://openaccess.thecvf.com//content/CVPR2024/papers/Fei_RepAn_Enhanced_Annealing_through_Re-parameterization_CVPR_2024_paper.pdf + The simulated annealing algorithm aims to improve model convergence through multiple restarts of training. However existing annealing algorithms overlook the correlation between different cycles neglecting the potential for incremental learning. We contend that a fixed network structure prevents the model from recognizing distinct features at different training stages. To this end we propose RepAn redesigning the irreversible re-parameterization (Rep) method and integrating it with annealing to enhance training. Specifically the network goes through Rep expansion restoration and backpropagation operations during training and iterating through these processes in each annealing round. Such a method exhibits good generalization and is easy to apply and we provide theoretical explanations for its effectiveness. Experiments demonstrate that our method improves baseline performance by 6.38% on the CIFAR-100 dataset and 2.80% on ImageNet achieving state-of-the-art performance in the Rep field. The code is available at https://github.com/xfey/RepAn. + + + + DreamControl: Control-Based Text-to-3D Generation with 3D Self-Prior + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_DreamControl_Control-Based_Text-to-3D_Generation_with_3D_Self-Prior_CVPR_2024_paper.pdf + 3D generation has raised great attention in recent years. With the success of text-to-image diffusion models the 2D-lifting technique becomes a promising route to controllable 3D generation. However these methods tend to present inconsistent geometry which is also known as the Janus problem. We observe that the problem is caused mainly by two aspects i.e. viewpoint bias in 2D diffusion models and overfitting of the optimization objective. To address it we propose a two-stage 2D-lifting framework namely DreamControl which optimizes coarse NeRF scenes as 3D self-prior and then generates fine-grained objects with control-based score distillation. Specifically adaptive viewpoint sampling and boundary integrity metric are proposed to ensure the consistency of generated priors. The priors are then regarded as input conditions to maintain reasonable geometries in which conditional LoRA and weighted score are further proposed to optimize detailed textures. DreamControl can generate high-quality 3D content in terms of both geometry consistency and texture fidelity. Moreover our control-based optimization guidance is applicable to more downstream tasks including user-guided generation and 3D animation. + + + + ODIN: A Single Model for 2D and 3D Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Jain_ODIN_A_Single_Model_for_2D_and_3D_Segmentation_CVPR_2024_paper.pdf + State-of-the-art models on contemporary 3D segmentation benchmarks like ScanNet consume and label dataset-provided 3D point clouds obtained through post processing of sensed multiview RGB-D images. They are typically trained in-domain forego large-scale 2D pre-training and outperform alternatives that featurize the posed RGB-D multiview images instead. The gap in performance between methods that consume posed images versus post-processed 3D point clouds has fueled the belief that 2D and 3D perception require distinct model architectures. In this paper we challenge this view and propose ODIN (Omni-Dimensional INstance segmentation) a model that can segment and label both 2D RGB images and 3D point clouds using a transformer architecture that alternates between 2D within-view and 3D cross-view information fusion. Our model differentiates 2D and 3D feature operations through the positional encodings of the tokens involved which capture pixel coordinates for 2D patch tokens and 3D coordinates for 3D feature tokens. ODIN achieves state-of-the-art performance on ScanNet200 Matterport3D and AI2THOR 3D instance segmentation benchmarks and competitive performance on ScanNet S3DIS and COCO. It outperforms all previous works by a wide margin when the sensed 3D point cloud is used in place of the point cloud sampled from 3D mesh. When used as the 3D perception engine in an instructable embodied agent architecture it sets a new state-of-the-art on the TEACh action-from-dialogue benchmark. Our code and checkpoints can be found at the project website: https://odin-seg.github.io. + + + + InitNO: Boosting Text-to-Image Diffusion Models via Initial Noise Optimization + http://openaccess.thecvf.com//content/CVPR2024/papers/Guo_InitNO_Boosting_Text-to-Image_Diffusion_Models_via_Initial_Noise_Optimization_CVPR_2024_paper.pdf + Recent strides in the development of diffusion models exemplified by advancements such as Stable Diffusion have underscored their remarkable prowess in generating visually compelling images. However the imperative of achieving a seamless alignment between the generated image and the provided prompt persists as a formidable challenge. This paper traces the root of these difficulties to invalid initial noise and proposes a solution in the form of Initial Noise Optimization (InitNO) a paradigm that refines this noise. Considering text prompts not all random noises are effective in synthesizing semantically-faithful images. We design the cross-attention response score and the self-attention conflict score to evaluate the initial noise bifurcating the initial latent space into valid and invalid sectors. A strategically crafted noise optimization pipeline is developed to guide the initial noise towards valid regions. Our method validated through rigorous experimentation shows a commendable proficiency in generating images in strict accordance with text prompts. Our code is available at https://github.com/xiefan-guo/initno. + + + + Multimodal Sense-Informed Forecasting of 3D Human Motions + http://openaccess.thecvf.com//content/CVPR2024/papers/Lou_Multimodal_Sense-Informed_Forecasting_of_3D_Human_Motions_CVPR_2024_paper.pdf + Predicting future human pose is a fundamental application for machine intelligence which drives robots to plan their behavior and paths ahead of time to seamlessly accomplish human-robot collaboration in real-world 3D scenarios. Despite encouraging results existing approaches rarely consider the effects of the external scene on the motion sequence leading to pronounced artifacts and physical implausibilities in the predictions. To address this limitation this work introduces a novel multi-modal sense-informed motion prediction approach which conditions high-fidelity generation on two modal information: external 3D scene and internal human gaze and is able to recognize their salience for future human activity. Furthermore the gaze information is regarded as the human intention and combined with both motion and scene features we construct a ternary intention-aware attention to supervise the generation to match where the human wants to reach. Meanwhile we introduce semantic coherence-aware attention to explicitly distinguish the salient point clouds and the underlying ones to ensure a reasonable interaction of the generated sequence with the 3D scene. On two real-world benchmarks the proposed method achieves state-of-the-art performance both in 3D human pose and trajectory prediction. More detailed results are available on the page: https://sites.google.com/view/cvpr2024sif3d. + + + + FlowerFormer: Empowering Neural Architecture Encoding using a Flow-aware Graph Transformer + http://openaccess.thecvf.com//content/CVPR2024/papers/Hwang_FlowerFormer_Empowering_Neural_Architecture_Encoding_using_a_Flow-aware_Graph_Transformer_CVPR_2024_paper.pdf + The success of a specific neural network architecture is closely tied to the dataset and task it tackles; there is no one-size-fits-all solution. Thus considerable efforts have been made to quickly and accurately estimate the performances of neural architectures without full training or evaluation for given tasks and datasets. Neural architecture encoding has played a crucial role in the estimation and graphbased methods which treat an architecture as a graph have shown prominent performance. For enhanced representation learning of neural architectures we introduce FlowerFormer a powerful graph transformer that incorporates the information flows within a neural architecture. FlowerFormer consists of two key components: (a) bidirectional asynchronous message passing inspired by the flows; (b) global attention built on flow-based masking. Our extensive experiments demonstrate the superiority of FlowerFormer over existing neural encoding methods and its effectiveness extends beyond computer vision models to include graph neural networks and auto speech recognition models. Our code is available at http://github.com/y0ngjaenius/CVPR2024_FLOWERFormer. + + + + EmoGen: Emotional Image Content Generation with Text-to-Image Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_EmoGen_Emotional_Image_Content_Generation_with_Text-to-Image_Diffusion_Models_CVPR_2024_paper.pdf + Recent years have witnessed remarkable progress in image generation task where users can create visually astonishing images with high-quality. However exsiting text-to-image diffusion models are proficient in generating concrete concepts (dogs) but encounter challenges with more abstract ones (emotions). Several efforts have been made to modify image emotions with color and style adjustments facing limitations in effectively conveying emotions with fixed image contents. In this work we introduce Emotional Image Content Generation (EIGC) a new task to generate semantic-clear and emotion-faithful images given emotion categories. Specifically we propose an emotion space and construct a mapping network to align it with powerful Contrastive Language-Image Pre-training (CLIP) space providing a concrete interpretation of abstract emotions. Attribute loss and emotion confidence are further proposed to ensure the semantic diversity and emotion fidelity of the generated images. Our method outperforms the state-the-art text-to-image approaches both quantitatively and qualitatively where we derive three custom metrics i.e.emotion accuracy semantic clarity and semantic diversity. In addition to generation our method can help emotion understanding and inspire emotional art design. Project page: https://vcc.tech/research/2024/EmoGen. + + + + Neural Implicit Representation for Building Digital Twins of Unknown Articulated Objects + http://openaccess.thecvf.com//content/CVPR2024/papers/Weng_Neural_Implicit_Representation_for_Building_Digital_Twins_of_Unknown_Articulated_CVPR_2024_paper.pdf + We address the problem of building digital twins of unknown articulated objects from two RGBD scans of the object at different articulation states. We decompose the problem into two stages each addressing distinct aspects. Our method first reconstructs object-level shape at each state then recovers the underlying articulation model including part segmentation and joint articulations that associate the two states. By explicitly modeling point-level correspondences and exploiting cues from images 3D reconstructions and kinematics our method yields more accurate and stable results compared to prior work. It also handles more than one movable part and does not rely on any object shape or structure priors. Project page: https://github.com/NVlabs/DigitalTwinArt + + + + Vanishing-Point-Guided Video Semantic Segmentation of Driving Scenes + http://openaccess.thecvf.com//content/CVPR2024/papers/Guo_Vanishing-Point-Guided_Video_Semantic_Segmentation_of_Driving_Scenes_CVPR_2024_paper.pdf + The estimation of implicit cross-frame correspondences and the high computational cost have long been major challenges in video semantic segmentation (VSS) for driving scenes. Prior works utilize keyframes feature propagation or cross-frame attention to address these issues. By contrast we are the first to harness vanishing point (VP) priors for more effective segmentation. Intuitively objects near VPs (i.e. away from the vehicle) are less discernible. Moreover they tend to move radially away from the VP over time in the usual case of a forward-facing camera a straight road and linear forward motion of the vehicle. Our novel efficient network for VSS named VPSeg incorporates two modules that utilize exactly this pair of static and dynamic VP priors: sparse-to-dense feature mining (DenseVP) and VP-guided motion fusion (MotionVP). MotionVP employs VP-guided motion estimation to establish explicit correspondences across frames and help attend to the most relevant features from neighboring frames while DenseVP enhances weak dynamic features in distant regions around VPs. These modules operate within a context-detail framework which separates contextual features from high-resolution local features at different input resolutions to reduce computational costs. Contextual and local features are integrated through contextualized motion attention (CMA) for the final prediction. Extensive experiments on two popular driving segmentation benchmarks Cityscapes and ACDC demonstrate that VPSeg outperforms previous SOTA methods with only modest computational overhead. + + + + LAMP: Learn A Motion Pattern for Few-Shot Video Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_LAMP_Learn_A_Motion_Pattern_for_Few-Shot_Video_Generation_CVPR_2024_paper.pdf + In this paper we present a few-shot text-to-video framework LAMP which enables a text-to-image diffusion model to Learn A specific Motion Pattern with 8 16 videos on a single GPU. Unlike existing methods which require a large number of training resources or learn motions that are precisely aligned with template videos it achieves a trade-off between the degree of generation freedom and the resource costs for model training. Specifically we design a motion-content decoupled pipeline that uses an off-the-shelf text-to-image model for content generation so that our tuned video diffusion model mainly focuses on motion learning. The well-developed text-to-image techniques can provide visually pleasing and diverse content as generation conditions which highly improves video quality and generation freedom. To capture the features of temporal dimension we expand the pre-trained 2D convolution layers of the T2I model to our novel temporal-spatial motion learning layers and modify the attention blocks to the temporal level. Additionally we develop an effective inference trick shared-noise sampling which can improve the stability of videos without computational costs. Our method can also be flexibly applied to other tasks e.g. real-world image animation and video editing. Extensive experiments demonstrate that LAMP can effectively learn the motion pattern on limited data and generate high-quality videos. The code and models are available at https://rq-wu.github.io/projects/LAMP. + + + + Language-driven Object Fusion into Neural Radiance Fields with Pose-Conditioned Dataset Updates + http://openaccess.thecvf.com//content/CVPR2024/papers/Shum_Language-driven_Object_Fusion_into_Neural_Radiance_Fields_with_Pose-Conditioned_Dataset_CVPR_2024_paper.pdf + Neural radiance field (NeRF) is an emerging technique for 3D scene reconstruction and modeling. However current NeRF-based methods are limited in the capabilities of adding or removing objects. This paper fills the aforementioned gap by proposing a new language-driven method for object manipulation in NeRFs through dataset updates. Specifically to insert an object represented by a set of multi-view images into a background NeRF we use a text-to-image diffusion model to blend the object into the given background across views. The generated images are then used to update the NeRF so that we can render view-consistent images of the object within the background. To ensure view consistency we propose a dataset update strategy that prioritizes the radiance field training based on camera poses in a pose-ordered manner. We validate our method in two case studies: object insertion and object removal. Experimental results show that our method can generate photo-realistic results and achieves state-of-the-art performance in NeRF editing. + + + + DREAM: Diffusion Rectification and Estimation-Adaptive Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_DREAM_Diffusion_Rectification_and_Estimation-Adaptive_Models_CVPR_2024_paper.pdf + We present DREAM a novel training framework representing Diffusion Rectification and Estimation-Adaptive Models requiring minimal code changes (just three lines) yet significantly enhancing the alignment of training with sampling in diffusion models. DREAM features two components: diffusion rectification which adjusts training to reflect the sampling process and estimation adaptation which balances perception against distortion. When applied to image super-resolution (SR) DREAM adeptly navigates the tradeoff between minimizing distortion and preserving high image quality. Experiments demonstrate DREAM's superiority over standard diffusion-based SR methods showing a to faster training convergence and a to reduction in necessary sampling steps to achieve comparable or superior results. We hope DREAM will inspire a rethinking of diffusion model training paradigms. + + + + Seeing the World through Your Eyes + http://openaccess.thecvf.com//content/CVPR2024/papers/Alzayer_Seeing_the_World_through_Your_Eyes_CVPR_2024_paper.pdf + The reflective nature of the human eye is an under-appreciated source of information about what the world around us looks like. By imaging the eyes of a moving person we capture multiple views of a scene outside the camera's direct line of sight through the reflections in the eyes. In this paper we reconstruct a radiance field beyond the camera's line of sight using portrait images containing eye reflections. This task is challenging due to 1) the difficulty of accurately estimating eye poses and 2) the entangled appearance of the iris textures and the scene reflections. To address these our method jointly optimizes the cornea poses the radiance field depicting the scene and the observer's eye iris texture. We further present a regularization prior on the iris texture to improve scene reconstruction quality. Through various experiments on synthetic and real-world captures featuring people with varied eye colors and lighting conditions we demonstrate the feasibility of our approach to recover the radiance field using cornea reflections. + + + + Ungeneralizable Examples + http://openaccess.thecvf.com//content/CVPR2024/papers/Ye_Ungeneralizable_Examples_CVPR_2024_paper.pdf + The training of contemporary deep learning models heavily relies on publicly available data posing a risk of unauthorized access to online data and raising concerns about data privacy. Current approaches to creating unlearnable data involve incorporating small specially designed noises but these methods strictly limit data usability overlooking its potential usage in authorized scenarios. In this paper we extend the concept of unlearnable data to conditional data learnability and introduce UnGeneralizable Examples (UGEs). UGEs exhibit learnability for authorized users while maintaining unlearnability for potential hackers. The protector defines the authorized network and optimizes UGEs to match the gradients of the original data and its ungeneralizable version ensuring learnability. To prevent unauthorized learning UGEs are trained by maximizing a designated distance loss in a common feature space. Additionally to further safeguard the authorized side from potential attacks we introduce additional undistillation optimization. Experimental results on multiple datasets and various networks demonstrate that the proposed UGEs framework preserves data usability while reducing training performance on hacker networks even under different types of attacks. + + + + LaneCPP: Continuous 3D Lane Detection using Physical Priors + http://openaccess.thecvf.com//content/CVPR2024/papers/Pittner_LaneCPP_Continuous_3D_Lane_Detection_using_Physical_Priors_CVPR_2024_paper.pdf + Monocular 3D lane detection has become a fundamental problem in the context of autonomous driving which comprises the tasks of finding the road surface and locating lane markings. One major challenge lies in a flexible but robust line representation capable of modeling complex lane structures while still avoiding unpredictable behavior. While previous methods rely on fully data-driven approaches we instead introduce a novel approach LaneCPP that uses a continuous 3D lane detection model leveraging physical prior knowledge about the lane structure and road geometry. While our sophisticated lane model is capable of modeling complex road structures it also shows robust behavior since physical constraints are incorporated by means of a regularization scheme that can be analytically applied to our parametric representation. Moreover we incorporate prior knowledge about the road geometry into the 3D feature space by modeling geometry-aware spatial features guiding the network to learn an internal road surface representation. In our experiments we show the benefits of our contributions and prove the meaningfulness of using priors to make 3D lane detection more robust. The results show that LaneCPP achieves state-of-the-art performance in terms of F-Score and geometric errors. + + + + CityDreamer: Compositional Generative Model of Unbounded 3D Cities + http://openaccess.thecvf.com//content/CVPR2024/papers/Xie_CityDreamer_Compositional_Generative_Model_of_Unbounded_3D_Cities_CVPR_2024_paper.pdf + 3D city generation is a desirable yet challenging task since humans are more sensitive to structural distortions in urban environments. Additionally generating 3D cities is more complex than 3D natural scenes since buildings as objects of the same class exhibit a wider range of appearances compared to the relatively consistent appearance of objects like trees in natural scenes. To address these challenges we propose CityDreamer a compositional generative model designed specifically for unbounded 3D cities. Our key insight is that 3D city generation should be a composition of different types of neural fields: 1) various building instances and 2) background stuff such as roads and green lands. Specifically we adopt the bird's eye view scene representation and employ a volumetric render for both instance-oriented and stuff-oriented neural fields. The generative hash grid and periodic positional embedding are tailored as scene parameterization to suit the distinct characteristics of building instances and background stuff. Furthermore we contribute a suite of CityGen Datasets including OSM and GoogleEarth which comprises a vast amount of real-world city imagery to enhance the realism of the generated 3D cities both in their layouts and appearances. CityDreamer achieves state-of-the-art performance not only in generating realistic 3D cities but also in localized editing within the generated cities. + + + + Action Detection via an Image Diffusion Process + http://openaccess.thecvf.com//content/CVPR2024/papers/Foo_Action_Detection_via_an_Image_Diffusion_Process_CVPR_2024_paper.pdf + Action detection aims to localize the starting and ending points of action instances in untrimmed videos and predict the classes of those instances. In this paper we make the observation that the outputs of the action detection task can be formulated as images. Thus from a novel perspective we tackle action detection via a three-image generation process to generate starting point ending point and action-class predictions as images via our proposed Action Detection Image Diffusion (ADI-Diff) framework. Furthermore since our images differ from natural images and exhibit special properties we further explore a Discrete Action-Detection Diffusion Process and a Row-Column Transformer design to better handle their processing. Our ADI-Diff framework achieves state-of-the-art results on two widely-used datasets. + + + + ConTex-Human: Free-View Rendering of Human from a Single Image with Texture-Consistent Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Gao_ConTex-Human_Free-View_Rendering_of_Human_from_a_Single_Image_with_CVPR_2024_paper.pdf + In this work we propose a method to address the challenge of rendering a 3D human from a single image in a free-view manner. Some existing approaches could achieve this by using generalizable pixel-aligned implicit fields to reconstruct a textured mesh of a human or by employing a 2D diffusion model as guidance with the Score Distillation Sampling (SDS) method to lift the 2D image into 3D space. However a generalizable implicit field often results in an over-smooth texture field while the SDS method tends to lead to a texture-inconsistent novel view with the input image. In this paper we introduce a texture-consistent back view synthesis method that could transfer the reference image content to the back view through depth-guided mutual self-attention. With this method we could achieve high-fidelity and texture-consistent human rendering from a single image. Moreover to alleviate the color distortion that occurs in the side region we propose a visibility-aware patch consistency regularization combined with the synthesized back view texture. Experiments conducted on both real and synthetic data demonstrate the effectiveness of our method and show that our approach outperforms previous baseline methods. + + + + Streaming Dense Video Captioning + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_Streaming_Dense_Video_Captioning_CVPR_2024_paper.pdf + An ideal model for dense video captioning -- predicting captions localized temporally in a video -- should be able to handle long input videos predict rich detailed textual descriptions and be able to produce outputs before processing the entire video. Current state-of-the-art models however process a fixed number of downsampled frames and make a single full prediction after seeing the whole video. We propose a streaming dense video captioning model that consists of two novel components: First we propose a new memory module based on clustering incoming tokens which can handle arbitrarily long videos as the memory is of a fixed size. Second we develop a streaming decoding algorithm that enables our model to make predictions before the entire video has been processed. Our model achieves this streaming ability and significantly improves the state-of-the-art on three dense video captioning benchmarks: ActivityNet YouCook2 and ViTT. Our code is released at https://github.com/google-research/scenic. + + + + Rethinking Inductive Biases for Surface Normal Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Bae_Rethinking_Inductive_Biases_for_Surface_Normal_Estimation_CVPR_2024_paper.pdf + Despite the growing demand for accurate surface normal estimation models existing methods use general-purpose dense prediction models adopting the same inductive biases as other tasks. In this paper we discuss the inductive biases needed for surface normal estimation and propose to (1) utilize the per-pixel ray direction and (2) encode the relationship between neighboring surface normals by learning their relative rotation. The proposed method can generate crisp - yet piecewise smooth - predictions for challenging in-the-wild images of arbitrary resolution and aspect ratio. Compared to a recent ViT-based state-of-the-art model our method shows a stronger generalization ability despite being trained on an orders of magnitude smaller dataset. The code is available at https://github.com/baegwangbin/DSINE. + + + + Fair Federated Learning under Domain Skew with Local Consistency and Domain Diversity + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Fair_Federated_Learning_under_Domain_Skew_with_Local_Consistency_and_CVPR_2024_paper.pdf + Federated learning (FL) has emerged as a new paradigm for privacy-preserving collaborative training. Under domain skew the current FL approaches are biased and face two fairness problems. 1) Parameter Update Conflict: data disparity among clients leads to varying parameter importance and inconsistent update directions. These two disparities cause important parameters to potentially be overwhelmed by unimportant ones of dominant updates. It consequently results in significant performance decreases for lower-performing clients. 2) Model Aggregation Bias: existing FL approaches introduce unfair weight allocation and neglect domain diversity. It leads to biased model convergence objective and distinct performance among domains. We discover a pronounced directional update consistency in Federated Learning and propose a novel framework to tackle above issues. First leveraging the discovered characteristic we selectively discard unimportant parameter updates to prevent updates from clients with lower performance overwhelmed by unimportant parameters resulting in fairer generalization performance. Second we propose a fair aggregation objective to prevent global model bias towards some domains ensuring that the global model continuously aligns with an unbiased model. The proposed method is generic and can be combined with other existing FL methods to enhance fairness. Comprehensive experiments on Digits and Office-Caltech demonstrate the high fairness and performance of our method. + + + + HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding + http://openaccess.thecvf.com//content/CVPR2024/papers/Nguyen_HIG_Hierarchical_Interlacement_Graph_Approach_to_Scene_Graph_Generation_in_CVPR_2024_paper.pdf + Visual interactivity understanding within visual scenes presents a significant challenge in computer vision. Existing methods focus on complex interactivities while leveraging a simple relationship model. These methods however struggle with a diversity of appearance situation position interaction and relation in videos. This limitation hinders the ability to fully comprehend the interplay within the complex visual dynamics of subjects. In this paper we delve into interactivities understanding within visual content by deriving scene graph representations from dense interactivities among humans and objects. To achieve this goal we first present a new dataset containing Appearance-Situation-Position-Interaction-Relation predicates named ASPIRe offering an extensive collection of videos marked by a wide range of interactivities. Then we propose a new approach named Hierarchical Interlacement Graph (HIG) which leverages a unified layer and graph within a hierarchical structure to provide deep insights into scene changes across five distinct tasks. Our approach demonstrates superior performance to other methods through extensive experiments conducted in various scenarios. + + + + OOSTraj: Out-of-Sight Trajectory Prediction With Vision-Positioning Denoising + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_OOSTraj_Out-of-Sight_Trajectory_Prediction_With_Vision-Positioning_Denoising_CVPR_2024_paper.pdf + Trajectory prediction is fundamental in computer vision and autonomous driving particularly for understanding pedestrian behavior and enabling proactive decision-making. Existing approaches in this field often assume precise and complete observational data neglecting the challenges associated with out-of-view objects and the noise inherent in sensor data due to limited camera range physical obstructions and the absence of ground truth for denoised sensor data. Such oversights are critical safety concerns as they can result in missing essential non-visible objects. To bridge this gap we present a novel method for out-of-sight trajectory prediction that leverages a vision-positioning technique. Our approach denoises noisy sensor observations in an unsupervised manner and precisely maps sensor-based trajectories of out-of-sight objects into visual trajectories. This method has demonstrated state-of-the-art performance in out-of-sight noisy sensor trajectory denoising and prediction on the Vi-Fi and JRDB datasets. By enhancing trajectory prediction accuracy and addressing the challenges of out-of-sight objects our work significantly contributes to improving the safety and reliability of autonomous driving in complex environments. Our work represents the first initiative towards Out-Of-Sight Trajectory prediction (OOSTraj) setting a new benchmark for future research. + + + + FADES: Fair Disentanglement with Sensitive Relevance + http://openaccess.thecvf.com//content/CVPR2024/papers/Jang_FADES_Fair_Disentanglement_with_Sensitive_Relevance_CVPR_2024_paper.pdf + Learning fair representation in deep learning is essential to mitigate discriminatory outcomes and enhance trustworthiness. However previous research has been commonly established on inappropriate assumptions prone to unrealistic counterfactuals and performance degradation. Although some proposed alternative approaches such as employing correlation-aware causal graphs or proxies for mutual information these methods are less practical and not applicable in general. In this work we propose FAir DisEntanglement with Sensitive relevance (FADES) a novel approach that leverages conditional mutual information from the information theory perspective to address these challenges. We employ sensitive relevant code to direct correlated information between target labels and sensitive attributes by imposing conditional independence allowing better separation of the features of interest in the latent space. Utilizing an intuitive disentangling approach FADES consistently achieves superior performance and fairness both quantitatively and qualitatively with its straightforward structure. Specifically the proposed method outperforms existing works in downstream classification and counterfactual generations on various benchmarks. + + + + Self-Supervised Class-Agnostic Motion Prediction with Spatial and Temporal Consistency Regularizations + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Self-Supervised_Class-Agnostic_Motion_Prediction_with_Spatial_and_Temporal_Consistency_Regularizations_CVPR_2024_paper.pdf + The perception of motion behavior in a dynamic environment holds significant importance for autonomous driving systems wherein class-agnostic motion prediction methods directly predict the motion of the entire point cloud. While most existing methods rely on fully-supervised learning the manual labeling of point cloud data is laborious and time-consuming. Therefore several annotation-efficient methods have been proposed to address this challenge. Although effective these methods rely on weak annotations or additional multi-modal data like images and the potential benefits inherent in the point cloud sequence are still underexplored. To this end we explore the feasibility of self-supervised motion prediction with only unlabeled LiDAR point clouds. Initially we employ an optimal transport solver to establish coarse correspondences between current and future point clouds as the coarse pseudo motion labels. Training models directly using such coarse labels leads to noticeable spatial and temporal prediction inconsistencies. To mitigate these issues we introduce three simple spatial and temporal regularization losses which facilitate the self-supervised training process effectively. Experimental results demonstrate the significant superiority of our approach over the state-of-the-art self-supervised methods. Code will be available. + + + + CAT: Exploiting Inter-Class Dynamics for Domain Adaptive Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Kennerley_CAT_Exploiting_Inter-Class_Dynamics_for_Domain_Adaptive_Object_Detection_CVPR_2024_paper.pdf + Domain adaptive object detection aims to adapt detection models to domains where annotated data is unavailable. Existing methods have been proposed to address the domain gap using the semi-supervised student-teacher framework. However a fundamental issue arises from the class imbalance in the labelled training set which can result in inaccurate pseudo-labels. The relationship between classes especially where one class is a majority and the other minority has a large impact on class bias. We propose Class-Aware Teacher (CAT) to address the class bias issue in the domain adaptation setting. In our work we approximate the class relationships with our Inter-Class Relation module (ICRm) and exploit it to reduce the bias within the model. In this way we are able to apply augmentations to highly related classes both inter- and intra-domain to boost the performance of minority classes while having minimal impact on majority classes. We further reduce the bias by implementing a class-relation weight to our classification loss. Experiments conducted on various datasets and ablation studies show that our method is able to address the class bias in the domain adaptation setting. On the Cityscapes ? Foggy Cityscapes dataset we attained a 52.5 mAP a substantial improvement over the 51.2 mAP achieved by the state-of-the-art method. + + + + An Empirical Study of Scaling Law for Scene Text Recognition + http://openaccess.thecvf.com//content/CVPR2024/papers/Rang_An_Empirical_Study_of_Scaling_Law_for_Scene_Text_Recognition_CVPR_2024_paper.pdf + The laws of model size data volume computation and model performance have been extensively studied in the field of Natural Language Processing (NLP). However the scaling laws in Scene Text Recognition (STR) have not yet been investigated. To address this we conducted comprehensive studies that involved examining the correlations between performance and the scale of models data volume and computation in the field of text recognition. Conclusively the study demonstrates smooth power laws between performance and model size as well as training data volume when other influencing factors are held constant. Additionally we have constructed a large-scale dataset called REBU-Syn which comprises 6 million real samples and 18 million synthetic samples. Based on our scaling law and new dataset we have successfully trained a scene text recognition model achieving a new state-of-the-art on 6 common test benchmarks with a top-1 average accuracy of 97.42%. The models and dataset are publicly available at \href https://github.com/large-ocr-model/large-ocr-model.github.io large-ocr-model.github.io . + + + + Text2Loc: 3D Point Cloud Localization from Natural Language + http://openaccess.thecvf.com//content/CVPR2024/papers/Xia_Text2Loc_3D_Point_Cloud_Localization_from_Natural_Language_CVPR_2024_paper.pdf + We tackle the problem of 3D point cloud localization based on a few natural linguistic descriptions and introduce a novel neural network Text2Loc that fully interprets the semantic relationship between points and text. Text2Loc follows a coarse-to-fine localization pipeline: text-submap global place recognition followed by fine localization. In global place recognition relational dynamics among each textual hint are captured in a hierarchical transformer with max-pooling (HTM) whereas a balance between positive and negative pairs is maintained using text-submap contrastive learning. Moreover we propose a novel matching-free fine localization method to further refine the location predictions which completely removes the need for complicated text-instance matching and is lighter faster and more accurate than previous methods. Extensive experiments show that Text2Loc improves the localization accuracy by up to 2x over the state-of-the-art on the KITTI360Pose dataset. Our project page is publicly available at: https: //yan-xia.github.io/projects/text2loc/. + + + + Decomposing Disease Descriptions for Enhanced Pathology Detection: A Multi-Aspect Vision-Language Pre-training Framework + http://openaccess.thecvf.com//content/CVPR2024/papers/Phan_Decomposing_Disease_Descriptions_for_Enhanced_Pathology_Detection_A_Multi-Aspect_Vision-Language_CVPR_2024_paper.pdf + Medical vision language pre-training (VLP) has emerged as a frontier of research enabling zero-shot pathological recognition by comparing the query image with the textual descriptions for each disease. Due to the complex semantics of biomedical texts current methods struggle to align medical images with key pathological findings in unstructured reports. This leads to the misalignment with the target disease's textual representation. In this paper we introduce a novel VLP framework designed to dissect disease descriptions into their fundamental aspects leveraging prior knowledge about the visual manifestations of pathologies. This is achieved by consulting a large language model and medical experts. Integrating a Transformer module our approach aligns an input image with the diverse elements of a disease generating aspect-centric image representations. By consolidating the matches from each aspect we improve the compatibility between an image and its associated disease. Additionally capitalizing on the aspect-oriented representations we present a dual-head Transformer tailored to process known and unknown diseases optimizing the comprehensive detection efficacy. Conducting experiments on seven downstream datasets ours improves the accuracy of recent methods by up to 8.56% and 17.26% for seen and unseen categories respectively. Our code is released at https://github.com/HieuPhan33/MAVL. + + + + Fusing Personal and Environmental Cues for Identification and Segmentation of First-Person Camera Wearers in Third-Person Views + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_Fusing_Personal_and_Environmental_Cues_for_Identification_and_Segmentation_of_CVPR_2024_paper.pdf + As wearable cameras become more popular an important question emerges: how to identify camera wearers within the perspective of conventional static cameras. The drastic difference between first-person (egocentric) and third-person (exocentric) camera views makes this a challenging task. We present PersonEnvironmentNet (PEN) a framework designed to integrate information from both the individuals in the two views and geometric cues inferred from the background environment. To facilitate research in this direction we also present TF2023 a novel dataset comprising synchronized first-person and third-person views along with masks of camera wearers and labels associating these masks with the respective first-person views. In addition we propose a novel quantitative metric designed to measure a model's ability to comprehend the relationship between the two views. Our experiments reveal that PEN outperforms existing methods. The code and dataset are available at https://github.com/ziweizhao1993/PEN. + + + + Desigen: A Pipeline for Controllable Design Template Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Weng_Desigen_A_Pipeline_for_Controllable_Design_Template_Generation_CVPR_2024_paper.pdf + Templates serve as a good starting point to implement a design (e.g. banner slide) but it takes great effort from designers to manually create. In this paper we present Desigen an automatic template creation pipeline which generates background images as well as harmonious layout elements over the background. Different from natural images a background image should preserve enough non-salient space for the overlaying layout elements. To equip existing advanced diffusion-based models with stronger spatial control we propose two simple but effective techniques to constrain the saliency distribution and reduce the attention weight in desired regions during the background generation process. Then conditioned on the background we synthesize the layout with a Transformer-based autoregressive generator. To achieve a more harmonious composition we propose an iterative inference strategy to adjust the synthesized background and layout in multiple rounds. We constructed a design dataset with more than 40k advertisement banners to verify our approach. Extensive experiments demonstrate that the proposed pipeline generates high-quality templates comparable to human designers. More than a single-page design we further show an application of presentation generation that outputs a set of theme-consistent slides. The data and code are available at https://whaohan.github.io/desigen. + + + + Multi-criteria Token Fusion with One-step-ahead Attention for Efficient Vision Transformers + http://openaccess.thecvf.com//content/CVPR2024/papers/Lee_Multi-criteria_Token_Fusion_with_One-step-ahead_Attention_for_Efficient_Vision_Transformers_CVPR_2024_paper.pdf + Vision Transformer (ViT) has emerged as a prominent backbone for computer vision. For more efficient ViTs recent works lessen the quadratic cost of the self-attention layer by pruning or fusing the redundant tokens. However these works faced the speed-accuracy trade-off caused by the loss of information. Here we argue that token fusion needs to consider diverse relations between tokens to minimize information loss. In this paper we propose a Multi-criteria Token Fusion (MCTF) that gradually fuses the tokens based on multi-criteria (i.e. similarity informativeness and size of fused tokens). Further we utilize the one-step-ahead attention which is the improved approach to capture the informativeness of the tokens. By training the model equipped with MCTF using a token reduction consistency we achieve the best speed-accuracy trade-off in the image classification (ImageNet1K). Experimental results prove that MCTF consistently surpasses the previous reduction methods with and without training. Specifically DeiT-T and DeiT-S with MCTF reduce FLOPs by about 44% while improving the performance (+0.5% and +0.3%) over the base model respectively. We also demonstrate the applicability of MCTF in various Vision Transformers (e.g. T2T-ViT LV-ViT) achieving at least 31% speedup without performance degradation. Code is available at https://github.com/mlvlab/MCTF. + + + + ViewFusion: Towards Multi-View Consistency via Interpolated Denoising + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_ViewFusion_Towards_Multi-View_Consistency_via_Interpolated_Denoising_CVPR_2024_paper.pdf + Novel-view synthesis through diffusion models has demonstrated remarkable potential for generating diverse and high-quality images. Yet the independent process of image generation in these prevailing methods leads to challenges in maintaining multiple-view consistency. To address this we introduce ViewFusion a novel training-free algorithm that can be seamlessly integrated into existing pre-trained diffusion models. Our approach adopts an auto-regressive method that implicitly leverages previously generated views as context for the next view generation ensuring robust multi-view consistency during the novel-view generation process. Through a diffusion process that fuses known-view information via interpolated denoising our framework successfully extends single-view conditioned models to work in multiple-view conditional settings without any additional fine-tuning. Extensive experimental results demonstrate the effectiveness of ViewFusion in generating consistent and detailed novel views. + + + + SketchINR: A First Look into Sketches as Implicit Neural Representations + http://openaccess.thecvf.com//content/CVPR2024/papers/Bandyopadhyay_SketchINR_A_First_Look_into_Sketches_as_Implicit_Neural_Representations_CVPR_2024_paper.pdf + We propose SketchINR to advance the representation of vector sketches with implicit neural models. A variable length vector sketch is compressed into a latent space of fixed dimension that implicitly encodes the underlying shape as a function of time and strokes. The learned function predicts the xy point coordinates in a sketch at each time and stroke. Despite its simplicity SketchINR outperforms existing representations at multiple tasks: (i) Encoding an entire sketch dataset into a fixed size latent vector SketchINR gives 60x and 10x data compression over raster and vector sketches respectively. (ii) SketchINR's auto-decoder provides a much higher-fidelity representation than other learned vector sketch representations and is uniquely able to scale to complex vector sketches such as FS-COCO. (iii) SketchINR supports parallelisation that can decode/render 100x faster than other learned vector representations such as SketchRNN. (iv) SketchINR for the first time emulates the human ability to reproduce a sketch with varying abstraction in terms of number and complexity of strokes. As a first look at implicit sketches SketchINR's compact high-fidelity representation will support future work in modelling long and complex sketches. + + + + MatchU: Matching Unseen Objects for 6D Pose Estimation from RGB-D Images + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_MatchU_Matching_Unseen_Objects_for_6D_Pose_Estimation_from_RGB-D_CVPR_2024_paper.pdf + Recent learning methods for object pose estimation require resource-intensive training for each individual object instance or category hampering their scalability in real applications when confronted with previously unseen objects. In this paper we propose MatchU a Fuse-Describe-Match strategy for 6D pose estimation from RGB-D images. MatchU is a generic approach that fuses 2D texture and 3D geometric cues for 6D pose prediction of unseen objects. We rely on learning geometric 3D descriptors that are rotation-invariant by design. By encoding pose-agnostic geometry the learned descriptors naturally generalize to unseen objects and capture symmetries. To tackle ambiguous associations using 3D geometry only we fuse additional RGB information into our descriptor. This is achieved through a novel attention-based mechanism that fuses cross-modal information together with a matching loss that leverages the latent space learned from RGB data to guide the descriptor learning process. Extensive experiments reveal the generalizability of both the RGB-D fusion strategy as well as the descriptor efficacy. Benefiting from the novel designs MatchU surpasses all existing methods by a significant margin in terms of both accuracy and speed even without the requirement of expensive re-training or rendering. + + + + Towards High-fidelity Artistic Image Vectorization via Texture-Encapsulated Shape Parameterization + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Towards_High-fidelity_Artistic_Image_Vectorization_via_Texture-Encapsulated_Shape_Parameterization_CVPR_2024_paper.pdf + We develop a novel vectorized image representation scheme accommodating both shape/geometry and texture in a decoupled way particularly tailored for reconstruction and editing tasks of artistic/design images such as Emojis and Cliparts. In the heart of this representation is a set of sparsely and unevenly located 2D control points. On one hand these points constitute a collection of parametric/vectorized geometric primitives (e.g. curves and closed shapes) describing the shape characteristics of the target image. On the other hand local texture codes in terms of implicit neural network parameters are spatially distributed into each control point yielding local coordinate-to-RGB mappings within the anchored region of each control point. In the meantime a zero-shot learning algorithm is developed to decompose an arbitrary raster image into the above representation for the sake of high-fidelity image vectorization with convenient editing ability. Extensive experiments on a series of image vectorization and editing tasks well demonstrate the high accuracy offered by our proposed method with a significantly higher image compression ratio over prior art. + + + + EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything + http://openaccess.thecvf.com//content/CVPR2024/papers/Xiong_EfficientSAM_Leveraged_Masked_Image_Pretraining_for_Efficient_Segment_Anything_CVPR_2024_paper.pdf + Segment Anything Model (SAM) has emerged as a powerful tool for numerous vision applications. A key component that drives the impressive performance for zero-shot transfer and high versatility is a super large Transformer model trained on the extensive high-quality SA-1B dataset. While beneficial the huge computation cost of SAM model has limited its applications to wider real-world applications. To address this limitation we propose EfficientSAMs light-weight SAM models that exhibits decent performance with largely reduced complexity. Our idea is based on leveraging masked image pretraining SAMI which learns to reconstruct features from SAM image encoder for effective visual representation learning. Further we take SAMI-pretrained light-weight image encoders and mask decoder to build EfficientSAMs and finetune the models on SA-1B for segment anything task. We perform evaluations on multiple vision tasks including image classification object detection instance segmentation and semantic segmentation and find that our proposed pretraining method SAMI consistently outperforms other masked image pretraining methods. On segment anything task such as zero-shot instance segmentation our EfficientSAMs with SAMI-pretrained lightweight image encoders perform favorably with a significant gain (e.g. 4 AP on COCO/LVIS) over other fast SAM models. Our EfficientSAM code and models are available at https://github.com/yformer/EfficientSAM. + + + + ChatScene: Knowledge-Enabled Safety-Critical Scenario Generation for Autonomous Vehicles + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_ChatScene_Knowledge-Enabled_Safety-Critical_Scenario_Generation_for_Autonomous_Vehicles_CVPR_2024_paper.pdf + We present ChatScene a Large Language Model (LLM)-based agent that leverages the capabilities of LLMs to generate safety-critical scenarios for autonomous vehicles. Given unstructured language instructions the agent first generates textually described traffic scenarios using LLMs. These scenario descriptions are subsequently broken down into several sub-descriptions for specified details such as behaviors and locations of vehicles. The agent then distinctively transforms the textually described sub-scenarios into domain-specific languages which then generate actual code for prediction and control in simulators facilitating the creation of diverse and complex scenarios within the CARLA simulation environment. A key part of our agent is a comprehensive knowledge retrieval component which efficiently translates specific textual descriptions into corresponding domain-specific code snippets by training a knowledge database containing the scenario description and code pairs. Extensive experimental results underscore the efficacy of ChatScene in improving the safety of autonomous vehicles. For instance the scenarios generated by ChatScene show a 15% increase in collision rates compared to state-of-the-art baselines when tested against different reinforcement learning-based ego vehicles. Furthermore we show that by using our generated safety-critical scenarios to fine-tune different RL-based autonomous driving models they can achieve a 9% reduction in collision rates surpassing current SOTA methods. ChatScene effectively bridges the gap between textual descriptions of traffic scenarios and practical CARLA simulations providing a unified way to conveniently generate safety-critical scenarios for safety testing and improvement for AVs. + + + + Teeth-SEG: An Efficient Instance Segmentation Framework for Orthodontic Treatment based on Multi-Scale Aggregation and Anthropic Prior Knowledge + http://openaccess.thecvf.com//content/CVPR2024/papers/Zou_Teeth-SEG_An_Efficient_Instance_Segmentation_Framework_for_Orthodontic_Treatment_based_CVPR_2024_paper.pdf + Teeth localization segmentation and labeling in 2D images have great potential in modern dentistry to enhance dental diagnostics treatment planning and population-based studies on oral health. However general instance segmentation frameworks are incompetent due to 1) the subtle differences between some teeth' shapes (e.g. maxillary first premolar and second premolar) 2) the teeth's position and shape variation across subjects and 3) the presence of abnormalities in the dentition (e.g. caries and edentulism). To address these problems we propose a ViT-based framework named TeethSEG which consists of stacked Multi-Scale Aggregation (MSA) blocks and an Anthropic Prior Knowledge (APK) layer. Specifically to compose the two modules we design 1) a unique permutation-based upscaler to ensure high efficiency while establishing clear segmentation boundaries with 2) multi-head self/cross-gating layers to emphasize particular semantics meanwhile maintaining the divergence between token embeddings. Besides we collect 3) the first open-sourced intraoral image dataset IO150K which comprises over 150k intraoral photos and all photos are annotated by orthodontists using a human-machine hybrid algorithm. Experiments on IO150K demonstrate that our TeethSEG outperforms the state-of-the-art segmentation models on dental image segmentation. + + + + Bayesian Diffusion Models for 3D Shape Reconstruction + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_Bayesian_Diffusion_Models_for_3D_Shape_Reconstruction_CVPR_2024_paper.pdf + We present Bayesian Diffusion Models (BDM) a prediction algorithm that performs effective Bayesian inference by tightly coupling the top-down (prior) information with the bottom-up (data-driven) procedure via joint diffusion processes. We demonstrate the application of BDM on the 3D shape reconstruction task. Compared to standard deep learning data-driven approaches relying on supervised data our BDM can bring in rich prior information trained in an unsupervised manner to improve the bottom-up 3D reconstruction. As opposed to the traditional Bayesian frameworks where explicitly learned prior and data-driven distributions are required for gradient computation and combination BDM performs a seamless fusion of the two via coupled diffusion processes with learned gradient computation networks. The specialty of our Bayesian Diffusion Models (BDM) lies in its capability to engage the active and effective information exchange and fusion of the top-down and bottom-up processes where each itself is a diffusion process. We demonstrate state-of-the-art results on both synthetic and real-world benchmarks for 3D shape reconstruction. Project link: https://mlpc-ucsd.github.io/BDM + + + + CrossKD: Cross-Head Knowledge Distillation for Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_CrossKD_Cross-Head_Knowledge_Distillation_for_Object_Detection_CVPR_2024_paper.pdf + Knowledge Distillation (KD) has been validated as an effective model compression technique for learning compact object detectors. Existing state-of-the-art KD methods for object detection are mostly based on feature imitation. In this paper we present a general and effective prediction mimicking distillation scheme called CrossKD which delivers the intermediate features of the student's detection head to the teacher's detection head. The resulting cross-head predictions are then forced to mimic the teacher's predictions. This manner relieves the student's head from receiving contradictory supervision signals from the annotations and the teacher's predictions greatly improving the student's detection performance. Moreover as mimicking the teacher's predictions is the target of KD CrossKD offers more task-oriented information in contrast with feature imitation. On MS COCO with only prediction mimicking losses applied our CrossKD boosts the average precision of GFL ResNet-50 with 1x training schedule from 40.2 to 43.7 outperforming all existing KD methods. In addition our method also works well when distilling detectors with heterogeneous backbones. + + + + Bi-level Learning of Task-Specific Decoders for Joint Registration and One-Shot Medical Image Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Fan_Bi-level_Learning_of_Task-Specific_Decoders_for_Joint_Registration_and_One-Shot_CVPR_2024_paper.pdf + One-shot medical image segmentation (MIS) aims to cope with the expensive time-consuming and inherent human bias annotations. One prevalent method to address one-shot MIS is joint registration and segmentation (JRS) with a shared encoder which mainly explores the voxel-wise correspondence between the labeled data and unlabeled data for better segmentation. However this method omits underlying connections between task-specific decoders for segmentation and registration leading to unstable training. In this paper we propose a novel Bi-level Learning of Task-Specific Decoders for one-shot MIS employing a pretrained fixed shared encoder that is proved to be more quickly adapted to brand-new datasets than existing JRS without fixed shared encoder paradigm. To be more specific we introduce a bi-level optimization training strategy considering registration as a major objective and segmentation as a learnable constraint by leveraging inter-task coupling dependencies. Furthermore we design an appearance conformity constraint strategy that learns the backward transformations generating the fake labeled data used to perform data augmentation instead of the labeled image to avoid performance degradation caused by inconsistent styles between unlabeled data and labeled data in previous methods. Extensive experiments on the brain MRI task across ABIDE ADNI and PPMI datasets demonstrate that the proposed Bi-JROS outperforms state-of-the-art one-shot MIS methods for both segmentation and registration tasks. The code will be available at https://github.com/Coradlut/Bi-JROS. + + + + EscherNet: A Generative Model for Scalable View Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Kong_EscherNet_A_Generative_Model_for_Scalable_View_Synthesis_CVPR_2024_paper.pdf + We introduce EscherNet a multi-view conditioned diffusion model for view synthesis. EscherNet learns implicit and generative 3D representations coupled with a specialised camera positional encoding allowing precise and continuous relative control of the camera transformation between an arbitrary number of reference and target views. EscherNet offers exceptional generality flexibility and scalability in view synthesis --- it can generate more than 100 consistent target views simultaneously on a single consumer-grade GPU despite being trained with a fixed number of 3 reference views to 3 target views. As a result EscherNet not only addresses zero-shot novel view synthesis but also naturally unifies single- and multi-image 3D reconstruction combining these diverse tasks into a single cohesive framework. Our extensive experiments demonstrate that EscherNet achieves state-of-the-art performance in multiple benchmarks even when compared to methods specifically tailored for each individual problem. This remarkable versatility opens up new directions for designing scalable neural architectures for 3D vision. Project page: https://kxhit.github.io/EscherNet. + + + + MeaCap: Memory-Augmented Zero-shot Image Captioning + http://openaccess.thecvf.com//content/CVPR2024/papers/Zeng_MeaCap_Memory-Augmented_Zero-shot_Image_Captioning_CVPR_2024_paper.pdf + Zero-shot image captioning (IC) without well-paired image-text data can be categorized into two main types: training-free and text-only-training methods. While both types integrate pre-trained vision-language models such as CLIP for image-text similarity evaluation and a pre-trained language model (LM) for caption generation their distinction lies in the utilization of textual corpus for LM training. Despite achieving promising performance on certain metrics existing methods commonly suffer from drawbacks. Training-free methods often generate hallucinations whereas text-only-training methods may lack generalization capability. To address these challenges we propose a novel Memory-Augmented zero-shot image Captioning framework (MeaCap). This framework equipped with a textual memory incorporates a retrieve-then-filter module to extract key concepts highly relevant to the image. By leveraging our proposed memory-augmented visual-related fusion score within a keywords-to-sentence LM MeaCap generates concept-centered captions that exhibit high consistency with the image with reduced hallucinations and enriched world knowledge. MeaCap achieves state-of-the-art performance across various zero-shot IC settings. Our code is publicly available at https://github.com/joeyz0z/MeaCap. + + + + Elite360D: Towards Efficient 360 Depth Estimation via Semantic- and Distance-Aware Bi-Projection Fusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Ai_Elite360D_Towards_Efficient_360_Depth_Estimation_via_Semantic-_and_Distance-Aware_CVPR_2024_paper.pdf + 360 depth estimation has recently received great attention for 3D reconstruction owing to its omnidirectional field of view (FoV). Recent approaches are predominantly focused on cross-projection fusion with geometry-based re-projection: they fuse 360 images with equirectangular projection (ERP) and another projection type e.g. cubemap projection to estimate depth with the ERP format. However these methods suffer from 1) limited local receptive fields making it hardly possible to capture large FoV scenes and 2) prohibitive computational cost caused by the complex cross-projection fusion module design. In this paper we propose Elite360D a novel framework that inputs the ERP image and icosahedron projection (ICOSAP) point set which is undistorted and spatially continuous. Elite360D is superior in its capacity in learning a representation from a local-with-global perspective. With a flexible ERP image encoder it includes an ICOSAP point encoder and a Bi-projection Bi-attention Fusion (B2F) module (totally 1M parameters). Specifically the ERP image encoder can take various perspective image-trained backbones (e.g. ResNet Transformer) to extract local features. The point encoder extracts the global features from the ICOSAP. Then the B2F module captures the semantic- and distance-aware dependencies between each pixel of the ERP feature and the entire ICOSAP feature set. Without specific backbone design and obvious computational cost increase Elite360D outperforms the prior arts on several benchmark datasets. + + + + Curriculum Point Prompting for Weakly-Supervised Referring Image Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Dai_Curriculum_Point_Prompting_for_Weakly-Supervised_Referring_Image_Segmentation_CVPR_2024_paper.pdf + Referring image segmentation (RIS) aims to precisely segment referents in images through corresponding natural language expressions yet relying on cost-intensive mask annotations. Weakly supervised RIS thus learns from image-text pairs to pixel-level semantics which is challenging for segmenting fine-grained masks. A natural approach to enhancing segmentation precision is to empower weakly supervised RIS with the image segmentation foundation model SAM. Nevertheless we observe that simply integrating SAM yields limited benefits and can even lead to performance regression due to the inevitable noise issues and challenges in excessive focus on object parts. In this paper we present an innovative framework Point PrompTing (PPT) incorporated with the proposed multi-source curriculum learning strategy to address these challenges. Specifically the core of PPT is a point generator that not only harnesses CLIP's text-image alignment capability and SAM's powerful mask generation ability but also generates negative point prompts to address the noisy and excessive focus issues inherently and effectively. In addition we introduce a curriculum learning strategy with object-centric images to help PPT gradually learn from simpler yet precise semantic alignment to more complex RIS. Experiments demonstrate that our PPT significantly and consistently outperforms prior weakly supervised techniques on mIoU by 11.34% 14.14% and 6.97% across RefCOCO RefCOCO+ and G-Ref respectively. + + + + EventDance: Unsupervised Source-free Cross-modal Adaptation for Event-based Object Recognition + http://openaccess.thecvf.com//content/CVPR2024/papers/Zheng_EventDance_Unsupervised_Source-free_Cross-modal_Adaptation_for_Event-based_Object_Recognition_CVPR_2024_paper.pdf + In this paper we make the first attempt at achieving the cross-modal (i.e. image-to-events) adaptation for event-based object recognition without accessing any labeled source image data owning to privacy and commercial issues. Tackling this novel problem is non-trivial due to the novelty of event cameras and the distinct modality gap between images and events. In particular as only the source model is available a hurdle is how to extract the knowledge from the source model by only using the unlabeled target event data while achieving knowledge transfer. To this end we propose a novel framework dubbed EventDance for this unsupervised source-free cross-modal adaptation problem. Importantly inspired by event-to-video reconstruction methods we propose a reconstruction-based modality bridging (RMB) module which reconstructs intensity frames from events in a self-supervised manner. This makes it possible to build up the surrogate images to extract the knowledge (i.e. labels) from the source model. We then propose a multi-representation knowledge adaptation (MKA) module that transfers the knowledge to target models learning events with multiple representation types for fully exploring the spatiotemporal information of events. The two modules connecting the source and target models are mutually updated so as to achieve the best performance. Experiments on three benchmark datasets with two adaption settings show that EventDance is on par with prior methods utilizing the source data. + + + + CycleINR: Cycle Implicit Neural Representation for Arbitrary-Scale Volumetric Super-Resolution of Medical Data + http://openaccess.thecvf.com//content/CVPR2024/papers/Fang_CycleINR_Cycle_Implicit_Neural_Representation_for_Arbitrary-Scale_Volumetric_Super-Resolution_of_CVPR_2024_paper.pdf + In the realm of medical 3D data such as CT and MRI images prevalent anisotropic resolution is characterized by high intra-slice but diminished inter-slice resolution. The lowered resolution between adjacent slices poses challenges hindering optimal viewing experiences and impeding the development of robust downstream analysis algorithms. Various volumetric super-resolution algorithms aim to surmount these challenges enhancing inter-slice resolution and overall 3D medical imaging quality. However existing approaches confront inherent challenges: 1) often tailored to specific upsampling factors lacking flexibility for diverse clinical scenarios; 2) newly generated slices frequently suffer from over-smoothing degrading fine details and leading to inter-slice inconsistency. In response this study presents CycleINR a novel enhanced Implicit Neural Representation model for 3D medical data volumetric super-resolution. Leveraging the continuity of the learned implicit function the CycleINR model can achieve results with arbitrary up-sampling rates eliminating the need for separate training. Additionally we enhance the grid sampling in CycleINR with a local attention mechanism and mitigate over-smoothing by integrating cycle-consistent loss. We introduce a new metric Slice-wise Noise Level Inconsistency (SNLI) to quantitatively assess inter-slice noise level inconsistency. The effectiveness of our approach is demonstrated through image quality evaluations on an in-house dataset and a downstream task analysis on the Medical Segmentation Decathlon liver tumor dataset. + + + + Holistic Autonomous Driving Understanding by Bird's-Eye-View Injected Multi-Modal Large Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Ding_Holistic_Autonomous_Driving_Understanding_by_Birds-Eye-View_Injected_Multi-Modal_Large_Models_CVPR_2024_paper.pdf + The rise of multimodal large language models (MLLMs) has spurred interest in language-based driving tasks. However existing research typically focuses on limited tasks and often omits key multi-view and temporal information which is crucial for robust autonomous driving. To bridge these gaps we introduce NuInstruct a novel dataset with 91K multi-view video-QA pairs across 17 subtasks where each task demands holistic information (e.g. temporal multi-view and spatial) significantly elevating the challenge level. To obtain NuInstruct we propose a novel SQL-based method to generate instruction-response pairs automatically which is inspired by the driving logical progression of humans. We further present BEV-InMLLM an end-to-end method for efficiently deriving instruction-aware Bird's-Eye-View (BEV) features language-aligned for large language models. BEV-InMLLM integrates multi-view spatial awareness and temporal semantics to enhance MLLMs' capabilities on NuInstruct tasks. Moreover our proposed BEV injection module is a plug-and-play method for existing MLLMs. Our experiments on NuInstruct demonstrate that BEV-InMLLM significantly outperforms existing MLLMs e.g 9% improvement on various tasks. We release our NuInstruct at https://github.com/xmed-lab/NuInstruct. + + + + Extreme Point Supervised Instance Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Lee_Extreme_Point_Supervised_Instance_Segmentation_CVPR_2024_paper.pdf + This paper introduces a novel approach to learning instance segmentation using extreme points i.e. the topmost leftmost bottommost and rightmost points of each object. These points are readily available in the modern bounding box annotation process while offering strong clues for precise segmentation and thus allows to improve performance at the same annotation cost with box-supervised methods. Our work considers extreme points as a part of the true instance mask and propagates them to identify potential foreground and background points which are all together used for training a pseudo label generator. Then pseudo labels given by the generator are in turn used for supervised learning of our final model. On three public benchmarks our method significantly outperforms existing box-supervised methods further narrowing the gap with its fully supervised counterpart. In particular our model generates high-quality masks when a target object is separated into multiple parts where previous box-supervised methods often fail. + + + + MedM2G: Unifying Medical Multi-Modal Generation via Cross-Guided Diffusion with Visual Invariant + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhan_MedM2G_Unifying_Medical_Multi-Modal_Generation_via_Cross-Guided_Diffusion_with_Visual_CVPR_2024_paper.pdf + Medical generative models acknowledged for their high-quality sample generation ability have accelerated the fast growth of medical applications. However recent works concentrate on separate medical generation models for distinct medical tasks and are restricted to inadequate medical multi-modal knowledge constraining medical comprehensive diagnosis. In this paper we propose MedM2G a Medical Multi-Modal Generative framework with the key innovation to align extract and generate medical multi-modal within a unified model. Extending beyond single or two medical modalities we efficiently align medical multi-modal through the central alignment approach in the unified space. Significantly our framework extracts valuable clinical knowledge by preserving the medical visual invariant of each imaging modal thereby enhancing specific medical information for multi-modal generation. By conditioning the adaptive cross-guided parameters into the multi-flow diffusion framework our model promotes flexible interactions among medical multi-modal for generation. MedM2G is the first medical generative model that unifies medical generation tasks of text-to-image image-to-text and unified generation of medical modalities (CT MRI X-ray). It performs 5 medical generation tasks across 10 datasets consistently outperforming various state-of-the-art works. + + + + Neural Parametric Gaussians for Monocular Non-Rigid Object Reconstruction + http://openaccess.thecvf.com//content/CVPR2024/papers/Das_Neural_Parametric_Gaussians_for_Monocular_Non-Rigid_Object_Reconstruction_CVPR_2024_paper.pdf + Reconstructing dynamic objects from monocular videos is a severely underconstrained and challenging problem and recent work has approached it in various directions. However owing to the ill-posed nature of this problem there has been no solution that can provide consistent high-quality novel views from camera positions that are significantly different from the training views. In this work we introduce Neural Parametric Gaussians (NPGs) to take on this challenge by imposing a two-stage approach: first we fit a low-rank neural deformation model which then is used as regularization for non-rigid reconstruction in the second stage. The first stage learns the object's deformations such that it preserves consistency in novel views. The second stage obtains high reconstruction quality by optimizing 3D Gaussians that are driven by the coarse model. To this end we introduce a local 3D Gaussian representation where temporally shared Gaussians are anchored in and deformed by local oriented volumes. The resulting combined model can be rendered as radiance fields resulting in high-quality photo-realistic reconstructions of the non-rigidly deforming objects. We demonstrate that NPGs achieve superior results compared to previous works especially in challenging scenarios with few multi-view cues. + + + + PH-Net: Semi-Supervised Breast Lesion Segmentation via Patch-wise Hardness + http://openaccess.thecvf.com//content/CVPR2024/papers/Jiang_PH-Net_Semi-Supervised_Breast_Lesion_Segmentation_via_Patch-wise_Hardness_CVPR_2024_paper.pdf + We present a novel semi-supervised framework for breast ultrasound (BUS) image segmentation which is a very challenging task owing to (1) large scale and shape variations of breast lesions and (2) extremely ambiguous boundaries caused by massive speckle noise and artifacts in BUS images. While existing models achieved certain progress in this task we believe the main bottleneck nowadays for further improvement is that we still cannot deal with hard cases well. Our framework aims to break through this bottleneck which includes two innovative components: an adaptive patch augmentation scheme and a hard-patch contrastive learning module. We first identify hard patches by computing the average entropy of each patch and then shield hard patches to prevent them from being cropped out while performing random patch cutmix. Such a scheme is able to prevent hard regions from being inadequately trained under strong augmentation. We further develop a new hard-patch contrastive learning algorithm to direct model attention to hard regions by applying extra contrast to pixels in hard patches further improving segmentation performance on hard cases. We demonstrate the superiority of our framework to state-of-the-art approaches on two famous BUS datasets achieving better performance under different labeling conditions. The code is available at https://github.com/jjjsyyy/PH-Net. + + + + ExACT: Language-guided Conceptual Reasoning and Uncertainty Estimation for Event-based Action Recognition and More + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_ExACT_Language-guided_Conceptual_Reasoning_and_Uncertainty_Estimation_for_Event-based_Action_CVPR_2024_paper.pdf + Event cameras have recently been shown beneficial for practical vision tasks such as action recognition thanks to their high temporal resolution power efficiency and reduced privacy concerns. However current research is hindered by 1) the difficulty in processing events because of their prolonged duration and dynamic actions with complex and ambiguous semantics and 2) the redundant action depiction of the event frame representation with fixed stacks. We find language naturally conveys abundant semantic information rendering it stunningly superior in reducing semantic uncertainty. In light of this we propose ExACT a novel approach that for the first time tackles event-based action recognition from a cross-modal conceptualizing perspective. Our ExACT brings two technical contributions. Firstly we propose an adaptive fine-grained event (AFE) representation to adaptively filter out the repeated events for the stationary objects while preserving dynamic ones. This subtly enhances the performance of ExACT without extra computational cost. Then we propose a conceptual reasoning-based uncertainty estimation module which simulates the recognition process to enrich the semantic representation. In particular conceptual reasoning builds the temporal relation based on the action semantics and uncertainty estimation tackles the semantic uncertainty of actions based on the distributional representation. Experiments show that our ExACT achieves superior recognition accuracy of 94.83%(+2.23%) 90.10%(+37.47%) and 67.24% on PAF HARDVS and our SeAct datasets respectively. + + + + Improving Visual Recognition with Hyperbolical Visual Hierarchy Mapping + http://openaccess.thecvf.com//content/CVPR2024/papers/Kwon_Improving_Visual_Recognition_with_Hyperbolical_Visual_Hierarchy_Mapping_CVPR_2024_paper.pdf + Visual scenes are naturally organized in a hierarchy where a coarse semantic is recursively comprised of several fine details. Exploring such a visual hierarchy is crucial to recognize the complex relations of visual elements leading to a comprehensive scene understanding. In this paper we propose a Visual Hierarchy Mapper (Hi-Mapper) a novel approach for enhancing the structured understanding of the pre-trained Deep Neural Networks (DNNs). Hi-Mapper investigates the hierarchical organization of the visual scene by 1) pre-defining a hierarchy tree through the encapsulation of probability densities; and 2) learning the hierarchical relations in hyperbolic space with a novel hierarchical contrastive loss. The pre-defined hierarchy tree recursively interacts with the visual features of the pre-trained DNNs through hierarchy decomposition and encoding procedures thereby effectively identifying the visual hierarchy and enhancing the recognition of an entire scene. Extensive experiments demonstrate that Hi-Mapper significantly enhances the representation capability of DNNs leading to an improved performance on various tasks including image classification and dense prediction tasks. + + + + ParameterNet: Parameters Are All You Need for Large-scale Visual Pretraining of Mobile Networks + http://openaccess.thecvf.com//content/CVPR2024/papers/Han_ParameterNet_Parameters_Are_All_You_Need_for_Large-scale_Visual_Pretraining_CVPR_2024_paper.pdf + The large-scale visual pretraining has significantly improve the performance of large vision models. However we observe the low FLOPs pitfall that the existing low-FLOPs models cannot benefit from large-scale pretraining. In this paper we introduce a novel design principle termed ParameterNet aimed at augmenting the number of parameters in large-scale visual pretraining models while minimizing the increase in FLOPs. We leverage dynamic convolutions to incorporate additional parameters into the networks with only a marginal rise in FLOPs. The ParameterNet approach allows low-FLOPs networks to take advantage of large-scale visual pretraining. Furthermore we extend the ParameterNet concept to the language domain to enhance inference results while preserving inference speed. Experiments on the large-scale ImageNet-22K have shown the superiority of our ParameterNet scheme. For example ParameterNet-600M can achieve higher accuracy than the widely-used Swin Transformer (81.6% vs. 80.9%) and has much lower FLOPs (0.6G vs. 4.5G). The code will be released at https://parameternet.github.io/. + + + + Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Ke_Repurposing_Diffusion-Based_Image_Generators_for_Monocular_Depth_Estimation_CVPR_2024_paper.pdf + Monocular depth estimation is a fundamental computer vision task. Recovering 3D depth from a single image is geometrically ill-posed and requires scene understanding so it is not surprising that the rise of deep learning has led to a breakthrough. The impressive progress of monocular depth estimators has mirrored the growth in model capacity from relatively modest CNNs to large Transformer architectures. Still monocular depth estimators tend to struggle when presented with images with unfamiliar content and layout since their knowledge of the visual world is restricted by the data seen during training and challenged by zero-shot generalization to new domains. This motivates us to explore whether the extensive priors captured in recent generative diffusion models can enable better more generalizable depth estimation. We introduce Marigold a method for affine-invariant monocular depth estimation that is derived from Stable Diffusion and retains its rich prior knowledge. The estimator can be fine-tuned in a couple of days on a single GPU using only synthetic training data. It delivers state-of-the-art performance across a wide range of datasets including over 20% performance gains in specific cases. Project page: https://marigoldmonodepth.github.io. + + + + LLMs are Good Sign Language Translators + http://openaccess.thecvf.com//content/CVPR2024/papers/Gong_LLMs_are_Good_Sign_Language_Translators_CVPR_2024_paper.pdf + Sign Language Translation (SLT) is a challenging task that aims to translate sign videos into spoken language. Inspired by the strong translation capabilities of large language models (LLMs) that are trained on extensive multilingual text corpora we aim to harness off-the-shelf LLMs to handle SLT. In this paper we regularize the sign videos to embody linguistic characteristics of spoken language and propose a novel SignLLM framework to transform sign videos into a language-like representation for improved readability by off-the-shelf LLMs. SignLLM comprises two key modules: (1) The Vector-Quantized Visual Sign module converts sign videos into a sequence of discrete character-level sign tokens and (2) the Codebook Reconstruction and Alignment module converts these character-level tokens into word-level sign representations using an optimal transport formulation. A sign-text alignment loss further bridges the gap between sign and text tokens enhancing semantic compatibility. We achieve state-of-the-art gloss-free results on two widely-used SLT benchmarks. + + + + Revisiting the Domain Shift and Sample Uncertainty in Multi-source Active Domain Transfer + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Revisiting_the_Domain_Shift_and_Sample_Uncertainty_in_Multi-source_Active_CVPR_2024_paper.pdf + Active Domain Adaptation (ADA) aims to maximally boost model adaptation in a new target domain by actively selecting a limited number of target data to annotate. This setting neglects the more practical scenario where training data are collected from multiple sources. This motivates us to extend ADA from a single source domain to multiple source domains termed Multi-source Active Domain Adaptation (MADA). Not surprisingly we find that most traditional ADA methods cannot work directly in such a setting mainly due to the excessive domain gap introduced by all the source domains. Considering this we propose a Detective framework that comprehensively considers the domain shift between multi-source domains and target domains to detect the informative target samples. Specifically the Detective leverages a dynamic Domain Adaptation (DA) model that learns how to adapt the model's parameters to fit the union of multi-source domains. This enables an approximate single-source domain modeling by the dynamic model. We then comprehensively measure both domain uncertainty and predictive uncertainty in the target domain to detect informative target samples using evidential deep learning thereby mitigating uncertainty miscalibration. Experiments demonstrate that our solution outperforms existing methods by a considerable margin on three domain adaptation benchmarks. + + + + Learning Continual Compatible Representation for Re-indexing Free Lifelong Person Re-identification + http://openaccess.thecvf.com//content/CVPR2024/papers/Cui_Learning_Continual_Compatible_Representation_for_Re-indexing_Free_Lifelong_Person_Re-identification_CVPR_2024_paper.pdf + Lifelong Person Re-identification (L-ReID) aims to learn from sequentially collected data to match a person across different scenes. Once an L-ReID model is updated using new data all historical images in the gallery are required to be re-calculated to obtain new features for testing known as "re-indexing". However it is infeasible when raw images in the gallery are unavailable due to data privacy concerns resulting in incompatible retrieval between the query and the gallery features calculated by different models which causes significant performance degradation. In this paper we focus on a new task called Re-indexing Free Lifelong Person Re-identification (RFL-ReID) which requires achieving effective L-ReID without re-indexing raw images in the gallery. To this end we propose a Continual Compatible Representation (C2R) method which facilitates the query feature calculated by the continuously updated model to effectively retrieve the gallery feature calculated by the old model in a compatible manner. Specifically we design a Continual Compatible Transfer (CCT) network to continuously transfer and consolidate the old gallery feature into the new feature space. Besides a Balanced Compatible Distillation module is introduced to achieve compatibility by aligning the transferred feature space with the new feature space. Finally a Balanced Anti-forgetting Distillation module is proposed to eliminate the accumulated forgetting of old knowledge during the continual compatible transfer. Extensive experiments on several benchmark L-ReID datasets demonstrate the effectiveness of our method against state-of-the-art methods for both RFL-ReID and L-ReID tasks. The source code of this paper is available at https://github.com/PKU-ICST-MIPL/C2R_CVPR2024. + + + + CORES: Convolutional Response-based Score for Out-of-distribution Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Tang_CORES_Convolutional_Response-based_Score_for_Out-of-distribution_Detection_CVPR_2024_paper.pdf + Deep neural networks (DNNs) often display overconfidence when encountering out-of-distribution (OOD) samples posing significant challenges in real-world applications. Capitalizing on the observation that responses on convolutional kernels are generally more pronounced for in-distribution (ID) samples than for OOD ones this paper proposes the COnvolutional REsponse-based Score (CORES) to exploit these discrepancies for OOD detection. Initially CORES delves into the extremities of convolutional responses by considering both their magnitude and the frequency of significant values. Moreover through backtracking from the most prominent predictions CORES effectively pinpoints sample-relevant kernels across different layers. These kernels which exhibit a strong correlation to input samples are integral to CORES's OOD detection capability. Comprehensive experiments across various ID and OOD settings demonstrate CORES's effectiveness in OOD detection and its superiority to the state-of-the-art methods. + + + + Accurate Spatial Gene Expression Prediction by Integrating Multi-Resolution Features + http://openaccess.thecvf.com//content/CVPR2024/papers/Chung_Accurate_Spatial_Gene_Expression_Prediction_by_Integrating_Multi-Resolution_Features_CVPR_2024_paper.pdf + Recent advancements in Spatial Transcriptomics (ST) technology have facilitated detailed gene expression analysis within tissue contexts. However the high costs and methodological limitations of ST necessitate a more robust predictive model. In response this paper introduces TRIPLEX a novel deep learning framework designed to predict spatial gene expression from Whole Slide Images (WSIs). TRIPLEX uniquely harnesses multi-resolution features capturing cellular morphology at individual spots the local context around these spots and the global tissue organization. By integrating these features through an effective fusion strategy TRIPLEX achieves accurate gene expression prediction. Our comprehensive benchmark study conducted on three public ST datasets and supplemented with Visium data from 10X Genomics demonstrates that TRIPLEX outperforms current state-of-the-art models in Mean Squared Error (MSE) Mean Absolute Error (MAE) and Pearson Correlation Coefficient (PCC). The model's predictions align closely with ground truth gene expression profiles and tumor annotations underscoring TRIPLEX's potential in advancing cancer diagnosis and treatment. + + + + Behind the Veil: Enhanced Indoor 3D Scene Reconstruction with Occluded Surfaces Completion + http://openaccess.thecvf.com//content/CVPR2024/papers/Sun_Behind_the_Veil_Enhanced_Indoor_3D_Scene_Reconstruction_with_Occluded_CVPR_2024_paper.pdf + In this paper we present a novel indoor 3D reconstruction method with occluded surface completion given a sequence of depth readings. Prior state-of-the-art (SOTA) methods only focus on the reconstruction of the visible areas in a scene neglecting the invisible areas due to the occlusions e.g. the contact surface between furniture occluded wall and floor. Our method tackles the task of completing the occluded scene surfaces resulting in a complete 3D scene mesh. The core idea of our method is learning 3D geometry prior from various complete scenes to infer the occluded geometry of an unseen scene from solely depth measurements. We design a coarse-fine hierarchical octree representation coupled with a dual-decoder architecture i.e. Geo-decoder and 3D Inpainter which jointly reconstructs the complete 3D scene geometry. The Geo-decoder with detailed representation at fine levels is optimized online for each scene to reconstruct visible surfaces. The 3D Inpainter with abstract representation at coarse levels is trained offline using various scenes to complete occluded surfaces. As a result while the Geo-decoder is specialized for an individual scene the 3D Inpainter can be generally applied across different scenes. We evaluate the proposed method on the 3D Completed Room Scene (3D-CRS) and iTHOR datasets significantly outperforming the SOTA methods by a gain of 16.8% and 24.2% in terms of the completeness of 3D reconstruction. 3D-CRS dataset including a complete 3D mesh of each scene is provided at project webpage. + + + + VideoGrounding-DINO: Towards Open-Vocabulary Spatio-Temporal Video Grounding + http://openaccess.thecvf.com//content/CVPR2024/papers/Wasim_VideoGrounding-DINO_Towards_Open-Vocabulary_Spatio-Temporal_Video_Grounding_CVPR_2024_paper.pdf + Video grounding aims to localize a spatio-temporal section in a video corresponding to an input text query. This paper addresses a critical limitation in current video grounding methodologies by introducing an Open-Vocabulary Spatio-Temporal Video Grounding task. Unlike prevalent closed-set approaches that struggle with open-vocabulary scenarios due to limited training data and predefined vocabularies our model leverages pre-trained representations from foundational spatial grounding models. This empowers it to effectively bridge the semantic gap between natural language and diverse visual content achieving strong performance in closed-set and open-vocabulary settings. Our contributions include a novel spatio-temporal video grounding model surpassing state-of-the-art results in closed-set evaluations on multiple datasets and demonstrating superior performance in open-vocabulary scenarios. Notably the proposed model outperforms state-of-the-art methods in closed-set settings on VidSTG (Declarative and Interrogative) and HC-STVG (V1 and V2) datasets. Furthermore in open-vocabulary evaluations on HC-STVG V1 and YouCook-Interactions our model surpasses the recent best-performing models by 4.88 m_vIoU and 1.83 accuracy demonstrating its efficacy in handling diverse linguistic and visual concepts for improved video understanding. Our codes will be publicly released. + + + + Think Twice Before Selection: Federated Evidential Active Learning for Medical Image Analysis with Domain Shifts + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Think_Twice_Before_Selection_Federated_Evidential_Active_Learning_for_Medical_CVPR_2024_paper.pdf + Federated learning facilitates the collaborative learning of a global model across multiple distributed medical institutions without centralizing data. Nevertheless the expensive cost of annotation on local clients remains an obstacle to effectively utilizing local data. To mitigate this issue federated active learning methods suggest leveraging local and global model predictions to select a relatively small amount of informative local data for annotation. However existing methods mainly focus on all local data sampled from the same domain making them unreliable in realistic medical scenarios with domain shifts among different clients. In this paper we make the first attempt to assess the informativeness of local data derived from diverse domains and propose a novel methodology termed Federated Evidential Active Learning (FEAL) to calibrate the data evaluation under domain shift. Specifically we introduce a Dirichlet prior distribution in both local and global models to treat the prediction as a distribution over the probability simplex and capture both aleatoric and epistemic uncertainties by using the Dirichlet-based evidential model. Then we employ the epistemic uncertainty to calibrate the aleatoric uncertainty. Afterward we design a diversity relaxation strategy to reduce data redundancy and maintain data diversity. Extensive experiments and analysis on five real multi-center medical image datasets demonstrate the superiority of FEAL over the state-of-the-art active learning methods in federated scenarios with domain shifts. The code will be available at https://github.com/JiayiChen815/FEAL. + + + + ViTamin: Designing Scalable Vision Models in the Vision-Language Era + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_ViTamin_Designing_Scalable_Vision_Models_in_the_Vision-Language_Era_CVPR_2024_paper.pdf + Recent breakthroughs in vision-language models (VLMs) start a new page in the vision community. The VLMs provide stronger and more generalizable feature embeddings compared to those from ImageNet-pretrained models thanks to the training on the large-scale Internet image-text pairs. However despite the amazing achievement from the VLMs vanilla Vision Transformers (ViTs) remain the default choice for the image encoder. Although pure transformer proves its effectiveness in the text encoding area it remains questionable whether it is also the case for image encoding especially considering that various types of networks are proposed on the ImageNet benchmark which unfortunately are rarely studied in VLMs. Due to small data/model scale the original conclusions of model design on ImageNet can be limited and biased. In this paper we aim at building an evaluation protocol of vision models in the vision-language era under the contrastive language-image pretraining (CLIP) framework. We provide a comprehensive way to benchmark different vision models covering their zero-shot performance and scalability in both model and training data sizes. To this end we introduce ViTamin a new vision models tailored for VLMs. ViTamin-L significantly outperforms ViT-L by 2.0% ImageNet zero-shot accuracy when using the same publicly available DataComp-1B dataset and the same OpenCLIP training scheme. ViTamin-L presents promising results on 60 diverse benchmarks including classification retrieval open-vocabulary detection and segmentation and large multi-modal models. When further scaling up the model size our ViTamin-XL with only 436M parameters attains 82.9% ImageNet zero-shot accuracy surpassing 82.0% achieved by EVA-E that has ten times more parameters (4.4B). + + + + Seeing the Unseen: Visual Common Sense for Semantic Placement + http://openaccess.thecvf.com//content/CVPR2024/papers/Ramrakhya_Seeing_the_Unseen_Visual_Common_Sense_for_Semantic_Placement_CVPR_2024_paper.pdf + Computer vision tasks typically involve describing what is visible in an image (e.g. classification detection segmentation and captioning). We study a visual common sense task that requires understanding 'what is not visible'. Specifically given an image (e.g. of a living room) and a name of an object ("cushion") a vision system is asked to predict semantically-meaningful regions (masks or bounding boxes) in the image where that object could be placed or is likely be placed by humans (e.g. on the sofa). We call this task: Semantic Placement (SP) and believe that such common-sense visual understanding is critical for assitive robots (tidying a house) AR devices (automatically rendering an object in the user's space) and visually-grounded chatbots with common sense. Studying the invisible is hard. Datasets for image description are typically constructed by curating relevant images (e.g. via image search with object names) and asking humans to annotate the contents of the image; neither of those two steps are straightforward for objects not present in the image. We overcome this challenge by operating in the opposite direction: we start with an image of an object in context (which is easy to find online) and remove that object from the image via inpainting. This automated pipeline converts unstructured web data into a paired with/without object dataset. With this proposed data generation pipeline we collect a novel dataset containing 1.3M images across 9 object categories. We then train a SP prediction model called CLIP-UNet on our dataset. The CLIP-UNet outperforms existing VLMs and baselines that combine semantic priors with object detectors generalizes well to real-world and simulated images exhibits semantics-aware reasoning for object placement and enables downstream applications like tidying robots in indoor environments. + + + + LLaMA-Excitor: General Instruction Tuning via Indirect Feature Interaction + http://openaccess.thecvf.com//content/CVPR2024/papers/Zou_LLaMA-Excitor_General_Instruction_Tuning_via_Indirect_Feature_Interaction_CVPR_2024_paper.pdf + Existing methods to fine-tune LLMs like Adapter Prefix-tuning and LoRA which introduce extra modules or additional input sequences to inject new skills or knowledge may compromise the innate abilities of LLMs. In this paper we propose LLaMA-Excitor a lightweight method that stimulates the LLMs' potential to better follow instructions by gradually paying more attention to worthwhile information. Specifically the LLaMA-Excitor does not directly change the intermediate hidden state during the self-attention calculation of the transformer structure. We designed the Excitor block as a bypass module for the similarity score computation in LLMs' self-attention to reconstruct keys and change the importance of values by learnable prompts. LLaMA-Excitor ensures a self-adaptive allocation of additional attention to input instructions thus effectively preserving LLMs' pre-trained knowledge when fine-tuning LLMs on low-quality instruction-following datasets. Furthermore we unify the modeling of multi-modal tuning and language-only tuning extending LLaMA-Excitor to a powerful visual instruction follower without the need for complex multi-modal alignment. Our proposed approach is evaluated in language-only and multi-modal tuning experimental scenarios. Notably LLaMA-Excitor is the only method that maintains basic capabilities while achieving a significant improvement (+6%) on the MMLU benchmark. In the visual instruction tuning we achieve a new state-of-the-art image captioning performance of 157.5 CIDEr on MSCOCO and a comparable performance (88.39%) on ScienceQA to cutting-edge models with more parameters and extensive vision-language pertaining. + + + + Steerers: A Framework for Rotation Equivariant Keypoint Descriptors + http://openaccess.thecvf.com//content/CVPR2024/papers/Bokman_Steerers_A_Framework_for_Rotation_Equivariant_Keypoint_Descriptors_CVPR_2024_paper.pdf + Image keypoint descriptions that are discriminative and matchable over large changes in viewpoint are vital for 3D reconstruction. However descriptions output by learned descriptors are typically not robust to camera rotation. While they can be made more robust by e.g. data aug-mentation this degrades performance on upright images. Another approach is test-time augmentation which incurs a significant increase in runtime. Instead we learn a lin-ear transform in description space that encodes rotations of the input image. We call this linear transform a steerer since it allows us to transform the descriptions as if the im-age was rotated. From representation theory we know all possible steerers for the rotation group. Steerers can be optimized (A) given a fixed descriptor (B) jointly with a de-scriptor or (C) we can optimize a descriptor given a fixed steerer. We perform experiments in these three settings and obtain state-of-the-art results on the rotation invariant im-age matching benchmarks AIMS and Roto-360. + + + + Efficient Dataset Distillation via Minimax Diffusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Gu_Efficient_Dataset_Distillation_via_Minimax_Diffusion_CVPR_2024_paper.pdf + Dataset distillation reduces the storage and computational consumption of training a network by generating a small surrogate dataset that encapsulates rich information of the original large-scale one. However previous distillation methods heavily rely on the sample-wise iterative optimization scheme. As the images-per-class (IPC) setting or image resolution grows larger the necessary computation will demand overwhelming time and resources. In this work we intend to incorporate generative diffusion techniques for computing the surrogate dataset. Observing that key factors for constructing an effective surrogate dataset are representativeness and diversity we design additional minimax criteria in the generative training to enhance these facets for the generated images of diffusion models. We present a theoretical model of the process as hierarchical diffusion control demonstrating the flexibility of the diffusion process to target these criteria without jeopardizing the faithfulness of the sample to the desired distribution. The proposed method achieves state-of-the-art validation performance while demanding much less computational resources. Under the 100-IPC setting on ImageWoof our method requires less than one-twentieth the distillation time of previous methods yet yields even better performance. Source code and generated data are available in https://github.com/vimar-gu/MinimaxDiffusion. + + + + Posterior Distillation Sampling + http://openaccess.thecvf.com//content/CVPR2024/papers/Koo_Posterior_Distillation_Sampling_CVPR_2024_paper.pdf + We introduce Posterior Distillation Sampling (PDS) a novel optimization method for parametric image editing based on diffusion models. Existing optimization-based methods which leverage the powerful 2D prior of diffusion models to handle various parametric images have mainly focused on generation. Unlike generation editing requires a balance between conforming to the target attribute and preserving the identity of the source content. Recent 2D image editing methods have achieved this balance by leveraging the stochastic latent encoded in the generative process of diffusion models. To extend the editing capabilities of diffusion models shown in pixel space to parameter space we reformulate the 2D image editing method into an optimization form named PDS. PDS matches the stochastic latents of the source and the target enabling the sampling of targets in diverse parameter spaces that align with a desired attribute while maintaining the source's identity. We demonstrate that this optimization resembles running a generative process with the target attribute but aligning this process with the trajectory of the source's generative process. Extensive editing results in Neural Radiance Fields and Scalable Vector Graphics representations demonstrate that PDS is capable of sampling targets to fulfill the aforementioned balance across various parameter spaces. + + + + HOISDF: Constraining 3D Hand-Object Pose Estimation with Global Signed Distance Fields + http://openaccess.thecvf.com//content/CVPR2024/papers/Qi_HOISDF_Constraining_3D_Hand-Object_Pose_Estimation_with_Global_Signed_Distance_CVPR_2024_paper.pdf + Human hands are highly articulated and versatile at handling objects. Jointly estimating the 3D poses of a hand and the object it manipulates from a monocular camera is challenging due to frequent occlusions. Thus existing methods often rely on intermediate 3D shape representations to increase performance. These representations are typically explicit such as 3D point clouds or meshes and thus provide information in the direct surroundings of the intermediate hand pose estimate. To address this we introduce HOISDF a Signed Distance Field (SDF) guided hand-object pose estimation network which jointly exploits hand and object SDFs to provide a global implicit representation over the complete reconstruction volume. Specifically the role of the SDFs is threefold: equip the visual encoder with implicit shape information help to encode hand-object interactions and guide the hand and object pose regression via SDF-based sampling and by augmenting the feature representations. We show that HOISDF achieves state-of-the-art results on hand-object pose estimation benchmarks (DexYCB and HO3Dv2). Code is available https://github.com/amathislab/HOISDF. + + + + DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Gu_DiffPortrait3D_Controllable_Diffusion_for_Zero-Shot_Portrait_View_Synthesis_CVPR_2024_paper.pdf + We present DiffPortrait3D a conditional diffusion model that is capable of synthesizing 3D-consistent photo-realistic novel views from as few as a single in-the-wild portrait. Specifically given a single RGB input we aim to synthesize plausible but consistent facial details rendered from novel camera views with retained both identity and facial expression. In lieu of time-consuming optimization and fine-tuning our zero-shot method generalizes well to arbitrary face portraits with unposed camera views extreme facial expressions and diverse artistic depictions. At its core we leverage the generative prior of 2D diffusion models pre-trained on large-scale image datasets as our rendering backbone while the denoising is guided with disentangled attentive control of appearance and camera pose. To achieve this we first inject the appearance context from the reference image into the self-attention layers of the frozen UNets. The rendering view is then manipulated with a novel conditional control module that interprets the camera pose by watching a condition image of a crossed subject from the same view. Furthermore we insert a trainable cross-view attention module to enhance view consistency which is further strengthened with a novel 3D-aware noise generation process during inference. We demonstrate state-of-the-art results both qualitatively and quantitatively on our challenging in-the-wild and multi-view benchmarks. + + + + H-ViT: A Hierarchical Vision Transformer for Deformable Image Registration + http://openaccess.thecvf.com//content/CVPR2024/papers/Ghahremani_H-ViT_A_Hierarchical_Vision_Transformer_for_Deformable_Image_Registration_CVPR_2024_paper.pdf + This paper introduces a novel top-down representation approach for deformable image registration which estimates the deformation field by capturing various short- and long-range flow features at different scale levels. As a Hierarchical Vision Transformer (H-ViT) we propose a dual self-attention and cross-attention mechanism that uses high-level features in the deformation field to represent low-level ones enabling information streams in the deformation field across all voxel patch embeddings irrespective of their spatial proximity. Since high-level features contain abstract flow patterns such patterns are expected to effectively contribute to the representation of the deformation field in lower scales. When the self-attention module utilizes within-scale short-range patterns for representation the cross-attention modules dynamically look for the key tokens across different scales to further interact with the local query voxel patches. Our method shows superior accuracy and visual quality over the state-of-the-art registration methods in five publicly available datasets highlighting a substantial enhancement in the performance of medical imaging registration. The project link is available at https://mogvision.github.io/hvit. + + + + VideoLLM-online: Online Video Large Language Model for Streaming Video + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_VideoLLM-online_Online_Video_Large_Language_Model_for_Streaming_Video_CVPR_2024_paper.pdf + Large Language Models (LLMs) have been enhanced with vision capabilities enabling them to comprehend images videos and interleaved vision-language content. However the learning methods of these large multimodal models (LMMs) typically treat videos as predetermined clips rendering them less effective and efficient at handling streaming video inputs. In this paper we propose a novel Learning-In-Video-Stream (LIVE) framework which enables temporally aligned long-context and real-time dialogue within a continuous video stream. Our LIVE framework comprises comprehensive approaches to achieve video streaming dialogue encompassing: (1) a training objective designed to perform language modeling for continuous streaming inputs (2) a data generation scheme that converts offline temporal annotations into a streaming dialogue format and (3) an optimized inference pipeline to speed up interactive chat in real-world video streams. With our LIVE framework we develop a simplified model called VideoLLM-online and demonstrate its significant advantages in processing streaming videos. For instance our VideoLLM-online-7B model can operate at over 10 FPS on an A100 GPU for a 5-minute video clip from Ego4D narration. Moreover VideoLLM-online also showcases state-of-the-art performance on public offline video benchmarks such as recognition captioning and forecasting. The code model data and demo have been made available at showlab.github.io/videollm-online. + + + + Towards Better Vision-Inspired Vision-Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Cao_Towards_Better_Vision-Inspired_Vision-Language_Models_CVPR_2024_paper.pdf + Vision-language (VL) models have achieved unprecedented success recently in which the connection module is the key to bridge the modality gap. Nevertheless the abundant visual clues are not sufficiently exploited in most existing methods. On the vision side most existing approaches only use the last feature of the vision tower without using the low-level features. On the language side most existing methods only introduce shallow vision-language interactions. In this paper we present a vision-inspired vision-language connection module dubbed as VIVL which efficiently exploits the vision cue for VL models. To take advantage of the lowerlevel information from the vision tower a feature pyramid extractor (FPE) is introduced to combine features from different intermediate layers which enriches the visual cue with negligible parameters and computation overhead. To enhance VL interactions we propose deep vision-conditioned prompts (DVCP) that allows deep interactions of vision and language features efficiently. Our VIVL exceeds the previous state-of-the-art method by 18.1 CIDEr when training from scratch on the COCO caption task which greatly improves the data efficiency. When used as a plug-in module VIVL consistently improves the performance for various backbones and VL frameworks delivering new state-of-the-art results on multiple benchmarks e.g. NoCaps and VQAv2. + + + + VSRD: Instance-Aware Volumetric Silhouette Rendering for Weakly Supervised 3D Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_VSRD_Instance-Aware_Volumetric_Silhouette_Rendering_for_Weakly_Supervised_3D_Object_CVPR_2024_paper.pdf + Monocular 3D object detection poses a significant challenge in 3D scene understanding due to its inherently ill-posed nature in monocular depth estimation. Existing methods heavily rely on supervised learning using abundant 3D labels typically obtained through expensive and labor-intensive annotation on LiDAR point clouds. To tackle this problem we propose a novel weakly supervised 3D object detection framework named VSRD (Volumetric Silhouette Rendering for Detection) to train 3D object detectors without any 3D supervision but only weak 2D supervision. VSRD consists of multi-view 3D auto-labeling and subsequent training of monocular 3D object detectors using the pseudo labels generated in the auto-labeling stage. In the auto-labeling stage we represent the surface of each instance as a signed distance field (SDF) and render its silhouette as an instance mask through our proposed instance-aware volumetric silhouette rendering. To directly optimize the 3D bounding boxes through rendering we decompose the SDF of each instance into the SDF of a cuboid and the residual distance field (RDF) that represents the residual from the cuboid. This mechanism enables us to optimize the 3D bounding boxes in an end-to-end manner by comparing the rendered instance masks with the ground truth instance masks. The optimized 3D bounding boxes serve as effective training data for 3D object detection. We conduct extensive experiments on the KITTI-360 dataset demonstrating that our method outperforms the existing weakly supervised 3D object detection methods. The code is available at https://github.com/skmhrk1209/VSRD. + + + + RILA: Reflective and Imaginative Language Agent for Zero-Shot Semantic Audio-Visual Navigation + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_RILA_Reflective_and_Imaginative_Language_Agent_for_Zero-Shot_Semantic_Audio-Visual_CVPR_2024_paper.pdf + We leverage Large Language Models (LLM) for zeroshot Semantic Audio Visual Navigation (SAVN). Existing methods utilize extensive training demonstrations for reinforcement learning yet achieve relatively low success rates and lack generalizability. The intermittent nature of auditory signals further poses additional obstacles to inferring the goal information. To address this challenge we present the Reflective and Imaginative Language Agent (RILA). By employing multi-modal models to process sensory data we instruct an LLM-based planner to actively explore the environment. During the exploration our agent adaptively evaluates and dismisses inaccurate perceptual descriptions. Additionally we introduce an auxiliary LLMbased assistant to enhance global environmental comprehension by mapping room layouts and providing strategic insights. Through comprehensive experiments and analysis we show that our method outperforms relevant baselines without training demonstrations from the environment and complementary semantic information. + + + + Endow SAM with Keen Eyes: Temporal-spatial Prompt Learning for Video Camouflaged Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Hui_Endow_SAM_with_Keen_Eyes_Temporal-spatial_Prompt_Learning_for_Video_CVPR_2024_paper.pdf + The Segment Anything Model (SAM) a prompt-driven foundational model has demonstrated remarkable performance in natural image segmentation. However its application in video camouflaged object detection (VCOD) encounters challenges chiefly stemming from the overlooked temporal-spatial associations and the unreliability of user-provided prompts for camouflaged objects that are difficult to discern with the naked eye. To tackle the above issues we endow SAM with keen eyes and propose the Temporal-spatial Prompt SAM (TSP-SAM) a novel approach tailored for VCOD via an ingenious prompted learning scheme. Firstly motion-driven self-prompt learning is employed to capture the camouflaged object thereby bypassing the need for user-provided prompts. With the detected subtle motion cues across consecutive video frames the overall movement of the camouflaged object is captured for more precise spatial localization. Subsequently to eliminate the prompt bias resulting from inter-frame discontinuities the long-range consistency within the video sequences is taken into account to promote the robustness of the self-prompts. It is also injected into the encoder of SAM to enhance the representational capabilities. Extensive experimental results on two benchmarks demonstrate that the proposed TSP-SAM achieves a significant improvement over the state-of-the-art methods. With the mIoU metric increasing by 7.8% and 9.6% TSP-SAM emerges as a groundbreaking step forward in the field of VCOD. + + + + Forgery-aware Adaptive Transformer for Generalizable Synthetic Image Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Forgery-aware_Adaptive_Transformer_for_Generalizable_Synthetic_Image_Detection_CVPR_2024_paper.pdf + In this paper we study the problem of generalizable synthetic image detection aiming to detect forgery images from diverse generative methods e.g. GANs and diffusion models. Cutting-edge solutions start to explore the benefits of pre-trained models and mainly follow the fixed paradigm of solely training an attached classifier e.g. combining frozen CLIP-ViT with a learnable linear layer in UniFD. However our analysis shows that such a fixed paradigm is prone to yield detectors with insufficient learning regarding forgery representations. We attribute the key challenge to the lack of forgery adaptation and present a novel forgery-aware adaptive transformer approach namely FatFormer. Based on the pre-trained vision-language spaces of CLIP FatFormer introduces two core designs for the adaption to build generalized forgery representations. First motivated by the fact that both image and frequency analysis are essential for synthetic image detection we develop a forgery-aware adapter to adapt image features to discern and integrate local forgery traces within image and frequency domains. Second we find that considering the contrastive objectives between adapted image features and text prompt embeddings a previously overlooked aspect results in a nontrivial generalization improvement. Accordingly we introduce language-guided alignment to supervise the forgery adaptation with image and text prompts in FatFormer. Experiments show that by coupling these two designs our approach tuned on 4-class ProGAN data attains a remarkable detection performance achieving an average of 98% accuracy to unseen GANs and surprisingly generalizes to unseen diffusion models with 95% accuracy. + + + + PostureHMR: Posture Transformation for 3D Human Mesh Recovery + http://openaccess.thecvf.com//content/CVPR2024/papers/Song_PostureHMR_Posture_Transformation_for_3D_Human_Mesh_Recovery_CVPR_2024_paper.pdf + Human Mesh Recovery (HMR) aims to estimate the 3D human body from 2D images which is a challenging task due to inherent ambiguities in translating 2D observations to 3D space. A novel approach called PostureHMR is proposed to leverage a multi-step diffusion-style process which converts this task into a posture transformation from an SMPL T-pose mesh to the target mesh. To inject the learning process of posture transformation with the physical structure of the human body model a kinematics-based forward process is proposed to interpolate the intermediate state with pose and shape decomposition. Moreover a mesh-to-posture (M2P) decoder is designed by combining the input of 3D and 2D mesh constraints estimated from the image to model the posture changes in the reverse process. It mitigates the difficulties of posture change learning directly from RGB pixels. To overcome the limitation of pixel-level misalignment of modeling results with the input image a new trimap-based rendering loss is designed to highlight the areas with poor recognition. Experiments conducted on three widely used datasets demonstrate that the proposed approach outperforms the state-of-the-art methods. + + + + Dynamic Adapter Meets Prompt Tuning: Parameter-Efficient Transfer Learning for Point Cloud Analysis + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_Dynamic_Adapter_Meets_Prompt_Tuning_Parameter-Efficient_Transfer_Learning_for_Point_CVPR_2024_paper.pdf + Point cloud analysis has achieved outstanding performance by transferring point cloud pre-trained models. However existing methods for model adaptation usually update all model parameters i.e. full fine-tuning paradigm which is inefficient as it relies on high computational costs (e.g. training GPU memory) and massive storage space. In this paper we aim to study parameter-efficient transfer learning for point cloud analysis with an ideal trade-off between task performance and parameter efficiency. To achieve this goal we freeze the parameters of the default pre-trained models and then propose the Dynamic Adapter which generates a dynamic scale for each token considering the token significance to the downstream task. We further seamlessly integrate Dynamic Adapter with Prompt Tuning (DAPT) by constructing Internal Prompts capturing the instance-specific features for interaction. Extensive experiments conducted on five challenging datasets demonstrate that the proposed DAPT achieves superior performance compared to the full fine-tuning counterparts while significantly reducing the trainable parameters and training GPU memory by 95% and 35% respectively. Code is available at https://github.com/LMD0311/DAPT. + + + + Wonder3D: Single Image to 3D using Cross-Domain Diffusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Long_Wonder3D_Single_Image_to_3D_using_Cross-Domain_Diffusion_CVPR_2024_paper.pdf + In this work we introduce Wonder3D a novel method for generating high-fidelity textured meshes from single-view images with remarkable efficiency. Recent methods based on the Score Distillation Sampling (SDS) loss methods have shown the potential to recover 3D geometry from 2D diffusion priors but they typically suffer from time-consuming per-shape optimization and inconsistent geometry. In contrast certain works directly produce 3D information via fast network inferences but their results are often of low quality and lack geometric details. To holistically improve the quality consistency and efficiency of image-to-3D tasks we propose a cross-domain diffusion model that generates multi-view normal maps and the corresponding color images. To ensure consistency we employ a multi-view cross-domain attention mechanism that facilitates information exchange across views and modalities. Lastly we introduce a geometry-aware normal fusion algorithm that extracts high-quality surfaces from the multi-view 2D representations in only 2 3 minutes. Our extensive evaluations demonstrate that our method achieves high-quality reconstruction results robust generalization and remarkable efficiency compared to prior works. + + + + RichDreamer: A Generalizable Normal-Depth Diffusion Model for Detail Richness in Text-to-3D + http://openaccess.thecvf.com//content/CVPR2024/papers/Qiu_RichDreamer_A_Generalizable_Normal-Depth_Diffusion_Model_for_Detail_Richness_in_CVPR_2024_paper.pdf + Lifting 2D diffusion for 3D generation is a challenging problem due to the lack of geometric prior and the complex entanglement of materials and lighting in natural images. Existing methods have shown promise by first creating the geometry through score-distillation sampling (SDS) applied to rendered surface normals followed by appearance modeling. However relying on a 2D RGB diffusion model to optimize surface normals is suboptimal due to the distribution discrepancy between natural images and normals maps leading to instability in optimization. In this paper recognizing that the normal and depth information effectively describe scene geometry and be automatically estimated from images we propose to learn a generalizable Normal-Depth diffusion model for 3D generation. We achieve this by training on the large-scale LAION dataset together with the generalizable image-to-depth and normal prior models. In an attempt to alleviate the mixed illumination effects in the generated materials we introduce an albedo diffusion model to impose data-driven constraints on the albedo component. Our experiments show that when integrated into existing text-to-3D pipelines our models significantly enhance the detail richness achieving state-of-the-art results. Our project page is at https://aigc3d.github.io/richdreamer/. + + + + Zero-shot Referring Expression Comprehension via Structural Similarity Between Images and Captions + http://openaccess.thecvf.com//content/CVPR2024/papers/Han_Zero-shot_Referring_Expression_Comprehension_via_Structural_Similarity_Between_Images_and_CVPR_2024_paper.pdf + Zero-shot referring expression comprehension aims at localizing bounding boxes in an image corresponding to provided textual prompts which requires: (i) a fine-grained disentanglement of complex visual scene and textual context and (ii) a capacity to understand relationships among disentangled entities. Unfortunately existing large vision-language alignment (VLA) models e.g. CLIP struggle with both aspects so cannot be directly used for this task. To mitigate this gap we leverage large foundation models to disentangle both images and texts into triplets in the format of (subject predicate object). After that grounding is accomplished by calculating the structural similarity matrix between visual and textual triplets with a VLA model and subsequently propagate it to an instance-level similarity matrix. Furthermore to equip VLA models with the ability of relationship understanding we design a triplet-matching objective to fine-tune the VLA models on a collection of curated dataset containing abundant entity relationships. Experiments demonstrate that our visual grounding performance increase of up to 19.5% over the SOTA zero-shot model on RefCOCO/+/g. On the more challenging Who's Waldo dataset our zero-shot approach achieves comparable accuracy to the fully supervised model. Code is available at https://github.com/Show-han/Zeroshot_REC. + + + + Triplane Meets Gaussian Splatting: Fast and Generalizable Single-View 3D Reconstruction with Transformers + http://openaccess.thecvf.com//content/CVPR2024/papers/Zou_Triplane_Meets_Gaussian_Splatting_Fast_and_Generalizable_Single-View_3D_Reconstruction_CVPR_2024_paper.pdf + Recent advancements in 3D reconstruction from single images have been driven by the evolution of generative models. Prominent among these are methods based on Score Distillation Sampling (SDS) and the adaptation of diffusion models in the 3D domain. Despite their progress these techniques often face limitations due to slow optimization or rendering processes leading to extensive training and optimization times. In this paper we introduce a novel approach for single-view reconstruction that efficiently generates a 3D model from a single image via feed-forward inference. Our method utilizes two transformer-based networks namely a point decoder and a triplane decoder to reconstruct 3D objects using a hybrid Triplane-Gaussian intermediate representation. This hybrid representation strikes a balance achieving a faster rendering speed compared to implicit representations while simultaneously delivering superior rendering quality than explicit representations. The point decoder is designed for generating point clouds from single images offering an explicit representation which is then utilized by the triplane decoder to query Gaussian features for each point. This design choice addresses the challenges associated with directly regressing explicit 3D Gaussian attributes characterized by their non-structural nature. Subsequently the 3D Gaussians are decoded by an MLP to enable rapid rendering through splatting. Both decoders are built upon a scalable transformer-based architecture and have been efficiently trained on large-scale 3D datasets. The evaluations conducted on both synthetic datasets and real-world images demonstrate that our method not only achieves higher quality but also ensures a faster runtime in comparison to previous state-of-the-art techniques. Please see our project page at https://zouzx.github.io/TriplaneGaussian/ + + + + WateRF: Robust Watermarks in Radiance Fields for Protection of Copyrights + http://openaccess.thecvf.com//content/CVPR2024/papers/Jang_WateRF_Robust_Watermarks_in_Radiance_Fields_for_Protection_of_Copyrights_CVPR_2024_paper.pdf + The advances in the Neural Radiance Fields (NeRF) research offer extensive applications in diverse domains but protecting their copyrights has not yet been researched in depth. Recently NeRF watermarking has been considered one of the pivotal solutions for safely deploying NeRF-based 3D representations. However existing methods are designed to apply only to implicit or explicit NeRF representations. In this work we introduce an innovative watermarking method that can be employed in both representations of NeRF. This is achieved by fine-tuning NeRF to embed binary messages in the rendering process. In detail we propose utilizing the discrete wavelet transform in the NeRF space for watermarking. Furthermore we adopt a deferred back-propagation technique and introduce a combination with the patch-wise loss to improve rendering quality and bit accuracy with minimum trade-offs. We evaluate our method in three different aspects: capacity invisibility and robustness of the embedded watermarks in the 2D-rendered images. Our method achieves state-of-the-art performance with faster training speed over the compared state-of-the-art methods. Project page: https://kuai-lab.github.io/cvpr2024waterf/ + + + + Instance-aware Contrastive Learning for Occluded Human Mesh Reconstruction + http://openaccess.thecvf.com//content/CVPR2024/papers/Gwon_Instance-aware_Contrastive_Learning_for_Occluded_Human_Mesh_Reconstruction_CVPR_2024_paper.pdf + A simple yet effective method for occlusion-robust 3D human mesh reconstruction from a single image is presented in this paper. Although many recent studies have shown the remarkable improvement in human mesh reconstruction it is still difficult to generate accurate meshes when person-to-person occlusion occurs due to the ambiguity of who a body part belongs to. To address this problem we propose an instance-aware contrastive learning scheme. Specifically joint features belonging to the target human are trained to be proximate with the anchor feature (i.e. feature extracted from the body center position). On the other hand anchor features of different human instances are forced to be far apart so that joint features of each person can be clearly distinguished from others. By interpreting the joint possession based on such contrastive learning scheme the proposed method easily understands the spatial occupancy of body parts for each person in a given image thus can reconstruct reliable human meshes even with severely overlapped cases between multiple persons. Experimental results on benchmark datasets demonstrate the robustness of the proposed method compared to previous approaches under person-to-person occlusions. The code and model are publicly available at: https://github.com/DCVL-3D/InstanceHMR_release. + + + + Robust Noisy Correspondence Learning with Equivariant Similarity Consistency + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Robust_Noisy_Correspondence_Learning_with_Equivariant_Similarity_Consistency_CVPR_2024_paper.pdf + The surge in multi-modal data has propelled cross-modal matching to the forefront of research interest. However the challenge lies in the laborious and expensive process of curating a large and accurately matched multimodal dataset. Commonly sourced from the Internet these datasets often suffer from a significant presence of mismatched data impairing the performance of matching models. To address this problem we introduce a novel regularization approach named Equivariant Similarity Consistency (ESC) which can facilitate robust clean and noisy data separation and improve the training for cross-modal matching. Intuitively our method posits that the semantic variations caused by image changes should be proportional to those caused by text changes for any two matched samples. Accordingly we first calculate the ESC by comparing image and text semantic variations between a set of elaborated anchor points and other undivided training data. Then pairs with high ESC are filtered out as noisy correspondence pairs. We implement our method by combining the ESC with a traditional hinge-based triplet loss. Extensive experiments on three widely used datasets including Flickr30K MS-COCO and Conceptual Captions verify the effectiveness of our method. + + + + Compositional Video Understanding with Spatiotemporal Structure-based Transformers + http://openaccess.thecvf.com//content/CVPR2024/papers/Yun_Compositional_Video_Understanding_with_Spatiotemporal_Structure-based_Transformers_CVPR_2024_paper.pdf + In this paper we suggest a new novel method to understand complex semantic structures through long video inputs. Conventional methods for understanding videos have been focused on short-term clips and trained to get visual representations for the short clips using convolutional neural networks or transformer architectures. However most real-world videos are composed of long videos ranging from minutes to hours therefore it essentially brings limitations to understanding the overall semantic structures of the long videos by dividing them into small clips and learning the representations of them. We suggest a new algorithm to learn the multi-granular semantic structures of videos by defining spatiotemporal high-order relationships among object-based representations as semantic units. The proposed method includes a new transformer architecture capable of learning spatiotemporal graphs and a compositional learning method to learn disentangled features for each semantic unit. Using the suggested method we resolve the challenging video task which is compositional generalization understanding of unseen videos. In experiments we demonstrate new state-of-the-art performances for two challenging video datasets. + + + + 3D LiDAR Mapping in Dynamic Environments using a 4D Implicit Neural Representation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhong_3D_LiDAR_Mapping_in_Dynamic_Environments_using_a_4D_Implicit_CVPR_2024_paper.pdf + Building accurate maps is a key building block to enable reliable localization planning and navigation of autonomous vehicles. We propose a novel approach for building accurate 3D maps of dynamic environments utilizing a sequence of LiDAR scans. To this end we propose encoding the 4D scene into a novel spatio-temporal implicit neural map representation by fitting a time-dependent truncated signed distance function to each point. Using our representation we can extract the static map by filtering the dynamic parts. Our neural representation is based on sparse feature grids a globally shared decoder and time-dependent basis functions which can be jointly optimized in an unsupervised fashion. To learn this representation from a sequence of LiDAR scans we design a simple yet efficient loss function to supervise the map optimization in a piecewise way. We evaluate our approach on various scenes containing moving objects in terms of the reconstruction quality of static maps and the segmentation of dynamic point clouds. The experimental results demonstrate that our method is capable of removing the dynamic part of the input point clouds while reconstructing accurate and complete large-scale 3D maps outperforming several state-of-the-art methods for static map generation and scene reconstruction. + + + + What When and Where? Self-Supervised Spatio-Temporal Grounding in Untrimmed Multi-Action Videos from Narrated Instructions + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_What_When_and_Where_Self-Supervised_Spatio-Temporal_Grounding_in_Untrimmed_Multi-Action_CVPR_2024_paper.pdf + Spatio-temporal grounding describes the task of localizing events in space and time e.g. in video data based on verbal descriptions only. Models for this task are usually trained with human-annotated sentences and bounding box supervision. This work addresses this task from a multimodal supervision perspective proposing a framework for spatio-temporal action grounding trained on loose video and subtitle supervision only without human annotation. To this end we combine local representation learning which focuses on leveraging fine-grained spatial information with a global representation encoding that captures higher-level representations and incorporates both in a joint approach. To evaluate this challenging task in a real-life setting a new benchmark dataset is proposed providing dense spatio-temporal grounding annotations in long untrimmed multi-action instructional videos for over 5K events. We evaluate the proposed approach and other methods on the proposed and standard downstream tasks showing that our method improves over current baselines in various settings including spatial temporal and untrimmed multi-action spatio-temporal grounding. + + + + FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects + http://openaccess.thecvf.com//content/CVPR2024/papers/Wen_FoundationPose_Unified_6D_Pose_Estimation_and_Tracking_of_Novel_Objects_CVPR_2024_paper.pdf + We present FoundationPose a unified foundation model for 6D object pose estimation and tracking supporting both model-based and model-free setups. Our approach can be instantly applied at test-time to a novel object without finetuning as long as its CAD model is given or a small number of reference images are captured. Thanks to the unified framework the downstream pose estimation modules are the same in both setups with a neural implicit representation used for efficient novel view synthesis when no CAD model is available. Strong generalizability is achieved via large-scale synthetic training aided by a large language model (LLM) a novel transformer-based architecture and contrastive learning formulation. Extensive evaluation on multiple public datasets involving challenging scenarios and objects indicate our unified approach outperforms existing methods specialized for each task by a large margin. In addition it even achieves comparable results to instance-level methods despite the reduced assumptions. Project page: https://nvlabs.github.io/FoundationPose/ + + + + Hyperbolic Anomaly Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Hyperbolic_Anomaly_Detection_CVPR_2024_paper.pdf + Anomaly detection is a challenging computer vision task in industrial scenario. Advancements in deep learning constantly revolutionize vision-based anomaly detection methods and considerable progress has been made in both supervised and self-supervised anomaly detection. The commonly-used pipeline is to optimize the model by constraining the feature embeddings using a distance-based loss function. However these methods work in Euclidean space and they cannot well exploit the data lied in non-Euclidean space. In this paper we are the first to explore anomaly detection task in hyperbolic space that is a representative of non-Euclidean space and propose a hyperbolic anomaly detection (HypAD) method. Specifically we first extract image features and then map them from Euclidean space to hyperbolic space where the hyperbolic distance metric is employed to optimize the proposed HypAD. Extensive experiments on the benchmarking datasets including MVTec AD and VisA show that our HypAD approach obtains the state-of-the-art performance demonstrating the effectiveness of our HypAD and the promise of investigating anomaly detection in hyperbolic space. + + + + VLP: Vision Language Planning for Autonomous Driving + http://openaccess.thecvf.com//content/CVPR2024/papers/Pan_VLP_Vision_Language_Planning_for_Autonomous_Driving_CVPR_2024_paper.pdf + Autonomous driving is a complex and challenging task that aims at safe motion planning through scene understanding and reasoning. While vision-only autonomous driving methods have recently achieved notable performance through enhanced scene understanding several key issues including lack of reasoning low generalization performance and long-tail scenarios still need to be addressed. In this paper we present VLP a novel Vision-Language-Planning framework that exploits language models to bridge the gap between linguistic understanding and autonomous driving. VLP enhances autonomous driving systems by strengthening both the source memory foundation and the self-driving car's contextual understanding. VLP achieves state-of-the-art end-to-end planning performance on the challenging NuScenes dataset by achieving 35.9% and 60.5% reduction in terms of average L2 error and collision rates respectively compared to the previous best method. Moreover VLP shows improved performance in challenging long-tail scenarios and strong generalization capabilities when faced with new urban environments. + + + + ProMark: Proactive Diffusion Watermarking for Causal Attribution + http://openaccess.thecvf.com//content/CVPR2024/papers/Asnani_ProMark_Proactive_Diffusion_Watermarking_for_Causal_Attribution_CVPR_2024_paper.pdf + Generative AI (GenAI) is transforming creative workflows through the capability to synthesize and manipulate images via high-level prompts. Yet creatives are not well supported to receive recognition or reward for the use of their content in GenAI training. To this end we propose ProMark a causal attribution technique to attribute a synthetically generated image to its training data concepts like objects motifs templates artists or styles. The concept information is proactively embedded into the input training images using imperceptible watermarks and the diffusion models (unconditional or conditional) are trained to retain the corresponding watermarks in generated images. We show that we can embed as many as 2^ 16 unique watermarks into the training data and each training image can contain more than one watermark. ProMark can maintain image quality whilst outperforming correlation-based attribution. Finally several qualitative examples are presented providing the confidence that the presence of the watermark conveys a causative relationship between training data and synthetic images. + + + + Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering + http://openaccess.thecvf.com//content/CVPR2024/papers/Khan_Consistency_and_Uncertainty_Identifying_Unreliable_Responses_From_Black-Box_Vision-Language_Models_CVPR_2024_paper.pdf + The goal of selective prediction is to allow an a model to abstain when it may not be able to deliver a reliable prediction which is important in safety-critical contexts. Existing approaches to selective prediction typically require access to the internals of a model require retraining a model or study only unimodal models. However the most powerful models (e.g. GPT-4) are typically only available as black boxes with inaccessible internals are not retrainable by end-users and are frequently used for multimodal tasks. We study the possibility of selective prediction for vision-language models in a realistic black-box setting. We propose using the principle of neighborhood consistency to identify unreliable responses from a black-box vision-language model in question answering tasks. We hypothesize that given only a visual question and model response the consistency of the model's responses over the neighborhood of a visual question will indicate reliability. It is impossible to directly sample neighbors in feature space in a black-box setting. Instead we show that it is possible to use a smaller proxy model to approximately sample from the neighborhood. We find that neighborhood consistency can be used to identify model responses to visual questions that are likely unreliable even in adversarial settings or settings that are out-of-distribution to the proxy model. + + + + Implicit Motion Function + http://openaccess.thecvf.com//content/CVPR2024/papers/Gao_Implicit_Motion_Function_CVPR_2024_paper.pdf + Recent advancements in video modeling extensively rely on optical flow to represent the relationships across frames but this approach often lacks efficiency and fails to model the probability of the intrinsic motion of objects. In addition conventional encoder-decoder frameworks in video processing focus on modeling the correlation in the encoder leading to limited generative capabilities and redundant intermediate representations. To address these challenges this paper proposes a novel Implicit Motion Function (IMF) method. Our approach utilizes a low-dimensional latent token as the implicit representation along with the use of cross-attention to implicitly model the correlation between frames. This enables the implicit modeling of temporal correlations and understanding of object motions. Our method not only improves sparsity and efficiency in representation but also explores the generative capabilities of the decoder by integrating correlation modeling within it. The IMF framework facilitates video editing and other generative tasks by allowing the direct manipulation of latent tokens. We validate the effectiveness of IMF through extensive experiments on multiple video tasks demonstrating superior performance in terms of reconstructed video quality compression efficiency and generation ability. + + + + MultiDiff: Consistent Novel View Synthesis from a Single Image + http://openaccess.thecvf.com//content/CVPR2024/papers/Muller_MultiDiff_Consistent_Novel_View_Synthesis_from_a_Single_Image_CVPR_2024_paper.pdf + We introduce MultiDiff a novel approach for consistent novel view synthesis of scenes from a single RGB image. The task of synthesizing novel views from a single reference image is highly ill-posed by nature as there exist multiple plausible explanations for unobserved areas. To address this issue we incorporate strong priors in form of monocular depth predictors and video-diffusion models. Monocular depth enables us to condition our model on warped reference images for the target views increasing geometric stability. The video-diffusion prior provides a strong proxy for 3D scenes allowing the model to learn continuous and pixel-accurate correspondences across generated images. In contrast to approaches relying on autoregressive image generation that are prone to drifts and error accumulation MultiDiff jointly synthesizes a sequence of frames yielding high-quality and multi-view consistent results -- even for long-term scene generation with large camera movements while reducing inference time by an order of magnitude. For additional consistency and image quality improvements we introduce a novel structured noise distribution. Our experimental results demonstrate that MultiDiff outperforms state-of-the-art methods on the challenging real-world datasets RealEstate10K and ScanNet. Finally our model naturally supports multi-view consistent editing without the need for further tuning. + + + + Atom-Level Optical Chemical Structure Recognition with Limited Supervision + http://openaccess.thecvf.com//content/CVPR2024/papers/Oldenhof_Atom-Level_Optical_Chemical_Structure_Recognition_with_Limited_Supervision_CVPR_2024_paper.pdf + Identifying the chemical structure from a graphical representation or image of a molecule is a challenging pattern recognition task that would greatly benefit drug development. Yet existing methods for chemical structure recognition do not typically generalize well and show diminished effectiveness when confronted with domains where data is sparse or costly to generate such as hand-drawn molecule images. To address this limitation we propose a new chemical structure recognition tool that delivers state-of-the-art performance and can adapt to new domains with a limited number of data samples and supervision. Unlike previous approaches our method provides atom-level localization and can therefore segment the image into the different atoms and bonds. Our model is the first model to perform OCSR with atom-level entity detection with only SMILES supervision. Through rigorous and extensive benchmarking we demonstrate the preeminence of our chemical structure recognition approach in terms of data efficiency accuracy and atom-level entity prediction. + + + + LiDAR-based Person Re-identification + http://openaccess.thecvf.com//content/CVPR2024/papers/Guo_LiDAR-based_Person_Re-identification_CVPR_2024_paper.pdf + Camera-based person re-identification (ReID) systems have been widely applied in the field of public security. However cameras often lack the perception of 3D morphological information of human and are susceptible to various limitations such as inadequate illumination complex background and personal privacy. In this paper we propose a LiDAR-based ReID framework ReID3D that utilizes pre-training strategy to retrieve features of 3D body shape and introduces Graph-based Complementary Enhancement Encoder for extracting comprehensive features. Due to the lack of LiDAR datasets we build LReID the first LiDAR-based person ReID dataset which is collected in several outdoor scenes with variations in natural conditions. Additionally we introduce LReID-sync a simulated pedestrian dataset designed for pre-training encoders with tasks of point cloud completion and shape parameter learning. Extensive experiments on LReID show that ReID3D achieves exceptional performance with a rank-1 accuracy of 94.0 highlighting the significant potential of LiDAR in addressing person ReID tasks. To the best of our knowledge we are the first to propose a solution for LiDAR-based ReID. The code and dataset are available at https://github.com/GWxuan/ReID3D. + + + + Model Adaptation for Time Constrained Embodied Control + http://openaccess.thecvf.com//content/CVPR2024/papers/Song_Model_Adaptation_for_Time_Constrained_Embodied_Control_CVPR_2024_paper.pdf + When adopting a deep learning model for embodied agents it is required that the model structure be optimized for specific tasks and operational conditions. Such optimization can be static such as model compression or dynamic such as adaptive inference. Yet these techniques have not been fully investigated for embodied control systems subject to time constraints which necessitate sequential decision-making for multiple tasks each with distinct inference latency limitations. In this paper we present MoDeC a time constraint-aware embodied control framework using the modular model adaptation. We formulate model adaptation to varying operational conditions on resource and time restrictions as dynamic routing on a modular network incorporating these conditions as part of multi-task objectives. Our evaluation across several vision-based embodied environments demonstrates the robustness of MoDeC showing that it outperforms other model adaptation methods in both performance and adherence to time constraints in robotic manipulation and autonomous driving applications. + + + + ActiveDC: Distribution Calibration for Active Finetuning + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_ActiveDC_Distribution_Calibration_for_Active_Finetuning_CVPR_2024_paper.pdf + The pretraining-finetuning paradigm has gained popularity in various computer vision tasks. In this paradigm the emergence of active finetuning arises due to the abundance of large-scale data and costly annotation requirements. Active finetuning involves selecting a subset of data from an unlabeled pool for annotation facilitating subsequent finetuning. However the use of a limited number of training samples can lead to a biased distribution potentially resulting in model overfitting. In this paper we propose a new method called ActiveDC for the active finetuning tasks. Firstly we select samples for annotation by optimizing the distribution similarity between the subset to be selected and the entire unlabeled pool in continuous space. Secondly we calibrate the distribution of the selected samples by exploiting implicit category information in the unlabeled pool. The feature visualization provides an intuitive sense of the effectiveness of our approach to distribution calibration. We conducted extensive experiments on three image classification datasets with different sampling ratios. The results indicate that ActiveDC consistently outperforms the baseline performance in all image classification tasks. The improvement is particularly significant when the sampling ratio is low with performance gains of up to 10%. Our code will be released. + + + + Seeing Unseen: Discover Novel Biomedical Concepts via Geometry-Constrained Probabilistic Modeling + http://openaccess.thecvf.com//content/CVPR2024/papers/Fan_Seeing_Unseen_Discover_Novel_Biomedical_Concepts_via_Geometry-Constrained_Probabilistic_Modeling_CVPR_2024_paper.pdf + Machine learning holds tremendous promise for transforming the fundamental practice of scientific discovery by virtue of its data-driven nature. With the ever-increasing stream of research data collection it would be appealing to autonomously explore patterns and insights from observational data for discovering novel classes of phenotypes and concepts. However in the biomedical domain there are several challenges inherently presented in the cumulated data which hamper the progress of novel class discovery. The non-i.i.d. data distribution accompanied by the severe imbalance among different groups of classes essentially leads to ambiguous and biased semantic representations. In this work we present a geometry-constrained probabilistic modeling treatment to resolve the identified issues. First we propose to parameterize the approximated posterior of instance embedding as a marginal von Mises-Fisher distribution to account for the interference of distributional latent bias. Then we incorporate a suite of critical geometric properties to impose proper constraints on the layout of constructed embedding space which in turn minimizes the uncontrollable risk for unknown class learning and structuring. Furthermore a spectral graph-theoretic method is devised to estimate the number of potential novel classes. It inherits two intriguing merits compared to existent approaches namely high computational efficiency and flexibility for taxonomy-adaptive estimation. Extensive experiments across various biomedical scenarios substantiate the effectiveness and general applicability of our method. + + + + Communication-Efficient Federated Learning with Accelerated Client Gradient + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_Communication-Efficient_Federated_Learning_with_Accelerated_Client_Gradient_CVPR_2024_paper.pdf + Federated learning often suffers from slow and unstable convergence due to the heterogeneous characteristics of participating client datasets. Such a tendency is aggravated when the client participation ratio is low since the information collected from the clients has large variations. To address this challenge we propose a simple but effective federated learning framework which improves the consistency across clients and facilitates the convergence of the server model. This is achieved by making the server broadcast a global model with a lookahead gradient. This strategy enables the proposed approach to convey the projected global update information to participants effectively without additional client memory and extra communication costs. We also regularize local updates by aligning each client with the overshot global model to reduce bias and improve the stability of our algorithm. We provide the theoretical convergence rate of our algorithm and demonstrate remarkable performance gains in terms of accuracy and communication efficiency compared to the state-of-the-art methods especially with low client participation rates. The source code is available at our project page. + + + + LLMs are Good Action Recognizers + http://openaccess.thecvf.com//content/CVPR2024/papers/Qu_LLMs_are_Good_Action_Recognizers_CVPR_2024_paper.pdf + Skeleton-based action recognition has attracted lots of research attention. Recently to build an accurate skeleton-based action recognizer a variety of works have been proposed. Among them some works use large model architectures as backbones of their recognizers to boost the skeleton data representation capability while some other works pre-train their recognizers on external data to enrich the knowledge. In this work we observe that large language models which have been extensively used in various natural language processing tasks generally hold both large model architectures and rich implicit knowledge. Motivated by this we propose a novel LLM-AR framework in which we investigate treating the Large Language Model as an Action Recognizer. In our framework we propose a linguistic projection process to project each input action signal (i.e. each skeleton sequence) into its "sentence format" (i.e. an "action sentence"). Moreover we also incorporate our framework with several designs to further facilitate this linguistic projection process. Extensive experiments demonstrate the efficacy of our proposed framework. + + + + Interactive Continual Learning: Fast and Slow Thinking + http://openaccess.thecvf.com//content/CVPR2024/papers/Qi_Interactive_Continual_Learning_Fast_and_Slow_Thinking_CVPR_2024_paper.pdf + Advanced life forms sustained by the synergistic interaction of neural cognitive mechanisms continually acquire and transfer knowledge throughout their lifespan. In contrast contemporary machine learning paradigms exhibit limitations in emulating the facets of continual learning (CL). Nonetheless the emergence of large language models (LLMs) presents promising avenues for realizing CL via interactions with these models. Drawing on Complementary Learning System theory this paper presents a novel Interactive Continual Learning (ICL) framework enabled by collaborative interactions among models of various sizes. Specifically we assign the ViT model as System1 and multimodal LLM as System2. To enable the memory module to deduce tasks from class information and enhance Set2Set retrieval we propose the Class-Knowledge-Task Multi-Head Attention (CKT-MHA). Additionally to improve memory retrieval in System1 through enhanced geometric representation we introduce the CL-vMF mechanism based on the von Mises-Fisher (vMF) distribution. Meanwhile we introduce the von Mises-Fisher Outlier Detection and Interaction (vMF-ODI) strategy to identify hard examples thus enhancing collaboration between System1 and System2 for complex reasoning realization. Comprehensive evaluation of our proposed ICL demonstrates significant resistance to forgetting and superior performance relative to existing methods. Code is available at github.com/ICL. + + + + Towards Learning a Generalist Model for Embodied Navigation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zheng_Towards_Learning_a_Generalist_Model_for_Embodied_Navigation_CVPR_2024_paper.pdf + Building a generalist agent that can interact with the world is an ultimate goal for humans thus spurring the research for embodied navigation where an agent is required to navigate according to instructions or respond to queries. Despite the major progress attained previous works primarily focus on task-specific agents and lack generalizability to unseen scenarios. Recently LLMs have presented remarkable capabilities across various fields and provided a promising opportunity for embodied navigation. Drawing on this we propose the first generalist model for embodied navigation NaviLLM. It adapts LLMs to embodied navigation by introducing schema-based instruction. The schema-based instruction flexibly casts various tasks into generation problems thereby unifying a wide range of tasks. This approach allows us to integrate diverse data sources from various datasets into the training equipping NaviLLM with a wide range of capabilities required by embodied navigation. We conduct extensive experiments to evaluate the performance and generalizability of our model. The experimental results demonstrate that our unified model achieves state-of-the-art performance on CVDN SOON and ScanQA. Specifically it surpasses the previous stats-of-the-art method by a significant margin of 29% in goal progress on CVDN. Moreover our model also demonstrates strong generalizability and presents impressive results on unseen tasks e.g. embodied question answering and 3D captioning. + + + + Splatter Image: Ultra-Fast Single-View 3D Reconstruction + http://openaccess.thecvf.com//content/CVPR2024/papers/Szymanowicz_Splatter_Image_Ultra-Fast_Single-View_3D_Reconstruction_CVPR_2024_paper.pdf + We introduce the Splatter Image an ultra-efficient approach for monocular 3D object reconstruction. Splatter Image is based on Gaussian Splatting which allows fast and high-quality reconstruction of 3D scenes from multiple images. We apply Gaussian Splatting to monocular reconstruction by learning a neural network that at test time performs reconstruction in a feed-forward manner at 38 FPS. Our main innovation is the surprisingly straightforward design of this network which using 2D operators maps the input image to one 3D Gaussian per pixel. The resulting set of Gaussians thus has the form an image the Splatter Image. We further extend the method take several images as input via cross-view attention. Owning to the speed of the renderer (588 FPS) we use a single GPU for training while generating entire images at each iteration to optimize perceptual metrics like LPIPS. On several synthetic real multi-category and large-scale benchmark datasets we achieve better results in terms of PSNR LPIPS and other metrics while training and evaluating much faster than prior works. Code models and more results are available at https://szymanowiczs.github.io/ splatter-image. + + + + Modeling Collaborator: Enabling Subjective Vision Classification With Minimal Human Effort via LLM Tool-Use + http://openaccess.thecvf.com//content/CVPR2024/papers/Toubal_Modeling_Collaborator_Enabling_Subjective_Vision_Classification_With_Minimal_Human_Effort_CVPR_2024_paper.pdf + From content moderation to wildlife conservation the number of applications that require models to recognize nuanced or subjective visual concepts is growing. Traditionally developing classifiers for such concepts requires substantial manual effort measured in hours days or even months to identify and annotate data needed for training. Even with recently proposed Agile Modeling techniques which enable rapid bootstrapping of image classifiers users are still required to spend 30 minutes or more of monotonous repetitive data labeling just to train a single classifier. Drawing on Fiske's Cognitive Miser theory we propose a new framework that alleviates manual effort by replacing human labeling with natural language interactions reducing the total effort required to define a concept by an order of magnitude: from labeling 2000 images to only 100 plus some natural language interactions. Our framework leverages recent advances in foundation models both large language models and vision-language models to carve out the concept space through conversation and by automatically labeling training data points. Most importantly our framework eliminates the need for crowd-sourced annotations. Moreover our framework ultimately produces lightweight classification models that are deployable in cost-sensitive scenarios. Across 15 subjective concepts and across 2 public image classification datasets our trained models outperform traditional Agile Modeling as well as state-of-the-art zero-shot classification models like ALIGN CLIP CuPL and large visual question answering models like PaLI-X. + + + + GeoReF: Geometric Alignment Across Shape Variation for Category-level Object Pose Refinement + http://openaccess.thecvf.com//content/CVPR2024/papers/Zheng_GeoReF_Geometric_Alignment_Across_Shape_Variation_for_Category-level_Object_Pose_CVPR_2024_paper.pdf + Object pose refinement is essential for robust object pose estimation. Previous work has made significant progress towards instance-level object pose refinement. Yet category-level pose refinement is a more challenging problem due to large shape variations within a category and the discrepancies between the target object and the shape prior. To address these challenges we introduce a novel architecture for category-level object pose refinement. Our approach integrates an HS-layer and learnable affine transformations which aims to enhance the extraction and alignment of geometric information. Additionally we introduce a cross-cloud transformation mechanism that efficiently merges diverse data sources. Finally we push the limits of our model by incorporating the shape prior information for translation and size error prediction. We conducted extensive experiments to demonstrate the effectiveness of the proposed framework. Through extensive quantitative experiments we demonstrate significant improvement over the baseline method by a large margin across all metrics. + + + + Learning Group Activity Features Through Person Attribute Prediction + http://openaccess.thecvf.com//content/CVPR2024/papers/Nakatani_Learning_Group_Activity_Features_Through_Person_Attribute_Prediction_CVPR_2024_paper.pdf + This paper proposes Group Activity Feature (GAF) learning in which features of multi-person activity are learned as a compact latent vector. Unlike prior work in which the manual annotation of group activities is required for supervised learning our method learns the GAF through person attribute prediction without group activity annotations. By learning the whole network in an end-to-end manner so that the GAF is required for predicting the person attributes of people in a group the GAF is trained as the features of multi-person activity. As a person attribute we propose to use a person's action class and appearance features because the former is easy to annotate due to its simpleness and the latter requires no manual annotation. In addition we introduce a location-guided attribute prediction to disentangle the complex GAF for extracting the features of each target person properly. Various experimental results validate that our method outperforms SOTA methods quantitatively and qualitatively on two public datasets. Visualization of our GAF also demonstrates that our method learns the GAF representing fined-grained group activity classes. Code: https://github.com/chihina/GAFL-CVPR2024. + + + + Plug-and-Play Diffusion Distillation + http://openaccess.thecvf.com//content/CVPR2024/papers/Hsiao_Plug-and-Play_Diffusion_Distillation_CVPR_2024_paper.pdf + Diffusion models have shown tremendous results in image generation. However due to the iterative nature of the diffusion process and its reliance on classifier-free guidance inference times are slow. In this paper we propose a new distillation approach for guided diffusion models in which an external lightweight guide model is trained while the original text-to-image model remains frozen. We show that our method reduces the inference computation of classifier-free guided latent-space diffusion models by almost half and only requires 1% trainable parameters of the base model. Furthermore once trained our guide model can be applied to various fine-tuned domain-specific versions of the base diffusion model without the need for additional training: this "plug-and-play" functionality drastically improves inference computation while maintaining the visual fidelity of generated images. Empirically we show that our approach is able to produce visually appealing results and achieve a comparable FID score to the teacher with as few as 8 to 16 steps. + + + + MindBridge: A Cross-Subject Brain Decoding Framework + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_MindBridge_A_Cross-Subject_Brain_Decoding_Framework_CVPR_2024_paper.pdf + Brain decoding a pivotal field in neuroscience aims to reconstruct stimuli from acquired brain signals primarily utilizing functional magnetic resonance imaging (fMRI). Currently brain decoding is confined to a per-subject-per-model paradigm limiting its applicability to the same individual for whom the decoding model is trained. This constraint stems from three key challenges: 1) the inherent variability in input dimensions across subjects due to differences in brain size; 2) the unique intrinsic neural patterns influencing how different individuals perceive and process sensory information; 3) limited data availability for new subjects in real-world scenarios hampers the performance of decoding models. In this paper we present a novel approach MindBridge that achieves cross-subject brain decoding by employing only one model. Our proposed framework establishes a generic paradigm capable of addressing these challenges by introducing biological-inspired aggregation function and novel cyclic fMRI reconstruction mechanism for subject-invariant representation learning. Notably by cycle reconstruction of fMRI MindBridge can enable novel fMRI synthesis which also can serve as pseudo data augmentation. Within the framework we also devise a novel reset-tuning method for adapting a pretrained model to a new subject. Experimental results demonstrate MindBridge's ability to reconstruct images for multiple subjects which is competitive with dedicated subject-specific models. Furthermore with limited data for a new subject we achieve a high level of decoding accuracy surpassing that of subject-specific models. This advancement in cross-subject brain decoding suggests promising directions for wider applications in neuroscience and indicates potential for more efficient utilization of limited fMRI data in real-world scenarios. Project page: https://littlepure2333.github.io/MindBridge + + + + MM-Narrator: Narrating Long-form Videos with Multimodal In-Context Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_MM-Narrator_Narrating_Long-form_Videos_with_Multimodal_In-Context_Learning_CVPR_2024_paper.pdf + We present MM-Narrator a novel system leveraging GPT-4 with multimodal in-context learning for the generation of audio descriptions (AD). Unlike previous methods that primarily focused on downstream fine-tuning with short video clips MM-Narrator excels in generating precise audio descriptions for videos of extensive lengths even beyond hours in an autoregressive manner. This capability is made possible by the proposed memory-augmented generation process which effectively utilizes both the short-term textual context and long-term visual memory through an efficient register-and-recall mechanism. These contextual memories compile pertinent past information including storylines and character identities ensuring an accurate tracking and depicting of story-coherent and character-centric audio descriptions. Maintaining the training-free design of MM-Narrator we further propose a complexity-based demonstration selection strategy to largely enhance its multi-step reasoning capability via few-shot multimodal in-context learning (MM-ICL). Experimental results on MAD-eval dataset demonstrate that MM-Narrator consistently outperforms both the existing fine-tuning-based approaches and LLM-based approaches in most scenarios as measured by standard evaluation metrics. Additionally we introduce the first segment-based evaluator for recurrent text generation. Empowered by GPT-4 this evaluator comprehensively reasons and marks AD generation performance in various extendable dimensions. + + + + Morphable Diffusion: 3D-Consistent Diffusion for Single-image Avatar Creation + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Morphable_Diffusion_3D-Consistent_Diffusion_for_Single-image_Avatar_Creation_CVPR_2024_paper.pdf + Recent advances in generative diffusion models have enabled the previously unfeasible capability of generating 3D assets from a single input image or a text prompt. In this work we aim to enhance the quality and functionality of these models for the task of creating controllable photorealistic human avatars. We achieve this by integrating a 3D morphable model into the state-of-the-art multi-view-consistent diffusion approach. We demonstrate that accurate conditioning of a generative pipeline on the articulated 3D model enhances the baseline model performance on the task of novel view synthesis from a single image. More importantly this integration facilitates a seamless and accurate incorporation of facial expression and body pose control into the generation process. To the best of our knowledge our proposed framework is the first diffusion model to enable the creation of fully 3D-consistent animatable and photorealistic human avatars from a single image of an unseen subject; extensive quantitative and qualitative evaluations demonstrate the advantages of our approach over existing state-of-the-art avatar creation models on both novel view and novel expression synthesis tasks. The code for our project is publicly available. + + + + Fully Convolutional Slice-to-Volume Reconstruction for Single-Stack MRI + http://openaccess.thecvf.com//content/CVPR2024/papers/Young_Fully_Convolutional_Slice-to-Volume_Reconstruction_for_Single-Stack_MRI_CVPR_2024_paper.pdf + In magnetic resonance imaging (MRI) slice-to-volume reconstruction (SVR) refers to computational reconstruction of an unknown 3D magnetic resonance volume from stacks of 2D slices corrupted by motion. While promising current SVR methods require multiple slice stacks for accurate 3D reconstruction leading to long scans and limiting their use in time-sensitive applications such as fetal fMRI. Here we propose a SVR method that overcomes the shortcomings of previous work and produces state-of-the-art reconstructions in the presence of extreme inter-slice motion. Inspired by the recent success of single-view depth estimation methods we formulate SVR as a single-stack motion estimation task and train a fully convolutional network to predict a motion stack for a given slice stack producing a 3D reconstruction as a byproduct of the predicted motion. Extensive experiments on the SVR of adult and fetal brains demonstrate that our fully convolutional method is twice as accurate as previous SVR methods. Our code is available at github.com/seannz/svr. + + + + Enhance Image Classification via Inter-Class Image Mixup with Diffusion Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Enhance_Image_Classification_via_Inter-Class_Image_Mixup_with_Diffusion_Model_CVPR_2024_paper.pdf + Text-to-image (T2I) generative models have recently emerged as a powerful tool enabling the creation of photo-realistic images and giving rise to a multitude of applications. However the effective integration of T2I models into fundamental image classification tasks remains an open question. A prevalent strategy to bolster image classification performance is through augmenting the training set with synthetic images generated by T2I models. In this study we scrutinize the shortcomings of both current generative and conventional data augmentation techniques. Our analysis reveals that these methods struggle to produce images that are both faithful (in terms of foreground objects) and diverse (in terms of background contexts) for domain-specific concepts. To tackle this challenge we introduce an innovative inter-class data augmentation method known as Diff-Mix (\href https://github.com/Zhicaiwww/Diff-Mix) https://github.com/Zhicaiwww/Diff-Mix which enriches the dataset by performing image translations between classes. Our empirical results demonstrate that Diff-Mix achieves a better balance between faithfulness and diversity leading to a marked improvement in performance across diverse image classification scenarios including few-shot conventional and long-tail classifications for domain-specific datasets. + + + + Alpha-CLIP: A CLIP Model Focusing on Wherever You Want + http://openaccess.thecvf.com//content/CVPR2024/papers/Sun_Alpha-CLIP_A_CLIP_Model_Focusing_on_Wherever_You_Want_CVPR_2024_paper.pdf + Contrastive Language-Image Pre-training (CLIP) plays an essential role in extracting valuable content information from images across diverse tasks. It aligns textual and visual modalities to comprehend the entire image including all the details even those irrelevant to specific tasks. However for a finer understanding and controlled editing of images it becomes crucial to focus on specific regions of interest which can be indicated as points masks or boxes by humans or perception models. To fulfill the requirements we introduce Alpha-CLIP an enhanced version of CLIP with an auxiliary alpha channel to suggest attentive regions and fine-tuned with constructed millions of RGBA region-text pairs. Alpha-CLIP not only preserves the visual recognition ability of CLIP but also enables precise control over the emphasis of image contents. It demonstrates effectiveness in various tasks including but not limited to open-world recognition multimodal large language models and conditional 2D / 3D generation. It has a strong potential to serve as a versatile tool for image-related tasks. + + + + ADA-Track: End-to-End Multi-Camera 3D Multi-Object Tracking with Alternating Detection and Association + http://openaccess.thecvf.com//content/CVPR2024/papers/Ding_ADA-Track_End-to-End_Multi-Camera_3D_Multi-Object_Tracking_with_Alternating_Detection_and_CVPR_2024_paper.pdf + Many query-based approaches for 3D Multi-Object Tracking (MOT) adopt the tracking-by-attention paradigm utilizing track queries for identity-consistent detection and object queries for identity-agnostic track spawning. Tracking-by-attention however entangles detection and tracking queries in one embedding for both the detection and tracking task which is sub-optimal. Other approaches resemble the tracking-by-detection paradigm detecting objects using decoupled track and detection queries followed by a subsequent association. These methods however do not leverage synergies between the detection and association task. Combining the strengths of both paradigms we introduce ADA-Track a novel end-to-end framework for 3D MOT from multi-view cameras. We introduce a learnable data association module based on edge-augmented cross-attention leveraging appearance and geometric features. Furthermore we integrate this association module into the decoder layer of a DETR-based 3D detector enabling simultaneous DETR-like query-to-image cross-attention for detection and query-to-query cross-attention for data association. By stacking these decoder layers queries are refined for the detection and association task alternately effectively harnessing the task dependencies. We evaluate our method on the nuScenes dataset and demonstrate the advantage of our approach compared to the two previous paradigms. Code is available at https://github.com/dsx0511/ADA-Track. + + + + Mind The Edge: Refining Depth Edges in Sparsely-Supervised Monocular Depth Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Talker_Mind_The_Edge_Refining_Depth_Edges_in_Sparsely-Supervised_Monocular_Depth_CVPR_2024_paper.pdf + Monocular Depth Estimation (MDE) is a fundamental problem in computer vision with numerous applications. Recently LIDAR-supervised methods have achieved remarkable per-pixel depth accuracy in outdoor scenes. However significant errors are typically found in the proximity of depth discontinuities i.e. depth edges which often hinder the performance of depth-dependent applications that are sensitive to such inaccuracies e.g. novel view synthesis and augmented reality. Since direct supervision for the location of depth edges is typically unavailable in sparse LIDAR-based scenes encouraging the MDE model to produce correct depth edges is not straightforward. To the best of our knowledge this paper is the first attempt to address the depth edges issue for LIDAR-supervised scenes. In this work we propose to learn to detect the location of depth edges from densely-supervised synthetic data and use it to generate supervision for the depth edges in the MDE training. To quantitatively evaluate our approach and due to the lack of depth edges GT in LIDAR-based scenes we manually annotated subsets of the KITTI and the DDAD datasets with depth edges ground truth. We demonstrate significant gains in the accuracy of the depth edges with comparable per-pixel depth accuracy on several challenging datasets. Code and datasets are available at https://github.com/liortalker/MindTheEdge. + + + + Attention-Driven Training-Free Efficiency Enhancement of Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Attention-Driven_Training-Free_Efficiency_Enhancement_of_Diffusion_Models_CVPR_2024_paper.pdf + Diffusion models (DMs) have exhibited superior performance in generating high-quality and diverse images. However this exceptional performance comes at the cost of expensive generation process particularly due to the heavily used attention module in leading models. Existing works mainly adopt a retraining process to enhance DM efficiency. This is computationally expensive and not very scalable. To this end we introduce the Attention-driven Training-free Efficient Diffusion Model (AT-EDM) framework that leverages attention maps to perform run-time pruning of redundant tokens without the need for any retraining. Specifically for single-denoising-step pruning we develop a novel ranking algorithm Generalized Weighted Page Rank (G-WPR) to identify redundant tokens and a similarity-based recovery method to restore tokens for the convolution operation. In addition we propose a Denoising-Steps-Aware Pruning (DSAP) approach to adjust the pruning budget across different denoising timesteps for better generation quality. Extensive evaluations show that AT-EDM performs favorably against prior art in terms of efficiency (e.g. 38.8% FLOPs saving and up to 1.53x speed-up over Stable Diffusion XL) while maintaining nearly the same FID and CLIP scores as the full model. + + + + CPR: Retrieval Augmented Generation for Copyright Protection + http://openaccess.thecvf.com//content/CVPR2024/papers/Golatkar_CPR_Retrieval_Augmented_Generation_for_Copyright_Protection_CVPR_2024_paper.pdf + Retrieval Augmented Generation (RAG) is emerging as a flexible and robust technique to adapt models to private users data without training to handle credit attribution and to allow efficient machine unlearning at scale. However RAG techniques for image generation may lead to parts of the retrieved samples being copied in the model's output. To reduce risks of leaking private information contained in the retrieved set we introduce Copy-Protected generation with Retrieval (CPR) a new method for RAG with strong copyright protection guarantees in a mixed-private setting for diffusion models. CPR allows to condition the output of diffusion models on a set of retrieved images while also guaranteeing that unique identifiable information about those example is not exposed in the generated outputs. In particular it does so by sampling from a mixture of public (safe) distribution and private (user) distribution by merging their diffusion scores at inference. We prove that CPR satisfies Near Access Freeness (NAF) which bounds the amount of information an attacker may be able to extract from the generated images. We provide two algorithms for copyright protection CPR-KL and CPR-Choose. Unlike previously proposed rejection-sampling-based NAF methods our methods enable efficient copyright-protected sampling with a single run of backward diffusion. We show that our method can be applied to any pre-trained conditional diffusion model such as Stable Diffusion or unCLIP. In particular we empirically show that applying CPR on top of un- CLIP improves quality and text-to-image alignment of the generated results (81.4 to 83.17 on TIFA benchmark) while enabling credit attribution copy-right protection and deterministic constant time unlearning. + + + + Vision-and-Language Navigation via Causal Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Vision-and-Language_Navigation_via_Causal_Learning_CVPR_2024_paper.pdf + In the pursuit of robust and generalizable environment perception and language understanding the ubiquitous challenge of dataset bias continues to plague vision-and-language navigation (VLN) agents hindering their performance in unseen environments. This paper introduces the generalized cross-modal causal transformer (GOAT) a pioneering solution rooted in the paradigm of causal inference. By delving into both observable and unobservable confounders within vision language and history we propose the back-door and front-door adjustment causal learning (BACL and FACL) modules to promote unbiased learning by comprehensively mitigating potential spurious correlations. Additionally to capture global confounder features we propose a cross-modal feature pooling (CFP) module supervised by contrastive learning which is also shown to be effective in improving cross-modal representations during pre-training. Extensive experiments across multiple VLN datasets (R2R REVERIE RxR and SOON) underscore the superiority of our proposed method over previous state-of-the-art approaches. Code is available at https://github.com/CrystalSixone/VLN-GOAT. + + + + Unveiling Parts Beyond Objects: Towards Finer-Granularity Referring Expression Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Unveiling_Parts_Beyond_Objects_Towards_Finer-Granularity_Referring_Expression_Segmentation_CVPR_2024_paper.pdf + Referring expression segmentation (RES) aims at segmenting the foreground masks of the entities that match the descriptive natural language expression. Previous datasets and methods for classic RES task heavily rely on the prior assumption that one expression must refer to object-level targets. In this paper we take a step further to finer-grained part-level RES task. To promote the object-level RES task towards finer-grained vision-language understanding we put forward a new multi-granularity referring expression segmentation (MRES) task and construct an evaluation benchmark called RefCOCOm by manual annotations. By employing our automatic model-assisted data engine we build the largest visual grounding dataset namely MRES-32M which comprises over 32.2M high-quality masks and captions on the provided 1M images. Besides a simple yet strong model named UniRES is designed to accomplish the unified object-level and part-level grounding task. Extensive experiments on our RefCOCOm for MRES and three datasets (i.e. RefCOCO(+/g)) for classic RES task demonstrate the superiority of our method over previous state-of-the-art methods. To foster future research into fine-grained visual grounding our benchmark RefCOCOm the MRES-32M dataset and model UniRES will be publicly available at https://github.com/Rubics-Xuan/MRES. + + + + Differentiable Display Photometric Stereo + http://openaccess.thecvf.com//content/CVPR2024/papers/Choi_Differentiable_Display_Photometric_Stereo_CVPR_2024_paper.pdf + Photometric stereo leverages variations in illumination conditions to reconstruct surface normals. Display photometric stereo which employs a conventional monitor as an illumination source has the potential to overcome limitations often encountered in bulky and difficult-to-use conventional setups. In this paper we present differentiable display photometric stereo (DDPS) addressing an often overlooked challenge in display photometric stereo: the design of display patterns. Departing from using heuristic display patterns DDPS learns the display patterns that yield accurate normal reconstruction for a target system in an end-to-end manner. To this end we propose a differentiable framework that couples basis-illumination image formation with analytic photometric-stereo reconstruction. The differentiable framework facilitates the effective learning of display patterns via auto-differentiation. Also for training supervision we propose to use 3D printing for creating a real-world training dataset enabling accurate reconstruction on the target real-world setup. Finally we exploit that conventional LCD monitors emit polarized light which allows for the optical separation of diffuse and specular reflections when combined with a polarization camera leading to accurate normal reconstruction. Extensive evaluation of DDPS shows improved normal-reconstruction accuracy compared to heuristic patterns and demonstrates compelling properties such as robustness to pattern initialization calibration errors and simplifications in image formation and reconstruction. + + + + In-distribution Public Data Synthesis with Diffusion Models for Differentially Private Image Classification + http://openaccess.thecvf.com//content/CVPR2024/papers/Park_In-distribution_Public_Data_Synthesis_with_Diffusion_Models_for_Differentially_Private_CVPR_2024_paper.pdf + To alleviate the utility degradation of deep learning image classification with differential privacy (DP) employing extra public data or pre-trained models has been widely explored. Recently the use of in-distribution public data has been investigated where tiny subsets of datasets are released publicly. In this paper we investigate a framework that leverages recent diffusion models to amplify the information of public data. Subsequently we identify data diversity and generalization gap between public and private data as critical factors addressing the limited public data. While assuming 4% of training data as public our method achieves 85.48% on CIFAR-10 with a privacy budget of ?=2 without employing extra public data for training. + + + + LSK3DNet: Towards Effective and Efficient 3D Perception with Large Sparse Kernels + http://openaccess.thecvf.com//content/CVPR2024/papers/Feng_LSK3DNet_Towards_Effective_and_Efficient_3D_Perception_with_Large_Sparse_CVPR_2024_paper.pdf + Autonomous systems need to process large-scale sparse and irregular point clouds with limited compute resources. Consequently it is essential to develop LiDAR perception methods that are both efficient and effective. Although naively enlarging 3D kernel size can enhance performance it will also lead to a cubically-increasing overhead. Therefore it is crucial to develop streamlined 3D large kernel designs that eliminate redundant weights and work effectively with larger kernels. In this paper we propose an efficient and effective Large Sparse Kernel 3D Neural Network (LSK3DNet) that leverages dynamic pruning to amplify the 3D kernel size. Our method comprises two core components: Spatial-wise Dynamic Sparsity (SDS) and Channel-wise Weight Selection (CWS). SDS dynamically prunes and regrows volumetric weights from the beginning to learn a large sparse 3D kernel. It not only boosts performance but also significantly reduces model size and computational cost. Moreover CWS selects the most important channels for 3D convolution during training and subsequently prunes the redundant channels to accelerate inference for 3D vision tasks. We demonstrate the effectiveness of LSK3DNet on three benchmark datasets and five tracks compared with classical models and large kernel designs. Notably LSK3DNet achieves the state-of-the-art performance on SemanticKITTI (i.e. 75.6% on single-scan and 63.4% on multi-scan) with roughly 40% model size reduction and 60% computing operations reduction compared to the naive large 3D kernel model. + + + + Diversified and Personalized Multi-rater Medical Image Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_Diversified_and_Personalized_Multi-rater_Medical_Image_Segmentation_CVPR_2024_paper.pdf + Annotation ambiguity due to inherent data uncertainties such as blurred boundaries in medical scans and different observer expertise and preferences has become a major obstacle for training deep-learning based medical image segmentation models. To address it the common practice is to gather multiple annotations from different experts leading to the setting of multi-rater medical image segmentation. Existing works aim to either merge different annotations into the "groundtruth" that is often unattainable in numerous medical contexts or generate diverse results or produce personalized results corresponding to individual expert raters. Here we bring up a more ambitious goal for multi-rater medical image segmentation i.e. obtaining both diversified and personalized results. Specifically we propose a two-stage framework named D-Persona (first Diversification and then Personalization). In Stage I we exploit multiple given annotations to train a Probabilistic U-Net model with a bound-constrained loss to improve the prediction diversity. In this way a common latent space is constructed in Stage I where different latent codes denote diversified expert opinions. Then in Stage II we design multiple attention-based projection heads to adaptively query the corresponding expert prompts from the shared latent space and then perform the personalized medical image segmentation. We evaluated the proposed model on our in-house Nasopharyngeal Carcinoma dataset and the public lung nodule dataset (i.e. LIDC-IDRI). Extensive experiments demonstrated our D-Persona can provide diversified and personalized results at the same time achieving new SOTA performance for multi-rater medical image segmentation. Our code will be released at https://github.com/ycwu1997/D-Persona. + + + + Discover and Mitigate Multiple Biased Subgroups in Image Classifiers + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Discover_and_Mitigate_Multiple_Biased_Subgroups_in_Image_Classifiers_CVPR_2024_paper.pdf + Machine learning models can perform well on in-distribution data but often fail on biased subgroups that are underrepresented in the training data hindering the robustness of models for reliable applications. Such subgroups are typically unknown due to the absence of subgroup labels. Discovering biased subgroups is the key to understanding models' failure modes and further improving models' robustness. Most previous works of subgroup discovery make an implicit assumption that models only underperform on a single biased subgroup which does not hold on in-the-wild data where multiple biased subgroups exist. In this work we propose Decomposition Interpretation and Mitigation (DIM) a novel method to address a more challenging but also more practical problem of discovering multiple biased subgroups in image classifiers. Our approach decomposes the image features into multiple components that represent multiple subgroups. This decomposition is achieved via a bilinear dimension reduction method Partial Least Square (PLS) guided by useful supervision from the image classifier. We further interpret the semantic meaning of each subgroup component by generating natural language descriptions using vision-language foundation models. Finally DIM mitigates multiple biased subgroups simultaneously via two strategies including the data- and model-centric strategies. Extensive experiments on CIFAR-100 and Breeds datasets demonstrate the effectiveness of DIM in discovering and mitigating multiple biased subgroups. Furthermore DIM uncovers the failure modes of the classifier on Hard ImageNet showcasing its broader applicability to understanding model bias in image classifiers. + + + + ExMap: Leveraging Explainability Heatmaps for Unsupervised Group Robustness to Spurious Correlations + http://openaccess.thecvf.com//content/CVPR2024/papers/Chakraborty_ExMap_Leveraging_Explainability_Heatmaps_for_Unsupervised_Group_Robustness_to_Spurious_CVPR_2024_paper.pdf + Group robustness strategies aim to mitigate learned biases in deep learning models that arise from spurious correlations present in their training datasets. However most existing methods rely on the access to the label distribution of the groups which is time-consuming and expensive to obtain. As a result unsupervised group robustness strategies are sought. Based on the insight that a trained model's classification strategies can be inferred accurately based on explainability heatmaps we introduce ExMap an unsupervised two stage mechanism designed to enhance group robustness in traditional classifiers. ExMap utilizes a clustering module to infer pseudo-labels based on a model's explainability heatmaps which are then used during training in lieu of actual labels. Our empirical studies validate the efficacy of ExMap - We demonstrate that it bridges the per- formance gap with its supervised counterparts and outperforms existing partially supervised and unsupervised methods. Additionally ExMap can be seamlessly integrated with existing group robustness learning strategies. Finally we demonstrate its potential in tackling the emerging issue of multiple shortcut mitigation + + + + Learning to Segment Referred Objects from Narrated Egocentric Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Shen_Learning_to_Segment_Referred_Objects_from_Narrated_Egocentric_Videos_CVPR_2024_paper.pdf + Egocentric videos provide a first-person perspective of the wearer's activities involving simultaneous interactions with multiple objects. In this work we propose the task of weakly-supervised Narration-based Video Object Segmentation (NVOS). Given an egocentric video clip and a narration of the wearer's activities our aim is to segment object instances mentioned in the narration without using any spatial annotations during training. Existing weakly-supervised video object grounding methods typically yield bounding boxes for referred objects. In contrast we propose ROSA a weakly-supervised pixel-level grounding framework learning alignments between referred objects and segmentation mask proposals. Our model harnesses vision-language models pre-trained on image-text pairs to embed region masks and object phrases. During training we combine (a) a video-narration contrastive loss that implicitly supervises the alignment between regions and phrases and (b) a region-phrase contrastive loss based on inferred latent alignments. To address the lack of annotated NVOS datasets in egocentric videos we create a new evaluation benchmark VISOR-NVOS leveraging existing annotations of segmentation masks from VISOR alongside 14.6k newly-collected object-based video clip narrations. Our approach achieves state-of-the-art zero-shot pixel-level grounding performance compared to strong baselines under similar supervision. Additionally we demonstrate generalization capabilities for zero-shot video object grounding on YouCook2 a third-person instructional video dataset. + + + + Adapting Visual-Language Models for Generalizable Anomaly Detection in Medical Images + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_Adapting_Visual-Language_Models_for_Generalizable_Anomaly_Detection_in_Medical_Images_CVPR_2024_paper.pdf + Recent advancements in large-scale visual-language pre-trained models have led to significant progress in zero-/few-shot anomaly detection within natural image domains. However the substantial domain divergence between natural and medical images limits the effectiveness of these methodologies in medical anomaly detection. This paper introduces a novel lightweight multi-level adaptation and comparison framework to repurpose the CLIP model for medical anomaly detection. Our approach integrates multiple residual adapters into the pre-trained visual encoder enabling a stepwise enhancement of visual features across different levels. This multi-level adaptation is guided by multi-level pixel-wise visual-language feature alignment loss functions which recalibrate the model's focus from object semantics in natural imagery to anomaly identification in medical images. The adapted features exhibit improved generalization across various medical data types even in zero-shot scenarios where the model encounters unseen medical modalities and anatomical regions during training. Our experiments on medical anomaly detection benchmarks demonstrate that our method significantly surpasses current state-of-the-art models with an average AUC improvement of 6.24% and 7.33% for anomaly classification 2.03% and 2.37% for anomaly segmentation under the zero-shot and few-shot settings respectively. Source code is available at: https://github.com/MediaBrain-SJTU/MVFA-AD + + + + Depth-aware Test-Time Training for Zero-shot Video Object Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Depth-aware_Test-Time_Training_for_Zero-shot_Video_Object_Segmentation_CVPR_2024_paper.pdf + Zero-shot Video Object Segmentation (ZSVOS) aims at segmenting the primary moving object without any human annotations. Mainstream solutions mainly focus on learning a single model on large-scale video datasets which struggle to generalize to unseen videos. In this work we introduce a test-time training (TTT) strategy to address the problem. Our key insight is to enforce the model to predict consistent depth during the TTT process. In detail we first train a single network to perform both segmentation and depth prediction tasks. This can be effectively learned with our specifically designed depth modulation layer. Then for the TTT process the model is updated by predicting consistent depth maps for the same frame under different data augmentations. In addition we explore different TTT weight update strategies. Our empirical results suggest that the momentum-based weight initialization and looping-based training scheme lead to more stable improvements. Experiments show that the proposed method achieves clear improvements on ZSVOS. Our proposed video TTT strategy provides significant superiority over state-of-the-art TTT methods. Our code is available at: https://nifangbaage.github.io/DATTT/. + + + + RMem: Restricted Memory Banks Improve Video Object Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_RMem_Restricted_Memory_Banks_Improve_Video_Object_Segmentation_CVPR_2024_paper.pdf + With recent video object segmentation (VOS) benchmarks evolving to challenging scenarios we revisit a simple but overlooked strategy: restricting the size of memory banks. This diverges from the prevalent practice of expanding memory banks to accommodate extensive historical information. Our specially designed "memory deciphering" study offers a pivotal insight underpinning such a strategy: expanding memory banks while seemingly beneficial actually increases the difficulty for VOS modules to decode relevant features due to the confusion from redundant information. By restricting memory banks to a limited number of essential frames we achieve a notable improvement in VOS accuracy. This process balances the importance and freshness of frames to maintain an informative memory bank within a bounded capacity. Additionally restricted memory banks reduce the training-inference discrepancy in memory lengths compared with continuous expansion. This fosters new opportunities in temporal reasoning and enables us to introduce the previously overlooked "temporal positional embedding." Finally our insights are embodied in "RMem" ("R" for restricted) a simple yet effective VOS modification that excels at challenging VOS scenarios and establishes new state of the art for object state changes (VOST dataset) and long videos (the Long Videos dataset). Our codes are available at https://github.com/Restricted-Memory/RMemand our demo can be watched on https://youtu.be/V3tCFQsJrrM. + + + + Zero-TPrune: Zero-Shot Token Pruning through Leveraging of the Attention Graph in Pre-Trained Transformers + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Zero-TPrune_Zero-Shot_Token_Pruning_through_Leveraging_of_the_Attention_Graph_CVPR_2024_paper.pdf + Deployment of Transformer models on edge devices is becoming increasingly challenging due to the exponentially growing inference cost that scales quadratically with the number of tokens in the input sequence. Token pruning is an emerging solution to address this challenge due to its ease of deployment on various Transformer backbones. However most token pruning methods require computationally expensive fine-tuning which is undesirable in many edge deployment cases. In this work we propose Zero-TPrune the first zero-shot method that considers both the importance and similarity of tokens in performing token pruning. It leverages the attention graph of pre-trained Transformer models to produce an importance distribution for tokens via our proposed Weighted Page Rank (WPR) algorithm. This distribution further guides token partitioning for efficient similarity-based pruning. Due to the elimination of the fine-tuning overhead Zero-TPrune can prune large models at negligible computational cost switch between different pruning configurations at no computational cost and perform hyperparameter tuning efficiently. We evaluate the performance of Zero-TPrune on vision tasks by applying it to various vision Transformer backbones and testing them on ImageNet. Without any fine-tuning Zero-TPrune reduces the FLOPs cost of DeiT-S by 34.7% and improves its throughput by 45.3% with only 0.4% accuracy loss. Compared with state-of-the-art pruning methods that require fine-tuning Zero-TPrune not only eliminates the need for fine-tuning after pruning but also does so with only 0.1% accuracy loss. Compared with state-of-the-art fine-tuning-free pruning methods Zero-TPrune reduces accuracy loss by up to 49% with the same or higher throughput. + + + + DIBS: Enhancing Dense Video Captioning with Unlabeled Videos via Pseudo Boundary Enrichment and Online Refinement + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_DIBS_Enhancing_Dense_Video_Captioning_with_Unlabeled_Videos_via_Pseudo_CVPR_2024_paper.pdf + We present Dive Into the Boundaries (DIBS) a novel pretraining framework for dense video captioning (DVC) that elaborates on improving the quality of the generated event captions and their associated pseudo event boundaries from unlabeled videos. By leveraging the capabilities of diverse large language models (LLMs) we generate rich DVC-oriented caption candidates and optimize the corresponding pseudo boundaries under several meticulously designed objectives considering diversity event-centricity temporal ordering and coherence. Moreover we further introduce a novel online boundary refinement strategy that iteratively improves the quality of pseudo boundaries during training. Comprehensive experiments have been conducted to examine the effectiveness of the proposed technique components. By leveraging a substantial amount of unlabeled video data such as HowTo100M we achieve a remarkable advancement on standard DVC datasets like YouCook2 and ActivityNet. We outperform the previous state-of-the-art Vid2Seq across a majority of metrics achieving this with just 0.4% of the unlabeled video data used for pre-training by Vid2Seq. + + + + SOK-Bench: A Situated Video Reasoning Benchmark with Aligned Open-World Knowledge + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_SOK-Bench_A_Situated_Video_Reasoning_Benchmark_with_Aligned_Open-World_Knowledge_CVPR_2024_paper.pdf + Reasoning from visual dynamics scenes has many real world applications. However existing video reasoning benchmarks are still inadequate since they were mainly designed for factual or situated reasoning and rarely involve broader knowledge in the real world. Our work aims to delve deeper into reasoning evaluations specifically within dynamic open-world and structured context knowledge. We propose a new benchmark (SOK-Bench) consisting of 44K questions and 10K situations with instance-level annotations depicted in the videos. The reasoning process is required to understand and apply situated knowledge and general knowledge for problem-solving. To create such a dataset we propose an automatic and scalable generation method to generate question-answer pairs knowledge graphs and rationales by instructing the combinations of LLMs and MLLMs. Concretely we first extract observable situated entities relations and processes from videos for situated knowledge and then extend to open-world knowledge beyond the visible content. The task generation is facilitated through multiple dialogues as iterations and subsequently corrected and refined by our designed self-promptings and demonstrations. With a corpus of both explicit situated facts and implicit commonsense we generate associated question-answer pairs and reasoning processes finally followed by manual reviews for quality assurance. We evaluated recent mainstream large vision language models on the benchmark and found several insightful conclusions. For more information please refer to our benchmark at www.bobbywu.com/SOKBench. + + + + LORS: Low-rank Residual Structure for Parameter-Efficient Network Stacking + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_LORS_Low-rank_Residual_Structure_for_Parameter-Efficient_Network_Stacking_CVPR_2024_paper.pdf + Deep learning models particularly those based on transformers often employ numerous stacked structures which possess identical architectures and perform similar functions. While effective this stacking paradigm leads to a substantial increase in the number of parameters pos- ing challenges for practical applications. In today's land- scape of increasingly large models stacking depth can even reach dozens further exacerbating this issue. To miti- gate this problem we introduce LORS (LOw-rank Residual Structure). LORS allows stacked modules to share the majority of parameters requiring a much smaller num- ber of unique ones per module to match or even surpass the performance of using entirely distinct ones thereby significantly reducing parameter usage. We validate our method by applying it to the stacked decoders of a query- based object detector and conduct extensive experiments on the widely used MS COCO dataset. Experimental re- sults demonstrate the effectiveness of our method as even with a 70% reduction in the parameters of the decoder our method still enables the model to achieve comparable or even better performance than its original. + + + + Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_Multi-modal_In-Context_Learning_Makes_an_Ego-evolving_Scene_Text_Recognizer_CVPR_2024_paper.pdf + Scene text recognition (STR) in the wild frequently encounters challenges when coping with domain variations font diversity shape deformations etc. A straightforward solution is performing model fine-tuning tailored to a specific scenario but it is computationally intensive and requires multiple model copies for various scenarios. Recent studies indicate that large language models (LLMs) can learn from a few demonstration examples in a training-free manner termed "In-Context Learning" (ICL). Nevertheless applying LLMs as a text recognizer is unacceptably resource-consuming. Moreover our pilot experiments on LLMs show that ICL fails in STR mainly attributed to the insufficient incorporation of contextual information from diverse samples in the training stage. To this end we introduce E2STR a STR model trained with context-rich scene text sequences where the sequences are generated via our proposed in-context training strategy. E2STR demonstrates that a regular-sized model is sufficient to achieve effective ICL capabilities in STR. Extensive experiments show that E2STR exhibits remarkable training-free adaptation in various scenarios and outperforms even the fine-tuned state-of-the-art approaches on public benchmarks. The code is released at https://github.com/bytedance/E2STR. + + + + Training Diffusion Models Towards Diverse Image Generation with Reinforcement Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Miao_Training_Diffusion_Models_Towards_Diverse_Image_Generation_with_Reinforcement_Learning_CVPR_2024_paper.pdf + Diffusion models have demonstrated unprecedented capabilities in image generation. Yet they incorporate and amplify the data bias (e.g. gender age) from the original training set limiting the diversity of generated images. In this paper we propose a diversity-oriented fine-tuning method using reinforcement learning (RL) for diffusion models under the guidance of an image-set-based reward function. Specifically the proposed reward function denoted as Diversity Reward utilizes a set of generated images to evaluate the coverage of the current generative distribution w.r.t. the reference distribution represented by a set of unbiased images. Built on top of the probabilistic method of distribution discrepancy estimation Diversity Reward can measure the relative distribution gap with a small set of images efficiently. We further formulate the diffusion process as a multi-step decision-making problem (MDP) and apply policy gradient methods to fine-tune diffusion models by maximizing the Diversity Reward. The proposed rewards are validated on a post-sampling selection task where a subset of the most diverse images are selected based on Diversity Reward values. We also show the effectiveness of our RL fine-tuning framework on enhancing the diversity of image generation with different types of diffusion models including class-conditional models and text-conditional models e.g. StableDiffusion. + + + + LASIL: Learner-Aware Supervised Imitation Learning For Long-term Microscopic Traffic Simulation + http://openaccess.thecvf.com//content/CVPR2024/papers/Guo_LASIL_Learner-Aware_Supervised_Imitation_Learning_For_Long-term_Microscopic_Traffic_Simulation_CVPR_2024_paper.pdf + Microscopic traffic simulation plays a crucial role in transportation engineering by providing insights into individual vehicle behavior and overall traffic flow. However creating a realistic simulator that accurately replicates human driving behaviors in various traffic conditions presents significant challenges. Traditional simulators relying on heuristic models often fail to deliver accurate simulations due to the complexity of real-world traffic environments. Due to the covariate shift issue existing imitation learning-based simulators often fail to generate stable long-term simulations. In this paper we propose a novel approach called learner-aware supervised imitation learning to address the covariate shift problem in multi-agent imitation learning. By leveraging a variational autoencoder simultaneously modeling the expert and learner state distribution our approach augments expert states such that the augmented state is aware of learner state distribution. Our method applied to urban traffic simulation demonstrates significant improvements over existing state-of-the-art baselines in both short-term microscopic and long-term macroscopic realism when evaluated on the real-world dataset pNEUMA. + + + + SeaBird: Segmentation in Bird's View with Dice Loss Improves Monocular 3D Detection of Large Objects + http://openaccess.thecvf.com//content/CVPR2024/papers/Kumar_SeaBird_Segmentation_in_Birds_View_with_Dice_Loss_Improves_Monocular_CVPR_2024_paper.pdf + Monocular 3D detectors achieve remarkable performance on cars and smaller objects. However their performance drops on larger objects leading to fatal accidents. Some attribute the failures to training data scarcity or the receptive field requirements of large objects. In this paper we highlight this understudied problem of generalization to large objects. We find that modern frontal detectors struggle to generalize to large objects even on nearly balanced datasets. We argue that the cause of failure is the sensitivity of depth regression losses to noise of larger objects. To bridge this gap we comprehensively investigate regression and dice losses examining their robustness under varying error levels and object sizes. We mathematically prove that the dice loss leads to superior noise-robustness and model convergence for large objects compared to regression losses for a simplified case. Leveraging our theoretical insights we propose SeaBird (Segmentation in Bird's View) as the first step towards generalizing to large objects. SeaBird effectively integrates BEV segmentation on foreground objects for 3D detection with the segmentation head trained with the dice loss. SeaBird achieves SoTA results on the KITTI-360 leaderboard and improves existing detectors on the nuScenes leaderboard particularly for large objects. + + + + NOPE: Novel Object Pose Estimation from a Single Image + http://openaccess.thecvf.com//content/CVPR2024/papers/Nguyen_NOPE_Novel_Object_Pose_Estimation_from_a_Single_Image_CVPR_2024_paper.pdf + The practicality of 3D object pose estimation remains limited for many applications due to the need for prior knowledge of a 3D model and a training period for new objects. To address this limitation we propose an approach that takes a single image of a new object as input and predicts the relative pose of this object in new images without prior knowledge of the object's 3D model and without requiring training time for new objects and categories. We achieve this by training a model to directly predict discriminative embeddings for viewpoints surrounding the object. This prediction is done using a simple U-Net architecture with attention and conditioned on the desired pose which yields extremely fast inference. We compare our approach to state-of-the-art methods and show it outperforms them both in terms of accuracy and robustness. + + + + Dual-View Visual Contextualization for Web Navigation + http://openaccess.thecvf.com//content/CVPR2024/papers/Kil_Dual-View_Visual_Contextualization_for_Web_Navigation_CVPR_2024_paper.pdf + Automatic web navigation aims to build a web agent that can follow language instructions to execute complex and diverse tasks on real-world websites. Existing work primarily takes HTML documents as input which define the contents and action spaces (i.e. actionable elements and operations) of webpages. Nevertheless HTML documents may not provide a clear task-related context for each element making it hard to select the right (sequence of) actions. In this paper we propose to contextualize HTML elements through their "dual views" in webpage screenshots: each HTML element has its corresponding bounding box and visual content in the screenshot. We build upon the insight---web developers tend to arrange task-related elements nearby on webpages to enhance user experiences---and propose to contextualize each element with its neighbor elements using both textual and visual features. The resulting representations of HTML elements are more informative for the agent to take action. We validate our method on the recently released Mind2Web dataset which features diverse navigation domains and tasks on real-world websites. Our method consistently outperforms the baseline in all the scenarios including cross-task cross-website and cross-domain ones. + + + + Language-driven Grasp Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Vuong_Language-driven_Grasp_Detection_CVPR_2024_paper.pdf + Grasp detection is a persistent and intricate challenge with various industrial applications. Recently many methods and datasets have been proposed to tackle the grasp detection problem. However most of them do not consider using natural language as a condition to detect the grasp poses. In this paper we introduce Grasp-Anything++ a new language-driven grasp detection dataset featuring 1M samples over 3M objects and upwards of 10M grasping instructions. We utilize foundation models to create a large-scale scene corpus with corresponding images and grasp prompts. We approach the language-driven grasp detection task as a conditional generation problem. Drawing on the success of diffusion models in generative tasks and given that language plays a vital role in this task we propose a new language-driven grasp detection method based on diffusion models. Our key contribution is the contrastive training objective which explicitly contributes to the denoising process to detect the grasp pose given the language instructions. We illustrate that our approach is theoretically supportive. The intensive experiments show that our method outperforms state-of-the-art approaches and allows real-world robotic grasping. Finally we demonstrate our large-scale dataset enables zero-short grasp detection and is a challenging benchmark for future work. + + + + Towards Modern Image Manipulation Localization: A Large-Scale Dataset and Novel Methods + http://openaccess.thecvf.com//content/CVPR2024/papers/Qu_Towards_Modern_Image_Manipulation_Localization_A_Large-Scale_Dataset_and_Novel_CVPR_2024_paper.pdf + In recent years image manipulation localization has attracted increasing attention due to its pivotal role in ensuring social media security. However effectively identifying forged regions remains an open challenge. The high acquisition cost and the severe scarcity of high-quality data are major factors hindering the performance improvement of modern image manipulation localization systems. To address this issue we propose a novel paradigm termed as CAAA to automatically and accurately annotate the manually forged images from the web at the pixel-level. We further propose a novel metric termed as QES to assist in filtering out unreliable annotations. With CAAA and QES we construct a large-scale diverse and high-quality dataset comprising 123150 manually forged images with mask annotations. Furthermore we develop a new model termed as APSC-Net for accurate image manipulation localization. According to extensive experiments our methods outperforms previous state-of-the-art methods our dataset significantly improves the performance of various models on the widely-used benchmarks. The dataset and codes are publicly available at https://github.com/qcf-568/MIML. + + + + Object Recognition as Next Token Prediction + http://openaccess.thecvf.com//content/CVPR2024/papers/Yue_Object_Recognition_as_Next_Token_Prediction_CVPR_2024_paper.pdf + We present an approach to pose object recognition as next token prediction. The idea is to apply a language decoder that auto-regressively predicts the text tokens from image embeddings to form labels. To ground this prediction process in auto-regression we customize a non-causal attention mask for the decoder incorporating two key features: modeling tokens from different labels to be independent and treating image tokens as a prefix. This masking mechanism inspires an efficient method -- one-shot sampling -- to simultaneously sample tokens of multiple labels in parallel and rank generated labels by their probabilities during inference. To further enhance the efficiency we propose a simple strategy to construct a compact decoder by simply discarding the intermediate blocks of a pretrained language model. This approach yields a decoder that matches the full model's performance while being notably more efficient. The code is available at https://github.com/kaiyuyue/nxtp. + + + + + + CLIP-BEVFormer: Enhancing Multi-View Image-Based BEV Detector with Ground Truth Flow + http://openaccess.thecvf.com//content/CVPR2024/papers/Pan_CLIP-BEVFormer_Enhancing_Multi-View_Image-Based_BEV_Detector_with_Ground_Truth_Flow_CVPR_2024_paper.pdf + Autonomous driving stands as a pivotal domain in computer vision shaping the future of transportation. Within this paradigm the backbone of the system plays a crucial role in interpreting the complex environment. However a notable challenge has been the loss of clear supervision when it comes to Bird's Eye View elements. To address this limitation we introduce CLIP-BEVFormer a novel approach that leverages the power of contrastive learning techniques to enhance the multi-view image-derived BEV backbones with ground truth information flow. We conduct extensive experiments on the challenging nuScenes dataset and showcase significant and consistent improvements over the SOTA. Specifically CLIP-BEVFormer achieves an impressive 8.5% and 9.2% enhancement in terms of NDS and mAP respectively over the previous best BEV model on the 3D object detection task. + + + + CLOVA: A Closed-LOop Visual Assistant with Tool Usage and Update + http://openaccess.thecvf.com//content/CVPR2024/papers/Gao_CLOVA_A_Closed-LOop_Visual_Assistant_with_Tool_Usage_and_Update_CVPR_2024_paper.pdf + Utilizing large language models (LLMs) to compose off-the-shelf visual tools represents a promising avenue of research for developing robust visual assistants capable of addressing diverse visual tasks. However these methods often overlook the potential for continual learning typically by freezing the utilized tools thus limiting their adaptation to environments requiring new knowledge. To tackle this challenge we propose CLOVA a Closed-Loop Visual Assistant which operates within a framework encompassing inference reflection and learning phases. During the inference phase LLMs generate programs and execute corresponding tools to complete assigned tasks. In the reflection phase a multimodal global-local reflection scheme analyzes human feedback to determine which tools require updating. Lastly the learning phase employs three flexible approaches to automatically gather training data and introduces a novel prompt tuning scheme to update the tools allowing CLOVA to efficiently acquire new knowledge. Experimental findings demonstrate that CLOVA surpasses existing tool-usage methods by 5% in visual question answering and multiple-image reasoning by 10% in knowledge tagging and by 20% in image editing. These results underscore the significance of the continual learning capability in general visual assistants. + + + + Depth Prompting for Sensor-Agnostic Depth Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Park_Depth_Prompting_for_Sensor-Agnostic_Depth_Estimation_CVPR_2024_paper.pdf + Dense depth maps have been used as a key element of visual perception tasks. There have been tremendous efforts to enhance the depth quality ranging from optimization-based to learning-based methods. Despite the remarkable progress for a long time their applicability in the real world is limited due to systematic measurement biases such as density sensing pattern and scan range. It is well-known that the biases make it difficult for these methods to achieve their generalization. We observe that learning a joint representation for input modalities (e.g. images and depth) which most recent methods adopt is sensitive to the biases. In this work we disentangle those modalities to mitigate the biases with prompt engineering. For this we design a novel depth prompt module to allow the desirable feature representation according to new depth distributions from either sensor types or scene configurations. Our depth prompt can be embedded into foundation models for monocular depth estimation. Through this embedding process our method helps the pretrained model to be free from restraint of depth scan range and to provide absolute scale depth maps. We demonstrate the effectiveness of our method through extensive evaluations. Source code is publicly available at https://github.com/JinhwiPark/DepthPrompting. + + + + G3DR: Generative 3D Reconstruction in ImageNet + http://openaccess.thecvf.com//content/CVPR2024/papers/Reddy_G3DR_Generative_3D_Reconstruction_in_ImageNet_CVPR_2024_paper.pdf + We introduce a novel 3D generative method Generative 3D Reconstruction (G3DR) in ImageNet capable of generating diverse and high-quality 3D objects from single images addressing the limitations of existing methods. At the heart of our framework is a novel depth regularization technique that enables the generation of scenes with high-geometric fidelity. G3DR also leverages a pretrained language-vision model such as CLIP to enable reconstruction in novel views and improve the visual realism of generations. Additionally G3DR designs a simple but effective sampling procedure to further improve the quality of generations. G3DR offers diverse and efficient 3D asset generation based on class or text conditioning. Despite its simplicity G3DR is able to beat state-of-theart methods improving over them by up to 22% in perceptual metrics and 90% in geometry scores while needing only half of the training time. Code is available at https://github.com/preddy5/G3DR + + + + Hyperspherical Classification with Dynamic Label-to-Prototype Assignment + http://openaccess.thecvf.com//content/CVPR2024/papers/Saadabadi_Hyperspherical_Classification_with_Dynamic_Label-to-Prototype_Assignment_CVPR_2024_paper.pdf + Aiming to enhance the utilization of metric space by the parametric softmax classifier recent studies suggest replacing it with a non-parametric alternative. Although a non-parametric classifier may provide better metric space utilization it introduces the challenge of capturing inter-class relationships. A shared characteristic among prior non-parametric classifiers is the static assignment of labels to prototypes during the training i.e. each prototype consistently represents a class throughout the training course. Orthogonal to previous works we present a simple yet effective method to optimize the category assigned to each prototype (label-to-prototype assignment) during the training. To this aim we formalize the problem as a two-step optimization objective over network parameters and label-to-prototype assignment mapping. We solve this optimization using a sequential combination of gradient descent and Bipartide matching. We demonstrate the benefits of the proposed approach by conducting experiments on balanced and long-tail classification problems using different backbone network architectures. In particular our method outperforms its competitors by 1.22% accuracy on CIFAR-100 and 2.15% on ImageNet-200 using a metric space dimension half of the size of its competitors. \href https://github.com/msed-Ebrahimi/DL2PA_CVPR24 Code + + + + VTimeLLM: Empower LLM to Grasp Video Moments + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_VTimeLLM_Empower_LLM_to_Grasp_Video_Moments_CVPR_2024_paper.pdf + Large language models (LLMs) have shown remarkable text understanding capabilities which have been extended as Video LLMs to handle video data for comprehending visual details. However existing Video LLMs can only provide a coarse description of the entire video failing to capture the precise start and end time boundary of specific events. In this paper we solve this issue via proposing VTimeLLM a novel Video LLM designed for fine-grained video moment understanding and reasoning with respect to time boundary. Specifically our VTimeLLM adopts a boundary-aware three-stage training strategy which respectively utilizes image-text pairs for feature alignment multiple-event videos to increase temporal-boundary awareness and high-quality video-instruction tuning to further improve temporal understanding ability as well as align with human intents. Extensive experiments demonstrate that in fine-grained time-related comprehension tasks for videos such as Temporal Video Grounding and Dense Video Captioning VTimeLLM significantly outperforms existing Video LLMs. Besides benefits from the fine-grained temporal understanding of the videos further enable VTimeLLM to beat existing Video LLMs in video dialogue benchmark showing its superior cross-modal understanding and reasoning abilities. + + + + FLHetBench: Benchmarking Device and State Heterogeneity in Federated Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_FLHetBench_Benchmarking_Device_and_State_Heterogeneity_in_Federated_Learning_CVPR_2024_paper.pdf + Federated learning (FL) is a powerful technology that enables collaborative training of machine learning models without sharing private data among clients. The fundamental challenge in FL lies in learning over extremely heterogeneous data distributions device capacities and device state availabilities all of which adversely impact performance and communication efficiency. While data heterogeneity has been well-studied in the literature this paper introduces FLHetBench the first FL benchmark targeted toward understanding device and state heterogeneity. FLHetBench comprises two new sampling methods to generate real-world device and state databases with varying heterogeneity and new metrics for quantifying the success of FL methods under these real-world constraints. Using FLHetBench we conduct a comprehensive evaluation of existing methods and find that they struggle under these settings which inspires us to propose BiasPrompt+ a new method employing staleness-aware aggregation and fast weights to tackle these new heterogeneity challenges. Experiments on various FL tasks and datasets validate the effectiveness of our BiasPrompt+ method and highlight the value of FLHetBench in fostering the development of more efficient and robust FL solutions under real-world device and state constraints. + + + + Privacy-Preserving Optics for Enhancing Protection in Face De-Identification + http://openaccess.thecvf.com//content/CVPR2024/papers/Lopez_Privacy-Preserving_Optics_for_Enhancing_Protection_in_Face_De-Identification_CVPR_2024_paper.pdf + The modern surge in camera usage alongside widespread computer vision technology applications poses significant privacy and security concerns. Current artificial intelligence (AI) technologies aid in recognizing relevant events and assisting in daily tasks in homes offices hospitals etc. The need to access or process personal information for these purposes raises privacy concerns. While software-level solutions like face de-identification provide a good privacy/utility trade-off they present vulnerabilities to sniffing attacks. In this paper we propose a hardware-level face de-identification method to solve this vulnerability. Specifically our approach first learns an optical encoder along with a regression model to obtain a face heatmap while hiding the face identity from the source image. We also propose an anonymization framework that generates a new face using the privacy-preserving image face heatmap and a reference face image from a public dataset as input. We validate our approach with extensive simulations and hardware experiments. + + + + SmartRefine: A Scenario-Adaptive Refinement Framework for Efficient Motion Prediction + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_SmartRefine_A_Scenario-Adaptive_Refinement_Framework_for_Efficient_Motion_Prediction_CVPR_2024_paper.pdf + Predicting the future motion of surrounding agents is essential for autonomous vehicles (AVs) to operate safely in dynamic human-robot-mixed environments. Context information such as road maps and surrounding agents' states provides crucial geometric and semantic information for motion behavior prediction. To this end recent works explore two-stage prediction frameworks where coarse trajectories are first proposed and then used to select critical context information for trajectory refinement. However they either incur a large amount of computation or bring limited improvement if not both. In this paper we introduce a novel scenario-adaptive refinement strategy named SmartRefine to refine prediction with minimal additional computation. Specifically SmartRefine can comprehensively adapt refinement configurations based on each scenario's properties and smartly chooses the number of refinement iterations by introducing a quality score to measure the prediction quality and remaining refinement potential of each scenario. SmartRefine is designed as a generic and flexible approach that can be seamlessly integrated into most state-of-the-art motion prediction models. Experiments on Argoverse (1 & 2) show that our method consistently improves the prediction accuracy of multiple state-of-the-art prediction models. Specifically by adding SmartRefine to QCNet we outperform all published ensemble-free works on the Argoverse 2 leaderboard (single agent track) at submission. Comprehensive studies are also conducted to ablate design choices and explore the mechanism behind multi-iteration refinement. Codes are available at https://github.com/opendilab/SmartRefine/. + + + + Multi-Scale Video Anomaly Detection by Multi-Grained Spatio-Temporal Representation Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Multi-Scale_Video_Anomaly_Detection_by_Multi-Grained_Spatio-Temporal_Representation_Learning_CVPR_2024_paper.pdf + ecent progress in video anomaly detection suggests that the features of appearance and motion play crucial roles in distinguishing abnormal patterns from normal ones. However we note that the effect of spatial scales of anomalies is ignored. The fact that many abnormal events occur in limited localized regions and severe background noise interferes with the learning of anomalous changes. Meanwhile most existing methods are limited by coarse-grained modeling approaches which are inadequate for learning highly discriminative features to discriminate subtle differences between small-scale anomalies and normal patterns. To this end this paper address multi-scale video anomaly detection by multi-grained spatio-temporal representation learning. We utilize video continuity to design three proxy tasks to perform feature learning at both coarse-grained and fine-grained levels i.e. continuity judgment discontinuity localization and missing frame estimation. In particular we formulate missing frame estimation as a contrastive learning task in feature space instead of a reconstruction task in RGB space to learn highly discriminative features. Experiments show that our proposed method outperforms state-of-the-art methods on four datasets especially in scenes with small-scale anomalies. + + + + Generative Multimodal Models are In-Context Learners + http://openaccess.thecvf.com//content/CVPR2024/papers/Sun_Generative_Multimodal_Models_are_In-Context_Learners_CVPR_2024_paper.pdf + Humans can easily solve multimodal tasks in context with only a few demonstrations or simple instructions which current multimodal systems largely struggle to imitate. In this work we demonstrate that by effectively scaling up generative multimodal models their task-agnostic in-context learning capabilities can be significantly enhanced. We introduce Emu2 a generative multimodal model with 37 billion parameters which serves as a base model and general-purpose interface for a variety of multimodal tasks. Emu2 not only achieves strong performance in few-shot setting but can also be instruct-tuned to follow specific instructions such as visual question answering and object-grounded image generation. Emu2 even emerges to solve tasks that require on-the-fly reasoning such as visual prompting which existing models are unlikely to handle. We identify additional tasks where Emu2's in-context learning can further improve and discuss its broader societal impact. Our code and models will be made publicly available to facilitate future research. + + + + Feature Re-Embedding: Towards Foundation Model-Level Performance in Computational Pathology + http://openaccess.thecvf.com//content/CVPR2024/papers/Tang_Feature_Re-Embedding_Towards_Foundation_Model-Level_Performance_in_Computational_Pathology_CVPR_2024_paper.pdf + Multiple instance learning (MIL) is the most widely used framework in computational pathology encompassing sub-typing diagnosis prognosis and more. However the existing MIL paradigm typically requires an offline instance feature extractor such as a pre-trained ResNet or a foundation model. This approach lacks the capability for feature fine-tuning within the specific downstream tasks limiting its adaptability and performance. To address this issue we propose a Re-embedded Regional Transformer (RRT) for re-embedding the instance features online which captures fine-grained local features and establishes connections across different regions. Unlike existing works that focus on pre-training powerful feature extractor or designing sophisticated instance aggregator RRT is tailored to re-embed instance features online. It serves as a portable module that can seamlessly integrate into mainstream MIL models. Extensive experimental results on common computational pathology tasks validate that: 1) feature re-embedding improves the performance of MIL models based on ResNet-50 features to the level of foundation model features and further enhances the performance of foundation model features; 2) the RRT can introduce more significant performance improvements to various MIL models; 3) RRT-MIL as an RRT-enhanced AB-MIL outperforms other latest methods by a large margin. The code is available at: https://github.com/DearCaat/RRT-MIL. + + + + Text Prompt with Normality Guidance for Weakly Supervised Video Anomaly Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Text_Prompt_with_Normality_Guidance_for_Weakly_Supervised_Video_Anomaly_CVPR_2024_paper.pdf + Weakly supervised video anomaly detection (WSVAD) is a challenging task. Generating fine-grained pseudo-labels based on weak-label and then self-training a classifier is currently a promising solution. However since the existing methods use only RGB visual modality and the utilization of category text information is neglected thus limiting the generation of more accurate pseudo-labels and affecting the performance of self-training. Inspired by the manual labeling process based on the event description in this paper we propose a novel pseudo-label generation and self-training framework based on Text Prompt with Normality Guidance (TPWNG) for WSVAD. Our idea is to transfer the rich language-visual knowledge of the contrastive language-image pre-training (CLIP) model for aligning the video event description text and corresponding video frames to generate pseudo-labels. Specifically We first fine-tune the CLIP for domain adaptation by designing two ranking losses and a distributional inconsistency loss. Further we propose a learnable text prompt mechanism with the assist of a normality visual prompt to further improve the matching accuracy of video event description text and video frames. Then we design a pseudo-label generation module based on the normality guidance to infer reliable frame-level pseudo-labels. Finally we introduce a temporal context self-adaptive learning module to learn the temporal dependencies of different video events more flexibly and accurately. Extensive experiments show that our method achieves state-of-the-art performance on two benchmark datasets UCF-Crime and XD-Violence demonstrating the effectiveness of our proposed method. + + + + SparseOcc: Rethinking Sparse Latent Representation for Vision-Based Semantic Occupancy Prediction + http://openaccess.thecvf.com//content/CVPR2024/papers/Tang_SparseOcc_Rethinking_Sparse_Latent_Representation_for_Vision-Based_Semantic_Occupancy_Prediction_CVPR_2024_paper.pdf + Vision-based perception for autonomous driving requires an explicit modeling of a 3D space where 2D latent representations are mapped and subsequent 3D operators are applied. However operating on dense latent spaces introduces a cubic time and space complexity which limits scalability in terms of perception range or spatial resolution. Existing approaches compress the dense representation using projections like Bird's Eye View (BEV) or Tri-Perspective View (TPV). Although efficient these projections result in information loss especially for tasks like semantic occupancy prediction. To address this we propose SparseOcc an efficient occupancy network inspired by sparse point cloud processing. It utilizes a lossless sparse latent representation with three key innovations. Firstly a 3D sparse diffuser performs latent completion using spatially decomposed 3D sparse convolutional kernels. Secondly a feature pyramid and sparse interpolation enhance scales with information from others. Finally the transformer head is redesigned as a sparse variant. SparseOcc achieves a remarkable 74.9% reduction on FLOPs over the dense baseline. Interestingly it also improves accuracy from 12.8% to 14.1% mIOU which in part can be attributed to the sparse representation's ability to avoid hallucinations on empty voxels. + + + + Frequency Decoupling for Motion Magnification via Multi-Level Isomorphic Architecture + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Frequency_Decoupling_for_Motion_Magnification_via_Multi-Level_Isomorphic_Architecture_CVPR_2024_paper.pdf + Video Motion Magnification (VMM) aims to reveal subtle and imperceptible motion information of objects in the macroscopic world. Prior methods directly model the motion field from the Eulerian perspective by Representation Learning that separates shape and texture or Multi-domain Learning from phase fluctuations. Inspired by the frequency spectrum we observe that the low-frequency components with stable energy always possess spatial structure and less noise making them suitable for modeling the subtle motion field. To this end we present FD4MM a new paradigm of Frequency Decoupling for Motion Magnification with a Multi-level Isomorphic Architecture to capture multi-level high-frequency details and a stable low-frequency structure (motion field) in video space. Since high-frequency details and subtle motions are susceptible to information degradation due to their inherent subtlety and unavoidable external interference from noise we carefully design Sparse High/Low-pass Filters to enhance the integrity of details and motion structures and a Sparse Frequency Mixer to promote seamless recoupling. Besides we innovatively design a contrastive regularization for this task to strengthen the model's ability to discriminate irrelevant features reducing undesired motion magnification. Extensive experiments on both Real-world and Synthetic Datasets show that our FD4MM outperforms SOTA methods. Meanwhile FD4MM reduces FLOPs by 1.63xand boosts inference speed by 1.68xthan the latest method. Our code is available at https://github.com/Jiafei127/FD4MM. + + + + Hyperbolic Learning with Synthetic Captions for Open-World Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Kong_Hyperbolic_Learning_with_Synthetic_Captions_for_Open-World_Detection_CVPR_2024_paper.pdf + Open-world detection poses significant challenges as it requires the detection of any object using either object class labels or free-form texts. Existing related works often use large-scale manual annotated caption datasets for training which are extremely expensive to collect. Instead we propose to transfer knowledge from vision-language models (VLMs) to enrich the open-vocabulary descriptions automatically. Specifically we bootstrap dense synthetic captions using pre-trained VLMs to provide rich descriptions on different regions in images and incorporate these captions to train a novel detector that generalizes to novel concepts. To mitigate the noise caused by hallucination in synthetic captions we also propose a novel hyperbolic vision-language learning approach to impose a hierarchy between visual and caption embeddings. We call our detector "HyperLearner". We conduct extensive experiments on a wide variety of open-world detection benchmarks (COCO LVIS Object Detection in the Wild RefCOCO) and our results show that our model consistently outperforms existing state-of-the-art methods such as GLIP GLIPv2 and Grounding DINO when using the same backbone. + + + + Interpretable Measures of Conceptual Similarity by Complexity-Constrained Descriptive Auto-Encoding + http://openaccess.thecvf.com//content/CVPR2024/papers/Achille_Interpretable_Measures_of_Conceptual_Similarity_by_Complexity-Constrained_Descriptive_Auto-Encoding_CVPR_2024_paper.pdf + Quantifying the degree of similarity between images is a key copyright issue for image-based machine learning. In legal doctrine however determining the degree of similarity between works requires subjective analysis and fact-finders (judges and juries) can demonstrate considerable variability in these subjective judgement calls. Images that are structurally similar can be deemed dissimilar whereas images of completely different scenes can be deemed similar enough to support a claim of copying. We seek to define and compute a notion of "conceptual similarity" among images that captures high-level relations even among images that do not share repeated elements or visually similar components. The idea is to use a base multi-modal model to generate "explanations" (captions) of visual data at increasing levels of complexity. Then similarity can be measured by the length of the caption needed to discriminate between the two images: Two highly dissimilar images can be discriminated early in their description whereas conceptually dissimilar ones will need more detail to be distinguished. We operationalize this definition and show that it correlates with subjective (averaged human evaluation) assessment and beats existing baselines on both image-to-image and text-to-text similarity benchmarks. Beyond just providing a number our method also offers interpretability by pointing to the specific level of granularity of the description where the source data is differentiated. + + + + 3D Feature Tracking via Event Camera + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_3D_Feature_Tracking_via_Event_Camera_CVPR_2024_paper.pdf + This paper presents the first 3D feature tracking method with the corresponding dataset. Our proposed method takes event streams from stereo event cameras as input to predict 3D trajectories of the target features with high-speed motion. To achieve this our method leverages a joint framework to predict the 2D feature motion offsets and the 3D feature spatial position simultaneously. A motion compensation module is leveraged to overcome the feature deformation. A patch matching module based on bi-polarity hypergraph modeling is proposed to robustly estimate the feature spatial position. Meanwhile we collect the first 3D feature tracking dataset with high-speed moving objects and ground truth 3D feature trajectories at 250 FPS named E-3DTrack which can be used as the first high-speed 3D feature tracking benchmark. Our code and dataset could be found at: https://github.com/lisiqi19971013/E-3DTrack. + + + + MaxQ: Multi-Axis Query for N:M Sparsity Network + http://openaccess.thecvf.com//content/CVPR2024/papers/Xiang_MaxQ_Multi-Axis_Query_for_NM_Sparsity_Network_CVPR_2024_paper.pdf + N:M sparsity has received increasing attention due to its remarkable performance and latency trade-off compared with structured and unstructured sparsity. However existing N:M sparsity methods do not differentiate the relative importance of weights among blocks and leave important weights underappreciated. Besides they directly apply N:M sparsity to the whole network which will cause severe information loss. Thus they are still sub-optimal. In this paper we propose an efficient and effective Multi-Axis Query methodology dubbed as MaxQ to rectify these problems. During the training MaxQ employs a dynamic approach to generate soft N:M masks considering the weight importance across multiple axes. This method enhances the weights with more importance and ensures more effective updates. Meanwhile a sparsity strategy that gradually increases the percentage of N:M weight blocks is applied which allows the network to heal from the pruning-induced damage progressively. During the runtime the N:M soft masks can be precomputed as constants and folded into weights without causing any distortion to the sparse pattern and incurring additional computational overhead. Comprehensive experiments demonstrate that MaxQ achieves consistent improvements across diverse CNN architectures in various computer vision tasks including image classification object detection and instance segmentation. For ResNet50 with 1:16 sparse pattern MaxQ can achieve 74.6% top-1 accuracy on ImageNet and improve by over 2.8% over the state-of-the-art. Codes and checkpoints are available at https://github.com/JingyangXiang/MaxQ. + + + + Part-aware Unified Representation of Language and Skeleton for Zero-shot Action Recognition + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhu_Part-aware_Unified_Representation_of_Language_and_Skeleton_for_Zero-shot_Action_CVPR_2024_paper.pdf + While remarkable progress has been made on supervised skeleton-based action recognition the challenge of zero-shot recognition remains relatively unexplored. In this paper we argue that relying solely on aligning label-level semantics and global skeleton features is insufficient to effectively transfer locally consistent visual knowledge from seen to unseen classes. To address this limitation we introduce Part-aware Unified Representation between Language and Skeleton (PURLS) to explore visual-semantic alignment at both local and global scales. PURLS introduces a new prompting module and a novel partitioning module to generate aligned textual and visual representations across different levels. The former leverages a pre-trained GPT-3 to infer refined descriptions of the global and local (body-part-based and temporal-interval-based) movements from the original action labels. The latter employs an adaptive sampling strategy to group visual features from all body joint movements that are semantically relevant to a given description. Our approach is evaluated on various skeleton/language backbones and three large-scale datasets i.e. NTU-RGB+D 60 NTU-RGB+D 120 and a newly curated dataset Kinetics-skeleton 200. The results showcase the universality and superior performance of PURLS surpassing prior skeleton-based solutions and standard baselines from other domains. The source codes can be accessed at https://github.com/azzh1/PURLS. + + + + Composing Object Relations and Attributes for Image-Text Matching + http://openaccess.thecvf.com//content/CVPR2024/papers/Pham_Composing_Object_Relations_and_Attributes_for_Image-Text_Matching_CVPR_2024_paper.pdf + We study the visual semantic embedding problem for image-text matching. Most existing work utilizes a tailored cross-attention mechanism to perform local alignment across the two image and text modalities. This is computationally expensive even though it is more powerful than the unimodal dual-encoder approach. This work introduces a dual-encoder image-text matching model leveraging a scene graph to represent captions with nodes for objects and attributes interconnected by relational edges. Utilizing a graph attention network our model efficiently encodes object-attribute and object-object semantic relations resulting in a robust and fast-performing system. Representing caption as a scene graph offers the ability to utilize the strong relational inductive bias of graph neural networks to learn object-attribute and object-object relations effectively. To train the model we propose losses that align the image and caption both at the holistic level (image-caption) and the local level (image-object entity) which we show is key to the success of the model. Our model is termed Composition model for Object Relations and Attributes CORA. Experimental results on two prominent image-text retrieval benchmarks Flickr30K and MS-COCO demonstrate that CORA outperforms existing state-of-the-art computationally expensive cross-attention methods regarding recall score while achieving fast computation speed of the dual encoder. Our code is available at https://github.com/vkhoi/cora_cvpr24 + + + + Previously on ... From Recaps to Story Summarization + http://openaccess.thecvf.com//content/CVPR2024/papers/Singh_Previously_on_..._From_Recaps_to_Story_Summarization_CVPR_2024_paper.pdf + We introduce multimodal story summarization by leveraging TV episode recaps - short video sequences interweaving key story moments from previous episodes to bring viewers up to speed. We propose PlotSnap a dataset featuring two crime thriller TV shows with rich recaps and long episodes of 40 minutes. Story summarization labels are unlocked by matching recap shots to corresponding sub-stories in the episode. We propose a hierarchical model TaleSumm that processes entire episodes by creating compact shot and dialog representations and predicts importance scores for each video shot and dialog utterance by enabling interactions between local story groups. Unlike traditional summarization our method extracts multiple plot points from long videos. We present a thorough evaluation on story summarization including promising cross-series generalization. TaleSumm also shows good results on classic video summarization benchmarks. + + + + mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration + http://openaccess.thecvf.com//content/CVPR2024/papers/Ye_mPLUG-Owl2_Revolutionizing_Multi-modal_Large_Language_Model_with_Modality_Collaboration_CVPR_2024_paper.pdf + Multi-modal Large Language Models (MLLMs) have demonstrated impressive instruction abilities across various open-ended tasks. However previous methods have primarily focused on enhancing multi-modal capabilities. In this work we introduce a versatile multi-modal large language model mPLUG-Owl2 which effectively leverages modality collaboration to improve performance in both text and multi-modal tasks. mPLUG-Owl2 utilizes a modularized network design with the language decoder acting as a universal interface for managing different modalities. Specifically mPLUG-Owl2 incorporates shared functional modules to facilitate modality collaboration and introduces a modality-adaptive module that preserves modality-specific features. Extensive experiments reveal that mPLUG-Owl2 is capable of generalizing both text tasks and multi-modal tasks while achieving state-of-the-art performances with a single generalized model. Notably mPLUG-Owl2 is the first MLLM model that demonstrates the modality collaboration phenomenon in both pure-text and multi-modal scenarios setting a pioneering path in the development of future multi-modal foundation models. + + + + Learning by Correction: Efficient Tuning Task for Zero-Shot Generative Vision-Language Reasoning + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Learning_by_Correction_Efficient_Tuning_Task_for_Zero-Shot_Generative_Vision-Language_CVPR_2024_paper.pdf + Generative vision-language models (VLMs) have shown impressive performance in zero-shot vision-language tasks like image captioning and visual question answering.However improving their zero-shot reasoning typically requires second-stage instruction tuning which relies heavily on human-labeled or large language model-generated annotation incurring high labeling costs. To tackle this challenge we introduce Image-Conditioned Caption Correction (ICCC) a novel pre-training task designed to enhance VLMs' zero-shot performance without the need for labeled task-aware data. The ICCC task compels VLMs to rectify mismatches between visual and language concepts thereby enhancing instruction following and text generation conditioned on visual inputs. Leveraging language structure and a lightweight dependency parser we construct data samples of ICCC task from image-text datasets with low labeling and computation costs. Experimental results on BLIP-2 and InstructBLIP demonstrate significant improvements in zero-shot image-text generation-based VL tasks through ICCC instruction tuning. + + + + Supervised Anomaly Detection for Complex Industrial Images + http://openaccess.thecvf.com//content/CVPR2024/papers/Baitieva_Supervised_Anomaly_Detection_for_Complex_Industrial_Images_CVPR_2024_paper.pdf + Automating visual inspection in industrial production lines is essential for increasing product quality across various industries. Anomaly detection (AD) methods serve as robust tools for this purpose. However existing public datasets primarily consist of images without anomalies limiting the practical application of AD methods in production settings. To address this challenge we present (1) the Valeo Anomaly Dataset (VAD) a novel real-world industrial dataset comprising 5000 images including 2000 instances of challenging real defects across more than 20 subclasses. Acknowledging that traditional AD methods struggle with this dataset we introduce (2) Segmentation-based Anomaly Detector (SegAD). First SegAD leverages anomaly maps as well as segmentation maps to compute local statistics. Next SegAD uses these statistics and an optional supervised classifier score as input features for a Boosted Random Forest (BRF) classifier yielding the final anomaly score. Our SegAD achieves state-of-the-art performance on both VAD (+2.1% AUROC) and the VisA dataset (+0.4% AUROC). The code and the models are publicly available. + + + + Open3DSG: Open-Vocabulary 3D Scene Graphs from Point Clouds with Queryable Objects and Open-Set Relationships + http://openaccess.thecvf.com//content/CVPR2024/papers/Koch_Open3DSG_Open-Vocabulary_3D_Scene_Graphs_from_Point_Clouds_with_Queryable_CVPR_2024_paper.pdf + Current approaches for 3D scene graph prediction rely on labeled datasets to train models for a fixed set of known object classes and relationship categories. We present Open3DSG an alternative approach to learn 3D scene graph prediction in an open world without requiring labeled scene graph data. We co-embed the features from a 3D scene graph prediction backbone with the feature space of powerful open world 2D vision language foundation models. This enables us to predict 3D scene graphs from 3D point clouds in a zero-shot manner by querying object classes from an open vocabulary and predicting the inter-object relationships from a grounded LLM with scene graph features and queried object classes as context. Open3DSG is the first 3D point cloud method to predict not only explicit open-vocabulary object classes but also open-set relationships that are not limited to a predefined label set making it possible to express rare as well as specific objects and relationships in the predicted 3D scene graph. Our experiments show that Open3DSG is effective at predicting arbitrary object classes as well as their complex inter-object relationships describing spatial supportive semantic and comparative relationships. + + + + SURE: SUrvey REcipes for building reliable and robust deep networks + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_SURE_SUrvey_REcipes_for_building_reliable_and_robust_deep_networks_CVPR_2024_paper.pdf + In this paper we revisit techniques for uncertainty estimation within deep neural networks and consolidate a suite of techniques to enhance their reliability. Our investigation reveals that an integrated application of diverse techniques--spanning model regularization classifier and optimization--substantially improves the accuracy of uncertainty predictions in image classification tasks. The synergistic effect of these techniques culminates in our novel SURE approach. We rigorously evaluate SURE against the benchmark of failure prediction a critical testbed for uncertainty estimation efficacy. Our results showcase a consistently better performance than models that individually deploy each technique across various datasets and model architectures. When applied to real-world challenges such as data corruption label noise and long-tailed class distribution SURE exhibits remarkable robustness delivering results that are superior or on par with current state-of-the-art specialized methods. Particularly on Animal-10N and Food-101N for learning with noisy labels SURE achieves state-of-the-art performance without any task-specific adjustments. This work not only sets a new benchmark for robust uncertainty estimation but also paves the way for its application in diverse real-world scenarios where reliability is paramount. Our code is available at https://yutingli0606.github.io/SURE/. + + + + PolarRec: Improving Radio Interferometric Data Reconstruction Using Polar Coordinates + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_PolarRec_Improving_Radio_Interferometric_Data_Reconstruction_Using_Polar_Coordinates_CVPR_2024_paper.pdf + In radio astronomy visibility data which are measurements of wave signals from radio telescopes are transformed into images for observation of distant celestial objects. However these resultant images usually contain both real sources and artifacts due to signal sparsity and other factors. One way to obtain cleaner images is to reconstruct samples into dense forms before imaging. Unfortunately existing reconstruction methods often miss some components of visibility in frequency domain so blurred object edges and persistent artifacts remain in the images. Furthermore the computation overhead is high on irregular visibility samples due to the data skew. To address these problems we propose PolarRec a transformer-encoder-conditioned reconstruction pipeline with visibility samples converted into the polar coordinate system. This coordinate system matches the way in which radio telescopes observe a celestial area as the Earth rotates. As a result visibility samples distribute in the polar system more uniformly than in the Cartesian space. Therefore we propose to use radial distance in the loss function to help reconstruct complete visibility effectively. Also we group visibility samples by their polar angles and propose a group-based encoding scheme to improve the efficiency. Our experiments demonstrate that PolarRec markedly improves imaging results by faithfully reconstructing all frequency components in the visibility domain while significantly reducing the computation cost in visibility data encoding. The code is available at https://github.com/RapidsAtHKUST/PolarRec. + + + + Summarize the Past to Predict the Future: Natural Language Descriptions of Context Boost Multimodal Object Interaction Anticipation + http://openaccess.thecvf.com//content/CVPR2024/papers/Pasca_Summarize_the_Past_to_Predict_the_Future_Natural_Language_Descriptions_CVPR_2024_paper.pdf + We study object interaction anticipation in egocentric videos. This task requires an understanding of the spatio-temporal context formed by past actions on objects coined "action context". We propose TransFusion a multimodal transformer-based architecture for short-term object interaction anticipation. Our method exploits the representational power of language by summarizing the action context textually after leveraging pre-trained vision-language foundation models to extract the action context from past video frames. The summarized action context and the last observed video frame are processed by the multimodal fusion module to forecast the next object interaction. Experiments on the Ego4D next active object interaction dataset show the effectiveness of our multimodal fusion model and highlight the benefits of using the power of foundation models and language-based context summaries in a task where vision may appear to suffice. Our novel approach outperforms all state-of-the-art methods on both versions of the Ego4D dataset. + + + + Towards CLIP-driven Language-free 3D Visual Grounding via 2D-3D Relational Enhancement and Consistency + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Towards_CLIP-driven_Language-free_3D_Visual_Grounding_via_2D-3D_Relational_Enhancement_CVPR_2024_paper.pdf + 3D visual grounding plays a crucial role in scene understanding with extensive applications in AR/VR. Despite the significant progress made in recent methods the requirement of dense textual descriptions for each individual object which is time-consuming and costly hinders their scalability. To mitigate reliance on text annotations during training researchers have explored language-free training paradigms in the 2D field via explicit text generation or implicit feature substitution. Nevertheless unlike 2D images the complexity of spatial relations in 3D coupled with the absence of robust 3D visual language pre-trained models makes it challenging to directly transfer previous strategies. To tackle the above issues in this paper we introduce a language-free training framework for 3D visual grounding. By utilizing the visual-language joint embedding in 2D large cross-modality model as a bridge we can expediently produce the pseudo-language features by leveraging the features of 2D images which are equivalent to that of real textual descriptions. We further develop a relation injection scheme with a Neighboring Relation-aware Modeling module and a Cross-modality Relation Consistency module aiming to enhance and preserve the complex relationships between the 2D and 3D embedding space. Extensive experiments demonstrate that our proposed language-free 3D visual grounding approach can obtain promising performance across three widely used datasets --ScanRefer Nr3D and Sr3D. Our codes are available at https://github.com/xibi777/3DLFVG + + + + Optimal Transport Aggregation for Visual Place Recognition + http://openaccess.thecvf.com//content/CVPR2024/papers/Izquierdo_Optimal_Transport_Aggregation_for_Visual_Place_Recognition_CVPR_2024_paper.pdf + The task of Visual Place Recognition (VPR) aims to match a query image against references from an extensive database of images from different places relying solely on visual cues. State-of-the-art pipelines focus on the aggregation of features extracted from a deep backbone in order to form a global descriptor for each image. In this context we introduce SALAD (Sinkhorn Algorithm for Locally Aggregated Descriptors) which reformulates NetVLAD's soft-assignment of local features to clusters as an optimal transport problem. In SALAD we consider both feature-to-cluster and cluster-to-feature relations and we also introduce a dustbin cluster designed to selectively discard features deemed non-informative enhancing the overall descriptor quality. Additionally we leverage and fine-tune DINOv2 as a backbone which provides enhanced description power for the local features and dramatically reduces the required training time. As a result our single-stage method not only surpasses single-stage baselines in public VPR datasets but also surpasses two-stage methods that add a re-ranking with significantly higher cost. + + + + Aligning and Prompting Everything All at Once for Universal Visual Perception + http://openaccess.thecvf.com//content/CVPR2024/papers/Shen_Aligning_and_Prompting_Everything_All_at_Once_for_Universal_Visual_CVPR_2024_paper.pdf + Vision foundation models have been explored recently to build general-purpose vision systems. However predominant paradigms driven by casting instance-level tasks as an object-word alignment bring heavy cross-modality interaction which is not effective in prompting object detection and visual grounding. Another line of work that focuses on pixel-level tasks often encounters a large annotation gap of things and stuff and suffers from mutual interference between foreground-object and background-class segmentation. In stark contrast to the prevailing methods we present APE a universal visual perception model for aligning and prompting everything all at once in an image to perform diverse tasks i.e. detection segmentation and grounding as an instance-level sentence-object matching paradigm. Specifically APE advances the convergence of detection and grounding by reformulating language-guided grounding as open-vocabulary detection which efficiently scales up model prompting to thousands of category vocabularies and region descriptions while maintaining the effectiveness of cross-modality fusion. To bridge the granularity gap of different pixel-level tasks APE equalizes semantic and panoptic segmentation to proxy instance learning by considering any isolated regions as individual instances. APE aligns vision and language representation on broad data with natural and challenging characteristics all at once without task-specific fine-tuning. The extensive experiments on over 160 datasets demonstrate that with only one-suit of weights APE outperforms (or is on par with) the state-of-the-art models proving that an effective yet universal perception for anything aligning and prompting is indeed feasible. Codes and trained models are released at https://github.com/shenyunhang/APE. + + + + Correlation-Decoupled Knowledge Distillation for Multimodal Sentiment Analysis with Incomplete Modalities + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Correlation-Decoupled_Knowledge_Distillation_for_Multimodal_Sentiment_Analysis_with_Incomplete_Modalities_CVPR_2024_paper.pdf + Multimodal sentiment analysis (MSA) aims to understand human sentiment through multimodal data. Most MSA efforts are based on the assumption of modality completeness. However in real-world applications some practical factors cause uncertain modality missingness which drastically degrades the model's performance. To this end we propose a Correlation-decoupled Knowledge Distillation (CorrKD) framework for the MSA task under uncertain missing modalities. Specifically we present a sample-level contrastive distillation mechanism that transfers comprehensive knowledge containing cross-sample correlations to reconstruct missing semantics. Moreover a category-guided prototype distillation mechanism is introduced to capture cross-category correlations using category prototypes to align feature distributions and generate favorable joint representations. Eventually we design a response-disentangled consistency distillation strategy to optimize the sentiment decision boundaries of the student network through response disentanglement and mutual information maximization. Comprehensive experiments on three datasets indicate that our framework can achieve favorable improvements compared with several baselines. + + + + LoSh: Long-Short Text Joint Prediction Network for Referring Video Object Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Yuan_LoSh_Long-Short_Text_Joint_Prediction_Network_for_Referring_Video_Object_CVPR_2024_paper.pdf + Referring video object segmentation (RVOS) aims to segment the target instance referred by a given text expression in a video clip. The text expression normally contains sophisticated description of the instance's appearance action and relation with others. It is therefore rather difficult for a RVOS model to capture all these attributes correspondingly in the video; in fact the model often favours more on the action- and relation-related visual attributes of the instance. This can end up with partial or even incorrect mask prediction of the target instance. We tackle this problem by taking a subject-centric short text expression from the original long text expression. The short one retains only the appearance-related information of the target instance so that we can use it to focus the model's attention on the instance's appearance. We let the model make joint predictions using both long and short text expressions; and insert a long-short cross-attention module to interact the joint features and a long-short predictions intersection loss to regulate the joint predictions. Besides the improvement on the linguistic part we also introduce a forward-backward visual consistency loss which utilizes optical flows to warp visual features between the annotated frames and their temporal neighbors for consistency. We build our method on top of two state of the art pipelines. Extensive experiments on A2D-Sentences Refer-YouTube-VOS JHMDB-Sentences and Refer-DAVIS17 show impressive improvements of our method. Code is available here. + + + + Dual Prototype Attention for Unsupervised Video Object Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Cho_Dual_Prototype_Attention_for_Unsupervised_Video_Object_Segmentation_CVPR_2024_paper.pdf + Unsupervised video object segmentation (VOS) aims to detect and segment the most salient object in videos. The primary techniques used in unsupervised VOS are 1) the collaboration of appearance and motion information; and 2) temporal fusion between different frames. This paper proposes two novel prototype-based attention mechanisms inter-modality attention (IMA) and inter-frame attention (IFA) to incorporate these techniques via dense propagation across different modalities and frames. IMA densely integrates context information from different modalities based on a mutual refinement. IFA injects global context of a video to the query frame enabling a full utilization of useful properties from multiple frames. Experimental results on public benchmark datasets demonstrate that our proposed approach outperforms all existing methods by a substantial margin. The proposed two components are also thoroughly validated via ablative study. + + + + Navigate Beyond Shortcuts: Debiased Learning Through the Lens of Neural Collapse + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Navigate_Beyond_Shortcuts_Debiased_Learning_Through_the_Lens_of_Neural_CVPR_2024_paper.pdf + Recent studies have noted an intriguing phenomenon termed Neural Collapse that is when the neural networks establish the right correlation between feature spaces and the training targets their last-layer features together with the classifier weights will collapse into a stable and symmetric structure. In this paper we extend the investigation of Neural Collapse to the biased datasets with imbalanced attributes. We observe that models will easily fall into the pitfall of shortcut learning and form a biased non-collapsed feature space at the early period of training which is hard to reverse and limits the generalization capability. To tackle the root cause of biased classification we follow the recent inspiration of prime training and propose an avoid-shortcut learning framework without additional training complexity. With well-designed shortcut primes based on Neural Collapse structure the models are encouraged to skip the pursuit of simple shortcuts and naturally capture the intrinsic correlations. Experimental results demonstrate that our method induces a better convergence property during training and achieves state-of-the-art generalization performance on both synthetic and real-world biased datasets. + + + + A Subspace-Constrained Tyler's Estimator and its Applications to Structure from Motion + http://openaccess.thecvf.com//content/CVPR2024/papers/Yu_A_Subspace-Constrained_Tylers_Estimator_and_its_Applications_to_Structure_from_CVPR_2024_paper.pdf + We present the subspace-constrained Tyler's estimator (STE) designed for recovering a low-dimensional subspace within a dataset that may be highly corrupted with outliers. STE is a fusion of the Tyler's M-estimator (TME) and a variant of the fast median subspace. Our theoretical analysis suggests that under a common inlier-outlier model STE can effectively recover the underlying subspace even when it contains a smaller fraction of inliers relative to other methods in the field of robust subspace recovery. We apply STE in the context of Structure from Motion (SfM) in two ways: for robust estimation of the fundamental matrix and for the removal of outlying cameras enhancing the robustness of the SfM pipeline. Numerical experiments confirm the state-of-the-art performance of our method in these applications. This research makes significant contributions to the field of robust subspace recovery particularly in the context of computer vision and 3D reconstruction. + + + + CAD: Photorealistic 3D Generation via Adversarial Distillation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wan_CAD_Photorealistic_3D_Generation_via_Adversarial_Distillation_CVPR_2024_paper.pdf + The increased demand for 3D data in AR/VR robotics and gaming applications gave rise to powerful generative pipelines capable of synthesizing high-quality 3D objects. Most of these models rely on the Score Distillation Sampling (SDS) algorithm to optimize a 3D representation such that the rendered image maintains a high likelihood as evaluated by a pre-trained diffusion model. However this distillation process involves finding a correct mode in the high-dimensional and large-variance distribution produced by the diffusion model. This task is challenging and often leads to issues such as over-saturation over-smoothing and Janus-like artifacts in the 3D generation. In this paper we propose a novel learning paradigm for 3D synthesis that utilizes pre-trained diffusion models. Instead of focusing on mode-seeking our method directly models the distribution discrepancy between multi-view renderings and diffusion priors in an adversarial manner which unlocks the generation of high-fidelity and photorealistic 3D content conditioned on a single image and prompt. Moreover by harnessing the latent space of GANs and expressive diffusion model priors our method enables a wide variety of 3D applications including single-view reconstruction high diversity generation and continuous 3D interpolation in open domain. Our experiments demonstrate the superiority of our pipeline compared to previous works in terms of generation quality and diversity. + + + + Enhancing Vision-Language Pre-training with Rich Supervisions + http://openaccess.thecvf.com//content/CVPR2024/papers/Gao_Enhancing_Vision-Language_Pre-training_with_Rich_Supervisions_CVPR_2024_paper.pdf + We propose Strongly Supervised pre-training with ScreenShots (S4) - a novel pre-training paradigm for Vision-Language Models using data from large-scale web screenshot rendering. Using web screenshots unlocks a treasure trove of visual and textual cues that are not present in using image-text pairs. In S4 we leverage the inherent tree-structured hierarchy of HTML elements and the spatial localization to carefully design 10 pre-training tasks with large scale annotated data. These tasks resemble downstream tasks across different domains and the annotations are cheap to obtain. We demonstrate that compared to current screenshot pre-training objectives our innovative pre-training method significantly enhances performance of image-to-text model in nine varied and popular downstream tasks - up to 76.1% improvements on Table Detection and at least 1% on Widget Captioning. + + + + Adaptive VIO: Deep Visual-Inertial Odometry with Online Continual Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Pan_Adaptive_VIO_Deep_Visual-Inertial_Odometry_with_Online_Continual_Learning_CVPR_2024_paper.pdf + Visual-inertial odometry (VIO) has demonstrated remarkable success due to its low-cost and complementary sensors. However existing VIO methods lack the generalization ability to adjust to different environments and sensor attributes. In this paper we propose Adaptive VIO a new monocular visual-inertial odometry that combines online continual learning with traditional nonlinear optimization. Adaptive VIO comprises two networks to predict visual correspondence and IMU bias. Unlike end-to-end approaches that use networks to fuse the features from two modalities (camera and IMU) and predict poses directly we combine neural networks with visual-inertial bundle adjustment in our VIO system. The optimized estimates will be fed back to the visual and IMU bias networks refining the networks in a self-supervised manner. Such a learning-optimization-combined framework and feedback mechanism enable the system to perform online continual learning. Experiments demonstrate that our Adaptive VIO manifests adaptive capability on EuRoC and TUM-VI datasets. The overall performance exceeds the currently known learning-based VIO methods and is comparable to the state-of-the-art optimization-based methods. + + + + Generalized Large-Scale Data Condensation via Various Backbone and Statistical Matching + http://openaccess.thecvf.com//content/CVPR2024/papers/Shao_Generalized_Large-Scale_Data_Condensation_via_Various_Backbone_and_Statistical_Matching_CVPR_2024_paper.pdf + The lightweight "local-match-global" matching introduced by SRe2L successfully creates a distilled dataset with comprehensive information on the full 224x224 ImageNet-1k. However this one-sided approach is limited to a particular backbone layer and statistics which limits the improvement of the generalization of a distilled dataset. We suggest that sufficient and various "local-match-global" matching are more precise and effective than a single one and has the ability to create a distilled dataset with richer information and better generalization. We call this perspective "generalized matching" and propose Generalized Various Backbone and Statistical Matching (G-VBSM) in this work which aims to create a synthetic dataset with densities ensuring consistency with the complete dataset across various backbones layers and statistics. As experimentally demonstrated G-VBSM is the first algorithm to obtain strong performance across both small-scale and large-scale datasets. Specifically G-VBSM achieves a performance of 38.7% on CIFAR-100 with 128-width ConvNet 47.6% on Tiny-ImageNet with ResNet18 and 31.4% on the full 224x224 ImageNet-1k with ResNet18 under images per class (IPC) 10 50 and 10 respectively. These results surpass all SOTA methods by margins of 3.9% 6.5% and 10.1% respectively. + + + + On Train-Test Class Overlap and Detection for Image Retrieval + http://openaccess.thecvf.com//content/CVPR2024/papers/Song_On_Train-Test_Class_Overlap_and_Detection_for_Image_Retrieval_CVPR_2024_paper.pdf + How important is it for training and evaluation sets to not have class overlap in image retrieval? We revisit Google Landmarks v2 clean the most popular training set by identifying and removing class overlap with Revisited Oxford and Paris the most popular training set. By comparing the original and the new RGLDv2-clean on a benchmark of reproduced state-of-the-art methods our findings are striking. Not only is there a dramatic drop in performance but it is inconsistent across methods changing the ranking. What does it take to focus on objects or interest and ignore background clutter when indexing? Do we need to analyze the evaluation set? Do we need to train an object detector and the representation separately? Do we need location supervision? We introduce Single-stage Detect-to-Retrieve (CiDeR) an end-to-end single-stage pipeline to detect objects of interest and extract a global image representation. We outperform previous state-of-the-art on both existing training sets and the new RGLDv2-clean. + + + + AttriHuman-3D: Editable 3D Human Avatar Generation with Attribute Decomposition and Indexing + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_AttriHuman-3D_Editable_3D_Human_Avatar_Generation_with_Attribute_Decomposition_and_CVPR_2024_paper.pdf + Editable 3D-aware generation which supports user-interacted editing has witnessed rapid development recently. However existing editable 3D GANs either fail to achieve high-accuracy local editing or suffer from huge computational costs. We propose AttriHuman-3D an editable 3D human generation model which address the aforementioned problems with attribute decomposition and indexing. The core idea of the proposed model is to generate all attributes (e.g. human body hair clothes and so on) in an overall attribute space with six feature planes which are then decomposed and manipulated with different attribute indexes. To precisely extract features of different attributes from the generated feature planes we propose a novel attribute indexing method as well as an orthogonal projection regularization to enhance the disentanglement. We also introduce a hyper-latent training strategy and an attribute-specific sampling strategy to avoid style entanglement and misleading punishment from the discriminator. Our method allows users to interactively edit selected attributes in the generated 3D human avatars while keeping others fixed. Both qualitative and quantitative experiments demonstrate that our model provides a strong disentanglement between different attributes allows fine-grained image editing and generates high-quality 3D human avatars. + + + + Learning Object State Changes in Videos: An Open-World Perspective + http://openaccess.thecvf.com//content/CVPR2024/papers/Xue_Learning_Object_State_Changes_in_Videos_An_Open-World_Perspective_CVPR_2024_paper.pdf + Object State Changes (OSCs) are pivotal for video understanding. While humans can effortlessly generalize OSC understanding from familiar to unknown objects current approaches are confined to a closed vocabulary. Addressing this gap we introduce a novel open-world formulation for the video OSC problem. The goal is to temporally localize the three stages of an OSC---the object's initial state its transitioning state and its end state---whether or not the object has been observed during training. Towards this end we develop VidOSC a holistic learning approach that: (1) leverages text and vision-language models for supervisory signals to obviate manually labeling OSC training data and (2) abstracts fine-grained shared state representations from objects to enhance generalization. Furthermore we present HowToChange the first open-world benchmark for video OSC localization which offers an order of magnitude increase in the label space and annotation volume compared to the best existing benchmark. Experimental results demonstrate the efficacy of our approach in both traditional closed-world and open-world scenarios. + + + + SCoFT: Self-Contrastive Fine-Tuning for Equitable Image Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_SCoFT_Self-Contrastive_Fine-Tuning_for_Equitable_Image_Generation_CVPR_2024_paper.pdf + Accurate representation in media is known to improve the well-being of the people who consume it. Generative image models trained on large web-crawled datasets such as LAION are known to produce images with harmful stereotypes and misrepresentations of cultures. We improve inclusive representation in generated images by (1) engaging with communities to collect a culturally representative dataset that we call the Cross-Cultural Understanding Benchmark (CCUB) and (2) proposing a novel Self-Contrastive Fine-Tuning (SCoFT pronounced /soft/) method that leverages the model's known biases to self-improve. SCoFT is designed to prevent overfitting on small datasets encode only high-level information from the data and shift the generated distribution away from misrepresentations encoded in a pretrained model. Our user study conducted on 51 participants from 5 different countries based on their self-selected national cultural affiliation shows that fine-tuning on CCUB consistently generates images with higher cultural relevance and fewer stereotypes when compared to the Stable Diffusion baseline which is further improved with our SCoFT technique. + + + + Iterated Learning Improves Compositionality in Large Vision-Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Zheng_Iterated_Learning_Improves_Compositionality_in_Large_Vision-Language_Models_CVPR_2024_paper.pdf + A fundamental characteristic common to both human vision and natural language is their compositional nature. Yet despite the performance gains contributed by large vision and language pretraining recent investigations find that most--if not all--our state-of-the-art vision-language models struggle at compositionality. They are unable to distinguish between images of "a girl in white facing a man in black" and "a girl in black facing a man in white". Moreover prior work suggests that compositionality doesn't arise with scale: larger model sizes or training data don't help. This paper develops a new iterated training algorithm that incentivizes compositionality. We draw on decades of cognitive science research that identifies cultural transmission--the need to teach a new generation--as a necessary inductive prior that incentivizes humans to develop compositional languages. Specifically we reframe vision-language contrastive learning as the Lewis Signaling Game between a vision agent and a language agent and operationalize cultural transmission by iteratively resetting one of the agent's weights during training. After every iteration this training paradigm induces representations that become "easier to learn" a property of compositional languages: e.g. our model trained on CC3M and CC12M improves standard CLIP by 4.7% 4.0% respectfully in the SugarCrepe benchmark. + + + + Event Stream-based Visual Object Tracking: A High-Resolution Benchmark Dataset and A Novel Baseline + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Event_Stream-based_Visual_Object_Tracking_A_High-Resolution_Benchmark_Dataset_and_CVPR_2024_paper.pdf + Tracking with bio-inspired event cameras has garnered increasing interest in recent years. Existing works either utilize aligned RGB and event data for accurate tracking or directly learn an event-based tracker. The former incurs higher inference costs while the latter may be susceptible to the impact of noisy events or sparse spatial resolution. In this paper we propose a novel hierarchical knowledge distillation framework that can fully utilize multi-modal / multi-view information during training to facilitate knowledge transfer enabling us to achieve high-speed and low-latency visual tracking during testing by using only event signals. Specifically a teacher Transformer-based multi-modal tracking framework is first trained by feeding the RGB frame and event stream simultaneously. Then we design a new hierarchical knowledge distillation strategy which includes pairwise similarity feature representation and response maps-based knowledge distillation to guide the learning of the student Transformer network. In particular since existing event-based tracking datasets are all low-resolution (346 * 260) we propose the first large-scale high-resolution (1280 * 720) dataset named EventVOT. It contains 1141 videos and covers a wide range of categories such as pedestrians vehicles UAVs ping pong etc. Extensive experiments on both low-resolution (FE240hz VisEvent COESOT) and our newly proposed high-resolution EventVOT dataset fully validated the effectiveness of our proposed method. The dataset evaluation toolkit and source code will be released. + + + + Dual DETRs for Multi-Label Temporal Action Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhu_Dual_DETRs_for_Multi-Label_Temporal_Action_Detection_CVPR_2024_paper.pdf + Temporal Action Detection (TAD) aims to identify the action boundaries and the corresponding category within untrimmed videos. Inspired by the success of DETR in object detection several methods have adapted the query-based framework to the TAD task. However these approaches primarily followed DETR to predict actions at the instance level (i.e. identify each action by its center point) leading to sub-optimal boundary localization. To address this issue we propose a new Dual-level query-based TAD framework namely DualDETR to detect actions from both instance-level and boundary-level. Decoding at different levels requires semantics of different granularity therefore we introduce a two-branch decoding structure. This structure builds distinctive decoding processes for different levels facilitating explicit capture of temporal cues and semantics at each level. On top of the two-branch design we present a joint query initialization strategy to align queries from both levels. Specifically we leverage encoder proposals to match queries from each level in a one-to-one manner. Then the matched queries are initialized using position and content prior from the matched action proposal. The aligned dual-level queries can refine the matched proposal with complementary cues during subsequent decoding. We evaluate DualDETR on three challenging multi-label TAD benchmarks. The experimental results demonstrate the superior performance of DualDETR to the existing state-of-the-art methods achieving a substantial improvement under det-mAP and delivering impressive results under seg-mAP. + + + + Virtual Immunohistochemistry Staining for Histological Images Assisted by Weakly-supervised Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Virtual_Immunohistochemistry_Staining_for_Histological_Images_Assisted_by_Weakly-supervised_Learning_CVPR_2024_paper.pdf + Recently virtual staining technology has greatly promoted the advancement of histopathology. Despite the practical successes achieved the outstanding performance of most virtual staining methods relies on hard-to-obtain paired images in training. In this paper we propose a method for virtual immunohistochemistry (IHC) staining named confusion-GAN which does not require paired images and can achieve comparable performance to supervised algorithms. Specifically we propose a multi-branch discriminator which judges if the features of generated images can be embedded into the feature pool of target domain images to improve the visual quality of generated images. Meanwhile we also propose a novel patch-level pathology information extractor which is assisted by multiple instance learning to ensure pathological consistency during virtual staining. Extensive experiments were conducted on three types of IHC images including a high-resolution hepatocellular carcinoma immunohistochemical dataset proposed by us. The results demonstrated that our proposed confusion-GAN can generate highly realistic images that are capable of deceiving even experienced pathologists. Furthermore compared to using H&E images directly the downstream diagnosis achieved higher accuracy when using images generated by confusion-GAN. Our dataset and codes will be available at https://github.com/jiahanli2022/confusion-GAN. + + + + DeCoTR: Enhancing Depth Completion with 2D and 3D Attentions + http://openaccess.thecvf.com//content/CVPR2024/papers/Shi_DeCoTR_Enhancing_Depth_Completion_with_2D_and_3D_Attentions_CVPR_2024_paper.pdf + In this paper we introduce a novel approach that harnesses both 2D and 3D attentions to enable highly accurate depth completion without requiring iterative spatial propagations. Specifically we first enhance a baseline convolutional depth completion model by applying attention to 2D features in the bottleneck and skip connections. This effectively improves the performance of this simple network and sets it on par with the latest complex transformer-based models. Leveraging the initial depths and features from this network we uplift the 2D features to form a 3D point cloud and construct a 3D point transformer to process it allowing the model to explicitly learn and exploit 3D geometric features. In addition we propose normalization techniques to process the point cloud which improves learning and leads to better accuracy than directly using point transformers off the shelf. Furthermore we incorporate global attention on downsampled point cloud features which enables long-range context while still being computationally feasible. We evaluate our method DeCoTR on established depth completion benchmarks including NYU Depth V2 and KITTI showcasing that it sets new state-of-the-art performance. We further conduct zero-shot evaluations on ScanNet and DDAD benchmarks and demonstrate that DeCoTR has superior generalizability compared to existing approaches. + + + + Utility-Fairness Trade-Offs and How to Find Them + http://openaccess.thecvf.com//content/CVPR2024/papers/Dehdashtian_Utility-Fairness_Trade-Offs_and_How_to_Find_Them_CVPR_2024_paper.pdf + When building classification systems with demographic fairness considerations there are two objectives to satisfy: 1) maximizing utility for the specific task and 2) ensuring fairness w.r.t. a known demographic attribute. These objectives often compete so optimizing both can lead to a trade-off between utility and fairness. While existing works acknowledge the trade-offs and study their limits two questions remain unanswered: 1) What are the optimal tradeoffs between utility and fairness? and 2) How can we numerically quantify these trade-offs from data for a desired prediction task and demographic attribute of interest? This paper addresses these questions. We introduce two utility-fairness trade-offs: the Data-Space and Label-Space Trade-off. The trade-offs reveal three regions within the utility-fairness plane delineating what is fully and partially possible and impossible. We propose U-FaTE a method to numerically quantify the trade-offs for a given prediction task and group fairness definition from data samples. Based on the trade-offs we introduce a new scheme for evaluating representations. An extensive evaluation of fair representation learning methods and representations from over 1000 pre-trained models revealed that most current approaches are far from the estimated and achievable fairness-utility trade-offs across multiple datasets and prediction tasks. + + + + SAOR: Single-View Articulated Object Reconstruction + http://openaccess.thecvf.com//content/CVPR2024/papers/Aygun_SAOR_Single-View_Articulated_Object_Reconstruction_CVPR_2024_paper.pdf + We introduce SAOR a novel approach for estimating the 3D shape texture and viewpoint of an articulated object from a single image captured in the wild. Unlike prior approaches that rely on pre-defined category-specific 3D templates or tailored 3D skeletons SAOR learns to articulate shapes from single-view image collections with a skeleton-free part-based model without requiring any 3D object shape priors. To prevent ill-posed solutions we propose a cross-instance consistency loss that exploits disentangled object shape deformation and articulation. This is helped by a new silhouette-based sampling mechanism to enhance viewpoint diversity during training. Our method only requires estimated object silhouettes and relative depth maps from off-the-shelf pre-trained networks during training. At inference time given a single-view image it efficiently outputs an explicit mesh representation. We obtain improved qualitative and quantitative results on challenging quadruped animals compared to relevant existing work. + + + + A Theory of Joint Light and Heat Transport for Lambertian Scenes + http://openaccess.thecvf.com//content/CVPR2024/papers/Ramanagopal_A_Theory_of_Joint_Light_and_Heat_Transport_for_Lambertian_CVPR_2024_paper.pdf + We present a novel theory that establishes the relationship between light transport in visible and thermal infrared and heat transport in solids. We show that heat generated due to light absorption can be estimated by modeling heat transport using a thermal camera. For situations where heat conduction is negligible we analytically solve the heat transport equation to derive a simple expression relating the change in thermal image intensity to the absorbed light intensity and heat capacity of the material. Next we prove that intrinsic image decomposition for Lambertian scenes becomes a well-posed problem if one has access to the absorbed light. Our theory generalizes to arbitrary shapes and unstructured illumination. Our theory is based on applying energy conservation principle at each pixel independently. We validate our theory using real-world experiments on diffuse objects made of different materials that exhibit both direct and global components (inter-reflections) of light transport under unknown complex lighting. + + + + iKUN: Speak to Trackers without Retraining + http://openaccess.thecvf.com//content/CVPR2024/papers/Du_iKUN_Speak_to_Trackers_without_Retraining_CVPR_2024_paper.pdf + Referring multi-object tracking (RMOT) aims to track multiple objects based on input textual descriptions. Previous works realize it by simply integrating an extra textual module into the multi-object tracker. However they typically need to retrain the entire framework and have difficulties in optimization. In this work we propose an insertable Knowledge Unification Network termed iKUN to enable communication with off-the-shelf trackers in a plug-and-play manner. Concretely a knowledge unification module (KUM) is designed to adaptively extract visual features based on textual guidance. Meanwhile to improve the localization accuracy we present a neural version of Kalman filter (NKF) to dynamically adjust process noise and observation noise based on the current motion status. Moreover to address the problem of open-set long-tail distribution of textual descriptions a test-time similarity calibration method is proposed to refine the confidence score with pseudo frequency. Extensive experiments on Refer-KITTI dataset verify the effectiveness of our framework. Finally to speed up the development of RMOT we also contribute a more challenging dataset Refer-Dance by extending public DanceTrack dataset with motion and dressing descriptions. The codes and dataset are available at https://github.com/dyhBUPT/iKUN. + + + + Facial Identity Anonymization via Intrinsic and Extrinsic Attention Distraction + http://openaccess.thecvf.com//content/CVPR2024/papers/Kuang_Facial_Identity_Anonymization_via_Intrinsic_and_Extrinsic_Attention_Distraction_CVPR_2024_paper.pdf + The unprecedented capture and application of face images raise increasing concerns on anonymization to fight against privacy disclosure. Most existing methods may suffer from the problem of excessive change of the identity-independent information or insufficient identity protection. In this paper we present a new face anonymization approach by distracting the intrinsic and extrinsic identity attentions. On the one hand we anonymize the identity information in the feature space by distracting the intrinsic identity attention. On the other we anonymize the visual clues (i.e. appearance and geometry structure) by distracting the extrinsic identity attention. Our approach allows for flexible and intuitive manipulation of face appearance and geometry structure to produce diverse results and it can also be used to instruct users to perform personalized anonymization. We conduct extensive experiments on multiple datasets and demonstrate that our approach outperforms state-of-the-art methods. + + + + 3D-SceneDreamer: Text-Driven 3D-Consistent Scene Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_3D-SceneDreamer_Text-Driven_3D-Consistent_Scene_Generation_CVPR_2024_paper.pdf + Text-driven 3D scene generation techniques have made rapid progress in recent years. Their success is mainly attributed to using existing generative models to iteratively perform image warping and inpainting to generate 3D scenes. However these methods heavily rely on the outputs of existing models leading to error accumulation in geometry and appearance that prevent the models from being used in various scenarios (e.g. outdoor and unreal scenarios). To address this limitation we generatively refine the newly generated local views by querying and aggregating global 3D information and then progressively generate the 3D scene. Specifically we employ a tri-plane features-based NeRF as a unified representation of the 3D scene to constrain global 3D consistency and propose a generative refinement network to synthesize new contents with higher quality by exploiting the natural image prior from 2D diffusion model as well as the global 3D information of the current scene. Our extensive experiments demonstrate that in comparison to previous methods our approach supports wide variety of scene generation and arbitrary camera trajectories with improved visual quality and 3D consistency. + + + + VMINer: Versatile Multi-view Inverse Rendering with Near- and Far-field Light Sources + http://openaccess.thecvf.com//content/CVPR2024/papers/Fei_VMINer_Versatile_Multi-view_Inverse_Rendering_with_Near-_and_Far-field_Light_CVPR_2024_paper.pdf + This paper introduces a versatile multi-view inverse rendering framework with near- and far-field light sources. Tackling the fundamental challenge of inherent ambiguity in inverse rendering our framework adopts a lightweight yet inclusive lighting model for different near- and far-field lights thus is able to make use of input images under varied lighting conditions available during capture. It leverages observations under each lighting to disentangle the intrinsic geometry and material from the external lighting using both neural radiance field rendering and physically-based surface rendering on the 3D implicit fields. After training the reconstructed scene is extracted to a textured triangle mesh for seamless integration into industrial rendering software for various applications. Quantitatively and qualitatively tested on synthetic and real-world scenes our method shows superiority to state-of-the-art multi-view inverse rendering methods in both speed and quality. + + + + RoHM: Robust Human Motion Reconstruction via Diffusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_RoHM_Robust_Human_Motion_Reconstruction_via_Diffusion_CVPR_2024_paper.pdf + We propose RoHM an approach for robust 3D human motion reconstruction from monocular RGB(-D) videos in the presence of noise and occlusions. Most previous approaches either train neural networks to directly regress motion in 3D or learn data-driven motion priors and combine them with optimization at test time. RoHM is a novel diffusion-based motion model that conditioned on noisy and occluded input data reconstructs complete plausible motions in consistent global coordinates. Given the complexity of the problem -- requiring one to address different tasks (denoising and infilling) in different solution spaces (local and global motion) -- we decompose it into two sub-tasks and learn two models one for global trajectory and one for local motion. To capture the correlations between the two we then introduce a novel conditioning module combining it with an iterative inference scheme. We apply RoHM to a variety of tasks -- from motion reconstruction and denoising to spatial and temporal infilling. Extensive experiments on three popular datasets show that our method outperforms state-of-the-art approaches qualitatively and quantitatively while being faster at test time. The code is available at https://sanweiliti.github.io/ROHM/ROHM.html. + + + + Do You Remember? Dense Video Captioning with Cross-Modal Memory Retrieval + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_Do_You_Remember_Dense_Video_Captioning_with_Cross-Modal_Memory_Retrieval_CVPR_2024_paper.pdf + There has been significant attention to the research on dense video captioning which aims to automatically localize and caption all events within untrimmed video. Several studies introduce methods by designing dense video captioning as a multitasking problem of event localization and event captioning to consider inter-task relations. However addressing both tasks using only visual input is challenging due to the lack of semantic content. In this study we address this by proposing a novel framework inspired by the cognitive information processing of humans. Our model utilizes external memory to incorporate prior knowledge. The memory retrieval method is proposed with cross-modal video-to-text matching. To effectively incorporate retrieved text features the versatile encoder and the decoder with visual and textual cross-attention modules are designed. Comparative experiments have been conducted to show the effectiveness of the proposed method on ActivityNet Captions and YouCook2 datasets. Experimental results show promising performance of our model without extensive pretraining from a large video dataset. Our code is available at https://github.com/ailab-kyunghee/CM2_DVC. + + + + SPAD: Spatially Aware Multi-View Diffusers + http://openaccess.thecvf.com//content/CVPR2024/papers/Kant_SPAD_Spatially_Aware_Multi-View_Diffusers_CVPR_2024_paper.pdf + We present SPAD a novel approach for creating consistent multi-view images from text prompts or single images. To enable multi-view generation we repurpose a pretrained 2D diffusion model by extending its self-attention layers with cross-view interactions and fine-tune it on a high quality subset of Objaverse. We find that a naive extension of the self-attention proposed in prior work (e.g. MVDream) leads to content copying between views. Therefore we explicitly constrain the cross-view attention based on epipolar geometry. To further enhance 3D consistency we utilize Pl ?ucker coordinates derived from camera rays and inject them as positional encoding. This enables SPAD to reason over spatial proximity in 3D well. Compared to concurrent works that can only generate views at fixed azimuth and elevation (e.g. MVDream SyncDreamer) SPAD offers full camera control and achieves state-of-the-art results in novel view synthesis on unseen objects from the Objaverse and Google Scanned Objects datasets. Finally we demonstrate that text-to-3D generation using SPAD prevents the multi-face Janus issue. + + + + Gradient Reweighting: Towards Imbalanced Class-Incremental Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/He_Gradient_Reweighting_Towards_Imbalanced_Class-Incremental_Learning_CVPR_2024_paper.pdf + Class-Incremental Learning (CIL) trains a model to continually recognize new classes from non-stationary data while retaining learned knowledge. A major challenge of CIL arises when applying to real-world data characterized by non-uniform distribution which introduces a dual imbalance problem involving (i) disparities between stored exemplars of old tasks and new class data (inter-phase imbalance) and (ii) severe class imbalances within each individual task (intra-phase imbalance). We show that this dual imbalance issue causes skewed gradient updates with biased weights in FC layers thus inducing over/under-fitting and catastrophic forgetting in CIL. Our method addresses it by reweighting the gradients towards balanced optimization and unbiased classifier learning. Additionally we observe imbalanced forgetting where paradoxically the instance-rich classes suffer higher performance degradation during CIL due to a larger amount of training data becoming unavailable in subsequent learning phases. To tackle this we further introduce a distribution-aware knowledge distillation loss to mitigate forgetting by aligning output logits proportionally with the distribution of lost training data. We validate our method on CIFAR-100 ImageNetSubset and Food101 across various evaluation protocols and demonstrate consistent improvements compared to existing works showing great potential to apply CIL in real-world scenarios with enhanced robustness and effectiveness. + + + + Gaussian Splatting SLAM + http://openaccess.thecvf.com//content/CVPR2024/papers/Matsuki_Gaussian_Splatting_SLAM_CVPR_2024_paper.pdf + We present the first application of 3D Gaussian Splatting in monocular SLAM the most fundamental but the hardest setup for Visual SLAM. Our method which runs live at 3fps utilises Gaussians as the only 3D representation unifying the required representation for accurate efficient tracking mapping and high-quality rendering. Designed for challenging monocular settings our approach is seamlessly extendable to RGB-D SLAM when an external depth sensor is available. Several innovations are required to continuously reconstruct 3D scenes with high fidelity from a live camera. First to move beyond the original 3DGS algorithm which requires accurate poses from an offline Structure from Motion (SfM) system we formulate camera tracking for 3DGS using direct optimisation against the 3D Gaussians and show that this enables fast and robust tracking with a wide basin of convergence. Second by utilising the explicit nature of the Gaussians we introduce geometric verification and regularisation to handle the ambiguities occurring in incremental 3D dense reconstruction. Finally we introduce a full SLAM system which not only achieves state-of-the-art results in novel view synthesis and trajectory estimation but also reconstruction of tiny and even transparent objects. + + + + Not All Classes Stand on Same Embeddings: Calibrating a Semantic Distance with Metric Tensor + http://openaccess.thecvf.com//content/CVPR2024/papers/Park_Not_All_Classes_Stand_on_Same_Embeddings_Calibrating_a_Semantic_CVPR_2024_paper.pdf + The consistency training (CT)-based semi-supervised learning (SSL) bites state-of-the-art performance on SSL-based image classification. However the existing CT-based SSL methods do not highlight the non-Euclidean characteristics and class-wise varieties of embedding spaces in an SSL model thus they cannot fully utilize the effectiveness of CT. Thus we propose a metric tensor-based consistency regularization exploiting the class-variant geometrical structure of embeddings on the high-dimensional feature space. The proposed method not only minimizes the prediction discrepancy between different views of a given image but also estimates the intrinsic geometric curvature of embedding spaces by employing the global and local metric tensors. The global metric tensor is used to globally estimate the class-invariant embeddings from the whole data distribution while the local metric tensor is exploited to estimate the class-variant embeddings of each cluster. The two metric tensors are optimized by the consistency regularization based on the weak and strong augmentation strategy. The proposed method provides the highest classification accuracy on average compared to the existing state-of-the-art SSL methods on conventional datasets. + + + + A Simple Recipe for Contrastively Pre-training Video-First Encoders Beyond 16 Frames + http://openaccess.thecvf.com//content/CVPR2024/papers/Papalampidi_A_Simple_Recipe_for_Contrastively_Pre-training_Video-First_Encoders_Beyond_16_CVPR_2024_paper.pdf + Understanding long real-world videos requires modeling of long-range visual dependencies. To this end we explore video-first architectures building on the common paradigm of transferring large-scale image--text models to video via shallow temporal fusion. However we expose two limitations to the approach: (1) decreased spatial capabilities likely due to poor video--language alignment in standard video datasets and (2) higher memory consumption bottlenecking the number of frames that can be processed. To mitigate the memory bottleneck we systematically analyze the memory/accuracy trade-off of various efficient methods: factorized attention parameter-efficient image-to-video adaptation input masking and multi-resolution patchification. Surprisingly simply masking large portions of the video (up to 75%) during contrastive pre-training proves to be one of the most robust ways to scale encoders to videos up to 4.3 minutes at 1 FPS. Our simple approach for training long video-to-text models which scales to 1B parameters does not add new architectural complexity and is able to outperform the popular paradigm of using much larger LLMs as an information aggregator over segment-based information on benchmarks with long-range temporal dependencies (YouCook2 EgoSchema). + + + + Hierarchical Diffusion Policy for Kinematics-Aware Multi-Task Robotic Manipulation + http://openaccess.thecvf.com//content/CVPR2024/papers/Ma_Hierarchical_Diffusion_Policy_for_Kinematics-Aware_Multi-Task_Robotic_Manipulation_CVPR_2024_paper.pdf + This paper introduces Hierarchical Diffusion Policy (HDP) a hierarchical agent for multi-task robotic manipulation. HDP factorises a manipulation policy into a hierarchical structure: a high-level task-planning agent which predicts a distant next-best end-effector pose (NBP) and a low-level goal-conditioned diffusion policy which generates optimal motion trajectories. The factorised policy representation allows HDP to tackle both long-horizon task planning while generating fine-grained low-level actions. To generate context-aware motion trajectories while satisfying robot kinematics constraints we present a novel kinematics-aware goal-conditioned control agent Robot Kinematics Diffuser (RK-Diffuser). Specifically RK-Diffuser learns to generate both the end-effector pose and joint position trajectories and distill the accurate but kinematics-unaware end-effector pose diffuser to the kinematics-aware but less accurate joint position diffuser via differentiable kinematics. Empirically we show that HDP achieves a significantly higher success rate than the state-of-the-art methods in both simulation and real-world. + + + + Benchmarking the Robustness of Temporal Action Detection Models Against Temporal Corruptions + http://openaccess.thecvf.com//content/CVPR2024/papers/Zeng_Benchmarking_the_Robustness_of_Temporal_Action_Detection_Models_Against_Temporal_CVPR_2024_paper.pdf + Temporal action detection (TAD) aims to locate action positions and recognize action categories in long-term untrimmed videos. Although many methods have achieved promising results their robustness has not been thoroughly studied. In practice we observe that temporal information in videos can be occasionally corrupted such as missing or blurred frames. Interestingly existing methods often incur a significant performance drop even if only one frame is affected. To formally evaluate the robustness we establish two temporal corruption robustness benchmarks namely THUMOS14-C and ActivityNet-v1.3-C. In this paper we extensively analyze the robustness of seven leading TAD methods and obtain some interesting findings: 1) Existing methods are particularly vulnerable to temporal corruptions and end-to-end methods are often more susceptible than those with a pre-trained feature extractor; 2) Vulnerability mainly comes from localization error rather than classification error; 3) When corruptions occur in the middle of an action instance TAD models tend to yield the largest performance drop. Besides building a benchmark we further develop a simple but effective robust training method to defend against temporal corruptions through the FrameDrop augmentation and Temporal-Robust Consistency loss. Remarkably our approach not only improves robustness but also yields promising improvements on clean data. We believe that this study will serve as a benchmark for future research in robust video analysis. Source code and models are available at https://github.com/Alvin-Zeng/temporal-robustness-benchmark. + + + + Open-World Human-Object Interaction Detection via Multi-modal Prompts + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Open-World_Human-Object_Interaction_Detection_via_Multi-modal_Prompts_CVPR_2024_paper.pdf + In this paper we develop MP-HOI a powerful Multi-modal Prompt-based HOI detector designed to leverage both textual descriptions for open-set generalization and visual exemplars for handling high ambiguity in descriptions realizing HOI detection in the open world. Specifically it integrates visual prompts into existing language-guided-only HOI detectors to handle situations where textual descriptions face difficulties in generalization and to address complex scenarios with high interaction ambiguity. To facilitate MP-HOI training we build a large-scale HOI dataset named Magic-HOI which gathers six existing datasets into a unified label space forming over 186K images with 2.4K objects 1.2K actions and 20K HOI interactions. Furthermore to tackle the long-tail issue within the Magic-HOI dataset we introduce an automated pipeline for generating realistically annotated HOI images and present SynHOI a high-quality synthetic HOI dataset containing 100K images. Leveraging these two datasets MP-HOI optimizes the HOI task as a similarity learning process between multi-modal prompts and objects/interactions via a unified contrastive loss to learn generalizable and transferable objects/interactions representations from large-scale data. MP-HOI could serve as a generalist HOI detector surpassing the HOI vocabulary of existing expert models by more than 30 times. Concurrently our results demonstrate that MP-HOI exhibits remarkable zero-shot capability in real-world scenarios and consistently achieves a new state-of-the-art performance across various benchmarks. Our project homepage is available at https://MP-HOI.github.io/. + + + + UniMODE: Unified Monocular 3D Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_UniMODE_Unified_Monocular_3D_Object_Detection_CVPR_2024_paper.pdf + Realizing unified monocular 3D object detection including both indoor and outdoor scenes holds great importance in applications like robot navigation. However involving various scenarios of data to train models poses challenges due to their significantly different characteristics e.g. diverse geometry properties and heterogeneous domain distributions. To address these challenges we build a detector based on the bird's-eye-view (BEV) detection paradigm where the explicit feature projection is beneficial to addressing the geometry learning ambiguity when employing multiple scenarios of data to train detectors. Then we split the classical BEV detection architecture into two stages and propose an uneven BEV grid design to handle the convergence instability caused by the aforementioned challenges. Moreover we develop a sparse BEV feature projection strategy to reduce computational cost and a unified domain alignment method to handle heterogeneous domains. Combining these techniques a unified detector UniMODE is derived which surpasses the previous state-of-the-art on the challenging Omni3D dataset (a large-scale dataset including both indoor and outdoor scenes) by 4.9% \rm AP_ 3D revealing the first successful generalization of a BEV detector to unified 3D object detection. + + + + Multi-agent Collaborative Perception via Motion-aware Robust Communication Network + http://openaccess.thecvf.com//content/CVPR2024/papers/Hong_Multi-agent_Collaborative_Perception_via_Motion-aware_Robust_Communication_Network_CVPR_2024_paper.pdf + Collaborative perception allows for information sharing between multiple agents such as vehicles and infrastructure to obtain a comprehensive view of the environment through communication and fusion. Current research on multi-agent collaborative perception systems often assumes ideal communication and perception environments and neglects the effect of real-world noise such as pose noise motion blur and perception noise. To address this gap in this paper we propose a novel motion-aware robust communication network (MRCNet) that mitigates noise interference and achieves accurate and robust collaborative perception. MRCNet consists of two main components: multi-scale robust fusion (MRF) addresses pose noise by developing cross-semantic multi-scale enhanced aggregation to fuse features of different scales while motion enhanced mechanism (MEM) captures motion context to compensate for information blurring caused by moving objects. Experimental results on popular collaborative 3D object detection datasets demonstrate that MRCNet outperforms competing methods in noisy scenarios with improved perception performance using less bandwidth. + + + + The Manga Whisperer: Automatically Generating Transcriptions for Comics + http://openaccess.thecvf.com//content/CVPR2024/papers/Sachdeva_The_Manga_Whisperer_Automatically_Generating_Transcriptions_for_Comics_CVPR_2024_paper.pdf + In the past few decades Japanese comics commonly referred to as Manga have transcended both cultural and linguistic boundaries to become a true worldwide sensation. Yet the inherent reliance on visual cues and illustration within manga renders it largely inaccessible to individuals with visual impairments. In this work we seek to address this substantial barrier with the aim of ensuring that manga can be appreciated and actively engaged by everyone. Specifically we tackle the problem of diarisation i.e. generating a transcription of who said what and when in a fully automatic way. To this end we make the following contributions: (1) we present a unified model Magi that is able to (a) detect panels text boxes and character boxes (b) cluster characters by identity (without knowing the number of clusters apriori) and (c) associate dialogues to their speakers; (2) we propose a novel approach that is able to sort the detected text boxes in their reading order and generate a dialogue transcript; (3) we annotate an evaluation benchmark for this task using publicly available [English] manga pages. + + + + Exploring Region-Word Alignment in Built-in Detector for Open-Vocabulary Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Exploring_Region-Word_Alignment_in_Built-in_Detector_for_Open-Vocabulary_Object_Detection_CVPR_2024_paper.pdf + Open-vocabulary object detection aims to detect novel categories that are independent from the base categories used during training. Most modern methods adhere to the paradigm of learning vision-language space from a large-scale multi-modal corpus and subsequently transferring the acquired knowledge to off-the-shelf detectors like Faster-RCNN. However information attenuation or destruction may occur during the process of knowledge transfer due to the domain gap hampering the generalization ability on novel categories. To mitigate this predicament in this paper we present a novel framework named BIND standing for Bulit-IN Detector to eliminate the need for module replacement or knowledge transfer to off-the-shelf detectors. Specifically we design a two-stage training framework with an Encoder-Decoder structure. In the first stage an image-text dual encoder is trained to learn region-word alignment from a corpus of image-text pairs. In the second stage a DETR-style decoder is trained to perform detection on annotated object detection datasets. In contrast to conventional manually designed non-adaptive anchors which generate numerous redundant proposals we develop an anchor proposal network that generates anchor proposals with high likelihood based on candidates adaptively thereby substantially improving detection efficiency. Experimental results on two public benchmarks COCO and LVIS demonstrate that our method stands as a state-of-the-art approach for open-vocabulary object detection. + + + + MovieChat: From Dense Token to Sparse Memory for Long Video Understanding + http://openaccess.thecvf.com//content/CVPR2024/papers/Song_MovieChat_From_Dense_Token_to_Sparse_Memory_for_Long_Video_CVPR_2024_paper.pdf + Recently integrating video foundation models and large language models to build a video understanding system can overcome the limitations of specific pre-defined vision tasks. Yet existing systems can only handle videos with very few frames. For long videos the computation complexity memory cost and long-term temporal connection impose additional challenges. Taking advantage of the Atkinson-Shiffrin memory model with tokens in Transformers being employed as the carriers of memory in combination with our specially designed memory mechanism we propose the MovieChat to overcome these challenges. MovieChat achieves state-of-the-art performance in long video understanding along with the released MovieChat-1K benchmark with 1K long video and 14K manual annotations for validation of the effectiveness of our method. The code models and data can be found in https://rese1f.github.io/MovieChat. + + + + Comparing the Decision-Making Mechanisms by Transformers and CNNs via Explanation Methods + http://openaccess.thecvf.com//content/CVPR2024/papers/Jiang_Comparing_the_Decision-Making_Mechanisms_by_Transformers_and_CNNs_via_Explanation_CVPR_2024_paper.pdf + In order to gain insights about the decision-making of different visual recognition backbones we propose two methodologies sub-explanation counting and cross-testing that systematically applies deep explanation algorithms on a dataset-wide basis and compares the statistics generated from the amount and nature of the explanations. These methodologies reveal the difference among networks in terms of two properties called compositionality and disjunctivism. Transformers and ConvNeXt are found to be more compositional in the sense that they jointly consider multiple parts of the image in building their decisions whereas traditional CNNs and distilled transformers are less compositional and more disjunctive which means that they use multiple diverse but smaller set of parts to achieve a confident prediction. Through further experiments we pinpointed the choice of normalization to be especially important in the compositionality of a model in that batch normalization leads to less compositionality while group and layer normalization lead to more. Finally we also analyze the features shared by different backbones and plot a landscape of different models based on their feature-use similarity. + + + + Atlantis: Enabling Underwater Depth Estimation with Stable Diffusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Atlantis_Enabling_Underwater_Depth_Estimation_with_Stable_Diffusion_CVPR_2024_paper.pdf + Monocular depth estimation has experienced significant progress on terrestrial images in recent years thanks to deep learning advancements. But it remains inadequate for underwater scenes primarily due to data scarcity. Given the inherent challenges of light attenuation and backscatter in water acquiring clear underwater images or precise depth is notably difficult and costly. To mitigate this issue learning-based approaches often rely on synthetic data or turn to self- or unsupervised manners. Nonetheless their performance is often hindered by domain gap and looser constraints. In this paper we propose a novel pipeline for generating photorealistic underwater images using accurate terrestrial depth. This approach facilitates the supervised training of models for underwater depth estimation effectively reducing the performance disparity between terrestrial and underwater environments. Contrary to previous synthetic datasets that merely apply style transfer to terrestrial images without scene content change our approach uniquely creates vivid non-existent underwater scenes by leveraging terrestrial depth data through the innovative Stable Diffusion model. Specifically we introduce a specialized Depth2Underwater ControlNet trained on prepared \ Underwater Depth Text\ data triplets for this generation task. Our newly developed dataset Atlantis enables terrestrial depth estimation models to achieve considerable improvements on unseen underwater scenes surpassing their terrestrial pretrained counterparts both quantitatively and qualitatively. Moreover we further show its practical utility by applying the improved depth in underwater image enhancement and its smaller domain gap from the LLVM perspective. Code and dataset are publicly available at https://github.com/zkawfanx/Atlantis. + + + + Matching Anything by Segmenting Anything + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Matching_Anything_by_Segmenting_Anything_CVPR_2024_paper.pdf + The robust association of the same objects across video frames in complex scenes is crucial for many applications especially object tracking. Current methods predominantly rely on labeled domain-specific video datasets which limits cross-domain generalization of learned similarity embeddings. We propose MASA a novel method for robust instance association learning capable of matching any objects within videos across diverse domains without tracking labels. Leveraging the rich object segmentation from the Segment Anything Model (SAM) MASA learns instance-level correspondence through exhausive data transformations. We treat the SAM outputs as dense object region proposals and learn to match those regions from a vast image collection. We further design a universal MASA adapter which can work in tandem with foundational segmentation or detection models and enable them to track any detected objects. Those combinations present strong zero-shot tracking ability in complex domains. Extensive tests on multiple challenging MOT and MOTS benchmarks indicate that the proposed method using only unlabelled static images achieves even better performance than state-of-the-art methods trained with fully annotated in-domain video sequences in zero-shot association. Our code is available at https://github.com/siyuanliii/masa. + + + + Decoupled Pseudo-labeling for Semi-Supervised Monocular 3D Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Decoupled_Pseudo-labeling_for_Semi-Supervised_Monocular_3D_Object_Detection_CVPR_2024_paper.pdf + We delve into pseudo-labeling for semi-supervised monocular 3D object detection (SSM3OD) and discover two primary issues: a misalignment between the prediction quality of 3D and 2D attributes and the tendency of depth supervision derived from pseudo-labels to be noisy leading to significant optimization conflicts with other reliable forms of supervision. To tackle these issues we introduce a novel decoupled pseudo-labeling (DPL) approach for SSM3OD. Our approach features a Decoupled Pseudo-label Generation (DPG) module designed to efficiently generate pseudo-labels by separately processing 2D and 3D attributes. This module incorporates a unique homography-based method for identifying dependable pseudo-labels in Bird's Eye View (BEV) space specifically for 3D attributes. Additionally we present a Depth Gradient Projection (DGP) module to mitigate optimization conflicts caused by noisy depth supervision of pseudo-labels effectively decoupling the depth gradient and removing conflicting gradients. This dual decoupling strategy--at both the pseudo-label generation and gradient levels--significantly improves the utilization of pseudo-labels in SSM3OD. Our comprehensive experiments on the KITTI benchmark demonstrate the superiority of our method over existing approaches. + + + + Temporally Consistent Unbalanced Optimal Transport for Unsupervised Action Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_Temporally_Consistent_Unbalanced_Optimal_Transport_for_Unsupervised_Action_Segmentation_CVPR_2024_paper.pdf + We propose a novel approach to the action segmentation task for long untrimmed videos based on solving an optimal transport problem. By encoding a temporal consistency prior into a Gromov-Wasserstein problem we are able to decode a temporally consistent segmentation from a noisy affinity/matching cost matrix between video frames and action classes. Unlike previous approaches our method does not require knowing the action order for a video to attain temporal consistency. Furthermore our resulting (fused) Gromov-Wasserstein problem can be efficiently solved on GPUs using a few iterations of projected mirror descent. We demonstrate the effectiveness of our method in an unsupervised learning setting where our method is used to generate pseudo-labels for self-training. We evaluate our segmentation approach and unsupervised learning pipeline on the Breakfast 50-Salads YouTube Instructions and Desktop Assembly datasets yielding state-of-the-art results for the unsupervised video action segmentation task. + + + + Learning Transferable Negative Prompts for Out-of-Distribution Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Learning_Transferable_Negative_Prompts_for_Out-of-Distribution_Detection_CVPR_2024_paper.pdf + Existing prompt learning methods have shown certain capabilities in Out-of-Distribution (OOD) detection but the lack of OOD images in the target dataset in their training can lead to mismatches between OOD images and In-Distribution (ID) categories resulting in a high false positive rate. To address this issue we introduce a novel OOD detection method named 'NegPrompt' to learn a set of negative prompts each representing a negative connotation of a given class label for delineating the boundaries between ID and OOD images. It learns such negative prompts with ID data only without any reliance on external outlier data. Further current methods assume the availability of samples of all ID classes rendering them ineffective in open-vocabulary learning scenarios where the inference stage can contain novel ID classes not present during training. In contrast our learned negative prompts are transferable to novel class labels. Experiments on various ImageNet benchmarks show that NegPrompt surpasses state-of-the-art prompt-learning-based OOD detection methods and maintains a consistent lead in hard OOD detection in closed- and open-vocabulary classification scenarios. Code is available at https://github.com/mala-lab/negprompt. + + + + Holistic Features are almost Sufficient for Text-to-Video Retrieval + http://openaccess.thecvf.com//content/CVPR2024/papers/Tian_Holistic_Features_are_almost_Sufficient_for_Text-to-Video_Retrieval_CVPR_2024_paper.pdf + For text-to-video retrieval (T2VR) which aims to retrieve unlabeled videos by ad-hoc textual queries CLIP-based methods currently lead the way. Compared to CLIP4Clip which is efficient and compact state-of-the-art models tend to compute video-text similarity through fine-grained cross-modal feature interaction and matching putting their scalability for large-scale T2VR applications into doubt. We propose TeachCLIP enabling a CLIP4Clip based student network to learn from more advanced yet computationally intensive models. In order to create a learning channel to convey fine-grained cross-modal knowledge from a heavy model to the student we add to CLIP4Clip a simple Attentional frame-Feature Aggregation (AFA) block which by design adds no extra storage / computation overhead at the retrieval stage. Frame-text relevance scores calculated by the teacher network are used as soft labels to supervise the attentive weights produced by AFA. Extensive experiments on multiple public datasets justify the viability of the proposed method. TeachCLIP has the same efficiency and compactness as CLIP4Clip yet has near-SOTA effectiveness. + + + + Uncertainty-aware Action Decoupling Transformer for Action Anticipation + http://openaccess.thecvf.com//content/CVPR2024/papers/Guo_Uncertainty-aware_Action_Decoupling_Transformer_for_Action_Anticipation_CVPR_2024_paper.pdf + Human action anticipation aims at predicting what people will do in the future based on past observations. In this paper we introduce Uncertainty-aware Action Decoupling Transformer (UADT) for action anticipation. Unlike existing methods that directly predict action in a verb-noun pair format we decouple the action anticipation task into verb and noun anticipations separately. The objective is to make the two decoupled tasks assist each other and eventually improve the action anticipation task. Specifically we propose a two-stream Transformer-based architecture which is composed of a verb-to-noun model and a noun-to-verb model. The verb-to-noun model leverages the verb information to improve the noun prediction and the other way around. We extend the model in a probabilistic manner and quantify the predictive uncertainty of each decoupled task to select features. In this way the noun prediction leverages the most informative and redundancy-free verb features and verb prediction works similarly. Finally the two streams are combined dynamically based on their uncertainties to make the joint action anticipation. We demonstrate the efficacy of our method by achieving state-of-the-art performance on action anticipation benchmarks including EPIC-KITCHENS EGTEA Gaze+ and 50-Salads. + + + + One-Prompt to Segment All Medical Images + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_One-Prompt_to_Segment_All_Medical_Images_CVPR_2024_paper.pdf + Large foundation models known for their strong zero-shot generalization have excelled in visual and language applications. However applying them to medical image segmentation a domain with diverse imaging types and target labels remains an open challenge. Current approaches such as adapting interactive segmentation models like Segment Anything Model (SAM) require user prompts for each sample during inference. Alternatively transfer learning methods like few/one-shot models demand labeled samples leading to high costs. This paper introduces a new paradigm toward the universal medical image segmentation termed 'One-Prompt Segmentation.' One-Prompt Segmentation combines the strengths of one-shot and interactive methods. In the inference stage with just one prompted sample it can adeptly handle the unseen task in a single forward pass. We train One-Prompt Model on 64 open-source medical datasets accompanied by the collection of over 3000 clinician-labeled prompts. Tested on 14 previously unseen datasets the One-Prompt Model showcases superior zero-shot segmentation capabilities outperforming a wide range of related methods. The code and data is released as https://github.com/KidsWithTokens/one-prompt. + + + + GROUNDHOG: Grounding Large Language Models to Holistic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_GROUNDHOG_Grounding_Large_Language_Models_to_Holistic_Segmentation_CVPR_2024_paper.pdf + Most multimodal large language models (MLLMs) learn language-to-object grounding through causal language modeling where grounded objects are captured by bounding boxes as sequences of location tokens. This paradigm lacks pixel-level representations that are important for fine-grained visual understanding and diagnosis. In this work we introduce GROUNDHOG an MLLM developed by grounding Large Language Models to holistic segmentation. GROUNDHOG incorporates a masked feature extractor and converts extracted features into visual entity tokens for the MLLM backbone which then connects groundable phrases to unified grounding masks by retrieving and merging the entity masks. To train GROUNDHOG we carefully curated M3G2 a grounded visual instruction tuning dataset with Multi-Modal Multi-Grained Grounding by harvesting a collection of segmentation-grounded datasets with rich annotations. Our experimental results show that GROUNDHOG achieves superior performance on various language grounding tasks without task-specific fine-tuning and significantly reduces object hallucination. GROUNDHOG also demonstrates better grounding towards complex forms of visual input and provides easy-to-understand diagnosis in failure cases. + + + + Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_Omni-SMoLA_Boosting_Generalist_Multimodal_Models_with_Soft_Mixture_of_Low-rank_CVPR_2024_paper.pdf + In this work we present Omni-SMoLA a multimodal architecture that mixes many multi-modal experts efficiently and achieves both high specialist and generalist performance. In contrast to previous models for which we see performance degradation on average when training the models on a wide range of tasks we show that the SMoLA low-rank experts are able to model different skills and task and overall improve the performance of a generalist model. This finding indicates that simple LMM fine-tuning is suboptimal for handling a wide range of tasks and that pairing the act of fine-tuning with specifically-designed architecture changes leads to better performing models. + + + + SeMoLi: What Moves Together Belongs Together + http://openaccess.thecvf.com//content/CVPR2024/papers/Seidenschwarz_SeMoLi_What_Moves_Together_Belongs_Together_CVPR_2024_paper.pdf + We tackle semi-supervised object detection based on motion cues. Recent results suggest that heuristic-based clustering methods in conjunction with object trackers can be used to pseudo-label instances of moving objects and use these as supervisory signals to train 3D object detectors in Lidar data without manual supervision. We re-think this approach and suggest that both object detection as well as motion-inspired pseudo-labeling can be tackled in a data-driven manner. We leverage recent advances in scene flow estimation to obtain point trajectories from which we extract long-term class-agnostic motion patterns. Revisiting correlation clustering in the context of message passing networks we learn to group those motion patterns to cluster points to object instances. By estimating the full extent of the objects we obtain per-scan 3D bounding boxes that we use to supervise a Lidar object detection network. Our method not only outperforms prior heuristic-based approaches (57.5 AP +14 improvement over prior work) more importantly we show we can pseudo-label and train object detectors across datasets. + + + + Context-Guided Spatio-Temporal Video Grounding + http://openaccess.thecvf.com//content/CVPR2024/papers/Gu_Context-Guided_Spatio-Temporal_Video_Grounding_CVPR_2024_paper.pdf + Spatio-temporal video grounding (or STVG) task aims at locating a spatio-temporal tube for a specific instance given a text query. Despite advancements current methods easily suffer the distractors or heavy object appearance variations in videos due to insufficient object information from the text leading to degradation. Addressing this we propose a novel framework context-guided STVG (CG-STVG) which mines discriminative instance context for object in videos and applies it as a supplementary guidance for target localization. The key of CG-STVG lies in two specially designed modules including instance context generation (ICG) which focuses on discovering visual context information (in both appearance and motion) of the instance and instance context refinement (ICR) which aims to improve the instance context from ICG by eliminating irrelevant or even harmful information from the context. During grounding ICG together with ICR are deployed at each decoding stage of a Transformer architecture for instance context learning. Particularly instance context learned from one decoding stage is fed to the next stage and leveraged as a guidance containing rich and discriminative object feature to enhance the target-awareness in decoding feature which conversely benefits generating better new instance context for improving localization finally. Compared to existing methods CG-STVG enjoys object information in text query and guidance from mined instance visual context for more accurate target localization. In our experiments on three benchmarks including HCSTVG-v1/-v2 and VidSTG CG-STVG sets new state-of-the-arts in m_tIoU and m_vIoU on all of them showing efficacy. Code is released at https://github.com/HengLan/CGSTVG. + + + + Explaining the Implicit Neural Canvas: Connecting Pixels to Neurons by Tracing their Contributions + http://openaccess.thecvf.com//content/CVPR2024/papers/Padmanabhan_Explaining_the_Implicit_Neural_Canvas_Connecting_Pixels_to_Neurons_by_CVPR_2024_paper.pdf + The many variations of Implicit Neural Representations (INRs) where a neural network is trained as a continuous representation of a signal have tremendous practical utility for downstream tasks including novel view synthesis video compression and image super-resolution. Unfortunately the inner workings of these networks are seriously understudied. Our work eXplaining the Implicit Neural Canvas (XINC) is a unified framework for explaining properties of INRs by examining the strength of each neuron's contribution to each output pixel. We call the aggregate of these contribution maps the Implicit Neural Canvas and we use this concept to demonstrate that the INRs we study learn to "see" the frames they represent in surprising ways. For example INRs tend to have highly distributed representations. While lacking high-level object semantics they have a significant bias for color and edges and are almost entirely space-agnostic. We arrive at our conclusions by examining how objects are represented across time in video INRs using clustering to visualize similar neurons across layers and architectures and show that this is dominated by motion. These insights demonstrate the general usefulness of our analysis framework. + + + + Adapting to Length Shift: FlexiLength Network for Trajectory Prediction + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_Adapting_to_Length_Shift_FlexiLength_Network_for_Trajectory_Prediction_CVPR_2024_paper.pdf + Trajectory prediction plays an important role in various applications including autonomous driving robotics and scene understanding. Existing approaches mainly focus on developing compact neural networks to increase prediction precision on public datasets typically employing a standardized input duration. However a notable issue arises when these models are evaluated with varying observation lengths leading to a significant performance drop a phenomenon we term the Observation Length Shift. To address this issue we introduce a general and effective framework the FlexiLength Network (FLN) to enhance the robustness of existing trajectory prediction techniques against varying observation periods. Specifically FLN integrates trajectory data with diverse observation lengths incorporates FlexiLength Calibration (FLC) to acquire temporal invariant representations and employs FlexiLength Adaptation (FLA) to further refine these representations for more accurate future trajectory predictions. Comprehensive experiments on multiple datasets i.e. ETH/UCY nuScenes and Argoverse 1 demonstrate the effectiveness and flexibility of our proposed FLN framework. + + + + WorDepth: Variational Language Prior for Monocular Depth Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zeng_WorDepth_Variational_Language_Prior_for_Monocular_Depth_Estimation_CVPR_2024_paper.pdf + Three-dimensional (3D) reconstruction from a single image is an ill-posed problem with inherent ambiguities i.e. scale. Predicting a 3D scene from text description(s) is similarly ill-posed i.e. spatial arrangements of objects described. We investigate the question of whether two inherently ambiguous modalities can be used in conjunction to produce metric-scaled reconstructions. To test this we focus on monocular depth estimation the problem of predicting a dense depth map from a single image but with an additional text caption describing the scene. To this end we begin by encoding the text caption as a mean and standard deviation; using a variational framework we learn the distribution of the plausible metric reconstructions of 3D scenes corresponding to the text captions as a prior. To "select" a specific reconstruction or depth map we encode the given image through a conditional sampler that samples from the latent space of the variational text encoder which is then decoded to the output depth map. Our approach is trained alternatingly between the text and image branches: in one optimization step we predict the mean and standard deviation from the text description and sample from a standard Gaussian and in the other we sample using a (image) conditional sampler. Once trained we directly predict depth from the encoded text using the conditional sampler. We demonstrate our approach on indoor (NYUv2) and outdoor (KITTI) scenarios where we show that language can consistently improve performance in both. Code: https://github.com/Adonis-galaxy/WorDepth. + + + + A Unified Framework for Microscopy Defocus Deblur with Multi-Pyramid Transformer and Contrastive Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_A_Unified_Framework_for_Microscopy_Defocus_Deblur_with_Multi-Pyramid_Transformer_CVPR_2024_paper.pdf + Defocus blur is a persistent problem in microscope imaging that poses harm to pathology interpretation and medical intervention in cell microscopy and microscope surgery. To address this problem a unified framework including the multi-pyramid transformer (MPT) and extended frequency contrastive regularization (EFCR) is proposed to tackle two outstanding challenges in microscopy deblur: longer attention span and data deficiency. The MPT employs an explicit pyramid structure at each network stage that integrates the cross-scale window attention (CSWA) the intra-scale channel attention (ISCA) and the feature-enhancing feed-forward network (FEFN) to capture long-range cross-scale spatial interaction and global channel context. The EFCR addresses the data deficiency problem by exploring latent deblur signals from different frequency bands. It also enables deblur knowledge transfer to learn cross-domain information from extra data improving deblur performance for labeled and unlabeled data. Extensive experiments and downstream task validation show the framework achieves state-of-the-art performance across multiple datasets. Project page: https://github.com/PieceZhang/MPT-CataBlur. + + + + Frozen Feature Augmentation for Few-Shot Image Classification + http://openaccess.thecvf.com//content/CVPR2024/papers/Bar_Frozen_Feature_Augmentation_for_Few-Shot_Image_Classification_CVPR_2024_paper.pdf + Training a linear classifier or lightweight model on top of pretrained vision model outputs so-called 'frozen features' leads to impressive performance on a number of downstream few-shot tasks. Currently frozen features are not modified during training. On the other hand when networks are trained directly on images data augmentation is a standard recipe that improves performance with no substantial overhead. In this paper we conduct an extensive pilot study on few-shot image classification that explores applying data augmentations in the frozen feature space dubbed 'frozen feature augmentation (FroFA)' covering twenty augmentations in total. Our study demonstrates that adopting a deceptively simple pointwise FroFA such as brightness can improve few-shot performance consistently across three network architectures three large pretraining datasets and eight transfer datasets. + + + + Incorporating Geo-Diverse Knowledge into Prompting for Increased Geographical Robustness in Object Recognition + http://openaccess.thecvf.com//content/CVPR2024/papers/Buettner_Incorporating_Geo-Diverse_Knowledge_into_Prompting_for_Increased_Geographical_Robustness_in_CVPR_2024_paper.pdf + Existing object recognition models have been shown to lack robustness in diverse geographical scenarios due to domain shifts in design and context. Class representations need to be adapted to more accurately reflect an object concept under these shifts. In the absence of training data from target geographies we hypothesize that geographically diverse descriptive knowledge of categories can enhance robustness. For this purpose we explore the feasibility of probing a large language model for geography-based object knowledge and we examine the effects of integrating knowledge into zero-shot and learnable soft prompting with CLIP. Within this exploration we propose geography knowledge regularization to ensure that soft prompts trained on a source set of geographies generalize to an unseen target set. Accuracy gains over prompting baselines on DollarStreet while training only on Europe data are up to +2.8/1.2/1.6 on target data from Africa/Asia/Americas and +4.6 overall on the hardest classes. Competitive performance is shown vs. few-shot target training and analysis is provided to direct future study of geographical robustness. + + + + PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs + http://openaccess.thecvf.com//content/CVPR2024/papers/Dorkenwald_PIN_Positional_Insert_Unlocks_Object_Localisation_Abilities_in_VLMs_CVPR_2024_paper.pdf + Vision-Language Models (VLMs) such as Flamingo and GPT-4V have shown immense potential by integrating large language models with vision systems. Nevertheless these models face challenges in the fundamental computer vision task of object localisation due to their training on multimodal data containing mostly captions without explicit spatial grounding. While it is possible to construct custom supervised training pipelines with bounding box annotations that integrate with VLMs these result in specialized and hard-to-scale models. In this paper we aim to explore the limits of caption-based VLMs and instead propose to tackle the challenge in a simpler manner by i) keeping the weights of a caption-based VLM frozen and ii) not using any supervised detection data. To this end we introduce an input-agnostic Positional Insert (PIN) a learnable spatial prompt containing a minimal set of parameters that are slid inside the frozen VLM unlocking object localisation capabilities. Our PIN module is trained with a simple next-token prediction task on synthetic data without requiring the introduction of new output heads. Our experiments demonstrate strong zero-shot localisation performances on a variety of images including Pascal VOC COCO LVIS and diverse images like paintings or cartoons. + + + + UniGarmentManip: A Unified Framework for Category-Level Garment Manipulation via Dense Visual Correspondence + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_UniGarmentManip_A_Unified_Framework_for_Category-Level_Garment_Manipulation_via_Dense_CVPR_2024_paper.pdf + Garment manipulation (e.g. unfolding folding and hanging clothes) is essential for future robots to accomplish home-assistant tasks while highly challenging due to the diversity of garment configurations geometries and deformations. Although able to manipulate similar shaped garments in a certain task previous works mostly have to design different policies for different tasks could not generalize to garments with diverse geometries and often rely heavily on human-annotated data. In this paper we leverage the property that garments in a certain category have similar structures and then learn the topological dense (point-level) visual correspondence among garments in the category level with different deformations in the self-supervised manner. The topological correspondence can be easily adapted to the functional correspondence to guide the manipulation policies for various downstream tasks within only one or few-shot demonstrations. Experiments over garments in 3 different categories on 3 representative tasks in diverse scenarios using one or two arms taking one or more steps inputting flat or messy garments demonstrate the effectiveness of our proposed method. Project page: https://warshallrho.github.io/unigarmentmanip. + + + + Multi-Attribute Interactions Matter for 3D Visual Grounding + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_Multi-Attribute_Interactions_Matter_for_3D_Visual_Grounding_CVPR_2024_paper.pdf + 3D visual grounding aims to localize 3D objects described by free-form language sentences. Following the detection-then-matching paradigm existing methods mainly focus on embedding object attributes in unimodal feature extraction and multimodal feature fusion to enhance the discriminability of the proposal feature for accurate grounding. However most of them ignore the explicit interaction of multiple attributes causing a bias in unimodal representation and misalignment in multimodal fusion. In this paper we propose a multi-attribute aware Transformer for 3D visual grounding learning the multi-attribute interactions to refine the intra-modal and inter-modal grounding cues. Specifically we first develop an attribute causal analysis module to quantify the causal effect of different attributes for the final prediction which provides powerful supervision to correct the misleading attributes and adaptively capture other discriminative features. Then we design an exchanging-based multimodal fusion module which dynamically replaces tokens with low attribute attention between modalities before directly integrating low-dimensional global features. This ensures an attribute-level multimodal information fusion and helps align the language and vision details more efficiently for fine-grained multimodal features. Extensive experiments show that our method can achieve state-of-the-art performance on ScanRefer and Sr3D/Nr3D datasets. + + + + SCINeRF: Neural Radiance Fields from a Snapshot Compressive Image + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_SCINeRF_Neural_Radiance_Fields_from_a_Snapshot_Compressive_Image_CVPR_2024_paper.pdf + In this paper we explore the potential of Snapshot Com- pressive Imaging (SCI) technique for recovering the under- lying 3D scene representation from a single temporal com- pressed image. SCI is a cost-effective method that enables the recording of high-dimensional data such as hyperspec- tral or temporal information into a single image using low- cost 2D imaging sensors. To achieve this a series of spe- cially designed 2D masks are usually employed which not only reduces storage requirements but also offers potential privacy protection. Inspired by this to take one step further our approach builds upon the powerful 3D scene represen- tation capabilities of neural radiance fields (NeRF). Specif- ically we formulate the physical imaging process of SCI as part of the training of NeRF allowing us to exploit its impressive performance in capturing complex scene struc- tures. To assess the effectiveness of our method we con- duct extensive evaluations using both synthetic data and real data captured by our SCI system. Extensive experi- mental results demonstrate that our proposed approach sur- passes the state-of-the-art methods in terms of image re- construction and novel view image synthesis. Moreover our method also exhibits the ability to restore high frame- rate multi-view consistent images by leveraging SCI and the rendering capabilities of NeRF. The code is available at https://github.com/WU-CVGL/SCINeRF. + + + + Improved Visual Grounding through Self-Consistent Explanations + http://openaccess.thecvf.com//content/CVPR2024/papers/He_Improved_Visual_Grounding_through_Self-Consistent_Explanations_CVPR_2024_paper.pdf + Vision-and-language models trained to match images with text can be combined with visual explanation methods to point to the locations of specific objects in an image. Our work shows that the localization --"grounding'"-- abilities of these models can be further improved by finetuning for self-consistent visual explanations. We propose a strategy for augmenting existing text-image datasets with paraphrases using a large language model and SelfEQ a weakly-supervised strategy on visual explanation maps for paraphrases that encourages self-consistency. Specifically for an input textual phrase we attempt to generate a paraphrase and finetune the model so that the phrase and paraphrase map to the same region in the image. We posit that this both expands the vocabulary that the model is able to handle and improves the quality of the object locations highlighted by gradient-based visual explanation methods (e.g. GradCAM). We demonstrate that SelfEQ improves performance on Flickr30k ReferIt and RefCOCO+ over a strong baseline method and several prior works. Particularly comparing to other methods that do not use any type of box annotations we obtain 84.07% on Flickr30k (an absolute improvement of 4.69%) 67.40% on ReferIt (an absolute improvement of 7.68%) and 75.10% 55.49% on RefCOCO+ test sets A and B respectively (an absolute improvement of 3.74% on average). + + + + DifFlow3D: Toward Robust Uncertainty-Aware Scene Flow Estimation with Iterative Diffusion-Based Refinement + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_DifFlow3D_Toward_Robust_Uncertainty-Aware_Scene_Flow_Estimation_with_Iterative_Diffusion-Based_CVPR_2024_paper.pdf + Scene flow estimation which aims to predict per-point 3D displacements of dynamic scenes is a fundamental task in the computer vision field. However previous works commonly suffer from unreliable correlation caused by locally constrained searching ranges and struggle with accumulated inaccuracy arising from the coarse-to-fine structure. To alleviate these problems we propose a novel uncertainty-aware scene flow estimation network (DifFlow3D) with the diffusion probabilistic model. Iterative diffusion-based refinement is designed to enhance the correlation robustness and resilience to challenging cases e.g. dynamics noisy inputs repetitive patterns etc. To restrain the generation diversity three key flow-related features are leveraged as conditions in our diffusion model. Furthermore we also develop an uncertainty estimation module within diffusion to evaluate the reliability of estimated scene flow. Our DifFlow3D achieves state-of-the-art performance with 24.0% and 29.1% EPE3D reduction respectively on FlyingThings3D and KITTI 2015 datasets. Notably our method achieves an unprecedented millimeter-level accuracy (0.0078m in EPE3D) on the KITTI dataset. Additionally our diffusion-based refinement paradigm can be readily integrated as a plug-and-play module into existing scene flow networks significantly increasing their estimation accuracy. Codes are released at https://github.com/IRMVLab/DifFlow3D. + + + + FlashEval: Towards Fast and Accurate Evaluation of Text-to-image Diffusion Generative Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_FlashEval_Towards_Fast_and_Accurate_Evaluation_of_Text-to-image_Diffusion_Generative_CVPR_2024_paper.pdf + In recent years there has been significant progress in the development of text-to-image generative models. Evaluating the quality of the generative models is one essential step in the development process. Unfortunately the evaluation process could consume a significant amount of computational resources making the required periodic evaluation of model performance (e.g. monitoring training progress) impractical. Therefore we seek to improve the evaluation efficiency by selecting the representative subset of the text-image dataset. We systematically investigate the design choices including the selection criteria (textural features or imagebased metrics) and the selection granularity (prompt-level or set-level). We find that the insights from prior work on subset selection for training data do not generalize to this problem and we propose FlashEval an iterative search algorithm tailored to evaluation data selection. We demonstrate the effectiveness of FlashEval on ranking diffusion models with various configurations including architectures quantization levels and sampler schedules on COCO and DiffusionDB datasets. Our searched 50-item subset could achieve comparable evaluation quality to the randomly sampled 500-item subset for COCO annotations on unseen models achieving a 10x evaluation speedup. We release the condensed subset of these commonly used datasets to help facilitate diffusion algorithm design and evaluation and open-source FlashEval as a tool for condensing future datasets accessible at https://github.com/thu-nics/FlashEval. + + + + View From Above: Orthogonal-View aware Cross-view Localization + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_View_From_Above_Orthogonal-View_aware_Cross-view_Localization_CVPR_2024_paper.pdf + This paper presents a novel aerial-to-ground feature aggregation strategy tailored for the task of cross-view image-based geo-localization. Conventional vision-based methods heavily rely on matching ground-view image features with a pre-recorded image database often through establishing planar homography correspondences via a planar ground assumption. As such they tend to ignore features that are off-ground and not suited for handling visual occlusions leading to unreliable localization in challenging scenarios. We propose a Top-to-Ground Aggregation module that capitalizes aerial orthographic views to aggregate features down to the ground level leveraging reliable off-ground information to improve feature alignment. Furthermore we introduce a Cycle Domain Adaptation loss that ensures feature extraction robustness across domain changes. Additionally an Equidistant Re-projection loss is introduced to equalize the impact of all keypoints on orientation error leading to a more extended distribution of keypoints which benefits orientation estimation. On both KITTI and Ford Multi-AV datasets our method consistently achieves the lowest mean longitudinal and lateral translations across different settings and obtains the smallest orientation error when the initial pose is less accurate a more challenging setting. Further it can complete an entire route through continual vehicle pose estimation with initial vehicle pose given only at the starting point. + + + + PeVL: Pose-Enhanced Vision-Language Model for Fine-Grained Human Action Recognition + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_PeVL_Pose-Enhanced_Vision-Language_Model_for_Fine-Grained_Human_Action_Recognition_CVPR_2024_paper.pdf + Recent progress in Vision-Language (VL) foundation models has revealed the great advantages of cross-modality learning. However due to a large gap between vision and text they might not be able to sufficiently utilize the benefits of cross-modality information. In the field of human action recognition the additional pose modality may bridge the gap between vision and text to improve the effectiveness of cross-modality learning. In this paper we propose a novel framework called the Pose-enhanced Vision-Language (PeVL) model to adapt the VL model with pose modality to learn effective knowledge of fine-grained human actions. Our PeVL model includes two novel components: an Unsymmetrical Cross-Modality Refinement (UCMR) block and a Semantic-Guided Multi-level Contrastive (SGMC) module. The UCMR block includes Pose-guided Visual Refinement (P2V-R) and Visual-enriched Pose Refinement (V2P-R) for effective cross-modality learning. The SGMC module includes Multi-level Contrastive Associations of vision-text and pose-text at both action and sub-action levels and a Semantic-Guided Loss enabling effective contrastive learning with text. Built upon a pre-trained VL foundation model our model integrates trainable adapters and can be trained end-to-end. Our novel PeVL design over VL foundation model yields remarkable performance gains on four fine- grained human action recognition datasets achieving a new SOTA with a significantly small number of FLOPs for low- cost re-training. + + + + DeepCache: Accelerating Diffusion Models for Free + http://openaccess.thecvf.com//content/CVPR2024/papers/Ma_DeepCache_Accelerating_Diffusion_Models_for_Free_CVPR_2024_paper.pdf + Diffusion models have recently gained unprecedented attention in the field of image synthesis due to their remarkable generative capabilities. Notwithstanding their prowess these models often incur substantial computational costs primarily attributed to the sequential denoising process and cumbersome model size. Traditional methods for compressing diffusion models typically involve extensive retraining presenting cost and feasibility challenges. In this paper we introduce DeepCache a novel training-free paradigm that accelerates diffusion models from the perspective of model architecture. DeepCache capitalizes on the inherent temporal redundancy observed in the sequential denoising steps of diffusion models which caches and retrieves features across adjacent denoising stages thereby curtailing redundant computations. Utilizing the property of the U-Net we reuse the high-level features while updating the low-level features in a very cheap way. This innovative strategy in turn enables a speedup factor of 2.3xfor Stable Diffusion v1.5 with only a 0.05 decline in CLIP Score and 4.1xfor LDM-4-G with a slight decrease of 0.22 in FID on ImageNet. Our experiments also demonstrate DeepCache's superiority over existing pruning and distillation methods that necessitate retraining and its compatibility with current sampling techniques. Furthermore we find that under the same throughput DeepCache effectively achieves comparable or even marginally improved results with DDIM or PLMS. + + + + Learning Correlation Structures for Vision Transformers + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_Learning_Correlation_Structures_for_Vision_Transformers_CVPR_2024_paper.pdf + We introduce a new attention mechanism dubbed structural self-attention (StructSA) that leverages rich correlation patterns naturally emerging in key-query interactions of attention. StructSA generates attention maps by recognizing space-time structures of key-query correlations via convolution and uses them to dynamically aggregate local contexts of value features. This effectively leverages rich structural patterns in images and videos such as scene layouts object motion and inter-object relations.Using StructSA as a main building block we develop the structural vision transformer (StructViT) and evaluate its effectiveness on both image and video classification tasks achieving state-of-the-art results on ImageNet-1K Kinetics-400 Something-Something V1 & V2 Diving-48 and FineGym. + + + + PrPSeg: Universal Proposition Learning for Panoramic Renal Pathology Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Deng_PrPSeg_Universal_Proposition_Learning_for_Panoramic_Renal_Pathology_Segmentation_CVPR_2024_paper.pdf + Understanding the anatomy of renal pathology is crucial for advancing disease diagnostics treatment evaluation and clinical research. The complex kidney system comprises various components across multiple levels including regions (cortex medulla) functional units (glomeruli tubules) and cells (podocytes mesangial cells in glomerulus). Prior studies have predominantly overlooked the intricate spatial interrelations among objects from clinical knowledge. In this research we introduce a novel universal proposition learning approach called panoramic renal pathology segmentation (PrPSeg) designed to segment comprehensively panoramic structures within kidney by integrating extensive knowledge of kidney anatomy. In this paper we propose (1) the design of a comprehensive universal proposition matrix for renal pathology facilitating the incorporation of classification and spatial relationships into the segmentation process; (2) a token-based dynamic head single network architecture with the improvement of the partial label image segmentation and capability for future data enlargement; and (3) an anatomy loss function quantifying the inter-object relationships across the kidney. + + + + Weakly-Supervised Audio-Visual Video Parsing with Prototype-based Pseudo-Labeling + http://openaccess.thecvf.com//content/CVPR2024/papers/Rachavarapu_Weakly-Supervised_Audio-Visual_Video_Parsing_with_Prototype-based_Pseudo-Labeling_CVPR_2024_paper.pdf + In this paper we address the weakly-supervised Audio-Visual Video Parsing (AVVP) problem which aims at labeling events in a video as audible visible or both and temporally localizing and classifying them into known categories. This is challenging since we only have access to video-level (weak) event labels when training but need to predict event labels at the segment (frame) level at test time. Recent methods employ multiple-instance learning (MIL) techniques that tend to focus solely on the most discriminative segments resulting in frequent misclassifications. Our idea is to first construct several prototype features for each event class by clustering key segments identified for the event in the training data. We then assign pseudo labels to all training segments based on their feature similarities with these prototypes and re-train the model under weak and strong supervision. We facilitate this by structuring the feature space with contrastive learning using pseudo labels. Experiments show that we outperform existing methods for weakly-supervised AVVP. We also show that learning with weak and iteratively re-estimated pseudo labels can be interpreted as an expectation-maximization (EM) algorithm providing further insight for our training procedure. + + + + Intraoperative 2D/3D Image Registration via Differentiable X-ray Rendering + http://openaccess.thecvf.com//content/CVPR2024/papers/Gopalakrishnan_Intraoperative_2D3D_Image_Registration_via_Differentiable_X-ray_Rendering_CVPR_2024_paper.pdf + Surgical decisions are informed by aligning rapid portable 2D intraoperative images (e.g. X-rays) to a high-fidelity 3D preoperative reference scan (e.g. CT). However 2D/3D registration can often fail in practice: conventional optimization methods are prohibitively slow and susceptible to local minima while neural networks trained on small datasets fail on new patients or require impractical landmark supervision. We present DiffPose a self-supervised approach that leverages patient-specific simulation and differentiable physics-based rendering to achieve accurate 2D/3D registration without relying on manually labeled data. Preoperatively a CNN is trained to regress the pose of a randomly oriented synthetic X-ray rendered from the preoperative CT. The CNN then initializes rapid intraoperative test-time optimization that uses the differentiable X-ray renderer to refine the solution. Our work further proposes several geometrically principled methods for sampling camera poses from SE(3) for sparse differentiable rendering and for driving registration in the tangent space se(3) with geodesic and multiscale locality-sensitive losses. DiffPose achieves sub-millimeter accuracy across surgical datasets at intraoperative speeds improving upon existing unsupervised methods by an order of magnitude and even outperforming supervised baselines. Our implementation is at https://github.com/eigenvivek/DiffPose. + + + + MICap: A Unified Model for Identity-Aware Movie Descriptions + http://openaccess.thecvf.com//content/CVPR2024/papers/Raajesh_MICap_A_Unified_Model_for_Identity-Aware_Movie_Descriptions_CVPR_2024_paper.pdf + Characters are an important aspect of any storyline and identifying and including them in descriptions is necessary for story understanding. While previous work has largely ignored identity and generated captions with someone (anonymized names) recent work formulates id-aware captioning as a fill-in-the-blanks (FITB) task where given a caption with blanks the goal is to predict person id labels. However to predict captions with ids a two-stage approach is required: first predict captions with someone then fill in identities. In this work we present a new single stage approach that can seamlessly switch between id-aware caption generation or FITB when given a caption with blanks. Our model Movie-Identity Captioner (MICap) uses a shared auto-regressive decoder that benefits from training with FITB and full-caption generation objectives while the encoder can benefit from or disregard captions with blanks as input. Another challenge with id-aware captioning is the lack of a metric to capture subtle differences between person ids. To this end we introduce iSPICE a caption evaluation metric that focuses on identity tuples created through intermediate scene graphs. We evaluate MICap on Large-Scale Movie Description Challenge (LSMDC) where we show a 4.2% improvement in FITB accuracy and a 1-2% bump in classic captioning metrics. + + + + MonoDiff: Monocular 3D Object Detection and Pose Estimation with Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Ranasinghe_MonoDiff_Monocular_3D_Object_Detection_and_Pose_Estimation_with_Diffusion_CVPR_2024_paper.pdf + 3D object detection and pose estimation from a single-view image is challenging due to the high uncertainty caused by the absence of 3D perception. As a solution recent monocular 3D detection methods leverage additional modalities such as stereo image pairs and LiDAR point clouds to enhance image features at the expense of additional annotation costs. We propose using diffusion models to learn effective representations for monocular 3D detection without additional modalities or training data. We present MonoDiff a novel framework that employs the reverse diffusion process to estimate 3D bounding box and orientation. But considering the variability in bounding box sizes along different dimensions it is ineffective to sample noise from a standard Gaussian distribution. Hence we adopt a Gaussian mixture model to sample noise during the forward diffusion process and initialize the reverse diffusion process. Furthermore since the diffusion model generates the 3D parameters for a given object image we leverage 2D detection information to provide additional supervision by maintaining the correspondence between 3D/2D projection. Finally depending on the signal-to-noise ratio we incorporate a dynamic weighting scheme to account for the level of uncertainty in the supervision by projection at different timesteps. MonoDiff outperforms current state-of-the-art monocular 3D detection methods on the KITTI and Waymo benchmarks without additional depth priors. MonoDiff project is available at: https://dylran.github.io/monodiff.github.io. + + + + An Upload-Efficient Scheme for Transferring Knowledge From a Server-Side Pre-trained Generator to Clients in Heterogeneous Federated Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_An_Upload-Efficient_Scheme_for_Transferring_Knowledge_From_a_Server-Side_Pre-trained_CVPR_2024_paper.pdf + Heterogeneous Federated Learning (HtFL) enables collaborative learning on multiple clients with different model architectures while preserving privacy. Despite recent research progress knowledge sharing in HtFL is still difficult due to data and model heterogeneity. To tackle this issue we leverage the knowledge stored in pre-trained generators and propose a new upload-efficient knowledge transfer scheme called Federated Knowledge-Transfer Loop (FedKTL). Our FedKTL can produce client-task-related prototypical image-vector pairs via the generator's inference on the server. With these pairs each client can transfer pre-existing knowledge from the generator to its local model through an additional supervised local task. We conduct extensive experiments on four datasets under two types of data heterogeneity with 14 kinds of models including CNNs and ViTs. Results show that our upload-efficient FedKTL surpasses seven state-of-the-art methods by up to 7.31% in accuracy. Moreover our knowledge transfer scheme is applicable in scenarios with only one edge client. Code: https://github.com/TsingZ0/FedKTL + + + + Instance-aware Exploration-Verification-Exploitation for Instance ImageGoal Navigation + http://openaccess.thecvf.com//content/CVPR2024/papers/Lei_Instance-aware_Exploration-Verification-Exploitation_for_Instance_ImageGoal_Navigation_CVPR_2024_paper.pdf + As a new embodied vision task Instance ImageGoal Navigation (IIN) aims to navigate to a specified object depicted by a goal image in an unexplored environment. The main challenge of this task lies in identifying the target object from different viewpoints while rejecting similar distractors. Existing ImageGoal Navigation methods usually adopt the simple Exploration-Exploitation framework and ignore the identification of specific instance during navigation. In this work we propose to imitate the human behaviour of "getting closer to confirm" when distinguishing objects from a distance. Specifically we design a new modular navigation framework named Instance-aware Exploration-Verification-Exploitation (IEVE) for instancelevel image goal navigation. Our method allows for active switching among the exploration verification and exploitation actions thereby facilitating the agent in making reasonable decisions under different situations. On the challenging HabitatMatterport 3D semantic (HM3DSEM) dataset our method surpasses previous state-of-theart work with a classical segmentation model (0.684 vs. 0.561 success) or a robust model (0.702 vs. 0.561 success). Our code will be made publicly available at https://github.com/XiaohanLei/IEVE. + + + + One-2-3-45++: Fast Single Image to 3D Objects with Consistent Multi-View Generation and 3D Diffusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_One-2-3-45_Fast_Single_Image_to_3D_Objects_with_Consistent_Multi-View_CVPR_2024_paper.pdf + Recent advancements in open-world 3D object generation have been remarkable with image-to-3D methods offering superior fine-grained control over their text-to-3D counterparts. However most existing models fall short in simultaneously providing rapid generation speeds and high fidelity to input images - two features essential for practical applications. In this paper we present One-2-3-45++ an innovative method that transforms a single image into a detailed 3D textured mesh in approximately one minute. Our approach aims to fully harness the extensive knowledge embedded in 2D diffusion models and priors from valuable yet limited 3D data. This is achieved by initially finetuning a 2D diffusion model for consistent multi-view image generation followed by elevating these images to 3D with the aid of multi-view-conditioned 3D native diffusion models. Extensive experimental evaluations demonstrate that our method can produce high-quality diverse 3D assets that closely mirror the original input image. + + + + Let's Think Outside the Box: Exploring Leap-of-Thought in Large Language Models with Creative Humor Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhong_Lets_Think_Outside_the_Box_Exploring_Leap-of-Thought_in_Large_Language_CVPR_2024_paper.pdf + Chain-of-Thought (CoT) guides large language models (LLMs) to reason step-by-step and can motivate their logical reasoning ability. While effective for logical tasks CoT is not conducive to creative problem-solving which often requires out-of-box thoughts and is crucial for innovation advancements. In this paper we explore the Leap-of-Thought (LoT) abilities within LLMs -- a non-sequential creative paradigm involving strong associations and knowledge leaps. To this end we study LLMs on the popular Oogiri game which needs participants to have good creativity and strong associative thinking for responding unexpectedly and humorously to the given image text or both and thus is suitable for LoT study. Then to investigate LLMs' LoT ability in the Oogiri game we first build a multimodal and multilingual Oogiri-GO dataset which contains over 130000 samples from the Oogiri game and observe the insufficient LoT ability or failures of most existing LLMs on the Oogiri game. Accordingly we introduce a creative Leap-of-Thought (CLoT) paradigm to improve LLM's LoT ability. CLoT first formulates the Oogiri-GO dataset into LoT-oriented instruction tuning data to train pretrained LLM for achieving certain LoT humor generation and discrimination abilities. Then CLoT designs an explorative self-refinement that encourages the LLM to generate more creative LoT data via exploring parallels between seemingly unrelated concepts and selects high-quality data to train itself for self-refinement. CLoT not only excels in humor generation in the Oogiri game as shown in Fig. 1 but also boosts creative abilities in various tasks like "cloud guessing game" and "divergent association task". These findings advance our understanding and offer a pathway to improve LLMs' creative capacities for innovative applications across domains. The dataset code and models have been released online: https://zhongshsh.github.io/CLoT. + + + + SceneFun3D: Fine-Grained Functionality and Affordance Understanding in 3D Scenes + http://openaccess.thecvf.com//content/CVPR2024/papers/Delitzas_SceneFun3D_Fine-Grained_Functionality_and_Affordance_Understanding_in_3D_Scenes_CVPR_2024_paper.pdf + Existing 3D scene understanding methods are heavily focused on 3D semantic and instance segmentation. However identifying objects and their parts only constitutes an intermediate step towards a more fine-grained goal which is effectively interacting with the functional interactive elements (e.g. handles knobs buttons) in the scene to accomplish diverse tasks. To this end we introduce SceneFun3D a large-scale dataset with more than 14.8k highly accurate interaction annotations for 710 high-resolution real-world 3D indoor scenes. We accompany the annotations with motion parameter information describing how to interact with these elements and a diverse set of natural language descriptions of tasks that involve manipulating them in the scene context. To showcase the value of our dataset we introduce three novel tasks namely functionality segmentation task-driven affordance grounding and 3D motion estimation and adapt existing state-of-the-art methods to tackle them. Our experiments show that solving these tasks in real 3D scenes remains challenging despite recent progress in closed-set and open-set 3D scene understanding methods. + + + + Enhanced Motion-Text Alignment for Image-to-Video Transfer Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Enhanced_Motion-Text_Alignment_for_Image-to-Video_Transfer_Learning_CVPR_2024_paper.pdf + Extending large image-text pre-trained models (e.g. CLIP) for video understanding has made significant advancements. To enable the capability of CLIP to perceive dynamic information in videos existing works are dedicated to equipping the visual encoder with various temporal modules. However these methods exhibit "asymmetry" between the visual and textual sides with neither temporal descriptions in input texts nor temporal modules in text encoder. This limitation hinders the potential of language supervision emphasized in CLIP and restricts the learning of temporal features as the text encoder has demonstrated limited proficiency in motion understanding. To address this issue we propose leveraging "MoTion-Enhanced Descriptions" (MoTED) to facilitate the extraction of distinctive temporal features in videos. Specifically we first generate discriminative motion-related descriptions via querying GPT-4 to compare easy-confusing action categories. Then we incorporate both the visual and textual encoders with additional perception modules to process the video frames and generated descriptions respectively. Finally we adopt a contrastive loss to align the visual and textual motion features. Extensive experiments on five benchmarks show that MoTED surpasses state-of-the-art methods with convincing gaps laying a solid foundation for empowering CLIP with strong temporal modeling. + + + + UV-IDM: Identity-Conditioned Latent Diffusion Model for Face UV-Texture Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_UV-IDM_Identity-Conditioned_Latent_Diffusion_Model_for_Face_UV-Texture_Generation_CVPR_2024_paper.pdf + 3D face reconstruction aims at generating high-fidelity 3D face shapes and textures from single-view or multi-view images. However current prevailing facial texture generation methods generally suffer from low-quality texture identity information loss and inadequate handling of occlusions. To solve these problems we introduce an Identity-Conditioned Latent Diffusion Model for face UV-texture generation (UV-IDM) to generate photo-realistic textures based on the Basel Face Model (BFM). UV-IDM leverages the powerful texture generation capacity of a latent diffusion model (LDM) to obtain detailed facial textures. To preserve the identity during the reconstruction procedure we design an identity-conditioned module that can utilize any in-the-wild image as a robust condition for the LDM to guide texture generation. UV-IDM can be easily adapted to different BFM-based methods as a high-fidelity texture generator. Furthermore in light of the limited accessibility of most existing UV-texture datasets we build a large-scale and publicly available UV-texture dataset based on BFM termed BFM-UV. Extensive experiments show that our UV-IDM can generate high-fidelity textures in 3D face reconstruction within seconds while maintaining image consistency bringing new state-of-the-art performance in facial texture generation. + + + + A Pedestrian is Worth One Prompt: Towards Language Guidance Person Re-Identification + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_A_Pedestrian_is_Worth_One_Prompt_Towards_Language_Guidance_Person_CVPR_2024_paper.pdf + Extensive advancements have been made in person ReID through the mining of semantic information. Nevertheless existing methods that utilize semantic-parts from a single image modality do not explicitly achieve this goal. Whiteness the impressive capabilities in multimodal understanding of Vision Language Foundation Model CLIP a recent two-stage CLIP-based method employs automated prompt engineering to obtain specific textual labels for classifying pedestrians. However we note that the predefined soft prompts may be inadequate in expressing the entire visual context and struggle to generalize to unseen classes. This paper presents an end-to-end Prompt-driven Semantic Guidance (PromptSG) framework that harnesses the rich semantics inherent in CLIP. Specifically we guide the model to attend to regions that are semantically faithful to the prompt. To provide personalized language descriptions for specific individuals we propose learning pseudo tokens that represent specific visual contexts. This design not only facilitates learning fine-grained attribute information but also can inherently leverage language prompts during inference. Without requiring additional labeling efforts our PromptSG achieves state-of-the-art by over 10% on MSMT17 and nearly 5% on the Market-1501 benchmark. + + + + NetTrack: Tracking Highly Dynamic Objects with a Net + http://openaccess.thecvf.com//content/CVPR2024/papers/Zheng_NetTrack_Tracking_Highly_Dynamic_Objects_with_a_Net_CVPR_2024_paper.pdf + The complex dynamicity of open-world objects presents non-negligible challenges for multi-object tracking (MOT) often manifested as severe deformations fast motion and occlusions. Most methods that solely depend on coarse-grained object cues such as boxes and the overall appearance of the object are susceptible to degradation due to distorted internal relationships of dynamic objects. To address this problem this work proposes NetTrack an efficient generic and affordable tracking framework to introduce fine-grained learning that is robust to dynamicity. Specifically NetTrack constructs a dynamicity-aware association with a fine-grained Net leveraging point-level visual cues. Correspondingly a fine-grained sampler and matching method have been incorporated. Furthermore NetTrack learns object-text correspondence for fine-grained localization. To evaluate MOT in extremely dynamic open-world scenarios a bird flock tracking (BFT) dataset is constructed which exhibits high dynamicity with diverse species and open-world scenarios. Comprehensive evaluation on BFT validates the effectiveness of fine-grained learning on object dynamicity and thorough transfer experiments on challenging open-world benchmarks i.e. TAO TAO-OW AnimalTrack and GMOT-40 validate the strong generalization ability of NetTrack even without finetuning. + + + + Grounded Question-Answering in Long Egocentric Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Di_Grounded_Question-Answering_in_Long_Egocentric_Videos_CVPR_2024_paper.pdf + Existing approaches to video understanding mainly designed for short videos from a third-person perspective are limited in their applicability in certain fields such as robotics. In this paper we delve into open-ended question-answering (QA) in long egocentric videos which allows individuals or robots to inquire about their own past visual experiences. This task presents unique challenges including the complexity of temporally grounding queries within extensive video content the high resource demands for precise data annotation and the inherent difficulty of evaluating open-ended answers due to their ambiguous nature. Our proposed approach tackles these challenges by (i) integrating query grounding and answering within a unified model to reduce error propagation; (ii) employing large language models for efficient and scalable data synthesis; and (iii) introducing a close-ended QA task for evaluation to manage answer ambiguity. Extensive experiments demonstrate the effectiveness of our method which also achieves state-of-the-art performance on the QAEgo4D and Ego4D-NLQ benchmarks. Code data and models are open-sourced at https://github.com/Becomebright/GroundVQA. + + + + HPNet: Dynamic Trajectory Forecasting with Historical Prediction Attention + http://openaccess.thecvf.com//content/CVPR2024/papers/Tang_HPNet_Dynamic_Trajectory_Forecasting_with_Historical_Prediction_Attention_CVPR_2024_paper.pdf + Predicting the trajectories of road agents is essential for autonomous driving systems. The recent mainstream methods follow a static paradigm which predicts the future trajectory by using a fixed duration of historical frames. These methods make the predictions independently even at adjacent time steps which leads to potential instability and temporal inconsistency. As successive time steps have largely overlapping historical frames their forecasting should have intrinsic correlation such as overlapping predicted trajectories should be consistent or be different but share the same motion goal depending on the road situation. Motivated by this in this work we introduce HPNet a novel dynamic trajectory forecasting method. Aiming for stable and accurate trajectory forecasting our method leverages not only historical frames including maps and agent states but also historical predictions. Specifically we newly design a Historical Prediction Attention module to automatically encode the dynamic relationship between successive predictions. Besides it also extends the attention range beyond the currently visible window benefitting from the use of historical predictions. The proposed Historical Prediction Attention together with the Agent Attention and Mode Attention is further formulated as the Triple Factorized Attention module serving as the core design of HPNet. Experiments on the Argoverse and INTERACTION datasets show that HPNet achieves state-of-the-art performance and generates accurate and stable future trajectories. Our code are available at https://github.com/XiaolongTang23/HPNet. + + + + SI-MIL: Taming Deep MIL for Self-Interpretability in Gigapixel Histopathology + http://openaccess.thecvf.com//content/CVPR2024/papers/Kapse_SI-MIL_Taming_Deep_MIL_for_Self-Interpretability_in_Gigapixel_Histopathology_CVPR_2024_paper.pdf + Introducing interpretability and reasoning into Multiple Instance Learning (MIL) methods for Whole Slide Image (WSI) analysis is challenging given the complexity of gigapixel slides. Traditionally MIL interpretability is limited to identifying salient regions deemed pertinent for downstream tasks offering little insight to the end-user (pathologist) regarding the rationale behind these selections. To address this we propose Self-Interpretable MIL (SI-MIL) a method intrinsically designed for interpretability from the very outset. SI-MIL employs a deep MIL framework to guide an interpretable branch grounded on handcrafted pathological features facilitating linear predictions. Beyond identifying salient regions SI-MIL uniquely provides feature-level interpretations rooted in pathological insights for WSIs. Notably SI-MIL with its linear prediction constraints challenges the prevalent myth of an inevitable trade-off between model interpretability and performance demonstrating competitive results compared to state-of-the-art methods on WSI-level prediction tasks across three cancer types. In addition we thoroughly benchmark the local- and global-interpretability of SI-MIL in terms of statistical analysis a domain expert study and desiderata of interpretability namely user-friendliness and faithfulness. + + + + LayoutFormer: Hierarchical Text Detection Towards Scene Text Understanding + http://openaccess.thecvf.com//content/CVPR2024/papers/Liang_LayoutFormer_Hierarchical_Text_Detection_Towards_Scene_Text_Understanding_CVPR_2024_paper.pdf + Existing scene text detectors generally focus on accurately detecting single-level (i.e. word-level line-level or paragraph-level) text entities without exploring the relationships among different levels of text entities. To comprehensively understand scene texts detecting multi-level texts while exploring their contextual information is critical. To this end we propose a unified framework (dubbed LayoutFormer) for hierarchical text detection which simultaneously conducts multi-level text detection and predicts the geometric layouts for promoting scene text understanding. In LayoutFormer WordDecoder LineDecoder and ParaDecoder are proposed to be responsible for word-level text prediction line-level text prediction and paragraph-level text prediction respectively. Meanwhile WordDecoder and ParaDecoder adaptively learn word-line and line-paragraph relationships respectively. In addition we propose a Prior Location Sampler to be used on multi-scale features to adaptively select a few representative foreground features for updating text queries. It can improve hierarchical detection performance while significantly reducing the computational cost. Comprehensive experiments verify that our method achieves state-of-the-art performance on single-level and hierarchical text detection. + + + + GLOW: Global Layout Aware Attacks on Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Bao_GLOW_Global_Layout_Aware_Attacks_on_Object_Detection_CVPR_2024_paper.pdf + Adversarial attacks aim to perturb images such that a predictor outputs incorrect results. Due to the limited research in structured attacks imposing consistency checks on natural multi-object scenes is a practical defense against conventional adversarial attacks. More desired attacks should be able to fool defenses with such consistency checks. Therefore we present the first approach GLOW that copes with various attack requests by generating global layout-aware adversarial attacks in which both categorical and geometric layout constraints are explicitly established. Specifically we focus on object detection tasks and given a victim image GLOW first localizes victim objects according to target labels. And then it generates multiple attack plans together with their context-consistency scores. GLOW on the one hand is capable of handling various types of requests including single or multiple victim objects with or without specified victim objects. On the other hand it produces a consistency score for each attack plan reflecting the overall contextual consistency that both semantic category and global scene layout are considered. We conduct our experiments on MS COCO and Pascal. Extensive experimental results demonstrate that we can achieve about 30% average relative improvement compared to state-of-the-art methods in conventional single object attack request; Moreover such superiority is also valid across more generic attack requests under both white-box and zero-query black-box settings. Finally we conduct comprehensive human analysis which not only validates our claim further but also provides strong evidence that our evaluation metrics reflect human reviews well. + + + + SIRA: Scalable Inter-frame Relation and Association for Radar Perception + http://openaccess.thecvf.com//content/CVPR2024/papers/Yataka_SIRA_Scalable_Inter-frame_Relation_and_Association_for_Radar_Perception_CVPR_2024_paper.pdf + Conventional radar feature extraction faces limitations due to low spatial resolution noise multipath reflection the presence of ghost targets and motion blur. Such limitations can be exacerbated by nonlinear object motion particularly from an ego-centric viewpoint. It becomes evident that to address these challenges the key lies in exploiting temporal feature relation over an extended horizon and enforcing spatial motion consistence for effective association. To this end this paper proposes SIRA (Scalable Inter-frame Relation and Association) with two designs. First inspired by Swin Transformer we introduce extended temporal relation generalizing the existing temporal relation layer from two consecutive frames to multiple inter-frames with temporally regrouped window attention for scalability. Second we propose motion consistency track with the concept of a pseudo-tracklet generated from observational data for better trajectory prediction and subsequent object association. Our approach achieves 58.11 mAP@0.5 for oriented object detection and 47.79 MOTA for multiple object tracking on the Radiate dataset surpassing previous state-of-the-art by a margin of +4.11 mAP@0.5 and +9.94 MOTA respectively. + + + + VOODOO 3D: Volumetric Portrait Disentanglement For One-Shot 3D Head Reenactment + http://openaccess.thecvf.com//content/CVPR2024/papers/Tran_VOODOO_3D_Volumetric_Portrait_Disentanglement_For_One-Shot_3D_Head_Reenactment_CVPR_2024_paper.pdf + We present a 3D-aware one-shot head reenactment method based on a fully volumetric neural disentanglement framework for source appearance and driver expressions. Our method is real-time and produces high-fidelity and view-consistent output suitable for 3D teleconferencing systems based on holographic displays. Existing cutting-edge 3D-aware reenactment methods often use neural radiance fields or 3D meshes to produce view-consistent appearance encoding but at the same time they rely on linear face models such as 3DMM to achieve its disentanglement with facial expressions. As a result their reenactment results often exhibit identity leakage from the driver or have unnatural expressions. To address these problems we propose a neural self-supervised disentanglement approach that lifts both the source image and driver video frame into a shared 3D volumetric representation based on tri-planes. This representation can then be freely manipulated with expression tri-planes extracted from the driving images and rendered from an arbitrary view using neural radiance fields. We achieve this disentanglement via self-supervised learning on a large in-the-wild video dataset. We further introduce a highly effective fine-tuning approach to improve the generalizability of the 3D lifting using the same real-world data. We demonstrate state-of-the-art performance on a wide range of datasets and also showcase high-quality 3D-aware head reenactment on highly challenging and diverse subjects including non-frontal head poses and complex expressions for both source and driver. + + + + Visual Fact Checker: Enabling High-Fidelity Detailed Caption Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Ge_Visual_Fact_Checker_Enabling_High-Fidelity_Detailed_Caption_Generation_CVPR_2024_paper.pdf + Existing automatic captioning methods for visual content face challenges such as lack of detail content hallucination and poor instruction following. In this work we propose VisualFactChecker (VFC) a flexible training-free pipeline that generates high-fidelity and detailed captions for both 2D images and 3D objects. VFC consists of three steps: 1) proposal where image-to-text captioning models propose multiple initial captions; 2) verification where a large language model (LLM) utilizes tools such as object detection and VQA models to fact-check proposed captions; 3) captioning where an LLM generates the final caption by summarizing caption proposals and the fact check verification results. In this step VFC can flexibly generate captions in various styles following complex instructions. We conduct comprehensive captioning evaluations using four metrics: 1) CLIP-Score for image-text similarity; 2) CLIP-Image-Score for measuring the image-image similarity between the original and the reconstructed image generated by a text-to-image model using the caption. 3) human study on Amazon Mechanical Turk; 4) GPT-4V for fine-grained evaluation. Evaluation results show that VFC outperforms state-of-the-art open-sourced captioning methods for 2D images on the COCO dataset and 3D assets on the Objaverse dataset. Our study demonstrates that by combining open-source models into a pipeline we can attain captioning capability comparable to proprietary models such as GPT-4V despite being over 10x smaller in model size. + + + + Communication-Efficient Collaborative Perception via Information Filling with Codebook + http://openaccess.thecvf.com//content/CVPR2024/papers/Hu_Communication-Efficient_Collaborative_Perception_via_Information_Filling_with_Codebook_CVPR_2024_paper.pdf + Collaborative perception empowers each agent to improve its perceptual ability through the exchange of perceptual messages with other agents. It inherently results in a fundamental trade-off between perception ability and communication cost. To address this bottleneck issue our core idea is to optimize the collaborative messages from two key aspects: representation and selection. The proposed codebook-based message representation enables the transmission of integer codes rather than high-dimensional feature maps. The proposed information-filling-driven message selection optimizes local messages to collectively fill each agent's information demand preventing information overflow among multiple agents. By integrating these two designs we propose CodeFilling a novel communication-efficient collaborative perception system which significantly advances the perception-communication trade-off and is inclusive to both homogeneous and heterogeneous collaboration settings. We evaluate CodeFilling in both a real-world dataset DAIR-V2X and a new simulation dataset OPV2VH+. Results show that CodeFilling outperforms previous SOTA Where2comm on DAIR-V2X/OPV2VH+ with 1333/1206x lower communication volume. Our code is available at https://github.com/PhyllisH/CodeFilling. + + + + MVD-Fusion: Single-view 3D via Depth-consistent Multi-view Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Hu_MVD-Fusion_Single-view_3D_via_Depth-consistent_Multi-view_Generation_CVPR_2024_paper.pdf + We present MVD-Fusion: a method for single-view 3D inference via generative modeling of multi-view-consistent RGB-D images. While recent methods pursuing 3D inference advocate learning novel-view generative models these generations are not 3D-consistent and require a distillation process to generate a 3D output. We instead cast the task of 3D inference as directly generating mutually-consistent multiple views and build on the insight that additionally inferring depth can provide a mechanism for enforcing this consistency. Specifically we train a denoising diffusion model to generate multi-view RGB-D images given a single RGB input image and leverage the (intermediate noisy) depth estimates to obtain reprojection-based conditioning to maintain multi-view consistency. We train our model using large-scale synthetic dataset Obajverse as well as the real-world CO3D dataset comprising of generic camera viewpoints. We demonstrate that our approach can yield more accurate synthesis compared to recent state-of-the-art including distillation-based 3D inference and prior multi-view generation methods. We also evaluate the geometry induced by our multi-view depth prediction and find that it yields a more accurate representation than other direct 3D inference approaches. + + + + Effective Video Mirror Detection with Inconsistent Motion Cues + http://openaccess.thecvf.com//content/CVPR2024/papers/Warren_Effective_Video_Mirror_Detection_with_Inconsistent_Motion_Cues_CVPR_2024_paper.pdf + Image-based mirror detection has recently undergone rapid research due to its significance in applications such as robotic navigation semantic segmentation and scene reconstruction. Recently VMD-Net was proposed as the first video mirror detection technique by modeling dual correspondences between the inside and outside of the mirror both spatially and temporally. However this approach is not reliable as correspondences can occur completely inside or outside of the mirrors. In addition the proposed dataset VMD-D contains many small mirrors limiting its applicability to real-world scenarios. To address these problems we developed a more challenging dataset that includes mirrors of various shapes and sizes at different locations of the frames providing a better reflection of real-world scenarios. Next we observed that the motions between the inside and outside of the mirror are often inconsistent. For instance when moving in front of a mirror the motion inside the mirror is often much smaller than the motion outside due to increased depth perception. With these observations we propose modeling inconsistent motion cues to detect mirrors and a new network with two novel modules. The Motion Attention Module (MAM) explicitly models inconsistent motions around mirrors via optical flow and the Motion-Guided Edge Detection Module (MEDM) uses motions to guide mirror edge feature learning. Experimental results on our proposed dataset show that our method outperforms state-of-the-arts. The code and dataset are available at https://github.com/AlexAnthonyWarren/MG-VMD. + + + + DiffLoc: Diffusion Model for Outdoor LiDAR Localization + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_DiffLoc_Diffusion_Model_for_Outdoor_LiDAR_Localization_CVPR_2024_paper.pdf + Absolute pose regression (APR) estimates global pose in an end-to-end manner achieving impressive results in learn-based LiDAR localization. However compared to the top-performing methods reliant on 3D-3D correspondence matching APR's accuracy still has room for improvement. We recognize APR's lack of robust features learning and iterative denoising process leads to suboptimal results. In this paper we propose DiffLoc a novel framework that formulates LiDAR localization as a conditional generation of poses. First we propose to utilize the foundation model and static-object-aware pool to learn robust features. Second we incorporate the iterative denoising process into APR via a diffusion model conditioned on the learned geometrically robust features. In addition due to the unique nature of diffusion models we propose to adapt our models to two additional applications: (1) using multiple inferences to evaluate pose uncertainty and (2) seamlessly introducing geometric constraints on denoising steps to improve prediction accuracy. Extensive experiments conducted on the Oxford Radar RobotCar and NCLT datasets demonstrate that DiffLoc outperforms better than the stateof-the-art methods. Especially on the NCLT dataset we achieve 35% and 34.7% improvement on position and orientation accuracy respectively. Our code is released at https://github.com/liw95/DiffLoc. + + + + On Scaling Up a Multilingual Vision and Language Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_On_Scaling_Up_a_Multilingual_Vision_and_Language_Model_CVPR_2024_paper.pdf + We explore the boundaries of scaling up a multilingual vision and language model both in terms of size of the components and the breadth of its training task mixture. Our model achieves new levels of performance on a wide-range of varied and complex tasks including multiple image-based captioning and question-answering tasks image-based document understanding and few-shot (in-context) learning as well as object detection video question answering and video captioning. Our model advances the state-of-the-art on most vision-and-language benchmarks considered (20+ of them). Finally we observe emerging capabilities such as complex counting and multilingual object detection tasks that are not explicitly in the training mix. + + + + Day-Night Cross-domain Vehicle Re-identification + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Day-Night_Cross-domain_Vehicle_Re-identification_CVPR_2024_paper.pdf + Previous advances in vehicle re-identification (ReID) are mostly reported under favorable lighting conditions while cross-day-and-night performance is neglected which greatly hinders the development of related traffic intelligence applications. This work instead develops a novel Day-Night Dual-domain Modulation (DNDM) vehicle re-identification framework for day-night cross-domain traffic scenarios. Specifically a unique night-domain glare suppression module is provided to attenuate the headlight glare from raw nighttime vehicle images. To enhance vehicle features under low-light environments we propose a dual-domain structure enhancement module in the feature extractor which enhances geometric structures between appearance features. To alleviate day-night domain discrepancies we develop a cross-domain class awareness module that facilitates the interaction between appearance and structure features in both domains. In this work we address the Day-Night cross-domain ReID (DN-ReID) problem and provide a new cross-domain dataset named DN-Wild including day and night images of 2286 identities giving in total 85945 daytime images and 54952 nighttime images. Furthermore we also take into account the matter of balance between day and night samples and provide a dataset called DN-348. Exhaustive experiments demonstrate the robustness of the proposed framework in the DN-ReID problem. The code and benchmark are released at https://github.com/chenjingong/DN-ReID. + + + + Holodeck: Language Guided Generation of 3D Embodied AI Environments + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Holodeck_Language_Guided_Generation_of_3D_Embodied_AI_Environments_CVPR_2024_paper.pdf + 3D simulated environments play a critical role in Embodied AI but their creation requires expertise and extensive manual effort restricting their diversity and scope. To mitigate this limitation we present Holodeck a system that generates 3D environments to match a user-supplied prompt fully automatedly. Holodeck can generate diverse scenes e.g. arcades spas and museums adjust the designs for styles and can capture the semantics of complex queries such as "apartment for a researcher with a cat" and "office of a professor who is a fan of Star Wars". Holodeck leverages a large language model (i.e. GPT-4) for common sense knowledge about what the scene might look like and uses a large collection of 3D assets from Objaverse to populate the scene with diverse objects. To address the challenge of positioning objects correctly we prompt GPT-4 to generate spatial relational constraints between objects and then optimize the layout to satisfy those constraints. Our large-scale human evaluation shows that annotators prefer Holodeck over manually designed procedural baselines in residential scenes and that Holodeck can produce high-quality outputs for diverse scene types. We also demonstrate an exciting application of Holodeck in Embodied AI training agents to navigate in novel scenes like music rooms and daycares without human-constructed data which is a significant step forward in developing general-purpose embodied agents. + + + + Distilled Datamodel with Reverse Gradient Matching + http://openaccess.thecvf.com//content/CVPR2024/papers/Ye_Distilled_Datamodel_with_Reverse_Gradient_Matching_CVPR_2024_paper.pdf + The proliferation of large-scale AI models trained on extensive datasets has revolutionized machine learning. With these models taking on increasingly central roles in various applications the need to understand their behavior and enhance interpretability has become paramount. To investigate the impact of changes in training data on a pre-trained model a common approach is leave-one-out retraining. This entails systematically altering the training dataset by removing specific samples to observe resulting changes within the model. However retraining the model for each altered dataset presents a significant computational challenge given the need to perform this operation for every dataset variation. In this paper we introduce an efficient framework for assessing data impact comprising offline training and online evaluation stages. During the offline training phase we approximate the influence of training data on the target model through a distilled synset formulated as a reversed gradient matching problem. For online evaluation we expedite the leave-one-out process using the synset which is then utilized to compute the attribution matrix based on the evaluation objective. Experimental evaluations including training data attribution and assessments of data quality demonstrate that our proposed method achieves comparable model behavior evaluation while significantly speeding up the process compared to the direct retraining method. + + + + Pseudo Label Refinery for Unsupervised Domain Adaptation on Cross-dataset 3D Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Pseudo_Label_Refinery_for_Unsupervised_Domain_Adaptation_on_Cross-dataset_3D_CVPR_2024_paper.pdf + Recent self-training techniques have shown notable improvements in unsupervised domain adaptation for 3D object detection (3D UDA). These techniques typically select pseudo labels i.e. 3D boxes to supervise models for the target domain. However this selection process inevitably introduces unreliable 3D boxes in which 3D points cannot be definitively assigned as foreground or background. Previous techniques mitigate this by reweighting these boxes as pseudo labels but these boxes can still poison the training process. To resolve this problem in this paper we propose a novel pseudo label refinery framework. Specifically in the selection process to improve the reliability of pseudo boxes we propose a complementary augmentation strategy. This strategy involves either removing all points within an unreliable box or replacing it with a high-confidence box. Moreover the point numbers of instances in high-beam datasets are considerably higher than those in low-beam datasets also degrading the quality of pseudo labels during the training process. We alleviate this issue by generating additional proposals and aligning RoI features across different domains. Experimental results demonstrate that our method effectively enhances the quality of pseudo labels and consistently surpasses the state-of-the-art methods on six autonomous driving benchmarks. Code will be available at https://github.com/Zhanwei-Z/PERE. + + + + Reconstructing Hands in 3D with Transformers + http://openaccess.thecvf.com//content/CVPR2024/papers/Pavlakos_Reconstructing_Hands_in_3D_with_Transformers_CVPR_2024_paper.pdf + We present an approach that can reconstruct hands in 3D from monocular input. Our approach for Hand Mesh Recovery HaMeR follows a fully transformer-based architecture and can analyze hands with significantly increased accuracy and robustness compared to previous work. The key to HaMeR's success lies in scaling up both the data used for training and the capacity of the deep network for hand reconstruction. For training data we combine multiple datasets that contain 2D or 3D hand annotations. For the deep model we use a large scale Vision Transformer architecture. Our final model consistently outperforms the previous baselines on popular 3D hand pose benchmarks. To further evaluate the effect of our design in non-controlled settings we annotate existing in-the-wild datasets with 2D hand keypoint annotations. On this newly collected dataset of annotations HInt we demonstrate significant improvements over existing baselines. We will make our code data and models publicly available upon publication. We make our code data and models available on the project website: https://geopavlakos.github.io/hamer/. + + + + PELA: Learning Parameter-Efficient Models with Low-Rank Approximation + http://openaccess.thecvf.com//content/CVPR2024/papers/Guo_PELA_Learning_Parameter-Efficient_Models_with_Low-Rank_Approximation_CVPR_2024_paper.pdf + Applying a pre-trained large model to downstream tasks is prohibitive under resource-constrained conditions. Recent dominant approaches for addressing efficiency issues involve adding a few learnable parameters to the fixed backbone model. This strategy however leads to more challenges in loading large models for downstream fine-tuning with limited resources. In this paper we propose a novel method for increasing the parameter efficiency of pre-trained models by introducing an intermediate pre-training stage. To this end we first employ low-rank approximation to compress the original large model and then devise a feature distillation module and a weight perturbation regularization module. These modules are specifically designed to enhance the low-rank model. In particular we update only the low-rank model while freezing the backbone parameters during pre-training. This allows for direct and efficient utilization of the low-rank model for downstream fine-tuning tasks. The proposed method achieves both efficiencies in terms of required parameters and computation time while maintaining comparable results with minimal modifications to the backbone architecture. Specifically when applied to three vision-only and one vision-language Transformer models our approach often demonstrates a merely 0.6-point decrease in performance while reducing the original parameter size by 1/3 to 2/3. + + + + Auto-Train-Once: Controller Network Guided Automatic Network Pruning from Scratch + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_Auto-Train-Once_Controller_Network_Guided_Automatic_Network_Pruning_from_Scratch_CVPR_2024_paper.pdf + Current techniques for deep neural network (DNN) pruning often involve intricate multi-step processes that require domain-specific expertise making their widespread adoption challenging. To address the limitation the Only-Train-Once (OTO) and OTOv2 are proposed to eliminate the need for additional fine-tuning steps by directly training and compressing a general DNN from scratch. Nevertheless the static design of optimizers (in OTO) can lead to convergence issues of local optima. In this paper we proposed the Auto-Train-Once (ATO) an innovative network pruning algorithm designed to automatically reduce the computational and storage costs of DNNs. During the model training phase our approach not only trains the target model but also leverages a controller network as an architecture generator to guide the learning of target model weights. Furthermore we developed a novel stochastic gradient algorithm that enhances the coordination between model training and controller network training thereby improving pruning performance. We provide a comprehensive convergence analysis as well as extensive experiments and the results show that our approach achieves state-of-the-art performance across various model architectures (including ResNet18 ResNet34 ResNet50 ResNet56 and MobileNetv2) on standard benchmark datasets (CIFAR-10 CIFAR-100 and ImageNet). + + + + Constructing and Exploring Intermediate Domains in Mixed Domain Semi-supervised Medical Image Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Ma_Constructing_and_Exploring_Intermediate_Domains_in_Mixed_Domain_Semi-supervised_Medical_CVPR_2024_paper.pdf + Both limited annotation and domain shift are prevalent challenges in medical image segmentation. Traditional semi-supervised segmentation and unsupervised domain adaptation methods address one of these issues separately. However the coexistence of limited annotation and domain shift is quite common which motivates us to introduce a novel and challenging scenario: Mixed Domain Semi-supervised medical image Segmentation (MiDSS). In this scenario we handle data from multiple medical centers with limited annotations available for a single domain and a large amount of unlabeled data from multiple domains. We found that the key to solving the problem lies in how to generate reliable pseudo labels for the unlabeled data in the presence of domain shift with labeled data. To tackle this issue we employ Unified Copy-Paste (UCP) between images to construct intermediate domains facilitating the knowledge transfer from the domain of labeled data to the domains of unlabeled data. To fully utilize the information within the intermediate domain we propose a symmetric Guidance training strategy (SymGD) which additionally offers direct guidance to unlabeled data by merging pseudo labels from intermediate samples. Subsequently we introduce a Training Process aware Random Amplitude MixUp (TP-RAM) to progressively incorporate style-transition components into intermediate samples. Compared with existing state-of-the-art approaches our method achieves a notable 13.57% improvement in Dice score on Prostate dataset as demonstrated on three public datasets. Our code is available at https://github.com/MQinghe/MiDSS + + + + From Isolated Islands to Pangea: Unifying Semantic Space for Human Action Understanding + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_From_Isolated_Islands_to_Pangea_Unifying_Semantic_Space_for_Human_CVPR_2024_paper.pdf + Action understanding matters for intelligent agents and has attracted long-term attention. It can be formed as the mapping from the action physical space to the semantic space. Typically researchers built action datasets according to idiosyncratic choices to define classes and push the envelope of benchmarks respectively. Thus datasets are incompatible with each other like "Isolated Islands" due to semantic gaps and various class granularities e.g. do housework in dataset A and wash plate in dataset B. We argue that a more principled semantic space is an urgent need to concentrate the community efforts and enable us to use all datasets together to pursue generalizable action learning. To this end we design a structured action semantic space in view of verb taxonomy hierarchy and covering massive actions. By aligning the classes of previous datasets to our semantic space we gather (image/video/skeleton/MoCap) datasets into a unified database in a unified label system i.e. bridging "isolated islands" into a "Pangea". Accordingly we propose a novel model mapping from the physical space to semantic space to fully use Pangea. In extensive experiments our new system shows significant superiority especially in transfer learning. Our code and data will be made public at https://mvig-rhos.com/pangea. + + + + Bootstrapping Autonomous Driving Radars with Self-Supervised Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Hao_Bootstrapping_Autonomous_Driving_Radars_with_Self-Supervised_Learning_CVPR_2024_paper.pdf + The perception of autonomous vehicles using radars has attracted increased research interest due its ability to operate in fog and bad weather. However training radar models is hindered by the cost and difficulty of annotating large-scale radar data. To overcome this bottleneck we propose a self-supervised learning framework to leverage the large amount of unlabeled radar data to pre-train radar-only embeddings for self-driving perception tasks. The proposed method combines radar-to-radar and radar-to-vision contrastive losses to learn a general representation from unlabeled radar heatmaps paired with their corresponding camera images. When used for downstream object detection we demonstrate that the proposed self-supervision framework can improve the accuracy of state-of-the-art supervised baselines by 5.8% in mAP. Code is available at https://github.com/yiduohao/Radical. + + + + Weakly Supervised Monocular 3D Detection with a Single-View Image + http://openaccess.thecvf.com//content/CVPR2024/papers/Jiang_Weakly_Supervised_Monocular_3D_Detection_with_a_Single-View_Image_CVPR_2024_paper.pdf + Monocular 3D detection (M3D) aims for precise 3D object localization from a single-view image which usually involves labor-intensive annotation of 3D detection boxes. Weakly supervised M3D has recently been studied to obviate the 3D annotation process by leveraging many existing 2D annotations but it often requires extra training data such as LiDAR point clouds or multi-view images which greatly degrades its applicability and usability in various applications. We propose SKD-WM3D a weakly supervised monocular 3D detection framework that exploits depth information to achieve M3D with a single-view image exclusively without any 3D annotations or other training data. One key design in SKD-WM3D is a self-knowledge distillation framework which transforms image features into 3D-like representations by fusing depth information and effectively mitigates the inherent depth ambiguity in monocular scenarios with little computational overhead in inference. In addition we design an uncertainty-aware distillation loss and a gradient-targeted transfer modulation strategy which facilitate knowledge acquisition and knowledge transfer respectively. Extensive experiments show that SKD-WM3D surpasses the state-of-the-art clearly and is even on par with many fully supervised methods. + + + + Blind Image Quality Assessment Based on Geometric Order Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Shin_Blind_Image_Quality_Assessment_Based_on_Geometric_Order_Learning_CVPR_2024_paper.pdf + A novel approach to blind image quality assessment called quality comparison network (QCN) is proposed in this paper which sorts the feature vectors of input images according to their quality scores in an embedding space. QCN employs comparison transformers (CTs) and score pivots which act as the centroids of feature vectors of similar-quality images. Each CT updates the score pivots and the feature vectors of input images based on their ordered correlation. To this end we adopt four loss functions. Then we estimate the quality score of a test image by searching the nearest score pivot to its feature vector in the embedding space. Extensive experiments show that the proposed QCN algorithm yields excellent image quality assessment performances on various datasets. Furthermore QCN achieves great performances in cross-dataset evaluation demonstrating its superb generalization capability. The source codes are available at https://github.com/nhshin-mcl/QCN. + + + + Generalizable Whole Slide Image Classification with Fine-Grained Visual-Semantic Interaction + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Generalizable_Whole_Slide_Image_Classification_with_Fine-Grained_Visual-Semantic_Interaction_CVPR_2024_paper.pdf + Whole Slide Image (WSI) classification is often formulated as a Multiple Instance Learning (MIL) problem. Recently Vision-Language Models (VLMs) have demonstrated remarkable performance in WSI classification. However existing methods leverage coarse-grained pathogenetic descriptions for visual representation supervision which are insufficient to capture the complex visual appearance of pathogenetic images hindering the generalizability of models on diverse downstream tasks. Additionally processing high-resolution WSIs can be computationally expensive. In this paper we propose a novel "Fine-grained Visual-Semantic Interaction" (FiVE) framework for WSI classification. It is designed to enhance the model's generalizability by leveraging the interaction between localized visual patterns and fine-grained pathological semantics. Specifically with meticulously designed queries we start by utilizing a large language model to extract fine-grained pathological descriptions from various non-standardized raw reports. The output descriptions are then reconstructed into fine-grained labels used for training. By introducing a Task-specific Fine-grained Semantics (TFS) module we enable prompts to capture crucial visual information in WSIs which enhances representation learning and augments generalization capabilities significantly. Furthermore given that pathological visual patterns are redundantly distributed across tissue slices we sample a subset of visual instances during training. Our method demonstrates robust generalizability and strong transferability dominantly outperforming the counterparts on the TCGA Lung Cancer dataset with at least 9.19% higher accuracy in few-shot experiments. The code is available at: https://github.com/ls1rius/WSI_FiVE. + + + + Generalizing 6-DoF Grasp Detection via Domain Prior Knowledge + http://openaccess.thecvf.com//content/CVPR2024/papers/Ma_Generalizing_6-DoF_Grasp_Detection_via_Domain_Prior_Knowledge_CVPR_2024_paper.pdf + We focus on the generalization ability of the 6-DoF grasp detection method in this paper. While learning-based grasp detection methods can predict grasp poses for unseen objects using the grasp distribution learned from the training set they often exhibit a significant performance drop when encountering objects with diverse shapes and structures. To enhance the grasp detection methods' generalization ability we incorporate domain prior knowledge of robotic grasping enabling better adaptation to objects with significant shape and structure differences. More specifically we employ the physical constraint regularization during the training phase to guide the model towards predicting grasps that comply with the physical rule on grasping. For the unstable grasp poses predicted on novel objects we design a contact-score joint optimization using the projection contact map to refine these poses in cluttered scenarios. Extensive experiments conducted on the GraspNet-1billion benchmark demonstrate a substantial performance gain on the novel object set and the real-world grasping experiments also demonstrate the effectiveness of our generalizing 6-DoF grasp detection method. + + + + RadSimReal: Bridging the Gap Between Synthetic and Real Data in Radar Object Detection With Simulation + http://openaccess.thecvf.com//content/CVPR2024/papers/Bialer_RadSimReal_Bridging_the_Gap_Between_Synthetic_and_Real_Data_in_CVPR_2024_paper.pdf + Object detection in radar imagery with neural networks shows great potential for improving autonomous driving. However obtaining annotated datasets from real radar images crucial for training these networks is challenging especially in scenarios with long-range detection and adverse weather and lighting conditions where radar performance excels. To address this challenge we present RadSimReal an innovative physical radar simulation capable of generating synthetic radar images with accompanying annotations for various radar types and environmental conditions all without the need for real data collection. Remarkably our findings demonstrate that training object detection models on RadSimReal data and subsequently evaluating them on real-world data produce performance levels comparable to models trained and tested on real data from the same dataset and even achieves better performance when testing across different real datasets. RadSimReal offers advantages over other physical radar simulations that it does not necessitate knowledge of the radar design details which are often not disclosed by radar suppliers and has faster run-time. This innovative tool has the potential to advance the development of computer vision algorithms for radar-based autonomous driving applications. + + + + 3DSFLabelling: Boosting 3D Scene Flow Estimation by Pseudo Auto-labelling + http://openaccess.thecvf.com//content/CVPR2024/papers/Jiang_3DSFLabelling_Boosting_3D_Scene_Flow_Estimation_by_Pseudo_Auto-labelling_CVPR_2024_paper.pdf + Learning 3D scene flow from LiDAR point clouds presents significant difficulties including poor generalization from synthetic datasets to real scenes scarcity of real-world 3D labels and poor performance on real sparse LiDAR point clouds. We present a novel approach from the perspective of auto-labelling aiming to generate a large number of 3D scene flow pseudo labels for real-world LiDAR point clouds. Specifically we employ the assumption of rigid body motion to simulate potential object-level rigid movements in autonomous driving scenarios. By updating different motion attributes for multiple anchor boxes the rigid motion decomposition is obtained for the whole scene. Furthermore we developed a novel 3D scene flow data augmentation method for global and local motion. By perfectly synthesizing target point clouds based on augmented motion parameters we easily obtain lots of 3D scene flow labels in point clouds highly consistent with real scenarios. On multiple real-world datasets including LiDAR KITTI nuScenes and Argoverse our method outperforms all previous supervised and unsupervised methods without requiring manual labelling. Impressively our method achieves a tenfold reduction in EPE3D metric on the LiDAR KITTI dataset reducing it from 0.190m to a mere 0.008m error. + + + + Question Aware Vision Transformer for Multimodal Reasoning + http://openaccess.thecvf.com//content/CVPR2024/papers/Ganz_Question_Aware_Vision_Transformer_for_Multimodal_Reasoning_CVPR_2024_paper.pdf + Vision-Language (VL) models have gained significant research focus enabling remarkable advances in multimodal reasoning. These architectures typically comprise a vision encoder a Large Language Model (LLM) and a projection module that aligns visual features with the LLM's representation space. Despite their success a critical limitation persists: the vision encoding process remains decoupled from user queries often in the form of image-related questions. Consequently the resulting visual features may not be optimally attuned to the query-specific elements of the image. To address this we introduce QA-ViT a Question Aware Vision Transformer approach for multimodal reasoning which embeds question awareness directly within the vision encoder. This integration results in dynamic visual features focusing on relevant image aspects to the posed question. QA-ViT is model-agnostic and can be incorporated efficiently into any VL architecture. Extensive experiments demonstrate the effectiveness of applying our method to various multimodal architectures leading to consistent improvement across diverse tasks and showcasing its potential for enhancing visual and scene-text understanding. + + + + OST: Refining Text Knowledge with Optimal Spatio-Temporal Descriptor for General Video Recognition + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_OST_Refining_Text_Knowledge_with_Optimal_Spatio-Temporal_Descriptor_for_General_CVPR_2024_paper.pdf + Due to the resource-intensive nature of training vision-language models on expansive video data a majority of studies have centered on adapting pre-trained image-language models to the video domain. Dominant pipelines propose to tackle the visual discrepancies with additional temporal learners while overlooking the substantial discrepancy for web-scaled descriptive narratives and concise action category names leading to less distinct semantic space and potential performance limitations. In this work we prioritize the refinement of text knowledge to facilitate generalizable video recognition. To address the limitations of the less distinct semantic space of category names we prompt a large language model (LLM) to augment action class names into Spatio-Temporal Descriptors thus bridging the textual discrepancy and serving as a knowledge base for general recognition. Moreover to assign the best descriptors with different video instances we propose Optimal Descriptor Solver forming the video recognition problem as solving the optimal matching flow across frame-level representations and descriptors. Comprehensive evaluations in zero-shot few-shot and fully supervised video recognition highlight the effectiveness of our approach. Our best model achieves a state-of-the-art zero-shot accuracy of 75.1% on Kinetics-600. + + + + Habitat Synthetic Scenes Dataset (HSSD-200): An Analysis of 3D Scene Scale and Realism Tradeoffs for ObjectGoal Navigation + http://openaccess.thecvf.com//content/CVPR2024/papers/Khanna_Habitat_Synthetic_Scenes_Dataset_HSSD-200_An_Analysis_of_3D_Scene_CVPR_2024_paper.pdf + We contribute the Habitat Synthetic Scene Dataset a dataset of 211 high-quality 3D scenes and use it to test navigation agent generalization to realistic 3D environments. Our dataset represents real interiors and contains a diverse set of 18656 models of real-world objects. We investigate the impact of synthetic 3D scene dataset scale and realism on the task of training embodied agents to find and navigate to objects (ObjectGoal navigation). By comparing to synthetic 3D scene datasets from prior work we find that scale helps in generalization but the benefits quickly saturate making visual fidelity and correlation to real-world scenes more important. Our experiments show that agents trained on our smaller-scale dataset can match or outperform agents trained on much larger datasets. Surprisingly we observe that agents trained on just 122 scenes from our dataset outperform agents trained on 10000 scenes from the ProcTHOR-10K dataset in terms of zero-shot generalization in real-world scanned environments. + + + + NViST: In the Wild New View Synthesis from a Single Image with Transformers + http://openaccess.thecvf.com//content/CVPR2024/papers/Jang_NViST_In_the_Wild_New_View_Synthesis_from_a_Single_CVPR_2024_paper.pdf + We propose NViST a transformer-based model for efficient and generalizable novel-view synthesis from a single image for real-world scenes. In contrast to many methods that are trained on synthetic data object-centred scenarios or in a category-specific manner NViST is trained on MVImgNet a large-scale dataset of casually-captured real-world videos of hundreds of object categories with diverse backgrounds. NViST transforms image inputs directly into a radiance field conditioned on camera parameters via adaptive layer normalisation. In practice NViST exploits fine-tuned masked autoencoder (MAE) features and translates them to 3D output tokens via cross-attention while addressing occlusions with self-attention. To move away from object-centred datasets and enable full scene synthesis NViST adopts a 6-DOF camera pose model and only requires relative pose dropping the need for canonicalization of the training data which removes a substantial barrier to it being used on casually captured datasets. We show results on unseen objects and categories from MVImgNet and even generalization to casual phone captures. We conduct qualitative and quantitative evaluations on MVImgNet and ShapeNet to show that our model represents a step forward towards enabling true in-the-wild generalizable novel-view synthesis from a single image. Project webpage: https://wbjang.github.io/nvist_webpage. + + + + Step Differences in Instructional Video + http://openaccess.thecvf.com//content/CVPR2024/papers/Nagarajan_Step_Differences_in_Instructional_Video_CVPR_2024_paper.pdf + Comparing a user video to a reference how-to video is a key requirement for AR/VR technology delivering personalized assistance tailored to the user's progress. However current approaches for language-based assistance can only answer questions about a single video. We propose an approach that first automatically generates large amounts of visual instruction tuning data involving pairs of videos from HowTo100M by leveraging existing step annotations and accompanying narrations and then trains a video-conditioned language model to jointly reason across multiple raw videos. Our model achieves state-of-the-art performance at identifying differences between video pairs and ranking videos based on the severity of these differences and shows promising ability to perform general reasoning over multiple videos. + + + + Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Depth_Anything_Unleashing_the_Power_of_Large-Scale_Unlabeled_Data_CVPR_2024_paper.pdf + This work presents Depth Anything a highly practical solution for robust monocular depth estimation. Without pursuing novel technical modules we aim to build a simple yet powerful foundation model dealing with any images under any circumstances. To this end we scale up the dataset by designing a data engine to collect and automatically annotate large-scale unlabeled data ( 62M) which significantly enlarges the data coverage and thus is able to reduce the generalization error. We investigate two simple yet effective strategies that make data scaling-up promising. First a more challenging optimization target is created by leveraging data augmentation tools. It compels the model to actively seek extra visual knowledge and acquire robust representations. Second an auxiliary supervision is developed to enforce the model to inherit rich semantic priors from pre-trained encoders. We evaluate its zero-shot capabilities extensively including six public datasets and randomly captured photos. It demonstrates impressive generalization ability. Further through fine-tuning it with metric depth information from NYUv2 and KITTI new SOTAs are set. Our better depth model also results in a better depth-conditioned ControlNet. Our models are released at https://github.com/LiheYoung/Depth-Anything. + + + + MPOD123: One Image to 3D Content Generation Using Mask-enhanced Progressive Outline-to-Detail Optimization + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_MPOD123_One_Image_to_3D_Content_Generation_Using_Mask-enhanced_Progressive_CVPR_2024_paper.pdf + Recent advancements in single image driven 3D content generation have been propelled by leveraging prior knowledge from pretrained 2D diffusion models. However the 3D content generated by existing methods often exhibits distorted outline shapes and inadequate details. To solve this problem we propose a novel framework called Mask-enhanced Progressive Outline-to-Detail optimization (aka. MPOD123) which consists of two stages. Specifically in the first stage MPOD123 utilizes the pretrained view-conditioned diffusion model to guide the outline shape optimization of the 3D content. Given certain viewpoint we estimate outline shape priors in the form of 2D mask from the 3D content by leveraging opacity calculation. In the second stage MPOD123 incorporates Detail Appearance Inpainting (DAI) to guide the refinement on local geometry and texture with the shape priors. The essence of DAI lies in the Mask Rectified Cross-Attention (MRCA) which can be conveniently plugged in the stable diffusion model. The MRCA module utilizes the mask to rectify the attention map from each cross-attention layer. Accompanied with this new module DAI is capable of guiding the detail refinement of the 3D content while better preserves the outline shape. To assess the applicability in practical scenarios we contribute a new dataset modeled on real-world e-commerce environments. Extensive quantitative and qualitative experiments on this dataset and open benchmarks demonstrate the effectiveness of MPOD123 over the state-of-the-arts. + + + + UnionFormer: Unified-Learning Transformer with Multi-View Representation for Image Manipulation Detection and Localization + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_UnionFormer_Unified-Learning_Transformer_with_Multi-View_Representation_for_Image_Manipulation_Detection_CVPR_2024_paper.pdf + We present UnionFormer a novel framework that integrates tampering clues across three views by unified learning for image manipulation detection and localization. Specifically we construct a BSFI-Net to extract tampering features from RGB and noise views achieving enhanced responsiveness to boundary artifacts while modulating spatial consistency at different scales. Additionally to explore the inconsistency between objects as a new view of clues we combine object consistency modeling with tampering detection and localization into a three-task unified learning process allowing them to promote and improve mutually. Therefore we acquire a unified manipulation discriminative representation under multi-scale supervision that consolidates information from three views. This integration facilitates highly effective concurrent detection and localization of tampering. We perform extensive experiments on diverse datasets and the results show that the proposed approach outperforms state-of-the-art methods in tampering detection and localization. + + + + Situational Awareness Matters in 3D Vision Language Reasoning + http://openaccess.thecvf.com//content/CVPR2024/papers/Man_Situational_Awareness_Matters_in_3D_Vision_Language_Reasoning_CVPR_2024_paper.pdf + Being able to carry out complicated vision language reasoning tasks in 3D space represents a significant milestone in developing household robots and human-centered embodied AI. In this work we demonstrate that a critical and distinct challenge in 3D vision language reasoning is the situational awareness which incorporates two key components: (1) The autonomous agent grounds its self-location based on a language prompt. (2) The agent answers open-ended questions from the perspective of its calculated position. To address this challenge we introduce SIG3D an end-to-end Situation-Grounded model for 3D vision language reasoning. We tokenize the 3D scene into sparse voxel representation and propose a language-grounded situation estimator followed by a situated question answering module. Experiments on the SQA3D and ScanQA datasets show that SIG3D outperforms state-of-the-art models in situational estimation and question answering by a large margin (e.g. an enhancement of over 30% on situation accuracy). Subsequent analysis corroborates our architectural design choices explores the distinct functions of visual and textual tokens and highlights the importance of situational awareness in the domain of 3D question-answering. Project page is available at https://yunzeman.github.io/situation3d. + + + + RCBEVDet: Radar-camera Fusion in Bird's Eye View for 3D Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Lin_RCBEVDet_Radar-camera_Fusion_in_Birds_Eye_View_for_3D_Object_CVPR_2024_paper.pdf + Three-dimensional object detection is one of the key tasks in autonomous driving. To reduce costs in practice low-cost multi-view cameras for 3D object detection are proposed to replace the expansive LiDAR sensors. However relying solely on cameras is difficult to achieve highly accurate and robust 3D object detection. An effective solution to this issue is combining multi-view cameras with the economical millimeter-wave radar sensor to achieve more reliable multi-modal 3D object detection. In this paper we introduce RCBEVDet a radar-camera fusion 3D object detection method in the bird's eye view (BEV). Specifically we first design RadarBEVNet for radar BEV feature extraction. RadarBEVNet consists of a dual-stream radar backbone and a Radar Cross-Section (RCS) aware BEV encoder. In the dual-stream radar backbone a point-based encoder and a transformer-based encoder are proposed to extract radar features with an injection and extraction module to facilitate communication between the two encoders. The RCS-aware BEV encoder takes RCS as the object size prior to scattering the point feature in BEV. Besides we present the Cross-Attention Multi-layer Fusion module to automatically align the multi-modal BEV feature from radar and camera with the deformable attention mechanism and then fuse the feature with channel and spatial fusion layers. Experimental results show that RCBEVDet achieves new state-of-the-art radar-camera fusion results on nuScenes and view-of-delft (VoD) 3D object detection benchmarks. Furthermore RCBEVDet achieves better 3D detection results than all real-time camera-only and radar-camera 3D object detectors with a faster inference speed at 21 28 FPS. The source code will be released at https://github.com/VDIGPKU/RCBEVDet. + + + + Adaptive Softassign via Hadamard-Equipped Sinkhorn + http://openaccess.thecvf.com//content/CVPR2024/papers/Shen_Adaptive_Softassign_via_Hadamard-Equipped_Sinkhorn_CVPR_2024_paper.pdf + Softassign is a pivotal method in graph matching and other learning tasks. Many softassign-based algorithms exhibit performance sensitivity to a parameter in the softassign. However tuning the parameter is challenging and almost done empirically. This paper proposes an adaptive softassign method for graph matching by analyzing the relationship between the objective score and the parameter. This method can automatically tune the parameter based on a given error bound to guarantee accuracy. The Hadamard-Equipped Sinkhorn formulas introduced in this study significantly enhance the efficiency and stability of the adaptive softassign. Moreover these formulas can also be used in optimal transport problems. The resulting adaptive softassign graph matching algorithm enjoys significantly higher accuracy than previous state-of-the-art large graph matching algorithms while maintaining comparable efficiency. + + + + Re-thinking Data Availability Attacks Against Deep Neural Networks + http://openaccess.thecvf.com//content/CVPR2024/papers/Fang_Re-thinking_Data_Availability_Attacks_Against_Deep_Neural_Networks_CVPR_2024_paper.pdf + The unauthorized use of personal data for commercial purposes and the covert acquisition of private data for training machine learning models continue to raise concerns. To address these issues researchers have proposed availability attacks that aim to render data unexploitable. However many availability attack methods can be easily disrupted by adversarial training. Although some robust methods can resist adversarial training their protective effects are limited. In this paper we re-examine the existing availability attack methods and propose a novel two-stage min-max-min optimization paradigm to generate robust unlearnable noise. The inner min stage is utilized to generate unlearnable noise while the outer min-max stage simulates the training process of the poisoned model. Additionally we formulate the attack effects and use it to constrain the optimization objective. Comprehensive experiments have revealed that the noise generated by our method can lead to a decline in test accuracy for adversarially trained poisoned models by up to approximately 30% in comparison to SOTA methods. + + + + SHiNe: Semantic Hierarchy Nexus for Open-vocabulary Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_SHiNe_Semantic_Hierarchy_Nexus_for_Open-vocabulary_Object_Detection_CVPR_2024_paper.pdf + Open-vocabulary object detection (OvOD) has transformed detection into a language-guided task empowering users to freely define their class vocabularies of interest during inference. However our initial investigation indicates that existing OvOD detectors exhibit significant variability when dealing with vocabularies across various semantic granularities posing a concern for real-world deployment. To this end we introduce Semantic Hierarchy Nexus (SHiNe) a novel classifier that uses semantic knowledge from class hierarchies. It runs offline in three steps: i) it retrieves relevant super-/sub-categories from a hierarchy for each target class; ii) it integrates these categories into hierarchy-aware sentences; iii) it fuses these sentence embeddings to generate the nexus classifier vector. Our evaluation on various detection benchmarks demonstrates that SHiNe enhances robustness across diverse vocabulary granularities achieving up to +31.9% mAP50 with ground truth hierarchies while retaining improvements using hierarchies generated by large language models. Moreover when applied to open-vocabulary classification on ImageNet-1k SHiNe improves the CLIP zero-shot baseline by +2.8% accuracy. SHiNe is training-free and can be seamlessly integrated with any off-the-shelf OvOD detector without incurring additional computational overhead during inference. The code is open source. + + + + Ranking Distillation for Open-Ended Video Question Answering with Insufficient Labels + http://openaccess.thecvf.com//content/CVPR2024/papers/Liang_Ranking_Distillation_for_Open-Ended_Video_Question_Answering_with_Insufficient_Labels_CVPR_2024_paper.pdf + This paper focuses on open-ended video question answering which aims to find the correct answers from a large answer set in response to a video-related question. This is essentially a multi-label classification task since a question may have multiple answers. However due to annotation costs the labels in existing benchmarks are always extremely insufficient typically one answer per question. As a result existing works tend to directly treat all the unlabeled answers as negative labels leading to limited ability for generalization. In this work we introduce a simple yet effective ranking distillation framework (RADI) to mitigate this problem without additional manual annotation. RADI employs a teacher model trained with incomplete labels to generate rankings for potential answers which contain rich knowledge about label priority as well as label-associated visual cues thereby enriching the insufficient labeling information. To avoid overconfidence in the imperfect teacher model we further present two robust and parameter-free ranking distillation approaches: a pairwise approach which introduces adaptive soft margins to dynamically refine the optimization constraints on various pairwise rankings and a listwise approach which adopts sampling-based partial listwise learning to resist the bias in teacher ranking. Extensive experiments on five popular benchmarks consistently show that both our pairwise and listwise RADIs outperform state-of-the-art methods. Further analysis demonstrates the effectiveness of our methods on the insufficient labeling problem. + + + + Depth-Aware Concealed Crop Detection in Dense Agricultural Scenes + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Depth-Aware_Concealed_Crop_Detection_in_Dense_Agricultural_Scenes_CVPR_2024_paper.pdf + Concealed Object Detection (COD) aims to identify objects visually embedded in their background. Existing COD datasets and methods predominantly focus on animals or humans ignoring the agricultural domain which often contains numerous small and concealed crops with severe occlusions. In this paper we introduce Concealed Crop Detection (CCD) which extends classic COD to agricultural domains. Experimental study shows that unimodal data provides insufficient information for CCD. To address this gap we first collect a large-scale RGB-D dataset ACOD-12K containing high-resolution crop images and depth maps. Then we propose a foundational framework named Recurrent Iterative Segmentation Network (RISNet). To tackle the challenge of dense objects we employ multi-scale receptive fields to capture objects of varying sizes thus enhancing the detection performance for dense objects. By fusing depth features our method can acquire spatial information about concealed objects to mitigate disturbances caused by intricate backgrounds and occlusions. Furthermore our model adopts a multi-stage iterative approach using predictions from each stage as gate attention to reinforce position information thereby improving the detection accuracy for small objects. Extensive experimental results demonstrate that our RISNet achieves new state-of-the-art performance on both newly proposed CCD and classic COD tasks. All resources will be available at https://github.com/Kki2Eve/RISNet. + + + + Solving the Catastrophic Forgetting Problem in Generalized Category Discovery + http://openaccess.thecvf.com//content/CVPR2024/papers/Cao_Solving_the_Catastrophic_Forgetting_Problem_in_Generalized_Category_Discovery_CVPR_2024_paper.pdf + Generalized Category Discovery (GCD) aims to identify a mix of known and novel categories within unlabeled data sets providing a more realistic setting for image recognition. Essentially GCD needs to remember existing patterns thoroughly to recognize novel categories. Recent state-of-the-art method SimGCD transfers the knowledge from known-class data to the learning of novel classes through debiased learning. However some patterns are catastrophically forgot during adaptation and thus lead to poor performance in novel categories classification. To address this issue we propose a novel learning approach LegoGCD which is seamlessly integrated into previous methods to enhance the discrimination of novel classes while maintaining performance on previously encountered known classes. Specifically we design two types of techniques termed as \underline L ocal \underline E ntropy Re\underline g ularization (LER) and Dual-views Kullback-Leibler divergence c\underline o nstraint (DKL). The LER optimizes the distribution of potential known class samples in unlabeled data thus ensuring the preservation of knowledge related to known categories while learning novel classes. Meanwhile DKL introduces Kullback-Leibler divergence to encourage the model to produce a similar prediction distribution of two view samples from the same image. In this way it successfully avoids mismatched prediction and generates more reliable potential known class samples simultaneously. Extensive experiments validate that the proposed LegoGCD effectively addresses the known category forgetting issue across all datasets e.g. delivering a 7.74% and 2.51% accuracy boost on known and novel classes in CUB respectively. Our code is available at: https://github.com/Cliffia123/LegoGCD. + + + + Data-Efficient Unsupervised Interpolation Without Any Intermediate Frame for 4D Medical Images + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_Data-Efficient_Unsupervised_Interpolation_Without_Any_Intermediate_Frame_for_4D_Medical_CVPR_2024_paper.pdf + 4D medical images which represent 3D images with temporal information are crucial in clinical practice for capturing dynamic changes and monitoring long-term disease progression. However acquiring 4D medical images poses challenges due to factors such as radiation exposure and imaging duration necessitating a balance between achieving high temporal resolution and minimizing adverse effects. Given these circumstances not only is data acquisition challenging but increasing the frame rate for each dataset also proves difficult. To address this challenge this paper proposes a simple yet effective Unsupervised Volumetric Interpolation framework UVI-Net. This framework facilitates temporal interpolation without the need for any intermediate frames distinguishing it from the majority of other existing unsupervised methods. Experiments on benchmark datasets demonstrate significant improvements across diverse evaluation metrics compared to unsupervised and supervised baselines. Remarkably our approach achieves this superior performance even when trained with a dataset as small as one highlighting its exceptional robustness and efficiency in scenarios with sparse supervision. This positions UVI-Net as a compelling alternative for 4D medical imaging particularly in settings where data availability is limited. The source code is available at https://github.com/jungeun122333/UVI-Net. + + + + Learning the 3D Fauna of the Web + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Learning_the_3D_Fauna_of_the_Web_CVPR_2024_paper.pdf + Learning 3D models of all animals in nature requires massively scaling up existing solutions. With this ultimate goal in mind we develop 3D-Fauna an approach that learns a pan-category deformable 3D animal model for more than 100 animal species jointly. One crucial bottleneck of modeling animals is the limited availability of training data which we overcome by learning our model from 2D Internet images. We show that prior approaches which are category-specific fail to generalize to rare species with limited training images. We address this challenge by introducing the Semantic Bank of Skinned Models (SBSM) which automatically discovers a small set of base animal shapes by combining geometric inductive priors with semantic knowledge implicitly captured by an off-the-shelf self-supervised feature extractor. To train such a model we also contribute a new large-scale dataset of diverse animal species. At inference time given a single image of any quadruped animal our model reconstructs an articulated 3D mesh in a feed-forward manner in seconds. + + + + LISA: Reasoning Segmentation via Large Language Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Lai_LISA_Reasoning_Segmentation_via_Large_Language_Model_CVPR_2024_paper.pdf + Although perception systems have made remarkable advancements in recent years they still rely on explicit human instruction or pre-defined categories to identify the target objects before executing visual recognition tasks. Such systems cannot actively reason and comprehend implicit user intention. In this work we propose a new segmentation task --- reasoning segmentation. The task is designed to output a segmentation mask given a complex and implicit query text. Furthermore we establish a benchmark comprising over one thousand image-instruction-mask data samples incorporating intricate reasoning and world knowledge for evaluation purposes. Finally we present LISA: large Language Instructed Segmentation Assistant which inherits the language generation capabilities of multimodal Large Language Models (LLMs) while also possessing the ability to produce segmentation masks. We expand the original vocabulary with a <SEG> token and propose the embedding-as-mask paradigm to unlock the segmentation capability. Remarkably LISA can handle cases involving complex reasoning and world knowledge. Also it demonstrates robust zero-shot capability when trained exclusively on reasoning-free datasets. In addition fine-tuning the model with merely 239 reasoning segmentation data samples results in further performance enhancement. Both quantitative and qualitative experiments show our method effectively unlocks new reasoning segmentation capabilities for multimodal LLMs. Code models and data are available at github.com/dvlab-research/LISA. + + + + Bridging the Gap: A Unified Video Comprehension Framework for Moment Retrieval and Highlight Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Xiao_Bridging_the_Gap_A_Unified_Video_Comprehension_Framework_for_Moment_CVPR_2024_paper.pdf + Video Moment Retrieval (MR) and Highlight Detection (HD) have attracted significant attention due to the growing demand for video analysis. Recent approaches treat MR and HD as similar video grounding problems and address them together with transformer-based architecture. However we observe that the emphasis of MR and HD differs with one necessitating the perception of local relationships and the other prioritizing the understanding of global contexts. Consequently the lack of task-specific design will inevitably lead to limitations in associating the intrinsic specialty of two tasks. To tackle the issue we propose a Unified Video COMprehension framework (UVCOM) to bridge the gap and jointly solve MR and HD effectively. By performing progressive integration on intra and inter-modality across multi-granularity UVCOM achieves the comprehensive understanding in processing a video. Moreover we present multi-aspect contrastive learning to consolidate the local relation modeling and global knowledge accumulation via well aligned multi-modal space. Extensive experiments on QVHighlights Charades-STA TACoS YouTube Highlights and TVSum datasets demonstrate the effectiveness and rationality of UVCOM which outperforms the state-of-the-art methods by a remarkable margin. + + + + MuseChat: A Conversational Music Recommendation System for Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Dong_MuseChat_A_Conversational_Music_Recommendation_System_for_Videos_CVPR_2024_paper.pdf + Music recommendation for videos attracts growing interest in multi-modal research. However existing systems focus primarily on content compatibility often ignoring the users' preferences. Their inability to interact with users for further refinements or to provide explanations leads to a less satisfying experience. We address these issues with MuseChat a first-of-its-kind dialogue-based recommendation system that personalizes music suggestions for videos. Our system consists of two key functionalities with associated modules: recommendation and reasoning. The recommendation module takes a video along with optional information including previous suggested music and user's preference as inputs and retrieves an appropriate music matching the context. The reasoning module equipped with the power of Large Language Model (Vicuna-7B) and extended to multi-modal inputs is able to provide reasonable explanation for the recommended music. To evaluate the effectiveness of MuseChat we build a large-scale dataset conversational music recommendation for videos that simulates a two-turn interaction between a user and a recommender based on accurate music track information. Experiment results show that MuseChat achieves significant improvements over existing video-based music retrieval methods as well as offers strong interpretability and interactability. The dataset of this work is available at https://dongzhikang.github.io/musechat. + + + + Device-Wise Federated Network Pruning + http://openaccess.thecvf.com//content/CVPR2024/papers/Gao_Device-Wise_Federated_Network_Pruning_CVPR_2024_paper.pdf + Neural network pruning particularly channel pruning is a widely used technique for compressing deep learning models to enable their deployment on edge devices with limited resources. Typically redundant weights or structures are removed to achieve the target resource budget. Although data-driven pruning approaches have proven to be more effective they cannot be directly applied to federated learning (FL) which has emerged as a popular technique in edge computing applications because of distributed and confidential datasets. In response to this challenge we design a new network pruning method for FL. We propose device-wise sub-networks for each device assuming that the data distribution is similar within each device. These sub-networks are generated through sub-network embeddings and a hypernetwork. To further minimize memory usage and communication costs we permanently prune the full model to remove weights that are not useful for all devices. During the FL process we simultaneously train the device-wise sub-networks and the base sub-network to facilitate the pruning process. We then finetune the pruned model with device-wise sub-networks to regain performance. Moreover we provided the theoretical guarantee of convergence for our method. Our method achieves better performance and resource trade-off than other well-established network pruning baselines as demonstrated through extensive experiments on CIFAR-10 CIFAR-100 and TinyImageNet. + + + + MoReVQA: Exploring Modular Reasoning Models for Video Question Answering + http://openaccess.thecvf.com//content/CVPR2024/papers/Min_MoReVQA_Exploring_Modular_Reasoning_Models_for_Video_Question_Answering_CVPR_2024_paper.pdf + This paper addresses the task of video question answering (videoQA) via a decomposed multi-stage modular reasoning framework. Previous modular methods have shown promise with a single planning stage ungrounded in visual content. However through a simple and effective baseline we find that such systems can lead to brittle behavior in practice for challenging videoQA settings. Thus unlike traditional single-stage planning methods we propose a multi-stage system consisting of an event parser a grounding stage and a final reasoning stage in conjunction with an external memory. All stages are training-free and performed using few-shot prompting of large models creating interpretable intermediate outputs at each stage. By decomposing the underlying planning and task complexity our method MoReVQA improves over prior work on standard videoQA benchmarks (NExT-QA iVQA EgoSchema and ActivityNet-QA) with state-of-the-art results and extensions to related tasks (grounded videoQA paragraph captioning). + + + + Low-Rank Rescaled Vision Transformer Fine-Tuning: A Residual Design Approach + http://openaccess.thecvf.com//content/CVPR2024/papers/Dong_Low-Rank_Rescaled_Vision_Transformer_Fine-Tuning_A_Residual_Design_Approach_CVPR_2024_paper.pdf + Parameter-efficient fine-tuning for pre-trained Vision Transformers aims to adeptly tailor a model to downstream tasks by learning a minimal set of new adaptation parameters while preserving the frozen majority of pre-trained parameters. Striking a balance between retaining the generalizable representation capacity of the pre-trained model and acquiring task-specific features poses a key challenge. Currently there is a lack of focus on guiding this delicate trade-off. In this study we approach the problem from the perspective of Singular Value Decomposition (SVD) of pre-trained parameter matrices providing insights into the tuning dynamics of existing methods. Building upon this understanding we propose a Residual-based Low-Rank Rescaling (RLRR) fine-tuning strategy. This strategy not only enhances flexibility in parameter tuning but also ensures that new parameters do not deviate excessively from the pre-trained model through a residual design. Extensive experiments demonstrate that our method achieves competitive performance across various downstream image classification tasks all while maintaining comparable new parameters. We believe this work takes a step forward in offering a unified perspective for interpreting existing methods and serves as motivation for the development of new approaches that move closer to effectively considering the crucial trade-off mentioned above. Our code is available at https://github.com/zstarN70/RLRR.git. + + + + Distribution-aware Knowledge Prototyping for Non-exemplar Lifelong Person Re-identification + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_Distribution-aware_Knowledge_Prototyping_for_Non-exemplar_Lifelong_Person_Re-identification_CVPR_2024_paper.pdf + Lifelong person re-identification (LReID) suffers from the catastrophic forgetting problem when learning from non-stationary data. Existing exemplar-based and knowledge distillation-based LReID methods encounter data privacy and limited acquisition capacity respectively. In this paper we instead introduce the prototype which is under-investigated in LReID to better balance knowledge forgetting and acquisition. Existing prototype-based works primarily focus on the classification task where the prototypes are set as discrete points or statistical distributions. However they either discard the distribution information or omit instance-level diversity which are crucial fine-grained clues for LReID. To address the above problems we propose Distribution-aware Knowledge Prototyping (DKP) where the instance-level diversity of each sample is modeled to transfer comprehensive fine-grained knowledge for prototyping and facilitating LReID learning. Specifically an Instance-level Distribution Modeling network is proposed to capture the local diversity of each instance. Then the Distribution-oriented Prototype Generation algorithm transforms the instance-level diversity into identity-level distributions as prototypes which is further explored by the designed Prototype-based Knowledge Transfer module to enhance the knowledge anti-forgetting and acquisition capacity of the LReID model. Extensive experiments verify that our method achieves superior plasticity and stability balancing and outperforms existing LReID methods by 8.1%/9.1% average mAP/R@1 improvement. The code is available at https://github.com/zhoujiahuan1991/CVPR2024-DKP + + + + Generating Enhanced Negatives for Training Language-Based Object Detectors + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_Generating_Enhanced_Negatives_for_Training_Language-Based_Object_Detectors_CVPR_2024_paper.pdf + The recent progress in language-based open-vocabulary object detection can be largely attributed to finding better ways of leveraging large-scale data with free-form text annotations. Training such models with a discriminative objective function has proven successful but requires good positive and negative samples. However the free-form nature and the open vocabulary of object descriptions make the space of negatives extremely large. Prior works randomly sample negatives or use rule-based techniques to build them. In contrast we propose to leverage the vast knowledge built into modern generative models to automatically build negatives that are more relevant to the original data. Specifically we use large-language-models to generate negative text descriptions and text-to-image diffusion models to also generate corresponding negative images. Our experimental analysis confirms the relevance of the generated negative data and its use in language-based detectors improves performance on two complex benchmarks. Code is available at https://github.com/xiaofeng94/Gen-Enhanced-Negs. + + + + FedAS: Bridging Inconsistency in Personalized Federated Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_FedAS_Bridging_Inconsistency_in_Personalized_Federated_Learning_CVPR_2024_paper.pdf + Personalized Federated Learning (PFL) is primarily designed to provide customized models for each client to better fit the non-iid distributed client data which is a inherent challenge in Federated Learning. However current PFL methods suffer from inconsistencies in both intra-client and inter-client levels: 1) The intra-client inconsistency stems from the asynchronous update strategy for personalized and shared parameters. In PFL clients update their shared parameters to communicate and learn from others while keeping personalized parts unchanged leading to poor coordination between these two components. 2) The Inter-client inconsistency arises from "stragglers" - inactive clients that communicate and train with the server less frequently. This results in their under-trained personalized models and impedes the collaborative training stage for other clients. In this paper we present a novel PFL framework named FedAS which uses Federated Parameter-Alignment and Client-Synchronization to overcome above challenges. Initially we enhance the localization of global parameters by infusing them with local insights. We make the shared parts learn from previous model thereby increasing their local relevance and reducing the impact of parameter inconsistency. Furthermore we design a robust aggregation method to mitigate the impact of stragglers by preventing the incorporation of their under-trained knowledge into aggregated model. Experimental results on Cifar10 and Cifar100 validate the effectiveness of our FedAS in achieving better performance and robustness against data heterogeneity. + + + + MoST: Multi-Modality Scene Tokenization for Motion Prediction + http://openaccess.thecvf.com//content/CVPR2024/papers/Mu_MoST_Multi-Modality_Scene_Tokenization_for_Motion_Prediction_CVPR_2024_paper.pdf + Many existing motion prediction approaches rely on symbolic perception outputs to generate agent trajectories such as bounding boxes road graph information and traffic lights. This symbolic representation is a high-level abstraction of the real world which may render the motion prediction model vulnerable to perception errors (e.g. failures in detecting open-vocabulary obstacles) while missing salient information from the scene context (e.g. poor road conditions). An alternative paradigm is end-to-end learning from raw sensors. However this approach suffers from the lack of interpretability and requires significantly more training resources. In this work we propose tokenizing the visual world into a compact set of scene elements and then leveraging pre-trained image foundation models and LiDAR neural networks to encode all the scene elements in an open-vocabulary manner. The image foundation model enables our scene tokens to encode the general knowledge of the open world while the LiDAR neural network encodes geometry information. Our proposed representation can efficiently encode the multi-frame multi-modality observations with a few hundred tokens and is compatible with most transformer-based architectures. To evaluate our method we have augmented Waymo Open Motion Dataset with camera embeddings. Experiments over Waymo Open Motion Dataset show that our approach leads to significant performance improvements over the state-of-the-art. + + + + PIGEON: Predicting Image Geolocations + http://openaccess.thecvf.com//content/CVPR2024/papers/Haas_PIGEON_Predicting_Image_Geolocations_CVPR_2024_paper.pdf + Planet-scale image geolocalization remains a challenging problem due to the diversity of images originating from anywhere in the world. Although approaches based on vision transformers have made significant progress in geolocalization accuracy success in prior literature is constrained to narrow distributions of images of landmarks and performance has not generalized to unseen places. We present a new geolocalization system that combines semantic geocell creation multi-task contrastive pretraining and a novel loss function. Additionally our work is the first to perform retrieval over location clusters for guess refinements. We train two models for evaluations on street-level data and general-purpose image geolocalization; the first model PIGEON is trained on data from the game of GeoGuessr and is capable of placing over 40% of its guesses within 25 kilometers of the target location globally. We also develop a bot and deploy PIGEON in a blind experiment against humans ranking in the top 0.01% of players. We further challenge one of the world's foremost professional GeoGuessr players to a series of six matches with millions of viewers winning all six games. Our second model PIGEOTTO differs in that it is trained on a dataset of images from Flickr and Wikipedia achieving state-of-the-art results on a wide range of image geolocalization benchmarks outperforming the previous SOTA by up to 7.7 percentage points on the city accuracy level and up to 38.8 percentage points on the country level. Our findings suggest that PIGEOTTO is the first image geolocalization model that effectively generalizes to unseen places and that our approach can pave the way for highly accurate planet-scale image geolocalization systems. Our code is available on GitHub. + + + + Flow-Guided Online Stereo Rectification for Wide Baseline Stereo + http://openaccess.thecvf.com//content/CVPR2024/papers/Kumar_Flow-Guided_Online_Stereo_Rectification_for_Wide_Baseline_Stereo_CVPR_2024_paper.pdf + Stereo rectification is widely considered "solved" due to the abundance of traditional approaches to perform rectification. However autonomous vehicles and robots in-the-wild require constant re-calibration due to exposure to various environmental factors including vibration and structural stress when cameras are arranged in a wide-baseline configuration. Conventional rectification methods fail in these challenging scenarios: especially for larger vehicles such as autonomous freight trucks and semi-trucks the resulting incorrect rectification severely affects the quality of downstream tasks that use stereo/multi-view data. To tackle these challenges we propose an online rectification approach that operates at real-time rates while achieving high accuracy. We propose a novel learning-based online calibration approach that utilizes stereo correlation volumes built from a feature representation obtained from cross-image attention. Our model is trained to minimize vertical optical flow as proxy rectification constraint and predicts the relative rotation between the stereo pair. The method is real-time and even outperforms conventional methods used for offline calibration and substantially improves downstream stereo depth post-rectification. We release two public datasets (https://light.princeton.edu/online-stereo-recification/) a synthetic and experimental wide baseline dataset to foster further research. + + + + Driving Everywhere with Large Language Model Policy Adaptation + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Driving_Everywhere_with_Large_Language_Model_Policy_Adaptation_CVPR_2024_paper.pdf + Adapting driving behavior to new environments customs and laws is a long-standing problem in autonomous driving precluding the widespread deployment of autonomous vehicles (AVs). In this paper we present LLaDA a simple yet powerful tool that enables human drivers and autonomous vehicles alike to drive everywhere by adapting their tasks and motion plans to traffic rules in new locations. LLaDA achieves this by leveraging the impressive zero-shot generalizability of large language models (LLMs) in interpreting the traffic rules in the local driver handbook. Through an extensive user study we show that LLaDA's instructions are useful in disambiguating in-the-wild unexpected situations. We also demonstrate LLaDA's ability to adapt AV motion planning policies in real-world datasets; LLaDA outperforms baseline planning approaches on all our metrics. Please check our website for more details: https://boyiliee.github.io/llada. + + + + Koala: Key Frame-Conditioned Long Video-LLM + http://openaccess.thecvf.com//content/CVPR2024/papers/Tan_Koala_Key_Frame-Conditioned_Long_Video-LLM_CVPR_2024_paper.pdf + Long video question answering is a challenging task that involves recognizing short-term activities and reasoning about their fine-grained relationships. State-of-the-art video Large Language Models (vLLMs) hold promise as a viable solution due to their demonstrated emergent capabilities on new tasks. However despite being trained on millions of short seconds-long videos vLLMs are unable to understand minutes-long videos and accurately answer questions about them. To address this limitation we propose a lightweight and self-supervised approach Key frame-conditioned long video-LLM (Koala) that introduces learnable spatiotemporal queries to adapt pretrained vLLMs for generalizing to longer videos. Our approach introduces two new tokenizers that condition on visual tokens computed from sparse video key frames for understanding short and long video moments. We train our proposed approach on HowTo100M and demonstrate its effectiveness on zero-shot long video understanding benchmarks where it outperforms state-of-the-art large models by 3 - 6% in absolute accuracy across all tasks. Surprisingly we also empirically show that our approach not only helps a pretrained vLLM to understand long videos but also improves its accuracy on short-term action recognition. + + + + HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Guan_HallusionBench_An_Advanced_Diagnostic_Suite_for_Entangled_Language_Hallucination_and_CVPR_2024_paper.pdf + We introduce "HallusionBench" a comprehensive benchmark designed for the evaluation of image-context reasoning. This benchmark presents significant challenges to advanced large visual-language models (LVLMs) such as GPT-4V(ision) Gemini Pro Vision Claude 3 and LLaVA-1.5 by emphasizing nuanced understanding and interpretation of visual data. The benchmark comprises 346 images paired with 1129 questions all meticulously crafted by human experts. We introduce a novel structure for these visual questions designed to establish control groups. This structure enables us to conduct a quantitative analysis of the models' response tendencies logical consistency and various failure modes. In our evaluation on HallusionBench we benchmarked 15 different models highlighting a 31.42% question-pair accuracy achieved by the state-of-the-art GPT-4V. Notably all other evaluated models achieve accuracy below 16%. Moreover our analysis not only highlights the observed failure modes including language hallucination and visual illusion but also deepens an under standing of these pitfalls. Our comprehensive case studies within HallusionBench shed light on the challenges of hallucination and illusion in LVLMs. Based on these insights we suggest potential pathways for their future improvement. The benchmark and codebase can be accessed at https://github.com/tianyilab/HallusionBench. + + + + + + Jack of All Tasks Master of Many: Designing General-Purpose Coarse-to-Fine Vision-Language Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Pramanick_Jack_of_All_Tasks_Master_of_Many_Designing_General-Purpose_Coarse-to-Fine_CVPR_2024_paper.pdf + The ability of large language models (LLMs) to process visual inputs has given rise to general-purpose vision systems unifying various vision-language (VL) tasks by instruction tuning. However due to the enormous diversity in input-output formats in the vision domain existing general-purpose models fail to successfully integrate segmentation and multi-image inputs with coarse-level tasks into a single framework. In this work we introduce VistaLLM a powerful visual system that addresses coarse- and fine grained VL tasks over single and multiple input images using a unified framework. VistaLLM utilizes an instruction-guided image tokenizer that filters global embeddings using task descriptions to extract compressed and refined features from numerous images. Moreover VistaLLM employs a gradient-aware adaptive sampling technique to represent binary segmentation masks as sequences significantly improving over previously used uniform sampling. To bolster the desired capability of VistaLLM we curate CoinIt a comprehensive coarse-to-fine instruction tuning dataset with 6.8M samples. We also address the lack of multi-image grounding datasets by introducing a novel task AttCoSeg (Attribute-level Co Segmentation) which boosts the model's reasoning and grounding capability over multiple input images. Extensive experiments on a wide range of V- and VL tasks demonstrate the effectiveness of VistaLLM by achieving consistent state-of-the-art performance over strong baselines across many downstream tasks. Our project page can be found at https://shramanpramanick.github.io/VistaLLM/ + + + + SchurVINS: Schur Complement-Based Lightweight Visual Inertial Navigation System + http://openaccess.thecvf.com//content/CVPR2024/papers/Fan_SchurVINS_Schur_Complement-Based_Lightweight_Visual_Inertial_Navigation_System_CVPR_2024_paper.pdf + Accuracy and computational efficiency are the most important metrics to Visual Inertial Navigation System (VINS). The existing VINS algorithms with either high accuracy or low computational complexity are difficult to provide the high precision localization in resource-constrained devices. To this end we propose a novel filter-based VINS framework named SchurVINS (SV) which could guarantee both high accuracy by building a complete residual model and low computational complexity with Schur complement. Technically we first formulate the full residual model where Gradient Hessian and observation covariance are explicitly modeled. Then Schur complement is employed to decompose the full model into ego-motion residual model and landmark residual model. Finally Extended Kalman Filter (EKF) update is implemented in these two models with high efficiency. Experiments on EuRoC and TUM-VI datasets show that our method notably outperforms state-of-the-art (SOTA) methods in both accuracy and computational complexity. The experimental code of SchurVINS is available at https://github.com/bytedance/SchurVINS. + + + + ViP-LLaVA: Making Large Multimodal Models Understand Arbitrary Visual Prompts + http://openaccess.thecvf.com//content/CVPR2024/papers/Cai_ViP-LLaVA_Making_Large_Multimodal_Models_Understand_Arbitrary_Visual_Prompts_CVPR_2024_paper.pdf + While existing large vision-language multimodal models focus on whole image understanding there is a prominent gap in achieving region-specific comprehension. Current approaches that use textual coordinates or spatial encodings often fail to provide a user-friendly interface for visual prompting. To address this challenge we introduce a novel multimodal model capable of decoding arbitrary (free-form) visual prompts. This allows users to intuitively mark images and interact with the model using natural cues like a "red bounding box" or "pointed arrow'". Our simple design directly overlays visual markers onto the RGB image eliminating the need for complex region encodings yet achieves state-of-the-art performance on region-understanding tasks like Visual7W PointQA and Visual Commonsense Reasoning benchmark. Furthermore we present ViP-Bench a comprehensive benchmark to assess the capability of models in understanding visual prompts across multiple dimensions enabling future research in this domain. Code data and model are publicly available. + + + + OVER-NAV: Elevating Iterative Vision-and-Language Navigation with Open-Vocabulary Detection and StructurEd Representation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_OVER-NAV_Elevating_Iterative_Vision-and-Language_Navigation_with_Open-Vocabulary_Detection_and_StructurEd_CVPR_2024_paper.pdf + Recent advances in Iterative Vision-and-Language Navigation(IVLN) introduce a more meaningful and practical paradigm of VLN by maintaining the agent's memory across tours of scenes. Although the long-term memory aligns better with the persistent nature of the VLN task it poses more challenges on how to utilize the highly unstructured navigation memory with extremely sparse supervision. Towards this end we propose OVER-NAV which aims to go over and beyond the current arts of IVLN techniques. In particular we propose to incorporate LLMs and open-vocabulary detectors to distill key information and establish correspondence between multi-modal signals. Such a mechanism introduces reliable cross-modal supervision and enables on-the-fly generalization to unseen scenes without the need of extra annotation and re-training. To fully exploit the interpreted navigation data we further introduce a structured representation coded Omnigraph to effectively integrate multi-modal information along the tour. Accompanied with a novel omnigraph fusion mechanism OVER-NAV is able to extract the most relevant knowledge from omnigraph for a more accurate navigating action. In addition OVER-NAV seamlessly supports both discrete and continuous environments under a unified framework. We demonstrate the superiority of OVER-NAV in extensive experiments. + + + + All Rivers Run to the Sea: Private Learning with Asymmetric Flows + http://openaccess.thecvf.com//content/CVPR2024/papers/Niu_All_Rivers_Run_to_the_Sea_Private_Learning_with_Asymmetric_CVPR_2024_paper.pdf + Data privacy is of great concern in cloud machine-learning service platforms when sensitive data are exposed to service providers. While private computing environments (e.g. secure enclaves) and cryptographic approaches (e.g. homomorphic encryption) provide strong privacy protection their computing performance still falls short compared to cloud GPUs. To achieve privacy protection with high computing performance we propose Delta a new private training and inference framework with comparable model performance as non-private centralized training. Delta features two asymmetric data flows: the main information-sensitive flow and the residual flow. The main part flows into a small model while the residuals are offloaded to a large model. Specifically Delta embeds the information-sensitive representations into a low-dimensional space while pushing the information-insensitive part into high-dimension residuals. To ensure privacy protection the low-dimensional information-sensitive part is secured and fed to a small model in a private environment. On the other hand the residual part is sent to fast cloud GPUs and processed by a large model. To further enhance privacy and reduce the communication cost Delta applies a random binary quantization technique along with a DP-based technique to the residuals before sharing them with the public platform. We theoretically show that Delta guarantees differential privacy in the public environment and greatly reduces the complexity in the private environment. We conduct empirical analyses on CIFAR-10 CIFAR-100 and ImageNet datasets and ResNet-18 and ResNet-34 showing that Delta achieves strong privacy protection fast training and inference without significantly compromising the model utility. + + + + HandBooster: Boosting 3D Hand-Mesh Reconstruction by Conditional Synthesis and Sampling of Hand-Object Interactions + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_HandBooster_Boosting_3D_Hand-Mesh_Reconstruction_by_Conditional_Synthesis_and_Sampling_CVPR_2024_paper.pdf + Reconstructing 3D hand mesh robustly from a single image is very challenging due to the lack of diversity in existing real-world datasets. While data synthesis helps relieve the issue the syn-to-real gap still hinders its usage. In this work we present HandBooster a new approach to uplift the data diversity and boost the 3D hand-mesh reconstruction performance by training a conditional generative space on hand-object interactions and purposely sampling the space to synthesize effective data samples. First we construct versatile content-aware conditions to guide a diffusion model to produce realistic images with diverse hand appearances poses views and backgrounds; favorably accurate 3D annotations are obtained for free. Then we design a novel condition creator based on our similarity-aware distribution sampling strategies to deliberately find novel and realistic interaction poses that are distinctive from the training set. Equipped with our method several baselines can be significantly improved beyond the SOTA on the HO3D and DexYCB benchmarks. Our code will be released on https://github.com/hxwork/HandBooster_Pytorch. + + + + A-Teacher: Asymmetric Network for 3D Semi-Supervised Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_A-Teacher_Asymmetric_Network_for_3D_Semi-Supervised_Object_Detection_CVPR_2024_paper.pdf + This work proposes the first online asymmetric semi-supervised framework namely A-Teacher for LiDAR-based 3D object detection. Our motivation stems from the observation that 1) existing symmetric teacher-student methods for semi-supervised 3D object detection have characterized simplicity but impede the distillation performance between teacher and student because of the demand for an identical model structure and input data format. 2) The offline asymmetric methods with a complex teacher model constructed differently can generate more precise pseudo-labels but is challenging to jointly optimize the teacher and student model. Consequently in this paper we devise a different path from the conventional paradigm which can harness the capacity of a strong teacher while preserving the advantages of online teacher model updates. The essence is the proposed attention-based refinement model that can be seamlessly integrated into a vanilla teacher. The refinement model works in the divide-and-conquer manner that respectively handles three challenging scenarios including 1) objects detected in the current timestamp but with suboptimal box quality 2) objects are missed in the current timestamp but are detected in past or future frames 3) objects are neglected in all frames. It is worth noting that even while tackling these complex cases our model retains the efficiency of the online teacher-student semi-supervised framework. Experimental results on Waymo show that our method outperforms previous state-of-the-art HSSDA for 4.7 on mAP (L1) while consuming fewer training resources. + + + + Visual Objectification in Films: Towards a New AI Task for Video Interpretation + http://openaccess.thecvf.com//content/CVPR2024/papers/Tores_Visual_Objectification_in_Films_Towards_a_New_AI_Task_for_CVPR_2024_paper.pdf + In film gender studies the concept of "male gaze" refers to the way the characters are portrayed on-screen as objects of desire rather than subjects. In this article we introduce a novel video-interpretation task to detect character objectification in films. The purpose is to reveal and quantify the usage of complex temporal patterns operated in cinema to produce the cognitive perception of objectification. We introduce the ObyGaze12 dataset made of 1914 movie clips densely annotated by experts for objectification concepts identified in film studies and psychology. We evaluate recent vision models show the feasibility of the task and where the challenges remain with concept bottleneck models. Our new dataset and code are made available to the community. + + + + BiTT: Bi-directional Texture Reconstruction of Interacting Two Hands from a Single Image + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_BiTT_Bi-directional_Texture_Reconstruction_of_Interacting_Two_Hands_from_a_CVPR_2024_paper.pdf + Creating personalized hand avatars is important to offer a realistic experience to users on AR / VR platforms. While most prior studies focused on reconstructing 3D hand shapes some recent work has tackled the reconstruction of hand textures on top of shapes. However these methods are often limited to capturing pixels on the visible side of a hand requiring diverse views of the hand in a video or multiple images as input. In this paper we propose a novel method BiTT(Bi-directional Texture reconstruction of Two hands) which is the first end-to-end train- able method for relightable pose-free texture reconstruction of two interacting hands taking only a single RGB image by three novel components: 1) bi-directional (left ? right) texture reconstruction using the texture symmetry of left / right hands 2) utilizing a texture parametric model for hand texture recovery and 3) the overall coarse-to-fine stage pipeline for reconstructing personalized texture of two interacting hands. BiTT first estimates the scene light condition and albedo image from an input image then reconstructs the texture of both hands through the texture parametric model and bi-directional texture reconstructor. In experiments using InterHand2.6M and RGB2Hands datasets our method significantly outperforms state-of-the-art hand texture reconstruction methods quantitatively and qualitatively. The code is available at https://github.com/ yunminjin2/BiTT. + + + + Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs + http://openaccess.thecvf.com//content/CVPR2024/papers/Ranasinghe_Learning_to_Localize_Objects_Improves_Spatial_Reasoning_in_Visual-LLMs_CVPR_2024_paper.pdf + Integration of Large Language Models (LLMs) into visual domain tasks resulting in visual-LLMs (V-LLMs) has enabled exceptional performance in vision-language tasks particularly for visual question answering (VQA). However existing V-LLMs (e.g. BLIP-2 LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers these models fail at simple tasks like distinguishing a left vs right location. In this work we explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs. We discover optimal coordinate representations data-efficient instruction fine-tuning objectives and pseudo-data generation strategies that lead to improved spatial awareness in V-LLMs. Additionally our resulting model improves VQA across image and video domains reduces undesired hallucination and generates better contextual object descriptions. Experiments across 5 vision-language tasks involving 14 different datasets establish the clear performance improvements achieved by our proposed framework. + + + + Self-Distilled Masked Auto-Encoders are Efficient Video Anomaly Detectors + http://openaccess.thecvf.com//content/CVPR2024/papers/Ristea_Self-Distilled_Masked_Auto-Encoders_are_Efficient_Video_Anomaly_Detectors_CVPR_2024_paper.pdf + We propose an efficient abnormal event detection model based on a lightweight masked auto-encoder (AE) applied at the video frame level. The novelty of the proposed model is threefold. First we introduce an approach to weight tokens based on motion gradients thus shifting the focus from the static background scene to the foreground objects. Second we integrate a teacher decoder and a student decoder into our architecture leveraging the discrepancy between the outputs given by the two decoders to improve anomaly detection. Third we generate synthetic abnormal events to augment the training videos and task the masked AE model to jointly reconstruct the original frames (without anomalies) and the corresponding pixel-level anomaly maps. Our design leads to an efficient and effective model as demonstrated by the extensive experiments carried out on four benchmarks: Avenue ShanghaiTech UBnormal and UCSD Ped2. The empirical results show that our model achieves an excellent trade-off between speed and accuracy obtaining competitive AUC scores while processing 1655 FPS. Hence our model is between 8 and 70 times faster than competing methods. We also conduct an ablation study to justify our design. Our code is freely available at: https://github.com/ristea/aed-mae. + + + + Distilling Vision-Language Models on Millions of Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_Distilling_Vision-Language_Models_on_Millions_of_Videos_CVPR_2024_paper.pdf + The recent advance in vision-language models is largely attributed to the abundance of image-text data. We aim to replicate this success for video-language models but there simply is not enough human-curated video-text data available. We thus resort to fine-tuning a video-language model from a strong image-language baseline with synthesized instructional data. The resulting video model by video-instruction-tuning (VIIT) is then used to auto-label millions of videos to generate high-quality captions. We show the adapted video-language model performs well on a wide range of video-language benchmarks. For instance it surpasses the best prior result on open-ended NExT-QA by2.8%. Besides our model generates detailed descriptions for previously unseen videos which provide better textual supervision than existing methods. Experiments show that a video-language dual-encoder model contrastively trained on these auto-generated captions is 3.8% better than the strongest baseline that also leverages vision-language models. Our best model outperforms state-of-the-art methods on MSR-VTT zero-shot text-to-video retrieval by 6%. As a side product we generate the largest video caption dataset to date. + + + + Generalized Predictive Model for Autonomous Driving + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Generalized_Predictive_Model_for_Autonomous_Driving_CVPR_2024_paper.pdf + In this paper we introduce the first large-scale video prediction model in the autonomous driving discipline. To eliminate the restriction of high-cost data collection and empower the generalization ability of our model we acquire massive data from the web and pair it with diverse and high-quality text descriptions. The resultant dataset accumulates over 2000 hours of driving videos spanning areas all over the world with diverse weather conditions and traffic scenarios. Inheriting the merits from recent latent diffusion models our model dubbed GenAD handles the challenging dynamics in driving scenes with novel temporal reasoning blocks. We showcase that it can generalize to various unseen driving datasets in a zero-shot manner surpassing general or driving-specific video prediction counterparts. Furthermore GenAD can be adapted into an action-conditioned prediction model or a motion planner holding great potential for real-world driving applications. + + + + FACT: Frame-Action Cross-Attention Temporal Modeling for Efficient Action Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Lu_FACT_Frame-Action_Cross-Attention_Temporal_Modeling_for_Efficient_Action_Segmentation_CVPR_2024_paper.pdf + We study supervised action segmentation whose goal is to predict framewise action labels of a video. To capture temporal dependencies over long horizons prior works either improve framewise features with transformer or refine framewise predictions with learned action features. However they are computationally costly and ignore that frame and action features contain complimentary information which can be leveraged to enhance both features and improve temporal modeling. Therefore we propose an efficient Frame-Action Cross-attention Temporal modeling (FACT) framework that performs temporal modeling with frame and action features in parallel and leverage this parallelism to achieve iterative bidirectional information transfer between the features and refine them. FACT network contains (i) a frame branch to learn frame-level information with convolutions and frame features (ii) an action branch to learn action-level dependencies with transformers and action tokens and (iii) cross-attentions to allow communication between the two branches. We also propose a new matching loss to ensure each action token uniquely encodes an action segment thus better captures its semantics. Thanks to our architecture we can also leverage textual transcripts of videos to help action segmentation. We evaluate FACT on four video datasets (two egocentric and two third-person) for action segmentation with and without transcripts showing that it significantly improves the state-of-the-art accuracy while enjoys lower computational cost (3 times faster) than existing transformer-based methods + + + + Test-Time Zero-Shot Temporal Action Localization + http://openaccess.thecvf.com//content/CVPR2024/papers/Liberatori_Test-Time_Zero-Shot_Temporal_Action_Localization_CVPR_2024_paper.pdf + Zero-Shot Temporal Action Localization (ZS-TAL) seeks to identify and locate actions in untrimmed videos unseen during training. Existing ZS-TAL methods involve fine-tuning a model on a large amount of annotated training data. While effective training-based ZS-TAL approaches assume the availability of labeled data for supervised learning which can be impractical in some applications. Furthermore the training process naturally induces a domain bias into the learned model which may adversely affect the model's generalization ability to arbitrary videos. These considerations prompt us to approach the ZS-TAL problem from a radically novel perspective relaxing the requirement for training data. To this aim we introduce a novel method that performs Test-Time adaptation for Temporal Action Localization (T3AL). In a nutshell T3AL adapts a pre-trained Vision and Language Model (VLM). T3AL operates in three steps. First a video-level pseudo-label of the action category is computed by aggregating information from the entire video. Then action localization is performed adopting a novel procedure inspired by self-supervised learning. Finally frame-level textual descriptions extracted with a state-of-the-art captioning model are employed for refining the action region proposals. We validate the effectiveness of T 3AL by conducting experiments on the THUMOS14 and the ActivityNet-v1.3 datasets. Our results demonstrate that T3AL significantly outperforms zero-shot baselines based on state-of-the-art VLMs confirming the benefit of a test-time adaptation approach. + + + + AM-RADIO: Agglomerative Vision Foundation Model Reduce All Domains Into One + http://openaccess.thecvf.com//content/CVPR2024/papers/Ranzinger_AM-RADIO_Agglomerative_Vision_Foundation_Model_Reduce_All_Domains_Into_One_CVPR_2024_paper.pdf + A handful of visual foundation models (VFMs) have recently emerged as the backbones for numerous downstream tasks. VFMs like CLIP DINOv2 SAM are trained with distinct objectives exhibiting unique characteristics for various downstream tasks. We find that despite their conceptual differences these models can be effectively merged into a unified model through multi-teacher distillation. We name this approach AM-RADIO (Agglomerative Model -- Reduce All Domains Into One). This integrative approach not only surpasses the performance of individual teacher models but also amalgamates their distinctive features such as zero-shot vision-language comprehension detailed pixel-level understanding and open vocabulary segmentation capabilities. Additionally in pursuit of the most hardware-efficient backbone we evaluated numerous architectures in our multi-teacher distillation pipeline using the same training recipe. This led to the development of a novel architecture (E-RADIO) that exceeds the performance of its predecessors and is at least 6x faster than the teacher models at matched resolution. Our comprehensive benchmarking process covers downstream tasks including ImageNet classification semantic segmentation linear probing COCO object detection and integration into LLaVa-1.5. + + + + FastMAC: Stochastic Spectral Sampling of Correspondence Graph + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_FastMAC_Stochastic_Spectral_Sampling_of_Correspondence_Graph_CVPR_2024_paper.pdf + 3D correspondence i.e. a pair of 3D points is a fundamental concept in computer vision. A set of 3D correspondences when equipped with compatibility edges forms a correspondence graph. This graph is a critical component in several state-of-the-art 3D point cloud registration approaches e.g. the one based on maximal cliques (MAC). However its properties have not been well understood. So we present the first study that introduces graph signal processing into the domain of correspondence graph. We exploit the generalized degree signal on correspondence graph and pursue sampling strategies that preserve high-frequency components of this signal. To address time-consuming singular value decomposition in deterministic sampling we resort to a stochastic approximate sampling strategy. As such the core of our method is the stochastic spectral sampling of correspondence graph. As an application we build a complete 3D registration algorithm termed as FastMAC that reaches real-time speed while leading to little to none performance drop. Through extensive experiments we validate that FastMAC works for both indoor and outdoor benchmarks. For example FastMAC can accelerate MAC by 80 times while maintaining high registration success rate on KITTI. Codes are publicly available at https://github.com/Forrest-110/FastMAC. + + + + FedSOL: Stabilized Orthogonal Learning with Proximal Restrictions in Federated Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Lee_FedSOL_Stabilized_Orthogonal_Learning_with_Proximal_Restrictions_in_Federated_Learning_CVPR_2024_paper.pdf + Federated Learning (FL) aggregates locally trained models from individual clients to construct a global model. While FL enables learning a model with data privacy it often suffers from significant performance degradation when clients have heterogeneous data distributions. This data heterogeneity causes the model to forget the global knowledge acquired from previously sampled clients after being trained on local datasets. Although the introduction of proximal objectives in local updates helps to preserve global knowledge it can also hinder local learning by interfering with local objectives. Inspired by Continual Learning (CL) we adopt an orthogonal learning strategy to balance these two conflicting objectives. However we observe that directly negating the proximal gradient in the local gradient significantly undermines local learning. To address the problem we propose a novel method Federated Stabilized Orthogonal Learning (FedSOL). FedSOL is designed to identify gradients of local objectives that are inherently orthogonal to directions affecting the proximal objective. Specifically FedSOL targets parameter regions where learning on the local objective is minimally influenced by proximal weight perturbations. Our experiments demonstrate that FedSOL consistently achieves state-of-the-art performance across various scenarios. + + + + A Category Agnostic Model for Visual Rearrangment + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_A_Category_Agnostic_Model_for_Visual_Rearrangment_CVPR_2024_paper.pdf + This paper presents a novel category agnostic model for visual rearrangement task which can help an embodied agent to physically recover the shuffled scene configuration without any category concepts to the goal configuration. Previous methods usually follow a similar architecture completing the rearrangement task by aligning the scene changes of the goal and shuffled configuration according to the semantic scene graphs. However constructing scene graphs requires the inference of category labels which not only causes the accuracy drop of the entire task but also limits the application in real world scenario. In this paper we delve deep into the essence of visual rearrangement task and focus on the two most essential issues scene change detection and scene change matching. We utilize the movement and the protrusion of point cloud to accurately identify the scene changes and match these changes depending on the similarity of category agnostic appearance feature. Moreover to assist the agent to explore the environment more efficiently and comprehensively we propose a closer-aligned-retrace exploration policy aiming to observe more details of the scene at a closer distance. We conduct extensive experiments on AI2THOR Rearrangement Challenge based on RoomR dataset and a new multi-room multi-instance dataset MrMiR collected by us. The experimental results demonstrate the effectiveness of our proposed method. + + + + Representing Part-Whole Hierarchies in Foundation Models by Learning Localizability Composability and Decomposability from Anatomy via Self Supervision + http://openaccess.thecvf.com//content/CVPR2024/papers/Taher_Representing_Part-Whole_Hierarchies_in_Foundation_Models_by_Learning_Localizability_Composability_CVPR_2024_paper.pdf + Humans effortlessly interpret images by parsing them into part-whole hierarchies; deep learning excels in learning multi-level feature spaces but they often lack explicit coding of part-whole relations a prominent property of medical imaging. To overcome this limitation we introduce Adam-v2 a new self-supervised learning framework extending Adam [68] by explicitly incorporating part-whole hierarchies into its learning objectives through three key branches: (1) Localizability acquiring discriminative representations to distinguish different anatomical patterns; (2) Composability learning each anatomical structure in a parts-to-whole manner; and (3) Decomposability comprehending each anatomical structure in a whole-to-parts manner. Experimental results across 10 tasks compared to 11 baselines in zero-shot few-shot transfer and full fine-tuning settings showcase Adam-v2's superior performance over large-scale medical models and existing SSL methods across diverse downstream tasks. The higher generality and robustness of Adam-v2's representations originate from its explicit construction of hierarchies for distinct anatomical structures from unlabeled medical images. Adam-v2 preserves a semantic balance of anatomical diversity and harmony in its embedding yielding representations that are both generic and semantically meaningful yet overlooked in existing SSL methods. All code and pretrained models are available at GitHub.com/JLiangLab/Eden. + + + + Efficient Test-Time Adaptation of Vision-Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Karmanov_Efficient_Test-Time_Adaptation_of_Vision-Language_Models_CVPR_2024_paper.pdf + Test-time adaptation with pre-trained vision-language models has attracted increasing attention for tackling distribution shifts during the test time. Though prior studies have achieved very promising performance they involve intensive computation which is severely unaligned with test-time adaptation. We design TDA a training-free dynamic adapter that enables effective and efficient test-time adaptation with vision-language models. TDA works with a lightweight key-value cache that maintains a dynamic queue with few-shot pseudo labels as values and the corresponding test-sample features as keys. Leveraging the key-value cache TDA allows adapting to test data gradually via progressive pseudo label refinement which is super-efficient without incurring any backpropagation. In addition we introduce negative pseudo labeling that alleviates the adverse impact of pseudo label noises by assigning pseudo labels to certain negative classes when the model is uncertain about its pseudo label predictions. Extensive experiments over two benchmarks demonstrate TDA's superior effectiveness and efficiency as compared with the state-of-the-art. The code has been released in https://kdiaaa.github.io/tda/. + + + + Eyes Wide Shut? Exploring the Visual Shortcomings of Multimodal LLMs + http://openaccess.thecvf.com//content/CVPR2024/papers/Tong_Eyes_Wide_Shut_Exploring_the_Visual_Shortcomings_of_Multimodal_LLMs_CVPR_2024_paper.pdf + Is vision good enough for language? Recent advancements in multimodal models primarily stem from the powerful reasoning abilities of large language models (LLMs). However the visual component typically depends only on the instance-level contrastive language-image pre-training (CLIP). Our research reveals that the visual capabilities in recent MultiModal LLMs (MLLMs) still exhibit systematic shortcomings. To understand the roots of these errors we explore the gap between the visual embedding space of CLIP and vision-only self-supervised learning. We identify "CLIP-blind pairs" - images that CLIP perceives as similar despite their clear visual differences. With these pairs we construct the Multimodal Visual Patterns (MMVP) benchmark. MMVP exposes areas where state-of-the-art systems including GPT-4V struggle with straightforward questions across nine basic visual patterns often providing incorrect answers and hallucinated explanations. We further evaluate various CLIP-based vision-and-language models and found a notable correlation between visual patterns that challenge CLIP models and those problematic for multimodal LLMs. As an initial effort to address these issues we propose a Mixture of Features (MoF) approach demonstrating that integrating vision self-supervised learning features with MLLMs can significantly enhance their visual grounding capabilities. Together our research suggests visual representation learning remains an open challenge and accurate visual grounding is crucial for future successful multimodal systems. + + + + Mind Marginal Non-Crack Regions: Clustering-Inspired Representation Learning for Crack Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Mind_Marginal_Non-Crack_Regions_Clustering-Inspired_Representation_Learning_for_Crack_Segmentation_CVPR_2024_paper.pdf + Crack segmentation datasets make great efforts to obtain the ground truth crack or non-crack labels as clearly as possible. However it can be observed that ambiguities are still inevitable when considering the marginal non-crack region due to low contrast and heterogeneous texture. To solve this problem we propose a novel clustering-inspired representation learning framework which contains a two-phase strategy for automatic crack segmentation. In the first phase a pre-process is proposed to localize the marginal non-crack region. Then we propose an ambiguity-aware segmentation loss (Aseg Loss) that enables crack segmentation models to capture ambiguities in the above regions via learning segmentation variance which allows us to further localize ambiguous regions. In the second phase to learn the discriminative features of the above regions we propose a clustering-inspired loss (CI Loss) that alters the supervision learning of these regions into an unsupervised clustering manner. We demonstrate that the proposed method could surpass the existing crack segmentation models on various datasets and our constructed CrackSeg5k dataset. + + + + RegionGPT: Towards Region Understanding Vision Language Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Guo_RegionGPT_Towards_Region_Understanding_Vision_Language_Model_CVPR_2024_paper.pdf + Vision language models (VLMs) have experienced rapid advancements through the integration of large language models (LLMs) with image-text pairs yet they struggle with detailed regional visual understanding due to limited spatial awareness of the vision encoder and the use of coarse-grained training data that lacks detailed region-specific captions. To address this we introduce RegionGPT (short as RGPT) a novel framework designed for complex region-level captioning and understanding. RGPT enhances the spatial awareness of regional representation with simple yet effective modifications to existing visual encoders in VLMs. We further improve performance on tasks requiring a specific output scope by integrating task-guided instruction prompts during both training and inference phases while maintaining the model's versatility for general-purpose tasks. Additionally we develop an automated region caption data generation pipeline enriching the training set with detailed region-level captions. We demonstrate that a universal RGPT model can be effectively applied and significantly enhancing performance across a range of region-level tasks including but not limited to complex region descriptions reasoning object classification and referring expressions comprehension. + + + + Error Detection in Egocentric Procedural Task Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Lee_Error_Detection_in_Egocentric_Procedural_Task_Videos_CVPR_2024_paper.pdf + We present a new egocentric procedural error dataset containing videos with various types of errors as well as normal videos and propose a new framework for procedural error detection using error-free training videos only. Our framework consists of an action segmentation model and a contrastive step prototype learning module to segment actions and learn useful features for error detection. Based on the observation that interactions between hands and objects often inform action and error understanding we propose to combine holistic frame features with relations features which we learn by building a graph using active object detection followed by a Graph Convolutional Network. To handle errors unseen during training we use our contrastive step prototype learning to learn multiple prototypes for each step capturing variations of error-free step executions. At inference time we use feature-prototype similarities for error detection. By experiments on three datasets we show that our proposed framework outperforms state-of-the-art video anomaly detection methods for error detection and provides smooth action and error predictions. + + + + Uncertainty-Guided Never-Ending Learning to Drive + http://openaccess.thecvf.com//content/CVPR2024/papers/Lai_Uncertainty-Guided_Never-Ending_Learning_to_Drive_CVPR_2024_paper.pdf + We present a highly scalable self-training framework for incrementally adapting vision-based end-to-end autonomous driving policies in a semi-supervised manner i.e. over a continual stream of incoming video data. To facilitate large-scale model training (e.g. open web or unlabeled data) we do not assume access to ground-truth labels and instead estimate pseudo-label policy targets for each video. Our framework comprises three key components: knowledge distillation a sample purification module and an exploration and knowledge retention mechanism. First given sequential image frames we pseudo-label the data and estimate uncertainty using an ensemble of inverse dynamics models. The uncertainty is used to select the most informative samples to add to an experience replay buffer. We specifically select high-uncertainty pseudo-labels to facilitate the exploration and learning of new and diverse driving skills. However in contrast to prior work in continual learning that assumes ground-truth labeled samples the uncertain pseudo-labels can introduce significant noise. Thus we also pair the exploration with a label refinement module which makes use of consistency constraints to re-label the noisy exploratory samples and effectively learn from diverse data. Trained as a complete never-ending learning system we demonstrate state-of-the-art performance on training from domain-changing data as well as millions of images from the open web. + + + + FakeInversion: Learning to Detect Images from Unseen Text-to-Image Models by Inverting Stable Diffusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Cazenavette_FakeInversion_Learning_to_Detect_Images_from_Unseen_Text-to-Image_Models_by_CVPR_2024_paper.pdf + Due to the high potential for abuse of GenAI systems the task of detecting synthetic images has recently become of great interest to the research community. Unfortunately existing image space detectors quickly become obsolete as new high-fidelity text-to-image models are developed at blinding speed. In this work we propose a new synthetic image detector that uses features obtained by inverting an open-source pre-trained Stable Diffusion model. We show that these inversion features enable our detector to generalize well to unseen generators of high visual fidelity (e.g. DALL*E 3) even when the detector is trained only on lower fidelity fake images generated via Stable Diffusion. This detector achieves new state-of-the-art across multiple training and evaluation setups. Moreover we introduce a new challenging evaluation protocol that uses reverse image search to mitigate stylistic and thematic biases in the detector evaluation. We show that the resulting evaluation scores align well with detectors' in-the-wild performance and release these datasets as public benchmarks for future research. + + + + Attribute-Guided Pedestrian Retrieval: Bridging Person Re-ID with Internal Attribute Variability + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_Attribute-Guided_Pedestrian_Retrieval_Bridging_Person_Re-ID_with_Internal_Attribute_Variability_CVPR_2024_paper.pdf + In various domains such as surveillance and smart retail pedestrian retrieval centering on person re-identification (Re-ID) plays a pivotal role. Existing Re-ID methodologies often overlook subtle internal attribute variations which are crucial for accurately identifying individuals with changing appearances. In response our paper introduces the Attribute-Guided Pedestrian Retrieval (AGPR) task focusing on integrating specified attributes with query images to refine retrieval results. Although there has been progress in attribute-driven image retrieval there remains a notable gap in effectively blending robust Re-ID models with intra-class attribute variations. To bridge this gap we present the Attribute-Guided Transformer-based Pedestrian Retrieval (ATPR) framework. ATPR adeptly merges global ID recognition with local attribute learning ensuring a cohesive linkage between the two. Furthermore to effectively handle the complexity of attribute interconnectivity ATPR organizes attributes into distinct groups and applies both inter-group correlation and intra-group decorrelation regularizations. Our extensive experiments on a newly established benchmark using the RAP dataset demonstrate the effectiveness of ATPR within the AGPR paradigm. + + + + Text Is MASS: Modeling as Stochastic Embedding for Text-Video Retrieval + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Text_Is_MASS_Modeling_as_Stochastic_Embedding_for_Text-Video_Retrieval_CVPR_2024_paper.pdf + The increasing prevalence of video clips has sparked growing interest in text-video retrieval. Recent advances focus on establishing a joint embedding space for text and video relying on consistent embedding representations to compute similarity. However the text content in existing datasets is generally short and concise making it hard to fully describe the redundant semantics of a video. Correspondingly a single text embedding may be less expressive to capture the video embedding and empower the retrieval. In this study we propose a new stochastic text modeling method T-MASS i.e. text is modeled as a stochastic embedding to enrich text embedding with a flexible and resilient semantic range yielding a text mass. To be specific we introduce a similarity-aware radius module to adapt the scale of the text mass upon the given text-video pairs. Plus we design and develop a support text regularization to further control the text mass during the training. The inference pipeline is also tailored to fully exploit the text mass for accurate retrieval. Empirical evidence suggests that T-MASS not only effectively attracts relevant text-video pairs while distancing irrelevant ones but also enables the determination of precise text embeddings for relevant pairs. Our experimental results show a substantial improvement of T-MASS over baseline (3% 6.3% by R@1). Also T-MASS achieves state-of-the-art performance on five benchmark datasets including MSRVTT LSMDC DiDeMo VATEX and Charades. + + + + Know Your Neighbors: Improving Single-View Reconstruction via Spatial Vision-Language Reasoning + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Know_Your_Neighbors_Improving_Single-View_Reconstruction_via_Spatial_Vision-Language_Reasoning_CVPR_2024_paper.pdf + Recovering the 3D scene geometry from a single view is a fundamental yet ill-posed problem in computer vision. While classical depth estimation methods infer only a 2.5D scene representation limited to the image plane recent approaches based on radiance fields reconstruct a full 3D representation. However these methods still struggle with occluded regions since inferring geometry without visual observation requires (i) semantic knowledge of the surroundings and (ii) reasoning about spatial context. We propose KYN a novel method for single-view scene reconstruction that reasons about semantic and spatial context to predict each point's density. We introduce a vision-language modulation module to enrich point features with fine-grained semantic information. We aggregate point representations across the scene through a language-guided spatial attention mechanism to yield per-point density predictions aware of the 3D semantic context. We show that KYN improves 3D shape recovery compared to predicting density for each 3D point in isolation. We achieve state-of-the-art results in scene and object reconstruction on KITTI-360 and show improved zero-shot generalization compared to prior work. Project page: https://ruili3.github.io/kyn + + + + Preserving Fairness Generalization in Deepfake Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Lin_Preserving_Fairness_Generalization_in_Deepfake_Detection_CVPR_2024_paper.pdf + Although effective deepfake detection models have been developed in recent years recent studies have revealed that these models can result in unfair performance disparities among demographic groups such as race and gender. This can lead to particular groups facing unfair targeting or exclusion from detection potentially allowing misclassified deepfakes to manipulate public opinion and undermine trust in the model. The existing method for addressing this problem is providing a fair loss function. It shows good fairness performance for intra-domain evaluation but does not maintain fairness for cross-domain testing. This highlights the significance of fairness generalization in the fight against deepfakes. In this work we propose the first method to address the fairness generalization problem in deepfake detection by simultaneously considering features loss and optimization aspects. Our method employs disentanglement learning to extract demographic and domain-agnostic forgery features fusing them to encourage fair learning across a flattened loss landscape. Extensive experiments on prominent deepfake datasets demonstrate our method's effectiveness surpassing state-of-the-art approaches in preserving fairness during cross-domain deepfake detection. The code is available at https://github.com/Purdue-M2/Fairness-Generalization. + + + + Structure-Aware Sparse-View X-ray 3D Reconstruction + http://openaccess.thecvf.com//content/CVPR2024/papers/Cai_Structure-Aware_Sparse-View_X-ray_3D_Reconstruction_CVPR_2024_paper.pdf + X-ray known for its ability to reveal internal structures of objects is expected to provide richer information for 3D reconstruction than visible light. Yet existing NeRF algorithms overlook this nature of X-ray leading to their limitations in capturing structural contents of imaged objects. In this paper we propose a framework Structure-Aware X-ray Neural Radiodensity Fields (SAX-NeRF) for sparse-view X-ray 3D reconstruction. Firstly we design a Line Segment-based Transformer (Lineformer) as the backbone of SAX-NeRF. Linefomer captures internal structures of objects in 3D space by modeling the dependencies within each line segment of an X-ray. Secondly we present a Masked Local-Global (MLG) ray sampling strategy to extract contextual and geometric information in 2D projection. Plus we collect a larger-scale dataset X3D covering wider X-ray applications. Experiments on X3D show that SAX-NeRF surpasses previous NeRF-based methods by 12.56 and 2.49 dB on novel view synthesis and CT reconstruction. https://github.com/caiyuanhao1998/SAX-NeRF + + + + Dexterous Grasp Transformer + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_Dexterous_Grasp_Transformer_CVPR_2024_paper.pdf + In this work we propose a novel discriminative framework for dexterous grasp generation named Dexterous Grasp TRansformer (DGTR) capable of predicting a diverse set of feasible grasp poses by processing the object point cloud with only one forward pass. We formulate dexterous grasp generation as a set prediction task and design a transformer-based grasping model for it. However we identify that this set prediction paradigm encounters several optimization challenges in the field of dexterous grasping and results in restricted performance. To address these issues we propose progressive strategies for both the training and testing phases. First the dynamic-static matching training (DSMT) strategy is presented to enhance the optimization stability during the training phase. Second we introduce the adversarial-balanced test-time adaptation (AB-TTA) with a pair of adversarial losses to improve grasping quality during the testing phase. Experimental results on the DexGraspNet dataset demonstrate the capability of DGTR to predict dexterous grasp poses with both high quality and diversity. Notably while keeping high quality the diversity of grasp poses predicted by DGTR significantly outperforms previous works in multiple metrics without any data pre-processing. Codes are available at https://github.com/iSEE-Laboratory/DGTR. + + + + EgoThink: Evaluating First-Person Perspective Thinking Capability of Vision-Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Cheng_EgoThink_Evaluating_First-Person_Perspective_Thinking_Capability_of_Vision-Language_Models_CVPR_2024_paper.pdf + Vision-language models (VLMs) have recently shown promising results in traditional downstream tasks. Evaluation studies have emerged to assess their abilities with the majority focusing on the third-person perspective and only a few addressing specific tasks from the first-person perspective. However the capability of VLMs to "think" from a first-person perspective a crucial attribute for advancing autonomous agents and robotics remains largely unexplored. To bridge this research gap we introduce EgoThink a novel visual question-answering benchmark that encompasses six core capabilities with twelve detailed dimensions. The benchmark is constructed using selected clips from egocentric videos with manually annotated question-answer pairs containing first-person information. To comprehensively assess VLMs we evaluate twenty-one popular VLMs on EgoThink. Moreover given the open-ended format of the answers we use GPT-4 as the automatic judge to compute single-answer grading. Experimental results indicate that although GPT-4V leads in numerous dimensions all evaluated VLMs still possess considerable potential for improvement in first-person perspective tasks. Meanwhile enlarging the number of trainable parameters has the most significant impact on model performance on EgoThink. In conclusion EgoThink serves as a valuable addition to existing evaluation benchmarks for VLMs providing an indispensable resource for future research in the realm of embodied artificial intelligence and robotics. + + + + Hearing Anything Anywhere + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Hearing_Anything_Anywhere_CVPR_2024_paper.pdf + Recent years have seen immense progress in 3D computer vision and computer graphics with emerging tools that can virtualize real-world 3D environments for numerous Mixed Reality (XR) applications. However alongside immersive visual experiences immersive auditory experiences are equally vital to our holistic perception of an environment. In this paper we aim to reconstruct the spatial acoustic characteristics of an arbitrary environment given only a sparse set of (roughly 12) room impulse response (RIR) recordings and a planar reconstruction of the scene a setup that is easily achievable by ordinary users. To this end we introduce DiffRIR a differentiable RIR rendering framework with interpretable parametric models of salient acoustic features of the scene including sound source directivity and surface reflectivity. This allows us to synthesize novel auditory experiences through the space with any source audio. To evaluate our method we collect a dataset of RIR recordings and music in four diverse real environments. We show that our model outperforms state-of-the-art baselines on rendering monaural and binaural RIRs and music at unseen locations and learns physically interpretable parameters characterizing acoustic properties of the sound source and surfaces in the scene. + + + + PatchFusion: An End-to-End Tile-Based Framework for High-Resolution Monocular Metric Depth Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_PatchFusion_An_End-to-End_Tile-Based_Framework_for_High-Resolution_Monocular_Metric_Depth_CVPR_2024_paper.pdf + Single image depth estimation is a foundational task in computer vision and generative modeling. However prevailing depth estimation models grapple with accommodating the increasing resolutions commonplace in today's consumer cameras and devices. Existing high-resolution strategies show promise but they often face limitations ranging from error propagation to the loss of high-frequency details. We present PatchFusion a novel tile-based framework with three key components to improve the current state of the art: (1) A patch-wise fusion network that fuses a globally-consistent coarse prediction with finer inconsistent tiled predictions via high-level feature guidance (2) A Global-to-Local (G2L) module that adds vital context to the fusion network discarding the need for patch selection heuristics and (3) A Consistency-Aware Training (CAT) and Inference (CAI) approach emphasizing patch overlap consistency and thereby eradicating the necessity for post-processing. Experiments on UnrealStereo4K MVS-Synth and Middleburry 2014 demonstrate that our framework can generate high-resolution depth maps with intricate details. PatchFusion is independent of the base model for depth estimation. Notably our framework built on top of SOTA ZoeDepth brings improvements for a total of 17.3% and 29.4% in terms of the root mean squared error (RMSE) on UnrealStereo4K and MVS-Synth respectively. + + + + Retrieval-Augmented Egocentric Video Captioning + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_Retrieval-Augmented_Egocentric_Video_Captioning_CVPR_2024_paper.pdf + Understanding human actions from videos of first-person view poses significant challenges. Most prior approaches explore representation learning on egocentric videos only while overlooking the potential benefit of exploiting existing large-scale third-person videos. In this paper (1) we develop EgoInstructor a retrieval-augmented multimodal captioning model that automatically retrieves semantically relevant third-person instructional videos to enhance the video captioning of egocentric videos (2) for training the cross-view retrieval module we devise an automatic pipeline to discover ego-exo video pairs from distinct large-scale egocentric and exocentric datasets (3) we train the cross-view retrieval module with a novel EgoExoNCE loss that pulls egocentric and exocentric video features closer by aligning them to shared text features that describe similar actions (4) through extensive experiments our cross-view retrieval module demonstrates superior performance across seven benchmarks. Regarding egocentric video captioning EgoInstructor exhibits significant improvements by leveraging third-person videos as references. + + + + SkillDiffuser: Interpretable Hierarchical Planning via Skill Abstractions in Diffusion-Based Task Execution + http://openaccess.thecvf.com//content/CVPR2024/papers/Liang_SkillDiffuser_Interpretable_Hierarchical_Planning_via_Skill_Abstractions_in_Diffusion-Based_Task_CVPR_2024_paper.pdf + Diffusion models have demonstrated strong potential for robotic trajectory planning. However generating coherent trajectories from high-level instructions remains challenging especially for long-range composition tasks requiring multiple sequential skills. We propose SkillDiffuser an end-to-end hierarchical planning framework integrating interpretable skill learning with conditional diffusion planning to address this problem. At the higher level the skill abstraction module learns discrete human-understandable skill representations from visual observations and language instructions. These learned skill embeddings are then used to condition the diffusion model to generate customized latent trajectories aligned with the skills. This allows generating diverse state trajectories that adhere to the learnable skills. By integrating skill learning with conditional trajectory generation SkillDiffuser produces coherent behavior following abstract instructions across diverse tasks. Experiments on multi-task robotic manipulation benchmarks like Meta-World and LOReL demonstrate state-of-the-art performance and human-interpretable skill representations from SkillDiffuser. More visualization results and information could be found on https://skilldiffuser.github.io/. + + + + TE-TAD: Towards Full End-to-End Temporal Action Detection via Time-Aligned Coordinate Expression + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_TE-TAD_Towards_Full_End-to-End_Temporal_Action_Detection_via_Time-Aligned_Coordinate_CVPR_2024_paper.pdf + In this paper we investigate that the normalized coordinate expression is a key factor as reliance on hand-crafted components in query-based detectors for temporal action detection (TAD). Despite significant advancements towards an end-to-end framework in object detection query-based detectors have been limited in achieving full end-to-end modeling in TAD. To address this issue we propose TE-TAD a full end-to-end temporal action detection transformer that integrates time-aligned coordinate expression. We reformulate coordinate expression utilizing actual timeline values ensuring length-invariant representations from the extremely diverse video duration environment. Furthermore our proposed adaptive query selection dynamically adjusts the number of queries based on video length providing a suitable solution for varying video durations compared to a fixed query set. Our approach not only simplifies the TAD process by eliminating the need for hand-crafted components but also significantly improves the performance of query-based detectors. Our TE-TAD outperforms the previous query-based detectors and achieves competitive performance compared to state-of-the-art methods on popular benchmark datasets. Code is available at: https://github.com/Dotori-HJ/TE-TAD. + + + + PointBeV: A Sparse Approach for BeV Predictions + http://openaccess.thecvf.com//content/CVPR2024/papers/Chambon_PointBeV_A_Sparse_Approach_for_BeV_Predictions_CVPR_2024_paper.pdf + Bird's-eye View (BeV) representations have emerged as the de-facto shared space in driving applications offering a unified space for sensor data fusion and supporting various downstream tasks. However conventional models use grids with fixed resolution and range and face computational inefficiencies due to the uniform allocation of resources across all cells. To address this we propose PointBeV a novel sparse BeV segmentation model operating on sparse BeV cells instead of dense grids. This approach offers precise control over memory usage enabling the use of long temporal contexts and accommodating memory-constrained platforms. PointBeV employs an efficient two-pass strategy for training enabling focused computation on regions of interest. At inference time it can be used with various memory/performance trade-offs and flexibly adjusts to new specific use cases. PointBeV achieves state-of-the-art results on the nuScenes dataset for vehicle pedestrian and lane segmentation showcasing superior performance in static and temporal settings despite being trained solely with sparse signals. We release our code with two new efficient modules used in the architecture: Sparse Feature Pulling designed for the effective extraction of features from images to BeV and Submanifold Attention which enables efficient temporal modeling. The code is available at https://github.com/valeoai/PointBeV. + + + + From-Ground-To-Objects: Coarse-to-Fine Self-supervised Monocular Depth Estimation of Dynamic Objects with Ground Contact Prior + http://openaccess.thecvf.com//content/CVPR2024/papers/Moon_From-Ground-To-Objects_Coarse-to-Fine_Self-supervised_Monocular_Depth_Estimation_of_Dynamic_Objects_with_CVPR_2024_paper.pdf + Self-supervised monocular depth estimation (DE) is an approach to learning depth without costly depth ground truths. However it often struggles with moving objects that violate the static scene assumption during training. To address this issue we introduce a coarse-to-fine training strategy leveraging the ground contacting prior based on the observation that most moving objects in outdoor scenes contact the ground. In the coarse training stage we exclude the objects in dynamic classes from the reprojection loss calculation to avoid inaccurate depth learning. To provide precise supervision on the depth of the objects we present a novel Ground-contacting-prior Disparity Smoothness Loss (GDS-Loss) that encourages a DE network to align the depth of the objects with their ground-contacting points. Subsequently in the fine training stage we refine the DE network to learn the detailed depth of the objects from the reprojection loss while ensuring accurate DE on the moving object regions by employing our regularization loss with a cost-volume-based weighting factor. Our overall coarse-to-fine training strategy can easily be integrated with existing DE methods without any modifications significantly enhancing DE performance on challenging Cityscapes and KITTI datasets especially in the moving object regions. + + + + SRTube: Video-Language Pre-Training with Action-Centric Video Tube Features and Semantic Role Labeling + http://openaccess.thecvf.com//content/CVPR2024/papers/Lee_SRTube_Video-Language_Pre-Training_with_Action-Centric_Video_Tube_Features_and_Semantic_CVPR_2024_paper.pdf + In recent years large-scale video-language pre-training (VidLP) has received considerable attention for its effectiveness in relevant tasks. In this paper we propose a novel action-centric VidLP framework that employs video tube features for temporal modeling and language features based on semantic role labeling (SRL). Our video encoder generates multiple tube features along object trajectories identifying action-related regions within videos to overcome the limitations of existing temporal attention mechanisms. Additionally our text encoder incorporates high-level action-related language knowledge previously underutilized in current VidLP models. The SRL captures action-verbs and related semantics among objects in sentences and enhances the ability to perform instance-level text matching thus enriching the cross-modal (CM) alignment process. We also introduce two novel pre-training objectives and a self-supervision strategy to produce a more faithful CM representation. Experimental results demonstrate that our method outperforms existing VidLP frameworks in various downstream tasks and datasets establishing our model a baseline in the modern VidLP framework. + + + + Prompt Highlighter: Interactive Control for Multi-Modal LLMs + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Prompt_Highlighter_Interactive_Control_for_Multi-Modal_LLMs_CVPR_2024_paper.pdf + This study targets a critical aspect of multi-modal LLMs' (LLMs&VLMs) inference: explicit controllable text generation. Multi-modal LLMs empower multi-modality understanding with the capability of semantic generation yet bring less explainability and heavier reliance on prompt contents due to their autoregressive generative nature. While manipulating prompt formats could improve outputs designing specific and precise prompts per task can be challenging and ineffective. To tackle this issue we introduce a novel inference method Prompt Highlighter which enables users to highlight specific prompt spans to interactively control the focus during generation. Motivated by the classifier-free diffusion guidance we form regular and unconditional context pairs based on highlighted tokens demonstrating that the autoregressive generation in models can be guided in a classifier-free way. Notably we find that during inference guiding the models with highlighted tokens through the attention weights leads to more desired outputs. Our approach is compatible with current LLMs and VLMs achieving impressive customized generation results without training. Experiments confirm its effectiveness in focusing on input contexts and generating reliable content. Without tuning on LLaVA-v1.5 our method secured 70.7 in the MMBench test and 1552.5 in MME-perception. + + + + Continual Learning for Motion Prediction Model via Meta-Representation Learning and Optimal Memory Buffer Retention Strategy + http://openaccess.thecvf.com//content/CVPR2024/papers/Kang_Continual_Learning_for_Motion_Prediction_Model_via_Meta-Representation_Learning_and_CVPR_2024_paper.pdf + Embodied AI such as autonomous vehicles suffers from insufficient long-tailed data because it must be obtained from the physical world. In fact data must be continuously obtained in a series of small batches and the model must also be continuously trained to achieve generalizability and scalability by improving the biased data distribution. This paper addresses the training cost and catastrophic forgetting problems when continuously updating models to adapt to incoming small batches from various environments for real-world motion prediction in autonomous driving. To this end we propose a novel continual motion prediction (CMP) learning framework based on sparse meta-representation learning and an optimal memory buffer retention strategy. In meta-representation learning a model explicitly learns a sparse representation of each driving environment from road geometry to vehicle states by training to reduce catastrophic forgetting based on an augmented modulation network with sparsity regularization. Also in the adaptation phase We develop an Optimal Memory Buffer Retention strategy that smartly preserves diverse samples by focusing on representation similarity. This approach handles the nuanced task distribution shifts characteristic of motion prediction datasets ensuring our model stays responsive to evolving input variations without requiring extensive resources. The experiment results demonstrate that the proposed method shows superior adaptation performance to the conventional continual learning approach which is developed using a synthetic dataset for the continual learning problem. + + + + EditGuard: Versatile Image Watermarking for Tamper Localization and Copyright Protection + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_EditGuard_Versatile_Image_Watermarking_for_Tamper_Localization_and_Copyright_Protection_CVPR_2024_paper.pdf + In the era of AI-generated content (AIGC) malicious tampering poses imminent threats to copyright integrity and information security. Current deep image watermarking while widely accepted for safeguarding visual content can only protect copyright and ensure traceability. They fall short in localizing increasingly realistic image tampering potentially leading to trust crises privacy violations and legal disputes. To solve this challenge we propose an innovative proactive forensics framework EditGuard to unify copyright protection and tamper-agnostic localization especially for AIGC-based editing methods. It can offer a meticulous embedding of imperceptible watermarks and precise decoding of tampered areas and copyright information. Leveraging our observed fragility and locality of image-into-image steganography the realization of EditGuard can be converted into a united image-bit steganography issue thus completely decoupling the training process from the tampering types. Extensive experiments verify that our EditGuard balances the tamper localization accuracy copyright recovery precision and generalizability to various AIGC-based tampering methods especially for image forgery that is difficult for the naked eye to detect. + + + + FairRAG: Fair Human Generation via Fair Retrieval Augmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Shrestha_FairRAG_Fair_Human_Generation_via_Fair_Retrieval_Augmentation_CVPR_2024_paper.pdf + Existing text-to-image generative models reflect or even amplify societal biases ingrained in their training data. This is especially concerning for human image generation where models are biased against certain demographic groups. Existing attempts to rectify this issue are hindered by the inherent limitations of the pre-trained models and fail to substantially improve demographic diversity. In this work we introduce Fair Retrieval Augmented Generation (FairRAG) a novel framework that conditions pre-trained generative models on reference images retrieved from an external image database to improve fairness in human generation. FairRAG enables conditioning through a lightweight linear module that projects reference images into the textual space. To enhance fairness FairRAG applies simple-yet-effective debiasing strategies providing images from diverse demographic groups during the generative process. Extensive experiments demonstrate that FairRAG outperforms existing methods in terms of demographic diversity image-text alignment and image fidelity while incurring minimal computational overhead during inference. + + + + Template Free Reconstruction of Human-object Interaction with Procedural Interaction Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Xie_Template_Free_Reconstruction_of_Human-object_Interaction_with_Procedural_Interaction_Generation_CVPR_2024_paper.pdf + Reconstructing human-object interaction in 3D from a single RGB image is a challenging task and existing data driven methods do not generalize beyond the objects present in the carefully curated 3D interaction datasets. Capturing large-scale real data to learn strong interaction and 3D shape priors is very expensive due to the combinatorial nature of human-object interactions. In this paper we propose ProciGen (Procedural interaction Generation) a method to procedurally generate datasets with both plausible interaction and diverse object variation. We generate 1M+ human-object interaction pairs in 3D and leverage this large-scale data to train our HDM (Hierarchical Diffusion Model) a novel method to reconstruct interacting human and unseen object instances without any templates. Our HDM is an image-conditioned diffusion model that learns both realistic interaction and highly accurate human and object shapes. Experiments show that our HDM trained with ProciGen significantly outperforms prior methods that require template meshes and our dataset allows training methods with strong generalization ability to unseen object instances. Our code and data are released. + + + + Open-Vocabulary Video Anomaly Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_Open-Vocabulary_Video_Anomaly_Detection_CVPR_2024_paper.pdf + Current video anomaly detection (VAD) approaches with weak supervisions are inherently limited to a closed-set setting and may struggle in open-world applications where there can be anomaly categories in the test data unseen during training. A few recent studies attempt to tackle a more realistic setting open-set VAD which aims to detect unseen anomalies given seen anomalies and normal videos. However such a setting focuses on predicting frame anomaly scores having no ability to recognize the specific categories of anomalies despite the fact that this ability is essential for building more informed video surveillance systems. This paper takes a step further and explores open-vocabulary video anomaly detection (OVVAD) in which we aim to leverage pre-trained large models to detect and categorize seen and unseen anomalies. To this end we propose a model that decouples OVVAD into two mutually complementary tasks - class-agnostic detection and class-specific classification - and jointly optimizes both tasks. Particularly we devise a semantic knowledge injection module to introduce semantic knowledge from large language models for the detection task and design a novel anomaly synthesis module to generate pseudo unseen anomaly videos with the help of large vision generation models for the classification task. These semantic knowledge and synthesis anomalies substantially extend our model's capability in detecting and categorizing a variety of seen and unseen anomalies. Extensive experiments on three widely-used benchmarks demonstrate our model achieves state-of-the-art performance on OVVAD task. + + + + ODM: A Text-Image Further Alignment Pre-training Approach for Scene Text Detection and Spotting + http://openaccess.thecvf.com//content/CVPR2024/papers/Duan_ODM_A_Text-Image_Further_Alignment_Pre-training_Approach_for_Scene_Text_CVPR_2024_paper.pdf + Abstract In recent years text-image joint pre-training techniques have shown promising results in various tasks. However in Optical Character Recognition (OCR) tasks aligning text instances with their corresponding text regions in images poses a challenge as it requires effective alignment between text and OCR-Text (referring to the text in images as OCR-Text to distinguish from the text in natural language) rather than a holistic understanding of the overall image content. In this paper we propose a new pre-training method called OCR-Text Destylization Modeling (ODM) that transfers diverse styles of text found in images to a uniform style based on the text prompt. With ODM we achieve better alignment between text and OCR-Text and enable pre-trained models to adapt to the complex and diverse styles of scene text detection and spotting tasks. Additionally we have designed a new labeling generation method specifically for ODM and combined it with our proposed Text-Controller module to address the challenge of annotation costs in OCR tasks allowing a larger amount of unlabeled data to participate in pre-training. Extensive experiments on multiple public datasets demonstrate that our method significantly improves performance and outperforms current pre-training methods in scene text detection and spotting tasks. Code is available at https://github.com/PriNing/ODM. + + + + Epistemic Uncertainty Quantification For Pre-Trained Neural Networks + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Epistemic_Uncertainty_Quantification_For_Pre-Trained_Neural_Networks_CVPR_2024_paper.pdf + Epistemic uncertainty quantification (UQ) identifies where models lack knowledge. Traditional UQ methods often based on Bayesian neural networks are not suitable for pre-trained non-Bayesian models. Our study addresses quantifying epistemic uncertainty for any pre-trained model which does not need the original training data or model modifications and can ensure broad applicability regardless of network architectures or training techniques. Specifically we propose a gradient-based approach to assess epistemic uncertainty analyzing the gradients of outputs relative to model parameters and thereby indicating necessary model adjustments to accurately represent the inputs. We first explore theoretical guarantees of gradient-based methods for epistemic UQ questioning the view that this uncertainty is only calculable through differences between multiple models. We further improve gradient-driven UQ by using class-specific weights for integrating gradients and emphasizing distinct contributions from neural network layers. Additionally we enhance UQ accuracy by combining gradient and perturbation methods to refine the gradients. We evaluate our approach on out-of-distribution detection uncertainty calibration and active learning demonstrating its superiority over current state-of-the-art UQ methods for pre-trained models. + + + + Diffusion-ES: Gradient-free Planning with Diffusion for Autonomous and Instruction-guided Driving + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Diffusion-ES_Gradient-free_Planning_with_Diffusion_for_Autonomous_and_Instruction-guided_Driving_CVPR_2024_paper.pdf + Diffusion models excel at modeling complex and multimodal trajectory distributions for decision-making and control. Reward-gradient guided denoising has been recently proposed to generate trajectories that maximize both a differentiable reward function and the likelihood under the data distribution captured by a diffusion model. Reward-gradient guided denoising requires a differentiable reward function fitted to both clean and noised samples limiting its applicability as a general trajectory optimizer. In this paper we propose DiffusionES a method that combines gradient-free optimization with trajectory denoising to optimize black-box non-differentiable objectives while staying in the data manifold. Diffusion-ES samples trajectories during evolutionary search from a diffusion model and scores them using a black-box reward function. It mutates high-scoring trajectories using a truncated diffusion process that applies a small number of noising and denoising steps allowing for much more efficient exploration of the solution space. We show that DiffusionES achieves state-of-the-art performance on nuPlan an established closed-loop planning benchmark for autonomous driving. Diffusion-ES outperforms existing sampling-based planners reactive deterministic or diffusion-based policies and reward-gradient guidance. Additionally we show that unlike prior guidance methods our method can optimize non-differentiable language-shaped reward functions generated by few-shot LLM prompting. When guided by a human teacher that issues instructions to follow our method can generate novel highly complex behaviors such as aggressive lane weaving which are not present in the training data. This allows us to solve the hardest nuPlan scenarios which are beyond the capabilities of existing trajectory optimization methods and driving policies. + + + + MRC-Net: 6-DoF Pose Estimation with MultiScale Residual Correlation + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_MRC-Net_6-DoF_Pose_Estimation_with_MultiScale_Residual_Correlation_CVPR_2024_paper.pdf + We propose a single-shot approach to determining 6-DoF pose of an object with available 3D computer-aided design (CAD) model from a single RGB image. Our method dubbed MRC-Net comprises two stages. The first performs pose classification and renders the 3D object in the classified pose. The second stage performs regression to predict fine-grained residual pose within class. Connecting the two stages is a novel multi-scale residual correlation (MRC) layer that captures high-and-low level correspondences between the input image and rendering from first stage. MRC-Net employs a Siamese network with shared weights between both stages to learn embeddings for input and rendered images. To mitigate ambiguity when predicting discrete pose class labels on symmetric objects we use soft probabilistic labels to define pose class in the first stage. We demonstrate state-of-the-art accuracy outperforming all competing RGB-based methods on four challenging BOP benchmark datasets: T-LESS LM-O YCB-V and ITODD. Our method is non-iterative and requires no complex post-processing. Our code and pretrained models are available at https://github.com/amzn/mrc-net-6d-pose + + + + MonoCD: Monocular 3D Object Detection with Complementary Depths + http://openaccess.thecvf.com//content/CVPR2024/papers/Yan_MonoCD_Monocular_3D_Object_Detection_with_Complementary_Depths_CVPR_2024_paper.pdf + Monocular 3D object detection has attracted widespread attention due to its potential to accurately obtain object 3D localization from a single image at a low cost. Depth estimation is an essential but challenging subtask of monocular 3D object detection due to the ill-posedness of 2D to 3D mapping. Many methods explore multiple local depth clues such as object heights and keypoints and then formulate the object depth estimation as an ensemble of multiple depth predictions to mitigate the insufficiency of single-depth information. However the errors of existing multiple depths tend to have the same sign which hinders them from neutralizing each other and limits the overall accuracy of combined depth. To alleviate this problem we propose to increase the complementarity of depths with two novel designs. First we add a new depth prediction branch named complementary depth that utilizes global and efficient depth clues from the entire image rather than the local clues to reduce the correlation of depth predictions. Second we propose to fully exploit the geometric relations between multiple depth clues to achieve complementarity in form. Benefiting from these designs our method achieves higher complementarity. Experiments on the KITTI benchmark demonstrate that our method achieves state-of-the-art performance without introducing extra data. In addition complementary depth can also be a lightweight and plug-and-play module to boost multiple existing monocular 3d object detectors. Code is available at https://github.com/elvintanhust/MonoCD. + + + + Consistent3D: Towards Consistent High-Fidelity Text-to-3D Generation with Deterministic Sampling Prior + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_Consistent3D_Towards_Consistent_High-Fidelity_Text-to-3D_Generation_with_Deterministic_Sampling_Prior_CVPR_2024_paper.pdf + Score distillation sampling (SDS) and its variants have greatly boosted the development of text-to-3D generation but are vulnerable to geometry collapse and poor textures yet. To solve this issue we first deeply analyze the SDS and find that its distillation sampling process indeed corresponds to the trajectory sampling of a stochastic differential equation (SDE): SDS samples along an SDE trajectory to yield a less noisy sample which then serves as a guidance to optimize a 3D model. However the randomness in SDE sampling often leads to a diverse and unpredictable sample which is not always less noisy and thus is not a consistently correct guidance explaining the vulnerability of SDS. Since for any SDE there always exists an ordinary differential equation (ODE) whose trajectory sampling can deterministically and consistently converge to the desired target point as the SDE we propose a novel and effective "Consistent3D" method that explores the ODE deterministic sampling prior for text-to-3D generation. Specifically at each training iteration given a rendered image by a 3D model we first estimate its desired 3D score function by a pre-trained 2D diffusion model and build an ODE for trajectory sampling. Next we design a consistency distillation sampling loss which samples along the ODE trajectory to generate two adjacent samples and uses the less noisy sample to guide another more noisy one for distilling the deterministic prior into the 3D model. Experimental results show the efficacy of our Consistent3D in generating high-fidelity and diverse 3D objects and large-scale scenes as shown in Fig. 1. The codes are available at https://github.com/sail-sg/Consistent3D. + + + + ManipLLM: Embodied Multimodal Large Language Model for Object-Centric Robotic Manipulation + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_ManipLLM_Embodied_Multimodal_Large_Language_Model_for_Object-Centric_Robotic_Manipulation_CVPR_2024_paper.pdf + Robot manipulation relies on accurately predicting contact points and end-effector directions to ensure successful operation. However learning-based robot manipulation trained on a limited category within a simulator often struggles to achieve generalizability especially when confronted with extensive categories. Therefore we introduce an innovative approach for robot manipulation that leverages the robust reasoning capabilities of Multimodal Large Language Models (MLLMs) to enhance the stability and generalization of manipulation. By fine-tuning the injected adapters we preserve the inherent common sense and reasoning ability of the MLLMs while equipping them with the ability for manipulation. The fundamental insight lies in the introduced fine-tuning paradigm encompassing object category understanding affordance prior reasoning and object-centric pose prediction to stimulate the reasoning ability of MLLM in manipulation. During inference our approach utilizes an RGB image and text prompt to predict the end effector's pose in chain of thoughts. After the initial contact is established an active impedance adaptation policy is introduced to plan the upcoming waypoints in a closed-loop manner. Moreover in real world we design a test-time adaptation (TTA) strategy for manipulation to enable the model better adapt to the current real-world scene configuration. Experiments in simulator and real-world show the promising performance of ManipLLM. More details and demonstrations can be found at https://sites.google.com/view/manipllm. + + + + GLaMM: Pixel Grounding Large Multimodal Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Rasheed_GLaMM_Pixel_Grounding_Large_Multimodal_Model_CVPR_2024_paper.pdf + Large Multimodal Models (LMMs) extend Large Language Models to the vision domain. Initial LMMs used holistic images and text prompts to generate ungrounded textual responses. Recently region-level LMMs have been used to generate visually grounded responses. However they are limited to only referring to a single object category at a time require users to specify the regions or cannot offer dense pixel-wise object grounding. In this work we present Grounding LMM (GLaMM) the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks. GLaMM not only grounds objects appearing in the conversations but is flexible enough to accept both textual and optional visual prompts (region of interest) as input. This empowers users to interact with the model at various levels of granularity both in textual and visual domains. Due to the lack of standard benchmarks for the novel setting of visually Grounded Conversation Generation (GCG) we introduce a comprehensive evaluation protocol with our curated grounded conversations. Our proposed GCG task requires densely grounded concepts in natural scenes at a large-scale. To this end we propose a densely annotated Grounding-anything Dataset (GranD) using our proposed automated annotation pipeline that encompasses 7.5M unique concepts grounded in a total of 810M regions available with segmentation masks. Besides GCG GLaMM also performs effectively on several downstream tasks e.g. referring expression segmentation image and region-level captioning and vision-language conversations. + + + + Incremental Residual Concept Bottleneck Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Shang_Incremental_Residual_Concept_Bottleneck_Models_CVPR_2024_paper.pdf + Concept Bottleneck Models (CBMs) map the black-box visual representations extracted by deep neural networks onto a set of interpretable concepts and use the concepts to make predictions enhancing the transparency of the decision-making process. Multimodal pre-trained models can match visual representations with textual concept embeddings allowing for obtaining the interpretable concept bottleneck without the expertise concept annotations. Recent research has focused on the concept bank establishment and the high-quality concept selection. However it is challenging to construct a comprehensive concept bank through humans or large language models which severely limits the performance of CBMs. In this work we propose the Incremental Residual Concept Bottleneck Model (Res-CBM) to address the challenge of concept completeness. Specifically the residual concept bottleneck model employs a set of optimizable vectors to complete missing concepts then the incremental concept discovery module converts the complemented vectors with unclear meanings into potential concepts in the candidate concept bank. Our approach can be applied to any user-defined concept bank as a post-hoc processing method to enhance the performance of any CBMs. Furthermore to measure the descriptive efficiency of CBMs the Concept Utilization Efficiency (CUE) metric is proposed. Experiments show that the Res-CBM outperforms the current state-of-the-art methods in terms of both accuracy and efficiency and achieves comparable performance to black-box models across multiple datasets. + + + + SPOC: Imitating Shortest Paths in Simulation Enables Effective Navigation and Manipulation in the Real World + http://openaccess.thecvf.com//content/CVPR2024/papers/Ehsani_SPOC_Imitating_Shortest_Paths_in_Simulation_Enables_Effective_Navigation_and_CVPR_2024_paper.pdf + Reinforcement learning (RL) with dense rewards and imitation learning (IL) with human-generated trajectories are the most widely used approaches for training modern embodied agents. RL requires extensive reward shaping and auxiliary losses and is often too slow and ineffective for long-horizon tasks. While IL with human supervision is effective collecting human trajectories at scale is extremely expensive. In this work we show that imitating shortest-path planners in simulation produces agents that given a language instruction can proficiently navigate explore and manipulate objects in both simulation and in the real world using only RGB sensors (no depth map or GPS coordinates). This surprising result is enabled by our end-to-end transformer-based SPOC architecture powerful visual encoders paired with extensive image augmentation and the dramatic scale and diversity of our training data: millions of frames of shortest-path-expert trajectories collected inside approximately 200000 procedurally generated houses containing 40000 unique 3D assets. Our models data training code and newly proposed 10-task benchmarking suite CHORES are available at https://spoc-robot.github.io/. + + + + LoCoNet: Long-Short Context Network for Active Speaker Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_LoCoNet_Long-Short_Context_Network_for_Active_Speaker_Detection_CVPR_2024_paper.pdf + Active Speaker Detection (ASD) aims to identify who is speaking in each frame of a video. Solving ASD involves using audio and visual information in two complementary contexts: long-term intra-speaker context models the temporal dependencies of the same speaker while short-term inter-speaker context models the interactions of speakers in the same scene. Motivated by these observations we propose LoCoNet a simple but effective Long-Short Context Network that leverages Long-term Intra-speaker Modeling (LIM) and Short-term Inter-speaker Modeling (SIM) in an interleaved manner. LIM employs self-attention for long-range temporal dependencies modeling and cross-attention for audio-visual interactions modeling. SIM incorporates convolutional blocks that capture local patterns for short-term inter-speaker context. Experiments show that LoCoNet achieves state-of-the-art performance on multiple datasets with 95.2% (+0.3%) mAP on AVA-ActiveSpeaker 97.2% (+2.7%) mAP on Talkies and 68.4% (+7.7%) mAP on Ego4D. Moreover in challenging cases where multiple speakers are present LoCoNet outperforms previous state-of-the-art methods by 3.0% mAP on AVA-ActiveSpeaker. The code is available at https://github.com/SJTUwxz/LoCoNet_ASD. + + + + D3still: Decoupled Differential Distillation for Asymmetric Image Retrieval + http://openaccess.thecvf.com//content/CVPR2024/papers/Xie_D3still_Decoupled_Differential_Distillation_for_Asymmetric_Image_Retrieval_CVPR_2024_paper.pdf + Existing methods for asymmetric image retrieval employ a rigid pairwise similarity constraint between the query network and the larger gallery network. However these one-to-one constraint approaches often fail to maintain retrieval order consistency especially when the query network has limited representational capacity. To overcome this problem we introduce the Decoupled Differential Distillation (D3still) framework. This framework shifts from absolute one-to-one supervision to optimizing the relational differences in pairwise similarities produced by the query and gallery networks thereby preserving a consistent retrieval order across both networks. Our method involves computing a pairwise similarity differential matrix within the gallery domain which is then decomposed into three components: feature representation knowledge inconsistent pairwise similarity differential knowledge and consistent pairwise similarity differential knowledge. This strategic decomposition aligns the retrieval ranking of the query network with the gallery network effectively. Extensive experiments on various benchmark datasets reveal that D3still surpasses state-of-the-art methods in asymmetric image retrieval. Code is available at https://github.com/SCY-X/D3still. + + + + Learning Triangular Distribution in Visual World + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Learning_Triangular_Distribution_in_Visual_World_CVPR_2024_paper.pdf + Convolution neural network is successful in pervasive vision tasks including label distribution learning which usually takes the form of learning an injection from the non-linear visual features to the well-defined labels. However how the discrepancy between features is mapped to the label discrepancy is ambient and its correctness is not guaranteed.To address these problems we study the mathematical connection between feature and its label presenting a general and simple framework for label distribution learning. We propose a so-called Triangular Distribution Transform (TDT) to build an injective function between feature and label guaranteeing that any symmetric feature discrepancy linearly reflects the difference between labels. The proposed TDT can be used as a plug-in in mainstream backbone networks to address different label distribution learning tasks. Experiments on Facial Age Recognition Illumination Chromaticity Estimation and Aesthetics assessment show that TDT achieves on-par or better results than the prior arts. + + + + DiaLoc: An Iterative Approach to Embodied Dialog Localization + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_DiaLoc_An_Iterative_Approach_to_Embodied_Dialog_Localization_CVPR_2024_paper.pdf + Multimodal learning has advanced the performance for many vision-language tasks. However most existing works in embodied dialog research focus on navigation and leave the localization task understudied. The few existing dialog-based localization approaches assume the availability of entire dialog prior to localizaiton which is impractical for deployed dialog-based localization. In this paper we propose DiaLoc a new dialog-based localization framework which aligns with a real human operator behavior. Specifically we produce an iterative refinement of location predictions which can visualize current pose believes after each dialog turn. DiaLoc effectively utilizes the multimodal data for multi-shot localization where a fusion encoder fuses vision and dialog information iteratively. We achieve state-of-the-art results on embodied dialog-based localization task in single-shot (+7.08% in Acc5@valUnseen) and multi-shot settings (+10.85% in Acc5@valUnseen). DiaLoc narrows the gap between simulation and real-world applications opening doors for future research on collaborative localization and navigation. + + + + Self-Training Large Language Models for Improved Visual Program Synthesis With Visual Reinforcement + http://openaccess.thecvf.com//content/CVPR2024/papers/Khan_Self-Training_Large_Language_Models_for_Improved_Visual_Program_Synthesis_With_CVPR_2024_paper.pdf + Visual program synthesis is a promising approach to exploit the reasoning abilities of large language models for compositional computer vision tasks. Previous work has used few-shot prompting with frozen LLMs to synthesize visual programs. Training an LLM to write better visual programs is an attractive prospect but it is unclear how to accomplish this. No dataset of visual programs for training exists and acquisition of a visual program dataset cannot be easily crowdsourced due to the need for expert annotators. To get around the lack of direct supervision we explore improving the program synthesis abilities of an LLM using feedback from interactive experience. We propose a method where we exploit existing annotations for a vision-language task to improvise a coarse reward signal for that task treat the LLM as a policy and apply reinforced self-training to improve the visual program synthesis ability of the LLM for that task. We describe a series of experiments on object detection compositional visual question answering and image-text retrieval and show that in each case the self-trained LLM outperforms or performs on par with few-shot frozen LLMs that are an order of magnitude larger. Website: https://zaidkhan.me/ViReP + + + + MLIP: Enhancing Medical Visual Representation with Divergence Encoder and Knowledge-guided Contrastive Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_MLIP_Enhancing_Medical_Visual_Representation_with_Divergence_Encoder_and_Knowledge-guided_CVPR_2024_paper.pdf + The scarcity of annotated data has sparked significant interest in unsupervised pre-training methods that leverage medical reports as auxiliary signals for medical visual representation learning. However existing research overlooks the multi-granularity nature of medical visual representation and lacks suitable contrastive learning techniques to improve the models' generalizability across different granularities leading to the underutilization of image-text information. To address this we propose MLIP a novel framework leveraging domain-specific medical knowledge as guiding signals to integrate language information into the visual domain through image-text contrastive learning. Our model includes global contrastive learning with our designed divergence encoder local token-knowledge-patch alignment contrastive learning and knowledge-guided category-level contrastive learning with expert knowledge. Experimental evaluations reveal the efficacy of our model in enhancing transfer performance for tasks such as image classification object detection and semantic segmentation. Notably MLIP surpasses state-of-the-art methods even with limited annotated data highlighting the potential of multimodal pre-training in advancing medical representation learning. + + + + Collaborative Learning of Anomalies with Privacy (CLAP) for Unsupervised Video Anomaly Detection: A New Baseline + http://openaccess.thecvf.com//content/CVPR2024/papers/Al-lahham_Collaborative_Learning_of_Anomalies_with_Privacy_CLAP_for_Unsupervised_Video_CVPR_2024_paper.pdf + nsupervised (US) video anomaly detection (VAD) in surveillance applications is gaining more popularity lately due to its practical real-world applications. Due to the extremely challenging nature of this task where learning is carried out without any annotations privacy-critical collaborative learning of US-VAD systems has not been studied yet. As surveillance videos are privacy sensitive and the availability of large-scale video data may enable better US-VAD systems collaborative learning can be highly rewarding in this setting. In this paper we propose a new baseline for anomaly detection capable of localizing anomalous events in complex surveillance scenarios in a fully unsupervised fashion without any labels on a privacy-retaining participant-based distributed training configuration. Additionally we propose three new evaluation protocols to extensively evaluate anomaly detection approaches on various scenarios of collaborations and data availability. Moreover based on these protocols we modify existing VAD datasets to extensively evaluate our approach as well as existing US SOTA methods on two large-scale datasets including UCF-Crime and XD-Violence. All proposed evaluation protocols dataset splits and codes are available here: \href https://github.com/AnasEmad11/CLAP https://github.com/AnasEmad11/CLAP . + + + + Resource-Efficient Transformer Pruning for Finetuning of Large Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Ilhan_Resource-Efficient_Transformer_Pruning_for_Finetuning_of_Large_Models_CVPR_2024_paper.pdf + With the recent advances in vision transformers and large language models (LLMs) finetuning costly large models on downstream learning tasks poses significant challenges under limited computational resources. This paper presents a REsource and ComputAtion-efficient Pruning framework (RECAP) for the finetuning of transformer-based large models. RECAP by design bridges the gap between efficiency and performance through an iterative process cycling between pruning finetuning and updating stages to explore different chunks of the given large-scale model. At each iteration we first prune the model with Taylor-approximation-based importance estimation and then only update a subset of the pruned model weights based on the Fisher-information criterion. In this way RECAP achieves two synergistic and yet conflicting goals: reducing the GPU memory footprint while maintaining model performance unlike most existing pruning methods that require the model to be finetuned beforehand for better preservation of model performance. We perform extensive experiments with a wide range of large transformer-based architectures on various computer vision and natural language understanding tasks. Compared to recent pruning techniques we demonstrate that RECAP offers significant improvements in GPU memory efficiency capable of reducing the footprint by up to 65%. + + + + Multimodal Industrial Anomaly Detection by Crossmodal Feature Mapping + http://openaccess.thecvf.com//content/CVPR2024/papers/Costanzino_Multimodal_Industrial_Anomaly_Detection_by_Crossmodal_Feature_Mapping_CVPR_2024_paper.pdf + Recent advancements have shown the potential of leveraging both point clouds and images to localize anomalies. Nevertheless their applicability in industrial manufacturing is often constrained by significant drawbacks such as the use of memory banks which leads to a substantial increase in terms of memory footprint and inference times. We propose a novel light and fast framework that learns to map features from one modality to the other on nominal samples and detect anomalies by pinpointing inconsistencies between observed and mapped features. Extensive experiments show that our approach achieves state-of-the-art detection and segmentation performance in both the standard and few-shot settings on the MVTec 3D-AD dataset while achieving faster inference and occupying less memory than previous multimodal AD methods. Furthermore we propose a layer pruning technique to improve memory and time efficiency with a marginal sacrifice in performance. + + + + FFF: Fixing Flawed Foundations in Contrastive Pre-Training Results in Very Strong Vision-Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Bulat_FFF_Fixing_Flawed_Foundations_in_Contrastive_Pre-Training_Results_in_Very_CVPR_2024_paper.pdf + Despite noise and caption quality having been acknowledged as important factors impacting vision-language contrastive pre-training in this paper we show that the full potential of improving the training process by addressing such issues is yet to be realized. Specifically we firstly study and analyze two issues affecting training: incorrect assignment of negative pairs and low caption quality and diversity. Then we devise effective solutions for addressing both problems which essentially require training with multiple true positive pairs. Finally we propose training with sigmoid loss to address such a requirement. We show very large gains over the current state-of-the-art for both image recognition ( +6% on average over 11 datasets) and image retrieval ( +19% on Flickr30k and +15% on MSCOCO). + + + + Low-power Continuous Remote Behavioral Localization with Event Cameras + http://openaccess.thecvf.com//content/CVPR2024/papers/Hamann_Low-power_Continuous_Remote_Behavioral_Localization_with_Event_Cameras_CVPR_2024_paper.pdf + Researchers in natural science need reliable methods for quantifying animal behavior. Recently numerous computer vision methods emerged to automate the process. However observing wild species at remote locations remains a challenging task due to difficult lighting conditions and constraints on power supply and data storage. Event cameras offer unique advantages for battery-dependent remote monitoring due to their low power consumption and high dynamic range capabilities. We use this novel sensor to quantify a behavior in Chinstrap penguins called ecstatic display. We formulate the problem as a temporal action detection task determining the start and end times of the behavior. For this purpose we recorded a colony of breeding penguins in Antarctica for several weeks and labeled event data on 16 nests. The developed method consists of a generator of candidate time intervals (proposals) and a classifier of the actions within them. The experiments show that the event cameras' natural response to motion is effective for continuous behavior monitoring and detection reaching a mean average precision (mAP) of 58% (which increases to 63% in good weather conditions). The results also demonstrate the robustness against various lighting conditions contained in the challenging dataset. The low-power capabilities of the event camera allow it to record significantly longer than with a conventional camera. This work pioneers the use of event cameras for remote wildlife observation opening new interdisciplinary opportunities. https:// tub-rip.github.io/ eventpenguins/ + + + + SportsHHI: A Dataset for Human-Human Interaction Detection in Sports Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_SportsHHI_A_Dataset_for_Human-Human_Interaction_Detection_in_Sports_Videos_CVPR_2024_paper.pdf + Video-based visual relation detection tasks such as video scene graph generation play important roles in fine-grained video understanding. However current video visual relation detection datasets have two main limitations that hinder the progress of research in this area. First they do not explore complex human-human interactions in multi-person scenarios. Second the relation types of existing datasets have relatively low-level semantics and can be often recognized by appearance or simple prior information without the need for detailed spatio-temporal context reasoning. Nevertheless comprehending high-level interactions between humans is crucial for understanding complex multi-person videos such as sports and surveillance videos. To address this issue we propose a new video visual relation detection task: video human-human interaction detection and build a dataset named SportsHHI for it. SportsHHI contains 34 high-level interaction classes from basketball and volleyball sports. 118075 human bounding boxes and 50649 interaction instances are annotated on 11398 keyframes. To benchmark this we propose a two-stage baseline method and conduct extensive experiments to reveal the key factors for a successful human-human interaction detector. We hope that SportsHHI can stimulate research on human interaction understanding in videos and promote the development of spatio-temporal context modeling techniques in video visual relation detection. + + + + CrowdDiff: Multi-hypothesis Crowd Density Estimation using Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Ranasinghe_CrowdDiff_Multi-hypothesis_Crowd_Density_Estimation_using_Diffusion_Models_CVPR_2024_paper.pdf + Crowd counting is a fundamental problem in crowd analysis which is typically accomplished by estimating a crowd density map and summing over the density values. However this approach suffers from background noise accumulation and loss of density due to the use of broad Gaussian kernels to create the ground truth density maps. This issue can be overcome by narrowing the Gaussian kernel. However existing approaches perform poorly when trained with ground truth density maps with broad kernels. To deal with this limitation we propose using conditional diffusion models to predict density maps as diffusion models show high fidelity to training data during generation. With that we present CrowdDiff that generates the crowd density map as a reverse diffusion process. Furthermore as the intermediate time steps of the diffusion process are noisy we incorporate a regression branch for direct crowd estimation only during training to improve the feature learning. In addition owing to the stochastic nature of the diffusion model we introduce producing multiple density maps to improve the counting performance contrary to the existing crowd counting pipelines. We conduct extensive experiments on publicly available datasets to validate the effectiveness of our method. CrowdDiff outperforms existing \sota crowd counting methods on several public crowd analysis benchmarks with significant improvements. CrowdDiff project is available at: https://dylran.github.io/crowddiff.github.io. + + + + Diffusion-FOF: Single-View Clothed Human Reconstruction via Diffusion-Based Fourier Occupancy Field + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Diffusion-FOF_Single-View_Clothed_Human_Reconstruction_via_Diffusion-Based_Fourier_Occupancy_Field_CVPR_2024_paper.pdf + Reconstructing a clothed human from a single-view image has several challenging issues including flexibly representing various body shapes and poses estimating complete 3D geometry and consistent texture and achieving more fine-grained details. To address them we propose a new diffusion-based Fourier occupancy field method to improve the human representing ability and the geometry generating ability. First we estimate the back-view image from the given reference image by incorporating a style consistency constraint. Then we extract multi-scale features of the two images as conditional and design a diffusion model to generate the Fourier occupancy field in the wavelet domain. We refine the initial estimated Fourier occupancy field with image features as conditions to improve the geometric accuracy. Finally the reference and estimated back-view images are mapped onto the human model creating a textured clothed human model. Substantial experiments are conducted and the experimental results show that our method outperforms the state-of-the-art methods in geometry and texture reconstruction performance. + + + + ToNNO: Tomographic Reconstruction of a Neural Network's Output for Weakly Supervised Segmentation of 3D Medical Images + http://openaccess.thecvf.com//content/CVPR2024/papers/Schmidt-Mengin_ToNNO_Tomographic_Reconstruction_of_a_Neural_Networks_Output_for_Weakly_CVPR_2024_paper.pdf + Annotating lots of 3D medical images for training segmentation models is time-consuming. The goal of weakly supervised semantic segmentation is to train segmentation models without using any ground truth segmentation masks. Our work addresses the case where only image-level categorical labels indicating the presence or absence of a particular region of interest (such as tumours or lesions) are available. Most existing methods rely on class activation mapping (CAM). We propose a novel approach ToNNO which is based on the Tomographic reconstruction of a Neural Network's Output. Our technique extracts stacks of slices with different angles from the input 3D volume feeds these slices to a 2D encoder and applies the inverse Radon transform in order to reconstruct a 3D heatmap of the encoder's predictions. This generic method allows to perform dense prediction tasks on 3D volumes using any 2D image encoder. We apply it to weakly supervised medical image segmentation by training the 2D encoder to output high values for slices containing the regions of interest. We test it on four large scale medical image datasets and outperform 2D CAM methods. We then extend ToNNO by combining tomographic reconstruction with CAM methods proposing Averaged CAM and Tomographic CAM which obtain even better results. + + + + Learning to Navigate Efficiently and Precisely in Real Environments + http://openaccess.thecvf.com//content/CVPR2024/papers/Bono_Learning_to_Navigate_Efficiently_and_Precisely_in_Real_Environments_CVPR_2024_paper.pdf + In the context of autonomous navigation of terrestrial robots the creation of realistic models for agent dynamics and sensing is a widespread habit in the robotics literature and in commercial applications where they are used for model based control and/or for localization and mapping. The more recent Embodied AI literature on the other hand focuses on modular or end-to-end agents trained in simulators like Habitat or AI-Thor where the emphasis is put on photo-realistic rendering and scene diversity but high-fidelity robot motion is assigned a less privileged role. The resulting sim2real gap significantly impacts transfer of the trained models to real robotic platforms. In this work we explore end-to-end training of agents in simulation in settings which minimize the sim2real gap both in sensing and in actuation. Our agent directly predicts (discretized) velocity commands which are maintained through closed-loop control in the real robot. The behavior of the real robot (including the underlying low-level controller) is identified and simulated in a modified Habitat simulator. Noise models for odometry and localization further contribute in lowering the sim2real gap. We evaluate on real navigation scenarios explore different localization and point goal calculation methods and report significant gains in performance and robustness compared to prior work. + + + + VkD: Improving Knowledge Distillation using Orthogonal Projections + http://openaccess.thecvf.com//content/CVPR2024/papers/Miles_VkD_Improving_Knowledge_Distillation_using_Orthogonal_Projections_CVPR_2024_paper.pdf + Knowledge distillation is an effective method for training small and efficient deep learning models. However the efficacy of a single method can degenerate when transferring to other tasks modalities or even other architectures. To address this limitation we propose a novel constrained feature distillation method. This method is derived from a small set of core principles which results in two emerging components: an orthogonal projection and a task-specific normalisation. Equipped with both of these components our transformer models can outperform all previous methods on ImageNet and reach up to a 4.4% relative improvement over the previous state-of-the-art methods. To further demonstrate the generality of our method we apply it to object detection and image generation whereby we obtain consistent and substantial performance improvements over state-of-the-art. Code and models are publicly available. + + + + LaRE^2: Latent Reconstruction Error Based Method for Diffusion-Generated Image Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Luo_LaRE2_Latent_Reconstruction_Error_Based_Method_for_Diffusion-Generated_Image_Detection_CVPR_2024_paper.pdf + The evolution of Diffusion Models has dramatically improved image generation quality making it increasingly difficult to differentiate between real and generated images. This development while impressive also raises significant privacy and security concerns. In response to this we propose a novel Latent REconstruction error guided feature REfinement method (LaRE^2) for detecting the diffusion-generated images. We come up with the Latent Reconstruction Error (LaRE) the first reconstruction-error based feature in the latent space for generated image detection. LaRE surpasses existing methods in terms of feature extraction efficiency while preserving crucial cues required to differentiate between the real and the fake. To exploit LaRE we propose an Error-Guided feature REfinement module (EGRE) which can refine the image feature guided by LaRE to enhance the discriminativeness of the feature. Our EGRE utilizes an align-then-refine mechanism which effectively refines the image feature for generated-image detection from both spatial and channel perspectives. Extensive experiments on the large-scale GenImage benchmark demonstrate the superiority of our LaRE^2 which surpasses the best SoTA method by up to 11.9%/12.1% average ACC/AP across 8 different image generators. LaRE also surpasses existing methods in terms of feature extraction cost delivering an impressive speed enhancement of 8 times. + + + + T4P: Test-Time Training of Trajectory Prediction via Masked Autoencoder and Actor-specific Token Memory + http://openaccess.thecvf.com//content/CVPR2024/papers/Park_T4P_Test-Time_Training_of_Trajectory_Prediction_via_Masked_Autoencoder_and_CVPR_2024_paper.pdf + Trajectory prediction is a challenging problem that requires considering interactions among multiple actors and the surrounding environment. While data-driven approaches have been used to address this complex problem they suffer from unreliable predictions under distribution shifts during test time. Accordingly several online learning methods have been proposed using regression loss from the ground truth of observed data leveraging the auto-labeling nature of trajectory prediction task. We mainly tackle the following two issues. First previous works underfit and overfit as they only optimize the last layer of motion decoder. To this end we employ the masked autoencoder (MAE) for representation learning to encourage complex interaction modeling in shifted test distribution for updating deeper layers. Second utilizing the sequential nature of driving data we propose an actor-specific token memory that enables the test-time learning of actor-wise motion characteristics. Our proposed method has been validated across various challenging cross-dataset distribution shift scenarios including nuScenes Lyft Waymo and Interaction. Our method surpasses the performance of existing state-of-the-art online learning methods in terms of both prediction accuracy and computational efficiency. The code is available at https://github.com/daeheepark/T4P. + + + + InstaGen: Enhancing Object Detection by Training on Synthetic Dataset + http://openaccess.thecvf.com//content/CVPR2024/papers/Feng_InstaGen_Enhancing_Object_Detection_by_Training_on_Synthetic_Dataset_CVPR_2024_paper.pdf + In this paper we present a novel paradigm to enhance the ability of object detector e.g. expanding categories or improving detection performance by training on syn- thetic dataset generated from diffusion models. Specifically we integrate an instance-level grounding head into a pre- trained generative diffusion model to augment it with the ability of localising instances in the generated images. The grounding head is trained to align the text embedding of category names with the regional visual feature of the diffusion model using supervision from an off-the-shelf object detector and a novel self-training scheme on (novel) categories not covered by the detector. We conduct thorough experiments to show that this enhanced version of diffusion model termed as InstaGen can serve as a data synthesizer to enhance object detectors by training on its generated samples demonstrating superior performance over existing state-of-the-art methods in open-vocabulary (+4.5 AP) and data-sparse (+1.2 ? 5.2 AP) scenarios. + + + + Visual Point Cloud Forecasting enables Scalable Autonomous Driving + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Visual_Point_Cloud_Forecasting_enables_Scalable_Autonomous_Driving_CVPR_2024_paper.pdf + In contrast to extensive studies on general vision pre-training for scalable visual autonomous driving remains seldom explored. Visual autonomous driving applications require features encompassing semantics 3D geometry and temporal information simultaneously for joint perception prediction and planning posing dramatic challenges for pre-training. To resolve this we bring up a new pre-training task termed as visual point cloud forecasting - predicting future point clouds from historical visual input. The key merit of this task captures the synergic learning of semantics 3D structures and temporal dynamics. Hence it shows superiority in various downstream tasks. To cope with this new problem we present ViDAR a general model to pre-train downstream visual encoders. It first extracts historical embeddings by the encoder. These representations are then transformed to 3D geometric space via a novel Latent Rendering operator for future point cloud prediction. Experiments show significant gain in downstream tasks e.g. 3.1% NDS on 3D detection 10% error reduction on motion forecasting and 15% less collision rate on planning. + + + + Synthesize Step-by-Step: Tools Templates and LLMs as Data Generators for Reasoning-Based Chart VQA + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Synthesize_Step-by-Step_Tools_Templates_and_LLMs_as_Data_Generators_for_CVPR_2024_paper.pdf + Understanding data visualizations like charts and plots requires reasoning about both visual elements and numerics. Although strong in extractive questions current chart visual question answering (chart VQA) models suffer on complex reasoning questions. In this work we address the lack of reasoning ability by data augmentation. We leverage Large Language Models (LLMs) which have shown to have strong reasoning ability as an automatic data annotator that generates question-answer annotations for chart images. The key innovation in our method lies in the Synthesize Step-by-Step strategy: our LLM-based data generator learns to decompose the complex question into step-by-step sub-questions (rationales) which are then used to derive the final answer using external tools i.e. Python. This step-wise generation procedure is trained on synthetic data generated using a template-based QA generation pipeline. Experimental results highlight the significance of the proposed step-by-step generation. By training with the LLM-augmented data (LAMENDA) we significantly enhance the chart VQA models achieving the state-of-the-art accuracy on the ChartQA and PlotQA datasets. In particular our approach improves the accuracy of the previous state-of-the-art approach from 38% to 54% on the human-written questions in the ChartQA dataset which needs strong reasoning. We hope our work underscores the potential of synthetic data and encourages further exploration of data augmentation using LLMs for reasoning-heavy tasks. + + + + LayoutLLM: Layout Instruction Tuning with Large Language Models for Document Understanding + http://openaccess.thecvf.com//content/CVPR2024/papers/Luo_LayoutLLM_Layout_Instruction_Tuning_with_Large_Language_Models_for_Document_CVPR_2024_paper.pdf + Recently leveraging large language models (LLMs) or multimodal large language models (MLLMs) for document understanding has been proven very promising. However previous works that employ LLMs/MLLMs for document understanding have not fully explored and utilized the document layout information which is vital for precise document understanding. In this paper we propose LayoutLLM an LLM/MLLM based method for document understanding. The core of LayoutLLM is a layout instruction tuning strategy which is specially designed to enhance the comprehension and utilization of document layouts. The proposed layout instruction tuning strategy consists of two components: Layout-aware Pre-training and Layout-aware Supervised Fine-tuning. To capture the characteristics of document layout in Layout-aware Pre-training three groups of pre-training tasks corresponding to document-level region-level and segment-level information are introduced. Furthermore a novel module called layout chain-of-thought (LayoutCoT) is devised to enable LayoutLLM to focus on regions relevant to the question and generate accurate answers. LayoutCoT is effective for boosting the performance of document understanding. Meanwhile it brings a certain degree of interpretability which could facilitate manual inspection and correction. Experiments on standard benchmarks show that the proposed LayoutLLM significantly outperforms existing methods that adopt open-source 7B LLMs/MLLMs for document understanding. + + + + ProTeCt: Prompt Tuning for Taxonomic Open Set Classification + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_ProTeCt_Prompt_Tuning_for_Taxonomic_Open_Set_Classification_CVPR_2024_paper.pdf + Visual-language foundation models like CLIP learn generalized representations that enable zero-shot open-set classification. Few-shot adaptation methods based on prompt tuning have been shown to further improve performance on downstream datasets. However these methods do not fare well in the taxonomic open set (TOS) setting where the classifier is asked to make prediction from label set across different levels of semantic granularity. Frequently they infer incorrect labels at coarser taxonomic class levels even when the inference at the leaf level (original class labels) is correct. To address this problem we propose a prompt tuning technique that calibrates the hierarchical consistency of model predictions. A set of metrics of hierarchical consistency the Hierarchical Consistent Accuracy (HCA) and the Mean Treecut Accuracy (MTA) are first proposed to evaluate TOS model performance. A new Prompt Tuning for Hierarchical Consistency (ProTeCt) technique is then proposed to calibrate classification across label set granularities. Results show that ProTeCt can be combined with existing prompt tuning methods to significantly improve TOS classification without degrading the leaf level classification performance. + + + + Masked Autoencoders for Microscopy are Scalable Learners of Cellular Biology + http://openaccess.thecvf.com//content/CVPR2024/papers/Kraus_Masked_Autoencoders_for_Microscopy_are_Scalable_Learners_of_Cellular_Biology_CVPR_2024_paper.pdf + Featurizing microscopy images for use in biological research remains a significant challenge especially for large-scale experiments spanning millions of images. This work explores the scaling properties of weakly supervised classifiers and self-supervised masked autoencoders (MAEs) when training with increasingly larger model backbones and microscopy datasets. Our results show that ViT-based MAEs outperform weakly supervised classifiers on a variety of tasks achieving as much as a 11.5% relative improvement when recalling known biological relationships curated from public databases. Additionally we develop a new channel-agnostic MAE architecture (CA-MAE) that allows for inputting images of different numbers and orders of channels at inference time. We demonstrate that CA-MAEs effectively generalize by inferring and evaluating on a microscopy image dataset (JUMP-CP) generated under different experimental conditions with a different channel structure than our pretraining data (RPI-93M). Our findings motivate continued research into scaling self-supervised learning on microscopy data in order to create powerful foundation models of cellular biology that have the potential to catalyze advancements in drug discovery and beyond. + + + + Segment and Caption Anything + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_Segment_and_Caption_Anything_CVPR_2024_paper.pdf + We propose a method to efficiently equip the Segment Anything Model (SAM) with the ability to generate regional captions. SAM presents strong generalizability to segment anything while is short for semantic understanding. By introducing a lightweight query-based feature mixer we align the region-specific features with the embedding space of language models for later caption generation. As the number of trainable parameters is small (typically in the order of tens of millions) it costs less computation less memory usage and less communication bandwidth resulting in both fast and scalable training. To address the scarcity problem of regional caption data we propose to first pre-train our model on objection detection and segmentation tasks. We call this step weak supervision pretraining since the pretraining data only contains category names instead of full-sentence descriptions. The weak supervision pretraining allows us to leverage many publicly available object detection and segmentation datasets. We conduct extensive experiments to demonstrate the superiority of our method and validate each design choice. This work serves as a stepping stone towards scaling up regional captioning data and sheds light on exploring efficient ways to augment SAM with regional semantics. + + + + Adversarial Backdoor Attack by Naturalistic Data Poisoning on Trajectory Prediction in Autonomous Driving + http://openaccess.thecvf.com//content/CVPR2024/papers/Pourkeshavarz_Adversarial_Backdoor_Attack_by_Naturalistic_Data_Poisoning_on_Trajectory_Prediction_CVPR_2024_paper.pdf + In autonomous driving behavior prediction is fundamental for safe motion planning hence the security and robustness of prediction models against adversarial attacks are of paramount importance. We propose a novel adversarial backdoor attack against trajectory prediction models as a means of studying their potential vulnerabilities. Our attack affects the victim at training time via naturalistic hence stealthy poisoned samples crafted using a novel two-step approach. First the triggers are crafted by perturbing the trajectory of attacking vehicle and then disguised by transforming the scene using a bi-level optimization technique. The proposed attack does not depend on a particular model architecture and operates in a black-box manner thus can be effective without any knowledge of the victim model. We conduct extensive empirical studies using state-of-the-art prediction models on two benchmark datasets using metrics customized for trajectory prediction. We show that the proposed attack is highly effective as it can significantly hinder the performance of prediction models unnoticeable by the victims and efficient as it forces the victim to generate malicious behavior even under constrained conditions. Via ablative studies we analyze the impact of different attack design choices followed by an evaluation of existing defence mechanisms against the proposed attack. + + + + Low-Rank Approximation for Sparse Attention in Multi-Modal LLMs + http://openaccess.thecvf.com//content/CVPR2024/papers/Song_Low-Rank_Approximation_for_Sparse_Attention_in_Multi-Modal_LLMs_CVPR_2024_paper.pdf + This paper focuses on the high computational complexity in Large Language Models (LLMs) a significant challenge in both natural language processing (NLP) and multi-modal tasks. We propose Low-Rank Approximation for Sparse At- tention (LoRA-Sparse) an innovative approach that strate- gically reduces this complexity. LoRA-Sparse introduces low-rank linear projection layers for sparse attention ap- proximation. It utilizes an order-mimic training methodol- ogy which is crucial for efficiently approximating the self- attention mechanism in LLMs. We empirically show that sparse attention not only reduces computational demands but also enhances model performance in both NLP and multi-modal tasks. This surprisingly shows that redundant attention in LLMs might be non-beneficial. We extensively validate LoRA-Sparse through rigorous empirical studies in both (NLP) and multi-modal tasks demonstrating its effec- tiveness and general applicability. Based on LLaMA and LLaVA models our methods can reduce more than half of the self-attention computation with even better performance than full-attention baselines. + + + + TASeg: Temporal Aggregation Network for LiDAR Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_TASeg_Temporal_Aggregation_Network_for_LiDAR_Semantic_Segmentation_CVPR_2024_paper.pdf + Training deep models for LiDAR semantic segmentation is challenging due to the inherent sparsity of point clouds. Utilizing temporal data is a natural remedy against the sparsity problem as it makes the input signal denser. However previous multi-frame fusion algorithms fall short in utilizing sufficient temporal information due to the memory constraint and they also ignore the informative temporal images. To fully exploit rich information hidden in long-term temporal point clouds and images we present the Temporal Aggregation Network termed TASeg. Specifically we propose a Temporal LiDAR Aggregation and Distillation (TLAD) algorithm which leverages historical priors to assign different aggregation steps for different classes. It can largely reduce memory and time overhead while achieving higher accuracy. Besides TLAD trains a teacher injected with gt priors to distill the model further boosting the performance. To make full use of temporal images we design a Temporal Image Aggregation and Fusion (TIAF) module which can greatly expand the camera FOV and enhance the present features. Temporal LiDAR points in the camera FOV are used as mediums to transform temporal image features to the present coordinate for temporal multi-modal fusion. Moreover we develop a Static-Moving Switch Augmentation (SMSA) algorithm which utilizes sufficient temporal information to enable objects to switch their motion states freely thus greatly increasing static and moving training samples. Our TASeg ranks 1st on three challenging tracks i.e. SemanticKITTI single-scan track multi-scan track and nuScenes LiDAR segmentation track strongly demonstrating the superiority of our method. Codes are available at https://github.com/LittlePey/TASeg. + + + + Bootstrapping SparseFormers from Vision Foundation Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Gao_Bootstrapping_SparseFormers_from_Vision_Foundation_Models_CVPR_2024_paper.pdf + The recently proposed SparseFormer architecture provides an alternative approach to visual understanding by utilizing a significantly lower number of visual tokens via adjusting RoIs greatly reducing computational costs while still achieving promising performance. However training SparseFormers from scratch is still expensive and scaling up the number of parameters can be challenging. In this paper we propose to bootstrap SparseFormers from ViT-based vision foundation models in a simple and efficient way. Since the majority of SparseFormer blocks are the standard transformer ones we can inherit weights from large-scale pre-trained vision transformers and freeze them as much as possible. Therefore we only need to train the SparseFormer-specific lightweight focusing transformer to adjust token RoIs and fine-tune a few early pre-trained blocks to align the final token representation. In such a way we can bootstrap SparseFormer architectures from various large-scale pre-trained models (e.g. IN-21K pre-trained AugRegs or CLIPs) using a rather smaller amount of training samples (e.g. IN-1K) and without labels or captions within just a few hours. As a result the bootstrapped unimodal SparseFormer (from AugReg-ViT-L/16-384) can reach 84.9% accuracy on IN-1K with only 49 tokens and the multimodal SparseFormer from CLIPs also demonstrates notable zero-shot performance with highly reduced computational cost without seeing any caption during the bootstrapping procedure. In addition CLIP-bootstrapped SparseFormers which align the output space with language without seeing a word can serve as efficient vision encoders in multimodal large language models. Code and models are available at https://github.com/showlab/sparseformer + + + + EventPS: Real-Time Photometric Stereo Using an Event Camera + http://openaccess.thecvf.com//content/CVPR2024/papers/Yu_EventPS_Real-Time_Photometric_Stereo_Using_an_Event_Camera_CVPR_2024_paper.pdf + Photometric stereo is a well-established technique to estimate the surface normal of an object. However the requirement of capturing multiple high dynamic range images under different illumination conditions limits the speed and real-time applications. This paper introduces EventPS a novel approach to real-time photometric stereo using an event camera. Capitalizing on the exceptional temporal resolution dynamic range and low bandwidth characteristics of event cameras EventPS estimates surface normal only from the radiance changes significantly enhancing data efficiency. EventPS seamlessly integrates with both optimization-based and deep-learning-based photometric stereo techniques to offer a robust solution for non-Lambertian surfaces. Extensive experiments validate the effectiveness and efficiency of EventPS compared to frame-based counterparts. Our algorithm runs at over 30 fps in real-world scenarios unleashing the potential of EventPS in time-sensitive and high-speed downstream applications. + + + + On the Road to Portability: Compressing End-to-End Motion Planner for Autonomous Driving + http://openaccess.thecvf.com//content/CVPR2024/papers/Feng_On_the_Road_to_Portability_Compressing_End-to-End_Motion_Planner_for_CVPR_2024_paper.pdf + End-to-end motion planning models equipped with deep neural networks have shown great potential for enabling full autonomous driving. However the oversized neural networks render them impractical for deployment on resource-constrained systems which unavoidably requires more computational time and resources during reference. To handle this knowledge distillation offers a promising approach that compresses models by enabling a smaller student model to learn from a larger teacher model. Nevertheless how to apply knowledge distillation to compress motion planners has not been explored so far. In this paper we propose PlanKD the first knowledge distillation framework tailored for compressing end-to-end motion planners. First considering that driving scenes are inherently complex often containing planning-irrelevant or even noisy information transferring such information is not beneficial for the student planner. Thus we design an information bottleneck based strategy to only distill planning-relevant information rather than transfer all information indiscriminately. Second different waypoints in an output planned trajectory may hold varying degrees of importance for motion planning where a slight deviation in certain crucial waypoints might lead to a collision. Therefore we devise a safety-aware waypoint-attentive distillation module that assigns adaptive weights to different waypoints based on the importance to encourage the student to accurately mimic more crucial waypoints thereby improving overall safety. Experiments demonstrate that our PlanKD can boost the performance of smaller planners by a large margin and significantly reduce their reference time. + + + + PredToken: Predicting Unknown Tokens and Beyond with Coarse-to-Fine Iterative Decoding + http://openaccess.thecvf.com//content/CVPR2024/papers/Nie_PredToken_Predicting_Unknown_Tokens_and_Beyond_with_Coarse-to-Fine_Iterative_Decoding_CVPR_2024_paper.pdf + Predictive learning models which aim to predict future frames based on past observations are crucial to constructing world models. These models need to maintain low-level consistency and capture high-level dynamics in unannotated spatiotemporal data. Transitioning from frame-wise to token-wise prediction presents a viable strategy for addressing these needs. How to improve token representation and optimize token decoding presents significant challenges. This paper introduces PredToken a novel predictive framework that addresses these issues by decoupling space-time tokens into distinct components for iterative cascaded decoding. Concretely we first design a "decomposition quantization and reconstruction" schema based on VQGAN to improve the token representation. This scheme disentangles low- and high-frequency representations and employs a dimension-aware quantization model allowing more low-level details to be preserved. Building on this we present a "coarse-to-fine iterative decoding" method. It leverages dynamic soft decoding to refine coarse tokens and static soft decoding for fine tokens enabling more high-level dynamics to be captured. These designs make PredToken produce high-quality predictions. Extensive experiments demonstrate the superiority of our method on various real-world spatiotemporal predictive benchmarks. Furthermore PredToken can also be extended to other visual generative tasks to yield realistic outcomes. + + + + FairCLIP: Harnessing Fairness in Vision-Language Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Luo_FairCLIP_Harnessing_Fairness_in_Vision-Language_Learning_CVPR_2024_paper.pdf + Fairness is a critical concern in deep learning especially in healthcare where these models influence diagnoses and treatment decisions. Although fairness has been investigated in the vision-only domain the fairness of medical vision-language (VL) models remains unexplored due to the scarcity of medical VL datasets for studying fairness. To bridge this research gap we introduce the first fair vision-language medical dataset (Harvard-FairVLMed) that provides detailed demographic attributes ground-truth labels and clinical notes to facilitate an in-depth examination of fairness within VL foundation models. Using Harvard-FairVLMed we conduct a comprehensive fairness analysis of two widely-used VL models (CLIP and BLIP2) pre-trained on both natural and medical domains across four different protected attributes. Our results highlight significant biases in all VL models with Asian Male Non-Hispanic and Spanish being the preferred subgroups across the protected attributes of race gender ethnicity and language respectively. In order to alleviate these biases we propose FairCLIP an optimal-transport-based approach that achieves a favorable trade-off between performance and fairness by reducing the Sinkhorn distance between the overall sample distribution and the distributions corresponding to each demographic group. As the first VL dataset of its kind Harvard-FairVLMed holds the potential to catalyze advancements in the development of machine learning models that are both ethically aware and clinically effective. Our dataset and code are available at https://ophai.hms.harvard.edu/datasets/harvard-fairvlmed10k. + + + + StreamingFlow: Streaming Occupancy Forecasting with Asynchronous Multi-modal Data Streams via Neural Ordinary Differential Equation + http://openaccess.thecvf.com//content/CVPR2024/papers/Shi_StreamingFlow_Streaming_Occupancy_Forecasting_with_Asynchronous_Multi-modal_Data_Streams_via_CVPR_2024_paper.pdf + Predicting the future occupancy states of the surrounding environment is a vital task for autonomous driving. However current best-performing single-modality methods or multi-modality fusion perception methods are only able to predict uniform snapshots of future occupancy states and require strictly synchronized sensory data for sensor fusion. We propose a novel framework StreamingFlow to lift these strong limitations. StreamingFlow is a novel BEV occupancy predictor that ingests asynchronous multi-sensor data streams for fusion and performs streaming forecasting of the future occupancy map at any future timestamps. By integrating neural ordinary differential equations (N-ODE) into recurrent neural networks StreamingFlow learns derivatives of BEV features over temporal horizons updates the implicit sensor's BEV features as part of the fusion process and propagates BEV states to the desired future time point. It shows good zero-shot generalization ability of prediction reflected in the interpolation of the observed prediction time horizon and the reasonable inference of the unseen farther future period. Extensive experiments on two large-scale datasets nuScenes and Lyft L5 demonstrate that StreamingFlow significantly outperforms previous vision-based LiDAR-based methods and shows superior performance compared to state-of-the-art fusion-based methods. + + + + Language Model Guided Interpretable Video Action Reasoning + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Language_Model_Guided_Interpretable_Video_Action_Reasoning_CVPR_2024_paper.pdf + Although neural networks excel in video action recognition tasks their "black-box" nature makes it challenging to understand the rationale behind their decisions. Recent approaches used inherently interpretable models to analyze video actions in a manner akin to human reasoning. However it has been observed that these interpretable models tend to underperform when compared to their black-box counterparts. In this work we present a new framework called Language-guided Interpretable Action Recognition framework (LaIAR). This framework leverages knowledge from language models to enhance both the recognition capabilities and the interpretability of video models. In essence we reframe the challenge of understanding video model decisions as a task of aligning video and language models. Using the logical reasoning captured by the language model we steer the training of the video model. This integrated approach not only improves the video model's adaptability to different domains but also boosts its overall performance. Extensive experiments on Charades and CAD-120 datasets demonstrate the superior performance and interpretability of our proposed method. The code of LaIAR is available at https://github.com/NingWang2049/LaIAR. + + + + See Say and Segment: Teaching LMMs to Overcome False Premises + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_See_Say_and_Segment_Teaching_LMMs_to_Overcome_False_Premises_CVPR_2024_paper.pdf + Current open-source Large Multimodal Models (LMMs) excel at tasks such as open-vocabulary language grounding and segmentation but can suffer under false premises when queries imply the existence of something that is not actually present in the image. We observe that existing methods that fine-tune an LMM to segment images significantly degrade their ability to reliably determine ("see") if an object is present and to interact naturally with humans ("say") a form of catastrophic forgetting. In this work we propose a cascading and joint training approach for LMMs to solve this task avoiding catastrophic forgetting of previous skills. Our resulting model can "see" by detecting whether objects are present in an image "say" by telling the user if they are not proposing alternative queries or correcting semantic errors in the query and finally "segment" by outputting the mask of the desired objects if they exist. Additionally we introduce a novel False Premise Correction benchmark dataset an extension of existing RefCOCO(+/g) referring segmentation datasets (which we call FP-RefCOCO(+/g)). The results show that our method not only detects false premises up to 55% better than existing approaches but under false premise conditions produces relative cIOU improvements of more than 31% over baselines and produces natural language feedback judged helpful up to 67% of the time. + + + + Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving? + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Is_Ego_Status_All_You_Need_for_Open-Loop_End-to-End_Autonomous_CVPR_2024_paper.pdf + End-to-end autonomous driving recently emerged as a promising research direction to target autonomy from a full-stack perspective. Along this line many of the latest works follow an open-loop evaluation setting on nuScenes to study the planning behavior. In this paper we delve deeper into the problem by conducting thorough analyses and demystifying more devils in the details. We initially observed that the nuScenes dataset characterized by relatively simple driving scenarios leads to an under-utilization of perception information in end-to-end models incorporating ego status such as the ego vehicle's velocity. These models tend to rely predominantly on the ego vehicle's status for future path planning. Beyond the limitations of the dataset we also note that current metrics do not comprehensively assess the planning quality leading to potentially biased conclusions drawn from existing benchmarks. To address this issue we introduce a new metric to evaluate whether the predicted trajectories adhere to the road. We further propose a simple baseline able to achieve competitive results without relying on perception annotations. Given the current limitations on the benchmark and metrics we suggest the community reassess relevant prevailing research and be cautious about whether the continued pursuit of state-of-the-art would yield convincing and universal conclusions. Code and models are available at https://github.com/NVlabs/BEV-Planner. + + + + CGI-DM: Digital Copyright Authentication for Diffusion Models via Contrasting Gradient Inversion + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_CGI-DM_Digital_Copyright_Authentication_for_Diffusion_Models_via_Contrasting_Gradient_CVPR_2024_paper.pdf + Diffusion Models (DMs) have evolved into advanced image generation tools especially for few-shot generation where a pre-trained model is fine-tuned on a small set of images to capture a specific style or object. Despite their success concerns exist about potential copyright violations stemming from the use of unauthorized data in this process. In response we present Contrasting Gradient Inversion for Diffusion Models (CGI-DM) a novel method featuring vivid visual representations for digital copyright authentication. Our approach involves removing partial information of an image and recovering missing details by exploiting conceptual differences between the pre-trained and fine-tuned models. We formulate the differences as KL divergence between latent variables of the two models when given the same input image which can be maximized through Monte Carlo sampling and Projected Gradient Descent (PGD). The similarity between original and recovered images serves as a strong indicator of potential infringements. Extensive experiments on the WikiArt and Dreambooth datasets demonstrate the high accuracy of CGI-DM in digital copyright authentication surpassing alternative validation techniques. Code implementation is available at https://github.com/Nicholas0228/Revelio. + + + + Making Visual Sense of Oracle Bones for You and Me + http://openaccess.thecvf.com//content/CVPR2024/papers/Qiao_Making_Visual_Sense_of_Oracle_Bones_for_You_and_Me_CVPR_2024_paper.pdf + Visual perception evolves over time. This is particularly the case of oracle bone scripts where visual glyphs seem intuitive to people from distant past prove difficult to be understood in contemporary eyes. While semantic correspondence of an oracle can be found via a dictionary lookup this proves to be not enough for public viewers to connect the dots i.e. why does this oracle mean that? Common solution relies on a laborious curation process to collect visual guide for each oracle (Fig.1) which hinges on the case-by-case effort and taste of curators. This paper delves into one natural follow-up question: can AI take over?Begin with a comprehensive human study we show participants could indeed make better sense of an oracle glyph subjected to a proper visual guide and its efficacy can be approximated via a novel metric termed TransOV (Transferable Oracle Visuals). We then define a new conditional visual generation task based on an oracle glyph and its semantic meaning and importantly approach it by circumventing any form of model training in the presence of fatal lack of oracle data. At its heart is to leverage foundation model like GPT-4V to reason about the visual cues hidden inside an oracle and take advantage of an existing text-to-image model for final visual guide generation. Extensive empirical evidence shows our AI-enabled visual guides achieve significantly comparable TransOV performance compared with those collected under manual efforts. Finally we demonstrate the versatility of our system under a more complex setting where it is required to work alongside an AI image denoiser to cope with raw oracle scan image inputs (cf. processed clean oracle glyphs). Code is available at https://github.com/RQ-Lab/OBS-Visual. + + + + MOHO: Learning Single-view Hand-held Object Reconstruction with Multi-view Occlusion-Aware Supervision + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_MOHO_Learning_Single-view_Hand-held_Object_Reconstruction_with_Multi-view_Occlusion-Aware_Supervision_CVPR_2024_paper.pdf + Previous works concerning single-view hand-held object reconstruction typically rely on supervision from 3D ground-truth models which are hard to collect in real world. In contrast readily accessible hand-object videos offer a promising training data source but they only give heavily occluded object observations. In this paper we present a novel synthetic-to-real framework to exploit Multi-view Occlusion-aware supervision from hand-object videos for Hand-held Object reconstruction (MOHO) from a single image tackling two predominant challenges in such setting: hand-induced occlusion and object's self-occlusion. First in the synthetic pre-training stage we render a large-scaled synthetic dataset SOMVideo with hand-object images and multi-view occlusion-free supervisions adopted to address hand-induced occlusion in both 2D and 3D spaces. Second in the real-world finetuning stage MOHO leverages the amodal-mask-weighted geometric supervision to mitigate the unfaithful guidance caused by the hand-occluded supervising views in real world. Moreover domain-consistent occlusion-aware features are amalgamated in MOHO to resist object's self-occlusion for inferring the complete object shape. Extensive experiments on HO3D and DexYCB datasets demonstrate 2D-supervised MOHO gains superior results against 3D-supervised methods by a large margin. + + + + SecondPose: SE(3)-Consistent Dual-Stream Feature Fusion for Category-Level Pose Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_SecondPose_SE3-Consistent_Dual-Stream_Feature_Fusion_for_Category-Level_Pose_Estimation_CVPR_2024_paper.pdf + Category-level object pose estimation aiming to predict the 6D pose and 3D size of objects from known categories typically struggles with large intra-class shape variation. Existing works utilizing mean shapes often fall short of capturing this variation. To address this issue we present SecondPose a novel approach integrating object-specific geometric features with semantic category priors from DINOv2. Leveraging the advantage of DINOv2 in providing SE(3)-consistent semantic features we hierarchically extract two types of SE(3)-invariant geometric features to further encapsulate local-to-global object-specific information. These geometric features are then point-aligned with DINOv2 features to establish a consistent object representation under SE(3) transformations facilitating the mapping from camera space to the pre-defined canonical space thus further enhancing pose estimation. Extensive experiments on NOCS-REAL275 demonstrate that SecondPose achieves a 12.4% leap forward over the state-of-the-art. Moreover on a more complex dataset HouseCat6D which provides photometrically challenging objects SecondPose still surpasses other competitors by a large margin. Code is released at https://github.com/NOrangeeroli/SecondPose.git. + + + + EgoGen: An Egocentric Synthetic Data Generator + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_EgoGen_An_Egocentric_Synthetic_Data_Generator_CVPR_2024_paper.pdf + Understanding the world in first-person view is fundamental in Augmented Reality (AR). This immersive perspective brings dramatic visual changes and unique challenges compared to third-person views. Synthetic data has empowered third-person-view vision models but its application to embodied egocentric perception tasks remains largely unexplored. A critical challenge lies in simulating natural human movements and behaviors that effectively steer the embodied cameras to capture a faithful egocentric representation of the 3D world. To address this challenge we introduce EgoGen a new synthetic data generator that can produce accurate and rich ground-truth training data for egocentric perception tasks. At the heart of EgoGen is a novel human motion synthesis model that directly leverages egocentric visual inputs of a virtual human to sense the 3D environment. Combined with collision-avoiding motion primitives and a two-stage reinforcement learning approach our motion synthesis model offers a closed-loop solution where the embodied perception and movement of the virtual human are seamlessly coupled. Compared to previous works our model eliminates the need for a pre-defined global path and is directly applicable to dynamic environments. Combined with our easy-to-use and scalable data generation pipeline we demonstrate EgoGen's efficacy in three tasks: mapping and localization for head-mounted cameras egocentric camera tracking and human mesh recovery from egocentric views. EgoGen will be fully open-sourced offering a practical solution for creating realistic egocentric training data and aiming to serve as a useful tool for egocentric computer vision research. + + + + Video ReCap: Recursive Captioning of Hour-Long Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Islam_Video_ReCap_Recursive_Captioning_of_Hour-Long_Videos_CVPR_2024_paper.pdf + Most video captioning models are designed to process short video clips of few seconds and output text describing low-level visual concepts (e.g. objects scenes atomic actions). However most real-world videos last for minutes or hours and have a complex hierarchical structure spanning different temporal granularities. We propose Video ReCap a recursive video captioning model that can process video inputs of dramatically different lengths (from 1 second to 2 hours) and output video captions at multiple hierarchy levels. The recursive video-language architecture exploits the synergy between different video hierarchies and can process hour-long videos efficiently. We utilize a curriculum learning training scheme to learn the hierarchical structure of videos starting from clip-level captions describing atomic actions then focusing on segment-level descriptions and concluding with generating summaries for hour-long videos. Furthermore we introduce Ego4D-HCap dataset by augmenting Ego4D with 8267 manually collected long-range video summaries. Our recursive model can flexibly generate captions at different hierarchy levels while also being useful for other complex video understanding tasks such as VideoQA on EgoSchema. Data code and models are publicly available at https://sites.google.com/view/vidrecap. + + + + Towards Realistic Scene Generation with LiDAR Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Ran_Towards_Realistic_Scene_Generation_with_LiDAR_Diffusion_Models_CVPR_2024_paper.pdf + Diffusion models (DMs) excel in photo-realistic image synthesis but their adaptation to LiDAR scene generation poses a substantial hurdle. This is primarily because DMs operating in the point space struggle to preserve the curve-like patterns and 3D geometry of LiDAR scenes which consumes much of their representation power. In this paper we propose LiDAR Diffusion Models (LiDMs) to generate LiDAR-realistic scenes from a latent space tailored to capture the realism of LiDAR scenes by incorporating geometric priors into the learning pipeline. Our method targets three major desiderata: pattern realism geometry realism and object realism. Specifically we introduce curve-wise compression to simulate real-world LiDAR patterns point-wise coordinate supervision to learn scene geometry and patch-wise encoding for a full 3D object context. With these three core designs our method achieves competitive performance on unconditional LiDAR generation in 64-beam scenario and state of the art on conditional LiDAR generation while maintaining high efficiency compared to point-based DMs (up to 107xfaster). Furthermore by compressing LiDAR scenes into a latent space we enable the controllability of DMs with various conditions such as semantic maps camera views and text prompts. Our code and pretrained weights are available at https://github.com/hancyran/LiDAR-Diffusion. + + + + Diffusion Reflectance Map: Single-Image Stochastic Inverse Rendering of Illumination and Reflectance + http://openaccess.thecvf.com//content/CVPR2024/papers/Enyo_Diffusion_Reflectance_Map_Single-Image_Stochastic_Inverse_Rendering_of_Illumination_and_CVPR_2024_paper.pdf + Reflectance bounds the frequency spectrum of illumination in the object appearance. In this paper we introduce the first stochastic inverse rendering method which recovers the attenuated frequency spectrum of an illumination jointly with the reflectance of an object of known geometry from a single image. Our key idea is to solve this blind inverse problem in the reflectance map an appearance representation invariant to the underlying geometry by learning to reverse the image formation with a novel diffusion model which we refer to as the Diffusion Reflectance Map Network (DRMNet). Given an observed reflectance map converted and completed from the single input image DRMNet generates a reflectance map corresponding to a perfect mirror sphere while jointly estimating the reflectance. The forward process can be understood as gradually filtering a natural illumination with lower and lower frequency reflectance and additive Gaussian noise. DRMNet learns to invert this process with two subnetworks IllNet and RefNet which work in concert towards this joint estimation. The network is trained on an extensive synthetic dataset and is demonstrated to generalize to real images showing state-of-the-art accuracy on established datasets. + + + + MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI + http://openaccess.thecvf.com//content/CVPR2024/papers/Yue_MMMU_A_Massive_Multi-discipline_Multimodal_Understanding_and_Reasoning_Benchmark_for_CVPR_2024_paper.pdf + We introduce MMMU: a new benchmark designed to evaluate multimodal models on massive multi-discipline tasks demanding college-level subject knowledge and deliberate reasoning. MMMU includes 11.5K meticulously collected multimodal questions from college exams quizzes and textbooks covering six core disciplines: Art & Design Business Science Health & Medicine Humanities & Social Science and Tech & Engineering. These questions span 30 subjects and 183 subfields comprising 30 highly heterogeneous image types such as charts diagrams maps tables music sheets and chemical structures. Unlike existing benchmarks MMMU focuses on advanced perception and reasoning with domain-specific knowledge challenging models to perform tasks akin to those faced by experts. The evaluation of 28 open-source LMMs as well as the proprietary GPT-4V(ision) and Gemini highlights the substantial challenges posed by MMMU. Even the advanced GPT-4V and Gemini Ultra only achieve accuracies of 56% and 59% respectively indicating significant room for improvement. We believe MMMU will stimulate the community to build next-generation multimodal foundation models towards expert artificial general intelligence. + + + + EarthLoc: Astronaut Photography Localization by Indexing Earth from Space + http://openaccess.thecvf.com//content/CVPR2024/papers/Berton_EarthLoc_Astronaut_Photography_Localization_by_Indexing_Earth_from_Space_CVPR_2024_paper.pdf + Astronaut photography spanning six decades of human spaceflight presents a unique Earth observations dataset with immense value for both scientific research and disaster response. Despite its significance accurately localizing the geographical extent of these images crucial for effective utilization poses substantial challenges. Current manual localization efforts are time-consuming motivating the need for automated solutions. We propose a novel approach - leveraging image retrieval - to address this challenge efficiently. We introduce innovative training techniques including Year-Wise Data Augmentation and a Neutral-Aware Multi-Similarity Loss which contribute to the development of a high-performance model EarthLoc. We develop six evaluation datasets and perform a comprehensive benchmark comparing EarthLoc to existing methods showcasing its superior efficiency and accuracy. Our approach marks a significant advancement in automating the localization of astronaut photography which will help bridge a critical gap in Earth observations data. Code and datasets are available at this https://github.com/gmberton/EarthLoc + + + + Text-Image Alignment for Diffusion-Based Perception + http://openaccess.thecvf.com//content/CVPR2024/papers/Kondapaneni_Text-Image_Alignment_for_Diffusion-Based_Perception_CVPR_2024_paper.pdf + Diffusion models are generative models with impressive text-to-image synthesis capabilities and have spurred a new wave of creative methods for classical machine learning tasks. However the best way to harness the perceptual knowledge of these generative models for visual tasks is still an open question. Specifically it is unclear how to use the prompting interface when applying diffusion backbones to vision tasks. We find that automatically generated captions can improve text-image alignment and significantly enhance a model's cross-attention maps leading to better perceptual performance. Our approach improves upon the current state-of-the-art in diffusion-based semantic segmentation on ADE20K and the current overall SOTA for depth estimation on NYUv2. Furthermore our method generalizes to the cross-domain setting. We use model personalization and caption modifications to align our model to the target domain and find improvements over unaligned baselines. Our cross-domain object detection model trained on Pascal VOC achieves SOTA results on Watercolor2K. Our cross-domain segmentation method trained on Cityscapes achieves SOTA results on Dark Zurich-val and Nighttime Driving. Project page: vision.caltech.edu/TADP/. Code: github.com/damaggu/TADP + + + + MemFlow: Optical Flow Estimation and Prediction with Memory + http://openaccess.thecvf.com//content/CVPR2024/papers/Dong_MemFlow_Optical_Flow_Estimation_and_Prediction_with_Memory_CVPR_2024_paper.pdf + Optical flow is a classical task that is important to the vision community. Classical optical flow estimation uses two frames as input whilst some recent methods consider multiple frames to explicitly model long-range information. The former ones limit their ability to fully leverage temporal coherence along the video sequence; and the latter ones incur heavy computational overhead typically not possible for real-time flow estimation. Some multi-frame-based approaches even necessitate unseen future frames for current estimation compromising real-time applicability in safety-critical scenarios. To this end we present MemFlow a real-time method for optical flow estimation and prediction with memory. Our method enables memory read-out and update modules for aggregating historical motion information in real-time. Furthermore we integrate resolution-adaptive re-scaling to accommodate diverse video resolutions. Besides our approach seamlessly extends to the future prediction of optical flow based on past observations. Leveraging effective historical motion aggregation our method outperforms VideoFlow with fewer parameters and faster inference speed on Sintel and KITTI-15 datasets in terms of generalization performance. At the time of submission MemFlow also leads in performance on the 1080p Spring dataset. Codes and models will be available at: https://dqiaole.github.io/MemFlow/. + + + + Novel Class Discovery for Ultra-Fine-Grained Visual Categorization + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Novel_Class_Discovery_for_Ultra-Fine-Grained_Visual_Categorization_CVPR_2024_paper.pdf + Ultra-fine-grained visual categorization (Ultra-FGVC) aims at distinguishing highly similar sub-categories within fine-grained objects such as different soybean cultivars. Compared to traditional fine-grained visual categorization Ultra-FGVC encounters more hurdles due to the small inter-class and large intra-class variation. Given these challenges relying on human annotation for Ultra-FGVC is impractical. To this end our work introduces a novel task termed Ultra-Fine-Grained Novel Class Discovery (UFG-NCD) which leverages partially annotated data to identify new categories of unlabeled images for Ultra-FGVC. To tackle this problem we devise a Region-Aligned Proxy Learning (RAPL) framework which comprises a Channel-wise Region Alignment (CRA) module and a Semi-Supervised Proxy Learning (SemiPL) strategy. The CRA module is designed to extract and utilize discriminative features from local regions facilitating knowledge transfer from labeled to unlabeled classes. Furthermore SemiPL strengthens representation learning and knowledge transfer with proxy-guided supervised learning and proxy-guided contrastive learning. Such techniques leverage class distribution information in the embedding space improving the mining of subtle differences between labeled and unlabeled ultra-fine-grained classes. Extensive experiments demonstrate that RAPL significantly outperforms baselines across various datasets indicating its effectiveness in handling the challenges of UFG-NCD. Code is available at https://github.com/SSDUT-Caiyq/UFG-NCD. + + + + DiffusionGAN3D: Boosting Text-guided 3D Generation and Domain Adaptation by Combining 3D GANs and Diffusion Priors + http://openaccess.thecvf.com//content/CVPR2024/papers/Lei_DiffusionGAN3D_Boosting_Text-guided_3D_Generation_and_Domain_Adaptation_by_Combining_CVPR_2024_paper.pdf + Text-guided domain adaptation and generation of 3D-aware portraits find many applications in various fields. However due to the lack of training data and the challenges in handling the high variety of geometry and appearance the existing methods for these tasks suffer from issues like inflexibility instability and low fidelity. In this paper we propose a novel framework DiffusionGAN3D which boosts text-guided 3D domain adaptation and generation by combining 3D GANs and diffusion priors. Specifically we integrate the pre-trained 3D generative models (e.g. EG3D) and text-to-image diffusion models. The former provides a strong foundation for stable and high-quality avatar generation from text. And the diffusion models in turn offer powerful priors and guide the 3D generator finetuning with informative direction to achieve flexible and efficient text-guided domain adaptation. To enhance the diversity in domain adaptation and the generation capability in text-to-avatar we introduce the relative distance loss and case-specific learnable triplane respectively. Besides we design a progressive texture refinement module to improve the texture quality for both tasks above. Extensive experiments demonstrate that the proposed framework achieves excellent results in both domain adaptation and text-to-avatar tasks outperforming existing methods in terms of generation quality and efficiency. The project homepage is at https://younglbw.github.io/DiffusionGAN3D-homepage/. + + + + Rethinking Boundary Discontinuity Problem for Oriented Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_Rethinking_Boundary_Discontinuity_Problem_for_Oriented_Object_Detection_CVPR_2024_paper.pdf + Oriented object detection has been developed rapidly in the past few years where rotation equivariance is crucial for detectors to predict rotated boxes. It is expected that the prediction can maintain the corresponding rotation when objects rotate but severe mutation in angular prediction is sometimes observed when objects rotate near the boundary angle which is well-known boundary discontinuity problem. The problem has been long believed to be caused by the sharp loss increase at the angular boundary and widely used joint-optim IoU-like methods deal with this problem by loss-smoothing. However we experimentally find that even state-of-the-art IoU-like methods actually fail to solve the problem. On further analysis we find that the key to solution lies in encoding mode of the smoothing function rather than in joint or independent optimization. In existing IoU-like methods the model essentially attempts to fit the angular relationship between box and object where the break point at angular boundary makes the predictions highly unstable. To deal with this issue we propose a dual-optimization paradigm for angles. We decouple reversibility and joint-optim from single smoothing function into two distinct entities which for the first time achieves the objectives of both correcting angular boundary and blending angle with other parameters. Extensive experiments on multiple datasets show that boundary discontinuity problem is well-addressed. Moreover typical IoU-like methods are improved to the same level without obvious performance gap. The code is available at https://github.com/hangxu-cv/cvpr24acm. + + + + SleepVST: Sleep Staging from Near-Infrared Video Signals using Pre-Trained Transformers + http://openaccess.thecvf.com//content/CVPR2024/papers/Carter_SleepVST_Sleep_Staging_from_Near-Infrared_Video_Signals_using_Pre-Trained_Transformers_CVPR_2024_paper.pdf + Advances in camera-based physiological monitoring have enabled the robust non-contact measurement of respiration and the cardiac pulse which are known to be indicative of the sleep stage. This has led to research into camera-based sleep monitoring as a promising alternative to "gold-standard" polysomnography which is cumbersome expensive to administer and hence unsuitable for longer-term clinical studies. In this paper we introduce SleepVST a transformer model which enables state-of-the-art performance in camera-based sleep stage classification (sleep staging). After pre-training on contact sensor data SleepVST outperforms existing methods for cardio-respiratory sleep staging on the SHHS and MESA datasets achieving total Cohen's kappa scores of 0.75 and 0.77 respectively. We then show that SleepVST can be successfully transferred to cardio-respiratory waveforms extracted from video enabling fully contact-free sleep staging. Using a video dataset of 50 nights we achieve a total accuracy of 78.8% and a Cohen's \kappa of 0.71 in four-class video-based sleep staging setting a new state-of-the-art in the domain. + + + + TimeChat: A Time-sensitive Multimodal Large Language Model for Long Video Understanding + http://openaccess.thecvf.com//content/CVPR2024/papers/Ren_TimeChat_A_Time-sensitive_Multimodal_Large_Language_Model_for_Long_Video_CVPR_2024_paper.pdf + This work proposes TimeChat a time-sensitive multimodal large language model specifically designed for long video understanding. Our model incorporates two key architectural contributions: (1) a timestamp-aware frame encoder that binds visual content with the timestamp of each frame and (2) a sliding video Q-Former that produces a video token sequence of varying lengths to accommodate videos of various durations. Additionally we construct an instruction-tuning dataset encompassing 6 tasks and a total of 125K instances to further enhance TimeChat's instruction-following performance. Experiment results across various video understanding tasks such as dense captioning temporal grounding and highlight detection demonstrate TimeChat's strong zero-shot temporal localization and reasoning capabilities. For example it achieves +9.2 F1 score and +2.8 CIDEr on YouCook2 +5.8 HIT@1 on QVHighlights and +27.5 R@1 (IoU=0.5) on Charades-STA compared to state-of-the-art video large language models holding the potential to serve as a versatile video assistant for long-form video comprehension tasks and satisfy realistic user requirements. + + + + ManiFPT: Defining and Analyzing Fingerprints of Generative Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Song_ManiFPT_Defining_and_Analyzing_Fingerprints_of_Generative_Models_CVPR_2024_paper.pdf + Recent works have shown that generative models leave traces of their underlying generative process on the generated samples broadly referred to as fingerprints of a generative model and have studied their utility in detecting synthetic images from real ones. However the extend to which these fingerprints can distinguish between various types of synthetic image and help identify the underlying generative process remain under-explored. In particular the very definition of a fingerprint remains unclear to our knowledge. To that end in this work we formalize the definition of artifact and fingerprint in generative models propose an algorithm for computing them in practice and finally study its effectiveness in distinguishing a large array of different generative models. We find that using our proposed definition can significantly improve the performance on the task of identifying the underlying generative process from samples (model attribution) compared to existing methods. Additionally we study the structure of the fingerprints and observe that it is very predictive of the effect of different design choices on the generative process. + + + + Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Seyfioglu_Quilt-LLaVA_Visual_Instruction_Tuning_by_Extracting_Localized_Narratives_from_Open-Source_CVPR_2024_paper.pdf + Diagnosis in histopathology requires a global whole slide images (WSIs) analysis requiring pathologists to compound evidence from different WSI patches. The gigapixel scale of WSIs poses a challenge for histopathology multi-modal models. Training multi-model models for histopathology requires instruction tuning datasets which currently contain information for individual image patches without a spatial grounding of the concepts within each patch and without a wider view of the WSI. To bridge this gap we introduce QUILT-INSTRUCT a large-scale dataset of 107131 histopathology-specific instruction question/answer pairs grounded within diagnostically relevant image patches that make up the WSI. Our dataset is collected by leveraging educational histopathology videos from YouTube which provides spatial localization of narrations by automatically extracting the narrators' cursor positions. QUILT-INSTRUCT supports contextual reasoning by extracting diagnosis and supporting facts from the entire WSI. Using QUILT-INSTRUCT we train QUILT-LLAVA which can reason beyond the given single image patch enabling diagnostic reasoning across patches. To evaluate QUILT-LLAVA we propose a comprehensive evaluation dataset created from 985 images and 1283 human-generated question-answers. We also thoroughly evaluate QUILT-LLAVA using public histopathology datasets where QUILT-LLAVA significantly outperforms SOTA by over 10% on relative GPT-4 score and 4% and 9% on open and closed set VQA. + + + + E-GPS: Explainable Geometry Problem Solving via Top-Down Solver and Bottom-Up Generator + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_E-GPS_Explainable_Geometry_Problem_Solving_via_Top-Down_Solver_and_Bottom-Up_CVPR_2024_paper.pdf + Geometry Problem Solving has drawn growing attention recently due to its application prospects in intelligent education field. However existing methods are still inadequate to meet the needs of practical application suffering from the following limitations: 1) explainability is not ensured which is essential in real teaching scenarios; 2) the small scale and incomplete annotation of existing datasets make it hard for model to comprehend geometric knowledge. To tackle the above problems we propose a novel method called Explainable Geometry Problem Solving (E-GPS). E-GPS first parses the geometric diagram and problem text into unified formal language representations. Then the answer and explainable reasoning and solving steps are obtained by a Top-Down Problem Solver (TD-PS) which innovatively solves the problem from the target and focuses on what is needed. To alleviate the data issues a Bottom-Up Problem Generator (BU-PG) is devised to augment the data set with various well-annotated constructed geometry problems. It enables us to train an enhanced theorem predictor with a better grasp of theorem knowledge which further improves the efficiency of TD-PS. Extensive experiments demonstrate that E-GPS maintains comparable solving performances with fewer steps and provides outstanding explainability. + + + + Driving into the Future: Multiview Visual Forecasting and Planning with World Model for Autonomous Driving + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Driving_into_the_Future_Multiview_Visual_Forecasting_and_Planning_with_CVPR_2024_paper.pdf + In autonomous driving predicting future events in advance and evaluating the foreseeable risks empowers autonomous vehicles to plan their actions enhancing safety and efficiency on the road. To this end we propose Drive-WM the first driving world model compatible with existing end-to-end planning models. Through a joint spatial-temporal modeling facilitated by view factorization our model is the first to generate high-fidelity multiview videos. Building on its powerful generation ability we showcase the potential of applying the world model for safe driving planning for the first time. Our Drive-WM enables driving into multiple futures based on distinct driving maneuvers and determines the optimal trajectory according to the image-based rewards. Evaluation on real-world driving datasets verifies that our method could generate high-quality consistent and controllable multiview videos opening up possibilities for real-world simulations and safe planning. + + + + OpenESS: Event-based Semantic Scene Understanding with Open Vocabularies + http://openaccess.thecvf.com//content/CVPR2024/papers/Kong_OpenESS_Event-based_Semantic_Scene_Understanding_with_Open_Vocabularies_CVPR_2024_paper.pdf + Event-based semantic segmentation (ESS) is a fundamental yet challenging task for event camera sensing. The difficulties in interpreting and annotating event data limit its scalability. While domain adaptation from images to event data can help to mitigate this issue there exist data representational differences that require additional effort to resolve. In this work for the first time we synergize information from image text and event-data domains and introduce OpenESS to enable scalable ESS in an open-world annotation-efficient manner. We achieve this goal by transferring the semantically rich CLIP knowledge from image-text pairs to event streams. To pursue better cross-modality adaptation we propose a frame-to-event contrastive distillation and a text-to-event semantic consistency regularization. Experimental results on popular ESS benchmarks showed our approach outperforms existing methods. Notably we achieve 53.93% and 43.31% mIoU on DDD17 and DSEC-Semantic without using either event or frame labels. + + + + Do Vision and Language Encoders Represent the World Similarly? + http://openaccess.thecvf.com//content/CVPR2024/papers/Maniparambil_Do_Vision_and_Language_Encoders_Represent_the_World_Similarly_CVPR_2024_paper.pdf + Aligned text-image encoders such as CLIP have become the de-facto model for vision-language tasks. Furthermore modality-specific encoders achieve impressive performances in their respective domains. This raises a central question: does an alignment exist between uni-modal vision and language encoders since they fundamentally represent the same physical world? Analyzing the latent spaces structure of vision and language models on image-caption benchmarks using the Centered Kernel Alignment (CKA) we find that the representation spaces of unaligned and aligned encoders are semantically similar. In the absence of statistical similarity in aligned encoders like CLIP we show that a possible matching of unaligned encoders exists without any training. We frame this as a seeded graph-matching problem exploiting the semantic similarity between graphs and propose two methods - a Fast Quadratic Assignment Problem optimization and a novel localized CKA metric-based matching/retrieval. We demonstrate the effectiveness of this on several downstream tasks including cross-lingual cross-domain caption matching and image classification. Code available at github.com/mayug/0-shot-llm-vision. + + + + MGMap: Mask-Guided Learning for Online Vectorized HD Map Construction + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_MGMap_Mask-Guided_Learning_for_Online_Vectorized_HD_Map_Construction_CVPR_2024_paper.pdf + Currently high-definition (HD) map construction leans towards a lightweight online generation tendency which aims to preserve timely and reliable road scene information. However map elements contain strong shape priors. Subtle and sparse annotations make current detection-based frameworks ambiguous in locating relevant feature scopes and cause the loss of detailed structures in prediction. To alleviate these problems we propose MGMap a mask-guided approach that effectively highlights the informative regions and achieves precise map element localization by introducing the learned masks. Specifically MGMap employs learned masks based on the enhanced multi-scale BEV features from two perspectives. At the instance level we propose the Mask-activated instance (MAI) decoder which incorporates global instance and structural information into instance queries by the activation of instance masks. At the point level a novel position-guided mask patch refinement (PG-MPR) module is designed to refine point locations from a finer-grained perspective enabling the extraction of point-specific patch information. Compared to the baselines our proposed MGMap achieves a notable improvement of around 10 mAP for different input modalities. Extensive experiments also demonstrate that our approach showcases strong robustness and generalization capabilities. Our code can be found at https://github.com/xiaolul2/MGMap. + + + + VidLA: Video-Language Alignment at Scale + http://openaccess.thecvf.com//content/CVPR2024/papers/Rizve_VidLA_Video-Language_Alignment_at_Scale_CVPR_2024_paper.pdf + In this paper we propose VidLA an approach for video-language alignment at scale. There are two major limitations of previous video-language alignment approaches. First they do not capture both short-range and long-range temporal dependencies and typically employ complex hierarchical deep network architectures that are hard to integrate with existing pretrained image-text foundation models. To effectively address this limitation we instead keep the network architecture simple and use a set of data tokens that operate at different temporal resolutions in a hierarchical manner accounting for the temporally hierarchical nature of videos. By employing a simple two-tower architecture we are able to initialize our video-language model with pretrained image-text foundation models thereby boosting the final performance. Second existing video-language alignment works struggle due to the lack of semantically aligned large-scale training data. To overcome it we leverage recent LLMs to curate the largest video-language dataset to date with better visual grounding. Furthermore unlike existing video-text datasets which only contain short clips our dataset is enriched with video clips of varying durations to aid our temporally hierarchical data tokens in extracting better representations at varying temporal scales. Overall empirical results show that our proposed approach surpasses state-of-the-art methods on multiple retrieval benchmarks especially on longer videos and performs competitively on classification benchmarks. + + + + ERMVP: Communication-Efficient and Collaboration-Robust Multi-Vehicle Perception in Challenging Environments + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_ERMVP_Communication-Efficient_and_Collaboration-Robust_Multi-Vehicle_Perception_in_Challenging_Environments_CVPR_2024_paper.pdf + Collaborative perception enhances perception performance by enabling autonomous vehicles to exchange complementary information. Despite its potential to revolutionize the mobile industry challenges in various environments such as communication bandwidth limitations localization errors and information aggregation inefficiencies hinder its implementation in practical applications. In this work we propose ERMVP a communication-Efficient and collaboration-Robust Multi-Vehicle Perception method in challenging environments. Specifically ERMVP has three distinct strengths: i) It utilizes the hierarchical feature sampling strategy to abstract a representative set of feature vectors using less communication overhead for efficient communication; ii) It employs the sparse consensus features to execute precise spatial location calibrations effectively mitigating the implications of vehicle localization errors; iii) A pioneering feature fusion and interaction paradigm is introduced to integrate holistic spatial semantics among different vehicles and data sources. To thoroughly validate our method we conduct extensive experiments on real-world and simulated datasets. The results demonstrate that the proposed ERMVP is significantly superior to the state-of-the-art collaborative perception methods. + + + + PikeLPN: Mitigating Overlooked Inefficiencies of Low-Precision Neural Networks + http://openaccess.thecvf.com//content/CVPR2024/papers/Neseem_PikeLPN_Mitigating_Overlooked_Inefficiencies_of_Low-Precision_Neural_Networks_CVPR_2024_paper.pdf + Low-precision quantization is recognized for its efficacy in neural network optimization. Our analysis reveals that non-quantized elementwise operations which are prevalent in layers such as parameterized activation functions batch normalization and quantization scaling dominate the inference cost of low-precision models. These non-quantized elementwise operations are commonly overlooked in SOTA efficiency metrics such as Arithmetic Computation Effort (ACE). In this paper we propose ACEv2 - an extended version of ACE which offers a better alignment with the inference cost of quantized models and their energy consumption on ML hardware. Moreover we introduce PikeLPN a model that addresses these efficiency issues by applying quantization to both elementwise operations and multiply-accumulate operations. In particular we present a novel quantization technique for batch normalization layers named QuantNorm which allows for quantizing the batch normalization parameters without compromising the model performance. Additionally we propose applying Double Quantization where the quantization scaling parameters are quantized. Furthermore we recognize and resolve the issue of distribution mismatch in Separable Convolution layers by introducing Distribution-Heterogeneous Quantization which enables quantizing them to low-precision. PikeLPN achieves Pareto-optimality in efficiency-accuracy trade-off with up to 3X efficiency improvement compared to SOTA low-precision models. + + + + CAGE: Controllable Articulation GEneration + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_CAGE_Controllable_Articulation_GEneration_CVPR_2024_paper.pdf + We address the challenge of generating 3D articulated objects in a controllable fashion. Currently modeling articulated 3D objects is either achieved through laborious manual authoring or using methods from prior work that are hard to scale and control directly. We leverage the interplay between part shape connectivity and motion using a denoising diffusion-based method with attention modules designed to extract correlations between part attributes. Our method takes an object category label and a part connectivity graph as input and generates an object's geometry and motion parameters. The generated objects conform to user-specified constraints on the object category part shape and part articulation. Our experiments show that our method outperforms the state-of-the-art in articulated object generation producing more realistic objects while conforming better to user constraints. + + + + FocusMAE: Gallbladder Cancer Detection from Ultrasound Videos with Focused Masked Autoencoders + http://openaccess.thecvf.com//content/CVPR2024/papers/Basu_FocusMAE_Gallbladder_Cancer_Detection_from_Ultrasound_Videos_with_Focused_Masked_CVPR_2024_paper.pdf + In recent years automated Gallbladder Cancer (GBC) detection has gained the attention of researchers. Current state-of-the-art (SOTA) methodologies relying on ultrasound sonography (US) images exhibit limited generalization emphasizing the need for transformative approaches. We observe that individual US frames may lack sufficient information to capture disease manifestation. This study advocates for a paradigm shift towards video-based GBC detection leveraging the inherent advantages of spatiotemporal representations. Employing the Masked Autoencoder (MAE) for representation learning we address shortcomings in conventional image-based methods. We propose a novel design called FocusMAE to systematically bias the selection of masking tokens from high-information regions fostering a more refined representation of malignancy. Additionally we contribute the most extensive US video dataset for GBC detection. We also note that this is the first study on US video-based GBC detection. We validate the proposed methods on the curated dataset and report a new SOTA accuracy of 96.4% for the GBC detection problem against an accuracy of 84% by current Image-based SOTA - GBCNet and RadFormer and 94.7% by Video-based SOTA - AdaMAE. We further demonstrate the generality of the proposed FocusMAE on a public CT-based Covid detection dataset reporting an improvement in accuracy by 3.3% over current baselines. Project page with source code trained models and data is available at: https://gbc-iitd.github.io/focusmae. + + + + Visual Concept Connectome (VCC): Open World Concept Discovery and their Interlayer Connections in Deep Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Kowal_Visual_Concept_Connectome_VCC_Open_World_Concept_Discovery_and_their_CVPR_2024_paper.pdf + Understanding what deep network models capture in their learned representations is a fundamental challenge in computer vision. We present a new methodology to understanding such vision models the Visual Concept Connectome (VCC) which discovers human interpretable concepts and their interlayer connections in a fully unsupervised manner. Our approach simultaneously reveals fine-grained concepts at a layer connection weightings across all layers and is amendable to global analysis of network structure (e.g. branching pattern of hierarchical concept assemblies). Previous work yielded ways to extract interpretable concepts from single layers and examine their impact on classification but did not afford multilayer concept analysis across an entire network architecture. Quantitative and qualitative empirical results show the effectiveness of VCCs in the domain of image classification. Also we leverage VCCs for the application of failure mode debugging to reveal where mistakes arise in deep networks. + + + + GRAM: Global Reasoning for Multi-Page VQA + http://openaccess.thecvf.com//content/CVPR2024/papers/Blau_GRAM_Global_Reasoning_for_Multi-Page_VQA_CVPR_2024_paper.pdf + The increasing use of transformer-based large language models brings forward the challenge of processing long sequences. In document visual question answering (DocVQA) leading methods focus on the single-page setting while documents can span hundreds of pages. We present GRAM a method that seamlessly extends pre-trained single-page models to the multi-page setting without requiring computationally-heavy pretraining. To do so we leverage a single-page encoder for local page-level understanding and enhance it with document-level designated layers and learnable tokens facilitating the flow of information across pages for global reasoning. To enforce our model to utilize the newly introduced document tokens we propose a tailored bias adaptation method. For additional computational savings during decoding we introduce an optional compression stage using our compression-transformer (CFormer)reducing the encoded sequence length thereby allowing a tradeoff between quality and latency. Extensive experiments showcase GRAM's state-of-the-art performance on the benchmarks for multi-page DocVQA demonstrating the effectiveness of our approach. + + + + MS-DETR: Efficient DETR Training with Mixed Supervision + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_MS-DETR_Efficient_DETR_Training_with_Mixed_Supervision_CVPR_2024_paper.pdf + DETR accomplishes end-to-end object detection through iteratively generating multiple object candidates based on image features and promoting one candidate for each ground-truth object. The traditional training procedure using one-to-one supervision in the original DETR lacks direct supervision for the object detection candidates. We aim at improving the DETR training efficiency by explicitly supervising the candidate generation procedure through mixing one-to-one supervision and one-to-many supervision. Our approach namely MS-DETR is simple and places one-to-many supervision to the object queries of the primary decoder that is used for inference. In comparison to existing DETR variants with one-to-many supervision such as Group DETR and Hybrid DETR our approach does not need additional decoder branches or object queries. The object queries of the primary decoder in our approach directly benefit from one-to-many supervision and thus are superior in object candidate prediction. Experimental results show that our approach outperforms related DETR variants such as DN-DETR Hybrid DETR and Group DETR and the combination with related DETR variants further improves the performance. + + + + BEVSpread: Spread Voxel Pooling for Bird's-Eye-View Representation in Vision-based Roadside 3D Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_BEVSpread_Spread_Voxel_Pooling_for_Birds-Eye-View_Representation_in_Vision-based_Roadside_CVPR_2024_paper.pdf + Vision-based roadside 3D object detection has attracted rising attention in autonomous driving domain since it encompasses inherent advantages in reducing blind spots and expanding perception range. While previous work mainly focuses on accurately estimating depth or height for 2D-to-3D mapping ignoring the position approximation error in the voxel pooling process. Inspired by this insight we propose a novel voxel pooling strategy to reduce such error dubbed BEVSpread. Specifically instead of bringing the image features contained in a frustum point to a single BEV grid BEVSpread considers each frustum point as a source and spreads the image features to the surrounding BEV grids with adaptive weights. To achieve superior propagation performance a specific weight function is designed to dynamically control the decay speed of the weights according to distance and depth. Aided by customized CUDA parallel acceleration BEVSpread achieves comparable inference time as the original voxel pooling. Extensive experiments on two large-scale roadside benchmarks demonstrate that as a plug-in BEVSpread can significantly improve the performance of existing frustum-based BEV methods by a large margin of (1.12 5.26 3.01) AP in vehicle pedestrian and cyclist. + + + + DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving + http://openaccess.thecvf.com//content/CVPR2024/papers/Min_DriveWorld_4D_Pre-trained_Scene_Understanding_via_World_Models_for_Autonomous_CVPR_2024_paper.pdf + Vision-centric autonomous driving has recently raised wide attention due to its lower cost. Pre-training is essential for extracting a universal representation. However current vision-centric pre-training typically relies on either 2D or 3D pre-text tasks overlooking the temporal characteristics of autonomous driving as a 4D scene understanding task. In this paper we address this challenge by introducing a world model-based autonomous driving 4D representation learning framework dubbed DriveWorld which is capable of pre-training from multi-camera driving videos in a spatio-temporal fashion. Specifically we propose a Memory State-Space Model for spatio-temporal modelling which consists of a Dynamic Memory Bank module for learning temporal-aware latent dynamics to predict future changes and a Static Scene Propagation module for learning spatial-aware latent statics to offer comprehensive scene contexts. We additionally introduce a Task Prompt to decouple task-aware features for various downstream tasks. The experiments demonstrate that DriveWorld delivers promising results on various autonomous driving tasks. When pre-trained with the OpenScene dataset DriveWorld achieves a 7.5% increase in mAP for 3D object detection a 3.0% increase in IoU for online mapping a 5.0% increase in AMOTA for multi-object tracking a 0.1m decrease in minADE for motion forecasting a 3.0% increase in IoU for occupancy prediction and a 0.34m reduction in average L2 error for planning. + + + + Bridging the Gap Between End-to-End and Two-Step Text Spotting + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_Bridging_the_Gap_Between_End-to-End_and_Two-Step_Text_Spotting_CVPR_2024_paper.pdf + Modularity plays a crucial role in the development and maintenance of complex systems. While end-to-end text spotting efficiently mitigates the issues of error accumulation and sub-optimal performance seen in traditional two-step methodologies the two-step methods continue to be favored in many competitions and practical settings due to their superior modularity. In this paper we introduce Bridging Text Spotting a novel approach that resolves the error accumulation and suboptimal performance issues in two-step methods while retaining modularity. To achieve this we adopt a well-trained detector and recognizer that are developed and trained independently and then lock their parameters to preserve their already acquired capabilities. Subsequently we introduce a Bridge that connects the locked detector and recognizer through a zero-initialized neural network. This zero-initialized neural network initialized with weights set to zeros ensures seamless integration of the large receptive field features in detection into the locked recognizer. Furthermore since the fixed detector and recognizer cannot naturally acquire end-to-end optimization features we adopt the Adapter to facilitate their efficient learning of these features. We demonstrate the effectiveness of the proposed method through extensive experiments: Connecting the latest detector and recognizer through Bridging Text Spotting we achieved an accuracy of 83.3% on Total-Text 69.8% on CTW1500 and 89.5% on ICDAR 2015. The code is available at https://github.com/mxin262/Bridging-Text-Spotting. + + + + SUGAR: Pre-training 3D Visual Representations for Robotics + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_SUGAR_Pre-training_3D_Visual_Representations_for_Robotics_CVPR_2024_paper.pdf + Learning generalizable visual representations from Internet data has yielded promising results for robotics. Yet prevailing approaches focus on pre-training 2D representations being sub-optimal to deal with occlusions and accurately localize objects in complex 3D scenes. Meanwhile 3D representation learning has been limited to single-object understanding. To address these limitations we introduce a novel 3D pre-training framework for robotics named SUGAR that captures semantic geometric and affordance properties of objects through 3D point clouds. We underscore the importance of cluttered scenes in 3D representation learning and automatically construct a multi-object dataset benefiting from cost-free supervision in simulation. SUGAR employs a versatile transformer-based model to jointly address five pre-training tasks namely cross-modal knowledge distillation for semantic learning masked point modeling to understand geometry structures grasping pose synthesis for object affordance 3D instance segmentation and referring expression grounding to analyze cluttered scenes. We evaluate our learned representation on three robotic-related tasks namely zero-shot 3D object recognition referring expression grounding and language-driven robotic manipulation. Experimental results show that SUGAR's 3D representation outperforms state-of-the-art 2D and 3D representations. + + + + PairAug: What Can Augmented Image-Text Pairs Do for Radiology? + http://openaccess.thecvf.com//content/CVPR2024/papers/Xie_PairAug_What_Can_Augmented_Image-Text_Pairs_Do_for_Radiology_CVPR_2024_paper.pdf + Current vision-language pre-training (VLP) methodologies predominantly depend on paired image-text datasets a resource that is challenging to acquire in radiology due to privacy considerations and labelling complexities. Data augmentation provides a practical solution to overcome the issue of data scarcity however most augmentation methods exhibit a limited focus prioritising either image or text augmentation exclusively. Acknowledging this limitation our objective is to devise a framework capable of concurrently augmenting medical image and text data. We design a Pairwise Augmentation (PairAug) approach that contains an Inter-patient Augmentation (InterAug) branch and an Intra-patient Augmentation (IntraAug) branch. Specifically the InterAug branch of our approach generates radiology images using synthesised yet plausible reports derived from a Large Language Model (LLM). The generated pairs can be considered a collection of new patient cases since they are artificially created and may not exist in the original dataset. In contrast the IntraAug branch uses newly generated reports to manipulate images. This process allows us to create new paired data for each individual with diverse medical conditions. Our extensive experiments on various downstream tasks covering medical image classification zero-shot and fine-tuning analysis demonstrate that our PairAug concurrently expanding both image and text data substantially outperforms image-/text-only expansion baselines and advanced medical VLP baselines. Our code is released at https://github.com/YtongXie/PairAug. + + + + Harnessing Large Language Models for Training-free Video Anomaly Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Zanella_Harnessing_Large_Language_Models_for_Training-free_Video_Anomaly_Detection_CVPR_2024_paper.pdf + Video anomaly detection (VAD) aims to temporally locate abnormal events in a video. Existing works mostly rely on training deep models to learn the distribution of normality with either video-level supervision one-class supervision or in an unsupervised setting. Training-based methods are prone to be domain-specific thus being costly for practical deployment as any domain change will involve data collection and model training. In this paper we radically depart from previous efforts and propose LAnguage-based VAD (LAVAD) a method tackling VAD in a novel training-free paradigm exploiting the capabilities of pre-trained large language models (LLMs) and existing vision-language models (VLMs). We leverage VLM-based captioning models to generate textual descriptions for each frame of any test video. With the textual scene description we then devise a prompting mechanism to unlock the capability of LLMs in terms of temporal aggregation and anomaly score estimation turning LLMs into an effective video anomaly detector. We further leverage modality-aligned VLMs and propose effective techniques based on cross-modal similarity for cleaning noisy captions and refining the LLM-based anomaly scores. We evaluate LAVAD on two large datasets featuring real-world surveillance scenarios (UCF-Crime and XD-Violence) showing that it outperforms both unsupervised and one-class methods without requiring any training or data collection. + + + + FineParser: A Fine-grained Spatio-temporal Action Parser for Human-centric Action Quality Assessment + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_FineParser_A_Fine-grained_Spatio-temporal_Action_Parser_for_Human-centric_Action_Quality_CVPR_2024_paper.pdf + Existing action quality assessment (AQA) methods mainly learn deep representations at the video level for scoring diverse actions. Due to the lack of a fine-grained understanding of actions in videos they harshly suffer from low credibility and interpretability thus insufficient for stringent applications such as Olympic diving events. We argue that a fine-grained understanding of actions requires the model to perceive and parse actions in both time and space which is also the key to the credibility and interpretability of the AQA technique. Based on this insight we propose a new fine-grained spatial-temporal action parser named FineParser. It learns human-centric foreground action representations by focusing on target action regions within each frame and exploiting their fine-grained alignments in time and space to minimize the impact of invalid backgrounds during the assessment. In addition we construct fine-grained annotations of human-centric foreground action masks for the FineDiving dataset called FineDiving-HM. With refined annotations on diverse target action procedures FineDiving-HM can promote the development of real-world AQA systems. Through extensive experiments we demonstrate the effectiveness of FineParser which outperforms state-of-the-art methods while supporting more tasks of fine-grained action understanding. Data and code are available at https://github.com/PKU-ICST-MIPL/FineParser_CVPR2024. + + + + Language Models as Black-Box Optimizers for Vision-Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Language_Models_as_Black-Box_Optimizers_for_Vision-Language_Models_CVPR_2024_paper.pdf + Vision-language models (VLMs) pre-trained on web-scale datasets have demonstrated remarkable capabilities on downstream tasks when fine-tuned with minimal data. However many VLMs rely on proprietary data and are not open-source which restricts the use of white-box approaches for fine-tuning. As such we aim to develop a black-box approach to optimize VLMs through natural language prompts thereby avoiding the need to access model parameters feature embeddings or even output logits. We propose employing chat-based LLMs to search for the best text prompt for VLMs. Specifically we adopt an automatic "hill-climbing" procedure that converges to an effective prompt by evaluating the performance of current prompts and asking LLMs to refine them based on textual feedback all within a conversational process without human-in-the-loop. In a challenging 1-shot image classification setup our simple approach surpasses the white-box continuous prompting method (CoOp) by an average of 1.5% across 11 datasets including ImageNet. Our approach also outperforms both human-engineered and LLM-generated prompts. We highlight the advantage of conversational feedback that incorporates both positive and negative prompts suggesting that LLMs can utilize the implicit "gradient" direction in textual feedback for a more efficient search. In addition we find that the text prompts generated through our strategy are not only more interpretable but also transfer well across different VLM architectures in a black-box manner. Lastly we demonstrate our framework on a state-of-the-art black-box VLM (DALL-E 3) for text-to-image optimization. + + + + Exploring Orthogonality in Open World Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Sun_Exploring_Orthogonality_in_Open_World_Object_Detection_CVPR_2024_paper.pdf + Open world object detection aims to identify objects of unseen categories and incrementally recognize them once their annotations are provided. In distinction to the traditional paradigm that is limited to predefined categories this setting promises a continual and generalizable way of estimating objectness using class-agnostic information. However achieving such decorrelation between objectness and class information proves challenging. Without explicit consideration existing methods usually exhibit low recall on unknown objects and can misclassify them into known classes. To address this problem we exploit three levels of orthogonality in the detection process: First the objectness and classification heads are disentangled by operating on separate sets of features that are orthogonal to each other in a devised polar coordinate system. Secondly a prediction decorrelation loss is introduced to guide the detector towards more general and class-independent prediction. Furthermore we propose a calibration scheme that helps maintain orthogonality throughout the training process to mitigate catastrophic interference and facilitate incremental learning of previously unseen objects. Our method is comprehensively evaluated on open world and incremental object detection benchmarks demonstrating its effectiveness in detecting both known and unknown objects. Code and models are available at https://github.com/feifeiobama/OrthogonalDet. + + + + Mitigating Object Hallucinations in Large Vision-Language Models through Visual Contrastive Decoding + http://openaccess.thecvf.com//content/CVPR2024/papers/Leng_Mitigating_Object_Hallucinations_in_Large_Vision-Language_Models_through_Visual_Contrastive_CVPR_2024_paper.pdf + Large Vision-Language Models (LVLMs) have advanced considerably intertwining visual recognition and language understanding to generate content that is not only coherent but also contextually attuned. Despite their success LVLMs still suffer from the issue of object hallucinations where models generate plausible yet incorrect outputs that include objects that do not exist in the images. To mitigate this issue we introduce Visual Contrastive Decoding (VCD) a simple and training-free method that contrasts output distributions derived from original and distorted visual inputs. The proposed VCD effectively reduces the over-reliance on statistical bias and unimodal priors two essential causes of object hallucinations. This adjustment ensures the generated content is closely grounded to visual inputs resulting in contextually accurate outputs. Our experiments show that VCD without either additional training or the usage of external tools significantly mitigates the object hallucination issue across different LVLM families. Beyond mitigating object hallucinations VCD also excels in general LVLM benchmarks highlighting its wide-ranging applicability. + + + + Sculpt3D: Multi-View Consistent Text-to-3D Generation with Sparse 3D Prior + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Sculpt3D_Multi-View_Consistent_Text-to-3D_Generation_with_Sparse_3D_Prior_CVPR_2024_paper.pdf + Recent works on text-to-3d generation show that using only 2D diffusion supervision for 3D generation tends to produce results with inconsistent appearances (e.g. faces on the back view) and inaccurate shapes (e.g. animals with extra legs). Existing methods mainly address this issue by retraining diffusion models with images rendered from 3D data to ensure multi-view consistency while struggling to balance 2D generation quality with 3D consistency. In this paper we present a new framework Sculpt3D that equips the current pipeline with explicit injection of 3D priors from retrieved reference objects without re-training the 2D diffusion model. Specifically we demonstrate that high-quality and diverse 3D geometry can be guaranteed by keypoints supervision through a sparse ray sampling approach. Moreover to ensure accurate appearances of different views we further modulate the output of the 2D diffusion model to the correct patterns of the template views without altering the generated object's style. These two decoupled designs effectively harness 3D information from reference objects to generate 3D objects while preserving the generation quality of the 2D diffusion model. Extensive experiments show our method can largely improve the multi-view consistency while retaining fidelity and diversity. + + + + ScanFormer: Referring Expression Comprehension by Iteratively Scanning + http://openaccess.thecvf.com//content/CVPR2024/papers/Su_ScanFormer_Referring_Expression_Comprehension_by_Iteratively_Scanning_CVPR_2024_paper.pdf + Referring Expression Comprehension (REC) aims to localize the target objects specified by free-form natural language descriptions in images. While state-of-the-art methods achieve impressive performance they perform a dense perception of images which incorporates redundant visual regions unrelated to linguistic queries leading to additional computational overhead. This inspires us to explore a question: can we eliminate linguistic-irrelevant redundant visual regions to improve the efficiency of the model? Existing relevant methods primarily focus on fundamental visual tasks with limited exploration in vision-language fields. To address this we propose a coarse-to-fine iterative perception framework called ScanFormer. It can iteratively exploit the image scale pyramid to extract linguistic-relevant visual patches from top to bottom. In each iteration irrelevant patches are discarded by our designed informativeness prediction. Furthermore we propose a patch selection strategy for discarded patches to accelerate inference. Experiments on widely used datasets namely RefCOCO RefCOCO+ RefCOCOg and ReferItGame verify the effectiveness of our method which can strike a balance between accuracy and efficiency. + + + + Model Inversion Robustness: Can Transfer Learning Help? + http://openaccess.thecvf.com//content/CVPR2024/papers/Ho_Model_Inversion_Robustness_Can_Transfer_Learning_Help_CVPR_2024_paper.pdf + Model Inversion (MI) attacks aim to reconstruct private training data by abusing access to machine learning models. Contemporary MI attacks have achieved impressive attack performance posing serious threats to privacy. Meanwhile all existing MI defense methods rely on regularization that is in direct conflict with the training objective resulting in noticeable degradation in model utility. In this work we take a different perspective and propose a novel and simple Transfer Learning-based Defense against Model Inversion (TL-DMI) to render MI-robust models. Particularly by leveraging TL we limit the number of layers encoding sensitive information from private training dataset thereby degrading the performance of MI attack. We conduct an analysis using Fisher Information to justify our method. Our defense is remarkably simple to implement. Without bells and whistles we show in extensive experiments that TL-DMI achieves state-of-the-art (SOTA) MI robustness. Our code pre-trained models demo and inverted data are available at: https://hosytuyen.github.io/projects/TL-DMI + + + + RLHF-V: Towards Trustworthy MLLMs via Behavior Alignment from Fine-grained Correctional Human Feedback + http://openaccess.thecvf.com//content/CVPR2024/papers/Yu_RLHF-V_Towards_Trustworthy_MLLMs_via_Behavior_Alignment_from_Fine-grained_Correctional_CVPR_2024_paper.pdf + Multimodal Large Language Models (MLLMs) have recently demonstrated impressive capabilities in multimodal understanding reasoning and interaction. However existing MLLMs prevalently suffer from serious hallucination problems generating text that is not factually grounded in associated images. The problem makes existing MLLMs untrustworthy and thus impractical in real-world (especially high-stakes) applications. To address the challenge we present RLHF-V which enhances MLLM trustworthiness via behavior alignment from fine-grained correctional human feedback. Specifically RLHF-V collects human preference in the form of segment-level corrections on hallucinations and performs dense direct preference optimization over the human feedback. Comprehensive experiments on five benchmarks in both automatic and human evaluation show that RLHF-V can enable substantially more trustworthy MLLM behaviors with promising data and computation efficiency. Remarkably using 1.4k annotated data samples RLHF-V significantly reduces the hallucination rate of the base MLLM by 34.8% outperforming the concurrent LLaVA-RLHF trained on 10k annotated data. The final model achieves state-of-the-art performance in trustworthiness among open-source MLLMs and shows better robustness than GPT-4V in preventing hallucinations aroused from over-generalization. All the data code and model weights will be released to facilitate future research. + + + + ZeroShape: Regression-based Zero-shot Shape Reconstruction + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_ZeroShape_Regression-based_Zero-shot_Shape_Reconstruction_CVPR_2024_paper.pdf + We study the problem of single-image zero-shot 3D shape reconstruction. Recent works learn zero-shot shape reconstruction through generative modeling of 3D assets but these models are computationally expensive at train and inference time. In contrast the traditional approach to this problem is regression-based where deterministic models are trained to directly regress the object shape. Such regression methods possess much higher computational efficiency than generative methods. This raises a natural question: is generative modeling necessary for high performance or conversely are regression-based approaches still competitive? To answer this we design a strong regression-based model called ZeroShape based on the converging findings in this field and a novel insight. We also curate a large real-world evaluation benchmark with objects from three different real-world 3D datasets. This evaluation benchmark is more diverse and an order of magnitude larger than what prior works use to quantitatively evaluate their models aiming at reducing the evaluation variance in our field. We show that ZeroShape not only achieves superior performance over state-of-the-art methods but also demonstrates significantly higher computational and data efficiency. + + + + The STVchrono Dataset: Towards Continuous Change Recognition in Time + http://openaccess.thecvf.com//content/CVPR2024/papers/Sun_The_STVchrono_Dataset_Towards_Continuous_Change_Recognition_in_Time_CVPR_2024_paper.pdf + Recognizing continuous changes offers valuable insights into past historical events supports current trend analysis and facilitates future planning. This knowledge is crucial for a variety of fields such as meteorology and agriculture environmental science urban planning and construction tourism and cultural preservation. Currently available datasets in the field of scene change understanding primarily concentrate on two main tasks: the detection of changed regions within a scene and the linguistic description of the change content. Existing datasets focus on recognizing discrete changes such as adding or deleting an object from two images and largely rely on artificially generated images. Consequently the existing change understanding methods primarily focus on identifying distinct object differences overlooking the importance of continuous gradual changes occurring over extended time intervals. To address the above issues we propose a novel benchmark dataset STVchrono targeting the localization and description of long-term continuous changes in real-world scenes. The dataset consists of 71900 photographs from Google Street View API taken over an 18-year span across 50 cities all over the world. Our STVchrono dataset is designed to support real-world continuous change recognition and description in both image pairs and extended image sequences while also enabling the segmentation of changed regions. We conduct experiments to evaluate state-of-the-art methods on continuous change description and segmentation as well as multimodal Large Language Models for describing changes. Our findings reveal that even the most advanced methods lag human performance emphasizing the need to adapt them to continuously changing real-world scenarios. We hope that our benchmark dataset will further facilitate the research of temporal change recognition in a dynamic world. The STVchrono dataset is available at STVchrono Dataset. + + + + SocialCircle: Learning the Angle-based Social Interaction Representation for Pedestrian Trajectory Prediction + http://openaccess.thecvf.com//content/CVPR2024/papers/Wong_SocialCircle_Learning_the_Angle-based_Social_Interaction_Representation_for_Pedestrian_Trajectory_CVPR_2024_paper.pdf + Analyzing and forecasting trajectories of agents like pedestrians and cars in complex scenes has become more and more significant in many intelligent systems and applications. The diversity and uncertainty in socially interactive behaviors among a rich variety of agents make this task more challenging than other deterministic computer vision tasks. Researchers have made a lot of efforts to quantify the effects of these interactions on future trajectories through different mathematical models and network structures but this problem has not been well solved. Inspired by marine animals that localize the positions of their companions underwater through echoes we build a new anglebased trainable social interaction representation named SocialCircle for continuously reflecting the context of social interactions at different angular orientations relative to the target agent. We validate the effect of the proposed SocialCircle by training it along with several newly released trajectory prediction models and experiments show that the SocialCircle not only quantitatively improves the prediction performance but also qualitatively helps better simulate social interactions when forecasting pedestrian trajectories in a way that is consistent with human intuitions. + + + + Neighbor Relations Matter in Video Scene Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Tan_Neighbor_Relations_Matter_in_Video_Scene_Detection_CVPR_2024_paper.pdf + Video scene detection aims to temporally link shots for obtaining semantically compact scenes. It is essential for this task to capture scene-distinguishable affinity among shots by similarity assessment. However most methods relies on ordinary shot-to-shot similarities which may inveigle similar shots into being linked even though they are from different scenes and meanwhile hinder dissimilar shots from being blended into a complete scene. In this paper we propose NeighborNet to inject shot contexts into shot-to-shot similarities through carefully exploring the relations between semantic/temporal neighbors of shots over a local time period. In this way shot-to-shot similarities are remeasured as semantic/temporal neighbor-aware similarities so that NeighborNet can learn context embedding into shot features using graph convolutional network. As a result not only do the learned shot features suppress the affinity among similar shots from different scenes but they also promote the affinity among dissimilar shots in the same scene. Experimental results on public benchmark datasets show that our proposed NeighborNet yields substantial improvements in video scene detection especially outperforms released state-of-the-arts by at least 6% in Average Precision (AP). The code is available at https://github.com/ExMorgan-Alter/NeighborNet. + + + + Text-to-Image Diffusion Models are Great Sketch-Photo Matchmakers + http://openaccess.thecvf.com//content/CVPR2024/papers/Koley_Text-to-Image_Diffusion_Models_are_Great_Sketch-Photo_Matchmakers_CVPR_2024_paper.pdf + This paper for the first time explores text-to-image diffusion models for Zero-Shot Sketch-based Image Retrieval (ZS-SBIR). We highlight a pivotal discovery: the capacity of text-to-image diffusion models to seamlessly bridge the gap between sketches and photos. This proficiency is underpinned by their robust cross-modal capabilities and shape bias findings that are substantiated through our pilot studies. In order to harness pre-trained diffusion models effectively we introduce a straightforward yet powerful strategy focused on two key aspects: selecting optimal feature layers and utilising visual and textual prompts. For the former we identify which layers are most enriched with information and are best suited for the specific retrieval requirements (category-level or fine-grained). Then we employ visual and textual prompts to guide the model's feature extraction process enabling it to generate more discriminative and contextually relevant cross-modal representations. Extensive experiments on several benchmark datasets validate significant performance improvements. + + + + Mudslide: A Universal Nuclear Instance Segmentation Method + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Mudslide_A_Universal_Nuclear_Instance_Segmentation_Method_CVPR_2024_paper.pdf + Nuclear instance segmentation has played a critical role in pathology image analysis. The main challenges arise from the difficulty in accurately segmenting densely overlapping instances and the high cost of precise mask-level annotations. Existing fully-supervised nuclear instance segmentation methods such as boundary-based methods struggle to capture differences between overlapping instances and thus fail in densely distributed blurry regions. They also face challenges transitioning to point supervision where annotations are simple and effective. Inspired by natural mudslides we propose a universal method called Mudslide that uses simple representations to characterize differences between different instances and can easily be extended from fully-supervised to point-supervised. oncretely we introduce a collapse field and leverage it to construct a force map and initial boundary enabling a distinctive representation for each instance. Each pixel is assigned a collapse force with distinct directions between adjacent instances. Starting from the initial boundary Mudslide executes a pixel-by-pixel collapse along various force directions. Pixels that collapse into the same region are considered as one instance concurrently accounting for both inter-instance distinctions and intra-instance coherence. Experiments on public datasets show superior performance in both fully-supervised and point-supervised tasks. + + + + Modeling Multimodal Social Interactions: New Challenges and Baselines with Densely Aligned Representations + http://openaccess.thecvf.com//content/CVPR2024/papers/Lee_Modeling_Multimodal_Social_Interactions_New_Challenges_and_Baselines_with_Densely_CVPR_2024_paper.pdf + Understanding social interactions involving both verbal and non-verbal cues is essential for effectively interpreting social situations. However most prior works on multimodal social cues focus predominantly on single-person behaviors or rely on holistic visual representations that are not aligned to utterances in multi-party environments. Consequently they are limited in modeling the intricate dynamics of multi-party interactions. In this paper we introduce three new challenging tasks to model the fine-grained dynamics between multiple people: speaking target identification pronoun coreference resolution and mentioned player prediction. We contribute extensive data annotations to curate these new challenges in social deduction game settings. Furthermore we propose a novel multimodal baseline that leverages densely aligned language-visual representations by synchronizing visual features with their corresponding utterances. This facilitates concurrently capturing verbal and non-verbal cues pertinent to social reasoning. Experiments demonstrate the effectiveness of the proposed approach with densely aligned multimodal representations in modeling fine-grained social interactions. Project website: https://sangmin-git.github.io/projects/MMSI. + + + + Prompt-Driven Dynamic Object-Centric Learning for Single Domain Generalization + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Prompt-Driven_Dynamic_Object-Centric_Learning_for_Single_Domain_Generalization_CVPR_2024_paper.pdf + Single-domain generalization aims to learn a model from single source domain data attaining generalized performance on other unseen target domains. Existing works primarily focus on improving the generalization ability of static networks. However static networks are unable to dynamically adapt to the diverse variations in different image scenes leading to limited generalization capability. Different scenes exhibit varying levels of complexity and the complexity of images further varies significantly in cross-domain scenarios. In this paper we propose a dynamic object-centric perception network based on prompt learning aiming to adapt to the variations in image complexity. Specifically we propose an object-centric gating module based on prompt learning to focus attention on the object-centric features guided by the various scene prompts. Then with the object-centric gating masks the dynamic selective module dynamically selects highly correlated feature regions in both spatial and channel dimensions enabling the model to adaptively perceive object-centric relevant features thereby enhancing the generalization capability. Extensive experiments were conducted on single-domain generalization tasks in image classification and object detection. The experimental results demonstrate that our approach outperforms state-of-the-art methods which validates the effectiveness and versatility of our proposed method. + + + + Dual Pose-invariant Embeddings: Learning Category and Object-specific Discriminative Representations for Recognition and Retrieval + http://openaccess.thecvf.com//content/CVPR2024/papers/Sarkar_Dual_Pose-invariant_Embeddings_Learning_Category_and_Object-specific_Discriminative_Representations_for_CVPR_2024_paper.pdf + In the context of pose-invariant object recognition and retrieval we demonstrate that it is possible to achieve significant improvements in performance if both the category-based and the object-identity-based embeddings are learned simultaneously during training. In hindsight that sounds intuitive because learning about the categories is more fundamental than learning about the individual objects that correspond to those categories. However to the best of what we know no prior work in pose invariant learning has demonstrated this effect. This paper presents an attention-based dual-encoder architecture with specially designed loss functions that optimize the inter- and intra-class distances simultaneously in two different embedding spaces one for the category embeddings and the other for the object level embeddings. The loss functions we have proposed are pose-invariant ranking losses that are designed to minimize the intra-class distances and maximize the inter-class distances in the dual representation spaces. We demonstrate the power of our approach with three challenging multi-view datasets ModelNet-40 ObjectPI and FG3D. With our dual approach for single view object recognition we outperform the previous best by 20.0% on ModelNet40 2.0% on ObjectPI and 46.5% on FG3D. On the other hand for single-view object retrieval we outperform the previous best by 33.7% on ModelNet40 18.8% on ObjectPI and 56.9% on FG3D. + + + + vid-TLDR: Training Free Token Merging for Light-weight Video Transformer + http://openaccess.thecvf.com//content/CVPR2024/papers/Choi_vid-TLDR_Training_Free_Token_Merging_for_Light-weight_Video_Transformer_CVPR_2024_paper.pdf + Video Transformers have become the prevalent solution for various video downstream tasks with superior expressive power and flexibility. However these video transformers suffer from heavy computational costs induced by the massive number of tokens across the entire video frames which has been the major barrier to training the model. Further the patches irrelevant to the main contents e.g. backgrounds degrade the generalization performance of models. To tackle these issues we propose training free token merging for lightweight video Transformer (vid-TLDR) that aims to enhance the efficiency of video Transformers by merging the background tokens without additional training. For vid-TLDR we introduce a novel approach to capture the salient regions in videos only with the attention map. Further we introduce the saliency-aware token merging strategy by dropping the background tokens and sharpening the object scores. Our experiments show that vid-TLDR significantly mitigates the computational complexity of video Transformers while achieving competitive performance compared to the base model without vid-TLDR. Code is available at https://github.com/mlvlab/vid-TLDR. + + + + DRESS: Instructing Large Vision-Language Models to Align and Interact with Humans via Natural Language Feedback + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_DRESS_Instructing_Large_Vision-Language_Models_to_Align_and_Interact_with_CVPR_2024_paper.pdf + We present DRESS a large vision language model (LVLM) that innovatively exploits Natural Language feedback (NLF) from Large Language Models to enhance its alignment and interactions by addressing two key limitations in the state-of-the-art LVLMs. First prior LVLMs generally rely only on the instruction finetuning stage to enhance alignment with human preferences. Without incorporating extra feedback they are still prone to generate unhelpful hallucinated or harmful responses. Second while the visual instruction tuning data is generally structured in a multi-turn dialogue format the connections and dependencies among consecutive conversational turns are weak. This reduces the capacity for effective multi-turn interactions. To tackle these we propose a novel categorization of the NLF into two key types: critique and refinement. The critique NLF identifies the strengths and weaknesses of the responses and is used to align the LVLMs with human preferences. The refinement NLF offers concrete suggestions for improvement and is adopted to improve the interaction ability of the LVLMs-- which focuses on LVLMs' ability to refine responses by incorporating feedback in multi-turn interactions. To address the non-differentiable nature of NLF we generalize conditional reinforcement learning for training. Our experimental results demonstrate that DRESS can generate more helpful (9.76%) honest (11.52%) and harmless (21.03%) responses and more effectively learn from feedback during multi-turn interactions compared to SOTA LVLMs. + + + + Salience DETR: Enhancing Detection Transformer with Hierarchical Salience Filtering Refinement + http://openaccess.thecvf.com//content/CVPR2024/papers/Hou_Salience_DETR_Enhancing_Detection_Transformer_with_Hierarchical_Salience_Filtering_Refinement_CVPR_2024_paper.pdf + DETR-like methods have significantly increased detection performance in an end-to-end manner. The mainstream two-stage frameworks of them perform dense self-attention and select a fraction of queries for sparse cross-attention which is proven effective for improving performance but also introduces a heavy computational burden and high dependence on stable query selection. This paper demonstrates that suboptimal two-stage selection strategies result in scale bias and redundancy due to the mismatch between selected queries and objects in two-stage initialization. To address these issues we propose hierarchical salience filtering refinement which performs transformer encoding only on filtered discriminative queries for a better trade-off between computational efficiency and precision. The filtering process overcomes scale bias through a novel scale-independent salience supervision. To compensate for the semantic misalignment among queries we introduce elaborate query refinement modules for stable two-stage initialization. Based on above improvements the proposed Salience DETR achieves significant improvements of +4.0% AP +0.2% AP +4.4% AP on three challenging task-specific detection datasets as well as 49.2% AP on COCO 2017 with less FLOPs. The code is available at https://github.com/xiuqhou/Salience-DETR. + + + + Towards More Unified In-context Visual Understanding + http://openaccess.thecvf.com//content/CVPR2024/papers/Sheng_Towards_More_Unified_In-context_Visual_Understanding_CVPR_2024_paper.pdf + The rapid advancement of large language models (LLMs) has accelerated the emergence of in-context learning (ICL) as a cutting-edge approach in the natural language processing domain. Recently ICL has been employed in visual understanding tasks such as semantic segmentation and image captioning yielding promising results. However existing visual ICL framework can not enable producing content across multiple modalities which limits their potential usage scenarios. To address this issue we present a new ICL framework for visual understanding with multi-modal output enabled. First we quantize and embed both text and visual prompt into a unified representational space structured as interleaved in-context sequences. Then a decoder-only sparse transformer architecture is employed to perform generative modeling on them facilitating in-context learning. Thanks to this design the model is capable of handling in-context vision understanding tasks with multimodal output in a unified pipeline. Experimental results demonstrate that our model achieves competitive performance compared with specialized models and previous ICL baselines. Overall our research takes a further step toward unified multimodal in-context learning. + + + + F3Loc: Fusion and Filtering for Floorplan Localization + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_F3Loc_Fusion_and_Filtering_for_Floorplan_Localization_CVPR_2024_paper.pdf + In this paper we propose an efficient data-driven solution to self-localization within a floorplan. Floorplan data is readily available long-term persistent and inherently robust to changes in the visual appearance. Our method does not require retraining per map and location or demand a large database of images of the area of interest. We propose a novel probabilistic model consisting of an observation and a novel temporal filtering module. Operating internally with an efficient ray-based representation the observation module consists of a single and a multiview module to predict horizontal depth from images and fuses their results to benefit from advantages offered by either methodology. Our method operates on conventional consumer hardware and overcomes a common limitation of competing methods that often demand upright images. Our full system meets real-time requirements while outperforming the state-of-the-art by a significant margin. + + + + Multi-View Attentive Contextualization for Multi-View 3D Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Multi-View_Attentive_Contextualization_for_Multi-View_3D_Object_Detection_CVPR_2024_paper.pdf + We present Multi-View Attentive Contextualization (MvACon) a simple yet effective method for improving 2D-to-3D feature lifting in query-based multi-view 3D (MV3D) object detection. Despite remarkable progress witnessed in the field of query-based MV3D object detection prior art often suffers from either the lack of exploiting high-resolution 2D features in dense attention-based lifting due to high computational costs or from insufficiently dense grounding of 3D queries to multi-scale 2D features in sparse attention-based lifting. Our proposed MvACon hits the two birds with one stone using a representationally dense yet computationally sparse attentive feature contextualization scheme that is agnostic to specific 2D-to-3D feature lifting approaches. In experiments the proposed MvACon is thoroughly tested on the nuScenes benchmark using both the BEVFormer and its recent 3D deformable attention (DFA3D) variant as well as the PETR showing consistent detection performance improvement especially in enhancing performance in location orientation and velocity prediction. It is also tested on the Waymo-mini benchmark using BEVFormer with similar improvement. We qualitatively and quantitatively show that global cluster-based contexts effectively encode dense scene-level contexts for MV3D object detection. The promising results of our proposed MvACon reinforces the adage in computer vision "(contextualized) feature matters". + + + + MemSAM: Taming Segment Anything Model for Echocardiography Video Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Deng_MemSAM_Taming_Segment_Anything_Model_for_Echocardiography_Video_Segmentation_CVPR_2024_paper.pdf + We propose a novel echocardiographical video segmentation model by adapting SAM to medical videos to address some long-standing challenges in ultrasound video segmentation including (1) massive speckle noise and artifacts (2) extremely ambiguous boundaries and (3) large variations of targeting objects across frames. The core technique of our model is a temporal-aware and noise-resilient prompting scheme. Specifically we employ a space-time memory that contains both spatial and temporal information to prompt the segmentation of current frame and thus we call the proposed model as MemSAM. In prompting the memory carrying temporal cues sequentially prompt the video segmentation frame by frame. Meanwhile as the memory prompt propagates high-level features it avoids the issue of misidentification caused by mask propagation and improves representation consistency. To address the challenge of speckle noise we further propose a memory reinforcement mechanism which leverages predicted masks to improve the quality of the memory before storing it. We extensively evaluate our method on two public datasets and demonstrate state-of-the-art performance compared to existing models. Particularly our model achieves comparable performance with fully supervised approaches with limited annotations. Codes are available at https://github.com/dengxl0520/MemSAM. + + + + Language-conditioned Detection Transformer + http://openaccess.thecvf.com//content/CVPR2024/papers/Cho_Language-conditioned_Detection_Transformer_CVPR_2024_paper.pdf + We present a new open-vocabulary detection framework. Our framework uses both image-level labels and detailed detection annotations when available. Our framework proceeds in three steps. We first train a language-conditioned object detector on fully-supervised detection data. This detector gets to see the presence or absence of ground truth classes during training and conditions prediction on the set of present classes. We use this detector to pseudo-label images with image-level labels. Our detector provides much more accurate pseudo-labels than prior approaches with its conditioning mechanism. Finally we train an unconditioned open-vocabulary detector on the pseudo-annotated images. The resulting detector named DECOLA shows strong zero-shot performance in open-vocabulary LVIS benchmark as well as direct zero-shot transfer benchmarks on LVIS COCO Object365 and OpenImages. DECOLA outperforms the prior arts by 17.1 AP-rare and 9.4 mAP on zero-shot LVIS benchmark. DECOLA achieves state-of-the-art results in various model sizes architectures and datasets by only training on open-sourced data and academic-scale computing. Code is available at https://github.com/janghyuncho/DECOLA. + + + + Improving Single Domain-Generalized Object Detection: A Focus on Diversification and Alignment + http://openaccess.thecvf.com//content/CVPR2024/papers/Danish_Improving_Single_Domain-Generalized_Object_Detection_A_Focus_on_Diversification_and_CVPR_2024_paper.pdf + In this work we tackle the problem of domain generalization for object detection specifically focusing on the scenario where only a single source domain is available. We propose an effective approach that involves two key steps: diversifying the source domain and aligning detections based on class prediction confidence and localization. Firstly we demonstrate that by carefully selecting a set of augmentations a base detector can outperform existing methods for single domain generalization by a good margin. This highlights the importance of domain diversification in improving the performance of object detectors. Secondly we introduce a method to align detections from multiple views considering both classification and localization outputs. This alignment procedure leads to better generalized and well-calibrated object detector models which are crucial for accurate decision-making in safety-critical applications. Our approach is detector-agnostic and can be seamlessly applied to both single-stage and two-stage detectors. To validate the effectiveness of our proposed methods we conduct extensive experiments and ablations on challenging domain-shift scenarios. The results consistently demonstrate the superiority of our approach compared to existing methods. + + + + ARTrackV2: Prompting Autoregressive Tracker Where to Look and How to Describe + http://openaccess.thecvf.com//content/CVPR2024/papers/Bai_ARTrackV2_Prompting_Autoregressive_Tracker_Where_to_Look_and_How_to_CVPR_2024_paper.pdf + We present ARTrackV2 which integrates two pivotal aspects of tracking: determining where to look (localization) and how to describe (appearance analysis) the target object across video frames. Building on the foundation of its predecessor ARTrackV2 extends the concept by introducing a unified generative framework to "read out" object's trajectory and "retell" its appearance in an autoregressive manner. This approach fosters a time-continuous methodology that models the joint evolution of motion and visual features guided by previous estimates. Furthermore ARTrackV2 stands out for its efficiency and simplicity obviating the less efficient intra-frame autoregression and hand-tuned parameters for appearance updates. Despite its simplicity ARTrackV2 achieves state-of-the-art performance on prevailing benchmark datasets while demonstrating a remarkable efficiency improvement. In particular ARTrackV2 achieves an AO score of 79. 5% on GOT-10k and an AUC of 86. 1% on TrackingNet while being 3.6 xfaster than ARTrack. + + + + A Vision Check-up for Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Sharma_A_Vision_Check-up_for_Language_Models_CVPR_2024_paper.pdf + What does learning to model relationships between strings teach Large Language Models (LLMs) about the visual world? We systematically evaluate LLMs' abilities to generate and recognize an assortment of visual concepts of increasing complexity and then demonstrate how a preliminary visual representation learning system can be trained using models of text. As language models lack the ability to consume or output visual information as pixels we use code to represent images in our study. Although LLM-generated images do not look like natural images results on image generation and the ability of models to correct these generated images indicate that precise modeling of strings can teach language models about numerous aspects of the visual world. Furthermore experiments on self-supervised visual representation learning utilizing images generated with text models highlight the potential to train vision models capable of making semantic assessments of natural images using just LLMs. + + + + SyncMask: Synchronized Attentional Masking for Fashion-centric Vision-Language Pretraining + http://openaccess.thecvf.com//content/CVPR2024/papers/Song_SyncMask_Synchronized_Attentional_Masking_for_Fashion-centric_Vision-Language_Pretraining_CVPR_2024_paper.pdf + Vision-language models (VLMs) have made significant strides in cross-modal understanding through large-scale paired datasets. However in fashion domain datasets often exhibit a disparity between the information conveyed in image and text. This issue stems from datasets containing multiple images of a single fashion item all paired with one text leading to cases where some textual details are not visible in individual images. This mismatch particularly when non-co-occurring elements are masked undermines the training of conventional VLM objectives like Masked Language Modeling and Masked Image Modeling thereby hindering the model's ability to accurately align fine-grained visual and textual features. Addressing this problem we propose Synchronized attentional Masking (SyncMask) which generate masks that pinpoint the image patches and word tokens where the information co-occur in both image and text. This synchronization is accomplished by harnessing cross-attentional features obtained from a momentum model ensuring a precise alignment between the two modalities. Additionally we enhance grouped batch sampling with semi-hard negatives effectively mitigating false negative issues in Image-Text Matching and Image-Text Contrastive learning objectives within fashion datasets. Our experiments demonstrate the effectiveness of the proposed approach outperforming existing methods in three downstream tasks. + + + + Countering Personalized Text-to-Image Generation with Influence Watermarks + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Countering_Personalized_Text-to-Image_Generation_with_Influence_Watermarks_CVPR_2024_paper.pdf + State-of-the-art personalized text-to-image generation systems are usually trained on a few reference images to learn novel visual representations. However this is likely to incur infringement of copyright for the reference image owners when these images are personal and publicly available. Recent progress has been made in protecting these images from unauthorized use by adding protective noises. Yet current protection methods work under the assumption that these protected images are not changed which is in contradiction to the fact that most public platforms intend to modify user-uploaded content e.g. image compression. This paper introduces a robust watermarking method namely InMark to protect images from unauthorized learning. Inspired by influence functions the proposed method forges protective watermarks on more important pixels for these reference images from both heuristic and statistical perspectives. In this way the personal semantics of these images are under protection even if these images are modified to some extent. Extensive experiments demonstrate that the proposed InMark outperforms previous state-of-the-art methods in both protective performance and robustness. + + + + PromptAD: Learning Prompts with only Normal Samples for Few-Shot Anomaly Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_PromptAD_Learning_Prompts_with_only_Normal_Samples_for_Few-Shot_Anomaly_CVPR_2024_paper.pdf + The vision-language model has brought great improvement to few-shot industrial anomaly detection which usually needs to design of hundreds of prompts through prompt engineering. For automated scenarios we first use conventional prompt learning with many-class paradigm as the baseline to automatically learn prompts but found that it can not work well in one-class anomaly detection. To address the above problem this paper proposes a one-class prompt learning method for few-shot anomaly detection termed PromptAD. First we propose semantic concatenation which can transpose normal prompts into anomaly prompts by concatenating normal prompts with anomaly suffixes thus constructing a large number of negative samples used to guide prompt learning in one-class setting. Furthermore to mitigate the training challenge caused by the absence of anomaly images we introduce the concept of explicit anomaly margin which is used to explicitly control the margin between normal prompt features and anomaly prompt features through a hyper-parameter. For image-level/pixel-level anomaly detection PromptAD achieves first place in 11/12 few-shot settings on MVTec and VisA. + + + + DETRs Beat YOLOs on Real-time Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_DETRs_Beat_YOLOs_on_Real-time_Object_Detection_CVPR_2024_paper.pdf + The YOLO series has become the most popular framework for real-time object detection due to its reasonable trade-off between speed and accuracy. However we observe that the speed and accuracy of YOLOs are negatively affected by the NMS. Recently end-to-end Transformer-based detectors (DETRs) have provided an alternative to eliminating NMS. Nevertheless the high computational cost limits their practicality and hinders them from fully exploiting the advantage of excluding NMS. In this paper we propose the Real-Time DEtection TRansformer (RT-DETR) the first real-time end-to-end object detector to our best knowledge that addresses the above dilemma. We build RT-DETR in two steps drawing on the advanced DETR: first we focus on maintaining accuracy while improving speed followed by maintaining speed while improving accuracy. Specifically we design an efficient hybrid encoder to expeditiously process multi-scale features by decoupling intra-scale interaction and cross-scale fusion to improve speed. Then we propose the uncertainty-minimal query selection to provide high-quality initial queries to the decoder thereby improving accuracy. In addition RT-DETR supports flexible speed tuning by adjusting the number of decoder layers to adapt to various scenarios without retraining. Our RT-DETR-R50 / R101 achieves 53.1% / 54.3% AP on COCO and 108 / 74 FPS on T4 GPU outperforming previously advanced YOLOs in both speed and accuracy. We also develop scaled RT-DETRs that outperform the lighter YOLO detectors (S and M models). Furthermore RT-DETR-R50 outperforms DINO-R50 by 2.2% AP in accuracy and about 21 times in FPS. After pre-training with Objects365 RT-DETR-R50 / R101 achieves 55.3% / 56.2% AP. The project page: https://zhao-yian.github.io/RTDETR. + + + + An Asymmetric Augmented Self-Supervised Learning Method for Unsupervised Fine-Grained Image Hashing + http://openaccess.thecvf.com//content/CVPR2024/papers/Hu_An_Asymmetric_Augmented_Self-Supervised_Learning_Method_for_Unsupervised_Fine-Grained_Image_CVPR_2024_paper.pdf + Unsupervised fine-grained image hashing aims to learn compact binary hash codes in unsupervised settings addressing challenges posed by large-scale datasets and dependence on supervision. In this paper we first identify a granularity gap between generic and fine-grained datasets for unsupervised hashing methods highlighting the inadequacy of conventional self-supervised learning for fine-grained visual objects. To bridge this gap we propose the Asymmetric Augmented Self-Supervised Learning (A^2-SSL) method comprising three modules. The asymmetric augmented SSL module employs suitable augmentation strategies for positive/negative views preventing fine-grained category confusion inherent in conventional SSL. Part-oriented dense contrastive learning utilizes the Fisher Vector framework to capture and model fine-grained object parts enhancing unsupervised representations through part-level dense contrastive learning. Self-consistent hash code learning introduces a reconstruction task aligned with the self-consistency principle guiding the model to emphasize comprehensive features particularly fine-grained patterns. Experimental results on five benchmark datasets demonstrate the superiority of A^2-SSL over existing methods affirming its efficacy in unsupervised fine-grained image hashing. + + + + Exploring Pose-Aware Human-Object Interaction via Hybrid Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_Exploring_Pose-Aware_Human-Object_Interaction_via_Hybrid_Learning_CVPR_2024_paper.pdf + Human-Object Interaction (HOI) detection plays a crucial role in visual scene comprehension. In recent advancements two-stage detectors have taken a prominent position. However they are encumbered by two primary challenges. First the misalignment between feature representation and relation reasoning gives rise to a deficiency in discriminative features crucial for interaction detection. Second due to sparse annotation the second-stage interaction head generates numerous candidate <human object> pairs with only a small fraction receiving supervision. Towards these issues we propose a hybrid learning method based on pose-aware HOI feature refinement. Specifically we devise pose-aware feature refinement that encodes spatial features by considering human body pose characteristics. It can direct attention towards key regions ultimately offering a wealth of fine-grained features imperative for HOI detection. Further we introduce a hybrid learning method that combines HOI triplets with probabilistic soft labels supervision which is regenerated from decoupled verb-object pairs. This method explores the implicit connections between the interactions enhancing model generalization without requiring additional data. Our method establishes state-of-the-art performance on HICO-DET benchmark and excels notably in detecting rare HOIs. + + + + Density-Adaptive Model Based on Motif Matrix for Multi-Agent Trajectory Prediction + http://openaccess.thecvf.com//content/CVPR2024/papers/Wen_Density-Adaptive_Model_Based_on_Motif_Matrix_for_Multi-Agent_Trajectory_Prediction_CVPR_2024_paper.pdf + Multi-agent trajectory prediction is essential in autonomous driving risk avoidance and traffic flow control. However the heterogeneous traffic density on interactions which caused by physical laws social norms and so on is often overlooked in existing methods. When the density varies the number of agents involved in interactions and the corresponding interaction probability change dynamically. To tackle this issue we propose a new method called \underline D ensity-\underline A daptive Model based on \underline M otif \underline M atrix for Multi-Agent Trajectory Prediction (DAMM) to gain insights into multi-agent systems. Here we leverage the motif matrix to represent dynamic connectivity in a higher-order pattern and distill the interaction information from the perspectives of the spatial and the temporal dimensions. Specifically in spatial dimension we utilize multi-scale feature fusion to adaptively select the optimal range of neighbors participating in interactions for each time slot. In temporal dimension we extract the temporal interaction features and adapt a pyramidal pooling layer to generate the interaction probability for each agent. Experimental results demonstrate that our approach surpasses state-of-the-art methods on autonomous driving dataset. + + + + Contrastive Learning for DeepFake Classification and Localization via Multi-Label Ranking + http://openaccess.thecvf.com//content/CVPR2024/papers/Hong_Contrastive_Learning_for_DeepFake_Classification_and_Localization_via_Multi-Label_Ranking_CVPR_2024_paper.pdf + We propose a unified approach to simultaneously addressing the conventional setting of binary deepfake classification and a more challenging scenario of uncovering what facial components have been forged as well as the exact order of the manipulations. To solve the former task we consider multiple instance learning (MIL) that takes each image as a bag and its patches as instances. A positive bag corresponds to a forged image that includes at least one manipulated patch (i.e. a pixel in the feature map). The formulation allows us to estimate the probability of an input image being a fake one and establish the corresponding contrastive MIL loss. On the other hand tackling the component-wise deepfake problem can be reduced to solving multi-label prediction but the requirement to recover the manipulation order further complicates the learning task into a multi-label ranking problem. We resolve this difficulty by designing a tailor-made loss term to enforce that the rank order of the predicted multi-label probabilities respects the ground-truth order of the sequential modifications of a deepfake image. Through extensive experiments and comparisons with other relevant techniques we provide extensive results and ablation studies to demonstrate that the proposed method is an overall more comprehensive solution to deepfake detection. + + + + Enhancing the Power of OOD Detection via Sample-Aware Model Selection + http://openaccess.thecvf.com//content/CVPR2024/papers/Xue_Enhancing_the_Power_of_OOD_Detection_via_Sample-Aware_Model_Selection_CVPR_2024_paper.pdf + In this work we present a novel perspective on detecting out-of-distribution (OOD) samples and propose an algorithm for sample-aware model selection to enhance the effectiveness of OOD detection. Our algorithm determines for each test input which pre-trained models in the model zoo are capable of identifying the test input as an OOD sample. If no such models exist in the model zoo the test input is classified as an in-distribution (ID) sample. We theoretically demonstrate that our method maintains the true positive rate of ID samples and accurately identifies OOD samples with high probability when there are a sufficient number of diverse pre-trained models in the model zoo. Extensive experiments were conducted to validate our method demonstrating that it leverages the complementarity among single-model detectors to consistently improve the effectiveness of OOD sample identification. Compared to base-line methods our approach improved the relative performance by 65.40% and 37.25% on the CIFAR10 and ImageNet benchmarks respectively. + + + + Collaborative Semantic Occupancy Prediction with Hybrid Feature Fusion in Connected Automated Vehicles + http://openaccess.thecvf.com//content/CVPR2024/papers/Song_Collaborative_Semantic_Occupancy_Prediction_with_Hybrid_Feature_Fusion_in_Connected_CVPR_2024_paper.pdf + Collaborative perception in automated vehicles leverages the exchange of information between agents aiming to elevate perception results. Previous camera-based collaborative 3D perception methods typically employ 3D bounding boxes or bird's eye views as representations of the environment. However these approaches fall short in offering a comprehensive 3D environmental prediction. To bridge this gap we introduce the first method for collaborative 3D semantic occupancy prediction. Particularly it improves local 3D semantic occupancy predictions by hybrid fusion of (i) semantic and occupancy task features and (ii) compressed orthogonal attention features shared between vehicles. Additionally due to the lack of a collaborative perception dataset designed for semantic occupancy prediction we augment a current collaborative perception dataset to include 3D collaborative semantic occupancy labels for a more robust evaluation. The experimental findings highlight that: (i) our collaborative semantic occupancy predictions excel above the results from single vehicles by over 30% and (ii) models anchored on semantic occupancy outpace state-of-the-art collaborative 3D detection techniques in subsequent perception applications showcasing enhanced accuracy and enriched semantic-awareness in road environments. + + + + Towards Generalizable Tumor Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Towards_Generalizable_Tumor_Synthesis_CVPR_2024_paper.pdf + Tumor synthesis enables the creation of artificial tumors in medical images facilitating the training of AI models for tumor detection and segmentation. However success in tumor synthesis hinges on creating visually realistic tumors that are generalizable across multiple organs and furthermore the resulting AI models being capable of detecting real tumors in images sourced from different domains (e.g. hospitals). This paper made a progressive stride toward generalizable tumor synthesis by leveraging a critical observation: early-stage tumors (< 2cm) tend to have similar imaging characteristics in computed tomography (CT) whether they originate in the liver pancreas or kidneys. We have ascertained that generative AI models e.g. Diffusion Models can create realistic tumors generalized to a range of organs even when trained on a limited number of tumor examples from only one organ. Moreover we have shown that AI models trained on these synthetic tumors can be generalized to detect and segment real tumors from CT volumes encompassing a broad spectrum of patient demographics imaging protocols and healthcare facilities. + + + + EpiDiff: Enhancing Multi-View Synthesis via Localized Epipolar-Constrained Diffusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_EpiDiff_Enhancing_Multi-View_Synthesis_via_Localized_Epipolar-Constrained_Diffusion_CVPR_2024_paper.pdf + Generating multiview images from a single view facilitates the rapid generation of a 3D mesh conditioned on a single image. Recent methods that introduce 3D global representation into diffusion models have shown the potential to generate consistent multiviews but they have reduced generation speed and face challenges in maintaining generalizability and quality. To address this issue we propose EpiDiff a localized interactive multiview diffusion model. At the core of the proposed approach is to insert a lightweight epipolar attention block into the frozen diffusion model leveraging epipolar constraints to enable cross-view interaction among feature maps of neighboring views. The newly initialized 3D modeling module preserves the original feature distribution of the diffusion model exhibiting compatibility with a variety of base diffusion models. Experiments show that EpiDiff generates 16 multiview images in just 12 seconds and it surpasses previous methods in quality evaluation metrics including PSNR SSIM and LPIPS. Additionally EpiDiff can generate a more diverse distribution of views improving the reconstruction quality from generated multiviews. Please see the project page at https://huanngzh.github.io/EpiDiff/. + + + + On the Faithfulness of Vision Transformer Explanations + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_On_the_Faithfulness_of_Vision_Transformer_Explanations_CVPR_2024_paper.pdf + To interpret Vision Transformers post-hoc explanations assign salience scores to input pixels providing human-understandable heatmaps. However whether these interpretations reflect true rationales behind the model's output is still underexplored. To address this gap we study the faithfulness criterion of explanations: the assigned salience scores should represent the influence of the corresponding input pixels on the model's predictions. To evaluate faithfulness we introduce Salience-guided Faithfulness Coefficient (SaCo) a novel evaluation metric leveraging essential information of salience distribution. Specifically we conduct pair-wise comparisons among distinct pixel groups and then aggregate the differences in their salience scores resulting in a coefficient that indicates the explanation's degree of faithfulness. Our explorations reveal that current metrics struggle to differentiate between advanced explanation methods and Random Attribution thereby failing to capture the faithfulness property. In contrast our proposed SaCo offers a reliable faithfulness measurement establishing a robust metric for interpretations. Furthermore our SaCo demonstrates that the use of gradient and multi-layer aggregation can markedly enhance the faithfulness of attention-based explanation shedding light on potential paths for advancing Vision Transformer explainability. + + + + Pixel-level Semantic Correspondence through Layout-aware Representation Learning and Multi-scale Matching Integration + http://openaccess.thecvf.com//content/CVPR2024/papers/Sun_Pixel-level_Semantic_Correspondence_through_Layout-aware_Representation_Learning_and_Multi-scale_Matching_CVPR_2024_paper.pdf + Establishing precise semantic correspondence across object instances in different images is a fundamental and challenging task in computer vision. In this task difficulty arises often due to three challenges: confusing regions with similar appearance inconsistent object scale and indistinguishable nearby pixels. Recognizing these challenges our paper proposes a novel semantic matching pipeline named LPMFlow toward extracting fine-grained semantics and geometry layouts for building pixel-level semantic correspondences. LPMFlow consists of three modules each addressing one of the aforementioned challenges. The layout-aware representation learning module uniformly encodes source and target tokens to distinguish pixels or regions with similar appearances but different geometry semantics. The progressive feature superresolution module outputs four sets of 4D correlation tensors to generate accurate semantic flow between objects in different scales. Finally the matching flow integration and refinement module is exploited to fuse matching flow in different scales to give the final flow predictions. The whole pipeline can be trained end-to-end with a balance of computational cost and correspondence details. Extensive experiments based on benchmarks such as SPair-71K PF-PASCAL and PF-WILLOW have proved that the proposed method can well tackle the three challenges and outperform the previous methods especially in more stringent settings. Code is available at https://github.com/YXSUNMADMAX/LPMFlow. + + + + Dynamic Graph Representation with Knowledge-aware Attention for Histopathology Whole Slide Image Analysis + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Dynamic_Graph_Representation_with_Knowledge-aware_Attention_for_Histopathology_Whole_Slide_CVPR_2024_paper.pdf + Histopathological whole slide images (WSIs) classification has become a foundation task in medical microscopic imaging processing. Prevailing approaches involve learning WSIs as instance-bag representations emphasizing significant instances but struggling to capture the interactions between instances. Additionally conventional graph representation methods utilize explicit spatial positions to construct topological structures but restrict the flexible interaction capabilities between instances at arbitrary locations particularly when spatially distant. In response we propose a novel dynamic graph representation algorithm that conceptualizes WSIs as a form of the knowledge graph structure. Specifically we dynamically construct neighbors and directed edge embeddings based on the head and tail relationships between instances. Then we devise a knowledge-aware attention mechanism that can update the head node features by learning the joint attention score of each neighbor and edge. Finally we obtain a graph-level embedding through the global pooling process of the updated head serving as an implicit representation for the WSI classification. Our end-to-end graph representation learning approach has outperformed the state-of-the-art WSI analysis methods on three TCGA benchmark datasets and in-house test sets. Our code is available at https://github.com/WonderLandxD/WiKG. + + + + Align Before Adapt: Leveraging Entity-to-Region Alignments for Generalizable Video Action Recognition + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Align_Before_Adapt_Leveraging_Entity-to-Region_Alignments_for_Generalizable_Video_Action_CVPR_2024_paper.pdf + Large-scale visual-language pre-trained models have achieved significant success in various video tasks. However most existing methods follow an "adapt then align" paradigm which adapts pre-trained image encoders to model video-level representations and utilizes one-hot or text embedding of the action labels for supervision. This paradigm overlooks the challenge of mapping from static images to complicated activity concepts. In this paper we propose a novel "Align before Adapt" (ALT) paradigm. Prior to adapting to video representation learning we exploit the entity-to-region alignments for each frame. The alignments are fulfilled by matching the region-aware image embeddings to an offline-constructed text corpus. With the aligned entities we feed their text embeddings to a transformer-based video adapter as the queries which can help extract the semantics of the most important entities from a video to a vector. This paradigm reuses the visual-language alignment of VLP during adaptation and tries to explain an action by the underlying entities. This helps understand actions by bridging the gap with complex activity semantics particularly when facing unfamiliar or unseen categories. ALT demonstrates competitive performance while maintaining remarkably low computational costs. In fully supervised experiments it achieves 88.1% top-1 accuracy on Kinetics-400 with only 4947 GFLOPs. Moreover ALT outperforms the previous state-of-the-art methods in both zero-shot and few-shot experiments emphasizing its superior generalizability across various learning scenarios. + + + + Towards Robust 3D Object Detection with LiDAR and 4D Radar Fusion in Various Weather Conditions + http://openaccess.thecvf.com//content/CVPR2024/papers/Chae_Towards_Robust_3D_Object_Detection_with_LiDAR_and_4D_Radar_CVPR_2024_paper.pdf + Detecting objects in 3D under various (normal and adverse) weather conditions is essential for safe autonomous driving systems. Recent approaches have focused on employing weather-insensitive 4D radar sensors and leveraging them with other modalities such as LiDAR. However they fuse multi-modal information without considering the sensor characteristics and weather conditions and lose some height information which could be useful for localizing 3D objects. In this paper we propose a novel framework for robust LiDAR and 4D radar-based 3D object detection. Specifically we propose a 3D-LRF module that considers the distinct patterns they exhibit in 3D space (e.g. precise 3D mapping of LiDAR and wide-range weather-insensitive measurement of 4D radar) and extract fusion features based on their 3D spatial relationship. Then our weather-conditional radar-flow gating network modulates the information flow of fusion features depending on weather conditions and obtains enhanced feature that effectively incorporates the strength of two domains under various weather conditions. The extensive experiments demonstrate that our model achieves SoTA performance for 3D object detection under various weather conditions. + + + + Enhancing 3D Fidelity of Text-to-3D using Cross-View Correspondences + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_Enhancing_3D_Fidelity_of_Text-to-3D_using_Cross-View_Correspondences_CVPR_2024_paper.pdf + Leveraging multi-view diffusion models as priors for 3D optimization have alleviated the problem of 3D consistency e.g. the Janus face problem or the content drift problem in zero-shot text-to-3D models. However the 3D geometric fidelity of the output remains an unresolved issue; albeit the rendered 2D views are realistic the underlying geometry may contain errors such as unreasonable concavities. In this work we propose CorrespondentDream an effective method to leverage annotation-free cross-view correspondences yielded from the diffusion U-Net to provide additional 3D prior to the NeRF optimization process. We find that these correspondences are strongly consistent with human perception and by adopting it in our loss design we are able to produce NeRF models with geometries that are more coherent with common sense e.g. more smoothed object surface yielding higher 3D fidelity. We demonstrate the efficacy of our approach through various comparative qualitative results and a solid user study. + + + + Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs + http://openaccess.thecvf.com//content/CVPR2024/papers/Blayney_Bezier_Everywhere_All_at_Once_Learning_Drivable_Lanes_as_Bezier_CVPR_2024_paper.pdf + Knowledge of lane topology is a core problem in autonomous driving. Aerial imagery can provide high resolution quickly updatable lane source data but detecting lanes from such data has so far been an expensive manual process or where automated solutions exist undrivable and requiring of downstream processing. We propose a method for large-scale lane topology extraction from aerial imagery while ensuring that the resulting lanes are realistic and drivable by introducing a novel Bezier Graph shared parameterisation of Bezier curves. We develop a transformer-based model to predict these Bezier Graphs from input aerial images demonstrating competitive results on the UrbanLaneGraph dataset. We demonstrate that our method generates realistic lane graphs which require both minimal input and minimal downstream processing. We make our code publicly available at https://github.com/driskai/BGFormer + + + + Can I Trust Your Answer? Visually Grounded Video Question Answering + http://openaccess.thecvf.com//content/CVPR2024/papers/Xiao_Can_I_Trust_Your_Answer_Visually_Grounded_Video_Question_Answering_CVPR_2024_paper.pdf + We study visually grounded VideoQA in response to the emerging trends of utilizing pretraining techniques for video- language understanding. Specifically by forcing vision- language models (VLMs) to answer questions and simultane- ously provide visual evidence we seek to ascertain the extent to which the predictions of such techniques are genuinely anchored in relevant video content versus spurious corre- lations from language or irrelevant visual context. Towards this we construct NExT-GQA - an extension of NExT-QA with 10.5K temporal grounding (or location) labels tied to the original QA pairs. With NExT-GQA we scrutinize a series of state-of-the-art VLMs. Through post-hoc atten- tion analysis we find that these models are extremely weak in substantiating the answers despite their strong QA per- formance. This exposes the limitation of current VLMs in making reliable predictions. As a remedy we further explore and propose a grounded-QA method via Gaussian mask optimization and cross-modal learning. Experiments with different backbones demonstrate that this grounding mechanism improves both grounding and QA. With these efforts we aim to push towards trustworthy VLMs in VQA systems. Our dataset and code are available at https://github.com/doc-doc/NExT-GQA. + + + + Polos: Multimodal Metric Learning from Human Feedback for Image Captioning + http://openaccess.thecvf.com//content/CVPR2024/papers/Wada_Polos_Multimodal_Metric_Learning_from_Human_Feedback_for_Image_Captioning_CVPR_2024_paper.pdf + Establishing an automatic evaluation metric that closely aligns with human judgments is essential for effectively developing image captioning models. Recent data-driven metrics have demonstrated a stronger correlation with human judgments than classic metrics such as CIDEr; however they lack sufficient capabilities to handle hallucinations and generalize across diverse images and texts partially because they compute scalar similarities merely using embeddings learned from tasks unrelated to image captioning evaluation. In this study we propose Polos a supervised automatic evaluation metric for image captioning models. Polos computes scores from multimodal inputs using a parallel feature extraction mechanism that leverages embeddings trained through large-scale contrastive learning. To train Polos we introduce Multimodal Metric Learning from Human Feedback (M2LHF) a framework for developing metrics based on human feedback. We constructed the Polaris dataset which comprises 131K human judgments from 550 evaluators which is approximately ten times larger than standard datasets. Our approach achieved state-of-the-art performance on Composite Flickr8K-Expert Flickr8K-CF PASCAL-50S FOIL and the Polaris dataset thereby demonstrating its effectiveness and robustness. + + + + Detours for Navigating Instructional Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Ashutosh_Detours_for_Navigating_Instructional_Videos_CVPR_2024_paper.pdf + We introduce the video detours problem for navigating instructional videos. Given a source video and a natural language query asking to alter the how-to video's current path of execution in a certain way the goal is to find a related "detour video" that satisfies the requested alteration. To address this challenge we propose VidDetours a novel video-language approach that learns to retrieve the targeted temporal segments from a large repository of how-to's using video-and-text conditioned queries. Furthermore we devise a language-based pipeline that exploits how-to video narration text to create weakly supervised training data. We demonstrate our idea applied to the domain of how-to cooking videos where a user can detour from their current recipe to find steps with alternate ingredients tools and techniques. Validating on a ground truth annotated dataset of 16K samples we show our model's significant improvements over best available methods for video retrieval and question answering with recall rates exceeding the state of the art by 35%. + + + + Discontinuity-preserving Normal Integration with Auxiliary Edges + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_Discontinuity-preserving_Normal_Integration_with_Auxiliary_Edges_CVPR_2024_paper.pdf + Many surface reconstruction methods incorporate normal integration which is a process to obtain a depth map from surface gradients. In this process the input may represent a surface with discontinuities e.g. due to self-occlusion. To reconstruct an accurate depth map from the input normal map hidden surface gradients occurring from the jumps must be handled. To model these jumps correctly we design a novel discretization for the domain of normal integration. Our key idea is to introduce auxiliary edges which bridge between piecewise-smooth planes in the domain so that the magnitude of hidden jumps can be explicitly expressed on finite elements. Using the auxiliary edges we design a novel algorithm to optimize the discontinuity and the depth map from the input normal map. Our method optimizes discontinuities by using a combination of iterative re-weighted least squares and iterative filtering of the jump magnitudes on auxiliary edges to provide strong sparsity regularization. Compared to previous discontinuity-preserving normal integration methods which model the magnitude of jumps only implicitly our method reconstructs subtle discontinuities accurately thanks to our explicit representation allowing for strong sparsity regularization. + + + + Self-Supervised Multi-Object Tracking with Path Consistency + http://openaccess.thecvf.com//content/CVPR2024/papers/Lu_Self-Supervised_Multi-Object_Tracking_with_Path_Consistency_CVPR_2024_paper.pdf + In this paper we propose a novel concept of path consistency to learn robust object matching without using manual object identity supervision. Our key idea is that to track a object through frames we can obtain multiple different association results from a model by varying the frames it can observe i.e. skipping frames in observation. As the differences in observations do not alter the identities of objects the obtained association results should be consistent. Based on this rationale we generate multiple observation paths each specifying a different set of frames to be skipped and formulate the Path Consistency Loss that enforces the association results are consistent across different observation paths. We use the proposed loss to train our object matching model with only self-supervision. By extensive experiments on three tracking datasets (MOT17 PersonPath22 KITTI) we demonstrate that our method outperforms existing unsupervised methods with consistent margins on various evaluation metrics and even achieves performance close to supervised methods. + + + + Improving Distant 3D Object Detection Using 2D Box Supervision + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Improving_Distant_3D_Object_Detection_Using_2D_Box_Supervision_CVPR_2024_paper.pdf + Improving the detection of distant 3d objects is an important yet challenging task. For camera-based 3D perception the annotation of 3d bounding relies heavily on LiDAR for accurate depth information. As such the distance of annotation is often limited due to the sparsity of LiDAR points on distant objects which hampers the capability of existing detectors for long-range scenarios. We address this challenge by considering only 2D box supervision for distant objects since they are easy to annotate. We propose LR3D a framework that learns to recover the missing depth of distant objects. LR3D adopts an implicit projection head to learn the generation of mapping between 2D boxes and depth using the 3D supervision on close objects. This mapping allows the depth estimation of distant objects conditioned on their 2D boxes making long-range 3D detection with 2D supervision feasible. Experiments show that without distant 3D annotations LR3D allows camera-based methods to detect distant objects (over 200m) with comparable accuracy to full 3D supervision. Our framework is general and could widely benefit 3D detection methods to a large extent. + + + + HDQMF: Holographic Feature Decomposition Using Quantum Algorithms + http://openaccess.thecvf.com//content/CVPR2024/papers/Poduval_HDQMF_Holographic_Feature_Decomposition_Using_Quantum_Algorithms_CVPR_2024_paper.pdf + This paper addresses the decomposition of holographic feature vectors in Hyperdimensional Computing (HDC) aka Vector Symbolic Architectures (VSA). HDC uses high-dimensional vectors with brain-like properties to represent symbolic information and leverages efficient operators to construct and manipulate complexly structured data in a cognitive fashion. Existing models face challenges in decomposing these structures a process crucial for understanding and interpreting a composite hypervector. We address this challenge by proposing the HDC Memorized-Factorization Problem that captures the common patterns of construction in HDC models. To solve this problem efficiently we introduce HDQMF a HyperDimensional Quantum Memorized-Factorization algorithm. HDQMF is unique in its approach utilizing quantum computing to offer efficient solutions. It modifies crucial steps in Grover's algorithm to achieve hypervector decomposition achieving quadratic speed-up. + + + + UniPAD: A Universal Pre-training Paradigm for Autonomous Driving + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_UniPAD_A_Universal_Pre-training_Paradigm_for_Autonomous_Driving_CVPR_2024_paper.pdf + In the context of autonomous driving the significance of effective feature learning is widely acknowledged. While conventional 3D self-supervised pre-training methods have shown widespread success most methods follow the ideas originally designed for 2D images. In this paper we present UniPAD a novel self-supervised learning paradigm applying 3D volumetric differentiable rendering. UniPAD implicitly encodes 3D space facilitating the reconstruction of continuous 3D shape structures and the intricate appearance characteristics of their 2D projections. The flexibility of our method enables seamless integration into both 2D and 3D frameworks enabling a more holistic comprehension of the scenes. We manifest the feasibility and effectiveness of UniPAD by conducting extensive experiments on various 3D perception tasks. Our method significantly improves lidar- camera- and lidar-camera-based baseline by 9.1 7.7 and 6.9 NDS respectively. Notably our pre-training pipeline achieves 73.2 NDS for 3D object detection and 79.4 mIoU for 3D semantic segmentation on the nuScenes validation set achieving state-of-the-art results in comparison with previous methods. + + + + SocialCounterfactuals: Probing and Mitigating Intersectional Social Biases in Vision-Language Models with Counterfactual Examples + http://openaccess.thecvf.com//content/CVPR2024/papers/Howard_SocialCounterfactuals_Probing_and_Mitigating_Intersectional_Social_Biases_in_Vision-Language_Models_CVPR_2024_paper.pdf + While vision-language models (VLMs) have achieved remarkable performance improvements recently there is growing evidence that these models also posses harmful biases with respect to social attributes such as gender and race. Prior studies have primarily focused on probing such bias attributes individually while ignoring biases associated with intersections between social attributes. This could be due to the difficulty of collecting an exhaustive set of image-text pairs for various combinations of social attributes. To address this challenge we employ text-to-image diffusion models to produce counterfactual examples for probing intersectional social biases at scale. Our approach utilizes Stable Diffusion with cross attention control to produce sets of counterfactual image-text pairs that are highly similar in their depiction of a subject (e.g. a given occupation) while differing only in their depiction of intersectional social attributes (e.g. race & gender). Through our over-generate-then-filter methodology we produce SocialCounterfactuals a high-quality dataset containing 171k image-text pairs for probing intersectional biases related to gender race and physical characteristics. We conduct extensive experiments to demonstrate the usefulness of our generated dataset for probing and mitigating intersectional social biases in state-of-the-art VLMs. + + + + Efficient Privacy-Preserving Visual Localization Using 3D Ray Clouds + http://openaccess.thecvf.com//content/CVPR2024/papers/Moon_Efficient_Privacy-Preserving_Visual_Localization_Using_3D_Ray_Clouds_CVPR_2024_paper.pdf + The recent success in revealing scene details from sparse 3D point clouds obtained via structure-from-motion has raised significant privacy concerns in visual localization. One prominent approach for mitigating this issue is to lift 3D points to 3D lines thereby reducing the effectiveness of the scene inversion attacks but this comes at the cost of increased algorithmic complexity for camera localization due to weaker geometric constraints induced by line clouds. To overcome this limitation we propose a new lifting approach called "ray cloud" whereby each lifted 3D line intersects at one of two predefined locations depicting omnidirectional rays from two cameras. This yields two benefits i) camera localization can now be cast as relative pose estimation between the query image and the calibrated rig of two perspective cameras which can be efficiently solved using a variant of the 5-point algorithm and ii) the ray cloud introduces erroneous estimations for the density-based inversion attack degrading the quality of scene recovery. Moreover we explore possible modifications of the inversion attack to better recover scenes from the ray clouds and propose a ray sampling technique to reduce the effectiveness of the modified attack. Experimental results on two public datasets show real-time localization speed as well as enhanced privacy-preserving capability over the state-of-the-art without overly sacrificing the localization accuracy. + + + + CNC-Net: Self-Supervised Learning for CNC Machining Operations + http://openaccess.thecvf.com//content/CVPR2024/papers/Yavartanoo_CNC-Net_Self-Supervised_Learning_for_CNC_Machining_Operations_CVPR_2024_paper.pdf + CNC manufacturing is a process that employs computer numerical control (CNC) machines to govern the movements of various industrial tools and machinery encompassing equipment ranging from grinders and lathes to mills and CNC routers. However the reliance on manual CNC programming has become a bottleneck and the requirement for expert knowledge can result in significant costs. Therefore we introduce a pioneering approach named CNC-Net representing the use of deep neural networks (DNNs) to simulate CNC machines and grasp intricate operations when supplied with raw materials. CNC-Net constitutes a self-supervised framework that exclusively takes an input 3D model and subsequently generates the essential operation parameters required by the CNC machine to construct the object. Our method has the potential to transformative automation in manufacturing by offering a cost-effective alternative to the high costs of manual CNC programming while maintaining exceptional precision in 3D object production. Our experiments underscore the effectiveness of our CNC-Net in constructing the desired 3D objects through the utilization of CNC operations. Notably it excels in preserving finer local details exhibiting a marked enhancement in precision compared to the state-of-the-art 3D CAD reconstruction approaches. The codes are available at https://github.com/myavartanoo/CNC-Net_PyTorch. + + + + OPERA: Alleviating Hallucination in Multi-Modal Large Language Models via Over-Trust Penalty and Retrospection-Allocation + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_OPERA_Alleviating_Hallucination_in_Multi-Modal_Large_Language_Models_via_Over-Trust_CVPR_2024_paper.pdf + Hallucination posed as a pervasive challenge of multi-modal large language models (MLLMs) has significantly impeded their real-world usage that demands precise judgment. Existing methods mitigate this issue with either training with specific designed data or inferencing with external knowledge from other sources incurring inevitable additional costs. In this paper we present OPERA a novel MLLM decoding method grounded in an Over-trust Penalty and a Retrospection-Allocation strategy serving as a nearly free lunch to alleviate the hallucination issue without additional data knowledge or training. Our approach begins with an interesting observation that most hallucinations are closely tied to the knowledge aggregation patterns manifested in the self-attention matrix i.e. MLLMs tend to generate new tokens by focusing on a few summary tokens but not all the previous tokens. Such partial over-trust inclination results in the neglecting of image tokens and describes the image content with hallucination. Based on the observation OPERA introduces a penalty term on the model logits during the beam-search decoding to mitigate the over-trust issue along with a rollback strategy that retrospects the presence of summary tokens in the previously generated tokens and re-allocate the token selection if necessary. With extensive experiments OPERA shows significant hallucination-mitigating performance on different MLLMs and metrics proving its effectiveness and generality. Our code is available at: https://github.com/shikiw/OPERA. + + + + Volumetric Environment Representation for Vision-Language Navigation + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Volumetric_Environment_Representation_for_Vision-Language_Navigation_CVPR_2024_paper.pdf + Vision-language navigation (VLN) requires an agent to navigate through an 3D environment based on visual observations and natural language instructions. It is clear that the pivotal factor for successful navigation lies in the comprehensive scene understanding. Previous VLN agents employ monocular frameworks to extract 2D features of perspective views directly. Though straightforward they struggle for capturing 3D geometry and semantics leading to a partial and incomplete environment representation. To achieve a comprehensive 3D representation with fine-grained details we introduce a Volumetric Environment Representation (VER) which voxelizes the physical world into structured 3D cells. For each cell VER aggregates multi-view 2D features into such a unified 3D space via 2D-3D sampling. Through coarse-to-fine feature extraction and multi-task learning for VER our agent predicts 3D occupancy 3D room layout and 3D bounding boxes jointly. Based on online collected VERs our agent performs volume state estimation and builds episodic memory for predicting the next step. Experimental results show our environment representations from multi-task learning lead to evident performance gains on VLN. Our model achieves state-of-the-art performance across VLN benchmarks (R2R REVERIE and R4R). + + + + NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows + http://openaccess.thecvf.com//content/CVPR2024/papers/Tang_NeRFDeformer_NeRF_Transformation_from_a_Single_View_via_3D_Scene_CVPR_2024_paper.pdf + We present a method for automatically modifying a NeRF representation based on a single observation of a non-rigid transformed version of the original scene. Our method defines the transformation as a 3D flowspecifically as a weighted linear blending of rigid transformations of 3D anchor points that are defined on the surface of the scene. In order to identify anchor points we introduce a novel correspondence algorithm that first matches RGB-based pairs then leverages multi-view information and 3D reprojection to robustly filter false positives in two steps. We also introduce a new dataset for exploring the problem of modifying a NeRF scene through a single observation. Our dataset contains 113 scenes leveraging 47 3D assets.We show that our proposed method outperforms NeRF editing methods as well as diffusion-based methods and we also explore different methods for filtering correspondences. + + + + DiffusionTrack: Point Set Diffusion Model for Visual Object Tracking + http://openaccess.thecvf.com//content/CVPR2024/papers/Xie_DiffusionTrack_Point_Set_Diffusion_Model_for_Visual_Object_Tracking_CVPR_2024_paper.pdf + Existing Siamese or transformer trackers commonly pose visual object tracking as a one-shot detection problem i.e. locating the target object in a single forward evaluation scheme. Despite the demonstrated success these trackers may easily drift towards distractors with similar appearance due to the single forward evaluation scheme lacking self-correction. To address this issue we cast visual tracking as a point set based denoising diffusion process and propose a novel generative learning based tracker dubbed DiffusionTrack. Our DiffusionTrack possesses two appealing properties: 1) It follows a novel noise-to-target tracking paradigm that leverages multiple denoising diffusion steps to localize the target in a dynamic searching manner per frame. 2) It models the diffusion process using a point set representation which can better handle appearance variations for more precise localization. One side benefit is that DiffusionTrack greatly simplifies the post-processing e.g. removing window penalty scheme. Without bells and whistles our DiffusionTrack achieves leading performance over the state-of-the-art trackers and runs in real-time. The code is in https://github.com/VISION-SJTU/DiffusionTrack. + + + + Scaling Diffusion Models to Real-World 3D LiDAR Scene Completion + http://openaccess.thecvf.com//content/CVPR2024/papers/Nunes_Scaling_Diffusion_Models_to_Real-World_3D_LiDAR_Scene_Completion_CVPR_2024_paper.pdf + Computer vision techniques play a central role in the perception stack of autonomous vehicles. Such methods are employed to perceive the vehicle surroundings given sensor data. 3D LiDAR sensors are commonly used to collect sparse 3D point clouds from the scene. However compared to human perception such systems struggle to deduce the unseen parts of the scene given those sparse point clouds. In this matter the scene completion task aims at predicting the gaps in the LiDAR measurements to achieve a more complete scene representation. Given the promising results of recent diffusion models as generative models for images we propose extending them to achieve scene completion from a single 3D LiDAR scan. Previous works used diffusion models over range images extracted from LiDAR data directly applying image-based diffusion methods. Distinctly we propose to directly operate on the points reformulating the noising and denoising diffusion process such that it can efficiently work at scene scale. Together with our approach we propose a regularization loss to stabilize the noise predicted during the denoising process. Our experimental evaluation shows that our method can complete the scene given a single LiDAR scan as input producing a scene with more details compared to state-of-the-art scene completion methods. We believe that our proposed diffusion process formulation can support further research in diffusion models applied to scene-scale point cloud data. + + + + Physical Backdoor: Towards Temperature-based Backdoor Attacks in the Physical World + http://openaccess.thecvf.com//content/CVPR2024/papers/Yin_Physical_Backdoor_Towards_Temperature-based_Backdoor_Attacks_in_the_Physical_World_CVPR_2024_paper.pdf + Backdoor attacks have been well-studied in visible light object detection (VLOD) in recent years. However VLOD can not effectively work in dark and temperature-sensitive scenarios. Instead thermal infrared object detection (TIOD) is the most accessible and practical in such environments. In this paper our team is the first to investigate the security vulnerabilities associated with TIOD in the context of backdoor attacks spanning both the digital and physical realms. We introduce two novel types of backdoor attacks on TIOD each offering unique capabilities: Object-affecting Attack and Range-affecting Attack. We conduct a comprehensive analysis of key factors influencing trigger design which include temperature size material and concealment. These factors especially temperature significantly impact the efficacy of backdoor attacks on TIOD. A thorough understanding of these factors will serve as a foundation for designing physical triggers and temperature controlling experiments. Our study includes extensive experiments conducted in both digital and physical environments. In the digital realm we evaluate our approach using benchmark datasets for TIOD achieving an Attack Success Rate (ASR) of up to 98.21%. In the physical realm we test our approach in two real-world settings: a traffic intersection and a parking lot using a thermal infrared camera. Here we attain an ASR of up to 98.38%. + + + + Make Me a BNN: A Simple Strategy for Estimating Bayesian Uncertainty from Pre-trained Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Franchi_Make_Me_a_BNN_A_Simple_Strategy_for_Estimating_Bayesian_CVPR_2024_paper.pdf + Deep Neural Networks (DNNs) are powerful tools for various computer vision tasks yet they often struggle with reliable uncertainty quantification -a critical requirement for real-world applications. Bayesian Neural Networks (BNN) are equipped for uncertainty estimation but cannot scale to large DNNs where they are highly unstable to train. To address this challenge we introduce the Adaptable Bayesian Neural Network (ABNN) a simple and scalable strategy to seamlessly transform DNNs into BNNs in a post-hoc manner with minimal computational and training overheads. ABNN preserves the main predictive properties of DNNs while enhancing their uncertainty quantification abilities through simple BNN adaptation layers (attached to normalization layers) and a few fine-tuning steps on pre-trained models. We conduct extensive experiments across multiple datasets for image classification and semantic segmentation tasks and our results demonstrate that ABNN achieves state-of-the-art performance without the computational budget typically associated with ensemble methods. + + + + Language-only Training of Zero-shot Composed Image Retrieval + http://openaccess.thecvf.com//content/CVPR2024/papers/Gu_Language-only_Training_of_Zero-shot_Composed_Image_Retrieval_CVPR_2024_paper.pdf + Composed image retrieval (CIR) task takes a composed query of image and text aiming to search relative images for both conditions. Conventional CIR approaches need a training dataset composed of triplets of query image query text and target image which is very expensive to collect. Several recent works have worked on the zero-shot (ZS) CIR paradigm to tackle the issue without using pre-collected triplets. However the existing ZS-CIR methods show limited backbone scalability and generalizability due to the lack of diversity of the input texts during training. We propose a novel CIR framework only using language for its training. Our LinCIR (Language-only training for CIR) can be trained only with text datasets by a novel self-supervision named self-masking projection (SMP). We project the text latent embedding to the token embedding space and construct a new text by replacing the keyword tokens of the original text. Then we let the new and original texts have the same latent embedding vector. With this simple strategy LinCIR is surprisingly efficient and highly effective; LinCIR with CLIP ViT-G backbone is trained in 48 minutes and shows the best ZS-CIR performances on four different CIR benchmarks CIRCO GeneCIS FashionIQ and CIRR even outperforming supervised method on FashionIQ. Code is available at https://github.com/navervision/lincir + + + + Efficient and Effective Weakly-Supervised Action Segmentation via Action-Transition-Aware Boundary Alignment + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_Efficient_and_Effective_Weakly-Supervised_Action_Segmentation_via_Action-Transition-Aware_Boundary_Alignment_CVPR_2024_paper.pdf + Weakly-supervised action segmentation is a task of learning to partition a long video into several action segments where training videos are only accompanied by transcripts (ordered list of actions). Most of existing methods need to infer pseudo segmentation for training by serial alignment between all frames and the transcript which is time-consuming and hard to be parallelized while training. In this work we aim to escape from this inefficient alignment with massive but redundant frames and instead to directly localize a few action transitions for pseudo segmentation generation where a transition refers to the change from an action segment to its next adjacent one in the transcript. As the true transitions are submerged in noisy boundaries due to intra-segment visual variation we propose a novel Action-Transition-Aware Boundary Alignment (ATBA) framework to efficiently and effectively filter out noisy boundaries and detect transitions. In addition to boost the semantic learning in the case that noise is inevitably present in the pseudo segmentation we also introduce video-level losses to utilize the trusted video-level supervision. Extensive experiments show the effectiveness of our approach on both performance and training speed. + + + + Pixel-Aligned Language Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_Pixel-Aligned_Language_Model_CVPR_2024_paper.pdf + Large language models have achieved great success in recent years so as their variants in vision. Existing vision-language models can describe images in natural languages answer visual-related questions or perform complex reasoning about the image. However it is yet unclear how localization tasks such as word grounding or referring localization can be performed using large language models. In this work we aim to develop a vision-language model that can take locations for example a set of points or boxes as either inputs or outputs. When taking locations as inputs the model performs location-conditioned captioning which generates captions for the indicated object or region. When generating locations as outputs our model regresses pixel coordinates for each output word generated by the language model and thus performs dense word grounding. Our model is pre-trained on the Localized Narrative dataset which contains pixel-word-aligned captioning from human attention. We show our model can be applied to various location-aware vision-language tasks including referring localization location-conditioned captioning and dense object captioning archiving state-of-the-art performance on RefCOCO and Visual Genome. + + + + SNIDA: Unlocking Few-Shot Object Detection with Non-linear Semantic Decoupling Augmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_SNIDA_Unlocking_Few-Shot_Object_Detection_with_Non-linear_Semantic_Decoupling_Augmentation_CVPR_2024_paper.pdf + Once only a few-shot annotated samples are available the performance of learning-based object detection would be heavily dropped. Many few-shot object detection (FSOD) methods have been proposed to tackle this issue by adopting image-level augmentations in linear manners. Nevertheless those handcrafted enhancements often suffer from limited diversity and lack of semantic awareness resulting in unsatisfactory performance. To this end we propose a Semantic-guided Non-linear Instance-level Data Augmentation method (SNIDA) for FSOD by decoupling the foreground and background to increase their diversities respectively. We design a semantic awareness enhancement strategy to separate objects from backgrounds. Concretely masks of instances are extracted by an unsupervised semantic segmentation module. Then the diversity of samples would be improved by fusing instances into different backgrounds. Considering the shortcomings of augmenting images in a limited transformation space of existing traditional data augmentation methods we introduce an object reconstruction enhancement module. The aim of this module is to generate sufficient diversity and non-linear training data at the instance level through a semantic-guided masked autoencoder. In this way the potential of data can be fully exploited in various object detection scenarios. Extensive experiments on PASCAL VOC and MS-COCO demonstrate that the proposed method outperforms baselines by a large margin and achieves new state-of-the-art results under different shot settings. + + + + Not All Voxels Are Equal: Hardness-Aware Semantic Scene Completion with Self-Distillation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Not_All_Voxels_Are_Equal_Hardness-Aware_Semantic_Scene_Completion_with_CVPR_2024_paper.pdf + Semantic scene completion also known as semantic occupancy prediction can provide dense geometric and semantic information for autonomous vehicles which attracts the increasing attention of both academia and industry. Unfortunately existing methods usually formulate this task as a voxel-wise classification problem and treat each voxel equally in 3D space during training. As the hard voxels have not been paid enough attention the performance in some challenging regions is limited. The 3D dense space typically contains a large number of empty voxels which are easy to learn but require amounts of computation due to handling all the voxels uniformly for the existing models. Furthermore the voxels in the boundary region are more challenging to differentiate than those in the interior. In this paper we propose HASSC approach to train the semantic scene completion model with hardness-aware design. The global hardness from the network optimization process is defined for dynamical hard voxel selection. Then the local hardness with geometric anisotropy is adopted for voxel-wise refinement. Besides self-distillation strategy is introduced to make training process stable and consistent. Extensive experiments show that our HASSC scheme can effectively promote the accuracy of the baseline model without incurring the extra inference cost. Source code is available at: https://github.com/songw-zju/HASSC. + + + + 3D-LFM: Lifting Foundation Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Dabhi_3D-LFM_Lifting_Foundation_Model_CVPR_2024_paper.pdf + The lifting of a 3D structure and camera from 2D landmarks is at the cornerstone of the discipline of computer vision. Traditional methods have been confined to specific rigid objects such as those in Perspective-n-Point (PnP) problems but deep learning has expanded our capability to reconstruct a wide range of object classes (e.g. C3DPO [??] and PAUL [??]) with resilience to noise occlusions and perspective distortions. However all these techniques have been limited by the fundamental need to establish correspondences across the 3D training data significantly limiting their utility to applications where one has an abundance of "in-correspondence" 3D data. Our approach harnesses the inherent permutation equivariance of transformers to manage varying numbers of points per 3D data instance withstands occlusions and generalizes to unseen categories. We demonstrate state-of-the-art performance across 2D-3D lifting task benchmarks. Since our approach can be trained across such a broad class of structures we refer to it simply as a 3D Lifting Foundation Model (3D-LFM) -- the first of its kind. + + + + Quantifying Uncertainty in Motion Prediction with Variational Bayesian Mixture + http://openaccess.thecvf.com//content/CVPR2024/papers/Lu_Quantifying_Uncertainty_in_Motion_Prediction_with_Variational_Bayesian_Mixture_CVPR_2024_paper.pdf + Safety and robustness are crucial factors in developing trustworthy autonomous vehicles. One essential aspect of addressing these factors is to equip vehicles with the capability to predict future trajectories for all moving objects in the surroundings and quantify prediction uncertainties. In this paper we propose the Sequential Neural Variational Agent (SeNeVA) a generative model that describes the distribution of future trajectories for a single moving object. Our approach can distinguish Out-of-Distribution data while quantifying uncertainty and achieving competitive performance compared to state-of-the-art methods on the Argoverse 2 and INTERACTION datasets. Specifically a 0.446 meters minimum Final Displacement Error a 0.203 meters minimum Average Displacement Error and a 5.35% Miss Rate are achieved on the INTERACTION test set. Extensive qualitative and quantitative analysis is also provided to evaluate the proposed model. Our open-source code is available at https://github.com/PurdueDigitalTwin/seneva. + + + + Explaining CLIP's Performance Disparities on Data from Blind/Low Vision Users + http://openaccess.thecvf.com//content/CVPR2024/papers/Massiceti_Explaining_CLIPs_Performance_Disparities_on_Data_from_BlindLow_Vision_Users_CVPR_2024_paper.pdf + Large multi-modal models (LMMs) hold the potential to usher in a new era of automated visual assistance for people who are blind or low vision (BLV). Yet these models have not been systematically evaluated on data captured by BLV users. We address this by empirically assessing CLIP a widely-used LMM likely to underpin many assistive technologies. Testing 25 CLIP variants in a zero-shot classification task we find that their accuracy is 15 percentage points lower on average for images captured by BLV users than web-crawled images. This disparity stems from CLIP's sensitivities to 1) image content (e.g. not recognizing disability objects as well as other objects); 2) image quality (e.g. not being robust to lighting variation); and 3) text content (e.g. not recognizing objects described by tactile adjectives as well as visual ones). We delve deeper with a textual analysis of three common pre-training datasets: LAION-400M LAION-2B and DataComp-1B showing that disability content is rarely mentioned. We then provide three examples that illustrate how the performance disparities extend to three downstream models underpinned by CLIP: OWL-ViT CLIPSeg and DALL-E2. We find that few-shot learning with as few as 5 images can mitigate CLIP's quality-of-service disparities for BLV users in some scenarios which we discuss alongside a set of other possible mitigations. + + + + SingularTrajectory: Universal Trajectory Predictor Using Diffusion Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Bae_SingularTrajectory_Universal_Trajectory_Predictor_Using_Diffusion_Model_CVPR_2024_paper.pdf + There are five types of trajectory prediction tasks: deterministic stochastic domain adaptation momentary observation and few-shot. These associated tasks are defined by various factors such as the length of input paths data split and pre-processing methods. Interestingly even though they commonly take sequential coordinates of observations as input and infer future paths in the same coordinates as output designing specialized architectures for each task is still necessary. For the other task generality issues can lead to sub-optimal performances. In this paper we propose SingularTrajectory a diffusion-based universal trajectory prediction framework to reduce the performance gap across the five tasks. The core of SingularTrajectory is to unify a variety of human dynamics representations on the associated tasks. To do this we first build a Singular space to project all types of motion patterns from each task into one embedding space. We next propose an adaptive anchor working in the Singular space. Unlike traditional fixed anchor methods that sometimes yield unacceptable paths our adaptive anchor enables correct anchors which are put into a wrong location based on a traversability map. Finally we adopt a diffusion-based predictor to further enhance the prototype paths using a cascaded denoising process. Our unified framework ensures the generality across various benchmark settings such as input modality and trajectory lengths. Extensive experiments on five public benchmarks demonstrate that SingularTrajectory substantially outperforms existing models highlighting its effectiveness in estimating general dynamics of human movements. Code is publicly available at https://github.com/inhwanbae/SingularTrajectory. + + + + Generating Handwritten Mathematical Expressions From Symbol Graphs: An End-to-End Pipeline + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Generating_Handwritten_Mathematical_Expressions_From_Symbol_Graphs_An_End-to-End_Pipeline_CVPR_2024_paper.pdf + In this paper we explore a novel challenging generation task i.e. Handwritten Mathematical Expression Generation (HMEG) from symbolic sequences. Since symbolic sequences are naturally graph-structured data we formulate HMEG as a graph-to-image (G2I) generation problem. Unlike the generation of natural images HMEG requires critic layout clarity for synthesizing correct and recognizable formulas but has no real masks available to supervise the learning process. To alleviate this challenge we propose a novel end-to-end G2I generation pipeline (i.e. graph - layout - mask - image) which requires no real masks or nondifferentiable alignment between layouts and masks. Technically to boost the capacity of predicting detailed relations among adjacent symbols we propose a Less-is-More (LiM) learning strategy. In addition we design a differentiable layout refinement module which maps bounding boxes to pixel-level soft masks so as to further alleviate ambiguous layout areas. Our whole model including layout prediction mask refinement and image generation can be jointly optimized in an end-to-end manner. Experimental results show that our model can generate high-quality HME images and outperforms previous generative methods. Besides a series of ablations study demonstrate effectiveness of the proposed techniques. Finally we validate that our generated images promisingly boosts the performance of HME recognition models through data augmentation. Our code and results are available at: https://github.com/AiArt-HDU/HMEG. + + + + Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Nagasinghe_Why_Not_Use_Your_Textbook_Knowledge-Enhanced_Procedure_Planning_of_Instructional_CVPR_2024_paper.pdf + In this paper we explore the capability of an agent to construct a logical sequence of action steps thereby assembling a strategic procedural plan. This plan is crucial for navigating from an initial visual observation to a target visual outcome as depicted in real-life instructional videos. Existing works have attained partial success by extensively leveraging various sources of information available in the datasets such as heavy intermediate visual observations procedural names or natural language step-by-step instructions for features or supervision signals. However the task remains formidable due to the implicit causal constraints in the sequencing of steps and the variability inherent in multiple feasible plans. To tackle these intricacies that previous efforts have overlooked we propose to enhance the agent's capabilities by infusing it with procedural knowledge. This knowledge sourced from training procedure plans and structured as a directed weighted graph equips the agent to better navigate the complexities of step sequencing and its potential variations. We coin our approach KEPP a novel Knowledge-Enhanced Procedure Planning system which harnesses a probabilistic procedural knowledge graph extracted from training data effectively acting as a comprehensive textbook for the training domain. Experimental evaluations across three widely-used datasets under settings of varying complexity reveal that KEPP attains superior state-of-the-art results while requiring only minimal supervision. Code and trained model are available at https://github.com/Ravindu-Yasas-Nagasinghe/KEPP + + + + FreeKD: Knowledge Distillation via Semantic Frequency Prompt + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_FreeKD_Knowledge_Distillation_via_Semantic_Frequency_Prompt_CVPR_2024_paper.pdf + Knowledge distillation (KD) has been applied to various tasks successfully and mainstream methods typically boost the student model via spatial imitation losses. However the consecutive downsamplings induced in the spatial domain of teacher model is a type of corruption hindering the student from analyzing what specific information needs to be imitated which results in accuracy degradation. To better understand the underlying pattern of corrupted feature maps we shift our attention to the frequency domain. During frequency distillation we encounter a new challenge: the low-frequency bands convey general but minimal context while the high are more informative but also introduce noise. Not each pixel within the frequency bands contributes equally to the performance. To address the above problem: (1) We propose the Frequency Prompt plugged into the teacher model absorbing the semantic frequency context during finetuning. (2) During the distillation period a pixel-wise frequency mask is generated via Frequency Prompt to localize those pixel of interests (PoIs) in various frequency bands. Additionally we employ a position-aware relational frequency loss for dense prediction tasks delivering a high-order spatial enhancement to the student model. We dub our Frequency Knowledge Distillation method as FreeKD which determines the optimal localization and extent for the frequency distillation. Extensive experiments demonstrate that FreeKD not only outperforms spatial-based distillation methods consistently on dense prediction tasks (e.g. FreeKD brings 3.8 AP gains for RepPoints-R50 on COCO2017 and 4.55 mIoU gains for PSPNet-R18 on Cityscapes) but also conveys more robustness to the student. Notably we also validate the generalization of our approach on large-scale vision models (e.g. DINO and SAM). + + + + Can't Make an Omelette Without Breaking Some Eggs: Plausible Action Anticipation Using Large Video-Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Mittal_Cant_Make_an_Omelette_Without_Breaking_Some_Eggs_Plausible_Action_CVPR_2024_paper.pdf + We introduce PlausiVL a large video-language model for anticipating action sequences that are plausible in the real-world. While significant efforts have been made towards anticipating future actions prior approaches do not take into account the aspect of plausibility in an action sequence. To address this limitation we explore the generative capability of a large video-language model in our work and further develop the understanding of plausibility in an action sequence by introducing two objective functions a counterfactual-based plausible action sequence learning loss and a long-horizon action repetition loss. We utilize temporal logical constraints as well as verb-noun action pair logical constraints to create implausible/counterfactual action sequences and use them to train the model with plausible action sequence learning loss. This loss helps the model to differentiate between plausible and not plausible action sequences and also helps the model to learn implicit temporal cues crucial for the task of action anticipation. The long-horizon action repetition loss puts a higher penalty on the actions that are more prone to repetition over a longer temporal window. With this penalization the model is able to generate diverse plausible action sequences. We evaluate our approach on two large-scale datasets Ego4D and EPIC-Kitchens-100 and show improvements on the task of action anticipation. + + + + On the Estimation of Image-matching Uncertainty in Visual Place Recognition + http://openaccess.thecvf.com//content/CVPR2024/papers/Zaffar_On_the_Estimation_of_Image-matching_Uncertainty_in_Visual_Place_Recognition_CVPR_2024_paper.pdf + In Visual Place Recognition (VPR) the pose of a query image is estimated by comparing the image to a map of reference images with known reference poses. As is typical for image retrieval problems a feature extractor maps the query and reference images to a feature space where a nearest neighbor search is then performed. However till recently little attention has been given to quantifying the confidence that a retrieved reference image is a correct match. Highly certain but incorrect retrieval can lead to catastrophic failure of VPR-based localization pipelines. This work compares for the first time the main approaches for estimating the image-matching uncertainty including the traditional retrieval-based uncertainty estimation more recent data-driven aleatoric uncertainty estimation and the compute-intensive geometric verification. We further formulate a simple baseline method "SUE" which unlike the other methods considers the freely-available poses of the reference images in the map. Our experiments reveal that a simple L2-distance between the query and reference descriptors is already a better estimate of image-matching uncertainty than current data-driven approaches. SUE outperforms the other efficient uncertainty estimation methods and its uncertainty estimates complement the computationally expensive geometric verification approach. Future works for uncertainty estimation in VPR should consider the baselines discussed in this work. + + + + Prompt-Enhanced Multiple Instance Learning for Weakly Supervised Video Anomaly Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Prompt-Enhanced_Multiple_Instance_Learning_for_Weakly_Supervised_Video_Anomaly_Detection_CVPR_2024_paper.pdf + Weakly-supervised Video Anomaly Detection (wVAD) aims to detect frame-level anomalies using only video-level labels in training. Due to the limitation of coarse-grained labels Multi-Instance Learning (MIL) is prevailing in wVAD. However MIL suffers from insufficiency of binary supervision to model diverse abnormal patterns. Besides the coupling between abnormality and its context hinders the learning of clear abnormal event boundary. In this paper we propose prompt-enhanced MIL to detect various abnormal events while ensuring clear event boundaries. Concretely we design the abnormal-aware prompts by using abnormal class annotations together with learnable prompt which can incorporate semantic priors into video features dynamically. The detector can utilize the semantic-rich features to capture diverse abnormal patterns. In addition normal context prompt is introduced to amplify the distinction between abnormality and its context facilitating the generation of clear boundary. With the mutual enhancement of abnormal-aware and normal context prompt the model can construct discriminative representations to detect divergent anomalies without ambiguous event boundaries. Extensive experiments demonstrate our method achieves SOTA performance on three public benchmarks. The code is available at https://github.com/Junxi-Chen/PE-MIL. + + + + Non-autoregressive Sequence-to-Sequence Vision-Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Shi_Non-autoregressive_Sequence-to-Sequence_Vision-Language_Models_CVPR_2024_paper.pdf + Sequence-to-sequence vision-language models are showing promise but their applicability is limited by their inference latency due to their autoregressive way of generating predictions. We propose a parallel decoding sequence-to-sequence vision-language model trained with a Query-CTC loss that marginalizes over multiple inference paths in the decoder. This allows us to model the joint distribution of tokens rather than restricting to conditional distribution as in an autoregressive model. The resulting model NARVL achieves performance on-par with its state-of-the-art autoregressive counterpart but is faster at inference time reducing from the linear complexity associated with the sequential generation of tokens to a paradigm of constant time joint inference. + + + + Active Object Detection with Knowledge Aggregation and Distillation from Large Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Active_Object_Detection_with_Knowledge_Aggregation_and_Distillation_from_Large_CVPR_2024_paper.pdf + Accurately detecting active objects undergoing state changes is essential for comprehending human interactions and facilitating decision-making. The existing methods for active object detection (AOD) primarily rely on visual appearance of the objects within input such as changes in size shape and relationship with hands. However these visual changes can be subtle posing challenges particularly in scenarios with multiple distracting no-change instances of the same category. We observe that the state changes are often the result of an interaction being performed upon the object thus propose to use informed priors about object related plausible interactions (including semantics and visual appearance) to provide more reliable cues for AOD. Specifically we propose a knowledge aggregation procedure to integrate the aforementioned informed priors into oracle queries within the teacher decoder offering more object affordance commonsense to locate the active object. To streamline the inference process and reduce extra knowledge inputs we propose a knowledge distillation approach that encourages the student decoder to mimic the detection capabilities of the teacher decoder using the oracle query by replicating its predictions and attention. Our proposed framework achieves state-of-the-art performance on four datasets namely Ego4D Epic-Kitchens MECCANO and 100DOH which demonstrates the effectiveness of our approach in improving AOD. The code and models are available at https://github.com/idejie/KAD.git. + + + + Weak-to-Strong 3D Object Detection with X-Ray Distillation + http://openaccess.thecvf.com//content/CVPR2024/papers/Gambashidze_Weak-to-Strong_3D_Object_Detection_with_X-Ray_Distillation_CVPR_2024_paper.pdf + This paper addresses the critical challenges of sparsity and occlusion in LiDAR-based 3D object detection. Current methods often rely on supplementary modules or specific architectural designs potentially limiting their applicability to new and evolving architectures. To our knowledge we are the first to propose a versatile technique that seamlessly integrates into any existing framework for 3D Object Detection marking the first instance of Weak-to-Strong generalization in 3D computer vision. We introduce a novel framework X-Ray Distillation with Object-Complete Frames suitable for both supervised and semi-supervised settings that leverages the temporal aspect of point cloud sequences. This method extracts crucial information from both previous and subsequent LiDAR frames creating Object-Complete frames that represent objects from multiple viewpoints thus addressing occlusion and sparsity. Given the limitation of not being able to generate Object-Complete frames during online inference we utilize Knowledge Distillation within a Teacher-Student framework. This technique encourages the strong Student model to emulate the behavior of the weaker Teacher which processes simple and informative Object-Complete frames effectively offering a comprehensive view of objects as if seen through X-ray vision. Our proposed methods surpass state-of-the-art in semi-supervised learning by 1-1.5 mAP and enhance the performance of five established supervised models by 1-2 mAP on standard autonomous driving datasets even with default hyperparameters. Code for Object-Complete frames is available here: https://github.com/sakharok13/X-Ray-Teacher-Patching-Tools. + + + + Active Open-Vocabulary Recognition: Let Intelligent Moving Mitigate CLIP Limitations + http://openaccess.thecvf.com//content/CVPR2024/papers/Fan_Active_Open-Vocabulary_Recognition_Let_Intelligent_Moving_Mitigate_CLIP_Limitations_CVPR_2024_paper.pdf + Active recognition which allows intelligent agents to explore observations for better recognition performance serves as a prerequisite for various embodied AI tasks such as grasping navigation and room arrangements. Given the evolving environment and the multitude of object classes it is impractical to include all possible classes during the training stage. In this paper we aim at advancing active open-vocabulary recognition empowering embodied agents to actively perceive and classify arbitrary objects. However directly adopting recent open-vocabulary classification models like Contrastive Language Image Pretraining (CLIP) poses its unique challenges. Specifically we observe that CLIP's performance is heavily affected by the viewpoint and occlusions compromising its reliability in unconstrained embodied perception scenarios. Further the sequential nature of observations in agent-environment interactions necessitates an effective method for integrating features that maintains discriminative strength for open-vocabulary classification. To address these issues we introduce a novel agent for active open-vocabulary recognition. The proposed method leverages inter-frame and inter-concept similarities to navigate agent movements and to fuse features without relying on class-specific knowledge. Compared to baseline CLIP model with 29.6% accuracy on ShapeNet dataset the proposed agent could achieve 53.3% accuracy for open-vocabulary recognition without any fine-tuning to the equipped CLIP model. Additional experiments conducted with the Habitat simulator further affirm the efficacy of our method. + + + + Efficient Meshflow and Optical Flow Estimation from Event Cameras + http://openaccess.thecvf.com//content/CVPR2024/papers/Luo_Efficient_Meshflow_and_Optical_Flow_Estimation_from_Event_Cameras_CVPR_2024_paper.pdf + In this paper we explore the problem of event-based meshflow estimation a novel task that involves predicting a spatially smooth sparse motion field from event cameras. To start we generate a large-scale High-Resolution Event Meshflow (HREM) dataset which showcases its superiority by encompassing the merits of high resolution at 1280x720 handling dynamic objects and complex motion patterns and offering both optical flow and meshflow labels. These aspects have not been fully explored in previous works. Besides we propose Efficient Event-based MeshFlow (EEMFlow) network a lightweight model featuring a specially crafted encoder-decoder architecture to facilitate swift and accurate meshflow estimation. Furthermore we upgrade EEMFlow network to support dense event optical flow in which a Confidence-induced Detail Completion (CDC) module is proposed to preserve sharp motion boundaries. We conduct comprehensive experiments to show the exceptional performance and runtime efficiency (39x faster) of our EEMFlow model compared to recent state-of-the-art flow methods. Our code is available at https://github.com/boomluo02/EEMFlow. + + + + Visual Program Distillation: Distilling Tools and Programmatic Reasoning into Vision-Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Hu_Visual_Program_Distillation_Distilling_Tools_and_Programmatic_Reasoning_into_Vision-Language_CVPR_2024_paper.pdf + Solving complex visual tasks such as "Who invented the musical instrument on the right?" involves a composition of skills: understanding space recognizing instruments and also retrieving prior knowledge. Recent work shows promise by decomposing such tasks using a large language model (LLM) into an executable program that invokes specialized vision models. However generated programs are error-prone: they omit necessary steps include spurious ones and are unable to recover when the specialized models give incorrect outputs. Moreover they require loading multiple models incurring high latency and computation costs. We propose Visual Program Distillation (VPD) an instruction-tuning framework that produces a vision-language model (VLM) capable of solving complex visual tasks with a single forward pass. VPD distills the reasoning ability of LLMs by using them to sample multiple candidate programs which are then executed and verified to identify the correct one. It translates each correct program into a language description of the reasoning steps which are then distilled into a VLM. Extensive experiments show that VPD improves the VLM's ability to count understand spatial relations and reason compositionally. Our VPD-trained PaLI-X outperforms all prior VLMs achieving state-of-the-art performance across complex vision tasks including MMBench OK-VQA A-OKVQA TallyQA POPE and Hateful Memes. An evaluation with human annotators also confirms that VPD improves model response factuality and consistency. Finally experiments on content moderation demonstrate that VPD is also helpful for adaptation to real-world applications with limited data. + + + + A Backpack Full of Skills: Egocentric Video Understanding with Diverse Task Perspectives + http://openaccess.thecvf.com//content/CVPR2024/papers/Peirone_A_Backpack_Full_of_Skills_Egocentric_Video_Understanding_with_Diverse_CVPR_2024_paper.pdf + Human comprehension of a video stream is naturally broad: in a few instants we are able to understand what is happening the relevance and relationship of objects and forecast what will follow in the near future everything all at once. We believe that - to effectively transfer such an holistic perception to intelligent machines - an important role is played by learning to correlate concepts and to abstract knowledge coming from different tasks to synergistically exploit them when learning novel skills. To accomplish this we look for a unified approach to video understanding which combines shared temporal modelling of human actions with minimal overhead to support multiple downstream tasks and enable cooperation when learning novel skills. We then propose EgoPack a solution that creates a collection of task perspectives that can be carried across downstream tasks and used as a potential source of additional insights as a backpack of skills that a robot can carry around and use when needed. We demonstrate the effectiveness and efficiency of our approach on four Ego4D benchmarks outperforming current state-of-the-art methods. Project webpage: https://sapeirone.github.io/EgoPack. + + + + Visual In-Context Prompting + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Visual_In-Context_Prompting_CVPR_2024_paper.pdf + In-context prompting in large language models (LLMs) has become a prevalent approach to improve zero-shot capabilities but this idea is less explored in the vision domain. Existing visual prompting methods focus on referring segmentation to segment the most relevant object falling short of addressing many generic vision tasks like open-set segmentation and detection. In this paper we introduce a universal visual in-context prompting framework for both tasks as shown in Fig.1. In particular we build on top of an encoder-decoder architecture and develop a versatile prompt encoder to support a variety of prompts like strokes boxes and points. We further enhance it to take an arbitrary number of reference image segments as the context. Our extensive explorations show that the proposed visual in-context prompting elicits extraordinary referring and generic segmentation capabilities to refer and detect yielding competitive performance to close-set in-domain datasets and showing promising results on many open-set segmentation datasets. By joint training on COCO and SA-1B DINOv achieves 57.7 PQ on COCO and 23.2 PQ on ADE20K. Code will be available at https://github.com/UX-Decoder/DINOv + + + + Instruct-ReID: A Multi-purpose Person Re-identification Task with Instructions + http://openaccess.thecvf.com//content/CVPR2024/papers/He_Instruct-ReID_A_Multi-purpose_Person_Re-identification_Task_with_Instructions_CVPR_2024_paper.pdf + Human intelligence can retrieve any person according to both visual and language descriptions. However the current computer vision community studies specific person re-identification (ReID) tasks in different scenarios separately which limits the applications in the real world. This paper strives to resolve this problem by proposing a new instruct-ReID task that requires the model to retrieve images according to the given image or language instructions. Our instruct-ReID is a more general ReID setting where existing 6 ReID tasks can be viewed as special cases by designing different instructions. We propose a large-scale OmniReID benchmark and an adaptive triplet loss as a baseline method to facilitate research in this new setting. Experimental results show that the proposed multi-purpose ReID model trained on our OmniReID benchmark without finetuning can improve +0.5% +0.6% +7.7% mAP on Market1501 MSMT17 CUHK03 for traditional ReID +6.4% +7.1% +11.2% mAP on PRCC VC-Clothes LTCC for clothes-changing ReID +11.7% mAP on COCAS+ real2 for clothes template based clothes-changing ReID when using only RGB images +24.9% mAP on COCAS+ real2 for our newly defined language-instructed ReID +4.3% on LLCM for visible-infrared ReID +2.6% on CUHK-PEDES for text-to-image ReID. The datasets the model and code are available at https://github.com/hwz-zju/Instruct-ReID. + + + + IBD-SLAM: Learning Image-Based Depth Fusion for Generalizable SLAM + http://openaccess.thecvf.com//content/CVPR2024/papers/Yin_IBD-SLAM_Learning_Image-Based_Depth_Fusion_for_Generalizable_SLAM_CVPR_2024_paper.pdf + In this paper we address the challenging problem of visual SLAM with neural scene representations. Recently neural scene representations have shown promise for SLAM to produce dense 3D scene reconstruction with high quality. However existing methods require scene-specific optimization leading to time-consuming mapping processes for each individual scene. To overcome this limitation we propose IBD-SLAM an Image-Based Depth fusion framework for generalizable SLAM. In particular we adopt a Neural Radiance Field (NeRF) for scene representation. Inspired by multi-view image-based rendering instead of learning a fixed-grid scene representation we propose to learn an image-based depth fusion model that fuses depth maps of multiple reference views into a xyz-map representation. Once trained this model can be applied to new uncalibrated monocular RGBD videos of unseen scenes without the need for retraining and reconstructs full 3D scenes efficiently with a light-weight pose optimization procedure. We thoroughly evaluate IBD-SLAM on public visual SLAM benchmarks outperforming the previous state-of-the-art while being 10x faster in the mapping stage. Project page: https://visual-ai.github.io/ibd-slam. + + + + CPLIP: Zero-Shot Learning for Histopathology with Comprehensive Vision-Language Alignment + http://openaccess.thecvf.com//content/CVPR2024/papers/Javed_CPLIP_Zero-Shot_Learning_for_Histopathology_with_Comprehensive_Vision-Language_Alignment_CVPR_2024_paper.pdf + This paper proposes Comprehensive Pathology Language Image Pre-training (CPLIP) a new unsupervised technique designed to enhance the alignment of images and text in histopathology for tasks such as classification and segmentation. This methodology enriches vision language models by leveraging extensive data without needing ground truth annotations. CPLIP involves constructing a pathology-specific dictionary generating textual descriptions for images using language models and retrieving relevant images for each text snippet via a pre-trained model. The model is then fine-tuned using a many-to-many contrastive learning method to align complex interrelated concepts across both modalities. Evaluated across multiple histopathology tasks CPLIP shows notable improvements in zero-shot learning scenarios outperforming existing methods in both interpretability and robustness and setting a higher benchmark for the application of vision-language models in the field. To encourage further research and replication the code for CPLIP is available on GitHubat https://cplip.github.io/ + + + + Reg-PTQ: Regression-specialized Post-training Quantization for Fully Quantized Object Detector + http://openaccess.thecvf.com//content/CVPR2024/papers/Ding_Reg-PTQ_Regression-specialized_Post-training_Quantization_for_Fully_Quantized_Object_Detector_CVPR_2024_paper.pdf + Although deep learning based object detection is of great significance for various applications it faces challenges when deployed on edge devices due to the computation and energy limitations. Post-training quantization (PTQ) can improve inference efficiency through integer computing. However they suffer from severe performance degradation when performing full quantization due to overlooking the unique characteristics of regression tasks in object detection. In this paper we are the first to explore regression-friendly quantization and conduct full quantization on various detectors. We reveal the intrinsic reason behind the difficulty of quantizing regressors with empirical and theoretical justifications and introduce a novel Regression-specialized Post-Training Quantization (Reg-PTQ) scheme. It includes Filtered Global Loss Integration Calibration to combine the global loss with a two-step filtering mechanism mitigating the adverse impact of false positive bounding boxes and Learnable Logarithmic-Affine Quantizer tailored for the non-uniform distributed parameters in regression structures. Extensive experiments on prevalent detectors showcase the effectiveness of the well-designed Reg-PTQ. Notably our Reg-PTQ achieves 7.6 times and 5.4 times reduction in computation and storage consumption under INT4 with little performance degradation which indicates the immense potential of fully quantized detectors in real-world object detection applications. + + + + Action Scene Graphs for Long-Form Understanding of Egocentric Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Rodin_Action_Scene_Graphs_for_Long-Form_Understanding_of_Egocentric_Videos_CVPR_2024_paper.pdf + We present Egocentric Action Scene Graphs (EASGs) a new representation for long-form understanding of egocentric videos. EASGs extend standard manually-annotated representations of egocentric videos such as verb-noun action labels by providing a temporally evolving graph-based description of the actions performed by the camera wearer including interacted objects their relationships and how actions unfold in time. Through a novel annotation procedure we extend the Ego4D dataset adding manually labeled Egocentric Action Scene Graphs which offer a rich set of annotations for long-from egocentric video understanding. We hence define the EASG generation task and provide a baseline approach establishing preliminary benchmarks. Experiments on two downstream tasks action anticipation and activity summarization highlight the effectiveness of EASGs for long-form egocentric video understanding. We will release the dataset and code to replicate experiments and annotations. + + + + De-confounded Data-free Knowledge Distillation for Handling Distribution Shifts + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_De-confounded_Data-free_Knowledge_Distillation_for_Handling_Distribution_Shifts_CVPR_2024_paper.pdf + Data-Free Knowledge Distillation (DFKD) is a promising task to train high-performance small models to enhance actual deployment without relying on the original training data. Existing methods commonly avoid relying on private data by utilizing synthetic or sampled data. However a long-overlooked issue is that the severe distribution shifts between their substitution and original data which manifests as huge differences in the quality of images and class proportions. The harmful shifts are essentially the confounder that significantly causes performance bottlenecks. To tackle the issue this paper proposes a novel perspective with causal inference to disentangle the student models from the impact of such shifts. By designing a customized causal graph we first reveal the causalities among the variables in the DFKD task. Subsequently we propose a Knowledge Distillation Causal Intervention (KDCI) framework based on the backdoor adjustment to de-confound the confounder. KDCI can be flexibly combined with most existing state-of-the-art baselines. Experiments in combination with six representative DFKD methods demonstrate the effectiveness of our KDCI which can obviously help existing methods under almost all settings e.g. improving the baseline by up to 15.54% accuracy on the CIFAR-100 dataset. + + + + Siamese Learning with Joint Alignment and Regression for Weakly-Supervised Video Paragraph Grounding + http://openaccess.thecvf.com//content/CVPR2024/papers/Tan_Siamese_Learning_with_Joint_Alignment_and_Regression_for_Weakly-Supervised_Video_CVPR_2024_paper.pdf + Video Paragraph Grounding (VPG) is an emerging task in video-language understanding which aims at localizing multiple sentences with semantic relations and temporal order from an untrimmed video. However existing VPG approaches are heavily reliant on a considerable number of temporal labels that are laborious and time-consuming to acquire. In this work we introduce and explore Weakly-Supervised Video Paragraph Grounding (WSVPG) to eliminate the need of temporal annotations. Different from previous weakly-supervised grounding frameworks based on multiple instance learning or reconstruction learning for two-stage candidate ranking we propose a novel siamese learning framework that jointly learns the cross-modal feature alignment and temporal coordinate regression without timestamp labels to achieve concise one-stage localization for WSVPG. Specifically we devise a Siamese Grounding TRansformer (SiamGTR) consisting of two weight-sharing branches for learning complementary supervision. An Augmentation Branch is utilized for directly regressing the temporal boundaries of a complete paragraph within a pseudo video and an Inference Branch is designed to capture the order-guided feature correspondence for localizing multiple sentences in a normal video. We demonstrate by extensive experiments that our paradigm has superior practicability and flexibility to achieve efficient weakly-supervised or semi-supervised learning outperforming state-of-the-art methods trained with the same or stronger supervision. + + + + LEOD: Label-Efficient Object Detection for Event Cameras + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_LEOD_Label-Efficient_Object_Detection_for_Event_Cameras_CVPR_2024_paper.pdf + Object detection with event cameras benefits from the sensor's low latency and high dynamic range. However it is costly to fully label event streams for supervised training due to their high temporal resolution. To reduce this cost we present LEOD the first method for label-efficient event-based detection. Our approach unifies weakly- and semi-supervised object detection with a self-training mechanism. We first utilize a detector pre-trained on limited labels to produce pseudo ground truth on unlabeled events. Then the detector is re-trained with both real and generated labels. Leveraging the temporal consistency of events we run bi-directional inference and apply tracking-based post-processing to enhance the quality of pseudo labels. To stabilize training against label noise we further design a soft anchor assignment strategy. We introduce new experimental protocols to evaluate the task of label-efficient event-based detection on Gen1 and 1Mpx datasets. LEOD consistently outperforms supervised baselines across various labeling ratios. For example on Gen1 it improves mAP by 8.6% and 7.8% for RVT-S trained with 1% and 2% labels. On 1Mpx RVT-S with 10% labels even surpasses its fully-supervised counterpart using 100% labels. LEOD maintains its effectiveness even when all labeled data are available reaching new state-of-the-art results. Finally we show that our method readily scales to improve larger detectors as well. Code is released at https://github.com/Wuziyi616/LEOD. + + + + Morphological Prototyping for Unsupervised Slide Representation Learning in Computational Pathology + http://openaccess.thecvf.com//content/CVPR2024/papers/Song_Morphological_Prototyping_for_Unsupervised_Slide_Representation_Learning_in_Computational_Pathology_CVPR_2024_paper.pdf + Representation learning of pathology whole-slide images (WSIs) has been has primarily relied on weak supervision with Multiple Instance Learning (MIL). However the slide representations resulting from this approach are highly tailored to specific clinical tasks which limits their expressivity and generalization particularly in scenarios with limited data. Instead we hypothesize that morphological redundancy in tissue can be leveraged to build a task-agnostic slide representation in an unsupervised fashion. To this end we introduce PANTHER a prototype-based approach rooted in the Gaussian mixture model that summarizes the set of WSI patches into a much smaller set of morphological prototypes. Specifically each patch is assumed to have been generated from a mixture distribution where each mixture component represents a morphological exemplar. Utilizing the estimated mixture parameters we then construct a compact slide representation that can be readily used for a wide range of downstream tasks. By performing an extensive evaluation of PANTHER on subtyping and survival tasks using 13 datasets we show that 1) PANTHER outperforms or is on par with supervised MIL baselines and 2) the analysis of morphological prototypes brings new qualitative and quantitative insights into model interpretability. The code is available at https://github.com/mahmoodlab/Panther. + + + + Dense Optical Tracking: Connecting the Dots + http://openaccess.thecvf.com//content/CVPR2024/papers/Le_Moing_Dense_Optical_Tracking_Connecting_the_Dots_CVPR_2024_paper.pdf + Recent approaches to point tracking are able to recover the trajectory of any scene point through a large portion of a video despite the presence of occlusions. They are however too slow in practice to track every point observed in a single frame in a reasonable amount of time. This paper introduces DOT a novel simple and efficient method for solving this problem. It first extracts a small set of tracks from key regions at motion boundaries using an off-the-shelf point tracking algorithm. Given source and target frames DOT then computes rough initial estimates of a dense flow field and visibility mask through nearest-neighbor interpolation before refining them using a learnable optical flow estimator that explicitly handles occlusions and can be trained on synthetic data with ground-truth correspondences. We show that DOT is significantly more accurate than current optical flow techniques outperforms sophisticated "universal" trackers like OmniMotion and is on par with or better than the best point tracking algorithms like CoTracker while being at least two orders of magnitude faster. Quantitative and qualitative experiments with synthetic and real videos validate the promise of the proposed approach. Code data and videos showcasing the capabilities of our approach are available in the project webpage: https://16lemoing.github.io/dot . + + + + A Stealthy Wrongdoer: Feature-Oriented Reconstruction Attack against Split Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_A_Stealthy_Wrongdoer_Feature-Oriented_Reconstruction_Attack_against_Split_Learning_CVPR_2024_paper.pdf + Split Learning (SL) is a distributed learning framework renowned for its privacy-preserving features and minimal computational requirements. Previous research consistently highlights the potential privacy breaches in SL systems by server adversaries reconstructing training data. However these studies often rely on strong assumptions or compromise system utility to enhance attack performance. This paper introduces a new semi-honest Data Reconstruction Attack on SL named Feature-Oriented Reconstruction Attack (FORA). In contrast to prior works FORA relies on limited prior knowledge specifically that the server utilizes auxiliary samples from the public without knowing any client's private information. This allows FORA to conduct the attack stealthily and achieve robust performance. The key vulnerability exploited by FORA is the revelation of the model representation preference in the smashed data output by victim client. FORA constructs a substitute client through feature-level transfer learning aiming to closely mimic the victim client's representation preference. Leveraging this substitute client the server trains the attack model to effectively reconstruct private data. Extensive experiments showcase FORA's superior performance compared to state-of-the-art methods. Furthermore the paper systematically evaluates the proposed method's applicability across diverse settings and advanced defense strategies. + + + + TULIP: Transformer for Upsampling of LiDAR Point Clouds + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_TULIP_Transformer_for_Upsampling_of_LiDAR_Point_Clouds_CVPR_2024_paper.pdf + LiDAR Upsampling is a challenging task for the perception systems of robots and autonomous vehicles due to the sparse and irregular structure of large-scale scene contexts. Recent works propose to solve this problem by converting LiDAR data from 3D Euclidean space into an image super-resolution problem in 2D image space. Although their methods can generate high-resolution range images with fine-grained details the resulting 3D point clouds often blur out details and predict invalid points. In this paper we propose TULIP a new method to reconstruct high-resolution LiDAR point clouds from low-resolution LiDAR input. We also follow a range image-based approach but specifically modify the patch and window geometries of a Swin-Transformer-based network to better fit the characteristics of range images. We conducted several experiments on three public real-world and simulated datasets. TULIP outperforms state-of-the-art methods in all relevant metrics and generates robust and more realistic point clouds than prior works. + + + + BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_BT-Adapter_Video_Conversation_is_Feasible_Without_Video_Instruction_Tuning_CVPR_2024_paper.pdf + The recent progress in Large Language Models (LLM) has spurred various advancements in image-language conversation agents while how to build a proficient video-based dialogue system is still under exploration. Considering the extensive scale of LLM and visual backbone minimal GPU memory is left for facilitating effective temporal modeling which is crucial for comprehending and providing feedback on videos. To this end we propose Branching Temporal Adapter (BT-Adapter) a novel method for extending image-language pretrained models into the video domain. Specifically BT-Adapter serves as a plug-and-use temporal modeling branch alongside the pretrained visual encoder which is tuned while keeping the backbone frozen. Just pretrained once BT-Adapter can be seamlessly integrated into all image conversation models using this version of CLIP enabling video conversations without the need for video instructions. Besides we develop a unique asymmetric token masking strategy inside the branch with tailor-made training tasks for BT-Adapter facilitating faster convergence and better results. Thanks to BT-Adapter we are able to empower existing multimodal dialogue models with strong video understanding capabilities without incurring excessive GPU costs. Without bells and whistles BT-Adapter achieves (1) state-of-the-art zero-shot results on various video tasks using thousands of fewer GPU hours. (2) better performance than current video chatbots without any video instruction tuning. (3) state-of-the-art results of video chatting using video instruction tuning outperforming previous SOTAs by a large margin. The code has been available at https://github.com/farewellthree/BT-Adapter. + + + + Generate Subgoal Images before Act: Unlocking the Chain-of-Thought Reasoning in Diffusion Model for Robot Manipulation with Multimodal Prompts + http://openaccess.thecvf.com//content/CVPR2024/papers/Ni_Generate_Subgoal_Images_before_Act_Unlocking_the_Chain-of-Thought_Reasoning_in_CVPR_2024_paper.pdf + Robotics agents often struggle to understand and follow the multi-modal prompts in complex manipulation scenes which are challenging to be sufficiently and accurately described by text alone. Moreover for long-horizon manipulation tasks the deviation from general instruction tends to accumulate if lack of intermediate guidance from high-level subgoals. For this we consider can we generate subgoal images before act to enhance the instruction following in long-horizon manipulation with multi-modal prompts? Inspired by the great success of diffusion model in image generation tasks we propose a novel hierarchical framework named as CoTDiffusion that incorporates diffusion model as a high-level planner to convert the general and multi-modal prompts into coherent visual subgoal plans which further guide the low-level policy model before action execution. We design a semantic alignment module that can anchor the progress of generated keyframes along a coherent generation chain unlocking the chain-of-thought reasoning ability of diffusion model. Additionally we propose bi-directional generation and frame concat mechanism to further enhance the fidelity of generated subgoal images and the accuracy of instruction following. The experiments cover various robotics manipulation scenarios including visual reasoning visual rearrange and visual constraints. CoTDiffusion achieves outstanding performance gain compared to the baselines without explicit subgoal generation which proves that a subgoal image is worth a thousand words of instruction. + + + + Asymmetric Masked Distillation for Pre-Training Small Foundation Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_Asymmetric_Masked_Distillation_for_Pre-Training_Small_Foundation_Models_CVPR_2024_paper.pdf + Self-supervised foundation models have shown great potential in computer vision thanks to the pre-training paradigm of masked autoencoding. Scale is a primary factor influencing the performance of these foundation models. However these large foundation models often result in high computational cost. This paper focuses on pre-training relatively small vision transformer models that could be efficiently adapted to downstream tasks. Specifically taking inspiration from knowledge distillation in model compression we propose a new asymmetric masked distillation (AMD) framework for pre-training relatively small models with autoencoding. The core of AMD is to devise an asymmetric masking strategy where the teacher model is enabled to see more context information with a lower masking ratio while the student model is still equipped with a high masking ratio. We design customized multi-layer feature alignment between the teacher encoder and student encoder to regularize the pre-training of student MAE. To demonstrate the effectiveness and versatility of AMD we apply it to both ImageMAE and VideoMAE for pre-training relatively small ViT models. AMD achieved 84.6% classification accuracy on IN1K using the ViT-B model. And AMD achieves 73.3% classification accuracy using the ViT-B model on the Something-in-Something V2 dataset a 3.7% improvement over the original ViT-B model from VideoMAE. We also transfer AMD pre-trained models to downstream tasks and obtain consistent performance improvement over the original masked autoencoding. The code and models are available at https://github.com/MCG-NJU/AMD. + + + + MP5: A Multi-modal Open-ended Embodied System in Minecraft via Active Perception + http://openaccess.thecvf.com//content/CVPR2024/papers/Qin_MP5_A_Multi-modal_Open-ended_Embodied_System_in_Minecraft_via_Active_CVPR_2024_paper.pdf + It is a long-lasting goal to design an embodied system that can solve long-horizon open-world tasks in human-like ways. However existing approaches usually struggle with compound difficulties caused by the logic-aware decomposition and context-aware execution of these tasks. To this end we introduce MP5 an open-ended multimodal embodied system built upon the challenging Minecraft simulator which can decompose feasible sub-objectives design sophisticated situation-aware plans and perform embodied action control with frequent communication with a goal-conditioned active perception scheme. Specifically MP5 is developed on top of recent advances in Multimodal Large Language Models (MLLMs) and the system is modulated into functional modules that can be scheduled and collaborated to ultimately solve pre-defined context- and process-dependent tasks. Extensive experiments prove that MP5 can achieve a 22% success rate on difficult process-dependent tasks and a 91% success rate on tasks that heavily depend on the context. Moreover MP5 exhibits a remarkable ability to address many open-ended tasks that are entirely novel. + + + + Uncovering What Why and How: A Comprehensive Benchmark for Causation Understanding of Video Anomaly + http://openaccess.thecvf.com//content/CVPR2024/papers/Du_Uncovering_What_Why_and_How_A_Comprehensive_Benchmark_for_Causation_CVPR_2024_paper.pdf + Video anomaly understanding (VAU) aims to automatically comprehend unusual occurrences in videos thereby enabling various applications such as traffic surveillance and industrial manufacturing. While existing VAU benchmarks primarily concentrate on anomaly detection and localization our focus is on more practicality prompting us to raise the following crucial questions: "what anomaly occurred?" "why did it happen?" and "how severe is this abnormal event?". In pursuit of these answers we present a comprehensive benchmark for Causation Understanding of Video Anomaly (CUVA). Specifically each instance of the proposed benchmark involves three sets of human annotations to indicate the "what" "why" and "how" of an anomaly including 1) anomaly type start and end times and event descriptions 2) natural language explanations for the cause of an anomaly and 3) free text reflecting the effect of the abnormality. In addition we also introduce MMEval a novel evaluation metric designed to better align with human preferences for CUVA facilitating the measurement of existing LLMs in comprehending the underlying cause and corresponding effect of video anomalies. Finally we propose a novel prompt-based method that can serve as a baseline approach for the challenging CUVA. We conduct extensive experiments to show the superiority of our evaluation metric and the prompt-based approach. + + + + MiKASA: Multi-Key-Anchor & Scene-Aware Transformer for 3D Visual Grounding + http://openaccess.thecvf.com//content/CVPR2024/papers/Chang_MiKASA_Multi-Key-Anchor__Scene-Aware_Transformer_for_3D_Visual_Grounding_CVPR_2024_paper.pdf + 3D visual grounding involves matching natural language descriptions with their corresponding objects in 3D spaces. Existing methods often face challenges with accuracy in object recognition and struggle in interpreting complex linguistic queries particularly with descriptions that involve multiple anchors or are view-dependent. In response we present the MiKASA (Multi-Key-Anchor Scene-Aware) Transformer. Our novel end-to-end trained model integrates a self-attention-based scene-aware object encoder and an original multi-key-anchor technique enhancing object recognition accuracy and the understanding of spatial relationships. Furthermore MiKASA improves the explainability of decision-making facilitating error diagnosis. Our model achieves the highest overall accuracy in the Referit3D challenge for both the Sr3D and Nr3D datasets particularly excelling by a large margin in categories that require viewpoint-dependent descriptions. + + + + ZePT: Zero-Shot Pan-Tumor Segmentation via Query-Disentangling and Self-Prompting + http://openaccess.thecvf.com//content/CVPR2024/papers/Jiang_ZePT_Zero-Shot_Pan-Tumor_Segmentation_via_Query-Disentangling_and_Self-Prompting_CVPR_2024_paper.pdf + The long-tailed distribution problem in medical image analysis reflects a high prevalence of common conditions and a low prevalence of rare ones which poses a significant challenge in developing a unified model capable of identifying rare or novel tumor categories not encountered during training. In this paper we propose a new Zero-shot Pan-Tumor segmentation framework (ZePT) based on query-disentangling and self-prompting to segment unseen tumor categories beyond the training set. ZePT disentangles the object queries into two subsets and trains them in two stages. Initially it learns a set of fundamental queries for organ segmentation through an object-aware feature grouping strategy which gathers organ-level visual features. Subsequently it refines the other set of advanced queries that focus on the auto-generated visual prompts for unseen tumor segmentation. Moreover we introduce query-knowledge alignment at the feature level to enhance each query's discriminative representation and generalizability. Extensive experiments on various tumor segmentation tasks demonstrate the performance superiority of ZePT which surpasses the previous counterparts and evidences the promising ability for zero-shot tumor segmentation in real-world settings. + + + + Task-Driven Exploration: Decoupling and Inter-Task Feedback for Joint Moment Retrieval and Highlight Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Task-Driven_Exploration_Decoupling_and_Inter-Task_Feedback_for_Joint_Moment_Retrieval_CVPR_2024_paper.pdf + Video moment retrieval and highlight detection are two highly valuable tasks in video understanding but until recently they have been jointly studied. Although existing studies have made impressive advancement recently they predominantly follow the data-driven bottom-up paradigm. Such paradigm overlooks task-specific and inter-task effects resulting in poor model performance. In this paper we propose a novel task-driven top-down framework TaskWeave for joint moment retrieval and highlight detection. The framework introduces a task-decoupled unit to capture task-specific and common representations. To investigate the interplay between the two tasks we propose an inter-task feedback mechanism which transforms the results of one task as guiding masks to assist the other task. Different from existing methods we present a task-dependent joint loss function to optimize the model. Comprehensive experiments and in-depth ablation studies on QVHighlights TVSum and Charades-STA datasets corroborate the effectiveness and flexibility of the proposed framework. Codes are available at https://github.com/EdenGabriel/TaskWeave. + + + + MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training + http://openaccess.thecvf.com//content/CVPR2024/papers/Vasu_MobileCLIP_Fast_Image-Text_Models_through_Multi-Modal_Reinforced_Training_CVPR_2024_paper.pdf + Contrastive pre-training of image-text foundation models such as CLIP demonstrated excellent zero-shot performance and improved robustness on a wide range of downstream tasks. However these models utilize large transformer-based encoders with significant memory and latency overhead which pose challenges for deployment on mobile devices. In this work we introduce MobileCLIP - a new family of efficient image-text models optimized for runtime performance along with a novel and efficient training approach namely multi-modal reinforced training. The proposed training approach leverages knowledge transfer from an image captioning model and an ensemble of strong CLIP encoders to improve the accuracy of efficient models. Our approach avoids train-time compute overhead by storing the additional knowledge in a reinforced dataset. MobileCLIP sets a new state-of-the-art latency-accuracy tradeoff for zero-shot classification and retrieval tasks on several datasets. Our MobileCLIP-S2 variant is 2.3x faster while more accurate compared to previous best CLIP model based on ViT-B/16. We further demonstrate the effectiveness of our multi-modal reinforced training by training a CLIP model based on ViT-B/16 image backbone and achieving +2.9% average performance improvement on 38 evaluation benchmarks compared to the previous best. Moreover we show that the proposed approach achieves 10x-1000x improved learning efficiency when compared with non- reinforced CLIP training. Code and models are available at https://github.com/apple/ml-mobileclip + + + + VideoCon: Robust Video-Language Alignment via Contrast Captions + http://openaccess.thecvf.com//content/CVPR2024/papers/Bansal_VideoCon_Robust_Video-Language_Alignment_via_Contrast_Captions_CVPR_2024_paper.pdf + Despite being (pre)trained on a massive amount of data state-of-the-art video-language alignment models are not robust to semantically-plausible contrastive changes in the video captions. Our work addresses this by identifying a broad spectrum of contrast misalignments such as replacing entities actions and flipping event order which alignment models should be robust against. To this end we introduce the VideoCon a video-language alignment dataset constructed by a large language model that generates plausible contrast video captions and explanations for differences between original and contrast video captions. Then a generative video-language model is finetuned with VideoCon to assess video-language entailment and generate explanations. Our VideoCon-based alignment model significantly outperforms current models. It exhibits a 12-point increase in AUC for the video-language alignment task on human-generated contrast captions. Finally our model sets new state of the art zero-shot performance in temporally-extensive video-language tasks such as text-to-video retrieval (SSv2-Temporal) and video question answering (ATP-Hard). Moreover our model shows superior performance on novel videos and human-crafted captions and explanations. + + + + Discovering and Mitigating Visual Biases through Keyword Explanation + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_Discovering_and_Mitigating_Visual_Biases_through_Keyword_Explanation_CVPR_2024_paper.pdf + Addressing biases in computer vision models is crucial for real-world AI deployments. However mitigating visual biases is challenging due to their unexplainable nature often identified indirectly through visualization or sample statistics which necessitates additional human supervision for interpretation. To tackle this issue we propose the Bias-to-Text (B2T) framework which interprets visual biases as keywords. Specifically we extract common keywords from the captions of mispredicted images to identify potential biases in the model. We then validate these keywords by measuring their similarity to the mispredicted images using a vision-language scoring model. The keyword explanation form of visual bias offers several advantages such as a clear group naming for bias discovery and a natural extension for debiasing using these group names. Our experiments demonstrate that B2T can identify known biases such as gender bias in CelebA background bias in Waterbirds and distribution shifts in ImageNet-R/C. Additionally B2T uncovers novel biases in larger datasets such as Dollar Street and ImageNet. For example we discovered a contextual bias between \keyword bee and \keyword flower in ImageNet. We also highlight various applications of B2T keywords including debiased training CLIP prompting and model comparison. + + + + Robust Emotion Recognition in Context Debiasing + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Robust_Emotion_Recognition_in_Context_Debiasing_CVPR_2024_paper.pdf + Context-aware emotion recognition (CAER) has recently boosted the practical applications of affective computing techniques in unconstrained environments. Mainstream CAER methods invariably extract ensemble representations from diverse contexts and subject-centred characteristics to perceive the target person's emotional state. Despite advancements the biggest challenge remains due to context bias interference. The harmful bias forces the models to rely on spurious correlations between background contexts and emotion labels in likelihood estimation causing severe performance bottlenecks and confounding valuable context priors. In this paper we propose a counterfactual emotion inference (CLEF) framework to address the above issue. Specifically we first formulate a generalized causal graph to decouple the causal relationships among the variables in CAER. Following the causal graph CLEF introduces a non-invasive context branch to capture the adverse direct effect caused by the context bias. During the inference we eliminate the direct context effect from the total causal effect by comparing factual and counterfactual outcomes resulting in bias mitigation and robust prediction. As a model-agnostic framework CLEF can be readily integrated into existing methods bringing consistent performance gains. + + + + CAPE: CAM as a Probabilistic Ensemble for Enhanced DNN Interpretation + http://openaccess.thecvf.com//content/CVPR2024/papers/Chowdhury_CAPE_CAM_as_a_Probabilistic_Ensemble_for_Enhanced_DNN_Interpretation_CVPR_2024_paper.pdf + Deep Neural Networks (DNNs) are widely used for visual classification tasks but their complex computation process and black-box nature hinder decision transparency and interpretability. Class activation maps (CAMs) and recent variants provide ways to visually explain the DNN decision-making process by displaying 'attention' heatmaps of the DNNs. Nevertheless the CAM explanation only offers relative attention information that is on an attention heatmap we can interpret which image region is more or less important than the others. However these regions cannot be meaningfully compared across classes and the contribution of each region to the model's class prediction is not revealed. To address these challenges that ultimately lead to better DNN Interpretation in this paper we propose CAPE a novel reformulation of CAM that provides a unified and probabilistically meaningful assessment of the contributions of image regions. We quantitatively and qualitatively compare CAPE with state-of-the-art CAM methods on CUB and ImageNet benchmark datasets to demonstrate enhanced interpretability. We also test on a cytology imaging dataset depicting a challenging Chronic Myelomonocytic Leukemia (CMML) diagnosis problem. Code is available at:https://github.com/AIML-MED/CAPE. + + + + Multi-Space Alignments Towards Universal LiDAR Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Multi-Space_Alignments_Towards_Universal_LiDAR_Segmentation_CVPR_2024_paper.pdf + A unified and versatile LiDAR segmentation model with strong robustness and generalizability is desirable for safe autonomous driving perception. This work presents M3Net a one-of-a-kind framework for fulfilling multi-task multi-dataset multi-modality LiDAR segmentation in a universal manner using just a single set of parameters. To better exploit data volume and diversity we first combine large-scale driving datasets acquired by different types of sensors from diverse scenes and then conduct alignments in three spaces namely data feature and label spaces during the training. As a result M3Net is capable of taming heterogeneous data for training state-of-the-art LiDAR segmentation models. Extensive experiments on twelve LiDAR segmentation datasets verify our effectiveness. Notably using a shared set of parameters M3Net achieves 75.1% 83.1% and 72.4% mIoU scores respectively on the official benchmarks of SemanticKITTI nuScenes and Waymo Open. + + + + FlowDiffuser: Advancing Optical Flow Estimation with Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Luo_FlowDiffuser_Advancing_Optical_Flow_Estimation_with_Diffusion_Models_CVPR_2024_paper.pdf + Optical flow estimation a process of predicting pixel-wise displacement between consecutive frames has commonly been approached as a regression task in the age of deep learning. Despite notable advancements this de facto paradigm unfortunately falls short in generalization performance when trained on synthetic or constrained data. Pioneering a paradigm shift we reformulate optical flow estimation as a conditional flow generation challenge unveiling FlowDiffuser --- a new family of optical flow models that could have stronger learning and generalization capabilities. FlowDiffuser estimates optical flow through a `noise-to-flow' strategy progressively eliminating noise from randomly generated flows conditioned on the provided pairs. To optimize accuracy and efficiency our FlowDiffuser incorporates a novel Conditional Recurrent Denoising Decoder (Conditional-RDD) streamlining the flow estimation process. It incorporates a unique Hidden State Denoising (HSD) paradigm effectively leveraging the information from previous time steps. Moreover FlowDiffuser can be easily integrated into existing flow networks leading to significant improvements in performance metrics compared to conventional implementations. Experiments on challenging benchmarks including Sintel and KITTI demonstrate the effectiveness of our FlowDiffuser with superior performance to existing state-of-the-art models. Code is available at https://github.com/LA30/FlowDiffuser. + + + + Free3D: Consistent Novel View Synthesis without 3D Representation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zheng_Free3D_Consistent_Novel_View_Synthesis_without_3D_Representation_CVPR_2024_paper.pdf + We introduce Free3D a simple accurate method for monocular open-set novel view synthesis (NVS). Similar to Zero-1-to-3 we start from a pre-trained 2D image generator for generalization and fine-tune it for NVS. Compared to other works that took a similar approach we obtain significant improvements without resorting to an explicit 3D representation which is slow and memory-consuming and without training an additional network for 3D reconstruction. Our key contribution is to improve the way the target camera pose is encoded in the network which we do by introducing a new ray conditioning normalization (RCN) layer. The latter injects pose information in the underlying 2D image generator by telling each pixel its viewing direction. We further improve multi-view consistency by using light-weight multi-view attention layers and by sharing generation noise between the different views. We train Free3D on the Objaverse dataset and demonstrate excellent generalization to new categories in new datasets including OmniObject3D and GSO. The project page is available at https://chuanxiaz.com/free3d/. + + + + WALT3D: Generating Realistic Training Data from Time-Lapse Imagery for Reconstructing Dynamic Objects Under Occlusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Vuong_WALT3D_Generating_Realistic_Training_Data_from_Time-Lapse_Imagery_for_Reconstructing_CVPR_2024_paper.pdf + Current methods for 2D and 3D object understanding struggle with severe occlusions in busy urban environments partly due to the lack of large-scale labeled ground-truth annotations for learning occlusion. In this work we introduce a novel framework for automatically generating a large realistic dataset of dynamic objects under occlusions using freely available time-lapse imagery. By leveraging off-the-shelf 2D (bounding box segmentation keypoint) and 3D (pose shape) predictions as pseudo-groundtruth unoccluded 3D objects are identified automatically and composited into the background in a clip-art style ensuring realistic appearances and physically accurate occlusion configurations. The resulting clip-art image with pseudo-groundtruth enables efficient training of object reconstruction methods that are robust to occlusions. Our method demonstrates significant improvements in both 2D and 3D reconstruction particularly in scenarios with heavily occluded objects like vehicles and people in urban scenes. + + + + Towards Language-Driven Video Inpainting via Multimodal Large Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_Towards_Language-Driven_Video_Inpainting_via_Multimodal_Large_Language_Models_CVPR_2024_paper.pdf + We introduce a new task -- language-driven video inpainting which uses natural language instructions to guide the inpainting process. This approach overcomes the limitations of traditional video inpainting methods that depend on manually labeled binary masks a process often tedious and labor-intensive. We present the Remove Objects from Videos by Instructions (ROVI) dataset containing 5650 videos and 9091 inpainting results to support training and evaluation for this task. We also propose a novel diffusion-based language-driven video inpainting framework the first end-to-end baseline for this task integrating Multimodal Large Language Models to understand and execute complex language-based inpainting requests effectively. Our comprehensive results showcase the dataset's versatility and the model's effectiveness in various language-instructed inpainting scenarios. We have made datasets code and models publicly available at https://github.com/jianzongwu/Language-Driven-Video-Inpainting. + + + + CLIP-KD: An Empirical Study of CLIP Model Distillation + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_CLIP-KD_An_Empirical_Study_of_CLIP_Model_Distillation_CVPR_2024_paper.pdf + Contrastive Language-Image Pre-training (CLIP) has become a promising language-supervised visual pre-training framework. This paper aims to distill small CLIP models supervised by a large teacher CLIP model. We propose several distillation strategies including relation feature gradient and contrastive paradigms to examine the effectiveness of CLIP-Knowledge Distillation (KD). We show that a simple feature mimicry with Mean Squared Error loss works surprisingly well. Moreover interactive contrastive learning across teacher and student encoders is also effective in performance improvement. We explain that the success of CLIP-KD can be attributed to maximizing the feature similarity between teacher and student. The unified method is applied to distill several student models trained on CC3M+12M. CLIP-KD improves student CLIP models consistently over zero-shot ImageNet classification and cross-modal retrieval benchmarks. When using ViT-L/14 pretrained on Laion-400M as the teacher CLIP-KD achieves 57.5% and 55.4% zero-shot top-1 ImageNet accuracy over ViT-B/16 and ResNet-50 surpassing the original CLIP without KD by 20.5% and 20.1% margins respectively. Our code is released on https://github.com/winycg/CLIP-KD. + + + + OneTracker: Unifying Visual Object Tracking with Foundation Models and Efficient Tuning + http://openaccess.thecvf.com//content/CVPR2024/papers/Hong_OneTracker_Unifying_Visual_Object_Tracking_with_Foundation_Models_and_Efficient_CVPR_2024_paper.pdf + Visual object tracking aims to localize the target object of each frame based on its initial appearance in the first frame. Depending on the input modility tracking tasks can be divided into RGB tracking and RGB+X (e.g. RGB+N and RGB+D) tracking. Despite the different input modalities the core aspect of tracking is the temporal matching. Based on this common ground we present a general framework to unify various tracking tasks termed as OneTracker. OneTracker first performs a large-scale pre-training on a RGB tracker called Foundation Tracker. This pretraining phase equips the Foundation Tracker with a stable ability to estimate the location of the target object. Then we regard other modality information as prompt and build Prompt Tracker upon Foundation Tracker. Through freezing the Foundation Tracker and only adjusting some additional trainable parameters Prompt Tracker inhibits the strong localization ability from Foundation Tracker and achieves parameter-efficient finetuning on downstream RGB+X tracking tasks. To evaluate the effectiveness of our general framework OneTracker which is consisted of Foundation Tracker and Prompt Tracker we conduct extensive experiments on 6 popular tracking tasks across 11 benchmarks and our OneTracker outperforms other models and achieves state-of-the-art performance. + + + + SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Yue_SC-Tune_Unleashing_Self-Consistent_Referential_Comprehension_in_Large_Vision_Language_Models_CVPR_2024_paper.pdf + Recent trends in Large Vision Language Models (LVLMs) research have been increasingly focusing on advancing beyond general image understanding towards more nuanced object-level referential comprehension. In this paper we present and delve into the self-consistency capability of LVLMs a crucial aspect that reflects the models' ability to both generate informative captions for specific objects and subsequently utilize these captions to accurately re-identify the objects in a closed-loop process. This capability significantly mirrors the precision and reliability of fine-grained visual-language understanding. Our findings reveal that the self-consistency level of existing LVLMs falls short of expectations posing limitations on their practical applicability and potential. To address this gap we introduce a novel fine-tuning paradigm named Self-Consistency Tuning (SC-Tune). It features the synergistic learning of a cyclic describer-locator system. This paradigm is not only data-efficient but also exhibits generalizability across multiple LVLMs. Through extensive experiments we demonstrate that SC-Tune significantly elevates performance across a spectrum of object-level vision-language benchmarks and maintains competitive or improved performance on image-level vision-language benchmarks. Both our model and code will be publicly available at https://github.com/ivattyue/SC-Tune. + + + + NeRSP: Neural 3D Reconstruction for Reflective Objects with Sparse Polarized Images + http://openaccess.thecvf.com//content/CVPR2024/papers/Han_NeRSP_Neural_3D_Reconstruction_for_Reflective_Objects_with_Sparse_Polarized_CVPR_2024_paper.pdf + We present NeRSP a Neural 3D reconstruction technique for Reflective surfaces with Sparse Polarized images. Reflective surface reconstruction is extremely challenging as specular reflections are view-dependent and thus violate the multiview consistency for multiview stereo. On the other hand sparse image inputs as a practical capture setting commonly cause incomplete or distorted results due to the lack of correspondence matching. This paper jointly handles the challenges from sparse inputs and reflective surfaces by leveraging polarized images. We derive photometric and geometric cues from the polarimetric image formation model and multiview azimuth consistency which jointly optimize the surface geometry modeled via implicit neural representation. Based on the experiments on our synthetic and real datasets we achieve the state-of-the-art surface reconstruction results with only 6 views as input. + + + + Retrieval-Augmented Embodied Agents + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhu_Retrieval-Augmented_Embodied_Agents_CVPR_2024_paper.pdf + Embodied agents operating in complex and uncertain environments face considerable challenges. While some advanced agents handle complex manipulation tasks with proficiency their success often hinges on extensive training data to develop their capabilities. In contrast humans typically rely on recalling past experiences and analogous situations to solve new problems. Aiming to emulate this human approach in robotics we introduce the Retrieval-Augmented Embodied Agent (RAEA). This innovative system equips robots with a form of shared memory significantly enhancing their performance. Our approach integrates a policy retriever allowing robots to access relevant strategies from an external policy memory bank based on multi-modal inputs. Additionally a policy generator is employed to assimilate these strategies into the learning process enabling robots to formulate effective responses to tasks. Extensive testing of RAEA in both simulated and real-world scenarios demonstrates its superior performance over traditional methods representing a major leap forward in robotic technology. + + + + SAFDNet: A Simple and Effective Network for Fully Sparse 3D Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_SAFDNet_A_Simple_and_Effective_Network_for_Fully_Sparse_3D_CVPR_2024_paper.pdf + LiDAR-based 3D object detection plays an essential role in autonomous driving. Existing high-performing 3D object detectors usually build dense feature maps in the backbone network and prediction head. However the computational costs introduced by the dense feature maps grow quadratically as the perception range increases making these models hard to scale up to long-range detection. Some recent works have attempted to construct fully sparse detectors to solve this issue; nevertheless the resulting models either rely on a complex multi-stage pipeline or exhibit inferior performance. In this work we propose a fully sparse adaptive feature diffusion network (SAFDNet) for LiDAR-based 3D object detection. In SAFDNet an adaptive feature diffusion strategy is designed to address the center feature missing problem. We conducted extensive experiments on Waymo Open nuScenes and Argoverse2 datasets. SAFDNet performed slightly better than the previous SOTA on the first two datasets but much better on the last dataset which features long-range detection verifying the efficacy of SAFDNet in scenarios where long-range detection is required. Notably on Argoverse2 SAFDNet surpassed the previous best hybrid detector HEDNet by 2.6% mAP while being 2.1x faster and yielded 2.1% mAP gains over the previous best sparse detector FSDv2 while being 1.3x faster. The code will be available at https://github.com/zhanggang001/HEDNet. + + + + HINTED: Hard Instance Enhanced Detector with Mixed-Density Feature Fusion for Sparsely-Supervised 3D Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Xia_HINTED_Hard_Instance_Enhanced_Detector_with_Mixed-Density_Feature_Fusion_for_CVPR_2024_paper.pdf + Current sparsely-supervised object detection methods largely depend on high threshold settings to derive high-quality pseudo labels from detector predictions. However hard instances within point clouds frequently display incomplete structures causing decreased confidence scores in their assigned pseudo-labels. Previous methods inevitably result in inadequate positive supervision for these instances. To address this problem we propose a novel Hard INsTance Enhanced Detector HINTED for sparsely-supervised 3D object detection. Firstly we design a self-boosting teacher SBT model to generate more potential pseudo-labels enhancing the effectiveness of information transfer. Then we introduce a mixed-density student MDS model to concentrate on hard instances during the training phase thereby improving detection accuracy. Our extensive experiments on the KITTI dataset validate our method's superior performance. Compared with leading sparsely-supervised methods HINTED significantly improves the detection performance on hard instances notably outperforming fully-supervised methods in detecting challenging categories like cyclists. HINTED also significantly outperforms the state-of-the-art semi-supervised method on challenging categories. The code is available at https://github.com/xmuqimingxia/HINTED. + + + + Structured Gradient-based Interpretations via Norm-Regularized Adversarial Training + http://openaccess.thecvf.com//content/CVPR2024/papers/Gong_Structured_Gradient-based_Interpretations_via_Norm-Regularized_Adversarial_Training_CVPR_2024_paper.pdf + Gradient-based saliency maps have been widely used to explain the decisions of deep neural network classifiers. However standard gradient-based interpretation maps including the simple gradient and integrated gradient algorithms often lack desired structures such as sparsity and connectedness in their application to real-world computer vision models. A common approach to induce sparsity-based structures into gradient-based saliency maps is to modify the simple gradient scheme using sparsification or norm-based regularization. However one drawback with such post-processing approaches is the potentially significant loss in fidelity to the original simple gradient map. In this work we propose to apply adversarial training as an in-processing scheme to train neural networks with structured simple gradient maps. We demonstrate an existing duality between the regularized norms of the adversarial perturbations and gradient-based maps whereby we design adversarial training schemes promoting sparsity and group-sparsity properties in simple gradient maps. We present comprehensive numerical results to show the influence of our proposed norm-based adversarial training methods on the standard gradient-based maps of standard neural network architectures on benchmark image datasets. + + + + 3DFIRES: Few Image 3D REconstruction for Scenes with Hidden Surfaces + http://openaccess.thecvf.com//content/CVPR2024/papers/Jin_3DFIRES_Few_Image_3D_REconstruction_for_Scenes_with_Hidden_Surfaces_CVPR_2024_paper.pdf + This paper introduces 3DFIRES a novel system for scene-level 3D reconstruction from posed images. Designed to work with as few as one view 3DFIRES reconstructs the complete geometry of unseen scenes including hidden surfaces. With multiple view inputs our method produces full reconstruction within all camera frustums. A key feature of our approach is the fusion of multi-view information at the feature level enabling the production of coherent and comprehensive 3D reconstruction. We train our system on non-watertight scans from large-scale real scene dataset. We show it matches the efficacy of single-view reconstruction methods with only one input and surpasses existing techniques in both quantitative and qualitative measures for sparse-view 3D reconstruction. + + + + MCPNet: An Interpretable Classifier via Multi-Level Concept Prototypes + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_MCPNet_An_Interpretable_Classifier_via_Multi-Level_Concept_Prototypes_CVPR_2024_paper.pdf + Recent advancements in post-hoc and inherently interpretable methods have markedly enhanced the explanations of black box classifier models. These methods operate either through post-analysis or by integrating concept learning during model training. Although being effective in bridging the semantic gap between a model's latent space and human interpretation these explanation methods only partially reveal the model's decision-making process. The outcome is typically limited to high-level semantics derived from the last feature map. We argue that the explanations lacking insights into the decision processes at low and mid-level features are neither fully faithful nor useful. Addressing this gap we introduce the Multi-Level Concept Prototypes Classifier (MCPNet) an inherently interpretable model. MCPNet autonomously learns meaningful concept prototypes across multiple feature map levels using Centered Kernel Alignment (CKA) loss and an energy-based weighted PCA mechanism and it does so without reliance on predefined concept labels. Further we propose a novel classifier paradigm that learns and aligns multi-level concept prototype distributions for classification purposes via Class-aware Concept Distribution (CCD) loss. Our experiments reveal that our proposed MCPNet while being adaptable to various model architectures offers comprehensive multi-level explanations while maintaining classification accuracy. Additionally its concept distribution-based classification approach shows improved generalization capabilities in few-shot classification scenarios. + + + + ALGM: Adaptive Local-then-Global Token Merging for Efficient Semantic Segmentation with Plain Vision Transformers + http://openaccess.thecvf.com//content/CVPR2024/papers/Norouzi_ALGM_Adaptive_Local-then-Global_Token_Merging_for_Efficient_Semantic_Segmentation_with_CVPR_2024_paper.pdf + This work presents Adaptive Local-then-Global Merging (ALGM) a token reduction method for semantic segmentation networks that use plain Vision Transformers. ALGM merges tokens in two stages: (1) In the first network layer it merges similar tokens within a small local window and (2) halfway through the network it merges similar tokens across the entire image. This is motivated by an analysis in which we found that in those situations tokens with a high cosine similarity can likely be merged without a drop in segmentation quality. With extensive experiments across multiple datasets and network configurations we show that ALGM not only significantly improves the throughput by up to 100% but can also enhance the mean IoU by up to +1.1 thereby achieving a better trade-off between segmentation quality and efficiency than existing methods. Moreover our approach is adaptive during inference meaning that the same model can be used for optimal efficiency or accuracy depending on the application. Code is available at https://tue-mps.github.io/ALGM. + + + + Single-Model and Any-Modality for Video Object Tracking + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_Single-Model_and_Any-Modality_for_Video_Object_Tracking_CVPR_2024_paper.pdf + In the realm of video object tracking auxiliary modalities such as depth thermal or event data have emerged as valuable assets to complement the RGB trackers. In practice most existing RGB trackers learn a single set of parameters to use them across datasets and applications. However a similar single-model unification for multi-modality tracking presents several challenges. These challenges stem from the inherent heterogeneity of inputs -- each with modality-specific representations the scarcity of multi-modal datasets and the absence of all the modalities at all times. In this work we introduce Un-Track a Unified Tracker of a single set of parameters for any modality. To handle any modality our method learns their common latent space through low-rank factorization and reconstruction techniques. More importantly we use only the RGB-X pairs to learn the common latent space. This unique shared representation seamlessly binds all modalities together enabling effective unification and accommodating any missing modality all within a single transformer-based architecture. Our Un-Track achieves +8.1 absolute F-score gain on the DepthTrack dataset by introducing only +2.14 (over 21.50) GFLOPs with +6.6M (over 93M) parameters through a simple yet efficient prompting strategy. Extensive comparisons on five benchmark datasets with different modalities show that Un-Track surpasses both SOTA unified trackers and modality-specific counterparts validating our effectiveness and practicality. The source code is publicly available at https://github.com/Zongwei97/UnTrack. + + + + FlowTrack: Revisiting Optical Flow for Long-Range Dense Tracking + http://openaccess.thecvf.com//content/CVPR2024/papers/Cho_FlowTrack_Revisiting_Optical_Flow_for_Long-Range_Dense_Tracking_CVPR_2024_paper.pdf + In the domain of video tracking existing methods often grapple with a trade-off between spatial density and temporal range. Current approaches in dense optical flow estimators excel in providing spatially dense tracking but are limited to short temporal spans. Conversely recent advancements in long-range trackers offer extended temporal coverage but at the cost of spatial sparsity. This paper introduces FlowTrack a novel framework designed to bridge this gap. FlowTrack combines the strengths of both paradigms by 1) chaining confident flow predictions to maximize efficiency and 2) automatically switching to an error compensation module in instances of flow prediction inaccuracies. This dual strategy not only offers efficient dense tracking over extended temporal spans but also ensures robustness against error accumulations and occlusions common pitfalls of naive flow chaining. Furthermore we demonstrate that chained flow itself can serve as an effective guide for an error compensation module even for occluded points. Our framework achieves state-of-the-art accuracy for long-range tracking on the DAVIS dataset and renders 50% speed-up when performing dense tracking. + + + + Synthesize Diagnose and Optimize: Towards Fine-Grained Vision-Language Understanding + http://openaccess.thecvf.com//content/CVPR2024/papers/Peng_Synthesize_Diagnose_and_Optimize_Towards_Fine-Grained_Vision-Language_Understanding_CVPR_2024_paper.pdf + Vision language models (VLM) have demonstrated remarkable performance across various downstream tasks. However understanding fine-grained visual-linguistic concepts such as attributes and inter-object relationships remains a significant challenge. While several benchmarks aim to evaluate VLMs in finer granularity their primary focus remains on the linguistic aspect neglecting the visual dimension. Here we highlight the importance of evaluating VLMs from both a textual and visual perspective. We introduce a progressive pipeline to synthesize images that vary in a specific attribute while ensuring consistency in all other aspects. Utilizing this data engine we carefully design a benchmark SPEC to diagnose the comprehension of object size position existence and count. Subsequently we conduct a thorough evaluation of four leading VLMs on SPEC. Surprisingly their performance is close to random guess revealing significant limitations. With this in mind we propose a simple yet effective approach to optimize VLMs in fine-grained understanding achieving significant improvements on SPEC without compromising the zero-shot performance. Results on two additional fine-grained benchmarks also show consistent improvements further validating the transferability of our approach. Code and data are available at https://github.com/wjpoom/SPEC. + + + + WildlifeMapper: Aerial Image Analysis for Multi-Species Detection and Identification + http://openaccess.thecvf.com//content/CVPR2024/papers/Kumar_WildlifeMapper_Aerial_Image_Analysis_for_Multi-Species_Detection_and_Identification_CVPR_2024_paper.pdf + We introduce WildlifeMapper (WM) a flexible model designed to detect locate and identify multiple species in aerial imagery. It addresses the limitations of traditional labor-intensive wildlife population assessments that are central to advancing environmental conservation efforts worldwide. While a number of methods exist to automate this process they are often limited in their ability to generalize to different species or landscapes due to the dominance of homogeneous backgrounds and/or poorly captured local image structures. WM introduces two novel modules that help to capture the local structure and context of objects of interest to accurately localize and identify them achieving a state-of-the-art (SOTA) detection rate of 0.56 mAP. Further we introduce a large aerial imagery dataset with more than 11k Images and 28k annotations verified by trained experts. WM also achieves SOTA performance on 3 other publicly available aerial survey datasets collected across 4 different countries improving mAP by 42%. Source code and trained models are available at Github + + + + Tune-An-Ellipse: CLIP Has Potential to Find What You Want + http://openaccess.thecvf.com//content/CVPR2024/papers/Xie_Tune-An-Ellipse_CLIP_Has_Potential_to_Find_What_You_Want_CVPR_2024_paper.pdf + Visual prompting of large vision language models such as CLIP exhibits intriguing zero-shot capabilities. A manually drawn red circle commonly used for highlighting can guide CLIP's attention to the surrounding region to identify specific objects within an image. Without precise object proposals however it is insufficient for localization. Our novel simple yet effective approach i.e. Differentiable Visual Prompting enables CLIP to zero-shot localize: given an image and a text prompt describing an object we first pick a rendered ellipse from uniformly distributed anchor ellipses on the image grid via visual prompting then use three loss functions to tune the ellipse coefficients to encapsulate the target region gradually. This yields promising experimental results for referring expression comprehension without precisely specified object proposals. In addition we systematically present the limitations of visual prompting inherent in CLIP and discuss potential solutions. + + + + Incremental Nuclei Segmentation from Histopathological Images via Future-class Awareness and Compatibility-inspired Distillation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Incremental_Nuclei_Segmentation_from_Histopathological_Images_via_Future-class_Awareness_and_CVPR_2024_paper.pdf + We present a novel semantic segmentation approach for incremental nuclei segmentation from histopathological images which is a very challenging task as we have to incrementally optimize existing models to make them perform well in both old and new classes without using training samples of old classes. Yet it is an indispensable component of computer-aided diagnosis systems. The proposed approach has two key techniques. First we propose a new future-class awareness mechanism by separating some potential regions for future classes from background based on their similarities to both old and new classes in the representation space. With this mechanism we can not only reserve more parameter space for future updates but also enhance the representation capability of learned features. We further propose an innovative compatibility-inspired distillation scheme to make our model take full advantage of the knowledge learned by the old model. We conducted extensive experiments on two famous histopathological datasets and the results demonstrate the proposed approach achieves much better performance than state-of-the-art approaches. The code is available at https://github.com/why19991/InSeg. + + + + DiffMOT: A Real-time Diffusion-based Multiple Object Tracker with Non-linear Prediction + http://openaccess.thecvf.com//content/CVPR2024/papers/Lv_DiffMOT_A_Real-time_Diffusion-based_Multiple_Object_Tracker_with_Non-linear_Prediction_CVPR_2024_paper.pdf + In Multiple Object Tracking objects often exhibit non-linear motion of acceleration and deceleration with irregular direction changes. Tacking-by-detection (TBD) trackers with Kalman Filter motion prediction work well in pedestrian-dominant scenarios but fall short in complex situations when multiple objects perform non-linear and diverse motion simultaneously. To tackle the complex non-linear motion we propose a real-time diffusion-based MOT approach named DiffMOT. Specifically for the motion predictor component we propose a novel Decoupled Diffusion-based Motion Predictor (D^2MP). It models the entire distribution of various motion presented by the data as a whole. It also predicts an individual object's motion conditioning on an individual's historical motion information. Furthermore it optimizes the diffusion process with much fewer sampling steps. As a MOT tracker the DiffMOT is real-time at 22.7FPS and also outperforms the state-of-the-art on DanceTrack and SportsMOT datasets with 62.3% and 76.2% in HOTA metrics respectively. To the best of our knowledge DiffMOT is the first to introduce a diffusion probabilistic model into the MOT to tackle non-linear motion prediction. + + + + Just Add ! Pose Induced Video Transformers for Understanding Activities of Daily Living + http://openaccess.thecvf.com//content/CVPR2024/papers/Reilly_Just_Add__Pose_Induced_Video_Transformers_for_Understanding_Activities_CVPR_2024_paper.pdf + Video transformers have become the de facto standard for human action recognition yet their exclusive reliance on the RGB modality still limits their adoption in certain domains. One such domain is Activities of Daily Living (ADL) where RGB alone is not sufficient to distinguish between visually similar actions or actions observed from multiple viewpoints. To facilitate the adoption of video transformers for ADL we hypothesize that the augmentation of RGB with human pose information known for its sensitivity to fine-grained motion and multiple viewpoints is essential. Consequently we introduce the first Pose Induced Video Transformer: PI-ViT (or -ViT) a novel approach that augments the RGB representations learned by video transformers with 2D and 3D pose information. The key elements of -ViT are two plug-in modules 2D Skeleton Induction Module and 3D Skeleton Induction Module that are responsible for inducing 2D and 3D pose information into the RGB representations. These modules operate by performing pose-aware auxiliary tasks a design choice that allows -ViT to discard the modules during inference. Notably -ViT achieves the state-of-the-art performance on three prominent ADL datasets encompassing both real-world and large-scale RGB-D datasets without requiring poses or additional computational overhead at inference. + + + + ViLa-MIL: Dual-scale Vision-Language Multiple Instance Learning for Whole Slide Image Classification + http://openaccess.thecvf.com//content/CVPR2024/papers/Shi_ViLa-MIL_Dual-scale_Vision-Language_Multiple_Instance_Learning_for_Whole_Slide_Image_CVPR_2024_paper.pdf + Multiple instance learning (MIL)-based framework has become the mainstream for processing the whole slide image (WSI) with giga-pixel size and hierarchical image context in digital pathology. However these methods heavily depend on a substantial number of bag-level labels and solely learn from the original slides which are easily affected by variations in data distribution. Recently vision language model (VLM)-based methods introduced the language prior by pre-training on large-scale pathological image-text pairs. However the previous text prompt lacks the consideration of pathological prior knowledge therefore does not substantially boost the model's performance. Moreover the collection of such pairs and the pre-training process are very time-consuming and source-intensive. To solve the above problems we propose a dual-scale vision-language multiple instance learning (ViLa-MIL) framework for whole slide image classification. Specifically we propose a dual-scale visual descriptive text prompt based on the frozen large language model (LLM) to boost the performance of VLM effectively. To transfer the VLM to process WSI efficiently for the image branch we propose a prototype-guided patch decoder to aggregate the patch features progressively by grouping similar patches into the same prototype; for the text branch we introduce a context-guided text decoder to enhance the text features by incorporating the multi-granular image contexts. Extensive studies on three multi-cancer and multi-center subtyping datasets demonstrate the superiority of ViLa-MIL. + + + + CapsFusion: Rethinking Image-Text Data at Scale + http://openaccess.thecvf.com//content/CVPR2024/papers/Yu_CapsFusion_Rethinking_Image-Text_Data_at_Scale_CVPR_2024_paper.pdf + Large multimodal models demonstrate remarkable generalist ability to perform diverse multimodal tasks in a zero-shot manner. Large-scale web-based image-text pairs contribute fundamentally to this success but suffer from excessive noise. Recent studies use alternative captions synthesized by captioning models and have achieved notable benchmark performance. However our experiments reveal significant Scalability Deficiency and World Knowledge Loss issues in models trained with synthetic captions which have been largely obscured by their initial benchmark success. Upon closer examination we identify the root cause as the overly-simplified language structure and lack of knowledge details in existing synthetic captions. To provide higher-quality and more scalable multimodal pretraining data we propose CapsFusion an advanced framework that leverages large language models to consolidate and refine information from both web-based image-text pairs and synthetic captions. Extensive experiments show that CapsFusion captions exhibit remarkable all-round superiority over existing captions in terms of model performance (e.g. 18.8 and 18.3 improvements in CIDEr score on COCO and NoCaps) sample efficiency (requiring 11-16 times less computation than baselines) world knowledge depth and scalability. These effectiveness efficiency and scalability advantages position CapsFusion as a promising candidate for future scaling of LMM training. + + + + Tumor Micro-environment Interactions Guided Graph Learning for Survival Analysis of Human Cancers from Whole-slide Pathological Images + http://openaccess.thecvf.com//content/CVPR2024/papers/Shao_Tumor_Micro-environment_Interactions_Guided_Graph_Learning_for_Survival_Analysis_of_CVPR_2024_paper.pdf + The recent advance of deep learning technology brings the possibility of assisting the pathologist to predict the patients' survival from whole-slide pathological images (WSIs). However most of the prevalent methods only worked on the sampled patches in specifically or randomly selected tumor areas of WSIs which has very limited capability to capture the complex interactions between tumor and its surrounding micro-environment components. As a matter of fact tumor is supported and nurtured in the heterogeneous tumor micro-environment(TME) and the detailed analysis of TME and their correlation with tumors are important to in-depth analyze the mechanism of cancer development. In this paper we considered the spatial interactions among tumor and its two major TME components (i.e. lymphocytes and stromal fibrosis) and presented a Tumor Micro-environment Interactions Guided Graph Learning (TMEGL) algorithm for the prognosis prediction of human cancers. Specifically we firstly selected different types of patches as nodes to build graph for each WSI. Then a novel TME neighborhood organization guided graph embedding algorithm was proposed to learn node representations that can preserve their topological structure information. Finally a Gated Graph Attention Network is applied to capture the survival-associated intersections among tumor and different TME components for clinical outcome prediction. We tested TMEGL on three cancer cohorts derived from The Cancer Genome Atlas (TCGA) and the experimental results indicated that TMEGL not only outperforms the existing WSI-based survival analysis models but also has good explainable ability for survival prediction. + + + + Towards Generalizable Multi-Object Tracking + http://openaccess.thecvf.com//content/CVPR2024/papers/Qin_Towards_Generalizable_Multi-Object_Tracking_CVPR_2024_paper.pdf + Multi-Object Tracking (MOT) encompasses various tracking scenarios each characterized by unique traits. Effective trackers should demonstrate a high degree of generalizability across diverse scenarios. However existing trackers struggle to accommodate all aspects or necessitate hypothesis and experimentation to customize the association information (motion and/or appearance) for a given scenario leading to narrowly tailored solutions with limited generalizability. In this paper we investigate the factors that influence trackers' generalization to different scenarios and concretize them into a set of tracking scenario attributes to guide the design of more generalizable trackers. Furthermore we propose a "point-wise to instance-wise relation" framework for MOT i.e. GeneralTrack which can generalize across diverse scenarios while eliminating the need to balance motion and appearance. Thanks to its superior generalizability our proposed GeneralTrack achieves state-of-the-art performance on multiple benchmarks and demonstrates the potential for domain generalization. + + + + Slice3D: Multi-Slice Occlusion-Revealing Single View 3D Reconstruction + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Slice3D_Multi-Slice_Occlusion-Revealing_Single_View_3D_Reconstruction_CVPR_2024_paper.pdf + We introduce multi-slice reasoning a new notion for single-view 3D reconstruction which challenges the current and prevailing belief that multi-view synthesis is the most natural conduit between single-view and 3D. Our key observation is that object slicing is a more direct and hence more advantageous means to reveal occluded structures than altering camera views. Specifically slicing can peel through any occluder without obstruction and in the limit (i.e. with infinitely many slices) it is guaranteed to unveil all hidden object parts. We realize our idea by developing Slice3D a novel method for single-view 3D reconstruction which first predicts multi-slice images from a single RGB input image and then integrates the slices into a 3D model using a coordinate-based transformer network to product a signed distance function. The slice images can be regressed or generated both through a U-Net based network. For the former we inject a learnable slice indicator code to designate each decoded image into a spatial slice location while the slice generator is a denoising diffusion model operating on the entirety of slice images stacked on the input channels. We conduct extensive evaluation against state-of-the-art alternatives to demonstrate superiority of our method especially in recovering complex and severely occluded shape structures amid ambiguities. All Slice3D results were produced by networks trained on a single Nvidia A40 GPU with an inference time of less than 20 seconds. + + + + IIRP-Net: Iterative Inference Residual Pyramid Network for Enhanced Image Registration + http://openaccess.thecvf.com//content/CVPR2024/papers/Ma_IIRP-Net_Iterative_Inference_Residual_Pyramid_Network_for_Enhanced_Image_Registration_CVPR_2024_paper.pdf + Deep learning-based image registration (DLIR) methods have achieved remarkable success in deformable image registration. We observe that iterative inference can exploit the well-trained registration network to the fullest extent. In this work we propose a novel Iterative Inference Residual Pyramid Network (IIRP-Net) to enhance registration performance without any additional training costs. In IIRP-Net we construct a streamlined pyramid registration network consisting of a feature extractor and residual flow estimators (RP-Net) to achieve generalized capabilities in feature extraction and registration. Then in the inference phase IIRP-Net employs an iterative inference strategy to enhance RP-Net by iteratively reutilizing residual flow estimators from coarse to fine. The number of iterations is adaptively determined by the proposed IterStop mechanism. We conduct extensive experiments on the FLARE and Mindboggle datasets and the results verify the effectiveness of the proposed method outperforming state-of-the-art deformable image registration methods. Our code is available at https://github.com/Torbjorn1997/IIRP-Net. + + + + SNIFFER: Multimodal Large Language Model for Explainable Out-of-Context Misinformation Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Qi_SNIFFER_Multimodal_Large_Language_Model_for_Explainable_Out-of-Context_Misinformation_Detection_CVPR_2024_paper.pdf + Misinformation is a prevalent societal issue due to its potential high risks. Out-Of-Context (OOC) misinformation where authentic images are repurposed with false text is one of the easiest and most effective ways to mislead audiences. Current methods focus on assessing image-text consistency but lack convincing explanations for their judgments which are essential for debunking misinformation. While Multimodal Large Language Models (MLLMs) have rich knowledge and innate capability for visual reasoning and explanation generation they still lack sophistication in understanding and discovering the subtle cross-modal differences. In this paper we introduce Sniffer a novel multimodal large language model specifically engineered for OOC misinformation detection and explanation. Sniffer employs two-stage instruction tuning on InstructBLIP. The first stage refines the model's concept alignment of generic objects with news-domain entities and the second stage leverages OOC-specific instruction data generated by language-only GPT-4 to fine-tune the model's discriminatory powers. Enhanced by external tools and retrieval Sniffer not only detects inconsistencies between text and image but also utilizes external knowledge for contextual verification. Our experiments show that Sniffer surpasses the original MLLM by over 40% and outperforms state-of-the-art methods in detection accuracy. Sniffer also provides accurate and persuasive explanations as validated by quantitative and human evaluations. + + + + Beyond Seen Primitive Concepts and Attribute-Object Compositional Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Saini_Beyond_Seen_Primitive_Concepts_and_Attribute-Object_Compositional_Learning_CVPR_2024_paper.pdf + Learning from seen attribute-object pairs to generalize to unseen compositions has been studied extensively in Compositional Zero-Shot Learning (CZSL). However CZSL setup is still limited to seen attributes and objects and cannot generalize to unseen concepts and their compositions. To overcome this limitation we propose a new task Open Vocabulary-Compositional Zero-shot Learning (OV-CZSL) where unseen attributes objects and unseen compositions are evaluated. To show that OV-CZSL is a challenging yet solvable problem we propose three new benchmarks based on existing datasets MIT-States C-GQA and VAW-CZSL along with new baselines and evaluation setup. We use language embeddings and external vocabulary with our novel neighborhood expansion loss to allow any method to learn semantic correlations between seen and unseen primitives. + + + + Unleashing Network Potentials for Semantic Scene Completion + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Unleashing_Network_Potentials_for_Semantic_Scene_Completion_CVPR_2024_paper.pdf + Semantic scene completion (SSC) aims to predict complete 3D voxel occupancy and semantics from a single-view RGB-D image and recent SSC methods commonly adopt multi-modal inputs. However our investigation reveals two limitations: ineffective feature learning from single modalities and overfitting to limited datasets. To address these issues this paper proposes a novel SSC framework - Adversarial Modality Modulation Network (AMMNet) - with a fresh perspective of optimizing gradient updates. The proposed AMMNet introduces two core modules: a cross-modal modulation enabling the interdependence of gradient flows between modalities and a customized adversarial training scheme leveraging dynamic gradient competition. Specifically the cross-modal modulation adaptively re-calibrates the features to better excite representation potentials from each single modality. The adversarial training employs a minimax game of evolving gradients with customized guidance to strengthen the generator's perception of visual fidelity from both geometric completeness and semantic correctness. Extensive experimental results demonstrate that AMMNet outperforms state-of-the-art SSC methods by a large margin providing a promising direction for improving the effectiveness and generalization of SSC methods. + + + + Learning Occupancy for Monocular 3D Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Peng_Learning_Occupancy_for_Monocular_3D_Object_Detection_CVPR_2024_paper.pdf + Monocular 3D detection is a challenging task due to the lack of accurate 3D information. Existing approaches typically rely on geometry constraints and dense depth estimates to facilitate the learning but often fail to fully exploit the benefits of three-dimensional feature extraction in frustum and 3D space. In this paper we propose OccupancyM3D a method of learning occupancy for monocular 3D detection. It directly learns occupancy in frustum and 3D space leading to more discriminative and informative 3D features and representations. Specifically by using synchronized raw sparse LiDAR point clouds we define the space status and generate voxel-based occupancy labels. We formulate occupancy prediction as a simple classification problem and design associated occupancy losses. Resulting occupancy estimates are employed to enhance original frustum/3D features. As a result experiments on KITTI and Waymo open datasets demonstrate that the proposed method achieves a new state of the art and surpasses other methods by a significant margin. + + + + LAA-Net: Localized Artifact Attention Network for Quality-Agnostic and Generalizable Deepfake Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Nguyen_LAA-Net_Localized_Artifact_Attention_Network_for_Quality-Agnostic_and_Generalizable_Deepfake_CVPR_2024_paper.pdf + This paper introduces a novel approach for high-quality deepfake detection called Localized Artifact Attention Network (LAA-Net). Existing methods for high-quality deepfake detection are mainly based on a supervised binary classifier coupled with an implicit attention mechanism. As a result they do not generalize well to unseen manipulations. To handle this issue two main contributions are made. First an explicit attention mechanism within a multi-task learning framework is proposed. By combining heatmap-based and self-consistency attention strategies LAA-Net is forced to focus on a few small artifact-prone vulnerable regions. Second an Enhanced Feature Pyramid Network (E-FPN) is proposed as a simple and effective mechanism for spreading discriminative low-level features into the final feature output with the advantage of limiting redundancy. Experiments performed on several benchmarks show the superiority of our approach in terms of Area Under the Curve (AUC) and Average Precision (AP). The code is available at https://github.com/10Ring/LAA-Net. + + + + Rotation-Agnostic Image Representation Learning for Digital Pathology + http://openaccess.thecvf.com//content/CVPR2024/papers/Alfasly_Rotation-Agnostic_Image_Representation_Learning_for_Digital_Pathology_CVPR_2024_paper.pdf + This paper addresses complex challenges in histopathological image analysis through three key contributions. Firstly it introduces a fast patch selection method FPS for whole-slide image (WSI) analysis significantly reducing computational cost while maintaining accuracy. Secondly it presents PathDino a lightweight histopathology feature extractor with a minimal configuration of five Transformer blocks and only ? 9 million parameters markedly fewer than alternatives. Thirdly it introduces a rotation-agnostic representation learning paradigm using self-supervised learning effectively mitigating overfitting. We also show that our compact model outperforms existing state-of-the-art histopathology-specific vision transformers on 12 diverse datasets including both internal datasets spanning four sites (breast liver skin and colorectal) and seven public datasets (PANDA CAMELYON16 BRACS DigestPath Kather PanNuke and WSSS4LUAD). Notably even with a training dataset of ? 6 million histopathology patches from The Cancer Genome Atlas (TCGA) our approach demonstrates an average 8.5% improvement in patch-level majority vote performance. These contributions provide a robust framework for enhancing image analysis in digital pathology rigorously validated through extensive evaluation. + + + + EASE-DETR: Easing the Competition among Object Queries + http://openaccess.thecvf.com//content/CVPR2024/papers/Gao_EASE-DETR_Easing_the_Competition_among_Object_Queries_CVPR_2024_paper.pdf + This paper views the DETR's non-duplicate detection ability as a competition result among object queries. Around each object there are usually multiple queries within which only a single one can win the chance to become the final detection. Such a competition is hard: while some competing queries initially have very close prediction scores their leading query has to dramatically enlarge its score superiority after several decoder layers. To help the leading query stands out this paper proposes EASE-DETR which eases the competition by introducing bias that favours the leading one. EASE-DETR is very simple: in every intermediate decoder layer we identify the "leading / trailing" relationship between any two queries and encode this binary relationship into the following decoder layer to amplify the superiority of the leading one. More concretely the leading query is to be protected from mutual query suppression in the self-attention layer and encouraged to absorb more object features in the cross-attention layer therefore accelerating to win. Experimental results show that EASE-DETR brings consistent and remarkable improvement to various DETRs. + + + + Self-Discovering Interpretable Diffusion Latent Directions for Responsible Text-to-Image Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Self-Discovering_Interpretable_Diffusion_Latent_Directions_for_Responsible_Text-to-Image_Generation_CVPR_2024_paper.pdf + Diffusion-based models have gained significant popularity for text-to-image generation due to their exceptional image-generation capabilities. A risk with these models is the potential generation of inappropriate content such as biased or harmful images. However the underlying reasons for generating such undesired content from the perspective of the diffusion model's internal representation remain unclear. Previous work interprets vectors in an interpretable latent space of diffusion models as semantic concepts. However existing approaches cannot discover directions for arbitrary concepts such as those related to inappropriate concepts. In this work we propose a novel self-supervised approach to find interpretable latent directions for a given concept. With the discovered vectors we further propose a simple approach to mitigate inappropriate generation. Extensive experiments have been conducted to verify the effectiveness of our mitigation approach namely for fair generation safe generation and responsible text-enhancing generation. Project page: https://interpretdiffusion.github.io. + + + + HiLo: Detailed and Robust 3D Clothed Human Reconstruction with High-and Low-Frequency Information of Parametric Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_HiLo_Detailed_and_Robust_3D_Clothed_Human_Reconstruction_with_High-and_CVPR_2024_paper.pdf + Reconstructing 3D clothed human involves creating a detailed geometry of individuals in clothing with applications ranging from virtual try-on movies to games. To enable practical and widespread applications recent advances propose to generate a clothed human from an RGB image. However they struggle to reconstruct detailed and robust avatars simultaneously. We empirically find that the high-frequency (HF) and low-frequency (LF) information from a parametric model has the potential to enhance geometry details and improve robustness to noise respectively. Based on this we propose HiLo namely clothed human reconstruction with high- and low-frequency information which contains two components. 1) To recover detailed geometry using HF information we propose a progressive HF Signed Distance Function to enhance the detailed 3D geometry of a clothed human. We analyze that our progressive learning manner alleviates large gradients that hinder model convergence. 2) To achieve robust reconstruction against inaccurate estimation of the parametric model by using LF information we propose a spatial interaction implicit function. This function effectively exploits the complementary spatial information from a low-resolution voxel grid of the parametric model. Experimental results demonstrate that HiLo outperforms the state-of-the-art methods by 10.43% and 9.54% in terms of Chamfer distance on the Thuman2.0 and CAPE datasets respectively. Additionally HiLo demonstrates robustness to noise from the parametric model challenging poses and various clothing styles. + + + + Promptable Behaviors: Personalizing Multi-Objective Rewards from Human Preferences + http://openaccess.thecvf.com//content/CVPR2024/papers/Hwang_Promptable_Behaviors_Personalizing_Multi-Objective_Rewards_from_Human_Preferences_CVPR_2024_paper.pdf + Customizing robotic behaviors to be aligned with diverse human preferences is an underexplored challenge in the field of embodied AI. In this paper we present Promptable Behaviors a novel framework that facilitates efficient personalization of robotic agents to diverse human preferences in complex environments. We use multi-objective reinforcement learning to train a single policy adaptable to a broad spectrum of preferences. We introduce three distinct methods to infer human preferences by leveraging different types of interactions: (1) human demonstrations (2) preference feedback on trajectory comparisons and (3) language instructions. We evaluate the proposed method in personalized object-goal navigation and flee navigation tasks in ProcTHOR and RoboTHOR demonstrating the ability to prompt agent behaviors to satisfy human preferences in various scenarios. + + + + Neural Underwater Scene Representation + http://openaccess.thecvf.com//content/CVPR2024/papers/Tang_Neural_Underwater_Scene_Representation_CVPR_2024_paper.pdf + Among the numerous efforts towards digitally recovering the physical world Neural Radiance Fields (NeRFs) have proved effective in most cases. However underwater scene introduces unique challenges due to the absorbing water medium the local change in lighting and the dynamic contents in the scene. We aim at developing a neural underwater scene representation for these challenges modeling the complex process of attenuation unstable in-scattering and moving objects during light transport. The proposed method can reconstruct the scenes from both established datasets and in-the-wild videos with outstanding fidelity. + + + + Progress-Aware Online Action Segmentation for Egocentric Procedural Task Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Shen_Progress-Aware_Online_Action_Segmentation_for_Egocentric_Procedural_Task_Videos_CVPR_2024_paper.pdf + We address the problem of online action segmentation for egocentric procedural task videos. While previous studies have mostly focused on offline action segmentation where entire videos are available for both training and inference the transition to online action segmentation is crucial for practical applications like AR/VR task assistants. Notably applying an offline-trained model directly to online inference results in a significant performance drop due to the inconsistency between training and inference. We propose an online action segmentation framework by first modifying existing architectures to make them causal. Second we develop a novel action progress prediction module to dynamically estimate the progress of ongoing actions and using them to refine the predictions of causal action segmentation. Third we propose to learn task graphs from training videos and leverage them to obtain smooth and procedure-consistent segmentations. With the combination of progress and task graph with casual action segmentation our framework effectively addresses prediction uncertainty and oversegmentation in online action segmentation and achieves significant improvement on three egocentric datasets. + + + + Constrained Layout Generation with Factor Graphs + http://openaccess.thecvf.com//content/CVPR2024/papers/Dupty_Constrained_Layout_Generation_with_Factor_Graphs_CVPR_2024_paper.pdf + This paper addresses the challenge of object-centric layout generation under spatial constraints seen in multiple domains including floorplan design process. The design process typically involves specifying a set of spatial constraints that include object attributes like size and inter-object relations such as relative positioning. Existing works which typically represent objects as single nodes lack the granularity to accurately model complex interactions between objects. For instance often only certain parts of an object like a room's right wall interact with adjacent objects. To address this gap we introduce a factor graph based approach with four latent variable nodes for each room and a factor node for each constraint. The factor nodes represent dependencies among the variables to which they are connected effectively capturing constraints that are potentially of a higher order. We then develop message-passing on the bipartite graph forming a factor graph neural network that is trained to produce a floorplan that aligns with the desired requirements. Our approach is simple and generates layouts faithful to the user requirements demonstrated by a large improvement in IOU scores over existing methods. Additionally our approach being inferential and accurate is well-suited to the practical human-in-the-loop design process where specifications evolve iteratively offering a practical and powerful tool for AI-guided design. + + + + SLICE: Stabilized LIME for Consistent Explanations for Image Classification + http://openaccess.thecvf.com//content/CVPR2024/papers/Bora_SLICE_Stabilized_LIME_for_Consistent_Explanations_for_Image_Classification_CVPR_2024_paper.pdf + Local Interpretable Model-agnostic Explanations (LIME) - a widely used post-ad-hoc model agnostic explainable AI (XAI) technique. It works by training a simple transparent (surrogate) model using random samples drawn around the neighborhood of the instance (image) to be explained (IE). Explanations are then extracted for a black-box model and a given IE using the surrogate model. However the explanations of LIME suffer from inconsistency across different runs for the same model and the same IE. We identify two main types of inconsistencies: variance in the sign and importance ranks of the segments (superpixels). These factors hinder LIME from obtaining consistent explanations. We analyze these inconsistencies and propose a new method Stabilized LIME for Consistent Explanations (SLICE). The proposed method handles the stabilization problem in two aspects: using a novel feature selection technique to eliminate spurious superpixels and an adaptive perturbation technique to generate perturbed images in the neighborhood of IE. Our results demonstrate that the explanations from SLICE exhibit significantly better consistency and fidelity than LIME (and its variant BayLime). + + + + Anomaly Heterogeneity Learning for Open-set Supervised Anomaly Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhu_Anomaly_Heterogeneity_Learning_for_Open-set_Supervised_Anomaly_Detection_CVPR_2024_paper.pdf + Open-set supervised anomaly detection (OSAD) - a recently emerging anomaly detection area - aims at utilizing a few samples of anomaly classes seen during training to detect unseen anomalies (i.e. samples from open-set anomaly classes) while effectively identifying the seen anomalies. Benefiting from the prior knowledge illustrated by the seen anomalies current OSAD methods can often largely reduce false positive errors. However these methods are trained in a closed-set setting and treat the anomaly examples as from a homogeneous distribution rendering them less effective in generalizing to unseen anomalies that can be drawn from any distribution. This paper proposes to learn heterogeneous anomaly distributions using the limited anomaly examples to address this issue. To this end we introduce a novel approach namely Anomaly Heterogeneity Learning (AHL) that simulates a diverse set of heterogeneous anomaly distributions and then utilizes them to learn a unified heterogeneous abnormality model in surrogate open-set environments. Further AHL is a generic framework that existing OSAD models can plug and play for enhancing their abnormality modeling. Extensive experiments on nine real-world anomaly detection datasets show that AHL can 1) substantially enhance different state-of-the-art OSAD models in detecting seen and unseen anomalies and 2) effectively generalize to unseen anomalies in new domains. Code is available at https://github.com/mala-lab/AHL. + + + + Revisiting Counterfactual Problems in Referring Expression Comprehension + http://openaccess.thecvf.com//content/CVPR2024/papers/Yu_Revisiting_Counterfactual_Problems_in_Referring_Expression_Comprehension_CVPR_2024_paper.pdf + Traditional referring expression comprehension (REC) aims to locate the target referent in an image guided by a text query. Several previous methods have studied on the Counterfactual problem in REC (C-REC) where the objects for a given query cannot be found in the image. However these methods focus on the overall image-text or specific attribute mismatch only. In this paper we address the C-REC problem from a deep perspective of fine-grained attributes. To this aim we first propose a fine-grained counterfactual sample generation method to construct C-REC datasets. Specifically we leverage pre-trained language model such as BERT to modify the attribute words in the queries obtaining the corresponding counterfactual samples. Furthermore we propose a C-REC framework. We first adopt three encoders to extract image text and attribute features. Then our dual-branch attentive fusion module fuses these cross-modal features with two branches by an attention mechanism. At last two prediction heads generate a bounding box and a counterfactual label respectively. In addition we incorporate contrastive learning with the generated counterfactual samples as negatives to enhance the counterfactual perception. Extensive experiments show that our framework achieves promising performance on both public REC datasets RefCOCO/+/g and our constructed C-REC datasets C-RefCOCO/+/g. The code and data are available at https://github.com/Glacier0012/CREC. + + + + Compressed 3D Gaussian Splatting for Accelerated Novel View Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Niedermayr_Compressed_3D_Gaussian_Splatting_for_Accelerated_Novel_View_Synthesis_CVPR_2024_paper.pdf + Recently high-fidelity scene reconstruction with an optimized 3D Gaussian splat representation has been introduced for novel view synthesis from sparse image sets. Making such representations suitable for applications like network streaming and rendering on low-power devices requires significantly reduced memory consumption as well as improved rendering efficiency. We propose a compressed 3D Gaussian splat representation that utilizes sensitivity-aware vector clustering with quantization-aware training to compress directional colors and Gaussian parameters. The learned codebooks have low bitrates and achieve a compression rate of up to 31 on real-world scenes with only minimal degradation of visual quality. We demonstrate that the compressed splat representation can be efficiently rendered with hardware rasterization on lightweight GPUs at up to 4 higher framerates than reported via an optimized GPU compute pipeline. Extensive experiments across multiple datasets demonstrate the robustness and rendering speed of the proposed approach. + + + + Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language + http://openaccess.thecvf.com//content/CVPR2024/papers/Hamilton_Separating_the_Chirp_from_the_Chat_Self-supervised_Visual_Grounding_of_CVPR_2024_paper.pdf + We present DenseAV a novel dual encoder grounding architecture that learns high-resolution semantically meaningful and audio-visual aligned features solely through watching videos. We show that DenseAV can discover the "meaning" of words and the "location" of sounds without explicit localization supervision. Furthermore it automatically discovers and distinguishes between these two types of associations without supervision. We show that DenseAV's localization abilities arise from a new multi-head feature aggregation operator that directly compares dense image and audio representations for contrastive learning. In contrast many other systems that learn "global" audio and video representations cannot localize words and sound. Finally we contribute two new datasets to improve the evaluation of AV representations through speech and sound prompted semantic segmentation. On these and other datasets we show DenseAV dramatically outperforms the prior art on speech and sound prompted semantic segmentation. DenseAV outperforms the current state-of-the-art ImageBind on cross-modal retrieval using fewer than half of the parameters. Project Page: https://aka.ms/denseav + + + + MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding + http://openaccess.thecvf.com//content/CVPR2024/papers/He_MA-LMM_Memory-Augmented_Large_Multimodal_Model_for_Long-Term_Video_Understanding_CVPR_2024_paper.pdf + With the success of large language models (LLMs) integrating the vision model into LLMs to build vision-language foundation models has gained much more interest recently. However existing LLM-based large multimodal models (e.g. Video-LLaMA VideoChat) can only take in a limited number of frames for short video understanding. In this study we mainly focus on designing an efficient and effective model for long-term video understanding. Instead of trying to process more frames simultaneously like most existing work we propose to process videos in an online manner and store past video information in a memory bank. This allows our model to reference historical video content for long-term analysis without exceeding LLMs' context length constraints or GPU memory limits. Our memory bank can be seamlessly integrated into current multimodal LLMs in an off-the-shelf manner. We conduct extensive experiments on various video understanding tasks such as long-video understanding video question answering and video captioning and our model can achieve state-of-the-art performances across multiple datasets. + + + + Dr2Net: Dynamic Reversible Dual-Residual Networks for Memory-Efficient Finetuning + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_Dr2Net_Dynamic_Reversible_Dual-Residual_Networks_for_Memory-Efficient_Finetuning_CVPR_2024_paper.pdf + Large pretrained models are increasingly crucial in modern computer vision tasks. These models are typically used in downstream tasks by end-to-end finetuning which is highly memory-intensive for tasks with high-resolution data e.g. video understanding small object detection and point cloud analysis. In this paper we propose Dynamic Reversible Dual-Residual Networks or Dr2Net a novel family of network architectures that acts as a surrogate network to finetune a pretrained model with substantially reduced memory consumption. Dr2Net contains two types of residual connections one maintaining the residual structure in the pretrained models and the other making the network reversible. Due to its reversibility intermediate activations which can be reconstructed from output are cleared from memory during training. We use two coefficients on either type of residual connections respectively and introduce a dynamic training strategy that seamlessly transitions the pretrained model to a reversible network with much higher numerical precision. We evaluate Dr2Net on various pretrained models and various tasks and show that it can reach comparable performance to conventional finetuning but with significantly less memory usage. + + + + PNeRV: Enhancing Spatial Consistency via Pyramidal Neural Representation for Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_PNeRV_Enhancing_Spatial_Consistency_via_Pyramidal_Neural_Representation_for_Videos_CVPR_2024_paper.pdf + The primary focus of Neural Representation for Videos (NeRV) is to effectively model its spatiotemporal consistency. However current NeRV systems often face a significant issue of spatial inconsistency leading to decreased perceptual quality. To address this issue we introduce the Pyramidal Neural Representation for Videos (PNeRV) which is built on a multi-scale information connection and comprises a lightweight rescaling operator Kronecker Fully-connected layer (KFc) and a Benign Selective Memory (BSM) mechanism. The KFc inspired by the tensor decomposition of the vanilla Fully-connected layer facilitates low-cost rescaling and global correlation modeling. BSM merges high-level features with granular ones adaptively. Furthermore we provide an analysis based on the Universal Approximation Theory of the NeRV system and validate the effectiveness of the proposed PNeRV. We conducted comprehensive experiments to demonstrate that PNeRV surpasses the performance of contemporary NeRV models achieving the best results in video regression on UVG and DAVIS under various metrics (PSNR SSIM LPIPS and FVD). Compared to vanilla NeRV PNeRV achieves a +4.49 dB gain in PSNR and a 231% increase in FVD on UVG along with a +3.28 dB PSNR and 634% FVD increase on DAVIS. + + + + Point Transformer V3: Simpler Faster Stronger + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_Point_Transformer_V3_Simpler_Faster_Stronger_CVPR_2024_paper.pdf + This paper is not motivated to seek innovation within the attention mechanism. Instead it focuses on overcoming the existing trade-offs between accuracy and efficiency within the context of point cloud processing leveraging the power of scale. Drawing inspiration from recent advances in 3D large-scale representation learning we recognize that model performance is more influenced by scale than by intricate design. Therefore we present Point Transformer V3 (PTv3) which prioritizes simplicity and efficiency over the accuracy of certain mechanisms that are minor to the overall performance after scaling such as replacing the precise neighbor search by KNN with an efficient serialized neighbor mapping of point clouds organized with specific patterns. This principle enables significant scaling expanding the receptive field from 16 to 1024 points while remaining efficient (a 3x increase in processing speed and a 10x improvement in memory efficiency compared with its predecessor PTv2). PTv3 attains state-of-the-art results on over 20 downstream tasks that span both indoor and outdoor scenarios. Further enhanced with multi-dataset joint training PTv3 pushes these results to a higher level. + + + + Mask4Align: Aligned Entity Prompting with Color Masks for Multi-Entity Localization Problems + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Mask4Align_Aligned_Entity_Prompting_with_Color_Masks_for_Multi-Entity_Localization_CVPR_2024_paper.pdf + In Visual Question Answering (VQA) recognizing and localizing entities pose significant challenges. Pretrained vision-and-language models have addressed this problem by providing a text description as the answer. However in visual scenes with multiple entities textual descriptions struggle to distinguish the entities from the same category effectively. Consequently the VQA dataset is limited by the limitations of text description and cannot adequately cover scenarios involving multiple entities. To address this challenge we introduce a Mask for Align (Mask4Align) method which can determine the entity's position in the given image that best matches the user-input question. This method incorporates colored masks into the image enabling the VQA model to handle discrimination and localization challenges associated with multiple entities. To process an arbitrary number of similar entities Mask4Align is designed hierarchically to discern subtle differences achieving precise localization. Since Mask4Align directly utilizes pre-trained models it does not introduce additional training overhead. Extensive experiments conducted on both the gaze target prediction task dataset and our proposed multi-entity localization dataset showcase the superiority of Mask4Align. + + + + RCL: Reliable Continual Learning for Unified Failure Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhu_RCL_Reliable_Continual_Learning_for_Unified_Failure_Detection_CVPR_2024_paper.pdf + Deep neural networks are known to be overconfident for what they don't know in the wild which is undesirable for decision-making in high-stakes applications. Despite quantities of existing works most of them focus on detecting out-of-distribution (OOD) samples from unseen classes while ignoring large parts of relevant failure sources like misclassified samples from known classes. In particular recent studies reveal that prevalent OOD detection methods are actually harmful for misclassification detection (MisD) indicating that there seems to be a tradeoff between those two tasks. In this paper we study the critical yet under-explored problem of unified failure detection which aims to detect both misclassified and OOD examples. Concretely we identify the failure of simply integrating learning objectives of misclassification and OOD detection and show the potential of sequence learning. Inspired by this we propose a reliable continual learning paradigm whose spirit is to equip the model with MisD ability first and then improve the OOD detection ability without degrading the already adequate MisD performance. Extensive experiments demonstrate that our method achieves strong unified failure detection performance. The code is available at https://github.com/Impression2805/RCL. + + + + Referring Image Editing: Object-level Image Editing via Referring Expressions + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Referring_Image_Editing_Object-level_Image_Editing_via_Referring_Expressions_CVPR_2024_paper.pdf + Significant advancements have been made in image editing with the recent advance of the Diffusion model. However most of the current methods primarily focus on global or subject-level modifications and often face limitations when it comes to editing specific objects when there are other objects coexisting in the scene given solely textual prompts. In response to this challenge we introduce an object-level generative task called Referring Image Editing (RIE) which enables the identification and editing of specific source objects in an image using text prompts. To tackle this task effectively we propose a tailored framework called ReferDiffusion. It aims to disentangle input prompts into multiple embeddings and employs a mixed-supervised multi-stage training strategy. To facilitate further research in this domain we introduce the RefCOCO-Edit dataset comprising images editing prompts source object segmentation masks and reference edited images for training and evaluation. Our extensive experiments demonstrate the effectiveness of our approach in identifying and editing target objects while conventional general image editing and region-based image editing methods have difficulties in this challenging task. + + + + Unsupervised Video Domain Adaptation with Masked Pre-Training and Collaborative Self-Training + http://openaccess.thecvf.com//content/CVPR2024/papers/Reddy_Unsupervised_Video_Domain_Adaptation_with_Masked_Pre-Training_and_Collaborative_Self-Training_CVPR_2024_paper.pdf + In this work we tackle the problem of unsupervised domain adaptation (UDA) for video action recognition. Our approach which we call UNITE uses an image teacher model to adapt a video student model to the target domain. UNITE first employs self-supervised pre-training to promote discriminative feature learning on target domain videos using a teacher-guided masked distillation objective. We then perform self-training on masked target data using the video student model and image teacher model together to generate improved pseudolabels for unlabeled target videos. Our self-training process successfully leverages the strengths of both models to achieve strong transfer performance across domains. We evaluate our approach on multiple video domain adaptation benchmarks and observe significant improvements upon previously reported results. + + + + UniDepth: Universal Monocular Metric Depth Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Piccinelli_UniDepth_Universal_Monocular_Metric_Depth_Estimation_CVPR_2024_paper.pdf + Accurate monocular metric depth estimation (MMDE) is crucial to solving downstream tasks in 3D perception and modeling. However the remarkable accuracy of recent MMDE methods is confined to their training domains. These methods fail to generalize to unseen domains even in the presence of moderate domain gaps which hinders their practical applicability. We propose a new model UniDepth capable of reconstructing metric 3D scenes from solely single images across domains. Departing from the existing MMDE methods UniDepth directly predicts metric 3D points from the input image at inference time without any additional information striving for a universal and flexible MMDE solution. In particular UniDepth implements a self-promptable camera module predicting dense camera representation to condition depth features. Our model exploits a pseudo-spherical output representation which disentangles camera and depth representations. In addition we propose a geometric invariance loss that promotes the invariance of camera-prompted depth features. Thorough evaluations on ten datasets in a zero-shot regime consistently demonstrate the superior performance of UniDepth even when compared with methods directly trained on the testing domains. Code and models are available at: github.com/lpiccinelli-eth/unidepth + + + + NeuRAD: Neural Rendering for Autonomous Driving + http://openaccess.thecvf.com//content/CVPR2024/papers/Tonderski_NeuRAD_Neural_Rendering_for_Autonomous_Driving_CVPR_2024_paper.pdf + Neural radiance fields (NeRFs) have gained popularity in the autonomous driving (AD) community. Recent methods show NeRFs' potential for closed-loop simulation enabling testing of AD systems and as an advanced training data augmentation technique. However existing methods often require long training times dense semantic supervision or lack generalizability. This in turn hinders the application of NeRFs for AD at scale. In this paper we propose \modelname a robust novel view synthesis method tailored to dynamic AD data. Our method features simple network design extensive sensor modeling for both camera and lidar -- including rolling shutter beam divergence and ray dropping -- and is applicable to multiple datasets out of the box. We verify its performance on five popular AD datasets achieving state-of-the-art performance across the board. To encourage further development we openly release the NeuRAD source code at https://github.com/georghess/NeuRAD. + + + + Bootstrapping Chest CT Image Understanding by Distilling Knowledge from X-ray Expert Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Cao_Bootstrapping_Chest_CT_Image_Understanding_by_Distilling_Knowledge_from_X-ray_CVPR_2024_paper.pdf + Radiologists highly desire fully automated versatile AI for medical imaging interpretation. However the lack of extensively annotated large-scale multi-disease datasets has hindered the achievement of this goal. In this paper we explore the feasibility of leveraging language as a naturally high-quality supervision for chest CT imaging. In light of the limited availability of image-report pairs we bootstrap the understanding of 3D chest CT images by distilling chest-related diagnostic knowledge from an extensively pre-trained 2D X-ray expert model. Specifically we propose a language-guided retrieval method to match each 3D CT image with its semantically closest 2D X-ray image and perform pair-wise and semantic relation knowledge distillation. Subsequently we use contrastive learning to align images and reports within the same patient while distinguishing them from the other patients. However the challenge arises when patients have similar semantic diagnoses such as healthy patients potentially confusing if treated as negatives. We introduce a robust contrastive learning that identifies and corrects these false negatives. We train our model with over 12K pairs of chest CT images and radiology reports. Extensive experiments across multiple scenarios including zero-shot learning report generation and fine-tuning processes demonstrate the model's feasibility in interpreting chest CT images. + + + + Magic Tokens: Select Diverse Tokens for Multi-modal Object Re-Identification + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Magic_Tokens_Select_Diverse_Tokens_for_Multi-modal_Object_Re-Identification_CVPR_2024_paper.pdf + Single-modal object re-identification (ReID) faces great challenges in maintaining robustness within complex visual scenarios. In contrast multi-modal object ReID utilizes complementary information from diverse modalities showing great potentials for practical applications. However previous methods may be easily affected by irrelevant backgrounds and usually ignore the modality gaps. To address above issues we propose a novel learning framework named EDITOR to select diverse tokens from vision Transformers for multi-modal object ReID. We begin with a shared vision Transformer to extract tokenized features from different input modalities. Then we introduce a Spatial-Frequency Token Selection (SFTS) module to adaptively select object-centric tokens with both spatial and frequency information. Afterwards we employ a Hierarchical Masked Aggregation (HMA) module to facilitate feature interactions within and across modalities. Finally to further reduce the effect of backgrounds we propose a Background Consistency Constraint (BCC) and an Object-Centric Feature Refinement (OCFR). They are formulated as two new loss functions which improve the feature discrimination with background suppression. As a result our framework can generate more discriminative features for multi-modal object ReID. Extensive experiments on three multi-modal ReID benchmarks verify the effectiveness of our methods. The code is available at https://github.com/924973292/EDITOR. + + + + SignGraph: A Sign Sequence is Worth Graphs of Nodes + http://openaccess.thecvf.com//content/CVPR2024/papers/Gan_SignGraph_A_Sign_Sequence_is_Worth_Graphs_of_Nodes_CVPR_2024_paper.pdf + Despite the recent success of sign language research the widely adopted CNN-based backbones are mainly migrated from other computer vision tasks in which the contours and texture of objects are crucial for identifying objects. They usually treat sign frames as grids and may fail to capture effective cross-region features. In fact sign language tasks need to focus on the correlation of different regions in one frame and the interaction of different regions among adjacent frames for identifying a sign sequence. In this paper we propose to represent a sign sequence as graphs and introduce a simple yet effective graph-based sign language processing architecture named SignGraph to extract cross-region features at the graph level. SignGraph consists of two basic modules: Local Sign Graph (LSG) module for learning the correlation of intra-frame cross-region features in one frame and Temporal Sign Graph (TSG) module for tracking the interaction of inter-frame cross-region features among adjacent frames. With LSG and TSG we build our model in a multiscale manner to ensure that the representation of nodes can capture cross-region features at different granularities. Extensive experiments on current public sign language datasets demonstrate the superiority of our SignGraph model. Our model achieves very competitive performances with the SOTA model while not using any extra cues. Code and models are available at: https://github.com/gswycf/SignGraph. + + + + DeconfuseTrack: Dealing with Confusion for Multi-Object Tracking + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_DeconfuseTrack_Dealing_with_Confusion_for_Multi-Object_Tracking_CVPR_2024_paper.pdf + Accurate data association is crucial in reducing confusion such as ID switches and assignment errors in multi-object tracking (MOT). However existing advanced methods often overlook the diversity among trajectories and the ambiguity and conflicts present in motion and appearance cues leading to confusion among detections trajectories and associations when performing simple global data association. To address this issue we propose a simple versatile and highly interpretable data association approach called Decomposed Data Association (DDA). DDA decomposes the traditional association problem into multiple sub-problems using a series of non-learning-based modules and selectively addresses the confusion in each sub-problem by incorporating targeted exploitation of new cues. Additionally we introduce Occlusion-aware Non-Maximum Suppression (ONMS) to retain more occluded detections thereby increasing opportunities for association with trajectories and indirectly reducing the confusion caused by missed detections. Finally based on DDA and ONMS we design a powerful multi-object tracker named DeconfuseTrack specifically focused on resolving confusion in MOT. Extensive experiments conducted on the MOT17 and MOT20 datasets demonstrate that our proposed DDA and ONMS significantly enhance the performance of several popular trackers. Moreover DeconfuseTrack achieves state-of-the-art performance on the MOT17 and MOT20 test sets significantly outperforms the baseline tracker ByteTrack in metrics such as HOTA IDF1 AssA. This validates that our tracking design effectively reduces confusion caused by simple global association. + + + + HIMap: HybrId Representation Learning for End-to-end Vectorized HD Map Construction + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_HIMap_HybrId_Representation_Learning_for_End-to-end_Vectorized_HD_Map_Construction_CVPR_2024_paper.pdf + Vectorized High-Definition (HD) map construction requires predictions of the category and point coordinates of map elements (e.g. road boundary lane divider pedestrian crossing etc.). State-of-the-art methods are mainly based on point-level representation learning for regressing accurate point coordinates. However this pipeline has limitations in obtaining element-level information and handling element-level failures e.g. erroneous element shape or entanglement between elements. To tackle the above issues we propose a simple yet effective HybrId framework named HIMap to sufficiently learn and interact both point-level and element-level information. Concretely we introduce a hybrid representation called HIQuery to represent all map elements and propose a point-element interactor to interactively extract and encode the hybrid information of elements e.g. point position and element shape into the HIQuery. Additionally we present a point-element consistency constraint to enhance the consistency between the point-level and element-level information. Finally the output point-element integrated HIQuery can be directly converted into map elements' class point coordinates and mask. We conduct extensive experiments and consistently outperform previous methods on both nuScenes and Argoverse2 datasets. Notably our method achieves 77.8 mAP on the nuScenes dataset remarkably superior to previous SOTAs by 8.3 mAP at least. + + + + Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Unleashing_Unlabeled_Data_A_Paradigm_for_Cross-View_Geo-Localization_CVPR_2024_paper.pdf + This paper investigates the effective utilization of unlabeled data for large-area cross-view geo-localization (CVGL) encompassing both unsupervised and semi-supervised settings. Common approaches to CVGL rely on ground-satellite image pairs and employ label-driven supervised training. However the cost of collecting precise cross-view image pairs hinders the deployment of CVGL in real-life scenarios. Without the pairs CVGL will be more challenging to handle the significant imaging and spatial gaps between ground and satellite images. To this end we propose an unsupervised framework including a cross-view projection to guide the model for retrieving initial pseudo-labels and a fast re-ranking mechanism to refine the pseudo-labels by leveraging the fact that "the perfectly paired ground-satellite image is located in a unique and identical scene". The framework exhibits competitive performance compared with supervised works on three open-source benchmarks. Our code and models will be released on https://github.com/liguopeng0923/UCVGL. + + + + PanoOcc: Unified Occupancy Representation for Camera-based 3D Panoptic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_PanoOcc_Unified_Occupancy_Representation_for_Camera-based_3D_Panoptic_Segmentation_CVPR_2024_paper.pdf + Comprehensive modeling of the surrounding 3D world is crucial for the success of autonomous driving. However existing perception tasks like object detection road structure segmentation depth & elevation estimation and open-set object localization each only focus on a small facet of the holistic 3D scene understanding task. This divide-and-conquer strategy simplifies the algorithm development process but comes at the cost of losing an end-to-end unified solution to the problem. In this work we address this limitation by studying camera-based 3D panoptic segmentation aiming to achieve a unified occupancy representation for camera-only 3D scene understanding. To achieve this we introduce a novel method called PanoOcc which utilizes voxel queries to aggregate spatiotemporal information from multi-frame and multi-view images in a coarse-to-fine scheme integrating feature learning and scene representation into a unified occupancy representation. We have conducted extensive ablation studies to validate the effectiveness and efficiency of the proposed method. Our approach achieves new state-of-the-art results for camera-based semantic segmentation and panoptic segmentation on the nuScenes dataset. Furthermore our method can be easily extended to dense occupancy prediction and has demonstrated promising performance on the Occ3D benchmark. The code will be made available at https://github.com/Robertwyq/PanoOcc. + + + + Sparse Views Near Light: A Practical Paradigm for Uncalibrated Point-light Photometric Stereo + http://openaccess.thecvf.com//content/CVPR2024/papers/Brahimi_Sparse_Views_Near_Light_A_Practical_Paradigm_for_Uncalibrated_Point-light_CVPR_2024_paper.pdf + Neural approaches have shown a significant progress on camera-based reconstruction. But they require either a fairly dense sampling of the viewing sphere or pre-training on an existing dataset thereby limiting their generalizability. In contrast photometric stereo (PS) approaches have shown great potential for achieving high-quality reconstruction under sparse viewpoints. Yet they are impractical because they typically require tedious laboratory conditions are restricted to dark rooms and often multi-staged making them subject to accumulated errors. To address these shortcomings we propose an end-to-end uncalibrated multi-view PS framework for reconstructing high-resolution shapes acquired from sparse viewpoints in a real-world environment. We relax the dark room assumption and allow a combination of static ambient lighting and dynamic near LED lighting thereby enabling easy data capture outside the lab. Experimental validation confirms that it outperforms existing baseline approaches in the regime of sparse viewpoints by a large margin. This allows to bring high accuracy 3D reconstruction from the dark room to the real world while maintaining a reasonable data capture complexity. + + + + LQMFormer: Language-aware Query Mask Transformer for Referring Image Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Shah_LQMFormer_Language-aware_Query_Mask_Transformer_for_Referring_Image_Segmentation_CVPR_2024_paper.pdf + Referring Image Segmentation (RIS) aims to segment objects from an image based on a language description. Recent advancements have introduced transformer-based methods that leverage cross-modal dependencies significantly enhancing performance in referring segmentation tasks. These methods are designed such that each query predicts different masks. However RIS inherently requires a single-mask prediction leading to a phenomenon known as Query Collapse where all queries yield the same mask prediction. This reduces the generalization capability of the RIS model for complex or novel scenarios. To address this issue we propose a Multi-modal Query Feature Fusion technique characterized by two innovative designs: (1) Gaussian enhanced Multi-Modal Fusion a novel visual grounding mechanism that enhances overall representation by extracting rich local visual information and global visual-linguistic relationships and (2) A Dynamic Query Module that produces a diverse set of queries through a scoring network where the network selectively focuses on queries for objects referred to in the language description. Moreover we show that including an auxiliary loss to increase the distance between mask representations of different queries further enhances performance and mitigates query collapse. Extensive experiments conducted on four benchmark datasets validate the effectiveness of our framework. + + + + Omni-Q: Omni-Directional Scene Understanding for Unsupervised Visual Grounding + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Omni-Q_Omni-Directional_Scene_Understanding_for_Unsupervised_Visual_Grounding_CVPR_2024_paper.pdf + Unsupervised visual grounding methods alleviate the issue of expensive manual annotation of image-query pairs by generating pseudo-queries. However existing methods are prone to confusing the spatial relationships between objects and rely on designing complex prompt modules to generate query texts which severely impedes the ability to generate accurate and comprehensive queries due to ambiguous spatial relationships and manually-defined fixed templates. To tackle these challenges we propose a omni-directional language query generation approach for unsupervised visual grounding named Omni-Q. Specifically we develop a 3D spatial relation module to extend the 2D spatial representation to 3D thereby utilizing 3D location information to accurately determine the spatial position among objects. Besides we introduce a spatial graph module leveraging the power of graph structures to establish accurate and diverse object relationships and thus enhancing the flexibility of query generation. Extensive experiments on five public benchmark datasets demonstrate that our method significantly outperforms existing state-of-the-art unsupervised methods by up to 16.17%. In addition when applied in the supervised setting our method can freely save up to 60% human annotations without a loss of performance. + + + + VISTA-LLAMA: Reducing Hallucination in Video Language Models via Equal Distance to Visual Tokens + http://openaccess.thecvf.com//content/CVPR2024/papers/Ma_VISTA-LLAMA_Reducing_Hallucination_in_Video_Language_Models_via_Equal_Distance_CVPR_2024_paper.pdf + Recent advances in large video-language models have displayed promising outcomes in video comprehension. Current approaches straightforwardly convert video into language tokens and employ large language models for multi-modal tasks. However this method often leads to the generation of irrelevant content commonly known as "hallucination" as the length of the text increases and the impact of the video diminishes. To address this problem we propose Vista-LLaMA a novel framework that maintains the consistent distance between all visual tokens and any language tokens irrespective of the generated text length. Vista-LLaMA omits relative position encoding when determining attention weights between visual and text tokens retaining the position encoding for text and text tokens. This amplifies the effect of visual tokens on text generation especially when the relative distance is longer between visual and text tokens. The proposed attention mechanism significantly reduces the chance of producing irrelevant text related to the video content. Furthermore we present a sequential visual projector that projects the current video frame into tokens of language space with the assistance of the previous frame. This approach not only captures the temporal relationship within the video but also allows less visual tokens to encompass the entire video. Our approach significantly outperforms various previous methods (e.g. Video-ChatGPT MovieChat) on four challenging open-ended video question answering benchmarks. We reach an accuracy of 60.7 on the zero-shot NExT-QA and 60.5 on the zero-shot MSRVTT-QA setting a new state-of-the-art performance. + + + + Efficient Multitask Dense Predictor via Binarization + http://openaccess.thecvf.com//content/CVPR2024/papers/Shang_Efficient_Multitask_Dense_Predictor_via_Binarization_CVPR_2024_paper.pdf + Multi-task learning for dense prediction has emerged as a pivotal area in computer vision enabling simultaneous processing of diverse yet interrelated pixel-wise prediction tasks. However the substantial computational demands of state-of-the-art (SoTA) models often limit their widespread deployment. This paper addresses this challenge by introducing network binarization to compress resource-intensive multi-task dense predictors. Specifically our goal is to significantly accelerate multi-task dense prediction models via Binary Neural Networks (BNNs) while maintaining and even improving model performance at the same time. To reach this goal we propose a Binary Multi-task Dense Predictor Bi-MTDP and several variants of \bimtdp in which a multi-task dense predictor is constructed via specified binarized modules. Our systematical analysis of this predictor reveals that performance drop from binarization is primarily caused by severe information degradation. To address this issue we introduce a deep information bottleneck layer that enforces representations for downstream tasks satisfying Gaussian distribution in forward propagation. Moreover we introduce a knowledge distillation mechanism to correct the direction of information flow in backward propagation. Intriguingly one variant of Bi-MTDP outperforms full-precision (FP) multi-task dense prediction SoTAs ARTC (CNN-based) and InvPT (ViT-based). This result indicates that Bi-MTDP is not merely a naive trade-off between performance and efficiency but is rather a benefit of the redundant information flow thanks to the multi-task architecture. + + + + Jointly Training and Pruning CNNs via Learnable Agent Guidance and Alignment + http://openaccess.thecvf.com//content/CVPR2024/papers/Ganjdanesh_Jointly_Training_and_Pruning_CNNs_via_Learnable_Agent_Guidance_and_CVPR_2024_paper.pdf + Structural model pruning is a prominent approach used for reducing the computational cost of Convolutional Neural Networks (CNNs) before their deployment on resource-constrained devices. Yet the majority of proposed ideas require a pretrained model before pruning which is costly to secure. In this paper we propose a novel structural pruning approach to jointly learn the weights and structurally prune architectures of CNN models. The core element of our method is a Reinforcement Learning (RL) agent whose actions determine the pruning ratios of the CNN model's layers and the resulting model's accuracy serves as its reward. We conduct the joint training and pruning by iteratively training the model's weights and the agent's policy and we regularize the model's weights to align with the selected structure by the agent. The evolving model's weights result in a dynamic reward function for the agent which prevents using prominent episodic RL methods with stationary environment assumption for our purpose. We address this challenge by designing a mechanism to model the complex changing dynamics of the reward function and provide a representation of it to the RL agent. To do so we take a learnable embedding for each training epoch and employ a recurrent model to calculate a representation of the changing environment. We train the recurrent model and embeddings using a decoder model to reconstruct observed rewards. Such a design empowers our agent to effectively leverage episodic observations along with the environment representations to learn a proper policy to determine performant sub-networks of the CNN model. Our extensive experiments on CIFAR-10 and ImageNet using ResNets and MobileNets demonstrate the effectiveness of our method. + + + + Diffusion-EDFs: Bi-equivariant Denoising Generative Modeling on SE(3) for Visual Robotic Manipulation + http://openaccess.thecvf.com//content/CVPR2024/papers/Ryu_Diffusion-EDFs_Bi-equivariant_Denoising_Generative_Modeling_on_SE3_for_Visual_Robotic_CVPR_2024_paper.pdf + Diffusion generative modeling has become a promising approach for learning robotic manipulation tasks from stochastic human demonstrations. In this paper we present Diffusion-EDFs a novel SE(3)-equivariant diffusion-based approach for visual robotic manipulation tasks. We show that our proposed method achieves remarkable data efficiency requiring only 5 to 10 human demonstrations for effective end-to-end training in less than an hour. Furthermore our benchmark experiments demonstrate that our approach has superior generalizability and robustness compared to state-of-the-art methods. Lastly we validate our methods with real hardware experiments. + + + + Contrasting Intra-Modal and Ranking Cross-Modal Hard Negatives to Enhance Visio-Linguistic Compositional Understanding + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Contrasting_Intra-Modal_and_Ranking_Cross-Modal_Hard_Negatives_to_Enhance_Visio-Linguistic_CVPR_2024_paper.pdf + Vision-Language Models (VLMs) such as CLIP exhibit strong image-text comprehension abilities facilitating advances in several downstream tasks such as zero-shot image classification image-text retrieval and text-to-image generation. However the compositional reasoning abilities of existing VLMs remains subpar. The root of this limitation lies in the inadequate alignment between the images and captions in the pretraining datasets. Additionally the current contrastive learning objective fails to focus on fine-grained grounding components like relations actions and attributes resulting in "bag-of-words" representations. We introduce a simple and effective method to improve compositional reasoning in VLMs. Our method better leverages available datasets by refining and expanding the standard image-text contrastive learning framework. Our approach does not require specific annotations and does not incur extra parameters. When integrated with CLIP our technique yields notable improvement over state-of-the-art baselines across five vision-language compositional benchmarks. + + + + CMA: A Chromaticity Map Adapter for Robust Detection of Screen-Recapture Document Images + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_CMA_A_Chromaticity_Map_Adapter_for_Robust_Detection_of_Screen-Recapture_CVPR_2024_paper.pdf + The rebroadcasting of screen-recaptured document images introduces a significant risk to the confidential documents processed in government departments and commercial companies. However detecting recaptured document images subjected to distortions from online social networks (OSNs) is challenging since the common forensics cues such as moir ?e pattern are weakened during transmission. In this work we first devise a pixel-level distortion model of the screen-recaptured document image to identify the robust features of color artifacts. Then we extract a chromaticity map from the recaptured image to highlight the presence of color artifacts even under low-quality samples. Based on the prior understanding we design a chromaticity map adapter (CMA) to efficiently extract the chromaticity map and feed it into the transformer backbone as multi-modal prompt tokens. To evaluate the performance of the proposed method we collect a recaptured office document image dataset with over 10K diverse samples. Experimental results demonstrate that the proposed CMA method outperforms a SOTA approach (with RGB modality only) reducing the average EER from 26.82% to 16.78%. Robustness evaluation shows that our method achieves 0.8688 and 0.7554 AUCs under samples with JPEG compression (QF=70) and resolution as low as 534x503 pixels. + + + + VA3: Virtually Assured Amplification Attack on Probabilistic Copyright Protection for Text-to-Image Generative Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_VA3_Virtually_Assured_Amplification_Attack_on_Probabilistic_Copyright_Protection_for_CVPR_2024_paper.pdf + The booming use of text-to-image generative models has raised concerns about their high risk of producing copyright-infringing content. While probabilistic copyright protection methods provide a probabilistic guarantee against such infringement in this paper we introduce Virtually Assured Amplification Attack (VA3) a novel online attack framework that exposes the vulnerabilities of these protection mechanisms. The proposed framework significantly amplifies the probability of generating infringing content on the sustained interactions with generative models and a non-trivial lower-bound on the success probability of each engagement. Our theoretical and experimental results demonstrate the effectiveness of our approach under various scenarios.These findings highlight the potential risk of implementing probabilistic copyright protection in practical applications of text-to-image generative models. Code is available at https://github.com/South7X/VA3. + + + + Light the Night: A Multi-Condition Diffusion Framework for Unpaired Low-Light Enhancement in Autonomous Driving + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Light_the_Night_A_Multi-Condition_Diffusion_Framework_for_Unpaired_Low-Light_CVPR_2024_paper.pdf + Vision-centric perception systems for autonomous driving have gained considerable attention recently due to their cost-effectiveness and scalability especially compared to LiDAR-based systems. However these systems often struggle in low-light conditions potentially compromising their performance and safety. To address this our paper introduces LightDiff a domain-tailored framework designed to enhance the low-light image quality for autonomous driving applications. Specifically we employ a multi-condition controlled diffusion model. LightDiff works without any human-collected paired data leveraging a dynamic data degradation process instead. It incorporates a novel multi-condition adapter that adaptively controls the input weights from different modalities including depth maps RGB images and text captions to effectively illuminate dark scenes while maintaining context consistency. Furthermore to align the enhanced images with the detection model's knowledge LightDiff employs perception-specific scores as rewards to guide the diffusion training process through reinforcement learning. Extensive experiments on the nuScenes datasets demonstrate that LightDiff can significantly improve the performance of several state-of-the-art 3D detectors in night-time conditions while achieving high visual quality scores highlighting its potential to safeguard autonomous driving. + + + + Delving into the Trajectory Long-tail Distribution for Muti-object Tracking + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Delving_into_the_Trajectory_Long-tail_Distribution_for_Muti-object_Tracking_CVPR_2024_paper.pdf + Multiple Object Tracking (MOT) is a critical area within computer vision with a broad spectrum of practical implementations. Current research has primarily focused on the development of tracking algorithms and enhancement of post-processing techniques. Yet there has been a lack of thorough examination concerning the nature of tracking data it self. In this study we pioneer an exploration into the distribution patterns of tracking data and identify a pronounced long-tail distribution issue within existing MOT datasets. We note a significant imbalance in the distribution of trajectory lengths across different pedestrians a phenomenon we refer to as "pedestrians trajectory long-tail distribution". Addressing this challenge we introduce a bespoke strategy designed to mitigate the effects of this skewed distribution. Specifically we propose two data augmentation strategies including Stationary Camera View Data Augmentation (SVA) and Dynamic Camera View Data Augmentation (DVA) designed for viewpoint states and the Group Softmax (GS) module for Re-ID. SVA is to backtrack and predict the pedestrian trajectory of tail classes and DVA is to use diffusion model to change the background of the scene. GS divides the pedestrians into unrelated groups and performs softmax operation on each group individually. Our proposed strategies can be integrated into numerous existing tracking systems and extensive experimentation validates the efficacy of our method in reducing the influence of long-tail distribution on multi-object tracking performance. The code is available at https://github.com/chen-si-jia/Trajectory-Long-tail-Distribution-for-MOT. + + + + Seg2Reg: Differentiable 2D Segmentation to 1D Regression Rendering for 360 Room Layout Reconstruction + http://openaccess.thecvf.com//content/CVPR2024/papers/Sun_Seg2Reg_Differentiable_2D_Segmentation_to_1D_Regression_Rendering_for_360_CVPR_2024_paper.pdf + State-of-the-art single-view 360 room layout reconstruction methods formulate the problem as a high-level 1D (per-column) regression task. On the other hand traditional low-level 2D layout segmentation is simpler to learn and can represent occluded regions but it requires complex post-processing for the targeting layout polygon and sacrifices accuracy. We present Seg2Reg to render 1D layout depth regression from the 2D segmentation map in a differentiable and occlusion-aware way marrying the merits of both sides. Specifically our model predicts floor-plan density for the input equirectangular 360 image. Formulating the 2D layout representation as a density field enables us to employ 'flattened' volume rendering to form 1D layout depth regression. In addition we propose a novel 3D warping augmentation on layout to improve generalization. Finally we re-implement recent room layout reconstruction methods into our codebase for benchmarking and explore modern backbones and training techniques to serve as the strong baseline. The code is at https: //PanoLayoutStudio.github.io . + + + + UniMix: Towards Domain Adaptive and Generalizable LiDAR Semantic Segmentation in Adverse Weather + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_UniMix_Towards_Domain_Adaptive_and_Generalizable_LiDAR_Semantic_Segmentation_in_CVPR_2024_paper.pdf + LiDAR semantic segmentation (LSS) is a critical task in autonomous driving and has achieved promising progress. However prior LSS methods are conventionally investigated and evaluated on datasets within the same domain in clear weather. The robustness of LSS models in unseen scenes and all weather conditions is crucial for ensuring safety and reliability in real applications. To this end we propose UniMix a universal method that enhances the adaptability and generalizability of LSS models. UniMix first leverages physically valid adverse weather simulation to construct a Bridge Domain which serves to bridge the domain gap between the clear weather scenes and the adverse weather scenes. Then a Universal Mixing operator is defined regarding spatial intensity and semantic distributions to create the intermediate domain with mixed samples from given domains. Integrating the proposed two techniques into a teacher-student framework UniMix efficiently mitigates the domain gap and enables LSS models to learn weather-robust and domain-invariant representations. We devote UniMix to two main setups: 1) unsupervised domain adaption adapting the model from the clear weather source domain to the adverse weather target domain; 2) domain generalization learning a model that generalizes well to unseen scenes in adverse weather. Extensive experiments validate the effectiveness of UniMix across different tasks and datasets all achieving superior performance over state-of-the-art methods. The code will be released. + + + + Visual Delta Generator with Large Multi-modal Models for Semi-supervised Composed Image Retrieval + http://openaccess.thecvf.com//content/CVPR2024/papers/Jang_Visual_Delta_Generator_with_Large_Multi-modal_Models_for_Semi-supervised_Composed_CVPR_2024_paper.pdf + Composed Image Retrieval (CIR) is a task that retrieves images similar to a query based on a provided textual modification. Current techniques rely on supervised learning for CIR models using labeled triplets of the <reference image text target image>. These specific triplets are not as commonly available as simple image-text pairs limiting the widespread use of CIR and its scalability. On the other hand zero-shot CIR can be relatively easily trained with image-caption pairs without considering the image-to-image relation but this approach tends to yield lower accuracy. We propose a new semi-supervised CIR approach where we search for a reference and its related target images in auxiliary data and learn our large language model-based Visual Delta Generator (VDG) to generate text describing the visual difference (i.e. visual delta) between the two. VDG equipped with fluent language knowledge and being model agnostic can generate pseudo triplets to boost the performance of CIR models. Our approach significantly improves the existing supervised learning approaches and achieves state-of-the-art results on the CIR benchmarks. + + + + Selective Interpretable and Motion Consistent Privacy Attribute Obfuscation for Action Recognition + http://openaccess.thecvf.com//content/CVPR2024/papers/Ilic_Selective_Interpretable_and_Motion_Consistent_Privacy_Attribute_Obfuscation_for_Action_CVPR_2024_paper.pdf + Concerns for the privacy of individuals captured in public imagery have led to privacy-preserving action recognition. Existing approaches often suffer from issues arising through obfuscation being applied globally and a lack of interpretability. Global obfuscation hides privacy sensitive regions but also contextual regions important for action recognition. Lack of interpretability erodes trust in these new technologies. We highlight the limitations of current paradigms and propose a solution: Human selected privacy templates that yield interpretability by design an obfuscation scheme that selectively hides attributes and also induces temporal consistency which is important in action recognition. Our approach is architecture agnostic and directly modifies input imagery while existing approaches generally require architecture training. Our approach offers more flexibility as no retraining is required and outperforms alternatives on three widely used datasets. + + + + HiPose: Hierarchical Binary Surface Encoding and Correspondence Pruning for RGB-D 6DoF Object Pose Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Lin_HiPose_Hierarchical_Binary_Surface_Encoding_and_Correspondence_Pruning_for_RGB-D_CVPR_2024_paper.pdf + In this work we present a novel dense-correspondence method for 6DoF object pose estimation from a single RGB-D image. While many existing data-driven methods achieve impressive performance they tend to be time-consuming due to their reliance on rendering-based refinement approaches. To circumvent this limitation we present HiPose which establishes 3D-3D correspondences in a coarse-to-fine manner with a hierarchical binary surface encoding. Unlike previous dense-correspondence methods we estimate the correspondence surface by employing point-to-surface matching and iteratively constricting the surface until it becomes a correspondence point while gradually removing outliers. Extensive experiments on public benchmarks LM-O YCB-V and T-Less demonstrate that our method surpasses all refinement-free methods and is even on par with expensive refinement-based approaches. Crucially our approach is computationally efficient and enables real-time critical applications with high accuracy requirements. + + + + DiffForensics: Leveraging Diffusion Prior to Image Forgery Detection and Localization + http://openaccess.thecvf.com//content/CVPR2024/papers/Yu_DiffForensics_Leveraging_Diffusion_Prior_to_Image_Forgery_Detection_and_Localization_CVPR_2024_paper.pdf + As manipulating images may lead to misinterpretation of the visual content addressing the image forgery detection and localization (IFDL) problem has drawn serious public concerns. In this work we propose a simple assumption that the effective forensic method should focus on the mesoscopic properties of images. Based on the assumption a novel two-stage self-supervised framework leveraging the diffusion model for IFDL task i.e. DiffForensics is proposed in this paper. The DiffForensics begins with self-supervised denoising diffusion paradigm equipped with the module of encoder-decoder structure by freezing the pre-trained encoder (e.g. in ADE-20K) to inherit macroscopic features for general image characteristics while encouraging the decoder to learn microscopic feature representation of images enforcing the whole model to focus the mesoscopic representations. The pre-trained model as a prior is then further fine-tuned for IFDL task with the customized Edge Cue Enhancement Module (ECEM) which progressively highlights the boundary features within the manipulated regions thereby refining tampered area localization with better precision. Extensive experiments on several public challenging datasets demonstrate the effectiveness of the proposed method compared with other state-of-the-art methods. The proposed DiffForensics could significantly improve the model's capabilities for both accurate tamper detection and precise tamper localization while concurrently elevating its generalization and robustness. + + + + Boosting Self-Supervision for Single-View Scene Completion via Knowledge Distillation + http://openaccess.thecvf.com//content/CVPR2024/papers/Han_Boosting_Self-Supervision_for_Single-View_Scene_Completion_via_Knowledge_Distillation_CVPR_2024_paper.pdf + Inferring scene geometry from images via Structure from Motion is a long-standing and fundamental problem in computer vision. While classical approaches and more recently depth map predictions only focus on the visible parts of a scene the task of scene completion aims to reason about geometry even in occluded regions. With the popularity of NeRF implicit representations also became popular for scene completion by predicting so-called density fields. Unlike explicit approaches e.g. voxel-based methods density fields also allow for accurate depth prediction and novel-view synthesis via image-based rendering. In this work we propose to fuse the scene reconstruction from multiple images and distill this knowledge into a more accurate single-view scene reconstruction. To this end we propose MVBTS to fuse density fields from multiple posed images trained fully self-supervised only from image data. Using knowledge distillation we use MVBTS to train a single-view scene completion network via direct supervision called KDBTS. It achieves state-of-the-art performance on occupancy prediction especially in occluded regions. + + + + Sparse Global Matching for Video Frame Interpolation with Large Motion + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Sparse_Global_Matching_for_Video_Frame_Interpolation_with_Large_Motion_CVPR_2024_paper.pdf + Large motion poses a critical challenge in Video Frame Interpolation (VFI) task. Existing methods are often constrained by limited receptive fields resulting in sub-optimal performance when handling scenarios with large motion. In this paper we introduce a new pipeline for VFI which can effectively integrate global-level information to alleviate issues associated with large motion. Specifically we first estimate a pair of initial intermediate flows using a high-resolution feature map for extracting local details. Then we incorporate a sparse global matching branch to compensate for flow estimation which consists of identifying flaws in initial flows and generating sparse flow compensation with a global receptive field. Finally we adaptively merge the initial flow estimation with global flow compensation yielding a more accurate intermediate flow. To evaluate the effectiveness of our method in handling large motion we carefully curate a more challenging subset from commonly used benchmarks. Our method demonstrates the state-of-the-art performance on these VFI subsets with large motion. + + + + ExtDM: Distribution Extrapolation Diffusion Model for Video Prediction + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_ExtDM_Distribution_Extrapolation_Diffusion_Model_for_Video_Prediction_CVPR_2024_paper.pdf + Video prediction is a challenging task due to its nature of uncertainty especially for forecasting a long period. To model the temporal dynamics advanced methods benefit from the recent success of diffusion models and repeatedly refine the predicted future frames with 3D spatiotemporal U-Net. However there exists a gap between the present and future and the repeated usage of U-Net brings a heavy computation burden. To address this we propose a diffusion-based video prediction method that predicts future frames by extrapolating the present distribution of features namely ExtDM. Specifically our method consists of three components: (i) a motion autoencoder conducts a bijection transformation between video frames and motion cues; (ii) a layered distribution adaptor module extrapolates the present features in the guidance of Gaussian distribution; (iii) a 3D U-Net architecture specialized for jointly fusing guidance and features among the temporal dimension by spatiotemporal-window attention. Extensive experiments on five popular benchmarks covering short- and long-term video prediction verify the effectiveness of ExtDM. + + + + Point Segment and Count: A Generalized Framework for Object Counting + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_Point_Segment_and_Count_A_Generalized_Framework_for_Object_Counting_CVPR_2024_paper.pdf + Class-agnostic object counting aims to count all objects in an image with respect to example boxes or class names a.k.a few-shot and zero-shot counting. In this paper we propose a generalized framework for both few-shot and zero-shot object counting based on detection. Our framework combines the superior advantages of two foundation models without compromising their zero-shot capability: (i) SAM to segment all possible objects as mask proposals and (ii) CLIP to classify proposals to obtain accurate object counts. However this strategy meets the obstacles of efficiency overhead and the small crowded objects that cannot be localized and distinguished. To address these issues our framework termed PseCo follows three steps: point segment and count. Specifically we first propose a class-agnostic object localization to provide accurate but least point prompts for SAM which consequently not only reduces computation costs but also avoids missing small objects. Furthermore we propose a generalized object classification that leverages CLIP image/text embeddings as the classifier following a hierarchical knowledge distillation to obtain discriminative classifications among hierarchical mask proposals. Extensive experimental results on FSC-147 COCO and LVIS demonstrate that PseCo achieves state-of-the-art performance in both few-shot/zero-shot object counting/detection. + + + + PTT: Point-Trajectory Transformer for Efficient Temporal 3D Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_PTT_Point-Trajectory_Transformer_for_Efficient_Temporal_3D_Object_Detection_CVPR_2024_paper.pdf + Recent temporal LiDAR-based 3D object detectors achieve promising performance based on the two-stage proposal-based approach. They generate 3D box candidates from the first-stage dense detector followed by different temporal aggregation methods. However these approaches require per-frame objects or whole point clouds posing challenges related to memory bank utilization. Moreover point clouds and trajectory features are combined solely based on concatenation which may neglect effective interactions between them. In this paper we propose a point-trajectory transformer with long short-term memory for efficient temporal 3D object detection. To this end we only utilize point clouds of current-frame objects and their historical trajectories as input to minimize the memory bank storage requirement. Furthermore we introduce modules to encode trajectory features focusing on long short-term and future-aware perspectives and then effectively aggregate them with point cloud features. We conduct extensive experiments on the large-scale Waymo dataset to demonstrate that our approach performs well against state-of-the-art methods. The source codes and trained models will be made publicly available. Code and models will be made publicly available at https://github.com/kuanchihhuang/PTT. + + + + Generative Proxemics: A Prior for 3D Social Interaction from Images + http://openaccess.thecvf.com//content/CVPR2024/papers/Muller_Generative_Proxemics_A_Prior_for_3D_Social_Interaction_from_Images_CVPR_2024_paper.pdf + Social interaction is a fundamental aspect of human behavior and communication. The way individuals position themselves in relation to others also known as proxemics conveys social cues and affects the dynamics of social interaction. Reconstructing such interaction from images presents challenges because of mutual occlusion and the limited availability of large training datasets. To address this we present a novel approach that learns a prior over the 3D proxemics two people in close social interaction and demonstrate its use for single-view 3D reconstruction. We start by creating 3D training data of interacting people using image datasets with contact annotations. We then model the proxemics using a novel denoising diffusion model called BUDDI that learns the joint distribution over the poses of two people in close social interaction. Sampling from our generative proxemics model produces realistic 3D human interactions which we validate through a perceptual study. We use BUDDI in reconstructing two people in close proximity from an image without any contact annotation via an optimization approach that uses the diffusion model as a prior. Our approach recovers accurate 3D social interactions from noisy initial estimates outperforming state-of-the-art methods. Our code data and model are available at: muelea.github.io/buddi. + + + + A Simple and Effective Point-based Network for Event Camera 6-DOFs Pose Relocalization + http://openaccess.thecvf.com//content/CVPR2024/papers/Ren_A_Simple_and_Effective_Point-based_Network_for_Event_Camera_6-DOFs_CVPR_2024_paper.pdf + Event cameras exhibit remarkable attributes such as high dynamic range asynchronicity and low latency making them highly suitable for vision tasks that involve high-speed motion in challenging lighting conditions. These cameras implicitly capture movement and depth information in events making them appealing sensors for Camera Pose Relocalization (CPR) tasks. Nevertheless existing CPR networks based on events neglect the pivotal fine-grained temporal information in events resulting in unsatisfactory performance. Moreover the energy-efficient features are further compromised by the use of excessively complex models hindering efficient deployment on edge devices. In this paper we introduce PEPNet a simple and effective point-based network designed to regress six degrees of freedom (6-DOFs) event camera poses. We rethink the relationship between the event camera and CPR tasks leveraging the raw Point Cloud directly as network input to harness the high-temporal resolution and inherent sparsity of events. PEPNet is adept at abstracting the spatial and implicit temporal features through hierarchical structure and explicit temporal features by Attentive Bi-directional Long Short-Term Memory (A-Bi-LSTM). By employing a carefully crafted lightweight design PEPNet delivers state-of-the-art (SOTA) performance on both indoor and outdoor datasets with meager computational resources. Specifically PEPNet attains a significant 38% and 33% performance improvement on the random split IJRR and M3ED datasets respectively. Moreover the lightweight design version PEPNet_ tiny accomplishes results comparable to the SOTA while employing a mere 0.5% of the parameters. + + + + Region-Based Representations Revisited + http://openaccess.thecvf.com//content/CVPR2024/papers/Shlapentokh-Rothman_Region-Based_Representations_Revisited_CVPR_2024_paper.pdf + We investigate whether region-based representations are effective for recognition. Regions were once a mainstay in recognition approaches but pixel and patch-based features are now used almost exclusively. We show that recent class-agnostic segmenters like SAM can be effectively combined with strong unsupervised representations like DINOv2 and used for a wide variety of tasks including semantic segmentation object-based image retrieval and multi-image analysis. Once the masks and features are extracted these representations even with linear decoders enable competitive performance making them well suited to applications that require custom queries. The compactness of the representation also makes it well-suited to video analysis and other problems requiring inference across many images. + + + + GenH2R: Learning Generalizable Human-to-Robot Handover via Scalable Simulation Demonstration and Imitation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_GenH2R_Learning_Generalizable_Human-to-Robot_Handover_via_Scalable_Simulation_Demonstration_and_CVPR_2024_paper.pdf + This paper presents GenH2R a framework for learning generalizable vision-based human-to-robot (H2R) handover skills. The goal is to equip robots with the ability to reliably receive objects with unseen geometry handed over by humans in various complex trajectories. We acquire such generalizability by learning H2R handover at scale with a comprehensive solution including procedural simulation assets creation automated demonstration generation and effective imitation learning. We leverage large-scale 3D model repositories dexterous grasp generation methods and curve-based 3D animation to create an H2R handover simulation environment named GenH2R-Sim surpassing the number of scenes in existing simulators by three orders of magnitude. We further introduce a distillation-friendly demonstration generation method that automatically generates a million high-quality demonstrations suitable for learning. Finally we present a 4D imitation learning method augmented by a future forecasting objective to distill demonstrations into a visuo-motor handover policy. Experimental evaluations in both simulators and the real world demonstrate significant improvements (at least +10% success rate) over baselines in all cases. + + + + Modality-Agnostic Structural Image Representation Learning for Deformable Multi-Modality Medical Image Registration + http://openaccess.thecvf.com//content/CVPR2024/papers/Mok_Modality-Agnostic_Structural_Image_Representation_Learning_for_Deformable_Multi-Modality_Medical_Image_CVPR_2024_paper.pdf + Establishing dense anatomical correspondence across distinct imaging modalities is a foundational yet challenging procedure for numerous medical image analysis studies and image-guided radiotherapy. Existing multi-modality image registration algorithms rely on statistical-based similarity measures or local structural image representations. However the former is sensitive to locally varying noise while the latter is not discriminative enough to cope with complex anatomical structures in multimodal scans causing ambiguity in determining the anatomical correspondence across scans with different modalities. In this paper we propose a modality-agnostic structural representation learning method which leverages Deep Neighbourhood Self-similarity (DNS) and anatomy-aware contrastive learning to learn discriminative and contrast-invariance deep structural image representations (DSIR) without the need for anatomical delineations or pre-aligned training images. We evaluate our method on multiphase CT abdomen MR-CT and brain MR T1w-T2w registration. Comprehensive results demonstrate that our method is superior to the conventional local structural representation and statistical-based similarity measures in terms of discriminability and accuracy. + + + + Any-Shift Prompting for Generalization over Distributions + http://openaccess.thecvf.com//content/CVPR2024/papers/Xiao_Any-Shift_Prompting_for_Generalization_over_Distributions_CVPR_2024_paper.pdf + Image-language models with prompt learning have shown remarkable advances in numerous downstream vision tasks. Nevertheless conventional prompt learning methods overfit the training distribution and lose the generalization ability on the test distributions. To improve the generalization across various distribution shifts we propose any-shift prompting: a general probabilistic inference framework that considers the relationship between training and test distributions during prompt learning. We explicitly connect training and test distributions in the latent space by constructing training and test prompts in a hierarchical architecture. Within this framework the test prompt exploits the distribution relationships to guide the generalization of the CLIP image-language model from training to any test distribution. To effectively encode the distribution information and their relationships we further introduce a transformer inference network with a pseudo-shift training mechanism. The network generates the tailored test prompt with both training and test information in a feedforward pass avoiding extra training costs at test time. Extensive experiments on twenty-three datasets demonstrate the effectiveness of any-shift prompting on the generalization over various distribution shifts. + + + + CPR-Coach: Recognizing Composite Error Actions based on Single-class Training + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_CPR-Coach_Recognizing_Composite_Error_Actions_based_on_Single-class_Training_CVPR_2024_paper.pdf + Fine-grained medical action analysis plays a vital role in improving medical skill training efficiency but it faces the problems of data and algorithm shortage. Cardiopulmonary Resuscitation (CPR) is an essential skill in emergency treatment. Currently the assessment of CPR skills mainly depends on dummies and trainers leading to high training costs and low efficiency. For the first time this paper constructs a vision-based system to complete error action recognition and skill assessment in CPR. Specifically we define 13 types of single-error actions and 74 types of composite error actions during external cardiac compression and then develop a video dataset named CPR-Coach. By taking the CPR-Coach as a benchmark this paper investigates and compares the performance of existing action recognition models based on different data modalities. To solve the unavoidable "Single-class Training & Multi-class Testing" problem we propose a human-cognition-inspired framework named ImagineNet to improve the model's multi-error recognition performance under restricted supervision. Extensive comparison and actual deployment experiments verify the effectiveness of the framework. We hope this work could bring new inspiration to the computer vision and medical skills training communities simultaneously. The dataset and the code are publicly available on https://github.com/Shunli-Wang/CPR-Coach. + + + + RTracker: Recoverable Tracking via PN Tree Structured Memory + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_RTracker_Recoverable_Tracking_via_PN_Tree_Structured_Memory_CVPR_2024_paper.pdf + Existing tracking methods mainly focus on learning better target representation or developing more robust prediction models to improve tracking performance. While tracking performance has significantly improved the target loss issue occurs frequently due to tracking failures complete occlusion or out-of-view situations. However considerably less attention is paid to the self-recovery issue of tracking methods which is crucial for practical applications. To this end we propose a recoverable tracking framework \ourmethod that uses a tree-structured memory to dynamically associate a tracker and a detector to enable self-recovery ability. Specifically we propose a Positive-Negative Tree-structured memory to chronologically store and maintain positive and negative target samples. Upon the PN tree memory we develop corresponding walking rules for determining the state of the target and define a set of control flows to unite the tracker and the detector in different tracking scenarios. Our core idea is to use the support samples of positive and negative target categories to establish a relative distance-based criterion for a reliable assessment of target loss. The favorable performance in comparison against the state-of-the-art methods on numerous challenging benchmarks demonstrates the effectiveness of the proposed algorithm. All the source code and trained models will be released at https://github.com/NorahGreen/RTracker. + + + + DualAD: Disentangling the Dynamic and Static World for End-to-End Driving + http://openaccess.thecvf.com//content/CVPR2024/papers/Doll_DualAD_Disentangling_the_Dynamic_and_Static_World_for_End-to-End_Driving_CVPR_2024_paper.pdf + State-of-the-art approaches for autonomous driving integrate multiple sub-tasks of the overall driving task into a single pipeline that can be trained in an end-to-end fashion by passing latent representations between the different modules. In contrast to previous approaches that rely on a unified grid to represent the belief state of the scene we propose dedicated representations to disentangle dynamic agents and static scene elements. This allows us to explicitly compensate for the effect of both ego and object motion between consecutive time steps and to flexibly propagate the belief state through time. Furthermore dynamic objects can not only attend to the input camera images but also directly benefit from the inferred static scene structure via a novel dynamic-static cross-attention. Extensive experiments on the challenging nuScenes benchmark demonstrate the benefits of the proposed dual-stream design especially for modelling highly dynamic agents in the scene and highlight the improved temporal consistency of our approach. Our method titled DualAD not only outperforms independently trained single-task networks but also improves over previous state-of-the-art end-to-end models by a large margin on all tasks along the functional chain of driving. + + + + MULDE: Multiscale Log-Density Estimation via Denoising Score Matching for Video Anomaly Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Micorek_MULDE_Multiscale_Log-Density_Estimation_via_Denoising_Score_Matching_for_Video_CVPR_2024_paper.pdf + We propose a novel approach to video anomaly detection: we treat feature vectors extracted from videos as realizations of a random variable with a fixed distribution and model this distribution with a neural network. This lets us estimate the likelihood of test videos and detect video anomalies by thresholding the likelihood estimates. We train our video anomaly detector using a modification of denoising score matching a method that injects training data with noise to facilitate modeling its distribution. To eliminate hyperparameter selection we model the distribution of noisy video features across a range of noise levels and introduce a regularizer that tends to align the models for different levels of noise. At test time we combine anomaly indications at multiple noise scales with a Gaussian mixture model. Running our video anomaly detector induces minimal delays as inference requires merely extracting the features and forward-propagating them through a shallow neural network and a Gaussian mixture model. Our experiments on five popular video anomaly detection benchmarks demonstrate state-of-the-art performance both in the object-centric and in the frame-centric setup. + + + + PTQ4SAM: Post-Training Quantization for Segment Anything + http://openaccess.thecvf.com//content/CVPR2024/papers/Lv_PTQ4SAM_Post-Training_Quantization_for_Segment_Anything_CVPR_2024_paper.pdf + Segment Anything Model (SAM) has achieved impressive performance in many computer vision tasks. However as a large-scale model the immense memory and computation costs hinder its practical deployment. In this paper we propose a post-training quantization (PTQ) framework for Segment Anything Model namely PTQ4SAM. First we investigate the inherent bottleneck of SAM quantization attributed to the bimodal distribution in \cls post-Key-Linear activations. We analyze its characteristics from both per-tensor and per-channel perspectives and propose a Bimodal Integration strategy which utilizes a mathematically equivalent sign operation to transform the bimodal distribution into a relatively easy-quantized normal distribution offline. Second SAM encompasses diverse attention mechanisms (i.e. self-attention and two-way cross-attention) resulting in substantial variations in the post-Softmax distributions. Therefore we introduce an Adaptive Granularity Quantization for Softmax through searching the optimal power-of-two base which is hardware-friendly. Extensive experimental results across various vision tasks (instance segmentation semantic segmentation and object detection) datasets and model variants show the superiority of PTQ4SAM. For example when quantizing SAM-L to 6-bit we achieve lossless accuracy for instance segmentation about 0.5% drop with theoretical 3.9xacceleration. The code is available at https://github.com/chengtao-lv/PTQ4SAM. + + + + Improving Bird's Eye View Semantic Segmentation by Task Decomposition + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_Improving_Birds_Eye_View_Semantic_Segmentation_by_Task_Decomposition_CVPR_2024_paper.pdf + Semantic segmentation in bird's eye view (BEV) plays a crucial role in autonomous driving. Previous methods usually follow an end-to-end pipeline directly predicting the BEV segmentation map from monocular RGB inputs. However the challenge arises when the RGB inputs and BEV targets from distinct perspectives making the direct point-to-point predicting hard to optimize. In this paper we decompose the original BEV segmentation task into two stages namely BEV map reconstruction and RGB-BEV feature alignment. In the first stage we train a BEV autoencoder to reconstruct the BEV segmentation maps given corrupted noisy latent representation which urges the decoder to learn fundamental knowledge of typical BEV patterns. The second stage involves mapping RGB input images into the BEV latent space of the first stage directly optimizing the correlations between the two views at the feature level. Our approach simplifies the complexity of combining perception and generation into distinct steps equipping the model to handle intricate and challenging scenes effectively. Besides we propose to transform the BEV segmentation map from the Cartesian to the polar coordinate system to establish the column-wise correspondence between RGB images and BEV maps. Moreover our method requires neither multi-scale features nor camera intrinsic parameters for depth estimation and saves computational overhead. Extensive experiments on nuScenes and Argoverse show the effectiveness and efficiency of our method. Code is available at https://github.com/happytianhao/TaDe. + + + + Scene Adaptive Sparse Transformer for Event-based Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Peng_Scene_Adaptive_Sparse_Transformer_for_Event-based_Object_Detection_CVPR_2024_paper.pdf + While recent Transformer-based approaches have shown impressive performances on event-based object detection tasks their high computational costs still diminish the low power consumption advantage of event cameras. Image-based works attempt to reduce these costs by introducing sparse Transformers. However they display inadequate sparsity and adaptability when applied to event-based object detection since these approaches cannot balance the fine granularity of token-level sparsification and the efficiency of window-based Transformers leading to reduced performance and efficiency. Furthermore they lack scene-specific sparsity optimization resulting in information loss and a lower recall rate. To overcome these limitations we propose the Scene Adaptive Sparse Transformer (SAST). SAST enables window-token co-sparsification significantly enhancing fault tolerance and reducing computational overhead. Leveraging the innovative scoring and selection modules along with the Masked Sparse Window Self-Attention SAST showcases remarkable scene-aware adaptability: It focuses only on important objects and dynamically optimizes sparsity level according to scene complexity maintaining a remarkable balance between performance and computational cost. The evaluation results show that SAST outperforms all other dense and sparse networks in both performance and efficiency on two large-scale event-based object detection datasets (1Mpx and Gen1). Code: https://github.com/Peterande/SAST + + + + CURSOR: Scalable Mixed-Order Hypergraph Matching with CUR Decomposition + http://openaccess.thecvf.com//content/CVPR2024/papers/Zheng_CURSOR_Scalable_Mixed-Order_Hypergraph_Matching_with_CUR_Decomposition_CVPR_2024_paper.pdf + To achieve greater accuracy hypergraph matching algorithms require exponential increases in computational resources. Recent kd-tree-based approximate nearest neighbor (ANN) methods despite the sparsity of their compatibility tensor still require exhaustive calculations for large-scale graph matching. This work utilizes CUR tensor decomposition and introduces a novel cascaded second and third-order hypergraph matching framework (CURSOR) for efficient hypergraph matching. A CUR-based second-order graph matching algorithm is used to provide a rough match and then the core of CURSOR a fiber-CUR-based tensor generation method directly calculates entries of the compatibility tensor by leveraging the initial second-order match result. This significantly decreases the time complexity and tensor density. A probability relaxation labeling (PRL)-based matching algorithm especially suitable for sparse tensors is developed. Experiment results on large-scale synthetic datasets and widely-adopted benchmark sets demonstrate the superiority of CURSOR over existing methods. The tensor generation method in CURSOR can be integrated seamlessly into existing hypergraph matching methods to improve their performance and lower their computational costs. + + + + GigaTraj: Predicting Long-term Trajectories of Hundreds of Pedestrians in Gigapixel Complex Scenes + http://openaccess.thecvf.com//content/CVPR2024/papers/Lin_GigaTraj_Predicting_Long-term_Trajectories_of_Hundreds_of_Pedestrians_in_Gigapixel_CVPR_2024_paper.pdf + Pedestrian trajectory prediction is a well-established task with significant recent advancements. However existing datasets are unable to fulfill the demand for studying minute-level long-term trajectory prediction mainly due to the lack of high-resolution trajectory observation in the wide field of view (FoV). To bridge this gap we introduce a novel dataset named GigaTraj featuring videos capturing a wide FoV with ~ 4 x10^4 m^2 and high-resolution imagery at the gigapixel level. Furthermore GigaTraj includes comprehensive annotations such as bounding boxes identity associations world coordinates group/interaction relationships and scene semantics. Leveraging these multimodal annotations we evaluate and validate the state-of-the-art approaches for minute-level long-term trajectory prediction in large-scale scenes. Extensive experiments and analyses have revealed that long-term prediction for pedestrian trajectories presents numerous challenges indicating a vital new direction for trajectory research. The dataset is available at www.gigavision.ai. + + + + C2KD: Bridging the Modality Gap for Cross-Modal Knowledge Distillation + http://openaccess.thecvf.com//content/CVPR2024/papers/Huo_C2KD_Bridging_the_Modality_Gap_for_Cross-Modal_Knowledge_Distillation_CVPR_2024_paper.pdf + Existing Knowledge Distillation (KD) methods typically focus on transferring knowledge from a large-capacity teacher to a low-capacity student model achieving substantial success in unimodal knowledge transfer. However existing methods can hardly be extended to Cross-Modal Knowledge Distillation (CMKD) where the knowledge is transferred from a teacher modality to a different student modality with inference only on the distilled student modality. We empirically reveal that the modality gap i.e. modality imbalance and soft label misalignment incurs the ineffectiveness of traditional KD in CMKD. As a solution we propose a novel \underline C ustomized \underline C rossmodal \underline K nowledge \underline D istillation (C^2KD). Specifically to alleviate the modality gap the pre-trained teacher performs bidirectional distillation with the student to provide customized knowledge. The On-the-Fly Selection Distillation(OFSD) strategy is applied to selectively filter out the samples with misaligned soft labels where we distill cross-modal knowledge from non-target classes to avoid the modality imbalance issue. To further provide receptive cross-modal knowledge proxy student and teacher inheriting unimodal and cross-modal knowledge is formulated to progressively transfer cross-modal knowledge through bidirectional distillation. Experimental results on audio-visual image-text and RGB-depth datasets demonstrate that our method can effectively transfer knowledge across modalities achieving superior performance against traditional KD by a large margin. + + + + Traceable Federated Continual Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Traceable_Federated_Continual_Learning_CVPR_2024_paper.pdf + Federated continual learning (FCL) is a typical mechanism to achieve collaborative model training among clients that own dynamic data. While traditional FCL methods have been proved effective they do not consider the task repeatability and fail to achieve good performance under this practical scenario. In this paper we propose a new paradigm namely Traceable Federated Continual Learning (TFCL) aiming to cope with repetitive tasks by tracing and augmenting them. Following the new paradigm we develop TagFed a framework that enables accurate and effective Tracing augmentation and Federation for TFCL. The key idea is to decompose the whole model into a series of marked sub-models for optimizing each client task before conducting group-wise knowledge aggregation such that the repetitive tasks can be located precisely and federated selectively for improved performance. Extensive experiments on our constructed benchmark demonstrate the effectiveness and efficiency of the proposed framework. We will release our code at: https://github.com/P0werWeirdo/TagFCL. + + + + V?: Guided Visual Search as a Core Mechanism in Multimodal LLMs + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_V_Guided_Visual_Search_as_a_Core_Mechanism_in_Multimodal_CVPR_2024_paper.pdf + When we look around and perform complex tasks how we see and selectively process what we see is crucial. However the lack of this visual search mechanism in current multimodal LLMs (MLLMs) hinders their ability to focus on important visual details especially when handling high-resolution and visually crowded images. To address this we introduce V* an LLM-guided visual search mechanism that employs the world knowledge in LLMs for efficient visual querying. When combined with an MLLM this mechanism enhances collaborative reasoning contextual understanding and precise visual grounding. This integration results in a new MLLM meta-architecture named Show sEArch and TelL (SEAL). We further create V*Bench a benchmark specifically designed to evaluate MLLMs in their ability to process high-resolution images and focus on visual details. Our study highlights the necessity of incorporating visual search capabilities into multimodal systems. The code is available at https://github.com/penghao-wu/vstar + + + + Uncertainty Visualization via Low-Dimensional Posterior Projections + http://openaccess.thecvf.com//content/CVPR2024/papers/Yair_Uncertainty_Visualization_via_Low-Dimensional_Posterior_Projections_CVPR_2024_paper.pdf + In ill-posed inverse problems it is commonly desirable to obtain insight into the full spectrum of plausible solutions rather than extracting only a single reconstruction. Information about the plausible solutions and their likelihoods is encoded in the posterior distribution. However for high-dimensional data this distribution is challenging to visualize. In this work we introduce a new approach for estimating and visualizing posteriors by employing energy-based models (EBMs) over low-dimensional subspaces. Specifically we train a conditional EBM that receives an input measurement and a set of directions that span some low-dimensional subspace of solutions and outputs the probability density function of the posterior within that space. We demonstrate the effectiveness of our method across a diverse range of datasets and image restoration problems showcasing its strength in uncertainty quantification and visualization. As we show our method outperforms a baseline that projects samples from a diffusion-based posterior sampler while being orders of magnitude faster. Furthermore it is more accurate than a baseline that assumes a Gaussian posterior. + + + + VSCode: General Visual Salient and Camouflaged Object Detection with 2D Prompt Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Luo_VSCode_General_Visual_Salient_and_Camouflaged_Object_Detection_with_2D_CVPR_2024_paper.pdf + Salient object detection (SOD) and camouflaged object detection (COD) are related yet distinct binary mapping tasks. These tasks involve multiple modalities sharing commonalities and unique cues. Existing research often employs intricate task-specific specialist models potentially leading to redundancy and suboptimal results. We introduce VSCode a generalist model with novel 2D prompt learning to jointly address four SOD tasks and three COD tasks. We utilize VST as the foundation model and introduce 2D prompts within the encoder-decoder architecture to learn domain and task-specific knowledge on two separate dimensions. A prompt discrimination loss helps disentangle peculiarities to benefit model optimization. VSCode outperforms state-of-the-art methods across six tasks on 26 datasets and exhibits zero-shot generalization to unseen tasks by combining 2D prompts such as RGB-D COD. Source code has been available at https://github.com/Sssssuperior/VSCode. + + + + PointInfinity: Resolution-Invariant Point Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_PointInfinity_Resolution-Invariant_Point_Diffusion_Models_CVPR_2024_paper.pdf + We present PointInfinity an efficient family of point cloud diffusion models. Our core idea is to use a transformer-based architecture with a fixed-size resolution-invariant latent representation. This enables efficient training with low-resolution point clouds while allowing high-resolution point clouds to be generated during inference. More importantly we show that scaling the test-time resolution beyond the training resolution improves the fidelity of generated point clouds and surfaces. We analyze this phenomenon and draw a link to classifier-free guidance commonly used in diffusion models demonstrating that both allow trading off fidelity and variability during inference. Experiments on CO3D show that PointInfinity can efficiently generate high-resolution point clouds (up to 131k points 31 times more than Point-E) with state-of-the-art quality. + + + + Structured Model Probing: Empowering Efficient Transfer Learning by Structured Regularization + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_Structured_Model_Probing_Empowering_Efficient_Transfer_Learning_by_Structured_Regularization_CVPR_2024_paper.pdf + Despite encouraging results from recent developments in transfer learning for adapting pre-trained model to downstream tasks the performance of model probing is still lagging behind the state-of-the-art parameter efficient tuning methods. Our investigation reveals that existing model probing methods perform well for the easy case when the source domain (where models are pre-trained) and the adapted domain are similar but fail for the difficult case when the two domains are significantly different. Simply incorporating features extracted from multiple layers and increasing complexity of the probing model can mitigate the gap in the difficult case but degrades the performance in the easy case. To address this challenge we propose structured model probing (SMP) that is able to deliver good performance for both cases through structured regularization. The regularization performs feature selection leveraging model structure as a prior and controls the complexity of the probing model through the weights of selected structures. This enables us to construct a simple adaptation model with a small number of selected features and a linear prediction model for the easy case; and to automatically increase the complexity of adaptation model with a large number of selected features and a non-linear model for the difficult case. Our extensive empirical studies show that SMP significantly outperforms the state-of-the-art methods for parameter efficient tuning and at the same time still maintains the advantage of computational efficiency for probing-based methods. + + + + Multi-Modal Proxy Learning Towards Personalized Visual Multiple Clustering + http://openaccess.thecvf.com//content/CVPR2024/papers/Yao_Multi-Modal_Proxy_Learning_Towards_Personalized_Visual_Multiple_Clustering_CVPR_2024_paper.pdf + Multiple clustering has gained significant attention in recent years due to its potential to reveal multiple hidden structures of data from different perspectives. The advent of deep multiple clustering techniques has notably advanced the performance by uncovering complex patterns and relationships within large datasets. However a major challenge arises as users often do not need all the clusterings that algorithms generate and figuring out the one needed requires a substantial understanding of each clustering result. Traditionally aligning a user's brief keyword of interest with the corresponding vision components was challenging but the emergence of multi-modal and large language models (LLMs) has begun to bridge this gap. In response given unlabeled target visual data we propose Multi-Map a novel method employing a multi-modal proxy learning process. It leverages CLIP encoders to extract coherent text and image embeddings with GPT-4 integrating users' interests to formulate effective textual contexts. Moreover reference word constraint and concept-level constraint are designed to learn the optimal text proxy according to the user's interest. Multi-Map not only adeptly captures a user's interest via a keyword but also facilitates identifying relevant clusterings. Our extensive experiments show that Multi-Map consistently outperforms state-of-the-art methods in all benchmark multi-clustering vision tasks. Our code is available at https://github.com/Alexander-Yao/Multi-MaP. + + + + ChAda-ViT : Channel Adaptive Attention for Joint Representation Learning of Heterogeneous Microscopy Images + http://openaccess.thecvf.com//content/CVPR2024/papers/Bourriez_ChAda-ViT__Channel_Adaptive_Attention_for_Joint_Representation_Learning_of_CVPR_2024_paper.pdf + Unlike color photography images which are consistently encoded into RGB channels biological images encompass various modalities where the type of microscopy and the meaning of each channel varies with each experiment. Importantly the number of channels can range from one to a dozen and their correlation is often comparatively much lower than RGB as each of them brings specific information content. This aspect is largely overlooked by methods designed out of the bioimage field and current solutions mostly focus on intra-channel spatial attention often ignoring the relationship between channels yet crucial in most biological applications. Importantly the variable channel type and count prevent the projection of several experiments to a unified representation for large scale pre-training. In this study we propose ChAda-ViT a novel Channel Adaptive Vision Transformer architecture employing an Inter-Channel Attention mechanism on images with an arbitrary number order and type of channels. We also introduce IDRCell100k a bioimage dataset with a rich set of 79 experiments covering 7 microscope modalities with a multitude of channel types and channel counts varying from 1 to 10 per experiment. Our proposed architecture trained in a self-supervised manner outperforms existing approaches in several biologically relevant downstream tasks. Additionally it can be used to bridge the gap for the first time between assays with different microscopes channel numbers or types by embedding various image and experimental modalities into a unified biological image representation. The latter should facilitate interdisciplinary studies and pave the way for better adoption of deep learning in biological image-based analyses. + + + + CARZero: Cross-Attention Alignment for Radiology Zero-Shot Classification + http://openaccess.thecvf.com//content/CVPR2024/papers/Lai_CARZero_Cross-Attention_Alignment_for_Radiology_Zero-Shot_Classification_CVPR_2024_paper.pdf + The advancement of Zero-Shot Learning in the medical domain has been driven forward by using pre-trained models on large-scale image-text pairs focusing on image-text alignment. However existing methods primarily rely on cosine similarity for alignment which may not fully capture the complex relationship between medical images and reports. To address this gap we introduce a novel approach called Cross-Attention Alignment for Radiology Zero-Shot Classification (CARZero). Our approach innovatively leverages cross-attention mechanisms to process image and report features creating a Similarity Representation that more accurately reflects the intricate relationships in medical semantics. This representation is then linearly projected to form an image-text similarity matrix for cross-modality alignment. Additionally recognizing the pivotal role of prompt selection in zero-shot learning CARZero incorporates a Large Language Model-based prompt alignment strategy. This strategy standardizes diverse diagnostic expressions into a unified format for both training and inference phases overcoming the challenges of manual prompt design. Our approach is simple yet effective demonstrating state-of-the-art performance in zero-shot classification on five official chest radiograph diagnostic test sets including remarkable results on datasets with long-tail distributions of rare diseases. This achievement is attributed to our new image-text alignment strategy which effectively addresses the complex relationship between medical images and reports. Code and models are available at https://github.com/laihaoran/CARZero. + + + + Multi-Modal Hallucination Control by Visual Information Grounding + http://openaccess.thecvf.com//content/CVPR2024/papers/Favero_Multi-Modal_Hallucination_Control_by_Visual_Information_Grounding_CVPR_2024_paper.pdf + Generative Vision-Language Models (VLMs) are prone to generate plausible-sounding textual answers which however are not always grounded in the input image. We investigate this phenomenon usually referred to as "hallucination" and show that it stems from an excessive reliance on the language prior. In particular we show that as more tokens are generated the reliance on the visual prompt decreases and this behavior strongly correlates with the emergence of hallucinations. To reduce hallucinations we introduce Multi-Modal Mutual-Information Decoding (M3ID) a new sampling method for prompt amplification. M3ID amplifies the influence of the reference image over the language prior hence favoring the generation of tokens with higher mutual information with the visual prompt. M3ID can be applied to any pre-trained autoregressive VLM at inference time without necessitating further training and with minimal computational overhead. If training is an option we show that M3ID can be paired with Direct Preference Optimization (DPO) to improve the model's reliance on the prompt image without requiring any labels. Our empirical findings show that our algorithms maintain the fluency and linguistic capabilities of pre-trained VLMs while reducing hallucinations by mitigating visually ungrounded answers. Specifically for the LLaVA 13B model M3ID and M3ID+DPO reduce the percentage of hallucinated objects in captioning tasks by 25% and 28% respectively and improve the accuracy on VQA benchmarks such as POPE by 21% and 24%. + + + + The Neglected Tails in Vision-Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Parashar_The_Neglected_Tails_in_Vision-Language_Models_CVPR_2024_paper.pdf + Vision-language models (VLMs) excel in zero-shot recognition but their performance varies greatly across different visual concepts. For example although CLIP achieves impressive accuracy on ImageNet (60-80%) its performance drops below 10% for more than ten concepts like night snake presumably due to their limited presence in the pretraining data. However measuring the frequency of concepts in VLMs' large-scale datasets is challenging. We address this by using large language models (LLMs) to count the number of pretraining texts that contain synonyms of these concepts. Our analysis confirms that popular datasets such as LAION exhibit a long-tailed concept distribution yielding biased performance in VLMs. We also find that downstream applications of VLMs including visual chatbots (e.g. GPT-4V) and text-to-image models (e.g. Stable Diffusion) often fail to recognize or generate images of rare concepts identified by our method. To mitigate the imbalanced performance of zero-shot VLMs we propose REtrieval-Augmented Learning (REAL). First instead of prompting VLMs using the original class names REAL uses their most frequent synonyms found in pretraining texts. This simple change already outperforms costly human-engineered and LLM-enriched prompts over nine benchmark datasets. Second REAL trains a linear classifier on a small yet balanced set of pretraining data retrieved using concept synonyms. REAL surpasses the previous zero-shot SOTA using 400x less storage and 10000x less training time! + + + + Learning Background Prompts to Discover Implicit Knowledge for Open Vocabulary Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Learning_Background_Prompts_to_Discover_Implicit_Knowledge_for_Open_Vocabulary_CVPR_2024_paper.pdf + Open vocabulary object detection (OVD) aims at seeking an optimal object detector capable of recognizing objects from both base and novel categories. Recent advances leverage knowledge distillation to transfer insightful knowledge from pre-trained large-scale vision-language models to the task of object detection significantly generalizing the powerful capabilities of the detector to identify more unknown object categories. However these methods face significant challenges in background interpretation and model overfitting and thus often result in the loss of crucial background knowledge giving rise to sub-optimal inference performance of the detector. To mitigate these issues we present a novel OVD framework termed LBP to propose learning background prompts to harness explored implicit background knowledge thus enhancing the detection performance w.r.t. base and novel categories. Specifically we devise three modules: Background Category-specific Prompt Background Object Discovery and Inference Probability Rectification to empower the detector to discover represent and leverage implicit object knowledge explored from background proposals. Evaluation on two benchmark datasets OV-COCO and OV-LVIS demonstrates the superiority of our proposed method over existing state-of-the-art approaches in handling the OVD tasks. + + + + Towards Accurate Post-training Quantization for Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Towards_Accurate_Post-training_Quantization_for_Diffusion_Models_CVPR_2024_paper.pdf + In this paper we propose an accurate post-training quantization framework of diffusion models (APQ-DM) for efficient image generation. Conventional quantization frameworks learn shared quantization functions for tensor discretization regardless of the generation timesteps in diffusion models while the activation distribution differs significantly across various timesteps. Meanwhile the calibration images are acquired in random timesteps which fail to provide sufficient information for generalizable quantization function learning. Both issues cause sizable quantization errors with obvious image generation performance degradation. On the contrary we design distribution-aware quantization functions for activation discretization in different timesteps and search the optimal timesteps for informative calibration image generation so that our quantized diffusion model can reduce the discretization errors with negligible computational overhead. Specifically we partition various timestep quantization functions into different groups according to the importance weights which are optimized by differentiable search algorithms. We also extend structural risk minimization principle for informative calibration image generation to enhance the generalization ability in the deployment of quantized diffusion model. Extensive experimental results show that our method outperforms the state-of-the-art post-training quantization of diffusion model by a sizable margin with similar computational cost. + + + + GOAT-Bench: A Benchmark for Multi-Modal Lifelong Navigation + http://openaccess.thecvf.com//content/CVPR2024/papers/Khanna_GOAT-Bench_A_Benchmark_for_Multi-Modal_Lifelong_Navigation_CVPR_2024_paper.pdf + The Embodied AI community has recently made significant strides in visual navigation tasks exploring targets from 3D coordinates objects language description and images. However these navigation models often handle only a single input modality as the target. With the progress achieved so far it is time to move towards universal navigation models capable of handling various goal types enabling more effective user interaction with robots. To facilitate this goal we propose GOAT-Bench a benchmark for the universal navigation task referred to as GO to AnyThing (GOAT). In this task the agent is directed to navigate to a sequence of targets specified by the category name language description or instance image in an open-vocabulary fashion. We benchmark monolithic RL and modular methods on the GOAT task analyzing their performance across modalities the role of explicit and implicit scene memories their robustness to noise in goal specifications and the impact of memory in lifelong scenarios. + + + + Decoupling Static and Hierarchical Motion Perception for Referring Video Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/He_Decoupling_Static_and_Hierarchical_Motion_Perception_for_Referring_Video_Segmentation_CVPR_2024_paper.pdf + Referring video segmentation relies on natural language expressions to identify and segment objects often emphasizing motion clues. Previous works treat a sentence as a whole and directly perform identification at the video-level mixing up static image-level cues with temporal motion cues. However image-level features cannot well comprehend motion cues in sentences and static cues are not crucial for temporal perception. In fact static cues can sometimes interfere with temporal perception by overshadowing motion cues. In this work we propose to decouple video-level referring expression understanding into static and motion perception with a specific emphasis on enhancing temporal comprehension. Firstly we introduce an expression-decoupling module to make static cues and motion cues perform their distinct role alleviating the issue of sentence embeddings overlooking motion cues. Secondly we propose a hierarchical motion perception module to capture temporal information effectively across varying timescales. Furthermore we employ contrastive learning to distinguish the motions of visually similar objects. These contributions yield state-of-the-art performance across five datasets including a remarkable 9.2% J&F improvement on the challenging MeViS dataset. + + + + Dense Vision Transformer Compression with Few Samples + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Dense_Vision_Transformer_Compression_with_Few_Samples_CVPR_2024_paper.pdf + Few-shot model compression aims to compress a large model into a more compact one with only a tiny training set (even without labels). Block-level pruning has recently emerged as a leading technique in achieving high accuracy and low latency in few-shot CNN compression. But few-shot compression for Vision Transformers (ViT) remains largely unexplored which presents a new challenge. In particular the issue of sparse compression exists in traditional CNN few-shot methods which can only produce very few compressed models of different model sizes. This paper proposes a novel framework for few-shot ViT compression named DC-ViT. Instead of dropping the entire block DC-ViT selectively eliminates the attention module while retaining and reusing portions of the MLP module. DC-ViT enables dense compression which outputs numerous compressed models that densely populate the range of model complexity. DC-ViT outperforms state-of-the-art few-shot compression methods by a significant margin of 10 percentage points along with lower latency in the compression of ViT and its variants. + + + + Masked AutoDecoder is Effective Multi-Task Vision Generalist + http://openaccess.thecvf.com//content/CVPR2024/papers/Qiu_Masked_AutoDecoder_is_Effective_Multi-Task_Vision_Generalist_CVPR_2024_paper.pdf + Inspired by the success of general-purpose models in NLP recent studies attempt to unify different vision tasks in the same sequence format and employ autoregressive Transformers for sequence prediction. They apply uni-directional attention to capture sequential dependencies and generate task sequences recursively. However such autoregressive Transformers may not fit vision tasks well as vision task sequences usually lack the sequential dependencies typically observed in natural languages. In this work we design Masked AutoDecoder (MAD) an effective multi-task vision generalist. MAD consists of two core designs. First we develop a parallel decoding framework that introduces bi-directional attention to capture contextual dependencies comprehensively and decode vision task sequences in parallel. Second we design a masked sequence modeling approach that learns rich task contexts by masking and reconstructing task sequences. In this way MAD handles all the tasks by a single network branch and a simple cross-entropy loss with minimal task-specific designs. Extensive experiments demonstrate the great potential of MAD as a new paradigm for unifying various vision tasks. MAD achieves superior performance and inference efficiency compared to autoregressive counterparts while obtaining competitive accuracy with task-specific models. Code will be released at https://github.com/hanqiu-hq/MAD. + + + + Correlation-aware Coarse-to-fine MLPs for Deformable Medical Image Registration + http://openaccess.thecvf.com//content/CVPR2024/papers/Meng_Correlation-aware_Coarse-to-fine_MLPs_for_Deformable_Medical_Image_Registration_CVPR_2024_paper.pdf + Deformable image registration is a fundamental step for medical image analysis. Recently transformers have been used for registration and outperformed Convolutional Neural Networks (CNNs). Transformers can capture long-range dependence among image features which have been shown beneficial for registration. However due to the high computation/memory loads of self-attention transformers are typically used at downsampled feature resolutions and cannot capture fine-grained long-range dependence at the full image resolution. This limits deformable registration as it necessitates precise dense correspondence between each image pixel. Multi-layer Perceptrons (MLPs) without self-attention are efficient in computation/memory usage enabling the feasibility of capturing fine-grained long-range dependence at full resolution. Nevertheless MLPs have not been extensively explored for image registration and are lacking the consideration of inductive bias crucial for medical registration tasks. In this study we propose the first correlation-aware MLP-based registration network (CorrMLP) for deformable medical image registration. Our CorrMLP introduces a correlation-aware multi-window MLP block in a novel coarse-to-fine registration architecture which captures fine-grained multi-range dependence to perform correlation-aware coarse-to-fine registration. Extensive experiments with seven public medical datasets show that our CorrMLP outperforms state-of-the-art deformable registration methods. + + + + Toward Generalist Anomaly Detection via In-context Residual Learning with Few-shot Sample Prompts + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhu_Toward_Generalist_Anomaly_Detection_via_In-context_Residual_Learning_with_Few-shot_CVPR_2024_paper.pdf + This paper explores the problem of Generalist Anomaly Detection (GAD) aiming to train one single detection model that can generalize to detect anomalies in diverse datasets from different application domains without any further training on the target data. Some recent studies have shown that large pre-trained Visual-Language Models (VLMs) like CLIP have strong generalization capabilities on detecting industrial defects from various datasets but their methods rely heavily on handcrafted text prompts about defects making them difficult to generalize to anomalies in other applications e.g. medical image anomalies or semantic anomalies in natural images. In this work we propose to train a GAD model with few-shot normal images as sample prompts for AD on diverse datasets on the fly. To this end we introduce a novel approach that learns an in-context residual learning model for GAD termed InCTRL. It is trained on an auxiliary dataset to discriminate anomalies from normal samples based on a holistic evaluation of the residuals between query images and few-shot normal sample prompts. Regardless of the datasets per definition of anomaly larger residuals are expected for anomalies than normal samples thereby enabling InCTRL to generalize across different domains without further training. Comprehensive experiments on nine AD datasets are performed to establish a GAD benchmark that encapsulate the detection of industrial defect anomalies medical anomalies and semantic anomalies in both one-vs-all and multi-class setting on which InCTRL is the best performer and significantly outperforms state-of-the-art competing methods. Code is available at https://github.com/mala-lab/InCTRL. + + + + Fourier-basis Functions to Bridge Augmentation Gap: Rethinking Frequency Augmentation in Image Classification + http://openaccess.thecvf.com//content/CVPR2024/papers/Vaish_Fourier-basis_Functions_to_Bridge_Augmentation_Gap_Rethinking_Frequency_Augmentation_in_CVPR_2024_paper.pdf + Computer vision models normally witness degraded performance when deployed in real-world scenarios due to unexpected changes in inputs that were not accounted for during training. Data augmentation is commonly used to address this issue as it aims to increase data variety and reduce the distribution gap between training and test data. However common visual augmentations might not guarantee extensive robustness of computer vision models. In this paper we propose Auxiliary Fourier-basis Augmentation (AFA) a complementary technique targeting augmentation in the frequency domain and filling the robustness gap left by visual augmentations. We demonstrate the utility of augmentation via Fourier-basis additive noise in a straightforward and efficient adversarial setting. Our results show that AFA benefits the robustness of models against common corruptions OOD generalization and consistency of performance of models against increasing perturbations with negligible deficit to the standard performance of models. It can be seamlessly integrated with other augmentation techniques to further boost performance. Codes and models are available at \href https://github.com/nis-research/afa-augment https://github.com/nis-research/afa-augment . + + + + PlatoNeRF: 3D Reconstruction in Plato's Cave via Single-View Two-Bounce Lidar + http://openaccess.thecvf.com//content/CVPR2024/papers/Klinghoffer_PlatoNeRF_3D_Reconstruction_in_Platos_Cave_via_Single-View_Two-Bounce_Lidar_CVPR_2024_paper.pdf + 3D reconstruction from a single-view is challenging because of the ambiguity from monocular cues and lack of information about occluded regions. Neural radiance fields (NeRF) while popular for view synthesis and 3D reconstruction are typically reliant on multi-view images. Existing methods for single-view 3D reconstruction with NeRF rely on either data priors to hallucinate views of occluded regions which may not be physically accurate or shadows observed by RGB cameras which are difficult to detect in ambient light and low albedo backgrounds. We propose using time-of-flight data captured by a single-photon avalanche diode to overcome these limitations. Our method models two-bounce optical paths with NeRF using lidar transient data for supervision. By leveraging the advantages of both NeRF and two-bounce light measured by lidar we demonstrate that we can reconstruct visible and occluded geometry without data priors or reliance on controlled ambient lighting or scene albedo. In addition we demonstrate improved generalization under practical constraints on sensor spatial- and temporal-resolution. We believe our method is a promising direction as single-photon lidars become ubiquitous on consumer devices such as phones tablets and headsets. + + + + An Interactive Navigation Method with Effect-oriented Affordance + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_An_Interactive_Navigation_Method_with_Effect-oriented_Affordance_CVPR_2024_paper.pdf + Visual navigation is to let the agent reach the target according to the continuous visual input. In most previous works visual navigation is usually assumed to be done in a static and ideal environment: the target is always reachable with no need to alter the environment. However the "messy" environments are more general and practical in our daily lives where the agent may get blocked by obstacles. Thus Interactive Navigation (InterNav) is introduced to navigate to the objects in more realistic "messy" environments according to the object interaction. Prior work on InterNav learns short-term interaction through extensive trials with reinforcement learning. However interaction does not guarantee efficient navigation that is planning obstacle interactions that make shorter paths and consume less effort is also crucial. In this paper we introduce an effect-oriented affordance map to enable long-term interactive navigation extending the existing map-based navigation framework to the domain of dynamic environment. We train a set of affordance functions predicting available interactions and the time cost of removing obstacles which informatively support an interactive modular system to address interaction and long-term planning. Experiments on the ProcTHOR simulator demonstrate the capability of our affordance-driven system in long-term navigation in complex dynamic environments. + + + + PREGO: Online Mistake Detection in PRocedural EGOcentric Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Flaborea_PREGO_Online_Mistake_Detection_in_PRocedural_EGOcentric_Videos_CVPR_2024_paper.pdf + Promptly identifying procedural errors from egocentric videos in an online setting is highly challenging and valuable for detecting mistakes as soon as they happen. This capability has a wide range of applications across various fields such as manufacturing and healthcare. The nature of procedural mistakes is open-set since novel types of failures might occur which calls for one-class classifiers trained on correctly executed procedures. However no technique can currently detect open-set procedural mistakes online. We propose PREGO the first online one-class classification model for mistake detection in PRocedural EGOcentric videos. PREGO is based on an online action recognition component to model the current action and a symbolic reasoning module to predict the next actions. Mistake detection is performed by comparing the recognized current action with the expected future one. We evaluate PREGO on two procedural egocentric video datasets Assembly101 and Epic-tent which we adapt for online benchmarking of procedural mistake detection to establish suitable benchmarks thus defining the Assembly101-O and Epic-tent-O datasets respectively. The code is available at https://github.com/aleflabo/PREGO + + + + Logit Standardization in Knowledge Distillation + http://openaccess.thecvf.com//content/CVPR2024/papers/Sun_Logit_Standardization_in_Knowledge_Distillation_CVPR_2024_paper.pdf + Knowledge distillation involves transferring soft labels from a teacher to a student using a shared temperature-based softmax function. However the assumption of a shared temperature between teacher and student implies a mandatory exact match between their logits in terms of logit range and variance. This side-effect limits the performance of student considering the capacity discrepancy between them and the finding that the innate logit relations of teacher are sufficient for student to learn. To address this issue we propose setting the temperature as the weighted standard deviation of logit and performing a plug-and-play Z-score pre-process of logit standardization before applying softmax and Kullback-Leibler divergence. Our pre-process enables student to focus on essential logit relations from teacher rather than requiring a magnitude match and can improve the performance of existing logit-based distillation methods. We also show a typical case where the conventional setting of sharing temperature between teacher and student cannot reliably yield the authentic distillation evaluation; nonetheless this challenge is successfully alleviated by our Z-score. We extensively evaluate our method for various student and teacher models on CIFAR-100 and ImageNet showing its significant superiority. The vanilla knowledge distillation powered by our pre-process can achieve favorable performance against state-of-the-art methods and other distillation variants can obtain considerable gain with the assistance of our pre-process. The codes pre-trained models and logs are released on Github. + + + + Fine-grained Prototypical Voting with Heterogeneous Mixup for Semi-supervised 2D-3D Cross-modal Retrieval + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Fine-grained_Prototypical_Voting_with_Heterogeneous_Mixup_for_Semi-supervised_2D-3D_Cross-modal_CVPR_2024_paper.pdf + This paper studies the problem of semi-supervised 2D-3D retrieval which aims to align both labeled and unlabeled 2D and 3D data into the same embedding space. The problem is challenging due to the complicated heterogeneous relationships between 2D and 3D data. Moreover label scarcity in real-world applications hinders from generating discriminative representations. In this paper we propose a semi-supervised approach named Fine-grained Prototypcical Voting with Heterogeneous Mixup (FIVE) which maps both 2D and 3D data into a common embedding space for cross-modal retrieval. Specifically we generate fine-grained prototypes to model inter-class variation for both 2D and 3D data. Then considering each unlabeled sample as a query we retrieve relevant prototypes to vote for reliable and robust pseudo-labels which serve as guidance for discriminative learning under label scarcity. Furthermore to bridge the semantic gap between two modalities we mix cross-modal pairs with similar semantics in the embedding space and then perform similarity learning for cross-modal discrepancy reduction in a soft manner. The whole FIVE is optimized with the consideration of sharpness to mitigate the impact of potential label noise. Extensive experiments on benchmark datasets validate the superiority of FIVE compared with a range of baselines in different settings. On average FIVE outperforms the second-best approach by 4.74% on 3D MNIST 12.94% on ModelNet10 and 22.10% on ModelNet40. + + + + Leak and Learn: An Attacker's Cookbook to Train Using Leaked Data from Federated Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_Leak_and_Learn_An_Attackers_Cookbook_to_Train_Using_Leaked_CVPR_2024_paper.pdf + Federated learning is a decentralized learning paradigm introduced to preserve privacy of client data. Despite this prior work has shown that an attacker at the server can still reconstruct the private training data using only the client updates. These attacks are known as data reconstruction attacks and fall into two major categories: gradient inversion (GI) and linear layer leakage attacks (LLL). However despite demonstrating the effectiveness of these attacks in breaching privacy prior work has not investigated the usefulness of the reconstructed data for downstream tasks. In this work we explore data reconstruction attacks through the lens of training and improving models with leaked data. We demonstrate the effectiveness of both GI and LLL attacks in maliciously training models using the leaked data more accurately than a benign federated learning strategy. Counter-intuitively this bump in training quality can occur despite limited reconstruction quality or a small total number of leaked images. Finally we show the limitations of these attacks for downstream training individually for GI attacks and for LLL attacks. + + + + OCAI: Improving Optical Flow Estimation by Occlusion and Consistency Aware Interpolation + http://openaccess.thecvf.com//content/CVPR2024/papers/Jeong_OCAI_Improving_Optical_Flow_Estimation_by_Occlusion_and_Consistency_Aware_CVPR_2024_paper.pdf + The scarcity of ground-truth labels poses one major challenge in developing optical flow estimation models that are both generalizable and robust. While current methods rely on data augmentation they have yet to fully exploit the rich information available in labeled video sequences. We propose OCAI a method that supports robust frame interpolation by generating intermediate video frames alongside optical flows in between. Utilizing a forward warping approach OCAI employs occlusion awareness to resolve ambiguities in pixel values and fills in missing values by leveraging the forward-backward consistency of optical flows. Additionally we introduce a teacher-student style semi-supervised learning method on top of the interpolated frames. Using a pair of unlabeled frames and the teacher model's predicted optical flow we generate interpolated frames and flows to train a student model. The teacher's weights are maintained using Exponential Moving Averaging of the student. Our evaluations demonstrate perceptually superior interpolation quality and enhanced optical flow accuracy on established benchmarks such as Sintel and KITTI. + + + + Doodle Your 3D: From Abstract Freehand Sketches to Precise 3D Shapes + http://openaccess.thecvf.com//content/CVPR2024/papers/Bandyopadhyay_Doodle_Your_3D_From_Abstract_Freehand_Sketches_to_Precise_3D_CVPR_2024_paper.pdf + In this paper we democratise 3D content creation enabling precise generation of 3D shapes from abstract sketches while overcoming limitations tied to drawing skills. We introduce a novel part-level modelling and alignment framework that facilitates abstraction modelling and cross-modal correspondence. Leveraging the same part-level decoder our approach seamlessly extends to sketch modelling by establishing correspondence between CLIPasso edgemaps and projected 3D part regions eliminating the need for a dataset pairing human sketches and 3D shapes. Additionally our method introduces a seamless in-position editing process as a byproduct of cross-modal part-aligned modelling. Operating in a low-dimensional implicit space our approach significantly reduces computational demands and processing time. + + + + Single View Refractive Index Tomography with Neural Fields + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_Single_View_Refractive_Index_Tomography_with_Neural_Fields_CVPR_2024_paper.pdf + Refractive Index Tomography is the inverse problem of reconstructing the continuously-varying 3D refractive index in a scene using 2D projected image measurements. Although a purely refractive field is not directly visible it bends light rays as they travel through space thus providing a signal for reconstruction. The effects of such fields appear in many scientific computer vision settings ranging from refraction due to transparent cells in microscopy to the lensing of distant galaxies caused by dark matter in astrophysics. Reconstructing these fields is particularly difficult due to the complex nonlinear effects of the refractive field on observed images. Furthermore while standard 3D reconstruction and tomography settings typically have access to observations of the scene from many viewpoints many refractive index tomography problem settings only have access to images observed from a single viewpoint. We introduce a method that leverages prior knowledge of light sources scattered throughout the refractive medium to help disambiguate the single-view refractive index tomography problem. We differentiably trace curved rays through a neural field representation of the refractive field and optimize its parameters to best reproduce the observed image. We demonstrate the efficacy of our approach by reconstructing simulated refractive fields analyze the effects of light source distribution on the recovered field and test our method on a simulated dark matter mapping problem where we successfully recover the 3D refractive field caused by a realistic dark matter distribution. + + + + XFibrosis: Explicit Vessel-Fiber Modeling for Fibrosis Staging from Liver Pathology Images + http://openaccess.thecvf.com//content/CVPR2024/papers/Yin_XFibrosis_Explicit_Vessel-Fiber_Modeling_for_Fibrosis_Staging_from_Liver_Pathology_CVPR_2024_paper.pdf + The increasing prevalence of non-alcoholic fatty liver disease (NAFLD) has caused public concern in recent years. The high prevalence and risk of severe complications make monitoring NAFLD progression a public health priority. Fibrosis staging from liver biopsy images plays a key role in demonstrating the histological progression of NAFLD. Fibrosis mainly involves the deposition of fibers around vessels. Current deep learning-based fibrosis staging methods learn spatial relationships between tissue patches but do not explicitly consider the relationships between vessels and fibers leading to limited performance and poor interpretability. In this paper we propose an eXplicit vessel-fiber modeling method for Fibrosis staging from liver biopsy images namely XFibrosis. Specifically we transform vessels and fibers into graph-structured representations where their micro-structures are depicted by vessel-induced primal graphs and fiber-induced dual graphs respectively. Moreover the fiber-induced dual graphs also represent the connectivity information between vessels caused by fiber deposition. A primal-dual graph convolution module is designed to facilitate the learning of spatial relationships between vessels and fibers allowing for the joint exploration and interaction of their micro-structures. Experiments conducted on two datasets have shown that explicitly modeling the relationship between vessels and fibers leads to improved fibrosis staging and enhanced interpretability. + + + + UnO: Unsupervised Occupancy Fields for Perception and Forecasting + http://openaccess.thecvf.com//content/CVPR2024/papers/Agro_UnO_Unsupervised_Occupancy_Fields_for_Perception_and_Forecasting_CVPR_2024_paper.pdf + Perceiving the world and forecasting its future state is a critical task for self-driving. Supervised approaches leverage annotated object labels to learn a model of the world --- traditionally with object detections and trajectory predictions or temporal bird's-eye-view (BEV) occupancy fields. However these annotations are expensive and typically limited to a set of predefined categories that do not cover everything we might encounter on the road. Instead we learn to perceive and forecast a continuous 4D (spatio-temporal) occupancy field with self-supervision from LiDAR data. This unsupervised world model can be easily and effectively transferred to downstream tasks. We tackle point cloud forecasting by adding a lightweight learned renderer and achieve state-of-the-art performance in Argoverse 2 nuScenes and KITTI. To further showcase its transferability we fine-tune our model for BEV semantic occupancy forecasting and show that it outperforms the fully supervised state-of-the-art especially when labeled data is scarce. Finally when compared to prior state-of-the-art on spatio-temporal geometric occupancy prediction our 4D world model achieves a much higher recall of objects from classes relevant to self-driving. + + + + SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_SpatialVLM_Endowing_Vision-Language_Models_with_Spatial_Reasoning_Capabilities_CVPR_2024_paper.pdf + Understanding and reasoning about spatial relationships is crucial for Visual Question Answering (VQA) and robotics. Vision Language Models (VLMs) have shown impressive performance in some VQA benchmarks but struggle with 3D spatial reasoning such as recognizing distances or size differences between physical objects. This limitation may stem from a lack of 3D spatial knowledge in their training data. To address this we propose training VLMs with extensive spatial reasoning data from the internet. Our approach includes developing an automatic 3D spatial VQA data generation framework capable of creating 2 billion VQA examples from 10 million real-world images. We explore various factors in the training process such as data quality training pipeline and VLM architecture. Our work introduces the first Internet-scale 3D spatial reasoning dataset in metric space. By co-training a VLM with this dataset we significantly improve its performance in both qualitative and quantitative spatial VQA. Additionally this enhanced VLM enables new applications in chain-of-thought spatial reasoning and robotics particularly in quantitative estimation. + + + + InstructDiffusion: A Generalist Modeling Interface for Vision Tasks + http://openaccess.thecvf.com//content/CVPR2024/papers/Geng_InstructDiffusion_A_Generalist_Modeling_Interface_for_Vision_Tasks_CVPR_2024_paper.pdf + We present InstructDiffusion a unified and generic framework for aligning computer vision tasks with human instructions. Unlike existing approaches that integrate prior knowledge and pre-define the output space (e.g. categories and coordinates) for each vision task we cast diverse vision tasks into a human-intuitive image-manipulating process whose output space is a flexible and interactive pixel space. Concretely the model is built upon the diffusion process and is trained to predict pixels according to user instructions such as encircling the man's left shoulder in red or applying a blue mask to the left car. InstructDiffusion could handle a variety of vision tasks including understanding tasks (such as segmentation and keypoint detection) and generative tasks (such as editing and enhancement) and outperforms prior methods on novel datasets. This represents a solid step towards a generalist modeling interface for vision tasks advancing artificial general intelligence in the field of computer vision. + + + + Gated Fields: Learning Scene Reconstruction from Gated Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Ramazzina_Gated_Fields_Learning_Scene_Reconstruction_from_Gated_Videos_CVPR_2024_paper.pdf + Reconstructing outdoor 3D scenes from temporal observations is a challenge that recent work on neural fields has offered a new avenue for. However existing methods that recover scene properties such as geometry appearance or radiance solely from RGB captures often fail when handling poorly-lit or texture-deficient regions. Similarly recovering scenes with scanning lidar sensors is also difficult due to their low angular sampling rate which makes recovering expansive real-world scenes difficult. Tackling these gaps we introduce Gated Fields - a neural scene reconstruction method that utilizes active gated video sequences. To this end we propose a neural rendering approach that seamlessly incorporates time-gated capture and illumination. Our method exploits the intrinsic depth cues in the gated videos achieving precise and dense geometry reconstruction irrespective of ambient illumination conditions. We validate the method across day and night scenarios and find that Gated Fields compares favorably to RGB and LiDAR reconstruction methods + + + + RadarDistill: Boosting Radar-based Object Detection Performance via Knowledge Distillation from LiDAR Features + http://openaccess.thecvf.com//content/CVPR2024/papers/Bang_RadarDistill_Boosting_Radar-based_Object_Detection_Performance_via_Knowledge_Distillation_from_CVPR_2024_paper.pdf + The inherent noisy and sparse characteristics of radar data pose challenges in finding effective representations for 3D object detection. In this paper we propose RadarDistill a novel knowledge distillation (KD) method which can improve the representation of radar data by leveraging LiDAR data. RadarDistill successfully transfers desirable characteristics of LiDAR features into radar features using three key components: Cross-Modality Alignment (CMA) Activation-based Feature Distillation (AFD) and Proposal-based Feature Distillation (PFD). CMA enhances the density of radar features by employing multiple layers of dilation operations effectively addressing the challenge of inefficient knowledge transfer from LiDAR to radar. AFD selectively transfers knowledge based on regions of the LiDAR features with a specific focus on areas where activation intensity exceeds a predefined threshold. PFD similarly guides the radar network to selectively mimic features from the LiDAR network within the object proposals. Our comparative analyses conducted on the nuScenes datasets demonstrate that RadarDistill achieves state-of-the-art (SOTA) performance for radar-only object detection task recording 20.5% in mAP and 43.7% in NDS. Also RadarDistill significantly improves the performance of the camera-radar fusion model. + + + + Diffusion-driven GAN Inversion for Multi-Modal Face Image Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_Diffusion-driven_GAN_Inversion_for_Multi-Modal_Face_Image_Generation_CVPR_2024_paper.pdf + We present a new multi-modal face image generation method that converts a text prompt and a visual input such as a semantic mask or scribble map into a photo-realistic face image. To do this we combine the strengths of Generative Adversarial networks (GANs) and diffusion models (DMs) by employing the multi-modal features in the DM into the latent space of the pre-trained GANs. We present a simple mapping and a style modulation network to link two models and convert meaningful representations in feature maps and attention maps into latent codes. With GAN inversion the estimated latent codes can be used to generate 2D or 3D-aware facial images. We further present a multi-step training strategy that reflects textual and structural representations into the generated image. Our proposed network produces realistic 2D multi-view and stylized face images which align well with inputs. We validate our method by using pre-trained 2D and 3D GANs and our results outperform existing methods. Our project page is available at https://github.com/1211sh/Diffusiondriven_GAN-Inversion/. + + + + Low-Rank Knowledge Decomposition for Medical Foundation Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_Low-Rank_Knowledge_Decomposition_for_Medical_Foundation_Models_CVPR_2024_paper.pdf + The popularity of large-scale pre-training has promoted the development of medical foundation models. However some studies have shown that although foundation models exhibit strong general feature extraction capabilities their performance on specific tasks is still inferior to task-specific methods. In this paper we explore a new perspective called "Knowledge Decomposition" to improve the performance on specific medical tasks which deconstruct the foundation model into multiple lightweight expert models each dedicated to a particular task with the goal of improving specialization while concurrently mitigating resource expenditure. To accomplish the above objective we design a novel framework named Low-Rank Knowledge Decomposition (LoRKD) which explicitly separates graidents by incorporating low-rank expert modules and the efficient knowledge separation convolution. Extensive experimental results demonstrate that the decomposed models perform well in terms of performance and transferability even surpassing the original foundation models. Source code is available at: https://github.com/MediaBrain-SJTU/LoRKD + + + + Steganographic Passport: An Owner and User Verifiable Credential for Deep Model IP Protection Without Retraining + http://openaccess.thecvf.com//content/CVPR2024/papers/Cui_Steganographic_Passport_An_Owner_and_User_Verifiable_Credential_for_Deep_CVPR_2024_paper.pdf + Ensuring the legal usage of deep models is crucial to promoting trustable accountable and responsible artificial intelligence innovation. Current passport-based methods that obfuscate model functionality for license-to-use and ownership verifications suffer from capacity and quality constraints as they require retraining the owner model for new users. They are also vulnerable to advanced Expanded Residual Block ambiguity attacks. We propose Steganographic Passport which uses an invertible steganographic network to decouple license-to-use from ownership verification by hiding the user's identity images into the owner-side passport and recovering them from their respective user-side passports. An irreversible and collision-resistant hash function is used to avoid exposing the owner-side passport from the derived user-side passports and increase the uniqueness of the model signature. To safeguard both the passport and model's weights against advanced ambiguity attacks an activation-level obfuscation is proposed for the verification branch of the owner's model. By jointly training the verification and deployment branches their weights become tightly coupled. The proposed method supports agile licensing of deep models by providing a strong ownership proof and license accountability without requiring a separate model retraining for the admission of every new user. Experiment results show that our Steganographic Passport outperforms other passport-based deep model protection methods in robustness against various known attacks. + + + + En3D: An Enhanced Generative Model for Sculpting 3D Humans from 2D Synthetic Data + http://openaccess.thecvf.com//content/CVPR2024/papers/Men_En3D_An_Enhanced_Generative_Model_for_Sculpting_3D_Humans_from_CVPR_2024_paper.pdf + We present En3D an enhanced generative scheme for sculpting high-quality 3D human avatars. Unlike previous works that rely on scarce 3D datasets or limited 2D collections with imbalanced viewing angles and imprecise pose priors our approach aims to develop a zero-shot 3D generative scheme capable of producing visually realistic geometrically accurate and content-wise diverse 3D humans without relying on pre-existing 3D or 2D assets. To address this challenge we introduce a meticulously crafted workflow that implements accurate physical modeling to learn the enhanced 3D generative model from synthetic 2D data. During inference we integrate optimization modules to bridge the gap between realistic appearances and coarse 3D shapes. Specifically En3D comprises three modules: a 3D generator that accurately models generalizable 3D humans with realistic appearance from synthesized balanced diverse and structured human images; a geometry sculptor that enhances shape quality using multi-view normal constraints for intricate human structure; and a texturing module that disentangles explicit texture maps with fidelity and editability leveraging semantical UV partitioning and a differentiable rasterizer. Experimental results show that our approach significantly outperforms prior works in terms of image quality geometry accuracy and content diversity. We also showcase the applicability of our generated avatars for animation and editing as well as the scalability of our approach for content-style free adaptation. + + + + Neural Visibility Field for Uncertainty-Driven Active Mapping + http://openaccess.thecvf.com//content/CVPR2024/papers/Xue_Neural_Visibility_Field_for_Uncertainty-Driven_Active_Mapping_CVPR_2024_paper.pdf + This paper presents Neural Visibility Field (NVF) a novel uncertainty quantification method for Neural Radiance Fields (NeRF) applied to active mapping. Our key insight is that regions not visible in the training views lead to inherently unreliable color predictions by NeRF at this region resulting in increased uncertainty in the synthesized views. To address this we propose to use Bayesian Networks to composite position-based field uncertainty into ray-based uncertainty in camera observations. Consequently NVF naturally assigns higher uncertainty to unobserved regions aiding robots to select the most informative next viewpoints. Extensive evaluations show that NVF excels not only in uncertainty quantification but also in scene reconstruction for active mapping outperforming existing methods. More details can be found at https://sites.google.com/view/nvf-cvpr24/. + + + + Tri-Perspective View Decomposition for Geometry-Aware Depth Completion + http://openaccess.thecvf.com//content/CVPR2024/papers/Yan_Tri-Perspective_View_Decomposition_for_Geometry-Aware_Depth_Completion_CVPR_2024_paper.pdf + Depth completion is a vital task for autonomous driving as it involves reconstructing the precise 3D geometry of a scene from sparse and noisy depth measurements. However most existing methods either rely only on 2D depth representations or directly incorporate raw 3D point clouds for compensation which are still insufficient to capture the fine-grained 3D geometry of the scene. To address this challenge we introduce Tri-Perspective View Decomposition (TPVD) a novel framework that can explicitly model 3D geometry. In particular (1) TPVD ingeniously decomposes the original point cloud into three 2D views one of which corresponds to the sparse depth input. (2) We design TPV Fusion to update the 2D TPV features through recurrent 2D-3D-2D aggregation where a Distance-Aware Spherical Convolution (DASC) is applied. (3) By adaptively choosing TPV affinitive neighbors the newly proposed Geometric Spatial Propagation Network (GSPN) further improves the geometric consistency. As a result our TPVD outperforms existing methods on KITTI NYUv2 and SUN RGBD. Furthermore we build a novel depth completion dataset named TOFDC which is acquired by the time-of-flight (TOF) sensor and the color camera on smartphones. + + + + Relaxed Contrastive Learning for Federated Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Seo_Relaxed_Contrastive_Learning_for_Federated_Learning_CVPR_2024_paper.pdf + We propose a novel contrastive learning framework to effectively address the challenges of data heterogeneity in federated learning. We first analyze the inconsistency of gradient updates across clients during local training and establish its dependence on the distribution of feature representations leading to the derivation of the supervised contrastive learning (SCL) objective to mitigate local deviations. In addition we show that a naive integration of SCL into federated learning incurs representation collapse resulting in slow convergence and limited performance gains. To address this issue we introduce a relaxed contrastive learning loss that imposes a divergence penalty on excessively similar sample pairs within each class. This strategy prevents collapsed representations and enhances feature transferability facilitating collaborative training and leading to significant performance improvements. Our framework outperforms all existing federated learning approaches by significant margins on the standard benchmarks as demonstrated by extensive experimental results. The source code is available at our project page(https://github.com/skynbe/FedRCL). + + + + Harnessing the Power of MLLMs for Transferable Text-to-Image Person ReID + http://openaccess.thecvf.com//content/CVPR2024/papers/Tan_Harnessing_the_Power_of_MLLMs_for_Transferable_Text-to-Image_Person_ReID_CVPR_2024_paper.pdf + Text-to-image person re-identification (ReID) retrieves pedestrian images according to textual descriptions. Manually annotating textual descriptions is time-consuming restricting the scale of existing datasets and therefore the generalization ability of ReID models. As a result we study the transferable text-to-image ReID problem where we train a model on our proposed large-scale database and directly deploy it to various datasets for evaluation. We obtain substantial training data via Multi-modal Large Language Models (MLLMs). Moreover we identify and address two key challenges in utilizing the obtained textual descriptions. First an MLLM tends to generate descriptions with similar structures causing the model to overfit specific sentence patterns. Thus we propose a novel method that uses MLLMs to caption images according to various templates. These templates are obtained using a multi-turn dialogue with a Large Language Model (LLM). Therefore we can build a large-scale dataset with diverse textual descriptions. Second an MLLM may produce incorrect descriptions. Hence we introduce a novel method that automatically identifies words in a description that do not correspond with the image. This method is based on the similarity between one text and all patch token embeddings in the image. Then we mask these words with a larger probability in the subsequent training epoch alleviating the impact of noisy textual descriptions. The experimental results demonstrate that our methods significantly boost the direct transfer text-to-image ReID performance. Benefiting from the pre-trained model weights we also achieve state-of-the-art performance in the traditional evaluation settings. + + + + Weakly Supervised Video Individual Counting + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Weakly_Supervised_Video_Individual_Counting_CVPR_2024_paper.pdf + Video Individual Counting (VIC) aims to predict the number of unique individuals in a single video. Existing methods learn representations based on trajectory labels for individuals which are annotation-expensive. To provide a more realistic reflection of the underlying practical challenge we introduce a weakly supervised VIC task wherein trajectory labels are not provided. Instead two types of labels are provided to indicate traffic entering the field of view (inflow) and leaving the field view (outflow). We also propose the first solution as a baseline that formulates the task as a weakly supervised contrastive learning problem under group-level matching. In doing so we devise an end-to-end trainable soft contrastive loss to drive the network to distinguish inflow outflow and the remaining. To facilitate future study in this direction we generate annotations from the existing VIC datasets SenseCrowd and CroHD and also build a new dataset UAVVIC. Extensive results show that our baseline weakly supervised method outperforms supervised methods and thus little information is lost in the transition to the more practically relevant weakly supervised task. The code and trained model can be found at CGNet. + + + + Gaussian Shading: Provable Performance-Lossless Image Watermarking for Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Gaussian_Shading_Provable_Performance-Lossless_Image_Watermarking_for_Diffusion_Models_CVPR_2024_paper.pdf + Ethical concerns surrounding copyright protection and inappropriate content generation pose challenges for the practical implementation of diffusion models. One effective solution involves watermarking the generated images. However existing methods often compromise the model performance or require additional training which is undesirable for operators and users. To address this issue we propose Gaussian Shading a diffusion model watermarking technique that is both performance-lossless and training-free while serving the dual purpose of copyright protection and tracing of offending content. Our watermark embedding is free of model parameter modifications and thus is plug-and-play. We map the watermark to latent representations following a standard Gaussian distribution which is indistinguishable from latent representations obtained from the non-watermarked diffusion model. Therefore we can achieve watermark embedding with lossless performance for which we also provide theoretical proof. Furthermore since the watermark is intricately linked with image semantics it exhibits resilience to lossy processing and erasure attempts. The watermark can be extracted by Denoising Diffusion Implicit Models (DDIM) inversion and inverse sampling. We evaluate Gaussian Shading on multiple versions of Stable Diffusion and the results demonstrate that Gaussian Shading not only is performance-lossless but also outperforms existing methods in terms of robustness. + + + + DocRes: A Generalist Model Toward Unifying Document Image Restoration Tasks + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_DocRes_A_Generalist_Model_Toward_Unifying_Document_Image_Restoration_Tasks_CVPR_2024_paper.pdf + Document image restoration is a crucial aspect of Document AI systems as the quality of document images significantly influences the overall performance. Prevailing methods address distinct restoration tasks independently leading to intricate systems and the incapability to harness the potential synergies of multi-task learning. To overcome this challenge we propose DocRes a generalist model that unifies five document image restoration tasks including dewarping deshadowing appearance enhancement deblurring and binarization. To instruct DocRes to perform various restoration tasks we propose a novel visual prompt approach called Dynamic Task-Specific Prompt (DTSPrompt). The DTSPrompt for different tasks comprises distinct prior features which are additional characteristics extracted from the input image. Beyond its role as a cue for task-specific execution DTSPrompt can also serve as supplementary information to enhance the model's performance. Moreover DTSPrompt is more flexible than prior visual prompt approaches as it can be seamlessly applied and adapted to inputs with high and variable resolutions. Experimental results demonstrate that DocRes achieves competitive or superior performance compared to existing state-of-the-art task-specific models. This underscores the potential of DocRes across a broader spectrum of document image restoration tasks. The source code is publicly available at https://github.com/ZZZHANG-jx/DocRes. + + + + Honeybee: Locality-enhanced Projector for Multimodal LLM + http://openaccess.thecvf.com//content/CVPR2024/papers/Cha_Honeybee_Locality-enhanced_Projector_for_Multimodal_LLM_CVPR_2024_paper.pdf + In Multimodal Large Language Models (MLLMs) a visual projector plays a crucial role in bridging pre-trained vision encoders with LLMs enabling profound visual understanding while harnessing the LLMs' robust capabilities. Despite the importance of the visual projector it has been relatively less explored. In this study we first identify two essential projector properties: (i) flexibility in managing the number of visual tokens crucial for MLLMs' overall efficiency and (ii) preservation of local context from visual features vital for spatial understanding. Based on these findings we propose a novel projector design that is both flexible and locality-enhanced effectively satisfying the two desirable properties. Additionally we present comprehensive strategies to effectively utilize multiple and multifaceted instruction datasets. Through extensive experiments we examine the impact of individual design choices. Finally our proposed MLLM Honeybee remarkably outperforms previous state-of-the-art methods across various benchmarks including MME MMBench SEED-Bench and LLaVA-Bench achieving significantly higher efficiency. Code and models are available at https://github.com/kakaobrain/honeybee. + + + + Learned Trajectory Embedding for Subspace Clustering + http://openaccess.thecvf.com//content/CVPR2024/papers/Lochman_Learned_Trajectory_Embedding_for_Subspace_Clustering_CVPR_2024_paper.pdf + Clustering multiple motions from observed point trajectories is a fundamental task in understanding dynamic scenes. Most motion models require multiple tracks to estimate their parameters hence identifying clusters when multiple motions are observed is a very challenging task. This is even aggravated for high-dimensional motion models. The starting point of our work is that this high-dimensionality of motion model can actually be leveraged to our advantage as sufficiently long trajectories identify the underlying motion uniquely in practice. Consequently we propose to learn a mapping from trajectories to embedding vectors that represent the generating motion. The obtained trajectory embeddings are useful for clustering multiple observed motions but are also trained to contain sufficient information to recover the parameters of the underlying motion by utilizing a geometric loss. We therefore are able to use only weak supervision from given motion segmentation to train this mapping. The entire algorithm consisting of trajectory embedding clustering and motion parameter estimation is highly efficient. We conduct experiments on the Hopkins155 Hopkins12 and KT3DMoSeg datasets and show state-of-the-art performance of our proposed method for trajectory-based motion segmentation on full sequences and its competitiveness on the occluded sequences. Project page: https://ylochman.github.io/trajectory-embedding. + + + + HarmonyView: Harmonizing Consistency and Diversity in One-Image-to-3D + http://openaccess.thecvf.com//content/CVPR2024/papers/Woo_HarmonyView_Harmonizing_Consistency_and_Diversity_in_One-Image-to-3D_CVPR_2024_paper.pdf + Recent progress in single-image 3D generation highlights the importance of multi-view coherency leveraging 3D priors from large-scale diffusion models pretrained on Internet-scale images. However the aspect of novel-view diversity remains underexplored within the research landscape due to the ambiguity in converting a 2D image into 3D content where numerous potential shapes can emerge. Here we aim to address this research gap by simultaneously addressing both consistency and diversity. Yet striking a balance between these two aspects poses a considerable challenge due to their inherent trade-offs. This work introduces HarmonyView a simple yet effective diffusion sampling technique adept at decomposing two intricate aspects in single-image 3D generation: consistency and diversity. This approach paves the way for a more nuanced exploration of the two critical dimensions within the sampling process. Moreover we propose a new evaluation metric based on CLIP image and text encoders to comprehensively assess the diversity of the generated views which closely aligns with human evaluators' judgments. In experiments HarmonyView achieves a harmonious balance demonstrating a win-win scenario in both consistency and diversity. + + + + UnSAMFlow: Unsupervised Optical Flow Guided by Segment Anything Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Yuan_UnSAMFlow_Unsupervised_Optical_Flow_Guided_by_Segment_Anything_Model_CVPR_2024_paper.pdf + Traditional unsupervised optical flow methods are vulnerable to occlusions and motion boundaries due to lack of object-level information. Therefore we propose UnSAMFlow an unsupervised flow network that also leverages object information from the latest foundation model Segment Anything Model (SAM). We first include a self-supervised semantic augmentation module tailored to SAM masks. We also analyze the poor gradient landscapes of traditional smoothness losses and propose a new smoothness definition based on homography instead. A simple yet effective mask feature module has also been added to further aggregate features on the object level. With all these adaptations our method produces clear optical flow estimation with sharp boundaries around objects which outperforms state-of-the-art methods on both KITTI and Sintel datasets. Our method also generalizes well across domains and runs very efficiently. + + + + Exploiting Inter-sample and Inter-feature Relations in Dataset Distillation + http://openaccess.thecvf.com//content/CVPR2024/papers/Deng_Exploiting_Inter-sample_and_Inter-feature_Relations_in_Dataset_Distillation_CVPR_2024_paper.pdf + Dataset distillation has emerged as a promising approach in deep learning enabling efficient training with small synthetic datasets derived from larger real ones. Particularly distribution matching-based distillation methods attract attention thanks to its effectiveness and low computational cost. However these methods face two primary limitations: the dispersed feature distribution within the same class in synthetic datasets reducing class discrimination and an exclusive focus on mean feature consistency lacking precision and comprehensiveness. To address these challenges we introduce two novel constraints: a class centralization constraint and a covariance matching constraint. The class centralization constraint aims to enhance class discrimination by more closely clustering samples within classes. The covariance matching constraint seeks to achieve more accurate feature distribution matching between real and synthetic datasets through local feature covariance matrices particularly beneficial when sample sizes are much smaller than the number of features. Experiments demonstrate notable improvements with these constraints yielding performance boosts of up to 6.6% on CIFAR10 2.9% on SVHN 2.5% on CIFAR100 and 2.5% on TinyImageNet compared to the state-of-the-art relevant methods. In addition our method maintains robust performance in cross-architecture settings with a maximum performance drop of 1.7% on four architectures. Code is available at https://github.com/VincenDen/IID. + + + + Context-based and Diversity-driven Specificity in Compositional Zero-Shot Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Context-based_and_Diversity-driven_Specificity_in_Compositional_Zero-Shot_Learning_CVPR_2024_paper.pdf + Compositional Zero-Shot Learning (CZSL) aims to recognize unseen attribute-object pairs based on a limited set of observed examples. Current CZSL methodologies despite their advancements tend to neglect the distinct specificity levels present in attributes. For instance given images of sliced strawberries they may fail to prioritize `Sliced-Strawberry' over a generic `Red-Strawberry' despite the former being more informative. They also suffer from ballooning search space when shifting from Close-World (CW) to Open-World (OW) CZSL. To address the issues we introduce the Context-based and Diversity-driven Specificity learning framework for CZSL (CDS-CZSL). Our framework evaluates the specificity of attributes by considering the diversity of objects they apply to and their related context. This novel approach allows for more accurate predictions by emphasizing specific attribute-object pairs and improves composition filtering in OW-CZSL. We conduct experiments in both CW and OW scenarios and our model achieves state-of-the-art results across three datasets. + + + + Rethinking Diffusion Model for Multi-Contrast MRI Super-Resolution + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Rethinking_Diffusion_Model_for_Multi-Contrast_MRI_Super-Resolution_CVPR_2024_paper.pdf + Recently diffusion models (DM) have been applied in magnetic resonance imaging (MRI) super-resolution (SR) reconstruction exhibiting impressive performance especially with regard to detailed reconstruction. However the current DM-based SR reconstruction methods still face the following issues: (1) They require a large number of iterations to reconstruct the final image which is inefficient and consumes a significant amount of computational resources. (2) The results reconstructed by these methods are often misaligned with the real high-resolution images leading to remarkable distortion in the reconstructed MR images. To address the aforementioned issues we propose an efficient diffusion model for multi-contrast MRI SR named as DiffMSR. Specifically we apply DM in a highly compact low-dimensional latent space to generate prior knowledge with high-frequency detail information. The highly compact latent space ensures that DM requires only a few simple iterations to produce accurate prior knowledge. In addition we design the Prior-Guide Large Window Transformer (PLWformer) as the decoder for DM which can extend the receptive field while fully utilizing the prior knowledge generated by DM to ensure that the reconstructed MR image remains undistorted. Extensive experiments on public and clinical datasets demonstrate that our DiffMSR outperforms state-of-the-art methods. + + + + Unknown Prompt the only Lacuna: Unveiling CLIP's Potential for Open Domain Generalization + http://openaccess.thecvf.com//content/CVPR2024/papers/Singha_Unknown_Prompt_the_only_Lacuna_Unveiling_CLIPs_Potential_for_Open_CVPR_2024_paper.pdf + We delve into Open Domain Generalization (ODG) marked by domain and category shifts between training's labeled source and testing's unlabeled target domains. Existing solutions to ODG face limitations due to constrained generalizations of traditional CNN backbones and errors in detecting target open samples in the absence of prior knowledge. Addressing these pitfalls we introduce ODG-CLIP harnessing the semantic prowess of the vision-language model CLIP. Our framework brings forth three primary innovations: Firstly distinct from prevailing paradigms we conceptualize ODG as a multi-class classification challenge encompassing both known and novel categories. Central to our approach is modeling a unique prompt tailored for detecting unknown class samples and to train this we employ a readily accessible stable diffusion model elegantly generating proxy images for the open class. Secondly aiming for domain-tailored classification (prompt) weights while ensuring a balance of precision and simplicity we devise a novel visual style-centric prompt learning mechanism. Finally we infuse images with class-discriminative knowledge derived from the prompt space to augment the fidelity of CLIP's visual embeddings. We introduce a novel objective to safeguard the continuity of this infused semantic intel across domains especially for the shared classes. Through rigorous testing on diverse datasets covering closed and open-set DG contexts ODG-CLIP demonstrates clear supremacy consistently outpacing peers with performance boosts between 8%-16%. Code will be available at https://github.com/mainaksingha01/ODG-CLIP. + + + + From Coarse to Fine-Grained Open-Set Recognition + http://openaccess.thecvf.com//content/CVPR2024/papers/Lang_From_Coarse_to_Fine-Grained_Open-Set_Recognition_CVPR_2024_paper.pdf + Open-set recognition (OSR) methods aim to identify whether or not a test example belongs to a category ob- served during training. Depending on how visually sim- ilar a test example is to the training categories the OSR task can be easy or extremely challenging. However the vast majority of previous work has studied OSR in the presence of large coarse-grained semantic shifts. In contrast many real-world problems are inherently fine- grained which means that test examples may be highly visually similar to the training categories. Motivated by this observation we investigate three aspects of OSR: label granularity similarity between the open- and closed-sets and the role of hierarchical supervision during training. To study these dimensions we curate new open-set splits of a large fine-grained visual categorization dataset. Our anal- ysis results in several interesting findings including: (i) the best OSR method to use is heavily dependent on the degree of semantic shift present and (ii) hierarchical rep- resentation learning can improve coarse-grained OSR but has little effect on fine-grained OSR performance. To fur- ther enhance fine-grained OSR performance we propose a hierarchy-adversarial learning method to discourage hier- archical structure in the representation space which results in a perhaps counter-intuitive behaviour and a relative im- provement in fine-grained OSR of up to 2% in AUROC and 7% in AUPR over standard training. Code and data are available: langnico.github.io/fine-grained-osr. + + + + OmniViD: A Generative Framework for Universal Video Understanding + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_OmniViD_A_Generative_Framework_for_Universal_Video_Understanding_CVPR_2024_paper.pdf + The core of video understanding tasks such as recognition captioning and tracking is to automatically detect objects or actions in a video and analyze their temporal evolution. Despite sharing a common goal different tasks often rely on distinct model architectures and annotation formats. In contrast natural language processing benefits from a unified output space i.e. text sequences which simplifies the training of powerful foundational language models such as GPT-3 with extensive training corpora. Inspired by this we seek to unify the output space of video understanding tasks by using languages as labels and additionally introducing time and box tokens. In this way a variety of video tasks could be formulated as video-grounded token generation. This enables us to address various types of video tasks including classification (such as action recognition) captioning (covering clip captioning video question answering and dense video captioning) and localization tasks (such as visual object tracking) within a fully shared encoder-decoder architecture following a generative framework. Through comprehensive experiments we demonstrate such a simple and straightforward idea is quite effective and can achieve state-of-the-art or competitive results on seven video benchmarks providing a novel perspective for more universal video understanding. Code is available at \href https://github.com/wangjk666/OmniVid https://github.com/wangjk666/OmniVid . + + + + Naturally Supervised 3D Visual Grounding with Language-Regularized Concept Learners + http://openaccess.thecvf.com//content/CVPR2024/papers/Feng_Naturally_Supervised_3D_Visual_Grounding_with_Language-Regularized_Concept_Learners_CVPR_2024_paper.pdf + 3D visual grounding is a challenging task that often requires direct and dense supervision notably the semantic label for each object in the scene. In this paper we instead study the naturally supervised setting that learns from only 3D scene and QA pairs where prior works underperform. We propose the Language-Regularized Concept Learner (LARC) which uses constraints from language as regularization to significantly improve the accuracy of neuro-symbolic concept learners in the naturally supervised setting. Our approach is based on two core insights: the first is that language constraints (e.g. a word's relation to another) can serve as effective regularization for structured representations in neuro-symbolic models; the second is that we can query large language models to distill such constraints from language properties. We show that LARC improves performance of prior works in naturally supervised 3D visual grounding and demonstrates a wide range of 3D visual reasoning capabilities--from zero-shot composition to data efficiency and transferability. Our method represents a promising step towards regularizing structured visual reasoning frameworks with language-based priors for learning in settings without dense supervision. + + + + CA-Jaccard: Camera-aware Jaccard Distance for Person Re-identification + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_CA-Jaccard_Camera-aware_Jaccard_Distance_for_Person_Re-identification_CVPR_2024_paper.pdf + Person re-identification (re-ID) is a challenging task that aims to learn discriminative features for person retrieval. In person re-ID Jaccard distance is a widely used distance metric especially in re-ranking and clustering scenarios. However we discover that camera variation has a significant negative impact on the reliability of Jaccard distance. In particular Jaccard distance calculates the distance based on the overlap of relevant neighbors. Due to camera variation intra-camera samples dominate the relevant neighbors which reduces the reliability of the neighbors by introducing intra-camera negative samples and excluding inter-camera positive samples. To overcome this problem we propose a novel camera-aware Jaccard (CA-Jaccard) distance that leverages camera information to enhance the reliability of Jaccard distance. Specifically we design camera-aware k-reciprocal nearest neighbors (CKRNNs) to find k-reciprocal nearest neighbors on the intra-camera and inter-camera ranking lists which improves the reliability of relevant neighbors and guarantees the contribution of inter-camera samples in the overlap. Moreover we propose a camera-aware local query expansion (CLQE) to mine reliable samples in relevant neighbors by exploiting camera variation as a strong constraint and assign these samples higher weights in overlap further improving the reliability. Our CA-Jaccard distance is simple yet effective and can serve as a general distance metric for person re-ID methods with high reliability and low computational cost. Extensive experiments demonstrate the effectiveness of our method. Code is available at https://github.com/chen960/CA-Jaccard/. + + + + AutoAD III: The Prequel - Back to the Pixels + http://openaccess.thecvf.com//content/CVPR2024/papers/Han_AutoAD_III_The_Prequel_-_Back_to_the_Pixels_CVPR_2024_paper.pdf + Generating Audio Description (AD) for movies is a challenging task that requires fine-grained visual understanding and an awareness of the characters and their names. Currently visual language models for AD generation are limited by a lack of suitable training data and also their evaluation is hampered by using performance measures not specialized to the AD domain. In this paper we make three contributions: (i) We propose two approaches for constructing AD datasets with aligned video data and build training and evaluation datasets using these. These datasets will be publicly released; (ii) We develop a Q-former-based architecture which ingests raw video and generates AD using frozen pre-trained visual encoders and large language models; and (iii) We provide new evaluation metrics to benchmark AD quality that are well matched to human performance. Taken together we improve the state of the art on AD generation. + + + + Characteristics Matching Based Hash Codes Generation for Efficient Fine-grained Image Retrieval + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Characteristics_Matching_Based_Hash_Codes_Generation_for_Efficient_Fine-grained_Image_CVPR_2024_paper.pdf + The rapidly growing scale of data in practice poses demands on the efficiency of retrieval models. However for fine-grained image retrieval task there are inherent contradictions in the design of hashing based efficient models. Firstly the limited information embedding capacity of low-dimensional binary hash codes coupled with the detailed information required to describe fine-grained categories results in a contradiction in feature learning. Secondly there is also a contradiction between the complexity of fine-grained feature extraction models and retrieval efficiency. To address these issues in this paper we propose the characteristics matching based hash codes generation method. Coupled with the cross-layer semantic information transfer module and the multi-region feature embedding module the proposed method can generate hash codes that effectively capture fine-grained differences among samples while ensuring efficient inference. Extensive experiments on widely used datasets demonstrate that our method can significantly outperform state-of-the-art methods. + + + + Matching 2D Images in 3D: Metric Relative Pose from Metric Correspondences + http://openaccess.thecvf.com//content/CVPR2024/papers/Barroso-Laguna_Matching_2D_Images_in_3D_Metric_Relative_Pose_from_Metric_CVPR_2024_paper.pdf + Given two images we can estimate the relative camera pose between them by establishing image-to-image correspondences. Usually correspondences are 2D-to-2D and the pose we estimate is defined only up to scale. Some applications aiming at instant augmented reality anywhere require scale-metric pose estimates and hence they rely on external depth estimators to recover the scale. We present MicKey a keypoint matching pipeline that is able to predict metric correspondences in 3D camera space. By learning to match 3D coordinates across images we are able to infer the metric relative pose without depth measurements. Depth measurements are also not required for training nor are scene reconstructions or image overlap information. MicKey is supervised only by pairs of images and their relative poses. MicKey achieves state-of-the-art performance on the Map-Free Relocalisation benchmark while requiring less supervision than competing approaches. + + + + M3-UDA: A New Benchmark for Unsupervised Domain Adaptive Fetal Cardiac Structure Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Pu_M3-UDA_A_New_Benchmark_for_Unsupervised_Domain_Adaptive_Fetal_Cardiac_CVPR_2024_paper.pdf + The anatomical structure detection of fetal cardiac views is crucial for diagnosing fetal congenital heart disease. In practice there is a large domain gap between different hospitals' data such as the variable data quality due to differences in acquisition equipment. In addition accurate annotation information provided by obstetrician experts is always very costly or even unavailable. This study explores the unsupervised domain adaptive fetal cardiac structure detection issue. Existing unsupervised domain adaptive object detection (UDAOD) approaches mainly focus on detecting objects in natural scenes such as Foggy Cityscapes where the structural relationships of natural scenes are uncertain. Unlike all previous UDAOD scenarios we first collected a Fetal Cardiac Structure dataset from two hospital centers called FCS and proposed a multi-matching UDA approach (M3-UDA) including Histogram Matching (HM) Sub-structure Matching (SM) and Global-structure Matching (GM) to better transfer the topological knowledge of anatomical structure for UDA detection in medical scenarios. HM mitigates the domain gap between the source and target caused by pixel transformation. SM fuses the different angle information of the sub-structure to obtain the local topological knowledge for bridging the domain gap of the internal sub-structure. GM is designed to align the global topological knowledge of the whole organ from the source and target domain. Extensive experiments on our collected FCS and CardiacUDA and experimental results show that M3-UDA outperforms existing UDAOD studies significantly. All datasets and source code are available at : https://github.com/xmed-lab/M3-UDA + + + + Chat-UniVi: Unified Visual Representation Empowers Large Language Models with Image and Video Understanding + http://openaccess.thecvf.com//content/CVPR2024/papers/Jin_Chat-UniVi_Unified_Visual_Representation_Empowers_Large_Language_Models_with_Image_CVPR_2024_paper.pdf + Large language models have demonstrated impressive universal capabilities across a wide range of open-ended tasks and have extended their utility to encompass multimodal conversations. However existing methods encounter challenges in effectively handling both image and video understanding particularly with limited visual tokens. In this work we introduce Chat-UniVi a Unified Vision-language model capable of comprehending and engaging in conversations involving images and videos through a unified visual representation. Specifically we employ a set of dynamic visual tokens to uniformly represent images and videos. This representation framework empowers the model to efficiently utilize a limited number of visual tokens to simultaneously capture the spatial details necessary for images and the comprehensive temporal relationship required for videos. Moreover we leverage a multi-scale representation enabling the model to perceive both high-level semantic concepts and low-level visual details. Notably Chat-UniVi is trained on a mixed dataset containing both images and videos allowing direct application to tasks involving both mediums without requiring any modifications. Extensive experimental results demonstrate that Chat-UniVi consistently outperforms even existing methods exclusively designed for either images or videos. Code is available at https://github.com/PKU-YuanGroup/Chat-UniVi. + + + + Token Transformation Matters: Towards Faithful Post-hoc Explanation for Vision Transformer + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_Token_Transformation_Matters_Towards_Faithful_Post-hoc_Explanation_for_Vision_Transformer_CVPR_2024_paper.pdf + While Transformers have rapidly gained popularity in various computer vision applications post-hoc explanations of their internal mechanisms remain largely unexplored. Vision Transformers extract visual information by representing image regions as transformed tokens and integrating them via attention weights. However existing post-hoc explanation methods merely consider these attention weights neglecting crucial information from the transformed tokens which fails to accurately illustrate the rationales behind the models' predictions. To incorporate the influence of token transformation into interpretation we propose TokenTM a novel post-hoc explanation method that utilizes our introduced measurement of token transformation effects. Specifically we quantify token transformation effects by measuring changes in token lengths and correlations in their directions pre- and post-transformation. Moreover we develop initialization and aggregation rules to integrate both attention weights and token transformation effects across all layers capturing holistic token contributions throughout the model. Experimental results on segmentation and perturbation tests demonstrate the superiority of our proposed TokenTM compared to state-of-the-art Vision Transformer explanation methods. + + + + Bayesian Differentiable Physics for Cloth Digitalization + http://openaccess.thecvf.com//content/CVPR2024/papers/Gong_Bayesian_Differentiable_Physics_for_Cloth_Digitalization_CVPR_2024_paper.pdf + We propose a new method for cloth digitalization. Deviating from existing methods which learn from data captured under relatively casual settings we propose to learn from data captured in strictly tested measuring protocols and find plausible physical parameters of the cloths. However such data is currently absent so we first propose a new dataset with accurate cloth measurements. Further the data size is considerably smaller than the ones in current deep learning due to the nature of the data capture process. To learn from small data we propose a new Bayesian differentiable cloth model to estimate the complex material heterogeneity of real cloths. It can provide highly accurate digitalization from very limited data samples. Through exhaustive evaluation and comparison we show our method is accurate in cloth digitalization efficient in learning from limited data samples and general in capturing material variations. Code and data are available in: https://github.com/realcrane/Bayesian-Differentiable-Physics-for-Cloth-Digitalization + + + + Higher-order Relational Reasoning for Pedestrian Trajectory Prediction + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_Higher-order_Relational_Reasoning_for_Pedestrian_Trajectory_Prediction_CVPR_2024_paper.pdf + Social relations have substantial impacts on the potential trajectories of each individual. Modeling these dynamics has been a central solution for more precise and accurate trajectory forecasting. However previous works ignore the importance of `social depth' meaning the influences flowing from different degrees of social relations. In this work we propose HighGraph a graph-based pedestrian relational reasoning method that captures the higher-order dynamics of social interactions. First we construct a collision-aware relation graph based on the agents' observed trajectories. Upon this graph structure we build our core module that aggregates the agent features from diverse social distances. As a result the network is able to model complex social relations thereby yielding more accurate and socially acceptable trajectories. Our HighGraph is a plug-and-play module that can be easily applied to any current trajectory predictors. Extensive experiments with ETH/UCY and SDD datasets demonstrate that our HighGraph noticeably improves the previous state-of-the-art baselines both quantitatively and qualitatively. + + + + RealNet: A Feature Selection Network with Realistic Synthetic Anomaly for Anomaly Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_RealNet_A_Feature_Selection_Network_with_Realistic_Synthetic_Anomaly_for_CVPR_2024_paper.pdf + Self-supervised feature reconstruction methods have shown promising advances in industrial image anomaly detection and localization. Despite this progress these methods still face challenges in synthesizing realistic and diverse anomaly samples as well as addressing the feature redundancy and pre-training bias of pre-trained feature. In this work we introduce RealNet a feature reconstruction network with realistic synthetic anomaly and adaptive feature selection. It is incorporated with three key innovations: First we propose Strength-controllable Diffusion Anomaly Synthesis (SDAS) a diffusion process-based synthesis strategy capable of generating samples with varying anomaly strengths that mimic the distribution of real anomalous samples. Second we develop Anomaly-aware Features Selection (AFS) a method for selecting representative and discriminative pre-trained feature subsets to improve anomaly detection performance while controlling computational costs. Third we introduce Reconstruction Residuals Selection (RRS) a strategy that adaptively selects discriminative residuals for comprehensive identification of anomalous regions across multiple levels of granularity. We assess RealNet on four benchmark datasets and our results demonstrate significant improvements in both Image AUROC and Pixel AUROC compared to the current state-of-the-art methods. The code data and models are available at https://github.com/cnulab/RealNet. + + + + Multi-modal Instruction Tuned LLMs with Fine-grained Visual Perception + http://openaccess.thecvf.com//content/CVPR2024/papers/He_Multi-modal_Instruction_Tuned_LLMs_with_Fine-grained_Visual_Perception_CVPR_2024_paper.pdf + Multimodal Large Language Model (MLLMs) leverages Large Language Models as a cognitive framework for diverse visual-language tasks. Recent efforts have been made to equip MLLMs with visual perceiving and grounding capabilities. However there still remains a gap in providing fine-grained pixel-level perceptions and extending interactions beyond text-specific inputs. In this work we propose \bf AnyRef a general MLLM model that can generate pixel-wise object perceptions and natural language descriptions from multi-modality references such as texts boxes images or audio. This innovation empowers users with greater flexibility to engage with the model beyond textual and regional prompts without modality-specific designs. Through our proposed refocusing mechanism the generated grounding output is guided to better focus on the referenced object implicitly incorporating additional pixel-level supervision. This simple modification utilizes attention scores generated during the inference of LLM eliminating the need for extra computations while exhibiting performance enhancements in both grounding masks and referring expressions. With only publicly available training data our model achieves state-of-the-art results across multiple benchmarks including diverse modality referring segmentation and region-level referring expression generation. Code and models are available at https://github.com/jwh97nn/AnyRef + + + + LaMPilot: An Open Benchmark Dataset for Autonomous Driving with Language Model Programs + http://openaccess.thecvf.com//content/CVPR2024/papers/Ma_LaMPilot_An_Open_Benchmark_Dataset_for_Autonomous_Driving_with_Language_CVPR_2024_paper.pdf + Autonomous driving (AD) has made significant strides in recent years. However existing frameworks struggle to interpret and execute spontaneous user instructions such as "overtake the car ahead." Large Language Models (LLMs) have demonstrated impressive reasoning capabilities showing potential to bridge this gap. In this paper we present LaMPilot a novel framework that integrates LLMs into AD systems enabling them to follow user instructions by generating code that leverages established functional primitives. We also introduce LaMPilot-Bench the first benchmark dataset specifically designed to quantitatively evaluate the efficacy of language model programs in AD. Adopting the LaMPilot framework we conduct extensive experiments to assess the performance of off-the-shelf LLMs on LaMPilot-Bench. Our results demonstrate the potential of LLMs in handling diverse driving scenarios and following user instructions in driving. To facilitate further research in this area we release our code and data at GitHub.com/PurdueDigitalTwin/LaMPilot. + + + + FairDeDup: Detecting and Mitigating Vision-Language Fairness Disparities in Semantic Dataset Deduplication + http://openaccess.thecvf.com//content/CVPR2024/papers/Slyman_FairDeDup_Detecting_and_Mitigating_Vision-Language_Fairness_Disparities_in_Semantic_Dataset_CVPR_2024_paper.pdf + Recent dataset deduplication techniques have demonstrated that content-aware dataset pruning can dramatically reduce the cost of training Vision-Language Pretrained (VLP) models without significant performance losses compared to training on the original dataset. These results have been based on pruning commonly used image-caption datasets collected from the web -- datasets that are known to harbor harmful social biases that may then be codified in trained models. In this work we evaluate how deduplication affects the prevalence of these biases in the resulting trained models and introduce an easy-to-implement modification to the recent SemDeDup algorithm that can reduce the negative effects that we observe. When examining CLIP-style models trained on deduplicated variants of LAION-400M we find our proposed FairDeDup algorithm consistently leads to improved fairness metrics over SemDeDup on the FairFace and FACET datasets while maintaining zero-shot performance on CLIP benchmarks. + + + + Modality-agnostic Domain Generalizable Medical Image Segmentation by Multi-Frequency in Multi-Scale Attention + http://openaccess.thecvf.com//content/CVPR2024/papers/Nam_Modality-agnostic_Domain_Generalizable_Medical_Image_Segmentation_by_Multi-Frequency_in_Multi-Scale_CVPR_2024_paper.pdf + Generalizability in deep neural networks plays a pivotal role in medical image segmentation. However deep learning-based medical image analyses tend to overlook the importance of frequency variance which is critical element for achieving a model that is both modality-agnostic and domain-generalizable. Additionally various models fail to account for the potential information loss that can arise from multi-task learning under deep supervision a factor that can impair the model's representation ability. To address these challenges we propose a Modality-agnostic Domain Generalizable Network (MADGNet) for medical image segmentation which comprises two key components: a Multi-Frequency in Multi-Scale Attention (MFMSA) block and Ensemble Sub-Decoding Module (E-SDM). The MFMSA block refines the process of spatial feature extraction particularly in capturing boundary features by incorporating multi-frequency and multi-scale features thereby offering informative cues for tissue outline and anatomical structures. Moreover we propose E-SDM to mitigate information loss in multi-task learning with deep supervision especially during substantial upsampling from low resolution. We evaluate the segmentation performance of MADGNet across six modalities and fifteen datasets. Through extensive experiments we demonstrate that MADGNet consistently outperforms state-of-the-art models across various modalities showcasing superior segmentation performance. This affirms MADGNet as a robust solution for medical image segmentation that excels in diverse imaging scenarios. Our MADGNet code is available in GitHub Link. + + + + Auto MC-Reward: Automated Dense Reward Design with Large Language Models for Minecraft + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Auto_MC-Reward_Automated_Dense_Reward_Design_with_Large_Language_Models_CVPR_2024_paper.pdf + Many reinforcement learning environments (e.g. Minecraft) provide only sparse rewards that indicate task completion or failure with binary values. The challenge in exploration efficiency in such environments makes it difficult for reinforcement-learning-based agents to learn complex tasks. To address this this paper introduces an advanced learning system named Auto MC-Reward that leverages Large Language Models (LLMs) to automatically design dense reward functions thereby enhancing the learning efficiency. Auto MC-Reward consists of three important components: Reward Designer Reward Critic and Trajectory Analyzer. Given the environment information and task descriptions the Reward Designer first design the reward function by coding an executable Python function with predefined observation inputs. Then our Reward Critic will be responsible for verifying the code checking whether the code is self-consistent and free of syntax and semantic errors. Further the Trajectory Analyzer summarizes possible failure causes and provides refinement suggestions according to collected trajectories. In the next round Reward Designer will further refine and iterate the dense reward function based on feedback. Experiments demonstrate a significant improvement in the success rate and learning efficiency of our agents in complex tasks in Minecraft such as obtaining diamond with the efficient ability to avoid lava and efficiently explore trees and animals that are sparse in the plains biome. + + + + GenFlow: Generalizable Recurrent Flow for 6D Pose Refinement of Novel Objects + http://openaccess.thecvf.com//content/CVPR2024/papers/Moon_GenFlow_Generalizable_Recurrent_Flow_for_6D_Pose_Refinement_of_Novel_CVPR_2024_paper.pdf + Despite the progress of learning-based methods for 6D object pose estimation the trade-off between accuracy and scalability for novel objects still exists. Specifically previous methods for novel objects do not make good use of the target object's 3D shape information since they focus on generalization by processing the shape indirectly making them less effective. We present GenFlow an approach that enables both accuracy and generalization to novel objects with the guidance of the target object's shape. Our method predicts optical flow between the rendered image and the observed image and refines the 6D pose iteratively. It boosts the performance by a constraint of the 3D shape and the generalizable geometric knowledge learned from an end-to-end differentiable system. We further improve our model by designing a cascade network architecture to exploit the multi-scale correlations and coarse-to-fine refinement. GenFlow ranked first on the unseen object pose estimation benchmarks in both the RGB and RGB-D cases. It also achieves performance competitive with existing state-of-the-art methods for the seen object pose estimation without any fine-tuning. + + + + Logarithmic Lenses: Exploring Log RGB Data for Image Classification + http://openaccess.thecvf.com//content/CVPR2024/papers/Maxwell_Logarithmic_Lenses_Exploring_Log_RGB_Data_for_Image_Classification_CVPR_2024_paper.pdf + The design of deep network architectures and training methods in computer vision has been well-explored. However in almost all cases the images have been used as provided with little exploration of pre-processing steps beyond normalization and data augmentation. Virtually all images posted on the web or captured by devices are processed for viewing by humans. Is the pipeline used for humans also best for use by computers and deep networks? The human visual system uses logarithmic sensors; differences and sums correspond to ratios and products. Features in log space will be invariant to intensity changes and robust to color balance changes. Log RGB space also reveals structure that is corrupted by typical pre-processing. We explore using linear and log RGB data for training standard backbone architectures on an image classification task using data derived directly from RAW images to guarantee its integrity. We found that networks trained on log RGB data exhibit improved performance on an unmodified test set and invariance to intensity and color balance modifications without additional training or data augmentation. Furthermore we found that the gains from using high quality log data could also be partially or fully realized from data in 8-bit sRGB-JPG format by inverting the sRGB transform and taking the log. These results imply existing databases may benefit from this type of pre-processing. While working with log data we found it was critical to retain the integrity of the log relationships and that networks using log data train best with meta-parameters different than those used for sRGB or linear data. Finally we introduce a new 10-category 10k RAW image data set (RAW10) for image classification and other purposes to enable further the exploration of log RGB as an input format for deep networks in computer vision. + + + + Scaled Decoupled Distillation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wei_Scaled_Decoupled_Distillation_CVPR_2024_paper.pdf + Logit knowledge distillation attracts increasing attention due to its practicality in recent studies. However it often suffers inferior performance compared to the feature knowledge distillation. In this paper we argue that existing logit-based methods may be sub-optimal since they only leverage the global logit output that couples multiple semantic knowledge. This may transfer ambiguous knowledge to the student and mislead its learning. To this end we propose a simple but effective method i.e. Scale Decoupled Distillation (SDD) for logit knowledge distillation. SDD decouples the global logit output into multiple local logit outputs and establishes distillation pipelines for them. This helps the student to mine and inherit fine-grained and unambiguous logit knowledge. Moreover the decoupled knowledge can be further divided into consistent and complementary logit knowledge that transfers the semantic information and sample ambiguity respectively. By increasing the weight of complementary parts SDD can guide the student to focus more on ambiguous samples improving its discrimination ability. Extensive experiments on several benchmark datasets demonstrate the effectiveness of SDD for wide teacher-student pairs especially in the fine-grained classification task. Code is available at: \href https://github.com/shicaiwei123/SDD-CVPR2024 https://github.com/shicaiwei123/SDD-CVPR2024 + + + + Cloud-Device Collaborative Learning for Multimodal Large Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Cloud-Device_Collaborative_Learning_for_Multimodal_Large_Language_Models_CVPR_2024_paper.pdf + The burgeoning field of Multimodal Large Language Models (MLLMs) has exhibited remarkable performance in diverse tasks such as captioning commonsense reasoning and visual scene understanding. However the deployment of these large-scale MLLMs on client devices is hindered by their extensive model parameters leading to a notable decline in generalization capabilities when these models are compressed for device deployment. Addressing this challenge we introduce a Cloud-Device Collaborative Continual Adaptation framework designed to enhance the performance of compressed device-deployed MLLMs by leveraging the robust capabilities of cloud-based larger-scale MLLMs. Our framework is structured into three key components: a device-to-cloud uplink for efficient data transmission cloud-based knowledge adaptation and an optimized cloud-to-device downlink for model deployment. In the uplink phase we employ an Uncertainty-guided Token Sampling (UTS) strategy to effectively filter out-of-distribution tokens thereby reducing transmission costs and improving training efficiency. On the cloud side we propose Adapter-based Knowledge Distillation (AKD) method to transfer refined knowledge from large-scale to compressed pocket-size MLLMs. Furthermore we propose a Dynamic Weight update Compression (DWC) strategy for the downlink which adaptively selects and quantizes updated weight parameters enhancing transmission efficiency and reducing the representational disparity between cloud and device models. Extensive experiments on several multimodal benchmarks demonstrate the superiority of our proposed framework over prior Knowledge Distillation and device-cloud collaboration methods. Notably we also validate the feasibility of our approach to real-world experiments. + + + + KD-DETR: Knowledge Distillation for Detection Transformer with Consistent Distillation Points Sampling + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_KD-DETR_Knowledge_Distillation_for_Detection_Transformer_with_Consistent_Distillation_Points_CVPR_2024_paper.pdf + DETR is a novel end-to-end transformer architecture object detector which significantly outperforms classic detectors when scaling up. In this paper we focus on the compression of DETR with knowledge distillation. While knowledge distillation has been well-studied in classic detectors there is a lack of researches on how to make it work effectively on DETR. We first provide experimental and theoretical analysis to point out that the main challenge in DETR distillation is the lack of consistent distillation points. Distillation points refer to the corresponding inputs of the predictions for student to mimic which have different formulations in CNN detector and DETR and reliable distillation requires sufficient distillation points which are consistent between teacher and student. Based on this observation we propose the first general knowledge distillation paradigm for DETR(KD-DETR) with consistent distillation points sampling for both homogeneous and heterogeneous distillation. Specifically we decouple detection and distillation tasks by introducing a set of specialized object queries to construct distillation points for DETR. We further propose a general-to-specific distillation points sampling strategy to explore the extensibility of KD-DETR. Extensive experiments validate the effectiveness and generalization of KD-DETR. For both single-scale DAB-DETR and multis-scale Deformable DETR and DINO KD-DETR boost the performance of student model with improvements of 2.6%-5.2%. We further extend KD-DETR to heterogeneous distillation and achieves 2.1% improvement by distilling the knowledge from DINO to Faster R-CNN with ResNet-50 which is comparable with homogeneous distillation methods. + + + + EMCAD: Efficient Multi-scale Convolutional Attention Decoding for Medical Image Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Rahman_EMCAD_Efficient_Multi-scale_Convolutional_Attention_Decoding_for_Medical_Image_Segmentation_CVPR_2024_paper.pdf + An efficient and effective decoding mechanism is crucial in medical image segmentation especially in scenarios with limited computational resources. However these decoding mechanisms usually come with high computational costs. To address this concern we introduce EMCAD a new efficient multi-scale convolutional attention decoder designed to optimize both performance and computational efficiency. EMCAD leverages a unique multi-scale depth-wise convolution block significantly enhancing feature maps through multi-scale convolutions. EMCAD also employs channel spatial and grouped (large-kernel) gated attention mechanisms which are highly effective at capturing intricate spatial relationships while focusing on salient regions. By employing group and depth-wise convolution EMCAD is very efficient and scales well (e.g. only 1.91M parameters and 0.381G FLOPs are needed when using a standard encoder). Our rigorous evaluations across 12 datasets that belong to six medical image segmentation tasks reveal that EMCAD achieves state-of-the-art (SOTA) performance with 79.4% and 80.3% reduction in #Params and #FLOPs respectively. Moreover EMCAD's adaptability to different encoders and versatility across segmentation tasks further establish EMCAD as a promising tool advancing the field towards more efficient and accurate medical image analysis. Our implementation is available at https://github.com/SLDGroup/EMCAD. + + + + MART: Masked Affective RepresenTation Learning via Masked Temporal Distribution Distillation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_MART_Masked_Affective_RepresenTation_Learning_via_Masked_Temporal_Distribution_Distillation_CVPR_2024_paper.pdf + Limited training data is a long-standing problem for video emotion analysis (VEA). Existing works leverage the power of large-scale image datasets for transferring while failing to extract the temporal correlation of affective cues in the video. Inspired by psychology research and empirical theory we verify that the degree of emotion may vary in different segments of the video thus introducing the sentiment complementary and emotion intrinsic among temporal segments. We propose an MAE-style method for learning robust affective representation of videos via masking termed MART. First we extract the affective cues of the lexicon and verify the extracted one by computing its matching score with video content in terms of sentiment and emotion scores alongside the temporal dimension. Then with the verified cues we propose masked affective modeling to recover temporal emotion distribution. We present temporal affective complementary learning that pulls the complementary part and pushes the intrinsic one of masked multimodal features where the constraint is set with cross-modal attention among features to mask the video and recover the degree of emotion among segments. Extensive experiments on five benchmarks show the superiority of our method in video sentiment analysis video emotion recognition multimodal sentiment analysis and multimodal emotion recognition. + + + + MTLoRA: Low-Rank Adaptation Approach for Efficient Multi-Task Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Agiza_MTLoRA_Low-Rank_Adaptation_Approach_for_Efficient_Multi-Task_Learning_CVPR_2024_paper.pdf + Adapting models pre-trained on large-scale datasets to a variety of downstream tasks is a common strategy in deep learning. Consequently parameter-efficient fine-tuning methods have emerged as a promising way to adapt pre-trained models to different tasks while training only a minimal number of parameters. While most of these methods are designed for single-task adaptation parameter-efficient training in Multi-Task Learning (MTL) architectures is still unexplored. In this paper we introduce MTLoRA a novel framework for parameter-efficient training of MTL models. MTLoRA employs Task-Agnostic and Task-Specific Low-Rank Adaptation modules which effectively disentangle the parameter space in MTL fine-tuning thereby enabling the model to adeptly handle both task specialization and interaction within MTL contexts. We applied MTLoRA to hierarchical-transformer-based MTL architectures adapting them to multiple downstream dense prediction tasks. Our extensive experiments on the PASCAL dataset show that MTLoRA achieves higher accuracy on downstream tasks compared to fully fine-tuning the MTL model while reducing the number of trainable parameters by 3.6x. Furthermore MTLoRA establishes a Pareto-optimal trade-off between the number of trainable parameters and the accuracy of the downstream tasks outperforming current state-of-the-art parameter-efficient training methods in both accuracy and efficiency. + + + + Motion Blur Decomposition with Cross-shutter Guidance + http://openaccess.thecvf.com//content/CVPR2024/papers/Ji_Motion_Blur_Decomposition_with_Cross-shutter_Guidance_CVPR_2024_paper.pdf + Motion blur is a frequently observed image artifact especially under insufficient illumination where exposure time has to be prolonged so as to collect more photons for a bright enough image. Rather than simply removing such blurring effects recent researches have aimed at decomposing a blurry image into multiple sharp images with spatial and temporal coherence. Since motion blur decomposition itself is highly ambiguous priors from neighbouring frames or human annotation are usually needed for motion disambiguation. In this paper inspired by the complementary exposure characteristics of a global shutter (GS) camera and a rolling shutter (RS) camera we propose to utilize the ordered scanline-wise delay in a rolling shutter image to robustify motion decomposition of a single blurry image. To evaluate this novel dual imaging setting we construct a triaxial system to collect realistic data as well as a deep network architecture that explicitly addresses temporal and contextual information through reciprocal branches for cross-shutter motion blur decomposition. Experiment results have verified the effectiveness of our proposed algorithm as well as the validity of our dual imaging setting. + + + + Scene-adaptive and Region-aware Multi-modal Prompt for Open Vocabulary Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_Scene-adaptive_and_Region-aware_Multi-modal_Prompt_for_Open_Vocabulary_Object_Detection_CVPR_2024_paper.pdf + Open Vocabulary Object Detection (OVD) aims to detect objects from novel classes described by text inputs based on the generalization ability of trained classes. Existing methods mainly focus on transferring knowledge from large Vision and Language models (VLM) to detectors through knowledge distillation. However these approaches show weak ability in adapting to diverse classes and aligning between the image-level pre-training and region-level detection thereby impeding effective knowledge transfer. Motivated by the prompt tuning we propose scene-adaptive and region-aware multi-modal prompts to address these issues by effectively adapting class-aware knowledge from VLM to the detector at the region level. Specifically to enhance the adaptability to diverse classes we design a scene-adaptive prompt generator from a scene perspective to consider both the commonality and diversity of the class distributions and formulate a novel selection mechanism to facilitate the acquisition of common knowledge across all classes and specific insights relevant to each scene. Meanwhile to bridge the gap between the pre-trained model and the detector we present a region-aware multi-modal alignment module which employs the region prompt to incorporate the positional information for feature distillation and integrates textual prompts to align visual and linguistic representations. Extensive experimental results demonstrate that the proposed method significantly outperforms the state-of-the-art models on the OV-COCO and OV-LVIS datasets surpassing the current method by 3.0% mAP and 4.6% APr . + + + + Instance-Aware Group Quantization for Vision Transformers + http://openaccess.thecvf.com//content/CVPR2024/papers/Moon_Instance-Aware_Group_Quantization_for_Vision_Transformers_CVPR_2024_paper.pdf + Post-training quantization (PTQ) is an efficient model compression technique that quantizes a pretrained full-precision model using only a small calibration set of unlabeled samples without retraining. PTQ methods for convolutional neural networks (CNNs) provide quantization results comparable to full-precision counterparts. Directly applying them to vision transformers (ViTs) however incurs severe performance degradation mainly due to the differences in architectures between CNNs and ViTs. In particular the distribution of activations for each channel vary drastically according to input instances making PTQ methods for CNNs inappropriate for ViTs. To address this we introduce instance-aware group quantization for ViTs (IGQ-ViT). To this end we propose to split the channels of activation maps into multiple groups dynamically for each input instance such that activations within each group share similar statistical properties. We also extend our scheme to quantize softmax attentions across tokens. In addition the number of groups for each layer is adjusted to minimize the discrepancies between predictions from quantized and full-precision models under a bit-operation (BOP) constraint. We show extensive experimental results on image classification object detection and instance segmentation with various transformer architectures demonstrating the effectiveness of our approach. + + + + A General and Efficient Training for Transformer via Token Expansion + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_A_General_and_Efficient_Training_for_Transformer_via_Token_Expansion_CVPR_2024_paper.pdf + The remarkable performance of Vision Transformers (ViTs) typically requires an extremely large training cost. Existing methods have attempted to accelerate the training of ViTs yet typically disregard method universality with accuracy dropping. Meanwhile they break the training consistency of the original transformers including the consistency of hyper-parameters architecture and strategy which prevents them from being widely applied to different Transformer networks. In this paper we propose a novel token growth scheme Token Expansion (termed ToE) to achieve consistent training acceleration for ViTs. We introduce an "initialization-expansion-merging" pipeline to maintain the integrity of the intermediate feature distribution of original transformers preventing the loss of crucial learnable information in the training process. ToE can not only be seamlessly integrated into the training and fine-tuning process of transformers (e.g. DeiT and LV-ViT) but also effective for efficient training frameworks (e.g. EfficientTrain) without twisting the original training hyper-parameters architecture and introducing additional training strategies. Extensive experiments demonstrate that ToE achieves about 1.3x faster for the training of ViTs in a lossless manner or even with performance gains over the full-token training baselines. Code is available at https://github.com/Osilly/TokenExpansion. + + + + Tyche: Stochastic In-Context Learning for Medical Image Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Rakic_Tyche_Stochastic_In-Context_Learning_for_Medical_Image_Segmentation_CVPR_2024_paper.pdf + Existing learning-based solutions to medical image segmentation have two important shortcomings. First for most new segmentation tasks a new model has to be trained or fine-tuned. This requires extensive resources and machine-learning expertise and is therefore often infeasible for medical researchers and clinicians. Second most existing segmentation methods produce a single deterministic segmentation mask for a given image. In practice however there is often considerable uncertainty about what constitutes the correct segmentation and different expert annotators will often segment the same image differently. We tackle both of these problems with Tyche a framework that uses a context set to generate stochastic predictions for previously unseen tasks without the need to retrain. Tyche differs from other in-context segmentation methods in two important ways. (1) We introduce a novel convolution block architecture that enables interactions among predictions. (2) We introduce in-context test-time augmentation a new mechanism to provide prediction stochasticity. When combined with appropriate model design and loss functions Tyche can predict a set of plausible diverse segmentation candidates for new or unseen medical images and segmentation tasks without the need to retrain. Code available at: https://tyche.csail.mit.edu/. + + + + YOLO-World: Real-Time Open-Vocabulary Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Cheng_YOLO-World_Real-Time_Open-Vocabulary_Object_Detection_CVPR_2024_paper.pdf + The You Only Look Once (YOLO) series of detectors have established themselves as efficient and practical tools. However their reliance on predefined and trained object categories limits their applicability in open scenarios. Addressing this limitation we introduce YOLO-World an innovative approach that enhances YOLO with open-vocabulary detection capabilities through vision-language modeling and pre-training on large-scale datasets. Specifically we propose a new Re-parameterizable Vision-Language Path Aggregation Network (RepVL-PAN) and region-text contrastive loss to facilitate the interaction between visual and linguistic information. Our method excels in detecting a wide range of objects in a zero-shot manner with high efficiency. On the challenging LVIS dataset YOLO-World achieves 35.4 AP with 52.0 FPS on V100 which outperforms many state-of-the-art methods in terms of both accuracy and speed. Furthermore the fine-tuned YOLO-World achieves remarkable performance on several downstream tasks including object detection and open-vocabulary instance segmentation. Code and models are available at https://github.com/AILab-CVC/YOLO-World + + + + Cross-Dimension Affinity Distillation for 3D EM Neuron Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Cross-Dimension_Affinity_Distillation_for_3D_EM_Neuron_Segmentation_CVPR_2024_paper.pdf + Accurate 3D neuron segmentation from electron microscopy (EM) volumes is crucial for neuroscience research. However the complex neuron morphology often leads to over-merge and over-segmentation results. Recent advancements utilize 3D CNNs to predict a 3D affinity map with improved accuracy but suffer from two challenges: high computational cost and limited input size especially for practical deployment for large-scale EM volumes. To address these challenges we propose a novel method to leverage lightweight 2D CNNs for efficient neuron segmentation. Our method employs a 2D Y-shape network to generate two embedding maps from adjacent 2D sections which are then converted into an affinity map by measuring their embedding distance. While the 2D network better captures pixel dependencies inside sections with larger input sizes it overlooks inter-section dependencies. To overcome this we introduce a cross-dimension affinity distillation (CAD) strategy that transfers inter-section dependency knowledge from a 3D teacher network to the 2D student network by ensuring consistency between their output affinity maps. Additionally we design a feature grafting interaction (FGI) module to enhance knowledge transfer by grafting embedding maps from the 2D student onto those from the 3D teacher. Extensive experiments on multiple EM neuron segmentation datasets including a newly built one by ourselves demonstrate that our method achieves superior performance over state-of-the-art methods with only 1/20 inference latency. + + + + Producing and Leveraging Online Map Uncertainty in Trajectory Prediction + http://openaccess.thecvf.com//content/CVPR2024/papers/Gu_Producing_and_Leveraging_Online_Map_Uncertainty_in_Trajectory_Prediction_CVPR_2024_paper.pdf + High-definition (HD) maps have played an integral role in the development of modern autonomous vehicle (AV) stacks albeit with high associated labeling and maintenance costs. As a result many recent works have proposed methods for estimating HD maps online from sensor data enabling AVs to operate outside of previously-mapped regions. However current online map estimation approaches are developed in isolation of their downstream tasks complicating their integration in AV stacks. In particular they do not produce uncertainty or confidence estimates. In this work we extend multiple state-of-the-art online map estimation methods to additionally estimate uncertainty and show how this enables more tightly integrating online mapping with trajectory forecasting. In doing so we find that incorporating uncertainty yields up to 50% faster training convergence and up to 15% better prediction performance on the real-world nuScenes driving dataset. + + + + LASO: Language-guided Affordance Segmentation on 3D Object + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_LASO_Language-guided_Affordance_Segmentation_on_3D_Object_CVPR_2024_paper.pdf + Segmenting affordance in 3D data is key for bridging perception and action in robots. Existing efforts mostly focus on the visual side and overlook the affordance knowledge from a semantic aspect. This oversight not only limits their generalization to unseen objects but more importantly hinders their synergy with large language models (LLMs) which are excellent task planners that can decompose an overarching command into agent-actionable instructions. With this regard we propose a novel task Language-guided Affordance Segmentation on 3D Object (LASO) which challenges a model to segment a 3D object's part relevant to a given affordance question. To facilitate the task we contribute a dataset comprising 19751 point-question pairs covering 8434 object shapes and 870 expert-crafted questions. As a pioneer solution we further propose PointRefer which highlights an adaptive fusion module to identify target affordance regions at different scales. To ensure a text-aware segmentation we adopt a set of affordance queries conditioned on linguistic cues to generate dynamic kernels. These kernels are further used to convolute with point features and generate a segmentation mask. Comprehensive experiments and analyses validate PointRefer's effectiveness. With these efforts We hope that LASO can steer the direction of 3D affordance guiding it towards enhanced integration with the evolving capabilities of LLMs. + + + + Riemannian Multinomial Logistics Regression for SPD Neural Networks + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Riemannian_Multinomial_Logistics_Regression_for_SPD_Neural_Networks_CVPR_2024_paper.pdf + Deep neural networks for learning Symmetric Positive Definite (SPD) matrices are gaining increasing attention in machine learning. Despite the significant progress most existing SPD networks use traditional Euclidean classifiers on an approximated space rather than intrinsic classifiers that accurately capture the geometry of SPD manifolds. Inspired by Hyperbolic Neural Networks (HNNs) we propose Riemannian Multinomial Logistics Regression (RMLR) for the classification layers in SPD networks. We introduce a unified framework for building Riemannian classifiers under the metrics pulled back from the Euclidean space and showcase our framework under the parameterized Log-Euclidean Metric (LEM) and Log-Cholesky Metric (LCM). Besides our framework offers a novel intrinsic explanation for the most popular LogEig classifier in existing SPD networks. The effectiveness of our method is demonstrated in three applications: radar recognition human action recognition and electroencephalography (EEG) classification. The code is available at https://github.com/GitZH-Chen/SPDMLR.git. + + + + What Sketch Explainability Really Means for Downstream Tasks? + http://openaccess.thecvf.com//content/CVPR2024/papers/Bandyopadhyay_What_Sketch_Explainability_Really_Means_for_Downstream_Tasks_CVPR_2024_paper.pdf + In this paper we explore the unique modality of sketch for explainability emphasising the profound impact of human strokes compared to conventional pixel-oriented studies. Beyond explanations of network behavior we discern the genuine implications of explainability across diverse downstream sketch-related tasks. We propose a lightweight and portable explainability solution -- a seamless plugin that integrates effortlessly with any pre-trained model eliminating the need for re-training. Demonstrating its adaptability we present four applications: highly studied retrieval and generation and completely novel assisted drawing and sketch adversarial attacks. The centrepiece to our solution is a stroke-level attribution map that takes different forms when linked with downstream tasks. By addressing the inherent non-differentiability of rasterisation we enable explanations at both coarse stroke level (SLA) and partial stroke level (P-SLA) each with its advantages for specific downstream tasks. + + + + Neural Exposure Fusion for High-Dynamic Range Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Onzon_Neural_Exposure_Fusion_for_High-Dynamic_Range_Object_Detection_CVPR_2024_paper.pdf + Computer vision in unconstrained outdoor scenarios must tackle challenging high dynamic range (HDR) scenes and rapidly changing illumination conditions. Existing methods address this problem with multi-capture HDR sensors and a hardware image signal processor (ISP) that produces a single fused image as input to a downstream neural network. The output of the HDR sensor is a set of low dynamic range (LDR) exposures and the fusion in the ISP is performed in image space and typically optimized for human perception on a display. Preferring tonemapped content with smooth transition regions over detail (and noise) in the resulting image this image fusion does typically not preserve all information from the LDR exposures that may be essential for downstream computer vision tasks. In this work we depart from conventional HDR image fusion and propose a learned task-driven fusion in the feature domain. Instead of using a single companded image we introduce a novel local cross-attention fusion mechanism that exploits semantic features from all exposures learned in an end-to-end fashion with supervision from downstream detection losses. The proposed method outperforms all tested conventional HDR exposure fusion and auto-exposure methods in challenging automotive HDR scenarios. + + + + SFOD: Spiking Fusion Object Detector + http://openaccess.thecvf.com//content/CVPR2024/papers/Fan_SFOD_Spiking_Fusion_Object_Detector_CVPR_2024_paper.pdf + Event cameras characterized by high temporal resolution high dynamic range low power consumption and high pixel bandwidth offer unique capabilities for object detection in specialized contexts. Despite these advantages the inherent sparsity and asynchrony of event data pose challenges to existing object detection algorithms. Spiking Neural Networks (SNNs) inspired by the way the human brain codes and processes information offer a potential solution to these difficulties. However their performance in object detection using event cameras is limited in current implementations. In this paper we propose the Spiking Fusion Object Detector (SFOD) a simple and efficient approach to SNN-based object detection. Specifically we design a Spiking Fusion Module achieving the first-time fusion of feature maps from different scales in SNNs applied to event cameras. Additionally through integrating our analysis and experiments conducted during the pretraining of the backbone network on the NCAR dataset we delve deeply into the impact of spiking decoding strategies and loss functions on model performance. Thereby we establish state-of-the-art classification results based on SNNs achieving 93.7% accuracy on the NCAR dataset. Experimental results on the GEN1 detection dataset demonstrate that the SFOD achieves a state-of-the-art mAP of 32.1% outperforming existing SNN-based approaches. Our research not only underscores the potential of SNNs in object detection with event cameras but also propels the advancement of SNNs. Code is available at https://github.com/yimeng-fan/SFOD. + + + + OpenEQA: Embodied Question Answering in the Era of Foundation Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Majumdar_OpenEQA_Embodied_Question_Answering_in_the_Era_of_Foundation_Models_CVPR_2024_paper.pdf + We present a modern formulation of Embodied Question Answering (EQA) as the task of understanding an environment well enough to answer questions about it in natural language. An agent can achieve such an understanding by either drawing upon episodic memory exemplified by agents on smart glasses or by actively exploring the environment as in the case of mobile robots. We accompany our formulation with OpenEQA -- the first open-vocabulary benchmark dataset for EQA supporting both episodic memory and active exploration use cases. OpenEQA contains over 1600 high-quality human generated questions drawn from over 180 real-world environments. In addition to the dataset we also provide an automatic LLM-powered evaluation protocol that has excellent correlation with human judgement. Using this dataset and evaluation protocol we evaluate several state-of-the-art foundation models including GPT-4V and find that they significantly lag behind human-level performance. Consequently OpenEQA stands out as a straightforward measurable and practically relevant benchmark that poses a considerable challenge to current generation of foundation models. We hope this inspires and stimulates future research at the intersection of Embodied AI conversational agents and world models. + + + + DePT: Decoupled Prompt Tuning + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_DePT_Decoupled_Prompt_Tuning_CVPR_2024_paper.pdf + This work breaks through the Base-New Tradeoff (BNT) dilemma in prompt tuning i.e. the better the tuned model generalizes to the base (or target) task the worse it generalizes to new tasks and vice versa. Specifically through an in-depth analysis of the learned features of the base and new tasks we observe that the BNT stems from a channel bias issue--the vast majority of feature channels are occupied by base-specific knowledge leading to the collapse of task-shared knowledge important to new tasks. To address this we propose the Decoupled Prompt Tuning (DePT) framework which decouples base-specific knowledge from feature channels into an isolated feature space during prompt tuning so as to maximally preserve task-shared knowledge in the original feature space for achieving better zero-shot generalization on new tasks. Importantly our DePT is orthogonal to existing prompt tuning approaches and can enhance them with negligible additional computational cost. Extensive experiments on several datasets show the flexibility and effectiveness of DePT. + + + + A Generative Approach for Wikipedia-Scale Visual Entity Recognition + http://openaccess.thecvf.com//content/CVPR2024/papers/Caron_A_Generative_Approach_for_Wikipedia-Scale_Visual_Entity_Recognition_CVPR_2024_paper.pdf + In this paper we address web-scale visual entity recognition specifically the task of mapping a given query image to one of the 6 million existing entities in Wikipedia. One way of approaching a problem of such scale is using dual encoder models (e.g. CLIP) where all the entity names and query images are embedded into a unified space paving the way for an approximate kNN search. Alternatively it is also possible to re-purpose a captioning model to directly generate the entity names for a given image. In contrast we introduce a novel Generative Entity Recognition (GER) framework which given an input image learns to auto-regressively decode a semantic and discriminative "code" identifying the target entity. Our experiments demonstrate the efficacy of this GER paradigm showcasing state-of-the-art performance on the challenging OVEN benchmark. GER surpasses strong captioning dual-encoder visual matching and hierarchical classification baselines affirming its advantage in tackling the complexities of web-scale recognition. + + + + Open-Vocabulary Object 6D Pose Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Corsetti_Open-Vocabulary_Object_6D_Pose_Estimation_CVPR_2024_paper.pdf + We introduce the new setting of open-vocabulary object 6D pose estimation in which a textual prompt is used to specify the object of interest. In contrast to existing approaches in our setting (i) the object of interest is specified solely through the textual prompt (ii) no object model (e.g. CAD or video sequence) is required at inference and (iii) the object is imaged from two RGBD viewpoints of different scenes. To operate in this setting we introduce a novel approach that leverages a Vision-Language Model to segment the object of interest from the scenes and to estimate its relative 6D pose. The key of our approach is a carefully devised strategy to fuse object-level information provided by the prompt with local image features resulting in a feature space that can generalize to novel concepts. We validate our approach on a new benchmark based on two popular datasets REAL275 and Toyota-Light which collectively encompass 34 object instances appearing in four thousand image pairs. The results demonstrate that our approach outperforms both a well-established hand-crafted method and a recent deep learning-based baseline in estimating the relative 6D pose of objects in different scenes. Code and dataset are available at https://jcorsetti.github.io/oryon. + + + + Plug and Play Active Learning for Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Plug_and_Play_Active_Learning_for_Object_Detection_CVPR_2024_paper.pdf + Annotating datasets for object detection is an expensive and time-consuming endeavor. To minimize this burden active learning (AL) techniques are employed to select the most informative samples for annotation within a constrained "annotation budget". Traditional AL strategies typically rely on model uncertainty or sample diversity for query sampling while more advanced methods have focused on developing AL-specific object detector architectures to enhance performance. However these specialized approaches are not readily adaptable to different object detectors due to the significant engineering effort required for integration. To overcome this challenge we introduce Plug and Play Active Learning (PPAL) a simple and effective AL strategy for object detection. PPAL is a two-stage method comprising uncertainty-based and diversity-based sampling phases. In the first stage our Difficulty Calibrated Uncertainty Sampling leverage a category-wise difficulty coefficient that combines both classification and localisation difficulties to re-weight instance uncertainties from which we sample a candidate pool for the subsequent diversity-based sampling. In the second stage we propose Category Conditioned Matching Similarity to better compute the similarities of multi-instance images as ensembles of their instance similarities which is used by the k-Means++ algorithm to sample the final AL queries. PPAL makes no change to model architectures or detector training pipelines; hence it can be easily generalized to different object detectors. We benchmark PPAL on the MS-COCO and Pascal VOC datasets using different detector architectures and show that our method outperforms prior work by a large margin. Code is available at https://github.com/ChenhongyiYang/PPAL + + + + LiSA: LiDAR Localization with Semantic Awareness + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_LiSA_LiDAR_Localization_with_Semantic_Awareness_CVPR_2024_paper.pdf + LiDAR localization is a fundamental task in robotics and computer vision which estimates the pose of a LiDAR point cloud within a global map. Scene Coordinate Regression (SCR) has demonstrated state-of-the-art performance in this task. In SCR a scene is represented as a neural network which outputs the world coordinates for each point in the input point cloud. However SCR treats all points equally during localization ignoring the fact that not all objects are beneficial for localization. For example dynamic objects and repeating structures often negatively impact SCR. To address this problem we introduce LiSA the first method that incorporates semantic awareness into SCR to boost the localization robustness and accuracy. To avoid extra computation or network parameters during inference we distill the knowledge from a segmentation model to the original SCR network. Experiments show the superior performance of LiSA on standard LiDAR localization benchmarks compared to state-of-the-art methods. Applying knowledge distillation not only preserves high efficiency but also achieves higher localization accuracy than introducing extra semantic segmentation modules. We also analyze the benefit of semantic information for LiDAR localization. Our code is released at https://github.com/Ybchun/LiSA. + + + + LMDrive: Closed-Loop End-to-End Driving with Large Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Shao_LMDrive_Closed-Loop_End-to-End_Driving_with_Large_Language_Models_CVPR_2024_paper.pdf + Despite significant recent progress in the field of autonomous driving modern methods still struggle and can incur serious accidents when encountering long-tail unforeseen events and challenging urban scenarios. On the one hand large language models (LLM) have shown impressive reasoning capabilities that approach "Artificial General Intelligence". On the other hand previous autonomous driving methods tend to rely on limited-format inputs (e.g. sensor data and navigation waypoints) restricting the vehicle's ability to understand language information and interact with humans. To this end this paper introduces LMDrive a novel language-guided end-to-end closed-loop autonomous driving framework. LMDrive uniquely processes and integrates multi-modal sensor data with natural language instructions enabling interaction with humans and navigation software in realistic instructional settings. To facilitate further research in language-based closed-loop autonomous driving we also publicly release the corresponding dataset which includes approximately 64K instruction-following data clips and the LangAuto benchmark that tests the system's ability to handle complex instructions and challenging driving scenarios. Extensive closed-loop experiments are conducted to demonstrate LMDrive's effectiveness. To the best of our knowledge we're the very first work to leverage LLMs for closed-loop end-to-end autonomous driving. Code is available at https://github.com/opendilab/LMDrive + + + + AHIVE: Anatomy-aware Hierarchical Vision Encoding for Interactive Radiology Report Retrieval + http://openaccess.thecvf.com//content/CVPR2024/papers/Yan_AHIVE_Anatomy-aware_Hierarchical_Vision_Encoding_for_Interactive_Radiology_Report_Retrieval_CVPR_2024_paper.pdf + Automatic radiology report generation using deep learning models has been recently explored and found promising. Neural decoders are commonly used for the report generation where irrelevant and unfaithful contents are unavoidable. The retrieval-based approach alleviates the limitation by identifying reports which are relevant to the input to assist the generation. To achieve clinically accurate report retrieval we make reference to clinicians' diagnostic steps of examining a radiology image where anatomical and diagnostic details are typically focused and propose a novel hierarchical visual concept representation called anatomy-aware hierarchical vision encoding (AHIVE). To learn AHIVE we first derive a methodology to extract hierarchical diagnostic descriptions from radiology reports and develop a CLIP-based framework for the model training. Also the hierarchical architecture of AHIVE is designed to support interactive report retrieval so that report revision made at one layer can be propagated to the subsequent ones to trigger other necessary revisions. We conduct extensive experiments and show that AHIVE can outperform the SOTA vision-language retrieval methods in terms of clinical accuracy by a large margin. We provide also a case study to illustrate how it enables interactive report retrieval. + + + + CyberDemo: Augmenting Simulated Human Demonstration for Real-World Dexterous Manipulation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_CyberDemo_Augmenting_Simulated_Human_Demonstration_for_Real-World_Dexterous_Manipulation_CVPR_2024_paper.pdf + We introduce CyberDemo a novel approach to robotic imitation learning that leverages simulated human demonstrations for real-world tasks. By incorporating extensive data augmentation in a simulated environment CyberDemo outperforms traditional in-domain real-world demonstrations when transferred to the real world handling diverse physical and visual conditions. Regardless of its affordability and convenience in data collection CyberDemo outperforms baseline methods in terms of success rates across various tasks and exhibits generalizability with previously unseen objects. For example it can rotate novel tetra-valve and penta-valve despite human demonstrations only involving tri-valves. Our research demonstrates the significant potential of simulated human demonstrations for real world dexterous manipulation tasks. More details can be found at https://cyber-demo.github.io/ + + + + MaskCLR: Attention-Guided Contrastive Learning for Robust Action Representation Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Abdelfattah_MaskCLR_Attention-Guided_Contrastive_Learning_for_Robust_Action_Representation_Learning_CVPR_2024_paper.pdf + Current transformer-based skeletal action recognition models tend to focus on a limited set of joints and low-level motion patterns to predict action classes. This results in significant performance degradation under small skeleton perturbations or changing the pose estimator between training and testing. In this work we introduce MaskCLR a new Masked Contrastive Learning approach for Robust skeletal action recognition. We propose an Attention-Guided Probabilistic Masking strategy to occlude the most important joints and encourage the model to explore a larger set of discriminative joints. Furthermore we propose a Multi-Level Contrastive Learning paradigm to enforce the representations of standard and occluded skeletons to be class-discriminative i.e. more compact within each class and more dispersed across different classes. Our approach helps the model capture the high-level action semantics instead of low-level joint variations and can be conveniently incorporated into transformer-based models. Without loss of generality we combine MaskCLR with three transformer backbones: the vanilla transformer DSTFormer and STTFormer. Extensive experiments on NTU60 NTU120 and Kinetics400 show that MaskCLR consistently outperforms previous state-of-the-art methods on standard and perturbed skeletons from different pose estimators showing improved accuracy generalization and robustness. Project website: https://maskclr.github.io. + + + + Narrative Action Evaluation with Prompt-Guided Multimodal Interaction + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Narrative_Action_Evaluation_with_Prompt-Guided_Multimodal_Interaction_CVPR_2024_paper.pdf + In this paper we investigate a new problem called narrative action evaluation (NAE). NAE aims to generate professional commentary that evaluates the execution of an action. Unlike traditional tasks such as score-based action quality assessment and video captioning involving superficial sentences NAE focuses on creating detailed narratives in natural language. These narratives provide intricate descriptions of actions along with objective evaluations. NAE is a more challenging task because it requires both narrative flexibility and evaluation rigor. One existing possible solution is to use multi-task learning where narrative language and evaluative information are predicted separately. However this approach results in reduced performance for individual tasks because of variations between tasks and differences in modality between language information and evaluation information. To address this we propose a prompt-guided multimodal interaction framework. This framework utilizes a pair of transformers to facilitate the interaction between different modalities of information. It also uses prompts to transform the score regression task into a video-text matching task thus enabling task interactivity. To support further research in this field we re-annotate the MTL-AQA and FineGym datasets with high-quality and comprehensive action narration. Additionally we establish benchmarks for NAE. Extensive experiment results prove that our method outperforms separate learning methods and naive multi-task learning methods. Data and code will be released at https://github.com/shiyi-zh0408/NAE_CVPR2024. + + + + R-Cyclic Diffuser: Reductive and Cyclic Latent Diffusion for 3D Clothed Human Digitalization + http://openaccess.thecvf.com//content/CVPR2024/papers/Chan_R-Cyclic_Diffuser_Reductive_and_Cyclic_Latent_Diffusion_for_3D_Clothed_CVPR_2024_paper.pdf + Recently the authors of Zero-1-to-3 demonstrated that a latent diffusion model pretrained with Internet-scale data can not only address the single-view 3D object reconstruction task but can even attain SOTA results in it. However when applied to the task of single-view 3D clothed human reconstruction Zero-1-to-3 (and related models) are unable to compete with the corresponding SOTA methods in this field despite being trained on clothed human data. In this work we aim to tailor Zero-1-to-3's approach to the single-view 3D clothed human reconstruction task in a much more principled and structured manner. To this end we propose R-Cyclic Diffuser a framework that adapts Zero-1-to-3's novel approach to clothed human data by fusing it with a pixel-aligned implicit model. R-Cyclic Diffuser offers a total of three new contributions. The first and primary contribution is R-Cyclic Diffuser's cyclical conditioning mechanism for novel view synthesis. This mechanism directly addresses the view inconsistency problem faced by Zero-1-to-3 and related models. Secondly we further enhance this mechanism with two key features - Lateral Inversion Constraint and Cyclic Noise Selection. Both features are designed to regularize and restrict the randomness of outputs generated by a latent diffusion model. Thirdly we show how SMPL-X body priors can be incorporated in a latent diffusion model such that novel views of clothed human bodies can be generated much more accurately. Our experiments show that R-Cyclic Diffuser is able to outperform current SOTA methods in single-view 3D clothed human reconstruction both qualitatively and quantitatively. Our code is made publicly available at https://github.com/kcyt/r-cyclic-diffuser. + + + + Validating Privacy-Preserving Face Recognition under a Minimum Assumption + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Validating_Privacy-Preserving_Face_Recognition_under_a_Minimum_Assumption_CVPR_2024_paper.pdf + The widespread use of cloud-based face recognition technology raises privacy concerns as unauthorized access to face images can expose personal information or be exploited for fraudulent purposes. In response privacy-preserving face recognition (PPFR) schemes have emerged to hide visual information and thwart unauthorized access. However the validation methods employed by these schemes often rely on unrealistic assumptions leaving doubts about their true effectiveness in safeguarding facial privacy. In this paper we introduce a new approach to privacy validation called Minimum Assumption Privacy Protection Validation (Map^2V). This is the first exploration of formulating a privacy validation method utilizing deep image priors and zeroth-order gradient estimation with the potential to serve as a general framework for PPFR evaluation. Building upon Map^2V we comprehensively validate the privacy-preserving capability of PPFRs through a combination of human and machine vision. The experiment results and analysis demonstrate the effectiveness and generalizability of the proposed Map^2V showcasing its superiority over native privacy validation methods from PPFR works of literature. Additionally this work exposes privacy vulnerabilities in evaluated state-of-the-art PPFR schemes laying the foundation for the subsequent effective proposal of countermeasures. The source code is available at https://github.com/Beauty9882/MAP2V. + + + + Long-Tailed Anomaly Detection with Learnable Class Names + http://openaccess.thecvf.com//content/CVPR2024/papers/Ho_Long-Tailed_Anomaly_Detection_with_Learnable_Class_Names_CVPR_2024_paper.pdf + Anomaly detection (AD) aims to identify defective images and localize their defects (if any). Ideally AD models should be able to detect defects over many image classes; without relying on hard-coded class names that can be uninformative or inconsistent across datasets; learn without anomaly supervision; and be robust to the long-tailed distributions of real-world applications. To address these challenges we formulate the problem of long-tailed AD by introducing several datasets with different levels of class imbalance and metrics for performance evaluation. We then propose a novel method LTAD to detect defects from multiple and long-tailed classes without relying on dataset class names. LTAD combines AD by reconstruction and semantic AD modules. AD by reconstruction is implemented with a transformer-based reconstruction module. Semantic AD is implemented with a binary classifier which relies on learned pseudo class names and a pretrained foundation model. These modules are learned over two phases. Phase 1 learns the pseudo-class names and a variational autoencoder (VAE) for feature synthesis that augments the training data to combat long-tails. Phase 2 then learns the parameters of the reconstruction and classification modules of LTAD. Extensive experiments using the proposed long-tailed datasets show that LTAD substantially outperforms the state-of-the-art methods for most forms of dataset imbalance. The long-tailed dataset split is available at https://zenodo.org/records/10854201 + + + + Rapid 3D Model Generation with Intuitive 3D Input + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Rapid_3D_Model_Generation_with_Intuitive_3D_Input_CVPR_2024_paper.pdf + With the emergence of AR/VR 3D models are in tremendous demand. However conventional 3D modeling with Computer-Aided Design software requires much expertise and is difficult for novice users. We find that AR/VR devices in addition to serving as effective display mediums can offer a promising potential as an intuitive 3D model creation tool especially with the assistance of AI generative models. Here we propose Deep3DVRSketch the first 3D model generation network that inputs 3D VR sketches from novice users and generates highly consistent 3D models in multiple categories within seconds irrespective of the users' drawing abilities. We also contribute KO3D+ the largest 3D sketch-shape dataset. Our method pre-trains a conditional diffusion model on quality 3D data then fine-tunes an encoder to map 3D sketches onto the generator's manifold using an adaptive curriculum strategy for limited ground truths. In our experiment our approach achieves state-of-the-art performance in both model quality and fidelity with real-world input from novice users and users can even draw and obtain very detailed geometric structures. In our user study users were able to complete the 3D modeling tasks over 10 times faster using our approach compared to conventional CAD software tools. We believe that our Deep3DVRSketch and KO3D+ dataset can offer a promising solution for future 3D modeling in metaverse era. Check the project page at http://research.kokoni3d.com/Deep3DVRSketch. + + + + BoQ: A Place is Worth a Bag of Learnable Queries + http://openaccess.thecvf.com//content/CVPR2024/papers/Ali-bey_BoQ_A_Place_is_Worth_a_Bag_of_Learnable_Queries_CVPR_2024_paper.pdf + In visual place recognition accurately identifying and matching images of locations under varying environmental conditions and viewpoints remains a significant challenge. In this paper we introduce a new technique called Bag-of-Queries (BoQ) which learns a set of global queries designed to capture universal place-specific attributes. Unlike existing techniques that employ self-attention and generate the queries directly from the input BoQ employ distinct learnable global queries which probe the input features via cross-attention ensuring consistent information aggregation. In addition this technique provides an interpretable attention mechanism and integrates with both CNN and Vision Transformer backbones. The performance of BoQ is demonstrated through extensive experiments on 14 large-scale benchmarks. It consistently outperforms current state-of-the-art techniques including NetVLAD MixVPR and EigenPlaces. Moreover despite being a global retrieval technique (one-stage) BoQ surpasses two-stage retrieval methods such as Patch-NetVLAD TransVPR and R2Former all while being orders of magnitude faster and more efficient. The code and model weights are publicly available at https://github.com/amaralibey/Bag-of-Queries. + + + + GigaPose: Fast and Robust Novel Object Pose Estimation via One Correspondence + http://openaccess.thecvf.com//content/CVPR2024/papers/Nguyen_GigaPose_Fast_and_Robust_Novel_Object_Pose_Estimation_via_One_CVPR_2024_paper.pdf + We present GigaPose a fast robust and accurate method for CAD-based novel object pose estimation in RGB images. GigaPose first leverages discriminative "templates" rendered images of the CAD models to recover the out-of-plane rotation and then uses patch correspondences to estimate the four remaining parameters. Our approach samples templates in only a two-degrees-of-freedom space instead of the usual three and matches the input image to the templates using fast nearest-neighbor search in feature space results in a speedup factor of 35x compared to the state of the art. Moreover GigaPose is significantly more robust to segmentation errors. Our extensive evaluation on the seven core datasets of the BOP challenge demonstrates that it achieves state-of-the-art accuracy and can be seamlessly integrated with existing refinement methods. Additionally we show the potential of GigaPose with 3D models predicted by recent work on 3D reconstruction from a single image relaxing the need for CAD models and making 6D pose object estimation much more convenient. Our source code and trained models are publicly available at https://github.com/nv-nguyen/gigaPose + + + + Imagine Before Go: Self-Supervised Generative Map for Object Goal Navigation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Imagine_Before_Go_Self-Supervised_Generative_Map_for_Object_Goal_Navigation_CVPR_2024_paper.pdf + The Object Goal navigation (ObjectNav) task requires the agent to navigate to a specified target in an unseen environment. Since the environment layout is unknown the agent needs to infer the unknown contextual objects from partially observations thereby deducing the likely location of the target. Previous end-to-end RL methods capture contextual relationships through implicit representations while they lack notion of geometry. Alternatively modular methods construct local maps for recording the observed geometric structure of unseen environment however lacking the reasoning of contextual relation limits the exploration efficiency. In this work we propose the self-supervised generative map (SGM) a modular method that learns the explicit context relation via self-supervised learning. The SGM is trained to leverage both episodic observations and general knowledge to reconstruct the masked pixels of a cropped global map. During navigation the agent maintains an incomplete local semantic map meanwhile the unknown regions of the local map are generated by the pre-trained SGM. Based on the generated map the agent sets the predicted location of the target as the goal and moves towards it. Experiments on Gibson MP3D and HM3D show the effectiveness of our method. + + + + HIPTrack: Visual Tracking with Historical Prompts + http://openaccess.thecvf.com//content/CVPR2024/papers/Cai_HIPTrack_Visual_Tracking_with_Historical_Prompts_CVPR_2024_paper.pdf + Trackers that follow Siamese paradigm utilize similarity matching between template and search region features for tracking. Many methods have been explored to enhance tracking performance by incorporating tracking history to better handle scenarios involving target appearance variations such as deformation and occlusion. However the utilization of historical information in existing methods is insufficient and incomprehensive which typically requires repetitive training and introduces a large amount of computation. In this paper we show that by providing a tracker that follows Siamese paradigm with precise and updated historical information a significant performance improvement can be achieved with completely unchanged parameters. Based on this we propose a historical prompt network that uses refined historical foreground masks and historical visual features of the target to provide comprehensive and precise prompts for the tracker. We build a novel tracker called HIPTrack based on the historical prompt network which achieves considerable performance improvements without the need to retrain the entire model. We conduct experiments on seven datasets and experimental results demonstrate that our method surpasses the current state-of-the-art trackers on LaSOT LaSOText GOT-10k and NfS. Furthermore the historical prompt network can seamlessly integrate as a plug-and-play module into existing trackers providing performance enhancements. The source code is available at https://github.com/WenRuiCai/HIPTrack. + + + + An N-Point Linear Solver for Line and Motion Estimation with Event Cameras + http://openaccess.thecvf.com//content/CVPR2024/papers/Gao_An_N-Point_Linear_Solver_for_Line_and_Motion_Estimation_with_CVPR_2024_paper.pdf + Event cameras respond primarily to edges---formed by strong gradients---and are thus particularly well-suited for line-based motion estimation. Recent work has shown that events generated by a single line each satisfy a polynomial constraint which describes a manifold in the space-time volume. Multiple such constraints can be solved simultaneously to recover the partial linear velocity and line parameters. In this work we show that with a suitable line parametrization this system of constraints is actually linear in the unknowns which allows us to design a novel linear solver. Unlike existing solvers our linear solver (i) is fast and numerically stable since it does not rely on expensive root finding (ii) can solve both minimal and overdetermined systems with more than 5 events and (iii) admits the characterization of all degenerate cases and multiple solutions. The found line parameters are singularity-free and have a fixed scale which eliminates the need for auxiliary constraints typically encountered in previous work. To recover the full linear camera velocity we fuse observations from multiple lines with a novel velocity averaging scheme that relies on a geometrically-motivated residual and thus solves the problem more efficiently than previous schemes which minimize an algebraic residual. Extensive experiments in synthetic and real-world settings demonstrate that our method surpasses the previous work in numerical stability and operates over 600 times faster. + + + + GenNBV: Generalizable Next-Best-View Policy for Active 3D Reconstruction + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_GenNBV_Generalizable_Next-Best-View_Policy_for_Active_3D_Reconstruction_CVPR_2024_paper.pdf + While recent advances in neural radiance field enable realistic digitization for large-scale scenes the image-capturing process is still time-consuming and labor-intensive. Previous works attempt to automate this process using the Next-Best-View (NBV) policy for active 3D reconstruction. However the existing NBV policies heavily rely on hand-crafted criteria limited action space or per-scene optimized representations. These constraints limit their cross-dataset generalizability. To overcome them we propose GenNBV an end-to-end generalizable NBV policy. Our policy adopts a reinforcement learning (RL)-based framework and extends typical limited action space to 5D free space. It empowers our agent drone to scan from any viewpoint and even interact with unseen geometries during training. To boost the cross-dataset generalizability we also propose a novel multi-source state embedding including geometric semantic and action representations. We establish a benchmark using the Isaac Gym simulator with the Houses3K and OmniObject3D datasets to evaluate this NBV policy. Experiments demonstrate that our policy achieves a 98.26% and 97.12% coverage ratio on unseen building-scale objects from these datasets respectively outperforming prior solutions. + + + + Taming Self-Training for Open-Vocabulary Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_Taming_Self-Training_for_Open-Vocabulary_Object_Detection_CVPR_2024_paper.pdf + Recent studies have shown promising performance in open-vocabulary object detection (OVD) by utilizing pseudo labels (PLs) from pretrained vision and language models (VLMs). However teacher-student self-training a powerful and widely used paradigm to leverage PLs is rarely explored for OVD. This work identifies two challenges of using self-training in OVD: noisy PLs from VLMs and frequent distribution changes of PLs. To address these challenges we propose SAS-Det that tames self-training for OVD from two key perspectives. First we present a split-and-fusion (SAF) head that splits a standard detection into an open-branch and a closed-branch. This design can reduce noisy supervision from pseudo boxes. Moreover the two branches learn complementary knowledge from different training data significantly enhancing performance when fused together. Second in our view unlike in closed-set tasks the PL distributions in OVD are solely determined by the teacher model. We introduce a periodic update strategy to decrease the number of updates to the teacher thereby decreasing the frequency of changes in PL distributions which stabilizes the training process. Extensive experiments demonstrate SAS-Det is both efficient and effective. SAS-Det outperforms recent models of the same scale by a clear margin and achieves 37.4 AP50 and 29.1 APr on novel categories of the COCO and LVIS benchmarks respectively. Code is available at https://github.com/xiaofeng94/SAS-Det. + + + + Bilateral Propagation Network for Depth Completion + http://openaccess.thecvf.com//content/CVPR2024/papers/Tang_Bilateral_Propagation_Network_for_Depth_Completion_CVPR_2024_paper.pdf + Depth completion aims to derive a dense depth map from sparse depth measurements with a synchronized color image. Current state-of-the-art (SOTA) methods are predominantly propagation-based which work as an iterative refinement on the initial estimated dense depth. However the initial depth estimations mostly result from direct applications of convolutional layers on the sparse depth map. In this paper we present a Bilateral Propagation Network (BP-Net) that propagates depth at the earliest stage to avoid directly convolving on sparse data. Specifically our approach propagates the target depth from nearby depth measurements via a non-linear model whose coefficients are generated through a multi-layer perceptron conditioned on both radiometric difference and spatial distance. By integrating bilateral propagation with multi-modal fusion and depth refinement in a multi-scale framework our BP-Net demonstrates outstanding performance on both indoor and outdoor scenes. It achieves SOTA on the NYUv2 dataset and ranks 1st on the KITTI depth completion benchmark at the time of submission. Experimental results not only show the effectiveness of bilateral propagation but also emphasize the significance of early-stage propagation in contrast to the refinement stage. Our code and trained models will be available on the project page. + + + + Unleashing Channel Potential: Space-Frequency Selection Convolution for SAR Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Unleashing_Channel_Potential_Space-Frequency_Selection_Convolution_for_SAR_Object_Detection_CVPR_2024_paper.pdf + Deep Convolutional Neural Networks (DCNNs) have achieved remarkable performance in synthetic aperture radar (SAR) object detection but this comes at the cost of tremendous computational resources partly due to extracting redundant features within a single convolutional layer. Recent works either delve into model compression methods or focus on the carefully-designed lightweight models both of which result in performance degradation. In this paper we propose an efficient convolution module for SAR object detection called SFS-Conv which increases feature diversity within each convolutional layer through a shunt-perceive-select strategy. Specifically we shunt input feature maps into space and frequency aspects. The former perceives the context of various objects by dynamically adjusting receptive field while the latter captures abundant frequency variations and textural features via fractional Gabor transformer. To adaptively fuse features from space and frequency aspects a parameter-free feature selection module is proposed to ensure that the most representative and distinctive information are preserved. With SFS-Conv we build a lightweight SAR object detection network called SFS-CNet. Experimental results show that SFS-CNet outperforms state-of-the-art (SoTA) models on a series of SAR object detection benchmarks while simultaneously reducing both the model size and computational cost. + + + + READ: Retrieval-Enhanced Asymmetric Diffusion for Motion Planning + http://openaccess.thecvf.com//content/CVPR2024/papers/Oba_READ_Retrieval-Enhanced_Asymmetric_Diffusion_for_Motion_Planning_CVPR_2024_paper.pdf + This paper proposes Retrieval-Enhanced Asymmetric Diffusion (READ) for image-based robot motion planning. Given an image of the scene READ retrieves an initial motion from a database of image-motion pairs and uses a diffusion model to refine the motion for the given scene. Unlike prior retrieval-based diffusion models that require long forward-reverse diffusion paths READ directly diffuses between the source (retrieved) and target motions resulting in an efficient diffusion path. A second contribution of READ is its use of asymmetric diffusion whereby it preserves the kinematic feasibility of the generated motion by forward diffusion in a low-dimensional latent space while achieving high-resolution motion by reverse diffusion in the original task space using cold diffusion. Experimental results on various manipulation tasks demonstrate that READ outperforms state-of-the-art planning methods while ablation studies elucidate the contributions of asymmetric diffusion. + + + + OVMR: Open-Vocabulary Recognition with Multi-Modal References + http://openaccess.thecvf.com//content/CVPR2024/papers/Ma_OVMR_Open-Vocabulary_Recognition_with_Multi-Modal_References_CVPR_2024_paper.pdf + The challenge of open-vocabulary recognition lies in the model has no clue of new categories it is applied to. Existing works have proposed different methods to embed category cues into the model e.g. through few-shot fine-tuning providing category names or textual descriptions to Vision-Language Models. Fine-tuning is time-consuming and degrades the generalization capability. Textual descriptions could be ambiguous and fail to depict visual details. This paper tackles open-vocabulary recognition from a different perspective by referring to multi-modal clues composed of textual descriptions and exemplar images. Our method named OVMR adopts two innovative components to pursue a more robust category cues embedding. A multi-modal classifier is first generated by dynamically complementing textual descriptions with image exemplars. A preference-based refinement module is hence applied to fuse uni-modal and multi-modal classifiers with the aim to alleviate issues of low-quality exemplar images or textual descriptions. The proposed OVMR is a plug-and-play module and works well with exemplar images randomly crawled from the Internet. Extensive experiments have demonstrated the promising performance of OVMR e.g. it outperforms existing methods across various scenarios and setups. Codes are publicly available at \href https://github.com/Zehong-Ma/OVMR https://github.com/Zehong-Ma/OVMR . + + + + Global and Local Prompts Cooperation via Optimal Transport for Federated Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Global_and_Local_Prompts_Cooperation_via_Optimal_Transport_for_Federated_CVPR_2024_paper.pdf + Prompt learning in pretrained visual-language models has shown remarkable flexibility across various downstream tasks. Leveraging its inherent lightweight nature recent research attempted to integrate the powerful pretrained models into federated learning frameworks to simultaneously reduce communication costs and promote local training on insufficient data. Despite these efforts current federated prompt learning methods lack specialized designs to systematically address severe data heterogeneities e.g. data distribution with both label and feature shifts involved. To address this challenge we present Federated Prompts Cooperation via Optimal Transport (FedOTP) which introduces efficient collaborative prompt learning strategies to capture diverse category traits on a per-client basis. Specifically for each client we learn a global prompt to extract consensus knowledge among clients and a local prompt to capture client-specific category characteristics. Unbalanced Optimal Transport is then employed to align local visual features with these prompts striking a balance between global consensus and local personalization. By relaxing one of the equality constraints FedOTP enables prompts to focus solely on core image patch regions. Extensive experiments on datasets with various types of heterogeneities have demonstrated that our FedOTP outperforms the state-of-the-art methods. + + + + Retrieval-Augmented Open-Vocabulary Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_Retrieval-Augmented_Open-Vocabulary_Object_Detection_CVPR_2024_paper.pdf + Open-vocabulary object detection (OVD) has been studied with Vision-Language Models (VLMs) to detect novel objects beyond the pre-trained categories. Previous approaches improve the generalization ability to expand the knowledge of the detector using 'positive' pseudo-labels with additional 'class' names e.g. sock iPod and alligator. To extend the previous methods in two aspects we propose Retrieval-Augmented Losses and visual Features (RALF). Our method retrieves related 'negative' classes and augments loss functions. Also visual features are augmented with 'verbalized concepts' of classes e.g. worn on the feet handheld music player and sharp teeth. Specifically RALF consists of two modules: Retrieval Augmented Losses (RAL) and Retrieval-Augmented visual Features (RAF). RAL constitutes two losses reflecting the semantic similarity with negative vocabularies. In addition RAF augments visual features with the verbalized concepts from a large language model (LLM). Our experiments demonstrate the effectiveness of RALF on COCO and LVIS benchmark datasets. We achieve improvement up to 3.4 box AP_ 50 ^ \text N on novel categories of the COCO dataset and 3.6 mask AP_ \text r gains on the LVIS dataset. Code is available at https://github.com/mlvlab/RALF. + + + + MULTIFLOW: Shifting Towards Task-Agnostic Vision-Language Pruning + http://openaccess.thecvf.com//content/CVPR2024/papers/Farina_MULTIFLOW_Shifting_Towards_Task-Agnostic_Vision-Language_Pruning_CVPR_2024_paper.pdf + While excellent in transfer learning Vision-Language models (VLMs) come with high computational costs due to their large number of parameters. To address this issue removing parameters via model pruning is a viable solution. However existing techniques for VLMs are task-specific and thus require pruning the network from scratch for each new task of interest. In this work we explore a new direction: Task-Agnostic Vision-Language Pruning (TA-VLP). Given a pretrained VLM the goal is to find a unique pruned counterpart transferable to multiple unknown downstream tasks. In this challenging setting the transferable representations already encoded in the pretrained model are a key aspect to preserve. Thus we propose Multimodal Flow Pruning (MULTIFLOW) a first gradient-free pruning framework for TA-VLP where: (i) the importance of a parameter is expressed in terms of its magnitude and its information flow by incorporating the saliency of the neurons it connects; and (ii) pruning is driven by the emergent (multimodal) distribution of the VLM parameters after pretraining. We benchmark eight state-of-the-art pruning algorithms in the context of TA-VLP experimenting with two VLMs three vision-language tasks and three pruning ratios. Our experimental results show that MULTIFLOW outperforms recent sophisticated combinatorial competitors in the vast majority of the cases paving the way towards addressing TA-VLP. The code is publicly available at https://github.com/FarinaMatteo/multiflow. + + + + Spin-UP: Spin Light for Natural Light Uncalibrated Photometric Stereo + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Spin-UP_Spin_Light_for_Natural_Light_Uncalibrated_Photometric_Stereo_CVPR_2024_paper.pdf + Natural Light Uncalibrated Photometric Stereo (NaUPS) relieves the strict environment and light assumptions in classical Uncalibrated Photometric Stereo (UPS) methods. However due to the intrinsic ill-posedness and high-dimensional ambiguities addressing NaUPS is still an open question. Existing works impose strong assumptions on the environment lights and objects' material restricting the effectiveness in more general scenarios. Alternatively some methods leverage supervised learning with intricate models while lacking interpretability resulting in a biased estimation. In this work we proposed Spin Light Uncalibrated Photometric Stereo (Spin-UP) an unsupervised method to tackle NaUPS in various environment lights and objects. The proposed method uses a novel setup that captures the object's images on a rotatable platform which mitigates NaUPS's ill-posedness by reducing unknowns and provides reliable priors to alleviate NaUPS's ambiguities. Leveraging neural inverse rendering and the proposed training strategies Spin-UP recovers surface normals environment light and isotropic reflectance under complex natural light with low computational cost. Experiments have shown that Spin-UP outperforms other supervised / unsupervised NaUPS methods and achieves state-of-the-art performance on synthetic and real-world datasets. Codes and data are available at https://github.com/LMozart/CVPR2024-SpinUP. + + + + MemoNav: Working Memory Model for Visual Navigation + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_MemoNav_Working_Memory_Model_for_Visual_Navigation_CVPR_2024_paper.pdf + Image-goal navigation is a challenging task that requires an agent to navigate to a goal indicated by an image in unfamiliar environments. Existing methods utilizing diverse scene memories suffer from inefficient exploration since they use all historical observations for decision-making without considering the goal-relevant fraction. To address this limitation we present MemoNav a novel memory model for image-goal navigation which utilizes a working memory-inspired pipeline to improve navigation performance. Specifically we employ three types of navigation memory. The node features on a map are stored in the short-term memory (STM) as these features are dynamically updated. A forgetting module then retains the informative STM fraction to increase efficiency. We also introduce long-term memory (LTM) to learn global scene representations by progressively aggregating STM features. Subsequently a graph attention module encodes the retained STM and the LTM to generate working memory (WM) which contains the scene features essential for efficient navigation. The synergy among these three memory types boosts navigation performance by enabling the agent to learn and leverage goal-relevant scene features within a topological map. Our evaluation on multi-goal tasks demonstrates that MemoNav significantly outperforms previous methods across all difficulty levels in both Gibson and Matterport3D scenes. Qualitative results further illustrate that MemoNav plans more efficient routes. + + + + AssistGUI: Task-Oriented PC Graphical User Interface Automation + http://openaccess.thecvf.com//content/CVPR2024/papers/Gao_AssistGUI_Task-Oriented_PC_Graphical_User_Interface_Automation_CVPR_2024_paper.pdf + Graphical User Interface (GUI) automation holds significant promise for assisting users with complex tasks thereby boosting human productivity. Existing works leveraging Large Language Model (LLM) or LLM-based AI agents have shown capabilities in automating tasks on Android and Web platforms. However these tasks are primarily aimed at simple device usage and entertainment operations. This paper presents a novel benchmark AssistGUI to evaluate whether models are capable of manipulating the mouse and keyboard on the Windows platform in response to user-requested tasks. We carefully collected a set of 100 tasks from nine widely-used software applications such as After Effects and MS Word each accompanied by the necessary project files for better evaluation. Moreover we propose a multi-agent collaboration framework which incorporates four agents to perform task decomposition GUI parsing action generation and reflection. Our experimental results reveal that our multi-agent collaboration mechanism outshines existing methods in performance. Nevertheless the potential remains substantial with the best model attaining only a 46% success rate on our benchmark. We conclude with a thorough analysis of the current methods' limitations setting the stage for future breakthroughs in this domain. + + + + PaSCo: Urban 3D Panoptic Scene Completion with Uncertainty Awareness + http://openaccess.thecvf.com//content/CVPR2024/papers/Cao_PaSCo_Urban_3D_Panoptic_Scene_Completion_with_Uncertainty_Awareness_CVPR_2024_paper.pdf + We propose the task of Panoptic Scene Completion (PSC) which extends the recently popular Semantic Scene Completion (SSC) task with instance-level information to produce a richer understanding of the 3D scene. Our PSC proposal utilizes a hybrid mask-based technique on the nonempty voxels from sparse multi-scale completions. Whereas the SSC literature overlooks uncertainty which is critical for robotics applications we instead propose an efficient ensembling to estimate both voxel-wise and instance-wise uncertainties along PSC. This is achieved by building on a multi-input multi-output (MIMO) strategy while improving performance and yielding better uncertainty for little additional compute. Additionally we introduce a technique to aggregate permutation-invariant mask predictions. Our experiments demonstrate that our method surpasses all baselines in both Panoptic Scene Completion and uncertainty estimation on three large-scale autonomous driving datasets. Our code and data are available at https://astra-vision.github.io/PaSCo . + + + + PhyScene: Physically Interactable 3D Scene Synthesis for Embodied AI + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_PhyScene_Physically_Interactable_3D_Scene_Synthesis_for_Embodied_AI_CVPR_2024_paper.pdf + With recent developments in Embodied Artificial Intelligence (EAI) research there has been a growing demand for high-quality large-scale interactive scene generation. While prior methods in scene synthesis have prioritized the naturalness and realism of the generated scenes the physical plausibility and interactivity of scenes have been largely left unexplored. To address this disparity we introduce PhyScene a novel method dedicated to generating interactive 3D scenes characterized by realistic layouts articulated objects and rich physical interactivity tailored for embodied agents. Based on a conditional diffusion model for capturing scene layouts we devise novel physics- and interactivity-based guidance mechanisms that integrate constraints from object collision room layout and object reachability. Through extensive experiments we demonstrate that PhyScene effectively leverages these guidance functions for physically interactable scene synthesis outperforming existing state-of-the-art scene synthesis methods by a large margin. Our findings suggest that the scenes generated by PhyScene hold considerable potential for facilitating diverse skill acquisition among agents within interactive environments thereby catalyzing further advancements in embodied AI research. + + + + Harnessing Meta-Learning for Improving Full-Frame Video Stabilization + http://openaccess.thecvf.com//content/CVPR2024/papers/Ali_Harnessing_Meta-Learning_for_Improving_Full-Frame_Video_Stabilization_CVPR_2024_paper.pdf + Video stabilization is a longstanding computer vision problem particularly pixel-level synthesis solutions for video stabilization which synthesize full frames add to the complexity of this task. These techniques aim to stabilize videos by synthesizing full frames while enhancing the stability of the considered video. This intensifies the complexity of the task due to the distinct mix of unique motion profiles and visual content present in each video sequence making robust generalization with fixed parameters difficult. In our study we introduce a novel approach to enhance the performance of pixel-level synthesis solutions for video stabilization by adapting these models to individual input video sequences. The proposed adaptation exploits low-level visual cues accessible during test-time to improve both the stability and quality of resulting videos. We highlight the efficacy of our methodology of "test-time adaptation" through simple fine-tuning of one of these models followed by significant stability gain via the integration of meta-learning techniques. Notably significant improvement is achieved with only a single adaptation step. The versatility of the proposed algorithm is demonstrated by consistently improving the performance of various pixel-level synthesis models for video stabilization in real-world scenarios. + + + + How to Handle Sketch-Abstraction in Sketch-Based Image Retrieval? + http://openaccess.thecvf.com//content/CVPR2024/papers/Koley_How_to_Handle_Sketch-Abstraction_in_Sketch-Based_Image_Retrieval_CVPR_2024_paper.pdf + In this paper we propose a novel abstraction-aware sketch-based image retrieval framework capable of handling sketch abstraction at varied levels. Prior works had mainly focused on tackling sub-factors such as drawing style and order we instead attempt to model abstraction as a whole and propose feature-level and retrieval granularity-level designs so that the system builds into its DNA the necessary means to interpret abstraction. On learning abstraction-aware features we for the first-time harness the rich semantic embedding of pre-trained StyleGAN model together with a novel abstraction-level mapper that deciphers the level of abstraction and dynamically selects appropriate dimensions in the feature matrix correspondingly to construct a feature matrix embedding that can be freely traversed to accommodate different levels of abstraction. For granularity-level abstraction understanding we dictate that the retrieval model should not treat all abstraction-levels equally and introduce a differentiable surrogate Acc.@q loss to inject that understanding into the system. Different to the gold-standard triplet loss our Acc.@q loss uniquely allows a sketch to narrow/broaden its focus in terms of how stringent the evaluation should be - the more abstract a sketch the less stringent (higher q). Extensive experiments depict our method to outperform existing state-of-the-arts in standard SBIR tasks along with challenging scenarios like early retrieval forensic sketch-photo matching and style-invariant retrieval. + + + + ProS: Prompting-to-simulate Generalized knowledge for Universal Cross-Domain Retrieval + http://openaccess.thecvf.com//content/CVPR2024/papers/Fang_ProS_Prompting-to-simulate_Generalized_knowledge_for_Universal_Cross-Domain_Retrieval_CVPR_2024_paper.pdf + The goal of Universal Cross-Domain Retrieval (UCDR) is to achieve robust performance in generalized test scenarios wherein data may belong to strictly unknown domains and categories during training. Recently pre-trained models with prompt tuning have shown strong generalization capabilities and attained noteworthy achievements in various downstream tasks such as few-shot learning and video-text retrieval. However applying them directly to UCDR may not be sufficient to handle both domain shift (i.e. adapting to unfamiliar domains) and semantic shift (i.e. transferring to unknown categories). To this end we propose Prompting-to-Simulate (ProS) the first method to apply prompt tuning for UCDR. ProS employs a two-step process to simulate Content-aware Dynamic Prompts (CaDP) which can impact models to produce generalized features for UCDR. Concretely in Prompt Units Learning stage we introduce two Prompt Units to individually capture domain and semantic knowledge in a mask-and-align way. Then in Context-aware Simulator Learning stage we train a Content-aware Prompt Simulator under a simulated test scenario to produce the corresponding CaDP. Extensive experiments conducted on three benchmark datasets show that our method achieves new state-of-the-art performance without bringing excessive parameters. Code is available at https://github.com/fangkaipeng/ProS + + + + Boosting Object Detection with Zero-Shot Day-Night Domain Adaptation + http://openaccess.thecvf.com//content/CVPR2024/papers/Du_Boosting_Object_Detection_with_Zero-Shot_Day-Night_Domain_Adaptation_CVPR_2024_paper.pdf + Detecting objects in low-light scenarios presents a persistent challenge as detectors trained on well-lit data exhibit significant performance degradation on low-light data due to low visibility. Previous methods mitigate this issue by exploring image enhancement or object detection techniques with real low-light image datasets. However the progress is impeded by the inherent difficulties about collecting and annotating low-light images. To address this challenge we propose to boost low-light object detection with zero-shot day-night domain adaptation which aims to generalize a detector from well-lit scenarios to low-light ones without requiring real low-light data. Revisiting Retinex theory in the low-level vision we first design a reflectance representation learning module to learn Retinex-based illumination invariance in images with a carefully designed illumination invariance reinforcement strategy. Next an interchange-redecomposition-coherence procedure is introduced to improve over the vanilla Retinex image decomposition process by performing two sequential image decompositions and introducing a redecomposition cohering loss. Extensive experiments on ExDark DARK FACE and CODaN datasets show strong low-light generalizability of our method. Our code is available at https://github.com/ZPDu/DAI-Net. + + + + Versatile Medical Image Segmentation Learned from Multi-Source Datasets via Model Self-Disambiguation + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Versatile_Medical_Image_Segmentation_Learned_from_Multi-Source_Datasets_via_Model_CVPR_2024_paper.pdf + A versatile medical image segmentation model applicable to images acquired with diverse equipment and protocols can facilitate model deployment and maintenance. However building such a model typically demands a large diverse and fully annotated dataset which is challenging to obtain due to the labor-intensive nature of data curation. To address this challenge we propose a cost-effective alternative that harnesses multi-source data with only partial or sparse segmentation labels for training substantially reducing the cost of developing a versatile model. We devise strategies for model self-disambiguation prior knowledge incorporation and imbalance mitigation to tackle challenges associated with inconsistently labeled multi-source data including label ambiguity and modality dataset and class imbalances. Experimental results on a multi-modal dataset compiled from eight different sources for abdominal structure segmentation have demonstrated the effectiveness and superior performance of our method compared to state-of-the-art alternative approaches. We anticipate that its cost-saving features which optimize the utilization of existing annotated data and reduce annotation efforts for new data will have a significant impact in the field. + + + + Align and Aggregate: Compositional Reasoning with Video Alignment and Answer Aggregation for Video Question-Answering + http://openaccess.thecvf.com//content/CVPR2024/papers/Liao_Align_and_Aggregate_Compositional_Reasoning_with_Video_Alignment_and_Answer_CVPR_2024_paper.pdf + Despite the recent progress made in Video Question-Answering (VideoQA) these methods typically function as black-boxes making it difficult to understand their reasoning processes and perform consistent compositional reasoning. To address these challenges we propose a model-agnostic Video Alignment and Answer Aggregation (VA3) framework which is capable of enhancing both compositional consistency and accuracy of existing VidQA methods by integrating video aligner and answer aggregator modules. The video aligner hierarchically selects the relevant video clips based on the question while the answer aggregator deduces the answer to the question based on its sub-questions with compositional consistency ensured by the information flow along the question decompose graph and the contrastive learning strategy. We evaluate our framework on three settings of the AGQA-Decomp dataset with three baseline methods and propose new metrics to measure the compositional consistency of VidQA methods more comprehensively. Moreover we propose a large language model (LLM) based automatic question decompose pipeline to apply our framework on any VidQA data. We extend MSVD and NExT-QA datasets with it to evaluate such scheme and our VA3 framework on broader scenarios. Extensive experiments show that our framework improves both compositional consistency and accuracy of existing methods leading to more interpretable models in real-world applications. + + + + Action-slot: Visual Action-centric Representations for Multi-label Atomic Activity Recognition in Traffic Scenes + http://openaccess.thecvf.com//content/CVPR2024/papers/Kung_Action-slot_Visual_Action-centric_Representations_for_Multi-label_Atomic_Activity_Recognition_in_CVPR_2024_paper.pdf + In this paper we study multi-label atomic activity recognition. Despite the notable progress in action recognition it is still challenging to recognize atomic activities due to a deficiency in holistic understanding of both multiple road users' motions and their contextual information. In this paper we introduce Action-slot a slot attention-based approach that learns visual action-centric representations capturing both motion and contextual information. Our key idea is to design action slots that are capable of paying attention to regions where atomic activities occur without the need for explicit perception guidance. To further enhance slot attention we introduce a background slot that competes with action slots aiding the training process in avoiding unnecessary focus on background regions devoid of activities. Yet the imbalanced class distribution in the existing dataset hampers the assessment of rare activities. To address the limitation we collect a synthetic dataset called TACO which is four times larger than OATS and features a balanced distribution of atomic activities. To validate the effectiveness of our method we conduct comprehensive experiments and ablation studies against various action recognition baselines. We also show that the performance of multi-label atomic activity recognition on real-world datasets can be improved by pretraining representations on TACO. + + + + Retraining-Free Model Quantization via One-Shot Weight-Coupling Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Tang_Retraining-Free_Model_Quantization_via_One-Shot_Weight-Coupling_Learning_CVPR_2024_paper.pdf + Quantization is of significance for compressing the over-parameterized deep neural models and deploying them on resource-limited devices. Fixed-precision quantization suffers from performance drop due to the limited numerical representation ability. Conversely mixed-precision quantization (MPQ) is advocated to compress the model effectively by allocating heterogeneous bit-width for layers. MPQ is typically organized into a searching-retraining two-stage process. Previous works only focus on determining the optimal bit-width configuration in the first stage efficiently while ignoring the considerable time costs in the second stage. However retraining always consumes hundreds of GPU-hours on the cutting-edge GPUs thus hindering deployment efficiency significantly. In this paper we devise a one-shot training-searching paradigm for mixed-precision model compression. Specifically in the first stage all potential bit-width configurations are coupled and thus optimized simultaneously within a set of shared weights. However our observations reveal a previously unseen and severe bit-width interference phenomenon among highly coupled weights during optimization leading to considerable performance degradation under a high compression ratio. To tackle this problem we first design a bit-width scheduler to dynamically freeze the most turbulent bit-width of layers during training to ensure the rest bit-widths converged properly. Then taking inspiration from information theory we present an information distortion mitigation technique to align the behaviour of the bad-performing bit-widths to the well-performing ones. In the second stage an inference-only greedy search scheme is devised to evaluate the goodness of configurations without introducing any additional training costs. Extensive experiments on three representative models and three datasets demonstrate the effectiveness of the proposed method. + + + + EVCap: Retrieval-Augmented Image Captioning with External Visual-Name Memory for Open-World Comprehension + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_EVCap_Retrieval-Augmented_Image_Captioning_with_External_Visual-Name_Memory_for_Open-World_CVPR_2024_paper.pdf + Large language models (LLMs)-based image captioning has the capability of describing objects not explicitly observed in training data; yet novel objects occur frequently necessitating the requirement of sustaining up-to-date object knowledge for open-world comprehension. Instead of relying on large amounts of data and/or scaling up network parameters we introduce a highly effective retrieval-augmented image captioning method that prompts LLMs with object names retrieved from External Visual--name memory (EVCap). We build ever-changing object knowledge memory using objects' visuals and names enabling us to (i) update the memory at a minimal cost and (ii) effortlessly augment LLMs with retrieved object names by utilizing a lightweight and fast-to-train model. Our model which was trained only on the COCO dataset can adapt to out-of-domain without requiring additional fine-tuning or re-training. Our experiments conducted on benchmarks and synthetic commonsense-violating data show that EVCap with only 3.97M trainable parameters exhibits superior performance compared to other methods based on frozen pre-trained LLMs. Its performance is also competitive to specialist SOTAs that require extensive training. + + + + SIFU: Side-view Conditioned Implicit Function for Real-world Usable Clothed Human Reconstruction + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_SIFU_Side-view_Conditioned_Implicit_Function_for_Real-world_Usable_Clothed_Human_CVPR_2024_paper.pdf + Creating high-quality 3D models of clothed humans from single images for real-world applications is crucial. Despite recent advancements accurately reconstructing humans in complex poses or with loose clothing from in-the-wild images along with predicting textures for unseen areas remains a significant challenge. A key limitation of previous methods is their insufficient prior guidance in transitioning from 2D to 3D and in texture prediction. In response we introduce SIFU (Side-view Conditioned Implicit Function for Real-world Usable Clothed Human Reconstruction) a novel approach combining a Side-view Decoupling Transformer with a 3D Consistent Texture Refinement pipeline. SIFU employs a cross-attention mechanism within the transformer using SMPL-X normals as queries to effectively decouple side-view features in the process of mapping 2D features to 3D. This method not only improves the precision of the 3D models but also their robustness especially when SMPL-X estimates are not perfect. Our texture refinement process leverages text-to-image diffusion-based prior to generate realistic and consistent textures for invisible views. Through extensive experiments SIFU surpasses SOTA methods in both geometry and texture reconstruction showcasing enhanced robustness in complex scenarios and achieving an unprecedented Chamfer and P2S measurement. Our approach extends to practical applications such as 3D printing and scene building demonstrating its broad utility in real-world scenarios. + + + + Autoregressive Queries for Adaptive Tracking with Spatio-Temporal Transformers + http://openaccess.thecvf.com//content/CVPR2024/papers/Xie_Autoregressive_Queries_for_Adaptive_Tracking_with_Spatio-Temporal_Transformers_CVPR_2024_paper.pdf + The rich spatio-temporal information is crucial to capture the complicated target appearance variations in visual tracking. However most top-performing tracking algorithms rely on many hand-crafted components for spatio-temporal information aggregation. Consequently the spatio-temporal information is far away from being fully explored. To alleviate this issue we propose an adaptive tracker with spatio-temporal transformers (named AQATrack) which adopts simple autoregressive queries to effectively learn spatio-temporal information without many hand-designed components. Firstly we introduce a set of learnable and autoregressive queries to capture the instantaneous target appearance changes in a sliding window fashion. Then we design a novel attention mechanism for the interaction of existing queries to generate a new query in current frame. Finally based on the initial target template and learnt autoregressive queries a spatio-temporal information fusion module (STM) is designed for spatiotemporal formation aggregation to locate a target object. Benefiting from the STM we can effectively combine the static appearance and instantaneous changes to guide robust tracking. Extensive experiments show that our method significantly improves the tracker's performance on six popular tracking benchmarks: LaSOT LaSOText TrackingNet GOT-10k TNL2K and UAV123. Code and models will be https://github.com/orgs/GXNU-ZhongLab. + + + + Lane2Seq: Towards Unified Lane Detection via Sequence Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_Lane2Seq_Towards_Unified_Lane_Detection_via_Sequence_Generation_CVPR_2024_paper.pdf + In this paper we present a novel sequence generation-based framework for lane detection called Lane2Seq. It unifies various lane detection formats by casting lane detection as a sequence generation task. This is different from previous lane detection methods which depend on well-designed task-specific head networks and corresponding loss functions. Lane2Seq only adopts a plain transformer-based encoder-decoder architecture with a simple cross-entropy loss. Additionally we propose a new multi-format model tuning based on reinforcement learning to incorporate the task-specific knowledge into Lane2Seq. Experimental results demonstrate that such a simple sequence generation paradigm not only unifies lane detection but also achieves competitive performance on benchmarks. For example Lane2Seq gets 97.95% and 97.42% F1 score on Tusimple and LLAMAS datasets establishing a new state-of-the-art result for two benchmarks. + + + + LEMON: Learning 3D Human-Object Interaction Relation from 2D Images + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_LEMON_Learning_3D_Human-Object_Interaction_Relation_from_2D_Images_CVPR_2024_paper.pdf + Learning 3D human-object interaction relation is pivotal to embodied AI and interaction modeling. Most existing methods approach the goal by learning to predict isolated interaction elements e.g. human contact object affordance and human-object spatial relation primarily from the perspective of either the human or the object. Which underexploit certain correlations between the interaction counterparts (human and object) and struggle to address the uncertainty in interactions. Actually objects' functionalities potentially affect humans' interaction intentions which reveals what the interaction is. Meanwhile the interacting humans and objects exhibit matching geometric structures which presents how to interact. In light of this we propose harnessing these inherent correlations between interaction counterparts to mitigate the uncertainty and jointly anticipate the above interaction elements in 3D space. To achieve this we present LEMON (LEarning 3D huMan-Object iNteraction relation) a unified model that mines interaction intentions of the counterparts and employs curvatures to guide the extraction of geometric correlations combining them to anticipate the interaction elements. Besides the 3D Interaction Relation dataset (3DIR) is collected to serve as the test bed for training and evaluation. Extensive experiments demonstrate the superiority of LEMON over methods estimating each element in isolation. The code and dataset are available at https://yyvhang.github.io/LEMON/ + + + + Understanding Video Transformers via Universal Concept Discovery + http://openaccess.thecvf.com//content/CVPR2024/papers/Kowal_Understanding_Video_Transformers_via_Universal_Concept_Discovery_CVPR_2024_paper.pdf + This paper studies the problem of concept-based interpretability of transformer representations for videos. Concretely we seek to explain the decision-making process of video transformers based on high-level spatiotemporal concepts that are automatically discovered. Prior research on concept-based interpretability has concentrated solely on image-level tasks. Comparatively video models deal with the added temporal dimension increasing complexity and posing challenges in identifying dynamic concepts over time. In this work we systematically address these challenges by introducing the first Video Transformer Concept Discovery (VTCD) algorithm. To this end we propose an efficient approach for unsupervised identification of units of video transformer representations - concepts and ranking their importance to the output of a model. The resulting concepts are highly interpretable revealing spatio-temporal reasoning mechanisms and object-centric representations in unstructured video models. Performing this analysis jointly over a diverse set of supervised and self-supervised representations we discover that some of these mechanism are universal in video transformers. Finally we show that VTCD can be used for fine-grained action recognition and video object segmentation. + + + + PointOBB: Learning Oriented Object Detection via Single Point Supervision + http://openaccess.thecvf.com//content/CVPR2024/papers/Luo_PointOBB_Learning_Oriented_Object_Detection_via_Single_Point_Supervision_CVPR_2024_paper.pdf + Single point-supervised object detection is gaining attention due to its cost-effectiveness. However existing approaches focus on generating horizontal bounding boxes (HBBs) while ignoring oriented bounding boxes (OBBs) commonly used for objects in aerial images. This paper proposes PointOBB the first single Point-based OBB generation method for oriented object detection. PointOBB operates through the collaborative utilization of three distinctive views: an original view a resized view and a rotated/flipped (rot/flp) view. Upon the original view we leverage the resized and rot/flp views to build a scale augmentation module and an angle acquisition module respectively. In the former module a Scale-Sensitive Consistency (SSC) loss is designed to enhance the deep network's ability to perceive the object scale. For accurate object angle predictions the latter module incorporates self-supervised learning to predict angles which is associated with a scale-guided Dense-to-Sparse (DS) matching strategy for aggregating dense angles corresponding to sparse objects. The resized and rot/flp views are switched using a progressive multi-view switching strategy during training to achieve coupled optimization of scale and angle. Experimental results on the DIOR-R and DOTA-v1.0 datasets demonstrate that PointOBB achieves promising performance and significantly outperforms potential point-supervised baselines. + + + + OmniParser: A Unified Framework for Text Spotting Key Information Extraction and Table Recognition + http://openaccess.thecvf.com//content/CVPR2024/papers/Wan_OmniParser_A_Unified_Framework_for_Text_Spotting_Key_Information_Extraction_CVPR_2024_paper.pdf + Recently visually-situated text parsing (VsTP) has experienced notable advancements driven by the increasing demand for automated document understanding and the emergence of Generative Large Language Models (LLMs) capable of processing document-based questions. Various methods have been proposed to address the challenging problem of VsTP. However due to the diversified targets and heterogeneous schemas previous works usually design task-specific architectures and objectives for individual tasks which inadvertently leads to modal isolation and complex workflow. In this paper we propose a unified paradigm for parsing visually-situated text across diverse scenarios. Specifically we devise a universal model called OmniParser which can simultaneously handle three typical visually-situated text parsing tasks: text spotting key information extraction and table recognition. In OmniParser all tasks share the unified encoder-decoder architecture the unified objective: point-conditioned text generation and the unified input & output representation: prompt & structured sequences. Extensive experiments demonstrate that the proposed OmniParser achieves state-of-the-art (SOTA) or highly competitive performances on 7 datasets for the three visually-situated text parsing tasks despite its unified concise design. The code is available at https://github.com/AlibabaResearch/AdvancedLiterateMachinery. + + + + Training Like a Medical Resident: Context-Prior Learning Toward Universal Medical Image Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Gao_Training_Like_a_Medical_Resident_Context-Prior_Learning_Toward_Universal_Medical_CVPR_2024_paper.pdf + A major focus of clinical imaging workflow is disease diagnosis and management leading to medical imaging datasets strongly tied to specific clinical objectives. This scenario has led to the prevailing practice of developing task-specific segmentation models without gaining insights from widespread imaging cohorts. Inspired by the training program of medical radiology residents we propose a shift towards universal medical image segmentation a paradigm aiming to build medical image understanding foundation models by leveraging the diversity and commonality across clinical targets body regions and imaging modalities. Towards this goal we develop Hermes a novel context-prior learning approach to address the challenges of data heterogeneity and annotation differences in medical image segmentation. In a large collection of eleven diverse datasets (2438 3D images) across five modalities (CT PET T1 T2 and cine MRI) and multiple body regions we demonstrate the merit of the universal paradigm over the traditional paradigm on addressing multiple tasks within a single model. By exploiting the synergy across tasks Hermes achieves state-of-the-art performance on all testing datasets and shows superior model scalability. Results on two additional datasets reveals Hermes' strong performance for transfer learning incremental learning and generalization to downstream tasks. Hermes's learned priors demonstrate an appealing trait to reflect the intricate relations among tasks and modalities which aligns with the established anatomical and imaging principles in radiology. The code is available. + + + + MicroDiffusion: Implicit Representation-Guided Diffusion for 3D Reconstruction from Limited 2D Microscopy Projections + http://openaccess.thecvf.com//content/CVPR2024/papers/Hui_MicroDiffusion_Implicit_Representation-Guided_Diffusion_for_3D_Reconstruction_from_Limited_2D_CVPR_2024_paper.pdf + Volumetric optical microscopy using non-diffracting beams enables rapid imaging of 3D volumes by projecting them axially to 2D images but lacks crucial depth information. Addressing this we introduce MicroDiffusion a pioneering tool facilitating high-quality depth-resolved 3D volume reconstruction from limited 2D projections. While existing Implicit Neural Representation (INR) models often yield incomplete outputs and Denoising Diffusion Probabilistic Models (DDPM) excel at capturing details our method integrates INR's structural coherence with DDPM's fine-detail enhancement capabilities. We pretrain an INR model to transform 2D axially-projected images into a preliminary 3D volume. This pretrained INR acts as a global prior guiding DDPM's generative process through a linear interpolation between INR outputs and noise inputs. This strategy enriches the diffusion process with structured 3D information enhancing detail and reducing noise in localized 2D images.By conditioning the diffusion model on the closest 2D projection MicroDiffusion substantially enhances fidelity in resulting 3D reconstructions surpassing INR and standard DDPM outputs with unparalleled image quality and structural fidelity. Our code and dataset are available athttps://github.com/UCSC-VLAA/MicroDiffusion. + + + + Task-Conditioned Adaptation of Visual Features in Multi-Task Policy Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Marza_Task-Conditioned_Adaptation_of_Visual_Features_in_Multi-Task_Policy_Learning_CVPR_2024_paper.pdf + Successfully addressing a wide variety of tasks is a core ability of autonomous agents requiring flexibly adapting the underlying decision-making strategies and as we argue in this work also adapting the perception modules. An analogical argument would be the human visual system which uses top-down signals to focus attention determined by the current task. Similarly we adapt pre-trained large vision models conditioned on specific downstream tasks in the context of multi-task policy learning. We introduce task-conditioned adapters that do not require finetuning any pre-trained weights combined with a single policy trained with behavior cloning and capable of addressing multiple tasks. We condition the visual adapters on task embeddings which can be selected at inference if the task is known or alternatively inferred from a set of example demonstrations. To this end we propose a new optimization-based estimator. We evaluate the method on a wide variety of tasks from the CortexBench benchmark and show that compared to existing work it can be addressed with a single policy. In particular we demonstrate that adapting visual features is a key design choice and that the method generalizes to unseen tasks given a few demonstrations. + + + + Hybrid Proposal Refiner: Revisiting DETR Series from the Faster R-CNN Perspective + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_Hybrid_Proposal_Refiner_Revisiting_DETR_Series_from_the_Faster_R-CNN_CVPR_2024_paper.pdf + With the transformative impact of the Transformer DETR pioneered the application of the encoder-decoder architecture to object detection. A collection of follow-up research e.g. Deformable DETR aims to enhance DETR while adhering to the encoder-decoder design. In this work we revisit the DETR series through the lens of Faster R-CNN. We find that the DETR resonates with the underlying principles of Faster R-CNN's RPN-refiner design but benefits from end-to-end detection owing to the incorporation of Hungarian matching. We systematically adapt the Faster R-CNN towards the Deformable DETR by integrating or repurposing each component of Deformable DETR and note that Deformable DETR's improved performance over Faster R-CNN is attributed to the adoption of advanced modules such as a superior proposal refiner (e.g. deformable attention rather than RoI Align). When viewing the DETR through the RPN-refiner paradigm we delve into various proposal refinement techniques such as deformable attention cross attention and dynamic convolution. These proposal refiners cooperate well with each other; thus we synergistically combine them to establish a Hybrid Proposal Refiner (HPR). Our HPR is versatile and can be incorporated into various DETR detectors. For instance by integrating HPR to a strong DETR detector we achieve an AP of 54.9 on the COCO benchmark utilizing a ResNet-50 backbone and a 36-epoch training schedule. Code and models are available at https://github.com/ZhaoJingjing713/HPR. + + + + Video Harmonization with Triplet Spatio-Temporal Variation Patterns + http://openaccess.thecvf.com//content/CVPR2024/papers/Guo_Video_Harmonization_with_Triplet_Spatio-Temporal_Variation_Patterns_CVPR_2024_paper.pdf + Video harmonization is an important and challenging task that aims to obtain visually realistic composite videos by automatically adjusting the foreground's appearance to harmonize with the background. Inspired by the short-term and long-term gradual adjustment process of manual harmonization we present a Video Triplet Transformer framework to model three spatio-temporal variation patterns within videos i.e. short-term spatial as well as long-term global and dynamic for video-to-video tasks like video harmonization. Specifically for short-term harmonization we adjust foreground appearance to consist with background in spatial dimension based on the neighbor frames; for long-term harmonization we not only explore global appearance variations to enhance temporal consistency but also alleviate motion offset constraints to align similar contextual appearances dynamically. Extensive experiments and ablation studies demonstrate the effectiveness of our method achieving state-of-the-art performance in video harmonization video enhancement and video demoireing tasks. We also propose a temporal consistency metric to better evaluate the harmonized videos. Code is available at https://github.com/zhenglab/VideoTripletTransformer. + + + + Improved Zero-Shot Classification by Adapting VLMs with Text Descriptions + http://openaccess.thecvf.com//content/CVPR2024/papers/Saha_Improved_Zero-Shot_Classification_by_Adapting_VLMs_with_Text_Descriptions_CVPR_2024_paper.pdf + The zero-shot performance of existing vision-language models (VLMs) such as CLIP is limited by the availability of large-scale aligned image and text datasets in specific domains. In this work we leverage two complementary sources of information -- descriptions of categories generated by large language models (LLMs) and abundant fine-grained image classification datasets -- to improve the zero-shot classification performance of VLMs across fine-grained domains. On the technical side we develop methods to train VLMs with this "bag-level" image-text supervision. We find that simply using these attributes at test-time does not improve performance but our training strategy for example on the iNaturalist dataset leads to an average improvement of 4-5% in zero-shot classification accuracy for novel categories of birds and flowers. Similar improvements are observed in domains where a subset of the categories was used to fine-tune the model. By prompting LLMs in various ways we generate descriptions that capture visual appearance habitat and geographic regions and pair them with existing attributes such as the taxonomic structure of the categories. We systematically evaluate their ability to improve zero-shot categorization in natural domains. Our findings suggest that geographic priors can be just as effective and are complementary to visual appearance. Our method also outperforms prior work on prompt-based tuning of VLMs. We release the benchmark consisting of 14 datasets at https://github.com/cvl-umass/AdaptCLIPZS which will contribute to future research in zero-shot recognition. + + + + CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition + http://openaccess.thecvf.com//content/CVPR2024/papers/Lu_CricaVPR_Cross-image_Correlation-aware_Representation_Learning_for_Visual_Place_Recognition_CVPR_2024_paper.pdf + Over the past decade most methods in visual place recognition (VPR) have used neural networks to produce feature representations. These networks typically produce a global representation of a place image using only this image itself and neglect the cross-image variations (e.g. viewpoint and illumination) which limits their robustness in challenging scenes. In this paper we propose a robust global representation method with cross-image correlation awareness for VPR named CricaVPR. Our method uses the attention mechanism to correlate multiple images within a batch. These images can be taken in the same place with different conditions or viewpoints or even captured from different places. Therefore our method can utilize the cross-image variations as a cue to guide the representation learning which ensures more robust features are produced. To further facilitate the robustness we propose a multi-scale convolution-enhanced adaptation method to adapt pre-trained visual foundation models to the VPR task which introduces the multi-scale local information to further enhance the cross-image correlation-aware representation. Experimental results show that our method outperforms state-of-the-art methods by a large margin with significantly less training time. The code is released at https://github.com/Lu-Feng/CricaVPR. + + + + Instance-level Expert Knowledge and Aggregate Discriminative Attention for Radiology Report Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Bu_Instance-level_Expert_Knowledge_and_Aggregate_Discriminative_Attention_for_Radiology_Report_CVPR_2024_paper.pdf + Automatic radiology report generation can provide substantial advantages to clinical physicians by effectively reducing their workload and improving efficiency. Despite the promising potential of current methods challenges persist in effectively extracting and preventing degradation of prominent features as well as enhancing attention on pivotal regions. In this paper we propose an Instance-level Expert Knowledge and Aggregate Discriminative Attention framework (EKAGen) for radiology report generation. We convert expert reports into an embedding space and generate comprehensive representations for each disease which serve as Preliminary Knowledge Support (PKS). To prevent feature disruption we select the representations in the embedding space with the smallest distances to PKS as Rectified Knowledge Support (RKS). Then EKAGen diagnoses the diseases and retrieves knowledge from RKS creating Instance-level Expert Knowledge (IEK) for each query image boosting generation. Additionally we introduce Aggregate Discriminative Attention Map (ADM) which uses weak supervision to create maps of discriminative regions that highlight pivotal regions. For training we propose a Global Information Self-Distillation (GID) strategy using an iteratively optimized model to distill global knowledge into EKAGen. Extensive experiments and analyses on IU X-Ray and MIMIC-CXR datasets demonstrate that EKAGen outperforms previous state-of-the-art methods. + + + + Each Test Image Deserves A Specific Prompt: Continual Test-Time Adaptation for 2D Medical Image Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Each_Test_Image_Deserves_A_Specific_Prompt_Continual_Test-Time_Adaptation_CVPR_2024_paper.pdf + Distribution shift widely exists in medical images acquired from different medical centres and poses a significant obstacle to deploying the pre-trained semantic segmentation model in real-world applications. Test-time adaptation has proven its effectiveness in tackling the cross-domain distribution shift during inference. However most existing methods achieve adaptation by updating the pre-trained models rendering them susceptible to error accumulation and catastrophic forgetting when encountering a series of distribution shifts (i.e. under the continual test-time adaptation setup). To overcome these challenges caused by updating the models in this paper we freeze the pre-trained model and propose the Visual Prompt-based Test-Time Adaptation (VPTTA) method to train a specific prompt for each test image to align the statistics in the batch normalization layers. Specifically we present the low-frequency prompt which is lightweight with only a few parameters and can be effectively trained in a single iteration. To enhance prompt initialization we equip VPTTA with a memory bank to benefit the current prompt from previous ones. Additionally we design a warm-up mechanism which mixes source and target statistics to construct warm-up statistics thereby facilitating the training process. Extensive experiments demonstrate the superiority of our VPTTA over other state-of-the-art methods on two medical image segmentation benchmark tasks. The code and weights of pre-trained source models are available at https://github.com/Chen-Ziyang/VPTTA. + + + + Versatile Navigation Under Partial Observability via Value-guided Diffusion Policy + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Versatile_Navigation_Under_Partial_Observability_via_Value-guided_Diffusion_Policy_CVPR_2024_paper.pdf + Route planning for navigation under partial observability plays a crucial role in modern robotics and autonomous driving. Existing route planning approaches can be categorized into two main classes: traditional autoregressive and diffusion-based methods. The former often fails due to its myopic nature while the latter either assumes full observability or struggles to adapt to unfamiliar scenarios due to strong couplings with behavior cloning from experts. To address these deficiencies we propose a versatile diffusion-based approach for both 2D and 3D route planning under partial observability. Specifically our value-guided diffusion policy first generates plans to predict actions across various timesteps providing ample foresight to the planning. It then employs a differentiable planner with state estimations to derive a value function directing the agent's exploration and goal-seeking behaviors without seeking experts while explicitly addressing partial observability. During inference our policy is further enhanced by a best-plan-selection strategy substantially boosting the planning success rate. Moreover we propose projecting point clouds derived from RGB-D inputs onto 2D grid-based bird-eye-view maps via semantic segmentation generalizing to 3D environments. This simple yet effective adaption enables zero-shot transfer from 2D-trained policy to 3D cutting across the laborious training for 3D policy and thus certifying our versatility. Experimental results demonstrate our superior performance particularly in navigating situations beyond expert demonstrations surpassing state-of-the-art autoregressive and diffusion-based baselines for both 2D and 3D scenarios. + + + + All in One Framework for Multimodal Re-identification in the Wild + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_All_in_One_Framework_for_Multimodal_Re-identification_in_the_Wild_CVPR_2024_paper.pdf + In Re-identification (ReID) recent advancements yield noteworthy progress in both unimodal and cross-modal retrieval tasks. However the challenge persists in developing a unified framework that could effectively handle varying multimodal data including RGB infrared sketches and textual information. Additionally the emergence of large-scale models shows promising performance in various vision tasks but the foundation model in ReID is still blank. In response to these challenges a novel multimodal learning paradigm for ReID is introduced referred to as All-in-One (AIO) which harnesses a frozen pre-trained big model as an encoder enabling effective multimodal retrieval without additional fine-tuning. The diverse multimodal data in AIO are seamlessly tokenized into a unified space allowing the modality-shared frozen encoder to extract identity-consistent features comprehensively across all modalities. Furthermore a meticulously crafted ensemble of cross-modality heads is designed to guide the learning trajectory. AIO is the first framework to perform all-in-one ReID encompassing four commonly used modalities. Experiments on cross-modal and multimodal ReID reveal that AIO not only adeptly handles various modal data but also excels in challenging contexts showcasing exceptional performance in zero-shot and domain generalization scenarios. Code will be available at: https://github.com/lihe404/AIO. + + + + Looking 3D: Anomaly Detection with 2D-3D Alignment + http://openaccess.thecvf.com//content/CVPR2024/papers/Bhunia_Looking_3D_Anomaly_Detection_with_2D-3D_Alignment_CVPR_2024_paper.pdf + Automatic anomaly detection based on visual cues holds practical significance in various domains such as manufacturing and product quality assessment. This paper introduces a new conditional anomaly detection problem which involves identifying anomalies in a query image by comparing it to a reference shape. To address this challenge we have created a large dataset BrokenChairs-180K consisting of around 180K images with diverse anomalies geometries and textures paired with 8143 reference 3D shapes. To tackle this task we have proposed a novel transformer-based approach that explicitly learns the correspondence between the query image and reference 3D shape via feature alignment and leverages a customized attention mechanism for anomaly detection. Our approach has been rigorously evaluated through comprehensive experiments serving as a benchmark for future research in this domain. + + + + VS: Reconstructing Clothed 3D Human from Single Image via Vertex Shift + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_VS_Reconstructing_Clothed_3D_Human_from_Single_Image_via_Vertex_CVPR_2024_paper.pdf + Various applications require high-fidelity and artifact-free 3D human reconstructions. However current implicit function-based methods inevitably produce artifacts while existing deformation methods are difficult to reconstruct high-fidelity humans wearing loose clothing. In this paper we propose a two-stage deformation method named Vertex Shift (VS) for reconstructing clothed 3D humans from single images. Specifically VS first stretches the estimated SMPL-X mesh into a coarse 3D human model using shift fields inferred from normal maps then refines the coarse 3D human model into a detailed 3D human model via a graph convolutional network embedded with implicit-function-learned features. This "stretch-refine" strategy addresses large deformations required for reconstructing loose clothing and delicate deformations for recovering intricate and detailed surfaces achieving high-fidelity reconstructions that faithfully convey the pose clothing and surface details from the input images. The graph convolutional network's ability to exploit neighborhood vertices coupled with the advantages inherited from the deformation methods ensure VS rarely produces artifacts like distortions and non-human shapes and never produces artifacts like holes broken parts and dismembered limbs. As a result VS can reconstruct high-fidelity and artifact-less clothed 3D humans from single images even under scenarios of challenging poses and loose clothing. Experimental results on three benchmarks and two in-the-wild datasets demonstrate that VS significantly outperforms current state-of-the-art methods. The code and models of VS are available for research purposes at https://github.com/starVisionTeam/VS. + + + + PARA-Drive: Parallelized Architecture for Real-time Autonomous Driving + http://openaccess.thecvf.com//content/CVPR2024/papers/Weng_PARA-Drive_Parallelized_Architecture_for_Real-time_Autonomous_Driving_CVPR_2024_paper.pdf + Recent works have proposed end-to-end autonomous vehicle (AV) architectures comprised of differentiable modules achieving state-of-the-art driving performance. While they provide advantages over the traditional perception-prediction-planning pipeline (e.g. removing information bottlenecks between components and alleviating integration challenges) they do so using a diverse combination of tasks modules and their interconnectivity. As of yet however there has been no systematic analysis of the necessity of these modules or the impact of their connectivity placement and internal representations on overall driving performance. Addressing this gap our work conducts a comprehensive exploration of the design space of end-to-end modular AV stacks. Our findings culminate in the development of PARA-Drive: a fully parallel end-to-end AV architecture. PARA-Drive not only achieves state-of-the-art performance in perception prediction and planning but also significantly enhances runtime speed by nearly 3x without compromising on interpretability or safety. + + + + Pink: Unveiling the Power of Referential Comprehension for Multi-modal LLMs + http://openaccess.thecvf.com//content/CVPR2024/papers/Xuan_Pink_Unveiling_the_Power_of_Referential_Comprehension_for_Multi-modal_LLMs_CVPR_2024_paper.pdf + Multi-modal Large Language Models (MLLMs) have shown remarkable capabilities in various multi-modal tasks. Nevertheless their performance in fine-grained image understanding tasks is still limited. To address this issue this paper proposes a new framework to enhance the fine-grained image understanding abilities of MLLMs. Specifically we present a new method for constructing the instruction tuning dataset at a low cost by leveraging annotations in existing datasets. A self-consistent bootstrapping method is also introduced to extend existing dense object annotations into high-quality referring-expression-bounding-box pairs. These methods enable the generation of high-quality instruction data which includes a wide range of fundamental abilities essential for fine-grained image perception. Moreover we argue that the visual encoder should be tuned during instruction tuning to mitigate the gap between full image perception and fine-grained image perception. Experimental results demonstrate the superior performance of our method. For instance our model exhibits a 5.2% accuracy improvement over Qwen-VL on GQA and surpasses the accuracy of Kosmos-2 by 24.7% on RefCOCO_val. We have also attained the top rank on the leaderboard of MMBench. This promising performance is achieved by training on only publicly available data making it easily reproducible. The models datasets and codes are publicly available at https://github.com/SY-Xuan/Pink. + + + + HalluciDoctor: Mitigating Hallucinatory Toxicity in Visual Instruction Data + http://openaccess.thecvf.com//content/CVPR2024/papers/Yu_HalluciDoctor_Mitigating_Hallucinatory_Toxicity_in_Visual_Instruction_Data_CVPR_2024_paper.pdf + Multi-modal Large Language Models (MLLMs) tuned on machine-generated instruction-following data have demonstrated remarkable performance in various multimodal understanding and generation tasks. However the hallucinations inherent in machine-generated data which could lead to hallucinatory outputs in MLLMs remain under-explored. This work aims to investigate various hallucinations (i.e. object relation attribute hallucinations) and mitigate those hallucinatory toxicities in large-scale machine-generated visual instruction datasets. Drawing on the human ability to identify factual errors we present a novel hallucination detection and elimination framework HalluciDoctor based on the cross-checking paradigm. We use our framework to identify and eliminate hallucinations in the training data automatically. Interestingly HalluciDoctor also indicates that spurious correlations arising from long-tail object co-occurrences contribute to hallucinations. Based on that we execute counterfactual visual instruction expansion to balance data distribution thereby enhancing MLLMs' resistance to hallucinations. Comprehensive experiments on hallucination evaluation benchmarks show that our method successfully mitigates 44.6% hallucinations relatively and maintains competitive performance compared to LLaVA. The data and code for this paper are publicly available. + + + + C^2RV: Cross-Regional and Cross-View Learning for Sparse-View CBCT Reconstruction + http://openaccess.thecvf.com//content/CVPR2024/papers/Lin_C2RV_Cross-Regional_and_Cross-View_Learning_for_Sparse-View_CBCT_Reconstruction_CVPR_2024_paper.pdf + Cone beam computed tomography (CBCT) is an important imaging technology widely used in medical scenarios such as diagnosis and preoperative planning. Using fewer projection views to reconstruct CT also known as sparse-view reconstruction can reduce ionizing radiation and further benefit interventional radiology. Compared with sparse-view reconstruction for traditional parallel/fan-beam CT CBCT reconstruction is more challenging due to the increased dimensionality caused by the measurement process based on cone-shaped X-ray beams. As a 2D-to-3D reconstruction problem although implicit neural representations have been introduced to enable efficient training only local features are considered and different views are processed equally in previous works resulting in spatial inconsistency and poor performance on complicated anatomies. To this end we propose C^2RV by leveraging explicit multi-scale volumetric representations to enable cross-regional learning in the 3D space. Additionally the scale-view cross-attention module is introduced to adaptively aggregate multi-scale and multi-view features. Extensive experiments demonstrate that our C^2RV achieves consistent and significant improvement over previous state-of-the-art methods on datasets with diverse anatomy. Code is available at https://github.com/xmed-lab/C2RV-CBCT. + + + + GLiDR: Topologically Regularized Graph Generative Network for Sparse LiDAR Point Clouds + http://openaccess.thecvf.com//content/CVPR2024/papers/Kumar_GLiDR_Topologically_Regularized_Graph_Generative_Network_for_Sparse_LiDAR_Point_CVPR_2024_paper.pdf + Sparse LiDAR point clouds cause severe loss of detail of static structures and reduce the density of static points available for navigation. Reduced density can be detrimental to navigation under several scenarios. We observe that despite high sparsity in most cases the global topology of LiDAR outlining the static structures can be inferred. We utilize this property to obtain a backbone skeleton of a LiDAR scan in the form of a single connected component that is a proxy to its global topology. We utilize the backbone to augment new points along static structures to overcome sparsity. Newly introduced points could correspond to existing static structures or to static points that were earlier obstructed by dynamic objects. To the best of our knowledge we are the first to use such a strategy for sparse LiDAR point clouds. Existing solutions close to our approach fail to identify and preserve the global static LiDAR topology and generate sub-optimal points. We propose GLiDR a Graph Generative network that is topologically regularized using 0-dimensional Persistent Homology (PH) constraints. This enables GLiDR to introduce newer static points along a topologically consistent global static LiDAR backbone. GLiDR generates precise static points using 32x sparser dynamic scans and performs better than the baselines across three datasets. GLiDR generates a valuable byproduct - an accurate binary segmentation mask of static and dynamic objects that are helpful for navigation planning and safety in constrained environments. The newly introduced static points allow GLiDR to outperform LiDAR-based navigation using SLAM in several settings. + + + + Commonsense Prototype for Outdoor Unsupervised 3D Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_Commonsense_Prototype_for_Outdoor_Unsupervised_3D_Object_Detection_CVPR_2024_paper.pdf + The prevalent approaches of unsupervised 3D object detection follow cluster-based pseudo-label generation and iterative self-training processes. However the challenge arises due to the sparsity of LiDAR scans which leads to pseudo-labels with erroneous size and position resulting in subpar detection performance. To tackle this problem this paper introduces a Commonsense Prototype-based Detector termed CPD for unsupervised 3D object detection. CPD first constructs Commonsense Prototype (CProto) characterized by high-quality bounding box and dense points based on commonsense intuition. Subsequently CPD refines the low-quality pseudo-labels by leveraging the size prior from CProto. Furthermore CPD enhances the detection accuracy of sparsely scanned objects by the geometric knowledge from CProto. CPD outperforms state-of-the-art unsupervised 3D detectors on Waymo Open Dataset (WOD) PandaSet and KITTI datasets by a large margin. Besides by training CPD on WOD and testing on KITTI CPD attains 90.85% and 81.01% 3D Average Precision on easy and moderate car classes respectively. These achievements position CPD in close proximity to fully supervised detectors highlighting the significance of our method. The code will be available at https://github.com/hailanyi/CPD. + + + + Lookahead Exploration with Neural Radiance Representation for Continuous Vision-Language Navigation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Lookahead_Exploration_with_Neural_Radiance_Representation_for_Continuous_Vision-Language_Navigation_CVPR_2024_paper.pdf + Vision-and-language navigation (VLN) enables the agent to navigate to a remote location following the natural language instruction in 3D environments. At each navigation step the agent selects from possible candidate locations and then makes the move. For better navigation planning the lookahead exploration strategy aims to effectively evaluate the agent's next action by accurately anticipating the future environment of candidate locations. To this end some existing works predict RGB images for future environments while this strategy suffers from image distortion and high computational cost. To address these issues we propose the pre-trained hierarchical neural radiance representation model (HNR) to produce multi-level semantic features for future environments which are more robust and efficient than pixel-wise RGB reconstruction. Furthermore with the predicted future environmental representations our lookahead VLN model is able to construct the navigable future path tree and select the optimal path via efficient parallel evaluation. Extensive experiments on the VLN-CE datasets confirm the effectiveness of our method. + + + + Learning Vision from Models Rivals Learning Vision from Data + http://openaccess.thecvf.com//content/CVPR2024/papers/Tian_Learning_Vision_from_Models_Rivals_Learning_Vision_from_Data_CVPR_2024_paper.pdf + We introduce SynCLR a novel approach for learning visual representations exclusively from synthetic images without any real data. We synthesize a large dataset of image captions using LLMs then use an off-the-shelf text-to-image model to generate multiple images corresponding to each synthetic caption. We perform visual representation learning on these synthetic images via contrastive learning treating images sharing the same caption as positive pairs. The resulting representations demonstrate remarkable transferability competing favorably with other general-purpose visual representation learners such as CLIP and DINO v2 in image classification tasks. Furthermore in dense prediction tasks such as semantic segmentation SynCLR outperforms previous self-supervised methods by a significant margin e.g. improving over MAE and iBOT by 5.0 and 3.1 mIoU on ADE20k for ViT-B/16. + + + + Adapting Short-Term Transformers for Action Detection in Untrimmed Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Adapting_Short-Term_Transformers_for_Action_Detection_in_Untrimmed_Videos_CVPR_2024_paper.pdf + Vision Transformer (ViT) has shown high potential in video recognition owing to its flexible design adaptable self-attention mechanisms and the efficacy of masked pre-training. Yet it remains unclear how to adapt these pre-trained short-term ViTs for temporal action detection (TAD) in untrimmed videos. The existing works treat them as off-the-shelf feature extractors for each short-trimmed snippet without capturing the fine-grained relation among different snippets in a broader temporal context. To mitigate this issue this paper focuses on designing a new mechanism for adapting these pre-trained ViT models as a unified long-form video transformer to fully unleash its modeling power in capturing inter-snippet relation while still keeping low computation overhead and memory consumption for efficient TAD. To this end we design effective cross-snippet propagation modules to gradually exchange short-term video information among different snippets from two levels. For inner-backbone information propagation we introduce a cross-snippet propagation strategy to enable multi-snippet temporal feature interaction inside the backbone. For post-backbone information propagation we propose temporal transformer layers for further clip-level modeling. With the plain ViT-B pre-trained with VideoMAE our end-to-end temporal action detector (ViT-TAD) yields a very competitive performance to previous temporal action detectors riching up to 69.5 average mAP on THUMOS14 37.40 average mAP on ActivityNet-1.3 and 17.20 average mAP on FineAction. + + + + SOAC: Spatio-Temporal Overlap-Aware Multi-Sensor Calibration using Neural Radiance Fields + http://openaccess.thecvf.com//content/CVPR2024/papers/Herau_SOAC_Spatio-Temporal_Overlap-Aware_Multi-Sensor_Calibration_using_Neural_Radiance_Fields_CVPR_2024_paper.pdf + In rapidly-evolving domains such as autonomous driving the use of multiple sensors with different modalities is crucial to ensure high operational precision and stability. To correctly exploit the provided information by each sensor in a single common frame it is essential for these sensors to be accurately calibrated. In this paper we leverage the ability of Neural Radiance Fields (NeRF) to represent different sensors modalities in a common volumetric representation to achieve robust and accurate spatio-temporal sensor calibration. By designing a partitioning approach based on the visible part of the scene for each sensor we formulate the calibration problem using only the overlapping areas. This strategy results in a more robust and accurate calibration that is less prone to failure. We demonstrate that our approach works on outdoor urban scenes by validating it on multiple established driving datasets. Results show that our method is able to get better accuracy and robustness compared to existing methods. + + + + G^3-LQ: Marrying Hyperbolic Alignment with Explicit Semantic-Geometric Modeling for 3D Visual Grounding + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_G3-LQ_Marrying_Hyperbolic_Alignment_with_Explicit_Semantic-Geometric_Modeling_for_3D_CVPR_2024_paper.pdf + Grounding referred objects in 3D scenes is a burgeoning vision-language task pivotal for propelling Embodied AI as it endeavors to connect the 3D physical world with free-form descriptions. Compared to the 2D counterparts challenges posed by the variability of 3D visual grounding remain relatively unsolved in existing studies: 1) the underlying geometric and complex spatial relationships in 3D scene. 2) the inherent complexity of 3D grounded language. 3) the inconsistencies between text and geometric features. To tackle these issues we propose G^3-LQ a DEtection TRansformer-based model tailored for 3D visual grounding task. G^3-LQ explicitly models Geometric-aware visual representations and Generates fine-Grained Language-guided object Queries in an overarching framework which comprises two dedicated modules. Specifically the Position Adaptive Geometric Exploring (PAGE) unearths underlying information of 3D objects in the geometric details and spatial relationships perspectives. The Fine-grained Language-guided Query Selection (Flan-QS) delves into syntactic structure of texts and generates object queries that exhibit higher relevance towards fine-grained text features. Finally a pioneering Poincare Semantic Alignment (PSA) loss establishes semantic-geometry consistencies by modeling non-linear vision-text feature mappings and aligning them on a hyperbolic prototype--Poincare ball. Extensive experiments verify the superiority of our G^3-LQ method trumping the state-of-the-arts by a considerable margin. + + + + ToonerGAN: Reinforcing GANs for Obfuscating Automated Facial Indexing + http://openaccess.thecvf.com//content/CVPR2024/papers/Thakral_ToonerGAN_Reinforcing_GANs_for_Obfuscating_Automated_Facial_Indexing_CVPR_2024_paper.pdf + The rapid evolution of automatic facial indexing tech- nologies increases the risk of compromising personal and sensitive information. To address the issue we propose cre- ating cartoon avatars or 'toon avatars' designed to effec- tively obscure identity features. The primary objective is to deceive current AI systems preventing them from accu- rately identifying individuals while making minimal modi- fications to their facial features. Moreover we aim to en- sure that a human observer can still recognize the person depicted in these altered avatar images. To achieve this we introduce 'ToonerGAN' a novel approach that utilizes Generative Adversarial Networks (GANs) to craft person- alized cartoon avatars. The ToonerGAN framework con- sists of a style module and a de-identification module that work together to produce high-resolution realistic cartoon images. For the efficient training of our network we have developed an extensive dataset named 'ToonSet' compris- ing approximately 23000 facial images and their cartoon renditions. Through comprehensive experiments and bench- marking against existing datasets including CelebA-HQ our method demonstrates superior performance in obfus- cating identity while preserving the utility of data. Addi- tionally a user-centric study to explore the effectiveness of ToonerGAN has yielded some compelling observations. + + + + Editable Scene Simulation for Autonomous Driving via Collaborative LLM-Agents + http://openaccess.thecvf.com//content/CVPR2024/papers/Wei_Editable_Scene_Simulation_for_Autonomous_Driving_via_Collaborative_LLM-Agents_CVPR_2024_paper.pdf + Scene simulation in autonomous driving has gained significant attention because of its huge potential for generating customized data. However existing editable scene simulation approaches face limitations in terms of user interaction efficiency multi-camera photo-realistic rendering and external digital assets integration. To address these challenges this paper introduces ChatSim the first system that enables editable photo-realistic 3D driving scene simulations via natural language commands with external digital assets. To enable editing with high command flexibility ChatSim leverages a large language model (LLM) agent collaboration framework. To generate photo-realistic outcomes ChatSim employs a novel multi-camera neural radiance field method. Furthermore to unleash the potential of extensive high-quality digital assets ChatSim employs a novel multi-camera lighting estimation method to achieve scene-consistent assets' rendering. Our experiments on Waymo Open Dataset demonstrate that ChatSim can handle complex language commands and generate corresponding photo-realistic scene videos. Code can be accessed at: https://github.com/yifanlu0227/ChatSim. + + + + SnAG: Scalable and Accurate Video Grounding + http://openaccess.thecvf.com//content/CVPR2024/papers/Mu_SnAG_Scalable_and_Accurate_Video_Grounding_CVPR_2024_paper.pdf + Temporal grounding of text descriptions in videos is a central problem in vision-language learning and video understanding. Existing methods often prioritize accuracy over scalability --- they have been optimized for grounding only a few text queries within short videos and fail to scale up to long videos with hundreds of queries. In this paper we study the effect of cross-modal fusion on the scalability of video grounding models. Our analysis establishes late fusion as a more cost-effective fusion scheme for long-form videos with many text queries. Moreover it leads us to a novel video-centric sampling scheme for efficient training. Based on these findings we present SnAG a simple baseline for scalable and accurate video grounding. Without bells and whistles SnAG is 43% more accurate and 1.5x faster than CONE a state of the art for long-form video grounding on the challenging MAD dataset while achieving highly competitive results on short videos. + + + + Building Vision-Language Models on Solid Foundations with Masked Distillation + http://openaccess.thecvf.com//content/CVPR2024/papers/Sameni_Building_Vision-Language_Models_on_Solid_Foundations_with_Masked_Distillation_CVPR_2024_paper.pdf + Recent advancements in Vision-Language Models (VLMs) have marked a significant leap in bridging the gap between computer vision and natural language processing. However traditional VLMs trained through contrastive learning on limited and noisy image-text pairs often lack the spatial and linguistic understanding to generalize well to dense vision tasks or less common languages. Our approach Solid Foundation CLIP (SF-CLIP) circumvents this issue by implicitly building on the solid visual and language understanding of foundational models trained on vast amounts of unimodal data. SF-CLIP integrates contrastive image-text pretraining with a masked knowledge distillation from large foundational text and vision models. This methodology guides our VLM in developing robust text and image representations. As a result SF-CLIP shows exceptional zero-shot classification accuracy and enhanced image and text retrieval capabilities setting a new state of the art for ViT-B/16 trained on YFCC15M and CC12M. Moreover the dense per-patch supervision enhances our zero-shot and linear probe performance in semantic segmentation tasks. A remarkable aspect of our model is its multilingual proficiency evidenced by strong retrieval results in multiple languages despite being trained predominantly on English data. We achieve all of these improvements without sacrificing the training efficiency through our selective application of masked distillation and the inheritance of teacher word embeddings. + + + + TransLoc4D: Transformer-based 4D Radar Place Recognition + http://openaccess.thecvf.com//content/CVPR2024/papers/Peng_TransLoc4D_Transformer-based_4D_Radar_Place_Recognition_CVPR_2024_paper.pdf + Place Recognition is crucial for unmanned vehicles in terms of localization and mapping. Recent years have witnessed numerous explorations in the field where 2D cameras and 3D LiDARs are mostly employed. Despite their admirable performance they may encounter challenges in adverse weather such as rain and fog. Hopefully 4D millimeter-wave Radar emerges as a promising alternative as its longer wavelength makes it virtually immune to interference from tiny particles of fog and rain. Therefore in this work we propose a novel 4D Radar place recognition model TransLoc4D based on sparse convolution and Transformer structures. Specifically a Minkloc4D backbone is first proposed to leverage the geometric intensity and velocity information from 4D Radar scans. While mainstream 3D LiDAR solutions merely capture geometric structures of point clouds Minkloc4D explores the intensity and velocity properties of 4D Radar scans and demonstrates their effectiveness. After feature extraction a Transformer layer is introduced to enhance local features where linear self-attention captures the long-range dependency of point cloud alleviating its sparsity and noise. To validate TransLoc4D we construct two datasets and set up benchmarks for 4D radar place recognition. Experiments show TransLoc4D is feasible and can robustly deal with dynamic and adverse environments. + + + + Multiscale Vision Transformers Meet Bipartite Matching for Efficient Single-stage Action Localization + http://openaccess.thecvf.com//content/CVPR2024/papers/Ntinou_Multiscale_Vision_Transformers_Meet_Bipartite_Matching_for_Efficient_Single-stage_Action_CVPR_2024_paper.pdf + Action Localization is a challenging problem that combines detection and recognition tasks which are often addressed separately. State-of-the-art methods rely on off-the-shelf bounding box detections pre-computed at high resolution and propose transformer models that focus on the classification task alone. Such two-stage solutions are prohibitive for real-time deployment. On the other hand single-stage methods target both tasks by devoting part of the network (generally the backbone) to sharing the majority of the workload compromising performance for speed. These methods build on adding a DETR head with learnable queries that after cross- and self-attention can be sent to corresponding MLPs for detecting a person's bounding box and action. However DETR-like architectures are challenging to train and can incur in big complexity. In this paper we observe that a straight bipartite matching loss can be applied to the output tokens of a vision transformer. This results in a backbone + MLP architecture that can do both tasks without the need of an extra encoder-decoder head and learnable queries. We show that a single MViTv2-S architecture trained with bipartite matching to perform both tasks surpasses the same MViTv2-S when trained with RoI align on pre-computed bounding boxes. With a careful design of token pooling and the proposed training pipeline our Bipartite-Matching Vision Transformer model BMViT achieves +3 mAP on AVA2.2. w.r.t. the two-stage MViTv2-S counterpart. Code is available at https://github.com/IoannaNti/BMViT + + + + Deep Single Image Camera Calibration by Heatmap Regression to Recover Fisheye Images Under Manhattan World Assumption + http://openaccess.thecvf.com//content/CVPR2024/papers/Wakai_Deep_Single_Image_Camera_Calibration_by_Heatmap_Regression_to_Recover_CVPR_2024_paper.pdf + A Manhattan world lying along cuboid buildings is useful for camera angle estimation. However accurate and robust angle estimation from fisheye images in the Manhattan world has remained an open challenge because general scene images tend to lack constraints such as lines arcs and vanishing points. To achieve higher accuracy and robustness we propose a learning-based calibration method that uses heatmap regression which is similar to pose estimation using keypoints to detect the directions of labeled image coordinates. Simultaneously our two estimators recover the rotation and remove fisheye distortion by remapping from a general scene image. Without considering vanishing-point constraints we find that additional points for learning-based methods can be defined. To compensate for the lack of vanishing points in images we introduce auxiliary diagonal points that have the optimal 3D arrangement of spatial uniformity. Extensive experiments demonstrated that our method outperforms conventional methods on large-scale datasets and with off-the-shelf cameras. + + + + CSTA: CNN-based Spatiotemporal Attention for Video Summarization + http://openaccess.thecvf.com//content/CVPR2024/papers/Son_CSTA_CNN-based_Spatiotemporal_Attention_for_Video_Summarization_CVPR_2024_paper.pdf + Video summarization aims to generate a concise representation of a video capturing its essential content and key moments while reducing its overall length. Although several methods employ attention mechanisms to handle long-term dependencies they often fail to capture the visual significance inherent in frames. To address this limitation we propose a CNN-based SpatioTemporal Attention (CSTA) method that stacks each feature of frames from a single video to form image-like frame representations and applies 2D CNN to these frame features. Our methodology relies on CNN to comprehend the inter and intra-frame relations and to find crucial attributes in videos by exploiting its ability to learn absolute positions within images. In contrast to previous work compromising efficiency by designing additional modules to focus on spatial importance CSTA requires minimal computational overhead as it uses CNN as a sliding window. Extensive experiments on two benchmark datasets (SumMe and TVSum) demonstrate that our proposed approach achieves state-of-the-art performance with fewer MACs compared to previous methods. Codes are available at https://github.com/thswodnjs3/CSTA. + + + + PEM: Prototype-based Efficient MaskFormer for Image Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Cavagnero_PEM_Prototype-based_Efficient_MaskFormer_for_Image_Segmentation_CVPR_2024_paper.pdf + Recent transformer-based architectures have shown impressive results in the field of image segmentation. Thanks to their flexibility they obtain outstanding performance in multiple segmentation tasks such as semantic and panoptic under a single unified framework. To achieve such impressive performance these architectures employ intensive operations and require substantial computational resources which are often not available especially on edge devices. To fill this gap we propose Prototype-based Efficient MaskFormer (PEM) an efficient transformer-based architecture that can operate in multiple segmentation tasks. PEM proposes a novel prototype-based cross-attention which leverages the redundancy of visual features to restrict the computation and improve the efficiency without harming the performance. In addition PEM introduces an efficient multi-scale feature pyramid network capable of extracting features that have high semantic content in an efficient way thanks to the combination of deformable convolutions and context-based self-modulation. We benchmark the proposed PEM architecture on two tasks semantic and panoptic segmentation evaluated on two different datasets Cityscapes and ADE20K. PEM demonstrates outstanding performance on every task and dataset outperforming task-specific architectures while being comparable and even better than computationally expensive baselines. Code is available at https://github.com/NiccoloCavagnero/PEM. + + + + Referring Expression Counting + http://openaccess.thecvf.com//content/CVPR2024/papers/Dai_Referring_Expression_Counting_CVPR_2024_paper.pdf + Existing counting tasks are limited to the class level which don't account for fine-grained details within the class. In real applications it often requires in-context or referring human input for counting target objects. Take urban analysis as an example fine-grained information such as traffic flow in different directions pedestrians and vehicles waiting or moving at different sides of the junction is more beneficial. Current settings of both class-specific and class-agnostic counting treat objects of the same class indifferently which pose limitations in real use cases. To this end we propose a new task named Referring Expression Counting (REC) which aims to count objects with different attributes within the same class. To evaluate the REC task we create a novel dataset named REC-8K which contains 8011 images and 17122 referring expressions. Experiments on REC-8K show that our proposed method achieves state-of-the-art performance compared with several text-based counting methods and an open-set object detection model. We also outperform prior models on the class agnostic counting (CAC) benchmark [36] for the zero-shot setting and perform on par with the few-shot methods. Code and dataset is available at https://github.com/sydai/referring-expression-counting. + + + + Learning to Predict Activity Progress by Self-Supervised Video Alignment + http://openaccess.thecvf.com//content/CVPR2024/papers/Donahue_Learning_to_Predict_Activity_Progress_by_Self-Supervised_Video_Alignment_CVPR_2024_paper.pdf + In this paper we tackle the problem of self-supervised video alignment and activity progress prediction using in-the-wild videos. Our proposed self-supervised representation learning method carefully addresses different action orderings redundant actions and background frames to generate improved video representations compared to previous methods. Our model generalizes temporal cycle-consistency learning to allow for more flexibility in determining cycle-consistent neighbors. More specifically to handle repeated actions we propose a multi-neighbor cycle consistency and a multi-cycle-back regression loss by finding multiple soft nearest neighbors using a Gaussian Mixture Model. To handle background and redundant frames we introduce a context-dependent drop function in our framework discouraging the alignment of droppable frames. On the other hand to learn from videos of multiple activities jointly we propose a multi-head crosstask network allowing us to embed a video and estimate progress without knowing its activity label. Experiments on multiple datasets show that our method outperforms the state-of-the-art for video alignment and progress prediction. + + + + VicTR: Video-conditioned Text Representations for Activity Recognition + http://openaccess.thecvf.com//content/CVPR2024/papers/Kahatapitiya_VicTR_Video-conditioned_Text_Representations_for_Activity_Recognition_CVPR_2024_paper.pdf + Vision-Language models (VLMs) have excelled in the image-domain--- especially in zero-shot settings--- thanks to the availability of vast pretraining data (i.e. paired image-text samples). However for videos such paired data is not as abundant. Therefore video-VLMs are usually designed by adapting pretrained image-VLMs to the video-domain instead of training from scratch. All such recipes rely on augmenting visual embeddings with temporal information (i.e. image --> video) often keeping text embeddings unchanged or even being discarded. In this paper we argue the contrary that better video-VLMs can be designed by focusing more on augmenting text rather than visual information. More specifically we introduce Video-conditioned Text Representations (VicTR): a form of text embeddings optimized w.r.t. visual embeddings creating a more-flexible contrastive latent space. Our model can further make use of freely-available semantic information in the form of visually-grounded auxiliary text (e.g. object or scene information). We evaluate our model on few-shot zero-shot (HMDB-51 UCF-101) short-form (Kinetics-400) and long-form (Charades) activity recognition benchmarks showing strong performance among video-VLMs. + + + + Label-Efficient Group Robustness via Out-of-Distribution Concept Curation + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Label-Efficient_Group_Robustness_via_Out-of-Distribution_Concept_Curation_CVPR_2024_paper.pdf + Deep neural networks are prone to capture correlations between spurious attributes and class labels leading to low accuracy on some combinations of class labels and spurious attribute values. When a spurious attribute represents a protected class these low-accuracy groups can manifest discriminatory bias. Existing methods attempting to improve worst-group accuracy assume the training data validation data or both are reliably labeled by the spurious attribute. But a model may be perceived to be biased towards a concept that is not represented by pre-existing labels on the training data. In these situations the spurious attribute must be defined with external information. We propose Concept Correction a framework that represents a concept as a curated set of images from any source then labels each training sample by its similarity to the concept set to control spurious correlations. For example concept sets representing gender can be used to measure and control gender bias even without explicit labels. We demonstrate and evaluate an instance of the framework as Concept DRO which uses concept sets to estimate group labels then uses these labels to train with a state of the art distributively robust optimization objective. We show that Concept DRO outperforms existing methods that do not require labels of spurious attributes by up to 33.1% on three image classification datasets and is competitive with the best methods that assume access to labels. We consider how the size and quality of the concept set influences performance and find that even smaller manually curated sets of noisy AI-generated images are effective at controlling spurious correlations suggesting that high-quality reusable concept sets are easy to create and effective in reducing bias. + + + + 3DToonify: Creating Your High-Fidelity 3D Stylized Avatar Easily from 2D Portrait Images + http://openaccess.thecvf.com//content/CVPR2024/papers/Men_3DToonify_Creating_Your_High-Fidelity_3D_Stylized_Avatar_Easily_from_2D_CVPR_2024_paper.pdf + Visual content creation has aroused a surge of interest given its applications in mobile photography and AR/VR. Portrait style transfer and 3D recovery from monocular images as two representative tasks have so far evolved independently. In this paper we make a connection between the two and tackle the challenging task of 3D portrait stylization - modeling high-fidelity 3D stylized avatars from captured 2D portrait images. However naively combining the techniques from the two isolated areas may suffer from either inadequate stylization or absence of 3D assets. To this end we propose 3DToonify a new framework that introduces a progressive training scheme to achieve 3D style adaption on spatial neural representation (SNR). SNR is constructed with implicit fields and they are dynamically optimized by the progressive training scheme which consists of three stages: guided prior learning deformable geometry adaption and explicit texture adaption. In this way stylized geometry and texture are learned in SNR in an explicit and structured way with only a single stylized exemplar needed. Moreover our method obtains style-adaptive underlying structures (i.e. deformable geometry and exaggerated texture) and view-consistent stylized avatar rendering from arbitrary novel viewpoints. Both qualitative and quantitative experiments have been conducted to demonstrate the effectiveness and superiority of our method for automatically generating exemplar-guided 3D stylized avatars. + + + + Investigating Compositional Challenges in Vision-Language Models for Visual Grounding + http://openaccess.thecvf.com//content/CVPR2024/papers/Zeng_Investigating_Compositional_Challenges_in_Vision-Language_Models_for_Visual_Grounding_CVPR_2024_paper.pdf + Pre-trained vision-language models (VLMs) have achieved high performance on various downstream tasks which have been widely used for visual grounding tasks in a weakly supervised manner. However despite the performance gains contributed by large vision and language pre-training we find that state-of-the-art VLMs struggle with compositional reasoning on grounding tasks. To demonstrate this we propose Attribute Relation and Priority grounding (ARPGrounding) benchmark to test VLMs' compositional reasoning ability on visual grounding tasks. ARPGrounding contains 11425 samples and evaluates the compositional understanding of VLMs in three dimensions: 1) attribute denoting comprehension of objects' properties; 2) relation indicating an understanding of relation between objects; 3) priority reflecting an awareness of the part of speech associated with nouns. Using the ARPGrounding benchmark we evaluate several mainstream VLMs. We empirically find that these models perform quite well on conventional visual grounding datasets achieving performance comparable to or surpassing state-of-the-art methods but showing strong deficiencies in compositional reasoning. Furthermore we propose a composition-aware fine-tuning pipeline demonstrating the potential to leverage cost-effective image-text annotations for enhancing the compositional understanding of VLMs in grounding tasks. + + + + 6D-Diff: A Keypoint Diffusion Framework for 6D Object Pose Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_6D-Diff_A_Keypoint_Diffusion_Framework_for_6D_Object_Pose_Estimation_CVPR_2024_paper.pdf + Estimating the 6D object pose from a single RGB image often involves noise and indeterminacy due to challenges such as occlusions and cluttered backgrounds. Meanwhile diffusion models have shown appealing performance in generating high-quality images from random noise with high indeterminacy through step-by-step denoising. Inspired by their denoising capability we propose a novel diffusion-based framework (6D-Diff) to handle the noise and indeterminacy in object pose estimation for better performance. In our framework to establish accurate 2D-3D correspondence we formulate 2D keypoints detection as a reverse diffusion (denoising) process. To facilitate such a denoising process we design a Mixture-of-Cauchy-based forward diffusion process and condition the reverse process on the object appearance features. Extensive experiments on the LM-O and YCB-V datasets demonstrate the effectiveness of our framework. + + + + Generative Region-Language Pretraining for Open-Ended Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Lin_Generative_Region-Language_Pretraining_for_Open-Ended_Object_Detection_CVPR_2024_paper.pdf + In recent research significant attention has been devoted to the open-vocabulary object detection task aiming to generalize beyond the limited number of classes labeled during training and detect objects described by arbitrary category names at inference. Compared with conventional object detection open vocabulary object detection largely extends the object detection categories. However it relies on calculating the similarity between image regions and a set of arbitrary category names with a pretrained vision-and-language model. This implies that despite its open-set nature the task still needs the predefined object categories during the inference stage. This raises the question: What if we do not have exact knowledge of object categories during inference? In this paper we call such a new setting as generative open-ended object detection which is a more general and practical problem. To address it we formulate object detection as a generative problem and propose a simple framework named GenerateU which can detect dense objects and generate their names in a free-form way. Particularly we employ Deformable DETR as a region proposal generator with a language model translating visual regions to object names. To assess the free-form object detection task we introduce an evaluation method designed to quantitatively measure the performance of generative outcomes. Extensive experiments demonstrate strong zero-shot detection performance of our GenerateU. For example on the LVIS dataset our GenerateU achieves comparable results to the open-vocabulary object detection method GLIP even though the category names are not seen by GenerateU during inference. Code is available at: https://github.com/FoundationVision/GenerateU. + + + + Enhancing Post-training Quantization Calibration through Contrastive Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Shang_Enhancing_Post-training_Quantization_Calibration_through_Contrastive_Learning_CVPR_2024_paper.pdf + Post-training quantization (PTQ) converts a pre-trained full-precision (FP) model into a quantized model in a training-free manner. Determining suitable quantization parameters such as scaling factors and weight rounding is the primary strategy for mitigating the impact of quantization noise (calibration) and restoring the performance of the quantized models. However the existing activation calibration methods have never considered information degradation between pre- (FP) and post-quantized activations. In this study we introduce a well-defined distributional metric from information theory mutual information into PTQ calibration. We aim to calibrate the quantized activations by maximizing the mutual information between the pre- and post-quantized activations. To realize this goal we establish a contrastive learning (CL) framework for the PTQ calibration where the quantization parameters are optimized through a self-supervised proxy task. Specifically by leveraging CL during the PTQ process we can benefit from pulling the positive pairs of quantized and FP activations collected from the same input samples while pushing negative pairs from different samples. Thanks to the ingeniously designed critic function we avoid the unwanted but often-encountered collision solution in CL especially in calibration scenarios where the amount of calibration data is limited. Additionally we provide a theoretical guarantee that minimizing our designed loss is equivalent to maximizing the desired mutual information. Consequently the quantized activations retain more information which ultimately enhances the performance of the quantized network. Experimental results show that our method can effectively serve as an add-on module to existing SoTA PTQ methods. + + + + Enhancing Visual Document Understanding with Contrastive Learning in Large Visual-Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Enhancing_Visual_Document_Understanding_with_Contrastive_Learning_in_Large_Visual-Language_CVPR_2024_paper.pdf + Recently the advent of Large Visual-Language Models (LVLMs) has received increasing attention across various domains particularly in the field of visual document understanding (VDU). Different from conventional vision-language tasks VDU is specifically concerned with text-rich scenarios containing abundant document elements. Nevertheless the importance of fine-grained features remains largely unexplored within the community of LVLMs leading to suboptimal performance in text-rich scenarios. In this paper we abbreviate it as the fine-grained feature collapse issue. With the aim of filling this gap we propose a contrastive learning framework termed Document Object COntrastive learning (DoCo) specifically tailored for the downstream tasks of VDU. DoCo leverages an auxiliary multimodal encoder to obtain the features of document objects and align them to the visual features generated by the vision encoder of LVLM which enhances visual representation in text-rich scenarios. It can represent that the contrastive learning between the visual holistic representations and the multimodal fine-grained features of document objects can assist the vision encoder in acquiring more effective visual cues thereby enhancing the comprehension of text-rich documents in LVLMs. We also demonstrate that the proposed DoCo serves as a plug-and-play pre-training method which can be employed in the pre-training of various LVLMs without inducing any increase in computational complexity during the inference process. Extensive experimental results on multiple benchmarks of VDU reveal that LVLMs equipped with our proposed DoCo can achieve superior performance and mitigate the gap between VDU and generic vision-language tasks. + + + + Data Valuation and Detections in Federated Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Data_Valuation_and_Detections_in_Federated_Learning_CVPR_2024_paper.pdf + Federated Learning (FL) enables collaborative model training while preserving the privacy of raw data. A challenge in this framework is the fair and efficient valuation of data which is crucial for incentivizing clients to contribute high-quality data in the FL task. In scenarios involving numerous data clients within FL it is often the case that only a subset of clients and datasets are pertinent to a specific learning task while others might have either a negative or negligible impact on the model training process. This paper introduces a novel privacy-preserving method for evaluating client contributions and selecting relevant datasets without a pre-specified training algorithm in an FL task. Our proposed approach \texttt FedBary utilizes Wasserstein distance within the federated context offering a new solution for data valuation in the FL framework. This method ensures transparent data valuation and efficient computation of the Wasserstein barycenter and reduces the dependence on validation datasets. Through extensive empirical experiments and theoretical analyses we demonstrate the advantages of this data valuation method as a promising avenue for FL research. + + + + Joint Reconstruction of 3D Human and Object via Contact-Based Refinement Transformer + http://openaccess.thecvf.com//content/CVPR2024/papers/Nam_Joint_Reconstruction_of_3D_Human_and_Object_via_Contact-Based_Refinement_CVPR_2024_paper.pdf + Human-object contact serves as a strong cue to understand how humans physically interact with objects. Nevertheless it is not widely explored to utilize human-object contact information for the joint reconstruction of 3D human and object from a single image. In this work we present a novel joint 3D human-object reconstruction method (CONTHO) that effectively exploits contact information between humans and objects. There are two core designs in our system: 1) 3D-guided contact estimation and 2) contact-based 3D human and object refinement. First for accurate human-object contact estimation CONTHO initially reconstructs 3D humans and objects and utilizes them as explicit 3D guidance for contact estimation. Second to refine the initial reconstructions of 3D human and object we propose a novel contact-based refinement Transformer that effectively aggregates human features and object features based on the estimated human-object contact. The proposed contact-based refinement prevents the learning of erroneous correlation between human and object which enables accurate 3D reconstruction. As a result our CONTHO achieves state-of-the-art performance in both human-object contact estimation and joint reconstruction of 3D human and object. The codes are available in https://github.com/dqj5182/CONTHO_RELEASE. + + + + TIM: A Time Interval Machine for Audio-Visual Action Recognition + http://openaccess.thecvf.com//content/CVPR2024/papers/Chalk_TIM_A_Time_Interval_Machine_for_Audio-Visual_Action_Recognition_CVPR_2024_paper.pdf + Diverse actions give rise to rich audio-visual signals in long videos. Recent works showcase that the two modalities of audio and video exhibit different temporal extents of events and distinct labels. We address the interplay between the two modalities in long videos by explicitly modelling the temporal extents of audio and visual events. We propose the Time Interval Machine (TIM) where a modality-specific time interval poses as a query to a transformer encoder that ingests a long video input. The encoder then attends to the specified interval as well as the surrounding context in both modalities in order to recognise the ongoing action. We test TIM on three long audio-visual video datasets: EPIC-KITCHENS Perception Test and AVE reporting state-of-the-art (SOTA) for recognition. On EPIC-KITCHENS we beat previous SOTA that utilises LLMs and significantly larger pre-training by 2.9% top-1 action recognition accuracy. Additionally we show that TIM can be adapted for action detection using dense multi-scale interval queries outperforming SOTA on EPIC-KITCHENS-100 for most metrics and showing strong performance on the Perception Test. Our ablations show the critical role of integrating the two modalities and modelling their time intervals in achieving this performance. Code and models at: https://github.com/JacobChalk/TIM. + + + + Would Deep Generative Models Amplify Bias in Future Models? + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Would_Deep_Generative_Models_Amplify_Bias_in_Future_Models_CVPR_2024_paper.pdf + We investigate the impact of deep generative models on potential social biases in upcoming computer vision models. As the internet witnesses an increasing influx of AI-generated images concerns arise regarding inherent biases that may accompany them potentially leading to the dissemination of harmful content. This paper explores whether a detrimental feedback loop resulting in bias amplification would occur if generated images were used as the training data for future models. We conduct simulations by progressively substituting original images in COCO and CC3M datasets with images generated through Stable Diffusion. The modified datasets are used to train OpenCLIP and image captioning models which we evaluate in terms of quality and bias. Contrary to expectations our findings indicate that introducing generated images during training does not uniformly amplify bias. Instead instances of bias mitigation across specific tasks are observed. We further explore the factors that may influence these phenomena such as artifacts in image generation (e.g. blurry faces) or pre-existing biases in the original datasets. + + + + CogAgent: A Visual Language Model for GUI Agents + http://openaccess.thecvf.com//content/CVPR2024/papers/Hong_CogAgent_A_Visual_Language_Model_for_GUI_Agents_CVPR_2024_paper.pdf + People are spending an enormous amount of time on digital devices through graphical user interfaces (GUIs) e.g. computer or smartphone screens. Large language models (LLMs) such as ChatGPT can assist people in tasks like writing emails but struggle to understand and interact with GUIs thus limiting their potential to increase automation levels. In this paper we introduce CogAgent an 18-billion-parameter visual language model (VLM) specializing in GUI understanding and navigation. By utilizing both low-resolution and high-resolution image encoders CogAgent supports input at a resolution of 1120*1120 enabling it to recognize tiny page elements and text. As a generalist visual language model CogAgent achieves the state of the art on five text-rich and four general VQA benchmarks including VQAv2 OK-VQA Text-VQA ST-VQA ChartQA infoVQA DocVQA MM-Vet and POPE. CogAgent using only screenshots as input outperforms LLM-based methods that consume extracted HTML text on both PC and Android GUI navigation tasks---Mind2Web and AITW advancing the state of the art. The model and codes are available at https://github.com/THUDM/CogVLM. + + + + AIDE: An Automatic Data Engine for Object Detection in Autonomous Driving + http://openaccess.thecvf.com//content/CVPR2024/papers/Liang_AIDE_An_Automatic_Data_Engine_for_Object_Detection_in_Autonomous_CVPR_2024_paper.pdf + Autonomous vehicle (AV) systems rely on robust perception models as a cornerstone of safety assurance. However objects encountered on the road exhibit a long-tailed distribution with rare or unseen categories posing challenges to a deployed perception model. This necessitates an expensive process of continuously curating and annotating data with significant human effort. We propose to leverage recent advances in vision-language and large language models to design an Automatic Data Engine (AIDE) that automatically identifies issues efficiently curates data improves the model through auto-labeling and verifies the model through generation of diverse scenarios. This process operates iteratively allowing for continuous self-improvement of the model. We further establish a benchmark for open-world detection on AV datasets to comprehensively evaluate various learning paradigms demonstrating our method's superior performance at a reduced cost. + + + + Smart Help: Strategic Opponent Modeling for Proactive and Adaptive Robot Assistance in Households + http://openaccess.thecvf.com//content/CVPR2024/papers/Cao_Smart_Help_Strategic_Opponent_Modeling_for_Proactive_and_Adaptive_Robot_CVPR_2024_paper.pdf + Despite the significant demand for assistive technology among vulnerable groups (e.g. the elderly children and the disabled) in daily tasks research into advanced AI-driven assistive solutions that genuinely accommodate their diverse needs remains sparse. Traditional human-machine interaction tasks often require machines to simply help without nuanced consideration of human abilities and feelings such as their opportunity for practice and learning sense of self-improvement and self-esteem. Addressing this gap we define a pivotal and novel challenge Smart Help which aims to provide proactive yet adaptive support to human agents with diverse disabilities and dynamic goals in various tasks and environments. To establish this challenge we leverage AI2-THOR to build a new interactive 3D realistic household environment for the Smart Help task. We introduce an innovative opponent modeling module that provides a nuanced understanding of the main agent's capabilities and goals in order to optimize the assisting agent's helping policy. Rigorous experiments validate the efficacy of our model components and show the superiority of our holistic approach against established baselines. Our findings illustrate the potential of AI-imbued assistive robots in improving the well-being of vulnerable groups. + + + + Rapid Motor Adaptation for Robotic Manipulator Arms + http://openaccess.thecvf.com//content/CVPR2024/papers/Liang_Rapid_Motor_Adaptation_for_Robotic_Manipulator_Arms_CVPR_2024_paper.pdf + Developing generalizable manipulation skills is a core challenge in embodied AI. This includes generalization across diverse task configurations encompassing variations in object shape density friction coefficient and external disturbances such as forces applied to the robot. Rapid Motor Adaptation (RMA) offers a promising solution to this challenge. It posits that essential hidden variables influencing an agent's task performance such as object mass and shape can be effectively inferred from the agent's action and proprioceptive history. Drawing inspiration from RMA in locomotion and in-hand rotation we use depth perception to develop agents tailored for rapid motor adaptation in a variety of manipulation tasks. We evaluated our agents on four challenging tasks from the Maniskill2 benchmark namely pick-and-place operations with hundreds of objects from the YCB and EGAD datasets peg insertion with precise position and orientation and operating a variety of faucets and handles with customized environment variations. Empirical results demonstrate that our agents surpass state-of-the-art methods like automatic domain randomization and vision-based policies obtaining better generalization performance and sample efficiency. + + + + WWW: A Unified Framework for Explaining What Where and Why of Neural Networks by Interpretation of Neuron Concepts + http://openaccess.thecvf.com//content/CVPR2024/papers/Ahn_WWW_A_Unified_Framework_for_Explaining_What_Where_and_Why_CVPR_2024_paper.pdf + Recent advancements in neural networks have showcased their remarkable capabilities across various domains. Despite these successes the "black box" problem still remains. To address this we propose a novel framework WWW that offers the 'what' 'where' and 'why' of the neural network decisions in human-understandable terms. Specifically WWW utilizes an adaptive selection for concept discovery employing adaptive cosine similarity and thresholding techniques to effectively explain 'what'. To address the 'where' and 'why' we proposed a novel combination of neuron activation maps (NAMs) with Shapley values generating localized concept maps and heatmaps for individual inputs. Furthermore WWW introduces a method for predicting uncertainty leveraging heatmap similarities to estimate the prediction's reliability. Experimental evaluations of WWW demonstrate superior performance in both quantitative and qualitative metrics outperforming existing methods in interpretability. WWW provides a unified solution for explaining 'what' 'where' and 'why' introducing a method for localized explanations from global interpretations and offering a plug-and-play solution adaptable to various architectures. The code is available at: https://github.com/ailab-kyunghee/WWW + + + + CaKDP: Category-aware Knowledge Distillation and Pruning Framework for Lightweight 3D Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_CaKDP_Category-aware_Knowledge_Distillation_and_Pruning_Framework_for_Lightweight_3D_CVPR_2024_paper.pdf + Knowledge distillation (KD) possesses immense potential to accelerate the deep neural networks (DNNs) for LiDAR-based 3D detection. However in most of prevailing approaches the suboptimal teacher models and insufficient student architecture investigations limit the performance gains. To address these issues we propose a simple yet effective Category-aware Knowledge Distillation and Pruning (CaKDP) framework for compressing 3D detectors. Firstly CaKDP transfers the knowledge of two-stage detector to one-stage student one mitigating the impact of inadequate teacher models. To bridge the gap between the heterogeneous detectors we investigate their differences and then introduce the student-motivated category-aware KD to align the category prediction between distillation pairs. Secondly we propose a category-aware pruning scheme to obtain the customizable architecture of compact student model. The method calculates the category prediction gap before and after removing each filter to evaluate the importance of filters and retains the important filters. Finally to further improve the student performance a modified IOU-aware refinement module with negligible computations is leveraged to remove the redundant false positive predictions. Experiments demonstrate that CaKDP achieves the compact detector with high performance. For example on WOD CaKDP accelerates CenterPoint by half while boosting L2 mAPH by 1.61%. The code is available at https://github.com/zhnxjtu/CaKDP. + + + + ICP-Flow: LiDAR Scene Flow Estimation with ICP + http://openaccess.thecvf.com//content/CVPR2024/papers/Lin_ICP-Flow_LiDAR_Scene_Flow_Estimation_with_ICP_CVPR_2024_paper.pdf + Scene flow characterizes the 3D motion between two LiDAR scans captured by an autonomous vehicle at nearby timesteps. Prevalent methods consider scene flow as point-wise unconstrained flow vectors that can be learned by either large-scale training beforehand or time-consuming optimization at inference. However these methods do not take into account that objects in autonomous driving often move rigidly. We incorporate this rigid-motion assumption into our design where the goal is to associate objects over scans and then estimate the locally rigid transformations. We propose ICP-Flow a learning-free flow estimator. The core of our design is the conventional Iterative Closest Point (ICP) algorithm which aligns the objects over time and outputs the corresponding rigid transformations. Crucially to aid ICP we propose a histogram-based initialization that discovers the most likely translation thus providing a good starting point for ICP. The complete scene flow is then recovered from the rigid transformations. We outperform state-of-the-art baselines including supervised models on the Waymo dataset and perform competitively on Argoverse-v2 and nuScenes. Further we train a feedforward neural network supervised by the pseudo labels from our model and achieve top performance among all models capable of real-time inference. We validate the advantage of our model on scene flow estimation with longer temporal gaps up to 0.4 seconds where other models fail to deliver meaningful results. + + + + MADTP: Multimodal Alignment-Guided Dynamic Token Pruning for Accelerating Vision-Language Transformer + http://openaccess.thecvf.com//content/CVPR2024/papers/Cao_MADTP_Multimodal_Alignment-Guided_Dynamic_Token_Pruning_for_Accelerating_Vision-Language_Transformer_CVPR_2024_paper.pdf + Vision-Language Transformers (VLTs) have shown great success recently but are meanwhile accompanied by heavy computation costs where a major reason can be attributed to the large number of visual and language tokens. Existing token pruning research for compressing VLTs mainly follows a single-modality-based scheme yet ignores the critical role of aligning different modalities for guiding the token pruning process causing the important tokens for one modality to be falsely pruned in another modality branch. Meanwhile existing VLT pruning works also lack the flexibility to dynamically compress each layer based on different input samples. To this end we propose a novel framework named Multimodal Alignment-Guided Dynamic Token Pruning (MADTP) for accelerating various VLTs. Specifically we first introduce a well-designed Multi-modality Alignment Guidance (MAG) module that can align features of the same semantic concept from different modalities to ensure the pruned tokens are less important for all modalities. We further design a novel Dynamic Token Pruning (DTP) module which can adaptively adjust the token compression ratio in each layer based on different input instances. Extensive experiments on various benchmarks demonstrate that MADTP significantly reduces the computational complexity of kinds of multimodal models while preserving competitive performance. Notably when applied to the BLIP model in the NLVR2 dataset MADTP can reduce the GFLOPs by 80% with less than 4% performance degradation. + + + + G-NeRF: Geometry-enhanced Novel View Synthesis from Single-View Images + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_G-NeRF_Geometry-enhanced_Novel_View_Synthesis_from_Single-View_Images_CVPR_2024_paper.pdf + Novel view synthesis aims to generate new view images of a given view image collection. Recent attempts address this problem relying on 3D geometry priors (e.g. shapes sizes and positions) learned from multi-view images. However such methods encounter the following limitations: 1) they require a set of multi-view images as training data for a specific scene (e.g. face car or chair) which is often unavailable in many real-world scenarios; 2) they fail to extract the geometry priors from single-view images due to the lack of multi-view supervision. In this paper we propose a Geometry-enhanced NeRF (G-NeRF) which seeks to enhance the geometry priors by a geometry-guided multi-view synthesis approach followed by a depth-aware training. In the synthesis process inspired that existing 3D GAN models can unconditionally synthesize high-fidelity multi-view images we seek to adopt off-the-shelf 3D GAN models such as EG3D as a free source to provide geometry priors through synthesizing multi-view data. Simultaneously to further improve the geometry quality of the synthetic data we introduce a truncation method to effectively sample latent codes within 3D GAN models. To tackle the absence of multi-view supervision for single-view images we design the depth-aware training approach incorporating a depth-aware discriminator to guide geometry priors through depth maps. Experiments demonstrate the effectiveness of our method in terms of both qualitative and quantitative results. + + + + SpiderMatch: 3D Shape Matching with Global Optimality and Geometric Consistency + http://openaccess.thecvf.com//content/CVPR2024/papers/Roetzer_SpiderMatch_3D_Shape_Matching_with_Global_Optimality_and_Geometric_Consistency_CVPR_2024_paper.pdf + Finding shortest paths on product spaces is a popular approach to tackle numerous variants of matching problems including the dynamic time warping method for matching signals the matching of curves or the matching of a curve to a 3D shape. While these approaches admit the computation of globally optimal solutions in polynomial time their natural generalisation to 3D shape matching is widely known to be intractable. In this work we address this issue by proposing a novel path-based formalism for 3D shape matching. More specifically we consider an alternative shape discretisation in which one of the 3D shapes (the source shape) is represented as a SpiderCurve i.e. a long self-intersecting curve that traces the 3D shape surface. We then tackle the 3D shape matching problem as finding a shortest path in the product graph of the SpiderCurve and the target 3D shape. Our approach introduces a set of novel constraints that ensure a globally geometrically consistent matching. Overall our formalism leads to an integer linear programming problem for which we experimentally show that it can efficiently be solved to global optimality. We demonstrate that our approach is competitive with recent state-of-the-art shape matching methods while in addition guaranteeing geometric consistency. + + + + Evidential Active Recognition: Intelligent and Prudent Open-World Embodied Perception + http://openaccess.thecvf.com//content/CVPR2024/papers/Fan_Evidential_Active_Recognition_Intelligent_and_Prudent_Open-World_Embodied_Perception_CVPR_2024_paper.pdf + Active recognition enables robots to intelligently explore novel observations thereby acquiring more information while circumventing undesired viewing conditions. Recent approaches favor learning policies from simulated or collected data wherein appropriate actions are more frequently selected when the recognition is accurate. However most recognition modules are developed under the closed-world assumption which makes them ill-equipped to handle unexpected inputs such as the absence of the target object in the current observation. To address this issue we propose treating active recognition as a sequential evidence-gathering process providing by-step uncertainty quantification and reliable prediction under the evidence combination theory. Additionally the reward function developed in this paper effectively characterizes the merit of actions when operating in open-world environments. To evaluate the performance we collect a dataset from an indoor simulator encompassing various recognition challenges such as distance occlusion levels and visibility. Through a series of experiments on recognition and robustness analysis we demonstrate the necessity of introducing uncertainties to active recognition and the superior performance of the proposed method. + + + + The Unreasonable Effectiveness of Pre-Trained Features for Camera Pose Refinement + http://openaccess.thecvf.com//content/CVPR2024/papers/Trivigno_The_Unreasonable_Effectiveness_of_Pre-Trained_Features_for_Camera_Pose_Refinement_CVPR_2024_paper.pdf + Pose refinement is an interesting and practically relevant research direction. Pose refinement can be used to (1) obtain a more accurate pose estimate from an initial prior (e.g. from retrieval) (2) as pre-processing i.e. to provide a better starting point to a more expensive pose estimator (3) as post-processing of a more accurate localizer. Existing approaches focus on learning features / scene representations for the pose refinement task. This involves training an implicit scene representation or learning features while optimizing a camera pose-based loss. A natural question is whether training specific features / representations is truly necessary or whether similar results can be already achieved with more generic features. In this work we present a simple approach that combines pre-trained features with a particle filter and a renderable representation of the scene. Despite its simplicity it achieves state-of-the-art results demonstrating that one can easily build a pose refiner without the need for specific training. The code will be released upon acceptance. + + + + CLIP as RNN: Segment Countless Visual Concepts without Training Endeavor + http://openaccess.thecvf.com//content/CVPR2024/papers/Sun_CLIP_as_RNN_Segment_Countless_Visual_Concepts_without_Training_Endeavor_CVPR_2024_paper.pdf + Existing open-vocabulary image segmentation methods require a fine-tuning step on mask labels and/or image-text datasets. Mask labels are labor-intensive which limits the number of categories in segmentation datasets. Consequently the vocabulary capacity of pre-trained VLMs is severely reduced after fine-tuning. However without fine-tuning VLMs trained under weak image-text supervision tend to make suboptimal mask predictions. To alleviate these issues we introduce a novel recurrent framework that progressively filters out irrelevant texts and enhances mask quality without training efforts. The recurrent unit is a two-stage segmenter built upon a frozen VLM. Thus our model retains the VLM's broad vocabulary space and equips it with segmentation ability. Experiments show that our method outperforms not only the training-free counterparts but also those fine-tuned with millions of data samples and sets the new state-of-the-art records for both zero-shot semantic and referring segmentation. Concretely we improve the current record by 28.8 16.0 and 6.9 mIoU on Pascal VOC COCO Object and Pascal Context. + + + + Active Generalized Category Discovery + http://openaccess.thecvf.com//content/CVPR2024/papers/Ma_Active_Generalized_Category_Discovery_CVPR_2024_paper.pdf + Generalized Category Discovery (GCD) is a pragmatic and challenging open-world task which endeavors to cluster unlabeled samples from both novel and old classes leveraging some labeled data of old classes. Given that knowledge learned from old classes is not fully transferable to new classes and that novel categories are fully unlabeled GCD inherently faces intractable problems including imbalanced classification performance and inconsistent confidence between old and new classes especially in the low-labeling regime. Hence some annotations of new classes are deemed necessary. However labeling new classes is extremely costly. To address this issue we take the spirit of active learning and propose a new setting called Active Generalized Category Discovery (AGCD). The goal is to improve the performance of GCD by actively selecting a limited amount of valuable samples for labeling from the oracle. To solve this problem we devise an adaptive sampling strategy which jointly considers novelty informativeness and diversity to adaptively select novel samples with proper uncertainty. However owing to the varied orderings of label indices caused by the clustering of novel classes the queried labels are not directly applicable to subsequent training. To overcome this issue we further propose a stable label mapping algorithm that transforms ground truth labels to the label space of the classifier thereby ensuring consistent training across different active selection stages. Our method achieves state-of-the-art performance on both generic and fine-grained datasets. Our code is available at https://github.com/mashijie1028/ActiveGCD + + + + OpenBias: Open-set Bias Detection in Text-to-Image Generative Models + http://openaccess.thecvf.com//content/CVPR2024/papers/DInca_OpenBias_Open-set_Bias_Detection_in_Text-to-Image_Generative_Models_CVPR_2024_paper.pdf + Text-to-image generative models are becoming increasingly popular and accessible to the general public. As these models see large-scale deployments it is necessary to deeply investigate their safety and fairness to not disseminate and perpetuate any kind of biases. However existing works focus on detecting closed sets of biases defined a priori limiting the studies to well-known concepts. In this paper we tackle the challenge of open-set bias detection in text-to-image generative models presenting OpenBias a new pipeline that identifies and quantifies the severity of biases agnostically without access to any precompiled set. OpenBias has three stages. In the first phase we leverage a Large Language Model (LLM) to propose biases given a set of captions. Secondly the target generative model produces images using the same set of captions. Lastly a Vision Question Answering model recognizes the presence and extent of the previously proposed biases. We study the behavior of Stable Diffusion 1.5 2 and XL emphasizing new biases never investigated before. Via quantitative experiments we demonstrate that OpenBias agrees with current closed-set bias detection methods and human judgement. + + + + 3DiffTection: 3D Object Detection with Geometry-Aware Diffusion Features + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_3DiffTection_3D_Object_Detection_with_Geometry-Aware_Diffusion_Features_CVPR_2024_paper.pdf + 3DiffTection introduces a novel method for 3D object detection from single images utilizing a 3D-aware diffusion model for feature extraction. Addressing the resource-intensive nature of annotating large-scale 3D image data our approach leverages pretrained diffusion models traditionally used for 2D tasks and adapts them for 3D detection through geometric and semantic tuning. Geometrically we enhance the model to perform view synthesis from single images incorporating an epipolar warp operator. This process utilizes easily accessible posed image data eliminating the need for manual annotation. Semantically the model is further refined on target detection data. Both stages utilize ControlNet ensuring the preservation of original feature capabilities. Through our methodology we obtain 3D-aware features that excel in identifying cross-view point correspondences. In 3D detection 3DiffTection substantially surpasses previous benchmarks e.g. Cube-RCNN by 9.43% in AP3D on the Omni3D-ARkitscene dataset. Furthermore 3DiffTection demonstrates robust label efficiency and generalizes well to cross-domain data nearly matching fully-supervised models in zero-shot scenarios. + + + + LowRankOcc: Tensor Decomposition and Low-Rank Recovery for Vision-based 3D Semantic Occupancy Prediction + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_LowRankOcc_Tensor_Decomposition_and_Low-Rank_Recovery_for_Vision-based_3D_Semantic_CVPR_2024_paper.pdf + In this paper we present a tensor decomposition and low-rank recovery approach (LowRankOcc) for vision-based 3D semantic occupancy prediction. Conventional methods model outdoor scenes with fine-grained 3D grids but the sparsity of non-empty voxels introduces considerable spatial redundancy leading to potential overfitting risks. In contrast our approach leverages the intrinsic low-rank property of 3D occupancy data factorizing voxel representations into low-rank components to efficiently mitigate spatial redundancy without sacrificing performance. Specifically we present the Vertical-Horizontal (VH) decomposition block factorizes 3D tensors into vertical vectors and horizontal matrices. With our "decomposition-encoding-recovery" framework we encode 3D contexts with only 1/2D convolutions and poolings and subsequently recover the encoded compact yet informative context features back to voxel representations. Experimental results demonstrate that LowRankOcc achieves state-of-the-art performances in semantic scene completion on the SemanticKITTI dataset and 3D occupancy prediction on the nuScenes dataset. + + + + Novel View Synthesis with View-Dependent Effects from a Single Image + http://openaccess.thecvf.com//content/CVPR2024/papers/Bello_Novel_View_Synthesis_with_View-Dependent_Effects_from_a_Single_Image_CVPR_2024_paper.pdf + In this paper we address single image-based novel view synthesis (NVS) by firstly integrating view-dependent effects (VDE) into the process. Our approach leverages camera motion priors to model VDE treating negative disparity as the representation of these effects in the scene. By identifying that specularities align with camera motion we infuse VDEs into input images by aggregating pixel colors along the negative depth region of epipolar lines. Additionally we introduce a relaxed volumetric rendering approximation enhancing efficiency by computing densities in a single pass for NVS from single images. Notably our method learns single-image NVS from image sequences alone making it a fully self-supervised learning approach that requires no depth or camera pose annotations. We present extensive experimental results and show that our proposed method can learn NVS with VDEs outperforming the SOTA single-view NVS methods on the RealEstate10k and MannequinChallenge datasets. Visit our project site https://kaist-viclab.github.io/monovde-site. + + + + Point2RBox: Combine Knowledge from Synthetic Visual Patterns for End-to-end Oriented Object Detection with Single Point Supervision + http://openaccess.thecvf.com//content/CVPR2024/papers/Yu_Point2RBox_Combine_Knowledge_from_Synthetic_Visual_Patterns_for_End-to-end_Oriented_CVPR_2024_paper.pdf + With the rapidly increasing demand for oriented object detection (OOD) recent research involving weakly-supervised detectors for learning rotated box (RBox) from the horizontal box (HBox) has attracted more and more attention. In this paper we explore a more challenging yet label-efficient setting namely single point-supervised OOD and present our approach called Point2RBox. Specifically we propose to leverage two principles: 1) Synthetic pattern knowledge combination: By sampling around each labeled point on the image we spread the object feature to synthetic visual patterns with known boxes to provide the knowledge for box regression. 2) Transform self-supervision: With a transformed input image (e.g. scaled/rotated) the output RBoxes are trained to follow the same transformation so that the network can perceive the relative size/rotation between objects. The detector is further enhanced by a few devised techniques to cope with peripheral issues e.g. the anchor/layer assignment as the size of the object is not available in our point supervision setting. To our best knowledge Point2RBox is the first end-to-end solution for point-supervised OOD. In particular our method uses a lightweight paradigm yet it achieves a competitive performance among point-supervised alternatives 41.05%/27.62%/80.01% on DOTA/DIOR/HRSC datasets. + + + + HRVDA: High-Resolution Visual Document Assistant + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_HRVDA_High-Resolution_Visual_Document_Assistant_CVPR_2024_paper.pdf + Leveraging vast training data multimodal large language models (MLLMs) have demonstrated formidable general visual comprehension capabilities and achieved remarkable performance across various tasks. However their performance in visual document understanding still leaves much room for improvement. This discrepancy is primarily attributed to the fact that visual document understanding is a fine-grained prediction task. In natural scenes MLLMs typically use low-resolution images leading to a substantial loss of visual information. Furthermore general-purpose MLLMs do not excel in handling document-oriented instructions. In this paper we propose a High-Resolution Visual Document Assistant (HRVDA) which bridges the gap between MLLMs and visual document understanding. This model employs a content filtering mechanism and an instruction filtering module to separately filter out the content-agnostic visual tokens and instruction-agnostic visual tokens thereby achieving efficient model training and inference for high-resolution images. In addition we construct a document-oriented visual instruction tuning dataset and apply a multi-stage training strategy to enhance the model's document modeling capabilities. Extensive experiments demonstrate that our model achieves state-of-the-art performance across multiple document understanding datasets while maintaining training efficiency and inference speed comparable to low-resolution models. + + + + Learning for Transductive Threshold Calibration in Open-World Recognition + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Learning_for_Transductive_Threshold_Calibration_in_Open-World_Recognition_CVPR_2024_paper.pdf + In deep metric learning for visual recognition the calibration of distance thresholds is crucial for achieving desired model performance in the true positive rates (TPR) or true negative rates (TNR). However calibrating this thresh- old presents challenges in open-world scenarios where the test classes can be entirely disjoint from those encountered during training. We define the problem of finding distance thresholds for a trained embedding model to achieve target performance metrics over unseen open-world test classes as open-world threshold calibration. Existing posthoc threshold calibration methods reliant on inductive inference and requiring a calibration dataset with a similar distance distribution as the test data often prove ineffective in open- world scenarios. To address this we introduce OpenGCN a Graph Neural Network-based transductive threshold calibration method with enhanced adaptability and robustness. OpenGCN learns to predict pairwise connectivity for the unlabeled test instances embedded in a graph to determine its TPR and TNR at various distance thresholds allowing for transductive inference of the distance thresholds which also incorporates test-time information. Extensive experiments across open-world visual recognition benchmarks validate OpenGCN's superiority over existing posthoc calibration methods for open-world threshold calibration. + + + + Weakly-Supervised Emotion Transition Learning for Diverse 3D Co-speech Gesture Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Qi_Weakly-Supervised_Emotion_Transition_Learning_for_Diverse_3D_Co-speech_Gesture_Generation_CVPR_2024_paper.pdf + Generating vivid and emotional 3D co-speech gestures is crucial for virtual avatar animation in human-machine interaction applications. While the existing methods enable generating the gestures to follow a single emotion label they overlook that long gesture sequence modeling with emotion transition is more practical in real scenes. In addition the lack of large-scale available datasets with emotional transition speech and corresponding 3D human gestures also limits the addressing of this task. To fulfill this goal we first incorporate the ChatGPT-4 and an audio inpainting approach to construct the high-fidelity emotion transition human speeches. Considering obtaining the realistic 3D pose annotations corresponding to the dynamically inpainted emotion transition audio is extremely difficult we propose a novel weakly supervised training strategy to encourage authority gesture transitions. Specifically to enhance the coordination of transition gestures w.r.t. different emotional ones we model the temporal association representation between two different emotional gesture sequences as style guidance and infuse it into the transition generation. We further devise an emotion mixture mechanism that provides weak supervision based on a learnable mixed emotion label for transition gestures. Last we present a keyframe sampler to supply effective initial posture cues in long sequences enabling us to generate diverse gestures. Extensive experiments demonstrate that our method outperforms the state-of-the-art models constructed by adapting single emotion-conditioned counterparts on our newly defined emotion transition task and datasets. Our code and dataset will be released on the project page: https://xingqunqi-lab.github.io/Emo-Transition-Gesture/ + + + + Causal-CoG: A Causal-Effect Look at Context Generation for Boosting Multi-modal Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_Causal-CoG_A_Causal-Effect_Look_at_Context_Generation_for_Boosting_Multi-modal_CVPR_2024_paper.pdf + While Multi-modal Language Models (MLMs) demon strate impressive multimodal ability they still struggle on providing factual and precise responses for tasks like vi sual question answering (VQA). In this paper we address this challenge from the perspective of contextual informa tion. We propose Causal Context Generation Causal-CoG which is a prompting strategy that engages contextual infor mation to enhance precise VQA during inference. Specifi cally we prompt MLMs to generate contexts i.e text de scription of an image and engage the generated contexts for question answering. Moreover we investigate the ad vantage of contexts on VQA from a causality perspective introducing causality filtering to select samples for which contextual information is helpful. To show the effective ness of Causal-CoG we run extensive experiments on 10 multimodal benchmarks and showconsistent improvements e.g. +6.30% on POPE +13.69% on Vizwiz and +6.43% on VQAv2 compared to direct decoding surpassing exist ing methods. We hope Casual-CoG inspires explorations of context knowledge in multimodal models and serves as a plug-and-play strategy for MLM decoding. + + + + Brush2Prompt: Contextual Prompt Generator for Object Inpainting + http://openaccess.thecvf.com//content/CVPR2024/papers/Chiu_Brush2Prompt_Contextual_Prompt_Generator_for_Object_Inpainting_CVPR_2024_paper.pdf + Object inpainting is a task that involves adding objects to real images and seamlessly compositing them. With the recent commercialization of products like Stable Diffusion and Generative Fill inserting objects into images by using prompts has achieved impressive visual results. In this paper we propose a prompt suggestion model to simplify the process of prompt input. When the user provides an image and a mask our model predicts suitable prompts based on the partial contextual information in the masked image and the shape and location of the mask. Specifically we introduce a concept-diffusion in the CLIP space that predicts CLIP-text embeddings from a masked image. These diffused embeddings can be directly injected into open-source inpainting models like Stable Diffusion and its variants. Alternatively they can be decoded into natural language for use in other publicly available applications such as Generative Fill. Our prompt suggestion model demonstrates a balanced accuracy and diversity showing its capability to be both contextually aware and creatively adaptive. + + + + Joint-Task Regularization for Partially Labeled Multi-Task Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Nishi_Joint-Task_Regularization_for_Partially_Labeled_Multi-Task_Learning_CVPR_2024_paper.pdf + Multi-task learning has become increasingly popular in the machine learning field but its practicality is hindered by the need for large labeled datasets. Most multi-task learning methods depend on fully labeled datasets wherein each input example is accompanied by ground-truth labels for all target tasks. Unfortunately curating such datasets can be prohibitively expensive and impractical especially for dense prediction tasks which require per-pixel labels for each image. With this in mind we propose Joint-Task Regularization (JTR) an intuitive technique which leverages cross-task relations to simultaneously regularize all tasks in a single joint-task latent space to improve learning when data is not fully labeled for all tasks. JTR stands out from existing approaches in that it regularizes all tasks jointly rather than separately in pairs---therefore it achieves linear complexity relative to the number of tasks while previous methods scale quadratically. To demonstrate the validity of our approach we extensively benchmark our method across a wide variety of partially labeled scenarios based on NYU-v2 Cityscapes and Taskonomy. + + + + Shallow-Deep Collaborative Learning for Unsupervised Visible-Infrared Person Re-Identification + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Shallow-Deep_Collaborative_Learning_for_Unsupervised_Visible-Infrared_Person_Re-Identification_CVPR_2024_paper.pdf + Unsupervised visible-infrared person re-identification (US-VI-ReID) centers on learning a cross-modality retrieval model without labels reducing the reliance on expensive cross-modality manual annotation. Previous US-VI-ReID works gravitate toward learning cross-modality information with the deep features extracted from the ultimate layer. Nevertheless interfered by the multiple discrepancies solely relying on deep features is insufficient for accurately learning modality-invariant features resulting in negative optimization. The shallow feature from the shallow layers contains nuanced detail information which is critical for effective cross-modality learning but is disregarded regrettably by the existing methods. To address the above issues we design a Shallow-Deep Collaborative Learning (SDCL) framework based on the transformer with shallow-deep contrastive learning incorporating Collaborative Neighbor Learning (CNL) and Collaborative Ranking Association (CRA) module. Specifically CNL unveils the intrinsic homogeneous and heterogeneous collaboration which are harnessed for neighbor alignment enhancing the robustness in a dynamic manner. Furthermore CRA associates the cross-modality labels with the ranking association between shallow and deep features furnishing valuable supervision for cross-modality learning. Extensive experiments validate the superiority of our method even outperforming certain supervised counterparts. + + + + Context-Aware Integration of Language and Visual References for Natural Language Tracking + http://openaccess.thecvf.com//content/CVPR2024/papers/Shao_Context-Aware_Integration_of_Language_and_Visual_References_for_Natural_Language_CVPR_2024_paper.pdf + Tracking by natural language specification (TNL) aims to consistently localize a target in a video sequence given a linguistic description in the initial frame. Existing methodologies perform language-based and template-based matching for target reasoning separately and merge the matching results from two sources which suffer from tracking drift when language and visual templates miss-align with the dynamic target state and ambiguity in the later merging stage. To tackle the issues we propose a joint multi-modal tracking framework with 1) a prompt modulation module to leverage the complementarity between temporal visual templates and language expressions enabling precise and context-aware appearance and linguistic cues and 2) a unified target decoding module to integrate the multi-modal reference cues and executes the integrated queries on the search image to predict the target location in an end-to-end manner directly. This design ensures spatio-temporal consistency by leveraging historical visual information and introduces an integrated solution generating predictions in a single step. Extensive experiments conducted on TNL2K OTB-Lang LaSOT and RefCOCOg validate the efficacy of our proposed approach. The results demonstrate competitive performance against state-of-the-art methods for both tracking and grounding. Code is available at https://github.com/twotwo2/QueryNLT + + + + An Edit Friendly DDPM Noise Space: Inversion and Manipulations + http://openaccess.thecvf.com//content/CVPR2024/papers/Huberman-Spiegelglas_An_Edit_Friendly_DDPM_Noise_Space_Inversion_and_Manipulations_CVPR_2024_paper.pdf + Denoising diffusion probabilistic models (DDPMs) employ a sequence of white Gaussian noise samples to generate an image. In analogy with GANs those noise maps could be considered as the latent code associated with the generated image. However this native noise space does not possess a convenient structure and is thus challenging to work with in editing tasks. Here we propose an alternative latent noise space for DDPM that enables a wide range of editing operations via simple means and present an inversion method for extracting these edit-friendly noise maps for any given image (real or synthetically generated). As opposed to the native DDPM noise space the edit-friendly noise maps do not have a standard normal distribution and are not statistically independent across timesteps. However they allow perfect reconstruction of any desired image and simple transformations on them translate into meaningful manipulations of the output image (e.g. shifting color edits). Moreover in text-conditional models fixing those noise maps while changing the text prompt modifies semantics while retaining structure. We illustrate how this property enables text-based editing of real images via the diverse DDPM sampling scheme (in contrast to the popular non-diverse DDIM inversion). We also show how it can be used within existing diffusion-based editing methods to improve their quality and diversity. The code of the method is attached to this submission. + + + + RoDLA: Benchmarking the Robustness of Document Layout Analysis Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_RoDLA_Benchmarking_the_Robustness_of_Document_Layout_Analysis_Models_CVPR_2024_paper.pdf + Before developing a Document Layout Analysis (DLA) model in real-world applications conducting comprehensive robustness testing is essential. However the robustness of DLA models remains underexplored in the literature. To address this we are the first to introduce a robustness benchmark for DLA models which includes 450K document images of three datasets. To cover realistic corruptions we propose a perturbation taxonomy with 12 common document perturbations with 3 severity levels inspired by real-world document processing. Additionally to better understand document perturbation impacts we propose two metrics Mean Perturbation Effect (mPE) for perturbation assessment and Mean Robustness Degradation (mRD) for robustness evaluation. Furthermore we introduce a self-titled model i.e. Robust Document Layout Analyzer (RoDLA) which improves attention mechanisms to boost extraction of robust features. Experiments on the proposed benchmarks (PubLayNet-P DocLayNet-P and M6Doc-P) demonstrate that RoDLA obtains state-of-the-art mRD scores of 115.7 135.4 and 150.4 respectively. Compared to previous methods RoDLA achieves notable improvements in mAP of +3.8% +7.1% and +12.1% respectively. + + + + BilevelPruning: Unified Dynamic and Static Channel Pruning for Convolutional Neural Networks + http://openaccess.thecvf.com//content/CVPR2024/papers/Gao_BilevelPruning_Unified_Dynamic_and_Static_Channel_Pruning_for_Convolutional_Neural_CVPR_2024_paper.pdf + Most existing dynamic or runtime channel pruning methods have to store all weights to achieve efficient inference which brings extra storage costs. Static pruning methods can reduce storage costs directly but their performance is limited by using a fixed sub-network to approximate the original model. Most existing pruning works suffer from these drawbacks because they were designed to only conduct either static or dynamic pruning. In this paper we propose a novel method to solve both efficiency and storage challenges via simultaneously conducting dynamic and static channel pruning for convolutional neural networks. We propose a new bi-level optimization based model to naturally integrate the static and dynamic channel pruning. By doing so our method enjoys benefits from both sides and the disadvantages of dynamic and static pruning are reduced. After pruning we permanently remove redundant parameters and then finetune the model with dynamic flexibility. Experimental results on CIFAR-10 and ImageNet datasets suggest that our method can achieve state-of-the-art performance compared to existing dynamic and static channel pruning methods. + + + + IDGuard: Robust General Identity-centric POI Proactive Defense Against Face Editing Abuse + http://openaccess.thecvf.com//content/CVPR2024/papers/Dai_IDGuard_Robust_General_Identity-centric_POI_Proactive_Defense_Against_Face_Editing_CVPR_2024_paper.pdf + In this work we propose IDGuard a novel proactive defense method from the perspective of developers to protect Persons-of-Interest (POI) such as national leaders from face editing abuse. We build a bridge between identities and model behavior safeguarding POI identities rather than merely certain face images. Given a face editing model IDGuard enables it to reject editing any image containing POI identities while retaining its editing functionality for regular use. Specifically we insert an ID Normalization Layer into the original face editing model and introduce an ID Extractor to extract the identities of input images. To differentiate the editing behavior between POI and nonPOI we use a transformer-based ID Encoder to encode extracted POI identities as parameters of the ID Normalization Layer. Our method supports the simultaneous protection of multiple POI and allows for the addition of new POI in the inference stage without the need for retraining. Extensive experiments show that our method achieves 100% protection accuracy on POI images even if they are neither included in the training set nor subject to any preprocessing. Notably our method exhibits excellent robustness against image and model attacks and maintains 100% protection performance when generalized to various face editing models further demonstrating its practicality. + + + + Viewpoint-Aware Visual Grounding in 3D Scenes + http://openaccess.thecvf.com//content/CVPR2024/papers/Shi_Viewpoint-Aware_Visual_Grounding_in_3D_Scenes_CVPR_2024_paper.pdf + Referring expressions for visual objects often include descriptions of relative spatial arrangements to other objects -- e.g. "to the right of" -- that depend on the point of view of the speaker. In 2D referring expression tasks this viewpoint is captured unambiguously in the image. However grounding expressions with such spatial language in 3D without viewpoint annotations can be ambiguous. In this paper we investigate the significance of viewpoint information in 3D visual grounding -- introducing a model that explicitly predicts the speaker's viewpoint based on the referring expression and scene. We pretrain this model on a synthetically generated dataset that provides viewpoint annotations and then finetune on 3D referring expression datasets. Further we introduce an auxiliary uniform object representation loss to encourage viewpoint invariance in learned object representations. We find that our proposed ViewPoint Prediction Network (VPP-Net) achieves state-of-the-art performance on ScanRefer SR3D and NR3D -- improving Accuracy@0.25IoU by 1.06% 0.60% and 2.00% respectively compared to prior work. + + + + CRKD: Enhanced Camera-Radar Object Detection with Cross-modality Knowledge Distillation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_CRKD_Enhanced_Camera-Radar_Object_Detection_with_Cross-modality_Knowledge_Distillation_CVPR_2024_paper.pdf + In the field of 3D object detection for autonomous driving LiDAR-Camera (LC) fusion is the top-performing sensor configuration. Still LiDAR is relatively high cost which hinders adoption of this technology for consumer automobiles. Alternatively camera and radar are commonly deployed on vehicles already on the road today but performance of Camera-Radar (CR) fusion falls behind LC fusion. In this work we propose Camera-Radar Knowledge Distillation (CRKD) to bridge the performance gap between LC and CR detectors with a novel cross-modality KD framework. We use the Bird's-Eye-View (BEV) representation as the shared feature space to enable effective knowledge distillation. To accommodate the unique cross-modality KD path we propose four distillation losses to help the student learn crucial features from the teacher model. We present extensive evaluations on the nuScenes dataset to demonstrate the effectiveness of the proposed CRKD framework. The project page for CRKD is https://song-jingyu.github.io/CRKD. + + + + CoG-DQA: Chain-of-Guiding Learning with Large Language Models for Diagram Question Answering + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_CoG-DQA_Chain-of-Guiding_Learning_with_Large_Language_Models_for_Diagram_Question_CVPR_2024_paper.pdf + Diagram Question Answering (DQA) is a challenging task requiring models to answer natural language questions based on visual diagram contexts. It serves as a crucial basis for academic tutoring technical support and more practical applications. DQA poses significant challenges such as the demand for domain-specific knowledge and the scarcity of annotated data which restrict the applicability of large-scale deep models. Previous approaches have explored external knowledge integration through pre-training but these methods are costly and can be limited by domain disparities. While Large Language Models (LLMs) show promise in question-answering there is still a gap in how to cooperate and interact with the diagram parsing process. In this paper we introduce the Chain-of-Guiding Learning Model for Diagram Question Answering (CoG-DQA) a novel framework that effectively addresses DQA challenges. CoG-DQA leverages LLMs to guide diagram parsing tools (DPTs) through the guiding chains enhancing the precision of diagram parsing while introducing rich background knowledge. Our experimental findings reveal that CoG-DQA surpasses all comparison models in various DQA scenarios achieving an average accuracy enhancement exceeding 5% and peaking at 11% across four datasets. These results underscore CoG-DQA's capacity to advance the field of visual question answering and promote the integration of LLMs into specialized domains. + + + + Transferable and Principled Efficiency for Open-Vocabulary Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_Transferable_and_Principled_Efficiency_for_Open-Vocabulary_Segmentation_CVPR_2024_paper.pdf + Recent success of pre-trained foundation vision-language models makes Open-Vocabulary Segmentation (OVS) possible. Despite the promising performance this approach introduces heavy computational overheads for two challenges: 1) large model sizes of the backbone; 2) expensive costs during the fine-tuning. These challenges hinder this OVS strategy from being widely applicable and affordable in real-world scenarios. Although traditional methods such as model compression and efficient fine-tuning can address these challenges they often rely on heuristics. This means that their solutions cannot be easily transferred and necessitate re-training on different models which comes at a cost. In the context of efficient OVS we target achieving performance that is comparable to or even better than prior OVS works based on large vision-language foundation models by utilizing smaller models that incur lower training costs. The core strategy is to make our efficiency principled and thus seamlessly transferable from one OVS framework to others without further customization. Comprehensive experiments on diverse OVS benchmarks demonstrate our superior trade-off between segmentation accuracy and computation costs over previous works. Our code is available on https://github.com/Xujxyang/OpenTrans + + + + EvDiG: Event-guided Direct and Global Components Separation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_EvDiG_Event-guided_Direct_and_Global_Components_Separation_CVPR_2024_paper.pdf + Separating the direct and global components of a scene aids in shape recovery and basic material understanding. Conventional methods capture multiple frames under high frequency illumination patterns or shadows requiring the scene to keep stationary during the image acquisition process. Single-frame methods simplify the capture procedure but yield lower-quality separation results. In this paper we leverage the event camera to facilitate the separation of direct and global components enabling video-rate separation of high quality. In detail we adopt an event camera to record rapid illumination changes caused by the shadow of a line occluder sweeping over the scene and reconstruct the coarse separation results through event accumulation. We then design a network to resolve the noise in the coarse separation results and restore color information. A real-world dataset is collected using a hybrid camera system for network training and evaluation. Experimental results show superior performance over state-of-the-art methods. + + + + Feedback-Guided Autonomous Driving + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Feedback-Guided_Autonomous_Driving_CVPR_2024_paper.pdf + While behavior cloning has recently emerged as a highly successful paradigm for autonomous driving humans rarely learn to perform complex tasks such as driving via imitation or behavior cloning alone. In contrast learning in humans often involves additional detailed guidance throughout the interactive learning process i.e. where feedback often via language provides detailed information as to which part of their trial was performed incorrectly or suboptimally and why. Motivated by this observation we introduce an efficient feedback-based framework for improving behavior-cloning-based training of sensorimotor driving agents. Our key insight is to leverage recent advances in Large Language Models (LLMs) to provide corrective fine-grained feedback regarding the underlying reason behind driving prediction failures. Moreover our introduced network architecture is efficient enabling the first sensorimotor end-to-end training and evaluation of LLM-based driving models. The resulting agent achieves state-of-the-art performance in open-loop evaluation on nuScenes outperforming prior state-of-the-art by over 8.1% and 57.1% in accuracy and collision rate respectively. In CARLA our camera-based agent improves by 16.6% in driving score over prior LIDAR-based approaches. + + + + DiLiGenRT: A Photometric Stereo Dataset with Quantified Roughness and Translucency + http://openaccess.thecvf.com//content/CVPR2024/papers/Guo_DiLiGenRT_A_Photometric_Stereo_Dataset_with_Quantified_Roughness_and_Translucency_CVPR_2024_paper.pdf + Photometric stereo faces challenges from non-Lambertian reflectance in real-world scenarios. Systematically measuring the reliability of photometric stereo methods in handling such complex reflectance necessitates a real-world dataset with quantitatively controlled reflectances. This paper introduces DiLiGenRT the first real-world dataset for evaluating photometric stereo methods under quantified reflectances by manufacturing 54 hemispheres with varying degrees of two reflectance properties: Roughness and Translucency. Unlike qualitative and semantic labels such as diffuse and specular that have been used in previous datasets our quantified dataset allows comprehensive and systematic benchmark evaluations. In addition it facilitates selecting best-fit photometric stereo methods based on the quantitative reflectance properties. Our dataset and benchmark results are available at https://photometricstereo.github.io/diligentrt.html. + + + + De-Diffusion Makes Text a Strong Cross-Modal Interface + http://openaccess.thecvf.com//content/CVPR2024/papers/Wei_De-Diffusion_Makes_Text_a_Strong_Cross-Modal_Interface_CVPR_2024_paper.pdf + We demonstrate text as a strong cross-modal interface. Rather than relying on deep embeddings to connect image and language as the interface representation our approach represents an image as text from which we enjoy the interpretability and flexibility inherent to natural language. We employ an autoencoder that uses a pre-trained text-to-image diffusion model for decoding. The encoder is trained to transform an input image into text which is then fed into the fixed text-to-image diffusion decoder to reconstruct the original input a process we term De-Diffusion. Experiments validate both the precision and comprehensiveness of De-Diffusion text representing images such that it can be readily ingested by off-the-shelf text-to-image tools and LLMs for diverse multi-modal tasks. For example a single De-Diffusion model can generalize to provide transferable prompts for different text-to-image tools and also achieves a new state of the art on open-ended vision-language tasks by simply prompting large language models with few-shot examples. Project page: https://dediffusion.github.io/ + + + + End-to-End Spatio-Temporal Action Localisation with Video Transformers + http://openaccess.thecvf.com//content/CVPR2024/papers/Gritsenko_End-to-End_Spatio-Temporal_Action_Localisation_with_Video_Transformers_CVPR_2024_paper.pdf + The most performant spatio-temporal action localisation models use external person proposals and complex external memory banks. We propose a fully end-to-end transformer based model that directly ingests an input video and outputs tubelets -- a sequence of bounding boxes and the action classes at each frame. Our flexible model can be trained with either sparse bounding-box supervision on individual frames or full tubelet annotations. And in both cases it predicts coherent tubelets as the output. Moreover our end-to-end model requires no additional pre-processing in the form of proposals or post-processing in terms of non-maximal suppression. We perform extensive ablation experiments and significantly advance the state-of-the-art on five different spatio-temporal action localisation benchmarks with both sparse keyframes and full tubelet annotations. + + + + End-to-End Temporal Action Detection with 1B Parameters Across 1000 Frames + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_End-to-End_Temporal_Action_Detection_with_1B_Parameters_Across_1000_Frames_CVPR_2024_paper.pdf + Recently temporal action detection (TAD) has seen significant performance improvement with end-to-end training. However due to the memory bottleneck only models with limited scales and limited data volumes can afford end-to-end training which inevitably restricts TAD performance. In this paper we reduce the memory consumption for end-to-end training and manage to scale up the TAD backbone to 1 billion parameters and the input video to 1536 frames leading to significant detection performance. The key to our approach lies in our proposed temporal-informative adapter (TIA) which is a novel lightweight module that reduces training memory. Using TIA we free the humongous backbone from learning to adapt to the TAD task by only updating the parameters in TIA. TIA also leads to better TAD representation by temporally aggregating context from adjacent frames throughout the backbone. We evaluate our model across four representative datasets. Owing to our efficient design we are able to train end-to-end on VideoMAEv2-giant and achieve 75.4% mAP on THUMOS14 being the first end-to-end model to outperform the best feature-based methods. + + + + TransNeXt: Robust Foveal Visual Perception for Vision Transformers + http://openaccess.thecvf.com//content/CVPR2024/papers/Shi_TransNeXt_Robust_Foveal_Visual_Perception_for_Vision_Transformers_CVPR_2024_paper.pdf + Due to the depth degradation effect in residual connections many efficient Vision Transformers models that rely on stacking layers for information exchange often fail to form sufficient information mixing leading to unnatural visual perception. To address this issue in this paper we propose Aggregated Attention a biomimetic design-based token mixer that simulates biological foveal vision and continuous eye movement while enabling each token on the feature map to have a global perception. Furthermore we incorporate learnable tokens that interact with conventional queries and keys which further diversifies the generation of affinity matrices beyond merely relying on the similarity between queries and keys. Our approach does not rely on stacking for information exchange thus effectively avoiding depth degradation and achieving natural visual perception. Additionally we propose Convolutional GLU a channel mixer that bridges the gap between GLU and SE mechanism which empowers each token to have channel attention based on its nearest neighbor image features enhancing local modeling capability and model robustness. We combine aggregated attention and convolutional GLU to create a new visual backbone called TransNeXt. Extensive experiments demonstrate that our TransNeXt achieves state-of-the-art performance across multiple model sizes. At a resolution of 224^2 TransNeXt-Tiny attains an ImageNet accuracy of 84.0% surpassing ConvNeXt-B with 69% fewer parameters. Our TransNeXt-Base achieves an ImageNet accuracy of 86.2% and an ImageNet-A accuracy of 61.6% at a resolution of 384^2 a COCO object detection mAP of 57.1 and an ADE20K semantic segmentation mIoU of 54.7. + + + + Modeling Dense Multimodal Interactions Between Biological Pathways and Histology for Survival Prediction + http://openaccess.thecvf.com//content/CVPR2024/papers/Jaume_Modeling_Dense_Multimodal_Interactions_Between_Biological_Pathways_and_Histology_for_CVPR_2024_paper.pdf + Integrating whole-slide images (WSIs) and bulk transcriptomics for predicting patient survival can improve our understanding of patient prognosis. However this multimodal task is particularly challenging due to the different nature of these data: WSIs represent a very high-dimensional spatial description of a tumor while bulk transcriptomics represent a global description of gene expression levels within that tumor. In this context our work aims to address two key challenges: (1) how can we tokenize transcriptomics in a semantically meaningful and interpretable way? and (2) how can we capture dense multimodal interactions between these two modalities? Here we propose to learn biological pathway tokens from transcriptomics that can encode specific cellular functions. Together with histology patch tokens that encode the slide morphology we argue that they form appropriate reasoning units for interpretability. We fuse both modalities using a memory-efficient multimodal Transformer that can model interactions between pathway and histology patch tokens. Our model SURVPATH achieves state-of-the-art performance when evaluated against unimodal and multimodal baselines on five datasets from The Cancer Genome Atlas. Our interpretability framework identifies key multimodal prognostic factors and as such can provide valuable insights into the interaction between genotype and phenotype. Code available at https://github.com/mahmoodlab/SurvPath. + + + + Mining Supervision for Dynamic Regions in Self-Supervised Monocular Depth Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Nguyen_Mining_Supervision_for_Dynamic_Regions_in_Self-Supervised_Monocular_Depth_Estimation_CVPR_2024_paper.pdf + This paper focuses on self-supervised monocular depth estimation in dynamic scenes trained on monocular videos. Existing methods jointly estimate pixel-wise depth and motion relying mainly on an image reconstruction loss. Dynamic regions remain a critical challenge for these methods due to the inherent ambiguity in depth and motion estimation resulting in inaccurate depth estimation. This paper proposes a self-supervised training framework exploiting pseudo depth labels for dynamic regions from training data. The key contribution of our framework is to decouple depth estimation for static and dynamic regions of images in the training data. We start with an unsupervised depth estimation approach which provides reliable depth estimates for static regions and motion cues for dynamic regions and allows us to extract moving object information at the instance level. In the next stage we use an object network to estimate the depth of those moving objects assuming rigid motions. Then we propose a new scale alignment module to address the scale ambiguity between estimated depths for static and dynamic regions. We can then use the depth labels generated to train an end-to-end depth estimation network and improve its performance. Extensive experiments on the Cityscapes and KITTI datasets show that our self-training strategy consistently outperforms existing self-/unsupervised depth estimation methods. + + + + Physics-guided Shape-from-Template: Monocular Video Perception through Neural Surrogate Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Stotko_Physics-guided_Shape-from-Template_Monocular_Video_Perception_through_Neural_Surrogate_Models_CVPR_2024_paper.pdf + 3D reconstruction of dynamic scenes is a long-standing problem in computer graphics and increasingly difficult the less information is available. Shape-from-Template (SfT) methods aim to reconstruct a template-based geometry from RGB images or video sequences often leveraging just a single monocular camera without depth information such as regular smartphone recordings. Unfortunately existing reconstruction methods are either unphysical and noisy or slow in optimization. To solve this problem we propose a novel SfT reconstruction algorithm for cloth using a pre-trained neural surrogate model that is fast to evaluate stable and produces smooth reconstructions due to a regularizing physics simulation. Differentiable rendering of the simulated mesh enables pixel-wise comparisons between the reconstruction and a target video sequence that can be used for a gradient-based optimization procedure to extract not only shape information but also physical parameters such as stretching shearing or bending stiffness of the cloth. This allows to retain a precise stable and smooth reconstructed geometry while reducing the runtime by a factor of 400-500 compared to ?-SfT a state-of-the-art physics-based SfT approach. + + + + You'll Never Walk Alone: A Sketch and Text Duet for Fine-Grained Image Retrieval + http://openaccess.thecvf.com//content/CVPR2024/papers/Koley_Youll_Never_Walk_Alone_A_Sketch_and_Text_Duet_for_CVPR_2024_paper.pdf + Two primary input modalities prevail in image retrieval: sketch and text. While text is widely used for inter-category retrieval tasks sketches have been established as the sole preferred modality for fine-grained image retrieval due to their ability to capture intricate visual details. In this paper we question the reliance on sketches alone for fine-grained image retrieval by simultaneously exploring the fine-grained representation capabilities of both sketch and text orchestrating a duet between the two. The end result enables precise retrievals previously unattainable allowing users to pose ever-finer queries and incorporate attributes like colour and contextual cues from text. For this purpose we introduce a novel compositionality framework effectively combining sketches and text using pre-trained CLIP models while eliminating the need for extensive fine-grained textual descriptions. Last but not least our system extends to novel applications in composed image retrieval domain attribute transfer and fine-grained generation providing solutions for various real-world scenarios. + + + + Unsupervised 3D Structure Inference from Category-Specific Image Collections + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Unsupervised_3D_Structure_Inference_from_Category-Specific_Image_Collections_CVPR_2024_paper.pdf + Understanding 3D object structure from image collections of general object categories remains a long-standing challenge in computer vision. Due to the high relevance of image keypoints (e.g. for graph matching controlling generative models scene understanding etc.) in this work we specifically focus on inferring 3D structure in terms of sparse keypoints. Existing 3D keypoint inference approaches rely on strong priors such as spatio-temporal consistency multi-view images of the same object 3D shape priors (e.g. templates skeleton) or supervisory signals e.g. in the form of 2D keypoint annotations. In contrast we propose the first unsupervised 3D keypoint inference approach that can be trained for general object categories solely from an inhomogeneous image collection (containing different instances of objects from the same category). Our experiments show that our method not only improves upon unsupervised 2D keypoint inference but more importantly it also produces reasonable 3D structure for various object categories both qualitatively and quantitatively. + + + + DiG-IN: Diffusion Guidance for Investigating Networks - Uncovering Classifier Differences Neuron Visualisations and Visual Counterfactual Explanations + http://openaccess.thecvf.com//content/CVPR2024/papers/Augustin_DiG-IN_Diffusion_Guidance_for_Investigating_Networks_-_Uncovering_Classifier_Differences_CVPR_2024_paper.pdf + While deep learning has led to huge progress in complex image classification tasks like ImageNet unexpected failure modes e.g. via spurious features call into question how reliably these classifiers work in the wild. Furthermore for safety-critical tasks the black-box nature of their decisions is problematic and explanations or at least methods which make decisions plausible are needed urgently. In this paper we address these problems by generating images that optimize a classifier-derived objective using a framework for guided image generation. We analyze the decisions of image classifiers by visual counterfactual explanations (VCEs) detection of systematic mistakes by analyzing images where classifiers maximally disagree and visualization of neurons and spurious features. In this way we validate existing observations e.g. the shape bias of adversarially robust models as well as novel failure modes e.g. systematic errors of zero-shot CLIP classifiers. Moreover our VCEs outperform previous work while being more versatile. + + + + RepViT: Revisiting Mobile CNN From ViT Perspective + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_RepViT_Revisiting_Mobile_CNN_From_ViT_Perspective_CVPR_2024_paper.pdf + Recently lightweight Vision Transformers (ViTs) demonstrate superior performance and lower latency compared with lightweight Convolutional Neural Networks (CNNs) on resource-constrained mobile devices. Researchers have discovered many structural connections between lightweight ViTs and lightweight CNNs. However the notable architectural disparities in the block structure macro and micro designs between them have not been adequately examined. In this study we revisit the efficient design of lightweight CNNs from ViT perspective and emphasize their promising prospect for mobile devices. Specifically we incrementally enhance the mobile-friendliness of a standard lightweight CNN i.e. MobileNetV3 by integrating the efficient architectural designs of lightweight ViTs. This ends up with a new family of pure lightweight CNNs namely RepViT. Extensive experiments show that RepViT outperforms existing state-of-the-art lightweight ViTs and exhibits favorable latency in various vision tasks. Notably on ImageNet RepViT achieves over 80% top-1 accuracy with 1.0 ms latency on an iPhone 12 which is the first time for a lightweight model to the best of our knowledge. Besides when RepViT meets SAM our RepViT-SAM can achieve nearly 10xfaster inference than the advanced MobileSAM. Codes and models are available at https://github.com/THU-MIG/RepViT. + + + + MonoNPHM: Dynamic Head Reconstruction from Monocular Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Giebenhain_MonoNPHM_Dynamic_Head_Reconstruction_from_Monocular_Videos_CVPR_2024_paper.pdf + We present Monocular Neural Parametric Head Models (MonoNPHM) for dynamic 3D head reconstructions from monocular RGB videos. To this end we propose a latent appearance space that parameterizes a texture field on top of a neural parametric model. We constrain predicted color values to be correlated with the underlying geometry such that gradients from RGB effectively influence latent geometry codes during inverse rendering. To increase the representational capacity of our expression space we augment our backward deformation field with hyper-dimensions thus improving color and geometry representation in topologically challenging expressions. Using MonoNPHM as a learned prior we approach the task of 3D head reconstruction using signed distance field based volumetric rendering. By numerically inverting our backward deformation field we incorporated a landmark loss using facial anchor points that are closely tied to our canonical geometry representation. We incorporate a facial landmark loss by numerically inverting our backward deformation field tied with our canonical geometry to observed 2D facial landmarks in posed space. To evaluate the task of dynamic face reconstruction from monocular RGB videos we record 20 challenging Kinect sequences under casual conditions. MonoNPHM outperforms all baselines with a significant margin and makes an important step towards easily accessible neural parametric face models through RGB tracking. + + + + Realigning Confidence with Temporal Saliency Information for Point-Level Weakly-Supervised Temporal Action Localization + http://openaccess.thecvf.com//content/CVPR2024/papers/Xia_Realigning_Confidence_with_Temporal_Saliency_Information_for_Point-Level_Weakly-Supervised_Temporal_CVPR_2024_paper.pdf + Point-level weakly-supervised temporal action localization (P-TAL) aims to localize action instances in untrimmed videos through the use of single-point annotations in each instance. Existing methods predict the class activation sequences without any boundary information and the unreliable sequences result in a significant misalignment between the quality of proposals and their corresponding confidence. In this paper we surprisingly observe the most salient frame tend to appear in the central region of the each instance and is easily annotated by humans. Guided by the temporal saliency information we present a novel proposal-level plug-in framework to relearn the aligned confidence of proposals generated by the base locators. The proposed approach consists of Center Score Learning (CSL) and Alignment-based Boundary Adaptation (ABA). In CSL we design a novel center label generated by the point annotations for predicting aligned center scores. During inference we first fuse the center scores with the predicted action probabilities to obtain the aligned confidence. ABA utilizes the both aligned confidence and IoU information to enhance localization completeness. Extensive experiments demonstrate the generalization and effectiveness of the proposed framework showcasing state-of-the-art or competitive performances across three benchmarks. Our code is available at https://github.com/zyxia1009/CVPR2024-TSPNet. + + + + Theoretically Achieving Continuous Representation of Oriented Bounding Boxes + http://openaccess.thecvf.com//content/CVPR2024/papers/Xiao_Theoretically_Achieving_Continuous_Representation_of_Oriented_Bounding_Boxes_CVPR_2024_paper.pdf + Considerable efforts have been devoted to Oriented Object Detection (OOD). However one lasting issue regarding the discontinuity in Oriented Bounding Box (OBB) representation remains unresolved which is an inherent bottleneck for extant OOD methods. This paper endeavors to completely solve this issue in a theoretically guaranteed manner and puts an end to the ad-hoc efforts in this direction. Prior studies typically can only address one of the two cases of discontinuity: rotation and aspect ratio and often inadvertently introduce decoding discontinuity e.g. Decoding Incompleteness (DI) and Decoding Ambiguity (DA) as discussed in literature. Specifically we propose a novel representation method called Continuous OBB (COBB) which can be readily integrated into existing detectors e.g. Faster-RCNN as a plugin. It can theoretically ensure continuity in bounding box regression which to our best knowledge has not been achieved in literature for rectangle-based object representation. For fairness and transparency of experiments we have developed a modularized benchmark based on the open-source deep learning framework Jittor's detection toolbox JDet for OOD evaluation. On the popular DOTA dataset by integrating Faster-RCNN as the same baseline model our new method outperforms the peer method Gliding Vertex by 1.13% mAP50 (relative improvement 1.54%) and 2.46% mAP75 (relative improvement 5.91%) without any tricks. + + + + Learning Large-Factor EM Image Super-Resolution with Generative Priors + http://openaccess.thecvf.com//content/CVPR2024/papers/Shou_Learning_Large-Factor_EM_Image_Super-Resolution_with_Generative_Priors_CVPR_2024_paper.pdf + As the mainstream technique for capturing images of biological specimens at nanometer resolution electron microscopy (EM) is extremely time-consuming for scanning wide field-of-view (FOV) specimens. In this paper we investigate a challenging task of large-factor EM image super-resolution (EMSR) which holds great promise for reducing scanning time relaxing acquisition conditions and expanding imaging FOV. By exploiting the repetitive structures and volumetric coherence of EM images we propose the first generative learning-based framework for large-factor EMSR. Specifically motivated by the predictability of repetitive structures and textures in EM images we first learn a discrete codebook in the latent space to represent high-resolution (HR) cell-specific priors and a latent vector indexer to map low-resolution (LR) EM images to their corresponding latent vectors in a generative manner. By incorporating the generative cell-specific priors from HR EM images through a multi-scale prior fusion module we then deploy multi-image feature alignment and fusion to further exploit the inter-section coherence in the volumetric EM data. Extensive experiments demonstrate that our proposed framework outperforms advanced single-image and video super-resolution methods for 8x and 16x EMSR (i.e. with 64 times and 256 times less data acquired respectively) achieving superior visual reconstruction quality and downstream segmentation accuracy on benchmark EM datasets. Code is available at https://github.com/jtshou/GPEMSR. + + + + Adaptive Fusion of Single-View and Multi-View Depth for Autonomous Driving + http://openaccess.thecvf.com//content/CVPR2024/papers/Cheng_Adaptive_Fusion_of_Single-View_and_Multi-View_Depth_for_Autonomous_Driving_CVPR_2024_paper.pdf + Multi-view depth estimation has achieved impressive performance over various benchmarks. However almost all current multi-view systems rely on given ideal camera poses which are unavailable in many real-world scenarios such as autonomous driving. In this work we propose a new robustness benchmark to evaluate the depth estimation system under various noisy pose settings. Surprisingly we find current multi-view depth estimation methods or single-view and multi-view fusion methods will fail when given noisy pose settings. To address this challenge we propose a single-view and multi-view fused depth estimation system which adaptively integrates high-confident multi-view and single-view results for both robust and accurate depth estimations. The adaptive fusion module performs fusion by dynamically selecting high-confidence regions between two branches based on a wrapping confidence map. Thus the system tends to choose the more reliable branch when facing textureless scenes inaccurate calibration dynamic objects and other degradation or challenging conditions. Our method outperforms state-of-the-art multi-view and fusion methods under robustness testing. Furthermore we achieve state-of-the-art performance on challenging benchmarks (KITTI and DDAD) when given accurate pose estimations. Project website: https://github.com/Junda24/AFNet/ + + + + Continual Self-supervised Learning: Towards Universal Multi-modal Medical Data Representation Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Ye_Continual_Self-supervised_Learning_Towards_Universal_Multi-modal_Medical_Data_Representation_Learning_CVPR_2024_paper.pdf + Self-supervised learning (SSL) is an efficient pre-training method for medical image analysis. However current research is mostly confined to certain modalities consuming considerable time and resources without achieving universality across different modalities. A straightforward solution is combining all modality data for joint SSL which poses practical challenges. Firstly our experiments reveal conflicts in representation learning as the number of modalities increases. Secondly multi-modal data collected in advance cannot cover all real-world scenarios. In this paper we reconsider versatile SSL from the perspective of continual learning and propose MedCoSS a continuous SSL approach for multi-modal medical data. Different from joint representation learning MedCoSS assigns varying data modalities to separate training stages creating a multi-stage pre-training process. We propose a rehearsal-based continual learning approach to manage modal conflicts and prevent catastrophic forgetting. Specifically we use the k-means sampling to retain and rehearse previous modality data during new modality learning. Moreover we apply feature distillation and intra-modal mixup on buffer data for knowledge retention bypassing pretext tasks. We conduct experiments on a large-scale multi-modal unlabeled dataset including clinical reports X-rays CT MRI and pathological images. Experimental results demonstrate MedCoSS's exceptional generalization ability across 9 downstream datasets and its significant scalability in integrating new modality data. The code and pre-trained model are available at https://github.com/yeerwen/MedCoSS. + + + + Towards Efficient Replay in Federated Incremental Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Towards_Efficient_Replay_in_Federated_Incremental_Learning_CVPR_2024_paper.pdf + In Federated Learning (FL) the data in each client is typically assumed fixed or static. However data often comes in an incremental manner in real-world applications where the data domain may increase dynamically. In this work we study catastrophic forgetting with data heterogeneity in Federated Incremental Learning (FIL) scenarios where edge clients may lack enough storage space to retain full data. We propose to employ a simple generic framework for FIL named Re-Fed which can coordinate each client to cache important samples for replay. More specifically when a new task arrives each client first caches selected previous samples based on their global and local importance. Then the client trains the local model with both the cached samples and the samples from the new task. Theoretically we analyze the ability of Re-Fed to discover important samples for replay thus alleviating the catastrophic forgetting problem. Moreover we empirically show that Re-Fed achieves competitive performance compared to state-of-the-art methods. + + + + SimAC: A Simple Anti-Customization Method for Protecting Face Privacy against Text-to-Image Synthesis of Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_SimAC_A_Simple_Anti-Customization_Method_for_Protecting_Face_Privacy_against_CVPR_2024_paper.pdf + Despite the success of diffusion-based customization methods on visual content creation increasing concerns have been raised about such techniques from both privacy and political perspectives. To tackle this issue several anti-customization methods have been proposed in very recent months predominantly grounded in adversarial attacks. Unfortunately most of these methods adopt straightforward designs such as end-to-end optimization with a focus on adversarially maximizing the original training loss thereby neglecting nuanced internal properties intrinsic to the diffusion model and even leading to ineffective optimization in some diffusion time steps. In this paper we strive to bridge this gap by undertaking a comprehensive exploration of these inherent properties to boost the performance of current anti-customization approaches. Two aspects of properties are investigated: 1) We examine the relationship between time step selection and the model's perception in the frequency domain of images and find that lower time steps can give much more contributions to adversarial noises. This inspires us to propose an adaptive greedy search for optimal time steps that seamlessly integrates with existing anti-customization methods. 2) We scrutinize the roles of features at different layers during denoising and devise a sophisticated feature-based optimization framework for anti-customization. Experiments on facial benchmarks demonstrate that our approach significantly increases identity disruption thereby protecting user privacy and copyright. Our code is available at: https://github.com/somuchtome/SimAC. + + + + Fair-VPT: Fair Visual Prompt Tuning for Image Classification + http://openaccess.thecvf.com//content/CVPR2024/papers/Park_Fair-VPT_Fair_Visual_Prompt_Tuning_for_Image_Classification_CVPR_2024_paper.pdf + Despite the remarkable success of Vision Transformers (ViT) across diverse fields in computer vision they have a clear drawback of expensive adaption cost for downstream tasks due to the increased scale. To address this Visual Prompt Tuning (VPT) incorporates learnable parameters in the input space of ViT. While freezing the ViT backbone and tuning only the prompts it exhibits superior performances to full fine-tuning. However despite the outstanding advantage we point out that VPT may lead to serious unfairness in downstream classification. Initially we investigate the causes of unfairness in VPT identifying the biasedly pre-trained ViT as a principal factor. Motivated by this observation we propose a Fair Visual Prompt Tuning (Fair-VPT) which removes biased information in the pre-trained ViT while adapting it to downstream classification tasks. To this end we categorize prompts into "cleaner prompts" and "target prompts". Based on this we encode the class token in two different ways by either masking or not masking the target prompts in the self-attention process. These encoded tokens are trained with distinct objective functions resulting in the inclusion of different information in the target and cleaner prompts. Moreover we introduce a disentanglement loss based on contrastive learning to further decorrelate them. In experiments across diverse benchmarks the proposed method demonstrates the most superior performance in terms of balanced classification accuracy and fairness. + + + + CaDeT: a Causal Disentanglement Approach for Robust Trajectory Prediction in Autonomous Driving + http://openaccess.thecvf.com//content/CVPR2024/papers/Pourkeshavarz_CaDeT_a_Causal_Disentanglement_Approach_for_Robust_Trajectory_Prediction_in_CVPR_2024_paper.pdf + For safe motion planning in real-world autonomous vehicles require behavior prediction models that are reliable and robust to distribution shifts. The recent studies suggest that the existing learning-based trajectory prediction models do not posses such characteristics and are susceptible to small perturbations that are not present in the training data largely due to overfitting to spurious correlations while learning. In this paper we propose a causal disentanglement representation learning approach aiming to separate invariant (causal) and variant (spurious) features for more robust learning. Our method benefits from a novel intervention mechanism in the latent space that estimates potential distribution shifts resulted from spurious correlations using uncertain feature statistics hence maintaining the realism of interventions. To facilitate learning we propose a novel invariance objective based on the variances of the distributions over uncertain statistics to induce the model to focus on invariant representations during training. We conduct extensive experiments on two large-scale autonomous driving datasets and show that besides achieving state-of-the-art performance our method can significantly improve prediction robustness to various distribution shifts in driving scenes. We further conduct ablative studies to evaluate the design choices in our proposed framework. + + + + Prompting Vision Foundation Models for Pathology Image Analysis + http://openaccess.thecvf.com//content/CVPR2024/papers/Yin_Prompting_Vision_Foundation_Models_for_Pathology_Image_Analysis_CVPR_2024_paper.pdf + The rapid increase in cases of non-alcoholic fatty liver disease (NAFLD) in recent years has raised significant public concern. Accurately identifying tissue alteration regions is crucial for the diagnosis of NAFLD but this task presents challenges in pathology image analysis particularly with small-scale datasets. Recently the paradigm shift from full fine-tuning to prompting in adapting vision foundation models has offered a new perspective for small-scale data analysis. However existing prompting methods based on task-agnostic prompts are mainly developed for generic image recognition which fall short in providing instructive cues for complex pathology images. In this paper we propose Q uantitative A ttribute-based P rompting (QAP) a novel prompting method specifically for liver pathology image analysis. QAP is based on two quantitative attributes namely K-function-based spatial attributes and histogram-based morphological attributes which are aimed for quantitative assessment of tissue states. Moreover a conditional prompt generator is designed to turn these instance-specific attributes into visual prompts. Extensive experiments on three diverse tasks demonstrate that our task-specific prompting method achieves better diagnostic performance as well as better interpretability. Code is available at \href https://github.com/7LFB/QAP https://github.com/7LFB/QAP . + + + + SEED-Bench: Benchmarking Multimodal Large Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_SEED-Bench_Benchmarking_Multimodal_Large_Language_Models_CVPR_2024_paper.pdf + Multimodal large language models (MLLMs) building upon the foundation of powerful large language models (LLMs) have recently demonstrated exceptional capabilities in generating not only texts but also images given interleaved multimodal inputs (acting like a combination of GPT-4V and DALL-E 3). However existing MLLM benchmarks remain limited to assessing only models' comprehension ability of single image-text inputs failing to keep up with the strides made in MLLMs. A comprehensive benchmark is imperative for investigating the progress and uncovering the limitations of current MLLMs. In this work we categorize the capabilities of MLLMs into hierarchical levels from L_0 to L_4 based on the modalities they can accept and generate and propose SEED-Bench a comprehensive benchmark that evaluates the hierarchical capabilities of MLLMs. Specifically SEED-Bench comprises 24K multiple-choice questions with accurate human annotations which spans 27 dimensions including the evaluation of both text and image generation. Multiple-choice questions with groundtruth options derived from human annotation enables an objective and efficient assessment of model performance eliminating the need for human or GPT intervention during evaluation. We further evaluate the performance of 22 prominent open-source MLLMs and summarize valuable observations. By revealing the limitations of existing MLLMs through extensive evaluations we aim for SEED-Bench to provide insights that will motivate future research towards the goal of General Artificial Intelligence. + + + + Object Pose Estimation via the Aggregation of Diffusion Features + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Object_Pose_Estimation_via_the_Aggregation_of_Diffusion_Features_CVPR_2024_paper.pdf + Estimating the pose of objects from images is a crucial task of 3D scene understanding and recent approaches have shown promising results on very large benchmarks. However these methods experience a significant performance drop when dealing with unseen objects. We believe that it results from the limited generalizability of image features. To address this problem we have an in-depth analysis on the features of diffusion models e.g. Stable Diffusion which hold substantial potential for modeling unseen objects. Based on this analysis we then innovatively introduce these diffusion features for object pose estimation. To achieve this we propose three distinct architectures that can effectively capture and aggregate diffusion features of different granularity greatly improving the generalizability of object pose estimation. Our approach outperforms the state-of-the-art methods by a considerable margin on three popular benchmark datasets LM O-LM and T-LESS. In particular our method achieves higher accuracy than the previous best arts on unseen objects: 98.2% vs. 93.5% on Unseen LM 85.9% vs. 76.3% on Unseen O-LM showing the strong generalizability of our method. Our code is released at https://github.com/Tianfu18/diff-feats-pose. + + + + Panda-70M: Captioning 70M Videos with Multiple Cross-Modality Teachers + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Panda-70M_Captioning_70M_Videos_with_Multiple_Cross-Modality_Teachers_CVPR_2024_paper.pdf + The quality of the data and annotation upper-bounds the quality of a downstream model. While there exist large text corpora and image-text pairs high-quality video-text data is much harder to collect. First of all manual labeling is more time-consuming as it requires an annotator to watch an entire video. Second videos have a temporal dimension consist of a number of scenes stacked together and show multiple actions. Accordingly to establish a video dataset with high-quality captions we propose an automatic approach leveraging multimodal inputs such as textual video description subtitles and individual video frames. Specifically we curate 3.8M high-resolution videos from the publicly available HD-VILA-100M dataset. We then split them into semantically consistent video clips and apply multiple cross-modality teacher models to obtain captions for each video. Next we finetune a retrieval model on a small subset where the best caption of each video is manually selected and then employ the model in the whole dataset to select the best caption as the annotation. In this way we get 70M videos paired with high-quality text captions. We dub the dataset as Panda-70M. We show the value of the proposed dataset on three downstream tasks: video captioning video and text retrieval and text-driven video generation. The models trained on the proposed data score substantially better on the majority of metrics across all the tasks. + + + + Infrared Small Target Detection with Scale and Location Sensitivity + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Infrared_Small_Target_Detection_with_Scale_and_Location_Sensitivity_CVPR_2024_paper.pdf + Recently infrared small target detection (IRSTD) has been dominated by deep-learning-based methods. However these methods mainly focus on the design of complex model structures to extract discriminative features leaving the loss functions for IRSTD under-explored. For example the widely used Intersection over Union (IoU) and Dice losses lack sensitivity to the scales and locations of targets limiting the detection performance of detectors. In this paper we focus on boosting detection performance with a more effective loss but a simpler model structure. Specifically we first propose a novel Scale and Location Sensitive (SLS) loss to handle the limitations of existing losses: 1) for scale sensitivity we compute a weight for the IoU loss based on target scales to help the detector distinguish targets with different scales: 2) for location sensitivity we introduce a penalty term based on the center points of targets to help the detector localize targets more precisely. Then we design a simple Multi-Scale Head to the plain U-Net (MSHNet). By applying SLS loss to each scale of the predictions our MSHNet outperforms existing state-of-the-art methods by a large margin. In addition the detection performance of existing detectors can be further improved when trained with our SLS loss demonstrating the effectiveness and generalization of our SLS loss. The code is available at https://github.com/ying-fu/MSHNet. + + + + Self-supervised Debiasing Using Low Rank Regularization + http://openaccess.thecvf.com//content/CVPR2024/papers/Park_Self-supervised_Debiasing_Using_Low_Rank_Regularization_CVPR_2024_paper.pdf + Spurious correlations can cause strong biases in deep neural networks impairing generalization ability. While most existing debiasing methods require full supervision on either spurious attributes or target labels training a debiased model from a limited amount of both annotations is still an open question. To address this issue we investigate an interesting phenomenon using the spectral analysis of latent representations: spuriously correlated attributes make neural networks inductively biased towards encoding lower effective rank representations. We also show that a rank regularization can amplify this bias in a way that encourages highly correlated features. Leveraging these findings we propose a self-supervised debiasing framework potentially compatible with unlabeled samples. Specifically we first pretrain a biased encoder in a self-supervised manner with the rank regularization serving as a semantic bottleneck to enforce the encoder to learn the spuriously correlated attributes. This biased encoder is then used to discover and upweight bias-conflicting samples in a downstream task serving as a boosting to effectively debias the main model. Remarkably the proposed debiasing framework significantly improves the generalization performance of self-supervised learning baselines and in some cases even outperforms state-of-the-art supervised debiasing approaches. + + + + Finding Lottery Tickets in Vision Models via Data-driven Spectral Foresight Pruning + http://openaccess.thecvf.com//content/CVPR2024/papers/Iurada_Finding_Lottery_Tickets_in_Vision_Models_via_Data-driven_Spectral_Foresight_CVPR_2024_paper.pdf + Recent advances in neural network pruning have shown how it is possible to reduce the computational costs and memory demands of deep learning models before training. We focus on this framework and propose a new pruning at initialization algorithm that leverages the Neural Tangent Kernel (NTK) theory to align the training dynamics of the sparse network with that of the dense one. Specifically we show how the usually neglected data-dependent component in the NTK's spectrum can be taken into account by providing an analytical upper bound to the NTK's trace obtained by decomposing neural networks into individual paths. This leads to our Path eXclusion (PX) a foresight pruning method designed to preserve the parameters that mostly influence the NTK's trace. PX is able to find lottery tickets (i.e. good paths) even at high sparsity levels and largely reduces the need for additional training. When applied to pre-trained models it extracts subnetworks directly usable for several downstream tasks resulting in performance comparable to those of the dense counterpart but with substantial cost and computational savings. + + + + InNeRF360: Text-Guided 3D-Consistent Object Inpainting on 360-degree Neural Radiance Fields + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_InNeRF360_Text-Guided_3D-Consistent_Object_Inpainting_on_360-degree_Neural_Radiance_Fields_CVPR_2024_paper.pdf + We propose InNeRF360 an automatic system that accurately removes text-specified objects from 360-degree Neural Radiance Fields (NeRF). The challenge is to effectively remove objects while inpainting perceptually consistent content for the missing regions which is particularly demanding for existing NeRF models due to their implicit volumetric representation. Moreover unbounded scenes are more prone to floater artifacts in the inpainted region than frontal-facing scenes as the change of object appearance and background across views is more sensitive to inaccurate segmentations and inconsistent inpainting. With a trained NeRF and a text description our method efficiently removes specified objects and inpaints visually consistent content without artifacts. We apply depth-space warping to enforce consistency across multiview text-encoded segmentations and then refine the inpainted NeRF model using perceptual priors and 3D diffusion-based geometric priors to ensure visual plausibility. Through extensive experiments in segmentation and inpainting on 360-degree and frontal-facing NeRFs we show that InNeRF360 is effective and enhances NeRF's editability. Project page: https://ivrl.github.io/InNeRF360. + + + + IS-Fusion: Instance-Scene Collaborative Fusion for Multimodal 3D Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Yin_IS-Fusion_Instance-Scene_Collaborative_Fusion_for_Multimodal_3D_Object_Detection_CVPR_2024_paper.pdf + Bird's eye view (BEV) representation has emerged as a dominant solution for describing 3D space in autonomous driving scenarios. However objects in the BEV representation typically exhibit small sizes and the associated point cloud context is inherently sparse which leads to great challenges for reliable 3D perception. In this paper we propose IS-Fusion an innovative multimodal fusion framework that jointly captures the Instance- and Scene-level contextual information. IS-Fusion essentially differs from existing approaches that only focus on the BEV scene-level fusion by explicitly incorporating instance-level multimodal information thus facilitating the instance-centric tasks like 3D object detection. It comprises a Hierarchical Scene Fusion (HSF) module and an Instance-Guided Fusion (IGF) module. HSF applies Point-to-Grid and Grid-to-Region transformers to capture the multimodal scene context at different granularities. IGF mines instance candidates explores their relationships and aggregates the local multimodal context for each instance. These instances then serve as guidance to enhance the scene feature and yield an instance-aware BEV representation. On the challenging nuScenes benchmark IS-Fusion outperforms all the published multimodal works to date. + + + + Enhancing Intrinsic Features for Debiasing via Investigating Class-Discerning Common Attributes in Bias-Contrastive Pair + http://openaccess.thecvf.com//content/CVPR2024/papers/Park_Enhancing_Intrinsic_Features_for_Debiasing_via_Investigating_Class-Discerning_Common_Attributes_CVPR_2024_paper.pdf + In the image classification task deep neural networks frequently rely on bias attributes that are spuriously correlated with a target class in the presence of dataset bias resulting in degraded performance when applied to data without bias attributes. The task of debiasing aims to compel classifiers to learn intrinsic attributes that inherently define a target class rather than focusing on bias attributes. While recent approaches mainly focus on emphasizing the learning of data samples without bias attributes (i.e. bias-conflicting samples) compared to samples with bias attributes (i.e. bias-aligned samples) they fall short of directly guiding models where to focus for learning intrinsic features. To address this limitation this paper proposes a method that provides the model with explicit spatial guidance that indicates the region of intrinsic features. We first identify the intrinsic features by investigating the class-discerning common features between a bias-aligned (BA) sample and a bias-conflicting (BC) sample (i.e. bias-contrastive pair). Next we enhance the intrinsic features in the BA sample that are relatively under-exploited for prediction compared to the BC sample. To construct the bias-contrastive pair without using bias information we introduce a bias-negative score that distinguishes BC samples from BA samples employing a biased model. The experiments demonstrate that our method achieves state-of-the-art performance on synthetic and real-world datasets with various levels of bias severity. + + + + Compositional Chain-of-Thought Prompting for Large Multimodal Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Mitra_Compositional_Chain-of-Thought_Prompting_for_Large_Multimodal_Models_CVPR_2024_paper.pdf + The combination of strong visual backbones and Large Language Model (LLM) reasoning has led to Large Multimodal Models (LMMs) becoming the current standard for a wide range of vision and language (VL) tasks. However recent research has shown that even the most advanced LMMs still struggle to capture aspects of compositional visual reasoning such as attributes and relationships between objects. One solution is to utilize scene graphs (SGs)---a formalization of objects and their relations and attributes that has been extensively used as a bridge between the visual and textual domains. Yet scene graph data requires scene graph annotations which are expensive to collect and thus not easily scalable. Moreover finetuning an LMM based on SG data can lead to catastrophic forgetting of the pretraining objective. To overcome this inspired by chain-of-thought methods we propose Compositional Chain-of-Thought (CCoT) a novel zero-shot Chain-of-Thought prompting method that utilizes SG representations in order to extract compositional knowledge from an LMM. Specifically we first generate an SG using the LMM and then use that SG in the prompt to produce a response. Through extensive experiments we find that the proposed CCoT approach not only improves LMM performance on several vision and language VL compositional benchmarks but also improves the performance of several popular LMMs on general multimodal benchmarks without the need for fine-tuning or annotated ground-truth SGs. Code: https://github.com/chancharikmitra/CCoT + + + + Diffusion Time-step Curriculum for One Image to 3D Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Yi_Diffusion_Time-step_Curriculum_for_One_Image_to_3D_Generation_CVPR_2024_paper.pdf + Score distillation sampling (SDS) has been widely adopted to overcome the absence of unseen views in reconstructing 3D objects from a single image. It leverages pre-trained 2D diffusion models as teacher to guide the reconstruction of student 3D models. Despite their remarkable success SDS-based methods often encounter geometric artifacts and texture saturation. We find out the crux is the overlooked indiscriminate treatment of diffusion time-steps during optimization: it unreasonably treats the student-teacher knowledge distillation to be equal at all time-steps and thus entangles coarse-grained and fine-grained modeling. Therefore we propose the Diffusion Time-step Curriculum one-image-to-3D pipeline (DTC123) which involves both the teacher and student models collaborating with the time-step curriculum in a coarse-to-fine manner. Extensive experiments on NeRF4 RealFusion15 GSO and Level50 benchmark demonstrate that DTC123 can produce multi-view consistent high-quality and diverse 3D assets. Codes and more generation demos will be released in https://github.com/yxymessi/DTC123. + + + + Adaptive Hyper-graph Aggregation for Modality-Agnostic Federated Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Qi_Adaptive_Hyper-graph_Aggregation_for_Modality-Agnostic_Federated_Learning_CVPR_2024_paper.pdf + In Federated Learning (FL) the issue of statistical data heterogeneity has been a significant challenge to the field's ongoing development. This problem is further exacerbated when clients' data vary in modalities. In response to these issues of statistical heterogeneity and modality incompatibility we propose the Adaptive Hyper-graph Aggregation framework a novel solution for Modality-Agnostic Federated Learning. We design a Modular Architecture for Local Model with single modality setting the stage for efficient intra-modality sharing and inter-modality complementarity. An innovative Global Consensus Prototype Enhancer is crafted to assimilate and broadcast global consensus knowledge within the network. At the core of our approach lies the Adaptive Hyper-graph Learning Strategy which effectively tackles the inherent challenges of modality incompatibility and statistical heterogeneity within federated learning environments accomplishing this adaptively even without the server being aware of the clients' modalities. Our approach tested on three multimodal benchmark datasets demonstrated strong performance across diverse data distributions affirming its effectiveness in multimodal federated learning. + + + + SPIN: Simultaneous Perception Interaction and Navigation + http://openaccess.thecvf.com//content/CVPR2024/papers/Uppal_SPIN_Simultaneous_Perception_Interaction_and_Navigation_CVPR_2024_paper.pdf + While there has been remarkable progress recently in the fields of manipulation and locomotion mobile manipulation remains a long-standing challenge. Compared to locomotion or static manipulation a mobile system must make a diverse range of long-horizon tasks feasible in unstructured and dynamic environments. While the applications are broad and interesting there are a plethora of challenges in developing these systems such as coordination between the base and arm reliance on onboard perception for perceiving and interacting with the environment and most importantly simultaneously integrating all these parts together. Prior works approach the problem using disentangled modular skills for mobility and manipulation that are trivially tied together. This causes several limitations such as compounding errors delays in decision-making and no whole-body coordination. In this work we present a reactive mobile manipulation framework that uses an active visual system to consciously perceive and react to its environment. Similar to how humans leverage whole-body and hand-eye coordination we develop a mobile manipulator that exploits its ability to move and see more specifically -- to move in order to see and to see in order to move. This allows it to not only move around and interact with its environment but also choose "when" to perceive "what" using an active visual system. We observe that such an agent learns to navigate around complex cluttered scenarios while displaying agile whole-body coordination using only ego-vision without needing to create environment maps. Videos are available at https://spin-robot.github.io + + + + Exploring the Potential of Large Foundation Models for Open-Vocabulary HOI Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Lei_Exploring_the_Potential_of_Large_Foundation_Models_for_Open-Vocabulary_HOI_CVPR_2024_paper.pdf + Open-vocabulary human-object interaction (HOI) detection which is concerned with the problem of detecting novel HOIs guided by natural language is crucial for understanding human-centric scenes. However prior zero-shot HOI detectors often employ the same levels of feature maps to model HOIs with varying distances leading to suboptimal performance in scenes containing human-object pairs with a wide range of distances. In addition these detectors primarily rely on category names and overlook the rich contextual information that language can provide which is essential for capturing open vocabulary concepts that are typically rare and not well-represented by category names alone. In this paper we introduce a novel end-to-end open vocabulary HOI detection framework with conditional multi-level decoding and fine-grained semantic enhancement (CMD-SE) harnessing the potential of Visual-Language Models (VLMs). Specifically we propose to model human-object pairs with different distances with different levels of feature maps by incorporating a soft constraint during the bipartite matching process. Furthermore by leveraging large language models (LLMs) such as GPT models we exploit their extensive world knowledge to generate descriptions of human body part states for various interactions. Then we integrate the generalizable and fine-grained semantics of human body parts to improve interaction recognition. Experimental results on two datasets SWIG-HOI and HICO-DET demonstrate that our proposed method achieves state-of-the-art results in open vocabulary HOI detection. The code and models are available at https://github.com/ltttpku/CMD-SE-release. + + + + Unmixing Diffusion for Self-Supervised Hyperspectral Image Denoising + http://openaccess.thecvf.com//content/CVPR2024/papers/Zeng_Unmixing_Diffusion_for_Self-Supervised_Hyperspectral_Image_Denoising_CVPR_2024_paper.pdf + Hyperspectral images (HSIs) have extensive applications in various fields such as medicine agriculture and industry. Nevertheless acquiring high signal-to-noise ratio HSI poses a challenge due to narrow-band spectral filtering. Consequently the importance of HSI denoising is substantial especially for snapshot hyperspectral imaging technology. While most previous HSI denoising methods are supervised creating supervised training datasets for the diverse scenes hyperspectral cameras and scan parameters is impractical. In this work we present Diff-Unmix a self-supervised denoising method for HSI using diffusion denoising generative models. Specifically Diff-Unmix addresses the challenge of recovering noise-degraded HSI through a fusion of Spectral Unmixing and conditional abundance generation. Firstly it employs a learnable block-based spectral unmixing strategy complemented by a pure transformer-based backbone. Then we introduce a self-supervised generative diffusion network to enhance abundance maps from the spectral unmixing block. This network reconstructs noise-free Unmixing probability distributions effectively mitigating noise-induced degradations within these components. Finally the reconstructed HSI is reconstructed through unmixing reconstruction by blending the diffusion-adjusted abundance map with the spectral endmembers. Experimental results on both simulated and real-world noisy datasets show that Diff-Unmix achieves state-of-the-art performance. + + + + Test-Time Linear Out-of-Distribution Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Fan_Test-Time_Linear_Out-of-Distribution_Detection_CVPR_2024_paper.pdf + Out-of-Distribution (OOD) detection aims to address the excessive confidence prediction by neural networks by triggering an alert when the input sample deviates significantly from the training distribution (in-distribution) indicating that the output may not be reliable. Current OOD detection approaches explore all kinds of cues to identify OOD data such as finding irregular patterns in the feature space logit space gradient space or the raw image space. Surprisingly we observe a linear trend between the OOD score produced by current OOD detection algorithms and the network features on several datasets. We conduct a thorough investigation theoretically and empirically to analyze and understand the meaning of such a linear trend in OOD detection. This paper proposes a Robust Test-time Linear method (RTL) to utilize such linear trends like a `free lunch' when we have a batch of data to perform OOD detection. By using a simple linear regression as a test time adaptation we can make a more precise OOD prediction. We further propose an online variant of the proposed method which achieves promising performance and is more practical for real applications. Theoretical analysis is given to prove the effectiveness of our methods. Extensive experiments on several OOD datasets show the efficacy of RTL for OOD detection tasks significantly improving the results of base OOD detectors. Project will be available at https://github.com/kfan21/RTL. + + + + Unsupervised Blind Image Deblurring Based on Self-Enhancement + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Unsupervised_Blind_Image_Deblurring_Based_on_Self-Enhancement_CVPR_2024_paper.pdf + Significant progress in image deblurring has been achieved by deep learning methods especially the remarkable performance of supervised models on paired synthetic data. However real-world quality degradation is more complex than synthetic datasets and acquiring paired data in real-world scenarios poses significant challenges. To address these challenges we propose a novel unsupervised image deblurring framework based on self-enhancement. The framework progressively generates improved pseudo-sharp and blurry image pairs without the need for real paired datasets and the generated image pairs with higher qualities can be used to enhance the performance of the reconstructor. To ensure the generated blurry images are closer to the real blurry images we propose a novel re-degradation principal component consistency loss which enforces the principal components of the generated low-quality images to be similar to those of re-degraded images from the original sharp ones. Furthermore we introduce the self-enhancement strategy that significantly improves deblurring performance without increasing the computational complexity of network during inference. Through extensive experiments on multiple real-world blurry datasets we demonstrate the superiority of our approach over other state-of-the-art unsupervised methods. + + + + UFineBench: Towards Text-based Person Retrieval with Ultra-fine Granularity + http://openaccess.thecvf.com//content/CVPR2024/papers/Zuo_UFineBench_Towards_Text-based_Person_Retrieval_with_Ultra-fine_Granularity_CVPR_2024_paper.pdf + Existing text-based person retrieval datasets often have relatively coarse-grained text annotations. This hinders the model to comprehend the fine-grained semantics of query texts in real scenarios. To address this problem we contribute a new benchmark named UFineBench for text-based person retrieval with ultra-fine granularity. Firstly we construct a new dataset named UFine6926. We collect a large number of person images and manually annotate each image with two detailed textual descriptions averaging 80.8 words each. The average word count is three to four times that of the previous datasets. In addition of standard in-domain evaluation we also propose a special evaluation paradigm more representative of real scenarios. It contains a new evaluation set with cross domains cross textual granularity and cross textual styles named UFine3C and a new evaluation metric for accurately measuring retrieval ability named mean Similarity Distribution (mSD). Moreover we propose CFAM a more efficient algorithm especially designed for text-based person retrieval with ultra fine-grained texts. It achieves fine granularity mining by adopting a shared cross-modal granularity decoder and hard negative match mechanism. With standard in-domain evaluation CFAM establishes competitive performance across various datasets especially on our ultra fine-grained UFine6926. Furthermore by evaluating on UFine3C we demonstrate that training on our UFine6926 significantly improves generalization to real scenarios compared with other coarse-grained datasets. The dataset and code will be made publicly available at https://github.com/Zplusdragon/UFineBench. + + + + Efficient Hyperparameter Optimization with Adaptive Fidelity Identification + http://openaccess.thecvf.com//content/CVPR2024/papers/Jiang_Efficient_Hyperparameter_Optimization_with_Adaptive_Fidelity_Identification_CVPR_2024_paper.pdf + Hyperparameter Optimization and Neural Architecture Search are powerful in attaining state-of-the-art machine learning models with Bayesian Optimization (BO) standing out as a mainstream method. Extending BO into the multi-fidelity setting has been an emerging research topic in this field but faces the challenge of determining an appropriate fidelity for each hyperparameter configuration to fit the surrogate model. To tackle the challenge we propose a multi-fidelity BO method named FastBO which excels in adaptively deciding the fidelity for each configuration and providing strong performance while ensuring efficient resource usage. These advantages are achieved through our proposed techniques based on the concepts of efficient point and saturation point for each configuration which can be obtained from the empirical learning curve of the configuration estimated from early observations. Extensive experiments demonstrate FastBO's superior anytime performance and efficiency in identifying high-quality configurations and architectures. We also show that our method provides a way to extend any single-fidelity method to the multi-fidelity setting highlighting the wide applicability of our approach. + + + + Focus on Hiders: Exploring Hidden Threats for Enhancing Adversarial Training + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Focus_on_Hiders_Exploring_Hidden_Threats_for_Enhancing_Adversarial_Training_CVPR_2024_paper.pdf + Adversarial training is often formulated as a min-max problem however concentrating only on the worst adversarial examples causes alternating repetitive confusion of the model i.e. previously defended or correctly classified samples are not defensible or accurately classifiable in subsequent adversarial training. We characterize such non-ignorable samples as "hiders" which reveal the hidden high-risk regions within the secure area obtained through adversarial training and prevent the model from finding the real worst cases. We demand the model to prevent hiders when defending against adversarial examples for improving accuracy and robustness simultaneously. By rethinking and redefining the min-max optimization problem for adversarial training we propose a generalized adversarial training algorithm called Hider-Focused Adversarial Training (HFAT). HFAT introduces the iterative evolution optimization strategy to simplify the optimization problem and employs an auxiliary model to reveal hiders effectively combining the optimization directions of standard adversarial training and prevention hiders. Furthermore we introduce an adaptive weighting mechanism that facilitates the model in adaptively adjusting its focus between adversarial examples and hiders during different training periods. We demonstrate the effectiveness of our method based on extensive experiments and ensure that HFAT can provide higher robustness and accuracy. We will release the source code upon publication. + + + + GoodSAM: Bridging Domain and Capacity Gaps via Segment Anything Model for Distortion-aware Panoramic Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_GoodSAM_Bridging_Domain_and_Capacity_Gaps_via_Segment_Anything_Model_CVPR_2024_paper.pdf + This paper tackles a novel yet challenging problem: how to transfer knowledge from the emerging Segment Anything Model (SAM) -- which reveals impressive zero-shot instance segmentation capacity -- to learn a compact panoramic semantic segmentation model i.e. student without requiring any labeled data. This poses considerable challenges due to SAM's inability to provide semantic labels and the large capacity gap between SAM and the student. To this end we propose a novel framework called GoodSAM that introduces a teacher assistant (TA) to provide semantic information integrated with SAM to generate ensemble logits to achieve knowledge transfer. Specifically we propose a Distortion-Aware Rectification (DAR) module that first addresses the distortion problem of panoramic images by imposing prediction-level consistency and boundary enhancement. This subtly enhances TA's prediction capacity on panoramic images. DAR then incorporates a cross-task complementary fusion block to adaptively merge the predictions of SAM and TA to obtain more reliable ensemble logits. Moreover we introduce a Multi-level Knowledge Adaptation (MKA) module to efficiently transfer the multi-level feature knowledge from TA and ensemble logits to learn a compact student model. Extensive experiments on two benchmarks show that our GoodSAM achieves a remarkable +3.75% mIoU improvement over the state-of-the-art (SOTA) domain adaptation methods e.g. [41]. Also our most lightweight model achieves comparable performance to the SOTA methods with only 3.7M parameters. + + + + DYSON: Dynamic Feature Space Self-Organization for Online Task-Free Class Incremental Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/He_DYSON_Dynamic_Feature_Space_Self-Organization_for_Online_Task-Free_Class_Incremental_CVPR_2024_paper.pdf + In this paper we focus on a challenging Online Task-Free Class Incremental Learning (OTFCIL) problem. Different from the existing methods that continuously learn the feature space from data streams we propose a novel compute-and-align paradigm for the OTFCIL. It first computes an optimal geometry i.e. the class prototype distribution for classifying existing classes and updates it when new classes emerge and then trains a DNN model by aligning its feature space to the optimal geometry. To this end we develop a novel Dynamic Neural Collapse (DNC) algorithm to compute and update the optimal geometry. The DNC expands the geometry when new classes emerge without loss of the geometry optimality and guarantees the drift distance of old class prototypes with an explicit upper bound. Then we propose a novel Dynamic feature space Self-Organization (DYSON) method containing three major components including 1) a feature extractor 2) a Dynamic Feature-Geometry Alignment (DFGA) module aligning the feature space to the optimal geometry computed by DNC and 3) a training-free class-incremental classifier derived from the DNC geometry. Experimental comparison results on four benchmark datasets including CIFAR10 CIFAR100 CUB200 and CoRe50 demonstrate the efficiency and superiority of the DYSON method. The source code is provided in the supplementary material. + + + + Event-based Structure-from-Orbit + http://openaccess.thecvf.com//content/CVPR2024/papers/Elms_Event-based_Structure-from-Orbit_CVPR_2024_paper.pdf + Event sensors offer high temporal resolution visual sensing which makes them ideal for perceiving fast visual phenomena without suffering from motion blur. Certain applications in robotics and vision-based navigation require 3D perception of an object undergoing circular or spinning motion in front of a static camera such as recovering the angular velocity and shape of the object. The setting is equivalent to observing a static object with an orbiting camera. In this paper we propose event-based structure-from-orbit (eSfO) where the aim is to simultaneously reconstruct the 3D structure of a fast spinning object observed from a static event camera and recover the equivalent orbital motion of the camera. Our contributions are threefold: since state-of-the-art event feature trackers cannot handle periodic self-occlusion due to the spinning motion we develop a novel event feature tracker based on spatio-temporal clustering and data association that can better track the helical trajectories of valid features in the event data. The feature tracks are then fed to our novel factor graph-based structure-from-orbit back-end that calculates the orbital motion parameters (e.g. spin rate relative rotational axis) that minimize the reprojection error. For evaluation we produce a new event dataset of objects under spinning motion. Comparisons against ground truth indicate the efficacy of eSfO. + + + + LED: A Large-scale Real-world Paired Dataset for Event Camera Denoising + http://openaccess.thecvf.com//content/CVPR2024/papers/Duan_LED_A_Large-scale_Real-world_Paired_Dataset_for_Event_Camera_Denoising_CVPR_2024_paper.pdf + Event camera has significant advantages in capturingdynamic scene information while being prone to noise interferenceparticularly in challenging conditions like lowthreshold and low illumination. However most existing researchfocuses on gentle situations hindering event cameraapplications in realistic complex scenarios. To tackle thislimitation and advance the field we construct a new pairedreal-world event denoising dataset (LED) including 3K sequenceswith 18K seconds of high-resolution (1200*680)event streams and showing three notable distinctions comparedto others: diverse noise levels and scenes largerscalewith high-resolution and high-quality GT. Specificallyit contains stepped parameters and varying illuminationwith diverse scenarios. Moreover based on theproperty of noise events inconsistency and signal eventsconsistency we propose a novel effective denoising framework(DED) using homogeneous dual events to generate theGT with better separating noise from the raw. Furthermorewe design a bio-inspired baseline leveraging Leaky-Integrate-and-Fire (LIF) neurons with dynamic thresholdsto realize accurate denoising. The experimental resultsdemonstrate that the remarkable performance of the proposedapproach on different datasets.The dataset and codeare at https://github.com/Yee-Sing/led. + + + + SVDinsTN: A Tensor Network Paradigm for Efficient Structure Search from Regularized Modeling Perspective + http://openaccess.thecvf.com//content/CVPR2024/papers/Zheng_SVDinsTN_A_Tensor_Network_Paradigm_for_Efficient_Structure_Search_from_CVPR_2024_paper.pdf + Tensor network (TN) representation is a powerful technique for computer vision and machine learning. TN structure search (TN-SS) aims to search for a customized structure to achieve a compact representation which is a challenging NP-hard problem. Recent "sampling-evaluation"-based methods require sampling an extensive collection of structures and evaluating them one by one resulting in prohibitively high computational costs. To address this issue we propose a novel TN paradigm named SVD-inspired TN decomposition (SVDinsTN) which allows us to efficiently solve the TN-SS problem from a regularized modeling perspective eliminating the repeated structure evaluations. To be specific by inserting a diagonal factor for each edge of the fully-connected TN SVDinsTN allows us to calculate TN cores and diagonal factors simultaneously with the factor sparsity revealing a compact TN structure. In theory we prove a convergence guarantee for the proposed method. Experimental results demonstrate that the proposed method achieves approximately 100 1000 times acceleration compared to the state-of-the-art TN-SS methods while maintaining a comparable level of representation ability. + + + + Inverse Rendering of Glossy Objects via the Neural Plenoptic Function and Radiance Fields + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Inverse_Rendering_of_Glossy_Objects_via_the_Neural_Plenoptic_Function_CVPR_2024_paper.pdf + Inverse rendering aims at recovering both geometry and materials of objects. It provides a more compatible reconstruction for conventional rendering engines compared with the neural radiance fields (NeRFs). On the other hand existing NeRF-based inverse rendering methods cannot handle glossy objects with local light interactions well as they typically oversimplify the illumination as a 2D environmental map which assumes infinite lights only. Observing the superiority of NeRFs in recovering radiance fields we propose a novel 5D Neural Plenoptic Function (NeP) based on NeRFs and ray tracing such that more accurate lighting-object interactions can be formulated via the rendering equation. We also design a material-aware cone sampling strategy to efficiently integrate lights inside the BRDF lobes with the help of pre-filtered radiance fields. Our method has two stages: the geometry of the target object and the pre-filtered environmental radiance fields are reconstructed in the first stage and materials of the target object are estimated in the second stage with the proposed NeP and material-aware cone sampling strategy. Extensive experiments on the proposed real-world and synthetic datasets demonstrate that our method can reconstruct high-fidelity geometry/materials of challenging glossy objects with complex lighting interactions from nearby objects. Project webpage: https://whyy.site/paper/nep + + + + Split to Merge: Unifying Separated Modalities for Unsupervised Domain Adaptation + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Split_to_Merge_Unifying_Separated_Modalities_for_Unsupervised_Domain_Adaptation_CVPR_2024_paper.pdf + Large vision-language models (VLMs) like CLIP have demonstrated good zero-shot learning performance in the unsupervised domain adaptation task. Yet most transfer approaches for VLMs focus on either the language or visual branches overlooking the nuanced interplay between both modalities. In this work we introduce a Unified Modality Separation (UniMoS) framework for unsupervised domain adaptation. Leveraging insights from modality gap studies we craft a nimble modality separation network that distinctly disentangles CLIP's features into language-associated and vision-associated components. Our proposed Modality-Ensemble Training (MET) method fosters the exchange of modality-agnostic information while maintaining modality-specific nuances. We align features across domains using a modality discriminator. Comprehensive evaluations on three benchmarks reveal our approach sets a new state-of-the-art with minimal computational costs. Code: https://github.com/TL-UESTC/UniMoS. + + + + Overcoming Generic Knowledge Loss with Selective Parameter Update + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Overcoming_Generic_Knowledge_Loss_with_Selective_Parameter_Update_CVPR_2024_paper.pdf + Foundation models encompass an extensive knowledge base and offer remarkable transferability. However this knowledge becomes outdated or insufficient over time. The challenge lies in continuously updating foundation models to accommodate novel information while retaining their original capabilities. Leveraging the fact that foundation models have initial knowledge on various tasks and domains we propose a novel approach that instead of updating all parameters equally localizes the updates to a sparse set of parameters relevant to the task being learned. We strike a balance between efficiency and new task performance while maintaining the transferability and generalizability of foundation models. We extensively evaluate our method on foundational vision-language models with a diverse spectrum of continual learning tasks. Our method achieves improvements on the accuracy of the newly learned tasks up to 7% while preserving the pretraining knowledge with a negligible decrease of 0.9% on a representative control set accuracy. + + + + Diff-BGM: A Diffusion Model for Video Background Music Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Diff-BGM_A_Diffusion_Model_for_Video_Background_Music_Generation_CVPR_2024_paper.pdf + When editing a video a piece of attractive background music is indispensable. However video background music generation tasks face several challenges for example the lack of suitable training datasets and the difficulties in flexibly controlling the music generation process and sequentially aligning the video and music. In this work we first propose a high-quality music-video dataset BGM909 with detailed annotation and shot detection to provide multi-modal information about the video and music. We then present evaluation metrics to assess music quality including music diversity and alignment between music and video with retrieval precision metrics. Finally we propose the Diff-BGM framework to automatically generate the background music for a given video which uses different signals to control different aspects of the music during the generation process i.e. uses dynamic video features to control music rhythm and semantic features to control the melody and atmosphere. We propose to align the video and music sequentially by introducing a segment-aware cross-attention layer. Experiments verify the effectiveness of our proposed method. The code and models are available at https://github.com/sizhelee/Diff-BGM. + + + + Looking Similar Sounding Different: Leveraging Counterfactual Cross-Modal Pairs for Audiovisual Representation Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Singh_Looking_Similar_Sounding_Different_Leveraging_Counterfactual_Cross-Modal_Pairs_for_Audiovisual_CVPR_2024_paper.pdf + Audiovisual representation learning typically relies on the correspondence between sight and sound. However there are often multiple audio tracks that can correspond with a visual scene. Consider for example different conversations on the same crowded street. The effect of such counterfactual pairs on audiovisual representation learning has not been previously explored. To investigate this we use dubbed versions of movies and television shows to augment cross-modal contrastive learning. Our approach learns to represent alternate audio tracks differing only in speech similarly to the same video. Our results from a comprehensive set of experiments investigating different training strategies show this general approach improves performance on a range of downstream auditory and audiovisual tasks without majorly affecting linguistic task performance overall. These findings highlight the importance of considering speech variation when learning scene-level audiovisual correspondences and suggest that dubbed audio can be a useful augmentation technique for training audiovisual models toward more robust performance on diverse downstream tasks. + + + + Towards HDR and HFR Video from Rolling-Mixed-Bit Spikings + http://openaccess.thecvf.com//content/CVPR2024/papers/Chang_Towards_HDR_and_HFR_Video_from_Rolling-Mixed-Bit_Spikings_CVPR_2024_paper.pdf + The spiking cameras offer the benefits of high dynamic range (HDR) high temporal resolution and low data redundancy. However reconstructing HDR videos in high-speed conditions using single-bit spikings presents challenges due to the limited bit depth. Increasing the bit depth of the spikings is advantageous for boosting HDR performance but the readout efficiency will be decreased which is unfavorable for achieving a high frame rate (HFR) video. To address these challenges we propose a readout mechanism to obtain rolling-mixed-bit (RMB) spikings which involves interleaving multi-bit spikings within the single-bit spikings in a rolling manner thereby combining the characteristics of high bit depth and efficient readout. Furthermore we introduce RMB-Net for reconstructing HDR and HFR videos. RMB-Net comprises a cross-bit attention block for fusing mixed-bit spikings and a cross-time attention block for achieving temporal fusion. Extensive experiments conducted on synthetic and real-synthetic data demonstrate the superiority of our method. For instance pure 3-bit spikings result in 3 times of data volume whereas our method achieves comparable performance with less than 2% increase in data volume. + + + + Bridging the Synthetic-to-Authentic Gap: Distortion-Guided Unsupervised Domain Adaptation for Blind Image Quality Assessment + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Bridging_the_Synthetic-to-Authentic_Gap_Distortion-Guided_Unsupervised_Domain_Adaptation_for_Blind_CVPR_2024_paper.pdf + The annotation of blind image quality assessment (BIQA) is labor-intensive and time-consuming especially for authentic images. Training on synthetic data is expected to be beneficial but synthetically trained models often suffer from poor generalization in real domains due to domain gaps. In this work we make a key observation that introducing more distortion types in the synthetic dataset may not improve or even be harmful to generalizing authentic image quality assessment. To solve this challenge we propose distortion-guided unsupervised domain adaptation for BIQA (DGQA) a novel framework that leverages adaptive multi-domain selection via prior knowledge from distortion to match the data distribution between the source domains and the target domain thereby reducing negative transfer from the outlier source domains. Extensive experiments on two cross-domain settings (synthetic distortion to authentic distortion and synthetic distortion to algorithmic distortion) have demonstrated the effectiveness of our proposed DGQA. Besides DGQA is orthogonal to existing model-based BIQA methods and can be used in combination with such models to improve performance with less training data. + + + + Coherent Temporal Synthesis for Incremental Action Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Ding_Coherent_Temporal_Synthesis_for_Incremental_Action_Segmentation_CVPR_2024_paper.pdf + Data replay is a successful incremental learning technique for images. It prevents catastrophic forgetting by keeping a reservoir of previous data original or synthesized to ensure the model retains past knowledge while adapting to novel concepts. However its application in the video domain is rudimentary as it simply stores frame exemplars for action recognition. This paper presents the first exploration of video data replay techniques for incremental action segmentation focusing on action temporal modeling. We propose a Temporally Coherent Action (TCA) model which represents actions using a generative model instead of storing individual frames. The integration of a conditioning variable that captures temporal coherence allows our model to understand the evolution of action features over time. Therefore action segments generated by TCA for replay are diverse and temporally coherent. In a 10-task incremental setup on the Breakfast dataset our approach achieves significant increases in accuracy for up to 22% compared to the baselines. + + + + HiFi4G: High-Fidelity Human Performance Rendering via Compact Gaussian Splatting + http://openaccess.thecvf.com//content/CVPR2024/papers/Jiang_HiFi4G_High-Fidelity_Human_Performance_Rendering_via_Compact_Gaussian_Splatting_CVPR_2024_paper.pdf + We have recently seen tremendous progress in photo-real human modeling and rendering. Yet efficiently rendering realistic human performance and integrating it into the rasterization pipeline remains challenging. In this paper we present HiFi4G an explicit and compact Gaussian-based approach for high-fidelity human performance rendering from dense footage. Our core intuition is to marry the 3D Gaussian representation with non-rigid tracking achieving a compact and compression-friendly representation. We first propose a dual-graph mechanism to obtain motion priors with a coarse deformation graph for effective initialization and a fine-grained Gaussian graph to enforce subsequent constraints. Then we utilize a 4D Gaussian optimization scheme with adaptive spatial-temporal regularizers to effectively balance the non-rigid prior and Gaussian updating. We also present a companion compression scheme with residual compensation for immersive experiences on various platforms. It achieves a substantial compression rate of approximately 25 times with less than 2MB of storage per frame. Extensive experiments demonstrate the effectiveness of our approach which significantly outperforms existing approaches in terms of optimization speed rendering quality and storage overhead. + + + + G-FARS: Gradient-Field-based Auto-Regressive Sampling for 3D Part Grouping + http://openaccess.thecvf.com//content/CVPR2024/papers/Cheng_G-FARS_Gradient-Field-based_Auto-Regressive_Sampling_for_3D_Part_Grouping_CVPR_2024_paper.pdf + This paper proposes a novel task named "3D part grouping". Suppose there is a mixed set containing scattered parts from various shapes. This task requires algorithms to find out every possible combination among all the parts. To address this challenge we propose the so called Gradient Field-based Auto-Regressive Sampling framework (G-FARS) tailored specifically for the 3D part grouping task. In our framework we design a gradient-field-based selection graph neural network (GNN) to learn the gradients of a log conditional probability density in terms of part selection where the condition is the given mixed part set. This innovative approach implemented through the gradient-field-based selection GNN effectively captures complex relationships among all the parts in the input. Upon completion of the training process our framework becomes capable of autonomously grouping 3D parts by iteratively selecting them from the mixed part set leveraging the knowledge acquired by the trained gradient-field-based selection GNN. Our code is available at: https://github.com/J-F-Cheng/G-FARS-3DPartGrouping. + + + + DMR: Decomposed Multi-Modality Representations for Frames and Events Fusion in Visual Reinforcement Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_DMR_Decomposed_Multi-Modality_Representations_for_Frames_and_Events_Fusion_in_CVPR_2024_paper.pdf + We explore visual reinforcement learning (RL) using two complementary visual modalities: frame-based RGB camera and event-based Dynamic Vision Sensor (DVS). Existing multi-modality visual RL methods often encounter challenges in effectively extracting task-relevant information from multiple modalities while suppressing the increased noise only using indirect reward signals instead of pixel-level supervision. To tackle this we propose a Decomposed Multi-Modality Representation (DMR) framework for visual RL. It explicitly decomposes the inputs into three distinct components: combined task-relevant features (co-features) RGB-specific noise and DVS-specific noise. The co-features represent the full information from both modalities that is relevant to the RL task; the two noise components each constrained by a data reconstruction loss to avoid information leak are contrasted with the co-features to maximize their difference. Extensive experiments demonstrate that by explicitly separating the different types of information our approach achieves substantially improved policy performance compared to state-of-the-art approaches. + + + + DiffuseMix: Label-Preserving Data Augmentation with Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Islam_DiffuseMix_Label-Preserving_Data_Augmentation_with_Diffusion_Models_CVPR_2024_paper.pdf + Recently a number of image-mixing-based augmentation techniques have been introduced to improve the generalization of deep neural networks. In these techniques two or more randomly selected natural images are mixed together to generate an augmented image. Such methods may not only omit important portions of the input images but also introduce label ambiguities by mixing images across labels resulting in misleading supervisory signals. To address these limitations we propose DIFFUSEMIX a novel data augmentation technique that leverages a diffusion model to reshape training images supervised by our bespoke conditional prompts. First concatenation of a partial natural image and its generated counterpart is obtained which helps in avoiding the generation of unrealistic images or label ambiguities. Then to enhance resilience against adversarial attacks and improves safety measures a randomly selected structural pattern from a set of fractal images is blended into the concatenated image to form the final augmented image for training. Our empirical results on seven different datasets reveal that DIFFUSEMIX achieves superior performance compared to existing state- of-the-art methods on tasks including general classification fine-grained classification fine-tuning data scarcity and adversarial robustness. + + + + FREE: Faster and Better Data-Free Meta-Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Wei_FREE_Faster_and_Better_Data-Free_Meta-Learning_CVPR_2024_paper.pdf + Data-Free Meta-Learning (DFML) aims to extract knowledge from a collection of pre-trained models without requiring the original data presenting practical benefits in contexts constrained by data privacy concerns. Current DFML methods primarily focus on the data recovery from these pre-trained models. However they suffer from slow recovery speed and overlook gaps inherent in heterogeneous pre-trained models. In response to these challenges we introduce the Faster and Better Data-Free Meta-Learning (FREE) framework which contains: (i) a meta-generator for rapidly recovering training tasks from pre-trained models; and (ii) a meta-learner for generalizing to new unseen tasks. Specifically within the module Faster Inversion via Meta-Generator each pre-trained model is perceived as a distinct task. The meta-generator can rapidly adapt to a specific task in just five steps significantly accelerating the data recovery. Furthermore we propose Better Generalization via Meta-Learner and introduce an implicit gradient alignment algorithm to optimize the meta-learner. This is achieved as aligned gradient directions alleviate potential conflicts among tasks from heterogeneous pre-trained models. Empirical experiments on multiple benchmarks affirm the superiority of our approach marking a notable speed-up (20x) and performance enhancement (1.42% 4.78%) in comparison to the state-of-the-art. + + + + Bi-SSC: Geometric-Semantic Bidirectional Fusion for Camera-based 3D Semantic Scene Completion + http://openaccess.thecvf.com//content/CVPR2024/papers/Xue_Bi-SSC_Geometric-Semantic_Bidirectional_Fusion_for_Camera-based_3D_Semantic_Scene_Completion_CVPR_2024_paper.pdf + Camera-based Semantic Scene Completion (SSC) is to infer the full geometry of objects and scenes from only 2D images. The task is particularly challenging for those invisible areas due to the inherent occlusions and lighting ambiguity. Existing works ignore the information missing or ambiguous in those shaded and occluded areas resulting in distorted geometric prediction. To address this issue we propose a novel method Bi-SSC bidirectional geometric semantic fusion for camera-based 3D semantic scene completion. The key insight is to use the neighboring structure of objects in the image and the spatial differences from different perspectives to compensate for the lack of information in occluded areas. Specifically we introduce a spatial sensory fusion module with multiple association attention to improve semantic correlation in geometric distributions. This module works within single view and across stereo views to achieve global spatial consistency. Experimental results demonstrate that Bi-SSC outperforms state-of-the-art camera-based methods on SemanticKITTI particularly excelling in those invisible and shaded areas. + + + + Parameter Efficient Self-Supervised Geospatial Domain Adaptation + http://openaccess.thecvf.com//content/CVPR2024/papers/Scheibenreif_Parameter_Efficient_Self-Supervised_Geospatial_Domain_Adaptation_CVPR_2024_paper.pdf + As large-scale foundation models become publicly available for different domains efficiently adapting them to individual downstream applications and additional data modalities has turned into a central challenge. For example foundation models for geospatial and satellite remote sensing applications are commonly trained on large optical RGB or multi-spectral datasets although data from a wide variety of heterogeneous sensors are available in the remote sensing domain. This leads to significant discrepancies between pre-training and downstream target data distributions for many important applications. Fine-tuning large foundation models to bridge that gap incurs high computational cost and can be infeasible when target datasets are small. In this paper we address the question of how large pre-trained foundational transformer models can be efficiently adapted to downstream remote sensing tasks involving different data modalities or limited dataset size. We present a self-supervised adaptation method that boosts downstream linear evaluation accuracy of different foundation models by 4-6% (absolute) across 8 remote sensing datasets while outperforming full fine-tuning when training only 1-2% of the model parameters. Our method significantly improves label efficiency and increases few-shot accuracy by 6-10% on different datasets. + + + + Defense without Forgetting: Continual Adversarial Defense with Anisotropic & Isotropic Pseudo Replay + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_Defense_without_Forgetting_Continual_Adversarial_Defense_with_Anisotropic__Isotropic_CVPR_2024_paper.pdf + Deep neural networks have demonstrated susceptibility to adversarial attacks. Adversarial defense techniques often focus on one-shot setting to maintain robustness against attack. However new attacks can emerge in sequences in real-world deployment scenarios. As a result it is crucial for a defense model to constantly adapt to new attacks but the adaptation process can lead to catastrophic forgetting of previously defended against attacks. In this paper we discuss for the first time the concept of continual adversarial defense under a sequence of attacks and propose a lifelong defense baseline called Anisotropic & Isotropic Replay (AIR) which offers three advantages: (1) Isotropic replay ensures model consistency in the neighborhood distribution of new data indirectly aligning the output preference between old and new tasks. (2) Anisotropic replay enables the model to learn a compromise data manifold with fresh mixed semantics for further replay constraints and potential future attacks. (3) A straightforward regularizer mitigates the 'plasticity-stability' trade-off by aligning model output between new and old tasks. Experiment results demonstrate that AIR can approximate or even exceed the empirical performance upper bounds achieved by Joint Training. + + + + Transferable Structural Sparse Adversarial Attack Via Exact Group Sparsity Training + http://openaccess.thecvf.com//content/CVPR2024/papers/Ming_Transferable_Structural_Sparse_Adversarial_Attack_Via_Exact_Group_Sparsity_Training_CVPR_2024_paper.pdf + Deep neural networks (DNNs) are vulnerable to highly transferable adversarial attacks. Especially many studies have shown that sparse attacks pose a significant threat to DNNs on account of their exceptional imperceptibility. Current sparse attack methods mostly limit only the magnitude and number of perturbations while generally overlooking the location of the perturbations resulting in decreased performances on attack transferability. A subset of studies indicates that perturbations existing in the significant regions with rich classification-relevant features are more effective. Leveraging this insight we introduce the structural sparsity constraint in the framework of generative models to limit the perturbation positions. To ensure that the perturbations are generated towards classification-relevant regions we propose an exact group sparsity training method to learn pixel-level and group-level sparsity. For purpose of improving the effectiveness of sparse training we further put forward masked quantization network and multi-stage optimization algorithm in the training process. Utilizing CNNs as surrogate models extensive experiments demonstrate that our method has higher transferability in image classification attack compared to state-of-the-art methods at approximately same sparsity levels. In cross-model ViT object detection and semantic segmentation attack tasks we also achieve a better attack success rate. Code is available at https://github.com/MisterRpeng/EGS-TSSA. + + + + Unsupervised Occupancy Learning from Sparse Point Cloud + http://openaccess.thecvf.com//content/CVPR2024/papers/Ouasfi_Unsupervised_Occupancy_Learning_from_Sparse_Point_Cloud_CVPR_2024_paper.pdf + Implicit Neural Representations have gained prominence as a powerful framework for capturing complex data modalities encompassing a wide range from 3D shapes to images and audio. Within the realm of 3D shape representation Neural Signed Distance Functions (SDF) have demonstrated remarkable potential in faithfully encoding intricate shape geometry. However learning SDFs from 3D point clouds in the absence of ground truth supervision remains a very challenging task. In this paper we propose a method to infer occupancy fields instead of SDFs as they are easier to learn from sparse inputs. We leverage a margin-based uncertainty measure to differentiably sample from the decision boundary of the occupancy function and supervise the sampled boundary points using the input point cloud. We further stabilise the optimization process at the early stages of the training by biasing the occupancy function towards minimal entropy fields while maximizing its entropy at the input point cloud. Through extensive experiments and evaluations we illustrate the efficacy of our proposed method highlighting its capacity to improve implicit shape inference with respect to baselines and the state-of-the-art using synthetic and real data. + + + + 3DInAction: Understanding Human Actions in 3D Point Clouds + http://openaccess.thecvf.com//content/CVPR2024/papers/Ben-Shabat_3DInAction_Understanding_Human_Actions_in_3D_Point_Clouds_CVPR_2024_paper.pdf + We propose a novel method for 3D point cloud action recognition. Understanding human actions in RGB videos has been widely studied in recent years however its 3D point cloud counterpart remains under-explored despite the clear value that 3D information may bring. This is mostly due to the inherent limitation of the point cloud data modality---lack of structure permutation invariance and varying number of points---which makes it difficult to learn a spatio-temporal representation. To address this limitation we propose the 3DinAction pipeline that first estimates patches moving in time (t-patches) as a key building block alongside a hierarchical architecture that learns an informative spatio-temporal representation. We show that our method achieves improved performance on existing datasets including DFAUST and IKEA ASM. Code is publicly available at https://github.com/sitzikbs/3dincaction + + + + SDDGR: Stable Diffusion-based Deep Generative Replay for Class Incremental Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_SDDGR_Stable_Diffusion-based_Deep_Generative_Replay_for_Class_Incremental_Object_CVPR_2024_paper.pdf + In the field of class incremental learning (CIL) generative replay has become increasingly prominent as a method to mitigate the catastrophic forgetting alongside the continuous improvements in generative models. However its application in class incremental object detection (CIOD) has been significantly limited primarily due to the complexities of scenes involving multiple labels. In this paper we propose a novel approach called stable diffusion deep generative replay (SDDGR) for CIOD. Our method utilizes a diffusion-based generative model with pre-trained text-to-image diffusion networks to generate realistic and diverse synthetic images. SDDGR incorporates an iterative refinement strategy to produce high-quality images encompassing old classes. Additionally we adopt an L2 knowledge distillation technique to improve the retention of prior knowledge in synthetic images. Furthermore our approach includes pseudo-labeling for old objects within new task images preventing misclassification as background elements. Extensive experiments on the COCO 2017 dataset demonstrate that SDDGR significantly outperforms existing algorithms achieving a new state-of-the-art in various CIOD scenarios. + + + + Physical 3D Adversarial Attacks against Monocular Depth Estimation in Autonomous Driving + http://openaccess.thecvf.com//content/CVPR2024/papers/Zheng_Physical_3D_Adversarial_Attacks_against_Monocular_Depth_Estimation_in_Autonomous_CVPR_2024_paper.pdf + Deep learning-based monocular depth estimation (MDE) extensively applied in autonomous driving is known to be vulnerable to adversarial attacks. Previous physical attacks against MDE models rely on 2D adversarial patches so they only affect a small localized region in the MDE map but fail under various viewpoints. To address these limitations we propose 3D Depth Fool (3D^2Fool) the first 3D texture-based adversarial attack against MDE models. 3D^2Fool is specifically optimized to generate 3D adversarial textures agnostic to model types of vehicles and to have improved robustness in bad weather conditions such as rain and fog. Experimental results validate the superior performance of our 3D^2Fool across various scenarios including vehicles MDE models weather conditions and viewpoints. Real-world experiments with printed 3D textures on physical vehicle models further demonstrate that our 3D^2Fool can cause an MDE error of over 10 meters. + + + + Adaptive Random Feature Regularization on Fine-tuning Deep Neural Networks + http://openaccess.thecvf.com//content/CVPR2024/papers/Yamaguchi_Adaptive_Random_Feature_Regularization_on_Fine-tuning_Deep_Neural_Networks_CVPR_2024_paper.pdf + While fine-tuning is a de facto standard method for training deep neural networks it still suffers from overfitting when using small target datasets. Previous methods improve fine-tuning performance by maintaining knowledge of the source datasets or introducing regularization terms such as contrastive loss. However these methods require auxiliary source information (e.g. source labels or datasets) or heavy additional computations. In this paper we propose a simple method called adaptive random feature regularization (AdaRand). AdaRand helps the feature extractors of training models to adaptively change the distribution of feature vectors for downstream classification tasks without auxiliary source information and with reasonable computation costs. To this end AdaRand minimizes the gap between feature vectors and random reference vectors that are sampled from class conditional Gaussian distributions. Furthermore AdaRand dynamically updates the conditional distribution to follow the currently updated feature extractors and balance the distance between classes in feature spaces. Our experiments show that AdaRand outperforms the other fine-tuning regularization requiring auxiliary source information and heavy computation costs. + + + + Multimodal Prompt Perceiver: Empower Adaptiveness Generalizability and Fidelity for All-in-One Image Restoration + http://openaccess.thecvf.com//content/CVPR2024/papers/Ai_Multimodal_Prompt_Perceiver_Empower_Adaptiveness_Generalizability_and_Fidelity_for_All-in-One_CVPR_2024_paper.pdf + Despite substantial progress all-in-one image restoration (IR) grapples with persistent challenges in handling intricate real-world degradations. This paper introduces MPerceiver: a novel multimodal prompt learning approach that harnesses Stable Diffusion (SD) priors to enhance adaptiveness generalizability and fidelity for all-in-one image restoration. Specifically we develop a dual-branch module to master two types of SD prompts: textual for holistic representation and visual for multiscale detail representation. Both prompts are dynamically adjusted by degradation predictions from the CLIP image encoder enabling adaptive responses to diverse unknown degradations. Moreover a plug-in detail refinement module improves restoration fidelity via direct encoder-to-decoder information transformation. To assess our method MPerceiver is trained on 9 tasks for all-in-one IR and outperforms state-of-the-art task-specific methods across many tasks. Post multitask pre-training MPerceiver attains a generalized representation in low-level vision exhibiting remarkable zero-shot and few-shot capabilities in unseen tasks. Extensive experiments on 16 IR tasks underscore the superiority of MPerceiver in terms of adaptiveness generalizability and fidelity. + + + + Color Shift Estimation-and-Correction for Image Enhancement + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Color_Shift_Estimation-and-Correction_for_Image_Enhancement_CVPR_2024_paper.pdf + Images captured under sub-optimal illumination conditions may contain both over- and under-exposures. We observe that over- and over-exposed regions display opposite color tone distribution shifts which may not be easily normalized in joint modeling as they usually do not have "normal-exposed" regions/pixels as reference. In this paper we propose a novel method to enhance images with both over- and under-exposures by learning to estimate and correct such color shifts. Specifically we first derive the color feature maps of the brightened and darkened versions of the input image via a UNet-based network followed by a pseudo-normal feature generator to produce pseudo-normal color feature maps. We then propose a novel COlor Shift Estimation (COSE) module to estimate the color shifts between the derived brightened (or darkened) color feature maps and the pseudo-normal color feature maps. The COSE module corrects the estimated color shifts of the over- and under-exposed regions separately. We further propose a novel COlor MOdulation (COMO) module to modulate the separately corrected colors in the over- and under-exposed regions to produce the enhanced image. Comprehensive experiments show that our method outperforms existing approaches. + + + + Towards Scalable 3D Anomaly Detection and Localization: A Benchmark via 3D Anomaly Synthesis and A Self-Supervised Learning Network + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Towards_Scalable_3D_Anomaly_Detection_and_Localization_A_Benchmark_via_CVPR_2024_paper.pdf + Recently 3D anomaly detection a crucial problem involving fine-grained geometry discrimination is getting more attention. However the lack of abundant real 3D anomaly data limits the scalability of current models. To enable scalable anomaly data collection we propose a 3D anomaly synthesis pipeline to adapt existing large-scale 3D models for 3D anomaly detection. Specifically we construct a synthetic dataset i.e. Anomaly-ShapeNet based on ShapeNet. Anomaly-ShapeNet consists of 1600 point cloud samples under 40 categories which provides a rich and varied collection of data enabling efficient training and enhancing adaptability to industrial scenarios. Meanwhile to enable scalable representation learning for 3D anomaly localization we propose a self-supervised method i.e. Iterative Mask Reconstruction Network (IMRNet). During training we propose a geometry-aware sample module to preserve potentially anomalous local regions during point cloud down-sampling. Then we randomly mask out point patches and sent the visible patches to a transformer for reconstruction-based self-supervision. During testing the point cloud repeatedly goes through the Mask Reconstruction Network with each iteration's output becoming the next input. By merging and contrasting the final reconstructed point cloud with the initial input our method successfully locates anomalies. Experiments show that IMRNet outperforms previous state-of-the-art methods achieving 66.1% in I-AUC on our Anomaly-ShapeNet dataset and 72.5% in I-AUC on Real3D-AD dataset. Our benchmark will be released at https://github.com/Chopper233/Anomaly-ShapeNet. + + + + Cam4DOcc: Benchmark for Camera-Only 4D Occupancy Forecasting in Autonomous Driving Applications + http://openaccess.thecvf.com//content/CVPR2024/papers/Ma_Cam4DOcc_Benchmark_for_Camera-Only_4D_Occupancy_Forecasting_in_Autonomous_Driving_CVPR_2024_paper.pdf + Understanding how the surrounding environment changes is crucial for performing downstream tasks safely and reliably in autonomous driving applications. Recent occupancy estimation techniques using only camera images as input can provide dense occupancy representations of large-scale scenes based on the current observation. However they are mostly limited to representing the current 3D space and do not consider the future state of surrounding objects along the time axis. To extend camera-only occupancy estimation into spatiotemporal prediction we propose Cam4DOcc a new benchmark for camera-only 4D occupancy forecasting evaluating the surrounding scene changes in a near future. We build our benchmark based on multiple publicly available datasets including nuScenes nuScenes-Occupancy and Lyft-Level5 which provides sequential occupancy states of general movable and static objects as well as their 3D backward centripetal flow. To establish this benchmark for future research with comprehensive comparisons we introduce four baseline types from diverse camera-based perception and prediction implementations including a static-world occupancy model voxelization of point cloud prediction 2D-3D instance-based prediction and our proposed novel end-to-end 4D occupancy forecasting network. Furthermore the standardized evaluation protocol for preset multiple tasks is also provided to compare the performance of all the proposed baselines on present and future occupancy estimation with respect to objects of interest in autonomous driving scenarios. The dataset and our implementation of all four baselines in the proposed Cam4DOcc benchmark are released as open source at https://github.com/haomo-ai/Cam4DOcc. + + + + DIEM: Decomposition-Integration Enhancing Multimodal Insights + http://openaccess.thecvf.com//content/CVPR2024/papers/Jiang_DIEM_Decomposition-Integration_Enhancing_Multimodal_Insights_CVPR_2024_paper.pdf + In image question answering due to the abundant and sometimes redundant information precisely matching and integrating the information from both text and images is a challenge. In this paper we propose the Decomposition-Integration Enhancing Multimodal Insight (DIEM) which initially decomposes the given question and image into multiple subquestions and several sub-images aiming to isolate specific elements for more focused analysis. We then integrate these sub-elements by matching each subquestion with its relevant sub-images while also retaining the original image to construct a comprehensive answer to the original question without losing sight of the overall context. This strategy mirrors the human cognitive process of simplifying complex problems into smaller components for individual analysis followed by an integration of these insights. We implement DIEM on the LLaVA-v1.5 model and evaluate its performance on ScienceQA and MM-Vet. Experimental results indicate that our method boosts accuracy in most question classes of the ScienceQA (+2.03% in average) especially in the image modality (+3.40%). On MM-Vet our method achieves an improvement in MM-Vet scores increasing from 31.1 to 32.4. These findings highlight DIEM's effectiveness in harmonizing the complexities of multimodal data demonstrating its ability to enhance accuracy and depth in image question answering through its decomposition-integration process. + + + + Contrastive Pre-Training with Multi-View Fusion for No-Reference Point Cloud Quality Assessment + http://openaccess.thecvf.com//content/CVPR2024/papers/Shan_Contrastive_Pre-Training_with_Multi-View_Fusion_for_No-Reference_Point_Cloud_Quality_CVPR_2024_paper.pdf + No-reference point cloud quality assessment (NR-PCQA) aims to automatically evaluate the perceptual quality of distorted point clouds without available reference which have achieved tremendous improvements due to the utilization of deep neural networks. However learning-based NR-PCQA methods suffer from the scarcity of labeled data and usually perform suboptimally in terms of generalization. To solve the problem we propose a novel contrastive pre-training framework tailored for PCQA (CoPA) which enables the pre-trained model to learn quality-aware representations from unlabeled data. To obtain anchors in the representation space we project point clouds with different distortions into images and randomly mix their local patches to form mixed images with multiple distortions. Utilizing the generated anchors we constrain the pre-training process via a quality-aware contrastive loss following the philosophy that perceptual quality is closely related to both content and distortion. Furthermore in the model fine-tuning stage we propose a semantic-guided multi-view fusion module to effectively integrate the features of projected images from multiple perspectives. Extensive experiments show that our method outperforms the state-of-the-art PCQA methods on popular benchmarks. Further investigations demonstrate that CoPA can also benefit existing learning-based PCQA models. + + + + Revisiting Spatial-Frequency Information Integration from a Hierarchical Perspective for Panchromatic and Multi-Spectral Image Fusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Tan_Revisiting_Spatial-Frequency_Information_Integration_from_a_Hierarchical_Perspective_for_Panchromatic_CVPR_2024_paper.pdf + Pan-sharpening is a super-resolution problem that essentially relies on spectra fusion of panchromatic (PAN) images and low-resolution multi-spectral (LRMS) images. The previous methods have validated the effectiveness of information fusion in the Fourier space of the whole image. However they haven't fully explored the Fourier relationships at different hierarchies between PAN and LRMS images. To this end we propose a Hierarchical Frequency Integration Network (HFIN) to facilitate hierarchical Fourier information integration for pan-sharpening. Specifically our network consists of two designs: information stratification and information integration. For information stratification we hierarchically decompose PAN and LRMS information into spatial global Fourier and local Fourier information and fuse them independently. For information integration the above hierarchical fused information is processed to further enhance their relationships and undergo comprehensive integration. Our method extend a new space for exploring the relationships of PAN and LRMS images enhancing the integration of spatial-frequency information. Extensive experiments robustly validate the effectiveness of the proposed network showcasing its superior performance compared to other state-of-the-art methods and generalization in real-world scenes and other fusion tasks as a general image fusion framework. Code is available at https://github.com/JosephTiTan/HFIN. + + + + BSNet: Box-Supervised Simulation-assisted Mean Teacher for 3D Instance Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Lu_BSNet_Box-Supervised_Simulation-assisted_Mean_Teacher_for_3D_Instance_Segmentation_CVPR_2024_paper.pdf + 3D instance segmentation (3DIS) is a crucial task but point-level annotations are tedious in fully supervised settings. Thus using bounding boxes (bboxes) as annotations has shown great potential. The current mainstream approach is a two-step process involving the generation of pseudo-labels from box annotations and the training of a 3DIS network with the pseudo-labels. However due to the presence of intersections among bboxes not every point has a determined instance label especially in overlapping areas. To generate higher quality pseudo-labels and achieve more precise weakly supervised 3DIS results we propose the Box-Supervised Simulation-assisted Mean Teacher for 3D Instance Segmentation (BSNet) which devises a novel pseudo-labeler called Simulation-assisted Transformer. The labeler consists of two main components. The first is Simulation-assisted Mean Teacher which introduces Mean Teacher for the first time in this task and constructs simulated samples to assist the labeler in acquiring prior knowledge about overlapping areas. To better model local-global structure we also propose Local-Global Aware Attention as the decoder for teacher and student labelers. Extensive experiments conducted on the ScanNetV2 and S3DIS datasets verify the superiority of our designs. + + + + Adaptive Slot Attention: Object Discovery with Dynamic Slot Number + http://openaccess.thecvf.com//content/CVPR2024/papers/Fan_Adaptive_Slot_Attention_Object_Discovery_with_Dynamic_Slot_Number_CVPR_2024_paper.pdf + Object-centric learning (OCL) extracts the representation of objects with slots offering an exceptional blend of flexibility and interpretability for abstracting low-level perceptual features. A widely adopted method within OCL is slot attention which utilizes attention mechanisms to iteratively refine slot representations. However a major drawback of most object-centric models including slot attention is their reliance on predefining the number of slots. This not only necessitates prior knowledge of the dataset but also overlooks the inherent variability in the number of objects present in each instance. To overcome this fundamental limitation we present a novel complexity-aware object auto-encoder framework. Within this framework we introduce an adaptive slot attention (AdaSlot) mechanism that dynamically determines the optimal number of slots based on the content of the data. This is achieved by proposing a discrete slot sampling module that is responsible for selecting an appropriate number of slots from a candidate list. Furthermore we introduce a masked slot decoder that suppresses unselected slots during the decoding process. Our framework tested extensively on object discovery tasks with various datasets shows performance matching or exceeding top fixed-slot models. Moreover our analysis substantiates that our method exhibits the capability to dynamically adapt the slot number according to each instance's complexity offering the potential for further exploration in slot attention research. Project will be available at https://kfan21.github.io/AdaSlot/ + + + + Task-Driven Wavelets using Constrained Empirical Risk Minimization + http://openaccess.thecvf.com//content/CVPR2024/papers/Marcus_Task-Driven_Wavelets_using_Constrained_Empirical_Risk_Minimization_CVPR_2024_paper.pdf + Deep Neural Networks (DNNs) are widely used for their ability to effectively approximate large classes of functions. This flexibility however makes the strict enforcement of constraints on DNNs a difficult problem. In contexts where it is critical to limit the function space to which certain network components belong such as wavelets employed in Multi-Resolution Analysis (MRA) naive constraints via additional terms in the loss function are inadequate. To address this we introduce a Convolutional Neural Network (CNN) wherein the convolutional filters are strictly constrained to be wavelets. This allows the filters to update to task-optimized wavelets during the training procedure. Our primary contribution lies in the rigorous formulation of these filters via a constrained empirical risk minimization framework thereby providing an exact mechanism to enforce these structural constraints. While our work is grounded in theory we investigate our approach empirically through applications in medical imaging particularly in the task of contour prediction around various organs achieving superior performance compared to baseline methods. + + + + DeiT-LT: Distillation Strikes Back for Vision Transformer Training on Long-Tailed Datasets + http://openaccess.thecvf.com//content/CVPR2024/papers/Rangwani_DeiT-LT_Distillation_Strikes_Back_for_Vision_Transformer_Training_on_Long-Tailed_CVPR_2024_paper.pdf + Vision Transformer (ViT) has emerged as a prominent architecture for various computer vision tasks. In ViT we divide the input image into patch tokens and process them through a stack of self-attention blocks. However unlike Convolutional Neural Network (CNN) ViT's simple architecture has no informative inductive bias (e.g. locality etc.). Due to this ViT requires a large amount of data for pre-training. Various data-efficient approaches (DeiT) have been proposed to train ViT on balanced datasets effectively. However limited literature discusses the use of ViT for datasets with long-tailed imbalances. In this work we introduce DeiT-LT to tackle the problem of training ViTs from scratch on long-tailed datasets. In DeiT-LT we introduce an efficient and effective way of distillation from CNN via distillation \texttt DIST token by using out-of-distribution images and re-weighting the distillation loss to enhance focus on tail classes. This leads to the learning of local CNN-like features in early ViT blocks improving generalization for tail classes. Further to mitigate overfitting we propose distilling from a flat CNN teacher which leads to learning low-rank generalizable features for DIST tokens across all ViT blocks. With the proposed DeiT-LT scheme the distillation DIST token becomes an expert on the tail classes and the classifier CLS token becomes an expert on the head classes. The experts help to effectively learn features corresponding to both the majority and minority classes using a distinct set of tokens within the same ViT architecture. We show the effectiveness of DeiT-LT for training ViT from scratch on datasets ranging from small-scale CIFAR-10 LT to large-scale iNaturalist-2018. Project Page: https://rangwani-harsh.github.io/DeiT-LT. + + + + FCS: Feature Calibration and Separation for Non-Exemplar Class Incremental Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_FCS_Feature_Calibration_and_Separation_for_Non-Exemplar_Class_Incremental_Learning_CVPR_2024_paper.pdf + Non-Exemplar Class Incremental Learning (NECIL) involves learning a classification model on a sequence of data without access to exemplars from previously encountered old classes. Such a stringent constraint always leads to catastrophic forgetting of the learned knowledge. Currently existing methods either employ knowledge distillation techniques or preserved class prototypes to sustain prior knowledge. However two critical issues still persist. On the one hand as the model is continually updated the preserved prototypes of old classes will inevitably derive from the suitable location in the feature space of the new model. On the other hand due to the lack of exemplars the features of new classes will take the place of similar old classes which breaks the classification boundary. To address these challenges we propose a Feature Calibration and Separation (FCS) method for NECIL. Our approach comprises a Feature Calibration Network (FCN) that adapts prototypes of old classes to the new model via optimal transport learning approximating the drift of prototypes caused by model evolution. Additionally we also propose a Prototype-Involved Contrastive Loss (PIC) that enhances feature separation among different classes. Specifically to mitigate the boundary distortion arising from the interplay of classes from different learning stages prototypes are involved in pushing the feature of new classes away from the old classes. Extensive experiments on three datasets with different settings have demonstrated the superiority of our FCS method against the state-of-the-art class incremental learning approaches. Code is available at https://github.com/zhoujiahuan1991/CVPR2024-FCS. + + + + Task2Box: Box Embeddings for Modeling Asymmetric Task Relationships + http://openaccess.thecvf.com//content/CVPR2024/papers/Daroya_Task2Box_Box_Embeddings_for_Modeling_Asymmetric_Task_Relationships_CVPR_2024_paper.pdf + Modeling and visualizing relationships between tasks or datasets is an important step towards solving various meta-tasks such as dataset discovery multi-tasking and transfer learning. However many relationships such as containment and transferability are naturally asymmetric and current approaches for representation and visualization (e.g. t-SNE) do not readily support this. We propose Task2Box an approach to represent tasks using box embeddings---axis-aligned hyperrectangles in low dimensional spaces---that can capture asymmetric relationships between them through volumetric overlaps. We show that Task2Box accurately predicts unseen hierarchical relationships between nodes in ImageNet and iNaturalist datasets as well as transferability between tasks in the Taskonomy benchmark. We also show that box embeddings estimated from task representations (e.g. CLIP Task2Vec or attribute based) can be used to predict relationships between unseen tasks more accurately than classifiers trained on the same representations as well as handcrafted asymmetric distances (e.g. KL divergence). This suggests that low-dimensional box embeddings can effectively capture these task relationships and have the added advantage of being interpretable. We use the approach to visualize relationships among publicly available image classification datasets on popular dataset hosting platform called Hugging Face. + + + + LoS: Local Structure-Guided Stereo Matching + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_LoS_Local_Structure-Guided_Stereo_Matching_CVPR_2024_paper.pdf + Estimating disparities in challenging areas is difficult and limits the performance of stereo matching models. In this paper we exploit local structure information (LSI) to enhance stereo matching. Specifically our LSI comprises a series of key elements including the slant plane (parameterised by disparity gradients) disparity offset details and neighbouring relations. This LSI empowers our method to effectively handle intricate structures including object boundaries and curved surfaces. We bootstrap the LSI from monocular depth and subsequently iteratively refine it to better capture the underlying scene geometry constraints. Building upon the LSI we introduce the Local Structure-Guided Propagation (LSGP) which enhances the disparity initialization optimization and refinement processes. By combining LSGP with a Gated Recurrent Unit (GRU) we present our novel stereo matching method referred to as Local Structure-guided stereo matching (LoS). Remarkably LoS achieves top-ranking results on four widely recognized public benchmark datasets (ETH3D Middlebury KITTI 15 & 12) demonstrating the superior capabilities of our proposed model. + + + + Probing the 3D Awareness of Visual Foundation Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Banani_Probing_the_3D_Awareness_of_Visual_Foundation_Models_CVPR_2024_paper.pdf + Recent advances in large-scale pretraining have yielded visual foundation models with strong capabilities. Not only can recent models generalize to arbitrary images for their training task their intermediate representations are useful for other visual tasks such as detection and segmentation. Given that such models can classify delineate and localize objects in 2D we ask whether they also represent their 3D structure? In this work we analyze the 3D awareness of visual foundation models. We posit that 3D awareness implies that representations (1) encode the 3D structure of the scene and (2) consistently represent the surface across views. We conduct a series of experiments using task-specific probes and zero-shot inference procedures on frozen features. Our experiments reveal several limitations of the current models. Our code and analysis can be found at https://github.com/mbanani/probe3d. + + + + When Visual Grounding Meets Gigapixel-level Large-scale Scenes: Benchmark and Approach + http://openaccess.thecvf.com//content/CVPR2024/papers/Ma_When_Visual_Grounding_Meets_Gigapixel-level_Large-scale_Scenes_Benchmark_and_Approach_CVPR_2024_paper.pdf + Visual grounding refers to the process of associating natural language expressions with corresponding regions within an image. Existing benchmarks for visual grounding primarily operate within small-scale scenes with a few objects. Nevertheless recent advances in imaging technology have enabled the acquisition of gigapixel-level images providing high-resolution details in large-scale scenes containing numerous objects. To bridge this gap between imaging and computer vision benchmarks and make grounding more practically valuable we introduce a novel dataset named GigaGrounding designed to challenge visual grounding models in gigapixel-level large-scale scenes. We extensively analyze and compare the dataset with existing benchmarks demonstrating that GigaGrounding presents unique challenges such as large-scale scene understanding gigapixel-level resolution significant variations in object scales and the "multi-hop expressions". Furthermore we introduced a simple yet effective grounding approach which employs a "glance-to-zoom-in" paradigm and exhibits enhanced capabilities for addressing the GigaGrounding task. The dataset is available at www.gigavision.ai. + + + + Mind Artist: Creating Artistic Snapshots with Human Thought + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Mind_Artist_Creating_Artistic_Snapshots_with_Human_Thought_CVPR_2024_paper.pdf + We introduce Mind Artist (MindArt) a novel and efficient neural decoding architecture to snap artistic photographs from our mind in a controllable manner. Recently progress has been made in image reconstruction with non-invasive brain recordings but it's still difficult to generate realistic images with high semantic fidelity due to the scarcity of data annotations. Unlike previous methods this work casts the neural decoding into optimal transport (OT) and representation decoupling problems. Specifically under discrete OT theory we design a graph matching-guided neural representation learning framework to seek the underlying correspondences between conceptual semantics and neural signals which yields a natural and meaningful self-supervisory task. Moreover the proposed MindArt structured with multiple stand-alone modal branches enables the seamless incorporation of semantic representation into any visual style information thus leaving it to have multi-modal reconstruction and training-free semantic editing capabilities. By doing so the reconstructed images of MindArt have phenomenal realism both in terms of semantics and appearance. We compare our MindArt with leading alternatives and achieve SOTA performance in different decoding tasks. Importantly our approach can directly generate a series of stylized "mind snapshots" w/o extra optimizations which may open up more potential applications. Code is available at https://github.com/JxuanC/MindArt. + + + + Accept the Modality Gap: An Exploration in the Hyperbolic Space + http://openaccess.thecvf.com//content/CVPR2024/papers/Ramasinghe_Accept_the_Modality_Gap_An_Exploration_in_the_Hyperbolic_Space_CVPR_2024_paper.pdf + Recent advancements in machine learning have spotlighted the potential of hyperbolic spaces as they effectively learn hierarchical feature representations. While there has been progress in leveraging hyperbolic spaces in single-modality contexts its exploration in multimodal settings remains under explored. Some recent efforts have sought to transpose Euclidean multimodal learning techniques to hyperbolic spaces by adopting geodesic distance based contrastive losses. However we show both theoretically and empirically that such spatial proximity based contrastive loss significantly disrupts hierarchies in the latent space. To remedy this we advocate that the cross-modal representations should accept the inherent modality gap between text and images and introduce a novel approach to measure cross-modal similarity that does not enforce spatial proximity. Our approach show remarkable capabilities in preserving unimodal hierarchies while aligning the two modalities. Our experiments on a series of downstream tasks demonstrate that better latent structure emerges with our objective function while being superior in text-to-image and image-to-text retrieval tasks. + + + + Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Unraveling_Instance_Associations_A_Closer_Look_for_Audio-Visual_Segmentation_CVPR_2024_paper.pdf + Audio-visual segmentation (AVS) is a challenging task that involves accurately segmenting sounding objects based on audio-visual cues. The effectiveness of audio-visual learning critically depends on achieving accurate cross-modal alignment between sound and visual objects. Successful audio-visual learning requires two essential components: 1) a challenging dataset with high-quality pixel-level multi-class annotated images associated with audio files and 2) a model that can establish strong links between audio information and its corresponding visual object. However these requirements are only partially addressed by current methods with training sets containing biased audio-visual data and models that generalise poorly beyond this biased training set. In this work we propose a new cost-effective strategy to build challenging and relatively unbiased high-quality audio-visual segmentation benchmarks. We also propose a new informative sample mining method for audio-visual supervised contrastive learning to leverage discriminative contrastive samples to enforce cross-modal understanding. We show empirical results that demonstrate the effectiveness of our benchmark. Furthermore experiments conducted on existing AVS datasets and on our new benchmark show that our method achieves state-of-the-art (SOTA) segmentation accuracy. + + + + Few-Shot Object Detection with Foundation Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Han_Few-Shot_Object_Detection_with_Foundation_Models_CVPR_2024_paper.pdf + Few-shot object detection (FSOD) aims to detect objects with only a few training examples. Visual feature extraction and query-support similarity learning are the two critical components. Existing works are usually developed based on ImageNet pre-trained vision backbones and design sophisticated metric-learning networks for few-shot learning but still have inferior accuracy. In this work we study few-shot object detection using modern foundation models. First vision-only contrastive pre-trained DINOv2 model is used for the vision backbone which shows strong transferable performance without tuning the parameters. Second Large Language Model (LLM) is employed for contextualized few-shot learning with the input of all classes and query image proposals. Language instructions are carefully designed to prompt the LLM to classify each proposal in context. The contextual information include proposal-proposal relations proposal-class relations and class-class relations which can largely promote few-shot learning. We comprehensively evaluate the proposed model (FM-FSOD) in multiple FSOD benchmarks achieving state-of-the-arts performance. + + + + FedMef: Towards Memory-efficient Federated Dynamic Pruning + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_FedMef_Towards_Memory-efficient_Federated_Dynamic_Pruning_CVPR_2024_paper.pdf + Federated learning (FL) promotes decentralized training while prioritizing data confidentiality. However its application on resource-constrained devices is challenging due to the high demand for computation and memory resources to train deep learning models. Neural network pruning techniques such as dynamic pruning could enhance model efficiency but directly adopting them in FL still poses substantial challenges including post-pruning performance degradation high activation memory usage etc. To address these challenges we propose FedMef a novel and memory-efficient federated dynamic pruning framework. FedMef comprises two key components. First we introduce the budget-aware extrusion that maintains pruning efficiency while preserving post-pruning performance by salvaging crucial information from parameters marked for pruning within a given budget. Second we propose scaled activation pruning to effectively reduce activation memory footprints which is particularly beneficial for deploying FL to memory-limited devices. Extensive experiments demonstrate the effectiveness of our proposed FedMef. In particular it achieves a significant reduction of 28.5% in memory footprint compared to state-of-the-art methods while obtaining superior accuracy. + + + + PracticalDG: Perturbation Distillation on Vision-Language Models for Hybrid Domain Generalization + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_PracticalDG_Perturbation_Distillation_on_Vision-Language_Models_for_Hybrid_Domain_Generalization_CVPR_2024_paper.pdf + Domain Generalization (DG) aims to resolve distribution shifts between source and target domains and current DG methods are default to the setting that data from source and target domains share identical categories. Nevertheless there exists unseen classes from target domains in practical scenarios. To address this issue Open Set Domain Generalization (OSDG) has emerged and several methods have been exclusively proposed. However most existing methods adopt complex architectures with slight improvement compared with DG methods. Recently vision-language models (VLMs) have been introduced in DG following the fine-tuning paradigm but consume huge training overhead with large vision models. Therefore in this paper we innovate to transfer knowledge from VLMs to lightweight vision models and improve the robustness by introducing Perturbation Distillation (PD) from three perspectives including Score Class and Instance (SCI) named SCI-PD. Moreover previous methods are oriented by the benchmarks with identical and fixed splits ignoring the divergence between source domains. These methods are revealed to suffer from sharp performance decay with our proposed new benchmark Hybrid Domain Generalization (HDG) and a novel metric H^ 2 -CV which construct various splits to comprehensively assess the robustness of algorithms. Extensive experiments demonstrate that our method outperforms state-of-the-art algorithms on multiple datasets especially improving the robustness when confronting data scarcity. + + + + SODA: Bottleneck Diffusion Models for Representation Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Hudson_SODA_Bottleneck_Diffusion_Models_for_Representation_Learning_CVPR_2024_paper.pdf + We introduce SODA a self-supervised diffusion model designed for representation learning. The model incorporates an image encoder which distills a source view into a compact representation that in turn guides the generation of related novel views. We show that by imposing a tight bottleneck between the encoder and a denoising decoder and leveraging novel view synthesis as a self-supervised objective we can turn diffusion models into strong representation learners capable of capturing visual semantics in an unsupervised manner. To the best of our knowledge SODA is the first diffusion model to succeed at ImageNet linear-probe classification and at the same time it accomplishes reconstruction editing and synthesis tasks across a wide range of datasets. Further investigation reveals the disentangled nature of its emergent latent space that serves as an effective interface to control and manipulate the produced images. All in all we aim to shed light on the exciting and promising potential of diffusion models not only for image generation but also for learning rich and robust representations. See our website at soda-diffusion.github.io. + + + + Zero-Reference Low-Light Enhancement via Physical Quadruple Priors + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Zero-Reference_Low-Light_Enhancement_via_Physical_Quadruple_Priors_CVPR_2024_paper.pdf + Understanding illumination and reducing the need for supervision pose a significant challenge in low-light enhancement. Current approaches are highly sensitive to data usage during training and illumination-specific hyper-parameters limiting their ability to handle unseen scenarios. In this paper we propose a new zero-reference low-light enhancement framework trainable solely with normal light images. To accomplish this we devise an illumination-invariant prior inspired by the theory of physical light transfer. This prior serves as the bridge between normal and low-light images. Then we develop a prior-to-image framework trained without low-light data. During testing this framework is able to restore our illumination-invariant prior back to images automatically achieving low-light enhancement. Within this framework we leverage a pretrained generative diffusion model for model ability introduce a bypass decoder to handle detail distortion as well as offer a lightweight version for practicality. Extensive experiments demonstrate our framework's superiority in various scenarios as well as good interpretability robustness and efficiency. Code is available on our project homepage: http://daooshee.github.io/QuadPrior-Website/ + + + + NeRFCodec: Neural Feature Compression Meets Neural Radiance Fields for Memory-Efficient Scene Representation + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_NeRFCodec_Neural_Feature_Compression_Meets_Neural_Radiance_Fields_for_Memory-Efficient_CVPR_2024_paper.pdf + The emergence of Neural Radiance Fields (NeRF) has greatly impacted 3D scene modeling and novel-view synthesis. As a kind of visual media for 3D scene representation compression with high rate-distortion performance is an eternal target. Motivated by advances in neural compression and neural field representation we propose NeRFCodec an end-to-end NeRF compression framework that integrates non-linear transform quantization and entropy coding for memory-efficient scene representation. Since training a non-linear transform directly on a large scale of NeRF feature planes is impractical we discover that pre-trained neural 2D image codec can be utilized for compressing the features when adding content-specific parameters. Specifically we reuse neural 2D image codec but modify its encoder and decoder heads while keeping the other parts of the pre-trained decoder frozen. This allows us to train the full pipeline via supervision of rendering loss and entropy loss yielding the rate-distortion balance by updating the content-specific parameters. At test time the bitstreams containing latent code feature decoder head and other side information are transmitted for communication. Experimental results demonstrate our method outperforms existing NeRF compression methods enabling high-quality novel view synthesis with a memory budget of 0.5 MB. + + + + Rethinking Transformers Pre-training for Multi-Spectral Satellite Imagery + http://openaccess.thecvf.com//content/CVPR2024/papers/Noman_Rethinking_Transformers_Pre-training_for_Multi-Spectral_Satellite_Imagery_CVPR_2024_paper.pdf + Recent advances in unsupervised learning have demonstrated the ability of large vision models to achieve promising results on downstream tasks by pre-training on large amount of unlabelled data. Such pre-training techniques have also been explored recently in the remote sensing domain due to the availability of large amount of unlabelled data. Different from standard natural image datasets remote sensing data is acquired from various sensor technologies and exhibit diverse range of scale variations as well as modalities. Existing satellite image pre-training methods either ignore the scale information present in the remote sensing imagery or restrict themselves to use only a single type of data modality. In this paper we re-visit transformers pre-training and leverage multi-scale information that is effectively utilized with multiple modalities. Our proposed approach named SatMAE++ performs multi-scale pre-training and utilizes convolution based upsampling blocks to reconstruct the image at higher scales making it extensible to include more scales. Compared to existing works the proposed SatMAE++ with multi-scale pre-training is equally effective for both optical as well as multi-spectral imagery. Extensive experiments on six datasets reveal the merits of proposed contributions leading to state-of-the-art performance on all datasets. SatMAE++ achieves mean average precision (mAP) gain of 2.5% for multi-label classification task on BigEarthNet dataset. + + + + LLM4SGG: Large Language Models for Weakly Supervised Scene Graph Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_LLM4SGG_Large_Language_Models_for_Weakly_Supervised_Scene_Graph_Generation_CVPR_2024_paper.pdf + Weakly-Supervised Scene Graph Generation (WSSGG) research has recently emerged as an alternative to the fully-supervised approach that heavily relies on costly annotations. In this regard studies on WSSGG have utilized image captions to obtain unlocalized triplets while primarily focusing on grounding the unlocalized triplets over image regions. However they have overlooked the two issues involved in the triplet formation process from the captions: 1) Semantic over-simplification issue arises when extracting triplets from captions where fine-grained predicates in captions are undesirably converted into coarse-grained predicates resulting in a long-tailed predicate distribution and 2) Low-density scene graph issue arises when aligning the triplets in the caption with entity/predicate classes of interest where many triplets are discarded and not used in training leading to insufficient supervision. To tackle the two issues we propose a new approach i.e. Large Language Model for weakly-supervised SGG (LLM4SGG) where we mitigate the two issues by leveraging the LLM's in-depth understanding of language and reasoning ability during the extraction of triplets from captions and alignment of entity/predicate classes with target data. To further engage the LLM in these processes we adopt the idea of Chain-of-Thought and the in-context few-shot learning strategy. To validate the effectiveness of LLM4SGG we conduct extensive experiments on Visual Genome and GQA datasets showing significant improvements in both Recall@K and mean Recall@K compared to the state-of-the-art WSSGG methods. A further appeal is that LLM4SGG is data-efficient enabling effective model training with a small amount of training images. + + + + Neural Directional Encoding for Efficient and Accurate View-Dependent Appearance Modeling + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_Neural_Directional_Encoding_for_Efficient_and_Accurate_View-Dependent_Appearance_Modeling_CVPR_2024_paper.pdf + Novel-view synthesis of specular objects like shiny metals or glossy paints remains a significant challenge. Not only the glossy appearance but also global illumination effects including reflections of other objects in the environment are critical components to faithfully reproduce a scene. In this paper we present Neural Directional Encoding (NDE) a view-dependent appearance encoding of neural radiance fields (NeRF) for rendering specular objects. NDE transfers the concept of feature-grid-based spatial encoding to the angular domain significantly improving the ability to model high-frequency angular signals. In contrast to previous methods that use encoding functions with only angular input we additionally cone-trace spatial features to obtain a spatially varying directional encoding which addresses the challenging interreflection effects. Extensive experiments on both synthetic and real datasets show that a NeRF model with NDE (1) outperforms the state of the art on view synthesis of specular objects and (2) works with small networks to allow fast (real-time) inference. The source code is available at: https://github.com/lwwu2/nde + + + + Label Propagation for Zero-shot Classification with Vision-Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Stojni_Label_Propagation_for_Zero-shot_Classification_with_Vision-Language_Models_CVPR_2024_paper.pdf + Vision-Language Models (VLMs) have demonstrated impressive performance on zero-shot classification i.e. classification when provided merely with a list of class names. In this paper we tackle the case of zero-shot classification in the presence of unlabeled data. We leverage the graph structure of the unlabeled data and introduce ZLaP a method based on label propagation (LP) that utilizes geodesic distances for classification. We tailor LP to graphs containing both text and image features and further propose an efficient method for performing inductive inference based on a dual solution and a sparsification step. We perform extensive experiments to evaluate the effectiveness of our method on 14 common datasets and show that ZLaP outperforms the latest related works. Code: https://github.com/vladan-stojnic/ZLaP + + + + Revisiting Global Translation Estimation with Feature Tracks + http://openaccess.thecvf.com//content/CVPR2024/papers/Tao_Revisiting_Global_Translation_Estimation_with_Feature_Tracks_CVPR_2024_paper.pdf + Global translation estimation is a highly challenging step in the global structure from motion (SfM) algorithm. Many existing methods depend solely on relative translations leading to inaccuracies in low parallax scenes and degradation under collinear camera motion. While recent approaches aim to address these issues by incorporating feature tracks into objective functions they are often sensitive to outliers. In this paper we first revisit global translation estimation methods with feature tracks and categorize them into explicit and implicit methods. Then we highlight the superiority of the objective function based on the cross-product distance metric and propose a novel explicit global translation estimation framework that integrates both relative translations and feature tracks as input. To enhance the accuracy of input observations we re-estimate relative translations with the coplanarity constraint of the epipolar plane and propose a simple yet effective strategy to select reliable feature tracks. Finally the effectiveness of our approach is demonstrated through experiments on urban image sequences and unordered Internet images showcasing its superior accuracy and robustness compared to many state-of-the-art techniques. + + + + Open-Set Domain Adaptation for Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Choe_Open-Set_Domain_Adaptation_for_Semantic_Segmentation_CVPR_2024_paper.pdf + Unsupervised domain adaptation (UDA) for semantic segmentation aims to transfer the pixel-wise knowledge from the labeled source domain to the unlabeled target domain. However current UDA methods typically assume a shared label space between source and target limiting their applicability in real-world scenarios where novel categories may emerge in the target domain. In this paper we introduce Open-Set Domain Adaptation for Semantic Segmentation (OSDA-SS) for the first time where the target domain includes unknown classes. We identify two major problems in the OSDA-SS scenario as follows: 1) the existing UDA methods struggle to predict the exact boundary of the unknown classes and 2) they fail to accurately predict the shape of the unknown classes. To address these issues we propose Boundary and Unknown Shape-Aware open-set domain adaptation coined BUS. Our BUS can accurately discern the boundaries between known and unknown classes in a contrastive manner using a novel dilation-erosion-based contrastive loss. In addition we propose OpenReMix a new domain mixing augmentation method that guides our model to effectively learn domain and size-invariant features for improving the shape detection of the known and unknown classes. Through extensive experiments we demonstrate that our proposed BUS effectively detects unknown classes in the challenging OSDA-SS scenario compared to the previous methods by a large margin. + + + + Sculpting Holistic 3D Representation in Contrastive Language-Image-3D Pre-training + http://openaccess.thecvf.com//content/CVPR2024/papers/Gao_Sculpting_Holistic_3D_Representation_in_Contrastive_Language-Image-3D_Pre-training_CVPR_2024_paper.pdf + Contrastive learning has emerged as a promising paradigm for 3D open-world understanding i.e. aligning point cloud representation to image and text embedding space individually. In this paper we introduce MixCon3D a simple yet effective method aiming to sculpt holistic 3D representation in contrastive language-image-3D pre-training. In contrast to point cloud only we develop the 3D object-level representation from complementary perspectives e.g. multi-view rendered images with the point cloud. Then MixCon3D performs language-3D contrastive learning comprehensively depicting real-world 3D objects and bolstering text alignment. Additionally we pioneer the first thorough investigation of various training recipes for the 3D contrastive learning paradigm building a solid baseline with improved performance. Extensive experiments conducted on three representative benchmarks reveal that our method significantly improves over the baseline surpassing the previous state-of-the-art performance on the challenging 1156-category Objaverse-LVIS dataset by 5.7%. The versatility of MixCon3D is showcased in applications such as text-to-3D retrieval and point cloud captioning further evidencing its efficacy in diverse scenarios. The code is available at https://github.com/UCSC-VLAA/MixCon3D. + + + + Probing Synergistic High-Order Interaction in Infrared and Visible Image Fusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Zheng_Probing_Synergistic_High-Order_Interaction_in_Infrared_and_Visible_Image_Fusion_CVPR_2024_paper.pdf + Infrared and visible image fusion aims to generate a fused image by integrating and distinguishing complementary information from multiple sources. While the cross-attention mechanism with global spatial interactions appears promising it only capture second-order spatial interactions neglecting higher-order interactions in both spatial and channel dimensions. This limitation hampers the exploitation of synergies between multi-modalities. To bridge this gap we introduce a Synergistic High-order Interaction Paradigm (SHIP) designed to systematically investigate spatial fine-grained and global statistics collaborations between infrared and visible images across two fundamental dimensions: 1) Spatial dimension: we construct spatial fine-grained interactions through element-wise multiplication mathematically equivalent to global interactions and then foster high-order formats by iteratively aggregating and evolving complementary information enhancing both efficiency and flexibility. 2) Channel dimension: expanding on channel interactions with first-order statistics (mean) we devise high-order channel interactions to facilitate the discernment of inter-dependencies between source images based on global statistics. Harnessing high-order interactions significantly enhances our model's ability to exploit multi-modal synergies leading in superior performance over state-of-the-art alternatives as shown through comprehensive experiments across various benchmarks. + + + + ESCAPE: Encoding Super-keypoints for Category-Agnostic Pose Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Nguyen_ESCAPE_Encoding_Super-keypoints_for_Category-Agnostic_Pose_Estimation_CVPR_2024_paper.pdf + In this paper we tackle the task of category-agnostic pose estimation (CAPE) which aims to predict poses for objects of any category with few annotated samples. Previous works either rely on local matching between features of support and query samples or require support keypoint identifier. The former is prone to overfitting due to its sensitivity to sparse samples while the latter is impractical for the open-world nature of the task. To overcome these limitations we propose ESCAPE - a Bayesian framework that learns a prior over the features of keypoints. The prior can be expressed as a mixture of super-keypoints each being a high-level abstract keypoint that captures the statistics of semantically related keypoints from different categories. We estimate the super-keypoints from base categories and use them in adaptation to novel categories. The adaptation to an unseen category involves two steps: first we match each novel keypoint to a related super-keypoint; and second we transfer the knowledge encoded in the matched super-keypoints to the novel keypoints. For the first step we propose a learnable matching network to capture the relationship between the novel keypoints and the super-keypoints resulting in a more reliable matching. ESCAPE mitigates overfitting by directly transferring learned knowledge to novel categories while it does not use keypoint identifiers. We achieve state-of-the-art performance on the standard MP-100 benchmark. + + + + TULIP: Multi-camera 3D Precision Assessment of Parkinson's Disease + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_TULIP_Multi-camera_3D_Precision_Assessment_of_Parkinsons_Disease_CVPR_2024_paper.pdf + Parkinson's disease (PD) is a devastating movement disorder accelerating in global prevalence but a lack of precision symptom measurement has made the development of effective therapies challenging. The Unified Parkinson's Disease Rating Scale (UPDRS) is the gold-standard for assessing motor symptom severity yet its manual scoring criteria are vague and subjective resulting in coarse and noisy clinical assessments. Machine learning approaches have the potential to modernize PD symptom assessments by making them more quantitative objective and scalable. However the lack of benchmark video datasets for PD motor exams hinders model development. Here we introduce the TULIP dataset to bridge this gap. TULIP emphasizes precision and comprehensiveness comprising multi-view video recordings (6 cameras) of all 25 UPDRS motor exam components together with ratings by 3 clinical experts in a cohort of Parkinson's patients and healthy controls. The multi-view recordings enable 3D reconstructions of body movement that better capture disease signatures than more conventional 2D methods. Using the dataset we establish a baseline model for predicting UPDRS scores from 3D poses illustrating how existing diagnostics could be automated. Looking ahead TULIP could aid the development of new precision diagnostics that transcend UPDRS scores providing a deeper understanding of PD and its potential treatments. + + + + HybridNeRF: Efficient Neural Rendering via Adaptive Volumetric Surfaces + http://openaccess.thecvf.com//content/CVPR2024/papers/Turki_HybridNeRF_Efficient_Neural_Rendering_via_Adaptive_Volumetric_Surfaces_CVPR_2024_paper.pdf + Neural radiance fields provide state-of-the-art view synthesis quality but tend to be slow to render. One reason is that they make use of volume rendering thus requiring many samples (and model queries) per ray at render time. Although this representation is flexible and easy to optimize most real-world objects can be modeled more efficiently with surfaces instead of volumes requiring far fewer samples per ray. This observation has spurred considerable progress in surface representations such as signed distance functions but these may struggle to model semi-opaque and thin structures. We propose a method HybridNeRF that leverages the strengths of both representations by rendering most objects as surfaces while modeling the (typically) small fraction of challenging regions volumetrically. We evaluate HybridNeRF against the challenging Eyeful Tower dataset along with other commonly used view synthesis datasets. When comparing to state-of-the-art baselines including recent rasterization-based approaches we improve error rates by 15-30% while achieving real-time framerates (at least 36 FPS) for virtual-reality resolutions (2K x 2K). + + + + Motion-adaptive Separable Collaborative Filters for Blind Motion Deblurring + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Motion-adaptive_Separable_Collaborative_Filters_for_Blind_Motion_Deblurring_CVPR_2024_paper.pdf + Eliminating image blur produced by various kinds of motion has been a challenging problem. Dominant approaches rely heavily on model capacity to remove blurring by reconstructing residual from blurry observation in feature space. These practices not only prevent the capture of spatially variable motion in the real world but also ignore the tailored handling of various motions in image space. In this paper we propose a novel real-world deblurring filtering model called the Motion-adaptive Separable Collaborative (MISC) Filter. In particular we use a motion estimation network to capture motion information from neighborhoods thereby adaptively estimating spatially-variant motion flow mask kernels weights and offsets to obtain the MISC Filter. The MISC Filter first aligns the motion-induced blurring patterns to the motion middle along the predicted flow direction and then collaboratively filters the aligned image through the predicted kernels weights and offsets to generate the output. This design can handle more generalized and complex motion in a spatially differentiated manner. Furthermore we analyze the relationships between the motion estimation network and the residual reconstruction network. Extensive experiments on four widely used benchmarks demonstrate that our method provides an effective solution for real-world motion blur removal and achieves state-of-the-art performance. Code is available at https://github.com/ChengxuLiu/MISCFilter. + + + + DART: Implicit Doppler Tomography for Radar Novel View Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_DART_Implicit_Doppler_Tomography_for_Radar_Novel_View_Synthesis_CVPR_2024_paper.pdf + Simulation is an invaluable tool for radio-frequency system designers that enables rapid prototyping of various algorithms for imaging target detection classification and tracking. However simulating realistic radar scans is a challenging task that requires an accurate model of the scene radio frequency material properties and a corresponding radar synthesis function. Rather than specifying these models explicitly we propose DART - Doppler Aided Radar Tomography a Neural Radiance Field-inspired method which uses radar-specific physics to create a reflectance and transmittance-based rendering pipeline for range-Doppler images. We then evaluate DART by constructing a custom data collection platform and collecting a novel radar dataset together with accurate position and instantaneous velocity measurements from lidar-based localization. In comparison to state-of-the-art baselines DART synthesizes superior radar range-Doppler images from novel views across all datasets and additionally can be used to generate high quality tomographic images. + + + + Genuine Knowledge from Practice: Diffusion Test-Time Adaptation for Video Adverse Weather Removal + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Genuine_Knowledge_from_Practice_Diffusion_Test-Time_Adaptation_for_Video_Adverse_CVPR_2024_paper.pdf + Real-world vision tasks frequently suffer from the appearance of unexpected adverse weather conditions including rain haze snow and raindrops. In the last decade convolutional neural networks and vision transformers have yielded outstanding results in single-weather video removal. However due to the absence of appropriate adaptation most of them fail to generalize to other weather conditions. Although ViWS-Net is proposed to remove adverse weather conditions in videos with a single set of pre-trained weights it is seriously blinded by seen weather at train-time and degenerates when coming to unseen weather during test-time. In this work we introduce test-time adaptation into adverse weather removal in videos and propose the first framework that integrates test-time adaptation into the iterative diffusion reverse process. Specifically we devise a diffusion-based network with a novel temporal noise model to efficiently explore frame-correlated information in degraded video clips at training stage. During inference stage we introduce a proxy task named Diffusion Tubelet Self-Calibration to learn the primer distribution of test video stream and optimize the model by approximating the temporal noise model for online adaptation. Experimental results on benchmark datasets demonstrate that our Test-Time Adaptation method with Diffusion-based network(Diff-TTA) outperforms state-of-the-art methods in terms of restoring videos degraded by seen weather conditions. Its generalizable capability is validated with unseen weather conditions in synthesized and real-world videos. + + + + Gradient-based Parameter Selection for Efficient Fine-Tuning + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Gradient-based_Parameter_Selection_for_Efficient_Fine-Tuning_CVPR_2024_paper.pdf + With the growing size of pre-trained models full fine-tuning and storing all the parameters for various downstream tasks is costly and infeasible. In this paper we propose a new parameter-efficient fine-tuning method Gradient-based Parameter Selection (GPS) demonstrating that only tuning a few selected parameters from the pre-trained model while keeping the remainder of the model frozen can generate similar or better performance compared with the full model fine-tuning method. Different from the existing popular and state-of-the-art parameter-efficient fine-tuning approaches our method does not introduce any additional parameters and computational costs during both the training and inference stages. Another advantage is the model-agnostic and non-destructive property which eliminates the need for any other design specific to a particular model. Compared with the full fine-tuning GPS achieves 3.33% (91.78% vs. 88.45% FGVC) and 9.61% (73.1% vs. 65.57% VTAB) improvement of the accuracy with tuning only 0.36% parameters of the pre-trained model on average over 24 image classification tasks; it also demonstrates a significant improvement of 17% and 16.8% in mDice and mIoU respectively on medical image segmentation task. Moreover GPS achieves state-of-the-art performance compared with existing PEFT methods. The code will be available in https://github.com/FightingFighting/GPS.git. + + + + Domain Prompt Learning with Quaternion Networks + http://openaccess.thecvf.com//content/CVPR2024/papers/Cao_Domain_Prompt_Learning_with_Quaternion_Networks_CVPR_2024_paper.pdf + Prompt learning has emerged as an effective and data-efficient technique in large Vision-Language Models (VLMs). However when adapting VLMs to specialized domains such as remote sensing and medical imaging domain prompt learning remains underexplored. While large-scale domain-specific foundation models can help tackle this challenge their concentration on a single vision level makes it challenging to prompt both vision and language modalities. To overcome this we propose to leverage domain-specific knowledge from domain-specific foundation models to transfer the robust recognition ability of VLMs from generalized to specialized domains using quaternion networks. Specifically the proposed method involves using domain-specific vision features from domain-specific foundation models to guide the transformation of generalized contextual embeddings from the language branch into a specialized space within the quaternion networks. Moreover we present a hierarchical approach that generates vision prompt features by analyzing intermodal relationships between hierarchical language prompt features and domain-specific vision features. In this way quaternion networks can effectively mine the intermodal relationships in the specific domain facilitating domain-specific vision-language contrastive learning. Extensive experiments on domain-specific datasets show that our proposed method achieves new state-of-the-art results in prompt learning. + + + + BEHAVIOR Vision Suite: Customizable Dataset Generation via Simulation + http://openaccess.thecvf.com//content/CVPR2024/papers/Ge_BEHAVIOR_Vision_Suite_Customizable_Dataset_Generation_via_Simulation_CVPR_2024_paper.pdf + The systematic evaluation and understanding of computer vision models under varying conditions require large amounts of data with comprehensive and customized labels which real-world vision datasets rarely satisfy. While current synthetic data generators offer a promising alternative particularly for embodied AI tasks they often fall short for computer vision tasks due to low asset and rendering quality limited diversity and unrealistic physical properties. We introduce the BEHAVIOR Vision Suite (BVS) a set of tools and assets to generate fully customized synthetic data for systematic evaluation of computer vision models based on the newly developed embodied AI benchmark BEHAVIOR-1K. BVS supports a large number of adjustable parameters at the scene level (e.g. lighting object placement) the object level (e.g. joint configuration attributes such as "filled" and "folded") and the camera level (e.g. field of view focal length). Researchers can arbitrarily vary these parameters during data generation to perform controlled experiments. We showcase three example application scenarios: systematically evaluating the robustness of models across different continuous axes of domain shift evaluating scene understanding models on the same set of images and training and evaluating simulation-to-real transfer for a novel vision task: unary and binary state prediction. Project website: https://behavior-vision-suite.github.io/ + + + + Gaussian-Flow: 4D Reconstruction with Dynamic 3D Gaussian Particle + http://openaccess.thecvf.com//content/CVPR2024/papers/Lin_Gaussian-Flow_4D_Reconstruction_with_Dynamic_3D_Gaussian_Particle_CVPR_2024_paper.pdf + We introduce Gaussian-Flow a novel point-based approach for fast dynamic scene reconstruction and real-time rendering from both multi-view and monocular videos. In contrast to the prevalent NeRF-based approaches hampered by slow training and rendering speeds our approach harnesses recent advancements in point-based 3D Gaussian Splatting (3DGS). Specifically a novel Dual-Domain Deformation Model (DDDM) is proposed to explicitly model attribute deformations of each Gaussian point where the time-dependent residual of each attribute is captured by a polynomial fitting in the time domain and a Fourier series fitting in the frequency domain. The proposed DDDM is capable of modeling complex scene deformations across long video footage eliminating the need for training separate 3DGS for each frame or introducing an additional implicit neural field to model 3D dynamics. Moreover the explicit deformation modeling for discretized Gaussian points ensures ultra-fast training and rendering of a 4D scene which is comparable to the original 3DGS designed for static 3D reconstruction. Our proposed approach showcases a substantial efficiency improvement achieving a 5xfaster training speed compared to the per-frame 3DGS modeling. In addition quantitative results demonstrate that the proposed Gaussian-Flow significantly outperforms previous leading methods in novel view rendering quality. + + + + DiVAS: Video and Audio Synchronization with Dynamic Frame Rates + http://openaccess.thecvf.com//content/CVPR2024/papers/Fernandez-Labrador_DiVAS_Video_and_Audio_Synchronization_with_Dynamic_Frame_Rates_CVPR_2024_paper.pdf + Synchronization issues between audio and video are one of the most disturbing quality defects in film production and live broadcasting. Even a discrepancy as short as 45 millisecond can degrade the viewer's experience enough to warrant manual quality checks over entire movies. In this paper we study the automatic discovery of such issues. Specifically we focus on the alignment of lip movements with spoken words targeting realistic production scenarios which can include background noise and music intricate head poses excessive makeup or scenes with multiple individuals where the speaker is unknown. Our model's robustness also extends to various media specifications including different video frame rates and audio sample rates. To address these challenges we present a model fully based on transformers that encodes face crops or full video frames and raw audio using timestamp information identifies the speaker and provides highly accurate synchronization predictions much faster than previous methods. + + + + HDRFlow: Real-Time HDR Video Reconstruction with Large Motions + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_HDRFlow_Real-Time_HDR_Video_Reconstruction_with_Large_Motions_CVPR_2024_paper.pdf + Reconstructing High Dynamic Range (HDR) video from image sequences captured with alternating exposures is challenging especially in the presence of large camera or object motion. Existing methods typically align low dynamic range sequences using optical flow or attention mechanism for deghosting. However they often struggle to handle large complex motions and are computationally expensive. To address these challenges we propose a robust and efficient flow estimator tailored for real-time HDR video reconstruction named HDRFlow. HDRFlow has three novel designs: an HDR-domain alignment loss (HALoss) an efficient flow network with a multi-size large kernel (MLK) and a new HDR flow training scheme. The HALoss supervises our flow network to learn an HDR-oriented flow for accurate alignment in saturated and dark regions. The MLK can effectively model large motions at a negligible cost. In addition we incorporate synthetic data Sintel into our training dataset utilizing both its provided forward flow and backward flow generated by us to supervise our flow network enhancing our performance in large motion regions. Extensive experiments demonstrate that our HDRFlow outperforms previous methods on standard benchmarks. To the best of our knowledge HDRFlow is the first real-time HDR video reconstruction method for video sequences captured with alternating exposures capable of processing 720p resolution inputs at 25ms. + + + + SPIDeRS: Structured Polarization for Invisible Depth and Reflectance Sensing + http://openaccess.thecvf.com//content/CVPR2024/papers/Ichikawa_SPIDeRS_Structured_Polarization_for_Invisible_Depth_and_Reflectance_Sensing_CVPR_2024_paper.pdf + Can we capture shape and reflectance in stealth? Such capability would be valuable for many application domains in vision xR robotics and HCI. We introduce structured polarization for invisible depth and reflectance sensing (SPIDeRS) the first depth and reflectance sensing method using patterns of polarized light. The key idea is to modulate the angle of linear polarization (AoLP) of projected light at each pixel. The use of polarization makes it invisible and lets us recover not only depth but also directly surface normals and even reflectance. We implement SPIDeRS with a liquid crystal spatial light modulator (SLM) and a polarimetric camera. We derive a novel method for robustly extracting the projected structured polarization pattern from the polarimetric object appearance. We evaluate the effectiveness of SPIDeRS by applying it to a number of real-world objects. The results show that our method successfully reconstructs object shapes of various materials and is robust to diffuse reflection and ambient light. We also demonstrate relighting using recovered surface normals and reflectance. We believe SPIDeRS opens a new avenue of polarization use in visual sensing. + + + + SuperNormal: Neural Surface Reconstruction via Multi-View Normal Integration + http://openaccess.thecvf.com//content/CVPR2024/papers/Cao_SuperNormal_Neural_Surface_Reconstruction_via_Multi-View_Normal_Integration_CVPR_2024_paper.pdf + We present SuperNormal a fast high-fidelity approach to multi-view 3D reconstruction using surface normal maps. With a few minutes SuperNormal produces detailed surfaces on par with 3D scanners. We harness volume rendering to optimize a neural signed distance function (SDF) powered by multi-resolution hash encoding. To accelerate training we propose directional finite difference and patchbased ray marching to approximate the SDF gradients numerically. While not compromising reconstruction quality this strategy is nearly twice as efficient as analytical gradients and about three times faster than axis-aligned finite difference. Experiments on the benchmark dataset demonstrate the superiority of SuperNormal in efficiency and accuracy compared to existing multi-view photometric stereo methods. On our captured objects SuperNormal produces more fine-grained geometry than recent neural 3D reconstruction methods. Our code is available at https://github.com/CyberAgentAILab/SuperNormal.git. + + + + ADFactory: An Effective Framework for Generalizing Optical Flow with NeRF + http://openaccess.thecvf.com//content/CVPR2024/papers/Ling_ADFactory_An_Effective_Framework_for_Generalizing_Optical_Flow_with_NeRF_CVPR_2024_paper.pdf + A significant challenge facing current optical flow methods is the difficulty in generalizing them well to the real world. This is mainly due to the lack of large-scale real-world datasets and existing self-supervised methods are limited by indirect loss and occlusions resulting in fuzzy outcomes. To address this challenge we introduce a novel optical flow training framework: automatic data factory (ADF). ADF only requires RGB images as input to effectively train the optical flow network on the target data domain. Specifically we use advanced NeRF technology to reconstruct scenes from photo groups collected by a monocular camera and then calculate optical flow labels between camera pose pairs based on the rendering results. To eliminate erroneous labels caused by defects in the scene reconstructed by NeRF we screened the generated labels from multiple aspects such as optical flow matching accuracy radiation field confidence and depth consistency. The filtered labels can be directly used for network supervision. Experimentally the generalization ability of ADF on KITTI surpasses existing self-supervised optical flow and monocular scene flow algorithms. In addition ADF achieves impressive results in real-world zero-point generalization evaluations and surpasses most supervised methods. + + + + How Far Can We Compress Instant-NGP-Based NeRF? + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_How_Far_Can_We_Compress_Instant-NGP-Based_NeRF_CVPR_2024_paper.pdf + In recent years Neural Radiance Field (NeRF) has demonstrated remarkable capabilities in representing 3D scenes. To expedite the rendering process learnable explicit representations have been introduced for combination with implicit NeRF representation which however results in a large storage space requirement. In this paper we introduce the Context-based NeRF Compression (CNC) framework which leverages highly efficient context models to provide a storage-friendly NeRF representation. Specifically we excavate both level-wise and dimension-wise context dependencies to enable probability prediction for information entropy reduction. Additionally we exploit hash collision and occupancy grids as strong prior knowledge for better context modeling. To the best of our knowledge we are the first to construct and exploit context models for NeRF compression. We achieve a size reduction of 100X and 70X with improved fidelity against the baseline Instant-NGP on Synthesic-NeRF and Tanks and Temples datasets respectively. Additionally we attain 86.7% and 82.3% storage size reduction against the SOTA NeRF compression method BiRF. Our code is available here: https://github.com/YihangChen-ee/CNC. + + + + GPT4Point: A Unified Framework for Point-Language Understanding and Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Qi_GPT4Point_A_Unified_Framework_for_Point-Language_Understanding_and_Generation_CVPR_2024_paper.pdf + Multimodal Large Language Models (MLLMs) have excelled in 2D image-text comprehension and image generation but their understanding of the 3D world is notably deficient limiting progress in 3D language understanding and generation. To solve this problem we introduce GPT4Point an innovative groundbreaking point-language multimodal model designed specifically for unified 3D object understanding and generation within the MLLM framework. GPT4Point as a powerful 3D MLLM seamlessly can execute a variety of point-text reference tasks such as point-cloud captioning and Q&A. Additionally GPT4Point is equipped with advanced capabilities for controllable 3D generation it can get high-quality results through a low-quality point-text feature maintaining the geometric shapes and colors. To support the expansive needs of 3D object-text pairs we develop Pyramid-XL a point-language dataset annotation engine. It constructs a large-scale database over 1M objects of varied text granularity levels from the Objaverse-XL dataset essential for training GPT4Point. A comprehensive benchmark has been proposed to evaluate 3D point-language understanding capabilities. In extensive evaluations GPT4Point has demonstrated superior performance in understanding and generation. + + + + SemCity: Semantic Scene Generation with Triplane Diffusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Lee_SemCity_Semantic_Scene_Generation_with_Triplane_Diffusion_CVPR_2024_paper.pdf + We present "SemCity" a 3D diffusion model for semantic scene generation in real-world outdoor environments. Most 3D diffusion models focus on generating a single object synthetic indoor scenes or synthetic outdoor scenes while the generation of real-world outdoor scenes is rarely addressed. In this paper we concentrate on generating a real-outdoor scene through learning a diffusion model on a real-world outdoor dataset. In contrast to synthetic data real-outdoor datasets often contain more empty spaces due to sensor limitations causing challenges in learning real-outdoor distributions. To address this issue we exploit a triplane representation as a proxy form of scene distributions to be learned by our diffusion model. Furthermore we propose a triplane manipulation that integrates seamlessly with our triplane diffusion model. The manipulation improves our diffusion model's applicability in a variety of downstream tasks related to outdoor scene generation such as scene inpainting scene outpainting and semantic scene completion refinements. In experimental results we demonstrate that our triplane diffusion model shows meaningful generation results compared with existing work in a real-outdoor dataset SemanticKITTI. We also show our triplane manipulation facilitates seamlessly adding removing or modifying objects within a scene. Further it also enables the expansion of scenes toward a city-level scale. Finally we evaluate our method on semantic scene completion refinements where our diffusion model enhances predictions of semantic scene completion networks by learning scene distribution. Our code is available at https://github.com/zoomin-lee/SemCity. + + + + Improving Semantic Correspondence with Viewpoint-Guided Spherical Maps + http://openaccess.thecvf.com//content/CVPR2024/papers/Mariotti_Improving_Semantic_Correspondence_with_Viewpoint-Guided_Spherical_Maps_CVPR_2024_paper.pdf + Recent self-supervised models produce visual features that are not only effective at encoding image-level but also pixel-level semantics. They have been reported to obtain impressive results for dense visual semantic correspondence estimation even outperforming fully-supervised methods. Nevertheless these models still fail in the presence of challenging image characteristics such as symmetries and repeated parts. To address these limitations we propose a new semantic correspondence estimation method that supplements state-of-the-art self-supervised features with 3D understanding via a weak geometric spherical prior. Compared to more involved 3D pipelines our model provides a simple and effective way of injecting informative geometric priors into the learned representation while requiring only weak viewpoint information. We also propose a new evaluation metric that better accounts for repeated part and symmetry-induced mistakes. We show that our method succeeds in distinguishing between symmetric views and repeated parts across many object categories in the challenging SPair-71k dataset and also in generalizing to previously unseen classes in the AwA dataset. + + + + Dual Memory Networks: A Versatile Adaptation Approach for Vision-Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Dual_Memory_Networks_A_Versatile_Adaptation_Approach_for_Vision-Language_Models_CVPR_2024_paper.pdf + With the emergence of pre-trained vision-language models like CLIP how to adapt them to various downstream classification tasks has garnered significant attention in recent research. The adaptation strategies can be typically categorized into three paradigms: zero-shot adaptation few-shot adaptation and the recently-proposed training-free few-shot adaptation. Most existing approaches are tailored for a specific setting and can only cater to one or two of these paradigms. In this paper we introduce a versatile adaptation approach that can effectively work under all three settings. Specifically we propose the dual memory networks that comprise dynamic and static memory components. The static memory caches training data knowledge enabling training-free few-shot adaptation while the dynamic memory preserves historical test features online during the testing process allowing for the exploration of additional data insights beyond the training set. This novel capability enhances model performance in the few-shot setting and enables model usability in the absence of training data. The two memory networks employ the same flexible memory interactive strategy which can operate in a training-free mode and can be further enhanced by incorporating learnable projection layers. Our approach is tested across 11 datasets under the three task settings. Remarkably in the zero-shot scenario it outperforms existing methods by over 3% and even shows superior results against methods utilizing external training data. Additionally our method exhibits robust performance against natural distribution shifts. + + + + LION: Empowering Multimodal Large Language Model with Dual-Level Visual Knowledge + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_LION_Empowering_Multimodal_Large_Language_Model_with_Dual-Level_Visual_Knowledge_CVPR_2024_paper.pdf + Multimodal Large Language Models (MLLMs) have endowed LLMs with the ability to perceive and understand multi-modal signals. However most of the existing MLLMs mainly adopt vision encoders pretrained on coarsely aligned image-text pairs leading to insufficient extraction and reasoning of visual knowledge. To address this issue we devise a dual-Level vIsual knOwledge eNhanced Multimodal Large Language Model (LION) which empowers the MLLM by injecting visual knowledge in two levels. 1) Progressive incorporation of fine-grained spatial-aware visual knowledge. We design a vision aggregator cooperated with region-level vision-language (VL) tasks to incorporate fine-grained spatial-aware visual knowledge into the MLLM. To alleviate the conflict between image-level and region-level VL tasks during incorporation we devise a dedicated stage-wise instruction-tuning strategy with mixture-of-adapters. This progressive incorporation scheme contributes to the mutual promotion between these two kinds of VL tasks. 2) Soft prompting of high-level semantic visual evidence. We facilitate the MLLM with high-level semantic visual evidence by leveraging diverse image tags. To mitigate the potential influence caused by imperfect predicted tags we propose a soft prompting method by embedding a learnable token into the tailored text instruction. Comprehensive experiments on several multi-modal benchmarks demonstrate the superiority of our model (e.g. improvement of 5% accuracy on VSR and 3% CIDEr on TextCaps over InstructBLIP 5% accuracy on RefCOCOg over Kosmos-2). + + + + Learning to Select Views for Efficient Multi-View Understanding + http://openaccess.thecvf.com//content/CVPR2024/papers/Hou_Learning_to_Select_Views_for_Efficient_Multi-View_Understanding_CVPR_2024_paper.pdf + Multiple camera view (multi-view) setups have proven useful in many computer vision applications. However the high computational cost associated with multiple views creates a significant challenge for end devices with limited computational resources. In modern CPU pipelining breaks a longer job into steps and enables parallelism over sequential steps from multiple jobs. Inspired by this we study selective view pipelining for efficient multi-view understanding which breaks computation of multiple views into steps and only computes the most helpful views/steps in a parallel manner for the best efficiency. To this end we use reinforcement learning to learn a very light view selection module that analyzes the target object or scenario from initial views and selects the next-best-view for recognition or detection for pipeline computation. Experimental results on multi-view classification and detection tasks show that our approach achieves promising performance while using only 2 or 3 out of N available views significantly reducing computational costs while maintaining parallelism over GPU through selective view pipelining. + + + + Unified Entropy Optimization for Open-Set Test-Time Adaptation + http://openaccess.thecvf.com//content/CVPR2024/papers/Gao_Unified_Entropy_Optimization_for_Open-Set_Test-Time_Adaptation_CVPR_2024_paper.pdf + Test-time adaptation (TTA) aims at adapting a model pre-trained on the labeled source domain to the unlabeled target domain. Existing methods usually focus on improving TTA performance under covariate shifts while neglecting semantic shifts. In this paper we delve into a realistic open-set TTA setting where the target domain may contain samples from unknown classes. Many state-of-the-art closed-set TTA methods perform poorly when applied to open-set scenarios which can be attributed to the inaccurate estimation of data distribution and model confidence. To address these issues we propose a simple but effective framework called unified entropy optimization (UniEnt) which is capable of simultaneously adapting to covariate-shifted in-distribution (csID) data and detecting covariate-shifted out-of-distribution (csOOD) data. Specifically UniEnt first mines pseudo-csID and pseudo-csOOD samples from test data followed by entropy minimization on the pseudo-csID data and entropy maximization on the pseudo-csOOD data. Furthermore we introduce UniEnt+ to alleviate the noise caused by hard data partition leveraging sample-level confidence. Extensive experiments on CIFAR benchmarks and Tiny-ImageNet-C show the superiority of our framework. The code is available at https://github.com/gaozhengqing/UniEnt. + + + + Expandable Subspace Ensemble for Pre-Trained Model-Based Class-Incremental Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_Expandable_Subspace_Ensemble_for_Pre-Trained_Model-Based_Class-Incremental_Learning_CVPR_2024_paper.pdf + Class-Incremental Learning (CIL) requires a learning system to continually learn new classes without forgetting. Despite the strong performance of Pre-Trained Models (PTMs) in CIL a critical issue persists: learning new classes often results in the overwriting of old ones. Excessive modification of the network causes forgetting while minimal adjustments lead to an inadequate fit for new classes. As a result it is desired to figure out a way of efficient model updating without harming former knowledge. In this paper we propose ExpAndable Subspace Ensemble (EASE) for PTM-based CIL. To enable model updating without conflict we train a distinct lightweight adapter module for each new task aiming to create task-specific subspaces. These adapters span a high-dimensional feature space enabling joint decision-making across multiple subspaces. As data evolves the expanding subspaces render the old class classifiers incompatible with new-stage spaces. Correspondingly we design a semantic-guided prototype complement strategy that synthesizes old classes' new features without using any old class instance. Extensive experiments on seven benchmark datasets verify EASE's state-of-the-art performance. Code is available at: https://github.com/sun-hailong/CVPR24-Ease + + + + L4D-Track: Language-to-4D Modeling Towards 6-DoF Tracking and Shape Reconstruction in 3D Point Cloud Stream + http://openaccess.thecvf.com//content/CVPR2024/papers/Sun_L4D-Track_Language-to-4D_Modeling_Towards_6-DoF_Tracking_and_Shape_Reconstruction_in_CVPR_2024_paper.pdf + 3D visual language multi-modal modeling plays an important role in actual human-computer interaction. However the inaccessibility of large-scale 3D-language pairs restricts their applicability in real-world scenarios. In this paper we aim to handle a real-time multi-task for 6-DoF pose tracking of unknown objects leveraging 3D-language pre-training scheme from a series of 3D point cloud video streams while simultaneously performing 3D shape reconstruction in current observation. To this end we present a generic Language-to-4D modeling paradigm termed L4D-Track that tackles zero-shot 6-DoF \underline Track ing and shape reconstruction by learning pairwise implicit 3D representation and multi-level multi-modal alignment. Our method constitutes two core parts. 1) Pairwise Implicit 3D Space Representation that establishes spatial-temporal to language coherence descriptions across continuous 3D point cloud video. 2) Language-to-4D Association and Contrastive Alignment enables multi-modality semantic connections between 3D point cloud video and language. Our method trained exclusively on public NOCS-REAL275 dataset achieves promising results on both two publicly benchmarks. This not only shows powerful generalization performance but also proves its remarkable capability in zero-shot inference. + + + + General Point Model Pretraining with Autoencoding and Autoregressive + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_General_Point_Model_Pretraining_with_Autoencoding_and_Autoregressive_CVPR_2024_paper.pdf + The pre-training architectures of large language models encompass various types including autoencoding models autoregressive models and encoder-decoder models. We posit that any modality can potentially benefit from a large language model as long as it undergoes vector quantization to become discrete tokens. Inspired by the General Language Model we propose a General Point Model (GPM) that seamlessly integrates autoencoding and autoregressive tasks in a point cloud transformer. This model is versatile allowing fine-tuning for downstream point cloud representation tasks as well as unconditional and conditional generation tasks. GPM enhances masked prediction in autoencoding through various forms of mask padding tasks leading to improved performance in point cloud understanding. Additionally GPM demonstrates highly competitive results in unconditional point cloud generation tasks even exhibiting the potential for conditional generation tasks by modifying the input's conditional information. Compared to models like Point-BERT MaskPoint and PointMAE our GPM achieves superior performance in point cloud understanding tasks. Furthermore the integration of autoregressive and autoencoding within the same transformer underscores its versatility across different downstream tasks. + + + + MVHumanNet: A Large-scale Dataset of Multi-view Daily Dressing Human Captures + http://openaccess.thecvf.com//content/CVPR2024/papers/Xiong_MVHumanNet_A_Large-scale_Dataset_of_Multi-view_Daily_Dressing_Human_Captures_CVPR_2024_paper.pdf + In this era the success of large language models and text-to-image models can be attributed to the driving force of large-scale datasets. However in the realm of 3D vision while remarkable progress has been made with models trained on large-scale synthetic and real-captured object data like Objaverse and MVImgNet a similar level of progress has not been observed in the domain of human-centric tasks partially due to the lack of a large-scale human dataset. Existing datasets of high-fidelity 3D human capture continue to be mid-sized due to the significant challenges in acquiring large-scale high-quality 3D human data. To bridge this gap we present MVHumanNet a dataset that comprises multi-view human action sequences of 4500 human identities. The primary focus of our work is on collecting human data that features a large number of diverse identities and everyday clothing using a multi-view human capture system which facilitates easily scalable data collection. Our dataset contains 9000 daily outfits 60000 motion sequences and 645 million frames with extensive annotations including human masks camera parameters 2D and 3D keypoints SMPL/SMPLX parameters and corresponding textual descriptions. To explore the potential of MVHumanNet in various 2D and 3D visual tasks we conducted pilot studies on view-consistent action recognition human NeRF reconstruction text-driven view-unconstrained human image generation as well as 2D view-unconstrained human image and 3D avatar generation. Extensive experiments demonstrate the performance improvements and effective applications enabled by the scale provided by MVHumanNet. As the current largest-scale 3D human dataset we hope that the release of MVHumanNet data with annotations will foster further innovations in the domain of 3D human-centric tasks at scale. + + + + NoiseCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions in Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Dalva_NoiseCLR_A_Contrastive_Learning_Approach_for_Unsupervised_Discovery_of_Interpretable_CVPR_2024_paper.pdf + Generative models have been very popular in the recent years for their image generation capabilities. GAN-based models are highly regarded for their disentangled latent space which is a key feature contributing to their success in controlled image editing. On the other hand diffusion models have emerged as powerful tools for generating high-quality images. However the latent space of diffusion models is not as thoroughly explored or understood. Existing methods that aim to explore the latent space of diffusion models usually relies on text prompts to pinpoint specific semantics. However this approach may be restrictive in areas such as art fashion or specialized fields like medicine where suitable text prompts might not be available or easy to conceive thus limiting the scope of existing work. In this paper we propose an unsupervised method to discover latent semantics in text-to-image diffusion models without relying on text prompts. Our method takes a small set of unlabeled images from specific domains such as faces or cats and a pre-trained diffusion model and discovers diverse semantics in unsupervised fashion using a contrastive learning objective. Moreover the learned directions can be applied simultaneously either within the same domain (such as various types of facial edits) or across different domains (such as applying cat and face edits within the same image) without interfering with each other. Our extensive experiments show that our method achieves highly disentangled edits outperforming existing approaches in both diffusion-based and GAN-based latent space editing methods. + + + + SpecNeRF: Gaussian Directional Encoding for Specular Reflections + http://openaccess.thecvf.com//content/CVPR2024/papers/Ma_SpecNeRF_Gaussian_Directional_Encoding_for_Specular_Reflections_CVPR_2024_paper.pdf + Neural radiance fields have achieved remarkable performance in modeling the appearance of 3D scenes. However existing approaches still struggle with the view-dependent appearance of glossy surfaces especially under complex lighting of indoor environments. Unlike existing methods which typically assume distant lighting like an environment map we propose a learnable Gaussian directional encoding to better model the view-dependent effects under near-field lighting conditions. Importantly our new directional encoding captures the spatially-varying nature of near-field lighting and emulates the behavior of prefiltered environment maps. As a result it enables the efficient evaluation of preconvolved specular color at any 3D location with varying roughness coefficients. We further introduce a data-driven geometry prior that helps alleviate the shape radiance ambiguity in reflection modeling. We show that our Gaussian directional encoding and geometry prior significantly improve the modeling of challenging specular reflections in neural radiance fields which helps decompose appearance into more physically meaningful components. + + + + Snapshot Lidar: Fourier Embedding of Amplitude and Phase for Single-Image Depth Reconstruction + http://openaccess.thecvf.com//content/CVPR2024/papers/Friday_Snapshot_Lidar_Fourier_Embedding_of_Amplitude_and_Phase_for_Single-Image_CVPR_2024_paper.pdf + Amplitude modulated continuous-wave time-of-flight (AMCW-ToF) cameras are finding applications as flash Lidars in autonomous navigation robotics and AR/VR applications. A conventional CW-ToF camera requires illuminating the scene with a temporally varying light source and demodulating a set of quadrature measurements to recover the scene's depth and intensity. Capturing the four measurements in sequence renders the system slow invariably causing inaccuracies in depth estimates due to motion in the scene or the camera. To mitigate this problem we propose a snapshot Lidar that captures amplitude and phase simultaneously as a single time-of-flight hologram. Uniquely our approach requires minimal changes to existing CW-ToF imaging hardware. To demonstrate the efficacy of the proposed system we design and build a lab prototype and evaluate it under varying scene geometries illumination conditions and compare the reconstructed depth measurements against conventional techniques. We rigorously evaluate the robustness of our system on diverse real-world scenes to show that our technique results in a significant reduction in data bandwidth with minimal loss in reconstruction accuracy. As high-resolution CW-ToF cameras are becoming ubiquitous increasing their temporal resolution by four times enables robust real-time capture of geometries of dynamic scenes. + + + + Convolutional Prompting meets Language Models for Continual Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Roy_Convolutional_Prompting_meets_Language_Models_for_Continual_Learning_CVPR_2024_paper.pdf + Continual Learning (CL) enables machine learning models to learn from continuously shifting new training data in absence of data from old tasks. Recently pre-trained vision transformers combined with prompt tuning have shown promise for overcoming catastrophic forgetting in CL. These approaches rely on a pool of learnable prompts which can be inefficient in sharing knowledge across tasks leading to inferior performance. In addition the lack of fine-grained layer specific prompts does not allow these to fully express the strength of the prompts for CL. We address these limitations by proposing ConvPrompt a novel convolutional prompt creation mechanism that maintains layer-wise shared embeddings enabling both layer-specific learning and better concept transfer across tasks. The intelligent use of convolution enables us to maintain a low parameter overhead without compromising performance. We further leverage Large Language Models to generate fine-grained text descriptions of each category which are used to get task similarity and dynamically decide the number of prompts to be learned. Extensive experiments demonstrate the superiority of ConvPrompt and improves SOTA by 3% with significantly less parameter overhead. We also perform strong ablation over various modules to disentangle the importance of different components. + + + + Distilling Semantic Priors from SAM to Efficient Image Restoration Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Distilling_Semantic_Priors_from_SAM_to_Efficient_Image_Restoration_Models_CVPR_2024_paper.pdf + In image restoration (IR) leveraging semantic priors from segmentation models has been a common approach to improve performance. The recent segment anything model (SAM) has emerged as a powerful tool for extracting advanced semantic priors to enhance IR tasks. However the computational cost of SAM is prohibitive for IR compared to existing smaller IR models. The incorporation of SAM for extracting semantic priors considerably hampers the model inference efficiency. To address this issue we propose a general framework to distill SAM's semantic knowledge to boost exiting IR models without interfering with their inference process. Specifically our proposed framework consists of the semantic priors fusion (SPF) scheme and the semantic priors distillation (SPD) scheme. SPF fuses two kinds of information between the restored image predicted by the original IR model and the semantic mask predicted by SAM for the refined restored image. SPD leverages a self-distillation manner to distill the fused semantic priors to boost the performance of original IR models. Additionallywe design a semantic-guided relation (SGR) module for SPD which ensures semantic feature representation space consistency to fully distill the priors. We demonstrate the effectiveness of our framework across multiple IR models and tasks including deraining deblurring and denoising. + + + + Learning Intra-view and Cross-view Geometric Knowledge for Stereo Matching + http://openaccess.thecvf.com//content/CVPR2024/papers/Gong_Learning_Intra-view_and_Cross-view_Geometric_Knowledge_for_Stereo_Matching_CVPR_2024_paper.pdf + Geometric knowledge has been shown to be beneficial for the stereo matching task. However prior attempts to integrate geometric insights into stereo matching algorithms have largely focused on geometric knowledge from single images while crucial cross-view factors such as occlusion and matching uniqueness have been overlooked. To address this gap we propose a novel Intra-view and Cross-view Geometric knowledge learning Network (ICGNet) specifically crafted to assimilate both intra-view and cross-view geometric knowledge. ICGNet harnesses the power of interest points to serve as a channel for intra-view geometric understanding. Simultaneously it employs the correspondences among these points to capture cross-view geometric relationships. This dual incorporation empowers the proposed ICGNet to leverage both intra-view and cross-view geometric knowledge in its learning process substantially improving its ability to estimate disparities. Our extensive experiments demonstrate the superiority of the ICGNet over contemporary leading models. The code will be available at https://github.com/DFSDDDDD1199/ICGNet. + + + + Rethinking the Evaluation Protocol of Domain Generalization + http://openaccess.thecvf.com//content/CVPR2024/papers/Yu_Rethinking_the_Evaluation_Protocol_of_Domain_Generalization_CVPR_2024_paper.pdf + Domain generalization aims to solve the challenge of Out-of-Distribution (OOD) generalization by leveraging common knowledge learned from multiple training domains to generalize to unseen test domains. To accurately evaluate the OOD generalization ability it is required that test data information is unavailable. However the current domain generalization protocol may still have potential test data information leakage. This paper examines the risks of test data information leakage from two aspects of the current evaluation protocol: supervised pretraining on ImageNet and oracle model selection. We propose modifications to the current protocol that we should employ self-supervised pretraining or train from scratch instead of employing the current supervised pretraining and we should use multiple test domains. These would result in a more precise evaluation of OOD generalization ability. We also rerun the algorithms with the modified protocol and introduce new leaderboards to encourage future research in domain generalization with a fairer comparison. + + + + Aligning Logits Generatively for Principled Black-Box Knowledge Distillation + http://openaccess.thecvf.com//content/CVPR2024/papers/Ma_Aligning_Logits_Generatively_for_Principled_Black-Box_Knowledge_Distillation_CVPR_2024_paper.pdf + Black-Box Knowledge Distillation (B2KD) is a formulated problem for cloud-to-edge model compression with invisible data and models hosted on the server. B2KD faces challenges such as limited Internet exchange and edge-cloud disparity of data distributions. In this paper we formalize a two-step workflow consisting of deprivatization and distillation and theoretically provide a new optimization direction from logits to cell boundary different from direct logits alignment. With its guidance we propose a new method Mapping-Emulation KD (MEKD) that distills a black-box cumbersome model into a lightweight one. Our method does not differentiate between treating soft or hard responses and consists of: 1) deprivatization: emulating the inverse mapping of the teacher function with a generator and 2) distillation: aligning low-dimensional logits of the teacher and student models by reducing the distance of high-dimensional image points. For different teacher-student pairs our method yields inspiring distillation performance on various benchmarks and outperforms the previous state-of-the-art approaches. + + + + HoloVIC: Large-scale Dataset and Benchmark for Multi-Sensor Holographic Intersection and Vehicle-Infrastructure Cooperative + http://openaccess.thecvf.com//content/CVPR2024/papers/Ma_HoloVIC_Large-scale_Dataset_and_Benchmark_for_Multi-Sensor_Holographic_Intersection_and_CVPR_2024_paper.pdf + Vehicle-to-everything (V2X) is a popular topic in the field of Autonomous Driving in recent years. Vehicle-infrastructure cooperation (VIC) becomes one of the important research area. Due to the complexity of traffic conditions such as blind spots and occlusion it greatly limits the perception capabilities of single-view roadside sensing systems. To further enhance the accuracy of roadside perception and provide better information to the vehicle side in this paper we constructed holographic intersections with various layouts to build a large-scale multi-sensor holographic vehicle-infrastructure cooperation dataset called HoloVIC. Our dataset includes 3 different types of sensors (Camera Lidar Fisheye) and employs 4 sensor-layouts based on the different intersections. Each intersection is equipped with 6-18 sensors to capture synchronous data. While autonomous vehicles pass through these intersections for collecting VIC data. HoloVIC contains in total on 100k+ synchronous frames from different sensors. Additionally we annotated 3D bounding boxes based on Camera Fisheye and Lidar. We also associate the IDs of the same objects across different devices and consecutive frames in sequence. Based on HoloVIC we formulated four tasks to facilitate the development of related research. We also provide benchmarks for these tasks. + + + + LOTUS: Evasive and Resilient Backdoor Attacks through Sub-Partitioning + http://openaccess.thecvf.com//content/CVPR2024/papers/Cheng_LOTUS_Evasive_and_Resilient_Backdoor_Attacks_through_Sub-Partitioning_CVPR_2024_paper.pdf + Backdoor attack poses a significant security threat to Deep Learning applications. Existing attacks are often not evasive to established backdoor detection techniques. This susceptibility primarily stems from the fact that these attacks typically leverage a universal trigger pattern or transformation function such that the trigger can cause misclassification for any input. In response to this recent papers have introduced attacks using sample-specific invisible triggers crafted through special transformation functions. While these approaches manage to evade detection to some extent they reveal vulnerability to existing backdoor mitigation techniques. To address and enhance both evasiveness and resilience we introduce a novel backdoor attack LOTUS. Specifically it leverages a secret function to separate samples in the victim class into a set of partitions and applies unique triggers to different partitions. Furthermore LOTUS incorporates an effective trigger focusing mechanism ensuring only the trigger corresponding to the partition can induce the backdoor behavior. Extensive experimental results show that LOTUS can achieve high attack success rate across 4 datasets and 7 model structures and effectively evading 13 backdoor detection and mitigation techniques. The code is available at https://github.com/Megum1/LOTUS. + + + + LAN: Learning to Adapt Noise for Image Denoising + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_LAN_Learning_to_Adapt_Noise_for_Image_Denoising_CVPR_2024_paper.pdf + Removing noise from images a.k.a image denoising can be a very challenging task since the type and amount of noise can greatly vary for each image due to many factors including a camera model and capturing environments. While there have been striking improvements in image denoising with the emergence of advanced deep learning architectures and real-world datasets recent denoising networks struggle to maintain performance on images with noise that has not been seen during training. One typical approach to address the challenge would be to adapt a denoising network to new noise distribution. Instead in this work we shift our attention to the input noise itself for adaptation rather than adapting a network. Thus we keep a pretrained network frozen and adapt an input noise to capture the fine-grained deviations. As such we propose a new denoising algorithm dubbed Learning-to-Adapt-Noise (LAN) where a learnable noise offset is directly added to a given noisy image to bring a given input noise closer towards the noise distribution a denoising network is trained to handle. Consequently the proposed framework exhibits performance improvement on images with unseen noise displaying the potential of the proposed research direction. + + + + HUNTER: Unsupervised Human-centric 3D Detection via Transferring Knowledge from Synthetic Instances to Real Scenes + http://openaccess.thecvf.com//content/CVPR2024/papers/Yao_HUNTER_Unsupervised_Human-centric_3D_Detection_via_Transferring_Knowledge_from_Synthetic_CVPR_2024_paper.pdf + Human-centric 3D scene understanding has recently drawn increasing attention driven by its critical impact on robotics. However human-centric real-life scenarios are extremely diverse and complicated and humans have intricate motions and interactions. With limited labeled data supervised methods are difficult to generalize to general scenarios hindering real-life applications. Mimicking human intelligence we propose an unsupervised 3D detection method for human-centric scenarios by transferring the knowledge from synthetic human instances to real scenes. To bridge the gap between the distinct data representations and feature distributions of synthetic models and real point clouds we introduce novel modules for effective instance-to-scene representation transfer and synthetic-to-real feature alignment. Remarkably our method exhibits superior performance compared to current state-of-the-art techniques achieving 87.8% improvement in mAP and closely approaching the performance of fully supervised methods (62.15 mAP vs. 69.02 mAP) on HuCenLife Dataset. + + + + Improving Transferable Targeted Adversarial Attacks with Model Self-Enhancement + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_Improving_Transferable_Targeted_Adversarial_Attacks_with_Model_Self-Enhancement_CVPR_2024_paper.pdf + Various transfer attack methods have been proposed to evaluate the robustness of deep neural networks (DNNs). Although manifesting remarkable performance in generating untargeted adversarial perturbations existing proposals still fail to achieve high targeted transferability. In this work we discover that the adversarial perturbations' overfitting towards source models of mediocre generalization capability can hurt their targeted transferability. To address this issue we focus on enhancing the source model's generalization capability to improve its ability to conduct transferable targeted adversarial attacks. In pursuit of this goal we propose a novel model self-enhancement method that incorporates two major components: Sharpness-Aware Self-Distillation (SASD) and Weight Scaling (WS). Specifically SASD distills a fine-tuned auxiliary model which mirrors the source model's structure into the source model while flattening the source model's loss landscape. WS obtains an approximate ensemble of numerous pruned models to perform model augmentation which can be conveniently synergized with SASD to elevate the source model's generalization capability and thus improve the resultant targeted perturbations' transferability. Extensive experiments corroborate the effectiveness of the proposed method. Notably under the black-box setting our approach can outperform the state-of-the-art baselines by a significant margin of 12.2% on average in terms of the obtained targeted transferability. Code is available at https://github.com/g4alllf/SASD. + + + + Unsupervised Learning of Category-Level 3D Pose from Object-Centric Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Sommer_Unsupervised_Learning_of_Category-Level_3D_Pose_from_Object-Centric_Videos_CVPR_2024_paper.pdf + Category-level 3D pose estimation is a fundamentally important problem in computer vision and robotics e.g. for embodied agents or to train 3D generative models. However so far methods that estimate the category-level object pose require either large amounts of human annotations CAD models or input from RGB-D sensors. In contrast we tackle the problem of learning to estimate the category-level 3D pose only from casually taken object-centric videos without human supervision. We propose a two-step pipeline: First we introduce a multi-view alignment procedure that determines canonical camera poses across videos with a novel and robust cyclic distance formulation for geometric and appearance matching using reconstructed coarse meshes and DINOv2 features. In a second step the canonical poses and reconstructed meshes enable us to train a model for 3D pose estimation from a single image. In particular our model learns to estimate dense correspondences between images and a prototypical 3D template by predicting for each pixel in a 2D image a feature vector of the corresponding vertex in the template mesh. We demonstrate that our method outperforms all baselines at the unsupervised alignment of object-centric videos by a large margin and provides faithful and robust predictions in-the-wild on the Pascal3D+ and ObjectNet3D datasets. + + + + FutureHuman3D: Forecasting Complex Long-Term 3D Human Behavior from Video Observations + http://openaccess.thecvf.com//content/CVPR2024/papers/Diller_FutureHuman3D_Forecasting_Complex_Long-Term_3D_Human_Behavior_from_Video_Observations_CVPR_2024_paper.pdf + We present a generative approach to forecast long-term future human behavior in 3D requiring only weak supervision from readily available 2D human action data. This is a fundamental task enabling many downstream applications. The required ground-truth data is hard to capture in 3D (mocap suits expensive setups) but easy to acquire in 2D (simple RGB cameras). Thus we design our method to only require 2D RGB data at inference time while being able to generate 3D human motion sequences. We use a differentiable 2D projection scheme in an autoregressive manner for weak supervision and an adversarial loss for 3D regularization. Our method predicts long and complex human behavior sequences (e.g. cooking assembly) consisting of multiple sub-actions. We tackle this in a semantically hierarchical manner jointly predicting high-level coarse action labels together with their low-level fine-grained realizations as characteristic 3D human poses. We observe that these two action representations are coupled in nature and joint prediction benefits both action and pose forecasting. Our experiments demonstrate the complementary nature of joint action and 3D pose prediction: our joint approach outperforms each task treated individually enables robust longer-term sequence prediction and improves over alternative approaches to forecast actions and characteristic 3D poses. + + + + NightCC: Nighttime Color Constancy via Adaptive Channel Masking + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_NightCC_Nighttime_Color_Constancy_via_Adaptive_Channel_Masking_CVPR_2024_paper.pdf + Nighttime conditions pose a significant challenge to color constancy due to the diversity of lighting conditions and the presence of substantial low-light noise. Existing color constancy methods struggle with nighttime scenes frequently leading to imprecise light color estimations. To tackle nighttime color constancy we propose a novel unsupervised domain adaptation approach that utilizes labeled daytime data to facilitate learning on unlabeled nighttime images. To specifically address the unique lighting conditions of nighttime and ensure the robustness of pseudo labels we propose adaptive channel masking and light uncertainty. By selectively masking channels that are less sensitive to lighting conditions adaptive channel masking directs the model to progressively focus on features less affected by variations in light colors and noise. Additionally our model leverages light uncertainty to provide a pixel-wise uncertainty estimation regarding light color prediction which helps avoid learning from incorrect labels. Our model demonstrates a significant improvement in accuracy achieving 21.5% lower Mean Angular Error (MAE) compared to the state-of-the-art method on our nighttime dataset. + + + + UnScene3D: Unsupervised 3D Instance Segmentation for Indoor Scenes + http://openaccess.thecvf.com//content/CVPR2024/papers/Rozenberszki_UnScene3D_Unsupervised_3D_Instance_Segmentation_for_Indoor_Scenes_CVPR_2024_paper.pdf + 3D instance segmentation is fundamental to geometric understanding of the world around us. Existing methods for instance segmentation of 3D scenes rely on supervision from expensive manual 3D annotations. We propose UnScene3D the first fully unsupervised 3D learning approach for class-agnostic 3D instance segmentation of indoor scans. UnScene3D first generates pseudo masks by leveraging self-supervised color and geometry features to find potential object regions. We operate on a basis of 3D segment primitives enabling efficient representation and learning on high-resolution 3D data. The coarse proposals are then refined through self-training our model on its predictions. Our approach improves over state-of-the-art unsupervised 3D instance segmentation methods by more than 300% Average Precision score demonstrating effective instance segmentation even in challenging cluttered 3D scenes. + + + + Nearest is Not Dearest: Towards Practical Defense against Quantization-conditioned Backdoor Attacks + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Nearest_is_Not_Dearest_Towards_Practical_Defense_against_Quantization-conditioned_Backdoor_CVPR_2024_paper.pdf + Model quantization is widely used to compress and accelerate deep neural networks. However recent studies have revealed the feasibility of weaponizing model quantization via implanting quantization-conditioned backdoors (QCBs). These special backdoors stay dormant on released full-precision models but will come into effect after standard quantization. Due to the peculiarity of QCBs existing defenses have minor effects on reducing their threats or are even infeasible. In this paper we conduct the first in-depth analysis of QCBs. We reveal that the activation of existing QCBs primarily stems from the nearest rounding operation and is closely related to the norms of neuron-wise truncation errors (i.e. the difference between the continuous fullprecision weights and its quantized version). Motivated by these insights we propose Error-guided Flipped Rounding with Activation Preservation (EFRAP) an effective and practical defense against QCBs. Specifically EFRAP learns a non-nearest rounding strategy with neuron-wise error norm and layer-wise activation preservation guidance flipping the rounding strategies of neurons crucial for backdoor effects but with minimal impact on clean accuracy. Extensive evaluations on benchmark datasets demonstrate that our EFRAP can defeat state-of-the-art QCB attacks under various settings. Code is available here. + + + + A Simple Recipe for Language-guided Domain Generalized Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Fahes_A_Simple_Recipe_for_Language-guided_Domain_Generalized_Segmentation_CVPR_2024_paper.pdf + Generalization to new domains not seen during training is one of the long-standing challenges in deploying neural networks in real-world applications. Existing generalization techniques either necessitate external images for augmentation and/or aim at learning invariant representations by imposing various alignment constraints. Large-scale pretraining has recently shown promising generalization capabilities along with the potential of binding different modalities. For instance the advent of vision-language models like CLIP has opened the doorway for vision models to exploit the textual modality. In this paper we introduce a simple framework for generalizing semantic segmentation networks by employing language as the source of randomization. Our recipe comprises three key ingredients: (i) the preservation of the intrinsic CLIP robustness through minimal fine-tuning (ii) language-driven local style augmentation and (iii) randomization by locally mixing the source and augmented styles during training. Extensive experiments report state-of-the-art results on various generalization benchmarks. + + + + Multiagent Multitraversal Multimodal Self-Driving: Open MARS Dataset + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Multiagent_Multitraversal_Multimodal_Self-Driving_Open_MARS_Dataset_CVPR_2024_paper.pdf + Large-scale datasets have fueled recent advancements in AI-based autonomous vehicle research. However these datasets are usually collected from a single vehicle's one-time pass of a certain location lacking multiagent interactions or repeated traversals of the same place. Such information could lead to transformative enhancements in autonomous vehicles' perception prediction and planning capabilities. To bridge this gap in collaboration with the self-driving company May Mobility we present the MARS dataset which unifies scenarios that enable MultiAgent multitraveRSal and multimodal autonomous vehicle research. More specifically MARS is collected with a fleet of autonomous vehicles driving within a certain geographical area. Each vehicle has its own route and different vehicles may appear at nearby locations. Each vehicle is equipped with a LiDAR and surround-view RGB cameras. We curate two subsets in MARS: one facilitates collaborative driving with multiple vehicles simultaneously present at the same location and the other enables memory retrospection through asynchronous traversals of the same location by multiple vehicles. We conduct experiments in place recognition and neural reconstruction. More importantly MARS introduces new research opportunities and challenges such as multitraversal 3D reconstruction multiagent perception and unsupervised object discovery. Our data and codes can be found at https://ai4ce.github.io/MARS/. + + + + From Variance to Veracity: Unbundling and Mitigating Gradient Variance in Differentiable Bundle Adjustment Layers + http://openaccess.thecvf.com//content/CVPR2024/papers/Gurumurthy_From_Variance_to_Veracity_Unbundling_and_Mitigating_Gradient_Variance_in_CVPR_2024_paper.pdf + Various pose estimation and tracking problems in robotics can be decomposed into a correspondence estimation problem (often computed using a deep network) followed by a weighted least squares optimization problem to solve for the poses. Recent work has shown that coupling the two problems by iteratively refining one conditioned on the other's output yields SOTA results across domains. However training these models has proved challenging requiring a litany of tricks to stabilize and speed up training. In this work we take the visual odometry problem as an example and identify three plausible causes: (1) flow loss interference (2) linearization errors in the bundle adjustment (BA) layer and (3) dependence of weight gradients on the BA residual. We show how these issues result in noisy and higher variance gradients potentially leading to a slow down in training and instabilities. We then propose a simple solution to reduce the gradient variance by using the weights predicted by the network in the inner optimization loop to also weight the correspondence objective in the training problem. This helps the training objective 'focus' on the more important points thereby reducing the variance and mitigating the influence of outliers. We show that the resulting method leads to faster training and can be more flexibly trained in varying training setups without sacrificing performance. In particular we show 2-2.5x training speedups over a baseline visual odometry model we modify. + + + + Image-Text Co-Decomposition for Text-Supervised Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_Image-Text_Co-Decomposition_for_Text-Supervised_Semantic_Segmentation_CVPR_2024_paper.pdf + This paper addresses text-supervised semantic segmentation aiming to learn a model capable of segmenting arbitrary visual concepts within images by using only image-text pairs without dense annotations. Existing methods have demonstrated that contrastive learning on image-text pairs effectively aligns visual segments with the meanings of texts. We notice that there is a discrepancy between text alignment and semantic segmentation: A text often consists of multiple semantic concepts whereas semantic segmentation strives to create semantically homogeneous segments. To address this issue we propose a novel framework Image-Text Co-Decomposition (CoDe) where the paired image and text are jointly decomposed into a set of image regions and a set of word segments respectively and contrastive learning is developed to enforce region-word alignment. To work with a vision-language model we present a prompt learning mechanism that derives an extra representation to highlight an image segment or a word segment of interest with which more effective features can be extracted from that segment. Comprehensive experimental results demonstrate that our method performs favorably against existing text-supervised semantic segmentation methods on six benchmark datasets. + + + + Orchestrate Latent Expertise: Advancing Online Continual Learning with Multi-Level Supervision and Reverse Self-Distillation + http://openaccess.thecvf.com//content/CVPR2024/papers/Yan_Orchestrate_Latent_Expertise_Advancing_Online_Continual_Learning_with_Multi-Level_Supervision_CVPR_2024_paper.pdf + To accommodate real-world dynamics artificial intelligence systems need to cope with sequentially arriving content in an online manner. Beyond regular Continual Learning (CL) attempting to address catastrophic forgetting with offline training of each task Online Continual Learning (OCL) is a more challenging yet realistic setting that performs CL in a one-pass data stream. Current OCL methods primarily rely on memory replay of old training samples. However a notable gap from CL to OCL stems from the additional overfitting-underfitting dilemma associated with the use of rehearsal buffers: the inadequate learning of new training samples (underfitting) and the repeated learning of a few old training samples (overfitting). To this end we introduce a novel approach Multi-level Online Sequential Experts (MOSE) which cultivates the model as stacked sub-experts integrating multi-level supervision and reverse self-distillation. Supervision signals across multiple stages facilitate appropriate convergence of the new task while gathering various strengths from experts by knowledge distillation mitigates the performance decline of old tasks. MOSE demonstrates remarkable efficacy in learning new samples and preserving past knowledge through multi-level experts thereby significantly advancing OCL performance over state-of-the-art baselines (e.g. up to 7.3% on Split CIFAR-100 and 6.1% on Split Tiny-ImageNet). + + + + Mitigating Object Dependencies: Improving Point Cloud Self-Supervised Learning through Object Exchange + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_Mitigating_Object_Dependencies_Improving_Point_Cloud_Self-Supervised_Learning_through_Object_CVPR_2024_paper.pdf + In the realm of point cloud scene understanding particularly in indoor scenes objects are arranged following human habits resulting in objects of certain semantics being closely positioned and displaying notable inter-object correlations. This can create a tendency for neural networks to exploit these strong dependencies bypassing the individual object patterns. To address this challenge we introduce a novel self-supervised learning (SSL) strategy. Our approach leverages both object patterns and contextual cues to produce robust features. It begins with the formulation of an object-exchanging strategy where pairs of objects with comparable sizes are exchanged across different scenes effectively disentangling the strong contextual dependencies. Subsequently we introduce a context-aware feature learning strategy which encodes object patterns without relying on their specific context by aggregating object features across various scenes. Our extensive experiments demonstrate the superiority of our method over existing SSL techniques further showing its better robustness to environmental changes. Moreover we showcase the applicability of our approach by transferring pre-trained models to diverse point cloud datasets. + + + + Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Geng_Visual_Anagrams_Generating_Multi-View_Optical_Illusions_with_Diffusion_Models_CVPR_2024_paper.pdf + We address the problem of synthesizing multi-view optical illusions: images that change appearance upon a transformation such as a flip or rotation. We propose a simple zero-shot method for obtaining these illusions from off-the-shelf text-to-image diffusion models. During the reverse diffusion process we estimate the noise from different views of a noisy image and then combine these noise estimates together and denoise the image. A theoretical analysis suggests that this method works precisely for views that can be written as orthogonal transformations of which permutations are a subset. This leads to the idea of a visual anagram ---an image that changes appearance under some rearrangement of pixels. This includes rotations and flips but also more exotic pixel permutations such as a jigsaw rearrangement. Our approach also naturally extends to illusions with more than two views. We provide both qualitative and quantitative results demonstrating the effectiveness and flexibility of our method. Please see our project webpage for additional visualizations and results: https://dangeng.github.io/visual_anagrams/ + + + + Leveraging Predicate and Triplet Learning for Scene Graph Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Leveraging_Predicate_and_Triplet_Learning_for_Scene_Graph_Generation_CVPR_2024_paper.pdf + Scene Graph Generation (SGG) aims to identify entities and predict the relationship triplets <subject predicate object> in visual scenes. Given the prevalence of large visual variations of subject-object pairs even in the same predicate it can be quite challenging to model and refine predicate representations directly across such pairs which is however a common strategy adopted by most existing SGG methods. We observe that visual variations within the identical triplet are relatively small and certain relation cues are shared in the same type of triplet which can potentially facilitate the relation learning in SGG. Moreover for the long-tail problem widely studied in SGG task it is also crucial to deal with the limited types and quantity of triplets in tail predicates. Accordingly in this paper we propose a Dual-granularity Relation Modeling (DRM) network to leverage fine-grained triplet cues besides the coarse-grained predicate ones. DRM utilizes contexts and semantics of predicate and triplet with Dual-granularity Constraints generating compact and balanced representations from two perspectives to facilitate relation recognition. Furthermore a Dual-granularity Knowledge Transfer (DKT) strategy is introduced to transfer variation from head predicates/triplets to tail ones aiming to enrich the pattern diversity of tail classes to alleviate the long-tail problem. Extensive experiments demonstrate the effectiveness of our method which establishes new state-of-the-art performance on Visual Genome Open Image and GQA datasets. Our code is available at https://github.com/jkli1998/DRM. + + + + CoDi-2: In-Context Interleaved and Interactive Any-to-Any Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Tang_CoDi-2_In-Context_Interleaved_and_Interactive_Any-to-Any_Generation_CVPR_2024_paper.pdf + We present CoDi-2 a Multimodal Large Language Model (MLLM) for learning in-context interleaved multimodal representations. By aligning modalities with language for both encoding and generation CoDi-2 empowers Large Language Models (LLMs) to understand modality-interleaved instructions and in-context examples and autoregressively generate grounded and coherent multimodal outputs in an any-to-any input-output modality paradigm. To train CoDi-2 we build a large-scale generation dataset encompassing in-context multimodal instructions across text vision and audio. CoDi-2 demonstrates a wide range of zero-shot and few-shot capabilities for tasks like editing exemplar learning composition reasoning etc. CoDi-2 surpasses previous domain-specific models on tasks such as subject-driven image generation vision transformation and audio editing and showcases a significant advancement for integrating diverse multimodal tasks with sequential generation. + + + + Tuning Stable Rank Shrinkage: Aiming at the Overlooked Structural Risk in Fine-tuning + http://openaccess.thecvf.com//content/CVPR2024/papers/Shen_Tuning_Stable_Rank_Shrinkage_Aiming_at_the_Overlooked_Structural_Risk_CVPR_2024_paper.pdf + Existing fine-tuning methods for computer vision tasks primarily focus on re-weighting the knowledge learned from the source domain during pre-training. They aim to retain beneficial knowledge for the target domain while suppressing unfavorable knowledge. During the pre-training and fine-tuning stages there is a notable disparity in the data scale. Consequently it is theoretically necessary to employ a model with reduced complexity to mitigate the potential structural risk. However our empirical investigation in this paper reveals that models fine-tuned using existing methods still manifest a high level of model complexity inherited from the pre-training stage leading to a suboptimal stability and generalization ability. This phenomenon indicates an issue that has been overlooked in fine-tuning: Structural Risk Minimization. To address this issue caused by data scale disparity during the fine-tuning stage we propose a simple yet effective approach called Tuning Stable Rank Shrinkage (TSRS). TSRS mitigates the structural risk during the fine-tuning stage by constraining the noise sensitivity of the target model based on stable rank theories. Through extensive experiments we demonstrate that incorporating TSRS into fine-tuning methods leads to improved generalization ability on various tasks regardless of whether the neural networks are based on convolution or transformer architectures. Additionally empirical analysis reveals that TSRS enhances the robustness convexity and smoothness of the loss landscapes in fine-tuned models. Code is available at https://github.com/WitGotFlg/TSRS. + + + + Towards Automatic Power Battery Detection: New Challenge Benchmark Dataset and Baseline + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_Towards_Automatic_Power_Battery_Detection_New_Challenge_Benchmark_Dataset_and_CVPR_2024_paper.pdf + We conduct a comprehensive study on a new task named power battery detection (PBD) which aims to localize the dense cathode and anode plates endpoints from X-ray images to evaluate the quality of power batteries. Existing manufacturers usually rely on human eye observation to complete PBD which makes it difficult to balance the accuracy and efficiency of detection. To address this issue and drive more attention into this meaningful task we first elaborately collect a dataset called X-ray PBD which has 1500 diverse X-ray images selected from thousands of power batteries of 5 manufacturers with 7 different visual interference. Then we propose a novel segmentation-based solution for PBD termed multi-dimensional collaborative network (MDCNet). With the help of line and counting predictors the representation of the point segmentation branch can be improved at both semantic and detail aspects. Besides we design an effective distance-adaptive mask generation strategy which can alleviate the visual challenge caused by the inconsistent distribution density of plates to provide MDCNet with stable supervision. Without any bells and whistles our segmentation-based MDCNet consistently outperforms various other corner detection crowd counting and general/tiny object detection-based solutions making it a strong baseline that can help facilitate future research in PBD. Finally we share some potential difficulties and works for future researches. The source code and datasets will be publicly available at \href https://github.com/Xiaoqi-Zhao-DLUT/X-ray-PBD X-ray PBD . + + + + AVFF: Audio-Visual Feature Fusion for Video Deepfake Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Oorloff_AVFF_Audio-Visual_Feature_Fusion_for_Video_Deepfake_Detection_CVPR_2024_paper.pdf + With the rapid growth in deepfake video content we require improved and generalizable methods to detect them. Most existing detection methods either use uni-modal cues or rely on supervised training to capture the dissonance between the audio and visual modalities. While the former disregards the audio-visual correspondences entirely the latter predominantly focuses on discerning audio-visual cues within the training corpus thereby potentially overlooking correspondences that can help detect unseen deepfakes. We present Audio-Visual Feature Fusion (AVFF) a two-stage cross-modal learning method that explicitly captures the correspondence between the audio and visual modalities for improved deepfake detection. The first stage pursues representation learning via self-supervision on real videos to capture the intrinsic audio-visual correspondences. To extract rich cross-modal representations we use contrastive learning and autoencoding objectives and introduce a novel audio-visual complementary masking and feature fusion strategy. The learned representations are tuned in the second stage where deepfake classification is pursued via supervised learning on both real and fake videos. Extensive experiments and analysis suggest that our novel representation learning paradigm is highly discriminative in nature. We report 98.6% accuracy and 99.1% AUC on the FakeAVCeleb dataset outperforming the current audio-visual state-of-the-art by 14.9% and 9.9% respectively. + + + + X-MIC: Cross-Modal Instance Conditioning for Egocentric Action Generalization + http://openaccess.thecvf.com//content/CVPR2024/papers/Kukleva_X-MIC_Cross-Modal_Instance_Conditioning_for_Egocentric_Action_Generalization_CVPR_2024_paper.pdf + Lately there has been growing interest in adapting vision-language models (VLMs) to image and third-person video classification due to their success in zero-shot recognition. However the adaptation of these models to egocentric videos has been largely unexplored. To address this gap we propose a simple yet effective cross-modal adaptation framework which we call X-MIC. Using a video adapter our pipeline learns to align frozen text embeddings to each egocentric video directly in the shared embedding space. Our novel adapter architecture retains and improves generalization of the pre-trained VLMs by disentangling learnable temporal modeling and frozen visual encoder. This results in an enhanced alignment of text embeddings to each egocentric video leading to a significant improvement in cross-dataset generalization. We evaluate our approach on the Epic-Kitchens Ego4D and EGTEA datasets for fine-grained cross-dataset action generalization demonstrating the effectiveness of our method. + + + + AV-RIR: Audio-Visual Room Impulse Response Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Ratnarajah_AV-RIR_Audio-Visual_Room_Impulse_Response_Estimation_CVPR_2024_paper.pdf + Accurate estimation of Room Impulse Response (RIR) which captures an environment's acoustic properties is important for speech processing and AR/VR applications. We propose AV-RIR a novel multi-modal multi-task learning approach to accurately estimate the RIR from a given reverberant speech signal and the visual cues of its corresponding environment. AV-RIR builds on a novel neural codec-based architecture that effectively captures environment geometry and materials properties and solves speech dereverberation as an auxiliary task by using multi-task learning. We also propose Geo-Mat features that augment material information into visual cues and CRIP that improves late reverberation components in the estimated RIR via image-to-RIR retrieval by 86%. Empirical results show that AV-RIR quantitatively outperforms previous audio-only and visual-only approaches by achieving 36% - 63% improvement across various acoustic metrics in RIR estimation. Additionally it also achieves higher preference scores in human evaluation. As an auxiliary benefit dereverbed speech from AV-RIR shows competitive performance with the state-of-the-art in various spoken language processing tasks and outperforms reverberation time error score in the real-world AVSpeech dataset. Qualitative examples of both synthesized reverberant speech and enhanced speech are available online https://www.youtube.com/watch?v=tTsKhviukAE. + + + + Dual-Consistency Model Inversion for Non-Exemplar Class Incremental Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Qiu_Dual-Consistency_Model_Inversion_for_Non-Exemplar_Class_Incremental_Learning_CVPR_2024_paper.pdf + Non-exemplar class incremental learning (NECIL) aims to continuously assimilate new knowledge without forgetting previously acquired ones when historical data are unavailable. One of the generative NECIL methods is to invert the images of old classes for joint training. However these synthetic images suffer significant domain shifts compared with real data hampering the recognition of old classes. In this paper we present a novel method termed Dual-Consistency Model Inversion (DCMI) to generate better synthetic samples of old classes through two pivotal consistency alignments: (1) the semantic consistency between the synthetic images and the corresponding prototypes and (2) domain consistency between synthetic and real images of new classes. Besides we introduce Prototypical Routing (PR) to provide task-prior information and generate unbiased and accurate predictions. Our comprehensive experiments across diverse datasets consistently showcase the superiority of our method over previous state-of-the-art approaches. + + + + Not All Prompts Are Secure: A Switchable Backdoor Attack Against Pre-trained Vision Transfomers + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Not_All_Prompts_Are_Secure_A_Switchable_Backdoor_Attack_Against_CVPR_2024_paper.pdf + Given the power of vision transformers a new learning paradigm pre-training and then prompting makes it more efficient and effective to address downstream visual recognition tasks. In this paper we identify a novel security threat towards such a paradigm from the perspective of backdoor attacks. Specifically an extra prompt token called the switch token in this work can turn the backdoor mode on i.e. converting a benign model into a backdoored one. Once under the backdoor mode a specific trigger can force the model to predict a target class. It poses a severe risk to the users of cloud API since the malicious behavior can not be activated and detected under the benign mode thus making the attack very stealthy. To attack a pre-trained model our proposed attack named SWARM learns a trigger and prompt tokens including a switch token. They are optimized with the clean loss which encourages the model always behaves normally even the trigger presents and the backdoor loss that ensures the backdoor can be activated by the trigger when the switch is on. Besides we utilize the cross-mode feature distillation to reduce the effect of the switch token on clean samples. The experiments on diverse visual recognition tasks confirm the success of our switchable backdoor attack i.e. achieving 95%+ attack success rate and also being hard to be detected and removed. Our code is available at https://github.com/20000yshust/SWARM. + + + + PortraitBooth: A Versatile Portrait Model for Fast Identity-preserved Personalization + http://openaccess.thecvf.com//content/CVPR2024/papers/Peng_PortraitBooth_A_Versatile_Portrait_Model_for_Fast_Identity-preserved_Personalization_CVPR_2024_paper.pdf + Recent advancements in personalized image generation using diffusion models have been noteworthy. However existing methods suffer from inefficiencies due to the requirement for subject-specific fine-tuning. This computationally intensive process hinders efficient deployment limiting practical usability. Moreover these methods often grapple with identity distortion and limited expression diversity. In light of these challenges we propose PortraitBooth an innovative approach designed for high efficiency robust identity preservation and expression-editable text-to-image generation without the need for fine-tuning. PortraitBooth leverages subject embeddings from a face recognition model for personalized image generation without fine-tuning. It eliminates computational overhead and mitigates identity distortion. The introduced dynamic identity preservation strategy further ensures close resemblance to the original image identity. Moreover PortraitBooth incorporates emotion-aware cross-attention control for diverse facial expressions in generated images supporting text-driven expression editing. Its scalability enables efficient and high-quality image creation including multi-subject generation. Extensive results demonstrate superior performance over other state-of-the-art methods in both single and multiple image generation scenarios. + + + + Learn from View Correlation: An Anchor Enhancement Strategy for Multi-view Clustering + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Learn_from_View_Correlation_An_Anchor_Enhancement_Strategy_for_Multi-view_CVPR_2024_paper.pdf + In recent years anchor-based methods have achieved promising progress in multi-view clustering. The performances of these methods are significantly affected by the quality of the anchors. However the anchors generated by previous works solely rely on single-view information ignoring the correlation among different views. In particular we observe that similar patterns are more likely to exist between similar views so such correlation information can be leveraged to enhance the quality of the anchors which is also omitted. To this end we propose a novel plug-and-play anchor enhancement strategy through view correlation for multi-view clustering. Specifically we construct a view graph based on aligned initial anchor graphs to explore inter-view correlations. By learning from view correlation we enhance the anchors of the current view using the relationships between anchors and samples on neighboring views thereby narrowing the spatial distribution of anchors on similar views. Experimental results on seven datasets demonstrate the superiority of our proposed method over other existing methods. Furthermore extensive comparative experiments validate the effectiveness of the proposed anchor enhancement module when applied to various anchor-based methods. + + + + APSeg: Auto-Prompt Network for Cross-Domain Few-Shot Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/He_APSeg_Auto-Prompt_Network_for_Cross-Domain_Few-Shot_Semantic_Segmentation_CVPR_2024_paper.pdf + Few-shot semantic segmentation (FSS) endeavors to segment unseen classes with only a few labeled samples. Current FSS methods are commonly built on the assumption that their training and application scenarios share similar domains and their performances degrade significantly while applied to a distinct domain. To this end we propose to leverage the cutting-edge foundation model the Segment Anything Model (SAM) for generalization enhancement. The SAM however performs unsatisfactorily on domains that are distinct from its training data which primarily comprise natural scene images and it does not support automatic segmentation of specific semantics due to its interactive prompting mechanism. In our work we introduce APSeg a novel auto-prompt network for cross-domain few-shot semantic segmentation (CD-FSS) which is designed to be auto-prompted for guiding cross-domain segmentation. Specifically we propose a Dual Prototype Anchor Transformation (DPAT) module that fuses pseudo query prototypes extracted based on cycle-consistency with support prototypes allowing features to be transformed into a more stable domain-agnostic space. Additionally a Meta Prompt (MPG) module is introduced to automatically generate prompt embeddings eliminating the need for manual visual prompts. We build an efficient model which can be applied directly to target domains without fine-tuning. Extensive experiments on four cross-domain datasets show that our model outperforms the state-of-the-art CD-FSS method by 5.24% and 3.10% in average accuracy on 1-shot and 5-shot settings respectively. + + + + Enhancing Visual Continual Learning with Language-Guided Supervision + http://openaccess.thecvf.com//content/CVPR2024/papers/Ni_Enhancing_Visual_Continual_Learning_with_Language-Guided_Supervision_CVPR_2024_paper.pdf + Continual learning (CL) aims to empower models to learn new tasks without forgetting previously acquired knowledge. Most prior works concentrate on the techniques of architectures replay data regularization etc. However the category name of each class is largely neglected. Existing methods commonly utilize the one-hot labels and randomly initialize the classifier head. We argue that the scarce semantic information conveyed by the one-hot labels hampers the effective knowledge transfer across tasks. In this paper we revisit the role of the classifier head within the CL paradigm and replace the classifier with semantic knowledge from pretrained language models (PLMs). Specifically we use PLMs to generate semantic targets for each class which are frozen and serve as supervision signals during training. Such targets fully consider the semantic correlation between all classes across tasks. Empirical studies show that our approach mitigates forgetting by alleviating representation drifting and facilitating knowledge transfer across tasks. The proposed method is simple to implement and can seamlessly be plugged into existing methods with negligible adjustments. Extensive experiments based on eleven mainstream baselines demonstrate the effectiveness and generalizability of our approach to various protocols. For example under the class-incremental learning setting on ImageNet-100 our method significantly improves the Top-1 accuracy by 3.2% to 6.1% while reducing the forgetting rate by 2.6% to 13.1%. + + + + Revamping Federated Learning Security from a Defender's Perspective: A Unified Defense with Homomorphic Encrypted Data Space + http://openaccess.thecvf.com//content/CVPR2024/papers/Kumar_Revamping_Federated_Learning_Security_from_a_Defenders_Perspective_A_Unified_CVPR_2024_paper.pdf + Federated Learning (FL) facilitates clients to collaborate on training a shared machine learning model without exposing individual private data. Nonetheless FL remains susceptible to utility and privacy attacks notably evasion data poisoning and model inversion attacks compromising the system's efficiency and data privacy. Existing FL defenses are often specialized to a particular single attack lacking generality and a comprehensive defender's perspective. To address these challenges we introduce Federated Cryptography Defense (FCD) a unified single framework aligning with the defender's perspective. FCD employs row-wise transposition cipher based data encryption with a secret key to counter both evasion black-box data poisoning and model inversion attacks. The crux of FCD lies in transferring the entire learning process into an encrypted data space and using a novel distillation loss guided by the Kullback-Leibler (KL) divergence. This measure compares the probability distributions of the local pretrained teacher model's predictions on normal data and the local student model's predictions on the same data in FCD's encrypted form. By working within this encrypted space FCD eliminates the need for decryption at the server resulting in reduced computational complexity. We demonstrate the practical feasibility of FCD and apply it to defend against evasion utility attack on benchmark datasets (GTSRB KBTS CIFAR10 and EMNIST). We further extend FCD for defending against model inversion attack in split FL on the CIFAR100 dataset. Our experiments across the diverse attack and FL settings demonstrate practical feasibility and robustness against utility evasion (impact >30) and privacy attacks (MSE >73) compared to the second best method. + + + + A Dynamic Kernel Prior Model for Unsupervised Blind Image Super-Resolution + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_A_Dynamic_Kernel_Prior_Model_for_Unsupervised_Blind_Image_Super-Resolution_CVPR_2024_paper.pdf + Deep learning-based methods have achieved significant successes on solving the blind super-resolution (BSR) problem. However most of them request supervised pre-training on labelled datasets. This paper proposes an unsupervised kernel estimation model named dynamic kernel prior (DKP) to realize an unsupervised and pre-training-free learning-based algorithm for solving the BSR problem. DKP can adaptively learn dynamic kernel priors to realize real-time kernel estimation and thereby enables superior HR image restoration performances. This is achieved by a Markov chain Monte Carlo sampling process on random kernel distributions. The learned kernel prior is then assigned to optimize a blur kernel estimation network which entails a network-based Langevin dynamic optimization strategy. These two techniques ensure the accuracy of the kernel estimation. DKP can be easily used to replace the kernel estimation models in the existing methods such as Double-DIP and FKP-DIP or be added to the off-the-shelf image restoration model such as diffusion model. In this paper we incorporate our DKP model with DIP and diffusion model referring to DIP-DKP and Diff-DKP for validations. Extensive simulations on Gaussian and motion kernel scenarios demonstrate that the proposed DKP model can significantly improve the kernel estimation with comparable runtime and memory usage leading to state-of-the-art BSR results. The code is available at https://github.com/XYLGroup/DKP. + + + + Mitigating Noisy Correspondence by Geometrical Structure Consistency Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_Mitigating_Noisy_Correspondence_by_Geometrical_Structure_Consistency_Learning_CVPR_2024_paper.pdf + Noisy correspondence that refers to mismatches in cross-modal data pairs is prevalent on human-annotated or web-crawled datasets. Prior approaches to leverage such data mainly consider the application of uni-modal noisy label learning without amending the impact on both cross-modal and intra-modal geometrical structures in multimodal learning. Actually we find that both structures are effective to discriminate noisy correspondence through structural differences when being well-established. Inspired by this observation we introduce a Geometrical Structure Consistency (GSC) method to infer the true correspondence. Specifically GSC ensures the preservation of geometrical structures within and between modalities allowing for the accurate discrimination of noisy samples based on structural differences. Utilizing these inferred true correspondence labels GSC refines the learning of geometrical structures by filtering out the noisy samples. Experiments across four cross-modal datasets confirm that GSC effectively identifies noisy samples and significantly outperforms the current leading methods. Source code is available at https://github.com/MediaBrain-SJTU/GSC. + + + + DVMNet: Computing Relative Pose for Unseen Objects Beyond Hypotheses + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_DVMNet_Computing_Relative_Pose_for_Unseen_Objects_Beyond_Hypotheses_CVPR_2024_paper.pdf + Determining the relative pose of an object between two images is pivotal to the success of generalizable object pose estimation. Existing approaches typically approximate the continuous pose representation with a large number of discrete pose hypotheses which incurs a computationally expensive process of scoring each hypothesis at test time. By contrast we present a Deep Voxel Matching Network (DVMNet) that eliminates the need for pose hypotheses and computes the relative object pose in a single pass. To this end we map the two input RGB images reference and query to their respective voxelized 3D representations. We then pass the resulting voxels through a pose estimation module where the voxels are aligned and the pose is computed in an end-to-end fashion by solving a least-squares problem. To enhance robustness we introduce a weighted closest voxel algorithm capable of mitigating the impact of noisy voxels. We conduct extensive experiments on the CO3D LINEMOD and Objaverse datasets demonstrating that our method delivers more accurate relative pose estimates for novel objects at a lower computational cost compared to state-of-the-art methods. Our code is released at: https://github.com/sailor-z/DVMNet. + + + + MuRF: Multi-Baseline Radiance Fields + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_MuRF_Multi-Baseline_Radiance_Fields_CVPR_2024_paper.pdf + We present Multi-Baseline Radiance Fields (MuRF) a general feed-forward approach to solving sparse view synthesis under multiple different baseline settings (small and large baselines and different number of input views). To render a target novel view we discretize the 3D space into planes parallel to the target image plane and accordingly construct a target view frustum volume. Such a target volume representation is spatially aligned with the target view which effectively aggregates relevant information from the input views for high-quality rendering. It also facilitates subsequent radiance field regression with a convolutional network thanks to its axis-aligned nature. The 3D context modeled by the convolutional network enables our method to synthesis sharper scene structures than prior works. Our MuRF achieves state-of-the-art performance across multiple different baseline settings and diverse scenarios ranging from simple objects (DTU) to complex indoor and outdoor scenes (RealEstate10K and LLFF). We also show promising zero-shot generalization abilities on the Mip-NeRF 360 dataset demonstrating the general applicability of MuRF. + + + + Flattening the Parent Bias: Hierarchical Semantic Segmentation in the Poincare Ball + http://openaccess.thecvf.com//content/CVPR2024/papers/Weber_Flattening_the_Parent_Bias_Hierarchical_Semantic_Segmentation_in_the_Poincare_CVPR_2024_paper.pdf + Hierarchy is a natural representation of semantic taxonomies including the ones routinely used in image segmentation. Indeed recent work on semantic segmentation reports improved accuracy from supervised training leveraging hierarchical label structures. Encouraged by these results we revisit the fundamental assumptions behind that work. We postulate and then empirically verify that the reasons for the observed improvement in segmentation accuracy may be entirely unrelated to the use of the semantic hierarchy. To demonstrate this we design a range of cross-domain experiments with a representative hierarchical approach. We find that on the new testing domains a flat (non-hierarchical) segmentation network in which the parents are inferred from the children has superior segmentation accuracy to the hierarchical approach across the board. Complementing these findings and inspired by the intrinsic properties of hyperbolic spaces we study a more principled approach to hierarchical segmentation using the Poincare ball model. The hyperbolic representation largely outperforms the previous (Euclidean) hierarchical approach as well and is on par with our flat Euclidean baseline in terms of segmentation accuracy. However it additionally exhibits surprisingly strong calibration quality of the parent nodes in the semantic hierarchy especially on the more challenging domains. Our combined analysis suggests that the established practice of hierarchical segmentation may be limited to in-domain settings whereas flat classifiers generalize substantially better especially if they are modeled in the hyperbolic space. + + + + MVBench: A Comprehensive Multi-modal Video Understanding Benchmark + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_MVBench_A_Comprehensive_Multi-modal_Video_Understanding_Benchmark_CVPR_2024_paper.pdf + With the rapid development of Multi-modal Large Language Models (MLLMs) a number of diagnostic benchmarks have recently emerged to evaluate the comprehension capabilities of these models. However most benchmarks predominantly assess spatial understanding in the static image tasks while overlooking temporal understanding in the dynamic video tasks. To alleviate this issue we introduce a comprehensive Multi-modal Video understanding Benchmark namely MVBench which covers 20 challenging video tasks that cannot be effectively solved with a single frame. Specifically we first introduce a novel static-to-dynamic method to define these temporal-related tasks. By transforming various static tasks into dynamic ones we enable the systematic generation of video tasks that require a broad spectrum of temporal skills ranging from perception to cognition. Then guided by the task definition we automatically convert public video annotations into multiple-choice QA to evaluate each task. On one hand such a distinct paradigm allows us to build MVBench efficiently without much manual intervention. On the other hand it guarantees evaluation fairness with ground-truth video annotations avoiding the biased scoring of LLMs. Moreover we further develop a robust video MLLM baseline i.e. VideoChat2 by progressive multi-modal training with diverse instruction-tuning data. The extensive results on our MVBench reveal that the existing MLLMs are far from satisfactory in temporal understanding while our VideoChat2 largely surpasses these leading models by over 15% on MVBench. + + + + An Aggregation-Free Federated Learning for Tackling Data Heterogeneity + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_An_Aggregation-Free_Federated_Learning_for_Tackling_Data_Heterogeneity_CVPR_2024_paper.pdf + The performance of Federated Learning (FL) hinges on the effectiveness of utilizing knowledge from distributed datasets. Traditional FL methods adopt an aggregate-then-adapt framework where clients update local models based on a global model aggregated by the server from the previous training round. This process can cause client drift especially with significant cross-client data heterogeneity impacting model performance and convergence of the FL algorithm. To address these challenges we introduce FedAF a novel aggregation-free FL algorithm. In this framework clients collaboratively learn condensed data by leveraging peer knowledge the server subsequently trains the global model using the condensed data and soft labels received from the clients. FedAF inherently avoids the issue of client drift enhances the quality of condensed data amid notable data heterogeneity and improves the global model performance. Extensive numerical studies on several popular benchmark datasets show FedAF surpasses various state-of-the-art FL algorithms in handling label-skew and feature-skew data heterogeneity leading to superior global model accuracy and faster convergence. + + + + Hierarchical Intra-modal Correlation Learning for Label-free 3D Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Kang_Hierarchical_Intra-modal_Correlation_Learning_for_Label-free_3D_Semantic_Segmentation_CVPR_2024_paper.pdf + Recent methods for label-free 3D semantic segmentation aim to assist 3D model training by leveraging the open-world recognition ability of pre-trained vision language models. However these methods usually suffer from inconsistent and noisy pseudo-labels provided by the vision language models. To address this issue we present a hierarchical intra-modal correlation learning framework that captures visual and geometric correlations in 3D scenes at three levels: intra-set intra-scene and inter-scene to help learn more compact 3D representations. We refine pseudo-labels using intra-set correlations within each geometric consistency set and align features of visually and geometrically similar points using intra-scene and inter-scene correlation learning. We also introduce a feedback mechanism to distill the correlation learning capability into the 3D model. Experiments on both indoor and outdoor datasets show the superiority of our method. We achieve a state-of-the-art 36.6% mIoU on the ScanNet dataset and a 23.0% mIoU on the nuScenes dataset with improvements of 7.8% mIoU and 2.2% mIoU compared with previous SOTA. We also provide theoretical analysis and qualitative visualization results to discuss the mechanism and conduct thorough ablation studies to support the effectiveness of our framework. + + + + DiffSal: Joint Audio and Video Learning for Diffusion Saliency Prediction + http://openaccess.thecvf.com//content/CVPR2024/papers/Xiong_DiffSal_Joint_Audio_and_Video_Learning_for_Diffusion_Saliency_Prediction_CVPR_2024_paper.pdf + Audio-visual saliency prediction can draw support from diverse modality complements but further performance enhancement is still challenged by customized architectures as well as task-specific loss functions. In recent studies denoising diffusion models have shown more promising in unifying task frameworks owing to their inherent ability of generalization. Following this motivation a novel Diffusion architecture for generalized audio-visual Saliency prediction (DiffSal) is proposed in this work which formulates the prediction problem as a conditional generative task of the saliency map by utilizing input audio and video as the conditions. Based on the spatio-temporal audio-visual features an extra network Saliency-UNet is designed to perform multi-modal attention modulation for progressive refinement of the ground-truth saliency map from the noisy map. Extensive experiments demonstrate that the proposed DiffSal can achieve excellent performance across six challenging audio-visual benchmarks with an average relative improvement of 6.3% over the previous state-of-the-art results by six metrics. + + + + Revisiting Single Image Reflection Removal In the Wild + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhu_Revisiting_Single_Image_Reflection_Removal_In_the_Wild_CVPR_2024_paper.pdf + This research focuses on the issue of single-image reflection removal (SIRR) in real-world conditions examining it from two angles: the collection pipeline of real reflection pairs and the perception of real reflection locations. We devise an advanced reflection collection pipeline that is highly adaptable to a wide range of real-world reflection scenarios and incurs reduced costs in collecting large-scale aligned reflection pairs. In the process we develop a large-scale high-quality reflection dataset named Reflection Removal in the Wild (RRW). RRW contains over 14950 high-resolution real-world reflection pairs a dataset forty-five times larger than its predecessors. Regarding perception of reflection locations we identify that numerous virtual reflection objects visible in reflection images are not present in the corresponding ground-truth images. This observation drawn from the aligned pairs leads us to conceive the Maximum Reflection Filter (MaxRF). The MaxRF could accurately and explicitly characterize reflection locations from pairs of images. Building upon this we design a reflection location-aware cascaded framework specifically tailored for SIRR. Powered by these innovative techniques our solution achieves superior performance than current leading methods across multiple real-world benchmarks. Codes and datasets are available at \href https://github.com/zhuyr97/Reflection_RemoVal_CVPR2024 \color blue here . + + + + SinSR: Diffusion-Based Image Super-Resolution in a Single Step + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_SinSR_Diffusion-Based_Image_Super-Resolution_in_a_Single_Step_CVPR_2024_paper.pdf + While super-resolution (SR) methods based on diffusion models exhibit promising results their practical application is hindered by the substantial number of required inference steps. Recent methods utilize the degraded images in the initial state thereby shortening the Markov chain. Nevertheless these solutions either rely on a precise formulation of the degradation process or still necessitate a relatively lengthy generation path (e.g. 15 iterations). To enhance inference speed we propose a simple yet effective method for achieving single-step SR generation named SinSR. Specifically we first derive a deterministic sampling process from the most recent state-of-the-art (SOTA) method for accelerating diffusion-based SR. This allows the mapping between the input random noise and the generated high-resolution image to be obtained in a reduced and acceptable number of inference steps during training. We show that this deterministic mapping can be distilled into a student model that performs SR within only one inference step. Additionally we propose a novel consistency-preserving loss to simultaneously leverage the ground-truth image during the distillation process ensuring that the performance of the student model is not solely bound by the feature manifold of the teacher model resulting in further performance improvement. Extensive experiments conducted on synthetic and real-world datasets demonstrate that the proposed method can achieve comparable or even superior performance compared to both previous SOTA methods and the teacher model in just one sampling step resulting in a remarkable up to x10 speedup for inference. Our code will be released at https://github.com/wyf0912/SinSR/. + + + + Systematic Comparison of Semi-supervised and Self-supervised Learning for Medical Image Classification + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_Systematic_Comparison_of_Semi-supervised_and_Self-supervised_Learning_for_Medical_Image_CVPR_2024_paper.pdf + In typical medical image classification problems labeled data is scarce while unlabeled data is more available. Semi-supervised learning and self-supervised learning are two different research directions that can improve accuracy by learning from extra unlabeled data. Recent methods from both directions have reported significant gains on traditional benchmarks. Yet past benchmarks do not focus on medical tasks and rarely compare self- and semi- methods together on an equal footing. Furthermore past benchmarks often handle hyperparameter tuning suboptimally. First they may not tune hyperparameters at all leading to underfitting. Second when tuning does occur it often unrealistically uses a labeled validation set that is much larger than the training set. Therefore currently published rankings might not always corroborate with their practical utility This study contributes a systematic evaluation of self- and semi- methods with a unified experimental protocol intended to guide a practitioner with scarce overall labeled data and a limited compute budget. We answer two key questions: Can hyperparameter tuning be effective with realistic-sized validation sets? If so when all methods are tuned well which self- or semi-supervised methods achieve the best accuracy? Our study compares 13 representative semi- and self-supervised methods to strong labeled-set-only baselines on 4 medical datasets. From 20000+ GPU hours of computation we provide valuable best practices to resource-constrained practitioners: hyperparameter tuning is effective and the semi-supervised method known as MixMatch delivers the most reliable gains across 4 datasets. + + + + MSU-4S - The Michigan State University Four Seasons Dataset + http://openaccess.thecvf.com//content/CVPR2024/papers/Kent_MSU-4S_-_The_Michigan_State_University_Four_Seasons_Dataset_CVPR_2024_paper.pdf + Public datasets such as KITTI nuScenes and Waymo have played a key role in the research and development of autonomous vehicles and advanced driver assistance systems. However many of these datasets fail to incorporate a full range of driving conditions; some datasets only contain clear-weather conditions underrepresenting or entirely missing colder weather conditions such as snow or autumn scenes with bright colorful foliage. In this paper we present the Michigan State University Four Seasons (MSU-4S) Dataset which contains real-world collections of autonomous vehicle data from varied types of driving scenarios. These scenarios were recorded throughout a full range of seasons and capture clear rainy snowy and fall weather conditions at varying times of day. MSU-4S contains more than 100000 two- and three-dimensional frames for camera lidar and radar data as well as Global Navigation Satellite System (GNSS) wheel speed and steering data all annotated with weather time-of-day and time-of-year. Our data includes cluttered scenes that have large numbers of vehicles and pedestrians; and it also captures industrial scenes busy traffic thoroughfare with traffic lights and numerous signs and scenes with dense foliage. While providing a diverse set of scenes our data incorporate an important feature: virtually every scene and its corresponding lidar camera and radar frames were captured in four different seasons enabling unparalleled object detection analysis and testing of the domain shift problem across weather conditions. In that context we present detailed analyses for 3D and 2D object detection showing a strong domain shift effect among MSU-4S data segments collected across different conditions. MSU-4S will also enable advanced multimodal fusion research including different combinations of camera-lidar-radar fusion which continues to be of strong interest for the computer vision autonomous driving and ADAS development communities. The MSU-4S dataset is available online at https://egr.msu.edu/waves/msu4s. + + + + Improving Plasticity in Online Continual Learning via Collaborative Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Improving_Plasticity_in_Online_Continual_Learning_via_Collaborative_Learning_CVPR_2024_paper.pdf + Online Continual Learning (CL) solves the problem of learning the ever-emerging new classification tasks from a continuous data stream. Unlike its offline counterpart in online CL the training data can only be seen once. Most existing online CL research regards catastrophic forgetting (i.e. model stability) as almost the only challenge. In this paper we argue that the model's capability to acquire new knowledge (i.e. model plasticity) is another challenge in online CL. While replay-based strategies have been shown to be effective in alleviating catastrophic forgetting there is a notable gap in research attention toward improving model plasticity. To this end we propose Collaborative Continual Learning (CCL) a collaborative learning based strategy to improve the model's capability in acquiring new concepts. Additionally we introduce Distillation Chain (DC) a collaborative learning scheme to boost the training of the models. We adapt CCL-DC to existing representative online CL works. Extensive experiments demonstrate that even if the learners are well-trained with state-of-the-art online CL methods our strategy can still improve model plasticity dramatically and thereby improve the overall performance by a large margin. The source code of our work is available at https://github.com/maorong-wang/CCL-DC. + + + + Spectral and Polarization Vision: Spectro-polarimetric Real-world Dataset + http://openaccess.thecvf.com//content/CVPR2024/papers/Jeon_Spectral_and_Polarization_Vision_Spectro-polarimetric_Real-world_Dataset_CVPR_2024_paper.pdf + Image datasets are essential not only in validating existing methods in computer vision but also in developing new methods. Many image datasets exist consisting of trichromatic intensity images taken with RGB cameras which are designed to replicate human vision. However polarization and spectrum the wave properties of light that animals in harsh environments and with limited brain capacity often rely on remain underrepresented in existing datasets. Although there are previous spectro-polarimetric datasets they have insufficient object diversity limited illumination conditions linear-only polarization data and inadequate image count. Here we introduce two spectro-polarimetric datasets consisting of trichromatic Stokes images and hyperspectral Stokes images. These datasets encompass both linear and circular polarization; they introduce multiple spectral channels; and they feature a broad selection of real-world scenes. With our dataset in hand we analyze the spectro-polarimetric image statistics develop efficient representations of such high-dimensional data and evaluate spectral dependency of shape-from-polarization methods. As such the proposed dataset promises a foundation for data-driven spectro-polarimetric imaging and vision research. + + + + Transfer CLIP for Generalizable Image Denoising + http://openaccess.thecvf.com//content/CVPR2024/papers/Cheng_Transfer_CLIP_for_Generalizable_Image_Denoising_CVPR_2024_paper.pdf + Image denoising is a fundamental task in computer vision. While prevailing deep learning-based supervised and self-supervised methods have excelled in eliminating in-distribution noise their susceptibility to out-of-distribution (OOD) noise remains a significant challenge. The recent emergence of contrastive language-image pre-training (CLIP) model has showcased exceptional capabilities in open-world image recognition and segmentation. Yet the potential for leveraging CLIP to enhance the robustness of low-level tasks remains largely unexplored. This paper uncovers that certain dense features extracted from the frozen ResNet image encoder of CLIP exhibit distortion-invariant and content-related properties which are highly desirable for generalizable denoising. Leveraging these properties we devise an asymmetrical encoder-decoder denoising network which incorporates dense features including the noisy image and its multi-scale features from the frozen ResNet encoder of CLIP into a learnable image decoder to achieve generalizable denoising. The progressive feature augmentation strategy is further proposed to mitigate feature overfitting and improve the robustness of the learnable decoder. Extensive experiments and comparisons conducted across diverse OOD noises including synthetic noise real-world sRGB noise and low-dose CT image noise demonstrate the superior generalization ability of our method. + + + + Revisiting Adversarial Training at Scale + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Revisiting_Adversarial_Training_at_Scale_CVPR_2024_paper.pdf + The machine learning community has witnessed a drastic change in the training pipeline pivoted by those "foundation models" with unprecedented scales. However the field of adversarial training is lagging behind predominantly centered around small model sizes like ResNet-50 and tiny and low-resolution datasets like CIFAR-10. To bridge this transformation gap this paper provides a modern re-examination with adversarial training investigating its potential benefits when applied at scale. Additionally we introduce an efficient and effective training strategy to enable adversarial training with giant models and web-scale data at an affordable computing cost. We denote this newly introduced framework as AdvXL. Empirical results demonstrate that AdvXL establishes new state-of-the-art robust accuracy records under AutoAttack on ImageNet-1K. For example by training on DataComp-1B dataset our AdvXL empowers a vanilla ViT-g model to substantially surpass the previous records of l_ infinity - l_ 2 - and l_ 1 -robust accuracy by margins of 11.4% 14.2% and 12.9% respectively. This achievement posits AdvXL as a pioneering approach charting a new trajectory for the efficient training of robust visual representations at significantly larger scales. Our code is available at https://github.com/UCSC-VLAA/AdvXL. + + + + Towards Fairness-Aware Adversarial Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Towards_Fairness-Aware_Adversarial_Learning_CVPR_2024_paper.pdf + Although adversarial training (AT) has proven effective in enhancing the model's robustness the recently revealed issue of fairness in robustness has not been well addressed i.e. the robust accuracy varies significantly among different categories. In this paper instead of uniformly evaluating the model's average class performance we delve into the issue of robust fairness by considering the worst-case distribution across various classes. We propose a novel learning paradigm named Fairness-Aware Adversarial Learning (FAAL). As a generalization of conventional AT we re-define the problem of adversarial training as a min-max-max framework to ensure both robustness and fairness of the trained model. Specifically by taking advantage of distributional robust optimization our method aims to find the worst distribution among different categories and the solution is guaranteed to obtain the upper bound performance with high probability. In particular FAAL can fine-tune an unfair robust model to be fair within only two epochs without compromising the overall clean and robust accuracies. Extensive experiments on various image datasets validate the superior performance and efficiency of the proposed FAAL compared to other state-of-the-art methods. + + + + MirageRoom: 3D Scene Segmentation with 2D Pre-trained Models by Mirage Projection + http://openaccess.thecvf.com//content/CVPR2024/papers/Sun_MirageRoom_3D_Scene_Segmentation_with_2D_Pre-trained_Models_by_Mirage_CVPR_2024_paper.pdf + Nowadays leveraging 2D images and pre-trained models to guide 3D point cloud feature representation has shown a remarkable potential to boost the performance of 3D fundamental models. While some works rely on additional data such as 2D real-world images and their corresponding camera poses recent studies target at using point cloud exclusively by designing 3D-to-2D projection. However in the indoor scene scenario existing 3D-to-2D projection strategies suffer from severe occlusions and incoherence which fail to contain sufficient information for fine-grained point cloud segmentation task. In this paper we argue that the crux of the matter resides in the basic premise of existing projection strategies that the medium is homogeneous thereby projection rays propagate along straight lines and behind objects are occluded by front ones. Inspired by the phenomenon of mirage where the occluded objects are exposed by distorted light rays due to heterogeneous medium refraction rate we propose MirageRoom by designing parametric mirage projection with heterogeneous medium to obtain series of projected images with various distorted degrees. We further develop a masked reprojection module across 2D and 3D latent space to bridge the gap between pre-trained 2D backbone and 3D point-wise features. Both quantitative and qualitative experimental results on S3DIS and ScanNet V2 demonstrate the effectiveness of our method. + + + + In2SET: Intra-Inter Similarity Exploiting Transformer for Dual-Camera Compressive Hyperspectral Imaging + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_In2SET_Intra-Inter_Similarity_Exploiting_Transformer_for_Dual-Camera_Compressive_Hyperspectral_Imaging_CVPR_2024_paper.pdf + Dual-camera compressive hyperspectral imaging (DCCHI) offers the capability to reconstruct 3D hyperspectral image (HSI) by fusing compressive and panchromatic (PAN) image which has shown great potential for snapshot hyperspectral imaging in practice. In this paper we introduce a novel DCCHI reconstruction network intra-inter similarity exploiting Transformer (In2SET). Our key insight is to make full use of the PAN image to assist the reconstruction. To this end we propose to use the intra-similarity within the PAN image as a proxy for approximating the intra-similarity in the original HSI thereby offering an enhanced content prior for more accurate HSI reconstruction. Furthermore we propose to use the inter-similarity to align the features between HSI and PAN images thereby maintaining semantic consistency between the two modalities during the reconstruction process. By integrating In2SET into a PAN-guided deep unrolling (PGDU) framework our method substantially enhances the spatial-spectral fidelity and detail of the reconstructed images providing a more comprehensive and accurate depiction of the scene. Experiments conducted on both real and simulated datasets demonstrate that our approach consistently outperforms existing state-of-the-art methods in terms of reconstruction quality and computational complexity. The code is available at https://github.com/2JONAS/In2SET. + + + + Look-Up Table Compression for Efficient Image Restoration + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Look-Up_Table_Compression_for_Efficient_Image_Restoration_CVPR_2024_paper.pdf + Look-Up Table (LUT) has recently gained increasing attention for restoring High-Quality (HQ) images from Low-Quality (LQ) observations thanks to its high computational efficiency achieved through a "space for time" strategy of caching learned LQ-HQ pairs. However incorporating multiple LUTs for improved performance comes at the cost of a rapidly growing storage size which is ultimately restricted by the allocatable on-device cache size. In this work we propose a novel LUT compression framework to achieve a better trade-off between storage size and performance for LUT-based image restoration models. Based on the observation that most cached LQ image patches are distributed along the diagonal of a LUT we devise a Diagonal-First Compression (DFC) framework where diagonal LQ-HQ pairs are preserved and carefully re-indexed to maintain the representation capacity while non-diagonal pairs are aggressively subsampled to save storage. Extensive experiments on representative image restoration tasks demonstrate that our DFC framework significantly reduces the storage size of LUT-based models (including our new design) while maintaining their performance. For instance DFC saves up to 90% of storage at a negligible performance drop for x4 super-resolution. The source code is available on GitHub: https://github.com/leenas233/DFC. + + + + TextNeRF: A Novel Scene-Text Image Synthesis Method based on Neural Radiance Fields + http://openaccess.thecvf.com//content/CVPR2024/papers/Cui_TextNeRF_A_Novel_Scene-Text_Image_Synthesis_Method_based_on_Neural_CVPR_2024_paper.pdf + Acquiring large-scale well-annotated datasets is essential for training robust scene text detectors yet the process is often resource-intensive and time-consuming. While some efforts have been made to explore the synthesis of scene text images a notable gap remains between synthetic and authentic data. In this paper we introduce a novel method that utilizes Neural Radiance Fields (NeRF) to model real-world scenes and emulate the data collection process by rendering images from diverse camera perspectives enriching the variability and realism of the synthesized data. A semi-supervised learning framework is proposed to categorize semantic regions within 3D scenes ensuring consistent labeling of text regions across various viewpoints. Our method also models the pose and view-dependent appearance of text regions thereby offering precise control over camera poses and significantly improving the realism of text insertion and editing within scenes. Employing our technique on real-world scenes has led to the creation of a novel scene text image dataset. Compared to other existing benchmarks the proposed dataset is distinctive in providing not only standard annotations such as bounding boxes and transcriptions but also the information of 3D pose attributes for text regions enabling a more detailed evaluation of the robustness of text detection algorithms. Through extensive experiments we demonstrate the effectiveness of our proposed method in enhancing the performance of scene text detectors. + + + + Dr.Hair: Reconstructing Scalp-Connected Hair Strands without Pre-Training via Differentiable Rendering of Line Segments + http://openaccess.thecvf.com//content/CVPR2024/papers/Takimoto_Dr.Hair_Reconstructing_Scalp-Connected_Hair_Strands_without_Pre-Training_via_Differentiable_Rendering_CVPR_2024_paper.pdf + In the film and gaming industries achieving a realistic hair appearance typically involves the use of strands originating from the scalp. However reconstructing these strands from observed surface images of hair presents significant challenges. The difficulty in acquiring Ground Truth (GT) data has led state-of-the-art learning-based methods to rely on pre-training with manually prepared synthetic CG data. This process is not only labor-intensive and costly but also introduces complications due to the domain gap when compared to real-world data. In this study we propose an optimization-based approach that eliminates the need for pre-training. Our method represents hair strands as line segments growing from the scalp and optimizes them using a novel differentiable rendering algorithm. To robustly optimize a substantial number of slender explicit geometries we introduce 3D orientation estimation utilizing global optimization strand initialization based on Laplace's equation and reparameterization that leverages geometric connectivity and spatial proximity. Unlike existing optimization-based methods our method is capable of reconstructing internal hair flow in an absolute direction. Our method exhibits robust and accurate inverse rendering surpassing the quality of existing methods and significantly improving processing speed. + + + + DiVa-360: The Dynamic Visual Dataset for Immersive Neural Fields + http://openaccess.thecvf.com//content/CVPR2024/papers/Lu_DiVa-360_The_Dynamic_Visual_Dataset_for_Immersive_Neural_Fields_CVPR_2024_paper.pdf + Advances in neural fields are enabling high-fidelity capture of the shape and appearance of dynamic 3D scenes. However their capabilities lag behind those offered by conventional representations such as 2D videos because of algorithmic challenges and the lack of large-scale multi-view real-world datasets. We address the dataset limitation with DiVa-360 a real-world 360? dynamic visual dataset that contains synchronized high-resolution and long-duration multi-view video sequences of table-scale scenes captured using a customized low-cost system with 53 cameras. It contains 21 object-centric sequences categorized by different motion types 25 intricate hand-object interaction sequences and 8 long-duration sequences for a total of 17.4 M image frames. In addition we provide foreground-background segmentation masks synchronized audio and text descriptions. We benchmark the state-of-the-art dynamic neural field methods on DiVa-360 and provide insights about existing methods and future challenges on long-duration neural field capture. + + + + FSC: Few-point Shape Completion + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_FSC_Few-point_Shape_Completion_CVPR_2024_paper.pdf + While previous studies have demonstrated successful 3D object shape completion with a sufficient number of points they often fail in scenarios when a few points e.g. tens of points are observed. Surprisingly via entropy analysis we find that even a few points e.g. 64 points could retain substantial information to help recover the 3D shape of the object. To address the challenge of shape completion with very sparse point clouds we then propose Few-point Shape Completion (FSC) model which contains a novel dual-branch feature extractor for handling extremely sparse inputs coupled with an extensive branch for maximal point utilization with a saliency branch for dynamic importance assignment. This model is further bolstered by a two-stage revision network that refines both the extracted features and the decoder output enhancing the detail and authenticity of the completed point cloud. Our experiments demonstrate the feasibility of recovering 3D shapes from a few points. The proposed Few-point Shape Completion (FSC) model outperforms previous methods on both few-point inputs and many-point inputs and shows good generalizability to different object categories. + + + + T-VSL: Text-Guided Visual Sound Source Localization in Mixtures + http://openaccess.thecvf.com//content/CVPR2024/papers/Mahmud_T-VSL_Text-Guided_Visual_Sound_Source_Localization_in_Mixtures_CVPR_2024_paper.pdf + Visual sound source localization poses a significant challenge in identifying the semantic region of each sounding source within a video. Existing self-supervised and weakly supervised source localization methods struggle to accurately distinguish the semantic regions of each sounding object particularly in multi-source mixtures. These methods often rely on audio-visual correspondence as guidance which can lead to substantial performance drops in complex multi-source localization scenarios. The lack of access to individual source sounds in multi-source mixtures during training exacerbates the difficulty of learning effective audio-visual correspondence for localization. To address this limitation in this paper we propose incorporating the text modality as an intermediate feature guide using tri-modal joint embedding models (e.g. AudioCLIP) to disentangle the semantic audio-visual source correspondence in multi-source mixtures. Our framework dubbed T-VSL begins by predicting the class of sounding entities in mixtures. Subsequently the textual representation of each sounding source is employed as guidance to disentangle fine-grained audio-visual source correspondence from multi-source mixtures leveraging the tri-modal AudioCLIP embedding. This approach enables our framework to handle a flexible number of sources and exhibits promising zero-shot transferability to unseen classes during test time. Extensive experiments conducted on the MUSIC VGGSound and VGGSound-Instruments datasets demonstrate significant performance improvements over state-of-the-art methods. Code is released at https://github.com/enyac-group/T-VSL/tree/main. + + + + VCoder: Versatile Vision Encoders for Multimodal Large Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Jain_VCoder_Versatile_Vision_Encoders_for_Multimodal_Large_Language_Models_CVPR_2024_paper.pdf + Humans possess the remarkable skill of Visual Perception the ability to see and understand the seen helping them make sense of the visual world and in turn reason. Multimodal Large Language Models (MLLM) have recently achieved impressive performance on vision-language tasks ranging from visual question-answering and image captioning to visual reasoning and image generation. However when prompted to identify or count (perceive) the entities in a given image existing MLLM systems fail. Working towards developing an accurate MLLM system for perception and reasoning we propose using Versatile vision enCoders (VCoder) as perception eyes for Multimodal LLMs. We feed the VCoder with perception modalities such as segmentation or depth maps improving the MLLM's perception abilities. Secondly we leverage the images from COCO and outputs from off-the-shelf vision perception models to create our COCO Segmentation Text (COST) dataset for training and evaluating MLLMs on the object perception task. Thirdly we introduce metrics to assess the object perception abilities in MLLMs on our COST dataset. Lastly we provide extensive experimental evidence proving the VCoder's improved object-level perception skills over existing Multimodal LLMs including GPT-4V. We open-source our dataset code and models to promote research. + + + + Event-based Visible and Infrared Fusion via Multi-task Collaboration + http://openaccess.thecvf.com//content/CVPR2024/papers/Geng_Event-based_Visible_and_Infrared_Fusion_via_Multi-task_Collaboration_CVPR_2024_paper.pdf + Visible and Infrared image Fusion (VIF) offers a comprehensive scene description by combining thermal infrared images with the rich textures from visible cameras. However conventional VIF systems may capture over/under exposure or blurry images in extreme lighting and high dynamic motion scenarios leading to degraded fusion results. To address these problems we propose a novel Event-based Visible and Infrared Fusion (EVIF) system that employs a visible event camera as an alternative to traditional frame-based cameras for the VIF task. With extremely low latency and high dynamic range event cameras can effectively address blurriness and are robust against diverse luminous ranges. To produce high-quality fused images we develop a multi-task collaborative framework that simultaneously performs event-based visible texture reconstruction event-guided infrared image deblurring and visible-infrared fusion. Rather than independently learning these tasks our framework capitalizes on their synergy leveraging cross-task event enhancement for efficient deblurring and bi-level min-max mutual information optimization to achieve higher fusion quality. Experiments on both synthetic and real data show that EVIF achieves remarkable performance in dealing with extreme lighting conditions and high-dynamic scenes ensuring high-quality fused images across a broad range of practical scenarios. + + + + RegionPLC: Regional Point-Language Contrastive Learning for Open-World 3D Scene Understanding + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_RegionPLC_Regional_Point-Language_Contrastive_Learning_for_Open-World_3D_Scene_Understanding_CVPR_2024_paper.pdf + We propose a lightweight and scalable Regional Point-Language Contrastive learning framework namely RegionPLC for open-world 3D scene understanding aiming to identify and recognize open-set objects and categories. Specifically based on our empirical studies we introduce a 3D-aware SFusion strategy that fuses 3D vision-language pairs derived from multiple 2D foundation models yielding high-quality dense region-level language descriptions without human 3D annotations. Subsequently we devise a region-aware point-discriminative contrastive learning objective to enable robust and effective 3D learning from dense regional language supervision. We carry out extensive experiments on ScanNet ScanNet200 and nuScenes datasets and our model outperforms prior 3D open-world scene understanding approaches by an average of 17.2% and 9.1% for semantic and instance segmentation respectively while maintaining greater scalability and lower resource demands. Furthermore our method has the flexibility to be effortlessly integrated with language models to enable open-ended grounded 3D reasoning without extra task-specific training. Code will be released. + + + + Three Pillars Improving Vision Foundation Model Distillation for Lidar + http://openaccess.thecvf.com//content/CVPR2024/papers/Puy_Three_Pillars_Improving_Vision_Foundation_Model_Distillation_for_Lidar_CVPR_2024_paper.pdf + Self-supervised image backbones can be used to address complex 2D tasks (e.g. semantic segmentation object discovery) very efficiently and with little or no downstream supervision. Ideally 3D backbones for lidar should be able to inherit these properties after distillation of these powerful 2D features. The most recent methods for image-to-lidar distillation on autonomous driving data show promising results obtained thanks to distillation methods that keep improving. Yet we still notice a large performance gap when measuring by linear probing the quality of distilled vs fully supervised features. In this work instead of focusing only on the distillation method we study the effect of three pillars for distillation: the 3D backbone the pretrained 2D backbone and the pretraining 2D+3D dataset. In particular thanks to our scalable distillation method named ScaLR we show that scaling the 2D and 3D backbones and pretraining on diverse datasets leads to a substantial improvement of the feature quality. This allows us to significantly reduce the gap between the quality of distilled and fully-supervised 3D features and to improve the robustness of the pretrained backbones to domain gaps and perturbations. + + + + ShapeWalk: Compositional Shape Editing Through Language-Guided Chains + http://openaccess.thecvf.com//content/CVPR2024/papers/Slim_ShapeWalk_Compositional_Shape_Editing_Through_Language-Guided_Chains_CVPR_2024_paper.pdf + Editing 3D shapes through natural language instructions is a challenging task that requires the comprehension of both language semantics and fine-grained geometric details. To bridge this gap we introduce ShapeWalk a carefully designed synthetic dataset designed to advance the field of language-guided shape editing. The dataset consists of 158K unique shapes connected through 26K edit chains with an average length of 14 chained shapes. Each consecutive pair of shapes is associated with precise language instructions describing the applied edits. We synthesize edit chains by reconstructing and interpolating shapes sampled from a realistic CAD-designed 3D dataset in the parameter space of the GeoCode shape program. We leverage rule-based methods and language models to generate accurate and realistic natural language prompts corresponding to each edit. To illustrate the practicality of our contribution we train neural editor modules in the latent space of shape autoencoders and demonstrate the ability of our dataset to enable a variety of language-guided shape edits. Finally we introduce multi-step editing metrics to benchmark the capacity of our models to perform recursive shape edits. We hope that our work will enable further study of compositional language-guided shape editing and finds application in 3D CAD design and interactive modeling. + + + + MESA: Matching Everything by Segmenting Anything + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_MESA_Matching_Everything_by_Segmenting_Anything_CVPR_2024_paper.pdf + Feature matching is a crucial task in the field of computer vision which involves finding correspondences between images. Previous studies achieve remarkable performance using learning-based feature comparison. However the pervasive presence of matching redundancy between images gives rise to unnecessary and error-prone computations in these methods imposing limitations on their accuracy. To address this issue we propose MESA a novel approach to establish precise area (or region) matches for efficient matching redundancy reduction. MESA first leverages the advanced image understanding capability of SAM a state-of-the-art foundation model for image segmentation to obtain image areas with implicit semantic. Then a multi-relational graph is proposed to model the spatial structure of these areas and construct their scale hierarchy. Based on graphical models derived from the graph the area matching is reformulated as an energy minimization task and effectively resolved. Extensive experiments demonstrate that MESA yields substantial precision improvement for multiple point matchers in indoor and outdoor downstream tasks e.g. +13.61% for DKM in indoor pose estimation. + + + + Learning Degradation-Independent Representations for Camera ISP Pipelines + http://openaccess.thecvf.com//content/CVPR2024/papers/Guo_Learning_Degradation-Independent_Representations_for_Camera_ISP_Pipelines_CVPR_2024_paper.pdf + Image signal processing (ISP) pipeline plays a fundamental role in digital cameras which converts raw Bayer sensor data to RGB images. However ISP-generated images usually suffer from imperfections due to the compounded degradations that stem from sensor noises demosaicing noises compression artifacts and possibly adverse effects of erroneous ISP hyperparameter settings such as ISO and gamma values. In a general sense these ISP imperfections can be considered as degradations. The highly complex mechanisms of ISP degradations some of which are even unknown pose great challenges to the generalization capability of deep neural networks (DNN) for image restoration and to their adaptability to downstream tasks. To tackle the issues we propose a novel DNN approach to learn degradation-independent representations (DiR) through the refinement of a self-supervised learned baseline representation. The proposed DiR learning technique has remarkable domain generalization capability and consequently it outperforms state-of-the-art methods across various downstream tasks including blind image restoration object detection and instance segmentation as verified in our experiments. + + + + OmniGlue: Generalizable Feature Matching with Foundation Model Guidance + http://openaccess.thecvf.com//content/CVPR2024/papers/Jiang_OmniGlue_Generalizable_Feature_Matching_with_Foundation_Model_Guidance_CVPR_2024_paper.pdf + The image matching field has been witnessing a continuous emergence of novel learnable feature matching techniques with ever-improving performance on conventional benchmarks. However our investigation shows that despite these gains their potential for real-world applications is restricted by their limited generalization capabilities to novel image domains. In this paper we introduce OmniGlue the first learnable image matcher that is designed with generalization as a core principle. OmniGlue leverages broad knowledge from a vision foundation model to guide the feature matching process boosting generalization to domains not seen at training time. Additionally we propose a novel keypoint position-guided attention mechanism which disentangles spatial and appearance information leading to enhanced matching descriptors. We perform comprehensive experiments on a suite of 6 datasets with varied image domains including scene-level object-centric and aerial images. OmniGlue's novel components lead to relative gains on unseen domains of 20.9% with respect to a directly comparable reference model while also outperforming the recent LightGlue method by 9.5% relatively. Code and model can be found at https://hwjiang1510.github.io/OmniGlue. + + + + OmniSDF: Scene Reconstruction using Omnidirectional Signed Distance Functions and Adaptive Binoctrees + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_OmniSDF_Scene_Reconstruction_using_Omnidirectional_Signed_Distance_Functions_and_Adaptive_CVPR_2024_paper.pdf + We present a method to reconstruct indoor and outdoor static scene geometry and appearance from an omnidirectional video moving in a small circular sweep. This setting is challenging because of the small baseline and large depth ranges making it difficult to find ray crossings. To better constrain the optimization we estimate geometry as a signed distance field within a spherical binoctree data structure and use a complementary efficient tree traversal strategy based on a breadth-first search for sampling. Unlike regular grids or trees the shape of this structure well-matches the camera setting creating a better memory-quality trade-off. From an initial depth estimate the binoctree is adaptively subdivided throughout the optimization; previous methods use a fixed depth that leaves the scene undersampled. In comparison with three neural optimization methods and two non-neural methods ours shows decreased geometry error on average especially in a detailed scene while significantly reducing the required number of voxels to represent such details. + + + + Generating Content for HDR Deghosting from Frequency View + http://openaccess.thecvf.com//content/CVPR2024/papers/Hu_Generating_Content_for_HDR_Deghosting_from_Frequency_View_CVPR_2024_paper.pdf + Recovering ghost-free High Dynamic Range (HDR) images from multiple Low Dynamic Range (LDR) images becomes challenging when the LDR images exhibit saturation and significant motion. Recent Diffusion Models (DMs) have been introduced in HDR imaging field demonstrating promising performance particularly in achieving visually perceptible results compared to previous DNN-based methods. However DMs require extensive iterations with large models to estimate entire images resulting in inefficiency that hinders their practical application. To address this challenge we propose the Low-Frequency aware Diffusion (LF-Diff) model for ghost-free HDR imaging. The key idea of LF-Diff is implementing the DMs in a highly compacted latent space and integrating it into a regression-based model to enhance the details of reconstructed images. Specifically as low-frequency information is closely related to human visual perception we propose to utilize DMs to create compact low-frequency priors for the reconstruction process. In addition to take full advantage of the above low-frequency priors the Dynamic HDR Reconstruction Network (DHRNet) is carried out in a regression-based manner to obtain final HDR images. Extensive experiments conducted on synthetic and real-world benchmark datasets demonstrate that our LF-Diff performs favorably against several state-of-the-art methods and is 10x faster than previous DM-based methods. + + + + LiDAR-Net: A Real-scanned 3D Point Cloud Dataset for Indoor Scenes + http://openaccess.thecvf.com//content/CVPR2024/papers/Guo_LiDAR-Net_A_Real-scanned_3D_Point_Cloud_Dataset_for_Indoor_Scenes_CVPR_2024_paper.pdf + In this paper we present LiDAR-Net a new real-scanned indoor point cloud dataset containing nearly 3.6 billion precisely point-level annotated points covering an expansive area of 30000m^2. It encompasses three prevalent daily environments including learning scenes working scenes and living scenes. LiDAR-Net is characterized by its non-uniform point distribution e.g. scanning holes and scanning lines. Additionally it meticulously records and annotates scanning anomalies including reflection noise and ghost. These anomalies stem from specular reflections on glass or metal as well as distortions due to moving persons. LiDAR-Net's realistic representation of non-uniform distribution and anomalies significantly enhances the training of deep learning models leading to improved generalization in practical applications. We thoroughly evaluate the performance of state-of-the-art algorithms on LiDAR-Net and provide a detailed analysis of the results. Crucially our research identifies several fundamental challenges in understanding indoor point clouds contributing essential insights to future explorations in this field. Our dataset can be found online: http://lidar-net.njumeta.com + + + + Rich Human Feedback for Text-to-Image Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Liang_Rich_Human_Feedback_for_Text-to-Image_Generation_CVPR_2024_paper.pdf + Recent Text-to-Image (T2I) generation models such as Stable Diffusion and Imagen have made significant progress in generating high-resolution images based on text descriptions. However many generated images still suffer from issues such as artifacts/implausibility misalignment with text descriptions and low aesthetic quality. Inspired by the success of Reinforcement Learning with Human Feedback (RLHF) for large language models prior works collected human-provided scores as feedback on generated images and trained a reward model to improve the T2I generation. In this paper we enrich the feedback signal by (i) marking image regions that are implausible or misaligned with the text and (ii) annotating which words in the text prompt are misrepresented or missing on the image. We collect such rich human feedback on 18K generated images (RichHF-18K) and train a multimodal transformer to predict the rich feedback automatically. We show that the predicted rich human feedback can be leveraged to improve image generation for example by selecting high-quality training data to finetune and improve the generative models or by creating masks with predicted heatmaps to inpaint the problematic regions. Notably the improvements generalize to models (Muse) beyond those used to generate the images on which human feedback data were collected (Stable Diffusion variants). The RichHF-18K data set will be released in our GitHub repository: https://github.com/google-research/google-research/tree/master/richhf_18k. + + + + Map-Relative Pose Regression for Visual Re-Localization + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Map-Relative_Pose_Regression_for_Visual_Re-Localization_CVPR_2024_paper.pdf + Pose regression networks predict the camera pose of a query image relative to a known environment. Within this family of methods absolute pose regression (APR) has recently shown promising accuracy in the range of a few centimeters in position error. APR networks encode the scene geometry implicitly in their weights. To achieve high accuracy they require vast amounts of training data that realistically can only be created using novel view synthesis in a days-long process. This process has to be repeated for each new scene again and again. We present a new approach to pose regression map-relative pose regression (marepo) that satisfies the data hunger of the pose regression network in a scene-agnostic fashion. We condition the pose regressor on a scene-specific map representation such that its pose predictions are relative to the scene map. This allows us to train the pose regressor across hundreds of scenes to learn the generic relation between a scene-specific map representation and the camera pose. Our map-relative pose regressor can be applied to new map representations immediately or after mere minutes of fine-tuning for the highest accuracy. Our approach outperforms previous pose regression methods by far on two public datasets indoor and outdoor. Code is available: https://nianticlabs.github.io/marepo. + + + + Implicit Event-RGBD Neural SLAM + http://openaccess.thecvf.com//content/CVPR2024/papers/Qu_Implicit_Event-RGBD_Neural_SLAM_CVPR_2024_paper.pdf + Implicit neural SLAM has achieved remarkable progress recently. Nevertheless existing methods face significant challenges in non-ideal scenarios such as motion blur or lighting variation which often leads to issues like convergence failures localization drifts and distorted mapping. To address these challenges we propose EN-SLAM the first event-RGBD implicit neural SLAM framework which effectively leverages the high rate and high dynamic range advantages of event data for tracking and mapping. Specifically EN-SLAM proposes a differentiable CRF (Camera Response Function) rendering technique to generate distinct RGB and event camera data via a shared radiance field which is optimized by learning a unified implicit representation with the captured event and RGBD supervision. Moreover based on the temporal difference property of events we propose a temporal aggregating optimization strategy for the event joint tracking and global bundle adjustment capitalizing on the consecutive difference constraints of events significantly enhancing tracking accuracy and robustness. Finally we construct the simulated dataset DEV-Indoors and real captured dataset DEV-Reals containing 6 scenes 17 sequences with practical motion blur and lighting changes for evaluations. Experimental results show that our method outperforms the SOTA methods in both tracking ATE and mapping ACC with a real-time 17 FPS in various challenging environments. Project page: https://delinqu.github.io/EN-SLAM. + + + + Domain-Specific Block Selection and Paired-View Pseudo-Labeling for Online Test-Time Adaptation + http://openaccess.thecvf.com//content/CVPR2024/papers/Yu_Domain-Specific_Block_Selection_and_Paired-View_Pseudo-Labeling_for_Online_Test-Time_Adaptation_CVPR_2024_paper.pdf + Test-time adaptation (TTA) aims to adapt a pre-trained model to a new test domain without access to source data after deployment. Existing approaches typically rely on self-training with pseudo-labels since ground-truth cannot be obtained from test data. Although the quality of pseudo labels is important for stable and accurate long-term adaptation it has not been previously addressed. In this work we propose DPLOT a simple yet effective TTA framework that consists of two components: (1) domain-specific block selection and (2) pseudo-label generation using paired-view images. Specifically we select blocks that involve domain-specific feature extraction and train these blocks by entropy minimization. After blocks are adjusted for current test domain we generate pseudo-labels by averaging given test images and corresponding flipped counterparts. By simply using flip augmentation we prevent a decrease in the quality of the pseudo-labels which can be caused by the domain gap resulting from strong augmentation. Our experimental results demonstrate that DPLOT outperforms previous TTA methods in CIFAR10-C CIFAR100-C and ImageNet-C benchmarks reducing error by up to 5.4% 9.1% and 2.9% respectively. Also we provide an extensive analysis to demonstrate effectiveness of our framework. Code is available at https://github.com/gist-ailab/domain-specific-block-selection-and-paired-view-pseudo-labeling-for-online-TTA. + + + + Aerial Lifting: Neural Urban Semantic and Building Instance Lifting from Aerial Imagery + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Aerial_Lifting_Neural_Urban_Semantic_and_Building_Instance_Lifting_from_CVPR_2024_paper.pdf + We present a neural radiance field method for urban-scale semantic and building-level instance segmentation from aerial images by lifting noisy 2D labels to 3D. This is a challenging problem due to two primary reasons. Firstly objects in urban aerial images exhibit substantial variations in size including buildings cars and roads which pose a significant challenge for accurate 2D segmentation. Secondly the 2D labels generated by existing segmentation methods suffer from the multi-view inconsistency problem especially in the case of aerial images where each image captures only a small portion of the entire scene. To overcome these limitations we first introduce a scale-adaptive semantic label fusion strategy that enhances the segmentation of objects of varying sizes by combining labels predicted from different altitudes harnessing the novel-view synthesis capabilities of NeRF. We then introduce a novel cross-view instance label grouping strategy based on the 3D scene representation to mitigate the multi-view inconsistency problem in the 2D instance labels. Furthermore we exploit multi-view reconstructed depth priors to improve the geometric quality of the reconstructed radiance field resulting in enhanced segmentation results. Experiments on multiple real-world urban-scale datasets demonstrate that our approach outperforms existing methods highlighting its effectiveness. The source code is available at https://github.com/zyqz97/Aerial_lifting. + + + + Learning with Structural Labels for Learning with Noisy Labels + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_Learning_with_Structural_Labels_for_Learning_with_Noisy_Labels_CVPR_2024_paper.pdf + Deep Neural Networks (DNNs) have demonstrated remarkable performance across diverse domains and tasks with large-scale datasets. To reduce labeling costs for large-scale datasets semi-automated and crowdsourcing labeling methods are developed but their labels are inevitably noisy. Learning with Noisy Labels (LNL) approaches aim to train DNNs despite the presence of noisy labels. These approaches utilize the memorization effect to select correct labels and refine noisy ones which are then used for subsequent training. However these methods encounter a significant decrease in the model's generalization performance due to the inevitably existing noise labels. To overcome this limitation we propose a new approach to enhance learning with noisy labels by incorporating additional distribution information--structural labels. In order to leverage additional distribution information for generalization we employ a reverse k-NN which helps the model in achieving a better feature manifold and mitigating overfitting to noisy labels. The proposed method shows outperformed performance in multiple benchmark datasets with IDN and real-world noisy datasets. + + + + DeMatch: Deep Decomposition of Motion Field for Two-View Correspondence Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_DeMatch_Deep_Decomposition_of_Motion_Field_for_Two-View_Correspondence_Learning_CVPR_2024_paper.pdf + Two-view correspondence learning has recently focused on considering the coherence and smoothness of the motion field between an image pair. Dominant schemes include controlling the complexity of the field function with regularization or smoothing the field with local filters but the former suffers from heavy computational burden and the latter fails to accommodate discontinuities in the case of large scene disparities. In this paper inspired by Fourier expansion we propose a novel network called DeMatch which decomposes the motion field to retain its main "low-frequency" and smooth part. This achieves implicit regularization with lower computational cost and generates piecewise smoothness naturally. Specifically we first decompose the rough motion field that is contaminated by false matches into several different sub-fields which are highly smooth and contain the main energy of the original field. Then with these smooth sub-fields we recover a cleaner motion field from which correct motion vectors are subsequently derived. We also design a special masked decomposition strategy to further mitigate the negative influence of false matches. All the mentioned processes are finally implemented in a discrete and learnable manner avoiding the difficulty of calculating real dense fields. Extensive experiments reveal that DeMatch outperforms state-of-the-art methods in multiple tasks and shows promising low computational usage and piecewise smoothness property. The code and trained models are publicly available at https://github.com/SuhZhang/DeMatch. + + + + Sherpa3D: Boosting High-Fidelity Text-to-3D Generation via Coarse 3D Prior + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Sherpa3D_Boosting_High-Fidelity_Text-to-3D_Generation_via_Coarse_3D_Prior_CVPR_2024_paper.pdf + Recently 3D content creation from text prompts has demonstrated remarkable progress by utilizing 2D and 3D diffusion models. While 3D diffusion models ensure great multi-view consistency their ability to generate high-quality and diverse 3D assets is hindered by the limited 3D data. In contrast 2D diffusion models find a distillation approach that achieves excellent generalization and rich details without any 3D data. However 2D lifting methods suffer from inherent view-agnostic ambiguity thereby leading to serious multi-face Janus issues where text prompts fail to provide sufficient guidance to learn coherent 3D results. Instead of retraining a costly viewpoint-aware model we study how to fully exploit easily accessible coarse 3D knowledge to enhance the prompts and guide 2D lifting optimization for refinement. In this paper we propose Sherpa3D a new text-to-3D framework that achieves high-fidelity generalizability and geometric consistency simultaneously. Specifically we design a pair of guiding strategies derived from the coarse 3D prior generated by the 3D diffusion model: a structural guidance for geometric fidelity and a semantic guidance for 3D coherence. Employing the two types of guidance the 2D diffusion model enriches the 3D content with diversified and high-quality results. Extensive experiments show the superiority of our Sherpa3D over the state-of-the-art text-to-3D methods in terms of quality and 3D consistency. + + + + A Unified Diffusion Framework for Scene-aware Human Motion Estimation from Sparse Signals + http://openaccess.thecvf.com//content/CVPR2024/papers/Tang_A_Unified_Diffusion_Framework_for_Scene-aware_Human_Motion_Estimation_from_CVPR_2024_paper.pdf + Estimating full-body human motion via sparse tracking signals from head-mounted displays and hand controllers in 3D scenes is crucial to applications in AR/VR. One of the biggest challenges to this task is the one-to-many mapping from sparse observations to dense full-body motions which endowed inherent ambiguities. To help resolve this ambiguous problem we introduce a new framework to combine rich contextual information provided by scenes to benefit full-body motion tracking from sparse observations. To estimate plausible human motions given sparse tracking signals and 3D scenes we develop \text S ^2Fusion a unified framework fusing \underline S cene and sparse \underline S ignals with a conditional dif\underline Fusion model. \text S ^2Fusion first extracts the spatial-temporal relations residing in the sparse signals via a periodic autoencoder and then produces time-alignment feature embedding as additional inputs. Subsequently by drawing initial noisy motion from a pre-trained prior \text S ^2Fusion utilizes conditional diffusion to fuse scene geometry and sparse tracking signals to generate full-body scene-aware motions. The sampling procedure of \text S ^2Fusion is further guided by a specially designed scene-penetration loss and phase-matching loss which effectively regularizes the motion of the lower body even in the absence of any tracking signals making the generated motion much more plausible and coherent. Extensive experimental results have demonstrated that our \text S ^2Fusion outperforms the state-of-the-art in terms of estimation quality and smoothness. + + + + Single Domain Generalization for Crowd Counting + http://openaccess.thecvf.com//content/CVPR2024/papers/Peng_Single_Domain_Generalization_for_Crowd_Counting_CVPR_2024_paper.pdf + Due to its promising results density map regression has been widely employed for image-based crowd counting. The approach however often suffers from severe performance degradation when tested on data from unseen scenarios the so-called "domain shift" problem. To address the problem we investigate in this work single domain generalization (SDG) for crowd counting. The existing SDG approaches are mainly for image classification and segmentation and can hardly be extended to our case due to its regression nature and label ambiguity (i.e. ambiguous pixel-level ground truths). We propose MPCount a novel effective SDG approach even for narrow source distribution. MPCount stores diverse density values for density map regression and reconstructs domain-invariant features by means of only one memory bank a content error mask and attention consistency loss. By partitioning the image into grids it employs patch-wise classification as an auxiliary task to mitigate label ambiguity. Through extensive experiments on different datasets MPCount is shown to significantly improve counting accuracy compared to the state of the art under diverse scenarios unobserved in the training data characterized by narrow source distribution. Code is available at https://github.com/Shimmer93/MPCount. + + + + Task-Aware Encoder Control for Deep Video Compression + http://openaccess.thecvf.com//content/CVPR2024/papers/Ge_Task-Aware_Encoder_Control_for_Deep_Video_Compression_CVPR_2024_paper.pdf + Prior research on deep video compression (DVC) for machine tasks typically necessitates training a unique codec for each specific task mandating a dedicated decoder per task. In contrast traditional video codecs employ a flexible encoder controller enabling the adaptation of a single codec to different tasks through mechanisms like mode prediction. Drawing inspiration from this we introduce an innovative encoder controller for deep video compression for machines. This controller features a mode prediction and a Group of Pictures (GoP) selection module. Our approach centralizes control at the encoding stage allowing for adaptable encoder adjustments across different tasks such as detection and tracking while maintaining compatibility with a standard pre-trained DVC decoder. Empirical evidence demonstrates that our method is applicable across multiple tasks with various existing pre-trained DVCs. Moreover extensive experiments demonstrate that our method outperforms previous DVC by about 25% bitrate for different tasks with only one pre-trained decoder. + + + + Long-Tail Class Incremental Learning via Independent Sub-prototype Construction + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Long-Tail_Class_Incremental_Learning_via_Independent_Sub-prototype_Construction_CVPR_2024_paper.pdf + Long-tail class incremental learning (LT-CIL) is designed to perpetually acquire novel knowledge from an imbalanced and perpetually evolving data stream while ensuring the retention of previously acquired knowledge. The existing method only re-balances data distribution and ignores exploring the potential relationship between different samples causing non-robust representations and even severe forgetting in classes with few samples. In this paper we constructed two parallel spaces simultaneously: 1) Sub-prototype space and 2) Reminiscence space to learn robust representations while alleviating forgetfulness. Concretely we advance the concept of the sub-prototype space which amalgamates insights from diverse classes. This integration facilitates the mutual complementarity of varied knowledge thereby augmenting the attainment of more robust representations. Furthermore we introduce the reminiscence space which encapsulates each class distribution aiming to constraint model optimization and mitigate the phenomenon of forgetting. The tandem utilization of the two parallel spaces effectively alleviates the adverse consequences associated with imbalanced data distribution preventing forgetting without needing replay examples. Extensive experiments demonstrate that our method achieves state-of-the-art performance on various benchmarks. + + + + Learning with Unreliability: Fast Few-shot Voxel Radiance Fields with Relative Geometric Consistency + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_Learning_with_Unreliability_Fast_Few-shot_Voxel_Radiance_Fields_with_Relative_CVPR_2024_paper.pdf + We propose a voxel-based optimization framework ReVoRF for few-shot radiance fields that strategically addresses the unreliability in pseudo novel view synthesis. Our method pivots on the insight that relative depth relationships within neighboring regions are more reliable than the absolute color values in disoccluded areas. Consequently we devise a bilateral geometric consistency loss that carefully navigates the trade-off between color fidelity and geometric accuracy in the context of depth consistency for uncertain regions. Moreover we present a reliability-guided learning strategy to discern and utilize the variable quality across synthesized views complemented by a reliability-aware voxel smoothing algorithm that smoothens the transition between reliable and unreliable data patches. Our approach allows for a more nuanced use of all available data promoting enhanced learning from regions previously considered unsuitable for high-quality reconstruction. Extensive experiments across diverse datasets reveal that our approach attains significant gains in efficiency and accuracy delivering rendering speeds of 3 FPS 7 mins to train a 360deg scene and a 5% improvement in PSNR over existing few-shot methods. Code is available at https://github.com/HKCLynn/ReVoRF + + + + Towards Understanding and Improving Adversarial Robustness of Vision Transformers + http://openaccess.thecvf.com//content/CVPR2024/papers/Jain_Towards_Understanding_and_Improving_Adversarial_Robustness_of_Vision_Transformers_CVPR_2024_paper.pdf + Recent literature has demonstrated that vision transformers (VITs) exhibit superior performance compared to convolutional neural networks (CNNs). The majority of recent research on adversarial robustness however has predominantly focused on CNNs. In this work we bridge this gap by analyzing the effectiveness of existing attacks on VITs. We demonstrate that due to the softmax computations in every attention block in VITs they are inherently vulnerable to floating point underflow errors. This can lead to a gradient masking effect resulting in suboptimal attack strength of well-known attacks like PGD Carlini and Wagner (CW) GAMA and Patch attacks. Motivated by this we propose Adaptive Attention Scaling (AAS) attack that can automatically find the optimal scaling factors of pre-softmax outputs using gradient-based optimization. We show that the proposed simple strategy can be incorporated with any existing adversarial attacks as well as adversarial training methods and achieved improved performance. On VIT-B16 we demonstrate an improved attack strength of upto 2.2% on CIFAR10 and upto 2.9% on CIFAR100 by incorporating the proposed AAS attack with state-of-the-art single attack methods like GAMA attack. Further we utilise the proposed AAS attack for every few epochs in existing adversarial training methods which is termed as Adaptive Attention Scaling Adversarial Training (AAS-AT). On incorporating AAS-AT with existing methods we outperform them on VITs over 1.3-3.5% on CIFAR10. We observe improved performance on ImageNet-100 as well. + + + + S-DyRF: Reference-Based Stylized Radiance Fields for Dynamic Scenes + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_S-DyRF_Reference-Based_Stylized_Radiance_Fields_for_Dynamic_Scenes_CVPR_2024_paper.pdf + Current 3D stylization methods often assume static scenes which violates the dynamic nature of our real world. To address this limitation we present S-DyRF a reference-based spatio-temporal stylization method for dynamic neural radiance fields. However stylizing dynamic 3D scenes is inherently challenging due to the limited availability of stylized reference images along the temporal axis. Our key insight lies in introducing additional temporal cues besides the provided reference. To this end we generate temporal pseudo-references from the given stylized reference. These pseudo-references facilitate the propagation of style information from the reference to the entire dynamic 3D scene. For coarse style transfer we enforce novel views and times to mimic the style details present in pseudo-references at the feature level. To preserve high-frequency details we create a collection of stylized temporal pseudo-rays from temporal pseudo-references. These pseudo-rays serve as detailed and explicit stylization guidance for achieving fine style transfer. Experiments on both synthetic and real-world datasets demonstrate that our method yields plausible stylized results of space-time view synthesis on dynamic 3D scenes. + + + + What How and When Should Object Detectors Update in Continually Changing Test Domains? + http://openaccess.thecvf.com//content/CVPR2024/papers/Yoo_What_How_and_When_Should_Object_Detectors_Update_in_Continually_CVPR_2024_paper.pdf + It is a well-known fact that the performance of deep learning models deteriorates when they encounter a distribution shift at test time. Test-time adaptation (TTA) algorithms have been proposed to adapt the model online while inferring test data. However existing research predominantly focuses on classification tasks through the optimization of batch normalization layers or classification heads but this approach limits its applicability to various model architectures like Transformers and makes it challenging to apply to other tasks such as object detection. In this paper we propose a novel online adaption approach for object detection in continually changing test domains considering which part of the model to update how to update it and when to perform the update. By introducing architecture-agnostic and lightweight adaptor modules and only updating these while leaving the pre-trained backbone unchanged we can rapidly adapt to new test domains in an efficient way and prevent catastrophic forgetting. Furthermore we present a practical and straightforward class-wise feature aligning method for object detection to resolve domain shifts. Additionally we enhance efficiency by determining when the model is sufficiently adapted or when additional adaptation is needed due to changes in the test distribution. Our approach surpasses baselines on widely used benchmarks achieving improvements of up to 4.9%p and 7.9%p in mAP for COCO ? COCO-corrupted and SHIFT respectively while maintaining about 20 FPS or higher. The implementation code is available at https://github.com/natureyoo/ContinualTTA_ObjectDetection. + + + + Bayesian Exploration of Pre-trained Models for Low-shot Image Classification + http://openaccess.thecvf.com//content/CVPR2024/papers/Miao_Bayesian_Exploration_of_Pre-trained_Models_for_Low-shot_Image_Classification_CVPR_2024_paper.pdf + Low-shot image classification is a fundamental task in computer vision and the emergence of large-scale vision-language models such as CLIP has greatly advanced the forefront of research in this field. However most existing CLIP-based methods lack the flexibility to effectively incorporate other pre-trained models that encompass knowledge distinct from CLIP. To bridge the gap this work proposes a simple and effective probabilistic model ensemble framework based on Gaussian processes which have previously demonstrated remarkable efficacy in processing small data. We achieve the integration of prior knowledge by specifying the mean function with CLIP and the kernel function with an ensemble of deep kernels built upon various pre-trained models. By regressing the classification label directly our framework enables analytical inference straightforward uncertainty quantification and principled hyper-parameter tuning. Through extensive experiments on standard benchmarks we demonstrate that our method consistently outperforms competitive ensemble baselines regarding predictive performance. Additionally we assess the robustness of our method and the quality of the yielded uncertainty estimates on out-of-distribution datasets. We also illustrate that our method despite relying on label regression still enjoys superior model calibration compared to most deterministic baselines. + + + + RoMa: Robust Dense Feature Matching + http://openaccess.thecvf.com//content/CVPR2024/papers/Edstedt_RoMa_Robust_Dense_Feature_Matching_CVPR_2024_paper.pdf + Feature matching is an important computer vision task that involves estimating correspondences between two images of a 3D scene and dense methods estimate all such correspondences. The aim is to learn a robust model i.e. a model able to match under challenging real-world changes. In this work we propose such a model leveraging frozen pretrained features from the foundation model DINOv2. Although these features are significantly more robust than local features trained from scratch they are inherently coarse. We therefore combine them with specialized ConvNet fine features creating a precisely localizable feature pyramid. To further improve robustness we propose a tailored transformer match decoder that predicts anchor probabilities which enables it to express multimodality. Finally we propose an improved loss formulation through regression-by-classification with subsequent robust regression. We conduct a comprehensive set of experiments that show that our method RoMa achieves significant gains setting a new state-of-the-art. In particular we achieve a 36% improvement on the extremely challenging WxBS benchmark. Code is provided at github.com/Parskatt/RoMa. + + + + Insights from the Use of Previously Unseen Neural Architecture Search Datasets + http://openaccess.thecvf.com//content/CVPR2024/papers/Geada_Insights_from_the_Use_of_Previously_Unseen_Neural_Architecture_Search_CVPR_2024_paper.pdf + The boundless possibility of neural networks which can be used to solve a problem - each with different performance - leads to a situation where a Deep Learning expert is required to identify the best neural network. This goes against the hope of removing the need for experts. Neural Architecture Search (NAS) offers a solution to this by automatically identifying the best architecture. However to date NAS work has focused on a small set of datasets which we argue are not representative of real-world problems. We introduce eight new datasets created for a series of NAS Challenges: AddNIST Language MultNIST CIFARTile Gutenberg Isabella GeoClassing and Chesseract. These datasets and challenges are developed to direct attention to issues in NAS development and to encourage authors to consider how their models will perform on datasets unknown to them at development time. We present experimentation using standard Deep Learning methods as well as the best results from challenge participants + + + + Adversarially Robust Few-shot Learning via Parameter Co-distillation of Similarity and Class Concept Learners + http://openaccess.thecvf.com//content/CVPR2024/papers/Dong_Adversarially_Robust_Few-shot_Learning_via_Parameter_Co-distillation_of_Similarity_and_CVPR_2024_paper.pdf + Few-shot learning (FSL) facilitates a variety of computer vision tasks yet remains vulnerable to adversarial attacks. Existing adversarially robust FSL methods rely on either visual similarity learning or class concept learning. Our analysis reveals that these two learning paradigms are complementary exhibiting distinct robustness due to their unique decision boundary types (concepts clustering by the visual similarity label vs. classification by the class labels). To bridge this gap we propose a novel framework unifying adversarially robust similarity learning and class concept learning. Specifically we distill parameters from both network branches into a "unified embedding model" during robust optimization and redistribute them to individual network branches periodically. To capture generalizable robustness across diverse branches we initialize adversaries in each episode with cross-branch class-wise "global adversarial perturbations" instead of less informative random initialization. We also propose a branch robustness harmonization to modulate the optimization of similarity and class concept learners via their relative adversarial robustness. Extensive experiments demonstrate the state-of-the-art performance of our method in diverse few-shot scenarios. + + + + APISR: Anime Production Inspired Real-World Anime Super-Resolution + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_APISR_Anime_Production_Inspired_Real-World_Anime_Super-Resolution_CVPR_2024_paper.pdf + While real-world anime super-resolution (SR) has gained increasing attention in the SR community existing methods still adopt techniques from the photorealistic domain. In this paper we analyze the anime production workflow and rethink how to use characteristics of it for the sake of the real-world anime SR. First we argue that video networks and datasets are not necessary for anime SR due to the repetition use of hand-drawing frames. Instead we propose an anime image collection pipeline by choosing the least compressed and the most informative frames from the video sources. Based on this pipeline we introduce the Anime Production-oriented Image (API) dataset. In addition we identify two anime-specific challenges of distorted and faint hand-drawn lines and unwanted color artifacts. We address the first issue by introducing a prediction-oriented compression module in the image degradation model and a pseudo-ground truth preparation with enhanced hand-drawn lines. In addition we introduce the balanced twin perceptual loss combining both anime and photorealistic high-level features to mitigate unwanted color artifacts and increase visual clarity. We evaluate our method through extensive experiments on the public benchmark showing our method outperforms state-of-the-art anime dataset-trained approaches. + + + + MVCPS-NeuS: Multi-view Constrained Photometric Stereo for Neural Surface Reconstruction + http://openaccess.thecvf.com//content/CVPR2024/papers/Santo_MVCPS-NeuS_Multi-view_Constrained_Photometric_Stereo_for_Neural_Surface_Reconstruction_CVPR_2024_paper.pdf + Multi-view photometric stereo (MVPS) recovers a high-fidelity 3D shape of a scene by benefiting from both multi-view stereo and photometric stereo. While photometric stereo boosts detailed shape reconstruction it necessitates recording images under various light conditions for each viewpoint. In particular calibrating the light directions for each view significantly increases the cost of acquiring images. To make MVPS more accessible we introduce a practical and easy-to-implement setup multi-view constrained photometric stereo (MVCPS) where the light directions are unknown but constrained to move together with the camera. Unlike conventional multi-view uncalibrated photometric stereo our constrained setting reduces the ambiguities of surface normal estimates from per-view linear ambiguities to a single and global linear one thereby simplifying the disambiguation process. The proposed method integrates the ambiguous surface normal into neural surface reconstruction (NeuS) to simultaneously resolve the global ambiguity and estimate the detailed 3D shape. Experiments demonstrate that our method estimates accurate shapes under sparse viewpoints using only a few multi-view constrained light sources. + + + + ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding + http://openaccess.thecvf.com//content/CVPR2024/papers/Xue_ULIP-2_Towards_Scalable_Multimodal_Pre-training_for_3D_Understanding_CVPR_2024_paper.pdf + Recent advancements in multimodal pre-training have shown promising efficacy in 3D representation learning by aligning multimodal features across 3D shapes their 2D counterparts and language descriptions. However the methods used by existing frameworks to curate such multimodal data in particular language descriptions for 3D shapes are not scalable and the collected language descriptions are not diverse. To address this we introduce ULIP-2 a simple yet effective tri-modal pretraining framework that leverages large multimodal models to automatically generate holistic language descriptions for 3D shapes. It only needs 3D data as input eliminating the need for any manual 3D annotations and is therefore scalable to large datasets. ULIP-2 is also equipped with scaled-up backbones for better multi-modal representation learning. We conduct experiments on two large-scale 3D datasets Objaverse and ShapeNet and augment them with tri-modal datasets of 3D point clouds images and language for training ULIP-2. Experiments show that ULIP-2 demonstrates substantial benefits in three downstream tasks: zero-shot 3D classification standard 3D classification with fine-tuning and 3D captioning (3D-to-language generation). It achieves a new SOTA of 50.6% (top- 1) on Objaverse-LVIS and 84.7% (top-1) on ModelNet40 in zero-shot classification. In the ScanObjectNN benchmark for standard fine-tuning ULIP-2 reaches an overall accuracy of 91.5% with a compact model of only 1.4 million parameters. ULIP-2 sheds light on a new paradigm for scalable multimodal 3D representation learning without human annotations and shows significant improvements over existing baselines. The code and datasets are released at https://github.com/salesforce/ULIP. + + + + WaveMo: Learning Wavefront Modulations to See Through Scattering + http://openaccess.thecvf.com//content/CVPR2024/papers/Xie_WaveMo_Learning_Wavefront_Modulations_to_See_Through_Scattering_CVPR_2024_paper.pdf + Imaging through scattering media is a fundamental and pervasive challenge in fields ranging from medical diagnostics to astronomy. A promising strategy to overcome this challenge is wavefront modulation which induces measurement diversity during image acquisition. Despite its importance designing optimal wavefront modulations to image through scattering remains under-explored. This paper introduces a novel learning-based framework to address the gap. Our approach jointly optimizes wavefront modulations and a computationally lightweight feedforward "proxy" reconstruction network. This network is trained to recover scenes obscured by scattering using measurements that are modified by these modulations. The learned modulations produced by our framework generalize effectively to unseen scattering scenarios and exhibit remarkable versatility. During deployment the learned modulations can be decoupled from the proxy network to augment other more computationally expensive restoration algorithms. Through extensive experiments we demonstrate our approach significantly advances the state of the art in imaging through scattering media. Our project webpage is at https://wavemo-2024.github.io/. + + + + Integrating Efficient Optimal Transport and Functional Maps For Unsupervised Shape Correspondence Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Le_Integrating_Efficient_Optimal_Transport_and_Functional_Maps_For_Unsupervised_Shape_CVPR_2024_paper.pdf + In the realm of computer vision and graphics accurately establishing correspondences between geometric 3D shapes is pivotal for applications like object tracking registration texture transfer and statistical shape analysis. Moving beyond traditional hand-crafted and data-driven feature learning methods we incorporate spectral methods with deep learning focusing on functional maps (FMs) and optimal transport (OT). Traditional OT-based approaches often reliant on entropy regularization OT in learning-based framework face computational challenges due to their quadratic cost. Our key contribution is to employ the sliced Wasserstein distance (SWD) for OT which is a valid fast optimal transport metric in an unsupervised shape matching framework. This unsupervised framework integrates functional map regularizers with a novel OT-based loss derived from SWD enhancing feature alignment between shapes treated as discrete probability measures. We also introduce an adaptive refinement process utilizing entropy regularized OT further refining feature alignments for accurate point-to-point correspondences. Our method demonstrates superior performance in non-rigid shape matching including near-isometric and non-isometric scenarios and excels in downstream tasks like segmentation transfer. The empirical results on diverse datasets highlight our framework's effectiveness and generalization capabilities setting new standards in non-rigid shape matching with efficient OT metrics and an adaptive refinement module. + + + + ODCR: Orthogonal Decoupling Contrastive Regularization for Unpaired Image Dehazing + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_ODCR_Orthogonal_Decoupling_Contrastive_Regularization_for_Unpaired_Image_Dehazing_CVPR_2024_paper.pdf + Unpaired image dehazing (UID) holds significant research importance due to the challenges in acquiring haze/clear image pairs with identical backgrounds. This paper proposes a novel method for UID named Orthogonal Decoupling Contrastive Regularization (ODCR). Our method is grounded in the assumption that an image consists of both haze-related features which influence the degree of haze and haze-unrelated features such as texture and semantic information. ODCR aims to ensure that the haze-related features of the dehazing result closely resemble those of the clear image while the haze-unrelated features align with the input hazy image. To accomplish the motivation Orthogonal MLPs optimized geometrically on the Stiefel manifold are proposed which can project image features into an orthogonal space thereby reducing the relevance between different features. Furthermore a task-driven Depth-wise Feature Classifier (DWFC) is proposed which assigns weights to the orthogonal features based on the contribution of each channel's feature in predicting whether the feature source is hazy or clear in a self-supervised fashion. Finally a Weighted PatchNCE (WPNCE) loss is introduced to achieve the pulling of haze-related features in the output image toward those of clear images while bringing haze-unrelated features close to those of the hazy input. Extensive experiments demonstrate the superior performance of our ODCR method on UID. + + + + OmniSeg3D: Omniversal 3D Segmentation via Hierarchical Contrastive Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Ying_OmniSeg3D_Omniversal_3D_Segmentation_via_Hierarchical_Contrastive_Learning_CVPR_2024_paper.pdf + Towards holistic understanding of 3D scenes a general 3D segmentation method is needed that can segment diverse objects without restrictions on object quantity or categories while also reflecting the inherent hierarchical structure. To achieve this we propose OmniSeg3D an omniversal segmentation method aims for segmenting anything in 3D all at once. The key insight is to lift multi-view inconsistent 2D segmentations into a consistent 3D feature field through a hierarchical contrastive learning framework which is accomplished by two steps. Firstly we design a novel hierarchical representation based on category-agnostic 2D segmentations to model the multi-level relationship among pixels. Secondly image features rendered from the 3D feature field are clustered at different levels which can be further drawn closer or pushed apart according to the hierarchical relationship between different levels. In tackling the challenges posed by inconsistent 2D segmentations this framework yields a global consistent 3D feature field which further enables hierarchical segmentation multi-object selection and global discretization. Extensive experiments demonstrate the effectiveness of our method on high-quality 3D segmentation and accurate hierarchical structure understanding. A graphical user interface further facilitates flexible interaction for omniversal 3D segmentation. + + + + Simple Semantic-Aided Few-Shot Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Simple_Semantic-Aided_Few-Shot_Learning_CVPR_2024_paper.pdf + Learning from a limited amount of data namely Few-Shot Learning stands out as a challenging computer vision task. Several works exploit semantics and design complicated semantic fusion mechanisms to compensate for rare representative features within restricted data. However relying on naive semantics such as class names introduces biases due to their brevity while acquiring extensive semantics from external knowledge takes a huge time and effort. This limitation severely constrains the potential of semantics in Few-Shot Learning. In this paper we design an automatic way called Semantic Evolution to generate high-quality semantics. The incorporation of high-quality semantics alleviates the need for complex network structures and learning algorithms used in previous works. Hence we employ a simple two-layer network termed Semantic Alignment Network to transform semantics and visual features into robust class prototypes with rich discriminative features for few-shot classification. The experimental results show our framework outperforms all previous methods on six benchmarks demonstrating a simple network with high-quality semantics can beat intricate multi-modal modules on few-shot classification tasks. Code is available at https://github.com/zhangdoudou123/SemFew. + + + + Leveraging Cross-Modal Neighbor Representation for Improved CLIP Classification + http://openaccess.thecvf.com//content/CVPR2024/papers/Yi_Leveraging_Cross-Modal_Neighbor_Representation_for_Improved_CLIP_Classification_CVPR_2024_paper.pdf + CLIP showcases exceptional cross-modal matching capabilities due to its training on image-text contrastive learning tasks. However without specific optimization for unimodal scenarios its performance in single-modality feature extraction might be suboptimal. Despite this some studies have directly used CLIP's image encoder for tasks like few-shot classification introducing a misalignment between its pre-training objectives and feature extraction methods. This inconsistency can diminish the quality of the image's feature representation adversely affecting CLIP's effectiveness in target tasks. In this paper we view text features as precise neighbors of image features in CLIP's space and present a novel CrOss-moDal nEighbor Representation (CODER) based on the distance structure between images and their neighbor texts. This feature extraction method aligns better with CLIP's pre-training objectives thereby fully leveraging CLIP's robust cross-modal capabilities. The key to construct a high-quality CODER lies in how to create a vast amount of high-quality and diverse texts to match with images. We introduce the Auto Text Generator (ATG) to automatically produce the required text in a data-free and training-free manner. We apply CODER to CLIP's zero-shot and few-shot image classification tasks. Experiment results across various datasets and models confirm CODER's effectiveness. Code is available at: https://github.com/YCaigogogo/CVPR24-CODER. + + + + Revisiting Adversarial Training Under Long-Tailed Distributions + http://openaccess.thecvf.com//content/CVPR2024/papers/Yue_Revisiting_Adversarial_Training_Under_Long-Tailed_Distributions_CVPR_2024_paper.pdf + Deep neural networks are vulnerable to adversarial attacks leading to erroneous outputs. Adversarial training has been recognized as one of the most effective methods to counter such attacks. However existing adversarial training techniques have predominantly been evaluated on balanced datasets whereas real-world data often exhibit a long-tailed distribution casting doubt on the efficacy of these methods in practical scenarios. In this paper we delve into the performance of adversarial training under long-tailed distributions. Through an analysis of the prior method "RoBal" (Wu et al. CVPR'21) we discover that utilizing Balanced Softmax Loss (BSL) alone can obtain comparable performance to the complete RoBal approach while significantly reducing the training overhead. Then we reveal that adversarial training under long-tailed distributions also suffers from robust overfitting similar to uniform distributions. We explore utilizing data augmentation to mitigate this issue and unexpectedly discover that unlike results obtained with balanced data data augmentation not only effectively alleviates robust overfitting but also significantly improves robustness. We further identify that the improvement is attributed to the increased diversity of training data. Extensive experiments further corroborate that data augmentation alone can significantly improve robustness. Finally building on these findings we demonstrate that compared to RoBal the combination of BSL and data augmentation leads to a +6.66% improvement in model robustness under AutoAttack on CIFAR-10-LT. Our code is available at: https://github.com/NISPLab/AT-BSL. + + + + Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Monkey_Image_Resolution_and_Text_Label_Are_Important_Things_for_CVPR_2024_paper.pdf + Large Multimodal Models (LMMs) have shown promise in vision-language tasks but struggle with high-resolution input and detailed scene understanding. Addressing these challenges we introduce Monkey to enhance LMM capabilities. Firstly Monkey processes input images by dividing them into uniform patches each matching the size (e.g. 448x448) used in the original training of the well-trained vision encoder. Equipped with individual adapter for each patch Monkey can handle higher resolutions up to 1344x896 pixels enabling the detailed capture of complex visual information. Secondly it employs a multi-level description generation method enriching the context for scene-object associations. This two-part strategy ensures more effective learning from generated data: the higher resolution allows for a more detailed capture of visuals which in turn enhances the effectiveness of comprehensive descriptions. Extensive ablative results validate the effectiveness of our designs. Additionally experiments on 18 datasets further demonstrate that Monkey surpasses existing LMMs in many tasks like Image Captioning and various Visual Question Answering formats. Specially in qualitative tests focused on dense text question answering Monkey has exhibited encouraging results compared with GPT4V. Code is available at https://github.com/Yuliang-Liu/Monkey. + + + + Decompose-and-Compose: A Compositional Approach to Mitigating Spurious Correlation + http://openaccess.thecvf.com//content/CVPR2024/papers/Noohdani_Decompose-and-Compose_A_Compositional_Approach_to_Mitigating_Spurious_Correlation_CVPR_2024_paper.pdf + While standard Empirical Risk Minimization (ERM) training is proven effective for image classification on in-distribution data it fails to perform well on out-of-distribution samples. One of the main sources of distribution shift for image classification is the compositional nature of images. Specifically in addition to the main object or component(s) determining the label some other image components usually exist which may lead to the shift of input distribution between train and test environments. More importantly these components may have spurious correlations with the label. To address this issue we propose Decompose-and-Compose (DaC) which improves robustness to correlation shift by a compositional approach based on combining elements of images. Based on our observations models trained with ERM usually highly attend to either the causal components or the components having a high spurious correlation with the label (especially in datapoints on which models have a high confidence). In fact according to the amount of spurious correlation and the easiness of classification based on the causal or non-causal components the model usually attends to one of these more (on samples with high confidence). Following this we first try to identify the causal components of images using class activation maps of models trained with ERM. Afterward we intervene on images by combining them and retraining the model on the augmented data including the counterfactual ones. This work proposes a group-balancing method by intervening on images without requiring group labels or information regarding the spurious features during training. The method has an overall better worst group accuracy compared to previous methods with the same amount of supervision on the group labels in correlation shift. Our code is available at https://github.com/fhn98/DaC. + + + + BEM: Balanced and Entropy-based Mix for Long-Tailed Semi-Supervised Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Zheng_BEM_Balanced_and_Entropy-based_Mix_for_Long-Tailed_Semi-Supervised_Learning_CVPR_2024_paper.pdf + Data mixing methods play a crucial role in semi-supervised learning (SSL) but their application is unexplored in long-tailed semi-supervised learning (LTSSL). The primary reason is that the in-batch mixing manner fails to address class imbalance. Furthermore existing LTSSL methods mainly focus on re-balancing data quantity but ignore class-wise uncertainty which is also vital for class balance. For instance some classes with sufficient samples might still exhibit high uncertainty due to indistinguishable features. To this end this paper introduces the Balanced and Entropy-based Mix (BEM) a pioneering mixing approach to re-balance the class distribution of both data quantity and uncertainty. Specifically we first propose a class balanced mix bank to store data of each class for mixing. This bank samples data based on the estimated quantity distribution thus re-balancing data quantity. Then we present an entropy-based learning approach to re-balance class-wise uncertainty including entropy-based sampling strategy entropy-based selection module and entropy-based class balanced loss. Our BEM first leverages data mixing for improving LTSSL and it can also serve as a complement to the existing re-balancing methods. Experimental results show that BEM significantly enhances various LTSSL frameworks and achieves state-of-the-art performances across multiple benchmarks. + + + + HUGS: Holistic Urban 3D Scene Understanding via Gaussian Splatting + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_HUGS_Holistic_Urban_3D_Scene_Understanding_via_Gaussian_Splatting_CVPR_2024_paper.pdf + Holistic understanding of urban scenes based on RGB images is a challenging yet important problem. It encompasses understanding both the geometry and appearance to enable novel view synthesis parsing semantic labels and tracking moving objects. Despite considerable progress existing approaches often focus on specific aspects of this task and require additional inputs such as LiDAR scans or manually annotated 3D bounding boxes. In this paper we introduce a novel pipeline that utilizes 3D Gaussian Splatting for holistic urban scene understanding. Our main idea involves the joint optimization of geometry appearance semantics and motion using a combination of static and dynamic 3D Gaussians where moving object poses are regularized via physical constraints. Our approach offers the ability to render new viewpoints in real-time yielding 2D and 3D semantic information with high accuracy and reconstruct dynamic scenes even in scenarios where 3D bounding box detection are highly noisy. Experimental results on KITTI KITTI-360 and Virtual KITTI 2 demonstrate the effectiveness of our approach. Our project page is at https://xdimlab.github.io/hugs_website. + + + + GeoAuxNet: Towards Universal 3D Representation Learning for Multi-sensor Point Clouds + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_GeoAuxNet_Towards_Universal_3D_Representation_Learning_for_Multi-sensor_Point_Clouds_CVPR_2024_paper.pdf + Point clouds captured by different sensors such as RGB-D cameras and LiDAR possess non-negligible domain gaps. Most existing methods design different network architectures and train separately on point clouds from various sensors. Typically point-based methods achieve outstanding performances on even-distributed dense point clouds from RGB-D cameras while voxel-based methods are more efficient for large-range sparse LiDAR point clouds. In this paper we propose geometry-to-voxel auxiliary learning to enable voxel representations to access point-level geometric information which supports better generalisation of the voxel-based backbone with additional interpretations of multi-sensor point clouds. Specifically we construct hierarchical geometry pools generated by a voxel-guided dynamic point network which efficiently provide auxiliary fine-grained geometric information adapted to different stages of voxel features. We conduct experiments on joint multi-sensor datasets to demonstrate the effectiveness of GeoAuxNet. Enjoying elaborate geometric information our method outperforms other models collectively trained on multi-sensor datasets and achieve competitive results with the-state-of-art experts on each single dataset. + + + + Unveiling the Power of Audio-Visual Early Fusion Transformers with Dense Interactions through Masked Modeling + http://openaccess.thecvf.com//content/CVPR2024/papers/Mo_Unveiling_the_Power_of_Audio-Visual_Early_Fusion_Transformers_with_Dense_CVPR_2024_paper.pdf + Humans possess a remarkable ability to integrate auditory and visual information enabling a deeper understanding of the surrounding environment. This early fusion of audio and visual cues demonstrated through cognitive psychology and neuroscience research offers promising potential for developing multimodal perception models. However training early fusion architectures poses significant challenges as the increased model expressivity requires robust learning frameworks to harness their enhanced capabilities. In this paper we address this challenge by leveraging the masked reconstruction framework previously successful in unimodal settings to train audio-visual encoders with early fusion. Additionally we propose an attention-based fusion module that captures interactions between local audio and visual representations enhancing the model's ability to capture fine-grained interactions. While effective this procedure can become computationally intractable as the number of local representations increases. Thus to address the computational complexity we propose an alternative procedure that factorizes the local representations before representing audio-visual interactions. Extensive evaluations on a variety of datasets demonstrate the superiority of our approach in audio-event classification visual sound localization sound separation and audio-visual segmentation. These contributions enable the efficient training of deeply integrated audio-visual models and significantly advance the usefulness of early fusion architectures. + + + + RepKPU: Point Cloud Upsampling with Kernel Point Representation and Deformation + http://openaccess.thecvf.com//content/CVPR2024/papers/Rong_RepKPU_Point_Cloud_Upsampling_with_Kernel_Point_Representation_and_Deformation_CVPR_2024_paper.pdf + In this work we present RepKPU an efficient network for point cloud upsampling. We propose to promote upsampling performance by exploiting better shape representation and point generation strategy. Inspired by KPConv we propose a novel representation called RepKPoints to effectively characterize the local geometry whose advantages over prior representations are as follows: (1) density-sensitive; (2) large receptive fields; (3) position-adaptive which makes RepKPoints a generalized form of previous representations. Moreover we propose a novel paradigm namely Kernel-to-Displacement generation for point generation where point cloud upsampling is reformulated as the deformation of kernel points. Specifically we propose KP-Queries which is a set of kernel points with predefined positions and learned features to serve as the initial state of upsampling. Using cross-attention mechanisms we achieve interactions between RepKPoints and KP-Queries and subsequently KP-Queries are converted to displacement features followed by a MLP to predict the new positions of KP-Queries which serve as the generated points. Extensive experimental results demonstrate that RepKPU outperforms state-of-the-art methods on several widely-used benchmark datasets with high efficiency. + + + + ConCon-Chi: Concept-Context Chimera Benchmark for Personalized Vision-Language Tasks + http://openaccess.thecvf.com//content/CVPR2024/papers/Rosasco_ConCon-Chi_Concept-Context_Chimera_Benchmark_for_Personalized_Vision-Language_Tasks_CVPR_2024_paper.pdf + While recent Vision-Language (VL) models excel at open-vocabulary tasks it is unclear how to use them with specific or uncommon concepts. Personalized Text-to-Image Retrieval (TIR) or Generation (TIG) are recently introduced tasks that represent this challenge where the VL model has to learn a concept from few images and respectively discriminate or generate images of the target concept in arbitrary contexts. We identify the ability to learn new meanings and their compositionality with known ones as two key properties of a personalized system. We show that the available benchmarks offer a limited validation of personalized textual concept learning from images with respect to the above properties and introduce ConCon-Chi as a benchmark for both personalized TIR and TIG designed to fill this gap. We modelled the new-meaning concepts by crafting chimeric objects and formulating a large varied set of contexts where we photographed each object. To promote the compositionality assessment of the learned concepts with known contexts we combined different contexts with the same concept and vice-versa. We carry out a thorough evaluation of state-of-the-art methods on the resulting dataset. Our study suggests that future work on personalized TIR and TIG methods should focus on the above key properties and we propose principles and a dataset for their performance assessment. Dataset: https://doi.org/10.48557/QJ1166 and code: https://github.com/hsp-iit/concon-chi_benchmark. + + + + MeshGPT: Generating Triangle Meshes with Decoder-Only Transformers + http://openaccess.thecvf.com//content/CVPR2024/papers/Siddiqui_MeshGPT_Generating_Triangle_Meshes_with_Decoder-Only_Transformers_CVPR_2024_paper.pdf + We introduce MeshGPT a new approach for generating triangle meshes that reflects the compactness typical of artist-created meshes in contrast to dense triangle meshes extracted by iso-surfacing methods from neural fields. Inspired by recent advances in powerful large language models we adopt a sequence-based approach to autoregressively generate triangle meshes as sequences of triangles. We first learn a vocabulary of latent quantized embeddings using graph convolutions which inform these embeddings of the local mesh geometry and topology. These embeddings are sequenced and decoded into triangles by a decoder ensuring that they can effectively reconstruct the mesh. A transformer is then trained on this learned vocabulary to predict the index of the next embedding given previous embeddings. Once trained our model can be autoregressively sampled to generate new triangle meshes directly generating compact meshes with sharp edges more closely imitating the efficient triangulation patterns of human-crafted meshes. MeshGPT demonstrates a notable improvement over state of the art mesh generation methods with a 9% increase in shape coverage and a 30-point enhancement in FID scores across various categories. + + + + Image Restoration by Denoising Diffusion Models with Iteratively Preconditioned Guidance + http://openaccess.thecvf.com//content/CVPR2024/papers/Garber_Image_Restoration_by_Denoising_Diffusion_Models_with_Iteratively_Preconditioned_Guidance_CVPR_2024_paper.pdf + Training deep neural networks has become a common approach for addressing image restoration problems. An alternative for training a "task-specific" network for each observation model is to use pretrained deep denoisers for imposing only the signal's prior within iterative algorithms without additional training. Recently a sampling-based variant of this approach has become popular with the rise of diffusion/score-based generative models. Using denoisers for general purpose restoration requires guiding the iterations to ensure agreement of the signal with the observations. In low-noise settings guidance that is based on back-projection (BP) has been shown to be a promising strategy (used recently also under the names "pseudoinverse" or "range/-space" guidance). However the presence of noise in the observations hinders the gains from this approach. In this paper we propose a novel guidance technique based on preconditioning that allows traversing from BP-based guidance to least squares based guidance along the restoration scheme. The proposed approach is robust to noise while still having much simpler implementation than alternative methods (e.g. it does not require SVD or a large number of iterations). We use it within both an optimization scheme and a sampling-based scheme and demonstrate its advantages over existing methods for image deblurring and super-resolution. + + + + MTMMC: A Large-Scale Real-World Multi-Modal Camera Tracking Benchmark + http://openaccess.thecvf.com//content/CVPR2024/papers/Woo_MTMMC_A_Large-Scale_Real-World_Multi-Modal_Camera_Tracking_Benchmark_CVPR_2024_paper.pdf + Multi-target multi-camera tracking is a crucial task that involves identifying and tracking individuals over time using video streams from multiple cameras. This task has practical applications in various fields such as visual surveillance crowd behavior analysis and anomaly detection. However due to the difficulty and cost of collecting and labeling data existing datasets for this task are either synthetically generated or artificially constructed within a controlled camera network setting which limits their ability to model real-world dynamics and generalize to diverse camera configurations. To address this issue we present MTMMC a real-world large-scale dataset that includes long video sequences captured by 16 multi-modal cameras in two different environments - campus and factory - across various time weather and season conditions. This dataset provides a challenging test bed for studying multi-camera tracking under diverse real-world complexities and includes an additional input modality of spatially aligned and temporally synchronized RGB and thermal cameras which enhances the accuracy of multi-camera tracking. MTMMC is a super-set of existing datasets benefiting independent fields such as person detection re-identification and multiple object tracking. We provide baselines and new learning setups on this dataset and set the reference scores for future studies. The datasets models and test server will be made publicly available. + + + + DAP: A Dynamic Adversarial Patch for Evading Person Detectors + http://openaccess.thecvf.com//content/CVPR2024/papers/Guesmi_DAP_A_Dynamic_Adversarial_Patch_for_Evading_Person_Detectors_CVPR_2024_paper.pdf + Patch-based adversarial attacks were proven to compromise the robustness and reliability of computer vision systems. However their conspicuous and easily detectable nature challenge their practicality in real-world setting. To address this recent work has proposed using Generative Adversarial Networks (GANs) to generate naturalistic patches that may not attract human attention. However such approaches suffer from a limited latent space making it challenging to produce a patch that is efficient stealthy and robust to multiple real-world transformations. This paper introduces a novel approach that produces a Dynamic Adversarial Patch (DAP) designed to overcome these limitations. DAP maintains a naturalistic appearance while optimizing attack efficiency and robustness to real-world transformations. The approach involves redefining the optimization problem and introducing a novel objective function that incorporates a similarity metric to guide the patch's creation. Unlike GAN-based techniques the DAP directly modifies pixel values within the patch providing increased flexibility and adaptability to multiple transformations. Furthermore most clothing-based physical attacks assume static objects and ignore the possible transformations caused by non-rigid deformation due to changes in a person's pose. To address this limitation a `Creases Transformation' (CT) block is introduced enhancing the patch's resilience to a variety of real-world distortions. Experimental results demonstrate that the proposed approach outperforms state-of-the-art attacks achieving a success rate of up to 82.28% in the digital world when targeting the YOLOv7 detector and 65% in the physical world when targeting YOLOv3tiny detector deployed in edge-based smart cameras. + + + + Learned Lossless Image Compression based on Bit Plane Slicing + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Learned_Lossless_Image_Compression_based_on_Bit_Plane_Slicing_CVPR_2024_paper.pdf + Autoregressive Initial Bits (ArIB) a framework that combines subimage autoregression and latent variable models has shown its advantages in lossless image compression. However in current methods the image splitting makes the information of latent variables being uniformly distributed in each subimage and causes inadequate use of latent variables in addition to posterior collapse. To tackle these issues we introduce Bit Plane Slicing (BPS) splitting images in the bit plane dimension with the considerations on different importance for latent variables. Thus BPS provides a more effective representation by arranging subimages with decreasing importance for latent variables. To solve the problem of the increased number of dimensions caused by BPS we further propose a dimension-tailored autoregressive model that tailors autoregression methods for each dimension based on their characteristics efficiently capturing the dependencies in plane space and color dimensions. As shown in the extensive experimental results our method demonstrates the superior compression performance with comparable inference speed when compared to the state-of-the-art normalizing-flow-based methods. The code is at https://github.com/ZZ022/ArIB-BPS. + + + + Flexible Depth Completion for Sparse and Varying Point Densities + http://openaccess.thecvf.com//content/CVPR2024/papers/Park_Flexible_Depth_Completion_for_Sparse_and_Varying_Point_Densities_CVPR_2024_paper.pdf + While recent depth completion methods have achieved remarkable results filling in relatively dense depth maps (e.g. projected 64-line LiDAR on KITTI or 500 sampled points on NYUv2) with RGB guidance their performance on very sparse input (e.g. 4-line LiDAR or 32 depth point measurements) is unverified. These sparser regimes present new challenges as a 4-line LiDAR increases the distance between pixels without depth and their nearest depth point sixfold from 5 pixels to 30 pixels compared to 64 lines. Observing that existing methods struggle with sparse and variable distribution depth maps we propose an Affinity-Based Shift Correction (ASC) module that iteratively aligns depth predictions to input depth based on predicted affinities between image pixels and depth points. Our framework enables each depth point to adaptively influence and improve predictions across the image leading to largely improved results for fewer-line fewer-point and variable sparsity settings. Further we show improved performance in domain transfer from KITTI to nuScenes and from random sampling to irregular point distributions. Our correction module can easily be added to any depth completion or RGB-only depth estimation model notably allowing the latter to perform both completion and estimation with a single model. + + + + Shadows Don't Lie and Lines Can't Bend! Generative Models don't know Projective Geometry...for now + http://openaccess.thecvf.com//content/CVPR2024/papers/Sarkar_Shadows_Dont_Lie_and_Lines_Cant_Bend_Generative_Models_dont_CVPR_2024_paper.pdf + Generative models can produce impressively realistic images. This paper demonstrates that generated images have geometric features different from those of real images. We build a set of collections of generated images prequalified to fool simple signal-based classifiers into believing they are real. We then show that prequalified generated images can be identified reliably by classifiers that only look at geometric properties. We use three such classifiers. All three classifiers are denied access to image pixels and look only at derived geometric features. The first classifier looks at the perspective field of the image the second looks at lines detected in the image and the third looks at relations between detected objects and shadows. Our procedure detects generated images more reliably than SOTA local signal based detectors for images from a number of distinct generators. Saliency maps suggest that the classifiers can identify geometric problems reliably. We conclude that current generators cannot reliably reproduce geometric properties of real images. + + + + GEARS: Local Geometry-aware Hand-object Interaction Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_GEARS_Local_Geometry-aware_Hand-object_Interaction_Synthesis_CVPR_2024_paper.pdf + Generating realistic hand motion sequences in interaction with objects has gained increasing attention with the growing interest in digital humans. Prior work has illustrated the effectiveness of employing occupancy-based or distance-based virtual sensors to extract hand-object interaction features. Nonetheless these methods show limited generalizability across object categories shapes and sizes. We hypothesize that this is due to two reasons: 1) the limited expressiveness of employed virtual sensors and 2) scarcity of available training data. To tackle this challenge we introduce a novel joint-centered sensor designed to reason about local object geometry near potential interaction regions. The sensor queries for object surface points in the neighbourhood of each hand joint. As an important step towards mitigating the learning complexity we transform the points from global frame to hand template frame and use a shared module to process sensor features of each individual joint. This is followed by a spatio-temporal transformer network aimed at capturing correlation among the joints in different dimensions. Moreover we devise simple heuristic rules to augment the limited training sequences with vast static hand grasping samples. This leads to a broader spectrum of grasping types observed during training in turn enhancing our model's generalization capability. We evaluate on two public datasets GRAB and InterCap where our method shows superiority over baselines both quantitatively and perceptually. + + + + CodedEvents: Optimal Point-Spread-Function Engineering for 3D-Tracking with Event Cameras + http://openaccess.thecvf.com//content/CVPR2024/papers/Shah_CodedEvents_Optimal_Point-Spread-Function_Engineering_for_3D-Tracking_with_Event_Cameras_CVPR_2024_paper.pdf + Point-spread-function (PSF) engineering is a well-established computational imaging technique that uses phase masks and other optical elements to embed extra information (e.g. depth) into the images captured by conventional CMOS image sensors. To date however PSF-engineering has not been applied to neuromorphic event cameras; a powerful new image sensing technology that responds to changes in the log-intensity of light. This paper establishes theoretical limits (Cramer Rao bounds) on 3D point localization and tracking with PSF-engineered event cameras. Using these bounds we first demonstrate that existing Fisher phase masks are already near-optimal for localizing static flashing point sources (e.g. blinking fluorescent molecules). We then demonstrate that existing designs are sub-optimal for tracking moving point sources and proceed to use our theory to design optimal phase masks and binary amplitude masks for this task. To overcome the non-convexity of the design problem we leverage novel implicit neural representation based parameterizations of the phase and amplitude masks. We demonstrate the efficacy of our designs through extensive simulations. We also validate our method with a simple prototype. + + + + Learning Discriminative Dynamics with Label Corruption for Noisy Label Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_Learning_Discriminative_Dynamics_with_Label_Corruption_for_Noisy_Label_Detection_CVPR_2024_paper.pdf + Label noise commonly found in real-world datasets has a detrimental impact on a model's generalization. To effectively detect incorrectly labeled instances previous works have mostly relied on distinguishable training signals such as training loss as indicators to differentiate between clean and noisy labels. However they have limitations in that the training signals incompletely reveal the model's behavior and are not effectively generalized to various noise types resulting in limited detection accuracy. In this paper we propose DynaCor framework that distinguishes incorrectly labeled instances from correctly labeled ones based on the dynamics of the training signals. To cope with the absence of supervision for clean and noisy labels DynaCor first introduces a label corruption strategy that augments the original dataset with intentionally corrupted labels enabling indirect simulation of the model's behavior on noisy labels. Then DynaCor learns to identify clean and noisy instances by inducing two clearly distinguishable clusters from the latent representations of training dynamics. Our comprehensive experiments show that DynaCor outperforms the state-of-the-art competitors and shows strong robustness to various noise types and noise rates. + + + + DiPrompT: Disentangled Prompt Tuning for Multiple Latent Domain Generalization in Federated Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Bai_DiPrompT_Disentangled_Prompt_Tuning_for_Multiple_Latent_Domain_Generalization_in_CVPR_2024_paper.pdf + Federated learning (FL) has emerged as a powerful paradigm for learning from decentralized data and federated domain generalization further considers the test dataset (target domain) is absent from the decentralized training data (source domains). However most existing FL methods assume that domain labels are provided during training and their evaluation imposes explicit constraints on the number of domains which must strictly match the number of clients. Because of the underutilization of numerous edge devices and additional cross-client domain annotations in the real world such restrictions may be impractical and involve potential privacy leaks. In this paper we propose an efficient and novel approach called Disentangled Prompt Tuning (DiPrompT) a method that tackles the above restrictions by learning adaptive prompts for domain generalization in a distributed manner. Specifically we first design two types of prompts i.e. global prompt to capture general knowledge across all clients and domain prompts to capture domain-specific knowledge. They eliminate the restriction on the one-to-one mapping between source domains and local clients. Furthermore a dynamic query metric is introduced to automatically search the suitable domain label for each sample which includes two-substep text-image alignments based on prompt tuning without labor-intensive annotation. Extensive experiments on multiple datasets demonstrate that our DiPrompT achieves superior domain generalization performance over state-of-the-art FL methods when domain labels are not provided and even outperforms many centralized learning methods using domain labels. + + + + Adversarial Distillation Based on Slack Matching and Attribution Region Alignment + http://openaccess.thecvf.com//content/CVPR2024/papers/Yin_Adversarial_Distillation_Based_on_Slack_Matching_and_Attribution_Region_Alignment_CVPR_2024_paper.pdf + Adversarial distillation (AD) is a highly effective method for enhancing the robustness of small models. Contrary to expectations a high-performing teacher model does not always result in a more robust student model. This is due to two main reasons. First when there are significant differences in predictions between the teacher model and the student model exact matching of predicted values using KL divergence interferes with training leading to poor performance of existing methods. Second matching solely based on the output prevents the student model from fully understanding the behavior of the teacher model. To address these challenges this paper proposes a novel AD method named SmaraAD. During the training process we facilitate the student model in better understanding the teacher model's behavior by aligning the attribution region that the student model focuses on with that of the teacher model. Concurrently we relax the condition of exact matching in KL divergence and replace it with a more flexible matching criterion thereby enhancing the model's robustness. Extensive experiments substantiate the effectiveness of our method in improving the robustness of small models outperforming previous SOTA methods. + + + + Boosting Spike Camera Image Reconstruction from a Perspective of Dealing with Spike Fluctuations + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_Boosting_Spike_Camera_Image_Reconstruction_from_a_Perspective_of_Dealing_CVPR_2024_paper.pdf + As a bio-inspired vision sensor with ultra-high speed spike cameras exhibit great potential in recording dynamic scenes with high-speed motion or drastic light changes. Different from traditional cameras each pixel in spike cameras records the arrival of photons continuously by firing binary spikes at an ultra-fine temporal granularity. In this process multiple factors impact the imaging including the photons' Poisson arrival thermal noises from circuits and quantization effects in spike readout. These factors introduce fluctuations to spikes making the recorded spike intervals unstable and unable to reflect accurate light intensities. In this paper we present an approach to deal with spike fluctuations and boost spike camera image reconstruction. We first analyze the quantization effects and reveal the unbiased estimation attribute of the reciprocal of differential of spike firing time (DSFT). Based on this we propose a spike representation module to use DSFT with multiple orders for fluctuation suppression where DSFT with higher orders indicates spike integration duration between multiple spikes. We also propose a module for inter-moment feature alignment at multiple granularities. The coarser alignment is based on patch-level cross-attention with a local search strategy and the finer alignment is based on deformable convolution at the pixel level. Experimental results demonstrate the effectiveness of our method on both synthetic and real-captured data. The source code and dataset are available at https://github.com/ruizhao26/BSF. + + + + Text-guided Explorable Image Super-resolution + http://openaccess.thecvf.com//content/CVPR2024/papers/Gandikota_Text-guided_Explorable_Image_Super-resolution_CVPR_2024_paper.pdf + In this paper we introduce the problem of zero-shot text-guided exploration of the solutions to open-domain image super-resolution. Our goal is to allow users to explore diverse semantically accurate reconstructions that preserve data consistency with the low-resolution inputs for different large downsampling factors without explicitly training for these specific degradations. We propose two approaches for zero-shot text-guided super-resolution - i) modifying the generative process of text-to-image (T2I) diffusion models to promote consistency with low-resolution inputs and ii) incorporating language guidance into zero-shot diffusion-based restoration methods. We show that the proposed approaches result in diverse solutions that match the semantic meaning provided by the text prompt while preserving data consistency with the degraded inputs. We evaluate the proposed baselines for the task of extreme super-resolution and demonstrate advantages in terms of restoration quality diversity and explorability of solutions. + + + + Improving the Generalization of Segmentation Foundation Model under Distribution Shift via Weakly Supervised Adaptation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Improving_the_Generalization_of_Segmentation_Foundation_Model_under_Distribution_Shift_CVPR_2024_paper.pdf + The success of large language models has inspired the computer vision community to explore image segmentation foundation model that is able to zero/few-shot generalize through prompt engineering. Segment-Anything (SAM) among others is the state-of-the-art image segmentation foundation model demonstrating strong zero/few-shot generalization. Despite the success recent studies reveal the weakness of SAM under strong distribution shift. In particular SAM performs awkwardly on corrupted natural images camouflaged images medical images etc. Motivated by the observations we aim to develop a self-training based strategy to adapt SAM to target distribution. Given the unique challenges of large source dataset high computation cost and incorrect pseudo label we propose a weakly supervised self-training architecture with anchor regularization and low-rank finetuning to improve the robustness and computation efficiency of adaptation. We validate the effectiveness on 5 types of downstream segmentation tasks including natural clean/corrupted images medical images camouflaged images and robotic images. Our proposed method is task-agnostic in nature and outperforms pre-trained SAM and state-of-the-art domain adaptation methods on almost all downstream tasks with the same testing prompt inputs. + + + + Correspondence-Free Non-Rigid Point Set Registration Using Unsupervised Clustering Analysis + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_Correspondence-Free_Non-Rigid_Point_Set_Registration_Using_Unsupervised_Clustering_Analysis_CVPR_2024_paper.pdf + This paper presents a novel non-rigid point set registration method that is inspired by unsupervised clustering analysis. Unlike previous approaches that treat the source and target point sets as separate entities we develop a holistic framework where they are formulated as clustering centroids and clustering members separately. We then adopt Tikhonov regularization with an ?1-induced Laplacian kernel instead of the commonly used Gaussian kernel to ensure smooth and more robust displacement fields. Our formulation delivers closed-form solutions theoretical guarantees independence from dimensions and the ability to handle large deformations. Subsequently we introduce a clustering-improved Nystrom method to effectively reduce the computational complexity and storage of the Gram matrix to linear while providing a rigorous bound for the low rank approximation. Our method achieves high accuracy results across various scenarios and surpasses competitors by a significant margin particularly on shapes with substantial deformations. Additionally we demonstrate the versatility of our method in challenging tasks such as shape transfer and medical registration. + + + + BadCLIP: Trigger-Aware Prompt Learning for Backdoor Attacks on CLIP + http://openaccess.thecvf.com//content/CVPR2024/papers/Bai_BadCLIP_Trigger-Aware_Prompt_Learning_for_Backdoor_Attacks_on_CLIP_CVPR_2024_paper.pdf + Contrastive Vision-Language Pre-training known as CLIP has shown promising effectiveness in addressing downstream image recognition tasks. However recent works revealed that the CLIP model can be implanted with a downstream-oriented backdoor. On downstream tasks one victim model performs well on clean samples but predicts a specific target class whenever a specific trigger is present. For injecting a backdoor existing attacks depend on a large amount of additional data to maliciously fine-tune the entire pre-trained CLIP model which makes them inapplicable to data-limited scenarios. In this work motivated by the recent success of learnable prompts we address this problem by injecting a backdoor into the CLIP model in the prompt learning stage. Our method named BadCLIP is built on a novel and effective mechanism in backdoor attacks on CLIP i.e. influencing both the image and text encoders with the trigger. It consists of a learnable trigger applied to images and a trigger-aware context generator such that the trigger can change text features via trigger-aware prompts resulting in a powerful and generalizable attack. Extensive experiments conducted on 11 datasets verify that the clean accuracy of BadCLIP is similar to those of advanced prompt learning methods and the attack success rate is higher than 99% in most cases. BadCLIP is also generalizable to unseen classes and shows a strong generalization capability under cross-dataset and cross-domain settings. The code is available at https://github.com/jiawangbai/BadCLIP. + + + + PixelRNN: In-pixel Recurrent Neural Networks for End-to-end-optimized Perception with Neural Sensors + http://openaccess.thecvf.com//content/CVPR2024/papers/So_PixelRNN_In-pixel_Recurrent_Neural_Networks_for_End-to-end-optimized_Perception_with_Neural_CVPR_2024_paper.pdf + Conventional image sensors digitize high-resolution images at fast frame rates producing a large amount of data that needs to be transmitted off the sensor for further processing. This is challenging for perception systems operating on edge devices because communication is power inefficient and induces latency. Fueled by innovations in stacked image sensor fabrication emerging sensor--processors offer programmability and processing capabilities directly on the sensor. We exploit these capabilities by developing an efficient recurrent neural network architecture PixelRNN that encodes spatio-temporal features on the sensor using purely binary operations. PixelRNN reduces the amount of data to be transmitted off the sensor by factors up to 256 compared to the raw sensor data while offering competitive accuracy for hand gesture recognition and lip reading tasks. We experimentally validate PixelRNN using a prototype implementation on the SCAMP-5 sensor--processor platform. + + + + DUSt3R: Geometric 3D Vision Made Easy + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_DUSt3R_Geometric_3D_Vision_Made_Easy_CVPR_2024_paper.pdf + Multi-view stereo reconstruction (MVS) in the wild requires to first estimate the camera intrinsic and extrinsic parameters. These are usually tedious and cumbersome to obtain yet they are mandatory to triangulate corresponding pixels in 3D space which is at the core of all best performing MVS algorithms. In this work we take an opposite stance and introduce DUSt3R a radically novel paradigm for Dense and Unconstrained Stereo 3D Reconstruction of arbitrary image collections operating without prior information about camera calibration nor viewpoint poses. We cast the pairwise reconstruction problem as a regression of pointmaps relaxing the hard constraints of usual projective camera models. We show that this formulation smoothly unifies the monocular and binocular reconstruction cases. In the case where more than two images are provided we further propose a simple yet effective global alignment strategy that expresses all pairwise pointmaps in a common reference frame. We base our network architecture on standard Transformer encoders and decoders allowing us to leverage powerful pretrained models. Our formulation directly provides a 3D model of the scene as well as depth information but interestingly we can seamlessly recover from it pixel matches focal lengths relative and absolute cameras. Extensive experiments on all these tasks showcase how DUSt3R effectively unifies various 3D vision tasks setting new performance records on monocular & multi-view depth estimation as well as relative pose estimation. In summary DUSt3R makes many geometric 3D vision tasks easy. Code and models at https://github.com/naver/dust3r + + + + Robust Distillation via Untargeted and Targeted Intermediate Adversarial Samples + http://openaccess.thecvf.com//content/CVPR2024/papers/Dong_Robust_Distillation_via_Untargeted_and_Targeted_Intermediate_Adversarial_Samples_CVPR_2024_paper.pdf + Adversarially robust knowledge distillation aims to compress large-scale models into lightweight models while preserving adversarial robustness and natural performance on a given dataset. Existing methods typically align probability distributions of natural and adversarial samples between teacher and student models but they overlook intermediate adversarial samples along the "adversarial path" formed by the multi-step gradient ascent of a sample towards the decision boundary. Such paths capture rich information about the decision boundary. In this paper we propose a novel adversarially robust knowledge distillation approach by incorporating such adversarial paths into the alignment process. Recognizing the diverse impacts of intermediate adversarial samples (ranging from benign to noisy) we propose an adaptive weighting strategy to selectively emphasize informative adversarial samples thus ensuring efficient utilization of lightweight model capacity. Moreover we propose a dual-branch mechanism exploiting two following insights: (i) complementary dynamics of adversarial paths obtained by targeted and untargeted adversarial learning and (ii) inherent differences between the gradient ascent path from class c_i towards the nearest class boundary and the gradient descent path from a specific class c_j towards the decision region of c_i (i \neq j). Comprehensive experiments demonstrate the effectiveness of our method on lightweight models under various settings. + + + + Soften to Defend: Towards Adversarial Robustness via Self-Guided Label Refinement + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Soften_to_Defend_Towards_Adversarial_Robustness_via_Self-Guided_Label_Refinement_CVPR_2024_paper.pdf + Adversarial training (AT) is currently one of the most effective ways to obtain the robustness of deep neural networks against adversarial attacks. However most AT methods suffer from robust overfitting i.e. a significant generalization gap in adversarial robustness between the training and testing curves. In this paper we first identify a connection between robust overfitting and the excessive memorization of noisy labels in AT from a view of gradient norm. As such label noise is mainly caused by a distribution mismatch and improper label assignments we are motivated to propose a label refinement approach for AT. Specifically our Self-Guided Label Refinement first self-refines a more accurate and informative label distribution from over-confident hard labels and then it calibrates the training by dynamically incorporating knowledge from self-distilled models into the current model and thus requiring no external teachers. Empirical results demonstrate that our method can simultaneously boost the standard accuracy and robust performance across multiple benchmark datasets attack types and architectures. In addition we also provide a set of analyses from the perspectives of information theory to dive into our method and suggest the importance of soft labels for robust generalization. + + + + Pose-Guided Self-Training with Two-Stage Clustering for Unsupervised Landmark Discovery + http://openaccess.thecvf.com//content/CVPR2024/papers/Tourani_Pose-Guided_Self-Training_with_Two-Stage_Clustering_for_Unsupervised_Landmark_Discovery_CVPR_2024_paper.pdf + Unsupervised landmarks discovery (ULD) for an object category is a challenging computer vision problem. In pursuit of developing a robust ULD framework we explore the potential of a recent paradigm of self-supervised learning algorithms known as diffusion models. Some recent works have shown that these models implicitly contain important correspondence cues. Towards harnessing the potential of diffusion models for ULD task we make the following core contributions. First we propose a ZeroShot ULD baseline based on simple clustering of random pixel locations with nearest neighbour matching. It delivers better results than the existing ULD methods. Second motivated by the ZeroShot performance we develop a ULD algorithm based on diffusion features using self-training and clustering which also outperforms prior methods by notable margins. Third we introduce a new proxy task based on generating latent pose codes and also propose a two-stage clustering mechanism to facilitate effective pseudo-labeling resulting in a significant performance improvement. Overall our approach consistently outperforms state-of-the-art methods on four challenging benchmarks AFLW MAFL CatHeads and LS3D by significant margins. + + + + Learning from Synthetic Human Group Activities + http://openaccess.thecvf.com//content/CVPR2024/papers/Chang_Learning_from_Synthetic_Human_Group_Activities_CVPR_2024_paper.pdf + The study of complex human interactions and group activities has become a focal point in human-centric computer vision. However progress in related tasks is often hindered by the challenges of obtaining large-scale labeled datasets from real-world scenarios. To address the limitation we introduce M3Act a synthetic data generator for multi-view multi-group multi-person human atomic actions and group activities. Powered by Unity Engine M3Act features multiple semantic groups highly diverse and photorealistic images and a comprehensive set of annotations which facilitates the learning of human-centered tasks across single-person multi-person and multi-group conditions. We demonstrate the advantages of M3Act across three core experiments. The results suggest our synthetic dataset can significantly improve the performance of several downstream methods and replace real-world datasets to reduce cost. Notably M3Act improves the state-of-the-art MOTRv2 on DanceTrack dataset leading to a hop on the leaderboard from 10th to 2nd place. Moreover M3Act opens new research for controllable 3D group activity generation. We define multiple metrics and propose a competitive baseline for the novel task. Our code and data are available at our project page: http://cjerry1243.github.io/M3Act. + + + + Text Grouping Adapter: Adapting Pre-trained Text Detector for Layout Analysis + http://openaccess.thecvf.com//content/CVPR2024/papers/Bi_Text_Grouping_Adapter_Adapting_Pre-trained_Text_Detector_for_Layout_Analysis_CVPR_2024_paper.pdf + Significant progress has been made in scene text detection models since the rise of deep learning but scene text layout analysis which aims to group detected text instances as paragraphs has not kept pace. Previous works either treated text detection and grouping using separate models or train a model from scratch while using a unified one. All of them have not yet made full use of the already well-trained text detectors and easily obtainable detection datasets. In this paper we present Text Grouping Adapter (TGA) a module that can enable the utilization of various pre-trained text detectors to learn layout analysis allowing us to adopt a well-trained text detector right off the shelf or just fine-tune it efficiently. Designed to be compatible with various text detector architectures TGA takes detected text regions and image features as universal inputs to assemble text instance features. To capture broader contextual information for layout analysis we propose to predict text group masks from text instance features by one-to-many assignment. Our comprehensive experiments demonstrate that even with frozen pre-trained models incorporating our TGA into various pre-trained text detectors and text spotters can achieve superior layout analysis performance simultaneously inheriting generalized text detection ability from pre-training. In the case of full parameter fine-tuning we can further improve layout analysis performance. + + + + THRONE: An Object-based Hallucination Benchmark for the Free-form Generations of Large Vision-Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Kaul_THRONE_An_Object-based_Hallucination_Benchmark_for_the_Free-form_Generations_of_CVPR_2024_paper.pdf + Mitigating hallucinations in large vision-language models (LVLMs) remains an open problem. Recent benchmarks do not address hallucinations in open-ended free-form responses which we term "Type I hallucinations". Instead they focus on hallucinations responding to very specific question formats---typically a multiple-choice response regarding a particular object or attribute---which we term "Type II hallucinations". Additionally such benchmarks often require external API calls to models which are subject to change. In practice we observe that a reduction in Type II hallucinations does not lead to a reduction in Type I hallucinations but rather that the two forms of hallucinations are often anti-correlated. To address this we propose THRONE a novel object-based automatic framework for quantitatively evaluating Type I hallucinations in LVLM free-form outputs. We use public language models (LMs) to identify hallucinations in LVLM responses and compute informative metrics. By evaluating a large selection of recent LVLMs using public datasets we show that an improvement in existing metrics do not lead to a reduction in Type I hallucinations and that established benchmarks for measuring Type I hallucinations are incomplete. Finally we provide a simple and effective data augmentation method to reduce Type I and Type II hallucinations as a strong baseline. + + + + LUWA Dataset: Learning Lithic Use-Wear Analysis on Microscopic Images + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_LUWA_Dataset_Learning_Lithic_Use-Wear_Analysis_on_Microscopic_Images_CVPR_2024_paper.pdf + Lithic Use-Wear Analysis (LUWA) using microscopic images is an underexplored vision-for-science research area. It seeks to distinguish the worked material which is critical for understanding archaeological artifacts material interactions tool functionalities and dental records. However this challenging task goes beyond the well-studied image classification problem for common objects. It is affected by many confounders owing to the complex wear mechanism and microscopic imaging which makes it difficult even for human experts to identify the worked material successfully. In this paper we investigate the following three questions on this unique vision task for the first time:(i) How well can state-of-the-art pre-trained models (like DINOv2) generalize to the rarely seen domain? (ii) How can few-shot learning be exploited for scarce microscopic images? (iii) How do the ambiguous magnification and sensing modality influence the classification accuracy? To study these we collaborated with archaeologists and built the first open-source and the largest LUWA dataset containing 23130 microscopic images with different magnifications and sensing modalities. Extensive experiments show that existing pre-trained models notably outperform human experts but still leave a large gap for improvements. Most importantly the LUWA dataset provides an underexplored opportunity for vision and learning communities and complements existing image classification problems on common objects. + + + + The Audio-Visual Conversational Graph: From an Egocentric-Exocentric Perspective + http://openaccess.thecvf.com//content/CVPR2024/papers/Jia_The_Audio-Visual_Conversational_Graph_From_an_Egocentric-Exocentric_Perspective_CVPR_2024_paper.pdf + In recent years the thriving development of research related to egocentric videos has provided a unique perspective for the study of conversational interactions where both visual and audio signals play a crucial role. While most prior work focus on learning about behaviors that directly involve the camera wearer we introduce the Ego-Exocentric Conversational Graph Prediction problem marking the first attempt to infer exocentric conversational interactions from egocentric videos. We propose a unified multi-modal framework---Audio-Visual Conversational Attention (AV-CONV) for the joint prediction of conversation behaviors---speaking and listening---for both the camera wearer as well as all other social partners present in the egocentric video. Specifically we adopt the self-attention mechanism to model the representations across-time across-subjects and across-modalities. To validate our method we conduct experiments on a challenging egocentric video dataset that includes multi-speaker and multi-conversation scenarios. Our results demonstrate the superior performance of our method compared to a series of baselines. We also present detailed ablation studies to assess the contribution of each component in our model. Check our \href https://vjwq.github.io/AV-CONV/ Project Page . + + + + Byzantine-robust Decentralized Federated Learning via Dual-domain Clustering and Trust Bootstrapping + http://openaccess.thecvf.com//content/CVPR2024/papers/Sun_Byzantine-robust_Decentralized_Federated_Learning_via_Dual-domain_Clustering_and_Trust_Bootstrapping_CVPR_2024_paper.pdf + Decentralized federated learning (DFL) facilitates collaborative model training across multiple connected clients without a central coordination server thereby avoiding the single point of failure in traditional centralized federated learning (CFL). However DFL exhibits heightened susceptibility to Byzantine attacks owing to the lack of a responsible central server. Furthermore a benign client in DFL may be dominated by Byzantine clients (more than half of its neighbors are malicious) posing significant challenges for robust model training. In this work we propose DFL-Dual a novel Byzantine-robust DFL method through dual-domain client clustering and trust bootstrapping. Specifically we first propose to leverage both data-domain and model-domain distance metrics to identify client discrepancies. Then we design a trust evaluation mechanism centered on benign clients which enables them to evaluate their neighbors. Building upon the dual-domain distance metric and trust evaluation mechanism we further develop a two-stage clustering and trust bootstrapping technique to exclude Byzantine clients from local model aggregation. We extensively evaluate the proposed DFL-Dual method through rigorous experimentation demonstrating its remarkable performance superiority over existing robust CFL and DFL schemes. + + + + No More Ambiguity in 360deg Room Layout via Bi-Layout Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Tsai_No_More_Ambiguity_in_360deg_Room_Layout_via_Bi-Layout_Estimation_CVPR_2024_paper.pdf + Inherent ambiguity in layout annotations poses significant challenges to developing accurate 360deg room layout estimation models. To address this issue we propose a novel Bi-Layout model capable of predicting two distinct layout types. One stops at ambiguous regions while the other extends to encompass all visible areas. Our model employs two global context embeddings where each embedding is designed to capture specific contextual information for each layout type. With our novel feature guidance module the image feature retrieves relevant context from these embeddings generating layout-aware features for precise bi-layout predictions. A unique property of our Bi-Layout model is its ability to inherently detect ambiguous regions by comparing the two predictions. To circumvent the need for manual correction of ambiguous annotations during testing we also introduce a new metric for disambiguating ground truth layouts. Our method demonstrates superior performance on benchmark datasets notably outperforming leading approaches. Specifically on the MatterportLayout dataset it improves 3DIoU from 81.70% to 82.57% across the full test set and notably from 54.80% to 59.97% in subsets with significant ambiguity. + + + + A Noisy Elephant in the Room: Is Your Out-of-Distribution Detector Robust to Label Noise? + http://openaccess.thecvf.com//content/CVPR2024/papers/Humblot-Renaux_A_Noisy_Elephant_in_the_Room_Is_Your_Out-of-Distribution_Detector_CVPR_2024_paper.pdf + The ability to detect unfamiliar or unexpected images is essential for safe deployment of computer vision systems. In the context of classification the task of detecting images outside of a model's training domain is known as out-of-distribution (OOD) detection. While there has been a growing research interest in developing post-hoc OOD detection methods there has been comparably little discussion around how these methods perform when the underlying classifier is not trained on a clean carefully curated dataset. In this work we take a closer look at 20 state-of-the-art OOD detection methods in the (more realistic) scenario where the labels used to train the underlying classifier are unreliable (e.g. crowd-sourced or web-scraped labels). Extensive experiments across different datasets noise types & levels architectures and checkpointing strategies provide insights into the effect of class label noise on OOD detection and show that poor separation between incorrectly classified ID samples vs. OOD samples is an overlooked yet important limitation of existing methods. Code: https://github.com/glhr/ood-labelnoise + + + + VideoMAC: Video Masked Autoencoders Meet ConvNets + http://openaccess.thecvf.com//content/CVPR2024/papers/Pei_VideoMAC_Video_Masked_Autoencoders_Meet_ConvNets_CVPR_2024_paper.pdf + Recently the advancement of self-supervised learning techniques like masked autoencoders (MAE) has greatly influenced visual representation learning for images and videos. Nevertheless it is worth noting that the predominant approaches in existing masked image / video modeling rely excessively on resource-intensive vision transformers (ViTs) as the feature encoder. In this paper we propose a new approach termed as VideoMAC which combines video masked autoencoders with resource-friendly ConvNets. Specifically VideoMAC employs symmetric masking on randomly sampled pairs of video frames. To prevent the issue of mask pattern dissipation we utilize ConvNets which are implemented with sparse convolutional operators as encoders. Simultaneously we present a simple yet effective masked video modeling (MVM) approach a dual encoder architecture comprising an online encoder and an exponential moving average target encoder aimed to facilitate inter-frame reconstruction consistency in videos. Additionally we demonstrate that VideoMAC empowering classical (ResNet) / modern (ConvNeXt) convolutional encoders to harness the benefits of MVM outperforms ViT-based approaches on downstream tasks including video object segmentation (+5.2% / 6.4% \mathcal J &\mathcal F ) body part propagation (+6.3% / 3.1% mIoU) and human pose tracking (+10.2% / 11.1% PCK@0.1). + + + + Unsigned Orthogonal Distance Fields: An Accurate Neural Implicit Representation for Diverse 3D Shapes + http://openaccess.thecvf.com//content/CVPR2024/papers/Lu_Unsigned_Orthogonal_Distance_Fields_An_Accurate_Neural_Implicit_Representation_for_CVPR_2024_paper.pdf + Neural implicit representation of geometric shapes has witnessed considerable advancements in recent years. However common distance field based implicit representations specifically signed distance field (SDF) for watertight shapes or unsigned distance field (UDF) for arbitrary shapes routinely suffer from degradation of reconstruction accuracy when converting to explicit surface points and meshes. In this paper we introduce a novel neural implicit representation based on unsigned orthogonal distance fields (UODFs). In UODFs the minimal unsigned distance from any spatial point to the shape surface is defined solely in one orthogonal direction contrasting with the multi-directional determination made by SDF and UDF. Consequently every point in the 3D UODFs can directly access its closest surface points along three orthogonal directions. This distinctive feature leverages the accurate reconstruction of surface points without interpolation errors. We verify the effectiveness of UODFs through a range of reconstruction examples extending from simple watertight or non-watertight shapes to complex shapes that include hollows internal or assembling structures. + + + + OA-CNNs: Omni-Adaptive Sparse CNNs for 3D Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Peng_OA-CNNs_Omni-Adaptive_Sparse_CNNs_for_3D_Semantic_Segmentation_CVPR_2024_paper.pdf + The booming of 3D recognition in the 2020s began with the introduction of point cloud transformers. They quickly overwhelmed sparse CNNs and became state-of-the-art models especially in 3D semantic segmentation. However sparse CNNs are still valuable networks due to their efficiency treasure and ease of application. In this work we reexamine the design distinctions and test the limits of what a sparse CNN can achieve. We discover that the key credit to the performance difference is adaptivity. Specifically we propose two key components i.e. adaptive receptive fields (spatially) and adaptive relation to bridge the gap. This exploration led to the creation of Omni-Adaptive 3D CNNs (OA-CNNs) a family of networks that integrates a lightweight module to greatly enhance the adaptivity of sparse CNNs at minimal computational cost. Without any self-attention modules OA-CNNs favorably surpass point transformers in terms of accuracy in both indoor and outdoor scenes with much less latency and memory cost. Notably it achieves 76.1% 78.9% and 70.6% mIoU on ScanNet v2 nuScenes and SemanticKITTI validation benchmarks respectively while maintaining at most 5x better speed than transformer counterparts. This revelation highlights the potential of pure sparse CNNs to outperform transformer-related networks. Our code is built upon Pointcept which is available at https://github.com/Pointcept/Pointcept. + + + + Generative Image Dynamics + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Generative_Image_Dynamics_CVPR_2024_paper.pdf + We present an approach to modeling an image-space prior on scene motion. Our prior is learned from a collection of motion trajectories extracted from real video sequences depicting natural oscillatory dynamics of objects such as treesflowers candles and clothes swaying in the wind. We model dense long-term motion in the Fourier domain as spectral volumes which we find are well-suited to prediction with diffusion models. Given a single image our trained model uses a frequency-coordinated diffusion sampling process to predict a spectral volume which can be converted into a motion texture that spans an entire video. Along with an image-based rendering module the predicted motion representation can be used for a number of downstream applications such as turning still images into seamlessly looping videos or allowing users to interact with objects in real images producing realistic simulated dynamics (by interpreting the spectral volumes as image-space modal bases). See our project page for more results: generative-dynamics.github.io + + + + On the Test-Time Zero-Shot Generalization of Vision-Language Models: Do We Really Need Prompt Learning? + http://openaccess.thecvf.com//content/CVPR2024/papers/Zanella_On_the_Test-Time_Zero-Shot_Generalization_of_Vision-Language_Models_Do_We_CVPR_2024_paper.pdf + The development of large vision-language models notably CLIP has catalyzed research into effective adaptation techniques with a particular focus on soft prompt tuning. Conjointly test-time augmentation which utilizes multiple augmented views of a single image to enhance zero-shot generalization is emerging as a significant area of interest. This has predominantly directed research efforts towards test-time prompt tuning. In contrast we introduce a robust MeanShift for Test-time Augmentation (MTA) which surpasses prompt-based methods without requiring this intensive training procedure. This positions MTA as an ideal solution for both standalone and API-based applications. Additionally our method does not rely on ad hoc rules (e.g. confidence threshold) used in some previous test-time augmentation techniques to filter the augmented views. Instead MTA incorporates a quality assessment variable for each view directly into its optimization process termed as the inlierness score. This score is jointly optimized with a density mode seeking process leading to an efficient training- and hyperparameter-free approach. We extensively benchmark our method on 15 datasets and demonstrate MTA's superiority and computational efficiency. Deployed easily as plug-and-play module on top of zero-shot models and state-of-the-art few-shot methods MTA shows systematic and consistent improvements. + + + + Beyond Text: Frozen Large Language Models in Visual Signal Comprehension + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhu_Beyond_Text_Frozen_Large_Language_Models_in_Visual_Signal_Comprehension_CVPR_2024_paper.pdf + In this work we investigate the potential of a large language model (LLM) to directly comprehend visual signals without the necessity of fine-tuning on multi-modal datasets. The foundational concept of our method views an image as a linguistic entity and translates it to a set of discrete words derived from the LLM's vocabulary. To achieve this we present the Vision-to-Language Tokenizer abbreviated as V2T Tokenizer which transforms an image into a "foreign language" with the combined aid of an encoder-decoder the LLM vocabulary and a CLIP model. With this innovative image encoding the LLM gains the ability not only for visual comprehension but also for image denoising and restoration in an auto-regressive fashion--crucially without any fine-tuning. We undertake rigorous experiments to validate our method encompassing understanding tasks like image recognition image captioning and visual question answering as well as image denoising tasks like inpainting outpainting deblurring and shift restoration. Code and models are available at https://github.com/zh460045050/V2L-Tokenizer. + + + + Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Rotated_Multi-Scale_Interaction_Network_for_Referring_Remote_Sensing_Image_Segmentation_CVPR_2024_paper.pdf + Referring Remote Sensing Image Segmentation (RRSIS) is a new challenge that combines computer vision and natural language processing. Traditional Referring Image Segmentation (RIS) approaches have been impeded by the complex spatial scales and orientations found in aerial imagery leading to suboptimal segmentation results. To address these challenges we introduce the Rotated Multi-Scale Interaction Network (RMSIN) an innovative approach designed for the unique demands of RRSIS. RMSIN incorporates an Intra-scale Interaction Module (IIM) to effectively address the fine-grained detail required at multiple scales and a Cross-scale Interaction Module (CIM) for integrating these details coherently across the network. Furthermore RMSIN employs an Adaptive Rotated Convolution (ARC) to account for the diverse orientations of objects a novel contribution that significantly enhances segmentation accuracy. To assess the efficacy of RMSIN we have curated an expansive dataset comprising 17402 image-caption-mask triplets which is unparalleled in terms of scale and variety. This dataset not only presents the model with a wide range of spatial and rotational scenarios but also establishes a stringent benchmark for the RRSIS task ensuring a rigorous evaluation of performance. Experimental evaluations demonstrate the exceptional performance of RMSIN surpassing existing state-of-the-art models by a significant margin. Datasets and code are available at https://github.com/Lsan2401/RMSIN. + + + + GLACE: Global Local Accelerated Coordinate Encoding + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_GLACE_Global_Local_Accelerated_Coordinate_Encoding_CVPR_2024_paper.pdf + Scene coordinate regression (SCR) methods are a family of visual localization methods that directly regress 2D-3D matches for camera pose estimation. They are effective in small-scale scenes but face significant challenges in large-scale scenes that are further amplified in the absence of ground truth 3D point clouds for supervision. Here the model can only rely on reprojection constraints and needs to implicitly triangulate the points. The challenges stem from a fundamental dilemma: The network has to be invariant to observations of the same landmark at different viewpoints and lighting conditions etc. but at the same time discriminate unrelated but similar observations. The latter becomes more relevant and severe in larger scenes. In this work we tackle this problem by introducing the concept of co-visibility to the network. We propose GLACE which integrates pre-trained global and local encodings and enables SCR to scale to large scenes with only a single small-sized network. Specifically we propose a novel feature diffusion technique that implicitly groups the reprojection constraints with co-visibility and avoids overfitting to trivial solutions. Additionally our position decoder parameterizes the output positions for large-scale scenes more effectively. Without using 3D models or depth maps for supervision our method achieves state-of-the-art results on large-scale scenes with a low-map-size model. On Cambridge landmarks with a single model we achieve a 17% lower median position error than Poker the ensemble variant of the state-of-the-art SCR method ACE. Code is available at: https://github.com/cvg/glace. + + + + Localization Is All You Evaluate: Data Leakage in Online Mapping Datasets and How to Fix It + http://openaccess.thecvf.com//content/CVPR2024/papers/Lilja_Localization_Is_All_You_Evaluate_Data_Leakage_in_Online_Mapping_CVPR_2024_paper.pdf + The task of online mapping is to predict a local map using current sensor observations e.g. from lidar and camera without relying on a pre-built map. State-of-the-art methods are based on supervised learning and are trained predominantly using two datasets: nuScenes and Argoverse 2. However these datasets revisit the same geographic locations across training validation and test sets. Specifically over 80% of nuScenes and 40% of Argoverse 2 validation and test samples are less than 5 m from a training sample. At test time the methods are thus evaluated more on how well they localize within a memorized implicit map built from the training data than on extrapolating to unseen locations. Naturally this data leakage causes inflated performance numbers and we propose geographically disjoint data splits to reveal the true performance in unseen environments. Experimental results show that methods perform considerably worse some dropping more than 45 mAP when trained and evaluated on proper data splits. Additionally a reassessment of prior design choices reveals diverging conclusions from those based on the original split. Notably the impact of lifting methods and the support from auxiliary tasks (e.g. depth supervision) on performance appears less substantial or follows a different trajectory than previously perceived. + + + + Alchemist: Parametric Control of Material Properties with Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Sharma_Alchemist_Parametric_Control_of_Material_Properties_with_Diffusion_Models_CVPR_2024_paper.pdf + We propose a method to control material attributes of objects like roughness metallic albedo and transparency in real images. Our method capitalizes on the generative prior of text-to-image models known for photorealism employing a scalar value and instructions to alter low-level material properties. Addressing the lack of datasets with controlled material attributes we generated an object-centric synthetic dataset with physically-based materials. Fine-tuning a modified pre-trained text-to-image model on this synthetic dataset enables us to edit material properties in real-world images while preserving all other attributes. We show the potential application of our model to material edited NeRFs. + + + + MoDE: CLIP Data Experts via Clustering + http://openaccess.thecvf.com//content/CVPR2024/papers/Ma_MoDE_CLIP_Data_Experts_via_Clustering_CVPR_2024_paper.pdf + The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions which tends to be noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering. Each data expert is trained on one data cluster being less sensitive to false negative noises in other clusters. At inference time we ensemble their outputs by applying weights determined through the correlation between task metadata and cluster conditions. To estimate the correlation precisely the samples in one cluster should be semantically similar but the number of data experts should still be reasonable for training and inference. As such we consider the ontology in human language and propose to use fine-grained cluster centers to represent each data expert at a coarse-grained level. Experimental studies show that four CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and OpenCLIP on zero-shot image classification but with less (<35%) training cost. Meanwhile MoDE can train all data expert asynchronously and can flexibly include new data experts. The code is available here. + + + + FineSports: A Multi-person Hierarchical Sports Video Dataset for Fine-grained Action Understanding + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_FineSports_A_Multi-person_Hierarchical_Sports_Video_Dataset_for_Fine-grained_Action_CVPR_2024_paper.pdf + Fine-grained action analysis in multi-person sports is complex due to athletes' quick movements and intense physical confrontations which result in severe visual obstructions in most scenes. In addition accessible multi-person sports video datasets lack fine-grained action annotations in both space and time adding to the difficulty in fine-grained action analysis. To this end we construct a new multi-person basketball sports video dataset named FineSports which contains fine-grained semantic and spatial-temporal annotations on 10000 NBA game videos covering 52 fine-grained action types 16000 action instances and 123000 spatial-temporal bounding boxes. We also propose a new prompt-driven spatial-temporal action location approach called PoSTAL composed of a prompt-driven target action encoder (PTA) and an action tube-specific detector (ATD) to directly generate target action tubes with fine-grained action types without any off-line proposal generation. Extensive experiments on the FineSports dataset demonstrate that PoSTAL outperforms state-of-the-art methods. Data and code are available at https://github.com/PKU-ICST-MIPL/FineSports_CVPR2024. + + + + GARField: Group Anything with Radiance Fields + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_GARField_Group_Anything_with_Radiance_Fields_CVPR_2024_paper.pdf + Grouping is inherently ambiguous due to the multiple levels of granularity in which one can decompose a scene --- should the wheels of an excavator be considered separate or part of the whole? We propose Group Anything with Radiance Fields (GARField) an approach for decomposing 3D scenes into a hierarchy of semantically meaningful groups from posed image inputs. To do this we embrace group ambiguity through physical scale: by optimizing a scale-conditioned 3D affinity feature field a point in the world can belong to different groups of different sizes. We optimize this field from a set of 2D masks provided by Segment Anything (SAM) in a way that respects coarse-to-fine hierarchy using scale to consistently fuse conflicting masks from different viewpoints. From this field we can derive a hierarchy of possible groupings via automatic tree construction or user interaction. We evaluate GARField on a variety of in-the-wild scenes and find it effectively extracts groups at many levels: clusters of objects objects and various subparts. GARField inherently represents multi-view consistent groupings and produces higher fidelity groups than the input SAM masks. GARField's hierarchical grouping could have exciting downstream applications such as 3D asset extraction or dynamic scene understanding. Project site: https://www.garfield.studio/ + + + + Learning Equi-angular Representations for Online Continual Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Seo_Learning_Equi-angular_Representations_for_Online_Continual_Learning_CVPR_2024_paper.pdf + Online continual learning suffers from an underfitted solution due to insufficient training for prompt model updates (e.g. single-epoch training). To address the challenge we propose an efficient online continual learning method using the neural collapse phenomenon. In particular we induce neural collapse to form a simplex equiangular tight frame (ETF) structure in the representation space so that the continuously learned model with a single epoch can better fit to the streamed data by proposing preparatory data training and residual correction in the representation space. With an extensive set of empirical validations using CIFAR-10/100 TinyImageNet ImageNet-200 and ImageNet-1K we show that our proposed method outperforms state-of-the-art methods by a noticeable margin in various online continual learning scenarios such as disjoint and Gaussian scheduled continuous (i.e. boundary-free) data setups. + + + + POCE: Primal Policy Optimization with Conservative Estimation for Multi-constraint Offline Reinforcement Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Guan_POCE_Primal_Policy_Optimization_with_Conservative_Estimation_for_Multi-constraint_Offline_CVPR_2024_paper.pdf + Multi-constraint offline reinforcement learning (RL) promises to learn policies that satisfy both cumulative and state-wise costs from offline datasets. This arrangement provides an effective approach for the widespread application of RL in high-risk scenarios where both cumulative and state-wise costs need to be considered simultaneously. However previously constrained offline RL algorithms are primarily designed to handle single-constraint problems related to cumulative cost which faces challenges when addressing multi-constraint tasks that involve both cumulative and state-wise costs. In this work we propose a novel Primal policy Optimization with Conservative Estimation algorithm (POCE) to address the problem of multi-constraint offline RL. Concretely we reframe the objective of multi-constraint offline RL by introducing the concept of Maximum Markov Decision Processes (MMDP). Subsequently we present a primal policy optimization algorithm to confront the multi-constraint problems which improves the stability and convergence speed of model training. Furthermore we propose a conditional Bellman operator to estimate cumulative and state-wise Q-values reducing the extrapolation error caused by out-of-distribution (OOD) actions. Finally extensive experiments demonstrate that the POCE algorithm achieves competitive performance across multiple experimental tasks particularly outperforming baseline algorithms in terms of safety. Our code is available at \href https://github.com/guanjiayi/poce github.POCE . + + + + Masked Spatial Propagation Network for Sparsity-Adaptive Depth Refinement + http://openaccess.thecvf.com//content/CVPR2024/papers/Jun_Masked_Spatial_Propagation_Network_for_Sparsity-Adaptive_Depth_Refinement_CVPR_2024_paper.pdf + The main function of depth completion is to compensate for an insufficient and unpredictable number of sparse depth measurements of hardware sensors. However existing research on depth completion assumes that the sparsity --- the number of points or LiDAR lines --- is fixed for training and testing. Hence the completion performance drops severely when the number of sparse depths changes significantly. To address this issue we propose the sparsity-adaptive depth refinement (SDR) framework which refines monocular depth estimates using sparse depth points. For SDR we propose the masked spatial propagation network (MSPN) to perform SDR with a varying number of sparse depths effectively by gradually propagating sparse depth information throughout the entire depth map. Experimental results demonstrate that MPSN achieves state-of-the-art performance on both SDR and conventional depth completion scenarios. + + + + C3Net: Compound Conditioned ControlNet for Multimodal Content Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_C3Net_Compound_Conditioned_ControlNet_for_Multimodal_Content_Generation_CVPR_2024_paper.pdf + We present Compound Conditioned ControlNet C3Net a novel generative neural architecture taking conditions from multiple modalities and synthesizing multimodal contents simultaneously (e.g. image text audio). C3Net adapts the ControlNet architecture to jointly train and make inferences on a production-ready diffusion model and its trainable copies. Specifically C3Net first aligns the conditions from multi-modalities to the same semantic latent space using modality-specific encoders based on contrastive training. Then it generates multimodal outputs based on the aligned latent space whose semantic information is combined using a ControlNet-like architecture called Control C3-UNet. Correspondingly with this system design our model offers an improved solution for joint-modality generation through learning and explaining multimodal conditions involving more than just linear interpolation within the latent space. Meanwhile as we align conditions to a unified latent space C3Net only requires one trainable Control C3-UNet to work on multimodal semantic information. Furthermore our model employs unimodal pretraining on the condition alignment stage outperforming the non-pretrained alignment even on relatively scarce training data and thus demonstrating high-quality compound condition generation. We contribute the first high-quality tri-modal validation set to validate quantitatively that C3Net outperforms or is on par with the first and contemporary state-of-the-art multimodal generation. Our codes and tri-modal dataset will be released. + + + + Adapt Before Comparison: A New Perspective on Cross-Domain Few-Shot Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Herzog_Adapt_Before_Comparison_A_New_Perspective_on_Cross-Domain_Few-Shot_Segmentation_CVPR_2024_paper.pdf + Few-shot segmentation performance declines substantially when facing images from a domain different than the training domain effectively limiting real-world use cases. To alleviate this recently cross-domain few-shot segmentation (CD-FSS) has emerged. Works that address this task mainly attempted to learn segmentation on a source domain in a manner that generalizes across domains. Surprisingly we can outperform these approaches while eliminating the training stage and removing their main segmentation network. We show test-time task-adaption is the key for successful CD-FSS instead. Task-adaption is achieved by appending small networks to the feature pyramid of a conventionally classification-pretrained backbone. To avoid overfitting to the few labeled samples in supervised fine-tuning consistency across augmented views of input images serves as guidance while learning the parameters of the attached layers. Despite our self-restriction not to use any images other than the few labeled samples at test time we achieve new state-of-the-art performance in CD-FSS evidencing the need to rethink approaches for the task. Code is available at https://github.com/Vision-Kek/ABCDFSS. + + + + Insect-Foundation: A Foundation Model and Large-scale 1M Dataset for Visual Insect Understanding + http://openaccess.thecvf.com//content/CVPR2024/papers/Nguyen_Insect-Foundation_A_Foundation_Model_and_Large-scale_1M_Dataset_for_Visual_CVPR_2024_paper.pdf + In precision agriculture the detection and recognition of insects play an essential role in the ability of crops to grow healthy and produce a high-quality yield. The current machine vision model requires a large volume of data to achieve high performance. However there are approximately 5.5 million different insect species in the world. None of the existing insect datasets can cover even a fraction of them due to varying geographic locations and acquisition costs. In this paper we introduce a novel "Insect-1M" dataset a game-changing resource poised to revolutionize insect-related foundation model training. Covering a vast spectrum of insect species our dataset including 1 million images with dense identification labels of taxonomy hierarchy and insect descriptions offers a panoramic view of entomology enabling foundation models to comprehend visual and semantic information about insects like never before. Then to efficiently establish an Insect Foundation Model we develop a micro-feature self-supervised learning method with a Patch-wise Relevant Attention mechanism capable of discerning the subtle differences among insect images. In addition we introduce Description Consistency loss to improve micro-feature modeling via insect descriptions. Through our experiments we illustrate the effectiveness of our proposed approach in insect modeling and achieve State-of-the-Art performance on standard benchmarks of insect-related tasks. Our Insect Foundation Model and Dataset promise to empower the next generation of insect-related vision models bringing them closer to the ultimate goal of precision agriculture. + + + + Data-Efficient Multimodal Fusion on a Single GPU + http://openaccess.thecvf.com//content/CVPR2024/papers/Vouitsis_Data-Efficient_Multimodal_Fusion_on_a_Single_GPU_CVPR_2024_paper.pdf + The goal of multimodal alignment is to learn a single latent space that is shared between multimodal inputs. The most powerful models in this space have been trained using massive datasets of paired inputs and large-scale computational resources making them prohibitively expensive to train in many practical scenarios. We surmise that existing unimodal encoders pre-trained on large amounts of unimodal data should provide an effective bootstrap to create multimodal models from unimodal ones at much lower costs. We therefore propose FuseMix a multimodal augmentation scheme that operates on the latent spaces of arbitrary pre-trained unimodal encoders. Using FuseMix for multimodal alignment we achieve competitive performance - and in certain cases outperform state-of-the art methods - in both image-text and audio-text retrieval with orders of magnitude less compute and data: for example we outperform CLIP on the Flickr30K text-to-image retrieval task with ?600x fewer GPU days and ?80x fewer image-text pairs. Additionally we show how our method can be applied to convert pre-trained text-to-image generative models into audio-to-image ones. Code is available at: https://github.com/layer6ai-labs/fusemix. + + + + FedSelect: Personalized Federated Learning with Customized Selection of Parameters for Fine-Tuning + http://openaccess.thecvf.com//content/CVPR2024/papers/Tamirisa_FedSelect_Personalized_Federated_Learning_with_Customized_Selection_of_Parameters_for_CVPR_2024_paper.pdf + Standard federated learning approaches suffer when client data distributions have sufficient heterogeneity. Recent methods addressed the client data heterogeneity issue via personalized federated learning (PFL) - a class of FL algorithms aiming to personalize learned global knowledge to better suit the clients' local data distributions. Existing PFL methods usually decouple global updates in deep neural networks by performing personalization on particular layers (i.e. classifier heads) and global aggregation for the rest of the network. However preselecting network layers for personalization may result in suboptimal storage of global knowledge. In this work we propose FedSelect a novel PFL algorithm inspired by the iterative subnetwork discovery procedure used for the Lottery Ticket Hypothesis. FedSelect incrementally expands subnetworks to personalize client parameters concurrently conducting global aggregations on the remaining parameters. This approach enables the personalization of both client parameters and subnetwork structure during the training process. Finally we show that FedSelect outperforms recent state-of-the-art PFL algorithms under challenging client data heterogeneity settings and demonstrates robustness to various real-world distributional shifts. + + + + Bidirectional Multi-Scale Implicit Neural Representations for Image Deraining + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Bidirectional_Multi-Scale_Implicit_Neural_Representations_for_Image_Deraining_CVPR_2024_paper.pdf + How to effectively explore multi-scale representations of rain streaks is important for image deraining. In contrast to existing Transformer-based methods that depend mostly on single-scale rain appearance we develop an end-to-end multi-scale Transformer that leverages the potentially useful features in various scales to facilitate high-quality image reconstruction. To better explore the common degradation representations from spatially-varying rain streaks we incorporate intra-scale implicit neural representations based on pixel coordinates with the degraded inputs in a closed-loop design enabling the learned features to facilitate rain removal and improve the robustness of the model in complex scenarios. To ensure richer collaborative representation from different scales we embed a simple yet effective inter-scale bidirectional feedback operation into our multi-scale Transformer by performing coarse-to-fine and fine-to-coarse information communication. Extensive experiments demonstrate that our approach named as NeRD-Rain performs favorably against the state-of-the-art ones on both synthetic and real-world benchmark datasets. The source code and trained models are available at https://github.com/cschenxiang/NeRD-Rain. + + + + Enhancing Quality of Compressed Images by Mitigating Enhancement Bias Towards Compression Domain + http://openaccess.thecvf.com//content/CVPR2024/papers/Xing_Enhancing_Quality_of_Compressed_Images_by_Mitigating_Enhancement_Bias_Towards_CVPR_2024_paper.pdf + Existing quality enhancement methods for compressed images focus on aligning the enhancement domain with the raw domain to yield realistic images. However these methods exhibit a pervasive enhancement bias towards the compression domain inadvertently regarding it as more realistic than the raw domain. This bias makes enhanced images closely resemble their compressed counterparts thus degrading their perceptual quality. In this paper we propose a simple yet effective method to mitigate this bias and enhance the quality of compressed images. Our method employs a conditional discriminator with the compressed image as a key condition and then incorporates a domain-divergence regularization to actively distance the enhancement domain from the compression domain. Through this dual strategy our method enables the discrimination against the compression domain and brings the enhancement domain closer to the raw domain. Comprehensive quality evaluations confirm the superiority of our method over other state-of-the-art methods without incurring inference overheads. + + + + LangSplat: 3D Language Gaussian Splatting + http://openaccess.thecvf.com//content/CVPR2024/papers/Qin_LangSplat_3D_Language_Gaussian_Splatting_CVPR_2024_paper.pdf + Humans live in a 3D world and commonly use natural language to interact with a 3D scene. Modeling a 3D language field to support open-ended language queries in 3D has gained increasing attention recently. This paper introduces LangSplat which constructs a 3D language field that enables precise and efficient open-vocabulary querying within 3D spaces. Unlike existing methods that ground CLIP language embeddings in a NeRF model LangSplat advances the field by utilizing a collection of 3D Gaussians each encoding language features distilled from CLIP to represent the language field. By employing a tile-based splatting technique for rendering language features we circumvent the costly rendering process inherent in NeRF. Instead of directly learning CLIP embeddings LangSplat first trains a scene-wise language autoencoder and then learns language features on the scene-specific latent space thereby alleviating substantial memory demands imposed by explicit modeling. Existing methods struggle with imprecise and vague 3D language fields which fail to discern clear boundaries between objects. We delve into this issue and propose to learn hierarchical semantics using SAM thereby eliminating the need for extensively querying the language field across various scales and the regularization of DINO features. Extensive experimental results show that LangSplat significantly outperforms the previous state-of-the-art method LERF by a large margin. Notably LangSplat is extremely efficient achieving a 199 x speedup compared to LERF at the resolution of 1440 x 1080. We strongly recommend readers to check out our video results at https://langsplat.github.io/. + + + + Improving Spectral Snapshot Reconstruction with Spectral-Spatial Rectification + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Improving_Spectral_Snapshot_Reconstruction_with_Spectral-Spatial_Rectification_CVPR_2024_paper.pdf + How to effectively utilize the spectral and spatial characteristics of Hyperspectral Image (HSI) is always a key problem in spectral snapshot reconstruction. Recently the spectra-wise transformer has shown great potential in capturing inter-spectra similarities of HSI but the classic design of the transformer i.e. multi-head division in the spectral (channel) dimension hinders the modeling of global spectral information and results in mean effect. In addition previous methods adopt the normal spatial priors without taking imaging processes into account and fail to address the unique spatial degradation in snapshot spectral reconstruction. In this paper we analyze the influence of multi-head division and propose a novel Spectral-Spatial Rectification (SSR) method to enhance the utilization of spectral information and improve spatial degradation. Specifically SSR includes two core parts: Window-based Spectra-wise Self-Attention (WSSA) and spAtial Rectification Block (ARB). WSSA is proposed to capture global spectral information and account for local differences whereas ARB aims to mitigate the spatial degradation using a spatial alignment strategy. The experimental results on simulation and real scenes demonstrate the effectiveness of the proposed modules and we also provide models at multiple scales to demonstrate the superiority of our approach. + + + + DNGaussian: Optimizing Sparse-View 3D Gaussian Radiance Fields with Global-Local Depth Normalization + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_DNGaussian_Optimizing_Sparse-View_3D_Gaussian_Radiance_Fields_with_Global-Local_Depth_CVPR_2024_paper.pdf + Radiance fields have demonstrated impressive performance in synthesizing novel views from sparse input views yet prevailing methods suffer from high training costs and slow inference speed. This paper introduces DNGaussian a depth-regularized framework based on 3D Gaussian radiance fields offering real-time and high-quality few-shot novel view synthesis at low costs. Our motivation stems from the highly efficient representation and surprising quality of the recent 3D Gaussian Splatting despite it will encounter a geometry degradation when input views decrease. In the Gaussian radiance fields we find this degradation in scene geometry primarily lined to the positioning of Gaussian primitives and can be mitigated by depth constraint. Consequently we propose a Hard and Soft Depth Regularization to restore accurate scene geometry under coarse monocular depth supervision while maintaining a fine-grained color appearance. To further refine detailed geometry reshaping we introduce Global-Local Depth Normalization enhancing the focus on small local depth changes. Extensive experiments on LLFF DTU and Blender datasets demonstrate that DNGaussian outperforms state-of-the-art methods achieving comparable or better results with significantly reduced memory cost a 25x reduction in training time and over 3000x faster rendering speed. Code is available at: https://github.com/Fictionarry/DNGaussian + + + + ColorPCR: Color Point Cloud Registration with Multi-Stage Geometric-Color Fusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Mu_ColorPCR_Color_Point_Cloud_Registration_with_Multi-Stage_Geometric-Color_Fusion_CVPR_2024_paper.pdf + Point cloud registration is still a challenging and open problem. For example when the overlap between two point clouds is extremely low geo-only features may be not sufficient. Therefore it is important to further explore how to utilize color data in this task. Under such circumstances we propose ColorPCR for color point cloud registration with multi-stage geometric-color fusion. We design a Hierarchical Color Enhanced Feature Extraction module to extract multi-level geometric-color features and a GeoColor Superpoint Matching Module to encode transformation-invariant geo-color global context for robust patch correspondences. In this way both geometric and color data can be used thus lead to robust performance even under extremely challenging scenarios such as low overlap between two point clouds. To evaluate the performance of our method we colorize 3DMatch/3DLoMatch datasets as Color3DMatch/Color3DLoMatch and evaluations on these datasets demonstrate the effectiveness of our proposed method. Our method achieves state-of-the-art registration recall of 97.5%/88.9% on them. + + + + HomoFormer: Homogenized Transformer for Image Shadow Removal + http://openaccess.thecvf.com//content/CVPR2024/papers/Xiao_HomoFormer_Homogenized_Transformer_for_Image_Shadow_Removal_CVPR_2024_paper.pdf + The spatial non-uniformity and diverse patterns of shadow degradation conflict with the weight sharing manner of dominant models which may lead to an unsatisfactory compromise. To tackle with this issue we present a novel strategy from the view of shadow transformation in this paper: directly homogenizing the spatial distribution of shadow degradation. Our key design is the random shuffle operation and its corresponding inverse operation. Specifically random shuffle operation stochastically rearranges the pixels across spatial space and the inverse operation recovers the original order. After randomly shuffling the shadow diffuses in the whole image and the degradation appears in a homogenized way which can be effectively processed by the local self-attention layer. Moreover we further devise a new feed forward network with position modeling to exploit image structural information. Based on these elements we construct the final local window based transformer named HomoFormer for image shadow removal. Our HomoFormer can enjoy the linear complexity of local transformers while bypassing challenges of non-uniformity and diversity of shadow. Extensive experiments are conducted to verify the superiority of our HomoFormer across public datasets. + + + + What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_What_If_the_TV_Was_Off_Examining_Counterfactual_Reasoning_Abilities_CVPR_2024_paper.pdf + Counterfactual reasoning a fundamental aspect of human cognition involves contemplating alternatives to established facts or past events significantly enhancing our abilities in planning and decision-making. In light of the advancements in current multi-modal large language models we explore their effectiveness in counterfactual reasoning. To facilitate this investigation we introduce a novel dataset C-VQA specifically designed to examine the counterfactual reasoning capabilities of modern multi-modal large language models. This dataset is constructed by infusing original questions with counterfactual presuppositions spanning various types such as numerical and boolean queries. It encompasses a mix of real and synthetic data representing a wide range of difficulty levels. Our thorough evaluations of contemporary vision-language models using this dataset have revealed substantial performance drops with some models showing up to a 40% decrease highlighting a significant gap between current models and human-like vision reasoning capabilities. We hope our dataset will serve as a vital benchmark for evaluating the counterfactual reasoning capabilities of models. Code and dataset are publicly available at https://bzhao.me/C-VQA/. + + + + FAR: Flexible Accurate and Robust 6DoF Relative Camera Pose Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Rockwell_FAR_Flexible_Accurate_and_Robust_6DoF_Relative_Camera_Pose_Estimation_CVPR_2024_paper.pdf + Estimating relative camera poses between images has been a central problem in computer vision. Methods that find correspondences and solve for the fundamental matrix offer high precision in most cases. Conversely methods predicting pose directly using neural networks are more robust to limited overlap and can infer absolute translation scale but at the expense of reduced precision. We show how to combine the best of both methods; our approach yields results that are both precise and robust while also accurately inferring translation scales. At the heart of our model lies a Transformer that (1) learns to balance between solved and learned pose estimations and (2) provides a prior to guide a solver. A comprehensive analysis supports our design choices and demonstrates that our method adapts flexibly to various feature extractors and correspondence estimators showing state-of-the-art performance in 6DoF pose estimation on Matterport3D InteriorNet StreetLearn and Map-free Relocalization. + + + + eTraM: Event-based Traffic Monitoring Dataset + http://openaccess.thecvf.com//content/CVPR2024/papers/Verma_eTraM_Event-based_Traffic_Monitoring_Dataset_CVPR_2024_paper.pdf + Event cameras with their high temporal and dynamic range and minimal memory usage have found applications in various fields. However their potential in static traffic monitoring remains largely unexplored. To facilitate this exploration we present eTraM - a first-of-its-kind fully event-based traffic monitoring dataset. eTraM offers 10 hr of data from different traffic scenarios in various lighting and weather conditions providing a comprehensive overview of real-world situations. Providing 2M bounding box annotations it covers eight distinct classes of traffic participants ranging from vehicles to pedestrians and micro-mobility. eTraM's utility has been assessed using state-of-the-art methods for traffic participant detection including RVT RED and YOLOv8. We quantitatively evaluate the ability of event-based models to generalize on nighttime and unseen scenes. Our findings substantiate the compelling potential of leveraging event cameras for traffic monitoring opening new avenues for research and application. eTraM is available at https://eventbasedvision.github.io/eTraM. + + + + MoCha-Stereo: Motif Channel Attention Network for Stereo Matching + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_MoCha-Stereo_Motif_Channel_Attention_Network_for_Stereo_Matching_CVPR_2024_paper.pdf + Learning-based stereo matching techniques have made significant progress. However existing methods inevitably lose geometrical structure information during the feature channel generation process resulting in edge detail mismatches. In this paper the Motif Channel Attention Stereo Matching Network (MoCha-Stereo) is designed to address this problem. We provide the Motif Channel Correlation Volume (MCCV) to determine more accurate edge matching costs. MCCV is achieved by projecting motif channels which capture common geometric structures in feature channels onto feature maps and cost volumes. In addition edge variations in the reconstruction error map also affect details matching we propose the Reconstruction Error Motif Penalty (REMP) module to further refine the full-resolution disparity estimation. REMP integrates the frequency information of typical channel features from the reconstruction error. MoCha-Stereo ranks 1st on the KITTI-2015 and KITTI-2012 Reflective leaderboards. Our structure also shows excellent performance in Multi-View Stereo. Code is avaliable at https://github.com/ZYangChen/MoCha-Stereo. + + + + Extend Your Own Correspondences: Unsupervised Distant Point Cloud Registration by Progressive Distance Extension + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Extend_Your_Own_Correspondences_Unsupervised_Distant_Point_Cloud_Registration_by_CVPR_2024_paper.pdf + Registration of point clouds collected from a pair of distant vehicles provides a comprehensive and accurate 3D view of the driving scenario which is vital for driving safety related applications yet existing literature suffers from the expensive pose label acquisition and the deficiency to generalize to new data distributions. In this paper we propose EYOC an unsupervised distant point cloud registration method that adapts to new point cloud distributions on the fly requiring no global pose labels. The core idea of EYOC is to train a feature extractor in a progressive fashion where in each round the feature extractor trained with near point cloud pairs can label slightly farther point cloud pairs enabling self-supervision on such far point cloud pairs. This process continues until the derived extractor can be used to register distant point clouds. Particularly to enable high-fidelity correspondence label generation we devise an effective spatial filtering scheme to select the most representative correspondences to register a point cloud pair and then utilize the aligned point clouds to discover more correct correspondences. Experiments show that EYOC can achieve comparable performance with state-of-the-art supervised methods at a lower training cost. Moreover it outwits supervised methods regarding generalization performance on new data distributions. + + + + Multi-modal Learning for Geospatial Vegetation Forecasting + http://openaccess.thecvf.com//content/CVPR2024/papers/Benson_Multi-modal_Learning_for_Geospatial_Vegetation_Forecasting_CVPR_2024_paper.pdf + Precise geospatial vegetation forecasting holds potential across diverse sectors including agriculture forestry humanitarian aid and carbon accounting. To leverage the vast availability of satellite imagery for this task various works have applied deep neural networks for predicting multispectral images in photorealistic quality. However the important area of vegetation dynamics has not been thoroughly explored. Our study introduces GreenEarthNet the first dataset specifically designed for high-resolution vegetation forecasting and Contextformer a novel deep learning approach for predicting vegetation greenness from Sentinel 2 satellite images with fine resolution across Europe. Our multi-modal transformer model Contextformer leverages spatial context through a vision backbone and predicts the temporal dynamics on local context patches incorporating meteorological time series in a parameter-efficient manner. The GreenEarthNet dataset features a learned cloud mask and an appropriate evaluation scheme for vegetation modeling. It also maintains compatibility with the existing satellite imagery forecasting dataset EarthNet2021 enabling cross-dataset model comparisons. Our extensive qualitative and quantitative analyses reveal that our methods outperform a broad range of baseline techniques. This includes surpassing previous state-of-the-art models on EarthNet2021 as well as adapted models from time series forecasting and video prediction. To the best of our knowledge this work presents the first models for continental-scale vegetation modeling at fine resolution able to capture anomalies beyond the seasonal cycle thereby paving the way for predicting vegetation health and behaviour in response to climate variability and extremes. We provide open source code and pre-trained weights to reproduce our experimental results under https://github.com/vitusbenson/greenearthnet. + + + + Bring Event into RGB and LiDAR: Hierarchical Visual-Motion Fusion for Scene Flow + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_Bring_Event_into_RGB_and_LiDAR_Hierarchical_Visual-Motion_Fusion_for_CVPR_2024_paper.pdf + Single RGB or LiDAR is the mainstream sensor for the challenging scene flow which relies heavily on visual features to match motion features. Compared with single modality existing methods adopt a fusion strategy to directly fuse the cross-modal complementary knowledge in motion space. However these direct fusion methods may suffer the modality gap due to the visual intrinsic heterogeneous nature between RGB and LiDAR thus deteriorating motion features. We discover that event has the homogeneous nature with RGB and LiDAR in both visual and motion spaces. In this work we bring the event as a bridge between RGB and LiDAR and propose a novel hierarchical visual-motion fusion framework for scene flow which explores a homogeneous space to fuse the cross-modal complementary knowledge for physical interpretation. In visual fusion we discover that event has a complementarity (relative v.s. absolute) in luminance space with RGB for high dynamic imaging and has a complementarity (local boundary v.s. global shape) in scene structure space with LiDAR for structure integrity. In motion fusion we figure out that RGB event and LiDAR are complementary (spatial-dense temporal-dense v.s. spatiotemporal-sparse) to each other in correlation space which motivates us to fuse their motion correlations for motion continuity. The proposed hierarchical fusion can explicitly fuse the multimodal knowledge to progressively improve scene flow from visual space to motion space. Extensive experiments have been performed to verify the superiority of the proposed method. + + + + MMVP: A Multimodal MoCap Dataset with Vision and Pressure Sensors + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_MMVP_A_Multimodal_MoCap_Dataset_with_Vision_and_Pressure_Sensors_CVPR_2024_paper.pdf + Foot contact is an important cue for human motion capture understanding and generation. Existing datasets tend to annotate dense foot contact using visual matching with thresholding or incorporating pressure signals. However these approaches either suffer from low accuracy or are only designed for small-range and slow motion. There is still a lack of a vision-pressure multimodal dataset with large-range and fast human motion as well as accurate and dense foot-contact annotation. To fill this gap we propose a Multimodal MoCap Dataset with Vision and Pressure sensors named MMVP. MMVP provides accurate and dense plantar pressure signals synchronized with RGBD observations which is especially useful for both plausible shape estimation robust pose fitting without foot drifting and accurate global translation tracking. To validate the dataset we propose an RGBD-P SMPL fitting method and also a monocular-video-based baseline framework VP-MoCap for human motion capture. Experiments demonstrate that our RGBD-P SMPL Fitting results significantly outperform pure visual motion capture. Moreover VP-MoCap outperforms SOTA methods in foot-contact and global translation estimation accuracy. We believe the configuration of the dataset and the baseline frameworks will stimulate the research in this direction and also provide a good reference for MoCap applications in various domains. Project page: https://metaverse-ai-lab-thu.github.io/MMVP-Dataset/. + + + + JoAPR: Cleaning the Lens of Prompt Learning for Vision-Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Guo_JoAPR_Cleaning_the_Lens_of_Prompt_Learning_for_Vision-Language_Models_CVPR_2024_paper.pdf + Leveraging few-shot datasets in prompt learning for Vision-Language Models eliminates the need for manual prompt engineering while highlighting the necessity of accurate annotations for the labels. However high-level or complex label noise challenges prompt learning for Vision-Language Models. Aiming at this issue we propose a new framework for improving its robustness. Specifically we introduce the Joint Adaptive Partitioning for Label Refurbishment (JoAPR) a structured framework encompassing two key steps. 1) Data Partitioning where we differentiate between clean and noisy data using joint adaptive thresholds. 2) Label Refurbishment where we correct the labels based on the partition outcomes before retraining the network. Our comprehensive experiments confirm that JoAPR substantially enhances the robustness of prompt learning for Vision-Language Models against label noise offering a promising direction for future research. + + + + Open-Vocabulary 3D Semantic Segmentation with Foundation Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Jiang_Open-Vocabulary_3D_Semantic_Segmentation_with_Foundation_Models_CVPR_2024_paper.pdf + In dynamic 3D environments the ability to recognize a diverse range of objects without the constraints of predefined categories is indispensable for real-world applications. In response to this need we introduce OV3D an innovative framework designed for open-vocabulary 3D semantic segmentation. OV3D leverages the broad open-world knowledge embedded in vision and language foundation models to establish a fine-grained correspondence between 3D points and textual entity descriptions. These entity descriptions are enriched with contextual information enabling a more open and comprehensive understanding. By seamlessly aligning 3D point features with entity text features OV3D empowers open-vocabulary recognition in the 3D domain achieving state-of-the-art open-vocabulary semantic segmentation performance across multiple datasets including ScanNet Matterport3D and nuScenes. + + + + 1-Lipschitz Layers Compared: Memory Speed and Certifiable Robustness + http://openaccess.thecvf.com//content/CVPR2024/papers/Prach_1-Lipschitz_Layers_Compared_Memory_Speed_and_Certifiable_Robustness_CVPR_2024_paper.pdf + The robustness of neural networks against input perturbations with bounded magnitude represents a serious concern in the deployment of deep learning models in safety-critical systems. Recently the scientific community has focused on enhancing certifiable robustness guarantees by crafting \ols neural networks that leverage Lipschitz bounded dense and convolutional layers. Different methods have been proposed in the literature to achieve this goal however comparing the performance of such methods is not straightforward since different metrics can be relevant (e.g. training time memory usage accuracy certifiable robustness) for different applications. Therefore this work provides a thorough comparison between different methods covering theoretical aspects such as computational complexity and memory requirements as well as empirical measurements of time per epoch required memory accuracy and certifiable robust accuracy. The paper also provides some guidelines and recommendations to support the user in selecting the methods that work best depending on the available resources. We provide code at github.com/berndprach/1LipschitzLayersCompared + + + + Construct to Associate: Cooperative Context Learning for Domain Adaptive Point Cloud Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Construct_to_Associate_Cooperative_Context_Learning_for_Domain_Adaptive_Point_CVPR_2024_paper.pdf + This paper tackles the domain adaptation problem in point cloud semantic segmentation which performs adaptation from a fully labeled domain (source domain) to an unlabeled target domain. Due to the unordered property of point clouds LiDAR scans typically show varying geometric structures across different regions in terms of density noises etc hence leading to increased dynamics on context. However such characteristics are not consistent across domains due to the difference in sensors environments etc thus hampering the effective scene comprehension across domains. To solve this we propose Cooperative Context Learning that performs context modeling and modulation from different aspects but in a cooperative manner. Specifically we first devise context embeddings to discover and model contextual relationships with close neighbors in a learnable manner. Then with the context embeddings from two domains we introduce a set of learnable prototypes to attend and associate them under the attention paradigm. As a result these prototypes naturally establish long-range dependency across regions and domains thereby encouraging the transfer of context knowledge and easing the adaptation. Moreover the attention in turn attunes and guides the local context modeling and urges them to focus on the domain-invariant context knowledge thus promoting the adaptation in a cooperative manner. Experiments on representative benchmarks verify that our method attains the new state-of-the-art. + + + + GoMVS: Geometrically Consistent Cost Aggregation for Multi-View Stereo + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_GoMVS_Geometrically_Consistent_Cost_Aggregation_for_Multi-View_Stereo_CVPR_2024_paper.pdf + Matching cost aggregation plays a fundamental role in learning-based multi-view stereo networks. However directly aggregating adjacent costs can lead to suboptimal results due to local geometric inconsistency. Related methods either seek selective aggregation or improve aggregated depth in the 2D space both are unable to handle geometric inconsistency in the cost volume effectively. In this paper we propose GoMVS to aggregate geometrically consistent costs yielding better utilization of adjacent geometries. More specifically we correspond and propagate adjacent costs to the reference pixel by leveraging the local geometric smoothness in conjunction with surface normals. We achieve this by the geometric consistent propagation (GCP) module. It computes the correspondence from the adjacent depth hypothesis space to the reference depth space using surface normals then uses the correspondence to propagate adjacent costs to the reference geometry followed by a convolution for aggregation. Our method achieves new state-of-the-art performance on DTU Tanks & Temple and ETH3D datasets. Notably our method ranks 1st on the Tanks & Temple Advanced benchmark. Code is available at https://github.com/Wuuu3511/GoMVS. + + + + Evaluating Transferability in Retrieval Tasks: An Approach Using MMD and Kernel Methods + http://openaccess.thecvf.com//content/CVPR2024/papers/Dai_Evaluating_Transferability_in_Retrieval_Tasks_An_Approach_Using_MMD_and_CVPR_2024_paper.pdf + Retrieval tasks play central roles in real-world machine learning systems such as search engines recommender systems and retrieval-augmented generation (RAG). Achieving decent performance in these tasks often requires fine-tuning various pre-trained models on specific datasets and selecting the best candidate a process that can be both time and resource-consuming. To tackle the problem we introduce a novel and efficient method called RetMMD that leverages Maximum Mean Discrepancy (MMD) and kernel methods to assess the transferability of pretrained models in retrieval tasks. RetMMD is calculated on pretrained model and target dataset without any fine-tuning involved. Specifically given some query we quantify the distribution discrepancy between relevant and irrelevant document embeddings by estimating the similarities within their mappings in the fine-tuned embedding space through kernel method. This discrepancy is averaged over multiple queries taking into account the distribution characteristics of the target dataset. Experiments suggest that the proposed metric calculated on pre-trained models closely aligns with retrieval performance post-fine-tuning. The observation holds across a variety of datasets including image text and multi-modal domains indicating the potential of using MMD and kernel methods for transfer learning evaluation in retrieval scenarios. In addition we also design a way of evaluating dataset transferability for retrieval tasks with experimental results demonstrating the effectiveness of the proposed approach. + + + + OMG-Seg: Is One Model Good Enough For All Segmentation? + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_OMG-Seg_Is_One_Model_Good_Enough_For_All_Segmentation_CVPR_2024_paper.pdf + In this work we address various segmentation tasks each traditionally tackled by distinct or partially unified models. We propose OMG-Seg One Model that is Good enough to efficiently and effectively handle all the segmentation tasks including image semantic instance and panoptic segmentation as well as their video counterparts open vocabulary settings prompt-driven interactive segmentation like SAM and video object segmentation. To our knowledge this is the first model to handle all these tasks in one model and achieve satisfactory performance. We show that OMG-Seg a transformer-based encoder-decoder architecture with task-specific queries and outputs can support over ten distinct segmentation tasks and yet significantly reduce computational and parameter overhead across various tasks and datasets. We rigorously evaluate the inter-task influences and correlations during co-training. Code and models are available at https://github.com/lxtGH/OMG-Seg. + + + + DetCLIPv3: Towards Versatile Generative Open-vocabulary Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Yao_DetCLIPv3_Towards_Versatile_Generative_Open-vocabulary_Object_Detection_CVPR_2024_paper.pdf + Existing open-vocabulary object detectors typically require a predefined set of categories from users significantly confining their application scenarios. In this paper we introduce DetCLIPv3 a high-performing detector that excels not only at both open-vocabulary object detection but also generating hierarchical labels for detected objects. DetCLIPv3 is characterized by three core designs: 1. Versatile model architecture: we derive a robust open-set detection framework which is further empowered with generation ability via the integration of a caption head. 2. High information density data: we develop an auto-annotation pipeline leveraging visual large language model to refine captions for large-scale image-text pairs providing rich multi-granular object labels to enhance the training. 3. Efficient training strategy: we employ a pre-training stage with low-resolution inputs that enables the object captioner to efficiently learn a broad spectrum of visual concepts from extensive image-text paired data. This is followed by a fine-tuning stage that leverages a small number of high-resolution samples to further enhance detection performance. With these effective designs DetCLIPv3 demonstrates superior open-vocabulary detection performance e.g. our Swin-T backbone model achieves a notable 47.0 zero-shot fixed AP on the LVIS minival benchmark outperforming GLIPv2 GroundingDINO and DetCLIPv2 by 18.0/19.6/6.6 AP respectively. DetCLIPv3 also achieves a state-of-the-art 19.7 AP in dense captioning task on VG dataset showcasing its strong generative capability. + + + + UVEB: A Large-scale Benchmark and Baseline Towards Real-World Underwater Video Enhancement + http://openaccess.thecvf.com//content/CVPR2024/papers/Xie_UVEB_A_Large-scale_Benchmark_and_Baseline_Towards_Real-World_Underwater_Video_CVPR_2024_paper.pdf + Learning-based underwater image enhancement (UIE) methods have made great progress. However the lack of large-scale and high-quality paired training samples has become the main bottleneck hindering the development of UIE. The inter-frame information in underwater videos can accelerate or optimize the UIE process. Thus we constructed the first large-scale high-resolution underwater video enhancement benchmark (UVEB) to promote the development of underwater vision.It contains 1308 pairs of video sequences and more than 453000 high-resolution with 38% Ultra-High-Definition (UHD) 4K frame pairs. UVEB comes from multiple countries containing various scenes and video degradation types to adapt to diverse and complex underwater environments. We also propose the first supervised underwater video enhancement method UVE-Net. UVE-Net converts the current frame information into convolutional kernels and passes them to adjacent frames for efficient inter-frame information exchange. By fully utilizing the redundant degraded information of underwater videos UVE-Net completes video enhancement better. Experiments show the effective network design and good performance of UVE-Net. + + + + Discovering Syntactic Interaction Clues for Human-Object Interaction Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Luo_Discovering_Syntactic_Interaction_Clues_for_Human-Object_Interaction_Detection_CVPR_2024_paper.pdf + Recently Vision-Language Model (VLM) has greatly advanced the Human-Object Interaction (HOI) detection. The existing VLM-based HOI detectors typically adopt a hand-crafted template (e.g. a photo of a person [action] a/an [object]) to acquire text knowledge through the VLM text encoder. However such approaches only encoding the action-specific text prompts in vocabulary level may suffer from learning ambiguity without exploring the fine-grained clues from the perspective of interaction context. In this paper we propose a novel method to discover Syntactic Interaction Clues for HOI detection (SICHOI) by using VLM. Specifically we first investigate what are the essential elements for an interaction context and then establish a syntactic interaction bank from three levels: spatial relationship action-oriented posture and situational condition. Further to align visual features with the syntactic interaction bank we adopt a multi-view extractor to jointly aggregate visual features from instance interaction and image levels accordingly. In addition we also introduce a dual cross-attention decoder to perform context propagation between text knowledge and visual features thereby enhancing the HOI detection. Experimental results demonstrate that our proposed method achieves state-of-the-art performance on HICO-DET and V-COCO. + + + + Inter-X: Towards Versatile Human-Human Interaction Analysis + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_Inter-X_Towards_Versatile_Human-Human_Interaction_Analysis_CVPR_2024_paper.pdf + The analysis of the ubiquitous human-human interactions is pivotal for understanding humans as social beings. Existing human-human interaction datasets typically suffer from inaccurate body motions lack of hand gestures and fine-grained textual descriptions. To better perceive and generate human-human interactions we propose Inter-X a currently largest human-human interaction dataset with accurate body movements and diverse interaction patterns together with detailed hand gestures. The dataset includes 11K interaction sequences and more than 8.1M frames. We also equip Inter-X with versatile annotations of more than 34K fine-grained human part-level textual descriptions semantic interaction categories interaction order and the relationship and personality of the subjects. Based on the elaborate annotations we propose a unified benchmark composed of 4 categories of downstream tasks from both the perceptual and generative directions. Extensive experiments and comprehensive analysis show that Inter-X serves as a testbed for promoting the development of versatile human-human interaction analysis. Our dataset and benchmark will be publicly available for research purposes. + + + + MaskClustering: View Consensus based Mask Graph Clustering for Open-Vocabulary 3D Instance Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Yan_MaskClustering_View_Consensus_based_Mask_Graph_Clustering_for_Open-Vocabulary_3D_CVPR_2024_paper.pdf + Open-vocabulary 3D instance segmentation is cutting-edge for its ability to segment 3D instances without predefined categories. However progress in 3D lags behind its 2D counterpart due to limited annotated 3D data. To address this recent works first generate 2D open-vocabulary masks through 2D models and then merge them into 3D instances based on metrics calculated between two neighboring frames. In contrast to these local metrics we propose a novel metric view consensus rate to enhance the utilization of multi-view observations. The key insight is that two 2D masks should be deemed part of the same 3D instance if a significant number of other 2D masks from different views contain both these two masks. Using this metric as edge weight we construct a global mask graph where each mask is a node. Through iterative clustering of masks showing high view consensus we generate a series of clusters each representing a distinct 3D instance. Notably our model is training-free. Through extensive experiments on publicly available datasets including ScanNet++ ScanNet200 and MatterPort3D we demonstrate that our method achieves state-of-the-art performance in open-vocabulary 3D instance segmentation. Our project page is at \href https://pku-epic.github.io/MaskClustering/ https://pku-epic.github.io/MaskClustering . + + + + PeerAiD: Improving Adversarial Distillation from a Specialized Peer Tutor + http://openaccess.thecvf.com//content/CVPR2024/papers/Jung_PeerAiD_Improving_Adversarial_Distillation_from_a_Specialized_Peer_Tutor_CVPR_2024_paper.pdf + Adversarial robustness of the neural network is a significant concern when it is applied to security-critical domains. In this situation adversarial distillation is a promising option which aims to distill the robustness of the teacher network to improve the robustness of a small student network. Previous works pretrain the teacher network to make it robust against the adversarial examples aimed at itself. However the adversarial examples are dependent on the parameters of the target network. The fixed teacher network inevitably degrades its robustness against the unseen transferred adversarial examples which target the parameters of the student network in the adversarial distillation process. We propose PeerAiD to make a peer network learn the adversarial examples of the student network instead of adversarial examples aimed at itself. PeerAiD is an adversarial distillation that trains the peer network and the student network simultaneously in order to specialize the peer network for defending the student network. We observe that such peer networks surpass the robustness of the pretrained robust teacher model against adversarial examples aimed at the student network. With this peer network and adversarial distillation PeerAiD achieves significantly higher robustness of the student network with AutoAttack (AA) accuracy by up to 1.66%p and improves the natural accuracy of the student network by up to 4.72%p with ResNet-18 on TinyImageNet dataset. Code is available at https://github.com/jaewonalive/PeerAiD. + + + + Scaling Laws for Data Filtering-- Data Curation cannot be Compute Agnostic + http://openaccess.thecvf.com//content/CVPR2024/papers/Goyal_Scaling_Laws_for_Data_Filtering--_Data_Curation_cannot_be_Compute_CVPR_2024_paper.pdf + Vision-language models (VLMs) are trained for thousands of GPU hours on carefully selected subsets of massive web scrapes. For instance the LAION public dataset retained only about 10 percent of the total crawled data. In recent times data curation has gained prominence with several works developing strategies to retain high-quality subsets of raw scraped data. However these strategies are typically developed agnostic to the available compute for training. In this paper we demonstrate that making filtering decisions independent of training compute is often suboptimal: well-curated data rapidly loses its utility when repeated eventually decreasing below the utility of unseen but lower-quality data. While past research in neural scaling laws has considered web data to be homogenous real data is not. Our work bridges this important gap in the literature by developing scaling laws that characterize the differing utility of various data subsets and accounting for how this diminishes for a data point at its nth repetition. Our key message is that data curation can not be agnostic of the total compute a model will be trained for. Even without ever jointly training on multiple data buckets our scaling laws enable us to estimate model performance under this dynamic trade-off between quality and repetition. This allows us to curate the best possible pool for achieving top performance on Datacomp at various compute budgets carving out a pareto-frontier for data curation. + + + + Beyond Average: Individualized Visual Scanpath Prediction + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Beyond_Average_Individualized_Visual_Scanpath_Prediction_CVPR_2024_paper.pdf + Understanding how attention varies across individuals has significant scientific and societal impacts. However existing visual scanpath models treat attention uniformly neglecting individual differences. To bridge this gap this paper focuses on individualized scanpath prediction (ISP) a new attention modeling task that aims to accurately predict how different individuals shift their attention in diverse visual tasks. It proposes an ISP method featuring three novel technical components: (1) an observer encoder to characterize and integrate an observer's unique attention traits (2) an observer-centric feature integration approach that holistically combines visual features task guidance and observer-specific characteristics and (3) an adaptive fixation prioritization mechanism that refines scanpath predictions by dynamically prioritizing semantic feature maps based on individual observers' attention traits. These novel components allow scanpath models to effectively address the attention variations across different observers. Our method is generally applicable to different datasets model architectures and visual tasks offering a comprehensive tool for transforming general scanpath models into individualized ones. Comprehensive evaluations using value-based and ranking-based metrics verify the method's effectiveness and generalizability. + + + + Seeing Motion at Nighttime with an Event Camera + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Seeing_Motion_at_Nighttime_with_an_Event_Camera_CVPR_2024_paper.pdf + We focus on a very challenging task: imaging at nighttime dynamic scenes. Most previous methods rely on the low-light enhancement of a conventional RGB camera. However they would inevitably face a dilemma between the long exposure time of nighttime and the motion blur of dynamic scenes. Event cameras react to dynamic changes with higher temporal resolution (microsecond) and higher dynamic range (120dB) offering an alternative solution. In this work we present a novel nighttime dynamic imaging method with an event camera. Specifically we discover that the event at nighttime exhibits temporal trailing characteristics and spatial non-stationary distribution. Consequently we propose a nighttime event reconstruction network (NER-Net) which mainly includes a learnable event timestamps calibration module (LETC) to align the temporal trailing events and a non-uniform illumination aware module (NIAM) to stabilize the spatiotemporal distribution of events. Moreover we construct a paired real low-light event dataset (RLED) through a co-axial imaging system including 64200 spatially and temporally aligned image GTs and low-light events. Extensive experiments demonstrate that the proposed method outperforms state-of-the-art methods in terms of visual quality and generalization ability on real-world nighttime datasets. The project are available at: https://github.com/Liu-haoyue/NER-Net. + + + + FISBe: A Real-World Benchmark Dataset for Instance Segmentation of Long-Range Thin Filamentous Structures + http://openaccess.thecvf.com//content/CVPR2024/papers/Mais_FISBe_A_Real-World_Benchmark_Dataset_for_Instance_Segmentation_of_Long-Range_CVPR_2024_paper.pdf + Instance segmentation of neurons in volumetric light microscopy images of nervous systems enables groundbreaking research in neuroscience by facilitating joint functional and morphological analyses of neural circuits at cellular resolution. Yet said multi-neuron light microscopy data exhibits extremely challenging properties for the task of instance segmentation: Individual neurons have long-ranging thin filamentous and widely branching morphologies multiple neurons are tightly inter-weaved and partial volume effects uneven illumination and noise inherent to light microscopy severely impede local disentangling as well as long-range tracing of individual neurons. These properties reflect a current key challenge in machine learning research namely to effectively capture long-range dependencies in the data. While respective methodological research is buzzing to date methods are typically benchmarked on synthetic datasets. To address this gap we release the FlyLight Instance Segmentation Benchmark (FISBe) dataset the first publicly available multi-neuron light microscopy dataset with pixel-wise annotations. In addition we define a set of instance segmentation metrics for benchmarking that we designed to be meaningful with regard to downstream analyses. Lastly we provide three baselines to kick off a competition that we envision to both advance the field of machine learning regarding methodology for capturing long-range data dependencies and facilitate scientific discovery in basic neuroscience. + + + + LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding Reasoning and Planning + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_LL3DA_Visual_Interactive_Instruction_Tuning_for_Omni-3D_Understanding_Reasoning_and_CVPR_2024_paper.pdf + Recent progress in Large Multimodal Models (LMM) has opened up great possibilities for various applications in the field of human-machine interactions. However developing LMMs that can comprehend reason and plan in complex and diverse 3D environments remains a challenging topic especially considering the demand for understanding permutation-invariant point cloud representations of the 3D scene. Existing works seek help from multi-view images by projecting 2D features to 3D space which inevitably leads to huge computational overhead and performance degradation. In this paper we present LL3DA a Large Language 3D Assistant that takes point cloud as the direct input and responds to both text instructions and visual interactions. The additional visual interaction enables LMMs to better comprehend human interactions with the 3D environment and further remove the ambiguities within plain texts. Experiments show that LL3DA achieves remarkable results and surpasses various 3D vision-language models on both 3D Dense Captioning and 3D Question Answering. + + + + 4D Gaussian Splatting for Real-Time Dynamic Scene Rendering + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_4D_Gaussian_Splatting_for_Real-Time_Dynamic_Scene_Rendering_CVPR_2024_paper.pdf + Representing and rendering dynamic scenes has been an important but challenging task. Especially to accurately model complex motions high efficiency is usually hard to guarantee. To achieve real-time dynamic scene rendering while also enjoying high training and storage efficiency we propose 4D Gaussian Splatting (4D-GS) as a holistic representation for dynamic scenes rather than applying 3D-GS for each individual frame. In 4D-GS a novel explicit representation containing both 3D Gaussians and 4D neural voxels is proposed. A decomposed neural voxel encoding algorithm inspired by HexPlane is proposed to efficiently build Gaussian features from 4D neural voxels and then a lightweight MLP is applied to predict Gaussian deformations at novel timestamps. Our 4D-GS method achieves real-time rendering under high resolutions 82 FPS at an 800*800 resolution on an RTX 3090 GPU while maintaining comparable or better quality than previous state-of-the-art methods. More demos and code are available at https://guanjunwu.github.io/4dgs. + + + + Selective-Stereo: Adaptive Frequency Information Selection for Stereo Matching + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Selective-Stereo_Adaptive_Frequency_Information_Selection_for_Stereo_Matching_CVPR_2024_paper.pdf + Stereo matching methods based on iterative optimization like RAFT-Stereo and IGEV-Stereo have evolved into a cornerstone in the field of stereo matching. However these methods struggle to simultaneously capture high-frequency information in edges and low-frequency information in smooth regions due to the fixed receptive field. As a result they tend to lose details blur edges and produce false matches in textureless areas. In this paper we propose Selective Recurrent Unit (SRU) a novel iterative update operator for stereo matching. The SRU module can adaptively fuse hidden disparity information at multiple frequencies for edge and smooth regions. To perform adaptive fusion we introduce a new Contextual Spatial Attention (CSA) module to generate attention maps as fusion weights. The SRU empowers the network to aggregate hidden disparity information across multiple frequencies mitigating the risk of vital hidden disparity information loss during iterative processes. To verify SRU's universality we apply it to representative iterative stereo matching methods collectively referred to as Selective-Stereo. Our Selective-Stereo ranks first on KITTI 2012 KITTI 2015 ETH3D and Middlebury leaderboards among all published methods. Code is available at https://github.com/Windsrain/Selective-Stereo. + + + + PerAda: Parameter-Efficient Federated Learning Personalization with Generalization Guarantees + http://openaccess.thecvf.com//content/CVPR2024/papers/Xie_PerAda_Parameter-Efficient_Federated_Learning_Personalization_with_Generalization_Guarantees_CVPR_2024_paper.pdf + Personalized Federated Learning (pFL) has emerged as a promising solution to tackle data heterogeneity across clients in FL. However existing pFL methods either (1) introduce high computation and communication costs or (2) overfit to local data which can be limited in scope and vulnerable to evolved test samples with natural distribution shifts. In this paper we propose PerAda a parameter-efficient pFL framework that reduces communication and computational costs and exhibits superior generalization performance especially under test-time distribution shifts. PerAda reduces the costs by leveraging the power of pretrained models and only updates and communicates a small number of additional parameters from adapters. PerAda achieves high generalization by regularizing each client's personalized adapter with a global adapter while the global adapter uses knowledge distillation to aggregate generalized information from all clients. Theoretically we provide generalization bounds of PerAda and we prove its convergence to stationary points under non-convex settings. Empirically PerAda demonstrates higher personalized performance (+4.85% on CheXpert) and enables better out-of-distribution generalization (+5.23% on CIFAR-10-C) on different datasets across natural and medical domains compared with baselines while only updating 12.6% of parameters per model. Our code is available at https://github.com/NVlabs/PerAda. + + + + MAFA: Managing False Negatives for Vision-Language Pre-training + http://openaccess.thecvf.com//content/CVPR2024/papers/Byun_MAFA_Managing_False_Negatives_for_Vision-Language_Pre-training_CVPR_2024_paper.pdf + We consider a critical issue of false negatives in Vision- Language Pre-training (VLP) a challenge that arises from the inherent many-to-many correspondence of image-text pairs in large-scale web-crawled datasets. The presence of false negatives can impede achieving optimal performance and even lead to a significant performance drop. To address this challenge we propose MAFA (MAnaging FAlse nega- tives) which consists of two pivotal components building upon the recently developed GRouped mIni-baTch sampling (GRIT) strategy: 1) an efficient connection mining process that identifies and converts false negatives into positives and 2) label smoothing for the image-text contrastive (ITC) loss. Our comprehensive experiments verify the effectiveness of MAFA across multiple downstream tasks emphasizing the crucial role of addressing false negatives in VLP potentially even surpassing the importance of addressing false posi- tives. In addition the compatibility of MAFA with the recent BLIP-family model is also demonstrated. Code is available at https://github.com/jaeseokbyun/MAFA. + + + + InfLoRA: Interference-Free Low-Rank Adaptation for Continual Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Liang_InfLoRA_Interference-Free_Low-Rank_Adaptation_for_Continual_Learning_CVPR_2024_paper.pdf + Continual learning requires the model to learn multiple tasks sequentially. In continual learning the model should possess the ability to maintain its performance on old tasks (stability) and the ability to adapt to new tasks continuously (plasticity). Recently parameter-efficient fine-tuning (PEFT) which involves freezing a pre-trained model and injecting a small number of learnable parameters to adapt to downstream tasks has gained increasing popularity in continual learning. Although existing continual learning methods based on PEFT have demonstrated superior performance compared to those not based on PEFT most of them do not consider how to eliminate the interference of the new task on the old tasks which inhibits the model from making a good trade-off between stability and plasticity. In this work we propose a new PEFT method called interference-free low-rank adaptation (InfLoRA) for continual learning. InfLoRA injects a small number of parameters to reparameterize the pre-trained weights and shows that fine-tuning these injected parameters is equivalent to fine-tuning the pre-trained weights within a subspace. Furthermore InfLoRA designs this subspace to eliminate the interference of the new task on the old tasks making a good trade-off between stability and plasticity. Experimental results show that InfLoRA outperforms existing state-of-the-art continual learning methods on multiple datasets. + + + + PLGSLAM: Progressive Neural Scene Represenation with Local to Global Bundle Adjustment + http://openaccess.thecvf.com//content/CVPR2024/papers/Deng_PLGSLAM_Progressive_Neural_Scene_Represenation_with_Local_to_Global_Bundle_CVPR_2024_paper.pdf + Neural implicit scene representations have recently shown encouraging results in dense visual SLAM. However existing methods produce low-quality scene reconstruction and low-accuracy localization performance when scaling up to large indoor scenes and long sequences. These limitations are mainly due to their single global radiance field with finite capacity which does not adapt to large scenarios. Their end-to-end pose networks are also not robust enough with the growth of cumulative errors in large scenes. To this end we introduce PLGSLAM a neural visual SLAM system capable of high-fidelity surface reconstruction and robust camera tracking in real-time. To handle large-scale indoor scenes PLGSLAM proposes a progressive scene representation method which dynamically allocates new local scene representation trained with frames within a local sliding window. This allows us to scale up to larger indoor scenes and improves robustness (even under pose drifts). In local scene representation PLGSLAM utilizes tri-planes for local high-frequency features with multi-layer perceptron (MLP) networks for the low-frequency feature achieving smoothness and scene completion in unobserved areas. Moreover we propose local-to-global bundle adjustment method with a global keyframe database to address the increased pose drifts on long sequences. Experimental results demonstrate that PLGSLAM achieves state-of-the-art scene reconstruction results and tracking performance across various datasets and scenarios (both in small and large-scale indoor environments). + + + + Multi-Task Dense Prediction via Mixture of Low-Rank Experts + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Multi-Task_Dense_Prediction_via_Mixture_of_Low-Rank_Experts_CVPR_2024_paper.pdf + Previous multi-task dense prediction methods based on the Mixture of Experts (MoE) have received great performance but they neglect the importance of explicitly modeling the global relations among all tasks. In this paper we present a novel decoder-focused method for multi-task dense prediction called Mixture-of-Low-Rank-Experts (MLoRE). To model the global task relationships MLoRE adds a generic convolution path to the original MoE structure where each task feature can go through this path for explicit parameter sharing. Furthermore to control the parameters and computational cost brought by the increase in the number of experts we take inspiration from LoRA and propose to leverage the low-rank format of a vanilla convolution in the expert network. Since the low-rank experts have fewer parameters and can be dynamically parameterized into the generic convolution the parameters and computational cost do not change much with the increase of experts. Benefiting from this design we increase the number of experts and its reception field to enlarge the representation capacity facilitating multiple dense tasks learning in a unified network. Extensive experiments on the PASCAL-Context and NYUD-v2 benchmarks show that our MLoRE achieves superior performance compared to previous state-of-the-art methods on all metrics. Our code is available at https://github.com/YuqiYang213/MLoRE. + + + + Binding Touch to Everything: Learning Unified Multimodal Tactile Representations + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Binding_Touch_to_Everything_Learning_Unified_Multimodal_Tactile_Representations_CVPR_2024_paper.pdf + The ability to associate touch with other modalities has huge implications for humans and computational systems. However multimodal learning with touch remains challenging due to the expensive data collection process and non-standardized sensor outputs. We introduce UniTouch a unified tactile model for vision-based touch sensors connected to multiple modalities including vision language and sound. We achieve this by aligning our UniTouch embeddings to pretrained image embeddings already associated with a variety of other modalities. We further propose learnable sensor-specific tokens allowing the model to learn from a set of heterogeneous tactile sensors all at the same time. UniTouch is capable of conducting various touch sensing tasks in the zero-shot setting from robot grasping prediction to touch image question answering. To the best of our knowledge UniTouch is the first to demonstrate such capabilities. + + + + Your Transferability Barrier is Fragile: Free-Lunch for Transferring the Non-Transferable Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Hong_Your_Transferability_Barrier_is_Fragile_Free-Lunch_for_Transferring_the_Non-Transferable_CVPR_2024_paper.pdf + Recently non-transferable learning (NTL) was proposed to restrict models' generalization toward the target domain(s) which serves as state-of-the-art solutions for intellectual property (IP) protection. However the robustness of the established "transferability barrier" for degrading the target domain performance has not been well studied. In this paper we first show that the generalization performance of NTL models is widely impaired on third-party domains (i.e. the unseen domain in the NTL training stage). We explore the impairment patterns and find that: due to the dominant generalization of non-transferable task NTL models tend to make target-domain-consistent predictions on third-party domains even though only a slight distribution shift from the third-party domain to the source domain. Motivated by these findings we uncover the potential risks of NTL by proposing a simple but effective method (dubbed as TransNTL) to recover the target domain performance with few source domain data. Specifically by performing a group of different perturbations on the few source domain data we obtain diverse third-party domains that evoke the same impairment patterns as the unavailable target domain. Then we fine-tune the NTL model under an impairment-repair self-distillation framework where the source-domain predictions are used to teach the model itself how to predict on third-party domains thus repairing the impaired generalization. Empirically experiments on standard NTL benchmarks show that the proposed TransNTL reaches up to 72% target-domain improvements by using only 10% source domain data. Finally we also explore a feasible defense method and empirically demonstrate its effectiveness. + + + + Complementing Event Streams and RGB Frames for Hand Mesh Reconstruction + http://openaccess.thecvf.com//content/CVPR2024/papers/Jiang_Complementing_Event_Streams_and_RGB_Frames_for_Hand_Mesh_Reconstruction_CVPR_2024_paper.pdf + Reliable hand mesh reconstruction (HMR) from commonly-used color and depth sensors is challenging especially under scenarios with varied illuminations and fast motions. Event camera is a highly promising alternative for its high dynamic range and dense temporal resolution properties but it lacks key texture appearance for hand mesh reconstruction. In this paper we propose EvRGBHand -- the first approach for 3D hand mesh reconstruction with an event camera and an RGB camera compensating for each other. By fusing two modalities of data across time space and information dimensionsEvRGBHand can tackle overexposure and motion blur issues in RGB-based HMR and foreground scarcity and background overflow issues in event-based HMR. We further propose EvRGBDegrader which allows our model to generalize effectively in challenging scenes even when trained solely on standard scenes thus reducing data acquisition costs. Experiments on real-world data demonstrate that EvRGBHand can effectively solve the challenging issues when using either type of camera alone via retaining the merits of both and shows the potential of generalization to outdoor scenes and another type of event camera. Our code models and dataset will be made public after acceptance. + + + + Empowering Resampling Operation for Ultra-High-Definition Image Enhancement with Model-Aware Guidance + http://openaccess.thecvf.com//content/CVPR2024/papers/Yu_Empowering_Resampling_Operation_for_Ultra-High-Definition_Image_Enhancement_with_Model-Aware_Guidance_CVPR_2024_paper.pdf + Image enhancement algorithms have made remarkable advancements in recent years but directly applying them to Ultra-high-definition (UHD) images presents intractable computational overheads. Therefore previous straightforward solutions employ resampling techniques to reduce the resolution by adopting a "Downsampling-Enhancement-Upsampling" processing paradigm. However this paradigm disentangles the resampling operators and inner enhancement algorithms which results in the loss of information that is favored by the model further leading to sub-optimal outcomes. In this paper we propose a novel method of Learning Model-Aware Resampling (LMAR) which learns to customize resampling by extracting model-aware information from the UHD input image under the guidance of model knowledge. Specifically our method consists of two core designs namely compensatory kernel estimation and steganographic resampling. At the first stage we dynamically predict compensatory kernels tailored to the specific input and resampling scales. At the second stage the image-wise compensatory information is derived with the compensatory kernels and embedded into the rescaled input images. This promotes the representation of the newly derived downscaled inputs to be more consistent with the full-resolution UHD inputs as perceived by the model. Our LMAR enables model-aware and model-favored resampling while maintaining compatibility with existing resampling operators. Extensive experiments on multiple UHD image enhancement datasets and different backbones have shown consistent performance gains after correlating resizer and enhancer e.g. up to 1.2dB PSNR gain for x1.8 resampling scale on UHD-LOL4K. The code is available at \href https://github.com/YPatrickW/LMAR https://github.com/YPatrickW/LMAR . + + + + Hallucination Augmented Contrastive Learning for Multimodal Large Language Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Jiang_Hallucination_Augmented_Contrastive_Learning_for_Multimodal_Large_Language_Model_CVPR_2024_paper.pdf + Multi-modal large language models (MLLMs) have been shown to efficiently integrate natural language with visual information to handle multi-modal tasks. However MLLMs still face a fundamental limitation of hallucinations where they tend to generate erroneous or fabricated information. In this paper we address hallucinations in MLLMs from a novel perspective of representation learning. We first analyzed the representation distribution of textual and visual tokens in MLLM revealing two important findings: 1) there is a significant gap between textual and visual representations indicating unsatisfactory cross-modal representation alignment; 2) representations of texts that contain and do not contain hallucinations are entangled making it challenging to distinguish them. These two observations inspire us with a simple yet effective method to mitigate hallucinations. Specifically we introduce contrastive learning into MLLMs and use text with hallucination as hard negative examples naturally bringing representations of non-hallucinative text and visual samples closer while pushing way representations of non-hallucinating and hallucinative text. We evaluate our method quantitatively and qualitatively showing its effectiveness in reducing hallucination occurrences and improving performance across multiple benchmarks. On the MMhal-Bench benchmark our method obtains a 34.66% /29.5% improvement over the baseline MiniGPT-4/LLaVA. Our code is available on https://github.com/X-PLUG/mPLUG-HalOwl/tree/main/hacl. + + + + Cooperation Does Matter: Exploring Multi-Order Bilateral Relations for Audio-Visual Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Cooperation_Does_Matter_Exploring_Multi-Order_Bilateral_Relations_for_Audio-Visual_Segmentation_CVPR_2024_paper.pdf + Recently an audio-visual segmentation (AVS) task has been introduced aiming to group pixels with sounding objects within a given video. This task necessitates a first-ever audio-driven pixel-level understanding of the scene posing significant challenges. In this paper we propose an innovative audio-visual transformer framework termed COMBO an acronym for COoperation of Multi-order Bilateral relatiOns. For the first time our framework explores three types of bilateral entanglements within AVS: pixel entanglement modality entanglement and temporal entanglement. Regarding pixel entanglement we employ a Siam-Encoder Module (SEM) that leverages prior knowledge to generate more precise visual features from the foundational model. For modality entanglement we design a Bilateral-Fusion Module (BFM) enabling COMBO to align corresponding visual and auditory signals bi-directionally. As for temporal entanglement we introduce an innovative adaptive inter-frame consistency loss according to the inherent rules of temporal. Comprehensive experiments and ablation studies on AVSBench-object (84.7 mIoU on S4 59.2 mIou on MS3) and AVSBench-semantic (42.1 mIoU on AVSS) datasets demonstrate that COMBO surpasses previous state-of-the-art methods. Project page is available at https://yannqi.github.io/AVS-COMBO. + + + + Improved Self-Training for Test-Time Adaptation + http://openaccess.thecvf.com//content/CVPR2024/papers/Ma_Improved_Self-Training_for_Test-Time_Adaptation_CVPR_2024_paper.pdf + Test-time adaptation (TTA) is a technique to improve the performance of a pre-trained source model on a target distribution without using any labeled data. However existing self-trained TTA methods often face the challenges of unreliable pseudo-labels and unstable model optimization. In this paper we propose an Improved Self-Training (IST) approach which addresses these challenges by enhancing the pseudo-label quality and stabilizing the adaptation process. Specifically we use a simple augmentation strategy to generate multiple views of each test sample and construct a graph structure to correct the pseudo-labels based on the similarity of the latent features. Moreover we adopt a parameter moving average scheme to smooth the model updates and prevent catastrophic forgetting. Instead of using a model with fixed label space we explore the adaptability of the foundation model CLIP to various downstream tasks at test time. Extensive experiments on various benchmarks show that IST can achieve significant and consistent improvements over the existing TTA methods in classification detection and segmentation tasks. + + + + Unsupervised Feature Learning with Emergent Data-Driven Prototypicality + http://openaccess.thecvf.com//content/CVPR2024/papers/Guo_Unsupervised_Feature_Learning_with_Emergent_Data-Driven_Prototypicality_CVPR_2024_paper.pdf + Given a set of images our goal is to map each image to a point in a feature space such that not only point proximity indicates visual similarity but where it is located directly encodes how prototypical the image is according to the dataset. Our key insight is to perform unsupervised feature learning in hyperbolic instead of Euclidean space where the distance between points still reflects image similarity yet we gain additional capacity for representing prototypicality with the location of the point: The closer it is to the origin the more prototypical it is. The latter property is simply emergent from optimizing the metric learning objective: The image similar to many training instances is best placed at the center of corresponding points in Euclidean space but closer to the origin in hyperbolic space. We propose an unsupervised feature learning algorithm in Hyperbolic space with sphere pACKing. HACK first generates uniformly packed particles in the Poincar'e ball of hyperbolic space and then assigns each image uniquely to a particle. With our feature mapper simply trained to spread out training instances in hyperbolic space we observe that images move closer to the origin with congealing - a warping process that aligns all the images and makes them appear more common and similar to each other validating our idea of unsupervised prototypicality discovery. We demonstrate that our data-driven prototypicality provides an easy and superior unsupervised instance selection to reduce sample complexity increase model generalization with atypical instances and robustness with typical ones. + + + + Improving Generalized Zero-Shot Learning by Exploring the Diverse Semantics from External Class Names + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Improving_Generalized_Zero-Shot_Learning_by_Exploring_the_Diverse_Semantics_from_CVPR_2024_paper.pdf + Generalized Zero-Shot Learning (GZSL) methods often assume that the unseen classes are similar to seen classes and thus perform poor when unseen classes are dissimilar to seen classes. Although some existing GZSL approaches can alleviate this issue by leveraging additional semantic information from test unseen classes their generalization ability to dissimilar unseen classes is still unsatisfactory. This motivates us to study GZSL in the more practical setting where unseen classes can be either similar or dissimilar to seen classes. In this paper we propose a simple yet effective GZSL framework by exploring diverse semantics from external class names (DSECN) which is simultaneously robust on the similar and dissimilar unseen classes. This is achieved by introducing diverse semantics from external class names and aligning the introduced semantics to visual space using the classification head of pre-trained network. Furthermore we show that the design idea of DSECN can easily be integrate into other advanced GZSL approaches such as the generative-based ones and enhance their robustness for dissimilar unseen classes. Extensive experiments in the practical setting including both similar and dissimilar unseen classes show that our method significantly outperforms the state-of-the-art approaches on all datasets and can be trained very efficiently. + + + + TeMO: Towards Text-Driven 3D Stylization for Multi-Object Meshes + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_TeMO_Towards_Text-Driven_3D_Stylization_for_Multi-Object_Meshes_CVPR_2024_paper.pdf + Recent progress in the text-driven 3D stylization of a single object has been considerably promoted by CLIP-based methods. However the stylization of multi-object 3D scenes is still impeded in that the image-text pairs used for pre-training CLIP mostly consist of an object. Meanwhile the local details of multiple objects may be susceptible to omission due to the existing supervision manner primarily relying on coarse-grained contrast of image-text pairs. To overcome these challenges we present a novel framework dubbed TeMO to parse multi-object 3D scenes and edit their styles under the contrast supervision at multiple levels. We first propose a Decoupled Graph Attention (DGA) module to distinguishably reinforce the features of 3D surface points. Particularly a cross-modal graph is constructed to align the object points accurately and noun phrases decoupled from the 3D mesh and textual description. Then we develop a Cross-Grained Contrast (CGC) supervision system where a fine-grained loss between the words in the textual description and the randomly rendered images are constructed to complement the coarse-grained loss. Extensive experiments show that our method can synthesize high-quality stylized content and outperform the existing methods over a wide range of multi-object 3D meshes. + + + + GSNeRF: Generalizable Semantic Neural Radiance Fields with Enhanced 3D Scene Understanding + http://openaccess.thecvf.com//content/CVPR2024/papers/Chou_GSNeRF_Generalizable_Semantic_Neural_Radiance_Fields_with_Enhanced_3D_Scene_CVPR_2024_paper.pdf + Utilizing multi-view inputs to synthesize novel-view images Neural Radiance Fields (NeRF) have emerged as a popular research topic in 3D vision. In this work we introduce a Generalizable Semantic Neural Radiance Field (GSNeRF) which uniquely takes image semantics into the synthesis process so that both novel view images and the associated semantic maps can be produced for unseen scenes. Our GSNeRF is composed of two stages: Semantic Geo-Reasoning and Depth-Guided Visual rendering. The former is able to observe multi-view image inputs to extract semantic and geometry features from a scene. Guided by the resulting image geometry information the latter performs both image and semantic rendering with improved performances. Our experiments not only confirm that GSNeRF performs favorably against prior works on both novel-view image and semantic segmentation synthesis but the effectiveness of our sampling strategy for visual rendering is further verified. + + + + Alpha Invariance: On Inverse Scaling Between Distance and Volume Density in Neural Radiance Fields + http://openaccess.thecvf.com//content/CVPR2024/papers/Ahn_Alpha_Invariance_On_Inverse_Scaling_Between_Distance_and_Volume_Density_CVPR_2024_paper.pdf + Scale-ambiguity in 3D scene dimensions leads to magnitude-ambiguity of volumetric densities in neural radiance fields i.e. the densities double when scene size is halved and vice versa. We call this property alpha invariance. For NeRFs to better maintain alpha invariance we recommend 1) parameterizing both distance and volume densities in log space and 2) a discretization-agnostic initialization strategy to guarantee high ray transmittance. We revisit a few popular radiance field models and find that these systems use various heuristics to deal with issues arising from scene scaling. We test their behaviors and show our recipe to be more robust. + + + + D3T: Distinctive Dual-Domain Teacher Zigzagging Across RGB-Thermal Gap for Domain-Adaptive Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Phat_D3T_Distinctive_Dual-Domain_Teacher_Zigzagging_Across_RGB-Thermal_Gap_for_Domain-Adaptive_CVPR_2024_paper.pdf + Domain adaptation for object detection typically entails transferring knowledge from one visible domain to another visible domain. However there are limited studies on adapting from the visible to the thermal domain because the domain gap between the visible and thermal domains is much larger than expected and traditional domain adaptation can not successfully facilitate learning in this situation. To overcome this challenge we propose a Distinctive Dual-Domain Teacher (D3T) framework that employs distinct training paradigms for each domain. Specifically we segregate the source and target training sets for building dual-teachers and successively deploy exponential moving average to the student model to individual teachers of each domain. The framework further incorporates a zigzag learning method between dual teachers facilitating a gradual transition from the visible to thermal domains during training. We validate the superiority of our method through newly designed experimental protocols with well-known thermal datasets i.e. FLIR and KAIST. Source code is available at https://github.com/EdwardDo69/D3T. + + + + Positive-Unlabeled Learning by Latent Group-Aware Meta Disambiguation + http://openaccess.thecvf.com//content/CVPR2024/papers/Long_Positive-Unlabeled_Learning_by_Latent_Group-Aware_Meta_Disambiguation_CVPR_2024_paper.pdf + Positive-Unlabeled (PU) learning aims to train a binary classifier using minimal positive data supplemented by a substantially larger pool of unlabeled data in the specific absence of explicitly annotated negatives. Despite its straightforward nature as a binary classification task the currently best-performing PU algorithms still largely lag behind the supervised counterpart. In this work we identify that the primary bottleneck lies in the difficulty of deriving discriminative representations under unreliable binary supervision with poor semantics which subsequently hinders the common label disambiguation procedures. To cope with this problem we propose a novel PU learning framework namely Latent Group-Aware Meta Disambiguation (LaGAM) which incorporates a hierarchical contrastive learning module to extract the underlying grouping semantics within PU data and produce compact representations. As a result LaGAM enables a more aggressive label disambiguation strategy where we enhance the robustness of training by iteratively distilling the true labels of unlabeled data directly through meta-learning. Extensive experiments show that LaGAM significantly outperforms the current state-of-the-art methods by an average of 6.8% accuracy on common benchmarks approaching the supervised baseline. We also provide comprehensive ablations as well as visualized analysis to verify the effectiveness of our LaGAM. + + + + Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal Alignment + http://openaccess.thecvf.com//content/CVPR2024/papers/Fu_Linguistic-Aware_Patch_Slimming_Framework_for_Fine-grained_Cross-Modal_Alignment_CVPR_2024_paper.pdf + Cross-modal alignment aims to build a bridge connecting vision and language. It is an important multi-modal task that efficiently learns the semantic similarities between images and texts. Traditional fine-grained alignment methods heavily rely on pre-trained object detectors to extract region features for subsequent region-word alignment thereby incurring substantial computational costs for region detection and error propagation issues for two-stage training. In this paper we focus on the mainstream vision transformer incorporating patch features for patch-word alignment while addressing the resultant issue of visual patch redundancy and patch ambiguity for semantic alignment. We propose a novel Linguistic-Aware Patch Slimming (LAPS) framework for fine-grained alignment which explicitly identifies redundant visual patches with language supervision and rectifies their semantic and spatial information to facilitate more effective and consistent patch-word alignment. Extensive experiments on various evaluation benchmarks and model backbones show LAPS outperforms the state-of-the-art fine-grained alignment methods by 5%-15% rSum. Our code is available at https://github.com/CrossmodalGroup/LAPS + + + + Domain-Rectifying Adapter for Cross-Domain Few-Shot Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Su_Domain-Rectifying_Adapter_for_Cross-Domain_Few-Shot_Segmentation_CVPR_2024_paper.pdf + Few-shot semantic segmentation (FSS) has achieved great success on segmenting objects of novel classes supported by only a few annotated samples. However existing FSS methods often underperform in the presence of domain shifts especially when encountering new domain styles that are unseen during training. It is suboptimal to directly adapt or generalize the entire model to new domains in the few-shot scenario. Instead our key idea is to adapt a small adapter for rectifying diverse target domain styles to the source domain. Consequently the rectified target domain features can fittingly benefit from the well-optimized source domain segmentation model which is intently trained on sufficient source domain data. Training domain-rectifying adapter requires sufficiently diverse target domains. We thus propose a novel local-global style perturbation method to simulate diverse potential target domains by perturbating the feature channel statistics of the individual images and collective statistics of the entire source domain respectively. Additionally we propose a cyclic domain alignment module to facilitate the adapter effectively rectifying domains using a reverse domain rectification supervision. The adapter is trained to rectify the image features from diverse synthesized target domains to align with the source domain. During testing on target domains we start by rectifying the image features and then conduct few-shot segmentation on the domain-rectified features. Extensive experiments demonstrate the effectiveness of our method achieving promising results on cross-domain few-shot semantic segmentation tasks. Our code is available at https://github.com/Matt-Su/DR-Adapter. + + + + CPP-Net: Embracing Multi-Scale Feature Fusion into Deep Unfolding CP-PPA Network for Compressive Sensing + http://openaccess.thecvf.com//content/CVPR2024/papers/Guo_CPP-Net_Embracing_Multi-Scale_Feature_Fusion_into_Deep_Unfolding_CP-PPA_Network_CVPR_2024_paper.pdf + In the domain of compressive sensing (CS) deep unfolding networks (DUNs) have garnered attention for their good performance and certain degree of interpretability rooted in CS domain achieved by marrying traditional optimization solvers with deep networks. However current DUNs are ill-suited for the intricate task of capturing fine-grained image details leading to perceptible distortions and blurriness in reconstructed images particularly at low CS ratios e.g. 0.10 and below. In this paper we propose CPP-Net a novel deep unfolding CS framework inspired by the primal-dual hybrid strategy of the Chambolle and Pock Proximal Point Algorithm (CP-PPA). First we derive three iteration submodules Xk Vk and Yk by incorporating customized deep learning modules to solve the sparse basis related proximal operator within CP-PPA. Second we design the Dual Path Fusion Block (DPFB) to adeptly extract and fuse multi-scale feature information enhancing sensitivity to feature information at different scales and improving detail reconstruction. Third we introduce the Iteration Fusion Strategy (IFS) to effectively weight the fusion of outputs from diverse reconstruction stages maximizing the utilization of feature information and mitigating the information loss during reconstruction stages. Extensive experiments demonstrate that CPP-Net effectively reduces distortion and blurriness while preserving richer image details outperforming current state-of-the-art methods. Codes are available at https://github.com/ICSResearch/CPP-Net. + + + + 3DGStream: On-the-Fly Training of 3D Gaussians for Efficient Streaming of Photo-Realistic Free-Viewpoint Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Sun_3DGStream_On-the-Fly_Training_of_3D_Gaussians_for_Efficient_Streaming_of_CVPR_2024_paper.pdf + Constructing photo-realistic Free-Viewpoint Videos (FVVs) of dynamic scenes from multi-view videos remains a challenging endeavor. Despite the remarkable advancements achieved by current neural rendering techniques these methods generally require complete video sequences for offline training and are not capable of real-time rendering. To address these constraints we introduce 3DGStream a method designed for efficient FVV streaming of real-world dynamic scenes. Our method achieves fast on-the-fly per-frame reconstruction within 12 seconds and real-time rendering at 200 FPS. Specifically we utilize 3D Gaussians (3DGs) to represent the scene. Instead of the naive approach of directly optimizing 3DGs per-frame we employ a compact Neural Transformation Cache (NTC) to model the translations and rotations of 3DGs markedly reducing the training time and storage required for each FVV frame. Furthermore we propose an adaptive 3DG addition strategy to handle emerging objects in dynamic scenes. Experiments demonstrate that 3DGStream achieves competitive performance in terms of rendering speed image quality training time and model storage when compared with state-of-the-art methods. + + + + FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Aneja_FaceTalk_Audio-Driven_Motion_Diffusion_for_Neural_Parametric_Head_Models_CVPR_2024_paper.pdf + We introduce FaceTalk a novel generative approach designed for synthesizing high-fidelity 3D motion sequences of talking human heads from input audio signal. To capture the expressive detailed nature of human heads including hair ears and finer-scale eye movements we propose to couple speech signal with the latent space of neural parametric head models to create high-fidelity temporally coherent motion sequences. We propose a new latent diffusion model for this task operating in the expression space of neural parametric head models to synthesize audio-driven realistic head sequences. In the absence of a dataset with corresponding NPHM expressions to audio we optimize for these correspondences to produce a dataset of temporally-optimized NPHM expressions fit to audio-video recordings of people talking. To the best of our knowledge this is the first work to propose a generative approach for realistic and high-quality motion synthesis of volumetric human heads representing a significant advancement in the field of audio-driven 3D animation. Notably our approach stands out in its ability to generate plausible motion sequences that can produce high-fidelity head animation coupled with the NPHM shape space. Our experimental results substantiate the effectiveness of FaceTalk consistently achieving superior and visually natural motion encompassing diverse facial expressions and styles outperforming existing methods by 75% in perceptual user study evaluation + + + + Mip-Splatting: Alias-free 3D Gaussian Splatting + http://openaccess.thecvf.com//content/CVPR2024/papers/Yu_Mip-Splatting_Alias-free_3D_Gaussian_Splatting_CVPR_2024_paper.pdf + Recently 3D Gaussian Splatting has demonstrated impressive novel view synthesis results reaching high fidelity and efficiency. However strong artifacts can be observed when changing the sampling rate e.g. by changing focal length or camera distance. We find that the source for this phenomenon can be attributed to the lack of 3D frequency constraints and the usage of a 2D dilation filter. To address this problem we introduce a 3D smoothing filter to constrains the size of the 3D Gaussian primitives based on the maximal sampling frequency induced by the input views. It eliminates high-frequency artifacts when zooming in. Moreover replacing 2D dilation with a 2D Mip filter which simulates a 2D box filter effectively mitigates aliasing and dilation issues. Our evaluation including scenarios such a training on single-scale images and testing on multiple scales validates the effectiveness of our approach. + + + + Learning Coupled Dictionaries from Unpaired Data for Image Super-Resolution + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Learning_Coupled_Dictionaries_from_Unpaired_Data_for_Image_Super-Resolution_CVPR_2024_paper.pdf + The difficulty of acquiring high-resolution (HR) and low-resolution (LR) image pairs in real scenarios limits the performance of existing learning-based image super-resolution (SR) methods in the real world. To conduct training on real-world unpaired data current methods focus on synthesizing pseudo LR images to associate unpaired images. However the realness and diversity of pseudo LR images are vulnerable due to the large image space. In this paper we circumvent the difficulty of image generation and propose an alternative to build the connection between unpaired images in a compact proxy space. Specifically we first construct coupled HR and LR dictionaries and then encode HR and LR images into a common latent code space using these dictionaries. In addition we develop an autoencoder-based framework to couple these dictionaries during optimization by reconstructing input HR and LR images. The coupled dictionaries enable our method to employ a shallow network architecture with only 18 layers to achieve efficient image SR. Extensive experiments show that our method (DictSR) can effectively model the LR-to-HR mapping in coupled dictionaries and produces state-of-the-art performance on benchmark datasets. + + + + Deep Video Inverse Tone Mapping Based on Temporal Clues + http://openaccess.thecvf.com//content/CVPR2024/papers/Ye_Deep_Video_Inverse_Tone_Mapping_Based_on_Temporal_Clues_CVPR_2024_paper.pdf + Inverse tone mapping (ITM) aims to reconstruct high dynamic range (HDR) radiance from low dynamic range (LDR) content. Although many deep image ITM methods can generate impressive results the field of video ITM is still to be explored. Processing video sequences by image ITM methods may cause temporal inconsistency. Besides they aren't able to exploit the potentially useful information in the temporal domain. In this paper we analyze the process of video filming and then propose a Global Sample and Local Propagate strategy to better find and utilize temporal clues. To better realize the proposed strategy we design a two-stage pipeline which includes modules named Incremental Clue Aggregation Module and Feature and Clue Propagation Module. They can align and fuse frames effectively under the condition of brightness changes and propagate features and temporal clues to all frames efficiently. Our temporal clues based video ITM method can recover realistic and temporal consistent results with high fidelity in over-exposed regions. Qualitative and quantitative experiments on public datasets show that the proposed method has significant advantages over existing methods. + + + + NeRF-HuGS: Improved Neural Radiance Fields in Non-static Scenes Using Heuristics-Guided Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_NeRF-HuGS_Improved_Neural_Radiance_Fields_in_Non-static_Scenes_Using_Heuristics-Guided_CVPR_2024_paper.pdf + Neural Radiance Field (NeRF) has been widely recognized for its excellence in novel view synthesis and 3D scene reconstruction. However their effectiveness is inherently tied to the assumption of static scenes rendering them susceptible to undesirable artifacts when confronted with transient distractors such as moving objects or shadows. In this work we propose a novel paradigm namely "Heuristics-Guided Segmentation" (HuGS) which significantly enhances the separation of static scenes from transient distractors by harmoniously combining the strengths of hand-crafted heuristics and state-of-the-art segmentation models thus significantly transcending the limitations of previous solutions. Furthermore we delve into the meticulous design of heuristics introducing a seamless fusion of Structure-from-Motion (SfM)-based heuristics and color residual heuristics catering to a diverse range of texture profiles. Extensive experiments demonstrate the superiority and robustness of our method in mitigating transient distractors for NeRFs trained in non-static scenes. Project page: https://cnhaox.github.io/NeRF-HuGS/ + + + + ImageNet-D: Benchmarking Neural Network Robustness on Diffusion Synthetic Object + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_ImageNet-D_Benchmarking_Neural_Network_Robustness_on_Diffusion_Synthetic_Object_CVPR_2024_paper.pdf + We establish rigorous benchmarks for visual perception robustness. Synthetic images such as ImageNet-C ImageNet-9 and Stylized ImageNet provide specific type of evaluation over synthetic corruptions backgrounds and textures yet those robustness benchmarks are restricted in specified variations and have low synthetic quality. In this work we introduce generative model as a data source for synthesizing hard images that benchmark deep models' robustness. Leveraging diffusion models we are able to generate images with more diversified backgrounds textures and materials than any prior work where we term this benchmark as ImageNet-D. Experimental results show that ImageNet-D results in a significant accuracy drop to a range of vision models from the standard ResNet visual classifier to the latest foundation models like CLIP and MiniGPT-4 significantly reducing their accuracy by up to 60%. Our work suggests that diffusion models can be an effective source to test vision models. The code and dataset are available at https://github.com/chenshuang-zhang/imagenet_d. + + + + Text-Enhanced Data-free Approach for Federated Class-Incremental Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Tran_Text-Enhanced_Data-free_Approach_for_Federated_Class-Incremental_Learning_CVPR_2024_paper.pdf + Federated Class-Incremental Learning (FCIL) is an underexplored yet pivotal issue involving the dynamic addition of new classes in the context of federated learning. In this field Data-Free Knowledge Transfer (DFKT) plays a crucial role in addressing catastrophic forgetting and data privacy problems. However prior approaches lack the crucial synergy between DFKT and the model training phases causing DFKT to encounter difficulties in generating high-quality data from a non-anchored latent space of the old task model. In this paper we introduce LANDER (Label Text Centered Data-Free Knowledge Transfer) to address this issue by utilizing label text embeddings (LTE) produced by pretrained language models. Specifically during the model training phase our approach treats LTE as anchor points and constrains the feature embeddings of corresponding training samples around them enriching the surrounding area with more meaningful information. In the DFKT phase by using these LTE anchors LANDER can synthesize more meaningful samples thereby effectively addressing the forgetting problem. Additionally instead of tightly constraining embeddings toward the anchor the Bounding Loss is introduced to encourage sample embeddings to remain flexible within a defined radius. This approach preserves the natural differences in sample embeddings and mitigates the embedding overlap caused by heterogeneous federated settings. Extensive experiments conducted on CIFAR100 Tiny-ImageNet and ImageNet demonstrate that LANDER significantly outperforms previous methods and achieves state-of-the-art performance in FCIL. The code is available at https://github.com/tmtuan1307/lander. + + + + UDiFF: Generating Conditional Unsigned Distance Fields with Optimal Wavelet Diffusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_UDiFF_Generating_Conditional_Unsigned_Distance_Fields_with_Optimal_Wavelet_Diffusion_CVPR_2024_paper.pdf + Diffusion models have shown remarkable results for image generation editing and inpainting. Recent works explore diffusion models for 3D shape generation with neural implicit functions i.e. signed distance function and occupancy function. However they are limited to shapes with closed surfaces which prevents them from generating diverse 3D real-world contents containing open surfaces. In this work we present UDiFF a 3D diffusion model for unsigned distance fields (UDFs) which is capable to generate textured 3D shapes with open surfaces from text conditions or unconditionally. Our key idea is to generate UDFs in spatial-frequency domain with an optimal wavelet transformation which produces a compact representation space for UDF generation. Specifically instead of selecting an appropriate wavelet transformation which requires expensive manual efforts and still leads to large information loss we propose a data-driven approach to learn the optimal wavelet transformation for UDFs. We evaluate UDiFF to show our advantages by numerical and visual comparisons with the latest methods on widely used benchmarks. Page: https://weiqi-zhang.github.io/UDiFF. + + + + Towards Large-scale 3D Representation Learning with Multi-dataset Point Prompt Training + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_Towards_Large-scale_3D_Representation_Learning_with_Multi-dataset_Point_Prompt_Training_CVPR_2024_paper.pdf + The rapid advancement of deep learning models is often attributed to their ability to leverage massive training data. In contrast such privilege has not yet fully benefited 3D deep learning mainly due to the limited availability of large-scale 3D datasets. Merging multiple available data sources and letting them collaboratively train a single model is a potential solution. However due to the large domain gap between 3D point cloud datasets such mixed supervision could adversely affect the model's performance and lead to degenerated performance (i.e. negative transfer) compared to single-dataset training. In view of this challenge we introduce Point Prompt Training (PPT) a novel framework for multi-dataset synergistic learning in the context of 3D representation learning that supports multiple pre-training paradigms. Based on this framework we propose Prompt-driven Normalization which adapts the model to different datasets with domain-specific prompts and Language-guided Categorical Alignment that decently unifies the multiple-dataset label spaces by leveraging the relationship between label text. Extensive experiments verify that PPT can overcome the negative transfer associated with synergistic learning and produce generalizable representations. Notably it achieves state-of-the-art performance on each dataset using a single weight-shared model with supervised multi-dataset training. Moreover when served as a pre-training framework it outperforms other pre-training approaches regarding representation quality and attains remarkable state-of-the-art performance across over ten diverse downstream tasks spanning both indoor and outdoor 3D scenarios. + + + + EmbodiedScan: A Holistic Multi-Modal 3D Perception Suite Towards Embodied AI + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_EmbodiedScan_A_Holistic_Multi-Modal_3D_Perception_Suite_Towards_Embodied_AI_CVPR_2024_paper.pdf + In the realm of computer vision and robotics embodied agents are expected to explore their environment and carry out human instructions. This necessitates the ability to fully understand 3D scenes given their first-person observations and contextualize them into language for interaction. However traditional research focuses more on scene-level input and output setups from a global view. To address the gap we introduce EmbodiedScan a multi-modal ego-centric 3D perception dataset and benchmark for holistic 3D scene understanding. It encompasses over 5k scans encapsulating 1M ego-centric RGB-D views 1M language prompts 160k 3D-oriented boxes spanning over 760 categories some of which partially align with LVIS and dense semantic occupancy with 80 common categories. Building upon this database we introduce a baseline framework named Embodied Perceptron. It is capable of processing an arbitrary number of multi-modal inputs and demonstrates remarkable 3D perception capabilities both within the two series of benchmarks we set up i.e. fundamental 3D perception tasks and language-grounded tasks and in the wild. + + + + SHINOBI: Shape and Illumination using Neural Object Decomposition via BRDF Optimization In-the-wild + http://openaccess.thecvf.com//content/CVPR2024/papers/Engelhardt_SHINOBI_Shape_and_Illumination_using_Neural_Object_Decomposition_via_BRDF_CVPR_2024_paper.pdf + We present SHINOBI an end-to-end framework for the reconstruction of shape material and illumination from object images captured with varying lighting pose and background. Inverse rendering of an object based on unconstrained image collections is a long-standing challenge in computer vision and graphics and requires a joint optimization over shape radiance and pose. We show that an implicit shape representation based on a multi-resolution hash encoding enables faster and robust shape reconstruction with joint camera alignment optimization that outperforms prior work. Further to enable the editing of illumination and object reflectance (i.e. material) we jointly optimize BRDF and illumination together with the object's shape. Our method is class-agnostic and works on in-the-wild image collections of objects to produce relightable 3D assets for several use cases such as AR/VR movies games etc. + + + + ES3: Evolving Self-Supervised Learning of Robust Audio-Visual Speech Representations + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_ES3_Evolving_Self-Supervised_Learning_of_Robust_Audio-Visual_Speech_Representations_CVPR_2024_paper.pdf + We propose a novel strategy ES3 for self-supervised learning of robust audio-visual speech representations from unlabeled talking face videos. While many recent approaches for this task primarily rely on guiding the learning process using the audio modality alone to capture information shared between audio and video we reframe the problem as the acquisition of shared unique (modality-specific) and synergistic speech information to address the inherent asymmetry between the modalities. Based on this formulation we propose a novel "evolving" strategy that progressively builds joint audio-visual speech representations that are strong for both uni-modal (audio & visual) and bi-modal (audio-visual) speech. First we leverage the more easily learnable audio modality to initialize audio and visual representations by capturing audio-unique and shared speech information. Next we incorporate video-unique speech information and bootstrap the audio-visual representations on top of the previously acquired shared knowledge. Finally we maximize the total audio-visual speech information including synergistic information to obtain robust and comprehensive representations. We implement ES3 as a simple Siamese framework and experiments on both English benchmarks and a newly contributed large-scale Mandarin dataset show its effectiveness. In particular on LRS2-BBC our smallest model is on par with SoTA models with only 1/2 parameters and 1/8 unlabeled data (223h). + + + + Motion2VecSets: 4D Latent Vector Set Diffusion for Non-rigid Shape Reconstruction and Tracking + http://openaccess.thecvf.com//content/CVPR2024/papers/Cao_Motion2VecSets_4D_Latent_Vector_Set_Diffusion_for_Non-rigid_Shape_Reconstruction_CVPR_2024_paper.pdf + We introduce Motion2VecSets a 4D diffusion model for dynamic surface reconstruction from point cloud sequences. While existing state-of-the-art methods have demonstrated success in reconstructing non-rigid objects using neural field representations conventional feed-forward networks encounter challenges with ambiguous observations from noisy partial or sparse point clouds. To address these challenges we introduce a diffusion model that explicitly learns the shape and motion distribution of non-rigid objects through an iterative denoising process of compressed latent representations. The diffusion-based priors enable more plausible and probabilistic reconstructions when handling ambiguous inputs. We parameterize 4D dynamics with latent sets instead of using global latent codes. This novel 4D representation allows us to learn local shape and deformation patterns leading to more accurate non-linear motion capture and significantly improving generalizability to unseen motions and identities. For more temporally-coherent object tracking we synchronously denoise deformation latent sets and exchange information across multiple frames. To avoid computational overhead we designed an interleaved space and time attention block to alternately aggregate deformation latents along spatial and temporal domains. Extensive comparisons against state-of-the-art methods demonstrate the superiority of our Motion2VecSets in 4D reconstruction from various imperfect observations. + + + + A2XP: Towards Private Domain Generalization + http://openaccess.thecvf.com//content/CVPR2024/papers/Yu_A2XP_Towards_Private_Domain_Generalization_CVPR_2024_paper.pdf + Deep Neural Networks (DNNs) have become pivotal in various fields especially in computer vision outperforming previous methodologies. A critical challenge in their deployment is the bias inherent in data across different domains such as image style and environmental conditions leading to domain gaps. This necessitates techniques for learning general representations from biased training data known as domain generalization. This paper presents Attend to eXpert Prompts (A2XP) a novel approach for domain generalization that preserves the privacy and integrity of the network architecture. A2XP consists of two phases: Expert Adaptation and Domain Generalization. In the first phase prompts for each source domain are optimized to guide the model towards the optimal direction. In the second phase two embedder networks are trained to effectively amalgamate these expert prompts aiming for an optimal output. Our extensive experiments demonstrate that A2XP achieves state-of-the-art results over existing non-private domain generalization methods. The experimental results validate that the proposed approach not only tackles the domain generalization challenge in DNNs but also offers a privacy-preserving efficient solution to the broader field of computer vision. + + + + Active Domain Adaptation with False Negative Prediction for Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Nakamura_Active_Domain_Adaptation_with_False_Negative_Prediction_for_Object_Detection_CVPR_2024_paper.pdf + Domain adaptation adapts models to various scenes with different appearances. In this field active domain adaptation is crucial in effectively sampling a limited number of data in the target domain. We propose an active domain adaptation method for object detection focusing on quantifying the undetectability of objects. Existing methods for active sampling encounter challenges in considering undetected objects while estimating the uncertainty of model predictions. Our proposed active sampling strategy addresses this issue using an active learning approach that simultaneously accounts for uncertainty and undetectability. Our newly proposed False Negative Prediction Module evaluates the undetectability of images containing undetected objects enabling more informed active sampling. This approach considers previously overlooked undetected objects thereby reducing false negative errors. Moreover using unlabeled data our proposed method utilizes uncertainty-guided pseudo-labeling to enhance domain adaptation further. Extensive experiments demonstrate that the performance of our proposed method closely rivals that of fully supervised learning while requiring only a fraction of the labeling efforts needed for the latter. + + + + Generative 3D Part Assembly via Part-Whole-Hierarchy Message Passing + http://openaccess.thecvf.com//content/CVPR2024/papers/Du_Generative_3D_Part_Assembly_via_Part-Whole-Hierarchy_Message_Passing_CVPR_2024_paper.pdf + Generative 3D part assembly involves understanding part relationships and predicting their 6-DoF poses for assembling a realistic 3D shape. Prior work often focus on the geometry of individual parts neglecting part-whole hierarchies of objects. Leveraging two key observations: 1) super-part poses provide strong hints about part poses and 2) predicting super-part poses is easier due to fewer super-parts we propose a part-whole-hierarchy message passing network for efficient 3D part assembly. We first introduce super-parts by grouping geometrically similar parts without any semantic labels. Then we employ a part-whole hierarchical encoder wherein a super-part encoder predicts latent super-part poses based on input parts. Subsequently we transform the point cloud using the latent poses feeding it to the part encoder for aggregating super-part information and reasoning about part relationships to predict all part poses. In training only ground-truth part poses are required. During inference the predicted latent poses of super-parts enhance interpretability. Experimental results on the PartNet dataset that our method achieves state-of-the-art performance in part and connectivity accuracy and enables an interpretable hierarchical part assembly. + + + + Benchmarking Segmentation Models with Mask-Preserved Attribute Editing + http://openaccess.thecvf.com//content/CVPR2024/papers/Yin_Benchmarking_Segmentation_Models_with_Mask-Preserved_Attribute_Editing_CVPR_2024_paper.pdf + When deploying segmentation models in practice it is critical to evaluate their behaviors in varied and complex scenes. Different from the previous evaluation paradigms only in consideration of global attribute variations (e.g. adverse weather) we investigate both local and global attribute variations for robustness evaluation. To achieve this we construct a mask-preserved attribute editing pipeline to edit visual attributes of real images with precise control of structural information. Therefore the original segmentation labels can be reused for the edited images. Using our pipeline we construct a benchmark covering both object and image attributes (e.g. color material pattern style). We evaluate a broad variety of semantic segmentation models spanning from conventional close-set models to recent open-vocabulary large models on their robustness to different types of variations. We find that both local and global attribute variations affect segmentation performances and the sensitivity of models diverges across different variation types. We argue that local attributes have the same importance as global attributes and should be considered in the robustness evaluation of segmentation models. Code: https://github.com/PRIS-CV/Pascal-EA. + + + + Analyzing and Improving the Training Dynamics of Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Karras_Analyzing_and_Improving_the_Training_Dynamics_of_Diffusion_Models_CVPR_2024_paper.pdf + Diffusion models currently dominate the field of data-driven image synthesis with their unparalleled scaling to large datasets. In this paper we identify and rectify several causes for uneven and ineffective training in the popular ADM diffusion model architecture without altering its high-level structure. Observing uncontrolled magnitude changes and imbalances in both the network activations and weights over the course of training we redesign the network layers to preserve activation weight and update magnitudes on expectation. We find that systematic application of this philosophy eliminates the observed drifts and imbalances resulting in considerably better networks at equal computational complexity. Our modifications improve the previous record FID of 2.41 in ImageNet-512 synthesis to 1.81 achieved using fast deterministic sampling. As an independent contribution we present a method for setting the exponential moving average (EMA) parameters post-hoc i.e. after completing the training run. This allows precise tuning of EMA length without the cost of performing several training runs and reveals its surprising interactions with network architecture training time and guidance. + + + + Hierarchical Correlation Clustering and Tree Preserving Embedding + http://openaccess.thecvf.com//content/CVPR2024/papers/Chehreghani_Hierarchical_Correlation_Clustering_and_Tree_Preserving_Embedding_CVPR_2024_paper.pdf + We propose a hierarchical correlation clustering method that extends the well-known correlation clustering to produce hierarchical clusters applicable to both positive and negative pairwise dissimilarities. Then in the following we study unsupervised representation learning with such hierarchical correlation clustering. For this purpose we first investigate embedding the respective hierarchy to be used for tree preserving embedding and feature extraction. Thereafter we study the extension of minimax distance measures to correlation clustering as another representation learning paradigm. Finally we demonstrate the performance of our methods on several datasets. + + + + Can Protective Perturbation Safeguard Personal Data from Being Exploited by Stable Diffusion? + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_Can_Protective_Perturbation_Safeguard_Personal_Data_from_Being_Exploited_by_CVPR_2024_paper.pdf + Stable Diffusion has established itself as a foundation model in generative AI artistic applications receiving widespread research and application. Some recent fine-tuning methods have made it feasible for individuals to implant personalized concepts onto the basic Stable Diffusion model with minimal computational costs on small datasets. However these innovations have also given rise to issues like facial privacy forgery and artistic copyright infringement. In recent studies researchers have explored the addition of imperceptible adversarial perturbations to images to prevent potential unauthorized exploitation and infringements when personal data is used for fine-tuning Stable Diffusion. Although these studies have demonstrated the ability to protect images it is essential to consider that these methods may not be entirely applicable in real-world scenarios. In this paper we systematically evaluate the use of perturbations to protect images within a practical threat model. The results suggest that these approaches may not be sufficient to safeguard image privacy and copyright effectively. Furthermore we introduce a purification method capable of removing protected perturbations while preserving the original image structure to the greatest extent possible. Experiments reveal that Stable Diffusion can effectively learn from purified images over all protective methods. + + + + MultiPLY: A Multisensory Object-Centric Embodied Large Language Model in 3D World + http://openaccess.thecvf.com//content/CVPR2024/papers/Hong_MultiPLY_A_Multisensory_Object-Centric_Embodied_Large_Language_Model_in_3D_CVPR_2024_paper.pdf + Human beings possess the capability to multiply a melange of multisensory cues while actively exploring and interacting with the 3D world. Current multi-modal large language models however passively absorb sensory data as inputs lacking the capacity to actively interact with the objects in the 3D environment and dynamically collect their multisensory information. To usher in the study of this area we propose MultiPLY a multisensory embodied large language model that could incorporate multisensory interactive data including visual audio tactile and thermal information into large language models thereby establishing the correlation among words actions and percepts. To this end we first collect Multisensory Universe a large-scale multisensory interaction dataset comprising 500k data by deploying an LLM-powered embodied agent to engage with the 3D environment. To perform instruction tuning with pre-trained LLM on such generated data we first encode the 3D scene as abstracted object-centric representations and then introduce action tokens denoting that the embodied agent takes certain actions within the environment as well as state tokens that represent the multisensory state observations of the agent at each time step. In the inference time MultiPLY could generate action tokens instructing the agent to take the action in the environment and obtain the next multisensory state observation. The observation is then appended back to the LLM via state tokens to generate subsequent text or action tokens. We demonstrate that MultiPLY outperforms baselines by a large margin through a diverse set of embodied tasks involving object retrieval tool use multisensory captioning and task decomposition. + + + + Learning to Visually Localize Sound Sources from Mixtures without Prior Source Knowledge + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_Learning_to_Visually_Localize_Sound_Sources_from_Mixtures_without_Prior_CVPR_2024_paper.pdf + The goal of the multi-sound source localization task is to localize sound sources from the mixture individually. While recent multi-sound source localization methods have shown improved performance they face challenges due to their reliance on prior information about the number of objects to be separated. In this paper to overcome this limitation we present a novel multi-sound source localization method that can perform localization without prior knowledge of the number of sound sources. To achieve this goal we propose an iterative object identification (IOI) module which can recognize sound-making objects in an iterative manner. After finding the regions of sound-making objects we devise object similarity-aware clustering (OSC) loss to guide the IOI module to effectively combine regions of the same object but also distinguish between different objects and backgrounds. It enables our method to perform accurate localization of sound-making objects without any prior knowledge. Extensive experimental results on the MUSIC and VGGSound benchmarks show the significant performance improvements of the proposed method over the existing methods for both single and multi-source. Our code is available at: https://github.com/VisualAIKHU/NoPrior_MultiSSL + + + + Regressor-Segmenter Mutual Prompt Learning for Crowd Counting + http://openaccess.thecvf.com//content/CVPR2024/papers/Guo_Regressor-Segmenter_Mutual_Prompt_Learning_for_Crowd_Counting_CVPR_2024_paper.pdf + Crowd counting has achieved significant progress by training regressors to predict instance positions. In heavily crowded scenarios however regressors are challenged by uncontrollable annotation variance which causes density map bias and context information inaccuracy. In this study we propose mutual prompt learning (mPrompt) which leverages a regressor and a segmenter as guidance for each other solving bias and inaccuracy caused by annotation variance while distinguishing foreground from background. In specific mPrompt leverages point annotations to tune the segmenter and predict pseudo head masks in a way of point prompt learning. It then uses the predicted segmentation masks which serve as spatial constraint to rectify biased point annotations as context prompt learning. mPrompt defines a way of mutual information maximization from prompt learning mitigating the impact of annotation variance while improving model accuracy. Experiments show that mPrompt significantly reduces the Mean Average Error (MAE) demonstrating the potential to be general framework for down-stream vision tasks. Code is available at https://github.com/csguomy/mPrompt. + + + + Instantaneous Perception of Moving Objects in 3D + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Instantaneous_Perception_of_Moving_Objects_in_3D_CVPR_2024_paper.pdf + The perception of 3D motion of surrounding traffic participants is crucial for driving safety. While existing works primarily focus on general large motions we contend that the instantaneous detection and quantification of subtle motions is equally important as they indicate the nuances in driving behavior that may be safety critical such as behaviors near a stop sign of parking positions. We delve into this under-explored task examining its unique challenges and developing our solution accompanied by a carefully designed benchmark. Specifically due to the lack of correspondences between consecutive frames of sparse Lidar point clouds static objects might appear to be moving - the so-called swimming effect. This intertwines with the true object motion thereby posing ambiguity in accurate estimation especially for subtle motion. To address this we propose to leverage local occupancy completion of object point clouds to densify the shape cue and mitigate the impact of swimming artifacts. The occupancy completion is learned in an end-to-end fashion together with the detection of moving objects and the estimation of their motion instantaneously as soon as objects start to move. Extensive experiments demonstrate superior performance compared to standard 3D motion estimation approaches particularly highlighting our method's specialized treatment of subtle motion. + + + + CORE-MPI: Consistency Object Removal with Embedding MultiPlane Image + http://openaccess.thecvf.com//content/CVPR2024/papers/Yoon_CORE-MPI_Consistency_Object_Removal_with_Embedding_MultiPlane_Image_CVPR_2024_paper.pdf + Novel view synthesis is attractive for social media but it often contains unwanted details such as personal information that needs to be edited out for a better experience. Multiplane image (MPI) is desirable for social media because of its generality but it is complex and computationally expensive making object removal challenging. To address these challenges we propose CORE-MPI which employs embedding images to improve the consistency and accessibility of MPI object removal. CORE-MPI allows for real-time transmission and interaction with embedding images on social media facilitating object removal with a single mask. However recovering the geometric information hidden in the embedding images is a significant challenge. Therefore we propose a dual-network approach where one network focuses on color restoration and the other on inpainting the embedding image including geometric information. For the training of CORE-MPI we introduce a pseudo-reference loss aimed at proficient color recovery even in complex scenes or with large masks. Furthermore we present a disparity consistency loss to preserve the geometric consistency of the inpainted region. We demonstrate the effectiveness of CORE-MPI on RealEstate10K and UCSD datasets. + + + + Backpropagation-free Network for 3D Test-time Adaptation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Backpropagation-free_Network_for_3D_Test-time_Adaptation_CVPR_2024_paper.pdf + Real-world systems often encounter new data over time which leads to experiencing target domain shifts. Existing Test-Time Adaptation (TTA) methods tend to apply computationally heavy and memory-intensive backpropagation-based approaches to handle this. Here we propose a novel method that uses a backpropagation-free approach for TTA for the specific case of 3D data. Our model uses a two-stream architecture to maintain knowledge about the source domain as well as complementary target-domain-specific information. The backpropagation-free property of our model helps address the well-known forgetting problem and mitigates the error accumulation issue. The proposed method also eliminates the need for the usually noisy process of pseudo-labeling and reliance on costly self-supervised training. Moreover our method leverages subspace learning effectively reducing the distribution variance between the two domains. Furthermore the source-domain-specific and the target-domain-specific streams are aligned using a novel entropy-based adaptive fusion strategy. Extensive experiments on popular benchmarks demonstrate the effectiveness of our method. The code will be available at https://github.com/abie-e/BFTT3D. + + + + ParamISP: Learned Forward and Inverse ISPs using Camera Parameters + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_ParamISP_Learned_Forward_and_Inverse_ISPs_using_Camera_Parameters_CVPR_2024_paper.pdf + RAW images are rarely shared mainly due to its excessive data size compared to their sRGB counterparts obtained by camera ISPs. Learning the forward and inverse processes of camera ISPs has been recently demonstrated enabling physically-meaningful RAW-level image processing on input sRGB images. However existing learning-based ISP methods fail to handle the large variations in the ISP processes with respect to camera parameters such as ISO and exposure time and have limitations when used for various applications. In this paper we propose ParamISP a learning-based method for forward and inverse conversion between sRGB and RAW images that adopts a novel neural-network module to utilize camera parameters which is dubbed as ParamNet. Given the camera parameters provided in the EXIF data ParamNet converts them into a feature vector to control the ISP networks. Extensive experiments demonstrate that ParamISP achieve superior RAW and sRGB reconstruction results compared to previous methods and it can be effectively used for a variety of applications such as deblurring dataset synthesis raw deblurring HDR reconstruction and camera-to-camera transfer. + + + + Perturbing Attention Gives You More Bang for the Buck: Subtle Imaging Perturbations That Efficiently Fool Customized Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_Perturbing_Attention_Gives_You_More_Bang_for_the_Buck_Subtle_CVPR_2024_paper.pdf + Diffusion models (DMs) embark a new era of generative modeling and offer more opportunities for efficient generating high-quality and realistic data samples. However their widespread use has also brought forth new challenges in model security which motivates the creation of more effective adversarial attackers on DMs to understand its vulnerability. We propose CAAT a simple but generic and efficient approach that does not require costly training to effectively fool latent diffusion models (LDMs). The approach is based on the observation that cross-attention layers exhibits higher sensitivity to gradient change allowing for leveraging subtle perturbations on published images to significantly corrupt the generated images. We show that a subtle perturbation on an image can significantly impact the cross-attention layers thus changing the mapping between text and image during the fine-tuning of customized diffusion models. Extensive experiments demonstrate that CAAT is compatible with diverse diffusion models and outperforms baseline attack methods in a more effective (more noise) and efficient (twice as fast as Anti-DreamBooth and Mist) manner. + + + + SeNM-VAE: Semi-Supervised Noise Modeling with Hierarchical Variational Autoencoder + http://openaccess.thecvf.com//content/CVPR2024/papers/Zheng_SeNM-VAE_Semi-Supervised_Noise_Modeling_with_Hierarchical_Variational_Autoencoder_CVPR_2024_paper.pdf + The data bottleneck has emerged as a fundamental challenge in learning based image restoration methods. Researchers have attempted to generate synthesized training data using paired or unpaired samples to address this challenge. This study proposes SeNM-VAE a semi-supervised noise modeling method that leverages both paired and unpaired datasets to generate realistic degraded data. Our approach is based on modeling the conditional distribution of degraded and clean images with a specially designed graphical model. Under the variational inference framework we develop an objective function for handling both paired and unpaired data. We employ our method to generate paired training samples for real-world image denoising and super-resolution tasks. Our approach excels in the quality of synthetic degraded images compared to other unpaired and paired noise modeling methods. Furthermore our approach demonstrates remarkable performance in downstream image restoration tasks even with limited paired data. With more paired data our method achieves the best performance on the SIDD dataset. + + + + Anchor-based Robust Finetuning of Vision-Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Han_Anchor-based_Robust_Finetuning_of_Vision-Language_Models_CVPR_2024_paper.pdf + We aim at finetuning a vision-language model without hurting its out-of-distribution (OOD) generalization. We address two types of OOD generalization i.e. i) domain shift such as natural to sketch images and ii) zero-shot capability to recognize the category that was not contained in the finetune data. Arguably the diminished OOD generalization after finetuning stems from the excessively simplified finetuning target which only provides the class information such as "a photo of a [CLASS]". This is distinct from the process in that CLIP was pretrained where there is abundant text supervision with rich semantic information. Therefore we propose to compensate for the finetune process using auxiliary supervision with rich semantic information which acts as anchors to preserve the OOD generalization. Specifically two types of anchors are elaborated in our methods including i) text-compensated anchor which uses the images from the finetune set but enriches the text supervision from a pretrained captioner ii) image-text-pair anchor which is retrieved from the dataset similar to pretraining data of CLIP according to the downstream task associating with the original CLIP text with rich semantics. Those anchors are utilized as auxiliary semantic information to maintain the original feature space of CLIP thereby preserving the OOD generalization capabilities. Comprehensive experiments demonstrate that our method achieves in-distribution performance akin to conventional finetuning while attaining new state-of-the-art results on domain shift and zero-shot learning benchmarks. + + + + DiSR-NeRF: Diffusion-Guided View-Consistent Super-Resolution NeRF + http://openaccess.thecvf.com//content/CVPR2024/papers/Lee_DiSR-NeRF_Diffusion-Guided_View-Consistent_Super-Resolution_NeRF_CVPR_2024_paper.pdf + We present DiSR-NeRF a diffusion-guided framework for view-consistent super-resolution (SR) NeRF. Unlike prior works we circumvent the requirement for high-resolution (HR) reference images by leveraging existing powerful 2D super-resolution models. Nonetheless independent SR 2D images are often inconsistent across different views. We thus propose Iterative 3D Synchronization (I3DS) to mitigate the inconsistency problem via the inherent multi-view consistency property of NeRF. Specifically our I3DS alternates between upscaling low-resolution (LR) rendered images with diffusion models and updating the underlying 3D representation with standard NeRF training. We further introduce Renoised Score Distillation (RSD) a novel score-distillation objective for 2D image resolution. Our RSD combines features from ancestral sampling and Score Distillation Sampling (SDS) to generate sharp images that are also LR-consistent. Qualitative and quantitative results on both synthetic and real-world datasets demonstrate that our DiSR-NeRF can achieve better results on NeRF super-resolution compared with existing works. Code and video results available at the project website. + + + + Dispersed Structured Light for Hyperspectral 3D Imaging + http://openaccess.thecvf.com//content/CVPR2024/papers/Shin_Dispersed_Structured_Light_for_Hyperspectral_3D_Imaging_CVPR_2024_paper.pdf + Hyperspectral 3D imaging aims to acquire both depth and spectral information of a scene. However existing methods are either prohibitively expensive and bulky or compromise on spectral and depth accuracy. In this paper we present Dispersed Structured Light (DSL) a cost-effective and compact method for accurate hyperspectral 3D imaging. DSL modifies a traditional projector-camera system by placing a sub-millimeter thick diffraction grating film front of the projector. This configuration enables dispersing structured light based on light wavelength. To utilize the dispersed structured light we devise a model for dispersive projection image formation and a per-pixel hyperspectral 3D reconstruction method. We validate DSL by instantiating a compact experimental prototype. DSL achieves spectral accuracy of 18.8nm full-width half-maximum (FWHM) and depth error of 1mm outperforming prior work on practical hyperspectral 3D imaging. DSL promises accurate and practical hyperspectral 3D imaging for diverse application domains including computer vision and graphics cultural heritage geology and biology. + + + + GLID: Pre-training a Generalist Encoder-Decoder Vision Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_GLID_Pre-training_a_Generalist_Encoder-Decoder_Vision_Model_CVPR_2024_paper.pdf + This paper proposes a GeneraLIst encoder-Decoder (GLID) pre-training method for better handling various downstream computer vision tasks. While self-supervised pre-training approaches e.g. Masked Autoencoder have shown success in transfer learning task-specific sub-architectures are still required to be appended for different downstream tasks which cannot enjoy the benefits of large-scale pre-training. GLID overcomes this challenge by allowing the pre-trained generalist encoder-decoder to be fine-tuned on various vision tasks with minimal task-specific architecture modifications. In the GLID training scheme pre-training pretext task and other downstream tasks are modeled as "query-to-answer" problems including the pre-training pretext task and other downstream tasks. We pre-train a task-agnostic encoder-decoder with query-mask pairs. During fine-tuning GLID maintains the pre-trained encoder-decoder and queries only replacing the topmost linear transformation layer with task-specific linear heads. This minimizes the pretrain-finetune architecture inconsistency and enables the pre-trained model to better adapt to downstream tasks. GLID achieves competitive performance on various vision tasks including object detection image segmentation pose estimation and depth estimation outperforming or matching specialist models such as Mask2Former DETR ViTPose and BinsFormer. + + + + PKU-DyMVHumans: A Multi-View Video Benchmark for High-Fidelity Dynamic Human Modeling + http://openaccess.thecvf.com//content/CVPR2024/papers/Zheng_PKU-DyMVHumans_A_Multi-View_Video_Benchmark_for_High-Fidelity_Dynamic_Human_Modeling_CVPR_2024_paper.pdf + High-quality human reconstruction and photo-realistic rendering of a dynamic scene is a long-standing problem in computer vision and graphics. Despite considerable efforts invested in developing various capture systems and reconstruction algorithms recent advancements still struggle with loose or oversized clothing and overly complex poses. In part this is due to the challenges of acquiring high-quality human datasets. To facilitate the development of these fields in this paper we present PKU-DyMVHumans a versatile human-centric dataset for high-fidelity reconstruction and rendering of dynamic human scenarios from dense multi-view videos. It comprises 8.2 million frames captured by more than 56 synchronized cameras across diverse scenarios. These sequences comprise 32 human subjects across 45 different scenarios each with a high-detailed appearance and realistic human motion. Inspired by recent advancements in neural radiance field (NeRF)-based scene representations we carefully set up an off-the-shelf framework that is easy to provide those state-of-the-art NeRF-based implementations and benchmark on PKU-DyMVHumans dataset. It is paving the way for various applications like fine-grained foreground/background decomposition high-quality human reconstruction and photo-realistic novel view synthesis of a dynamic scene. Extensive studies are performed on the benchmark demonstrating new observations and challenges that emerge from using such high-fidelity dynamic data. The project page and data is available at: https://pku-dymvhumans.github.io. + + + + CausalPC: Improving the Robustness of Point Cloud Classification by Causal Effect Identification + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_CausalPC_Improving_the_Robustness_of_Point_Cloud_Classification_by_Causal_CVPR_2024_paper.pdf + Deep neural networks have demonstrated remarkable performance in point cloud classification. However previous works show they are vulnerable to adversarial perturbations that can manipulate their predictions. Given the distinctive modality of point clouds various attack strategies have emerged posing challenges for existing defenses to achieve effective generalization. In this study we for the first time introduce causal modeling to enhance the robustness of point cloud classification models. Our insight is from the observation that adversarial examples closely resemble benign point clouds from the human perspective. In our causal modeling we incorporate two critical variables the structural information (standing for the key feature leading to the classification) and the hidden confounders (standing for the noise interfering with the classification). The resulting overall framework CausalPC consists of three sub-modules to identify the causal effect for robust classification. The framework is model-agnostic and adaptable for integration with various point cloud classifiers. Our approach significantly improves the adversarial robustness of three mainstream point cloud classification models on two benchmark datasets. For instance the classification accuracy for DGCNN on ModelNet40 increases from 29.2% to 72.0% with CausalPC whereas the best-performing baseline achieves only 42.4%. + + + + LASA: Instance Reconstruction from Real Scans using A Large-scale Aligned Shape Annotation Dataset + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_LASA_Instance_Reconstruction_from_Real_Scans_using_A_Large-scale_Aligned_CVPR_2024_paper.pdf + Instance shape reconstruction from a 3D scene involves recovering the full geometries of multiple objects at the semantic instance level. Many methods leverage data-driven learning due to the intricacies of scene complexity and significant indoor occlusions. Training these methods often requires a large-scale high-quality dataset with aligned and paired shape annotations with real-world scans. Existing datasets are either synthetic or misaligned restricting the performance of data-driven methods on real data. To this end we introduce LASA a Large-scale Aligned Shape Annotation Dataset comprising 10412 high-quality CAD annotations aligned with 920 real-world scene scans from ArkitScenes created manually by professional artists. On this top we propose a novel Diffusion-based Cross-Modal Shape Reconstruction (DisCo) method. It is empowered by a hybrid feature aggregation design to fuse multi-modal inputs and recover high-fidelity object geometries. Besides we present an Occupancy-Guided 3D Object Detection (OccGOD) method and demonstrate that our shape annotations provide scene occupancy clues that can further improve 3D object detection. Supported by LASA extensive experiments show that our methods achieve state-of-the-art performance in both instance-level scene reconstruction and 3D object detection tasks. + + + + DiffSCI: Zero-Shot Snapshot Compressive Imaging via Iterative Spectral Diffusion Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Pan_DiffSCI_Zero-Shot_Snapshot_Compressive_Imaging_via_Iterative_Spectral_Diffusion_Model_CVPR_2024_paper.pdf + This paper endeavors to advance the precision of snapshot compressive imaging (SCI) reconstruction for multispectral image (MSI). To achieve this we integrate the advantageous attributes of established SCI techniques and an image generative model propose a novel structured zero-shot diffusion model dubbed DiffSCI. DiffSCI leverages the structural insights from the deep prior and optimization-based methodologies complemented by the generative capabilities offered by the contemporary denoising diffusion model. Specifically firstly we employ a pre-trained diffusion model which has been trained on a substantial corpus of RGB images as the generative denoiser within the Plug-and-Play framework for the first time. This integration allows for the successful completion of SCI reconstruction especially in the case that current methods struggle to address effectively. Secondly we systematically account for spectral band correlations and introduce a robust methodology to mitigate wavelength mismatch thus enabling seamless adaptation of the RGB diffusion model to MSIs.Thirdly an accelerated algorithm is implemented to expedite the resolution of the data subproblem. This augmentation not only accelerates the convergence rate but also elevates the quality of the reconstruction process.We present extensive testing to show that DiffSCI exhibits discernible performance enhancements over prevailing self-supervised and zero-shot approaches surpassing even supervised transformer counterparts across both simulated and real datasets. Code is at https://github.com/PAN083/DiffSCI. + + + + MeLFusion: Synthesizing Music from Image and Language Cues using Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Chowdhury_MeLFusion_Synthesizing_Music_from_Image_and_Language_Cues_using_Diffusion_CVPR_2024_paper.pdf + Music is a universal language that can communicate emotions and feelings. It forms an essential part of the whole spectrum of creative media ranging from movies to social media posts. Machine learning models that can synthesize music are predominantly conditioned on textual descriptions of it. Inspired by how musicians compose music not just from a movie script but also through visualizations we propose MeLFusion a model that can effectively use cues from a textual description and the corresponding image to synthesize music. MeLFusion is a text-to-music diffusion model with a novel "visual synapse" which effectively infuses the semantics from the visual modality into the generated music. To facilitate research in this area we introduce a new dataset MeLBench and propose a new evaluation metric IMSM. Our exhaustive experimental evaluation suggests that adding visual information to the music synthesis pipeline significantly improves the quality of generated music measured both objectively and subjectively with a relative gain of up to 67.98% on the FAD score. We hope that our work will gather attention to this pragmatic yet relatively under-explored research area. + + + + Noisy-Correspondence Learning for Text-to-Image Person Re-identification + http://openaccess.thecvf.com//content/CVPR2024/papers/Qin_Noisy-Correspondence_Learning_for_Text-to-Image_Person_Re-identification_CVPR_2024_paper.pdf + Text-to-image person re-identification (TIReID) is a compelling topic in the cross-modal community which aims to retrieve the target person based on a textual query. Although numerous TIReID methods have been proposed and achieved promising performance they implicitly assume the training image-text pairs are correctly aligned which is not always the case in real-world scenarios. In practice the image-text pairs inevitably exist under-correlated or even false-correlated a.k.a noisy correspondence (NC) due to the low quality of the images and annotation errors. To address this problem we propose a novel Robust Dual Embedding method (RDE) that can learn robust visual-semantic associations even with NC. Specifically RDE consists of two main components: 1) A Confident Consensus Division (CCD) module that leverages the dual-grained decisions of dual embedding modules to obtain a consensus set of clean training data which enables the model to learn correct and reliable visual-semantic associations. 2) A Triplet Alignment Loss (TAL) relaxes the conventional Triplet Ranking loss with the hardest negative samples to a log-exponential upper bound over all negative ones thus preventing the model collapse under NC and can also focus on hard-negative samples for promising performance. We conduct extensive experiments on three public benchmarks namely CUHK-PEDES ICFG-PEDES and RSTPReID to evaluate the performance and robustness of our RDE. Our method achieves state-of-the-art results both with and without synthetic noisy correspondences on all three datasets. Code is available at https://github.com/QinYang79/RDE. + + + + PanoRecon: Real-Time Panoptic 3D Reconstruction from Monocular Video + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_PanoRecon_Real-Time_Panoptic_3D_Reconstruction_from_Monocular_Video_CVPR_2024_paper.pdf + We introduce the Panoptic 3D Reconstruction task a unified and holistic scene understanding task for a monocular video. And we present PanoRecon - a novel framework to address this new task which realizes an online geometry reconstruction alone with dense semantic and instance labeling. Specifically PanoRecon incrementally performs panoptic 3D reconstruction for each video fragment consisting of multiple consecutive key frames from a volumetric feature representation using feed-forward neural networks. We adopt a depth-guided back-projection strategy to sparse and purify the volumetric feature representation. We further introduce a voxel clustering module to get object instances in each local fragment and then design a tracking and fusion algorithm for the integration of instances from different fragments to ensure temporal coherence. Such design enables our PanoRecon to yield a coherent and accurate panoptic 3D reconstruction. Experiments on ScanNetV2 demonstrate a very competitive geometry reconstruction result compared with state-of-the-art reconstruction methods as well as promising 3D panoptic segmentation result with only RGB input while being real-time. Code is available at: https://github.com/Riser6/PanoRecon. + + + + Towards Transferable Targeted 3D Adversarial Attack in the Physical World + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_Towards_Transferable_Targeted_3D_Adversarial_Attack_in_the_Physical_World_CVPR_2024_paper.pdf + Compared with transferable untargeted attacks transferable targeted adversarial attacks could specify the misclassification categories of adversarial samples posing a greater threat to security-critical tasks. In the meanwhile 3D adversarial samples due to their potential of multi-view robustness can more comprehensively identify weaknesses in existing deep learning systems possessing great application value. However the field of transferable targeted 3D adversarial attacks remains vacant. The goal of this work is to develop a more effective technique that could generate transferable targeted 3D adversarial examples filling the gap in this field. To achieve this goal we design a novel framework named TT3D that could rapidly reconstruct from few multi-view images into Transferable Targeted 3D textured meshes. While existing mesh-based texture optimization methods compute gradients in the high-dimensional mesh space and easily fall into local optima leading to unsatisfactory transferability and distinct distortions TT3D innovatively performs dual optimization towards both feature grid and Multi-layer Perceptron (MLP) parameters in the grid-based NeRF space which significantly enhances black-box transferability while enjoying naturalness. Experimental results show that TT3D not only exhibits superior cross-model transferability but also maintains considerable adaptability across different renders and vision tasks. More importantly we produce 3D adversarial examples with 3D printing techniques in the real world and verify their robust performance under various scenarios. + + + + SwitchLight: Co-design of Physics-driven Architecture and Pre-training Framework for Human Portrait Relighting + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_SwitchLight_Co-design_of_Physics-driven_Architecture_and_Pre-training_Framework_for_Human_CVPR_2024_paper.pdf + We introduce a co-designed approach for human portrait relighting that combines a physics-guided architecture with a pre-training framework. Drawing on the Cook-Torrance reflectance model we have meticulously configured the architecture design to precisely simulate light-surface interactions. Furthermore to overcome the limitation of scarce high-quality lightstage data we have developed a self-supervised pre-training strategy. This novel combination of accurate physical modeling and expanded training dataset establishes a new benchmark in relighting realism. + + + + Adapters Strike Back + http://openaccess.thecvf.com//content/CVPR2024/papers/Steitz_Adapters_Strike_Back_CVPR_2024_paper.pdf + Adapters provide an efficient and lightweight mechanism for adapting trained transformer models to a variety of different tasks. However they have often been found to be outperformed by other adaptation mechanisms including low-rank adaptation. In this paper we provide an in-depth study of adapters their internal structure as well as various implementation choices. We uncover pitfalls for using adapters and suggest a concrete improved adapter architecture called Adapter+ that not only outperforms previous adapter implementations but surpasses a number of other more complex adaptation mechanisms in several challenging settings. Despite this our suggested adapter is highly robust and unlike previous work requires little to no manual intervention when addressing a novel scenario. Adapter+ reaches state-of-the-art average accuracy on the VTAB benchmark even without a per-task hyperparameter optimization. + + + + CLIP-Driven Open-Vocabulary 3D Scene Graph Generation via Cross-Modality Contrastive Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_CLIP-Driven_Open-Vocabulary_3D_Scene_Graph_Generation_via_Cross-Modality_Contrastive_Learning_CVPR_2024_paper.pdf + 3D Scene Graph Generation (3DSGG) aims to classify objects and their predicates within 3D point cloud scenes. However current 3DSGG methods struggle with two main challenges. 1) The dependency on labor-intensive ground-truth annotations. 2) Closed-set classes training hampers the recognition of novel objects and predicates. Addressing these issues our idea is to extract cross-modality features by CLIP from text and image data naturally related to 3D point clouds. Cross-modality features are used to train a robust 3D scene graph (3DSG) feature extractor. Specifically we propose a novel Cross-Modality Contrastive Learning 3DSGG (CCL-3DSGG) method. Firstly to align the text with 3DSG the text is parsed into word level that are consistent with the 3DSG annotation. To enhance robustness during the alignment adjectives are exchanged for different objects as negative samples. Then to align the image with 3DSG the camera view is treated as a positive sample and other views as negatives. Lastly the recognition of novel object and predicate classes is achieved by calculating the cosine similarity between prompts and 3DSG features. Our rigorous experiments confirm the superior open-vocabulary capability and applicability of CCL-3DSGG in real-world contexts. + + + + StraightPCF: Straight Point Cloud Filtering + http://openaccess.thecvf.com//content/CVPR2024/papers/de_Silva_Edirimuni_StraightPCF_Straight_Point_Cloud_Filtering_CVPR_2024_paper.pdf + Point cloud filtering is a fundamental 3D vision task which aims to remove noise while recovering the underlying clean surfaces. State-of-the-art methods remove noise by moving noisy points along stochastic trajectories to the clean surfaces. These methods often require regularization within the training objective and/or during post-processing to ensure fidelity. In this paper we introduce StraightPCF a new deep learning based method for point cloud filtering. It works by moving noisy points along straight paths thus reducing discretization errors while ensuring faster convergence to the clean surfaces. We model noisy patches as intermediate states between high noise patch variants and their clean counterparts and design the VelocityModule to infer a constant flow velocity from the former to the latter. This constant flow leads to straight filtering trajectories. In addition we introduce a DistanceModule that scales the straight trajectory using an estimated distance scalar to attain convergence near the clean surface. Our network is lightweight and only has 530K parameters being 17% of IterativePFN (a most recent point cloud filtering network). Extensive experiments on both synthetic and real-world data show our method achieves state-of-the-art results. Our method also demonstrates nice distributions of filtered points without the need for regularization. The implementation code can be found at: https://github.com/ddsediri/StraightPCF. + + + + Mirasol3B: A Multimodal Autoregressive Model for Time-Aligned and Contextual Modalities + http://openaccess.thecvf.com//content/CVPR2024/papers/Piergiovanni_Mirasol3B_A_Multimodal_Autoregressive_Model_for_Time-Aligned_and_Contextual_Modalities_CVPR_2024_paper.pdf + One of the main challenges of multimodal learning is the need to combine heterogeneous modalities (e.g. video audio text). For example video and audio are obtained at much higher rates than text and are roughly aligned in time. They are often not synchronized with text which comes as a global context e.g. a title or a description. Furthermore video and audio inputs are of much larger volumes and grow as the video length increases which naturally requires more compute dedicated to these modalities and makes modeling of long-range dependencies harder. We here decouple the multimodal modeling dividing it into separate autoregressive models processing the inputs according to the characteristics of the modalities. We propose a multimodal model consisting of an autoregressive component for the time-synchronized modalities (audio and video) and an autoregressive component for the context modalities which are not necessarily aligned in time but are still sequential. To address the long-sequences of the video-audio inputs we further partition the video and audio sequences in consecutive snippets and autoregressively process their representations. To that end we propose a Combiner mechanism which models the audio-video information jointly producing compact but expressive representations. This allows us to scale to 512 input video frames without increase in model parameters. Our approach achieves the state-of-the-art on multiple well established multimodal benchmarks. It effectively addresses the high computational demand of media inputs by learning compact representations controlling the sequence length of the audio-video feature representations and modeling their dependencies in time. + + + + Semantically-Shifted Incremental Adapter-Tuning is A Continual ViTransformer + http://openaccess.thecvf.com//content/CVPR2024/papers/Tan_Semantically-Shifted_Incremental_Adapter-Tuning_is_A_Continual_ViTransformer_CVPR_2024_paper.pdf + Class-incremental learning (CIL) aims to enable models to continuously learn new classes while overcoming catastrophic forgetting. The introduction of pre-trained models has brought new tuning paradigms to CIL. In this paper we revisit different parameter-efficient tuning (PET) methods within the context of continual learning. We observe that adapter tuning demonstrates superiority over prompt-based methods even without parameter expansion in each learning session. Motivated by this we propose incrementally tuning the shared adapter without imposing parameter update constraints enhancing the learning capacity of the backbone. Additionally we employ feature sampling from stored prototypes to retrain a unified classifier further improving its performance. We estimate the semantic shift of old prototypes without access to past samples and update stored prototypes session by session. Our proposed method eliminates model expansion and avoids retaining any image samples. It surpasses previous pre-trained model-based CIL methods and demonstrates remarkable continual learning capabilities. Experimental results on five CIL benchmarks validate the effectiveness of our approach achieving state-of-the-art (SOTA) performance. + + + + Random Entangled Tokens for Adversarially Robust Vision Transformer + http://openaccess.thecvf.com//content/CVPR2024/papers/Gong_Random_Entangled_Tokens_for_Adversarially_Robust_Vision_Transformer_CVPR_2024_paper.pdf + Vision Transformers (ViTs) have emerged as a compelling alternative to Convolutional Neural Networks (CNNs) in the realm of computer vision showcasing tremendous potential. However recent research has unveiled a susceptibility of ViTs to adversarial attacks akin to their CNN counterparts. Adversarial training and randomization are two representative effective defenses for CNNs. Some researchers have attempted to apply adversarial training to ViTs and achieved comparable robustness to CNNs while it is not easy to directly apply randomization to ViTs because of the architecture difference between CNNs and ViTs. In this paper we delve into the structural intricacies of ViTs and propose a novel defense mechanism termed Random entangled image Transformer (ReiT) which seamlessly integrates adversarial training and randomization to bolster the adversarial robustness of ViTs. Recognizing the challenge posed by the structural disparities between ViTs and CNNs we introduce a novel module input-independent random entangled self-attention (II-ReSA). This module optimizes random entangled tokens that lead to "dissimilar" self-attention outputs by leveraging model parameters and the sampled random tokens thereby synthesizing the self-attention module outputs and random entangled tokens to diminish adversarial similarity. ReiT incorporates two distinct random entangled tokens and employs dual randomization offering an effective countermeasure against adversarial examples while ensuring comprehensive deduction guarantees. Through extensive experiments conducted on various ViT variants and benchmarks we substantiate the superiority of our proposed method in enhancing the adversarial robustness of Vision Transformers. + + + + L2B: Learning to Bootstrap Robust Models for Combating Label Noise + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_L2B_Learning_to_Bootstrap_Robust_Models_for_Combating_Label_Noise_CVPR_2024_paper.pdf + Deep neural networks have shown great success in representation learning. Deep neural networks have shown great success in representation learning. However when learning with noisy labels (LNL) they can easily overfit and fail to generalize to new data. This paper introduces a simple and effective method named Learning to Bootstrap (L2B) which enables models to bootstrap themselves using their own predictions without being adversely affected by erroneous pseudo-labels. It achieves this by dynamically adjusting the importance weight between real observed and generated labels as well as between different samples through meta-learning. Unlike existing instance reweighting methods the key to our method lies in a new versatile objective that enables implicit relabeling concurrently leading to significant improvements without incurring additional costs. L2B offers several benefits over the baseline methods. It yields more robust models that are less susceptible to the impact of noisy labels by guiding the bootstrapping procedure more effectively. It better exploits the valuable information contained in corrupted instances by adapting the weights of both instances and labels. Furthermore L2B is compatible with existing LNL methods and delivers competitive results spanning natural and medical imaging tasks including classification and segmentation under both synthetic and real-world noise. Extensive experiments demonstrate that our method effectively mitigates the challenges of noisy labels often necessitating few to no validation samples and is well generalized to other tasks such as image segmentation. This not only positions it as a robust complement to existing LNL techniques but also underscores its practical applicability. The code and models are available at https://github.com/yuyinzhou/l2b. + + + + Tactile-Augmented Radiance Fields + http://openaccess.thecvf.com//content/CVPR2024/papers/Dou_Tactile-Augmented_Radiance_Fields_CVPR_2024_paper.pdf + We present a scene representation that brings vision and touch into a shared 3D space which we call a tactile-augmented radiance field. This representation capitalizes on two key insights: (i) ubiquitous vision-based touch sensors are built on perspective cameras and (ii) visually and structurally similar regions of a scene share the same tactile features. We use these insights to train a conditional diffusion model that provided with an RGB image and a depth map rendered from a neural radiance field generates its corresponding tactile "image". To train this diffusion model we collect the largest collection of spatially-aligned visual and tactile data. Through qualitative and quantitative experiments we demonstrate the accuracy of our cross-modal generative model and the utility of collected and rendered visual-tactile pairs across a range of downstream tasks. Project page: https://dou-yiming.github.io/TaRF + + + + Intensity-Robust Autofocus for Spike Camera + http://openaccess.thecvf.com//content/CVPR2024/papers/Su_Intensity-Robust_Autofocus_for_Spike_Camera_CVPR_2024_paper.pdf + Spike cameras a novel neuromorphic visual sensor can capture full-time spatial information through spike stream offering ultra-high temporal resolution and an extensive dynamic range. Autofocus control (AC) plays a pivotal role in a camera to efficiently capture information in challenging real-world scenarios. Nevertheless due to disparities in data modality and information characteristics compared to frame stream and event stream the current lack of efficient AC methods has made it challenging for spike cameras to adapt to intricate real-world conditions. To address this challenge we introduce a spike-based autofocus framework that includes a spike-specific focus measure called spike dispersion (SD) which effectively mitigates the influence of variations in scene light intensity during the focusing process by leveraging the spike camera's ability to record full-time spatial light intensity. Additionally the framework integrates a fast search strategy called spike-based golden fast search (SGFS) allowing rapid focal positioning without the need for a complete focus range traversal. To validate the performance of our method we have collected a spike-based autofocus dataset (SAD) containing synthetic data and real-world data under varying scene brightness and motion scenarios. Experimental results on these datasets demonstrate that our method offers state-of-the-art accuracy and efficiency. Furthermore experiments with data captured under varying scene brightness levels illustrate the robustness of our method to changes in light intensity during the focusing process. + + + + COTR: Compact Occupancy TRansformer for Vision-based 3D Occupancy Prediction + http://openaccess.thecvf.com//content/CVPR2024/papers/Ma_COTR_Compact_Occupancy_TRansformer_for_Vision-based_3D_Occupancy_Prediction_CVPR_2024_paper.pdf + The autonomous driving community has shown significant interest in 3D occupancy prediction driven by its exceptional geometric perception and general object recognition capabilities. To achieve this current works try to construct a Tri-Perspective View (TPV) or Occupancy (OCC) representation extending from the Bird-Eye-View perception. However compressed views like TPV representation lose 3D geometry information while raw and sparse OCC representation requires heavy but redundant computational costs. To address the above limitations we propose Compact Occupancy TRansformer (COTR) with a geometry-aware occupancy encoder and a semantic-aware group decoder to reconstruct a compact 3D OCC representation. The occupancy encoder first generates a compact geometrical OCC feature through efficient explicit-implicit view transformation. Then the occupancy decoder further enhances the semantic discriminability of the compact OCC representation by a coarse-to-fine semantic grouping strategy. Empirical experiments show that there are evident performance gains across multiple baselines e.g. COTR outperforms baselines with a relative improvement of 8%-15% demonstrating the superiority of our method. + + + + BANF: Band-Limited Neural Fields for Levels of Detail Reconstruction + http://openaccess.thecvf.com//content/CVPR2024/papers/Shabanov_BANF_Band-Limited_Neural_Fields_for_Levels_of_Detail_Reconstruction_CVPR_2024_paper.pdf + Largely due to their implicit nature neural fields lack a direct mechanism for filtering as Fourier analysis from discrete signal processing is not directly applicable to these representations. Effective filtering of neural fields is critical to enable level-of-detail processing in downstream applications and support operations that involve sampling the field on regular grids (e.g. marching cubes). Existing methods that attempt to decompose neural fields in the frequency domain either resort to heuristics or require extensive modifications to the neural field architecture. We show that via a simple modification one can obtain neural fields that are low-pass filtered and in turn show how this can be exploited to obtain a frequency decomposition of the entire signal. We demonstrate the validity of our technique by investigating level-of-detail reconstruction and showing how coarser representations can be computed effectively. + + + + Physical Property Understanding from Language-Embedded Feature Fields + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhai_Physical_Property_Understanding_from_Language-Embedded_Feature_Fields_CVPR_2024_paper.pdf + Can computers perceive the physical properties of objects solely through vision? Research in cognitive science and vision science has shown that humans excel at identifying materials and estimating their physical properties based purely on visual appearance. In this paper we present a novel approach for dense prediction of the physical properties of objects using a collection of images. Inspired by how humans reason about physics through vision we leverage large language models to propose candidate materials for each object. We then construct a language-embedded point cloud and estimate the physical properties of each 3D point using a zero-shot kernel regression approach. Our method is accurate annotation-free and applicable to any object in the open world. Experiments demonstrate the effectiveness of the proposed approach in various physical property reasoning tasks such as estimating the mass of common objects as well as other properties like friction and hardness. + + + + LEAD: Exploring Logit Space Evolution for Model Selection + http://openaccess.thecvf.com//content/CVPR2024/papers/Hu_LEAD_Exploring_Logit_Space_Evolution_for_Model_Selection_CVPR_2024_paper.pdf + The remarkable success of "pretrain-then-finetune" paradigm has led to a proliferation of available pre-trained models for vision tasks. This surge presents a significant challenge in efficiently choosing the most suitable pre-trained models for downstream tasks. The critical aspect of this challenge lies in effectively predicting the model transferability by considering the underlying fine-tuning dynamics. Existing methods often model fine-tuning dynamics in feature space with linear transformations which do not precisely align with the fine-tuning objective and fail to grasp the essential nonlinearity from optimization. To this end we present LEAD a finetuning-aligned approach based on the network output of logits. LEAD proposes a theoretical framework to model the optimization process and derives an ordinary differential equation (ODE) to depict the nonlinear evolution toward the final logit state. Additionally we design a class-aware decomposition method to consider the varying evolution dynamics across classes and further ensure practical applicability. Integrating the closely aligned optimization objective and nonlinear modeling capabilities derived from the differential equation our method offers a concise solution to effectively bridge the optimization gap in a single step bypassing the lengthy fine-tuning process. The comprehensive experiments on 24 supervised and self-supervised pre-trained models across 10 downstream datasets demonstrate impressive performances and showcase its broad adaptability even in low-data scenarios. + + + + GaussianAvatars: Photorealistic Head Avatars with Rigged 3D Gaussians + http://openaccess.thecvf.com//content/CVPR2024/papers/Qian_GaussianAvatars_Photorealistic_Head_Avatars_with_Rigged_3D_Gaussians_CVPR_2024_paper.pdf + We introduce GaussianAvatars a new method to create photorealistic head avatars that are fully controllable in terms of expression pose and viewpoint. The core idea is a dynamic 3D representation based on 3D Gaussian splats that are rigged to a parametric morphable face model. This combination facilitates photorealistic rendering while allowing for precise animation control via the underlying parametric model e.g. through expression transfer from a driving sequence or by manually changing the morphable model parameters. We parameterize each splat by a local coordinate frame of a triangle and optimize for explicit displacement offset to obtain a more accurate geometric representation. During avatar reconstruction we jointly optimize for the morphable model parameters and Gaussian splat parameters in an end-to-end fashion. We demonstrate the animation capabilities of our photorealistic avatar in several challenging scenarios. For instance we show reenactments from a driving video where our method outperforms existing works by a significant margin. + + + + GaussianEditor: Editing 3D Gaussians Delicately with Text Instructions + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_GaussianEditor_Editing_3D_Gaussians_Delicately_with_Text_Instructions_CVPR_2024_paper.pdf + Recently impressive results have been achieved in 3D scene editing with text instructions based on a 2D diffusion model. However current diffusion models primarily generate images by predicting noise in the latent space and the editing is usually applied to the whole image which makes it challenging to perform delicate especially localized editing for 3D scenes. Inspired by recent 3D Gaussian splatting we propose a systematic framework named GaussianEditor to edit 3D scenes delicately via 3D Gaussians with text instructions. Benefiting from the explicit property of 3D Gaussians we design a series of techniques to achieve delicate editing. Specifically we first extract the region of interest (RoI) corresponding to the text instruction aligning it to 3D Gaussians. The Gaussian RoI is further used to control the editing process. Our framework can achieve more delicate and precise editing of 3D scenes than previous methods while enjoying much faster training speed i.e. within 20 minutes on a single V100 GPU more than twice as fast as Instruct-NeRF2NeRF (45 minutes -- 2 hours). The project page is at GaussianEditor.github.io. + + + + HiKER-SGG: Hierarchical Knowledge Enhanced Robust Scene Graph Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_HiKER-SGG_Hierarchical_Knowledge_Enhanced_Robust_Scene_Graph_Generation_CVPR_2024_paper.pdf + Being able to understand visual scenes is a precursor for many downstream tasks including autonomous driving robotics and other vision-based approaches. A common approach enabling the ability to reason over visual data is Scene Graph Generation (SGG); however many existing approaches assume undisturbed vision i.e. the absence of real-world corruptions such as fog snow smoke as well as non-uniform perturbations like sun glare or water drops. In this work we propose a novel SGG benchmark containing procedurally generated weather corruptions and other transformations over the Visual Genome dataset. Further we introduce a corresponding approach Hierarchical Knowledge Enhanced Robust Scene Graph Generation (HiKER-SGG) providing a strong baseline for scene graph generation under such challenging setting. At its core HiKER-SGG utilizes a hierarchical knowledge graph in order to refine its predictions from coarse initial estimates to detailed predictions. In our extensive experiments we show that HiKER-SGG does not only demonstrate superior performance on corrupted images in a zero-shot manner but also outperforms current state-of-the-art methods on uncorrupted SGG tasks. Code is available at https://github.com/zhangce01/HiKER-SGG. + + + + Watermark-embedded Adversarial Examples for Copyright Protection against Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhu_Watermark-embedded_Adversarial_Examples_for_Copyright_Protection_against_Diffusion_Models_CVPR_2024_paper.pdf + Diffusion Models (DMs) have shown remarkable capabilities in various image-generation tasks. However there are growing concerns that DMs could be used to imitate unauthorized creations and thus raise copyright issues. To address this issue we propose a novel framework that embeds personal watermarks in the generation of adversarial examples. Such examples can force DMs to generate images with visible watermarks and prevent DMs from imitating unauthorized images. We construct a generator based on conditional adversarial networks and design three losses (adversarial loss GAN loss and perturbation loss) to generate adversarial examples that have subtle perturbation but can effectively attack DMs to prevent copyright violations. Training a generator for a personal watermark by our method only requires 5-10 samples within 2-3 minutes and once the generator is trained it can generate adversarial examples with that watermark significantly fast (0.2s per image). We conduct extensive experiments in various conditional image-generation scenarios. Compared to existing methods that generate images with chaotic textures our method adds visible watermarks on the generated images which is a more straightforward way to indicate copyright violations. We also observe that our adversarial examples exhibit good transferability across unknown generative models. Therefore this work provides a simple yet powerful way to protect copyright from DM-based imitation. + + + + TCP:Textual-based Class-aware Prompt tuning for Visual-Language Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Yao_TCPTextual-based_Class-aware_Prompt_tuning_for_Visual-Language_Model_CVPR_2024_paper.pdf + Prompt tuning represents a valuable technique for adapting pre-trained visual-language models (VLM) to various downstream tasks. Recent advancements in CoOp-based methods propose a set of learnable domain-shared or image-conditional textual tokens to facilitate the generation of task-specific textual classifiers. However those textual tokens have a limited generalization ability regarding unseen domains as they cannot dynamically adjust to the distribution of testing classes. To tackle this issue we present a novel Textual-based Class-aware Prompt tuning(TCP) that explicitly incorporates prior knowledge about classes to enhance their discriminability. The critical concept of TCP involves leveraging Textual Knowledge Embedding (TKE) to map the high generalizability of class-level textual knowledge into class aware textual tokens. By seamlessly integrating these class-aware prompts into the Text Encoder a dynamic class-aware classifier is generated to enhance discriminability for unseen domains. During inference TKE dynamically generates class-aware prompts related to the unseen classes. Comprehensive evaluations demonstrate that TKE serves as a plug-and-play module effortlessly combinable with existing methods. Furthermore TCP consistently achieves superior performance while demanding less training time. + + + + DiffusionMTL: Learning Multi-Task Denoising Diffusion Model from Partially Annotated Data + http://openaccess.thecvf.com//content/CVPR2024/papers/Ye_DiffusionMTL_Learning_Multi-Task_Denoising_Diffusion_Model_from_Partially_Annotated_Data_CVPR_2024_paper.pdf + Recently there has been an increased interest in the practical problem of learning multiple dense scene understanding tasks from partially annotated data where each training sample is only labeled for a subset of the tasks. The missing of task labels in training leads to low-quality and noisy predictions as can be observed from state-of-the-art methods. To tackle this issue we reformulate the partially-labeled multi-task dense prediction as a pixel-level denoising problem and propose a novel multi-task denoising diffusion framework coined as DiffusionMTL. It designs a joint diffusion and denoising paradigm to model a potential noisy distribution in the task prediction or feature maps and generate rectified outputs for different tasks. To exploit multi-task consistency in denoising we further introduce a Multi-Task Conditioning strategy which can implicitly utilize the complementary nature of the tasks to help learn the unlabeled tasks leading to an improvement in the denoising performance of the different tasks. Extensive quantitative and qualitative experiments demonstrate that the proposed multi-task denoising diffusion model can significantly improve multi-task prediction maps and outperform the state-of-the-art methods on three challenging multi-task benchmarks under two different partial-labeling evaluation settings. The code is available at https://prismformore.github.io/diffusionmtl/. + + + + Spike-guided Motion Deblurring with Unknown Modal Spatiotemporal Alignment + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Spike-guided_Motion_Deblurring_with_Unknown_Modal_Spatiotemporal_Alignment_CVPR_2024_paper.pdf + The traditional frame-based cameras that rely on exposure windows for imaging experience motion blur in high-speed scenarios. Frame-based deblurring methods lack reliable motion cues to restore sharp images under extreme blur conditions. The spike camera is a novel neuromorphic visual sensor that outputs spike streams with ultra-high temporal resolution. It can supplement the temporal information lost in traditional cameras and guide motion deblurring. However in real-world scenarios aligning discrete RGB images and continuous spike streams along both temporal and spatial axes is challenging due to the complexity of calibrating their coordinates device displacements in vibrations and time deviations. Misalignment of pixels leads to severe degradation of deblurring. We introduce the first framework for spike-guided motion deblurring without knowing the spatiotemporal alignment between spikes and images. To address the problem we first propose a novel three-stage network containing a basic deblurring net a carefully designed bi-directional deformable aligning module and a flow-based multi-scale fusion net. Experimental results demonstrate that our approach can effectively guide the image deblurring with unknown alignment surpassing the performance of other methods. Public project page: https://github.com/Leozhangjiyuan/UaSDN. + + + + VRP-SAM: SAM with Visual Reference Prompt + http://openaccess.thecvf.com//content/CVPR2024/papers/Sun_VRP-SAM_SAM_with_Visual_Reference_Prompt_CVPR_2024_paper.pdf + In this paper we propose a novel Visual Reference Prompt (VRP) encoder that empowers the Segment Anything Model (SAM) to utilize annotated reference images as prompts for segmentation creating the VRP-SAM model. In essence VRP-SAM can utilize annotated reference images to comprehend specific objects and perform segmentation of specific objects in target image. It is note that the VRP encoder can support a variety of annotation formats for reference images including point box scribble and mask. VRP-SAM achieves a breakthrough within the SAM framework by extending its versatility and applicability while preserving SAM's inherent strengths thus enhancing user-friendliness. To enhance the generalization ability of VRP-SAM the VRP encoder adopts a meta-learning strategy. To validate the effectiveness of VRP-SAM we conducted extensive empirical studies on the Pascal and COCO datasets. Remarkably VRP-SAM achieved state-of-the-art performance in visual reference segmentation with minimal learnable parameters. Furthermore VRP-SAM demonstrates strong generalization capabilities allowing it to perform segmentation of unseen objects and enabling cross-domain segmentation. The source code and models will be available at https://github.com/syp2ysy/VRP-SAM + + + + Discriminability-Driven Channel Selection for Out-of-Distribution Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Yuan_Discriminability-Driven_Channel_Selection_for_Out-of-Distribution_Detection_CVPR_2024_paper.pdf + Out-of-distribution (OOD) detection is essential for deploying machine learning models in open-world environments. Activation-based methods are a key approach in OOD detection working to mitigate overconfident predictions of OOD data. These techniques rectifying anomalous activations enhancing the distinguishability between in-distribution (ID) data and OOD data. However they assume by default that every channel is necessary for OOD detection and rectify anomalous activations in each channel. Empirical evidence has shown that there is a significant difference among various channels in OOD detection and discarding some channels can greatly enhance the performance of OOD detection. Based on this insight we propose \underline D iscriminability-\underline D riven \underline C hannel \underline S election (DDCS) which leverages an adaptive channel selection by estimating the discriminative score of each channel to boost OOD detection. The discriminative score takes inter-class similarity and inter-class variance of training data into account. However the estimation of discriminative score itself is susceptible to anomalous activations. To better estimate score we pre-rectify anomalous activations for each channel mildly. The experimental results show that DDCS achieves state-of-the-art performance on CIFAR and ImageNet-1K benchmarks. Moreover DDCS can generalize to different backbones and OOD scores. + + + + Traffic Scene Parsing through the TSP6K Dataset + http://openaccess.thecvf.com//content/CVPR2024/papers/Jiang_Traffic_Scene_Parsing_through_the_TSP6K_Dataset_CVPR_2024_paper.pdf + Traffic scene perception in computer vision is a critically important task to achieve intelligent cities. To date most existing datasets focus on autonomous driving scenes. We observe that the models trained on those driving datasets often yield unsatisfactory results on traffic monitoring scenes. However little effort has been put into improving the traffic monitoring scene understanding mainly due to the lack of specific datasets. To fill this gap we introduce a specialized traffic monitoring dataset termed TSP6K containing images from the traffic monitoring scenario with high-quality pixel-level and instance-level annotations. The TSP6K dataset captures more crowded traffic scenes with several times more traffic participants than the existing driving scenes. We perform a detailed analysis of the dataset and comprehensively evaluate previous popular scene parsing methods instance segmentation methods and unsupervised domain adaption methods. Furthermore considering the vast difference in instance sizes we propose a detail refining decoder for scene parsing which recovers the details of different semantic regions in traffic scenes owing to the proposed TSP6K dataset. Experiments show its effectiveness in parsing the traffic monitoring scenes. Code and dataset are available at https://github.com/PengtaoJiang/TSP6K. + + + + Fourier Priors-Guided Diffusion for Zero-Shot Joint Low-Light Enhancement and Deblurring + http://openaccess.thecvf.com//content/CVPR2024/papers/Lv_Fourier_Priors-Guided_Diffusion_for_Zero-Shot_Joint_Low-Light_Enhancement_and_Deblurring_CVPR_2024_paper.pdf + Existing joint low-light enhancement and deblurring methods learn pixel-wise mappings from paired synthetic data which results in limited generalization in real-world scenes. While some studies explore the rich generative prior of pre-trained diffusion models they typically rely on the assumed degradation process and cannot handle unknown real-world degradations well. To address these problems we propose a novel zero-shot framework FourierDiff which embeds Fourier priors into a pre-trained diffusion model to harmoniously handle the joint degradation of luminance and structures. FourierDiff is appealing in its relaxed requirements on paired training data and degradation assumptions. The key zero-shot insight is motivated by image characteristics in the Fourier domain: most luminance information concentrates on amplitudes while structure and content information are closely related to phases. Based on this observation we decompose the sampled results of the reverse diffusion process in the Fourier domain and take advantage of the amplitude of the generative prior to align the enhanced brightness with the distribution of natural images. To yield a sharp and content-consistent enhanced result we further design a spatial-frequency alternating optimization strategy to progressively refine the phase of the input. Extensive experiments demonstrate the superior effectiveness of the proposed method especially in real-world scenes. + + + + Scaling Up to Excellence: Practicing Model Scaling for Photo-Realistic Image Restoration In the Wild + http://openaccess.thecvf.com//content/CVPR2024/papers/Yu_Scaling_Up_to_Excellence_Practicing_Model_Scaling_for_Photo-Realistic_Image_CVPR_2024_paper.pdf + We introduce SUPIR (Scaling-UP Image Restoration) a groundbreaking image restoration method that harnesses generative prior and the power of model scaling up. Leveraging multi-modal techniques and advanced generative prior SUPIR marks a significant advance in intelligent and realistic image restoration. As a pivotal catalyst within SUPIR model scaling dramatically enhances its capabilities and demonstrates new potential for image restoration. We collect a dataset comprising 20 million high-resolution high-quality images for model training each enriched with descriptive text annotations. SUPIR provides the capability to restore images guided by textual prompts broadening its application scope and potential. Moreover we introduce negative-quality prompts to further improve perceptual quality. We also develop a restoration-guided sampling method to suppress the fidelity issue encountered in generative-based restoration. Experiments demonstrate SUPIR's exceptional restoration effects and its novel capacity to manipulate restoration through textual prompts. + + + + Q-Instruct: Improving Low-level Visual Abilities for Multi-modality Foundation Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_Q-Instruct_Improving_Low-level_Visual_Abilities_for_Multi-modality_Foundation_Models_CVPR_2024_paper.pdf + Multi-modality large language models (MLLMs) as represented by GPT-4V have introduced a paradigm shift for visual perception and understanding tasks that a variety of abilities can be achieved within one foundation model. While current MLLMs demonstrate primary low-level visual abilities from the identification of low-level visual attributes (e.g. clarity brightness) to the evaluation on image quality there's still an imperative to further improve the accuracy of MLLMs to substantially alleviate human burdens. To address this we collect the first dataset consisting of human natural language feedback on low-level vision. Each feedback offers a comprehensive description of an image's low-level visual attributes culminating in an overall quality assessment. The constructed Q-Pathway dataset includes 58K detailed human feedbacks on 18973 multi-sourced images with diverse low-level appearance. To ensure MLLMs can adeptly handle diverse queries we further propose a GPT-participated transformation to convert these feedbacks into a rich set of 200K instruction-response pairs termed Q-Instruct. Experimental results indicate that the Q-Instruct consistently elevates various low-level visual capabilities across multiple base models. We anticipate that our datasets can pave the way for a future that foundation models can assist humans on low-level visual tasks. + + + + Zero-Shot Structure-Preserving Diffusion Model for High Dynamic Range Tone Mapping + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhu_Zero-Shot_Structure-Preserving_Diffusion_Model_for_High_Dynamic_Range_Tone_Mapping_CVPR_2024_paper.pdf + Tone mapping techniques aiming to convert high dynamic range (HDR) images to high-quality low dynamic range (LDR) images for display play a more crucial role in real-world vision systems with the increasing application of HDR images. However obtaining paired HDR and high-quality LDR images is difficult posing a challenge to deep learning based tone mapping methods. To overcome this challenge we propose a novel zero-shot tone mapping framework that utilizes shared structure knowledge allowing us to transfer a pre-trained mapping model from the LDR domain to HDR fields without paired training data. Our approach involves decomposing both the LDR and HDR images into two components: structural information and tonal information. To preserve the original image's structure we modify the reverse sampling process of a diffusion model and explicitly incorporate the structure information into the intermediate results. Additionally for improved image details we introduce a dual-control network architecture that enables different types of conditional inputs to control different scales of the output. Experimental results demonstrate the effectiveness of our approach surpassing previous state-of-the-art methods both qualitatively and quantitatively. Moreover our model exhibits versatility and can be applied to other low-level vision tasks without retraining. The code is available at https://github.com/ZSDM-HDR/Zero-Shot-Diffusion-HDR. + + + + VoCo: A Simple-yet-Effective Volume Contrastive Learning Framework for 3D Medical Image Analysis + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_VoCo_A_Simple-yet-Effective_Volume_Contrastive_Learning_Framework_for_3D_Medical_CVPR_2024_paper.pdf + Self-Supervised Learning (SSL) has demonstrated promising results in 3D medical image analysis. However the lack of high-level semantics in pre-training still heavily hinders the performance of downstream tasks. We observe that 3D medical images contain relatively consistent contextual position information i.e. consistent geometric relations between different organs which leads to a potential way for us to learn consistent semantic representations in pre-training. In this paper we propose a simple-yet-effective Volume Contrast (VoCo) framework to leverage the contextual position priors for pre-training. Specifically we first generate a group of base crops from different regions while enforcing feature discrepancy among them where we employ them as class assignments of different regions. Then we randomly crop sub-volumes and predict them belonging to which class (located at which region) by contrasting their similarity to different base crops which can be seen as predicting contextual positions of different sub-volumes. Through this pretext task VoCo implicitly encodes the contextual position priors into model representations without the guidance of annotations enabling us to effectively improve the performance of downstream tasks that require high-level semantics. Extensive experimental results on six downstream tasks demonstrate the superior effectiveness of VoCo. Code will be available at https://github.com/Luffy03/VoCo. + + + + IPoD: Implicit Field Learning with Point Diffusion for Generalizable 3D Object Reconstruction from Single RGB-D Images + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_IPoD_Implicit_Field_Learning_with_Point_Diffusion_for_Generalizable_3D_CVPR_2024_paper.pdf + Generalizable 3D object reconstruction from single-view RGB-D images remains a challenging task particularly with real-world data. Current state-of-the-art methods develop Transformer-based implicit field learning necessitating an intensive learning paradigm that requires dense query-supervision uniformly sampled throughout the entire space. We propose a novel approach IPoD which harmonizes implicit field learning with point diffusion. This approach treats the query points for implicit field learning as a noisy point cloud for iterative denoising allowing for their dynamic adaptation to the target object shape. Such adaptive query points harness diffusion learning's capability for coarse shape recovery and also enhances the implicit representation's ability to delineate finer details. Besides an additional self-conditioning mechanism is designed to use implicit predictions as the guidance of diffusion learning leading to a cooperative system. Experiments conducted on the CO3D-v2 dataset affirm the superiority of IPoD achieving 7.8% improvement in F-score and 28.6% in Chamfer distance over existing methods. The generalizability of IPoD is also demonstrated on the MVImgNet dataset. Our project page is at https://yushuang-wu.github.io/IPoD. + + + + CurveCloudNet: Processing Point Clouds with 1D Structure + http://openaccess.thecvf.com//content/CVPR2024/papers/Stearns_CurveCloudNet_Processing_Point_Clouds_with_1D_Structure_CVPR_2024_paper.pdf + Modern depth sensors such as LiDAR operate by sweeping laser-beams across the scene resulting in a point cloud with notable 1D curve-like structures. In this work we introduce a new point cloud processing scheme and backbone called CurveCloudNet which takes advantage of the curve-like structure inherent to these sensors. While existing backbones discard the rich 1D traversal patterns and rely on generic 3D operations CurveCloudNet parameterizes the point cloud as a collection of polylines (dubbed a "curve cloud") establishing a local surface-aware ordering on the points. By reasoning along curves CurveCloudNet captures lightweight curve-aware priors to efficiently and accurately reason in several diverse 3D environments. We evaluate CurveCloudNet on multiple synthetic and real datasets that exhibit distinct 3D size and structure. We demonstrate that CurveCloudNet outperforms both point-based and sparse-voxel backbones in various segmentation settings notably scaling to large scenes better than point-based alternatives while exhibiting improved single-object performance over sparse-voxel alternatives. In all CurveCloudNet is an efficient and accurate backbone that can handle a larger variety of 3D environments than past works. + + + + OpenStreetView-5M: The Many Roads to Global Visual Geolocation + http://openaccess.thecvf.com//content/CVPR2024/papers/Astruc_OpenStreetView-5M_The_Many_Roads_to_Global_Visual_Geolocation_CVPR_2024_paper.pdf + Determining the location of an image anywhere on Earth is a complex visual task which makes it particularly relevant for evaluating computer vision algorithms. Determining the location of an image anywhere on Earth is a complex visual task which makes it particularly relevant for evaluating computer vision algorithms. Yet the absence of standard large-scale open-access datasets with reliably localizable images has limited its potential. To address this issue we introduce OpenStreetView-5M a large-scale open-access dataset comprising over 5.1 million geo-referenced street view images covering 225 countries and territories. In contrast to existing benchmarks we enforce a strict train/test separation allowing us to evaluate the relevance of learned geographical features beyond mere memorization. To demonstrate the utility of our dataset we conduct an extensive benchmark of various state-of-the-art image encoders spatial representations and training strategies. All associated codes and models can be found at https://github.com/gastruc/osv5m. + + + + Text-IF: Leveraging Semantic Text Guidance for Degradation-Aware and Interactive Image Fusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Yi_Text-IF_Leveraging_Semantic_Text_Guidance_for_Degradation-Aware_and_Interactive_Image_CVPR_2024_paper.pdf + Image fusion aims to combine information from different source images to create a comprehensively representative image. Existing fusion methods are typically helpless in dealing with degradations in low-quality source images and non-interactive to multiple subjective and objective needs. To solve them we introduce a novel approach that leverages semantic text guidance image fusion model for degradation-aware and interactive image fusion task termed as Text-IF. It innovatively extends the classical image fusion to the text guided image fusion along with the ability to harmoniously address the degradation and interaction issues during fusion. Through the text semantic encoder and semantic interaction fusion decoder Text-IF is accessible to the all-in-one infrared and visible image degradation-aware processing and the interactive flexible fusion outcomes. In this way Text-IF achieves not only multi-modal image fusion but also multi-modal information fusion. Extensive experiments prove that our proposed text guided image fusion strategy has obvious advantages over SOTA methods in the image fusion performance and degradation treatment. The code is available at https://github.com/XunpengYi/Text-IF. + + + + Learning to Produce Semi-dense Correspondences for Visual Localization + http://openaccess.thecvf.com//content/CVPR2024/papers/Giang_Learning_to_Produce_Semi-dense_Correspondences_for_Visual_Localization_CVPR_2024_paper.pdf + This study addresses the challenge of performing visual localization in demanding conditions such as night-time scenarios adverse weather and seasonal changes. While many prior studies have focused on improving image matching performance to facilitate reliable dense keypoint matching between images existing methods often heavily rely on predefined feature points on a reconstructed 3D model. Consequently they tend to overlook unobserved keypoints during the matching process. Therefore dense keypoint matches are not fully exploited leading to a notable reduction in accuracy particularly in noisy scenes. To tackle this issue we propose a novel localization method that extracts reliable semi-dense 2D-3D matching points based on dense keypoint matches. This approach involves regressing semi-dense 2D keypoints into 3D scene coordinates using a point inference network. The network utilizes both geometric and visual cues to effectively infer 3D coordinates for unobserved keypoints from the observed ones. The abundance of matching information significantly enhances the accuracy of camera pose estimation even in scenarios involving noisy or sparse 3D models. Comprehensive evaluations demonstrate that the proposed method outperforms other methods in challenging scenes and achieves competitive results in large-scale visual localization benchmarks. The code will be available at https://github.com/TruongKhang/DeViLoc + + + + Amodal Ground Truth and Completion in the Wild + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhan_Amodal_Ground_Truth_and_Completion_in_the_Wild_CVPR_2024_paper.pdf + This paper studies amodal image segmentation: predicting entire object segmentation masks including both visible and invisible (occluded) parts. In previous work the amodal segmentation ground truth on real images is usually predicted by manual annotaton and thus is subjective. In contrast we use 3D data to establish an automatic pipeline to determine authentic ground truth amodal masks for partially occluded objects in real images. This pipeline is used to construct an amodal completion evaluation benchmark MP3D-Amodal consisting of a variety of object categories and labels. To better handle the amodal completion task in the wild we explore two architecture variants: a two-stage model that first infers the occluder followed by amodal mask completion; and a one-stage model that exploits the representation power of Stable Diffusion for amodal segmentation across many categories. Without bells and whistles our method achieves a new state-of-the-art performance on Amodal segmentation datasets that cover a large variety of objects including COCOA and our new MP3D-Amodal dataset. The dataset model and code are available at https://www. robots.ox.ac.uk/ vgg/research/amodal/. + + + + NECA: Neural Customizable Human Avatar + http://openaccess.thecvf.com//content/CVPR2024/papers/Xiao_NECA_Neural_Customizable_Human_Avatar_CVPR_2024_paper.pdf + Human avatar has become a novel type of 3D asset with various applications. Ideally a human avatar should be fully customizable to accommodate different settings and environments. In this work we introduce NECA an approach capable of learning versatile human representation from monocular or sparse-view videos enabling granular customization across aspects such as pose shadow shape lighting and texture. The core of our approach is to represent humans in complementary dual spaces and predict disentangled neural fields of geometry albedo shadow as well as an external lighting from which we are able to derive realistic rendering with high-frequency details via volumetric rendering. Extensive experiments demonstrate the advantage of our method over the state-of-the-art methods in photorealistic rendering as well as various editing tasks such as novel pose synthesis and relighting. Our code is available at https://github.com/iSEE-Laboratory/NECA. + + + + Real-IAD: A Real-World Multi-View Dataset for Benchmarking Versatile Industrial Anomaly Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Real-IAD_A_Real-World_Multi-View_Dataset_for_Benchmarking_Versatile_Industrial_Anomaly_CVPR_2024_paper.pdf + Industrial anomaly detection (IAD) has garnered significant attention and experienced rapid development. However the recent development of IAD approach has encountered certain difficulties due to dataset limitations. On the one hand most of the state-of-the-art methods have achieved saturation (over 99% in AUROC) on mainstream datasets such as MVTec and the differences of methods cannot be well distinguished leading to a significant gap between public datasets and actual application scenarios. On the other hand the research on various new practical anomaly detection settings is limited by the scale of the dataset posing a risk of overfitting in evaluation results. Therefore we propose a large-scale Real-world and multi-view Industrial Anomaly Detection dataset named Real-IAD which contains 150K high-resolution images of 30 different objects an order of magnitude larger than existing datasets. It has a larger range of defect area and ratio proportions making it more challenging than previous datasets. To make the dataset closer to real application scenarios we adopted a multi-view shooting method and proposed sample-level evaluation metrics. In addition beyond the general unsupervised anomaly detection setting we propose a new setting for Fully Unsupervised Industrial Anomaly Detection (FUIAD) based on the observation that the yield rate in industrial production is usually greater than 60% which has more practical application value. Finally we report the results of popular IAD methods on the Real-IAD dataset providing a highly challenging benchmark to promote the development of the IAD field. + + + + Boosting Adversarial Transferability by Block Shuffle and Rotation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Boosting_Adversarial_Transferability_by_Block_Shuffle_and_Rotation_CVPR_2024_paper.pdf + Adversarial examples mislead deep neural networks with imperceptible perturbations and have brought significant threats to deep learning. An important aspect is their transferability which refers to their ability to deceive other models thus enabling attacks in the black-box setting. Though various methods have been proposed to boost transferability the performance still falls short compared with white-box attacks. In this work we observe that existing input transformation based attacks one of the mainstream transfer-based attacks result in different attention heatmaps on various models which might limit the transferability. We also find that breaking the intrinsic relation of the image can disrupt the attention heatmap of the original image. Based on this finding we propose a novel input transformation based attack called block shuffle and rotation (BSR). Specifically BSR splits the input image into several blocks then randomly shuffles and rotates these blocks to construct a set of new images for gradient calculation. Empirical evaluations on the ImageNet dataset demonstrate that BSR could achieve significantly better transferability than the existing input transformation based methods under single-model and ensemble-model settings. Combining BSR with the current input transformation method can further improve the transferability which significantly outperforms the state-of-the-art methods. Code is available at https://github.com/Trustworthy-AI-Group/BSR. + + + + LidaRF: Delving into Lidar for Neural Radiance Field on Street Scenes + http://openaccess.thecvf.com//content/CVPR2024/papers/Sun_LidaRF_Delving_into_Lidar_for_Neural_Radiance_Field_on_Street_CVPR_2024_paper.pdf + Photorealistic simulation plays a crucial role in applications such as autonomous driving where advances in neural radiance fields (NeRFs) may allow better scalability through the automatic creation of digital 3D assets. However reconstruction quality suffers on street scenes due to largely collinear camera motions and sparser samplings at higher speeds. On the other hand the application often demands rendering from camera views that deviate from the inputs to accurately simulate behaviors like lane changes. In this paper we propose several insights that allow a better utilization of Lidar data to improve NeRF quality on street scenes. First our framework learns a geometric scene representation from Lidar which are fused with the implicit grid-based representation for radiance decoding thereby supplying stronger geometric information offered by explicit point cloud. Second we put forth a robust occlusion-aware depth supervision scheme which allows utilizing densified Lidar points by accumulation. Third we generate augmented training views from Lidar points for further improvement. Our insights translate to largely improved novel view synthesis under real driving scenes. + + + + Video Recognition in Portrait Mode + http://openaccess.thecvf.com//content/CVPR2024/papers/Han_Video_Recognition_in_Portrait_Mode_CVPR_2024_paper.pdf + The creation of new datasets often presents new challenges for video recognition and can inspire novel ideas while addressing these challenges. While existing datasets mainly comprise landscape mode videos our paper seeks to introduce portrait mode videos to the research community and highlight the unique challenges associated with this video format. With the growing popularity of smartphones and social media applications recognizing portrait mode videos is becoming increasingly important. To this end we have developed the first dataset dedicated to portrait mode video recognition namely PortraitMode-400. The taxonomy of PortraitMode-400 was constructed in a data-driven manner comprising 400 fine-grained categories and rigorous quality assurance was implemented to ensure the accuracy of human annotations. In addition to the new dataset we conducted a comprehensive analysis of the impact of video format (portrait mode versus landscape mode) on recognition accuracy and spatial bias due to the different formats. Furthermore we designed extensive experiments to explore key aspects of portrait mode video recognition including the choice of data augmentation evaluation procedure the importance of temporal information and the role of audio modality. Building on the insights from our experimental results and the introduction of PortraitMode-400 our paper aims to inspire further research efforts in this emerging research area. + + + + Selective Hourglass Mapping for Universal Image Restoration Based on Diffusion Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Zheng_Selective_Hourglass_Mapping_for_Universal_Image_Restoration_Based_on_Diffusion_CVPR_2024_paper.pdf + Universal image restoration is a practical and potential computer vision task for real-world applications. The main challenge of this task is handling the different degradation distributions at once. Existing methods mainly utilize task-specific conditions (e.g. prompt) to guide the model to learn different distributions separately named multi-partite mapping. However it is not suitable for universal model learning as it ignores the shared information between different tasks. In this work we propose an advanced selective hourglass mapping strategy based on diffusion model termed DiffUIR. Two novel considerations make our DiffUIR non-trivial. Firstly we equip the model with strong condition guidance to obtain accurate generation direction of diffusion model (selective). More importantly DiffUIR integrates a flexible shared distribution term (SDT) into the diffusion algorithm elegantly and naturally which gradually maps different distributions into a shared one. In the reverse process combined with SDT and strong condition guidance DiffUIR iteratively guides the shared distribution to the task-specific distribution with high image quality (hourglass). Without bells and whistles by only modifying the mapping strategy we achieve state-of-the-art performance on five image restoration tasks 22 benchmarks in the universal setting and zero-shot generalization setting. Surprisingly by only using a lightweight model (only 0.89M) we could achieve outstanding performance. The source code and pre-trained models are available at https://github.com/iSEE-Laboratory/DiffUIR + + + + Audio-Visual Segmentation via Unlabeled Frame Exploitation + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Audio-Visual_Segmentation_via_Unlabeled_Frame_Exploitation_CVPR_2024_paper.pdf + Audio-visual segmentation (AVS) aims to segment the sounding objects in video frames. Although great progress has been witnessed we experimentally reveal that current methods reach marginal performance gain within the use of the unlabeled frames leading to the underutilization issue. To fully explore the potential of the unlabeled frames for AVS we explicitly divide them into two categories based on their temporal characteristics i.e. neighboring frame (NF) and distant frame (DF). NFs temporally adjacent to the labeled frame often contain rich motion information that assists in the accurate localization of sounding objects. Contrary to NFs DFs have long temporal distances from the labeled frame which share semantic-similar objects with appearance variations. Considering their unique characteristics we propose a versatile framework that effectively leverages them to tackle AVS. Specifically for NFs we exploit the motion cues as the dynamic guidance to improve the objectness localization. Besides we exploit the semantic cues in DFs by treating them as valid augmentations to the labeled frames which are then used to enrich data diversity in a self-training manner. Extensive experimental results demonstrate the versatility and superiority of our method unleashing the power of the abundant unlabeled frames. + + + + DriveTrack: A Benchmark for Long-Range Point Tracking in Real-World Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Balasingam_DriveTrack_A_Benchmark_for_Long-Range_Point_Tracking_in_Real-World_Videos_CVPR_2024_paper.pdf + This paper presents DriveTrack a new benchmark and data generation framework for long-range keypoint tracking in real-world videos. DriveTrack is motivated by the observation that the accuracy of state-of-the-art trackers depends strongly on visual attributes around the selected keypoints such as texture and lighting. The problem is that these artifacts are especially pronounced in real-world videos but these trackers are unable to train on such scenes due to a dearth of annotations. DriveTrack bridges this gap by building a framework to automatically annotate point tracks on autonomous driving datasets. We release a dataset consisting of 1 billion point tracks across 24 hours of video which is seven orders of magnitude greater than prior real-world benchmarks and on par with the scale of synthetic benchmarks. DriveTrack unlocks new use cases for point tracking in real-world videos. First we show that fine-tuning keypoint trackers on DriveTrack improves accuracy on real-world scenes by up to 7%. Second we analyze the sensitivity of trackers to visual artifacts in real scenes and motivate the idea of running assistive keypoint selectors alongside trackers. + + + + Infrared Adversarial Car Stickers + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhu_Infrared_Adversarial_Car_Stickers_CVPR_2024_paper.pdf + Infrared physical adversarial examples are of great significance for studying the security of infrared AI systems that are widely used in our lives such as autonomous driving. Previous infrared physical attacks mainly focused on 2D infrared pedestrian detection which may not fully manifest its destructiveness to AI systems. In this work we propose a physical attack method against infrared detectors based on 3D modeling which is applied to a real car. The goal is to design a set of infrared adversarial stickers to make cars invisible to infrared detectors at various viewing angles distances and scenes. We build a 3D infrared car model with real infrared characteristics and propose an infrared adversarial pattern generation method based on 3D mesh shadow. We propose a 3D control points-based mesh smoothing algorithm and use a set of smoothness loss functions to enhance the smoothness of adversarial meshes and facilitate the sticker implementation. Besides We designed the aluminum stickers and conducted physical experiments on two real Mercedes-Benz A200L cars. Our adversarial stickers hid the cars from Faster RCNN an object detector at various viewing angles distances and scenes. The attack success rate (ASR) was 91.49% for real cars. In comparison the ASRs of random stickers and no sticker were only 6.21% and 0.66% respectively. In addition the ASRs of the designed stickers against six unseen object detectors such as YOLOv3 and Deformable DETR were between 73.35%-95.80% showing good transferability of the attack performance across detectors. + + + + FreeMan: Towards Benchmarking 3D Human Pose Estimation under Real-World Conditions + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_FreeMan_Towards_Benchmarking_3D_Human_Pose_Estimation_under_Real-World_Conditions_CVPR_2024_paper.pdf + Estimating the 3D structure of the human body from nat- ural scenes is a fundamental aspect of visual perception. 3D human pose estimation is a vital step in advancing fields like AIGC and human-robot interaction serving as a crucial tech- nique for understanding and interacting with human actions in real-world settings. However the current datasets often collected under single laboratory conditions using complex motion capture equipment and unvarying backgrounds are insufficient. The absence of datasets on variable conditions is stalling the progress of this crucial task. To facilitate the development of 3D pose estimation we present FreeMan the first large-scale multi-view dataset collected under the real- world conditions. FreeMan was captured by synchronizing 8 smartphones across diverse scenarios. It comprises 11M frames from 8000 sequences viewed from different perspec- tives. These sequences cover 40 subjects across 10 different scenarios each with varying lighting conditions. We have also established an semi-automated pipeline containing er- ror detection to reduce the workload of manual check and ensure precise annotation. We provide comprehensive eval- uation baselines for a range of tasks underlining the sig- nificant challenges posed by FreeMan. Further evaluations of standard indoor/outdoor human sensing datasets reveal that FreeMan offers robust representation transferability in real and complex scenes. FreeMan is publicly available at https://wangjiongw.github.io/freeman. + + + + GP-NeRF: Generalized Perception NeRF for Context-Aware 3D Scene Understanding + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_GP-NeRF_Generalized_Perception_NeRF_for_Context-Aware_3D_Scene_Understanding_CVPR_2024_paper.pdf + Applying Neural Radiance Fields (NeRF) to downstream perception tasks for scene understanding and representation is becoming increasingly popular. Most existing methods treat semantic prediction as an additional rendering task i.e. the "label rendering" task to build semantic NeRFs. However by rendering semantic/instance labels per pixel without considering the contextual information of the rendered image these methods usually suffer from unclear boundary segmentation and abnormal segmentation of pixels within an object. To solve this problem we propose Generalized Perception NeRF (GP-NeRF) a novel pipeline that makes the widely used segmentation model and NeRF work compatibly under a unified framework for facilitating context-aware 3D scene perception. To accomplish this goal we introduce transformers to aggregate radiance as well as semantic embedding fields jointly for novel views and facilitate the joint volumetric rendering of both fields. In addition we propose two self-distillation mechanisms i.e. the Semantic Distill Loss and the Depth-Guided Semantic Distill Loss to enhance the discrimination and quality of the semantic field and the maintenance of geometric consistency. In evaluation as shown in Fig. 1 we conduct experimental comparisons under two perception tasks (i.e. semantic and instance segmentation) using both synthetic and real-world datasets. Notably our method outperforms SOTA approaches by 6.94% 11.76% and 8.47% on generalized semantic segmentation finetuning semantic segmentation and instance segmentation respectively + + + + Polarization Wavefront Lidar: Learning Large Scene Reconstruction from Polarized Wavefronts + http://openaccess.thecvf.com//content/CVPR2024/papers/Scheuble_Polarization_Wavefront_Lidar_Learning_Large_Scene_Reconstruction_from_Polarized_Wavefronts_CVPR_2024_paper.pdf + Lidar has become a cornerstone sensing modality for 3D vision especially for large outdoor scenarios and autonomous driving. Conventional lidar sensors are capable of providing centimeter-accurate distance information by emitting laser pulses into a scene and measuring the time-of-flight (ToF) of the reflection. However the polarization of the received light that depends on the surface orientation and material properties is usually not considered. As such the polarization modality has the potential to improve scene reconstruction beyond distance measurements. In this work we introduce a novel long-range polarization wavefront lidar sensor (PolLidar) that modulates the polarization of the emitted and received light. Departing from conventional lidar sensors PolLidar allows access to the raw time-resolved polarimetric wavefronts. We leverage polarimetric wavefronts to estimate normals distance and material properties in outdoor scenarios with a novel learned reconstruction method. To train and evaluate the method we introduce a simulated and real-world long-range dataset with paired raw lidar data ground truth distance and normal maps. We find that the proposed method improves normal and distance reconstruction by 53% mean angular error and 41% mean absolute error compared to existing shape-from-polarization (SfP) and ToF methods. Code and data are open-sourced here. + + + + GDA: Generalized Diffusion for Robust Test-time Adaptation + http://openaccess.thecvf.com//content/CVPR2024/papers/Tsai_GDA_Generalized_Diffusion_for_Robust_Test-time_Adaptation_CVPR_2024_paper.pdf + Machine learning models face generalization challenges when exposed to out-of-distribution (OOD) samples with unforeseen distribution shifts. Recent research reveals that for vision tasks test-time adaptation employing diffusion models can achieve state-of-the-art accuracy improvements on OOD samples by generating domain-aligned samples without altering the model's weights. Unfortunately those studies have primarily focused on pixel-level corruptions thereby lacking the generalization to adapt to a broader range of OOD types. We introduce Generalized Diffusion Adaptation (GDA) a novel diffusion-based test-time adaptation method robust against diverse OOD types. Specifically GDA iteratively guides the diffusion by applying a marginal entropy loss derived from the model in conjunction with style and content preservation losses during the reverse sampling process. In other words GDA considers the model's output behavior and the samples' semantic information as a whole reducing ambiguity in downstream tasks. based adaptation. Evaluation across various model architectures and OOD benchmarks indicates that GDA consistently surpasses previous diffusion-based adaptation methods. Notably it achieves the highest classification accuracy improvements ranging from 4.4% to 5.02% on ImageNet-C and 2.5% to 7.4% on Rendition Sketch and Stylized benchmarks. This performance highlights GDA's generalization to a broader range of OOD benchmarks. + + + + Continual-MAE: Adaptive Distribution Masked Autoencoders for Continual Test-Time Adaptation + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Continual-MAE_Adaptive_Distribution_Masked_Autoencoders_for_Continual_Test-Time_Adaptation_CVPR_2024_paper.pdf + Continual Test-Time Adaptation (CTTA) is proposed to migrate a source pre-trained model to continually changing target distributions addressing real-world dynamism. Existing CTTA methods mainly rely on entropy minimization or teacher-student pseudo-labeling schemes for knowledge extraction in unlabeled target domains. However dynamic data distributions cause miscalibrated predictions and noisy pseudo-labels in existing self-supervised learning methods hindering the effective mitigation of error accumulation and catastrophic forgetting problems during the continual adaptation process. To tackle these issues we propose a continual self-supervised method Adaptive Distribution Masked Autoencoders (ADMA) which enhances the extraction of target domain knowledge while mitigating the accumulation of distribution shifts. Specifically we propose a Distribution-aware Masking (DaM) mechanism to adaptively sample masked positions followed by establishing consistency constraints between the masked target samples and the original target samples. Additionally for masked tokens we utilize an efficient decoder to reconstruct a hand-crafted feature descriptor (e.g. Histograms of Oriented Gradients) leveraging its invariant properties to boost task-relevant representations. Through conducting extensive experiments on four widely recognized benchmarks our proposed method attains state-of-the-art performance in both classification and segmentation CTTA tasks. + + + + Dual-Enhanced Coreset Selection with Class-wise Collaboration for Online Blurry Class Incremental Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Luo_Dual-Enhanced_Coreset_Selection_with_Class-wise_Collaboration_for_Online_Blurry_Class_CVPR_2024_paper.pdf + Traditional online class incremental learning assumes class sets in different tasks are disjoint. However recent works have shifted towards a more realistic scenario where tasks have shared classes creating blurred task boundaries. Under this setting although existing approaches could be directly applied challenges like data imbalance and varying class-wise data volumes complicate the critical coreset selection used for replay. To tackle these challenges we introduce DECO (Dual-Enhanced Coreset Selection with Class-wise Collaboration) an approach that starts by establishing a class-wise balanced memory to address data imbalances followed by a tailored class-wise gradient-based similarity scoring system for refined coreset selection strategies with reasonable score guidance to all classes. DECO is distinguished by two main strategies: (1) Collaborative Diverse Score Guidance that mitigates biased knowledge in less-exposed classes through guidance from well-established classes simultaneously consolidating the knowledge in the established classes to enhance overall stability. (2) Adaptive Similarity Score Constraint that relaxes constraints between class types boosting learning plasticity for less-exposed classes and assisting well-established classes in defining clearer boundaries thereby improving overall plasticity. Overall DECO helps effectively identify critical coreset samples improving learning stability and plasticity across all classes. Extensive experiments are conducted on four benchmark datasets to demonstrate the effectiveness and superiority of DECO over other competitors under this online blurry class incremental learning setting. + + + + Cyclic Learning for Binaural Audio Generation and Localization + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Cyclic_Learning_for_Binaural_Audio_Generation_and_Localization_CVPR_2024_paper.pdf + Binaural audio is obtained by simulating the biological structure of human ears which plays an important role in artificial immersive spaces. A promising approach is to utilize mono audio and corresponding vision to synthesize binaural audio thereby avoiding expensive binaural audio recording. However most existing methods directly use the entire scene as a guide ignoring the correspondence between sounds and sounding objects. In this paper we advocate generating binaural audio using fine-grained raw waveform and object-level visual information as guidance. Specifically we propose a Cyclic Locating-and-UPmixing (CLUP) framework that jointly learns visual sounding object localization and binaural audio generation. Visual sounding object localization establishes the correspondence between specific visual objects and sound modalities which provides object-aware guidance to improve binaural generation performance. Meanwhile the spatial information contained in the generated binaural audio can further improve the performance of sounding object localization. In this case visual sounding object localization and binaural audio generation can achieve cyclic learning and benefit from each other. Experimental results demonstrate that on the FAIR-Play benchmark dataset our method is significantly ahead of the existing baselines in multiple evaluation metrics (STFT\downarrow: 0.787 vs. 0.851 ENV\downarrow: 0.128 vs. 0.134 WAV\downarrow: 5.244 vs. 5.684 SNR\uparrow: 7.546 vs. 7.044). + + + + Learning Instance-Aware Correspondences for Robust Multi-Instance Point Cloud Registration in Cluttered Scenes + http://openaccess.thecvf.com//content/CVPR2024/papers/Yu_Learning_Instance-Aware_Correspondences_for_Robust_Multi-Instance_Point_Cloud_Registration_in_CVPR_2024_paper.pdf + Multi-instance point cloud registration estimates the poses of multiple instances of a model point cloud in a scene point cloud. Extracting accurate point correspondences is to the center of the problem. Existing approaches usually treat the scene point cloud as a whole overlooking the separation of instances. Therefore point features could be easily polluted by other points from the background or different instances leading to inaccurate correspondences oblivious to separate instances especially in cluttered scenes. In this work we propose MIRETR Multi-Instance REgistration TRansformer a coarse-to-fine approach to the extraction of instance-aware correspondences. At the coarse level it jointly learns instance-aware superpoint features and predicts per-instance masks. With instance masks the influence from outside of the instance being concerned is minimized such that highly reliable superpoint correspondences can be extracted. The superpoint correspondences are then extended to instance candidates at the fine level according to the instance masks. At last an efficient candidate selection and refinement algorithm is devised to obtain the final registrations. Extensive experiments on three public benchmarks demonstrate the efficacy of our approach. In particular MIRETR outperforms the state of the arts by 16.6 points on F1 score on the challenging ROBI benchmark. Code and models are available at https://github.com/zhiyuanYU134/MIRETR + + + + COCONut: Modernizing COCO Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Deng_COCONut_Modernizing_COCO_Segmentation_CVPR_2024_paper.pdf + In recent decades the vision community has witnessed remarkable progress in visual recognition partially owing to advancements in dataset benchmarks. Notably the established COCO benchmark has propelled the development of modern detection and segmentation systems. However the COCO segmentation benchmark has seen comparatively slow improvement over the last decade. Originally equipped with coarse polygon annotations for thing instances it gradually incorporated coarse superpixel annotations for stuff regions which were subsequently heuristically amalgamated to yield panoptic segmentation annotations. These annotations executed by different groups of raters have resulted not only in coarse segmentation masks but also in inconsistencies between segmentation types. In this study we undertake a comprehensive reevaluation of the COCO segmentation annotations. By enhancing the annotation quality and expanding the dataset to encompass 383K images with more than 5.18M panoptic masks we introduce COCONut the COCO Next Universal segmenTation dataset. COCONut harmonizes segmentation annotations across semantic instance and panoptic segmentation with meticulously crafted high-quality masks and establishes a robust benchmark for all segmentation tasks. To our knowledge COCONut stands as the inaugural large-scale universal segmentation dataset verified by human raters. We anticipate that the release of COCONut will significantly contribute to the community's ability to assess the progress of novel neural networks. + + + + Semantic Line Combination Detector + http://openaccess.thecvf.com//content/CVPR2024/papers/Ko_Semantic_Line_Combination_Detector_CVPR_2024_paper.pdf + A novel algorithm called semantic line combination detector (SLCD) to find an optimal combination of semantic lines is proposed in this paper. It processes all lines in each line combination at once to assess the overall harmony of the lines. First we generate various line combinations from reliable lines. Second we estimate the score of each line combination and determine the best one. Experimental results demonstrate that the proposed SLCD outperforms existing semantic line detectors on various datasets. Moreover it is shown that SLCD can be applied effectively to three vision tasks of vanishing point detection symmetry axis detection and composition-based image retrieval. Our codes are available at https://github.com/Jinwon-Ko/SLCD. + + + + ReconFusion: 3D Reconstruction with Diffusion Priors + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_ReconFusion_3D_Reconstruction_with_Diffusion_Priors_CVPR_2024_paper.pdf + 3D reconstruction methods such as Neural Radiance Fields (NeRFs) excel at rendering photorealistic novel views of complex scenes. However recovering a high-quality NeRF typically requires tens to hundreds of input images resulting in a time-consuming capture process. We present ReconFusion to reconstruct real-world scenes using only a few photos. Our approach leverages a diffusion prior for novel view synthesis trained on synthetic and multiview datasets which regularizes a NeRF-based 3D reconstruction pipeline at novel camera poses beyond those captured by the set of input images. Our method synthesizes realistic geometry and texture in underconstrained regions while preserving the appearance of observed regions. We perform an extensive evaluation across various real-world datasets including forward-facing and 360-degree scenes demonstrating significant performance improvements over previous few-view NeRF reconstruction approaches. Please see our project page at reconfusion.github.io. + + + + InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_InternVL_Scaling_up_Vision_Foundation_Models_and_Aligning_for_Generic_CVPR_2024_paper.pdf + The exponential growth of large language models (LLMs) has opened up numerous possibilities for multi-modal AGI systems. However the progress in vision and vision-language foundation models which are also critical elements of multi-modal AGI has not kept pace with LLMs. In this work we design a large-scale vision-language foundation model (InternVL) which scales up the vision foundation model to 6 billion parameters and progressively aligns it with the LLM using web-scale image-text data from various sources. This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks including visual perception tasks such as image-level or pixel-level recognition vision-language tasks such as zero-shot image/video classification zero-shot image/video-text retrieval and link with LLMs to create multi-modal dialogue systems. It has powerful visual capabilities and can be a good alternative to the ViT-22B. We hope that our research could contribute to the development of multi-modal large models. + + + + PI3D: Efficient Text-to-3D Generation with Pseudo-Image Diffusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_PI3D_Efficient_Text-to-3D_Generation_with_Pseudo-Image_Diffusion_CVPR_2024_paper.pdf + Diffusion models trained on large-scale text-image datasets have demonstrated a strong capability of controllable high-quality image generation from arbitrary text prompts. However the generation quality and generalization ability of 3D diffusion models is hindered by the scarcity of high-quality and large-scale 3D datasets. In this paper we present PI3D a framework that fully leverages the pre-trained text-to-image diffusion models' ability to generate high-quality 3D shapes from text prompts in minutes. The core idea is to connect the 2D and 3D domains by representing a 3D shape as a set of Pseudo RGB Images. We fine-tune an existing text-to-image diffusion model to produce such pseudo-images using a small number of text-3D pairs. Surprisingly we find that it can already generate meaningful and consistent 3D shapes given complex text descriptions. We further take the generated shapes as the starting point for a lightweight iterative refinement using score distillation sampling to achieve high-quality generation under a low budget. PI3D generates a single 3D shape from text in only 3 minutes and the quality is validated to outperform existing 3D generative models by a large margin. + + + + pixelSplat: 3D Gaussian Splats from Image Pairs for Scalable Generalizable 3D Reconstruction + http://openaccess.thecvf.com//content/CVPR2024/papers/Charatan_pixelSplat_3D_Gaussian_Splats_from_Image_Pairs_for_Scalable_Generalizable_CVPR_2024_paper.pdf + We introduce pixelSplat a feed-forward model that learns to reconstruct 3D radiance fields parameterized by 3D Gaussian primitives from pairs of images. Our model features real-time and memory-efficient rendering for scalable training as well as fast 3D reconstruction at inference time. To overcome local minima inherent to sparse and locally supported representations we predict a dense probability distribution over 3D and sample Gaussian means from that probability distribution. We make this sampling operation differentiable via a reparameterization trick allowing us to back-propagate gradients through the Gaussian splatting representation. We benchmark our method on wide-baseline novel view synthesis on the real-world RealEstate10k and ACID datasets where we outperform state-of-the-art light field transformers and accelerate rendering by 2.5 orders of magnitude while reconstructing an interpretable and editable 3D radiance field. Additional materials can be found on the anonymous project website (pixelsplat.github.io). + + + + VBench: Comprehensive Benchmark Suite for Video Generative Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_VBench_Comprehensive_Benchmark_Suite_for_Video_Generative_Models_CVPR_2024_paper.pdf + Video generation has witnessed significant advancements yet evaluating these models remains a challenge. A comprehensive evaluation benchmark for video generation is indispensable for two reasons: 1) Existing metrics do not fully align with human perceptions; 2) An ideal evaluation system should provide insights to inform future developments of video generation. To this end we present VBench a comprehensive benchmark suite that dissects "video generation quality" into specific hierarchical and disentangled dimensions each with tailored prompts and evaluation methods. VBench has three appealing properties: 1) Comprehensive Dimensions: VBench comprises 16 dimensions in video generation (e.g. subject identity inconsistency motion smoothness temporal flickering and spatial relationship etc). The evaluation metrics with fine-grained levels reveal individual models' strengths and weaknesses. 2) Human Alignment: We also provide a dataset of human preference annotations to validate our benchmarks' alignment with human perception for each evaluation dimension respectively. 3) Valuable Insights: We look into current models' ability across various evaluation dimensions and various content types. We also investigate the gaps between video and image generation models. We will open-source VBench including all prompts evaluation methods generated videos and human preference annotations and also include more video generation models in VBench to drive forward the field of video generation. + + + + MAP: MAsk-Pruning for Source-Free Model Intellectual Property Protection + http://openaccess.thecvf.com//content/CVPR2024/papers/Peng_MAP_MAsk-Pruning_for_Source-Free_Model_Intellectual_Property_Protection_CVPR_2024_paper.pdf + Deep learning has achieved remarkable progress in various applications heightening the importance of safeguarding the intellectual property (IP) of well-trained models. It entails not only authorizing usage but also ensuring the deployment of models in authorized data domains i.e. making models exclusive to certain target domains. Previous methods necessitate concurrent access to source training data and target unauthorized data when performing IP protection making them risky and inefficient for decentralized private data. In this paper we target a practical setting where only a well-trained source model is available and investigate how we can realize IP protection. To achieve this we propose a novel MAsk Pruning (MAP) framework. MAP stems from an intuitive hypothesis i.e. there are target-related parameters in a well-trained model locating and pruning them is the key to IP protection. Technically MAP freezes the source model and learns a target-specific binary mask to prevent unauthorized data usage while minimizing performance degradation on authorized data. Moreover we introduce a new metric aimed at achieving a better balance between source and target performance degradation. To verify the effectiveness and versatility we have evaluated MAP in a variety of scenarios including vanilla source-available practical source-free and challenging data-free. Extensive experiments indicate that MAP yields new state-of-the-art performance. + + + + Visual Prompting for Generalized Few-shot Segmentation: A Multi-scale Approach + http://openaccess.thecvf.com//content/CVPR2024/papers/Hossain_Visual_Prompting_for_Generalized_Few-shot_Segmentation_A_Multi-scale_Approach_CVPR_2024_paper.pdf + The emergence of attention-based transformer models has led to their extensive use in various tasks due to their superior generalization and transfer properties. Recent research has demonstrated that such models when prompted appropriately are excellent for few-shot inference. However such techniques are under-explored for dense prediction tasks like semantic segmentation. In this work we examine the effectiveness of prompting a transformer-decoder with learned visual prompts for the generalized few-shot segmentation (GFSS) task. Our goal is to achieve strong performance not only on novel categories with limited examples but also to retain performance on base categories. We propose an approach to learn visual prompts with limited examples. These learned visual prompts are used to prompt a multiscale transformer decoder to facilitate accurate dense predictions. Additionally we introduce a unidirectional causal attention mechanism between the novel prompts learned with limited examples and the base prompts learned with abundant data. This mechanism enriches the novel prompts without deteriorating the base class performance. Overall this form of prompting helps us achieve state-of-the-art performance for GFSS on two different benchmark datasets: COCO-20^i and Pascal-5^i without the need for test-time optimization (or transduction). Furthermore test-time optimization leveraging unlabelled test data can be used to improve the prompts which we refer to as transductive prompt tuning. + + + + Memory-based Adapters for Online 3D Scene Perception + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_Memory-based_Adapters_for_Online_3D_Scene_Perception_CVPR_2024_paper.pdf + In this paper we propose a new framework for online 3D scene perception. Conventional 3D scene perception methods are offline i.e. take an already reconstructed 3D scene geometry as input which is not applicable in robotic applications where the input data is streaming RGB-D videos rather than a complete 3D scene reconstructed from pre- collected RGB-D videos. To deal with online 3D scene per- ception tasks where data collection and perception should be performed simultaneously the model should be able to process 3D scenes frame by frame and make use of the temporal information. To this end we propose an adapter-based plug-and-play module for the backbone of 3D scene perception model which constructs memory to cache and aggregate the extracted RGB-D features to empower offline models with temporal learning ability. Specifically we propose a queued memory mechanism to cache the supporting point cloud and image features. Then we devise aggregation modules which directly perform on the memory and pass temporal information to current frame. We further propose 3D-to-2D adapter to enhance image features with strong global context. Our adapters can be easily inserted into mainstream offline architectures of different tasks and significantly boost their performance on online tasks. Extensive experiments on ScanNet and SceneNN datasets demonstrate our approach achieves leading performance on three 3D scene perception tasks compared with state-of-the-art online methods by simply finetuning existing offline models without any model and task-specific designs. + + + + A Study of Dropout-Induced Modality Bias on Robustness to Missing Video Frames for Audio-Visual Speech Recognition + http://openaccess.thecvf.com//content/CVPR2024/papers/Dai_A_Study_of_Dropout-Induced_Modality_Bias_on_Robustness_to_Missing_CVPR_2024_paper.pdf + Advanced Audio-Visual Speech Recognition (AVSR) systems have been observed to be sensitive to missing video frames performing even worse than single-modality models. While applying the common dropout techniques to the video modality enhances robustness to missing frames it simultaneously results in a performance loss when dealing with complete data input. In this study we delve into this contrasting phenomenon through the lens of modality bias and uncover that an excessive modality bias towards the audio modality induced by dropout constitutes the fundamental cause. Next we present the Modality Bias Hypothesis (MBH) to systematically describe the relationship between the modality bias and the robustness against missing modality in multimodal systems. Building on these findings we propose a novel Multimodal Distribution Approximation with Knowledge Distillation (MDA-KD) framework to reduce over-reliance on the audio modality maintaining performance and robustness simultaneously. Finally to address an entirely missing modality we adopt adapters to dynamically switch decision strategies. The effectiveness of our proposed approach is evaluated through comprehensive experiments on the MISP2021 and MISP2022 datasets. Our code is available at https://github.com/dalision/ModalBiasAVSR. + + + + A Conditional Denoising Diffusion Probabilistic Model for Point Cloud Upsampling + http://openaccess.thecvf.com//content/CVPR2024/papers/Qu_A_Conditional_Denoising_Diffusion_Probabilistic_Model_for_Point_Cloud_Upsampling_CVPR_2024_paper.pdf + Point cloud upsampling (PCU) enriches the representation of raw point clouds significantly improving the performance in downstream tasks such as classification and reconstruction. Most of the existing point cloud upsampling methods focus on sparse point cloud feature extraction and upsampling module design. In a different way we dive deeper into directly modelling the gradient of data distribution from dense point clouds. In this paper we proposed a conditional denoising diffusion probabilistic model (DDPM) for point cloud upsampling called PUDM. Specifically PUDM treats the sparse point cloud as a condition and iteratively learns the transformation relationship between the dense point cloud and the noise. Simultaneously PUDM aligns with a dual mapping paradigm to further improve the discernment of point features. In this context PUDM enables learning complex geometry details in the ground truth through the dominant features while avoiding an additional upsampling module design. Furthermore to generate high-quality arbitrary-scale point clouds during inference PUDM exploits the prior knowledge of the scale between sparse point clouds and dense point clouds during training by parameterizing a rate factor. Moreover PUDM exhibits strong noise robustness in experimental results. In the quantitative and qualitative evaluations on PU1K and PUGAN PUDM significantly outperformed existing methods in terms of Chamfer Distance (CD) and Hausdorff Distance (HD) achieving state of the art (SOTA) performance. + + + + GAFusion: Adaptive Fusing LiDAR and Camera with Multiple Guidance for 3D Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_GAFusion_Adaptive_Fusing_LiDAR_and_Camera_with_Multiple_Guidance_for_CVPR_2024_paper.pdf + Recent years have witnessed the remarkable progress of 3D multi-modality object detection methods based on the Bird's-Eye-View (BEV) perspective. However most of them overlook the complementary interaction and guidance between LiDAR and camera. In this work we propose a novel multi-modality 3D objection detection method named GAFusion with LiDAR-guided global interaction and adaptive fusion. Specifically we introduce sparse depth guidance (SDG) and LiDAR occupancy guidance (LOG) to generate 3D features with sufficient depth information. In the following LiDAR-guided adaptive fusion transformer (LGAFT) is developed to adaptively enhance the interaction of different modal BEV features from a global perspective. Meanwhile additional downsampling with sparse height compression and multi-scale dual-path transformer (MSDPT) are designed to enlarge the receptive fields of different modal features. Finally a temporal fusion module is introduced to aggregate features from previous frames. GAFusion achieves state-of-the-art 3D object detection results with 73.6% mAP and 74.9% NDS on the nuScenes test set. + + + + Improving Graph Contrastive Learning via Adaptive Positive Sampling + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhuo_Improving_Graph_Contrastive_Learning_via_Adaptive_Positive_Sampling_CVPR_2024_paper.pdf + Graph Contrastive Learning (GCL) a Self-Supervised Learning (SSL) architecture tailored for graphs has shown notable potential for mitigating label scarcity. Its core idea is to amplify feature similarities between the positive sample pairs and reduce them between the negative sample pairs. Unfortunately most existing GCLs consistently present suboptimal performances on both homophilic and heterophilic graphs. This is primarily attributed to two limitations of positive sampling that is incomplete local sampling and blind sampling. To address these limitations this paper introduces a novel GCL framework with an adaptive positive sampling module named grapH contrastivE Adaptive posiTive Samples (HEATS). Motivated by the observation that the affinity matrix corresponding to optimal positive sample sets has a block-diagonal structure with equal weights within each block a self-expressive learning objective incorporating the block and idempotent constraint is presented. This learning objective and the contrastive learning objective are iteratively optimized to improve the adaptability and robustness of HEATS. Extensive experiments on graphs and images validate the effectiveness and generality of HEATS. + + + + UFC-Net: Unrolling Fixed-point Continuous Network for Deep Compressive Sensing + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_UFC-Net_Unrolling_Fixed-point_Continuous_Network_for_Deep_Compressive_Sensing_CVPR_2024_paper.pdf + Deep unfolding networks (DUNs) renowned for their interpretability and superior performance have invigorated the realm of compressive sensing (CS). Nonetheless existing DUNs frequently suffer from issues related to insufficient feature extraction and feature attrition during the iterative steps. In this paper we propose Unrolling Fixed-point Continuous Network (UFC-Net) a novel deep CS framework motivated by the traditional fixed-point continuous optimization algorithm. Specifically we introduce Convolution-guided Attention Module (CAM) to serve as a critical constituent within the reconstruction phase encompassing tailored components such as Multi-head Attention Residual Block (MARB) Auxiliary Iterative Reconstruction Block (AIRB) etc. MARB effectively integrates multi-head attention mechanisms with convolution to reinforce feature extraction transcending the confinement of localized attributes and facilitating the apprehension of long-range correlations. Meanwhile AIRB introduces auxiliary variables significantly bolstering the preservation of features within each iterative stage. Extensive experiments demonstrate that our proposed UFC-Net achieves remarkable performance both on image CS and CS-magnetic resonance imaging (CS-MRI) in contrast to state-of-the-art methods. + + + + ECoDepth: Effective Conditioning of Diffusion Models for Monocular Depth Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Patni_ECoDepth_Effective_Conditioning_of_Diffusion_Models_for_Monocular_Depth_Estimation_CVPR_2024_paper.pdf + In the absence of parallax cues a learning-based single image depth estimation (SIDE) model relies heavily on shading and contextual cues in the image. While this simplicity is attractive it is necessary to train such models on large and varied datasets which are difficult to capture. It has been shown that using embeddings from pre-trained foundational models such as CLIP improves zero shot transfer in several applications. Taking inspiration from this in our paper we explore the use of global image priors generated from a pre-trained ViT model to provide more detailed contextual information. We argue that the embedding vector from a ViT model pre-trained on a large dataset captures greater relevant information for SIDE than the usual route of generating pseudo image captions followed by CLIP based text embeddings. Based on this idea we propose a new SIDE model using a diffusion backbone which is conditioned on ViT embeddings. Our proposed design establishes a new state-of-the-art (SOTA) for SIDE on NYUv2 dataset achieving Abs Rel error of 0.059(14% improvement) compared to 0.069 by the current SOTA (VPD). And on KITTI dataset achieving Sq Rel error of 0.139 (2% improvement) compared to 0.142 by the current SOTA (GEDepth). For zero-shot transfer with a model trained on NYUv2 we report mean relative improvement of (20% 23% 81% 25%) over NeWCRFs on (Sun-RGBD iBims1 DIODE HyperSim) datasets compared to (16% 18% 45% 9%) by ZoeDepth. The project page is available at https://ecodepth-iitd.github.io + + + + DL3DV-10K: A Large-Scale Scene Dataset for Deep Learning-based 3D Vision + http://openaccess.thecvf.com//content/CVPR2024/papers/Ling_DL3DV-10K_A_Large-Scale_Scene_Dataset_for_Deep_Learning-based_3D_Vision_CVPR_2024_paper.pdf + We have witnessed significant progress in deep learning-based 3D vision ranging from neural radiance field (NeRF) based 3D representation learning to applications in novel view synthesis (NVS). However existing scene-level datasets for deep learning-based 3D vision limited to either synthetic environments or a narrow selection of real-world scenes are quite insufficient. This insufficiency not only hinders a comprehensive benchmark of existing methods but also caps what could be explored in deep learning-based 3D analysis. To address this critical gap we present DL3DV-10K a large-scale scene dataset featuring 51.2 million frames from 10510 videos captured from 65 types of point-of-interest (POI) locations covering both bounded and unbounded scenes with different levels of reflection transparency and lighting. We conducted a comprehensive benchmark of recent NVS methods on DL3DV-10K which revealed valuable insights for future research in NVS. In addition we have obtained encouraging results in a pilot study to learn generalizable NeRF from DL3DV-10K which manifests the necessity of a large-scale scene-level dataset to forge a path toward a foundation model for learning 3D representation. Our DL3DV-10K dataset benchmark results and models will be publicly accessible. + + + + Bilateral Adaptation for Human-Object Interaction Detection with Occlusion-Robustness + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Bilateral_Adaptation_for_Human-Object_Interaction_Detection_with_Occlusion-Robustness_CVPR_2024_paper.pdf + Human-Object Interaction (HOI) Detection constitutes an important aspect of human-centric scene understanding which requires precise object detection and interaction recognition. Despite increasing advancement in detection recognizing subtle and intricate interactions remains challenging. Recent methods have endeavored to leverage the rich semantic representation from pre-trained CLIP yet fail to efficiently capture finer-grained spatial features that are highly informative for interaction discrimination. In this work instead of solely using representations from CLIP we fill the gap by proposing a spatial adapter that efficiently utilizes the multi-scale spatial information in the pre-trained detector. This leads to a bilateral adaptation that mutually produces complementary features. To further improve interaction recognition under occlusion which is common in crowded scenarios we propose an Occluded Part Extrapolation module that guides the model to recover the spatial details from manually occluded feature maps. Moreover we design a Conditional Contextual Mining module that further mines informative contextual clues from the spatial features via a tailored cross-attention mechanism. Extensive experiments on V-COCO and HICO-DET benchmarks demonstrate that our method significantly outperforms prior art on both standard and zero-shot settings resulting in new state-of-the-art performance. Additional ablation studies further validate the effectiveness of each component in our method. + + + + Projecting Trackable Thermal Patterns for Dynamic Computer Vision + http://openaccess.thecvf.com//content/CVPR2024/papers/Sheinin_Projecting_Trackable_Thermal_Patterns_for_Dynamic_Computer_Vision_CVPR_2024_paper.pdf + Adding artificial patterns to objects like QR codes can ease tasks such as object tracking robot navigation and conveying information (e.g. a label or a website link). However these patterns require a physical application and they alter the object's appearance. Conversely projected patterns can temporarily change the object's appearance aiding tasks like 3D scanning and retrieving object textures and shading. However projected patterns impede dynamic tasks like object tracking because they do not `stick' to the object's surface. Or do they? This paper introduces a novel approach combining the advantages of projected and persistent physical patterns. Our system projects heat patterns using a laser beam (similar in spirit to a LIDAR) which a thermal camera observes and tracks. Such thermal patterns enable tracking poorly-textured objects whose tracking is highly challenging with standard cameras while not affecting the object's appearance or physical properties. To avail these thermal patterns in existing vision frameworks we train a network to reverse heat diffusion's effects and remove inconsistent pattern points between different thermal frames. We prototyped and tested this approach on dynamic vision tasks like structure from motion optical flow and object tracking of everyday textureless objects. + + + + SG-PGM: Partial Graph Matching Network with Semantic Geometric Fusion for 3D Scene Graph Alignment and Its Downstream Tasks + http://openaccess.thecvf.com//content/CVPR2024/papers/Xie_SG-PGM_Partial_Graph_Matching_Network_with_Semantic_Geometric_Fusion_for_CVPR_2024_paper.pdf + Scene graphs have been recently introduced into 3D spatial understanding as a comprehensive representation of the scene. The alignment between 3D scene graphs is the first step of many downstream tasks such as scene graph aided point cloud registration mosaicking overlap checking and robot navigation. In this work we treat 3D scene graph alignment as a partial graph-matching problem and propose to solve it with a graph neural network. We reuse the geometric features learned by a point cloud registration method and associate the clustered point-level geometric features with the node-level semantic feature via our designed feature fusion module. Partial matching is enabled by using a learnable method to select the top-k similar node pairs. Subsequent downstream tasks such as point cloud registration are achieved by running a pre-trained registration network within the matched regions. We further propose a point-matching rescoring method that uses the node-wise alignment of the 3D scene graph to reweight the matching candidates from a pre-trained point cloud registration method. It reduces the false point correspondences estimated especially in low-overlapping cases. Experiments show that our method improves the alignment accuracy by 10 20% in low-overlap and random transformation scenarios and outperforms the existing work in multiple downstream tasks. Our code and models are available here (https://github.com/dfki-av/sg-pgm.git). + + + + Advancing Saliency Ranking with Human Fixations: Dataset Models and Benchmarks + http://openaccess.thecvf.com//content/CVPR2024/papers/Deng_Advancing_Saliency_Ranking_with_Human_Fixations_Dataset_Models_and_Benchmarks_CVPR_2024_paper.pdf + Saliency ranking detection (SRD) has emerged as a challenging task in computer vision aiming not only to identify salient objects within images but also to rank them based on their degree of saliency. Existing SRD datasets have been created primarily using mouse-trajectory data which inadequately captures the intricacies of human visual perception. Addressing this gap this paper introduces the first large-scale SRD dataset SIFR constructed using genuine human fixation data thereby aligning more closely with real visual perceptual processes. To establish a baseline for this dataset we propose QAGNet a novel model that leverages salient instance query features from a transformer detector within a tri-tiered nested graph. Through extensive experiments we demonstrate that our approach outperforms existing state-of-the-art methods across two widely used SRD datasets and our newly proposed dataset. Code and dataset are available at https://github.com/EricDengbowen/QAGNet. + + + + Unsupervised Deep Unrolling Networks for Phase Unwrapping + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Unsupervised_Deep_Unrolling_Networks_for_Phase_Unwrapping_CVPR_2024_paper.pdf + Phase unwrapping (PU) is a technique to reconstruct original phase images from their noisy wrapped counterparts finding many applications in scientific imaging. Although supervised learning has shown promise in PU its utility is limited in ground-truth (GT) scarce scenarios. This paper presents an unsupervised learning approach that eliminates the need for GTs during end-to-end training. Our approach leverages the insight that both the gradients and wrapped gradients of wrapped phases serve as noisy labels for GT phase gradients along with sparse outliers induced by the wrapping operation. A recorruption-based self-reconstruction loss in the gradient domain is proposed to mitigate the adverse effects of label noise complemented with a self-distillation loss for improved generalization. Additionally by unfolding a variational model of PU that utilizes wrapped gradients of wrapped phases for its data-fitting term we develop a deep unrolling network that encodes physics of phase wrapping and incorporates special treatments on outliers. In the experiments on three types of phase data our approach outperforms existing GT-free methods and competes well against the supervised ones. + + + + Federated Generalized Category Discovery + http://openaccess.thecvf.com//content/CVPR2024/papers/Pu_Federated_Generalized_Category_Discovery_CVPR_2024_paper.pdf + Generalized category discovery (GCD) aims at grouping unlabeled samples from known and unknown classes given labeled data of known classes. To meet the recent decentralization trend in the community we introduce a practical yet challenging task Federated GCD (Fed-GCD) where the training data are distributed in local clients and cannot be shared among clients. Fed-GCD aims to train a generic GCD model by client collaboration under the privacy-protected constraint. The Fed-GCD leads to two challenges: 1) representation degradation caused by training each client model with fewer data than centralized GCD learning and 2) highly heterogeneous label spaces across different clients. To this end we propose a novel Associated Gaussian Contrastive Learning (AGCL) framework based on learnable GMMs which consists of a Client Semantics Association (CSA) and a global-local GMM Contrastive Learning (GCL). On the server CSA aggregates the heterogeneous categories of local-client GMMs to generate a global GMM containing more comprehensive category knowledge. On each client GCL builds class-level contrastive learning with both local and global GMMs. The local GCL learns robust representation with limited local data. The global GCL encourages the model to produce more discriminative representation with the comprehensive category relationships that may not exist in local data. We build a benchmark based on six visual datasets to facilitate the study of Fed-GCD. Extensive experiments show that our AGCL outperforms multiple baselines on all datasets. + + + + Edge-Aware 3D Instance Segmentation Network with Intelligent Semantic Prior + http://openaccess.thecvf.com//content/CVPR2024/papers/Roh_Edge-Aware_3D_Instance_Segmentation_Network_with_Intelligent_Semantic_Prior_CVPR_2024_paper.pdf + While recent 3D instance segmentation approaches show promising results based on transformer architectures they often fail to correctly identify instances with similar appearances. They also ambiguously determine edges leading to multiple misclassifications of adjacent edge points. In this work we introduce a novel framework called EASE to overcome these challenges and improve the perception of complex 3D instances. We first propose a semantic guidance network to leverage rich semantic knowledge from a language model as intelligent priors enhancing the functional understanding of real-world instances beyond relying solely on geometrical information. We explicitly instruct the basic instance queries using text embeddings of each instance to learn deep semantic details. Further we utilize the edge prediction module encouraging the segmentation network to be edge-aware. We extract voxel-wise edge maps from point features and use them as auxiliary information for learning edge cues. In our extensive experiments on large-scale benchmarks ScanNetV2 ScanNet200 S3DIS and STPLS3D our EASE outperforms existing state-of-the-art models demonstrating its superior performance. + + + + Coherence As Texture - Passive Textureless 3D Reconstruction by Self-interference + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Coherence_As_Texture_-_Passive_Textureless_3D_Reconstruction_by_Self-interference_CVPR_2024_paper.pdf + Passive depth estimation based on stereo or defocus relies on the presence of the texture on an object to resolve its depth. Hence recovering the depth of a textureless object-- for example a large white wall--is not just hard but perhaps even impossible. Or is it? We show that spatial coherence a property of natural light sources can be used to resolve the depth of a scene point even when it is textureless. Our approach relies on the idea that natural light scattered off a scene point is locally coherent with itself while incoherent with the light scattered from other surface points; we use this insight to design an optical setup that uses self-interference as a texture feature for estimating depth. Our lab prototype is capable of resolving the depths of textureless objects in sunlight as well as indoor lights. + + + + Generative Multi-modal Models are Good Class Incremental Learners + http://openaccess.thecvf.com//content/CVPR2024/papers/Cao_Generative_Multi-modal_Models_are_Good_Class_Incremental_Learners_CVPR_2024_paper.pdf + In class incremental learning (CIL) scenarios the phenomenon of catastrophic forgetting caused by the classifier's bias towards the current task has long posed a significant challenge. It is mainly caused by the characteristic of discriminative models. With the growing popularity of the generative multi-modal models we would explore replacing discriminative models with generative ones for CIL. However transitioning from discriminative to generative models requires addressing two key challenges. The primary challenge lies in transferring the generated textual information into the classification of distinct categories. Additionally it requires formulating the task of CIL within a generative framework. To this end we propose a novel generative multi-modal model (GMM) framework for class incremental learning. Our approach directly generates labels for images using an adapted generative model. After obtaining the detailed text we use a text encoder to extract text features and employ feature matching to determine the most similar label as the classification prediction. In the conventional CIL settings we achieve significantly better results in long-sequence task scenarios. Under the Few-shot CIL setting we have improved by at least 14% over the current state-of-the-art methods with significantly less forgetting. + + + + Low-Resource Vision Challenges for Foundation Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Low-Resource_Vision_Challenges_for_Foundation_Models_CVPR_2024_paper.pdf + Low-resource settings are well-established in natural lan- guage processing where many languages lack sufficient data for deep learning at scale. However low-resource problems are under-explored in computer vision. In this paper we address this gap and explore the challenges of low-resource image tasks with vision foundation models. We first collect a benchmark of genuinely low-resource image data covering historic maps circuit diagrams and mechanical drawings. These low-resource settings all share three challenges: data scarcity fine-grained differences and the distribution shift from natural images to the specialized domain of interest. While existing foundation models have shown impressive generalizability we find they cannot transfer well to our low-resource tasks. To begin to tackle the challenges of low-resource vision we introduce one simple baseline per challenge. Specifically we i) enlarge the data space by generative models ii) adopt the best sub-kernels to encode local regions for fine-grained difference discovery and iii) learn attention for specialized domains. Experiments on our three low-resource tasks demonstrate our proposals already provide a better baseline than transfer learning data aug- mentation and fine-grained methods. This highlights the unique characteristics and challenges of low-resource vision for foundation models that warrant further investigation. Project page: https://xiaobai1217.github.io/ Low-Resource-Vision/. + + + + RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Xia_RGBD_Objects_in_the_Wild_Scaling_Real-World_3D_Object_Learning_CVPR_2024_paper.pdf + We introduce a new RGB-D object dataset captured in the wild called WildRGB-D. Unlike most existing real-world object-centric datasets which only come with RGB capturing the direct capture of the depth channel allows better 3D annotations and broader downstream applications. WildRGB-D comprises large-scale category-level RGB-D object videos which are taken using an iPhone to go around the objects in 360 degrees. It contains around 8500 recorded objects and nearly 20000 RGB-D videos across 46 common object categories. These videos are taken with diverse cluttered backgrounds with three setups to cover as many real-world scenarios as possible: (i) a single object in one video; (ii) multiple objects in one video; and (iii) an object with a static hand in one video. The dataset is annotated with object masks real-world scale camera poses and reconstructed aggregated point clouds from RGBD videos. We benchmark four tasks with WildRGB-D including novel view synthesis camera pose estimation object 6d pose estimation and object surface reconstruction. Our experiments show that the large-scale capture of RGB-D objects provides a large potential to advance 3D object learning. Our project page is https://wildrgbd.github.io/. + + + + Low-Res Leads the Way: Improving Generalization for Super-Resolution by Self-Supervised Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Low-Res_Leads_the_Way_Improving_Generalization_for_Super-Resolution_by_Self-Supervised_CVPR_2024_paper.pdf + For image super-resolution (SR) bridging the gap between the performance on synthetic datasets and real-world degradation scenarios remains a challenge. This work introduces a novel "Low-Res Leads the Way" (LWay) training framework merging Supervised Pre-training with Self-supervised Learning to enhance the adaptability of SR models to real-world images. Our approach utilizes a low-resolution (LR) reconstruction network to extract degradation embeddings from LR images merging them with super-resolved outputs for LR reconstruction. Leveraging unseen LR images for self-supervised learning guides the model to adapt its modeling space to the target domain facilitating fine-tuning of SR models without requiring paired high-resolution (HR) images. The integration of Discrete Wavelet Transform (DWT) further refines the focus on high-frequency details. Extensive evaluations show that our method significantly improves the generalization and detail restoration capabilities of SR models on unseen real-world datasets outperforming existing methods. Our training regime is universally compatible requiring no network architecture modifications making it a practical solution for real-world SR applications. + + + + Learning Spatial Features from Audio-Visual Correspondence in Egocentric Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Majumder_Learning_Spatial_Features_from_Audio-Visual_Correspondence_in_Egocentric_Videos_CVPR_2024_paper.pdf + We propose a self-supervised method for learning representations based on spatial audio-visual correspondences in egocentric videos. Our method uses a masked auto-encoding framework to synthesize masked binaural audio through the synergy of audio and vision thereby learning useful spatial relationships between the two modalities. We use our pretrained features to tackle two downstream video tasks requiring spatial understanding in social scenarios: active speaker detection and spatial audio denoising. Through extensive experiments we show that our features are generic enough to improve over multiple state-of-the-art baselines on both tasks on two challenging egocentric video datasets that offer binaural audio EgoCom and EasyCom. Project: http://vision.cs.utexas.edu/ projects/ego_av_corr. + + + + Brain Decodes Deep Nets + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Brain_Decodes_Deep_Nets_CVPR_2024_paper.pdf + We developed a tool for visualizing and analyzing large pre-trained vision models by mapping them onto the brain thus exposing their hidden inside. Our innovation arises from a surprising usage of brain encoding: predicting brain fMRI measurements in response to images. We report two findings. First explicit mapping between the brain and deep-network features across dimensions of space layers scales and channels is crucial. This mapping method FactorTopy is plug-and-play for any deep-network; with it one can paint a picture of the network onto the brain (literally!). Second our visualization shows how different training methods matter: they lead to remarkable differences in hierarchical organization and scaling behavior growing with more data or network capacity. It also provides insight into fine-tuning: how pre-trained models change when adapting to small datasets. We found brain-like hierarchically organized network suffer less from catastrophic forgetting after fine-tuned. + + + + Semantics Distortion and Style Matter: Towards Source-free UDA for Panoramic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zheng_Semantics_Distortion_and_Style_Matter_Towards_Source-free_UDA_for_Panoramic_CVPR_2024_paper.pdf + This paper addresses an interesting yet challenging problem-- source-free unsupervised domain adaptation (SFUDA) for pinhole-to-panoramic semantic segmentation--given only a pinhole image-trained model (i.e. source) and unlabeled panoramic images (i.e. target). Tackling this problem is nontrivial due to the semantic mismatches style discrepancies and inevitable distortion of panoramic images. To this end we propose a novel method that utilizes Tangent Projection (TP) as it has less distortion and meanwhile slits the equirectangular projection (ERP) with a fixed FoV to mimic the pinhole images. Both projections are shown effective in extracting knowledge from the source model. However the distinct projection discrepancies between source and target domains impede the direct knowledge transfer; thus we propose a panoramic prototype adaptation module (PPAM) to integrate panoramic prototypes from the extracted knowledge for adaptation. We then impose the loss constraints on both predictions and prototypes and propose a cross-dual attention module (CDAM) at the feature level to better align the spatial and channel characteristics across the domains and projections. Both knowledge extraction and transfer processes are synchronously updated to reach the best performance. Extensive experiments on the synthetic and real-world benchmarks including outdoor and indoor scenarios demonstrate that our method achieves significantly better performance than prior SFUDA methods for pinhole-to-panoramic adaptation. + + + + GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_GOV-NeSF_Generalizable_Open-Vocabulary_Neural_Semantic_Fields_CVPR_2024_paper.pdf + Recent advancements in vision-language foundation models have significantly enhanced open-vocabulary 3D scene understanding. However the generalizability of existing methods is constrained due to their framework designs and their reliance on 3D data. We address this limitation by introducing Generalizable Open-Vocabulary Neural Semantic Fields (GOV-NeSF) a novel approach offering a generalizable implicit representation of 3D scenes with open-vocabulary semantics. We aggregate the geometry-aware features using a cost volume and propose a Multi-view Joint Fusion module to aggregate multi-view features through a cross-view attention mechanism which effectively predicts view-specific blending weights for both colors and open-vocabulary features. Remarkably our GOV-NeSF exhibits state-of-the-art performance in both 2D and 3D open-vocabulary semantic segmentation eliminating the need for ground truth semantic labels or depth priors and effectively generalize across scenes and datasets without fine-tuning. + + + + Dual-Scale Transformer for Large-Scale Single-Pixel Imaging + http://openaccess.thecvf.com//content/CVPR2024/papers/Qu_Dual-Scale_Transformer_for_Large-Scale_Single-Pixel_Imaging_CVPR_2024_paper.pdf + Single-pixel imaging (SPI) is a potential computational imaging technique which produces image by solving an ill-posed reconstruction problem from few measurements captured by a single-pixel detector. Deep learning has achieved impressive success on SPI reconstruction. However previous poor reconstruction performance and impractical imaging model limit its real-world applications. In this paper we propose a deep unfolding network with hybrid-attention Transformer on Kronecker SPI model dubbed HATNet to improve the imaging quality of real SPI cameras. Specifically we unfold the computation graph of the iterative shrinkage-thresholding algorithm (ISTA) into two alternative modules: efficient tensor gradient descent and hybrid-attention multi-scale denoising. By virtue of Kronecker SPI the gradient descent module can avoid high computational overheads rooted in previous gradient descent modules based on vectorized SPI. The denoising module is an encoder-decoder architecture powered by dual-scale spatial attention for high- and low-frequency aggregation and channel attention for global information recalibration. Moreover we build a SPI prototype to verify the effectiveness of the proposed method. Extensive experiments on synthetic and real data demonstrate that our method achieves the state-of-the-art performance. The source code and pre-trained models are available at https://github.com/Gang-Qu/HATNet-SPI. + + + + Bridging Remote Sensors with Multisensor Geospatial Foundation Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Han_Bridging_Remote_Sensors_with_Multisensor_Geospatial_Foundation_Models_CVPR_2024_paper.pdf + In the realm of geospatial analysis the diversity of remote sensors encompassing both optical and microwave technologies offers a wealth of distinct observational capabilities. Recognizing this we present msGFM a multisensor geospatial foundation model that effectively unifies data from four key sensor modalities. This integration spans an expansive dataset of two million multisensor images. msGFM is uniquely adept at handling both paired and unpaired sensor data. For data originating from identical geolocations our model employs an innovative cross-sensor pretraining approach in masked image modeling enabling the synthesis of joint representations from diverse sensors. msGFM incorporating four remote sensors upholds strong performance forming a comprehensive model adaptable to various sensor types. msGFM has demonstrated enhanced proficiency in a range of both single-sensor and multisensor downstream tasks. These include scene classification segmentation cloud removal and pan-sharpening. A key discovery of our research is that representations derived from natural images are not always compatible with the distinct characteristics of geospatial remote sensors underscoring the limitations of existing representations in this field. Our work can serve as a guide for developing multisensor geospatial pretraining models paving the way for more advanced geospatial capabilities. Code can be found at \url https://github.com/boranhan/Geospatial_Foundation_Models + + + + SeeSR: Towards Semantics-Aware Real-World Image Super-Resolution + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_SeeSR_Towards_Semantics-Aware_Real-World_Image_Super-Resolution_CVPR_2024_paper.pdf + Owe to the powerful generative priors the pre-trained text-to-image (T2I) diffusion models have become increasingly popular in solving the real-world image super-resolution problem. However as a consequence of the heavy quality degradation of input low-resolution (LR) images the destruction of local structures can lead to ambiguous image semantics. As a result the content of reproduced high-resolution image may have semantic errors deteriorating the super-resolution performance. To address this issue we present a semantics-aware approach to better preserve the semantic fidelity of generative real-world image super-resolution. First we train a degradation-aware prompt extractor which can generate accurate soft and hard semantic prompts even under strong degradation. The hard semantic prompts refer to the image tags aiming to enhance the local perception ability of the T2I model while the soft semantic prompts compensate for the hard ones to provide additional representation information. These semantic prompts encourage the T2I model to generate detailed and semantically accurate results. Furthermore during the inference process we integrate the LR images into the initial sampling noise to mitigate the diffusion model's tendency to generate excessive random details. The experiments show that our method can reproduce more realistic image details and hold better the semantics. The source code of our method can be found at https://github.com/cswry/SeeSR + + + + DrivingGaussian: Composite Gaussian Splatting for Surrounding Dynamic Autonomous Driving Scenes + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_DrivingGaussian_Composite_Gaussian_Splatting_for_Surrounding_Dynamic_Autonomous_Driving_Scenes_CVPR_2024_paper.pdf + We present DrivingGaussian an efficient and effective framework for surrounding dynamic autonomous driving scenes. For complex scenes with moving objects we first sequentially and progressively model the static background of the entire scene with incremental static 3D Gaussians. We then leverage a composite dynamic Gaussian graph to handle multiple moving objects individually reconstructing each object and restoring their accurate positions and occlusion relationships within the scene. We further use a LiDAR prior for Gaussian Splatting to reconstruct scenes with greater details and maintain panoramic consistency. DrivingGaussian outperforms existing methods in dynamic driving scene reconstruction and enables photorealistic surround-view synthesis with high-fidelity and multi-camera consistency. Our project page is at: https://github.com/VDIGPKU/DrivingGaussian. + + + + Unsupervised Keypoints from Pretrained Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Hedlin_Unsupervised_Keypoints_from_Pretrained_Diffusion_Models_CVPR_2024_paper.pdf + Unsupervised learning of keypoints and landmarks has seen significant progress with the help of modern neural network architectures but performance is yet to match the supervised counterpart making their practicability questionable. We leverage the emergent knowledge within text-to-image diffusion models towards more robust unsupervised keypoints. Our core idea is to find text embeddings that would cause the generative model to consistently attend to compact regions in images (i.e. keypoints). To do so we simply optimize the text embedding such that the cross-attention maps within the denoising network are localized as Gaussians with small standard deviations. We validate our performance on multiple datasets: the CelebA CUB-200-2011 Tai-Chi-HD DeepFashion and Human3.6m datasets. We achieve significantly improved accuracy sometimes even outperforming supervised ones particularly for data that is non-aligned and less curated. Our code is publicly available at https://stablekeypoints.github.io/. + + + + Resolution Limit of Single-Photon LiDAR + http://openaccess.thecvf.com//content/CVPR2024/papers/Chan_Resolution_Limit_of_Single-Photon_LiDAR_CVPR_2024_paper.pdf + Single-photon Light Detection and Ranging (LiDAR) systems are often equipped with an array of detectors for improved spatial resolution and sensing speed. However given a fixed amount of flux produced by the laser transmitter across the scene the per-pixel Signal-to-Noise Ratio (SNR) will decrease when more pixels are packed in a unit space. This presents a fundamental trade-off between the spatial resolution of the sensor array and the SNR received at each pixel. Theoretical characterization of this fundamental limit is explored. By deriving the photon arrival statistics and introducing a series of new approximation techniques the Mean Squared Error (MSE) of the maximum-likelihood estimator of the time delay is derived. The theoretical predictions align well with simulations and real data. + + + + Flatten Long-Range Loss Landscapes for Cross-Domain Few-Shot Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Zou_Flatten_Long-Range_Loss_Landscapes_for_Cross-Domain_Few-Shot_Learning_CVPR_2024_paper.pdf + Cross-domain few-shot learning (CDFSL) aims to acquire knowledge from limited training data in the target domain by leveraging prior knowledge transferred from source domains with abundant training samples. CDFSL faces challenges in transferring knowledge across dissimilar domains and fine-tuning models with limited training data. To address these challenges we initially extend the analysis of loss landscapes from the parameter space to the representation space which allows us to simultaneously interpret the transferring and fine-tuning difficulties of CDFSL models. We observe that sharp minima in the loss landscapes of the representation space result in representations that are hard to transfer and fine-tune. Moreover existing flatness-based methods have limited generalization ability due to their short-range flatness. To enhance the transferability and facilitate fine-tuning we introduce a simple yet effective approach to achieve long-range flattening of the minima in the loss landscape. This approach considers representations that are differently normalized as minima in the loss landscape and flattens the high-loss region in the middle by randomly sampling interpolated representations. We implement this method as a new normalization layer that replaces the original one in both CNNs and ViTs. This layer is simple and lightweight introducing only a minimal number of additional parameters. Experimental results on 8 datasets demonstrate that our approach outperforms state-of-the-art methods in terms of average accuracy. Moreover our method achieves performance improvements of up to 9% compared to the current best approaches on individual datasets. Our code will be released. + + + + Diffusion-based Blind Text Image Super-Resolution + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Diffusion-based_Blind_Text_Image_Super-Resolution_CVPR_2024_paper.pdf + Recovering degraded low-resolution text images is challenging especially for Chinese text images with complex strokes and severe degradation in real-world scenarios. Ensuring both text fidelity and style realness is crucial for high-quality text image super-resolution. Recently diffusion models have achieved great success in natural image synthesis and restoration due to their powerful data distribution modeling abilities and data generation capabilities. In this work we propose an Image Diffusion Model (IDM) to restore text images with realistic styles. For diffusion models they are not only suitable for modeling realistic image distribution but also appropriate for learning text distribution. Since text prior is important to guarantee the correctness of the restored text structure according to existing arts we also propose a Text Diffusion Model (TDM) for text recognition which can guide IDM to generate text images with correct structures. We further propose a Mixture of Multi-modality module (MoM) to make these two diffusion models cooperate with each other in all the diffusion steps. Extensive experiments on synthetic and real-world datasets demonstrate that our Diffusion-based Blind Text Image Super-Resolution (DiffTSR) can restore text images with more accurate text structures as well as more realistic appearances simultaneously. Code is available at https://github.com/YuzheZhang-1999/DiffTSR. + + + + Consistent Prompting for Rehearsal-Free Continual Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Gao_Consistent_Prompting_for_Rehearsal-Free_Continual_Learning_CVPR_2024_paper.pdf + Continual learning empowers models to adapt autonomously to the ever-changing environment or data streams without forgetting old knowledge. Prompt-based approaches are built on frozen pre-trained models to learn the task-specific prompts and classifiers efficiently. Existing prompt based methods are inconsistent between training and testing limiting their effectiveness. Two types of inconsistency are revealed. Test predictions are made from all classifiers while training only focuses on the current task classifier without holistic alignment leading to Classifier inconsistency. Prompt inconsistency indicates that the prompt selected during testing may not correspond to the one associated with this task during training. In this paper we propose a novel prompt-based method Consistent Prompting (CPrompt) for more aligned training and testing. Specifically all existing classifiers are exposed to prompt training resulting in classifier consistency learning. In addition prompt consistency learning is proposed to enhance prediction robustness and boost prompt selection accuracy. Our Consistent Prompting surpasses its prompt-based counterparts and achieves state-of-the-art performance on multiple continual learning benchmarks. Detailed analysis shows that improvements come from more consistent training and testing. + + + + SeD: Semantic-Aware Discriminator for Image Super-Resolution + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_SeD_Semantic-Aware_Discriminator_for_Image_Super-Resolution_CVPR_2024_paper.pdf + Generative Adversarial Networks (GANs) have been widely used to recover vivid textures in image super-resolution (SR) tasks. In particular one discriminator is utilized to enable the SR network to learn the distribution of real-world high-quality images in an adversarial training manner. However the distribution learning is overly coarse-grained which is susceptible to virtual textures and causes counter-intuitive generation results. To mitigate this we propose the simple and effective Semantic-aware Discriminator (denoted as SeD) which encourages the SR network to learn the fine-grained distributions by introducing the semantics of images as a condition. Concretely we aim to excavate the semantics of images from a well-trained semantic extractor. Under different semantics the discriminator is able to distinguish the real-fake images individually and adaptively which guides the SR network to learn the more fine-grained semantic-aware textures. To obtain accurate and abundant semantics we take full advantage of recently popular pretrained vision models (PVMs) with extensive datasets and then incorporate its semantic features into the discriminator through a well-designed spatial cross-attention module. In this way our proposed semantic-aware discriminator empowered the SR network to produce more photo-realistic and pleasing images. Extensive experiments on two typical tasks i.e. SR and Real SR have demonstrated the effectiveness of our proposed methods. The code will be available at https://github.com/lbc12345/SeD. + + + + ReCoRe: Regularized Contrastive Representation Learning of World Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Poudel_ReCoRe_Regularized_Contrastive_Representation_Learning_of_World_Model_CVPR_2024_paper.pdf + While recent model-free Reinforcement Learning (RL) methods have demonstrated human-level effectiveness in gaming environments their success in everyday tasks like visual navigation has been limited particularly under significant appearance variations. This limitation arises from (i) poor sample efficiency and (ii) over-fitting to training scenarios. To address these challenges we present a world model that learns invariant features using (i) contrastive unsupervised learning and (ii) an intervention-invariant regularizer. Learning an explicit representation of the world dynamics i.e. a world model improves sample efficiency while contrastive learning implicitly enforces learning of invariant features which improves generalization. However the naive integration of contrastive loss to world models is not good enough as world-model-based RL methods independently optimize representation learning and agent policy. To overcome this issue we propose an intervention-invariant regularizer in the form of an auxiliary task such as depth prediction image denoising image segmentation etc. that explicitly enforces invariance to style interventions. Our method outperforms current state-of-the-art model-based and model-free RL methods and significantly improves on out-of-distribution point navigation tasks evaluated on the iGibson benchmark. With only visual observations we further demonstrate that our approach outperforms recent language-guided foundation models for point navigation which is essential for deployment on robots with limited computation capabilities. Finally we demonstrate that our proposed model excels at the sim-to-real transfer of its perception module on the Gibson benchmark. + + + + JRDB-PanoTrack: An Open-world Panoptic Segmentation and Tracking Robotic Dataset in Crowded Human Environments + http://openaccess.thecvf.com//content/CVPR2024/papers/Le_JRDB-PanoTrack_An_Open-world_Panoptic_Segmentation_and_Tracking_Robotic_Dataset_in_CVPR_2024_paper.pdf + Autonomous robot systems have attracted increasing research attention in recent years where environment understanding is a crucial step for robot navigation human-robot interaction and decision. Real-world robot systems usually collect visual data from multiple sensors and are required to recognize numerous objects and their movements in complex human-crowded settings. Traditional benchmarks with their reliance on single sensors and limited object classes and scenarios fail to provide the comprehensive environmental understanding robots need for accurate navigation interaction and decision-making. As an extension of JRDB dataset we unveil JRDB-PanoTrack a novel open-world panoptic segmentation and tracking benchmark towards more comprehensive environmental perception. JRDB-PanoTrack includes (1) various data involving indoor and outdoor crowded scenes as well as comprehensive 2D and 3D synchronized data modalities; (2) high-quality 2D spatial panoptic segmentation and temporal tracking annotations with additional 3D label projections for further spatial understanding; (3) diverse object classes for closed- and open-world recognition benchmarks with OSPA-based metrics for evaluation. Extensive evaluation of leading methods shows significant challenges posed by our dataset. + + + + Embracing Unimodal Aleatoric Uncertainty for Robust Multimodal Fusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Gao_Embracing_Unimodal_Aleatoric_Uncertainty_for_Robust_Multimodal_Fusion_CVPR_2024_paper.pdf + As a fundamental problem in multimodal learning multimodal fusion aims to compensate for the inherent limitations of a single modality. One challenge of multimodal fusion is that the unimodal data in their unique embedding space mostly contains potential noise which leads to corrupted cross-modal interactions. However in this paper we show that the potential noise in unimodal data could be well quantified and further employed to enhance more stable unimodal embeddings via contrastive learning. Specifically we propose a novel generic and robust multimodal fusion strategy termed Embracing Aleatoric Uncertainty (EAU) which is simple and can be applied to kinds of modalities. It consists of two key steps: (1) the Stable Unimodal Feature Augmentation (SUFA) that learns a stable unimodal representation by incorporating the aleatoric uncertainty into self-supervised contrastive learning. (2) Robust Multimodal Feature Integration (RMFI) leveraging an information-theoretic strategy to learn a robust compact joint representation. We evaluate our proposed EAU method on five multimodal datasets where the video RGB image text audio and depth image are involved. Extensive experiments demonstrate the EAU method is more noise-resistant than existing multimodal fusion strategies and establishes new state-of-the-art on several benchmarks. + + + + Unifying Correspondence Pose and NeRF for Generalized Pose-Free Novel View Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Hong_Unifying_Correspondence_Pose_and_NeRF_for_Generalized_Pose-Free_Novel_View_CVPR_2024_paper.pdf + This work delves into the task of pose-free novel view synthesis from stereo pairs a challenging and pioneering task in 3D vision. Our innovative framework unlike any before seamlessly integrates 2D correspondence matching camera pose estimation and NeRF rendering fostering a synergistic enhancement of these tasks. We achieve this through designing an architecture that utilizes a shared representation which serves as a foundation for enhanced 3D geometry understanding. Capitalizing on the inherent interplay between the tasks our unified framework is trained end-to-end with the proposed training strategy to improve overall model accuracy. Through extensive evaluations across diverse indoor and outdoor scenes from two real-world datasets we demonstrate that our approach achieves substantial improvement over previous methodologies especially in scenarios characterized by extreme viewpoint changes and the absence of accurate camera poses. + + + + Draw Step by Step: Reconstructing CAD Construction Sequences from Point Clouds via Multimodal Diffusion. + http://openaccess.thecvf.com//content/CVPR2024/papers/Ma_Draw_Step_by_Step_Reconstructing_CAD_Construction_Sequences_from_Point_CVPR_2024_paper.pdf + Reconstructing CAD construction sequences from raw 3D geometry serves as an interface between real-world objects and digital designs. In this paper we propose CAD-Diffuser a multimodal diffusion scheme aiming at integrating top-down design paradigm into generative reconstruction. In particular we unify CAD point clouds and CAD construction sequences at the token level guiding our proposed multimodal diffusion strategy to understand and link between the geometry and the design intent concentrated in construction sequences. Leveraging the strong decoding abilities of language models the forward process is modeled as a random walk between the original token and the [MASK] token while the reverse process naturally fits the masked token modeling scheme. A volume-based noise schedule is designed to encourage outline-first generation decomposing the top-down design methodology into a machine-understandable procedure. For tokenizing CAD data of multiple modalities we introduce a tokenizer with a self-supervised face segmentation task to compress local and global geometric information for CAD point clouds and the CAD construction sequence is transformed into a primitive token string. Experimental results show that our CAD-Diffuser can perceive geometric details and the results are more likely to be reused by human designers. + + + + Discriminative Pattern Calibration Mechanism for Source-Free Domain Adaptation + http://openaccess.thecvf.com//content/CVPR2024/papers/Xia_Discriminative_Pattern_Calibration_Mechanism_for_Source-Free_Domain_Adaptation_CVPR_2024_paper.pdf + Source-free domain adaptation (SFDA) assumes that model adaptation only accesses the well-learned source model and unlabeled target instances for knowledge transfer. However cross-domain distribution shift easily triggers invalid discriminative semantics from source model on recognizing the target samples. Hence understanding the specific content of discriminative pattern and adjusting their representation in target domain become the important key to overcome SFDA. To achieve such a vision this paper proposes a novel explanation paradigm "Discriminative Pattern Calibration (DPC)" mechanism on solving SFDA issue. Concretely DPC first utilizes learning network to infer the discriminative regions on the target images and specifically emphasizes them in feature space to enhance their representation. Moreover DPC relies on the attention-reversed mixup mechanism to augment more samples and improve the robustness of the classifier. Considerable experimental results and studies suggest that the effectiveness of our DPC in enhancing the performance of existing SFDA baselines. + + + + Deep Generative Model based Rate-Distortion for Image Downscaling Assessment + http://openaccess.thecvf.com//content/CVPR2024/papers/Liang_Deep_Generative_Model_based_Rate-Distortion_for_Image_Downscaling_Assessment_CVPR_2024_paper.pdf + In this paper we propose Image Downscaling Assessment by Rate-Distortion (IDA-RD) a novel measure to quantitatively evaluate image downscaling algorithms. In contrast to image-based methods that measure the quality of downscaled images ours is process-based that draws ideas from rate-distortion theory to measure the distortion incurred during downscaling. Our main idea is that downscaling and super-resolution (SR) can be viewed as the encoding and decoding processes in the rate-distortion model respectively and that a downscaling algorithm that preserves more details in the resulting low-resolution (LR) images should lead to less distorted high-resolution (HR) images in SR. In other words the distortion should increase as the downscaling algorithm deteriorates. However it is non-trivial to measure this distortion as it requires the SR algorithm to be blind and stochastic. Our key insight is that such requirements can be met by recent SR algorithms based on deep generative models that can find all matching HR images for a given LR image on their learned image manifolds. Extensive experimental results show the effectiveness of our IDA-RD measure. + + + + EFHQ: Multi-purpose ExtremePose-Face-HQ dataset + http://openaccess.thecvf.com//content/CVPR2024/papers/Dao_EFHQ_Multi-purpose_ExtremePose-Face-HQ_dataset_CVPR_2024_paper.pdf + The existing facial datasets while having plentiful images at near frontal views lack images with extreme head poses leading to the downgraded performance of deep learning models when dealing with profile or pitched faces. This work aims to address this gap by introducing a novel dataset named Extreme Pose Face High-Quality Dataset (EFHQ) which includes a maximum of 450k high-quality images of faces at extreme poses. To produce such a massive dataset we utilize a novel and meticulous dataset processing pipeline to curate two publicly available datasets VFHQ and CelebV-HQ which contain many high-resolution face videos captured in various settings. Our dataset can complement existing datasets on various facial-related tasks such as facial synthesis with 2D/3D-aware GAN diffusion-based text-to-image face generation and face reenactment. Specifically training with EFHQ helps models generalize well across diverse poses significantly improving performance in scenarios involving extreme views confirmed by extensive experiments. Additionally we utilize EFHQ to define a challenging cross-view face verification benchmark in which the performance of SOTA face recognition models drops 5-37% compared to frontal-to-frontal scenarios aiming to stimulate studies on face recognition under severe pose conditions in the wild. + + + + Dynamic Cues-Assisted Transformer for Robust Point Cloud Registration + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Dynamic_Cues-Assisted_Transformer_for_Robust_Point_Cloud_Registration_CVPR_2024_paper.pdf + Point Cloud Registration is a critical and challenging task in computer vision. Recent advancements have predominantly embraced a coarse-to-fine matching mechanism with the key to matching the superpoints located in patches with inter-frame consistent structures. However previous methods still face challenges with ambiguous matching because the interference information aggregated from irrelevant regions may disturb the capture of inter-frame consistency relations leading to wrong matches. To address this issue we propose Dynamic Cues-Assisted Transformer (DCATr). Firstly the interference from irrelevant regions is greatly reduced by constraining attention to certain cues i.e. regions with highly correlated structures of potential corresponding superpoints. Secondly cues-assisted attention is designed to mine the inter-frame consistency relations while more attention is assigned to pairs with high consistent confidence in feature aggregation. Finally a dynamic updating fashion is proposed to facilitate mining richer consistency information further improving aggregated features' distinctiveness and relieving matching ambiguity. Extensive evaluations on indoor and outdoor standard benchmarks demonstrate that DCATr outperforms all state-of-the-art methods. + + + + Patch2Self2: Self-supervised Denoising on Coresets via Matrix Sketching + http://openaccess.thecvf.com//content/CVPR2024/papers/Fadnavis_Patch2Self2_Self-supervised_Denoising_on_Coresets_via_Matrix_Sketching_CVPR_2024_paper.pdf + Diffusion MRI (dMRI) non-invasively maps brain white matter yet necessitates denoising due to low signal-to-noise ratios. Patch2Self (P2S) employing self-supervised techniques and regression on a Casorati matrix effectively denoises dMRI images and has become the new de-facto standard in this field. P2S however is resource intensive both in terms of running time and memory usage as it uses all voxels (n) from all-but-one held-in volumes (d-1) to learn a linear mapping Phi : \mathbb R ^ n x(d-1) \mapsto \mathbb R ^ n for denoising the held-out volume. The increasing size and dimensionality of higher resolution dMRI acquisitions can make P2S infeasible for large-scale analyses. This work exploits the redundancy imposed by P2S to alleviate its performance issues and inspect regions that influence the noise disproportionately. Specifically this study makes a three-fold contribution: (1) We present Patch2Self2 (P2S2) a method that uses matrix sketching to perform self-supervised denoising. By solving a sub-problem on a smaller sub-space so called coreset we show how P2S2 can yield a significant speedup in training time while using less memory. (2) We present a theoretical analysis of P2S2 focusing on determining the optimal sketch size through rank estimation a key step in achieving a balance between denoising accuracy and computational efficiency. (3) We show how the so-called statistical leverage scores can be used to interpret the denoising of dMRI data a process that was traditionally treated as a black-box. Experimental results on both simulated and real data affirm that P2S2 maintains denoising quality while significantly enhancing speed and memory efficiency achieved by training on a reduced data subset. + + + + The Devil is in the Fine-Grained Details: Evaluating Open-Vocabulary Object Detectors for Fine-Grained Understanding + http://openaccess.thecvf.com//content/CVPR2024/papers/Bianchi_The_Devil_is_in_the_Fine-Grained_Details_Evaluating_Open-Vocabulary_Object_CVPR_2024_paper.pdf + Recent advancements in large vision-language models enabled visual object detection in open-vocabulary scenarios where object classes are defined in free-text formats during inference. In this paper we aim to probe the state-of-the-art methods for open-vocabulary object detection to determine to what extent they understand fine-grained properties of objects and their parts. To this end we introduce an evaluation protocol based on dynamic vocabulary generation to test whether models detect discern and assign the correct fine-grained description to objects in the presence of hard-negative classes. We contribute with a benchmark suite of increasing difficulty and probing different properties like color pattern and material. We further enhance our investigation by evaluating several state-of-the-art open-vocabulary object detectors using the proposed protocol and find that most existing solutions which shine in standard open-vocabulary benchmarks struggle to accurately capture and distinguish finer object details. We conclude the paper by highlighting the limitations of current methodologies and exploring promising research directions to overcome the discovered drawbacks. Data and code are available at https://lorebianchi98.github.io/FG-OVD . + + + + Link-Context Learning for Multimodal LLMs + http://openaccess.thecvf.com//content/CVPR2024/papers/Tai_Link-Context_Learning_for_Multimodal_LLMs_CVPR_2024_paper.pdf + The ability to learn from context with novel concepts and deliver appropriate responses are essential in human conversations. Despite current Multimodal Large Language Models (MLLMs) and Large Language Models (LLMs) being trained on mega-scale datasets recognizing unseen images or understanding novel concepts in a training-free manner remains a challenge. In-Context Learning (ICL) explores training-free few-shot learning where models are encouraged to "learn to learn" from limited tasks and generalize to unseen tasks. In this work we propose link-context learning (LCL) which emphasizes "reasoning from cause and effect" to augment the learning capabilities of MLLMs. LCL goes beyond traditional ICL by explicitly strengthening the causal relationship between the support set and the query set. By providing demonstrations with causal links LCL guides the model to discern not only the analogy but also the underlying causal associations between data points which empowers MLLMs to recognize unseen images and understand novel concepts more effectively. To facilitate the evaluation of this novel approach we introduce the ISEKAI dataset comprising exclusively of unseen generated image-label pairs designed for link-context learning. Extensive experiments show that our LCL-MLLM exhibits strong link-context learning capabilities to novel concepts over vanilla MLLMs. + + + + ConsistDreamer: 3D-Consistent 2D Diffusion for High-Fidelity Scene Editing + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_ConsistDreamer_3D-Consistent_2D_Diffusion_for_High-Fidelity_Scene_Editing_CVPR_2024_paper.pdf + This paper proposes ConsistDreamer - a novel framework that lifts 2D diffusion models with 3D awareness and 3D consistency thus enabling high-fidelity instruction-guided scene editing. To overcome the fundamental limitation of missing 3D consistency in 2D diffusion models our key insight is to introduce three synergetic strategies that augment the input of the 2D diffusion model to become 3D-aware and to explicitly enforce 3D consistency during the training process. Specifically we design surrounding views as context-rich input for the 2D diffusion model and generate 3D-consistent structured noise instead of image-independent noise. Moreover we introduce self-supervised consistency-enforcing training within the per-scene editing procedure. Extensive evaluation shows that our ConsistDreamer achieves state-of-the-art performance for instruction-guided scene editing across various scenes and editing instructions particularly in complicated large-scale indoor scenes from ScanNet++ with significantly improved sharpness and fine-grained textures. Notably ConsistDreamer stands as the first work capable of successfully editing complex (e.g. plaid/checkered) patterns. + + + + On the Robustness of Large Multimodal Models Against Image Adversarial Attacks + http://openaccess.thecvf.com//content/CVPR2024/papers/Cui_On_the_Robustness_of_Large_Multimodal_Models_Against_Image_Adversarial_CVPR_2024_paper.pdf + Recent advances in instruction tuning have led to the development of State-of-the-Art Large Multimodal Models (LMMs). Given the novelty of these models the impact of visual adversarial attacks on LMMs has not been thoroughly examined. We conduct a comprehensive study of the robustness of various LMMs against different adversarial attacks evaluated across tasks including image classification image captioning and Visual Question Answer (VQA). We find that in general LMMs are not robust to visual adversarial inputs. However our findings suggest that context provided to the model via prompts--such as questions in a QA pair--helps to mitigate the effects of visual adversarial inputs. Notably the LMMs evaluated demonstrated remarkable resilience to such attacks on the ScienceQA task with only an 8.10% drop in performance compared to their visual counterparts which dropped 99.73%. We also propose a new approach to real-world image classification which we term query decomposition. By incorporating existence queries into our input prompt we observe diminished attack effectiveness and improvements in image classification accuracy. This research highlights a previously under explored facet of LMM robustness and sets the stage for future work aimed at strengthening the resilience of multimodal systems in adversarial environments. + + + + SoundingActions: Learning How Actions Sound from Narrated Egocentric Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_SoundingActions_Learning_How_Actions_Sound_from_Narrated_Egocentric_Videos_CVPR_2024_paper.pdf + We propose a novel self-supervised embedding to learn how actions sound from narrated in-the-wild egocentric videos. Whereas existing methods rely on curated data with known audio-visual correspondence our multimodal contrastive-consensus coding (MC3) embedding reinforces the associations between audio language and vision when all modality pairs agree while diminishing those associations when any one pair does not. We show our approach can successfully discover how the long tail of human actions sound from egocentric video outperforming an array of recent multimodal embedding techniques on two datasets (Ego4D and EPIC-Sounds) and multiple cross-modal tasks. + + + + MonoHair: High-Fidelity Hair Modeling from a Monocular Video + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_MonoHair_High-Fidelity_Hair_Modeling_from_a_Monocular_Video_CVPR_2024_paper.pdf + Undoubtedly high-fidelity 3D hair is crucial for achieving realism artistic expression and immersion in computer graphics. While existing 3D hair modeling methods have achieved impressive performance the challenge of achieving high-quality hair reconstruction persists: they either require strict capture conditions making practical applications difficult or heavily rely on learned prior data obscuring fine-grained details in images. To address these challenges we propose MonoHair a generic framework to achieve high-fidelity hair reconstruction from a monocular video without specific requirements for environments. Our approach bifurcates the hair modeling process into two main stages: precise exterior reconstruction and interior structure inference. The exterior is meticulously crafted using our Patch-based Multi-View Optimization PMVO. This method strategically collects and integrates hair information from multiple views independent of prior data to produce a high-fidelity exterior 3D line map. This map not only captures intricate details but also facilitates the inference of the hair's inner structure. For the interior we employ a data-driven multi-view 3D hair reconstruction method. This method utilizes 2D structural renderings derived from the reconstructed exterior mirroring the synthetic 2D inputs used during training. This alignment effectively bridges the domain gap between our training data and real-world data thereby enhancing the accuracy and reliability of our interior structure inference. Lastly we generate a strand model and resolve the directional ambiguity by our hair growth algorithm. Our experiments demonstrate that our method exhibits robustness across diverse hairstyles and achieves state-of-the-art performance. For more results please refer to our project page https://keyuwu-cs.github.io/MonoHair/ + + + + One Prompt Word is Enough to Boost Adversarial Robustness for Pre-trained Vision-Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_One_Prompt_Word_is_Enough_to_Boost_Adversarial_Robustness_for_CVPR_2024_paper.pdf + Large pre-trained Vision-Language Models (VLMs) like CLIP despite having remarkable generalization ability are highly vulnerable to adversarial examples. This work studies the adversarial robustness of VLMs from the novel perspective of the text prompt instead of the extensively studied model weights (frozen in this work). We first show that the effectiveness of both adversarial attack and defense are sensitive to the used text prompt. Inspired by this we propose a method to improve resilience to adversarial attacks by learning a robust text prompt for VLMs. The proposed method named Adversarial Prompt Tuning (APT) is effective while being both computationally and data efficient. Extensive experiments are conducted across 15 datasets and 4 data sparsity schemes (from 1-shot to full training data settings) to show APT's superiority over hand-engineered prompts and other state-of-the-art adaption methods. APT demonstrated excellent abilities in terms of the in-distribution performance and the generalization under input distribution shift and across datasets. Surprisingly by simply adding one learned word to the prompts APT can significantly boost the accuracy and robustness (epsilon=4/255) over the hand-engineered prompts by +13% and +8.5% on average respectively. The improvement further increases in our most effective setting to +26.4% for accuracy and +16.7% for robustness. Code is available at https://github.com/TreeLLi/APT. + + + + A Versatile Framework for Continual Test-Time Domain Adaptation: Balancing Discriminability and Generalizability + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_A_Versatile_Framework_for_Continual_Test-Time_Domain_Adaptation_Balancing_Discriminability_CVPR_2024_paper.pdf + Continual test-time domain adaptation (CTTA) aims to adapt the source pre-trained model to a continually changing target domain without additional data acquisition or labeling costs. This issue necessitates an initial performance enhancement within the present domain without labels while concurrently averting an excessive bias toward the current domain. Such bias exacerbates catastrophic forgetting and diminishes the generalization ability to future domains. To tackle the problem this paper designs a versatile framework to capture high-quality supervision signals from three aspects: 1) The adaptive thresholds are employed to determine the reliability of pseudo-labels; 2) The knowledge from the source pre-trained model is utilized to adjust the unreliable one and 3) By evaluating past supervision signals we calculate a diversity score to ensure subsequent generalization. In this way we form a complete supervisory signal generation framework which can capture the current domain discriminative and reserve generalization in future domains. Finally to avoid catastrophic forgetting we design a weighted soft parameter alignment method to explore the knowledge from the source model. Extensive experimental results demonstrate that our method performs well on several benchmark datasets. + + + + Sieve: Multimodal Dataset Pruning using Image Captioning Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Mahmoud_Sieve_Multimodal_Dataset_Pruning_using_Image_Captioning_Models_CVPR_2024_paper.pdf + Vision-Language Models (VLMs) are pretrained on large diverse and noisy web-crawled datasets. This underscores the critical need for dataset pruning as the quality of these datasets is strongly correlated with the performance of VLMs on downstream tasks. Using CLIPScore from a pretrained model to only train models using highly-aligned samples is one of the most successful methods for pruning. We argue that this approach suffers from multiple limitations including: false positives and negatives due to CLIP's pretraining on noisy labels. We propose a pruning signal Sieve that employs synthetic captions generated by image-captioning models pretrained on small diverse and well-aligned image-text pairs to evaluate the alignment of noisy image-text pairs. To bridge the gap between the limited diversity of generated captions and the high diversity of alternative text (alt-text) we estimate the semantic textual similarity in the embedding space of a language model pretrained on unlabeled text corpus. Using DataComp a multimodal dataset filtering benchmark when evaluating on 38 downstream tasks our pruning approach surpasses CLIPScore by 2.6% and 1.7% on medium and large scale respectively. In addition on retrieval tasks Sieve leads to a significant improvement of 2.7% and 4.5% on medium and large scale respectively. + + + + Dynamic LiDAR Re-simulation using Compositional Neural Fields + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_Dynamic_LiDAR_Re-simulation_using_Compositional_Neural_Fields_CVPR_2024_paper.pdf + We introduce DyNFL a novel neural field-based approach for high-fidelity re-simulation of LiDAR scans in dynamic driving scenes. DyNFL processes LiDAR measurements from dynamic environments accompanied by bounding boxes of moving objects to construct an editable neural field. This field comprising separately reconstructed static background and dynamic objects allows users to modify viewpoints adjust object positions and seamlessly add or remove objects in the re-simulated scene. A key innovation of our method is the neural field composition technique which effectively integrates reconstructed neural assets from various scenes through a ray drop test accounting for occlusions and transparent surfaces. Our evaluation with both synthetic and real-world environments demonstrates that DyNFL substantially improves dynamic scene LiDAR simulation offering a combination of physical fidelity and flexible editing capabilities. Project page: https://shengyuh.github.io/dynfl + + + + AETTA: Label-Free Accuracy Estimation for Test-Time Adaptation + http://openaccess.thecvf.com//content/CVPR2024/papers/Lee_AETTA_Label-Free_Accuracy_Estimation_for_Test-Time_Adaptation_CVPR_2024_paper.pdf + Test-time adaptation (TTA) has emerged as a viable solution to adapt pre-trained models to domain shifts using unlabeled test data. However TTA faces challenges of adaptation failures due to its reliance on blind adaptation to unknown test samples in dynamic scenarios. Traditional methods for out-of-distribution performance estimation are limited by unrealistic assumptions in the TTA context such as requiring labeled data or re-training models. To address this issue we propose AETTA a label-free accuracy estimation algorithm for TTA. We propose the prediction disagreement as the accuracy estimate calculated by comparing the target model prediction with dropout inferences. We then improve the prediction disagreement to extend the applicability of AETTA under adaptation failures. Our extensive evaluation with four baselines and six TTA methods demonstrates that AETTA shows an average of 19.8%p more accurate estimation compared with the baselines. We further demonstrate the effectiveness of accuracy estimation with a model recovery case study showcasing the practicality of our model recovery based on accuracy estimation. The source code is available at https://github.com/taeckyung/AETTA. + + + + An Empirical Study of the Generalization Ability of Lidar 3D Object Detectors to Unseen Domains + http://openaccess.thecvf.com//content/CVPR2024/papers/Eskandar_An_Empirical_Study_of_the_Generalization_Ability_of_Lidar_3D_CVPR_2024_paper.pdf + 3D Object Detectors (3D-OD) are crucial for understanding the environment in many robotic tasks especially autonomous driving. Including 3D information via Lidar sensors improves accuracy greatly. However such detectors perform poorly on domains they were not trained on i.e. different locations sensors weather etc. limiting their reliability in safety-critical applications. There exist methods to adapt 3D-ODs to these domains; however these methods treat 3D-ODs as a black box neglecting underlying architectural decisions and source-domain training strategies. Instead we dive deep into the details of 3D-ODs focusing our efforts on fundamental factors that influence robustness prior to domain adaptation. We systematically investigate four design choices (and the interplay between them) often overlooked in 3D-OD robustness and domain adaptation: architecture voxel encoding data augmentations and anchor strategies. We assess their impact on the robustness of nine state-of-the-art 3D-ODs across six benchmarks encompassing three types of domain gaps - sensor type weather and location. Our main findings are: (1) transformer backbones with local point features are more robust than 3D CNNs (2) test-time anchor size adjustment is crucial for adaptation across geographical locations significantly boosting scores without retraining (3) source-domain augmentations allow the model to generalize to low-resolution sensors and (4) surprisingly robustness to bad weather is improved when training directly on more clean weather data than on training with bad weather data. We outline our main conclusions and findings to provide practical guidance on developing more robust 3D-ODs. + + + + Unsupervised Universal Image Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Niu_Unsupervised_Universal_Image_Segmentation_CVPR_2024_paper.pdf + Several unsupervised image segmentation approaches have been proposed which eliminate the need for dense manually-annotated segmentation masks; current models separately handle either semantic segmentation (e.g. STEGO) or class-agnostic instance segmentation (e.g. CutLER) but not both (i.e. panoptic segmentation). We propose an Unsupervised Universal Segmentation model (U2Seg) adept at performing various image segmentation tasks---instance semantic and panoptic---using a novel unified framework. U2Seg generates pseudo semantic labels for these segmentation tasks via leveraging self-supervised models followed by clustering; each cluster represents different semantic and/or instance membership of pixels. We then self-train the model on these pseudo semantic labels yielding substantial performance gains over specialized methods tailored to each task: a +2.6 APbox boost (vs. CutLER) in unsupervised instance segmentation on COCO and a +7.0 PixelAcc increase (vs. STEGO) in unsupervised semantic segmentation on COCOStuff. Moreover our method sets up a new baseline for unsupervised panoptic segmentation which has not been previously explored. U2Seg is also a strong pretrained model for few-shot segmentation surpassing CutLER by +5.0 APmask when trained on a low-data regime e.g. only 1% COCO labels. We hope our simple yet effective method can inspire more research on unsupervised universal image segmentation. + + + + A Closer Look at the Few-Shot Adaptation of Large Vision-Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Silva-Rodriguez_A_Closer_Look_at_the_Few-Shot_Adaptation_of_Large_Vision-Language_CVPR_2024_paper.pdf + Efficient transfer learning (ETL) is receiving increasing attention to adapt large pre-trained language-vision models on downstream tasks with a few labeled samples. While significant progress has been made we reveal that state-of-the-art ETL approaches exhibit strong performance only in narrowly-defined experimental setups and with a careful adjustment of hyperparameters based on a large corpus of labeled samples. In particular we make two interesting and surprising empirical observations. First to outperform a simple Linear Probing baseline these methods require to optimize their hyper-parameters on each target task. And second they typically underperform --sometimes dramatically-- standard zero-shot predictions in the presence of distributional drifts. Motivated by the unrealistic assumptions made in the existing literature i.e. access to a large validation set and case-specific grid-search for optimal hyperparameters we propose a novel approach that meets the requirements of real-world scenarios. More concretely we introduce a CLass-Adaptive linear Probe (CLAP) objective whose balancing term is optimized via an adaptation of the general Augmented Lagrangian method tailored to this context. We comprehensively evaluate CLAP on a broad span of datasets and scenarios demonstrating that it consistently outperforms SoTA approaches while yet being a much more efficient alternative. + + + + Global and Hierarchical Geometry Consistency Priors for Few-shot NeRFs in Indoor Scenes + http://openaccess.thecvf.com//content/CVPR2024/papers/Sun_Global_and_Hierarchical_Geometry_Consistency_Priors_for_Few-shot_NeRFs_in_CVPR_2024_paper.pdf + It is challenging for Neural Radiance Fields (NeRFs) in the few-shot setting to reconstruct high-quality novel views and depth maps in 360^\circ outward-facing indoor scenes. The captured sparse views for these scenes usually contain large viewpoint variations. This greatly reduces the potential consistency between views leading NeRFs to degrade a lot in these scenarios. Existing methods usually leverage pretrained depth prediction models to improve NeRFs. However these methods cannot guarantee geometry consistency due to the inherent geometry ambiguity in the pretrained models thus limiting NeRFs' performance. In this work we present P\textsuperscript 2 NeRF to capture global and hierarchical geometry consistency priors from pretrained models thus facilitating few-shot NeRFs in 360^\circ outward-facing indoor scenes. On the one hand we propose a matching-based geometry warm-up strategy to provide global geometry consistency priors for NeRFs. This effectively avoids the overfitting of early training with sparse inputs. On the other hand we propose a group depth ranking loss and ray weight mask regularization based on the monocular depth estimation model. This provides hierarchical geometry consistency priors for NeRFs. As a result our approach can fully leverage the geometry consistency priors from pretrained models and help few-shot NeRFs achieve state-of-the-art performance on two challenging indoor datasets. Our code is released at https://github.com/XT5un/P2NeRF. + + + + Mask Grounding for Referring Image Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Chng_Mask_Grounding_for_Referring_Image_Segmentation_CVPR_2024_paper.pdf + Referring Image Segmentation (RIS) is a challenging task that requires an algorithm to segment objects referred by free-form language expressions. Despite significant progress in recent years most state-of-the-art (SOTA) methods still suffer from considerable language-image modality gap at the pixel and word level. These methods generally 1) rely on sentence-level language features for language-image alignment and 2) lack explicit training supervision for fine-grained visual grounding. Consequently they exhibit weak object-level correspondence between visual and language features. Without well-grounded features prior methods struggle to understand complex expressions that require strong reasoning over relationships among multiple objects especially when dealing with rarely used or ambiguous clauses. To tackle this challenge we introduce a novel Mask Grounding auxiliary task that significantly improves visual grounding within language features by explicitly teaching the model to learn fine-grained correspondence between masked textual tokens and their matching visual objects. Mask Grounding can be directly used on prior RIS methods and consistently bring improvements. Furthermore to holistically address the modality gap we also design a cross-modal alignment loss and an accompanying alignment module. These additions work synergistically with Mask Grounding. With all these techniques our comprehensive approach culminates in MagNet (Mask-grounded Network) an architecture that significantly outperforms prior arts on three key benchmarks (RefCOCO RefCOCO+ and G-Ref) demonstrating our method's effectiveness in addressing current limitations of RIS algorithms. Our code and pre-trained weights will be released. + + + + Time-Efficient Light-Field Acquisition Using Coded Aperture and Events + http://openaccess.thecvf.com//content/CVPR2024/papers/Habuchi_Time-Efficient_Light-Field_Acquisition_Using_Coded_Aperture_and_Events_CVPR_2024_paper.pdf + We propose a computational imaging method for time-efficient light-field acquisition that combines a coded aperture with an event-based camera. Different from the conventional coded-aperture imaging method our method applies a sequence of coding patterns during a single exposure for an image frame. The parallax information which is related to the differences in coding patterns is recorded as events. The image frame and events all of which are measured in a single exposure are jointly used to computationally reconstruct a light field. We also designed an algorithm pipeline for our method that is end-to-end trainable on the basis of deep optics and compatible with real camera hardware. We experimentally showed that our method can achieve more accurate reconstruction than several other imaging methods with a single exposure. We also developed a hardware prototype with the potential to complete the measurement on the camera within 22 msec and demonstrated that light fields from real 3-D scenes can be obtained with convincing visual quality. Our software and supplementary video are available from our project website. + + + + EVS-assisted Joint Deblurring Rolling-Shutter Correction and Video Frame Interpolation through Sensor Inverse Modeling + http://openaccess.thecvf.com//content/CVPR2024/papers/Jiang_EVS-assisted_Joint_Deblurring_Rolling-Shutter_Correction_and_Video_Frame_Interpolation_through_CVPR_2024_paper.pdf + Event-based Vision Sensors (EVS) gain popularity in enhancing CMOS Image Sensor (CIS) video capture. Nonidealities of EVS such as pixel or readout latency can significantly influence the quality of the enhanced images and warrant dedicated consideration in the design of fusion algorithms. A novel approach for jointly computing deblurred rolling-shutter artifact corrected high-speed videos with frame rates up to 10000 FPS using inherently blurry rolling shutter CIS frames of 120 FPS to 150 FPS in conjunction with EVS data from a hybrid CIS-EVS sensor is presented. EVS pixel latency readout latency and the sensor's refractory period are explicitly incorporated into the measurement model. This inverse function problem is solved on a per-pixel manner using an optimization-based framework. The interpolated images are subsequently processed by a novel refinement network. The proposed method is evaluated using simulated and measured datasets under natural and controlled environments. Extensive experiments show reduced shadowing effect a 4 dB increment in PSNR and a 12% improvement in LPIPS score compared to state-of-the-art methods. + + + + Active Prompt Learning in Vision Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Bang_Active_Prompt_Learning_in_Vision_Language_Models_CVPR_2024_paper.pdf + Pre-trained Vision Language Models (VLMs) have demonstrated notable progress in various zero-shot tasks such as classification and retrieval. Despite their performance because improving performance on new tasks requires task-specific knowledge their adaptation is essential. While labels are needed for the adaptation acquiring them is typically expensive. To overcome this challenge active learning a method of achieving a high performance by obtaining labels for a small number of samples from experts has been studied. Active learning primarily focuses on selecting unlabeled samples for labeling and leveraging them to train models. In this study we pose the question "how can the pre-trained VLMs be adapted under the active learning framework?" In response to this inquiry we observe that (1) simply applying a conventional active learning framework to pre-trained VLMs even may degrade performance compared to random selection because of the class imbalance in labeling candidates and (2) the knowledge of VLMs can provide hints for achieving the balance before labeling. Based on these observations we devise a novel active learning framework for VLMs denoted as PCB. To assess the effectiveness of our approach we conduct experiments on seven different real-world datasets and the results demonstrate that PCB surpasses conventional active learning and random sampling methods. + + + + NICE: Neurogenesis Inspired Contextual Encoding for Replay-free Class Incremental Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Gurbuz_NICE_Neurogenesis_Inspired_Contextual_Encoding_for_Replay-free_Class_Incremental_Learning_CVPR_2024_paper.pdf + Deep neural networks (DNNs) struggle to learn in dynamic settings because they mainly rely on static datasets. Continual learning (CL) aims to overcome this limitation by enabling DNNs to incrementally accumulate knowledge. A widely adopted scenario in CL is class-incremental learning (CIL) where DNNs are required to sequentially learn more classes. Among the various strategies in CL replay methods which revisit previous classes stand out as the only effective ones in CIL. Other strategies such as architectural modifications to segregate information across weights and protect them from change are ineffective in CIL. This is because they need additional information during testing to select the correct network parts to use. In this paper we propose NICE Neurogenesis Inspired Contextual Encoding a replay-free architectural method inspired by adult neurogenesis in the hippocampus. NICE groups neurons in the DNN based on different maturation stages and infers which neurons to use during testing without any additional signal. Through extensive experiments across 6 datasets and 3 architectures we show that NICE performs on par with or often outperforms replay methods. We also make the case that neurons exhibit highly distinctive activation patterns for the classes in which they specialize enabling us to determine when they should be used. The code is available at https://github.com/BurakGurbuz97/NICE. + + + + Backdoor Defense via Test-Time Detecting and Repairing + http://openaccess.thecvf.com//content/CVPR2024/papers/Guan_Backdoor_Defense_via_Test-Time_Detecting_and_Repairing_CVPR_2024_paper.pdf + Deep neural networks have played a crucial part in many critical domains such as autonomous driving face recognition and medical diagnosis. However deep neural networks are facing security threats from backdoor attacks and can be manipulated into attacker-decided behaviors by the backdoor attacker. To defend the backdoor prior research has focused on using clean data to remove backdoor attacks before model deployment. In this paper we investigate the possibility of defending against backdoor attacks by utilizing test-time partially poisoned data to remove the backdoor from the model. To address the problem a two-stage method TTBD is proposed. In the first stage we propose a backdoor sample detection method DDP to identify poisoned samples from a batch of mixed partially poisoned samples. Once the poisoned samples are detected we employ Shapley estimation to calculate the contribution of each neuron's significance in the network locate the poisoned neurons and prune them to remove backdoor in the models. Our experiments demonstrate that TTBD removes the backdoor successfully with only a batch of partially poisoned data across different model architectures and datasets against different types of backdoor attacks. + + + + OneFormer3D: One Transformer for Unified Point Cloud Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Kolodiazhnyi_OneFormer3D_One_Transformer_for_Unified_Point_Cloud_Segmentation_CVPR_2024_paper.pdf + Semantic instance and panoptic segmentation of 3D point clouds have been addressed using task-specific models of distinct design. Thereby the similarity of all segmentation tasks and the implicit relationship between them have not been utilized effectively. This paper presents a unified simple and effective model addressing all these tasks jointly. The model named OneFormer3D performs instance and semantic segmentation consistently using a group of learnable kernels where each kernel is responsible for generating a mask for either an instance or a semantic category. These kernels are trained with a transformer-based decoder with unified instance and semantic queries passed as an input. Such a design enables training a model end-to-end in a single run so that it achieves top performance on all three segmentation tasks simultaneously. Specifically our OneFormer3D ranks 1st and sets a new state-of-the-art (+2.1 mAP50) in the ScanNet test leaderboard. We also demonstrate the state-of-the-art results in semantic instance and panoptic segmentation of ScanNet (+21 PQ) ScanNet200 (+3.8 mAP50) and S3DIS (+0.8 mIoU) datasets. + + + + JRDB-Social: A Multifaceted Robotic Dataset for Understanding of Context and Dynamics of Human Interactions Within Social Groups + http://openaccess.thecvf.com//content/CVPR2024/papers/Jahangard_JRDB-Social_A_Multifaceted_Robotic_Dataset_for_Understanding_of_Context_and_CVPR_2024_paper.pdf + Understanding human social behaviour is crucial in computer vision and robotics. Micro-level observations like individual actions fall short necessitating a comprehensive approach that considers individual behaviour intra-group dynamics and social group levels for a thorough understanding. To address dataset limitations this paper introduces JRDB-Social an extension of JRDB. Designed to fill gaps in human understanding across diverse indoor and outdoor social contexts JRDB-Social provides annotations at three levels: individual attributes intra-group interactions and social group context. This dataset aims to enhance our grasp of human social dynamics for robotic applications. Utilizing the recent cutting-edge multi-modal large language models we evaluated our benchmark to explore their capacity to decipher social human behaviour. + + + + GPT-4V(ision) is a Human-Aligned Evaluator for Text-to-3D Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_GPT-4Vision_is_a_Human-Aligned_Evaluator_for_Text-to-3D_Generation_CVPR_2024_paper.pdf + Despite recent advances in text-to-3D generative methods there is a notable absence of reliable evaluation metrics. Existing metrics usually focus on a single criterion each such as how well the asset aligned with the input text. These metrics lack the flexibility to generalize to different evaluation criteria and might not align well with human preferences. Conducting user preference studies is an alternative that offers both adaptability and human-aligned results. User studies however can be very expensive to scale. This paper presents an automatic versatile and human-aligned evaluation metric for text-to-3D generative models. To this end we first develop a prompt generator using GPT-4V to generate evaluating prompts which serve as input to compare text-to-3D models. We further design a method instructing GPT-4V to compare two 3D assets according to user-defined criteria. Finally we use these pairwise comparison results to assign these models Elo ratings. Experimental results suggest our metric strongly align with human preference across different evaluation criteria. + + + + NTO3D: Neural Target Object 3D Reconstruction with Segment Anything + http://openaccess.thecvf.com//content/CVPR2024/papers/Wei_NTO3D_Neural_Target_Object_3D_Reconstruction_with_Segment_Anything_CVPR_2024_paper.pdf + Neural 3D reconstruction from multi-view images has recently attracted increasing attention from the community. Existing methods normally learn a neural field for the whole scene while it is still under-explored how to reconstruct a target object indicated by users. Considering the Segment Anything Model (SAM) has shown effectiveness in segmenting any 2D images in this paper we propose NTO3D a novel high-quality Neural Target Object 3D (NTO3D) reconstruction method which leverages the benefits of both neural field and SAM. We first propose a novel strategy to lift the multi-view 2D segmentation masks of SAM into a unified 3D occupancy field. The 3D occupancy field is then projected into 2D space and generates the new prompts for SAM. This process is iterative until convergence to separate the target object from the scene. After this we then lift the 2D features of the SAM encoder into a 3D feature field in order to improve the reconstruction quality of the target object. NTO3D lifts the 2D masks and features of SAM into the 3D neural field for high-quality neural target object 3D reconstruction. We conduct detailed experiments on several benchmark datasets to demonstrate the advantages of our method. The code will be available at: https://github.com/ucwxb/NTO3D. + + + + OmniMedVQA: A New Large-Scale Comprehensive Evaluation Benchmark for Medical LVLM + http://openaccess.thecvf.com//content/CVPR2024/papers/Hu_OmniMedVQA_A_New_Large-Scale_Comprehensive_Evaluation_Benchmark_for_Medical_LVLM_CVPR_2024_paper.pdf + Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities in various multimodal tasks. However their potential in the medical domain remains largely unexplored. A significant challenge arises from the scarcity of diverse medical images spanning various modalities and anatomical regions which is essential in real-world medical applications. To solve this problem in this paper we introduce OmniMedVQA a novel comprehensive medical Visual Question Answering (VQA) benchmark. This benchmark is collected from 73 different medical datasets including 12 different modalities and covering more than 20 distinct anatomical regions. Importantly all images in this benchmark are sourced from authentic medical scenarios ensuring alignment with the requirements of the medical field and suitability for evaluating LVLMs. Through our extensive experiments we have found that existing LVLMs struggle to address these medical VQA problems effectively. Moreover what surprises us is that medical-specialized LVLMs even exhibit inferior performance to those general-domain models calling for a more versatile and robust LVLM in the biomedical field. The evaluation results not only reveal the current limitations of LVLM in understanding real medical images but also highlight our dataset's significance. Our code with dataset are available at https://github.com/OpenGVLab/Multi-Modality-Arena. + + + + Visual Programming for Zero-shot Open-Vocabulary 3D Visual Grounding + http://openaccess.thecvf.com//content/CVPR2024/papers/Yuan_Visual_Programming_for_Zero-shot_Open-Vocabulary_3D_Visual_Grounding_CVPR_2024_paper.pdf + 3D Visual Grounding (3DVG) aims at localizing 3D object based on textual descriptions. Conventional supervised methods for 3DVG often necessitate extensive annotations and a predefined vocabulary which can be restrictive. To address this issue we propose a novel visual programming approach for zero-shot open-vocabulary 3DVG leveraging the capabilities of large language models (LLMs). Our approach begins with a unique dialog-based method engaging with LLMs to establish a foundational understanding of zero-shot 3DVG. Building on this we design a visual program that consists of three types of modules i.e. view-independent view-dependent and functional modules. Furthermore we develop an innovative language-object correlation module to extend the scope of existing 3D object detectors into open-vocabulary scenarios. Extensive experiments demonstrate that our zero-shot approach can outperform some supervised baselines marking a significant stride towards effective 3DVG. Code is available at https://curryyuan.github.io/ZSVG3D. + + + + Class Incremental Learning with Multi-Teacher Distillation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wen_Class_Incremental_Learning_with_Multi-Teacher_Distillation_CVPR_2024_paper.pdf + Distillation strategies are currently the primary approaches for mitigating forgetting in class incremental learning (CIL). Existing methods generally inherit previous knowledge from a single teacher. However teachers with different mechanisms are talented at different tasks and inheriting diverse knowledge from them can enhance compatibility with new knowledge. In this paper we propose the MTD method to find multiple diverse teachers for CIL. Specifically we adopt weight permutation feature perturbation and diversity regularization techniques to ensure diverse mechanisms in teachers. To reduce time and memory consumption each teacher is represented as a small branch in the model. We adapt existing CIL distillation strategies with MTD and extensive experiments on CIFAR-100 ImageNet-100 and ImageNet-1000 show significant performance improvement. Our code is available at https://github.com/HaitaoWen/CLearning. + + + + AMU-Tuning: Effective Logit Bias for CLIP-based Few-shot Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Tang_AMU-Tuning_Effective_Logit_Bias_for_CLIP-based_Few-shot_Learning_CVPR_2024_paper.pdf + Recently pre-trained vision-language models (e.g. CLIP) have shown great potential in few-shot learning and attracted a lot of research interest. Although efforts have been made to improve few-shot ability of CLIP key factors on the effectiveness of existing methods have not been well studied limiting further exploration of CLIP's potential in few-shot learning. In this paper we first introduce a unified formulation to analyze CLIP-based few-shot learning methods from a perspective of logit bias which encourages us to learn an effective logit bias for further improving performance of CLIP-based few-shot learning methods. To this end we disassemble three key components involved in computation of logit bias (i.e. logit features logit predictor and logit fusion) and empirically analyze the effect on performance of few-shot classification. Based on analysis of key components this paper proposes a novel AMU-Tuning method to learn effective logit bias for CLIP-based few-shot classification. Specifically our AMU-Tuning predicts logit bias by exploiting the appropriate Auxiliary features which are fed into an efficient feature-initialized linear classifier with Multi-branch training. Finally an Uncertainty-based fusion is developed to incorporate logit bias into CLIP for few-shot classification. The experiments are conducted on several widely used benchmarks and the results show AMU-Tuning clearly outperforms its counterparts while achieving state-of-the-art performance of CLIP-based few-shot learning without bells and whistles. + + + + Real-World Mobile Image Denoising Dataset with Efficient Baselines + http://openaccess.thecvf.com//content/CVPR2024/papers/Flepp_Real-World_Mobile_Image_Denoising_Dataset_with_Efficient_Baselines_CVPR_2024_paper.pdf + The recently increased role of mobile photography has raised the standards of on-device photo processing tremendously. Despite the latest advancements in camera hardware the mobile camera sensor area cannot be increased significantly due to physical constraints leading to a pixel size of 0.6--2.0 \mum which results in strong image noise even in moderate lighting conditions. In the era of deep learning one can train a CNN model to perform robust image denoising. However there is still a lack of a substantially diverse dataset for this task. To address this problem we introduce a novel Mobile Image Denoising Dataset (MIDD) comprising over 400000 noisy / noise-free image pairs captured under various conditions by 20 different mobile camera sensors. Additionally we propose a new DPreview test set consisting of data from 294 different cameras for precise model evaluation. Furthermore we present the efficient baseline model SplitterNet for the considered mobile image denoising task that achieves high numerical and visual results while being able to process 8MP photos directly on smartphone GPUs in under one second. Thereby outperforming models with similar runtimes. This model is also compatible with recent mobile NPUs demonstrating an even higher speed when deployed on them. The conducted experiments demonstrate high robustness of the proposed solution when applied to images from previously unseen sensors showing its high generalizability. The datasets code and models can be found on the official project website. + + + + Fine-Grained Bipartite Concept Factorization for Clustering + http://openaccess.thecvf.com//content/CVPR2024/papers/Peng_Fine-Grained_Bipartite_Concept_Factorization_for_Clustering_CVPR_2024_paper.pdf + In this paper we propose a novel concept factorization method that seeks factor matrices using a cross-order positive semi-definite neighbor graph which provides comprehensive and complementary neighbor information of the data. The factor matrices are learned with bipartite graph partitioning which exploits explicit cluster structure of the data and is more geared towards clustering application. We develop an effective and efficient optimization algorithm for our method and provide elegant theoretical results about the convergence. Extensive experimental results confirm the effectiveness of the proposed method. + + + + Language-Driven Anchors for Zero-Shot Adversarial Robustness + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Language-Driven_Anchors_for_Zero-Shot_Adversarial_Robustness_CVPR_2024_paper.pdf + Deep Neural Networks (DNNs) are known to be susceptible to adversarial attacks. Previous researches mainly focus on improving adversarial robustness in the fully supervised setting leaving the challenging domain of zero-shot adversarial robustness an open question. In this work we investigate this domain by leveraging the recent advances in large vision-language models such as CLIP to introduce zero-shot adversarial robustness to DNNs. We propose LAAT a Language-driven Anchor-based Adversarial Training strategy. LAAT utilizes the features of a text encoder for each category as fixed anchors (normalized feature embeddings) for each category which are then employed for adversarial training. By leveraging the semantic consistency of the text encoders LAAT aims to enhance the adversarial robustness of the image model on novel categories. However naively using text encoders leads to poor results. Through analysis we identified the issue to be the high cosine similarity between text encoders. We then design an expansion algorithm and an alignment cross-entropy loss to alleviate the problem. Our experimental results demonstrated that LAAT significantly improves zero-shot adversarial robustness over state-of-the-art methods. LAAT has the potential to enhance adversarial robustness by large-scale multimodal models especially when labeled data is unavailable during training. + + + + Fooling Polarization-Based Vision using Locally Controllable Polarizing Projection + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Fooling_Polarization-Based_Vision_using_Locally_Controllable_Polarizing_Projection_CVPR_2024_paper.pdf + Polarization is a fundamental property of light that encodes abundant information regarding surface shape material illumination and viewing geometry. The computer vision community has witnessed a blossom of polarization-based vision applications such as reflection removal shape-from-polarization (SfP) transparent object segmentation and color constancy partially due to the emergence of single-chip mono/color polarization sensors that make polarization data acquisition easier than ever. However is polarization-based vision vulnerable to adversarial attacks? If so is that possible to realize these adversarial attacks in the physical world without being perceived by human eyes? In this paper we warn the community of the vulnerability of polarization-based vision which can be more serious than RGB-based vision. By adapting a commercial LCD projector we achieve locally controllable polarizing projection which is successfully utilized to fool state-of-the-art polarization-based vision algorithms for glass segmentation and SfP. Compared with existing physical attacks on RGB-based vision which always suffer from the trade-off between attack efficacy and eye conceivability the adversarial attackers based on polarizing projection are contact-free and visually imperceptible since naked human eyes can rarely perceive the difference of viciously manipulated polarizing light and ordinary illumination. This poses unprecedented risks on polarization-based vision for which due attentions should be paid and counter measures be considered. + + + + DiffAM: Diffusion-based Adversarial Makeup Transfer for Facial Privacy Protection + http://openaccess.thecvf.com//content/CVPR2024/papers/Sun_DiffAM_Diffusion-based_Adversarial_Makeup_Transfer_for_Facial_Privacy_Protection_CVPR_2024_paper.pdf + With the rapid development of face recognition (FR) systems the privacy of face images on social media is facing severe challenges due to the abuse of unauthorized FR systems. Some studies utilize adversarial attack techniques to defend against malicious FR systems by generating adversarial examples. However the generated adversarial examples i.e. the protected face images tend to suffer from subpar visual quality and low transferability. In this paper we propose a novel face protection approach dubbed DiffAM which leverages the powerful generative ability of diffusion models to generate high-quality protected face images with adversarial makeup transferred from reference images. To be specific we first introduce a makeup removal module to generate non-makeup images utilizing a fine-tuned diffusion model with guidance of textual prompts in CLIP space. As the inverse process of makeup transfer makeup removal can make it easier to establish the deterministic relationship between makeup domain and non-makeup domain regardless of elaborate text prompts. Then with this relationship a CLIP-based makeup loss along with an ensemble attack strategy is introduced to jointly guide the direction of adversarial makeup domain achieving the generation of protected face images with natural-looking makeup and high black-box transferability. Extensive experiments demonstrate that DiffAM achieves higher visual quality and attack success rates with a gain of 12.98% under black-box setting compared with the state of the arts. The code will be available at https://github.com/HansSunY/DiffAM. + + + + SlowFormer: Adversarial Attack on Compute and Energy Consumption of Efficient Vision Transformers + http://openaccess.thecvf.com//content/CVPR2024/papers/Navaneet_SlowFormer_Adversarial_Attack_on_Compute_and_Energy_Consumption_of_Efficient_CVPR_2024_paper.pdf + Recently there has been a lot of progress in reducing the computation of deep models at inference time. These methods can reduce both the computational needs and power usage of deep models. Some of these approaches adaptively scale the compute based on the input instance. We show that such models can be vulnerable to a universal adversarial patch attack where the attacker optimizes for a patch that when pasted on any image can increase the compute and power consumption of the model. We run experiments with three different efficient vision transformer methods showing that in some cases the attacker can increase the computation to the maximum possible level by simply pasting a patch that occupies only 8% of the image area. We also show that a standard adversarial training defense method can reduce some of the attack's success. We believe adaptive efficient methods will be necessary for the future to lower the power usage of expensive deep models so we hope our paper encourages the community to study the robustness of these methods and develop better defense methods for the proposed attack. Code is available at: https://github.com/UCDvision/SlowFormer. + + + + How to Configure Good In-Context Sequence for Visual Question Answering + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_How_to_Configure_Good_In-Context_Sequence_for_Visual_Question_Answering_CVPR_2024_paper.pdf + Inspired by the success of Large Language Models in dealing with new tasks via In-Context Learning (ICL) in NLP researchers have also developed Large Vision-Language Models (LVLMs) with ICL capabilities. However when implementing ICL using these LVLMs researchers usually resort to the simplest way like random sampling to configure the in-context sequence thus leading to sub-optimal results. To enhance the ICL performance in this study we use Visual Question Answering (VQA) as case study to explore diverse in-context configurations to find the powerful ones. Additionally through observing the changes of the LVLM outputs by altering the in-context sequence we gain insights into the inner properties of LVLMs improving our understanding of them. Specifically to explore in-context configurations we design diverse retrieval methods and employ different strategies to manipulate the retrieved demonstrations. Through exhaustive experiments on three VQA datasets: VQAv2 VizWiz and OK-VQA we uncover three important inner properties of the applied LVLM and demonstrate which strategies can consistently improve the ICL VQA performance. Our code is provided in: https: //github.com/GaryJiajia/OFv2_ICL_VQA. + + + + Defense Against Adversarial Attacks on No-Reference Image Quality Models with Gradient Norm Regularization + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Defense_Against_Adversarial_Attacks_on_No-Reference_Image_Quality_Models_with_CVPR_2024_paper.pdf + The task of No-Reference Image Quality Assessment (NR-IQA) is to estimate the quality score of an input image without additional information. NR-IQA models play a crucial role in the media industry aiding in performance evaluation and optimization guidance. However these models are found to be vulnerable to adversarial attacks which introduce imperceptible perturbations to input images resulting in significant changes in predicted scores. In this paper we propose a defense method to mitigate the variability in predicted scores caused by small perturbations thus enhancing the adversarial robustness of NR-IQA models. To be specific we present theoretical evidence showing that the extent of score changes is related to the l_1 norm of the gradient of the predicted score with respect to the input image when adversarial perturbations are l_inf-bounded. Building on this theoretical foundation we propose a norm regularization training strategy aimed at reducing the l_1 norm of the gradient thereby boosting the adversarial robustness of NR-IQA models. Experiments conducted on four NR-IQA baseline models demonstrate the effectiveness of our strategy in reducing score changes in the presence of adversarial attacks. To the best of our knowledge this work marks the first attempt to defend against adversarial attacks on NR-IQA models. Our study offers valuable insights into the adversarial robustness of NR-IQA models and provides a foundation for future research in this area. + + + + TACO: Benchmarking Generalizable Bimanual Tool-ACtion-Object Understanding + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_TACO_Benchmarking_Generalizable_Bimanual_Tool-ACtion-Object_Understanding_CVPR_2024_paper.pdf + Humans commonly work with multiple objects in daily life and can intuitively transfer manipulation skills to novel objects by understanding object functional regularities. However existing technical approaches for analyzing and synthesizing hand-object manipulation are mostly limited to handling a single hand and object due to the lack of data support. To address this we construct TACO an extensive bimanual hand-object-interaction dataset spanning a large variety of tool-action-object compositions for daily human activities. TACO contains 2.5K motion sequences paired with third-person and egocentric views precise hand-object 3D meshes and action labels. To rapidly expand the data scale we present a fully automatic data acquisition pipeline combining multi-view sensing with an optical motion capture system. With the vast research fields provided by TACO we benchmark three generalizable hand-object-interaction tasks: compositional action recognition generalizable hand-object motion forecasting and cooperative grasp synthesis. Extensive experiments reveal new insights challenges and opportunities for advancing the studies of generalizable hand-object motion analysis and synthesis. Our data and code are available at https://taco2024.github.io. + + + + AlignMiF: Geometry-Aligned Multimodal Implicit Field for LiDAR-Camera Joint Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Tao_AlignMiF_Geometry-Aligned_Multimodal_Implicit_Field_for_LiDAR-Camera_Joint_Synthesis_CVPR_2024_paper.pdf + Neural implicit fields have been a de facto standard in novel view synthesis. Recently there exist some methods exploring fusing multiple modalities within a single field aiming to share implicit features from different modalities to enhance reconstruction performance. However these modalities often exhibit misaligned behaviors: optimizing for one modality such as LiDAR can adversely affect another like camera performance and vice versa. In this work we conduct comprehensive analyses on the multimodal implicit field of LiDAR-camera joint synthesis revealing the underlying issue lies in the misalignment of different sensors. Furthermore we introduce AlignMiF a geometrically aligned multimodal implicit field with two proposed modules: Geometry-Aware Alignment (GAA) and Shared Geometry Initialization (SGI). These modules effectively align the coarse geometry across different modalities significantly enhancing the fusion process between LiDAR and camera data. Through extensive experiments across various datasets and scenes we demonstrate the effectiveness of our approach in facilitating better interaction between LiDAR and camera modalities within a unified neural field. Specifically our proposed AlignMiF achieves remarkable improvement over recent implicit fusion methods (+2.01 and +3.11 image PSNR on the KITTI-360 and Waymo datasets) and consistently surpasses single modality performance (13.8% and 14.2% reduction in LiDAR Chamfer Distance on the respective datasets). + + + + Improving Unsupervised Hierarchical Representation with Reinforcement Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/An_Improving_Unsupervised_Hierarchical_Representation_with_Reinforcement_Learning_CVPR_2024_paper.pdf + Learning representations to capture the very fundamental understanding of the world is a key challenge in machine learning. The hierarchical structure of explanatory factors hidden in data is such a general representation and could be potentially achieved with a hierarchical VAE. However training a hierarchical VAE always suffers from the "posterior collapse" where the data information is hard to propagate to the higher-level latent variables hence resulting in a bad hierarchical representation. To address this issue we first analyze the shortcomings of existing methods for mitigating the "posterior collapse" from an information theory perspective then highlight the necessity of regularization for explicitly propagating data information to higher-level latent variables while maintaining the dependency between different levels. This naturally leads to formulating the inference of the hierarchical latent representation as a sequential decision process which could benefit from applying reinforcement learning (RL). Aligning RL's objective with the regularization we first introduce a "skip-generative path" to acquire a reward for evaluating the information content of an inferred latent representation and then the developed Q-value function based on it could have a consistent optimization direction of the regularization. Finally policy gradient one of the typical RL methods is employed to train a hierarchical VAE without introducing a gradient estimator. Experimental results firmly support our analysis and demonstrate that our proposed method effectively mitigates the "posterior collapse" issue learns an informative hierarchy acquires explainable latent representations and significantly outperforms other hierarchical VAE-based methods in downstream tasks. + + + + HPL-ESS: Hybrid Pseudo-Labeling for Unsupervised Event-based Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Jing_HPL-ESS_Hybrid_Pseudo-Labeling_for_Unsupervised_Event-based_Semantic_Segmentation_CVPR_2024_paper.pdf + Event-based semantic segmentation has gained popularity due to its capability to deal with scenarios under high-speed motion and extreme lighting conditions which cannot be addressed by conventional RGB cameras. Since it is hard to annotate event data previous approaches rely on event-to-image reconstruction to obtain pseudo labels for training. However this will inevitably introduce noise and learning from noisy pseudo labels especially when generated from a single source may reinforce the errors. This drawback is also called confirmation bias in pseudo-labeling. In this paper we propose a novel hybrid pseudo-labeling framework for unsupervised event-based semantic segmentation HPL-ESS to alleviate the influence of noisy pseudo labels. In particular we first employ a plain unsupervised domain adaptation framework as our baseline which can generate a set of pseudo labels through self-training. Then we incorporate offline event-to-image reconstruction into the framework and obtain another set of pseudo labels by predicting segmentation maps on the reconstructed images. A noisy label learning strategy is designed to mix the two sets of pseudo labels and enhance the quality. Moreover we propose a soft prototypical alignment module to further improve the consistency of target domain features. Extensive experiments show that our proposed method outperforms existing state-of-the-art methods by a large margin on the DSEC-Semantic dataset (+5.88% accuracy +10.32% mIoU) which even surpasses several supervised methods. + + + + Towards General Robustness Verification of MaxPool-based Convolutional Neural Networks via Tightening Linear Approximation + http://openaccess.thecvf.com//content/CVPR2024/papers/Xiao_Towards_General_Robustness_Verification_of_MaxPool-based_Convolutional_Neural_Networks_via_CVPR_2024_paper.pdf + The robustness of convolutional neural networks (CNNs) is vital to modern AI-driven systems. It can be quantified by formal verification by providing a certified lower bound within which any perturbation does not alter the original input's classification result. It is challenging due to nonlinear components such as MaxPool. At present many verification methods are sound but risk losing some precision to enhance efficiency and scalability and thus a certified lower bound is a crucial criterion for evaluating the performance of verification tools. In this paper we present MaxLin a robustness verifier for MaxPool-based CNNs with tight Linear approximation. By tightening the linear approximation of the MaxPool function we can certify larger certified lower bounds of CNNs. We evaluate MaxLin with open-sourced benchmarks including LeNet and networks trained on the MNIST CIFAR-10 and Tiny ImageNet datasets. The results show that MaxLin outperforms state-of-the-art tools with up to 110.60% improvement regarding the certified lower bound and 5.13 X speedup for the same neural networks. Our code is available at https://github.com/xiaoyuanpigo/maxlin. + + + + Learning to Rematch Mismatched Pairs for Robust Cross-Modal Retrieval + http://openaccess.thecvf.com//content/CVPR2024/papers/Han_Learning_to_Rematch_Mismatched_Pairs_for_Robust_Cross-Modal_Retrieval_CVPR_2024_paper.pdf + Collecting well-matched multimedia datasets is crucial for training cross-modal retrieval models. However in real-world scenarios massive multimodal data are harvested from the Internet which inevitably contains Partially Mismatched Pairs (PMPs). Undoubtedly such semantical irrelevant data will remarkably harm the cross-modal retrieval performance. Previous efforts tend to mitigate this problem by estimating a soft correspondence to down-weight the contribution of PMPs. In this paper we aim to address this challenge from a new perspective: the potential semantic similarity among unpaired samples makes it possible to excavate useful knowledge from mismatched pairs. To achieve this we propose L2RM a general framework based on Optimal Transport (OT) that learns to rematch mismatched pairs. In detail L2RM aims to generate refined alignments by seeking a minimal-cost transport plan across different modalities. To formalize the rematching idea in OT first we propose a self-supervised cost function that automatically learns from explicit similarity-cost mapping relation. Second we present to model a partial OT problem while restricting the transport among false positives to further boost refined alignments. Extensive experiments on three benchmarks demonstrate our L2RM significantly improves the robustness against PMPs for existing models. The code is available at https://github.com/hhc1997/L2RM. + + + + CDMAD: Class-Distribution-Mismatch-Aware Debiasing for Class-Imbalanced Semi-Supervised Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Lee_CDMAD_Class-Distribution-Mismatch-Aware_Debiasing_for_Class-Imbalanced_Semi-Supervised_Learning_CVPR_2024_paper.pdf + Pseudo-label-based semi-supervised learning (SSL) algorithms trained on a class-imbalanced set face two cascading challenges: 1) Classifiers tend to be biased towards majority classes and 2) Biased pseudo-labels are used for training. It is difficult to appropriately re-balance the classifiers in SSL because the class distribution of an unlabeled set is often unknown and could be mismatched with that of a labeled set. We propose a novel class-imbalanced SSL algorithm called class-distribution-mismatch-aware debiasing (CDMAD). For each iteration of training CDMAD first assesses the classifier's biased degree towards each class by calculating the logits on an image without any patterns (e.g. solid color image) which can be considered irrelevant to the training set. CDMAD then refines biased pseudo-labels of the base SSL algorithm by ensuring the classifier's neutrality. CDMAD uses these refined pseudo-labels during the training of the base SSL algorithm to improve the quality of the representations. In the test phase CDMAD similarly refines biased class predictions on test samples. CDMAD can be seen as an extension of post-hoc logit adjustment to address a challenge of incorporating the unknown class distribution of the unlabeled set for re-balancing the biased classifier under class distribution mismatch. CDMAD ensures Fisher consistency for the balanced error. Extensive experiments verify the effectiveness of CDMAD. + + + + PanoPose: Self-supervised Relative Pose Estimation for Panoramic Images + http://openaccess.thecvf.com//content/CVPR2024/papers/Tu_PanoPose_Self-supervised_Relative_Pose_Estimation_for_Panoramic_Images_CVPR_2024_paper.pdf + Scaled relative pose estimation i.e. estimating relative rotation and scaled relative translation between two images has always been a major challenge in global Structure-from-Motion (SfM). This difficulty arises because the two-view relative translation computed by traditional geometric vision methods e.g. the five-point algorithm is scaleless. Many researchers have proposed diverse translation averaging methods to solve this problem. Instead of solving the problem in the motion averaging phase we focus on estimating scaled relative pose with the help of panoramic cameras and deep neural networks. In this paper a novel network namely PanoPose is proposed to estimate the relative motion in a fully self-supervised manner and a global SfM pipeline is built for panorama images. The proposed PanoPose comprises a depth-net and a pose-net with self-supervision achieved by reconstructing the reference image from its neighboring images based on the estimated depth and relative pose. To maintain precise pose estimation under large viewing angle differences we randomly rotate the panoramic images and pre-train the pose-net with images before and after the rotation. To enhance scale accuracy a fusion block is introduced to incorporate depth information into pose estimation. Extensive experiments on panoramic SfM datasets demonstrate the effectiveness of PanoPose compared with state-of-the-arts. + + + + Describing Differences in Image Sets with Natural Language + http://openaccess.thecvf.com//content/CVPR2024/papers/Dunlap_Describing_Differences_in_Image_Sets_with_Natural_Language_CVPR_2024_paper.pdf + How do two sets of images differ? Discerning set-level differences is crucial for understanding model behaviors and analyzing datasets yet manually sifting through thousands of images is impractical. To aid in this discovery process we explore the task of automatically describing the differences between two sets of images which we term Set Difference Captioning. This task takes in image sets \mathcal D _A and \mathcal D _B and outputs a description that is more often true on \mathcal D _A than \mathcal D _B. We outline a two-stage approach that first proposes candidate difference descriptions from image sets and then re-ranks the candidates by checking how well they can differentiate the two sets. We introduce VisDiff which first captions the images and prompts a language model to propose candidate descriptions then re-ranks these descriptions using CLIP. To evaluate VisDiff we collect VisDiffBench a dataset with 187 paired image sets with ground truth difference descriptions. We apply VisDiff to various domains such as comparing datasets (e.g. ImageNet vs. ImageNetV2) comparing classification models (e.g. zero-shot CLIP vs. supervised ResNet) characterizing differences between generative models (e.g. StableDiffusionV1 and V2) and discovering what makes images memorable. Using VisDiff we are able to find interesting and previously unknown differences in datasets and models demonstrating its utility in revealing nuanced insights. + + + + Fully Geometric Panoramic Localization + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_Fully_Geometric_Panoramic_Localization_CVPR_2024_paper.pdf + We introduce a lightweight and accurate localization method that only utilizes the geometry of 2D-3D lines. Given a pre-captured 3D map our approach localizes a panorama image taking advantage of the holistic 360 degree view. The system mitigates potential privacy breaches or domain discrepancies by avoiding trained or hand-crafted visual descriptors. However as lines alone can be ambiguous we express distinctive yet compact spatial contexts from relationships between lines namely the dominant directions of parallel lines and the intersection between non-parallel lines. The resulting representations are efficient in processing time and memory compared to conventional visual descriptor-based methods. Given the groups of dominant line directions and their intersections we accelerate the search process to test thousands of pose candidates in less than a millisecond without sacrificing accuracy. We empirically show that the proposed 2D-3D matching can localize panoramas for challenging scenes with similar structures dramatic domain shifts or illumination changes. Our fully geometric approach does not involve extensive parameter tuning or neural network training making it a practical algorithm that can be readily deployed in the real world. Project page including the code is available through this link: https://82magnolia.github.io/fgpl/. + + + + NeRF Director: Revisiting View Selection in Neural Volume Rendering + http://openaccess.thecvf.com//content/CVPR2024/papers/Xiao_NeRF_Director_Revisiting_View_Selection_in_Neural_Volume_Rendering_CVPR_2024_paper.pdf + Neural Rendering representations have significantly contributed to the field of 3D computer vision. Given their potential considerable efforts have been invested to improve their performance. Nonetheless the essential question of selecting training views is yet to be thoroughly investigated. This key aspect plays a vital role in achieving high-quality results and aligns with the well-known tenet of deep learning: "garbage in garbage out". In this paper we first illustrate the importance of view selection by demonstrating how a simple rotation of the test views within the most pervasive NeRF dataset can lead to consequential shifts in the performance rankings of state-of-the-art techniques. To address this challenge we introduce a unified framework for view selection methods and devise a thorough benchmark to assess its impact. Significant improvements can be achieved without leveraging error or uncertainty estimation but focusing on uniform view coverage of the reconstructed object resulting in a training-free approach. Using this technique we show that high-quality renderings can be achieved faster by using fewer views. We conduct extensive experiments on both synthetic datasets and realistic data to demonstrate the effectiveness of our proposed method compared with random conventional error-based and uncertainty-guided view selection. + + + + SonicVisionLM: Playing Sound with Vision Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Xie_SonicVisionLM_Playing_Sound_with_Vision_Language_Models_CVPR_2024_paper.pdf + There has been a growing interest in the task of generating sound for silent videos primarily because of its practicality in streamlining video post-production. However existing methods for video-sound generation attempt to directly create sound from visual representations which can be challenging due to the difficulty of aligning visual representations with audio representations. In this paper we present SonicVisionLM a novel framework aimed at generating a wide range of sound effects by leveraging vision-language models(VLMs). Instead of generating audio directly from video we use the capabilities of powerful VLMs. When provided with a silent video our approach first identifies events within the video using a VLM to suggest possible sounds that match the video content. This shift in approach transforms the challenging task of aligning image and audio into more well-studied sub-problems of aligning image-to-text and text-to-audio through the popular diffusion models. To improve the quality of audio recommendations with LLMs we have collected an extensive dataset that maps text descriptions to specific sound effects and developed a time-controlled audio adapter. Our approach surpasses current state-of-the-art methods for converting video to audio enhancing synchronization with the visuals and improving alignment between audio and video components. Project page: https://yusiissy.github.io/SonicVisionLM.github.io/ + + + + DiffuScene: Denoising Diffusion Models for Generative Indoor Scene Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Tang_DiffuScene_Denoising_Diffusion_Models_for_Generative_Indoor_Scene_Synthesis_CVPR_2024_paper.pdf + We present DiffuScene for indoor 3D scene synthesis based on a novel scene configuration denoising diffusion model. It generates 3D instance properties stored in an unordered object set and retrieves the most similar geometry for each object configuration which is characterized as a concatenation of different attributes including location size orientation semantics and geometry features. We introduce a diffusion network to synthesize a collection of 3D indoor objects by denoising a set of unordered object attributes. Unordered parametrization simplifies and eases the joint distribution approximation. The shape feature diffusion facilitates natural object placements including symmetries. Our method enables many downstream applications including scene completion scene arrangement and text-conditioned scene synthesis. Experiments on the 3D-FRONT dataset show that our method can synthesize more physically plausible and diverse indoor scenes than state-of-the-art methods. Extensive ablation studies verify the effectiveness of our design choice in scene diffusion models. + + + + MCNet: Rethinking the Core Ingredients for Accurate and Efficient Homography Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhu_MCNet_Rethinking_the_Core_Ingredients_for_Accurate_and_Efficient_Homography_CVPR_2024_paper.pdf + We propose Multiscale Correlation searching homography estimation Network namely MCNet an iterative deep homography estimation architecture. Different from previous approaches that achieve iterative refinement by correlation searching within a single scale MCNet combines the multiscale strategy with correlation searching incurring nearly ignored computational overhead. Moreover MCNet adopts a Fine-Grained Optimization loss function named FGO loss to further boost the network training at the convergent stage which can improve the estimation accuracy without additional computational overhead. According to our experiments using the above two simple strategies can produce significant homography estimation accuracy with considerable efficiency. We show that MCNet achieves state-of-the-art performance on a variety of datasets including common scene MSCOCO cross-modal scene GoogleEarth and GoogleMap and dynamic scene SPID. Compared to the previous SOTA method 2-scale RHWF our MCNet reduces inference time FLOPs parameter cost and memory cost by 78.9% 73.5% 34.1% and 33.2% respectively while achieving 20.5% (MSCOCO) 43.4% (GoogleEarth) and 41.1% (GoogleMap) mean average corner error (MACE) reduction. Source code is available at https://github.com/zjuzhk/MCNet. + + + + Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters + http://openaccess.thecvf.com//content/CVPR2024/papers/Yu_Boosting_Continual_Learning_of_Vision-Language_Models_via_Mixture-of-Experts_Adapters_CVPR_2024_paper.pdf + Continual learning can empower vision-language models to continuously acquire new knowledge without the need for access to the entire historical dataset. However mitigating the performance degradation in large-scale models is non-trivial due to (i) parameter shifts throughout lifelong learning and (ii) significant computational burdens associated with full-model tuning. In this work we present a parameter-efficient continual learning framework to alleviate long-term forgetting in incremental learning with vision-language models. Our approach involves the dynamic expansion of a pre-trained CLIP model through the integration of Mixture-of-Experts (MoE) adapters in response to new tasks. To preserve the zero-shot recognition capability of vision-language models we further introduce a Distribution Discriminative Auto-Selector (DDAS) that automatically routes in-distribution and out-of-distribution inputs to the MoE Adapter and the original CLIP respectively. Through extensive experiments across various settings our proposed method consistently outperforms previous state-of-the-art approaches while concurrently reducing parameter training burdens by 60%. + + + + Benchmarking Implicit Neural Representation and Geometric Rendering in Real-Time RGB-D SLAM + http://openaccess.thecvf.com//content/CVPR2024/papers/Hua_Benchmarking_Implicit_Neural_Representation_and_Geometric_Rendering_in_Real-Time_RGB-D_CVPR_2024_paper.pdf + Implicit neural representation (INR) in combination with geometric rendering has recently been employed in real-time dense RGB-D SLAM. Despite active research endeavors being made there lacks a unified protocol for fair evaluation impeding the evolution of this area. In this work we establish to our knowledge the first open-source benchmark framework to evaluate the performance of a wide spectrum of commonly used INRs and rendering functions for mapping and localization. The goal of our benchmark is to 1) gain an intuition of how different INRs and rendering functions impact mapping and localization and 2) establish a unified evaluation protocol w.r.t. the design choices that may impact the mapping and localization. With the framework we conduct a large suite of experiments offering various insights in choosing the INRs and geometric rendering functions: for example the dense feature grid outperforms other INRs (e.g. tri-plane and hash grid) even when geometric and color features are jointly encoded for memory efficiency. To extend the findings into the practical scenario a hybrid encoding strategy is proposed to bring the best of the accuracy and completion from the grid-based and decomposition-based INRs. We further propose explicit hybrid encoding for high-fidelity dense grid mapping to comply with the RGB-D SLAM system that puts the premise on robustness and computation efficiency. + + + + SuperSVG: Superpixel-based Scalable Vector Graphics Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Hu_SuperSVG_Superpixel-based_Scalable_Vector_Graphics_Synthesis_CVPR_2024_paper.pdf + SVG (Scalable Vector Graphics) is a widely used graphics format that possesses excellent scalability and editability. Image vectorization that aims to convert raster images to SVGs is an important yet challenging problem in computer vision and graphics. Existing image vectorization methods either suffer from low reconstruction accuracy for complex images or require long computation time. To address this issue we propose SuperSVG a superpixel-based vectorization model that achieves fast and high-precision image vectorization. Specifically we decompose the input image into superpixels to help the model focus on areas with similar colors and textures. Then we propose a two-stage self-training framework where a coarse-stage model is employed to reconstruct the main structure and a refinement-stage model is used for enriching the details. Moreover we propose a novel dynamic path warping loss to help the refinement-stage model to inherit knowledge from the coarse-stage model. Extensive qualitative and quantitative experiments demonstrate the superior performance of our method in terms of reconstruction accuracy and inference time compared to state-of-the-art approaches. The code is available in https://github.com/sjtuplayer/SuperSVG. + + + + AV2AV: Direct Audio-Visual Speech to Audio-Visual Speech Translation with Unified Audio-Visual Speech Representation + http://openaccess.thecvf.com//content/CVPR2024/papers/Choi_AV2AV_Direct_Audio-Visual_Speech_to_Audio-Visual_Speech_Translation_with_Unified_CVPR_2024_paper.pdf + This paper proposes a novel direct Audio-Visual Speech to Audio-Visual Speech Translation (AV2AV) framework where the input and output of the system are multimodal (i.e. audio and visual speech). With the proposed AV2AV two key advantages can be brought: 1) We can perform real-like conversations with individuals worldwide in a virtual meeting by utilizing our own primary languages. In contrast to Speech-to-Speech Translation (A2A) which solely translates between audio modalities the proposed AV2AV directly translates between audio-visual speech. This capability enhances the dialogue experience by presenting synchronized lip movements along with the translated speech. 2) We can improve the robustness of the spoken language translation system. By employing the complementary information of audio-visual speech the system can effectively translate spoken language even in the presence of acoustic noise showcasing robust performance. To mitigate the problem of the absence of a parallel AV2AV translation dataset we propose to train our spoken language translation system with the audio-only dataset of A2A. This is done by learning unified audio-visual speech representations through self-supervised learning in advance to train the translation system. Moreover we propose an AV-Renderer that can generate raw audio and video in parallel. It is designed with zero-shot speaker modeling thus the speaker in source audio-visual speech can be maintained at the target translated audio-visual speech. The effectiveness of AV2AV is evaluated with extensive experiments in a many-to-many language translation setting. Demo page is available on choijeongsoo.github.io/av2av. + + + + Contrastive Mean-Shift Learning for Generalized Category Discovery + http://openaccess.thecvf.com//content/CVPR2024/papers/Choi_Contrastive_Mean-Shift_Learning_for_Generalized_Category_Discovery_CVPR_2024_paper.pdf + We address the problem of generalized category discovery (GCD) that aims to partition a partially labeled collection of images; only a small part of the collection is labeled and the total number of target classes is unknown. To address this generalized image clustering problem we revisit the mean-shift algorithm i.e. a classic powerful technique for mode seeking and incorporate it into a contrastive learning framework. The proposed method dubbed Contrastive Mean-Shift (CMS) learning trains an embedding network to produce representations with better clustering properties by an iterative process of mean shift and contrastive update. Experiments demonstrate that our method both in settings with and without the total number of clusters being known achieves state-of-the-art performance on six public GCD benchmarks without bells and whistles. + + + + Improving Depth Completion via Depth Feature Upsampling + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Improving_Depth_Completion_via_Depth_Feature_Upsampling_CVPR_2024_paper.pdf + The encoder-decoder network (ED-Net) is a commonly employed choice for existing depth completion methods but its working mechanism is ambiguous. In this paper we visualize the internal feature maps to analyze how the network densifies the input sparse depth. We find that the encoder feature of ED-Net focus on the areas with input depth points around. To obtain a dense feature and thus estimate complete depth the decoder feature tends to complement and enhance the encoder feature by skip-connection to make the fused encoder-decoder feature dense resulting in the decoder feature also exhibits sparse. However ED-Net obtains the sparse decoder feature from the dense fused feature at the previous stage where the "dense to sparse" process destroys the completeness of features and loses information. To address this issue we present a depth feature upsampling network (DFU) that explicitly utilizes these dense features to guide the upsampling of a low-resolution (LR) depth feature to a high-resolution (HR) one. The completeness of features is maintained throughout the upsampling process thus avoiding information loss. Furthermore we propose a confidence-aware guidance module (CGM) which is confidence-aware and performs guidance with adaptive receptive fields (GARF) to fully exploit the potential of these dense features as guidance. Experimental results show that our DFU a plug-and-play module can significantly improve the performance of existing ED-Net based methods with limited computational overheads and new SOTA results are achieved. Besides the generalization capability on sparser depth is also enhanced. Project page: https://npucvr.github.io/DFU. + + + + SNI-SLAM: Semantic Neural Implicit SLAM + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhu_SNI-SLAM_Semantic_Neural_Implicit_SLAM_CVPR_2024_paper.pdf + We propose SNI-SLAM a semantic SLAM system utilizing neural implicit representation that simultaneously performs accurate semantic mapping high-quality surface reconstruction and robust camera tracking. In this system we introduce hierarchical semantic representation to allow multi-level semantic comprehension for top-down structured semantic mapping of the scene. In addition to fully utilize the correlation between multiple attributes of the environment we integrate appearance geometry and semantic features through cross-attention for feature collaboration. This strategy enables a more multifaceted understanding of the environment thereby allowing SNI-SLAM to remain robust even when single attribute is defective. Then we design an internal fusion-based decoder to obtain semantic RGB Truncated Signed Distance Field (TSDF) values from multi-level features for accurate decoding. Furthermore we propose a feature loss to update the scene representation at the feature level. Compared with low-level losses such as RGB loss and depth loss our feature loss is capable of guiding the network optimization on a higher-level. Our SNI-SLAM method demonstrates superior performance over all recent NeRF-based SLAM methods in terms of mapping and tracking accuracy on Replica and ScanNet datasets while also showing excellent capabilities in accurate semantic segmentation and real-time semantic mapping. Codes will be available at https://github.com/IRMVLab/SNI-SLAM. + + + + Building a Strong Pre-Training Baseline for Universal 3D Large-Scale Perception + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Building_a_Strong_Pre-Training_Baseline_for_Universal_3D_Large-Scale_Perception_CVPR_2024_paper.pdf + An effective pre-training framework with universal 3D representations is extremely desired in perceiving large-scale dynamic scenes. However establishing such an ideal framework that is both task-generic and label-efficient poses a challenge in unifying the representation of the same primitive across diverse scenes. The current contrastive 3D pre-training methods typically follow a frame-level consistency which focuses on the 2D-3D relationships in each detached image. Such inconsiderate consistency greatly hampers the promising path of reaching an universal pre-training framework: (1) The cross-scene semantic self-conflict \textit i.e. the intense collision between primitive segments of the same semantics from different scenes; (2) Lacking a globally unified bond that pushes the cross-scene semantic consistency into 3D representation learning. To address above challenges we propose a CSC framework that puts a scene-level semantic consistency in the heart bridging the connection of the similar semantic segments across various scenes. To achieve this goal we combine the coherent semantic cues provided by the vision foundation model and the knowledge-rich cross-scene prototypes derived from the complementary multi-modality information. These allow us to train a universal 3D pre-training model that facilitates various downstream tasks with less fine-tuning efforts. Empirically we achieve consistent improvements over SOTA pre-training approaches in semantic segmentation (+1.4% mIoU) object detection (+1.0% mAP) and panoptic segmentation (+3.0% PQ) using their task-specific 3D network on nuScenes. Code is released at \href https://github.com/chenhaomingbob/CSC https://github.com/chenhaomingbob/CSC hoping to inspire future research. + + + + DS-NeRV: Implicit Neural Video Representation with Decomposed Static and Dynamic Codes + http://openaccess.thecvf.com//content/CVPR2024/papers/Yan_DS-NeRV_Implicit_Neural_Video_Representation_with_Decomposed_Static_and_Dynamic_CVPR_2024_paper.pdf + Implicit neural representations for video (NeRV) have recently become a novel way for high-quality video representation. However existing works employ a single network to represent the entire video which implicitly confuse static and dynamic information. This leads to an inability to effectively compress the redundant static information and lack the explicitly modeling of global temporal-coherent dynamic details. To solve above problems we propose DS-NeRV which decomposes videos into sparse learnable static codes and dynamic codes without the need for explicit optical flow or residual supervision. By setting different sampling rates for two codes and applying weighted sum and interpolation sampling methods DS-NeRV efficiently utilizes redundant static information while maintaining high-frequency details. Additionally we design a cross-channel attention-based (CCA) fusion module to efficiently fuse these two codes for frame decoding. Our approach achieves a high quality reconstruction of 31.2 PSNR with only 0.35M parameters thanks to separate static and dynamic codes representation and outperforms existing NeRV methods in many downstream tasks. Our project website is at https://haoyan14.github.io/DS-NeRV. + + + + SDSTrack: Self-Distillation Symmetric Adapter Learning for Multi-Modal Visual Object Tracking + http://openaccess.thecvf.com//content/CVPR2024/papers/Hou_SDSTrack_Self-Distillation_Symmetric_Adapter_Learning_for_Multi-Modal_Visual_Object_Tracking_CVPR_2024_paper.pdf + Multimodal Visual Object Tracking (VOT) has recently gained significant attention due to its robustness. Early research focused on fully fine-tuning RGB-based trackers which was inefficient and lacked generalized representation due to the scarcity of multimodal data. Therefore recent studies have utilized prompt tuning to transfer pre-trained RGB-based trackers to multimodal data. However the modality gap limits pre-trained knowledge recall and the dominance of the RGB modality persists preventing the full utilization of information from other modalities. To address these issues we propose a novel symmetric multimodal tracking framework called SDSTrack. We introduce lightweight adaptation for efficient fine-tuning which directly transfers the feature extraction ability from RGB to other domains with a small number of trainable parameters and integrates multimodal features in a balanced symmetric manner. Furthermore we design a complementary masked patch distillation strategy to enhance the robustness of trackers in complex environments such as extreme weather poor imaging and sensor failure. Extensive experiments demonstrate that SDSTrack outperforms state-of-the-art methods in various multimodal tracking scenarios including RGB+Depth RGB+Thermal and RGB+Event tracking and exhibits impressive results in extreme conditions. Our source code is available at : https://github.com/hoqolo/SDSTrack. + + + + Semantic Shield: Defending Vision-Language Models Against Backdooring and Poisoning via Fine-grained Knowledge Alignment + http://openaccess.thecvf.com//content/CVPR2024/papers/Ishmam_Semantic_Shield_Defending_Vision-Language_Models_Against_Backdooring_and_Poisoning_via_CVPR_2024_paper.pdf + In recent years there has been enormous interest in vision-language models trained using self-supervised objectives. However the use of large-scale datasets scraped from the web for training also makes these models vulnerable to potential security threats such as backdooring and poisoning attacks. In this paper we propose a method for mitigating such attacks on contrastively trained vision-language models. Our approach leverages external knowledge extracted from a language model to prevent models from learning correlations between image regions which lack strong alignment with external knowledge. We do this by imposing constraints to enforce that attention paid by the model to visual regions is proportional to the alignment of those regions with external knowledge. We conduct extensive experiments using a variety of recent backdooring and poisoning attacks on multiple datasets and architectures. Our results clearly demonstrate that our proposed approach is highly effective at defending against such attacks across multiple settings while maintaining model utility and without requiring any changes at inference time. + + + + Rethinking the Up-Sampling Operations in CNN-based Generative Network for Generalizable Deepfake Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Tan_Rethinking_the_Up-Sampling_Operations_in_CNN-based_Generative_Network_for_Generalizable_CVPR_2024_paper.pdf + Recently the proliferation of highly realistic synthetic images facilitated through a variety of GANs and Diffusions has significantly heightened the susceptibility to misuse. While the primary focus of deepfake detection has traditionally centered on the design of detection algorithms an investigative inquiry into the generator architectures has remained conspicuously absent in recent years. This paper contributes to this lacuna by rethinking the architectures of CNN-based generator thereby establishing a generalized representation of synthetic artifacts. Our findings illuminate that the up-sampling operator can beyond frequency-based artifacts produce generalized forgery artifacts. In particular the local interdependence among image pixels caused by upsampling operators is significantly demonstrated in synthetic images generated by GAN or diffusion. Building upon this observation we introduce the concept of Neighboring Pixel Relationships(NPR) as a means to capture and characterize the generalized structural artifacts stemming from up-sampling operations. A comprehensive analysis is conducted on an open-world dataset comprising samples generated by 28 distinct generative models. This analysis culminates in the establishment of a novel state-of-the-art performance showcasing a remarkable 12.8% improvement over existing methods. The code is available at https://github.com/chuangchuangtan/NPR-DeepfakeDetection. + + + + GlitchBench: Can Large Multimodal Models Detect Video Game Glitches? + http://openaccess.thecvf.com//content/CVPR2024/papers/Taesiri_GlitchBench_Can_Large_Multimodal_Models_Detect_Video_Game_Glitches_CVPR_2024_paper.pdf + Large multimodal models (LMMs) have evolved from large language models (LLMs) to integrate multiple input modalities such as visual inputs. This integration augments the capacity of LLMs for tasks requiring visual comprehension and reasoning. However the extent and limitations of their enhanced abilities are not fully understood especially when it comes to real-world tasks. To address this gap we introduce GlitchBench a novel benchmark derived from video game quality assurance tasks to test and evaluate the reasoning capabilities of LMMs. Our benchmark is curated from a variety of unusual and glitched scenarios from video games and aims to challenge both the visual and linguistic reasoning powers of LMMs in detecting and interpreting out-of-the-ordinary events. We evaluate multiple state-of-the-art LMMs and we show that GlitchBench presents a new challenge for these models. Code and data are available at: https://glitchbench.github.io/ + + + + Density-guided Translator Boosts Synthetic-to-Real Unsupervised Domain Adaptive Segmentation of 3D Point Clouds + http://openaccess.thecvf.com//content/CVPR2024/papers/Yuan_Density-guided_Translator_Boosts_Synthetic-to-Real_Unsupervised_Domain_Adaptive_Segmentation_of_3D_CVPR_2024_paper.pdf + 3D synthetic-to-real unsupervised domain adaptive segmentation is crucial to annotating new domains. Self-training is a competitive approach for this task but its performance is limited by different sensor sampling patterns (i.e. variations in point density) and incomplete training strategies. In this work we propose a density-guided translator (DGT) which translates point density between domains and integrates it into a two-stage self-training pipeline named DGT-ST. First in contrast to existing works that simultaneously conduct data generation and feature/output alignment within unstable adversarial training we employ the non-learnable DGT to bridge the domain gap at the input level. Second to provide a well-initialized model for self-training we propose a category-level adversarial network in stage one that utilizes the prototype to prevent negative transfer. Finally by leveraging the designs above a domain-mixed self-training method with source-aware consistency loss is proposed in stage two to narrow the domain gap further. Experiments on two synthetic-to-real segmentation tasks (SynLiDAR ? semanticKITTI and SynLiDAR ? semanticPOSS) demonstrate that DGT-ST outperforms state-of-the-art methods achieving 9.4% and 4.3% mIoU improvements respectively. Code is available at https://github.com/yuan-zm/DGT-ST. + + + + Neural Spline Fields for Burst Image Fusion and Layer Separation + http://openaccess.thecvf.com//content/CVPR2024/papers/Chugunov_Neural_Spline_Fields_for_Burst_Image_Fusion_and_Layer_Separation_CVPR_2024_paper.pdf + Each photo in an image burst can be considered a sample of a complex 3D scene: the product of parallax diffuse and specular materials scene motion and illuminant variation. While decomposing all of these effects from a stack of misaligned images is a highly ill-conditioned task the conventional align-and-merge burst pipeline takes the other extreme: blending them into a single image. In this work we propose a versatile intermediate representation: a two-layer alpha-composited image plus flow model constructed with neural spline fields -- networks trained to map input coordinates to spline control points. Our method is able to during test-time optimization jointly fuse a burst image capture into one high-resolution reconstruction and decompose it into transmission and obstruction layers. Then by discarding the obstruction layer we can perform a range of tasks including seeing through occlusions reflection suppression and shadow removal. Tested on complex in-the-wild captures we find that with no post-processing steps or learned priors our generalizable model is able to outperform existing dedicated single-image and multi-view obstruction removal approaches. + + + + NAPGuard: Towards Detecting Naturalistic Adversarial Patches + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_NAPGuard_Towards_Detecting_Naturalistic_Adversarial_Patches_CVPR_2024_paper.pdf + Recently the emergence of naturalistic adversarial patch (NAP) which possesses a deceptive appearance and various representations underscores the necessity of developing robust detection strategies. However existing approaches fail to differentiate the deep-seated natures in adversarial patches i.e. aggressiveness and naturalness leading to unsatisfactory precision and generalization against NAPs. To tackle this issue we propose NAPGuard to provide strong detection capability against NAPs via the elaborated critical feature modulation framework. For improving precision we propose the aggressive feature aligned learning to enhance the model's capability in capturing accurate aggressive patterns. Considering the challenge of inaccurate model learning caused by deceptive appearance we align the aggressive features by the proposed pattern alignment loss during training. Since the model could learn more accurate aggressive patterns it is able to detect deceptive patches more precisely. To enhance generalization we design the natural feature suppressed inference to universally mitigate the disturbance from different NAPs. Since various representations arise in diverse disturbing forms to hinder generalization we suppress the natural features in a unified approach via the feature shield module. Therefore the models could recognize NAPs within less disturbance and activate the generalized detection ability. Extensive experiments show that our method surpasses state-of-the-art methods by large margins in detecting NAPs (improve 60.24% AP@0.5 on average). + + + + Unified Language-driven Zero-shot Domain Adaptation + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Unified_Language-driven_Zero-shot_Domain_Adaptation_CVPR_2024_paper.pdf + This paper introduces Unified Language-driven Zero-shot Domain Adaptation (ULDA) a novel task setting that enables a single model to adapt to diverse target domains without explicit domain-ID knowledge. We identify the constraints in the existing language-driven zero-shot domain adaptation task particularly the requirement for domain IDs and domain-specific models which may restrict flexibility and scalability. To overcome these issues we propose a new framework for ULDA consisting of Hierarchical Context Alignment (HCA) Domain Consistent Representation Learning (DCRL) and Text-Driven Rectifier (TDR). These components work synergistically to align simulated features with target text across multiple visual levels retain semantic correlations between different regional representations and rectify biases between simulated and real target visual features respectively. Our extensive empirical evaluations demonstrate that this framework achieves competitive performance in both settings surpassing even the model that requires domain-ID showcasing its superiority and generalization ability. The proposed method is not only effective but also maintains practicality and efficiency as it does not introduce additional computational costs during inference. The code is available on the project website. + + + + Equivariant Multi-Modality Image Fusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_Equivariant_Multi-Modality_Image_Fusion_CVPR_2024_paper.pdf + Multi-modality image fusion is a technique that combines information from different sensors or modalities enabling the fused image to retain complementary features from each modality such as functional highlights and texture details. However effective training of such fusion models is challenging due to the scarcity of ground truth fusion data. To tackle this issue we propose the Equivariant Multi-Modality imAge fusion (EMMA) paradigm for end-to-end self-supervised learning. Our approach is rooted in the prior knowledge that natural imaging responses are equivariant to certain transformations. Consequently we introduce a novel training paradigm that encompasses a fusion module a pseudo-sensing module and an equivariant fusion module. These components enable the net training to follow the principles of the natural sensing-imaging process while satisfying the equivariant imaging prior. Extensive experiments confirm that EMMA yields high-quality fusion results for infrared-visible and medical images concurrently facilitating downstream multi-modal segmentation and detection tasks. The code is available at https://github.com/Zhaozixiang1228/MMIF-EMMA. + + + + NeLF-Pro: Neural Light Field Probes for Multi-Scale Novel View Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/You_NeLF-Pro_Neural_Light_Field_Probes_for_Multi-Scale_Novel_View_Synthesis_CVPR_2024_paper.pdf + We present NeLF-Pro a novel representation to model and reconstruct light fields in diverse natural scenes that vary in extent and spatial granularity. In contrast to previous fast reconstruction methods that represent the 3D scene globally we model the light field of a scene as a set of local light field feature probes parameterized with position and multi-channel 2D feature maps. Our central idea is to bake the scene's light field into spatially varying learnable representations and to query point features by weighted blending of probes close to the camera - allowing for mipmap representation and rendering. We introduce a novel vector-matrix-matrix (VMM) factorization technique that effectively represents the light field feature probes as products of core factors (i.e. VM) shared among local feature probes and a basis factor (i.e. M) - efficiently encoding internal relationships and patterns within the scene.Experimentally we demonstrate that NeLF-Pro significantly boosts the performance of feature grid-based representations and achieves fast reconstruction with better rendering quality while maintaining compact modeling. Project page: sinoyou.github.io/nelf-pro + + + + Solving Masked Jigsaw Puzzles with Diffusion Vision Transformers + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Solving_Masked_Jigsaw_Puzzles_with_Diffusion_Vision_Transformers_CVPR_2024_paper.pdf + Solving image and video jigsaw puzzles poses the challenging task of rearranging image fragments or video frames from unordered sequences to restore meaningful images and video sequences. Existing approaches often hinge on discriminative models tasked with predicting either the absolute positions of puzzle elements or the permutation actions applied to the original data. Unfortunately these methods face limitations in effectively solving puzzles with a large number of elements. In this paper we propose JPDVT an innovative approach that harnesses diffusion transformers to address this challenge. Specifically we generate positional information for image patches or video frames conditioned on their underlying visual content. This information is then employed to accurately assemble the puzzle pieces in their correct positions even in scenarios involving missing pieces. Our method achieves state-of-the-art performance on several datasets. + + + + Fully Exploiting Every Real Sample: SuperPixel Sample Gradient Model Stealing + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_Fully_Exploiting_Every_Real_Sample_SuperPixel_Sample_Gradient_Model_Stealing_CVPR_2024_paper.pdf + Model stealing (MS) involves querying and observing the output of a machine learning model to steal its capabilities. The quality of queried data is crucial yet obtaining a large amount of real data for MS is often challenging. Recent works have reduced reliance on real data by using generative models. However when high-dimensional query data is required these methods are impractical due to the high costs of querying and the risk of model collapse. In this work we propose using sample gradients (SG) to enhance the utility of each real sample as SG provides crucial guidance on the decision boundaries of the victim model. However utilizing SG in the model stealing scenario faces two challenges: 1. Pixel-level gradient estimation requires extensive query volume and is susceptible to defenses. 2. The estimation of sample gradients has a significant variance. This paper proposes Superpixel Sample Gradient stealing (SPSG) for model stealing under the constraint of limited real samples. With the basic idea of imitating the victim model's low-variance patch-level gradients instead of pixel-level gradients SPSG achieves efficient sample gradient estimation through two steps. First we perform patch-wise perturbations on query images to estimate the average gradient in different regions of the image. Then we filter the gradients through a threshold strategy to reduce variance. Exhaustive experiments demonstrate that with the same number of real samples SPSG achieves accuracy agreements and adversarial success rate significantly surpassing the current state-of-the-art MS methods. Codes are available at https://github.com/zyl123456aB/SPSG_attack. + + + + Progressive Divide-and-Conquer via Subsampling Decomposition for Accelerated MRI + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Progressive_Divide-and-Conquer_via_Subsampling_Decomposition_for_Accelerated_MRI_CVPR_2024_paper.pdf + Deep unfolding networks (DUN) have emerged as a popular iterative framework for accelerated magnetic resonance imaging (MRI) reconstruction. However conventional DUN aims to reconstruct all the missing information within the entire space in each iteration. Thus it could be challenging when dealing with highly ill-posed degradation often resulting in subpar reconstruction. In this work we propose a Progressive Divide-And-Conquer (PDAC) strategy aiming to break down the subsampling process in the actual severe degradation and thus perform reconstruction sequentially. Starting from decomposing the original maximum-a-posteriori problem of accelerated MRI we present a rigorous derivation of the proposed PDAC framework which could be further unfolded into an end-to-end trainable network. Each PDAC iteration specifically targets a distinct segment of moderate degradation based on the decomposition. Furthermore as part of the PDAC iteration such decomposition is adaptively learned as an auxiliary task through a degradation predictor which provides an estimation of the decomposed sampling mask. Following this prediction the sampling mask is further integrated via a severity conditioning module to ensure awareness of the degradation severity at each stage. Extensive experiments demonstrate that our proposed method achieves superior performance on the publicly available fastMRI and Stanford2D FSE datasets in both multi-coil and single-coil settings. + + + + MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval + http://openaccess.thecvf.com//content/CVPR2024/papers/Jin_MV-Adapter_Multimodal_Video_Transfer_Learning_for_Video_Text_Retrieval_CVPR_2024_paper.pdf + State-of-the-art video-text retrieval (VTR) methods typically involve fully fine-tuning a pre-trained model (e.g. CLIP) on specific datasets. However this can result in significant storage costs in practical applications as a separate model per task must be stored. To address this issue we present our pioneering work that enables parameter-efficient VTR using a pre-trained model with only a small number of tunable parameters during training. Towards this goal we propose a new method dubbed Multimodal Video Adapter (MV-Adapter) for efficiently transferring the knowledge in the pre-trained CLIP from image-text to video-text. Specifically MV-Adapter utilizes bottleneck structures in both video and text branches along with two novel components. The first is a Temporal Adaptation Module that is incorporated in the video branch to introduce global and local temporal contexts. We also train weights calibrations to adjust to dynamic variations across frames. The second is Cross Modality Tying that generates weights for video/text branches through sharing cross modality factors for better aligning between modalities. Thanks to above innovations MV-Adapter can achieve comparable or better performance than standard fine-tuning with negligible parameters overhead. Notably MV-Adapter consistently outperforms various competing methods in V2T/T2V tasks with large margins on five widely used VTR benchmarks (MSR-VTT MSVD LSMDC DiDemo and ActivityNet). Codes will be released. + + + + Rethinking Multi-view Representation Learning via Distilled Disentangling + http://openaccess.thecvf.com//content/CVPR2024/papers/Ke_Rethinking_Multi-view_Representation_Learning_via_Distilled_Disentangling_CVPR_2024_paper.pdf + Multi-view representation learning aims to derive robust representations that are both view-consistent and view-specific from diverse data sources. This paper presents an in-depth analysis of existing approaches in this domain highlighting a commonly overlooked aspect: the redundancy between view-consistent and view-specific representations. To this end we propose an innovative framework for multi-view representation learning which incorporates a technique we term 'distilled disentangling'. Our method introduces the concept of masked cross-view prediction enabling the extraction of compact high-quality view-consistent representations from various sources without incurring extra computational overhead. Additionally we develop a distilled disentangling module that efficiently filters out consistency-related information from multi-view representations resulting in purer view-specific representations. This approach significantly reduces redundancy between view-consistent and view-specific representations enhancing the overall efficiency of the learning process. Our empirical evaluations reveal that higher mask ratios substantially improve the quality of view-consistent representations. Moreover we find that reducing the dimensionality of view-consistent representations relative to that of view-specific representations further refines the quality of the combined representations. + + + + Targeted Representation Alignment for Open-World Semi-Supervised Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Xiao_Targeted_Representation_Alignment_for_Open-World_Semi-Supervised_Learning_CVPR_2024_paper.pdf + Open-world Semi-Supervised Learning aims to classify unlabeled samples utilizing information from labeled data while unlabeled samples are not only from the labeled known categories but also from novel categories previously unseen. Despite the promise current approaches solely rely on hazardous similarity-based clustering algorithms and give unlabeled samples free rein to spontaneously group into distinct novel class clusters. Nevertheless due to the absence of novel class supervision these methods typically suffer from the representation collapse dilemma---features of different novel categories can get closely intertwined and indistinguishable even collapsing into the same cluster and leading to degraded performance. To alleviate this we propose a novel framework TRAILER which targets to attain an optimal feature arrangement revealed by the recently uncovered neural collapse phenomenon. To fulfill this we adopt targeted prototypes that are pre-assigned uniformly with maximum separation and then progressively align the representations to them. To further tackle the potential downsides of such stringent alignment we encapsulate a sample-target allocation mechanism with coarse-to-fine refinery that is able to infer label assignments with high quality. Extensive experiments demonstrate that TRAILER outperforms current state-of-the-art methods on generic and fine-grained benchmarks. The code is available at https://github.com/Justherozen/TRAILER. + + + + Efficient Solution of Point-Line Absolute Pose + http://openaccess.thecvf.com//content/CVPR2024/papers/Hruby_Efficient_Solution_of_Point-Line_Absolute_Pose_CVPR_2024_paper.pdf + We revisit certain problems of pose estimation based on 3D--2D correspondences between features which may be points or lines. Specifically we address the two previously-studied minimal problems of estimating camera extrinsics from p \in \ 1 2 \ point--point correspondences and l=3-p line--line correspondences. To the best of our knowledge all of the previously-known practical solutions to these problems required computing the roots of degree \ge 4 (univariate) polynomials when p=2 or degree \ge 8 polynomials when p=1. We describe and implement two elementary solutions which reduce the degrees of the needed polynomials from 4 to 2 and from 8 to 4 respectively. We show experimentally that the resulting solvers are numerically stable and fast: when compared to the previous state-of-the art we may obtain nearly an order of magnitude speedup. The code is available at https://github.com/petrhruby97/efficient_absolute + + + + Text-to-3D using Gaussian Splatting + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Text-to-3D_using_Gaussian_Splatting_CVPR_2024_paper.pdf + Automatic text-to-3D generation that combines Score Distillation Sampling (SDS) with the optimization of volume rendering has achieved remarkable progress in synthesizing realistic 3D objects. Yet most existing text-to-3D methods by SDS and volume rendering suffer from inaccurate geometry e.g. the Janus issue since it is hard to explicitly integrate 3D priors into implicit 3D representations. Besides it is usually time-consuming for them to generate elaborate 3D models with rich colors. In response this paper proposes GSGEN a novel method that adopts Gaussian Splatting a recent state-of-the-art representation to text-to-3D generation. GSGEN aims at generating high-quality 3D objects and addressing existing shortcomings by exploiting the explicit nature of Gaussian Splatting that enables the incorporation of 3D prior. Specifically our method adopts a progressive optimization strategy which includes a geometry optimization stage and an appearance refinement stage. In geometry optimization a coarse representation is established under 3D point cloud diffusion prior along with the ordinary 2D SDS optimization ensuring a sensible and 3D-consistent rough shape. Subsequently the obtained Gaussians undergo an iterative appearance refinement to enrich texture details. In this stage we increase the number of Gaussians by compactness-based densification to enhance continuity and improve fidelity. With these designs our approach can generate 3D assets with delicate details and accurate geometry. Extensive evaluations demonstrate the effectiveness of our method especially for capturing high-frequency components. + + + + POPDG: Popular 3D Dance Generation with PopDanceSet + http://openaccess.thecvf.com//content/CVPR2024/papers/Luo_POPDG_Popular_3D_Dance_Generation_with_PopDanceSet_CVPR_2024_paper.pdf + Generating dances that are both lifelike and well-aligned with music continues to be a challenging task in the cross-modal domain. This paper introduces PopDanceSet the first dataset tailored to the preferences of young audiences enabling the generation of aesthetically oriented dances. And it surpasses the AIST++ dataset in music genre diversity and the intricacy and depth of dance movements. Moreover the proposed POPDG model within the iDDPM framework enhances dance diversity and through the Space Augmentation Algorithm strengthens spatial physical connections between human body joints ensuring that increased diversity does not compromise generation quality. A streamlined Alignment Module is also designed to improve the temporal alignment between dance and music. Extensive experiments show that POPDG achieves SOTA results on two datasets. Furthermore the paper also expands on current evaluation metrics. The dataset and code are available at https://github.com/Luke-Luo1/POPDG. + + + + Learning without Exact Guidance: Updating Large-scale High-resolution Land Cover Maps from Low-resolution Historical Labels + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Learning_without_Exact_Guidance_Updating_Large-scale_High-resolution_Land_Cover_Maps_CVPR_2024_paper.pdf + Large-scale high-resolution (HR) land-cover mapping is a vital task to survey the Earth's surface and resolve many challenges facing humanity. However it is still a non-trivial task hindered by complex ground details various landforms and the scarcity of accurate training labels over a wide-span geographic area. In this paper we propose an efficient weakly supervised framework (Paraformer) to guide large-scale HR land-cover mapping with easy-access historical land-cover data of low resolution (LR). Specifically existing land-cover mapping approaches reveal the dominance of CNNs in preserving local ground details but still suffer from insufficient global modeling in various landforms. Therefore we design a parallel CNN-Transformer feature extractor in Paraformer consisting of a downsampling-free CNN branch and a Transformer branch to jointly capture local and global contextual information. Besides facing the spatial mismatch of training data a pseudo-label-assisted training (PLAT) module is adopted to reasonably refine LR labels for weakly supervised semantic segmentation of HR images. Experiments on two large-scale datasets demonstrate the superiority of Paraformer over other state-of-the-art methods for automatically updating HR land-cover maps from LR historical labels. + + + + TTA-EVF: Test-Time Adaptation for Event-based Video Frame Interpolation via Reliable Pixel and Sample Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Cho_TTA-EVF_Test-Time_Adaptation_for_Event-based_Video_Frame_Interpolation_via_Reliable_CVPR_2024_paper.pdf + Video Frame Interpolation (VFI) which aims at generating high-frame-rate videos from low-frame-rate inputs is a highly challenging task. The emergence of bio-inspired sensors known as event cameras which boast microsecond-level temporal resolution has ushered in a transformative era for VFI. Nonetheless the application of event-based VFI techniques in domains with distinct environments from the training data can be problematic. This is mainly because event camera data distribution can undergo substantial variations based on camera settings and scene conditions presenting challenges for effective adaptation. In this paper we propose a test-time adaptation method for event-based VFI to address the gap between the source and target domains. Our approach enables sequential learning in an online manner on the target domain which only provides low-frame-rate videos. We present an approach that leverages confident pixels as pseudo ground-truths enabling stable and accurate online learning from low-frame-rate videos. Furthermore to prevent overfitting during the continuous online process where the same scene is encountered repeatedly we propose a method of blending historical samples with current scenes. Extensive experiments validate the effectiveness of our method both in cross-domain and continuous domain shifting setups. The code is available at https://github.com/Chohoonhee/TTA-EVF. + + + + BEVNeXt: Reviving Dense BEV Frameworks for 3D Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_BEVNeXt_Reviving_Dense_BEV_Frameworks_for_3D_Object_Detection_CVPR_2024_paper.pdf + Recently the rise of query-based Transformer decoders is reshaping camera-based 3D object detection. These query-based decoders are surpassing the traditional dense BEV (Bird's Eye View)-based methods. However we argue that dense BEV frameworks remain important due to their outstanding abilities in depth estimation and object localization depicting 3D scenes accurately and comprehensively. This paper aims to address the drawbacks of the existing dense BEV-based 3D object detectors by introducing our proposed enhanced components including a CRF-modulated depth estimation module enforcing object-level consistencies a long-term temporal aggregation module with extended receptive fields and a two-stage object decoder combining perspective techniques with CRF-modulated depth embedding. These enhancements lead to a "modernized" dense BEV framework dubbed BEVNeXt. On the nuScenes benchmark BEVNeXt outperforms both BEV-based and query-based frameworks under various settings achieving a state-of-the-art result of 64.2 NDS on the nuScenes test set. + + + + LEAD: Learning Decomposition for Source-free Universal Domain Adaptation + http://openaccess.thecvf.com//content/CVPR2024/papers/Qu_LEAD_Learning_Decomposition_for_Source-free_Universal_Domain_Adaptation_CVPR_2024_paper.pdf + Universal Domain Adaptation (UniDA) targets knowledge transfer in the presence of both covariate and label shifts. Recently Source-free Universal Domain Adaptation (SF-UniDA) has emerged to achieve UniDA without access to source data which tends to be more practical due to data protection policies. The main challenge lies in determining whether covariate-shifted samples belong to target-private unknown categories. Existing methods tackle this either through hand-crafted thresholding or by developing time-consuming iterative clustering strategies. In this paper we propose a new idea of LEArning Decomposition (LEAD) which decouples features into source-known and -unknown components to identify target-private data. Technically LEAD initially leverages the orthogonal decomposition analysis for feature decomposition. Then LEAD builds instance-level decision boundaries to adaptively identify target-private data. Extensive experiments across various UniDA scenarios have demonstrated the effectiveness and superiority of LEAD. Notably in the OPDA scenario on VisDA dataset LEAD outperforms GLC by 3.5% overall H-score and reduces 75% time to derive pseudo-labeling decision boundaries. Besides LEAD is also appealing in that it is complementary to most existing methods. The code is available at https://github. com/ispc-lab/LEAD + + + + OneLLM: One Framework to Align All Modalities with Language + http://openaccess.thecvf.com//content/CVPR2024/papers/Han_OneLLM_One_Framework_to_Align_All_Modalities_with_Language_CVPR_2024_paper.pdf + Multimodal large language models (MLLMs) have gained significant attention due to their strong multimodal understanding capability. However existing works rely heavily on modality-specific encoders which usually differ in architecture and are limited to common modalities. In this paper we present OneLLM an MLLM that aligns eight modalities to language using a unified framework. We achieve this through a unified multimodal encoder and a progressive multimodal alignment pipeline. In detail we first train an image projection module to connect a vision encoder with LLM. Then we build a universal projection module (UPM) by mixing multiple image projection modules and dynamic routing. Finally we progressively align more modalities to LLM with the UPM. To fully leverage the potential of OneLLM in following instructions we also curated a comprehensive multimodal instruction dataset including 2M items from image audio video point cloud depth/normal map IMU and fMRI brain activity. OneLLM is evaluated on 25 diverse benchmarks encompassing tasks such as multimodal captioning question answering and reasoning where it delivers excellent performance. Code data model and online demo are available at https://github.com/csuhan/OneLLM + + + + PAD: Patch-Agnostic Defense against Adversarial Patch Attacks + http://openaccess.thecvf.com//content/CVPR2024/papers/Jing_PAD_Patch-Agnostic_Defense_against_Adversarial_Patch_Attacks_CVPR_2024_paper.pdf + Adversarial patch attacks present a significant threat to real-world object detectors due to their practical feasibility. Existing defense methods which rely on attack data or prior knowledge struggle to effectively address a wide range of adversarial patches. In this paper we show two inherent characteristics of adversarial patches semantic independence and spatial heterogeneity independent of their appearance shape size quantity and location. Semantic independence indicates that adversarial patches operate autonomously within their semantic context while spatial heterogeneity manifests as distinct image quality of the patch area that differs from original clean image due to the independent generation process. Based on these observations we propose PAD a novel adversarial patch localization and removal method that does not require prior knowledge or additional training. PAD offers patch-agnostic defense against various adversarial patches compatible with any pre-trained object detectors. Our comprehensive digital and physical experiments involving diverse patch types such as localized noise printable and naturalistic patches exhibit notable improvements over state-of-the-art works. Our code is available at https://github.com/Lihua-Jing/PAD. + + + + MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Tudosiu_MULAN_A_Multi_Layer_Annotated_Dataset_for_Controllable_Text-to-Image_Generation_CVPR_2024_paper.pdf + Text-to-image generation has achieved astonishing results yet precise spatial controllability and prompt fidelity remain highly challenging. This limitation is typically addressed through cumbersome prompt engineering scene layout conditioning or image editing techniques which often require hand drawn masks. Nonetheless pre-existing works struggle to take advantage of the natural instance-level compositionality of scenes due to the typically flat nature of rasterized RGB output images. Towards addressing this challenge we introduce MuLAn: a novel dataset comprising over 44K MUlti-Layer ANnotations of RGB images as multi-layer instance-wise RGBA decompositions and over 100K instance images. To build MuLAn we developed a training free pipeline which decomposes a monocular RGB image into a stack of RGBA layers comprising of background and isolated instances. We achieve this through the use of pretrained general-purpose models and by developing three modules: image decomposition for instance discovery and extraction instance completion to reconstruct occluded areas and image re-assembly. We use our pipeline to create MuLAn-COCO and MuLAn-LAION datasets which contain a variety of image decompositions in terms of style composition and complexity. With MuLAn we provide the first photorealistic resource providing instance decomposition and occlusion information for high quality images opening up new avenues for text-to-image generative AI research. With this we aim to encourage the development of novel generation and editing technology in particular layer-wise solutions. MuLAn data resources are available at https://MuLAn-dataset.github.io/ + + + + Unbiased Faster R-CNN for Single-source Domain Generalized Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Unbiased_Faster_R-CNN_for_Single-source_Domain_Generalized_Object_Detection_CVPR_2024_paper.pdf + Single-source domain generalization (SDG) for object detection is a challenging yet essential task as the distribution bias of the unseen domain degrades the algorithm performance significantly. However existing methods attempt to extract domain-invariant features neglecting that the biased data leads the network to learn biased features that are non-causal and poorly generalizable. To this end we propose an Unbiased Faster R-CNN (UFR) for generalizable feature learning. Specifically we formulate SDG in object detection from a causal perspective and construct a Structural Causal Model (SCM) to analyze the data bias and feature bias in the task which are caused by scene confounders and object attribute confounders. Based on the SCM we design a Global-Local Transformation module for data augmentation which effectively simulates domain diversity and mitigates the data bias. Additionally we introduce a Causal Attention Learning module that incorporates a designed attention invariance loss to learn image-level features that are robust to scene confounders. Moreover we develop a Causal Prototype Learning module with an explicit instance constraint and an implicit prototype constraint which further alleviates the negative impact of object attribute confounders. Experimental results on five scenes demonstrate the prominent generalization ability of our method with an improvement of 3.9% mAP on the Night-Clear scene. + + + + Super-Resolution Reconstruction from Bayer-Pattern Spike Streams + http://openaccess.thecvf.com//content/CVPR2024/papers/Dong_Super-Resolution_Reconstruction_from_Bayer-Pattern_Spike_Streams_CVPR_2024_paper.pdf + Spike camera is a neuromorphic vision sensor that can capture highly dynamic scenes by generating a continuous stream of binary spikes to represent the arrival of photons at very high temporal resolution. Equipped with Bayer color filter array (CFA) color spike camera (CSC) has been invented to capture color information. Although spike camera has already demonstrated great potential for high-speed imaging its spatial resolution is limited compared with conventional digital cameras. This paper proposes a Color Spike Camera Super-Resolution (CSCSR) network to super-resolve higher-resolution color images from spike camera streams with Bayer CFA. To be specific we first propose a representation for Bayer-pattern spike streams exploring local temporal information with global perception to represent the binary data. Then we exploit the CFA layout and sub-pixel level motion to collect temporal pixels for the spatial super-resolution of each color channel. In particular a residual-based module for feature refinement is developed to reduce the impact of motion estimation errors. Considering color correlation we jointly utilize the multi-stage temporal-pixel features of color channels to reconstruct the high-resolution color image. Experimental results demonstrate that the proposed scheme can reconstruct satisfactory color images with both high temporal and spatial resolution from low-resolution Bayer-pattern spike streams. The source codes are available at https://github.com/csycdong/CSCSR. + + + + Stationary Representations: Optimally Approximating Compatibility and Implications for Improved Model Replacements + http://openaccess.thecvf.com//content/CVPR2024/papers/Biondi_Stationary_Representations_Optimally_Approximating_Compatibility_and_Implications_for_Improved_Model_CVPR_2024_paper.pdf + Learning compatible representations enables the interchangeable use of semantic features as models are updated over time. This is particularly relevant in search and retrieval systems where it is crucial to avoid reprocessing of the gallery images with the updated model. While recent research has shown promising empirical evidence there is still a lack of comprehensive theoretical understanding about learning compatible representations. In this paper we demonstrate that the stationary representations learned by the d-Simplex fixed classifier optimally approximate compatibility representation according to the two inequality constraints of its formal definition. This not only establishes a solid foundation for future works in this line of research but also presents implications that can be exploited in practical learning scenarios. An exemplary application is the now-standard practice of downloading and fine-tuning new pre-trained models. Specifically we show the strengths and critical issues of stationary representations in the case in which a model undergoing sequential fine-tuning is asynchronously replaced by downloading a better-performing model pre-trained elsewhere. Such a representation enables seamless delivery of retrieval service (i.e. no reprocessing of gallery images) and offers improved performance without operational disruptions during model replacement. Code available at: https://github.com/miccunifi/iamcl2r. + + + + Towards Calibrated Multi-label Deep Neural Networks + http://openaccess.thecvf.com//content/CVPR2024/papers/Cheng_Towards_Calibrated_Multi-label_Deep_Neural_Networks_CVPR_2024_paper.pdf + The problem of calibrating deep neural networks (DNNs) for multi-label learning is considered. It is well-known that DNNs trained by cross-entropy for single-label or one-hot classification are poorly calibrated. Many calibration techniques have been proposed to address the problem. However little attention has been paid to the calibration of multi-label DNNs. In this literature the focus has been on improving labeling accuracy in the face of severe dataset unbalance. This is addressed by the introduction of asymmetric losses which have became very popular. However these losses do not induce well calibrated classifiers. In this work we first provide a theoretical explanation for this poor calibration performance by showing that these loses losses lack the strictly proper property a necessary condition for accurate probability estimation. To overcome this problem we propose a new Strictly Proper Asymmetric (SPA) loss. This is complemented by a Label Pair Regularizer (LPR) that increases the number of calibration constraints introduced per training example. The effectiveness of both contributions is validated by extensive experiments on various multi-label datasets. The resulting training method is shown to significantly decrease the calibration error while maintaining state-of-the-art accuracy. + + + + SceneTex: High-Quality Texture Synthesis for Indoor Scenes via Diffusion Priors + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_SceneTex_High-Quality_Texture_Synthesis_for_Indoor_Scenes_via_Diffusion_Priors_CVPR_2024_paper.pdf + We propose SceneTex a novel method for effectively generating high-quality and style-consistent textures for indoor scenes using depth-to-image diffusion priors. Unlike previous methods that either iteratively warp 2D views onto a mesh surface or distillate diffusion latent features without accurate geometric and style cues SceneTex formulates the texture synthesis task as an optimization problem in the RGB space where style and geometry consistency are properly reflected. At its core SceneTex proposes a multiresolution texture field to implicitly encode the mesh appearance. We optimize the target texture via a score-distillation-based objective function in respective RGB renderings. To further secure the style consistency across views we introduce a cross-attention decoder to predict the RGB values by cross-attending to the pre-sampled reference locations in each instance. SceneTex enables various and accurate texture synthesis for 3D-FRONT scenes demonstrating significant improvements in visual quality and prompt fidelity over the prior texture generation methods. + + + + TUMTraf V2X Cooperative Perception Dataset + http://openaccess.thecvf.com//content/CVPR2024/papers/Zimmer_TUMTraf_V2X_Cooperative_Perception_Dataset_CVPR_2024_paper.pdf + Cooperative perception offers several benefits for enhancing the capabilities of autonomous vehicles and improving road safety. Using roadside sensors in addition to onboard sensors increases reliability and extends the sensor range. External sensors offer higher situational awareness for automated vehicles and prevent occlusions. We propose CoopDet3D a cooperative multi-modal fusion model and TUMTraf-V2X a perception dataset for the cooperative 3D object detection and tracking task. Our dataset contains 2000 labeled point clouds and 5000 labeled images from five roadside and four onboard sensors. It includes 30k 3D boxes with track IDs and precise GPS and IMU data. We labeled nine categories and covered occlusion scenarios with challenging driving maneuvers like traffic violations near-miss events overtaking and U-turns. Through multiple experiments we show that our CoopDet3D camera-LiDAR fusion model achieves an increase of +14.36 3D mAP compared to a vehicle camera-LiDAR fusion model. Finally we make our dataset model labeling tool and devkit publicly available on our website: https://tum-traffic-dataset.github.io/tumtraf-v2x. + + + + SPECAT: SPatial-spEctral Cumulative-Attention Transformer for High-Resolution Hyperspectral Image Reconstruction + http://openaccess.thecvf.com//content/CVPR2024/papers/Yao_SPECAT_SPatial-spEctral_Cumulative-Attention_Transformer_for_High-Resolution_Hyperspectral_Image_Reconstruction_CVPR_2024_paper.pdf + Compressive spectral image reconstruction is a critical method for acquiring images with high spatial and spectral resolution. Current advanced methods which involve designing deeper networks or adding more self-attention modules are limited by the scope of attention modules and the irrelevance of attentions across different dimensions. This leads to difficulties in capturing non-local mutation features in the spatial-spectral domain and results in a significant parameter increase but only limited performance improvement. To address these issues we propose SPECAT a SPatial-spEctral Cumulative-Attention Transformer designed for high-resolution hyperspectral image reconstruction. SPECAT utilizes Cumulative-Attention Blocks (CABs) within an efficient hierarchical framework to extract features from non-local spatial-spectral details. Furthermore it employs a projection-object Dual-domain Loss Function (DLF) to integrate the optical path constraint a physical aspect often overlooked in current methodologies. Ultimately SPECAT not only significantly enhances the reconstruction quality of spectral details but also breaks through the bottleneck of mutual restriction between the cost and accuracy in existing algorithms. Our experimental results demonstrate the superiority of SPECAT achieving 40.3 dB in hyperspectral reconstruction benchmarks outperforming the state-of-the-art (SOTA) algorithms by 1.2 dB while using only 5% of the network parameters and 10% of the computational cost. The code is available at https://github.com/THU-luvision/SPECAT. + + + + Attentive Illumination Decomposition Model for Multi-Illuminant White Balancing + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_Attentive_Illumination_Decomposition_Model_for_Multi-Illuminant_White_Balancing_CVPR_2024_paper.pdf + White balance (WB) algorithms in many commercial cameras assume single and uniform illumination leading to undesirable results when multiple lighting sources with different chromaticities exist in the scene. Prior research on multi-illuminant WB typically predicts illumination at the pixel level without fully grasping the scene's actual lighting conditions including the number and color of light sources. This often results in unnatural outcomes lacking in overall consistency. To handle this problem we present a deep white balancing model that leverages the slot attention where each slot is in charge of representing individual illuminants. This design enables the model to generate chromaticities and weight maps for individual illuminants which are then fused to compose the final illumination map. Furthermore we propose the centroid-matching loss which regulates the activation of each slot based on the color range thereby enhancing the model to separate illumination more effectively. Our method achieves the state-of-the-art performance on both single- and multi-illuminant WB benchmarks and also offers additional information such as the number of illuminants in the scene and their chromaticity. This capability allows for illumination editing an application not feasible with prior methods. + + + + Efficient Stitchable Task Adaptation + http://openaccess.thecvf.com//content/CVPR2024/papers/He_Efficient_Stitchable_Task_Adaptation_CVPR_2024_paper.pdf + The paradigm of pre-training and fine-tuning has laid the foundation for deploying deep learning models. However most fine-tuning methods are designed to meet a specific resource budget. Recently considering diverse deployment scenarios with various resource budgets SN-Net is introduced to quickly obtain numerous new networks (stitches) from the pre-trained models (anchors) in a model family via model stitching. Although promising SN-Net confronts new challenges when adapting it to new target domains including huge memory and storage requirements and a long and sub-optimal multistage adaptation process. In this work we present a novel framework Efficient Stitchable Task Adaptation (ESTA) to efficiently produce a palette of fine-tuned models that adhere to diverse resource constraints. Specifically we first tailor parameter-efficient fine-tuning to share low-rank updates among the stitches while maintaining independent bias terms. In this way we largely reduce fine-tuning memory burdens and mitigate the interference among stitches that arises in task adaptation. Furthermore we streamline a simple yet effective one-stage deployment pipeline which estimates the important stitches to deploy with training-time gradient statistics. By assigning higher sampling probabilities to important stitches we also get a boosted Pareto frontier. Extensive experiments on 25 downstream visual recognition tasks demonstrate that our ESTA is capable of generating stitches with smooth accuracy-efficiency trade-offs and surpasses the direct SN-Net adaptation by remarkable margins with significantly lower training time and fewer trainable parameters. Furthermore we demonstrate the flexibility and scalability of our ESTA framework by stitching LLMs from LLaMA family obtaining chatbot stitches of assorted sizes. + + + + Image Processing GNN: Breaking Rigidity in Super-Resolution + http://openaccess.thecvf.com//content/CVPR2024/papers/Tian_Image_Processing_GNN_Breaking_Rigidity_in_Super-Resolution_CVPR_2024_paper.pdf + Super-Resolution (SR) reconstructs high-resolution images from low-resolution ones. CNNs and window-attention methods are two major categories of canonical SR models. However these measures are rigid: in both operations each pixel gathers the same number of neighboring pixels hindering their effectiveness in SR tasks. Alternatively we leverage the flexibility of graphs and propose the Image Processing GNN (IPG) model to break the rigidity that dominates previous SR methods. Firstly SR is unbalanced in that most reconstruction efforts are concentrated to a small proportion of detail-rich image parts. Hence we leverage degree flexibility by assigning higher node degrees to detail-rich image nodes. Then in order to construct graphs for SR-effective aggregation we treat images as pixel node sets rather than patch nodes. Lastly we hold that both local and global information are crucial for SR performance. In the hope of gathering pixel information from both local and global scales efficiently via flexible graphs we search node connections within nearby regions to construct local graphs; and find connections within a strided sampling space of the whole image for global graphs. The flexibility of graphs boosts the SR performance of the IPG model. Experiment results on various datasets demonstrates that the proposed IPG outperforms State-of-the-Art baselines. Codes are available at https://github.com/huawei-noah/Efficient-Computing/tree/master/LowLevel/IPG. + + + + Towards Generalizing to Unseen Domains with Few Labels + http://openaccess.thecvf.com//content/CVPR2024/papers/Galappaththige_Towards_Generalizing_to_Unseen_Domains_with_Few_Labels_CVPR_2024_paper.pdf + We approach the challenge of addressing semi-supervised domain generalization (SSDG). Specifically our aim is to obtain a model that learns domain-generalizable features by leveraging a limited subset of labelled data alongside a substantially larger pool of unlabeled data. Existing domain generalization (DG) methods which are unable to exploit unlabeled data perform poorly compared to semi-supervised learning (SSL) methods under SSDG setting. Nevertheless SSL methods have considerable room for performance improvement when compared to fully-supervised DG training. To tackle this underexplored yet highly practical problem of SSDG we make the following core contributions. First we propose a feature-based conformity technique that matches the posterior distributions from the feature space with the pseudo-label from the model's output space. Second we develop a semantics alignment loss to learn semantically-compatible representations by regularizing the semantic structure in the feature space. Our method is plug-and-play and can be readily integrated with different SSL-based SSDG baselines without introducing any additional parameters. Extensive experimental results across five challenging DG benchmarks with four strong SSL baselines suggest that our method provides consistent and notable gains in two different SSDG settings. + + + + LTGC: Long-tail Recognition via Leveraging LLMs-driven Generated Content + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_LTGC_Long-tail_Recognition_via_Leveraging_LLMs-driven_Generated_Content_CVPR_2024_paper.pdf + Long-tail recognition is challenging because it requires the model to learn good representations from tail categories and address imbalances across all categories. In this paper we propose a novel generative and fine-tuning framework LTGC to handle long-tail recognition via leveraging generated content. Firstly inspired by the rich implicit knowledge in large-scale models (e.g. large language models LLMs) LTGC leverages the power of these models to parse and reason over the original tail data to produce diverse tail-class content. We then propose several novel designs for LTGC to ensure the quality of the generated data and to efficiently fine-tune the model using both the generated and original data. The visualization demonstrates the effectiveness of the generation module in LTGC which produces accurate and diverse tail data. Additionally the experimental results demonstrate that our LTGC outperforms existing state-of-the-art methods on popular long-tailed benchmarks. + + + + Neural Refinement for Absolute Pose Regression with Feature Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Neural_Refinement_for_Absolute_Pose_Regression_with_Feature_Synthesis_CVPR_2024_paper.pdf + Absolute Pose Regression (APR) methods use deep neural networks to directly regress camera poses from RGB images. However the predominant APR architectures only rely on 2D operations during inference resulting in limited accuracy of pose estimation due to the lack of 3D geometry constraints or priors. In this work we propose a test-time refinement pipeline that leverages implicit geometric constraints using a robust feature field to enhance the ability of APR methods to use 3D information during inference. We also introduce a novel Neural Feature Synthesizer (NeFeS) model which encodes 3D geometric features during training and directly renders dense novel view features at test time to refine APR methods. To enhance the robustness of our model we introduce a feature fusion module and a progressive training strategy. Our proposed method achieves state-of-the-art single-image APR accuracy on indoor and outdoor datasets. Code will be released at https://github.com/ActiveVisionLab/NeFeS. + + + + DiffCast: A Unified Framework via Residual Diffusion for Precipitation Nowcasting + http://openaccess.thecvf.com//content/CVPR2024/papers/Yu_DiffCast_A_Unified_Framework_via_Residual_Diffusion_for_Precipitation_Nowcasting_CVPR_2024_paper.pdf + Precipitation nowcasting is an important spatio-temporal prediction task to predict the radar echoes sequences based on current observations which can serve both meteorological science and smart city applications. Due to the chaotic evolution nature of the precipitation systems it is a very challenging problem. Previous studies address the problem either from the perspectives of deterministic modeling or probabilistic modeling. However their predictions suffer from the blurry high-value echoes fading away and position inaccurate issues. The root reason of these issues is that the chaotic evolutionary precipitation systems are not appropriately modeled. Inspired by the nature of the systems we propose to decompose and model them from the perspective of global deterministic motion and local stochastic variations with residual mechanism. A unified and flexible framework that can equip any type of spatio-temporal models is proposed based on residual diffusion which effectively tackles the shortcomings of previous methods. Extensive experimental results on four publicly available radar datasets demonstrate the effectiveness and superiority of the proposed framework compared to state-of-the-art techniques. Our code is publicly available at https://github.com/DeminYu98/DiffCast. + + + + Ego-Exo4D: Understanding Skilled Human Activity from First- and Third-Person Perspectives + http://openaccess.thecvf.com//content/CVPR2024/papers/Grauman_Ego-Exo4D_Understanding_Skilled_Human_Activity_from_First-_and_Third-Person_Perspectives_CVPR_2024_paper.pdf + We present Ego-Exo4D a diverse large-scale multimodal multiview video dataset and benchmark challenge. Ego-Exo4D centers around simultaneously-captured egocentric and exocentric video of skilled human activities (e.g. sports music dance bike repair). 740 participants from 13 cities worldwide performed these activities in 123 different natural scene contexts yielding long-form captures from 1 to 42 minutes each and 1286 hours of video combined. The multimodal nature of the dataset is unprecedented: the video is accompanied by multichannel audio eye gaze 3D point clouds camera poses IMU and multiple paired language descriptions---including a novel "expert commentary" done by coaches and teachers and tailored to the skilled-activity domain. To push the frontier of first-person video understanding of skilled human activity we also present a suite of benchmark tasks and their annotations including fine-grained activity understanding proficiency estimation cross-view translation and 3D hand/body pose. All resources are open sourced to fuel new research in the community. + + + + Point Cloud Pre-training with Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Zheng_Point_Cloud_Pre-training_with_Diffusion_Models_CVPR_2024_paper.pdf + Pre-training a model and then fine-tuning it on downstream tasks has demonstrated significant success in the 2D image and NLP domains. However due to the unordered and non-uniform density characteristics of point clouds it is non-trivial to explore the prior knowledge of point clouds and pre-train a point cloud backbone. In this paper we propose a novel pre-training method called Point cloud Diffusion pre-training PointDif. We consider the point cloud pre-training task as a conditional point-to-point generation problem and introduce a conditional point generator. This generator aggregates the features extracted by the backbone and employs them as the condition to guide the point-to-point recovery from the noisy point cloud thereby assisting the backbone in capturing both local and global geometric priors as well as the global point density distribution of the object. We also present a recurrent uniform sampling optimization strategy which enables the model to uniformly recover from various noise levels and learn from balanced supervision. Our PointDif achieves substantial improvement across various real-world datasets for diverse downstream tasks such as classification segmentation and detection. Specifically PointDif attains 70.0% mIoU on S3DIS Area 5 for the segmentation task and achieves an average improvement of 2.4% on ScanObjectNN for the classification task compared to TAP. Furthermore our pre-training framework can be flexibly applied to diverse point cloud backbones and bring considerable gains. Code is available at https://github.com/zhengxiaozx/PointDif + + + + CAMixerSR: Only Details Need More "Attention" + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_CAMixerSR_Only_Details_Need_More_Attention_CVPR_2024_paper.pdf + To satisfy the rapidly increasing demands on the large image (2K-8K) super-resolution (SR) prevailing methods follow two independent tracks: 1) accelerate existing networks by content-aware routing and 2) design better super-resolution networks via token mixer refining. Despite directness they encounter unavoidable defects (e.g. inflexible route or non-discriminative processing) limiting further improvements of quality-complexity trade-off. To erase the drawbacks we integrate these schemes by proposing a content-aware mixer (CAMixer) which assigns convolution for simple contexts and additional deformable window-attention for sparse textures. Specifically the CAMixer uses a learnable predictor to generate multiple bootstraps including offsets for windows warping a mask for classifying windows and convolutional attentions for endowing convolution with the dynamic property which modulates attention to include more useful textures self-adaptively and improves the representation capability of convolution. We further introduce a global classification loss to improve the accuracy of predictors. By simply stacking CAMixers we obtain CAMixerSR which achieves superior performance on large-image SR lightweight SR and omnidirectional-image SR. + + + + Towards Backward-Compatible Continual Learning of Image Compression + http://openaccess.thecvf.com//content/CVPR2024/papers/Duan_Towards_Backward-Compatible_Continual_Learning_of_Image_Compression_CVPR_2024_paper.pdf + This paper explores the possibility of extending the capability of pre-trained neural image compressors (e.g. adapting to new data or target bitrates) without breaking backward compatibility the ability to decode bitstreams encoded by the original model. We refer to this problem as continual learning of image compression. Our initial findings show that baseline solutions such as end-to-end fine-tuning do not preserve the desired backward compatibility. To tackle this we propose a knowledge replay training strategy that effectively addresses this issue. We also design a new model architecture that enables more effective continual learning than existing baselines. Experiments are conducted for two scenarios: data-incremental learning and rate-incremental learning. The main conclusion of this paper is that neural image compressors can be fine-tuned to achieve better performance (compared to their pre-trained version) on new data and rates without compromising backward compatibility. The code is publicly available online. + + + + Latent Modulated Function for Computational Optimal Continuous Image Representation + http://openaccess.thecvf.com//content/CVPR2024/papers/He_Latent_Modulated_Function_for_Computational_Optimal_Continuous_Image_Representation_CVPR_2024_paper.pdf + The recent work Local Implicit Image Function (LIIF) and subsequent Implicit Neural Representation (INR) based works have achieved remarkable success in Arbitrary-Scale Super-Resolution (ASSR) by using MLP to decode Low-Resolution (LR) features. However these continuous image representations typically implement decoding in High-Resolution (HR) High-Dimensional (HD) space leading to a quadratic increase in computational cost and seriously hindering the practical applications of ASSR. To tackle this problem we propose a novel Latent Modulated Function (LMF) which decouples the HR-HD decoding process into shared latent decoding in LR-HD space and independent rendering in HR Low-Dimensional (LD) space thereby realizing the first computational optimal paradigm of continuous image representation. Specifically LMF utilizes an HD MLP in latent space to generate latent modulations of each LR feature vector. This enables a modulated LD MLP in render space to quickly adapt to any input feature vector and perform rendering at arbitrary resolution. Furthermore we leverage the positive correlation between modulation intensity and input image complexity to design a Controllable Multi-Scale Rendering (CMSR) algorithm offering the flexibility to adjust the decoding efficiency based on the rendering precision. Extensive experiments demonstrate that converting existing INR-based ASSR methods to LMF can reduce the computational cost by up to 99.9% accelerate inference by up to 57x and save up to 76% of parameters while maintaining competitive performance. The code is available at https://github.com/HeZongyao/LMF. + + + + VideoCutLER: Surprisingly Simple Unsupervised Video Instance Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_VideoCutLER_Surprisingly_Simple_Unsupervised_Video_Instance_Segmentation_CVPR_2024_paper.pdf + Existing approaches to unsupervised video instance segmentation typically rely on motion estimates and experience difficulties tracking small or divergent motions. We present VideoCutLER a simple method for unsupervised multi-instance video segmentation without using motion-based learning signals like optical flow or training on natural videos. Our key insight is that using high-quality pseudo masks and a simple video synthesis method for model training is surprisingly sufficient to enable the resulting video model to effectively segment and track multiple instances across video frames. We show the first competitive unsupervised learning results on the challenging YouTubeVIS-2019 benchmark achieving 50.7% AP50 surpassing the previous state-of-the-art by a large margin. VideoCutLER can also serve as a strong pretrained model for supervised video instance segmentation tasks exceeding DINO by 15.9% on YouTubeVIS-2019 in terms of AP. + + + + PAPR in Motion: Seamless Point-level 3D Scene Interpolation + http://openaccess.thecvf.com//content/CVPR2024/papers/Peng_PAPR_in_Motion_Seamless_Point-level_3D_Scene_Interpolation_CVPR_2024_paper.pdf + We propose the problem of point-level 3D scene interpolation which aims to simultaneously reconstruct a 3D scene in two states from multiple views synthesize smooth point-level interpolations between them and render the scene from novel viewpoints all without any supervision between the states. The primary challenge is on achieving a smooth transition between states that may involve significant and non-rigid changes. To address these challenges we introduce "PAPR in Motion" a novel approach that builds upon the recent Proximity Attention Point Rendering (PAPR) technique which can deform a point cloud to match a significantly different shape and render a visually coherent scene even after non-rigid deformations. Our approach is specifically designed to maintain the temporal consistency of the geometric structure by introducing various regularization techniques for PAPR. The result is a method that can effectively bridge large scene changes and produce visually coherent and temporally smooth interpolations in both geometry and appearance. Evaluation across diverse motion types demonstrates that "PAPR in Motion" outperforms the leading neural renderer for dynamic scenes. For more results and code please visit our project website at https://niopeng.github.io/PAPR-in-Motion/. + + + + Causal Mode Multiplexer: A Novel Framework for Unbiased Multispectral Pedestrian Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_Causal_Mode_Multiplexer_A_Novel_Framework_for_Unbiased_Multispectral_Pedestrian_CVPR_2024_paper.pdf + RGBT multispectral pedestrian detection has emerged as a promising solution for safety-critical applications that require day/night operations. However the modality bias problem remains unsolved as multispectral pedestrian detectors learn the statistical bias in datasets. Specifically datasets in multispectral pedestrian detection mainly distribute between ROTO (day) and RXTO (night) data; the majority of the pedestrian labels statistically co-occur with their thermal features. As a result multispectral pedestrian detectors show poor generalization ability on examples beyond this statistical correlation such as ROTX data. To address this problem we propose a novel Causal Mode Multiplexer (CMM) framework that effectively learns the causalities between multispectral inputs and predictions. Moreover we construct a new dataset (ROTX-MP) to evaluate modality bias in multispectral pedestrian detection. ROTX-MP mainly includes ROTX examples not presented in previous datasets. Extensive experiments demonstrate that our proposed CMM framework generalizes well on existing datasets (KAIST CVC-14 FLIR) and the new ROTX-MP. Our code and dataset are available open-source. + + + + LTA-PCS: Learnable Task-Agnostic Point Cloud Sampling + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_LTA-PCS_Learnable_Task-Agnostic_Point_Cloud_Sampling_CVPR_2024_paper.pdf + Recently many approaches directly operate on point clouds for different tasks. These approaches become more computation and storage demanding when point cloud size is large. To reduce the required computation and storage one possible solution is to sample the point cloud. In this paper we propose the first Learnable Task-Agnostic Point Cloud Sampling (LTA-PCS) framework. Existing task-agnostic point cloud sampling strategy (e.g. FPS) does not consider semantic information of point clouds causing degraded performance on downstream tasks. While learning-based point cloud sampling methods consider semantic information they are task-specific and require task-oriented ground-truth annotations. So they cannot generalize well on different downstream tasks. Our LTA-PCS achieves task-agnostic point cloud sampling without requiring task-oriented labels in which both the geometric and semantic information of points is considered in sampling. Extensive experiments on multiple downstream tasks demonstrate the effectiveness of our LTA-PCS. + + + + Non-Rigid Structure-from-Motion: Temporally-Smooth Procrustean Alignment and Spatially-Variant Deformation Modeling + http://openaccess.thecvf.com//content/CVPR2024/papers/Shi_Non-Rigid_Structure-from-Motion_Temporally-Smooth_Procrustean_Alignment_and_Spatially-Variant_Deformation_Modeling_CVPR_2024_paper.pdf + Even though Non-rigid Structure-from-Motion (NRSfM) has been extensively studied and great progress has been made there are still key challenges that hinder their broad real-world applications: 1) the inherent motion/rotation ambiguity requires either explicit camera motion recovery with extra constraint or complex Procrustean Alignment; 2) existing low-rank modeling of the global shape can over-penalize drastic deformations in the 3D shape sequence. This paper proposes to resolve the above issues from a spatial-temporal modeling perspective. First we propose a novel Temporally-smooth Procrustean Alignment module that estimates 3D deforming shapes and adjusts the camera motion by aligning the 3D shape sequence consecutively. Our new alignment module remedies the requirement of complex reference 3D shape during alignment which is more conductive to non-isotropic deformation modeling. Second we propose a spatial-weighted approach to enforce the low-rank constraint adaptively at different locations to accommodate drastic spatially-variant deformation reconstruction better. Our modeling outperform existing low-rank based methods and extensive experiments across different datasets validate the effectiveness of our method. + + + + ShapeMatcher: Self-Supervised Joint Shape Canonicalization Segmentation Retrieval and Deformation + http://openaccess.thecvf.com//content/CVPR2024/papers/Di_ShapeMatcher_Self-Supervised_Joint_Shape_Canonicalization_Segmentation_Retrieval_and_Deformation_CVPR_2024_paper.pdf + In this paper we present ShapeMatcher a unified self-supervised learning framework for joint shape canonicalization segmentation retrieval and deformation. Given a partially-observed object in an arbitrary pose we first canonicalize the object by extracting point-wise affine invariant features disentangling inherent structure of the object with its pose and size. These learned features are then leveraged to predict semantically consistent part segmentation and corresponding part centers. Next our lightweight retrieval module aggregates the features within each part as its retrieval token and compare all the tokens with source shapes from a pre-established database to identify the most geometrically similar shape. Finally we deform the retrieved shape in the deformation module to tightly fit the input object by harnessing part center guided neural cage deformation. The key insight of ShapeMaker is the simultaneous training of the four highly-associated processes: canonicalization segmentation retrieval and deformation leveraging cross-task consistency losses for mutual supervision. Extensive experiments on synthetic datasets PartNet ComplementMe and real-world dataset Scan2CAD demonstrate that ShapeMatcher surpasses competitors by a large margin. Code is released at https://github.com/Det1999/ShapeMaker. + + + + Global Latent Neural Rendering + http://openaccess.thecvf.com//content/CVPR2024/papers/Tanay_Global_Latent_Neural_Rendering_CVPR_2024_paper.pdf + A recent trend among generalizable novel view synthesis methods is to learn a rendering operator acting over single camera rays. This approach is promising because it removes the need for explicit volumetric rendering but it effectively treats target images as collections of independent pixels. Here we propose to learn a global rendering operator acting over all camera rays jointly. We show that the right representation to enable such rendering is a 5-dimensional plane sweep volume consisting of the projection of the input images on a set of planes facing the target camera. Based on this understanding we introduce our Convolutional Global Latent Renderer (ConvGLR) an efficient convolutional architecture that performs the rendering operation globally in a low-resolution latent space. Experiments on various datasets under sparse and generalizable setups show that our approach consistently outperforms existing methods by significant margins. + + + + Meta-Point Learning and Refining for Category-Agnostic Pose Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Meta-Point_Learning_and_Refining_for_Category-Agnostic_Pose_Estimation_CVPR_2024_paper.pdf + Category-agnostic pose estimation (CAPE) aims to predict keypoints for arbitrary classes given a few support images annotated with keypoints. Existing methods only rely on the features extracted at support keypoints to predict or refine the keypoints on query image but a few support feature vectors are local and inadequate for CAPE. Considering that human can quickly perceive potential keypoints of arbitrary objects we propose a novel framework for CAPE based on such potential keypoints (named as meta-points). Specifically we maintain learnable embeddings to capture inherent information of various keypoints which interact with image feature maps to produce meta-points without any support. The produced meta-points could serve as meaningful potential keypoints for CAPE. Due to the inevitable gap between inherency and annotation we finally utilize the identities and details offered by support keypoints to assign and refine meta-points to desired keypoints in query image. In addition we propose a progressive deformable point decoder and a slacked regression loss for better prediction and supervision. Our novel framework not only reveals the inherency of keypoints but also outperforms existing methods of CAPE. Comprehensive experiments and in-depth studies on large-scale MP-100 dataset demonstrate the effectiveness of our framework. + + + + Batch Normalization Alleviates the Spectral Bias in Coordinate Networks + http://openaccess.thecvf.com//content/CVPR2024/papers/Cai_Batch_Normalization_Alleviates_the_Spectral_Bias_in_Coordinate_Networks_CVPR_2024_paper.pdf + Representing signals using coordinate networks dominates the area of inverse problems recently and is widely applied in various scientific computing tasks. Still there exists an issue of spectral bias in coordinate networks limiting the capacity to learn high-frequency components. This problem is caused by the pathological distribution of the neural tangent kernel's (NTK's) eigenvalues of coordinate networks. We find that this pathological distribution could be improved using the classical batch normalization (BN) which is a common deep learning technique but rarely used in coordinate networks. BN greatly reduces the maximum and variance of NTK's eigenvalues while slightly modifies the mean value considering the max eigenvalue is much larger than the most this variance change results in a shift of eigenvalues' distribution from a lower one to a higher one therefore the spectral bias could be alleviated (see Fig. 1). This observation is substantiated by the significant improvements of applying BN-based coordinate networks to various tasks including the image compression computed tomography reconstruction shape representation magnetic resonance imaging and novel view synthesis. + + + + SplaTAM: Splat Track & Map 3D Gaussians for Dense RGB-D SLAM + http://openaccess.thecvf.com//content/CVPR2024/papers/Keetha_SplaTAM_Splat_Track__Map_3D_Gaussians_for_Dense_RGB-D_CVPR_2024_paper.pdf + Dense simultaneous localization and mapping (SLAM) is crucial for robotics and augmented reality applications. However current methods are often hampered by the non-volumetric or implicit way they represent a scene. This work introduces SplaTAM an approach that for the first time leverages explicit volumetric representations i.e. 3D Gaussians to enable high-fidelity reconstruction from a single unposed RGB-D camera surpassing the capabilities of existing methods. SplaTAM employs a simple online tracking and mapping system tailored to the underlying Gaussian representation. It utilizes a silhouette mask to elegantly capture the presence of scene density. This combination enables several benefits over prior representations including fast rendering and dense optimization quickly determining if areas have been previously mapped and structured map expansion by adding more Gaussians. Extensive experiments show that SplaTAM achieves up to 2x superior performance in camera pose estimation map construction and novel-view synthesis over existing methods paving the way for more immersive high-fidelity SLAM applications. + + + + Instance-based Max-margin for Practical Few-shot Recognition + http://openaccess.thecvf.com//content/CVPR2024/papers/Fu_Instance-based_Max-margin_for_Practical_Few-shot_Recognition_CVPR_2024_paper.pdf + In order to mimic the human few-shot learning (FSL) ability better and to make FSL closer to real-world applications this paper proposes a practical FSL (pFSL) setting. pFSL is based on unsupervised pre-trained models (analogous to human prior knowledge) and recognizes many novel classes simultaneously. Compared to traditional FSL pFSL is simpler in its formulation easier to evaluate more challenging and more practical. To cope with the rarity of training examples this paper proposes IbM2 an instance-based max-margin method not only for the new pFSL setting but also works well in traditional FSL scenarios. Based on the Gaussian Annulus Theorem IbM2 converts random noise applied to the instances into a mechanism to achieve maximum margin in the many-way pFSL (or traditional FSL) recognition task. Experiments with various self-supervised pre-training methods and diverse many- or few-way FSL tasks show that IbM2 almost always leads to improvements compared to its respective baseline methods and in most cases the improvements are significant. With both the new pFSL setting and novel IbM2 method this paper shows that practical few-shot learning is both viable and promising. + + + + ZeroRF: Fast Sparse View 360deg Reconstruction with Zero Pretraining + http://openaccess.thecvf.com//content/CVPR2024/papers/Shi_ZeroRF_Fast_Sparse_View_360deg_Reconstruction_with_Zero_Pretraining_CVPR_2024_paper.pdf + We present ZeroRF a novel per-scene optimization method addressing the challenge of sparse view 360deg reconstruction in neural field representations. Current breakthroughs like Neural Radiance Fields (NeRF) have demonstrated high-fidelity image synthesis but struggle with sparse input views. Existing methods such as Generalizable NeRFs and per-scene optimization approaches face limitations in data dependency computational cost and generalization across diverse scenarios. To overcome these challenges we propose ZeroRF whose key idea is to integrate a tailored Deep Image Prior into a factorized NeRF representation. Unlike traditional methods ZeroRF parametrizes feature grids with a neural network generator enabling efficient sparse view 360deg reconstruction without any pretraining or additional regularization. Extensive experiments showcase ZeroRF's versatility and superiority in terms of both quality and speed achieving state-of-the-art results on benchmark datasets. ZeroRF's significance extends to applications in 3D content generation and editing. Project page: https://sarahweiii.github.io/zerorf/ + + + + RCooper: A Real-world Large-scale Dataset for Roadside Cooperative Perception + http://openaccess.thecvf.com//content/CVPR2024/papers/Hao_RCooper_A_Real-world_Large-scale_Dataset_for_Roadside_Cooperative_Perception_CVPR_2024_paper.pdf + The value of roadside perception which could extend the boundaries of autonomous driving and traffic management has gradually become more prominent and acknowledged in recent years. However existing roadside perception approaches only focus on the single-infrastructure sensor system which cannot realize a comprehensive understanding of a traffic area because of the limited sensing range and blind spots. Orienting high-quality roadside perception we need Roadside Cooperative Perception (RCooper) to achieve practical area-coverage roadside perception for restricted traffic areas. Rcooper has its own domain-specific challenges but further exploration is hindered due to the lack of datasets. We hence release the first real-world large-scale RCooper dataset to bloom the research on practical roadside cooperative perception including detection and tracking. The manually annotated dataset comprises 50k images and 30k point clouds including two representative traffic scenes (i.e. intersection and corridor). The constructed benchmarks prove the effectiveness of roadside cooperation perception and demonstrate the direction of further research. Codes and dataset can be accessed at: https://github.com/AIR-THU/DAIR-RCooper. + + + + TutteNet: Injective 3D Deformations by Composition of 2D Mesh Deformations + http://openaccess.thecvf.com//content/CVPR2024/papers/Sun_TutteNet_Injective_3D_Deformations_by_Composition_of_2D_Mesh_Deformations_CVPR_2024_paper.pdf + This work proposes a novel representation of injective deformations of 3D space which overcomes existing limitations of injective methods namely inaccuracy lack of robustness and incompatibility with general learning and optimization frameworks. Our core idea is to reduce the problem to a "deep" composition of multiple 2D mesh-based piecewise-linear maps. Namely we build differentiable layers that produce mesh deformations through Tutte's embedding (guaranteed to be injective in 2D) and compose these layers over different planes to create complex 3D injective deformations of the 3D volume. We show our method provides the ability to ef?ciently and accurately optimize and learn complex deformations outperforming other injective approaches. As a main application we produce complex and artifact-free NeRF and SDF deformations. + + + + Estimating Noisy Class Posterior with Part-level Labels for Noisy Label Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_Estimating_Noisy_Class_Posterior_with_Part-level_Labels_for_Noisy_Label_CVPR_2024_paper.pdf + In noisy label learning estimating noisy class posteriors plays a fundamental role for developing consistent classifiers as it forms the basis for estimating clean class posteriors and the transition matrix. Existing methods typically learn noisy class posteriors by training a classification model with noisy labels. However when labels are incorrect these models may be misled to overemphasize the feature parts that do not reflect the instance characteristics resulting in significant errors in estimating noisy class posteriors. To address this issue this paper proposes to augment the supervised information with part-level labels encouraging the model to focus on and integrate richer information from various parts. Specifically our method first partitions features into distinct parts by cropping instances yielding part-level labels associated with these various parts. Subsequently we introduce a novel single-to-multiple transition matrix to model the relationship between the noisy and part-level labels which incorporates part-level labels into a classifier-consistent framework. Utilizing this framework with part-level labels we can learn the noisy class posteriors more precisely by guiding the model to integrate information from various parts ultimately improving the classification performance. Our method is theoretically sound while experiments show that it is empirically effective in synthetic and real-world noisy benchmarks. + + + + Leveraging Vision-Language Models for Improving Domain Generalization in Image Classification + http://openaccess.thecvf.com//content/CVPR2024/papers/Addepalli_Leveraging_Vision-Language_Models_for_Improving_Domain_Generalization_in_Image_Classification_CVPR_2024_paper.pdf + Vision-Language Models (VLMs) such as CLIP are trained on large amounts of image-text pairs resulting in remarkable generalization across several data distributions. However in several cases their expensive training and data collection/curation costs do not justify the end application. This motivates a vendor-client paradigm where a vendor trains a large-scale VLM and grants only input-output access to clients on a pay-per-query basis in a black-box setting. The client aims to minimize inference cost by distilling the VLM to a student model using the limited available task-specific data and further deploying this student model in the downstream application. While naive distillation largely improves the In-Domain (ID) accuracy of the student it fails to transfer the superior out-of-distribution (OOD) generalization of the VLM teacher using the limited available labeled images. To mitigate this we propose Vision-Language to Vision - Align Distill Predict (VL2V-ADiP) which first aligns the vision and language modalities of the teacher model with the vision modality of a pre-trained student model and further distills the aligned VLM representations to the student. This maximally retains the pre-trained features of the student while also incorporating the rich representations of the VLM image encoder and the superior generalization of the text embeddings. The proposed approach achieves state-of-the-art results on the standard Domain Generalization benchmarks in a black-box teacher setting as well as a white-box setting where the weights of the VLM are accessible. + + + + Prompt Learning via Meta-Regularization + http://openaccess.thecvf.com//content/CVPR2024/papers/Park_Prompt_Learning_via_Meta-Regularization_CVPR_2024_paper.pdf + Pre-trained vision-language models have shown impressive success on various computer vision tasks with their zero-shot generalizability. Recently prompt learning approaches have been explored to efficiently and effectively adapt the vision-language models to a variety of downstream tasks. However most existing prompt learning methods suffer from task overfitting since the general knowledge of the pre-trained vision language models is forgotten while the prompts are finetuned on a small data set from a specific target task. To address this issue we propose a Prompt Meta-Regularization (ProMetaR) to improve the generalizability of prompt learning for vision-language models. Specifically ProMetaR meta-learns both the regularizer and the soft prompts to harness the task-specific knowledge from the downstream tasks and task-agnostic general knowledge from the vision-language models. Further ProMetaR augments the task to generate multiple virtual tasks to alleviate the meta-overfitting. In addition we provide the analysis to comprehend how ProMetaR improves the generalizability of prompt tuning in the perspective of the gradient alignment. Our extensive experiments demonstrate that our ProMetaR improves the generalizability of conventional prompt learning methods under base-to-base/base-to-new and domain generalization settings. The code of ProMetaR is available at https://github.com/mlvlab/ProMetaR. + + + + Embodied Multi-Modal Agent trained by an LLM from a Parallel TextWorld + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Embodied_Multi-Modal_Agent_trained_by_an_LLM_from_a_Parallel_CVPR_2024_paper.pdf + While large language models (LLMs) excel in a simulated world of texts they struggle to interact with the more realistic world without perceptions of other modalities such as visual or audio signals. Although vision-language models (VLMs) integrate LLM modules (1) aligned with static image features and (2) may possess prior knowledge of world dynamics (as demonstrated in the text world) they have not been trained in an embodied visual world and thus cannot align with its dynamics. On the other hand training an embodied agent in a noisy visual world without expert guidance is often challenging and inefficient. In this paper we train a VLM agent living in a visual world using an LLM agent excelling in a parallel text world. Specifically we distill LLM's reflection outcomes (improved actions by analyzing mistakes) in a text world's tasks to finetune the VLM on the same tasks of the visual world resulting in an Embodied Multi-Modal Agent (EMMA) quickly adapting to the visual world dynamics. Such cross-modality imitation learning between the two parallel worlds is achieved by a novel DAgger-DPO algorithm enabling EMMA to generalize to a broad scope of new tasks without any further guidance from the LLM expert. Extensive evaluations on the ALFWorld benchmark's diverse tasks highlight EMMA's superior performance to SOTA VLM-based agents e.g. 20%-70% improvement in the success rate. + + + + + + Intriguing Properties of Diffusion Models: An Empirical Study of the Natural Attack Capability in Text-to-Image Generative Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Sato_Intriguing_Properties_of_Diffusion_Models_An_Empirical_Study_of_the_CVPR_2024_paper.pdf + Denoising probabilistic diffusion models have shown breakthrough performance to generate more photo-realistic images or human-level illustrations than the prior models such as GANs. This high image-generation capability has stimulated the creation of many downstream applications in various areas. However we find that this technology is actually a double-edged sword: We identify a new type of attack called the Natural Denoising Diffusion (NDD) attack based on the finding that state-of-the-art deep neural network (DNN) models still hold their prediction even if we intentionally remove their robust features which are essential to the human visual system (HVS) through text prompts. The NDD attack shows a significantly high capability to generate low-cost model-agnostic and transferable adversarial attacks by exploiting the natural attack capability in diffusion models. To systematically evaluate the risk of the NDD attack we perform a large-scale empirical study with our newly created dataset the Natural Denoising Diffusion Attack (NDDA) dataset. We evaluate the natural attack capability by answering 6 research questions. Through a user study we find that it can achieve an 88% detection rate while being stealthy to 93% of human subjects; we also find that the non-robust features embedded by diffusion models contribute to the natural attack capability. To confirm the model-agnostic and transferable attack capability we perform the NDD attack against the Tesla Model 3 and find that 73% of the physically printed attacks can be detected as stop signs. Our hope is that the study and dataset can help our community be aware of the risks in diffusion models and facilitate further research toward robust DNN models. + + + + HouseCat6D - A Large-Scale Multi-Modal Category Level 6D Object Perception Dataset with Household Objects in Realistic Scenarios + http://openaccess.thecvf.com//content/CVPR2024/papers/Jung_HouseCat6D_-_A_Large-Scale_Multi-Modal_Category_Level_6D_Object_Perception_CVPR_2024_paper.pdf + Estimating 6D object poses is a major challenge in 3D computer vision. Building on successful instance-level approaches research is shifting towards category-level pose estimation for practical applications. Current category-level datasets however fall short in annotation quality and pose variety. Addressing this we introduce HouseCat6D a new category-level 6D pose dataset. It features 1) multi-modality with Polarimetric RGB and Depth (RGBD+P) 2) encompasses 194 diverse objects across 10 household categories including two photometrically challenging ones and 3) provides high-quality pose annotations with an error range of only 1.35 mm to 1.74 mm. The dataset also includes 4) 41 large-scale scenes with comprehensive viewpoint and occlusion coverage 5) a checkerboard-free environment and 6. dense 6D parallel-jaw robotic grasp annotations. Additionally we present benchmark results for leading category-level pose estimation networks. + + + + Towards Co-Evaluation of Cameras HDR and Algorithms for Industrial-Grade 6DoF Pose Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Kalra_Towards_Co-Evaluation_of_Cameras_HDR_and_Algorithms_for_Industrial-Grade_6DoF_CVPR_2024_paper.pdf + 6DoF Pose estimation has been gaining increased importance in vision for over a decade however it does not yet meet the reliability and accuracy standards for mass deployment in industrial robotics. To this effect we present the Industrial Plenoptic Dataset (IPD): the first dataset for the co-evaluation of cameras HDR and algorithms targeted at reliable high-accuracy industrial automation. Specifically we capture 2300 physical scenes of 20 industrial parts covering a 1mx1mx0.5m working volume resulting in over 100000 distinct object views. Each scene is captured with 13 well-calibrated multi-modal cameras including polarization and high-resolution structured light. In terms of lighting we capture each scene at 4 exposures and in 3 challenging lighting conditions ranging from 100 lux to 100000 lux. We also present validate and analyze robot consistency an evaluation method targeted at scalable high accuracy evaluation. We hope that vision systems that succeed on this dataset will have direct industry impact. The dataset and evaluation code are available at https://github.com/intrinsic-ai/ipd. + + + + MLP Can Be A Good Transformer Learner + http://openaccess.thecvf.com//content/CVPR2024/papers/Lin_MLP_Can_Be_A_Good_Transformer_Learner_CVPR_2024_paper.pdf + Self-attention mechanism is the key of the Transformer but often criticized for its computation demands. Previous token pruning works motivate their methods from the view of computation redundancy but still need to load the full network and require same memory costs. This paper introduces a novel strategy that simplifies vision transformers and reduces computational load through the selective removal of non-essential attention layers guided by entropy considerations. We identify that regarding the attention layer in bottom blocks their subsequent MLP layers i.e. two feed-forward layers can elicit the same entropy quantity. Meanwhile the accompanied MLPs are under-exploited since they exhibit smaller feature entropy compared to those MLPs in the top blocks. Therefore we propose to integrate the uninformative attention layers into their subsequent counterparts by degenerating them into identical mapping yielding only MLP in certain transformer blocks. Experimental results on ImageNet-1k show that the proposed method can remove 40% attention layer of DeiT-B improving throughput and memory bound without performance compromise. + + + + + + Visual-Augmented Dynamic Semantic Prototype for Generative Zero-Shot Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Hou_Visual-Augmented_Dynamic_Semantic_Prototype_for_Generative_Zero-Shot_Learning_CVPR_2024_paper.pdf + Generative Zero-shot learning (ZSL) learns a generator to synthesize visual samples for unseen classes which is an effective way to advance ZSL. However existing generative methods rely on the conditions of Gaussian noise and the predefined semantic prototype which limit the generator only optimized on specific seen classes rather than characterizing each visual instance resulting in poor generalizations (e.g. overfitting to seen classes). To address this issue we propose a novel Visual-Augmented Dynamic Semantic prototype method (termed VADS) to boost the generator to learn accurate semantic-visual mapping by fully exploiting the visual-augmented knowledge into semantic conditions. In detail VADS consists of two modules: (1) Visual-aware Domain Knowledge Learning module (VDKL) learns the local bias and global prior of the visual features (referred to as domain visual knowledge) which replace pure Gaussian noise to provide richer prior noise information; (2) Vision-Oriented Semantic Updation module (VOSU) updates the semantic prototype according to the visual representations of the samples. Ultimately we concatenate their output as a dynamic semantic prototype which serves as the condition of the generator. Extensive experiments demonstrate that our VADS achieves superior CZSL and GZSL performances on three prominent datasets and outperforms other state-of-the-art methods with averaging increases by 6.4% 5.9% and 4.2% on SUN CUB and AWA2 respectively. + + + + Dynamic Prompt Optimizing for Text-to-Image Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Mo_Dynamic_Prompt_Optimizing_for_Text-to-Image_Generation_CVPR_2024_paper.pdf + Text-to-image generative models specifically those based on diffusion models like Imagen and Stable Diffusion have made substantial advancements. Recently there has been a surge of interest in the delicate refinement of text prompts. Users assign weights or alter the injection time steps of certain words in the text prompts to improve the quality of generated images. However the success of fine-control prompts depends on the accuracy of the text prompts and the careful selection of weights and time steps which requires significant manual intervention. To address this we introduce the Prompt Auto-Editing (PAE) method. Besides refining the original prompts for image generation we further employ an online reinforcement learning strategy to explore the weights and injection time steps of each word leading to the dynamic fine-control prompts. The reward function during training encourages the model to consider aesthetic score semantic consistency and user preferences. Experimental results demonstrate that our proposed method effectively improves the original prompts generating visually more appealing images while maintaining semantic alignment. Code is available at \href https://github.com/Mowenyii/PAE this https URL . + + + + 360Loc: A Dataset and Benchmark for Omnidirectional Visual Localization with Cross-device Queries + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_360Loc_A_Dataset_and_Benchmark_for_Omnidirectional_Visual_Localization_with_CVPR_2024_paper.pdf + Portable 360^\circ cameras are becoming a cheap and efficient tool to establish large visual databases. By capturing omnidirectional views of a scene these cameras could expedite building environment models that are essential for visual localization. However such an advantage is often overlooked due to the lack of valuable datasets. This paper introduces a new benchmark dataset 360Loc composed of 360^\circ images with ground truth poses for visual localization. We present a practical implementation of 360^\circ mapping combining 360^\circ images with lidar data to generate the ground truth 6DoF poses. 360Loc is the first dataset and benchmark that explores the challenge of cross-device visual positioning involving 360^\circ reference frames and query frames from pinhole ultra-wide FoV fisheye and 360^\circ cameras. We propose a virtual camera approach to generate lower-FoV query frames from 360^\circ images which ensures a fair comparison of performance among different query types in visual localization tasks. We also extend this virtual camera approach to feature matching-based and pose regression-based methods to alleviate the performance loss caused by the cross-device domain gap and evaluate its effectiveness against state-of-the-art baselines. We demonstrate that omnidirectional visual localization is more robust in challenging large-scale scenes with symmetries and repetitive structures. These results provide new insights into 360-camera mapping and omnidirectional visual localization with cross-device queries. Project Page and dataset: https://huajianup.github.io/research/360Loc/. + + + + Domain Gap Embeddings for Generative Dataset Augmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Domain_Gap_Embeddings_for_Generative_Dataset_Augmentation_CVPR_2024_paper.pdf + The performance of deep learning models is intrinsically tied to the quality volume and relevance of their training data. Gathering ample data for production scenarios often demands significant time and resources. Among various strategies data augmentation circumvents exhaustive data collection by generating new data points from existing ones. However traditional augmentation techniques can be less effective amidst a shift in training and testing distributions. This paper explores the potential of synthetic data by leveraging large pre-trained models for data augmentation especially when confronted with distribution shifts. Although recent advancements in generative models have enabled several prior works in cross-distribution data generation they require model fine-tuning and a complex setup. To bypass these shortcomings we introduce Domain Gap Embeddings (DoGE) a plug-and-play semantic data augmentation framework in a cross-distribution few-shot setting. Our method extracts disparities between source and desired data distributions in a latent form and subsequently steers a generative process to supplement the training set with endless diverse synthetic samples. Our evaluations conducted on a subpopulation shift and three domain adaptation scenarios under a few-shot paradigm reveal that our versatile method improves performance across tasks without needing hands-on intervention or intricate fine-tuning. DoGE paves the way to effortlessly generate realistic controllable synthetic datasets following the test distributions bolstering real-world efficacy for downstream task models. + + + + Geometrically-driven Aggregation for Zero-shot 3D Point Cloud Understanding + http://openaccess.thecvf.com//content/CVPR2024/papers/Mei_Geometrically-driven_Aggregation_for_Zero-shot_3D_Point_Cloud_Understanding_CVPR_2024_paper.pdf + Zero-shot 3D point cloud understanding can be achieved via 2D Vision-Language Models (VLMs). Existing strategies directly map VLM representations from 2D pixels of rendered or captured views to 3D points overlooking the inherent and expressible point cloud geometric structure. Geometrically similar or close regions can be exploited for bolstering point cloud understanding as they are likely to share semantic information. To this end we introduce the first training-free aggregation technique that leverages the point cloud's 3D geometric structure to improve the quality of the transferred VLM representation. Our approach operates iteratively performing local-to-global aggregation based on geometric and semantic point-level reasoning. We benchmark our approach on three downstream tasks including classification part segmentation and semantic segmentation with a variety of datasets representing both synthetic/real-world and indoor/outdoor scenarios. Our approach achieves new state-of-the-art results in all benchmarks. + + + + Learning to Rank Patches for Unbiased Image Redundancy Reduction + http://openaccess.thecvf.com//content/CVPR2024/papers/Luo_Learning_to_Rank_Patches_for_Unbiased_Image_Redundancy_Reduction_CVPR_2024_paper.pdf + Images suffer from heavy spatial redundancy because pixels in neighboring regions are spatially correlated. Existing approaches strive to overcome this limitation by reducing less meaningful image regions. However current leading methods rely on supervisory signals. They may compel models to preserve content that aligns with labeled categories and discard content belonging to unlabeled categories. This categorical inductive bias makes these methods less effective in real-world scenarios. To address this issue we propose a self-supervised framework for image redundancy reduction called Learning to Rank Patches (LTRP). We observe that image reconstruction of masked image modeling models is sensitive to the removal of visible patches when the masking ratio is high (e.g. 90%). Building upon it we implement LTRP via two steps: inferring the semantic density score of each patch by quantifying variation between reconstructions with and without this patch and learning to rank the patches with the pseudo score. The entire process is self-supervised thus getting out of the dilemma of categorical inductive bias. We design extensive experiments on different datasets and tasks. The results demonstrate that LTRP outperforms both supervised and other self-supervised methods due to the fair assessment of image content. + + + + Going Beyond Multi-Task Dense Prediction with Synergy Embedding Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_Going_Beyond_Multi-Task_Dense_Prediction_with_Synergy_Embedding_Models_CVPR_2024_paper.pdf + Multi-task visual scene understanding aims to leverage the relationships among a set of correlated tasks which are solved simultaneously by embedding them within a uni- fied network. However most existing methods give rise to two primary concerns from a task-level perspective: (1) the lack of task-independent correspondences for distinct tasks and (2) the neglect of explicit task-consensual dependencies among various tasks. To address these issues we propose a novel synergy embedding models (SEM) which goes be- yond multi-task dense prediction by leveraging two innova- tive designs: the intra-task hierarchy-adaptive module and the inter-task EM-interactive module. Specifically the con- structed intra-task module incorporates hierarchy-adaptive keys from multiple stages enabling the efficient learning of specialized visual patterns with an optimal trade-off. In ad- dition the developed inter-task module learns interactions from a compact set of mutual bases among various tasks benefiting from the expectation maximization (EM) algo- rithm. Extensive empirical evidence from two public bench- marks NYUD-v2 and PASCAL-Context demonstrates that SEM consistently outperforms state-of-the-art approaches across a range of metrics. + + + + Disentangled Pre-training for Human-Object Interaction Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Disentangled_Pre-training_for_Human-Object_Interaction_Detection_CVPR_2024_paper.pdf + Detecting human-object interaction (HOI) has long been limited by the amount of supervised data available. Recent approaches address this issue by pre-training according to pseudo-labels which align object regions with HOI triplets parsed from image captions. However pseudo-labeling is tricky and noisy making HOI pre-training a complex process. Therefore we propose an efficient disentangled pre-training method for HOI detection (DP-HOI) to address this problem. First DP-HOI utilizes object detection and action recognition datasets to pre-train the detection and interaction decoder layers respectively. Then we arrange these decoder layers so that the pre-training architecture is consistent with the downstream HOI detection task. This facilitates efficient knowledge transfer. Specifically the detection decoder identifies reliable human instances in each action recognition dataset image generates one corresponding query and feeds it into the interaction decoder for verb classification. Next we combine the human instance verb predictions in the same image and impose image-level supervision. The DP-HOI structure can be easily adapted to the HOI detection task enabling effective model parameter initialization. Therefore it significantly enhances the performance of existing HOI detection models on a broad range of rare categories. The code and pre-trained weight are available at https://github.com/xingaoli/DP-HOI. + + + + MetaCloak: Preventing Unauthorized Subject-driven Text-to-image Diffusion-based Synthesis via Meta-learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_MetaCloak_Preventing_Unauthorized_Subject-driven_Text-to-image_Diffusion-based_Synthesis_via_Meta-learning_CVPR_2024_paper.pdf + Text-to-image diffusion models allow seamless generation of personalized images from scant reference photos. Yet these tools in the wrong hands can fabricate misleading or harmful content endangering individuals. To address this problem existing poisoning-based approaches perturb user images in an imperceptible way to render them "unlearnable" from malicious uses. We identify two limitations of these defending approaches: i) sub-optimal due to the hand-crafted heuristics for solving the intractable bilevel optimization and ii) lack of robustness against simple data transformations like Gaussian filtering. To solve these challenges we propose MetaCloak which solves the bi-level poisoning problem with a meta-learning framework with an additional transformation sampling process to craft transferable and robust perturbation. Specifically we employ a pool of surrogate diffusion models to craft transferable and model-agnostic perturbation. Furthermore by incorporating an additional transformation process we design a simple denoising-error maximization loss that is sufficient for causing transformation-robust semantic distortion and degradation in a personalized generation. Extensive experiments on the VGGFace2 and CelebA-HQ datasets show that MetaCloak outperforms existing approaches. Notably MetaCloak can successfully fool online training services like Replicate in a black-box manner demonstrating the effectiveness of MetaCloak in real-world scenarios. + + + + Neural Modes: Self-supervised Learning of Nonlinear Modal Subspaces + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Neural_Modes_Self-supervised_Learning_of_Nonlinear_Modal_Subspaces_CVPR_2024_paper.pdf + We propose a self-supervised approach for learning physics-based subspaces for real-time simulation. Existing learning-based methods construct subspaces by approximating pre-defined simulation data in a purely geometric way. However this approach tends to produce high-energy configurations leads to entangled latent space dimensions and generalizes poorly beyond the training set. To overcome these limitations we propose a self-supervised approach that directly minimizes the system's mechanical energy during training. We show that our method leads to learned subspaces that reflect physical equilibrium constraints resolve overfitting issues of previous methods and offer interpretable latent space parameters. + + + + How to Train Neural Field Representations: A Comprehensive Study and Benchmark + http://openaccess.thecvf.com//content/CVPR2024/papers/Papa_How_to_Train_Neural_Field_Representations_A_Comprehensive_Study_and_CVPR_2024_paper.pdf + Neural fields (NeFs) have recently emerged as a versatile method for modeling signals of various modalities including images shapes and scenes. Subsequently a number of works have explored the use of NeFs as representations for downstream tasks e.g. classifying an image based on the parameters of a NeF that has been fit to it. However the impact of the NeF hyperparameters on their quality as downstream representation is scarcely understood and remains largely unexplored. This is in part caused by the large amount of time required to fit datasets of neural fields.In this work we propose a JAX-based library that leverages parallelization to enable fast optimization of large-scale NeF datasets resulting in a significant speed-up. With this library we perform a comprehensive study that investigates the effects of different hyperparameters on fitting NeFs for downstream tasks. In particular we explore the use of a shared initialization the effects of overtraining and the expressiveness of the network architectures used. Our study provides valuable insights on how to train NeFs and offers guidance for optimizing their effectiveness in downstream applications. Finally based on the proposed library and our analysis we propose Neural Field Arena a benchmark consisting of neural field variants of popular vision datasets including MNIST CIFAR variants of ImageNet and ShapeNetv2. Our library and the Neural Field Arena will be open-sourced to introduce standardized benchmarking and promote further research on neural fields. + + + + Strong Transferable Adversarial Attacks via Ensembled Asymptotically Normal Distribution Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Fang_Strong_Transferable_Adversarial_Attacks_via_Ensembled_Asymptotically_Normal_Distribution_Learning_CVPR_2024_paper.pdf + Strong adversarial examples are crucial for evaluating and enhancing the robustness of deep neural networks. However the performance of popular attacks is usually sensitive for instance to minor image transformations stemming from limited information -- typically only one input example a handful of white-box source models and undefined defense strategies. Hence the crafted adversarial examples are prone to overfit the source model which hampers their transferability to unknown architectures. In this paper we propose an approach named Multiple Asymptotically Normal Distribution Attacks (MultiANDA) which explicitly characterize adversarial perturbations from a learned distribution. Specifically we approximate the posterior distribution over the perturbations by taking advantage of the asymptotic normality property of stochastic gradient ascent (SGA) then employ the deep ensemble strategy as an effective proxy for Bayesian marginalization in this process aiming to estimate a mixture of Gaussians that facilitates a more thorough exploration of the potential optimization space. The approximated posterior essentially describes the stationary distribution of SGA iterations which captures the geometric information around the local optimum. Thus MultiANDA allows drawing an unlimited number of adversarial perturbations for each input and reliably maintains the transferability. Our proposed method outperforms ten state-of-the-art black-box attacks on deep learning models with or without defenses through extensive experiments on seven normally trained and seven defense models. + + + + Spanning Training Progress: Temporal Dual-Depth Scoring (TDDS) for Enhanced Dataset Pruning + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Spanning_Training_Progress_Temporal_Dual-Depth_Scoring_TDDS_for_Enhanced_Dataset_CVPR_2024_paper.pdf + Dataset pruning aims to construct a coreset capable of achieving performance comparable to the original full dataset. Most existing dataset pruning methods rely on snapshot-based criteria to identify representative samples often resulting in poor generalization across various pruning and cross-architecture scenarios. Recent studies have addressed this issue by expanding the scope of training dynamics considered including factors such as forgetting event and probability change typically using an averaging approach. However these works struggle to integrate a broader range of training dynamics without overlooking well-generalized samples which may not be sufficiently highlighted in an averaging manner. In this study we propose a novel dataset pruning method termed as Temporal Dual-Depth Scoring (TDDS) to tackle this problem. TDDS utilizes a dual-depth strategy to achieve a balance between incorporating extensive training dynamics and identifying representative samples for dataset pruning. In the first depth we estimate the series of each sample's individual contributions spanning the training progress ensuring comprehensive integration of training dynamics. In the second depth we focus on the variability of the sample-wise contributions identified in the first depth to highlight well-generalized samples. Extensive experiments conducted on CIFAR and ImageNet datasets verify the superiority of TDDS over previous SOTA methods. Specifically on CIFAR-100 our method achieves 54.51% accuracy with only 10% training data surpassing baselines methods by more than 12.69%. Our codes are available at https://github.com/zhangxin-xd/Dataset-Pruning-TDDS. + + + + CoSeR: Bridging Image and Language for Cognitive Super-Resolution + http://openaccess.thecvf.com//content/CVPR2024/papers/Sun_CoSeR_Bridging_Image_and_Language_for_Cognitive_Super-Resolution_CVPR_2024_paper.pdf + Existing super-resolution (SR) models primarily focus on restoring local texture details often neglecting the global semantic information within the scene. This oversight can lead to the omission of crucial semantic details or the introduction of inaccurate textures during the recovery process. In our work we introduce the Cognitive Super-Resolution (CoSeR) framework empowering SR models with the capacity to comprehend low-resolution images. We achieve this by marrying image appearance and language understanding to generate a cognitive embedding which not only activates prior information from large text-to-image diffusion models but also facilitates the generation of high-quality reference images to optimize the SR process. To further improve image fidelity we propose a novel condition injection scheme called "All-in-Attention" consolidating all conditional information into a single module. Consequently our method successfully restores semantically correct and photorealistic details demonstrating state-of-the-art performance across multiple benchmarks. Project page: https://coser-main.github.io/ + + + + PromptKD: Unsupervised Prompt Distillation for Vision-Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_PromptKD_Unsupervised_Prompt_Distillation_for_Vision-Language_Models_CVPR_2024_paper.pdf + Prompt learning has emerged as a valuable technique in enhancing vision-language models (VLMs) such as CLIP for downstream tasks in specific domains. Existing work mainly focuses on designing various learning forms of prompts neglecting the potential of prompts as effective distillers for learning from larger teacher models. In this paper we introduce an unsupervised domain prompt distillation framework which aims to transfer the knowledge of a larger teacher model to a lightweight target model through prompt-driven imitation using unlabeled domain images. Specifically our framework consists of two distinct stages. In the initial stage we pre-train a large CLIP teacher model using domain (few-shot) labels. After pre-training we leverage the unique decoupled-modality characteristics of CLIP by pre-computing and storing the text features as class vectors only once through the teacher text encoder. In the subsequent stage the stored class vectors are shared across teacher and student image encoders for calculating the predicted logits. Further we align the logits of both the teacher and student models via KL divergence encouraging the student image encoder to generate similar probability distributions to the teacher through the learnable prompts. The proposed prompt distillation process eliminates the reliance on labeled data enabling the algorithm to leverage a vast amount of unlabeled images within the domain. Finally the well-trained student image encoders and pre-stored text features (class vectors) are utilized for inference. To our best knowledge we are the first to (1) perform unsupervised domain-specific prompt-driven knowledge distillation for CLIP and (2) establish a practical pre-storing mechanism of text features as shared class vectors between teacher and student. Extensive experiments on 11 datasets demonstrate the effectiveness of our method. Code is publicly available at https://github.com/zhengli97/PromptKD. + + + + Robust Overfitting Does Matter: Test-Time Adversarial Purification With FGSM + http://openaccess.thecvf.com//content/CVPR2024/papers/Tang_Robust_Overfitting_Does_Matter_Test-Time_Adversarial_Purification_With_FGSM_CVPR_2024_paper.pdf + Numerous studies have demonstrated the susceptibility of deep neural networks (DNNs) to subtle adversarial perturbations prompting the development of many advanced adversarial defense methods aimed at mitigating adversarial attacks. Current defense strategies usually train DNNs for a specific adversarial attack method and can achieve good robustness in defense against this type of adversarial attack. Nevertheless when subjected to evaluations involving unfamiliar attack modalities empirical evidence reveals a pronounced deterioration in the robustness of DNNs. Meanwhile there is a trade-off between the classification accuracy of clean examples and adversarial examples. Most defense methods often sacrifice the accuracy of clean examples in order to improve the adversarial robustness of DNNs. To alleviate these problems and enhance the overall robust generalization of DNNs we propose the Test-Time Pixel-Level Adversarial Purification (TPAP) method. This approach is based on the robust overfitting characteristic of DNNs to the fast gradient sign method (FGSM) on training and test datasets. It utilizes FGSM for adversarial purification to process images for purifying unknown adversarial perturbations from pixels at testing time in a "counter changes with changelessness" manner thereby enhancing the defense capability of DNNs against various unknown adversarial attacks. Extensive experimental results show that our method can effectively improve both overall robust generalization of DNNs notably over previous methods. Code is available https://github.com/tly18/TPAP. + + + + Modality-Collaborative Test-Time Adaptation for Action Recognition + http://openaccess.thecvf.com//content/CVPR2024/papers/Xiong_Modality-Collaborative_Test-Time_Adaptation_for_Action_Recognition_CVPR_2024_paper.pdf + Video-based Unsupervised Domain Adaptation (VUDA) method improves the generalization of the video model enabling it to be applied to action recognition tasks in different environments. However these methods require continuous access to source data during the adaptation process which are impractical in real scenarios where the source videos are not available with concerns in transmission efficiency or privacy issues. To address this problem in this paper we propose to solve the Multimodal Video Test-Time Adaptation task (MVTTA). Existing image-based TTA methods cannot be directly applied to this task because video have domain shift in multimodal and temporal which brings difficulties to adaptation. To address the above challenges we propose a Modality-Collaborative Test-Time Adaptation (MC-TTA) Network. We maintain teacher and student memory banks respectively for generating pseudo-prototypes and target-prototypes. In the teacher model we propose Self-assembled Source-friendly Feature Reconstruction (SSFR) module to encourage the teacher memory bank to store features that are more likely to be consistent with the source distribution. Through multimodal prototype alignment and cross-modal relative consistency our method can effectively alleviate domain shift in videos. We evaluate the proposed model on four public video datasets. The results show that our model outperforms existing state-of-the-art methods. + + + + Small Steps and Level Sets: Fitting Neural Surface Models with Point Guidance + http://openaccess.thecvf.com//content/CVPR2024/papers/Koneputugodage_Small_Steps_and_Level_Sets_Fitting_Neural_Surface_Models_with_CVPR_2024_paper.pdf + A neural signed distance function (SDF) is a convenient shape representation for many tasks such as surface reconstruction editing and generation. However neural SDFs are difficult to fit to raw point clouds such as those sampled from the surface of a shape by a scanner. A major issue occurs when the shape's geometry is very different from the structural biases implicit in the network's initialization. In this case we observe that the standard loss formulation does not guide the network towards the correct SDF values. We circumvent this problem by introducing guiding points and use them to steer the optimization towards the true shape via small incremental changes for which the loss formulation has a good descent direction. We show that this point-guided homotopy-based optimization scheme facilitates a deformation from an easy problem to the difficult reconstruction problem. We also propose a metric to quantify the difference in surface geometry between a target shape and an initial surface which helps indicate whether the standard loss formulation is guiding towards the target shape. Our method outperforms previous state-of-the-art approaches with large improvements on shapes identified by this metric as particularly challenging. + + + + Domain-Agnostic Mutual Prompting for Unsupervised Domain Adaptation + http://openaccess.thecvf.com//content/CVPR2024/papers/Du_Domain-Agnostic_Mutual_Prompting_for_Unsupervised_Domain_Adaptation_CVPR_2024_paper.pdf + Conventional Unsupervised Domain Adaptation (UDA) strives to minimize distribution discrepancy between domains which neglects to harness rich semantics from data and struggles to handle complex domain shifts. A promising technique is to leverage the knowledge of large-scale pre-trained vision-language models for more guided adaptation. Despite some endeavors current methods often learn textual prompts to embed domain semantics for source and target domains separately and perform classification within each domain limiting cross-domain knowledge transfer. Moreover prompting only the language branch lacks flexibility to adapt both modalities dynamically. To bridge this gap we propose Domain-Agnostic Mutual Prompting (DAMP) to exploit domain-invariant semantics by mutually aligning visual and textual embeddings. Specifically the image contextual information is utilized to prompt the language branch in a domain-agnostic and instance-conditioned way. Meanwhile visual prompts are imposed based on the domain-agnostic textual prompt to elicit domain-invariant visual embeddings. These two branches of prompts are learned mutually with a cross-attention module and regularized with a semantic-consistency loss and an instance-discrimination contrastive loss. Experiments on three UDA benchmarks demonstrate the superiority of DAMP over state-of-the-art approaches. + + + + Semantic-Aware Multi-Label Adversarial Attacks + http://openaccess.thecvf.com//content/CVPR2024/papers/Mahmood_Semantic-Aware_Multi-Label_Adversarial_Attacks_CVPR_2024_paper.pdf + Despite its importance generating attacks for multi label learning (MLL) models has received much less attention compared to multi-class recognition. Attacking an MLL model by optimizing a loss on the target set of labels has often the undesired consequence of changing the predictions for other labels. On the other hand adding a loss on the remaining labels to keep them fixed leads to highly negatively correlated gradient directions reducing the attack effectiveness. In this paper we develop a framework for crafting effective and semantic aware adversarial attacks for MLL. First to obtain an attack that leads to semantically consistent predictions across all labels we find a minimal superset of the target labels referred to as consistent target set. To do so we develop an efficient search algorithm over a knowledge graph which encodes label dependencies. Next we propose an optimization that searches for an attack that modifies the predictions of labels in the consistent target set while ensuring other labels will not get affected. This leads to an efficient algorithm that projects the gradient of the consistent target set loss onto the orthogonal direction of the gradient of the loss on other labels. Our framework can generate attacks on different target set sizes and for MLL with thousands of labels (as in OpenImages). Finally by extensive experiments on three datasets and several MLL models we show that our method generates both successful and semantically consistent attacks. + + + + MatSynth: A Modern PBR Materials Dataset + http://openaccess.thecvf.com//content/CVPR2024/papers/Vecchio_MatSynth_A_Modern_PBR_Materials_Dataset_CVPR_2024_paper.pdf + We introduce MatSynth a dataset of 4000+ CC0 ultra-high resolution PBR materials. Materials are crucial components of virtual relightable assets defining the interaction of light at the surface of geometries. Given their importance significant research effort was dedicated to their representation creation and acquisition. However in the past 6 years most research in material acquisition or generation relied either on the same unique dataset or on company-owned huge library of procedural materials. With this dataset we propose a significantly larger more diverse and higher resolution set of materials than previously publicly available. We carefully discuss the data collection process and demonstrate the benefits of this dataset for material acquisition and generation applications. The complete data further contains metadata with each material's origin license category tags creation method and when available descriptions and physical size as well as 3M+ renderings of the augmented materials in 1K under various environment lightings. The MatSynth dataset is released through the project page at: https://www.gvecchio.com/matsynth. + + + + OTE: Exploring Accurate Scene Text Recognition Using One Token + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_OTE_Exploring_Accurate_Scene_Text_Recognition_Using_One_Token_CVPR_2024_paper.pdf + In this paper we propose a novel framework to fully exploit the potential of a single vector for scene text recognition (STR). Different from previous sequence-to-sequence methods that rely on a sequence of visual tokens to represent scene text images we prove that just one token is enough to characterize the entire text image and achieve accurate text recognition. Based on this insight we introduce a new paradigm for STR called One Token rEcognizer (OTE). Specifically we implement an image-to-vector encoder to extract the fine-grained global semantics eliminating the need for sequential features. Furthermore an elegant yet potent vector-to-sequence decoder is designed to adaptively diffuse global semantics to corresponding character locations enabling both autoregressive and non-autoregressive decoding schemes. By executing decoding within a high-level representational space our vector-to-sequence (V2S) approach avoids the alignment issues between visual tokens and character embeddings prevalent in traditional sequence-to-sequence methods. Remarkably due to introducing character-wise fine-grained information such global tokens also boost the performance of scene text retrieval tasks. Extensive experiments on synthetic and real datasets demonstrate the effectiveness of our method by achieving new state-of-the-art results on various public STR benchmarks. Our code is available at https://github.com/Xu-Jianjun/OTE. + + + + Gaussian Shadow Casting for Neural Characters + http://openaccess.thecvf.com//content/CVPR2024/papers/Bolanos_Gaussian_Shadow_Casting_for_Neural_Characters_CVPR_2024_paper.pdf + Neural character models can now reconstruct detailed geometry and texture from video but they lack explicit shadows and shading leading to artifacts when generating novel views and poses or during relighting. It is particularly difficult to include shadows as they are a global effect and the required casting of secondary rays is costly. We propose a new shadow model using a Gaussian density proxy that replaces sampling with a simple analytic formula. It supports dynamic motion and is tailored for shadow computation thereby avoiding the affine projection approximation and sorting required by the closely related Gaussian splatting. Combined with a deferred neural rendering model our Gaussian shadows enable Lambertian shading and shadow casting with minimal overhead. We demonstrate improved reconstructions with better separation of albedo shading and shadows in challenging outdoor scenes with direct sun light and hard shadows. Our method is able to optimize the light direction without any input from the user. As a result novel poses have fewer shadow artifacts and relighting in novel scenes is more realistic compared to the state-of-the-art methods providing new ways to pose neural characters in novel environments increasing their applicability. Code available at: https://github.com/LuisBolanos17/GaussianShadowCasting + + + + Federated Online Adaptation for Deep Stereo + http://openaccess.thecvf.com//content/CVPR2024/papers/Poggi_Federated_Online_Adaptation_for_Deep_Stereo_CVPR_2024_paper.pdf + We introduce a novel approach for adapting deep stereo networks in a collaborative manner. By building over principles of federated learning we develop a distributed framework allowing for demanding the optimization process to a number of clients deployed in different environments. This makes it possible for a deep stereo network running on resourced-constrained devices to capitalize on the adaptation process carried out by other instances of the same architecture and thus improve its accuracy in challenging environments even when it cannot carry out adaptation on its own. Experimental results show how federated adaptation performs equivalently to on-device adaptation and even better when dealing with challenging environments. + + + + Sequential Modeling Enables Scalable Learning for Large Vision Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Bai_Sequential_Modeling_Enables_Scalable_Learning_for_Large_Vision_Models_CVPR_2024_paper.pdf + We introduce a novel sequential modeling approach which enables learning a Large Vision Model (LVM) without making use of any linguistic data. To do this we define a common format "visual sentences" in which we can represent raw images and videos as well as annotated data sources such as semantic segmentations and depth reconstructions without needing any meta-knowledge beyond the pixels. Once this wide variety of visual data (comprising 420 billion tokens) is represented as sequences the model can be trained to minimize a cross-entropy loss for next token prediction. By training across various scales of model architecture and data diversity we provide empirical evidence that our models scale effectively. Many different vision tasks can be solved by designing suitable visual prompts at test time. + + + + Regularized Parameter Uncertainty for Improving Generalization in Reinforcement Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Moure_Regularized_Parameter_Uncertainty_for_Improving_Generalization_in_Reinforcement_Learning_CVPR_2024_paper.pdf + In order for reinforcement learning (RL) agents to be deployed in real-world environments they must be able to generalize to unseen environments. However RL struggles with out-of-distribution generalization often due to over-fitting the particulars of the training environment. Although regularization techniques from supervised learning can be applied to avoid over-fitting the differences between supervised learning and RL limit their application. To address this we propose the Signal-to-Noise Ratio regulated Parameter Uncertainty Network (SNR PUN) for RL. We introduce SNR as a new measure of regularizing the parameter uncertainty of a network and provide a formal analysis explaining why SNR regularization works well for RL. We demonstrate the effectiveness of our proposed method to generalize in several simulated environments; and in a physical system showing the possibility of using SNR PUN for applying RL to real-world applications. + + + + CoralSCOP: Segment any COral Image on this Planet + http://openaccess.thecvf.com//content/CVPR2024/papers/Zheng_CoralSCOP_Segment_any_COral_Image_on_this_Planet_CVPR_2024_paper.pdf + Underwater visual understanding has recently gained increasing attention within the computer vision community for studying and monitoring underwater ecosystems. Among these coral reefs play an important and intricate role often referred to as the rainforests of the sea due to their rich biodiversity and crucial environmental impact. Existing coral analysis due to its technical complexity requires significant manual work from coral biologists therefore hindering scalable and comprehensive studies. In this paper we introduce CoralSCOP the first foundation model designed for the automatic dense segmentation of coral reefs. CoralSCOP is developed to accurately assign labels to different coral entities addressing the challenges in the semantic analysis of coral imagery. Its main objective is to identify and delineate the irregular boundaries between various coral individuals across different granularities such as coral/non-coral growth form and genus. This task is challenging due to the semantic agnostic nature or fixed limited semantic categories of previous generic segmentation methods which fail to adequately capture the complex characteristics of coral structures. By introducing a novel parallel semantic branch CoralSCOP can produce high-quality coral masks with semantics that enable a wide range of downstream coral reef analysis tasks. We demonstrate that CoralSCOP exhibits a strong zero-shot ability to segment unseen coral images. To effectively train our foundation model we propose CoralMask a new dataset with 41297 densely labeled coral images and 330144 coral masks. We have conducted comprehensive and extensive experiments to demonstrate the advantages of CoralSCOP over existing generalist segmentation algorithms and coral reef analytical approaches. + + + + Improved Baselines with Visual Instruction Tuning + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Improved_Baselines_with_Visual_Instruction_Tuning_CVPR_2024_paper.pdf + Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this paper we present the first systematic study to investigate the design choices of LMMs in a controlled setting under the LLaVA framework. We show that the fully-connected vision-language connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA namely using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with response formatting prompts we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data and finishes full training in 1 day on a single 8-A100 node. Furthermore we present some early exploration of open problems in LMMs including scaling to higher resolution inputs compositional capabilities and model hallucination etc. We hope this makes state-of-the-art LMM research more accessible. Code and model will be publicly available. + + + + Unexplored Faces of Robustness and Out-of-Distribution: Covariate Shifts in Environment and Sensor Domains + http://openaccess.thecvf.com//content/CVPR2024/papers/Baek_Unexplored_Faces_of_Robustness_and_Out-of-Distribution_Covariate_Shifts_in_Environment_CVPR_2024_paper.pdf + Computer vision applications predict on digital images acquired by a camera from physical scenes through light. However conventional robustness benchmarks rely on perturbations in digitized images diverging from distribution shifts occurring in the image acquisition process. To bridge this gap we introduce a new distribution shift dataset ImageNet-ES comprising variations in environmental and camera sensor factors by directly capturing 202k images with a real camera in a controllable testbed. With the new dataset we evaluate out-of-distribution (OOD) detection and model robustness. We find that existing OOD detection methods do not cope with the covariate shifts in ImageNet-ES implying that the definition and detection of OOD should be revisited to embrace real-world distribution shifts. We also observe that the model becomes more robust in both ImageNet-C and -ES by learning environment and sensor variations in addition to existing digital augmentations. Lastly our results suggest that effective shift mitigation via camera sensor control can significantly improve performance without increasing model size. With these findings our benchmark may aid future research on robustness OOD and camera sensor control for computer vision. Our code and dataset are available at https://github.com/Edw2n/ImageNet-ES. + + + + GaussianEditor: Swift and Controllable 3D Editing with Gaussian Splatting + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_GaussianEditor_Swift_and_Controllable_3D_Editing_with_Gaussian_Splatting_CVPR_2024_paper.pdf + 3D editing plays a crucial role in many areas such as gaming and virtual reality. Traditional 3D editing methods which rely on representations like meshes and point clouds often fall short in realistically depicting complex scenes. On the other hand methods based on implicit 3D representations like Neural Radiance Field (NeRF) render complex scenes effectively but suffer from slow processing speeds and limited control over specific scene areas. In response to these challenges our paper presents GaussianEditor the first 3D editing algorithm based on Gaussian Splatting (GS) a novel 3D representation. GaussianEditor enhances precision and control in editing through our proposed Gaussian semantic tracing which traces the editing target throughout the training process. Additionally we propose Hierarchical Gaussian splatting (HGS) to achieve stabilized and fine results under stochastic generative guidance from 2D diffusion models. We also develop editing strategies for efficient object removal and integration a challenging task for existing methods. Our comprehensive experiments demonstrate GaussianEditor's superior control effective and efficient performance marking a significant advancement in 3D editing. + + + + Open-Vocabulary Semantic Segmentation with Image Embedding Balancing + http://openaccess.thecvf.com//content/CVPR2024/papers/Shan_Open-Vocabulary_Semantic_Segmentation_with_Image_Embedding_Balancing_CVPR_2024_paper.pdf + Open-vocabulary semantic segmentation is a challenging task which requires the model to output semantic masks of an image beyond a close-set vocabulary. Although many efforts have been made to utilize powerful CLIP models to accomplish this task they are still easily overfitting to training classes due to the natural gaps in semantic information between training and new classes. To overcome this challenge we propose a novel framework for open-vocabulary semantic segmentation called EBSeg incorporating an Adaptively Balanced Decoder (AdaB Decoder) and a Semantic Structure Consistency loss (SSC Loss). The AdaB Decoder is designed to generate different image embeddings for both training and new classes. Subsequently these two types of embeddings are adaptively balanced to fully exploit their ability to recognize training classes and generalization ability for new classes. To learn a consistent semantic structure from CLIP the SSC Loss aligns the inter-classes affinity in the image feature space with that in the text feature space of CLIP thereby improving the generalization ability of our model. Furthermore we employ a frozen SAM image encoder to complement the spatial information that CLIP features lack due to the low training image resolution and image-level supervision inherent in CLIP. Extensive experiments conducted across various benchmarks demonstrate that the proposed EBSeg outperforms the state-of-the-art methods. Our code and trained models will be here: https://github.com/slonetime/EBSeg. + + + + Stronger Fewer & Superior: Harnessing Vision Foundation Models for Domain Generalized Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wei_Stronger_Fewer__Superior_Harnessing_Vision_Foundation_Models_for_Domain_CVPR_2024_paper.pdf + In this paper we first assess and harness various Vision Foundation Models (VFMs) in the context of Domain Generalized Semantic Segmentation (DGSS). Driven by the motivation that Leveraging Stronger pre-trained models and Fewer trainable parameters for Superior generalizability we introduce a robust fine-tuning approach namely "Rein" to parameter-efficiently harness VFMs for DGSS. Built upon a set of trainable tokens each linked to distinct instances Rein precisely refines and forwards the feature maps from each layer to the next layer within the backbone. This process produces diverse refinements for different categories within a single image. With fewer trainable parameters Rein efficiently fine-tunes VFMs for DGSS tasks surprisingly surpassing full parameter fine-tuning. Extensive experiments across various settings demonstrate that Rein significantly outperforms state-of-the-art methods. Remarkably with just an extra 1% of trainable parameters within the frozen backbone Rein achieves a mIoU of 68.1% on the Cityscapes without accessing any real urban-scene datasets. Code is available at https://github.com/w1oves/Rein.git. + + + + UniBind: LLM-Augmented Unified and Balanced Representation Space to Bind Them All + http://openaccess.thecvf.com//content/CVPR2024/papers/Lyu_UniBind_LLM-Augmented_Unified_and_Balanced_Representation_Space_to_Bind_Them_CVPR_2024_paper.pdf + We present UniBind a flexible and efficient approach that learns a unified representation space for seven diverse modalities-- images text audio point cloud thermal video and event data. Existing works eg. ImageBind treat the image as the central modality and build an image-centered representation space; however the space may be sub-optimal as it leads to an unbalanced representation space among all modalities. Moreover the category names are directly used to extract text embeddings for the downstream tasks making it hardly possible to represent the semantics of multi-modal data. The 'out-of-the-box' insight of our UniBind is to make the alignment center modality-agnostic and further learn a unified and balanced representation space empowered by the large language models (LLMs). UniBind is superior in its flexible application to all CLIP-style models and delivers remarkable performance boosts. To make this possible we 1) construct a knowledge base of text embeddings with the help of LLMs and multi-modal LLMs; 2) adaptively build LLM-augmented class-wise embedding center on top of the knowledge base and encoded visual embeddings; 3) align all the embeddings to the LLM-augmented embedding center via contrastive learning to achieve a unified and balanced representation space. UniBind shows strong zero-shot recognition performance gains over prior arts by an average of 6.36%. Finally we achieve new state-of-the-art performance eg. a 6.75% gain on ImageNet on the multi-modal fine-tuning setting while reducing 90% of the learnable parameters. + + + + Test-Time Adaptation for Depth Completion + http://openaccess.thecvf.com//content/CVPR2024/papers/Park_Test-Time_Adaptation_for_Depth_Completion_CVPR_2024_paper.pdf + It is common to observe performance degradation when transferring models trained on some (source) datasets to target testing data due to a domain gap between them. Existing methods for bridging this gap such as domain adaptation (DA) may require the source data on which the model was trained (often not available) while others i.e. source-free DA require many passes through the testing data. We propose an online test-time adaptation method for depth completion the task of inferring a dense depth map from a single image and associated sparse depth map that closes the performance gap in a single pass. We first present a study on how the domain shift in each data modality affects model performance. Based on our observations that the sparse depth modality exhibits a much smaller covariate shift than the image we design an embedding module trained in the source domain that preserves a mapping from features encoding only sparse depth to those encoding image and sparse depth. During test time sparse depth features are projected using this map as a proxy for source domain features and are used as guidance to train a set of auxiliary parameters (i.e. adaptation layer) to align image and sparse depth features from the target test domain to that of the source domain. We evaluate our method on indoor and outdoor scenarios and show that it improves over baselines by an average of 21.1%. Code available at https://github.com/seobbro/TTA-depth-completion. + + + + Binarized Low-light Raw Video Enhancement + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Binarized_Low-light_Raw_Video_Enhancement_CVPR_2024_paper.pdf + Recently deep neural networks have achieved excellent performance on low-light raw video enhancement. However they often come with high computational complexity and large memory costs which hinder their applications on resource-limited devices. In this paper we explore the feasibility of applying the extremely compact binary neural network (BNN) to low-light raw video enhancement. Nevertheless there are two main issues with binarizing video enhancement models. One is how to fuse the temporal information to improve low-light denoising without complex modules. The other is how to narrow the performance gap between binary convolutions with the full precision ones. To address the first issue we introduce a spatial-temporal shift operation which is easy-to-binarize and effective. The temporal shift efficiently aggregates the features of neighbor frames and the spatial shift handles the misalignment caused by the large motion in videos. For the second issue we present a distribution-aware binary convolution which captures the distribution characteristics of real-valued input and incorporates them into plain binary convolutions to alleviate the degradation in performance. Extensive quantitative and qualitative experiments have shown our high-efficiency binarized low-light raw video enhancement method can attain a promising performance. The code is available at https://github.com/ying-fu/BRVE. + + + + MorpheuS: Neural Dynamic 360deg Surface Reconstruction from Monocular RGB-D Video + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_MorpheuS_Neural_Dynamic_360deg_Surface_Reconstruction_from_Monocular_RGB-D_Video_CVPR_2024_paper.pdf + Neural rendering has demonstrated remarkable success in dynamic scene reconstruction. Thanks to the expressiveness of neural representations prior works can accurately capture the motion and achieve high-fidelity reconstruction of the target object. Despite this real-world video scenarios often feature large unobserved regions where neural representations struggle to achieve realistic completion. To tackle this challenge we introduce MorpheuS a framework for dynamic 360deg surface reconstruction from a casually captured RGB-D video. Our approach models the target scene as a canonical field that encodes its geometry and appearance in conjunction with a deformation field that warps points from the current frame to the canonical space. We leverage a view-dependent diffusion prior and distill knowledge from it to achieve realistic completion of unobserved regions. Experimental results on various real-world and synthetic datasets show that our method can achieve high-fidelity 360deg surface reconstruction of a deformable object from a monocular RGB-D video. + + + + Weakly Misalignment-free Adaptive Feature Alignment for UAVs-based Multimodal Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Weakly_Misalignment-free_Adaptive_Feature_Alignment_for_UAVs-based_Multimodal_Object_Detection_CVPR_2024_paper.pdf + Visible-infrared (RGB-IR) image fusion has shown great potentials in object detection based on unmanned aerial vehicles (UAVs). However the weakly misalignment problem between multimodal image pairs limits its performance in object detection. Most existing methods often ignore the modality gap and emphasize a strict alignment resulting in an upper bound of alignment quality and an increase of implementation costs. To address these challenges we propose a novel method named Offset-guided Adaptive Feature Alignment (OAFA) which could adaptively adjust the relative positions between multimodal features. Considering the impact of modality gap on the cross-modality spatial matching a Cross-modality Spatial Offset Modeling (CSOM) module is designed to establish a common subspace to estimate the precise feature-level offsets. Then an Offset-guided Deformable Alignment and Fusion (ODAF) module is utilized to implicitly capture optimal fusion positions for detection task rather than conducting a strict alignment. Comprehensive experiments demonstrate that our method not only achieves state-of-the-art performance in the UAVs-based object detection task but also shows strong robustness to the weakly misalignment problem. + + + + Passive Snapshot Coded Aperture Dual-Pixel RGB-D Imaging + http://openaccess.thecvf.com//content/CVPR2024/papers/Ghanekar_Passive_Snapshot_Coded_Aperture_Dual-Pixel_RGB-D_Imaging_CVPR_2024_paper.pdf + Passive compact single-shot 3D sensing is useful in many application areas such as microscopy medical imaging surgical navigation and autonomous driving where form factor time and power constraints can exist. Obtaining RGB-D scene information over a short imaging distance in an ultra-compact form factor and in a passive snapshot manner is challenging. Dual-pixel (DP) sensors are a potential solution to achieve the same. DP sensors collect light rays from two different halves of the lens in two interleaved pixel arrays thus capturing two slightly different views of the scene like a stereo camera system. However imaging with a DP sensor implies that the defocus blur size is directly proportional to the disparity seen between the views. This creates a trade-off between disparity estimation vs. deblurring accuracy. To improve this trade-off effect we propose CADS (Coded Aperture Dual-Pixel Sensing) in which we use a coded aperture in the imaging lens along with a DP sensor. In our approach we jointly learn an optimal coded pattern and the reconstruction algorithm in an end-to-end optimization setting. Our resulting CADS imaging system demonstrates improvement of >1.5dB PSNR in all-in-focus (AIF) estimates and 5-6% in depth estimation quality over naive DP sensing for a wide range of aperture settings. Furthermore we build the proposed CADS prototypes for DSLR photography settings and in an endoscope and a dermoscope form factor. Our novel coded dual-pixel sensing approach demonstrates accurate RGB-D reconstruction results in simulations and real-world experiments in a passive snapshot and compact manner. + + + + Instance Tracking in 3D Scenes from Egocentric Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_Instance_Tracking_in_3D_Scenes_from_Egocentric_Videos_CVPR_2024_paper.pdf + Egocentric sensors such as AR/VR devices capture human-object interactions and offer the potential to provide task-assistance by recalling 3D locations of objects of interest in the surrounding environment. This capability requires instance tracking in real-world 3D scenes from egocentric videos (IT3DEgo). We explore this problem by first introducing a new benchmark dataset consisting of RGB and depth videos per-frame camera pose and instance-level annotations in both 2D camera and 3D world coordinates. We present an evaluation protocol which evaluates tracking performance in 3D coordinates with two settings for enrolling instances to track: (1) single-view online enrollment where an instance is specified on-the-fly based on the human wearer's interactions. and (2) multi-view pre-enrollment where images of an instance to be tracked are stored in memory ahead of time. To address IT3DEgo we first re-purpose methods from relevant areas e.g. single object tracking (SOT) -- running SOT methods to track instances in 2D frames and lifting them to 3D using camera pose and depth. We also present a simple method that leverages pretrained segmentation and detection models to generate proposals from RGB frames and match proposals with enrolled instance images. Our experiments show that our method (with no finetuning) significantly outperforms SOT-based approaches in the egocentric setting. We conclude by arguing that the problem of egocentric instance tracking is made easier by leveraging camera pose and using a 3D allocentric (world) coordinate representation. + + + + Learning to Transform Dynamically for Better Adversarial Transferability + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhu_Learning_to_Transform_Dynamically_for_Better_Adversarial_Transferability_CVPR_2024_paper.pdf + Adversarial examples crafted by adding perturbations imperceptible to humans can deceive neural networks. Recent studies identify the adversarial transferability across various models i.e. the cross-model attack ability of adversarial samples. To enhance such adversarial transferability existing input transformation-based methods diversify input data with transformation augmentation. However their effectiveness is limited by the finite number of available transformations. In our study we introduce a novel approach named Learning to Transform (L2T). L2T increases the diversity of transformed images by selecting the optimal combination of operations from a pool of candidates consequently improving adversarial transferability. We conceptualize the selection of optimal transformation combinations as a trajectory optimization problem and employ a reinforcement learning strategy to effectively solve the problem. Comprehensive experiments on the ImageNet dataset as well as practical tests with Google Vision and GPT-4V reveal that L2T surpasses current methodologies in enhancing adversarial transferability thereby confirming its effectiveness and practical significance. + + + + PanoContext-Former: Panoramic Total Scene Understanding with a Transformer + http://openaccess.thecvf.com//content/CVPR2024/papers/Dong_PanoContext-Former_Panoramic_Total_Scene_Understanding_with_a_Transformer_CVPR_2024_paper.pdf + Panoramic images enable deeper understanding and more holistic perception of 360 surrounding environment which can naturally encode enriched scene context information compared to standard perspective image. Previous work has made lots of effort to solve the scene understanding task in a hybrid solution based on 2D-3D geometric reasoning thus each sub-task is processed separately and few correlations are explored in this procedure. In this paper we propose a fully 3D method for holistic indoor scene understanding which recovers the objects' shapes oriented bounding boxes and the 3D room layout simultaneously from a single panorama. To maximize the exploration of the rich context information we design a transformer-based context module to predict the representation and relationship among each component of the scene. In addition we introduce a new dataset for scene understanding including photo-realistic panoramas high-fidelity depth images accurately annotated room layouts oriented object bounding boxes and shapes. Experiments on the synthetic and new datasets demonstrate that our method outperforms previous panoramic scene understanding methods in terms of both layout estimation and 3D object detection. + + + + Prompt3D: Random Prompt Assisted Weakly-Supervised 3D Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Prompt3D_Random_Prompt_Assisted_Weakly-Supervised_3D_Object_Detection_CVPR_2024_paper.pdf + The prohibitive cost of annotations for fully supervised 3D indoor object detection limits its practicality. In this work we propose Random Prompt Assisted Weakly-supervised 3D Object Detection termed as Prompt3D a weakly-supervised approach that leverages position-level labels to overcome this challenge. Explicitly our method focuses on enhancing labeling using synthetic scenes crafted from 3D shapes generated via random prompts. First a Synthetic Scene Generation (SSG) module is introduced to assemble synthetic scenes with a curated collection of 3D shapes created via random prompts for each category. These scenes are enriched with automatically generated point-level annotations providing a robust supervisory framework for training the detection algorithm. To enhance the transfer of knowledge from virtual to real datasets we then introduce a Prototypical Proposal Feature Alignment (PPFA) module. This module effectively alleviates the domain gap by directly minimizing the distance between feature prototypes of the same class proposals across two domains. Compared with sota BR our method improves by 5.4% and 8.7% on mAP with VoteNet and GroupFree3D serving as detectors respectively demonstrating the effectiveness of our proposed method. Code is available at: https://github.com/huishengye/prompt3d. + + + + Navigating Beyond Dropout: An Intriguing Solution towards Generalizable Image Super Resolution + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Navigating_Beyond_Dropout_An_Intriguing_Solution_towards_Generalizable_Image_Super_CVPR_2024_paper.pdf + Deep learning has led to a dramatic leap on Single Image Super-Resolution (SISR) performances in recent years. While most existing work assumes a simple and fixed degradation model (e.g. bicubic downsampling) the research of Blind SR seeks to improve model generalization ability with unknown degradation. Recently Kong et al. pioneer the investigation of a more suitable training strategy for Blind SR using Dropout. Although such method indeed brings substantial generalization improvements via mitigating overfitting we argue that Dropout simultaneously introduces undesirable side-effect that compromises model's capacity to faithfully reconstruct fine details. We show both the theoretical and experimental analyses in our paper and furthermore we present another easy yet effective training strategy that enhances the generalization ability of the model by simply modulating its first and second-order features statistics. Experimental results have shown that our method could serve as a model-agnostic regularization and outperforms Dropout on seven benchmark datasets including both synthetic and real-world scenarios. + + + + FC-GNN: Recovering Reliable and Accurate Correspondences from Interferences + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_FC-GNN_Recovering_Reliable_and_Accurate_Correspondences_from_Interferences_CVPR_2024_paper.pdf + Finding correspondences between images is essential for many computer vision tasks and sparse matching pipelines have been popular for decades. However matching noise within and between images along with inconsistent keypoint detection frequently degrades the matching performance. We review these problems and thus propose: 1) a novel and unified Filtering and Calibrating (FC) approach that jointly rejects outliers and optimizes inliers and 2) leveraging both the matching context and the underlying image texture to remove matching uncertainties. Under the guidance of the above innovations we construct Filtering and Calibrating Graph Neural Network (FC-GNN) which follows the FC approach to recover reliable and accurate correspondences from various interferences. FC-GNN conducts an effectively combined inference of contextual and local information through careful embedding and multiple information aggregations predicting confidence scores and calibration offsets for the input correspondences to jointly filter out outliers and improve pixel-level matching accuracy. Moreover we exploit the local coherence of matches to perform inference on local graphs thereby reducing computational complexity. Overall FC-GNN operates at lightning speed and can greatly boost the performance of diverse matching pipelines across various tasks showcasing the immense potential of such approaches to become standard and pivotal components of image matching. Code is avaiable at https://github.com/xuy123456/fcgnn. + + + + Turb-Seg-Res: A Segment-then-Restore Pipeline for Dynamic Videos with Atmospheric Turbulence + http://openaccess.thecvf.com//content/CVPR2024/papers/Saha_Turb-Seg-Res_A_Segment-then-Restore_Pipeline_for_Dynamic_Videos_with_Atmospheric_Turbulence_CVPR_2024_paper.pdf + Tackling image degradation due to atmospheric turbulence particularly in dynamic environments remains a challenge for long-range imaging systems. Existing techniques have been primarily designed for static scenes or scenes with small motion. This paper presents the first segment-then-restore pipeline for restoring the videos of dynamic scenes in turbulent environments. We leverage mean optical flow with an unsupervised motion segmentation method to separate dynamic and static scene components prior to restoration. After camera shake compensation and segmentation we introduce foreground/background enhancement leveraging the statistics of turbulence strength and a transformer model trained on a novel noise-based procedural turbulence generator for fast dataset augmentation. Benchmarked against existing restoration methods our approach restores most of the geometric distortion and enhances the sharpness of videos. We make our code simulator and data publicly available to advance the field of video restoration from turbulence: riponcs.github.io/TurbSegRes + + + + Real-time Acquisition and Reconstruction of Dynamic Volumes with Neural Structured Illumination + http://openaccess.thecvf.com//content/CVPR2024/papers/Zeng_Real-time_Acquisition_and_Reconstruction_of_Dynamic_Volumes_with_Neural_Structured_CVPR_2024_paper.pdf + We propose a novel framework for real-time acquisition and reconstruction of temporally-varying 3D phenomena with high quality. The core of our framework is a deep neural network with an encoder that directly maps to the structured illumination during acquisition a decoder that predicts a 1D density distribution from single-pixel measurements under the optimized lighting and an aggregation module that combines the predicted densities for each camera into a single volume. It enables the automatic and joint optimization of physical acquisition and computational reconstruction and is flexible to adapt to different hardware configurations. The effectiveness of our framework is demonstrated on a lightweight setup with an off-the-shelf projector and one or multiple cameras achieving a performance of 40 volumes per second at a spatial resolution of 128^3. We compare favorably with state-of-the-art techniques in real and synthetic experiments and evaluate the impact of various factors over our pipeline. + + + + Probabilistic Sampling of Balanced K-Means using Adiabatic Quantum Computing + http://openaccess.thecvf.com//content/CVPR2024/papers/Zaech_Probabilistic_Sampling_of_Balanced_K-Means_using_Adiabatic_Quantum_Computing_CVPR_2024_paper.pdf + Adiabatic quantum computing (AQC) is a promising approach for discrete and often NP-hard optimization problems. Current AQCs allow to implement problems of research interest which has sparked the development of quantum representations for many computer vision tasks. Despite requiring multiple measurements from the noisy AQC current approaches only utilize the best measurement discarding information contained in the remaining ones. In this work we explore the potential of using this information for probabilistic balanced k-means clustering. Instead of discarding non-optimal solutions we propose to use them to compute calibrated posterior probabilities with little additional compute cost. This allows us to identify ambiguous solutions and data points which we demonstrate on a D-Wave AQC on synthetic tasks and real visual data. + + + + UniPT: Universal Parallel Tuning for Transfer Learning with Efficient Parameter and Memory + http://openaccess.thecvf.com//content/CVPR2024/papers/Diao_UniPT_Universal_Parallel_Tuning_for_Transfer_Learning_with_Efficient_Parameter_CVPR_2024_paper.pdf + Parameter-efficient transfer learning (PETL) i.e. fine-tuning a small portion of parameters is an effective strategy for adapting pre-trained models to downstream domains. To further reduce the memory demand recent PETL works focus on the more valuable memory-efficient characteristic. In this paper we argue that the scalability adaptability and generalizability of state-of-the-art methods are hindered by structural dependency and pertinency on specific pre-trained backbones. To this end we propose a new memory-efficient PETL strategy Universal Parallel Tuning (UniPT) to mitigate these weaknesses. Specifically we facilitate the transfer process via a lightweight and learnable parallel network which consists of: 1) A parallel interaction module that decouples the sequential connections and processes the intermediate activations detachedly from the pre-trained network. 2) A confidence aggregation module that learns optimal strategies adaptively for integrating cross-layer features. We evaluate UniPT with different backbones (e.g. T5 VSEinfinity CLIP4Clip Clip-ViL and MDETR) on various vision-and-language and pure NLP tasks. Extensive ablations on 18 datasets have validated that UniPT can not only dramatically reduce memory consumption and outperform the best competitor but also achieve competitive performance over other plain PETL methods with lower training memory overhead. Our code is publicly available at: https://github.com/Paranioar/UniPT. + + + + Composed Video Retrieval via Enriched Context and Discriminative Embeddings + http://openaccess.thecvf.com//content/CVPR2024/papers/Thawakar_Composed_Video_Retrieval_via_Enriched_Context_and_Discriminative_Embeddings_CVPR_2024_paper.pdf + Composed video retrieval (CoVR) is a challenging prob- lem in computer vision which has recently highlighted the in- tegration of modification text with visual queries for more so- phisticated video search in large databases. Existing works predominantly rely on visual queries combined with modi- fication text to distinguish relevant videos. However such a strategy struggles to fully preserve the rich query-specific context in retrieved target videos and only represents the target video using visual embedding. We introduce a novel CoVR framework that leverages detailed language descrip- tions to explicitly encode query-specific contextual informa- tion and learns discriminative embeddings of vision only text only and vision-text for better alignment to accurately retrieve matched target videos. Our proposed framework can be flexibly employed for both composed video (CoVR) and image (CoIR) retrieval tasks. Experiments on three datasets show that our approach obtains state-of-the-art per- formance for both CovR and zero-shot CoIR tasks achiev- ing gains as high as around 7% in terms of recall@K=1 score. Our code detailed language descriptions for WebViD- CoVR dataset are available at https://github.com/ OmkarThawakar/composed-video-retrieval. + + + + Perceptual Assessment and Optimization of HDR Image Rendering + http://openaccess.thecvf.com//content/CVPR2024/papers/Cao_Perceptual_Assessment_and_Optimization_of_HDR_Image_Rendering_CVPR_2024_paper.pdf + High dynamic range (HDR) rendering has the ability to faithfully reproduce the wide luminance ranges in natural scenes but how to accurately assess the rendering quality is relatively underexplored. Existing quality models are mostly designed for low dynamic range (LDR) images and do not align well with human perception of HDR image quality. To fill this gap we propose a family of HDR quality metrics in which the key step is employing a simple inverse display model to decompose an HDR image into a stack of LDR images with varying exposures. Subsequently these decomposed images are assessed through well-established LDR quality metrics. Our HDR quality models present three distinct benefits. First they directly inherit the recent advancements of LDR quality metrics. Second they do not rely on human perceptual data of HDR image quality for re-calibration. Third they facilitate the alignment and prioritization of specific luminance ranges for more accurate and detailed quality assessment. Experimental results show that our HDR quality metrics consistently outperform existing models in terms of quality assessment on four HDR image quality datasets and perceptual optimization of HDR novel view synthesis. + + + + Multiview Aerial Visual RECognition (MAVREC): Can Multi-view Improve Aerial Visual Perception? + http://openaccess.thecvf.com//content/CVPR2024/papers/Dutta_Multiview_Aerial_Visual_RECognition_MAVREC_Can_Multi-view_Improve_Aerial_Visual_CVPR_2024_paper.pdf + Despite the commercial abundance of UAVs aerial data acquisition remains challenging and the existing Asia and North America-centric open-source UAV datasets are small-scale or low-resolution and lack diversity in scene contextuality. Additionally the color content of the scenes solar zenith angle and population density of different geographies influence the data diversity. These factors conjointly render suboptimal aerial-visual perception of the deep neural network (DNN) models trained primarily on the ground view data including the open-world foundational models. To pave the way for a transformative era of aerial detection we present Multiview Aerial Visual RECognition (MAVREC) a video dataset where we record synchronized scenes from different perspectives --- ground camera and drone-mounted camera. MAVREC consists of around 2.5 hours of industry-standard 2.7K resolution video sequences more than 0.5 million frames and 1.1 million annotated bounding boxes. This makes MAVREC the largest ground and aerial view dataset and the fourth largest among all drone-based datasets across all modalities and tasks. Through our extensive benchmarking on MAVREC we recognize that augmenting object detectors with ground view images from the corresponding geographical location is a superior pre-training strategy for aerial detection. Building on this strategy we benchmark MAVREC with a curriculum-based semi-supervised object detection approach that leverages labeled (ground and aerial) and unlabeled (only aerial) images to enhance aerial detection. + + + + SaCo Loss: Sample-wise Affinity Consistency for Vision-Language Pre-training + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_SaCo_Loss_Sample-wise_Affinity_Consistency_for_Vision-Language_Pre-training_CVPR_2024_paper.pdf + Vision-language pre-training (VLP) aims to learn joint representations of vision and language modalities. The contrastive paradigm is currently dominant in this field. However we observe a notable misalignment phenomenon that is the affinity between samples has an obvious disparity across different modalities namely "Affinity Inconsistency Problem". Our intuition is that for a well-aligned model two images that look similar to each other should have the same level of similarity as their corresponding texts that describe them. In this paper we first investigate the reason of this inconsistency problem. We discover that the lack of consideration for sample-wise affinity consistency across modalities in existing training objectives is the central cause. To address this problem we propose a novel loss function named Sample-wise affinity Consistency (SaCo) loss which is designed to enhance such consistency by minimizing the distance between image embedding similarity and text embedding similarity for any two samples. Our SaCo loss can be easily incorporated into existing vision-language models as an additional loss due to its complementarity for most training objectives. In addition considering that pre-training from scratch is computationally expensive we also provide a more efficient way to continuously pre-train on a converged model by integrating our loss. Experimentally the model trained with our SaCo loss significantly outperforms the baseline on a variety of vision and language tasks. + + + + Stable Neighbor Denoising for Source-free Domain Adaptive Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_Stable_Neighbor_Denoising_for_Source-free_Domain_Adaptive_Segmentation_CVPR_2024_paper.pdf + We study source-free unsupervised domain adaptation (SFUDA) for semantic segmentation which aims to adapt a source-trained model to the target domain without accessing the source data. Many works have been proposed to address this challenging problem among which uncertainty based self-training is a predominant approach. However without comprehensive denoising mechanisms they still largely fall into biased estimates when dealing with different domains and confirmation bias. In this paper we observe that pseudo-label noise is mainly contained in unstable samples in which the predictions of most pixels undergo significant variations during self-training. Inspired by this we propose a novel mechanism to denoise unstable samples with stable ones. Specifically we introduce the Stable Neighbor Denoising (SND) approach which effectively discovers highly correlated stable and unstable samples by nearest neighbor retrieval and guides the reliable optimization of unstable samples by bi-level learning. Moreover we compensate for the stable set by object-level object paste which can further eliminate the bias caused by less learned classes. Our SND enjoys two advantages. First SND does not require a specific segmentor structure endowing its universality. Second SND simultaneously addresses the issues of class domain and confirmation biases during adaptation ensuring its effectiveness. Extensive experiments show that SND consistently outperforms state-of-the-art methods in various SFUDA semantic segmentation settings. In addition SND can be easily integrated with other approaches obtaining further improvements. The source code will be publicly available. + + + + Boosting Adversarial Training via Fisher-Rao Norm-based Regularization + http://openaccess.thecvf.com//content/CVPR2024/papers/Yin_Boosting_Adversarial_Training_via_Fisher-Rao_Norm-based_Regularization_CVPR_2024_paper.pdf + Adversarial training is extensively utilized to improve the adversarial robustness of deep neural networks. Yet mitigating the degradation of standard generalization performance in adversarial-trained models remains an open problem. This paper attempts to resolve this issue through the lens of model complexity. First We leverage the Fisher-Rao norm a geometrically invariant metric for model complexity to establish the non-trivial bounds of the Cross-Entropy Loss-based Rademacher complexity for a ReLU-activated Multi-Layer Perceptron. Building upon this observation we propose a novel regularization framework called Logit-Oriented Adversarial Training (LOAT) which can mitigate the trade-off between robustness and accuracy while imposing only a negligible increase in computational overhead. Our extensive experiments demonstrate that the proposed regularization strategy can boost the performance of the prevalent adversarial training algorithms including PGD-AT TRADES TRADES (LSE) MART and DM-AT across various network architectures. Our code will be available at https://github.com/TrustAI/LOAT. + + + + DAVE - A Detect-and-Verify Paradigm for Low-Shot Counting + http://openaccess.thecvf.com//content/CVPR2024/papers/Pelhan_DAVE_-_A_Detect-and-Verify_Paradigm_for_Low-Shot_Counting_CVPR_2024_paper.pdf + Low-shot counters estimate the number of objects corresponding to a selected category based on only few or no exemplars annotated in the image. The current state-of-the-art estimates the total counts as the sum over the object location density map but do not provide object locations and sizes which are crucial for many applications. This is addressed by detection-based counters which however fall behind in the total count accuracy. Furthermore both approaches tend to overestimate the counts in the presence of other object classes due to many false positives. We propose DAVE a low-shot counter based on a detect-and-verify paradigm that avoids the aforementioned issues by first generating a high-recall detection set and then verifying the detections to identify and remove the outliers.This jointly increases the recall and precision leading to accurate counts. DAVE outperforms the top density-based counters by ?20% in the total count MAE it outperforms the most recent detection-based counter by ?20% in detection quality and sets a new state-of-the-art in zero-shot as well as text-prompt-based counting. + + + + Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Efficient_LoFTR_Semi-Dense_Local_Feature_Matching_with_Sparse-Like_Speed_CVPR_2024_paper.pdf + We present a novel method for efficiently producing semi-dense matches across images. Previous detector-free matcher LoFTR has shown remarkable matching capability in handling large-viewpoint change and texture-poor scenarios but suffers from low efficiency. We revisit its design choices and derive multiple improvements for both efficiency and accuracy. One key observation is that performing the transformer over the entire feature map is redundant due to shared local information therefore we propose an aggregated attention mechanism with adaptive token selection for efficiency. Furthermore we find spatial variance exists in LoFTR's fine correlation module which is adverse to matching accuracy. A novel two-stage correlation layer is proposed to achieve accurate subpixel correspondences for accuracy improvement. Our efficiency optimized model is ~ 2.5xfaster than LoFTR which can even surpass state-of-the-art efficient sparse matching pipeline SuperPoint + LightGlue. Moreover extensive experiments show that our method can achieve higher accuracy compared with competitive semi-dense matchers with considerable efficiency benefits. This opens up exciting prospects for large-scale or latency-sensitive applications such as image retrieval and 3D reconstruction. Project page: https://zju3dv.github.io/efficientloftr/. + + + + Contextual Augmented Global Contrast for Multimodal Intent Recognition + http://openaccess.thecvf.com//content/CVPR2024/papers/Sun_Contextual_Augmented_Global_Contrast_for_Multimodal_Intent_Recognition_CVPR_2024_paper.pdf + Multimodal intent recognition (MIR) aims to perceive the human intent polarity via language visual and acoustic modalities. The inherent intent ambiguity makes it challenging to recognize in multimodal scenarios. Existing MIR methods tend to model the individual video independently ignoring global contextual information across videos. This learning manner inevitably introduces perception biases exacerbated by the inconsistencies of the multimodal representation amplifying the intent uncertainty. This challenge motivates us to explore effective global context modeling. Thus we propose a context-augmented global contrast (CAGC) method to capture rich global context features by mining both intra-and cross-video context interactions for MIR. Concretely we design a context-augmented transformer module to extract global context dependencies across videos. To further alleviate error accumulation and interference we develop a cross-video bank that retrieves effective video sources by considering both intentional tendency and video similarity. Furthermore we introduce a global context-guided contrastive learning scheme designed to mitigate inconsistencies arising from global context and individual modalities in different feature spaces. This scheme incorporates global cues as the supervision to capture robust the multimodal intent representation. Experiments demonstrate CAGC obtains superior performance than state-of-the-art MIR methods. We also generalize our approach to a closely related task multimodal sentiment analysis achieving the comparable performance. + + + + Pre-trained Model Guided Fine-Tuning for Zero-Shot Adversarial Robustness + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_Pre-trained_Model_Guided_Fine-Tuning_for_Zero-Shot_Adversarial_Robustness_CVPR_2024_paper.pdf + Large-scale pre-trained vision-language models like CLIP have demonstrated impressive performance across various tasks and exhibit remarkable zero-shot generalization capability while they are also vulnerable to imperceptible adversarial examples. Existing works typically employ adversarial training (fine-tuning) as a defense method against adversarial examples. However direct application to the CLIP model may result in overfitting compromising the model's capacity for generalization. In this paper we propose Pre-trained Model Guided Adversarial Fine-Tuning (PMG-AFT) method which leverages supervision from the original pre-trained model by carefully designing an auxiliary branch to enhance the model's zero-shot adversarial robustness. Specifically PMG-AFT minimizes the distance between the features of adversarial examples in the target model and those in the pre-trained model aiming to preserve the generalization features already captured by the pre-trained model. Extensive Experiments on 15 zero-shot datasets demonstrate that PMG-AFT significantly outperforms the state-of-the-art method improving the top-1 robust accuracy by an average of 4.99%. Furthermore our approach consistently improves clean accuracy by an average of 8.72%. + + + + CoGS: Controllable Gaussian Splatting + http://openaccess.thecvf.com//content/CVPR2024/papers/Yu_CoGS_Controllable_Gaussian_Splatting_CVPR_2024_paper.pdf + Capturing and re-animating the 3D structure of articulated objects present significant barriers. On one hand methods requiring extensively calibrated multi-view setups are prohibitively complex and resource-intensive limiting their practical applicability. On the other hand while single-camera Neural Radiance Fields (NeRFs) offer a more streamlined approach they have excessive training and rendering costs. 3D Gaussian Splatting would be a suitable alternative but for two reasons. Firstly existing methods for 3D dynamic Gaussians require synchronized multi-view cameras and secondly the lack of controllability in dynamic scenarios. We present CoGS a method for Controllable Gaussian Splatting that enables the direct manipulation of scene elements offering real-time control of dynamic scenes without the prerequisite of pre-computing control signals. We evaluated CoGS using both synthetic and real-world datasets that include dynamic objects that differ in degree of difficulty. In our evaluations CoGS consistently outperformed existing dynamic and controllable neural representations in terms of visual fidelity. + + + + Partial-to-Partial Shape Matching with Geometric Consistency + http://openaccess.thecvf.com//content/CVPR2024/papers/Ehm_Partial-to-Partial_Shape_Matching_with_Geometric_Consistency_CVPR_2024_paper.pdf + Finding correspondences between 3D shapes is an important and long-standing problem in computer vision graphics and beyond. A prominent challenge are partial-to-partial shape matching settings which occur when the shapes to match are only observed incompletely (e.g. from 3D scanning). Although partial-to-partial matching is a highly relevant setting in practice it is rarely explored. Our work bridges the gap between existing (rather artificial) 3D full shape matching and partial-to-partial real-world settings by exploiting geometric consistency as a strong constraint. We demonstrate that it is indeed possible to solve this challenging problem in a variety of settings. For the first time we achieve geometric consistency for partial-to-partial matching which is realized by a novel integer non-linear program formalism building on triangle product spaces along with a new pruning algorithm based on linear integer programming. Further we generate a new inter-class dataset for partial-to-partial shape-matching. We show that our method outperforms current SOTA methods on both an established intra-class dataset and our novel inter-class dataset. + + + + Descriptor and Word Soups: Overcoming the Parameter Efficiency Accuracy Tradeoff for Out-of-Distribution Few-shot Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Liao_Descriptor_and_Word_Soups_Overcoming_the_Parameter_Efficiency_Accuracy_Tradeoff_CVPR_2024_paper.pdf + Over the past year a large body of multimodal research has emerged around zero-shot evaluation using GPT descriptors. These studies boost the zero-shot accuracy of pretrained VL models with an ensemble of label-specific text generated by GPT. A recent study WaffleCLIP demonstrated that similar zero-shot accuracy can be achieved with an ensemble of random descriptors. However both zero-shot methods are un-trainable and consequently sub-optimal when some few-shot out-of-distribution (OOD) training data is available. Inspired by these prior works we present two more flexible methods called descriptor and word soups which do not require an LLM at test time and can leverage training data to increase OOD target accuracy. Descriptor soup greedily selects a small set of textual descriptors using generic few-shot training data then calculates robust class embeddings using the selected descriptors. Word soup greedily assembles a chain of words in a similar manner. Compared to existing few-shot soft prompt tuning methods word soup requires fewer parameters by construction and less GPU memory since it does not require backpropagation. Both soups outperform current published few-shot methods even when combined with SoTA zero-shot methods on cross-dataset and domain generalization benchmarks. Compared with SoTA prompt and descriptor ensembling methods such as ProDA and WaffleCLIP word soup achieves higher OOD accuracy with fewer ensemble members. Please checkout our code: https://github.com/Chris210634/word_soups + + + + 360+x: A Panoptic Multi-modal Scene Understanding Dataset + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_360x_A_Panoptic_Multi-modal_Scene_Understanding_Dataset_CVPR_2024_paper.pdf + Human perception of the world is shaped by a multitude of viewpoints and modalities. While many existing datasets focus on scene understanding from a certain perspective (e.g. egocentric or third-person views) our dataset offers a panoptic perspective (i.e. multiple viewpoints with multiple data modalities). Specifically we encapsulate third-person panoramic and front views as well as egocentric monocular/binocular views with rich modalities including video multi-channel audio directional binaural delay location data and textual scene descriptions within each scene captured presenting comprehensive observation of the world. To the best of our knowledge this is the first database that covers multiple viewpoints with multiple data modalities to mimic how daily information is accessed in the real world. Through our benchmark analysis we presented 5 different scene understanding tasks on the proposed 360+x dataset to evaluate the impact and benefit of each data modality and perspective in panoptic scene understanding. We hope this unique dataset could broaden the scope of comprehensive scene understanding and encourage the community to approach these problems from more diverse perspectives. + + + + Generalized Event Cameras + http://openaccess.thecvf.com//content/CVPR2024/papers/Sundar_Generalized_Event_Cameras_CVPR_2024_paper.pdf + Event cameras capture the world at high time resolution and with minimal bandwidth requirements. However event streams which only encode changes in brightness do not contain sufficient scene information to support a wide variety of downstream tasks. In this work we design generalized event cameras that inherently preserve scene intensity in a bandwidth-efficient manner. We generalize event cameras in terms of when an event is generated and what information is transmitted. To implement our designs we turn to single-photon sensors that provide digital access to individual photon detections; this modality gives us the flexibility to realize a rich space of generalized event cameras. Our single-photon event cameras are capable of high-speed high-fidelity imaging at low readout rates. Consequently these event cameras can support plug-and-play downstream inference without capturing new event datasets or designing specialized event-vision models. As a practical implication our designs which involve lightweight and near-sensor-compatible computations provide a way to use single-photon sensors without exorbitant bandwidth costs. + + + + 3D Neural Edge Reconstruction + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_3D_Neural_Edge_Reconstruction_CVPR_2024_paper.pdf + Real-world objects and environments are predominantly composed of edge features including straight lines and curves. Such edges are crucial elements for various applications such as CAD modeling surface meshing lane mapping etc. However existing traditional methods only prioritize lines over curves for simplicity in geometric modeling. To this end we introduce EMAP a new method for learning 3D edge representations with a focus on both lines and curves. Our method implicitly encodes 3D edge distance and direction in Unsigned Distance Functions (UDF) from multi-view edge maps. On top of this neural representation we propose an edge extraction algorithm that robustly abstracts parametric 3D edges from the inferred edge points and their directions. Comprehensive evaluations demonstrate that our method achieves better 3D edge reconstruction on multiple challenging datasets. We further show that our learned UDF field enhances neural surface reconstruction by capturing more details. + + + + DGC-GNN: Leveraging Geometry and Color Cues for Visual Descriptor-Free 2D-3D Matching + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_DGC-GNN_Leveraging_Geometry_and_Color_Cues_for_Visual_Descriptor-Free_2D-3D_CVPR_2024_paper.pdf + Matching 2D keypoints in an image to a sparse 3D point cloud of the scene without requiring visual descriptors has garnered increased interest due to its low memory requirements inherent privacy preservation and reduced need for expensive 3D model maintenance compared to visual descriptor-based methods. However existing algorithms often compromise on performance resulting in a significant deterioration compared to their descriptor-based counterparts. In this paper we introduce DGC-GNN a novel algorithm that employs a global-to-local Graph Neural Network (GNN) that progressively exploits geometric and color cues to rep- resent keypoints thereby improving matching accuracy. Our procedure encodes both Euclidean and angular relations at a coarse level forming the geometric embedding to guide the point matching. We evaluate DGC-GNN on both indoor and outdoor datasets demonstrating that it not only doubles the accuracy of the state-of-the-art visual descriptor-free algorithm but also substantially narrows the performance gap between descriptor-based and descriptor free methods. + + + + CuVLER: Enhanced Unsupervised Object Discoveries through Exhaustive Self-Supervised Transformers + http://openaccess.thecvf.com//content/CVPR2024/papers/Arica_CuVLER_Enhanced_Unsupervised_Object_Discoveries_through_Exhaustive_Self-Supervised_Transformers_CVPR_2024_paper.pdf + In this paper we introduce VoteCut an innovative method for unsupervised object discovery that leverages feature representations from multiple self-supervised models. VoteCut employs normalized-cut based graph partitioning clustering and a pixel voting approach. Additionally We present CuVLER (Cut-Vote-and-LEaRn) a zero-shot model trained using pseudo-labels generated by VoteCut and a novel soft target loss to refine segmentation accuracy. Through rigorous evaluations across multiple datasets and several unsupervised setups our methods demonstrate significant improvements in comparison to previous state-of-the-art models. Our ablation studies further highlight the contributions of each component revealing the robustness and efficacy of our approach. Collectively VoteCut and CuVLER pave the way for future advancements in image segmentation. + + + + Entity-NeRF: Detecting and Removing Moving Entities in Urban Scenes + http://openaccess.thecvf.com//content/CVPR2024/papers/Otonari_Entity-NeRF_Detecting_and_Removing_Moving_Entities_in_Urban_Scenes_CVPR_2024_paper.pdf + Recent advancements in the study of Neural Radiance Fields (NeRF) for dynamic scenes often involve explicit modeling of scene dynamics. However this approach faces challenges in modeling scene dynamics in urban environments where moving objects of various categories and scales are present. In such settings it becomes crucial to effectively eliminate moving objects to accurately reconstruct static backgrounds. Our research introduces an innovative method termed here as Entity-NeRF which combines the strengths of knowledge-based and statistical strategies. This approach utilizes entity-wise statistics leveraging entity segmentation and stationary entity classification through thing/stuff segmentation. To assess our methodology we created an urban scene dataset masked with moving objects. Our comprehensive experiments demonstrate that Entity-NeRF notably outperforms existing techniques in removing moving objects and reconstructing static urban backgrounds both quantitatively and qualitatively. + + + + TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_TAMM_TriAdapter_Multi-Modal_Learning_for_3D_Shape_Understanding_CVPR_2024_paper.pdf + The limited scale of current 3D shape datasets hinders the advancements in 3D shape understanding and motivates multi-modal learning approaches which transfer learned knowledge from data-abundant 2D image and language modalities to 3D shapes. However even though the image and language representations have been aligned by cross-modal models like CLIP we find that the image modality fails to contribute as much as the language in existing multi-modal 3D representation learning methods. This is attributed to the domain shift in the 2D images and the distinct focus of each modality. To more effectively leverage both modalities in the pre-training we introduce TriAdapter Multi-Modal Learning (TAMM) - a novel two-stage learning approach based on three synergistic adapters. First our CLIP Image Adapter mitigates the domain gap between 3D-rendered images and natural images by adapting the visual representations of CLIP for synthetic image-text pairs. Subsequently our Dual Adapters decouple the 3D shape representation space into two complementary sub-spaces: one focusing on visual attributes and the other for semantic understanding which ensure a more comprehensive and effective multi-modal pre-training. Extensive experiments demonstrate that TAMM consistently enhances 3D representations for a wide range of 3D encoder architectures pre-training datasets and downstream tasks. Notably we boost the zero-shot classification accuracy on Objaverse-LVIS from 46.8% to 50.7% and improve the 5-way 10-shot linear probing classification accuracy on ModelNet40 from 96.1% to 99.0%. Project page: https://alanzhangcs.github.io/tamm-page. + + + + GauHuman: Articulated Gaussian Splatting from Monocular Human Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Hu_GauHuman_Articulated_Gaussian_Splatting_from_Monocular_Human_Videos_CVPR_2024_paper.pdf + We present GauHuman a 3D human model with Gaussian Splatting for both fast training (1 2 minutes) and real-time rendering (up to 189 FPS) compared with existing NeRF-based implicit representation modelling frameworks demanding hours of training and seconds of rendering per frame. Specifically GauHuman encodes Gaussian Splatting in the canonical space and transforms 3D Gaussians from canonical space to posed space with linear blend skinning (LBS) in which effective pose and LBS refinement modules are designed to learn fine details of 3D humans under negligible computational cost. Moreover to enable fast optimization of GauHuman we initialize and prune 3D Gaussians with 3D human prior while splitting/cloning via KL divergence guidance along with a novel merge operation for further speeding up. Extensive experiments on ZJU_Mocap and MonoCap datasets demonstrate that GauHuman achieves state-of-the-art performance quantitatively and qualitatively with fast training and real-time rendering speed. Notably without sacrificing rendering quality GauHuman can fast model the 3D human performer with 13k 3D Gaussians. Our code is available at https://github.com/skhu101/GauHuman. + + + + EGTR: Extracting Graph from Transformer for Scene Graph Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Im_EGTR_Extracting_Graph_from_Transformer_for_Scene_Graph_Generation_CVPR_2024_paper.pdf + Scene Graph Generation (SGG) is a challenging task of detecting objects and predicting relationships between objects. After DETR was developed one-stage SGG models based on a one-stage object detector have been actively studied. However complex modeling is used to predict the relationship between objects and the inherent relationship between object queries learned in the multi-head self-attention of the object detector has been neglected. We propose a lightweight one-stage SGG model that extracts the relation graph from the various relationships learned in the multi-head self-attention layers of the DETR decoder. By fully utilizing the self-attention by-products the relation graph can be extracted effectively with a shallow relation extraction head. Considering the dependency of the relation extraction task on the object detection task we propose a novel relation smoothing technique that adjusts the relation label adaptively according to the quality of the detected objects. By the relation smoothing the model is trained according to the continuous curriculum that focuses on object detection task at the beginning of training and performs multi-task learning as the object detection performance gradually improves. Furthermore we propose a connectivity prediction task that predicts whether a relation exists between object pairs as an auxiliary task of the relation extraction. We demonstrate the effectiveness and efficiency of our method for the Visual Genome and Open Image V6 datasets. Our code is publicly available at https://github.com/naver-ai/egtr. + + + + Rethinking Multi-domain Generalization with A General Learning Objective + http://openaccess.thecvf.com//content/CVPR2024/papers/Tan_Rethinking_Multi-domain_Generalization_with_A_General_Learning_Objective_CVPR_2024_paper.pdf + Multi-domain generalization (mDG) is universally aimed to minimize the discrepancy between training and testing distributions to enhance marginal-to-label distribution mapping. However existing mDG literature lacks a general learning objective paradigm and often imposes constraints on static target marginal distributions. In this paper we propose to leverage a Y-mapping to relax the constraint. We rethink the learning objective for mDG and design a new general learning objective to interpret and analyze most existing mDG wisdom. This general objective is bifurcated into two synergistic amis: learning domain-independent conditional features and maximizing a posterior. Explorations also extend to two effective regularization terms that incorporate prior information and suppress invalid causality alleviating the issues that come with relaxed constraints. We theoretically contribute an upper bound for the domain alignment of domain-independent conditional features disclosing that many previous mDG endeavors actually optimize partially the objective and thus lead to limited performance. As such our study distills a general learning objective into four practical components providing a general robust and flexible mechanism to handle complex domain shifts. Extensive empirical results indicate that the proposed objective with Y-mapping leads to substantially better mDG performance in various downstream tasks including regression segmentation and classification. Code is available at https://github.com/zhaorui-tan/GMDG/tree/main. + + + + Universal Novelty Detection Through Adaptive Contrastive Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Mirzaei_Universal_Novelty_Detection_Through_Adaptive_Contrastive_Learning_CVPR_2024_paper.pdf + Novelty detection is a critical task for deploying machine learning models in the open world. A crucial property of novelty detection methods is universality which can be interpreted as generalization across various distributions of training or test data. More precisely for novelty detection distribution shifts may occur in the training set or the test set. Shifts in the training set refer to cases where we train a novelty detector on a new dataset and expect strong transferability. Conversely distribution shifts in the test set indicate the methods' performance when the trained model encounters a shifted test sample. We experimentally show that existing methods falter in maintaining universality which stems from their rigid inductive biases. Motivated by this we aim for more generalized techniques that have more adaptable inductive biases. In this context we leverage the fact that contrastive learning provides an efficient framework to easily switch and adapt to new inductive biases through the proper choice of augmentations in forming the negative pairs. We propose a novel probabilistic auto-negative pair generation method AutoAugOOD along with contrastive learning to yield a universal novelty detector method. Our experiments demonstrate the superiority of our method under different distribution shifts in various image benchmark datasets. Notably our method emerges universality in the lens of adaptability to different setups of novelty detection including one-class unlabeled multi-class and labeled multi-class settings. + + + + Resurrecting Old Classes with New Data for Exemplar-Free Continual Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Goswami_Resurrecting_Old_Classes_with_New_Data_for_Exemplar-Free_Continual_Learning_CVPR_2024_paper.pdf + Continual learning methods are known to suffer from catastrophic forgetting a phenomenon that is particularly hard to counter for methods that do not store exemplars of previous tasks. Therefore to reduce potential drift in the feature extractor existing exemplar-free methods are typically evaluated in settings where the first task is significantly larger than subsequent tasks. Their performance drops drastically in more challenging settings starting with a smaller first task. To address this problem of feature drift estimation for exemplar-free methods we propose to adversarially perturb the current samples such that their embeddings are close to the old class prototypes in the old model embedding space. We then estimate the drift in the embedding space from the old to the new model using the perturbed images and compensate the prototypes accordingly. We exploit the fact that adversarial samples are transferable from the old to the new feature space in a continual learning setting. The generation of these images is simple and computationally cheap. We demonstrate in our experiments that the proposed approach better tracks the movement of prototypes in embedding space and outperforms existing methods on several standard continual learning benchmarks as well as on fine-grained datasets. Code is available at https://github.com/dipamgoswami/ADC. + + + + Poly Kernel Inception Network for Remote Sensing Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Cai_Poly_Kernel_Inception_Network_for_Remote_Sensing_Detection_CVPR_2024_paper.pdf + Object detection in remote sensing images (RSIs) often suffers from several increasing challenges including the large variation in object scales and the diverse-ranging context. Prior methods tried to address these challenges by expanding the spatial receptive field of the backbone either through large-kernel convolution or dilated convolution. However the former typically introduces considerable background noise while the latter risks generating overly sparse feature representations. In this paper we introduce the Poly Kernel Inception Network (PKINet) to handle the above challenges. PKINet employs multi-scale convolution kernels without dilation to extract object features of varying scales and capture local context. In addition a Context Anchor Attention (CAA) module is introduced in parallel to capture long-range contextual information. These two components work jointly to advance the performance of PKINet on four challenging remote sensing object detection benchmarks namely DOTA-v1.0 DOTA-v1.5 HRSC2016 and DIOR-R. + + + + Dual Prior Unfolding for Snapshot Compressive Imaging + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Dual_Prior_Unfolding_for_Snapshot_Compressive_Imaging_CVPR_2024_paper.pdf + Recently deep unfolding methods have achieved remarkable success in the realm of Snapshot Compressive Imaging (SCI) reconstruction. However the existing methods all follow the iterative framework of a single image prior which limits the efficiency of the unfolding methods and makes it a problem to use other priors simply and effectively. To break out of the box we derive an effective Dual Prior Unfolding (DPU) which achieves the joint utilization of multiple deep priors and greatly improves iteration efficiency. Our unfolding method is implemented through two parts i.e. Dual Prior Framework (DPF) and Focused Attention (FA). In brief in addition to the normal image prior DPF introduces a residual into the iteration formula and constructs a degraded prior for the residual by considering various degradations to establish the unfolding framework. To improve the effectiveness of the image prior based on self-attention FA adopts a novel mechanism inspired by PCA denoising to scale and filter attention which lets the attention focus more on effective features with little computation cost. Besides an asymmetric backbone is proposed to further improve the efficiency of hierarchical self-attention. Remarkably our 5-stage DPU achieves state-of-the-art (SOTA) performance with the least FLOPs and parameters compared to previous methods while our 9-stage DPU significantly outperforms other unfolding methods with less computational requirement. + + + + COLMAP-Free 3D Gaussian Splatting + http://openaccess.thecvf.com//content/CVPR2024/papers/Fu_COLMAP-Free_3D_Gaussian_Splatting_CVPR_2024_paper.pdf + While neural rendering has led to impressive advances in scene reconstruction and novel view synthesis it relies heavily on accurately pre-computed camera poses. To relax this constraint multiple efforts have been made to train Neural Radiance Fields (NeRFs) without pre-processed camera poses. However the implicit representations of NeRFs provide extra challenges to optimize the 3D structure and camera poses at the same time. On the other hand the recently proposed 3D Gaussian Splatting provides new opportunities given its explicit point cloud representations. This paper leverages both the explicit geometric representation and the continuity of the input video stream to perform novel view synthesis without any SfM preprocessing. We process the input frames in a sequential manner and progressively grow the 3D Gaussians set by taking one input frame at a time without the need to pre-compute the camera poses. Our method significantly improves over previous approaches in view synthesis and camera pose estimation under large motion changes. Our project page is: https: //oasisyang.github.io/colmap-free-3dgs. + + + + BadCLIP: Dual-Embedding Guided Backdoor Attack on Multimodal Contrastive Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Liang_BadCLIP_Dual-Embedding_Guided_Backdoor_Attack_on_Multimodal_Contrastive_Learning_CVPR_2024_paper.pdf + While existing backdoor attacks have successfully infected multimodal contrastive learning models such as CLIP they can be easily countered by specialized backdoor defenses for MCL models. This paper reveals the threats in this practical scenario and introduces the BadCLIP attack which is resistant to backdoor detection and model fine-tuning defenses. To achieve this we draw motivations from the perspective of the Bayesian rule and propose a dual-embedding guided framework for backdoor attacks. Specifically we ensure that visual trigger patterns approximate the textual target semantics in the embedding space making it challenging to detect the subtle parameter variations induced by backdoor learning on such natural trigger patterns. Additionally we optimize the visual trigger patterns to align the poisoned samples with target vision features in order to hinder backdoor unlearning through clean fine-tuning. Our experiments show a significant improvement in attack success rate (+45.3 % ASR) over current leading methods even against state-of-the-art backdoor defenses highlighting our attack's effectiveness in various scenarios including downstream tasks. Our codes can be found at https://github.com/LiangSiyuan21/BadCLIP. + + + + Efficient Vision-Language Pre-training by Cluster Masking + http://openaccess.thecvf.com//content/CVPR2024/papers/Wei_Efficient_Vision-Language_Pre-training_by_Cluster_Masking_CVPR_2024_paper.pdf + We propose a simple strategy for masking image patches during visual-language contrastive learning that improves the quality of the learned representations and the training speed. During each iteration of training we randomly mask clusters of visually similar image patches as measured by their raw pixel intensities. This provides an extra learning signal beyond the contrastive training itself since it forces a model to predict words for masked visual structures solely from context. It also speeds up training by reducing the amount of data used in each image. We evaluate the effectiveness of our model by pre-training on a number of benchmarks finding that it outperforms other masking strategies such as FLIP on the quality of the learned representation. + + + + GPS-Gaussian: Generalizable Pixel-wise 3D Gaussian Splatting for Real-time Human Novel View Synthesis + http://openaccess.thecvf.com//content/CVPR2024/papers/Zheng_GPS-Gaussian_Generalizable_Pixel-wise_3D_Gaussian_Splatting_for_Real-time_Human_Novel_CVPR_2024_paper.pdf + We present a new approach termed GPS-Gaussian for synthesizing novel views of a character in a real-time manner. The proposed method enables 2K-resolution rendering under a sparse-view camera setting. Unlike the original Gaussian Splatting or neural implicit rendering methods that necessitate per-subject optimizations we introduce Gaussian parameter maps defined on the source views and regress directly Gaussian Splatting properties for instant novel view synthesis without any fine-tuning or optimization. To this end we train our Gaussian parameter regression module on a large amount of human scan data jointly with a depth estimation module to lift 2D parameter maps to 3D space. The proposed framework is fully differentiable and experiments on several datasets demonstrate that our method outperforms state-of-the-art methods while achieving an exceeding rendering speed. + + + + MAGICK: A Large-scale Captioned Dataset from Matting Generated Images using Chroma Keying + http://openaccess.thecvf.com//content/CVPR2024/papers/Burgert_MAGICK_A_Large-scale_Captioned_Dataset_from_Matting_Generated_Images_using_CVPR_2024_paper.pdf + We introduce MAGICK a large-scale dataset of generated objects with high-quality alpha mattes. While image generation methods have produced segmentations they cannot generate alpha mattes with accurate details in hair fur and transparencies. This is likely due to the small size of current alpha matting datasets and the difficulty in obtaining ground-truth alpha. We propose a scalable method for synthesizing images of objects with high-quality alpha that can be used as a ground-truth dataset. A key idea is to generate objects on a single-colored background so chroma keying approaches can be used to extract the alpha. However this faces several challenges including that current text-to-image generation methods cannot create images that can be easily chroma keyed and that chroma keying is an underconstrained problem that generally requires manual intervention for high-quality results. We address this using a combination of generation and alpha extraction methods. Using our method we generate a dataset of 150000 objects with alpha. We show the utility of our dataset by training an alpha-to-rgb generation method that outperforms baselines. Please see our project website at https://ryanndagreat.github.io/MAGICK/. + + + + Video Super-Resolution Transformer with Masked Inter&Intra-Frame Attention + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_Video_Super-Resolution_Transformer_with_Masked_InterIntra-Frame_Attention_CVPR_2024_paper.pdf + Recently Vision Transformer has achieved great success in recovering missing details in low-resolution sequences i.e. the video super-resolution (VSR) task. Despite its superiority in VSR accuracy the heavy computational burden as well as the large memory footprint hinder the deployment of Transformer-based VSR models on constrained devices. In this paper we address the above issue by proposing a novel feature-level masked processing framework: VSR with Masked Intra and inter-frame Attention (MIA-VSR). The core of MIA-VSR is leveraging feature-level temporal continuity between adjacent frames to reduce redundant computations and make more rational use of previously enhanced SR features. Concretely we propose an intra-frame and inter-frame attention block which takes the respective roles of past features and input features into consideration and only exploits previously enhanced features to provide supplementary information. In addition an adaptive block-wise mask prediction module is developed to skip unimportant computations according to feature similarity between adjacent frames. We conduct detailed ablation studies to validate our contributions and compare the proposed method with recent state-of-the-art VSR approaches. The experimental results demonstrate that MIA-VSR improves the memory and computation efficiency over state-of-the-art methods without trading off PSNR accuracy. The code is available at https://github.com/LabShuHangGU/MIA-VSR. + + + + SurroundSDF: Implicit 3D Scene Understanding Based on Signed Distance Field + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_SurroundSDF_Implicit_3D_Scene_Understanding_Based_on_Signed_Distance_Field_CVPR_2024_paper.pdf + Vision-centric 3D environment understanding is both vital and challenging for autonomous driving systems. Recently object-free methods have attracted considerable attention. Such methods perceive the world by predicting the semantics of discrete voxel grids but fail to construct continuous and accurate obstacle surfaces. To this end in this paper we propose SurroundSDF to implicitly predict the signed distance field (SDF) and semantic field for the continuous perception from surround images. Specifically we introduce a query-based approach and utilize SDF constrained by the Eikonal formulation to accurately describe the surfaces of obstacles. Furthermore considering the absence of precise SDF ground truth we propose a novel weakly supervised paradigm for SDF referred to as the Sandwich Eikonal formulation which emphasizes applying correct and dense constraints on both sides of the surface thereby enhancing the perceptual accuracy of the surface. Experiments suggest that our method achieves SOTA for both occupancy prediction and 3D scene reconstruction tasks on the nuScenes dataset. + + + + Outdoor Scene Extrapolation with Hierarchical Generative Cellular Automata + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Outdoor_Scene_Extrapolation_with_Hierarchical_Generative_Cellular_Automata_CVPR_2024_paper.pdf + We aim to generate fine-grained 3D geometry from large-scale sparse LiDAR scans abundantly captured by autonomous vehicles (AV). Contrary to prior work on AV scene completion we aim to extrapolate fine geometry from unlabeled and beyond spatial limits of LiDAR scans taking a step towards generating realistic high-resolution simulation-ready 3D street environments. We propose hierarchical Generative Cellular Automata (hGCA) a spatially scalable conditional 3D generative model which grows geometry recursively with local kernels following GCAs in a coarse-to-fine manner equipped with a light-weight planner to induce global consistency. Experiments on synthetic scenes show that hGCA generates plausible scene geometry with higher fidelity and completeness compared to state-of-the-art baselines. Our model generalizes strongly from sim-to-real qualitatively outperforming baselines on the Waymo-open dataset. We also show anecdotal evidence of the ability to create novel objects from real-world geometric cues even when trained on limited synthetic content. + + + + Instruct 4D-to-4D: Editing 4D Scenes as Pseudo-3D Scenes Using 2D Diffusion + http://openaccess.thecvf.com//content/CVPR2024/papers/Mou_Instruct_4D-to-4D_Editing_4D_Scenes_as_Pseudo-3D_Scenes_Using_2D_CVPR_2024_paper.pdf + This paper proposes Instruct 4D-to-4D that achieves 4D awareness and spatial-temporal consistency for 2D diffusion models to generate high-quality instruction-guided dynamic scene editing results. Traditional applications of 2D diffusion models in dynamic scene editing often result in inconsistency primarily due to their inherent frame-by-frame editing methodology. Addressing the complexities of extending instruction-guided editing to 4D our key insight is to treat a 4D scene as a pseudo-3D scene decoupled into two sub-problems: achieving temporal consistency in video editing and applying these edits to the pseudo-3D scene. Following this we first enhance the Instruct-Pix2Pix (IP2P) model with an anchor-aware attention module for batch processing and consistent editing. Additionally we integrate optical flow-guided appearance propagation in a sliding window fashion for more precise frame-to-frame editing and incorporate depth-based projection to manage the extensive data of pseudo-3D scenes followed by iterative editing to achieve convergence. We extensively evaluate our approach in various scenes and editing instructions and demonstrate that it achieves spatially and temporally consistent editing results with significantly enhanced detail and sharpness over the prior art. Notably Instruct 4D-to-4D is general and applicable to both monocular and challenging multi-camera scenes. + + + + Photo-SLAM: Real-time Simultaneous Localization and Photorealistic Mapping for Monocular Stereo and RGB-D Cameras + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_Photo-SLAM_Real-time_Simultaneous_Localization_and_Photorealistic_Mapping_for_Monocular_Stereo_CVPR_2024_paper.pdf + The integration of neural rendering and the SLAM system recently showed promising results in joint localization and photorealistic view reconstruction. However existing methods fully relying on implicit representations are so resource-hungry that they cannot run on portable devices which deviates from the original intention of SLAM. In this paper we present Photo-SLAM a novel SLAM framework with a hyper primitives map. Specifically we simultaneously exploit explicit geometric features for localization and learn implicit photometric features to represent the texture information of the observed environment. In addition to actively densifying hyper primitives based on geometric features we further introduce a Gaussian-Pyramid-based training method to progressively learn multi-level features enhancing photorealistic mapping performance. The extensive experiments with monocular stereo and RGB-D datasets prove that our proposed system Photo-SLAM significantly outperforms current state-of-the-art SLAM systems for online photorealistic mapping e.g. PSNR is 30% higher and rendering speed is hundreds of times faster in the Replica dataset. Moreover the Photo-SLAM can run at real-time speed using an embedded platform such as Jetson AGX Orin showing the potential of robotics applications. Project Page and code: https://huajianup.github.io/research/Photo-SLAM/. + + + + ProMotion: Prototypes As Motion Learners + http://openaccess.thecvf.com//content/CVPR2024/papers/Lu_ProMotion_Prototypes_As_Motion_Learners_CVPR_2024_paper.pdf + In this work we introduce ProMotion a unified prototypical transformer-based framework engineered to model fundamental motion tasks. ProMotion offers a range of compelling attributes that set it apart from current task-specific paradigms. 1. We adopt a prototypical perspective establishing a unified paradigm that harmonizes disparate motion learning approaches. This novel paradigm streamlines the architectural design enabling the simultaneous assimilation of diverse motion information. 2. We capitalize on a dual mechanism involving the feature denoiser and the prototypical learner to decipher the intricacies of motion. This approach effectively circumvents the pitfalls of ambiguity in pixel-wise feature matching significantly bolstering the robustness of motion representation. We demonstrate a profound degree of transferability across distinct motion patterns. This inherent versatility reverberates robustly across a comprehensive spectrum of both 2D and 3D downstream tasks. Empirical results demonstrate that outperforms various well-known specialized architectures achieving 0.54 and 0.054 Abs Rel error on the Sintel and KITTI depth datasets 1.04 and 2.01 average endpoint error on the clean and final pass of Sintel flow benchmark and 4.30 F1-all error on the KITTI flow benchmark. For its efficacy we hope our work can catalyze a paradigm shift in universal models in computer vision. + + + + SpatialTracker: Tracking Any 2D Pixels in 3D Space + http://openaccess.thecvf.com//content/CVPR2024/papers/Xiao_SpatialTracker_Tracking_Any_2D_Pixels_in_3D_Space_CVPR_2024_paper.pdf + Recovering dense and long-range pixel motion in videos is a challenging problem. Part of the difficulty arises from the 3D-to-2D projection process leading to occlusions and discontinuities in the 2D motion domain. While 2D motion can be intricate we posit that the underlying 3D motion can often be simple and low-dimensional. In this work we propose to estimate point trajectories in 3D space to mitigate the issues caused by image projection. Our method named SpatialTracker lifts 2D pixels to 3D using monocular depth estimators represents the 3D content of each frame efficiently using a triplane representation and performs iterative updates using a transformer to estimate 3D trajectories. Tracking in 3D allows us to leverage as-rigid-as possible(ARAP) constraints while simultaneously learning a rigidity embedding that clusters pixels into different rigid parts. Extensive evaluation shows that our approach achieves state-of-the-art tracking performance both qualitatively and quantitatively particularly in chal- lenging scenarios such as out-of-plane rotation. And our project page is available at https://henry123-boy.github.io/SpaTracker/. + + + + CrossMAE: Cross-Modality Masked Autoencoders for Region-Aware Audio-Visual Pre-Training + http://openaccess.thecvf.com//content/CVPR2024/papers/Guo_CrossMAE_Cross-Modality_Masked_Autoencoders_for_Region-Aware_Audio-Visual_Pre-Training_CVPR_2024_paper.pdf + Learning joint and coordinated features across modalities is essential for many audio-visual tasks. Existing pre-training methods primarily focus on global information neglecting fine-grained features and positions leading to suboptimal performance in dense prediction tasks. To address this issue we take a further step towards region-aware audio-visual pre-training and propose CrossMAE which excels in Cross-modality interaction and region alignment. Specifically we devise two masked autoencoding (MAE) pretext tasks at both pixel and embedding levels namely Cross-Conditioned Reconstruction and Cross-Embedding Reconstruction. Taking the visual modality as an example (the same goes for audio) in Cross-Conditioned Reconstruction the visual modality reconstructs the input image pixels conditioned on audio Attentive Tokens. As for the more challenging Cross-Embedding Reconstruction unmasked visual tokens reconstruct complete audio features under the guidance of learnable queries implying positional information which effectively enhances the interaction between modalities and exploits fine-grained semantics. Experimental results demonstrate that CrossMAE achieves state-of-the-art performance not only in classification and retrieval but also in dense prediction tasks. Furthermore we dive into the mechanism of modal interaction and region alignment of CrossMAE highlighting the effectiveness of the proposed components. + + + + Osprey: Pixel Understanding with Visual Instruction Tuning + http://openaccess.thecvf.com//content/CVPR2024/papers/Yuan_Osprey_Pixel_Understanding_with_Visual_Instruction_Tuning_CVPR_2024_paper.pdf + Multimodal large language models (MLLMs) have recently achieved impressive general-purpose vision-language capabilities through visual instruction tuning. However current MLLMs primarily focus on image-level or box-level understanding falling short in achieving fine-grained vision-language alignment at pixel level. Besides the lack of mask-based instruction data limits their advancements. In this paper we propose Osprey a mask-text instruction tuning approach to extend MLLMs by incorporating fine-grained mask regions into language instruction aiming at achieving pixel-wise visual understanding. To achieve this goal we first meticulously curate a mask-based region-text dataset with 724K samples and then design a vision-language model by injecting pixel-level representation into LLM. Specifically Osprey adopts a convolutional CLIP backbone as the vision encoder and employs a mask-aware visual extractor to extract precise visual mask features from high resolution input. Experimental results demonstrate Osprey's superiority in various region understanding tasks showcasing its new capability for pixel-level instruction tuning. In particular Osprey can be integrated with Segment Anything Model (SAM) seamlessly to obtain multi-granularity semantics. The source code dataset and demo can be found at https://github.com/CircleRadon/Osprey. + + + + Few-shot Learner Parameterization by Diffusion Time-steps + http://openaccess.thecvf.com//content/CVPR2024/papers/Yue_Few-shot_Learner_Parameterization_by_Diffusion_Time-steps_CVPR_2024_paper.pdf + Even when using large multi-modal foundation models few-shot learning is still challenging -- if there is no proper inductive bias it is nearly impossible to keep the nuanced class attributes while removing the visually prominent attributes that spuriously correlate with class labels. To this end we find an inductive bias that the time-steps of a Diffusion Model (DM) can isolate the nuanced class attributes i.e. as the forward diffusion adds noise to an image at each time-step nuanced attributes are usually lost at an earlier time-step than the spurious attributes that are visually prominent. Building on this we propose Time-step Few-shot (TiF) learner. We train class-specific low-rank adapters for a text-conditioned DM to make up for the lost attributes such that images can be accurately reconstructed from their noisy ones given a prompt. Hence at a small time-step the adapter and prompt are essentially a parameterization of only the nuanced class attributes. For a test image we can use the parameterization to only extract the nuanced class attributes for classification. TiF learner significantly outperforms OpenCLIP and its adapters on a variety of fine-grained and customized few-shot learning tasks. Codes are in https://github.com/yue-zhongqi/tif. + + + + OrCo: Towards Better Generalization via Orthogonality and Contrast for Few-Shot Class-Incremental Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Ahmed_OrCo_Towards_Better_Generalization_via_Orthogonality_and_Contrast_for_Few-Shot_CVPR_2024_paper.pdf + Few-Shot Class-Incremental Learning (FSCIL) introduces a paradigm in which the problem space expands with limited data. FSCIL methods inherently face the challenge of catastrophic forgetting as data arrives incrementally making models susceptible to overwriting previously acquired knowledge. Moreover given the scarcity of labeled samples available at any given time models may be prone to overfitting and find it challenging to strike a balance between extensive pretraining and the limited incremental data. To address these challenges we propose the OrCo framework built on two core principles: features' orthogonality in the representation space and contrastive learning. In particular we improve the generalization of the embedding space by employing a combination of supervised and self-supervised contrastive losses during the pretraining phase. Additionally we introduce OrCo loss to address challenges arising from data limitations during incremental sessions. Through feature space perturbations and orthogonality between classes the OrCo loss maximizes margins and reserves space for the following incremental data. This in turn ensures the accommodation of incoming classes in the feature space without compromising previously acquired knowledge. Our experimental results showcase state-of-the-art performance across three benchmark datasets including mini-ImageNet CIFAR100 and CUB datasets. Code is available at https://github.com/noorahmedds/OrCo + + + + MuGE: Multiple Granularity Edge Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_MuGE_Multiple_Granularity_Edge_Detection_CVPR_2024_paper.pdf + Edge segmentation is well-known to be subjective due to personalized annotation styles and preferred granularity. However most existing deterministic edge detection methods produce only a single edge map for one input image. We argue that generating multiple edge maps is more reasonable than generating a single one considering the subjectivity and ambiguity of the edges. Thus motivated in this paper we propose multiple granularity edge detection called MuGE which can produce a wide range of edge maps from approximate object contours to fine texture edges. Specifically we first propose to design an edge granularity network to estimate the edge granularity from an individual edge annotation. Subsequently to guide the generation of diversified edge maps we integrate such edge granularity into the multi-scale feature maps in the spatial domain. Meanwhile we decompose the feature maps into low-frequency and high-frequency parts where the encoded edge granularity is further fused into the high-frequency part to achieve more precise control over the details of the produced edge maps. Compared to previous methods MuGE is able to not only generate multiple edge maps at different controllable granularities but also achieve a competitive performance on the BSDS500 and Multicue benchmark datasets. + + + + Real-World Efficient Blind Motion Deblurring via Blur Pixel Discretization + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_Real-World_Efficient_Blind_Motion_Deblurring_via_Blur_Pixel_Discretization_CVPR_2024_paper.pdf + As recent advances in mobile camera technology have enabled the capability to capture high-resolution images such as 4K images the demand for an efficient deblurring model handling large motion has increased. In this paper we discover that the image residual errors i.e. blur-sharp pixel differences can be grouped into some categories according to their motion blur type and how complex their neighboring pixels are. Inspired by this we decompose the deblurring (regression) task into blur pixel discretization (pixel-level blur classification) and discrete-to-continuous conversion (regression with blur class map) tasks. Specifically we generate the discretized image residual errors by identifying the blur pixels and then transform them to a continuous form which is computationally more efficient than naively solving the original regression problem with continuous values. Here we found that the discretization result i.e. blur segmentation map remarkably exhibits visual similarity with the image residual errors. As a result our efficient model shows comparable performance to state-of-the-art methods in realistic benchmarks while our method is up to 10 times computationally more efficient. + + + + EmoVIT: Revolutionizing Emotion Insights with Visual Instruction Tuning + http://openaccess.thecvf.com//content/CVPR2024/papers/Xie_EmoVIT_Revolutionizing_Emotion_Insights_with_Visual_Instruction_Tuning_CVPR_2024_paper.pdf + Visual Instruction Tuning represents a novel learning paradigm involving the fine-tuning of pre-trained language models using task-specific instructions. This paradigm shows promising zero-shot results in various natural language processing tasks but is still unexplored in vision emotion understanding. In this work we focus on enhancing the model's proficiency in understanding and adhering to instructions related to emotional contexts. Initially we identify key visual clues critical to visual emotion recognition. Subsequently we introduce a novel GPT-assisted pipeline for generating emotion visual instruction data effectively addressing the scarcity of annotated instruction data in this domain. Expanding on the groundwork established by InstructBLIP our proposed EmoVIT architecture incorporates emotion-specific instruction data leveraging the powerful capabilities of Large Language Models to enhance performance. Through extensive experiments our model showcases its proficiency in emotion classification adeptness in affective reasoning and competence in comprehending humor. The comparative analysis provides a robust benchmark for Emotion Visual Instruction Tuning in the era of LLMs providing valuable insights and opening avenues for future exploration in this domain. Our code is available at https://github.com/aimmemotion/EmoVIT. + + + + Learning to Count without Annotations + http://openaccess.thecvf.com//content/CVPR2024/papers/Knobel_Learning_to_Count_without_Annotations_CVPR_2024_paper.pdf + While recent supervised methods for reference-based object counting continue to improve the performance on benchmark datasets they have to rely on small datasets due to the cost associated with manually annotating dozens of objects in images. We propose UnCounTR a model that can learn this task without requiring any manual annotations. To this end we construct "Self-Collages" images with various pasted objects as training samples that provide a rich learning signal covering arbitrary object types and counts. Our method builds on existing unsupervised representations and segmentation techniques to successfully demonstrate for the first time the ability of reference-based counting without manual supervision. Our experiments show that our method not only outperforms simple baselines and generic models such as FasterRCNN and DETR but also matches the performance of supervised counting models in some domains. + + + + NARUTO: Neural Active Reconstruction from Uncertain Target Observations + http://openaccess.thecvf.com//content/CVPR2024/papers/Feng_NARUTO_Neural_Active_Reconstruction_from_Uncertain_Target_Observations_CVPR_2024_paper.pdf + We present NARUTO a neural active reconstruction system that combines a hybrid neural representation with uncertainty learning enabling high-fidelity surface reconstruction. Our approach leverages a multi-resolution hash-grid as the mapping backbone chosen for its exceptional convergence speed and capacity to capture high-frequency local features. The centerpiece of our work is the incorporation of an uncertainty learning module that dynamically quantifies reconstruction uncertainty while actively reconstructing the environment. By harnessing learned uncertainty we propose a novel uncertainty aggregation strategy for goal searching and efficient path planning. Our system autonomously explores by targeting uncertain observations and reconstructs environments with remarkable completeness and fidelity. We also demonstrate the utility of this uncertainty-aware approach by enhancing SOTA neural SLAM systems through an active ray sampling strategy. Extensive evaluations of NARUTO in various environments using an indoor scene simulator confirm its superior performance and state-of-the-art status in active reconstruction as evidenced by its impressive results on benchmark datasets like Replica and MP3D. + + + + Learnable Earth Parser: Discovering 3D Prototypes in Aerial Scans + http://openaccess.thecvf.com//content/CVPR2024/papers/Loiseau_Learnable_Earth_Parser_Discovering_3D_Prototypes_in_Aerial_Scans_CVPR_2024_paper.pdf + We propose an unsupervised method for parsing large 3D scans of real-world scenes with easily-interpretable shapes. This work aims to provide a practical tool for analyzing 3D scenes in the context of aerial surveying and mapping without the need for user annotations. Our approach is based on a probabilistic reconstruction model that decomposes an input 3D point cloud into a small set of learned prototypical 3D shapes. The resulting reconstruction is visually interpretable and can be used to perform unsupervised instance and low-shot semantic segmentation of complex scenes. We demonstrate the usefulness of our model on a novel dataset of seven large aerial LiDAR scans from diverse real-world scenarios. Our approach outperforms state-of-the-art unsupervised methods in terms of decomposition accuracy while remaining visually interpretable. Our code and dataset are available at https://romainloiseau.fr/learnable-earth-parser/. + + + + NeRFiller: Completing Scenes via Generative 3D Inpainting + http://openaccess.thecvf.com//content/CVPR2024/papers/Weber_NeRFiller_Completing_Scenes_via_Generative_3D_Inpainting_CVPR_2024_paper.pdf + We propose NeRFiller an approach that completes missing portions of a 3D capture via generative 3D inpainting using off-the-shelf 2D visual generative models. Often parts of a captured 3D scene or object are missing due to mesh reconstruction failures or a lack of observations (e.g. contact regions such as the bottom of objects or hard-to-reach areas). We approach this challenging 3D inpainting problem by leveraging a 2D inpainting diffusion model. We identify a surprising behavior of these models where they generate more 3D consistent inpaints when images form a 2x2 grid and show how to generalize this behavior to more than four images. We then present an iterative framework to distill these inpainted regions into a single consistent 3D scene. In contrast to related works we focus on completing scenes rather than deleting foreground objects and our approach does not require tight 2D object masks or text. We compare our approach to relevant baselines adapted to our setting on a variety of scenes where NeRFiller creates the most 3D consistent and plausible scene completions. Our project page is at https://ethanweber.me/nerfiller/. + + + + Absolute Pose from One or Two Scaled and Oriented Features + http://openaccess.thecvf.com//content/CVPR2024/papers/Ventura_Absolute_Pose_from_One_or_Two_Scaled_and_Oriented_Features_CVPR_2024_paper.pdf + Keypoints used for image matching often include an estimate of the feature scale and orientation. While recent work has demonstrated the advantages of using feature scales and orientations for relative pose estimation relatively little work has considered their use for absolute pose estimation. We introduce minimal solutions for absolute pose from two oriented feature correspondences in the general case or one scaled and oriented correspondence given a known vertical direction. Nowadays assuming a known direction is not particularly restrictive as modern consumer devices such as smartphones or drones are equipped with Inertial Measurement Units (IMU) that provide the gravity direction by default. Compared to traditional absolute pose methods requiring three point correspondences our solvers need a smaller minimal sample reducing the cost and complexity of robust estimation. Evaluations on large-scale and public real datasets demonstrate the advantage of our methods for fast and accurate localization in challenging conditions. Code is available at https://github.com/danini/absolute-pose-from-oriented-and-scaled-features . + + + + Source-Free Domain Adaptation with Frozen Multimodal Foundation Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Tang_Source-Free_Domain_Adaptation_with_Frozen_Multimodal_Foundation_Model_CVPR_2024_paper.pdf + Source-Free Domain Adaptation (SFDA) aims to adapt a source model for a target domain with only access to unlabeled target training data and the source model pretrained on a supervised source domain. Relying on pseudo labeling and/or auxiliary supervision conventional methods are inevitably error-prone. To mitigate this limitation in this work we for the first time explore the potentials of off-the-shelf vision-language (ViL) multimodal models (e.g. CLIP) with rich whilst heterogeneous knowledge. We find that directly applying the ViL model to the target domain in a zero-shot fashion is unsatisfactory as it is not specialized for this particular task but largely generic. To make it task specific we propose a novel Distilling multImodal Foundation mOdel (DIFO) approach. Specifically DIFO alternates between two steps during adaptation: (i) Customizing the ViL model by maximizing the mutual information with the target model in a prompt learning manner (ii) Distilling the knowledge of this customized ViL model to the target model. For more fine-grained and reliable distillation we further introduce two effective regularization terms namely most-likely category encouragement and predictive consistency. Extensive experiments show that DIFO significantly outperforms the state-of-the-art alternatives. Code is here. + + + + Benchmarking Audio Visual Segmentation for Long-Untrimmed Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Benchmarking_Audio_Visual_Segmentation_for_Long-Untrimmed_Videos_CVPR_2024_paper.pdf + Existing audio-visual segmentation datasets typically focus on short-trimmed videos with only one pixel-map annotation for a per-second video clip. In contrast for untrimmed videos the sound duration start- and end-sounding time positions and visual deformation of audible objects vary significantly. Therefore we observed that current AVS models trained on trimmed videos might struggle to segment sounding objects in long videos. To investigate the feasibility of grounding audible objects in videos along both temporal and spatial dimensions we introduce the Long-Untrimmed Audio-Visual Segmentation dataset (LU-AVS) which includes precise frame-level annotations of sounding emission times and provides exhaustive mask annotations for all frames. Considering that pixel-level annotations are difficult to achieve in some complex scenes we also provide the bounding boxes to indicate the sounding regions. Specifically LU-AVS contains 10M mask annotations across 6.6K videos and 11M bounding box annotations across 7K videos. Compared with the existing datasets LU-AVS videos are on average 4 8 times longer with the silent duration being 3 15 times greater. Furthermore we try our best to adapt some baseline models that were originally designed for audio-visual-relevant tasks to examine the challenges of our newly curated LU-AVS. Through comprehensive evaluation we demonstrate the challenges of LU-AVS compared to the ones containing trimmed videos. Therefore LU-AVS provides an ideal yet challenging platform for evaluating audio-visual segmentation and localization on untrimmed long videos. The dataset is publicly available at: https://yenanliu.github.io/LU-AVS/. + + + + VTQA: Visual Text Question Answering via Entity Alignment and Cross-Media Reasoning + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_VTQA_Visual_Text_Question_Answering_via_Entity_Alignment_and_Cross-Media_CVPR_2024_paper.pdf + Achieving the optimal form of Visual Question Answering mandates a profound grasp of understanding grounding and reasoning within the intersecting domains of vision and language. Traditional VQA benchmarks have predominantly focused on simplistic tasks such as counting visual attributes and object detection which do not necessitate intricate cross-modal information understanding and inference. Motivated by the need for a more comprehensive evaluation we introduce a novel dataset comprising 23781 questions derived from 10124 image-text pairs. Specifically the task of this dataset requires the model to align multimedia representations of the same entity to implement multi-hop reasoning between image and text and finally use natural language to answer the question. Furthermore we evaluate this VTQA dataset comparing the performance of both state-of-the-art VQA models and our proposed baseline model the Key Entity Cross-Media Reasoning Network (KECMRN). The VTQA task poses formidable challenges for traditional VQA models underscoring its intrinsic complexity. Conversely KECMRN exhibits a modest improvement signifying its potential in multimedia entity alignment and multi-step reasoning. Our analysis underscores the diversity difficulty and scale of the VTQA task compared to previous multimodal QA datasets. In conclusion we anticipate that this dataset will serve as a pivotal resource for advancing and evaluating models proficient in multimedia entity alignment multi-step reasoning and open-ended answer generation. Our dataset and code is available at https://visual-text-qa.github.io/. + + + + QN-Mixer: A Quasi-Newton MLP-Mixer Model for Sparse-View CT Reconstruction + http://openaccess.thecvf.com//content/CVPR2024/papers/Ayad_QN-Mixer_A_Quasi-Newton_MLP-Mixer_Model_for_Sparse-View_CT_Reconstruction_CVPR_2024_paper.pdf + Inverse problems span across diverse fields. In medical contexts computed tomography (CT) plays a crucial role in reconstructing a patient's internal structure presenting challenges due to artifacts caused by inherently ill-posed inverse problems. Previous research advanced image quality via post-processing and deep unrolling algorithms but faces challenges such as extended convergence times with ultra-sparse data. Despite enhancements resulting images often show significant artifacts limiting their effectiveness for real-world diagnostic applications. We aim to explore deep second-order unrolling algorithms for solving imaging inverse problems emphasizing their faster convergence and lower time complexity compared to common first-order methods like gradient descent. In this paper we introduce QN-Mixer an algorithm based on the quasi-Newton approach. We use learned parameters through the BFGS algorithm and introduce Incept-Mixer an efficient neural architecture that serves as a non-local regularization term capturing long-range dependencies within images. To address the computational demands typically associated with quasi-Newton algorithms that require full Hessian matrix computations we present a memory-efficient alternative. Our approach intelligently downsamples gradient information significantly reducing computational requirements while maintaining performance. The approach is validated through experiments on the sparse-view CT problem involving various datasets and scanning protocols and is compared with post-processing and deep unrolling state-of-the-art approaches. Our method outperforms existing approaches and achieves state-of-the-art performance in terms of SSIM and PSNR all while reducing the number of unrolling iterations required. + + + + Learning CNN on ViT: A Hybrid Model to Explicitly Class-specific Boundaries for Domain Adaptation + http://openaccess.thecvf.com//content/CVPR2024/papers/Ngo_Learning_CNN_on_ViT_A_Hybrid_Model_to_Explicitly_Class-specific_CVPR_2024_paper.pdf + Most domain adaptation (DA) methods are based on either a convolutional neural networks (CNNs) or a vision transformers (ViTs). They align the distribution differences between domains as encoders without considering their unique characteristics. For instance ViT excels in accuracy due to its superior ability to capture global representations while CNN has an advantage in capturing local representations. This fact has led us to design a hybrid method to fully take advantage of both ViT and CNN called Explicitly Class-specific Boundaries (ECB). ECB learns CNN on ViT to combine their distinct strengths. In particular we leverage ViT's properties to explicitly find class-specific decision boundaries by maximizing the discrepancy between the outputs of the two classifiers to detect target samples far from the source support. In contrast the CNN encoder clusters target features based on the previously defined class-specific boundaries by minimizing the discrepancy between the probabilities of the two classifiers. Finally ViT and CNN mutually exchange knowledge to improve the quality of pseudo labels and reduce the knowledge discrepancies of these models. Compared to conventional DA methods our ECB achieves superior performance which verifies its effectiveness in this hybrid model. The project website can be found https://dotrannhattuong.github.io/ECB/website/. + + + + A Picture is Worth More Than 77 Text Tokens: Evaluating CLIP-Style Models on Dense Captions + http://openaccess.thecvf.com//content/CVPR2024/papers/Urbanek_A_Picture_is_Worth_More_Than_77_Text_Tokens_Evaluating_CVPR_2024_paper.pdf + Curation methods for massive vision-language datasets trade off between dataset size and quality. However even the highest quality of available curated captions are far too short to capture the rich visual detail in an image. To show the value of dense and highly-aligned image-text pairs we collect the Densely Captioned Images (DCI) dataset containing 8012 natural images human-annotated with mask-aligned descriptions averaging above 1000 words each. With precise and reliable captions associated with specific parts of an image we can evaluate vision-language models' (VLMs) understanding of image content with a novel task that matches each caption with its corresponding subcrop. As current models are often limited to 77 text tokens we also introduce a summarized version (sDCI) in which each caption length is limited. We show that modern techniques that make progress on standard benchmarks do not correspond with significant improvement on our sDCI based benchmark. Lastly we finetune CLIP using sDCI and show significant improvements over the baseline despite a small training set. By releasing the first human annotated dense image captioning dataset we hope to enable the development of new benchmarks or fine-tuning recipes for the next generation of VLMs to come. + + + + Infinigen Indoors: Photorealistic Indoor Scenes using Procedural Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Raistrick_Infinigen_Indoors_Photorealistic_Indoor_Scenes_using_Procedural_Generation_CVPR_2024_paper.pdf + We introduce Infinigen Indoors a Blender-based procedural generator of photorealistic indoor scenes. It builds upon the existing Infinigen system which focuses on natural scenes but expands its coverage to indoor scenes by introducing a diverse library of procedural indoor assets including furniture architecture elements appliances and other day-to-day objects. It also introduces a constraint-based arrangement system which consists of a domain-specific language for expressing diverse constraints on scene composition and a solver that generates scene compositions that maximally satisfy the constraints. We provide an export tool that allows the generated 3D objects and scenes to be directly used for training embodied agents in real-time simulators such as Omniverse and Unreal. Infinigen Indoors is open-sourced under the BSD license. Please visit infinigen.org for code and videos. + + + + MimicDiffusion: Purifying Adversarial Perturbation via Mimicking Clean Diffusion Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Song_MimicDiffusion_Purifying_Adversarial_Perturbation_via_Mimicking_Clean_Diffusion_Model_CVPR_2024_paper.pdf + Deep neural networks (DNNs) are vulnerable to adversarial perturbation where an imperceptible perturbation is added to the image that can fool the DNNs. Diffusion-based adversarial purification uses the diffusion model to generate a clean image against such adversarial attacks. Unfortunately the generative process of the diffusion model is also inevitably affected by adversarial perturbation since the diffusion model is also a deep neural network where its input has adversarial perturbation. In this work we propose MimicDiffusion a new diffusion-based adversarial purification technique that directly approximates the generative process of the diffusion model with the clean image as input. Concretely we analyze the differences between the guided terms using the clean image and the adversarial sample. After that we first implement MimicDiffusion based on Manhattan distance. Then we propose two guidance to purify the adversarial perturbation and approximate the clean diffusion model. Extensive experiments on three image datasets including CIFAR-10 CIFAR-100 and ImageNet with three classifier backbones including WideResNet-70-16 WideResNet-28-10 and ResNet-50 demonstrate that MimicDiffusion significantly performs better than the state-of-the-art baselines. On CIFAR-10 CIFAR-100 and ImageNet it achieves 92.67% 61.35% and 61.53% average robust accuracy which are 18.49% 13.23% and 17.64% higher respectively. The code is available at https://github.com/psky1111/MimicDiffusion. + + + + Robust Synthetic-to-Real Transfer for Stereo Matching + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Robust_Synthetic-to-Real_Transfer_for_Stereo_Matching_CVPR_2024_paper.pdf + With advancements in domain generalized stereo matching networks models pre-trained on synthetic data demonstrate strong robustness to unseen domains. However few studies have investigated the robustness after fine-tuning them in real-world scenarios during which the domain generalization ability can be seriously degraded. In this paper we explore fine-tuning stereo matching networks without compromising their robustness to unseen domains. Our motivation stems from comparing Ground Truth (GT) versus Pseudo Label (PL) for fine-tuning: GT degrades but PL preserves the domain generalization ability. Empirically we find the difference between GT and PL implies valuable information that can regularize networks during fine-tuning. We also propose a framework to utilize this difference for fine-tuning consisting of a frozen Teacher an exponential moving average (EMA) Teacher and a Student network. The core idea is to utilize the EMA Teacher to measure what the Student has learned and dynamically improve GT and PL for fine-tuning. We integrate our framework with state-of-the-art networks and evaluate its effectiveness on several real-world datasets. Extensive experiments show that our method effectively preserves the domain generalization ability during fine-tuning. + + + + GenZI: Zero-Shot 3D Human-Scene Interaction Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_GenZI_Zero-Shot_3D_Human-Scene_Interaction_Generation_CVPR_2024_paper.pdf + Can we synthesize 3D humans interacting with scenes without learning from any 3D human-scene interaction data? We propose GenZI the first zero-shot approach to generating 3D human-scene interactions. Key to GenZI is our distillation of interaction priors from large vision-language models (VLMs) which have learned a rich semantic space of 2D human-scene compositions. Given a natural language description and a coarse point location of the desired interaction in a 3D scene we first leverage VLMs to imagine plausible 2D human interactions inpainted into multiple rendered views of the scene. We then formulate a robust iterative optimization to synthesize the pose and shape of a 3D human model in the scene guided by consistency with the 2D interaction hypotheses. In contrast to existing learning-based approaches GenZI circumvents the conventional need for captured 3D interaction data and allows for flexible control of the 3D interaction synthesis with easy-to-use text prompts. Extensive experiments show that our zero-shot approach has high flexibility and generality making it applicable to diverse scene types including both indoor and outdoor environments. + + + + DiffAssemble: A Unified Graph-Diffusion Model for 2D and 3D Reassembly + http://openaccess.thecvf.com//content/CVPR2024/papers/Scarpellini_DiffAssemble_A_Unified_Graph-Diffusion_Model_for_2D_and_3D_Reassembly_CVPR_2024_paper.pdf + Reassembly tasks play a fundamental role in many fields and multiple approaches exist to solve specific reassembly problems. In this context we posit that a general unified model can effectively address them all irrespective of the input data type (image 3D etc.). We introduce DiffAssemble a Graph Neural Network (GNN)-based architecture that learns to solve reassembly tasks using a diffusion model formulation. Our method treats the elements of a set whether pieces of 2D patch or 3D object fragments as nodes of a spatial graph. Training is performed by introducing noise into the position and rotation of the elements and iteratively denoising them to reconstruct the coherent initial pose. DiffAssemble achieves state-of-the-art (SOTA) results in most 2D and 3D reassembly tasks and is the first learning-based approach that solves 2D puzzles for both rotation and translation. Furthermore we highlight its remarkable reduction in run-time performing 11 times faster than the quickest optimization-based method for puzzle solving. + + + + NeISF: Neural Incident Stokes Field for Geometry and Material Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_NeISF_Neural_Incident_Stokes_Field_for_Geometry_and_Material_Estimation_CVPR_2024_paper.pdf + Multi-view inverse rendering is the problem of estimating the scene parameters such as shapes materials or illuminations from a sequence of images captured under different viewpoints. Many approaches however assume single light bounce and thus fail to recover challenging scenarios like inter-reflections. On the other hand simply extending those methods to consider multi-bounced light requires more assumptions to alleviate the ambiguity. To address this problem we propose Neural Incident Stokes Fields (NeISF) a multi-view inverse rendering framework that reduces ambiguities using polarization cues. The primary motivation for using polarization cues is that it is the accumulation of multi-bounced light providing rich information about geometry and material. Based on this knowledge the proposed incident Stokes field efficiently models the accumulated polarization effect with the aid of an original physically-based differentiable polarimetric renderer. Lastly experimental results show that our method outperforms the existing works in synthetic and real scenarios. + + + + ViT-Lens: Towards Omni-modal Representations + http://openaccess.thecvf.com//content/CVPR2024/papers/Lei_ViT-Lens_Towards_Omni-modal_Representations_CVPR_2024_paper.pdf + Aiming to advance AI agents large foundation models significantly improve reasoning and instruction execution yet the current focus on vision and language neglects the potential of perceiving diverse modalities in open-world environments. However the success of data-driven vision and language models is costly or even infeasible to be reproduced for rare modalities. In this paper we present ViT-Lens that facilitates efficient omni-modal representation learning by perceiving novel modalities with a pretrained ViT and aligning them to a pre-defined space. Specifically the modality-specific lens is tuned to project any-modal signals to an intermediate embedding space which are then processed by a strong ViT with pre-trained visual knowledge. The encoded representations are optimized toward aligning with the modal-independent space pre-defined by off-the-shelf foundation models. ViT-Lens provides a unified solution for representation learning of increasing modalities with two appealing advantages: (i) Unlocking the great potential of pretrained ViTs to novel modalities effectively with efficient data regime; (ii) Enabling emergent downstream capabilities through modality alignment and shared ViT parameters. We tailor ViT-Lens to learn representations for 3D point cloud depth audio tactile and EEG and set new state-of-the-art results across various understanding tasks such as zero-shot classification. By seamlessly integrating ViT-Lens into Multimodal Foundation Models we enable Any-modality to Text and Image Generation in a zero-shot manner. Code and models are available at https://github.com/TencentARC/ViT-Lens. + + + + GeoChat: Grounded Large Vision-Language Model for Remote Sensing + http://openaccess.thecvf.com//content/CVPR2024/papers/Kuckreja_GeoChat_Grounded_Large_Vision-Language_Model_for_Remote_Sensing_CVPR_2024_paper.pdf + Recent advancements in Large Vision-Language Models (VLMs) have shown great promise in natural image domains allowing users to hold a dialogue about given visual content. However such general-domain VLMs perform poorly for Remote Sensing (RS) scenarios leading to inaccurate or fabricated information when presented with RS domain-specific queries. Such a behavior emerges due to the unique challenges introduced by RS imagery. For example to handle high-resolution RS imagery with diverse scale changes across categories and many small objects region-level reasoning is necessary alongside holistic scene interpretation. Furthermore the lack of domain-specific multimodal instruction following data as well as strong backbone models for RS make it hard for the models to align their behavior with user queries. To address these limitations we propose GeoChat - the first versatile remote sensing VLM that offers multitask conversational capabilities with high-resolution RS images. Specifically GeoChat can not only answer image-level queries but also accepts region inputs to hold region-specific dialogue. Furthermore it can visually ground objects in its responses by referring to their spatial coordinates. To address the lack of domain-specific datasets we generate a novel RS multimodal instruction-following dataset by extending image-text pairs from existing diverse RS datasets. Leveraging this rich dataset we fine-tune our remote sensing VLM based on the LLaVA-1.5 architecture. We establish a comprehensive benchmark for RS multitask conversations and compare with a number of baseline methods. GeoChat demonstrates robust zero-shot performance on various remote sensing tasks e.g. image and region captioning visual question answering scene classification visually grounded conversations and referring object detection. Our codes will be open-sourced. + + + + PerceptionGPT: Effectively Fusing Visual Perception into LLM + http://openaccess.thecvf.com//content/CVPR2024/papers/Pi_PerceptionGPT_Effectively_Fusing_Visual_Perception_into_LLM_CVPR_2024_paper.pdf + The integration of visual inputs with large language models (LLMs) has led to remarkable advancements in multi-modal capabilities giving rise to vision large language models (VLLMs). However effectively harnessing LLMs for intricate visual perception tasks such as detection and segmentation remains a challenge. Conventional approaches achieve this by transforming perception signals (e.g. bounding boxes segmentation masks) into sequences of discrete tokens which struggle with the precision errors and introduces further complexities for training. In this paper we present a novel end-to-end framework named PerceptionGPT which represent the perception signals using LLM's dynamic token embedding. Specifically we leverage lightweight encoders and decoders to handle the perception signals in LLM's embedding space which takes advantage of the representation power of the high-dimensional token embeddings. Our approach significantly eases the training difficulties associated with the discrete representations in prior methods. Furthermore owing to our compact representation the inference speed is also greatly boosted. Consequently PerceptionGPT enables accurate flexible and efficient handling of complex perception signals. We validate the effectiveness of our approach through extensive experiments. The results demonstrate significant improvements over previous methods with only 4% trainable parameters and less than 25% training time. + + + + Probabilistic Speech-Driven 3D Facial Motion Synthesis: New Benchmarks Methods and Applications + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Probabilistic_Speech-Driven_3D_Facial_Motion_Synthesis_New_Benchmarks_Methods_and_CVPR_2024_paper.pdf + We consider the task of animating 3D facial geometry from speech signal. Existing works are primarily deterministic focusing on learning a one-to-one mapping from speech signal to 3D face meshes on small datasets with limited speakers. While these models can achieve high-quality lip articulation for speakers in the training set they are unable to capture the full and diverse distribution of 3D facial motions that accompany speech in the real world. Importantly the relationship between speech and facial motion is one-to-many containing both inter-speaker and intra-speaker variations and necessitating a probabilistic approach. In this paper we identify and address key challenges that have so far limited the development of probabilistic models: lack of datasets and metrics that are suitable for training and evaluating them as well as the difficulty of designing a model that generates diverse results while remaining faithful to a strong conditioning signal as speech. We first propose large-scale benchmark datasets and metrics suitable for probabilistic modeling. Then we demonstrate a probabilistic model that achieves both diversity and fidelity to speech outperforming other methods across the proposed benchmarks. Finally we showcase useful applications of probabilistic models trained on these large-scale datasets: we can generate diverse speech-driven 3D facial motion that matches unseen speaker styles extracted from reference clips; and our synthetic meshes can be used to improve the performance of downstream audio-visual models. + + + + FreGS: 3D Gaussian Splatting with Progressive Frequency Regularization + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_FreGS_3D_Gaussian_Splatting_with_Progressive_Frequency_Regularization_CVPR_2024_paper.pdf + 3D Gaussian splatting has achieved very impressive performance in real-time novel view synthesis. However it often suffers from over-reconstruction during Gaussian densification where high-variance image regions are covered by a few large Gaussians only leading to blur and artifacts in the rendered images. We design a progressive frequency regularization (FreGS) technique to tackle the over-reconstruction issue within the frequency space. Specifically FreGS performs coarse-to-fine Gaussian densification by exploiting low-to-high frequency components that can be easily extracted with low-pass and high-pass filters in the Fourier space. By minimizing the discrepancy between the frequency spectrum of the rendered image and the corresponding ground truth it achieves high-quality Gaussian densification and alleviates the over-reconstruction of Gaussian splatting effectively. Experiments over multiple widely adopted benchmarks (e.g. Mip-NeRF360 Tanks-and-Temples and Deep Blending) show that FreGS achieves superior novel view synthesis and outperforms the state-of-the-art consistently. + + + + Discriminative Sample-Guided and Parameter-Efficient Feature Space Adaptation for Cross-Domain Few-Shot Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Perera_Discriminative_Sample-Guided_and_Parameter-Efficient_Feature_Space_Adaptation_for_Cross-Domain_Few-Shot_CVPR_2024_paper.pdf + In this paper we look at cross-domain few-shot classification which presents the challenging task of learning new classes in previously unseen domains with few labelled examples. Existing methods though somewhat effective encounter several limitations which we alleviate through two significant improvements. First we introduce a lightweight parameter-efficient adaptation strategy to address overfitting associated with fine-tuning a large number of parameters on small datasets. This strategy employs a linear transformation of pre-trained features significantly reducing the trainable parameter count. Second we replace the traditional nearest centroid classifier with a discriminative sample-aware loss function enhancing the model's sensitivity to the inter- and intra-class variances within the training set for improved clustering in feature space. Empirical evaluations on the Meta-Dataset benchmark showcase that our approach not only improves accuracy up to 7.7% and 5.3% on previously seen and unseen datasets respectively but also achieves the above performance while being at least 3x more parameter-efficient than existing methods establishing a new state-of-the-art in cross-domain few-shot learning. Our code is available at https://github.com/rashindrie/DIPA. + + + + Detector-Free Structure from Motion + http://openaccess.thecvf.com//content/CVPR2024/papers/He_Detector-Free_Structure_from_Motion_CVPR_2024_paper.pdf + We propose a structure-from-motion framework to recover accurate camera poses and point clouds from unordered images. Traditional SfM systems typically rely on the successful detection of repeatable keypoints across multiple views as the first step which is difficult for texture-poor scenes and poor keypoint detection may break down the whole SfM system. We propose a detector-free SfM framework to draw benefits from the recent success of detector-free matchers to avoid the early determination of keypoints while solving the multi-view inconsistency issue of detector-free matchers. Specifically our framework first reconstructs a coarse SfM model from quantized detector-free matches. Then it refines the model by a novel iterative refinement pipeline which iterates between an attention-based multi-view matching module to refine feature tracks and a geometry refinement module to improve the reconstruction accuracy. Experiments demonstrate that the proposed framework outperforms existing detector-based SfM systems on common benchmark datasets. We also collect a texture-poor SfM dataset to demonstrate the capability of our framework to reconstruct texture-poor scenes. Based on this framework we take first place in Image Matching Challenge 2023. + + + + CG-HOI: Contact-Guided 3D Human-Object Interaction Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Diller_CG-HOI_Contact-Guided_3D_Human-Object_Interaction_Generation_CVPR_2024_paper.pdf + We propose CG-HOI the first method to address the task of generating dynamic 3D human-object interactions (HOIs) from text. We model the motion of both human and object in an interdependent fashion as semantically rich human motion rarely happens in isolation without any interactions. Our key insight is that explicitly modeling contact between the human body surface and object geometry can be used as strong proxy guidance both during training and inference. Using this guidance to bridge human and object motion enables generating more realistic and physically plausible interaction sequences where the human body and corresponding object move in a coherent manner. Our method first learns to model human motion object motion and contact in a joint diffusion process inter-correlated through cross-attention. We then leverage this learned contact for guidance during inference to synthesize realistic and coherent HOIs. Extensive evaluation shows that our joint contact-based human-object interaction approach generates realistic and physically plausible sequences and we show two applications highlighting the capabilities of our method. Conditioned on a given object trajectory we can generate the corresponding human motion without re-training demonstrating strong human-object interdependency learning. Our approach is also flexible and can be applied to static real-world 3D scene scans. + + + + Towards Surveillance Video-and-Language Understanding: New Dataset Baselines and Challenges + http://openaccess.thecvf.com//content/CVPR2024/papers/Yuan_Towards_Surveillance_Video-and-Language_Understanding_New_Dataset_Baselines_and_Challenges_CVPR_2024_paper.pdf + Surveillance videos are important for public security. However current surveillance video tasks mainly focus on classifying and localizing anomalous events. Existing methods are limited to detecting and classifying the predefined events with unsatisfactory semantic understanding although they have obtained considerable performance. To address this issue we propose a new research direction of surveillance video-and-language understanding(VALU) and construct the first multimodal surveillance video dataset. We manually annotate the real-world surveillance dataset UCF-Crime with fine-grained event content and timing. Our newly annotated dataset UCA (UCF-Crime Annotation) contains 23542 sentences with an average length of 20 words and its annotated videos are as long as 110.7 hours. Furthermore we benchmark SOTA models for four multimodal tasks on this newly created dataset which serve as new baselines for surveillance VALU. Through experiments we find that mainstream models used in previously public datasets perform poorly on surveillance video demonstrating new challenges in surveillance VALU. We also conducted experiments on multimodal anomaly detection. These results demonstrate that our multimodal surveillance learning can improve the performance of anomaly detection. All the experiments highlight the necessity of constructing this dataset to advance surveillance AI. + + + + AdaRevD: Adaptive Patch Exiting Reversible Decoder Pushes the Limit of Image Deblurring + http://openaccess.thecvf.com//content/CVPR2024/papers/Mao_AdaRevD_Adaptive_Patch_Exiting_Reversible_Decoder_Pushes_the_Limit_of_CVPR_2024_paper.pdf + Despite the recent progress in enhancing the efficacy of image deblurring the limited decoding capability constrains the upper limit of State-Of-The-Art (SOTA) methods. This paper proposes a pioneering work Adaptive Patch Exiting Reversible Decoder (AdaRevD) to explore their insufficient decoding capability. By inheriting the weights of the well-trained encoder we refactor a reversible decoder which scales up the single-decoder training to multi-decoder training while remaining GPU memory-friendly. Meanwhile we show that our reversible structure gradually disentangles high-level degradation degree and low-level blur pattern (residual of the blur image and its sharp counterpart) from compact degradation representation. Besides due to the spatially-variant motion blur kernels different blur patches have various deblurring difficulties. We further introduce a classifier to learn the degradation degree of image patches enabling them to exit at different sub-decoders for speedup. Experiments show that our AdaRevD pushes the limit of image deblurring e.g. achieving 34.60 dB in PSNR on GoPro dataset. + + + + Learning to Remove Wrinkled Transparent Film with Polarized Prior + http://openaccess.thecvf.com//content/CVPR2024/papers/Tang_Learning_to_Remove_Wrinkled_Transparent_Film_with_Polarized_Prior_CVPR_2024_paper.pdf + In this paper we study a new problem Film Removal (FR) which attempts to remove the interference of wrinkled transparent films and reconstruct the original information under films for industrial recognition systems. We first physically model the imaging of industrial materials covered by the film. Considering the specular highlight from the film can be effectively recorded by the polarized camera we build a practical dataset with polarization information containing paired data with and without transparent film. We aim to remove interference from the film (specular highlights and other degradations) with an end-to-end framework. To locate the specular highlight we use an angle estimation network to optimize the polarization angle with the minimized specular highlight. The image with minimized specular highlight is set as a prior for supporting the reconstruction network. Based on the prior and the polarized images the reconstruction network can decouple all degradations from the film. Extensive experiments show that our framework achieves SOTA performance in both image reconstruction and industrial downstream tasks. Our code will be released at https://github.com/jqtangust/FilmRemoval. + + + + Dispel Darkness for Better Fusion: A Controllable Visual Enhancer based on Cross-modal Conditional Adversarial Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Dispel_Darkness_for_Better_Fusion_A_Controllable_Visual_Enhancer_based_CVPR_2024_paper.pdf + We propose a controllable visual enhancer named DDBF which is based on cross-modal conditional adversarial learning and aims to dispel darkness and achieve better visible and infrared modalities fusion. Specifically a guided restoration module (GRM) is firstly designed to enhance weakened information in the low-light visible modality. The GRM utilizes the light-invariant high-contrast characteristics of the infrared modality as the central target distribution and constructs a multi-level conditional adversarial sample set to enable continuous controlled brightness enhancement of visible images. Then we develop an information fusion module (IFM) to integrate the advantageous features of the enhanced visible image and the infrared image. Thanks to customized explicit information preservation and hue fidelity constraints the IFM produces visually pleasing results with rich textures significant contrast and vivid colors. The brightened visible image and the final fused image compose the dual output of our DDBF to meet the diverse visual preferences of users. We evaluate DDBF on the public datasets achieving state-of-the-art performances of low-light enhancement and information integration that is available for both day and night scenarios. The experiments also demonstrate that our DDBF is effective in improving decision accuracy for object detection and semantic segmentation. Moreover we offer a user-friendly interface for the convenient application of our model. The code is publicly available at https://github.com/HaoZhang1018/DDBF. + + + + Querying as Prompt: Parameter-Efficient Learning for Multimodal Language Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Liang_Querying_as_Prompt_Parameter-Efficient_Learning_for_Multimodal_Language_Model_CVPR_2024_paper.pdf + Recent advancements in language models pre-trained on large-scale corpora have significantly propelled developments in the NLP domain and advanced progress in multimodal tasks. In this paper we propose a Parameter-Efficient multimodal language model learning strategy named QaP (Querying as Prompt). Its core innovation is a novel modality-bridging method that allows a set of modality-specific queries to be input as soft prompts into a frozen pre-trained language model. Specifically we introduce an efficient Text-Conditioned Resampler that is easy to incorporate into the language models which enables adaptive injection of text-related multimodal information at different levels of the model through query learning. This approach effectively bridges multimodal information to the language models while fully leveraging its token fusion and representation potential. We validated our method across four datasets in three distinct multimodal tasks. The results demonstrate that our QaP multimodal language model achieves state-of-the-art performance in various tasks with training only 4.6% parameters. + + + + Deformable 3D Gaussians for High-Fidelity Monocular Dynamic Scene Reconstruction + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Deformable_3D_Gaussians_for_High-Fidelity_Monocular_Dynamic_Scene_Reconstruction_CVPR_2024_paper.pdf + Implicit neural representation has paved the way for new approaches to dynamic scene reconstruction. Nonetheless cutting-edge dynamic neural rendering methods rely heavily on these implicit representations which frequently struggle to capture the intricate details of objects in the scene. Furthermore implicit methods have difficulty achieving real-time rendering in general dynamic scenes limiting their use in a variety of tasks. To address the issues we propose a deformable 3D Gaussians splatting method that reconstructs scenes using 3D Gaussians and learns them in canonical space with a deformation field to model monocular dynamic scenes. We also introduce an annealing smoothing training mechanism with no extra overhead which can mitigate the impact of inaccurate poses on the smoothness of time interpolation tasks in real-world scenes. Through a differential Gaussian rasterizer the deformable 3D Gaussians not only achieve higher rendering quality but also real-time rendering speed. Experiments show that our method outperforms existing methods significantly in terms of both rendering quality and speed making it well-suited for tasks such as novel-view synthesis time interpolation and real-time rendering. + + + + Enhancing 3D Object Detection with 2D Detection-Guided Query Anchors + http://openaccess.thecvf.com//content/CVPR2024/papers/Ji_Enhancing_3D_Object_Detection_with_2D_Detection-Guided_Query_Anchors_CVPR_2024_paper.pdf + Multi-camera-based 3D object detection has made notable progress in the past several years. However we observe that there are cases (e.g. faraway regions) in which popular 2D object detectors are more reliable than state-of-the-art 3D detectors. In this paper to improve the performance of query-based 3D object detectors we present a novel query generating approach termed QAF2D which infers 3D query anchors from 2D detection results. A 2D bounding box of an object in an image is lifted to a set of 3D anchors by associating each sampled point within the box with depth yaw angle and size candidates. Then the validity of each 3D anchor is verified by comparing its projection in the image with its corresponding 2D box and only valid anchors are kept and used to construct queries. The class information of the 2D bounding box associated with each query is also utilized to match the predicted boxes with ground truth for the set-based loss. The image feature extraction backbone is shared between the 3D detector and 2D detector by adding a small number of prompt parameters. We integrate QAF2D into three popular query-based 3D object detectors and carry out comprehensive evaluations on the nuScenes dataset. The largest improvement that QAF2D can bring about on the nuScenes validation subset is 2.3% NDS and 2.7% mAP. Code is available at https://github.com/max-vision/QAF2D. + + + + Continual Forgetting for Pre-trained Vision Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_Continual_Forgetting_for_Pre-trained_Vision_Models_CVPR_2024_paper.pdf + For privacy and security concerns the need to erase unwanted information from pre-trained vision models is becoming evident nowadays. In real-world scenarios erasure requests originate at any time from both users and model owners. These requests usually form a sequence. Therefore under such a setting selective information is expected to be continuously removed from a pre-trained model while maintaining the rest. We define this problem as continual forgetting and identify two key challenges. (i) For unwanted knowledge efficient and effective deleting is crucial. (ii) For remaining knowledge the impact brought by the forgetting procedure should be minimal. To address them we propose Group Sparse LoRA (GS-LoRA). Specifically towards (i) we use LoRA modules to fine-tune the FFN layers in Transformer blocks for each forgetting task independently and towards (ii) a simple group sparse regularization is adopted enabling automatic selection of specific LoRA groups and zeroing out the others. GS-LoRA is effective parameter-efficient data-efficient and easy to implement. We conduct extensive experiments on face recognition object detection and image classification and demonstrate that GS-LoRA manages to forget specific classes with minimal impact on other classes. Codes will be released on https://github.com/bjzhb666/GS-LoRA. + + + + Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Real_Acoustic_Fields_An_Audio-Visual_Room_Acoustics_Dataset_and_Benchmark_CVPR_2024_paper.pdf + We present a new dataset called Real Acoustic Fields (RAF) that captures real acoustic room data from multiple modalities. The dataset includes high-quality and densely captured room impulse response data paired with multi-view images and precise 6DoF pose tracking data for sound emitters and listeners in the rooms. We used this dataset to evaluate existing methods for novel-view acoustic synthesis and impulse response generation which previously relied on synthetic data. In our evaluation we thoroughly assessed existing audio and audio-visual models against multiple criteria and proposed settings to enhance their performance on real-world data. We also conducted experiments to investigate the impact of incorporating visual data (i.e. images and depth) into neural acoustic field models. Additionally we demonstrated the effectiveness of a simple sim2real approach where a model is pre-trained with simulated data and fine-tuned with sparse real-world data resulting in significant improvements in the few-shot learning approach. RAF is the first dataset to provide densely captured room acoustic data making it an ideal resource for researchers working on audio and audio-visual neural acoustic field modeling techniques. + + + + A Physics-informed Low-rank Deep Neural Network for Blind and Universal Lens Aberration Correction + http://openaccess.thecvf.com//content/CVPR2024/papers/Gong_A_Physics-informed_Low-rank_Deep_Neural_Network_for_Blind_and_Universal_CVPR_2024_paper.pdf + High-end lenses although offering high-quality images suffer from both insufficient affordability and bulky design which hamper their applications in low-budget scenarios or on low-payload platforms. A flexible scheme is to tackle the optical aberration of low-end lenses computationally. However it is highly demanded but quite challenging to build a general model capable of handling non-stationary aberrations and covering diverse lenses especially in a blind manner. To address this issue we propose a universal solution by extensively utilizing the physical properties of camera lenses: (i) reducing the complexity of lens aberrations i.e. lens-specific non-stationary blur by warping annual-ring-shaped sub-images into rectangular stripes to transform non-uniform degenerations into a uniform one (ii) building a low-dimensional non-negative orthogonal representation of lens blur kernels to cover diverse lenses; (iii) designing a decoupling network to decompose the input low-quality image into several components degenerated by above kernel bases and applying corresponding pre-trained deconvolution networks to reverse the degeneration. Benefiting from the proper incorporation of lenses' physical properties and unique network design the proposed method achieves superb imaging quality wide applicability for various lenses high running efficiency and is totally free of kernel calibration. These advantages bring great potential for scenarios requiring lightweight high-quality photography. + + + + Calibrating Multi-modal Representations: A Pursuit of Group Robustness without Annotations + http://openaccess.thecvf.com//content/CVPR2024/papers/You_Calibrating_Multi-modal_Representations_A_Pursuit_of_Group_Robustness_without_Annotations_CVPR_2024_paper.pdf + Fine-tuning pre-trained vision-language models like CLIP has yielded success on diverse downstream tasks. However several pain points persist for this paradigm: (i) directly tuning entire pre-trained models becomes both time-intensive and computationally costly. Additionally these tuned models tend to become highly specialized limiting their practicality for real-world deployment; (ii) recent studies indicate that pre-trained vision-language classifiers may overly depend on spurious features -- patterns that correlate with the target in training data but are not related to the true labeling function; and (iii) existing studies on mitigating the reliance on spurious features largely based on the assumption that we can identify such features does not provide definitive assurance for real-world applications. As a piloting study this work focuses on exploring mitigating the reliance on spurious features for CLIP without using any group annotation. To this end we systematically study the existence of spurious correlation on CLIP and CILP+ERM. We first following recent work on Deep Feature Reweighting (DFR) verify that last-layer retraining can greatly improve group robustness on pretrained CLIP. In view of them we advocate a lightweight representation calibration method for fine-tuning CLIP by first generating a calibration set using the pretrained CLIP and then calibrating representations of samples within this set through contrastive learning all without the need for group labels. Extensive experiments and in-depth visualizations on several benchmarks validate the effectiveness of our proposals largely reducing reliance and significantly boosting the model generalization. + + + + MCD: Diverse Large-Scale Multi-Campus Dataset for Robot Perception + http://openaccess.thecvf.com//content/CVPR2024/papers/Nguyen_MCD_Diverse_Large-Scale_Multi-Campus_Dataset_for_Robot_Perception_CVPR_2024_paper.pdf + Perception plays a crucial role in various robot applications. However existing well-annotated datasets are biased towards autonomous driving scenarios while unlabelled SLAM datasets are quickly over-fitted and often lack environment and domain variations. To expand the frontier of these fields we introduce a comprehensive dataset named MCD (Multi-Campus Dataset) featuring a wide range of sensing modalities high-accuracy ground truth and diverse challenging environments across three Eurasian university campuses. MCD comprises both CCS (Classical Cylindrical Spinning) and NRE (Non-Repetitive Epicyclic) lidars high-quality IMUs (Inertial Measurement Units) cameras and UWB (Ultra-WideBand) sensors. Furthermore in a pioneering effort we introduce semantic annotations of 29 classes over 59k sparse NRE lidar scans across three domains thus providing a novel challenge to existing semantic segmentation research upon this largely unexplored lidar modality. Finally we propose for the first time to the best of our knowledge continuous-time ground truth based on optimization-based registration of lidar-inertial data on large survey-grade prior maps which are also publicly released each several times the size of existing ones. We conduct a rigorous evaluation of numerous state-of-the-art algorithms on MCD report their performance and highlight the challenges awaiting solutions from the research community. + + + + ArGue: Attribute-Guided Prompt Tuning for Vision-Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Tian_ArGue_Attribute-Guided_Prompt_Tuning_for_Vision-Language_Models_CVPR_2024_paper.pdf + Although soft prompt tuning is effective in efficiently adapting Vision-Language (V&L) models for downstream tasks it shows limitations in dealing with distribution shifts. We address this issue with Attribute-Guided Prompt Tuning (ArGue) making three key contributions. 1) In contrast to the conventional approach of directly appending soft prompts preceding class names we align the model with primitive visual attributes generated by Large Language Models (LLMs). We posit that a model's ability to express high confidence in these attributes signifies its capacity to discern the correct class rationales. 2) We introduce attribute sampling to eliminate disadvantageous attributes thus only semantically meaningful attributes are preserved. 3) We propose negative prompting explicitly enumerating class-agnostic attributes to activate spurious correlations and encourage the model to generate highly orthogonal probability distributions in relation to these negative features. In experiments our method significantly outperforms current state-of-the-art prompt tuning methods on both novel class prediction and out-of-distribution generalization tasks. + + + + Close Imitation of Expert Retouching for Black-and-White Photography + http://openaccess.thecvf.com//content/CVPR2024/papers/Shin_Close_Imitation_of_Expert_Retouching_for_Black-and-White_Photography_CVPR_2024_paper.pdf + Since the widespread availability of cameras black-and-white (BW)photography has been a popular choice for artistic and aesthetic expression. It highlights the main subject in varying tones of gray creating various effects such as drama and contrast. However producing BW photography often demands high-end cameras or photographic editing from experts. Even the experts prefer different styles depending on the subject or even the same subject when taking grayscale photos or converting color images to BW. It is thus questionable which approach is better. To imitate the artistic values of decolorized images this paper introduces a deep metric learning framework with a novel subject-style specified proxy and a large-scale BW dataset. Our proxy-based decolorization utilizes a hierarchical proxy-based loss and a hierarchical bilateral grid network to mimic the experts' retouching scheme. The proxy-based loss captures both expert-discriminative and classsharing characteristics while the hierarchical bilateral grid network enables imitating spatially-variant retouching by considering both global and local scene contexts. Our dataset including color and BW images edited by three experts demonstrates the scalability of our method which can be further enhanced by constructing additional proxies from any set of BW photos like Internet downloaded figures. Our Experiments show that our framework successfully produce visually-pleasing BW images from color ones as evaluated by user preference with respect to artistry and aesthetics. + + + + Understanding and Improving Source-free Domain Adaptation from a Theoretical Perspective + http://openaccess.thecvf.com//content/CVPR2024/papers/Mitsuzumi_Understanding_and_Improving_Source-free_Domain_Adaptation_from_a_Theoretical_Perspective_CVPR_2024_paper.pdf + Source-free Domain Adaptation (SFDA) is an emerging and challenging research area that addresses the problem of unsupervised domain adaptation (UDA) without source data. Though numerous successful methods have been proposed for SFDA a theoretical understanding of why these methods work well is still absent. In this paper we shed light on the theoretical perspective of existing SFDA methods. Specifically we find that SFDA loss functions comprising discriminability and diversity losses work in the same way as the training objective in the theory of self-training based on the expansion assumption which shows the existence of the target error bound. This finding brings two novel insights that enable us to build an improved SFDA method comprising 1) Model Training with Auto-Adjusting Diversity Constraint and 2) Augmentation Training with Teacher-Student Framework yielding a better recognition performance. Extensive experiments on three benchmark datasets demonstrate the validity of the theoretical analysis and our method. + + + + Learning SO(3)-Invariant Semantic Correspondence via Local Shape Transform + http://openaccess.thecvf.com//content/CVPR2024/papers/Park_Learning_SO3-Invariant_Semantic_Correspondence_via_Local_Shape_Transform_CVPR_2024_paper.pdf + Establishing accurate 3D correspondences between shapes stands as a pivotal challenge with profound implications for computer vision and robotics. However existing self-supervised methods for this problem assume perfect input shape alignment restricting their real-world applicability. In this work we introduce a novel self-supervised Rotation-Invariant 3D correspondence learner with Local Shape Transform dubbed RIST that learns to establish dense correspondences between shapes even under challenging intra-class variations and arbitrary orientations. Specifically RIST learns to dynamically formulate an SO(3)-invariant local shape transform for each point which maps the SO(3)-equivariant global shape descriptor of the input shape to a local shape descriptor. These local shape descriptors are provided as inputs to our decoder to facilitate point cloud self- and cross-reconstruction. Our proposed self-supervised training pipeline encourages semantically corresponding points from different shapes to be mapped to similar local shape descriptors enabling RIST to establish dense point-wise correspondences. RIST demonstrates state-of-the-art performances on 3D part label transfer and semantic keypoint transfer given arbitrarily rotated point cloud pairs outperforming existing methods by significant margins. + + + + Deep-TROJ: An Inference Stage Trojan Insertion Algorithm through Efficient Weight Replacement Attack + http://openaccess.thecvf.com//content/CVPR2024/papers/Ahmed_Deep-TROJ_An_Inference_Stage_Trojan_Insertion_Algorithm_through_Efficient_Weight_CVPR_2024_paper.pdf + To insert Trojan into a Deep Neural Network (DNN) the existing attack assumes the attacker can access the victim's training facilities. However a realistic threat model was recently developed by leveraging memory fault to inject Trojans at the inference stage. In this work we develop a novel Trojan attack by adopting a unique memory fault injection technique that can inject bit-flip into the page table of the main memory. In the main memory each weight block consists of a group of weights located at a specific address of a DRAM row. A bit-flip in the page frame number replaces a target weight block of a DNN model with another replacement weight block. To develop a successful Trojan attack leveraging this unique fault model the attacker must solve three key challenges: i) how to identify a minimum set of target weight blocks to be modified? ii) how to identify the corresponding optimal replacement weight block? iii) how to optimize the trigger to maximize the attacker's objective given a target and replacement weight block set? We address them by proposing a novel Deep-TROJ attack algorithm that can identify a minimum set of vulnerable target and corresponding replacement weight blocks while optimizing the trigger at the same time. We evaluate the performance of our proposed Deep-TROJ on CIFAR-10 CIFAR-100 and ImageNet dataset for sixteen different DNN architectures including vision transformers. Proposed Deep-TROJ is the most successful one to date that does not require access to training facilities while successfully bypassing the existing defenses. Our code is available at https://github.com/ML-Security-Research-LAB/Deep-TROJ. + + + + Investigating and Mitigating the Side Effects of Noisy Views for Self-Supervised Clustering Algorithms in Practical Multi-View Scenarios + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_Investigating_and_Mitigating_the_Side_Effects_of_Noisy_Views_for_CVPR_2024_paper.pdf + Multi-view clustering (MVC) aims at exploring category structures among multi-view data in self-supervised manners. Multiple views provide more information than single views and thus existing MVC methods can achieve satisfactory performance. However their performance might seriously degenerate when the views are noisy in practical multi-view scenarios. In this paper we formally investigate the drawback of noisy views and then propose a theoretically grounded deep MVC method (namely MVCAN) to address this issue. Specifically we propose a novel MVC objective that enables un-shared parameters and inconsistent clustering predictions across multiple views to reduce the side effects of noisy views. Furthermore a two-level multi-view iterative optimization is designed to generate robust learning targets for refining individual views' representation learning. Theoretical analysis reveals that MVCAN works by achieving the multi-view consistency complementarity and noise robustness. Finally experiments on extensive public datasets demonstrate that MVCAN outperforms state-of-the-art methods and is robust against the existence of noisy views. + + + + EvalCrafter: Benchmarking and Evaluating Large Video Generation Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_EvalCrafter_Benchmarking_and_Evaluating_Large_Video_Generation_Models_CVPR_2024_paper.pdf + The vision and language generative models have been overgrown in recent years. For video generation various open-sourced models and public-available services have been developed to generate high-quality videos. However these methods often use a few metrics e.g. FVD or IS to evaluate the performance. We argue that it is hard to judge the large conditional generative models from the simple metrics since these models are often trained on very large datasets with multi-aspect abilities. Thus we propose a novel framework and pipeline for exhaustively evaluating the performance of the generated videos. Our approach involves generating a diverse and comprehensive list of 700 prompts for text-to-video generation which is based on an analysis of real-world user data and generated with the assistance of a large language model. Then we evaluate the state-of-the-art video generative models on our carefully designed benchmark in terms of visual qualities content qualities motion qualities and text-video alignment with 17 well-selected objective metrics. To obtain the final leaderboard of the models we further fit a series of coefficients to align the objective metrics to the users' opinions. Based on the proposed human alignment method our final score shows a higher correlation than simply averaging the metrics showing the effectiveness of the proposed evaluation method. + + + + SelfOcc: Self-Supervised Vision-Based 3D Occupancy Prediction + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_SelfOcc_Self-Supervised_Vision-Based_3D_Occupancy_Prediction_CVPR_2024_paper.pdf + 3D occupancy prediction is an important task for the robustness of vision-centric autonomous driving which aims to predict whether each point is occupied in the surrounding 3D space. Existing methods usually require 3D occupancy labels to produce meaningful results. However it is very laborious to annotate the occupancy status of each voxel. In this paper we propose SelfOcc to explore a self-supervised way to learn 3D occupancy using only video sequences. We first transform the images into the 3D space (e.g. bird's eye view) to obtain 3D representation of the scene. We directly impose constraints on the 3D representations by treating them as signed distance fields. We can then render 2D images of previous and future frames as self-supervision signals to learn the 3D representations. We propose an MVS-embedded strategy to directly optimize the SDF-induced weights with multiple depth proposals. Our SelfOcc outperforms the previous best method SceneRF by 58.7% using a single frame as input on SemanticKITTI and is the first self-supervised work that produces reasonable 3D occupancy for surround cameras on nuScenes. SelfOcc produces high-quality depth and achieves state-of-the-art results on novel depth synthesis monocular depth estimation and surround-view depth estimation on the SemanticKITTI KITTI-2015 and nuScenes respectively. Code: https://github.com/huang-yh/SelfOcc. + + + + SubT-MRS Dataset: Pushing SLAM Towards All-weather Environments + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_SubT-MRS_Dataset_Pushing_SLAM_Towards_All-weather_Environments_CVPR_2024_paper.pdf + Simultaneous localization and mapping (SLAM) is a fundamental task for numerous applications such as autonomous navigation and exploration. Despite many SLAM datasets have been released current SLAM solutions still struggle to have sustained and resilient performance. One major issue is the absence of high-quality datasets including diverse all-weather conditions and a reliable metric for assessing robustness. This limitation significantly restricts the scalability and generalizability of SLAM technologies impacting their development validation and deployment. To address this problem we present SubT-MRS an extremely challenging real-world dataset designed to push SLAM towards all-weather environments to pursue the most robust SLAM performance. It contains multi-degraded environments including over 30 diverse scenes such as structureless corridors varying lighting conditions and perceptual obscurants like smoke and dust; multimodal sensors such as LiDAR fisheye camera IMU and thermal camera; and multiple locomotions like aerial legged and wheeled robots. We developed accuracy and robustness evaluation tracks for SLAM and introduced novel robustness metrics. Comprehensive studies are performed revealing new observations challenges and opportunities for future research. + + + + Relational Matching for Weakly Semi-Supervised Oriented Object Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_Relational_Matching_for_Weakly_Semi-Supervised_Oriented_Object_Detection_CVPR_2024_paper.pdf + Oriented object detection has witnessed significant progress in recent years. However the impressive performance of oriented object detectors is at the huge cost of labor-intensive annotations and deteriorates once the annotated data becomes limited. Semi-supervised learning in which sufficient unannotated data are utilized to enhance the base detector is a promising method to address the annotation deficiency problem. Motivated by weakly supervised learning we introduce annotation-efficient point annotations for unannotated images and propose a weakly semi-supervised method for oriented object detection to balance the detection performance and annotation cost. Specifically we propose a Rotation-Modulated Relational Graph Matching method to match relations of proposals centered on annotated points between different models to alleviate the ambiguity of point annotations in depicting the oriented object. In addition we further propose a Relational Rank Distribution Matching method to align the rank distribution on classification and regression between different models. Finally to handle the difficult annotated points that both models are confused about we introduce weakly supervised learning to impose positive signals for difficult point-induced clusters to the base model and focus the base model on the occupancy between the predictions and annotated points. We perform extensive experiments on challenging datasets to demonstrate the effectiveness of our proposed weakly semi-supervised method in effectively leveraging unannotated data for significant performance improvement. + + + + Rethinking the Representation in Federated Unsupervised Learning with Non-IID Data + http://openaccess.thecvf.com//content/CVPR2024/papers/Liao_Rethinking_the_Representation_in_Federated_Unsupervised_Learning_with_Non-IID_Data_CVPR_2024_paper.pdf + Federated learning achieves effective performance in modeling decentralized data. In practice client data are not well-labeled which makes it potential for federated unsupervised learning (FUSL) with non-IID data. However the performance of existing FUSL methods suffers from insufficient representations i.e. (1) representation collapse entanglement among local and global models and (2) inconsistent representation spaces among local models. The former indicates that representation collapse in local model will subsequently impact the global model and other local models. The latter means that clients model data representation with inconsistent parameters due to the deficiency of supervision signals. In this work we propose FedU2 which enhances generating uniform and unified representation in FUSL with non-IID data. Specifically FedU2 consists of flexible uniform regularizer (FUR) and efficient unified aggregator (EUA). FUR in each client avoids representation collapse via dispersing samples uniformly and EUA in server promotes unified representation by constraining consistent client model updating. To extensively validate the performance of FedU2 we conduct both cross-device and cross-silo evaluation experiments on two benchmark datasets i.e. CIFAR10 and CIFAR100. + + + + Distraction is All You Need: Memory-Efficient Image Immunization against Diffusion-Based Image Editing + http://openaccess.thecvf.com//content/CVPR2024/papers/Lo_Distraction_is_All_You_Need_Memory-Efficient_Image_Immunization_against_Diffusion-Based_CVPR_2024_paper.pdf + Recent text-to-image (T2I) diffusion models have revolutionized image editing by empowering users to control outcomes using natural language. However the ease of image manipulation has raised ethical concerns with the potential for malicious use in generating deceptive or harmful content. To address the concerns we propose an image immunization approach named semantic attack to protect our images from being manipulated by malicious agents using diffusion models. Our approach focuses on disrupting the semantic understanding of T2I diffusion models regarding specific content. By attacking the cross-attention mechanism that encodes image features with text messages during editing we distract the model's attention regarding the content of our concern. Our semantic attack renders the model uncertain about the areas to edit resulting in poorly edited images and contradicting the malicious editing attempts. In addition by shifting the attack target towards intermediate attention maps from the final generated image our approach substantially diminishes computational burden and alleviates GPU memory constraints in comparison to previous methods. Moreover we introduce timestep universal gradient updating to create timestep-agnostic perturbations effective across different input noise levels. By treating the full diffusion process as discrete denoising timesteps during the attack we achieve equivalent or even superior immunization efficacy with nearly half the memory consumption of the previous method. Our contributions include a practical and effective approach to safeguard images against malicious editing and the proposed method offers robust immunization against various image inpainting and editing approaches showcasing its potential for real-world applications. + + + + Knowledge-Enhanced Dual-stream Zero-shot Composed Image Retrieval + http://openaccess.thecvf.com//content/CVPR2024/papers/Suo_Knowledge-Enhanced_Dual-stream_Zero-shot_Composed_Image_Retrieval_CVPR_2024_paper.pdf + We study the zero-shot Composed Image Retrieval (ZS-CIR) task which is to retrieve the target image given a reference image and a description without training on the triplet datasets. Previous works generate pseudo-word tokens by projecting the reference image features to the text embedding space. However they focus on the global visual representation ignoring the representation of detailed attributes e.g. color object number and layout. To address this challenge we propose a Knowledge-Enhanced Dual-stream zero-shot composed image retrieval framework (KEDs). KEDs implicitly models the attributes of the reference images by incorporating a database. The database enriches the pseudo-word tokens by providing relevant images and captions emphasizing shared attribute information in various aspects. In this way KEDs recognizes the reference image from diverse perspectives. Moreover KEDs adopts an extra stream that aligns pseudo-word tokens with textual concepts leveraging pseudo-triplets mined from image-text pairs. The pseudo-word tokens generated in this stream are explicitly aligned with fine-grained semantics in the text embedding space. Extensive experiments on widely used benchmarks i.e. ImageNet-R COCO object Fashion-IQ and CIRR show that KEDs outperforms previous zero-shot composed image retrieval methods. Code is available at https://github.com/suoych/KEDs. + + + + Grounding and Enhancing Grid-based Models for Neural Fields + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhao_Grounding_and_Enhancing_Grid-based_Models_for_Neural_Fields_CVPR_2024_paper.pdf + Many contemporary studies utilize grid-based models for neural field representation but a systematic analysis of grid-based models is still missing hindering the improvement of those models. Therefore this paper introduces a theoretical framework for grid-based models. This framework points out that these models' approximation and generalization behaviors are determined by grid tangent kernels (GTK) which are intrinsic properties of grid-based models. The proposed framework facilitates a consistent and systematic analysis of diverse grid-based models. Furthermore the introduced framework motivates the development of a novel grid-based model named the Multiplicative Fourier Adaptive Grid (MulFAGrid). The numerical analysis demonstrates that MulFAGrid exhibits a lower generalization bound than its predecessors indicating its robust generalization performance. Empirical studies reveal that MulFAGrid achieves state-of-the-art performance in various tasks including 2D image fitting 3D signed distance field (SDF) reconstruction and novel view synthesis demonstrating superior representation ability. The project website is available at https://sites.google.com/view/cvpr24-2034-submission/home. + + + + GART: Gaussian Articulated Template Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Lei_GART_Gaussian_Articulated_Template_Models_CVPR_2024_paper.pdf + We introduce Gaussian Articulated Template Model (GART) an explicit efficient and expressive representation for non-rigid articulated subject capturing and rendering from monocular videos. GART utilizes a mixture of moving 3D Gaussians to explicitly approximate a deformable subject's geometry and appearance. It takes advantage of a categorical template model prior (SMPL SMAL etc.) with learnable forward skinning while further generalizing to more complex non-rigid deformations with novel latent bones. GART can be reconstructed via differentiable rendering from monocular videos in seconds or minutes and rendered in novel poses faster than 150fps. + + + + KP-RED: Exploiting Semantic Keypoints for Joint 3D Shape Retrieval and Deformation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_KP-RED_Exploiting_Semantic_Keypoints_for_Joint_3D_Shape_Retrieval_and_CVPR_2024_paper.pdf + In this paper we present KP-RED a unified KeyPoint-driven REtrieval and Deformation framework that takes object scans as input and jointly retrieves and deforms the most geometrically similar CAD models from a pre-processed database to tightly match the target. Unlike existing dense matching based methods that typically struggle with noisy partial scans we propose to leverage category-consistent sparse keypoints to naturally handle both full and partial object scans. Specifically we first employ a lightweight retrieval module to establish a keypoint-based embedding space measuring the similarity among objects by dynamically aggregating deformation-aware local-global features around extracted keypoints. Objects that are close in the embedding space are considered similar in geometry. Then we introduce the neural cage-based deformation module that estimates the influence vector of each keypoint upon cage vertices inside its local support region to control the deformation of the retrieved shape. Extensive experiments on the synthetic dataset PartNet and the real-world dataset Scan2CAD demonstrate that KP-RED surpasses existing state-of-the-art approaches by a large margin. Codes and trained models will be released in https://github.com/lolrudy/KP-RED. + + + + Learning from One Continuous Video Stream + http://openaccess.thecvf.com//content/CVPR2024/papers/Carreira_Learning_from_One_Continuous_Video_Stream_CVPR_2024_paper.pdf + We introduce a framework for online learning from a single continuous video stream - the way people and animals learn without mini-batches data augmentation or shuffling. This poses great challenges given the high correlation between consecutive video frames and there is very little prior work on it. Our framework allows us to do a first deep dive into the topic and includes a collection of streams and tasks composed from two existing video datasets plus methodology for performance evaluation that considers both adaptation and generalization. We employ pixel-to-pixel modelling as a practical and flexible way to switch between pre-training and single-stream evaluation as well as between arbitrary tasks without ever requiring changes to models and always using the same pixel loss. Equipped with this framework we obtained large single-stream learning gains from pre-training with a novel family of future prediction tasks found that momentum hurts and that the pace of weight updates matters. The combination of these insights leads to matching the performance of IID learning with batch size 1 when using the same architecture and without costly replay buffers. + + + + VGGSfM: Visual Geometry Grounded Deep Structure From Motion + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_VGGSfM_Visual_Geometry_Grounded_Deep_Structure_From_Motion_CVPR_2024_paper.pdf + Structure-from-motion (SfM) is a long-standing problem in the computer vision community which aims to reconstruct the camera poses and 3D structure of a scene from a set of unconstrained 2D images. Classical frameworks solve this problem in an incremental manner by detecting and matching keypoints registering images triangulating 3D points and conducting bundle adjustment. Recent research efforts have predominantly revolved around harnessing the power of deep learning techniques to enhance specific elements (e.g. keypoint matching) but are still based on the original non-differentiable pipeline. Instead we propose a new deep SfM pipeline where each component is fully differentiable and thus can be trained in an end-to-end manner. To this end we introduce new mechanisms and simplifications. First we build on recent advances in deep 2D point tracking to extract reliable pixel-accurate tracks which eliminates the need for chaining pairwise matches. Furthermore we recover all cameras simultaneously based on the image and track features instead of gradually registering cameras. Finally we optimise the cameras and triangulate 3D points via a differentiable bundle adjustment layer. We attain state-of-the-art performance on three popular datasets CO3D IMC Phototourism and ETH3D. + + + + PixelLM: Pixel Reasoning with Large Multimodal Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Ren_PixelLM_Pixel_Reasoning_with_Large_Multimodal_Model_CVPR_2024_paper.pdf + While large multimodal models (LMMs) have achieved remarkable progress generating pixel-level masks for image reasoning tasks involving multiple open-world targets remains a challenge. To bridge this gap we introduce PixelLM an effective and efficient LMM for pixel-level reasoning and understanding. Central to PixelLM are a novel lightweight pixel decoder and a comprehensive segmentation codebook. The decoder efficiently produces masks from the hidden embeddings of the codebook tokens which encode detailed target-relevant information. With this design PixelLM harmonizes with the structure of popular LMMs and avoids the need for additional costly segmentation models. Furthermore we propose a token fusion method to enhance the model's ability to differentiate between multiple targets leading to substantially improved mask quality. To advance research in this area we construct MUSE a high-quality multi-target reasoning segmentation benchmark. PixelLM excels across various pixel-level image reasoning and understanding tasks outperforming well-established methods in multiple benchmarks including MUSE and multi-referring segmentation. Comprehensive ablations confirm the efficacy of each proposed component. All code models and datasets will be publicly available. + + + + MRFS: Mutually Reinforcing Image Fusion and Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_MRFS_Mutually_Reinforcing_Image_Fusion_and_Segmentation_CVPR_2024_paper.pdf + This paper proposes a coupled learning framework to break the performance bottleneck of infrared-visible image fusion and segmentation called MRFS. By leveraging the intrinsic consistency between vision and semantics it emphasizes mutual reinforcement rather than treating these tasks as separate issues. First we embed weakened information recovery and salient information integration into the image fusion task employing the CNN-based interactive gated mixed attention (IGM-Att) module to extract high-quality visual features. This aims to satisfy human visual perception producing fused images with rich textures high contrast and vivid colors. Second a transformer-based progressive cycle attention (PC-Att) module is developed to enhance semantic segmentation. It establishes single-modal self-reinforcement and cross-modal mutual complementarity enabling more accurate decisions in machine semantic perception. Then the cascade of IGM-Att and PC-Att couples image fusion and semantic segmentation tasks implicitly bringing vision-related and semantics-related features into closer alignment. Therefore they mutually provide learning priors to each other resulting in visually satisfying fused images and more accurate segmentation decisions. Extensive experiments on public datasets showcase the advantages of our method in terms of visual satisfaction and decision accuracy. The code is publicly available at https://github.com/HaoZhang1018/MRFS. + + + + Robust Depth Enhancement via Polarization Prompt Fusion Tuning + http://openaccess.thecvf.com//content/CVPR2024/papers/Ikemura_Robust_Depth_Enhancement_via_Polarization_Prompt_Fusion_Tuning_CVPR_2024_paper.pdf + Existing depth sensors are imperfect and may provide inaccurate depth values in challenging scenarios such as in the presence of transparent or reflective objects. In this work we present a general framework that leverages polarization imaging to improve inaccurate depth measurements from various depth sensors. Previous polarization-based depth enhancement methods focus on utilizing pure physics-based formulas for a single sensor. In contrast our method first adopts a learning-based strategy where a neural network is trained to estimate a dense and complete depth map from polarization data and a sensor depth map from different sensors. To further improve the performance we propose a Polarization Prompt Fusion Tuning (PPFT) strategy to effectively utilize RGB-based models pre-trained on large-scale datasets as the size of the polarization dataset is limited to train a strong model from scratch. We conducted extensive experiments on a public dataset and the results demonstrate that the proposed method performs favorably compared to existing depth enhancement baselines. Code and demos are available at https://lastbasket.github.io/PPFT/. + + + + Compact 3D Gaussian Representation for Radiance Field + http://openaccess.thecvf.com//content/CVPR2024/papers/Lee_Compact_3D_Gaussian_Representation_for_Radiance_Field_CVPR_2024_paper.pdf + Neural Radiance Fields (NeRFs) have demonstrated remarkable potential in capturing complex 3D scenes with high fidelity. However one persistent challenge that hinders the widespread adoption of NeRFs is the computational bottleneck due to the volumetric rendering. On the other hand 3D Gaussian splatting (3DGS) has recently emerged as an alternative representation that leverages a 3D Gaussisan-based representation and adopts the rasterization pipeline to render the images rather than volumetric rendering achieving very fast rendering speed and promising image quality. However a significant drawback arises as 3DGS entails a substantial number of 3D Gaussians to maintain the high fidelity of the rendered images which requires a large amount of memory and storage. To address this critical issue we place a specific emphasis on two key objectives: reducing the number of Gaussian points without sacrificing performance and compressing the Gaussian attributes such as view-dependent color and covariance. To this end we propose a learnable mask strategy that significantly reduces the number of Gaussians while preserving high performance. In addition we propose a compact but effective representation of view-dependent color by employing a grid-based neural field rather than relying on spherical harmonics. Finally we learn codebooks to compactly represent the geometric attributes of Gaussian by vector quantization. With model compression techniques such as quantization and entropy coding we consistently show over 25x reduced storage and enhanced rendering speed while maintaining the quality of the scene representation compared to 3DGS. Our work provides a comprehensive framework for 3D scene representation achieving high performance fast training compactness and real-time rendering. Our project page is available at https://maincold2.github.io/c3dgs/. + + + + 3D Building Reconstruction from Monocular Remote Sensing Images with Multi-level Supervisions + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_3D_Building_Reconstruction_from_Monocular_Remote_Sensing_Images_with_Multi-level_CVPR_2024_paper.pdf + 3D building reconstruction from monocular remote sensing images is an important and challenging research problem that has received increasing attention in recent years owing to its low cost of data acquisition and availability for large-scale applications. However existing methods rely on expensive 3D-annotated samples for fully-supervised training restricting their application to large-scale cross-city scenarios. In this work we propose MLS-BRN a multi-level supervised building reconstruction network that can flexibly utilize training samples with different annotation levels to achieve better reconstruction results in an end-to-end manner. To alleviate the demand on full 3D supervision we design two new modules Pseudo Building Bbox Calculator and Roof-Offset guided Footprint Extractor as well as new tasks and training strategies for different types of samples. Experimental results on several public and new datasets demonstrate that our proposed MLS-BRN achieves competitive performance using much fewer 3D-annotated samples and significantly improves the footprint extraction and 3D reconstruction performance compared with current state-of-the-art. The code and datasets of this work will be released at https://github.com/opendatalab/MLS-BRN.git. + + + + Generative Latent Coding for Ultra-Low Bitrate Image Compression + http://openaccess.thecvf.com//content/CVPR2024/papers/Jia_Generative_Latent_Coding_for_Ultra-Low_Bitrate_Image_Compression_CVPR_2024_paper.pdf + Most existing image compression approaches perform transform coding in the pixel space to reduce its spatial redundancy. However they encounter difficulties in achieving both high-realism and high-fidelity at low bitrate as the pixel-space distortion may not align with human perception. To address this issue we introduce a Generative Latent Coding (GLC) architecture which performs transform coding in the latent space of a generative vector-quantized variational auto-encoder (VQ-VAE) instead of in the pixel space. The generative latent space is characterized by greater sparsity richer semantic and better alignment with human perception rendering it advantageous for achieving high-realism and high-fidelity compression. Additionally we introduce a categorical hyper module to reduce the bit cost of hyper-information and a code-prediction-based supervision to enhance the semantic consistency. Experiments demonstrate that our GLC maintains high visual quality with less than 0.04 bpp on natural images and less than 0.01 bpp on facial images. On the CLIC2020 test set we achieve the same FID as MS-ILLM with 45% fewer bits. Furthermore the powerful generative latent space enables various applications built on our GLC pipeline such as image restoration and style transfer. + + + + Distributionally Generative Augmentation for Fair Facial Attribute Classification + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Distributionally_Generative_Augmentation_for_Fair_Facial_Attribute_Classification_CVPR_2024_paper.pdf + Facial Attribute Classification (FAC) holds substantial promise in widespread applications. However FAC models trained by traditional methodologies can be unfair by exhibiting accuracy inconsistencies across varied data subpopulations. This unfairness is largely attributed to bias in data where some spurious attributes (e.g. Male) statistically correlate with the target attribute (e.g. Smiling). Most of existing fairness-aware methods rely on the labels of spurious attributes which may be unavailable in practice. This work proposes a novel generation-based two-stage framework to train a fair FAC model on biased data without additional annotation. Initially we identify the potential spurious attributes based on generative models. Notably it enhances interpretability by explicitly showing the spurious attributes in image space. Following this for each image we first edit the spurious attributes with a random degree sampled from a uniform distribution while keeping target attribute unchanged. Then we train a fair FAC model by fostering model invariance to these augmentation. Extensive experiments on three common datasets demonstrate the effectiveness of our method in promoting fairness in FAC without compromising accuracy. Codes are in https://github.com/heqianpei/DiGA. + + + + From SAM to CAMs: Exploring Segment Anything Model for Weakly Supervised Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Kweon_From_SAM_to_CAMs_Exploring_Segment_Anything_Model_for_Weakly_CVPR_2024_paper.pdf + Weakly Supervised Semantic Segmentation (WSSS) aims to learn the concept of segmentation using image-level class labels. Recent WSSS works have shown promising results by using the Segment Anything Model (SAM) a foundation model for segmentation during the inference phase. However we observe that these methods can still be vulnerable to the noise of class activation maps (CAMs) serving as initial seeds. As a remedy this paper introduces From-SAM-to-CAMs (S2C) a novel WSSS framework that directly transfers the knowledge of SAM to the classifier during the training process enhancing the quality of CAMs itself. S2C comprises SAM-segment Contrasting (SSC) and a CAM-based prompting module (CPM) which exploit SAM at the feature and logit levels respectively. SSC performs prototype-based contrasting using SAM's automatic segmentation results. It constrains each feature to be close to the prototype of its segment and distant from prototypes of the others. Meanwhile CPM extracts prompts from the CAM of each class and uses them to generate class-specific segmentation masks through SAM. The masks are aggregated into unified self-supervision based on the confidence score designed to consider the reliability of both SAM and CAMs. S2C achieves a new state-of-the-art performance across all benchmarks outperforming existing studies by significant margins. The code is available at https://github.com/sangrockEG/S2C. + + + + Boosting Flow-based Generative Super-Resolution Models via Learned Prior + http://openaccess.thecvf.com//content/CVPR2024/papers/Tsao_Boosting_Flow-based_Generative_Super-Resolution_Models_via_Learned_Prior_CVPR_2024_paper.pdf + Flow-based super-resolution (SR) models have demonstrated astonishing capabilities in generating high-quality images. However these methods encounter several challenges during image generation such as grid artifacts exploding inverses and suboptimal results due to a fixed sampling temperature. To overcome these issues this work introduces a conditional learned prior to the inference phase of a flow-based SR model. This prior is a latent code predicted by our proposed latent module conditioned on the low-resolution image which is then transformed by the flow model into an SR image. Our framework is designed to seamlessly integrate with any contemporary flow-based SR model without modifying its architecture or pre-trained weights. We evaluate the effectiveness of our proposed framework through extensive experiments and ablation analyses. The proposed framework successfully addresses all the inherent issues in flow-based SR models and enhances their performance in various SR scenarios. Our code is available at: https://github.com/liyuantsao/FlowSR-LP + + + + What You See is What You GAN: Rendering Every Pixel for High-Fidelity Geometry in 3D GANs + http://openaccess.thecvf.com//content/CVPR2024/papers/Trevithick_What_You_See_is_What_You_GAN_Rendering_Every_Pixel_CVPR_2024_paper.pdf + 3D-aware Generative Adversarial Networks (GANs) have shown remarkable progress in learning to generate multi-view-consistent images and 3D geometries of scenes from collections of 2D images via neural volume rendering. Yet the significant memory and computational costs of dense sampling in volume rendering have forced 3D GANs to adopt patch-based training or employ low-resolution rendering with post-processing 2D super resolution which sacrifices multiview consistency and the quality of resolved geometry. Consequently 3D GANs have not yet been able to fully resolve the rich 3D geometry present in 2D images. In this work we propose techniques to scale neural volume rendering to the much higher resolution of native 2D images thereby resolving fine-grained 3D geometry with unprecedented detail. Our approach employs learning-based samplers for accelerating neural rendering for 3D GAN training using up to 5 times fewer depth samples. This enables us to explicitly "render every pixel" of the full-resolution image during training and inference without post-processing superresolution in 2D. Together with our strategy to learn high-quality surface geometry our method synthesizes high-resolution 3D geometry and strictly view-consistent images while maintaining image quality on par with baselines relying on post-processing super resolution. We demonstrate state-of-the-art 3D gemetric quality on FFHQ and AFHQ setting a new standard for unsupervised learning of 3D shapes in 3D GANs. + + + + Towards Robust Learning to Optimize with Theoretical Guarantees + http://openaccess.thecvf.com//content/CVPR2024/papers/Song_Towards_Robust_Learning_to_Optimize_with_Theoretical_Guarantees_CVPR_2024_paper.pdf + Learning to optimize (L2O) is an emerging technique to solve mathematical optimization problems with learning-based methods. Although with great success in many real-world scenarios such as wireless communications computer networks and electronic design existing L2O works lack theoretical demonstration of their performance and robustness in out-of-distribution (OOD) scenarios. We address this gap by providing comprehensive proofs. First we prove a sufficient condition for a robust L2O model with homogeneous convergence rates over all In-Distribution (InD) instances. We assume an L2O model achieves robustness for an InD scenario. Based on our proposed methodology of aligning OOD problems to InD problems we also demonstrate that the L2O model's convergence rate in OOD scenarios will deteriorate by an equation of the L2O model's input features. Moreover we propose an L2O model with a concise gradient-only feature construction and a novel gradient-based history modeling method. Numerical simulation demonstrates that our proposed model outperforms the state-of-the-art baseline in both InD and OOD scenarios and achieves up to 10 xconvergence speedup. The code of our method can be found from https://github.com/NetX-lab/GoMathL2O-Official. + + + + Differentiable Neural Surface Refinement for Modeling Transparent Objects + http://openaccess.thecvf.com//content/CVPR2024/papers/Deng_Differentiable_Neural_Surface_Refinement_for_Modeling_Transparent_Objects_CVPR_2024_paper.pdf + Neural implicit surface reconstruction leveraging volume rendering has led to significant advances in multi-view reconstruction. However results for transparent objects can be very poor primarily because the rendering function fails to account for the intricate light transport induced by refraction and reflection. In this study we introduce transparent neural surface refinement (TNSR) a novel surface reconstruction framework that explicitly incorporates physical refraction and reflection tracing. Beginning with an initial approximate surface our method employs sphere tracing combined with Snell's law to cast both reflected and refracted rays. Central to our proposal is an innovative differentiable technique devised to allow signals from the photometric evidence to propagate back to the surface model by considering how the surface bends and reflects light rays. This allows us to connect surface refinement with volume rendering enabling end-to-end optimization solely on multi-view RGB images. In our experiments TNSR demonstrates significant improvements in novel view synthesis and geometry estimation of transparent objects without prior knowledge of the refractive index. + + + + Improving Generalization via Meta-Learning on Hard Samples + http://openaccess.thecvf.com//content/CVPR2024/papers/Jain_Improving_Generalization_via_Meta-Learning_on_Hard_Samples_CVPR_2024_paper.pdf + Learned reweighting (LRW) approaches to supervised learning use an optimization criterion to assign weights for training instances in order to maximize performance on a representative validation dataset. We pose and formalize the problem of optimized selection of the validation set used in LRW training to improve classifier generalization. In particular we show that using hard-to-classify instances in the validation set has both a theoretical connection to and strong empirical evidence of generalization. We provide an efficient algorithm for training this meta-optimized model as well as a simple train-twice heuristic for careful comparative study. We demonstrate that LRW with easy validation data performs consistently worse than LRW with hard validation data establishing the validity of our meta-optimization problem. Our proposed algorithm outperforms a wide range of baselines on a range of datasets and domain shift challenges (Imagenet-1K CIFAR-100 Clothing-1M CAMELYON WILDS etc.) with 1% gains using VIT-B on Imagenet. We also show that using naturally hard examples for validation (Imagenet-R / Imagenet-A) in LRW training for Imagenet improves performance on both clean and naturally hard test instances by 1-2%. Secondary analyses show that using hard validation data in an LRW framework improves margins on test data hinting at the mechanism underlying our empirical gains. We believe this work opens up new research directions for the meta-optimization of meta-learning in a supervised learning context. + + + + Differentiable Information Bottleneck for Deterministic Multi-view Clustering + http://openaccess.thecvf.com//content/CVPR2024/papers/Yan_Differentiable_Information_Bottleneck_for_Deterministic_Multi-view_Clustering_CVPR_2024_paper.pdf + In recent several years the information bottleneck (IB) principle provides an information-theoretic framework for deep multi-view clustering (MVC) by compressing multi-view observations while preserving the relevant information of multiple views. Although existing IB-based deep MVC methods have achieved huge success they rely on variational approximation and distribution assumption to estimate the lower bound of mutual information which is a notoriously hard and impractical problem in high-dimensional multi-view spaces. In this work we propose a new differentiable information bottleneck (DIB) method which provides a deterministic and analytical MVC solution by fitting the mutual information without the necessity of variational approximation. Specifically we first propose to directly fit the mutual information of high-dimensional spaces by leveraging normalized kernel Gram matrix which does not require any auxiliary neural estimator to estimate the lower bound of mutual information. Then based on the new mutual information measurement a deterministic multi-view neural network with analytical gradients is explicitly trained to parameterize IB principle which derives a deterministic compression of input variables from different views. Finally a triplet consistency discovery mechanism is devised which is capable of mining the feature consistency cluster consistency and joint consistency based on the deterministic and compact representations. Extensive experimental results show the superiority of our DIB method on 6 benchmarks compared with 13 state-of-the-art baselines. + + + + Animatable Gaussians: Learning Pose-dependent Gaussian Maps for High-fidelity Human Avatar Modeling + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Animatable_Gaussians_Learning_Pose-dependent_Gaussian_Maps_for_High-fidelity_Human_Avatar_CVPR_2024_paper.pdf + Modeling animatable human avatars from RGB videos is a long-standing and challenging problem. Recent works usually adopt MLP-based neural radiance fields (NeRF) to represent 3D humans but it remains difficult for pure MLPs to regress pose-dependent garment details. To this end we introduce Animatable Gaussians a new avatar representation that leverages powerful 2D CNNs and 3D Gaussian splatting to create high-fidelity avatars. To associate 3D Gaussians with the animatable avatar we learn a parametric template from the input videos and then parameterize the template on two front & back canonical Gaussian maps where each pixel represents a 3D Gaussian. The learned template is adaptive to the wearing garments for modeling looser clothes like dresses. Such template-guided 2D parameterization enables us to employ a powerful StyleGAN-based CNN to learn the pose-dependent Gaussian maps for modeling detailed dynamic appearances. Furthermore we introduce a pose projection strategy for better generalization given novel poses. Overall our method can create lifelike avatars with dynamic realistic and generalized appearances. Experiments show that our method outperforms other state-of-the-art approaches. Code: https://github.com/lizhe00/AnimatableGaussians. + + + + Latency Correction for Event-guided Deblurring and Frame Interpolation + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Latency_Correction_for_Event-guided_Deblurring_and_Frame_Interpolation_CVPR_2024_paper.pdf + Event cameras with their high temporal resolution dynamic range and low power consumption are particularly good at time-sensitive applications like deblurring and frame interpolation. However their performance is hindered by latency variability especially under low-light conditions and with fast-moving objects. This paper addresses the challenge of latency in event cameras -- the temporal discrepancy between the actual occurrence of changes in the corresponding timestamp assigned by the sensor. Focusing on event-guided deblurring and frame interpolation tasks we propose a latency correction method based on a parameterized latency model. To enable data-driven learning we develop an event-based temporal fidelity to describe the sharpness of latent images reconstructed from events and the corresponding blurry images and reformulate the event-based double integral model differentiable to latency. The proposed method is validated using synthetic and real-world datasets demonstrating the benefits of latency correction for deblurring and interpolation across different lighting conditions. + + + + WinSyn: : A High Resolution Testbed for Synthetic Data + http://openaccess.thecvf.com//content/CVPR2024/papers/Kelly_WinSyn__A_High_Resolution_Testbed_for_Synthetic_Data_CVPR_2024_paper.pdf + We present WinSyn a unique dataset and testbed for creating high-quality synthetic data with procedural modeling techniques. The dataset contains high-resolution photographs of windows selected from locations around the world with 89318 individual window crops showcasing diverse geometric and material characteristics. We evaluate a procedural model by training semantic segmentation networks on both synthetic and real images and then comparing their performances on a shared test set of real images. Specifically we measure the difference in mean Intersection over Union (mIoU) and determine the effective number of real images to match synthetic data's training performance. We design a baseline procedural model as a benchmark and provide 21290 synthetically generated images. By tuning the procedural model key factors are identified which significantly influence the model's fidelity in replicating real-world scenarios. Importantly we highlight the challenge of procedural modeling using current techniques especially in their ability to replicate the spatial semantics of real-world scenarios. This insight is critical because of the potential of procedural models to bridge hidden scene aspects such as depth reflectivity material properties and lighting conditions. + + + + Language-aware Visual Semantic Distillation for Video Question Answering + http://openaccess.thecvf.com//content/CVPR2024/papers/Zou_Language-aware_Visual_Semantic_Distillation_for_Video_Question_Answering_CVPR_2024_paper.pdf + Significant advancements in video question answering (VideoQA) have been made thanks to thriving large image-language pretraining frameworks. Although these image-language models can efficiently represent both video and language branches they typically employ a goal-free vision perception process and do not interact vision with language well during the answer generation thus omitting crucial visual cues. In this paper we are inspired by the human recognition and learning pattern and propose VideoDistill a framework with language-aware (i.e. goal-driven) behavior in both vision perception and answer generation process. VideoDistill generates answers only from question-related visual embeddings and follows a thinking-observing-answering approach that closely resembles human behavior distinguishing it from previous research. Specifically we develop a language-aware gating mechanism to replace the standard cross-attention avoiding language's direct fusion into visual representations. We incorporate this mechanism into two key components of the entire framework. The first component is a differentiable sparse sampling module which selects frames containing the necessary dynamics and semantics relevant to the questions. The second component is a vision refinement module that merges existing spatial-temporal attention layers to ensure the extraction of multi-grained visual semantics associated with the questions. We conduct experimental evaluations on various challenging video question-answering benchmarks and VideoDistill achieves state-of-the-art performance in both general and long-form VideoQA datasets. In Addition we verify that VideoDistill can effectively alleviate the utilization of language shortcut solutions in the EgoTaskQA dataset. + + + + Disentangled Prompt Representation for Domain Generalization + http://openaccess.thecvf.com//content/CVPR2024/papers/Cheng_Disentangled_Prompt_Representation_for_Domain_Generalization_CVPR_2024_paper.pdf + Domain Generalization (DG) aims to develop a versatile model capable of performing well on unseen target domains. Recent advancements in pre-trained Visual Foundation Models (VFMs) such as CLIP show significant potential in enhancing the generalization abilities of deep models. Although there is a growing focus on VFM-based domain prompt tuning for DG effectively learning prompts that disentangle invariant features across all domains remains a major challenge. In this paper we propose addressing this challenge by leveraging the controllable and flexible language prompt of the VFM. Observing that the text modality of VFMs is inherently easier to disentangle we introduce a novel text feature guided visual prompt tuning framework. This framework first automatically disentangles the text prompt using a large language model (LLM) and then learns domain-invariant visual representation guided by the disentangled text feature. Moreover we also devise domain-specific prototype learning to fully exploit domain-specific information to combine with the invariant feature prediction. Extensive experiments on mainstream DG datasets namely PACS VLCS OfficeHome DomainNet and TerraInc demonstrate that the proposed method achieves superior performances to state-of-the-art DG methods. + + + + Abductive Ego-View Accident Video Understanding for Safe Driving Perception + http://openaccess.thecvf.com//content/CVPR2024/papers/Fang_Abductive_Ego-View_Accident_Video_Understanding_for_Safe_Driving_Perception_CVPR_2024_paper.pdf + We present MM-AU a novel dataset for Multi-Modal Accident video Understanding. MM-AU contains 11727 in-the-wild ego-view accident videos each with temporally aligned text descriptions. We annotate over 2.23 million object boxes and 58650 pairs of video-based accident reasons covering 58 accident categories. MM-AU supports various accident understanding tasks particularly multimodal video diffusion to understand accident cause-effect chains for safe driving. With MM-AU we present an Abductive accident Video understanding framework for Safe Driving perception (AdVersa-SD). AdVersa-SD performs video diffusion via an Object-Centric Video Diffusion (OAVD) method which is driven by an abductive CLIP model. This model involves a contrastive interaction loss to learn the pair co-occurrence of normal near-accident accident frames with the corresponding text descriptions such as accident reasons prevention advice and accident categories. OAVD enforces the object region learning while fixing the content of the original frame background in video generation to find the dominant objects for certain accidents. Extensive experiments verify the abductive ability of AdVersa-SD and the superiority of OAVD against the state-of-the-art diffusion models. Additionally we provide careful benchmark evaluations for object detection and accident reason answering since AdVersa-SD relies on precise object and accident reason information. + + + + Cross-spectral Gated-RGB Stereo Depth Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Brucker_Cross-spectral_Gated-RGB_Stereo_Depth_Estimation_CVPR_2024_paper.pdf + Gated cameras flood-illuminate a scene and capture the time-gated impulse response of a scene. By employing nanosecond-scale gates existing sensors are capable of capturing mega-pixel gated images delivering dense depth improving on today's LiDAR sensors in spatial resolution and depth precision. Although gated depth estimation methods deliver a million of depth estimates per frame their resolution is still an order below existing RGB imaging methods. In this work we combine high-resolution stereo HDR RCCB cameras with gated imaging allowing us to exploit depth cues from active gating multi-view RGB and multi-view NIR sensing -- multi-view and gated cues across the entire spectrum. The resulting capture system consists only of low-cost CMOS sensors and flood-illumination. We propose a novel stereo-depth estimation method that is capable of exploiting these multi-modal multi-view depth cues including the active illumination that is measured by the RCCB camera when removing the IR-cut filter. The proposed method achieves accurate depth at long ranges outperforming the next best existing method by 39% for ranges of 100 to 220 m in MAE on accumulated LiDAR ground-truth. Our code models and datasets are available here (https://light.princeton.edu/gatedrccbstereo/). + + + + KVQ: Kwai Video Quality Assessment for Short-form Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Lu_KVQ_Kwai_Video_Quality_Assessment_for_Short-form_Videos_CVPR_2024_paper.pdf + Short-form UGC video platforms like Kwai and TikTok have been an emerging and irreplaceable mainstream media form thriving on user-friendly engagement and kaleidoscope creation etc. However the advancing content generation modes e.g. special effects and sophisticated processing workflows e.g. de-artifacts have introduced significant challenges to recent UGC video quality assessment: (i) the ambiguous contents hinder the identification of quality-determined regions. (ii) the diverse and complicated hybrid distortions are hard to distinguish. To tackle the above challenges and assist in the development of short-form videos we establish the first large-scale Kwai short Video database for Quality assessment termed KVQ which comprises 600 user-uploaded short videos and 3600 processed videos through the diverse practical processing workflows including pre-processing transcoding and enhancement. Among them the absolute quality score of each video and partial ranking score among indistinguish samples are provided by a team of professional researchers specializing in image processing. Based on this database we propose the first short-form video quality evaluator i.e. KSVQE which enables the quality evaluator to identify the quality-determined semantics with the content understanding of large vision language models (i.e. CLIP) and distinguish the distortions with the distortion under- standing module. Experimental results have shown the effectiveness of KSVQE on our KVQ database and popular VQA databases. The project can be found at https: //lixinustc.github.io/projects/KVQ/. + + + + Exploring the Transferability of Visual Prompting for Multimodal Large Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Exploring_the_Transferability_of_Visual_Prompting_for_Multimodal_Large_Language_CVPR_2024_paper.pdf + Although Multimodal Large Language Models (MLLMs) have demonstrated promising versatile capabilities their performance is still inferior to specialized models on downstream tasks which makes adaptation necessary to enhance their utility. However fine-tuning methods require independent training for every model leading to huge computation and memory overheads. In this paper we propose a novel setting where we aim to improve the performance of diverse MLLMs with a group of shared parameters optimized for a downstream task. To achieve this we propose Transferable Visual Prompting (TVP) a simple and effective approach to generate visual prompts that can transfer to different models and improve their performance on downstream tasks after trained on only one model. We introduce two strategies to address the issue of cross-model feature corruption of existing visual prompting methods and enhance the transferability of the learned prompts including 1) Feature Consistency Alignment: which imposes constraints to the prompted feature changes to maintain task-agnostic knowledge; 2) Task Semantics Enrichment: which encourages the prompted images to contain richer task-specific semantics with language guidance. We validate the effectiveness of TVP through extensive experiments with 6 modern MLLMs on a wide variety of tasks ranging from object recognition and counting to multimodal reasoning and hallucination correction. + + + + SHAP-EDITOR: Instruction-Guided Latent 3D Editing in Seconds + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_SHAP-EDITOR_Instruction-Guided_Latent_3D_Editing_in_Seconds_CVPR_2024_paper.pdf + We propose a novel feed-forward 3D editing framework called Shap-Editor. Prior research on editing 3D objects primarily concentrated on editing individual objects by leveraging off-the-shelf 2D image editing networks utilizing a process called 3D distillation which transfers knowledge from the 2D network to the 3D asset. Distillation necessitates at least tens of minutes per asset to attain satisfactory editing results thus it is not very practical. In contrast we ask whether 3D editing can be carried out directly by a feed-forward network eschewing test-time optimization. In particular we hypothesise that this process can be greatly simplified by first encoding 3D objects into a suitable latent space. We validate this hypothesis by building upon the latent space of Shap-E. We demonstrate that direct 3D editing in this space is possible and efficient by learning a feed-forward editor network that only requires approximately one second per edit. Our experiments show that Shap-Editor generalises well to both in-distribution and out-of-distribution 3D assets with different prompts and achieves superior performance compared to methods that carry out test-time optimisation for each edited instance. + + + + HyperSDFusion: Bridging Hierarchical Structures in Language and Geometry for Enhanced 3D Text2Shape Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Leng_HyperSDFusion_Bridging_Hierarchical_Structures_in_Language_and_Geometry_for_Enhanced_CVPR_2024_paper.pdf + 3D shape generation from text is a fundamental task in 3D representation learning. The text-shape pairs exhibit a hierarchical structure where a general text like "chair" covers all 3D shapes of the chair while more detailed prompts refer to more specific shapes. Furthermore both text and 3D shapes are inherently hierarchical structures. However existing Text2Shape methods such as SDFusion do not exploit that. In this work we propose HyperSDFusion a dual-branch diffusion model that generates 3D shapes from a given text. Since hyperbolic space is suitable for handling hierarchical data we propose to learn the hierarchical representations of text and 3D shapes in hyperbolic space. First we introduce a hyperbolic text-image encoder to learn the sequential and multi-modal hierarchical features of text in hyperbolic space. In addition we design a hyperbolic text-graph convolution module to learn the hierarchical features of text in hyperbolic space. In order to fully utilize these text features we introduce a dual-branch structure to embed text features in 3D feature space. At last to endow the generated 3D shapes with a hierarchical structure we devise a hyperbolic hierarchical loss. Our method is the first to explore the hyperbolic hierarchical representation for text-to-shape generation. Experimental results on the existing text-to-shape paired dataset Text2Shape achieved state-of-the-art results. We release our implementation under HyperSDFusion.github.io. + + + + Are Conventional SNNs Really Efficient? A Perspective from Network Quantization + http://openaccess.thecvf.com//content/CVPR2024/papers/Shen_Are_Conventional_SNNs_Really_Efficient_A_Perspective_from_Network_Quantization_CVPR_2024_paper.pdf + Spiking Neural Networks (SNNs) have been widely praised for their high energy efficiency and immense potential. However comprehensive research that critically contrasts and correlates SNNs with quantized Artificial Neural Networks (ANNs) remains scant often leading to skewed comparisons lacking fairness towards ANNs. This paper introduces a unified perspective illustrating that the time steps in SNNs and quantized bit-widths of activation values present analogous representations. Building on this we present a more pragmatic and rational approach to estimating the energy consumption of SNNs. Diverging from the conventional Synaptic Operations (SynOps) we champion the "Bit Budget" concept. This notion permits an intricate discourse on strategically allocating computational and storage resources between weights activation values and temporal steps under stringent hardware constraints. Guided by the Bit Budget paradigm we discern that pivoting efforts towards spike patterns and weight quantization rather than temporal attributes elicits profound implications for model performance. Utilizing the Bit Budget for holistic design consideration of SNNs elevates model performance across diverse data types encompassing static imagery and neuromorphic datasets. Our revelations bridge the theoretical chasm between SNNs and quantized ANNs and illuminate a pragmatic trajectory for future endeavors in energy-efficient neural computations. + + + + Initialization Matters for Adversarial Transfer Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Hua_Initialization_Matters_for_Adversarial_Transfer_Learning_CVPR_2024_paper.pdf + With the prevalence of the Pretraining-Finetuning paradigm in transfer learning the robustness of downstream tasks has become a critical concern. In this work we delve into adversarial robustness in transfer learning and reveal the critical role of initialization including both the pretrained model and the linear head. First we discover the necessity of an adversarially robust pretrained model. Specifically we reveal that with a standard pretrained model Parameter-Efficient Finetuning (PEFT) methods either fail to be adversarially robust or continue to exhibit significantly degraded adversarial robustness on downstream tasks even with adversarial training during finetuning. Leveraging a robust pretrained model surprisingly we observe that a simple linear probing can outperform full finetuning and other PEFT methods with random initialization on certain datasets. We further identify that linear probing excels in preserving robustness from the robust pretraining. Based on this we propose Robust Linear Initialization (RoLI) for adversarial finetuning which initializes the linear head with the weights obtained by adversarial linear probing to maximally inherit the robustness from pretraining. Across five different image classification datasets we demonstrate the effectiveness of RoLI and achieve new state-of-the-art results. Our code is available at https://github.com/DongXzz/RoLI. + + + + L0-Sampler: An L0 Model Guided Volume Sampling for NeRF + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_L0-Sampler_An_L0_Model_Guided_Volume_Sampling_for_NeRF_CVPR_2024_paper.pdf + Since its proposal Neural Radiance Fields (NeRF) has achieved great success in related tasks mainly adopting the hierarchical volume sampling (HVS) strategy for volume rendering. However the HVS of NeRF approximates distributions using piecewise constant functions which provides a relatively rough estimation. Based on the observation that a well-trained weight function w(t) and the L_0 distance between points and the surface have very high similarity we propose L_0-Sampler by incorporating the L_0 model into w(t) to guide the sampling process. Specifically we propose using piecewise exponential functions rather than piecewise constant functions for interpolation which can not only approximate quasi-L_0 weight distributions along rays quite well but can be easily implemented with a few lines of code change without additional computational burden. Stable performance improvements can be achieved by applying L_0-Sampler to NeRF and related tasks like 3D reconstruction. Code is available at https://ustc3dv.github.io/L0-Sampler/. + + + + Practical Measurements of Translucent Materials with Inter-Pixel Translucency Prior + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Practical_Measurements_of_Translucent_Materials_with_Inter-Pixel_Translucency_Prior_CVPR_2024_paper.pdf + Material appearance is a key component of photorealism with a pronounced impact on human perception. Although there are many prior works targeting at measuring opaque materials using light-weight setups (e.g. consumer-level cameras) little attention is paid on acquiring the optical properties of translucent materials which are also quite common in nature. In this paper we present a practical method for acquiring scattering properties of translucent materials based solely on ordinary images captured with unknown lighting and camera parameters. The key to our method is an inter-pixel translucency prior which states that image pixels of a given homogeneous translucent material typically form curves (dubbed translucent curves) in the RGB space of which the shapes are determined by the parameters of the material. We leverage this prior in a specially-designed convolutional neural network comprising multiple encoders a translucency-aware feature fusion module and a cascaded decoder. We demonstrate through both visual comparisons and quantitative evaluations that high accuracy can be achieved on a wide range of real-world translucent materials. + + + + TurboSL: Dense Accurate and Fast 3D by Neural Inverse Structured Light + http://openaccess.thecvf.com//content/CVPR2024/papers/Mirdehghan_TurboSL_Dense_Accurate_and_Fast_3D_by_Neural_Inverse_Structured_CVPR_2024_paper.pdf + We show how to turn a noisy and fragile active triangulation technique--three-pattern structured light with a grayscale camera--into a fast and powerful tool for 3D capture: able to output sub-pixel accurate disparities at megapixel resolution along with reflectance normals and a no-reference estimate of its own pixelwise 3D error. To achieve this we formulate structured-light decoding as a neural inverse rendering problem. We show that despite having just three or four input images--all from the same viewpoint--this problem can be tractably solved by TurboSL an algorithm that combines (1) a precise image formation model (2) a signed distance field scene representation and (3) projection pattern sequences optimized for accuracy instead of precision. We use TurboSL to reconstruct a variety of complex scenes from images captured at up to 60 fps with a camera and a common projector. Our experiments highlight TurboSL's potential for dense and highly-accurate 3D acquisition from data captured in fractions of a second. + + + + GS-IR: 3D Gaussian Splatting for Inverse Rendering + http://openaccess.thecvf.com//content/CVPR2024/papers/Liang_GS-IR_3D_Gaussian_Splatting_for_Inverse_Rendering_CVPR_2024_paper.pdf + We propose GS-IR a novel inverse rendering approach based on 3D Gaussian Splatting (GS) that leverages forward mapping volume rendering to achieve photorealistic novel view synthesis and relighting results. Unlike previous works that use implicit neural representations and volume rendering (e.g. NeRF) which suffer from low expressive power and high computational complexity we extend GS a top-performance representation for novel view synthesis to estimate scene geometry surface material and environment illumination from multi-view images captured under unknown lighting conditions. There are two main problems when introducing GS to inverse rendering: 1) GS does not support producing plausible normal natively; 2) forward mapping (e.g. rasterization and splatting) cannot trace the occlusion like backward mapping (e.g. ray tracing). To address these challenges our GS-IR proposes an efficient optimization scheme that incorporates a depth-derivation-based regularization for normal estimation and a baking-based occlusion to model indirect lighting. The flexible and expressive GS representation allows us to achieve fast and compact geometry reconstruction photorealistic novel view synthesis and effective physically-based rendering. We demonstrate the superiority of our method over baseline methods through qualitative and quantitative evaluations on various challenging scenes. + + + + SynFog: A Photo-realistic Synthetic Fog Dataset based on End-to-end Imaging Simulation for Advancing Real-World Defogging in Autonomous Driving + http://openaccess.thecvf.com//content/CVPR2024/papers/Xie_SynFog_A_Photo-realistic_Synthetic_Fog_Dataset_based_on_End-to-end_Imaging_CVPR_2024_paper.pdf + To advance research in learning-based defogging algorithms various synthetic fog datasets have been developed. However exsiting datasets created using the Atmospheric Scattering Model (ASM) or real-time rendering engines often struggle to produce photo-realistic foggy images that accurately mimic the actual imaging process. This limitation hinders the effective generalization of models from synthetic to real data. In this paper we introduce an end-to-end simulation pipeline designed to generate photo-realistic foggy images. This pipeline comprehensively considers the entire physically-based foggy scene imaging process closely aligning with real-world image capture methods. Based on this pipeline we present a new synthetic fog dataset named SynFog which features both sky light and active lighting conditions as well as three levels of fog density. Experimental results demonstrate that models trained on SynFog exhibit superior performance in visual perception and detection accuracy compared to others when applied to real-world foggy images. + + + + TRINS: Towards Multimodal Language Models that Can Read + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_TRINS_Towards_Multimodal_Language_Models_that_Can_Read_CVPR_2024_paper.pdf + Large multimodal language models have shown remarkable proficiency in understanding and editing images. However a majority of these visually-tuned models struggle to comprehend the textual content embedded in images primarily due to the limitation of training data. In this work we introduce TRINS: a Text-Rich image1 INStruction dataset with the objective of enhancing the reading ability of the multimodal large language model. TRINS is built upon LAION 2 using hybrid data annotation strategies that include machine-assisted and human-assisted annotation process. It contains 39153 text-rich images captions and 102437 questions. Specifically we show that the number of words per annotation in TRINS is significantly longer than that of related datasets providing new challenges. Furthermore we introduce a simple and effective architecture called a Language-Vision Reading Assistant (LaRA) which is good at understanding textual content within images. LaRA outperforms existing state-of-the-art multimodal large language models on the TRINS dataset as well as other classical benchmarks. Lastly we conducted a comprehensive evaluation with TRINS on various text-rich image understanding and generation tasks demonstrating its effectiveness. + + + + Self-Supervised Representation Learning from Arbitrary Scenarios + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Self-Supervised_Representation_Learning_from_Arbitrary_Scenarios_CVPR_2024_paper.pdf + Current self-supervised methods can primarily be categorized into contrastive learning and masked image modeling. Extensive studies have demonstrated that combining these two approaches can achieve state-of-the-art performance. However these methods essentially reinforce the global consistency of contrastive learning without taking into account the conflicts between these two approaches which hinders their generalizability to arbitrary scenarios. In this paper we theoretically prove that MAE serves as a patch-level contrastive learning where each patch within an image is considered as a distinct category. This presents a significant conflict with global-level contrastive learning which treats all patches in an image as an identical category. To address this conflict this work abandons the non-generalizable global-level constraints and proposes explicit patch-level contrastive learning as a solution. Specifically this work employs the encoder of MAE to generate dual-branch features which then perform patch-level learning through a decoder. In contrast to global-level data augmentation in contrastive learning our approach leverages patch-level feature augmentation to mitigate interference from global-level learning. Consequently our approach can learn heterogeneous representations from a single image while avoiding the conflicts encountered by previous methods. Massive experiments affirm the potential of our method for learning from arbitrary scenarios. + + + + Living Scenes: Multi-object Relocalization and Reconstruction in Changing 3D Environments + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhu_Living_Scenes_Multi-object_Relocalization_and_Reconstruction_in_Changing_3D_Environments_CVPR_2024_paper.pdf + Research into dynamic 3D scene understanding has primarily focused on short-term change tracking from dense observations while little attention has been paid to long-term changes with sparse observations. We address this gap with MoRE a novel approach for multi-object relocalization and reconstruction in evolving environments. We view these environments as Living Scenes and consider the problem of transforming scans taken at different points in time into a 3D reconstruction of the object instances whose accuracy and completeness increase over time. At the core of our method lies an SE(3) equivariant representation in a single encoder-decoder network trained on synthetic data. This representation enables us to seamlessly tackle instance matching registration and reconstruction. We also introduce a joint optimization algorithm that facilitates the accumulation of point clouds originating from the same instance across multiple scans taken at different points in time. We validate our method on synthetic and real-world data and demonstrate state-of-the-art performance in both end-to-end performance and individual subtasks. + + + + Task-Adaptive Saliency Guidance for Exemplar-free Class Incremental Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Task-Adaptive_Saliency_Guidance_for_Exemplar-free_Class_Incremental_Learning_CVPR_2024_paper.pdf + Exemplar-free Class Incremental Learning (EFCIL) aims to sequentially learn tasks with access only to data from the current one. EFCIL is of interest because it mitigates concerns about privacy and long-term storage of data while at the same time alleviating the problem of catastrophic forgetting in incremental learning. In this work we introduce task-adaptive saliency for EFCIL and propose a new framework which we call Task-Adaptive Saliency Supervision (TASS) for mitigating the negative effects of saliency drift between different tasks. We first apply boundary-guided saliency to maintain task adaptivity and plasticity on model attention. Besides we introduce task-agnostic low-level signals as auxiliary supervision to increase the stability of model attention. Finally we introduce a module for injecting and recovering saliency noise to increase the robustness of saliency preservation. Our experiments demonstrate that our method can better preserve saliency maps across tasks and achieve state-of-the-art results on the CIFAR-100 Tiny-ImageNet and ImageNet-Subset EFCIL benchmarks. Code is available at https://github.com/scok30/tass. + + + + Language-driven All-in-one Adverse Weather Removal + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_Language-driven_All-in-one_Adverse_Weather_Removal_CVPR_2024_paper.pdf + All-in-one (AiO) frameworks restore various adverse weather degradations with a single set of networks jointly. To handle various weather conditions an AiO framework is expected to adaptively learn weather-specific knowledge for different degradations and shared knowledge for common patterns. However existing method: 1) rely on extra supervision signals which are usually unknown in real-world applications; 2) employ fixed network structures which restrict the diversity of weather-specific knowledge. In this paper we propose a Language-driven Restoration framework (LDR) to alleviate the aforementioned issues. First we leverage the power of pre-trained vision-language (PVL) models to enrich the diversity of weather-specific knowledge by reasoning about the occurrence type and severity of degradation generating description-based degradation priors. Then with the guidance of degradation prior we sparsely select restoration experts from a candidate list dynamically based on a Mixture-of-Experts (MoE) structure. This enables us to adaptively learn the weather-specific and shared knowledge to handle various weather conditions (e.g. unknown or mixed weather). Experiments on extensive restoration scenarios show our superior performance (see Fig. 1). The source code will be made available. + + + + MAPLM: A Real-World Large-Scale Vision-Language Benchmark for Map and Traffic Scene Understanding + http://openaccess.thecvf.com//content/CVPR2024/papers/Cao_MAPLM_A_Real-World_Large-Scale_Vision-Language_Benchmark_for_Map_and_Traffic_CVPR_2024_paper.pdf + Vision-language generative AI has demonstrated remarkable promise for empowering cross-modal scene understanding of autonomous driving and high-definition (HD) map systems. However current benchmark datasets lack multi-modal point cloud image and language data pairs. Recent approaches utilize visual instruction learning and cross-modal prompt engineering to expand vision-language models into this domain. In this paper we propose a new vision-language benchmark that can be used to finetune traffic and HD map domain-specific foundation models. Specifically we annotate and leverage large-scale broad-coverage traffic and map data extracted from huge HD map annotations and use CLIP and LLaMA-2 / Vicuna to finetune a baseline model with instruction-following data. Our experimental results across various algorithms reveal that while visual instruction-tuning large language models (LLMs) can effectively learn meaningful representations from MAPLM-QA there remains significant room for further advancements. To facilitate applying LLMs and multi-modal data into self-driving research we will release our visual-language QA data and the baseline models at GitHub.com/LLVM-AD/MAPLM. + + + + EgoExoLearn: A Dataset for Bridging Asynchronous Ego- and Exo-centric View of Procedural Activities in Real World + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_EgoExoLearn_A_Dataset_for_Bridging_Asynchronous_Ego-_and_Exo-centric_View_CVPR_2024_paper.pdf + Being able to map the activities of others into one's own point of view is one fundamental human skill even from a very early age. Taking a step toward understanding this human ability we introduce EgoExoLearn a large-scale dataset that emulates the human demonstration following process in which individuals record egocentric videos as they execute tasks guided by demonstration videos. Focusing on the potential applications of daily assistance and professional support EgoExoLearn contains egocentric and demonstration video data spanning 120 hours captured in daily life scenarios and specialized laboratories. Along with the videos we record high-quality gaze data and provide detailed multimodal annotations formulating a playground for modeling the human ability to bridge asynchronous procedural actions from different viewpoints. To this end we present benchmarks such as cross-view association cross-view action planning and cross-view referenced skill assessment along with detailed analysis. We expect EgoExoLearn can serve as an important resource for bridging the actions across views thus paving the way for creating AI agents capable of seamlessly learning by observing humans in the real world. The dataset and benchmark codes are available at https://github.com/OpenGVLab/EgoExoLearn. + + + + Improved Implicit Neural Representation with Fourier Reparameterized Training + http://openaccess.thecvf.com//content/CVPR2024/papers/Shi_Improved_Implicit_Neural_Representation_with_Fourier_Reparameterized_Training_CVPR_2024_paper.pdf + Implicit Neural Representation (INR) as a mighty representation paradigm has achieved success in various computer vision tasks recently. Due to the low-frequency bias issue of vanilla multi-layer perceptron (MLP) existing methods have investigated advanced techniques such as positional encoding and periodic activation function to improve the accuracy of INR. In this paper we connect the network training bias with the reparameterization technique and theoretically prove that weight reparameterization could provide us a chance to alleviate the spectral bias of MLP. Based on our theoretical analysis we propose a Fourier reparameterization method which learns coefficient matrix of fixed Fourier bases to compose the weights of MLP. We evaluate the proposed Fourier reparameterization method on different INR tasks with various MLP architectures including vanilla MLP MLP with positional encoding and MLP with advanced activation function etc. The superiority approximation results on different MLP architectures clearly validate the advantage of our proposed method. Armed with our Fourier reparameterization method better INR with more textures and less artifacts can be learned from the training data. The codes are available at https://github.com/LabShuHangGU/FR-INR. + + + + Groupwise Query Specialization and Quality-Aware Multi-Assignment for Transformer-based Visual Relationship Detection + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_Groupwise_Query_Specialization_and_Quality-Aware_Multi-Assignment_for_Transformer-based_Visual_Relationship_CVPR_2024_paper.pdf + Visual Relationship Detection (VRD) has seen significant advancements with Transformer-based architectures recently. However we identify two key limitations in a conventional label assignment for training Transformer-based VRD models which is a process of mapping a ground-truth (GT) to a prediction. Under the conventional assignment an 'unspecialized' query is trained since a query is expected to detect every relation which makes it difficult for a query to specialize in specific relations. Furthermore a query is also insufficiently trained since a GT is assigned only to a single prediction therefore near-correct or even correct predictions are suppressed by being assigned 'no relation' as a GT. To address these issues we propose Groupwise Query Specialization and Quality-Aware Multi-Assignment (SpeaQ). Groupwise Query Specialization trains a 'specialized' query by dividing queries and relations into disjoint groups and directing a query in a specific query group solely toward relations in the corresponding relation group. Quality-Aware Multi-Assignment further facilitates the training by assigning a GT to multiple predictions that are significantly close to a GT in terms of a subject an object and the relation in between. Experimental results and analyses show that SpeaQ effectively trains 'specialized' queries which better utilize the capacity of a model resulting in consistent performance gains with 'zero' additional inference cost across multiple VRD models and benchmarks. Code is available at https://github.com/mlvlab/SpeaQ. + + + + Purified and Unified Steganographic Network + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Purified_and_Unified_Steganographic_Network_CVPR_2024_paper.pdf + Steganography is the art of hiding secret data into the cover media for covert communication. In recent years more and more deep neural network (DNN)-based steganographic schemes are proposed to train steganographic networks for secret embedding and recovery which are shown to be promising. Compared with the handcrafted steganographic tools steganographic networks tend to be large in size. It raises concerns on how to imperceptibly and effectively transmit these networks to the sender and receiver to facilitate the covert communication. To address this issue we propose in this paper a Purified and Unified Steganographic Network (PUSNet). It performs an ordinary machine learning task in a purified network which could be triggered into steganographic networks for secret embedding or recovery using different keys. We formulate the construction of the PUSNet into a sparse weight filling problem to flexibly switch between the purified and steganographic networks. We further instantiate our PUSNet as an image denoising network with two steganographic networks concealed for secret image embedding and recovery. Comprehensive experiments demonstrate that our PUSNet achieves good performance on secret image embedding secret image recovery and image denoising in a single architecture. It is also shown to be capable of imperceptibly carrying the steganographic networks in a purified network. Steganography is the art of hiding secret data into the cover media for covert communication. In recent years more and more deep neural network (DNN)-based steganographic schemes are proposed to train steganographic networks for secret embedding and recovery which are shown to be promising. Compared with the handcrafted steganographic tools steganographic networks tend to be large in size. It raises concerns on how to imperceptibly and effectively transmit these networks to the sender and receiver to facilitate the covert communication. To address this issue we propose in this paper a Purified and Unified Steganographic Network (PUSNet). It performs an ordinary machine learning task in a purified network which could be triggered into steganographic networks for secret embedding or recovery using different keys. We formulate the construction of the PUSNet into a sparse weight filling problem to flexibly switch between the purified and steganographic networks. We further instantiate our PUSNet as an image denoising network with two steganographic networks concealed for secret image embedding and recovery. Comprehensive experiments demonstrate that our PUSNet achieves good performance on secret image embedding secret image recovery and image denoising in a single architecture. It is also shown to be capable of imperceptibly carrying the steganographic networks in a purified network. Code is available at https://github.com/albblgb/PUSNet + + + + TEA: Test-time Energy Adaptation + http://openaccess.thecvf.com//content/CVPR2024/papers/Yuan_TEA_Test-time_Energy_Adaptation_CVPR_2024_paper.pdf + Test Time Adaptation (TTA) aims to improve model generalizability when test data diverges from training distribution with the distinct advantage of not requiring access to training data and processes especially valuable in the context of pre-trained models. However current TTA methods fail to address the fundamental issue: covariate shift i.e. the decreased generalizability can be attributed to the model's reliance on the marginal distribution of the training data which may impair model calibration and introduce confirmation bias. To address this we propose a novel energy-based perspective enhancing the model's perception of target data distributions without requiring access to training data or processes. Building on this perspective we introduce Test-time Energy Adaptation (TEA) which transforms the trained classifier into an energy-based model and aligns the model's distribution with the test data's enhancing its ability to perceive test distributions and thus improving overall generalizability. Extensive experiments across multiple tasks benchmarks and architectures demonstrate TEA's superior generalization performance against state-of-the-art methods. Further in-depth analyses reveal that TEA can equip the model with a comprehensive perception of test distribution ultimately paving the way toward improved generalization and calibration. Code is available at https://github.com/yuanyige/tea. + + + + NEAT: Distilling 3D Wireframes from Neural Attraction Fields + http://openaccess.thecvf.com//content/CVPR2024/papers/Xue_NEAT_Distilling_3D_Wireframes_from_Neural_Attraction_Fields_CVPR_2024_paper.pdf + This paper studies the problem of structured 3D recon- struction using wireframes that consist of line segments and junctions focusing on the computation of structured boundary geometries of scenes. Instead of leveraging matching-based solutions from 2D wireframes (or line segments) for 3D wireframe reconstruction as done in prior arts we present NEAT a rendering-distilling formulation using neural fields to represent 3D line segments with 2D observations and bipartite matching for perceiving and dis- tilling of a sparse set of 3D global junctions. The proposed NEAT enjoys the joint optimization of the neural fields and the global junctions from scratch using view-dependent 2D observations without precomputed cross-view feature matching. Comprehensive experiments on the DTU and BlendedMVS datasets demonstrate our NEAT's superiority over state-of-the-art alternatives for 3D wireframe recon- struction. Moreover the distilled 3D global junctions by NEAT are a better initialization than SfM points for the recently-emerged 3D Gaussian Splatting for high-fidelity novel view synthesis using about 20 times fewer initial 3D points. Project page: https://xuenan.net/neat + + + + LDP: Language-driven Dual-Pixel Image Defocus Deblurring Network + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_LDP_Language-driven_Dual-Pixel_Image_Defocus_Deblurring_Network_CVPR_2024_paper.pdf + Recovering sharp images from dual-pixel (DP) pairs with disparity-dependent blur is a challenging task. Existing blur map-based deblurring methods have demonstrated promising results. In this paper we propose to the best of our knowledge the first framework to introduce the contrastive language-image pre-training framework (CLIP) to achieve accurate blur map estimation from DP pairs unsupervisedly. To this end we first carefully design text prompts to enable CLIP to understand blur-related geometric prior knowledge from the DP pair. Then we propose a format to input stereo DP pair to the CLIP without any fine-tuning where the CLIP is pre-trained on monocular images. Given the estimated blur map we introduce a blur-prior attention block a blur-weighting loss and a blur-aware loss to recover the all-in-focus image. Our method achieves state-of-the-art performance in extensive experiments (see Fig. 1). + + + + MMSum: A Dataset for Multimodal Summarization and Thumbnail Generation of Videos + http://openaccess.thecvf.com//content/CVPR2024/papers/Qiu_MMSum_A_Dataset_for_Multimodal_Summarization_and_Thumbnail_Generation_of_CVPR_2024_paper.pdf + Multimodal summarization with multimodal output (MSMO) has emerged as a promising research direction. Nonetheless numerous limitations exist within existing public MSMO datasets including insufficient maintenance data inaccessibility limited size and the absence of proper categorization which pose significant challenges. To address these challenges and provide a comprehensive dataset for this new direction we have meticulously curated the MMSum dataset. Our new dataset features (1) Human-validated summaries for both video and textual content providing superior human instruction and labels for multimodal learning. (2) Comprehensively and meticulously arranged categorization spanning 17 principal categories and 170 subcategories to encapsulate a diverse array of real-world scenarios. (3) Benchmark tests performed on the proposed dataset to assess various tasks and methods including video summarization text summarization and multimodal summarization. To champion accessibility and collaboration we released the MMSum dataset and the data collection tool as fully open-source resources fostering transparency and accelerating future developments. + + + + Pre-trained Vision and Language Transformers Are Few-Shot Incremental Learners + http://openaccess.thecvf.com//content/CVPR2024/papers/Park_Pre-trained_Vision_and_Language_Transformers_Are_Few-Shot_Incremental_Learners_CVPR_2024_paper.pdf + Few-Shot Class Incremental Learning (FSCIL) is a task that requires a model to learn new classes incrementally without forgetting when only a few samples for each class are given. FSCIL encounters two significant challenges: catastrophic forgetting and overfitting and these challenges have driven prior studies to primarily rely on shallow models such as ResNet-18. Even though their limited capacity can mitigate both forgetting and overfitting issues it leads to inadequate knowledge transfer during few-shot incremental sessions. In this paper we argue that large models such as vision and language transformers pre-trained on large datasets can be excellent few-shot incremental learners. To this end we propose a novel FSCIL framework called PriViLege Pre-trained Vision and Language transformers with prompting functions and knowledge distillation. Our framework effectively addresses the challenges of catastrophic forgetting and overfitting in large models through new pre-trained knowledge tuning (PKT) and two losses: entropy-based divergence loss and semantic knowledge distillation loss. Experimental results show that the proposed PriViLege significantly outperforms the existing state-of-the-art methods with a large margin e.g. +9.38% in CUB200 +20.58% in CIFAR-100 and +13.36% in miniImageNet. Our implementation code is available at https://github.com/KHU-AGI/PriViLege. + + + + Language-guided Image Reflection Separation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhong_Language-guided_Image_Reflection_Separation_CVPR_2024_paper.pdf + This paper studies the problem of language-guided reflection separation which aims at addressing the ill-posed reflection separation problem by introducing language descriptions to provide layer content. We propose a unified framework to solve this problem which leverages the cross-attention mechanism with contrastive learning strategies to construct the correspondence between language descriptions and image layers. A gated network design and a randomized training strategy are employed to tackle the recognizable layer ambiguity. The effectiveness of the proposed method is validated by the significant performance advantage over existing reflection separation methods on both quantitative and qualitative comparisons. + + + + View-Category Interactive Sharing Transformer for Incomplete Multi-View Multi-Label Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Ou_View-Category_Interactive_Sharing_Transformer_for_Incomplete_Multi-View_Multi-Label_Learning_CVPR_2024_paper.pdf + As a problem often encountered in real-world scenarios multi-view multi-label learning has attracted considerable research attention. However due to oversights in data collection and uncertainties in manual annotation real-world data often suffer from incompleteness. Regrettably most existing multi-view multi-label learning methods sidestep missing views and labels. Furthermore they often neglect the potential of harnessing complementary information between views and labels thus constraining their classification capabilities. To address these challenges we propose a view-category interactive sharing transformer tailored for incomplete multi-view multi-label learning. Within this network we incorporate a two-layer transformer module to characterize the interplay between views and labels. Additionally to address view incompleteness a KNN-style missing view generation module is employed. Finally we introduce a view-category consistency guided embedding enhancement module to align different views and improve the discriminating power of the embeddings. Collectively these modules synergistically integrate to classify the incomplete multi-view multi-label data effectively. Extensive experiments substantiate that our approach outperforms the existing state-of-the-art methods. + + + + The More You See in 2D the More You Perceive in 3D + http://openaccess.thecvf.com//content/CVPR2024/papers/Han_The_More_You_See_in_2D_the_More_You_Perceive_CVPR_2024_paper.pdf + Humans can infer 3D structure from 2D images of an object based on past experience and improve their 3D understanding as they see more images. Inspired by this behavior we introduce SAP3D a system for 3D reconstruction and novel view synthesis from an arbitrary number of unposed images. Given a few unposed images of an object we adapt a pre-trained view-conditioned diffusion model together with the camera poses of the images via test-time fine-tuning. The adapted diffusion model and the obtained camera poses are then utilized as instance-specific priors for 3D reconstruction and novel view synthesis. We show that as the number of input images increases the performance of our approach improves bridging the gap between optimization-based prior-less 3D reconstruction methods and single-image-to-3D diffusion-based methods. We demonstrate our system on real images as well as standard synthetic benchmarks. Our ablation studies confirm that this adaption behavior is key for more accurate 3D understanding. + + + + Unifying Automatic and Interactive Matting with Pretrained ViTs + http://openaccess.thecvf.com//content/CVPR2024/papers/Ye_Unifying_Automatic_and_Interactive_Matting_with_Pretrained_ViTs_CVPR_2024_paper.pdf + Automatic and interactive matting largely improve image matting by respectively alleviating the need for auxiliary input and enabling object selection. Due to different settings on whether prompts exist they either suffer from weakness in instance completeness or region details. Also when dealing with different scenarios directly switching between the two matting models introduces inconvenience and higher workload. Therefore we wonder whether we can alleviate the limitations of both settings while achieving unification to facilitate more convenient use. Our key idea is to offer saliency guidance for automatic mode to enable its attention to detailed regions and also refine the instance completeness in interactive mode by replacing the binary mask guidance with a more probabilistic form. With different guidance for each mode we can achieve unification through adaptable guidance defined as saliency information in automatic mode and user cue for interactive one. It is instantiated as candidate feature in our method an automatic switch for class token in pretrained ViTs and average feature of user prompts controlled by the existence of user prompts. Then we use the candidate feature to generate a probabilistic similarity map as the guidance to alleviate the over-reliance on binary mask. Extensive experiments show that our method can adapt well to both automatic and interactive scenarios with more light-weight framework. Code available at https://github.com/coconuthust/SmartMatting. + + + + MoPE-CLIP: Structured Pruning for Efficient Vision-Language Models with Module-wise Pruning Error Metric + http://openaccess.thecvf.com//content/CVPR2024/papers/Lin_MoPE-CLIP_Structured_Pruning_for_Efficient_Vision-Language_Models_with_Module-wise_Pruning_CVPR_2024_paper.pdf + Vision-language pre-trained models have achieved impressive performance on various downstream tasks. However their large model sizes hinder their utilization on platforms with limited computational resources. We find that directly using smaller pre-trained models and applying magnitude-based pruning on CLIP models leads to inflexibility and inferior performance. Recent efforts for VLP compression either adopt uni-modal compression metrics resulting in limited performance or involve costly mask-search processes with learnable masks. In this paper we first propose the Module-wise Pruning Error (MoPE) metric accurately assessing CLIP module importance by performance decline on cross-modal tasks. Using the MoPE metric we introduce a unified pruning framework applicable to both pre-training and task-specific fine-tuning compression stages. For pre-training MoPE-CLIP effectively leverages knowledge from the teacher model significantly reducing pre-training costs while maintaining strong zero-shot capabilities. For fine-tuning consecutive pruning from width to depth yields highly competitive task-specific models. Extensive experiments in two stages demonstrate the effectiveness of the MoPE metric and MoPE-CLIP outperforms previous state-of-the-art VLP compression methods. + + + + Leveraging Frame Affinity for sRGB-to-RAW Video De-rendering + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Leveraging_Frame_Affinity_for_sRGB-to-RAW_Video_De-rendering_CVPR_2024_paper.pdf + Unprocessed RAW video has shown distinct advantages over sRGB video in video editing and computer vision tasks. However capturing RAW video is challenging due to limitations in bandwidth and storage. Various methods have been proposed to address similar issues in single image RAW capture through de-rendering. These methods utilize both the metadata and the sRGB image to perform sRGB-to-RAW de-rendering and recover high-quality single-frame RAW data. However metadata-based methods always require additional computation for online metadata generation imposing severe burden on mobile camera device for high frame rate RAW video capture. To address this issue we propose a framework that utilizes frame affinity to achieve high-quality sRGB-to-RAW video reconstruction. Our approach consists of two main steps. The first step temporal affinity prior extraction uses motion information between adjacent frames to obtain a reference RAW image. The second step spatial feature fusion and mapping learns a pixel-level mapping function using scene-specific and position-specific features provided by the previous frame. Our method can be easily applied to current mobile camera equipment without complicated adaptations or added burden. To demonstrate the effectiveness of our approach we introduce the first RAW Video De-rendering Benchmark. In this benchmark our method outperforms state-of-the-art RAW image reconstruction methods even without image-level metadata. + + + + The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes + http://openaccess.thecvf.com//content/CVPR2024/papers/Ko_The_Mirrored_Influence_Hypothesis_Efficient_Data_Influence_Estimation_by_Harnessing_CVPR_2024_paper.pdf + Large-scale black-box models have become ubiquitous across numerous applications. Understanding the influence of individual training data sources on predictions made by these models is crucial for improving their trustworthiness. Current influence estimation techniques involve computing gradients for every training point or repeated training on different subsets. These approaches face obvious computational challenges when scaled up to large datasets and models. In this paper we introduce and explore the Mirrored Influence Hypothesis highlighting a reciprocal nature of influence between training and test data. Specifically it suggests that evaluating the influence of training data on test predictions can be reformulated as an equivalent yet inverse problem: assessing how the predictions for training samples would be altered if the model were trained on specific test samples. Through both empirical and theoretical validations we demonstrate the wide applicability of our hypothesis. Inspired by this we introduce a new method for estimating the influence of training data which requires calculating gradients for specific test samples paired with a forward pass for each training point. This approach can capitalize on the common asymmetry in scenarios where the number of test samples under concurrent examination is much smaller than the scale of the training dataset thus gaining a significant improvement in efficiency compared to existing approaches. We demonstrate the applicability of our method across a range of scenarios including data attribution in diffusion models data leakage detection analysis of memorization mislabeled data detection and tracing behavior in language models. + + + + Choose What You Need: Disentangled Representation Learning for Scene Text Recognition Removal and Editing + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Choose_What_You_Need_Disentangled_Representation_Learning_for_Scene_Text_CVPR_2024_paper.pdf + Scene text images contain not only style information (font background) but also content information (character texture). Different scene text tasks need different information but previous representation learning methods use tightly coupled features for all tasks resulting in sub-optimal performance. We propose a Disentangled Representation Learning framework (DARLING) aimed at disentangling these two types of features for improved adaptability in better addressing various downstream tasks (choose what you really need). Specifically we synthesize a dataset of image pairs with identical style but different content. Based on the dataset we decouple the two types of features by the supervision design. Clearly we directly split the visual representation into style and content features the content features are supervised by a text recognition loss while an alignment loss aligns the style features in the image pairs. Then style features are employed in reconstructing the counterpart image via an image decoder with a prompt that indicates the counterpart's content. Such an operation effectively decouples the features based on their distinctive properties. To the best of our knowledge this is the first time in the field of scene text that disentangles the inherent properties of the text images. Our method achieves state-of-the-art performance in Scene Text Recognition Removal and Editing. + + + + Symphonize 3D Semantic Scene Completion with Contextual Instance Queries + http://openaccess.thecvf.com//content/CVPR2024/papers/Jiang_Symphonize_3D_Semantic_Scene_Completion_with_Contextual_Instance_Queries_CVPR_2024_paper.pdf + 3D Semantic Scene Completion (SSC) has emerged as a nascent and pivotal undertaking in autonomous driving aiming to predict the voxel occupancy within volumetric scenes. However prevailing methodologies primarily focus on voxel-wise feature aggregation while neglecting instance semantics and scene context. In this paper we present a novel paradigm termed Symphonies (Scene-from-Insts) that delves into the integration of instance queries to orchestrate 2D-to-3D reconstruction and 3D scene modeling. Leveraging our proposed Serial Instance-Propagated Attentions Symphonies dynamically encodes instance-centric semantics facilitating intricate interactions between the image and volumetric domains. Simultaneously Symphonies fosters holistic scene comprehension by capturing context through the efficient fusion of instance queries alleviating geometric ambiguities such as occlusion and perspective errors through contextual scene reasoning. Experimental results demonstrate that Symphonies achieves state-of-the-art performance on the challenging SemanticKITTI and SSCBench-KITTI-360 benchmarks yielding remarkable mIoU scores of 15.04 and 18.58 respectively. These results showcase the promising advancements of our paradigm. The code for our method is available at https://github.com/hustvl/Symphonies. + + + + Loopy-SLAM: Dense Neural SLAM with Loop Closures + http://openaccess.thecvf.com//content/CVPR2024/papers/Liso_Loopy-SLAM_Dense_Neural_SLAM_with_Loop_Closures_CVPR_2024_paper.pdf + Neural RGBD SLAM techniques have shown promise in dense Simultaneous Localization And Mapping (SLAM) yet face challenges such as error accumulation during camera tracking resulting in distorted maps. In response we introduce Loopy-SLAM that globally optimizes poses and the dense 3D model. We use frame-to-model tracking using a data-driven point-based submap generation method and trigger loop closures online by performing global place recognition. Robust pose graph optimization is used to rigidly align the local submaps. As our representation is point based map corrections can be performed efficiently without the need to store the entire history of input frames used for mapping as typically required by methods employing a grid based mapping structure. Evaluation on the synthetic Replica and real-world TUM-RGBD and ScanNet datasets demonstrate competitive or superior performance in tracking mapping and rendering accuracy when compared to existing dense neural RGBD SLAM methods. Project page: notchla.github.io/Loopy-SLAM. + + + + Content-Adaptive Non-Local Convolution for Remote Sensing Pansharpening + http://openaccess.thecvf.com//content/CVPR2024/papers/Duan_Content-Adaptive_Non-Local_Convolution_for_Remote_Sensing_Pansharpening_CVPR_2024_paper.pdf + Currently machine learning-based methods for remote sensing pansharpening have progressed rapidly. However existing pansharpening methods often do not fully exploit differentiating regional information in non-local spaces thereby limiting the effectiveness of the methods and resulting in redundant learning parameters. In this paper we introduce a so-called content-adaptive non-local convolution (CANConv) a novel method tailored for remote sensing image pansharpening. Specifically CANConv employs adaptive convolution ensuring spatial adaptability and incorporates non-local self-similarity through the similarity relationship partition (SRP) and the partition-wise adaptive convolution (PWAC) sub-modules. Furthermore we also propose a corresponding network architecture called CANNet which mainly utilizes the multi-scale self-similarity. Extensive experiments demonstrate the superior performance of CANConv compared with recent promising fusion methods. Besides we substantiate the method's effectiveness through visualization ablation experiments and comparison with existing methods on multiple test sets. The source code is publicly available at https://github.com/duanyll/CANConv. + + + + Learning Inclusion Matching for Animation Paint Bucket Colorization + http://openaccess.thecvf.com//content/CVPR2024/papers/Dai_Learning_Inclusion_Matching_for_Animation_Paint_Bucket_Colorization_CVPR_2024_paper.pdf + Colorizing line art is a pivotal task in the production of hand-drawn cel animation. This typically involves digital painters using a paint bucket tool to manually color each segment enclosed by lines based on RGB values predetermined by a color designer. This frame-by-frame process is both arduous and time-intensive. Current automated methods mainly focus on segment matching. This technique migrates colors from a reference to the target frame by aligning features within line-enclosed segments across frames. However issues like occlusion and wrinkles in animations often disrupt these direct correspondences leading to mismatches. In this work we introduce a new learning-based inclusion matching pipeline which directs the network to comprehend the inclusion relationships between segments rather than relying solely on direct visual correspondences. Our method features a two-stage pipeline that integrates a coarse color warping module with an inclusion matching module enabling more nuanced and accurate colorization. To facilitate the training of our network we also develope a unique dataset referred to as PaintBucket-Character. This dataset includes rendered line arts alongside their colorized counterparts featuring various 3D characters. Extensive experiments demonstrate the effectiveness and superiority of our method over existing techniques. + + + + SAM-6D: Segment Anything Model Meets Zero-Shot 6D Object Pose Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Lin_SAM-6D_Segment_Anything_Model_Meets_Zero-Shot_6D_Object_Pose_Estimation_CVPR_2024_paper.pdf + Zero-shot 6D object pose estimation involves the detection of novel objects with their 6D poses in cluttered scenes presenting significant challenges for model generalizability. Fortunately the recent Segment Anything Model (SAM) has showcased remarkable zero-shot transfer performance which provides a promising solution to tackle this task. Motivated by this we introduce SAM-6D a novel framework designed to realize the task through two steps including instance segmentation and pose estimation. Given the target objects SAM-6D employs two dedicated sub-networks namely Instance Segmentation Model (ISM) and Pose Estimation Model (PEM) to perform these steps on cluttered RGB-D images. ISM takes SAM as an advanced starting point to generate all possible object proposals and selectively preserves valid ones through meticulously crafted object matching scores in terms of semantics appearance and geometry. By treating pose estimation as a partial-to-partial point matching problem PEM performs a two-stage point matching process featuring a novel design of background tokens to construct dense 3D-3D correspondence ultimately yielding the pose estimates. Without bells and whistles SAM-6D outperforms the existing methods on the seven core datasets of the BOP Benchmark for both instance segmentation and pose estimation of novel objects. + + + + SPOT: Self-Training with Patch-Order Permutation for Object-Centric Learning with Autoregressive Transformers + http://openaccess.thecvf.com//content/CVPR2024/papers/Kakogeorgiou_SPOT_Self-Training_with_Patch-Order_Permutation_for_Object-Centric_Learning_with_Autoregressive_CVPR_2024_paper.pdf + Unsupervised object-centric learning aims to decompose scenes into interpretable object entities termed slots. Slot-based auto-encoders stand out as a prominent method for this task. Within them crucial aspects include guiding the encoder to generate object-specific slots and ensuring the decoder utilizes them during reconstruction. This work introduces two novel techniques (i) an attention-based self-training approach which distills superior slot-based attention masks from the decoder to the encoder enhancing object segmentation and (ii) an innovative patch-order permutation strategy for autoregressive transformers that strengthens the role of slot vectors in reconstruction. The effectiveness of these strategies is showcased experimentally. The combined approach significantly surpasses prior slot-based autoencoder methods in unsupervised object segmentation especially with complex real-world images. We provide the implementation code at https://github.com/gkakogeorgiou/spot . + + + + CroSel: Cross Selection of Confident Pseudo Labels for Partial-Label Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Tian_CroSel_Cross_Selection_of_Confident_Pseudo_Labels_for_Partial-Label_Learning_CVPR_2024_paper.pdf + Partial-label learning (PLL) is an important weakly supervised learning problem which allows each training example to have a candidate label set instead of a single ground-truth label. Identification-based methods have been widely explored to tackle label ambiguity issues in PLL which regard the true label as a latent variable to be identified. However identifying the true labels accurately and completely remains challenging causing noise in pseudo labels during model training. In this paper we propose a new method called CroSel which leverages historical predictions from the model to identify true labels for most training examples. First we introduce a cross selection strategy which enables two deep models to select true labels of partially labeled data for each other. Besides we propose a novel consistency regularization term called co-mix to avoid sample waste and tiny noise caused by false selection. In this way CroSel can pick out the true labels of most examples with high precision. Extensive experiments demonstrate the superiority of CroSel which consistently outperforms previous state-of-the-art methods on benchmark datasets. Additionally our method achieves over 90% accuracy and quantity for selecting true labels on CIFAR-type datasets under various settings. + + + + ModaVerse: Efficiently Transforming Modalities with LLMs + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_ModaVerse_Efficiently_Transforming_Modalities_with_LLMs_CVPR_2024_paper.pdf + Humans possess the capability to comprehend diverse modalities and seamlessly transfer information between them. In this work we introduce ModaVerse a Multi-modal Large Language Model (MLLM) capable of comprehending and transforming content across various modalities including images videos and audio. Predominant MLLM frameworks have largely relied on aligning latent spaces of textual and non-textual features. This alignment process which synchronizes a language model trained on textual data with encoders and decoders trained on multi-modal data often necessitates extensive training of several projection layers in multiple stages. Inspired by LLM-as-agent methodologies we propose a novel Input/Output (I/O) alignment mechanism that operates directly at the level of natural language. It aligns the LLM's output with the input of generative models avoiding the complexities associated with latent feature alignments and simplifying the multiple training stages of existing MLLMs into a single efficient process. By conducting experiments on several benchmarks we demonstrate that our approach attains comparable performance with the state of the art while achieving considerable efficiencies in data usage. + + + + Frequency-aware Event-based Video Deblurring for Real-World Motion Blur + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_Frequency-aware_Event-based_Video_Deblurring_for_Real-World_Motion_Blur_CVPR_2024_paper.pdf + Video deblurring aims to restore sharp frames from blurred video clips. Despite notable progress in video deblurring works it is still a challenging problem because of the loss of motion information during the duration of the exposure time. Since event cameras can capture clear motion information asynchronously with high temporal resolution several works exploit the event camera for deblurring as they can provide abundant motion information. However despite these approaches there were few cases of actively exploiting the long-range temporal dependency of videos. To tackle these deficiencies we present an event-based video deblurring framework by actively utilizing temporal information from videos. To be specific we first introduce a frequency-based cross-modal feature enhancement module. Second we propose event-guided video alignment modules by considering the valuable characteristics of the event and videos. In addition we designed a hybrid camera system to collect the first real-world event-based video deblurring dataset. For the first time we build a dataset containing synchronized high-resolution real-world blurred videos and corresponding sharp videos and event streams. Experimental results validate that our frameworks significantly outperform the state-of-the-art frame-based and event-based deblurring works in the various datasets. In addition we designed a hybrid camera system to collect the first real-world event-based video deblurring dataset. For the first time we build a dataset containing synchronized high-resolution real-world blurred videos and corresponding sharp videos and event streams. Experimental results validate that our frameworks significantly outperform the state-of-the-art frame-based and event-based deblurring works in the various datasets. The project pages are available at https://sites.google.com/view/fevd-cvpr2024. + + + + Unsegment Anything by Simulating Deformation + http://openaccess.thecvf.com//content/CVPR2024/papers/Lu_Unsegment_Anything_by_Simulating_Deformation_CVPR_2024_paper.pdf + Foundation segmentation models while powerful pose a significant risk: they enable users to effortlessly extract any objects from any digital content with a single click potentially leading to copyright infringement or malicious misuse. To mitigate this risk we introduce a new task "Anything Unsegmentable" to grant any image "the right to be unsegmented". The ambitious pursuit of the task is to achieve highly transferable adversarial attack against all prompt-based segmentation models regardless of model parameterizations and prompts. We highlight the non-transferable and heterogeneous nature of prompt-specific adversarial noises. Our approach focuses on disrupting image encoder features to achieve prompt-agnostic attacks. Intriguingly targeted feature attacks exhibit better transferability compared to untargeted ones suggesting the optimal update direction aligns with the image manifold. Based on the observations we design a novel attack named Unsegment Anything by Simulating Deformation (UAD). Our attack optimizes a differentiable deformation function to create a target deformed image which alters structural information while preserving achievable feature distance by adversarial example. Extensive experiments verify the effectiveness of our approach compromising a variety of promptable segmentation models with different architectures and prompt interfaces. + + + + Transductive Zero-Shot and Few-Shot CLIP + http://openaccess.thecvf.com//content/CVPR2024/papers/Martin_Transductive_Zero-Shot_and_Few-Shot_CLIP_CVPR_2024_paper.pdf + Transductive inference has been widely investigated in few-shot image classification but completely overlooked in the recent fast growing literature on adapting vision-langage models like CLIP. This paper addresses the transductive zero-shot and few-shot CLIP classification challenge in which inference is performed jointly across a mini-batch of unlabeled query samples rather than treating each instance independently. This paper addresses the transductive zero-shot and few-shot CLIP classification challenge in which inference is performed jointly across a mini-batch of unlabeled query samples rather than treating each instance independently. We initially construct informative vision-text probability features leading to a classification problem on the unit simplex set. Inspired by Expectation-Maximization (EM) our optimization-based classifying objective models the data probability distribution for each class using a Dirichlet law. The minimization problem is then tackled with a novel block Majorization-Minimization algorithm which simultaneously estimates the distribution parameters and class assignments. Extensivenumerical experiments on 11 datasets underscore the benefits and efficacy of our batch inference approach. On zero-shot tasks with test batches of 75 samples our approach yields near 20% improvement in ImageNet accuracy over CLIP's zero-shot performance. Additionally we outperform state-of-the-art methods in the few-shot setting. Code is available at https://github.com/SegoleneMartin/transductive-CLIP. + + + + ID-Blau: Image Deblurring by Implicit Diffusion-based reBLurring AUgmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_ID-Blau_Image_Deblurring_by_Implicit_Diffusion-based_reBLurring_AUgmentation_CVPR_2024_paper.pdf + Image deblurring aims to remove undesired blurs from an image captured in a dynamic scene. Much research has been dedicated to improving deblurring performance through model architectural designs. However there is little work on data augmentation for image deblurring. Since continuous motion causes blurred artifacts during image exposure we aspire to develop a groundbreaking blur augmentation method to generate diverse blurred images by simulating motion trajectories in a continuous space. This paper proposes Implicit Diffusion-based reBLurring AUgmentation (ID-Blau) utilizing a sharp image paired with a controllable blur condition map to produce a corresponding blurred image. We parameterize the blur patterns of a blurred image with their orientations and magnitudes as a pixel-wise blur condition map to simulate motion trajectories and implicitly represent them in a continuous space. By sampling diverse blur conditions ID-Blau can generate various blurred images unseen in the training set. Experimental results demonstrate that ID-Blau can produce realistic blurred images for training and thus significantly improve performance for state-of-the-art deblurring models. The source code is available at https://github.com/plusgood-steven/ID-Blau. + + + + Decentralized Directed Collaboration for Personalized Federated Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Decentralized_Directed_Collaboration_for_Personalized_Federated_Learning_CVPR_2024_paper.pdf + Personalized Federated Learning (PFL) is proposed to find the greatest personalized models for each client. To avoid the central failure and communication bottleneck in the server-based FL we concentrate on the Decentralized Personalized Federated Learning (DPFL) that performs distributed model training in a Peer-to-Peer (P2P) manner. Most personalized works in DPFL are based on undirected and symmetric topologies however the data computation and communication resources heterogeneity result in large variances in the personalized models which lead the undirected aggregation to suboptimal personalized performance and unguaranteed convergence. To address these issues we propose a directed collaboration DPFL framework by incorporating stochastic gradient push and partial model personalized called Decentralized Federated Partial Gradient Push (DFedPGP). It personalizes the linear classifier in the modern deep model to customize the local solution and learns a consensus representation in a fully decentralized manner. Clients only share gradients with a subset of neighbors based on the directed and asymmetric topologies which guarantees flexible choices for resource efficiency and better convergence. Theoretically we show that the proposed DFedPGP achieves a superior convergence rate of O(1/?T) in the general non-convex setting and tighter connectivity among clients will speed up the convergence. The proposed method achieves state-of-the-art (SOTA) accuracy in both data and computation heterogeneity scenarios demonstrating the efficiency of the directed collaboration and partial gradient push. + + + + GES : Generalized Exponential Splatting for Efficient Radiance Field Rendering + http://openaccess.thecvf.com//content/CVPR2024/papers/Hamdi_GES__Generalized_Exponential_Splatting_for_Efficient_Radiance_Field_Rendering_CVPR_2024_paper.pdf + Advancements in 3D Gaussian Splatting have significantly accelerated 3D reconstruction and generation. However it may require a large number of Gaussians which creates a substantial memory footprint. This paper introduces GES (Generalized Exponential Splatting) a novel representation that employs Generalized Exponential Function (GEF) to model 3D scenes requiring far fewer particles to represent a scene and thus significantly outperforming Gaussian Splatting methods in efficiency with a plug-and-play replacement ability for Gaussian-based utilities. GES is validated theoretically and empirically in both principled 1D setup and realistic 3D scenes. It is shown to represent signals with sharp edges more accurately which are typically challenging for Gaussians due to their inherent low-pass characteristics. Our empirical analysis demonstrates that GEF outperforms Gaussians in fitting natural-occurring signals (E.g. squares triangles parabolic signals) thereby reducing the need for extensive splitting operations that increase the memory footprint of Gaussian Splatting. With the aid of a frequency-modulated loss GES achieves competitive performance in novel-view synthesis benchmarks while requiring less than half the memory storage of Gaussian Splatting and increasing the rendering speed by up to 39%. The code is available on the project website https://abdullahamdi.com/ges . + + + + MMCert: Provable Defense against Adversarial Attacks to Multi-modal Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_MMCert_Provable_Defense_against_Adversarial_Attacks_to_Multi-modal_Models_CVPR_2024_paper.pdf + Different from a unimodal model whose input is from a single modality the input (called multi-modal input) of a multi-modal model is from multiple modalities such as image 3D points audio text etc. Similar to unimodal models many existing studies show that a multi-modal model is also vulnerable to adversarial perturbation where an attacker could add small perturbation to all modalities of a multi-modal input such that the multi-modal model makes incorrect predictions for it. Existing certified defenses are mostly designed for unimodal models which achieve sub-optimal certified robustness guarantees when extended to multi-modal models as shown in our experimental results. In our work we propose MMCert the first certified defense against adversarial attacks to a multi-modal model. We derive a lower bound on the performance of our MMCert under arbitrary adversarial attacks with bounded perturbations to both modalities (e.g. in the context of auto-driving we bound the number of changed pixels in both RGB image and depth image). We evaluate our MMCert using two benchmark datasets: one for the multi-modal road segmentation task and the other for the multi-modal emotion recognition task. Moreover we compare our MMCert with a state-of-the-art certified defense extended from unimodal models. Our experimental results show that our MMCert outperforms the baseline. + + + + NAYER: Noisy Layer Data Generation for Efficient and Effective Data-free Knowledge Distillation + http://openaccess.thecvf.com//content/CVPR2024/papers/Tran_NAYER_Noisy_Layer_Data_Generation_for_Efficient_and_Effective_Data-free_CVPR_2024_paper.pdf + Data-Free Knowledge Distillation (DFKD) has made significant recent strides by transferring knowledge from a teacher neural network to a student neural network without accessing the original data. Nonetheless existing approaches encounter a significant challenge when attempting to generate samples from random noise inputs which inherently lack meaningful information. Consequently these models struggle to effectively map this noise to the ground-truth sample distribution resulting in prolonging training times and low-quality outputs. In this paper we propose a novel Noisy Layer Generation method (NAYER) which relocates the random source from the input to a noisy layer and utilizes the meaningful constant label-text embedding (LTE) as the input. LTE is generated by using the language model once and then it is stored in memory for all subsequent training processes. The significance of LTE lies in its ability to contain substantial meaningful inter-class information enabling the generation of high-quality samples with only a few training steps. Simultaneously the noisy layer plays a key role in addressing the issue of diversity in sample generation by preventing the model from overemphasizing the constrained label information. By reinitializing the noisy layer in each iteration we aim to facilitate the generation of diverse samples while still retaining the method's efficiency thanks to the ease of learning provided by LTE. Experiments carried out on multiple datasets demonstrate that our NAYER not only outperforms the state-of-the-art methods but also achieves speeds 5 to 15 times faster than previous approaches. The code is available at https://github.com/tmtuan1307/nayer. + + + + OmniVec2 - A Novel Transformer based Network for Large Scale Multimodal and Multitask Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Srivastava_OmniVec2_-_A_Novel_Transformer_based_Network_for_Large_Scale_CVPR_2024_paper.pdf + We present a novel multimodal multitask network and associated training algorithm. The method is capable of ingesting data from approximately 12 different modalities namely image video audio text depth point cloud time series tabular graph X-ray infrared IMU and hyperspectral. The proposed approach utilizes modality specialized tokenizers a shared transformer architecture and cross-attention mechanisms to project the data from different modalities into a unified embedding space. It addresses multimodal and multitask scenarios by incorporating modality-specific task heads for different tasks in respective modalities. We propose a novel pretraining strategy with iterative modality switching to initialize the network and a training algorithm which trades off fully joint training over all modalities with training on pairs of modalities at a time. We provide comprehensive evaluation across 25 datasets from 12 modalities and show state of the art performances demonstrating the effectiveness of the proposed architecture pretraining strategy and adapted multitask training. + + + + Efficient Model Stealing Defense with Noise Transition Matrix + http://openaccess.thecvf.com//content/CVPR2024/papers/Wu_Efficient_Model_Stealing_Defense_with_Noise_Transition_Matrix_CVPR_2024_paper.pdf + With the escalating complexity and investment cost of training deep neural networks safeguarding them from unauthorized usage and intellectual property theft has become imperative. Especially the rampant misuse of prediction APIs to replicate models without access to the original data or architecture poses grave security threats. Diverse defense strategies have emerged to address these vulnerabilities yet these defenses either incur heavy inference overheads or assume idealized attack scenarios. To address these challenges we revisit the utilization of noise transition matrix as an efficient perturbation technique which injects noise into predicted posteriors in a linear manner and integrates seamlessly into existing systems with minimal overhead for model stealing defense. Provably with such perturbed posteriors the attacker's cloning process degrades into learning from noisy data. Toward optimizing the noise transition matrix we proposed a novel bi-level optimization training framework which performs fidelity on the victim model while the surrogate model adversarially. Comprehensive experimental results demonstrate that our method effectively thwarts model stealing attacks and achieves minimal utility trade-offs outperforming existing state-of-the-art defenses. + + + + GS-SLAM: Dense Visual SLAM with 3D Gaussian Splatting + http://openaccess.thecvf.com//content/CVPR2024/papers/Yan_GS-SLAM_Dense_Visual_SLAM_with_3D_Gaussian_Splatting_CVPR_2024_paper.pdf + In this paper we introduce GS-SLAM that first utilizes 3D Gaussian representation in the Simultaneous Localization and Mapping (SLAM) system. It facilitates a better balance between efficiency and accuracy. Compared to recent SLAM methods employing neural implicit representations our method utilizes a real-time differentiable splatting rendering pipeline that offers significant speedup to map optimization and RGB-D rendering. Specifically we propose an adaptive expansion strategy that adds new or deletes noisy 3D Gaussians in order to efficiently reconstruct new observed scene geometry and improve the mapping of previously observed areas. This strategy is essential to extend 3D Gaussian representation to reconstruct the whole scene rather than synthesize a static object in existing methods. Moreover in the pose tracking process an effective coarse-to-fine technique is designed to select reliable 3D Gaussian representations to optimize camera pose resulting in runtime reduction and robust estimation. Our method achieves competitive performance compared with existing state-of-the-art real-time methods on the Replica TUM-RGBD datasets. Project page: \href https://gs-slam.github.io/ https://gs-slam.github.io/ . + + + + Scaffold-GS: Structured 3D Gaussians for View-Adaptive Rendering + http://openaccess.thecvf.com//content/CVPR2024/papers/Lu_Scaffold-GS_Structured_3D_Gaussians_for_View-Adaptive_Rendering_CVPR_2024_paper.pdf + Neural rendering methods have significantly advanced photo-realistic 3D scene rendering in various academic and industrial applications. The recent 3D Gaussian Splatting method has achieved the state-of-the-art rendering quality and speed combining the benefits of both primitive-based representations and volumetric representations. However it often leads to heavily redundant Gaussians that try to fit every training view neglecting the underlying scene geometry. Consequently the resulting model becomes less robust to significant view changes texture-less area and lighting effects. We introduce Scaffold-GS which uses anchor points to distribute local 3D Gaussians and predicts their attributes on-the-fly based on viewing direction and distance within the view frustum. Anchor growing and pruning strategies are developed based on the importance of neural Gaussians to reliably improve the scene coverage. We show that our method effectively reduces redundant Gaussians while delivering high-quality rendering. We also demonstrates an enhanced capability to accommodate scenes with varying levels-of-detail and view-dependent observations without sacrificing the rendering speed. Project page: https://city-super.github.io/scaffold-gs. + + + + Classes Are Not Equal: An Empirical Study on Image Recognition Fairness + http://openaccess.thecvf.com//content/CVPR2024/papers/Cui_Classes_Are_Not_Equal_An_Empirical_Study_on_Image_Recognition_CVPR_2024_paper.pdf + In this paper we present an empirical study on image recognition unfairness i.e. extreme class accuracy disparity on balanced data like ImageNet. We demonstrate that classes are not equal and unfairness is prevalent for image classification models across various datasets network architectures and model capacities. Moreover several intriguing properties of fairness are identified. First the unfairness lies in problematic representation rather than classifier bias distinguished from long-tailed recognition. Second with the proposed concept of Model Prediction Bias we investigate the origins of problematic representation during training optimization. Our findings reveal that models tend to exhibit greater prediction biases for classes that are more challenging to recognize. It means that more other classes will be confused with harder classes. Then the False Positives (FPs) will dominate the learning in optimization thus leading to their poor accuracy. Further we conclude that data augmentation and representation learning algorithms improve overall performance by promoting fairness to some degree in image classification. + + + + Multi-Scale 3D Gaussian Splatting for Anti-Aliased Rendering + http://openaccess.thecvf.com//content/CVPR2024/papers/Yan_Multi-Scale_3D_Gaussian_Splatting_for_Anti-Aliased_Rendering_CVPR_2024_paper.pdf + 3D Gaussians have recently emerged as a highly efficient representation for 3D reconstruction and rendering. Despite its high rendering quality and speed at high resolutions they both deteriorate drastically when rendered at lower resolutions or from far away camera position. During low resolution or far away rendering the pixel size of the image can fall below the Nyquist frequency compared to the screen size of each splatted 3D Gaussian and leads to aliasing effect. The rendering is also drastically slowed down by the sequential alpha blending of more splatted Gaussians per pixel. To address these issues we propose a multi-scale 3D Gaussian splatting algorithm which maintains Gaussians at different scales to represent the same scene. Higher-resolution images are rendered with more small Gaussians and lower-resolution images are rendered with fewer larger Gaussians. With similar training time our algorithm can achieve 13%-66% PSNR and 160%-2400% rendering speed improvement at 4x-128x scale rendering on Mip-NeRF360 dataset compared to the single scale 3D Gaussian splatting. + + + + A Bayesian Approach to OOD Robustness in Image Classification + http://openaccess.thecvf.com//content/CVPR2024/papers/Kaushik_A_Bayesian_Approach_to_OOD_Robustness_in_Image_Classification_CVPR_2024_paper.pdf + An important and unsolved problem in computer vision is to ensure that the algorithms are robust to changes in image domains. We address this problem in the scenario where we have access to images from the target domains but no annotations. Motivated by the challenges of the OOD-CV benchmark where we encounter real world Out-of-Domain (OOD) nuisances and occlusion we introduce a novel Bayesian approach to OOD robustness for object classification. Our work extends Compositional Neural Networks (CompNets) which have been shown to be robust to occlusion but degrade badly when tested on OOD data. We exploit the fact that CompNets contain a generative head defined over feature vectors represented by von Mises-Fisher (vMF) kernels which correspond roughly to object parts and can be learned without supervision. We obverse that some vMF kernels are similar between different domains while others are not. This enables us to learn a transitional dictionary of vMF kernels that are intermediate between the source and target domains and train the generative model on this dictionary using the annotations on the source domain followed by iterative refinement. This approach termed Unsupervised Generative Transition (UGT) performs very well in OOD scenarios even when occlusion is present. UGT is evaluated on different OOD benchmarks including the OOD-CV dataset several popular datasets (e.g. ImageNet-C artificial image corruptions (including adding occluders) and synthetic-to-real domain transfer and does well in all scenarios outperforming SOTA alternatives. + + + + Unified-IO 2: Scaling Autoregressive Multimodal Models with Vision Language Audio and Action + http://openaccess.thecvf.com//content/CVPR2024/papers/Lu_Unified-IO_2_Scaling_Autoregressive_Multimodal_Models_with_Vision_Language_Audio_CVPR_2024_paper.pdf + We present Unified-IO 2 a multimodal and multi-skill unified model capable of following novel instructions. Unified-IO 2 can use text images audio and/or videos as input and can generate text image or audio outputs which is accomplished in a unified way by tokenizing these different inputs and outputs into a shared semantic space that can then be processed by a single encoder-decoder transformer model. Unified-IO 2 is trained from scratch on a custom-built multimodal pre-training corpus and then learns an expansive set of skills through fine-tuning on over 120 datasets including datasets for segmentation object detection image editing audio localization video tracking embodied AI and 3D detection. To facilitate instruction-following we add prompts and other data augmentations to these tasks to allow Unified-IO 2 to generalize these skills to new tasks zero-shot. Unified-IO 2 is the first model to be trained on such a diverse and wide-reaching set of skills and unify three separate generation capabilities. Unified-IO 2 achieves state-of-the-art performance on the multi-task GRIT benchmark and achieves strong results on 30 diverse datasets including SEED-Bench image and video understanding TIFA image generation VQA 2.0 ScienceQA VIMA robotic manipulation VGG-Sound and Kinetics-Sounds and can perform unseen tasks and generate free-form responses. We release our model and code to facilitate future work. + + + + Multi-Level Neural Scene Graphs for Dynamic Urban Environments + http://openaccess.thecvf.com//content/CVPR2024/papers/Fischer_Multi-Level_Neural_Scene_Graphs_for_Dynamic_Urban_Environments_CVPR_2024_paper.pdf + We estimate the radiance field of large-scale dynamic areas from multiple vehicle captures under varying environmental conditions. Previous works in this domain are either restricted to static environments do not scale to more than a single short video or struggle to separately represent dynamic object instances. To this end we present a novel decomposable radiance field approach for dynamic urban environments. We propose a multi-level neural scene graph representation that scales to thousands of images from dozens of sequences with hundreds of fast-moving objects. To enable efficient training and rendering of our representation we develop a fast composite ray sampling and rendering scheme. To test our approach in urban driving scenarios we introduce a new novel view synthesis benchmark. We show that our approach outperforms prior art by a significant margin on both established and our proposed benchmark while being faster in training and rendering. + + + + Bayes' Rays: Uncertainty Quantification for Neural Radiance Fields + http://openaccess.thecvf.com//content/CVPR2024/papers/Goli_Bayes_Rays_Uncertainty_Quantification_for_Neural_Radiance_Fields_CVPR_2024_paper.pdf + Neural Radiance Fields (NeRFs) have shown promise in applications like view synthesis and depth estimation but learning from multiview images faces inherent uncertainties. Current methods to quantify them are either heuristic or computationally demanding. We introduce BayesRays a post-hoc framework to evaluate uncertainty in any pretrained NeRF without modifying the training process. Our method establishes a volumetric uncertainty field using spatial perturbations and a Bayesian Laplace approximation. We derive our algorithm statistically and show its superior performance in key metrics and applications. Additional results available at: https://bayesrays.github.io/ + + + + Driving-Video Dehazing with Non-Aligned Regularization for Safety Assistance + http://openaccess.thecvf.com//content/CVPR2024/papers/Fan_Driving-Video_Dehazing_with_Non-Aligned_Regularization_for_Safety_Assistance_CVPR_2024_paper.pdf + Real driving-video dehazing poses a significant challenge due to the inherent difficulty in acquiring precisely aligned hazy/clear video pairs for effective model training especially in dynamic driving scenarios with unpredictable weather conditions. In this paper we propose a pioneering approach that addresses this challenge through a nonaligned regularization strategy. Our core concept involves identifying clear frames that closely match hazy frames serving as references to supervise a video dehazing network. Our approach comprises two key components: reference matching and video dehazing. Firstly we introduce a non-aligned reference frame matching module leveraging an adaptive sliding window to match high-quality reference frames from clear videos. Video dehazing incorporates flow-guided cosine attention sampler and deformable cosine attention fusion modules to enhance spatial multiframe alignment and fuse their improved information. To validate our approach we collect a GoProHazy dataset captured effortlessly with GoPro cameras in diverse rural and urban road environments. Extensive experiments demonstrate the superiority of the proposed method over current state-of-the-art methods in the challenging task of real driving-video dehazing. Project page. + + + + Is Vanilla MLP in Neural Radiance Field Enough for Few-shot View Synthesis? + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhu_Is_Vanilla_MLP_in_Neural_Radiance_Field_Enough_for_Few-shot_CVPR_2024_paper.pdf + Neural Radiance Field (NeRF) has achieved superior performance for novel view synthesis by modeling the scene with a Multi-Layer Perception (MLP) and a volume rendering procedure however when fewer known views are given (i.e. few-shot view synthesis) the model is prone to overfit the given views. To handle this issue previous efforts have been made towards leveraging learned priors or introducing additional regularizations. In contrast in this paper we for the first time provide an orthogonal method from the perspective of network structure. Given the observation that trivially reducing the number of model parameters alleviates the overfitting issue but at the cost of missing details we propose the multi-input MLP (mi-MLP) that incorporates the inputs (i.e. location and viewing direction) of the vanilla MLP into each layer to prevent the overfitting issue without harming detailed synthesis. To further reduce the artifacts we propose to model colors and volume density separately and present two regularization terms. Extensive experiments on multiple datasets demonstrate that: 1) although the proposed mi-MLP is easy to implement it is surprisingly effective as it boosts the PSNR of the baseline from 14.73 to 24.23. 2) the overall framework achieves state-of-the-art results on a wide range of benchmarks. We will release the code upon publication. + + + + CVT-xRF: Contrastive In-Voxel Transformer for 3D Consistent Radiance Fields from Sparse Inputs + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhong_CVT-xRF_Contrastive_In-Voxel_Transformer_for_3D_Consistent_Radiance_Fields_from_CVPR_2024_paper.pdf + Neural Radiance Fields (NeRF) have shown impressive capabilities for photorealistic novel view synthesis when trained on dense inputs. However when trained on sparse inputs NeRF typically encounters issues of incorrect density or color predictions mainly due to insufficient coverage of the scene causing partial and sparse supervision thus leading to significant performance degradation. While existing works mainly consider ray-level consistency to construct 2D learning regularization based on rendered color depth or semantics on image planes in this paper we propose a novel approach that models 3D spatial field consistency to improve NeRF's performance with sparse inputs. Specifically we first adopt a voxel-based ray sampling strategy to ensure that the sampled rays intersect with a certain voxel in 3D space. We then randomly sample additional points within the voxel and apply a Transformer to infer the properties of other points on each ray which are then incorporated into the volume rendering. By backpropagating through the rendering loss we enhance the consistency among neighboring points. Additionally we propose to use a contrastive loss on the encoder output of the Transformer to further improve consistency within each voxel. Experiments demonstrate that our method yields significant improvement over different radiance fields in the sparse inputs setting and achieves comparable performance with current works. The project page for this paper is available at https://zhongyingji.github.io/CVT-xRF. + + + + Online Task-Free Continual Generative and Discriminative Learning via Dynamic Cluster Memory + http://openaccess.thecvf.com//content/CVPR2024/papers/Ye_Online_Task-Free_Continual_Generative_and_Discriminative_Learning_via_Dynamic_Cluster_CVPR_2024_paper.pdf + Online Task-Free Continual Learning (OTFCL) aims to learn novel concepts from streaming data without accessing task information. Most memory-based approaches used in OTFCL are not suitable for unsupervised learning because they require accessing supervised signals to implement their sample selection mechanisms. In this study we address this issue by proposing a novel memory management approach namely the Dynamic Cluster Memory (DCM) which builds new memory clusters to capture distribution shifts over time without accessing any supervised signals. DCM introduces a novel memory expansion mechanism based on the knowledge discrepancy criterion which evaluates the novelty of the incoming data as the signal for the memory expansion ensuring a compact memory capacity. We also propose a new sample selection approach that automatically stores incoming data samples with similar semantic information in the same memory cluster while also facilitating the knowledge diversity among memory clusters. Furthermore a novel memory pruning approach is proposed to automatically remove overlapping memory clusters through a graph relation evaluation ensuring a fixed memory capacity while maintaining the diversity among the samples stored in the memory. The proposed DCM is model-free plug-and-play and can be used in both supervised and unsupervised learning without modifications. Empirical results on OTFCL experiments show that the proposed DCM outperforms the state-of-the-art while requiring fewer data samples to be stored. The source code is available at https://github.com/dtuzi123/DCM. + + + + DSGG: Dense Relation Transformer for an End-to-end Scene Graph Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Hayder_DSGG_Dense_Relation_Transformer_for_an_End-to-end_Scene_Graph_Generation_CVPR_2024_paper.pdf + Scene graph generation aims to capture detailed spatial and semantic relationships between objects in an image which is challenging due to incomplete labeling long-tailed relationship categories and relational semantic overlap. Existing Transformer-based methods either employ distinct queries for objects and predicates or utilize holistic queries for relation triplets and hence often suffer from limited capacity in learning low-frequency relationships. In this paper we present a new Transformer-based method called DSGG that views scene graph detection as a direct graph prediction problem based on a unique set of graph-aware queries. In particular each graph-aware query encodes a compact representation of both the node and all of its relations in the graph acquired through the utilization of a relaxed sub-graph matching during the training process. Moreover to address the problem of relational semantic overlap we utilize a strategy for relation distillation aiming to efficiently learn multiple instances of semantic relationships. Extensive experiments on the VG and the PSG datasets show that our model achieves state-of-the-art results showing a significant improvement of 3.5% and 6.7% in mR@50 and mR@100 for the scene-graph generation task and achieves an even more substantial improvement of 8.5% and 10.3% in mR@50 and mR@100 for the panoptic scene graph generation task. Code is available at https://github.com/zeeshanhayder/DSGG. + + + + Object Dynamics Modeling with Hierarchical Point Cloud-based Representations + http://openaccess.thecvf.com//content/CVPR2024/papers/Kim_Object_Dynamics_Modeling_with_Hierarchical_Point_Cloud-based_Representations_CVPR_2024_paper.pdf + Modeling object dynamics with a neural network is an important problem with numerous applications. Most recent work has been based on graph neural networks. However physics happens in 3D space where geometric information potentially plays an important role in modeling physical phenomena. In this work we propose a novel U-net architecture based on continuous point convolution which naturally embeds information from 3D coordinates and allows for multi-scale feature representations with established downsampling and upsampling procedures. Bottleneck layers in the downsampled point clouds lead to better long-range interaction modeling. Besides the flexibility of point convolutions allows our approach to generalize to sparsely sampled points from mesh vertices and dynamically generate features on important interaction points on mesh faces. Experimental results demonstrate that our approach significantly improves the state-of-the-art especially in scenarios that require accurate gravity or collision reasoning. + + + + SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery + http://openaccess.thecvf.com//content/CVPR2024/papers/Guo_SkySense_A_Multi-Modal_Remote_Sensing_Foundation_Model_Towards_Universal_Interpretation_CVPR_2024_paper.pdf + Prior studies on Remote Sensing Foundation Model (RSFM) reveal immense potential towards a generic model for Earth Observation. Nevertheless these works primarily focus on a single modality without temporal and geo-context modeling hampering their capabilities for diverse tasks. In this study we present SkySense a generic billion-scale model pre-trained on a curated multi-modal Remote Sensing Imagery (RSI) dataset with 21.5 million temporal sequences. SkySense incorporates a factorized multi-modal spatiotemporal encoder taking temporal sequences of optical and Synthetic Aperture Radar (SAR) data as input. This encoder is pre-trained by our proposed Multi-Granularity Contrastive Learning to learn representations across different modal and spatial granularities. To further enhance the RSI representations by the geo-context clue we introduce Geo-Context Prototype Learning to learn region-aware prototypes upon RSI's multi-modal spatiotemporal features. To our best knowledge SkySense is the largest Multi-Modal RSFM to date whose modules can be flexibly combined or used individually to accommodate various tasks. It demonstrates remarkable generalization capabilities on a thorough evaluation encompassing 16 datasets over 7 tasks from single- to multi-modal static to temporal and classification to localization. SkySense surpasses 18 recent RSFMs in all test scenarios. Specifically it outperforms the latest models such as GFM SatLas and Scale-MAE by a large margin i.e. 2.76% 3.67% and 3.61% on average respectively. We will release the pre-trained weights to facilitate future research and Earth Observation applications. + + + + CFAT: Unleashing Triangular Windows for Image Super-resolution + http://openaccess.thecvf.com//content/CVPR2024/papers/Ray_CFAT_Unleashing_Triangular_Windows_for_Image_Super-resolution_CVPR_2024_paper.pdf + Transformer-based models have revolutionized the field of image super-resolution (SR) by harnessing their inherent ability to capture complex contextual features. The overlapping rectangular shifted window technique used in transformer architecture nowadays is a common practice in super-resolution models to improve the quality and robustness of image upscaling. However it suffers from distortion at the boundaries and has limited unique shifting modes. To overcome these weaknesses we propose a non-overlapping triangular window technique that synchronously works with the rectangular one to mitigate boundary-level distortion and allows the model to access more unique sifting modes. In this paper we propose a Composite Fusion Attention Transformer (CFAT) that incorporates triangular-rectangular window-based local attention with a channel-based global attention technique in image super-resolution. As a result CFAT enables attention mechanisms to be activated on more image pixels and captures long-range multi-scale features to improve SR performance. The extensive experimental results and ablation study demonstrate the effectiveness of CFAT in the SR domain. Our proposed model shows a significant 0.7 dB performance improvement over other state-of-the-art SR architectures. + + + + Rolling Shutter Correction with Intermediate Distortion Flow Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Cao_Rolling_Shutter_Correction_with_Intermediate_Distortion_Flow_Estimation_CVPR_2024_paper.pdf + This paper proposes to correct the rolling shutter (RS) distorted images by estimating the distortion flow from the global shutter (GS) to RS directly. Existing methods usually perform correction using the undistortion flow from the RS to GS. They initially predict the flow from consecutive RS frames subsequently rescaling it as the displacement fields from the RS frame to the underlying GS image using time-dependent scaling factors. Following this RS-aware forward warping is employed to convert the RS image into its GS counterpart. Nevertheless this strategy is prone to two shortcomings. First the undistortion flow estimation is rendered inaccurate by merely linear scaling the flow due to the complex non-linear motion nature. Second RS-aware forward warping often results in unavoidable artifacts. To address these limitations we introduce a new framework that directly estimates the distortion flow and rectifies the RS image with the backward warping operation. More specifically we first propose a global correlation-based flow attention mechanism to estimate the initial distortion flow and GS feature jointly which are then refined by the following coarse-to-fine decoder layers. Additionally a multi-distortion flow prediction strategy is integrated to mitigate the issue of inaccurate flow estimation further. Experimental results validate the effectiveness of the proposed method which outperforms state-of-the-art approaches on various benchmarks while maintaining high efficiency. The project is available at https://github.com/ljzycmd/DFRSC. + + + + Attack To Defend: Exploiting Adversarial Attacks for Detecting Poisoned Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Fares_Attack_To_Defend_Exploiting_Adversarial_Attacks_for_Detecting_Poisoned_Models_CVPR_2024_paper.pdf + Poisoning (trojan/backdoor) attacks enable an adversary to train and deploy a corrupted machine learning (ML) model which typically works well and achieves good accuracy on clean input samples but behaves maliciously on poisoned samples containing specific trigger patterns. Using such poisoned ML models as the foundation to build real-world systems can compromise application safety. Hence there is a critical need for algorithms that detect whether a given target model has been poisoned. This work proposes a novel approach for detecting poisoned models called Attack To Defend (A2D) which is based on the observation that poisoned models are more sensitive to adversarial perturbations compared to benign models. We propose a metric called sensitivity to adversarial perturbations (SAP) to measure the sensitivity of a ML model to adversarial attacks at a specific perturbation bound. We then generate strong adversarial attacks against an unrelated reference model and estimate the SAP value of the target model by transferring the generated attacks. The target model is deemed to be a trojan if its SAP value exceeds a decision threshold. The A2D framework requires only black-box access to the target model and a small clean set while being computationally efficient. The A2D approach has been evaluated on four standard image datasets and its effectiveness under various types of poisoning attacks has been demonstrated + + + + Troika: Multi-Path Cross-Modal Traction for Compositional Zero-Shot Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_Troika_Multi-Path_Cross-Modal_Traction_for_Compositional_Zero-Shot_Learning_CVPR_2024_paper.pdf + Recent compositional zero-shot learning (CZSL) methods adapt pre-trained vision-language models (VLMs) by constructing trainable prompts only for composed state-object pairs. Relying on learning the joint representation of seen compositions these methods ignore the explicit modeling of the state and object thus limiting the exploitation of pre-trained knowledge and generalization to unseen compositions. With a particular focus on the universality of the solution in this work we propose a novel paradigm for CZSL models that establishes three identification branches (i.e. Multi-Path) to jointly model the state object and composition. The presented Troika is an outstanding implementation that aligns the branch-specific prompt representations with decomposed visual features. To calibrate the bias between semantically similar multi-modal representations we further devise a Cross-Modal Traction module into Troika that shifts the prompt representation towards the current visual content. We conduct extensive experiments on three popular benchmarks where our method significantly outperforms existing methods in both closed-world and open-world settings. The code will be available at https://github.com/bighuang624/Troika. + + + + Enhancing Multimodal Cooperation via Sample-level Modality Valuation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wei_Enhancing_Multimodal_Cooperation_via_Sample-level_Modality_Valuation_CVPR_2024_paper.pdf + One primary topic of multimodal learning is to jointly incorporate heterogeneous information from different modalities. However most models often suffer from unsatisfactory multimodal cooperation which cannot jointly utilize all modalities well. Some methods are proposed to identify and enhance the worse learnt modality but they are often hard to provide the fine-grained observation of multimodal cooperation at sample-level with theoretical support. Hence it is essential to reasonably observe and improve the fine-grained cooperation between modalities especially when facing realistic scenarios where the modality discrepancy could vary across different samples. To this end we introduce a sample-level modality valuation metric to evaluate the contribution of each modality for each sample. Via modality valuation we observe that modality discrepancy indeed could be different at sample-level beyond the global contribution discrepancy at dataset-level. We further analyze this issue and improve cooperation between modalities at sample-level by enhancing the discriminative ability of low-contributing modalities in a targeted manner. Overall our methods reasonably observe the fine-grained uni-modal contribution and achieve considerable improvement. The source code and dataset are available at https://github.com/GeWu-Lab/Valuate-and-Enhance-Multimodal-Cooperation. + + + + SatSynth: Augmenting Image-Mask Pairs through Diffusion Models for Aerial Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Toker_SatSynth_Augmenting_Image-Mask_Pairs_through_Diffusion_Models_for_Aerial_Semantic_CVPR_2024_paper.pdf + In recent years semantic segmentation has become a pivotal tool in processing and interpreting satellite imagery. Yet a prevalent limitation of supervised learning techniques remains the need for extensive manual annotations by experts. In this work we explore the potential of generative image diffusion to address the scarcity of annotated data in earth observation tasks. The main idea is to learn the joint data manifold of images and labels leveraging recent advancements in denoising diffusion probabilistic models. To the best of our knowledge we are the first to generate both images and corresponding masks for satellite segmentation. We find that the obtained pairs not only display high quality in fine-scale features but also ensure a wide sampling diversity. Both aspects are crucial for earth observation data where semantic classes can vary severely in scale and occurrence frequency. We employ the novel data instances for downstream segmentation as a form of data augmentation. In our experiments we provide comparisons to prior works based on discriminative diffusion models or GANs. We demonstrate that integrating generated samples yields significant quantitative improvements for satellite semantic segmentation -- both compared to baselines and when training only on the original data. + + + + XScale-NVS: Cross-Scale Novel View Synthesis with Hash Featurized Manifold + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_XScale-NVS_Cross-Scale_Novel_View_Synthesis_with_Hash_Featurized_Manifold_CVPR_2024_paper.pdf + We propose XScale-NVS for high-fidelity cross-scale novel view synthesis of real-world large-scale scenes. Existing representations based on explicit surface suffer from discretization resolution or UV distortion while implicit volumetric representations lack scalability for large scenes due to the dispersed weight distribution and surface ambiguity. In light of the above challenges we introduce hash featurized manifold a novel hash-based featurization coupled with a deferred neural rendering framework. This approach fully unlocks the expressivity of the representation by explicitly concentrating the hash entries on the 2D manifold thus effectively representing highly detailed contents independent of the discretization resolution. We also introduce a novel dataset namely GigaNVS to benchmark cross-scale high-resolution novel view synthesis of real-world large-scale scenes. Our method significantly outperforms competing baselines on various real-world scenes yielding an average LPIPS that is ?40% lower than prior state-of-the-art on the challenging GigaNVS benchmark. Please see our project page at: xscalenvs.github.io. + + + + Ink Dot-Oriented Differentiable Optimization for Neural Image Halftoning + http://openaccess.thecvf.com//content/CVPR2024/papers/Jiang_Ink_Dot-Oriented_Differentiable_Optimization_for_Neural_Image_Halftoning_CVPR_2024_paper.pdf + Halftoning is a time-honored printing technique that simulates continuous tones using ink dots (halftone dots). The resurgence of deep learning has catalyzed the emergence of innovative technologies in the printing industry fostering the advancement of data-driven halftoning methods. Nevertheless current deep learning-based approaches produce halftones through image-to-image black box transformations lacking direct control over the movement of individual halftone dots. In this paper we propose an innovative halftoning method termed "neural dot-controllable halftoning". This method allows dot-level image dithering by providing direct control over the motion of each ink dot. We conceptualize halftoning as the process of sprinkling dots on a canvas. Initially a specific quantity of dots are randomly dispersed on the canvas and subsequently adjusted based on the surrounding grayscale and gradient. To establish differentiable transformations between discrete ink dot positions and halftone matrices we devise a lightweight dot encoding network to spread dense gradients to sparse dots. Dot control offers several advantages to our approach including the capability to regulate the quantity of halftone dots and enhance specific areas with artifacts in the generated halftones by adjusting the placement of the dots. Our proposed method exhibits superior performance than previous approaches in extensive quantitative and qualitative experiments. + + + + Scalable 3D Registration via Truncated Entry-wise Absolute Residuals + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_Scalable_3D_Registration_via_Truncated_Entry-wise_Absolute_Residuals_CVPR_2024_paper.pdf + Given an input set of 3D point pairs the goal of outlier-robust 3D registration is to compute some rotation and translation that align as many point pairs as possible. This is an important problem in computer vision for which many highly accurate approaches have been recently proposed. Despite their impressive performance these approaches lack scalability often overflowing the 16GB of memory of a standard laptop to handle roughly 30000 point pairs. In this paper we propose a 3D registration approach that can process more than ten million (10^7) point pairs with over 99% random outliers. Moreover our method is efficient entails low memory costs and maintains high accuracy at the same time. We call our method TEAR as it involves minimizing an outlier-robust loss that computes Truncated Entry-wise Absolute Residuals. To minimize this loss we decompose the original 6-dimensional problem into two subproblems of dimensions 3 and 2 respectively solved in succession to global optimality via a customized branch-and-bound method. While branch-and-bound is often slow and unscalable this does not apply to TEAR as we propose novel bounding functions that are tight and computationally efficient. Experiments on various datasets are conducted to validate the scalability and efficiency of our method. + + + + ExtraNeRF: Visibility-Aware View Extrapolation of Neural Radiance Fields with Diffusion Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Shih_ExtraNeRF_Visibility-Aware_View_Extrapolation_of_Neural_Radiance_Fields_with_Diffusion_CVPR_2024_paper.pdf + We propose ExtraNeRF a novel method for extrapolating the range of views handled by a Neural Radiance Field (NeRF). Our main idea is to leverage NeRFs to model scene-specific fine-grained details while capitalizing on diffusion models to extrapolate beyond our observed data. A key ingredient is to track visibility to determine what portions of the scene have not been observed and focus on reconstructing those regions consistently with diffusion models. Our primary contributions include a visibility-aware diffusion-based inpainting module that is fine-tuned on the input imagery yielding an initial NeRF with moderate quality (often blurry) inpainted regions followed by a second diffusion model trained on the input imagery to consistently enhance notably sharpen the inpainted imagery from the first pass. We demonstrate high-quality results extrapolating beyond a small number of (typically six or fewer) input views effectively outpainting the NeRF as well as inpainting newly disoccluded regions inside the original viewing volume. We compare with related work both quantitatively and qualitatively and show significant gains over prior art. + + + + Equivariant Plug-and-Play Image Reconstruction + http://openaccess.thecvf.com//content/CVPR2024/papers/Terris_Equivariant_Plug-and-Play_Image_Reconstruction_CVPR_2024_paper.pdf + Plug-and-play algorithms constitute a popular framework for solving inverse imaging problems that rely on the implicit definition of an image prior via a denoiser. These algorithms can leverage powerful pre-trained denoisers to solve a wide range of imaging tasks circumventing the necessity to train models on a per-task basis. Unfortunately plug-and-play methods often show unstable behaviors hampering their promise of versatility and leading to suboptimal quality of reconstructed images. In this work we show that enforcing equivariance to certain groups of transformations (rotations reflections and/or translations) on the denoiser strongly improves the stability of the algorithm as well as its reconstruction quality. We provide a theoretical analysis that illustrates the role of equivariance on better performance and stability. We present a simple algorithm that enforces equivariance on any existing denoiser by simply applying a random transformation to the input of the denoiser and the inverse transformation to the output at each iteration of the algorithm. Experiments on multiple imaging modalities and denoising networks show that the equivariant plug-and-play algorithm improves both the reconstruction performance and the stability compared to their non-equivariant counterparts. + + + + LP++: A Surprisingly Strong Linear Probe for Few-Shot CLIP + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_LP_A_Surprisingly_Strong_Linear_Probe_for_Few-Shot_CLIP_CVPR_2024_paper.pdf + In a recent strongly emergent literature on few-shot CLIP adaptation Linear Probe (LP) has been often reported as a weak baseline. This has motivated intensive research building convoluted prompt learning or feature adaptation strategies. In this work we propose and examine from convex-optimization perspectives a generalization of the standard LP baseline in which the linear classifier weights are learnable functions of the text embedding with class-wise multipliers blending image and text knowledge. As our objective function depends on two types of variables i.e. the class visual prototypes and the learnable blending parameters we propose a computationally efficient block coordinate Majorize-Minimize (MM) descent algorithm. In our full-batch MM optimizer which we coin LP++ step sizes are implicit unlike standard gradient descent practices where learning rates are intensively searched over validation sets. By examining the mathematical properties of our loss (e.g. Lipschitz gradient continuity) we build majorizing functions yielding data-driven learning rates and derive approximations of the loss's minima which provide data-informed initialization of the variables. Our image-language objective function along with these non-trivial optimization insights and ingredients yields surprisingly highly competitive few-shot CLIP performances. Furthermore LP++ operates in black-box relaxes intensive validation searches for the optimization hyper-parameters and runs orders-of-magnitudes faster than state-of-the-art few-shot CLIP adaptation methods. Our code is available at: https://github.com/FereshteShakeri/FewShot-CLIP-Strong-Baseline.git. + + + + FlowVQTalker: High-Quality Emotional Talking Face Generation through Normalizing Flow and Quantization + http://openaccess.thecvf.com//content/CVPR2024/papers/Tan_FlowVQTalker_High-Quality_Emotional_Talking_Face_Generation_through_Normalizing_Flow_and_CVPR_2024_paper.pdf + Generating emotional talking faces is a practical yet challenging endeavor. To create a lifelike avatar we draw upon two critical insights from a human perspective: 1) The connection between audio and the non-deterministic facial dynamics encompassing expressions blinks poses should exhibit synchronous and one-to-many mapping. 2) Vibrant expressions are often accompanied by emotion-aware high-definition (HD) textures and finely detailed teeth. However both aspects are frequently overlooked by existing methods. To this end this paper proposes using normalizing Flow and Vector-Quantization modeling to produce emotional talking faces that satisfy both insights concurrently (FlowVQTalker). Specifically we develop a flowbased coefficient generator that encodes the dynamics of facial emotion into a multi-emotion-class latent space represented as a mixture distribution. The generation process commences with random sampling from the modeled distribution guided by the accompanying audio enabling both lip-synchronization and the uncertain nonverbal facial cues generation. Furthermore our designed vector-quantization image generator treats the creation of expressive facial images as a code query task utilizing a learned codebook to provide rich high-quality textures that enhance the emotional perception of the results. Extensive experiments are conducted to showcase the effectiveness of our approach. + + + + Learning from Observer Gaze: Zero-Shot Attention Prediction Oriented by Human-Object Interaction Recognition + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_Learning_from_Observer_Gaze_Zero-Shot_Attention_Prediction_Oriented_by_Human-Object_CVPR_2024_paper.pdf + Most existing attention prediction research focuses on salient instances like humans and objects. However the more complex interaction-oriented attention arising from the comprehension of interactions between instances by human observers remains largely unexplored. This is equally crucial for advancing human-machine interaction and human-centered artificial intelligence. To bridge this gap we first collect a novel gaze fixation dataset named IG comprising 530000 fixation points across 740 diverse interaction categories capturing visual attention during human observers' cognitive processes of interactions. Subsequently we introduce the zero-shot interaction-oriented attention prediction task (ZeroIA) which challenges models to predict visual cues for interactions not encountered during training. Thirdly we present the Interactive Attention model (IA) designed to emulate human observers' cognitive processes to tackle the ZeroIA problem. Extensive experiments demonstrate that the proposed IA outperforms other state-of-the-art approaches in both ZeroIA and fully supervised settings. Lastly we endeavor to apply interaction-oriented attention to the interaction recognition task itself. Further experimental results demonstrate the promising potential to enhance the performance and interpretability of existing state-of-the-art HOI models by incorporating real human attention data from IG and attention labels generated by IA. + + + + Lift3D: Zero-Shot Lifting of Any 2D Vision Model to 3D + http://openaccess.thecvf.com//content/CVPR2024/papers/T_Lift3D_Zero-Shot_Lifting_of_Any_2D_Vision_Model_to_3D_CVPR_2024_paper.pdf + In recent years there has been an explosion of 2D vision models for numerous tasks such as semantic segmentation style transfer or scene editing enabled by large-scale 2D image datasets. At the same time there has been renewed interest in 3D scene representations such as neural radiance fields from multi-view images. However the availability of 3D or multiview data is still substantially limited compared to 2D image datasets making extending 2D vision models to 3D data highly desirable but also very challenging. Indeed extending a single 2D vision operator like scene editing to 3D typically requires a highly creative method specialized to that task and often requires per-scene optimization. In this paper we ask the question of whether any 2D vision model can be lifted to make 3D consistent predictions. We answer this question in the affirmative; our new Lift3D method trains to predict unseen views on feature spaces generated by a few visual models (i.e. DINO and CLIP) but then generalizes to novel vision operators and tasks such as style transfer super-resolution open vocabulary segmentation and image colorization; for some of these tasks there is no comparable previous 3D method. In many cases we even outperform state-of-the-art methods specialized for the task in question. Moreover Lift3D is a zero-shot method in the sense that it requires no task-specific training nor scene-specific optimization. + + + + Multiway Point Cloud Mosaicking with Diffusion and Global Optimization + http://openaccess.thecvf.com//content/CVPR2024/papers/Jin_Multiway_Point_Cloud_Mosaicking_with_Diffusion_and_Global_Optimization_CVPR_2024_paper.pdf + We introduce a novel framework for multiway point cloud mosaicking (named Wednesday) designed to co-align sets of partially overlapping point clouds -- typically obtained from 3D scanners or moving RGB-D cameras -- into a unified coordinate system. At the core of our approach is ODIN a learned pairwise registration algorithm that iteratively identifies overlaps and refines attention scores employing a diffusion-based process for denoising pairwise correlation matrices to enhance matching accuracy. Further steps include constructing a pose graph from all point clouds performing rotation averaging a novel robust algorithm for re-estimating translations optimally in terms of consensus maximization and translation optimization. Finally the point cloud rotations and positions are optimized jointly by a diffusion-based approach. Tested on four diverse large-scale datasets our method achieves state-of-the-art pairwise and multiway registration results by a large margin on all benchmarks. Our code and models are available at https://github.com/jinsz/Multiway-Point-Cloud-Mosaicking-with-Diffusion-and-Global-Optimization. + + + + PBWR: Parametric-Building-Wireframe Reconstruction from Aerial LiDAR Point Clouds + http://openaccess.thecvf.com//content/CVPR2024/papers/Huang_PBWR_Parametric-Building-Wireframe_Reconstruction_from_Aerial_LiDAR_Point_Clouds_CVPR_2024_paper.pdf + In this paper we present an end-to-end 3D-building-wireframe reconstruction method to regress edges directly from aerial light-detection-and-ranging (LiDAR) point clouds. Our method named parametric-building-wireframe reconstruction (PBWR) takes aerial LiDAR point clouds and initial edge entities as input and fully uses the self-attention mechanism of transformers to regress edge parameters without any intermediate steps such as corner prediction. We propose an edge non-maximum suppression (E-NMS) module based on edge similarity to remove redundant edges. Additionally a dedicated edge loss function is utilized to guide the PBWR in regressing edges parameters when the simple use of the edge distance loss is not suitable. In our experiments our proposed method demonstrated state-of-the-art results on the Building3D dataset achieving an improvement of approximately 36% in Entry-level dataset edge accuracy and around a 42% improvement in the Tallinn dataset. + + + + Spectrum AUC Difference (SAUCD): Human-aligned 3D Shape Evaluation + http://openaccess.thecvf.com//content/CVPR2024/papers/Luan_Spectrum_AUC_Difference_SAUCD_Human-aligned_3D_Shape_Evaluation_CVPR_2024_paper.pdf + Existing 3D mesh shape evaluation metrics mainly focus on the overall shape but are usually less sensitive to local details. This makes them inconsistent with human evaluation as human perception cares about both overall and detailed shape. In this paper we propose an analytic metric named Spectrum Area Under the Curve Difference (SAUCD) that demonstrates better consistency with human evaluation. To compare the difference between two shapes we first transform the 3D mesh to the spectrum domain using the discrete Laplace-Beltrami operator and Fourier transform. Then we calculate the Area Under the Curve (AUC) difference between the two spectrums so that each frequency band that captures either the overall or detailed shape is equitably considered. Taking human sensitivity across frequency bands into account we further extend our metric by learning suitable weights for each frequency band which better aligns with human perception. To measure the performance of SAUCD we build a 3D mesh evaluation dataset called Shape Grading along with manual annotations from more than 800 subjects. By measuring the correlation between our metric and human evaluation we demonstrate that SAUCD is well aligned with human evaluation and outperforms previous 3D mesh metrics. + + + + Multi-Session SLAM with Differentiable Wide-Baseline Pose Optimization + http://openaccess.thecvf.com//content/CVPR2024/papers/Lipson_Multi-Session_SLAM_with_Differentiable_Wide-Baseline_Pose_Optimization_CVPR_2024_paper.pdf + We introduce a new system for Multi-Session SLAM which tracks camera motion across multiple disjoint videos under a single global reference. Our approach couples the prediction of optical flow with solver layers to estimate camera pose. The backbone is trained end-to-end using a novel differentiable solver for wide-baseline two-view pose. The full system can connect disjoint sequences perform visual odometry and global optimization. Compared to existing approaches our design is accurate and robust to catastrophic failures. + + + + Improving Out-of-Distribution Generalization in Graphs via Hierarchical Semantic Environments + http://openaccess.thecvf.com//content/CVPR2024/papers/Piao_Improving_Out-of-Distribution_Generalization_in_Graphs_via_Hierarchical_Semantic_Environments_CVPR_2024_paper.pdf + Out-of-distribution (OOD) generalization in the graph domain is challenging due to complex distribution shifts and a lack of environmental contexts. Recent methods attempt to enhance graph OOD generalization by generating flat environments. However such flat environments come with inherent limitations to capture more complex data distributions. Considering the DrugOOD dataset which contains diverse training environments (e.g. scaffold size etc.) flat contexts cannot sufficiently address its high heterogeneity. Thus a new challenge is posed to generate more semantically enriched environments to enhance graph invariant learning for handling distribution shifts. In this paper we propose a novel approach to generate hierarchical semantic environments for each graph. Firstly given an input graph we explicitly extract variant subgraphs from the input graph to generate proxy predictions on local environments. Then stochastic attention mechanisms are employed to re-extract the subgraphs for regenerating global environments in a hierarchical manner. In addition we introduce a new learning objective that guides our model to learn the diversity of environments within the same hierarchy while maintaining consistency across different hierarchies. This approach enables our model to consider the relationships between environments and facilitates robust graph invariant learning. Extensive experiments on real-world graph data have demonstrated the effectiveness of our framework. Particularly in the challenging dataset DrugOOD our method achieves up to 1.29% and 2.83% improvement over the best baselines on IC50 and EC50 prediction tasks respectively. + + + + CN-RMA: Combined Network with Ray Marching Aggregation for 3D Indoor Object Detection from Multi-view Images + http://openaccess.thecvf.com//content/CVPR2024/papers/Shen_CN-RMA_Combined_Network_with_Ray_Marching_Aggregation_for_3D_Indoor_CVPR_2024_paper.pdf + This paper introduces CN-RMA a novel approach for 3D indoor object detection from multi-view images. We observe the key challenge as the ambiguity of image and 3D correspondence without explicit geometry to provide occlusion information. To address this issue CN-RMA leverages the synergy of 3D reconstruction networks and 3D object detection networks where the reconstruction network provides a rough Truncated Signed Distance Function (TSDF) and guides image features to vote to 3D space correctly in an end-to-end manner. Specifically we associate weights to sampled points of each ray through ray marching representing the contribution of a pixel in an image to corresponding 3D locations. Such weights are determined by the predicted signed distances so that image features vote only to regions near the reconstructed surface. Our method achieves state-of-the-art performance in 3D object detection from multi-view images as measured by mAP@0.25 and mAP@0.5 on the ScanNet and ARKitScenes datasets. The code and models are released at https://github.com/SerCharles/CN-RMA. + + + + Hide in Thicket: Generating Imperceptible and Rational Adversarial Perturbations on 3D Point Clouds + http://openaccess.thecvf.com//content/CVPR2024/papers/Lou_Hide_in_Thicket_Generating_Imperceptible_and_Rational_Adversarial_Perturbations_on_CVPR_2024_paper.pdf + Adversarial attack methods based on point manipulation for 3D point cloud classification have revealed the fragility of 3D models yet the adversarial examples they produce are easily perceived or defended against. The trade-off between the imperceptibility and adversarial strength leads most point attack methods to inevitably introduce easily detectable outlier points upon a successful attack. Another promising strategy shape-based attack can effectively eliminate outliers but existing methods often suffer significant reductions in imperceptibility due to irrational deformations. We find that concealing deformation perturbations in areas insensitive to human eyes can achieve a better trade-off between imperceptibility and adversarial strength specifically in parts of the object surface that are complex and exhibit drastic curvature changes. Therefore we propose a novel shape-based adversarial attack method HiT-ADV which initially conducts a two-stage search for attack regions based on saliency and imperceptibility scores and then adds deformation perturbations in each attack region using Gaussian kernel functions. Additionally HiT-ADV is extendable to physical attack. We propose that by employing benign resampling and benign rigid transformations we can further enhance physical adversarial strength with little sacrifice to imperceptibility. Extensive experiments have validated the superiority of our method in terms of adversarial and imperceptible properties in both digital and physical spaces. + + + + SG-BEV: Satellite-Guided BEV Fusion for Cross-View Semantic Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Ye_SG-BEV_Satellite-Guided_BEV_Fusion_for_Cross-View_Semantic_Segmentation_CVPR_2024_paper.pdf + This paper aims at achieving fine-grained building attribute segmentation in a cross-view scenario i.e. using satellite and street-view image pairs. The main challenge lies in overcoming the significant perspective differences between street views and satellite views. In this work we introduce SG-BEV a novel approach for satellite-guided BEV fusion for cross-view semantic segmentation. To overcome the limitations of existing cross-view projection methods in capturing the complete building facade features we innovatively incorporate Bird's Eye View (BEV) method to establish a spatially explicit mapping of street-view features. Moreover we fully leverage the advantages of multiple perspectives by introducing a novel satellite-guided reprojection module optimizing the uneven feature distribution issues associated with traditional BEV methods. Our method demonstrates significant improvements on four cross-view datasets collected from multiple cities including New York San Francisco and Boston. On average across these datasets our method achieves an increase in mIOU by 10.13% and 5.21% compared with the state-of-the-art satellite-based and cross-view methods. The code and datasets of this work will be released at https://github.com/sysu-liweijia-lab/SG-BEV. + + + + LEAP-VO: Long-term Effective Any Point Tracking for Visual Odometry + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_LEAP-VO_Long-term_Effective_Any_Point_Tracking_for_Visual_Odometry_CVPR_2024_paper.pdf + Visual odometry estimates the motion of a moving camera based on visual input. Existing methods mostly focusing on two-view point tracking often ignore the rich temporal context in the image sequence thereby overlooking the global motion patterns and providing no assessment of the full trajectory reliability. These shortcomings hinder performance in scenarios with occlusion dynamic objects and low-texture areas. To address these challenges we present the Long-term Effective Any Point Tracking (LEAP) module. LEAP innovatively combines visual inter-track and temporal cues with mindfully selected anchors for dynamic track estimation. Moreover LEAP's temporal probabilistic formulation integrates distribution updates into a learnable iterative refinement module to reason about point-wise uncertainty. Based on these traits we develop LEAP-VO a robust visual odometry system adept at handling occlusions and dynamic scenes. Our mindful integration showcases a novel practice by employing long-term point tracking as the front-end. Extensive experiments demonstrate that the proposed pipeline significantly outperforms existing baselines across various visual odometry benchmarks. + + + + Unveiling the Unknown: Unleashing the Power of Unknown to Known in Open-Set Source-Free Domain Adaptation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wan_Unveiling_the_Unknown_Unleashing_the_Power_of_Unknown_to_Known_CVPR_2024_paper.pdf + Open-Set Source-Free Domain Adaptation aims to transfer knowledge in realistic scenarios where the target domain has additional unknown classes compared to the limited-access source domain. Due to the absence of information on unknown classes existing methods mainly transfer knowledge of known classes while roughly grouping unknown classes as one attenuating the knowledge transfer and generalization. In contrast this paper advocates that exploring unknown classes can better identify known ones and proposes a domain adaptation model to transfer knowledge on known and unknown classes jointly. Specifically given a source pre-trained model we first introduce an unknown diffuser that can determine whether classes in space need to be split and merged through similarity measures to estimate and generate a wider class space distribution including known and unknown classes. Based on such a wider space distribution we enhance the reliability of known class knowledge in the source pre-trained model through contrastive constraint. Finally various supervision information including reliable known class knowledge and clustered pseudo-labels optimize the model for impressive knowledge transfer and generalization. Extensive experiments show that our network can achieve superior exploration and knowledge generalization on unknown classes while with excellent known class transfer. The code is available at https://github.com/xdwfl/UPUK. + + + + Instance-Adaptive and Geometric-Aware Keypoint Learning for Category-Level 6D Object Pose Estimation + http://openaccess.thecvf.com//content/CVPR2024/papers/Lin_Instance-Adaptive_and_Geometric-Aware_Keypoint_Learning_for_Category-Level_6D_Object_Pose_CVPR_2024_paper.pdf + Category-level 6D object pose estimation aims to estimate the rotation translation and size of unseen instances within specific categories. In this area dense correspondence-based methods have achieved leading performance. However they do not explicitly consider the local and global geometric information of different instances resulting in poor generalization ability to unseen instances with significant shape variations. To deal with this problem we propose a novel Instance-adaptive and To deal with this problem we propose a novel Instance-Adaptive and Geometric-Aware Keypoint Learning method for category-level 6D object pose estimation (AG-Pose) which includes two key designs: (1) The first design is an Instance-Adaptive Keypoint Detection module which can adaptively detect a set of sparse keypoints for various instances to represent their geometric structures. (2) The second design is a Geometric-Aware Feature Aggregation module which can efficiently integrate the local and global geometric information into keypoint features. These two modules can work together to establish robust keypoint-level correspondences for unseen instances thus enhancing the generalization ability of the model.Experimental results on CAMERA25 and REAL275 datasets show that the proposed AG-Pose outperforms state-of-the-art methods by a large margin without category-specific shape priors. + + + + Universal Semi-Supervised Domain Adaptation by Mitigating Common-Class Bias + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Universal_Semi-Supervised_Domain_Adaptation_by_Mitigating_Common-Class_Bias_CVPR_2024_paper.pdf + Domain adaptation is a critical task in machine learning that aims to improve model performance on a target domain by leveraging knowledge from a related source domain. In this work we introduce Universal Semi-Supervised Domain Adaptation (UniSSDA) a practical yet challenging setting where the target domain is partially labeled and the source and target label space may not strictly match. UniSSDA is at the intersection of Universal Domain Adaptation (UniDA) and Semi-Supervised Domain Adaptation (SSDA): the UniDA setting does not allow for fine-grained categorization of target private classes not represented in the source domain while SSDA focuses on the restricted closed-set setting where source and target label spaces match exactly. Existing UniDA and SSDA methods are susceptible to common-class bias in UniSSDA settings where models overfit to data distributions of classes common to both domains at the expense of private classes. We propose a new prior-guided pseudo-label refinement strategy to reduce the reinforcement of common-class bias due to pseudo-labeling a common label propagation strategy in domain adaptation. We demonstrate the effectiveness of the proposed strategy on benchmark datasets Office-Home DomainNet and VisDA. The proposed strategy attains the best performance across UniSSDA adaptation settings and establishes a new baseline for UniSSDA. + + + + Feature 3DGS: Supercharging 3D Gaussian Splatting to Enable Distilled Feature Fields + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhou_Feature_3DGS_Supercharging_3D_Gaussian_Splatting_to_Enable_Distilled_Feature_CVPR_2024_paper.pdf + 3D scene representations have gained immense popularity in recent years. Methods that use Neural Radiance fields are versatile for traditional tasks such as novel view synthesis. In recent times some work has emerged that aims to extend the functionality of NeRF beyond view synthesis for semantically aware tasks such as editing and segmentation using 3D feature field distillation from 2D foundation models. However these methods have two major limitations: (a) they are limited by the rendering speed of NeRF pipelines and (b) implicitly represented feature fields suffer from continuity artifacts reducing feature quality. Recently 3D Gaussian Splatting has shown state-of-the-art performance on real-time radiance field rendering. In this work we go one step further: in addition to radiance field rendering we enable 3D Gaussian splatting on arbitrary-dimension semantic features via 2D foundation model distillation. This translation is not straightforward: naively incorporating feature fields in the 3DGS framework encounters significant challenges notably the disparities in spatial resolution and channel consistency between RGB images and feature maps. We propose architectural and training changes to efficiently avert this problem. Our proposed method is general and our experiments showcase novel view semantic segmentation language-guided editing and segment anything through learning feature fields from state-of-the-art 2D foundation models such as SAM and CLIP-LSeg. Across experiments our distillation method is able to provide comparable or better results while being significantly faster to both train and render. Additionally to the best of our knowledge we are the first method to enable point and bounding-box prompting for radiance field manipulation by leveraging the SAM model. Project website at: https://feature-3dgs.github.io/ + + + + 4K4D: Real-Time 4D View Synthesis at 4K Resolution + http://openaccess.thecvf.com//content/CVPR2024/papers/Xu_4K4D_Real-Time_4D_View_Synthesis_at_4K_Resolution_CVPR_2024_paper.pdf + This paper targets high-fidelity and real-time view synthesis of dynamic 3D scenes at 4K resolution. Recent methods on dynamic view synthesis have shown impressive rendering quality. However their speed is still limited when rendering high-resolution images. To overcome this problem we propose 4K4D a 4D point cloud representation that supports hardware rasterization and network pre-computation to enable unprecedented rendering speed with a high rendering quality. Our representation is built on a 4D feature grid so that the points are naturally regularized and can be robustly optimized. In addition we design a novel hybrid appearance model that significantly boosts the rendering quality while preserving efficiency. Moreover we develop a differentiable depth peeling algorithm to effectively learn the proposed model from RGB videos. Experiments show that our representation can be rendered at over 400 FPS on the DNA-Rendering dataset at 1080p resolution and 80 FPS on the ENeRF-Outdoor dataset at 4K resolution using an RTX 4090 GPU which is 30x faster than previous methods and achieves the state-of-the-art rendering quality. Our project page is available at https://zju3dv.github.io/4k4d. + + + + View-decoupled Transformer for Person Re-identification under Aerial-ground Camera Network + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_View-decoupled_Transformer_for_Person_Re-identification_under_Aerial-ground_Camera_Network_CVPR_2024_paper.pdf + Existing person re-identification methods have achieved remarkable advances in appearance-based identity association across homogeneous cameras such as ground-ground matching. However as a more practical scenario aerial-ground person re-identification (AGPReID) among heterogeneous cameras has received minimal attention. To alleviate the disruption of discriminative identity representation by dramatic view discrepancy as the most significant challenge in AGPReID the view-decoupled transformer (VDT) is proposed as a simple yet effective framework. Two major components are designed in VDT to decouple view-related and view-unrelated features namely hierarchical subtractive separation and orthogonal loss where the former separates these two features inside the VDT and the latter constrains these two to be independent. In addition we contribute a large-scale AGPReID dataset called CARGO consisting of five/eight aerial/ground cameras 5000 identities and 108563 images. Experiments on two datasets show that VDT is a feasible and effective solution for AGPReID surpassing the previous method on mAP/Rank1 by up to 5.0%/2.7% on CARGO and 3.7%/5.2% on AG-ReID keeping the same magnitude of computational complexity. Our project is available at https://github.com/LinlyAC/VDT-AGPReID + + + + OED: Towards One-stage End-to-End Dynamic Scene Graph Generation + http://openaccess.thecvf.com//content/CVPR2024/papers/Wang_OED_Towards_One-stage_End-to-End_Dynamic_Scene_Graph_Generation_CVPR_2024_paper.pdf + Dynamic Scene Graph Generation (DSGG) focuses on identifying visual relationships within the spatial-temporal domain of videos. Conventional approaches often employ multi-stage pipelines which typically consist of object detection temporal association and multi-relation classification. However these methods exhibit inherent limitations due to the separation of multiple stages and independent optimization of these sub-problems may yield sub-optimal solutions. To remedy these limitations we propose a one-stage end-to-end framework termed OED which streamlines the DSGG pipeline. This framework reformulates the task as a set prediction problem and leverages pair-wise features to represent each subject-object pair within the scene graph. Moreover another challenge of DSGG is capturing temporal dependencies we introduce a Progressively Refined Module (PRM) for aggregating temporal context without the constraints of additional trackers or handcrafted trajectories enabling end-to-end optimization of the network. Extensive experiments conducted on the Action Genome benchmark demonstrate the effectiveness of our design. The code and models are available at https://github.com/guanw-pku/OED. + + + + DeIL: Direct-and-Inverse CLIP for Open-World Few-Shot Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Shao_DeIL_Direct-and-Inverse_CLIP_for_Open-World_Few-Shot_Learning_CVPR_2024_paper.pdf + Open-World Few-Shot Learning (OFSL) is a critical field of research concentrating on the precise identification of target samples in environments with scarce data and unreliable labels thus possessing substantial practical significance. Recently the evolution of foundation models like CLIP has revealed their strong capacity for representation even in settings with restricted resources and data. This development has led to a significant shift in focus transitioning from the traditional method of "building models from scratch" to a strategy centered on "efficiently utilizing the capabilities of foundation models to extract relevant prior knowledge tailored for OFSL and apply it judiciously". Amidst this backdrop we unveil the Direct-and-Inverse CLIP (DeIL) an innovative method leveraging our proposed "Direct-and-Inverse" concept to activate CLIP-based methods for addressing OFSL. This concept transforms conventional single-step classification into a nuanced two-stage process: initially filtering out less probable categories followed by accurately determining the specific category of samples. DeIL comprises two key components: a pre-trainer (frozen) for data denoising and an adapter (tunable) for achieving precise final classification. In experiments DeIL achieves SOTA performance on 11 datasets. + + + + Large Language Models are Good Prompt Learners for Low-Shot Image Classification + http://openaccess.thecvf.com//content/CVPR2024/papers/Zheng_Large_Language_Models_are_Good_Prompt_Learners_for_Low-Shot_Image_CVPR_2024_paper.pdf + Low-shot image classification where training images are limited or inaccessible has benefited from recent progress on pre-trained vision-language (VL) models with strong generalizability e.g. CLIP. Prompt learning methods built with VL models generate text features from the class names that only have confined class-specific information. Large Language Models (LLMs) with their vast encyclopedic knowledge emerge as the complement. Thus in this paper we discuss the integration of LLMs to enhance pre-trained VL models specifically on low-shot classification. However the domain gap between language and vision blocks the direct application of LLMs. Thus we propose LLaMP Large Language Models as Prompt learners that produces adaptive prompts for the CLIP text encoder establishing it as the connecting bridge. Experiments show that compared with other state-of-the-art prompt learning methods LLaMP yields better performance on both zero-shot generalization and few-shot image classification over a spectrum of 11 datasets. Code will be made available at: https://github.com/zhaohengz/LLaMP. + + + + VILA: On Pre-training for Visual Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Lin_VILA_On_Pre-training_for_Visual_Language_Models_CVPR_2024_paper.pdf + Visual language models (VLMs) rapidly progressed with the recent success of large language models. There have been growing efforts on visual instruction tuning to extend the LLM with visual inputs but lacks an in-depth study of the visual language pre-training process where the model learns to perform joint modeling on both modalities. In this work we examine the design options for VLM pre-training by augmenting LLM towards VLM through step-by-step con- trollable comparisons. We introduce three main findings: (1) freezing LLMs during pre-training can achieve decent zero-shot performance but lack in-context learning capabil- ity which requires unfreezing the LLM; (2) interleaved pre- training data is beneficial whereas image-text pairs alone are not optimal; (3) re-blending text-only instruction data to image-text data during instruction fine-tuning not only remedies the degradation of text-only tasks but also boosts VLM task accuracy. With an enhanced pre-training recipe we build VILA a Visual Language model family that consis- tently outperforms the state-of-the-art models e.g. LLaVA- 1.5 across main benchmarks without bells and whistles. Multi-modal pre-training also helps unveil appealing prop- erties of VILA including multi-image reasoning enhanced in-context learning and better world knowledge. VILA is also deployable on Jetson Orin for on-device VLM. + + + + Text-Guided Variational Image Generation for Industrial Anomaly Detection and Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Lee_Text-Guided_Variational_Image_Generation_for_Industrial_Anomaly_Detection_and_Segmentation_CVPR_2024_paper.pdf + We propose a text-guided variational image generation method to address the challenge of getting clean data for anomaly detection in industrial manufacturing. Our method utilizes text information about the target object learned from extensive text library documents to generate non-defective data images resembling the input image. The proposed framework ensures that the generated non-defective images align with anticipated distributions derived from textual and image-based knowledge ensuring stability and generality. Experimental results demonstrate the effectiveness of our approach surpassing previous methods even with limited non-defective data. Our approach is validated through generalization tests across four baseline models and three distinct datasets. We present an additional analysis to enhance the effectiveness of anomaly detection models by utilizing the generated images. + + + + Self-Adaptive Reality-Guided Diffusion for Artifact-Free Super-Resolution + http://openaccess.thecvf.com//content/CVPR2024/papers/Zheng_Self-Adaptive_Reality-Guided_Diffusion_for_Artifact-Free_Super-Resolution_CVPR_2024_paper.pdf + Artifact-free super-resolution (SR) aims to translate low-resolution images into their high-resolution counterparts with a strict integrity of the original content eliminating any distortions or synthetic details. While traditional diffusion-based SR techniques have demonstrated remarkable abilities to enhance image detail they are prone to artifact introduction during iterative procedures. Such artifacts ranging from trivial noise to unauthentic textures deviate from the true structure of the source image thus challenging the integrity of the super-resolution process. In this work we propose Self-Adaptive Reality-Guided Diffusion (SARGD) a training-free method that delves into the latent space to effectively identify and mitigate the propagation of artifacts. Our SARGD begins by using an artifact detector to identify implausible pixels creating a binary mask that highlights artifacts. Following this the Reality Guidance Refinement (RGR) process refines artifacts by integrating this mask with realistic latent representations improving alignment with the original image. Nonetheless initial realistic-latent representations from lower-quality images result in over-smoothing in the final output. To address this we introduce a Self-Adaptive Guidance (SAG) mechanism. It dynamically computes a reality score enhancing the sharpness of the realistic latent. These alternating mechanisms collectively achieve artifact-free super-resolution. Extensive experiments demonstrate the superiority of our method delivering detailed artifact-free high-resolution images while reducing sampling steps by 2X. We release our code at https://github.com/ProAirVerse/Self-Adaptive-Guidance-Diffusion.git. + + + + Multimodal Representation Learning by Alternating Unimodal Adaptation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Multimodal_Representation_Learning_by_Alternating_Unimodal_Adaptation_CVPR_2024_paper.pdf + Multimodal learning which integrates data from diverse sensory modes plays a pivotal role in artificial intelligence. However existing multimodal learning methods often struggle with challenges where some modalities appear more dominant than others during multimodal learning resulting in suboptimal performance. To address this challenge we propose MLA (Multimodal Learning with Alternating Unimodal Adaptation). MLA reframes the conventional joint multimodal learning process by transforming it into an alternating unimodal learning process thereby minimizing interference between modalities. Simultaneously it captures cross-modal interactions through a shared head which undergoes continuous optimization across different modalities. This optimization process is controlled by a gradient modification mechanism to prevent the shared head from losing previously acquired information. During the inference phase MLA utilizes a test-time uncertainty-based model fusion mechanism to integrate multimodal information. Extensive experiments are conducted on five diverse datasets encompassing scenarios with complete modalities and scenarios with missing modalities. These experiments demonstrate the superiority of MLA over competing prior approaches. Our code is available at https://github.com/Cecile-hi/Multimodal-Learning-with-Alternating-Unimodal-Adaptation. + + + + Pre-training Vision Models with Mandelbulb Variations + http://openaccess.thecvf.com//content/CVPR2024/papers/Chiche_Pre-training_Vision_Models_with_Mandelbulb_Variations_CVPR_2024_paper.pdf + The use of models that have been pre-trained on natural image datasets like ImageNet may face some limitations. First this use may be restricted due to copyright and license on the training images and privacy laws. Second these datasets and models may incorporate societal and ethical biases. Formula-driven supervised learning (FDSL) enables model pre-training to circumvent these issues. This consists of generating a synthetic image dataset based on mathematical formulae and pre-training the model on it. In this work we propose novel FDSL datasets based on Mandelbulb Variations. These datasets contain RGB images that are projections of colored objects deriving from the 3D Mandelbulb fractal. Pre-training ResNet-50 on one of our proposed datasets MandelbulbVAR-1k enables an average top-1 accuracy over target classification datasets that is at least 1% higher than pre-training on existing FDSL datasets. With regard to anomaly detection on MVTec AD pre-training the WideResNet-50 backbone on MandelbulbVAR-1k enables PatchCore to achieve 97.2% average image-level AUROC. This is only 1.9% lower than pre-training on ImageNet-1k (99.1%) and 4.5% higher than pre-training on the second-best performing FDSL dataset i.e. VisualAtom-1k (92.7%). Regarding Vision Transformer (ViT) pre-training another dataset that we propose and coin MandelbulbVAR-Hybrid-21k enables ViT-Base to achieve 82.2% top-1 accuracy on ImageNet-1k which is 0.4% higher than pre-training on ImageNet-21k (81.8%) and only 0.1% lower than pre-training on VisualAtom-1k (82.3%). + + + + S2MVTC: a Simple yet Efficient Scalable Multi-View Tensor Clustering + http://openaccess.thecvf.com//content/CVPR2024/papers/Long_S2MVTC_a_Simple_yet_Efficient_Scalable_Multi-View_Tensor_Clustering_CVPR_2024_paper.pdf + Anchor-based large-scale multi-view clustering has attracted considerable attention for its effectiveness in handling massive datasets. However current methods mainly seek the consensus embedding feature for clustering by exploring global correlations between anchor graphs or projection matrices.In this paper we propose a simple yet efficient scalable multi-view tensor clustering (S2MVTC) approach where our focus is on learning correlations of embedding features within and across views. Specifically we first construct the embedding feature tensor by stacking the embedding features of different views into a tensor and rotating it. Additionally we build a novel tensor low-frequency approximation (TLFA) operator which incorporates graph similarity into embedding feature learning efficiently achieving smooth representation of embedding features within different views. Furthermore consensus constraints are applied to embedding features to ensure inter-view semantic consistency. Experimental results on six large-scale multi-view datasets demonstrate that S2MVTC significantly outperforms state-of-the-art algorithms in terms of clustering performance and CPU execution time especially when handling massive data. The code of S2MVTC is publicly available at https://github.com/longzhen520/S2MVTC. + + + + S2MAE: A Spatial-Spectral Pretraining Foundation Model for Spectral Remote Sensing Data + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_S2MAE_A_Spatial-Spectral_Pretraining_Foundation_Model_for_Spectral_Remote_Sensing_CVPR_2024_paper.pdf + In the expansive domain of computer vision a myriad of pre-trained models are at our disposal. However most of these models are designed for natural RGB images and prove inadequate for spectral remote sensing (RS) images. Spectral RS images have two main traits: (1) multiple bands capturing diverse feature information (2) spatial alignment and consistent spectral sequencing within the spatial-spectral dimension. In this paper we introduce Spatial-SpectralMAE (S2MAE) a specialized pre-trained architecture for spectral RS imagery. S2MAE employs a 3D transformer for masked autoencoder modeling integrating learnable spectral-spatial embeddings with a 90% masking ratio. The model efficiently captures local spectral consistency and spatial invariance using compact cube tokens demonstrating versatility to diverse input characteristics. This adaptability facilitates progressive pretraining on extensive spectral datasets. The effectiveness of S2MAE is validated through continuous pretraining on two sizable datasets totaling over a million training images. The pre-trained model is subsequently applied to three distinct downstream tasks with in-depth ablation studies conducted to emphasize its efficacy. + + + + DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Saadati_DIMAT_Decentralized_Iterative_Merging-And-Training_for_Deep_Learning_Models_CVPR_2024_paper.pdf + Recent advances in decentralized deep learning algorithms have demonstrated cutting-edge performance on various tasks with large pre-trained models. However a pivotal prerequisite for achieving this level of competitiveness is the significant communication and computation overheads when updating these models which prohibits the applications of them to real-world scenarios. To address this issue drawing inspiration from advanced model merging techniques without requiring additional training we introduce the Decentralized Iterative Merging-And-Training (DIMAT) paradigm--a novel decentralized deep learning framework. Within DIMAT each agent is trained on their local data and periodically merged with their neighboring agents using advanced model merging techniques like activation matching until convergence is achieved. DIMAT provably converges with the best available rate for nonconvex functions with various first-order methods while yielding tighter error bounds compared to the popular existing approaches. We conduct a comprehensive empirical analysis to validate DIMAT's superiority over baselines across diverse computer vision tasks sourced from multiple datasets. Empirical results validate our theoretical claims by showing that DIMAT attains faster and higher initial gain in accuracy with independent and identically distributed (IID) and non-IID data incurring lower communication overhead. This DIMAT paradigm presents a new opportunity for the future decentralized learning enhancing its adaptability to real-world with sparse and light-weight communication and computation. + + + + MMA: Multi-Modal Adapter for Vision-Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Yang_MMA_Multi-Modal_Adapter_for_Vision-Language_Models_CVPR_2024_paper.pdf + Pre-trained Vision-Language Models (VLMs) have served as excellent foundation models for transfer learning in diverse downstream tasks. However tuning VLMs for few-shot generalization tasks faces a discrimination -- generalization dilemma i.e. general knowledge should be preserved and task-specific knowledge should be fine-tuned. How to precisely identify these two types of representations remains a challenge. In this paper we propose a Multi-Modal Adapter (MMA) for VLMs to improve the alignment between representations from text and vision branches. MMA aggregates features from different branches into a shared feature space so that gradients can be communicated across branches. To determine how to incorporate MMA we systematically analyze the discriminability and generalizability of features across diverse datasets in both the vision and language branches and find that (1) higher layers contain discriminable dataset-specific knowledge while lower layers contain more generalizable knowledge and (2) language features are more discriminable than visual features and there are large semantic gaps between the features of the two modalities especially in the lower layers. Therefore we only incorporate MMA to a few higher layers of transformers to achieve an optimal balance between discrimination and generalization. We evaluate the effectiveness of our approach on three tasks: generalization to novel classes novel target datasets and domain generalization. Compared to many state-of-the-art methods our MMA achieves leading performance in all evaluations. Code is at https://github.com/ZjjConan/Multi-Modal-Adapter + + + + BioCLIP: A Vision Foundation Model for the Tree of Life + http://openaccess.thecvf.com//content/CVPR2024/papers/Stevens_BioCLIP_A_Vision_Foundation_Model_for_the_Tree_of_Life_CVPR_2024_paper.pdf + Images of the natural world collected by a variety of cameras from drones to individual phones are increasingly abundant sources of biological information. There is an explosion of computational methods and tools particularly computer vision for extracting biologically relevant information from images for science and conservation. Yet most of these are bespoke approaches designed for a specific task and are not easily adaptable or extendable to new questions contexts and datasets. A vision model for general organismal biology questions on images is of timely need. To approach this we curate and release TreeOfLife-10M the largest and most diverse ML-ready dataset of biology images. We then develop BioCLIP a foundation model for the tree of life leveraging the unique properties of biology captured by TreeOfLife-10M namely the abundance and variety of images of plants animals and fungi together with the availability of rich structured biological knowledge. We rigorously benchmark our approach on diverse fine-grained biology classification tasks and find that BioCLIP consistently and substantially outperforms existing baselines (by 17% to 20% absolute). Intrinsic evaluation reveals that BioCLIP has learned a hierarchical representation conforming to the tree of life shedding light on its strong generalizability. All data code and models will be publicly released upon acceptance. + + + + From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_From_Pixels_to_Graphs_Open-Vocabulary_Scene_Graph_Generation_with_Vision-Language_CVPR_2024_paper.pdf + Scene graph generation (SGG) aims to parse a visual scene into an intermediate graph representation for downstream reasoning tasks. Despite recent advancements existing methods struggle to generate scene graphs with novel visual relation concepts. To address this challenge we introduce a new open-vocabulary SGG framework based on sequence generation. Our framework leverages vision-language pre-trained models (VLM) by incorporating an image-to-graph generation paradigm. Specifically we generate scene graph sequences via image-to-text generation with VLM and then construct scene graphs from these sequences. By doing so we harness the strong capabilities of VLM for open-vocabulary SGG and seamlessly integrate explicit relational modeling for enhancing the VL tasks. Experimental results demonstrate that our design not only achieves superior performance with an open vocabulary but also enhances downstream vision-language task performance through explicit relation modeling knowledge. + + + + Deep Imbalanced Regression via Hierarchical Classification Adjustment + http://openaccess.thecvf.com//content/CVPR2024/papers/Xiong_Deep_Imbalanced_Regression_via_Hierarchical_Classification_Adjustment_CVPR_2024_paper.pdf + Regression tasks in computer vision such as age estimation or counting are often formulated into classification by quantizing the target space into classes. Yet real-world data is often imbalanced -- the majority of training samples lie in a head range of target values while a minority of samples span a usually larger tail range. By selecting the class quantization one can adjust imbalanced regression targets into balanced classification outputs though there are trade-offs in balancing classification accuracy and quantization error. To improve regression performance over the entire range of data we propose to construct hierarchical classifiers for solving imbalanced regression tasks. The fine-grained classifiers limit the quantization error while being modulated by the coarse predictions to ensure high accuracy. Standard hierarchical classification approaches when applied to the regression problem fail to ensure that predicted ranges remain consistent across the hierarchy. As such we propose a range-preserving distillation process that effectively learns a single classifier from the set of hierarchical classifiers. Our novel hierarchical classification adjustment (HCA) for imbalanced regression shows superior results on three diverse tasks: age estimation crowd counting and depth estimation. Code is available at https://github.com/xhp-hust-2018-2011/HCA. + + + + Total-Decom: Decomposed 3D Scene Reconstruction with Minimal Interaction + http://openaccess.thecvf.com//content/CVPR2024/papers/Lyu_Total-Decom_Decomposed_3D_Scene_Reconstruction_with_Minimal_Interaction_CVPR_2024_paper.pdf + Scene reconstruction from multi-view images is a fundamental problem in computer vision and graphics. Recent neural implicit surface reconstruction methods have achieved high-quality results; however editing and manipulating the 3D geometry of reconstructed scenes remains challenging due to the absence of naturally decomposed object entities and complex object/background compositions. In this paper we present Total-Decom a novel method for decomposed 3D reconstruction with minimal human interaction. Our approach seamlessly integrates the Segment Anything Model (SAM) with hybrid implicit-explicit neural surface representations and a mesh-based region-growing technique for accurate 3D object decomposition. Total-Decom requires minimal human annotations while providing users with real-time control over the granularity and quality of decomposition. We extensively evaluate our method on benchmark datasets and demonstrate its potential for downstream applications such as animation and scene editing. + + + + Accelerating Neural Field Training via Soft Mining + http://openaccess.thecvf.com//content/CVPR2024/papers/Kheradmand_Accelerating_Neural_Field_Training_via_Soft_Mining_CVPR_2024_paper.pdf + We present an approach to accelerate Neural Field training by efficiently selecting sampling locations. While Neural Fields have recently become popular it is often trained by uniformly sampling the training domain or through handcrafted heuristics. We show that improved convergence and final training quality can be achieved by a soft mining technique based on importance sampling: rather than either considering or ignoring a pixel completely we weigh the corresponding loss by a scalar. To implement our idea we use Langevin Monte-Carlo sampling. We show that by doing so regions with higher error are being selected more frequently leading to more than 2x improvement in convergence speed. The code and related resources for this study are publicly available at https://ubc-vision.github.io/nf-soft-mining/. + + + + Ensemble Diversity Facilitates Adversarial Transferability + http://openaccess.thecvf.com//content/CVPR2024/papers/Tang_Ensemble_Diversity_Facilitates_Adversarial_Transferability_CVPR_2024_paper.pdf + With the advent of ensemble-based attacks the transferability of generated adversarial examples is elevated by a noticeable margin despite many methods only employing superficial integration yet ignoring the diversity between ensemble models. However most of them compromise the latent value of the diversity between generated perturbation from distinct models which we argue is also able to increase the adversarial transferability especially heterogeneous attacks. To address the issues we propose a novel method of Stochastic Mini-batch black-box attack with Ensemble Reweighing using reinforcement learning (SMER) to produce highly transferable adversarial examples. We emphasize the diversity between surrogate models achieving individual perturbation iteratively. In order to customize the individual effect between surrogates ensemble reweighing is introduced to refine ensemble weights by maximizing attack loss based on reinforcement learning which functions on the ultimate transferability elevation. Extensive experiments demonstrate our superiority to recent ensemble attacks with a significant margin across different black-box attack scenarios especially on heterogeneous conditions. + + + + Gear-NeRF: Free-Viewpoint Rendering and Tracking with Motion-aware Spatio-Temporal Sampling + http://openaccess.thecvf.com//content/CVPR2024/papers/Liu_Gear-NeRF_Free-Viewpoint_Rendering_and_Tracking_with_Motion-aware_Spatio-Temporal_Sampling_CVPR_2024_paper.pdf + Extensions of Neural Radiance Fields (NeRFs) to model dynamic scenes have enabled their near photo-realistic free-viewpoint rendering. Although these methods have shown some potential in creating immersive experiences two drawbacks limit their ubiquity: (i) a significant reduction in reconstruction quality when the computing budget is limited and (ii) a lack of semantic understanding of the underlying scenes. To address these issues we introduce Gear-NeRF which leverages semantic information from powerful image segmentation models. Our approach presents a principled way for learning a spatio-temporal (4D) semantic embedding based on which we introduce the concept of gears to allow for stratified modeling of dynamic regions of the scene based on the extent of their motion. Such differentiation allows us to adjust the spatio-temporal sampling resolution for each region in proportion to its motion scale achieving more photo-realistic dynamic novel view synthesis. At the same time almost for free our approach enables free-viewpoint tracking of objects of interest -- a functionality not yet achieved by existing NeRF-based methods. Empirical studies validate the effectiveness of our method where we achieve state-of-the-art rendering and tracking performance on multiple challenging datasets. The project page is available at: https://merl.com/research/highlights/gear-nerf. + + + + BrainWash: A Poisoning Attack to Forget in Continual Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Abbasi_BrainWash_A_Poisoning_Attack_to_Forget_in_Continual_Learning_CVPR_2024_paper.pdf + Continual learning has gained substantial attention within the deep learning community offering promising solutions to the challenging problem of sequential learning. Yet a largely unexplored facet of this paradigm is its susceptibility to adversarial attacks especially with the aim of inducing forgetting. In this paper we introduce "BrainWash" a novel data poisoning method tailored to impose forgetting on a continual learner. By adding the BrainWash noise to a variety of baselines we demonstrate how a trained continual learner can be induced to forget its previously learned tasks catastrophically even when using these continual learning baselines. An important feature of our approach is that the attacker requires no access to previous tasks' data and is armed merely with the model's current parameters and the data belonging to the most recent task. Our extensive experiments highlight the efficacy of BrainWash showcasing degradation in performance across various regularization and memory replay-based continual learning methods. Our code is available here: https://github.com/mint-vu/Brainwash + + + + FreePoint: Unsupervised Point Cloud Instance Segmentation + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_FreePoint_Unsupervised_Point_Cloud_Instance_Segmentation_CVPR_2024_paper.pdf + Instance segmentation of point clouds is a crucial task in 3D field with numerous applications that involve localizing and segmenting objects in a scene. However achieving satisfactory results requires a large number of manual annotations which is time-consuming and expensive. To alleviate dependency on annotations we propose a novel framework FreePoint for underexplored unsupervised class-agnostic instance segmentation on point clouds. In detail we represent the point features by combining coordinates colors and self-supervised deep features. Based on the point features we perform a bottom-up multicut algorithm to segment point clouds into coarse instance masks as pseudo labels which are used to train a point cloud instance segmentation model. We propose an id-as-feature strategy at this stage to alleviate the randomness of the multicut algorithm and improve the pseudo labels' quality. During training we propose a weakly-supervised two-step training strategy and corresponding losses to overcome the inaccuracy of coarse masks. FreePoint has achieved breakthroughs in unsupervised class-agnostic instance segmentation on point clouds and outperformed previous traditional methods by over 18.2% and a competitive concurrent work UnScene3D by 5.5% in AP. Additionally when used as a pretext task and fine-tuned on S3DIS FreePoint performs significantly better than existing self-supervised pre-training methods with limited annotations and surpasses CSC by 6.0% in AP with 10% annotation masks. Code will be released at https://github.com/zzk273/FreePoint. + + + + Circuit Design and Efficient Simulation of Quantum Inner Product and Empirical Studies of Its Effect on Near-Term Hybrid Quantum-Classic Machine Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Xiong_Circuit_Design_and_Efficient_Simulation_of_Quantum_Inner_Product_and_CVPR_2024_paper.pdf + For the essential operation namely inner product (IP) as widely adopted in classic computing e.g. matrix multiplication its quantum counterpart: quantum inner product (QIP) has also been recently theoretically explored with a verifiable lower complexity on quantum computers. However it remains unclear for the embodiment of the quantum circuits (QC) for QIP let alone a (thorough) evaluation of the QIP circuits especially in a practical context in the NISQ era by applying QIP to ML via hybrid quantum-classic pipelines. In this paper we carefully design the QIP circuits from scratch whose complexity is in accordance with the theoretical complexity. To make the simulation tractable on classic computers especially when it is integrated in the gradient-based hybrid ML pipelines we further devise a highly-efficient simulation scheme by directly simulates the output state. Experiments show that the scheme accelerates the simulation for more than 68k times compared with the previous circuit simulator. This allows our empirical evaluation on typical machine learning tasks ranging from supervised and self-supervised learning via neural nets to K-Means clustering. The results show that the calculation error brought by typical quantum mechanisms would incur in general little influence on the final numerical results given sufficient qubits. However certain tasks e.g. ranking in K-Means could be more sensitive to quantum noise. + + + + How to Make Cross Encoder a Good Teacher for Efficient Image-Text Retrieval? + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_How_to_Make_Cross_Encoder_a_Good_Teacher_for_Efficient_CVPR_2024_paper.pdf + Dominant dual-encoder models enable efficient image-text retrieval but suffer from limited accuracy while the cross-encoder models offer higher accuracy at the expense of efficiency. Distilling cross-modality matching knowledge from cross-encoder to dual-encoder provides a natural approach to harness their strengths. Thus we investigate the following valuable question: how to make cross-encoder a good teacher for dual-encoder? Our findings are threefold: (1) Cross-modal similarity score distribution of cross-encoder is more concentrated while the result of dual-encoder is nearly normal making vanilla logit distillation less effective. However ranking distillation remains practical as it is not affected by the score distribution. (2) Only the relative order between hard negatives conveys valid knowledge while the order information between easy negatives has little significance. (3) Maintaining the coordination between distillation loss and dual-encoder training loss is beneficial for knowledge transfer. Based on these findings we propose a novel Contrastive Partial Ranking Distillation (CPRD) method which implements the objective of mimicking relative order between hard negative samples with contrastive learning. This approach coordinates with the training of the dual-encoder effectively transferring valid knowledge from the cross-encoder to the dual-encoder. Extensive experiments on image-text retrieval and ranking tasks show that our method surpasses other distillation methods and significantly improves the accuracy of dual-encoder. + + + + Diffeomorphic Template Registration for Atmospheric Turbulence Mitigation + http://openaccess.thecvf.com//content/CVPR2024/papers/Lao_Diffeomorphic_Template_Registration_for_Atmospheric_Turbulence_Mitigation_CVPR_2024_paper.pdf + We describe a method for recovering the irradiance underlying a collection of images corrupted by atmospheric turbulence. Since supervised data is often technically impossible to obtain assumptions and biases have to be imposed to solve this inverse problem and we choose to model them explicitly. Rather than initializing a latent irradiance ("template") by heuristics to estimate deformation we select one of the images as a reference and model the deformation in this image by the aggregation of the optical flow from it to other images exploiting a prior imposed by Central Limit Theorem. Then with a novel flow inversion module the model registers each image TO the template but WITHOUT the template avoiding artifacts related to poor template initialization. To illustrate the robustness of the method we simply (i) select the first frame as the reference and (ii) use the simplest optical flow to estimate the warpings yet the improvement in registration is decisive in the final reconstruction as we achieve state-of-the-art performance despite its simplicity. The method establishes a strong baseline that can be further improved by integrating it seamlessly into more sophisticated pipelines or with domain-specific methods if so desired. + + + + Selective Nonlinearities Removal from Digital Signals + http://openaccess.thecvf.com//content/CVPR2024/papers/Maliszewski_Selective_Nonlinearities_Removal_from_Digital_Signals_CVPR_2024_paper.pdf + Many instruments performing optical and non-optical imaging and sensing such as Optical Coherence Tomography (OCT) Magnetic Resonance Imaging or Fourier-transform spectrometry produce digital signals containing modulations sine-like components which only after Fourier transformation give information about the structure or characteristics of the investigated object. Due to the fundamental physics-related limitations of such methods the distribution of these signal components is often nonlinear and when not properly compensated leads to the resolution precision or quality drop in the final image. Here we propose an innovative approach that has the potential to allow cleaning of the signal from the nonlinearities but most of all it now allows to switch the given order off leaving all others intact. The latter provides a tool for more in-depth analysis of the nonlinearity-inducing properties of the investigated object which can lead to applications in early disease detection or more sensitive sensing of chemical compounds. We consider OCT signals and nonlinearities up to the third order. In our approach we propose two neural networks: one to remove solely the second-order nonlinearity and the other for removing solely the third-order nonlinearity. The input of the networks is a novel two-dimensional data structure with all the information needed for the network to infer a nonlinearity-free signal. We describe the developed networks and present the results for second-order and third-order nonlinearity removal in OCT data representing the images of various objects: a mirror glass and fruits. + + + + NB-GTR: Narrow-Band Guided Turbulence Removal + http://openaccess.thecvf.com//content/CVPR2024/papers/Xia_NB-GTR_Narrow-Band_Guided_Turbulence_Removal_CVPR_2024_paper.pdf + The removal of atmospheric turbulence is crucial for long-distance imaging. Leveraging the stochastic nature of atmospheric turbulence numerous algorithms have been developed that employ multi-frame input to mitigate the turbulence. However when limited to a single frame existing algorithms face substantial performance drops particularly in diverse real-world scenes. In this paper we propose a robust solution to turbulence removal from an RGB image under the guidance of an additional narrow-band image broadening the applicability of turbulence mitigation techniques in real-world imaging scenarios. Our approach exhibits a substantial suppression in the magnitude of turbulence artifacts by using only a pair of images thereby enhancing the clarity and fidelity of the captured scene. + + + + Can Biases in ImageNet Models Explain Generalization? + http://openaccess.thecvf.com//content/CVPR2024/papers/Gavrikov_Can_Biases_in_ImageNet_Models_Explain_Generalization_CVPR_2024_paper.pdf + The robust generalization of models to rare in-distribution (ID) samples drawn from the long tail of the training distribution and to out-of-training-distribution (OOD) samples is one of the major challenges of current deep learning methods. For image classification this manifests in the existence of adversarial attacks the performance drops on distorted images and a lack of generalization to concepts such as sketches. The current understanding of generalization in neural networks is very limited but some biases that differentiate models from human vision have been identified and might be causing these limitations. Consequently several attempts with varying success have been made to reduce these biases during training to improve generalization. We take a step back and sanity-check these attempts. Fixing the architecture to the well-established ResNet-50 we perform a large-scale study on 48 ImageNet models obtained via different training methods to understand how and if these biases - including shape bias spectral biases and critical bands - interact with generalization. Our extensive study results reveal that contrary to previous findings these biases are insufficient to accurately predict the generalization of a model holistically. We provide access to all checkpoints and evaluation code at https://github.com/paulgavrikov/biases_vs_generalization/ + + + + Generative Quanta Color Imaging + http://openaccess.thecvf.com//content/CVPR2024/papers/Purohit_Generative_Quanta_Color_Imaging_CVPR_2024_paper.pdf + The astonishing development of single-photon cameras has created an unprecedented opportunity for scientific and industrial imaging. However the high data throughput generated by these 1-bit sensors creates a significant bottleneck for low-power applications. In this paper we explore the possibility of generating a color image from a single binary frame of a single-photon camera. We evidently find this problem being particularly difficult to standard colorization approaches due to the substantial degree of exposure variation. The core innovation of our paper is an exposure synthesis model framed under a neural ordinary differential equation (Neural ODE) that allows us to generate a continuum of exposures from a single observation. This innovation ensures consistent exposure in binary images that colorizers take on resulting in notably enhanced colorization. We demonstrate applications of the method in single-image and burst colorization and show superior generative performance over baselines. Project website can be found at https://vishal-s-p.github.io/projects/2023/generative_quanta_color.html + + + + Overload: Latency Attacks on Object Detection for Edge Devices + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Overload_Latency_Attacks_on_Object_Detection_for_Edge_Devices_CVPR_2024_paper.pdf + Nowadays the deployment of deep learning-based applications is an essential task owing to the increasing demands on intelligent services. In this paper we investigate latency attacks on deep learning applications. Unlike common adversarial attacks for misclassification the goal of latency attacks is to increase the inference time which may stop applications from responding to the requests within a reasonable time. This kind of attack is ubiquitous for various applications and we use object detection to demonstrate how such kind of attacks work. We also design a framework named Overload to generate latency attacks at scale. Our method is based on a newly formulated optimization problem and a novel technique called spatial attention. This attack serves to escalate the required computing costs during the inference time consequently leading to an extended inference time for object detection. It presents a significant threat especially to systems with limited computing resources. We conducted experiments using YOLOv5 models on Nvidia NX. Compared to existing methods our method is simpler and more effective. The experimental results show that with latency attacks the inference time of a single image can be increased ten times longer in reference to the normal setting. Moreover our findings pose a potential new threat to all object detection tasks requiring non-maximum suppression (NMS) as our attack is NMS-agnostic. + + + + SD4Match: Learning to Prompt Stable Diffusion Model for Semantic Matching + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_SD4Match_Learning_to_Prompt_Stable_Diffusion_Model_for_Semantic_Matching_CVPR_2024_paper.pdf + In this paper we address the challenge of matching semantically similar keypoints across image pairs. Existing research indicates that the intermediate output of the UNet within the Stable Diffusion (SD) can serve as robust image feature maps for such a matching task. We demonstrate that by employing a basic prompt tuning technique the inherent potential of Stable Diffusion can be harnessed resulting in a significant enhancement in accuracy over previous approaches. We further introduce a novel conditional prompting module that conditions the prompt on the local details of the input image pairs leading to a further improvement in performance. We designate our approach as SD4Match short for Stable Diffusion for Semantic Matching. Comprehensive evaluations of SD4Match on the PF-Pascal PF-Willow and SPair-71k datasets show that it sets new benchmarks in accuracy across all these datasets. Particularly SD4Match outperforms the previous state-of-the-art by a margin of 12 percentage points on the challenging SPair-71k dataset. Code is available at the project website: https://sd4match.active.vision. + + + + Neural Video Compression with Feature Modulation + http://openaccess.thecvf.com//content/CVPR2024/papers/Li_Neural_Video_Compression_with_Feature_Modulation_CVPR_2024_paper.pdf + The emerging conditional coding-based neural video codec (NVC) shows superiority over commonly-used residual coding-based codec and the latest NVC already claims to outperform the best traditional codec. However there still exist critical problems blocking the practicality of NVC. In this paper we propose a powerful conditional coding-based NVC that solves two critical problems via feature modulation. The first is how to support a wide quality range in a single model. Previous NVC with this capability only supports about 3.8 dB PSNR range on average. To tackle this limitation we modulate the latent feature of the current frame via the learnable quantization scaler. During the training we specially design the uniform quantization parameter sampling mechanism to improve the harmonization of encoding and quantization. This results in a better learning of the quantization scaler and helps our NVC support about 11.4 dB PSNR range. The second is how to make NVC still work under a long prediction chain. We expose that the previous SOTA NVC has an obvious quality degradation problem when using a large intra-period setting. To this end we propose modulating the temporal feature with a periodically refreshing mechanism to boost the quality. Notably under single intra-frame setting our codec can achieve 29.7% bitrate saving over previous SOTA NVC with 16% MACs reduction. Our codec serves as a notable landmark in the journey of NVC evolution. The codes are at https://github.com/microsoft/DCVC. + + + + Data Poisoning based Backdoor Attacks to Contrastive Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Zhang_Data_Poisoning_based_Backdoor_Attacks_to_Contrastive_Learning_CVPR_2024_paper.pdf + Contrastive learning (CL) pre-trains general-purpose encoders using an unlabeled pre-training dataset which consists of images or image-text pairs. CL is vulnerable to data poisoning based backdoor attacks (DPBAs) in which an attacker injects poisoned inputs into the pre-training dataset so the encoder is backdoored. However existing DPBAs achieve limited effectiveness. In this work we take the first step to analyze the limitations of existing backdoor attacks and propose new DPBAs called CorruptEncoder to CL. CorruptEncoder introduces a new attack strategy to create poisoned inputs and uses a theory-guided method to maximize attack effectiveness. Our experiments show that CorruptEncoder substantially outperforms existing DPBAs. In particular CorruptEncoder is the first DPBA that achieves more than 90% attack success rates with only a few (3) reference images and a small poisoning ratio (0.5%). Moreover we also propose a defense called localized cropping to defend against DPBAs. Our results show that our defense can reduce the effectiveness of DPBAs but it sacrifices the utility of the encoder highlighting the need for new defenses. + + + + Progressive Semantic-Guided Vision Transformer for Zero-Shot Learning + http://openaccess.thecvf.com//content/CVPR2024/papers/Chen_Progressive_Semantic-Guided_Vision_Transformer_for_Zero-Shot_Learning_CVPR_2024_paper.pdf + Zero-shot learning (ZSL) recognizes the unseen classes by conducting visual-semantic interactions to transfer semantic knowledge from seen classes to unseen ones supported by semantic information (e.g. attributes). However existing ZSL methods simply extract visual features using a pre-trained network backbone (i.e. CNN or ViT) which fail to learn matched visual-semantic correspondences for representing semantic-related visual features as lacking of the guidance of semantic information resulting in undesirable visual-semantic interactions. To tackle this issue we propose a progressive semantic-guided vision transformer for zero-shot learning (dubbed ZSLViT). ZSLViT mainly considers two properties in the whole network: i) discover the semantic-related visual representations explicitly and ii) discard the semantic-unrelated visual information. Specifically we first introduce semantic-embedded token learning to improve the visual-semantic correspondences via semantic enhancement and discover the semantic-related visual tokens explicitly with semantic-guided token attention. Then we fuse low semantic-visual correspondence visual tokens to discard the semantic-unrelated visual information for visual enhancement. These two operations are integrated into various encoders to progressively learn semantic-related visual representations for accurate visual-semantic interactions in ZSL. The extensive experiments show that our ZSLViT achieves significant performance gains on three popular benchmark datasets i.e. CUB SUN and AWA2. + + + + Building Bridges across Spatial and Temporal Resolutions: Reference-Based Super-Resolution via Change Priors and Conditional Diffusion Model + http://openaccess.thecvf.com//content/CVPR2024/papers/Dong_Building_Bridges_across_Spatial_and_Temporal_Resolutions_Reference-Based_Super-Resolution_via_CVPR_2024_paper.pdf + Reference-based super-resolution (RefSR) has the potential to build bridges across spatial and temporal resolutions of remote sensing images. However existing RefSR methods are limited by the faithfulness of content reconstruction and the effectiveness of texture transfer in large scaling factors. Conditional diffusion models have opened up new opportunities for generating realistic high-resolution images but effectively utilizing reference images within these models remains an area for further exploration. Furthermore content fidelity is difficult to guarantee in areas without relevant reference information. To solve these issues we propose a change-aware diffusion model named Ref-Diff for RefSR using the land cover change priors to guide the denoising process explicitly. Specifically we inject the priors into the denoising model to improve the utilization of reference information in unchanged areas and regulate the reconstruction of semantically relevant content in changed areas. With this powerful guidance we decouple the semantics-guided denoising and reference texture-guided denoising processes to improve the model performance. Extensive experiments demonstrate the superior effectiveness and robustness of the proposed method compared with state-of-the-art RefSR methods in both quantitative and qualitative evaluations. The code and data are available at https://github.com/dongrunmin/RefDiff. + + + + \ No newline at end of file diff --git a/rss/ICCV2023.xml b/rss/ICCV2023.xml new file mode 100644 index 0000000..9f29f11 --- /dev/null +++ b/rss/ICCV2023.xml @@ -0,0 +1,12943 @@ + + + + ICCV 2023 + + + Towards Attack-tolerant Federated Learning via Critical Parameter Analysis + http://openaccess.thecvf.com//content/ICCV2023/papers/Han_Towards_Attack-tolerant_Federated_Learning_via_Critical_Parameter_Analysis_ICCV_2023_paper.pdf + Federated learning is used to train a shared model in a decentralized way without clients sharing private data with each other. Federated learning systems are susceptible to poisoning attacks when malicious clients send false updates to the central server. Existing defense strategies are ineffective under non-IID data settings. This paper proposes a new defense strategy, FedCPA (Federated learning with Critical Parameter Analysis). Our attack-tolerant aggregation method is based on the observation that benign local models have similar sets of top-k and bottom-k critical parameters, whereas poisoned local models do not. Experiments with different attack scenarios on multiple datasets demonstrate that our model outperforms existing defense strategies in defending against poisoning attacks. + + + + Stochastic Segmentation with Conditional Categorical Diffusion Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Zbinden_Stochastic_Segmentation_with_Conditional_Categorical_Diffusion_Models_ICCV_2023_paper.pdf + Semantic segmentation has made significant progress in recent years thanks to deep neural networks, but the common objective of generating a single segmentation output that accurately matches the image's content may not be suitable for safety-critical domains such as medical diagnostics and autonomous driving. Instead, multiple possible correct segmentation maps may be required to reflect the true distribution of annotation maps. In this context, stochastic semantic segmentation methods must learn to predict conditional distributions of labels given the image, but this is challenging due to the typically multimodal distributions, high-dimensional output spaces, and limited annotation data. To address these challenges, we propose a conditional categorical diffusion model (CCDM) for semantic segmentation based on Denoising Diffusion Probabilistic Models. Our model is conditioned to the input image, enabling it to generate multiple segmentation label maps that account for the aleatoric uncertainty arising from divergent ground truth annotations. Our experimental results show that CCDM achieves state-of-the-art performance on LIDC, a stochastic semantic segmentation dataset, and outperforms established baselines on the classical segmentation dataset Cityscapes. + + + + A Dynamic Dual-Processing Object Detection Framework Inspired by the Brain's Recognition Mechanism + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_A_Dynamic_Dual-Processing_Object_Detection_Framework_Inspired_by_the_Brains_ICCV_2023_paper.pdf + There are two main approaches to object detection: CNN-based and Transformer-based. The former views object detection as a dense local matching problem, while the latter sees it as a sparse global retrieval problem. Research in neuroscience has shown that the recognition decision in the brain is based on two processes, namely familiarity and recollection. Based on this biological support, we propose an efficient and effective dual-processing object detection framework. It integrates CNN- and Transformer-based detectors into a comprehensive object detection system consisting of a shared backbone, an efficient dual-stream encoder, and a dynamic dual-decoder. To better integrate local and global features, we design a search space for the CNN-Transformer dual-stream encoder to find the optimal fusion solution. To enable better coordination between the CNN- and Transformer-based decoders, we provide the dual-decoder with a selective mask. This mask dynamically chooses the more advantageous decoder for each position in the image based on high-level representation. As demonstrated by extensive experiments, our approach shows flexibility and effectiveness in prompting the mAP of the various source detectors by 3.0 3.7 without increasing FLOPs. + + + + Hard No-Box Adversarial Attack on Skeleton-Based Human Action Recognition with Skeleton-Motion-Informed Gradient + http://openaccess.thecvf.com//content/ICCV2023/papers/Lu_Hard_No-Box_Adversarial_Attack_on_Skeleton-Based_Human_Action_Recognition_with_ICCV_2023_paper.pdf + Recently, methods for skeleton-based human activity recognition have been shown to be vulnerable to adversarial attacks. However, these attack methods require either the full knowledge of the victim (i.e. white-box attacks), access to training data (i.e. transfer-based attacks) or frequent model queries (i.e. black-box attacks). All their requirements are highly restrictive, raising the question of how detrimental the vulnerability is. In this paper, we show that the vulnerability indeed exists. To this end, we consider a new attack task: the attacker has no access to the victim model or the training data or labels, where we coin the term hard no-box attack. Specifically, we first learn a motion manifold where we define an adversarial loss to compute a new gradient for the attack, named skeleton-motion-informed (SMI) gradient. Our gradient contains information of the motion dynamics, which is different from existing gradient-based attack methods that compute the loss gradient assuming each dimension in the data is independent. The SMI gradient can augment many gradient-based attack methods, leading to a new family of no-box attack methods. Extensive evaluation and comparison show that our method imposes a real threat to existing classifiers. They also show that the SMI gradient improves the transferability and imperceptibility of adversarial samples in both no-box and transfer-based black-box settings. + + + + GameFormer: Game-theoretic Modeling and Learning of Transformer-based Interactive Prediction and Planning for Autonomous Driving + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_GameFormer_Game-theoretic_Modeling_and_Learning_of_Transformer-based_Interactive_Prediction_and_ICCV_2023_paper.pdf + Autonomous vehicles operating in complex real-world environments require accurate predictions of interactive behaviors between traffic participants. This paper tackles the interaction prediction problem by formulating it with hierarchical game theory and proposing the GameFormer model for its implementation. The model incorporates a Transformer encoder, which effectively models the relationships between scene elements, alongside a novel hierarchical Transformer decoder structure. At each decoding level, the decoder utilizes the prediction outcomes from the previous level, in addition to the shared environmental context, to iteratively refine the interaction process. Moreover, we propose a learning process that regulates an agent's behavior at the current level to respond to other agents' behaviors from the preceding level. Through comprehensive experiments on large-scale real-world driving datasets, we demonstrate the state-of-the-art accuracy of our model on the Waymo interaction prediction task. Additionally, we validate the model's capacity to jointly reason about the motion plan of the ego agent and the behaviors of multiple agents in both open-loop and closed-loop planning tests, outperforming various baseline methods. Furthermore, we evaluate the efficacy of our model on the nuPlan planning benchmark, where it achieves leading performance. + + + + Learning in Imperfect Environment: Multi-Label Classification with Long-Tailed Distribution and Partial Labels + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Learning_in_Imperfect_Environment_Multi-Label_Classification_with_Long-Tailed_Distribution_and_ICCV_2023_paper.pdf + Conventional multi-label classification (MLC) methods assume that all samples are fully labeled and identically distributed. Unfortunately, this assumption is unrealistic in large-scale MLC data that has long-tailed (LT) distribution and partial labels (PL). To address the problem, we introduce a novel task, Partial labeling and Long-Tailed Multi-Label Classification (PLT-MLC), to jointly consider the above two imperfect learning environments. Not surprisingly, we find that most LT-MLC and PL-MLC approaches fail to solve the PLT-MLC, resulting in significant performance degradation on the two proposed PLT-MLC benchmarks. Therefore, we propose an end-to-end learning framework: COrrection -> ModificatIon -> balanCe, abbreviated as COMC. Our bootstrapping philosophy is to simultaneously correct the missing labels (Correction) with convinced prediction confidence over a class-aware threshold and to learn from these recall labels during training. We next propose a novel multi-focal modifier loss that simultaneously addresses head-tail imbalance and positive-negative imbalance to adaptively modify the attention to different samples (Modification) under the LT class distribution. We also develop a balanced training strategy by distilling the model's learning effect from head and tail samples, and thus design the balanced classifier (Balance) conditioned on the head and tail learning effect to maintain a stable performance. Our experimental study shows that the proposed method significantly outperforms the general MLC, LT-MLC and ML-MLC methods in terms of effectiveness and robustness on our newly created PLT-MLC datasets. + + + + Flexible Visual Recognition by Evidential Modeling of Confusion and Ignorance + http://openaccess.thecvf.com//content/ICCV2023/papers/Fan_Flexible_Visual_Recognition_by_Evidential_Modeling_of_Confusion_and_Ignorance_ICCV_2023_paper.pdf + In real-world scenarios, typical visual recognition systems could fail under two major causes, i.e., the misclassification between known classes and the excusable misbehavior on unknown-class images. To tackle these deficiencies, flexible visual recognition should dynamically predict multiple classes when they are unconfident between choices and reject making predictions when the input is entirely out of the training distribution. Two challenges emerge along with this novel task. First, prediction uncertainty should be separately quantified as confusion depicting inter-class uncertainties and ignorance identifying out-of-distribution samples. Second, both confusion and ignorance should be comparable between samples to enable effective decision-making. In this paper, we propose to model these two sources of uncertainty explicitly with the theory of Subjective Logic. Regarding recognition as an evidence-collecting process, confusion is then defined as conflicting evidence, while ignorance is the absence of evidence. By predicting Dirichlet concentration parameters for singletons, comprehensive subjective opinions, including confusion and ignorance, could be achieved via further evidence combinations. Through a series of experiments on synthetic data analysis, visual recognition, and open-set detection, we demonstrate the effectiveness of our methods in quantifying two sources of uncertainties and dealing with flexible recognition. + + + + Texture Generation on 3D Meshes with Point-UV Diffusion + http://openaccess.thecvf.com//content/ICCV2023/papers/Yu_Texture_Generation_on_3D_Meshes_with_Point-UV_Diffusion_ICCV_2023_paper.pdf + In this work, we focus on synthesizing high-quality textures on 3D meshes. We present Point-UV diffusion, a coarse-to-fine pipeline that marries the denoising diffusion model with UV mapping to generate 3D consistent and high-quality texture images in UV space. We start with introducing a point diffusion model to synthesize low-frequency texture components with our tailored style guidance to tackle the biased color distribution. The derived coarse texture offers global consistency and serves as a condition for the subsequent UV diffusion stage, aiding in regularizing the model to generate a 3D consistent UV texture image. Then, a UV diffusion model with hybrid conditions is developed to enhance the texture fidelity in the 2D UV space. Our method can process meshes of any genus, generating diversified, geometry-compatible, and high-fidelity textures. + + + + Enhanced Soft Label for Semi-Supervised Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Ma_Enhanced_Soft_Label_for_Semi-Supervised_Semantic_Segmentation_ICCV_2023_paper.pdf + As a mainstream framework in the field of semi-supervised learning (SSL), self-training via pseudo labeling and its variants have witnessed impressive progress in semi-supervised semantic segmentation with the recent advance of deep neural networks. However, modern self-training based SSL algorithms use a pre-defined constant threshold to select unlabeled pixel samples that contribute to the training, thus failing to be compatible with different learning difficulties of variant categories and different learning status of the model. To address these issues, we propose Enhanced Soft Label (ESL), a curriculum learning approach to fully leverage the high-value supervisory signals implicit in the untrustworthy pseudo label. ESL believes that pixels with unconfident predictions can be pretty sure about their belonging to a subset of dominant classes though being arduous to determine the exact one. It thus contains a Dynamic Soft Label (DSL) module to dynamically maintain the high probability classes, keeping the label "soft" so as to make full use of the high entropy prediction. However, the DSL itself will inevitably introduce ambiguity between dominant classes, thus blurring the classification boundary. Therefore, we further propose a pixel-to-part contrastive learning method cooperated with an unsupervised object part grouping mechanism to improve its ability to distinguish between different classes. Extensive experimental results on Pascal VOC 2012 and Cityscapes show that our approach achieves remarkable improvements over existing state-of-the-art approaches. + + + + HM-ViT: Hetero-Modal Vehicle-to-Vehicle Cooperative Perception with Vision Transformer + http://openaccess.thecvf.com//content/ICCV2023/papers/Xiang_HM-ViT_Hetero-Modal_Vehicle-to-Vehicle_Cooperative_Perception_with_Vision_Transformer_ICCV_2023_paper.pdf + Vehicle-to-Vehicle technologies have enabled autonomous vehicles to share information to see through occlusions, greatly enhancing perception performance. Nevertheless, existing works all focused on homogeneous traffic where vehicles are equipped with the same type of sensors, which significantly hampers the scale of collaboration and benefit of cross-modality interactions. In this paper, we investigate the multi-agent hetero-modal cooperative perception problem where agents may have distinct sensor modalities. We present HM-ViT, the first unified multi-agent hetero-modal cooperative perception framework that can collaboratively predict 3D objects for highly dynamic Vehicle-to-Vehicle (V2V) collaborations with varying numbers and types of agents. To effectively fuse features from multi-view images and LiDAR point clouds, we design a novel heterogeneous 3D graph transformer to jointly reason inter-agent and intra-agent interactions. The extensive experiments on the V2V perception dataset OPV2V demonstrate that the HM-ViT outperforms SOTA cooperative perception methods for V2V hetero-modal cooperative perception. Our code will be released at https://github.com/XHwind/HM-ViT. + + + + HyperReenact: One-Shot Reenactment via Jointly Learning to Refine and Retarget Faces + http://openaccess.thecvf.com//content/ICCV2023/papers/Bounareli_HyperReenact_One-Shot_Reenactment_via_Jointly_Learning_to_Refine_and_Retarget_ICCV_2023_paper.pdf + In this paper, we present our method for neural face reenactment, called HyperReenact, that aims to generate realistic talking head images of a source identity, driven by a target facial pose. Existing state-of-the-art face reenactment methods train controllable generative models that learn to synthesize realistic facial images, yet producing reenacted faces that are prone to significant visual artifacts, especially under the challenging condition of extreme head pose changes, or requiring expensive few-shot fine-tuning to better preserve the source identity characteristics. We propose to address these limitations by leveraging the photorealistic generation ability and the disentangled properties of a pretrained StyleGAN2 generator, by first inverting the real images into its latent space and then using a hypernetwork to perform: (i) refinement of the source identity characteristics and (ii) facial pose re-targeting, eliminating this way the dependence on external editing methods that typically produce artifacts. Our method operates under the one-shot setting (i.e., using a single source frame) and allows for cross-subject reenactment, without requiring any subject-specific fine-tuning. We compare our method both quantitatively and qualitatively against several state-of-the-art techniques on the standard benchmarks of VoxCeleb1 and VoxCeleb2, demonstrating the superiority of our approach in producing artifact-free images, exhibiting remarkable robustness even under extreme head pose changes. We make the code and the pretrained models publicly available at: https://github.com/StelaBou/HyperReenact + + + + Unified Visual Relationship Detection with Vision and Language Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhao_Unified_Visual_Relationship_Detection_with_Vision_and_Language_Models_ICCV_2023_paper.pdf + This work focuses on training a single visual relationship detector predicting over the union of label spaces from multiple datasets. Merging labels spanning different datasets could be challenging due to inconsistent taxonomies. The issue is exacerbated in visual relationship detection when second-order visual semantics are introduced between pairs of objects. To address this challenge, we propose UniVRD, a novel bottom-up method for Unified Visual Relationship Detection by leveraging vision and language models (VLMs). VLMs provide well-aligned image and text embeddings, where similar relationships are optimized to be close to each other for semantic unification. Our bottom-up design enables the model to enjoy the benefit of training with both object detection and visual relationship datasets. Empirical results on both human-object interaction detection and scene-graph generation demonstrate the competitive performance of our model. UniVRD achieves 38.07 mAP on HICO-DET, outperforming the current best bottom-up HOI detector by 14.26 mAP. More importantly, we show that our unified detector performs as well as dataset-specific models in mAP, and achieves further improvements when we scale up the model. Our code will be made publicly available on GitHub. + + + + Rickrolling the Artist: Injecting Backdoors into Text Encoders for Text-to-Image Synthesis + http://openaccess.thecvf.com//content/ICCV2023/papers/Struppek_Rickrolling_the_Artist_Injecting_Backdoors_into_Text_Encoders_for_Text-to-Image_ICCV_2023_paper.pdf + While text-to-image synthesis currently enjoys great popularity among researchers and the general public, the security of these models has been neglected so far. Many text-guided image generation models rely on pre-trained text encoders from external sources, and their users trust that the retrieved models will behave as promised. Unfortunately, this might not be the case. We introduce backdoor attacks against text-guided generative models and demonstrate that their text encoders pose a major tampering risk. Our attacks only slightly alter an encoder so that no suspicious model behavior is apparent for image generations with clean prompts. By then inserting a single character trigger into the prompt, e.g., a non-Latin character or emoji, the adversary can trigger the model to either generate images with pre-defined attributes or images following a hidden, potentially malicious description. We empirically demonstrate the high effectiveness of our attacks on Stable Diffusion and highlight that the injection process of a single backdoor takes less than two minutes. Besides phrasing our approach solely as an attack, it can also force an encoder to forget phrases related to certain concepts, such as nudity or violence, and help to make image generation safer. + + + + LD-ZNet: A Latent Diffusion Approach for Text-Based Image Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/PNVR_LD-ZNet_A_Latent_Diffusion_Approach_for_Text-Based_Image_Segmentation_ICCV_2023_paper.pdf + Large-scale pre-training tasks like image classification, captioning, or self-supervised techniques do not incentivize learning the semantic boundaries of objects. However, recent generative foundation models built using text-based latent diffusion techniques may learn semantic boundaries. This is because they have to synthesize intricate details about all objects in an image based on a text description. Therefore, we present a technique for segmenting real and AI-generated images using latent diffusion models (LDMs) trained on internet-scale datasets. First, we show that the latent space of LDMs (z-space) is a better input representation compared to other feature representations like RGB images or CLIP encodings for text-based image segmentation. By training the segmentation models on the latent z-space, which creates a compressed representation across several domains like different forms of art, cartoons, illustrations, and photographs, we are also able to bridge the domain gap between real and AI-generated images. We show that the internal features of LDMs contain rich semantic information and present a technique in the form of LD-ZNet to further boost the performance of text-based segmentation. Overall, we show up to 6% improvement over standard baselines for text-to-image segmentation on natural images. For AI-generated imagery, we show close to 20% improvement compared to state-of-the-art techniques. The project is available at https://koutilya-pnvr.github.io/LD-ZNet/. + + + + Downstream-agnostic Adversarial Examples + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhou_Downstream-agnostic_Adversarial_Examples_ICCV_2023_paper.pdf + Self-supervised learning usually uses a large amount of unlabeled data to pre-train an encoder which can be used as a general-purpose feature extractor, such that downstream users only need to perform fine-tuning operations to enjoy the benefit of "big model". Despite this promising prospect, the security of pre-trained encoder has not been thoroughly investigated yet, especially when the pre-trained encoder is publicly available for commercial use. In this paper, we propose AdvEncoder, the first framework for generating downstream-agnostic universal adversarial examples based on the pre-trained encoder. AdvEncoder aims to construct a universal adversarial perturbation or patch for a set of natural images that can fool all the downstream tasks inheriting the victim pre-trained encoder. Unlike traditional adversarial example works, the pre-trained encoder only outputs feature vectors rather than classification labels. Therefore, we first exploit the high frequency component information of the image to guide the generation of adversarial examples. Then we design a generative attack framework to construct adversarial perturbations/patches by learning the distribution of the attack surrogate dataset to improve their attack success rates and transferability. Our results show that an attacker can successfully attack downstream tasks without knowing either the pre-training dataset or the downstream dataset. We also tailor four defenses for pre-trained encoders, the results of which further prove the attack ability of AdvEncoder. + + + + Studying How to Efficiently and Effectively Guide Models with Explanations + http://openaccess.thecvf.com//content/ICCV2023/papers/Rao_Studying_How_to_Efficiently_and_Effectively_Guide_Models_with_Explanations_ICCV_2023_paper.pdf + Despite being highly performant, deep neural networks might base their decisions on features that spuriously correlate with the provided labels, thus hurting generalization. To mitigate this, 'model guidance' has recently gained popularity, i.e. the idea of regularizing the models' explanations to ensure that they are "right for the right reasons". While various techniques to achieve such model guidance have been proposed, experimental validation of these approaches has thus far been limited to relatively simple and / or synthetic datasets. To better understand the effectiveness of the various design choices that have been explored in the context of model guidance, in this work we conduct an in-depth evaluation across various loss functions, attribution methods, models, and 'guidance depths' on the PASCAL VOC 2007 and MS COCO 2014 datasets. As annotation costs for model guidance can limit its applicability, we also place a particular focus on efficiency. Specifically, we guide the models via bounding box annotations, which are much cheaper to obtain than the commonly used segmentation masks, and evaluate the robustness of model guidance under limited (e.g. with only 1% of annotated images) or overly coarse annotations. Further, we propose using the EPG score as an additional evaluation metric and loss function ('Energy loss'). We show that optimizing for the Energy loss leads to models that exhibit a distinct focus on object-specific features, despite only using bounding box annotations that also include background regions. Lastly, we show that such model guidance can improve generalization under distribution shifts. Code available at: https://github.com/sukrutrao/Model-Guidance + + + + SkeletonMAE: Graph-based Masked Autoencoder for Skeleton Sequence Pre-training + http://openaccess.thecvf.com//content/ICCV2023/papers/Yan_SkeletonMAE_Graph-based_Masked_Autoencoder_for_Skeleton_Sequence_Pre-training_ICCV_2023_paper.pdf + Skeleton sequence representation learning has shown great advantages for action recognition due to its promising ability to model human joints and topology. However, the current methods usually require sufficient labeled data for training computationally expensive models. Moreover, these methods ignore how to utilize the fine-grained dependencies among different skeleton joints to pre-train an efficient skeleton sequence learning model that can generalize well across different datasets. In this paper, we propose an efficient skeleton sequence learning framework, named Skeleton Sequence Learning (SSL). To comprehensively capture the human pose and obtain discriminative skeleton sequence representation, we build an asymmetric graph-based encoder-decoder pre-training architecture named SkeletonMAE, which embeds skeleton joint sequence into graph convolutional network and reconstructs the masked skeleton joints and edges based on the prior human topology knowledge. Then, the pre-trained SkeletonMAE encoder is integrated with the Spatial-Temporal Representation Learning (STRL) module to build the SSL framework. Extensive experimental results show that our SSL generalizes well across different datasets and outperforms the state-of-the-art self-supervised skeleton-based methods on FineGym, Diving48, NTU 60 and NTU 120 datasets. Moreover, we obtain comparable performance to some fully supervised methods. The code is avaliable at https://github.com/HongYan1123/SkeletonMAE. + + + + Pose-Free Neural Radiance Fields via Implicit Pose Regularization + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Pose-Free_Neural_Radiance_Fields_via_Implicit_Pose_Regularization_ICCV_2023_paper.pdf + Pose-free neural radiance fields (NeRF) aim to train NeRF with unposed multi-view images and it has achieved very impressive success in recent years. Most existing works share the pipeline of training a coarse pose estimator with rendered images at first, followed by a joint optimization of estimated poses and neural radiance field. However, as the pose estimator is trained with only rendered images, the pose estimation is usually biased or inaccurate for real images due to the domain gap between real images and rendered images, leading to poor robustness for the pose estimation of real images and further local min- ima in joint optimization. We design IR-NeRF, an innovative pose-free NeRF that introduces implicit pose regularization to refine pose estimator with unposed real images and improve the robustness of the pose estimation for real images. With a collection of 2D images of a specific scene, IR-NeRF constructs a scene codebook that stores scene features and captures the scene-specific pose distribution implicitly as priors. Thus, the robustness of pose estimation can be promoted with the scene priors according to the rationale that a 2D real image can be well reconstructed from the scene codebook only when its estimated pose lies within the pose distribution. Extensive experiments show that IR-NeRF achieves superior novel view synthesis and outperforms the state-of-the-art consistently across multiple synthetic and real datasets. + + + + Encyclopedic VQA: Visual Questions About Detailed Properties of Fine-Grained Categories + http://openaccess.thecvf.com//content/ICCV2023/papers/Mensink_Encyclopedic_VQA_Visual_Questions_About_Detailed_Properties_of_Fine-Grained_Categories_ICCV_2023_paper.pdf + We propose Encyclopedic-VQA, a large scale visual question answering (VQA) dataset featuring visual questions about detailed properties of fine-grained categories and instances. It contains 221k unique question+answer pairs each matched with (up to) 5 images, resulting in a total of 1M VQA samples. Moreover, our dataset comes with a controlled knowledge base derived from Wikipedia, marking the evidence to support each answer. Empirically, we show that our dataset poses a hard challenge for large vision+language models as they perform poorly on our dataset: PaLI [9] is state-of-the-art on OK-VQA [29], yet it only achieves 13.0% accuracy on our dataset. Moreover, we experimentally show that progress on answering our encyclopedic questions can be achieved by augmenting large models with a mechanism that retrieves relevant information for the knowledge base. An oracle experiment with perfect retrieval achieves 87.0% accuracy on the single-hop portion of our dataset, and an automatic retrieval- augmented prototype yields 48.8%. We believe that our dataset enables future research on retrieval-augmented vision+language models. + + + + Towards Understanding the Generalization of Deepfake Detectors from a Game-Theoretical View + http://openaccess.thecvf.com//content/ICCV2023/papers/Yao_Towards_Understanding_the_Generalization_of_Deepfake_Detectors_from_a_Game-Theoretical_ICCV_2023_paper.pdf + This paper aims to explain the generalization of deepfake detectors from the novel perspective of multi-order interactions among visual concepts. Specifically, we propose three hypotheses: 1. Deepfake detectors encode multi-order interactions among visual concepts, in which the low-order interactions usually have substantially negative contributions to deepfake detection. 2. Deepfake detectors with better generalization abilities tend to encode low-order interactions with fewer negative contributions. 3. Generalized deepfake detectors usually weaken the negative contributions of low-order interactions by suppressing their strength. Accordingly, we design several mathematical metrics to evaluate the effect of low-order interaction for deepfake detectors. Extensive comparative experiments are conducted, which verify the soundness of our hypotheses. Based on the analyses, we further propose a generic method, which directly reduces the toxic effects of low-order interactions to improve the generalization of deepfake detectors to some extent. The code will be released when the paper is accepted. + + + + 3DPPE: 3D Point Positional Encoding for Transformer-based Multi-Camera 3D Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Shu_3DPPE_3D_Point_Positional_Encoding_for_Transformer-based_Multi-Camera_3D_Object_ICCV_2023_paper.pdf + Transformer-based methods have swept the benchmarks on 2D and 3D detection on images. Because tokenization before the attention mechanism drops the spatial information, positional encoding becomes critical for those methods. Recent works found that encodings based on samples of the 3D viewing rays can significantly improve the quality of multi-camera 3D object detection. We hypothesize that 3D point locations can provide more information than rays. Therefore, we introduce 3D point positional encoding, 3DPPE, to the 3D detection Transformer decoder. Although 3D measurements are not available at the inference time of monocular 3D object detection, 3DPPE uses predicted depth to approximate the real point positions. Our hybrid-depth module combines direct and categorical depth to estimate the refined depth of each pixel. Despite the approximation, 3DPPE achieves 46.0 mAP and 51.4 NDS on the competitive nuScenes dataset, significantly outperforming encodings based on ray samples. The codes are available at https://github.com/drilistbox/3DPPE. + + + + VertexSerum: Poisoning Graph Neural Networks for Link Inference + http://openaccess.thecvf.com//content/ICCV2023/papers/Ding_VertexSerum_Poisoning_Graph_Neural_Networks_for_Link_Inference_ICCV_2023_paper.pdf + Graph neural networks (GNNs) have brought superb performance to various applications utilizing graph structural data, such as social analysis and fraud detection. The graph links, e.g., social relationships and transaction history, are sensitive and valuable information, which raises privacy concerns when using GNNs. To exploit these vulnerabilities, we propose VertexSerum, a novel graph poisoning attack that increases the effectiveness of graph link stealing by amplifying the link connectivity leakage. To infer node adjacency more accurately, we propose an attention mechanism that can be embedded into the link detection network. Our experiments demonstrate that VertexSerum significantly outperforms the SOTA link inference attack, improving the AUC scores by an average of 9.8% across four real-world datasets and three different GNN structures. Furthermore, our experiments reveal the effectiveness of VertexSerum in both black-box and online learning settings, further validating its applicability in real-world scenarios. + + + + Deep Geometrized Cartoon Line Inbetweening + http://openaccess.thecvf.com//content/ICCV2023/papers/Siyao_Deep_Geometrized_Cartoon_Line_Inbetweening_ICCV_2023_paper.pdf + We aim to address a significant but understudied problem in the anime industry, namely the inbetweening of cartoon line drawings. Inbetweening involves generating intermediate frames between two black-and-white line drawings and is a time-consuming and expensive process that can benefit from automation. However, existing frame interpolation methods that rely on matching and warping whole raster images are unsuitable for line inbetweening and often produce blurring artifacts that damage the intricate line structures. To preserve the precision and detail of the line drawings, we propose a new approach, called AnimeInbet, which geometrizes raster line drawings into graphs of endpoints and reframes the inbetweening task as a graph fusion problem with vertex repositioning. Our method can effectively capture the sparsity and unique structure of line drawings while preserving the details during inbetweening. This is made possible through our novel modules, i.e., vertex encoding, a vertex correspondence Transformer, an effective mechanism for vertex repositioning and a visibility predictor. To train our method, we introduce MixamoLine240, a new dataset of line drawings with ground truth vectorization and matching labels. Our experiments demonstrate that AnimeInbet synthesizes high-quality, clean, and complete intermediate line drawings, outperforming existing methods quantitatively and qualitatively, especially in cases with large motions. + + + + MatrixCity: A Large-scale City Dataset for City-scale Neural Rendering and Beyond + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_MatrixCity_A_Large-scale_City_Dataset_for_City-scale_Neural_Rendering_and_ICCV_2023_paper.pdf + Neural radiance fields (NeRF) and its subsequent variants have led to remarkable progress in neural rendering. While most of recent neural rendering works focus on objects and small-scale scenes, developing neural rendering methods for city-scale scenes is of great potential in many real-world applications. However, this line of research is impeded by the absence of a comprehensive and high-quality dataset, yet collecting such a dataset over real city-scale scenes is costly, sensitive, and technically infeasible. To this end, we build a large-scale, comprehensive, and high-quality synthetic dataset for city-scale neural rendering researches. Leveraging the Unreal Engine 5 City Sample project, we developed a pipeline to easily collect aerial and street city views, accompanied by ground-truth camera poses and a range of additional data modalities. Flexible controls on environmental factors like light, weather, human and car crowd are also available in our pipeline, supporting the need of various tasks covering city-scale neural rendering and beyond. The resulting pilot dataset, MatrixCity, contains 67k aerial images and 452k street images from two city maps of total size 28km^2. On top of MatrixCity, a thorough benchmark is also conducted, which not only reveals unique challenges of the task of city-scale neural rendering, but also highlights potential improvements for future works. The dataset and code will be publicly available at the project page: https://city-super.github.io/matrixcity/. + + + + LinkGAN: Linking GAN Latents to Pixels for Controllable Image Synthesis + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhu_LinkGAN_Linking_GAN_Latents_to_Pixels_for_Controllable_Image_Synthesis_ICCV_2023_paper.pdf + This work presents an easy-to-use regularizer for GAN training, which helps explicitly link some axes of the latent space to a set of pixels in the synthesized image. Establishing such a connection facilitates a more convenient local control of GAN generation, where users can alter the image content only within a spatial area simply by partially resampling the latent code. Experimental results confirm four appealing properties of our regularizer, which we call LinkGAN. (1) The latent-pixel linkage is applicable to either a fixed region (i.e., same for all instances) or a particular semantic category (i.e., varying across instances), like the sky. (2) Two or multiple regions can be independently linked to different latent axes, which further supports joint control. (3) Our regularizer can improve the spatial controllability of both 2D and 3D-aware GAN models, barely sacrificing the synthesis performance. (4) The models trained with our regularizer are compatible with GAN inversion techniques and maintain editability on real images. + + + + SVDiff: Compact Parameter Space for Diffusion Fine-Tuning + http://openaccess.thecvf.com//content/ICCV2023/papers/Han_SVDiff_Compact_Parameter_Space_for_Diffusion_Fine-Tuning_ICCV_2023_paper.pdf + Recently, diffusion models have achieved remarkable success in text-to-image generation, enabling the creation of high-quality images from text prompts and various conditions. However, existing methods for customizing these models are limited by handling multiple personalized subjects and the risk of overfitting. Moreover, the large parameter space is inefficient for model storage. In this paper, we propose a novel approach to address the limitations in existing text-to-image diffusion models for personalization and customization. Our method involves fine-tuning the singular values of the weight matrices, leading to a compact and efficient parameter space that reduces the risk of overfitting and language-drifting. Our approach also includes a Cut-Mix-Unmix data-augmentation technique to enhance the quality of multi-subject image generation and a simple text-based image editing framework. Our proposed SVDiff method has a significantly smaller model size (1.7MB for StableDiffusion) compared to existing methods, making it more practical for real-world applications. + + + + Distilling Large Vision-Language Model with Out-of-Distribution Generalizability + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Distilling_Large_Vision-Language_Model_with_Out-of-Distribution_Generalizability_ICCV_2023_paper.pdf + Large vision-language models have achieved outstanding performance, but their size and computational requirements make their deployment on resource-constrained devices and time-sensitive tasks impractical. Model distillation, the process of creating smaller, faster models that maintain the performance of larger models, is a promising direction towards the solution. This paper investigates the distillation of visual representations in large teacher vision-language models into lightweight student models using a small- or mid-scale dataset. Notably, this study focuses on open-vocabulary out-of-distribution (OOD) generalization, a challenging problem that has been overlooked in previous model distillation literature. We propose two principles from vision and language modality perspectives to enhance student's OOD generalization: (1) by better imitating teacher's visual representation space, and carefully promoting better coherence in vision-language alignment with the teacher; (2) by enriching the teacher's language representations with informative and finegrained semantic attributes to effectively distinguish between different labels. We propose several metrics and conduct extensive experiments to investigate their techniques. The results demonstrate significant improvements in zero-shot and few-shot student performance on open-vocabulary out-of-distribution classification, highlighting the effectiveness of our proposed approaches. + + + + What do neural networks learn in image classification? A frequency shortcut perspective + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_What_do_neural_networks_learn_in_image_classification_A_frequency_ICCV_2023_paper.pdf + Frequency analysis is useful for understanding the mechanisms of representation learning in neural networks (NNs). Most research in this area focuses on the learning dynamics of NNs for regression tasks, while little for classification. This study empirically investigates the latter and expands the understanding of frequency shortcuts. First, we perform experiments on synthetic datasets, designed to have a bias in different frequency bands. Our results demonstrate that NNs tend to find simple solutions for classification, and what they learn first during training depends on the most distinctive frequency characteristics, which can be either low- or high-frequencies. Second, we confirm this phenomenon on natural images. We propose a metric to measure class-wise frequency characteristics and a method to identify frequency shortcuts. The results show that frequency shortcuts can be texture-based or shape-based, depending on what best simplifies the objective. Third, we validate the transferability of frequency shortcuts on out-of-distribution (OOD) test sets. Our results suggest that frequency shortcuts can be transferred across datasets and cannot be fully avoided by larger model capacity and data augmentation. We recommend that future research should focus on effective training schemes mitigating frequency shortcut learning. Codes and data are available at https://github.com/nis-research/nn-frequency-shortcuts. + + + + PromptCap: Prompt-Guided Image Captioning for VQA with GPT-3 + http://openaccess.thecvf.com//content/ICCV2023/papers/Hu_PromptCap_Prompt-Guided_Image_Captioning_for_VQA_with_GPT-3_ICCV_2023_paper.pdf + Knowledge-based visual question answering (VQA) involves questions that require world knowledge beyond the image to yield the correct answer. Large language models (LMs) like GPT-3 are particularly helpful for this task because of their strong knowledge retrieval and reasoning capabilities. To enable LM to understand images, prior work uses a captioning model to convert images into text. However, when summarizing an image in a single caption sentence, which visual entities to describe are often underspecified. Generic image captions often miss visual details essential for the LM to answer visual questions correctly. To address this challenge, we propose PromptCap (Prompt-guided image Captioning), a captioning model designed to serve as a better connector between images and black-box LMs. Different from generic captions, PromptCap takes a natural-language prompt to control the visual entities to describe in the generated caption. The prompt contains a question that the caption should aid in answering. To avoid extra annotation, PromptCap is trained by examples synthesized with GPT-3 and existing datasets. We demonstrate PromptCap's effectiveness on an existing pipeline in which GPT-3 is prompted with image captions to carry out VQA. PromptCap outperforms generic captions by a large margin and achieves state-of-the-art accuracy on knowledge-based VQA tasks (60.4% on OK-VQA and 59.6% on A-OKVQA). Zero-shot results on WebQA show that PromptCap generalizes well to unseen domains. + + + + Periodically Exchange Teacher-Student for Source-Free Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Periodically_Exchange_Teacher-Student_for_Source-Free_Object_Detection_ICCV_2023_paper.pdf + Source-free object detection (SFOD) aims to adapt the source detector to unlabeled target domain data in the absence of source domain data. Most SFOD methods follow the same self-training paradigm using mean-teacher (MT) framework where the student model is guided by only one single teacher model. However, such paradigm can easily fall into a training instability problem that when the teacher model collapses uncontrollably due to the domain shift, the student model also suffers drastic performance degradation. To address this issue, we propose the Periodically Exchange Teacher-Student (PETS) method, a simple yet novel approach that introduces a multiple-teacher framework consisting of a static teacher, a dynamic teacher, and a student model. During the training phase, we periodically exchange the weights between the static teacher and the student model. Then, we update the dynamic teacher using the moving average of the student model that has already been exchanged by the static teacher. In this way, the dynamic teacher can integrate knowledge from past periods, effectively reducing error accumulation and enabling a more stable training process within the MT-based framework. Further, we develop a consensus mechanism to merge the predictions of two teacher models to provide higher-quality pseudo labels for student model. Extensive experiments on multiple SFOD benchmarks show that the proposed method achieves state-of-the-art performance compared with other related methods, demonstrating the effectiveness and superiority of our method on SFOD task. + + + + Learning to Transform for Generalizable Instance-wise Invariance + http://openaccess.thecvf.com//content/ICCV2023/papers/Singhal_Learning_to_Transform_for_Generalizable_Instance-wise_Invariance_ICCV_2023_paper.pdf + Computer vision research has long aimed to build systems that are robust to transformations found in natural data. Traditionally, this is done using data augmentation or hard-coding invariances into the architecture. However, too much or too little invariance can hurt, and the correct amount is unknown a priori and dependent on the instance. Ideally, the appropriate invariance would be learned from data and inferred at test-time. We treat invariance as a prediction problem. Given any image, we predict a distribution over transformations. We use variational inference to learn this distribution end-to-end. Combined with a graphical model approach, this distribution forms a flexible, generalizable, and adaptive form of invariance. Our experiments show that it can be used to align datasets and discover prototypes, adapt to out-of-distribution poses, and generalize invariances across classes. When used for data augmentation, our method shows consistent gains in accuracy and robustness on CIFAR 10, CIFAR10-LT, and TinyImageNet. + + + + Multiple Instance Learning Framework with Masked Hard Instance Mining for Whole Slide Image Classification + http://openaccess.thecvf.com//content/ICCV2023/papers/Tang_Multiple_Instance_Learning_Framework_with_Masked_Hard_Instance_Mining_for_ICCV_2023_paper.pdf + The whole slide image (WSI) classification is often formulated as a multiple instance learning (MIL) problem. Since the positive tissue is only a small fraction of the gigapixel WSI, existing MIL methods intuitively focus on identifying salient instances via attention mechanisms. However, this leads to a bias towards easy-to-classify instances while neglecting hard-to-classify instances. Some literature has revealed that hard examples are beneficial for modeling a discriminative boundary accurately. By applying such an idea at the instance level, we elaborate a novel MIL framework with masked hard instance mining (MHIM-MIL), which uses a Siamese structure (Teacher-Student) with a consistency constraint to explore the potential hard instances. With several instance masking strategies based on attention scores, MHIM-MIL employs a momentum teacher to implicitly mine hard instances for training the student model, which can be any attention-based MIL model. This counter-intuitive strategy essentially enables the student to learn a better discriminating boundary. Moreover, the student is used to update the teacher with an exponential moving average (EMA), which in turn identifies new hard instances for subsequent training iterations and stabilizes the optimization. Experimental results on the CAMELYON-16 and TCGA Lung Cancer datasets demonstrate that MHIM-MIL outperforms other latest methods in terms of performance and training cost. The code is available at: https://github.com/DearCaat/MHIM-MIL. + + + + Unsupervised Compositional Concepts Discovery with Text-to-Image Generative Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Unsupervised_Compositional_Concepts_Discovery_with_Text-to-Image_Generative_Models_ICCV_2023_paper.pdf + Text-to-image generative models have enabled high-resolution image synthesis across different domains, but require users to specify the content they wish to generate. In this paper, we consider the inverse problem - given a collection of different images, can we discover the generative concepts that represent each image? We present an unsupervised approach to discover generative concepts from a collection of images, disentangling different art styles in paintings, objects, and lighting from kitchen scenes, and discovering image classes given ImageNet images. We show how such generative concepts can accurately represent the content of images, be recombined and composed to generate new artistic and hybrid images, and be further used as a representation for downstream classification tasks. + + + + Partition-And-Debias: Agnostic Biases Mitigation via a Mixture of Biases-Specific Experts + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Partition-And-Debias_Agnostic_Biases_Mitigation_via_a_Mixture_of_Biases-Specific_Experts_ICCV_2023_paper.pdf + Bias mitigation in image classification has been widely researched, and existing methods have yielded notable results. However, most of these methods implicitly assume that a given image contains only one type of known or unknown bias, failing to consider the complexities of real-world biases. We introduce a more challenging scenario, agnostic biases mitigation, aiming at bias removal regardless of whether the type of bias or the number of types is unknown in the datasets. To address this difficult task, we present the Partition-and-Debias (PnD) method that uses a mixture of biases-specific experts to implicitly divide the bias space into multiple subspaces and a gating module to find a consensus among experts to achieve debiased classification. Experiments on both public and constructed benchmarks demonstrated the efficacy of the PnD. Code is available at: https://github.com/Jiaxuan-Li/PnD. + + + + Spatial Self-Distillation for Object Detection with Inaccurate Bounding Boxes + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_Spatial_Self-Distillation_for_Object_Detection_with_Inaccurate_Bounding_Boxes_ICCV_2023_paper.pdf + Object detection via inaccurate bounding box supervision has boosted a broad interest due to the expensive high-quality annotation data or the occasional inevitability of low annotation quality (e.g. tiny objects). The previous works usually utilize multiple instance learning (MIL), which highly depends on category information, to select and refine a low-quality box. Those methods suffer from part domination, object drift and group prediction problems without exploring spatial information. In this paper, we heuristically propose a Spatial Self-Distillation based Object Detector (SSD-Det) to mine spatial information to refine the inaccurate box in a self-distillation fashion. SSD-Det utilizes a Spatial Position Self-Distillation SPSD) module to exploit spatial information and an interactive structure to combine spatial information and category information, thus constructing a high-quality proposal bag. To further improve the selection procedure, a Spatial Identity Self-Distillation (SISD) module is introduced in SSD-Det to obtain spatial confidence to help select the best proposals. Experiments on MS-COCO and VOC datasets with noisy box annotation verify our method's effectiveness and achieve state-of-the-art performance. The code is available at https://github.com/ucas-vg/PointTinyBenchmark/tree/SSD-Det. + + + + CC3D: Layout-Conditioned Generation of Compositional 3D Scenes + http://openaccess.thecvf.com//content/ICCV2023/papers/Bahmani_CC3D_Layout-Conditioned_Generation_of_Compositional_3D_Scenes_ICCV_2023_paper.pdf + In this work, we introduce CC3D, a conditional generative model that synthesizes complex 3D scenes conditioned on 2D semantic scene layouts, trained using single-view images. Different from most existing 3D GANs that limit their applicability to aligned single objects, we focus on generating complex scenes with multiple objects, by modeling the compositional nature of 3D scenes. By devising a 2D layout-based approach for 3D synthesis and implementing a new 3D field representation with a stronger geometric inductive bias, we have created a 3D GAN that is both efficient and of high quality, while allowing for a more controllable generation process. Our evaluations on synthetic 3D-FRONT and real-world KITTI-360 datasets demonstrate that our model generates scenes of improved visual and geometric quality in comparison to previous works. + + + + TextPSG: Panoptic Scene Graph Generation from Textual Descriptions + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhao_TextPSG_Panoptic_Scene_Graph_Generation_from_Textual_Descriptions_ICCV_2023_paper.pdf + Panoptic Scene Graph has recently been proposed for comprehensive scene understanding. However, previous works adopt a fully-supervised learning manner, requiring large amounts of pixel-wise densely-annotated data, which is always tedious and expensive to obtain. To address this limitation, we study a new problem of Panoptic Scene Graph Generation from Purely Textual Descriptions (Caption-to-PSG). The key idea is to leverage the large collection of free image-caption data on the Web alone to generate panoptic scene graphs. The problem is very challenging for three constraints: 1) no location priors; 2) no explicit links between visual regions and textual entities; and 3) no pre-defined concept sets. To tackle this problem, we propose a new framework TextPSG consisting of four modules, i.e., a region grouper, an entity grounder, a segment merger, and a label generator, with several novel techniques. The region grouper first groups image pixels into different segments and the entity grounder then aligns visual segments with language entities based on the textual description of the segment being referred to. The grounding results can thus serve as pseudo labels enabling the segment merger to learn the segment similarity as well as guiding the label generator to learn object semantics and relation predicates, resulting in a fine-grained structured scene understanding. Our framework is effective, significantly outperforming the baselines and achieving strong out-of-distribution robustness. We perform comprehensive ablation studies to corroborate the effectiveness of our design choices and provide an in-depth analysis to highlight future directions. Our code, data, and results are available on our project page: https://vis-www.cs.umass.edu/TextPSG. + + + + Cross-modal Latent Space Alignment for Image to Avatar Translation + http://openaccess.thecvf.com//content/ICCV2023/papers/de_Guevara_Cross-modal_Latent_Space_Alignment_for_Image_to_Avatar_Translation_ICCV_2023_paper.pdf + We present a novel method for automatic vectorized avatar generation from a single portrait image. Most existing approaches that create avatars rely on image-to-image translation methods, which present some limitations when applied to 3D rendering, animation, or video. Instead, we leverage modality-specific autoencoders trained on large-scale unpaired portraits and parametric avatars, and then learn a mapping between both modalities via an alignment module trained on a significantly smaller amount of data. The resulting cross-modal latent space preserves facial identity, producing more visually appealing and higher fidelity avatars than previous methods, as supported by our quantitative and qualitative evaluations. Moreover, our method's virtue of being resolution-independent makes it highly versatile and applicable in a wide range of settings. + + + + Inspecting the Geographical Representativeness of Images from Text-to-Image Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Basu_Inspecting_the_Geographical_Representativeness_of_Images_from_Text-to-Image_Models_ICCV_2023_paper.pdf + Recent progress in generative models has resulted in models that produce both realistic as well as relevant images for most textual inputs. These models are being used to generate millions of images everyday, and hold the potential to drastically impact areas such as generative art, digital marketing and data augmentation. Given their outsized impact, it is important to ensure that the generated content reflects the artifacts and surroundings across the globe, rather than over-representing certain parts of the world. In this paper, we measure the geographical representativeness of common nouns (e.g., a house) generated through DALL.E 2 and Stable Diffusion models using a crowdsourced study comprising 540 participants across 27 countries. For deliberately underspecified inputs without country names, the generated images most reflect the surroundings of the United States followed by India, and the top generations rarely reflect surroundings from all other countries (average score less than 3 out of 5). Specifying the country names in the input increases the representativeness by 1.44 points on average on a 5-point Likert scale for DALL.E 2 and 0.75 for Stable Diffusion, however, the overall scores for many countries still remain low, highlighting the need for future models to be more geographically inclusive. Lastly, we examine the feasibility of quantifying the geographical representativeness of generated images without conducting user studies. + + + + HSR-Diff: Hyperspectral Image Super-Resolution via Conditional Diffusion Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_HSR-Diff_Hyperspectral_Image_Super-Resolution_via_Conditional_Diffusion_Models_ICCV_2023_paper.pdf + Despite the proven significance of hyperspectral images (HSIs) in performing various computer vision tasks, its potential is adversely affected by the low-resolution (LR) property in the spatial domain, resulting from multiple physical factors. Inspired by recent advancements in deep generative models, we propose an HSI Super-resolution (SR) approach with Conditional Diffusion Models (HSR-Diff) that merges a high-resolution (HR) multispectral image (MSI) with the corresponding LR-HSI. HSR-Diff generates an HR-HSI via repeated refinement, in which the HR-HSI is initialized with pure Gaussian noise and iteratively refined. At each iteration, the noise is removed with a Conditional Denoising Transformer (CDFormer) that is trained on denoising at different noise levels, conditioned on the hierarchical feature maps of HR-MSI and LR-HSI. In addition, a progressive learning strategy is employed to exploit the global information of full-resolution images. Systematic experiments have been conducted on four public datasets, demonstrating that HSR-Diff outperforms state-of-the-art methods. + + + + Advancing Example Exploitation Can Alleviate Critical Challenges in Adversarial Training + http://openaccess.thecvf.com//content/ICCV2023/papers/Ge_Advancing_Example_Exploitation_Can_Alleviate_Critical_Challenges_in_Adversarial_Training_ICCV_2023_paper.pdf + Deep neural networks have achieved remarkable results across various tasks. However, they are susceptible to adversarial examples, which are generated by adding adversarial perturbations to original data. Adversarial training (AT) is the most effective defense mechanism against adversarial examples and has received significant attention. Recent studies highlight the importance of example exploitation, where the model's learning intensity is altered for specific examples to extend classic AT approaches. However, the analysis methodologies employed by these studies are varied and contradictory, which may lead to confusion in future research. To address this issue, we provide a comprehensive summary of representative strategies focusing on exploiting examples within a unified framework. Furthermore, we investigate the role of examples in AT and find that examples which contribute primarily to accuracy or robustness are distinct. Based on this finding, we propose a novel example-exploitation idea that can further improve the performance of advanced AT methods. This new idea suggests that critical challenges in AT, such as the accuracy-robustness trade-off, robust overfitting, and catastrophic overfitting, can be alleviated simultaneously from an example-exploitation perspective. The code can be found in https://github.com/geyao1995/advancing-example-exploitation-in-adversarial-training. + + + + ShiftNAS: Improving One-shot NAS via Probability Shift + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_ShiftNAS_Improving_One-shot_NAS_via_Probability_Shift_ICCV_2023_paper.pdf + One-shot Neural architecture search (One-shot NAS) has been proposed as a time-efficient approach to obtain optimal subnet architectures and weights under different complexity cases by training only once. However, the subnet performance obtained by weight sharing is often inferior to the performance achieved by retraining. In this paper, we investigate the performance gap and attribute it to the use of uniform sampling, which is a common approach in supernet training. Uniform sampling concentrates training resources on subnets with intermediate computational resources, which are sampled with high probability. However, subnets with different complexity regions require different optimal training strategies for optimal performance. To address the problem of uniform sampling, we propose ShiftNAS, a method that can adjust the sampling probability based on the complexity of subnets. We achieve this by evaluating the performance variation of subnets with different complexity and designing an architecture generator that can accurately and efficiently provide subnets with the desired complexity. Both the sampling probability and the architecture generator can be trained end-to-end in a gradient-based manner. With ShiftNAS, we can directly obtain the optimal model architecture and parameters for a given computational complexity. We evaluate our approach on multiple visual network models, including convolutional neural networks (CNNs) and vision transformers (ViTs), and demonstrate that ShiftNAS is model-agnostic. Experimental results on ImageNet show that ShiftNAS can improve the performance of one-shot NAS without additional computational consumption. Source codes are available at GitHub. + + + + Adaptive Testing of Computer Vision Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Gao_Adaptive_Testing_of_Computer_Vision_Models_ICCV_2023_paper.pdf + Vision models often fail systematically on groups of data that share common semantic characteristics (e.g., rare objects or unusual scenes), but identifying these failure modes is a challenge. We introduce AdaVision, an interactive process for testing vision models which helps users identify and fix coherent failure modes. Given a natural language description of a coherent group, AdaVision retrieves relevant images from LAION-5B with CLIP. The user then labels a small amount of data for model correctness, which is used in successive retrieval rounds to hill-climb towards high-error regions, refining the group definition. Once a group is saturated, AdaVision uses GPT-3 to suggest new group descriptions for the user to explore. We demonstrate the usefulness and generality of AdaVision in user studies, where users find major bugs in state-of-the-art classification, object detection, and image captioning models. These user-discovered groups have failure rates 2-3x higher than those surfaced by automatic error clustering methods. Finally, finetuning on examples found with AdaVision fixes the discovered bugs when evaluated on unseen examples, without degrading in-distribution accuracy, and while also improving performance on out-of-distribution datasets. + + + + Feature Proliferation -- the "Cancer" in StyleGAN and its Treatments + http://openaccess.thecvf.com//content/ICCV2023/papers/Song_Feature_Proliferation_--_the_Cancer_in_StyleGAN_and_its_Treatments_ICCV_2023_paper.pdf + Despite the success of StyleGAN in image synthesis, the images it synthesizes are not always perfect and the well-known truncation trick has become a standard post-processing technique for StyleGAN to synthesize high-quality images. Although effective, it has long been noted that the truncation trick tends to reduce the diversity of synthesized images and unnecessarily sacrifices many distinct image features. To address this issue, in this paper, we first delve into the StyleGAN image synthesis mechanism and discover an important phenomenon, namely Feature Proliferation, which demonstrates how specific features reproduce with forward propagation. Then, we show how the occurrence of Feature Proliferation results in StyleGAN image artifacts. As an analogy, we refer to it as the "cancer" in StyleGAN from its proliferating and malignant nature. Finally, we propose a novel feature rescaling method that identifies and modulates risky features to mitigate feature proliferation. Thanks to our discovery of Feature Proliferation, the proposed feature rescaling method is less destructive and retains more useful image features than the truncation trick, as it is more fine-grained and works in a lower-level feature space rather than a high-level latent space. Experimental results justify the validity of our claims and the effectiveness of the proposed feature rescaling method. Our code is available at https://github.com/songc42/Feature-proliferation. + + + + Multi-Label Self-Supervised Learning with Scene Images + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhu_Multi-Label_Self-Supervised_Learning_with_Scene_Images_ICCV_2023_paper.pdf + Self-supervised learning (SSL) methods targeting scene images have seen a rapid growth recently, and they mostly rely on either a dedicated dense matching mechanism or a costly unsupervised object discovery module. This paper shows that instead of hinging on these strenuous operations, quality image representations can be learned by treating scene/multi-label image SSL simply as a multi-label classification problem, which greatly simplifies the learning framework. Specifically, multiple binary pseudo-labels are assigned for each input image by comparing its embeddings with those in two dictionaries, and the network is optimized using the binary cross entropy loss. The proposed method is named Multi-Label Self-supervised learning (MLS). Visualizations qualitatively show that clearly the pseudo-labels by MLS can automatically find semantically similar pseudo-positive pairs across different images to facilitate contrastive learning. MLS learns high quality representations on MS-COCO and achieves state-of-the-art results on classification, detection and segmentation benchmarks. At the same time, MLS is much simpler than existing methods, making it easier to deploy and for further exploration. + + + + Enhancing Fine-Tuning Based Backdoor Defense with Sharpness-Aware Minimization + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhu_Enhancing_Fine-Tuning_Based_Backdoor_Defense_with_Sharpness-Aware_Minimization_ICCV_2023_paper.pdf + Backdoor defense, which aims to detect or mitigate the effect of malicious triggers introduced by attackers, is becoming increasingly critical for machine learning security and integrity. Fine-tuning based on benign data is a natural defense to erase the backdoor effect in a backdoored model. However, recent studies show that, given limited benign data, vanilla fine-tuning has poor defense performance. In this work, we firstly investigate the vanilla fine-tuning process for backdoor mitigation from the neuron weight perspective, and find that backdoor-related neurons are only slightly perturbed in the vanilla fine-tuning process, which explains its poor backdoor defense performance. To enhance the fine-tuning based defense, inspired by the observation that the backdoor-related neurons often have larger weight norms, we propose FT-SAM, a novel backdoor defense paradigm that aims to shrink the norms of backdoorrelated neurons by incorporating sharpness-aware minimization with fine-tuning. We demonstrate the effectiveness of our method on several benchmark datasets and network architectures, where it achieves state-of-the-art defense performance, and provide extensive analysis to reveal the FTSAM's mechanism. Overall, our work provides a promising avenue for improving the robustness of machine learning models against backdoor attacks. Codes are available at https://github.com/SCLBD/BackdoorBench. + + + + Deep Geometry-Aware Camera Self-Calibration from Video + http://openaccess.thecvf.com//content/ICCV2023/papers/Hagemann_Deep_Geometry-Aware_Camera_Self-Calibration_from_Video_ICCV_2023_paper.pdf + Accurate intrinsic calibration is essential for camera-based 3D perception, yet, it typically requires targets of well-known geometry. Here, we propose a camera self-calibration approach that infers camera intrinsics during application, from monocular videos in the wild. We propose to explicitly model projection functions and multi-view geometry, while leveraging the capabilities of deep neural networks for feature extraction and matching. To achieve this, we build upon recent research on integrating bundle adjustment into deep learning models, and introduce a self-calibrating bundle adjustment layer. The self-calibrating bundle adjustment layer optimizes camera intrinsics through classical Gauss-Newton steps and can be adapted to different camera models without re-training. As a specific realization, we implemented this layer within the deep visual SLAM system DROID-SLAM, and show that the resulting model, DroidCalib, yields state-of-the-art calibration accuracy across multiple public datasets. Our results suggest that the model generalizes to unseen environments and different camera models, including significant lens distortion. Thereby, the approach enables performing 3D perception tasks without prior knowledge about the camera. Code is available at https://github.com/boschresearch/droidcalib. + + + + Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Exploring_Object-Centric_Temporal_Modeling_for_Efficient_Multi-View_3D_Object_Detection_ICCV_2023_paper.pdf + In this paper, we propose a long-sequence modeling framework, named StreamPETR, for multi-view 3D object detection. Built upon the sparse query design in the PETR series, we systematically develop an object-centric temporal mechanism. The model is performed in an online manner and the long-term historical information is propagated through object queries frame by frame. Besides, we introduce a motion-aware layer normalization to model the movement of the objects. StreamPETR achieves significant performance improvements only with negligible computation cost, compared to the single-frame baseline. On the standard nuScenes benchmark, it is the first online multi-view method that achieves comparable performance (67.6% NDS & 65.3% AMOTA) with lidar-based methods. The lightweight version realizes 45.0% mAP and 31.7 FPS, outperforming the state-of-the-art method (SOLOFusion) by 2.3% mAP and 1.8x faster FPS. Code has been available at https://github.com/exiawsh/StreamPETR.git. + + + + ScanNet++: A High-Fidelity Dataset of 3D Indoor Scenes + http://openaccess.thecvf.com//content/ICCV2023/papers/Yeshwanth_ScanNet_A_High-Fidelity_Dataset_of_3D_Indoor_Scenes_ICCV_2023_paper.pdf + We present ScanNet++, a large-scale dataset that couples together capture of high-quality and commodity-level geometry and color of indoor scenes. Each scene is captured with a high-end laser scanner at sub-millimeter resolution, along with registered 33-megapixel images from a DSLR camera, and RGB-D streams from an iPhone. Scene reconstructions are further annotated with an open vocabulary of semantics, with label-ambiguous scenarios explicitly annotated for comprehensive semantic understanding. ScanNet++ enables a new real-world benchmark for novel view synthesis, both from high-quality RGB capture, and importantly also from commodity-level images, in addition to a new benchmark for 3D semantic scene understanding that comprehensively encapsulates diverse and ambiguous semantic labeling scenarios. Currently, ScanNet++ contains 460 scenes, 280,000 captured DSLR images, and over 3.7M iPhone RGBD frames. + + + + Improving Diversity in Zero-Shot GAN Adaptation with Semantic Variations + http://openaccess.thecvf.com//content/ICCV2023/papers/Jeon_Improving_Diversity_in_Zero-Shot_GAN_Adaptation_with_Semantic_Variations_ICCV_2023_paper.pdf + Training deep generative models usually requires a large amount of data. To alleviate the data collection cost, the task of zero-shot GAN adaptation aims to reuse well-trained generators to synthesize images of an unseen target domain without any further training samples. Due to the data absence, the textual description of the target domain and the vision-language models, e.g., CLIP, are utilized to effectively guide the generator. However, with only a single representative text feature instead of real images, the synthesized images gradually lose diversity as the model is optimized, which is also known as mode collapse. To tackle the problem, we propose a novel method to find semantic variations of the target text in the CLIP space. Specifically, we explore diverse semantic variations based on the informative text feature of the target domain while regularizing the uncontrolled deviation of the semantic information. With the obtained variations, we design a novel directional moment loss that matches the first and second moments of image and text direction distributions. Moreover, we introduce elastic weight consolidation and a relation consistency loss to effectively preserve valuable content information from the source domain, e.g., appearances. Through extensive experiments, we demonstrate the efficacy of the proposed methods in ensuring sample diversity in various scenarios of zero-shot GAN adaptation. We also conduct ablation studies to validate the effect of each proposed component. Notably, our model achieves a new state-of-the-art on zero-shot GAN adaptation in terms of both diversity and quality. + + + + Vox-E: Text-Guided Voxel Editing of 3D Objects + http://openaccess.thecvf.com//content/ICCV2023/papers/Sella_Vox-E_Text-Guided_Voxel_Editing_of_3D_Objects_ICCV_2023_paper.pdf + Large scale text-guided diffusion models have garnered significant attention due to their ability to synthesize diverse images that convey complex visual concepts. This generative power has more recently been leveraged to perform text-to-3D synthesis. In this work, we present a technique that harnesses the power of latent diffusion models for editing existing 3D objects. Our method takes oriented 2D images of a 3D object as input and learns a grid-based volumetric representation of it. To guide the volumetric representation to conform to a target text prompt, we follow unconditional text-to-3D methods and optimize a Score Distillation Sampling (SDS) loss. However, we observe that combining this diffusion-guided loss with an image-based regularization loss that encourages the representation not to deviate too strongly from the input object is challenging, as it requires achieving two conflicting goals while viewing only structure-and-appearance coupled 2D projections. Thus, we introduce a novel volumetric regularization loss that operates directly in 3D space, utilizing the explicit nature of our 3D representation to enforce correlation between the global structure of the original and edited object. Furthermore, we present a technique that optimizes cross-attention volumetric grids to refine the spatial extent of the edits. Extensive experiments and comparisons demonstrate the effectiveness of our approach in creating a myriad of edits which cannot be achieved by prior works. Our code and data will be made publicly available. + + + + Unilaterally Aggregated Contrastive Learning with Hierarchical Augmentation for Anomaly Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Unilaterally_Aggregated_Contrastive_Learning_with_Hierarchical_Augmentation_for_Anomaly_Detection_ICCV_2023_paper.pdf + Anomaly detection (AD), aiming to find samples that deviate from the training distribution, is essential in safety-critical applications. Though recent self-supervised learning based attempts achieve promising results by creating virtual outliers, their training objectives are less faithful to AD which requires a concentrated inlier distribution as well as a dispersive outlier distribution. In this paper, we propose Unilaterally Aggregated Contrastive Learning with Hierarchical Augmentation (UniCon-HA), taking into account both the requirements above. Specifically, we explicitly encourage the concentration of inliers and the dispersion of virtual outliers via supervised and unsupervised contrastive losses, respectively. Considering that standard contrastive data augmentation for generating positive views may induce outliers, we additionally introduce a soft mechanism to re-weight each augmented inlier according to its deviation from the inlier distribution, to ensure a purified concentration. Moreover, to prompt a higher concentration, inspired by curriculum learning, we adopt an easy-to-hard hierarchical augmentation strategy and perform contrastive aggregation at different depths of the network based on the strengths of data augmentation. Our method is evaluated under three AD settings including unlabeled one-class, unlabeled multi-class, and labeled multi-class, demonstrating its consistent superiority over other competitors. + + + + Learning Image-Adaptive Codebooks for Class-Agnostic Image Restoration + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Learning_Image-Adaptive_Codebooks_for_Class-Agnostic_Image_Restoration_ICCV_2023_paper.pdf + Recent work of discrete generative priors, in the form of codebooks, has shown exciting performance for image reconstruction and restoration, since the discrete prior space spanned by the codebooks increases the robustness against diverse image degradations. Nevertheless, these methods require separate training of codebooks for different image categories, which limits their use to specific image categories only (e.g. face, architecture, etc.), and fail to handle arbitrary natural images. In this paper, we propose AdaCode for learning image-adaptive codebooks for class-agnostic image restoration. Instead of learning a single codebook for all categories of images, we learn a set of basis codebooks. For a given input image, AdaCode learns a weight map with which we compute a weighted combination of these basis codebooks for adaptive image restoration. Intuitively, AdaCode is a more flexible and expressive discrete generative prior than previous work. Experimental results show that AdaCode achieves state-of-the-art performance on image reconstruction and restoration tasks, including image super-resolution and inpainting. + + + + 3D Segmentation of Humans in Point Clouds with Synthetic Data + http://openaccess.thecvf.com//content/ICCV2023/papers/Takmaz_3D_Segmentation_of_Humans_in_Point_Clouds_with_Synthetic_Data_ICCV_2023_paper.pdf + Segmenting humans in 3D indoor scenes has become increasingly important with the rise of human-centered robotics and AR/VR applications. To this end, we propose the task of joint 3D human semantic segmentation, instance segmentation and multi-human body-part segmentation. Few works have attempted to directly segment humans in cluttered 3D scenes, which is largely due to the lack of annotated training data of humans interacting with 3D scenes. We address this challenge and propose a framework for generating training data of synthetic humans interacting with real 3D scenes. Furthermore, we propose a novel transformer-based model, Human3D, which is the first end-to-end model for segmenting multiple human instances and their body-parts in a unified manner. The key advantage of our synthetic data generation framework is its ability to generate diverse and realistic human-scene interactions, with highly accurate ground truth. Our experiments show that pre-training on synthetic data improves performance on a wide variety of 3D human segmentation tasks. Finally, we demonstrate that Human3D outperforms even task-specific state-of-the-art 3D segmentation methods. + + + + Mastering Spatial Graph Prediction of Road Networks + http://openaccess.thecvf.com//content/ICCV2023/papers/Sotiris_Mastering_Spatial_Graph_Prediction_of_Road_Networks_ICCV_2023_paper.pdf + Accurately predicting road networks from satellite images requires a global understanding of the network topology. We propose to capture such high-level information by introducing a graph-based framework that given a partially generated graph, sequentially adds new edges. To deal with misalignment between the model predictions and the intended purpose, and to optimize over complex, non-continuous metrics of interest, we adopt a reinforcement learning (RL) approach that nominates modifications that maximize a cumulative reward. As opposed to standard supervised techniques that tend to be more restricted to commonly used surrogate losses, our framework yields more power and flexibility to encode problem-dependent knowledge. Empirical results on several benchmark datasets demonstrate enhanced performance and increased high-level reasoning about the graph topology when using a tree-based search. We further demonstrate the superiority of our approach in handling examples with substantial occlusion and additionally provide evidence that our predictions better match the statistical properties of the ground dataset. + + + + Domain Generalization via Rationale Invariance + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Domain_Generalization_via_Rationale_Invariance_ICCV_2023_paper.pdf + This paper offers a new perspective to ease the challenge of domain generalization, which involves maintaining robust results even in unseen environments. Our design focuses on the decision-making process in the final classifier layer. Specifically, we propose treating the element-wise contributions to the final results as the rationale for making a decision and representing the rationale for each sample as a matrix. For a well-generalized model, we suggest the rationale matrices for samples belonging to the same category should be similar, indicating the model relies on domain-invariant clues to make decisions, thereby ensuring robust results. To implement this idea, we introduce a rationale invariance loss as a simple regularization technique, requiring only a few lines of code. Our experiments demonstrate that the proposed approach achieves competitive results across various datasets, despite its simplicity. Code is available at https://github.com/liangchen527/RIDG. + + + + ProbVLM: Probabilistic Adapter for Frozen Vison-Language Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Upadhyay_ProbVLM_Probabilistic_Adapter_for_Frozen_Vison-Language_Models_ICCV_2023_paper.pdf + Large-scale vision-language models (VLMs) like CLIP successfully find correspondences between images and text. Through the standard deterministic mapping process, an image or a text sample is mapped to a single vector in the embedding space. This is problematic: as multiple samples (images or text) can abstract the same concept in the physical world, deterministic embeddings do not reflect the inherent ambiguity in the embedding space. We propose ProbVLM, a probabilistic adapter that estimates probability distributions for the embeddings of pre-trained VLMs via inter/intra-modal alignment in a post-hoc manner without needing large-scale datasets or computing. On four challenging datasets, i.e., COCO, Flickr, CUB, and Oxford-flowers, we estimate the multi-modal embedding uncertainties for two VLMs, i.e., CLIP and BLIP, quantify the calibration of embedding uncertainties in retrieval tasks and show that ProbVLM outperforms other methods. Furthermore, we propose active learning and model selection as two real-world downstream tasks for VLMs and show that the estimated uncertainty aids both tasks. Lastly, we present a novel technique for visualizing the embedding distributions using a large-scale pre-trained latent diffusion model. + + + + Latent-OFER: Detect, Mask, and Reconstruct with Latent Vectors for Occluded Facial Expression Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Lee_Latent-OFER_Detect_Mask_and_Reconstruct_with_Latent_Vectors_for_Occluded_ICCV_2023_paper.pdf + Most research on facial expression recognition (FER) is conducted in highly controlled environments, but its performance is often unacceptable when applied to real-world situations. This is because when unexpected objects occlude the face, the FER network faces difficulties extracting facial features and accurately predicting facial expressions. Therefore, occluded FER (OFER) is a challenging problem. Previous studies on occlusion-aware FER have typically required fully annotated facial images for training. However, collecting facial images with various occlusions and expression annotations is time-consuming and expensive. Latent-OFER, the proposed method, can detect occlusions, restore occluded parts of the face as if they were unoccluded, and recognize them, improving FER accuracy. This approach involves three steps: First, the vision transformer (ViT)-based occlusion patch detector masks the occluded position by training only latent vectors from the unoccluded patches using the support vector data description algorithm. Second, the hybrid reconstruction network generates the masking position as a complete image using the ViT and convolutional neural network (CNN). Last, the expression-relevant latent vector extractor retrieves and uses expression-related information from all latent vectors by applying a CNN-based class activation map. This mechanism has a significant advantage in preventing performance degradation from occlusion by unseen objects. The experimental results on several databases demonstrate the superiority of the proposed method over state-of-the-art methods. + + + + Self-supervised Cross-view Representation Reconstruction for Change Captioning + http://openaccess.thecvf.com//content/ICCV2023/papers/Tu_Self-supervised_Cross-view_Representation_Reconstruction_for_Change_Captioning_ICCV_2023_paper.pdf + Change captioning aims to describe the difference between a pair of similar images. Its key challenge is how to learn a stable difference representation under pseudo changes caused by viewpoint change. In this paper, we address this by proposing a self-supervised cross-view representation reconstruction (SCORER) network. Concretely, we first design a multi-head token-wise matching to model relationships between cross-view features from similar/dissimilar images. Then, by maximizing cross-view contrastive alignment of two similar images, SCORER learns two view-invariant image representations in a self-supervised way. Based on these, we reconstruct the representations of unchanged objects by cross-attention, thus learning a stable difference representation for caption generation. Further, we devise a cross-modal backward reasoning to improve the quality of caption. This module reversely models a "hallucination" representation with the caption and "before" representation. By pushing it closer to the "after" representation, we enforce the caption to be informative about the difference in a self-supervised manner. Extensive experiments show our method achieves the state-of-the-art results on four datasets. The code is available at https://github.com/tuyunbin/SCORER. + + + + Unify, Align and Refine: Multi-Level Semantic Alignment for Radiology Report Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Unify_Align_and_Refine_Multi-Level_Semantic_Alignment_for_Radiology_Report_ICCV_2023_paper.pdf + Automatic radiology report generation has attracted enormous research interest due to its practical value in reducing the workload of radiologists. However, simultaneously establishing global correspondences between the image (e.g., Chest X-ray) and its related report and local alignments between image patches and keywords remains challenging. To this end, we propose an Unify, Align and then Refine (UAR) approach to learn multi-level cross-modal alignments and introduce three novel modules: Latent Space Unifier (LSU), Cross-modal Representation Aligner (CRA) and Text-to-Image Refiner (TIR). Specifically, LSU unifies multimodal data into discrete tokens, making it flexible to learn common knowledge among modalities with a shared network. The modality-agnostic CRA learns discriminative features via a set of orthonormal basis and a dual-gate mechanism first and then globally aligns visual and textual representations under a triplet contrastive loss. TIR boosts token-level local alignment via calibrating text-to-image attention with a learnable mask. Additionally, we design a two-stage training procedure to make UAR gradually grasp cross-modal alignments at different levels, which imitates radiologists' workflow: writing sentence by sentence first and then checking word by word. Extensive experiments and analyses on IU-Xray and MIMIC-CXR benchmark datasets demonstrate the superiority of our UAR against varied state-of-the-art methods. + + + + Scene-Aware Feature Matching + http://openaccess.thecvf.com//content/ICCV2023/papers/Lu_Scene-Aware_Feature_Matching_ICCV_2023_paper.pdf + Current feature matching methods focus on point-level matching, pursuing better representation learning of individual features, but lacking further understanding of the scene. This results in significant performance degradation when handling challenging scenes such as scenes with large viewpoint and illumination changes. To tackle this problem, we propose a novel model named SAM, which applies attentional grouping to guide Scene-Aware feature Matching. SAM handles multi-level features, i.e., image tokens and group tokens, with attention layers, and groups the image tokens with the proposed token grouping module. Our model can be trained by ground-truth matches only and produce reasonable grouping results. With the sense-aware grouping guidance, SAM is not only more accurate and robust but also more interpretable than conventional feature matching models. Sufficient experiments on various applications, including homography estimation, pose estimation, and image matching, demonstrate that our model achieves state-of-the-art performance. + + + + FDViT: Improve the Hierarchical Architecture of Vision Transformer + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_FDViT_Improve_the_Hierarchical_Architecture_of_Vision_Transformer_ICCV_2023_paper.pdf + Despite the fact that transformer-based models have yielded great success in computer vision tasks, they suffer from the challenge of high computational costs that limits their use on resource-constrained devices. One major reason is that vision transformers have redundant calculations since the self-attention operation generates patches with high similarity at a later stage in the network. Hierarchical architectures have been proposed for vision transformers to alleviate this challenge. However, by shrinking the spatial dimensions to half of the originals with downsampling layers, the challenge is actually overcompensated, as too much information is lost. In this paper, we propose FDViT to improve the hierarchical architecture of the vision transformer by using a flexible downsampling layer that is not limited to integer stride to smoothly reduce the sizes of the middle feature maps. Furthermore, a masked auto-encoder architecture is used to facilitate the training of the proposed flexible downsampling layer and produces informative outputs. Experimental results on benchmark datasets demonstrate that the proposed method can reduce computational costs while increasing classification performance and achieving state-of-the-art results. For example, the proposed FDViT-S model achieves a top-1 accuracy of 81.5%, which is 1.7 percent points higher than the ViT-S model and reduces 39% FLOPs. + + + + Towards Robust Model Watermark via Reducing Parametric Vulnerability + http://openaccess.thecvf.com//content/ICCV2023/papers/Gan_Towards_Robust_Model_Watermark_via_Reducing_Parametric_Vulnerability_ICCV_2023_paper.pdf + Deep neural networks are valuable assets considering their commercial benefits and huge demands for costly annotation and computation resources. To protect the copyright of DNNs, backdoor-based ownership verification becomes popular recently, in which the model owner can watermark the model by embedding a specific backdoor behavior before releasing it. The defenders (usually the model owners) can identify whether a suspicious third-party model is "stolen" from them based on the presence of the behavior. Unfortunately, these watermarks are proven to be vulnerable to removal attacks even like fine-tuning. To further explore this vulnerability, we investigate the parametric space and find there exist many watermark-removed models in the vicinity of the watermarked one, which may be easily used by removal attacks. Inspired by this finding, we propose a minimax formulation to find these watermark-removed models and recover their watermark behavior. Extensive experiments demonstrate that our method improves the robustness of the model watermarking against parametric changes and numerous watermark-removal attacks. The codes for reproducing our main experiments are available at https://github.com/GuanhaoGan/robust-model-watermarking. + + + + LEA2: A Lightweight Ensemble Adversarial Attack via Non-overlapping Vulnerable Frequency Regions + http://openaccess.thecvf.com//content/ICCV2023/papers/Qian_LEA2_A_Lightweight_Ensemble_Adversarial_Attack_via_Non-overlapping_Vulnerable_Frequency_ICCV_2023_paper.pdf + Recent work shows that well-designed adversarial examples can fool deep neural networks (DNNs). Due to their transferability, adversarial examples can also attack target models without extra information, called black-box attacks. However, most existing ensemble attacks depend on numerous substitute models to cover the vulnerable subspace of a target model. In this work, we find three types of models with non-overlapping vulnerable frequency regions, which can cover a large enough vulnerable subspace. Based on this finding, we propose a lightweight ensemble adversarial attack named LEA2, integrated by standard, weakly robust, and robust models. Moreover, we analyze Gaussian noise from the perspective of frequency and find that Gaussian noise is located in the vulnerable frequency regions of standard models. Therefore, we substitute standard models with Gaussian noise to ensure the use of high-frequency vulnerable regions while reducing attack time consumption. Experiments on several image datasets indicate that LEA^2 achieves better transferability under different defended models compared with extensive baselines and state-of-the-art attacks. + + + + Unsupervised Domain Adaptive Detection with Network Stability Analysis + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhou_Unsupervised_Domain_Adaptive_Detection_with_Network_Stability_Analysis_ICCV_2023_paper.pdf + Domain adaptive detection aims to improve the generality of a detector, learned from the labeled source domain, on the unlabeled target domain. In this work, drawing inspiration from the concept of stability from the control theory that a robust system requires to remain consistent both externally and internally regardless of disturbances, we propose a novel framework that achieves unsupervised domain adaptive detection through stability analysis. In specific, we treat discrepancies between images and regions from different domains as disturbances, and introduce a novel simple but effective Network Stability Analysis (NSA) framework that considers various disturbances for domain adaptation. Particularly, we explore three types of perturbations including heavy and light image-level disturbances and instance-level disturbance. For each type, NSA performs external consistency analysis on the outputs from raw and perturbed images and/or internal consistency analysis on their features, using teacher-student models. By integrating NSA into Faster R-CNN, we immediately achieve state-of-the-art results. In particular, we set a new record of 52.7% mAP on Cityscapes-to-FoggyCityscapes, showing the potential of NSA for domain adaptive detection. It is worth noticing, our NSA is designed for general purpose, and thus applicable to one-stage detection model (e.g., FCOS) besides the adopted one, as shown by experiments. Code is released at https://github.com/tiankongzhang/NSA. + + + + MeViS: A Large-scale Benchmark for Video Segmentation with Motion Expressions + http://openaccess.thecvf.com//content/ICCV2023/papers/Ding_MeViS_A_Large-scale_Benchmark_for_Video_Segmentation_with_Motion_Expressions_ICCV_2023_paper.pdf + This paper strives for motion expressions guided video segmentation, which focuses on segmenting objects in video content based on a sentence describing the motion of the objects. Existing referring video object datasets typically focus on salient objects and use language expressions that contain excessive static attributes that could potentially enable the target object to be identified in a single frame. These datasets downplay the importance of motion in video content for language-guided video object segmentation. To investigate the feasibility of using motion expressions to ground and segment objects in videos, we propose a large-scale dataset called MeViS, which contains numerous motion expressions to indicate target objects in complex environments. We benchmarked 5 existing referring video object segmentation (RVOS) methods and conducted a comprehensive comparison on the MeViS dataset. The results show that current RVOS methods cannot effectively address motion expression-guided video segmentation. We further analyze the challenges and propose a baseline approach for the proposed MeViS dataset. The goal of our benchmark is to provide a platform that enables the development of effective language-guided video segmentation algorithms that leverage motion expressions as a primary cue for object segmentation in complex video scenes. The proposed MeViS dataset has been released at https://henghuiding.github.io/MeViS. + + + + OPERA: Omni-Supervised Representation Learning with Hierarchical Supervisions + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_OPERA_Omni-Supervised_Representation_Learning_with_Hierarchical_Supervisions_ICCV_2023_paper.pdf + The pretrain-finetune paradigm in modern computer vision facilitates the success of self-supervised learning, which tends to achieve better transferability than supervised learning. However, with the availability of massive labeled data, a natural question emerges: how to train a better model with both self and full supervision signals? In this paper, we propose Omni-suPErvised Representation leArning with hierarchical supervisions (OPERA) as a solution. We provide a unified perspective of supervisions from labeled and unlabeled data and propose a unified framework of fully supervised and self-supervised learning. We extract a set of hierarchical proxy representations for each image and impose self and full supervisions on the corresponding proxy representations. Extensive experiments on both convolutional neural networks and vision transformers demonstrate the superiority of OPERA in image classification, segmentation, and object detection. + + + + GPFL: Simultaneously Learning Global and Personalized Feature Information for Personalized Federated Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_GPFL_Simultaneously_Learning_Global_and_Personalized_Feature_Information_for_Personalized_ICCV_2023_paper.pdf + Federated Learning (FL) is popular for its privacy-preserving and collaborative learning capabilities. Recently, personalized FL (pFL) has received attention for its ability to address statistical heterogeneity and achieve personalization in FL. However, from the perspective of feature extraction, most existing pFL methods only focus on extracting global or personalized feature information during local training, which fails to meet the collaborative learning and personalization goals of pFL. To address this, we propose a new pFL method, named GPFL, to simultaneously learn global and personalized feature information on each client. We conduct extensive experiments on six datasets in three statistically heterogeneous settings and show the superiority of GPFL over ten state-of-the-art methods regarding effectiveness, scalability, fairness, stability, and privacy. Besides, GPFL mitigates overfitting and outperforms the baselines by up to 8.99% in accuracy. + + + + Efficient Region-Aware Neural Radiance Fields for High-Fidelity Talking Portrait Synthesis + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Efficient_Region-Aware_Neural_Radiance_Fields_for_High-Fidelity_Talking_Portrait_Synthesis_ICCV_2023_paper.pdf + This paper presents ER-NeRF, a novel conditional Neural Radiance Fields (NeRF) based architecture for talking portrait synthesis that can concurrently achieve fast convergence, real-time rendering, and state-of-the-art performance with small model size. Our idea is to explicitly exploit the unequal contribution of spatial regions to guide talking portrait modeling. Specifically, to improve the accuracy of dynamic head reconstruction, a compact and expressive NeRF-based Tri-Plane Hash Representation is introduced by pruning empty spatial regions with three planar hash encoders. For speech audio, we propose a Region Attention Module to generate region-aware condition feature via an attention mechanism. Different from existing methods that utilize an MLP-based encoder to learn the cross-modal relation implicitly, the attention mechanism builds an explicit connection between audio features and spatial regions to capture the priors of local motions. Moreover, a direct and fast Adaptive Pose Encoding is introduced to optimize the head-torso separation problem by mapping the complex transformation of the head pose into spatial coordinates. Extensive experiments demonstrate that our method renders better high-fidelity and audio-lips synchronized talking portrait videos, with realistic details and high efficiency compared to previous methods. + + + + End2End Multi-View Feature Matching with Differentiable Pose Optimization + http://openaccess.thecvf.com//content/ICCV2023/papers/Roessle_End2End_Multi-View_Feature_Matching_with_Differentiable_Pose_Optimization_ICCV_2023_paper.pdf + Erroneous feature matches have severe impact on subsequent camera pose estimation and often require additional, time-costly measures, like RANSAC, for outlier rejection. Our method tackles this challenge by addressing feature matching and pose optimization jointly. To this end, we propose a graph attention network to predict image correspondences along with confidence weights. The resulting matches serve as weighted constraints in a differentiable pose estimation. Training feature matching with gradients from pose optimization naturally learns to down-weight outliers and boosts pose estimation on image pairs compared to SuperGlue by 6.7% on ScanNet. At the same time, it reduces the pose estimation time by over 50% and renders RANSAC iterations unnecessary. Moreover, we integrate information from multiple views by spanning the graph across multiple frames to predict the matches all at once. Multi-view matching combined with end-to-end training improves the pose estimation metrics on Matterport3D by 18.5% compared to SuperGlue. + + + + Exploring the Benefits of Visual Prompting in Differential Privacy + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Exploring_the_Benefits_of_Visual_Prompting_in_Differential_Privacy_ICCV_2023_paper.pdf + Visual Prompting (VP) is an emerging and powerful technique that allows sample-efficient adaptation to downstream tasks by engineering a well-trained frozen source model. In this work, we explore the benefits of VP in constructing compelling neural network classifiers with differential privacy (DP). We explore and integrate VP into canonical DP training methods and demonstrate its simplicity and efficiency. In particular, we discover that VP in tandem with PATE, a state-of-the-art DP training method that leverages the knowledge transfer from an ensemble of teachers, achieves the state-of-the-art privacy-utility trade-off with minimum expenditure of privacy budget. Moreover, we conduct additional experiments on cross-domain image classification with a sufficient domain gap to further unveil the advantage of VP in DP. Lastly, we also conduct extensive ablation studies to validate the effectiveness and contribution of VP under DP consideration. + + + + Mining bias-target Alignment from Voronoi Cells + http://openaccess.thecvf.com//content/ICCV2023/papers/Nahon_Mining_bias-target_Alignment_from_Voronoi_Cells_ICCV_2023_paper.pdf + Despite significant research efforts, deep neural networks remain vulnerable to biases: this raises concerns about their fairness and limits their generalization. In this paper, we propose a bias-agnostic approach to mitigate the impact of biases in deep neural networks. Unlike traditional debiasing approaches, we rely on a metric to quantify "bias alignment/misalignment" on target classes and use this information to discourage the propagation of bias-target alignment information through the network. We conduct experiments on several commonly used datasets for debiasing and compare our method with supervised and bias-specific approaches. Our results indicate that the proposed method achieves comparable performance to state-of-the-art supervised approaches, despite being bias-agnostic, even in the presence of multiple biases in the same sample. + + + + The Victim and The Beneficiary: Exploiting a Poisoned Model to Train a Clean Model on Poisoned Data + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhu_The_Victim_and_The_Beneficiary_Exploiting_a_Poisoned_Model_to_ICCV_2023_paper.pdf + Recently, backdoor attacks have posed a serious security threat to the training process of deep neural networks (DNNs). The attacked model behaves normally on benign samples but outputs a specific result when the trigger is present. However, compared with the rocketing progress of backdoor attacks, existing defenses are difficult to deal with these threats effectively or require benign samples to work, which may be unavailable in real scenarios. In this paper, we find that the poisoned samples and benign samples can be distinguished with prediction entropy. This inspires us to propose a novel dual-network training framework: The Victim and The Beneficiary (V&B), which exploits a poisoned model to train a clean model without extra benign samples. Firstly, we sacrifice the Victim network to be a powerful poisoned sample detector by training on suspicious samples. Secondly, we train the Beneficiary network on the credible samples selected by the Victim to inhibit backdoor injection. Thirdly, a semi-supervised suppression strategy is adopted for erasing potential backdoors and improving model performance. Furthermore, to better inhibit missed poisoned samples, we propose a strong data augmentation method, AttentionMix, which works well with our proposed V&B framework. Extensive experiments on two widely used datasets against 6 state-of-the-art attacks demonstrate that our framework is effective in preventing backdoor injection and robust to various attacks while maintaining the performance on benign samples. Our code is available at https://github.com/Zixuan-Zhu/VaB. + + + + DIFFGUARD: Semantic Mismatch-Guided Out-of-Distribution Detection Using Pre-Trained Diffusion Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Gao_DIFFGUARD_Semantic_Mismatch-Guided_Out-of-Distribution_Detection_Using_Pre-Trained_Diffusion_Models_ICCV_2023_paper.pdf + Given a classifier, the inherent property of semantic Out-of-Distribution (OOD) samples is that their contents differ from all legal classes in terms of semantics, namely semantic mismatch. There is a recent work that directly applies it to OOD detection, which employs a conditional Generative Adversarial Network (cGAN) to enlarge semantic mismatch in the image space. While achieving remarkable OOD detection performance on small datasets, it is not applicable to ImageNet-scale datasets due to the difficulty in training cGANs with both input images and labels as conditions. As diffusion models are much easier to train and amenable to various conditions compared to cGANs, in this work, we propose to directly use pre-trained diffusion models for semantic mismatch-guided OOD detection, named DiffGuard. Specifically, given an OOD input image and the predicted label from the classifier, we try to enlarge the semantic difference between the reconstructed OOD image under these conditions and the original input image. We also present several test-time techniques to further strengthen such differences. Experimental results show that DiffGuard is effective on both Cifar-10 and hard cases of the large-scale ImageNet, and it can be easily combined with existing OOD detection techniques to achieve state-of-the-art OOD detection results. + + + + Tracking Anything with Decoupled Video Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Cheng_Tracking_Anything_with_Decoupled_Video_Segmentation_ICCV_2023_paper.pdf + Training data for video segmentation are expensive to annotate. This impedes extensions of end-to-end algorithms to new video segmentation tasks, especially in large-vocabulary settings. To 'track anything' without training on video data for every individual task, we develop a decoupled video segmentation approach (DEVA), composed of task-specific image-level segmentation and class/task-agnostic bi-directional temporal propagation. Due to this design, we only need an image-level model for the target task (which is cheaper to train) and a universal temporal propagation model which is trained once and generalizes across tasks. To effectively combine these two modules, we use bi-directional propagation for (semi-)online fusion of segmentation hypotheses from different frames to generate a coherent segmentation. We show that this decoupled formulation compares favorably to end-to-end approaches in several data-scarce tasks including large-vocabulary video panoptic segmentation, open-world video segmentation, referring video segmentation, and unsupervised video object segmentation. Code is available at: https://hkchengrex.github.io/Tracking-Anything-with-DEVA. + + + + Generative Gradient Inversion via Over-Parameterized Networks in Federated Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Generative_Gradient_Inversion_via_Over-Parameterized_Networks_in_Federated_Learning_ICCV_2023_paper.pdf + Federated learning has gained recognitions as a secure approach for safeguarding local private data in collaborative learning. But the advent of gradient inversion research has posed significant challenges to this premise by enabling a third-party to recover groundtruth images via gradients. While prior research has predominantly focused on low-resolution images and small batch sizes, this study highlights the feasibility of reconstructing complex images with high resolutions and large batch sizes. The success of the proposed method is contingent on constructing an over-parameterized convolutional network, so that images are generated before fitting to the gradient matching requirement. Practical experiments demonstrate that the proposed algorithm achieves high-fidelity image recovery, surpassing state-of-the-art competitors that commonly fail in more intricate scenarios. Consequently, our study shows that local participants in a federated learning system are vulnerable to potential data leakage issues. Source code is available at https://github.com/czhang024/CI-Net. + + + + EQ-Net: Elastic Quantization Neural Networks + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_EQ-Net_Elastic_Quantization_Neural_Networks_ICCV_2023_paper.pdf + Current model quantization methods have shown their promising capability in reducing storage space and computation complexity. However, due to the diversity of quantization forms supported by different hardware, one limitation of existing solutions is that usually require repeated optimization for different scenarios. How to construct a model with flexible quantization forms has been less studied. In this paper, we explore a one-shot network quantization regime, named Elastic Quantization Neural Networks (EQ-Net), which aims to train a robust weight-sharing quantization supernet. First of all, we propose an elastic quantization space (including elastic bit-width, granularity, and symmetry) to adapt to various mainstream quantitative forms. Secondly, we propose the Weight Distribution Regularization Loss (WDR-Loss) and Group Progressive Guidance Loss (GPG-Loss) to bridge the inconsistency of the distribution for weights and output logits in the elastic quantization space gap. Lastly, we incorporate genetic algorithms and the proposed Conditional Quantization-Aware Accuracy Predictor (CQAP) as an estimator to quickly search mixed-precision quantized neural networks in supernet. Extensive experiments demonstrate that our EQ-Net is close to or even better than its static counterparts as well as state-of-the-art robust bit-width methods. Code can be available at https://github.com/xuke225/EQ-Net.git + + + + Exploring Open-Vocabulary Semantic Segmentation from CLIP Vision Encoder Distillation Only + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Exploring_Open-Vocabulary_Semantic_Segmentation_from_CLIP_Vision_Encoder_Distillation_Only_ICCV_2023_paper.pdf + Semantic segmentation is a crucial task in computer vision that involves segmenting images into semantically meaningful regions at the pixel level. However, existing approaches often rely on expensive human annotations as supervision for model training, limiting their scalability to large, unlabeled datasets. To address this challenge, we present ZeroSeg, a novel method that leverages the existing pretrained vision-language (VL) model (e.g. CLIP vision encoder) to train open-vocabulary zero-shot semantic segmentation models. Although acquired extensive knowledge of visual concepts, it is non-trivial to exploit knowledge from these VL models to the task of semantic segmentation, as they are usually trained at an image level. ZeroSeg overcomes this by distilling the visual concepts learned by VL models into a set of segment tokens, each summarizing a localized region of the target image. We evaluate ZeroSeg on multiple popular segmentation benchmarks, including PASCAL VOC 2012, PASCAL Context, and COCO, in a zero-shot manner Our approach achieves state-of-the-art performance when compared to other zero-shot segmentation methods under the same training data, while also performing competitively compared to strongly supervised methods. Finally, we also demonstrated the effectiveness of ZeroSeg on open-vocabulary segmentation, through both human studies and qualitative visualizations. The code is publicly available at https://github.com/facebookresearch/ZeroSeg + + + + Parallax-Tolerant Unsupervised Deep Image Stitching + http://openaccess.thecvf.com//content/ICCV2023/papers/Nie_Parallax-Tolerant_Unsupervised_Deep_Image_Stitching_ICCV_2023_paper.pdf + Traditional image stitching approaches tend to leverage increasingly complex geometric features (point, line, edge, etc.) for better performance. However, these hand-crafted features are only suitable for specific natural scenes with adequate geometric structures. In contrast, deep stitching schemes overcome adverse conditions by adaptively learning robust semantic features, but they cannot handle large-parallax cases. To solve these issues, we propose a parallax-tolerant unsupervised deep image stitching technique. First, we propose a robust and flexible warp to model the image registration from global homography to local thin-plate spline motion. It provides accurate alignment for overlapping regions and shape preservation for non-overlapping regions by joint optimization concerning alignment and distortion. Subsequently, to improve the generalization capability, we design a simple but effective iterative strategy to enhance the warp adaption in cross-dataset and cross-resolution applications. Finally, to further eliminate the parallax artifacts, we propose to composite the stitched image seamlessly by unsupervised learning for seam-driven composition masks. Compared with existing methods, our solution is parallax-tolerant and free from laborious designs of complicated geometric features for specific scenes. Extensive experiments show our superiority over the SoTA methods, both quantitatively and qualitatively. The code will be available soon. + + + + M2T: Masking Transformers Twice for Faster Decoding + http://openaccess.thecvf.com//content/ICCV2023/papers/Mentzer_M2T_Masking_Transformers_Twice_for_Faster_Decoding_ICCV_2023_paper.pdf + We show how bidirectional transformers trained for masked token prediction can be applied to neural image compression to achieve state-of-the-art results. Such models were previously used for image_generation_ by progressive sampling groups of masked tokens according to uncertainty-adaptive schedules. Unlike these works, we demonstrate that predefined, deterministic schedules perform as well or better for image compression. This insight allows us to use masked attention during training in addition to masked inputs, and activation caching during inference, to significantly speed up our models (4x higher inference speed) at a small increase in bitrate. + + + + CoIn: Contrastive Instance Feature Mining for Outdoor 3D Object Detection with Very Limited Annotations + http://openaccess.thecvf.com//content/ICCV2023/papers/Xia_CoIn_Contrastive_Instance_Feature_Mining_for_Outdoor_3D_Object_Detection_ICCV_2023_paper.pdf + Recently, 3D object detection with sparse annotations has received great attention. However, current detectors usually perform poorly under very limited annotations. To address this problem, we propose a novel Contrastive Instance feature mining method, named CoIn. To better identify indistinguishable features learned through limited supervision, we design a Multi-Class contrastive learning module (MCcont) to enhance feature discrimination. Meanwhile, we propose a feature-level pseudo-label mining framework consisting of an instance feature mining module (InF-Mining) and a Labeled-to-Pseudo contrastive learning module (LPcont). These two modules exploit latent instances in feature space to supervise the training of detectors with limited annotations. Extensive experiments with KITTI dataset, Waymo open dataset, and nuScenes dataset show that under limited annotations, our method greatly improves the performance of baseline detectors: CenterPoint, Voxel-RCNN, and CasA. Combining CoIn with an iterative training strategy, we propose a CoIn++ pipeline, which requires only 2% annotations in the KITTI dataset to achieve performance comparable to the fully supervised methods. The code is available at https://github.com/xmuqimingxia/CoIn. + + + + Computation and Data Efficient Backdoor Attacks + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_Computation_and_Data_Efficient_Backdoor_Attacks_ICCV_2023_paper.pdf + Backdoor attacks against deep learning have been widely studied. Various attack techniques have been proposed for different domains and paradigms, e.g., image, point cloud, natural language processing, transfer learning, etc. These works normally adopt the data poisoning strategy to embed the backdoor. They randomly select samples from the benign training set for poisoning, without considering the distinct contribution of each sample to the backdoor effectiveness, making the attack less optimal. A recent work (IJCAI-22) proposed to use the forgetting score to measure the importance of each poisoned sample and then filter out redundant data for effective backdoor training. However, this method is empirically designed without theoretical proofing. It is also very time-consuming as it needs to go through almost all the training stages for data selection. To address such limitations, we propose a novel confidence-based scoring methodology, which can efficiently measure the contribution of each poisoning sample based on the distance posteriors. We further introduce a greedy search algorithm to find the most informative samples for backdoor injection more promptly. Experimental evaluations on both 2D image and 3D point cloud classification tasks show that our approach can achieve comparable performance or even surpass the forgetting score-based searching method while requiring only several extra epochs' computation of a standard training process. + + + + Decouple Before Interact: Multi-Modal Prompt Learning for Continual Visual Question Answering + http://openaccess.thecvf.com//content/ICCV2023/papers/Qian_Decouple_Before_Interact_Multi-Modal_Prompt_Learning_for_Continual_Visual_Question_ICCV_2023_paper.pdf + In the real world, a desirable Visual Question Answering model is expected to provide correct answers to new questions and images in a continual setting (recognized as CL-VQA). However, existing works formulate CLVQA from a vision-only or language-only perspective, and straightforwardly apply the uni-modal continual learning (CL) strategies to this multi-modal task, which is improper and suboptimal. On the one hand, such a partial formulation may result in limited evaluations. On the other hand, neglecting the interactions between modalities will lead to poor performance. To tackle these challenging issues, we propose a comprehensive formulation for CL-VQA from the perspective of multi-modal vision-language fusion. Based on our formulation, we further propose MulTi-Modal PRompt LearnIng with DecouPLing bEfore InTeraction (TRIPLET), a novel approach that builds on a pre-trained vision-language model and consists of decoupled prompts and prompt interaction strategies to capture the complex interactions between modalities. In particular, decoupled prompts contain learnable parameters that are decoupled w.r.t different aspects, and the prompt interaction strategies are in charge of modeling interactions between inputs and prompts. Additionally, we build two CL-VQA benchmarks for a more comprehensive evaluation. Extensive experiments demonstrate that our TRIPLET outperforms state-of-the-art methods in both uni-modal and multi-modal continual settings for CL-VQA. + + + + Unsupervised Manifold Linearizing and Clustering + http://openaccess.thecvf.com//content/ICCV2023/papers/Ding_Unsupervised_Manifold_Linearizing_and_Clustering_ICCV_2023_paper.pdf + We consider the problem of simultaneously clustering and learning a linear representation of data lying close to a union of low-dimensional manifolds, a fundamental task in machine learning and computer vision. When the manifolds are assumed to be linear subspaces, this reduces to the classical problem of subspace clustering, which has been studied extensively over the past two decades. Unfortunately, many real-world datasets such as natural images can not be well approximated by linear subspaces. On the other hand, numerous works have attempted to learn an appropriate transformation of the data, such that data is mapped from a union of general non-linear manifolds to a union of linear subspaces (with points from the same manifold being mapped to the same subspace). However, many existing works have limitations such as assuming knowledge of the membership of samples to clusters, requiring high sampling density, or being shown theoretically to learn trivial representations. In this paper, we propose to optimize the Maximal Coding Rate Reduction metric with respect to both the data representation and a novel doubly stochastic cluster membership, inspired by state-of-the-art subspace clustering results. We give a parameterization of such a representation and membership, allowing efficient mini-batching and one-shot initialization. Experiments on CIFAR-10, -20, -100, and TinyImageNet-200 datasets show that the proposed method is much more accurate and scalable than state-of-the-art deep clustering methods, and further learns a latent linear representation of the data. + + + + MMVP: Motion-Matrix-Based Video Prediction + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhong_MMVP_Motion-Matrix-Based_Video_Prediction_ICCV_2023_paper.pdf + A central challenge of video prediction lies where the system has to reason the object's future motion from image frames while simultaneously maintaining the consistency of its appearance across frames. This work introduces an end-to-end trainable two-stream video prediction framework, Motion-Matrix-based Video Prediction (MMVP), to tackle this challenge. Unlike previous methods that usually handle motion prediction and appearance maintenance within the same set of modules, MMVP decouples motion and appearance information by constructing appearance-agnostic motion matrices. The motion matrices represent the temporal similarity of each and every pair of feature patches in the input frames, and are the sole input of the motion prediction module in MMVP. This design improves video prediction in both accuracy and efficiency, and reduces the model size. Results of extensive experiments demonstrate that MMVP outperforms state-of-the-art systems on public data sets by non-negligible large margins (approx. 1 db in PSNR, UCF Sports) in significantly smaller model sizes (84% the size or smaller). Please refer to https://github.com/Kay1794/MMVP-motion-matrix-based-video-prediction for the official code and the datasets used in this paper. + + + + Human Preference Score: Better Aligning Text-to-Image Models with Human Preference + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_Human_Preference_Score_Better_Aligning_Text-to-Image_Models_with_Human_Preference_ICCV_2023_paper.pdf + Recent years have witnessed a rapid growth of deep generative models, with text-to-image models gaining significant attention from the public. However, existing models often generate images that do not align well with human preferences, such as awkward combinations of limbs and facial expressions. To address this issue, we collect a dataset of human choices on generated images from the Stable Foundation Discord channel. Our experiments demonstrate that current evaluation metrics for generative models do not correlate well with human choices. Thus, we train a human preference classifier with the collected dataset and derive a Human Preference Score (HPS) based on the classifier. Using HPS, we propose a simple yet effective method to adapt Stable Diffusion to better align with human preferences. Our experiments show that HPS outperforms CLIP in predicting human choices and has good generalization capability toward images generated from other models. By tuning Stable Diffusion with the guidance of HPS, the adapted model is able to generate images that are more preferred by human users. The project page is available here: https://tgxs002.github.io/align_sd_web/. + + + + Guided Motion Diffusion for Controllable Human Motion Synthesis + http://openaccess.thecvf.com//content/ICCV2023/papers/Karunratanakul_Guided_Motion_Diffusion_for_Controllable_Human_Motion_Synthesis_ICCV_2023_paper.pdf + Denoising diffusion models have shown great promise in human motion synthesis conditioned on natural language descriptions. However, integrating spatial constraints, such as pre-defined motion trajectories and obstacles, remains a challenge despite being essential for bridging the gap between isolated human motion and its surrounding environment. To address this issue, we propose Guided Motion Diffusion (GMD), a method that incorporates spatial constraints into the motion generation process. Specifically, we propose an effective feature projection scheme that manipulates motion representation to enhance the coherency between spatial information and local poses. Together with a new imputation formulation, the generated motion can reliably conform to spatial constraints such as global motion trajectories. Furthermore, given sparse spatial constraints (e.g. sparse keyframes), we introduce a new dense guidance approach to turn a sparse signal, which is susceptible to being ignored during the reverse steps, into denser signals to guide the generated motion to the given constraints. Our extensive experiments justify the development of \methodname, which achieves a significant improvement over state-of-the-art methods in text-based motion generation while allowing control of the synthesized motions with spatial constraints. + + + + DiffuMask: Synthesizing Images with Pixel-level Annotations for Semantic Segmentation Using Diffusion Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_DiffuMask_Synthesizing_Images_with_Pixel-level_Annotations_for_Semantic_Segmentation_Using_ICCV_2023_paper.pdf + Collecting and annotating images with pixel-wise labels is time-consuming and laborious. In contrast, synthetic data can be freely available using a generative model (e.g., DALL-E, Stable Diffusion). In this paper, we show that it is possible to automatically obtain accurate semantic masks of synthetic images generated by the pre-trained Stable Diffusion, which uses only text-image pairs during training. Our approach, called DiffuMask, exploits the potential of the cross-attention map between text and image, which is natural and seamless to extend the text-driven image synthesis to semantic mask generation. DiffuMask uses text-guided cross-attention information to localize class/word-specific regions, which are combined with practical techniques to create a novel high-resolution and class-discriminative pixel-wise mask. The methods help to reduce data collection and annotation costs obviously. Experiments demonstrate that the existing segmentation methods trained on synthetic data of DiffuMask can achieve a competitive performance over the counterpart of real data (VOC 2012, Cityscapes). For some classes (e.g., bird), DiffuMask presents promising performance, close to the state-of-the-art result of real data (within 3% mIoU gap). Moreover, in the open-vocabulary segmentation (zero-shot) setting, DiffuMask achieves a new SOTA result on Unseen class of VOC 2012. + + + + StyleDomain: Efficient and Lightweight Parameterizations of StyleGAN for One-shot and Few-shot Domain Adaptation + http://openaccess.thecvf.com//content/ICCV2023/papers/Alanov_StyleDomain_Efficient_and_Lightweight_Parameterizations_of_StyleGAN_for_One-shot_and_ICCV_2023_paper.pdf + Domain adaptation of GANs is a problem of fine-tuning GAN models pretrained on a large dataset (e.g. StyleGAN) to a specific domain with few samples (e.g. painting faces, sketches, etc.). While there are many methods that tackle this problem in different ways, there are still many important questions that remain unanswered. In this paper, we provide a systematic and in-depth analysis of the domain adaptation problem of GANs, focusing on the StyleGAN model. We perform a detailed exploration of the most important parts of StyleGAN that are responsible for adapting the generator to a new domain depending on the similarity between the source and target domains. As a result of this study, we propose new efficient and lightweight parameterizations of StyleGAN for domain adaptation. Particularly, we show that there exist directions in StyleSpace (StyleDomain directions) that are sufficient for adapting to similar domains. For dissimilar domains, we propose Affine+ and AffineLight+ parameterizations that allows us to outperform existing baselines in few-shot adaptation while having significantly less training parameters. Finally, we examine StyleDomain directions and discover their many surprising properties that we apply for domain mixing and cross-domain image morphing. Source code can be found at https://github.com/AIRI-Institute/StyleDomain. + + + + RankMixup: Ranking-Based Mixup Training for Network Calibration + http://openaccess.thecvf.com//content/ICCV2023/papers/Noh_RankMixup_Ranking-Based_Mixup_Training_for_Network_Calibration_ICCV_2023_paper.pdf + Network calibration aims to accurately estimate the level of confidences, which is particularly important for employing deep neural networks in real-world systems. Recent approaches leverage mixup to calibrate the network's predictions during training. However, they do not consider the problem that mixtures of labels in mixup may not accurately represent the actual distribution of augmented samples. In this paper, we present RankMixup, a novel mixup-based framework alleviating the problem of the mixture of labels for network calibration. To this end, we propose to use an ordinal ranking relationship between raw and mixup-augmented samples as an alternative supervisory signal to the label mixtures for network calibration. We hypothesize that the network should estimate a higher level of confidence for the raw samples than the augmented ones (Fig.1). To implement this idea, we introduce a mixup-based ranking loss (MRL) that encourages lower confidences for augmented samples compared to raw ones, maintaining the ranking relationship. We also propose to leverage the ranking relationship among multiple mixup-augmented samples to further improve the calibration capability. Augmented samples with larger mixing coefficients are expected to have higher confidences and vice versa (Fig.1). That is, the order of confidences should be aligned with that of mixing coefficients. To this end, we introduce a novel loss, M-NDCG, in order to reduce the number of misaligned pairs of the coefficients and confidences. Extensive experimental results on standard benchmarks for network calibration demonstrate the effectiveness of RankMixup. + + + + Learning to Generate Semantic Layouts for Higher Text-Image Correspondence in Text-to-Image Synthesis + http://openaccess.thecvf.com//content/ICCV2023/papers/Park_Learning_to_Generate_Semantic_Layouts_for_Higher_Text-Image_Correspondence_in_ICCV_2023_paper.pdf + Existing text-to-image generation approaches have set high standards for photorealism and text-image correspondence, largely benefiting from web-scale text-image datasets, which can include up to 5 billion pairs. However, text-to-image generation models trained on domain-specific datasets, such as urban scenes, medical images, and faces, still suffer from low text-image correspondence due to the lack of text-image pairs. Additionally, collecting billions of text-image pairs for a specific domain can be time-consuming and costly. Thus, ensuring high text-image correspondence without relying on web-scale text-image datasets remains a challenging task. In this paper, we present a novel approach for enhancing text-image correspondence by leveraging available semantic layouts. Specifically, we propose a Gaussian-categorical diffusion process that simultaneously generates both images and corresponding layout pairs. Our experiments reveal that we can guide text-to-image generation models to be aware of the semantics of different image regions, by training the model to generate semantic labels for each pixel. We demonstrate that our approach achieves higher text-image correspondence compared to existing text-to-image generation approaches in the Multi-Modal CelebA-HQ and the Cityscapes dataset, where text-image pairs are scarce. + + + + Erasing Concepts from Diffusion Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Gandikota_Erasing_Concepts_from_Diffusion_Models_ICCV_2023_paper.pdf + Motivated by concerns that large-scale diffusion models can produce undesirable output such as sexually explicit content or copyrighted artistic styles, we study erasure of specific concepts from diffusion model weights. We propose a fine-tuning method that can erase a visual concept from a pre-trained diffusion model, given only the name of the style and using negative guidance as a teacher. We benchmark our method against previous approaches that remove sexually explicit content and demonstrate its effectiveness, performing on par with Safe Latent Diffusion and censored training. To evaluate artistic style removal, we conduct experiments erasing five modern artists from the network and conduct a user study to assess the human perception of the removed styles. Unlike previous methods, our approach can remove concepts from a diffusion model permanently rather than modifying the output at the inference time, so it cannot be circumvented even if a user has access to model weights. Our code, data, and results are available at erasing.baulab.info + + + + Fully Attentional Networks with Self-emerging Token Labeling + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhao_Fully_Attentional_Networks_with_Self-emerging_Token_Labeling_ICCV_2023_paper.pdf + Recent studies indicate that Vision Transformers (ViTs) are robust against out-of-distribution scenarios. In particular, the Fully Attentional Network (FAN) - a family of ViT backbones, has achieved state-of-the-art robustness. In this paper, we revisit the FAN models and improve their pre-training with a self-emerging token labeling (STL) framework. Our method contains a two-stage training framework. Specifically, we first train a FAN token labeler (FAN-TL) to generate semantically meaningful patch token labels, followed by a FAN student model training stage that uses both the token labels and the original class label. With the proposed STL framework, our best model based on FAN-L-Hybrid (77.3M parameters) achieves 84.8% Top-1 accuracy and 42.1% mCE on ImageNet-1K and ImageNet-C, and sets a new state-of-the-art for ImageNet-A (46.1%) and ImageNet-R (56.6%) without using extra data, outperforming the original FAN counterpart by significant margins. The proposed framework also demonstrates significantly enhanced performance on downstream tasks such as semantic segmentation, with up to 1.7% improvement in robustness over the counterpart model. + + + + ACTIVE: Towards Highly Transferable 3D Physical Camouflage for Universal and Robust Vehicle Evasion + http://openaccess.thecvf.com//content/ICCV2023/papers/Suryanto_ACTIVE_Towards_Highly_Transferable_3D_Physical_Camouflage_for_Universal_and_ICCV_2023_paper.pdf + Adversarial camouflage has garnered attention for its ability to attack object detectors from any viewpoint by covering the entire object's surface. However, universality and robustness in existing methods often fall short as the transferability aspect is often overlooked, thus restricting their application only to a specific target with limited performance. To address these challenges, we present Adversarial Camouflage for Transferable and Intensive Vehicle Evasion (ACTIVE), a state-of-the-art physical camouflage attack framework designed to generate universal and robust adversarial camouflage capable of concealing any 3D vehicle from detectors. Our framework incorporates innovative techniques to enhance universality and robustness, including a refined texture rendering that enables common texture application to different vehicles without being constrained to a specific texture map, a novel stealth loss that renders the vehicle undetectable, and a smooth and camouflage loss to enhance the naturalness of the adversarial camouflage. Our extensive experiments on 15 different models show that ACTIVE consistently outperforms existing works on various public detectors, including the latest YOLOv7. Notably, our universality evaluations reveal promising transferability to other vehicle classes, tasks (segmentation models), and the real world, not just other vehicles. + + + + Too Large; Data Reduction for Vision-Language Pre-Training + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Too_Large_Data_Reduction_for_Vision-Language_Pre-Training_ICCV_2023_paper.pdf + This paper examines the problems of severe image-text misalignment and high redundancy in the widely-used large-scale Vision-Language Pre-Training (VLP) datasets. To address these issues, we propose an efficient and straightforward Vision-Language learning algorithm called TL;DR which aims to compress the existing large VLP data into a small, high-quality set. Our approach consists of two major steps. First, a codebook-based encoder-decoder captioner is developed to select representative samples. Second, a new caption is generated to complement the original captions for selected samples, mitigating the text-image misalignment problem while maintaining uniqueness. As the result, TL;DR enables us to reduce the large dataset into a small set of high-quality data, which can serve as an alternative pre-training dataset. This algorithm significantly speeds up the time-consuming pretraining process. Specifically, TL;DR can compress the mainstream VLP datasets at a high ratio, e.g., reduce well-cleaned CC3M dataset from 2.8M to 0.67M ( 24%) and noisy YFCC15M from 15M to 2.5M ( 16.7%). Extensive experiments with three popular VLP models over seven downstream tasks show that VLP model trained on the compressed dataset provided by TL;DR can perform similar or even better results compared with training on the full-scale dataset. + + + + Towards Deeply Unified Depth-aware Panoptic Segmentation with Bi-directional Guidance Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/He_Towards_Deeply_Unified_Depth-aware_Panoptic_Segmentation_with_Bi-directional_Guidance_Learning_ICCV_2023_paper.pdf + Depth-aware panoptic segmentation is an emerging topic in computer vision which combines semantic and geometric understanding for more robust scene interpretation. Recent works pursue unified frameworks to tackle this challenge but mostly still treat it as two individual learning tasks, which limits their potential for exploring cross-domain information. We propose a deeply unified framework for depth-aware panoptic segmentation, which performs joint segmentation and depth estimation both in a per-segment manner with identical object queries. To narrow the gap between the two tasks, we further design a geometric query enhancement method, which is able to integrate scene geometry into object queries using latent representations. In addition, we propose a bi-directional guidance learning approach to facilitate cross-task feature learning by taking advantage of their mutual relations. Our method sets the new state of the art for depth-aware panoptic segmentation on both Cityscapes-DVPS and SemKITTI-DVPS datasets. Moreover, our guidance learning approach is shown to deliver performance improvement even under incomplete supervision labels. Code and models are available at https://github.com/jwh97nn/DeepDPS. + + + + Point-Query Quadtree for Crowd Counting, Localization, and More + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Point-Query_Quadtree_for_Crowd_Counting_Localization_and_More_ICCV_2023_paper.pdf + We show that crowd counting can be viewed as a decomposable point querying process. This formulation enables arbitrary points as input and jointly reasons whether the points are crowd and where they locate. The querying processing, however, raises an underlying problem on the number of necessary querying points. Too few imply underestimation; too many increase computational overhead. To address this dilemma, we introduce a decomposable structure, i.e., the point-query quadtree, and propose a new counting model, termed Point quEry Transformer (PET). PET implements decomposable point querying via data-dependent quadtree splitting, where each querying point could split into four new points when necessary, thus enabling dynamic processing of sparse and dense regions. Such a querying process yields an intuitive, universal modeling of crowd as both the input and output are interpretable and steerable. We demonstrate the applications of PET on a number of crowd-related tasks, including fully-supervised crowd counting and localization, partial annotation learning, and point annotation refinement, and also report state-of-the-art performance. For the first time, we show that a single counting model can address multiple crowd-related tasks across different learning paradigms. Code is available at https://github.com/cxliu0/PET. + + + + Zero-Shot Spatial Layout Conditioning for Text-to-Image Diffusion Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Couairon_Zero-Shot_Spatial_Layout_Conditioning_for_Text-to-Image_Diffusion_Models_ICCV_2023_paper.pdf + Large-scale text-to-image diffusion models have significantly improved the state of the art in generative image modeling and allow for an intuitive and powerful user interface to drive the image generation process. Expressing spatial constraints, e.g. to position specific objects in particular locations, is cumbersome using text; and current text-based image generation models are not able to accurately follow such instructions. In this paper we consider image generation from text associated with segments on the image canvas, which combines an intuitive natural language interface with precise spatial control over the generated content. We propose ZestGuide, a "ZEro-shot" SegmenTation Guidance approach that can be plugged into pre-trained text-to-image diffusion models, and does not require any additional training. It leverages implicit segmentation maps that can be extracted from cross-attention layers, and uses them to align the generation with input masks. Our experimental results combine high image quality with accurate alignment of generated content with input segmentations, and improve over prior work both quantitatively and qualitatively, including methods that require training on images with corresponding segmentations. Compared to Paint with Words, the previous state-of-the art in image generation with zero-shot segmentation conditioning, we improve by 5 to 10 mIoU points on the COCO dataset with similar FID scores. + + + + SegGPT: Towards Segmenting Everything in Context + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_SegGPT_Towards_Segmenting_Everything_in_Context_ICCV_2023_paper.pdf + We present SegGPT, a generalist model for segmenting everything in context. We unify various segmentation tasks into a generalist in-context learning framework that accommodates different kinds of segmentation data by transforming them into the same format of images. The training of SegGPT is formulated as an in-context coloring problem with random color mapping for each data sample. The objective is to accomplish diverse tasks according to the context, rather than relying on specific colors. After training, SegGPT can perform arbitrary segmentation tasks in images or videos via in-context inference, such as object instance, stuff, part, contour, and text. SegGPT is evaluated on a broad range of tasks, including few-shot semantic segmentation, video object segmentation, semantic segmentation, and panoptic segmentation. Our results show strong capabilities in segmenting in-domain and out-of-domain targets, either qualitatively or quantitatively. + + + + DDColor: Towards Photo-Realistic Image Colorization via Dual Decoders + http://openaccess.thecvf.com//content/ICCV2023/papers/Kang_DDColor_Towards_Photo-Realistic_Image_Colorization_via_Dual_Decoders_ICCV_2023_paper.pdf + Image colorization is a challenging problem due to multi-modal uncertainty and high ill-posedness. Directly training a deep neural network usually leads to incorrect semantic colors and low color richness. While transformer-based methods can deliver better results, they often rely on manually designed priors, suffer from poor generalization ability, and introduce color bleeding effects. To address these issues, we propose DDColor, an end-to-end method with dual decoders for image colorization. Our approach includes a pixel decoder and a query-based color decoder. The former restores the spatial resolution of the image, while the latter utilizes rich visual features to refine color queries, thus avoiding hand-crafted priors. Our two decoders work together to establish correlations between color and multi-scale semantic representations via cross-attention, significantly alleviating the color bleeding effect. Additionally, a simple yet effective colorfulness loss is introduced to enhance the color richness. Extensive experiments demonstrate that DDColor achieves superior performance to existing state-of-the-art works both quantitatively and qualitatively. The codes and models are publicly available. + + + + Visual Explanations via Iterated Integrated Attributions + http://openaccess.thecvf.com//content/ICCV2023/papers/Barkan_Visual_Explanations_via_Iterated_Integrated_Attributions_ICCV_2023_paper.pdf + We introduce Iterated Integrated Attributions (IIA) - a generic method for explaining the predictions of vision models. IIA employs iterative integration across the input image, the internal representations generated by the model, and their gradients, yielding precise and focused explanation maps. We demonstrate the effectiveness of IIA through comprehensive evaluations across various tasks, datasets, and network architectures. Our results showcase that IIA produces accurate explanation maps, outperforming other state-of-the-art explanation techniques. + + + + Pairwise Similarity Learning is SimPLE + http://openaccess.thecvf.com//content/ICCV2023/papers/Wen_Pairwise_Similarity_Learning_is_SimPLE_ICCV_2023_paper.pdf + In this paper, we focus on a general yet important learning problem, pairwise similarity learning (PSL). PSL subsumes a wide range of important applications, such as open-set face recognition, speaker verification, image retrieval and person re-identification. The goal of PSL is to learn a pairwise similarity function assigning a higher similarity score to positive pairs (i.e., a pair of samples with the same label) than to negative pairs (i.e., a pair of samples with different label). We start by identifying a key desideratum for PSL, and then discuss how existing methods can achieve this desideratum. We then propose a surprisingly simple proxy-free method, called SimPLE, which requires neither feature/proxy normalization nor angular margin and yet is able to generalize well in open-set recognition. We apply the proposed method to three challenging PSL tasks: open-set face recognition, image retrieval and speaker verification. Comprehensive experimental results on large-scale benchmarks show that our method performs significantly better than current state-of-the-art methods. + + + + GO-SLAM: Global Optimization for Consistent 3D Instant Reconstruction + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_GO-SLAM_Global_Optimization_for_Consistent_3D_Instant_Reconstruction_ICCV_2023_paper.pdf + Neural implicit representations have recently demonstrated compelling results on dense Simultaneous Localization And Mapping (SLAM) but suffer from the accumulation of errors in camera tracking and distortion in the reconstruction. Purposely, we present GO-SLAM, a deep-learning-based dense visual SLAM framework globally optimizing poses and 3D reconstruction in real-time. Robust pose estimation is at its core, supported by efficient loop closing and online full bundle adjustment, which optimize per frame by utilizing the learned global geometry of the complete history of input frames. Simultaneously, we update the implicit and continuous surface representation on-the-fly to ensure global consistency of 3D reconstruction. Results on various synthetic and real-world datasets demonstrate that GO-SLAM outperforms state-of-the-art approaches at tracking robustness and reconstruction accuracy. Furthermore, GO-SLAM is versatile and can run with monocular, stereo, and RGB-D input. + + + + FACTS: First Amplify Correlations and Then Slice to Discover Bias + http://openaccess.thecvf.com//content/ICCV2023/papers/Yenamandra_FACTS_First_Amplify_Correlations_and_Then_Slice_to_Discover_Bias_ICCV_2023_paper.pdf + Computer vision datasets frequently contain spurious correlations between task-relevant labels and (easy to learn) latent task-irrelevant attributes (e.g. context). Models trained on such datasets learn "shortcuts" and underperform on bias-conflicting slices of data where the correlation does not hold. In this work, we study the problem of identifying such slices to inform downstream bias mitigation strategies. We propose First Amplify Correlations and Then Slice (FACTS), wherein we first amplify correlations to fit a simple bias-aligned hypothesis via strongly regularized empirical risk minimization. Next, we perform correlation-aware slicing via mixture modeling in bias-aligned feature space to discover underperforming data slices that capture distinct correlations. Despite its simplicity, our method considerably improves over prior work (by as much as 35% precision@10) in correlation bias identification across a range of diverse evaluation settings. Our code is available at https://github.com/yvsriram/FACTS. + + + + Mask-Attention-Free Transformer for 3D Instance Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Lai_Mask-Attention-Free_Transformer_for_3D_Instance_Segmentation_ICCV_2023_paper.pdf + Recently, transformer-based methods have dominated 3D instance segmentation, where mask attention is commonly involved. Specifically, object queries are guided by the initial instance masks in the first cross-attention, and then iteratively refine themselves in a similar manner. However, we observe that the mask-attention pipeline usually leads to slow convergence due to low-recall initial instance masks. Therefore, we abandon the mask attention design and resort to an auxiliary center regression task instead. Through center regression, we effectively overcome the low-recall issue and perform cross-attention by imposing positional prior. To reach this goal, we develop a series of position-aware designs. First, we learn a spatial distribution of 3D locations as the initial position queries. They spread over the 3D space densely, and thus can easily capture the objects in a scene with a high recall. Moreover, we present relative position encoding for the cross-attention and iterative refinement for more accurate position queries. Experiments show that our approach converges 4x faster than existing work, sets a new state of the art on ScanNetv2 3D instance segmentation benchmark, and also demonstrates superior performance across various datasets. Code and models are available at https://github.com/dvlab-research/Mask-Attention-Free-Transformer. + + + + EgoLoc: Revisiting 3D Object Localization from Egocentric Videos with Visual Queries + http://openaccess.thecvf.com//content/ICCV2023/papers/Mai_EgoLoc_Revisiting_3D_Object_Localization_from_Egocentric_Videos_with_Visual_ICCV_2023_paper.pdf + With the recent advances in video and 3D understanding, novel 4D spatio-temporal methods fusing both concepts have emerged. Towards this direction, the Ego4D Episodic Memory Benchmark proposed a task for Visual Queries with 3D Localization (VQ3D). Given an egocentric video clip and an image crop depicting a query object, the goal is to localize the 3D position of the center of that query object with respect to the camera pose of a query frame. Current methods tackle the problem of VQ3D by unprojecting the 2D localization results of the sibling task Visual Queries with 2D Localization (VQ2D) into 3D predictions. Yet, we point out that the low number of camera poses caused by camera re-localization from previous VQ3D methods severally hinders their overall success rate. In this work, we formalize a pipeline (we dub EgoLoc) that better entangles 3D multiview geometry with 2D object retrieval from egocentric videos. Our approach involves estimating more robust camera poses and aggregating multi-view 3D displacements by leveraging the 2D detection confidence, which enhances the success rate of object queries and leads to a significant improvement in the VQ3D baseline performance. Specifically, our approach achieves an overall success rate of up to 87.12%, which sets a new state-of-the-art result in the VQ3D task. We provide a comprehensive empirical analysis of the VQ3D task and existing solutions, and highlight the remaining challenges in VQ3D. The code is available at https://github.com/Wayne-Mai/EgoLoc. + + + + FLatten Transformer: Vision Transformer using Focused Linear Attention + http://openaccess.thecvf.com//content/ICCV2023/papers/Han_FLatten_Transformer_Vision_Transformer_using_Focused_Linear_Attention_ICCV_2023_paper.pdf + The quadratic computation complexity of self-attention has been a persistent challenge when applying Transformer models to vision tasks. Linear attention, on the other hand, offers a much more efficient alternative with its linear complexity by approximating the Softmax operation through carefully designed mapping functions. However, current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead from the mapping functions. In this paper, we propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness. Specifically, we first analyze the factors contributing to the performance degradation of linear attention from two perspectives: the focus ability and feature diversity. To overcome these limitations, we introduce a simple yet effective mapping function and an efficient rank restoration module to enhance the expressiveness of self-attention while maintaining low computation complexity. Extensive experiments show that our linear attention module is applicable to a variety of advanced vision Transformers, and achieves consistently improved performances on multiple benchmarks. Code is available at https://github.com/LeapLabTHU/FLatten-Transformer. + + + + ADNet: Lane Shape Prediction via Anchor Decomposition + http://openaccess.thecvf.com//content/ICCV2023/papers/Xiao_ADNet_Lane_Shape_Prediction_via_Anchor_Decomposition_ICCV_2023_paper.pdf + In this paper, we revisit the limitations of anchor-based lane detection methods, which have predominantly focused on fixed anchors that stem from the edges of the image, disregarding their versatility and quality. To overcome the inflexibility of anchors, we decompose them into learning the heat map of starting points and their associated directions. This decomposition removes the limitations on the starting point of anchors, making our algorithm adaptable to different lane types in various datasets. To enhance the quality of anchors, we introduce the Large Kernel Attention (LKA) for Feature Pyramid Network (FPN). This significantly increases the receptive field, which is crucial in capturing the sufficient context as lane lines typically run throughout the entire image. We have named our proposed system the Anchor Decomposition Network (ADNet). Additionally, we propose the General Lane IoU (GLIoU) loss, which significantly improves the performance of ADNet in complex scenarios. Experimental results on three widely used lane detection benchmarks, VIL-100, CULane, and TuSimple, demonstrate that our approach outperforms the state-of-the-art methods on VIL-100 and exhibits competitive accuracy on CULane and TuSimple. Code and models will be released on https://github.com/ Sephirex-X/ADNet. + + + + HollowNeRF: Pruning Hashgrid-Based NeRFs with Trainable Collision Mitigation + http://openaccess.thecvf.com//content/ICCV2023/papers/Xie_HollowNeRF_Pruning_Hashgrid-Based_NeRFs_with_Trainable_Collision_Mitigation_ICCV_2023_paper.pdf + Neural radiance fields (NeRF) have garnered significant attention, with recent works such as Instant-NGP accelerating NeRF training and evaluation through a combination of hashgrid-based positional encoding and neural networks. However, effectively leveraging the spatial sparsity of 3D scenes remains a challenge. To cull away unnecessary regions of the feature grid, existing solutions rely on prior knowledge of object shape or periodically estimate object shape during training by repeated model evaluations, which are costly and wasteful. To address this issue, we propose HollowNeRF, a novel compression solution for hashgrid-based NeRF which automatically sparsifies the feature grid during the training phase. Instead of directly compressing dense features, HollowNeRF trains a coarse 3D saliency mask that guides efficient feature pruning, and employs an alternating direction method of multipliers (ADMM) pruner to sparsify the 3D saliency mask during training. By exploiting the sparsity in the 3D scene to redistribute hash collisions, HollowNeRF improves rendering quality while using a fraction of the parameters of comparable state-of-the-art solutions, leading to a better cost-accuracy trade-off. Our method delivers comparable rendering quality to Instant-NGP, while utilizing just 31% of the parameters. In addition, our solution can achieve a PSNR accuracy gain of up to 1dB using only 56% of the parameters. + + + + A Complete Recipe for Diffusion Generative Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Pandey_A_Complete_Recipe_for_Diffusion_Generative_Models_ICCV_2023_paper.pdf + Score-based Generative Models (SGMs) have demonstrated exceptional synthesis outcomes across various tasks. However, the current design landscape of the forward diffusion process remains largely untapped and often relies on physical heuristics or simplifying assumptions. Utilizing insights from the development of scalable Bayesian posterior samplers, we present a complete recipe for formulating forward processes in SGMs, ensuring convergence to the desired target distribution. Our approach reveals that several existing SGMs can be seen as specific manifestations of our framework. Building upon this method, we introduce Phase Space Langevin Diffusion (PSLD), which relies on score-based modeling within an augmented space enriched by auxiliary variables akin to physical phase space. Empirical results exhibit the superior sample quality and improved speed-quality trade-off of PSLD compared to various competing approaches on established image synthesis benchmarks. Remarkably, PSLD achieves sample quality akin to state-of-the-art SGMs (FID: 2.10 for unconditional CIFAR-10 generation). Lastly, we demonstrate the applicability of PSLD in conditional synthesis using pre-trained score networks, offering an appealing alternative as an SGM backbone for future advancements. Code and model checkpoints can be accessed at https://github.com/mandt-lab/PSLD. + + + + The Devil is in the Crack Orientation: A New Perspective for Crack Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_The_Devil_is_in_the_Crack_Orientation_A_New_Perspective_ICCV_2023_paper.pdf + Cracks are usually curve-like structures that are the focus of many computer-vision applications (e.g., road safety inspection and surface inspection of industrial facilities). The existing pixel-based crack segmentation methods rely on time-consuming and costly pixel-level annotations. And the object-based crack detection methods exploit the horizontal box to detect the crack without considering crack orientation, resulting in scale variation and intra-class variation. Considering this, we provide a new perspective for crack detection that models the cracks as a series of sub-cracks with the corresponding orientation. However, the vanilla adaptation of the existing oriented object detection methods to the crack detection tasks will result in limited performance, due to the boundary discontinuity issue and the ambiguities in sub-crack orientation. In this paper, we propose a first-of-its-kind oriented sub-crack detector, dubbed as CrackDet, which is derived from a novel piecewise angle definition, to ease the boundary discontinuity problem. And then, we propose a multi-branch angle regression loss for learning sub-crack orientation and variance together. Since there are no related benchmarks, we construct three fully annotated datasets, namely, ORC, ONPP, and OCCSD, which involve various cracks in road pavement and industrial facilities. Experiments show that our approach outperforms state-of-the-art crack detectors. + + + + FedPD: Federated Open Set Recognition with Parameter Disentanglement + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_FedPD_Federated_Open_Set_Recognition_with_Parameter_Disentanglement_ICCV_2023_paper.pdf + Existing federated learning (FL) approaches are deployed under the unrealistic closed-set setting, with both training and testing classes belong to the same set, which makes the global model fail to identify the unseen classes as `unknown'. To this end, we aim to study a novel problem of federated open-set recognition (FedOSR), which learns an open-set recognition (OSR) model under federated paradigm such that it classifies seen classes while at the same time detects unknown classes. In this work, we propose a parameter disentanglement guided federated open-set recognition (FedPD) algorithm to address two core challenges of FedOSR: cross-client inter-set interference between learning closed-set and open-set knowledge and cross-client intra-set inconsistency by data heterogeneity. The proposed FedPD framework mainly leverages two modules, i.e., local parameter disentanglement (LPD) and global divide-and-conquer aggregation (GDCA), to first disentangle client OSR model into different subnetworks, then align the corresponding parts cross clients for matched model aggregation. Specifically, on the client side, LPD decouples an OSR model into a closed-set subnetwork and an open-set subnetwork by the task-related importance, thus preventing inter-set interference. On the server side, GDCA first partitions the two subnetworks into specific and shared parts, and subsequently aligns the corresponding parts through optimal transport to eliminate parameter misalignment. Extensive experiments on various datasets demonstrate the superior performance of our proposed method. + + + + WaterMask: Instance Segmentation for Underwater Imagery + http://openaccess.thecvf.com//content/ICCV2023/papers/Lian_WaterMask_Instance_Segmentation_for_Underwater_Imagery_ICCV_2023_paper.pdf + Underwater image instance segmentation is a fundamental and critical step in underwater image analysis and understanding. However, the paucity of general multiclass instance segmentation datasets has impeded the development of instance segmentation studies for underwater images. In this paper, we propose the first underwater image instance segmentation dataset (UIIS), which provides 4628 images for 7 categories with pixel-level annotations. Meanwhile, we also design WaterMask for underwater image instance segmentation for the first time. In Water- Mask, we first devise Difference Similarity Graph Attention Module (DSGAT) to recover lost detailed information due to image quality degradation and downsampling to help the network prediction. Then, we propose Multi-level Feature Refinement Module (MFRM) to predict foreground masks and boundary masks separately by features at different scales, and guide the network through Boundary Mask Strategy (BMS) with boundary learning loss to provide finer prediction results. Extensive experimental results demonstrates that WaterMask can achieve significant gains of 2.9, 3.8 mAP over Mask R-CNN when using ResNet-50 and ResNet-101. Code and Dataset are available at https: //github.com/LiamLian0727/WaterMask. + + + + MosaiQ: Quantum Generative Adversarial Networks for Image Generation on NISQ Computers + http://openaccess.thecvf.com//content/ICCV2023/papers/Silver_MosaiQ_Quantum_Generative_Adversarial_Networks_for_Image_Generation_on_NISQ_ICCV_2023_paper.pdf + Quantum machine learning and vision have come to the fore recently, with hardware advances enabling rapid advancement in the capabilities of quantum machines. Recently, quantum image generation has been explored with many potential advantages over non-quantum techniques; however, previous techniques have suffered from poor quality and robustness. To address these problems, we introduce MosaiQ a high-quality quantum image generation GAN framework that can be executed on today's Near-term Intermediate Scale Quantum (NISQ) computers. + + + + DVIS: Decoupled Video Instance Segmentation Framework + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_DVIS_Decoupled_Video_Instance_Segmentation_Framework_ICCV_2023_paper.pdf + Video instance segmentation (VIS) is a critical task with diverse applications, including autonomous driving and video editing. Existing methods often underperform on complex and long videos in real world, primarily due to two factors. Firstly, offline methods are limited by the tightly-coupled modeling paradigm, which treats all frames equally and disregards the interdependencies between adjacent frames. Consequently, this leads to the introduction of excessive noise during long-term temporal alignment. Secondly, online methods suffer from inadequate utilization of temporal information. To tackle these challenges, we propose a decoupling strategy for VIS by dividing it into three independent sub-tasks: segmentation, tracking, and refinement. The efficacy of the decoupling strategy relies on two crucial elements: 1) attaining precise long-term alignment outcomes via frame-by-frame association during tracking, and 2) the effective utilization of temporal information predicated on the aforementioned accurate alignment outcomes during refinement. We introduce a novel referring tracker and temporal refiner to construct the Decoupled VIS framework (DVIS). DVIS achieves new SOTA performance in both VIS and VPS, surpassing the current SOTA methods by 7.3 AP and 9.6 VPQ on the OVIS and VIPSeg datasets, which are the most challenging and realistic benchmarks. Moreover, thanks to the decoupling strategy, the referring tracker and temporal refiner are super light-weight (only 6% of the segmenter FLOPs), allowing for efficient training and inference on a single GPU with 11G memory. To promote reproducibility and facilitate further research, we will make the code publicly available. + + + + Rethinking Amodal Video Segmentation from Learning Supervised Signals with Object-centric Representation + http://openaccess.thecvf.com//content/ICCV2023/papers/Fan_Rethinking_Amodal_Video_Segmentation_from_Learning_Supervised_Signals_with_Object-centric_ICCV_2023_paper.pdf + Video amodal segmentation is a particularly challenging task in computer vision, which requires to deduce the full shape of an object from the visible parts of it. Recently, some studies have achieved promising performance by using motion flow to integrate information across frames under a self-supervised setting. However, motion flow has a clear limitation by the two factors of moving cameras and object deformation. This paper presents a rethinking to previous works. We particularly leverage the supervised signals with object-centric representation in real-world scenarios. The underlying idea is the supervision signal of the specific object and the features from different views can mutually benefit the deduction of the full mask in any specific frame. We thus propose an Efficient object-centric Representation amodal Segmentation (EoRaS). Specially, beyond solely relying on supervision signals, we design a translation module to project image features into the Bird's-Eye View (BEV), which introduces 3D information to improve current feature quality. Furthermore, we propose a multi-view fusion layer based temporal module which is equipped with a set of object slots and interacts with features from different views by attention mechanism to fulfill sufficient object representation completion. As a result, the full mask of the object can be decoded from image features updated by object slots. Extensive experiments on both real-world and synthetic benchmarks demonstrate the superiority of our proposed method, achieving state-of-the-art performance. Our code will be released at https://github.com/kfan21/EoRaS. + + + + Distilled Reverse Attention Network for Open-world Compositional Zero-Shot Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Distilled_Reverse_Attention_Network_for_Open-world_Compositional_Zero-Shot_Learning_ICCV_2023_paper.pdf + Open-World Compositional Zero-Shot Learning (OW-CZSL) aims to recognize new compositions of seen attributes and objects. In OW-CZSL, methods built on the conventional closed-world setting degrade severely due to the unconstrained OW test space. While previous works alleviate the issue by pruning compositions according to external knowledge or correlations in seen pairs, they introduce biases that harm the generalization. Some methods thus predict state and object with independently constructed and trained classifiers, ignoring that attributes are highly context-dependent and visually entangled with objects. In this paper, we propose a novel Distilled Reverse Attention Network to address the challenges. We also model attributes and objects separately but with different motivations, capturing contextuality and locality, respectively. We further design a reverse-and-distill strategy that learns disentangled representations of elementary components in training data supervised by reverse attention and knowledge distillation. We conduct experiments on three datasets and consistently achieve state-of-the-art (SOTA) performance. + + + + TexFusion: Synthesizing 3D Textures with Text-Guided Image Diffusion Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Cao_TexFusion_Synthesizing_3D_Textures_with_Text-Guided_Image_Diffusion_Models_ICCV_2023_paper.pdf + We present TexFusion(Texture Diffusion), a new method to synthesize textures for given 3D geometries, using only large-scale text-guided image diffusion models. In contrast to recent works that leverage 2D text-to-image diffusion models to distill 3D objects using a slow and fragile optimization process, TexFusion introduces a new 3D-consistent generation technique specifically designed for texture synthesis that employs regular diffusion model sampling on different 2D rendered views. Specifically, we leverage latent diffusion models, apply the diffusion model's denoiser on a set of 2D renders of the 3D object, and aggregate the different denoising predictions on a shared latent texture map. Final RGB output textures are produced by optimizing an intermediate neural color field on the decodings of 2D renders of the latent texture. We thoroughly validate TexFusion and show that we can efficiently generate diverse, high quality and globally coherent textures. We achieve state-of-the-art text-guided texture synthesis performance using only image diffusion models, while avoiding the pitfalls of previous distillation-based methods. The text-conditioning offers detailed control and we also do not rely on any ground truth 3D textures for training. This makes our method very versatile and applicable to a broad range of geometries and texture types. We hope that TexFusion will advance AI-based texturing of 3D assets for applications in virtual reality, game design, simulation, and more. + + + + Shift from Texture-bias to Shape-bias: Edge Deformation-based Augmentation for Robust Object Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/He_Shift_from_Texture-bias_to_Shape-bias_Edge_Deformation-based_Augmentation_for_Robust_ICCV_2023_paper.pdf + Recent studies have shown the vulnerability of CNNs under perturbation noises, which is partially caused by the reason that the well-trained CNNs are too biased toward the object texture, i.e., they make predictions mainly based on texture cues. To reduce this texture-bias, current studies resort to learning augmented samples with heavily perturbed texture to make networks be more biased toward relatively stable shape cues. However, such methods usually fail to achieve real shape-biased networks due to the insufficient diversity of the shape cues. In this paper, we propose to augment the training dataset by generating semantically meaningful shapes and samples, via a shape deformation-based online augmentation, namely as SDbOA. The samples generated by our SDbOA have two main merits. First, the augmented samples with more diverse shape variations enable networks to learn the shape cues more elaborately, which encourages the network to be shape-biased. Second, semantic-meaningful shape-augmentation samples could be produced by jointly regularizing the generator with object texture and edge-guidance soft constraint, where the edges are represented more robustly with a self information guided map to better against the noises on them. Extensive experiments under various perturbation noises demonstrate the obvious superiority of our shape-bias-motivated model over the state of the arts in terms of robustness performance. Our code is appended in the supplementary material. + + + + Data-free Knowledge Distillation for Fine-grained Visual Categorization + http://openaccess.thecvf.com//content/ICCV2023/papers/Shao_Data-free_Knowledge_Distillation_for_Fine-grained_Visual_Categorization_ICCV_2023_paper.pdf + Data-free knowledge distillation (DFKD) is a promising approach for addressing issues related to model compression, security privacy, and transmission restrictions. Although the existing methods exploiting DFKD have achieved inspiring achievements in coarse-grained classification, in practical applications involving fine-grained classification tasks that require more detailed distinctions between similar categories, sub-optimal results are obtained. To address this issue, we propose an approach called DFKD-FGVC that extends DFKD to fine-grained vision categorization (FGVC) tasks. Our approach utilizes an adversarial distillation framework with attention generator, mixed high-order attention distillation, and semantic feature contrast learning. Specifically, we introduce a spatial-wise attention mechanism to the generator to synthesize fine-grained images with more details of discriminative parts. We also utilize the mixed high-order attention mechanism to capture complex interactions among parts and the subtle differences among discriminative features of the fine-grained categories, paying attention to both local features and semantic context relationships. Moreover, we leverage the teacher and student models of the distillation framework to contrast high-level semantic feature maps in the hyperspace, comparing variances of different categories. We evaluate our approach on three widely-used FGVC benchmarks (Aircraft, Cars196, and CUB200) and demonstrate its superior performance. + + + + EgoPCA: A New Framework for Egocentric Hand-Object Interaction Understanding + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_EgoPCA_A_New_Framework_for_Egocentric_Hand-Object_Interaction_Understanding_ICCV_2023_paper.pdf + With the surge in attention to Egocentric Hand-Object Interaction (Ego-HOI), large-scale datasets such as Ego4D and EPIC-KITCHENS have been proposed. However, most current research is built on resources derived from third-person video action recognition. This inherent domain gap between first- and third-person action videos, which have not been adequately addressed before, makes current Ego-HOI suboptimal. This paper rethinks and proposes a new framework as an infrastructure to advance Ego-HOI recognition by Probing, Curation and Adaption (EgoPCA). We contribute comprehensive pre-train sets, balanced test sets and a new baseline, which are complete with a training-finetuning strategy. With our new framework, we not only achieve state-of-the-art performance on Ego-HOI benchmarks but also build several new and effective mechanisms and settings to advance further research. We believe our data and the findings will pave a new way for Ego-HOI understanding. Code and data are available at https://mvig-rhos.com/ego_pca. + + + + I Can't Believe There's No Images! Learning Visual Tasks Using only Language Supervision + http://openaccess.thecvf.com//content/ICCV2023/papers/Gu_I_Cant_Believe_Theres_No_Images_Learning_Visual_Tasks_Using_ICCV_2023_paper.pdf + Many high-level skills that are required for computer vision tasks, such as parsing questions, comparing and contrasting semantics, and writing descriptions, are also required in other domains such as natural language processing. In this paper, we ask whether it is possible to learn those skills from text data and then transfer them to vision tasks without ever training on visual training data. Key to our approach is exploiting the joint embedding space of contrastively trained vision and language encoders. In practice, there can be systematic differences between embedding spaces for different modalities in contrastive models, and we analyze how these differences affect our approach and study strategies to mitigate this concern. We produce models using only text training data on four representative tasks: image captioning, visual entailment, visual question answering and visual news captioning, and evaluate them on standard benchmarks using images. We find these models perform close to models trained on images, while surpassing prior work for captioning and visual entailment in this text-only setting by over 9 points, and outperforming all prior work on visual news by over 30 points. We also showcase a variety of stylistic image captioning models that are trained using no image data and no human-curated language data, but instead using readily-available text data from books, the web, or language models. + + + + Feature Prediction Diffusion Model for Video Anomaly Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Yan_Feature_Prediction_Diffusion_Model_for_Video_Anomaly_Detection_ICCV_2023_paper.pdf + Anomaly detection in the video is an important research area and a challenging task in real applications. Due to the unavailability of large-scale annotated anomaly events, most existing video anomaly detection (VAD) methods focus on learning the distribution of normal samples to detect the substantially deviated samples as anomalies. To well learn the distribution of normal motion and appearance, many auxiliary networks are employed to extract foreground object or action information. These high-level semantic features effectively filter the noise from the background to decrease its influence on detection models. However, the capability of these extra semantic models heavily affects the performance of the VAD methods. Motivated by the impressive generative and anti-noise capacity of diffusion model (DM), in this work, we introduce a novel DM-based method to predict the features of video frames for anomaly detection. We aim to learn the distribution of normal samples without any extra high-level semantic feature extraction models involved. To this end, we build two denoising diffusion implicit modules to predict and refine the features. The first module concentrates on feature motion learning, while the last focuses on feature appearance learning. To the best of our knowledge, it is the first DM-based method to predict frame features for VAD. The strong capacity of DMs also enables our method to more accurately predict the normal features than non-DM-based feature prediction-based VAD methods. Extensive experiments show that the proposed approach substantially outperforms state-of-the-art competing methods. + + + + MasQCLIP for Open-Vocabulary Universal Image Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_MasQCLIP_for_Open-Vocabulary_Universal_Image_Segmentation_ICCV_2023_paper.pdf + We present a new method for open-vocabulary universal image segmentation, which is capable of performing instance, semantic, and panoptic segmentation under a unified framework. Our approach, called MasQCLIP, seamlessly integrates with a pre-trained CLIP model by utilizing its dense features, thereby circumventing the need for extensive parameter training. MasQCLIP emphasizes two new aspects when building an image segmentation method with a CLIP model: 1) a student-teacher module to deal with masks of the novel (unseen) classes by distilling information from the base (seen) classes; 2) a fine-tuning process to update model parameters for the queries Q within the CLIP model. Thanks to these two simple and intuitive designs, MasQCLIP is able to achieve state-of-the-art performances with a substantial gain over the competing methods by a large margin across all three tasks, including open-vocabulary instance, semantic, and panoptic segmentation. Project page is at https://masqclip.github.io/. + + + + Self-similarity Driven Scale-invariant Learning for Weakly Supervised Person Search + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Self-similarity_Driven_Scale-invariant_Learning_for_Weakly_Supervised_Person_Search_ICCV_2023_paper.pdf + Weakly supervised person search aims to jointly detect and match persons with only bounding box annotations. Existing approaches typically focus on improving the features by exploring the relations of persons. However, scale variation problem is a more severe obstacle and under-studied that a person often owns images with different scales (resolutions). For one thing, small-scale images contain less information of a person, thus affecting the accuracy of the generated pseudo labels. For another, different similarities between cross-scale images of a person increase the difficulty of matching. In this paper, we address it by proposing a novel one-step framework, named Self-similarity driven Scale-invariant Learning (SSL). Scale invariance can be explored based on the self-similarity prior that it shows the same statistical properties of an image at different scales. To this end, we introduce a Multi-scale Exemplar Branch to guide the network in concentrating on the foreground and learning scale-invariant features by hard exemplars mining. To enhance the discriminative power of the learned features, we further introduce a dynamic pseudo label prediction that progressively seeks true labels for training. Experimental results on two standard benchmarks, i.e., PRW and CUHK-SYSU datasets, demonstrate that the proposed method can solve scale variation problem effectively and perform favorably against state-of-the-art methods. Code is available at https://github.com/Wangbenzhi/SSL.git. + + + + Ord2Seq: Regarding Ordinal Regression as Label Sequence Prediction + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Ord2Seq_Regarding_Ordinal_Regression_as_Label_Sequence_Prediction_ICCV_2023_paper.pdf + Ordinal regression refers to classifying object instances into ordinal categories. It has been widely studied in many scenarios, such as medical disease grading and movie rating. Known methods focused only on learning inter-class ordinal relationships, but still incur limitations in distinguishing adjacent categories thus far. In this paper, we propose a simple sequence prediction framework for ordinal regression called Ord2Seq, which, for the first time, transforms each ordinal category label into a special label sequence and thus regards an ordinal regression task as a sequence prediction process. In this way, we decompose an ordinal regression task into a series of recursive binary classification steps, so as to subtly distinguish adjacent categories. Comprehensive experiments show the effectiveness of distinguishing adjacent categories for performance improvement and our new approach exceeds state-of-the-art performances in four different scenarios. Codes are available at https://github.com/wjh892521292/Ord2Seq. + + + + Controllable Visual-Tactile Synthesis + http://openaccess.thecvf.com//content/ICCV2023/papers/Gao_Controllable_Visual-Tactile_Synthesis_ICCV_2023_paper.pdf + Deep generative models have various content creation applications such as graphic design, e-commerce, and virtual try-on. However, current works mainly focus on synthesizing realistic visual outputs, often ignoring other sensory modalities, such as touch, which limits physical interaction with users. In this work, we leverage deep generative models to create a multi-sensory experience where users can touch and see the synthesized object when sliding their fingers on a haptic surface. The main challenges lie in the significant scale discrepancy between vision and touch sensing and the lack of explicit mapping from touch sensing data to a haptic rendering device. To bridge this gap, we collect high-resolution tactile data with a GelSight sensor and create a new visuotactile clothing dataset. We then develop a conditional generative model that synthesizes both visual and tactile outputs from a single sketch. We evaluate our method regarding image quality and tactile rendering accuracy. Finally, we introduce a pipeline to render high-quality visual and tactile outputs on an electroadhesion-based haptic device for an immersive experience, allowing for challenging materials and editable sketch inputs. + + + + Keep It SimPool: Who Said Supervised Transformers Suffer from Attention Deficit? + http://openaccess.thecvf.com//content/ICCV2023/papers/Psomas_Keep_It_SimPool_Who_Said_Supervised_Transformers_Suffer_from_Attention_ICCV_2023_paper.pdf + Convolutional networks and vision transformers have different forms of pairwise interactions, pooling across layers and pooling at the end of the network. Does the latter really need to be different? As a by-product of pooling, vision transformers provide spatial attention for free, but this is most often of low quality unless self-supervised, which is not well studied. Is supervision really the problem? In this work, we develop a generic pooling framework and then we formulate a number of existing methods as instantiations. By discussing the properties of each group of methods, we derive SimPool, a simple attention-based pooling mechanism as a replacement of the default one for both convolutional and transformer encoders. We find that, whether supervised or self-supervised, this improves performance on pre-training and downstream tasks and provides attention maps delineating object boundaries in all cases. One could thus call SimPool universal. To our knowledge, we are the first to obtain attention maps in supervised transformers of at least as good quality as self-supervised, without explicit losses or modifying the architecture. Code at: https://github.com/billpsomas/simpool. + + + + LoGoPrompt: Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Shi_LoGoPrompt_Synthetic_Text_Images_Can_Be_Good_Visual_Prompts_for_ICCV_2023_paper.pdf + Prompt engineering is a powerful tool used to enhance the performance of pre-trained models on downstream tasks. For example, providing the prompt "Let's think step by step" improved GPT-3's reasoning accuracy to 63% on MutiArith while prompting "a photo of" filled with a class name enables CLIP to achieve 80% zero-shot accuracy on ImageNet. While previous research has explored prompt learning for the visual modality, analyzing what constitutes a good visual prompt specifically for image recognition is limited. In addition, existing visual prompt tuning methods' generalization ability is worse than text-only prompting tuning. This paper explores our key insight: synthetic text images are good visual prompts for vision-language models! To achieve that, we propose our LoGoPrompt, which reformulates the classification objective to the visual prompt selection and addresses the chicken-and-egg challenge of first adding synthetic text images as class-wise visual prompts or predicting the class first. Without any trainable visual prompt parameters, experimental results on 16 datasets demonstrate that our method consistently outperforms state-of-the-art methods in few-shot learning, base-to-new generalization, and domain generalization. The code will be publicly available upon publication. + + + + FeatEnHancer: Enhancing Hierarchical Features for Object Detection and Beyond Under Low-Light Vision + http://openaccess.thecvf.com//content/ICCV2023/papers/Hashmi_FeatEnHancer_Enhancing_Hierarchical_Features_for_Object_Detection_and_Beyond_Under_ICCV_2023_paper.pdf + Extracting useful visual cues for the downstream tasks is especially challenging under low-light vision. Prior works create enhanced representations by either correlating visual quality with machine perception or designing illumination-degrading transformation methods that require pre-training on synthetic datasets. We argue that optimizing enhanced image representation pertaining to the loss of the downstream task can result in more expressive representations. Therefore, in this work, we propose a novel module, FeatEnHancer, that hierarchically combines multiscale features using multiheaded attention guided by task-related loss function to create suitable representations. Furthermore, our intra-scale enhancement improves the quality of features extracted at each scale or level, as well as combines features from different scales in a way that reflects their relative importance for the task at hand. FeatEnHancer is a general-purpose plug-and-play module and can be incorporated into any low-light vision pipeline. We show with extensive experimentation that the enhanced representation produced with FeatEnHancer significantly and consistently improves results in several low-light vision tasks, including dark object detection (+5.7 mAP on ExDark), face detection (+1.5 mAP on DARK FACE), nighttime semantic segmentation (+5.1 mIoU on ACDC ), and video object detection (+1.8 mAP on DarkVision), highlighting the effectiveness of enhancing hierarchical features under low-light vision. + + + + Saliency Regularization for Self-Training with Partial Annotations + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Saliency_Regularization_for_Self-Training_with_Partial_Annotations_ICCV_2023_paper.pdf + Partially annotated images are easy to obtain in multi-label classification. However, unknown labels in partially annotated images exacerbate the positive-negative imbalance inherent in multi-label classification, which affects supervised learning of known labels. Most current methods require sufficient image annotations, and do not focus on the imbalance of the labels in the supervised training phase. In this paper, we propose saliency regularization (SR) for a novel self-training framework. In particular, we model saliency on the class-specific maps, and strengthen the saliency of object regions corresponding to the present labels. Besides, we introduce consistency regularization to mine unlabeled information to complement unknown labels with the help of SR. It is verified to alleviate the negative dominance caused by the imbalance, and achieve state-of-the-art performance on Pascal VOC 2007, MS-COCO, VG-200, and OpenImages V3. + + + + Stabilizing Visual Reinforcement Learning via Asymmetric Interactive Cooperation + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhai_Stabilizing_Visual_Reinforcement_Learning_via_Asymmetric_Interactive_Cooperation_ICCV_2023_paper.pdf + Vision-based reinforcement learning (RL) depends on discriminative representation encoders to abstract the observation states. Despite the great success of increasing CNN parameters for many supervised computer vision tasks, reinforcement learning with temporal-difference (TD) losses cannot benefit from it in most complex environments. In this paper, we analyze that the training instability arises from the oscillating self-overfitting of the heavy-optimizable encoder. We argue that serious oscillation will occur to the parameters when enforced to fit the sensitive TD targets, causing uncertain drifting of the latent state space and thus transmitting these perturbations to the policy learning. To alleviate this phenomenon, we propose a novel asymmetric interactive cooperation approach with the interaction between a heavy-optimizable encoder and a supportive light-optimizable encoder, in which both their advantages are integrated including the highly discriminative capability as well as the training stability. We also present a greedy bootstrapping optimization to isolate the visual perturbations from policy learning, where representation and policy are trained sufficiently by turns. Finally, we demonstrate the effectiveness of our method in utilizing larger visual models by first-person highway driving task CARLA and Vizdoom environments. + + + + Learning Hierarchical Features with Joint Latent Space Energy-Based Prior + http://openaccess.thecvf.com//content/ICCV2023/papers/Cui_Learning_Hierarchical_Features_with_Joint_Latent_Space_Energy-Based_Prior_ICCV_2023_paper.pdf + This paper studies the fundamental problem of multi-layer generator models in learning hierarchical representations. The multi-layer generator model that consists of multiple layers of latent variables organized in a top-down architecture tends to learn multiple levels of data abstraction. However, such multi-layer latent variables are typically parameterized to be Gaussian, which can be less informative in capturing complex abstractions, resulting in limited success in hierarchical representation learning. On the other hand, the energy-based (EBM) prior is known to be expressive in capturing the data regularities, but it often lacks the hierarchical structure to capture different levels of hierarchical representations. In this paper, we propose a joint latent space EBM prior model with multi-layer latent variables for effective hierarchical representation learning. We develop a variational joint learning scheme that seamlessly integrates an inference model for efficient inference. Our experiments demonstrate that the proposed joint EBM prior is effective and expressive in capturing hierarchical representations and modelling data distribution. + + + + UniFormerV2: Unlocking the Potential of Image ViTs for Video Understanding + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_UniFormerV2_Unlocking_the_Potential_of_Image_ViTs_for_Video_Understanding_ICCV_2023_paper.pdf + The prolific performances of Vision Transformers (ViTs) in image tasks have prompted research into adapting the image ViTs for video tasks. However, the substantial gap between image and video impedes the spatiotemporal learning of these image-pretrained models. Though video-specialized models like UniFormer can transfer to the video domain more seamlessly, their unique architectures require prolonged image pretraining, limiting the scalability. Given the emergence of powerful open-source image ViTs, we propose unlocking their potential for video understanding with efficient UniFormer designs. We call the resulting model UniFormerV2, since it inherits the concise style of the UniFormer block, while redesigning local and global relation aggregators that seamlessly integrate advantages from both ViTs and UniFormer. Our UniFormerV2 achieves state-of-the-art performances on 8 popular video benchmarks, including scene-related Kinetics-400/600/700, heterogeneous Moments in Time, temporal-related Something-Something V1/V2, and untrimmed ActivityNet and HACS. It is noteworthy that to the best of our knowledge, UniFormerV2 is the first to elicit 90% top-1 accuracy on Kinetics-400. + + + + TARGET: Federated Class-Continual Learning via Exemplar-Free Distillation + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_TARGET_Federated_Class-Continual_Learning_via_Exemplar-Free_Distillation_ICCV_2023_paper.pdf + This paper focuses on an under-explored yet important problem: Federated Class-Continual Learning (FCCL), where new classes are dynamically added in federated learning. Existing FCCL works suffer from various limitations, such as requiring additional datasets or storing the private data from previous tasks. In response, we first demonstrate that non-IID data exacerbates catastrophic forgetting issue in FL. Then we propose a novel method called TARGET (federatTed clAss-continual leaRninG via Exemplar-free disTillation), which alleviates catastrophic forgetting in FCCL while preserving client data privacy. Our proposed method leverages the previously trained global model to transfer knowledge of old tasks to the current task at the model level. Moreover, a generator is trained to produce synthetic data to simulate the global distribution of data on each client at the data level. Compared to previous FCCL methods, TARGET does not require any additional datasets or storing real data from previous tasks, which makes it ideal for data-sensitive scenarios. + + + + DiffV2S: Diffusion-Based Video-to-Speech Synthesis with Vision-Guided Speaker Embedding + http://openaccess.thecvf.com//content/ICCV2023/papers/Choi_DiffV2S_Diffusion-Based_Video-to-Speech_Synthesis_with_Vision-Guided_Speaker_Embedding_ICCV_2023_paper.pdf + Recent research has demonstrated impressive results in video-to-speech synthesis which involves reconstructing speech solely from visual input. However, previous works have struggled to accurately synthesize speech due to a lack of sufficient guidance for the model to infer the correct content with the appropriate sound. To resolve the issue, they have adopted an extra speaker embedding as a speaking style guidance from a reference auditory information. Nevertheless, it is not always possible to obtain the audio information from the corresponding video input, especially during the inference time. In this paper, we present a novel vision-guided speaker embedding extractor using a self-supervised pre-trained model and prompt tuning technique. In doing so, the rich speaker embedding information can be produced solely from input visual information, and the extra audio information is not necessary during the inference time. Using the extracted vision-guided speaker embedding representations, we further develop a diffusion-based video-to-speech synthesis model, so called DiffV2S, conditioned on those speaker embeddings and the visual representation extracted from the input video. The proposed DiffV2S not only maintains phoneme details contained in the input video frames, but also creates a highly intelligible mel-spectrogram in which the speaker identities of the multiple speakers are all preserved. Our experimental results show that DiffV2S achieves the state-of-the-art performance compared to the previous video-to-speech synthesis technique. + + + + The Effectiveness of MAE Pre-Pretraining for Billion-Scale Pretraining + http://openaccess.thecvf.com//content/ICCV2023/papers/Singh_The_Effectiveness_of_MAE_Pre-Pretraining_for_Billion-Scale_Pretraining_ICCV_2023_paper.pdf + This paper revisits the standard pretrain-then-finetune paradigm used in computer vision for visual recognition tasks. Typically, state-of-the-art foundation models are pretrained using large scale (weakly) supervised datasets with billions of images. We introduce an additional pre-pretraining stage that is simple and uses the self supervised MAE technique to initialize the model. While MAE has only been shown to scale with the size of models, we find that it scales with the size of the training dataset as well. Thus, our MAE-based pre-pretraining scales with both model and data size making it applicable for training foundation models. Pre-pretraining consistently improves both the model convergence and the downstream transfer performance across a range of model scales (millions to billions of parameters), and dataset sizes (millions to billions of labels). We measure the effectiveness of pre-pretraining on 10 different visual recognition tasks spanning image classification, video recognition, object detection, low-shot classification and zero-shot recognition. Our largest model achieves new state-of-the-art results on iNaturalist-18 (91.3%), 1-shot ImageNet-1k (62.1%), and zero-shot transfer on Food-101 (96.2%). Our study reveals that model initialization plays a significant role, even for web-scale pretraining with billions of images. + + + + GPA-3D: Geometry-aware Prototype Alignment for Unsupervised Domain Adaptive 3D Object Detection from Point Clouds + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_GPA-3D_Geometry-aware_Prototype_Alignment_for_Unsupervised_Domain_Adaptive_3D_Object_ICCV_2023_paper.pdf + LiDAR-based 3D detection has made great progress in recent years. However, the performance of 3D detectors is considerably limited when deployed in unseen environments, owing to the severe domain gap problem. Existing domain adaptive 3D detection methods do not adequately consider the problem of the distributional discrepancy in feature space, thereby hindering the generalization of detectors across domains. In this work, we propose a novel unsupervised domain adaptive 3D detection framework, namely Geometry-aware Prototype Alignment (GPA-3D), which explicitly leverages the intrinsic geometric relationship from point cloud objects to reduce the feature discrepancy, thus facilitating cross-domain transferring. Specifically, GPA-3D assigns a series of tailored and learnable prototypes to point cloud objects with distinct geometric structures. Each prototype aligns BEV (bird's-eye-view) features derived from corresponding point cloud objects on source and target domains, reducing the distributional discrepancy and achieving better adaptation. The evaluation results obtained on various benchmarks, including Waymo, nuScenes and KITTI, demonstrate the superiority of our GPA-3D over the state-of-the-art approaches for different adaptation scenarios. The MindSpore version code will be publicly available at https://github.com/Liz66666/GPA3D. + + + + TransHuman: A Transformer-based Human Representation for Generalizable Neural Human Rendering + http://openaccess.thecvf.com//content/ICCV2023/papers/Pan_TransHuman_A_Transformer-based_Human_Representation_for_Generalizable_Neural_Human_Rendering_ICCV_2023_paper.pdf + In this paper, we focus on the task of generalizable neural human rendering which trains conditional Neural Radiance Fields (NeRF) from multi-view videos of different characters. To handle the dynamic human motion, previous methods have primarily used a SparseConvNet (SPC)-based human representation to process the painted SMPL. However, such SPC-based representation i) optimizes under the volatile observation space which leads to the pose-misalignment between training and inference stages, and ii) lacks the global relationships among human parts that is critical for handling the incomplete painted SMPL. Tackling these issues, we present a brand-new framework named TransHuman, which learns the painted SMPL under the canonical space and captures the global relationships between human parts with transformers. Specifically, TransHuman is mainly composed of Transformer-based Human Encoding (TransHE), Deformable Partial Radiance Fields (DPaRF), and Fine-grained Detail Integration (FDI). TransHE first processes the painted SMPL under the canonical space via transformers for capturing the global relationships between human parts. Then, DPaRF binds each output token with a deformable radiance field for encoding the query point under the observation space. Finally, the FDI is employed to further integrate fine-grained information from reference images. Extensive experiments on ZJU-MoCap and H36M show that our TransHuman achieves a significantly new state-of-the-art performance with high efficiency. Project page: https://pansanity666.github.io/TransHuman/ + + + + Unsupervised Surface Anomaly Detection with Diffusion Probabilistic Model + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Unsupervised_Surface_Anomaly_Detection_with_Diffusion_Probabilistic_Model_ICCV_2023_paper.pdf + Unsupervised surface anomaly detection aims at discovering and localizing anomalous patterns using only anomaly-free training samples. Reconstruction-based models are among the most popular and successful methods, which rely on the assumption that anomaly regions are more difficult to reconstruct. However, there are three major challenges to the practical application of this approach: 1) the reconstruction quality needs to be further improved since it has a great impact on the final result, especially for images with structural changes; 2) it is observed that for many neural networks, the anomalies can also be well reconstructed, which severely violates the underlying assumption; 3) since reconstruction is an ill-conditioned problem, a test instance may correspond to multiple normal patterns, but most current reconstruction-based methods have ignored this critical fact. In this paper, we propose DiffAD, a method for unsupervised anomaly detection based on the latent diffusion model, inspired by its ability to generate high-quality and diverse images. We further propose noisy condition embedding and interpolated channels to address the aforementioned challenges in the general reconstruction-based pipeline. Extensive experiments show that our method achieves state-of-the-art performance on the challenging MVTec dataset, especially in localization accuracy. + + + + Simoun: Synergizing Interactive Motion-appearance Understanding for Vision-based Reinforcement Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_Simoun_Synergizing_Interactive_Motion-appearance_Understanding_for_Vision-based_Reinforcement_Learning_ICCV_2023_paper.pdf + Efficient motion and appearance modeling are critical for vision-based Reinforcement Learning (RL). However, existing methods struggle to reconcile motion and appearance information within the state representations learned from a single observation encoder. To address the problem, we present Synergizing Interactive Motion-appearance Understanding (Simoun), a unified framework for vision-based RL. Given consecutive observation frames, Simoun deliberately and interactively learns both motion and appearance features through a dual-path network architecture. The learning process collaborates with a structural interactive module, which explores the latent motion-appearance structures from the two network paths to leverage their complementarity. To promote sample efficiency, we further design a consistency-guided curiosity module to encourage the exploration of under-learned observations. During training, the curiosity module provides intrinsic rewards according to the consistency of environmental temporal dynamics, which are deduced from both motion and appearance network paths. Experiments conducted on the DeepMind control suite and CARLA automatic driving benchmarks demonstrate the effectiveness of Simoun, where it performs favorably against state-of-the-art methods. + + + + Representation Disparity-aware Distillation for 3D Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Representation_Disparity-aware_Distillation_for_3D_Object_Detection_ICCV_2023_paper.pdf + In this paper, we focus on developing knowledge distillation (KD) for compact 3D detectors. We observe that off-the-shelf KD methods manifest their efficacy only when the teacher model and student counterpart share similar intermediate feature representations. This might explain why they are less effective in building extreme-compact 3D detectors where significant representation disparity arises due primarily to the intrinsic sparsity and irregularity in 3D point clouds. This paper presents a novel representation disparity-aware distillation (RDD) method to address the representation disparity issue and reduce performance gap between compact students and over-parameterized teachers. This is accomplished by building our RDD from an innovative perspective of information bottleneck (IB), which can effectively minimize the disparity of proposal region pairs from student and teacher in features and logits. Extensive experiments are performed to demonstrate the superiority of our RDD over existing KD methods. For example, our RDD increases mAP of CP-Voxel-S to 57.1% on nuScenes dataset, which even surpasses teacher performance while taking up only 42% FLOPs. + + + + Breaking The Limits of Text-conditioned 3D Motion Synthesis with Elaborative Descriptions + http://openaccess.thecvf.com//content/ICCV2023/papers/Qian_Breaking_The_Limits_of_Text-conditioned_3D_Motion_Synthesis_with_Elaborative_ICCV_2023_paper.pdf + Given its wide applications, there is increasing focus on generating 3D human motions from textual descriptions. Differing from the majority of previous works, which regard actions as single entities and can only generate short sequences for simple motions, we propose EMS, an elaborative motion synthesis model conditioned on detailed natural language descriptions. It generates natural and smooth motion sequences for long and complicated actions by factorizing them into groups of atomic actions. Meanwhile, it understands atomic-action level attributes (e.g., motion direction, speed, and body parts) and enables users to generate sequences of unseen complex actions from unique sequences of known atomic actions with independent attribute settings and timings applied. We evaluate our method on the KIT Motion-Language and BABEL benchmarks, where it outperforms all previous state-of-the-art with noticeable margins. + + + + VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control + http://openaccess.thecvf.com//content/ICCV2023/papers/Hu_VL-PET_Vision-and-Language_Parameter-Efficient_Tuning_via_Granularity_Control_ICCV_2023_paper.pdf + As the model size of pre-trained language models (PLMs) grows rapidly, full fine-tuning becomes prohibitively expensive for model training and storage. In vision-and-language (VL), parameter-efficient tuning (PET) techniques are proposed to integrate modular modifications (e.g., Adapter) into encoder-decoder PLMs. By tuning a small set of trainable parameters, these techniques perform on par with full fine-tuning. However, excessive modular modifications and neglecting the unique abilities of the encoders and decoders can lead to performance degradation, while existing PET techniques (e.g., VL-Adapter) overlook these issues. In this paper, we propose a Vision-and-Language Parameter-Efficient Tuning (VL-PET) framework to impose effective control over modular modifications via a novel granularity-controlled mechanism. Considering different granularity-controlled matrices generated by this mechanism, a variety of model-agnostic VL-PET modules can be instantiated from our framework for better efficiency and effectiveness trade-offs. We further propose lightweight designs to enhance VL alignment and modeling for the encoders and maintain text generation for the decoders. Extensive experiments conducted on four image-text tasks and four video-text tasks demonstrate the efficiency, effectiveness, scalability and transferability of our VL-PET framework. In particular, our VL-PET-large significantly outperforms full fine-tuning by 2.39% (2.61%) and VL-Adapter by 2.92% (3.41%) with BART-base (T5-base) on image-text tasks, while utilizing fewer trainable parameters. Furthermore, we validate the enhanced effect of employing our VL-PET designs (e.g., granularity-controlled mechanism and lightweight designs) on existing PET techniques, enabling them to achieve significant performance improvements. + + + + ROME: Robustifying Memory-Efficient NAS via Topology Disentanglement and Gradient Accumulation + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_ROME_Robustifying_Memory-Efficient_NAS_via_Topology_Disentanglement_and_Gradient_Accumulation_ICCV_2023_paper.pdf + Albeit being a prevalent architecture searching approach, differentiable architecture search (DARTS) is largely hindered by its substantial memory cost since the entire supernet resides in the memory. This is where the single-path DARTS comes in, which only chooses a single-path submodel at each step. While being memory-friendly, it also comes with low computational costs. Nonetheless, we discover a critical issue of single-path DARTS that has not been primarily noticed. Namely, it also suffers from severe performance collapse since too many parameter-free operations like skip connections are derived, just like DARTS does. In this paper, we propose a new algorithm called RObustifying Memory-Efficient NAS (ROME) to give a cure. First, we disentangle the topology search from the operation search to make searching and evaluation consistent. We then adopt Gumbel-Top2 reparameterization and gradient accumulation to robustify the unwieldy bi-level optimization. We verify ROME extensively across 15 benchmarks to demonstrate its effectiveness and robustness. + + + + Toward Multi-Granularity Decision-Making: Explicit Visual Reasoning with Hierarchical Knowledge + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Toward_Multi-Granularity_Decision-Making_Explicit_Visual_Reasoning_with_Hierarchical_Knowledge_ICCV_2023_paper.pdf + Answering visual questions requires the ability to parse visual observations and correlate them with a variety of knowledge. Existing visual question answering (VQA) models either pay little attention to the role of knowledge or do not take into account the granularity of knowledge, e.g., attaching the color of "grassland" to "ground"). They have yet to develop the capability of modeling knowledge of multiple granularity, and are also vulnerable to spurious data biases. To fill the gap, this paper makes progresses from two distinct perspectives: (1) It presents a Hierarchical Concept Graph (HCG) that discriminates and associates multi-granularity concepts with a multi-layered hierarchical structure, aligning visual observations with knowledge across different levels to alleviate data biases. (2) To facilitate a comprehensive understanding of how knowledge contributes throughout the decision-making process, we further propose an interpretable Hierarchical Concept Neural Module Network (HCNMN) It explicitly propagates multi-granularity knowledge across the hierarchical structure and incorporates them with a sequence of reasoning steps, providing a transparent interface to elaborate on the integration of observations and knowledge. Through extensive experiments on multiple challenging datasets (i.e., GQA,VQA,FVQA,OK-VQA) , we demonstrate the effectiveness of our method in answering questions in different scenarios. Our code is available at https://github.com/SuperJohnZhang/HCNMN. + + + + 3D-aware Image Generation using 2D Diffusion Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Xiang_3D-aware_Image_Generation_using_2D_Diffusion_Models_ICCV_2023_paper.pdf + In this paper, we introduce a novel 3D-aware image generation method that leverages 2D diffusion models. We formulate the 3D-aware image generation task as multiview 2D image set generation, and further to a sequential unconditional-conditional multiview image generation process. This allows us to utilize 2D diffusion models to boost the generative modelling power of the method. Additionally, we incorporate depth information from monocular depth estimators to construct the training data for the conditional diffusion model using only still images. We train our method on a large-scale unstructured dataset, i.e., ImageNet, which is not addressed by previous methods. It produces high-quality images that significantly outperform prior methods. Furthermore, our approach showcases its capability to generate instances with large view angles, even though the training images are diverse and unaligned, gathered from "in-the-wild" real-world environments. + + + + ICE-NeRF: Interactive Color Editing of NeRFs via Decomposition-Aware Weight Optimization + http://openaccess.thecvf.com//content/ICCV2023/papers/Lee_ICE-NeRF_Interactive_Color_Editing_of_NeRFs_via_Decomposition-Aware_Weight_Optimization_ICCV_2023_paper.pdf + Neural Radiance Fields (NeRFs) have gained considerable attention for their high-quality results in 3D scene reconstruction and rendering. Recently, there have been active studies on various tasks such as novel view synthesis and scene editing. However, editing NeRFs is challenging as accurately decomposing the desired area of 3D space and ensuring the consistency of edited results from different angles is difficult. In this paper, we propose ICE-NeRF, an Interactive Color Editing framework that performs color editing by taking a pre-trained NeRF and a rough user mask as input. Our proposed method performs the entire color editing process in only under a minute using a partial fine-tuning approach. To perform effective color editing, we address two issues: (1) the entanglement of the implicit representation that causes unwanted color changes in undesired areas when learning weights, and (2) the loss of multi-view consistency when fine-tuning for a single or a few views. To address these issues, we introduce two techniques: Activation Field-based Regularization (AFR) and Single-mask Multi-view Rendering (SMR). The AFR performs weight regularization during fine-tuning based on the assumption that not all weights have an equal impact on the desired area. The SMR maps the 2D mask to 3D space through inverse projection and renders it from other views to generate multi-view masks. ICE-NeRF not only enables well-decomposed, multi-view consistent color editing but also significantly reduces processing time compared to existing methods. + + + + SPANet: Frequency-balancing Token Mixer using Spectral Pooling Aggregation Modulation + http://openaccess.thecvf.com//content/ICCV2023/papers/Yun_SPANet_Frequency-balancing_Token_Mixer_using_Spectral_Pooling_Aggregation_Modulation_ICCV_2023_paper.pdf + Recent studies show that self-attentions behave like low-pass filters (as opposed to convolutions) and enhancing their high-pass filtering capability improves model performance. Contrary to this idea, we investigate existing convolution-based models with spectral analysis and observe that improving the low-pass filtering in convolution operations also leads to performance improvement. To account for this observation, we hypothesize that utilizing optimal token mixers that capture balanced representations of both high- and low-frequency components can enhance the performance of models. We verify this by decomposing visual features into the frequency domain and combining them in a balanced manner. To handle this, we replace the balancing problem with a mask filtering problem in the frequency domain. Then, we introduce a novel token-mixer named SPAM and leverage it to derive a MetaFormer model termed as SPANet. Experimental results show that the proposed method provides a way to achieve this balance, and the balanced representations of both high- and low-frequency components can improve the performance of models on multiple computer vision tasks. Our code is available at https://doranlyong.github.io/projects/spanet/. + + + + ASAG: Building Strong One-Decoder-Layer Sparse Detectors via Adaptive Sparse Anchor Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Fu_ASAG_Building_Strong_One-Decoder-Layer_Sparse_Detectors_via_Adaptive_Sparse_Anchor_ICCV_2023_paper.pdf + Recent sparse detectors with multiple, e.g. six, decoder layers achieve promising performance but much inference time due to complex heads. Previous works have explored using dense priors as initialization and built one-decoder-layer detectors. Although they gain remarkable acceleration, their performance still lags behind their six-decoder-layer counterparts by a large margin. In this work, we aim to bridge this performance gap while retaining fast speed. We find that the architecture discrepancy between dense and sparse detectors leads to feature conflict, hampering the performance of one-decoder-layer detectors. Thus we propose Adaptive Sparse Anchor Generator (ASAG) which predicts dynamic anchors on patches rather than grids in a sparse way so that it alleviates the feature conflict problem. For each image, ASAG dynamically selects which feature maps and which locations to predict, forming a fully adaptive way to generate image-specific anchors. Further, a simple and effective Query Weighting method eases the training instability from adaptiveness. Extensive experiments show that our method outperforms dense-initialized ones and achieves a better speed-accuracy trade-off. The code is available at https://github.com/iSEE-Laboratory/ASAG. + + + + The Perils of Learning From Unlabeled Data: Backdoor Attacks on Semi-supervised Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Shejwalkar_The_Perils_of_Learning_From_Unlabeled_Data_Backdoor_Attacks_on_ICCV_2023_paper.pdf + Semi-supervised learning (SSL) is gaining popularity as it reduces cost of machine learning (ML) by training high performance models using unlabeled data. In this paper, we reveal that the key feature of SSL, i.e., learning from (non-inspected) unlabeled data, exposes SSL to strong poisoning attacks that can significantly damage its security. Poisoning is a long-standing problem in conventional supervised ML, but we argue that, as SSL relies on non-inspected unlabeled data, poisoning poses a more significant threat to SSL. We demonstrate this by designing a backdoor poisoning attack on SSL that can be conducted by a weak adversary with no knowledge of the target SSL pipeline. This is unlike prior poisoning attacks on supervised ML that assume strong adversaries with impractical capabilities. We show that by poisoning only 0.2% of the unlabeled training data, our (weak) adversary can successfully cause misclassification on more than 80% of test inputs (when they contain the backdoor trigger). Our attack remains effective across different benchmark datasets and SSL algorithms, and even circumvents state-of-the-art defenses against backdoor attacks. Our work raises significant concerns about the security of SSL in real-world security critical applications. + + + + StyleDiffusion: Controllable Disentangled Style Transfer via Diffusion Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_StyleDiffusion_Controllable_Disentangled_Style_Transfer_via_Diffusion_Models_ICCV_2023_paper.pdf + Content and style (C-S) disentanglement is a fundamental problem and critical challenge of style transfer. Existing approaches based on explicit definitions (e.g., Gram matrix) or implicit learning (e.g., GANs) are neither interpretable nor easy to control, resulting in entangled representations and less satisfying results. In this paper, we propose a new C-S disentangled framework for style transfer without using previous assumptions. The key insight is to explicitly extract the content information and implicitly learn the complementary style information, yielding interpretable and controllable C-S disentanglement and style transfer. A simple yet effective CLIP-based style disentanglement loss coordinated with a style reconstruction prior is introduced to disentangle C-S in the CLIP image space. By further leveraging the powerful style removal and generative ability of diffusion models, our framework achieves superior results than state of the art and flexible C-S disentanglement and trade-off control. Our work provides new insights into the C-S disentanglement in style transfer and demonstrates the potential of diffusion models for learning well-disentangled C-S characteristics. + + + + AdvDiffuser: Natural Adversarial Example Synthesis with Diffusion Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_AdvDiffuser_Natural_Adversarial_Example_Synthesis_with_Diffusion_Models_ICCV_2023_paper.pdf + Previous work on adversarial examples typically involves a fixed norm perturbation budget, which fails to capture the way humans perceive perturbations. Recent work has shifted towards investigating natural unrestricted adversarial examples (UAEs) that breaks l_p perturbation bounds but nonetheless remain semantically plausible. Current methods use GAN or VAE to generate UAEs by perturbing latent codes. However, this leads to loss of high-level information, resulting in low-quality and unnatural UAEs. In light of this, we propose AddDiffuser, a new method for synthesizing natural UAEs using diffusion models. It can generate UAEs from scratch or conditionally based on reference images. To generate natural UAEs, we perturb predicted images to steer their latent code towards the adversarial sample space of a particular classifier. In addition, we propose adversarial inpainting based on class activation mapping to retain the salient regions of the image while perturbing less important areas. Our method achieves impressive results on CIFAR-10, CelebA and ImageNet, and we demonstrate that it can defeat the most robust models on the RobustBench leaderboard with near 100% success rates. Furthermore, The synthesized UAEs are not only more natural but also stronger compared to the current state-of-the-art attacks. Specifically, compared with GA-attack, the UAEs generated with AdvDiffuser exhibit 6xsmaller LPIPS perturbations, 2 ~ 3 xsmaller FID scores and 0.28 higher in SSIM metrics, making them perceptually stealthier. Lastly, it is capable of generating an unlimited number of natural adversarial examples. For more please visit our project page: Link to follow. + + + + DarSwin: Distortion Aware Radial Swin Transformer + http://openaccess.thecvf.com//content/ICCV2023/papers/Athwale_DarSwin_Distortion_Aware_Radial_Swin_Transformer_ICCV_2023_paper.pdf + Wide-angle lenses are commonly used in perception tasks requiring a large field of view. Unfortunately, these lenses produce significant distortions making conventional models that ignore the distortion effects unable to adapt to wide-angle images. In this paper, we present a novel transformer-based model that automatically adapts to the distortion produced by wide-angle lenses. We leverage the physical characteristics of such lenses, which are analytically defined by the radial distortion profile (assumed to be known), to develop a distortion aware radial swin transformer (DarSwin). In contrast to conventional transformer-based architectures, DarSwin comprises a radial patch partitioning, a distortion-based sampling technique for creating token embeddings, and an angular position encoding for radial patch merging. We validate our method on classification tasks using synthetically distorted ImageNet data and show through extensive experiments that DarSwin can perform zero-shot adaptation to unseen distortions of different wide-angle lenses. Compared to other baselines, DarSwin achieves the best results (in terms of Top-1 accuracy) with significant gains when trained on bounded levels of distortions (very-low, low, medium, and high) and tested on all including out-of-distribution distortions. The code and models are publicly available at https://lvsn.github.io/darswin/ + + + + Take-A-Photo: 3D-to-2D Generative Pre-training of Point Cloud Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Take-A-Photo_3D-to-2D_Generative_Pre-training_of_Point_Cloud_Models_ICCV_2023_paper.pdf + With the overwhelming trend of mask image modeling led by MAE, generative pre-training has shown a remarkable potential to boost the performance of fundamental models in 2D vision. However, in 3D vision, the over-reliance on Transformer-based backbones and the unordered nature of point clouds have restricted the further development of gen- erative pre-training. In this paper, we propose a novel 3D-to- 2D generative pre-training method that is adaptable to any point cloud model. We propose to generate view images from different instructed poses via the cross-attention mechanism as the pre-training scheme. Generating view images has more precise supervision than its point cloud counterpart, thus assisting 3D backbones to have a finer comprehension of the geometrical structure and stereoscopic relations of the point cloud. Experimental results have proved the su- periority of our proposed 3D-to-2D generative pre-training over previous pre-training methods. Our method is also ef- fective in boosting the performance of architecture-oriented approaches, achieving state-of-the-art performance when fine-tuning on ScanObjectNN classification and ShapeNet- Part segmentation tasks. Code is available at https: //github.com/wangzy22/TakeAPhoto. + + + + Open-vocabulary Panoptic Segmentation with Embedding Modulation + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Open-vocabulary_Panoptic_Segmentation_with_Embedding_Modulation_ICCV_2023_paper.pdf + Open-vocabulary segmentation is attracting increasing attention due to its critical applications in the real world. Traditional closed-vocabulary segmentation methods are not able to characterize novel objects, whereas several recent open-vocabulary attempts obtain unsatisfactory results, i.e., notable performance reduction on the closed-vocabulary and massive demand for extra training data. To this end, we propose OPSNet, an omnipotent and data-efficient framework for Open-vocabulary Panoptic Segmentation. Specifically, the exquisitely designed Embedding Modulation module, together with several meticulous components, enables adequate embedding enhancement and information exchange between the segmentation backbone and the visual-linguistic well-aligned CLIP encoder, resulting in superior segmentation performance under both open- and closed vocabulary settings and much fewer need of additional data. Extensive experimental evaluations are conducted across multiple datasets(e.g., COCO, ADE20K, Cityscapes, and PascalContext) under various circumstances, where the proposed OPSNet achieves state-of-the-art results, which demonstrates the effectiveness and generality of the proposed approach. The code and trained models will be made publicly available. + + + + Beyond Single Path Integrated Gradients for Reliable Input Attribution via Randomized Path Sampling + http://openaccess.thecvf.com//content/ICCV2023/papers/Jeon_Beyond_Single_Path_Integrated_Gradients_for_Reliable_Input_Attribution_via_ICCV_2023_paper.pdf + Input attribution is a widely used explanation method for deep neural networks, especially in visual tasks. Among various attribution methods, Integrated Gradients (IG) is frequently used because of its model-agnostic applicability and desirable axioms. However, previous work has shown that such method often produces noisy and unreliable attributions during the integration of the gradients over the path defined in the input space. In this paper, we tackle this issue by estimating the distribution of the possible attributions according to the integrating path selection. We show that such noisy attribution can be reduced by aggregating attributions from the multiple paths instead of using a single path. Inspired by Stick-Breaking Process (SBP), we suggest a random process to generate rich and various sampling of the gradient integrating path. Using multiple input attributions obtained from randomized path, we propose a novel attribution measure using the distribution of attributions at each input features. We identify proposed method qualitatively show less-noisy and object-aligned attribution and its feasibility through the quantitative evaluations. + + + + Foreground-Background Separation through Concept Distillation from Generative Image Foundation Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Dombrowski_Foreground-Background_Separation_through_Concept_Distillation_from_Generative_Image_Foundation_Models_ICCV_2023_paper.pdf + Curating datasets for object segmentation is a difficult task. With the advent of large-scale pre-trained generative models, conditional image generation has been given a significant boost in result quality and ease of use. In this paper, we present a novel method that enables the generation of general foreground-background segmentation models from simple textual descriptions, without requiring segmentation labels. We leverage and explore pre-trained latent diffusion models, to automatically generate weak segmentation masks for concepts and objects. The masks are then used to fine-tune the diffusion model on an inpainting task, which enables fine-grained removal of the object, while at the same time providing a synthetic foreground and background dataset. We demonstrate that using this method beats previous methods in both discriminative and generative performance and closes the gap with fully supervised training while requiring no pixel-wise object labels. We show results on the task of segmenting four different objects (humans, dogs, cars, birds) and a use case scenario in medical image analysis. The code is available at https://github.com/MischaD/fobadiffusion. + + + + ENVIDR: Implicit Differentiable Renderer with Neural Environment Lighting + http://openaccess.thecvf.com//content/ICCV2023/papers/Liang_ENVIDR_Implicit_Differentiable_Renderer_with_Neural_Environment_Lighting_ICCV_2023_paper.pdf + Recent advances in neural rendering have shown great potential for reconstructing scenes from multiview images. However, accurately representing objects with glossy surfaces remains a challenge for existing methods. In this work, we introduce ENVIDR, a rendering and modeling framework for high-quality rendering and reconstruction of surfaces with challenging specular reflections. To achieve this, we first propose a novel neural renderer with decomposed rendering components to learn the interaction between surface and environment lighting. This renderer is trained using existing physically based renderers and is decoupled from actual scene representations. We then propose an SDF-based neural surface model that leverages this learned neural renderer to represent general scenes. Our model additionally synthesizes indirect illuminations caused by inter-reflections from shiny surfaces by marching surface-reflected rays. We demonstrate that our method outperforms state-of-art methods on challenging shiny scenes, providing high-quality rendering of specular reflections while also enabling material editing and scene relighting. + + + + Not All Steps are Created Equal: Selective Diffusion Distillation for Image Manipulation + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Not_All_Steps_are_Created_Equal_Selective_Diffusion_Distillation_for_ICCV_2023_paper.pdf + Conditional diffusion models have demonstrated impressive performance in image manipulation tasks. The general pipeline involves adding noise to the image and then denoising it. However, this method faces a trade-off problem: adding too much noise affects the fidelity of the image while adding too little affects its editability. This largely limits their practical applicability. In this paper, we propose a novel framework, Selective Diffusion Distillation (SDD), that ensures both the fidelity and editability of images. Instead of directly editing images with a diffusion model, we train a feedforward image manipulation network under the guidance of the diffusion model. Besides, we propose an effective indicator to select the semantic-related timestep to obtain the correct semantic guidance from the diffusion model. This approach successfully avoids the dilemma caused by the diffusion process. Our extensive experiments demonstrate the advantages of our framework. + + + + ALIP: Adaptive Language-Image Pre-Training with Synthetic Caption + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_ALIP_Adaptive_Language-Image_Pre-Training_with_Synthetic_Caption_ICCV_2023_paper.pdf + Contrastive Language-Image Pre-training (CLIP) has significantly boosted the performance of various vision-language tasks by scaling up the dataset with image-text pairs collected from the web. However, the presence of intrinsic noise and unmatched image-text pairs in web data can potentially affect the performance of representation learning. To address this issue, we first utilize the OFA model to generate synthetic captions that focus on the image content. The generated captions contain complementary information that is beneficial for pre-training. Then, we propose an Adaptive Language-Image Pre-training (ALIP), a bi-path model that integrates supervision from both raw text and synthetic caption. As the core components of ALIP, the Language Consistency Gate (LCG) and Description Consistency Gate (DCG) dynamically adjust the weights of samples and image-text/caption pairs during the training process. Meanwhile, the adaptive contrastive loss can effectively reduce the impact of noise data and enhances the efficiency of pre-training data. We validate ALIP with experiments on different scales of models and pre-training datasets. Experiments results show that ALIP achieves state-of-the-art performance on multiple downstream tasks including zero-shot image-text retrieval and linear probe. To facilitate future research, the code and pre-trained models are released at https://github.com/deepglint/ALIP. + + + + LaPE: Layer-adaptive Position Embedding for Vision Transformers with Independent Layer Normalization + http://openaccess.thecvf.com//content/ICCV2023/papers/Yu_LaPE_Layer-adaptive_Position_Embedding_for_Vision_Transformers_with_Independent_Layer_ICCV_2023_paper.pdf + Position information is critical for Vision Transformers (VTs) due to the permutation-invariance of self-attention operations. A typical way to introduce position information is adding the absolute Position Embedding (PE) to patch embedding before entering VTs. However, this approach operates the same Layer Normalization (LN) to token embedding and PE, and delivers the same PE to each layer. This results in restricted and monotonic PE across layers, as the shared LN affine parameters are not dedicated to PE, and the PE cannot be adjusted on a per-layer basis. To overcome these limitations, we propose using two independent LNs for token embeddings and PE in each layer, and progressively delivering PE across layers. By implementing this approach, VTs will receive layer-adaptive and hierarchical PE. We name our method as Layer-adaptive Position Embedding, abbreviated as LaPE, which is simple, effective, and robust. Extensive experiments on image classification, object detection, and semantic segmentation demonstrate that LaPE significantly outperforms the default PE method. For example, LaPE improves +1.06% for CCT on CIFAR100, +1.57% for DeiT-Ti on ImageNet-1K, +0.7 box AP and +0.5 mask AP for ViT-Adapter-Ti on COCO, and +1.37 mIoU for tiny Segmenter on ADE20K. This is remarkable considering LaPE only increases negligible parameters, memory, and computational cost. + + + + SA-BEV: Generating Semantic-Aware Bird's-Eye-View Feature for Multi-view 3D Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_SA-BEV_Generating_Semantic-Aware_Birds-Eye-View_Feature_for_Multi-view_3D_Object_Detection_ICCV_2023_paper.pdf + Recently, the pure camera-based Bird's-Eye-View (BEV) perception provides a feasible solution for economical autonomous driving. However, the existing BEV-based multi-view 3D detectors generally transform all image features into BEV features, without considering the problem that the large proportion of background information may submerge the object information. In this paper, we propose Semantic-Aware BEV Pooling (SA-BEVPool), which can filter out background information according to the semantic segmentation of image features and transform image features into semantic-aware BEV features. Accordingly, we propose BEV-Paste, an effective data augmentation strategy that closely matches with semantic-aware BEV feature. In addition, we design a Multi-Scale Cross-Task (MSCT) head, which combines task-specific and cross-task information to predict depth distribution and semantic segmentation more accurately, further improving the quality of semantic-aware BEV feature. Finally, we integrate the above modules into a novel multi-view 3D object detection framework, namely SA-BEV. Experiments on nuScenes show that SA-BEV achieves state-of-the-art performance. Code has been available at https://github.com/mengtan00/SA-BEV.git. + + + + Global Knowledge Calibration for Fast Open-Vocabulary Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Han_Global_Knowledge_Calibration_for_Fast_Open-Vocabulary_Segmentation_ICCV_2023_paper.pdf + Recent advancements in pre-trained vision-language models, such as CLIP, have enabled the segmentation of arbitrary concepts solely from textual inputs, a process commonly referred to as open-vocabulary semantic segmentation (OVS). However, existing OVS techniques confront a fundamental challenge: the trained classifier tends to overfit on the base classes observed during training, resulting in suboptimal generalization performance to unseen classes. To mitigate this issue, recent studies have proposed the use of an additional frozen pre-trained CLIP for classification. Nonetheless, this approach incurs heavy computational overheads as the CLIP vision encoder must be repeatedly forward-passed for each mask, rendering it impractical for real-world applications. To address this challenge, our objective is to develop a fast OVS model that can perform comparably or better without the extra computational burden of the CLIP image encoder during inference. To this end, we propose a core idea of preserving the generalizable representation when fine-tuning on known classes. Specifically, we introduce a text diversification strategy that generates a set of synonyms for each training category, which prevents the learned representation from collapsing onto specific known category names. Additionally, we employ a text-guided knowledge distillation method to preserve the generalizable knowledge of CLIP. Extensive experiments demonstrate that our proposed model achieves robust generalization performance across various datasets. Furthermore, we perform a preliminary exploration of open-vocabulary video segmentation and present a benchmark that can facilitate future open-vocabulary research in the video domain. + + + + Compatibility of Fundamental Matrices for Complete Viewing Graphs + http://openaccess.thecvf.com//content/ICCV2023/papers/Bratelund_Compatibility_of_Fundamental_Matrices_for_Complete_Viewing_Graphs_ICCV_2023_paper.pdf + This paper studies the problem of recovering cameras from a set of fundamental matrices. A set of fundamental matrices is said to be compatible if a set of cameras exists for which they are the fundamental matrices. We focus on the complete graph, where fundamental matrices for each pair of cameras are given. Previous work has established necessary and sufficient conditions for compatibility as rank and eigenvalue conditions on the n-view fundamental matrix obtained by concatenating the individual fundamental matrices. In this work, we show that the eigenvalue condition is redundant in the generic and collinear cases. We provide explicit homogeneous polynomials that describe necessary and sufficient conditions for compatibility in terms of the fundamental matrices and their epipoles. In this direction, we find that quadruple-wise compatibility is enough to ensure global compatibility for any number of cameras. We demonstrate that for four cameras, compatibility is generically described by triple-wise conditions and one additional equation involving all fundamental matrices. + + + + MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge + http://openaccess.thecvf.com//content/ICCV2023/papers/Lin_MAtch_eXpand_and_Improve_Unsupervised_Finetuning_for_Zero-Shot_Action_Recognition_ICCV_2023_paper.pdf + Large scale Vision-Language (VL) models have shown tremendous success in aligning representations between visual and text modalities. This enables remarkable progress in zero-shot recognition, image generation & editing, and many other exciting tasks. However, VL models tend to over-represent objects while paying much less attention to verbs, and require additional tuning on video data for best zero-shot action recognition performance. While previous work relied on large-scale, fully-annotated data, in this work we propose an unsupervised approach. We adapt a VL model for zero-shot and few-shot action recognition using a collection of unlabeled videos and an unpaired action dictionary. Based on that, we leverage Large Language Models and VL models to build a text bag for each unlabeled video via matching, text expansion and captioning. We use those bags in a Multiple Instance Learning setup to adapt an image-text backbone to video data. Although finetuned on unlabeled video data, our resulting models demonstrate high transferability to numerous unseen zero-shot downstream tasks, improving the base VL model performance by up to 14%, and even comparing favorably to fully-supervised baselines in both zero-shot and few-shot video recognition transfer. The code is provided in supplementary and will be released upon acceptance. + + + + Space Engage: Collaborative Space Supervision for Contrastive-Based Semi-Supervised Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Space_Engage_Collaborative_Space_Supervision_for_Contrastive-Based_Semi-Supervised_Semantic_Segmentation_ICCV_2023_paper.pdf + Semi-Supervised Semantic Segmentation (S4) aims to train a segmentation model with limited labeled images and a substantial volume of unlabeled images. To improve the robustness of representations, powerful methods introduce a pixel-wise contrastive learning approach in latent space (i.e., representation space) that aggregates the representations to their prototypes in a fully supervised manner. However, previous contrastive-based S4 methods merely rely on the supervision from the model's output (logits) in logit space during unlabeled training. In contrast, we utilize the outputs in both logit space and representation space to obtain supervision in a collaborative way. The supervision from two spaces plays two roles: 1) reduces the risk of over-fitting to incorrect semantic information in logits with the help of representations; 2) enhances the knowledge exchange between the two spaces. Furthermore, unlike previous approaches, we use the similarity between representations and prototypes as a new indicator to tilt training those under-performing representations and achieve a more efficient contrastive learning process. Results on two public benchmarks demonstrate the competitive performance of our method compared with state-of-the-art methods. + + + + Delving into Motion-Aware Matching for Monocular 3D Object Tracking + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_Delving_into_Motion-Aware_Matching_for_Monocular_3D_Object_Tracking_ICCV_2023_paper.pdf + Recent advances of monocular 3D object detection facilitate the 3D multi-object tracking task based on low-cost camera sensors. In this paper, we find that the motion cue of objects along different time frames is critical in 3D multi-object tracking, which is less explored in existing monocular-based approaches. To this end, we propose MoMA-M3T, a framework that mainly consists of three motion-aware components. First, we represent the possible movement of an object related to all object tracklets in the feature space as its motion features. Then, we further model the historical object tracklet along the time frame in a spatial-temporal perspective via a motion transformer. Finally, we propose a motion-aware matching module to associate historical object tracklets and current observations as final tracking results. We conduct extensive experiments on the nuScenes and KITTI datasets to demonstrate that our MoMA-M3T achieves competitive performance against state-of-the-art methods. Moreover, the proposed tracker is flexible and can be easily plugged into existing image-based 3D object detectors without re-training. Code and models are available at https://github.com/kuanchihhuang/MoMA-M3T. + + + + Fast Adversarial Training with Smooth Convergence + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhao_Fast_Adversarial_Training_with_Smooth_Convergence_ICCV_2023_paper.pdf + Fast adversarial training (FAT) is beneficial for improving the adversarial robustness of neural networks. However, previous FAT work has encountered a significant issue known as catastrophic overfitting when dealing with large perturbation budgets, i.e. the adversarial robustness of models declines to near zero during training. To address this, we analyze the training process of prior FAT work and observe that catastrophic overfitting is accompanied by the appearance of loss convergence outliers. Therefore, we argue a moderately smooth loss convergence process will be a stable FAT process that solves catastrophic overfitting. To obtain a smooth loss convergence process, we propose a novel oscillatory constraint (dubbed ConvergeSmooth) to limit the loss difference between adjacent epochs. The convergence stride of ConvergeSmooth is introduced to balance convergence and smoothing. Likewise, we design weight centralization without introducing additional hyperparameters other than the loss balance coefficient. Our proposed methods are attack-agnostic and thus can improve the training stability of various FAT techniques. Extensive experiments on popular datasets show that the proposed methods efficiently avoid catastrophic overfitting and outperform all previous FAT methods. Code is available at https://github.com/FAT-CS/ConvergeSmooth. + + + + A-STAR: Test-time Attention Segregation and Retention for Text-to-image Synthesis + http://openaccess.thecvf.com//content/ICCV2023/papers/Agarwal_A-STAR_Test-time_Attention_Segregation_and_Retention_for_Text-to-image_Synthesis_ICCV_2023_paper.pdf + While recent developments in text-to-image generative models have led to a suite of high-performing methods capable of producing creative imagery from free-form text, there are several limitations. By analyzing the cross-attention representations of these models, we notice two key issues. First, for text prompts that contain multiple concepts, there is a significant amount of pixel-space overlap (i.e., same spatial regions) among pairs of different concepts. This eventually leads to the model being unable to distinguish between the two concepts and one of them being ignored in the final generation. Next, while these models attempt to capture all such concepts during the beginning of denoising (e.g., first few steps) as evidenced by cross-attention maps, this knowledge is not retained by the end of denoising (e.g., last few steps). Such loss of knowledge eventually leads to inaccurate generation outputs. To address these issues, our key innovations include two test-time attention-based loss functions that substantially improve the performance of pretrained baseline text-to-image diffusion models. First, our attention segregation loss reduces the cross-attention overlap between attention maps of different concepts in the text prompt, thereby reducing the confusion/conflict among various concepts and the eventual capture of all concepts in the generated output. Next, our attention retention loss explicitly forces text-to-image diffusion models to retain cross-attention information for all concepts across all denoising time steps, thereby leading to reduced information loss and the preservation of all concepts in the generated output. We conduct extensive experiments with the proposed loss functions on a variety of text prompts and demonstrate they lead to generated images that are significantly semantically closer to the input text when compared to baseline text-to-image diffusion models. + + + + FaceCLIPNeRF: Text-driven 3D Face Manipulation using Deformable Neural Radiance Fields + http://openaccess.thecvf.com//content/ICCV2023/papers/Hwang_FaceCLIPNeRF_Text-driven_3D_Face_Manipulation_using_Deformable_Neural_Radiance_Fields_ICCV_2023_paper.pdf + As recent advances in Neural Radiance Fields (NeRF) have enabled high-fidelity 3D face reconstruction and novel view synthesis, its manipulation also became an essential task in 3D vision. However, existing manipulation methods require extensive human labor, such as a user-provided semantic mask and manual attribute search unsuitable for non-expert users. Instead, our approach is designed to require a single text to manipulate a face reconstructed with NeRF. To do so, we first train a scene manipulator, a latent code-conditional deformable NeRF, over a dynamic scene to control a face deformation using the latent code. However, representing a scene deformation with a single latent code is unfavorable for compositing local deformations observed in different instances. As so, our proposed Position-conditional Anchor Compositor (PAC) learns to represent a manipulated scene with spatially varying latent codes. Their renderings with the scene manipulator are then optimized to yield high cosine similarity to a target text in CLIP embedding space for text-driven manipulation. To the best of our knowledge, our approach is the first to address the text-driven manipulation of a face reconstructed with NeRF. Extensive results, comparisons, and ablation studies demonstrate the effectiveness of our approach. + + + + Learning Shape Primitives via Implicit Convexity Regularization + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_Learning_Shape_Primitives_via_Implicit_Convexity_Regularization_ICCV_2023_paper.pdf + Shape primitives decomposition has been an important and long-standing task in 3D shape analysis. Prior arts heavily rely on 3D point clouds or voxel data for shape primitives extraction, which are less practical in real-world scenarios. This paper proposes to learn shape primitives from multi-view images by introducing implicit surface rendering. It is challenging since implicit shapes have a high degree of freedom, which violates the simplicity property of shape primitives. In this work, a novel regularization term named Implicit Convexity Regularization (ICR) imposed on implicit primitive learning is proposed to tackle this problem. We start with the convexity definition of general 3D shapes, and then derive the equivalent expression for implicit shapes represented by signed distance functions (SDFs). Further, instead of directly constraining the output SDF values which cause unstable optimization, we alternatively impose constraint on second order directional derivatives on line segments inside the shapes, which proves to be a tighter condition for 3D convexity. Implicit primitives constrained by the proposed ICR are combined into a whole object via softmax-weighted-sum operation over all primitive SDFs. Experiments on synthetic and real-world datasets show that our method is able to decompose objects into simple and reasonable shape primitives without the need of segmentation labels or 3D data. Code and data is publicly available in https://github.com/seanywang0408/ICR. + + + + ITI-GEN: Inclusive Text-to-Image Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_ITI-GEN_Inclusive_Text-to-Image_Generation_ICCV_2023_paper.pdf + Text-to-image generative models often reflect the biases of the training data, leading to unequal representations of underrepresented groups. This study investigates inclusive text-to-image generative models that generate images based on human-written prompts and ensure the resulting images are uniformly distributed across attributes of interest. Unfortunately, directly expressing the desired attributes in the prompt often leads to sub-optimal results due to linguistic ambiguity or model misrepresentation. Hence, this paper proposes a drastically different approach that adheres to the maxim that "a picture is worth a thousand words". We show that, for some attributes, images can represent concepts more expressively than text. For instance, categories of skin tones are typically hard to specify by text but can be easily represented by example images. Building upon these insights, we propose a novel approach, ITI-GEN, that leverages readily available reference images for Inclusive Text-to-Image GENeration. The key idea is learning a set of prompt embeddings to generate images that can effectively represent all desired attribute categories. More importantly, ITI-GEN requires no model fine-tuning, making it computationally efficient to augment existing text-to-image models. Extensive experiments demonstrate that ITI-GEN largely improves over state-of-the-art models to generate inclusive images from a prompt. + + + + Learning Neural Eigenfunctions for Unsupervised Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Deng_Learning_Neural_Eigenfunctions_for_Unsupervised_Semantic_Segmentation_ICCV_2023_paper.pdf + Unsupervised semantic segmentation is a long-standing challenge in computer vision with great significance. Spectral clustering is a theoretically grounded solution to it where the spectral embeddings for pixels are computed to construct distinct clusters. Despite recent progress in enhancing spectral clustering with powerful pre-trained models, current approaches still suffer from inefficiencies in spectral decomposition and inflexibility in applying them to the test data. This work addresses these issues by casting spectral clustering as a parametric approach that employs neural network-based eigenfunctions to produce spectral embeddings. The outputs of the neural eigenfunctions are further restricted to discrete vectors that indicate clustering assignments directly. As a result, an end-to-end NN-based paradigm of spectral clustering emerges. In practice, the neural eigenfunctions are lightweight and take the features from pre-trained models as inputs, improving training efficiency and unleashing the potential of pre-trained models for dense prediction. We conduct extensive empirical studies to validate the effectiveness of our approach and observe significant performance gains over competitive baselines on Pascal Context, Cityscapes, and ADE20K benchmarks. The code is available at https://github.com/thudzj/NeuralEigenfunctionSegmentor. + + + + Shape Analysis of Euclidean Curves under Frenet-Serret Framework + http://openaccess.thecvf.com//content/ICCV2023/papers/Chassat_Shape_Analysis_of_Euclidean_Curves_under_Frenet-Serret_Framework_ICCV_2023_paper.pdf + Geometric frameworks for analyzing curves are common in applications as they focus on invariant features and provide visually satisfying solutions to standard problems such as computing invariant distances, averaging curves, or registering curves. We show that for any smooth curve in R^d, d>1, the generalized curvatures associated with the Frenet-Serret equation can be used to define a Riemannian geometry that takes into account all the geometric features of the shape. This geometry is based on a Square Root Curvature Transform that extends the square root-velocity transform for Euclidean curves (in any dimensions) and provides likely geodesics that avoid artefacts encountered by representations using only first-order geometric information. Our analysis is supported by simulated data and is especially relevant for analyzing human motions. We consider trajectories acquired from sign language, and show the interest of considering curvature and also torsion in their analysis, both being physically meaningful. + + + + Efficient Diffusion Training via Min-SNR Weighting Strategy + http://openaccess.thecvf.com//content/ICCV2023/papers/Hang_Efficient_Diffusion_Training_via_Min-SNR_Weighting_Strategy_ICCV_2023_paper.pdf + Denoising diffusion models have been a mainstream approach for image generation, however, training these models often suffers from slow convergence. In this paper, we discovered that the slow convergence is partly due to conflicting optimization directions between timesteps. To address this issue, we treat the diffusion training as a multi-task learning problem, and introduce a simple yet effective approach referred to as Min-SNR-g. This method adapts loss weights of timesteps based on clamped signal-to-noise ratios, which effectively balances the conflicts among timesteps. Our results demonstrate a significant improvement in converging speed, 3.4x faster than previous weighting strategies. It is also more effective, achieving a new record FID score of 2.06 on the ImageNet 256x256 benchmark using smaller architectures than that employed in previous state-of-the-art. + + + + Perceptual Grouping in Contrastive Vision-Language Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Ranasinghe_Perceptual_Grouping_in_Contrastive_Vision-Language_Models_ICCV_2023_paper.pdf + Recent advances in zero-shot image recognition suggest that vision-language models learn generic visual representations with a high degree of semantic information that may be arbitrarily probed with natural language phrases. Understanding an image, however, is not just about understanding what content resides within an image, but importantly, where that content resides. In this work we examine how well vision-language models are able to understand where objects reside within an image and group together visually related parts of the imagery. We demonstrate how contemporary vision and language representation learning models based on contrastive losses and large web-based data capture limited object localization information. We propose a minimal set of modifications that results in models that uniquely learn both semantic and spatial information. We measure this performance in terms of zero-shot image recognition, unsupervised bottom-up and top-down semantic segmentations, as well as robustness analyses. We find that the resulting model achieves state-of-the-art results in terms of unsupervised segmentation, and demonstrate that the learned representations are uniquely robust to spurious correlations in datasets designed to probe the causal behavior of vision models. + + + + Dynamic Perceiver for Efficient Visual Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Han_Dynamic_Perceiver_for_Efficient_Visual_Recognition_ICCV_2023_paper.pdf + Early exiting has become a promising approach to im- proving the inference efficiency of deep networks. By structuring models with multiple classifiers (exits), predictions for "easy" samples can be generated at earlier exits, negating the need for executing deeper layers. Current multi-exit networks typically implement linear classifiers at intermediate layers, compelling low-level features to encapsulate high-level semantics. This sub-optimal design invariably undermines the performance of later exits. In this paper, we propose Dynamic Perceiver (Dyn-Perceiver) to decouple the feature extraction procedure and the early classification task with a novel dual-branch architecture. A feature branch serves to extract image features, while a classification branch processes a latent code assigned for classification tasks. Bi-directional cross-attention layers are established to progressively fuse the information of both branches. Early exits are placed exclusively within the classification branch, thus eliminating the need for linear separability in low-level features. Dyn-Perceiver constitutes a versatile and adaptable framework that can be built upon various architectures. Experiments on image classification, action recognition, and object detection demonstrate that our method significantly improves the inference efficiency of different backbones, outperforming numerous competitive approaches across a broad range of computational budgets. Evaluation on both CPU and GPU platforms substantiate the superior practical efficiency of Dyn-Perceiver. Code is available at https://www.github. com/LeapLabTHU/Dynamic_Perceiver. + + + + Phasic Content Fusing Diffusion Model with Directional Distribution Consistency for Few-Shot Model Adaption + http://openaccess.thecvf.com//content/ICCV2023/papers/Hu_Phasic_Content_Fusing_Diffusion_Model_with_Directional_Distribution_Consistency_for_ICCV_2023_paper.pdf + Training a generative model with limited number of samples is a challenging task. Current methods primarily rely on few-shot model adaption to train the network. However, in scenarios where data is extremely limited (less than 10), the generative network tends to overfit and suffers from content degradation. To address these problems, we propose a novel phasic content fusing few-shot diffusion model with directional distribution consistency loss, which targets different learning objectives at distinct training stages of the diffusion model. Specifically, we design a phasic training strategy with phasic content fusion to help our model learn content and style information when t is large, and learn local details of target domain when t is small, leading to an improvement in the capture of content, style and local details. Furthermore, we introduce a novel directional distribution consistency loss that ensures the consistency between the generated and source distributions more efficiently and stably than the prior methods, preventing our model from overfitting. Finally, we propose a cross-domain structure guidance strategy that enhances structure consistency during domain adaptation. Theoretical analysis, qualitative and quantitative experiments demonstrate the superiority of our approach in few-shot generative model adaption tasks compared to state-of-the-art methods. + + + + HAL3D: Hierarchical Active Learning for Fine-Grained 3D Part Labeling + http://openaccess.thecvf.com//content/ICCV2023/papers/Yu_HAL3D_Hierarchical_Active_Learning_for_Fine-Grained_3D_Part_Labeling_ICCV_2023_paper.pdf + We present the first active learning tool for fine-grained 3D part labeling, a problem which challenges even the most advanced deep learning (DL) methods due to the significant structural variations among the intricate parts. For the same reason, the necessary effort to annotate training data is tremendous, motivating approaches to minimize human involvement. Our labeling tool iteratively verifies or modifies part labels predicted by a deep neural network, with human feedback continually improving the network prediction. To effectively reduce human efforts, we develop two novel features in our tool, hierarchical and symmetry-aware active labeling. Our human-in-the-loop approach, coined HAL3D, achieves close to error-free fine-grained annotations on any test set with pre-defined hierarchical part labels, with 80% time-saving over manual effort. We will release the finely labeled models to serve the community. + + + + FedPerfix: Towards Partial Model Personalization of Vision Transformers in Federated Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Sun_FedPerfix_Towards_Partial_Model_Personalization_of_Vision_Transformers_in_Federated_ICCV_2023_paper.pdf + Personalized Federated Learning (PFL) represents a promising solution for decentralized learning in heterogeneous data environments. Partial model personalization has been proposed to improve the efficiency of PFL by selectively updating local model parameters instead of aggregating all of them. However, previous work on partial model personalization has mainly focused on Convolutional Neural Networks (CNNs), leaving a gap in understanding how it can be applied to other popular models such as Vision Transformers (ViTs). In this work, we investigate where and how to partially personalize a ViT model. Specifically, we empirically evaluate the sensitivity to data distribution of each type of layer. Based on the insights that the self-attention layer and the classification head are the most sensitive parts of a ViT, we propose a novel approach called FedPerfix, which leverages plugins to transfer information from the aggregated model to the local client as a personalization. Finally, we evaluate the proposed approach on CIFAR-100, OrganAMNIST, and Office-Home datasets and demonstrate its effectiveness in improving the model's performance compared to several advanced PFL methods. + + + + Conditional 360-degree Image Synthesis for Immersive Indoor Scene Decoration + http://openaccess.thecvf.com//content/ICCV2023/papers/Shum_Conditional_360-degree_Image_Synthesis_for_Immersive_Indoor_Scene_Decoration_ICCV_2023_paper.pdf + In this paper, we address the problem of conditional scene decoration for 360deg images. Our method takes a 360deg background photograph of an indoor scene and generates decorated images of the same scene in the panorama view. To do this, we develop a 360-aware object layout generator that learns latent object vectors in the 360deg view to enable a variety of furniture arrangements for an input 360deg background image. We use this object layout to condition a generative adversarial network to synthesize images of an input scene. To further reinforce the generation capability of our model, we develop a simple yet effective scene emptier that removes the generated furniture and produces an emptied scene for our model to learn a cyclic constraint. We train the model on the Structure3D dataset and show that our model can generate diverse decorations with controllable object layout. Our method achieves state-of-the-art performance on the Structure3D dataset and generalizes well to the Zillow indoor scene dataset. Our user study confirms the immersive experiences provided by the realistic image quality and furniture layout in our generation results. Our implementation is available at https://github.com/kcshum/neural_360_decoration.git. + + + + SIDGAN: High-Resolution Dubbed Video Generation via Shift-Invariant Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Muaz_SIDGAN_High-Resolution_Dubbed_Video_Generation_via_Shift-Invariant_Learning_ICCV_2023_paper.pdf + Dubbed video generation aims to accurately synchronize mouth movements of a given facial video with driving audio while preserving identity and scene-specific visual dynamics, such as head pose and lighting. Despite the accurate lip generation of previous approaches that adopts a pretrained audio-video synchronization metric as an objective function, called Sync-Loss, extending it to high-resolution videos was challenging due to shift biases in the loss landscape that inhibit tandem optimization of Sync-Loss and visual quality, leading to a loss of detail. To address this issue, we introduce shift-invariant learning, which generates photo-realistic high-resolution videos with accurate Lip-Sync. Further, we employ a pyramid network with coarse-to-fine image generation to improve stability and lip syncronization. Our model outperforms state-of-the-art methods on multiple benchmark datasets, including AVSpeech, HDTF, and LRW, in terms of photo-realism, identity preservation, and Lip-Sync accuracy. + + + + Meta-ZSDETR: Zero-shot DETR with Meta-learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Meta-ZSDETR_Zero-shot_DETR_with_Meta-learning_ICCV_2023_paper.pdf + Zero-shot object detection aims to localize and recognize objects of unseen classes. Most of existing works face two problems: the low recall of RPN in unseen classes and the confusion of unseen classes with background. In this paper, we present the first method that combines DETR and meta-learning to perform zero-shot object detection, named Meta-ZSDETR, where model training is formalized as an individual episode based meta-learning task. Different from Faster R-CNN based methods that firstly generate class-agnostic proposals, and then classify them with visual-semantic alignment module, Meta-ZSDETR directly predict class-specific boxes with class-specific queries and further filter them with the predicted accuracy from classification head. The model is optimized with meta-contrastive learning, which contains a regression head to generate the coordinates of class-specific boxes, a classification head to predict the accuracy of generated boxes, and a contrastive head that utilizes the proposed contrastive-reconstruction loss to further separate different classes in visual space. We conduct extensive experiments on two benchmark datasets MS COCO and PASCAL VOC. Experimental results show that our method outperforms the existing ZSD methods by a large margin. + + + + STPrivacy: Spatio-Temporal Privacy-Preserving Action Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_STPrivacy_Spatio-Temporal_Privacy-Preserving_Action_Recognition_ICCV_2023_paper.pdf + Existing methods of privacy-preserving action recognition (PPAR) mainly focus on frame-level (spatial) privacy removal through 2D CNNs. Unfortunately, they have two major drawbacks. First, they may compromise temporal dynamics in input videos, which are critical for accurate action recognition. Second, they are vulnerable to practical attacking scenarios where attackers probe for privacy from an entire video rather than individual frames. To address these issues, we propose a novel framework STPrivacy to perform video-level PPAR. For the first time, we introduce vision Transformers into PPAR by treating a video as a tubelet sequence, and accordingly design two complementary mechanisms, i.e., sparsification and anonymization, to remove privacy from a spatio-temporal perspective. In specific, our privacy sparsification mechanism applies adaptive token selection to abandon action-irrelevant tubelets. Then, our anonymization mechanism implicitly manipulates the remaining action-tubelets to erase privacy in the embedding space through adversarial learning. These mechanisms provide significant advantages in terms of privacy preservation for human eyes and action-privacy trade-off adjustment during deployment. We additionally contribute the first two large-scale PPAR benchmarks, VP-HMDB51 and VP-UCF101, to the community. Extensive evaluations on them, as well as two other tasks, validate the effectiveness and generalization capability of our framework. + + + + Computationally-Efficient Neural Image Compression with Shallow Decoders + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Computationally-Efficient_Neural_Image_Compression_with_Shallow_Decoders_ICCV_2023_paper.pdf + Neural image compression methods have seen increasingly strong performance in recent years. However, they suffer orders of magnitude higher computational complexity compared to traditional codecs, which hinders their real-world deployment. This paper takes a step forward in closing this gap in decoding complexity by adopting shallow or even linear decoding transforms. To compensate for the resulting drop in compression performance, we exploit the often asymmetrical computation budget between encoding and decoding, by adopting more powerful encoder networks and iterative encoding. We theoretically formalize the intuition behind, and our experimental results establish a new frontier in the trade-off between rate-distortion and decoding complexity for neural image compression. Specifically, we achieve rate-distortion performance competitive with the established mean-scale hyperprior architecture of Minnen et al. (2018) at less than 50K decoding FLOPs/pixel, reducing the baseline's overall decoding complexity by 80%, or over 90% for the synthesis transform alone. Our code can be found at https://github.com/mandt-lab/shallow-ntc. + + + + Tracing the Origin of Adversarial Attack for Forensic Investigation and Deterrence + http://openaccess.thecvf.com//content/ICCV2023/papers/Fang_Tracing_the_Origin_of_Adversarial_Attack_for_Forensic_Investigation_and_ICCV_2023_paper.pdf + Deep neural networks are vulnerable to adversarial attacks. In this paper, we take the role of investigators who want to trace the attack and identify the source, that is, the particular model which the adversarial examples are generated from. Techniques derived would aid forensic investigation of attack incidents and serve as deterrence to potential attacks. We consider the buyers-seller setting where a machine learning model is to be distributed to various buyers and each buyer receives a slightly different copy with same functionality. A malicious buyer generates adversarial examples from a particular copy "Mi" and uses them to attack other copies. From these adversarial examples, the investigator wants to identify the source "Mi". To address this problem, we propose a two-stage separate-and-trace framework. The model separation stage generates multiple copies of a model for a same classification task. This process injects unique characteristics into each copy so that adversarial examples generated have distinct and traceable features. We give a parallel structure which pairs a unique tracer with the original classification model in each copy and a variational autoencoder (VAE)-based training method to achieve this goal. The tracing stage takes in adversarial examples and a few candidate models, and identifies the likely source. Based on the unique features induced by the tracer, we could effectively trace the potential adversarial copy by considering the output logits from each tracer. Empirical results show that it is possible to trace the origin of the adversarial example and the mechanism can be applied to a wide range of architectures and datasets. + + + + Scenimefy: Learning to Craft Anime Scene via Semi-Supervised Image-to-Image Translation + http://openaccess.thecvf.com//content/ICCV2023/papers/Jiang_Scenimefy_Learning_to_Craft_Anime_Scene_via_Semi-Supervised_Image-to-Image_Translation_ICCV_2023_paper.pdf + Automatic high-quality rendering of anime scenes from complex real-world images is of significant practical value. The challenges of this task lie in the complexity of the scenes, the unique features of anime style, and the lack of high-quality datasets to bridge the domain gap. Despite promising attempts, previous efforts are still incompetent in achieving satisfactory results with consistent semantic preservation, evident stylization, and fine details. In this study, we propose Scenimefy, a novel semi-supervised image-to-image translation framework that addresses these challenges. Our approach guides the learning with structure-consistent pseudo paired data, simplifying the pure unsupervised setting. The pseudo data are derived uniquely from a semantic-constrained StyleGAN leveraging rich model priors like CLIP. We further apply segmentation-guided data selection to obtain high-quality pseudo supervision. A patch-wise contrastive style loss is introduced to improve stylization and fine details. Besides, we contribute a high-resolution anime scene dataset to facilitate future research. Our extensive experiments demonstrate the superiority of our method over state-of-the-art baselines in terms of both perceptual quality and quantitative performance. + + + + DR-Tune: Improving Fine-tuning of Pretrained Visual Models by Distribution Regularization with Semantic Calibration + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhou_DR-Tune_Improving_Fine-tuning_of_Pretrained_Visual_Models_by_Distribution_Regularization_ICCV_2023_paper.pdf + The visual models pretrained on large-scale benchmarks encode general knowledge and prove effective in building more powerful representations for downstream tasks. Most existing approaches follow the fine-tuning paradigm, either by initializing or regularizing the downstream model based on the pretrained one. The former fails to retain the knowledge in the successive fine-tuning phase, thereby prone to be over-fitting, and the latter imposes strong constraints to the weights or feature maps of the downstream model without considering semantic drift, often incurring insufficient optimization. To deal with these issues, we propose a novel fine-tuning framework, namely distribution regularization with semantic calibration (DR-Tune). It employs distribution regularization by enforcing the downstream task head to decrease its classification error on the pretrained feature distribution, which prevents it from over-fitting while enabling sufficient training of downstream encoders. Furthermore, to alleviate the interference by semantic drift, we develop the semantic calibration (SC) module to align the global shape and class centers of the pretrained and downstream feature distributions. Extensive experiments on widely used image classification datasets show that DR-Tune consistently improves the performance when combing with various backbones under different pretraining strategies. Code is available at: https://github.com/weeknan/DR-Tune. + + + + Dense 2D-3D Indoor Prediction with Sound via Aligned Cross-Modal Distillation + http://openaccess.thecvf.com//content/ICCV2023/papers/Yun_Dense_2D-3D_Indoor_Prediction_with_Sound_via_Aligned_Cross-Modal_Distillation_ICCV_2023_paper.pdf + Sound can convey significant information for spatial reasoning in our daily lives. To endow deep networks with such ability, we address the challenge of dense indoor prediction with sound in both 2D and 3D via cross-modal knowledge distillation. In this work, we propose a Spatial Alignment via Matching (SAM) distillation framework that elicits local correspondence between the two modalities in vision-to-audio knowledge transfer. SAM integrates audio features with visually coherent learnable spatial embeddings to resolve inconsistencies in multiple layers of a student model. Our approach does not rely on a specific input representation, allowing for flexibility in the input shapes or dimensions without performance degradation. With a newly curated benchmark named Dense Auditory Prediction of Surroundings (DAPS), we are the first to tackle dense indoor prediction of omnidirectional surroundings in both 2D and 3D with audio observations. Specifically, for audio-based depth estimation, semantic segmentation, and challenging 3D scene reconstruction, the proposed distillation framework consistently achieves state-of-the-art performance across various metrics and backbone architectures. + + + + EverLight: Indoor-Outdoor Editable HDR Lighting Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Dastjerdi_EverLight_Indoor-Outdoor_Editable_HDR_Lighting_Estimation_ICCV_2023_paper.pdf + Because of the diversity in lighting environments, existing illumination estimation techniques have been designed explicitly on indoor or outdoor environments. Methods have focused specifically on capturing accurate energy (e.g., through parametric lighting models), which emphasizes shading and strong cast shadows; or producing plausible texture (e.g., with GANs), which prioritizes plausible reflections. Approaches which provide editable lighting capabilities have been proposed, but these tend to be with simplified lighting models, offering limited realism. In this work, we propose to bridge the gap between these recent trends in the literature, and propose a method which combines a parametric light model with 360deg panoramas, ready to use as HDRI in rendering engines. We leverage recent advances in GAN-based LDR panorama extrapolation from a regular image, which we extend to HDR using parametric spherical gaussians. To achieve this, we introduce a novel lighting co-modulation method that injects lighting-related features throughout the generator, tightly coupling the original or edited scene illumination within the panorama generation process. In our representation, users can easily edit light direction, intensity, number, etc. to impact shading while providing rich, complex reflections while seamlessly blending with the edits. Furthermore, our method encompasses indoor and outdoor environments, demonstrating state-of-the-art results even when compared to domain-specific methods. + + + + MARS: Model-agnostic Biased Object Removal without Additional Supervision for Weakly-Supervised Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Jo_MARS_Model-agnostic_Biased_Object_Removal_without_Additional_Supervision_for_Weakly-Supervised_ICCV_2023_paper.pdf + Weakly-supervised semantic segmentation aims to reduce labeling costs by training semantic segmentation models using weak supervision, such as image-level class labels. However, most approaches struggle to produce accurate localization maps and suffer from false predictions in class-related backgrounds (i.e., biased objects), such as detecting a railroad with the train class. Recent methods that remove biased objects require additional supervision for manually identifying biased objects for each problematic class and collecting their datasets by reviewing predictions, limiting their applicability to the real-world dataset with multiple labels and complex relationships for biasing. Following the first observation that biased features can be separated and eliminated by matching biased objects with backgrounds in the same dataset, we propose a fully-automatic/model-agnostic biased removal framework called MARS (Model-Agnostic biased object Removal without additional Supervision), which utilizes semantically consistent features of an unsupervised technique to eliminate biased objects in pseudo labels. Surprisingly, we show that MARS achieves new state-of-the-art results on two popular benchmarks, PASCAL VOC 2012 (val: 77.7%, test: 77.2%) and MS COCO 2014 (val: 49.4%), by consistently improving the performance of various WSSS models by at least 30% without additional supervision. Code is available at https://github.com/shjo-april/MARS. + + + + Few-Shot Physically-Aware Articulated Mesh Generation via Hierarchical Deformation + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Few-Shot_Physically-Aware_Articulated_Mesh_Generation_via_Hierarchical_Deformation_ICCV_2023_paper.pdf + We study the problem of few-shot physically-aware articulated mesh generation. By observing an articulated object dataset containing only a few examples, we wish to learn a model that can generate diverse meshes with high visual fidelity and physical validity. Previous mesh generative models either have difficulties in depicting a diverse data space from only a few examples or fail to ensure physical validity of their samples. Regarding the above challenges, we propose two key innovations, including 1) a hierarchical mesh deformation-based generative model based upon the divide-and-conquer philosophy to alleviate the few-shot challenge by borrowing transferrable deformation patterns from large scale rigid meshes and 2) a physics-aware deformation correction scheme to encourage physically plausible generations. We conduct extensive experiments on 6 articulated categories to demonstrate the superiority of our method in generating articulated meshes with better diversity, higher visual fidelity, and better physical validity over previous methods in the few-shot setting. Further, we validate solid contributions of our two innovations in the ablation study. Project page with code is available at https://meowuu7.github.io/few-arti-obj-gen. + + + + Single-Stage Diffusion NeRF: A Unified Approach to 3D Generation and Reconstruction + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Single-Stage_Diffusion_NeRF_A_Unified_Approach_to_3D_Generation_and_ICCV_2023_paper.pdf + 3D-aware image synthesis encompasses a variety of tasks, such as scene generation and novel view synthesis from images. Despite numerous task-specific methods, developing a comprehensive model remains challenging. In this paper, we present SSDNeRF, a unified approach that employs an expressive diffusion model to learn a generalizable prior of neural radiance fields (NeRF) from multi-view images of diverse objects. Previous studies have used two-stage approaches that rely on pretrained NeRFs as real data to train diffusion models. In contrast, we propose a new single-stage training paradigm with an end-to-end objective that jointly optimizes a NeRF auto-decoder and a latent diffusion model, enabling simultaneous 3D reconstruction and prior learning, even from sparsely available views. At test time, we can directly sample the diffusion prior for unconditional generation, or combine it with arbitrary observations of unseen objects for NeRF reconstruction. SSDNeRF demonstrates robust results comparable to or better than leading task-specific methods in unconditional generation and single/sparse-view 3D reconstruction. + + + + One-Shot Generative Domain Adaptation + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_One-Shot_Generative_Domain_Adaptation_ICCV_2023_paper.pdf + This work aims to transfer a Generative Adversarial Network (GAN) pre-trained on one image domain to another domain referred to as few as just one reference image. The challenge is that, under limited supervision, it is extremely difficult to synthesize photo realistic and highly diverse images while retaining the representative characters of the target domain. Different from existing approaches that adopt the vanilla fine-tuning strategy, we design two lightweight modules in the generator and the discriminator respectively. We first introduce an attribute adaptor in the generator and freeze the generator's original parameters, which can reuse the prior knowledge to the most extent and maintain the synthesis quality and diversity. We then equip the well-learned discriminator with an attribute classifier to ensure that the generator with the attribute adaptor captures the appropriate characters of the reference image. Furthermore, considering the very limited diversity of the training data (i.e., as few as only one image), we propose to constrain the diversity of the latent space through truncation in the training process, alleviating the optimization difficulty. Our approach brings appealing results under various settings, substantially surpassing state-of-the-art alternatives, especially in terms of synthesis diversity. Noticeably, our method works well even with large domain gaps and robustly converges within a few minutes for each experiment. Code and models are available at https://genforce.github.io/genda/. + + + + HybridAugment++: Unified Frequency Spectra Perturbations for Model Robustness + http://openaccess.thecvf.com//content/ICCV2023/papers/Yucel_HybridAugment_Unified_Frequency_Spectra_Perturbations_for_Model_Robustness_ICCV_2023_paper.pdf + Convolutional Neural Networks (CNN) are known to exhibit poor generalization performance under distribution shifts. Their generalization have been studied extensively, and one line of work approaches the problem from a frequency-centric perspective. These studies highlight the fact that humans and CNNs might focus on different frequency components of an image. First, inspired by these observations, we propose a simple yet effective data augmentation method HybridAugment that reduces the reliance of CNNs on high-frequency components, and thus improves their robustness while keeping their clean accuracy high. Second, we propose HybridAugment++, which is a hierarchical augmentation method that attempts to unify various frequency-spectrum augmentations. HybridAugment++ builds on HybridAugment, and also reduces the reliance of CNNs on the amplitude component of images, and promotes phase information instead. This unification results in competitive to or better than state-of-the-art results on clean accuracy (CIFAR-10/100 and ImageNet), corruption benchmarks (ImageNet-C, CIFAR-10-C and CIFAR-100-C), adversarial robustness on CIFAR-10 and out-of-distribution detection on various datasets. HybridAugment and HybridAugment++ are implemented in a few lines of code, does not require extra data, ensemble models or additional networks. + + + + Doppelgangers: Learning to Disambiguate Images of Similar Structures + http://openaccess.thecvf.com//content/ICCV2023/papers/Cai_Doppelgangers_Learning_to_Disambiguate_Images_of_Similar_Structures_ICCV_2023_paper.pdf + We consider the visual disambiguation task of determining whether a pair of visually similar images depict the same or distinct 3D surfaces (e.g., the same or opposite sides of a symmetric building). Illusory image matches, where two images observe distinct but visually similar 3D surfaces, can be challenging for humans to differentiate, and can also lead 3D reconstruction algorithms to produce erroneous results. We propose a learning-based approach to visual disambiguation, formulating it as a binary classification task on image pairs. To that end, we introduce a new dataset for this problem, Doppelgangers, which includes image pairs of similar structures with ground truth labels. We also design a network architecture that takes the spatial distribution of local keypoints and matches as input, allowing for better reasoning about both local and global cues. Our evaluation shows that our method can distinguish illusory matches in difficult cases, and can be integrated into SfM pipelines to produce correct, disambiguated 3D reconstructions. See our project page for our code, datasets, and more results: http://doppelgangers-3d.github.io/. + + + + Improving Generalization of Adversarial Training via Robust Critical Fine-Tuning + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhu_Improving_Generalization_of_Adversarial_Training_via_Robust_Critical_Fine-Tuning_ICCV_2023_paper.pdf + Deep neural networks are susceptible to adversarial examples, posing a significant security risk in critical applications. Adversarial Training (AT) is a well-established technique to enhance adversarial robustness, but it often comes at the cost of decreased generalization ability. This paper proposes Robustness Critical Fine-Tuning (RiFT), a novel approach to enhance generalization without compromising adversarial robustness. The core idea of RiFT is to exploit the redundant capacity for robustness by fine-tuning the adversarially trained model on its non-robust-critical module. To do so, we introduce module robust criticality (MRC), a measure that evaluates the significance of a given module to model robustness under worst-case weight perturbations. Using this measure, we identify the module with the lowest MRC value as the non-robust-critical module and fine-tune its weights to obtain fine-tuned weights. Subsequently, we linearly interpolate between the adversarially trained weights and fine-tuned weights to derive the optimal fine-tuned model weights. We demonstrate the efficacy of RiFT on ResNet18, ResNet34, and WideResNet34-10 models trained on CIFAR10, CIFAR100, and Tiny-ImageNet datasets. Our experiments show that RiFT can significantly improve both generalization and out-of-distribution robust- ness by around 1.5% while maintaining or even slightly enhancing adversarial robustness. Code is available at https://github.com/Immortalise/RiFT . + + + + Understanding the Feature Norm for Out-of-Distribution Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Park_Understanding_the_Feature_Norm_for_Out-of-Distribution_Detection_ICCV_2023_paper.pdf + A neural network trained on a classification dataset often exhibits a higher vector norm of hidden layer features for in-distribution (ID) samples, while producing relatively lower norm values on unseen instances from out-of-distribution (OOD). Despite this intriguing phenomenon being utilized in many applications, the underlying cause has not been thoroughly investigated. In this study, we demystify this very phenomenon by scrutinizing the discriminative structures concealed in the intermediate layers of a neural network. Our analysis leads to the following discoveries: (1) The feature norm is a confidence value of a classifier hidden in the network layer, specifically its maximum logit. Hence, the feature norm distinguishes OOD from ID in the same manner that a classifier confidence does. (2) The feature norm is class-agnostic, thus it can detect OOD samples across diverse discriminative models. (3) The conventional feature norm fails to capture the deactivation tendency of hidden layer neurons, which may lead to misidentification of ID samples as OOD instances. To resolve this drawback, we propose a novel negative-aware norm (NAN) that can capture both the activation and deactivation tendencies of hidden layer neurons. We conduct extensive experiments on NAN, demonstrating its efficacy and compatibility with existing OOD detectors, as well as its capability in label-free environments. + + + + Knowledge Proxy Intervention for Deconfounded Video Question Answering + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Knowledge_Proxy_Intervention_for_Deconfounded_Video_Question_Answering_ICCV_2023_paper.pdf + Recently, Video Question-Answering (VideoQA) has drawn more and more attention from both industry and research community. Despite all the success achieved by recent works, dataset bias always harmfully misleads current methods focusing on spurious correlations in training data. To analyze the effects of dataset bias, we frame the VideoQA pipeline into a causal graph, which shows the causalities among video, question, aligned feature between video and question, answer, and underlying confounder. Through the causal graph, we prove that the confounder and the backdoor path lead to spurious causality. To tackle the challenge that the confounder in VideoQA is unobserved and non-enumerable in general, we propose a model-agnostic framework called Knowledge Proxy Intervention (KPI), which introduces an extra knowledge proxy variable in the causal graph to cut the backdoor path and remove the confounder. Our KPI framework exploits the front-door adjustment, which requires no prior knowledge about the confounder. The effectiveness of our KPI framework is corroborated by three baseline methods on five benchmark datasets, including MSVD-QA, MSRVTT-QA, TGIF-QA, NExT-QA, and Causal-VidQA. + + + + DetZero: Rethinking Offboard 3D Object Detection with Long-term Sequential Point Clouds + http://openaccess.thecvf.com//content/ICCV2023/papers/Ma_DetZero_Rethinking_Offboard_3D_Object_Detection_with_Long-term_Sequential_Point_ICCV_2023_paper.pdf + Existing offboard 3D detectors always follow a modular pipeline design to take advantage of unlimited sequential point clouds. We have found that the full potential of offboard 3D detectors is not explored mainly due to two reasons: (1) the onboard multi-object tracker cannot generate sufficient complete object trajectories, and (2) the motion state of objects poses an inevitable challenge for the object-centric refining stage in leveraging the long-term temporal context representation. To tackle these problems, we propose a novel paradigm of offboard 3D object detection, named DetZero. Concretely, an offline tracker coupled with a multi-frame detector is proposed to focus on the completeness of generated object tracks. An attention-mechanism refining module is proposed to strengthen contextual information interaction across long-term sequential point clouds for object refining with decomposed regression methods. Extensive experiments on Waymo Open Dataset show our DetZero outperforms all state-of-the-art onboard and offboard 3D detection methods. Notably, DetZero ranks 1st place on Waymo 3D object detection leaderboard with 85.15 mAPH (L2) detection performance. Further experiments validate the application of taking the place of human labels with such high-quality results. Our empirical study leads to rethinking conventions and interesting findings that can guide future research on offboard 3D object detection. + + + + Learning from Noisy Data for Semi-Supervised 3D Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Learning_from_Noisy_Data_for_Semi-Supervised_3D_Object_Detection_ICCV_2023_paper.pdf + Pseudo-Labeling (PL) is a critical approach in semi-supervised 3D object detection (SSOD). In PL, delicately selected pseudo-labels, generated by the teacher model, are provided for the student model to supervise the semi-supervised detection framework. However, such a paradigm may introduce misclassified labels or loose localized box predictions, resulting in a sub-optimal solution of detection performance. In this paper, we take PL from a noisy learning perspective: instead of directly applying vanilla pseudo-labels, we design a noise-resistant instance supervision module for better generalization. Specifically, we soften the classification targets by considering both the quality of pseudo labels and the network learning ability, and convert the regression task into a probabilistic modeling problem. Besides, considering that self-supervised learning works in the absence of labels, we incorporate dense pixel-wise feature consistency constraints to eliminate the negative impact of noisy labels. To this end, we propose NoiseDet, a simple yet effective framework for semi-supervised 3D object detection. Extensive experiments on competitive ONCE and Waymo benchmarks demonstrate that our method outperforms current semi-supervised approaches by a large margin. Notably, our NoiseDet achieves state-of-the-art performance under various dataset scales on ONCE dataset. For example, NoiseDet improves its NoiseyStudent baseline from 55.5 mAP to 58.0 mAP, and further reaches 60.2 mAP with enhanced pseudo-label generation. Code will be available at https://github.com/zehuichen123/NoiseDet. + + + + Towards Authentic Face Restoration with Iterative Diffusion Models and Beyond + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhao_Towards_Authentic_Face_Restoration_with_Iterative_Diffusion_Models_and_Beyond_ICCV_2023_paper.pdf + An authentic face restoration system is becoming increasingly demanding in many computer vision applications, e.g., image enhancement, video communication, and taking portrait. Most of the advanced face restoration models can recover high-quality faces from low-quality ones but usually fail to faithfully generate realistic and high-frequency details that are favored by users. To achieve authentic restoration, we propose IDM, an Iteratively learned face restoration system based on denoising Diffusion Models (DDMs). We define the criterion of an authentic face restoration system, and argue that denoising diffusion models are naturally endowed with this property from two aspects: intrinsic iterative refinement and extrinsic iterative enhancement. Intrinsic learning can preserve the content well and gradually refine the high-quality details, while extrinsic enhancement helps clean the data and improve the restoration task one step further. We demonstrate superior performance on blind face restoration tasks. Beyond restoration, we find the authentically cleaned data by the proposed restoration system is also helpful to image generation tasks in terms of training stabilization and sample quality. Without modifying the baseline models, we achieve better quality than state-of-the-art on FFHQ and ImageNet generation using either GANs or diffusion models. + + + + Group DETR: Fast DETR Training with Group-Wise One-to-Many Assignment + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Group_DETR_Fast_DETR_Training_with_Group-Wise_One-to-Many_Assignment_ICCV_2023_paper.pdf + Detection transformer (DETR) relies on one-to-one assignment, assigning one ground-truth object to one prediction, for end-to-end detection without NMS post-processing. It is known that one-to-many assignment, assigning one ground-truth object to multiple predictions, succeeds in detection methods such as Faster R-CNN and FCOS. While the naive one-to-many assignment does not work for DETR, and it remains challenging to apply one-to-many assignment for DETR training. In this paper, we introduce Group DETR, a simple yet efficient DETR training approach that introduces a group-wise way for one-to-many assignment. This approach involves using multiple groups of object queries, conducting one-to-one assignment within each group, and performing decoder self-attention separately. It resembles data augmentation with automatically-learned object query augmentation. It is also equivalent to simultaneously training parameter-sharing networks of the same architecture, introducing more supervision and thus improving DETR training. The inference process is the same as DETR trained normally and only needs one group of queries without any architecture modification. Group DETR is versatile and is applicable to various DETR variants. The experiments show that Group DETR significantly speeds up the training convergence and improves the performance of various DETR-based models. Code will be available at https://github.com/Atten4Vis/GroupDETR. + + + + DETRs with Collaborative Hybrid Assignments Training + http://openaccess.thecvf.com//content/ICCV2023/papers/Zong_DETRs_with_Collaborative_Hybrid_Assignments_Training_ICCV_2023_paper.pdf + In this paper, we provide the observation that too few queries assigned as positive samples in DETR with one-to-one set matching leads to sparse supervision on the encoder's output which considerably hurt the discriminative feature learning of the encoder and vice visa for attention learning in the decoder. To alleviate this, we present a novel collaborative hybrid assignments training scheme, namely Co-DETR, to learn more efficient and effective DETR-based detectors from versatile label assignment manners. This new training scheme can easily enhance the encoder's learning ability in end-to-end detectors by training the multiple parallel auxiliary heads supervised by one-to-many label assignments such as ATSS and Faster RCNN. In addition, we conduct extra customized positive queries by extracting the positive coordinates from these auxiliary heads to improve the training efficiency of positive samples in the decoder. In inference, these auxiliary heads are discarded and thus our method introduces no additional parameters and computational cost to the original detector while requiring no hand-crafted non-maximum suppression (NMS). We conduct extensive experiments to evaluate the effectiveness of the proposed approach on DETR variants, including DAB-DETR, Deformable-DETR, and DINO-Deformable-DETR. The state-of-the-art DINO-Deformable-DETR with Swin-L can be improved from 58.5% to 59.5% AP on COCO val. Surprisingly, incorporated with ViT-L backbone, we achieve 66.0% AP on COCO test-dev and 67.9% AP on LVIS val, outperforming previous methods by clear margins with much fewer model sizes. Codes are available at https://github.com/Sense-X/Co-DETR. + + + + Multi-Modal Neural Radiance Field for Monocular Dense SLAM with a Light-Weight ToF Sensor + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Multi-Modal_Neural_Radiance_Field_for_Monocular_Dense_SLAM_with_a_ICCV_2023_paper.pdf + Light-weight time-of-flight (ToF) depth sensors are compact and cost-efficient, and thus widely used on mobile devices for tasks such as autofocus and obstacle detection. However, due to the sparse and noisy depth measurements, these sensors have rarely been considered for dense geometry reconstruction. In this work, we present the first dense SLAM system with a monocular camera and a light-weight ToF sensor. Specifically, we propose a multi-modal implicit scene representation that supports rendering both the signals from the RGB camera and light-weight ToF sensor which drives the optimization by comparing with the raw sensor inputs. Moreover, in order to guarantee successful pose tracking and reconstruction, we exploit a predicted depth as an intermediate supervision and develop a coarse-to-fine optimization strategy for efficient learning of the implicit representation. At last, the temporal information is explicitly exploited to deal with the noisy signals from light-weight ToF sensors to improve the accuracy and robustness of the system. Experiments demonstrate that our system well exploits the signals of light-weight ToF sensors and achieves competitive results both on camera tracking and dense scene reconstruction. Project page: https://zju3dv.github.io/tof_slam/. + + + + MonoNeRD: NeRF-like Representations for Monocular 3D Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_MonoNeRD_NeRF-like_Representations_for_Monocular_3D_Object_Detection_ICCV_2023_paper.pdf + In the field of monocular 3D detection, it is common practice to utilize scene geometric clues to enhance the detector's performance. However, many existing works adopt these clues explicitly such as estimating a depth map and back-projecting it into 3D space. This explicit methodology induces sparsity in 3D representations due to the increased dimensionality from 2D to 3D, and leads to substantial information loss, especially for distant and occluded objects. To alleviate this issue, we propose MonoNeRD, a novel detection framework that can infer dense 3D geometry and occupancy. Specifically, we model scenes with Signed Distance Functions (SDF), facilitating the production of dense 3D representations. We treat these representations as Neural Radiance Fields (NeRF) and then employ volume rendering to recover RGB images and depth maps. To the best of our knowledge, this work is the first to introduce volume rendering for M3D, and demonstrates the potential of implicit reconstruction for image-based 3D perception. Extensive experiments conducted on the KITTI-3D benchmark and Waymo Open Dataset demonstrate the effectiveness of MonoNeRD. Codes are available at https://github.com/cskkxjk/MonoNeRD. + + + + Monocular 3D Object Detection with Bounding Box Denoising in 3D by Perceiver + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Monocular_3D_Object_Detection_with_Bounding_Box_Denoising_in_3D_ICCV_2023_paper.pdf + The main challenge of monocular 3D object detection is the accurate localization of 3D center. Motivated by a new and strong observation that this challenge can be remedied by a 3D-space local-grid search scheme in an ideal case, we propose a stage-wise approach, which combines the information flow from 2D-to-3D (3D bounding box proposal generation with a single 2D image) and 3D-to-2D (proposal verification by denoising with 3D-to-2D contexts) in a top-down manner. Specifically, we first obtain initial proposals from off-the-shelf backbone monocular 3D detectors. Then, we generate a 3D anchor space by local-grid sampling from the initial proposals. Finally, we perform 3D bounding box denoising at the 3D-to-2D proposal verification stage. To effectively learn discriminative features for denoising highly overlapped proposals, this paper presents a method of using the Perceiver I/O model to fuse the 3D-to-2D geometric information and the 2D appearance information. With the encoded latent representation of a proposal, the verification head is implemented with a self-attention module. Our method, named as MonoXiver, is generic and can be easily adapted to any backbone monocular 3D detectors. Experimental results on the well-established KITTI dataset and the challenging large-scale Waymo dataset show that MonoXiver consistently achieves improvement with limited computation overhead. + + + + WaveIPT: Joint Attention and Flow Alignment in the Wavelet domain for Pose Transfer + http://openaccess.thecvf.com//content/ICCV2023/papers/Ma_WaveIPT_Joint_Attention_and_Flow_Alignment_in_the_Wavelet_domain_ICCV_2023_paper.pdf + Human pose transfer aims to generate a new image of the source person in a target pose. Among the existing methods, attention and flow are two of the most popular and effective approaches. Attention excels in preserving the semantic structure of the source image, which is more reflected in the low-frequency domain. Contrastively, flow is better at retaining fine-grained texture details in the high-frequency domain. To leverage the advantages of both attention and flow simultaneously, we propose Wavelet-aware Image-based Pose Transfer (WaveIPT) to fuse the attention and flow in the wavelet domain. To improve the fusion effect and avoid interference from irrelevant information between different frequencies, WaveIPT first applies Intra-scale Local Correlation (ILC) to adaptively fuse attention and flow in the same scale according to their strengths in low and high-frequency domains, and then uses Inter-scale Feature Interaction (IFI) module to explore inter-scale frequency features for effective information transfer across different scales. We further introduce an effective Progressive Flow Regularization to alleviate the challenges of flow estimation under large pose differences. Our experiments on the DeepFashion dataset demonstrate that WaveIPT achieves a new state-of-the-art in terms of FID and LPIPS, with improvements of 4.97% and 3.89%, respectively. + + + + PARTNER: Level up the Polar Representation for LiDAR 3D Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Nie_PARTNER_Level_up_the_Polar_Representation_for_LiDAR_3D_Object_ICCV_2023_paper.pdf + Recently, polar-based representation has shown promising properties in perceptual tasks. In addition to Cartesian-based approaches, which separate point clouds unevenly, representing point clouds as polar grids has been recognized as an alternative due to (1) its advantage in robust performance under different resolutions and (2) its superiority in streaming-based approaches. However, state-of-the-art polar-based detection methods inevitably suffer from the feature distortion problem because of the non-uniform division of polar representation, resulting in a non-negligible performance gap compared to Cartesian-based approaches. To tackle this issue, we present PARTNER, a novel 3D object detector in the polar coordinate. PARTNER alleviates the dilemma of feature distortion with global representation re-alignment and facilitates the regression by introducing instance-level geometric information into the detection head. Extensive experiments show overwhelming advantages in streaming-based detection and different resolutions. Furthermore, our method outperforms the previous polar-based works with remarkable margins of 3.68% and 9.15% on Waymo and ONCE validation set, thus achieving competitive results over the state-of-the-art methods. + + + + Corrupting Neuron Explanations of Deep Visual Features + http://openaccess.thecvf.com//content/ICCV2023/papers/Srivastava_Corrupting_Neuron_Explanations_of_Deep_Visual_Features_ICCV_2023_paper.pdf + The inability of DNNs to explain their black-box behavior has led to a recent surge of explainability methods. However, there are growing concerns that these explainability methods are not robust and trustworthy. In this work, we perform the first robustness analysis of Neuron Explanation Methods under a unified pipeline and show that these explanations can be significantly corrupted by random noises and well-designed perturbations added to their probing data. We find that even adding small random noise with a standard deviation of 0.02 can already change the assigned concepts of up to 28% neurons in the deeper layers. Furthermore, we devise a novel corruption algorithm and show that our algorithm can manipulate the explanation of more than 80% neurons by poisoning less than 10% of probing data. This raises the concern of trusting Neuron Explanation Methods in real-life safety and fairness critical applications. + + + + PNI : Industrial Anomaly Detection using Position and Neighborhood Information + http://openaccess.thecvf.com//content/ICCV2023/papers/Bae_PNI__Industrial_Anomaly_Detection_using_Position_and_Neighborhood_Information_ICCV_2023_paper.pdf + Because anomalous samples cannot be used for training, many anomaly detection and localization methods use pre-trained networks and non-parametric modeling to estimate encoded feature distribution. However, these methods neglect the impact of position and neighborhood information on the distribution of normal features. To overcome this, we propose a new algorithm, PNI, which estimates the normal distribution using conditional probability given neighborhood features, modeled with a multi-layer perceptron network. Moreover, position information is utilized by creating a histogram of representative features at each position. Instead of simply resizing the anomaly map, the proposed method employs an additional refine network trained on synthetic anomaly images to better interpolate and account for the shape and edge of the input image. We conducted experiments on the MVTec AD benchmark dataset and achieved state-of-the-art performance, with 99.56% and 98.98% AUROC scores in anomaly detection and localization, respectively. Code is available at https://github.com/wogur110/PNI_Anomaly_Detection. + + + + Bidirectionally Deformable Motion Modulation For Video-based Human Pose Transfer + http://openaccess.thecvf.com//content/ICCV2023/papers/Yu_Bidirectionally_Deformable_Motion_Modulation_For_Video-based_Human_Pose_Transfer_ICCV_2023_paper.pdf + Video-based human pose transfer is a video-to-video generation task that animates a plain source human image based on a series of target human poses. Considering the difficulties in transferring highly structural patterns on the garments and discontinuous poses, existing methods often generate unsatisfactory results such as distorted textures and flickering artifacts. To address these issues, we propose a novel Deformable Motion Modulation (DMM) that utilizes geometric kernel offset with adaptive weight modulation to simultaneously perform feature alignment and style transfer. Different from normal style modulation used in style transfer, the proposed modulation mechanism adaptively reconstructs smoothed frames from style codes according to the object shape through an irregular receptive field of view. To enhance the spatio-temporal consistency, we leverage bidirectional propagation to extract the hidden motion information from a warped image sequence generated by noisy poses. The proposed feature propagation significantly enhances the motion prediction ability by forward and backward propagation. Both quantitative and qualitative experimental results demonstrate superiority over the state-of-the-arts in terms of image fidelity and visual continuity. The source code is publicly available at github.com/rocketappslab/bdmm. + + + + Objects Do Not Disappear: Video Object Detection by Single-Frame Object Location Anticipation + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Objects_Do_Not_Disappear_Video_Object_Detection_by_Single-Frame_Object_ICCV_2023_paper.pdf + Objects in videos are typically characterized by continuous smooth motion. We exploit continuous smooth motion in three ways. 1) Improved accuracy by using object motion as an additional source of supervision, which we obtain by anticipating object locations from a static keyframe. 2) Improved efficiency by only doing the expensive feature computations on a small subset of all frames. Because neighboring video frames are often redundant, we only compute features for a single static keyframe and predict object locations in subsequent frames. 3) Reduced annotation cost, where we only annotate the keyframe and use smooth pseudo-motion between keyframes. We demonstrate computational efficiency, annotation efficiency, and improved mean average precision compared to the state-of-the-art on four datasets: ImageNet VID, EPIC KITCHENS-55, YouTube-BoundingBoxes and Waymo Open dataset. Our source code is available at https://github.com/L-KID/Videoobject-detection-by-location-anticipation. + + + + Learning from Semantic Alignment between Unpaired Multiviews for Egocentric Video Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Learning_from_Semantic_Alignment_between_Unpaired_Multiviews_for_Egocentric_Video_ICCV_2023_paper.pdf + We are concerned with a challenging scenario in unpaired multiview video learning. In this case, the model aims to learn comprehensive multiview representations while the cross-view semantic information exhibits variations. We propose Semantics-based Unpaired Multiview Learning (SUM-L) to tackle this unpaired multiview learning problem. The key idea is to build cross-view pseudo-pairs and do view-invariant alignment by leveraging the semantic information of videos. To facilitate the data efficiency of multiview learning, we further perform video-text alignment for first-person and third-person videos, to fully leverage the semantic knowledge to improve video representations. Extensive experiments on multiple benchmark datasets verify the effectiveness of our framework. Our method also outperforms multiple existing view-alignment methods, under the more challenging scenario than typical paired or unpaired multimodal or multiview learning. Our code is available at https://github.com/wqtwjt1996/SUM-L. + + + + Source-free Depth for Object Pop-out + http://openaccess.thecvf.com//content/ICCV2023/papers/WU_Source-free_Depth_for_Object_Pop-out_ICCV_2023_paper.pdf + Depth cues are known to be useful for visual perception. However, direct measurement of depth is often impracticable. Fortunately, though, modern learning-based methods offer promising depth maps by inference in the wild. In this work, we adapt such depth inference models for object segmentation using the objects' "pop-out" prior in 3D. The "pop-out" is a simple composition prior that assumes objects reside on the background surface. Such compositional prior allows us to reason about objects in the 3D space. More specifically, we adapt the inferred depth maps such that objects can be localized using only 3D information. Such separation, however, requires knowledge about contact surface which we learn using the weak supervision of the segmentation mask. Our intermediate representation of contact surface, and thereby reasoning about objects purely in 3D, allows us to better transfer the depth knowledge into semantics. The proposed adaptation method uses only the depth model without needing the source data used for training, making the learning process efficient and practical. Our experiments on eight datasets of two challenging tasks, namely salient object detection and camouflaged object detection, consistently demonstrate the benefit of our method in terms of both performance and generalizability. The source code is publicly available at https://github.com/Zongwei97/PopNet. + + + + Token-Label Alignment for Vision Transformers + http://openaccess.thecvf.com//content/ICCV2023/papers/Xiao_Token-Label_Alignment_for_Vision_Transformers_ICCV_2023_paper.pdf + Data mixing strategies (e.g., CutMix) have shown the ability to greatly improve the performance of convolutional neural networks (CNNs). They mix two images as inputs for training and assign them with a mixed label with the same ratio. While they are shown effective for vision transformers (ViTs), we identify a token fluctuation phenomenon that has suppressed the potential of data mixing strategies. We empirically observe that the contributions of input tokens fluctuate as forward propagating, which might induce a different mixing ratio in the output tokens. The training target computed by the original data mixing strategy can thus be inaccurate, resulting in less effective training. To address this, we propose a token-label alignment (TL-Align) method to trace the correspondence between transformed tokens and the original tokens to maintain a label for each token. We reuse the computed attention at each layer for efficient token-label alignment, introducing only negligible additional training costs. Extensive experiments demonstrate that our method improves the performance of ViTs on image classification, semantic segmentation, objective detection, and transfer learning tasks. + + + + Learning Gabor Texture Features for Fine-Grained Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhu_Learning_Gabor_Texture_Features_for_Fine-Grained_Recognition_ICCV_2023_paper.pdf + Extracting and using class-discriminative features is critical for fine-grained recognition. Existing works have demonstrated the possibility of applying deep CNNs to exploit features that distinguish similar classes. However, CNNs suffer from problems including frequency bias and loss of detailed local information, which restricts the performance of recognizing fine-grained categories. To address the challenge, we propose a novel texture branch as complimentary to the CNN branch for feature extraction. We innovatively utilize Gabor filters as a powerful extractor to exploit texture features, motivated by the capability of Gabor filters in effectively capturing multi-frequency features and detailed local information. We implement several designs to enhance the effectiveness of Gabor filters, including imposing constraints on parameter values and developing a learning method to determine the optimal parameters. Moreover, we introduce a statistical feature extractor to utilize informative statistical information from the signals captured by Gabor filters, and a gate selection mechanism to enable efficient computation by only considering qualified regions as input for texture extraction. Through the integration of features from the Gabor-filter-based texture branch and CNN-based semantic branch, we achieve comprehensive information extraction. We demonstrate the efficacy of our method on multiple datasets, including CUB-200-2011, NA-bird, Stanford Dogs, and GTOS-mobile. State-of-the-art performance is achieved using our approach. + + + + An Embarrassingly Simple Backdoor Attack on Self-supervised Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_An_Embarrassingly_Simple_Backdoor_Attack_on_Self-supervised_Learning_ICCV_2023_paper.pdf + As a new paradigm in machine learning, self-supervised learning (SSL) is capable of learning high-quality representations of complex data without relying on labels. In addition to eliminating the need for labeled data, research has found that SSL improves the adversarial robustness over supervised learning since lacking labels makes it more challenging for adversaries to manipulate model predictions. However, the extent to which this robustness superiority generalizes to other types of attacks remains an open question. We explore this question in the context of backdoor attacks. Specifically, we design and evaluate CTRL, an embarrassingly simple yet highly effective self-supervised backdoor attack. By only polluting a tiny fraction of training data (<1%) with indistinguishable poisoning samples, CTRL causes any trigger-embedded input to be misclassified to the adversary's designated class with a high probability (>99%) at inference time. Our findings suggest that SSL and supervised learning are comparably vulnerable to backdoor attacks. More importantly, through the lens of CTRL, we study the inherent vulnerability of SSL to backdoor attacks. With both empirical and analytical evidence, we reveal that the representation invariance property of SSL, which benefits adversarial robustness, may also be the very reason making SSL highly susceptible to backdoor attacks. Our findings also imply that the existing defenses against supervised backdoor attacks are not easily retrofitted to the unique vulnerability of SSL. + + + + Partition Speeds Up Learning Implicit Neural Representations Based on Exponential-Increase Hypothesis + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Partition_Speeds_Up_Learning_Implicit_Neural_Representations_Based_on_Exponential-Increase_ICCV_2023_paper.pdf + Implicit neural representations (INRs) aim to learn a continuous function (i.e., a neural network) to represent an image, where the input and output of the function are pixel coordinates and RGB/Gray values, respectively. However, images tend to consist of many objects whose colors are not perfectly consistent, resulting in the challenge that image is actually a discontinuous piecewise function and cannot be well estimated by a continuous function. In this paper, we empirically investigate that if a neural network is enforced to fit a discontinuous piecewise function to reach a fixed small error, the time costs will increase exponentially with respect to the boundaries in the spatial domain of the target signal. We name this phenomenon the exponential-increase hypothesis. Under the exponential-increase hypothesis, learning INRs for images with many objects will converge very slowly. To address this issue, we first prove that partitioning a complex signal into several sub-regions and utilizing piecewise INRs to fit that signal can significantly speed up the convergence. Based on this fact, we introduce a simple partition mechanism to boost the performance of two INR methods for image reconstruction: one for learning INRs, and the other for learning-to-learn INRs. In both cases, we partition an image into different sub-regions and dedicate smaller networks for each part. In addition, we further propose two partition rules based on regular grids and semantic segmentation maps, respectively. Extensive experiments validate the effectiveness of the proposed partitioning methods in terms of learning INR for a single image (ordinary learning framework) and the learning-to-learn framework. + + + + Uncertainty Guided Adaptive Warping for Robust and Efficient Stereo Matching + http://openaccess.thecvf.com//content/ICCV2023/papers/Jing_Uncertainty_Guided_Adaptive_Warping_for_Robust_and_Efficient_Stereo_Matching_ICCV_2023_paper.pdf + Correlation based stereo matching has achieved outstanding performance, which pursues cost volume between two feature maps. Unfortunately, current methods with a fixed trained model do not work uniformly well across various datasets, greatly limiting their real-world applicability. To tackle this issue, this paper proposes a new perspective to dynamically calculate correlation for robust stereo matching. A novel Uncertainty Guided Adaptive Correlation (UGAC) module is introduced to robustly adapt the same model for different scenarios. Specifically, a variance-based uncertainty estimation is employed to adaptively adjust the sampling area during warping operation. Additionally, we improve the traditional non-parametric warping with learnable parameters, such that the position-specific weights can be learned. We show that by empowering the recurrent network with the UGAC module, stereo matching can be exploited more robustly and effectively. Extensive experiments demonstrate that our method achieves state-of-the-art performance over the ETH3D, KITTI, and Middlebury datasets when employing the same fixed model over these datasets without any retraining procedure. To target real-time applications, we further design a lightweight model based on UGAC, which also outperforms other methods over KITTI benchmarks with only 0.6 M parameters. + + + + CGBA: Curvature-aware Geometric Black-box Attack + http://openaccess.thecvf.com//content/ICCV2023/papers/Reza_CGBA_Curvature-aware_Geometric_Black-box_Attack_ICCV_2023_paper.pdf + Decision-based black-box attacks often necessitate a large number of queries to craft an adversarial example. Moreover, decision-based attacks based on querying boundary points in the estimated normal vector direction often suffer from inefficiency and convergence issues. In this paper, we propose a novel query-efficient curvature-aware geometric decision-based black-box attack (CGBA) that conducts boundary search along a semicircular path on a restricted 2D plane to ensure finding a boundary point successfully irrespective of the boundary curvature. While the proposed CGBA attack can work effectively for an arbitrary decision boundary, it is particularly efficient in exploiting the low curvature to craft high-quality adversarial examples, which is widely seen and experimentally verified in commonly used classifiers under non-targeted attacks. In contrast, the decision boundaries often exhibit higher curvature under targeted attacks. Thus, we develop a new query-efficient variant, CGBA-H, that is adapted for the targeted attack. In addition, we further design an algorithm to obtain a better initial boundary point at the expense of some extra queries, which considerably enhances the performance of the targeted attack. Extensive experiments are conducted to evaluate the performance of our proposed methods against some well-known classifiers on the ImageNet and CIFAR10 datasets, demonstrating the superiority of CGBA and CGBA-H over state-of-the-art non-targeted and targeted attacks, respectively. The source code is available at https://github.com/Farhamdur/CGBA. + + + + Unsupervised Facial Performance Editing via Vector-Quantized StyleGAN Representations + http://openaccess.thecvf.com//content/ICCV2023/papers/Kicanaoglu_Unsupervised_Facial_Performance_Editing_via_Vector-Quantized_StyleGAN_Representations_ICCV_2023_paper.pdf + High-fidelity virtual human avatar applications create a need for photorealistic video face synthesis with controllable semantic editing over facial features. While recent generative neural methods have shown significant progress in portrait video synthesis, intuitive facial control, e.g., of mouth interior and gaze at different levels of details, remains a challenge. In this work, we present a novel face editing framework that combines a 3D face model with StyleGAN vector-quantization to learn multi-level semantic facial control. We show that vector quantization of StyleGAN features unveils richer semantic facial representations, e.g., teeth and pupils, which are difficult to model with 3D tracking priors. Such representations along with 3D tracking can be used as self-supervision to train a generator with control over coarse expressions and finer facial attributes. Learned representations can be combined with user-defined masks to create semantic segmentations that act as custom detail handles for semantic-aware video editing. Our formulation allows video face manipulation with precise local control over facial attributes, such as eyes and teeth, opening up a number of face reenactment and visual expression articulation applications. + + + + A Multidimensional Analysis of Social Biases in Vision Transformers + http://openaccess.thecvf.com//content/ICCV2023/papers/Brinkmann_A_Multidimensional_Analysis_of_Social_Biases_in_Vision_Transformers_ICCV_2023_paper.pdf + The embedding spaces of image models have been shown to encode a range of social biases such as racism and sexism. Here, we investigate specific factors that contribute to the emergence of these biases in Vision Transformers (ViT). Therefore, we measure the impact of training data, model architecture, and training objectives on social biases in the learned representations of ViTs. Our findings indicate that counterfactual augmentation training using diffusion-based image editing can mitigate biases, but does not eliminate them. Moreover, we find that larger models are less biased than smaller models, and that models trained using discriminative objectives are less biased than those trained using generative objectives. In addition, we observe inconsistencies in the learned social biases. To our surprise, ViTs can exhibit opposite biases when trained on the same data set using different self-supervised objectives. Our findings give insights into the factors that contribute to the emergence of social biases and suggests that we could achieve substantial fairness improvements based on model design choices. + + + + PGFed: Personalize Each Client's Global Objective for Federated Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Luo_PGFed_Personalize_Each_Clients_Global_Objective_for_Federated_Learning_ICCV_2023_paper.pdf + Personalized federated learning has received an upsurge of attention due to the mediocre performance of conventional federated learning (FL) over heterogeneous data. Unlike conventional FL which trains a single global consensus model, personalized FL allows different models for different clients. However, existing personalized FL algorithms only implicitly transfer the collaborative knowledge across the federation by embedding the knowledge into the aggregated model or regularization. We observed that this implicit knowledge transfer fails to maximize the potential of each client's empirical risk toward other clients. Based on our observation, in this work, we propose Personalized Global Federated Learning (PGFed), a novel personalized FL framework that enables each client to personalize its own global objective by explicitly and adaptively aggregating the empirical risks of itself and other clients. To avoid massive (O(N^2)) communication overhead and potential privacy leakage while achieving this, each client's risk is estimated through a first-order approximation for other clients' adaptive risk aggregation. On top of PGFed, we develop a momentum upgrade, dubbed PGFedMo, to more efficiently utilize clients' empirical risks. Our extensive experiments on four datasets under different federated settings show consistent improvements of PGFed over previous state-of-the-art methods. The code is publicly available at https://github.com/ljaiverson/pgfed. + + + + Instance and Category Supervision are Alternate Learners for Continual Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Tian_Instance_and_Category_Supervision_are_Alternate_Learners_for_Continual_Learning_ICCV_2023_paper.pdf + Continual Learning (CL) is the constant development of complex behaviors by building upon previously acquired skills. Yet, current CL algorithms tend to incur class-level forgetting as the label information is often quickly overwritten by new knowledge. This motivates attempts to mine instance-level discrimination by resorting to recent self-supervised learning (SSL) techniques. However, previous works have pointed that the self-supervised learning objective is essentially a trade-off between invariance to distortion and preserving sample information, which seriously hinders the unleashing of instance-level discrimination. In this work, we reformulate SSL from the information-theoretic perspective by disentangling the goal of instance-level discrimination, and tackle the trade-off to promote compact representations with maximally preserved invariance to distortion. On this basis, we develop a novel alternate learning paradigm to enjoy the complementary merits of instance-level and category-level supervision, which yields improved robustness against forgetting and better adaptation to each task. To verify the proposed method, we conduct extensive experiments on four different benchmarks using both class-incremental and task-incremental settings, where the leap in performance and thorough ablation studies demonstrate the efficacy and efficiency of our modeling strategy. + + + + Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning + http://openaccess.thecvf.com//content/ICCV2023/papers/Feng_Diverse_Data_Augmentation_with_Diffusions_for_Effective_Test-time_Prompt_Tuning_ICCV_2023_paper.pdf + Benefiting from prompt tuning, recent years have witnessed the promising performance of pre-trained vision-language models, e.g., CLIP, on versatile downstream tasks. In this paper, we focus on a particular setting of learning adaptive prompts on the fly for each test sample from an unseen new domain, which is known as test-time prompt tuning (TPT). Existing TPT methods typically rely on data augmentation and confidence selection. However, conventional data augmentation techniques, e.g., random resized crops, suffers from the lack of data diversity, while entropy-based confidence selection alone is not sufficient to guarantee prediction fidelity. To address these issues, we propose a novel TPT method, named DiffTPT, which leverages pre-trained diffusion models to generate diverse and informative new data. Specifically, we incorporate augmented data by both conventional method and pre-trained stable diffusion to exploit their respective merits, improving the model's ability to adapt to unknown new test data. Moreover, to ensure the prediction fidelity of generated data, we introduce a cosine similarity-based filtration technique to select the generated data with higher similarity to the single test sample. Our experiments on test datasets with distribution shifts and unseen categories demonstrate that DiffTPT improves the zero-shot accuracy by an average of 5.13% compared to the state-of-the-art TPT method. + + + + GePSAn: Generative Procedure Step Anticipation in Cooking Videos + http://openaccess.thecvf.com//content/ICCV2023/papers/Abdelsalam_GePSAn_Generative_Procedure_Step_Anticipation_in_Cooking_Videos_ICCV_2023_paper.pdf + We study the problem of future step anticipation in procedural videos. Given a video of an ongoing procedural activity, we predict a plausible next procedure step described in rich natural language. While most previous work focus on the problem of data scarcity in procedural video datasets, another core challenge of future anticipation is how to account for multiple plausible future realizations in natural settings. This problem has been largely overlooked in previous work. To address this challenge, we frame future step prediction as modelling the distribution of all possible candidates for the next step. Specifically, we design a generative model that takes a series of video clips as input, and generates multiple plausible and diverse candidates (in natural language) for the next step. Following previous work, we side-step the video annotation scarcity by pretraining our model on a large text-based corpus of procedural activities, and then transfer the model to the video domain. Our experiments, both in textual and video domains, show that our model captures diversity in the next step prediction and generates multiple plausible future predictions. Moreover, our model establishes new state-of-the-art results on YouCookII, where it outperforms existing baselines on the next step anticipation. Finally, we also show that our model can successfully transfer from text to the video domain zero-shot, ie, without fine-tuning or adaptation, and produces good-quality future step predictions from video. + + + + AutoDiffusion: Training-Free Optimization of Time Steps and Architectures for Automated Diffusion Model Acceleration + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_AutoDiffusion_Training-Free_Optimization_of_Time_Steps_and_Architectures_for_Automated_ICCV_2023_paper.pdf + Diffusion models are emerging expressive generative models, in which a large number of time steps (inference steps) are required for a single image generation. To accelerate such tedious process, reducing steps uniformly is considered as an undisputed principle of diffusion models. We consider that such a uniform assumption is not the optimal solution in practice; i.e., we can find different optimal time steps for different models. Therefore, we propose to search the optimal time steps sequence and compressed model architecture in a unified framework to achieve effective image generation for diffusion models without any further training. Specifically, we first design a unified search space that consists of all possible time steps and various architectures. Then, a two stage evolutionary algorithm is introduced to find the optimal solution in the designed search space. To further accelerate the search process, we employ FID score between generated and real samples to estimate the performance of the sampled examples. As a result, the proposed method is (i).training-free, obtaining the optimal time steps and model architecture without any training process; (ii). orthogonal to most advanced diffusion samplers and can be integrated to gain better sample quality. (iii). generalized, where the searched time steps and architectures can be directly applied on different diffusion models with the same guidance scale. Experimental results show that our method achieves excellent performance by using only a few time steps, e.g. 17.86 FID score on ImageNet 64 x 64 with only four steps, compared to 138.66 with DDIM. + + + + DPS-Net: Deep Polarimetric Stereo Depth Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Tian_DPS-Net_Deep_Polarimetric_Stereo_Depth_Estimation_ICCV_2023_paper.pdf + Stereo depth estimation usually struggles to deal with textureless scenes for both traditional and learning-based methods due to the inherent dependence on image correspondence matching. In this paper, we propose a novel neural network, i.e., DPS-Net, to exploit both the prior geometric knowledge and polarimetric information for depth estimation with two polarimetric stereo images. Specifically, we construct both RGB and polarization correlation volumes to fully leverage the multi-domain similarity between polarimetric stereo images. Since inherent ambiguities exist in the polarization images, we introduce the iso-depth cost explicitly into the network to solve these ambiguities. Moreover, we design a cascaded dual-GRU architecture to recurrently update the disparity and effectively fuse both the multi-domain correlation features and the iso-depth cost. Besides, we present new synthetic and real polarimetric stereo datasets for evaluation. Experimental results demonstrate that our method outperforms the state-of-the-art stereo depth estimation methods. + + + + SpaceEvo: Hardware-Friendly Search Space Design for Efficient INT8 Inference + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_SpaceEvo_Hardware-Friendly_Search_Space_Design_for_Efficient_INT8_Inference_ICCV_2023_paper.pdf + The combination of Neural Architecture Search (NAS) and quantization has proven successful in automatically designing low-FLOPs INT8 quantized neural networks (QNN). However, directly applying NAS to design accurate QNN models that achieve low latency on real-world devices leads to inferior performance. In this work, we identify that the poor INT8 latency is due to the quantization-unfriendly issue: the operator and configuration (e.g., channel width) choices in prior art search spaces lead to diverse quantization efficiency and can slow down the INT8 inference speed. To address this challenge, we propose SpaceEvo, an automatic method for designing a dedicated, quantization-friendly search space for each target hardware. The key idea of SpaceEvo is to automatically search hardware-preferred operators and configurations to construct the search space, guided by a metric called Q-T score to quantify how quantization-friendly a candidate search space is. We further train a quantized-for-all supernet over our discovered search space, enabling the searched models to be directly deployed without extra retraining or quantization. Our discovered models, SEQnet, establish new SOTA INT8 quantized accuracy under various latency constraints, achieving up to 10.1% accuracy improvement on ImageNet than prior art CNNs under the same latency. Extensive experiments on real devices show that SpaceEvo consistently outperforms manually-designed search spaces with up to 2.5x faster speed while achieving the same accuracy. + + + + How Far Pre-trained Models Are from Neural Collapse on the Target Dataset Informs their Transferability + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_How_Far_Pre-trained_Models_Are_from_Neural_Collapse_on_the_ICCV_2023_paper.pdf + This paper focuses on model transferability estimation, i.e., assessing the performance of pre-trained models on a downstream task without performing fine-tuning. Motivated by the neural collapse (NC) that reveals the feature geometry at the terminal stage of training, our method considers the model transferability as how far the target activations obtained by pre-trained models are from their hypothetical state in the terminal phase of the fine-tuned model. We propose a metric that computes this proximity based on three phenomena of NC: within-class variability collapse, simplex encoded label interpolation geometry structure is formed, and the nearest center classifier becomes optimal on training data. Extensive experiments on 11 benchmark datasets demonstrate the effectiveness and efficiency of the proposed method over the existing SOTA approaches. Particularly, our method achieves SOTA transferability estimation accuracy with approximately 10xwall-clock time speed up compared to the existing approaches + + + + Convolutional Networks with Oriented 1D Kernels + http://openaccess.thecvf.com//content/ICCV2023/papers/Kirchmeyer_Convolutional_Networks_with_Oriented_1D_Kernels_ICCV_2023_paper.pdf + In computer vision, 2D convolution is arguably the most important operation performed by a ConvNet. Unsurprisingly, it has been the focus of intense software and hardware optimization and enjoys highly efficient implementations. In this work, we ask an intriguing question: can we make a ConvNet work without 2D convolutions? Surprisingly, we find that the answer is yes --- we show that a ConvNet consisting entirely of 1D convolutions can do just as well as 2D on ImageNet classification. Specifically, we find that one key ingredient to a high-performing 1D ConvNet is oriented 1D kernels: 1D kernels that are oriented not just horizontally or vertically, but also at other angles. Our experiments show that oriented 1D convolutions can not only replace 2D convolutions but also augment existing architectures with large kernels, leading to improved accuracy with minimal FLOPs increase. A key contribution of this work is a highly-optimized custom CUDA implementation of oriented 1D kernels, specialized to the depthwise convolution setting. Our benchmarks demonstrate that our custom CUDA implementation almost perfectly realizes the theoretical advantage of 1D convolution: it is faster than a native horizontal convolution for any arbitrary angle. Code is available at https://github.com/princeton-vl/Oriented1D. + + + + Improving Pixel-based MIM by Reducing Wasted Modeling Capability + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Improving_Pixel-based_MIM_by_Reducing_Wasted_Modeling_Capability_ICCV_2023_paper.pdf + There has been significant progress in Masked Image Modeling (MIM). Existing MIM methods can be broadly categorized into two groups based on the reconstruction target: pixel-based and tokenizer-based approaches. The former offers a simpler pipeline and lower computational cost, but it is known to be biased toward high-frequency details. In this paper, we provide a set of empirical studies to confirm this limitation of pixel-based MIM and propose a new method that explicitly utilizes low-level features from shallow layers to aid pixel reconstruction. By incorporating this design into our base method, MAE, we reduce the wasted modeling capability of pixel-based MIM, improving its convergence and achieving non-trivial improvements across various downstream tasks. To the best of our knowledge, we are the first to systematically investigate multi-level feature fusion for isotropic architectures like the standard Vision Transformer (ViT). Notably, when applied to a smaller model (e.g., ViT-S), our method yields significant performance gains, such as 1.2% on fine-tuning, 2.8% on linear probing, and 2.6% on semantic segmentation. + + + + Towards Memory- and Time-Efficient Backpropagation for Training Spiking Neural Networks + http://openaccess.thecvf.com//content/ICCV2023/papers/Meng_Towards_Memory-_and_Time-Efficient_Backpropagation_for_Training_Spiking_Neural_Networks_ICCV_2023_paper.pdf + Spiking Neural Networks (SNNs) are promising energy-efficient models for neuromorphic computing. For training the non-differentiable SNN models, the backpropagation through time (BPTT) with surrogate gradients (SG) method has achieved high performance. However, this method suffers from considerable memory cost and training time during training. In this paper, we propose the Spatial Learning Through Time (SLTT) method that can achieve high performance while greatly improving training efficiency compared with BPTT. First, we show that the backpropagation of SNNs through the temporal domain contributes just a little to the final calculated gradients. Thus, we propose to ignore the unimportant routes in the computational graph during backpropagation. The proposed method reduces the number of scalar multiplications and achieves a small memory occupation that is independent of the total time steps. Furthermore, we propose a variant of SLTT, called SLTT-K, that allows backpropagation only at K time steps, then the required number of scalar multiplications is further reduced and is independent of the total time steps. Experiments on both static and neuromorphic datasets demonstrate superior training efficiency and performance of our SLTT. In particular, our method achieves state-of-the-art accuracy on ImageNet, while the memory cost and training time are reduced by more than 70% and 50%, respectively, compared with BPTT. + + + + When to Learn What: Model-Adaptive Data Augmentation Curriculum + http://openaccess.thecvf.com//content/ICCV2023/papers/Hou_When_to_Learn_What_Model-Adaptive_Data_Augmentation_Curriculum_ICCV_2023_paper.pdf + Data augmentation (DA) is widely used to improve the generalization of neural networks by enforcing the invariances and symmetries to pre-defined transformations applied to input data. However, a fixed augmentation policy may have different effects on each sample in different training stages but existing approaches cannot adjust the policy to be adaptive to each sample and the training model. In this paper, we propose "Model-Adaptive Data Augmentation (MADAug)" that jointly trains an augmentation policy network to teach the model "when to learn what". Unlike previous work, MADAug selects augmentation operators for each input image by a model-adaptive policy varying between training stages, producing a data augmentation curriculum optimized for better generalization. In MADAug, we train the policy through a bi-level optimization scheme, which aims to minimize a validation set loss of a model trained using the policy-produced data augmentations. We conduct an extensive evaluation of MADAug on multiple image classification tasks and network architectures with thorough comparisons to existing DA approaches. MADAug outperforms or is on par with other baselines and exhibits better fairness: it brings improvement to all classes and more to the difficult ones. Moreover, MADAug learned policy shows better performance when transferred to fine-grained datasets. In addition, the auto-optimized policy in MADAug gradually introduces increasing perturbations and naturally forms an easy-to-hard curriculum. + + + + COPILOT: Human-Environment Collision Prediction and Localization from Egocentric Videos + http://openaccess.thecvf.com//content/ICCV2023/papers/Pan_COPILOT_Human-Environment_Collision_Prediction_and_Localization_from_Egocentric_Videos_ICCV_2023_paper.pdf + The ability to forecast human-environment collisions from egocentric observations is vital to enable collision avoidance in applications such as VR, AR, and wearable assistive robotics. In this work, we introduce the challenging problem of predicting collisions in diverse environments from multi-view egocentric videos captured from body-mounted cameras. Solving this problem requires a generalizable perception system that can classify which human body joints will collide and estimate a collision region heatmap to localize collisions in the environment. To achieve this, we propose a transformer-based model called COPILOT to perform collision prediction and localization simultaneously, which accumulates information across multi-view inputs through a novel 4D space-time-viewpoint attention mechanism. To train our model and enable future research on this task, we develop a synthetic data generation framework that produces egocentric videos of virtual humans moving and colliding within diverse 3D environments. This framework is then used to establish a large-scale dataset consisting of 8.6M egocentric RGBD frames. Extensive experiments show that COPILOT generalizes to unseen synthetic as well as real-world scenes. We further demonstrate COPILOT outputs are useful for downstream collision avoidance through simple closed-loop control. Please visit our project webpage at https://sites.google.com/stanford.edu/copilot. + + + + EGformer: Equirectangular Geometry-biased Transformer for 360 Depth Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Yun_EGformer_Equirectangular_Geometry-biased_Transformer_for_360_Depth_Estimation_ICCV_2023_paper.pdf + Estimating the depths of equirectangular (i.e., 360) images (EIs) is challenging given the distorted 180 x 360 field-of-view, which is hard to be addressed via convolutional neural network (CNN). Although a transformer with global attention achieves significant improvements over CNN for EI depth estimation task, it is computationally inefficient, which raises the need for transformer with local attention. However, to apply local attention successfully for EIs, a specific strategy, which addresses distorted equirectangular geometry and limited receptive field simultaneously, is required. Prior works have only cared either of them, resulting in unsatisfactory depths occasionally. In this paper, we propose an equirectangular geometry-biased transformer termed EGformer. While limiting the computational cost and the number of network parameters, EGformer enables the extraction of the equirectangular geometry-aware local attention with a large receptive field. To achieve this, we actively utilize the equirectangular geometry as the bias for the local attention instead of struggling to reduce the distortion of EIs. As compared to the most recent EI depth estimation studies, the proposed approach yields the best depth outcomes overall with the lowest computational cost and the fewest parameters, demonstrating the effectiveness of the proposed methods. + + + + Size Does Matter: Size-aware Virtual Try-on via Clothing-oriented Transformation Try-on Network + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Size_Does_Matter_Size-aware_Virtual_Try-on_via_Clothing-oriented_Transformation_Try-on_ICCV_2023_paper.pdf + Virtual try-on tasks aim at synthesizing realistic try-on results by trying target clothes on humans. Most previous works relied on the Thin Plate Spline or appearance flows to warp clothes to fit human body shapes. However, both approaches cannot handle complex warping, leading to over distortion or misalignment. Furthermore, there is a critical unaddressed challenge of adjusting clothing sizes for try-on. To tackle these issues, we propose a Clothing-Oriented Transformation Try-On Network (COTTON). COTTON leverages clothing structure with landmarks and segmentation to design a novel landmark-guided transformation for precisely deforming clothes, allowing for size adjustment during try-on. Additionally, to properly remove the clothing region from the human image without losing significant human characteristics, we propose a clothing elimination policy based on both transformed clothes and human segmentation. This method enables users to try on clothes tucked-in or untucked while retaining more human characteristics. Both qualitative and quantitative results show that COTTON outperforms the state-of-the-art high-resolution virtual try-on approaches. All the code is available at https://github.com/cotton6/COTTON-size-does-matter. + + + + Generating Realistic Images from In-the-wild Sounds + http://openaccess.thecvf.com//content/ICCV2023/papers/Lee_Generating_Realistic_Images_from_In-the-wild_Sounds_ICCV_2023_paper.pdf + Representing wild sounds as images is an important but challenging task due to the lack of paired datasets between sound and image data and the significant differences in the characteristics of these two modalities. Previous studies have focused on generating images from sound in limited categories or music. In this paper, we propose a novel approach to generate images from wild sounds. First, we convert sound into text using audio captioning. Second, we propose audio attention and sentence attention to represent the rich characteristics of sound and visualize the sound. Lastly, we propose a direct sound optimization with CLIPscore and AudioCLIP and generate images with a diffusion-based model. In experiments, it shows that our model is able to generate high quality images from wild sounds and outperforms baselines in both quantitative and qualitative evaluations on wild audio datasets. + + + + Candidate-aware Selective Disambiguation Based On Normalized Entropy for Instance-dependent Partial-label Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/He_Candidate-aware_Selective_Disambiguation_Based_On_Normalized_Entropy_for_Instance-dependent_Partial-label_ICCV_2023_paper.pdf + In partial-label learning (PLL), each training example has a set of candidate labels, among which only one is the true label. Most existing PLL studies focus on the instance-independent (II) case, where the generation of candidate labels is only dependent on the true label. However, this II-PLL paradigm could be unrealistic, since candidate labels are usually generated according to the specific features of the instance. Therefore, instance-dependent PLL (ID-PLL) has attracted increasing attention recently. Unfortunately, existing ID-PLL studies lack an insightful perception of the intrinsic challenge in ID-PLL. In this paper, we start with an empirical study of the dynamics of label disambiguation in both II-PLL and ID-PLL. We found that the performance degradation of ID-PLL stems from the inaccurate supervision caused by massive under-disambiguated (UD) examples that do not achieve complete disambiguation. To solve this problem, we propose a novel two-stage PLL framework including selective disambiguation and candidate-aware thresholding. Specifically, we first choose a part of well-disambiguated (WD) examples based on the magnitude of normalized entropy (NE) and integrate harmless complementary supervision from the remaining ones to train two networks. Next, the remaining examples whose NE is lower than the specific class-wise WD-NE threshold are selected as additional WD ones. Meanwhile, the remaining UD examples, whose NE is lower than the self-adaptive UD-NE threshold and whose predictions from two networks are agreed, are also regarded as WD ones for model training. Extensive experiments demonstrate that our proposed method outperforms state-of-the-art PLL methods. + + + + Open-vocabulary Video Question Answering: A New Benchmark for Evaluating the Generalizability of Video Question Answering Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Ko_Open-vocabulary_Video_Question_Answering_A_New_Benchmark_for_Evaluating_the_ICCV_2023_paper.pdf + Video Question Answering (VideoQA) is a challenging task that entails complex multi-modal reasoning. In contrast to multiple-choice VideoQA which aims to predict the answer given several options, the goal of open-ended VideoQA is to answer questions without restricting candidate answers. However, the majority of previous VideoQA models formulate open-ended VideoQA as a classification task to classify the video-question pairs into a fixed answer set, i.e., closed-vocabulary, which contains only frequent answers (e.g., top-1000 answers). This leads the model to be biased toward only frequent answers and fail to generalize on out-of-vocabulary answers. We hence propose a new benchmark, Open-vocabulary Video Question Answering (OVQA), to measure the generalizability of VideoQA models by considering rare and unseen answers. In addition, in order to improve the model's generalization power, we introduce a novel GNN-based soft verbalizer that enhances the prediction on rare and unseen answers by aggregating the information from their similar words. For evaluation, we introduce new baselines by modifying the existing (closed-vocabulary) open-ended VideoQA models and improve their performances by further taking into account rare and unseen answers. Our ablation studies and qualitative analyses demonstrate that our GNN-based soft verbalizer further improves the model performance, especially on rare and unseen answers. We hope that our benchmark OVQA can serve as a guide for evaluating the generalizability of VideoQA models and inspire future research. Code is available at https://github.com/mlvlab/OVQA. + + + + Using a Waffle Iron for Automotive Point Cloud Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Puy_Using_a_Waffle_Iron_for_Automotive_Point_Cloud_Semantic_Segmentation_ICCV_2023_paper.pdf + Semantic segmentation of point clouds in autonomous driving datasets requires techniques that can process large numbers of points efficiently. Sparse 3D convolutions have become the de-facto tools to construct deep neural networks for this task: they exploit point cloud sparsity to reduce the memory and computational loads and are at the core of today's best methods. In this paper, we propose an alternative method that reaches the level of state-of-the-art methods without requiring sparse convolutions. We actually show that such level of performance is achievable by relying on tools a priori unfit for large scale and high-performing 3D perception. In particular, we propose a novel 3D backbone, WaffleIron, made almost exclusively of MLPs and dense 2D convolutions and present how to train it to reach high performance on SemanticKITTI and nuScenes. We believe that WaffleIron is a compelling alternative to backbones using sparse 3D convolutions, especially in frameworks and on hardware where those convolutions are not readily available. + + + + AutoReP: Automatic ReLU Replacement for Fast Private Network Inference + http://openaccess.thecvf.com//content/ICCV2023/papers/Peng_AutoReP_Automatic_ReLU_Replacement_for_Fast_Private_Network_Inference_ICCV_2023_paper.pdf + The growth of the Machine-Learning-As-A-Service (MLaaS) market has highlighted clients' data privacy and security issues. Private inference (PI) techniques using cryptographic primitives offer a solution but often have high computation and communication costs, particularly with non-linear operators like ReLU. Many attempts to reduce ReLU operations exist, but they may need heuristic threshold selection or cause substantial accuracy loss. This work introduces AutoReP, a gradient-based approach to lessen non-linear operators and alleviate these issues. It automates the selection of ReLU and polynomial functions to speed up PI applications and introduces distribution-aware polynomial approximation (DaPa) to maintain model expressivity while accurately approximating ReLUs. Our experimental results demonstrate significant accuracy improvements of 6.12% (94.31%, 12.9K ReLU budget, CIFAR-10), 8.39% (74.92%, 12.9K ReLU budget, CIFAR-100), and 9.45% (63.69%, 55K ReLU budget, Tiny-ImageNet) over current state-of-the-art methods, e.g., SNL. Morever, AutoReP is applied to EfficientNet-B2 on ImageNet dataset, and achieved 75.55% accuracy with 176.1 xReLU budget reduction. + + + + Center-Based Decoupled Point-cloud Registration for 6D Object Pose Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Jiang_Center-Based_Decoupled_Point-cloud_Registration_for_6D_Object_Pose_Estimation_ICCV_2023_paper.pdf + In this paper, we propose a novel center-based decoupled point cloud registration framework for robust 6D object pose estimation in real-world scenarios. Our method decouples the translation from the entire transformation by predicting the object center and estimating the rotation in a center-aware manner. This center offset-based translation estimation is correspondence-free, freeing us from the difficulty of constructing correspondences in challenging scenarios, thus improving robustness. To obtain reliable center predictions, we use a multi-view (bird's eye view and front view) object shape description of the source-point features, with both views jointly voting for the object center. Additionally, we propose an effective shape embedding module to augment the source features, largely completing the missing shape information due to partial scanning, thus facilitating the center prediction. With the center-aligned source and model point clouds, the rotation predictor utilizes feature similarity to establish putative correspondences for SVD-based rotation estimation. In particular, we introduce a center-aware hybrid feature descriptor with a normal correction technique to extract discriminative, part-aware features for high-quality correspondence construction. Our experiments show that our method outperforms the state-of-the-art methods by a large margin on real-world datasets such as TUD-L, LINEMOD, and Occluded-LINEMOD. + + + + GAIT: Generating Aesthetic Indoor Tours with Deep Reinforcement Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Xie_GAIT_Generating_Aesthetic_Indoor_Tours_with_Deep_Reinforcement_Learning_ICCV_2023_paper.pdf + Placing and orienting a camera to compose aesthetically meaningful shots of a scene is not only a key objective in real-world photography and cinematography but also for virtual content creation. The framing of a camera often significantly contributes to the story telling in movies, games, and mixed reality applications. Generating single camera poses or even contiguous trajectories either requires a significant amount of manual labor or requires solving high-dimensional optimization problems, which can be computationally demanding and error-prone. In this paper, we introduce GAIT, a framework for training a Deep Reinforcement Learning (DRL) agent, that learns to automatically control a camera to generate a sequence of aesthetically meaningful views for synthetic 3D indoor scenes. To generate sequences of frames with high aesthetic value, GAIT relies on a neural aesthetics estimator, which is trained on a crowed-sourced dataset. Additionally, we introduce regularization techniques for diversity and smoothness to generate visually interesting trajectories for a 3D environment, and to constrain agent acceleration in the reward function to generate a smooth sequence of camera frames. We validated our method by comparing it to baseline algorithms, based on a perceptual user study, and through ablation studies. Code and visual results are available on the project website: https://desaixie.github.io/gait-rl + + + + Rethinking Mobile Block for Efficient Attention-based Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Rethinking_Mobile_Block_for_Efficient_Attention-based_Models_ICCV_2023_paper.pdf + This paper focuses on developing modern, efficient, lightweight models for dense predictions while trading off parameters, FLOPs, and performance. Inverted Residual Block (IRB) serves as the infrastructure for lightweight CNNs, but no counterpart has been recognized by attention-based studies. This work rethinks lightweight infrastructure from efficient IRB and effective components of Transformer from a unified perspective, extending CNN-based IRB to attention-based models and abstracting a one-residual Meta Mobile Block (MMB) for lightweight model design. Following simple but effective design criterion, we deduce a modern Inverted Residual Mobile Block (iRMB) and build a ResNet-like Efficient MOdel (EMO) with only iRMB for down-stream tasks. Extensive experiments on ImageNet-1K, COCO2017, and ADE20K benchmarks demonstrate the superiority of our EMO over state-of-the-art methods, e.g., EMO-1M/2M/5M achieve 71.5, 75.1, and 78.4 Top-1 that surpass equal-order CNN-/Attention-based models, while trading-off the parameter, efficiency, and accuracy well: running 2.8-4.0x faster than EdgeNeXt on iPhone14. + + + + REAP: A Large-Scale Realistic Adversarial Patch Benchmark + http://openaccess.thecvf.com//content/ICCV2023/papers/Hingun_REAP_A_Large-Scale_Realistic_Adversarial_Patch_Benchmark_ICCV_2023_paper.pdf + Machine learning models are known to be susceptible to adversarial perturbation. One famous attack is the adversarial patch, a particularly crafted sticker that makes the model mispredict the object it is placed on. This attack presents a critical threat to cyber-physical systems that rely on cameras such as autonomous cars. Despite the significance of the problem, conducting research in this setting has been difficult; evaluating attacks and defenses in the real world is exceptionally costly while synthetic data are unrealistic. In this work, we propose the REAP (REalistic Adversarial Patch) benchmark, a digital benchmark that enables the evaluations on real images under real-world conditions. Built on top of the Mapillary Vistas dataset, our benchmark contains over 14,000 traffic signs. Each sign is augmented with geometric and lighting transformations for applying a digitally generated patch realistically onto the sign. Using our benchmark, we perform the first large-scale assessments of adversarial patch attacks under realistic conditions. Our experiments suggest that patch attacks may present a smaller threat than previously believed and that the success rate of an attack on simpler digital simulations is not predictive of its actual effectiveness in practice. Our benchmark is released publicly at https://github.com/wagner-group/reap-benchmark. + + + + StegaNeRF: Embedding Invisible Information within Neural Radiance Fields + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_StegaNeRF_Embedding_Invisible_Information_within_Neural_Radiance_Fields_ICCV_2023_paper.pdf + Recent advancements in neural rendering have paved the way for a future marked by the widespread distribution of visual data through the sharing of Neural Radiance Field (NeRF) model weights. However, while established techniques exist for embedding ownership or copyright information within conventional visual data such as images and videos, the challenges posed by the emerging NeRF format have remained unaddressed. In this paper, we introduce StegaNeRF, an innovative approach for steganographic information embedding within NeRF renderings. We have meticulously developed an optimization framework that enables precise retrieval of hidden information from images generated by NeRF, while ensuring the original visual quality of the rendered images to remain intact. Through rigorous experimentation, we assess the efficacy of our methodology across various potential deployment scenarios. Furthermore, we delve into the insights gleaned from our analysis. StegaNeRF represents an initial foray into the intriguing realm of infusing NeRF renderings with customizable, imperceptible, and recoverable information, all while minimizing any discernible impact on the rendered images. For more details, please visit our project page: https://xggnet.github.io/StegaNeRF/ + + + + Robust Evaluation of Diffusion-Based Adversarial Purification + http://openaccess.thecvf.com//content/ICCV2023/papers/Lee_Robust_Evaluation_of_Diffusion-Based_Adversarial_Purification_ICCV_2023_paper.pdf + We question the current evaluation practice on diffusion-based purification methods. Diffusion-based purification methods aim to remove adversarial effects from an input data point at test time. The approach gains increasing attention as an alternative to adversarial training due to the disentangling between training and testing. Well-known white-box attacks are often employed to measure the robustness of the purification. However, it is unknown whether these attacks are the most effective for the diffusion-based purification since the attacks are often tailored for adversarial training. We analyze the current practices and provide a new guideline for measuring the robustness of purification methods against adversarial attacks. Based on our analysis, we further propose a new purification strategy improving robustness compared to the current diffusion-based purification methods. + + + + Hyperbolic Audio-visual Zero-shot Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Hong_Hyperbolic_Audio-visual_Zero-shot_Learning_ICCV_2023_paper.pdf + Audio-visual zero-shot learning aims to classify samples consisting of a pair of corresponding audio and video sequences from classes that are not present during training. An analysis of the audio-visual data reveals a large degree of hyperbolicity, indicating the potential benefit of using a hyperbolic transformation to achieve curvature-aware geometric learning, with the aim of exploring more complex hierarchical data structures for this task. The proposed approach employs a novel loss function that incorporates cross-modality alignment between video and audio features in the hyperbolic space. Additionally, we explore the use of multiple adaptive curvatures for hyperbolic projections. The experimental results on this very challenging task demonstrate that our proposed hyperbolic approach for zero-shot learning outperforms the SOTA method on three datasets: VGGSound-GZSL, UCF-GZSL, and ActivityNet-GZSL achieving a harmonic mean (HM) improvement of around 3.0%, 7.0%, and 5.3%, respectively. + + + + ModelGiF: Gradient Fields for Model Functional Distance + http://openaccess.thecvf.com//content/ICCV2023/papers/Song_ModelGiF_Gradient_Fields_for_Model_Functional_Distance_ICCV_2023_paper.pdf + The last decade has witnessed the success of deep learning and the surge of publicly released trained models, which necessitates the quantification of the model functional distance for various purposes. However, quantifying the model functional distance is always challenging due to the opacity in inner workings and the heterogeneity in architectures and tasks. Inspired by the concept of "field" in physics, in this work we introduce Model Gradient Field (abbr. ModelGiF) to extract homogeneous representations from the heterogeneous pre-trained models. Our main assumption underlying ModelGiF is that each pre-trained deep model uniquely determines a ModelGiF over the input space. The distance between models can thus be measured by the similarity between their ModelGiFs. We provide theoretical insights into the proposed ModelGiFs for model functional distance, and validate the effectiveness of the proposed ModelGiF with a suite of testbeds, including task relatedness estimation, intellectual property protection, and model unlearning verification. Experimental results demonstrate the versatility of the proposed ModelGiF on these tasks, with significantly superiority performance to state-of-the-art competitors. Codes are available at https://github.com/zju-vipa/modelgif. + + + + SIGMA: Scale-Invariant Global Sparse Shape Matching + http://openaccess.thecvf.com//content/ICCV2023/papers/Gao_SIGMA_Scale-Invariant_Global_Sparse_Shape_Matching_ICCV_2023_paper.pdf + We propose a novel mixed-integer programming (MIP) formulation for generating precise sparse correspondences for highly non-rigid shapes. To this end, we introduce a projected Laplace-Beltrami operator (PLBO) which combines intrinsic and extrinsic geometric information to measure the deformation quality induced by predicted correspondences. We integrate the PLBO, together with an orientation-aware regulariser, into a novel MIP formulation that can be solved to global optimality for many practical problems. In contrast to previous methods, our approach is provably invariant to rigid transformations and global scaling, initialisation-free, has optimality guarantees, and scales to high resolution meshes with (empirically observed) linear time. We show state-of-the-art results for sparse non-rigid matching on several challenging 3D datasets, including data with inconsistent meshing, as well as applications in mesh-to-point-cloud matching. + + + + VidStyleODE: Disentangled Video Editing via StyleGAN and NeuralODEs + http://openaccess.thecvf.com//content/ICCV2023/papers/Ali_VidStyleODE_Disentangled_Video_Editing_via_StyleGAN_and_NeuralODEs_ICCV_2023_paper.pdf + We propose VidStyleODE, a spatiotemporally continuous disentangled video representation based upon StyleGAN and Neural-ODEs. Effective traversal of the latent space learned by Generative Adversarial Networks (GANs) has been the basis for recent breakthroughs in image editing. However, the applicability of such advancements to the video domain has been hindered by the difficulty of representing and controlling videos in the latent space of GANs. In particular, videos are composed of content (i.e., appearance) and complex motion components that require a special mechanism to disentangle and control. To achieve this, VidStyleODE encodes the video content in a pre-trained StyleGAN W+ space and benefits from a latent ODE component to summarize the spatiotemporal dynamics of the input video. Our novel continuous video generation process then combines the two to generate high-quality and temporally consistent videos with varying frame rates. We show that our proposed method enables a variety of applications on real videos: text-guided appearance manipulation, motion manipulation, image animation, and video interpolation and extrapolation. + + + + LeaF: Learning Frames for 4D Point Cloud Sequence Understanding + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_LeaF_Learning_Frames_for_4D_Point_Cloud_Sequence_Understanding_ICCV_2023_paper.pdf + We focus on learning descriptive geometry and motion features from 4D point cloud sequences in this work. Existing works usually develop generic 4D learning tools without leveraging the prior that a 4D sequence comes from a single 3D scene with local dynamics. Based on this observation, we propose to learn region-wise coordinate frames that transform together with the underlying geometry. With such frames, we can factorize geometry and motion to facilitate a feature-space geometric reconstruction for more effective 4D learning. To learn such region frames, we develop a rotation equivariant network with a frame stabilization strategy. To leverage such frames for better spatial-temporal feature learning, we develop a frame-guided 4D learning scheme. Experiments show that this approach significantly outperforms previous state-of-the-art methods on a wide range of 4D understanding benchmarks. + + + + Towards Improved Input Masking for Convolutional Neural Networks + http://openaccess.thecvf.com//content/ICCV2023/papers/Balasubramanian_Towards_Improved_Input_Masking_for_Convolutional_Neural_Networks_ICCV_2023_paper.pdf + The ability to remove features from the input of machine learning models is very important to understand and interpret model predictions. However, this is non-trivial for vision models since masking out parts of the input image typically causes large distribution shifts. This is because the baseline color used for masking (typically grey or black) is out of distribution. Furthermore, the shape of the mask itself can contain unwanted signals which can be used by the model for its predictions. Recently, there has been some progress in mitigating this issue (called missingness bias) in image masking for vision transformers. In this work, we propose a new masking method for CNNs we call layer masking in which the missingness bias caused by masking is reduced to a large extent. Intuitively, layer masking applies a mask to intermediate activation maps so that the model only processes the unmasked input. We show that our method (i) is able to eliminate or minimize the influence of the mask shape or color on the output of the model, and (ii) is much better than replacing the masked region by black or grey for input perturbation based interpretability techniques like LIME. Thus, layer masking is much less affected by missingness bias than other masking strategies. We also demonstrate how the shape of the mask may leak information about the class, thus affecting estimates of model reliance on class-relevant features derived from input masking. Furthermore, we discuss the role of data augmentation techniques for tackling this problem, and argue that they are not sufficient for preventing model reliance on mask shape. + + + + Gramian Attention Heads are Strong yet Efficient Vision Learners + http://openaccess.thecvf.com//content/ICCV2023/papers/Ryu_Gramian_Attention_Heads_are_Strong_yet_Efficient_Vision_Learners_ICCV_2023_paper.pdf + We introduce a novel architecture design that enhances expressiveness by incorporating multiple head classifiers (i.e., classification heads) instead of relying on channel expansion or additional building blocks. Our approach employs attention-based aggregation, utilizing pairwise feature similarity to enhance multiple lightweight heads with minimal resource overhead. We compute the Gramian matrices to reinforce class tokens in an attention layer for each head. This enables the heads to learn more discriminative representations, enhancing their aggregation capabilities. Furthermore, we propose a learning algorithm that encourages heads to complement each other by reducing correlation for aggregation. Our models eventually surpass state-of-the-art CNNs and ViTs regarding the accuracy-throughput trade-off on ImageNet-1K and deliver remarkable performance across various downstream tasks, such as COCO object instance segmentation, ADE20k semantic segmentation, and fine-grained visual classification datasets. The effectiveness of our framework is substantiated by practical experimental results and further underpinned by generalization error bound. We release the code publicly at: https://github.com/Lab-LVM/imagenet-models. + + + + MI-GAN: A Simple Baseline for Image Inpainting on Mobile Devices + http://openaccess.thecvf.com//content/ICCV2023/papers/Sargsyan_MI-GAN_A_Simple_Baseline_for_Image_Inpainting_on_Mobile_Devices_ICCV_2023_paper.pdf + In recent years, many deep learning based image inpainting methods have been developed by the research community. Some of those methods have shown impressive image completion abilities. Yet, to the best of our knowledge, there is no image inpainting model designed to run on mobile devices. In this paper we present a simple image inpainting baseline, Mobile Inpainting GAN (MI-GAN), which is approximately one order of magnitude computationally cheaper and smaller than existing state-of-the-art inpainting models, and can be efficiently deployed on mobile devices. Excessive quantitative and qualitative evaluations show that MI-GAN performs comparable or, in some cases, better than recent state-of-the-art approaches. Moreover, we perform a user study comparing MI-GAN results with results from several commercial mobile inpainting applications, which clearly shows the advantage of MI-GAN in comparison to existing apps. With the purpose of high quality and efficient inpainting, we utilize an effective combination of adversarial training, model re-parametrization, and knowledge distillation. Our models and code are publicly available at https://github.com/Picsart-AI-Research/MI-GAN. + + + + A Large-Scale Outdoor Multi-Modal Dataset and Benchmark for Novel View Synthesis and Implicit Scene Reconstruction + http://openaccess.thecvf.com//content/ICCV2023/papers/Lu_A_Large-Scale_Outdoor_Multi-Modal_Dataset_and_Benchmark_for_Novel_View_ICCV_2023_paper.pdf + Neural Radiance Fields (NeRF) has achieved impressive results in single object scene reconstruction and novel view synthesis, as demonstrated on many single modality and single object focused indoor scene datasets like DTU, BMVS, and NeRF Synthetic. However, the study of NeRF on large-scale outdoor scene reconstruction is still limited, as there is no unified outdoor scene dataset for large-scale NeRF evaluation due to expensive data acquisition and calibration costs. In this work, we propose a large-scale outdoor multi-modal dataset, OMMO dataset, containing complex objects and scenes with calibrated images, point clouds and prompt annotations. A new benchmark for several outdoor NeRF-based tasks is established, such as novel view synthesis,diverse 3D representation, and multi-modal NeRF. To create the dataset, we capture and collect a large number of real fly-view videos and select high-quality and high-resolution clips from them. Then we design a quality review module to refine images, remove low-quality frames and fail-to-calibrate scenes through a learning-based automatic evaluation plus manual review. Finally, volunteers are employed to label and review the prompt annotation for each scene and keyframe.Compared with existing NeRF datasets, our dataset contains abundant real-world urban and natural scenes with various scales, camera trajectories, and lighting conditions. Experiments show that our dataset can benchmark most state-of-the-art NeRF methods on different tasks.The dataset can be found at the following link: https://ommo.luchongshan.com/ . + + + + Unleashing Vanilla Vision Transformer with Masked Image Modeling for Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Fang_Unleashing_Vanilla_Vision_Transformer_with_Masked_Image_Modeling_for_Object_ICCV_2023_paper.pdf + We present an approach to efficiently and effectively adapt a masked image modeling (MIM) pre-trained vanilla Vision Transformer (ViT) for object detection, which is based on our two novel observations: (i) A MIM pre-trained vanilla ViT encoder can work surprisingly well in the challenging object-level recognition scenario even with randomly sampled partial observations, e.g., only 25% 50% of the input embeddings. (ii) In order to construct multi-scale representations for object detection from single-scale ViT, a randomly initialized compact convolutional stem supplants the pre-trained patchify stem, and its intermediate features can naturally serve as the higher resolution inputs of a feature pyramid network without further upsampling or other manipulations. While the pre-trained ViT is only regarded as the third-stage of our detector's backbone instead of the whole feature extractor. This naturally results in a ConvNet-ViT hybrid architecture. The proposed detector, named MIMDet, enables a MIM pre-trained vanilla ViT to outperform leading hierarchical architectures such as Swin Transformer, MViTv2 and ConvNeXt on COCO object detection & instance segmentation, and achieves better results compared with the previous best adapted vanilla ViT detector using a more modest fine-tuning recipe while converging 2.8x faster. Code and pre-trained models are available at https://github.com/hustvl/MIMDet. + + + + Spatio-Temporal Crop Aggregation for Video Representation Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Sameni_Spatio-Temporal_Crop_Aggregation_for_Video_Representation_Learning_ICCV_2023_paper.pdf + We propose Spatio-temporal Crop Aggregation for video representation LEarning (SCALE), a novel method that enjoys high scalability at both training and inference time. Our model builds long-range video features by learning from sets of video clip-level features extracted with a pre-trained backbone. To train the model, we propose a self-supervised objective consisting of masked clip feature predictions. We apply sparsity to both the input, by extracting a random set of video clips, and to the loss function, by only reconstructing the sparse inputs. Moreover, we use dimensionality reduction by working in the latent space of a pre-trained backbone applied to single video clips. These techniques make our method not only extremely efficient to train but also highly effective in transfer learning. We demonstrate that our video representation yields state-of-the-art performance with linear, nonlinear, and k-NN probing on common action classification and video understanding datasets. + + + + Zero-guidance Segmentation Using Zero Segment Labels + http://openaccess.thecvf.com//content/ICCV2023/papers/Rewatbowornwong_Zero-guidance_Segmentation_Using_Zero_Segment_Labels_ICCV_2023_paper.pdf + The joint visual-language model CLIP has enabled new and exciting applications, such as open-vocabulary segmentation, which can locate any segment given an arbitrary text query. In our research, we ask whether it is possible to discover semantic segments without any user guidance in the form of text queries or predefined classes, and label them using natural language automatically? We propose a novel problem zero-guidance segmentation and the first baseline that leverages two pre-trained generalist models, DINO and CLIP, to solve this problem without any fine-tuning or segmentation dataset. The general idea is to first segment an image into small over-segments, encode them into CLIP's visual-language space, translate them into text labels, and merge semantically similar segments together. The key challenge, however, is how to encode a visual segment into a segment-specific embedding that balances global and local context information, both useful for recognition. Our main contribution is a novel attention-masking technique that balances the two contexts by analyzing the attention layers inside CLIP. We also introduce several metrics for the evaluation of this new task. With CLIP's innate knowledge, our method can precisely locate the Mona Lisa painting among a museum crowd. + + + + Communication-efficient Federated Learning with Single-Step Synthetic Features Compressor for Faster Convergence + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhou_Communication-efficient_Federated_Learning_with_Single-Step_Synthetic_Features_Compressor_for_Faster_ICCV_2023_paper.pdf + Reducing communication overhead in federated learning (FL) is challenging but crucial for large-scale distributed privacy-preserving machine learning. While methods utilizing sparsification or other techniques can largely reduce the communication overhead, the convergence rate is also greatly compromised. In this paper, we propose a novel method named Single-Step Synthetic Features Compressor (3SFC) to achieve communication-efficient FL by directly constructing a tiny synthetic dataset containing synthetic features based on raw gradients. Therefore, 3SFC can achieve an extremely low compression rate when the constructed synthetic dataset contains only one data sample. Additionally, the compressing phase of 3SFC utilizes a similarity-based objective function so that it can be optimized with just one step, considerably improving its performance and robustness. To minimize the compressing error, error feedback (EF) is also incorporated into 3SFC. Experiments on multiple datasets and models suggest that 3SFC has significantly better convergence rates compared to competing methods with lower compression rates (i.e., up to 0.02%). Furthermore, ablation studies and visualizations show that 3SFC can carry more information than competing methods for every communication round, further validating its effectiveness. + + + + CTVIS: Consistent Training for Online Video Instance Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Ying_CTVIS_Consistent_Training_for_Online_Video_Instance_Segmentation_ICCV_2023_paper.pdf + The discrimination of instance embeddings plays a vital role in associating instances across time for online video instance segmentation (VIS). Instance embedding learning is directly supervised by the contrastive loss computed upon the contrastive items (CIs), which are sets of anchor/positive/negative embeddings. Recent online VIS methods leverage CIs sourced from one reference frame only, which we argue is insufficient for learning highly discriminative embeddings. Intuitively, a possible strategy to enhance CIs is replicating the inference phase during training. To this end, we propose a simple yet effective training strategy, called Consistent Training for Online VIS(CTVIS), which devotes to aligning the training and inference pipelines in terms of building CIs. Specifically, CTVIS constructs CIs by referring inference the momentum-averaged embedding and the memory bank storage mechanisms, and adding noise to the relevant embeddings. Such an extension allows a reliable comparison between embeddings of current instances and the stable representations of historical instances, thereby conferring an advantage in modeling VIS challenges such as occlusion, re-identification, and deformation. Empirically, CTVIS outstrips the SOTA VIS models by up to +5.0 points on three VIS benchmarks, including YTVIS19 (55.1% AP), YTVIS21 (50.1% AP) and OVIS (35.5% AP). Furthermore, we find that pseudo-videos transformed from images can train robust models surpassing fully-supervised ones. + + + + Unsupervised Video Object Segmentation with Online Adversarial Self-Tuning + http://openaccess.thecvf.com//content/ICCV2023/papers/Su_Unsupervised_Video_Object_Segmentation_with_Online_Adversarial_Self-Tuning_ICCV_2023_paper.pdf + The existing unsupervised video object segmentation methods depend heavily on the segmentation model trained offline on a labeled training video set, and cannot well generalize to the test videos from a different domain with possible distribution shifts. We propose to perform online fine-tuning on the pre-trained segmentation model to adapt to any ad-hoc videos at the test time. To achieve this, we design an offline semi-supervised adversarial training process, which leverages the unlabeled video frames to improve the model generalizability while aligning the features of the labeled video frames with the features of the unlabeled video frames. With the trained segmentation model, we further conduct an online self-supervised adversarial finetuning, in which a teacher model and a student model are first initialized with the pre-trained segmentation model weights, and the pseudo label produced by the teacher model is used to supervise the student model in an adversarial learning framework. Through online finetuning, the student model is progressively updated according to the emerging patterns in each test video, which significantly reduces the test-time domain gap. We integrate our offline training and online fine-tuning in a unified framework for unsupervised video object segmentation and dub our method Online Adversarial Self-Tuning (OAST). The experiments show that our method out-performs the state-of-the-arts with significant gains on the popular video object segmentation datasets. + + + + GlobalMapper: Arbitrary-Shaped Urban Layout Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/He_GlobalMapper_Arbitrary-Shaped_Urban_Layout_Generation_ICCV_2023_paper.pdf + Modeling and designing urban building layouts is of significant interest in computer vision, computer graphics, and urban applications. A building layout consists of a set of buildings in city blocks defined by a network of roads. We observe that building layouts are discrete structures, consisting of multiple rows of buildings of various shapes, and are amenable to skeletonization for mapping arbitrary city block shapes to a canonical form. Hence, we propose a fully automatic approach to building layout generation using a graph attention networks. Our method generates realistic urban layouts given arbitrary road networks, and enables conditional generation based on learned priors. Our results, including user study, demonstrate superior performance as compared to prior layout generation networks, support arbitrary city block and varying building shapes as demonstrated by generating layouts for 28 large cities. + + + + Unified Coarse-to-Fine Alignment for Video-Text Retrieval + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Unified_Coarse-to-Fine_Alignment_for_Video-Text_Retrieval_ICCV_2023_paper.pdf + The canonical approach to video-text retrieval leverages a coarse-grained or fine-grained alignment between visual and textual information. However, retrieving the correct video according to the text query is often challenging as it requires the ability to reason about both high-level (scene) and low-level (object) visual clues and how they relate to the text query. To this end, we propose a Unified Coarse-to-fine Alignment model, dubbed UCOFIA. Specifically, our model captures the cross-modal similarity information at different granularity levels. To alleviate the effect of irrelevant visual clues, we also apply an Interactive Similarity Aggregation module (ISA) to consider the importance of different visual features while aggregating the cross-modal similarity to obtain a similarity score for each granularity. Finally, we apply the Sinkhorn-Knopp algorithm to normalize the similarities of each level before summing them, alleviating over- and under-representation issues at different levels. By jointly considering the cross-modal similarity of different granularity, UCOFIA allows the effective unification of multi-grained alignments. Empirically, UCOFIA outperforms previous state-of-the-art CLIP-based methods on multiple video-text retrieval benchmarks, achieving 2.4%, 1.4% and 1.3% improvements in text-to-video retrieval R@1 on MSR-VTT, Activity-Net, and DiDeMo, respectively. Our code is publicly available at https://github.com/Ziyang412/UCoFiA. + + + + Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Gradient-Regulated_Meta-Prompt_Learning_for_Generalizable_Vision-Language_Models_ICCV_2023_paper.pdf + Prompt tuning, a recently emerging paradigm, enables the powerful vision-language pre-training models to adapt to downstream tasks in a parameter- and data- efficient way, by learning the "soft prompts" to condition frozen pre-training models. Though effective, it is particularly problematic in the few-shot scenario, where prompt tuning performance is sensitive to the initialization and requires a time-consuming process to find a good initialization, thus restricting the fast adaptation ability of the pre-training models. In addition, prompt tuning could undermine the generalizability of the pre-training models, because the learnable prompt tokens are easy to overfit to the limited training samples. To address these issues, we introduce a novel Gradient-RegulAted Meta-prompt learning (GRAM) framework that jointly meta-learns an efficient soft prompt initialization for better adaptation and a lightweight gradient regulating function for strong cross-domain generalizability in a meta-learning paradigm using only the unlabeled image-text pre-training data. Rather than designing a specific prompt tuning method, our GRAM can be easily incorporated into various prompt tuning methods in a model-agnostic way, and comprehensive experiments show that GRAM brings about consistent improvement for them in several settings (i.e., few-shot learning, cross-domain generalization, cross-dataset generalization, etc.) over 11 datasets. Further, experiments show that GRAM enables the orthogonal methods of textual and visual prompt tuning to work in a mutually-enhanced way, offering better generalizability beyond the uni-modal prompt tuning methods. + + + + MUter: Machine Unlearning on Adversarially Trained Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_MUter_Machine_Unlearning_on_Adversarially_Trained_Models_ICCV_2023_paper.pdf + Machine unlearning is an emerging task of removing the influence of selected training datapoints from a trained model upon data deletion requests, which echoes the widely enforced data regulations mandating the Right to be Forgotten. Many unlearning methods have been proposed recently, achieving significant efficiency gains over the naive baseline of retraining from scratch. However, existing methods focus exclusively on unlearning from standard training models and do not apply to adversarial training models (ATMs) despite their popularity as effective defenses against adversarial examples. During adversarial training, the training data are involved in not only an outer loop for minimizing the training loss, but also an inner loop for generating the adversarial perturbation. Such bi-level optimization greatly complicates the influence measure for the data to be deleted and renders the unlearning more challenging than standard model training with single-level optimization. This paper proposes a new approach called MUter for unlearning from ATMs. We derive a closed-form unlearning step underpinned by a total Hessian-related data influence measure, while existing methods can mis-capture the data influence associated with the indirect Hessian part. We further alleviate the computational cost by introducing a series of approximations and conversions to avoid the most computationally demanding parts of Hessian inversions. The efficiency and effectiveness of MUter have been validated through experiments on four datasets using both linear and neural network models. + + + + ParCNetV2: Oversized Kernel with Enhanced Attention + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_ParCNetV2_Oversized_Kernel_with_Enhanced_Attention_ICCV_2023_paper.pdf + Transformers have shown great potential in various computer vision tasks. By borrowing design concepts from transformers, many studies revolutionized CNNs and showed remarkable results. This paper falls in this line of studies. Specifically, we propose a new convolutional neural network, ParCNetV2, that extends the research line of ParCNetV1 by bridging the gap between CNN and ViT. It introduces two key designs: 1) Oversized Convolution (OC) with twice the size of the input, and 2) Bifurcate Gate Unit (BGU) to ensure that the model is input adaptive. Fusing OC and BGU in a unified CNN, ParCNetV2 is capable of flexibly extracting global features like ViT, while maintaining lower latency and better accuracy. Extensive experiments demonstrate the superiority of our method over other convolutional neural networks and hybrid models that combine CNNs and transformers. The code is publicly available at https://github.com/XuRuihan/ParCNetV2. + + + + RealGraph: A Multiview Dataset for 4D Real-world Context Graph Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Lin_RealGraph_A_Multiview_Dataset_for_4D_Real-world_Context_Graph_Generation_ICCV_2023_paper.pdf + In this paper, we propose a brand new scene understanding paradigm called "Context Graph Generation (CGG)", aiming at abstracting holistic semantic information in the complicated 4D world. The CGG task capitalizes on the calibrated multiview videos of a dynamic scene, and targets at recovering semantic information (coordination, trajectories and relationships) of the presented objects in the form of spatio-temporal context graph in 4D space. We also present a benchmark 4D video dataset "RealGraph", the first dataset tailored for the proposed CGG task. The raw data of RealGraph is composed of calibrated and synchronized multiview videos. We exclusively provide manual annotations including object 2D&3D bounding boxes, category labels and semantic relationships. We also make sure the annotated ID for every single object is temporally and spatially consistent. We propose the first CGG baseline algorithm, Multiview-based Context Graph Generation Network (MCGNet), to empirically investigate the legitimacy of CGG task on RealGraph dataset. We nevertheless reveal the great challenges behind this task and encourage the community to explore beyond our solution. + + + + PivotNet: Vectorized Pivot Learning for End-to-end HD Map Construction + http://openaccess.thecvf.com//content/ICCV2023/papers/Ding_PivotNet_Vectorized_Pivot_Learning_for_End-to-end_HD_Map_Construction_ICCV_2023_paper.pdf + Vectorized high-definition map online construction has garnered considerable attention in the field of autonomous driving research. Most existing approaches model changeable map elements using a fixed number of points, or predict local maps in a two-stage autoregressive manner, which may miss essential details and lead to error accumulation. Towards precise map element learning, we propose a simple yet effective architecture named PivotNet, which adopts unified pivot-based map representations and is formulated as a direct set prediction paradigm. Concretely, we first propose a novel Point-to-Line Mask module to encode both the subordinate and geometrical point-line priors in the network. Then, a well-designed Pivot Dynamic Matching module is proposed to model the topology in dynamic point sequences by introducing the concept of sequence matching. Furthermore, to supervise the position and topology of the vectorized point predictions, we propose a Dynamic Vectorized Sequence loss. Extensive experiments and ablations show that PivotNet is remarkably superior to other SOTAs by 5.9 mAP at least. The code will be available soon. + + + + Universal Domain Adaptation via Compressive Attention Matching + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhu_Universal_Domain_Adaptation_via_Compressive_Attention_Matching_ICCV_2023_paper.pdf + Universal domain adaptation (UniDA) aims to transfer knowledge from the source domain to the target domain without any prior knowledge about the label set. The challenge lies in how to determine whether the target samples belong to common categories. The mainstream methods make judgments based on the sample features, which overemphasizes global information while ignoring the most crucial local objects in the image, resulting in limited accuracy. To address this issue, we propose a Universal Attention Matching (UniAM) framework by exploiting the self-attention mechanism in vision transformer to capture the crucial object information. The proposed framework introduces a novel Compressive Attention Matching (CAM) approach to explore the core information by compressively representing attentions. Furthermore, CAM incorporates a residual-based measurement to determine the sample commonness. By utilizing the measurement, UniAM achieves domain-wise and category-wise Common Feature Alignment (CFA) and Target Class Separation (TCS). Notably, UniAM is the first method utilizing the attention in vision transformer directly to perform classification tasks. Extensive experiments show that UniAM outperforms the current state-of-the-art methods on various benchmark datasets. + + + + Point2Mask: Point-supervised Panoptic Segmentation via Optimal Transport + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Point2Mask_Point-supervised_Panoptic_Segmentation_via_Optimal_Transport_ICCV_2023_paper.pdf + Weakly-supervised image segmentation has recently attracted increasing research attentions, aiming to avoid the expensive pixel-wise labeling. In this paper, we present an effective method, namely Point2Mask, to achieve high-quality panoptic prediction using only a single random point annotation per target for training. Specifically, we formulate the panoptic pseudo-mask generation as an Optimal Transport (OT) problem, where each ground-truth (gt) point label and pixel sample are defined as the label supplier and consumer, respectively. The transportation cost is calculated by the introduced task-oriented maps, which focus on the category-wise and instance-wise differences among the various thing and stuff targets. Furthermore, a centroid-based scheme is proposed to set the accurate unit number for each gt point supplier. Hence, the pseudo-mask generation is converted into finding the optimal transport plan at a globally minimal transportation cost, which can be solved via the Sinkhorn-Knopp Iteration. Experimental results on Pascal VOC and COCO demonstrate the promising performance of our proposed Point2Mask approach to point-supervised panoptic segmentation. Source code is available at: https://github.com/LiWentomng/Point2Mask. + + + + RFLA: A Stealthy Reflected Light Adversarial Attack in the Physical World + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_RFLA_A_Stealthy_Reflected_Light_Adversarial_Attack_in_the_Physical_ICCV_2023_paper.pdf + Physical adversarial attacks against deep neural networks (DNNs) have recently gained increasing attention. The current mainstream physical attacks use printed adversarial patches or camouflage to alter the appearance of the target object. However, these approaches generate conspicuous adversarial patterns that show poor stealthiness. Another physical deployable attack is the optical attack, featuring stealthiness while exhibiting weakly in the daytime with sunlight. In this paper, we propose a novel Reflected Light Attack (RFLA), featuring effective and stealthy in both the digital and physical world, which is implemented by placing the color transparent plastic sheet and a paper cut of a specific shape in front of the mirror to create different colored geometries on the target object. To achieve these goals, we devise a general framework based on the circle to model the reflected light on the target object. Specifically, we optimize a circle (composed of a coordinate and radius) to carry various geometrical shapes determined by the optimized angle. The fill color of the geometry shape and its corresponding transparency are also optimized. We extensively evaluate the effectiveness of RFLA on different datasets and models. Experiment results suggest that the proposed method achieves over 99% success rate on different datasets and models in the digital world. Additionally, we verify the effectiveness of the proposed method in different physical environments by using sunlight or a flashlight. + + + + Nearest Neighbor Guidance for Out-of-Distribution Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Park_Nearest_Neighbor_Guidance_for_Out-of-Distribution_Detection_ICCV_2023_paper.pdf + Detecting out-of-distribution (OOD) samples are crucial for machine learning models deployed in open-world environments. Classifier-based scores are a standard approach for OOD detection due to their fine-grained detection capability. However, these scores often suffer from overconfidence issues, misclassifying OOD samples distant from the in-distribution region. To address this challenge, we propose a method called Nearest Neighbor Guidance (NNGuide) that guides the classifier-based score to respect the boundary geometry of the data manifold. NNGuide reduces the overconfidence of OOD samples while preserving the fine-grained capability of the classifier-based score. We conduct extensive experiments on ImageNet OOD detection benchmarks under diverse settings, including a scenario where the ID data undergoes natural distribution shift. Our results demonstrate that NNGuide provides a significant performance improvement on the base detection scores, achieving state-of-the-art results on both AUROC, FPR95, and AUPR metrics. + + + + Diffusion-SDF: Conditional Generative Modeling of Signed Distance Functions + http://openaccess.thecvf.com//content/ICCV2023/papers/Chou_Diffusion-SDF_Conditional_Generative_Modeling_of_Signed_Distance_Functions_ICCV_2023_paper.pdf + Probabilistic diffusion models have achieved state-of-the-art results for image synthesis, inpainting, and text-to-image tasks. However, they are still in the early stages of generating complex 3D shapes. This work proposes Diffusion-SDF, a generative model for shape completion, single-view reconstruction, and reconstruction of real-scanned point clouds. We use neural signed distance functions (SDFs) as our 3D representation to parameterize the geometry of various signals (e.g., point clouds, 2D images) through neural networks. Neural SDFs are implicit functions and diffusing them amounts to learning the reversal of their neural network weights, which we solve using a custom modulation module. Extensive experiments show that our method is capable of both realistic unconditional generation and conditional generation from partial inputs. This work expands the domain of diffusion models from learning 2D, explicit representations, to 3D, implicit representations. Code is released at https://github.com/princeton-computational-imaging/Diffusion-SDF. + + + + Open-Vocabulary Object Detection With an Open Corpus + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Open-Vocabulary_Object_Detection_With_an_Open_Corpus_ICCV_2023_paper.pdf + Existing open vocabulary object detection (OVD) works expand the object detector toward open categories by replacing the classifier with the category text embeddings and optimizing the region-text alignment on data of the base categories. However, both the class-agnostic proposal generator and the classifier are biased to the seen classes as demonstrated by the gaps of objectness and accuracy assessment between base and novel classes. In this paper, an open corpus, composed of a set of external object concepts and clustered to several centroids, is introduced to improve the generalization ability in the detector. We propose the generalized objectness assessment (GOAT) in the proposal generator based on the visual-text alignment, where the similarities of visual feature to the cluster centroids are summarized as the objectness. This simple heuristic evaluates objectness with concepts in open corpus and is thus generalized to open categories. We further propose category expanding (CE) with open corpus in two training tasks, which enables the detector to perceive more categories in the feature space and get more reasonable optimization direction. For the classification task, we introduce an open corpus classifier by reconstructing original classifier with similar words in text space. For the image-caption alignment task, the open corpus centroids are incorporated to enlarge the negative samples in the contrastive loss. Extensive experiments demonstrate the effectiveness of GOAT and CE, which greatly improve the performance on novel classes and get new state-of-the-art on the OVD benchmarks. + + + + Spectrum-guided Multi-granularity Referring Video Object Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Miao_Spectrum-guided_Multi-granularity_Referring_Video_Object_Segmentation_ICCV_2023_paper.pdf + Current referring video object segmentation (R-VOS) techniques extract conditional kernels from encoded (low-resolution) vision-language features to segment the decoded high-resolution features. We discovered that this causes significant feature drift, which the segmentation kernels struggle to perceive during the forward computation. This negatively affects the ability of segmentation kernels. To address the drift problem, we propose a Spectrum-guided Multi-granularity (SgMg) approach, which performs direct segmentation on the encoded features and employs visual details to further optimize the masks. In addition, we propose Spectrum-guided Cross-modal Fusion (SCF) to perform intra-frame global interactions in the spectral domain for effective multimodal representation. Finally, we extend SgMg to perform multi-object R-VOS, a new paradigm that enables simultaneous segmentation of multiple referred objects in a video. This not only makes R-VOS faster, but also more practical. Extensive experiments show that SgMg achieves state-of-the-art performance on four video benchmark datasets, outperforming the nearest competitor by 2.8% points on Ref-YouTube-VOS. Our extended SgMg enables multi-object R-VOS, runs about 3 times faster while maintaining satisfactory performance. Code is available at https://github.com/bo-miao/SgMg. + + + + Sound Source Localization is All about Cross-Modal Alignment + http://openaccess.thecvf.com//content/ICCV2023/papers/Senocak_Sound_Source_Localization_is_All_about_Cross-Modal_Alignment_ICCV_2023_paper.pdf + Humans can easily perceive the direction of sound sources in a visual scene, termed sound source localization. Recent studies on learning-based sound source localization have mainly explored the problem from a localization perspective. However, prior arts and existing benchmarks do not account for a more important aspect of the problem, cross-modal semantic understanding, which is essential for genuine sound source localization. Cross-modal semantic understanding is important in understanding semantically mismatched audio-visual events, e.g., silent objects, or off-screen sounds. To account for this, we propose a cross-modal alignment task as a joint task with sound source localization to better learn the interaction between audio and visual modalities. Thereby, we achieve high localization performance with strong cross-modal semantic understanding. Our method outperforms the state-of-the-art approaches in both sound source localization and cross-modal retrieval. Our work suggests that jointly tackling both tasks is necessary to conquer genuine sound source localization. + + + + BlendFace: Re-designing Identity Encoders for Face-Swapping + http://openaccess.thecvf.com//content/ICCV2023/papers/Shiohara_BlendFace_Re-designing_Identity_Encoders_for_Face-Swapping_ICCV_2023_paper.pdf + The great advancements of generative adversarial networks and face recognition models in computer vision have made it possible to swap identities on images from single sources. Although a lot of studies seems to have proposed almost satisfactory solutions, we notice previous methods still suffer from an identity-attribute entanglement that causes undesired attributes swapping because widely used identity encoders, e.g., ArcFace, have some crucial attribute biases owing to their pretraining on face recognition tasks. To address this issue, we design BlendFace, a novel identity encoder for face-swapping. The key idea behind BlendFace is training face recognition models on blended images whose attributes are replaced with those of another mitigates inter-personal biases such as hairsyles and head shapes. BlendFace feeds disentangled identity features into generators and guides generators properly as an identity loss function. Extensive experiments demonstrate that BlendFace improves the identity-attribute disentanglement in face-swapping models, maintaining a comparable quantitative performance to previous methods. + + + + Test-time Personalizable Forecasting of 3D Human Poses + http://openaccess.thecvf.com//content/ICCV2023/papers/Cui_Test-time_Personalizable_Forecasting_of_3D_Human_Poses_ICCV_2023_paper.pdf + Current motion forecasting approaches typically train a deep end-to-end model from the source domain data, and then apply it directly to target subjects. Despite promising results, they remain non-optimal, due to privacy considerations, the test person and his/her natural properties (e.g., stature, behavioral trait) are typically unseen/absent in training. In this case, the source pre-trained model has a low ability to adapt to these out-of-source characteristics, resulting in an unreliable prediction. To tackle this issue, we propose a novel helper-predictor test-time personalization approach (H/P-TTP), which allows for a generalizable representation of out-of-source subjects to gain more realistic predictions. Concretely, the helper is preceded by explicit and implicit augmenters, where the former yields noisy sequences to improve robustness, while the latter is to generate novel-domain data with an adversarial learning paradigm. Then, the domain-generalizable learning is achieved where the helper can extract cross-subject invariant-knowledge to update the predictor. At test time, given a new person, the predictor is able to be further optimized to empower personalized capabilities to the specific properties. Under several benchmarks, extensive experiments show that with H/P-TTP, the existing predictive models are significantly improved for various unseen subjects. + + + + DreamBooth3D: Subject-Driven Text-to-3D Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Raj_DreamBooth3D_Subject-Driven_Text-to-3D_Generation_ICCV_2023_paper.pdf + We present DreamBooth3D, an approach to personalize text-to-3D generative models from as few as 3-6 casually captured images of a subject. Our approach combines recent advances in personalizing text-to-image models (DreamBooth) with text-to-3D generation (DreamFusion). We find that naively combining these methods fails to yield satisfactory subject-specific 3D assets due to personalized text-to-image models overfitting to the input viewpoints of the subject. We overcome this through a 3-stage optimization strategy where we jointly leverage the 3D consistency of neural radiance fields together with the personalization capability of text-to-image models. Our method can produce high-quality, subject-specific 3D assets with text-driven modifications such as novel poses, colors and attributes that are not seen in any of the input images of the subject. + + + + Dynamic Snake Convolution Based on Topological Geometric Constraints for Tubular Structure Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Qi_Dynamic_Snake_Convolution_Based_on_Topological_Geometric_Constraints_for_Tubular_ICCV_2023_paper.pdf + Accurate segmentation of topological tubular structures, such as blood vessels and roads, is crucial in various fields, ensuring accuracy and efficiency in downstream tasks. However, many factors complicate the task, including thin local structures and variable global morphologies. In this work, we note the specificity of tubular structures and use this knowledge to guide our DSCNet to simultaneously enhance perception in three stages: feature extraction, feature fusion, and loss constraint. First, we propose a dynamic snake convolution to accurately capture the features of tubular structures by adaptively focusing on slender and tortuous local structures. Subsequently, we propose a multi-view feature fusion strategy to complement the attention to features from multiple perspectives during feature fusion, ensuring the retention of important information from different global morphologies. Finally, a continuity constraint loss function, based on persistent homology, is proposed to constrain the topological continuity of the segmentation better. Experiments on 2D and 3D datasets show that our DSCNet provides better accuracy and continuity on the tubular structure segmentation task compared with several methods. + + + + Learning to Upsample by Learning to Sample + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Learning_to_Upsample_by_Learning_to_Sample_ICCV_2023_paper.pdf + We present DySample, an ultra-lightweight and effective dynamic upsampler. While impressive performance gains have been witnessed from recent kernel-based dynamic upsamplers such as CARAFE, FADE, and SAPA, they introduce much workload, mostly due to the time-consuming dynamic convolution and the additional sub-network used to generate dynamic kernels. Further, the need for high-res feature guidance of FADE and SAPA somehow limits their application scenarios. To address these concerns, we bypass dynamic convolution and formulate upsampling from the perspective of point sampling, which is more resource-efficient and can be easily implemented with the standard built-in function in PyTorch. We first showcase a naive design, and then demonstrate how to strengthen its upsampling behavior step by step towards our new upsampler, DySample. Compared with former kernel-based dynamic upsamplers, DySample requires no customized CUDA package and has much fewer parameters, FLOPs, GPU memory, and latency. Besides the light-weight characteristics, DySample outperforms other upsamplers across five dense prediction tasks, including semantic segmentation, object detection, instance segmentation, panoptic segmentation, and monocular depth estimation. Code is available at https://github.com/tiny-smart/dysample. + + + + LayoutDiffusion: Improving Graphic Layout Generation by Discrete Diffusion Probabilistic Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_LayoutDiffusion_Improving_Graphic_Layout_Generation_by_Discrete_Diffusion_Probabilistic_Models_ICCV_2023_paper.pdf + Creating graphic layouts is a fundamental step in graphic designs. In this work, we present a novel generative model named LayoutDiffusion for automatic layout generation. As layout is typically represented as a sequence of discrete tokens, LayoutDiffusion models layout generation as a discrete denoising diffusion process. It learns to reverse a mild forward process, in which layouts become increasingly chaotic with the growth of forward steps and layouts in the neighboring steps do not differ too much. Designing such a mild forward process is however very challenging as layout has both categorical attributes and ordinal attributes. To tackle the challenge, we summarize three critical factors for achieving a mild forward process for the layout, i.e., legality, coordinate proximity and type disruption. Based on the factors, we propose a block-wise transition matrix coupled with a piece-wise linear noise schedule. Experiments on RICO and PubLayNet datasets show that LayoutDiffusion outperforms state-of-the-art approaches significantly. Moreover, it enables two conditional layout generation tasks in a plug-and-play manner without re-training and achieves better performance than existing methods. Project page: https://layoutdiffusion.github.io. + + + + Efficiently Robustify Pre-Trained Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Jain_Efficiently_Robustify_Pre-Trained_Models_ICCV_2023_paper.pdf + A recent trend in deep learning algorithms has been towards training large scale models, having high parameter count and trained on big dataset. However, robustness of such large scale models towards real-world settings is still a less-explored topic. In this work, we first benchmark the performance of these models under different perturbations and datasets thereby representing real-world shifts, and highlight their degrading performance under these shifts. We then discuss on how complete model fine-tuning based existing robustification schemes might not be a scalable option given very large scale networks and can also lead them to forget some of the desired characterstics. Finally, we propose a simple and cost-effective method to solve this problem, inspired by knowledge transfer literature. It involves robustifying smaller models, at a lower computation cost, and then use them as teachers to tune a fraction of these large scale networks, reducing the overall computational overhead. We evaluate our proposed method under various vision perturbations including ImageNet-C,R,S,A datasets and also for transfer learning, zero-shot evaluation setups on different datasets. Benchmark results show that our method is able to induce robustness to these large scale models efficiently, requiring significantly lower time and also preserves the transfer learning, zero-shot properties of the original model which none of the existing methods are able to achieve. + + + + XMem++: Production-level Video Segmentation From Few Annotated Frames + http://openaccess.thecvf.com//content/ICCV2023/papers/Bekuzarov_XMem_Production-level_Video_Segmentation_From_Few_Annotated_Frames_ICCV_2023_paper.pdf + Despite advancements in user-guided video segmentation, extracting complex objects consistently for highly complex scenes is still a labor-intensive task, especially for production. It is not uncommon that a majority of frames need to be annotated. We introduce a novel semi-supervised video object segmentation (SSVOS) model, XMem++, that improves existing memory-based models, with a permanent memory module. Most existing methods focus on single frame annotations, while our approach can effectively handle multiple user-selected frames with varying appearances of the same object or region. Our method can extract highly consistent results while keeping the required number of frame annotations low. We further introduce an iterative and attention-based frame suggestion mechanism, which computes the next best frame for annotation. Our method is real-time and does not require retraining after each user input. We also introduce a new dataset, PUMaVOS, which covers new challenging use cases not found in previous benchmarks. We demonstrate SOTA performance on challenging (partial and multi-class) segmentation scenarios as well as long videos, while ensuring significantly fewer frame annotations than any existing method. Project page: https://max810.github.io/xmem2-project-page/ + + + + End-to-End Diffusion Latent Optimization Improves Classifier Guidance + http://openaccess.thecvf.com//content/ICCV2023/papers/Wallace_End-to-End_Diffusion_Latent_Optimization_Improves_Classifier_Guidance_ICCV_2023_paper.pdf + Classifier guidance---using the gradients of an image classifier to steer the generations of a diffusion model---has the potential to dramatically expand the creative control over image generation and editing. However, currently classifier guidance requires either training new noise-aware models to obtain accurate gradients or using a one-step denoising approximation of the final generation, which leads to misaligned gradients and sub-optimal control. We highlight this approximation's shortcomings and propose a novel guidance method: Direct Optimization of Diffusion Latents (DOODL), which enables plug-and-play guidance by optimizing diffusion latents w.r.t. the gradients of a pre-trained classifier on the true generated pixels, using an invertible diffusion process to achieve memory-efficient backpropagation. Showcasing the potential of more precise guidance, DOODL outperforms one-step classifier guidance on computational and human evaluation metrics across different forms of guidance: using CLIP guidance to improve generations of complex prompts from DrawBench, using fine-grained visual classifiers to expand the vocabulary of Stable Diffusion, enabling image-conditioned generation with a CLIP visual encoder, and improving image aesthetics using an aesthetic scoring network. + + + + TRM-UAP: Enhancing the Transferability of Data-Free Universal Adversarial Perturbation via Truncated Ratio Maximization + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_TRM-UAP_Enhancing_the_Transferability_of_Data-Free_Universal_Adversarial_Perturbation_via_ICCV_2023_paper.pdf + Aiming at crafting a single universal adversarial perturbation (UAP) to fool CNN models for various data samples, universal attack enables a more efficient and accurate evaluation for the robustness of CNN models. Early universal attacks craft UAPs depending on data priors. For more practical applications, the data-free universal attacks that make UAPs from random noises have aroused much attention recently. However, existing data-free UAP methods perturb all the CNN feature layers equally via the maximization of the CNN activation, leading to poor transferability. In this paper, we propose a novel data-free universal attack without depending on any real data samples through truncated ratio maximization, which we term as TRM-UAP. Specifically, different from the maximization of the positive activation in convolution layers, we propose to optimize the UAP generation from the ratio of positive and negative activations. To further enhance the transferability of universal attack, TRM-UAP not only performs the ratio maximization merely on low-level generic features via the truncation strategy, but also incorporates a curriculum optimization algorithm that can effectively learn the diversity of artificial images. Extensive experiments on the ImageNet dataset verify that TRM-UAP achieves a state-of-the-art average fooling rate and excellent transferability on different CNN models as compared to other data-free UAP methods. Code is available at https://github.com/RandolphCarter0/TRMUAP. + + + + Scratching Visual Transformer's Back with Uniform Attention + http://openaccess.thecvf.com//content/ICCV2023/papers/Hyeon-Woo_Scratching_Visual_Transformers_Back_with_Uniform_Attention_ICCV_2023_paper.pdf + The favorable performance of Vision Transformers (ViTs) is often attributed to the multi-head self-attention (MSA), which enables global interactions at each layer of a ViT model. Previous works acknowledge the property of long-range dependency for the effectiveness in MSA. In this work, we study the role of MSA in terms of the different axis, density. Our preliminary analyses suggest that the spatial interactions of learned attention maps are close to dense interactions rather than sparse ones. This is a curious phenomenon because dense attention maps are harder for the model to learn due to softmax. We interpret this opposite behavior against softmax as a strong preference for the ViT models to include dense interaction. We thus manually insert the dense uniform attention to each layer of the ViT models to supply the much-needed dense interactions. We call this method Context Broadcasting, CB. Our study demonstrates the inclusion of CB takes the role of dense attention, and thereby reduces the degree of density in the original attention maps by complying softmax in MSA. We also show that, with negligible costs of CB (1 line in your model code and no additional parameters), both the capacity and generalizability of the ViT models are increased. + + + + Tune-A-Video: One-Shot Tuning of Image Diffusion Models for Text-to-Video Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_Tune-A-Video_One-Shot_Tuning_of_Image_Diffusion_Models_for_Text-to-Video_Generation_ICCV_2023_paper.pdf + To replicate the success of text-to-image (T2I) generation, recent works employ large-scale video datasets to train a text-to-video (T2V) generator. Despite their promising results, such paradigm is computationally expensive. In this work, we propose a new T2V generation setting--One-Shot Video Tuning, where only one text-video pair is presented. Our model is built on state-of-the-art T2I diffusion models pre-trained on massive image data. We make two key observations: 1) T2I models can generate still images that represent verb terms; 2) extending T2I models to generate multiple images concurrently exhibits surprisingly good content consistency. To further learn continuous motion, we introduce Tune-A-Video, which involves a tailored spatio-temporal attention mechanism and an efficient one-shot tuning strategy. At inference, we employ DDIM inversion to provide structure guidance for sampling. Extensive qualitative and numerical experiments demonstrate the remarkable ability of our method across various applications. + + + + Anchor-Intermediate Detector: Decoupling and Coupling Bounding Boxes for Accurate Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Lv_Anchor-Intermediate_Detector_Decoupling_and_Coupling_Bounding_Boxes_for_Accurate_Object_ICCV_2023_paper.pdf + Anchor-based detectors have been continuously developed for object detection. However, the individual anchor box makes it difficult to predict the boundary's offset accurately. Instead of taking each bounding box as a closed individual, we consider using multiple boxes together to get prediction boxes. To this end, this paper proposes the Box Decouple-Couple(BDC) strategy in the inference, which no longer discards the overlapping boxes, but decouples the corner points of these boxes. Then, according to each corner's score, we couple the corner points to select the most accurate corner pairs. To meet the BDC strategy, a simple but novel model is designed named the Anchor-Intermediate Detector(AID), which contains two head networks, i.e., an anchor-based head and an anchor-free Corner-aware head. The corner-aware head is able to score the corners of each bounding box to facilitate the coupling between corner points. Extensive experiments on MS COCO show that the proposed anchor-intermediate detector respectively outperforms their baseline RetinaNet and GFL method by 2.4 and 1.2 AP on the MS COCO test-dev dataset without any bells and whistles. + + + + Extensible and Efficient Proxy for Neural Architecture Search + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Extensible_and_Efficient_Proxy_for_Neural_Architecture_Search_ICCV_2023_paper.pdf + Efficient or near-zero-cost proxies were proposed recently to address the demanding computational issues of Neural Architecture Search (NAS) in designing deep neural networks (DNNs), where each candidate architecture network only requires one iteration of backpropagation. The values obtained from proxies are used as predictions of architecture performance for downstream tasks. However, two significant drawbacks hinder the wide adoption of these efficient proxies: 1. they are not adaptive to various NAS search spaces and 2. they are not extensible to multi-modality downstream tasks. To address these two issues, we first propose an Extensible proxy (Eproxy) that utilizes self-supervised, few-shot training to achieve near-zero costs. A key component to our Eproxy's efficiency is the introduction of a barrier layer with randomly initialized frozen convolution parameters, which adds non-linearities to the optimization spaces so that Eproxy can discriminate the performance of architectures at an early stage. We further propose a Discrete Proxy Search (DPS) method to find the optimized training settings for Eproxy with only a handful of benchmarked architectures on the target tasks. Our extensive experiments confirm the effectiveness of both Eproxy and DPS. On the NDS-ImageNet search spaces, Eproxy+DPS achieves a higher average ranking correlation (Spearman r = 0.73) than the previous efficient proxy (Spearman r = 0.56). On the NAS-Bench-Trans-Micro search spaces with seven tasks, Eproxy+DPS delivers comparable performance with the early stopping method (146x faster). For the end-to-end task such as DARTS-ImageNet-1k, our method delivers better results than NAS performed on CIFAR-10 while only requiring one GPU hour with a single batch of CIFAR-10 images. + + + + MAAL: Multimodality-Aware Autoencoder-Based Affordance Learning for 3D Articulated Objects + http://openaccess.thecvf.com//content/ICCV2023/papers/Liang_MAAL_Multimodality-Aware_Autoencoder-Based_Affordance_Learning_for_3D_Articulated_Objects_ICCV_2023_paper.pdf + Inferring affordance for 3D articulated objects is a challenging and practical problem. It is a primary problem for applying robots to real-world scenarios. The exploration can be summarized as figuring out where to act and how to act. Correspondingly, the task mainly requires producing actionability scores, action proposals, and success likelihood scores according to the given 3D object information and robotic information. Current works usually directly process multi-modal inputs with early fusion and apply critic networks to produce scores, which leads to insufficient multi-modal learning ability and inefficiently iterative training in multiple stages. This paper proposes a novel Multimodality-Aware Autoencoder-based affordance Learning (MAAL) for the 3D object affordance problem. It is an efficient pipeline, trained in one go, and only requires a few positive samples in training data. More importantly, MAAL contains a MultiModal Energized Encoder (MME) for better multi-modal learning. It comprehensively models all multi-modal inputs from 3D objects and robotic actions. Jointly considering information from multiple modalities, the encoder further learns interactions between robots and objects. MME empowers the better multi-modal learning ability for understanding object affordance. Experimental results and visualizations, based on a large-scale dataset PartNet-Mobility, show the effectiveness of MAAL in learning multi-modal data and solving the 3D articulated object affordance problem. + + + + Benchmarking and Analyzing Robust Point Cloud Recognition: Bag of Tricks for Defending Adversarial Examples + http://openaccess.thecvf.com//content/ICCV2023/papers/Ji_Benchmarking_and_Analyzing_Robust_Point_Cloud_Recognition_Bag_of_Tricks_ICCV_2023_paper.pdf + Deep Neural Networks (DNNs) for 3D point cloud recognition are vulnerable to adversarial examples, threatening their practical deployment. Despite the many research endeavors have been made to tackle this issue in recent years, the diversity of adversarial examples on 3D point clouds makes them more challenging to defend against than those on 2D images. For examples, attackers can generate adversarial examples by adding, shifting, or removing points. Consequently, existing defense strategies are hard to counter unseen point cloud adversarial examples. In this paper, we first establish a comprehensive, and rigorous point cloud adversarial robustness benchmark to evaluate adversarial robustness, which can provide a detailed understanding of the effects of the defense and attack methods. We then collect existing defense tricks in point cloud adversarial defenses and then perform extensive and systematic experiments to identify an effective combination of these tricks. Furthermore, we propose a hybrid training augmentation methods that consider various types of point cloud adversarial examples to adversarial training, significantly improving the adversarial robustness. By combining these tricks, we construct a more robust defense framework achieving an average accuracy of 83.45% against various attacks, demonstrating its capability to enabling robust learners. Our codebase are open-sourced on https://github.com/qiufan319/benchmark_pc_attack.git + + + + Poincare ResNet + http://openaccess.thecvf.com//content/ICCV2023/papers/van_Spengler_Poincare_ResNet_ICCV_2023_paper.pdf + This paper introduces an end-to-end residual network that operates entirely on the Poincare ball model of hyperbolic space. Hyperbolic learning has recently shown great potential for visual understanding, but is currently only performed in the penultimate layer(s) of deep networks. All visual representations are still learned through standard Euclidean networks. In this paper we investigate how to learn hyperbolic representations of visual data directly from the pixel-level. We propose Poincare ResNet, a hyperbolic counterpart of the celebrated residual network, starting from Poincare 2D convolutions up to Poincare residual connections. We identify three roadblocks for training convolutional networks entirely in hyperbolic space and propose a solution for each: (i) Current hyperbolic network initializations collapse to the origin, limiting their applicability in deeper networks. We provide an identity-based initialization that preserves norms over many layers. (ii) Residual networks rely heavily on batch normalization, which comes with expensive Frechet mean calculations in hyperbolic space. We introduce Poincare midpoint batch normalization as a faster and equally effective alternative. (iii) Due to the many intermediate operations in Poincare layers, the computation graphs of deep learning libraries blow up, limiting our ability to train on deep hyperbolic networks. We provide manual backward derivations of core hyperbolic operations to maintain manageable computation graphs. + + + + Subclass-balancing Contrastive Learning for Long-tailed Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Hou_Subclass-balancing_Contrastive_Learning_for_Long-tailed_Recognition_ICCV_2023_paper.pdf + Long-tailed recognition with imbalanced class distribution naturally emerges in practical machine learning applications. Existing methods such as data reweighing, resampling, and supervised contrastive learning enforce the class balance with a price of introducing imbalance between instances of head class and tail class, which may ignore the underlying rich semantic substructures of the former and exaggerate the biases in the latter. We overcome these drawbacks by a novel "subclass-balancing contrastive learning (SBCL)" approach that clusters each head class into multiple subclasses of similar sizes as the tail classes and enforce representations to capture the two-layer class hierarchy between the original classes and their subclasses. Since the clustering is conducted in the representation space and updated during the course of training, the subclass labels preserve the semantic substructures of head classes. Meanwhile, it does not overemphasize tail class samples, so each individual instance contribute to the representation learning equally. Hence, our method achieves both the instance- and subclass-balance, while the original class labels are also learned through contrastive learning among subclasses from different classes. We evaluate SBCL over a list of long-tailed benchmark datasets and it achieves the state-of-the-art performance. In addition, we present extensive analyses and ablation studies of SBCL to verify its advantages. + + + + Dynamic Mesh-Aware Radiance Fields + http://openaccess.thecvf.com//content/ICCV2023/papers/Qiao_Dynamic_Mesh-Aware_Radiance_Fields_ICCV_2023_paper.pdf + Embedding polygonal mesh assets within photorealistic Neural Radience Fields (NeRF) volumes, such that they can be rendered and their dynamics simulated in a physically consistent manner with the NeRF, is under-explored from the system perspective of integrating NeRF into the traditional graphics pipeline. This paper designs a two-way coupling between mesh and NeRF during rendering and simulation. We first review the light transport equations for both mesh and NeRF, then distill them into an efficient algorithm for updating radiance and throughput along a cast ray with an arbitrary number of bounces. To resolve the discrepancy between the linear color space that the path tracer assumes and the sRGB color space that standard NeRF uses, we train NeRF with High Dynamic Range (HDR) images. We also present a strategy to estimate light sources and cast shadows on the NeRF. Finally, we consider how the hybrid surface-volumetric formulation can be efficiently integrated with a high-performance physics simulator that supports cloth, rigid and soft bodies. The full rendering and simulation system can be run on a GPU at interactive rates. We show that a hybrid system approach outperforms alternatives in visual realism for mesh insertion, because it allows realistic light transport from volumetric NeRF media onto surfaces, which affects the appearance of reflective/refractive surfaces and illumination of diffuse surfaces informed by the dynamic scene. + + + + Learning Support and Trivial Prototypes for Interpretable Image Classification + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Learning_Support_and_Trivial_Prototypes_for_Interpretable_Image_Classification_ICCV_2023_paper.pdf + Prototypical part network (ProtoPNet) methods have been designed to achieve interpretable classification by associating predictions with a set of training prototypes, which we refer to as trivial prototypes because they are trained to lie far from the classification boundary in the feature space. Note that it is possible to make an analogy between ProtoPNet and support vector machine (SVM) given that the classification from both methods relies on computing similarity with a set of training points (i.e., trivial prototypes in ProtoPNet, and support vectors in SVM). However, while trivial prototypes are located far from the classification boundary, support vectors are located close to this boundary, and we argue that this discrepancy with the well-established SVM theory can result in ProtoPNet models with inferior classification accuracy. In this paper, we aim to improve the classification of ProtoPNet with a new method to learn support prototypes that lie near the classification boundary in the feature space, as suggested by the SVM theory. In addition, we target the improvement of classification results with a new model, named ST-ProtoPNet, which exploits our support prototypes and the trivial prototypes to provide more effective classification. Experimental results on CUB-200-2011, Stanford Cars, and Stanford Dogs datasets demonstrate that ST-ProtoPNet achieves state-of-the-art classification accuracy and interpretability results. We also show that the proposed support prototypes tend to be better localised in the object of interest rather than in the background region. Code is available at https://github.com/cwangrun/ST-ProtoPNet. + + + + Decoupled DETR: Spatially Disentangling Localization and Classification for Improved End-to-End Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Decoupled_DETR_Spatially_Disentangling_Localization_and_Classification_for_Improved_End-to-End_ICCV_2023_paper.pdf + The introduction of DETR represents a new paradigm for object detection. However, its decoder conducts classification and box localization using shared queries and cross-attention layers, leading to suboptimal results. We observe that different regions of interest in the visual feature map are suitable for performing query classification and box localization tasks, even for the same object. Salient regions provide vital information for classification, while the boundaries around them are more favorable for box regression. Unfortunately, such spatial misalignment between these two tasks greatly hinders DETR's training. Therefore, in this work, we focus on decoupling localization and classification tasks in DETR. To achieve this, we introduce a new design scheme called spatially decoupled DETR (SD-DETR), which includes a task-aware query generation module and a disentangled feature learning process. We elaborately design the task-aware query initialization process and divide the cross-attention block in the decoder to allow the task-aware queries to match different visual regions. Meanwhile, we also observe that the prediction misalignment problem for high classification confidence and precise localization exists, so we propose an alignment loss to further guide the spatially decoupled DETR training. Through extensive experiments, we demonstrate that our approach achieves a significant improvement in MSCOCO datasets compared to previous work. For instance, we improve the performance of Conditional DETR by 4.5%. By spatially disentangling the two tasks, our method overcomes the misalignment problem and greatly improves the performance of DETR for object detection. + + + + GIFD: A Generative Gradient Inversion Method with Feature Domain Optimization + http://openaccess.thecvf.com//content/ICCV2023/papers/Fang_GIFD_A_Generative_Gradient_Inversion_Method_with_Feature_Domain_Optimization_ICCV_2023_paper.pdf + Federated Learning (FL) has recently emerged as a promising distributed machine learning framework to preserve clients' privacy, by allowing multiple clients to upload the gradients calculated from their local data to a central server. Recent studies find that the exchanged gradients also take the risk of privacy leakage, e.g., an attacker can invert the shared gradients and recover sensitive data against an FL system by leveraging pre-trained generative adversarial networks (GAN) as prior knowledge. However, performing gradient inversion attacks in the latent space of the GAN model limits their expression ability and generalizability. To tackle these challenges, we propose Gradient Inversion over Feature Domains (GIFD), which disassembles the GAN model and searches the feature domains of the intermediate layers. Instead of optimizing only over the initial latent code, we progressively change the optimized layer, from the initial latent space to intermediate layers closer to the output images. In addition, we design a regularizer to avoid unreal image generation by adding a small l1 ball constraint to the searching range. We also extend GIFD to the out-of-distribution (OOD) setting, which weakens the assumption that the training sets of GANs and FL tasks obey the same data distribution. Extensive experiments demonstrate that our method can achieve pixel-level reconstruction and is superior to the existing methods. Notably, GIFD also shows great generalizability under different defense strategy settings and batch sizes. + + + + Generalized Sum Pooling for Metric Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Gurbuz_Generalized_Sum_Pooling_for_Metric_Learning_ICCV_2023_paper.pdf + A common architectural choice for deep metric learning is a convolutional neural network followed by global average pooling (GAP). Albeit simple, GAP is a highly effective way to aggregate information. One possible explanation for the effectiveness of GAP is considering each feature vector as representing a different semantic entity and GAP as a convex combination of them. Following this perspective, we generalize GAP and propose a learnable generalized sum pooling method (GSP). GSP improves GAP with two distinct abilities: i) the ability to choose a subset of semantic entities, effectively learning to ignore nuisance information, and ii) learning the weights corresponding to the importance of each entity. Formally, we propose an entropy-smoothed optimal transport problem and show that it is a strict generalization of GAP, i.e., a specific realization of the problem gives back GAP. We show that this optimization problem enjoys analytical gradients enabling us to use it as a direct learnable replacement for GAP. We further propose a zero-shot loss to ease the learning of GSP. We show the effectiveness of our method with extensive evaluations on 4 popular metric learning benchmarks. Code is available at: GSP-DML Framework + + + + AlignDet: Aligning Pre-training and Fine-tuning in Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_AlignDet_Aligning_Pre-training_and_Fine-tuning_in_Object_Detection_ICCV_2023_paper.pdf + The paradigm of large-scale pre-training followed by downstream fine-tuning has been widely employed in various object detection algorithms. In this paper, we reveal discrepancies in data, model, and task between the pre-training and fine-tuning procedure in existing practices, which implicitly limit the detector's performance, generalization ability, and convergence speed. To this end, we propose AlignDet, a unified pre-training framework that can be adapted to various existing detectors to alleviate the discrepancies. AlignDet decouples the pre-training process into two stages, i.e., image-domain and box-domain pre-training. The image-domain pre-training optimizes the detection backbone to capture holistic visual abstraction, and box-domain pre-training learns instance-level semantics and task-aware concepts to initialize the parts out of the backbone. By incorporating the self-supervised pre-trained backbones, we can pre-train all modules for various detectors in an unsupervised paradigm. As depicted in Figure 1, extensive experiments demonstrate that AlignDet can achieve significant improvements across diverse protocols, such as detection algorithm, model backbone, data setting, and training schedule. For example, AlignDet improves FCOS by 5.3 mAP, RetinaNet by 2.1 mAP, Faster R-CNN by 3.3 mAP, and DETR by 2.3 mAP under fewer epochs. + + + + Dense Text-to-Image Generation with Attention Modulation + http://openaccess.thecvf.com//content/ICCV2023/papers/Kim_Dense_Text-to-Image_Generation_with_Attention_Modulation_ICCV_2023_paper.pdf + Existing text-to-image diffusion models struggle to synthesize realistic images given dense captions, where each text prompt provides a detailed description for a specific image region. To address this, we propose DenseDiffusion, a training-free method that adapts a pre-trained text-to-image model to handle such dense captions while offering control over the scene layout. We first analyze the relationship between generated images' layouts and the pre-trained model's intermediate attention maps. Next, we develop an attention modulation method that guides objects to appear in specific regions according to layout guidance. Without requiring additional fine-tuning or datasets, we improve image generation performance given dense captions regarding both automatic and human evaluation scores. In addition, we achieve similar-quality visual results with models specifically trained with layout conditions. + + + + Sentence Attention Blocks for Answer Grounding + http://openaccess.thecvf.com//content/ICCV2023/papers/Khoshsirat_Sentence_Attention_Blocks_for_Answer_Grounding_ICCV_2023_paper.pdf + Answer grounding is the task of locating relevant visual evidence for the Visual Question Answering task. While a wide variety of attention methods have been introduced for this task, they suffer from the following three problems: designs that do not allow the usage of pre-trained networks and do not benefit from large data pre-training, custom designs that are not based on well-grounded previous designs, therefore limiting the learning power of the network, or complicated designs that make it challenging to re-implement or improve them. In this paper, we propose a novel architectural block, which we term Sentence Attention Block, to solve these problems. The proposed block re-calibrates channel-wise image feature-maps by explicitly modeling inter-dependencies between the image feature-maps and sentence embedding. We visually demonstrate how this block filters out irrelevant feature-maps channels based on sentence embedding. We start our design with a well-known attention method, and by making minor modifications, we improve the results to achieve state-of-the-art accuracy. The flexibility of our method makes it easy to use different pre-trained backbone networks, and its simplicity makes it easy to understand and be re-implemented. We demonstrate the effectiveness of our method on the TextVQA-X, VQS, VQA-X, and VizWiz-VQA-Grounding datasets. We perform multiple ablation studies to show the effectiveness of our design choices. + + + + Towards Fairness-aware Adversarial Network Pruning + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Towards_Fairness-aware_Adversarial_Network_Pruning_ICCV_2023_paper.pdf + Network pruning aims to compress models while minimizing loss in accuracy. With the increasing focus on bias in AI systems, the bias inheriting or even magnification nature of traditional network pruning methods has raised a new perspective towards fairness-aware network pruning. Straightforward pruning plus debias methods and recent designs for monitoring disparities of demographic attributes during pruning have endeavored to enhance fairness in pruning. However, neither simple assembling of two tasks nor specifically designed pruning strategies could achieve the optimal trade-off among pruning ratio, accuracy, and fairness. This paper proposes an end-to-end learnable framework for fairness-aware network pruning, which optimizes both pruning and debias tasks jointly by adversarial training against those final evaluation metrics like accuracy for pruning, and disparate impact (DI) and equalized odds (DEO) for fairness. In other words, our fairness-aware adversarial pruning method would learn to prune without any handcraft rules. Therefore, our approach could flexibly adapt to variate network structures. Exhaustive experimentation demonstrates the generalization capacity of our approach, as well as superior performance on pruning and debias simultaneously. To highlight, the proposed method could preserve the SOTA pruning performance while significantly improving fairness by around 50% as compared to traditional pruning methods. + + + + Breaking Temporal Consistency: Generating Video Universal Adversarial Perturbations Using Image Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Kim_Breaking_Temporal_Consistency_Generating_Video_Universal_Adversarial_Perturbations_Using_Image_ICCV_2023_paper.pdf + As video analysis using deep learning models becomes more widespread, the vulnerability of such models to adversarial attacks is becoming a pressing concern. In particular, Universal Adversarial Perturbation (UAP) poses a significant threat, as a single perturbation can mislead deep learning models on entire datasets. We propose a novel video UAP using image data and image model. This enables us to take advantage of the rich image data and image model-based studies available for video applications. However, there is a challenge that image models are limited in their ability to analyze the temporal aspects of videos, which is crucial for a successful video attack. To address this challenge, we introduce the Breaking Temporal Consistancy (BTC) method, which is the first attempt to incorporate temporal information into video attacks using image models. We aim to generate adversarial videos that have opposite patterns to the original. Specifically, BTC-UAP minimizes the feature similarity between neighboring frames in videos. Our approach is simple but effective at attacking unseen video models. Additionally, it is applicable to videos of varying lengths and invariant to temporal shifts. Our approach surpasses existing methods in terms of effectiveness on various datasets, including ImageNet, UCF-101, and Kinetics-400. + + + + Smoothness Similarity Regularization for Few-Shot GAN Adaptation + http://openaccess.thecvf.com//content/ICCV2023/papers/Sushko_Smoothness_Similarity_Regularization_for_Few-Shot_GAN_Adaptation_ICCV_2023_paper.pdf + The task of few-shot GAN adaptation aims to adapt a pre-trained GAN model to a small dataset with very few training images. While existing methods perform well when the dataset for pre-training is structurally similar to the target dataset, the approaches suffer from training instabilities or memorization issues when the objects in the two domains have a very different structure. To mitigate this limitation, we propose a new smoothness similarity regularization that transfers the inherently learned smoothness of the pre-trained GAN to the few-shot target domain even if the two domains are very different. We evaluate our approach by adapting an unconditional and a class-conditional GAN to diverse few-shot target domains. Our proposed method significantly outperforms prior few-shot GAN adaptation methods in the challenging case of structurally dissimilar source-target domains, while performing on par with the state of the art for similar source-target domains. + + + + Distilling Coarse-to-Fine Semantic Matching Knowledge for Weakly Supervised 3D Visual Grounding + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Distilling_Coarse-to-Fine_Semantic_Matching_Knowledge_for_Weakly_Supervised_3D_Visual_ICCV_2023_paper.pdf + 3D visual grounding involves finding a target object in a 3D scene that corresponds to a given sentence query. Although many approaches have been proposed and achieved impressive performance, they all require dense object-sentence pair annotations in 3D point clouds, which are both time-consuming and expensive. To address the problem that fine-grained annotated data is difficult to obtain, we propose to leverage weakly supervised annotations to learn the 3D visual grounding model, i.e., only coarse scene-sentence correspondences are used to learn object-sentence links. To accomplish this, we design a novel semantic matching model that analyzes the semantic similarity between object proposals and sentences in a coarse-to-fine manner. Specifically, we first extract object proposals and coarsely select the top-K candidates based on feature and class similarity matrices. Next, we reconstruct the masked keywords of the sentence using each candidate one by one, and the reconstructed accuracy finely reflects the semantic similarity of each candidate to the query. Additionally, we distill the coarse-to-fine semantic matching knowledge into a typical two-stage 3D visual grounding model, which reduces inference costs and improves performance by taking full advantage of the well-studied structure of the existing architectures. We conduct extensive experiments on ScanRefer, Nr3D, and Sr3D, which demonstrate the effectiveness of our proposed method. + + + + zPROBE: Zero Peek Robustness Checks for Federated Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Ghodsi_zPROBE_Zero_Peek_Robustness_Checks_for_Federated_Learning_ICCV_2023_paper.pdf + Privacy-preserving federated learning allows multiple users to jointly train a model with coordination of a central server. The server only learns the final aggregation result, thereby preventing leakage of the users' (private) training data from the individual model updates. However, keeping the individual updates private allows malicious users to degrade the model accuracy without being detected, also known as Byzantine attacks. Best existing defenses against Byzantine workers rely on robust rank-based statistics, e.g., setting robust bounds via the median of updates, to find malicious updates. However, implementing privacy-preserving rank-based statistics, especially median-based, is nontrivial and unscalable in the secure domain, as it requires sorting of all individual updates. We establish the first private robustness check that uses high break point rank-based statistics on aggregated model updates. By exploiting randomized clustering, we significantly improve the scalability of our defense without compromising privacy. We leverage the derived statistical bounds in zero-knowledge proofs to detect and remove malicious updates without revealing the private user updates. Our novel framework, zPROBE, enables Byzantine resilient and secure federated learning. We show the effectiveness of zPROBE on several computer vision benchmarks. Empirical evaluations demonstrate that zPROBE provides a low overhead solution to defend against state-of-the-art Byzantine attacks while preserving privacy. + + + + Generative Prompt Model for Weakly Supervised Object Localization + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhao_Generative_Prompt_Model_for_Weakly_Supervised_Object_Localization_ICCV_2023_paper.pdf + Weakly supervised object localization (WSOL) remains challenging when learning object localization models from image category labels. Conventional methods that discriminatively train activation models ignore representative yet less discriminative object parts. In this study, we propose a generative prompt model (GenPromp), defining the first generative pipeline to localize less discriminative object parts by formulating WSOL as a conditional image denoising procedure. During training, GenPromp converts image category labels to learnable prompt embeddings which are fed to a generative model to conditionally recover the input image with noise and learn representative embeddings. During inference, GenPromp combines the representative embeddings with discriminative embeddings (queried from an off-the-shelf vision-language model) for both representative and discriminative capacity. The combined embeddings are finally used to generate multi-scale high-quality attention maps, which facilitate localizing full object extent. Experiments on CUB-200-2011 and ILSVRC show that GenPromp respectively outperforms the best discriminative models by 5.2% and 5.6% (Top-1 Loc), setting a solid baseline for WSOL with the generative model. Code is available at https://github.com/callsys/GenPromp. + + + + ActFormer: A GAN-based Transformer towards General Action-Conditioned 3D Human Motion Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_ActFormer_A_GAN-based_Transformer_towards_General_Action-Conditioned_3D_Human_Motion_ICCV_2023_paper.pdf + We present a GAN-based Transformer for general action-conditioned 3D human motion generation, including not only single-person actions but also multi-person interactive actions. Our approach consists of a powerful Action-conditioned motion TransFormer (ActFormer) under a GAN training scheme, equipped with a Gaussian Process latent prior. Such a design combines the strong spatio-temporal representation capacity of Transformer, superiority in generative modeling of GAN, and inherent temporal correlations from the latent prior. Furthermore, ActFormer can be naturally extended to multi-person motions by alternately modeling temporal correlations and human interactions with Transformer encoders. To further facilitate research on multi-person motion generation, we introduce a new synthetic dataset of complex multi-person combat behaviors. Extensive experiments on NTU-13, NTU RGB+D 120, BABEL and the proposed combat dataset show that our method can adapt to various human motion representations and achieve superior performance over the state-of-the-art methods on both single-person and multi-person motion generation tasks, demonstrating a promising step towards a general human motion generator. + + + + Hiding Visual Information via Obfuscating Adversarial Perturbations + http://openaccess.thecvf.com//content/ICCV2023/papers/Su_Hiding_Visual_Information_via_Obfuscating_Adversarial_Perturbations_ICCV_2023_paper.pdf + Growing leakage and misuse of visual information raise security and privacy concerns, which promotes the development of information protection. Existing adversarial perturbations-based methods mainly focus on the de-identification against deep learning models. However, the inherent visual information of the data has not been well protected. In this work, inspired by the Type-I adversarial attack, we propose an Adversarial Visual Information Hiding (AVIH) method to protect the visual privacy of data. Specifically, the method generates obfuscating adversarial perturbations to obscure the visual information of the data. Meanwhile, it maintains the hidden objectives to be correctly predicted by models. In addition, our method does not modify the parameters of the applied model, which makes it flexible for different scenarios. Experimental results on the recognition and classification tasks demonstrate that the proposed method can effectively hide visual information and hardly affect the performances of models. The code is available at https://github.com/suzhigangssz/AVIH. + + + + Category-aware Allocation Transformer for Weakly Supervised Object Localization + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Category-aware_Allocation_Transformer_for_Weakly_Supervised_Object_Localization_ICCV_2023_paper.pdf + Weakly supervised object localization (WSOL) aims to localize objects based on only image-level labels as supervision. Recently, transformers have been introduced into WSOL, yielding impressive results. The self-attention mechanism and multilayer perceptron structure in transformers preserve long-range feature dependency, facilitating complete localization of the full object extent. However, current transformer-based methods predict bounding boxes using category-agnostic attention maps, which may lead to confused and noisy object localization. To address this issue, we propose a novel Category-aware Allocation TRansformer (CATR) that learns category-aware representations for specific objects and produces corresponding category-aware attention maps for object localization. First, we introduce a Category-aware Stimulation Module (CSM) to induce learnable category biases for self-attention maps, providing auxiliary supervision to guide the learning of more effective transformer representations. Second, we design an Object Constraint Module (OCM) to refine the object regions for the category-aware attention maps in a self-supervised manner. Extensive experiments on the CUB-200-2011 and ILSVRC datasets demonstrate that the proposed CATR achieves significant and consistent performance improvements over competing approaches. + + + + Domain Specified Optimization for Deployment Authorization + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Domain_Specified_Optimization_for_Deployment_Authorization_ICCV_2023_paper.pdf + This paper explores Deployment Authorization (DPA) as a means of restricting the generalization capabilities of vision models on certain domains to protect intellectual property. Nevertheless, the current advancements in DPA are predominantly confined to fully supervised settings. Such settings require the accessibility of annotated images from any unauthorized domain, rendering the DPA approach impractical for real-world applications due to its exorbitant costs. To address this issue, we propose Source-Only Deployment Authorization (SDPA), which assumes that only authorized domains are accessible during training phases, and the model's performance on unauthorized domains must be suppressed in inference stages. Drawing inspiration from distributional robust statistics, we present a lightweight method called Domain-Specified Optimization (DSO) for SDPA that degrades the model's generalization over a divergence ball. DSO comes with theoretical guarantees on the convergence property and its authorization performance. As a complementary of SDPA, we also propose Target-Combined Deployment Authorization (TPDA), where unauthorized domains are partially accessible, and simplify the DSO method to a perturbation operation on the pseudo predictions, referred to as Target-Dependent Domain-Specified Optimization (TDSO). We demonstrate the effectiveness of our proposed DSO and TDSO methods through extensive experiments on six image benchmarks, achieving dominant performance on both SDPA and TDPA settings. + + + + Locally Stylized Neural Radiance Fields + http://openaccess.thecvf.com//content/ICCV2023/papers/Pang_Locally_Stylized_Neural_Radiance_Fields_ICCV_2023_paper.pdf + In recent years, there has been increasing interest in applying stylization on 3D scenes from a reference style image, in particular onto neural radiance fields (NeRF). While performing stylization directly on NeRF guarantees appearance consistency over arbitrary novel views, it is a challenging problem to guide the transfer of patterns from the style image onto different parts of the NeRF scene. In this work, we propose a stylization framework for NeRF based on local style transfer. In particular, we use a hash-grid encoding to learn the embedding of the appearance and geometry components, and show that the mapping defined by the hash table allows us to control the stylization to a certain extent. Stylization is then achieved by optimizing the appearance branch while keeping the geometry branch fixed. To support local style transfer, we propose a new loss function that utilizes a segmentation network and bipartite matching to establish region correspondences between the style image and the content images obtained from volume rendering. Our experiments show that our method yields plausible stylization results with novel view synthesis while having flexible controllability via manipulating and customizing the region correspondences. + + + + Confidence-aware Pseudo-label Learning for Weakly Supervised Visual Grounding + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Confidence-aware_Pseudo-label_Learning_for_Weakly_Supervised_Visual_Grounding_ICCV_2023_paper.pdf + Visual grounding aims at localizing the target object in image which is most related to the given free-form natural language query. As labeling the position of target object is labor-intensive, the weakly supervised methods, where only image-sentence annotations are required during model training have recently received increasing attention. Most of the existing weakly-supervised methods first generate region proposals via pre-trained object detectors and then employ either cross-modal similarity score or reconstruction loss as the criteria to select proposal from them. However, due to the cross-modal heterogeneous gap, these method often suffer from high confidence spurious association and model prone to error propagation. In this paper, we propose Confidence-aware Pseudo-label Learning (CPL) to overcome the above limitations. Specifically, we first adopt both the uni-modal and cross-modal pre-trained models and propose conditional prompt engineering to automatically generate multiple `descriptive, realistic and diverse' pseudo language queries for each region proposal, and then establish reliable cross-modal association for model training based on the uni-modal similarity score (between pseudo and real text queries). Secondly, we propose a confidence-aware pseudo label verification module which reduces the amount of noise encountered in the training process and the risk of error propagation. Experiments on five widely used datasets validate the efficacy of our proposed components and demonstrate state-of-the-art performance. + + + + Luminance-aware Color Transform for Multiple Exposure Correction + http://openaccess.thecvf.com//content/ICCV2023/papers/Baek_Luminance-aware_Color_Transform_for_Multiple_Exposure_Correction_ICCV_2023_paper.pdf + Images captured with irregular exposures inevitably present unsatisfactory visual effects, such as distorted hue and color tone. However, most recent studies mainly focus on underexposure correction, which limits their applicability to real-world scenarios where exposure levels vary. Furthermore, some works to tackle multiple exposure rely on the encoder-decoder architecture, resulting in losses of details in input images during down-sampling and up-sampling processes. With this regard, a novel correction algorithm for multiple exposure, called luminance-aware color transform (LACT), is proposed in this study. First, we reason the relative exposure condition between images to obtain luminance features based on a luminance comparison module. Next, we encode the set of transformation functions from the luminance features, which enable complex color transformations for both overexposure and underexposure images. Finally, we project the transformed representation onto RGB color space to produce exposure correction results. Extensive experiments demonstrate that the proposed LACT yields new state-of-the-arts on two multiple exposure datasets. + + + + A Simple Framework for Open-Vocabulary Segmentation and Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_A_Simple_Framework_for_Open-Vocabulary_Segmentation_and_Detection_ICCV_2023_paper.pdf + In this work, we present OpenSeeD, a simple Open-vocabulary Segmentation and Detection framework that learns from different segmentation and detection datasets. To bridge the gap of vocabulary and annotation granularity, we first introduce a pretrained text encoder to encode all the visual concepts in two tasks and learn a common semantic space for them. This gives us reasonably good results compared with the counterparts trained on segmentation task only. To further reconcile them, we locate two discrepancies: i) task discrepancy -- segmentation requires extracting masks for both foreground objects and background stuff, while detection merely cares about the former; ii) data discrepancy -- box and mask annotations are with different spatial granularity, and thus not directly interchangeable. We propose a decoupled foreground/background decoding and a conditioned mask decoding to address these issues, respectively. To this end, we develop a simple encoder-decoder model encompassing all three techniques and train it jointly on COCO and Objects365. After pretraining, our model exhibits competitive or stronger zero-shot transferability for both segmentation and detection. Specifically, OpenSeeD beats the state-of-the-art method for open-vocabulary instance and panoptic segmentation across 5 datasets, and outperforms previous work for open-vocabulary detection on LVIS and ODinW under similar settings. When transferred to specific tasks, our model achieves new SoTA on panoptic segmentation on COCO and ADE20K, and instance segmentation on ADE20K and Cityscapes. Finally, we note that OpenSeed is the first to explore the potential of joint training on segmentation and detection, and hope it can be received as a strong baseline for developing a single model for open-vocabulary segmentation and detection. + + + + Alignment Before Aggregation: Trajectory Memory Retrieval Network for Video Object Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Sun_Alignment_Before_Aggregation_Trajectory_Memory_Retrieval_Network_for_Video_Object_ICCV_2023_paper.pdf + Memory-based methods in semi-supervised video object segmentation task achieve competitive performance by performing dense matching between query and memory frames. However, most of the existing methods neglect the fact that videos carry rich temporal information yet redundant spatial information. In this case, direct pixel-level global matching will lead to ambiguous correspondences. In this work, we reconcile the inherent tension of spatial and temporal information to retrieve memory frame information along the object trajectory, and propose a novel and coherent Trajectory Memory Retrieval Network (TMRN) to equip with the trajectory information, including a spatial alignment module and a temporal aggregation module. The proposed TMRN enjoys several merits. First, TMRN is empowered to characterize the temporal correspondence which is in line with the nature of video in a data-driven manner. Second, we elegantly customize the spatial alignment module by coupling SVD initialization with agent-level correlation for representative agent construction and rectifying false matches caused by direct pairwise pixel-level correlation, respectively. Extensive experimental results on challenging benchmarks including DAVIS 2017 validation / test and Youtube-VOS 2018 / 2019 demonstrate that our TMRN, as a general plugin module, achieves consistent improvements over several leading methods. + + + + Deep Directly-Trained Spiking Neural Networks for Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Su_Deep_Directly-Trained_Spiking_Neural_Networks_for_Object_Detection_ICCV_2023_paper.pdf + Spiking neural networks (SNNs) are brain-inspired energy-efficient models that encode information in spatiotemporal dynamics. Recently, deep SNNs trained directly have shown great success in achieving high performance on classification tasks with very few time steps. However, how to design a directly-trained SNN for the regression task of object detection still remains a challenging problem. To address this problem, we propose EMS-YOLO, a novel directly-trained SNN framework for object detection, which is the first trial to train a deep SNN with surrogate gradients for object detection rather than ANN-SNN conversion strategies. Specifically, we design a full-spike residual block, EMS-ResNet, which can effectively extend the depth of the directly-trained SNN with low power consumption. Furthermore, we theoretically analyze and prove the EMS-ResNet could avoid gradient vanishing or exploding. The results demonstrate that our approach outperforms the state-of-the-art ANN-SNN conversion methods (at least 500 time steps) in extremely fewer time steps (only 4 time steps). It is shown that our model could achieve comparable performance to the ANN with the same architecture while consuming 5.83x less energy on the frame-based COCO Dataset and the event-based Gen1 Dataset. Our code is available in https://github.com/BICLab/EMS-YOLO. + + + + Masked Autoencoders Are Stronger Knowledge Distillers + http://openaccess.thecvf.com//content/ICCV2023/papers/Lao_Masked_Autoencoders_Are_Stronger_Knowledge_Distillers_ICCV_2023_paper.pdf + Knowledge distillation (KD) has shown great success in improving student's performance by mimicking the intermediate output of the high-capacity teacher in fine-grained visual tasks, e.g. object detection. This paper proposes a technique called Masked Knowledge Distillation (MKD) that enhances this process using a masked autoencoding scheme. In MKD, random patches of the input image are masked, and the corresponding missing feature is recovered by forcing it to imitate the output of the teacher. MKD is based on two core designs. First, using the student as the encoder, we develop an adaptive decoder architecture, which includes a spatial alignment module that operates on the multi-scale features in the feature pyramid network (FPN), a simple decoder, and a spatial recovery module that mimics the teacher's output from the latent representation and mask tokens. Second, we introduce the masked convolution in each convolution block to keep the masked patches unaffected by others. By coupling these two designs, we can further improve the completeness and effectiveness of teacher knowledge learning. We conduct extensive experiments on different architectures with object detection and semantic segmentation. The results show that all the students can achieve further improvements compared to the conventional KD. Notably, we establish the new state-of-the-art results by boosting RetinaNet ResNet-18, and ResNet-50 from 33.4 to 37.5 mAP, and 37.4 to 41.5 mAP, respectively. + + + + ASIC: Aligning Sparse in-the-wild Image Collections + http://openaccess.thecvf.com//content/ICCV2023/papers/Gupta_ASIC_Aligning_Sparse_in-the-wild_Image_Collections_ICCV_2023_paper.pdf + We present a method for joint alignment of sparse in-the-wild image collections of an object category. Most prior works assume either ground-truth keypoint annotations or a large dataset of images of a single object category. However, neither of the above assumptions hold true for the long-tail of the objects present in the world. We present a self-supervised technique that directly optimizes on a sparse collection of images of a particular object/object category to obtain consistent dense correspondences across the collection. We use pairwise nearest neighbors obtained from deep features of a pre-trained vision transformer (ViT) model as noisy and sparse keypoint matches and make them dense and accurate matches by optimizing a neural network that jointly maps the image collection into a learned canonical grid. Experiments on CUB, SPair-71k and PF-Willow benchmarks demonstrate that our method can produce globally consistent and higher quality correspondences across the image collection when compared to existing self-supervised methods. Code and other material will be made available at https://kampta.github.io/asic. + + + + Residual Pattern Learning for Pixel-Wise Out-of-Distribution Detection in Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Residual_Pattern_Learning_for_Pixel-Wise_Out-of-Distribution_Detection_in_Semantic_Segmentation_ICCV_2023_paper.pdf + Semantic segmentation models classify pixels into a set of known ("in-distribution") visual classes. When deployed in an open world, the reliability of these models depends on their ability to not only classify in-distribution pixels but also to detect out-of-distribution (OoD) pixels. Historically, the poor OoD detection performance of these models has motivated the design of methods based on model re-training using synthetic training images that include OoD visual objects. Although successful, these re-trained methods have two issues: 1) their in-distribution segmentation accuracy may drop during re-training, and 2) their OoD detection accuracy does not generalise well to new contexts (e.g., country surroundings) outside the training set (e.g., city surroundings). In this paper, we mitigate these issues with: (i) a new residual pattern learning (RPL) module that assists the segmentation model to detect OoD pixels with minimal deterioration to the inlier segmentation performance; and (ii) a novel context-robust contrastive learning (CoroCL) that enforces RPL to robustly detect OoD pixels in various contexts. Our approach improves by around 10% FPR and 7% AuPRC the previous state-of-the-art in Fishyscapes, Segment-Me-If-You-Can, and RoadAnomaly datasets. Code will be available. + + + + Hierarchical Visual Primitive Experts for Compositional Zero-Shot Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Kim_Hierarchical_Visual_Primitive_Experts_for_Compositional_Zero-Shot_Learning_ICCV_2023_paper.pdf + Compositional zero-shot learning (CZSL) aims to recognize unseen compositions with prior knowledge of known primitives (attribute and object). Previous works for CZSL often suffer from grasping the contextuality between attribute and object, as well as the discriminability of visual features, and the long-tailed distribution of real-world compositional data. We propose a simple and scalable framework called Composition Transformer (CoT) to address these issues. CoT employs object and attribute experts in distinctive manners to generate representative embeddings, using the visual network hierarchically. The object expert extracts representative object embeddings from the final layer in a bottom-up manner, while the attribute expert makes attribute embeddings in a top-down manner with a proposed object-guided attention module that models contextuality explicitly. To remedy biased prediction caused by imbalanced data distribution, we develop a simple minority attribute augmentation (MAA) that synthesizes virtual samples by mixing two images and oversampling minority attribute classes. Our method achieves SoTA performance on several benchmarks, including MIT-States, C-GQA, and VAW-CZSL. We also demonstrate the effectiveness of CoT in improving visual discrimination and addressing the model bias from the imbalanced data distribution. The code is available at https://github.com/HanjaeKim98/CoT. + + + + Segment Every Reference Object in Spatial and Temporal Spaces + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_Segment_Every_Reference_Object_in_Spatial_and_Temporal_Spaces_ICCV_2023_paper.pdf + The reference-based object segmentation tasks, namely referring image segmentation (RIS), referring video object segmentation (RVOS), and video object segmentation (VOS), aim to segment a specific object by utilizing either language or annotated masks as references. Despite significant progress in each respective field, current methods are task-specifically designed and developed in different directions, which hinders the activation of multi-task capabilities for these tasks. In this work, we end the current fragmented situation and propose UniRef to unify the three reference-based object segmentation tasks with a single architecture. At the heart of our approach is the multiway-fusion for handling different task with respect to their specified references. And a unified Transformer architecture is then adopted for performing instance-level segmentation. With the unified designs, UniRef can be jointly trained on a broad range of benchmarks and can flexibly perform multiple tasks at runtime by specifying the corresponding references. We evaluate the jointly trained network on various benchmarks. Extensive experimental results indicate that our proposed UniRef achieves state-of-the-art performance on RIS and RVOS, and performs competitively on VOS with a single network. + + + + Unified Out-Of-Distribution Detection: A Model-Specific Perspective + http://openaccess.thecvf.com//content/ICCV2023/papers/Averly_Unified_Out-Of-Distribution_Detection_A_Model-Specific_Perspective_ICCV_2023_paper.pdf + Out-of-distribution (OOD) detection aims to identify test examples that do not belong to the training distribution and are thus unlikely to be predicted reliably. Despite a plethora of existing works, most of them focused only on the scenario where OOD examples come from semantic shift (e.g., unseen categories), ignoring other possible causes (e.g., covariate shift). In this paper, we present a novel, unifying framework to study OOD detection in a broader scope. Instead of detecting OOD examples from a particular cause, we propose to detect examples that a deployed machine learning model (e.g., an image classifier) is unable to predict correctly. That is, whether a test example should be detected and rejected or not is "model-specific". We show that this framework unifies the detection of OOD examples caused by semantic shift and covariate shift, and closely addresses the concern of applying a machine learning model to uncontrolled environments. We provide an extensive analysis that involves a variety of models (e.g., different architectures and training strategies), sources of OOD examples, and OOD detection approaches, and reveal several insights into improving and understanding OOD detection in uncontrolled environments. + + + + RankMatch: Fostering Confidence and Consistency in Learning with Noisy Labels + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_RankMatch_Fostering_Confidence_and_Consistency_in_Learning_with_Noisy_Labels_ICCV_2023_paper.pdf + Learning with noisy labels (LNL) is one of the most important and challenging problems in weakly-supervised learning. Recent advances adopt the sample selection strategy to mitigate the interference of noisy labels and use small-loss criteria to select clean samples. However, the one-dimensional loss is an over-simplified metric that fails to accommodate the complex feature landscape of various samples, and, hence, is prone to introduce classification errors during sample selection. In this paper, we propose RankMatch, a novel LNL framework that investigates additional dimensions of confidence and consistency in order to combat noisy labels. Confidence-wise, we propose a novel sample selection strategy based on confidence representation voting instead of the widely-used small-loss criterion. This new strategy is capable of increasing sample selection quantity without sacrificing labeling accuracy. Consistency-wise, instead of the widely adopted feature distance metric for measuring the consistency of inner-class samples, we advocate that the rank of principal features is a much more robust indicator. Based on this metric, we propose rank contrastive loss, which strengthens the consistency of similar samples regardless of their labels and facilitates feature representation learning. Experimental results on noisy versions of CIFAR-10, CIFAR-100, Clothing1M, and WebVision have validated the superiority of our approach over existing state-of-the-art methods. + + + + MixReorg: Cross-Modal Mixed Patch Reorganization is a Good Mask Learner for Open-World Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Cai_MixReorg_Cross-Modal_Mixed_Patch_Reorganization_is_a_Good_Mask_Learner_ICCV_2023_paper.pdf + Recently, semantic segmentation models trained with image-level text supervision have shown promising results in challenging open-world scenarios. However, these models still face difficulties in learning fine-grained semantic alignment at the pixel level and predicting accurate object masks. To address this issue, we propose MixReorg, a novel and straightforward pre-training paradigm for semantic segmentation that enhances a model's ability to reorganize patches mixed across images, exploring both local visual relevance and global semantic coherence. Our approach involves generating fine-grained patch-text pairs data by mixing image patches while preserving the correspondence between patches and text. The model is then trained to minimize the segmentation loss of the mixed images and the two contrastive losses of the original and restored features. With MixReorg as a mask learner, conventional text-supervised semantic segmentation models can achieve highly generalizable pixel-semantic alignment ability, which is crucial for open-world segmentation. After training with large-scale image-text data, MixReorg models can be applied directly to segment visual objects of arbitrary categories, without the need for further fine-tuning. Our proposed framework demonstrates strong performance on popular zero-shot semantic segmentation benchmarks, outperforming GroupViT by significant margins of 5.0%, 6.2%, 2.5%, and 3.4% mIoU on PASCAL VOC2012, PASCAL Context, MS COCO, and ADE20K, respectively. + + + + Preface: A Data-driven Volumetric Prior for Few-shot Ultra High-resolution Face Synthesis + http://openaccess.thecvf.com//content/ICCV2023/papers/Buhler_Preface_A_Data-driven_Volumetric_Prior_for_Few-shot_Ultra_High-resolution_Face_ICCV_2023_paper.pdf + NeRFs have enabled highly realistic synthesis of human faces including complex appearance and reflectance effects of hair and skin. These methods typically require a large number of multi-view input images, making the process hardware intensive and cumbersome, limiting applicability to unconstrained settings. We propose a novel volumetric human face prior that enables the synthesis of ultra high-resolution novel views of subjects that are not part of the prior's training distribution. This prior model consists of an identity-conditioned NeRF, trained on a dataset of low-resolution multi-view images of diverse humans with known camera calibration. A simple sparse landmark-based 3D alignment of the training dataset allows our model to learn a smooth latent space of geometry and appearance despite a limited number of training identities. A high-quality volumetric representation of a novel subject can be obtained by model fitting to 2 or 3 camera views of arbitrary resolution. Importantly, our method requires as few as two views of casually captured images as input at inference time. + + + + ICICLE: Interpretable Class Incremental Continual Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Rymarczyk_ICICLE_Interpretable_Class_Incremental_Continual_Learning_ICCV_2023_paper.pdf + Continual learning enables incremental learning of new tasks without forgetting those previously learned, resulting in positive knowledge transfer that can enhance performance on both new and old tasks. However, continual learning poses new challenges for interpretability, as the rationale behind model predictions may change over time, leading to interpretability concept drift. We address this problem by proposing Interpretable Class-InCremental LEarning (ICICLE), an exemplar-free approach that adopts a prototypical part-based approach. It consists of three crucial novelties: interpretability regularization that distills previously learned concepts while preserving user-friendly positive reasoning; proximity-based prototype initialization strategy dedicated to the fine-grained setting; and task-recency bias compensation devoted to prototypical parts. Our experimental results demonstrate that ICICLE reduces the interpretability concept drift and outperforms the existing exemplar-free methods of common class-incremental learning when applied to concept-based models. + + + + PointCLIP V2: Prompting CLIP and GPT for Powerful 3D Open-world Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhu_PointCLIP_V2_Prompting_CLIP_and_GPT_for_Powerful_3D_Open-world_ICCV_2023_paper.pdf + Large-scale pre-trained models have shown promising open-world performance for both vision and language tasks. However, their transferred capacity on 3D point clouds is still limited and only constrained to the classification task. In this paper, we first collaborate CLIP and GPT to be a unified 3D open-world learner, named as PointCLIP V2, which fully unleashes their potential for zero-shot 3D classification, segmentation, and detection. To better align 3D data with the pre-trained language knowledge, PointCLIP V2 contains two key designs. For the visual end, we prompt CLIP via a shape projection module to generate more realistic depth maps, narrowing the domain gap between projected point clouds with natural images. For the textual end, we prompt the GPT model to generate 3D-specific text as the input of CLIP's textual encoder. Without any training in 3D domains, our approach significantly surpasses PointCLIP by +42.90%, +40.44%, and +28.75% accuracy on three datasets for zero-shot 3D classification. On top of that, V2 can be extended to few-shot 3D classification, zero-shot 3D part segmentation, and 3D object detection in a simple manner, demonstrating our generalization ability for unified 3D open-world learning. + + + + Identification of Systematic Errors of Image Classifiers on Rare Subgroups + http://openaccess.thecvf.com//content/ICCV2023/papers/Metzen_Identification_of_Systematic_Errors_of_Image_Classifiers_on_Rare_Subgroups_ICCV_2023_paper.pdf + Despite excellent average-case performance of many image classifiers, their performance can substantially deteriorate on semantically coherent subgroups of the data that were under-represented in the training data. These systematic errors can impact both fairness for demographic minority groups as well as robustness and safety under domain shift. A major challenge is to identify such subgroups with subpar performance when the subgroups are not annotated and their occurrence is very rare. We leverage recent advances in text-to-image models and search in the space of textual descriptions of subgroups ("prompts") for subgroups where the target model has low performance on the prompt-conditioned synthesized data. To tackle the exponentially growing number of subgroups, we employ combinatorial testing. We denote this procedure as PromptAttack as it can be interpreted as an adversarial attack in a prompt space. We study subgroup coverage and identifiability with PromptAttack in a controlled setting and find that it identifies systematic errors with high accuracy. Thereupon, we apply PromptAttack to ImageNet classifiers and identify novel systematic errors on rare subgroups. + + + + Clusterformer: Cluster-based Transformer for 3D Object Detection in Point Clouds + http://openaccess.thecvf.com//content/ICCV2023/papers/Pei_Clusterformer_Cluster-based_Transformer_for_3D_Object_Detection_in_Point_Clouds_ICCV_2023_paper.pdf + Attributed to the unstructured and sparse nature of point clouds, the transformer shows greater potential in point clouds data processing. However, the recent query-based 3D detectors usually project the features acquired from a sparse backbone into the structured and compact Bird's Eye View(BEV) plane before adopting the transformer, which destroys the sparsity of features, introducing empty tokens and additional resource consumption for the transformer. To this end, in this paper, we propose a novel query-based 3D detector called Clusterformer, our Clusterformer regards each object as a cluster of 3D space which mainly consists of the non-empty voxels belonging to the same object, and leverages the cluster to conduct the transformer decoder to generate the proposals from the sparse voxel features directly. Such cluster-based transformer structure can effectively improve the performance and convergence speed of query-based detectors by making use of the object prior information contained in the clusters. Additionally, we introduce a Query2Key strategy to enhance the key and value features with the object-level information iteratively in our cluster-based transformer structure. Experimental results show that the proposed Clusterformer outperforms the previous query-based detectors with a lower latency and memory usage, which achieves state-of-the-art performance on the Waymo Open Datasets and KITTI Datasets. + + + + CDUL: CLIP-Driven Unsupervised Learning for Multi-Label Image Classification + http://openaccess.thecvf.com//content/ICCV2023/papers/Abdelfattah_CDUL_CLIP-Driven_Unsupervised_Learning_for_Multi-Label_Image_Classification_ICCV_2023_paper.pdf + This paper presents a CLIP-based unsupervised learning method for annotation-free multi-label image classification, including three stages: initialization, training, and inference. At the initialization stage, we take full advantage of the powerful CLIP model and propose a novel approach to extend CLIP for multi-label predictions based on global-local image-text similarity aggregation. To be more specific, we split each image into snippets and leverage CLIP to generate the similarity vector for the whole image (global) as well as each snippet (local). Then a similarity aggregator is introduced to leverage the global and local similarity vectors. Using the aggregated similarity scores as the initial pseudo labels at the training stage, we propose an optimization framework to train the parameters of the classification network and refine pseudo labels for unobserved labels. During inference, only the classification network is used to predict the labels of the input image. Extensive experiments show that our method outperforms state-of-the-art unsupervised methods on MS-COCO, PASCAL VOC 2007, PASCAL VOC 2012, and NUS datasets and even achieves comparable results to weakly supervised classification methods. + + + + Your Diffusion Model is Secretly a Zero-Shot Classifier + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Your_Diffusion_Model_is_Secretly_a_Zero-Shot_Classifier_ICCV_2023_paper.pdf + The recent wave of large-scale text-to-image diffusion models has dramatically increased our text-based image generation abilities. These models can generate realistic images for a staggering variety of prompts and exhibit impressive compositional generalization abilities. Almost all use cases thus far have solely focused on sampling; however, diffusion models can also provide conditional density estimates, which are useful for tasks beyond image generation. In this paper, we show that the density estimates from large-scale text-to-image diffusion models like Stable Diffusion can be leveraged to perform zero-shot classification without any additional training. Our generative approach to classification, which we call Diffusion Classifier, attains strong results on a variety of benchmarks and outperforms alternative methods of extracting knowledge from diffusion models. Although a gap remains between generative and discriminative approaches on zero-shot recognition tasks, our diffusion-based approach has stronger multimodal compositional reasoning abilities than competing discriminative approaches. Finally, we use Diffusion Classifier to extract standard classifiers from class-conditional diffusion models trained on ImageNet. These models approach the performance of SOTA discriminative classifiers and exhibit strong "effective robustness" to distribution shift. Overall, our results are a step toward using generative over discriminative models for downstream tasks. + + + + Backpropagation Path Search On Adversarial Transferability + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_Backpropagation_Path_Search_On_Adversarial_Transferability_ICCV_2023_paper.pdf + Deep neural networks are vulnerable to adversarial examples, dictating the imperativeness to test the model's robustness before deployment. Transfer-based attackers craft adversarial examples against surrogate models and transfer them to victim models deployed in the black-box situation. To enhance the adversarial transferability, structure-based attackers adjust the backpropagation path to avoid the attack from overfitting the surrogate model. However, existing structure-based attackers fail to explore the convolution module in CNNs and modify the backpropagation graph heuristically, leading to limited effectiveness. In this paper, we propose backPropagation pAth Search (PAS), solving the aforementioned two problems. We first propose SkipConv to adjust the backpropagation path of convolution by structural reparameterization. To overcome the drawback of heuristically designed backpropagation paths, we further construct a DAG-based search space, utilize one-step approximation for path evaluation and employ Bayesian Optimization to search for the optimal path. We conduct comprehensive experiments in a wide range of transfer settings, showing that PAS improves the attack success rate by a huge margin for both normally trained and defense models. + + + + Boosting Adversarial Transferability via Gradient Relevance Attack + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhu_Boosting_Adversarial_Transferability_via_Gradient_Relevance_Attack_ICCV_2023_paper.pdf + Plentiful adversarial attack researches have revealed the fragility of deep neural networks (DNNs), where the imperceptible perturbations can cause drastic changes in the output. Among the diverse types of attack methods, gradient-based attacks are powerful and easy to implement, arousing wide concern for the security problem of DNNs. However, under the black-box setting, the existing gradient-based attacks have much trouble in breaking through DNN models with defense technologies, especially those adversarially trained models. To make adversarial examples more transferable, in this paper, we explore the fluctuation phenomenon on the plus-minus sign of the adversarial perturbations' pixels during the generation of adversarial examples, and propose an ingenious Gradient Relevance Attack (GRA). Specifically, two gradient relevance frameworks are presented to better utilize the information in the neighborhood of the input, which can correct the update direction adaptively. Then we adjust the update step at each iteration with a decay indicator to counter the fluctuation. Experiment results on a subset of the ILSVRC 2012 validation set forcefully verify the effectiveness of GRA. Furthermore, the attack success rates of 68.7% and 64.8% on Tencent Cloud and Baidu AI Cloud further indicate that GRA can craft adversarial examples with the ability to transfer across both datasets and model architectures. Code is released at https://github.com/RYC-98/GRA. + + + + CLIPN for Zero-Shot OOD Detection: Teaching CLIP to Say No + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_CLIPN_for_Zero-Shot_OOD_Detection_Teaching_CLIP_to_Say_No_ICCV_2023_paper.pdf + Out-of-distribution (OOD) detection refers to training the model on in-distribution (ID) dataset to classify if the input images come from unknown classes. Considerable efforts have been invested in designing various OOD detection methods based on either convolutional neural networks or transformers. However, Zero-shot OOD detection methods driven by CLIP, which require only class names for ID, have received less attention. This paper presents a novel method, namely CLIP saying no (CLIPN), which empowers "no" logic within CLIP. Our key motivation is to equip CLIP with the capability of distinguishing OOD and ID samples via positive-semantic prompts and negation-semantic prompts. To be specific, we design a novel learnable "no" prompt and a "no" text encoder to capture the negation-semantic with images. Subsequently, we introduce two loss functions: the image-text binary-opposite loss and the text semantic-opposite loss, which we use to teach CLIPN to associate images with "no" prompts, thereby enabling it to identify unknown samples. Furthermore, we propose two threshold-free inference algorithms to perform OOD detection via using negation semantics from "no" prompts and text encoder. Experimental results on 9 benchmark datasets (3 ID datasets and 6 OOD datasets) for the OOD detection task demonstrate that CLIPN outperforms 7 well-used algorithms by at least 1.1% and 7.37% on AUROC and FPR95 on zero-shot OOD detection of ImageNet-1K. Our CLIPN can serve as a solid foundation for leveraging CLIP effectively in downstream OOD tasks. + + + + CO-Net: Learning Multiple Point Cloud Tasks at Once with A Cohesive Network + http://openaccess.thecvf.com//content/ICCV2023/papers/Xie_CO-Net_Learning_Multiple_Point_Cloud_Tasks_at_Once_with_A_ICCV_2023_paper.pdf + We present CO-Net, a cohesive framework that optimizes multiple point cloud tasks collectively across heterogeneous dataset domains. CO-Net maintains the characteristics of high storage efficiency since models with the preponderance of shared parameters can be assembled into a single model. Specifically, we leverage residual MLP (Res-MLP) block for effective feature extraction and scale it gracefully along the depth and width of the network to meet the demands of different tasks. Based on the block, we propose a novel nested layer-wise processing policy, which identifies the optimal architecture for each task while provides partial sharing parameters and partial non-sharing parameters inside each layer of the block. Such policy tackles the inherent challenges of multi-task learning on point cloud, e.g., diverse model topologies resulting from task skew and conflicting gradients induced by heterogeneous dataset domains. Finally, we propose a sign-based gradient surgery to promote the training of CO-Net, thereby emphasizing the usage of task-shared parameters and guaranteeing that each task can be thoroughly optimized. Experimental results reveal that models optimized by CO-Net jointly for all point cloud tasks maintain much fewer computation cost and overall storage cost yet outpace prior methods by a significant margin. We also demonstrate that CO-Net allows incremental learning and prevents catastrophic amnesia when adapting to a new point cloud task. + + + + Quality Diversity for Visual Pre-Training + http://openaccess.thecvf.com//content/ICCV2023/papers/Chavhan_Quality_Diversity_for_Visual_Pre-Training_ICCV_2023_paper.pdf + Models pre-trained on large datasets such as ImageNet provide the de-facto standard for transfer learning, with both supervised and self-supervised approaches proving effective. However, emerging evidence suggests that any single pre-trained feature will not perform well on diverse downstream tasks. Each pre-training strategy encodes a certain inductive bias, which may suit some downstream tasks but not others. Notably, the augmentations used in both supervised and self-supervised training lead to features with high invariance to spatial and appearance transformations. This renders them sub-optimal for tasks that demand sensitivity to these factors. In this paper we develop a feature that better supports diverse downstream tasks by providing a diverse set of sensitivities and invariances. In particular, we are inspired by Quality-Diversity in evolution, to define a pre-training objective that requires high quality yet diverse features -- where diversity is defined in terms of transformation (in)variances. Our framework plugs in to both supervised and self-supervised pre-training, and produces a small ensemble of features. We further show how downstream tasks can easily and efficiently select their preferred (in)variances. Both empirical and theoretical analysis show the efficacy of our representation and transfer learning approach for diverse downstream tasks. + + + + UniDexGrasp++: Improving Dexterous Grasping Policy Learning via Geometry-Aware Curriculum and Iterative Generalist-Specialist Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Wan_UniDexGrasp_Improving_Dexterous_Grasping_Policy_Learning_via_Geometry-Aware_Curriculum_and_ICCV_2023_paper.pdf + We propose a novel, object-agnostic method for learning a universal policy for dexterous object grasping from realistic point cloud observations and proprioceptive information under a table-top setting, namely UniDexGrasp++. To address the challenge of learning the vision-based policy across thousands of object instances, we propose Geometry-aware Curriculum Learning (GeoCurriculum) and Geometry-aware iterative Generalist-Specialist Learning (GiGSL) which leverage the geometry feature of the task and significantly improve the generalizability. With our proposed techniques, our final policy shows universal dexterous grasping on thousands of object instances with 85.4% and 78.2% success rate on the train set and test set which outperforms the state-of-the-art baseline UniDexGrasp by 11.7% and 11.3%, respectively. + + + + FerKD: Surgical Label Adaptation for Efficient Distillation + http://openaccess.thecvf.com//content/ICCV2023/papers/Shen_FerKD_Surgical_Label_Adaptation_for_Efficient_Distillation_ICCV_2023_paper.pdf + We present FerKD, a novel efficient knowledge distillation framework that incorporates partial soft-hard label adaptation coupled with a region-calibration mechanism. Our approach stems from the observation and intuition that standard data augmentations, such as RandomResizedCrop, tend to transform inputs into diverse conditions: easy positives, hard positives, or hard negatives. In traditional distillation frameworks, these transformed samples are utilized equally through their predictive probabilities derived from pretrained teacher models. However, merely relying on prediction values from a pretrained teacher, a common practice in prior studies, neglects the reliability of these soft label predictions. To address this, we propose a new scheme that calibrates the less-confident regions to be the context using softened hard groundtruth labels. Our approach involves the processes of hard regions mining + calibration. We demonstrate empirically that this method can dramatically improve the convergence speed and final accuracy. Additionally, we find that a consistent mixing strategy can stabilize the distributions of soft supervision, taking advantage of the soft labels. As a result, we introduce a stabilized SelfMix augmentation that weakens the variation of the mixed images and corresponding soft labels through mixing similar regions within the same image. FerKD is an intuitive and well-designed learning system that eliminates several heuristics and hyperparameters in former FKD solution. More importantly, it achieves remarkable improvement on ImageNet-1K and downstream tasks. For instance, FerKD achieves 81.2% on ImageNet-1K with ResNet-50, outperforming FKD and FunMatch by remarkable margins. Leveraging better pre-trained weights and larger architectures, our finetuned ViT-G14 even achieves 89.9%. Our code is available at https://github.com/szq0214/FKD/tree/main/FerKD. + + + + Neural Fields for Structured Lighting + http://openaccess.thecvf.com//content/ICCV2023/papers/Shandilya_Neural_Fields_for_Structured_Lighting_ICCV_2023_paper.pdf + We present an image formation model and optimization procedure that combines the advantages of neural radiance fields and structured light imaging. Existing depth-supervised neural models rely on depth sensors to accurately capture the scene's geometry. However, the depth maps recovered by these sensors can be prone to error, or even fail outright. Instead of depending on the fidelity of processed depth maps from a structured light system, a more principled approach is to explicitly model the raw structured light images themselves. Our proposed approach enables the estimation of high-fidelity depth maps, including for objects with complex material properties (e.g., partially-transparent surfaces). Besides computing depth, the raw structured light images also confer other useful radiometric cues, which enable predicting surface normals and decomposing scene appearance in terms of a direct, indirect, and ambient component. We evaluate our framework quantitatively and qualitatively on a range of real and synthetic scenes, and decompose scenes into their constituent components for novel views. + + + + ClothPose: A Real-world Benchmark for Visual Analysis of Garment Pose via An Indirect Recording Solution + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_ClothPose_A_Real-world_Benchmark_for_Visual_Analysis_of_Garment_Pose_ICCV_2023_paper.pdf + Garments are important and pervasive in daily life. However, visual analysis on them for pose estimation is challenging because it requires recovering the complete configurations of garments, which is difficult, if not impossible, to annotate in the real world. In this work, we propose a recording system, GarmentTwin, which can track garment poses in dynamic settings such as manipulation. GarmentTwin first collects garment models and RGB-D manipulation videos from the real world and then replays the manipulation process using physics-based animation. This way, we can obtain deformed garments with poses coarsely aligned with real-world observations. Finally, we adopt an optimization-based approach to fit the pose with real-world observations. We verify the fitting results quantitatively and qualitatively. With GarmentTwin, we construct a large-scale dataset named ClothPose, which consists of 30K RGB-D frames from 2K video clips on 600 garments of 10 categories. We benchmark two tasks on the proposed ClothPose: non-rigid reconstruction and pose estimation. The experiments show that previous baseline methods struggle with highly large non-rigid deformation of manipulated garments. Therefore, we hope that the recording system and the dataset can facilitate research on pose estimation tasks on non-rigid objects. Datasets, models, and codes are made publicly available. + + + + Unsupervised Object Localization with Representer Point Selection + http://openaccess.thecvf.com//content/ICCV2023/papers/Song_Unsupervised_Object_Localization_with_Representer_Point_Selection_ICCV_2023_paper.pdf + We propose a novel unsupervised object localization method that allows us to explain the predictions of the model by utilizing self-supervised pre-trained models without additional finetuning. Existing unsupervised and self-supervised object localization methods often utilize class-agnostic activation maps or self-similarity maps of a pre-trained model. Although these maps can offer valuable information for localization, their limited ability to explain how the model makes predictions remains challenging. In this paper, we propose a simple yet effective unsupervised object localization method based on representer point selection, where the predictions of the model can be represented as a linear combination of representer values of training points. By selecting representer points, which are the most important examples for the model predictions, our model can provide insights into how the model predicts the foreground object by providing relevant examples as well as their importance. Our method outperforms the state-of-the-art unsupervised and self-supervised object localization methods on various datasets with significant margins and even outperforms recent weakly supervised and few-shot methods. + + + + SEMPART: Self-supervised Multi-resolution Partitioning of Image Semantics + http://openaccess.thecvf.com//content/ICCV2023/papers/Ravindran_SEMPART_Self-supervised_Multi-resolution_Partitioning_of_Image_Semantics_ICCV_2023_paper.pdf + Accurately determining salient regions of an image is challenging when labeled data is scarce. DINO-based self-supervised approaches have recently leveraged meaningful image semantics captured by patch-wise features for locating foreground objects. Recent methods have also incorporated intuitive priors and demonstrated value in unsupervised methods for object partitioning. In this paper, we propose SEMPART, which jointly infers coarse and fine bi-partitions over an image's DINO-based semantic graph. Furthermore, SEMPART preserves fine boundary details using graph-driven regularization and successfully distills the coarse mask semantics into the fine mask. Our salient object detection and single object localization findings suggest that SEMPART produces high-quality masks rapidly without additional post-processing and benefits from co-optimizing the coarse and fine branches. + + + + Flatness-Aware Minimization for Domain Generalization + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Flatness-Aware_Minimization_for_Domain_Generalization_ICCV_2023_paper.pdf + Domain generalization (DG) seeks to learn robust models that generalize well under unknown distribution shifts. As a critical aspect of DG, optimizer selection has not been explored in depth. Currently, most DG methods follow the widely used benchmark, DomainBed, and utilize Adam as the default optimizer for all datasets. However, we reveal that Adam is not necessarily the optimal choice for the majority of current DG methods and datasets. Based on the perspective of loss landscape flatness, we propose a novel approach, Flatness-Aware Minimization for Domain Generalization (FAD), which can efficiently optimize both zeroth-order and first-order flatness simultaneously for DG. We provide theoretical analyses of the FAD's out-of-distribution (OOD) generalization error and convergence. Our experimental results demonstrate the superiority of FAD on various DG datasets. + + + + ProtoFL: Unsupervised Federated Learning via Prototypical Distillation + http://openaccess.thecvf.com//content/ICCV2023/papers/Kim_ProtoFL_Unsupervised_Federated_Learning_via_Prototypical_Distillation_ICCV_2023_paper.pdf + Federated learning (FL) is a promising approach for enhancing data privacy preservation, particularly for authentication systems. However, limited round communications, scarce representation, and scalability pose significant challenges to its deployment, hindering its full potential. In this paper, we propose 'ProtoFL', Prototypical Representation Distillation based unsupervised Federated Learning to enhance the representation power of a global model and reduce round communication costs. Additionally, we introduce a local one-class classifier based on normalizing flows to improve performance with limited data. Our study represents the first investigation of using FL to improve one-class classification performance. We conduct extensive experiments on five widely used benchmarks, namely MNIST, CIFAR-10, CIFAR-100, ImageNet-30, and Keystroke-Dynamics, to demonstrate the superior performance of our proposed framework over previous methods in the literature. + + + + Multi-label Affordance Mapping from Egocentric Vision + http://openaccess.thecvf.com//content/ICCV2023/papers/Mur-Labadia_Multi-label_Affordance_Mapping_from_Egocentric_Vision_ICCV_2023_paper.pdf + Accurate affordance detection and segmentation with pixel precision is an important piece in many complex systems based on interactions, such as robots and assitive devices. We present a new approach to affordance perception which enables accurate multi-label segmentation. Our approach can be used to automatically annotate grounded affordances from first person videos of interactions using a 3D map of the environment providing pixel level precision for the affordance location. We use this method to build the largest and most complete dataset on affordances based on the EPIC-Kitchen dataset, EPIC-Aff, which provides automatic, interaction-grounded, multi-label, metric and spatial affordance annotations. Then, we propose a new approach to affordance segmentation based on multi-label detection which enables multiple affordances to co-exists in the same space, for example if they are associated with the same object. We present several strategies of multi-label detection using several segmentation architectures. The experimental results highlights the importance of the multi-label detection. Finally, we show how our metric representation can be exploited for build a map of interaction hotspots in spatial action-centric zones and use that representation to perform a task-oriented navigation. + + + + Unified Adversarial Patch for Cross-Modal Attacks in the Physical World + http://openaccess.thecvf.com//content/ICCV2023/papers/Wei_Unified_Adversarial_Patch_for_Cross-Modal_Attacks_in_the_Physical_World_ICCV_2023_paper.pdf + Recently, physical adversarial attacks have been presented to evade DNNs-based object detectors. To ensure the security, many scenarios are simultaneously deployed with visible sensors and infrared sensors, leading to the failures of these single-modal physical attacks. To show the potential risks under such scenes, we propose a unified adversarial patch to perform cross-modal physical attacks, i.e., fooling visible and infrared object detectors at the same time via a single patch. Considering different imaging mechanisms of visible and infrared sensors, our work focuses on modeling the shapes of adversarial patches, which can be captured in different modalities when they change. To this end, we design a novel boundary-limited shape optimization to achieve the compact and smooth shapes, and thus they can be easily implemented in the physical world. In addition, to balance the fooling degree between visible detector and infrared detector during the optimization process, we propose a score-aware iterative evaluation, which can guide the adversarial patch to iteratively reduce the predicted scores of the multi-modal sensors. We finally test our method against the one-stage detector: YOLOv3 and the two-stage detector: Faster RCNN. Results show that our unified patch achieves an Attack Success Rate (ASR) of 73.33% and 69.17%, respectively. More importantly, we verify the effective attacks in the physical world when visible and infrared sensors shoot the objects under various settings like different angles, distances, postures, and scenes. + + + + Misalign, Contrast then Distill: Rethinking Misalignments in Language-Image Pre-training + http://openaccess.thecvf.com//content/ICCV2023/papers/Kim_Misalign_Contrast_then_Distill_Rethinking_Misalignments_in_Language-Image_Pre-training_ICCV_2023_paper.pdf + Contrastive Language-Image Pretraining has emerged as a prominent approach for training vision and text encoders with uncurated image-text pairs from the web. To enhance data-efficiency, recent efforts have introduced additional supervision terms that involve random-augmented views of the image. However, since the image augmentation process is unaware of its text counterpart, this procedure could cause various degrees of image-text misalignments during training. Prior methods either disregarded this discrepancy or introduced external models to mitigate the impact of misalignments during training. In contrast, we propose a novel metric learning approach that capitalizes on these misalignments as an additional training source, which we term "Misalign, Contrast then Distill (MCD)". Unlike previous methods that treat augmented images and their text counterparts as simple positive pairs, MCD predicts the continuous scales of misalignment caused by the augmentation. Our extensive experimental results show that our proposed MCD achieves state-of-the-art transferability in multiple classification and retrieval downstream datasets. + + + + MixPath: A Unified Approach for One-shot Neural Architecture Search + http://openaccess.thecvf.com//content/ICCV2023/papers/Chu_MixPath_A_Unified_Approach_for_One-shot_Neural_Architecture_Search_ICCV_2023_paper.pdf + Blending multiple convolutional kernels is proved advantageous in neural architecture design. However, current two-stage neural architecture search methods are mainly limited to single-path search spaces. How to efficiently search models of multi-path structures remains a difficult problem. In this paper, we are motivated to train a one-shot multi-path supernet to accurately evaluate the candidate architectures. Specifically, we discover that in the studied search spaces, feature vectors summed from multiple paths are nearly multiples of those from a single path. Such disparity perturbs the supernet training and its ranking ability. Therefore, we propose a novel mechanism called Shadow Batch Normalization (SBN) to regularize the disparate feature statistics. Extensive experiments prove that SBNs are capable of stabilizing the optimization and improving ranking performance. We call our unified multi-path one-shot approach as MixPath, which generates a series of models that achieve state-of-the-art results on ImageNet. + + + + Enhancing NeRF akin to Enhancing LLMs: Generalizable NeRF Transformer with Mixture-of-View-Experts + http://openaccess.thecvf.com//content/ICCV2023/papers/Cong_Enhancing_NeRF_akin_to_Enhancing_LLMs_Generalizable_NeRF_Transformer_with_ICCV_2023_paper.pdf + Cross-scene generalizable NeRF models, which can directly synthesize novel views of unseen scenes, have become a new spotlight of the NeRF field. Several existing attempts rely on increasingly end-to-end "neuralized" architectures, i.e., replacing scene representation and/or rendering modules with performant neural networks such as transformers, and turning novel view synthesis into a feed-forward inference pipeline. While those feedforward "neuralized" architectures still do not fit diverse scenes well out of the box, we propose to bridge them with the powerful Mixture-of-Experts (MoE) idea from large language models (LLMs), which has demonstrated superior generalization ability by balancing between larger overall model capacity and flexible per-instance specialization. Starting from a recent generalizable NeRF architecture called GNT, we first demonstrate that MoE can be neatly plugged in to enhance the model. We further customize a shared permanent expert and a geometry-aware consistency loss to enforce cross-scene consistency and spatial smoothness respectively, which are essential for generalizable view synthesis. Our proposed model, dubbed GNT with Mixture-of-View-Experts (GNT-MOVE), has experimentally shown state-of-the-art results when transferring to unseen scenes, indicating remarkably better cross-scene generalization in both zero-shot and few-shot settings. Our codes are available at https://github.com/VITA-Group/GNT-MOVE. + + + + Task-aware Adaptive Learning for Cross-domain Few-shot Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Guo_Task-aware_Adaptive_Learning_for_Cross-domain_Few-shot_Learning_ICCV_2023_paper.pdf + Although existing few-shot learning works yield promising results for in-domain queries, they still suffer from weak cross-domain generalization. Limited support data requires effective knowledge transfer, but domain-shift makes this harder. Towards this emerging challenge, researchers improved adaptation by introducing task-specific parameters, which are directly optimized and estimated for each task. However, adding a fixed number of additional parameters fails to consider the diverse domain shifts between target tasks and the source domain, limiting efficacy. In this paper, we first observe the dependence of task-specific parameter configuration on the target task. Abundant task-specific parameters may over-fit, and insufficient task-specific parameters may result in under-adaptation -- but the optimal task-specific configuration varies for different test tasks. Based on these findings, we propose the Task-aware Adaptive Network (TA2-Net), which is trained by reinforcement learning to adaptively estimate the optimal task-specific parameter configuration for each test task. It learns, for example, that tasks with significant domain shift usually have a larger need for task-specific parameters for adaptation. We evaluate our model on Meta-dataset. Empirical results show that our model outperforms existing state-of-the-art methods. + + + + Revisiting Domain-Adaptive 3D Object Detection by Reliable, Diverse and Class-balanced Pseudo-Labeling + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Revisiting_Domain-Adaptive_3D_Object_Detection_by_Reliable_Diverse_and_Class-balanced_ICCV_2023_paper.pdf + Unsupervised domain adaptation (DA) with the aid of pseudo labeling techniques has emerged as a crucial approach for domain-adaptive 3D object detection. While effective, existing DA methods suffer from a substantial drop in performance when applied to a multi-class training setting, due to the co-existence of low-quality pseudo labels and class imbalance issues. In this paper, we address this challenge by proposing a novel ReDB framework tailored for learning to detect all classes at once. Our approach produces Reliable, Diverse, and class-Balanced pseudo 3D boxes to iteratively guide the self-training on a distributionally different target domain. To alleviate disruptions caused by the environmental discrepancy (e.g., beam numbers), the proposed cross-domain examination (CDE) assesses the correctness of pseudo labels by copy-pasting target instances into a source environment and measuring the prediction consistency. To reduce computational overhead and mitigate the object shift (e.g., scales and point densities), we design an overlapped boxes counting (OBC) metric that allows to uniformly downsample pseudo-labeled objects across different geometric characteristics. To confront the issue of inter-class imbalance, we progressively augment the target point clouds with a class-balanced set of pseudo-labeled target instances and source objects, which boosts recognition accuracies on both frequently appearing and rare classes. Experimental results on three benchmark datasets using both voxel-based (i.e., SECOND) and point-based 3D detectors (i.e., PointRCNN) demonstrate that our proposed ReDB approach outperforms existing 3D domain adaptation methods by a large margin, improving 23.15% mAP on the nuScenes - KITTI task. The code is available at https://github.com/zhuoxiao-chen/ReDB-DA-3Ddet. + + + + Efficient Adaptive Human-Object Interaction Detection with Concept-guided Memory + http://openaccess.thecvf.com//content/ICCV2023/papers/Lei_Efficient_Adaptive_Human-Object_Interaction_Detection_with_Concept-guided_Memory_ICCV_2023_paper.pdf + Human Object Interaction (HOI) detection aims to localize and infer the relationships between a human and an object. Arguably, training supervised models for this task from scratch presents challenges due to the performance drop over rare classes and the high computational cost and time required to handle long-tailed distributions of HOIs in complex HOI scenes in realistic settings. This observation motivates us to design an HOI detector that can be trained even with long-tailed labeled data and can leverage existing knowledge from pre-trained models. Inspired by the powerful generalization ability of the large Vision-Language Models (VLM) on classification and retrieval tasks, we propose an efficient Adaptive HOI Detector with Concept-guided Memory (ADA-CM). ADA-CM has two operating modes. The first mode makes it tunable without learning new parameters in a training-free paradigm. Its second mode incorporates an instance-aware adapter mechanism that can further efficiently boost performance if updating a lightweight set of parameters can be afforded. Our proposed method achieves competitive results with state-of-the-art on the HICO-DET and V-COCO datasets with much less training time. Code can be found at https://github.com/ltttpku/ADA-CM. + + + + Attentive Mask CLIP + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Attentive_Mask_CLIP_ICCV_2023_paper.pdf + In vision-language modeling, image token removal is an efficient augmentation technique to reduce the cost of encoding image features. The CLIP-style models, however, have been found to be negatively impacted by this technique. We hypothesize that removing a large portion of image tokens may inadvertently destroy the semantic information associated to a given text description, resulting in misaligned paired data in CLIP training. To address this issue, we propose an attentive token removal approach, which retains a small number of tokens that have a strong semantic correlation to the corresponding text description. The correlation scores are dynamically evaluated through an EMA-updated vision encoder. Our method, termed attentive mask CLIP, outperforms original CLIP and CLIP variant with random token removal while saving the training time. In addition, our approach also enables efficient multi-view contrastive learning. Experimentally, by training ViT-B on YFCC-15M dataset, our approach achieves 43.9% top-1 accuracy on ImageNet-1K zero-shot classification, 62.7/42.1 and 38.0/23.2 I2T/T2I retrieval accuracy on Flickr30K and MS COCO, outperforming SLIP by +1.1%,+5.5/+0.9, and +4.4/+1.3, respectively, while being 2.30x faster. An efficient version of our approach runs 1.16x faster than the plain CLIP model, while achieving significant gains of +5.3%, +11.3/+8.0, and +9.5/+4.9 on these benchmarks, respectively. + + + + Motion-Guided Masking for Spatiotemporal Representation Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Fan_Motion-Guided_Masking_for_Spatiotemporal_Representation_Learning_ICCV_2023_paper.pdf + Several recent works have directly extended the image masked autoencoder (MAE) with random masking into video domain, achieving promising results. However, unlike images, both spatial and temporal information are important for video understanding. This suggests that the random masking strategy that is inherited from the image MAE is less effective for video MAE. This motivates the design of a novel masking algorithm that can more efficiently make use of video saliency. Specifically, we propose a motion-guided masking algorithm (MGM) which leverages motion vectors to guide the position of each mask over time. Crucially, these motion-based correspondences can be directly obtained from information stored in the compressed format of the video, which makes our method efficient and scalable. On two challenging large-scale video benchmarks (Kinetics-400 and Something-Something V2), we equip video MAE with our MGM and achieve up to +1.3% improvement compared to previous state-of-the-art methods. Additionally, our MGM achieves equivalent performance to previous video MAE using up to 66% fewer training epochs. Lastly, we show that MGM generalizes better to downstream transfer learning and domain adaptation tasks on the UCF101, HMDB51, and Diving48 datasets, achieving up to +4.9% improvement compared to baseline methods. + + + + Urban Radiance Field Representation with Deformable Neural Mesh Primitives + http://openaccess.thecvf.com//content/ICCV2023/papers/Lu_Urban_Radiance_Field_Representation_with_Deformable_Neural_Mesh_Primitives_ICCV_2023_paper.pdf + Neural Radiance Fields (NeRFs) have achieved great success in the past few years. However, most current methods still require intensive resources due to ray marching-based rendering. To construct urban-level radiance fields efficiently, we design Deformable Neural Mesh Primitive (DNMP), and propose to parameterize the entire scene with such primitives. The DNMP is a flexible and compact neural variant of classic mesh representation, which enjoys both the efficiency of rasterization-based rendering and the powerful neural representation capability for photo-realistic image synthesis. Specifically, a DNMP consists of a set of connected deformable mesh vertices with paired vertex features to parameterize the geometry and radiance information of a local area. To constrain the degree of freedom for optimization and lower the storage budgets, we enforce the shape of each primitive to be decoded from a relatively low-dimensional latent space. The rendering colors are decoded from the vertex features (interpolated with rasterization) by a view-dependent MLP. The DNMP provides a new paradigm for urban-level scene representation with appealing properties: (1) High-quality rendering. Our method achieves leading performance for novel view synthesis in urban scenarios. (2) Low computational costs. Our representation enables fast rendering (2.07ms/1k pixels) and low peak memory usage (110MB/1k pixels). We also present a lightweight version that can run 33xfaster than vanilla NeRFs, and comparable to the highly-optimized Instant-NGP (0.61 vs 0.71ms/1k pixels). + + + + Adaptive Frequency Filters As Efficient Global Token Mixers + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_Adaptive_Frequency_Filters_As_Efficient_Global_Token_Mixers_ICCV_2023_paper.pdf + Recent vision transformers, large-kernel CNNs and MLPs have attained remarkable successes in broad vision tasks thanks to their effective information fusion in the global scope. However, their efficient deployments, especially on mobile devices, still suffer from noteworthy challenges due to the heavy computational costs of self-attention mechanisms, large kernels, or fully connected layers. In this work, we apply conventional convolution theorem to deep learning for addressing this and reveal that adaptive frequency filters can serve as efficient global token mixer. With this insight, we propose Adaptive Frequency Filtering (AFF) token mixer. This neural operator transfers a latent representation to the frequency domain via a Fourier transform and performs semantic-adaptive frequency filtering via an elementwise multiplication, which mathematically equals to a token mixing operation in the original latent space with a dynamic convolution kernel as large as the spatial resolution of this latent representation. We take AFF token mixers as primary neural operators to build a lightweight neural network, dubbed AFFNet. Extensive experiments demonstrate the effectiveness of our proposed AFF token mixer and show that AFFNet achieve superior accuracy and efficiency trade-offs compared to other lightweight network designs on broad visual tasks, including visual recognition and dense prediction tasks. + + + + Zolly: Zoom Focal Length Correctly for Perspective-Distorted Human Mesh Reconstruction + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Zolly_Zoom_Focal_Length_Correctly_for_Perspective-Distorted_Human_Mesh_Reconstruction_ICCV_2023_paper.pdf + As it is hard to calibrate single-view RGB images in the wild, existing 3D human mesh reconstruction (3DHMR) methods either use a constant large focal length or estimate one based on the background environment context, which can not tackle the problem of the torso, limb, hand or face distortion caused by perspective camera projection when the camera is close to the human body. The naive focal length assumptions can harm this task with the incorrectly formulated projection matrices. To solve this, we propose Zolly, the first 3DHMR method focusing on perspective-distorted images. Our approach begins with analysing the reason for perspective distortion, which we find is mainly caused by the relative location of the human body to the camera center. We propose a new camera model and a novel 2D representation, termed distortion image, which describes the 2D dense distortion scale of the human body. We then estimate the distance from distortion scale features rather than environment context features. Afterwards, We integrate the distortion feature with image features to reconstruct the body mesh. To formulate the correct projection matrix and locate the human body position, we simultaneously use perspective and weak-perspective projection loss. Since existing datasets could not handle this task, we propose the first synthetic dataset PDHuman and extend two real-world datasets tailored for this task, all containing perspective-distorted human images. Extensive experiments show that Zolly outperforms existing state-of-the-art methods on both perspective-distorted datasets and the standard benchmark (3DPW). Code and dataset will be released at https://wenjiawang0312.github.io/projects/zolly/. + + + + Beyond One-to-One: Rethinking the Referring Image Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Hu_Beyond_One-to-One_Rethinking_the_Referring_Image_Segmentation_ICCV_2023_paper.pdf + Referring image segmentation aims to segment the target object referred by a natural language expression. However, previous methods rely on the strong assumption that one sentence must describe one target in the image, which is often not the case in real-world applications. As a result, such methods fail when the expressions refer to either no objects or multiple objects. In this paper, we address this issue from two perspectives. First, we propose a Dual Multi-Modal Interaction (DMMI) Network, which contains two decoder branches and enables information flow in two directions. In the text-to-image decoder, text embedding is utilized to query the visual feature and localize the corresponding target. Meanwhile, the image-to-text decoder is implemented to reconstruct the erased entity-phrase conditioned on the visual feature. In this way, visual features are encouraged to contain the critical semantic information about target entity, which supports the accurate segmentation in the text-to-image decoder in turn. Secondly, we collect a new challenging but realistic dataset called Ref-ZOM, which includes image-text pairs under different settings. Extensive experiments demonstrate our method achieves state-of-the-art performance on different datasets, and the Ref-ZOM-trained model performs well on various types of text inputs. Codes and datasets are available at https://github.com/toggle1995/RIS-DMMI. + + + + MoreauGrad: Sparse and Robust Interpretation of Neural Networks via Moreau Envelope + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_MoreauGrad_Sparse_and_Robust_Interpretation_of_Neural_Networks_via_Moreau_ICCV_2023_paper.pdf + Explaining the predictions of deep neural nets has been a topic of great interest in the computer vision literature. While several gradient-based interpretation schemes have been proposed to reveal the influential variables in a neural net's prediction, standard gradient-based interpretation frameworks have been commonly observed to lack robustness to input perturbations and flexibility for incorporating prior knowledge of sparsity and group-sparsity structures. In this work, we propose MoreauGrad as an interpretation scheme based on the classifier neural net's Moreau envelope. We demonstrate that MoreauGrad results in a smooth and robust interpretation of a multi-layer neural network and can be efficiently computed through first-order optimization methods. Furthermore, we show that MoreauGrad can be naturally combined with L1-norm regularization techniques to output a sparse or group-sparse explanation which are prior conditions applicable to a wide range of deep learning applications. We empirically evaluate the proposed MoreauGrad scheme on standard computer vision datasets, showing the qualitative and quantitative success of the MoreauGrad approach in comparison to standard gradient-based interpretation methods. + + + + Class-Incremental Grouping Network for Continual Audio-Visual Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Mo_Class-Incremental_Grouping_Network_for_Continual_Audio-Visual_Learning_ICCV_2023_paper.pdf + Continual learning is a challenging problem in which models need to be trained on non-stationary data across sequential tasks for class-incremental learning. While previous methods have focused on using either regularization or rehearsal-based frameworks to alleviate catastrophic forgetting in image classification, they are limited to a single modality and cannot learn compact class-aware cross-modal representations for continual audio-visual learning. To address this gap, we propose a novel class-incremental grouping network (CIGN) that can learn category-wise semantic features to achieve continual audio-visual learning. Our CIGN leverages learnable audio-visual class tokens and audio-visual grouping to continually aggregate class-aware features. Additionally, it utilizes class tokens distillation and continual grouping to prevent forgetting parameters learned from previous tasks, thereby improving the model's ability to capture discriminative audio-visual categories. We conduct extensive experiments on VGGSound-Instruments, VGGSound-100, and VGG-Sound Sources benchmarks. Our experimental results demonstrate that the CIGN achieves state-of-the-art audio-visual class-incremental learning performance. + + + + Improving Sample Quality of Diffusion Models Using Self-Attention Guidance + http://openaccess.thecvf.com//content/ICCV2023/papers/Hong_Improving_Sample_Quality_of_Diffusion_Models_Using_Self-Attention_Guidance_ICCV_2023_paper.pdf + Denoising diffusion models (DDMs) have attracted attention for their exceptional generation quality and diversity. This success is largely attributed to the use of class- or text-conditional diffusion guidance methods, such as classifier and classifier-free guidance. In this paper, we present a more comprehensive perspective that goes beyond the traditional guidance methods. From this generalized perspective, we introduce novel condition- and training-free strategies to enhance the quality of generated images. As a simple solution, blur guidance improves the suitability of intermediate samples for their fine-scale information and structures, enabling diffusion models to generate higher quality samples with a moderate guidance scale. Improving upon this, Self-Attention Guidance (SAG) uses the intermediate self-attention maps of diffusion models to enhance their stability and efficacy. Specifically, SAG adversarially blurs only the regions that diffusion models attend to at each iteration and guides them accordingly. Our experimental results show that our SAG improves the performance of various diffusion models, including ADM, IDDPM, Stable Diffusion, and DiT. Moreover, combining SAG with conventional guidance methods leads to further improvement. + + + + Evaluating Data Attribution for Text-to-Image Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Evaluating_Data_Attribution_for_Text-to-Image_Models_ICCV_2023_paper.pdf + While large text-to-image models are able to synthesize "novel" images, these images are necessarily a reflection of the training data. The problem of data attribution in such models -- which of the images in the training set are most responsible for the appearance of a given generated image -- is a difficult yet important one. As an initial step toward this problem, we evaluate attribution through "customization" methods, which tune an existing large-scale model toward a given exemplar object or style. Our key insight is that this allows us to efficiently create synthetic images that are computationally influenced by the exemplar by construction. With our new dataset of such exemplar-influenced images, we are able to evaluate various data attribution algorithms and different possible feature spaces. Furthermore, by training on our dataset, we can tune standard models, such as DINO, CLIP, and ViT, toward the attribution problem. Even though the procedure is tuned towards small exemplar sets, we show generalization to larger sets. Finally, by taking into account the inherent uncertainty of the problem, we can assign soft attribution scores over a set of training images. + + + + Delta Denoising Score + http://openaccess.thecvf.com//content/ICCV2023/papers/Hertz_Delta_Denoising_Score_ICCV_2023_paper.pdf + This paper introduces Delta Denoising Score (DDS), a novel diffusion-based scoring technique that optimizes a parametric model for the task of image editing. Unlike the existing Score Distillation Sampling (SDS), which queries the generative model with a single image-text pair, DDS utilizes an additional fixed query of a reference image-text pair to generate delta scores that represent the difference between the outputs of the two queries. By estimating noisy gradient directions introduced by SDS using the source image and its text description, DDS provides cleaner gradient directions that modify the edited portions of the image while leaving others unchanged, yielding a distilled edit of the source image. The analysis presented in this paper supports the power of the new score for image-to-image translation. We further show that the new score can be used to train an effective zero-shot image translation model. The experimental results show that the proposed loss term outperforms existing methods in terms of stability and quality, highlighting its potential for real-world applications. + + + + Hierarchical Prior Mining for Non-local Multi-View Stereo + http://openaccess.thecvf.com//content/ICCV2023/papers/Ren_Hierarchical_Prior_Mining_for_Non-local_Multi-View_Stereo_ICCV_2023_paper.pdf + As a fundamental problem in computer vision, multi-view stereo (MVS) aims at recovering the 3D geometry of a target from a set of 2D images. Recent advances in MVS have shown that it is important to perceive non-local structured information for recovering geometry in low-textured areas. In this work, we propose a Hierarchical Prior Mining for Non-local Multi-View Stereo (HPM-MVS). The key characteristics are the following techniques that exploit non-local information to assist MVS: 1) A Non-local Extensible Sampling Pattern (NESP), which is able to adaptively change the size of sampled areas without becoming snared in locally optimal solutions. 2) A new approach to leverage non-local reliable points and construct a planar prior model based on K-Nearest Neighbor (KNN), to obtain potential hypotheses for the regions where prior construction is challenging. 3) A Hierarchical Prior Mining (HPM) framework, which is used to mine extensive non-local prior information at different scales to assist 3D model recovery, this strategy can achieve a considerable balance between the reconstruction of details and low-textured areas. Experimental results on the ETH3D and Tanks & Temples have verified the superior performance and strong generalization capability of our method. Our code will be available at https://github.com/CLinvx/HPM-MVS. + + + + Generative Multiplane Neural Radiance for 3D-Aware Image Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Kumar_Generative_Multiplane_Neural_Radiance_for_3D-Aware_Image_Generation_ICCV_2023_paper.pdf + We present a method to efficiently generate 3D-aware high-resolution images that are view-consistent across multiple target views. The proposed multiplane neural radiance model, named GMNR, consists of a novel a-guided view-dependent representation (a-VdR) module for learning view-dependent information. The a-VdR module, faciliated by an a-guided pixel sampling technique, computes the view-dependent representation efficiently by learning viewing direction and position coefficients. Moreover, we propose a view-consistency loss to enforce photometric similarity across multiple views. The GMNR model can generate 3D-aware high-resolution images that are view-consistent across multiple camera poses, while maintaining the computational efficiency in terms of both training and inference time. Experiments on three datasets demonstrate the effectiveness of the proposed modules, leading to favorable results in terms of both generation quality and inference time, compared to existing approaches. Our GMNR model generates 3D-aware images of 1024 x 1024 pixels with 17.6 FPS on a single V100. Code : https://github.com/VIROBO-15/GMNR + + + + Boosting Semantic Segmentation from the Perspective of Explicit Class Embeddings + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Boosting_Semantic_Segmentation_from_the_Perspective_of_Explicit_Class_Embeddings_ICCV_2023_paper.pdf + Semantic segmentation is a computer vision task that associates a label with each pixel in an image. Modern approaches tend to introduce class embeddings into semantic segmentation for deeply utilizing category semantics, and regard supervised class masks as final predictions. In this paper, we explore the mechanism of class embeddings and have an insight that more explicit and meaningful class embeddings can be generated based on class masks purposely. Following this observation, we propose ECENet, a new segmentation paradigm, in which class embeddings are obtained and enhanced explicitly during interacting with multi-stage image features. Based on this, we revisit the traditional decoding process and explore inverted information flow between segmentation masks and class embeddings. Furthermore, to ensure the discriminability and informativity of features from backbone, we propose a Feature Reconstruction module, which combines intrinsic and diverse branches together to ensure the concurrence of diversity and redundancy in features. Experiments show that our ECENet outperforms its counterparts on the ADE20K dataset with much less computational cost and achieves new state-of-the-art results on PASCAL-Context dataset. The code will be released at https://gitee.com/mindspore/models and https://github.com/Carol-lyh/ECENet. + + + + Learning to Identify Critical States for Reinforcement Learning from Videos + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Learning_to_Identify_Critical_States_for_Reinforcement_Learning_from_Videos_ICCV_2023_paper.pdf + Recent work on deep reinforcement learning (DRL) has pointed out that algorithmic information about good policies can be extracted from offline data which lack explicit information about executed actions. For example, videos of humans or robots may convey a lot of implicit information about rewarding action sequences, but a DRL machine that wants to profit from watching such videos must first learn by itself to identify and recognize relevant states/actions/rewards. Without relying on ground-truth annotations, our new method called Deep State Identifier learns to predict returns from episodes encoded as videos. Then it uses a kind of mask-based sensitivity analysis to extract/identify important critical states. Extensive experiments showcase our method's potential for understanding and improving agent behavior. The source code and the generated datasets are available at https://github.com/AI-Initiative-KAUST/VideoRLCS. + + + + Editing Implicit Assumptions in Text-to-Image Diffusion Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Orgad_Editing_Implicit_Assumptions_in_Text-to-Image_Diffusion_Models_ICCV_2023_paper.pdf + Text-to-image diffusion models often make implicit assumptions about the world when generating images. While some assumptions are useful (e.g., the sky is blue), they can also be outdated, incorrect, or reflective of social biases present in the training data. Thus, there is a need to control these assumptions without requiring explicit user input or costly re-training. In this work, we aim to edit a given implicit assumption in a pre-trained diffusion model. Our Text-to-Image Model Editing method, TIME for short, receives a pair of inputs: a "source" under-specified prompt for which the model makes an implicit assumption (e.g., "a pack of roses"), and a "destination" prompt that describes the same setting, but with a specified desired attribute (e.g., "a pack of blue roses"). TIME then updates the model's cross-attention layers, as these layers assign visual meaning to textual tokens. We edit the projection matrices in these layers such that the source prompt is projected close to the destination prompt. Our method is highly efficient, as it modifies a mere 2.2% of the model's parameters in under one second. To evaluate model editing approaches, we introduce TIMED (TIME Dataset), containing 147 source and destination prompt pairs from various domains. Our experiments (using Stable Diffusion) show that TIME is successful in model editing, generalizes well for related prompts unseen during editing, and imposes minimal effect on unrelated generations. + + + + Conceptual and Hierarchical Latent Space Decomposition for Face Editing + http://openaccess.thecvf.com//content/ICCV2023/papers/Ozkan_Conceptual_and_Hierarchical_Latent_Space_Decomposition_for_Face_Editing_ICCV_2023_paper.pdf + Generative Adversarial Networks (GANs) can produce photo-realistic results using an unconditional image-generation pipeline. However, the images generated by GANs (e.g., StyleGAN) are entangled in feature spaces, which makes it difficult to interpret and control the contents of images. In this paper, we present an encoder-decoder model that decomposes the entangled GAN space into a conceptual and hierarchical latent space in a self-supervised manner. The outputs of 3D morphable face models are leveraged to independently control image synthesis parameters like pose, expression, and illumination. For this purpose, a novel latent space decomposition pipeline is introduced using transformer networks and generative models. Later, this new space is used to optimize a transformer-based GAN space controller for face editing. In this work, a StyleGAN2 model for faces is utilized. Since our method manipulates only GAN features, the photo-realism of StyleGAN2 is fully preserved. The results demonstrate that our method qualitatively and quantitatively outperforms baselines in terms of identity preservation and editing precision. + + + + VL-Match: Enhancing Vision-Language Pretraining with Token-Level and Instance-Level Matching + http://openaccess.thecvf.com//content/ICCV2023/papers/Bi_VL-Match_Enhancing_Vision-Language_Pretraining_with_Token-Level_and_Instance-Level_Matching_ICCV_2023_paper.pdf + Vision-Language Pretraining (VLP) has significantly improved the performance of various vision-language tasks with the matching of images and texts. In this paper, we propose VL-Match, a Vision-Language framework with Enhanced Token-level and Instance-level Matching. At the token level, a Vision-Language Replaced Token Detection task is designed to boost the substantial interaction between text tokens and images, where the text encoder of VLP works as a generator to generate a corrupted text, and the multimodal encoder of VLP works as a discriminator to predict whether each text token in the corrupted text matches the image. At the instance level, in the Image-Text Matching task that judges whether an image-text pair is matched, we propose a novel bootstrapping method to generate hard negative text samples that are different from the positive ones only at the token level. In this way, we can force the network to detect fine-grained differences between images and texts. Notably, with a smaller amount of parameters, VL-Match significantly outperforms previous SOTA on all image-text retrieval tasks. + + + + Translating Images to Road Network: A Non-Autoregressive Sequence-to-Sequence Approach + http://openaccess.thecvf.com//content/ICCV2023/papers/Lu_Translating_Images_to_Road_Network_A_Non-Autoregressive_Sequence-to-Sequence_Approach_ICCV_2023_paper.pdf + The extraction of road network is essential for the generation of high-definition maps since it enables the precise localization of road landmarks and their interconnections. However, generating road network poses a significant challenge due to the conflicting underlying combination of Euclidean (e.g., road landmarks location) and non-Euclidean (e.g., road topological connectivity) structures. Existing methods struggle to merge the two types of data domains effectively, but few of them address it properly. Instead, our work establishes a unified representation of both types of data domain by projecting both Euclidean and non-Euclidean data into an integer series called RoadNet Sequence. Further than modeling an auto-regressive sequence-to-sequence Transformer model to understand RoadNet Sequence, we decouple the dependency of RoadNet Sequence into a mixture of auto-regressive and non-autoregressive dependency. Building on this, our proposed non-autoregressive sequence-to-sequence approach leverages non-autoregressive dependencies while fixing the gap towards auto-regressive dependencies, resulting in success on both efficiency and accuracy. Extensive experiments on nuScenes dataset demonstrate the superiority of RoadNet Sequence representation and the non-autoregressive approach compared to existing state-of-the-art alternatives. + + + + Generative Novel View Synthesis with 3D-Aware Diffusion Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Chan_Generative_Novel_View_Synthesis_with_3D-Aware_Diffusion_Models_ICCV_2023_paper.pdf + We present a diffusion-based model for 3D-aware generative novel view synthesis from as few as a single input image. Our model samples from the distribution of possible renderings consistent with the input and, even in the presence of ambiguity, is capable of rendering diverse and plausible novel views. To achieve this, our method makes use of existing 2D diffusion backbones but, crucially, incorporates geometry priors in the form of a 3D feature volume. This latent feature field captures the distribution over possible scene representations and improves our method's ability to generate view-consistent novel renderings. In addition to generating novel views, our method has the ability to autoregressively synthesize 3D-consistent sequences. We demonstrate state-of-the-art results on synthetic renderings and room-scale scenes; we also show compelling results for challenging, real-world objects. + + + + ALWOD: Active Learning for Weakly-Supervised Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_ALWOD_Active_Learning_for_Weakly-Supervised_Object_Detection_ICCV_2023_paper.pdf + Object detection (OD), a crucial vision task, remains challenged by the lack of large training datasets with precise object localization labels. In this work, we propose ALWOD, a new framework that addresses this problem by fusing active learning (AL) with weakly and semi-supervised object detection paradigms. Because the performance of AL critically depends on the model initialization, we propose a new auxiliary image generator strategy that utilizes an extremely small labeled set, coupled with a large weakly tagged set of images, as a warm-start for AL. We then propose a new AL acquisition function, another critical factor in AL success, that leverages the student-teacher OD pair disagreement and uncertainty to effectively propose the most informative images to annotate. Finally, to complete the AL loop, we introduce a new labeling task delegated to human annotators, based on selection and correction of model-proposed detections, which is both rapid and effective in labeling the informative images. We demonstrate, across several challenging benchmarks, that ALWOD significantly narrows the gap between the ODs trained on few partially labeled but strategically selected image instances and those that rely on the fully-labeled data. Our code is publicly available on https://github.com/seqam-lab/ALWOD. + + + + S-VolSDF: Sparse Multi-View Stereo Regularization of Neural Implicit Surfaces + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_S-VolSDF_Sparse_Multi-View_Stereo_Regularization_of_Neural_Implicit_Surfaces_ICCV_2023_paper.pdf + Neural rendering of implicit surfaces performs well in 3D vision applications. However, it requires dense input views as supervision. When only sparse input images are available, output quality drops significantly due to the shape-radiance ambiguity problem. We note that this ambiguity can be constrained when a 3D point is visible in multiple views, as is the case in multi-view stereo (MVS). We thus propose to regularize neural rendering optimization with an MVS solution. The use of an MVS probability volume and a generalized cross entropy loss leads to a noise-tolerant optimization process. In addition, neural rendering provides global consistency constraints that guide the MVS depth hypothesis sampling and thus improves MVS performance. Given only three sparse input views, experiments show that our method not only outperforms generic neural rendering models by a large margin but also significantly increases the reconstruction quality of MVS models. + + + + TextManiA: Enriching Visual Feature by Text-driven Manifold Augmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Ye-Bin_TextManiA_Enriching_Visual_Feature_by_Text-driven_Manifold_Augmentation_ICCV_2023_paper.pdf + We propose TextManiA, a text-driven manifold augmentation method that semantically enriches visual feature spaces, regardless of class distribution. TextManiA augments visual data with intra-class semantic perturbation by exploiting easy-to-understand visually mimetic words, i.e., attributes. This work is built on an interesting hypothesis that general language models, e.g., BERT and GPT, encompass visual information to some extent, even without training on visual training data. Given the hypothesis, TextManiA transfers pre-trained text representation obtained from a well-established large language encoder to a target visual feature space being learned. Our extensive analysis hints that the language encoder indeed encompasses visual information at least useful to augment visual representation. Our experiments demonstrate that TextManiA is particularly powerful in scarce samples with class imbalance as well as even distribution. We also show compatibility with the label mix-based approaches in evenly distributed scarce data. + + + + Breaking Common Sense: WHOOPS! A Vision-and-Language Benchmark of Synthetic and Compositional Images + http://openaccess.thecvf.com//content/ICCV2023/papers/Bitton-Guetta_Breaking_Common_Sense_WHOOPS_A_Vision-and-Language_Benchmark_of_Synthetic_and_ICCV_2023_paper.pdf + Weird, unusual, and uncanny images pique the curiosity of observers because they challenge commonsense. For example, an image released during the 2022 world cup depicts the famous soccer stars Lionel Messi and Cristiano Ronaldo playing chess, which playfully violates our expectation that their competition should occur on the football field. Humans can easily recognize and interpret these unconventional images, but can AI models do the same? We introduce WHOOPS!, a new dataset and benchmark for visual commonsense. The dataset is comprised of purposefully commonsense-defying images created by designers using publicly-available image generation tools like Midjourney. We consider several tasks posed over the dataset. In addition to image captioning, cross-modal matching, and visual question answering, we introduce a difficult explanation generation task, where models must identify and explain why a given image is unusual. Our results show that state-of-the-art models such as GPT3 and BLIP2 still lag behind human performance on WHOOPS!. We hope our dataset will inspire the development of AI models with stronger visual commonsense reasoning abilities. + + + + Consistent Depth Prediction for Transparent Object Reconstruction from RGB-D Camera + http://openaccess.thecvf.com//content/ICCV2023/papers/Cai_Consistent_Depth_Prediction_for_Transparent_Object_Reconstruction_from_RGB-D_Camera_ICCV_2023_paper.pdf + Transparent objects are commonly seen in indoor scenes but are hard to estimate. Currently, commercial depth cameras face difficulties in estimating the depth of transparent objects due to the light reflection and refraction on their surface. As a result, they tend to make a noisy and incorrect depth value for transparent objects. These incorrect depth data make the traditional RGB-D SLAM method fails in reconstructing the scenes that contain transparent objects. An exact depth value of the transparent object is required to restore in advance and it is essential that the depth value of the transparent object must keep consistent in different views, or the reconstruction result will be distorted. Previous depth prediction methods of transparent objects can restore these missing depth values but none of them can provide a good result in reconstruction due to the inconsistency prediction. In this work, we propose a real-time reconstruction method using a novel stereo-based depth prediction network to keep the consistency of depth prediction in a sequence of images. Because there is no video dataset about transparent objects currently to train our model, we construct a synthetic RGB-D video dataset with different transparent objects. Moreover, to test generalization capability, we capture video from real scenes using the RealSense D435i RGB-D camera. We compare the metrics on our dataset and SLAM reconstruction results in both synthetic scenes and real scenes with the previous methods. Experiments show our significant improvement in accuracy on depth prediction and scene reconstruction. + + + + DETR Does Not Need Multi-Scale or Locality Design + http://openaccess.thecvf.com//content/ICCV2023/papers/Lin_DETR_Does_Not_Need_Multi-Scale_or_Locality_Design_ICCV_2023_paper.pdf + This paper presents an improved DETR detector that maintains a "plain" nature: using a single-scale feature map and global cross-attention calculations without specific locality constraints, in contrast to previous leading DETR-based detectors that reintroduce architectural inductive biases of multi-scale and locality into the decoder. We show that two simple technologies are surprisingly effective within a plain design to compensate for the lack of multi-scale feature maps and locality constraints. The first is a box-to-pixel relative position bias (BoxRPB) term added to the cross-attention formulation, which well guides each query to attend to the corresponding object region while also providing encoding flexibility. The second is masked image modeling (MIM)-based backbone pre-training which helps learn representation with fine-grained localization ability and proves crucial for remedying dependencies on the multi-scale feature maps. By incorporating these technologies and recent advancements in training and problem formation, the improved "plain" DETR showed exceptional improvements over the original DETR detector. By leveraging the Object365 dataset for pre-training, it achieved 63.9 mAP accuracy using a Swin-L backbone, which is highly competitive with state-of-the-art detectors which all heavily rely on multi-scale feature maps and region-based feature extraction. Code will be available at https://github.com/impiga/Plain-DETR. + + + + ClusT3: Information Invariant Test-Time Training + http://openaccess.thecvf.com//content/ICCV2023/papers/Hakim_ClusT3_Information_Invariant_Test-Time_Training_ICCV_2023_paper.pdf + Deep Learning models have shown remarkable performance in a broad range of vision tasks. However, they are often vulnerable against domain shifts at test-time. Test-time training (TTT) methods have been developed in an attempt to mitigate these vulnerabilities, where a secondary task is solved at training time simultaneously with the main task, to be later used as an self-supervised proxy task at test-time. In this work, we propose a novel unsupervised TTT technique based on the maximization of Mutual Information between multi-scale feature maps and a discrete latent representation, which can be integrated to the standard training as an auxiliary clustering task. Experimental results demonstrate competitive classification performance on different popular test-time adaptation benchmarks. The code can be found at: https://github.com/dosowiechi/ClusT3.git + + + + AssetField: Assets Mining and Reconfiguration in Ground Feature Plane Representation + http://openaccess.thecvf.com//content/ICCV2023/papers/Xiangli_AssetField_Assets_Mining_and_Reconfiguration_in_Ground_Feature_Plane_Representation_ICCV_2023_paper.pdf + Both indoor and outdoor environments are inherently structured and repetitive. Traditional modeling pipelines keep an asset library storing unique object templates, which is both versatile and memory efficient in practice. Inspired by this observation, we propose AssetField, a novel neural scene representation that learns a set of object-aware ground feature planes to represent the scene, where an asset library storing template feature patches can be constructed in an unsupervised manner. Unlike existing methods which require object masks to query spatial points for object editing, our ground feature plane representation offers a natural visualization of the scene in the bird-eye view, allowing a variety of operations (e.g. translation, duplication, deformation) on objects to configure a new scene. With the template feature patches, group editing is enabled for scenes with many recurring items to avoid repetitive work on object individuals. We show that AssetField not only achieves competitive performance for novel-view synthesis but also generates realistic renderings for new scene configurations. + + + + SAGA: Spectral Adversarial Geometric Attack on 3D Meshes + http://openaccess.thecvf.com//content/ICCV2023/papers/Stolik_SAGA_Spectral_Adversarial_Geometric_Attack_on_3D_Meshes_ICCV_2023_paper.pdf + A triangular mesh is one of the most popular 3D data representations. As such, the deployment of deep neural networks for mesh processing is widely spread and is increasingly attracting more attention. However, neural networks are prone to adversarial attacks, where carefully crafted inputs impair the model's functionality. The need to explore these vulnerabilities is a fundamental factor in the future development of 3D-based applications. Recently, mesh attacks were studied on the semantic level, where classifiers are misled to produce wrong predictions. Nevertheless, mesh surfaces possess complex geometric attributes beyond their semantic meaning, and their analysis often includes the need to encode and reconstruct the geometry of the shape. We propose a novel framework for a geometric adversarial attack on a 3D mesh autoencoder. In this setting, an adversarial input mesh deceives the autoencoder by forcing it to reconstruct a different geometric shape at its output. The malicious input is produced by perturbing a clean shape in the spectral domain. Our method leverages the spectral decomposition of the mesh along with additional mesh-related properties to obtain visually credible results that consider the delicacy of surface distortions. + + + + Learning Navigational Visual Representations with Semantic Map Supervision + http://openaccess.thecvf.com//content/ICCV2023/papers/Hong_Learning_Navigational_Visual_Representations_with_Semantic_Map_Supervision_ICCV_2023_paper.pdf + Being able to perceive the semantics and the spatial structure of the environment is essential for visual navigation of a household robot. However, most existing works only employ visual backbones pre-trained either with independent images for classification or with self-supervised learning methods to adapt to the indoor navigation domain, both neglecting the spatial relationships that are essential to the learning of navigation. Inspired by the behavior that human naturally build semantically and spatially meaningful cognitive maps in their brain during navigation, in this paper, we propose a novel navigational-specific visual representation learning method by contrasting the agent's egocentric views and semantic maps (Ego^2-Map). We apply the visual transformer as the backbone encoder and train the model with data collected from the large-scale Habitat-Matterport3D environments. Ego^2-Map learning transfers the compact and rich information from a map, such as objects, structure and transition, to the agent's egocentric representations for navigation. Experiments show that agents using our learned representations on object-goal navigation outperforms recent visual pre-training methods. Moreover, our representations lead to a significant improvement in vision-and-language navigation in continuous environments for both high-level and low-level action spaces, achieving new state-of-the-art results of 47% SR and 41% SPL on the test server. + + + + Shortcut-V2V: Compression Framework for Video-to-Video Translation Based on Temporal Redundancy Reduction + http://openaccess.thecvf.com//content/ICCV2023/papers/Chung_Shortcut-V2V_Compression_Framework_for_Video-to-Video_Translation_Based_on_Temporal_Redundancy_ICCV_2023_paper.pdf + Video-to-video translation aims to generate video frames of a target domain from an input video. Despite its usefulness, the existing networks require enormous computations, necessitating their model compression for wide use. While there exist compression methods that improve computational efficiency in various image/video tasks, a generally applicable compression method for video-to-video translation has not been studied much. In response, we present Shortcut-V2V, a general-purpose compression framework for video-to-video translation. Shortcut-V2V avoids full inference for every neighboring video frame by approximating the intermediate features of a current frame from those of the previous frame. Moreover, in our framework, a newly-proposed block called AdaBD adaptively blends and deforms features of neighboring frames, which makes more accurate predictions of the intermediate features possible. We conduct quantitative and qualitative evaluations using well-known video-to-video translation models on various tasks to demonstrate the general applicability of our framework. The results show that Shortcut-V2V achieves comparable performance compared to the original video-to-video translation model while saving 3.2-5.7x computational cost and 7.8-44x memory at test time. Our code and videos are available at https://shortcut-v2v.github.io/. + + + + SG-Former: Self-guided Transformer with Evolving Token Reallocation + http://openaccess.thecvf.com//content/ICCV2023/papers/Ren_SG-Former_Self-guided_Transformer_with_Evolving_Token_Reallocation_ICCV_2023_paper.pdf + Vision Transformer has demonstrated impressive success across various vision tasks. However, its heavy computation cost, which grows quadratically with respect to the token sequence length, largely limits its power in handling large feature maps. To alleviate the computation cost, previous works rely on either fine-grained self-attentions restricted to local small regions, or global self-attentions but to shorten the sequence length resulting in coarse granularity. In this paper, we propose a novel model, termed as Self-guided Transformer (SG-Former), towards effective global self-attention with adaptive fine granularity. At the heart of our approach is to utilize a significance map, which is estimated through hybrid-scale self-attention and evolves itself during training, to reallocate tokens based on the significance of each region. Intuitively, we assign more tokens to the salient regions for achieving fine-grained attention, while allocating fewer tokens to the minor regions in exchange for efficiency and global receptive fields. The proposed SG-Former achieves performance superior to state of the art: our base size model achieves 84.7% Top-1 accuracy on ImageNet-1K, 51.2mAP bbAP on CoCo, 52.7mIoU on ADE20K surpassing the Swin Transformer by +1.3% / +2.7 mAP/ +3 mIoU, with lower computation costs and fewer parameters. + + + + ProtoTransfer: Cross-Modal Prototype Transfer for Point Cloud Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Tang_ProtoTransfer_Cross-Modal_Prototype_Transfer_for_Point_Cloud_Segmentation_ICCV_2023_paper.pdf + Knowledge transfer from multi-modal, i.e., LiDAR points and images, to a single LiDAR modal can take advantage of complimentary information from modal-fusion but keep a single modal inference speed, showing a promising direction for point cloud semantic segmentation in autonomous driving. Recent advances in point cloud segmentation distill knowledge from strictly aligned point-pixel fusion features while leaving a large number of unmatched image pixels unexplored and unmatched LiDAR points under-benefited. In this paper, we propose a novel approach, named ProtoTransfer, which not only fully exploits image representations but also transfers the learned multi-modal knowledge to all point cloud features. Specifically, based on the basic multi-modal learning framework, we build up a class-wise prototype bank from the strictly-aligned fusion features and encourage all the point cloud features to learn from the prototypes during model training. Moreover, to exploit the massive unmatched point and pixel features, we use a pseudo-labeling scheme and further accumulate these features into the class-wise prototype bank with a carefully designed fusion strategy. Without bells and whistles, our approach demonstrates superior performance over the published state-of-the-arts on two large-scale benchmarks, i.e., nuScenes and SemanticKITTI, and ranks 2nd on the competitive nuScenes Lidarseg challenge leaderboard. + + + + Deep Image Harmonization with Globally Guided Feature Transformation and Relation Distillation + http://openaccess.thecvf.com//content/ICCV2023/papers/Niu_Deep_Image_Harmonization_with_Globally_Guided_Feature_Transformation_and_Relation_ICCV_2023_paper.pdf + Given a composite image, image harmonization aims to adjust the foreground illumination to be consistent with background. Previous methods have explored transforming foreground features to achieve competitive performance. In this work, we show that using global information to guide foreground feature transformation could achieve significant improvement. Besides, we propose to transfer the foreground-background relation from real images to composite images, which can provide intermediate supervision for the transformed encoder features. Additionally, considering the drawbacks of existing harmonization datasets, we also contribute a ccHarmony dataset which simulates the natural illumination variation. Extensive experiments on iHarmony4 and our contributed dataset demonstrate the superiority of our method. Our ccHarmony dataset is released at https://github.com/bcmi/Image-Harmonization-Dataset-ccHarmony. + + + + VQ3D: Learning a 3D-Aware Generative Model on ImageNet + http://openaccess.thecvf.com//content/ICCV2023/papers/Sargent_VQ3D_Learning_a_3D-Aware_Generative_Model_on_ImageNet_ICCV_2023_paper.pdf + Recent work has shown the possibility of training generative models of 3D content from 2D image collections on small datasets corresponding to a single object class, such as human faces, animal faces, or cars. However, these models struggle on larger, more complex datasets. To model diverse and unconstrained image collections such as ImageNet, we present VQ3D, which introduces a NeRF-based decoder into a two-stage vector-quantized autoencoder. Our Stage 1 allows for the reconstruction of an input image and the ability to change the camera position around the image, and our Stage 2 allows for the generation of new 3D scenes. VQ3D is capable of generating and reconstructing 3D-aware images from the 1000-class ImageNet dataset of 1.2 million training images, and achieves a competitive ImageNet generation FID score of 16.8. + + + + 2D-3D Interlaced Transformer for Point Cloud Segmentation with Scene-Level Supervision + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_2D-3D_Interlaced_Transformer_for_Point_Cloud_Segmentation_with_Scene-Level_Supervision_ICCV_2023_paper.pdf + We present a Multimodal Interlaced Transformer (MIT) that jointly considers 2D and 3D data for weakly supervised point cloud segmentation. Research studies have shown that 2D and 3D features are complementary for point cloud segmentation. However, existing methods require extra 2D annotations to achieve 2D-3D information fusion. Considering the high annotation cost of point clouds, effective 2D and 3D feature fusion based on weakly supervised learning is in great demand. To this end, we propose a transformer model with two encoders and one decoder for weakly supervised point cloud segmentation using only scene-level class tags. Specifically, the two encoders compute the self-attended features for 3D point clouds and 2D multi-view images, respectively. The decoder implements interlaced 2D-3D cross-attention and carries out implicit 2D and 3D feature fusion. We alternately switch the roles of queries and key-value pairs in the decoder layers. It turns out that the 2D and 3D features are iteratively enriched by each other. Experiments show that it performs favorably against existing weakly supervised point cloud segmentation methods by a large margin on the S3DIS and ScanNet benchmarks. + + + + Collecting The Puzzle Pieces: Disentangled Self-Driven Human Pose Transfer by Permuting Textures + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Collecting_The_Puzzle_Pieces_Disentangled_Self-Driven_Human_Pose_Transfer_by_ICCV_2023_paper.pdf + Human pose transfer synthesizes new view(s) of a person for a given pose. Recent work achieves this via self-reconstruction, which disentangles a person's pose and texture information by breaking down the person into several parts, then recombines them to reconstruct the person. However, this part-level disentanglement preserves some pose information that can create unwanted artifacts. In this paper, we propose Pose Transfer by Permuting Textures, a self-driven human pose transfer approach that disentangles pose from texture at the patch-level. Specifically, we remove pose from an input image by permuting image patches so only texture information remains. Then we reconstruct the input image by sampling from the permuted textures to achieve patch-level disentanglement. To reduce the noise and recover clothing shape information from the permuted patches, we employ encoders with multiple kernel sizes in a triple branch network. Extensive experiments on DeepFashion and Market-1501 show that our model improves the quality of generated images in terms of FID, LPIPS and SSIM over other self-driven methods, and even outperforming some fully-supervised methods. A user study also shows that among self-driven approaches, images generated by our method are preferred in 68% of cases over prior work. Code is available at https://github.com/NannanLi999/pt_square. + + + + Sound Localization from Motion: Jointly Learning Sound Direction and Camera Rotation + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Sound_Localization_from_Motion_Jointly_Learning_Sound_Direction_and_Camera_ICCV_2023_paper.pdf + The images and sounds that we perceive undergo subtle but geometrically consistent changes as we rotate our heads. In this paper, we use these cues to solve a problem we call Sound Localization from Motion (SLfM): jointly estimating camera rotation and localizing sound sources. We learn to solve these tasks solely through self-supervision. A visual model predicts camera rotation from a pair of images, while an audio model predicts the direction of sound sources from binaural sounds. We train these models to generate predictions that agree with one another. At test time, the models can be deployed independently. To obtain a feature representation that is well-suited to solving this challenging problem, we also propose a method for learning an audio-visual representation through cross-view binauralization: estimating binaural sound from one view, given images and sound from another. Our model can successfully estimate accurate rotations on both real and synthetic scenes, and localize sound sources with accuracy competitive with state-of-the-art self-supervised approaches. + + + + Prompt Tuning Inversion for Text-driven Image Editing Using Diffusion Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Dong_Prompt_Tuning_Inversion_for_Text-driven_Image_Editing_Using_Diffusion_Models_ICCV_2023_paper.pdf + Recently large-scale language-image models (e.g., text-guided diffusion models) have considerably improved the image generation capabilities to generate photorealistic images in various domains. Based on this success, current image editing methods use texts to achieve intuitive and versatile modification of images. To edit a real image using diffusion models, one must first invert the image to a noisy latent from which an edited image is sampled with a target text prompt. However, most methods lack one of the following: user-friendliness (e.g., additional masks or precise descriptions of the input image are required), generalization to larger domains, or high fidelity to the input image. In this paper, we design an accurate and quick inversion technique, Prompt Tuning Inversion, for text-driven image editing. Specifically, our proposed editing method consists of a reconstruction stage and an editing stage. In the first stage, we encode the information of the input image into a learnable conditional embedding via Prompt Tuning Inversion. In the second stage, we apply classifier-free guidance to sample the edited image, where the conditional embedding is calculated by linearly interpolating between the target embedding and the optimized one obtained in the first stage. This technique ensures a superior trade-off between editability and high fidelity to the input image of our method. For example, we can change the color of a specific object while preserving its original shape and background under the guidance of only a target text prompt. Extensive experiments on ImageNet demonstrate the superior editing performance of our method compared to the state-of-the-art baselines. + + + + UnitedHuman: Harnessing Multi-Source Data for High-Resolution Human Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Fu_UnitedHuman_Harnessing_Multi-Source_Data_for_High-Resolution_Human_Generation_ICCV_2023_paper.pdf + Human generation has achieved significant progress. Nonetheless, existing methods still struggle to synthesize specific regions such as faces and hands. We argue that the main reason is rooted in the training data. A holistic human dataset inevitably has insufficient and low-resolution information on local parts. Therefore, we propose to use multi-source datasets with various resolution images to jointly learn a high-resolution human generative model. However, multi-source data inherently a) contains different parts that do not spatially align into a coherent human, and b) comes with different scales. To tackle these challenges, we propose an end-to-end framework, UnitedHuman, that empowers continuous GAN with the ability to effectively utilize multi-source data for high-resolution human generation. Specifically, 1) we design a Multi-Source Spatial Transformer that spatially aligns multi-source images to full-body space with a human parametric model. 2) Next, a continuous GAN is proposed with global-structural guidance and CutMix consistency. Patches from different datasets are then sampled and transformed to supervise the training of this scale-invariant generative model. Extensive experiments demonstrate that our model jointly learned from multi-source data achieves superior quality than those learned from a holistic dataset. + + + + Neural Microfacet Fields for Inverse Rendering + http://openaccess.thecvf.com//content/ICCV2023/papers/Mai_Neural_Microfacet_Fields_for_Inverse_Rendering_ICCV_2023_paper.pdf + We present Neural Microfacet Fields, a method for recovering materials, geometry (volumetric density), and environmental illumination from a collection of images of a scene. Our method applies a microfacet reflectance model within a volumetric setting by treating each sample along the ray as a surface, rather than an emitter. Using surface-based Monte Carlo rendering in a volumetric setting enables our method to perform inverse rendering efficiently and enjoy recent advances in volume rendering. Our approach obtains similar performance as state-of-the-art methods for novel view synthesis and outperforms prior work in inverse rendering, capturing high fidelity geometry and high frequency illumination details. + + + + Understanding Self-attention Mechanism via Dynamical System Perspective + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_Understanding_Self-attention_Mechanism_via_Dynamical_System_Perspective_ICCV_2023_paper.pdf + The self-attention mechanism (SAM) is widely used in various fields of artificial intelligence and has successfully boosted the performance of different models. However, current explanations of this mechanism are mainly based on intuitions and experiences, while there still lacks direct modeling for how the SAM helps performance. To mitigate this issue, in this paper, based on the dynamical system perspective of the residual neural network, we first show that the intrinsic stiffness phenomenon (SP) in the high-precision solution of ordinary differential equations (ODEs) also widely exists in high-performance neural networks (NN). Thus the ability of NN to measure SP at the feature level is necessary to obtain high performance and is an important factor in the difficulty of training NN. Similar to the adaptive step-size method which is effective in solving stiff ODEs, we show that the SAM is also a stiffness-aware step size adaptor that can enhance the model's representational ability to measure intrinsic SP by refining the estimation of stiffness information and generating adaptive attention values, which provides a new understanding about why and how the SAM can benefit the model performance. This novel perspective can also explain the lottery ticket hypothesis in SAM, design new quantitative metrics of representational ability, and inspire a new theoretic-inspired approach, StepNet. Extensive experiments on several popular benchmarks demonstrate that StepNet can extract fine-grained stiffness information and measure SP accurately, leading to significant improvements in various visual tasks. + + + + DDG-Net: Discriminability-Driven Graph Network for Weakly-supervised Temporal Action Localization + http://openaccess.thecvf.com//content/ICCV2023/papers/Tang_DDG-Net_Discriminability-Driven_Graph_Network_for_Weakly-supervised_Temporal_Action_Localization_ICCV_2023_paper.pdf + Weakly-supervised temporal action localization (WTAL) is a practical yet challenging task. Due to large-scale datasets, most existing methods use a network pretrained in other datasets to extract features, which are not suitable enough for WTAL. To address this problem, researchers design several modules for feature enhancement, which improve the performance of the localization module, especially modeling the temporal relationship between snippets. However, all of them omit that ambiguous snippets deliver contradictory information, which would reduce the discriminability of linked snippets. Considering this phenomenon, we propose Discriminability-Driven Graph Network (DDG-Net), which explicitly models ambiguous snippets and discriminative snippets with well-designed connections, preventing the transmission of ambiguous information and enhancing the discriminability of snippet-level representations. Additionally, we propose feature consistency loss to prevent the assimilation of features and drive the graph convolution network to generate more discriminative representations. Extensive experiments on THUMOS14 and ActivityNet1.2 benchmarks demonstrate the effectiveness of DDG-Net, establishing new state-of-the-art results on both datasets. Source code is available at https://github.com/XiaojunTang22/ICCV2023-DDGNet. + + + + Rethinking Data Distillation: Do Not Overlook Calibration + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhu_Rethinking_Data_Distillation_Do_Not_Overlook_Calibration_ICCV_2023_paper.pdf + Neural networks trained on distilled data often produce over-confident output and require correction by calibration methods. Existing calibration methods such as temperature scaling and mixup work well for networks trained on original large-scale data. However, we find that these methods fail to calibrate networks trained on data distilled from large source datasets. In this paper, we show that distilled data lead to networks that are not calibratable due to (i) a more concentrated distribution of the maximum logits and (ii) the loss of information that is semantically meaningful but unrelated to classification tasks. To address this problem, we propose Masked Temperature Scaling (MTS) and Masked Distillation Training (MDT) which mitigate the limitations of distilled data and achieve better calibration results while maintaining the efficiency of dataset distillation. + + + + Building Vision Transformers with Hierarchy Aware Feature Aggregation + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Building_Vision_Transformers_with_Hierarchy_Aware_Feature_Aggregation_ICCV_2023_paper.pdf + Thanks to the excellent global modeling capability of attention mechanisms, the Vision Transformer has achieved better results than ConvNet in many computer tasks. However, in generating hierarchical feature maps, the Transformer still adopts the ConvNet feature aggregation scheme. This leads to the problem that the semantic information of the grid area of image becomes confused after feature aggregation, making it difficult for attention to accurately model global relationships. To address this, we propose the Hierarchy Aware Feature Aggregation framework (HAFA). HAFA enhances the extraction of local features adaptively in shallow layers where semantic information is weak, while is able to aggregate patches with similar semantics in deep layers. The clear semantic information of the aggregated patches, enables the attention mechanism to more accurately model global information at the semantic level. Extensive experiments show that after using the HAFA framework, significant improvements have been achieved relative to the baseline models in image classification, object detection, and semantic segmentation tasks. + + + + SAL-ViT: Towards Latency Efficient Private Inference on ViT using Selective Attention Search with a Learnable Softmax Approximation + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_SAL-ViT_Towards_Latency_Efficient_Private_Inference_on_ViT_using_Selective_ICCV_2023_paper.pdf + Recently, private inference (PI) has addressed the rising concern over data and model privacy in machine learning inference as a service. However, existing PI frameworks suffer from high computational and communication overheads due to the expensive multi-party computation (MPC) protocols, particularly for large models such as vision transformers (ViT). The majority of this overhead is due to the encrypted softmax operation in each self-attention layer. In this work, we present SAL-ViT with two novel techniques to boost PI efficiency on ViTs. Our first technique is a learnable PI-efficient approximation to softmax, namely, learnable 2Quad (L2Q), that introduces learnable scaling and shifting parameters to the prior 2Quad softmax approximation, enabling improvement in accuracy. Then, given our observation that external attention (EA) presents lower PI latency than widely-adopted self-attention (SA) at the cost of accuracy, we present a selective attention search (SAS) method to integrate the strength of EA and SA. Specifically, for a given lightweight EA ViT, we leverage a constrained optimization procedure to selectively search and replace EA modules with SA alternatives to maximize the accuracy. Our extensive experiments show that our SAL-ViT can averagely achieve 1.28x, 1.28x, 1.14x lower PI latency with 1.79%, 1.41%, and 2.08% higher accuracy compared to the existing alternatives, on CIFAR-10, CIFAR-100, and Tiny-ImageNet, respectively. + + + + TIJO: Trigger Inversion with Joint Optimization for Defending Multimodal Backdoored Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Sur_TIJO_Trigger_Inversion_with_Joint_Optimization_for_Defending_Multimodal_Backdoored_ICCV_2023_paper.pdf + We present a Multimodal Backdoor defense technique TIJO (Trigger Inversion using Joint Optimization). Recently Walmer et al. demonstrated successful backdoor attacks on multimodal models for the Visual Question Answering task. Their dual-key backdoor trigger is split across two modalities (image and text), such that the backdoor is activated if and only if the trigger is present in both modalities. We propose TIJO that defends against dual-key attacks through a joint optimization that reverse-engineers the trigger in both the image and text modalities. This joint optimization is challenging in multimodal models due to the disconnected nature of the visual pipeline which consists of an offline feature extractor, whose output is then fused with the text using a fusion module. The key insight enabling the joint optimization in TIJO is that the trigger inversion needs to be carried out in the object detection box feature space as opposed to the pixel space. We demonstrate the effectiveness of our method on the TrojVQA benchmark, where TIJO improves upon the state-of-the-art unimodal methods from an AUC of 0.6 to 0.92 on multimodal dual-key backdoors. Furthermore, our method also improves upon the unimodal baselines on unimodal backdoors. We also present detailed ablation studies as well as qualitative results to provide insights into our algorithm such as the critical importance of overlaying the inverted feature triggers on all visual features during trigger inversion. + + + + + + The Making and Breaking of Camouflage + http://openaccess.thecvf.com//content/ICCV2023/papers/Lamdouar_The_Making_and_Breaking_of_Camouflage_ICCV_2023_paper.pdf + Not all camouflages are equally effective, as even a partially visible contour or a slight color difference can make the animal stand out and break its camouflage. In this paper, we address the question of what makes a camouflage successful, by proposing three scores for automatically assessing its effectiveness. In particular, we show that camouflage can be measured by the similarity between background and foreground features and boundary visibility. We use these camouflage scores to assess and compare all available camouflage datasets. We also incorporate the proposed camouflage score into a generative model as an auxiliary loss and show that effective camouflage images or videos can be synthesised in a scalable manner. The generated synthetic dataset is used to train a transformer-based model for segmenting camouflaged animals in videos. Experimentally, we demonstrate state-of-the-art camouflage breaking performance on the public MoCA-Mask benchmark. + + + + Object as Query: Lifting Any 2D Object Detector to 3D Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Object_as_Query_Lifting_Any_2D_Object_Detector_to_3D_ICCV_2023_paper.pdf + 3D object detection from multi-view images has drawn much attention over the past few years. Existing methods mainly establish 3D representations from multi-view images and adopt a dense detection head for object detection, or employ object queries distributed in 3D space to localize objects. In this paper, we design Multi-View 2D Objects guided 3D Object Detector (MV2D), which can lift any 2D object detector to multi-view 3D object detection. Since 2D detections can provide valuable priors for object existence, MV2D exploits 2D detectors to generate object queries conditioned on the rich image semantics. These dynamically generated queries help MV2D to recall objects in the field of view and show a strong capability of localizing 3D objects. For the generated queries, we design a sparse cross attention module to force them to focus on the features of specific objects, which suppresses interference from noises. The evaluation results on the nuScenes dataset demonstrate the dynamic object queries and sparse feature aggregation can promote 3D detection capability. MV2D also exhibits a state-of-the-art performance among existing methods. We hope MV2D can serve as a new baseline for future research. + + + + Versatile Diffusion: Text, Images and Variations All in One Diffusion Model + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_Versatile_Diffusion_Text_Images_and_Variations_All_in_One_Diffusion_ICCV_2023_paper.pdf + Recent advances in diffusion models have set an impressive milestone in many generation tasks, and trending works such as DALL-E2, Imagen, and Stable Diffusion have attracted great interest. Despite the rapid landscape changes, recent new approaches focus on extensions and performance rather than capacity, thus requiring separate models for separate tasks. In this work, we expand the existing single-flow diffusion pipeline into a multi-task multimodal network, dubbed Versatile Diffusion (VD), that handles multiple flows of text-to-image, image-to-text, and variations in one unified model. The pipeline design of VD instantiates a unified multi-flow diffusion framework, consisting of sharable and swappable layer modules that enable the crossmodal generality beyond images and text. Through extensive experiments, we demonstrate that VD successfully achieves the following: a) VD outperforms the baseline approaches and handles all its base tasks with competitive quality; b) VD enables novel extensions such as disentanglement of style and semantics, dual- and multi-context blending, etc.; c) The success of our multi-flow multimodal framework over images and text may inspire further diffusion-based universal AI research. Our code and models are open-sourced at https://github.com/SHI-Labs/Versatile-Diffusion. + + + + Sat2Density: Faithful Density Learning from Satellite-Ground Image Pairs + http://openaccess.thecvf.com//content/ICCV2023/papers/Qian_Sat2Density_Faithful_Density_Learning_from_Satellite-Ground_Image_Pairs_ICCV_2023_paper.pdf + This paper aims to develop an accurate 3D geometry representation of satellite images using satellite-ground image pairs. Our focus is on the challenging problem of 3D-aware ground-views synthesis from a satellite image. We draw inspiration from the density field representation used in volumetric neural rendering and propose a new approach, called Sat2Density. Our method utilizes the properties of ground-view panoramas for the sky and non-sky regions to learn faithful density fields of 3D scenes in a geometric perspective. Unlike other methods that require extra depth information during training, our Sat2Density can automatically learn accurate and faithful 3D geometry via density representation without depth supervision. This advancement significantly improves the ground-view panorama synthesis task. Additionally, our study provides a new geometric perspective to understand the relationship between satellite and ground-view images in 3D space. + + + + Expressive Text-to-Image Generation with Rich Text + http://openaccess.thecvf.com//content/ICCV2023/papers/Ge_Expressive_Text-to-Image_Generation_with_Rich_Text_ICCV_2023_paper.pdf + Plain text has become a prevalent interface for text-to-image synthesis. However, its limited customization options hinder users from accurately describing desired outputs. For example, plain text makes it hard to specify continuous quantities, such as the precise RGB color value or importance of each word. Furthermore, creating detailed text prompts for complex scenes is tedious for humans to write and challenging for text encoders to interpret. To address these challenges, we propose using a rich-text editor supporting formats such as font style, size, color, and footnote. We extract each word's attributes from rich text to enable local style control, explicit token reweighting, precise color rendering, and detailed region synthesis. We achieve these capabilities through a region-based diffusion process. We first obtain each word's region based on attention maps of a diffusion process using plain text. For each region, we enforce its text attributes by creating region-specific detailed prompts and applying region-specific guidance, and maintain its fidelity against plain-text generation through region-based injections. We present various examples of image generation from rich text and demonstrate that our method outperforms strong baselines with quantitative evaluations. + + + + Text-Driven Generative Domain Adaptation with Spectral Consistency Regularization + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Text-Driven_Generative_Domain_Adaptation_with_Spectral_Consistency_Regularization_ICCV_2023_paper.pdf + Combined with the generative prior of pre-trained models and the flexibility of text, text-driven generative domain adaptation can generate images from a wide range of target domains. However, current methods still suffer from overfitting and the mode collapse problem. In this paper, we analyze the mode collapse from the geometric point of view and reveal its relationship to the Hessian matrix of generator. To alleviate it, we propose the spectral consistency regularization to preserve the diversity of source domain without restricting the semantic adaptation to target domain. We also design granularity adaptive regularization to flexibly control the balance between diversity and stylization for target model. We conduct experiments for broad target domains compared with state-of-the-art methods and extensive ablation studies. The experiments demonstrate the effectiveness of our method to preserve the diversity of source domain and generate high fidelity target images. + + + + Neural Reconstruction of Relightable Human Model from Monocular Video + http://openaccess.thecvf.com//content/ICCV2023/papers/Sun_Neural_Reconstruction_of_Relightable_Human_Model_from_Monocular_Video_ICCV_2023_paper.pdf + Creating relightable and animatable human characters from monocular video at a low cost is a critical task for digital human modeling and virtual reality applications. This task is complex due to intricate articulation motion, a wide range of ambient lighting conditions, and pose-dependent clothing deformations. In this paper, we introduce a novel self-supervised framework that takes a monocular video of a moving human as input and generates a 3D neural representation capable of being rendered with novel poses under arbitrary lighting conditions. Our framework decomposes dynamic humans under varying illumination into neural fields in canonical space, taking into account geometry and spatially varying BRDF material properties. Additionally, we introduce pose-driven deformation fields, enabling bidirectional mapping between canonical space and observation. Leveraging the proposed appearance decomposition and deformation fields, our framework learns in a self-supervised manner. Ultimately, based on pose-driven deformation, recovered appearance, and physically-based rendering, the reconstructed human figure becomes relightable and can be explicitly driven by novel poses. We demonstrate significant performance improvements over previous works and provide compelling examples of relighting from monocular videos of moving humans in challenging, uncontrolled capture scenarios. + + + + FB-BEV: BEV Representation from Forward-Backward View Transformations + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_FB-BEV_BEV_Representation_from_Forward-Backward_View_Transformations_ICCV_2023_paper.pdf + View Transformation Module (VTM), where transformations happen between multi-view image features and Bird-Eye-View (BEV) representation, is a crucial step in camera-based BEV perception systems. Currently, the two most prominent VTM paradigms are forward projection and backward projection. Forward projection, represented by Lift-Splat-Shoot, leads to sparsely projected BEV features without post-processing. Backward projection, with BEVFormer being an example, tends to generate false-positive BEV features from incorrect projections due to the lack of utilization on depth. To address the above limitations, we propose a novel forward-backward view transformation module. Our approach compensates for the deficiencies in both existing methods, allowing them to enhance each other to obtain higher quality BEV representations mutually. We instantiate the proposed module with FB-BEV, which achieves a new state-of-the-art result of 62.4% NDS on the nuScenes test set. Code and models are available at https://github.com/NVlabs/FB-BEV + + + + BoxSnake: Polygonal Instance Segmentation with Box Supervision + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_BoxSnake_Polygonal_Instance_Segmentation_with_Box_Supervision_ICCV_2023_paper.pdf + Box-supervised instance segmentation has gained much attention as it requires only simple box annotations instead of costly mask or polygon annotations. However, existing box-supervised instance segmentation models mainly focus on mask-based frameworks. We propose a new end-to-end training technique, termed BoxSnake, to achieve effective polygonal instance segmentation using only box annotations for the first time. Our method consists of two loss functions: (1) a point-based unary loss that constrains the bounding box of predicted polygons to achieve coarse-grained segmentation; and (2) a distance-aware pairwise loss that encourages the predicted polygons to fit the object boundaries. Compared with the mask-based weakly-supervised methods, BoxSnake further reduces the performance gap between the predicted segmentation and the bounding box, and shows significant superiority on the Cityscapes dataset. + + + + ClimateNeRF: Extreme Weather Synthesis in Neural Radiance Field + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_ClimateNeRF_Extreme_Weather_Synthesis_in_Neural_Radiance_Field_ICCV_2023_paper.pdf + Physical simulations produce excellent predictions of weather effects. Neural radiance fields produce SOTA scene models. We describe a novel NeRF-editing procedure that can fuse physical simulations with NeRF models of scenes, producing realistic movies of physical phenomena in those scenes. Our application -- Climate NeRF -- allows people to visualize what climate change outcomes will do to them. ClimateNeRF allows us to render realistic weather effects, including smog, snow, and flood. Results can be controlled with physically meaningful variables like water level. Qualitative and quantitative studies show that our simulated results are significantly more realistic than those from SOTA 2D image editing and SOTA 3D NeRF stylization. + + + + Monte Carlo Linear Clustering with Single-Point Supervision is Enough for Infrared Small Target Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Monte_Carlo_Linear_Clustering_with_Single-Point_Supervision_is_Enough_for_ICCV_2023_paper.pdf + Single-frame infrared small target (SIRST) detection aims at separating small targets from clutter backgrounds on infrared images. Recently, deep learning based methods have achieved promising performance on SIRST detection, but at the cost of a large amount of training data with expensive pixel-level annotations. To reduce the annotation burden, we propose the first method to achieve SIRST detection with single-point supervision. The core idea of this work is to recover the per-pixel mask of each target from the given single point label by using clustering approaches, which looks simple but is indeed challenging since targets are always insalient and accompanied with background clutters. To handle this issue, we introduce randomness to the clustering process by adding noise to the input images, and then obtain much more reliable pseudo masks by averaging the clustered results. Thanks to this "Monte Carlo" clustering approach, our method can accurately recover pseudo masks and thus turn arbitrary fully supervised SIRST detection networks into weakly supervised ones with only single point annotation. Experiments on four datasets demonstrate that our method can be applied to existing SIRST detection networks to achieve comparable performance with their fully-supervised counterparts, which reveals that single-point supervision is strong enough for SIRST detection. + + + + Practical Membership Inference Attacks Against Large-Scale Multi-Modal Models: A Pilot Study + http://openaccess.thecvf.com//content/ICCV2023/papers/Ko_Practical_Membership_Inference_Attacks_Against_Large-Scale_Multi-Modal_Models_A_Pilot_ICCV_2023_paper.pdf + Membership inference attacks (MIAs) aim to infer whether a data point has been used to train a machine learning model. These attacks can be employed to identify potential privacy vulnerabilities and detect unauthorized use of personal data. While MIAs have been traditionally studied for simple classification models, recent advancements in multi-modal pre-training, such as CLIP, have demonstrated remarkable zero-shot performance across a range of computer vision tasks. However, the sheer scale of data and models presents significant computational challenges for performing the attacks. This paper takes a first step towards developing practical MIAs against large-scale multi-modal models. We introduce a simple baseline strategy by thresholding the cosine similarity between text and image features of a target point, and propose further enhancing the baseline by aggregating cosine similarity across transformations of the target. We also present a new weakly supervised attack method that leverages ground-truth non-members (e.g., obtained by using the publication date of a target model and the timestamps of the open data) to further enhance the attack. Our evaluation shows that CLIP models are susceptible to our attack strategies, with our simple baseline achieving over 75% membership identification accuracy. Furthermore, our enhanced attacks outperform the baseline across multiple models and datasets, with the weakly supervised attack demonstrating an average-case performance improvement of 17% and being at least 7X more effective at low false-positive rates. These findings highlight the importance of protecting the privacy of multi-modal foundational models, which were previously assumed to be less susceptible to MIAs due to less overfitting. The reach of the results presents unique challenges and insights for the broader community to address multi-modal privacy concerns. + + + + TCOVIS: Temporally Consistent Online Video Instance Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_TCOVIS_Temporally_Consistent_Online_Video_Instance_Segmentation_ICCV_2023_paper.pdf + In recent years, significant progress has been made in video instance segmentation (VIS), with many offline and online methods achieving state-of-the-art performance. While offline methods have the advantage of producing temporally consistent predictions, they are not suitable for real-time scenarios. Conversely, online methods are more practical, but maintaining temporal consistency remains a challenging task. In this paper, we propose a novel online method for video instance segmentation, called TCOVIS, which fully exploits the temporal information in a video clip. The core of our method consists of a global instance assignment strategy and a spatio-temporal enhancement module, which improve the temporal consistency of the features from two aspects. Specifically, we perform global optimal matching between the predictions and ground truth across the whole video clip, and supervise the model with the global optimal objective. We also capture the spatial feature and aggregate it with the semantic feature between frames, thus realizing the spatio-temporal enhancement. We evaluate our method on four widely adopted VIS benchmarks, namely YouTube-VIS 2019/2021/2022 and OVIS, and achieve state-of-the-art performance on all benchmarks without bells-and-whistles. For instance, on YouTube-VIS 2021, TCOVIS achieves 49.5 AP and 61.3 AP with ResNet-50 and Swin-L backbones, respectively. + + + + Long-Term Photometric Consistent Novel View Synthesis with Diffusion Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Yu_Long-Term_Photometric_Consistent_Novel_View_Synthesis_with_Diffusion_Models_ICCV_2023_paper.pdf + Novel view synthesis from a single input image is a challenging task, where the goal is to generate a new view of a scene from a desired camera pose that may be separated by a large motion. The highly uncertain nature of this synthesis task due to unobserved elements within the scene (i.e. occlusion) and outside the field-of-view makes the use of generative models appealing to capture the variety of possible outputs. In this paper, we propose a novel generative model capable of producing a sequence of photorealistic images consistent with a specified camera trajectory, and a single starting image. Our approach is centred on an autoregressive conditional diffusion-based model capable of interpolating visible scene elements, and extrapolating unobserved regions in a view, in a geometrically consistent manner. Conditioning is limited to an image capturing a single camera view and the (relative) pose of the new camera view. To measure the consistency over a sequence of generated views, we introduce a new metric, the thresholded symmetric epipolar distance (TSED), to measure the number of consistent frame pairs in a sequence. While previous methods have been shown to produce high quality images and consistent semantics across pairs of views, we show empirically with our metric that they are often inconsistent with the desired camera poses. In contrast, we demonstrate that our method produces both photorealistic and view-consistent imagery. Additional material is available on our project page: https://yorkucvil.github.io/Photoconsistent-NVS/. + + + + Benchmarking Algorithmic Bias in Face Recognition: An Experimental Approach Using Synthetic Faces and Human Evaluation + http://openaccess.thecvf.com//content/ICCV2023/papers/Liang_Benchmarking_Algorithmic_Bias_in_Face_Recognition_An_Experimental_Approach_Using_ICCV_2023_paper.pdf + We propose an experimental method for measuring bias in face recognition systems. Existing methods to measure bias depend on benchmark datasets that are collected in the wild and annotated for protected (e.g., race, gender) and non-protected (e.g., pose, lighting) attributes. Such observational datasets only permit correlational conclusions, e.g., "Algorithm A's accuracy is different on female and male faces in dataset X.". By contrast, experimental methods manipulate attributes individually and thus permit causal conclusions, e.g., "Algorithm A's accuracy is affected by gender and skin color." Our method is based on generating synthetic faces using a neural face generator, where each attribute of interest is modified independently while leaving all other attributes constant. Human observers crucially provide the ground truth on perceptual identity similarity between synthetic image pairs. We validate our method quantitatively by evaluating race and gender biases of three research-grade face recognition models. Our synthetic pipeline reveals that for these algorithms, accuracy is lower for Black and East Asian population subgroups. Our method can also quantify how perceptual changes in attributes affect face identity distances reported by these models. Our large synthetic dataset, consisting of 48,000 synthetic face image pairs (10,200 unique synthetic faces) and 555,000 human annotations (individual attributes and pairwise identity comparisons) is available to researchers in this important area. + + + + Spatial-Aware Token for Weakly Supervised Object Localization + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_Spatial-Aware_Token_for_Weakly_Supervised_Object_Localization_ICCV_2023_paper.pdf + Weakly supervised object localization (WSOL) is a challenging task aiming to localize objects with only image-level supervision. Recent works apply visual transformer to WSOL and achieve significant success by exploiting the long-range feature dependency in self-attention mechanism. However, existing transformer-based methods synthesize the classification feature maps as the localization map, which leads to optimization conflicts between classification and localization tasks. To address this problem, we propose to learn a task-specific spatial-aware token (SAT) to condition localization in a weakly supervised manner. Specifically, a spatial token is first introduced in the input space to aggregate representations for localization task. Then a spatial aware attention module is constructed, which allows spatial token to generate foreground probabilities of different patches by querying and to extract localization knowledge from the classification task. Besides, for the problem of sparse and unbalanced pixel-level supervision obtained from the image-level label, two spatial constraints, including batch area loss and normalization loss, are designed to compensate and enhance this supervision. Experiments show that the proposed SAT achieves state-of-the-art performance on both CUB-200 and ImageNet, with 98.45% and 73.13% GT-known Loc, respectively. Even under the extreme setting of using only 1 image per class from ImageNet for training, SAT already exceeds the SOTA method by 2.1% GT-known Loc. Code and models are available at https://github.com/wpy1999/SAT. + + + + Harnessing the Spatial-Temporal Attention of Diffusion Models for High-Fidelity Text-to-Image Synthesis + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_Harnessing_the_Spatial-Temporal_Attention_of_Diffusion_Models_for_High-Fidelity_Text-to-Image_ICCV_2023_paper.pdf + Diffusion-based models have achieved state-of-the-art performance on text-to-image synthesis tasks. However, one critical limitation of these models is the low fidelity of generated images with respect to the text description, such as missing objects, mismatched attributes, and mislocated objects. One key reason for such inconsistencies is the inaccurate cross-attention to text in both the spatial dimension, which controls at what pixel region an object should appear, and the temporal dimension, which controls how different levels of details are added through the denoising steps. In this paper, we propose a new text-to-image algorithm that adds explicit control over spatial-temporal cross-attention in diffusion models. We first utilize a layout predictor to predict the pixel regions for objects mentioned in the text. We then impose spatial attention control by combining the attention over the entire text description and that over the local description of the particular object in the corresponding pixel region of that object. The temporal attention control is further added by allowing the combination weights to change at each denoising step, and the combination weights are optimized to ensure high fidelity between the image and the text. Experiments show that our method generates images with higher fidelity compared to diffusion-model-based baselines without fine-tuning the diffusion model. Our code is publicly available. + + + + GraphAlign: Enhancing Accurate Feature Alignment by Graph matching for Multi-Modal 3D Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Song_GraphAlign_Enhancing_Accurate_Feature_Alignment_by_Graph_matching_for_Multi-Modal_ICCV_2023_paper.pdf + LiDAR and cameras are complementary sensors for 3D object detection in autonomous driving. However, it is challenging to explore the unnatural interaction between point clouds and images, and the critical factor is how to conduct feature alignment of heterogeneous modalities. Currently, many methods achieve feature alignment by projection calibration only, without considering the problem of coordinate conversion accuracy errors between sensors, leading to sub-optimal performance. In this paper, we present GraphAlign, a more accurate feature alignment strategy for 3D object detection by graph matching. Specifically, we fuse image features from a semantic segmentation encoder in the image branch and point cloud features from a 3D Sparse CNN in the LiDAR branch. To save computation, we construct the nearest neighbor relationship by calculating Euclidean distance within the subspaces that are divided into the point cloud features. Through the projection calibration between the image and point cloud, we project the nearest neighbors of point cloud features onto the image features. Then by matching the nearest neighbors with a single point cloud to multiple images, we search for a more appropriate feature alignment. In addition, we provide a self-attention module to enhance the weights of significant relations to fine-tune the feature alignment between heterogeneous modalities. Extensive experiments on nuScenes benchmark demonstrate the effectiveness and efficiency of our GraphAlign. + + + + NEMTO: Neural Environment Matting for Novel View and Relighting Synthesis of Transparent Objects + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_NEMTO_Neural_Environment_Matting_for_Novel_View_and_Relighting_Synthesis_ICCV_2023_paper.pdf + We propose NEMTO, the first end-to-end neural rendering pipeline to model 3D transparent objects with complex geometry and unknown indices of refraction. Commonly used appearance modeling such as the Disney BSDF model cannot accurately address this challenging problem due to the complex light paths bending through refractions and the strong dependency of surface appearance on illumination. With 2D images of the transparent object as input, our method is capable of high-quality novel view and relighting synthesis. We leverage implicit Signed Distance Functions (SDF) to model the object geometry and propose a refraction-aware ray bending network to model the effects of light refraction within the object. Our ray bending network is more tolerant to geometric inaccuracies than traditional physically-based methods for rendering transparent objects. We provide extensive evaluations on both synthetic and real-world datasets to demonstrate our high-quality synthesis and the applicability of our method. + + + + USAGE: A Unified Seed Area Generation Paradigm for Weakly Supervised Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Peng_USAGE_A_Unified_Seed_Area_Generation_Paradigm_for_Weakly_Supervised_ICCV_2023_paper.pdf + Seed area generation is usually the starting point of weakly supervised semantic segmentation (WSSS). Computing the Class Activation Map (CAM) from a multi-label classification network is the de facto paradigm for seed area generation, but CAMs generated from Convolutional Neural Networks (CNNs) and Transformers are prone to be under- and over-activated, respectively, which makes the strategies to refine CAMs for CNNs usually inappropriate for Transformers, and vice versa. In this paper, we propose a Unified optimization paradigm for Seed Area GEneration (USAGE) for both types of networks, in which the objective function to be optimized consists of two terms: One is a generation loss, which controls the shape of seed areas by a temperature parameter following a deterministic principle for different types of networks; The other is a regularization loss, which ensures the consistency between the seed areas that are generated by self-adaptive network adjustment from different views, to overturn false activation in seed areas. Experimental results show that USAGE consistently improves seed area generation for both CNNs and Transformers by large margins, e.g., outperforming state-of-the-art methods by an mIoU of 4.1% on PASCAL VOC. Moreover, based on the USAGE generated seed areas on Transformers, we achieve state-of-the-art WSSS results on both PASCAL VOC and MS COCO. + + + + NeuS2: Fast Learning of Neural Implicit Surfaces for Multi-view Reconstruction + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_NeuS2_Fast_Learning_of_Neural_Implicit_Surfaces_for_Multi-view_Reconstruction_ICCV_2023_paper.pdf + Recent methods for neural surface representation and rendering, for example NeuS, have demonstrated the remarkably high-quality reconstruction of static scenes. However, the training of NeuS takes an extremely long time (8 hours), which makes it almost impossible to apply them to dynamic scenes with thousands of frames. We propose a fast neural surface reconstruction approach, called NeuS2, which achieves two orders of magnitude improvement in terms of acceleration without compromising reconstruction quality. To accelerate the training process, we parameterize a neural surface representation by multi-resolution hash encodings and present a novel lightweight calculation of second-order derivatives tailored to our networks to leverage CUDA parallelism, achieving a factor two speed up. To further stabilize and expedite training, a progressive learning strategy is proposed to optimize multi-resolution hash encodings from coarse to fine. We extend our method for fast training of dynamic scenes, with a proposed incremental training strategy and a novel global transformation prediction component, which allow our method to handle challenging long sequences with large movements and deformations. Our experiments on various datasets demonstrate that NeuS2 significantly outperforms the state-of-the-arts in both surface reconstruction accuracy and training speed for both static and dynamic scenes. The code is available at our website: https://vcai.mpi-inf.mpg.de/projects/NeuS2/. + + + + Gender Artifacts in Visual Datasets + http://openaccess.thecvf.com//content/ICCV2023/papers/Meister_Gender_Artifacts_in_Visual_Datasets_ICCV_2023_paper.pdf + Gender biases are known to exist within large-scale visual datasets and can be reflected or even amplified in downstream models. Many prior works have proposed methods for mitigating gender biases, often by attempting to remove gender expression information from images. To understand the feasibility and practicality of these approaches, we investigate what "gender artifacts" exist in large-scale visual datasets. We define a "gender artifact" as a visual cue correlated with gender , focusing specifically on cues that are learnable by a modern image classifier and have an interpretable human corollary. Through our analyses, we find that gender artifacts are ubiquitous in the COCO and OpenImages datasets, occurring everywhere from low-level information (e.g., the mean value of the color channels) to higher-level image composition (e.g., pose and location of people). Further, bias mitigation methods that attempt to remove gender actually remove more information from the scene than the person. Given the prevalence of gender artifacts, we claim that attempts to remove these artifacts from such datasets are largely infeasible as certain removed artifacts may be necessary for the downstream task of object recognition. Instead, the responsibility lies with researchers and practitioners to be aware that the distribution of images within datasets is highly gendered and hence develop fairness-aware methods which are robust to these distributional shifts across groups. + + + + SuS-X: Training-Free Name-Only Transfer of Vision-Language Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Udandarao_SuS-X_Training-Free_Name-Only_Transfer_of_Vision-Language_Models_ICCV_2023_paper.pdf + Contrastive Language-Image Pre-training (CLIP) has emerged as a simple yet effective way to train large-scale vision-language models. CLIP demonstrates impressive zero-shot classification and retrieval performance on diverse downstream tasks. However, to leverage its full potential, fine-tuning still appears to be necessary. Fine-tuning the entire CLIP model can be resource-intensive and unstable. Moreover, recent methods that aim to circumvent this need for fine-tuning still require access to images from the target distribution. In this paper, we pursue a different approach and explore the regime of training-free "name-only transfer" in which the only knowledge we possess about downstream tasks comprises the names of downstream target categories. We propose a novel method, SuS-X, consisting of two key building blocks--"SuS" and "TIP-X", that requires neither intensive fine-tuning nor costly labelled data. SuS-X achieves state-of-the-art (SoTA) zero-shot classification results on 19 benchmark datasets. We further show the utility of TIP-X in the training-free few-shot setting, where we again achieve SoTA results over strong training-free baselines. + + + + Beating Backdoor Attack at Its Own Game + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Beating_Backdoor_Attack_at_Its_Own_Game_ICCV_2023_paper.pdf + Deep neural networks (DNNs) are vulnerable to backdoor attack, which does not affect the network's performance on clean data but would manipulate the network behavior once a trigger pattern is added. Existing defense methods have greatly reduced attack success rate, but their prediction accuracy on clean data still lags behind a clean model by a large margin. Inspired by the stealthiness and effectiveness of backdoor attack, we propose a simple but highly effective defense framework which injects non-adversarial backdoors targeting poisoned samples. Following the general steps in backdoor attack, we detect a small set of suspected samples and then apply a poisoning strategy to them. The non-adversarial backdoor, once triggered, suppresses the attacker's backdoor on poisoned data, but has limited influence on clean data. The defense can be carried out during data preprocessing, without any modification to the standard end-to-end training pipeline. We conduct extensive experiments on multiple benchmarks with different architectures and representative attacks. Results demonstrate that our method achieves state-of-the-art defense effectiveness with by far the lowest performance drop on clean data. Considering the surprising defense ability displayed by our framework, we call for more attention to utilizing backdoor for backdoor defense. Code is available at https://github.com/damianliumin/non-adversarial_backdoor. + + + + Do DALL-E and Flamingo Understand Each Other? + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Do_DALL-E_and_Flamingo_Understand_Each_Other_ICCV_2023_paper.pdf + The field of multimodal research focusing on the comprehension and creation of both images and text has witnessed significant strides. This progress is exemplified by the emergence of sophisticated models dedicated to image captioning at scale, such as the notable Flamingo model and text-to-image generative models, with DALL-E serving as a prominent example. An interesting question worth exploring in this domain is whether Flamingo and DALL-E understand each other. To study this question, we propose a reconstruction task where Flamingo generates a description for a given image and DALL-E uses this description as input to synthesize a new image. We argue that these models understand each other if the generated image is similar to the given image. Specifically, we study the relationship between the quality of the image reconstruction and that of the text generation. We find that an optimal description of an image is one that gives rise to a generated image similar to the original one. The finding motivates us to propose a unified framework to finetune the text-to-image and image-to-text models. Concretely, the reconstruction part forms a regularization loss to guide the tuning of the models. Extensive experiments on multiple datasets with different image captioning and image generation models validate our findings and demonstrate the effectiveness of our proposed unified framework. As DALL-E and Flamingo are not publicly available, we use Stable Diffusion and BLIP in the remaining work. Project website: https://dalleflamingo.github.io. + + + + Prototype-based Dataset Comparison + http://openaccess.thecvf.com//content/ICCV2023/papers/van_Noord_Protoype-based_Dataset_Comparison_ICCV_2023_paper.pdf + Dataset summarisation is a fruitful approach to dataset inspection. However, when applied to a single dataset the discovery of visual concepts is restricted to those most prominent. We argue that a comparative approach can expand upon this paradigm to enable richer forms of dataset inspection that go beyond the most prominent concepts. To enable dataset comparison we present a module that learns concept-level prototypes across datasets. We leverage self-supervised learning to discover these prototypes without supervision, and we demonstrate the benefits of our approach in two case-studies. Our findings show that dataset comparison extends dataset inspection and we hope to encourage more works in this direction. Code and usage instructions available at https://github.com/Nanne/ProtoSim + + + + FreeCOS: Self-Supervised Learning from Fractals and Unlabeled Images for Curvilinear Object Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Shi_FreeCOS_Self-Supervised_Learning_from_Fractals_and_Unlabeled_Images_for_Curvilinear_ICCV_2023_paper.pdf + Curvilinear object segmentation is critical for many applications. However, manually annotating curvilinear objects is very time-consuming and error-prone, yielding insufficiently available annotated datasets for existing supervised methods and domain adaptation methods. This paper proposes a self-supervised curvilinear object segmentation method (FreeCOS) that learns robust and distinctive features from fractals and unlabeled images. The key contributions include a novel Fractal-FDA synthesis (FFS) module and a geometric information alignment (GIA) approach. FFS generates curvilinear structures based on the parametric Fractal L-system and integrates the generated structures into unlabeled images to obtain synthetic training images via Fourier Domain Adaptation. GIA reduces the intensity differences between the synthetic and unlabeled images by comparing the intensity order of a given pixel to the values of its nearby neighbors. Such image alignment can explicitly remove the dependency on absolute intensity values and enhance the inherent geometric characteristics which are common in both synthetic and real images. In addition, GIA aligns features of synthetic and real images via the prediction space adaptation loss (PSAL) and the curvilinear mask contrastive loss (CMCL). Extensive experimental results on four public datasets, i.e., XCAD, DRIVE, STARE and CrackTree demonstrate that our method outperforms the state-of-the-art unsupervised methods, self-supervised methods and traditional methods by a large margin. The source code of this work is available at https://github.com/TY-Shi/FreeCOS. + + + + Generating Dynamic Kernels via Transformers for Lane Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Generating_Dynamic_Kernels_via_Transformers_for_Lane_Detection_ICCV_2023_paper.pdf + State-of-the-art lane detection methods often rely on specific knowledge about lanes -- such as straight lines and parametric curves -- to detect lane lines. While the specific knowledge can ease the modeling process, it poses challenges in handling lane lines with complex topologies (e.g., dense, forked, curved, etc.). Recently, dynamic convolution-based methods have shown promising performance by utilizing the features from some key locations of a lane line, such as the starting point, as convolutional kernels, and convoluting them with the whole feature map to detect lane lines. While such methods reduce the reliance on specific knowledge, the kernels computed from the key locations fail to capture the lane line's global structure due to its long and thin structure, leading to inaccurate detection of lane lines with complex topologies. In addition, the kernels resulting from the key locations are sensitive to occlusion and lane intersections. To overcome these limitations, we propose a transformer-based dynamic kernel generation architecture for lane detection. It utilizes a transformer to generate dynamic convolutional kernels for each lane line in the input image, and then detect these lane lines with dynamic convolution. Compared to the kernels generated from the key locations of a lane line, the kernels generated with the transformer can capture the lane line's global structure from the whole feature map, enabling them to effectively handle occlusions and lane lines with complex topologies. We evaluate our method on three lane detection benchmarks, and the results demonstrate its state-of-the-art performance. Specifically, our method achieves an F1 score of 63.40 on OpenLane and 88.47 on CurveLanes, surpassing the state of the art by 4.30 and 2.37 points, respectively. + + + + Boosting Long-tailed Object Detection via Step-wise Learning on Smooth-tail Data + http://openaccess.thecvf.com//content/ICCV2023/papers/Dong_Boosting_Long-tailed_Object_Detection_via_Step-wise_Learning_on_Smooth-tail_Data_ICCV_2023_paper.pdf + Real-world data tends to follow a long-tailed distribution, where the class imbalance results in dominance of the head classes during training. In this paper, we propose a frustratingly simple but effective step-wise learning framework to gradually enhance the capability of the model in detecting all categories of long-tailed datasets. Specifically, we build smooth-tail data where the long-tailed distribution of categories decays smoothly to correct the bias towards head classes. We pre-train a model on the whole long-tailed data to preserve discriminability between all categories. We then fine-tune the class-agnostic modules of the pre-trained model on the head class dominant replay data to get a head class expert model with improved decision boundaries from all categories. Finally, we train a unified model on the tail class dominant replay data while transferring knowledge from the head class expert model to ensure accurate detection of all categories. Extensive experiments on long-tailed datasets LVIS v0.5 and LVIS v1.0 demonstrate the superior performance of our method, where we can improve the AP with ResNet-50 backbone from 27.0% to 30.3% AP, and especially for the rare categories from 15.5% to 24.9% AP. Our best model using ResNet-101 backbone can achieve 30.7% AP, which suppresses all existing detectors using the same backbone. + + + + Talking Head Generation with Probabilistic Audio-to-Visual Diffusion Priors + http://openaccess.thecvf.com//content/ICCV2023/papers/Yu_Talking_Head_Generation_with_Probabilistic_Audio-to-Visual_Diffusion_Priors_ICCV_2023_paper.pdf + We introduce a novel framework for one-shot audio-driven talking head generation. Unlike prior works that require additional driving sources for controlled synthesis in a deterministic manner, we instead sample all holistic lip-irrelevant facial motions (i.e. pose, expression, blink, gaze, etc.) to semantically match the input audio while still maintaining both the photo-realism of audio-lip synchronization and overall naturalness. This is achieved by our newly proposed audio-to-visual diffusion prior trained on top of the mapping between audio and non-lip representations. Thanks to the probabilistic nature of the diffusion prior, one big advantage of our framework is it can synthesize diverse facial motion sequences given the same audio clip, which is quite user-friendly for many real applications. Through comprehensive evaluations of public benchmarks, we conclude that (1) our diffusion prior outperforms auto-regressive prior significantly on all the concerned metrics; (2) our overall system is competitive with prior works in terms of audio-lip synchronization but can effectively sample rich and natural-looking lip-irrelevant facial motions while still semantically harmonized with the audio input. + + + + Learning Cross-Modal Affinity for Referring Video Object Segmentation Targeting Limited Samples + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Learning_Cross-Modal_Affinity_for_Referring_Video_Object_Segmentation_Targeting_Limited_ICCV_2023_paper.pdf + Referring video object segmentation (RVOS), as a supervised learning task, relies on sufficient annotated data for a given scene. However, in more realistic scenarios, only minimal annotations are available for a new scene, which poses significant challenges to existing RVOS methods. With this in mind, we propose a simple yet effective model with a newly designed cross-modal affinity (CMA) module based on a Transformer architecture. The CMA module builds multimodal affinity with a few samples, thus quickly learning new semantic information, and enabling the model to adapt to different scenarios. Since the proposed method targets limited samples for new scenes, we generalize the problem as - few-shot referring video object segmentation (FS-RVOS). To foster research in this direction, we build up a new FS-RVOS benchmark based on currently available datasets. The benchmark covers a wide range and includes multiple situations, which can maximally simulate real-world scenarios. Extensive experiments show that our model adapts well to different scenarios with only a few samples, reaching state-of-the-art performance on the benchmark. On Mini-Ref-YouTube-VOS, our model achieves an average performance of 53.1 J and 54.8 F, which are 10% better than the baselines. Furthermore, we show impressive results of 77.7 J and 74.8 F on Mini-Ref-SAIL-VOS, which are significantly better than the baselines. Code is publicly available at https://github.com/hengliusky/Few_shot_RVOS. + + + + Divide and Conquer: a Two-Step Method for High Quality Face De-identification with Model Explainability + http://openaccess.thecvf.com//content/ICCV2023/papers/Wen_Divide_and_Conquer_a_Two-Step_Method_for_High_Quality_Face_ICCV_2023_paper.pdf + Face de-identification involves concealing the true identity of a face while retaining other facial characteristics. Current target-generic methods typically disentangle identity features in the latent space, using adversarial training to balance privacy and utility. However, this pattern often leads to a trade-off between privacy and utility, and the latent space remains difficult to explain. To address these issues, we propose IDeudemon, which employs a "divide and conquer" strategy to protect identity and preserve utility step by step while maintaining good explainability. In Step I, we obfuscate the 3D disentangled ID code calculated by a parametric NeRF model to protect identity. In Step II, we incorporate visual similarity assistance and train a GAN with adjusted losses to preserve image utility. Thanks to the powerful 3D prior and delicate generative designs, our approach could protect the identity naturally, produce high quality details and is robust to different poses and expressions. Extensive experiments demonstrate that the proposed IDeudemon outperforms previous state-of-the-art methods. + + + + Set-level Guidance Attack: Boosting Adversarial Transferability of Vision-Language Pre-training Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Lu_Set-level_Guidance_Attack_Boosting_Adversarial_Transferability_of_Vision-Language_Pre-training_Models_ICCV_2023_paper.pdf + Vision-language pre-training (VLP) models have shown vulnerability to adversarial examples in multimodal tasks. Furthermore, malicious adversaries can be deliberately transferred to attack other black-box models. However, existing work has mainly focused on investigating white-box attacks. In this paper, we present the first study to investigate the adversarial transferability of recent VLP models. We observe that existing methods exhibit much lower transferability, compared to the strong attack performance in white-box settings. The transferability degradation is partly caused by the under-utilization of cross-modal interactions. Particularly, unlike unimodal learning, VLP models rely heavily on cross-modal interactions and the multimodal alignments are many-to-many, e.g., an image can be described in various natural languages. To this end, we propose a highly transferable Set-level Guidance Attack (SGA) that thoroughly leverages modality interactions and incorporates alignment-preserving augmentation with cross-modal guidance. Experimental results demonstrate that SGA could generate adversarial examples that can strongly transfer across different VLP models on multiple downstream vision-language tasks. On image-text retrieval, SGA significantly enhances the attack success rate for transfer attacks from ALBEF to TCL by a large margin (at least 9.78% and up to 30.21%), compared to the state-of-the-art. Our code is available at https://github.com/Zoky-2020/SGA. + + + + Multimodal Distillation for Egocentric Action Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Radevski_Multimodal_Distillation_for_Egocentric_Action_Recognition_ICCV_2023_paper.pdf + The focal point of egocentric video understanding is modelling hand-object interactions. Standard models, e.g. CNNs or Vision Transformers, which receive RGB frames as input perform well, however, their performance improves further by employing additional input modalities that provide complementary cues, such as object detections, optical flow, audio, etc. The added complexity of the modality-specific modules, on the other hand, makes these models impractical for deployment. The goal of this work is to retain the performance of such a multimodal approach, while using only the RGB frames as input at inference time. We demonstrate that for egocentric action recognition on the Epic-Kitchens and the Something-Something datasets, students which are taught by multimodal teachers tend to be more accurate and better calibrated than architecturally equivalent models trained on ground truth labels in a unimodal or multimodal fashion. We further present a principled multimodal knowledge distillation framework, allowing us to deal with issues which occur when applying multimodal knowledge distillation in a naive manner. Lastly, we demonstrate the achieved reduction in computational complexity, and show that our approach maintains higher performance with the reduction of the number of input views. + + + + Perceptual Artifacts Localization for Image Synthesis Tasks + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Perceptual_Artifacts_Localization_for_Image_Synthesis_Tasks_ICCV_2023_paper.pdf + Recent advancements in deep generative models have facilitated the creation of photo-realistic images across various tasks. However, these generated images often exhibit perceptual artifacts in specific regions, necessitating manual correction. In this study, we present a comprehensive empirical examination of Perceptual Artifacts Localization (PAL) spanning diverse image synthesis endeavors. We introduce a novel dataset comprising 10,168 generated images, each annotated with per-pixel perceptual artifact labels across ten synthesis tasks. A segmentation model, trained on our proposed dataset, effectively localizes artifacts across a range of tasks. Additionally, we illustrate its proficiency in adapting to previously unseen models using minimal training samples. We further propose an innovative zoom-in inpainting pipeline that seamlessly rectifies perceptual artifacts in the generated images. Through our experimental analyses, we elucidate several invaluable downstream applications, such as automated artifact rectification, non-referential image quality evaluation, and abnormal region detection in images. The dataset and code are released here: https://owenzlz.github.io/PAL4VST + + + + Better May Not Be Fairer: A Study on Subgroup Discrepancy in Image Classification + http://openaccess.thecvf.com//content/ICCV2023/papers/Chiu_Better_May_Not_Be_Fairer_A_Study_on_Subgroup_Discrepancy_ICCV_2023_paper.pdf + In this paper, we provide 20,000 non-trivial human annotations on popular datasets as a first step to bridge gap to studying how natural semantic spurious features affect image classification, as prior works often study datasets mixing low-level features due to limitations in accessing realistic datasets. We investigate how natural background colors play a role as spurious features by annotating the test sets of CIFAR10 and CIFAR100 into subgroups based on the background color of each image. We name our datasets CIFAR10-B and CIFAR100-B and integrate them with CIFAR-Cs. We find that overall human-level accuracy does not guarantee consistent subgroup performances, and the phenomenon remains even on models pre-trained on ImageNet or after data augmentation (DA). To alleviate this issue, we propose FlowAug, a semantic DA that leverages decoupled semantic representations captured by a pre-trained generative flow. Experimental results show that FlowAug achieves more consistent subgroup results than other types of DA methods on CIFAR10/100 and on CIFAR10/100-C. Additionally, it shows better generalization performance. Furthermore, we propose a generic metric, MacroStd, for studying model robustness to spurious correlations, where we take a macro average on the weighted standard deviations across different classes. We show MacroStd being more predictive of better performances; per our metric, FlowAug demonstrates improvements on subgroup discrepancy. Although this metric is proposed to study our curated datasets, it applies to all datasets that have subgroups or subclasses. Lastly, we also show superior out-of-distribution results on CIFAR10.1. + + + + 3D Implicit Transporter for Temporally Consistent Keypoint Discovery + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhong_3D_Implicit_Transporter_for_Temporally_Consistent_Keypoint_Discovery_ICCV_2023_paper.pdf + Keypoint-based representation has proven advantageous in various visual and robotic tasks. However, the existing 2D and 3D methods for detecting keypoints mainly rely on geometric consistency to achieve spatial alignment, neglecting temporal consistency. To address this issue, the Transporter method was introduced for 2D data, which reconstructs the target frame from the source frame to incorporate both spatial and temporal information. However, the direct application of the Transporter to 3D point clouds is infeasible due to their structural differences from 2D images. Thus, we propose the first 3D version of the Transporter, which leverages hybrid 3D representation, cross attention, and implicit reconstruction. We apply this new learning system on 3D articulated objects/humans and show that learned keypoints are spatiotemporal consistent. Additionally, we propose a control policy that utilizes the learned keypoints for 3D object manipulation and demonstrate its superior performance. Our codes, data, and models will be made publicly available. + + + + Adaptive Rotated Convolution for Rotated Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Pu_Adaptive_Rotated_Convolution_for_Rotated_Object_Detection_ICCV_2023_paper.pdf + Rotated object detection aims to identify and locate objects in images with arbitrary orientation. In this scenario, the oriented directions of objects vary considerably across different images, while multiple orientations of objects exist within an image. This intrinsic characteristic makes it challenging for standard backbone networks to extract high-quality features of these arbitrarily orientated objects. In this paper, we present Adaptive Rotated Convolution (ARC) module to handle the aforementioned challenges. In our ARC module, the convolution kernels rotate adaptively to extract object features with varying orientations in different images, and an efficient conditional computation mechanism is introduced to accommodate the large orientation variations of objects within an image. The two designs work seamlessly in rotated object detection problem. Moreover, ARC can conveniently serve as a plug-and-play module in various vision backbones to boost their representation ability to detect oriented objects accurately. Experiments on commonly used benchmarks (DOTA and HRSC2016) demonstrate that equipped with our proposed ARC module in the backbone network, the performance of multiple popular oriented object detectors is significantly improved (e.g. +3.03% mAP on Rotated RetinaNet and +4.16% on CFA). Combined with the highly competitive method Oriented R-CNN, the proposed approach achieves state-of-the-art performance on the DOTA dataset with 81.77% mAP. Code is available at https://github.com/LeapLabTHU/ARC. + + + + UniVTG: Towards Unified Video-Language Temporal Grounding + http://openaccess.thecvf.com//content/ICCV2023/papers/Lin_UniVTG_Towards_Unified_Video-Language_Temporal_Grounding_ICCV_2023_paper.pdf + Video Temporal Grounding (VTG), which aims to ground target clips from videos (such as consecutive intervals or disjoint shots) according to custom language queries (e.g., sentences or words), is key for video browsing on social media. Most methods in this direction develop task-specific models that are trained with type-specific labels, such as moment retrieval (time interval) and highlight detection (worthiness curve), which limits their abilities to generalize to various VTG tasks and labels. In this paper, we propose to Unify the diverse VTG labels and tasks, dubbed UniVTG, along three directions: Firstly, we revisit a wide range of VTG labels and tasks and define a unified formulation. Based on this, we develop data annotation schemes to create scalable pseudo supervision. Secondly, we develop an effective and flexible grounding model capable of addressing each task and making full use of each label. Lastly, thanks to the unified framework, we are able to unlock temporal grounding pretraining from large-scale diverse labels and develop stronger grounding abilities e.g., zero-shot grounding. Extensive experiments on three tasks (moment retrieval, highlight detection and video summarization) across seven datasets (QVHighlights, Charades-STA, TACoS, Ego4D, YouTube Highlights, TV-Sum, and QFVS) demonstrate the effectiveness and flexibility of our proposed framework. The codes are available at https://github.com/showlab/UniVTG. + + + + Fast Globally Optimal Surface Normal Estimation from an Affine Correspondence + http://openaccess.thecvf.com//content/ICCV2023/papers/Hajder_Fast_Globally_Optimal_Surface_Normal_Estimation_from_an_Affine_Correspondence_ICCV_2023_paper.pdf + We present a new solver for estimating a surface normal from a single affine correspondence in two calibrated views. The proposed approach provides a new globally optimal solution for this over-determined problem and proves that it reduces to a linear system that can be solved extremely efficiently. This allows for performing significantly faster than other recent methods, solving the same problem and obtaining the same globally optimal solution. We demonstrate on 15k image pairs from standard benchmarks that the proposed approach leads to the same results as other optimal algorithms while being, on average, five times faster than the fastest alternative. Besides its theoretical value, we demonstrate that such an approach has clear benefits, e.g., in image-based visual localization, due to not requiring a dense point cloud to recover the surface normal. We show on the Cambridge Landmarks dataset that leveraging the proposed surface normal estimation further improves localization accuracy. Matlab and C++ implementations are also published in the supplementary material. + + + + Frequency-aware GAN for Adversarial Manipulation Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhu_Frequency-aware_GAN_for_Adversarial_Manipulation_Generation_ICCV_2023_paper.pdf + Image manipulation techniques have drawn growing concerns as manipulated images might cause morality and security problems. Various methods have been proposed to detect manipulations and achieved promising performance. However, these methods might be vulnerable to adversarial attacks. In this work, we design an Adversarial Manipulation Generation (AMG) task to explore the vulnerability of image manipulation detectors. We first propose an optimal loss function and extend existing attacks to generate adversarial examples. We observe that existing spatial attacks cause large degradation in image quality and find the loss of high-frequency detailed components might be its major reason. Inspired by this observation, we propose a novel adversarial attack that incorporates both spatial and frequency features into the GAN architecture to generate adversarial examples. We further design an encoder-decoder architecture with skip connections of high-frequency components to preserve fine details. We evaluated our method on three image manipulation detectors (FCN, ManTra-Net and MVSS-Net) with three benchmark datasets (DEFACTO, CASIAv2 and COVER). Experiments show that our method generates adversarial examples significantly fast (0.01s per image), preserves better image quality (PSNR 30% higher than spatial attacks) and achieves a high attack success rate. We also observe that the examples generated by AMG can fool both classification and segmentation models, which indicates better transferability among different tasks. + + + + Template-guided Hierarchical Feature Restoration for Anomaly Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Guo_Template-guided_Hierarchical_Feature_Restoration_for_Anomaly_Detection_ICCV_2023_paper.pdf + Targeting for detecting anomalies of various sizes for complicated normal patterns, we propose a Template-guided Hierarchical Feature Restoration method, which introduces two key techniques, bottleneck compression and template-guided compensation, for anomaly-free feature restoration. Specially, our framework compresses hierarchical features of an image by bottleneck structure to preserve the most crucial features shared among normal samples. We design template-guided compensation to restore the distorted features towards anomaly-free features. Particularly, we choose the most similar normal sample as the template and leverage hierarchical features from the template to compensate the distorted features. The bottleneck could partially filter out anomaly features, while the compensation further converts the reminding anomaly features towards normal with template guidance. Finally, anomalies are detected in terms of the cosine distance between the pre-trained features of an inference image and the corresponding restored anomaly-free features. Experimental results demonstrate the effectiveness of our approach, which achieves the state-of-the-art performance on the MVTec LOCO AD dataset. + + + + PourIt!: Weakly-Supervised Liquid Perception from a Single Image for Visual Closed-Loop Robotic Pouring + http://openaccess.thecvf.com//content/ICCV2023/papers/Lin_PourIt_Weakly-Supervised_Liquid_Perception_from_a_Single_Image_for_Visual_ICCV_2023_paper.pdf + Liquid perception is critical for robotic pouring tasks. It usually requires the robust visual detection of flowing liquid. However, while recent works have shown promising results in liquid perception, they typically require labeled data for model training, a process that is both time-consuming and reliant on human labor. To this end, this paper proposes a simple yet effective framework PourIt!, to serve as a tool for robotic pouring tasks. We design a simple data collection pipeline that only needs image-level labels to reduce the reliance on tedious pixel-wise annotations. Then, a binary classification model is trained to generate Class Activation Map (CAM) that focuses on the visual difference between these two kinds of collected data, i.e., the existence of liquid drop or not. We also devise a feature contrast strategy to improve the quality of the CAM, thus entirely and tightly covering the actual liquid regions. Then, the container pose is further utilized to facilitate the 3D point cloud recovery of the detected liquid region. Finally, the liquid-to-container distance is calculated for visual closed-loop control of the physical robot. To validate the effectiveness of our proposed method, we also contribute a novel dataset for our task and name it PourIt! dataset. Extensive results on this dataset and physical Franka robot have shown the utility and effectiveness of our method in the robotic pouring tasks. Our dataset, code and pre-trained models will be available on the project page. + + + + A Latent Space of Stochastic Diffusion Models for Zero-Shot Image Editing and Guidance + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_A_Latent_Space_of_Stochastic_Diffusion_Models_for_Zero-Shot_Image_ICCV_2023_paper.pdf + Diffusion models generate images by iterative denoising. Recent work has shown that by making the denoising process deterministic, one can encode real images into latent codes of the same size, which can be used for image editing. This paper explores the possibility of defining a latent space even when the denoising process remains stochastic. Recall that, in stochastic diffusion models, Gaussian noises are added in each denoising step, and we can concatenate all the noises to form a latent code. This results in a latent space of much higher dimensionality than the original image. We demonstrate that this latent space of stochastic diffusion models can be used in the same way as that of deterministic diffusion models in two applications. First, we propose CycleDiffusion, a method for zero-shot and unpaired image editing using stochastic diffusion models, which improves the performance over its deterministic counterpart. Second, we demonstrate unified, plug-and-play guidance in the latent spaces of deterministic and stochastic diffusion models. + + + + Open Set Video HOI detection from Action-Centric Chain-of-Look Prompting + http://openaccess.thecvf.com//content/ICCV2023/papers/Xi_Open_Set_Video_HOI_detection_from_Action-Centric_Chain-of-Look_Prompting_ICCV_2023_paper.pdf + Human-Object Interaction (HOI) detection is essential for understanding and modeling real-world events. Existing works on HOI detection mainly focus on static images and a closed setting, where all HOI classes are provided in the training set. In comparison, detecting HOIs in videos in open set scenarios is more challenging. First, under open set circumstances, HOI detectors are expected to hold strong generalizability to recognize unseen HOIs not included in the training data. Second, accurately capturing temporal contextual information from videos is difficult, but it is crucial for detecting temporal-related actions such as open, close, pull, push. To this end, we propose ACoLP, a model of Action-centric Chain-of-Look Prompting for open set video HOI detection. ACoLP regards actions as the carrier of semantics in videos, which captures the essential semantic information across frames. To make the model generalizable on unseen classes, inspired by the chain-of-thought prompting in natural language processing, we introduce the chain-of-look prompting scheme that decomposes prompt generation from large-scale vision-language model into a series of intermediate visual reasoning steps. Consequently, our model captures complex visual reasoning processes underlying the HOI events in videos, providing essential guidance for detecting unseen classes. Extensive experiments on two video HOI datasets, VidHOI and CAD120, demonstrate that ACoLP achieves competitive performance compared with the state-of-the-art methods in the conventional closed setting, and outperforms existing methods by a large margin in the open set setting. Our code is avaliable at https://github. com/southnx/ACoLP. + + + + Robust Mixture-of-Expert Training for Convolutional Neural Networks + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Robust_Mixture-of-Expert_Training_for_Convolutional_Neural_Networks_ICCV_2023_paper.pdf + Sparsely-gated Mixture of Expert (MoE), an emerging deep model architecture, has demonstrated a great promise to enable high-accuracy and ultra-efficient model inference. Despite the growing popularity of MoE, little work investigated its potential to advance convolutional neural networks (CNNs), especially in the plane of adversarial robustness. Since the lack of robustness has become one of the main hurdles for CNNs, in this paper we ask: How to adversarially robustify a CNN-based MoE model? Can we robustly train it like an ordinary CNN model? Our pilot study shows that the conventional adversarial training (AT) mechanism (developed for vanilla CNNs) no longer remains effective to robustify an MoE-CNN. To better understand this phenomenon, we dissect the robustness of an MoE-CNN into two dimensions: Robustness of routers (i.e., gating functions to select data-specific experts) and robustness of experts (i.e., the router-guided pathways defined by the subnetworks of the backbone CNN). Our analyses show that routers and experts are hard to adapt to each other in the vanilla AT. Thus, we propose a new router-expert alternating Adversarial training framework for MoE, termed AdvMoE. The effectiveness of our proposal is justified across 4 commonly-used CNN model architectures over 4 benchmark datasets. We find that AdvMoE achieves 1% 4% adversarial robustness improvement over the original dense CNN, and enjoys the efficiency merit of sparsity-gated MoE, leading to more than 50% inference cost reduction. + + + + UniTR: A Unified and Efficient Multi-Modal Transformer for Bird's-Eye-View Representation + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_UniTR_A_Unified_and_Efficient_Multi-Modal_Transformer_for_Birds-Eye-View_Representation_ICCV_2023_paper.pdf + Jointly processing information from multiple sensors is crucial to achieving accurate and robust perception for reliable autonomous driving systems. However, current 3D perception research follows a modality-specific paradigm, leading to additional computation overheads and inefficient collaboration between different sensor data. In this paper, we present an efficient multi-modal backbone for outdoor 3D perception named UniTR, which processes a variety of modalities with unified modeling and shared parameters. Unlike previous works, UniTR introduces a modality-agnostic transformer encoder to handle these view-discrepant sensor data for parallel modal-wise representation learning and automatic cross-modal interaction without additional fusion steps. More importantly, to make full use of these complementary sensor types, we present a novel multi-modal integration strategy by both considering semantic-abundant 2D perspective and geometry-aware 3D sparse neighborhood relations. UniTR is also a fundamentally task-agnostic backbone that naturally supports different 3D perception tasks. It sets a new state-of-the-art performance on the nuScenes benchmark, achieving +1.1 NDS higher for 3D object detection and +12.0 higher mIoU for BEV map segmentation with lower inference latency. Code will be available at https://github.com/Haiyang-W/UniTR. + + + + R3D3: Dense 3D Reconstruction of Dynamic Scenes from Multiple Cameras + http://openaccess.thecvf.com//content/ICCV2023/papers/Schmied_R3D3_Dense_3D_Reconstruction_of_Dynamic_Scenes_from_Multiple_Cameras_ICCV_2023_paper.pdf + Dense 3D reconstruction and ego-motion estimation are key challenges in autonomous driving and robotics. Compared to the complex, multi-modal systems deployed today, multi-camera systems provide a simpler, low-cost alternative. However, camera-based 3D reconstruction of complex dynamic scenes has proven extremely difficult, as existing solutions often produce incomplete or incoherent results. We propose R3D3, a multi-camera system for dense 3D reconstruction and ego-motion estimation. Our approach iterates between geometric estimation that exploits spatial-temporal information from multiple cameras, and monocular depth refinement. We integrate multi-camera feature correlation and dense bundle adjustment operators that yield robust geometric depth and pose estimates. To improve reconstruction where geometric depth is unreliable, e.g. for moving objects or low-textured regions, we introduce learnable scene priors via a depth refinement network. We show that this design enables a dense, consistent 3D reconstruction of challenging, dynamic outdoor environments. Consequently, we achieve state-of-the-art dense depth prediction on the DDAD and NuScenes benchmarks. + + + + Focus the Discrepancy: Intra- and Inter-Correlation Learning for Image Anomaly Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Yao_Focus_the_Discrepancy_Intra-_and_Inter-Correlation_Learning_for_Image_Anomaly_ICCV_2023_paper.pdf + Humans recognize anomalies through two aspects: larger patch-wise representation discrepancies and weaker patch-to-normal-patch correlations. However, the previous AD methods didn't sufficiently combine the two complementary aspects to design AD models. To this end, we find that Transformer can ideally satisfy the two aspects as its great power in the unified modeling of patchwise representations and patch-to-patch correlations. In this paper, we propose a novel AD framework: FOcus-the- Discrepancy (FOD), which can simultaneously spot the patch-wise, intra- and inter-discrepancies of anomalies. The major characteristic of our method is that we renovate the self attention maps in transformers to Intra-Inter-Correlation (I2Correlation). The I2Correlation contains a two-branch structure to first explicitly establish intraand inter-image correlations, and then fuses the features of two-branch to spotlight the abnormal patterns. To learn the intra- and inter-correlations adaptively, we propose the RBF-kernel-based target-correlations as learning targets for self-supervised learning. Besides, we introduce an entropy constraint strategy to solve the mode collapse issue in optimization and further amplify the normal abnormal distinguishability. Extensive experiments on three unsupervised real-world AD benchmarks show the superior performance of our approach. Code will be available at https://github.com/xcyao00/FOD. + + + + Make Encoder Great Again in 3D GAN Inversion through Geometry and Occlusion-Aware Encoding + http://openaccess.thecvf.com//content/ICCV2023/papers/Yuan_Make_Encoder_Great_Again_in_3D_GAN_Inversion_through_Geometry_ICCV_2023_paper.pdf + 3D GAN inversion aims to achieve high reconstruction fidelity and reasonable 3D geometry simultaneously from a single image input. However, existing 3D GAN inversion methods rely on time-consuming optimization for each individual case. In this work, we introduce a novel encoder-based inversion framework based on EG3D, one of the most widely-used 3D GAN models. We leverage the inherent properties of EG3D's latent space to design a discriminator and a background depth regularization. This enables us to train a geometry-aware encoder capable of converting the input image into corresponding latent code. Additionally, we explore the feature space of EG3D and develop an adaptive refinement stage that improves the representation ability of features in EG3D to enhance the recovery of fine-grained textural details. Finally, we propose an occlusion-aware fusion operation to prevent distortion in unobserved regions. Our method achieves impressive results comparable to optimization-based methods while operating up to 500 times faster. Our framework is well-suited for applications such as semantic editing. + + + + DLT: Conditioned layout generation with Joint Discrete-Continuous Diffusion Layout Transformer + http://openaccess.thecvf.com//content/ICCV2023/papers/Levi_DLT_Conditioned_layout_generation_with_Joint_Discrete-Continuous_Diffusion_Layout_Transformer_ICCV_2023_paper.pdf + Generating visual layouts is an essential ingredient of graphic design. The ability to condition layout generation on a partial subset of component attributes is critical to real-world applications that involve user interaction. Recently, diffusion models have demonstrated high-quality generative performances in various domains. However, it is unclear how to apply diffusion models to the natural representation of layouts which consists of a mix of discrete (class) and continuous (location, size) attributes. To address the conditioning layout generation problem, we introduce DLT, a joint discrete-continuous diffusion model. DLT is a transformer-based model which has a flexible conditioning mechanism that allows for conditioning on any given subset of all layout components classes, locations and sizes. Our method outperforms state-of-the-art generative models on various layout generation datasets with respect to different metrics and conditioning settings. Additionally, we validate the effectiveness of our proposed conditioning mechanism and the joint continuous-diffusion process. This joint process can be incorporated into a wide range of mixed discrete-continuous generative tasks. + + + + Open-Vocabulary Semantic Segmentation with Decoupled One-Pass Network + http://openaccess.thecvf.com//content/ICCV2023/papers/Han_Open-Vocabulary_Semantic_Segmentation_with_Decoupled_One-Pass_Network_ICCV_2023_paper.pdf + Recently, the open-vocabulary semantic segmentation problem has attracted increasing attention and the best performing methods are based on two-stream networks: one stream for proposal mask generation and the other for segment classification using a pre-trained visual-language model. However, existing two-stream methods require passing a great number of (up to a hundred) image crops into the visual-language model, which is highly inefficient. To address the problem, we propose a network that only needs a single pass through the visual-language model for each input image. Specifically, we first propose a novel networkadaptation approach, termed patch severance, to restrict the harmful interference between the patch embeddings in the pre-trained visual encoder. We then propose classification anchor learning to encourage the network to spatially focus on more discriminative features for classification. Extensive experiments demonstrate that the proposed method achieves outstanding performance, surpassing state-of-the-art methods while being 4 to 7 times faster at inference. Code: https://github.com/CongHan0808/DeOP.git + + + + Human-Inspired Facial Sketch Synthesis with Dynamic Adaptation + http://openaccess.thecvf.com//content/ICCV2023/papers/Gao_Human-Inspired_Facial_Sketch_Synthesis_with_Dynamic_Adaptation_ICCV_2023_paper.pdf + Facial sketch synthesis (FSS) aims to generate a vivid sketch portrait from a given facial photo. Existing FSS methods merely rely on 2D representations of facial semantic or appearance. However, professional human artists usually use outlines or shadings to covey 3D geometry. Thus facial 3D geometry (e.g. depth map) is extremely important for FSS. Besides, different artists may use diverse drawing techniques and create multiple styles of sketches; but the style is globally consistent in a sketch. Inspired by such observations, in this paper, we propose a novel Human-Inspired Dynamic Adaptation (HIDA) method. Specially, we propose to dynamically modulate neuron activations based on a joint consideration of both facial 3D geometry and 2D appearance, as well as globally consistent style control. Besides, we use deformable convolutions at coarse-scales to align deep features, for generating abstract and distinct outlines. Experiments show that HIDA can generate high-quality sketches in multiple styles, and significantly outperforms previous methods, over a large range of challenging faces. Besides, HIDA allows precise style control of the synthesized sketch, and generalizes well to natural scenes and other artistic styles. Our code and results have been released online at: https://github.com/AiArt-HDU/HIDA. + + + + DS-Fusion: Artistic Typography via Discriminated and Stylized Diffusion + http://openaccess.thecvf.com//content/ICCV2023/papers/Tanveer_DS-Fusion_Artistic_Typography_via_Discriminated_and_Stylized_Diffusion_ICCV_2023_paper.pdf + We introduce a novel method to automatically generate an artistic typography by stylizing one or more letter fonts to visually convey the semantics of an input word, while ensuring that the output remains readable. To address an assortment of challenges with our task at hand including conflicting goals (artistic stylization vs. legibility), lack of ground truth, and immense search space, our approach utilizes large language models to bridge texts and visual images for stylization and build an unsupervised generative model with a diffusion model backbone. Specifically, we employ the denoising generator in Latent Diffusion Model (LDM), with the key addition of a CNN-based discriminator to adapt the input style onto the input text. The discriminator uses rasterized images of a given letter/word font as real samples and the output of the denoising generator as fake samples. Our model is coined DS-Fusion for discriminated and stylized diffusion. We showcase the quality and versatility of our method through numerous examples, qualitative and quantitative evaluation, and ablation studies. User studies comparing to strong baselines including CLIPDraw, DALL-E 2, Stable Diffusion, as well as artist-crafted typographies, demonstrate strong performance of DS-Fusion. + + + + Distilling DETR with Visual-Linguistic Knowledge for Open-Vocabulary Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Distilling_DETR_with_Visual-Linguistic_Knowledge_for_Open-Vocabulary_Object_Detection_ICCV_2023_paper.pdf + Current methods for open-vocabulary object detection (OVOD) rely on a pre-trained vision-language model (VLM) to acquire the recognition ability. In this paper, we propose a simple yet effective framework to Distill the Knowledge from the VLM to a DETR-like detector, termed DK-DETR. Specifically, we present two ingenious distillation schemes named semantic knowledge distillation (SKD) and relational knowledge distillation (RKD). To utilize the rich knowledge from the VLM systematically, SKD transfers the semantic knowledge explicitly, while RKD exploits implicit relationship information between objects. Furthermore, a distillation branch including a group of auxiliary queries is added to the detector to mitigate the negative effect on base categories. Equipped with SKD and RKD on the distillation branch, DK-DETR improves the detection performance of novel categories significantly and avoids disturbing the detection of base categories. Extensive experiments on LVIS and COCO datasets show that DK-DETR surpasses existing OVOD methods under the setting that the base-category supervision is solely available. The code and models are available at https://github.com/hikvision-research/opera. + + + + Scale-MAE: A Scale-Aware Masked Autoencoder for Multiscale Geospatial Representation Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Reed_Scale-MAE_A_Scale-Aware_Masked_Autoencoder_for_Multiscale_Geospatial_Representation_Learning_ICCV_2023_paper.pdf + Large, pretrained models are commonly finetuned with imagery that is heavily augmented to mimic different conditions and scales, with the resulting models used for various tasks with imagery from a range of spatial scales. Such models overlook scale-specific information in the data for scale-dependent domains, such as remote sensing. In this paper, we present Scale-MAE, a pretraining method that explicitly learns relationships between data at different, known scales throughout the pretraining process. Scale-MAE pretrains a network by masking an input image at a known input scale, where the area of the Earth covered by the image determines the scale of the ViT positional encoding, not the image resolution. Scale-MAE encodes the masked image with a standard ViT backbone, and then decodes the masked image through a bandpass filter to reconstruct low/high frequency images at lower/higher scales. We find that tasking the network with reconstructing both low/high frequency images leads to robust multiscale representations for remote sensing imagery. Scale-MAE achieves an average of a 2.4 - 5.6% non-parametric kNN classification improvement across eight remote sensing datasets compared to current state-of-the-art and obtains a 0.9 mIoU to 1.7 mIoU improvement on the SpaceNet building segmentation transfer task for a range of evaluation scales. + + + + A Unified Framework for Robustness on Diverse Sampling Errors + http://openaccess.thecvf.com//content/ICCV2023/papers/Jeon_A_Unified_Framework_for_Robustness_on_Diverse_Sampling_Errors_ICCV_2023_paper.pdf + Recent studies have substantiated that machine learning algorithms including convolutional neural networks often suffer from unreliable generalizations when there is a significant gap between the source and target data distributions. To mitigate this issue, a predetermined distribution shift has been addressed independently (e.g., single domain generalization, de-biasing). However, a distribution mismatch cannot be clearly estimated because the target distribution is unknown at training. Therefore, a conservative approach robust on unexpected diverse distributions is more desirable in practice. Our work starts from a motivation to allow adaptive inference once we know the target, since it is accessible only at testing. Instead of assuming and fixing the target distribution at training, our proposed approach allows adjusting the feature space the model refers to at every prediction, i.e., instance-wise adaptive inference. The extensive evaluation demonstrates our method is effective for generalization on diverse distributions. + + + + LiDAR-Camera Panoptic Segmentation via Geometry-Consistent and Semantic-Aware Alignment + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_LiDAR-Camera_Panoptic_Segmentation_via_Geometry-Consistent_and_Semantic-Aware_Alignment_ICCV_2023_paper.pdf + 3D panoptic segmentation is a challenging perception task that requires both semantic segmentation and instance segmentation. In this task, we notice that images could provide rich texture, color, and discriminative information, which can complement LiDAR data for evident performance improvement, but their fusion remains a challenging problem. To this end, we propose LCPS, the first LiDAR-Camera Panoptic Segmentation network. In our approach, we conduct LiDAR-Camera fusion in three stages: 1) an Asynchronous Compensation Pixel Alignment (ACPA) module that calibrates the coordinate misalignment caused by asynchronous problems between sensors; 2) a Semantic-Aware Region Alignment (SARA) module that extends the one-to-one point-pixel mapping to one-to-many semantic relations; 3) a Point-to-Voxel feature Propagation (PVP) module that integrates both geometric and semantic fusion information for the entire point cloud. Our fusion strategy improves about 6.9% PQ performance over the LiDAR-only baseline on NuScenes dataset. Extensive quantitative and qualitative experiments further demonstrate the effectiveness of our novel framework. The code will be released at https://github.com/zhangzw12319/lcps.git. + + + + Scene-Aware Label Graph Learning for Multi-Label Image Classification + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhu_Scene-Aware_Label_Graph_Learning_for_Multi-Label_Image_Classification_ICCV_2023_paper.pdf + Multi-label image classification refers to assigning a set of labels for an image. One of the main challenges of this task is how to effectively capture the correlation among labels. Existing studies on this issue mostly rely on the statistical label co-occurrence or semantic similarity of labels. However, an important fact is ignored that the co-occurrence of labels is closely related with image scenes (indoor, outdoor, etc.), which is a vital characteristic in multi-label image classification. In this paper, a novel scene-aware label graph learning framework is proposed, which is capable of learning visual representations for labels while fully perceiving their co-occurrence relationships under variable scenes. Specifically, our framework is able to detect scene categories of images without relying on manual annotations, and keeps track of the co-occurring labels by maintaining a global co-occurrence matrix for each scene category throughout the whole training phase. These scene-independent co-occurrence matrices are further employed to guide the interactions among label representations in a graph propagation manner towards accurate label prediction. Extensive experiments on public benchmarks demonstrate the superiority of our proposed framework compared to the state of the arts. Code will be publicly available soon. + + + + Fcaformer: Forward Cross Attention in Hybrid Vision Transformer + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Fcaformer_Forward_Cross_Attention_in_Hybrid_Vision_Transformer_ICCV_2023_paper.pdf + Currently, one main research line in designing more efficient vision transformer is reducing computational cost of self attention modules by adopting sparse attention or using local attention windows. In contrast, we propose a different approach that aims to improve the performance of transformer-based architectures by densifying the attention pattern. Specifically, we proposed forward cross attention for hybrid vision transformer (FcaFormer), where tokens from previous blocks in the same stage are secondary used. To achieve this, the FcaFormer leverages two innovative components: learnable scale factors (LSFs) and a token merge and enhancement module (TME). The LSFs enable efficient processing of cross tokens, while the TME generates representative cross tokens. By integrating these components, the proposed FcaFormer enhances the interactions of tokens across blocks with potentially different semantics, and encourages more information flows to the lower levels. Based on the forward cross attention (Fca), we have designed a series of FcaFormer models that achieve the best trade-off between model size, computational cost, memory cost, and accuracy. For example, without the need for knowledge distillation to strengthen training, our FcaFormer achieves 83.1% top-1 accuracy on Imagenet with only 16.3 million parameters and about 3.6 billion MACs. This saves almost half of the parameters and a few computational cost while achieving 0.7% higher accuracy compared with distilled EfficientFormer + + + + Progressive Spatio-Temporal Prototype Matching for Text-Video Retrieval + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Progressive_Spatio-Temporal_Prototype_Matching_for_Text-Video_Retrieval_ICCV_2023_paper.pdf + The performance of text-video retrieval has been significantly improved by vision-language cross-modal learning schemes. The typical solution is to directly align the global video-level and sentence-level features during learning, which would ignore the intrinsic video-text relations, i.e., a text description only corresponds to a spatio-temporal part of videos. Hence, the matching process should consider both fine-grained spatial content and various temporal semantic events. To this end, we propose a text-video learning framework with progressive spatio-temporal prototype matching. Specifically, the vanilla matching process is decomposed into two complementary phases: object-phrase prototype matching and event-sentence prototype matching. In the object-phrase prototype matching phase, a spatial prototype generation mechanism is developed to predict key patches or words, which are sparsely integrated into object or phrase prototypes. Importantly, optimizing the local alignment between object-phrase prototypes helps the model perceive spatial details. In the event-sentence prototype matching phase, we design a temporal prototype generation mechanism to associate intra-frame objects and interact inter-frame temporal relations. Such progressively generated event prototypes can reveal semantic diversity in videos for dynamic matching. Validated by comprehensive experiments, our method consistently outperforms the state-of-the-art methods on four video retrieval benchmarks. + + + + Data Augmented Flatness-aware Gradient Projection for Continual Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Data_Augmented_Flatness-aware_Gradient_Projection_for_Continual_Learning_ICCV_2023_paper.pdf + The goal of continual learning (CL) is to continuously learn new tasks without forgetting previously learned old tasks. To alleviate catastrophic forgetting, gradient projection based CL methods require that the gradient updates of new tasks are orthogonal to the subspace spanned by old tasks. This limits the learning process and leads to poor performance on the new task due to the projection constraint being too strong. In this paper, we first revisit the gradient projection method from the perspective of flatness of loss surface, and find that unflatness of the loss surface leads to catastrophic forgetting of the old tasks when the projection constraint is reduced to improve the performance of new tasks. Based on our findings, we propose a Data Augmented Flatness-aware Gradient Projection (DFGP) method to solve the problem, which consists of three modules: data and weight perturbation, flatness-aware optimization, and gradient projection. Specifically, we first perform a flatness-aware perturbation on the task data and current weights to find the case that makes the task loss worst. Next, flatness-aware optimization optimizes both the loss and the flatness of the loss surface on raw and worst-case perturbed data to obtain a flatness-aware gradient. Finally, gradient projection updates the network with the flatness-aware gradient along directions orthogonal to the subspace of the old tasks. Extensive experiments on four datasets show that our method improves the flatness of loss surface and the performance of new tasks, and achieves state-of-the-art (SOTA) performance in the average accuracy of all tasks. + + + + Sample-wise Label Confidence Incorporation for Learning with Noisy Labels + http://openaccess.thecvf.com//content/ICCV2023/papers/Ahn_Sample-wise_Label_Confidence_Incorporation_for_Learning_with_Noisy_Labels_ICCV_2023_paper.pdf + Deep learning algorithms require large amounts of labeled data for effective performance, but the presence of noisy labels often significantly degrade their performance. Although recent studies on designing a robust objective function to label noise, known as the robust loss method, have shown promising results for learning with noisy labels, they suffer from the issue of underfitting not only noisy samples but also clean ones, leading to suboptimal model performance. To address this issue, we propose a novel learning framework that selectively suppresses noisy samples while avoiding underfitting clean data. Our framework incorporates label confidence as a measure of label noise, enabling the network model to prioritize the training of samples deemed to be noise-free. The label confidence is based on the robust loss methods, and we provide theoretical evidence that our method can reach the optimal point of the robust loss, subject to certain conditions. Furthermore, the proposed method is generalizable and can be combined with existing robust loss methods, making it suitable for a wide range of applications of learning with noisy labels. We evaluate our approach on both synthetic and real-world datasets, and the experimental results demonstrate its effectiveness in achieving outstanding classification performance compared to state-of-the-art methods. + + + + CLIPTrans: Transferring Visual Knowledge with Pre-trained Models for Multimodal Machine Translation + http://openaccess.thecvf.com//content/ICCV2023/papers/Gupta_CLIPTrans_Transferring_Visual_Knowledge_with_Pre-trained_Models_for_Multimodal_Machine_ICCV_2023_paper.pdf + There has been a growing interest in developing multimodal machine translation (MMT) systems that enhance neural machine translation (NMT) with visual knowledge. This problem setup involves using images as auxiliary information during training, and more recently, eliminating their use during inference. Towards this end, previous works face a challenge in training powerful MMT models from scratch due to the scarcity of annotated multilingual vision-language data, especially for low-resource languages. Simultaneously, there has been an influx of multilingual pre-trained models for NMT and multimodal pre-trained models for vision-language tasks, primarily in English, which have shown exceptional generalisation ability. However, these are not directly applicable to MMT since they do not provide aligned multimodal multilingual features for generative tasks. To alleviate this issue, instead of designing complex modules for MMT, we propose CLIPTrans, which simply adapts the independently pre-trained multimodal M-CLIP and the multilingual mBART. In order to align their embedding spaces, mBART is conditioned on the M-CLIP features by a prefix sequence generated through a lightweight mapping network. We train this in a two-stage pipeline which warms up the model with image captioning before the actual translation task. Through experiments, we demonstrate the merits of this framework and consequently push forward the state-of-the-art across standard benchmarks by an average of +2.67 BLEU. The code can be found at www.github.com/devaansh100/CLIPTrans. + + + + Ego-Only: Egocentric Action Detection without Exocentric Transferring + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Ego-Only_Egocentric_Action_Detection_without_Exocentric_Transferring_ICCV_2023_paper.pdf + We present Ego-Only, the first approach that enables state-of-the-art action detection on egocentric (first-person) videos without any form of exocentric (third-person) transferring. Despite the content and appearance gap separating the two domains, large-scale exocentric transferring has been the default choice for egocentric action detection. This is because prior works found that egocentric models are difficult to train from scratch and that transferring from exocentric representations leads to improved accuracy. However, in this paper, we revisit this common belief. Motivated by the large gap separating the two domains, we propose a strategy that enables effective training of egocentric models without exocentric transferring. Our Ego-Only approach is simple. It trains the video representation with a masked autoencoder finetuned for temporal segmentation. The learned features are then fed to an off-the-shelf temporal action localization method to detect actions. We find that this renders exocentric transferring unnecessary by showing remarkably strong results achieved by this simple Ego-Only approach on three established egocentric video datasets: Ego4D, EPIC-Kitchens-100, and Charades-Ego. On both action detection and action recognition, Ego-Only outperforms previous best exocentric transferring methods that use orders of magnitude more labels. Ego-Only sets new state-of-the-art results on these datasets and benchmarks without exocentric data. + + + + CoinSeg: Contrast Inter- and Intra- Class Representations for Incremental Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_CoinSeg_Contrast_Inter-_and_Intra-_Class_Representations_for_Incremental_Segmentation_ICCV_2023_paper.pdf + Class incremental semantic segmentation aims to strike a balance between the model's stability and plasticity by maintaining old knowledge while adapting to new concepts. However, most state-of-the-art methods use the freeze strategy for stability, which compromises the model's plasticity. In contrast, releasing parameter training for plasticity could lead to the best performance for all categories, but this requires discriminative feature representation. Therefore, we prioritize the model's plasticity and propose the Contrast inter- and intra-class representations for Incremental Segmentation (CoinSeg), which pursue discriminative representations for flexible parameter tuning. Inspired by the Gaussian mixture model that samples from a mixture of Gaussian distributions, CoinSeg emphasizes intra-class diversity with multiple contrastive representation centroids. Specifically, we use mask proposals to identify regions with strong objectness that are likely to be diverse instances/centroids of a category. These mask proposals are then used for contrastive representations to reinforce intra-class diversity. Meanwhile, to avoid bias from intra-class diversity, we also apply category-level pseudo-labels to enhance category-level consistency and inter-category diversity. Additionally, CoinSeg ensures the model's stability and alleviates forgetting through a specific flexible tuning strategy. We validate CoinSeg on Pascal VOC 2012 and ADE20K datasets with multiple incremental scenarios and achieve superior results compared to previous state-of-the-art methods, especially in more challenging and realistic long-term scenarios. + + + + Multi-View Active Fine-Grained Visual Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Du_Multi-View_Active_Fine-Grained_Visual_Recognition_ICCV_2023_paper.pdf + Despite the remarkable progress of Fine-grained visual classification (FGVC) with years of history, it is still limited to recognizing 2 images. Recognizing objects in the physical world (i.e., 3D environment) poses a unique challenge -- discriminative information is not only present in visible local regions but also in other unseen views. Therefore, in addition to finding the distinguishable part from the current view, efficient and accurate recognition requires inferring the critical perspective with minimal glances. E.g., a person might recognize a "Ford sedan" with a glance at its side and then know that looking at the front can help tell which model it is. In this paper, towards FGVC in the real physical world, we put forward the problem of multi-view active fine-grained visual recognition (MAFR) and complete this study in three steps: (i) a multi-view, fine-grained vehicle dataset is collected as the testbed, (ii) a pilot experiment is designed to validate the need and research value of MAFR, (iii) a policy-gradient-based framework along with a dynamic exiting strategy is proposed to achieve efficient recognition with active view selection. Our comprehensive experiments demonstrate that the proposed method outperforms previous multi-view recognition works and can extend existing state-of-the-art FGVC methods and advanced neural networks to become FGVC experts in the 3D environment. + + + + Variational Causal Inference Network for Explanatory Visual Question Answering + http://openaccess.thecvf.com//content/ICCV2023/papers/Xue_Variational_Causal_Inference_Network_for_Explanatory_Visual_Question_Answering_ICCV_2023_paper.pdf + Explanatory Visual Question Answering (EVQA) is a recently proposed multimodal reasoning task that requires answering visual questions and generating multimodal explanations for the reasoning processes. Unlike traditional Visual Question Answering (VQA) which focuses solely on answering, EVQA aims to provide user-friendly explanations to enhance the explainability and credibility of reasoning models. However, existing EVQA methods typically predict the answer and explanation separately, which ignores the causal correlation between them. Moreover, they neglect the complex relationships among question words, visual regions, and explanation tokens. To address these issues, we propose a Variational Causal Inference Network (VCIN) that establishes the causal correlation between predicted answers and explanations, and captures cross-modal relationships to generate rational explanations. First, we utilize a vision-and-language pretrained model to extract visual features and question features. Secondly, we propose a multimodal explanation gating transformer that constructs cross-modal relationships and generates rational explanations. Finally, we propose a variational causal inference to establish the target causal structure and predict the answers. Comprehensive experiments demonstrate the superiority of VCIN over state-of-the-art EVQA methods. + + + + Enhancing Generalization of Universal Adversarial Perturbation through Gradient Aggregation + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Enhancing_Generalization_of_Universal_Adversarial_Perturbation_through_Gradient_Aggregation_ICCV_2023_paper.pdf + Deep neural networks are vulnerable to universal adversarial perturbation (UAP), an instance-agnostic perturbation capable of fooling the target model for most samples. Compared to instance-specific adversarial examples, UAP is more challenging as it needs to generalize across various samples and models. In this paper, we examine the serious dilemma of UAP generation methods from a generalization perspective -- the gradient vanishing problem using small-batch stochastic gradient optimization and the local optima problem using large-batch optimization. To address these problems, we propose a simple and effective method called Stochastic Gradient Aggregation (SGA), which alleviates the gradient vanishing and escapes from poor local optima at the same time. Specifically, SGA employs the small-batch training to perform multiple iterations of inner pre-search. Then, all the inner gradients are aggregated as a one-step gradient estimation to enhance the gradient stability and reduce quantization errors. Extensive experiments on the standard ImageNet dataset demonstrate that our method significantly enhances the generalization ability of UAP and outperforms other state-of-the-art methods. The code is available at https://github.com/liuxuannan/Stochastic-Gradient-Aggregation. + + + + Parallel Attention Interaction Network for Few-Shot Skeleton-Based Action Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Parallel_Attention_Interaction_Network_for_Few-Shot_Skeleton-Based_Action_Recognition_ICCV_2023_paper.pdf + Learning discriminative features from very few labeled samples to identify novel classes has received increasing attention in skeleton-based action recognition. Existing works aim to learn action-specific embeddings by exploiting either intra-skeleton or inter-skeleton spatial associations, which may lead to less discriminative representations. To address these issues, we propose a novel Parallel Attention Interaction Network (PAINet) that incorporates two complementary branches to strengthen the match by inter-skeleton and intra-skeleton correlation. Specifically, a topology encoding module utilizing topology and physical information is proposed to enhance the modeling of interactive parts and joint pairs in both branches. In the Cross Spatial Alignment branch, we employ a spatial cross-attention module to establish joint associations across sequences, and a directional Average Symmetric Surface Metric is introduced to locate the closest temporal similarity. In parallel, the Cross Temporal Alignment branch incorporates a spatial self-attention module to aggregate spatial context within sequences as well as applies the temporal cross-attention network to correct misalignment temporally and calculate similarity. Extensive experiments on three skeleton benchmarks, namely NTU-T, NTU-S, and Kinetics, demonstrate the superiority of our framework and consistently outperform state-of-the-art methods. + + + + Not All Features Matter: Enhancing Few-shot CLIP with Adaptive Prior Refinement + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhu_Not_All_Features_Matter_Enhancing_Few-shot_CLIP_with_Adaptive_Prior_ICCV_2023_paper.pdf + The popularity of Contrastive Language-Image Pre-training (CLIP) has propelled its application to diverse downstream vision tasks. To improve its capacity on downstream tasks, few-shot learning has become a widely-adopted technique. However, existing methods either exhibit limited performance or suffer from excessive learnable parameters. In this paper, we propose APE, an Adaptive Prior rEfinement method for CLIP's pre-trained knowledge, which achieves superior accuracy with high computational efficiency. Via a prior refinement module, we analyze the inter-class disparity in the downstream data and decouple the domain-specific knowledge from the CLIP-extracted cache model. On top of that, we introduce two model variants, a training-free APE and a training-required APE-T. We explore the trilateral affinities between the test image, prior cache model, and textual representations, and only enable a lightweight category-residual module to be trained. For the average accuracy over 11 benchmarks, both APE and APE-T attain state-of-the-art and respectively outperform the second-best by +1.59% and +1.99% under 16 shots with x30 less learnable parameters. + + + + EgoVLPv2: Egocentric Video-Language Pre-training with Fusion in the Backbone + http://openaccess.thecvf.com//content/ICCV2023/papers/Pramanick_EgoVLPv2_Egocentric_Video-Language_Pre-training_with_Fusion_in_the_Backbone_ICCV_2023_paper.pdf + Video-language pre-training (VLP) has become increasingly important due to its ability to generalize to various vision and language tasks. However, existing egocentric VLP frameworks utilize separate video and language encoders and learn task-specific cross-modal information only during fine-tuning, limiting the development of a unified system. In this work, we introduce the second generation of egocentric video-language pre-training (EgoVLPv2), a significant improvement from the previous generation, by incorporating cross-modal fusion directly into the video and language backbones. EgoVLPv2 learns strong video-text representation during pre-training and reuses the cross-modal attention modules to support different downstream tasks in a flexible and efficient manner, reducing fine-tuning costs. Moreover, our proposed fusion in the backbone strategy is more lightweight and compute-efficient than stacking additional fusion-specific layers. Extensive experiments on a wide range of VL tasks demonstrate the effectiveness of EgoVLPv2 by achieving consistent state-of-the-art performance over strong baselines across all downstream. + + + + Deep Equilibrium Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Deep_Equilibrium_Object_Detection_ICCV_2023_paper.pdf + Query-based object detectors directly decode image features into object instances with a set of learnable queries. These query vectors are progressively refined to stable meaningful representations through a sequence of decoder layers, and then used to directly predict object locations and categories with simple FFN heads. In this paper, we present a new query-based object detector (DEQDet) by designing a deep equilibrium decoder. Our DEQ decoder models the query vector refinement as the fixed point solving of an implicit layer and is equivalent to applying infinite steps of refinement. To be more specific to object decoding, we use a two-step unrolled equilibrium equation to explicitly capture the query vector refinement. Accordingly, we are able to incorporate refinement awareness into the DEQ training with the inexact gradient back-propagation (RAG). In addition, to stabilize the training of our DEQDet and improve its generalization ability, we devise the deep supervision scheme on the optimization path of DEQ with refinement-aware perturbation (RAP). Our experiments demonstrate DEQDet converges faster, consumes less memory, and achieves better results than the baseline counterpart (AdaMixer). In particular, our DEQDet with ResNet50 backbone and 300 queries achieves the 49.5 mAP and 33.0 APs on the MS COCO benchmark under 2x training scheme (24 epochs). + + + + SMAUG: Sparse Masked Autoencoder for Efficient Video-Language Pre-Training + http://openaccess.thecvf.com//content/ICCV2023/papers/Lin_SMAUG_Sparse_Masked_Autoencoder_for_Efficient_Video-Language_Pre-Training_ICCV_2023_paper.pdf + Video-language pre-training is crucial for learning powerful multi-modal representation. However, it typically requires a massive amount of computation. In this paper, we develop SMAUG, an efficient pre-training framework for video-language models. The foundation component in SMAUG is masked autoencoders. Different from prior works which only mask textual inputs, our masking strategy considers both visual and textual modalities, providing a better cross-modal alignment and saving more pre-training costs. On top of that, we introduce a space-time token sparsification module, which leverages context information to further select only "important" spatial regions and temporal frames for pre-training. Coupling all these designs allows our method to enjoy both competitive performances on text-to-video retrieval and video question answering tasks, and much less pre-training costs by 1.9x or more. For example, our SMAUG only needs 50 NVIDIA A6000 GPU hours for pre-training to attain competitive performances on these two video-language tasks across six popular benchmarks. + + + + Communication-Efficient Vertical Federated Learning with Limited Overlapping Samples + http://openaccess.thecvf.com//content/ICCV2023/papers/Sun_Communication-Efficient_Vertical_Federated_Learning_with_Limited_Overlapping_Samples_ICCV_2023_paper.pdf + Federated learning is a popular collaborative learning approach that enables clients to train a global model without sharing their local data. Vertical federated learning (VFL) deals with scenarios in which the data on clients have different feature spaces but share some overlapping samples. Existing VFL approaches suffer from high communication costs and cannot deal efficiently with limited overlapping samples commonly seen in the real world. We propose a practical vertical federated learning (VFL) framework called one-shot VFL that can solve the communication bottleneck and the problem of limited overlapping samples simultaneously based on semi-supervised learning. We also propose few-shot VFL to improve the accuracy further with just one more communication round between the server and the clients. In our proposed framework, the clients only need to communicate with the server once or only a few times. We evaluate the proposed VFL framework on both image and tabular datasets. Our methods can improve the accuracy by more than 46.5% and reduce the communication cost by more than 330xcompared with state-of-the-art VFL methods when evaluated on CIFAR-10. Our code is publicly available. + + + + On the Audio-visual Synchronization for Lip-to-Speech Synthesis + http://openaccess.thecvf.com//content/ICCV2023/papers/Niu_On_the_Audio-visual_Synchronization_for_Lip-to-Speech_Synthesis_ICCV_2023_paper.pdf + Most lip-to-speech (LTS) synthesis models are trained and evaluated with the assumption that the audio-video pairs in the dataset are well synchronized. In this work, we demonstrate that commonly used audiovisual datasets such as GRID, TCD-TIMIT, and Lip2Wav can, however, have the data asynchrony issue, which will lead to inaccurate evaluation with conventional time alignment-sensitive metrics such as STOI, ESTOI, and MCD. Moreover, training an LTS model with such datasets can result in model asynchrony, meaning that the generated speech and input video are out of sync. To address these problems, we first provide a time-alignment frontend for the commonly used metrics to ensure accurate evaluation. Then, we propose a synchronized lip-to-speech (SLTS) model with an automatic synchronization mechanism (ASM) that corrects data asynchrony and penalizes model asynchrony during training. We evaluated the effectiveness of our approach on both artificial and popular audiovisual datasets. Our proposed method outperforms existing SOTA models in a variety of evaluation metrics. + + + + BallGAN: 3D-aware Image Synthesis with a Spherical Background + http://openaccess.thecvf.com//content/ICCV2023/papers/Shin_BallGAN_3D-aware_Image_Synthesis_with_a_Spherical_Background_ICCV_2023_paper.pdf + 3D-aware GANs aim to synthesize realistic 3D scenes that can be rendered in arbitrary camera viewpoints, generating high-quality images with well-defined geometry. As 3D content creation becomes more popular, the ability to generate foreground objects separately from the background has become a crucial property. Existing methods have been developed regarding overall image quality, but they can not generate foreground objects only and often show degraded 3D geometry. In this work, we propose to represent the background as a spherical surface for multiple reasons inspired by computer graphics. Our method naturally provides foreground-only 3D synthesis facilitating easier 3D content creation. Furthermore, it improves the foreground geometry of 3D-aware GANs and the training stability on datasets with complex backgrounds. + + + + AttT2M: Text-Driven Human Motion Generation with Multi-Perspective Attention Mechanism + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhong_AttT2M_Text-Driven_Human_Motion_Generation_with_Multi-Perspective_Attention_Mechanism_ICCV_2023_paper.pdf + Generating 3D human motion based on textual descriptions has been a research focus in recent years. It requires the generated motion to be diverse, natural, and conform to the textual description. Due to the complex spatio-temporal nature of human motion and the difficulty in learning the cross-modal relationship between text and motion, text-driven motion generation is still a challenging problem. To address these issues, we propose AttT2M, a two-stage method with multi-perspective attention mechanism: body-part attention and global-local motion-text attention. The former focuses on the motion embedding perspective, which means introducing a body-part spatio-temporal encoder into VQ-VAE to learn a more expressive discrete latent space. The latter is from the cross-modal perspective, which is used to learn the sentence-level and word-level motion-text cross-modal relationship. The text-driven motion is finally generated with a generative transformer. Extensive experiments conducted on HumanML3D and KIT-ML demonstrate that our method outperforms the current state-of-the-art works in terms of qualitative and quantitative evaluation, and achieve fine-grained synthesis and action2motion. Our code will be publicly available. + + + + A Theory of Topological Derivatives for Inverse Rendering of Geometry + http://openaccess.thecvf.com//content/ICCV2023/papers/Mehta_A_Theory_of_Topological_Derivatives_for_Inverse_Rendering_of_Geometry_ICCV_2023_paper.pdf + We introduce a theoretical framework for differentiable surface evolution that allows discrete topology changes through the use of topological derivatives for variational optimization of image functionals. While prior methods for inverse rendering of geometry rely on silhouette gradients for topology changes, such signals are sparse. In contrast, our theory derives topological derivatives that relate the introduction of vanishing holes and phases to changes in image intensity. As a result, we enable differentiable shape perturbations in the form of hole or phase nucleation. We validate the proposed theory with optimization of closed curves in 2D and surfaces in 3D to lend insights into limitations of current methods and enable improved applications such as image vectorization, vector-graphics generation from text prompts, single-image reconstruction of shape ambigrams and multiview 3D reconstruction. + + + + Canonical Factors for Hybrid Neural Fields + http://openaccess.thecvf.com//content/ICCV2023/papers/Yi_Canonical_Factors_for_Hybrid_Neural_Fields_ICCV_2023_paper.pdf + Factored feature volumes offer a simple way to build more compact, efficient, and intepretable neural fields, but also introduce biases that are not necessarily beneficial for real-world data. In this work, we (1) characterize the undesirable biases that these architectures have for axis-aligned signals -- they can lead to radiance field reconstruction differences of as high as 2 PSNR -- and (2) explore how learning a set of canonicalizing transformations can improve representations by removing these biases. We prove in a simple two-dimensional model problem that a hybrid architecture that simultaneously learns these transformations together with scene appearance succeeds with drastically improved efficiency. We validate the resulting architectures, which we call TILTED, using 2D image, signed distance field, and radiance field reconstruction tasks, where we observe improvements across quality, robustness, compactness, and runtime. Results demonstrate that TILTED can enable capabilities comparable to baselines that are 2x larger, while highlighting weaknesses of standard procedures for evaluating neural field representations. + + + + + + When Do Curricula Work in Federated Learning? + http://openaccess.thecvf.com//content/ICCV2023/papers/Vahidian_When_Do_Curricula_Work_in_Federated_Learning_ICCV_2023_paper.pdf + An oft-cited open problem of federated learning is the existence of data heterogeneity among clients. One path- way to understanding the drastic accuracy drop in feder- ated learning is by scrutinizing the behavior of the clients' deep models on data with different levels of "difficulty", which has been left unaddressed. In this paper, we investi- gate a different and rarely studied dimension of FL: ordered learning. Specifically, we aim to investigate how ordered learning principles can contribute to alleviating the hetero- geneity effects in FL. We present theoretical analysis and conduct extensive empirical studies on the efficacy of or- derings spanning three kinds of learning: curriculum, anti- curriculum, and random curriculum. We find that curricu- lum learning largely alleviates non-IIDness. Interestingly, the more disparate the data distributions across clients the more they benefit from ordered learning. We provide analysis explaining this phenomenon, specifically indicating how curriculum training appears to make the objective land- scape progressively less convex, suggesting fast converging iterations at the beginning of the training procedure. We derive quantitative results of convergence for both convex and nonconvex objectives by modeling the curriculum train- ing on federated devices as local SGD with locally biased stochastic gradients. Also, inspired by ordered learning, we propose a novel client selection technique that benefits from the real-world disparity in the clients. Our proposed approach to client selection has a synergic effect when applied together with ordered learning in FL. + + + + Audio-Visual Class-Incremental Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Pian_Audio-Visual_Class-Incremental_Learning_ICCV_2023_paper.pdf + In this paper, we introduce audio-visual class-incremental learning, a class-incremental learning scenario for audio-visual video recognition. We demonstrate that joint audio-visual modeling can improve class-incremental learning, but current methods fail to preserve semantic similarity between audio and visual features as incremental step grows. Furthermore, we observe that audio-visual correlations learned in previous tasks can be forgotten as incremental steps progress, leading to poor performance. To overcome these challenges, we propose AV-CIL, which incorporates Dual-Audio-Visual Similarity Constraint (D-AVSC) to maintain both instance-aware and class-aware semantic similarity between audio-visual modalities and Visual Attention Distillation (VAD) to retain previously learned audio-guided visual attentive ability. We create three audio-visual class-incremental datasets, AVE-Class-Incremental (AVE-CI), Kinetics-Sounds-Class-Incremental (K-S-CI), and VGGSound100-Class-Incremental (VS100-CI) based on the AVE, Kinetics-Sounds, and VGGSound datasets, respectively. Our experiments on AVE-CI, K-S-CI, and VS100-CI demonstrate that AV-CIL significantly outperforms existing class-incremental learning methods in audio-visual class-incremental learning. Code and data are available at: https://github.com/weiguoPian/AV-CIL_ICCV2023. + + + + Towards Viewpoint-Invariant Visual Recognition via Adversarial Training + http://openaccess.thecvf.com//content/ICCV2023/papers/Ruan_Towards_Viewpoint-Invariant_Visual_Recognition_via_Adversarial_Training_ICCV_2023_paper.pdf + Visual recognition models are not invariant to viewpoint changes in the 3D world, as different viewing directions can dramatically affect the predictions given the same object. Although many efforts have been devoted to making neural networks invariant to 2D image translations and rotations, viewpoint invariance is rarely investigated. As most models process images in the perspective view, it is challenging to impose invariance to 3D viewpoint changes based only on 2D inputs. Motivated by the success of adversarial training in promoting model robustness, we propose Viewpoint-Invariant Adversarial Training (VIAT) to improve viewpoint robustness of common image classifiers. By regarding viewpoint transformation as an attack, VIAT is formulated as a minimax optimization problem, where the inner maximization characterizes diverse adversarial viewpoints by learning a Gaussian mixture distribution based on a new attack GMVFool, while the outer minimization trains a viewpoint-invariant classifier by minimizing the expected loss over the worst-case adversarial viewpoint distributions. To further improve the generalization performance, a distribution sharing strategy is introduced leveraging the transferability of adversarial viewpoints across objects. Experiments validate the effectiveness of VIAT in improving the viewpoint robustness of various image classifiers based on the diversity of adversarial viewpoints generated by GMVFool. + + + + Multi-Metrics Adaptively Identifies Backdoors in Federated Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_Multi-Metrics_Adaptively_Identifies_Backdoors_in_Federated_Learning_ICCV_2023_paper.pdf + The decentralized and privacy-preserving nature of federated learning (FL) makes it vulnerable to backdoor attacks aiming to manipulate the behavior of the resulting model on specific adversary-chosen inputs. However, most existing defenses based on statistical differences take effect only against specific attacks, especially when the malicious gradients are similar to benign ones or the data are highly non-independent and identically distributed (non-IID). In this paper, we revisit the distance-based defense methods and discover that i) Euclidean distance becomes meaningless in high dimensions and ii) malicious gradients with diverse characteristics cannot be identified by a single metric. To this end, we present a simple yet effective defense strategy with multi-metrics and dynamic weighting to identify backdoors adaptively. Furthermore, our novel defense has no reliance on predefined assumptions over attack settings or data distributions and little impact on benign performance. To evaluate the effectiveness of our approach, we conduct comprehensive experiments on different datasets under various attack settings, where our method achieves the best defensive performance. For instance, we achieve the lowest backdoor accuracy of 3.06% under the difficult Edge-case PGD, showing significant superiority over previous defenses. The results also demonstrate that our method can be well-adapted to a wide range of non-IID degrees without sacrificing the benign performance. + + + + FPR: False Positive Rectification for Weakly Supervised Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_FPR_False_Positive_Rectification_for_Weakly_Supervised_Semantic_Segmentation_ICCV_2023_paper.pdf + Many weakly supervised semantic segmentation (WSSS) methods employ the class activation map (CAM) to generate the initial segmentation results. However, CAM often fails to distinguish the foreground from its co-occurred background (e.g., train and railroad), resulting in inaccurate activation from the background. Previous endeavors address this co-occurrence issue by introducing external supervision and human priors. In this paper, we present a False Positive Rectification (FPR) approach to tackle the co-occurrence problem by leveraging the false positives of CAM. Based on the observation that the CAM-activated regions of absent classes contain class-specific co-occurred background cues, we collect these false positives and utilize them to guide the training of CAM network by proposing a region-level contrast loss and a pixel-level rectification loss. Without introducing any external supervision and human priors, the proposed FPR effectively suppresses wrong activations from the background objects. Extensive experiments on the PASCAL VOC 2012 and MS COCO 2014 demonstrate that FPR brings significant improvements for off-the-shelf methods and achieves state-of-the-art performance. Code is available at https://github.com/mt-cly/FPR. + + + + DETRDistill: A Universal Knowledge Distillation Framework for DETR-families + http://openaccess.thecvf.com//content/ICCV2023/papers/Chang_DETRDistill_A_Universal_Knowledge_Distillation_Framework_for_DETR-families_ICCV_2023_paper.pdf + Transformer-based detectors (DETRs) are becoming popular for their simple framework, but the large model size and heavy time consumption hinder their deployment in the real world. While knowledge distillation (KD) can be an appealing technique to compress giant detectors into small ones for comparable detection performance and low inference cost. Since DETRs formulate object detection as a set prediction problem, existing KD methods designed for classic convolution-based detectors may not be directly applicable. In this paper, we propose DETRDistill, a novel knowledge distillation method dedicated to DETR-families. Specifically, we first design a Hungarian-matching logits distillation to encourage the student model to have the exact predictions as those of the teacher DETRs. Then, we propose a target-aware feature distillation to help the student model learn from the object-centric features of the teacher model. Finally, in order to improve the convergence rate of the student DETR, we introduce a query-prior assignment distillation to speed up the student model learning from well-trained queries and stable assignment of the teacher model. Extensive experimental results on the COCO dataset validate the effectiveness of our approach. Notably, DETRDistill consistently improves various DETRs by more than 2.0 mAP, even surpassing their teacher models. + + + + F&F Attack: Adversarial Attack against Multiple Object Trackers by Inducing False Negatives and False Positives + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhou_FF_Attack_Adversarial_Attack_against_Multiple_Object_Trackers_by_Inducing_ICCV_2023_paper.pdf + Multi-object tracking (MOT) aims to build moving trajectories for number-agnostic objects. Modern multi-object trackers commonly follow the tracking-by-detection strategy. Therefore, fooling detectors can be an effective solution but it usually requires attacks in multiple successive frames, resulting in low efficiency. Attacking association processes improves efficiency but may require model-specific design, leading to poor generalization. In this paper, we propose a novel False negative and False positive attack (F&F attack) mechanism: it perturbs the input image to erase original detections and to inject deceptive false alarms around original ones while integrating the association attack implicitly. The mechanism can produce effective identity switches against multi-object trackers by only fooling detectors in a few frames. To demonstrate the flexibility of the mechanism, we deploy it to three multi-object trackers (ByteTrack, SORT, and CenterTrack) which are enabled by two representative detectors (YOLOX and CenterNet). Comprehensive experiments on MOT17 and MOT20 datasets show that our method significantly outperforms existing attackers, revealing the vulnerability of the tracking-by-detection paradigm to detection attacks. + + + + Transferable Decoding with Visual Entities for Zero-Shot Image Captioning + http://openaccess.thecvf.com//content/ICCV2023/papers/Fei_Transferable_Decoding_with_Visual_Entities_for_Zero-Shot_Image_Captioning_ICCV_2023_paper.pdf + Image-to-text generation aims to describe images using natural language. Recently, zero-shot image captioning based on pre-trained vision-language models (VLMs) and large language models (LLMs) has made significant progress. However, we have observed and empirically demonstrated that these methods are susceptible to modality bias induced by LLMs and tend to generate descriptions containing objects (entities) that do not actually exist in the image but frequently appear during training (i.e., object hallucination). In this paper, we propose ViECap, a transferable decoding model that leverages entity-aware decoding to generate descriptions in both seen and unseen scenarios. ViECap incorporates entity-aware hard prompts to guide LLMs' attention toward the visual entities present in the image, enabling coherent caption generation across diverse scenes. With entity-aware hard prompts, ViECap is capable of maintaining performance when transferring from in-domain to out-of-domain scenarios. Extensive experiments demonstrate that ViECap sets a new state-of-theart cross-domain (transferable) captioning and performs competitively in-domain captioning compared to previous VLMs-based zero-shot methods. Our code is available at: https://github.com/FeiElysia/ViECap + + + + ReMoDiffuse: Retrieval-Augmented Motion Diffusion Model + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_ReMoDiffuse_Retrieval-Augmented_Motion_Diffusion_Model_ICCV_2023_paper.pdf + 3D human motion generation is crucial for creative industry. Recent advances rely on generative models with domain knowledge for text-driven motion generation, leading to substantial progress in capturing common motions. However, the performance on more diverse motions remains unsatisfactory. In this work, we propose ReMoDiffuse, a diffusion-model-based motion generation framework that integrates a retrieval mechanism to refine the denoising process. ReMoDiffuse enhances the generalizability and diversity of text-driven motion generation with three key designs:1) Hybrid Retrieval finds appropriate references from the database in terms of both semantic and kinematic similarities. 2) Semantic-Modulated Transformer selectively absorbs retrieval knowledge, adapting to the difference between retrieved samples and the target motion sequence. 3) Condition Mixture better utilizes the retrieval database during inference, overcoming the scale sensitivity in classifier-free guidance. Extensive experiments demonstrate that \name outperforms state-of-the-art methods by balancing both text-motion consistency and motion quality, especially for more diverse motion generation. + + + + Advancing Referring Expression Segmentation Beyond Single Image + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_Advancing_Referring_Expression_Segmentation_Beyond_Single_Image_ICCV_2023_paper.pdf + Referring Expression Segmentation (RES) is a widely explored multi-modal task, which endeavors to segment the pre-existing object within a single image with a given linguistic expression. However, in broader real-world scenarios, it is not always possible to determine if the described object exists in a specific image. Generally, a collection of images is available, some of which potentially contain the target objects. To this end, we propose a more realistic setting, named Group-wise Referring Expression Segmentation (GRES), which expands RES to a group of related images, allowing the described objects to exist in a subset of the input image group. To support this new setting, we introduce an elaborately compiled dataset named Grouped Referring Dataset (GRD), containing complete group-wise annotations of the target objects described by given expressions. Moreover, we also present a baseline method named Grouped Referring Segmenter (GRSer), which explicitly captures the language-vision and intra-group vision-vision interactions to achieve state-of-the-art results on the proposed GRES setting and related tasks, such as Co-Salient Object Detection and traditional RES. Our dataset and codes are publicly released in https://github.com/shikras/d-cube. + + + + LogicSeg: Parsing Visual Semantics with Neural Logic Learning and Reasoning + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_LogicSeg_Parsing_Visual_Semantics_with_Neural_Logic_Learning_and_Reasoning_ICCV_2023_paper.pdf + Current high-performance semantic segmentation models are purely data-driven sub-symbolic approaches and blind to the structured nature of the visual world. This is in stark contrast to human cognition which abstracts visual perceptions at multiple levels and conducts symbolic reasoning with such structured abstraction. To fill these fundamental gaps, we devise LogicSeg, a holistic visual semantic parser that integrates neural inductive learning and logic reasoning with both rich data and symbolic knowledge. In particular, the semantic concepts of interest are structured as a hierarchy, from which a comprehensive set of constraints are derived for describing the symbolic relations and formalized in first-order logic. After fuzzy logic-based continuous relaxation, logical formulae are grounded onto data and neural computational graphs, hence enabling logic-induced network training. During inference, logical constraints are packaged into an iterative process and injected into the network in a form of several matrix multiplications, so as to achieve hierarchy-coherent prediction with logic reasoning. These designs together make LogicSeg a general and compact neural-logic machine that is readily integrated into existing segmentation models. Extensive experiments over four datasets with various segmentation models and backbones verify the effectiveness and generality of LogicSeg. We believe this study opens a new avenue for visual semantic parsing. Our code will be released. + + + + Texture Learning Domain Randomization for Domain Generalized Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Kim_Texture_Learning_Domain_Randomization_for_Domain_Generalized_Segmentation_ICCV_2023_paper.pdf + Deep Neural Networks (DNNs)-based semantic segmentation models trained on a source domain often struggle to generalize to unseen target domains, i.e., a domain gap problem. Texture often contributes to the domain gap, making DNNs vulnerable to domain shift because they are prone to be texture-biased. Existing Domain Generalized Semantic Segmentation (DGSS) methods have alleviated the domain gap problem by guiding models to prioritize shape over texture. On the other hand, shape and texture are two prominent and complementary cues in semantic segmentation. This paper argues that leveraging texture is crucial for improving performance in DGSS. Specifically, we propose a novel framework, coined Texture Learning Domain Randomization (TLDR). TLDR includes two novel losses to effectively enhance texture learning in DGSS: (1) a texture regularization loss to prevent overfitting to source domain textures by using texture features from an ImageNet pre-trained model and (2) a texture generalization loss that utilizes random style images to learn diverse texture representations in a self-supervised manner. Extensive experimental results demonstrate the superiority of the proposed TLDR; e.g., TLDR achieves 46.5 mIoU on GTA-to-Cityscapes using ResNet-50, which improves the prior state-of-the-art method by 1.9 mIoU. The source code is available at https://github.com/ssssshwan/TLDR. + + + + Learning Concise and Descriptive Attributes for Visual Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Yan_Learning_Concise_and_Descriptive_Attributes_for_Visual_Recognition_ICCV_2023_paper.pdf + Recent advances in foundation models present new opportunities for interpretable visual recognition -- one can first query Large Language Models (LLMs) to obtain a set of attributes that describe each class, then apply vision-language models to classify images via these attributes. Pioneering work shows that querying thousands of attributes can achieve performance competitive with image features. However, our further investigation on 8 datasets reveals that LLM-generated attributes in a large quantity perform almost the same as random words. This surprising finding suggests that significant noise may be present in these attributes. We hypothesize that there exist subsets of attributes that can maintain the classification performance with much smaller sizes, and propose a novel learning-to-search method to discover those concise sets of attributes. As a result, on the CUB dataset, our method achieves performance close to that of massive LLM-generated attributes (e.g., 10k attributes for CUB), yet using only 32 attributes in total to distinguish 200 bird species. Furthermore, our new paradigm demonstrates several additional benefits: higher interpretability and interactivity for humans, and the ability to summarize knowledge for a recognition task. + + + + Label-Noise Learning with Intrinsically Long-Tailed Data + http://openaccess.thecvf.com//content/ICCV2023/papers/Lu_Label-Noise_Learning_with_Intrinsically_Long-Tailed_Data_ICCV_2023_paper.pdf + Label noise is one of the key factors that lead to the poor generalization of deep learning models. Existing label-noise learning methods usually assume that the ground-truth classes of the training data are balanced. However, the real-world data is often imbalanced, leading to the inconsistency between observed and intrinsic class distribution with label noises. In this case, it is hard to distinguish clean samples from noisy samples on the intrinsic tail classes with the unknown intrinsic class distribution. In this paper, we propose a learning framework for label-noise learning with intrinsically long-tailed data. Specifically, we propose two-stage bi-dimensional sample selection (TABASCO) to better separate clean samples from noisy samples, especially for the tail classes. TABASCO consists of two new separation metrics that complement each other to compensate for the limitation of using a single metric in sample separation. Extensive experiments on benchmarks demonstrate the effectiveness of our method. Our code is available at https://github.com/Wakings/TABASCO. + + + + Rethinking Range View Representation for LiDAR Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Kong_Rethinking_Range_View_Representation_for_LiDAR_Segmentation_ICCV_2023_paper.pdf + LiDAR segmentation is crucial for autonomous driving perception. Recent trends favor point- or voxel-based methods as they often yield better performance than the traditional range view representation. In this work, we unveil several key factors in building powerful range view models. We observe that the "many-to-one" mapping, semantic incoherence, and shape deformation are possible impediments against effective learning from range view projections. We present RangeFormer -- a full-cycle framework comprising novel designs across network architecture, data augmentation, and post-processing -- that better handles the learning and processing of LiDAR point clouds from the range view. We further introduce a Scalable Training from Range view (STR) strategy that trains on arbitrary low-resolution 2D range images, while still maintaining satisfactory 3D segmentation accuracy. We show that, for the first time, a range view method is able to surpass the point, voxel, and multi-view fusion counterparts in the competing LiDAR semantic and panoptic segmentation benchmarks, i.e., SemanticKITTI, nuScenes, and ScribbleKITTI. + + + + Divide and Conquer: 3D Point Cloud Instance Segmentation With Point-Wise Binarization + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhao_Divide_and_Conquer_3D_Point_Cloud_Instance_Segmentation_With_Point-Wise_ICCV_2023_paper.pdf + Instance segmentation on point clouds is crucially important for 3D scene understanding. Most SOTAs adopt distance clustering, which is typically effective but does not perform well in segmenting adjacent objects with the same semantic label (especially when they share neighboring points). Due to the uneven distribution of offset points, these existing methods can hardly cluster all instance points. To this end, we design a novel divide-and-conquer strategy named PBNet that binarizes each point and clusters them separately to segment instances. Our binary clustering divides offset instance points into two categories: high and low density points (HPs vs. LPs). Adjacent objects can be clearly separated by removing LPs, and then be completed and refined by assigning LPs via a neighbor voting method. To suppress potential over-segmentation, we propose to construct local scenes with the weight mask for each instance. As a plug-in, the proposed binary clustering can replace the traditional distance clustering and lead to consistent performance gains on many mainstream baselines. A series of experiments on ScanNetV2 and S3DIS datasets indicate the superiority of our model. In particular, PBNet ranks first on the ScanNetV2 official benchmark challenge, achieving the highest mAP. Code will be available publicly at https://github.com/weiguangzhao/PBNet. + + + + BANSAC: A Dynamic BAyesian Network for Adaptive SAmple Consensus + http://openaccess.thecvf.com//content/ICCV2023/papers/Piedade_BANSAC_A_Dynamic_BAyesian_Network_for_Adaptive_SAmple_Consensus_ICCV_2023_paper.pdf + RANSAC-based algorithms are the standard techniques for robust estimation in computer vision. These algorithms are iterative and computationally expensive; they alternate between random sampling of data, computing hypotheses, and running inlier counting. Many authors tried different approaches to improve efficiency. One of the major improvements is having a guided sampling, letting the RANSAC cycle stop sooner. This paper presents a new adaptive sampling process for RANSAC. Previous methods either assume no prior information about the inlier/outlier classification of data points or use some previously computed scores in the sampling. In this paper, we derive a dynamic Bayesian network that updates individual data points' inlier scores while iterating RANSAC. At each iteration, we apply weighted sampling using the updated scores. Our method works with or without prior data point scorings. In addition, we use the updated inlier/outlier scoring for deriving a new stopping criterion for the RANSAC loop. We test our method in multiple real-world datasets for several applications and obtain state-of-the-art results. Our method outperforms the baselines in accuracy while needing less computational time. + + + + ShapeScaffolder: Structure-Aware 3D Shape Generation from Text + http://openaccess.thecvf.com//content/ICCV2023/papers/Tian_ShapeScaffolder_Structure-Aware_3D_Shape_Generation_from_Text_ICCV_2023_paper.pdf + We present ShapeScaffolder, a structure-based neural network for generating colored 3D shapes based on text input. The approach, similar to providing scaffolds as internal structural supports and adding more details to them, aims to capture finer text-shape connections and improve the quality of generated shapes. Traditional text-to-shape methods often generate 3D shapes as a whole. However, humans tend to understand both shape and text as being structure-based. For example, a table is interpreted as being composed of legs, a seat, and a back; similarly, texts possess inherent linguistic structures that can be analyzed as dependency graphs, depicting the relationships between entities within the text. We believe structure-aware shape generation can bring finer text-shape connections and improve shape generation quality. However, the lack of explicit shape structure and the high freedom of text structure make cross-modality learning challenging. To address these challenges, we first build the structured shape implicit fields in an unsupervised manner. We then propose the part-level attention mechanism between shape parts and textual graph nodes to align the two modalities at the structural level. Finally, we employ a shape refiner to add further detail to the predicted structure, yielding the final results. Extensive experimentation demonstrates that our approaches outperform state-of-the-art methods in terms of both shape fidelity and shape-text matching. Our methods also allow for part-level manipulation and improved part-level completeness. + + + + Read-only Prompt Optimization for Vision-Language Few-shot Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Lee_Read-only_Prompt_Optimization_for_Vision-Language_Few-shot_Learning_ICCV_2023_paper.pdf + In recent years, prompt tuning has proven effective in adapting pre-trained vision-language models to down- stream tasks. These methods aim to adapt the pre-trained models by introducing learnable prompts while keeping pre- trained weights frozen. However, learnable prompts can affect the internal representation within the self-attention module, which may negatively impact performance vari- ance and generalization, especially in data-deficient set- tings. To address these issues, we propose a novel ap- proach, Read-only Prompt Optimization (RPO). RPO lever- ages masked attention to prevent the internal representa- tion shift in the pre-trained model. Further, to facilitate the optimization of RPO, the read-only prompts are ini- tialized based on special tokens of the pre-trained model. Our extensive experiments demonstrate that RPO outper- forms CLIP and CoCoOp in base-to-new generalization and domain generalization while displaying better robust- ness. Also, the proposed method achieves better generaliza- tion on extremely data-deficient settings, while improving parameter efficiency and computational overhead. Code is available at https://github.com/mlvlab/RPO. + + + + COCO-O: A Benchmark for Object Detectors under Natural Distribution Shifts + http://openaccess.thecvf.com//content/ICCV2023/papers/Mao_COCO-O_A_Benchmark_for_Object_Detectors_under_Natural_Distribution_Shifts_ICCV_2023_paper.pdf + Practical object detection application can lose its effectiveness on image inputs with natural distribution shifts. This problem leads the research community to pay more attention on the robustness of detectors under Out-Of-Distribution (OOD) inputs. Existing works construct datasets to benchmark the detector's OOD robustness for a specific application scenario, e.g., Autonomous Driving. However, these datasets lack universality and are hard to benchmark general detectors built on common tasks such as COCO. To give a more comprehensive robustness assessment, we introduce COCO-O(ut-of-distribution), a test dataset based on COCO with 6 types of natural distribution shifts. COCO-O has a large distribution gap with training data and results in a significant 55.7% relative performance drop on a Faster R-CNN detector. We leverage COCO-O to conduct experiments on more than 100 modern object detectors to investigate if their improvements are credible or just over-fitting to the COCO test set. Unfortunately, most classic detectors in early years do not exhibit strong OOD generalization. We further study the robustness effect on recent breakthroughs of detector's architecture design, augmentation and pre-training techniques. Some empirical findings are revealed: 1) Compared with detection head or neck, backbone is the most important part for robustness; 2) An end-to-end detection transformer design brings no enhancement, and may even reduce robustness; 3) Large-scale foundation models have made a great leap on robust object detection. We hope our COCO-O could provide a rich testbed for robustness study of object detection. The dataset will be available at https://github.com/alibaba/easyrobust/tree/main/benchmarks/coco_o. + + + + StageInteractor: Query-based Object Detector with Cross-stage Interaction + http://openaccess.thecvf.com//content/ICCV2023/papers/Teng_StageInteractor_Query-based_Object_Detector_with_Cross-stage_Interaction_ICCV_2023_paper.pdf + Previous object detectors make predictions based on dense grid points or numerous preset anchors. Most of these detectors are trained with one-to-many label assignment strategies. On the contrary, recent query-based object detectors are based a sparse set of learnable queries refined by a series of decoder layers. The one-to-one label assignment is independently applied on each layer for deep supervision during training. Despite the great success of query-based object detection, however, this vanilla one-to-one label assignment strategy requires the detectors to have strong fine-grained discrimination and modeling capacity. In this paper, we propose a new query-based object detector with cross-stage interaction, coined as StageInteractor. During the forward pass, we come up with an efficient way to improve this modeling ability by reusing dynamic operators with lightweight adapters. As for the label assignment, a cross-stage label assigner is designed to improve the one-to-one label assignment. With this assigner, the training target class labels are gathered across stages and then reallocated to proper predictions at each decoder layer. On MS COCO benchmark, our model improves the baseline counterpart by 2.2 AP, and achieves a 44.8 AP with ResNet-50 as backbone, 100 queries and 12 training epochs. With longer training time and 300 queries, StageInteractor achieves 51.3 AP and 52.7 AP with ResNeXt-101-DCN and Swin-S, respectively. The code and models are made available at https://github.com/MCG-NJU/StageInteractor. + + + + Moment Detection in Long Tutorial Videos + http://openaccess.thecvf.com//content/ICCV2023/papers/Croitoru_Moment_Detection_in_Long_Tutorial_Videos_ICCV_2023_paper.pdf + Tutorial videos play an increasingly important role in professional development and self-directed education. For users to realise the full benefits of this medium, tutorial videos must be efficiently searchable. In this work, we focus on the task of moment detection, in which the goal is to localise the temporal window where a given event occurs within a given tutorial video. Prior work on moment detection has focused primarily on short videos (typically on videos shorter than three minutes). However, many tutorial videos are substantially longer (stretching to hours in duration), presenting significant challenges for existing moment detection approaches. To study this problem, we propose the first dataset of untrimmed, long-form tutorial videos for the task of Moment Detection called the Behance Moment Detection (BMD) dataset. BMD videos have an average duration of over one hour and are characterised by slowly evolving visual content and wide-ranging dialogue. To meet the unique challenges of this dataset, we propose a new framework, LongMoment-DETR, and demonstrate that it outperforms strong baselines. Additionally, we introduce a variation of the dataset that contains YouTube Chapter annotations and show that the features obtained by our framework can be successfully used to boost the performance on the task of chapter detection. Code and data can be found at https://github.com/ioanacroi/longmoment-detr. + + + + DFA3D: 3D Deformable Attention For 2D-to-3D Feature Lifting + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_DFA3D_3D_Deformable_Attention_For_2D-to-3D_Feature_Lifting_ICCV_2023_paper.pdf + In this paper, we propose a new operator, called 3D DeFormable Attention (DFA3D), for 2D-to-3D feature lifting, which transforms multi-view 2D image features into a unified 3D space for 3D object detection. Existing feature lifting approaches, such as Lift-Splat-based and 2D attention-based, either use estimated depth to get pseudo LiDAR features and then splat them to a 3D space, which is a one-pass operation without feature refinement, or ignore depth and lift features by 2D attention mechanisms, which achieve finer semantics while suffering from a depth ambiguity problem. In contrast, our DFA3D-based method first leverages the estimated depth to expand each view's 2D feature map to 3D and then utilizes DFA3D to aggregate features from the expanded 3D feature maps. With the help of DFA3D, the depth ambiguity problem can be effectively alleviated from the root, and the lifted features can be progressively refined layer by layer, thanks to the Transformer-like architecture. In addition, we propose a mathematically equivalent implementation of DFA3D which can significantly improve its memory efficiency and computational speed. We integrate DFA3D into several methods that use 2D attention-based feature lifting with only a few modifications in code and evaluate on the nuScenes dataset. The experiment results show a consistent improvement of +1.41% mAP on average, and up to +15.1% mAP improvement when high-quality depth information is available, demonstrating the superiority, applicability, and huge potential of DFA3D. The code is available at https://github.com/IDEA-Research/3D-deformable-attention.git. + + + + Rosetta Neurons: Mining the Common Units in a Model Zoo + http://openaccess.thecvf.com//content/ICCV2023/papers/Dravid_Rosetta_Neurons_Mining_the_Common_Units_in_a_Model_Zoo_ICCV_2023_paper.pdf + Do different neural networks, trained for various vision tasks, share some common representations? In this paper, we demonstrate the existence of common features we call "Rosetta Neurons" across a range of models with different architectures, different tasks (generative and discriminative), and different types of supervision (class-supervised, text-supervised, self-supervised). We present an algorithm for mining a dictionary of Rosetta Neurons across several popular vision models: Class Supervised-ResNet50, DINO-ResNet50, DINO-ViT, MAE, CLIP-ResNet50, BigGAN, StyleGAN-2, StyleGAN-XL. Our findings suggest that certain visual concepts and structures are inherently embedded in the natural world and can be learned by different models regardless of the specific task or architecture, and without the use of semantic labels. We can visualize shared concepts directly due to generative models included in our analysis. The Rosetta Neurons facilitate model-to-model translation enabling various inversion-based manipulations, including cross-class alignments, shifting, zooming, and more, without the need for specialized training. + + + + Semi-Supervised Semantic Segmentation under Label Noise via Diverse Learning Groups + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Semi-Supervised_Semantic_Segmentation_under_Label_Noise_via_Diverse_Learning_Groups_ICCV_2023_paper.pdf + Semi-supervised semantic segmentation methods use a small amount of clean pixel-level annotations to guide the interpretation of a larger quantity of unlabelled image data. The challenges of providing pixel-accurate annotations at scale mean that the labels are typically noisy, and this contaminates the final results. In this work, we propose an approach that is robust to label noise in the annotated data. The method uses two diverse learning groups with different network architectures to effectively handle both label noise and unlabelled images. Each learning group consists of a teacher network, a student network and a novel filter module. The filter module of each learning group utilizes pixel-level features from the teacher network to detect incorrectly labelled pixels. To reduce confirmation bias, we employ the labels cleaned by the filter module from one learning group to train the other learning group. Experimental results on two different benchmarks and settings demonstrate the superiority of our method over state-of-the-art approaches. + + + + Segment Anything + http://openaccess.thecvf.com//content/ICCV2023/papers/Kirillov_Segment_Anything_ICCV_2023_paper.pdf + We introduce the Segment Anything (SA) project: a new task, model, and dataset for image segmentation. Using our efficient model in a data collection loop, we built the largest segmentation dataset to date (by far), with over 1 billion masks on 11M licensed and privacy respecting images. The model is designed and trained to be promptable, so it can transfer zero-shot to new image distributions and tasks. We evaluate its capabilities on numerous tasks and find that its zero-shot performance is impressive -- often competitive with or even superior to prior fully supervised results. We are releasing the Segment Anything Model (SAM) and corresponding dataset (SA-1B) of 1B masks and 11M images at https://segment-anything.com to foster research into foundation models for computer vision. We recommend reading the full paper at: https://arxiv.org/abs/2304.02643. + + + + Unsupervised Prompt Tuning for Text-Driven Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/He_Unsupervised_Prompt_Tuning_for_Text-Driven_Object_Detection_ICCV_2023_paper.pdf + Grounded language-image pre-trained models have shown strong zero-shot generalization to various downstream object detection tasks. Despite their promising performance, the models rely heavily on the laborious prompt engineering. Existing works typically address this problem by tuning text prompts using downstream training data in a few-shot or fully supervised manner. However, a rarely studied problem is to optimize text prompts without using any annotations. In this paper, we delve into this problem and propose an Unsupervised Prompt Tuning framework for text-driven object detection, which is composed of two novel mean teaching mechanisms. In conventional mean teaching, the quality of pseudo boxes is expected to optimize better as the training goes on, but there is still a risk of overfitting noisy pseudo boxes. To mitigate this problem, 1) we propose Nested Mean Teaching, which adopts nested-annotation to supervise teacher-student mutual learning in a bi-level optimization manner; 2) we propose Dual Complementary Teaching, which employs an offline pre-trained teacher and an online mean teacher via data-augmentation-based complementary labeling so as to ensure learning without accumulating confirmation bias. By integrating these two mechanisms, the proposed unsupervised prompt tuning framework achieves significant performance improvement on extensive object detection datasets. + + + + Re-ReND: Real-Time Rendering of NeRFs across Devices + http://openaccess.thecvf.com//content/ICCV2023/papers/Rojas_Re-ReND_Real-Time_Rendering_of_NeRFs_across_Devices_ICCV_2023_paper.pdf + This paper proposes a novel approach for rendering a pre-trained Neural Radiance Field (NeRF) in real-time on resource-constrained devices. We introduce Re-ReND, a method enabling Real-time Rendering of NeRFs across Devices. Re-ReND is designed to achieve real-time performance by converting the NeRF into a representation that can be efficiently processed by standard graphics pipelines. The proposed method distills the NeRF by extracting the learned density into a mesh, while the learned color information is factorized into a set of matrices that represent the scene's light field. Factorization implies the field is queried via inexpensive MLP-free matrix multiplications, while using a light field allows rendering a pixel by querying the field a single time--as opposed to hundreds of queries when employing a radiance field. Since the proposed representation can be implemented using a fragment shader, it can be directly integrated with standard rasterization frameworks. Our flexible implementation can render a NeRF in real-time with low memory requirements and on a wide range of resource-constrained devices, including mobiles and AR/VR headsets. Notably, we find that Re-ReND can achieve over a 2.6-fold increase in rendering speed versus the state-of-the-art without perceptible losses in quality. + + + + Handwritten and Printed Text Segmentation: A Signature Case Study + http://openaccess.thecvf.com//content/ICCV2023/papers/Gholamian_Handwritten_and_Printed_Text_Segmentation_A_Signature_Case_Study_ICCV_2023_paper.pdf + While analyzing scanned documents, handwritten text can overlap with printed text. This overlap causes difficulties during the optical character recognition (OCR) and digitization process of documents, and subsequently, hurts downstream NLP tasks. Prior research either focuses solely on the binary classification of handwritten text or performs a three-class segmentation of the document, i.e., recognition of handwritten, printed, and background pixels. This approach results in the assignment of overlapping handwritten and printed pixels to only one of the classes, and thus, they are not accounted for in the other class. Thus, in this research, we develop novel approaches to address the challenges of handwritten and printed text segmentation. Our objective is to recover text from different classes in their entirety, especially enhancing the segmentation performance on overlapping sections. To support this task, we introduce a new dataset, SignaTR6K, collected from real legal documents, as well as a new model architecture for the handwritten and printed text segmentation task. Our best configuration outperforms prior work on two different datasets by 17.9% and 7.3% on IoU scores. The SignaTR6K dataset is accessible for download via the following link: https://forms.office.com/r/2a5RDg7cAY. + + + + RbA: Segmenting Unknown Regions Rejected by All + http://openaccess.thecvf.com//content/ICCV2023/papers/Nayal_RbA_Segmenting_Unknown_Regions_Rejected_by_All_ICCV_2023_paper.pdf + Standard semantic segmentation models owe their success to curated datasets with a fixed set of semantic categories, without contemplating the possibility of identifying unknown objects from novel categories. Existing methods in outlier detection suffer from a lack of smoothness and objectness in their predictions, due to limitations of the per-pixel classification paradigm. Furthermore, additional training for detecting outliers harms the performance of known classes. In this paper, we explore another paradigm with region-level classification to better segment unknown objects. We show that the object queries in mask classification tend to behave like one vs. all classifiers. Based on this finding, we propose a novel outlier scoring function called RbA by defining the event of being an outlier as being rejected by all known classes. Our extensive experiments show that mask classification improves the performance of the existing outlier detection methods, and the best results are achieved with the proposed RbA. We also propose an objective to optimize RbA using minimal outlier supervision. Further fine-tuning with outliers improves the unknown performance, and unlike previous methods, it does not degrade the inlier performance. Project page: https://kuis-ai.github.io/RbA + + + + Towards Open-Vocabulary Video Instance Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Towards_Open-Vocabulary_Video_Instance_Segmentation_ICCV_2023_paper.pdf + Video Instance Segmentation (VIS) aims at segmenting and categorizing objects in videos from a closed set of training categories, lacking the generalization ability to handle novel categories in real-world videos. To address this limitation, we make the following three contributions. First, we introduce the novel task of Open-Vocabulary Video Instance Segmentation, which aims to simultaneously segment, track, and classify objects in videos from open-set categories, including novel categories unseen during training. Second, to benchmark Open-Vocabulary VIS, we collect a Large-Vocabulary Video Instance Segmentation dataset (LV-VIS), that contains well-annotated objects from 1,196 diverse categories, significantly surpassing the category size of existing datasets by more than one order of magnitude. Third, we propose an efficient Memory-Induced Transformer architecture, OV2Seg, to first achieve Open-Vocabulary VIS in an end-to-end manner with near real-time inference speed. Extensive experiments on LV-VIS and four existing VIS datasets demonstrate the strong zero-shot generalization ability of OV2Seg on novel categories. The dataset and code are released here https://github.com/haochenheheda/LVVIS. + + + + Unleashing the Power of Gradient Signal-to-Noise Ratio for Zero-Shot NAS + http://openaccess.thecvf.com//content/ICCV2023/papers/Sun_Unleashing_the_Power_of_Gradient_Signal-to-Noise_Ratio_for_Zero-Shot_NAS_ICCV_2023_paper.pdf + Neural Architecture Search (NAS) aims to automatically find optimal neural network architectures in an efficient way. Zero-Shot NAS is a promising technique that leverages proxies to predict the accuracy of candidate architectures without any training. However, we have observed that most existing proxies do not consistently perform well across different search spaces, and are less concerned with generalization. Recently, the gradient signal-to-noise ratio (GSNR) was shown to be correlated with neural network generalization performance. In this paper, we not only explicitly give the probability that larger GSNR at network initialization can ensure better generalization, but also theoretically prove that GSNR can ensure better convergence. Then we design the Xi-based gradient signal-to-noise ratio (Xi-GSNR) as a Zero-Shot NAS proxy to predict the network accuracy at initialization. Extensive experiments in different search spaces demonstrate that Xi-GSNR provides superior ranking consistency compared to previous proxies. Moreover, Xi-GSNR-based Zero-Shot NAS also achieves outstanding performance when directly searching for the optimal architecture in various search spaces and datasets. The source code is available at https://github.com/Sunzh1996/Xi-GSNR. + + + + BiViT: Extremely Compressed Binary Vision Transformers + http://openaccess.thecvf.com//content/ICCV2023/papers/He_BiViT_Extremely_Compressed_Binary_Vision_Transformers_ICCV_2023_paper.pdf + Model binarization can significantly compress model size, reduce energy consumption, and accelerate inference through efficient bit-wise operations. Although binarizing convolutional neural networks have been extensively studied, there is little work on exploring binarization of vision Transformers which underpin most recent breakthroughs in visual recognition. To this end, we propose to solve two fundamental challenges to push the horizon of Binary Vision Transformers (BiViT). First, the traditional binary method does not take the long-tailed distribution of softmax attention into consideration, bringing large binarization errors in the attention module. To solve this, we propose Softmax-aware Binarization, which dynamically adapts to the data distribution and reduces the error caused by binarization. Second, to better preserve the information of the pretrained model and restore accuracy, we propose a Cross-layer Binarization scheme that decouples the binarization of self-attention and multi-layer perceptrons (MLPs), and Parameterized Weight Scales which introduce learnable scaling factors for weight binarization. Overall, our method performs favorably against state-of-the-arts by 19.8% on the TinyImageNet dataset. On ImageNet, our BiViT achieves a competitive 75.6% Top-1 accuracy over Swin-S model. Additionally, on COCO object detection, our method achieves an mAP of 40.8 with a Swin-T backbone over Cascade Mask R-CNN framework. + + + + Tree-Structured Shading Decomposition + http://openaccess.thecvf.com//content/ICCV2023/papers/Geng_Tree-Structured_Shading_Decomposition_ICCV_2023_paper.pdf + We study inferring a tree-structured representation from a single image for object shading. Prior work typically uses the parametric or measured representation to model shading, which is neither interpretable nor easily editable. We propose using the shade tree representation, which combines basic shading nodes and compositing methods to factorize object surface shading. The shade tree representation enables novice users who are unfamiliar with the physical shading process to edit object shading in an efficient and intuitive manner. A main challenge in inferring the shade tree is that the inference problem involves both the discrete tree structure and the continuous parameters of the tree nodes. We propose a hybrid approach to address this issue. We introduce an auto-regressive inference model to generate a rough estimation of the tree structure and node parameters, and then we fine-tune the inferred shade tree through an optimization algorithm. We show experiments on synthetic images, captured reflectance, real images, and non-realistic vector drawings, allowing downstream applications such as material editing, vectorized shading, and relighting. Project website: https://chen-geng.com/inv-shade-trees. + + + + EfficientTrain: Exploring Generalized Curriculum Learning for Training Visual Backbones + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_EfficientTrain_Exploring_Generalized_Curriculum_Learning_for_Training_Visual_Backbones_ICCV_2023_paper.pdf + The superior performance of modern deep networks usually comes with a costly training procedure. This paper presents a new curriculum learning approach for the efficient training of visual backbones (e.g., vision Transformers). Our work is inspired by the inherent learning dynamics of deep networks: we experimentally show that at an earlier training stage, the model mainly learns to recognize some 'easier-to-learn' discriminative patterns within each example, e.g., the lower-frequency components of images and the original information before data augmentation. Driven by this phenomenon, we propose a curriculum where the model always leverages all the training data at each epoch, while the curriculum starts with only exposing the 'easier-to-learn' patterns of each example, and introduces gradually more difficult patterns. To implement this idea, we 1) introduce a cropping operation in the Fourier spectrum of the inputs, which enables the model to learn from only the lower-frequency components efficiently, 2) demonstrate that exposing the features of original images amounts to adopting weaker data augmentation, and 3) integrate 1) and 2) and design a curriculum learning schedule with a greedy-search algorithm. The resulting approach, EfficientTrain, is simple, general, yet surprisingly effective. As an off-the-shelf method, it reduces the wall-time training cost of a wide variety of popular models (e.g., ResNet, ConvNeXt, DeiT, PVT, Swin, and CSWin) by >1.5x on ImageNet-1K/22K without sacrificing accuracy. It is also effective for self-supervised learning (e.g., MAE). Code is available at https://github.com/LeapLabTHU/EfficientTrain. + + + + IntrinsicNeRF: Learning Intrinsic Neural Radiance Fields for Editable Novel View Synthesis + http://openaccess.thecvf.com//content/ICCV2023/papers/Ye_IntrinsicNeRF_Learning_Intrinsic_Neural_Radiance_Fields_for_Editable_Novel_View_ICCV_2023_paper.pdf + Existing inverse rendering combined with neural rendering methods can only perform editable novel view synthesis on object-specific scenes, while we present intrinsic neural radiance fields, dubbed IntrinsicNeRF, which introduce intrinsic decomposition into the NeRF-based neural rendering method and can extend its application to room-scale scenes. Since intrinsic decomposition is a fundamentally under-constrained inverse problem, we propose a novel distance-aware point sampling and adaptive reflectance iterative clustering optimization method, which enables IntrinsicNeRF with traditional intrinsic decomposition constraints to be trained in an unsupervised manner, resulting in multi-view consistent intrinsic decomposition results. To cope with the problem that different adjacent instances of similar reflectance in a scene are incorrectly clustered together, we further propose a hierarchical clustering method with coarse-to-fine optimization to obtain a fast hierarchical indexing representation. It supports compelling real-time augmented applications such as recoloring and illumination variation. Extensive experiments and editing samples on both object-specific/room-scale scenes and synthetic/real-word data demonstrate that we can obtain consistent intrinsic decomposition results and high-fidelity novel view synthesis even for challenging sequences. + + + + Multi-Object Discovery by Low-Dimensional Object Motion + http://openaccess.thecvf.com//content/ICCV2023/papers/Safadoust_Multi-Object_Discovery_by_Low-Dimensional_Object_Motion_ICCV_2023_paper.pdf + Recent work in unsupervised multi-object segmentation shows impressive results by predicting motion from a single image despite the inherent ambiguity in predicting motion without the next image. On the other hand, the set of possible motions for an image can be constrained to a low-dimensional space by considering the scene structure and moving objects in it. We propose to model pixel-wise geometry and object motion to remove ambiguity in reconstructing flow from a single image. Specifically, we divide the image into coherently moving regions and use depth to construct flow bases that best explain the observed flow in each region. We achieve state-of-the-art results in unsupervised multi-object segmentation on synthetic and real-world datasets by modeling the scene structure and object motion. Our evaluation of the predicted depth maps shows reliable performance in monocular depth estimation. + + + + GACE: Geometry Aware Confidence Enhancement for Black-Box 3D Object Detectors on LiDAR-Data + http://openaccess.thecvf.com//content/ICCV2023/papers/Schinagl_GACE_Geometry_Aware_Confidence_Enhancement_for_Black-Box_3D_Object_Detectors_ICCV_2023_paper.pdf + Widely-used LiDAR-based 3D object detectors often neglect fundamental geometric information readily available from the object proposals in their confidence estimation. This is mostly due to architectural design choices, which were often adopted from the 2D image domain, where geometric context is rarely available. In 3D, however, considering the object properties and its surroundings in a holistic way is important to distinguish between true and false positive detections, e.g. occluded pedestrians in a group. To address this, we present GACE, an intuitive and highly efficient method to improve the confidence estimation of a given black-box 3D object detector. We aggregate geometric cues of detections and their spatial relationships, which enables us to properly assess their plausibility and consequently, improve the confidence estimation. This leads to consistent performance gains over a variety of state-of-the-art detectors. Across all evaluated detectors, GACE proves to be especially beneficial for the vulnerable road user classes, i.e. pedestrians and cyclists. + + + + ToonTalker: Cross-Domain Face Reenactment + http://openaccess.thecvf.com//content/ICCV2023/papers/Gong_ToonTalker_Cross-Domain_Face_Reenactment_ICCV_2023_paper.pdf + We target cross-domain face reenactment in this paper, i.e., driving a cartoon image with the video of a real person and vice versa. Recently, many works have focused on one-shot talking face generation to drive a portrait with a real video, i.e., within-domain reenactment. Straightforwardly applying those methods to cross-domain animation will cause inaccurate expression transfer, blur effects, and even apparent artifacts due to the domain shift between cartoon and real faces. Only a few works attempt to settle cross-domain face reenactment. The most related work AnimeCeleb requires constructing a dataset with pose vector and cartoon image pairs by animating 3D characters, which makes it inapplicable anymore if no paired data is available. In this paper, we propose a novel method for cross-domain reenactment without paired data. Specifically, we propose a transformer-based framework to align the motions from different domains into a common latent space where motion transfer is conducted via latent code addition. Two domain-specific motion encoders and two learnable motion base memories are used to capture domain properties. A source query transformer and a driving one are exploited to project domain-specific motion to the canonical space. The edited motion is projected back to the domain of the source with a transformer. Moreover, since no paired data is provided, we propose a novel cross-domain training scheme using data from two domains with the designed analogy constraint. Besides, we contribute a cartoon dataset in Disney style. Extensive evaluations demonstrate the superiority of our method over competing methods. + + + + Source-free Domain Adaptive Human Pose Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Peng_Source-free_Domain_Adaptive_Human_Pose_Estimation_ICCV_2023_paper.pdf + Human Pose Estimation (HPE) is widely used in various fields, including motion analysis, healthcare, and virtual reality. However, the great expenses of labeled real-world datasets present a significant challenge for HPE. To overcome this, one approach is to train HPE models on synthetic datasets and then perform domain adaptation (DA) on real-world data. Unfortunately, existing DA methods for HPE neglect data privacy and security by using both source and target data in the adaptation process. To this end, we propose a new task, named source-free domain adaptive HPE, which aims to address the challenges of cross-domain learning of HPE without access to source data during the adaptation process. We further propose a novel framework that consists of three models: source model, intermediate model, and target model, which explores the task from both source-protect and target-relevant perspectives. The source-protect module preserves source information more effectively while resisting noise, and the target-relevant module reduces the sparsity of spatial representations by building a novel spatial probability space, and pose-specific contrastive learning and information maximization are proposed on the basis of this space. Comprehensive experiments on several domain adaptive HPE benchmarks show that the proposed method outperforms existing approaches by a considerable margin. The codes are available at https://github.com/davidpengucf/SFDAHPE. + + + + DOT: A Distillation-Oriented Trainer + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhao_DOT_A_Distillation-Oriented_Trainer_ICCV_2023_paper.pdf + Knowledge distillation transfers knowledge from a large model to a small one via task and distillation losses. In this paper, we observe a trade-off between task and distillation losses, i.e., introducing distillation loss limits the convergence of task loss. We believe that the trade-off results from the insufficient optimization of distillation loss. The reason is: The teacher has a lower task loss than the student, and a lower distillation loss drives the student more similar to the teacher, then a better-converged task loss could be obtained. To break the trade-off, we propose the Distillation-Oriented Trainer (DOT). DOT separately considers gradients of task and distillation losses, then applies a larger momentum to distillation loss to accelerate its optimization. We empirically prove that DOT breaks the trade-off, i.e., both losses are sufficiently optimized. Extensive experiments validate the superiority of DOT. Notably, DOT achieves a +2.59% accuracy improvement on ImageNet-1k for the ResNet50-MobileNetV1 pair. Conclusively, DOT greatly benefits the student's optimization properties in terms of loss convergence and model generalization. Code will be made publicly available. + + + + Neural Collage Transfer: Artistic Reconstruction via Material Manipulation + http://openaccess.thecvf.com//content/ICCV2023/papers/Lee_Neural_Collage_Transfer_Artistic_Reconstruction_via_Material_Manipulation_ICCV_2023_paper.pdf + Collage is a creative art form that uses diverse material scraps as a base unit to compose a single image. Although pixel-wise generation techniques can reproduce a target image in collage style, it is not a suitable method due to the solid stroke-by-stroke nature of the collage form. While some previous works for stroke-based rendering produced decent sketches and paintings, collages have received much less attention in research despite their popularity as a style. In this paper, we propose a method for learning to make collages via reinforcement learning without the need for demonstrations or collage artwork data. We design the collage Markov Decision Process (MDP), which allows the agent to handle various materials and propose a model-based soft actor-critic to mitigate the agent's training burden derived from the sophisticated dynamics of collage. Moreover, we devise additional techniques such as active material selection and complexity-based multi-scale collage to handle target images at any size and enhance the results' aesthetics by placing relatively more scraps in areas of high complexity. Experimental results show that the trained agent appropriately selected and pasted materials to regenerate the target image into a collage and obtained a higher evaluation score on content and style than pixel-wise generation methods. Code is available at https://github.com/northadventure/CollageRL. + + + + Informative Data Mining for One-Shot Cross-Domain Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Informative_Data_Mining_for_One-Shot_Cross-Domain_Semantic_Segmentation_ICCV_2023_paper.pdf + Contemporary domain adaptation offers a practical solution for achieving cross-domain transfer of semantic segmentation between labelled source data and unlabeled target data. These solutions have gained significant popularity; however, they require the model to be retrained when the test environment changes. This can result in unbearable costs in certain applications due to the time-consuming training process and concerns regarding data privacy. One-shot domain adaptation methods attempt to overcome these challenges by transferring the pre-trained source model to the target domain using only one target data. Despite this, the referring style transfer module still faces issues with computation cost and over-fitting problems. To address this problem, we propose a novel framework called Informative Data Mining (IDM) that enables efficient one-shot domain adaptation for semantic segmentation. Specifically, IDM provides an uncertainty-based selection criterion to identify the most informative samples, which facilitates quick adaptation and reduces redundant training. We then perform a model adaptation method using these selected samples, which includes patch-wise mixing and prototype-based information maximization to update the model. This approach effectively enhances adaptation and mitigates the overfitting problem. In general, we provide empirical evidence of the effectiveness and efficiency of IDM. Our approach outperforms existing methods and achieves a new state-of-the-art one-shot performance of 56.7%/55.4% on the GTA5/SYNTHIA to Cityscapes adaptation tasks, respectively. The code will be released at https://github.com/yxiwang/IDM. + + + + Householder Projector for Unsupervised Latent Semantics Discovery + http://openaccess.thecvf.com//content/ICCV2023/papers/Song_Householder_Projector_for_Unsupervised_Latent_Semantics_Discovery_ICCV_2023_paper.pdf + Generative Adversarial Networks (GANs), especially the recent style-based generators (StyleGANs), have versatile semantics in the structured latent space. Latent semantics discovery methods emerge to move around the latent code such that only one factor varies during the traversal. Recently, an unsupervised method proposed a promising direction to directly use the eigenvectors of the projection matrix that maps latent codes to features as the interpretable directions. However, one overlooked fact is that the projection matrix is non-orthogonal and the number of eigenvectors is too large. The non-orthogonality would entangle semantic attributes in the top few eigenvectors, and the large dimensionality might result in meaningless variations among the directions even if the matrix is orthogonal. To avoid these issues, we propose Householder Projector, a flexible and general low-rank orthogonal matrix representation based on Householder transformations, to parameterize the projection matrix. The orthogonality guarantees that the eigenvectors correspond to disentangled interpretable semantics, while the low-rank property encourages that each identified direction has meaningful variations. We integrate our projector into pre-trained StyleGAN2/StyleGAN3 and evaluate the models on several benchmarks. Within only 1% of the original training steps for fine-tuning, our projector helps StyleGANs to discover more disentangled and precise semantic attributes without sacrificing image fidelity. + + + + Bayesian Optimization Meets Self-Distillation + http://openaccess.thecvf.com//content/ICCV2023/papers/Lee_Bayesian_Optimization_Meets_Self-Distillation_ICCV_2023_paper.pdf + Bayesian optimization (BO) has contributed greatly to improving model performance by suggesting promising hyperparameter configurations iteratively based on observations from multiple training trials. However, only partial knowledge (i.e., the measured performances of trained models and their hyperparameter configurations) from previous trials is transferred. On the other hand, Self-Distillation (SD) only transfers partial knowledge learned by the task model itself. To fully leverage the various knowledge gained from all training trials, we propose the BOSS framework, which combines BO and SD. BOSS suggests promising hyperparameter configurations through BO and carefully selects pre-trained models from previous trials for SD, which are otherwise abandoned in the conventional BO process. BOSS achieves significantly better performance than both BO and SD in a wide range of tasks including general image classification, learning with noisy labels, semi-supervised learning, and medical image analysis tasks. Our code is available at https://github.com/sooperset/boss. + + + + No Fear of Classifier Biases: Neural Collapse Inspired Federated Learning with Synthetic and Fixed Classifier + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_No_Fear_of_Classifier_Biases_Neural_Collapse_Inspired_Federated_Learning_ICCV_2023_paper.pdf + Data heterogeneity is an inherent challenge that hinders the performance of federated learning (FL). Recent studies have identified the biased classifiers of local models as the key bottleneck. Previous attempts have used classifier calibration after FL training, but this approach falls short in improving the poor feature representations caused by training-time classifier biases. Resolving the classifier bias dilemma in FL requires a full understanding of the mechanisms behind the classifier. Recent advances in neural collapse have shown that the classifiers and feature prototypes under perfect training scenarios collapse into an optimal structure called simplex equiangular tight frame (ETF). Building on this neural collapse insight, we propose a solution to the FL's classifier bias problem by utilizing a synthetic and fixed ETF classifier during training. The optimal classifier structure enables all clients to learn unified and optimal feature representations even under extremely heterogeneous data. We devise several effective modules to better adapt the ETF structure in FL, achieving both high generalization and personalization. Extensive experiments demonstrate that our method achieves state-of-the-art performances on CIFAR-10, CIFAR-100, and Tiny-ImageNet. The code is available at https://github.com/ZexiLee/ICCV-2023-FedETF. + + + + MemorySeg: Online LiDAR Semantic Segmentation with a Latent Memory + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_MemorySeg_Online_LiDAR_Semantic_Segmentation_with_a_Latent_Memory_ICCV_2023_paper.pdf + Semantic segmentation of LiDAR point clouds has been widely studied in recent years, with most existing methods focusing on tackling this task using a single scan of the environment. However, leveraging the temporal stream of observations can provide very rich contextual information on regions of the scene with poor visibility (e.g., occlusions) or sparse observations (e.g., at long range), and can help reduce redundant computation frame after frame. In this paper, we tackle the challenge of exploiting the information from the past frames to improve the predictions of the current frame in an online fashion. To address this challenge, we propose a novel framework for semantic segmentation of a temporal sequence of LiDAR point clouds that utilizes a memory network to store, update and retrieve past information. Our framework also includes a novel regularizer that penalizes prediction variations in the neighborhood of the point cloud. Prior works have attempted to incorporate memory in range view representations for semantic segmentation, but these methods fail to handle occlusions and the range view representation of the scene changes drastically as agents nearby move. Our proposed framework overcomes these limitations by building a sparse 3D latent representation of the surroundings. We evaluate our method on SemanticKITTI, nuScenes, and PandaSet. Our experiments demonstrate the effectiveness of the proposed framework compared to the state-of-the-art. For more information, visit the project website: https://waabi.ai/research/memoryseg. + + + + Hashing Neural Video Decomposition with Multiplicative Residuals in Space-Time + http://openaccess.thecvf.com//content/ICCV2023/papers/Chan_Hashing_Neural_Video_Decomposition_with_Multiplicative_Residuals_in_Space-Time_ICCV_2023_paper.pdf + We present a video decomposition method that facilitates layer-based editing of videos with spatiotemporally varying lighting and motion effects. Our neural model decomposes an input video into multiple layered representations, each comprising a 2D texture map, a mask for the original video, and a multiplicative residual characterizing the spatiotemporal variations in lighting conditions. A single edit on the texture maps can be propagated to the corresponding locations in the entire video frames while preserving other contents' consistencies. Our method efficiently learns the layer-based neural representations of a 1080p video in 25s per frame via coordinate hashing and allows real-time rendering of the edited result at 71 fps on a single GPU. Qualitatively, we run our method on various videos to show its effectiveness in generating high-quality editing effects. Quantitatively, we propose to adopt feature-tracking evaluation metrics for objectively assessing the consistency of video editing. Project page: https://lightbulb12294.github.io/hashing-nvd/ + + + + Multimodal Variational Auto-encoder based Audio-Visual Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Mao_Multimodal_Variational_Auto-encoder_based_Audio-Visual_Segmentation_ICCV_2023_paper.pdf + We propose an Explicit Conditional Multimodal Variational Auto-Encoder (ECMVAE) for audio-visual segmentation (AVS), aiming to segment sound sources in the video sequence. Existing AVS methods focus on implicit feature fusion strategies, where models are trained to fit the discrete samples in the dataset. With a limited and less diverse dataset, the resulting performance is usually unsatisfactory. In contrast, we address this problem from an effective representation learning perspective, aiming to model the contribution of each modality explicitly. Specifically, we find that audio contains critical category information of the sound producers, and visual data provides candidate sound producer(s). Their shared information corresponds to the target sound producer(s) shown in the visual data. In this case, cross-modal shared representation learning is especially important for AVS. To achieve this, our ECMVAE factorizes the representations of each modality with a modality-shared representation and a modality-specific representation. An orthogonality constraint is applied between the shared and specific representations to maintain the exclusive attribute of the factorized latent code. Further, a mutual information maximization regularizer is introduced to achieve extensive exploration of each modality. Quantitative and qualitative evaluations on the AVSBench demonstrate the effectiveness of our approach, leading to a new state-of-the-art for AVS, with a 3.84 mIOU performance leap on the challenging MS3 subset for multiple sound source segmentation. Code and pre-train model will release to provide full details of our proposed method. + + + + DynaMITe: Dynamic Query Bootstrapping for Multi-object Interactive Segmentation Transformer + http://openaccess.thecvf.com//content/ICCV2023/papers/Rana_DynaMITe_Dynamic_Query_Bootstrapping_for_Multi-object_Interactive_Segmentation_Transformer_ICCV_2023_paper.pdf + Most state-of-the-art instance segmentation methods rely on large amounts of pixel-precise ground-truth annotations for training, which are expensive to create. Interactive segmentation networks help generate such annotations based on an image and the corresponding user interactions such as clicks. Existing methods for this task can only process a single instance at a time and each user interaction requires a full forward pass through the entire deep network. We introduce a more efficient approach, called DynaMITe, in which we represent user interactions as spatio-temporal queries to a Transformer decoder with a potential to segment multiple object instances in a single iteration. Our architecture also alleviates any need to re-compute image features during refinement, and requires fewer interactions for segmenting multiple instances in a single image when compared to other methods. DynaMITe achieves state-of-the-art results on multiple existing interactive segmentation benchmarks, and also on the new multi-instance benchmark that we propose in this paper. + + + + FRAug: Tackling Federated Learning with Non-IID Features via Representation Augmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_FRAug_Tackling_Federated_Learning_with_Non-IID_Features_via_Representation_Augmentation_ICCV_2023_paper.pdf + Federated Learning (FL) is a decentralized machine learning paradigm, in which multiple clients collaboratively train neural networks without centralizing their local data, and hence preserve data privacy. However, real-world FL applications usually encounter challenges arising from distribution shifts across the local datasets of individual clients. These shifts may drift the global model aggregation or result in convergence to deflected local optimum. While existing efforts have addressed distribution shifts in the label space, an equally important challenge remains relatively unexplored. This challenge involves situations where the local data of different clients indicate identical label distributions but exhibit divergent feature distributions. This issue can significantly impact the global model performance in the FL framework. In this work, we propose Federated Representation Augmentation (FRAug) to resolve this practical and challenging problem. FRAug optimizes a shared embedding generator to capture client consensus. Its output synthetic embeddings are transformed into client-specific by a locally optimized RTNet to augment the training space of each client. Our empirical evaluation on three public benchmarks and a real-world medical dataset demonstrates the effectiveness of the proposed method, which substantially outperforms the current state-of-the-art FL methods for feature distribution shifts, including PartialFed and FedBN. + + + + Homography Guided Temporal Fusion for Road Line and Marking Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Homography_Guided_Temporal_Fusion_for_Road_Line_and_Marking_Segmentation_ICCV_2023_paper.pdf + Reliable segmentation of road lines and markings is critical to autonomous driving. Our work is motivated by the observations that road lines and markings are (1) frequently occluded in the presence of moving vehicles, shadow, and glare and (2) highly structured with low intra-class shape variance and overall high appearance consistency. To solve these issues, we propose a Homography Guided Fusion (HomoFusion) module to exploit temporally-adjacent video frames for complementary cues facilitating the correct classification of the partially occluded road lines or markings. To reduce computational complexity, a novel surface normal estimator is proposed to establish spatial correspondences between the sampled frames, allowing the HomoFusion module to perform a pixel-to-pixel attention mechanism in updating the representation of the occluded road lines or markings. Experiments on ApolloScape, a large-scale lane mark segmentation dataset, and ApolloScape Night with artificial simulated night-time road conditions, demonstrate that our method outperforms other existing SOTA lane mark segmentation models with less than 9% of their parameters and computational complexity. We show that exploiting available camera intrinsic data and ground plane assumption for cross-frame correspondence can lead to a light-weight network with significantly improved performances in speed and accuracy. We also prove the versatility of our HomoFusion approach by applying it to the problem of water puddle segmentation and achieving SOTA performance. + + + + NeuRBF: A Neural Fields Representation with Adaptive Radial Basis Functions + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_NeuRBF_A_Neural_Fields_Representation_with_Adaptive_Radial_Basis_Functions_ICCV_2023_paper.pdf + We present a novel type of neural fields that uses general radial bases for signal representation. State-of-the-art neural fields typically rely on grid-based representations for storing local neural features and N-dimensional linear kernels for interpolating features at continuous query points. The spatial positions of their neural features are fixed on grid nodes and cannot well adapt to target signals. Our method instead builds upon general radial bases with flexible kernel position and shape, which have higher spatial adaptivity and can more closely fit target signals. To further improve the channel-wise capacity of radial basis functions, we propose to compose them with multi-frequency sinusoid functions. This technique extends a radial basis to multiple Fourier radial bases of different frequency bands without requiring extra parameters, facilitating the representation of details. Moreover, by marrying adaptive radial bases with grid-based ones, our hybrid combination inherits both adaptivity and interpolation smoothness. We carefully designed weighting schemes to let radial bases adapt to different types of signals effectively. Our experiments on 2D image and 3D signed distance field representation demonstrate the higher accuracy and compactness of our method than prior arts. When applied to neural radiance field reconstruction, our method achieves state-of-the-art rendering quality, with small model size and comparable training speed. + + + + Multi-granularity Interaction Simulation for Unsupervised Interactive Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Multi-granularity_Interaction_Simulation_for_Unsupervised_Interactive_Segmentation_ICCV_2023_paper.pdf + Interactive segmentation enables users to segment as needed by providing cues of objects, which introduces human-computer interaction for many fields, such as image editing and medical image analysis. Typically, massive and expansive pixel-level annotations are spent to train deep models by object-oriented interactions with manually labeled object masks. In this work, we reveal that informative interactions can be made by simulation with semantic-consistent yet diverse region exploration in an unsupervised paradigm. Concretely, we introduce a Multi-granularity Interaction Simulation (MIS) approach to open up a promising direction for unsupervised interactive segmentation. Drawing on the high-quality dense features produced by recent self-supervised models, we propose to gradually merge patches or regions with similar features to form more extensive regions and thus, every merged region serves as a semantic-meaningful multi-granularity proposal. By randomly sampling these proposals and simulating possible interactions based on them, we provide meaningful interaction at multiple granularities to teach the model to understand interactions. Our MIS significantly outperforms non-deep learning unsupervised methods and is even comparable with some previous deep-supervised methods without any annotation. + + + + RecursiveDet: End-to-End Region-Based Recursive Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhao_RecursiveDet_End-to-End_Region-Based_Recursive_Object_Detection_ICCV_2023_paper.pdf + End-to-end region-based object detectors like Sparse R-CNN usually have multiple cascade bounding box decoding stages, which refine the current predictions according to their previous results. Model parameters within each stage are independent, evolving a huge cost. In this paper, we find the general setting of decoding stages is actually redundant. By simply sharing parameters and making a recursive decoder, the detector already obtains a significant improvement. The recursive decoder can be further enhanced by positional encoding (PE) of the proposal box, which makes it aware of the exact locations and sizes of input bounding boxes, thus becoming adaptive to proposals from different stages during the recursion. Moreover, we also design centerness-based PE to distinguish the RoI feature element and dynamic convolution kernels at different positions within the bounding box. To validate the effectiveness of the proposed method, we conduct intensive ablations and build the full model on three recent mainstream region-based detectors. The RecusiveDet is able to achieve obvious performance boosts with even fewer model parameters and slightly increased computation cost. + + + + Structure Invariant Transformation for better Adversarial Transferability + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Structure_Invariant_Transformation_for_better_Adversarial_Transferability_ICCV_2023_paper.pdf + Given the severe vulnerability of Deep Neural Networks (DNNs) against adversarial examples, there is an urgent need for an effective adversarial attack to identify the deficiencies of DNNs in security-sensitive applications. As one of the prevalent black-box adversarial attacks, the existing transfer-based attacks still cannot achieve comparable performance with the white-box attacks. Among these, input transformation based attacks have shown remarkable effectiveness in boosting transferability. In this work, we find that the existing input transformation based attacks transform the input image globally, resulting in limited diversity of the transformed images. We postulate that the more diverse transformed images result in better transferability. Thus, we investigate how to locally apply various transformations onto the input image to improve such diversity while preserving the structure of image. To this end, we propose a novel input transformation based attack, called Structure Invariant Transformation (SIA), which applies a random image transformation onto each image block to craft a set of diverse images for gradient calculation. Extensive experiments on the standard ImageNet dataset demonstrate that SIA exhibits much better transferability than the existing SOTA input transformation based attacks on CNN-based and transformer-based models, showing its generality and superiority in boosting transferability. Code is available at https://github.com/xiaosen-wang/SIT. + + + + FULLER: Unified Multi-modality Multi-task 3D Perception via Multi-level Gradient Calibration + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_FULLER_Unified_Multi-modality_Multi-task_3D_Perception_via_Multi-level_Gradient_Calibration_ICCV_2023_paper.pdf + Multi-modality fusion and multi-task learning are becoming trendy in 3D autonomous driving scenario, considering robust prediction and computation budget. However, naively extending the existing framework to the domain of multi-modality multi-task learning remains ineffective and even poisonous due to the notorious modality bias and task conflict. Previous works manually coordinate the learning framework with empirical knowledge, which may lead to sub-optima. To mitigate the issue, we propose a novel yet simple multi-level gradient calibration learning framework across tasks and modalities during optimization. Specifically, the gradients, produced by the task heads and used to update the shared backbone, will be calibrated at the backbone's last layer to alleviate the task conflict. Before the calibrated gradients are further propagated to the modality branches of the backbone, their magnitudes will be calibrated again to the same level, ensuring the downstream tasks pay balanced attention to different modalities. Experiments on large-scale benchmark nuScenes demonstrate the effectiveness of the proposed method, eg, an absolute 14.4% mIoU improvement on map segmentation and 1.4% mAP improvement on 3D detection, advancing the application of 3D autonomous driving in the domain of multi-modality fusion and multi-task learning. We also discuss the links between modalities and tasks. + + + + Cross-Domain Product Representation Learning for Rich-Content E-Commerce + http://openaccess.thecvf.com/ICCV2023 + ('Connection aborted.', ConnectionResetError(10054, 'Զ������ǿ�ȹر���һ�����е����ӡ�', None, 10054, None)) + + + + Detection Transformer with Stable Matching + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Detection_Transformer_with_Stable_Matching_ICCV_2023_paper.pdf + This paper is concerned with the matching stability problem across different decoder layers in DEtection TRansformers (DETR). We point out that the unstable matching in DETR is caused by a multi-optimization path problem, which is highlighted by the one-to-one matching design in DETR. To address this problem, we show that the most important design is to use and only use positional metrics (like IOU) to supervise classification scores of positive examples. Under the principle, we propose two simple yet effective modifications by integrating positional metrics to DETR's classification loss and matching cost, named position-supervised loss and position-modulated cost. We verify our methods on several DETR variants. Our methods show consistent improvements over baselines. By integrating our methods with DINO, we achieve 50.4 and 51.5 AP on the COCO detection benchmark using ResNet-50 backbones under 1x (12 epochs) and 2x (24 epochs) training settings, achieving a new record under the same setting. Our code will be made available. + + + + Be Everywhere - Hear Everything (BEE): Audio Scene Reconstruction by Sparse Audio-Visual Samples + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Be_Everywhere_-_Hear_Everything_BEE_Audio_Scene_Reconstruction_by_ICCV_2023_paper.pdf + Fully immersive and interactive audio-visual scenes are dynamic such that the listeners and the sound emitters move and interact with each other. Reconstruction of an immersive sound experience, as it happens in the scene, requires detailed reconstruction of the audio perceived by the listener at an arbitrary location. The audio at the listener location is a complex outcome of sound propagation through the scene geometry and interacting with surfaces and also the locations of the emitters and the sounds they emit. Due to these aspects, detailed audio reconstruction requires extensive sampling of audio at any potential listener location. This is usually difficult to implement in realistic real-time dynamic scenes. In this work, we propose to circumvent the need for extensive sensors by leveraging audio and visual samples from only a handful of A/V receivers placed in the scene. In particular, we introduce a novel method and end-to-end integrated rendering pipeline which allows the listener to be everywhere and hear everything (BEE) in a dynamic scene in real-time. BEE reconstructs the audio with two main modules, Joint Audio-Visual Representation, and Integrated Rendering Head. The first module extracts the informative audio-visual features of the scene from sparse A/V reference samples, while the second module integrates the audio samples with learned time-frequency transformations to obtain the target sound. Our experiments indicate that BEE outperforms existing methods by a large margin in terms of quality of sound reconstruction, can generalize to scenes not seen in training and runs in real-time speed. + + + + Story Visualization by Online Text Augmentation with Context Memory + http://openaccess.thecvf.com//content/ICCV2023/papers/Ahn_Story_Visualization_by_Online_Text_Augmentation_with_Context_Memory_ICCV_2023_paper.pdf + Story visualization (SV) is a challenging text-to-image generation task for the difficulty of not only rendering visual details from the text descriptions but also encoding a longterm context across multiple sentences. While prior efforts mostly focus on generating a semantically relevant image for each sentence, encoding a context spread across the given paragraph to generate contextually convincing images (e.g., with a correct character or with a proper background of the scene) remains a challenge. To this end, we propose a novel memory architecture for the Bi-directional Transformer framework with an online text augmentation that generates multiple pseudo-descriptions as supplementary supervision during training for better generalization to the language variation at inference. In extensive experiments on the two popular SV benchmarks, i.e., the Pororo-SV and Flintstones-SV, the proposed method significantly outperforms the state of the arts in various metrics including FID, character F1, frame accuracy, BLEU-2/3, and R-precision with similar or less computational complexity. + + + + Global Balanced Experts for Federated Long-Tailed Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Zeng_Global_Balanced_Experts_for_Federated_Long-Tailed_Learning_ICCV_2023_paper.pdf + Federated learning (FL) is a prevalent distributed machine learning approach that enables collaborative training of a global model across multiple devices without sharing local data. However, the presence of long-tailed data can negatively deteriorate the model's performance in real-world FL applications. Moreover, existing re-balance strategies are less effective for the federated long-tailed issue when directly utilizing local label distribution as the class prior at the clients' side. To this end, we propose a novel Global Balanced Multi-Expert (GBME) framework to optimize a balanced global objective, which does not require additional information beyond the standard FL pipeline. In particular, a proxy is derived from the accumulated gradients uploaded by the clients after local training, and is shared by all clients as the class prior for re-balance training. Such a proxy can also guide the client grouping to train a multi-expert model, where the knowledge from different clients can be aggregated via the ensemble of different experts corresponding to different client groups. To further strengthen the privacy-preserving ability, we present a GBME-p algorithm with a theoretical guarantee to prevent privacy leakage from the proxy. Extensive experiments on long-tailed decentralized datasets demonstrate the effectiveness of GBME and GBME-p, both of which show superior performance to state-of-the-art methods. + + + + Cascade-DETR: Delving into High-Quality Universal Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Ye_Cascade-DETR_Delving_into_High-Quality_Universal_Object_Detection_ICCV_2023_paper.pdf + Object localization in general environments is a fundamental part of vision systems. While dominating on the COCO benchmark, recent Transformer-based detection methods are not competitive in diverse domains. Moreover, these methods still struggle to very accurately estimate the object bounding boxes in complex environments. We introduce Cascade-DETR for high-quality universal object detection. We jointly tackle the generalization to diverse domains and localization accuracy by proposing the Cascade Attention layer, which explicitly integrates object-centric information into the detection decoder by limiting the attention to the previous box prediction. To further enhance accuracy, we also revisit the scoring of queries. Instead of relying on classification scores, we predict the expected IoU of the query, leading to substantially more well-calibrated confidences. Lastly, we introduce a universal object detection benchmark, UDB10, that contains 10 datasets from diverse domains. While also advancing the state-of-the-art on COCO, Cascade-DETR substantially improves DETR-based detectors on all datasets in UDB10, even by over 10 mAP in some cases. The improvements under stringent quality requirements are even more pronounced. Our code and pretrained models are at https://github.com/SysCV/cascade-detr. + + + + ACLS: Adaptive and Conditional Label Smoothing for Network Calibration + http://openaccess.thecvf.com//content/ICCV2023/papers/Park_ACLS_Adaptive_and_Conditional_Label_Smoothing_for_Network_Calibration_ICCV_2023_paper.pdf + We address the problem of network calibration adjusting miscalibrated confidences of deep neural networks. Many approaches to network calibration adopt a regularization-based method that exploits a regularization term to smooth the miscalibrated confidences. Although these approaches have shown the effectiveness on calibrating the networks, there is still a lack of understanding on the underlying principles of regularization in terms of network calibration. We present in this paper an in-depth analysis of existing regularization-based methods, providing a better understanding on how they affect to network calibration. Specifically, we have observed that 1) the regularization-based methods can be interpreted as variants of label smoothing, and 2) they do not always behave desirably. Based on the analysis, we introduce a novel loss function, dubbed ACLS, that unifies the merits of existing regularization methods, while avoiding the limitations. We show extensive experimental results for image classification and semantic segmentation on standard benchmarks, including CIFAR10, Tiny-ImageNet, ImageNet, and PASCAL VOC, demonstrating the effectiveness of our loss function. + + + + EMR-MSF: Self-Supervised Recurrent Monocular Scene Flow Exploiting Ego-Motion Rigidity + http://openaccess.thecvf.com//content/ICCV2023/papers/Jiang_EMR-MSF_Self-Supervised_Recurrent_Monocular_Scene_Flow_Exploiting_Ego-Motion_Rigidity_ICCV_2023_paper.pdf + Self-supervised monocular scene flow estimation, aiming to understand both 3D structures and 3D motions from two temporally consecutive monocular images, has received increasing attention for its simple and economical sensor setup. However, the accuracy of current methods suffers from the bottleneck of less-efficient network architecture and lack of motion rigidity for regularization. In this paper, we propose a superior model named EMR-MSF by borrowing the advantages of network architecture design under the scope of supervised learning. We further impose explicit and robust geometric constraints with an elaborately constructed ego-motion aggregation module where a rigidity soft mask is proposed to filter out dynamic regions for stable ego-motion estimation using static regions. Moreover, we propose a motion consistency loss along with a mask regularization loss to fully exploit static regions. Several efficient training strategies are integrated including a gradient detachment technique and an enhanced view synthesis process for better performance. Our proposed method outperforms the previous self-supervised works by a large margin and catches up to the performance of supervised methods. On the KITTI scene flow benchmark, our approach improves the SF-all metric of the state-of-the-art self-supervised monocular method by 44% and demonstrates superior performance across sub-tasks including depth and visual odometry, amongst other self-supervised single-task or multi-task methods. + + + + Evaluation and Improvement of Interpretability for Self-Explainable Part-Prototype Networks + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_Evaluation_and_Improvement_of_Interpretability_for_Self-Explainable_Part-Prototype_Networks_ICCV_2023_paper.pdf + Part-prototype networks (e.g., ProtoPNet, ProtoTree, and ProtoPool) have attracted broad research interest for their intrinsic interpretability and comparable accuracy to non-interpretable counterparts. However, recent works find that the interpretability from prototypes is fragile, due to the semantic gap between the similarities in the feature space and that in the input space. In this work, we strive to address this challenge by making the first attempt to quantitatively and objectively evaluate the interpretability of the part-prototype networks. Specifically, we propose two evaluation metrics, termed as "consistency score" and "stability score", to evaluate the explanation consistency across images and the explanation robustness against perturbations, respectively, both of which are essential for explanations taken into practice. Furthermore, we propose an elaborated part-prototype network with a shallow-deep feature alignment (SDFA) module and a score aggregation (SA) module to improve the interpretability of prototypes. We conduct systematical evaluation experiments and provide substantial discussions to uncover the interpretability of existing part-prototype networks. Experiments on three benchmarks across nine architectures demonstrate that our model achieves significantly superior performance to the state of the art, in both the accuracy and interpretability. Our code is available at https://github.com/hqhQAQ/EvalProtoPNet. + + + + Mitigating Adversarial Vulnerability through Causal Parameter Estimation by Adversarial Double Machine Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Lee_Mitigating_Adversarial_Vulnerability_through_Causal_Parameter_Estimation_by_Adversarial_Double_ICCV_2023_paper.pdf + Adversarial examples derived from deliberately crafted perturbations on visual inputs can easily harm decision process of deep neural networks. To prevent potential threats, various adversarial training-based defense methods have grown rapidly and become a de facto standard approach for robustness. Despite recent competitive achievements, we observe that adversarial vulnerability varies across targets and certain vulnerabilities remain prevalent. Intriguingly, such peculiar phenomenon cannot be relieved even with deeper architectures and advanced defense methods. To address this issue, in this paper, we introduce a causal approach called Adversarial Double Machine Learning (ADML), which allows us to quantify the degree of adversarial vulnerability for network predictions and capture the effect of treatments on outcome of interests. ADML can directly estimate causal parameter of adversarial perturbations per se and mitigate negative effects that can potentially damage robustness, bridging a causal perspective into the adversarial vulnerability. Through extensive experiments on various CNN and Transformer architectures, we corroborate that ADML improves adversarial robustness with large margins and relieve the empirical observation. + + + + Dynamic Token Pruning in Plain Vision Transformers for Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Tang_Dynamic_Token_Pruning_in_Plain_Vision_Transformers_for_Semantic_Segmentation_ICCV_2023_paper.pdf + Vision transformers have achieved leading performance on various visual tasks yet still suffer from high computational complexity. The situation deteriorates in dense prediction tasks like semantic segmentation, as high-resolution inputs and outputs usually imply more tokens involved in computations. Directly removing the less attentive tokens has been discussed for the image classification task but can not be extended to semantic segmentation since a dense prediction is required for every patch. To this end, this work introduces a Dynamic Token Pruning (DToP) method based on the early exit of tokens for semantic segmentation. Motivated by the coarse-to-fine segmentation process by humans, we naturally split the widely adopted auxiliary-loss-based network architecture into several stages, where each auxiliary block grades every token's difficulty level. We can finalize the prediction of easy tokens in advance without completing the entire forward pass. Moreover, we keep k highest confidence tokens for each semantic category to uphold the representative context information. Thus, computational complexity will change with the difficulty of the input, akin to the way humans do segmentation. Experiments suggest that the proposed DToP architecture reduces on average 20% 35% of computational cost for current semantic segmentation methods based on plain vision transformers without accuracy degradation. The code is available through the following link: https://github.com/zbwxp/Dynamic-Token-Pruning. + + + + DiffFit: Unlocking Transferability of Large Diffusion Models via Simple Parameter-efficient Fine-Tuning + http://openaccess.thecvf.com//content/ICCV2023/papers/Xie_DiffFit_Unlocking_Transferability_of_Large_Diffusion_Models_via_Simple_Parameter-efficient_ICCV_2023_paper.pdf + Diffusion models have proven to be highly effective in generating high-quality images. However, adapting large pre-trained diffusion models to new domains remains an open challenge, which is critical for real-world applications. This paper proposes DiffFit, a parameter-efficient strategy to fine-tune large pre-trained diffusion models that enable fast adaptation to new domains. DiffFit is embarrassingly simple that only fine-tunes the bias term and newly-added scaling factors in specific layers, yet resulting in significant training speed-up and reduced model storage costs. Compared with full fine-tuning, DiffFit achieves 2x training speed-up and only needs to store approximately 0.12% of the total model parameters. Intuitive theoretical analysis has been provided to justify the efficacy of scaling factors on fast adaptation. On 8 downstream datasets, DiffFit achieves superior or competitive performances compared to the full fine-tuning while being more efficient. Remarkably, we show that DiffFit can adapt a pre-trained low-resolution generative model to a high-resolution one by adding minimal cost. Among diffusion-based methods, DiffFit sets a new state-of-the-art FID of 3.02 on ImageNet 512x512 benchmark by fine-tuning only 25 epochs from a public pre-trained ImageNet 256x256 checkpoint while being 30x more training efficient than the closest competitor. + + + + QD-BEV : Quantization-aware View-guided Distillation for Multi-view 3D Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_QD-BEV__Quantization-aware_View-guided_Distillation_for_Multi-view_3D_Object_Detection_ICCV_2023_paper.pdf + Multi-view 3D detection based on BEV (bird-eye-view) has recently achieved significant improvements. However, the huge memory consumption of state-of-the-art models makes it hard to deploy them on vehicles, and the non-trivial latency will affect the real-time perception of streaming applications. Despite the wide application of quantization to lighten models, we show in our paper that directly applying quantization in BEV tasks will 1) make the training unstable, and 2) lead to intolerable performance degradation. To solve these issues, our method QD-BEV enables a novel view-guided distillation (VGD) objective, which can stabilize the quantization-aware training (QAT) while enhancing the model performance by leveraging both image features and BEV features. Our experiments show that QD-BEV achieves similar or even better accuracy than previous methods with significant efficiency gains. On the nuScenes datasets, the 4-bit weight and 6-bit activation quantized QD-BEV-Tiny model achieves 37.2% NDS with only 15.8 MB model size, outperforming BevFormer-Tiny by 1.8% with an 8x model compression. On the Small and Base variants, QD-BEV models also perform superbly and achieve 47.9% NDS (28.2 MB) and 50.9% NDS (32.9 MB), respectively. + + + + CLIPascene: Scene Sketching with Different Types and Levels of Abstraction + http://openaccess.thecvf.com//content/ICCV2023/papers/Vinker_CLIPascene_Scene_Sketching_with_Different_Types_and_Levels_of_Abstraction_ICCV_2023_paper.pdf + In this paper, we present a method for converting a given scene image into a sketch using different types and multiple levels of abstraction. We distinguish between two types of abstraction. The first considers the fidelity of the sketch, varying its representation from a more precise portrayal of the input to a looser depiction. The second is defined by the visual simplicity of the sketch, moving from a detailed depiction to a sparse sketch. Using an explicit disentanglement into two abstraction axes --- and multiple levels for each one --- provides users additional control over selecting the desired sketch based on their personal goals and preferences. To form a sketch at a given level of fidelity and simplification, we train two MLP networks. The first network learns the desired placement of strokes, while the second network learns to gradually remove strokes from the sketch without harming its recognizability and semantics. Our approach is able to generate sketches of complex scenes including those with complex backgrounds (e.g. natural and urban settings) and subjects (e.g. animals and people) while depicting gradual abstractions of the input scene in terms of fidelity and simplicity. + + + + Multi-Directional Subspace Editing in Style-Space + http://openaccess.thecvf.com//content/ICCV2023/papers/Naveh_Multi-Directional_Subspace_Editing_in_Style-Space_ICCV_2023_paper.pdf + This paper describes a new technique for finding disentangled semantic directions in the latent space of StyleGAN. Our method identifies meaningful orthogonal subspaces that allow editing of one human face attribute, while minimizing undesired changes in other attributes. Our model is capable of editing a single attribute in multiple directions, resulting in a range of possible generated images. We compare our scheme with three state-of-the-art models and show that our method outperforms them in terms of face editing and disentanglement capabilities. Additionally, we suggest quantitative measures for evaluating attribute separation and disentanglement, and exhibit the superiority of our model with respect to those measures. + + + + Adaptive Superpixel for Active Learning in Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Kim_Adaptive_Superpixel_for_Active_Learning_in_Semantic_Segmentation_ICCV_2023_paper.pdf + Learning semantic segmentation requires pixel-wise annotations, which can be time-consuming and expensive. To reduce the annotation cost, we propose a superpixel-based active learning (AL) framework, which collects a dominant label per superpixel instead. To be specific, it consists of adaptive superpixel and sieving mechanisms, fully dedicated to AL. At each round of AL, we adaptively merge neighboring pixels of similar learned features into superpixels. We then query a selected subset of these superpixels using an acquisition function assuming no uniform superpixel size. This approach is more efficient than existing methods, which rely only on innate features such as RGB color and assume uniform superpixel sizes. Obtaining a dominant label per superpixel drastically reduces annotators' burden as it requires fewer clicks. However, it inevitably introduces noisy annotations due to mismatches between superpixel and ground truth segmentation. To address this issue, we further devise a sieving mechanism that identifies and excludes potentially noisy annotations from learning. Our experiments on both Cityscapes and PASCAL VOC datasets demonstrate the efficacy of adaptive superpixel and sieving mechanisms. + + + + Parametric Information Maximization for Generalized Category Discovery + http://openaccess.thecvf.com//content/ICCV2023/papers/Chiaroni_Parametric_Information_Maximization_for_Generalized_Category_Discovery_ICCV_2023_paper.pdf + We introduce a Parametric Information Maximization (PIM) model for the Generalized Category Discovery (GCD) problem. Specifically, we propose a bi-level optimization formulation, which explores a parameterized family of objective functions, each evaluating a weighted mutual information between the features and the latent labels, subject to supervision constraints from the labeled samples. Our formulation mitigates the class-balance bias encoded in standard information maximization approaches, thereby handling effectively both short-tailed and long-tailed data sets. We report extensive experiments and comparisons demonstrating that our PIM model consistently sets new state-of-the-art performances in GCD across six different datasets, more so when dealing with challenging fine-grained problems. Our code: https://github.com/ThalesGroup/pim-generalized-category-discovery. + + + + A Generalist Framework for Panoptic Segmentation of Images and Videos + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_A_Generalist_Framework_for_Panoptic_Segmentation_of_Images_and_Videos_ICCV_2023_paper.pdf + Panoptic segmentation assigns semantic and instance ID labels to every pixel of an image. As permutations of instance IDs are also valid solutions, the task requires learning of high-dimensional one-to-many mapping. As a result, state-of-the-art approaches use customized architectures and task-specific loss functions. We formulate panoptic segmentation as a discrete data generation problem, without relying on inductive bias of the task. A diffusion model is proposed to model panoptic masks, with a simple architecture and generic loss function. By simply adding past predictions as a conditioning signal, our method is capable of modeling video (in a streaming setting) and thereby learns to track object instances automatically. With extensive experiments, we demonstrate that our simple approach can perform competitively to state-of-the-art specialist methods in similar settings. + + + + DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generation Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Cho_DALL-Eval_Probing_the_Reasoning_Skills_and_Social_Biases_of_Text-to-Image_ICCV_2023_paper.pdf + Recently, DALL-E, a multimodal transformer language model, and its variants including diffusion models have shown high-quality text-to-image generation capabilities. However, despite the realistic image generation results, there has not been a detailed analysis of how to evaluate such models. In this work, we investigate the visual reasoning capabilities and social biases of different text-to-image models, covering both multimodal transformer language models and diffusion models. First, we measure three visual reasoning skills: object recognition, object counting, and spatial relation understanding. For this, we propose PaintSkills, a compositional diagnostic evaluation dataset that measures these skills. Despite the high-fidelity image generation capability, a large gap exists between the performance of recent models and the upper bound accuracy in object counting and spatial relation understanding skills. Second, we assess the gender and skin tone biases by measuring the gender/skin tone distribution of generated images across various professions and attributes. We demonstrate that recent text-to-image generation models learn specific biases about gender and skin tone from web image-text pairs. We hope our work will help guide future progress in improving text-to-image generation models on visual reasoning skills and learning socially unbiased representations. + + + + Scale-Aware Modulation Meet Transformer + http://openaccess.thecvf.com//content/ICCV2023/papers/Lin_Scale-Aware_Modulation_Meet_Transformer_ICCV_2023_paper.pdf + This paper presents a new vision Transformer, Scale Aware Modulation Transformer (SMT), that can handle various downstream tasks efficiently by combining the convolutional network and vision Transformer. The proposed Scale Aware Modulation (SAM) in the SMT includes two primary novel designs. Firstly, we introduce the Multi-Head Mixed Convolution (MHMC) module, which can capture multi scale features and expand the receptive field. Secondly, we propose the Scale-Aware Aggregation (SAA) module, which is lightweight but effective, enabling information fusion across different heads. By leveraging these two modules, convolutional modulation is further enhanced. Furthermore, in contrast to prior works that utilized modulations throughout all stages to build an attention-free network, we propose an Evolutionary Hybrid Network (EHN), which can effectively simulate the shift from capturing local to global dependencies as the network becomes deeper, resulting in superior performance. Extensive experiments demonstrate that SMT significantly outperforms existing state-of-the-art models across a wide range of visual tasks. Specifically, SMT with 11.5M / 2.4GFLOPs and 32M / 7.7GFLOPs can achieve 82.2% and 84.3% top-1 accuracy on ImageNet-1K, respectively. After pretrained on ImageNet-22K in 224x224 resolution, it attains 87.1% and 88.1% top-1 accuracy when finetuned with resolution 224x224 and 384x384 , respectively. For object detection with Mask R-CNN, the SMT base trained with 1x and 3x schedule outperforms the Swin Transformer counterpart by 4.2 and 1.3 mAP on COCO, respectively. For semantic segmentation with UPerNet, the SMT base test at single- and multi-scale surpasses Swin by 2.0 and 1.1 mIoU respectively on the ADE20K. Our code is available at https://github.com/AFeng-x/SMT. + + + + SUMMIT: Source-Free Adaptation of Uni-Modal Models to Multi-Modal Targets + http://openaccess.thecvf.com//content/ICCV2023/papers/Simons_SUMMIT_Source-Free_Adaptation_of_Uni-Modal_Models_to_Multi-Modal_Targets_ICCV_2023_paper.pdf + Scene understanding using multi-modal data is necessary in many applications, e.g., autonomous navigation. To achieve this in a variety of situations, existing models must be able to adapt to shifting data distributions without arduous data annotation. Current approaches assume that the source data is available during adaptation and that the source consists of paired multi-modal data. Both these assumptions may be problematic for many applications. Source data may not be available due to privacy, security, or economic concerns. Assuming the existence of paired multi-modal data for training also entails significant data collection costs and fails to take advantage of widely available freely distributed pre-trained uni-modal models. In this work, we relax both of these assumptions by addressing the problem of adapting a set of models trained independently on uni-modal data to a target domain consisting of unlabeled multi-modal data, without having access to the original source dataset. Our proposed approach solves this problem through a switching framework which automatically chooses between two complementary methods of cross-modal pseudo-label fusion -- agreement filtering and entropy weighting -- based on the estimated domain gap. We demonstrate our work on the semantic segmentation problem. Experiments across seven challenging adaptation scenarios verify the efficacy of our approach, achieving results comparable to, and in some cases outperforming, methods which assume access to source data. Our method achieves an improvement in mIoU of up to 12% over competing baselines. Our code is publicly available at https://github.com/csimo005/SUMMIT. + + + + Learning a More Continuous Zero Level Set in Unsigned Distance Fields through Level Set Projection + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhou_Learning_a_More_Continuous_Zero_Level_Set_in_Unsigned_Distance_ICCV_2023_paper.pdf + Latest methods represent shapes with open surfaces using unsigned distance functions (UDFs). They train neural networks to learn UDFs and reconstruct surfaces with the gradients around the zero level set of the UDF. However, the differential networks struggle from learning the zero level set where the UDF is not differentiable, which leads to large errors on unsigned distances and gradients around the zero level set, resulting in highly fragmented and discontinuous surfaces. To resolve this problem, we propose to learn a more continuous zero level set in UDFs with level set projections. Our insight is to guide the learning of zero level set using the rest non-zero level sets via a projection procedure. Our idea is inspired from the observations that the non-zero level sets are much smoother and more continuous than the zero level set. We pull the non-zero level sets onto the zero level set with gradient constraints which align gradients over different level sets and correct unsigned distance errors on the zero level set, leading to a smoother and more continuous unsigned distance field. We conduct comprehensive experiments in surface reconstruction for point clouds, real scans or depth maps, and further explore the performance in unsupervised point cloud upsampling and unsupervised point normal estimation with the learned UDF, which demonstrate our non-trivial improvements over the state-of-the-art methods. Code is available at https://github.com/junshengzhou/LevelSetUDF. + + + + HairNeRF: Geometry-Aware Image Synthesis for Hairstyle Transfer + http://openaccess.thecvf.com//content/ICCV2023/papers/Chang_HairNeRF_Geometry-Aware_Image_Synthesis_for_Hairstyle_Transfer_ICCV_2023_paper.pdf + We propose a novel hairstyle transferred image synthesis method considering the underlying head geometry of two input images. In traditional GAN-based methods, transferring hairstyle from one image to the other often makes the synthesized result awkward due to differences in pose, shape, and size of heads. To resolve this, we utilize neural rendering by registering two input heads in the volumetric space to make a transferred hairstyle fit on the head of a target image. Because of the geometric nature of neural rendering, our method can render view varying images of synthesized results from a single transfer process without causing distortion from which extant hairstyle transfer methods built upon traditional GAN-based generators suffer. We verify that our method surpasses other baselines in view of preserving the identity and hairstyle of two input images when synthesizing a hairstyle transferred image rendered at any point of view. + + + + GETAvatar: Generative Textured Meshes for Animatable Human Avatars + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_GETAvatar_Generative_Textured_Meshes_for_Animatable_Human_Avatars_ICCV_2023_paper.pdf + We study the problem of 3D-aware full-body human generation, aiming at creating animatable human avatars with high-quality textures and geometries. Generally, two challenges remain in this field: i) existing methods struggle to generate geometries with rich realistic details such as the wrinkles of garments; ii) they typically utilize volumetric radiance fields and neural renderers in the synthesis process, making high-resolution rendering non-trivial. To overcome these problems, we propose GETAvatar, a Generative model that directly generates Explicit Textured 3D meshes for animatable human Avatar, with photo-realistic appearance and fine geometric details. Specifically, we first design an articulated 3D human representation with explicit surface modeling, and enrich the generated humans with realistic surface details by learning from the 2D normal maps of 3D scan data. Second, with the explicit mesh representation, we can use a rasterization-based renderer to perform surface rendering, allowing us to achieve high-resolution image generation efficiently. Extensive experiments demonstrate that GETAvatar achieves state-of-the-art performance on 3D-aware human generation both in appearance and geometry quality. Notably, GETAvatar cangenerate images at 512x512 resolution with 17FPS and 1024x1024 resolution with 14FPS, improving upon previous methods by 2x. Our code and models will be available. + + + + StylerDALLE: Language-Guided Style Transfer Using a Vector-Quantized Tokenizer of a Large-Scale Generative Model + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_StylerDALLE_Language-Guided_Style_Transfer_Using_a_Vector-Quantized_Tokenizer_of_a_ICCV_2023_paper.pdf + Despite the progress made in the style transfer task, most previous work focus on transferring only relatively simple features like color or texture, while missing more abstract concepts such as overall art expression or painter-specific traits. However, these abstract semantics can be captured by models like DALL-E or CLIP, which have been trained using huge datasets of images and textual documents. In this paper, we propose StylerDALLE, a style transfer method that exploits both of these models and uses natural language to describe abstract art styles. Specifically, we formulate the language-guided style transfer task as a non-autoregressive token sequence translation, i.e., from input content image to output stylized image, in the discrete latent space of a large-scale pretrained vector-quantized tokenizer, e.g., the discrete variational auto-encoder (dVAE) of DALL-E. To incorporate style information, we propose a Reinforcement Learning strategy with CLIP-based language supervision that ensures stylization and content preservation simultaneously. Experimental results demonstrate the superiority of our method, which can effectively transfer art styles using language instructions at different granularities. Code is available at https://github.com/zipengxuc/StylerDALLE. + + + + Deep Image Harmonization with Learnable Augmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Niu_Deep_Image_Harmonization_with_Learnable_Augmentation_ICCV_2023_paper.pdf + The goal of image harmonization is adjusting the foreground appearance in a composite image to make the whole image harmonious. To construct paired training images, existing datasets adopt different ways to adjust the illumination statistics of foregrounds of real images to produce synthetic composite images. However, different datasets have considerable domain gap and the performances on small-scale datasets are limited by insufficient training data. In this work, we explore learnable augmentation to enrich the illumination diversity of small-scale datasets for better harmonization performance. In particular, our designed SYthetic COmposite Network (SycoNet) takes in a real image with foreground mask and a random vector to learn suitable color transformation, which is applied to the foreground of this real image to produce a synthetic composite image. Comprehensive experiments demonstrate the effectiveness of our proposed learnable augmentation for image harmonization. The code of SycoNet is released at https://github.com/bcmi/SycoNet-Adaptive-Image-Harmonization. + + + + Scalable Diffusion Models with Transformers + http://openaccess.thecvf.com//content/ICCV2023/papers/Peebles_Scalable_Diffusion_Models_with_Transformers_ICCV_2023_paper.pdf + We explore a new class of diffusion models based on the transformer architecture. We train latent diffusion models of images, replacing the commonly-used U-Net backbone with a transformer that operates on latent patches. We analyze the scalability of our Diffusion Transformers (DiTs) through the lens of forward pass complexity as measured by Gflops. We find that DiTs with higher Gflops---through increased transformer depth/width or increased number of input tokens---consistently have lower FID. In addition to possessing good scalability properties, our largest DiT-XL/2 models outperform all prior diffusion models on the class-conditional ImageNet 512x512 and 256x256 benchmarks, achieving a state-of-the-art FID of 2.27 on the latter. + + + + MMST-ViT: Climate Change-aware Crop Yield Prediction via Multi-Modal Spatial-Temporal Vision Transformer + http://openaccess.thecvf.com//content/ICCV2023/papers/Lin_MMST-ViT_Climate_Change-aware_Crop_Yield_Prediction_via_Multi-Modal_Spatial-Temporal_Vision_ICCV_2023_paper.pdf + Precise crop yield prediction provides valuable information for agricultural planning and decision-making processes. However, timely predicting crop yields remains challenging as crop growth is sensitive to growing season weather variation and climate change. In this work, we develop a deep learning-based solution, namely Multi-Modal Spatial-Temporal Vision Transformer (MMST-ViT), for predicting crop yields at the county level across the United States, by considering the effects of short-term meteorological variations during the growing season and the long-term climate change on crops. Specifically, our MMST-ViT consists of a Multi-Modal Transformer, a Spatial Transformer, and a Temporal Transformer. The Multi-Modal Transformer leverages both visual remote sensing data and short-term meteorological data for modeling the effect of growing season weather variations on crop growth. The Spatial Transformer learns the high-resolution spatial dependency among counties for accurate agricultural tracking. The Temporal Transformer captures the long-range temporal dependency for learning the impact of long-term climate change on crops. Meanwhile, we also devise a novel multi-modal contrastive learning technique to pre-train our model without extensive human supervision. Hence, our MMST-ViT captures the impacts of both short-term weather variations and long-term climate change on crops by leveraging both satellite images and meteorological data. We have conducted extensive experiments on over 200 counties in the United States, with the experimental results exhibiting that our MMST-ViT outperforms its counterparts under three performance metrics of interest. + + + + Grounded Image Text Matching with Mismatched Relation Reasoning + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_Grounded_Image_Text_Matching_with_Mismatched_Relation_Reasoning_ICCV_2023_paper.pdf + This paper introduces Grounded Image Text Matching with Mismatched Relation (GITM-MR), a novel visual-linguistic joint task that evaluates the relation understanding capabilities of transformer-based pre-trained models. GITM-MR requires a model to first determine if an expression describes an image, then localize referred objects or ground the mismatched parts of the text. We provide a benchmark for evaluating vision-language (VL) models on this task, with a focus on the challenging settings of limited training data and out-of-distribution sentence lengths. Our evaluation demonstrates that pre-trained VL models often lack data efficiency and length generalization ability. To address this, we propose the Relation-sensitive Correspondence Reasoning Network (RCRN), which incorporates relation-aware reasoning via bi-directional message propagation guided by language structure. Our RCRN can be interpreted as a modular program and delivers strong performance in terms of both length generalization and data efficiency. The code and data are available on https://github.com/SHTUPLUS/GITM-MR. + + + + UniKD: Universal Knowledge Distillation for Mimicking Homogeneous or Heterogeneous Object Detectors + http://openaccess.thecvf.com//content/ICCV2023/papers/Lao_UniKD_Universal_Knowledge_Distillation_for_Mimicking_Homogeneous_or_Heterogeneous_Object_ICCV_2023_paper.pdf + Knowledge distillation (KD) has become a standard method to boost the performance of lightweight object detectors. Most previous works are feature-based, where students mimic the features of homogeneous teacher detectors. However, distilling the knowledge from the heterogeneous teacher fails in this manner due to the serious semantic gap, which greatly limits the flexibility of KD in practical applications. Bridging this semantic gap now requires case-by-case algorithm design which is time-consuming and heavily relies on experienced adjustment. To alleviate this problem, we propose Universal Knowledge Distillation (UniKD), introducing additional decoder heads with deformable cross-attention called Adaptive Knowledge Extractor (AKE). In UniKD, AKEs are first pretrained on the teacher's output to infuse the teacher's content and positional knowledge into a fixed-number set of knowledge embeddings. The fixed AKEs are then attached to the student's backbone to encourage the student to absorb the teacher's knowledge in these knowledge embeddings. In this query-based distillation paradigm, detection-relevant information can be dynamically aggregated into a knowledge embedding set and transferred between different detectors. When the teacher model is too large for online inference, its output can be stored on disk in advance to save the computation overhead, which is more storage efficient than feature-based methods. Extensive experiments demonstrate that our UniKD can plug and play in any homogeneous or heterogeneous teacher-student pairs and significantly outperforms conventional feature-based KD. + + + + BoxDiff: Text-to-Image Synthesis with Training-Free Box-Constrained Diffusion + http://openaccess.thecvf.com//content/ICCV2023/papers/Xie_BoxDiff_Text-to-Image_Synthesis_with_Training-Free_Box-Constrained_Diffusion_ICCV_2023_paper.pdf + Recent text-to-image diffusion models have demonstrated an astonishing capacity to generate high-quality images. However, researchers mainly studied the way of synthesizing images with only text prompts. While some works have explored using other modalities as conditions, considerable paired data, e.g., box/mask-image pairs, and fine-tuning time are required for nurturing models. As such paired data is time-consuming and labor-intensive to acquire and restricted to a closed set, this potentially becomes the bottleneck for applications in an open world. This paper focuses on the simplest form of user-provided conditions, e.g., box or scribble. To mitigate the aforementioned problem, we propose a training-free method to control objects and contexts in the synthesized images adhering to the given spatial conditions. Specifically, three spatial constraints, i.e., Inner-Box, Outer-Box, and Corner Constraints, are designed and seamlessly integrated into the denoising step of diffusion models, requiring no additional training and massive annotated layout data. Extensive experimental results demonstrate that the proposed constraints can control what and where to present in the images while retaining the ability of Diffusion models to synthesize with high fidelity and diverse concept coverage. The code is publicly available at https://github.com/showlab/BoxDiff. + + + + Rapid Network Adaptation: Learning to Adapt Neural Networks Using Test-Time Feedback + http://openaccess.thecvf.com//content/ICCV2023/papers/Yeo_Rapid_Network_Adaptation_Learning_to_Adapt_Neural_Networks_Using_Test-Time_ICCV_2023_paper.pdf + We propose a method for adapting neural networks to distribution shifts at test-time. In contrast to training-time robustness mechanisms that attempt to anticipate the shift, we create a closed-loop system and make use of test-time feedback signal to adapt a network. We show that this loop can be effectively implemented using a learning-based function, which realizes an amortized optimizer for the network. This leads to an adaptation method, named Rapid Network Adaptation (RNA), that is notably more flexible and orders of magnitude faster than the baselines. Through a broad set of experiments using various adaptation signals and target tasks, we study the generality, efficiency, and flexibility of this method. We perform the evaluations using various datasets (Taskonomy, Replica, ScanNet, Hypersim, COCO, ImageNet), tasks (depth, optical flow, semantic segmentation, classification), and distribution shifts (Cross-datasets, 2D and 3D Common Corruptions) with promising results. + + + + Theoretical and Numerical Analysis of 3D Reconstruction Using Point and Line Incidences + http://openaccess.thecvf.com//content/ICCV2023/papers/Rydell_Theoretical_and_Numerical_Analysis_of_3D_Reconstruction_Using_Point_and_ICCV_2023_paper.pdf + We study the joint image of lines incident to points, meaning the set of image tuples obtained from fixed cameras observing a varying 3D point-line incidence. We prove a formula for the number of complex critical points of the triangulation problem that aims to compute a 3D point-line incidence from noisy images. Our formula works for an arbitrary number of images and measures the intrinsic difficulty of this triangulation. Additionally, we conduct numerical experiments using homotopy continuation methods, comparing different approaches of triangulation of such incidences. In our setup, exploiting the incidence relations gives a notably faster point reconstruction with comparable accuracy. + + + + Explaining Adversarial Robustness of Neural Networks from Clustering Effect Perspective + http://openaccess.thecvf.com//content/ICCV2023/papers/Jin_Explaining_Adversarial_Robustness_of_Neural_Networks_from_Clustering_Effect_Perspective_ICCV_2023_paper.pdf + Adversarial training (AT) is the most commonly used mechanism to improve the robustness of deep neural networks. Recently, a novel adversarial attack against intermediate layers exploits the extra fragility of adversarially trained networks to output incorrect predictions. The result implies the insufficiency in the searching space of the adversarial perturbation in adversarial training. To straighten out the reason for the effectiveness of the intermediate-layer attack, we interpret the forward propagation as the Clustering Effect, characterizing that the intermediate-layer representations of neural networks for samples i.i.d. to the training set with the same label are similar, and we theoretically prove the existence of Clustering Effect by corresponding Information Bottleneck Theory. We afterward observe that the intermediate-layer attack disobeys the clustering effect of the AT-trained model. Inspired by these significant observations, we propose a regularization method to extend the perturbation searching space during training, named sufficient adversarial training (SAT). We give a proven robustness bound of neural networks through rigorous mathematical proof. The experimental evaluations manifest the superiority of SAT over other state-of-the-art AT mechanisms in defending against adversarial attacks against both output and intermediate layers. Our code and Appendix can be found at https://github.com/clustering-effect/SAT. + + + + Leaping Into Memories: Space-Time Deep Feature Synthesis + http://openaccess.thecvf.com//content/ICCV2023/papers/Stergiou_Leaping_Into_Memories_Space-Time_Deep_Feature_Synthesis_ICCV_2023_paper.pdf + The success of deep learning models has led to their adaptation and adoption by prominent video understanding methods. The majority of these approaches encode features in a joint space-time modality for which the inner workings and learned representations are difficult to visually interpret. We propose LEArned Preconscious Synthesis (LEAPS), an architecture-independent method for synthesizing videos from the internal spatiotemporal representations of models. Using a stimulus video and a target class, we prime a fixed space-time model and iteratively optimize a video initialized with random noise. Additional regularizers are used to improve the feature diversity of the synthesized videos alongside the cross-frame temporal coherence of motions. We quantitatively and qualitatively evaluate the applicability of LEAPS by inverting a range of spatiotemporal convolutional and attention-based architectures trained on Kinetics-400, which to the best of our knowledge has not been previously accomplished. + + + + WDiscOOD: Out-of-Distribution Detection via Whitened Linear Discriminant Analysis + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_WDiscOOD_Out-of-Distribution_Detection_via_Whitened_Linear_Discriminant_Analysis_ICCV_2023_paper.pdf + Deep neural networks are susceptible to generating overconfident yet erroneous predictions when presented with data beyond known concepts. This challenge underscores the importance of detecting out-of-distribution (OOD) samples in the open world. In this work, we propose a novel feature-space OOD detection score based on class-specific and class-agnostic information. Specifically, the approach utilizes Whitened Linear Discriminant Analysis to project features into two subspaces - the discriminative and residual subspaces - for which the in-distribution (ID) classes are maximally separated and closely clustered, respectively. The OOD score is then determined by combining the deviation from the input data to the ID pattern in both subspaces. The efficacy of our method, named WDiscOOD, is verified on the large-scale ImageNet-1k benchmark, with six OOD datasets that cover a variety of distribution shifts. WDiscOOD demonstrates superior performance on deep classifiers with diverse backbone architectures, including CNN and vision transformer. Furthermore, we also show that WDiscOOD more effectively detects novel concepts in representation spaces trained with contrastive objectives, including supervised contrastive loss and multi-modality contrastive loss. + + + + Boosting Few-shot Action Recognition with Graph-guided Hybrid Matching + http://openaccess.thecvf.com//content/ICCV2023/papers/Xing_Boosting_Few-shot_Action_Recognition_with_Graph-guided_Hybrid_Matching_ICCV_2023_paper.pdf + Class prototype construction and matching are core aspects of few-shot action recognition. Previous methods mainly focus on designing spatiotemporal relation modeling modules or complex temporal alignment algorithms. Despite the promising results, they ignored the value of class prototype construction and matching, leading to unsatisfactory performance in recognizing similar categories in every task. In this paper, we propose GgHM, a new framework with Graph-guided Hybrid Matching. Concretely, we learn task-oriented features by the guidance of a graph neural network during class prototype construction, optimizing the intra- and inter-class feature correlation explicitly. Next, we design a hybrid matching strategy, combining frame-level and tuple-level matching to classify videos with multivariate styles. We additionally propose a learnable dense temporal modeling module to enhance the video feature temporal representation to build a more solid foundation for the matching process. GgHM shows consistent improvements over other challenging baselines on several few-shot datasets, demonstrating the effectiveness of our method. The code will be publicly available at https://github.com/jiazheng-xing/GgHM. + + + + Diffusion in Style + http://openaccess.thecvf.com//content/ICCV2023/papers/Everaert_Diffusion_in_Style_ICCV_2023_paper.pdf + We present Diffusion in Style, a simple method to adapt Stable Diffusion to any desired style, using only a small set of target images. It is based on the key observation that the style of the images generated by Stable Diffusion is tied to the initial latent tensor. Not adapting this initial latent tensor to the style makes fine-tuning slow, expensive, and impractical, especially when only a few target style images are available. In contrast, fine-tuning is much easier if this initial latent tensor is also adapted. Our Diffusion in Style is orders of magnitude more sample-efficient and faster. It also generates more pleasing images than existing approaches, as shown qualitatively and with quantitative comparisons. + + + + FunnyBirds: A Synthetic Vision Dataset for a Part-Based Analysis of Explainable AI Methods + http://openaccess.thecvf.com//content/ICCV2023/papers/Hesse_FunnyBirds_A_Synthetic_Vision_Dataset_for_a_Part-Based_Analysis_of_ICCV_2023_paper.pdf + The field of explainable artificial intelligence (XAI) aims to uncover the inner workings of complex deep neural models. While being crucial for safety-critical domains, XAI inherently lacks ground-truth explanations, making its automatic evaluation an unsolved problem. We address this challenge by proposing a novel synthetic vision dataset, named FunnyBirds, and accompanying automatic evaluation protocols. Our dataset allows performing semantically meaningful image interventions, e.g., removing individual object parts, which has three important implications. First, it enables analyzing explanations on a part level, which is closer to human comprehension than existing methods that evaluate on a pixel level. Second, by comparing the model output for inputs with removed parts, we can estimate ground-truth part importances that should be reflected in the explanations. Third, by mapping individual explanations into a common space of part importances, we can analyze a variety of different explanation types in a single common framework. Using our tools, we report results for 24 different combinations of neural models and XAI methods, demonstrating the strengths and weaknesses of the assessed methods in a fully automatic and systematic manner. + + + + Deformable Neural Radiance Fields using RGB and Event Cameras + http://openaccess.thecvf.com//content/ICCV2023/papers/Ma_Deformable_Neural_Radiance_Fields_using_RGB_and_Event_Cameras_ICCV_2023_paper.pdf + Modeling Neural Radiance Fields for fast-moving deformable objects from visual data alone is a challenging problem. A major issue arises due to the high deformation and low acquisition rates. To address this problem, we propose to use event cameras that offer very fast acquisition of visual change in an asynchronous manner. In this work, we develop a novel method to model the deformable neural radiance fields using RGB and Event cameras. The proposed method uses the asynchronous stream of events and calibrated sparse RGB frames. In this setup, the pose of the individual events --required to integrate them into the radiance fields-- remains to be unknown. Our method jointly optimizes the pose and the radiance field, in an efficient manner by leveraging the collection of events at once and actively sampling the events during learning. Experiments conducted on both realistically rendered and real-world datasets demonstrate a significant benefit of the proposed method over the state-of-the-art and the compared baseline. This shows a promising direction for modeling deformable neural radiance fields in real-world dynamic scenes. Our code and data will be publicly available. + + + + BeLFusion: Latent Diffusion for Behavior-Driven Human Motion Prediction + http://openaccess.thecvf.com//content/ICCV2023/papers/Barquero_BeLFusion_Latent_Diffusion_for_Behavior-Driven_Human_Motion_Prediction_ICCV_2023_paper.pdf + Stochastic human motion prediction (HMP) has generally been tackled with generative adversarial networks and variational autoencoders. Most prior works aim at predicting highly diverse motion in terms of the skeleton joints' dispersion. This has led to methods predicting fast and divergent movements, which are often unrealistic and incoherent with past motion. Such methods also neglect scenarios where anticipating diverse short-range behaviors with subtle joint displacements is important. To address these issues, we present BeLFusion, a model that, for the first time, leverages latent diffusion models in HMP to sample from a behavioral latent space where behavior is disentangled from pose and motion. Thanks to our behavior coupler, which is able to transfer sampled behavior to ongoing motion, BeLFusion's predictions display a variety of behaviors that are significantly more realistic, and coherent with past motion than the state of the art. To support it, we introduce two metrics, the Area of the Cumulative Motion Distribution, and the Average Pairwise Distance Error, which are correlated to realism according to a qualitative study (126 participants). Finally, we prove BeLFusion's generalization power in a new cross-dataset scenario for stochastic HMP. + + + + CleanCLIP: Mitigating Data Poisoning Attacks in Multimodal Contrastive Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Bansal_CleanCLIP_Mitigating_Data_Poisoning_Attacks_in_Multimodal_Contrastive_Learning_ICCV_2023_paper.pdf + Multimodal contrastive pretraining has been used to train multimodal representation models, such as CLIP, on large amounts of paired image-text data. However, previous studies have revealed that such models are vulnerable to backdoor attacks. Specifically, when trained on backdoored examples, CLIP learns spurious correlations between the embedded backdoor trigger and the target label, aligning their representations in the joint embedding space. Injecting even a small number of poisoned examples, such as 75 examples in 3 million pretraining data, can significantly manipulate the model's behavior, making it difficult to detect or unlearn such correlations. To address this issue, we propose CleanCLIP, a finetuning framework that weakens the learned spurious associations introduced by backdoor attacks by independently re-aligning the representations for individual modalities. We demonstrate that unsupervised finetuning using a combination of multimodal contrastive and unimodal self-supervised objectives for individual modalities can significantly reduce the impact of the backdoor attack. Additionally, we show that supervised finetuning on task-specific labeled image data removes the backdoor trigger from the CLIP vision encoder. We show empirically that CleanCLIP maintains model performance on benign examples while erasing a range of backdoor attacks on multimodal contrastive learning. + + + + Cumulative Spatial Knowledge Distillation for Vision Transformers + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhao_Cumulative_Spatial_Knowledge_Distillation_for_Vision_Transformers_ICCV_2023_paper.pdf + Distilling knowledge from convolutional neural networks (CNNs) is a double-edged sword for vision transformers (ViTs). It boosts the performance since the image-friendly local-inductive bias of CNN helps ViT learn faster and better, but leading to two problems: (1) Network designs of CNN and ViT are completely different, which leads to different semantic levels of intermediate features, making spatial-wise knowledge transfer methods (e.g., feature mimicking) inefficient. (2) Distilling knowledge from CNN limits the network convergence in the later training period since ViT's capability of integrating global information is suppressed by CNN's local-inductive-bias supervision. To this end, we present Cumulative Spatial Knowledge Distillation (CSKD). CSKD distills spatial-wise knowledge to all patch tokens of ViT from the corresponding spatial responses of CNN, without introducing intermediate features. Furthermore, CSKD exploits a Cumulative Knowledge Fusion (CKF) module, which introduces the global response of CNN and increasingly emphasizes its importance during the training. Applying CKF leverages CNN's local inductive bias in the early training period and gives full play to ViT's global capability in the later one. Extensive experiments and analysis on ImageNet-1k and downstream datasets demonstrate the superiority of our CSKD. Code will be publicly available. + + + + Less is More: Focus Attention for Efficient DETR + http://openaccess.thecvf.com//content/ICCV2023/papers/Zheng_Less_is_More_Focus_Attention_for_Efficient_DETR_ICCV_2023_paper.pdf + DETR-like models have significantly boosted the performance of detectors and even outperformed classical convolutional models. However, all tokens are treated equally without discrimination brings a redundant computational burden in the traditional encoder structure. The recent sparsification strategies exploit a subset of informative tokens to reduce attention complexity maintaining performance through the sparse encoder. But these methods tend to rely on unreliable model statistics. Moreover, simply reducing the token population hinders the detection performance to a large extent, limiting the application of these sparse models. We propose Focus-DETR, which focuses attention on more informative tokens for a better trade-off between computation efficiency and model accuracy. Specifically, we reconstruct the encoder with dual attention, which includes a token scoring mechanism that considers both localization and category semantic information of the objects from multi-scale feature maps. We efficiently abandon the background queries and enhance the semantic interaction of the fine-grained object queries based on the scores. Compared with the state-of-the-art sparse DETR-like detectors under the same setting, our Focus-DETR gets comparable complexity while achieving 50.4AP (+2.2) on COCO. The code is available at https://github.com/huawei-noah/noah-research/tree/master/Focus-DETR and https://gitee.com/mindspore/models/tree/master/research/cv/Focus-DETR. + + + + Efficient Controllable Multi-Task Architectures + http://openaccess.thecvf.com//content/ICCV2023/papers/Aich_Efficient_Controllable_Multi-Task_Architectures_ICCV_2023_paper.pdf + We aim to train a multi-task model such that users can adjust the desired compute budget and relative importance of task performances after deployment, without retraining. This enables optimizing performance for dynamically varying user needs, without heavy computational overhead to train and save models for various scenarios. To this end, we propose a multi-task model consisting of a shared encoder and task-specific decoders where both encoder and decoder channel widths are slimmable. Our key idea is to control the task importance by varying the capacities of task-specific decoders, while controlling the total computational cost by jointly adjusting the encoder capacity. This improves overall accuracy by allowing a stronger encoder for a given budget, increases control over computational cost, and delivers high-quality slimmed sub-architectures based on user's constraints. Our training strategy involves a novel `Configuration-Invariant Knowledge Distillation' loss that enforces backbone representations to be invariant under different runtime width configurations to enhance accuracy. Further, we present a simple but effective search algorithm that translates user constraints to runtime width configurations of both the shared encoder and task decoders, for sampling the sub-architectures. The key rule for the search algorithm is to provide a larger computational budget to the higher preferred task decoder, while searching a shared encoder configuration that enhances the overall MTL performance. Various experiments on three multi-task benchmarks (PASCALContext, NYUDv2, and CIFAR100-MTL) with diverse backbone architectures demonstrate the advantage of our approach. For example, our method shows a higher controllability by 33.5% in the NYUD-v2 dataset over prior methods, while incurring much less compute cost. + + + + Lens Parameter Estimation for Realistic Depth of Field Modeling + http://openaccess.thecvf.com//content/ICCV2023/papers/Piche-Meunier_Lens_Parameter_Estimation_for_Realistic_Depth_of_Field_Modeling_ICCV_2023_paper.pdf + We present a method to estimate the depth of field effect from a single image. Most existing methods related to this task provide either a per-pixel estimation of blur and/or depth. Instead, we go further and propose to use a lens-based representation that models the depth of field using two parameters: the blur factor and focus disparity. Those two parameters, along with the signed defocus representation, result in a more intuitive and linear representation which we solve using a novel weighting network. Furthermore, our method explicitly enforces consistency between the estimated defocus blur, the lens parameters, and the depth map. Finally, we train our deep-learning-based model on a mix of real images with synthetic depth of field and fully synthetic images. These improvements result in a more robust and accurate method, as demonstrated by our state-of-the-art results. In particular, our lens parametrization enables several applications, such as 3D staging for AR environments and seamless object compositing. + + + + Semantic-Aware Implicit Template Learning via Part Deformation Consistency + http://openaccess.thecvf.com//content/ICCV2023/papers/Kim_Semantic-Aware_Implicit_Template_Learning_via_Part_Deformation_Consistency_ICCV_2023_paper.pdf + Learning implicit templates as neural fields has recently shown impressive performance in unsupervised shape correspondence. Despite the success, we observe current approaches, which solely rely on geometric information, often learn suboptimal deformation across generic object shapes, which have high structural variability. In this paper, we highlight the importance of part deformation consistency and propose a semantic-aware implicit template learning framework to enable semantically plausible deformation. By leveraging semantic prior from a self-supervised feature extractor, we suggest local conditioning with novel semantic-aware deformation code and deformation consistency regularizations regarding part deformation, global deformation, and global scaling. Our extensive experiments demonstrate the superiority of the proposed method over baselines in various tasks: keypoint transfer, part label transfer, and texture transfer. More interestingly, our framework shows a larger performance gain under more challenging settings. We also provide qualitative analyses to validate the effectiveness of semantic-aware deformation. The code is available at https://github.com/mlvlab/PDC. + + + + GRAM-HD: 3D-Consistent Image Generation at High Resolution with Generative Radiance Manifolds + http://openaccess.thecvf.com//content/ICCV2023/papers/Xiang_GRAM-HD_3D-Consistent_Image_Generation_at_High_Resolution_with_Generative_Radiance_ICCV_2023_paper.pdf + Recent works have shown that 3D-aware GANs trained on unstructured single image collections can generate multiview images of novel instances. The key underpinnings to achieve this are a 3D radiance field generator and a volume rendering process. However, existing methods either cannot generate high-resolution images (e.g., up to 256x256) due to the high computation cost of neural volume rendering, or rely on 2D CNNs for image-space upsampling which jeopardizes the 3D consistency across different views. This paper proposes a novel 3D-aware GAN that can generate high resolution images (up to 1024x1024) while keeping strict 3D consistency as in volume rendering. Our motivation is to achieve super-resolution directly in the 3D space to preserve 3D consistency. We avoid the otherwise prohibitively-expensive computation cost by applying 2D convolutions on a set of 2D radiance manifolds defined in the recent generative radiance manifold (GRAM) approach, and apply dedicated loss functions for effective GAN training at high resolution. Experiments on FFHQ and AFHQv2 datasets show that our method can produce high-quality 3D-consistent results that significantly outperform existing methods. It makes a significant step towards closing the gap between traditional 2D image generation and 3D-consistent free-view generation. + + + + Small Object Detection via Coarse-to-fine Proposal Generation and Imitation Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Yuan_Small_Object_Detection_via_Coarse-to-fine_Proposal_Generation_and_Imitation_Learning_ICCV_2023_paper.pdf + The past few years have witnessed the immense success of object detection, while current excellent detectors struggle on tackling size-limited instances. Concretely, the well-known challenge of low overlaps between the priors and object regions leads to a constrained sample pool for optimization, and the paucity of discriminative information further aggravates the recognition. To alleviate the aforementioned issues, we propose CFINet, a two-stage framework tailored for small object detection based on the Coarse-to-fine pipeline and Feature Imitation learning. Firstly, we introduce Coarse-to-fine RPN (CRPN) to ensure sufficient and high-quality proposals for small objects through the dynamic anchor selection strategy and cascade regression. Then, we equip the conventional detection head with a Feature Imitation (FI) branch to facilitate the region representations of size-limited instances that perplex the model in an imitation manner. Moreover, an auxiliary imitation loss following supervised contrastive learning paradigm is devised to optimize this branch. When integrated with Faster RCNN, CFINet achieves state-of-the-art performance on the large-scale small object detection benchmarks, SODA-D and SODA-A, underscoring its superiority over baseline detector and other mainstream detection approaches. Code is available at https://github.com/shaunyuan22/CFINet. + + + + Anomaly Detection Under Distribution Shift + http://openaccess.thecvf.com//content/ICCV2023/papers/Cao_Anomaly_Detection_Under_Distribution_Shift_ICCV_2023_paper.pdf + Anomaly detection (AD) is a crucial machine learning task that aims to learn patterns from a set of normal training samples to identify abnormal samples in test data. Most existing AD studies assume that the training and test data are drawn from the same data distribution, but the test data can have large distribution shifts arising in many real-world applications due to different natural variations such as new lighting conditions, object poses, or background appearances, rendering existing AD methods ineffective in such cases. In this paper, we consider the problem of anomaly detection under distribution shift and establish performance benchmarks on four widely-used AD and out-of-distribution (OOD) generalization datasets. We demonstrate that simple adaptation of state-of-the-art OOD generalization methods to AD settings fails to work effectively due to the lack of labeled anomaly data. We further introduce a novel robust AD approach to diverse distribution shifts by minimizing the distribution gap between in-distribution and OOD normal samples in both the training and inference stages in an unsupervised way. Our extensive empirical results on the four datasets show that our approach substantially outperforms state-of-the-art AD methods and OOD generalization methods on data with various distribution shifts, while maintaining the detection accuracy on in-distribution data. Code and data are available at https://github.com/mala-lab/ADShift. + + + + Enhancing Privacy Preservation in Federated Learning via Learning Rate Perturbation + http://openaccess.thecvf.com//content/ICCV2023/papers/Wan_Enhancing_Privacy_Preservation_in_Federated_Learning_via_Learning_Rate_Perturbation_ICCV_2023_paper.pdf + Federated learning (FL) is a privacy-enhanced distributed machine learning framework, in which multiple clients collaboratively train a global model by exchanging their model updates without sharing local private data. However, the adversary can use gradient inversion attacks to reveal the clients' privacy from the shared model updates. Previous attacks assume the adversary can infer the local learning rate of each client, while we observe that: (1) using the uniformly distributed random local learning rates does not incur much accuracy loss of the global model, and (2) personalizing local learning rates can mitigate the drift issue which is caused by non-IID (identically and independently distributed) data. Moreover, we theoretically derive a convergence guarantee to FedAvg with uniformly perturbed local learning rates. Therefore, by perturbing the learning rate of each client with random noise, we propose a learning rate perturbation (LRP) defense against gradient inversion attacks. Specifically, for classification tasks, we adapt LPR to ada-LPR by personalizing the expectation of each local learning rate. The experiments show that our defenses can well enhance privacy preservation against existing gradient inversion attacks, and LRP outperforms 5 baseline defenses against a state-of-the-art gradient inversion attack. In addition, our defenses only incur minor accuracy reductions (less than 0.5%) of the global model. So they are effective in real applications. + + + + ImGeoNet: Image-induced Geometry-aware Voxel Representation for Multi-view 3D Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Tu_ImGeoNet_Image-induced_Geometry-aware_Voxel_Representation_for_Multi-view_3D_Object_Detection_ICCV_2023_paper.pdf + We propose ImGeoNet, a multi-view image-based 3D object detection framework that models a 3D space by an image-induced geometry-aware voxel representation. Unlike previous methods which aggregate 2D features into 3D voxels without considering geometry, ImGeoNet learns to induce geometry from multi-view images to alleviate the confusion arising from voxels of free space, and during the inference phase, only images from multiple views are required. Besides, a powerful pre-trained 2D feature extractor can be leveraged by our representation, leading to a more robust performance. To evaluate the effectiveness of ImGeoNet, we conduct quantitative and qualitative experiments on three indoor datasets, namely ARKitScenes, ScanNetV2, and ScanNet200. The results demonstrate that ImGeoNet outperforms the current state-of-the-art multi-view image-based method, ImVoxelNet, on all three datasets in terms of detection accuracy. In addition, ImGeoNet shows great data efficiency by achieving results comparable to ImVoxelNet with 100 views while utilizing only 40 views. Furthermore, our studies indicate that our proposed image-induced geometry-aware representation can enable image-based methods to attain superior detection accuracy than the seminal point cloud-based method, VoteNet, in two practical scenarios: (1) scenarios where point clouds are sparse and noisy, such as in ARKitScenes, and (2) scenarios involve diverse object classes, particularly classes of small objects, as in the case in ScanNet200. + + + + Diffusion-based Image Translation with Label Guidance for Domain Adaptive Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Peng_Diffusion-based_Image_Translation_with_Label_Guidance_for_Domain_Adaptive_Semantic_ICCV_2023_paper.pdf + Translating images from a source domain to a target domain for learning target models is one of the most common strategies in domain adaptive semantic segmentation (DASS). However, existing methods still struggle to preserve semantically-consistent local details between the original and translated images. In this work, we present an innovative approach that addresses this challenge by using source-domain labels as explicit guidance during image translation. Concretely, we formulate cross-domain image translation as a denoising diffusion process and utilize a novel Semantic Gradient Guidance (SGG) method to constrain the translation process, conditioning it on the pixel-wise source labels. Additionally, a Progressive Translation Learning (PTL) strategy is devised to enable the SGG method to work reliably across domains with large gaps. Extensive experiments demonstrate the superiority of our approach over state-of-the-art methods. + + + + Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Yuan_Isomer_Isomerous_Transformer_for_Zero-shot_Video_Object_Segmentation_ICCV_2023_paper.pdf + Recent leading zero-shot video object segmentation (ZVOS) works devote to integrating appearance and motion information by elaborately designing feature fusion modules and identically applying them in multiple feature stages. Our preliminary experiments show that with the strong long-range dependency modeling capacity of Transformer, simply concatenating the two modality features and feeding them to vanilla Transformers for feature fusion can distinctly benefit the performance but at a cost of heavy computation. Through further empirical analysis, we find that attention dependencies learned in Transformer in different stages exhibit completely different properties: global query-independent dependency in the low-level stages and semantic-specific dependency in the high-level stages. Motivated by the observations, we propose two Transformer variants: i) Context-Sharing Transformer (CST) that learns the global-shared contextual information within image frames with a lightweight computation. ii) Semantic Gathering-Scattering Transformer (SGST) that models the semantic correlation separately for the foreground and background and reduces the computation cost with a soft token merging mechanism. We apply CST and SGST for low-level and high-level feature fusions, respectively, formulating a level-isomerous Transformer framework for ZVOS task. Compared with the baseline that uses vanilla Transformers for multi-stage fusion, ours significantly increase the speed by 13 times and achieves new state-of-the-art ZVOS performance. Code is available at https://github.com/DLUT-yyc/Isomer. + + + + X-Mesh: Towards Fast and Accurate Text-driven 3D Stylization via Dynamic Textual Guidance + http://openaccess.thecvf.com//content/ICCV2023/papers/Ma_X-Mesh_Towards_Fast_and_Accurate_Text-driven_3D_Stylization_via_Dynamic_ICCV_2023_paper.pdf + Text-driven 3D stylization is a complex and crucial task in the fields of computer vision (CV) and computer graphics (CG), aimed at transforming a bare mesh to fit a target text. Prior methods adopt text-independent multilayer perceptrons (MLPs) to predict the attributes of the target mesh with the supervision of CLIP loss. However, such text-independent architecture lacks textual guidance during predicting attributes, thus leading to unsatisfactory stylization and slow convergence. To address these limitations, we present X-Mesh, an innovative text-driven 3D stylization framework that incorporates a novel Text-guided Dynamic Attention Module (TDAM). The TDAM dynamically integrates the guidance of the target text by utilizing text-relevant spatial and channel-wise attentions during vertex feature extraction, resulting in more accurate attribute prediction and faster convergence speed. Furthermore, existing works lack standard benchmarks and automated metrics for evaluation, often relying on subjective and non-reproducible user studies to assess the quality of stylized 3D assets. To overcome this limitation, we introduce a new standard text-mesh benchmark, namely MIT-30, and two automated metrics, which will enable future research to achieve fair and objective comparisons. Our extensive qualitative and quantitative experiments demonstrate that X-Mesh outperforms previous state-of-the-art methods. + + + + ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_ViLTA_Enhancing_Vision-Language_Pre-training_through_Textual_Augmentation_ICCV_2023_paper.pdf + Vision-language pre-training (VLP) methods are blossoming recently, and its crucial goal is to jointly learn visual and textual features via a transformer-based architecture, demonstrating promising improvements on a variety of vision-language tasks. Prior arts usually focus on how to align visual and textual features, but strategies for improving the robustness of model and speeding up model convergence are left insufficiently explored. In this paper, we propose a novel method ViLTA, comprising of two components to further facilitate the model to learn fine-grained representations among image-text pairs. For Masked Language Modeling (MLM), we propose a cross-distillation method to generate soft labels to enhance the robustness of model, which alleviates the problem of treating synonyms of masked words as negative samples in one-hot labels. For Image-Text Matching (ITM), we leverage the current language encoder to synthesize hard negatives based on the context of language input, encouraging the model to learn high-quality representations by increasing the difficulty of the ITM task. By leveraging the above techniques, our ViLTA can achieve better performance on various vision-language tasks. Extensive experiments on benchmark datasets demonstrate that the effectiveness of ViLTA and its promising potential for vision-language pre-training. + + + + Not Every Side Is Equal: Localization Uncertainty Estimation for Semi-Supervised 3D Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Not_Every_Side_Is_Equal_Localization_Uncertainty_Estimation_for_Semi-Supervised_ICCV_2023_paper.pdf + Semi-supervised 3D object detection from point cloud aims to train a detector with a small number of labeled data and a large number of unlabeled data. The core of existing methods lies in how to select high-quality pseudo-labels using the designed quality evaluation criterion. However, these methods treat each pseudo bounding box as a whole and assign equal importance to each side during training, which is detrimental to model performance due to many sides having poor localization quality. Besides, existing methods filter out a large number of low-quality pseudo-labels, which also contain some correct regression values that can help with model training. To address the above issues, we propose a side-aware framework for semi-supervised 3D object detection consisting of three key designs: a 3D bounding box parameterization method, an uncertainty estimation module, and a pseudo-label selection strategy. These modules work together to explicitly estimate the localization quality of each side and assign different levels of importance during the training phase. Extensive experiment results demonstrate that the proposed method can consistently outperform baseline models under different scenes and evaluation criteria. Moreover, our method achieves state-of-the-art performance on three datasets with different labeled ratios. + + + + Teaching CLIP to Count to Ten + http://openaccess.thecvf.com//content/ICCV2023/papers/Paiss_Teaching_CLIP_to_Count_to_Ten_ICCV_2023_paper.pdf + Large vision-language models, such as CLIP, learn robust representations of text and images, facilitating advances in many downstream tasks, including zero-shot classification and text-to-image generation. However, these models have several well-documented limitations. They fail to encapsulate compositional concepts, such as counting. To the best of our knowledge, this work is the first to extend CLIP to handle object counting. We introduce a simple yet effective method to improve the quantitative understanding of vision-language models, while maintaining their overall performance on common benchmarks. Our method automatically augments image captions to create hard negative samples that differ from the original captions by only the number of objects. For example, an image of three dogs can be contrasted with the negative caption "Six dogs playing in the yard". A dedicated loss encourages discrimination between the correct caption and its negative variant. In addition, we introduce CountBench, a new benchmark for evaluating a model's understanding of object counting, and demonstrate significant improvement over baseline models on this task. Furthermore, we leverage our improved CLIP representations for text-conditioned image generation, and show that our model can produce specific counts of objects more reliably than existing ones. + + + + Domain Generalization Guided by Gradient Signal to Noise Ratio of Parameters + http://openaccess.thecvf.com//content/ICCV2023/papers/Michalkiewicz_Domain_Generalization_Guided_by_Gradient_Signal_to_Noise_Ratio_of_ICCV_2023_paper.pdf + Overfitting to the source domain is a common issue in gradient-based training of deep neural networks. To compensate for the over-parameterized models, numerous regularization techniques have been introduced such as those based on dropout. While these methods achieve significant improvements on classical benchmarks such as ImageNet, their performance diminishes with the introduction of domain shift in the test set i.e. when the unseen data comes from a significantly different distribution. In this paper, we move away from the classical approach of Bernoulli sampled dropout mask construction and propose to base the selection on gradient-signal-to-noise ratio (GSNR) of network's parameters. Specifically, at each training step, parameters with high GSNR will be discarded. Furthermore, we alleviate the burden of manually searching for the optimal dropout ratio by leveraging a meta-learning approach. We evaluate our method on standard domain generalization benchmarks and achieve competitive results on classification and face anti-spoofing problems. + + + + Counterfactual-based Saliency Map: Towards Visual Contrastive Explanations for Neural Networks + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Counterfactual-based_Saliency_Map_Towards_Visual_Contrastive_Explanations_for_Neural_Networks_ICCV_2023_paper.pdf + Explaining deep models in a human-understandable way has been explored by many works that mostly explain why an input causes a corresponding prediction (ie., Why P?). However, seldom they could handle those more complex causal questions like "why P rather than Q?" and "why one is P while another is Q?", which would better help humans understand the behavior of deep models. Considering the insufficient study on such complex causal questions, we make the first attempt to explain different causal questions by contrastive explanations in a unified framework, ie., Counterfactual Contrastive Explanation (CCE), which visually and intuitively explains the aforementioned questions via a novel positive-negative saliency-based explanation scheme. More specifically, we propose a content-aware counterfactual perturbing algorithm to stimulate contrastive examples, from which a pair of positive and negative saliency maps could be derived to contrastively explain why P (positive class) rather than Q (negative class). Beyond existing works, our counterfactual perturbation meets the principles of validity, sparsity, and data distribution closeness at the same time. In addition, by slightly adjusting the objective of perturbation, our framework can adapt to different causal questions. Extensive experimental evaluation demonstrates the effectiveness and superior performance of the proposed CCE on different benchmark metrics for interpretability, including Sanity Check, Class Deviation Score and Insertion-Deletion tests. A user study is conducted and the results show that user confidence is increasing significantly when presented with CCE compared to standard saliency map baselines. + + + + MST-compression: Compressing and Accelerating Binary Neural Networks with Minimum Spanning Tree + http://openaccess.thecvf.com//content/ICCV2023/papers/Vo_MST-compression_Compressing_and_Accelerating_Binary_Neural_Networks_with_Minimum_Spanning_ICCV_2023_paper.pdf + Binary neural networks (BNNs) have been widely adopted to reduce the computational cost and memory storage on edge-computing devices by using one bit representation for activations and weights. However, as neural networks become wider/deeper to improve accuracy and meet practical requirements, the computational burden remains a significant challenge even on the binary version. To address these issues, this paper proposes a novel method called Minimum Spanning Tree (MST) compression that learns to compress and accelerate BNNs. The proposed architecture leverages an observation from previous works that an output channel in a binary convolution can be computed using another output channel and XNOR operations with weights that differ from the weights of the reused channel. We first construct a fully connected graph with vertices corresponding to output channels, where the distance between two vertices is the number of different values between the weight sets used for these outputs. Then, the MST of the graph with the minimum depth is proposed to reorder output calculations, aiming to reduce computational cost and latency. Moreover, we propose a new learning algorithm to reduce the total MST distance during training. Experimental results on benchmark models demonstrate that our method achieves significant compression ratios with negligible accuracy drops, making it a promising approach for resource-constrained edge-computing devices. + + + + IIEU: Rethinking Neural Feature Activation from Decision-Making + http://openaccess.thecvf.com//content/ICCV2023/papers/Cai_IIEU_Rethinking_Neural_Feature_Activation_from_Decision-Making_ICCV_2023_paper.pdf + Nonlinear Activation (Act) models which help fit the underlying mappings are critical for neural representation learning. Neuronal behaviors inspire basic Act functions, e.g., Softplus and ReLU. We instead seek improved explainable Act models by re-interpreting neural feature Act from a new philosophical perspective of Multi-Criteria Decision-Making (MCDM). By treating activation models as selective feature re-calibrators that suppress/emphasize features according to their importance scores measured by feature-filter similarities, we propose a set of specific properties of effective Act models with new intuitions. This helps us identify the unexcavated yet critical problem of mismatched feature scoring led by the differentiated norms of the features and filters. We present the Instantaneous Importance Estimation Units (IIEUs), a novel class of interpretable Act models that address the problem by re-calibrating the feature with the Instantaneous Importance (II) score (which we refer to as) estimated with the adaptive norm-decoupled feature-filter similarities, capable of modeling the cross-layer and -channel cues at a low cost. The extensive experiments on various vision benchmarks demonstrate the significant improvements of our IIEUs over the SOTA Act models and validate our interpretation of feature Act. By replacing the popular/SOTA Act models with IIEUs, the small ResNet-26s outperform/match the large ResNet-101s on ImageNet with far fewer parameters and computations. + + + + Integrally Migrating Pre-trained Transformer Encoder-decoders for Visual Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Integrally_Migrating_Pre-trained_Transformer_Encoder-decoders_for_Visual_Object_Detection_ICCV_2023_paper.pdf + Modern object detectors have taken the advantages of backbone networks pre-trained on large scale datasets. Except for the backbone networks, however, other components such as the detector head and the feature pyramid network (FPN) remain trained from scratch, which hinders the generalization capacity of detectors. In this study, we propose to integrally migrate pre-trained transformer encoder-decoders (imTED) to a detector, constructing a feature extraction path which is "fully pre-trained" so that detectors' generalization capacity is maximized. The essential differences between imTED with the baseline detector are twofold: (1) migrating the pre-trained transformer decoder to the detector head while removing the randomly initialized FPN from the feature extraction path; and (2) defining a multi-scale feature modulator (MFM) to enhance scale adaptability. Such designs not only reduce randomly initialized parameters significantly but also unify detector training with representation learning intendedly. Experiments on the MS COCO object detection dataset show that imTED consistently outperforms its counterparts by 2.4 AP. Without bells and whistles, imTED improves the state-of-the-art of few-shot object detection by up to 7.6 AP. Code is released at https://github.com/LiewFeng/imTED. + + + + V-FUSE: Volumetric Depth Map Fusion with Long-Range Constraints + http://openaccess.thecvf.com//content/ICCV2023/papers/Burgdorfer_V-FUSE_Volumetric_Depth_Map_Fusion_with_Long-Range_Constraints_ICCV_2023_paper.pdf + We introduce a learning-based depth map fusion framework that accepts a set of depth and confidence maps generated by a Multi-View Stereo (MVS) algorithm as input and improves them. This is accomplished by integrating volumetric visibility constraints that encode long-range surface relationships across different views into an end-to-end trainable architecture. We also introduce a depth search window estimation sub-network trained jointly with the larger fusion sub-network to reduce the depth hypothesis search space along each ray. Our method learns to model depth consensus and violations of visibility constraints directly from the data; effectively removing the necessity of fine-tuning fusion parameters. Extensive experiments on MVS datasets show substantial improvements in the accuracy of the output fused depth and confidence maps. + + + + GECCO: Geometrically-Conditioned Point Diffusion Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Tyszkiewicz_GECCO_Geometrically-Conditioned_Point_Diffusion_Models_ICCV_2023_paper.pdf + Diffusion models generating images conditionally on text, such as Dall-E 2 and Stable Diffusion, have recently made a splash far beyond the computer vision community. Here, we tackle the related problem of generating point clouds, both unconditionally, and conditionally with images. For the latter, we introduce a novel geometrically-motivated conditioning scheme based on projecting sparse image features into the point cloud and attaching them to each individual point, at every step in the denoising process. This approach improves geometric consistency and yields greater fidelity than current methods relying on unstructured, global latent codes. Additionally, we show how to apply recent continuous-time diffusion schemes. Our method performs on par or above the state of art on conditional and unconditional experiments on synthetic data, while being faster, lighter, and delivering tractable likelihoods. We show it can also scale to diverse indoors scenes. + + + + PETRv2: A Unified Framework for 3D Perception from Multi-Camera Images + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_PETRv2_A_Unified_Framework_for_3D_Perception_from_Multi-Camera_Images_ICCV_2023_paper.pdf + In this paper, we propose PETRv2, a unified framework for 3D perception from multi-view images. Based on PETR, PETRv2 explores the effectiveness of temporal modeling, which utilizes the temporal information of previous frames to boost 3D object detection. More specifically, we extend the 3D position embedding (3D PE) in PETR for temporal modeling. The 3D PE achieves the temporal alignment on object position of different frames. To support for multi-task learning (e.g., BEV segmentation and 3D lane detection), PETRv2 provides a simple yet effective solution by introducing task-specific queries, which are initialized under different spaces. PETRv2 achieves state-of-the-art performance on 3D object detection, BEV segmentation and 3D lane detection. Detailed robustness analysis is also conducted on PETR framework. Code is available at https://github.com/megvii-research/PETR. + + + + Out-of-Domain GAN Inversion via Invertibility Decomposition for Photo-Realistic Human Face Manipulation + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Out-of-Domain_GAN_Inversion_via_Invertibility_Decomposition_for_Photo-Realistic_Human_Face_ICCV_2023_paper.pdf + The fidelity of Generative Adversarial Networks (GAN) inversion is impeded by Out-Of-Domain (OOD) areas (e.g., background, accessories) in the image. Detecting the OOD areas beyond the generation ability of the pre-trained model and blending these regions with the input image can enhance fidelity. The "invertibility mask" figures out these OOD areas, and existing methods predict the mask with the reconstruction error. However, the estimated mask is usually inaccurate due to the influence of the reconstruction error in the In-Domain (ID) area. In this paper, we propose a novel framework that enhances the fidelity of human face inversion by designing a new module to decompose the input images to ID and OOD partitions with invertibility masks. Unlike previous works, our invertibility detector is simultaneously learned with a spatial alignment module. We iteratively align the generated features to the input geometry and reduce the reconstruction error in the ID regions. Thus, the OOD areas are more distinguishable and can be precisely predicted. Then, we improve the fidelity of our results by blending the OOD areas from the input image with the ID GAN inversion results. Our method produces photo-realistic results for real-world human face image inversion and manipulation. Extensive experiments demonstrate our method's superiority over existing methods in the quality of GAN inversion and attribute manipulation. + + + + Learning Trajectory-Word Alignments for Video-Language Tasks + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Learning_Trajectory-Word_Alignments_for_Video-Language_Tasks_ICCV_2023_paper.pdf + In a video, an object usually appears as the trajectory, i.e., it spans over a few spatial but longer temporal patches, that contains abundant spatiotemporal contexts. However, modern Video-Language BERTs (VDL-BERTs) neglect this trajectory characteristic that they usually follow image-language BERTs (IL-BERTs) to deploy the patch-to-word (P2W) attention that may over-exploit trivial spatial contexts and neglect significant temporal contexts. To amend this, we propose a novel TW-BERT to learn Trajectory-Word alignment by a newly designed trajectory-to-word (T2W) attention for solving video-language tasks. Moreover, previous VDL-BERTs usually uniformly sample a few frames into the model while different trajectories have diverse graininess, i.e., some trajectories span longer frames and some span shorter, and using a few frames will lose certain useful temporal contexts. However, simply sampling more frames will also make pre-training infeasible due to the largely increased training burdens. To alleviate the problem, during the fine-tuning stage, we insert a novel Hierarchical Frame-Selector (HFS) module into the video encoder. HFS gradually selects the suitable frames conditioned on the text context for the later cross-modal encoder to learn better trajectory-word alignments. By the proposed T2W attention and HFS, our TW-BERT achieves SOTA performances on text-to-video retrieval tasks, and comparable performances on video question-answering tasks with some VDL-BERTs trained on much more data. The code will be available in the supplementary material. + + + + Geometry-guided Feature Learning and Fusion for Indoor Scene Reconstruction + http://openaccess.thecvf.com//content/ICCV2023/papers/Yin_Geometry-guided_Feature_Learning_and_Fusion_for_Indoor_Scene_Reconstruction_ICCV_2023_paper.pdf + In addition to color and textual information, geometry provides important cues for 3D scene reconstruction. However, current reconstruction methods only include geometry at the feature level thus not fully exploiting the geometric information. In contrast, this paper proposes a novel geometry integration mechanism for 3D scene reconstruction. Our approach incorporates 3D geometry at three levels, i.e. feature learning, feature fusion, and network supervision. First, geometry-guided feature learning encodes geometric priors to contain view-dependent information. Second, a geometry-guided adaptive feature fusion is introduced which utilizes the geometric priors as a guidance to adaptively generate weights for multiple views. Third, at the supervision level, taking the consistency between 2D and 3D normals into account, a consistent 3D normal loss is designed to add local constraints. Large-scale experiments are conducted on the ScanNet dataset, showing that volumetric methods with our geometry integration mechanism outperform state-of-the-art methods quantitatively as well as qualitatively. Volumetric methods with ours also show good generalization on the 7-Scenes and TUM RGB-D datasets. + + + + Atmospheric Transmission and Thermal Inertia Induced Blind Road Segmentation with a Large-Scale Dataset TBRSD + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Atmospheric_Transmission_and_Thermal_Inertia_Induced_Blind_Road_Segmentation_with_ICCV_2023_paper.pdf + Computer vision-based walking assistants are prominent tools for aiding visually impaired people in navigation. Blind road segmentation is a key element in these walking assistant systems. However, most walking assistant systems rely on visual light images, which is dangerous in weak illumination environments such as darkness or fog. To address this issue and enhance the safety of vision-based walking assistant systems, we developed a thermal infrared blind road segmentation neural network (TINN). In contrast to conventional segmentation techniques that primarily concentrate on enhancing feature extraction and perception, our approach is geared towards preserving the inherent radiation characteristics within the thermal imaging process. Initially, we modelled two critical factors in thermal infrared imaging - thermal light atmospheric transmission and thermal inertia effect. Subsequently, we use an encoder-decoder architecture to fuse the feathers extracted by the two modules. Additionally, to train the network and evaluate the effectiveness of the proposed method, we constructed a large-scale thermal infrared blind road segmentation dataset named TBRSD consists 5180 pixel-level manual annotations. The experimental results demonstrate that our method outperforms existing techniques and achieves state-of-the-art performance in thermal blind road segmentation, as validated on benchmark thermal infrared semantic segmentation datasets such as MFNet and SODA. The dataset and our code are both publicly available in https://github.com/chenjzBUAA/TBRSD or http://xzbai.buaa.edu.cn/datasets.html. + + + + Efficient-VQGAN: Towards High-Resolution Image Generation with Efficient Vision Transformers + http://openaccess.thecvf.com//content/ICCV2023/papers/Cao_Efficient-VQGAN_Towards_High-Resolution_Image_Generation_with_Efficient_Vision_Transformers_ICCV_2023_paper.pdf + Vector-quantized image modeling has shown great potential in synthesizing high-quality images. However, generating high-resolution images remains a challenging task due to the quadratic computational overhead of the self-attention process. In this study, we seek to explore a more efficient two-stage framework for high-resolution image generation with improvements in the following three aspects. (1) Based on the observation that the first quantization stage has solid local property, we employ a local attention-based quantization model instead of the global attention mechanism used in previous methods, leading to better efficiency and reconstruction quality. (2) We emphasize the importance of multi-grained feature interaction during image generation and introduce an efficient attention mechanism that combines global attention (long-range semantic consistency within the whole image) and local attention (fined-grained details). This approach results in faster generation speed, higher generation fidelity, and improved resolution. (3) We propose a new generation pipeline incorporating autoencoding training and autoregressive generation strategy, demonstrating a better paradigm for image synthesis. Extensive experiments demonstrate the superiority of our approach in high-quality and high-resolution image reconstruction and generation. + + + + Towards Fair and Comprehensive Comparisons for Image-Based 3D Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Ma_Towards_Fair_and_Comprehensive_Comparisons_for_Image-Based_3D_Object_Detection_ICCV_2023_paper.pdf + In this work, we build a modular-designed codebase, formulate strong training recipes, design an error diagnosis toolbox, and discuss current methods for image-based 3D object detection. Specifically, different from other highly mature tasks, e.g., 2D object detection, the community of image-based 3D object detection is still evolving, where methods often adopt different training recipes and tricks resulting in unfair evaluations and comparisons. What is worse, these tricks may overwhelm their proposed designs in performance, even leading to wrong conclusions. To address this issue, we build a module-designed codebase and formulate unified training standards for the community. Furthermore, we also design an error diagnosis toolbox to measure the detailed characterization of detection models. Using these tools, we analyze current methods in-depth under varying settings and provide discussions for some open questions, e.g., discrepancies in conclusions on KITTI-3D and nuScenes datasets, which have led to different dominant methods for these datasets. We hope that this work will facilitate future research in vision-based 3D detection. Our codes will be released at https://github.com/OpenGVLab/3dodi. + + + + Random Boxes Are Open-world Object Detectors + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Random_Boxes_Are_Open-world_Object_Detectors_ICCV_2023_paper.pdf + We show that classifiers trained with random region proposals achieve state-of-the-art Open-world Object Detection (OWOD): they can not only maintain the accuracy of the known objects (w/ training labels), but also considerably improve the recall of unknown ones (w/o training labels). Specifically, we propose RandBox, a Fast R-CNN based architecture trained on random proposals at each training iteration, surpassing existing Faster R-CNN and Transformer based OWOD. Its effectiveness stems from the following two benefits introduced by randomness. First, as the randomization is independent of the distribution of the limited known objects, the random proposals become the instrumental variable that prevents the training from being confounded by the known objects. Second, the unbiased training encourages more proposal explorations by using our proposed matching score that does not penalize the random proposals whose prediction scores do not match the known objects. On two benchmarks: Pascal-VOC/MS-COCO and LVIS, RandBox significantly outperforms the previous state-of-the-art in all metrics. We also detail the ablations on randomization and loss designs. Codes and other details are in Appendix. + + + + DiffDreamer: Towards Consistent Unsupervised Single-view Scene Extrapolation with Conditional Diffusion Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Cai_DiffDreamer_Towards_Consistent_Unsupervised_Single-view_Scene_Extrapolation_with_Conditional_Diffusion_ICCV_2023_paper.pdf + Scene extrapolation---the idea of generating novel views by flying into a given image---is a promising, yet challenging task. For each predicted frame, a joint inpainting and 3D refinement problem has to be solved, which is ill posed and includes a high level of ambiguity. Moreover, training data for long-range scenes is difficult to obtain and usually lacks sufficient views to infer accurate camera poses. We introduce DiffDreamer, an unsupervised framework capable of synthesizing novel views depicting a long camera trajectory while training solely on internet-collected images of nature scenes. Utilizing the stochastic nature of the guided denoising steps, we train the diffusion models to refine projected RGBD images but condition the denoising steps on multiple past and future frames for inference. We demonstrate that image-conditioned diffusion models can effectively perform long-range scene extrapolation while preserving consistency significantly better than prior GAN-based methods. DiffDreamer is a powerful and efficient solution for scene extrapolation, producing impressive results despite limited supervision. Project page: https://primecai.github.io/diffdreamer. + + + + Enhancing Adversarial Robustness in Low-Label Regime via Adaptively Weighted Regularization and Knowledge Distillation + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Enhancing_Adversarial_Robustness_in_Low-Label_Regime_via_Adaptively_Weighted_Regularization_ICCV_2023_paper.pdf + Adversarial robustness is a research area that has recently received a lot of attention in the quest for trustworthy artificial intelligence. However, recent works on adversarial robustness have focused on supervised learning where it is assumed that labeled data is plentiful. In this paper, we investigate semi -supervised adversarial training where labeled data is scarce. We derive two upper bounds for the robust risk and propose a regularization term for unlabeled data motivated by these two upper bounds. Then, we develop a semi-supervised adversarial training algorithm that combines the proposed regularization term with knowledge distillation using a semi-supervised teacher. Our experiments show that our proposed algorithm achieves state-of-the-art performance with significant margins compared to existing algorithms. In particular, compared to supervised learning algorithms, performance of our proposed algorithm is not much worse even when the amount of labeled data is very small. For example, our algorithm with only 8% labeled data is comparable to supervised adversarial training algorithms that use all labeled data, both in terms of standard and robust accuracies on CIFAR-10. + + + + MIMO-NeRF: Fast Neural Rendering with Multi-input Multi-output Neural Radiance Fields + http://openaccess.thecvf.com//content/ICCV2023/papers/Kaneko_MIMO-NeRF_Fast_Neural_Rendering_with_Multi-input_Multi-output_Neural_Radiance_Fields_ICCV_2023_paper.pdf + Neural radiance fields (NeRFs) have shown impressive results for novel view synthesis. However, they depend on the repetitive use of a single-input single-output multilayer perceptron (SISO MLP) that maps 3D coordinates and view direction to the color and volume density in a sample-wise manner, which slows the rendering. We propose a multi-input multi-output NeRF (MIMO-NeRF) that reduces the number of MLPs running by replacing the SISO MLP with a MIMO MLP and conducting mappings in a group-wise manner. One notable challenge with this approach is that the color and volume density of each point can differ according to a choice of input coordinates in a group, which can lead to some notable ambiguity. We also propose a self-supervised learning method that regularizes the MIMO MLP with multiple fast reformulated MLPs to alleviate this ambiguity without using pretrained models. The results of a comprehensive experimental evaluation including comparative and ablation studies are presented to show that MIMO-NeRF obtains a good trade-off between speed and quality with a reasonable training time. We then demonstrate that MIMO-NeRF is compatible with and complementary to previous advancements in NeRFs by applying it to two representative fast NeRFs, i.e., a NeRF with a sampling network (DONeRF) and a NeRF with alternative representations (TensoRF). + + + + Instance Neural Radiance Field + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Instance_Neural_Radiance_Field_ICCV_2023_paper.pdf + This paper presents one of the first learning-based NeRF 3D instance segmentation pipelines, dubbed as Instance Neural Radiance Field, or Instance-NeRF. Taking a NeRF pretrained from multi-view RGB images as input, Instance-NeRF can learn 3D instance segmentation of a given scene, represented as an instance field component of the NeRF model. To this end, we adopt a 3D proposal-based mask prediction network on the sampled volumetric features from NeRF, which generates discrete 3D instance masks. The coarse 3D mask prediction is then projected to image space to match 2D segmentation masks from different views generated by existing panoptic segmentation models, which are used to supervise the training of the instance field. Notably, beyond generating consistent 2D segmentation maps from novel views, Instance-NeRF can query instance information at any 3D point, which greatly enhances NeRF object segmentation and manipulation. Our method is also one of the first to achieve such results in pure inference. Experimented on synthetic and real-world NeRF datasets with complex indoor scenes, Instance-NeRF surpasses previous NeRF segmentation works and competitive 2D segmentation methods in segmentation performance on unseen views. Code and data are available at https://github.com/lyclyc52/Instance_NeRF. + + + + One-bit Flip is All You Need: When Bit-flip Attack Meets Model Training + http://openaccess.thecvf.com//content/ICCV2023/papers/Dong_One-bit_Flip_is_All_You_Need_When_Bit-flip_Attack_Meets_ICCV_2023_paper.pdf + Deep neural networks (DNNs) are widely deployed on real-world devices. Concerns regarding their security have gained great attention from researchers. Recently, a new weight modification attack called bit flip attack (BFA) was proposed, which exploits memory fault inject techniques such as row hammer to attack quantized models in the deployment stage. With only a few bit flips, the target model can be rendered useless as a random guesser or even be implanted with malicious functionalities. In this work, we seek to further reduce the number of bit flips. We propose a training-assisted bit flip attack, in which the adversary is involved in the training stage to build a high-risk model to release. This high-risk model, obtained coupled with a corresponding malicious model, behaves normally and can escape various detection methods. The results on benchmark datasets show that an adversary can easily convert this high-risk but normal model to a malicious one on victim's side by flipping only one critical bit on average in the deployment stage. Moreover, our attack still poses a significant threat even when defenses are employed. The codes for reproducing main experiments are available at https://github.com/jianshuod/TBA. + + + + Improving CLIP Fine-tuning Performance + http://openaccess.thecvf.com//content/ICCV2023/papers/Wei_Improving_CLIP_Fine-tuning_Performance_ICCV_2023_paper.pdf + CLIP models have demonstrated impressively high zero-shot recognition accuracy, however, their fine-tuning performance on downstream vision tasks is sub-optimal. Contrarily, masked image modeling (MIM) performs exceptionally for fine-tuning on downstream tasks, despite the absence of semantic labels during training. We note that the two tasks have different ingredients: image-level targets versus token-level targets, a cross-entropy loss versus a regression loss, and full-image inputs versus partial-image inputs. To mitigate the differences, we introduce a classical feature map distillation framework, which can simultaneously inherit the semantic capability of CLIP models while constructing a task incorporated key ingredients of MIM. Experiments suggest that the feature map distillation approach significantly boosts the fine-tuning performance of CLIP models on several typical downstream vision tasks. We also observe that the approach yields new CLIP representations which share some diagnostic properties with those of MIM. Furthermore, the feature map distillation approach generalizes to other pre-training models, such as DINO, DeiT and SwinV2-G, reaching a new record of 64.2 mAP on COCO object detection with +1.1 improvement. The code and mod- els are publicly available at https://github.com/ SwinTransformer/Feature-Distillation. + + + + The Power of Sound (TPoS): Audio Reactive Video Generation with Stable Diffusion + http://openaccess.thecvf.com//content/ICCV2023/papers/Jeong_The_Power_of_Sound_TPoS_Audio_Reactive_Video_Generation_with_ICCV_2023_paper.pdf + In recent years, video generation has become a prominent generative tool and has drawn significant attention. However, there is little consideration in audio-to-video generation, though audio contains unique qualities like temporal semantics and magnitude. Hence, we propose The Power of Sound (TPoS) model to incorporate audio input that includes both changeable temporal semantics and magnitude. To generate video frames, TPoS utilizes a latent stable diffusion model with textual semantic information, which is then guided by the sequential audio embedding from our pretrained Audio Encoder. As a result, this method produces audio reactive video contents. We demonstrate the effectiveness of TPoS across various tasks and compare its results with current state-of-the-art techniques in the field of audio-to-video generation. More examples are available at https://ku-vai.github.io/TPoS/ + + + + DINAR: Diffusion Inpainting of Neural Textures for One-Shot Human Avatars + http://openaccess.thecvf.com//content/ICCV2023/papers/Svitov_DINAR_Diffusion_Inpainting_of_Neural_Textures_for_One-Shot_Human_Avatars_ICCV_2023_paper.pdf + We present DINAR, an approach for creating realistic rigged fullbody avatars from single RGB images. Similarly to previous works, our method uses neural textures combined with the SMPL-X body model to achieve photo-realistic quality of avatars while keeping them easy to animate and fast to infer. To restore the texture, we use a latent diffusion model and show how such model can be trained in the neural texture space. The use of the diffusion model allows us to realistically reconstruct large unseen regions such as the back of a person given the frontal view. The models in our pipeline are trained using 2D images and videos only. In the experiments, our approach achieves state-of-the-art rendering quality and good generalization to new poses and viewpoints. In particular, the approach improves state-of-the-art on the SnapshotPeople public benchmark. + + + + ElasticViT: Conflict-aware Supernet Training for Deploying Fast Vision Transformer on Diverse Mobile Devices + http://openaccess.thecvf.com//content/ICCV2023/papers/Tang_ElasticViT_Conflict-aware_Supernet_Training_for_Deploying_Fast_Vision_Transformer_on_ICCV_2023_paper.pdf + Neural Architecture Search (NAS) has shown promising performance in the automatic design of vision transformers (ViT) exceeding 1G FLOPs. However, designing lightweight and low-latency ViT models for diverse mobile devices remains a big challenge. In this work, we propose ElasticViT, a two-stage NAS approach that trains a high-quality ViT supernet over a very large search space for covering a wide range of mobile devices, and then searches an optimal sub-network (subnet) for direct deployment. However, current supernet training methods that rely on uniform sampling suffer from the gradient conflict issue: the sampled subnets can have vastly different model sizes (e.g., 50M vs. 2G FLOPs), leading to different optimization directions and inferior performance. To address this challenge, we propose two novel sampling techniques: complexity-aware sampling and performance-aware sampling. Complexity-aware sampling limits the FLOPs difference among the subnets sampled across adjacent training steps, while covering different-sized subnets in the search space. Performance-aware sampling further selects subnets that have good accuracy, which can reduce gradient conflicts and improve supernet quality. Our discovered models, ElasticViT models, achieve top-1 accuracy from 67.2% to 80.0% on ImageNet from 60M to 800M FLOPs without extra retraining, outperforming all prior CNNs and ViTs in terms of accuracy and latency. Our tiny and small models are also the first ViT models that surpass state-of-the-art CNNs with significantly lower latency on mobile devices. For instance, ElasticViT-S1 runs 2.62x faster than EfficientNet-B0 with 0.1% higher accuracy. + + + + Noise-Aware Learning from Web-Crawled Image-Text Data for Image Captioning + http://openaccess.thecvf.com//content/ICCV2023/papers/Kang_Noise-Aware_Learning_from_Web-Crawled_Image-Text_Data_for_Image_Captioning_ICCV_2023_paper.pdf + Image captioning is one of the straightforward tasks that can take advantage of large-scale web-crawled data which provides rich knowledge about the visual world for a captioning model. However, since web-crawled data contains image-text pairs that are aligned at different levels, the inherent noises (e.g., misaligned pairs) make it difficult to learn a precise captioning model. While the filtering strategy can effectively remove noisy data, it leads to a decrease in learnable knowledge and sometimes brings about a new problem of data deficiency. To take the best of both worlds, we propose a Noise-aware Captioning (NoC) framework, which learns rich knowledge from the whole web-crawled data while being less affected by the noises. This is achieved by the proposed alignment-level-controllable captioner, which is learned using alignment levels of the image-text pairs as a control signal during training. The alignment-level-conditioned training allows the model to generate high-quality captions by simply setting the control signal to the desired alignment level at inference time. An in-depth analysis shows the effectiveness of our framework in handling noise. With two tasks of zero-shot captioning and text-to-image retrieval using generated captions (i.e., self-retrieval), we also demonstrate our model can produce high-quality captions in terms of descriptiveness and distinctiveness. The code is available at https://github.com/kakaobrain/noc. + + + + Detecting Objects with Context-Likelihood Graphs and Graph Refinement + http://openaccess.thecvf.com//content/ICCV2023/papers/Bhowmik_Detecting_Objects_with_Context-Likelihood_Graphs_and_Graph_Refinement_ICCV_2023_paper.pdf + The goal of this paper is to detect objects by exploiting their interrelationships. Contrary to existing methods, which learn objects and relations separately, our key idea is to learn the object-relation distribution jointly. We first propose a novel way of creating a graphical representation of an image from inter-object relation priors and initial class predictions, we call a context-likelihood graph. We then learn the joint distribution with an energy-based modeling technique which allows to sample and refine the context-likelihood graph iteratively for a given image. Our formulation of jointly learning the distribution enables us to generate a more accurate graph representation of an image which leads to a better object detection performance. We demonstrate the benefits of our context-likelihood graph formulation and the energy-based graph refinement via experiments on the Visual Genome and MS-COCO datasets where we achieve a consistent improvement over object detectors like DETR and Faster-RCNN, as well as alternative methods modeling object interrelationships separately. Our method is detector agnostic, end-to-end trainable, and especially beneficial for rare object classes. + + + + Coarse-to-Fine Amodal Segmentation with Shape Prior + http://openaccess.thecvf.com//content/ICCV2023/papers/Gao_Coarse-to-Fine_Amodal_Segmentation_with_Shape_Prior_ICCV_2023_paper.pdf + Amodal object segmentation is a challenging task that involves segmenting both visible and occluded parts of an object. In this paper, we propose a novel approach, called Coarse-to-Fine Segmentation (C2F-Seg), that addresses this problem by progressively modeling the amodal segmentation. C2F-Seg initially reduces the learning space from the pixel-level image space to the vector-quantized latent space. This enables us to better handle long-range dependencies and learn a coarse-grained amodal segment from visual features and visible segments. However, this latent space lacks detailed information about the object, which makes it difficult to provide a precise segmentation directly. To address this issue, we propose a convolution refine module to inject fine-grained information and provide a more precise amodal object segmentation based on visual features and coarse-predicted segmentation. To help the studies of amodal object segmentation, we create a synthetic amodal dataset, named as MOViD-Amodal (MOViD-A), which can be used for both image and video amodal object segmentation. We extensively evaluate our model on two benchmark datasets: KINS and COCO-A. Our empirical results demonstrate the superiority of C2F-Seg. Moreover, we exhibit the potential of our approach for video amodal object segmentation tasks on FISHBOWL and our proposed MOViD-A. Project page at: https://jianxgao.github.io/C2F-Seg. + + + + AdVerb: Visually Guided Audio Dereverberation + http://openaccess.thecvf.com//content/ICCV2023/papers/Chowdhury_AdVerb_Visually_Guided_Audio_Dereverberation_ICCV_2023_paper.pdf + We present AdVerb, a novel audio-visual dereverberation framework that uses visual cues in addition to the reverberant sound to estimate clean audio. Although audio-only dereverberation is a well-studied problem, our approach incorporates the complementary visual modality to perform audio dereverberation. Given an image of the environment where the reverberated sound signal has been recorded, AdVerb employs a novel geometry-aware cross-modal transformer architecture that captures scene geometry and audio-visual cross-modal relationship to generate a complex ideal ratio mask, which, when applied to the reverberant audio predicts the clean sound. The effectiveness of our method is demonstrated through extensive quantitative and qualitative evaluations. Our approach significantly outperforms traditional audio-only and audio-visual baselines on three downstream tasks: speech enhancement, speech recognition, and speaker verification, with relative improvements in the range of 18% - 82% on the LibriSpeech test-clean set. We also achieve highly satisfactory RT60 error scores on the AVSpeech dataset. + + + + Open-vocabulary Object Segmentation with Diffusion Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Open-vocabulary_Object_Segmentation_with_Diffusion_Models_ICCV_2023_paper.pdf + The goal of this paper is to extract the visual-language correspondence from a pre-trained text-to-image diffusion model, in the form of segmentation map, i.e., simultaneously generating images and segmentation masks for the corresponding visual entities described in the text prompt. We make the following contributions: (i) we pair the existing Stable Diffusion model with a novel grounding module, that can be trained to align the visual and textual embedding space of the diffusion model with only a small number of object categories; (ii) we establish an automatic pipeline for constructing a dataset, that consists of image, segmentation mask, text prompt triplets, to train the proposed grounding module; (iii) we evaluate the performance of open-vocabulary grounding on images generated from the text-to-image diffusion model and show that the module can well segment the objects of categories beyond seen ones at training time; (iv) we adopt the augmented diffusion model to build a synthetic semantic segmentation dataset, and show that, training a standard segmentation model on such dataset demonstrates competitive performance on the zero-shot segmentation (ZS3) benchmark, which opens up new opportunities for adopting the powerful diffusion model for discriminative tasks. + + + + With a Little Help from Your Own Past: Prototypical Memory Networks for Image Captioning + http://openaccess.thecvf.com//content/ICCV2023/papers/Barraco_With_a_Little_Help_from_Your_Own_Past_Prototypical_Memory_ICCV_2023_paper.pdf + Image captioning, like many tasks involving vision and language, currently relies on Transformer-based architectures for extracting the semantics in an image and translating it into linguistically coherent descriptions. Although successful, the attention operator only considers a weighted summation of projections of the current input sample, therefore ignoring the relevant semantic information which can come from the joint observation of other samples. In this paper, we devise a network which can perform attention over activations obtained while processing other training samples, through a prototypical memory model. Our memory models the distribution of past keys and values through the definition of prototype vectors which are both discriminative and compact. Experimentally, we assess the performance of the proposed model on the COCO dataset, in comparison with carefully designed baselines and state-of-the-art approaches, and by investigating the role of each of the proposed components. We demonstrate that our proposal can increase the performance of an encoder-decoder Transformer by 3.7 CIDEr points both when training in cross-entropy only and when fine-tuning with self-critical sequence training. Source code and trained models are available at: https://github.com/aimagelab/PMA-Net. + + + + PDiscoNet: Semantically consistent part discovery for fine-grained recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/van_der_Klis_PDiscoNet_Semantically_consistent_part_discovery_for_fine-grained_recognition_ICCV_2023_paper.pdf + Fine-grained classification often requires recognizing specific object parts, such as beak shape and wing patterns for birds. Encouraging a fine-grained classification model to first detect such parts and then using them to infer the class could help us gauge whether the model is indeed looking at the right details better than with interpretability methods that provide a single attribution map. We propose PDiscoNet to discover object parts by using only image-level class labels along with priors encouraging the parts to be: discriminative, compact, distinct from each other, equivariant to rigid transforms, and active in at least some of the images. In addition to using the appropriate losses to encode these priors, we propose to use part-dropout, where full part feature vectors are dropped at once to prevent a single part from dominating in the classification, and part feature vector modulation, which makes the information coming from each part distinct from the perspective of the classifier. Our results on CUB, CelebA, and PartImageNet show that the proposed method provides substantially better part discovery performance than previous methods while not requiring any additional hyper-parameter tuning and without penalizing the classification performance. + + + + How to Choose your Best Allies for a Transferable Attack? + http://openaccess.thecvf.com//content/ICCV2023/papers/Maho_How_to_Choose_your_Best_Allies_for_a_Transferable_Attack_ICCV_2023_paper.pdf + The transferability of adversarial examples is a key issue in the security of deep neural networks. The possibility of an adversarial example crafted for a source model fooling another targeted model makes the threat of adversarial attacks more realistic. Measuring transferability is a crucial problem, but the Attack Success Rate alone does not provide a sound evaluation. This paper proposes a new methodology for evaluating transferability by putting distortion in a central position. This new tool shows that transferable attacks may perform far worse than a black box attack if the attacker randomly picks the source model. To address this issue, we propose a new selection mechanism, called FiT, which aims at choosing the best source model with only a few preliminary queries to the target. Our experimental results show that FiT is highly effective at selecting the best source model for multiple scenarios such as single-model attacks, ensemble-model attacks and multiple attacks. + + + + Self-Supervised Object Detection from Egocentric Videos + http://openaccess.thecvf.com//content/ICCV2023/papers/Akiva_Self-Supervised_Object_Detection_from_Egocentric_Videos_ICCV_2023_paper.pdf + Understanding the visual world from the perspective of humans (egocentric) has been a long-standing challenge in computer vision. Egocentric videos exhibit high scene complexity and irregular motion flows compared to typical video understanding tasks. With the egocentric domain in mind, we address the problem of self-supervised, class-agnostic object detection, which aims to locate all objects in a given view, regardless of category, without any annotations or pre-training weights. Our method, self-supervised object Detection from Egocentric VIdeos (DEVI), generalizes appearance-based methods to learn features that are category-specific and invariant to viewing angles and illumination conditions from highly ambiguous environments in an end-to-end manner. Our approach leverages typical human behavior and its egocentric perception to sample diverse views of the same objects for our multi-view and scale-regression loss functions. With our learned cluster residual module, we are able to effectively describe multi-category patches for better complex scene understanding. DEVI provides a boost in performance on recent egocentric datasets, with performance gains up to 4.11% AP50, 0.11% AR1, 1.32% AR10, and 5.03% AR100, while significantly reducing model complexity. We also demonstrate competitive performance on out-of-domain datasets without additional training or fine-tuning. + + + + Cross Contrasting Feature Perturbation for Domain Generalization + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Cross_Contrasting_Feature_Perturbation_for_Domain_Generalization_ICCV_2023_paper.pdf + Domain generalization (DG) aims to learn a robust model from source domains that generalize well on unseen target domains. Recent studies focus on generating novel domain samples or features to diversify distributions complementary to source domains. Yet, these approaches can hardly deal with the restriction that the samples synthesized from various domains can cause semantic distortion. In this paper, we propose a Cross Contrasting Feature Perturbation (CCFP) framework to simulate domain shift by generating perturbed features in the latent space while regularizing the model prediction against domain shift. Different from the previous fixed synthesizing strategy, we design modules with learnable feature perturbations and semantic consistency constraints. In contrast to prior work, our method does not use any generative-based models or domain labels. We conduct extensive experiments on a standard DomainBed benchmark with a strict evaluation protocol for a fair comparison. Comprehensive experiments show that our method outperforms the previous state-of-the-art, and quantitative analyses illustrate that our approach can alleviate the domain shift problem in out-of-distribution (OOD) scenarios. + + + + DiffusionRet: Generative Text-Video Retrieval with Diffusion Model + http://openaccess.thecvf.com//content/ICCV2023/papers/Jin_DiffusionRet_Generative_Text-Video_Retrieval_with_Diffusion_Model_ICCV_2023_paper.pdf + Existing text-video retrieval solutions are, in essence, discriminant models focused on maximizing the conditional likelihood, i.e., p(candidates|query). While straightforward, this de facto paradigm overlooks the underlying data distribution p(query), which makes it challenging to identify out-of-distribution data. To address this limitation, we creatively tackle this task from a generative viewpoint and model the correlation between the text and the video as their joint probability p(candidates,query). This is accomplished through a diffusion-based text-video retrieval framework (DiffusionRet), which models the retrieval task as a process of gradually generating joint distribution from noise. During training, DiffusionRet is optimized from both the generation and discrimination perspectives, with the generator being optimized by generation loss and the feature extractor trained with contrastive loss. In this way, DiffusionRet cleverly leverages the strengths of both generative and discriminative methods. Extensive experiments on five commonly used text-video retrieval benchmarks, including MSRVTT, LSMDC, MSVD, ActivityNet Captions, and DiDeMo, with superior performances, justify the efficacy of our method. More encouragingly, without any modification, DiffusionRet even performs well in out-domain retrieval settings. We believe this work brings fundamental insights into the related fields. Code is available at https://github.com/jpthu17/DiffusionRet. + + + + Adversarial Finetuning with Latent Representation Constraint to Mitigate Accuracy-Robustness Tradeoff + http://openaccess.thecvf.com//content/ICCV2023/papers/Suzuki_Adversarial_Finetuning_with_Latent_Representation_Constraint_to_Mitigate_Accuracy-Robustness_Tradeoff_ICCV_2023_paper.pdf + This paper addresses the tradeoff between standard accuracy on clean examples and robustness against adversarial examples in deep neural networks (DNNs). Although adversarial training (AT) improves robustness, it degrades the standard accuracy, thus yielding the tradeoff. To mitigate this tradeoff, we propose a novel AT method called ARREST, which comprises three components: (i) adversarial finetuning (AFT), (ii) representation-guided knowledge distillation (RGKD), and (iii) noisy replay (NR). AFT trains a DNN on adversarial examples by initializing its parameters with a DNN that is standardly pretrained on clean examples. RGKD and NR respectively entail a regularization term and an algorithm to preserve latent representations of clean examples during AFT. RGKD penalizes the distance between the representations of the standardly pretrained and AFT DNNs. NR switches input adversarial examples to nonadversarial ones when the representation changes significantly during AFT. By combining these components, ARREST achieves both high standard accuracy and robustness. Experimental results demonstrate that ARREST mitigates the tradeoff more effectively than previous AT-based methods do. + + + + MULLER: Multilayer Laplacian Resizer for Vision + http://openaccess.thecvf.com//content/ICCV2023/papers/Tu_MULLER_Multilayer_Laplacian_Resizer_for_Vision_ICCV_2023_paper.pdf + Image resizing operation is a fundamental preprocessing module in modern computer vision. Throughout the deep learning revolution, researchers have overlooked the potential of alternative resizing methods beyond the commonly used resizers that are readily available, such as nearest-neighbors, bilinear, and bicubic. The key question of our interest is whether the front-end resizer affects the performance of deep vision models? In this paper, we present an extremely lightweight multilayer Laplacian resizer with only a handful of trainable parameters, dubbed MULLER resizer. MULLER has a bandpass nature in that it learns to boost details in certain frequency subbands that benefit the downstream recognition models. We show that MULLER can be easily plugged into various training pipelines, and it effectively boosts the performance of the underlying vision task with little to no extra cost. Specifically, we select a state-of-the-art vision Transformer, MaxViT, as the baseline, and show that, if trained with MULLER, MaxViT gains up to 0.6% top-1 accuracy, and meanwhile enjoys 36% inference cost saving to achieve similar top-1 accuracy on ImageNet-1k, as compared to the standard training scheme. Notably, MULLER's performance also scales with model size and training data size such as ImageNet-21k and JFT, and it is widely applicable to multiple vision tasks, including image classification, object detection and segmentation, as well as image quality assessment. The code is available at https://github.com/google-research/google-research/tree/master/muller. + + + + X-VoE: Measuring eXplanatory Violation of Expectation in Physical Events + http://openaccess.thecvf.com//content/ICCV2023/papers/Dai_X-VoE_Measuring_eXplanatory_Violation_of_Expectation_in_Physical_Events_ICCV_2023_paper.pdf + Intuitive physics is pivotal for human understanding of the physical world, enabling prediction and interpretation of events even in infancy. Nonetheless, replicating this level of intuitive physics in artificial intelligence (AI) remains a formidable challenge. This study introduces X-VoE, a comprehensive benchmark dataset, to assess AI agents' grasp of intuitive physics. Built on the developmental psychology-rooted Violation of Expectation (VoE) paradigm, X-VoE establishes a higher bar for the explanatory capacities of intuitive physics models. Each VoE scenario within X-VoE encompasses three distinct settings, probing models' comprehension of events and their underlying explanations. Beyond model evaluation, we present an explanation-based learning system that captures physics dynamics and infers occluded object states solely from visual sequences, without explicit occlusion labels. Experimental outcomes highlight our model's alignment with human commonsense when tested against X-VoE. A remarkable feature is our model's ability to visually expound VoE events by reconstructing concealed scenes. Concluding, we discuss the findings' implications and outline future research directions. Through X-VoE, we catalyze the advancement of AI endowed with human-like intuitive physics capabilities. + + + + COOP: Decoupling and Coupling of Whole-Body Grasping Pose Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Zheng_COOP_Decoupling_and_Coupling_of_Whole-Body_Grasping_Pose_Generation_ICCV_2023_paper.pdf + Generating life-like whole-body human grasping has garnered significant attention in the field of computer graphics. Existing works have demonstrated the effectiveness of keyframe-guided motion generation framework, witch focus on modeling the grasping motions of humans in temporal sequence when the target objects are placed in front of them. However, the generated grasping poses of the human body in the key-frames are limited, failing to capture the full range of grasping poses that humans are capable of. To address this issue, we propose a novel framework called COOP (DeCOupling and COupling of Whole-Body GrasPing Pose Generation) to synthesize life-like whole-body poses that cover the widest range of human grasping capabilities. In this framework, we first decouple the whole-body pose into body pose and hand pose and model them separately, which allows us to pre-train the body model with out-of-domain data easily. Then, we couple these two generated body parts through a unified optimization algorithm. Furthermore, we design a simple evaluation method to evaluate the generalization ability of models in generating grasping poses for objects placed at different positions. The experimental results demonstrate the efficacy and superiority of our method. And COOP holds great potential as a plug-and-play component for other domains in whole-body pose generation. Our models and code are available at https://github.com/zhengyanzhao1997/COOP. + + + + Model Calibration in Dense Classification with Adaptive Label Perturbation + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Model_Calibration_in_Dense_Classification_with_Adaptive_Label_Perturbation_ICCV_2023_paper.pdf + For safety-related applications, it is crucial to produce trustworthy deep neural networks whose prediction is associated with confidence that can represent the likelihood of correctness for subsequent decision-making. Existing dense binary classification models are prone to being over-confident. To improve model calibration, we propose Adaptive Stochastic Label Perturbation (ASLP) which learns a unique label perturbation level for each training image. ASLP employs our proposed Self-Calibrating Binary Cross Entropy (SC-BCE) loss, which unifies label perturbation processes including stochastic approaches (like DisturbLabel), and label smoothing, to correct calibration while maintaining classification rates. ASLP follows Maximum Entropy Inference of classic statistical mechanics to maximise prediction entropy with respect to missing information. It performs this while: (1) preserving classification accuracy on known data as a conservative solution, or (2) specifically improves model calibration degree by minimising the gap between the prediction accuracy and expected confidence of target training label. Extensive results demonstrate that ASLP can significantly improve calibration degrees of dense binary classification models on both in-distribution and out-of-distribution data. + + + + + + Structure and Content-Guided Video Synthesis with Diffusion Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Esser_Structure_and_Content-Guided_Video_Synthesis_with_Diffusion_Models_ICCV_2023_paper.pdf + Text-guided generative diffusion models unlock powerful image creation and editing tools. Recent approaches that edit the content of footage while retaining structure require expensive re-training for every input or rely on error-prone propagation of image edits across frames. In this work, we present a structure and content-guided video diffusion model that edits videos based on descriptions of the desired output. Conflicts between user-provided content edits and structure representations occur due to insufficient disentanglement between the two aspects. As a solution, we show that training on monocular depth estimates with varying levels of detail provides control over structure and content fidelity. A novel guidance method, enabled by joint video and image training, exposes explicit control over temporal consistency. Our experiments demonstrate a wide variety of successes; fine-grained control over output characteristics, customization based on a few reference images, and a strong user preference towards results by our model. + + + + Beyond Skin Tone: A Multidimensional Measure of Apparent Skin Color + http://openaccess.thecvf.com//content/ICCV2023/papers/Thong_Beyond_Skin_Tone_A_Multidimensional_Measure_of_Apparent_Skin_Color_ICCV_2023_paper.pdf + This paper strives to measure apparent skin color in computer vision, beyond a unidimensional scale on skin tone. In their seminal paper Gender Shades, Buolamwini and Gebru have shown how gender classification systems can be biased against women with darker skin tones. Subsequently, fairness researchers and practitioners have adopted the Fitzpatick skin type classification as a common measure to assess skin color bias in computer vision systems. While effective, the Fitzpatick scale only focuses on the skin tone ranging from light to dark. Towards a more comprehensive measure of skin color, we introduce the hue angle ranging from red to yellow. When applied to images, the hue dimension reveals additional biases related to skin color in both computer vision datasets and models. We then recommend multidimensional skin color scales, relying on both skin tone and hue, for fairness assessments. + + + + NeILF++: Inter-Reflectable Light Fields for Geometry and Material Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_NeILF_Inter-Reflectable_Light_Fields_for_Geometry_and_Material_Estimation_ICCV_2023_paper.pdf + We present a novel differentiable rendering framework for joint geometry, material, and lighting estimation from multi-view images. In contrast to previous methods which assume a simplified environment map or co-located flashlights, in this work, we formulate the lighting of a static scene as one neural incident light field (NeILF) and one outgoing neural radiance field (NeRF). The key insight of the proposed method is the union of the incident and outgoing light fields through physically-based rendering and inter-reflections between surfaces, making it possible to disentangle the scene geometry, material, and lighting from image observations in a physically-based manner. The proposed incident light and inter-reflection framework can be easily applied to other NeRF systems. We show that our method can not only decompose the outgoing radiance into incident lights and surface materials, but also serve as a surface refinement module that further improves the reconstruction detail of the neural surface. We demonstrate on several datasets that the proposed method is able to achieve state-of-the-art results in terms of the geometry reconstruction quality, material estimation accuracy, and the fidelity of novel view rendering. + + + + MAGI: Multi-Annotated Explanation-Guided Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_MAGI_Multi-Annotated_Explanation-Guided_Learning_ICCV_2023_paper.pdf + Explanation supervision is a technique in which the model is guided by human-generated explanations during training. This technique aims to improve the predictability of the model by incorporating human understanding of the prediction process into the training phase. This is a challenging task since it relies on the accuracy of human annotation labels. To obtain high-quality explanation annotations, using multiple annotations to do explanation supervision is a reasonable method. However, how to use multiple annotations to improve accuracy is particularly challenging due to the following: 1) The noisiness of annotations from different annotators; 2) The lack of pre-given information about the corresponding relationship between annotations and annotators; 3) Missing annotations since some images are not labeled by all annotators. To solve these challenges, we propose a Multi-annotated explanation-guided learning (MAGI) framework to do explanation supervision with comprehensive and high-quality generated annotations. We first propose a novel generative model to generate annotations from all annotators and infer them using a newly proposed variational inference-based technique by learning the characteristics of each annotator. We also incorporate an alignment mechanism into the generative model to infer the correspondence between annotations and annotators in the training process. Extensive experiments on two datasets from the medical imaging domain demonstrate the effectiveness of our proposed framework in handling noisy annotations while obtaining superior prediction performance compared with previous SOTA. + + + + Adaptive Positional Encoding for Bundle-Adjusting Neural Radiance Fields + http://openaccess.thecvf.com//content/ICCV2023/papers/Gao_Adaptive_Positional_Encoding_for_Bundle-Adjusting_Neural_Radiance_Fields_ICCV_2023_paper.pdf + Neural Radiance Fields have shown great potential to synthesize novel views with only a few discrete image observations of the world. However, the requirement of accurate camera parameters to learn scene representations limits its further application. In this paper, we present adaptive positional encoding (APE) for bundle-adjusting neural radiance fields to reconstruct the neural radiance fields from unknown camera poses (or even intrinsics). Inspired by Fourier series regression, we investigate its relationship with the positional encoding method and therefore propose APE where all frequency bands are trainable. Furthermore, we introduce period-activated multilayer perceptrons (PMLPs) to construct the implicit network for the high-order scene representations and fine-grain gradients during backpropagation. Experimental results on public datasets demonstrate that the proposed method with APE and PMLPs can outperform the state-of-the-art methods in accurate camera poses and high-fidelity view synthesis. + + + + Inducing Neural Collapse to a Fixed Hierarchy-Aware Frame for Reducing Mistake Severity + http://openaccess.thecvf.com//content/ICCV2023/papers/Liang_Inducing_Neural_Collapse_to_a_Fixed_Hierarchy-Aware_Frame_for_Reducing_ICCV_2023_paper.pdf + There is a recently discovered and intriguing phenomenon called Neural Collapse: at the terminal phase of training a deep neural network for classification, the within-class penultimate feature means and the associated classifier vectors of all flat classes collapse to the vertices of a simplex Equiangular Tight Frame (ETF). Recent work has tried to exploit this phenomenon by fixing the related classifier weights to a pre-computed ETF to induce neural collapse and maximize the separation of the learned features when training with imbalanced data. In this work, we propose to fix the linear classifier of a deep neural network to a Hierarchy-Aware Frame (HAFrame), instead of an ETF, and use a cosine similarity-based auxiliary loss to learn hierarchy-aware penultimate features that collapse to the HAFrame. We demonstrate that our approach reduces the mistake severity of the model's predictions while maintaining its top-1 accuracy on several datasets of varying scales with hierarchies of heights ranging from 3 to 12. Code: https://github.com/ltong1130ztr/HAFrame. + + + + Factorized Inverse Path Tracing for Efficient and Accurate Material-Lighting Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_Factorized_Inverse_Path_Tracing_for_Efficient_and_Accurate_Material-Lighting_Estimation_ICCV_2023_paper.pdf + Inverse path tracing has recently been applied to joint material and lighting estimation, given geometry and multi-view HDR observations of an indoor scene. However, it has two major limitations: path tracing is expensive to compute, and ambiguities exist between reflection and emission. Our Factorized Inverse Path Tracing (FIPT) addresses these challenges by using a factored light transport formulation and finds emitters driven by rendering errors. Our algorithm enables accurate material and lighting optimization faster than previous work, and is more effective at resolving ambiguities. The exhaustive experiments on synthetic scenes show that our method (1) outperforms state-of-the-art indoor inverse rendering and relighting methods particularly in the presence of complex illumination effects; (2) speeds up inverse path tracing optimization to less than an hour. We further demonstrate robustness to noisy inputs through material and lighting estimates that allow plausible relighting in a real scene. The source code is available at: https://github.com/lwwu2/fipt + + + + Overwriting Pretrained Bias with Finetuning Data + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Overwriting_Pretrained_Bias_with_Finetuning_Data_ICCV_2023_paper.pdf + Transfer learning is beneficial by allowing the expressive features of models pretrained on large-scale datasets to be finetuned for the target task of smaller, more domain-specific datasets. However, there is a concern that these pretrained models may come with their own biases which would propagate into the finetuned model. In this work, we investigate bias when conceptualized as both spurious correlations between the target task and a sensitive attribute as well as underrepresentation of a particular group in the dataset. Under both notions of bias, we find that (1) models finetuned on top of pretrained models can indeed inherit their biases, but (2) this bias can be corrected for through relatively minor interventions to the finetuning dataset, and often with a negligible impact to performance. Our findings imply that careful curation of the finetuning dataset is important for reducing biases on a downstream task, and doing so can even compensate for bias in the pretrained model. + + + + Anti-DreamBooth: Protecting Users from Personalized Text-to-image Synthesis + http://openaccess.thecvf.com//content/ICCV2023/papers/Van_Le_Anti-DreamBooth_Protecting_Users_from_Personalized_Text-to-image_Synthesis_ICCV_2023_paper.pdf + Text-to-image diffusion models are nothing but a revolution, allowing anyone, even without design skills, to create realistic images from simple text inputs. With powerful personalization tools like DreamBooth, they can generate images of a specific person just by learning from his/her few reference images. However, when misused, such a powerful and convenient tool can produce fake news or disturbing content targeting any individual victim, posing a severe negative social impact. In this paper, we explore a defense system called Anti-DreamBooth against such malicious use of DreamBooth. The system aims to add subtle noise perturbation to each user's image before publishing in order to disrupt the generation quality of any DreamBooth model trained on these perturbed images. We investigate a wide range of algorithms for perturbation optimization and extensively evaluate them on two facial datasets over various text-to-image model versions. Despite the complicated formulation of DreamBooth and Diffusion-based text-to-image models, our methods effectively defend users from the malicious use of those models. Their effectiveness withstands even adverse conditions, such as model or prompt/term mismatching between training and testing. Our code will be available at https://github.com/VinAIResearch/Anti-DreamBooth + + + + Contrastive Continuity on Augmentation Stability Rehearsal for Continual Self-Supervised Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Cheng_Contrastive_Continuity_on_Augmentation_Stability_Rehearsal_for_Continual_Self-Supervised_Learning_ICCV_2023_paper.pdf + Self-supervised learning has attracted a lot of attention recently, which is able to learn powerful representations without any manual annotations. However, self-supervised learning needs to develop the ability to continuously learn to cope with a variety of real-world challenges, i.e., Continual Self-Supervised Learning (CSSL). Catastrophic forgetting is a notorious problem in CSSL, where the model tends to forget the learned knowledge. In practice, simple rehearsal or regularization will bring extra negative effects while alleviating catastrophic forgetting in CSSL, e.g., overfitting on the rehearsal samples or hindering the model from encoding fresh information. In order to address catastrophic forgetting without overfitting on the rehearsal samples, we propose Augmentation Stability Rehearsal (ASR) in this paper, which selects the most representative and discriminative samples by estimating the augmentation stability for rehearsal. Meanwhile, we design a matching strategy for ASR to dynamically update the rehearsal buffer. In addition, we further propose Contrastive Continuity on Augmentation Stability Rehearsal (C2ASR) based on ASR. We show that C2ASR is an upper bound of the Information Bottleneck (IB) principle, which suggests that C2ASR essentially preserves as much information shared among seen task streams as possible to prevent catastrophic forgetting and dismisses the redundant information between previous task streams and current task stream to free up the ability to encode fresh information. Our method obtains a great achievement compared with state-of-the-art CSSL methods on a variety of CSSL benchmarks. + + + + Treating Pseudo-labels Generation as Image Matting for Weakly Supervised Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Treating_Pseudo-labels_Generation_as_Image_Matting_for_Weakly_Supervised_Semantic_ICCV_2023_paper.pdf + Generating accurate pseudo-labels under the supervision of image categories is a crucial step in Weakly Supervised Semantic Segmentation (WSSS). In this work, we propose a Mat-Label pipeline that provides a fresh way to treat WSSS pseudo-labels generation as an image matting task. By taking a trimap as input which specifies the foreground, background and unknown regions, the image matting task outputs an object mask with fine edges. The intuition behind our Mat-Label is that generating trimap is much easier than generating pseudo-labels directly under weakly supervised setting. Although current CAM-based methods are off-the-shelf solutions for generating a trimap, they suffer from cross-category and foreground-background pixel prediction confusion. To solve this problem, we develop a Double Decoupled Class Activation Map (D2CAM) for Mat-Label to generate a high-quality trimap. By drawing on the idea of metric learning, we explicitly model class activation map with category decoupling and foreground-background decoupling. We also design two simple yet effective refinement constraints for D2CAM to stabilize optimization and eliminate non-exclusive activation. Extensive experiments validate that our Mat-Label achieves substantial and consistent performance gains compared to current state-of-the-art WSSS approaches. Our code is available at supplementary material. + + + + UMFuse: Unified Multi View Fusion for Human Editing Applications + http://openaccess.thecvf.com//content/ICCV2023/papers/Jain_UMFuse_Unified_Multi_View_Fusion_for_Human_Editing_Applications_ICCV_2023_paper.pdf + Numerous pose-guided human editing methods have been explored by the vision community due to their extensive practical applications. However, most of these methods still use an image-to-image formulation in which a single image is given as input to produce an edited image as output. This objective becomes ill-defined in cases when the target pose differs significantly from the input pose. Existing methods then resort to in-painting or style transfer to handle occlusions and preserve content. In this paper, we explore the utilization of multiple views to minimize the issue of missing information and generate an accurate representation of the underlying human model. To fuse knowledge from multiple viewpoints, we design a multi-view fusion network that takes the pose key points and texture from multiple source images and generates an explainable per pixel appearance retrieval map. Thereafter, the encodings from a separate network (trained on a single-view human reposing task) are merged in the latent space. This enables us to generate accurate, precise, and visually coherent images for different editing tasks. We show the application of our network on two newly proposed tasks - Multi-view human reposing and Mix&Match Human Image generation. Additionally, we study the limitations of single-view editing and scenarios in which multi-view provides a better alternative. + + + + CROSSFIRE: Camera Relocalization On Self-Supervised Features from an Implicit Representation + http://openaccess.thecvf.com//content/ICCV2023/papers/Moreau_CROSSFIRE_Camera_Relocalization_On_Self-Supervised_Features_from_an_Implicit_Representation_ICCV_2023_paper.pdf + Beyond novel view synthesis, Neural Radiance Fields are useful for applications that interact with the real world. In this paper, we use them as an implicit map of a given scene and propose a camera relocalization algorithm tailored for this representation. The proposed method enables to compute in real-time the precise position of a device using a single RGB camera, during its navigation. In contrast with previous work, we do not rely on pose regression or photometric alignment but rather use dense local features obtained through volumetric rendering which are specialized on the scene with a self-supervised objective. As a result, our algorithm is more accurate than competitors, able to operate in dynamic outdoor environments with changing lightning conditions and can be readily integrated in any volumetric neural renderer. + + + + Unmasking Anomalies in Road-Scene Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Nandan_Unmasking_Anomalies_in_Road-Scene_Segmentation_ICCV_2023_paper.pdf + Anomaly segmentation is a critical task for driving applications, and it is approached traditionally as a per-pixel classification problem. However, reasoning individually about each pixel without considering their contextual semantics results in high uncertainty around the objects' boundaries and numerous false positives. We propose a paradigm change by shifting from a per-pixel classification to a mask classification. Our mask-based method, Mask2Anomaly, demonstrates the feasibility of integrating an anomaly detection method in a mask-classification architecture. Mask2Anomaly includes several technical novelties that are designed to improve the detection of anomalies in masks: i) a global masked attention module to focus individually on the foreground and background regions; ii) a mask contrastive learning that maximizes the margin between an anomaly and known classes; and iii) a mask refinement solution to reduce false positives. Mask2Anomaly achieves new state-of-the-art results across a range of benchmarks, both in the per-pixel and component-level evaluations. In particular, Mask2Anomaly reduces the average false positives rate by 60% wrt the previous state-of-the-art. + + + + Self-Calibrated Cross Attention Network for Few-Shot Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_Self-Calibrated_Cross_Attention_Network_for_Few-Shot_Segmentation_ICCV_2023_paper.pdf + The key to the success of few-shot segmentation (FSS) lies in how to effectively utilize support samples. Most solutions compress support foreground (FG) features into prototypes, but lose some spatial details. Instead, others use cross attention to fuse query features with uncompressed support FG. Query FG is safely fused with support FG, however, query background (BG) cannot find matched BG features in support FG, yet it inevitably integrates dissimilar features. Besides, as both query FG and BG are combined with support FG, they get entangled, thereby leading to ineffective segmentation. To cope with these issues, we design a self-calibrated cross attention (SCCA) block. For efficient patch-based attention, query and support features are firstly split into patches. Then, we design a patch alignment module to align each query patch with its most similar support patch for better cross attention. Specifically, SCCA takes a query patch as Q, and groups the patches from the same query image and the aligned patches from the support image as K&V. In this way, the query BG features are fused with matched BG features (from query patches), and thus the aforementioned issues will be mitigated. Moreover, when calculating SCCA, we design a scaled-cosine mechanism to better utilize the support features for similarity calculation. Extensive experiments conducted on PASCAL-5^i and COCO-20^i demonstrate the superiority of our model, e.g., the mIoU score under 5-shot setting on COCO-20^i is 5.6%+ better than previous state-of-the-arts. The code is available at https://github.com/Sam1224/SCCAN. + + + + Learning Global-aware Kernel for Image Harmonization + http://openaccess.thecvf.com//content/ICCV2023/papers/Shen_Learning_Global-aware_Kernel_for_Image_Harmonization_ICCV_2023_paper.pdf + Image harmonization aims to solve the visual inconsistency problem in composited images by adaptively adjusting the foreground pixels with the background as references. Existing methods employ local color transformation or region matching between foreground and background, which neglects powerful proximity prior and independently distinguishes fore-/back-ground as a whole part for harmonization. As a result, they still show a limited performance across varied foreground objects and scenes. To address this issue, we propose a novel Global-aware Kernel Network (GKNet) to harmonize local regions with comprehensive consideration of long-distance background references. Specifically, GKNet includes two parts, i.e., harmony kernel prediction and harmony kernel modulation branches. The former includes a Long-distance Reference Extractor (LRE) to obtain long-distance context and Kernel Prediction Blocks (KPB) to predict multi-level harmony kernels by fusing global information with local features. To achieve this goal, a novel Selective Correlation Fusion (SCF) module is proposed to better select relevant long-distance background references for local harmonization. The latter employs the predicted kernels to harmonize foreground regions with both local and global awareness. Abundant experiments demonstrate the superiority of our method for image harmonization over state-of-the-art methods, e.g., achieving 39.53dB PSNR that surpasses the best counterpart by +0.78dB; decreasing fMSE by 11.5% and MSE by 6.7% compared with the SoTA method. + + + + Chordal Averaging on Flag Manifolds and Its Applications + http://openaccess.thecvf.com//content/ICCV2023/papers/Mankovich_Chordal_Averaging_on_Flag_Manifolds_and_Its_Applications_ICCV_2023_paper.pdf + This paper presents a new, provably-convergent algorithm for computing the flag-mean and flag-median of a set of points on a flag manifold under the chordal metric. The flag manifold is a mathematical space consisting of flags, which are sequences of nested subspaces of a vector space that increase in dimension. The flag manifold is a superset of a wide range of known matrix spaces, including Stiefel and Grassmanians, making it a general object that is useful in a wide variety computer vision problems. To tackle the challenge of computing first order flag statistics, we first transform the problem into one that involves auxiliary variables constrained to the Stiefel manifold. The Stiefel manifold is a space of orthogonal frames, and leveraging the numerical stability and efficiency of Stiefel-manifold optimization enables us to compute the flag-mean effectively. Through a series of experiments, we show the competence of our method in Grassmann and rotation averaging, as well as principal component analysis. + + + + Towards Building More Robust Models with Frequency Bias + http://openaccess.thecvf.com//content/ICCV2023/papers/Bu_Towards_Building_More_Robust_Models_with_Frequency_Bias_ICCV_2023_paper.pdf + The vulnerability of deep neural networks to adversarial samples has been a major impediment to their broad applications, despite their success in various fields. Recently, some works suggested that adversarially-trained models emphasize the importance of low-frequency information to achieve higher robustness. While several attempts have been made to leverage this frequency characteristic, they have all faced the issue that applying low-pass filters directly to input images leads to irreversible loss of discriminative information and poor generalizability to datasets with distinct frequency features. This paper presents a plug-and-play module called the Frequency Preference Control Module that adaptively reconfigures the low- and high-frequency components of intermediate feature representations, providing better utilization of frequency in robust learning. Empirical studies show that our proposed module can be easily incorporated into any adversarial training framework, further improving model robustness across different architectures and datasets. Additionally, experiments were conducted to examine how the frequency bias of robust models impacts the adversarial training process and its final robustness, revealing interesting insights. + + + + PolicyCleanse: Backdoor Detection and Mitigation for Competitive Reinforcement Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Guo_PolicyCleanse_Backdoor_Detection_and_Mitigation_for_Competitive_Reinforcement_Learning_ICCV_2023_paper.pdf + While real-world applications of reinforcement learning (RL) are becoming popular, the security and robustness of RL systems are worthy of more attention and exploration. In particular, recent works have revealed that, in a multi-agent RL environment, backdoor trigger actions can be injected into a victim agent (a.k.a. Trojan agent), which can result in a catastrophic failure as soon as it sees the backdoor trigger action. To ensure the security of RL agents against malicious backdoors, in this work, we propose the problem of Backdoor Detection in multi-agent RL systems, with the objective of detecting Trojan agents as well as the corresponding potential trigger actions, and further trying to mitigate their bad impact. In order to solve this problem, we propose PolicyCleanse that is based on the property that the activated Trojan agent's accumulated rewards degrade noticeably after several timesteps. Along with PolicyCleanse, we also design a machine unlearning-based approach that can effectively mitigate the detected backdoor. Extensive experiments demonstrate that the proposed methods can accurately detect Trojan agents, and outperform existing backdoor mitigation baseline approaches by at least 3% in winning rate across various types of agents and environments. + + + + Ref-NeuS: Ambiguity-Reduced Neural Implicit Surface Learning for Multi-View Reconstruction with Reflection + http://openaccess.thecvf.com//content/ICCV2023/papers/Ge_Ref-NeuS_Ambiguity-Reduced_Neural_Implicit_Surface_Learning_for_Multi-View_Reconstruction_with_ICCV_2023_paper.pdf + Neural implicit surface learning has shown significant progress in multi-view 3D reconstruction, where an object is represented by multilayer perceptrons that provide continuous implicit surface representation and view-dependent radiance. However, current methods often fail to accurately reconstruct reflective surfaces, leading to severe ambiguity. To overcome this issue, we propose Ref-NeuS, which aims to reduce ambiguity by attenuating the effect of reflective surfaces. Specifically, we utilize an anomaly detector to estimate an explicit reflection score with the guidance of multi-view context to localize reflective surfaces. Afterward, we design a reflection-aware photometric loss that adaptively reduces ambiguity by modeling rendered color as a Gaussian distribution, with the reflection score representing the variance. We show that together with a reflection direction-dependent radiance, our model achieves high-quality surface reconstruction on reflective surfaces and outperforms the state-of-the-arts by a large margin. Besides, our model is also comparable on general surfaces. + + + + Class-incremental Continual Learning for Instance Segmentation with Image-level Weak Supervision + http://openaccess.thecvf.com//content/ICCV2023/papers/Hsieh_Class-incremental_Continual_Learning_for_Instance_Segmentation_with_Image-level_Weak_Supervision_ICCV_2023_paper.pdf + Instance segmentation requires labor-intensive manual labeling of the contours of complex objects in images for training. The labels can also be provided incrementally in practice to balance the human labor in different time steps. However, research on incremental learning for instance segmentation with only weak labels is still lacking. In this paper, we propose a continual-learning method to segment object instances from image-level labels. Unlike most weakly-supervised instance segmentation (WSIS) which relies on traditional object proposals, we transfer the semantic knowledge from weakly-supervised semantic segmentation (WSSS) to WSIS to generate instance cues. To address the background shift problem in continual learning, we employ the old class segmentation results generated by the previous model to provide more reliable semantic and peak hypotheses. To our knowledge, this is the first work on weakly-supervised continual learning for instance segmentation of images. Experimental results show that our method can achieve better performance on Pascal VOC and COCO datasets under various incremental settings. + + + + When Prompt-based Incremental Learning Does Not Meet Strong Pretraining + http://openaccess.thecvf.com//content/ICCV2023/papers/Tang_When_Prompt-based_Incremental_Learning_Does_Not_Meet_Strong_Pretraining_ICCV_2023_paper.pdf + Incremental learning aims to overcome catastrophic forgetting when learning deep networks from sequential tasks. With impressive learning efficiency and performance, prompt-based methods adopt a fixed backbone to sequential tasks by learning task-specific prompts. However, existing prompt-based methods heavily rely on strong pretraining (typically trained on ImageNet-21k), and we find that their models could be trapped if the potential gap between the pretraining task and unknown future tasks is large. In this work, we develop a learnable Adaptive Prompt Generator (APG). The key is to unify the prompt retrieval and prompt learning processes into a learnable prompt generator. Hence, the whole prompting process can be optimized to reduce the negative effects of the gap between tasks effectively. To make our APG avoid learning ineffective knowledge, we maintain a knowledge pool to regularize APG with the feature distribution of each class. Extensive experiments show that our method significantly outperforms advanced methods in exemplar-free incremental learning without (strong) pretraining. Besides, under strong retraining, our method also has comparable performance to existing prompt-based models, showing that our method can still benefit from pretraining. Codes can be found at https://github.com/TOM-tym/APG + + + + Exploring Transformers for Open-world Instance Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_Exploring_Transformers_for_Open-world_Instance_Segmentation_ICCV_2023_paper.pdf + Open-world instance segmentation is a rising task, which aims to segment all objects in the image by learning from a limited number of base-category objects. This task is challenging, as the number of unseen categories could be hundreds of times larger than that of seen categories. Recently, the DETR-like models have been extensively studied in the closed world while stay unexplored in the open world. In this paper, we utilize the Transformer for open-world instance segmentation and present SWORD. Firstly, we introduce to attach the stop-gradient operation before classification head and further add IoU heads for discovering novel objects. We demonstrate that a simple stop-gradient operation not only prevents the novel objects from being suppressed as background, but also allows the network to enjoy the merit of heuristic label assignment. Secondly, we propose a novel contrastive learning framework to enlarge the representations between objects and background. Specifically, we maintain a universal object queue to obtain the object center, and dynamically select positive and negative samples from the object queries for contrastive learning. While the previous works only focus on pursuing average recall and neglect average precision, we show the prominence of SWORD by giving consideration to both criteria. Our models achieve state-of-the-art performance in various open-world cross-category and cross-dataset generalizations. Particularly, in VOC to non-VOC setup, our method sets new state-of-the-art results of 40.0% on ARb100 and 34.9% on ARm100. For COCO to UVO generalization, SWORD significantly outperforms the previous best open-world model by 5.9% on APm and 8.1% on ARm100, respectively. + + + + SSF: Accelerating Training of Spiking Neural Networks with Stabilized Spiking Flow + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_SSF_Accelerating_Training_of_Spiking_Neural_Networks_with_Stabilized_Spiking_ICCV_2023_paper.pdf + Surrogate gradient (SG) is one of the most effective approaches for training spiking neural networks (SNNs). While assisting SNNs to achieve classification performance comparable to artificial neural networks, SG suffers from the problem of time-consuming training, preventing it from efficient learning. In this paper, we formally analyze the backward process of classic SG and find that the membrane accumulation through time leads to exponential growth of training time. With this discovery, we propose Stabilized Spiking Flow (SSF), a simple yet effective approach to accelerate training of SG-based SNNs. For each spiking neuron, SSF averages its input and output activations over time to yield stabilized input and output, respectively. Then, instead of back propagating all errors that are related to current neuron and inherently entangled in time domain, the auxiliary gradient is directly propagated from the stabilized output to input through a devised relationship mapping. Additionally, SSF method is suitable to different neuron models. Extensive experiments on both static and neuromorphic datasets demonstrate that SNNs trained with SSF approach can achieve performance comparable to the original counterparts, while reducing the training time significantly. In particular, SSF speeds up the training process of state-of-the-art SNN models up to 10x when time steps equal to 80. + + + + Manipulate by Seeing: Creating Manipulation Controllers from Pre-Trained Representations + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Manipulate_by_Seeing_Creating_Manipulation_Controllers_from_Pre-Trained_Representations_ICCV_2023_paper.pdf + The field of visual representation learning has seen explosive growth in the past years, but its benefits in robotics have been surprisingly limited so far. Prior work uses generic visual representations as a basis to learn (task-specific) robot action policies (e.g., via behavior cloning). While the visual representations do accelerate learning, they are primarily used to encode visual observations. Thus, action information has to be derived purely from robot data, which is expensive to collect! In this work, we present a scalable alternative where the visual representations can help directly infer robot actions. We observe that vision encoders express relationships between image observations as distances (e.g., via embedding dot product) that could be used to efficiently plan robot behavior. We operationalize this insight and develop a simple algorithm for acquiring a distance function and dynamics predictor, by fine-tuning a pre-trained representation on human collected video sequences. The final method is able to substantially outperform traditional robot learning baselines (e.g., 70% success v.s. 50% for behavior cloning on pick-place) on a suite of diverse real-world manipulation tasks. It can also generalize to novel objects, without using any robot demonstrations during train time. For visualizations of the learned policies please check: https://agi-labs.github.io/manipulate-by-seeing/ + + + + Learning Human-Human Interactions in Images from Weak Textual Supervision + http://openaccess.thecvf.com//content/ICCV2023/papers/Alper_Learning_Human-Human_Interactions_in_Images_from_Weak_Textual_Supervision_ICCV_2023_paper.pdf + Interactions between humans are diverse and context-dependent, but previous works have treated them as categorical, disregarding the heavy tail of possible interactions. We propose a new paradigm of learning human-human interactions as free text from a single still image, allowing for flexibility in modeling the unlimited space of situations and relationships between people. To overcome the absence of data labelled specifically for this task, we use knowledge distillation applied to synthetic caption data produced by a large language model without explicit supervision. We show that the pseudo-labels produced by this procedure can be used to train a captioning model to effectively understand human-human interactions in images, as measured by a variety of metrics that measure textual and semantic faithfulness and factual groundedness of our predictions. We further show that our approach outperforms SOTA image captioning and situation recognition models on this task. We will release our code and pseudo-labels along with Waldo and Wenda, a manually-curated test set for still image human-human interaction understanding. + + + + Prototype Reminiscence and Augmented Asymmetric Knowledge Aggregation for Non-Exemplar Class-Incremental Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Shi_Prototype_Reminiscence_and_Augmented_Asymmetric_Knowledge_Aggregation_for_Non-Exemplar_Class-Incremental_ICCV_2023_paper.pdf + Non-exemplar class-incremental learning (NECIL) requires deep models to maintain existing knowledge while continuously learning new classes without saving old class samples. In NECIL methods, prototypical representations are usually stored, which inject information from former classes to resist catastrophic forgetting in subsequent incremental learning. However, since the model continuously learns new knowledge, the stored prototypical representations cannot correctly model the properties of old classes in the existence of knowledge updates. To address this problem, we propose a novel prototype reminiscence mechanism that incorporates the previous class prototypes with arriving new class features to dynamically reshape old class feature distributions thus preserving the decision boundaries of previous tasks. In addition, to improve the model generalization on both newly arriving classes and old classes, we contribute an augmented asymmetric knowledge aggregation approach, which aggregates the overall knowledge of the current task and extracts the valuable knowledge of the past tasks, on top of self-supervised label augmentation. Experimental results on three benchmarks suggest the superior performance of our approach over the SOTA methods. + + + + Exemplar-Free Continual Transformer with Convolutions + http://openaccess.thecvf.com//content/ICCV2023/papers/Roy_Exemplar-Free_Continual_Transformer_with_Convolutions_ICCV_2023_paper.pdf + Continual Learning (CL) involves training a machine learning model in a sequential manner to learn new information while retaining previously learned tasks without the presence of previous training data. Although there has been significant interest in CL, most recent CL approaches in computer vision have focused on convolutional architectures only. However, with the recent success of vision transformers, there is a need to explore their potential for CL. Although there have been some recent CL approaches for vision transformers, they either store training instances of previous tasks or require a task identifier during test time, which can be limiting. This paper proposes a new exemplar-free approach for class/task incremental learning called ConTraCon, which does not require task-id to be explicitly present during inference and avoids the need for storing previous training instances. The proposed approach leverages the transformer architecture and involves re-weighting the key, query, and value weights of the multi-head self-attention layers of a transformer trained on a similar task. The re-weighting is done using convolution, which enables the approach to maintain low parameter requirements per task. Additionally, an image augmentation-based entropic task identification approach is used to predict tasks without requiring task-ids during inference. Experiments on four benchmark datasets demonstrate that the proposed approach outperforms several competitive approaches while requiring fewer parameters. + + + + Efficient Decision-based Black-box Patch Attacks on Video Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Jiang_Efficient_Decision-based_Black-box_Patch_Attacks_on_Video_Recognition_ICCV_2023_paper.pdf + Although Deep Neural Networks (DNNs) have demonstrated excellent performance, they are vulnerable to adversarial patches that introduce perceptible and localized perturbations to the input. Generating adversarial patches on images has received much attention, while adversarial patches on videos have not been well investigated. Further, decision-based attacks, where attackers only access the predicted hard labels by querying threat models, have not been well explored on video models either, even if they are practical in real-world video recognition scenes. The absence of such studies leads to a huge gap in the robustness assessment for video models. To bridge this gap, this work first explores decision-based patch attacks on video models. We analyze that the huge parameter space brought by videos and the minimal information returned by decision-based models both greatly increase the attack difficulty and query burden. To achieve a query-efficient attack, we propose a spatial-temporal differential evolution (STDE) framework. First, STDE introduces target videos as patch textures and only adds patches on keyframes that are adaptively selected by temporal difference. Second, STDE takes minimizing the patch area as the optimization objective and adopts spatial-temporal mutation and crossover to search for the global optimum without falling into the local optimum. Experiments show STDE has demonstrated state-of-the-art performance in terms of threat, efficiency and imperceptibility. Hence, STDE has the potential to be a powerful tool for evaluating the robustness of video recognition models. + + + + MetaGCD: Learning to Continually Learn in Generalized Category Discovery + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_MetaGCD_Learning_to_Continually_Learn_in_Generalized_Category_Discovery_ICCV_2023_paper.pdf + In this paper, we consider a real-world scenario where a model that is trained on pre-defined classes continually encounters unlabeled data that contains both known and novel classes. The goal is to continually discover novel classes while maintaining the performance in known classes. We name the setting Continual Generalized Category Discovery (C-GCD). Existing methods for novel class discovery cannot directly handle the C-GCD setting due to some unrealistic assumptions, such as the unlabeled data only containing novel classes. Furthermore, they fail to discover novel classes in a continual fashion. In this work, we lift all these assumptions and propose an approach, called MetaGCD, to learn how to incrementally discover with less forgetting. Our proposed method uses a meta-learning framework and leverages the offline labeled data to simulate the testing incremental learning process. A meta-objective is defined to revolve around two conflicting learning objectives to achieve novel class discovery without forgetting. Furthermore, a soft neighborhood-based contrastive network is proposed to discriminate uncorrelated images while attracting correlated images. We build strong baselines and conduct extensive experiments on three widely used benchmarks to demonstrate the superiority of our method. + + + + Strip-MLP: Efficient Token Interaction for Vision MLP + http://openaccess.thecvf.com//content/ICCV2023/papers/Cao_Strip-MLP_Efficient_Token_Interaction_for_Vision_MLP_ICCV_2023_paper.pdf + Token interaction operation is one of the core modules in MLP-based models to exchange and aggregate information between different spatial locations. However, the power of token interaction on the spatial dimension is highly dependent on the spatial resolution of the feature maps, which limits the model's expressive ability, especially in deep layers where the feature are down-sampled to a small spatial size. To address this issue, we present a novel method called Strip-MLP to enrich the token interaction power in three ways. Firstly, we introduce a new MLP paradigm called Strip MLP layer that allows the token to interact with other tokens in a cross-strip manner, enabling the tokens in a row (or column) contribute to the information aggregations in adjacent but different strips of rows (or columns). Secondly, a Cascade Group Strip Mixing Module (CGSMM) is proposed to overcome the performance degradation caused by small spatial feature size. The module allows tokens to interact more effectively in the manners of within-patch and cross-patch, which is independent to the feature spatial size. Finally, based on the Strip MLP layer, we propose a novel Local Strip Mixing Module (LSMM) to boost the token interaction power in the local region. Extensive experiments demonstrate that Strip-MLP significantly improves the performance of MLP-based models on small datasets and obtains comparable or even better results on ImageNet with great superiorities on the number of parameters and FLOPs. In particular, Strip-MLP models achieve higher average Top-1 accuracy than existing MLP-based models by +2.44% on Caltech-101 and +2.16% on CIFAR-100. The source codes will be available at https://github.com/Med-Process/Strip_MLP. + + + + SAFARI: Versatile and Efficient Evaluations for Robustness of Interpretability + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_SAFARI_Versatile_and_Efficient_Evaluations_for_Robustness_of_Interpretability_ICCV_2023_paper.pdf + Interpretability of Deep Learning (DL) is a barrier to trustworthy AI. Despite great efforts made by the Explainable AI (XAI) community, explanations lack robustness--indistinguishable input perturbations may lead to different XAI results. Thus, it is vital to assess how robust DL interpretability is, given an XAI method. In this paper, we identify several challenges that the state-of-the-art is unable to cope with collectively: i) existing metrics are not comprehensive ii) XAI techniques are highly heterogeneous; iii) misinterpretations are normally rare events. To tackle these challenges, we introduce two black-box evaluation methods, concerning the worst-case interpretation discrepancy and a probabilistic notion of how robust in general, respectively. Genetic Algorithm (GA) with bespoke fitness function is used to solve constrained optimisation for efficient worst-case evaluation. Subset Simulation (SS), dedicated to estimating rare event probabilities, is used for evaluating overall robustness. Experiments show that the accuracy, sensitivity, and efficiency of our methods outperform the state-of-the-arts. Finally, we demonstrate two applications of our methods: ranking robust XAI methods and selecting training schemes to improve both classification and interpretation robustness. + + + + Combating Noisy Labels with Sample Selection by Mining High-Discrepancy Examples + http://openaccess.thecvf.com//content/ICCV2023/papers/Xia_Combating_Noisy_Labels_with_Sample_Selection_by_Mining_High-Discrepancy_Examples_ICCV_2023_paper.pdf + The sample selection approach is popular in learning with noisy labels. The state-of-the-art methods train two deep networks simultaneously for sample selection, which aims to employ their different learning abilities. To prevent two networks from converging to a consensus, their divergence should be maintained. Prior work presents that the divergence can be kept by locating the disagreement data on which the prediction labels of the two networks are different. However, this procedure is sample-inefficient for generalization, which means that only a few clean examples can be utilized in training. In this paper, to address the issue, we propose a simple yet effective method called CoDis. In particular, we select possibly clean data that simultaneously have high-discrepancy prediction probabilities between two networks. As selected data have high discrepancies in probabilities, the divergence of two networks can be maintained by training on such data. Additionally, the condition of high discrepancies is milder than disagreement, which allows more data to be considered for training, and makes our method more sample-efficient. Moreover, we show that the proposed method enables to mine hard clean examples to help generalization. Empirical results show that CoDis is superior to multiple baselines in the robustness of trained models. + + + + What can Discriminator do? Towards Box-free Ownership Verification of Generative Adversarial Networks + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_What_can_Discriminator_do_Towards_Box-free_Ownership_Verification_of_Generative_ICCV_2023_paper.pdf + In recent decades, Generative Adversarial Network (GAN) and its variants have achieved unprecedented success in image synthesis. However, well-trained GANs are under the threat of illegal steal or leakage. The prior studies on remote ownership verification assume a black-box setting where the defender can query the suspicious model with specific inputs, which we identify is not enough for generation tasks. To this end, in this paper, we propose a novel IP protection scheme for GANs where ownership verification can be done by checking outputs only, without choosing the inputs (i.e., box-free setting). Specifically, we make use of the unexploited potential of the discriminator to learn a hypersphere that captures the unique distribution learned by the paired generator. Extensive evaluations on two popular GAN tasks and more than 10 GAN architectures demonstrate our proposed scheme to effectively verify the ownership. Our proposed scheme shown to be immune to popular input-based removal attacks and robust against other existing attacks. The source code and models are available at https://github.com/AbstractTeen/gan_ownership_verification. + + + + An Adaptive Model Ensemble Adversarial Attack for Boosting Adversarial Transferability + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_An_Adaptive_Model_Ensemble_Adversarial_Attack_for_Boosting_Adversarial_Transferability_ICCV_2023_paper.pdf + While the transferability property of adversarial examples allows the adversary to perform black-box attacks i.e., the attacker has no knowledge about the target model), the transfer-based adversarial attacks have gained great attention. Previous works mostly study gradient variation or image transformations to amplify the distortion on critical parts of inputs. These methods can work on transferring across models with limited differences, i.e., from CNNs to CNNs, but always fail in transferring across models with wide differences, such as from CNNs to ViTs. Alternatively, model ensemble adversarial attacks are proposed to fuse outputs from surrogate models with diverse architectures to get an ensemble loss, making the generated adversarial example more likely to transfer to other models as it can fool multiple models concurrently. However, existing ensemble attacks simply fuse the outputs of the surrogate models evenly, thus are not efficacious to capture and amplify the intrinsic transfer information of adversarial examples. In this paper, we propose an adaptive ensemble attack, dubbed AdaEA, to adaptively control the fusion of the outputs from each model, via monitoring the discrepancy ratio of their contributions towards the adversarial objective. Furthermore, an extra disparity-reduced filter is introduced to further synchronize the update direction. As a result, we achieve considerable improvement over the existing ensemble attacks on various datasets, and the proposed AdaEA can also boost existing transfer-based attacks, which further demonstrates its efficacy and versatility. + + + + 3D-VisTA: Pre-trained Transformer for 3D Vision and Text Alignment + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhu_3D-VisTA_Pre-trained_Transformer_for_3D_Vision_and_Text_Alignment_ICCV_2023_paper.pdf + 3D vision-language grounding (3D-VL) is an emerging field that aims to connect the 3D physical world with natural language, which is crucial for achieving embodied intelligence. Current 3D-VL models rely heavily on sophisticated modules, auxiliary losses, and optimization tricks, which calls for a simple and unified model. In this paper, we propose 3D-VisTA, a pre-trained Transformer for 3D Vis ion and Text Alignment that can be easily adapted to various downstream tasks. 3D-VisTA simply utilizes self-attention layers for both single-modal modeling and multi-modal fusion without any sophisticated task-specific design. To further enhance its performance on 3D-VL tasks, we construct ScanScribe, the first large-scale 3D scene-text pairs dataset for 3D-VL pre-training. ScanScribe contains 2,995 RGB-D scans for 1,185 unique indoor scenes originating from ScanNet and 3R-Scan datasets, along with paired 278K scene descriptions generated from existing 3D-VL tasks, templates, and GPT-3. 3D-VisTA is pre-trained on ScanScribe via masked language/object modeling and scene-text matching. It achieves state-of-the-art results on various 3D-VL tasks, ranging from visual grounding and question answering to situated reasoning. Moreover, 3D-VisTA demonstrates superior data efficiency, obtaining strong performance even with limited annotations during downstream task fine-tuning. + + + + SparseDet: Improving Sparsely Annotated Object Detection with Pseudo-positive Mining + http://openaccess.thecvf.com//content/ICCV2023/papers/Suri_SparseDet_Improving_Sparsely_Annotated_Object_Detection_with_Pseudo-positive_Mining_ICCV_2023_paper.pdf + Training with sparse annotations is known to reduce the performance of object detectors. Previous methods have focused on proxies for missing ground truth annotations in the form of pseudo-labels for unlabeled boxes. We observe that existing methods suffer at higher levels of sparsity in the data due to noisy pseudo-labels. To prevent this, we propose an end-to-end system that learns to separate the proposals into labeled and unlabeled regions using Pseudo-positive mining. While the labeled regions are processed as usual, self-supervised learning is used to process the unlabeled regions thereby preventing the negative effects of noisy pseudo-labels. This novel approach has multiple advantages such as improved robustness to higher sparsity when compared to existing methods. We conduct exhaustive experiments on five splits on the PASCAL-VOC and COCO datasets achieving state-of-the-art performance. We also unify various splits used across literature for this task and present a standardized benchmark. On average, we improve by 2.6, 3.9 and 9.6 mAP over previous state-of-the-art methods on three splits of increasing sparsity on COCO. Our project is publicly available at cs.umd.edu/ sakshams/SparseDet. + + + + Among Us: Adversarially Robust Collaborative Perception by Consensus + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Among_Us_Adversarially_Robust_Collaborative_Perception_by_Consensus_ICCV_2023_paper.pdf + Multiple robots could perceive a scene (e.g., detect objects) collaboratively better than individuals, although easily suffer from adversarial attacks when using deep learning. This could be addressed by the adversarial defense, but its training requires the often-unknown attacking mechanism. Differently, we propose ROBOSAC, a novel sampling-based defense strategy generalizable to unseen attackers. Our key idea is that collaborative perception should lead to consensus rather than dissensus in results compared to individual perception. This leads to our hypothesize-and-verify framework: perception results with and without collaboration from a random subset of teammates are compared until reaching a consensus. In such a framework, more teammates in the sampled subset often entail better perception performance but require longer sampling time to reject potential attackers. Thus, we derive how many sampling trials are needed to ensure the desired size of an attacker-free subset, or equivalently, the maximum size of such a subset that we can successfully sample within a given number of trials. We validate our method on the task of collaborative 3D object detection in autonomous driving scenarios. + + + + BUS: Efficient and Effective Vision-Language Pre-Training with Bottom-Up Patch Summarization. + http://openaccess.thecvf.com//content/ICCV2023/papers/Jiang_BUS_Efficient_and_Effective_Vision-Language_Pre-Training_with_Bottom-Up_Patch_Summarization._ICCV_2023_paper.pdf + Vision Transformer (ViT) based Vision-Language Pretraining (VLP) models recently demonstrated impressive performance in various tasks. However, the lengthy visual token sequences used in these models can lead to inefficient and ineffective performance. Existing methods to address these issues lack textual guidance and may overlook crucial visual information related to the text, leading to the introduction of irrelevant information during cross-modal fusion and additional computational cost. In this paper, we propose a Bottom-Up Patch Summarization approach named BUS which is inspired by the Document Summarization Task in NLP to learn a concise visual summary of lengthy visual token sequences, guided by textual semantics. We introduce a Text-Semantic Aware Patch Selector (TAPS) in the ViT backbone to perform a coarse-grained selective visual summarization to over-determine the text-relevant patches, and a light Summarization Decoder to perform fine-grained abstractive summarization based on the selected patches, resulting in a further condensed representation sequence that highlights text-relevant visual semantic information. Such bottom-up process is both efficient and effective with higher performing. We evaluate our approach on various VL understanding and generation tasks and show competitive or better downstream task performance while boosting the efficiency by 50%. Additionally, our model achieves well-designed SOTA downstream task performance by increasing input image resolution without increasing computational costs compared to baselines. + + + + SegPrompt: Boosting Open-World Segmentation via Category-Level Prompt Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhu_SegPrompt_Boosting_Open-World_Segmentation_via_Category-Level_Prompt_Learning_ICCV_2023_paper.pdf + Current closed-set instance segmentation models rely on predefined class labels for each mask during training and evaluation, limiting their ability to detect novel objects. Open-world instance segmentation (OWIS) models address this challenge by detecting unknown objects in a class-agnostic manner. However, previous OWIS approaches completely erase category information during training to keep the model's ability to generalize to unknown objects. In this work, we propose a novel training mechanism called SegPrompt that utilizes category information to improve the model's class-agnostic segmentation ability for both known and unknown categories. In addition, the previous OWIS training setting exposes the unknown classes to the training set and brings information leakage, which is unreasonable in the real world. Therefore, we provide a new open-world benchmark closer to a real-world scenario by dividing the dataset classes into known-seen-unseen parts. For the first time we focus on the model's ability to discover objects that never appear in the training set images. Experiments show that SegPrompt can improve the overall and unseen detection performance by 5.6% and 6.1% in AR on our new benchmark without affecting the inference efficiency. We further demonstrate the effectiveness of our method on existing cross-dataset transfer and strongly supervised settings, leading to 5.5% and 12.3% relative improvement. + + + + CL-MVSNet: Unsupervised Multi-View Stereo with Dual-Level Contrastive Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Xiong_CL-MVSNet_Unsupervised_Multi-View_Stereo_with_Dual-Level_Contrastive_Learning_ICCV_2023_paper.pdf + Unsupervised Multi-View Stereo (MVS) methods have achieved promising progress recently. However, previous methods primarily depend on the photometric consistency assumption, which may suffer from two limitations: indistinguishable regions and view-dependent effects, e.g., low-textured areas and reflections. To address these issues, in this paper, we propose a new dual-level contrastive learning approach, named CL-MVSNet. Specifically, our model integrates two contrastive branches into an unsupervised MVS framework to construct additional supervisory signals. On the one hand, we present an image-level contrastive branch to guide the model to acquire more context awareness, thus leading to more complete depth estimation in indistinguishable regions. On the other hand, we exploit a scene-level contrastive branch to boost the representation ability, improving robustness to view-dependent effects. Moreover, to recover more accurate 3D geometry, we introduce an L0.5 photometric consistency loss, which encourages the model to focus more on accurate points while mitigating the gradient penalty of undesirable ones. Extensive experiments on DTU and Tanks&Temples benchmarks demonstrate that our approach achieves state-of-the-art performance among all end-to-end unsupervised MVS frameworks and outperforms its supervised counterpart by a considerable margin without fine-tuning. + + + + TF-ICON: Diffusion-Based Training-Free Cross-Domain Image Composition + http://openaccess.thecvf.com//content/ICCV2023/papers/Lu_TF-ICON_Diffusion-Based_Training-Free_Cross-Domain_Image_Composition_ICCV_2023_paper.pdf + Text-driven diffusion models have exhibited impressive generative capabilities, enabling various image editing tasks. In this paper, we propose TF-ICON, a novel Training-Free Image COmpositioN framework that harnesses the power of text-driven diffusion models for cross-domain image-guided composition. This task aims to seamlessly integrate user-provided objects into a specific visual context. Current diffusion-based methods often involve costly instance-based optimization or finetuning of pretrained models on customized datasets, which can potentially undermine their rich prior. In contrast, TF-ICON can leverage off-the-shelf diffusion models to perform cross-domain image-guided composition without requiring additional training, finetuning, or optimization. Moreover, we introduce the exceptional prompt, which contains no information, to facilitate text-driven diffusion models in accurately inverting real images into latent representations, forming the basis for compositing. Our experiments show that equipping Stable Diffusion with the exceptional prompt outperforms state-of-the-art inversion methods on various datasets (CelebA-HQ, COCO, and ImageNet), and that TF-ICON surpasses prior baselines in versatile visual domains. Code is available at https://github.com/Shilin-LU/TF-ICON + + + + Landscape Learning for Neural Network Inversion + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Landscape_Learning_for_Neural_Network_Inversion_ICCV_2023_paper.pdf + Many machine learning methods operate by inverting a neural network at inference time, which has become a popular technique for solving inverse problems in computer vision, robotics, and graphics. However, these methods often involve gradient descent through a highly non-convex loss landscape, causing the optimization process to be unstable and slow. We introduce a method that learns a loss landscape where gradient descent is efficient, bringing massive improvement and acceleration to the inversion process. We demonstrate this advantage on a number of methods for both generative and discriminative tasks, including GAN inversion, adversarial defense, and 3D human pose reconstruction. + + + + PPR: Physically Plausible Reconstruction from Monocular Videos + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_PPR_Physically_Plausible_Reconstruction_from_Monocular_Videos_ICCV_2023_paper.pdf + Given monocular videos, we build 3D models of articulated objects and environments whose 3D configurations satisfy dynamics and contact constraints. At its core, our method leverages differentiable physics simulation to aid visual reconstructions. We couple differentiable physics simulation with differentiable rendering via coordinate descent, which enables end-to-end optimization of, not only 3D reconstructions, but also physical system parameters from videos. We demonstrate the effectiveness of physics-informed reconstruction on monocular videos of quadruped animals and humans. It reduces reconstruction artifacts (e.g., scale ambiguity, unbalanced poses, and foot swapping) that are challenging to address by visual cues alone, and produces better foot contact estimation. + + + + Robust Heterogeneous Federated Learning under Data Corruption + http://openaccess.thecvf.com//content/ICCV2023/papers/Fang_Robust_Heterogeneous_Federated_Learning_under_Data_Corruption_ICCV_2023_paper.pdf + Model heterogeneous federated learning is a realistic and challenging problem. However, due to the limitations of data collection, storage, and transmission conditions, as well as the existence of free-rider participants, the clients may suffer from data corruption. This paper starts the first attempt to investigate the problem of data corruption in the model heterogeneous federated learning framework. We design a novel method named Augmented Heterogeneous Federated Learning (AugHFL), which consists of two stages: 1) In the local update stage, a corruption-robust data augmentation strategy is adopted to minimize the adverse effects of local corruption while enabling the models to learn rich local knowledge. 2) In the collaborative update stage, we design a robust re-weighted communication approach, which implements communication between heterogeneous models while mitigating corrupted knowledge transfer from others. Extensive experiments demonstrate the effectiveness of our method in coping with various corruption patterns in the model heterogeneous federated learning setting. + + + + Cyclic-Bootstrap Labeling for Weakly Supervised Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Yin_Cyclic-Bootstrap_Labeling_for_Weakly_Supervised_Object_Detection_ICCV_2023_paper.pdf + Recent progress in weakly supervised object detection is featured by a combination of multiple instance detection networks (MIDN) and ordinal online refinement. However, with only image-level annotation, MIDN inevitably assigns high scores to some unexpected region proposals when generating pseudo labels. These inaccurate high-scoring region proposals will mislead the training of subsequent refinement modules and thus hamper the detection performance. In this work, we explore how to ameliorate the quality of pseudo-labeling in MIDN. Formally, we devise Cyclic-Bootstrap Labeling (CBL), a novel weakly supervised object detection pipeline, which optimizes MIDN with rank information from a reliable teacher network. Specifically, we obtain this teacher network by introducing a weighted exponential moving average strategy to take advantage of various refinement modules. A novel class-specific ranking distillation algorithm is proposed to leverage the output of weighted ensembled teacher network for distilling MIDN with rank information. As a result, MIDN is guided to assign higher scores to accurate proposals, which further benefits final detection. Extensive experiments on the prevalent PASCAL VOC 2007 & 2012 and COCO datasets demonstrate the superior performance of our CBL framework. + + + + Tangent Sampson Error: Fast Approximate Two-view Reprojection Error for Central Camera Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Terekhov_Tangent_Sampson_Error_Fast_Approximate_Two-view_Reprojection_Error_for_Central_ICCV_2023_paper.pdf + In this paper we introduce the Tangent Sampson error, which is a generalization of the classical Sampson error in two-view geometry that allows for arbitrary central camera models. It only requires local gradients of the distortion map at the original correspondences (allowing for pre-computation) resulting in a negligible increase in computational cost when used in RANSAC or local refinement. The error effectively approximates the true-reprojection error for a large variety of cameras, including extremely wide field-of-view lenses that cannot be undistorted to a single pinhole image. We show experimentally that the new error outperforms competing approaches both when used for model scoring in RANSAC and for non-linear refinement of the relative camera pose. + + + + MPCViT: Searching for Accurate and Efficient MPC-Friendly Vision Transformer with Heterogeneous Attention + http://openaccess.thecvf.com//content/ICCV2023/papers/Zeng_MPCViT_Searching_for_Accurate_and_Efficient_MPC-Friendly_Vision_Transformer_with_ICCV_2023_paper.pdf + Secure multi-party computation (MPC) enables computation directly on encrypted data and protects both data and model privacy in deep learning inference. However, existing neural network architectures, including Vision Transformers (ViTs), are not designed or optimized for MPC and incur significant latency overhead. We observe Softmax accounts for the major latency bottleneck due to a high communication complexity, but can be selectively replaced or linearized without compromising the model accuracy. Hence, in this paper, we propose an MPC-friendly ViT, dubbed MPCViT, to enable accurate yet efficient ViT inference in MPC. Based on a systematic latency and accuracy evaluation of the Softmax attention and other attention variants, we propose a heterogeneous attention optimization space. We also develop a simple yet effective MPC-aware neural architecture search algorithm for fast Pareto optimization. To further boost the inference efficiency, we propose MPCViT+, to jointly optimize the Softmax attention and other network components, including GeLU, matrix multiplication, etc. With extensive experiments, we demonstrate that MPCViT achieves 1.9%, 1.3% and 3.6% higher accuracy with 6.2x, 2.9x and 1.9x latency reduction compared with baseline ViT, MPCFormer and THE-X on the Tiny-ImageNet dataset, respectively. MPCViT+ further achieves a better Pareto front compared with MPCViT. The code and models for evaluation are available at https://github.com/PKU-SEC-Lab/mpcvit. + + + + Masked Spiking Transformer + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Masked_Spiking_Transformer_ICCV_2023_paper.pdf + The combination of Spiking Neural Networks (SNNs) and Transformers has attracted significant attention due to their potential for high energy efficiency and high-performance nature. However, existing works on this topic typically rely on direct training, which can lead to suboptimal performance. To address this issue, we propose to leverage the benefits of the ANN-to-SNN conversion method to combine SNNs and Transformers, resulting in significantly improved performance over existing state-of-the-art SNN models. Furthermore, inspired by the quantal synaptic failures observed in the nervous system, which reduce the number of spikes transmitted across synapses, we introduce a novel Masked Spiking Transformer (MST) framework. This incorporates a Random Spike Masking (RSM) method to prune redundant spikes and reduce energy consumption without sacrificing performance. Our experimental results demonstrate that the proposed MST model achieves a significant reduction of 26.8% in power consumption when the masking ratio is 75% while maintaining the same level of performance as the unmasked model. The code is available at: https://github.com/bic-L/Masked-Spiking-Transformer. + + + + Joint Implicit Neural Representation for High-fidelity and Compact Vector Fonts + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Joint_Implicit_Neural_Representation_for_High-fidelity_and_Compact_Vector_Fonts_ICCV_2023_paper.pdf + Existing vector font generation approaches either struggle to preserve high-frequency corner details of the glyph or produce vector shapes that have redundant segments, which hinders their applications in practical scenarios. In this paper, we propose to learn vector fonts from pixelated font images utilizing a joint neural representation that consists of a signed distance field (SDF) and a probabilistic corner field (CF) to capture shape corner details. To achieve smooth shape interpolation on the learned shape manifold, we establish connections between the two fields for better alignment. We further design a vectorization process to extract high-quality and compact vector fonts from our joint neural representation. Experiments demonstrate that our method can generate more visually appealing vector fonts with a higher level of compactness compared to existing alternatives. + + + + Neural Characteristic Function Learning for Conditional Image Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Neural_Characteristic_Function_Learning_for_Conditional_Image_Generation_ICCV_2023_paper.pdf + The emergence of conditional generative adversarial networks (cGANs) has revolutionised the way we approach and control the generation, by means of adversarially learning joint distributions of data and auxiliary information. Despite the success, cGANs have been consistently put under scrutiny due to their ill-posed discrepancy measure between distributions, leading to mode collapse and instability problems in training. To address this issue, we propose a novel conditional characteristic function generative adversarial network (CCF-GAN) to reduce the discrepancy by the characteristic functions (CFs), which is able to learn accurate distance measure of joint distributions under theoretical soundness. More specifically, the difference between CFs is first proved to be complete and optimisation-friendly, for measuring the discrepancy of two joint distributions. To relieve the problem of curse of dimensionality in calculating CF difference, we propose to employ the neural network, namely neural CF (NCF), to efficiently minimise an upper bound of the difference. Based on the NCF, we establish the CCF-GAN framework to explicitly decompose CFs of joint distributions, which allows for learning the data distribution and auxiliary information with classified importance. The experimental results on synthetic and real-world datasets verify the superior performances of our CCF-GAN, on both the generation quality and stability. + + + + Holistic Label Correction for Noisy Multi-Label Classification + http://openaccess.thecvf.com//content/ICCV2023/papers/Xia_Holistic_Label_Correction_for_Noisy_Multi-Label_Classification_ICCV_2023_paper.pdf + Multi-label classification aims to learn classification models from instances associated with multiple labels. It is pivotal to learn and utilize the label dependence among multiple labels in multi-label classification. As a result of today's big and complex data, noisy labels are inevitable, making it looming to target multi-label classification with noisy labels. Although the importance of label dependence has been shown in multi-label classification with clean labels, it is challenging and hard to bring label dependence to the problem of multi-label classification with noisy labels. The issues are, that we do not understand why label dependence is helpful in the problem, and how to learn and utilize label dependence only using training data with noisy multiple labels. In this paper, we bring label dependence to tackle the problem of multi-label classification with noisy labels. Specifically, we first provide a high-level understanding of why label dependence helps distinguish the examples with clean/noisy multiple labels. Benefiting from the memorization effect in handling noisy labels, a novel algorithm is then proposed to learn the label dependence by only employing training data with noisy multiple labels, and utilize the learned dependence to help correct noisy multiple labels to clean ones. We prove that the use of label dependence could bring a higher success rate for recovering correct multiple labels. Empirical evaluations justify our claims and demonstrate the superiority of our algorithm. + + + + Unified Data-Free Compression: Pruning and Quantization without Fine-Tuning + http://openaccess.thecvf.com//content/ICCV2023/papers/Bai_Unified_Data-Free_Compression_Pruning_and_Quantization_without_Fine-Tuning_ICCV_2023_paper.pdf + Structured pruning and quantization are promising approaches for reducing the inference time and memory footprint of neural networks. However, most existing methods require the original training dataset to fine-tune the model. This not only brings heavy resource consumption but also is not possible for applications with sensitive or proprietary data due to privacy and security concerns. Therefore, a few data-free methods are proposed to address this problem, but they perform data-free pruning and quantization separately, which does not explore the complementarity of pruning and quantization. In this paper, we propose a novel framework named Unified Data-Free Compression(UDFC), which performs pruning and quantization simultaneously without any data and fine-tuning process. Specifically, UDFC starts with the assumption that the partial information of a damaged(e.g., pruned or quantized) channel can be preserved by a linear combination of other channels, and then derives the reconstruction form from the assumption to restore the information loss due to compression. Finally, we formulate the reconstruction error between the original network and its compressed network, and theoretically deduce the closed-form solution. We evaluate the UDFC on the large-scale image classification task and obtain significant improvements over various network architectures and compression methods. For example, we achieve a 20.54% accuracy improvement on ImageNet dataset compared to SOTA method with 30% pruning ratio and 6-bit quantization on ResNet-34. + + + + Temporal Enhanced Training of Multi-view 3D Object Detector via Historical Object Prediction + http://openaccess.thecvf.com//content/ICCV2023/papers/Zong_Temporal_Enhanced_Training_of_Multi-view_3D_Object_Detector_via_Historical_ICCV_2023_paper.pdf + In this paper, we propose a new paradigm, named Historical Object Prediction (HoP) for multi-view 3D detection to leverage temporal information more effectively. The HoP approach is straightforward: given the current timestamp t, we generate a pseudo Bird's-Eye View (BEV) feature of timestamp t-k from its adjacent frames and utilize this feature to predict the object set at timestamp t-k. Our approach is motivated by the observation that enforcing the detector to capture both the spatial location and temporal motion of objects occurring at historical timestamps can lead to more accurate BEV feature learning. First, we elaborately design short-term and long-term temporal decoders, which can generate the pseudo BEV feature for timestamp t-k without the involvement of its corresponding camera images. Second, an additional object decoder is flexibly attached to predict the object targets using the generated pseudo BEV feature. Note that we only perform HoP during training, thus the proposed method does not introduce extra overheads during inference. As a plug-and-play approach, HoP can be easily incorporated into state-of-the-art BEV detection frameworks, including BEVFormer and BEVDet series. Furthermore, the auxiliary HoP approach is complementary to prevalent temporal modeling methods, leading to significant performance gains. Extensive experiments are conducted to evaluate the effectiveness of the proposed HoP on the nuScenes dataset. We choose the representative methods, including BEVFormer and BEVDet4D-Depth to evaluate our method. Surprisingly, HoP achieves 68.2% NDS and 61.6% mAP with ViT-L on nuScenes test, outperforming all the 3D object detectors on the leaderboard by a large margin. + + + + PARIS: Part-level Reconstruction and Motion Analysis for Articulated Objects + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_PARIS_Part-level_Reconstruction_and_Motion_Analysis_for_Articulated_Objects_ICCV_2023_paper.pdf + We address the task of simultaneous part-level reconstruction and motion parameter estimation for articulated objects. Given two sets of multi-view images of an object in two static articulation states, we decouple the movable part from the static part and reconstruct shape and appearance while predicting the motion parameters. To tackle this problem, we present PARIS: a self-supervised, end-to-end architecture that learns part-level implicit shape and appearance models and optimizes motion parameters jointly without any 3D supervision, motion, or semantic annotation. Our experiments show that our method generalizes better across object categories, and outperforms baselines and prior work that are given 3D point clouds as input. Our approach improves reconstruction relative to state-of-the-art baselines with a Chamfer-L1 distance reduction of 3.94 (45.2%) for objects and 26.79 (84.5%) for parts, and achieves 5% error rate for motion estimation across 10 object categories. + + + + OnlineRefer: A Simple Online Baseline for Referring Video Object Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_OnlineRefer_A_Simple_Online_Baseline_for_Referring_Video_Object_Segmentation_ICCV_2023_paper.pdf + Referring video object segmentation (RVOS) aims at segmenting an object in a video following human instruction. Current state-of-the-art methods fall into an offline pattern, in which each clip independently interacts with text embedding for cross-modal understanding. They usually present that the offline pattern is necessary for RVOS, yet model limited temporal association within each clip. In this work, we break up the previous offline belief and propose a simple yet effective online model using explicit query propagation, named OnlineRefer. Specifically, our approach leverages target cues that gather semantic information and position prior to improve the accuracy and ease of referring predictions for the current frame. Furthermore, we generalize our online model into a semi-online framework to be compatible with video-based backbones. To show the effectiveness of our method, we evaluate it on four benchmarks, i.e., Refer-Youtube-VOS, Refer-DAVIS17, A2D-Sentences, and JHMDB-Sentences. Without bells and whistles, our OnlineRefer with a Swin-L backbone achieves 63.5 J&F and 64.8 J&F on Refer-Youtube-VOS and Refer-DAVIS17, outperforming all other offline methods. Our code is available at https://github.com/wudongming97/OnlineRefer. + + + + Environment Agnostic Representation for Visual Reinforcement Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Choi_Environment_Agnostic_Representation_for_Visual_Reinforcement_Learning_ICCV_2023_paper.pdf + Generalization capability of vision-based deep reinforcement learning (RL) is indispensable to deal with dynamic environment changes that exist in visual observations. The high-dimensional space of the visual input, however, imposes challenges in adapting an agent to unseen environments. In this work, we propose Environment Agnostic Reinforcement learning (EAR), which is a compact framework for domain generalization of the visual deep RL. Environment-agnostic features (EAFs) are extracted by leveraging three novel objectives based on feature factorization, reconstruction, and episode-aware state shifting, so that policy learning is accomplished only with vital features. EAR is a simple single-stage method with a low model complexity and a fast inference time, ensuring a high reproducibility, while attaining state-of-the-art performance in the DeepMind Control Suite and DrawerWorld benchmarks. + + + + Mimic3D: Thriving 3D-Aware GANs via 3D-to-2D Imitation + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Mimic3D_Thriving_3D-Aware_GANs_via_3D-to-2D_Imitation_ICCV_2023_paper.pdf + Generating images with both photorealism and multiview 3D consistency is crucial for 3D-aware GANs, yet existing methods struggle to achieve them simultaneously. Improving the photorealism via CNN-based 2D super-resolution can break the strict 3D consistency, while keeping the 3D consistency by learning high-resolution 3D representations for direct rendering often compromises image quality. In this paper, we propose a novel learning strategy, namely 3D-to-2D imitation, which enables a 3D-aware GAN to generate high-quality images while maintaining their strict 3D consistency, by letting the images synthesized by the generator's 3D rendering branch mimic those generated by its 2D super-resolution branch. We also introduce 3D-aware convolutions into the generator for better 3D representation learning, which further improves the image generation quality. With the above strategies, our method reaches FID scores of 5.4 and 4.3 on FFHQ and AFHQ-v2 Cats, respectively, at 512x512 resolution, largely outperforming existing 3D-aware GANs using direct 3D rendering and coming very close to the previous state-of-the-art method that leverages 2D super-resolution. Project website: https://seanchenxy.github.io/Mimic3DWeb. + + + + Does Physical Adversarial Example Really Matter to Autonomous Driving? Towards System-Level Effect of Adversarial Object Evasion Attack + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Does_Physical_Adversarial_Example_Really_Matter_to_Autonomous_Driving_Towards_ICCV_2023_paper.pdf + In autonomous driving (AD), accurate perception is indispensable to achieving safe and secure driving. Due to its safety-criticality, the security of AD perception has been widely studied. Among different attacks on AD perception, the physical adversarial object evasion attacks are especially severe. However, we find that all existing literature only evaluates their attack effect at the targeted AI component level but not at the system level, i.e., with the entire system semantics and context such as the full AD pipeline. Thereby, this raises a critical research question: can these existing researches effectively achieve system-level attack effects (e.g., traffic rule violations) in the real-world AD context? In this work, we conduct the first measurement study on whether and how effectively the existing designs can lead to system-level effects, especially for the STOP sign-evasion attacks due to their popularity and severity. Our evaluation results show that all the representative prior works cannot achieve any system-level effects. We observe two design limitations in the prior works: 1) physical model-inconsistent object size distribution in pixel sampling and 2) lack of vehicle plant model and AD system model consideration. Then, we propose SysAdv, a novel system-driven attack design in the AD context and our evaluation results show that the system-level effects can be significantly improved, i.e., the violation rate increases by around 70%. + + + + Generalizable Neural Fields as Partially Observed Neural Processes + http://openaccess.thecvf.com//content/ICCV2023/papers/Gu_Generalizable_Neural_Fields_as_Partially_Observed_Neural_Processes_ICCV_2023_paper.pdf + Neural fields, which represent signals as a function parameterized by a neural network, are a promising alternative to traditional discrete vector or grid-based representations. Compared to discrete representations, neural representations both scale well with increasing resolution, are continuous, and can be many-times differentiable. However, given a dataset of signals that we would like to represent, having to optimize a separate neural field for each signal is inefficient, and cannot capitalize on shared information or structures among signals. Existing generalization methods view this as a meta-learning problem and employ gradient-based meta-learning to learn an initialization which is then fine-tuned with test-time optimization, or learn hypernetworks to produce the weights of a neural field. We instead propose a new paradigm that views the large-scale training of neural representations as a part of a partially-observed neural process framework, and leverage neural process algorithms to solve this task. We demonstrate that this approach outperforms both state-of-the-art gradient-based meta-learning approaches and hypernetwork approaches. + + + + Adding Conditional Control to Text-to-Image Diffusion Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Adding_Conditional_Control_to_Text-to-Image_Diffusion_Models_ICCV_2023_paper.pdf + We present ControlNet, a neural network architecture to add spatial conditioning controls to large, pretrained text-to-image diffusion models. ControlNet locks the production-ready large diffusion models, and reuses their deep and robust encoding layers pretrained with billions of images as a strong backbone to learn a diverse set of conditional controls. The neural architecture is connected with "zero convolutions" (zero-initialized convolution layers) that progressively grow the parameters from zero and ensure that no harmful noise could affect the finetuning. We test various conditioning controls, e.g., edges, depth, segmentation, human pose, etc., with Stable Diffusion, using single or multiple conditions, with or without prompts. We show that the training of ControlNets is robust with small (<50k) and large (>1m) datasets. Extensive results show that ControlNet may facilitate wider applications to control image diffusion models. + + + + 3D Instance Segmentation via Enhanced Spatial and Semantic Supervision + http://openaccess.thecvf.com//content/ICCV2023/papers/Al_Khatib_3D_Instance_Segmentation_via_Enhanced_Spatial_and_Semantic_Supervision_ICCV_2023_paper.pdf + 3D instance segmentation has recently garnered increased attention. Typical deep learning methods adopt point grouping schemes followed by hand-designed geometric clustering. Inspired by the success of transformers for various 3D tasks, newer hybrid approaches have utilized transformer decoders coupled with convolutional backbones that operate on voxelized scenes. However, due to the nature of sparse feature backbones, the extracted features provided to the transformer decoder are lacking in spatial understanding. Thus, such approaches often predict spatially separate objects as single instances. To this end, we introduce a novel approach for 3D point clouds instance segmentation that addresses the challenge of generating distinct instance masks for objects that share similar appearances but are spatially separated. Our method leverages spatial and semantic supervision with query refinement to improve the performance of hybrid 3D instance segmentation models. Specifically, we provide the transformer block with spatial features to facilitate differentiation between similar object queries and incorporate semantic supervision to enhance prediction accuracy based on object class. Our proposed approach outperforms existing methods on the validation sets of ScanNet V2 and ScanNet200 datasets, establishing a new state-of-the-art for this task. + + + + Unleashing Text-to-Image Diffusion Models for Visual Perception + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhao_Unleashing_Text-to-Image_Diffusion_Models_for_Visual_Perception_ICCV_2023_paper.pdf + Diffusion models (DMs) have become the new trend of generative models and have demonstrated a powerful ability of conditional synthesis. Among those, text-to-image diffusion models pre-trained on large-scale image-text pairs are highly controllable by customizable prompts. Unlike the unconditional generative models that focus on low-level attributes and details, text-to-image diffusion models contain more high-level knowledge thanks to the vision-language pre-training. In this paper, we propose VPD (Visual Perception with pre-trained Diffusion models), a new framework that exploits the semantic information of a pre-trained text-to-image diffusion model in visual perception tasks. Instead of using the pre-trained denoising autoencoder in a diffusion-based pipeline, we simply use it as a backbone and aim to study how to take full advantage of the learned knowledge. Specifically, we prompt the denoising decoder with proper textual inputs and refine the text features with an adapter, leading to a better alignment to the pre-trained stage and making the visual contents interact with the text prompts. We also propose to utilize the cross-attention maps between the visual features and the text features to provide explicit guidance. Compared with other pre-training methods, we show that vision-language pre-trained diffusion models can be faster adapted to downstream visual perception tasks using the proposed VPD. Extensive experiments on semantic segmentation, referring image segmentation, and depth estimation demonstrate the effectiveness of our method. Notably, VPD attains 0.254 RMSE on NYUv2 depth estimation and 73.3% oIoU on RefCOCO-val referring image segmentation, establishing new records on these two benchmarks. Code is available at https://github.com/wl-zhao/VPD. + + + + Transferable Adversarial Attack for Both Vision Transformers and Convolutional Networks via Momentum Integrated Gradients + http://openaccess.thecvf.com//content/ICCV2023/papers/Ma_Transferable_Adversarial_Attack_for_Both_Vision_Transformers_and_Convolutional_Networks_ICCV_2023_paper.pdf + Visual Transformers (ViTs) and Convolutional Neural Networks (CNNs) are the two primary backbone structures extensively used in various vision tasks. Generating transferable adversarial examples for ViTs is difficult due to ViTs' superior robustness, while transferring adversarial examples across ViTs and CNNs is even harder, since their structures and mechanisms for processing images are fundamentally distinct. In this work, we propose a novel attack method named Momentum Integrated Gradients (MIG), which not only attacks ViTs with high success rate, but also exhibits impressive transferability across ViTs and CNNs. Specifically, we use integrated gradients rather than gradients to steer the generation of adversarial perturbations, inspired by the observation that integrated gradients of images demonstrate higher similarity across models in comparison to regular gradients. Then we acquire the accumulated gradients by combining the integrated gradients from previous iterations with the current ones in a momentum manner and use their sign to modify the perturbations iteratively. We conduct extensive experiments to demonstrate that adversarial examples obtained using MIG show stronger transferability, resulting in significant improvements over state-of-the-art methods for both CNN and ViT models. + + + + Adaptive Image Anonymization in the Context of Image Classification with Neural Networks + http://openaccess.thecvf.com//content/ICCV2023/papers/Shvai_Adaptive_Image_Anonymization_in_the_Context_of_Image_Classification_with_ICCV_2023_paper.pdf + Deep learning based methods have become the de-facto standard for various computer vision tasks. Nevertheless, they have repeatedly shown their vulnerability to various form of input perturbations such as pixels modification, region anonymization, etc. which are closely related to the adversarial attacks. This research particularly addresses the case of image anonymization, which is significantly important to preserve privacy and hence to secure digitized form of personal information from being exposed and potentially misused by different services that have captured it for various purposes. However, applying anonymization causes the classifier to provide different class decisions before and after applying it and therefore reduces the classifier's reliability and usability. In order to achieve a robust solution to this problem we propose a novel anonymization procedure that allows the existing classifiers to become class decision invariant on the anonymized images without any modification requires to apply on the classification models. We conduct numerous experiments on the popular ImageNet benchmark as well as on a large scale industrial toll classification problem's dataset. Obtained results confirm the efficiency and effectiveness of the proposed method as it obtained 0% rate of class decision change for both datasets compared to 15.95% on ImageNet and 0.18% on toll dataset obtained by applying the naive anonymization approaches. Moreover, it has shown a great potential to be applied to similar problems from different domains. + + + + Efficient Neural Supersampling on a Novel Gaming Dataset + http://openaccess.thecvf.com//content/ICCV2023/papers/Mercier_Efficient_Neural_Supersampling_on_a_Novel_Gaming_Dataset_ICCV_2023_paper.pdf + Real-time rendering for video games has become increasingly challenging due to the need for higher resolutions, framerates and photorealism. Supersampling has emerged as an effective solution to address this challenge. Our work introduces a novel neural algorithm for supersampling rendered content that is 4x more efficient than existing methods while maintaining the same level of accuracy. Additionally, we introduce a new dataset which provides auxiliary modalities such as motion vectors and depth generated using graphics rendering features like viewport jittering and mipmap biasing at different resolutions. We believe that this dataset fills a gap in the current dataset landscape and can serve as a valuable resource to help measure progress in the field and advance the state-of-the-art in super-resolution techniques for gaming content. + + + + Walking Your LiDOG: A Journey Through Multiple Domains for LiDAR Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Saltori_Walking_Your_LiDOG_A_Journey_Through_Multiple_Domains_for_LiDAR_ICCV_2023_paper.pdf + The ability to deploy robots that can operate safely in diverse environments is crucial for developing embodied intelligent agents. As a community, we have made tremendous progress in within-domain LiDAR semantic segmentation. However, do these methods generalize across domains? To answer this question, we design the first experimental setup for studying domain generalization (DG) for LiDAR semantic segmentation (DG-LSS). Our results confirm a significant gap between methods, evaluated in a cross-domain setting: for example, a model trained on the source dataset (SemanticKITTI) obtains 26.53 mIoU on the target data, compared to 48.49 mIoU obtained by the model trained on the target domain (nuScenes). To tackle this gap, we propose the first method specifically designed for DG-LSS, which obtains 34.88 mIoU on the target domain, outperforming all baselines. Our method augments a sparse-convolutional encoder-decoder 3D segmentation network with an additional, dense 2D convolutional decoder that learns to classify a birds-eye view of the point cloud. This simple auxiliary task encourages the 3D network to learn features that are robust to sensor placement shifts and resolution, and are transferable across domains. With this work, we aim to inspire the community to develop and evaluate future models in such cross-domain conditions. + + + + Explore and Tell: Embodied Visual Captioning in 3D Environments + http://openaccess.thecvf.com//content/ICCV2023/papers/Hu_Explore_and_Tell_Embodied_Visual_Captioning_in_3D_Environments_ICCV_2023_paper.pdf + While current visual captioning models have achieved impressive performance, they often assume that the image is well-captured and provides a complete view of the scene. In real-world scenarios, however, a single image may not offer a good viewpoint, hindering fine-grained scene understanding. To overcome this limitation, we propose a novel task called Embodied Captioning, which equips visual captioning models with navigation capabilities, enabling them to actively explore the scene and reduce visual ambiguity from suboptimal viewpoints. Specifically, starting at a random viewpoint, an agent must navigate the environment to gather information from different viewpoints and generate a comprehensive paragraph describing all objects in the scene. To support this task, we build the ET-Cap dataset with Kubric simulator, consisting of 10K 3D scenes with cluttered objects and three annotated paragraphs per scene. We propose a Cascade Embodied Captioning model (CaBOT), which comprises of a navigator and a captioner, to tackle this task. The navigator predicts which actions to take in the environment, while the captioner generates a paragraph description based on the whole navigation trajectory. Extensive experiments demonstrate that our model outperforms other carefully designed baselines. Our dataset, codes and models are available at https://aim3-ruc.github.io/ExploreAndTell. + + + + FastViT: A Fast Hybrid Vision Transformer Using Structural Reparameterization + http://openaccess.thecvf.com//content/ICCV2023/papers/Vasu_FastViT_A_Fast_Hybrid_Vision_Transformer_Using_Structural_Reparameterization_ICCV_2023_paper.pdf + The recent amalgamation of transformer and convolutional designs has led to steady improvements in accuracy and efficiency of the models. In this work, we introduce FastViT, a hybrid vision transformer architecture that obtains the state-of-the-art latency-accuracy trade-off. To this end, we introduce a novel token mixing operator, RepMixer, a building block of FastViT, that uses structural reparameterization to lower the memory access cost by removing skip-connections in the network. We further apply train-time overparametrization and large kernel convolutions to boost accuracy and empirically show that these choices have minimal effect on latency. We show that -- our model is 3.5x faster than CMT, a recent state-of-the-art hybrid transformer architecture, 4.9x faster than EfficientNet, and 1.9x faster than ConvNeXt on a mobile device for the same accuracy on the ImageNet dataset. At similar latency, our model obtains 4.2% better Top-1 accuracy on ImageNet than MobileOne. Our model consistently outperforms competing architectures across several tasks -- image classification, detection, segmentation and 3D mesh regression with significant improvement in latency on both a mobile device and a desktop GPU. Furthermore, our model is highly robust to out-of-distribution samples and corruptions, improving over competing robust models. Code and models are available at: https://github.com/apple/ml-fastvit + + + + OFVL-MS: Once for Visual Localization across Multiple Indoor Scenes + http://openaccess.thecvf.com//content/ICCV2023/papers/Xie_OFVL-MS_Once_for_Visual_Localization_across_Multiple_Indoor_Scenes_ICCV_2023_paper.pdf + In this work, we seek to predict camera poses across scenes with a multi-task learning manner, where we view the localization of each scene as a new task. We propose OFVL-MS, a unified framework that dispenses with the traditional practice of training a model for each individual scene and relieves gradient conflict induced by optimizing multiple scenes collectively, enabling efficient storage yet precise visual localization for all scenes. Technically, in the forward pass of OFVL-MS, we design a layer-adaptive sharing policy with a learnable score for each layer to automatically determine whether the layer is shared or not. Such sharing policy empowers us to acquire task-shared parameters for a reduction of storage cost and task-specific parameters for learning scene-related features to alleviate gradient conflict. In the backward pass of OFVL-MS, we introduce a gradient normalization algorithm that homogenizes the gradient magnitude of the task-shared parameters so that all tasks converge at the same pace. Furthermore, a sparse penalty loss is applied on the learnable scores to facilitate parameter sharing for all tasks without performance degradation. We conduct comprehensive experiments on multiple benchmarks and our new released indoor dataset LIVL, showing that OFVL-MS families significantly outperform the state-of-the-arts with fewer parameters. We also verify that OFVL-MS can generalize to a new scene with much few parameters while gaining superior localization performance. The proposed dataset and evaluation code is available at https://github.com/mooncake199809/UFVL-Net. + + + + Inter-Realization Channels: Unsupervised Anomaly Detection Beyond One-Class Classification + http://openaccess.thecvf.com//content/ICCV2023/papers/McIntosh_Inter-Realization_Channels_Unsupervised_Anomaly_Detection_Beyond_One-Class_Classification_ICCV_2023_paper.pdf + Unsupervised anomaly detection and localization in images is a challenging problem, leading previous methods to attempt an easier supervised one-class classification formalization. Assuming training images to be realizations of the underlying image distribution, it follows that nominal patches from these realizations will be well associated between and represented across realizations. From this, we propose Inter-Realization Channels (InReaCh), a fully unsupervised method of detecting and localizing anomalies. InReaCh extracts high-confidence nominal patches from training data by associating them between realizations into channels, only considering channels with high spans and low spread as nominal. We then create our nominal model from the patches of these channels to test new patches against. InReaCh extracts nominal patches from the MVTec AD dataset with 99.9% precision, then archives 0.968 AUROC in localization and 0.923 AUROC in detection with corrupted training data, competitive with current state-of-the-art supervised one-class classification methods. We test our model up to 40% of training data containing anomalies with negligibly affected performance. The shift to fully unsupervised training simplifies dataset creation and broadens possible applications. + + + + High Quality Entity Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Qi_High_Quality_Entity_Segmentation_ICCV_2023_paper.pdf + Dense image segmentation tasks (e.g., semantic, panop tic) are useful for image editing, but existing methods can hardly generalize well in an in-the-wild setting where there are unrestricted image domains, classes, and image reso lution & quality variations. Motivated by these observa tions, we construct a new entity segmentation dataset, with a strong focus on high-quality dense segmentation in the wild. The dataset contains images spanning diverse image domains and entities, along with plent(ful high-resolution images and high-quality mask annotations for training and testing. Given the high-quality and -resolution nature of the dataset, we propose CropFormer which is designed to tackle the intractability of instance-level segmentation on high-resolution images. It improves mask prediction by fusing high-res image crops that provides more fine grained image details and the full image. CropFormer is the first query-based Tran. former architecture that can ef fectively fuse mask predictions from multiple image views, by learning queries that effectively associate the same en tities across the full image and its crop. With CropFormer, we achieve a significant AP gain of 1.9 on the challenging entity segmentation task. Furthermore, CropFormer con sistently improves the accuracy of traditional segmentation tasks and datasets. The dataset and code are released at http://luqi.info/entityv2.github.iol + + + + CoTDet: Affordance Knowledge Prompting for Task Driven Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Tang_CoTDet_Affordance_Knowledge_Prompting_for_Task_Driven_Object_Detection_ICCV_2023_paper.pdf + Task driven object detection aims to detect object instances suitable for affording a task in an image. Its challenge lies in object categories available for the task being too diverse to be limited to a closed set of object vocabulary for traditional object detection. Simply mapping categories and visual features of common objects to the task cannot address the challenge. In this paper, we propose to explore fundamental affordances rather than object categories, i.e., common attributes that enable different objects to accomplish the same task. Moreover, we propose a novel multi-level chain-of-thought prompting (MLCoT) to extract the affordance knowledge from large language models, which contains multi-level reasoning steps from task to object examples to essential visual attributes with rationales. Furthermore, to fully exploit knowledge to benefit object recognition and localization, we propose a knowledge-conditional detection framework, namely CoTDet. It conditions the detector from the knowledge to generate object queries and regress boxes. Experimental results demonstrate that our CoTDet outperforms state-of-the-art methods consistently and significantly (+15.6 box AP and +14.8 mask AP) and can generate rationales for why objects are detected to afford the task. + + + + Rendering Humans from Object-Occluded Monocular Videos + http://openaccess.thecvf.com//content/ICCV2023/papers/Xiang_Rendering_Humans_from_Object-Occluded_Monocular_Videos_ICCV_2023_paper.pdf + 3D understanding and rendering of moving humans from monocular videos is a challenging task. Although recent progress has enabled this task to some extent, it is still difficult to guarantee satisfactory results in real-world scenarios, where obstacles may block the camera view and cause partial occlusions in the captured videos. Existing methods cannot handle such defects due to two reasons. Firstly, the standard rendering strategy relies on point-point mapping, which could lead to dramatic disparities between the visible and occluded areas of the body. Secondly, the naive direct regression approach does not consider any feasibility criteria (i.e., prior information) for rendering under occlusions. To tackle the above drawbacks, we present OccNeRF, a neural rendering method that achieves better rendering of humans in severely occluded scenes. As direct solutions to the two drawbacks, we propose surface-based rendering by integrating geometry and visibility priors. We validate our method on both simulated and real-world occlusions and demonstrate our method's superiority. + + + + Out-of-Distribution Detection for Monocular Depth Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Hornauer_Out-of-Distribution_Detection_for_Monocular_Depth_Estimation_ICCV_2023_paper.pdf + In monocular depth estimation, uncertainty estimation approaches mainly target the data uncertainty introduced by image noise. In contrast to prior work, we address the uncertainty due to lack of knowledge, which is relevant for the detection of data not represented by the training distribution, the so-called out-of-distribution (OOD) data. Motivated by anomaly detection, we propose to detect OOD images from an encoder-decoder depth estimation model based on the reconstruction error. Given the features extracted with the fixed depth encoder, we train an image decoder for image reconstruction using only in-distribution data. Consequently, OOD images result in a high reconstruction error, which we use to distinguish between in- and out-of-distribution samples. We built our experiments on the standard NYU Depth V2 and KITTI benchmarks as in-distribution data. Our post hoc method performs astonishingly well on different models and outperforms existing uncertainty estimation approaches without modifying the trained encoder-decoder depth estimation model. + + + + LLM-Planner: Few-Shot Grounded Planning for Embodied Agents with Large Language Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Song_LLM-Planner_Few-Shot_Grounded_Planning_for_Embodied_Agents_with_Large_Language_ICCV_2023_paper.pdf + This study focuses on using large language models (LLMs) as a planner for embodied agents that can follow natural language instructions to complete complex tasks in a visually-perceived environment. The high data cost and poor sample efficiency of existing methods hinders the development of versatile agents that are capable of many tasks and can learn new tasks quickly. In this work, we propose a novel method, LLM-Planner, that harnesses the power of large language models to do few-shot planning for embodied agents. We further propose a simple but effective way to enhance LLMs with physical grounding to generate and update plans that are grounded in the current environment. Experiments on the ALFRED dataset show that our method can achieve very competitive few-shot performance: Despite using less than 0.5% of paired training data, LLM-Planner achieves competitive performance with recent baselines that are trained using the full training data. Existing methods can barely complete any task successfully under the same few-shot setting. Our work opens the door for developing versatile and sample-efficient embodied agents that can quickly learn many tasks. + + + + Exploring Model Transferability through the Lens of Potential Energy + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Exploring_Model_Transferability_through_the_Lens_of_Potential_Energy_ICCV_2023_paper.pdf + Transfer learning has become crucial in computer vision tasks due to the vast availability of pre-trained deep learning models. However, selecting the optimal pre-trained model from a diverse pool for a specific downstream task remains a challenge. Existing methods for measuring the transferability of pre-trained models rely on statistical correlations between encoded static features and task labels, but they overlook the impact of underlying representation dynamics during fine-tuning, leading to unreliable results, especially for self-supervised models. In this paper, we present an insightful physics-inspired approach named PED to address these challenges. We reframe the challenge of model selection through the lens of potential energy and directly model the interaction forces that influence fine-tuning dynamics. By capturing the motion of dynamic representations to decline the potential energy within a force-driven physical model, we can acquire an enhanced and more stable observation for estimating transferability. The experimental results on 10 downstream tasks and 12 self-supervised models demonstrate that our approach can seamlessly integrate into existing ranking techniques and enhance their performances, revealing its effectiveness for the model selection task and its potential for understanding the mechanism in transfer learning. Code is available at https://github.com/lixiaotong97/PED. + + + + Diff-Retinex: Rethinking Low-light Image Enhancement with A Generative Diffusion Model + http://openaccess.thecvf.com//content/ICCV2023/papers/Yi_Diff-Retinex_Rethinking_Low-light_Image_Enhancement_with_A_Generative_Diffusion_Model_ICCV_2023_paper.pdf + In this paper, we rethink the low-light image enhancement task and propose a physically explainable and generative diffusion model for low-light image enhancement, termed as Diff-Retinex. We aim to integrate the advantages of the physical model and the generative network. Furthermore, we hope to supplement and even deduce the information missing in the low-light image through the generative network. Therefore, Diff-Retinex formulates the low-light image enhancement problem into Retinex decomposition and conditional image generation. In the Retinex decomposition, we integrate the superiority of attention in Transformer and meticulously design a Retinex Transformer decomposition network (TDN) to decompose the image into illumination and reflectance maps. Then, we design multi-path generative diffusion networks to reconstruct the normal-light Retinex probability distribution and solve the various degradations in these components respectively, including dark illumination, noise, color deviation, loss of scene contents, etc. Owing to generative diffusion model, Diff-Retinex puts the restoration of low-light subtle detail into practice. Extensive experiments conducted on real-world low-light datasets qualitatively and quantitatively demonstrate the effectiveness, superiority, and generalization of the proposed method. + + + + Bird's-Eye-View Scene Graph for Vision-Language Navigation + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Birds-Eye-View_Scene_Graph_for_Vision-Language_Navigation_ICCV_2023_paper.pdf + Vision-language navigation (VLN), which entails an agent to navigate 3D environments following human instructions, has shown great advances. However, current agents are built upon panoramic observations, which hinders their ability to perceive 3D scene geometry and easily leads to ambiguous selection of panoramic view. To address these limitations, we present a BEV Scene Graph (BSG), which leverages multi-step BEV representations to encode scene layouts and geometric cues of indoor environment under the supervision of 3D detection. During navigation, BSG builds a local BEV representation at each step and maintains a BEV-based global scene map, which stores and organizes all the online collected local BEV representations according to their topological relations. Based on BSG, the agent predicts a local BEV grid-level decision score and a global graph-level decision score, combined with a subview selection score on panoramic views, for more accurate action prediction. Our approach significantly outperforms state-of-the-art methods on REVERIE, R2R, and R4R, showing the potential of BEV perception in VLN. + + + + PVT++: A Simple End-to-End Latency-Aware Visual Tracking Framework + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_PVT_A_Simple_End-to-End_Latency-Aware_Visual_Tracking_Framework_ICCV_2023_paper.pdf + Visual object tracking is essential to intelligent robots. Most existing approaches have ignored the online latency that can cause severe performance degradation during real-world processing. Especially for unmanned aerial vehicles (UAVs), where robust tracking is more challenging and onboard computation is limited, the latency issue can be fatal. In this work, we present a simple framework for end-to-end latency-aware tracking, i.e., end-to-end predictive visual tracking (PVT++). Unlike existing solutions that naively append Kalman Filters after trackers, PVT++ can be jointly optimized, so that it takes not only motion information but can also leverage the rich visual knowledge in most pre-trained tracker models for robust prediction. Besides, to bridge the training-evaluation domain gap, we propose a relative motion factor, empowering PVT++ to generalize to the challenging and complex UAV tracking scenes. These careful designs have made the small-capacity lightweight PVT++ a widely effective solution. Additionally, this work presents an extended latency-aware evaluation benchmark for assessing an any-speed tracker in the online setting. Empirical results on a robotic platform from the aerial perspective show that PVT++ can achieve significant performance gain on various trackers and exhibit higher accuracy than prior solutions, largely mitigating the degradation brought by latency. Our code will be made public. + + + + Supervised Homography Learning with Realistic Dataset Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Jiang_Supervised_Homography_Learning_with_Realistic_Dataset_Generation_ICCV_2023_paper.pdf + In this paper, we propose an iterative framework, which consists of two phases: a generation phase and a training phase, to generate realistic training data and yield a supervised homography network. In the generation phase, given an unlabeled image pair, we utilize the pre-estimated dominant plane masks and homography of the pair, along with another sampled homography that serves as ground truth to generate a new labeled training pair with realistic motion. In the training phase, the generated data is used to train the supervised homography network, in which the training data is refined via a content consistency module and a quality assessment module. Once an iteration is finished, the trained network is used in the next data generation phase to update the pre-estimated homography. Through such an iterative strategy, the quality of the dataset and the performance of the network can be gradually and simultaneously improved. Experimental results show that our method achieves state-of-the-art performance and existing supervised methods can be also improved based on the generated dataset. Code and dataset are available at https://github.com/JianghaiSCU/RealSH. + + + + E2E-LOAD: End-to-End Long-form Online Action Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Cao_E2E-LOAD_End-to-End_Long-form_Online_Action_Detection_ICCV_2023_paper.pdf + Recently, feature-based methods for Online Action Detection (OAD) have been gaining traction. However, these methods are constrained by their fixed backbone design, which fails to leverage the potential benefits of a trainable backbone. This paper introduces an end-to-end learning network that revises these approaches, incorporating a backbone network design that improves effectiveness and efficiency. Our proposed model utilizes a shared initial spatial model for all frames and maintains an extended sequence cache, which enables low-cost inference. We promote an asymmetric spatiotemporal model that caters to long-form and short-form modeling. Additionally, we propose an innovative and efficient inference mechanism that accelerates extensive spatiotemporal exploration. Through comprehensive ablation studies and experiments, we validate the performance and efficiency of our proposed method. Remarkably, we achieve an end-to-end learning OAD of 17.3 (+12.6) FPS with 72.4% (+1.2%), 90.3% (+0.7%), and 48.1% (+26.0%) mAP on THMOUS'14, TVSeries, and HDD, respectively. + + + + Self-supervised Monocular Depth Estimation: Let's Talk About The Weather + http://openaccess.thecvf.com//content/ICCV2023/papers/Saunders_Self-supervised_Monocular_Depth_Estimation_Lets_Talk_About_The_Weather_ICCV_2023_paper.pdf + Current, self-supervised depth estimation architectures rely on clear and sunny weather scenes to train deep neural networks. However, in many locations, this assumption is too strong. For example in the UK (2021), 149 days consisted of rain. For these architectures to be effective in real-world applications, we must create models that can generalise to all weather conditions, times of the day and image qualities. Using a combination of computer graphics and generative models, one can augment existing sunny-weather data in a variety of ways that simulate adverse weather effects. While it is tempting to use such data augmentations for self-supervised depth, in the past this was shown to degrade performance instead of improving it. In this paper, we put forward a method that uses augmentations to remedy this problem. By exploiting the correspondence between unaugmented and augmented data we introduce a pseudo-supervised loss for both depth and pose estimation. This brings back some of the benefits of supervised learning while still not requiring any labels. We also make a series of practical recommendations which collectively offer a reliable, efficient framework for weather-related augmentation of self-supervised depth from monocular video. We present extensive testing to show that our method, Robust-Depth, achieves SotA performance on the KITTI dataset while significantly surpassing SotA on challenging, adverse condition data such as DrivingStereo, Foggy CityScape and NuScenes-Night. The project website can be found at https://kieran514.github.io/Robust-Depth-Project/. + + + + Fast Neural Scene Flow + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Fast_Neural_Scene_Flow_ICCV_2023_paper.pdf + Neural Scene Flow Prior (NSFP) is of significant interest to the vision community due to its inherent robustness to out-of-distribution (OOD) effects and its ability to deal with dense lidar points. The approach utilizes a coordinate neural network to estimate scene flow at runtime, without any training. However, it is up to 100 times slower than current state-of-the-art learning methods. In other applications such as image, video, and radiance function reconstruction innovations in speeding up the runtime performance of coordinate networks have centered upon architectural changes. In this paper, we demonstrate that scene flow is different---with the dominant computational bottleneck stemming from the loss function itself (i.e., Chamfer distance). Further, we rediscover the distance transform (DT) as an efficient, correspondence-free loss function that dramatically speeds up the runtime optimization. Our fast neural scene flow (FNSF) approach reports for the first time real-time performance comparable to learning methods, without any training or OOD bias on two of the largest open autonomous driving (AV) lidar datasets Waymo Open [62] and Argoverse [8]. + + + + ExposureDiffusion: Learning to Expose for Low-light Image Enhancement + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_ExposureDiffusion_Learning_to_Expose_for_Low-light_Image_Enhancement_ICCV_2023_paper.pdf + Previous raw image-based low-light image enhancement methods predominantly relied on feed-forward neural networks to learn deterministic mappings from low-light to normally-exposed images. However, they failed to capture critical distribution information, leading to visually undesirable results. This work addresses the issue by seamlessly integrating a diffusion model with a physics-based exposure model. Different from a vanilla diffusion model that has to perform Gaussian denoising, with the injected physics-based exposure model, our restoration process can directly start from a noisy image instead of pure noise. As such, our method obtains significantly improved performance and reduced inference time compared with vanilla diffusion models. To make full use of the advantages of different intermediate steps, we further propose an adaptive residual layer that effectively screens out the side-effect in the iterative refinement when the intermediate results have been already well-exposed. The proposed framework can work with both real-paired datasets, SOTA noise models, and different backbone networks. We evaluate the proposed method on various public benchmarks, achieving promising results with consistent improvements using different exposure models and backbones. Besides, the proposed method achieves better generalization capacity for unseen amplifying ratios and better performance than a larger feedforward neural model when few parameters are adopted. The code is released at https://github.com/wyf0912/ExposureDiffusion. + + + + RefEgo: Referring Expression Comprehension Dataset from First-Person Perception of Ego4D + http://openaccess.thecvf.com//content/ICCV2023/papers/Kurita_RefEgo_Referring_Expression_Comprehension_Dataset_from_First-Person_Perception_of_Ego4D_ICCV_2023_paper.pdf + Grounding textual expressions on scene objects from first-person views is a truly demanding capability in developing agents that are aware of their surroundings and behave following intuitive text instructions. Such capability is of necessity for glass-devices or autonomous robots to localize referred objects in the real-world. In the conventional referring expression comprehension tasks of images, however, datasets are mostly constructed based on the web-crawled data and don't reflect diverse real-world structures on the task of grounding textual expressions in diverse objects in the real world. Recently, a massive-scale egocentric video dataset of Ego4D was proposed. Ego4D covers around the world diverse real-world scenes including numerous indoor and outdoor situations such as shopping, cooking, walking, talking, manufacturing, etc. Based on egocentric videos of Ego4D, we constructed a broad coverage of the video-based referring expression comprehension dataset: RefEgo. Our dataset includes more than 12k video clips and 41 hours for video-based referring expression comprehension annotation. In experiments, we combine the state-of-the-art 2D referring expression comprehension models with the object tracking algorithm, achieving the video-wise referred object tracking even in difficult conditions: the referred object becomes out-of-frame in the middle of the video or multiple similar objects are presented in the video. + + + + Exploring Temporal Frequency Spectrum in Deep Video Deblurring + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhu_Exploring_Temporal_Frequency_Spectrum_in_Deep_Video_Deblurring_ICCV_2023_paper.pdf + Video deblurring aims to restore the latent video frames from their blurred counterparts. Despite the remarkable progress, most promising video deblurring methods only investigate the temporal priors in the spatial domain and rarely explore their its potential in the frequency domain. In this paper, we revisit the blurred sequence in the Fourier space and figure out some intrinsic frequency-temporal priors that imply the temporal blur degradation can be accessibly decoupled in the potential frequency domain. Based on these priors, we propose a novel Fourier-based frequency-temporal video deblurring solution, where the core design accommodates the temporal spectrum to a popular video deblurring pipeline of feature extraction, alignment, aggregation, and optimization. Specifically, we design a Spectrum Prior-guided Alignment module by leveraging enlarged blur information in the potential spectrum to mitigate the blur effects on the alignment. Then, Temporal Energy prior-driven Aggregation is implemented to replenish the original local features by estimating the temporal spectrum energy as the global sharpness guidance. In addition, the customized frequency loss is devised to optimize the proposed method for decent spectral distribution. Extensive experiments demonstrate that our model performs favorably against other state-of-the-art methods, thus confirming the effectiveness of frequency-temporal prior modeling. + + + + Occ^2Net: Robust Image Matching Based on 3D Occupancy Estimation for Occluded Regions + http://openaccess.thecvf.com//content/ICCV2023/papers/Fan_Occ2Net_Robust_Image_Matching_Based_on_3D_Occupancy_Estimation_for_ICCV_2023_paper.pdf + Image matching is a fundamental and critical task in various visual applications, such as Simultaneous Localization and Mapping (SLAM) and image retrieval, which require accurate pose estimation. However, most existing methods ignore the occlusion relations between objects caused by camera motion and scene structure. In this paper, we propose Occ^2Net, a novel image matching method that models occlusion relations using 3D occupancy and infers matching points in occluded regions. Thanks to the inductive bias encoded in the Occupancy Estimation (OE) module, it greatly simplifies bootstrapping of a multi-view consistent 3D representation that can then integrate information from multiple views. Together with an Occlusion-Aware (OA) module, it incorporates attention layers and rotation alignment to enable matching between occluded and visible points. We evaluate our method on both real-world and simulated datasets and demonstrate its superior performance over state-of-the-art methods on several metrics, especially in occlusion scenarios. + + + + Make-An-Animation: Large-Scale Text-conditional 3D Human Motion Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Azadi_Make-An-Animation_Large-Scale_Text-conditional_3D_Human_Motion_Generation_ICCV_2023_paper.pdf + Text-guided human motion generation has drawn significant interest because of its impactful applications spanning animation and robotics. Recently, application of diffusion models for motion generation has enabled improvements in the quality of generated motions. However, existing approaches are limited by their reliance on relatively small-scale motion capture data, leading to poor performance on more diverse, in-the-wild prompts. In this paper, we introduce Make-An-Animation, a text-conditioned human motion generation model which learns more diverse poses and prompts from large-scale image-text datasets, enabling significant improvement in performance over prior works. Make-An-Animation is trained in two stages. First, we train on a curated large-scale dataset of (text, static pseudo-pose) pairs extracted from image-text datasets. Second, we fine-tune on motion capture data, adding additional layers to model the temporal dimension. Unlike prior diffusion models for motion generation, Make-An-Animation uses a U-Net architecture similar to recent text-to-video generation models. Human evaluation of motion realism and alignment with input text shows that our model reaches state-of-the-art performance on text-to-motion generation. + + + + AerialVLN: Vision-and-Language Navigation for UAVs + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_AerialVLN_Vision-and-Language_Navigation_for_UAVs_ICCV_2023_paper.pdf + Recently emerged Vision-and-Language Navigation(VLN) tasks have drawn significant attention in both computer vision and natural language processing communities. Existing VLN tasks are built for agents that navigate on the ground, either indoors or outdoors. However, many tasks require intelligent agents to carry out in the sky, such as UAV-based goods delivery, traffic/security patrol, and scenery tour, to name a few. Navigating in the sky is more complicated than on the ground because agents need to consider the flying height and more complex spatial relationship reasoning. To fill this gap and facilitate research in this field, we propose a new task named AerialVLN, which is UAV-based and towards outdoor environments. We develop a 3D simulator rendered by near-realistic pictures of 25 city-level scenarios. Our simulator supports continuous navigation, environment extension and configuration. We also proposed an extended baseline model based on the widely-used cross modal-alignment (CMA) navigation methods. We find that there is still a significant gap between the baseline model and human performance, which suggests AerialVLN is a new challenging task. + + + + On the Robustness of Open-World Test-Time Training: Self-Training with Dynamic Prototype Expansion + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_On_the_Robustness_of_Open-World_Test-Time_Training_Self-Training_with_Dynamic_ICCV_2023_paper.pdf + Generalizing deep learning models to unknown target domain distribution with low latency has motivated research into test-time training/adaptation (TTT/TTA). Existing approaches often focus on improving test-time training performance under well-curated target domain data. As figured out in this work, many state-of-the-art methods fail to maintain the performance when the target domain is contaminated with strong out-of-distribution (OOD) data, a.k.a. open-world test-time training (OWTTT). The failure is mainly due to the inability to distinguish strong OOD samples from regular weak OOD samples. To improve the robustness of OWTTT we first develop an adaptive strong OOD pruning which improves the efficacy of the self-training TTT method. We further propose a way to dynamically expand the prototypes to represent strong OOD samples for an improved weak/strong OOD data separation. Finally, we regularize self-training with distribution alignment and the combination yields the state-of-the-art performance on 5 OWTTT benchmarks. The code is available at https://github.com/Yushu-Li/OWTTT. + + + + Self-supervised Learning to Bring Dual Reversed Rolling Shutter Images Alive + http://openaccess.thecvf.com//content/ICCV2023/papers/Shang_Self-supervised_Learning_to_Bring_Dual_Reversed_Rolling_Shutter_Images_Alive_ICCV_2023_paper.pdf + Modern consumer cameras usually employ the rolling shutter (RS) mechanism, where images are captured by scanning scenes row-by-row, yielding RS distortions for dynamic scenes. To correct RS distortions, existing methods adopt a fully supervised learning manner, where high framerate global shutter (GS) images should be collected as ground-truth supervision. In this paper, we propose a Self-supervised learning framework for Dual reversed RS distortions Correction (SelfDRSC), where a DRSC network can be learned to generate a high framerate GS video only based on dual RS images with reversed distortions. In particular, a bidirectional distortion warping module is proposed for reconstructing dual reversed RS images, and then a self-supervised loss can be deployed to train DRSC network by enhancing the cycle consistency between input and reconstructed dual reversed RS images. Besides start and end RS scanning time, GS images at arbitrary intermediate scanning time can also be supervised in SelfDRSC, thus enabling the learned DRSC network to generate a high framerate GS video. Moreover, a simple yet effective self-distillation strategy is introduced in self-supervised loss for mitigating boundary artifacts in generated GS images. On synthetic dataset, SelfDRSC achieves better or comparable quantitative metrics in comparison to state-of-the-art methods trained in the full supervision manner. On real-world RS cases, our SelfDRSC can produce high framerate GS videos with finer correction textures and better temporary consistency. The source code and trained models are made publicly available at https://github.com/ shangwei5/SelfDRSC. + + + + Self-Supervised Monocular Depth Estimation by Direction-aware Cumulative Convolution Network + http://openaccess.thecvf.com//content/ICCV2023/papers/Han_Self-Supervised_Monocular_Depth_Estimation_by_Direction-aware_Cumulative_Convolution_Network_ICCV_2023_paper.pdf + Monocular depth estimation is known as an ill-posed task that objects in a 2D image usually do not contain sufficient information to predict their depth. Thus, it acts differently from other tasks (e.g., classification and segmentation) in many ways. In this paper, we find that self-supervised monocular depth estimation shows a direction sensitivity and environmental dependency in the feature representation. But the current CNN backbones borrowed from other tasks cannot handle different types of environmental information efficiently, limiting the overall depth accuracy. To bridge this gap, we propose a new Direction-aware Cumulative Convolution Network (DaCCN), which improves the depth feature representation in two aspects. First, we propose a direction-aware module, which can learn to adjust the feature extraction in each direction, facilitating the encoding of different types of information. Secondly, we design a new cumulative convolution to improve the efficiency for aggregating important environmental information. Experiments show that our method achieves significant improvements on three widely used benchmarks and sets a new state-of-the-art performance on the popular benchmarks with all three types of self-supervision. + + + + Few-Shot Common Action Localization via Cross-Attentional Fusion of Context and Temporal Dynamics + http://openaccess.thecvf.com//content/ICCV2023/papers/Lee_Few-Shot_Common_Action_Localization_via_Cross-Attentional_Fusion_of_Context_and_ICCV_2023_paper.pdf + The goal of this paper is to localize action instances in a long untrimmed query video using just meager trimmed support videos representing a common action whose class information is not given. In this task, it is crucial to mine reliable temporal cues representing a common action from handful support videos. In our work, we develop an attention mechanism using cross-correlation. Based on this cross-attention, we first transform the support videos into query video's context to emphasize query-relevant important frames, and suppress less relevant ones. Next, we summarize sub-sequences of support video frames to represent temporal dynamics in coarse temporal granularity, which is then propagated to the fine-grained support video features through the cross-attention. In each case, the cross-attentions are applied to each support video in the individual-to-all strategy to balance heterogeneity and compatibility of the support videos. In contrast, the candidate instances in the query video are lastly attended by the resulting support video features, at once. In addition, we also develop a relational classifier head based on the query and support video representations. We show the effectiveness of our work with the state-of-the-art (SOTA) performance in benchmark datasets (ActivityNet1.3 and THUMOS14), and analyze each component extensively. + + + + Physically-Plausible Illumination Distribution Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Ershov_Physically-Plausible_Illumination_Distribution_Estimation_ICCV_2023_paper.pdf + A camera's auto-white-balance (AWB) module operates under the assumption that there is a single dominant illumination in a captured scene. AWB methods estimate an image's dominant illumination and use it as the target "white point" for correction. However, in natural scenes, there are often many light sources present. We performed a user study that revealed that non-dominant illuminations often produce visually pleasing white-balanced images and, in some cases, are even preferred over the dominant illumination. Motivated by this observation, we revisit AWB to predict a distribution of plausible illuminations for use in white balance. As part of this effort, we extend the Cube++ illumination estimation dataset to provide ground truth illumination distributions per image. Using this new ground truth data, we describe how to train a lightweight neural network method to predict the scene's illumination distribution. We describe how our idea can be used with existing image formats by embedding the estimated distribution in the RAW image to enable users to generate visually plausible white-balance images. + + + + Revisiting Foreground and Background Separation in Weakly-supervised Temporal Action Localization: A Clustering-based Approach + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Revisiting_Foreground_and_Background_Separation_in_Weakly-supervised_Temporal_Action_Localization_ICCV_2023_paper.pdf + Weakly-supervised temporal action localization aims to localize action instances in videos with only video-level action labels. Existing methods mainly embrace a localization-by-classification pipeline that optimizes the snippet-level prediction with a video classification loss. However, this formulation suffers from the discrepancy between classification and detection, resulting in inaccurate separation of foreground and background (F&B) snippets. To alleviate this problem, we propose to explore the underlying structure among the snippets by resorting to unsupervised snippet clustering, rather than heavily relying on the video classification loss. Specifically, we propose a novel clustering-based F&B separation algorithm. It comprises two core components: a snippet clustering component that groups the snippets into multiple latent clusters and a cluster classification component that further classifies the cluster as foreground or background. As there are no ground-truth labels to train these two components, we introduce a unified self-labeling mechanism based on optimal transport to produce high-quality pseudo-labels that match several plausible prior distributions. This ensures that the cluster assignments of the snippets can be accurately associated with their F&B labels, thereby boosting the F&B separation. We evaluate our method on three benchmarks: THUMOS14, ActivityNet v1.2 and v1.3. Our method achieves promising performance on all three benchmarks while being significantly more lightweight than previous methods. Code is available at https://github.com/Qinying-Liu/CASE + + + + 3D-Aware Neural Body Fitting for Occlusion Robust 3D Human Pose Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_3D-Aware_Neural_Body_Fitting_for_Occlusion_Robust_3D_Human_Pose_ICCV_2023_paper.pdf + Regression-based methods for 3D human pose estimation directly predict the 3D pose parameters from a 2D image using deep networks. While achieving state-of-the-art performance on standard benchmarks, their performance degrades under occlusion. In contrast, optimization-based methods fit a parametric body model to 2D features in an iterative manner. The localized reconstruction loss can potentially make them robust to occlusion, but they suffer from the 2D-3D ambiguity. Motivated by the recent success of generative models in rigid object pose estimation, we propose 3D-aware Neural Body Fitting (3DNBF) - an approximate analysis-by-synthesis approach to 3D human pose estimation with SOTA performance and occlusion robustness. In particular, we propose a generative model of deep features based on a volumetric human representation with Gaussian ellipsoidal kernels emitting 3D pose-dependent feature vectors. The neural features are trained with contrastive learning to become 3D-aware and hence to overcome the 2D-3D ambiguity. Experiments show that 3DNBF outperforms other approaches on both occluded and standard benchmarks. + + + + Chinese Text Recognition with A Pre-Trained CLIP-Like Model Through Image-IDS Aligning + http://openaccess.thecvf.com//content/ICCV2023/papers/Yu_Chinese_Text_Recognition_with_A_Pre-Trained_CLIP-Like_Model_Through_Image-IDS_ICCV_2023_paper.pdf + Scene text recognition has been studied for decades due to its broad applications. However, despite Chinese characters possessing different characteristics from Latin characters, such as complex inner structures and large categories, few methods have been proposed for Chinese Text Recognition (CTR). Particularly, the characteristic of large categories poses challenges in dealing with zero-shot and few-shot Chinese characters. In this paper, inspired by the way humans recognize Chinese texts, we propose a two-stage framework for CTR. Firstly, we pre-train a CLIP-like model through aligning printed character images and Ideographic Description Sequences (IDS). This pre-training stage simulates humans recognizing Chinese characters and obtains the canonical representation of each character. Subsequently, the learned representations are employed to supervise the CTR model, such that traditional single-character recognition can be improved to text-line recognition through image-IDS matching. To evaluate the effectiveness of the proposed method, we conduct extensive experiments on both Chinese character recognition (CCR) and CTR. The experimental results demonstrate that the proposed method performs best in CCR and outperforms previous methods in most scenarios of the CTR benchmark. It is worth noting that the proposed method can recognize zero-shot Chinese characters in text images without fine-tuning, whereas previous methods require fine-tuning when new classes appear. The code is available at https://github.com/FudanVI/FudanOCR/tree/main/image-ids-CTR. + + + + Exploiting Proximity-Aware Tasks for Embodied Social Navigation + http://openaccess.thecvf.com//content/ICCV2023/papers/Cancelli_Exploiting_Proximity-Aware_Tasks_for_Embodied_Social_Navigation_ICCV_2023_paper.pdf + Learning how to navigate among humans in an occluded and spatially constrained indoor environment, is a key ability required to embodied agents to be integrated into our society. In this paper, we propose an end-to-end architecture that exploits Proximity-Aware Tasks (referred as to Risk and Proximity Compass) to inject into a reinforcement learning navigation policy the ability to infer common-sense social behaviours. To this end, our tasks exploit the notion of immediate and future dangers of collision. Furthermore, we propose an evaluation protocol specifically designed for the Social Navigation Task in simulated environments. This is done to capture fine-grained features and characteristics of the policy by analyzing the minimal unit of human-robot spatial interaction, called Encounter. We validate our approach on Gibson4+ and Habitat-Matterport3D datasets. + + + + Hierarchical Contrastive Learning for Pattern-Generalizable Image Corruption Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Feng_Hierarchical_Contrastive_Learning_for_Pattern-Generalizable_Image_Corruption_Detection_ICCV_2023_paper.pdf + Effective image restoration with large-size corruptions, such as blind image inpainting, entails precise detection of corruption region masks which remains extremely challenging due to diverse shapes and patterns of corruptions. In this work, we present a novel method for automatic corruption detection, which allows for blind corruption restoration without known corruption masks. Specifically, we develop a hierarchical contrastive learning framework to detect corrupted regions by capturing the intrinsic semantic distinctions between corrupted and uncorrupted regions. In particular, our model detects the corrupted mask in a coarse-to-fine manner by first predicting a coarse mask by contrastive learning in low-resolution feature space and then refines the uncertain area of the mask by high-resolution contrastive learning. A specialized hierarchical interaction mechanism is designed to facilitate the knowledge propagation of contrastive learning in different scales, boosting the modeling performance substantially. The detected multi-scale corruption masks are then leveraged to guide the corruption restoration. Detecting corrupted regions by learning the contrastive distinctions rather than the semantic patterns of corruptions, our model has well generalization ability across different corruption patterns. Extensive experiments demonstrate following merits of our model: 1) the superior performance over other methods on both corruption detection and various image restoration tasks including blind inpainting and watermark removal, and 2) strong generalization across different corruption patterns such as graffiti, random noise or other image content. Codes and trained weights are available at https://github.com/xyfJASON/HCL. + + + + Learning Optical Flow from Event Camera with Rendered Dataset + http://openaccess.thecvf.com//content/ICCV2023/papers/Luo_Learning_Optical_Flow_from_Event_Camera_with_Rendered_Dataset_ICCV_2023_paper.pdf + We study the problem of estimating optical flow from event cameras. One important issue is how to build a high-quality event-flow dataset with accurate event values and flow labels. Previous datasets are created by either capturing real scenes by event cameras or synthesizing from images with pasted foreground objects. The former case can produce real event values but with calculated flow labels, which are sparse and inaccurate. The later case can generate dense flow labels but the interpolated events are prone to errors. In this work, we propose to render a physically correct event-flow dataset using computer graphics models. In particular, we first create indoor and outdoor 3D scenes by Blender with rich scene content variations. Second, diverse camera motions are included for the virtual capturing, producing images and accurate flow labels. Third, we render high-framerate videos between images for accurate events. The rendered dataset can adjust the density of events, based on which we further introduce an adaptive density module (ADM). Experiments show that our proposed dataset can facilitate event-flow learning, whereas previous approaches when trained on our dataset can improve their performances constantly by a relatively large margin. In addition, event-flow pipelines when equipped with our ADM can further improve performances. Our code and dataset will be publicly available. + + + + EPiC: Ensemble of Partial Point Clouds for Robust Classification + http://openaccess.thecvf.com//content/ICCV2023/papers/Levi_EPiC_Ensemble_of_Partial_Point_Clouds_for_Robust_Classification_ICCV_2023_paper.pdf + Robust point cloud classification is crucial for real-world applications,as consumer-type 3D sensors often yield partial and noisy data, degraded by various artifacts. In this work we propose a general ensemble framework, based on partial point cloud sampling. Each ensemble member is exposed to only partial input data. Three sampling strategies are used jointly, two local ones, based on patches and curves, and a global one of random sampling. We demonstrate the robustness of our method to various local and global degradations. We show that our framework significantly improves the robustness of top classification netowrks by a large margin. Our experimental setting uses the recently introduced ModelNet-C database by Ren et al., where we reach SOTA both on unaugmented and on augmented data. Our unaugmented mean Corruption Error (mCE) is 0.64 (current SOTA is 0.86) and 0.50 for augmented data (current SOTA is 0.57). We analyze and explain these remarkable results through diversity analysis. Our code is availabe at: https://github.com/yossilevii100/EPiC + + + + Cross-Modal Learning with 3D Deformable Attention for Action Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Kim_Cross-Modal_Learning_with_3D_Deformable_Attention_for_Action_Recognition_ICCV_2023_paper.pdf + An important challenge in vision-based action recognition is the embedding of spatiotemporal features with two or more heterogeneous modalities into a single feature. In this study, we propose a new 3D deformable transformer for action recognition with adaptive spatiotemporal receptive fields and a cross-modal learning scheme. The 3D deformable transformer consists of three attention modules: 3D deformability, local joint stride, and temporal stride attention. The two cross-modal tokens are input into the 3D deformable attention module to create a cross-attention token with a reflected spatiotemporal correlation. Local joint stride attention is applied to spatially combine attention and pose tokens. Temporal stride attention temporally reduces the number of input tokens in the attention module and supports temporal expression learning without the simultaneous use of all tokens. The deformable transformer iterates L-times and combines the last cross-modal token for classification. The proposed 3D deformable transformer was tested on the NTU60, NTU120, FineGYM, and PennAction datasets, and showed results better than or similar to pre-trained state-of-the-art methods even without a pre-training process. In addition, by visualizing important joints and correlations during action recognition through spatial joint and temporal stride attention, the possibility of achieving an explainable potential for action recognition is presented. + + + + Tracking by 3D Model Estimation of Unknown Objects in Videos + http://openaccess.thecvf.com//content/ICCV2023/papers/Rozumnyi_Tracking_by_3D_Model_Estimation_of_Unknown_Objects_in_Videos_ICCV_2023_paper.pdf + Most model-free visual object tracking methods formulate the tracking task as object location estimation given by a 2D segmentation or a bounding box in each video frame. We argue that this representation is limited and instead propose to guide and improve 2D tracking with an explicit object representation, namely the textured 3D shape and 6DoF pose in each video frame. Our representation tackles a complex long-term dense correspondence problem between all 3D points on the object for all video frames, including frames where some points are invisible. To achieve that, the estimation is driven by re-rendering the input video frames as well as possible through differentiable rendering, which has not been used for tracking before. The proposed optimization minimizes a novel loss function to estimate the best 3D shape, texture, and 6DoF pose. We improve the state-of-the-art in 2D segmentation tracking on three different datasets with mostly rigid objects. + + + + Sigmoid Loss for Language Image Pre-Training + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhai_Sigmoid_Loss_for_Language_Image_Pre-Training_ICCV_2023_paper.pdf + We propose a simple pairwise sigmoid loss for image-text pre-training. Unlike standard contrastive learning with softmax normalization, the sigmoid loss operates solely on image-text pairs and does not require a global view of the pairwise similarities for normalization. The sigmoid loss simultaneously allows further scaling up the batch size, while also performing better at smaller batch sizes. With only four TPUv4 chips, we can train a Base CLIP model at 4k batch size and a Large LiT model at 20k batch size, the latter achieves 84.5% ImageNet zero-shot accuracy in two days. This disentanglement of the batch size from the loss further allows us to study the impact of examples vs pairs and negative to positive ratio. Finally, we push the batch size to the extreme, up to one million, and find that the benefits of growing batch size quickly diminish, with a more reasonable batch size of 32k being sufficient. We hope our research motivates further explorations in improving the quality and efficiency of language-image pre-training. + + + + Neural Video Depth Stabilizer + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Neural_Video_Depth_Stabilizer_ICCV_2023_paper.pdf + Video depth estimation aims to infer temporally consistent depth. Some methods achieve temporal consistency by finetuning a single-image depth model during test time using geometry and re-projection constraints, which is inefficient and not robust. An alternative approach is to learn how to enforce temporal consistency from data, but this requires well-designed models and sufficient video depth data. To address these challenges, we propose a plug-and-play framework called Neural Video Depth Stabilizer (NVDS) that stabilizes inconsistent depth estimations and can be applied to different single-image depth models without extra effort. We also introduce a large-scale dataset, Video Depth in the Wild (VDW), which consists of 14,203 videos with over two million frames, making it the largest natural-scene video depth dataset to our knowledge. We evaluate our method on the VDW dataset as well as two public benchmarks and demonstrate significant improvements in consistency, accuracy, and efficiency compared to previous approaches. Our work serves as a solid baseline and provides a data foundation for learning-based video depth models. We will release our dataset and code for future research. + + + + Learning Symmetry-Aware Geometry Correspondences for 6D Object Pose Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhao_Learning_Symmetry-Aware_Geometry_Correspondences_for_6D_Object_Pose_Estimation_ICCV_2023_paper.pdf + Current 6D pose estimation methods focus on handling objects that are previously trained, which limits their applications in real dynamic world. To this end, we propose a geometry correspondence-based framework, termed GCPose, to estimate 6D pose of arbitrary unseen objects without any re-training. Specifically, the proposed method draws the idea from point cloud registration and resorts to object-agnostic geometry features to establish the 3D-3D correspondences between the object-scene point cloud and object-model point cloud. Then the 6D pose parameters are solved by a least-squares fitting algorithm. Taking the symmetry properties of objects into consideration, we design a symmetry-aware matching loss to facilitate the learning of dense point-wise geometry features and improve the performance considerably. Moreover, we introduce an online training data generation with special data augmentation and normalization to empower the network to learn diverse geometry prior. With training on synthetic objects from ShapeNet, our method outperforms previous approaches for unseen object pose estimation by a large margin on T-LESS, LINEMOD, Occluded-LINEMOD, and TUD-L datasets. Code is available at https://github.com/hikvision-research/GCPose. + + + + TrackFlow: Multi-Object tracking with Normalizing Flows + http://openaccess.thecvf.com//content/ICCV2023/papers/Mancusi_TrackFlow_Multi-Object_tracking_with_Normalizing_Flows_ICCV_2023_paper.pdf + The field of multi-object tracking has recently seen a renewed interest in the good old schema of tracking-by-detection, as its simplicity and strong priors spare it from the complex design and painful babysitting of tracking-by-attention approaches. In view of this, we aim at extending tracking-by-detection to multi-modal settings, where a comprehensive cost has to be computed from heterogeneous information e.g., 2D motion cues, visual appearance, and pose estimates. More precisely, we follow a case study where a rough estimate of 3D information is also available and must be merged with other traditional metrics (e.g., the IoU). To achieve that, recent approaches resort to either simple rules or complex heuristics to balance the contribution of each cost. However, i) they require careful tuning of tailored hyperparameters on a hold-out set, and ii) they imply these costs to be independent, which does not hold in reality. We address these issues by building upon an elegant probabilistic formulation, which considers the cost of a candidate association as the negative log-likelihood yielded by a deep density estimator, trained to model the conditional joint probability distribution of correct associations. Our experiments, conducted on both simulated and real benchmarks, show that our approach consistently enhances the performance of several tracking-by-detection algorithms. + + + + Generating Instance-level Prompts for Rehearsal-free Continual Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Jung_Generating_Instance-level_Prompts_for_Rehearsal-free_Continual_Learning_ICCV_2023_paper.pdf + We introduce Domain-Adaptive Prompt (DAP), a novel method for continual learning using Vision Transformers (ViT). Prompt-based continual learning has recently gained attention due to its rehearsal-free nature. Currently, the prompt pool, which is suggested by prompt-based continual learning, is key to effectively exploiting the frozen pre-trained ViT backbone in a sequence of tasks. However, we observe that the use of a prompt pool creates a domain scalability problem between pre-training and continual learning. This problem arises due to the inherent encoding of group-level instructions within the prompt pool. To address this problem, we propose DAP, a pool-free approach that generates a suitable prompt in an instance-level manner at inference time. We optimize an adaptive prompt generator that creates instance-specific fine-grained instructions required for each input, enabling enhanced model plasticity and reduced forgetting. Our experiments on seven datasets with varying degrees of domain similarity to ImageNet demonstrate the superiority of DAP over state-of-the-art prompt-based methods. Code is publicly available at https://github.com/naver-ai/dap-cl. + + + + HSE: Hybrid Species Embedding for Deep Metric Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_HSE_Hybrid_Species_Embedding_for_Deep_Metric_Learning_ICCV_2023_paper.pdf + Deep metric learning is crucial for finding an embedding function that can generalize to training and testing data, including unknown test classes. However, limited training samples restrict the model's generalization to downstream tasks. While adding new training samples is a promising solution, determining their labels remains a significant challenge. Here, we introduce Hybrid Species Embedding (HSE), which employs mixed sample data augmentations to generate hybrid species and provide additional training signals. We demonstrate that HSE outperforms multiple state-of-the-art methods in improving the metric Recall@K on the CUB-200 , CAR-196 and SOP datasets, thus offering a novel solution to deep metric learning's limitations. + + + + Online Continual Learning on Hierarchical Label Expansion + http://openaccess.thecvf.com//content/ICCV2023/papers/Lee_Online_Continual_Learning_on_Hierarchical_Label_Expansion_ICCV_2023_paper.pdf + Continual learning (CL) enables models to adapt to new tasks and environments without forgetting previously learned knowledge. While current CL setups have ignored the relationship between labels in the past task and the new task with or without small task overlaps, real-world scenarios often involve hierarchical relationships between old and new tasks, posing another challenge for traditional CL approaches. To address this challenge, we propose a novel multi-level hierarchical class incremental task configuration with an online learning constraint, called hierarchical label expansion (HLE). Our configuration allows a network to first learn coarse-grained classes, with data labels continually expanding to more fine-grained classes in various hierarchy depths. To tackle this new setup, we propose a rehearsal-based method that utilizes hierarchy-aware pseudo-labeling to incorporate hierarchical class information. Additionally, we propose a simple yet effective memory management and sampling strategy that selectively adopts samples of newly encountered classes. Our experiments demonstrate that our proposed method can effectively use hierarchy on our HLE setup to improve classification accuracy across all levels of hierarchies, regardless of depth and class imbalance ratio, outperforming prior state-of-the-art works by significant margins while also outperforming them on the conventional disjoint, blurry and i-Blurry CL setups. + + + + 3D Motion Magnification: Visualizing Subtle Motions from Time-Varying Radiance Fields + http://openaccess.thecvf.com//content/ICCV2023/papers/Feng_3D_Motion_Magnification_Visualizing_Subtle_Motions_from_Time-Varying_Radiance_Fields_ICCV_2023_paper.pdf + Motion magnification helps us visualize subtle, imperceptible motion. However, prior methods only work for 2D videos captured with a fixed camera. We present a 3D motion magnification method that can magnify subtle motions from scenes captured by a moving camera, while supporting novel view rendering. We represent the scene with time-varying radiance fields and leverage the Eulerian principle for motion magnification to extract and amplify the variation of the embedding of a fixed point over time. We study and validate our proposed principle for 3D motion magnification using both implicit and tri-plane-based radiance fields as our underlying 3D scene representation. We evaluate the effectiveness of our method on both synthetic and real-world scenes captured under various camera setups. + + + + Learning Spatial-context-aware Global Visual Feature Representation for Instance Image Retrieval + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Learning_Spatial-context-aware_Global_Visual_Feature_Representation_for_Instance_Image_Retrieval_ICCV_2023_paper.pdf + In instance image retrieval, considering local spatial information within an image has proven effective to boost retrieval performance, as demonstrated by local visual descriptor based geometric verification. Nevertheless, it will be highly valuable to make ordinary global image representations spatial-context-aware because global representation based image retrieval is appealing thanks to its algorithmic simplicity, low memory cost, and being friendly to sophisticated data structures. To this end, we propose a novel feature learning framework for instance image retrieval, which embeds local spatial context information into the learned global feature representations. Specifically, in parallel to the visual feature branch in a CNN backbone, we design a spatial context branch that consists of two modules called online token learning and distance encoding. For each local descriptor learned in CNN, the former module is used to indicate the types of its surrounding descriptors, while their spatial distribution information is captured by the latter module. After that, the visual feature branch and the spatial context branch are fused to produce a single global feature representation per image. As experimentally demonstrated, with the spatial-context-aware characteristic, we can well improve the performance of global representation based image retrieval while maintaining all of its appealing properties. Our code is available at https://github.com/Zy-Zhang/SpCa + + + + Space-time Prompting for Video Class-incremental Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Pei_Space-time_Prompting_for_Video_Class-incremental_Learning_ICCV_2023_paper.pdf + Recently, prompt-based learning has made impressive progress on image class-incremental learning, but it still lacks sufficient exploration in the video domain. In this paper, we will fill this gap by learning multiple prompts based on a powerful image-language pre-trained model, i.e., CLIP, making it fit for video class-incremental learning (VCIL). For this purpose, we present a space-time prompting approach (ST-Prompt) which contains two kinds of prompts, i.e., task-specific prompts and task-agnostic prompts. The task-specific prompts are to address the catastrophic forgetting problem by learning multi-grained prompts, i.e., spatial prompts, temporal prompts and comprehensive prompts, for accurate task identification. The task-agnostic prompts maintain a globally-shared prompt pool, which can empower the pre-trained image models with temporal perception abilities by exchanging contexts between frames. By this means, ST-Prompt can transfer the plentiful knowledge in the image-language pre-trained models to the VCIL task with only a tiny set of prompts to be optimized. To evaluate ST-Prompt, we conduct extensive experiments on three standard benchmarks. The results show that ST-Prompt can significantly surpass the state-of-the-art VCIL methods, especially it gains 9.06% on HMDB51 dataset under the 1*25 stage setting. + + + + Sparse Sampling Transformer with Uncertainty-Driven Ranking for Unified Removal of Raindrops and Rain Streaks + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Sparse_Sampling_Transformer_with_Uncertainty-Driven_Ranking_for_Unified_Removal_of_ICCV_2023_paper.pdf + In the real world, image degradations caused by rain often exhibit a combination of rain streaks and raindrops, thereby increasing the challenges of recovering the underlying clean image. Note that the rain streaks and raindrops have diverse shapes, sizes, and locations in the captured image, and thus modeling the correlation relationship between irregular degradations caused by rain artifacts is a necessary prerequisite for image deraining. This paper aims to present an efficient and flexible mechanism to learn and model degradation relationships in a global view, thereby achieving a unified removal of intricate rain scenes. To do so, we propose a Sparse Sampling Transformer based on Uncertainty-Driven Ranking, dubbed UDR-S2Former. Compared to previous methods, our UDR-S2Former has three merits. First, it can adaptively sample relevant image degradation information to model underlying degradation relationships. Second, explicit application of the uncertainty-driven ranking strategy can facilitate the network to attend to degradation features and understand the reconstruction process. Finally, experimental results show that our UDR-S2Former clearly outperforms state-of-the-art methods for all benchmarks. + + + + LexLIP: Lexicon-Bottlenecked Language-Image Pre-Training for Large-Scale Image-Text Sparse Retrieval + http://openaccess.thecvf.com//content/ICCV2023/papers/Luo_LexLIP_Lexicon-Bottlenecked_Language-Image_Pre-Training_for_Large-Scale_Image-Text_Sparse_Retrieval_ICCV_2023_paper.pdf + Image-text retrieval (ITR) aims to retrieve images or texts that match a query originating from the other modality. The conventional dense retrieval paradigm relies on encoding images and texts into dense representations with dual-stream encoders. However, this approach is limited by slow retrieval speeds in large-scale scenarios. To address this issue, we propose a novel sparse retrieval paradigm for ITR that exploits sparse representations in the vocabulary space for images and texts. This paradigm enables us to leverage bag-of-words models and efficient inverted indexes, significantly reducing retrieval latency. A critical gap emerges from representing continuous image data in a sparse vocabulary space. To bridge this gap, we introduce a novel pre-training framework, Lexicon-Bottlenecked Language-Image Pre-Training (LexLIP), that learns importance-aware lexicon representations. By using lexicon-bottlenecked modules between the dual-stream encoders and weakened text decoders, we are able to construct continuous bag-of-words bottlenecks and learn lexicon-importance distributions. Upon pre-training with same-scale data, our LexLIP achieves state-of-the-art performance on two ITR benchmarks, MSCOCO and Flickr30k. Furthermore, in large-scale retrieval scenarios, LexLIP outperforms CLIP with 5.8x faster retrieval speed and 19.1x less index storage memory. Beyond this, LexLIP surpasses CLIP across 8 out of 10 zero-shot image classification tasks. + + + + LFS-GAN: Lifelong Few-Shot Image Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Seo_LFS-GAN_Lifelong_Few-Shot_Image_Generation_ICCV_2023_paper.pdf + We address a challenging lifelong few-shot image generation task for the first time. In this situation, a generative model learns a sequence of tasks using only a few samples per task. Consequently, the learned model encounters both catastrophic forgetting and overfitting problems at a time. Existing studies on lifelong GANs have proposed modulation-based methods to prevent catastrophic forgetting. However, they require considerable additional parameters and cannot generate high-fidelity and diverse images from limited data. On the other hand, the existing few-shot GANs suffer from severe catastrophic forgetting when learning multiple tasks. To alleviate these issues, we propose a framework called Lifelong Few-Shot GAN (LFS-GAN) that can generate high-quality and diverse images in lifelong few-shot image generation task. Our proposed framework learns each task using an efficient task-specific modulator - Learnable Factorized Tensor (LeFT). LeFT is rank-constrained and has a rich representation ability due to its unique reconstruction technique. Furthermore, we propose a novel mode seeking loss to improve the diversity of our model in low-data circumstances. Extensive experiments demonstrate that the proposed LFS-GAN can generate high-fidelity and diverse images without any forgetting and mode collapse in various domains, achieving state-of-the-art in lifelong few-shot image generation task. Surprisingly, we find that our LFS-GAN even outperforms the existing few-shot GANs in the few-shot image generation task. The code is available at Github. + + + + MixCycle: Mixup Assisted Semi-Supervised 3D Single Object Tracking with Cycle Consistency + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_MixCycle_Mixup_Assisted_Semi-Supervised_3D_Single_Object_Tracking_with_Cycle_ICCV_2023_paper.pdf + 3D single object tracking (SOT) is an indispensable part of automated driving. Existing approaches rely heavily on large, densely labeled datasets. However, annotating point clouds is both costly and time-consuming. Inspired by the great success of cycle tracking in unsupervised 2D SOT, we introduce the first semi-supervised approach to 3D SOT. Specifically, we introduce two cycle-consistency strategies for supervision: 1) Self tracking cycles, which leverage labels to help the model converge better in the early stages of training; 2) forward-backward cycles, which strengthen the tracker's robustness to motion variations and the template noise caused by the template update strategy. Furthermore, we propose a data augmentation strategy named SOTMixup to improve the tracker's robustness to point cloud diversity. SOTMixup generates training samples by sampling points in two point clouds with a mixing rate and assigns a reasonable loss weight for training according to the mixing rate. The resulting MixCycle approach generalizes to appearance matching-based trackers. On the KITTI benchmark, based on the P2B tracker, MixCycle trained with 10% labels outperforms P2B trained with 100% labels, and achieves a 28.4% precision improvement when using 1% labels. Our code will be released at https://github.com/Mumuqiao/MixCycle. + + + + DiffFacto: Controllable Part-Based 3D Point Cloud Generation with Cross Diffusion + http://openaccess.thecvf.com//content/ICCV2023/papers/Nakayama_DiffFacto_Controllable_Part-Based_3D_Point_Cloud_Generation_with_Cross_Diffusion_ICCV_2023_paper.pdf + While the community of 3D point cloud generation has witnessed a big growth in recent years, there still lacks an effective way to enable intuitive user control in the generation process, hence limiting the general utility of such methods. Since an intuitive way of decomposing a shape is through its parts, we propose to tackle the task of controllable part-based point cloud generation. We introduce DiffFacto, a novel probabilistic generative model that learns the distribution of shapes with part-level control. We propose a factorization that models independent part style and part configuration distributions, and present a novel cross diffusion network that enables us to generate coherent and plausible shapes under our proposed factorization. Experiments show that our method is able to generate novel shapes with multiple axes of control. It achieves state-of-the-art part-level generation quality and generates plausible and coherent shape, while enabling various downstream editing applications such as shape interpolation, mixing and transformation editing. Code will be made publicly available. + + + + Spatio-temporal Prompting Network for Robust Video Feature Extraction + http://openaccess.thecvf.com//content/ICCV2023/papers/Sun_Spatio-temporal_Prompting_Network_for_Robust_Video_Feature_Extraction_ICCV_2023_paper.pdf + The frame quality deterioration problem is one of the main challenges in the field of video understanding. To compensate for the information loss caused by deteriorated frames, recent approaches exploit transformer-based integration modules to obtain spatio-temporal information. However, these integration modules are heavy and complex. Furthermore, each integration module is specifically tailored for its target task, making it difficult to generalise to multiple tasks. In this paper, we present a neat and unified framework, called Spatio-Temporal Prompting Network (STPN). It can efficiently extract robust and accurate video features by dynamically adjusting the input features in the backbone network. Specifically, STPN predicts several video prompts containing spatio-temporal information of neighbour frames. Then, these video prompts are prepended to the patch embeddings of the current frame as the updated input for video feature extraction. Moreover, STPN is easy to generalise to various video tasks because it does not contain task-specific modules. Without bells and whistles, STPN achieves state-of-the-art performance on three widely-used datasets for different video understanding tasks, i.e., ImageNetVID for video object detection, YouTubeVIS for video instance segmentation, and GOT-10k for visual object tracking. Codes are available at https://github.com/guanxiongsun/STPN. + + + + A Simple Vision Transformer for Weakly Semi-supervised 3D Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_A_Simple_Vision_Transformer_for_Weakly_Semi-supervised_3D_Object_Detection_ICCV_2023_paper.pdf + Advanced 3D object detection methods usually rely on large-scale, elaborately labeled datasets to achieve good performance. However, labeling the bounding boxes for the 3D objects is difficult and expensive. Although semi-supervised (SS3D) and weakly-supervised 3D object detection (WS3D) methods can effectively reduce the annotation cost, they suffer from two limitations: 1) their performance is far inferior to the fully-supervised counterparts; 2) they are difficult to adapt to different detectors or scenes (e.g, indoor or outdoor). In this paper, we study weakly semi-supervised 3D object detection (WSS3D) with point annotations, where the dataset comprises a small number of fully labeled and massive weakly labeled data with a single point annotated for each 3D object. To fully exploit the point annotations, we employ the plain and non-hierarchical vision transformer to form a point-to-box converter, termed ViT-WSS3D. By modeling global interactions between LiDAR points and corresponding weak labels, our ViT-WSS3D can generate high-quality pseudo-bounding boxes, which are then used to train any 3D detectors without exhaustive tuning. Extensive experiments on indoor and outdoor datasets (SUN RGBD and KITTI) show the effectiveness of our method. In particular, when only using 10% fully labeled and the rest as point labeled data, our ViT-WSS3D can enable most detectors to achieve similar performance with the oracle model using 100% fully labeled data. + + + + Open-domain Visual Entity Recognition: Towards Recognizing Millions of Wikipedia Entities + http://openaccess.thecvf.com//content/ICCV2023/papers/Hu_Open-domain_Visual_Entity_Recognition_Towards_Recognizing_Millions_of_Wikipedia_Entities_ICCV_2023_paper.pdf + Large-scale multi-modal pre-training models such as CLIP and PaLI exhibit strong generalization on various visual domains and tasks. However, existing image classification benchmarks often evaluate recognition on a specific domain (e.g., outdoor images) or a specific task (e.g., classifying plant species), which falls short of evaluating whether pre-trained foundational models are universal visual recognizers. To address this, we formally present the task of Open-domain Visual Entity recognitioN (OVEN), where a model need to link an image onto a Wikipedia entity with respect to a text query. We construct OVEN by re-purposing 14 existing datasets with all labels grounded onto one single label space: Wikipedia entities. OVEN challenges models to select among six million possible Wikipedia entities, making it a general visual recognition benchmark with largest number of labels. Our study on state-of-the-art pre-trained models reveals large headroom in generalizing to the massive-scale label space. We show that a PaLI-based auto-regressive visual recognition model performs surprisingly well, even on Wikipedia entities that have never been seen during fine-tuning. We also find existing pre-trained models yield different unique strengths: while PaLI-based models obtains higher overall performance, CLIP-based models are better at recognizing tail entities. + + + + A Soft Nearest-Neighbor Framework for Continual Semi-Supervised Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Kang_A_Soft_Nearest-Neighbor_Framework_for_Continual_Semi-Supervised_Learning_ICCV_2023_paper.pdf + Despite significant advances, the performance of state-of-the-art continual learning approaches hinges on the unrealistic scenario of fully labeled data. In this paper, we tackle this challenge and propose an approach for continual semi-supervised learning--a setting where not all the data samples are labeled. A primary issue in this scenario is the model forgetting representations of unlabeled data and overfitting the labeled samples. We leverage the power of nearest-neighbor classifiers to nonlinearly partition the feature space and flexibly model the underlying data distribution thanks to its non-parametric nature. This enables the model to learn a strong representation for the current task, and distill relevant information from previous tasks. We perform a thorough experimental evaluation and show that our method outperforms all the existing approaches by large margins, setting a solid state of the art on the continual semi-supervised learning paradigm. For example, on CIFAR-100 we surpass several others even when using at least 30 times less supervision (0.8% vs. 25% of annotations). Finally, our method works well on both low and high resolution images and scales seamlessly to more complex datasets such as ImageNet-100. + + + + Minimal Solutions to Uncalibrated Two-view Geometry with Known Epipoles + http://openaccess.thecvf.com//content/ICCV2023/papers/Nakano_Minimal_Solutions_to_Uncalibrated_Two-view_Geometry_with_Known_Epipoles_ICCV_2023_paper.pdf + This paper proposes minimal solutions to uncalibrated two-view geometry with known epipoles. Exploiting the epipoles, we can reduce the number of point correspondences needed to find the fundamental matrix together with the intrinsic parameters: the focal length and the radial lens distortion. We define four cases by the number of available epipoles and unknown intrinsic parameters, then derive a closed-form solution for each case formulated as a higher-order polynomial in a single variable. The proposed solvers are more numerically stable and faster by orders of magnitude than the conventional 6- or 7-point algorithms. Moreover, we demonstrate by experiments on the human pose dataset that the proposed method can solve two-view geometry even with 2D human pose, of which point localization is noisier than general feature point detectors. + + + + Context-Aware Planning and Environment-Aware Memory for Instruction Following Embodied Agents + http://openaccess.thecvf.com//content/ICCV2023/papers/Kim_Context-Aware_Planning_and_Environment-Aware_Memory_for_Instruction_Following_Embodied_Agents_ICCV_2023_paper.pdf + Accomplishing household tasks such as 'bringing a cup of water' requires to plan step-by-step actions by maintaining the knowledge about the spatial arrangement of objects and consequences of previous actions. Perception models of current embodied AI agents, however, often make mistakes due to lack of such knowledge but rely on imperfect learning of imitating agents or an algorithmic planner without the knowledge about the changed environment by the previous actions. To address the issue, we propose the CPEM (Context-aware Planner and Environment-aware Memory) embodied agent to incorporate the contextual information of previous actions for planning and maintaining spatial arrangement of objects with their states (e.g., if an object has been already moved or not) in the environment to the perception model for improving both visual navigation and object interactions. We observe that the proposed model achieves state-of-the-art task success performance in various metrics using a challenging interactive instruction following benchmark both in seen and unseen environments by large margins (up to +10.70% in unseen env.). + + + + Passive Ultra-Wideband Single-Photon Imaging + http://openaccess.thecvf.com//content/ICCV2023/papers/Wei_Passive_Ultra-Wideband_Single-Photon_Imaging_ICCV_2023_paper.pdf + We consider the problem of imaging a dynamic scene over an extreme range of timescales simultaneously--seconds to picoseconds--and doing so passively, without much light, and without any timing signals from the light source(s) emitting it. Because existing flux estimation techniques for single-photon cameras break down in this regime, we develop a flux probing theory that draws insights from stochastic calculus to enable reconstruction of a pixel's time-varying flux from a stream of monotonically-increasing photon detection timestamps. We use this theory to (1) show that passive free-running SPAD cameras have an attainable frequency bandwidth that spans the entire DC-to-31 GHz range in low-flux conditions, (2) derive a novel Fourier-domain flux reconstruction algorithm that scans this range for frequencies with statistically-significant support in the timestamp data, and (3) ensure the algorithm's noise model remains valid even for very low photon counts or non-negligible dead times. We show the potential of this asynchronous imaging regime by experimentally demonstrating several never-seen-before abilities: (1) imaging a scene illuminated simultaneously by sources operating at vastly different speeds without synchronization (bulbs, projectors, multiple pulsed lasers), (2) passive non-line-of-sight video acquisition, and (3) recording ultra-wideband video, which can be played back later at 30 Hz to show everyday motions--but can also be played a billion times slower to show the propagation of light itself. + + + + Deep Video Demoireing via Compact Invertible Dyadic Decomposition + http://openaccess.thecvf.com//content/ICCV2023/papers/Quan_Deep_Video_Demoireing_via_Compact_Invertible_Dyadic_Decomposition_ICCV_2023_paper.pdf + Removing moire patterns from videos recorded on screens or complex textures is known as video demoireing. It is a challenging task as both structures and textures of an image usually exhibit strong periodic patterns, which thus are easily confused with moire patterns and can be significantly erased in the removal process. By interpreting video demoireing as a multi-frame decomposition problem, we propose a compact invertible dyadic network called CIDNet that progressively decouples latent frames and the moire patterns from an input video sequence. Using a dyadic cross-scale coupling structure with coupling layers tailored for multi-scale processing, CIDNet aims at disentangling the features of image patterns from that of moire patterns at different scales, while retaining all latent image features to facilitate reconstruction. In addition, a compressed form for the network's output is introduced to reduce computational complexity and alleviate overfitting. The experiments show that CIDNet outperforms existing methods and enjoys the advantages in model size and computational efficiency. + + + + Scene Graph Contrastive Learning for Embodied Navigation + http://openaccess.thecvf.com//content/ICCV2023/papers/Singh_Scene_Graph_Contrastive_Learning_for_Embodied_Navigation_ICCV_2023_paper.pdf + Training effective embodied AI agents often involves expert imitation, specialized components such as maps, or leveraging additional sensors for depth and localization. Another approach is to use neural architectures alongside self-supervised objectives which encourage better representation learning. However, in practice, there are few guarantees that these self-supervised objectives encode task-relevant information. We propose the Scene Graph Contrastive (SGC) loss, which uses scene graphs as training-only supervisory signals. The SGC loss does away with explicit graph decoding and instead uses contrastive learning to align an agent's representation with a rich graphical encoding of its environment. The SGC loss is simple to implement and encourages representations that encode objects' semantics, relationships, and history. By using the SGC loss, we attain gains on three embodied tasks: Object Navigation, Multi-Object Navigation, and Arm Point Navigation. Finally, we present studies and analyses which demonstrate the ability of our trained representation to encode semantic cues about the environment. + + + + Preparing the Future for Continual Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Lin_Preparing_the_Future_for_Continual_Semantic_Segmentation_ICCV_2023_paper.pdf + In this study, we focus on Continual Semantic Segmentation (CSS) and present a novel approach to tackle the issue of existing methods struggling to learn new classes. The primary challenge of CSS is to learn new knowledge while retaining old knowledge, which is commonly known as the rigidity-plasticity dilemma. Existing approaches strive to address this by carefully balancing the learning of new and old classes during training on new data. Differently, this work aims to avoid this dilemma fundamentally rather than handling the difficulties involved in it. Specifically, we reveal that this dilemma mainly arises from the greater fluctuation of knowledge for new classes because they have never been learned before the current step. Additionally, the data available in incremental steps are usually inadequate, which can impede the model's ability to learn discriminative features for both new and old classes. To address these challenges, we introduce a novel concept of pre-learning for future knowledge. Our approach entails optimizing the feature space and output space for unlabeled data, which thus enables the model to acquire knowledge for future classes. With this approach, updating the model for new classes becomes as smooth as for old classes, effectively avoiding the rigidity-plasticity dilemma. We conducted extensive experiments and the results demonstrate a significant improvement in the learning of new classes compared to previous state-of-the-art methods. + + + + Synthesizing Diverse Human Motions in 3D Indoor Scenes + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhao_Synthesizing_Diverse_Human_Motions_in_3D_Indoor_Scenes_ICCV_2023_paper.pdf + We present a novel method for populating 3D indoor scenes with virtual humans that can navigate in the environment and interact with objects in a realistic manner. Existing approaches rely on high-quality training sequences that contain captured human motions and the 3D scenes they interact with. However, such interaction data are costly, difficult to capture, and can hardly cover the full range of plausible human-scene interactions in complex indoor environments. To address these challenges, we propose a reinforcement learning-based approach that enables virtual humans to navigate in 3D scenes and interact with objects realistically and autonomously, driven by learned motion control policies. The motion control policies employ latent motion action spaces, which correspond to realistic motion primitives and are learned from large-scale motion capture data using a powerful generative motion model. For navigation in a 3D environment, we propose a scene-aware policy with novel state and reward designs for collision avoidance. Combined with navigation mesh-based path-finding algorithms to generate intermediate waypoints, our approach enables the synthesis of diverse human motions navigating in 3D indoor scenes and avoiding obstacles. To generate fine-grained human-object interactions, we carefully curate interaction goal guidance using a marker-based body representation and leverage features based on the signed distance field (SDF) to encode human-scene proximity relations. Our method can synthesize realistic and diverse human-object interactions (e.g., sitting on a chair and then getting up) even for out-of-distribution test scenarios with different object shapes, orientations, starting body positions, and poses. Experimental results demonstrate that our approach outperforms state-of-the-art human-scene interaction synthesis methods in terms of both motion naturalness and diversity. Code, models, and demonstrative video results are publicly available at: https://zkf1997.github.io/DIMOS. + + + + Deep Optics for Video Snapshot Compressive Imaging + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Deep_Optics_for_Video_Snapshot_Compressive_Imaging_ICCV_2023_paper.pdf + Video snapshot compressive imaging (SCI) aims to capture a sequence of video frames with only a single shot of a 2D detector, whose backbones rest in optical modulation patterns (also known as masks) and a computational reconstruction algorithm. Advanced deep learning algorithms and mature hardware are putting video SCI into practical applications. Yet, there are two clouds in the sunshine of SCI: i) low dynamic range as a victim of high temporal multiplexing, and ii) existing deep learning algorithms' degradation on real system. To address these challenges, this paper presents a deep optics framework to jointly optimize masks and a reconstruction network. Specifically, we first propose a new type of structural mask to realize motionaware and full-dynamic-range measurement. Considering the motion awareness property in measurement domain, we develop an efficient network for video SCI reconstruction using Transformer to capture long-term temporal dependencies, dubbed Res2former. Moreover, sensor response is introduced into the forward model of video SCI to guarantee end-to-end model training close to real system. Finally, we implement the learned structural masks on a digital micro-mirror device. Experimental results on synthetic and real data validate the effectiveness of the proposed framework. We believe this is a miestone for real-world video SCI. The source code and data are available at https://github.com/pwangcs/DeepOpticsSCI. + + + + Joint Demosaicing and Deghosting of Time-Varying Exposures for Single-Shot HDR Imaging + http://openaccess.thecvf.com//content/ICCV2023/papers/Kim_Joint_Demosaicing_and_Deghosting_of_Time-Varying_Exposures_for_Single-Shot_HDR_ICCV_2023_paper.pdf + The quad-Bayer patterned image sensor has made significant improvements in spatial resolution over recent years due to advancements in image sensor technology. This has enabled single-shot high-dynamic-range (HDR) imaging using spatially varying multiple exposures. Popular methods for multi-exposure array sensors involve varying the gain of each exposure, but this does not effectively change the photoelectronic energy in each exposure. Consequently, HDR images produced using gain-based exposure variation may suffer from noise and details being saturated. To address this problem, we intend to use time-varying exposures in quad-Bayer patterned sensors. This approach allows long-exposure pixels to receive more photon energy than short- or middle-exposure pixels, resulting in higher-quality HDR images. However, time-varying exposures are not ideal for dynamic scenes and require an additional deghosting method. To tackle this issue, we propose a single-shot HDR demosaicing method that takes time-varying multiple exposures as input and jointly solves both the demosaicing and deghosting problems. Our method uses a feature-extraction module to handle mosaiced multiple exposures and a multiscale transformer module to register spatial displacements of multiple exposures and colors. We also created a dataset of quad-Bayer sensor input with time-varying exposures and trained our network using this dataset. Results demonstrate that our method outperforms baseline HDR reconstruction methods with both synthetic and real datasets. With our method, we can achieve high-quality HDR images in challenging lighting conditions. + + + + Tuning Pre-trained Model via Moment Probing + http://openaccess.thecvf.com//content/ICCV2023/papers/Gao_Tuning_Pre-trained_Model_via_Moment_Probing_ICCV_2023_paper.pdf + Recently, efficient fine-tuning of large-scale pre-trained models has attracted increasing research interests, where linear probing (LP) as a fundamental module is involved in exploiting the final representations for task-dependent classification. However, most of the existing methods focus on how to effectively introduce a few of learnable parameters, and little work pays attention to the commonly used LP module. In this paper, we propose a novel Moment Probing (MP) method to further explore the potential of LP. Distinguished from LP which builds a linear classification head based on the mean of final features (e.g., word tokens for ViT) or classification tokens, our MP performs a linear classifier on feature distribution, which provides a stronger representation ability by exploiting richer statistical information inherent in features. Specifically, we represent feature distribution by its characteristic function, which is efficiently approximated by using first- and second-order moments of features. Furthermore, we propose a multi-head convolutional cross-covariance to compute second-order moments in an efficient and effective manner. By considering that MP could affect feature learning, we introduce a partially shared module to learn two recalibrating parameters (PSRP) for backbones based on MP, namely MP+. Extensive experiments on ten benchmarks using various models show that our MP significantly outperforms LP and is competitive with counterparts at less training cost, while our MP+ achieves state-of-the-art performance. + + + + Task Agnostic Restoration of Natural Video Dynamics + http://openaccess.thecvf.com//content/ICCV2023/papers/Ali_Task_Agnostic_Restoration_of_Natural_Video_Dynamics_ICCV_2023_paper.pdf + In many video restoration/translation tasks, image processing operations are naively extended to the video domain by processing each frame independently, disregarding the temporal connection of the video frames. This disregard for the temporal connection often leads to severe temporal inconsistencies. State-Of-The-Art (SOTA) techniques that address these inconsistencies rely on the availability of unprocessed videos to implicitly siphon and utilize consistent video dynamics to restore the temporal consistency of frame-wise processed videos which often jeopardizes the translation effect. We propose a general framework for this task that learns to infer and utilize consistent motion dynamics from inconsistent videos to mitigate the temporal flicker while preserving the perceptual quality for both the temporally neighboring and relatively distant frames without requiring the raw videos at test time. The proposed framework produces SOTA results on two benchmark datasets, DAVIS and videvo.net, processed by numerous image processing applications. The code and the trained models will be open-sourced upon acceptance. + + + + TMR: Text-to-Motion Retrieval Using Contrastive 3D Human Motion Synthesis + http://openaccess.thecvf.com//content/ICCV2023/papers/Petrovich_TMR_Text-to-Motion_Retrieval_Using_Contrastive_3D_Human_Motion_Synthesis_ICCV_2023_paper.pdf + In this paper, we present TMR, a simple yet effective approach for text to 3D human motion retrieval. While previous work has only treated retrieval as a proxy evaluation metric, we tackle it as a standalone task. Our method extends the state-of-the-art text-to-motion synthesis model TEMOS, and incorporates a contrastive loss to better structure the cross-modal latent space. We show that maintaining the motion generation loss, along with the contrastive training, is crucial to obtain good performance. We introduce a benchmark for evaluation and provide an in-depth analysis by reporting results on several protocols. Our extensive experiments on the KIT-ML and HumanML3D datasets show that TMR outperforms the prior work by a significant margin, for example reducing the median rank from 54 to 19. Finally, we showcase the potential of our approach on moment retrieval. Our code and models are publicly available at https://mathis.petrovich.fr/tmr. + + + + SINC: Self-Supervised In-Context Learning for Vision-Language Tasks + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_SINC_Self-Supervised_In-Context_Learning_for_Vision-Language_Tasks_ICCV_2023_paper.pdf + Large Pre-trained Transformers exhibit an intriguing capacity for in-context learning. Without gradient updates, these models can rapidly construct new predictors from demonstrations presented in the inputs. Recent works promote this ability in the vision-language domain by incorporating visual information into large language models that can already make in-context predictions. However, these methods could inherit issues in the language domain, such as template sensitivity and hallucination. Also, the scale of these language models raises a significant demand for computations, making learning and operating these models resource-intensive. To this end, we raise a question: "How can we enable in-context learning without relying on the intrinsic in-context ability of large language models?". To answer it, we propose a succinct and general framework, Self-supervised IN-Context learning (SINC), that introduces a meta-model to learn on self-supervised prompts consisting of tailored demonstrations. The learned models can be transferred to downstream tasks for making in-context predictions on-the-fly. Extensive experiments show that SINC outperforms gradient-based methods in various vision-language tasks under few-shot settings. Furthermore, the designs of SINC help us investigate the benefits of in-context learning across different tasks, and the analysis further reveals the essential components for the emergence of in-context learning in the vision-language domain. + + + + Learning a Room with the Occ-SDF Hybrid: Signed Distance Function Mingled with Occupancy Aids Scene Representation + http://openaccess.thecvf.com//content/ICCV2023/papers/Lyu_Learning_a_Room_with_the_Occ-SDF_Hybrid_Signed_Distance_Function_ICCV_2023_paper.pdf + Implicit neural rendering, using signed distance function (SDF) representation with geometric priors like depth or surface normal, has made impressive strides in the surface reconstruction of large-scale scenes. However, applying this method to reconstruct a room-level scene from images may miss structures in low-intensity areas and/or small, thin objects. We have conducted experiments on three datasets to identify limitations of the original color rendering loss and priors-embedded SDF scene representation.Our findings show that the color rendering loss creates an optimization bias against low-intensity areas, resulting in gradient vanishing and leaving these areas unoptimized. To address this issue, we propose a feature-based color rendering loss that utilizes non-zero feature values to bring back optimization signals. Additionally, the SDF representation can be influenced by objects along a ray path, disrupting the monotonic change of SDF values when a single object is present. Accordingly, we explore using the occupancy representation, which encodes each point separately and is unaffected by objects along a querying ray. Our experimental results demonstrate that the joint forces of the feature-based rendering loss and Occ-SDF hybrid representation scheme can provide high-quality reconstruction results, especially in challenging room-level scenarios. The code is available at https://github.com/shawLyu/Occ-SDF_Hybrid. + + + + Cloth2Body: Generating 3D Human Body Mesh from 2D Clothing + http://openaccess.thecvf.com//content/ICCV2023/papers/Dai_Cloth2Body_Generating_3D_Human_Body_Mesh_from_2D_Clothing_ICCV_2023_paper.pdf + In this paper, we define and study a new Cloth2Body problem which has a goal of generating 3d human body meshes from a 2D clothing image. Unlike the existing human mesh recovery problem, Cloth2Body needs to address new and emerging challenges raised by the partial observation of the input and the high diversity of the output. Indeed, there are three specific challenges. First, how to locate and pose human bodies into the clothes. Second, how to effectively estimate body shapes out of various clothing types. Finally, how to generate diverse and plausible results from a 2D clothing image. To this end, we propose an end-to-end framework that can accurately estimate 3D body mesh parameterized by pose and shape from a 2D clothing image. Along this line, we first utilize Kinematics-aware Pose Estimation to estimate body pose parameters. 3D skeleton is employed as a proxy followed by an inverse kinematics module to boost the estimation accuracy. We additionally design an adaptive depth trick to align the re-projected 3D mesh better with 2D clothing image by disentangling the effects of object size and camera extrinsic. Next, we propose Physics-informed Shape Estimation to estimate body shape parameters. 3D shape parameters are predicted based on partial body measurements estimated from RGB image, which not only improves pixel-wise human-cloth alignment, but also enables flexible user editing. Finally, we design Evolution based pose generation method , a skeleton transplanting method inspired by genetic algorithms to generate diverse reasonable poses during inference. As shown by experimental results on both synthetic and real-world data, the proposed framework achieves state-of-the-art performance and can effectively recover natural and diverse 3D body meshes from 2D images that align well with clothing. + + + + Spatially and Spectrally Consistent Deep Functional Maps + http://openaccess.thecvf.com//content/ICCV2023/papers/Sun_Spatially_and_Spectrally_Consistent_Deep_Functional_Maps_ICCV_2023_paper.pdf + Cycle consistency has long been exploited as a powerful prior for jointly optimizing maps within a collection of shapes. In this paper, we investigate its utility in the approaches of Deep Functional Maps, which are considered state-of-the-art in non-rigid shape matching. We first justify that under certain conditions, the learned maps, when represented in the spectral domain, are already cycle consistent. Furthermore, we identify the discrepancy that spectrally consistent maps are not necessarily spatially, or point-wise, consistent. In light of this, we present a novel design of unsupervised Deep Functional Maps, which effectively enforces the harmony of learned maps under the spectral and the point-wise representation. By taking advantage of cycle consistency, our framework produces state-of-the-art results in mapping shapes even under significant distortions. Beyond that, by independently estimating maps in both spectral and spatial domains, our method naturally alleviates over-fitting in network training, yielding superior generalization performance and accuracy within an array of challenging tests for both near-isometric and non-isometric datasets. + + + + Sparse Point Guided 3D Lane Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Yao_Sparse_Point_Guided_3D_Lane_Detection_ICCV_2023_paper.pdf + 3D lane detection usually builds a dense correspondence between the front-view space and the BEV space to estimate lane points in the 3D space. 3D lanes only occupy a small ratio of the dense correspondence, while most correspondence belongs to the redundant background. This sparsity phenomenon bottlenecks valuable computation and raises the computation cost of building a high-resolution correspondence for accurate results. In this paper, we propose a sparse point-guided 3D lane detection, focusing on points related to 3D lanes. Our method runs in a coarse-to-fine manner, including coarse-level lane detection and iterative fine-level sparse point refinements. In coarse-level lane detection, we build a dense but efficient correspondence between the front view and BEV space at a very low resolution to compute coarse lanes. Then in fine-level sparse point refinement, we sample sparse points around coarse lanes to extract local features from the high-resolution front-view feature map. The high-resolution local information brought by sparse points refines 3D lanes in the BEV space hierarchically from low resolution to high resolution. The sparse point guides a more effective information flow and greatly promotes the SOTA result by 3 points on the overall F1-score and 6 points on several hard situations while reducing almost half memory cost and speeding up 2 times. + + + + Event-based Temporally Dense Optical Flow Estimation with Sequential Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Ponghiran_Event-based_Temporally_Dense_Optical_Flow_Estimation_with_Sequential_Learning_ICCV_2023_paper.pdf + Event cameras provide an advantage over traditional frame-based cameras when capturing fast-moving objects without a motion blur. They achieve this by recording changes in light intensity (known as events), thus allowing them to operate at a much higher frequency and making them suitable for capturing motions in a highly dynamic scene. Many recent studies have proposed methods to train neural networks (NNs) for predicting optical flow from events. However, they often rely on a spatio-temporal representation constructed from events over a fixed interval, such as 10Hz used in training on the DSEC dataset. This limitation restricts the flow prediction to the same interval (10Hz) whereas the fast speed of event cameras, which can operate up to 3kHz, has not been effectively utilized. In this work, we show that a temporally dense flow estimation at 100Hz can be achieved by treating the flow estimation as a sequential problem using two different variants of recurrent networks - Long-short term memory (LSTM) and spiking neural network (SNN). First, We utilize the NN model constructed similar to the popular EV-FlowNet but with LSTM layers to demonstrate the efficiency of our training method. The model not only produces 10x more frequent optical flow than the existing ones, but the estimated flows also have 13% lower errors than predictions from the baseline EV-FlowNet. Second, we construct an EV-FlowNet SNN but with leaky integrate and fire neurons to efficiently capture the temporal dynamics. We found that simple inherent recurrent dynamics of SNN lead to significant parameter reduction compared to the LSTM model. In addition, because of its event-driven computation, the spiking model is estimated to consume only 1.5% energy of the LSTM model, highlighting the efficiency of SNN in processing events and the potential for achieving temporally dense flow. + + + + Continual Zero-Shot Learning through Semantically Guided Generative Random Walks + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Continual_Zero-Shot_Learning_through_Semantically_Guided_Generative_Random_Walks_ICCV_2023_paper.pdf + Learning novel concepts, remembering previous knowledge, and adapting it to future tasks occur simultaneously throughout a human's lifetime. To model such comprehensive abilities, continual zero-shot learning (CZSL) has recently been introduced. However, most existing methods overused the unseen semantic information that may not be continually accessible in realistic settings. In this paper, we address the challenge of continual zero-shot learning where unseen information is not provided during training, by leveraging generative modeling. The heart of the generative-based methods is to learn quality representations from seen classes to improve the generative understanding of the unseen visual space. Motivated by this, we introduce generalization-bound tools and provide the first theoretical explanation for the benefits of generative modeling to CZSL tasks. Guided by the theoretical analysis, we then propose our learning algorithm that employs a novel semantically guided Generative Random Walk (GRW) loss. The GRW loss augments the training by continually encouraging the model to generate realistic and characterized samples to represent the unseen space. Our algorithm achieves state-of-the-art performance on AWA1, AWA2, CUB, and SUN datasets, surpassing existing CZSL methods by 3-7%. The code is available here https://github.com/wx-zhang/IGCZSL + + + + Foreground-Background Distribution Modeling Transformer for Visual Object Tracking + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Foreground-Background_Distribution_Modeling_Transformer_for_Visual_Object_Tracking_ICCV_2023_paper.pdf + Visual object tracking is a fundamental research topic with a broad range of applications. Benefiting from the rapid development of Transformer, pure Transformer trackers have achieved great progress. However, the feature learning of these Transformer-based trackers is easily disturbed by complex backgrounds. To address the above limitations, we propose a novel foreground-background distribution modeling transformer for visual object tracking (F-BDMTrack), including a fore-background agent learning (FBAL) module and a distribution-aware attention (DA2) module in a unified transformer architecture. The proposed F-BDMTrack enjoys several merits. First, the proposed FBAL module can effectively mine fore-background information with designed fore-background agents. Second, the DA2 module can suppress the incorrect interaction between foreground and background by modeling fore-background distribution similarities. Finally, F-BDMTrack can extract discriminative features under ever-changing tracking scenarios for more accurate target state estimation. Extensive experiments show that our F-BDMTrack outperforms previous state-of-the-art trackers on eight tracking benchmarks. + + + + Low-Light Image Enhancement with Illumination-Aware Gamma Correction and Complete Image Modelling Network + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Low-Light_Image_Enhancement_with_Illumination-Aware_Gamma_Correction_and_Complete_Image_ICCV_2023_paper.pdf + This paper presents a novel network structure with illumination-aware gamma correction and complete image modelling to solve the low-light image enhancement problem. Low-light environments usually lead to less informative large-scale dark areas, directly learning deep representations from low-light images is insensitive to recovering normal illumination. We propose to integrate the effectiveness of gamma correction with the strong modelling capacities of deep networks, which enables the correction factor gamma to be learned in a coarse to elaborate manner via adaptively perceiving the deviated illumination. Because exponential operation introduces high computational complexity, we propose to use Taylor Series to approximate gamma correction, accelerating the training and inference speed. Dark areas usually occupy large scales in low-light images, common local modelling structures, e.g., CNN, SwinIR, are thus insufficient to recover accurate illumination across whole low-light images. We propose a novel Transformer block to completely simulate the dependencies of all pixels across images via a local-to-global hierarchical attention mechanism, so that dark areas could be inferred by borrowing the information from far informative regions in a highly effective manner. Extensive experiments on several benchmark datasets demonstrate that our approach outperforms state-of-the-art methods. + + + + Both Diverse and Realism Matter: Physical Attribute and Style Alignment for Rainy Image Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Yu_Both_Diverse_and_Realism_Matter_Physical_Attribute_and_Style_Alignment_ICCV_2023_paper.pdf + Although considerable progress has been made in the deraining task under synthetic data, it is still a tough problem under real rain scenes, due to the domain gap between the synthetic and real data. Besides, difficulties in collecting and labeling diverse real rain images hinder the progress of this field. Consequently, we attempt to promote real rain removal from rain image generation (RIG) perspective. Existing RIG methods mainly focus on diversity but miss realistic, or the realistic but neglect diversity of the generation. To solve this dilemma, we propose a physical alignment and controllable generation network (PCGNet) for diverse and realistic rain generation. Our key idea is to simultaneously utilize the controllability of attributes from synthetic and the realism of appearance from real data. Specifically, we devise a unified framework to disentangle background, rain attributes, and appearance style from synthetic and real data. Then we collaboratively align the factors with a novel semi-supervised weight moving strategy for attribute, an explicit distribution modeling method for real rain style. Furthermore, we pack these aligned factors into the generation model, achieving physical controllable mapping from the attributes to real rainy with image-level and attribute-level consistency loss. Extensive experiments show that PCGNet can effectively generate appealing rainy results, which sifnicantltly improve the performance under synthetic and real scenes for all existing deraining methods. + + + + Single Image Reflection Separation via Component Synergy + http://openaccess.thecvf.com//content/ICCV2023/papers/Hu_Single_Image_Reflection_Separation_via_Component_Synergy_ICCV_2023_paper.pdf + The reflection superposition phenomenon is complex and widely distributed in the real world, which derives various simplified linear and nonlinear formulations of the problem. In this paper, based on the investigation of the weaknesses of existing models, we propose a more general form of the superposition model by introducing a learnable residue term, which can effectively capture residual information during decomposition, guiding the separated layers to be complete. In order to fully capitalize on its advantages, we further design the network structure elaborately, including a novel dual-stream interaction mechanism and a powerful decomposition network with a semantic pyramid encoder. Extensive experiments and ablation studies are conducted to verify our superiority over state-of-the-art approaches on multiple real-world benchmark datasets. + + + + SFHarmony: Source Free Domain Adaptation for Distributed Neuroimaging Analysis + http://openaccess.thecvf.com//content/ICCV2023/papers/Dinsdale_SFHarmony_Source_Free_Domain_Adaptation_for_Distributed_Neuroimaging_Analysis_ICCV_2023_paper.pdf + To represent the biological variability of clinical neuroimaging populations, it is vital to be able to combine data across scanners and studies. However, different MRI scanners produce images with different characteristics, resulting in a domain shift known as the 'harmonisation problem'. Additionally, neuroimaging data is inherently personal in nature, leading to data privacy concerns when sharing the data. To overcome these barriers, we propose an Unsupervised Source-Free Domain Adaptation (SFDA) method, SFHarmony. Through modelling the imaging features as a Gaussian Mixture Model and minimising an adapted Bhattacharyya distance between the source and target features, we can create a model that performs well for the target data whilst having a shared feature representation across the data domains, without needing access to the source data for adaptation or target labels. We demonstrate the performance of our method on simulated and real domain shifts, showing that the approach is applicable to classification, segmentation and regression tasks, requiring no changes to the algorithm. Our method outperforms existing SFDA approaches across a range of realistic data scenarios, demonstrating the potential utility of our approach for MRI harmonisation and general SFDA problems. Our code is available at https://github.com/nkdinsdale/SFHarmony. + + + + 3D Human Mesh Recovery with Sequentially Global Rotation Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_3D_Human_Mesh_Recovery_with_Sequentially_Global_Rotation_Estimation_ICCV_2023_paper.pdf + Model-based 3D human mesh recovery aims to reconstruct a 3D human body mesh by estimating its parameters from monocular RGB images. Most of recent works adopt the Skinned Multi-Person Linear (SMPL) model to regress relative rotations for each body joint along the kinematics chain. This pipeline needs to transform each relative rotation matrix into a global rotation matrix to articulate the canonical mesh, and suffers from accumulated errors along the kinematics chain. This paper proposes to directly estimate the global rotation of each joint to avoid error accumulation and pursue better accuracy. The proposed Sequentially Global Rotation Estimation (SGRE) directly predicts the global rotation matrix of each joint on the kinematics chain. SGRE features a residual learning module to leverage complementary features and previously predicted rotations of parent joints to guide the estimation of subsequent child joints. Thanks to this global estimation pipeline and residual learning module, SGRE alleviates error accumulation and produces more accurate 3D human mesh. It can be flexibly integrated into existing regression-based methods and achieves superior performance on various benchmarks. For example, it improves the latest method 3DCrowdNet by 3.3 mm MPJPE and 5.0 mm PVE on 3DPW dataset and 3.2 AP on COCO dataset, respectively. + + + + DREAMWALKER: Mental Planning for Continuous Vision-Language Navigation + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_DREAMWALKER_Mental_Planning_for_Continuous_Vision-Language_Navigation_ICCV_2023_paper.pdf + VLN-CE is a recently released embodied task, where AI agents need to navigate a freely traversable environment to reach a distant target location, given language instructions. It poses great challenges due to the huge space of possible strategies. Driven by the belief that the ability to anticipate the consequences of future actions is crucial for the emergence of intelligent and interpretable planning behavior, we propose Dreamwalker --- a world model based VLN-CE agent. The world model is built to summarize the visual, topological, and dynamic properties of the complicated continuous environment into a discrete, structured, and compact representation. Dreamwalker can simulate and evaluate possible plans entirely in such internal abstract world, before executing costly actions. As opposed to existing model-free VLN-CE agents simply making greedy decisions in the real world, which easily results in shortsighted behaviors, Dreamwalker is able to make strategic planning through large amounts of "mental experiments." Moreover, the imagined future scenarios reflect our agent's intention, making its decision-making process more transparent. Extensive experiments and ablation studies on VLN-CE dataset confirm the effectiveness of the proposed approach and outline fruitful directions for future work. Our code will be released. + + + + LAN-HDR: Luminance-based Alignment Network for High Dynamic Range Video Reconstruction + http://openaccess.thecvf.com//content/ICCV2023/papers/Chung_LAN-HDR_Luminance-based_Alignment_Network_for_High_Dynamic_Range_Video_Reconstruction_ICCV_2023_paper.pdf + As demands for high-quality videos continue to rise, high-resolution and high-dynamic range (HDR) imaging techniques are drawing attention. To generate an HDR video from low dynamic range (LDR) images, one of the critical steps is the motion compensation between LDR frames, for which most existing works employed the optical flow algorithm. However, these methods suffer from flow estimation errors when saturation or complicated motions exist. In this paper, we propose an end-to-end HDR video composition framework, which aligns LDR frames in the feature space and then merges aligned features into an HDR frame, without relying on pixel-domain optical flow. Specifically, we propose a luminance-based alignment network for HDR (LAN-HDR) consisting of an alignment module and a hallucination module. The alignment module aligns a frame to the adjacent reference by evaluating luminance-based attention, excluding color information. The hallucination module generates sharp details, especially for washed-out areas due to saturation. The aligned and hallucinated features are then blended adaptively to complement each other. Finally, we merge the features to generate a final HDR frame. In training, we adopt a temporal loss, in addition to frame reconstruction losses, to enhance temporal consistency and thus reduce flickering. Extensive experiments demonstrate that our method performs better or comparable to state-of-the-art methods on several benchmarks. Codes are available at https://github.com/haesoochung/LAN-HDR. + + + + Dancing in the Dark: A Benchmark towards General Low-light Video Enhancement + http://openaccess.thecvf.com//content/ICCV2023/papers/Fu_Dancing_in_the_Dark_A_Benchmark_towards_General_Low-light_Video_ICCV_2023_paper.pdf + Low-light video enhancement is a challenging task with broad applications. However, current research in this area is limited by the lack of high-quality benchmark datasets. To address this issue, we design a camera system and collect a high-quality low-light video dataset with multiple exposures and cameras. Our dataset provides dynamic video pairs with pronounced camera motion and strict spatial alignment. To achieve general low-light video enhancement, we also propose a novel Retinex-based method named Light Adjustable Network (LAN). LAN iteratively refines the illumination and adaptively adjusts it under varying lighting conditions, leading to visually appealing results even in diverse real-world scenarios. The extensive experiments demonstrate the superiority of our low-light video dataset and enhancement method. Our dataset and code will be publicly available. + + + + RED-PSM: Regularization by Denoising of Partially Separable Models for Dynamic Imaging + http://openaccess.thecvf.com//content/ICCV2023/papers/Iskender_RED-PSM_Regularization_by_Denoising_of_Partially_Separable_Models_for_Dynamic_ICCV_2023_paper.pdf + Dynamic imaging involves the recovery of a time-varying 2D or 3D object at each time instant using its undersampled measurements. In particular, in dynamic tomography, only a single projection at a single view angle may be available at a time, making the problem severely ill-posed. In this work, we propose an approach, RED-PSM, which combines for the first time two powerful techniques to address this challenging imaging problem. The first, are partially separable models, which have been used to introduce a low-rank prior for the spatio-temporal object. The second is the recent Regularization by Denoising (RED), which provides a flexible framework to exploit the impressive performance of state-of-the-art image denoising algorithms, for various inverse problems. We propose a partially separable objective with RED and an optimization scheme with variable splitting and ADMM. Our objective is proved to converge to a value corresponding to a stationary point satisfying the first-order optimality conditions. Convergence is accelerated by a particular projection-domain-based initialization. We demonstrate the performance and computational improvements of our proposed RED-PSM with a learned image denoiser by comparing it to a recent deep-prior-based method TD-DIP. Although the emphasis is on dynamic tomography, we also demonstrate the performance advantages of RED-PSM in a dynamic cardiac MRI setting. + + + + D-IF: Uncertainty-aware Human Digitization via Implicit Distribution Field + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_D-IF_Uncertainty-aware_Human_Digitization_via_Implicit_Distribution_Field_ICCV_2023_paper.pdf + Realistic virtual humans play a crucial role in numerous industries, such as metaverse, intelligent healthcare, and self-driving simulation. But creating them on a large scale with high levels of realism remains a challenge. The utilization of deep implicit function sparks a new era of image-based 3D clothed human reconstruction, enabling pixel-aligned shape recovery with fine details. Subsequently, the vast majority of works locate the surface by regressing the deterministic implicit value for each point. However, should all points be treated equally regardless of their proximity to the surface? In this paper, we propose replacing the implicit value with an adaptive uncertainty distribution, to differentiate between points based on their distance to the surface. This simple "value to distribution" transition yields significant improvements on nearly all the baselines. Furthermore, qualitative results demonstrate that the models trained using our uncertainty distribution loss, can capture more intricate wrinkles, and realistic limbs. Code and models are available for research purposes at https://github.com/psyai-net/D-IF_release. + + + + AffordPose: A Large-Scale Dataset of Hand-Object Interactions with Affordance-Driven Hand Pose + http://openaccess.thecvf.com//content/ICCV2023/papers/Jian_AffordPose_A_Large-Scale_Dataset_of_Hand-Object_Interactions_with_Affordance-Driven_Hand_ICCV_2023_paper.pdf + How human interact with objects depends on the functional roles of the target objects, which introduces the problem of affordance-aware hand-object interaction. It requires a large number of human demonstrations for the learning and understanding of plausible and appropriate hand-object interactions. In this work, we present AffordPose, a large-scale dataset of hand-object interactions with affordance-driven hand pose. We first annotate the specific part-level affordance labels for each object, e.g. twist, pull, handle-grasp, etc, instead of the general intents such as use or handover, to indicate the purpose and guide the localization of the hand-object interactions. The fine-grained hand-object interactions reveal the influence of hand-centered affordances on the detailed arrangement of the hand poses, yet also exhibit a certain degree of diversity. We collect a total of 26.7K hand-object interactions, each including the 3D object shape, the part-level affordance label, and the manually adjusted hand poses. The comprehensive data analysis shows the common characteristics and diversity of hand-object interactions per affordance via the parameter statistics and contacting computation. We also conduct experiments on the tasks of hand-object affordance understanding and affordance-oriented hand-object interaction generation, to validate the effectiveness of our dataset in learning the fine-grained hand-object interactions. Project page: https://github.com/GentlesJan/AffordPose . + + + + Locomotion-Action-Manipulation: Synthesizing Human-Scene Interactions in Complex 3D Environments + http://openaccess.thecvf.com//content/ICCV2023/papers/Lee_Locomotion-Action-Manipulation_Synthesizing_Human-Scene_Interactions_in_Complex_3D_Environments_ICCV_2023_paper.pdf + Synthesizing interaction-involved human motions has been challenging due to the high complexity of 3D environments and the diversity of possible human behaviors within. We present LAMA, Locomotion-Action-MAnipulation, to synthesize natural and plausible long term human movements in complex indoor environments. The key motivation of LAMA is to build a unified framework to encompass a series of everyday motions including locomotion, scene interaction, and object manipulation. Unlike existing methods that require motion data "paired" with scanned 3D scenes for supervision, we formulate the problem as a test-time optimization by using human motion capture data only for synthesis. LAMA leverages a reinforcement learning framework coupled with motion matching algorithm for optimization, and further exploits a motion editing framework via manifold learning to cover possible variations in interaction and manipulation. Throughout extensive experiments, we demonstrate that LAMA outperforms previous approaches in synthesizing realistic motions in various challenging scenarios. + + + + NDDepth: Normal-Distance Assisted Monocular Depth Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Shao_NDDepth_Normal-Distance_Assisted_Monocular_Depth_Estimation_ICCV_2023_paper.pdf + Monocular depth estimation has drawn widespread attention from the vision community due to its broad applications. In this paper, we propose a novel physics (geometry)-driven deep learning framework for monocular depth estimation by assuming that 3D scenes are constituted by piece-wise planes. Particularly, we introduce a new normal-distance head that outputs pixel-level surface normal and plane-to-origin distance for deriving depth at each position. Meanwhile, the normal and distance are regularized by a developed plane-aware consistency constraint. We further integrate an additional depth head to improve the robustness of the proposed framework. To fully exploit the strengths of these two heads, we develop an effective contrastive iterative refinement module that refines depth in a complementary manner according to the depth uncertainty. Extensive experiments indicate that the proposed method exceeds previous state-of-the-art competitors on the NYU-Depth-v2, KITTI and SUN RGB-D datasets. Notably, it ranks 1st among all submissions on the KITTI depth prediction online benchmark at the submission time. The source code is available at https://github.com/ShuweiShao/NDDepth. + + + + Sequential Texts Driven Cohesive Motions Synthesis with Natural Transitions + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Sequential_Texts_Driven_Cohesive_Motions_Synthesis_with_Natural_Transitions_ICCV_2023_paper.pdf + The intelligent synthesis/generation of daily-life motion sequences is fundamental and urgently needed for many VR/metaverse-related applications. However, existing approaches commonly focus on monotonic motion generation (e.g., walking, jumping, etc.) based on single instruction-like text, which is still not intelligent enough and can't meet practical demands. To this end, we propose a cohesive human motion sequence synthesis framework based on free-form sequential texts while ensuring semantic connection and natural transitions between adjacent motions. At the technical level, we explore the local-to-global semantic features of previous and current texts to extract relevant information. This information is used to guide the framework in understanding the semantics of the current moment. Moreover, we propose learnable tokens to adaptively learn the influence range of the previous motions towards natural transitions. These tokens can be trained to encode the relevant information into well-designed transition loss. To demonstrate the efficacy of our method, we conduct extensive experiments and comprehensive evaluations on the public dataset as well as a new dataset produced by us. All the experiments confirm that our method outperforms the state-of-the-art methods in terms of semantic matching, realism, and transition fluency. Our project is public available. https://druthrie.github.io/sequential-texts-to-motion/ + + + + Efficient Converted Spiking Neural Network for 3D and 2D Classification + http://openaccess.thecvf.com//content/ICCV2023/papers/Lan_Efficient_Converted_Spiking_Neural_Network_for_3D_and_2D_Classification_ICCV_2023_paper.pdf + Spiking Neural Networks (SNNs) have attracted enormous research interest due to their low-power and biologically plausible nature. Existing ANN-SNN conversion methods can achieve lossless conversion by converting a well-trained Artificial Neural Network (ANN) into an SNN. However, converted SNN requires a large amount of time steps to achieve competitive performance with the well-trained ANN, which means a large latency. In this paper, we propose an efficient unified ANN-SNN conversion method for point cloud classification and image classification to significantly reduce the time step to meet the fast and lossless ANN-SNN transformation. Specifically, we first adaptively adjust the threshold according to the activation state of spiking neurons, ensuring a certain proportion of spiking neurons are activated at each time step to reduce the time for accumulation of membrane potential. Next, we use an adaptive firing mechanism to enlarge the range of spiking output, getting more discrimination features in short time steps. Extensive experimental results on challenging point cloud and image datasets demonstrate that the suggested approach significantly outmatches state-of-the-art ANN-SNN conversion based methods. + + + + Eulerian Single-Photon Vision + http://openaccess.thecvf.com//content/ICCV2023/papers/Gupta_Eulerian_Single-Photon_Vision_ICCV_2023_paper.pdf + Single-photon sensors measure light signals at the finest possible resolution -- individual photons. These sensors introduce two major challenges in the form of strong Poisson noise and extremely large data acquisition rates, which are also inherited by downstream computer vision tasks. Previous work has largely focused on solving the image reconstruction problem first and then using off-the-shelf methods for downstream tasks, but the most general solutions that account for motion are costly and not scalable to large data volumes produced by single-photon sensors. This work forgoes the image reconstruction problem. Instead, we demonstrate computationally light-weight phase-based algorithms for the tasks of edge detection and motion estimation. These methods directly process the raw single-photon data as a 3D volume with a bank of velocity-tuned filters, achieving speed-ups of more than two orders of magnitude compared to explicit reconstruction-based methods. Project webpage: https://wisionlab.com/project/eulerian-single-photon-vision/ + + + + NSF: Neural Surface Fields for Human Modeling from Monocular Depth + http://openaccess.thecvf.com//content/ICCV2023/papers/Xue_NSF_Neural_Surface_Fields_for_Human_Modeling_from_Monocular_Depth_ICCV_2023_paper.pdf + Obtaining personalized 3D animatable avatars from a monocular camera has several real world applications in gaming, virtual try-on, animation, and VR/XR, etc. However, it is very challenging to model dynamic and fine-grained clothing deformations from such sparse data. Existing methods for modeling 3D humans from depth data have limitations in terms of computational efficiency, mesh coherency, and flexibility in resolution and topology. For instance, reconstructing shapes using implicit functions and extracting explicit meshes per frame is computationally expensive and cannot ensure coherent meshes across frames. Moreover, predicting per-vertex deformations on a pre-designed human template with a discrete surface lacks flexibility in resolution and topology. To overcome these limitations, we propose a novel method 'NSF: Neural Surface Fields' for modeling 3D clothed humans from monocular depth. NSF defines a neural field solely on the base surface which models a continuous and flexible displacement field. NSF can be adapted to the base surface with different resolution and topology without retraining at inference time. Compared to existing approaches, our method eliminates the expensive per-frame surface extraction while maintaining mesh coherency, and is capable of reconstructing meshes with arbitrary resolution without retraining. To foster research in this direction, we release our code in project page at: https://yuxuan-xue.com/nsf. + + + + Unaligned 2D to 3D Translation with Conditional Vector-Quantized Code Diffusion using Transformers + http://openaccess.thecvf.com//content/ICCV2023/papers/Corona-Figueroa_Unaligned_2D_to_3D_Translation_with_Conditional_Vector-Quantized_Code_Diffusion_ICCV_2023_paper.pdf + Generating 3D images of complex objects conditionally from a few 2D views is a difficult synthesis problem, compounded by issues such as domain gap and geometric misalignment. For instance, a unified framework such as Generative Adversarial Networks cannot achieve this unless they explicitly define both a domain-invariant and geometric-invariant joint latent distribution, whereas Neural Radiance Fields are generally unable to handle both issues as they optimize at the pixel level. By contrast, we propose a simple and novel 2D to 3D synthesis approach based on conditional diffusion with vector-quantized codes. Operating in an information-rich code space enables high-resolution 3D synthesis via full-coverage attention across the views. Specifically, we generate the 3D codes (e.g. for CT images) conditional on previously generated 3D codes and the entire codebook of two 2D views (e.g. 2D X-rays). Qualitative and quantitative results demonstrate state-of-the-art performance over specialized methods across varied evaluation criteria, including fidelity metrics such as density, coverage, and distortion metrics for two complex volumetric imagery datasets from in real-world scenarios. + + + + DMNet: Delaunay Meshing Network for 3D Shape Representation + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_DMNet_Delaunay_Meshing_Network_for_3D_Shape_Representation_ICCV_2023_paper.pdf + Recently, there has been a growing interest in learning-based explicit methods due to their ability to respect the original input and preserve details. However, the connectivity on complex structures is still difficult to infer due to the limited local shape perception, resulting in artifacts and non-watertight triangles. In this paper, we present a novel learning-based method with Delaunay triangulation to achieve high-precision reconstruction. We model the Delaunay triangulation as a dual graph, extract local geometric information from the points, and embed it into the structural representation of Delaunay triangulation in an organic way, benefiting fine-grained details reconstruction. To encourage neighborhood information interaction of edges and nodes in the graph, we introduce a local graph iteration algorithm, which is a variant of graph neural network. Moreover, a geometric constraint loss further improves the classification of tetrahedrons. Benefiting from our fully local network, a scaling strategy is designed to enable large-scale reconstruction. Experiments show that our method yields watertight and high-quality meshes. Especially for some thin structures and sharp edges, our method shows better performance than the current state-of-the-art methods. Furthermore, it has a strong adaptability to point clouds of different densities. + + + + Body Knowledge and Uncertainty Modeling for Monocular 3D Human Body Reconstruction + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Body_Knowledge_and_Uncertainty_Modeling_for_Monocular_3D_Human_Body_ICCV_2023_paper.pdf + While 3D body reconstruction methods have made remarkable progress recently, it remains difficult to acquire the sufficiently accurate and numerous 3D supervisions required for training. In this paper, we propose KNOWN, a framework that effectively utilizes body KNOWledge and uNcertainty modeling to compensate for insufficient 3D supervisions. KNOWN exploits a comprehensive set of generic body constraints derived from well-established body knowledge. These generic constraints precisely and explicitly characterize the reconstruction plausibility and enable 3D reconstruction models to be trained without any 3D data. Moreover, existing methods typically use images from multiple datasets during training, which can result in data noise (e.g., inconsistent joint annotation) and data imbalance (e.g., minority images representing unusual poses or captured from challenging camera views). KNOWN solves these problems through a novel probabilistic framework that models both aleatoric and epistemic uncertainty. Aleatoric uncertainty is encoded in a robust Negative Log-Likelihood (NLL) training loss, while epistemic uncertainty is used to guide model refinement. Experiments demonstrate that KNOWN's body reconstruction outperforms prior weakly-supervised approaches, particularly on the challenging minority images. + + + + Equivariant Similarity for Vision-Language Foundation Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Equivariant_Similarity_for_Vision-Language_Foundation_Models_ICCV_2023_paper.pdf + This study explores the concept of equivariance in vision-language foundation models (VLMs), focusing specifically on the multimodal similarity function that is not only the major training objective but also the core delivery to support downstream tasks. Unlike the existing image-text similarity objective which only categorizes matched pairs as similar and unmatched as dissimilar, equivariance also requires similarity to vary faithfully according to the semantic changes. This allows VLMs to generalize better to nuanced and unseen multimodal compositions. However, modeling equivariance is challenging as the ground truth of semantic change is difficult to collect. For example, given an image-text pair about a dog, it is unclear to what extent the similarity changes when the pixel is changed from dog to cat? To this end, we propose EqSim, a regularization loss that can be efficiently calculated from any two matched training pairs and easily pluggable into existing image-text retrieval fine-tuning. Meanwhile, to further diagnose the equivariance of VLMs, we present a new challenging benchmark EqBen. Compared to the existing evaluation sets, EqBen is the first to focus on "visual-minimal change". Extensive experiments show the lack of equivariance in current VLMs and validate the effectiveness of EqSim. + + + + ReST: A Reconfigurable Spatial-Temporal Graph Model for Multi-Camera Multi-Object Tracking + http://openaccess.thecvf.com//content/ICCV2023/papers/Cheng_ReST_A_Reconfigurable_Spatial-Temporal_Graph_Model_for_Multi-Camera_Multi-Object_Tracking_ICCV_2023_paper.pdf + Multi-Camera Multi-Object Tracking (MC-MOT) utilizes information from multiple views to better handle problems with occlusion and crowded scenes. Recently, the use of graph-based approaches to solve tracking problems has become very popular. However, many current graph-based methods do not effectively utilize information regarding spatial and temporal consistency. Instead, they rely on single-camera trackers as input, which are prone to fragmentation and ID switch errors. In this paper, we propose a novel reconfigurable graph model that first associates all detected objects across cameras spatially before reconfiguring it into a temporal graph for Temporal Association. This two-stage association approach enables us to extract robust spatial and temporal-aware features and address the problem with fragmented tracklets. Furthermore, our model is designed for online tracking, making it suitable for real-world applications. Experimental results show that the proposed graph model is able to extract more discriminating features for object tracking, and our model achieves state-of-the-art performance on several public datasets. Code is available at https://github.com/chengche6230/ReST. + + + + DiffTAD: Temporal Action Detection with Proposal Denoising Diffusion + http://openaccess.thecvf.com//content/ICCV2023/papers/Nag_DiffTAD_Temporal_Action_Detection_with_Proposal_Denoising_Diffusion_ICCV_2023_paper.pdf + We propose a new formulation of temporal action detection (TAD) with denoising diffusion, DiffTAD in short. Taking as input random temporal proposals, it can yield action proposals accurately given an untrimmed long video. This presents a generative modeling perspective, against previous discriminative learning manners. This capability is achieved by first diffusing the ground-truth proposals to random ones (i.e, the forward/noising process) and then learning to reverse the noising process (i.e, the backward/denoising process). Concretely, we establish the denoising process in the Transformer decoder (e.g, DETR) by introducing a temporal location query design with faster convergence in training. We further propose a cross-step selective conditioning algorithm for inference acceleration. Extensive evaluations on ActivityNet and THUMOS show that our DiffTAD achieves top performance compared to previous art alternatives. The code is available at https://github.com/sauradip/DiffusionTAD. + + + + Heterogeneous Diversity Driven Active Learning for Multi-Object Tracking + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Heterogeneous_Diversity_Driven_Active_Learning_for_Multi-Object_Tracking_ICCV_2023_paper.pdf + The existing one-stage multi-object tracking (MOT) algorithms have achieved satisfactory performance benefiting from a large amount of labeled data. However, acquiring plenty of laborious annotated frames is not practical in real applications. To reduce the cost of human annotations, we propose Heterogeneous Diversity driven Active Multi-Object Tracking (HD-AMOT), to infer the most informative frames for any MOT tracker by observing the heterogeneous cues of samples. HD-AMOT defines the diversified informative representation by encoding the geometric and semantic information, and formulates the frame inference strategy as a Markov decision process to learn an optimal sampling policy based on the designed informative representation. Specifically, HD-AMOT consists of a diversified informative representation module as well as an informative frame selection network. The former produces the signal characterizing the diversity and distribution of frames, and the latter receives the signal and conducts multi-frame cooperation to enable batch frame sampling. Extensive experiments conducted on the MOT15, MOT17, MOT20, and Dancetrack datasets demonstrate the efficacy and effectiveness of HD-AMOT. Experiments show that under 50% budget our HD-AMOT can achieve similar or even higher performance as fully-supervised learning. + + + + Dual Aggregation Transformer for Image Super-Resolution + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Dual_Aggregation_Transformer_for_Image_Super-Resolution_ICCV_2023_paper.pdf + Transformer has recently gained considerable popularity in low-level vision tasks, including image super-resolution (SR). These networks utilize self-attention along different dimensions, spatial or channel, and achieve impressive performance. This inspires us to combine the two dimensions in Transformer for a more powerful representation capability. Based on the above idea, we propose a novel Transformer model, Dual Aggregation Transformer (DAT), for image SR. Our DAT aggregates features across spatial and channel dimensions, in the inter-block and intra-block dual manner. Specifically, we alternately apply spatial and channel self-attention in consecutive Transformer blocks. The alternate strategy enables DAT to capture the global context and realize inter-block feature aggregation. Furthermore, we propose the adaptive interaction module (AIM) and the spatial-gate feed-forward network (SGFN) to achieve intra-block feature aggregation. AIM complements two self-attention mechanisms from corresponding dimensions. Meanwhile, SGFN introduces additional non-linear spatial information in the feed-forward network. Extensive experiments show that our DAT surpasses current methods. Code and models are obtainable at https://github.com/zhengchen1999/DAT. + + + + Semantify: Simplifying the Control of 3D Morphable Models Using CLIP + http://openaccess.thecvf.com//content/ICCV2023/papers/Gralnik_Semantify_Simplifying_the_Control_of_3D_Morphable_Models_Using_CLIP_ICCV_2023_paper.pdf + We present Semantify: a self-supervised method that utilizes the semantic power of CLIP language-vision foundation model to simplify the control of 3D morphable models. Given a parametric model, training data is created by randomly sampling the model's parameters, creating various shapes and rendering them. The similarity between the output images and a set of word descriptors is calculated in CLIP's latent space. Our key idea is first to choose a small set of semantically meaningful and disentangled descriptors that characterize the 3DMM, and then learn a non-linear mapping from scores across this set to the parametric coefficients of the given 3DMM. The non-linear mapping is defined by training a neural network without a human-in-the-loop. We present results on numerous 3DMMs: body shape models, face shape and expression models, as well as animal shapes. We demonstrate how our method defines a simple slider interface for intuitive modeling, and show how the mapping can be used to instantly fit a 3D parametric body shape to in-the-wild images. + + + + From Sky to the Ground: A Large-scale Benchmark and Simple Baseline Towards Real Rain Removal + http://openaccess.thecvf.com//content/ICCV2023/papers/Guo_From_Sky_to_the_Ground_A_Large-scale_Benchmark_and_Simple_ICCV_2023_paper.pdf + Learning-based image deraining methods have made great progress. However, the lack of large-scale high-quality paired training samples is the main bottleneck to hamper the real image deraining (RID). To address this dilemma and advance RID, we construct a Large-scale High-quality Paired real rain benchmark (LHP-Rain), including 3000 video sequences with 1 million high-resolution (1920*1080) frame pairs. The advantages of the proposed dataset over the existing ones are three-fold: rain with higher-diversity and larger-scale, image with higher-resolution and higher-quality ground-truth. Specifically, the real rains in LHP-Rain not only contain the classical rain streak/veiling/occlusion in the sky, but also the splashing on the ground overlooked by deraining community. Moreover, we propose a novel robust low-rank tensor recovery model to generate the GT with better separating the static background from the dynamic rain. In addition, we design a simple transformer-based single image deraining baseline, which simultaneously utilize the self-attention and cross-layer attention within the image and rain layer with discriminative feature representation. Extensive experiments verify the superiority of the proposed dataset and deraining method over state-of-the-art. + + + + JOTR: 3D Joint Contrastive Learning with Transformers for Occluded Human Mesh Recovery + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_JOTR_3D_Joint_Contrastive_Learning_with_Transformers_for_Occluded_Human_ICCV_2023_paper.pdf + In this study, we focus on the problem of 3D human mesh recovery from a single image under obscured conditions. Most state-of-the-art methods aim to improve 2D alignment technologies, such as spatial averaging and 2D joint sampling. However, they tend to neglect the crucial aspect of 3D alignment by improving 3D representations. Furthermore, recent methods struggle to separate target human from occlusion or background in crowded scenes as they optimize the 3D space of target human with 3D joint coordinates as local supervision. To address these issues, a desirable method would involve a framework for fusing 2D and 3D features and a strategy for optimizing the 3D space globally. Therefore, this paper presents 3D JOint contrastive learning with TRansformers (JOTR) framework for handling occluded 3D human mesh recovery. Our method includes an encoder-decoder transformer architecture to fuse 2D and 3D representations for achieving 2D&3D aligned results in a coarse-to-fine manner and a novel 3D joint contrastive learning approach for adding explicitly global supervision for the 3D feature space. The contrastive learning approach includes two contrastive losses: joint-to-joint contrast for enhancing the similarity of semantically similar voxels (i.e., human joints), and joint-to-non-joint contrast for ensuring discrimination from others (e.g., occlusions and background). Qualitative and quantitative analyses demonstrate that our method outperforms state-of-the-art competitors on both occlusion-specific and standard benchmarks, significantly improving the reconstruction of occluded humans. + + + + NIR-assisted Video Enhancement via Unpaired 24-hour Data + http://openaccess.thecvf.com//content/ICCV2023/papers/Niu_NIR-assisted_Video_Enhancement_via_Unpaired_24-hour_Data_ICCV_2023_paper.pdf + Low-light video enhancement in the visible (VIS) range is important yet technically challenging, and it is likely to become more tractable by introducing near-infrared (NIR) information for assistance, which in turn arouses a new challenge on how to obtain appropriate multispectral data for model training. In this paper, we defend the feasibility and superiority of NIR-assisted low-light video enhancement results by using unpaired 24-hour data for the first time, which significantly eases data collection and improves generalization performance on in-the-wild data. By accounting for different physical characteristics between unpaired daytime and nighttime videos, we first propose to turn daytime NIR & VIS into "nighttime mode". Specifically, we design a heuristic yet physics-inspired relighting algorithm to produce realistic pseudo nighttime NIR, and use a resampling strategy followed by a noiseGAN for nighttime VIS conversion. We further devise a temporal-aware network for video enhancement that extracts and fuses bi-directional temporal streams and is trained using real daytime videos and pseudo nighttime videos. We capture multi-spectral data using a co-axial camera and contribute Fulltime Multi-Spectral Video Dataset (FMSVD), the first dataset including aligned 24-hour NIR & VIS videos. Compared to alternative methods, we achieve significantly improved video quality as well as generalization ability on in-the-wild data in terms of both evaluation metrics and visual judgment. Codes and Data Available: https://github.com/MyNiuuu/NVEU. + + + + VeRi3D: Generative Vertex-based Radiance Fields for 3D Controllable Human Image Synthesis + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_VeRi3D_Generative_Vertex-based_Radiance_Fields_for_3D_Controllable_Human_Image_ICCV_2023_paper.pdf + Unsupervised learning of 3D-aware generative adversarial networks has lately made much progress. Some recent work demonstrates promising results of learning human generative models using neural articulated radiance fields, yet their generalization ability and controllability lag behind parametric human models, i.e., they do not perform well when generalizing to novel pose/shape and are not part controllable. To solve these problems, we propose VeRi3D, a generative human vertex-based radiance field parameterized by vertices of the parametric human template, SMPL. We map each 3D point to the local coordinate system defined on its neighboring vertices, and use the corresponding vertex feature and local coordinates for mapping it to color and density values. We demonstrate that our simple approach allows for generating photorealistic human images with free control over camera pose, human pose, shape, as well as enabling part-level editing. + + + + SHIFT3D: Synthesizing Hard Inputs For Tricking 3D Detectors + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_SHIFT3D_Synthesizing_Hard_Inputs_For_Tricking_3D_Detectors_ICCV_2023_paper.pdf + We present SHIFT3D, a differentiable pipeline for generating 3D shapes that are structurally plausible yet challenging to 3D object detectors. In safety-critical applications like autonomous driving, discovering such novel challenging objects can offer insight into unknown vulnerabilities of 3D detectors. By representing objects with a signed distanced function (SDF), we show that gradient error signals allow us to smoothly deform the shape or pose of a 3D object in order to confuse a downstream 3D detector. Importantly, the objects generated by SHIFT3D physically differ from the baseline object yet retain a semantically recognizable shape. Our approach provides interpretable failure modes for modern 3D object detectors, and can aid in preemptive discovery of potential safety risks within 3D perception systems before these risks become critical failures. + + + + Coordinate Transformer: Achieving Single-stage Multi-person Mesh Recovery from Videos + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Coordinate_Transformer_Achieving_Single-stage_Multi-person_Mesh_Recovery_from_Videos_ICCV_2023_paper.pdf + Multi-person 3D mesh recovery from videos is a critical first step towards automatic perception of group behavior in virtual reality, physical therapy and beyond. However, existing approaches rely on multi-stage paradigms, where the person detection and tracking stages are performed in a multi-person setting, while temporal dynamics are only modeled for one person at a time. Consequently, their performance is severely limited by the lack of inter-person interactions in the spatial-temporal mesh recovery, as well as by detection and tracking defects. To address these challenges, we propose the Coordinate transFormer (CoordFormer) that directly models multi-person spatial-temporal relations and simultaneously performs multi-mesh recovery in an end-to-end manner. Instead of partitioning the feature map into coarse-scale patch-wise tokens, CoordFormer leverages a novel Coordinate-Aware Attention to preserve pixel-level spatial-temporal coordinate information. Additionally, we propose a simple, yet effective Body Center Attention mechanism to fuse position information. Extensive experiments on the 3DPW dataset demonstrate that CoordFormer significantly improves the state-of-the-art, outperforming the previously best results by 4.2%, 8.8% and 4.7% according to the MPJPE, PAMPJPE, and PVE metrics, respectively, while being 40% faster than recent video-based approaches. The released code can be found at https://github.com/Li-Hao-yuan/CoordFormer. + + + + Boosting Positive Segments for Weakly-Supervised Audio-Visual Video Parsing + http://openaccess.thecvf.com//content/ICCV2023/papers/Rachavarapu_Boosting_Positive_Segments_for_Weakly-Supervised_Audio-Visual_Video_Parsing_ICCV_2023_paper.pdf + In this paper, we address the problem of weakly supervised Audio-Visual Video Parsing (AVVP), where the goal is to temporally localize events that are audible or visible and simultaneously classify them into known event categories. This is a challenging task, as we only have access to the video-level event labels during training but need to predict event labels at the segment level during evaluation. Existing multiple-instance learning (MIL) based methods use a form of attentive pooling over segment-level predictions. These methods only optimize for a subset of most discriminative segments that satisfy the weak-supervision constraints, which miss identifying positive segments. To address this, we focus on improving the proportion of positive segments detected in a video. To this end, we model the number of positive segments in a video as a latent variable and show that it can be modeled as Poisson binomial distribution over segment-level predictions, which can be computed exactly. Given the absence of fine-grained supervision, we propose an Expectation-Maximization approach to learn the model parameters by maximizing the evidence lower bound (ELBO). We iteratively estimate the minimum positive segments in a video and refine them to capture more positive segments. We conducted extensive experiments on AVVP tasks to evaluate the effectiveness of our proposed approach, and the results clearly demonstrate that it increases the number of positive segments captured compared to existing methods. Additionally, our experiments on Temporal Action Localization (TAL) demonstrate the potential of our method for generalization to similar MIL tasks. + + + + Sign Language Translation with Iterative Prototype + http://openaccess.thecvf.com//content/ICCV2023/papers/Yao_Sign_Language_Translation_with_Iterative_Prototype_ICCV_2023_paper.pdf + This paper presents IP-SLT, a simple yet effective framework for sign language translation (SLT). Our IP-SLT adopts a recurrent structure and enhances the semantic representation (prototype) of the input sign language video via an iterative refinement manner. Our idea mimics the behavior of human reading, where a sentence can be digested repeatedly, till reaching accurate understanding. Technically, IP-SLT consists of feature extraction, prototype initialization, and iterative prototype refinement. The initialization module generates the initial prototype based on the visual feature extracted by the feature extraction module. Then, the iterative refinement module leverages the cross-attention mechanism to polish the previous prototype by aggregating it with the original video feature. Through repeated refinement, the prototype finally converges to a more stable and accurate state, leading to a fluent and appropriate translation. In addition, to leverage the sequential dependence of prototypes, we further propose an iterative distillation loss to compress the knowledge of the final iteration into previous ones. As the autoregressive decoding process is executed only once in inference, our IP-SLT is ready to improve various SLT systems with acceptable overhead. Extensive experiments are conducted on public benchmarks to demonstrate the effectiveness of the IP-SLT. + + + + Humans in 4D: Reconstructing and Tracking Humans with Transformers + http://openaccess.thecvf.com//content/ICCV2023/papers/Goel_Humans_in_4D_Reconstructing_and_Tracking_Humans_with_Transformers_ICCV_2023_paper.pdf + We present an approach to reconstruct humans and track them over time. At the core of our approach, we propose a fully "transformerized" version of a network for human mesh recovery. This network, HMR 2.0, advances the state of the art and shows the capability to analyze unusual poses that have in the past been difficult to reconstruct from single images. To analyze video, we use 3D reconstructions from HMR 2.0 as input to a tracking system that operates in 3D. This enables us to deal with multiple people and maintain identities through occlusion events. Our complete approach, 4DHumans, achieves state-of-the-art results for tracking people from monocular video. Furthermore, we demonstrate the effectiveness of HMR 2.0 on the downstream task of action recognition, achieving significant improvements over previous pose-based action recognition approaches. Our code and models are available on the project website: https://shubham-goel.github.io/4dhumans/. + + + + Perpetual Humanoid Control for Real-time Simulated Avatars + http://openaccess.thecvf.com//content/ICCV2023/papers/Luo_Perpetual_Humanoid_Control_for_Real-time_Simulated_Avatars_ICCV_2023_paper.pdf + We present a physics-based humanoid controller that achieves high-fidelity motion imitation and fault-tolerant behavior in the presence of noisy input (e.g. pose estimates from video or generated from language) and unexpected falls. Our controller scales up to learning ten thousand motion clips without using any external stabilizing forces and learns to naturally recover from fail-state. Given reference motion, our controller can perpetually control simulated avatars without requiring resets. At its core, we propose the progressive multiplicative control policy (PMCP), which dynamically allocates new network capacity to learn harder and harder motion sequences. PMCP allows efficient scaling for learning from large-scale motion databases and adding new tasks, such as fail-state recovery, without catastrophic forgetting. We demonstrate the effectiveness of our controller by using it to imitate noisy poses from video-based pose estimators and language-based motion generators in a live and real-time multi-person avatar use case. + + + + Score Priors Guided Deep Variational Inference for Unsupervised Real-World Single Image Denoising + http://openaccess.thecvf.com//content/ICCV2023/papers/Cheng_Score_Priors_Guided_Deep_Variational_Inference_for_Unsupervised_Real-World_Single_ICCV_2023_paper.pdf + Real-world single image denoising is crucial and practical in computer vision. Bayesian inversions combined with score priors now have proven effective for single image denoising but are limited to white Gaussian noise. Moreover, applying existing score-based methods for real-world denoising requires not only the explicit train of score priors on the target domain but also the careful design of sampling procedures for posterior inference, which is complicated and impractical. To address these limitations, we propose a score priors-guided deep variational inference, namely ScoreDVI, for practical real-world denoising. By considering the deep variational image posterior with a Gaussian form, score priors are extracted based on easily accessible minimum MSE Non-i.i.d Gaussian denoisers and variational samples, which in turn facilitate optimizing the variational image posterior. Such a procedure adaptively applies cheap score priors to denoising. Additionally, we exploit a Non-i.i.d Gaussian mixture model and variational noise posterior to model the real-world noise. This scheme also enables the pixel-wise fusion of multiple image priors and variational image posteriors. Besides, we develop a noise-aware prior assignment strategy that dynamically adjusts the weight of image priors in the optimization. Our method outperforms other single image-based real-world denoising methods and achieves comparable performance to dataset-based unsupervised methods. + + + + Improving Transformer-based Image Matching by Cascaded Capturing Spatially Informative Keypoints + http://openaccess.thecvf.com//content/ICCV2023/papers/Cao_Improving_Transformer-based_Image_Matching_by_Cascaded_Capturing_Spatially_Informative_Keypoints_ICCV_2023_paper.pdf + Learning robust local image feature matching is a fundamental low-level vision task, which has been widely explored in the past few years. Recently, detector-free local feature matchers based on transformers have shown promising results, which largely outperform pure Convolutional Neural Network (CNN) based ones. But correlations produced by transformer-based methods are spatially limited to the center of source views' coarse patches, because of the costly attention learning. In this work, we rethink this issue and find that such matching formulation degrades pose estimation, especially for low-resolution images. So we propose a transformer-based cascade matching model -- Cascade feature Matching TRansformer (CasMTR), to efficiently learn dense feature correlations, which allows us to choose more reliable matching pairs for the relative pose estimation. Instead of re-training a new detector, we use a simple yet effective Non-Maximum Suppression (NMS) post-process to filter keypoints through the confidence map, and largely improve the matching precision. CasMTR achieves state-of-the-art performance in indoor and outdoor pose estimation as well as visual localization. Moreover, thorough ablations show the efficacy of the proposed components and techniques. + + + + Boundary-Aware Divide and Conquer: A Diffusion-Based Solution for Unsupervised Shadow Removal + http://openaccess.thecvf.com//content/ICCV2023/papers/Guo_Boundary-Aware_Divide_and_Conquer_A_Diffusion-Based_Solution_for_Unsupervised_Shadow_ICCV_2023_paper.pdf + Recent deep learning methods have achieved superior results in shadow removal. However, most of these supervised methods rely on training over a huge amount of shadow and shadow-free image pairs, which require laborious annotations and may end up with poor model generalization. Shadows, in fact, only form partial degradation in images, while their non-shadow regions provide rich structural information potentially for unsupervised learning. In this paper, we propose a novel diffusion-based solution for unsupervised shadow removal, which separately models the shadow, non-shadow, and their boundary regions. We employ a pretrained unconditional diffusion model fused with non-corrupted information to generate the natural shadow-free image. While the diffusion model can restore the clear structure in the boundary region by utilizing its adjacent non-corrupted contextual information, it fails to address the inner shadow area due to the isolation of the non-corrupted contexts. Thus we further propose a Shadow-Invariant Intrinsic Decomposition module to exploit the underlying reflectance in the shadow region to maintain structural consistency during the diffusive sampling. Extensive experiments on the publicly available shadow removal datasets show that the proposed method achieves a significant improvement compared to existing unsupervised methods, and even is comparable with some existing supervised methods. + + + + Towards Nonlinear-Motion-Aware and Occlusion-Robust Rolling Shutter Correction + http://openaccess.thecvf.com//content/ICCV2023/papers/Qu_Towards_Nonlinear-Motion-Aware_and_Occlusion-Robust_Rolling_Shutter_Correction_ICCV_2023_paper.pdf + This paper addresses the problem of rolling shutter correction in complex nonlinear and dynamic scenes with extreme occlusion. Existing methods suffer from two main drawbacks. Firstly, they face challenges in estimating the accurate correction field due to the uniform velocity assumption, leading to significant image correction errors under complex motion. Secondly, the drastic occlusion in dynamic scenes prevents current solutions from achieving better image quality because of the inherent difficulties in aligning and aggregating multiple frames. To tackle these challenges, we model the curvilinear trajectory of pixels analytically and propose a geometry-based Quadratic Rolling Shutter (QRS) motion solver, which precisely estimates the high-order correction field of individual pixels. Besides, to reconstruct high-quality occlusion frames in dynamic scenes, we present a 3D video architecture that effectively Aligns and Aggregates multi-frame context, namely, RSA2-Net. We evaluate our method across a broad range of cameras and video sequences, demonstrating its significant superiority. Specifically, our method surpasses the state-of-the-art by +4.98, +0.77, and +4.33 of PSNR on Carla-RS, Fastec-RS, and BS-RSC datasets, respectively. Code is available at https://github.com/DelinQu/qrsc. + + + + GraphEcho: Graph-Driven Unsupervised Domain Adaptation for Echocardiogram Video Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_GraphEcho_Graph-Driven_Unsupervised_Domain_Adaptation_for_Echocardiogram_Video_Segmentation_ICCV_2023_paper.pdf + Echocardiogram video segmentation plays an important role in cardiac disease diagnosis. This paper studies the unsupervised domain adaption (UDA) for echocardiogram video segmentation, where the goal is to generalize the model trained on the source domain to other unlabeled target domains. Existing UDA segmentation methods are not suitable for this task because they do not model local information and the cyclical consistency of heartbeat. In this paper, we introduce a newly collected CardiacUDA dataset and a novel GraphEcho method for cardiac structure segmentation. Our GraphEcho comprises two innovative modules, the Spatial-wise Cross-domain Graph Matching (SCGM) and the Temporal Cycle Consistency (TCC) module, which utilize prior knowledge of echocardiogram videos, i.e., consistent cardiac structure across patients and centers and the heartbeat cyclical consistency, respectively. These two modules can better align global and local features from source and target domains, leading to improved UDA segmentation results. Experimental results showed that our GraphEcho outperforms existing state-of-the-art UDA segmentation methods. Our collected dataset and code will be publicly released upon acceptance. This work will lay a new and solid cornerstone for cardiac structure segmentation from echocardiogram videos. + + + + Augmented Box Replay: Overcoming Foreground Shift for Incremental Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Augmented_Box_Replay_Overcoming_Foreground_Shift_for_Incremental_Object_Detection_ICCV_2023_paper.pdf + In incremental learning, replaying stored samples from previous tasks together with current task samples is one of the most efficient approaches to address catastrophic forgetting. However, unlike incremental classification, image replay has not been successfully applied to incremental object detection (IOD). In this paper, we identify the overlooked problem of foreground shift as the main reason for this. Foreground shift only occurs when replaying images of previous tasks and refers to the fact that their background might contain foreground objects of the current task. To overcome this problem, a novel and efficient Augmented Box Replay (ABR) method is developed that only stores and replays foreground objects and thereby circumvents the foreground shift problem. In addition, we propose an innovative Attentive RoI Distillation loss that uses spatial attention from region-of-interest (RoI) features to constrain current model to focus on the most important information from old model. ABR significantly reduces forgetting of previous classes while maintaining high plasticity in current classes. Moreover, it considerably reduces the storage requirements when compared to standard image replay. Comprehensive experiments on Pascal-VOC and COCO datasets support the state-of-the-art performance of our model. + + + + Lighting Every Darkness in Two Pairs: A Calibration-Free Pipeline for RAW Denoising + http://openaccess.thecvf.com//content/ICCV2023/papers/Jin_Lighting_Every_Darkness_in_Two_Pairs_A_Calibration-Free_Pipeline_for_ICCV_2023_paper.pdf + Calibration-based methods have dominated RAW image denoising under extremely low-light environments. However, these methods suffer from several main deficiencies: 1) the calibration procedure is laborious and time-consuming, 2) denoisers for different cameras are difficult to transfer, and 3) the discrepancy between synthetic noise and real noise is enlarged by high digital gain. To overcome the above shortcomings, we propose a calibration-free pipeline for Lighting Every Drakness (LED), regardless of the digital gain or camera sensor. Instead of calibrating the noise parameters and training repeatedly, our method could adapt to a target camera only with fewshot paired data and fine-tuning. In addition, well-designed structural modification during both stages alleviates the domain gap between synthetic noise and real noise without any extra computational cost. With 2 pairs for each additional digital gain (in total 6 pairs) and 0.5% iterations, our method achieves superior performance over other calibration-based methods. + + + + MotionBERT: A Unified Perspective on Learning Human Motion Representations + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhu_MotionBERT_A_Unified_Perspective_on_Learning_Human_Motion_Representations_ICCV_2023_paper.pdf + We present a unified perspective on tackling various human-centric video tasks by learning human motion representations from large-scale and heterogeneous data resources. Specifically, we propose a pretraining stage in which a motion encoder is trained to recover the underlying 3D motion from noisy partial 2D observations. The motion representations acquired in this way incorporate geometric, kinematic, and physical knowledge about human motion, which can be easily transferred to multiple downstream tasks. We implement the motion encoder with a Dual-stream Spatio-temporal Transformer (DSTformer) neural network. It could capture long-range spatio-temporal relationships among the skeletal joints comprehensively and adaptively, exemplified by the lowest 3D pose estimation error so far when trained from scratch. Furthermore, our proposed framework achieves state-of-the-art performance on all three downstream tasks by simply finetuning the pretrained motion encoder with a simple regression head (1-2 layers), which demonstrates the versatility of the learned motion representations. Code and models are available at https://motionbert.github.io/ + + + + Metric3D: Towards Zero-shot Metric 3D Prediction from A Single Image + http://openaccess.thecvf.com//content/ICCV2023/papers/Yin_Metric3D_Towards_Zero-shot_Metric_3D_Prediction_from_A_Single_Image_ICCV_2023_paper.pdf + Reconstructing accurate 3D scenes from images is a long-standing vision task. Due to the ill-posedness of the single-image reconstruction problem, most well-established methods are built upon multi-view geometry. State-of-the-art (SOTA) monocular metric depth estimation methods can only handle a single camera model and are unable to perform mixed-data training due to the metric ambiguity. Meanwhile, SOTA monocular methods trained on large mixed datasets achieve zero-shot generalization by learning affine-invariant depths, which cannot recover real-world metrics. In this work, we show that the key to a zero-shot single-view metric depth model lies in the combination of large-scale data training and resolving the metric ambiguity from various camera models. We propose a canonical camera space transformation module, which explicitly addresses the ambiguity problems and can be effortlessly plugged into existing monocular models. Equipped with our module, monocualr models can be stably trained over 8 millions of images with thousands of camera models, resulting in zero-shot generalization to in-the-wild images with unseen camera settings. Experiments demonstrate SOTA performance of our method on 7 zero-shot benchmarks. Our method can recover the metric 3D structure on randomly collected Internet images, enabling plausible single-image metrology. Downstream tasks can also be significantly improved by naively plug-in our model. E.g., our model relieves the scale drift issues of monocular-SLAM (Fig. 1), leading to metric scale high-quality dense mapping. + + + + Lightweight Image Super-Resolution with Superpixel Token Interaction + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Lightweight_Image_Super-Resolution_with_Superpixel_Token_Interaction_ICCV_2023_paper.pdf + Transformer-based methods have demonstrated impressive results on single-image super-resolution (SISR) task. However, self-attention mechanism is computationally expensive when applied to the entire image. As a result, current approaches divide low-resolution input images into small patches, which are processed separately and then fused to generate high-resolution images. Nevertheless, this conventional regular patch division is too coarse and lacks interpretability, resulting in artifacts and non-similar structure interference during attention operations. To address these challenges, we propose a novel super token interaction network (SPIN). Our method employs superpixels to cluster local similar pixels to form the explicable local regions and utilizes intra-superpixel attention to enable local information interaction. It is interpretable because only similar regions complement each other and dissimilar regions are excluded. Moreover, we design a superpixel cross-attention module to facilitate information propagation via the surrogation of superpixels. Extensive experiments demonstrate that the proposed SPIN model performs favorably against the state-of-the-art SR methods in terms of accuracy and lightweight. Code is available at https://github.com/ArcticHare105/SPIN. + + + + Iterative Denoiser and Noise Estimator for Self-Supervised Image Denoising + http://openaccess.thecvf.com//content/ICCV2023/papers/Zou_Iterative_Denoiser_and_Noise_Estimator_for_Self-Supervised_Image_Denoising_ICCV_2023_paper.pdf + With the emergence of powerful deep learning tools, more and more effective deep denoisers have advanced the field of image denoising. However, the huge progress made by these learning-based methods severely relies on large-scale and high-quality noisy/clean training pairs, which limits the practicality in real-world scenarios. To overcome this, researchers have been exploring self-supervised approaches that can denoise without paired data. However, the unavailable noise prior and inefficient feature extraction take these methods away from high practicality and precision. In this paper, we propose a Denoise-Corrupt-Denoise pipeline (DCD-Net) for self-supervised image denoising. Specifically, we design an iterative training strategy, which iteratively optimizes the denoiser and noise estimator, and gradually approaches high denoising performances using only single noisy images without any noise prior. The proposed self-supervised image denoising framework provides very competitive results compared with state-of-the-art methods on widely used synthetic and real-world image denoising benchmarks. + + + + Memory-and-Anticipation Transformer for Online Action Understanding + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Memory-and-Anticipation_Transformer_for_Online_Action_Understanding_ICCV_2023_paper.pdf + Most existing forecasting systems are memory-based methods, which attempt to mimic human forecasting ability by employing various memory mechanisms and have progressed in temporal modeling for memory dependency. Nevertheless, an obvious weakness of this paradigm is that it can only model limited historical dependence and can not transcend the past. In this paper, we rethink the temporal dependence of event evolution and propose a novel memory-anticipation-based paradigm to model an entire temporal structure, including the past, present, and future. Based on this idea, we present Memory-and-Anticipation Transformer (MAT), a memory-anticipation-based approach, to address the online action detection and anticipation tasks. In addition, owing to the inherent superiority of MAT, it can process online action detection and anticipation tasks in a unified manner. The proposed MAT model is tested on four challenging benchmarks TVSeries, THUMOS'14, HDD, and EPIC-Kitchens-100, for online action detection and anticipation tasks, and it significantly outperforms all existing methods. Code is available at https://github.com/Echo0125/Memory-and-Anticipation-Transformer . + + + + Realistic Full-Body Tracking from Sparse Observations via Joint-Level Modeling + http://openaccess.thecvf.com//content/ICCV2023/papers/Zheng_Realistic_Full-Body_Tracking_from_Sparse_Observations_via_Joint-Level_Modeling_ICCV_2023_paper.pdf + To bridge the physical and virtual worlds for rapidly developed VR/AR applications, the ability to realistically drive 3D full-body avatars is of great significance. Although real-time body tracking with only the head-mounted displays (HMDs) and hand controllers is heavily under-constrained, a carefully designed end-to-end neural network is of great potential to solve the problem by learning from large-scale motion data. To this end, we propose a two-stage framework that can obtain accurate and smooth full-body motions with the three tracking signals of head and hands only. Our framework explicitly models the joint-level features in the first stage and utilizes them as spatiotemporal tokens for alternating spatial and temporal transformer blocks to capture joint-level correlations in the second stage. Furthermore, we design a set of loss terms to constrain the task of a high degree of freedom, such that we can exploit the potential of our joint-level modeling. With extensive experiments on the AMASS motion dataset and real-captured data, we validate the effectiveness of our designs and show our proposed method can achieve more accurate and smooth motion compared to existing approaches. + + + + MetaF2N: Blind Image Super-Resolution by Learning Efficient Model Adaptation from Faces + http://openaccess.thecvf.com//content/ICCV2023/papers/Yin_MetaF2N_Blind_Image_Super-Resolution_by_Learning_Efficient_Model_Adaptation_from_ICCV_2023_paper.pdf + Due to their highly structured characteristics, faces are easier to recover than natural scenes for blind image super-resolution. Therefore, we can extract the degradation representation of an image from the low-quality and recovered face pairs. Using the degradation representation, realistic low-quality images can then be synthesized to fine-tune the super-resolution model for the real-world low-quality image. However, such a procedure is time-consuming and laborious, and the gaps between recovered faces and the ground-truths further increase the optimization uncertainty. To facilitate efficient model adaptation towards image-specific degradations, we propose a method dubbed MetaF2N, which leverages the contained faces to fine-tune model parameters for adapting to the whole natural image in a meta-learning framework. The degradation extraction and low-quality image synthesis steps are thus circumvented in our MetaF2N, and it requires only one fine-tuning step to get decent performance. Considering the gaps between the recovered faces and ground-truths, we further deploy a MaskNet for adaptively predicting loss weights at different positions to reduce the impact of low-confidence areas. To evaluate our proposed MetaF2N, we have collected a real-world low-quality dataset with one or multiple faces in each image, and our MetaF2N achieves superior performance on both synthetic and real world datasets. Source code, pre-trained models, and collected datasets are available at https://github.com/yinzhicun/MetaF2N. + + + + Lighting up NeRF via Unsupervised Decomposition and Enhancement + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Lighting_up_NeRF_via_Unsupervised_Decomposition_and_Enhancement_ICCV_2023_paper.pdf + Neural Radiance Field (NeRF) is a promising approach for synthesizing novel views, given a set of images and the corresponding camera poses of a scene. However, images photographed from a low-light scene can hardly be used to train a NeRF model to produce high-quality results, due to their low pixel intensities, heavy noise, and color distortion. Combining existing low-light image enhancement methods with NeRF methods also does not work well due to the view inconsistency caused by the individual 2D enhancement process. In this paper, we propose a novel approach, called Low-Light NeRF (or LLNeRF), to enhance the scene representation and synthesize normal-light novel views directly from sRGB low-light images in an unsupervised manner. The core of our approach is a decomposition of radiance field learning, which allows us to enhance the illumination, reduce noise and correct the distorted colors jointly with the NeRF optimization process. Our method is able to produce novel view images with proper lighting and vivid colors and details, given a collection of camera-finished low dynamic range (8-bits/channel) images from a low-light scene. Experiments demonstrate that our method outperforms existing low-light enhancement methods and NeRF methods. + + + + ViM: Vision Middleware for Unified Downstream Transferring + http://openaccess.thecvf.com//content/ICCV2023/papers/Feng_ViM_Vision_Middleware_for_Unified_Downstream_Transferring_ICCV_2023_paper.pdf + Foundation models are pre-trained on massive data and transferred to downstream tasks via fine-tuning. This work presents Vision Middleware (ViM), a new learning paradigm that targets unified transferring from a single foundation model to a variety of downstream tasks. ViM consists of a zoo of lightweight plug-in modules, each of which is independently learned on a midstream dataset with a shared frozen backbone. Downstream tasks can then benefit from an adequate aggregation of the module zoo thanks to the rich knowledge inherited from midstream tasks. There are three major advantages of such a design. From the efficiency aspect, the upstream backbone can be trained only once and reused for all downstream tasks without tuning. From the scalability aspect, we can easily append additional modules to ViM with no influence on existing modules. From the performance aspect, ViM can include as many midstream tasks as possible, narrowing the task gap between upstream and downstream. Considering these benefits, we believe that ViM, which the community could maintain and develop together, would serve as a powerful tool to assist foundation models. + + + + Co-Evolution of Pose and Mesh for 3D Human Body Estimation from Video + http://openaccess.thecvf.com//content/ICCV2023/papers/You_Co-Evolution_of_Pose_and_Mesh_for_3D_Human_Body_Estimation_ICCV_2023_paper.pdf + Despite significant progress in single image-based 3D human mesh recovery, accurately and smoothly recovering 3D human motion from a video remains challenging. Existing video-based methods generally recover human mesh by estimating the complex pose and shape parameters from coupled image features, whose high complexity and low representation ability often result in inconsistent pose motion and limited shape patterns. To alleviate this issue, we introduce 3D pose as the intermediary and propose a Pose and Mesh Co-Evolution network (PMCE) that decouples this task into two parts: 1) video-based 3D human pose estimation and 2) mesh vertices regression from the estimated 3D pose and temporal image feature. Specifically, we propose a two-stream encoder that estimates mid-frame 3D pose and extracts a temporal image feature from the input image sequence. In addition, we design a co-evolution decoder that performs pose and mesh interactions with the image-guided Adaptive Layer Normalization (AdaLN) to make pose and mesh fit the human body shape. Extensive experiments demonstrate that the proposed PMCE outperforms previous state-of-the-art methods in terms of both per-frame accuracy and temporal consistency on three benchmark datasets: 3DPW, Human3.6M, and MPI-INF-3DHP. Our code is available at https://github.com/kasvii/PMCE. + + + + Viewset Diffusion: (0-)Image-Conditioned 3D Generative Models from 2D Data + http://openaccess.thecvf.com//content/ICCV2023/papers/Szymanowicz_Viewset_Diffusion_0-Image-Conditioned_3D_Generative_Models_from_2D_Data_ICCV_2023_paper.pdf + We present Viewset Diffusion, a diffusion-based generator that outputs 3D objects while only using multi-view 2D data for supervision. We note that there exists a one-to-one mapping between viewsets, i.e., collections of several 2D views of an object, and 3D models. Hence, we train a diffusion model to generate viewsets, but design the neural network generator to reconstruct internally corresponding 3D models, thus generating those too. We fit a diffusion model to a large number of viewsets for a given category of objects. The resulting generator can be conditioned on zero, one or more input views. Conditioned on a single view, it performs 3D reconstruction accounting for the ambiguity of the task and allowing to sample multiple solutions compatible with the input. The model performs reconstruction efficiently, in a feed-forward manner, and is trained using only rendering losses using as few as three views per viewset. Project page: szymanowiczs.github.io/viewset-diffusion + + + + SIRA-PCR: Sim-to-Real Adaptation for 3D Point Cloud Registration + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_SIRA-PCR_Sim-to-Real_Adaptation_for_3D_Point_Cloud_Registration_ICCV_2023_paper.pdf + Point cloud registration is essential for many applications. However, existing real datasets require extremely tedious and costly annotations, yet may not provide accurate camera poses. For the synthetic datasets, they are mainly object-level, so the trained models may not generalize well to real scenes. We design SIRA-PCR, a new approach to 3D point cloud registration. First, we build a synthetic scene-level 3D registration dataset, specifically designed with physically-based and random strategies to arrange diverse objects. Second, we account for variations in different sensing mechanisms and layout placements, then formulate a sim-to-real adaptation framework with an adaptive re-sample module to simulate patterns in real point clouds. To our best knowledge, this is the first work that explores sim-to-real adaptation for point cloud registration. Extensive experiments show the SOTA performance of SIRA-PCR on widely-used indoor and outdoor datasets. The code and dataset will be released on https://github.com/Chen-Suyi/SIRA_Pytorch. + + + + SOAR: Scene-debiasing Open-set Action Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhai_SOAR_Scene-debiasing_Open-set_Action_Recognition_ICCV_2023_paper.pdf + Deep models have the risk of utilizing spurious clues to make predictions, e.g., recognizing actions via classifying the background scene. This problem severely degrades the open-set action recognition performance when the testing samples exhibit scene distributions different from the training samples. To mitigate this scene bias, we propose a Scene-debiasing Open-set Action Recognition method (SOAR), which features an adversarial reconstruction module and an adaptive adversarial scene classification module. The former prevents a decoder from reconstructing the video background given video features, and thus helps reduce the background information in feature learning. The latter aims to confuse scene type classification given video features, and helps to learn scene-invariant information. In addition, we design an experiment to quantify the scene bias. The results suggest current open-set action recognizers are biased toward the scene, and our SOAR better mitigates such bias. Furthermore, extensive experiments show our method outperforms state-of-the-art methods, with ablation studies demonstrating the effectiveness of our proposed modules. + + + + Discovering Spatio-Temporal Rationales for Video Question Answering + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Discovering_Spatio-Temporal_Rationales_for_Video_Question_Answering_ICCV_2023_paper.pdf + This paper strives to solve complex video question answering (VideoQA) which features long videos containing multiple objects and events at different time. To tackle the challenge, we highlight the importance of identifying question-critical temporal moments and spatial objects from the vast amount of video content. Towards this, we propose a Spatio-Temporal Rationalizer (STR), a differentiable selection module that adaptively collects question-critical moments and objects using cross-modal interaction. The discovered video moments and objects are then served as grounded rationales to support answer reasoning. Based on STR, we further propose TranSTR, a Transformer-style neural network architecture that takes STR as the core and additionally underscores a novel answer interaction mechanism to coordinate STR for answer decoding. Experiments on four datasets show that TranSTR achieves new state-of-the-art (SoTA). Especially, on NExT-QA and Causal-VidQA which feature complex VideoQA, it significantly surpasses the previous SoTA by 5.8% and 6.8%, respectively. We then conduct extensive studies to verify the importance of STR as well as the proposed answer interaction mechanism. With the success of TranSTR and our comprehensive analysis, we hope this work can spark more future efforts in complex VideoQA. Our results are fully reproducible at https://anonymous.4open.science/r/TranSTR/. + + + + Iterative Soft Shrinkage Learning for Efficient Image Super-Resolution + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Iterative_Soft_Shrinkage_Learning_for_Efficient_Image_Super-Resolution_ICCV_2023_paper.pdf + Image super-resolution (SR) has witnessed extensive neural network designs from CNN to transformer architectures. However, prevailing SR models suffer from prohibitive memory footprint and intensive computations, which limits further deployment on edge devices. This work investigates the potential of network pruning for super-resolution to take advantage of off-the-shelf network designs and reduce the underlying computational overhead. Two main challenges remain in applying pruning methods for SR. First, the widely-used filter pruning technique reflects limited granularity and restricted adaptability to diverse network structures. Second, existing pruning methods generally operate upon a pre-trained network for the sparse structure determination, hard to get rid of dense model training in the traditional SR paradigm. To address these challenges, we adopt unstructured pruning with sparse models directly trained from scratch. Specifically, we propose a novel Iterative Soft Shrinkage-Percentage (ISS-P) method by optimizing the sparse structure of a randomly initialized network at each iteration and tweaking unimportant weights with a small amount proportional to the magnitude scale on-the-fly. We observe that the proposed ISS-P can dynamically learn sparse structures adapting to the optimization process and preserve the sparse model's trainability by yielding a more regularized gradient throughput. Experiments on benchmark datasets demonstrate the effectiveness of the proposed ISS-P over diverse network architectures. Code is available at https://github.com/Jiamian-Wang/Iterative-Soft-Shrinkage-SR + + + + G2L: Semantically Aligned and Uniform Video Grounding via Geodesic and Game Theory + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_G2L_Semantically_Aligned_and_Uniform_Video_Grounding_via_Geodesic_and_ICCV_2023_paper.pdf + The recent video grounding works attempt to introduce vanilla contrastive learning into video grounding. However, we claim that this naive solution is suboptimal. Contrastive learning requires two key properties: (1) alignment of features of similar samples, and (2) uniformity of the induced distribution of the normalized features on the hypersphere. Due to two annoying issues in video grounding: (1) the co-existence of some visual entities in both ground truth and other moments, i.e. semantic overlapping; (2) only a few moments in the video are annotated, i.e. sparse annotation dilemma, vanilla contrastive learning is unable to model the correlations between temporally distant moments and learned inconsistent video representations. Both characteristics lead to vanilla contrastive learning being unsuitable for video grounding. In this paper, we introduce Geodesic and Game Localization (G2L), a semantically aligned and uniform video grounding framework via geodesic and game theory. We quantify the correlations among moments leveraging the geodesic distance that guides the model to learn the correct cross-modal representations. Furthermore, from the novel perspective of game theory, we propose semantic Shapley interaction based on geodesic distance sampling to learn fine-grained semantic alignment in similar moments. Experiments on three benchmarks demonstrate the effectiveness of our method. + + + + FashionNTM: Multi-turn Fashion Image Retrieval via Cascaded Memory + http://openaccess.thecvf.com//content/ICCV2023/papers/Pal_FashionNTM_Multi-turn_Fashion_Image_Retrieval_via_Cascaded_Memory_ICCV_2023_paper.pdf + Multi-turn textual feedback-based fashion image retrieval focuses on a real-world setting, where users can iteratively provide information to refine retrieval results until they find an item that fits all their requirements. In this work, we present a novel memory-based method, called FashionNTM, for such a multi-turn system. Our framework incorporates a new Cascaded Memory Neural Turing Machine (CM-NTM) approach for implicit state management, thereby learning to integrate information across all past turns to retrieve new images, for a given turn. Unlike vanilla Neural Turing Machine (NTM), our CM-NTM operates on multiple inputs, which interact with their respective memories via individual read and write heads, to learn complex relationships. Extensive evaluation results show that our proposed method outperforms the previous state-of-the-art algorithm by 50.5%, on Multi-turn FashionIQ -- the only existing multi-turn fashion dataset currently, in addition to having a relative improvement of 12.6% on Multi-turn Shoes -- an extension of the single-turn Shoes dataset that we created in this work. Further analysis of the model in a real-world interactive setting demonstrates two important capabilities of our model -- memory retention across turns, and agnosticity to turn order for non-contradictory feedback. Finally, user study results show that images retrieved by FashionNTM were favored by 83.1% over other multi-turn models. + + + + Towards Zero Domain Gap: A Comprehensive Study of Realistic LiDAR Simulation for Autonomy Testing + http://openaccess.thecvf.com//content/ICCV2023/papers/Manivasagam_Towards_Zero_Domain_Gap_A_Comprehensive_Study_of_Realistic_LiDAR_ICCV_2023_paper.pdf + Testing the full autonomy system in simulation is the safest and most scalable way to evaluate autonomous vehicle performance before deployment. This requires simulating sensor inputs such as LiDAR. To be effective, it is essential that the simulation has low domain gap with the real world. That is, the autonomy system in simulation should perform exactly the same way it would in the real world for the same scenario. To date, there has been limited analysis into what aspects of LiDAR phenomena affect autonomy performance. It is also difficult to evaluate the domain gap of existing LiDAR simulators, as they operate on fully synthetic scenes. In this paper, we propose a novel "paired-scenario" approach to evaluating the domain gap of a LiDAR simulator by reconstructing digital twins of real world scenarios. We can then simulate LiDAR in the scene and compare it to the real LiDAR. We leverage this setting to analyze what aspects of LiDAR simulation, such as pulse phenomena, scanning effects, and asset quality, affect the domain gap with respect to the autonomy system, including perception, prediction, and motion planning, and analyze how modifications to the simulated LiDAR influence each part. We identify key aspects that are important to model, such as motion blur, material reflectance, and the accurate geometric reconstruction of traffic participants. This helps provide research directions for improving LiDAR simulation and autonomy robustness to these effects. For more information, please visit the project website: https://waabi.ai/lidar-dg + + + + Random Sub-Samples Generation for Self-Supervised Real Image Denoising + http://openaccess.thecvf.com//content/ICCV2023/papers/Pan_Random_Sub-Samples_Generation_for_Self-Supervised_Real_Image_Denoising_ICCV_2023_paper.pdf + With sufficient paired training samples, the supervised deep learning methods have attracted much attention in image denoising because of their superior performance. However, it is still very challenging to widely utilize the supervised methods in real cases due to the lack of paired noisy-clean images. Meanwhile, most self-supervised denoising methods are ineffective as well when applied to the real-world denoising tasks because of their strict assumptions in applications. For example, as a typical method for self-supervised denoising, the original blind spot network (BSN) assumes that the noise is pixel-wise independent, which is much different from the real cases. To solve this problem, we propose a novel self-supervised real image denoising framework named Sampling Difference As Perturbation (SDAP) based on Random Sub-samples Generation (RSG) with a cyclic sample difference loss. Specifically, we dig deeper into the properties of BSN to make it more suitable for real noise. Surprisingly, we find that adding an appropriate perturbation to the training images can effectively improve the performance of BSN. Further, we propose that the sampling difference can be considered as perturbation to achieve better results. Finally we propose a new BSN framework in combination with our RSG strategy. The results show that it significantly outperforms other state-of-the-art self-supervised denoising methods on real-world datasets. The code is available at https://github.com/p1y2z3/SDAP. + + + + Waffling Around for Performance: Visual Classification with Random Words and Broad Concepts + http://openaccess.thecvf.com//content/ICCV2023/papers/Roth_Waffling_Around_for_Performance_Visual_Classification_with_Random_Words_and_ICCV_2023_paper.pdf + The visual classification performance of vision-language models such as CLIP has been shown to benefit from additional semantic knowledge from large language models (LLMs) such as GPT-3. In particular, averaging over LLM-generated class descriptors, e.g. "waffle, which has a round shape", can notably improve generalization performance. In this work, we critically study this behavior and propose WaffleCLIP, a framework for zero-shot visual classification which simply replaces LLM-generated descriptors with random character and word descriptors. Without querying external models, we achieve comparable performance gains on a large number of visual classification tasks. This allows WaffleCLIP to both serve as a low-cost alternative, as well as a sanity check for any future LLM-based vision-language model extensions. We conduct an extensive experimental study on the impact and shortcomings of additional semantics introduced with LLM-generated descriptors, and showcase how - if available - semantic context is better leveraged by querying LLMs for high-level concepts, which we show can be done to jointly resolve potential class name ambiguities. Code is available here: https://github.com/ExplainableML/WaffleCLIP. + + + + AutoAD II: The Sequel - Who, When, and What in Movie Audio Description + http://openaccess.thecvf.com//content/ICCV2023/papers/Han_AutoAD_II_The_Sequel_-_Who_When_and_What_in_ICCV_2023_paper.pdf + Audio Description (AD) is the task of generating descriptions of visual content, at suitable time intervals, for the benefit of visually impaired audiences. For movies, this presents notable challenges -- AD must occur only during existing pauses in dialogue, should refer to characters by name, and ought to aid understanding of the storyline as a whole. To this end, we develop a new model for automatically generating movie AD, given CLIP visual features of the frames, the cast list, and the temporal locations of the speech; addressing all three of the `who', `when', and `what' questions: (i) who -- we introduce a character bank consisting of the character's name, the actor that played the part, and a CLIP feature of their face, for the principal cast of each movie, and demonstrate how this can be used to improve naming in the generated AD; (ii) when -- we investigate several models for determining whether an AD should be generated for a time interval or not, based on the visual content of the interval and its neighbours; and (iii) what -- we implement a new vision-language model for this task, that can ingest the proposals from the character bank, whilst conditioning on the visual features using cross-attention, and demonstrate how this improves over previous architectures for AD text generation in an apples-to-apples comparison. + + + + Hyperbolic Chamfer Distance for Point Cloud Completion + http://openaccess.thecvf.com//content/ICCV2023/papers/Lin_Hyperbolic_Chamfer_Distance_for_Point_Cloud_Completion_ICCV_2023_paper.pdf + Chamfer distance (CD) is a standard metric to measure the shape dissimilarity between point clouds in point cloud completion, as well as a loss function for (deep) learning. However, it is well known that CD is vulnerable to outliers, leading to the drift towards suboptimal models. In contrast to the literature where most works address such issues in Euclidean space, we propose an extremely simple yet powerful metric for point cloud completion, namely Hyperbolic Chamfer Distance (HyperCD), that computes CD in hyperbolic space. In backpropagation, HyperCD consistently assigns higher weights to the matched point pairs with smaller Euclidean distances. In this way, good point matches are likely to be preserved while bad matches can be updated gradually, leading to better completion results. We demonstrate state-of-the-art performance on the benchmark datasets, i.e. PCN, ShapeNet-55, and ShapeNet-34, and show from visualization that HyperCD can significantly improve the surface smoothness. Code is available at: https://github.com/Zhang-VISLab. + + + + AG3D: Learning to Generate 3D Avatars from 2D Image Collections + http://openaccess.thecvf.com//content/ICCV2023/papers/Dong_AG3D_Learning_to_Generate_3D_Avatars_from_2D_Image_Collections_ICCV_2023_paper.pdf + While progress in 2D generative models of human appearance has been rapid, many applications require 3D avatars that can be animated and rendered. Unfortunately, most existing methods for learning generative models of 3D humans with diverse shape and appearance require 3D training data, which is limited and expensive to acquire. The key to progress is hence to learn generative models of 3D avatars from abundant unstructured 2D image collections. However, learning realistic and complete 3D appearance and geometry in this under-constrained setting remains challenging, especially in the presence of loose clothing such as dresses. In this paper, we propose a new adversarial generative model of realistic 3D people from 2D images. Our method captures shape and deformation of the body and loose clothing by adopting a holistic 3D generator and integrating an efficient, flexible, articulation module. To improve realism, we train our model using multiple discriminators while also integrating geometric cues in the form of predicted 2D normal maps. We experimentally find that our method outperforms previous 3D- and articulation-aware methods in terms of geometry and appearance. We validate the effectiveness of our model and the importance of each component via systematic ablation studies. + + + + Learned Image Reasoning Prior Penetrates Deep Unfolding Network for Panchromatic and Multi-spectral Image Fusion + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhou_Learned_Image_Reasoning_Prior_Penetrates_Deep_Unfolding_Network_for_Panchromatic_ICCV_2023_paper.pdf + The success of deep neural networks for pan-sharpening is commonly in a form of black box, lacking transparency and interpretability. To alleviate this issue, we propose a novel model-driven deep unfolding framework with image reasoning prior tailored for the pan-sharpening task. Different from existing unfolding solutions that deliver the proximal operator networks as the uncertain and vague priors, our framework is motivated by the content reasoning ability of masked autoencoders (MAE) with insightful designs. Specifically, the pre-trained MAE with spatial masking strategy, acting as intrinsic reasoning prior, is embedded into unfolding architecture. Meanwhile, the pre-trained MAE with spatial-spectral masking strategy is treated as the regularization term within loss function to constrain the spatial-spectral consistency. Such designs penetrate the image reasoning prior into deep unfolding networks while improving its interpretability and representation capability. The uniqueness of our framework is that the holistic learning process is explicitly integrated with the inherent physical mechanism underlying the pan-sharpening task. Extensive experiments on multiple satellite datasets demonstrate the superiority of our method over the existing state-of-the-art approaches. + + + + NCHO: Unsupervised Learning for Neural 3D Composition of Humans and Objects + http://openaccess.thecvf.com//content/ICCV2023/papers/Kim_NCHO_Unsupervised_Learning_for_Neural_3D_Composition_of_Humans_and_ICCV_2023_paper.pdf + Deep generative models have been recently extended to synthesizing 3D digital humans. However, previous approaches treat clothed humans as a single chunk of geometry without considering the compositionality of clothing and accessories. As a result, individual items cannot be naturally composed into novel identities, leading to limited expressiveness and controllability of generative 3D avatars. While several methods attempt to address this by leveraging synthetic data, the interaction between humans and objects is not authentic due to the domain gap, and manual asset creation is difficult to scale for a wide variety of objects. In this work, we present a novel framework for learning a compositional generative model of humans and objects (backpacks, coats, scarves, and more) from real-world 3D scans. Our compositional model is interaction-aware, meaning the spatial relationship between humans and objects, and the mutual shape change by physical contact is fully incorporated. The key challenge is that, since humans and objects are in contact, their 3D scans are merged into a single piece. To decompose them without manual annotations, we propose to leverage two sets of 3D scans of a single person with and without objects. Our approach learns to decompose objects and naturally compose them back into a generative human model in an unsupervised manner. Despite our simple setup requiring only the capture of a single subject with objects, our experiments demonstrate the strong generalization of our model by enabling the natural composition of objects to diverse identities in various poses and the composition of multiple objects, which is unseen in training data. The project page is available at https://taeksuu.github.io/ncho. + + + + Learning Non-Local Spatial-Angular Correlation for Light Field Image Super-Resolution + http://openaccess.thecvf.com//content/ICCV2023/papers/Liang_Learning_Non-Local_Spatial-Angular_Correlation_for_Light_Field_Image_Super-Resolution_ICCV_2023_paper.pdf + Exploiting spatial-angular correlation is crucial to light field (LF) image super-resolution (SR), but is highly challenging due to its non-local property caused by the disparities among LF images. Although many deep neural networks (DNNs) have been developed for LF image SR and achieved continuously improved performance, existing methods cannot well leverage the long-range spatial-angular correlation and thus suffer a significant performance drop when handling scenes with large disparity variations. In this paper, we propose a simple yet effective method to learn the non-local spatial-angular correlation for LF image SR. In our method, we adopt the epipolar plane image (EPI) representation to project the 4D spatial-angular correlation onto multiple 2D EPI planes, and then develop a Transformer network with repetitive self-attention operations to learn the spatial-angular correlation by modeling the dependencies between each pair of EPI pixels. Our method can fully incorporate the information from all angular views while achieving a global receptive field along the epipolar line. We conduct extensive experiments with insightful visualizations to validate the effectiveness of our method. Comparative results on five public datasets show that our method not only achieves state-of-the-art SR performance, but also performs robust to disparity variations. + + + + MGMAE: Motion Guided Masking for Video Masked Autoencoding + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_MGMAE_Motion_Guided_Masking_for_Video_Masked_Autoencoding_ICCV_2023_paper.pdf + Masked autoencoding has shown excellent performance on self-supervised video representation learning. Temporal redundancy has led to a high masking ratio and customized masking strategy in VideoMAE. In this paper, we aim to further improve the performance of video masked autoencoding by introducing a motion guided masking strategy. Our key insight is that motion is a general and unique prior in video, which should be taken into account during masked pre-training. Our motion guided masking explicitly incorporates motion information to build temporal consistent masking volume. Based on this masking volume, we can track the unmasked tokens in time and sample a set of temporal consistent cubes from videos. These temporal aligned unmasked tokens will further relieve the information leakage issue in time and encourage the MGMAE to learn more useful structure information. We implement our MGMAE with an online efficient optical flow estimator and backward masking map warping strategy. We perform experiments on the datasets of Something-Something V2 and Kinetics-400, demonstrating the superior performance of our MGMAE to the original VideoMAE. In addition, we provide the visualization analysis to illustrate that our MGMAE can sample temporal consistent cubes in a motion-adaptive manner for more effective video pre-training. + + + + ViewRefer: Grasp the Multi-view Knowledge for 3D Visual Grounding + http://openaccess.thecvf.com//content/ICCV2023/papers/Guo_ViewRefer_Grasp_the_Multi-view_Knowledge_for_3D_Visual_Grounding_ICCV_2023_paper.pdf + Understanding 3D scenes from multi-view inputs has been proven to alleviate the view discrepancy issue in 3D visual grounding. However, existing methods normally neglect the view cues embedded in the text modality and fail to weigh the relative importance of different views. In this paper, we propose ViewRefer, a multi-view framework for 3D visual grounding exploring how to grasp the view knowledge from both text and 3D modalities. For the text branch, ViewRefer leverages the diverse linguistic knowledge of large-scale language models to expand a single grounding text to multiple geometry-consistent descriptions. Meanwhile, in the 3D modality, a transformer fusion module with inter-view attention is introduced to boost the interaction of objects across views. On top of that, we further present a set of learnable multi-view prototypes, which memorize scene-agnostic knowledge for different views, and enhance the framework from two perspectives: a view-guided attention module for more robust text features, and a view-guided scoring strategy during the final prediction. With our designed paradigm, ViewRefer achieves superior performance on three benchmarks and surpasses the second-best by +2.8%, +1.5%, and +1.35% on Sr3D, Nr3D, and ScanRefer. + + + + CaPhy: Capturing Physical Properties for Animatable Human Avatars + http://openaccess.thecvf.com//content/ICCV2023/papers/Su_CaPhy_Capturing_Physical_Properties_for_Animatable_Human_Avatars_ICCV_2023_paper.pdf + We present CaPhy, a novel method for reconstructing animatable human avatars with realistic dynamic properties for clothing. Specifically, we aim for capturing the geometric and physical properties of the clothing from real observations. This allows us to apply novel poses to the human avatar with physically correct deformations and wrinkles of the clothing. To this end, we combine unsupervised training with physics-based losses and 3D-supervised training using scanned data to reconstruct a dynamic model of clothing that is physically realistic and conforms to the human scans. We also optimize the physical parameters of the underlying physical model from the scans by introducing gradient constraints of the physics-based losses. In contrast to previous work on 3D avatar reconstruction, our method is able to generalize to novel poses with realistic dynamic cloth deformations. Experiments on several subjects demonstrate that our method can estimate the physical properties of the garments, resulting in superior quantitative and qualitative results compared with previous methods. + + + + Fine-grained Unsupervised Domain Adaptation for Gait Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Ma_Fine-grained_Unsupervised_Domain_Adaptation_for_Gait_Recognition_ICCV_2023_paper.pdf + Gait recognition has emerged as a promising technique for the long-range retrieval of pedestrians, providing numerous advantages such as accurate identification in challenging conditions and non-intrusiveness, making it highly desirable for improving public safety and security. However, the high cost of labeling datasets, which is a prerequisite for most existing fully supervised approaches, poses a significant obstacle to the development of gait recognition. Recently, some unsupervised methods for gait recognition have shown promising results. However, these methods mainly rely on a fine-tuning approach that does not sufficiently consider the relationship between source and target domains, leading to the catastrophic forgetting of source domain knowledge. This paper presents a novel perspective that adjacent-view sequences exhibit overlapping views, which can be leveraged by the network to gradually attain cross-view and cross-dressing capabilities without pre-training on the labeled source domain. Specifically, we propose a fine-grained Unsupervised Domain Adaptation (UDA) framework that iteratively alternates between two stages. The initial stage involves offline clustering, which transfers knowledge from the labeled source domain to the unlabeled target domain and adaptively generates pseudo-labels according to the expressiveness of each part. Subsequently, the second stage encompasses online training, which further achieves cross-dressing capabilities by continuously learning to distinguish numerous features of source and target domains. The effectiveness of the proposed method is demonstrated through extensive experiments conducted on widely-used public gait datasets. + + + + Instance-aware Dynamic Prompt Tuning for Pre-trained Point Cloud Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Zha_Instance-aware_Dynamic_Prompt_Tuning_for_Pre-trained_Point_Cloud_Models_ICCV_2023_paper.pdf + Pre-trained point cloud models have found extensive applications in 3D understanding tasks like object classification and part segmentation. However, the prevailing strategy of full fine-tuning in downstream tasks leads to large per-task storage overhead for model parameters, which limits the efficiency when applying large-scale pre-trained models. Inspired by the recent success of visual prompt tuning (VPT), this paper attempts to explore prompt tuning on pre-trained point cloud models, to pursue an elegant balance between performance and parameter efficiency. We find while instance-agnostic static prompting, e.g. VPT, shows some efficacy in downstream transfer, it is vulnerable to the distribution diversity caused by various types of noises in real-world point cloud data. To conquer this limitation, we propose a novel Instance-aware Dynamic Prompt Tuning (IDPT) strategy for pre-trained point cloud models. The essence of IDPT is to develop a dynamic prompt generation module to perceive semantic prior features of each point cloud instance and generate adaptive prompt tokens to enhance the model's robustness. Notably, extensive experiments demonstrate that IDPT outperforms full fine-tuning in most tasks with a mere 7% of the trainable parameters, providing a promising solution to parameter-efficient learning for pre-trained point cloud models. Code is available at https://github.com/zyh16143998882/ICCV23-IDPT. + + + + GeoUDF: Surface Reconstruction from 3D Point Clouds via Geometry-guided Distance Representation + http://openaccess.thecvf.com//content/ICCV2023/papers/Ren_GeoUDF_Surface_Reconstruction_from_3D_Point_Clouds_via_Geometry-guided_Distance_ICCV_2023_paper.pdf + We present a learning-based method, namely GeoUDF, to tackle the long-standing and challenging problem of reconstructing a discrete surface from a sparse point cloud. To be specific, we propose a geometry-guided learning method for UDF and its gradient estimation that explicitly formulates the unsigned distance of a query point as the learnable affine averaging of its distances to the tangent planes of neighboring points on the surface. Besides, we model the local geometric structure of the input point clouds by explicitly learning a quadratic polynomial for each point. This not only facilitates upsampling the input sparse point cloud but also naturally induces unoriented normal, which further augments UDF estimation. Finally, to extract triangle meshes from the predicted UDF, we propose a customized edge-based marching cube module. We conduct extensive experiments and ablation studies to demonstrate the significant advantages of our method over state-of-the-art methods in terms of reconstruction accuracy, efficiency, and generality. The source code is publicly available at https://github.com/rsy6318/GeoUDF. + + + + MeMOTR: Long-Term Memory-Augmented Transformer for Multi-Object Tracking + http://openaccess.thecvf.com//content/ICCV2023/papers/Gao_MeMOTR_Long-Term_Memory-Augmented_Transformer_for_Multi-Object_Tracking_ICCV_2023_paper.pdf + As a video task, Multiple Object Tracking (MOT) is expected to capture temporal information of targets effectively. Unfortunately, most existing methods only explicitly exploit the object features between adjacent frames, while lacking the capacity to model long-term temporal information. In this paper, we propose MeMOTR, a long-term memory-augmented Transformer for multi-object tracking. Our method is able to make the same object's track embedding more stable and distinguishable by leveraging long-term memory injection with a customized memory-attention layer. This significantly improves the target association ability of our model. Experimental results on DanceTrack show that MeMOTR impressively surpasses the state-of-the-art method by 7.9% and 13.0% on HOTA and AssA metrics, respectively. Furthermore, our model also outperforms other Transformer-based methods on association performance on MOT17 and generalizes well on BDD100K. Code is available at https://github.com/MCG-NJU/MeMOTR. + + + + RawHDR: High Dynamic Range Image Reconstruction from a Single Raw Image + http://openaccess.thecvf.com//content/ICCV2023/papers/Zou_RawHDR_High_Dynamic_Range_Image_Reconstruction_from_a_Single_Raw_ICCV_2023_paper.pdf + High dynamic range (HDR) images can record much more intensity levels than usual ones. Existing methods mainly reconstruct HDR images from the 8-bit low dynamic range (LDR) sRGB images that have been degraded by the camera processing pipeline. However, recovering extremely high dynamic range scenes from such low bit-depth data is challenging. Unlike existing methods, the core idea of this work is to incorporate more informative Raw sensor data to generate HDR images, aiming to recover scene information in hard regions (the darkest and brightest areas of an HDR scene). We propose a model customized for Raw images, considering the unique feature of Raw data to learn the Raw-to-HDR mapping. Specifically, we learn exposure masks to separate the hard and easy regions of a high dynamic scene. Then, we introduce two important guidances, dual intensity guidance, which guides less informative channels with more informative ones, and global spatial guidance which hallucinates scene details from a longer spatial range. To verify our Raw-to-HDR approach, we collect a large and high-quality Raw/HDR paired dataset for both training and testing, which will be made available publicly. We verify the superiority of the proposed Raw-to-HDR reconstruction model, as well as our newly captured dataset in the experiments. + + + + Robust Object Modeling for Visual Tracking + http://openaccess.thecvf.com//content/ICCV2023/papers/Cai_Robust_Object_Modeling_for_Visual_Tracking_ICCV_2023_paper.pdf + Object modeling has become a core part of recent tracking frameworks. Current popular tackers use Transformer attention to extract the template feature separately or interactively with the search region. However, separate template learning lacks communication between the template and search regions, which brings difficulty in extracting discriminative target-oriented features. On the other hand, interactive template learning produces hybrid template features, which may introduce potential distractors to the template via the cluttered search regions. To enjoy the merits of both methods, we propose a robust object modeling framework for visual tracking (ROMTrack), which simultaneously models the inherent template and the hybrid template features. As a result, harmful distractors can be suppressed by combining the inherent features of target objects with search regions' guidance. Target-related features can also be extracted using the hybrid template, thus resulting in a more robust object modeling framework. To further enhance robustness, we present novel variation tokens to depict the ever-changing appearance of target objects. Variation tokens are adaptable to object deformation and appearance variations, which can boost overall performance with negligible computation. Experiments show that our ROMTrack sets a new state-of-the-art on multiple benchmarks. + + + + FSI: Frequency and Spatial Interactive Learning for Image Restoration in Under-Display Cameras + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_FSI_Frequency_and_Spatial_Interactive_Learning_for_Image_Restoration_in_ICCV_2023_paper.pdf + Under-display camera (UDC) systems remove the screen notch for bezel-free displays and provide a better interactive experience. The main challenge is that the pixel array of light-emitting diodes used for display diffracts and attenuates the incident light, leading to complex degradation. Existing models eliminate spatial diffraction by maximizing model capacity through complex design and ignore the periodic distribution of diffraction in the frequency domain, which prevents these approaches from satisfactory results. In this paper, we introduce a new perspective to handle various diffraction in UDC images by jointly exploring the feature restoration in the frequency and spatial domains, and present a Frequency and Spatial Interactive Learning Network (FSI). It consists of a series of well-designed Frequency-Spatial Joint (FSJ) modules for feature learning and a color transform module for color enhancement. In particular, in the FSJ module, a frequency learning block uses the Fourier transform to eliminate spectral bias, a spatial learning block uses a multi-distillation structure to supplement the absence of local details, and a dual transfer unit to facilitate the interactive learning between features of different domains. Experimental results demonstrate the superiority of the proposed FSI over state-of-the-art models, through extensive quantitative and qualitative evaluations in three widely-used UDC benchmarks. + + + + Temporal Collection and Distribution for Referring Video Object Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Tang_Temporal_Collection_and_Distribution_for_Referring_Video_Object_Segmentation_ICCV_2023_paper.pdf + Referring video object segmentation aims to segment a referent throughout a video sequence according to a natural language expression. It requires aligning the natural language expression with the objects' motions and their dynamic associations at the global video level but segmenting objects at the frame level. To achieve this goal, we propose to simultaneously maintain a global referent token and a sequence of object queries, where the former is responsible for capturing video-level referent according to the language expression, while the latter serves to better locate and segment objects with each frame. Furthermore, to explicitly capture object motions and spatial-temporal cross-modal reasoning over objects, we propose a novel temporal collection-distribution mechanism for interacting between the global referent token and object queries. Specifically, the temporal collection mechanism collects global information for the referent token from object queries to the temporal motions to the language expression. In turn, the temporal distribution first distributes the referent token to the referent sequence across all frames and then performs efficient cross-frame reasoning between the referent sequence and object queries in every frame. Experimental results show that our method outperforms state-of-the-art methods on all benchmarks consistently and significantly. + + + + Variational Degeneration to Structural Refinement: A Unified Framework for Superimposed Image Decomposition + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Variational_Degeneration_to_Structural_Refinement_A_Unified_Framework_for_Superimposed_ICCV_2023_paper.pdf + Decomposing a single mixed image into individual image layers is the common crux of a classical category of tasks in image restoration. Several unified frameworks have been proposed that can handle different types of degradation in superimposed image decomposition. However, there are always undesired structural distortions in the separated images when dealing with complicated degradation patterns. In this paper, we propose a unified framework for superimposed image decomposition that can cope with intricate degradation patterns adaptively. Considering the different mixing patterns between the layers, we introduce a degeneration representation in the latent space to mine the intrinsic relationship between the superimposed image and the degeneration pattern. Moreover, by extracting structure-guided knowledge from the superimposed image, we further propose structural guidance refinement to avoid confusing content caused by structure distortion. Extensive experiments have demonstrated that our method remarkably outperforms other popular image separation frameworks. The method also achieves competitive results on related applications including image deraining, image reflection removal, and image shadow removal, which validates the generalization of the framework. + + + + Focal Network for Image Restoration + http://openaccess.thecvf.com//content/ICCV2023/papers/Cui_Focal_Network_for_Image_Restoration_ICCV_2023_paper.pdf + Image restoration aims to reconstruct a sharp image from its degraded counterpart, which plays an important role in many fields. Recently, Transformer models have achieved promising performance on various image restoration tasks. However, their quadratic complexity remains an intractable issue for practical applications. The aim of this study is to develop an efficient and effective framework for image restoration. Inspired by the fact that different regions in a corrupted image always undergo degradations in various degrees, we propose to focus more on the important areas for reconstruction. To this end, we introduce a dual-domain selection mechanism to emphasize crucial information for restoration, such as edge signals and hard regions. In addition, we split high-resolution features to insert multi-scale receptive fields into the network, which improves both efficiency and performance. Finally, the proposed network, dubbed FocalNet, is built by incorporating these designs into a U-shaped backbone. Extensive experiments demonstrate that our model achieves state-of-the-art performance on ten datasets for three tasks, including single-image defocus deblurring, image dehazing, and image desnowing. Our code is available at https://github.com/c-yn/FocalNet. + + + + Indoor Depth Recovery Based on Deep Unfolding with Non-Local Prior + http://openaccess.thecvf.com//content/ICCV2023/papers/Dai_Indoor_Depth_Recovery_Based_on_Deep_Unfolding_with_Non-Local_Prior_ICCV_2023_paper.pdf + In recent years, depth recovery based on deep networks has achieved great success. However, the existing state-of-the-art network designs perform like black boxes in depth recovery tasks, lacking a clear mechanism. Utilizing the property that there is a large amount of non-local common characteristics in depth images, we propose a novel model-guided depth recovery method, namely the DC-NLAR model. A non-local auto-regressive regular term is also embedded into our model to capture more non-local depth information. To fully use the excellent performance of neural networks, we develop a deep image prior to better describe the characteristic of depth images. We also introduce an implicit data consistency term to tackle the degenerate operator with high heterogeneity. We then unfold the proposed model into networks by using the half-quadratic splitting algorithm. This proposed method is experimented on the NYU-Depth V2 and SUN RGB-D datasets, and the experimental results achieve comparable performance to that of deep learning methods. + + + + GAFlow: Incorporating Gaussian Attention into Optical Flow + http://openaccess.thecvf.com//content/ICCV2023/papers/Luo_GAFlow_Incorporating_Gaussian_Attention_into_Optical_Flow_ICCV_2023_paper.pdf + Optical flow, or the estimation of motion fields from image sequences, is one of the fundamental problems in computer vision. Unlike most pixel-wise tasks that aim at achieving consistent representations of the same category, optical flow raises extra demands for obtaining local discrimination and smoothness, which yet is not fully explored by existing approaches. In this paper, we push Gaussian Attention (GA) into the optical flow models to accentuate local properties during representation learning and enforce the motion affinity during matching. Specifically, we introduce a novel Gaussian-Constrained Layer (GCL) which can be easily plugged into existing Transformer blocks to highlight the local neighborhood that contains fine-grained structural information. Moreover, for reliable motion analysis, we provide a new Gaussian-Guided Attention Module (GGAM) which not only inherits properties from Gaussian distribution to instinctively revolve around the neighbor fields of each point but also is empowered to put the emphasis on contextually related regions during matching. Our fully-equipped model, namely Gaussian Attention Flow network (GAFlow), naturally incorporates a series of novel Gaussian-based modules into the conventional optical flow framework for reliable motion analysis. Extensive experiments on standard optical flow datasets consistently demonstrate the exceptional performance of the proposed approach in terms of both generalization ability evaluation and online benchmark testing. Code is available at https://github.com/LA30/GAFlow. + + + + SoDaCam: Software-defined Cameras via Single-Photon Imaging + http://openaccess.thecvf.com//content/ICCV2023/papers/Sundar_SoDaCam_Software-defined_Cameras_via_Single-Photon_Imaging_ICCV_2023_paper.pdf + Reinterpretable cameras are defined by their post-processing capabilities that exceed traditional imaging. We present "SoDaCam" that provides reinterpretable cameras at the granularity of photons, from photon-cubes acquired by single-photon devices. Photon-cubes represent the spatio-temporal detections of photons as a sequence of binary frames, at frame-rates as high as 100 kHz. We show that simple transformations of the photon-cube, or photon-cube projections, provide the functionality of numerous imaging systems including: exposure bracketing, flutter shutter cameras, video compressive systems, event cameras, and even cameras that move during exposure. Our photon-cube projections offer the flexibility of being software-defined constructs that are only limited by what is computable, and shot-noise. We exploit this flexibility to provide new capabilities for the emulated cameras. As an added benefit, our projections provide camera-dependent compression of photon-cubes, which we demonstrate using an implementation of our projections on a novel compute architecture that is designed for single-photon imaging. + + + + Decoupled Iterative Refinement Framework for Interacting Hands Reconstruction from a Single RGB Image + http://openaccess.thecvf.com//content/ICCV2023/papers/Ren_Decoupled_Iterative_Refinement_Framework_for_Interacting_Hands_Reconstruction_from_a_ICCV_2023_paper.pdf + Reconstructing interacting hands from a single RGB image is a very challenging task. On the one hand, severe mutual occlusion and similar local appearance between two hands confuse the extraction of visual features, resulting in the misalignment of estimated hand meshes and the image. On the other hand, there are complex spatial relationship between interacting hands, which significantly increases the solution space of hand poses and increases the difficulty of network learning. In this paper, we propose a decoupled iterative refinement framework to achieve pixel-alignment hand reconstruction while efficiently modeling the spatial relationship between hands. Specifically, we define two feature spaces with different characteristics, namely 2D visual feature space and 3D joint feature space. First, we obtain joint-wise features from the visual feature map and utilize a graph convolution network and a transformer to perform intra- and inter-hand information interaction in the 3D joint feature space, respectively. Then, we project the joint features with global information back into the 2D visual feature space in an obfuscation-free manner and utilize the 2D convolution for pixel-wise enhancement. By performing multiple alternate enhancements in the two feature spaces, our method can achieve an accurate and robust reconstruction of interacting hands. Our method outperforms all existing two-hand reconstruction methods by a large margin on the InterHand2.6M dataset. + + + + Who Are You Referring To? Coreference Resolution In Image Narrations + http://openaccess.thecvf.com//content/ICCV2023/papers/Goel_Who_Are_You_Referring_To_Coreference_Resolution_In_Image_Narrations_ICCV_2023_paper.pdf + Coreference resolution aims to identify words and phrases which refer to the same entity in a text, a core task in natural language processing. In this paper, we extend this task to resolving coreferences in long-form narrations of visual scenes. First, we introduce a new dataset with annotated coreference chains and their bounding boxes, as most existing image-text datasets only contain short sentences without coreferring expressions or labeled chains. We propose a new technique that learns to identify coreference chains using weak supervision, only from image-text pairs and a regularization using prior linguistic knowledge. Our model yields large performance gains over several strong baselines in resolving coreferences. We also show that coreference resolution helps improve grounding narratives in images. + + + + Dynamic Hyperbolic Attention Network for Fine Hand-object Reconstruction + http://openaccess.thecvf.com//content/ICCV2023/papers/Leng_Dynamic_Hyperbolic_Attention_Network_for_Fine_Hand-object_Reconstruction_ICCV_2023_paper.pdf + Reconstructing both objects and hands in 3D from a single RGB image is complex. Existing methods rely on manually defined hand-object constraints in Euclidean space, leading to suboptimal feature learning. Compared with Euclidean space, hyperbolic space better preserves the geometric properties of meshes thanks to its exponentially-growing space distance, which amplifies the differences between the features based on similarity. In this work, we propose the first precise hand-object reconstruction method in hyperbolic space, namely Dynamic Hyperbolic Attention Network (DHANet), which leverages intrinsic properties of hyperbolic space to learn representative features. Our method that projects mesh and image features into a unified hyperbolic space includes two modules, i.e. dynamic hyperbolic graph convolution and image-attention hyperbolic graph convolution. With these two modules, our method learns mesh features with rich geometry-image multi-modal information and models better hand-object interaction. Our method provides a promising alternative for fine hand-object reconstruction in hyperbolic space. Extensive experiments on three public datasets demonstrate that our method outperforms most state-of-the-art methods. + + + + LivePose: Online 3D Reconstruction from Monocular Video with Dynamic Camera Poses + http://openaccess.thecvf.com//content/ICCV2023/papers/Stier_LivePose_Online_3D_Reconstruction_from_Monocular_Video_with_Dynamic_Camera_ICCV_2023_paper.pdf + Dense 3D reconstruction from RGB images traditionally assumes static camera pose estimates. This assumption has endured, even as recent works have increasingly focused on real-time methods for mobile devices. However, the assumption of a fixed pose for each image does not hold for online execution: poses from real-time SLAM are dynamic and may be updated following events such as bundle adjustment and loop closure. This has been addressed in the RGB-D setting, by de-integrating past views and re-integrating them with updated poses, but it remains largely untreated in the RGB-only setting. We formalize this problem to define the new task of dense online reconstruction from dynamically-posed images. To support further research, we introduce a dataset called LivePose containing the dynamic poses from a SLAM system running on ScanNet. We select three recent reconstruction systems and apply a framework based on de-integration to adapt each one to the dynamic-pose setting. In addition, we propose a novel, non-linear de-integration module that learns to remove stale scene content. We show that responding to pose updates is critical for high-quality reconstruction, and that our de-integration framework is an effective solution. + + + + Feature Modulation Transformer: Cross-Refinement of Global Representation via High-Frequency Prior for Image Super-Resolution + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Feature_Modulation_Transformer_Cross-Refinement_of_Global_Representation_via_High-Frequency_Prior_ICCV_2023_paper.pdf + Transformer-based methods have exhibited remarkable potential in single image super-resolution (SISR) by effectively extracting long-range dependencies. However, most of the current research in this area has prioritized the design of transformer blocks to capture global information, while overlooking the importance of incorporating high-frequency priors, which we believe could be beneficial. In our study, we conducted a series of experiments and found that transformer structures are more adept at capturing low-frequency information, but have limited capacity in constructing high-frequency representations when compared to their convolutional counterparts. Our proposed solution, the cross-refinement adaptive feature modulation transformer (CRAFT), integrates the strengths of both convolutional and transformer structures. It comprises three key components: the high-frequency enhancement residual block (HFERB) for extracting high-frequency information, the shift rectangle window attention block (SRWAB) for capturing global information, and the hybrid fusion block (HFB) for refining the global representation. Our experiments on multiple datasets demonstrate that CRAFT outperforms state-of-the-art methods by up to 0.29dB while using fewer parameters. The source code will be made available at: https://github.com/AVC2-UESTC/CRAFT-SR.git. + + + + MPI-Flow: Learning Realistic Optical Flow with Multiplane Images + http://openaccess.thecvf.com//content/ICCV2023/papers/Liang_MPI-Flow_Learning_Realistic_Optical_Flow_with_Multiplane_Images_ICCV_2023_paper.pdf + The accuracy of learning-based optical flow estimation models heavily relies on the realism of the training datasets. Current approaches for generating such datasets either employ synthetic data or generate images with limited realism. However, the domain gap of these data with real-world scenes constrains the generalization of the trained model to real-world applications. To address this issue, we investigate generating realistic optical flow datasets from real-world images. Firstly, to generate highly realistic new images, we construct a layered depth representation, known as multiplane images (MPI), from single-view images. This allows us to generate novel view images that are highly realistic. To generate optical flow maps that correspond accurately to the new image, we calculate the optical flows of each plane using the camera matrix and plane depths. We then project these layered optical flows into the output optical flow map with volume rendering. Secondly, to ensure the realism of motion, we present an independent object motion module that can separate the camera and dynamic object motion in MPI. This module addresses the deficiency in MPI-based single-view methods, where optical flow is generated only by camera motion and does not account for any object movement. We additionally devise a depth-aware inpainting module to merge new images with dynamic objects and address unnatural motion occlusions. We show the superior performance of our method through extensive experiments on real-world datasets. Moreover, our approach achieves state-of-the-art performance in both unsupervised and supervised training of learning-based models. The code will be made publicly available at: https://github.com/Sharpiless/MPI-Flow. + + + + Learning Depth Estimation for Transparent and Mirror Surfaces + http://openaccess.thecvf.com//content/ICCV2023/papers/Costanzino_Learning_Depth_Estimation_for_Transparent_and_Mirror_Surfaces_ICCV_2023_paper.pdf + Inferring the depth of transparent or mirror (ToM) surfaces represents a hard challenge for either sensors, algorithms, or deep networks. We propose a simple pipeline for learning to estimate depth properly for such surfaces with neural networks, without requiring any ground-truth annotation. We unveil how to obtain reliable pseudo labels by in-painting ToM objects in images and processing them with a monocular depth estimation model. These labels can be used to fine-tune existing monocular or stereo networks, to let them learn how to deal with ToM surfaces. Experimental results on the Booster dataset show the dramatic improvements enabled by our remarkably simple proposal. + + + + Towards Zero-Shot Scale-Aware Monocular Depth Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Guizilini_Towards_Zero-Shot_Scale-Aware_Monocular_Depth_Estimation_ICCV_2023_paper.pdf + Monocular depth estimation is scale-ambiguous, and thus requires scale supervision to produce metric predictions. Even so, the resulting models will be geometry-specific, with learned scales that cannot be directly transferred across domains. Because of that, recent works focus instead on relative depth, eschewing scale in favor of improved up-to-scale zero-shot transfer. In this work we introduce ZeroDepth, a novel monocular depth estimation framework capable of predicting metric scale for arbitrary test images from different domains and camera parameters. This is achieved by (i) the use of input-level geometric embeddings that enable the network to learn a scale prior over objects; and (ii) decoupling the encoder and decoder stages, via a variational latent representation that is conditioned on single frame information. We evaluated ZeroDepth targeting both outdoor (KITTI, DDAD, nuScenes) and indoor (NYUv2) benchmarks, and achieved a new state-of-the-art in both settings using the same pre-trained model, outperforming methods that train on in-domain data and require test-time scaling to produce metric estimates. + + + + PromptStyler: Prompt-driven Style Generation for Source-free Domain Generalization + http://openaccess.thecvf.com//content/ICCV2023/papers/Cho_PromptStyler_Prompt-driven_Style_Generation_for_Source-free_Domain_Generalization_ICCV_2023_paper.pdf + In a joint vision-language space, a text feature (e.g., from "a photo of a dog") could effectively represent its relevant image features (e.g., from dog photos). Also, a recent study has demonstrated the cross-modal transferability phenomenon of this joint space. From these observations, we propose PromptStyler which simulates various distribution shifts in the joint space by synthesizing diverse styles via prompts without using any images to deal with source-free domain generalization. The proposed method learns to generate a variety of style features (from "a S* style of a") via learnable style word vectors for pseudo-words S*. To ensure that learned styles do not distort content information, we force style-content features (from "a S* style of a [class]") to be located nearby their corresponding content features (from "[class]") in the joint vision-language space. After learning style word vectors, we train a linear classifier using synthesized style-content features. PromptStyler achieves the state of the art on PACS, VLCS, OfficeHome and DomainNet, even though it does not require any images for training. + + + + SVQNet: Sparse Voxel-Adjacent Query Network for 4D Spatio-Temporal LiDAR Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_SVQNet_Sparse_Voxel-Adjacent_Query_Network_for_4D_Spatio-Temporal_LiDAR_Semantic_ICCV_2023_paper.pdf + LiDAR-based semantic perception tasks are critical yet challenging for autonomous driving. Due to the motion of objects and static/dynamic occlusion, temporal information plays an essential role in reinforcing perception by enhancing and completing single-frame knowledge. Previous approaches either directly stack historical frames to the current frame or build a 4D spatio-temporal neighborhood using KNN, which duplicates computation and hinders real-time performance. Based on our observation that stacking all the historical points would damage performance due to a large amount of redundant and misleading information, we propose the Sparse Voxel-Adjacent Query Network (SVQNet) for 4D LiDAR semantic segmentation. To take full advantage of the historical frames high-efficiently, we shunt the historical points into two groups with reference to the current points. One is the Voxel-Adjacent Neighborhood carrying local enhancing knowledge. The other is the Historical Context completing the global knowledge. Then we propose new modules to select and extract the instructive features from the two groups. Our SVQNet achieves state-of-the-art performance in LiDAR semantic segmentation of the SemanticKITTI benchmark and the nuScenes dataset. + + + + MEFLUT: Unsupervised 1D Lookup Tables for Multi-exposure Image Fusion + http://openaccess.thecvf.com//content/ICCV2023/papers/Jiang_MEFLUT_Unsupervised_1D_Lookup_Tables_for_Multi-exposure_Image_Fusion_ICCV_2023_paper.pdf + In this paper, we introduce a new approach for high-quality multi-exposure image fusion (MEF). We show that the fusion weights of an exposure can be encoded into a 1D lookup table (LUT), which takes pixel intensity value as input and produces fusion weight as output. We learn one 1D LUT for each exposure, then all the pixels from different exposures can query 1D LUT of that exposure independently for high-quality and efficient fusion. Specifically, to learn these 1D LUTs, we involve attention mechanism in various dimensions including frame, channel and spatial ones into the MEF task so as to bring us significant quality improvement over the state-of-the-art (SOTA). In addition, we collect a new MEF dataset consisting of 960 samples, 155 of which are manually tuned by professionals as ground-truth for evaluation. Our network is trained by this dataset in an unsupervised manner. Extensive experiments are conducted to demonstrate the effectiveness of all the newly proposed components, and results show that our approach outperforms the SOTA in our and another representative dataset SICE, both qualitatively and quantitatively. Moreover, our 1D LUT approach takes less than 4ms to run a 4K image on a PC GPU. Given its high quality, efficiency and robustness, our method has been shipped into millions of Android mobiles across multiple brands world-wide. Code is available at: https://github.com/Hedlen/MEFLUT. + + + + The Unreasonable Effectiveness of Large Language-Vision Models for Source-Free Video Domain Adaptation + http://openaccess.thecvf.com//content/ICCV2023/papers/Zara_The_Unreasonable_Effectiveness_of_Large_Language-Vision_Models_for_Source-Free_Video_ICCV_2023_paper.pdf + Source-Free Video Unsupervised Domain Adaptation (SFVUDA) task consists in adapting an action recognition model, trained on a labelled source dataset, to an unlabelled target dataset, without accessing the actual source data. The previous approaches have attempted to address SFVUDA by leveraging self-supervision (e.g., enforcing temporal consistency) derived from the target data itself. In this work, we take an orthogonal approach by exploiting "web-supervision" from Large Language-Vision Models (LLVMs), driven by the rationale that LLVMs contain a rich world prior surprisingly robust to domain-shift. We showcase the unreasonable effectiveness of integrating LLVMs for SFVUDA by devising an intuitive and parameter-efficient method, which we name Domain Adaptation with Large Language-Vision models (DALL-V), that distills the world prior and complementary source model information into a student network tailored for the target. Despite the simplicity, DALL-V achieves significant improvement over state-of-the-art SFVUDA methods. + + + + Sketch and Text Guided Diffusion Model for Colored Point Cloud Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_Sketch_and_Text_Guided_Diffusion_Model_for_Colored_Point_Cloud_ICCV_2023_paper.pdf + Diffusion probabilistic models have achieved remarkable success in text guided image generation. However, generating 3D shapes is still challenging due to the lack of sufficient data containing 3D models along with their descriptions. Moreover, text based descriptions of 3D shapes are inherently ambiguous and lack details. In this paper, we propose a sketch and text guided probabilistic diffusion model for colored point cloud generation that conditions the denoising process jointly with a hand drawn sketch of the object and its textual description. We incrementally diffuse the point coordinates and color values in a joint diffusion process to reach a Gaussian distribution. Colored point cloud generation thus amounts to learning the reverse diffusion process, conditioned by the sketch and text, to iteratively recover the desired shape and color. Specifically, to learn effective sketch-text embedding, our model adaptively aggregates the joint embedding of text prompt and the sketch based on a capsule attention network. Our model uses staged diffusion to generate the shape and then assign colors to different parts conditioned on the appearance prompt while preserving precise shapes from the first stage. This gives our model the flexibility to extend to multiple tasks, such as appearance re-editing and part segmentation. Experimental results demonstrate that our model outperforms the recent state-of-the-art in point cloud generation. + + + + Leveraging SE(3) Equivariance for Learning 3D Geometric Shape Assembly + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_Leveraging_SE3_Equivariance_for_Learning_3D_Geometric_Shape_Assembly_ICCV_2023_paper.pdf + Shape assembly aims to reassemble parts (or fragments) into a complete object, which is a common task in our daily life. Different from the semantic part assembly (e.g., assembling a chair's semantic parts like legs into a whole chair), geometric part assembly (e.g., assembling bowl fragments into a complete bowl) is an emerging task in computer vision and robotics. Instead of semantic information, this task focuses on geometric information of parts. As the both geometric and pose space of fractured parts are exceptionally large, shape pose disentanglement of part representations is beneficial to geometric shape assembly. In our paper, we propose to leverage SE(3) equivariance for such shape pose disentanglement. Moreover, while previous works in vision and robotics only consider SE(3) equivariance for the representations of single objects, we move a step forward and propose leveraging SE(3) equivariance for representations considering multi-part correlations, which further boosts the performance of the multi-part assembly. Experiments demonstrate the significance of SE(3) equivariance and our proposed method for geometric shape assembly. + + + + Adversarial Bayesian Augmentation for Single-Source Domain Generalization + http://openaccess.thecvf.com//content/ICCV2023/papers/Cheng_Adversarial_Bayesian_Augmentation_for_Single-Source_Domain_Generalization_ICCV_2023_paper.pdf + Generalizing to unseen image domains is a challenging problem primarily due to the lack of diverse training data, inaccessible target data, and the large domain shift that may exist in many real-world settings. As such data augmentation is a critical component of domain generalization methods that seek to address this problem. We present Adversarial Bayesian Augmentation (ABA), a novel algorithm that learns to generate image augmentations in the challenging single-source domain generalization setting. ABA draws on the strengths of adversarial learning and Bayesian neural networks to guide the generation of diverse data augmentations - these synthesized image domains aid the classifier in generalizing to unseen domains. We demonstrate the strength of ABA on several types of domain shift including style shift, subpopulation shift, and shift in the medical imaging setting. ABA outperforms all previous state-of-the-art methods, including pre-specified augmentations, pixel-based and convolutional-based augmentations. Code: https://github.com/shengcheng/ABA. + + + + Robust Geometry-Preserving Depth Estimation Using Differentiable Rendering + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Robust_Geometry-Preserving_Depth_Estimation_Using_Differentiable_Rendering_ICCV_2023_paper.pdf + In this study, we address the challenge of 3D scene structure recovery from monocular depth estimation. While traditional depth estimation methods leverage labeled datasets to directly predict absolute depth, recent advancements advocate for mix-dataset training, enhancing generalization across diverse scenes. However, such mixed dataset training yields depth predictions only up to an unknown scale and shift, hindering accurate 3D reconstructions. Existing solutions necessitate extra 3D datasets or geometry-complete depth annotations, constraints that limit their versatility. In this paper, we propose a learning framework that trains models to predict geometry-preserving depth without requiring extra data or annotations. To produce realistic 3D structures, we render novel views of the reconstructed scenes and design loss functions to promote depth estimation consistency across different views. Comprehensive experiments underscore our framework's superior generalization capabilities, surpassing existing state-of-the-art methods on several benchmark datasets without leveraging extra training information. Moreover, our innovative loss functions empower the model to autonomously recover domain-specific scale-and-shift coefficients using solely unlabeled images. + + + + Self-regulating Prompts: Foundational Model Adaptation without Forgetting + http://openaccess.thecvf.com//content/ICCV2023/papers/Khattak_Self-regulating_Prompts_Foundational_Model_Adaptation_without_Forgetting_ICCV_2023_paper.pdf + Prompt learning has emerged as an efficient alternative for fine-tuning foundational models, such as CLIP, for various downstream tasks. Conventionally trained using the task-specific objective, i.e., cross-entropy loss, prompts tend to overfit downstream data distributions and find it challenging to capture task-agnostic general features from the frozen CLIP. This leads to the loss of the model's original generalization capability. To address this issue, our work introduces a self-regularization framework for prompting called PromptSRC (Prompting with Self-regulating Constraints). PromptSRC guides the prompts to optimize for both task-specific and task-agnostic general representations using a three-pronged approach by: (a) regulating prompted representations via mutual agreement maximization with the frozen model, (b) regulating with self-ensemble of prompts over the training trajectory to encode their complementary strengths, and (c) regulating with textual diversity to mitigate sample diversity imbalance with the visual branch. To the best of our knowledge, this is the first regularization framework for prompt learning that avoids overfitting by jointly attending to pre-trained model features, the training trajectory during prompting, and the textual diversity. PromptSRC explicitly steers the prompts to learn a representation space that maximizes performance on downstream tasks without compromising CLIP generalization. We perform extensive experiments on 4 benchmarks where PromptSRC overall performs favorably well compared to the existing methods. Our code and pre-trained models are publicly available. + + + + Improving Lens Flare Removal with General-Purpose Pipeline and Multiple Light Sources Recovery + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhou_Improving_Lens_Flare_Removal_with_General-Purpose_Pipeline_and_Multiple_Light_ICCV_2023_paper.pdf + When taking images against strong light sources, the resulting images often contain heterogeneous flare artifacts. These artifacts can importantly affect image visual quality and downstream computer vision tasks. While collecting real data pairs of flare-corrupted/flare-free images for training flare removal models is challenging, current methods utilize the direct-add approach to synthesize data. However, these methods do not consider automatic exposure and tone mapping in image signal processing pipeline (ISP), leading to the limited generalization capability of deep models training using such data. Besides, existing methods struggle to handle multiple light sources due to the different sizes, shapes and illuminance of various light sources. In this paper, we propose a solution to improve the performance of lens flare removal by revisiting the ISP and remodeling the principle of automatic exposure in the synthesis pipeline and design a more reliable light sources recovery strategy. The new pipeline approaches realistic imaging by discriminating the local and global illumination through convex combination, avoiding global illumination shifting and local over-saturation. Our strategy for recovering multiple light sources convexly averages the input and output of the neural network based on illuminance levels, thereby avoiding the need for a hard threshold in identifying light sources. We also contribute a new flare removal testing dataset containing the flare-corrupted images captured by ten types of consumer electronics. The dataset facilitates the verification of the generalization capability of flare removal methods. Extensive experiments show that our solution can effectively improve the performance of lens flare removal and push the frontier toward more general situations. + + + + DCPB: Deformable Convolution Based on the Poincare Ball for Top-view Fisheye Cameras + http://openaccess.thecvf.com//content/ICCV2023/papers/Wei_DCPB_Deformable_Convolution_Based_on_the_Poincare_Ball_for_Top-view_ICCV_2023_paper.pdf + The accuracy of the visual tasks for top-view fisheye cameras is limited by the Euclidean geometry for pose-distorted objects in images. In this paper, we demonstrate the analogy between the fisheye model and the Poincare ball and that learning the shape of convolution kernels in the Poincare Ball can alleviate the spatial distortion problem. In particular, we propose the Deformable Convolution based on the Poincare Ball, named DCPB, which conducts the Graph Convolutional Network (GCN) in the Poincare ball and calculates the geodesic distances to Poincare hyperplanes as the offsets and modulation scalars of the modulated deformable convolution. Besides, we explore an appropriate network structure in the baseline with the DCPB. The DCPB markedly improves the neural network's performance. Experimental results on the public dataset THEODORE show that DCPB obtains a higher accuracy, and its efficiency demonstrates the potential for using temporal information in fisheye videos. + + + + Integrating Boxes and Masks: A Multi-Object Framework for Unified Visual Tracking and Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_Integrating_Boxes_and_Masks_A_Multi-Object_Framework_for_Unified_Visual_ICCV_2023_paper.pdf + Tracking any given object(s) spatially and temporally is a common purpose in Visual Object Tracking (VOT) and Video Object Segmentation (VOS). Joint tracking and segmentation have been attempted in some studies but they often lack full compatibility of both box and mask in initialization and prediction, and mainly focus on single-object scenarios. To address these limitations, this paper proposes a Multi-object Mask-box Integrated framework for unified Tracking and Segmentation, dubbed MITS. Firstly, the unified identification module is proposed to support both box and mask reference for initialization, where detailed object information is inferred from boxes or directly retained from masks. Additionally, a novel pinpoint box predictor is proposed for accurate multi-object box prediction, facilitating target-oriented representation learning. All target objects are processed simultaneously from encoding to propagation and decoding, as a unified pipeline for VOT and VOS. Experimental results show MITS achieves state-of-the-art performance on both VOT and VOS benchmarks. Notably, MITS surpasses the best prior VOT competitor by around 6% on the GOT-10k test set, and significantly improves the performance of box initialization on VOS benchmarks. The code is available at https://github.com/yoxu515/MITS. + + + + 3DMOTFormer: Graph Transformer for Online 3D Multi-Object Tracking + http://openaccess.thecvf.com//content/ICCV2023/papers/Ding_3DMOTFormer_Graph_Transformer_for_Online_3D_Multi-Object_Tracking_ICCV_2023_paper.pdf + Tracking 3D objects accurately and consistently is crucial for autonomous vehicles, enabling more reliable downstream tasks such as trajectory prediction and motion planning. Based on the substantial progress in object detection in recent years, the tracking-by-detection paradigm has become a popular choice due to its simplicity and efficiency. State-of-the-art 3D multi-object tracking (MOT) approaches typically rely on non-learned model-based algorithms such as Kalman Filter but require many manually tuned parameters. On the other hand, learning-based approaches face the problem of adapting the training to the online setting, leading to inevitable distribution mismatch between training and inference as well as suboptimal performance. In this work, we propose 3DMOTFormer, a learned geometry-based 3D MOT framework building upon the transformer architecture. We use an Edge-Augmented Graph Transformer to reason on the track-detection bipartite graph frame-by-frame and conduct data association via edge classification. To reduce the distribution mismatch between training and inference, we propose a novel online training strategy with an autoregressive and recurrent forward pass as well as sequential batch optimization. Using CenterPoint detections, our approach achieves 71.2% and 68.2% AMOTA on the nuScenes validation and test split, respectively. In addition, a trained 3DMOTFormer model generalizes well across different object detectors. Code is available at: https://github.com/dsx0511/3DMOTFormer. + + + + ReGen: A good Generative Zero-Shot Video Classifier Should be Rewarded + http://openaccess.thecvf.com//content/ICCV2023/papers/Bulat_ReGen_A_good_Generative_Zero-Shot_Video_Classifier_Should_be_Rewarded_ICCV_2023_paper.pdf + This paper sets out to solve the following problem: How can we turn a generative video captioning model into an open-world video/action classification model? Video captioning models can naturally produce open-ended free-form descriptions of a given video which, however, might not be discriminative enough for video/action recognition. Unfortunately, when fine-tuned to auto-regress the class names directly, video captioning models overfit the base classes losing their open-world zero-shot capabilities. To alleviate base class overfitting, in this work, we propose to use reinforcement learning to enforce the output of the video captioning model to be more class-level discriminative. Specifically, we propose ReGen, a novel reinforcement learning based framework with a three-fold objective and reward functions: (1) a class-level discrimination reward that enforces the generated caption to be correctly classified into the corresponding action class, (2) a CLIP reward that encourages the generated caption to continue to be descriptive of the input video (i.e. video-specific), and (3) a grammar reward that preserves the grammatical correctness of the caption. We show that ReGen can train a model to produce captions that are: discriminative, video-specific and grammatically correct. Importantly, when evaluated on standard benchmarks for zero- and few-shot action classification, ReGen significantly outperforms the previous state-of-the-art. + + + + Complementary Domain Adaptation and Generalization for Unsupervised Continual Domain Shift Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Cho_Complementary_Domain_Adaptation_and_Generalization_for_Unsupervised_Continual_Domain_Shift_ICCV_2023_paper.pdf + Continual domain shift poses a significant challenge in real-world applications, particularly in situations where labeled data is not available for new domains. The challenge of acquiring knowledge in this problem setting is referred to as unsupervised continual domain shift learning. Existing methods for domain adaptation and generalization have limitations in addressing this issue, as they focus either on adapting to a specific domain or generalizing to unseen domains, but not both. In this paper, we propose Complementary Domain Adaptation and Generalization (CoDAG), a simple yet effective learning framework that combines domain adaptation and generalization in a complementary manner to achieve three major goals of unsupervised continual domain shift learning: adapting to a current domain, generalizing to unseen domains, and preventing forgetting of previously seen domains. Our approach is model-agnostic, meaning that it is compatible with any existing domain adaptation and generalization algorithms. We evaluate CoDAG on several benchmark datasets and demonstrate that our model outperforms state-of-the-art models in all datasets and evaluation metrics, highlighting its effectiveness and robustness in handling unsupervised continual domain shift learning. + + + + Ordered Atomic Activity for Fine-grained Interactive Traffic Scenario Understanding + http://openaccess.thecvf.com//content/ICCV2023/papers/Agarwal_Ordered_Atomic_Activity_for_Fine-grained_Interactive_Traffic_Scenario_Understanding_ICCV_2023_paper.pdf + We introduce a novel representation called Ordered Atomic Activity for interactive scenario understanding. The representation decomposes each scenario into a set of ordered atomic activities, where each activity consists of an action and the corresponding actors involved and the order denotes the temporal development of the scenario. The design also helps in identifying important interactive relationships such as yielding. The action is a high-level semantic motion pattern that is grounded in the surrounding road topology, which we decompose into zones and corners with unique IDs. For example, a group of pedestrians crossing on the left side is denoted as C1 - C4: P+, as depicted in Figure 1. We collect a new large-scale dataset called OATS (Ordered Atomic Activities in interactive Traffic Scenarios), comprising 1026 video clips ( 20s) captured at intersections. Each clip is labeled with the proposed language, resulting in 59 activity categories and 6512 annotated activity instances. We propose three fine-grained scenario understanding tasks, i.e., multi-label Atomic Activity recognition, recognition, activity order prediction, and interactive scenario retrieval. We implement various state-of-the-art algorithms and conduct extensive experiments on OATS. We found the existing methods cannot achieve satisfactory performance, indicating new opportunities for the community to develop new algorithms for these tasks toward better interactive scenario understanding + + + + BEV-DG: Cross-Modal Learning under Bird's-Eye View for Domain Generalization of 3D Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_BEV-DG_Cross-Modal_Learning_under_Birds-Eye_View_for_Domain_Generalization_of_ICCV_2023_paper.pdf + Cross-modal Unsupervised Domain Adaptation (UDA) aims to exploit the complementarity of 2D-3D data to overcome the lack of annotation in a new domain. However, UDA methods rely on access to the target domain during training, meaning the trained model only works in a specific target domain. In light of this, we propose cross-modal learning under bird's-eye view for Domain Generalization (DG) of 3D semantic segmentation, called BEV-DG. DG is more challenging because the model cannot access the target domain during training, meaning it needs to rely on cross-modal learning to alleviate the domain gap. Since 3D semantic segmentation requires the classification of each point, existing cross-modal learning is directly conducted point-to-point, which is sensitive to the misalignment in projections between pixels and points. To this end, our approach aims to optimize domain-irrelevant representation modeling with the aid of cross-modal learning under bird's-eye view. We propose BEV-based Area-to-area Fusion (BAF) to conduct cross-modal learning under bird's-eye view, which has a higher fault tolerance for point-level misalignment. Furthermore, to model domain-irrelevant representations, we propose BEV-driven Domain Contrastive Learning (BDCL) with the help of cross-modal learning under bird's-eye view. We design three domain generalization settings based on three 3D datasets, and BEV-DG significantly outperforms state-of-the-art competitors with tremendous margins in all settings. + + + + Grounded Entity-Landmark Adaptive Pre-Training for Vision-and-Language Navigation + http://openaccess.thecvf.com//content/ICCV2023/papers/Cui_Grounded_Entity-Landmark_Adaptive_Pre-Training_for_Vision-and-Language_Navigation_ICCV_2023_paper.pdf + Cross-modal alignment is one key challenge for Vision-and-Language Navigation (VLN). Most existing studies concentrate on mapping the global instruction or single sub-instruction to the corresponding trajectory. However, another critical problem of achieving fine-grained alignment at the entity level is seldom considered. To address this problem, we propose a novel Grounded Entity-Landmark Adaptive (GELA) pre-training paradigm for VLN tasks. To achieve the adaptive pre-training paradigm, we first introduce grounded entity-landmark human annotations into the Room-to-Room (R2R) dataset, named GEL-R2R. Additionally, we adopt three grounded entity-landmark adaptive pre-training objectives: 1) entity phrase prediction, 2) landmark bounding box prediction, and 3) entity-landmark semantic alignment, which explicitly supervise the learning of fine-grained cross-modal alignment between entity phrases and environment landmarks. Finally, we validate our model on two downstream benchmarks: VLN with descriptive instructions (R2R) and dialogue instructions (CVDN). The comprehensive experiments show that our GELA model achieves state-of-the-art results on both tasks, demonstrating its effectiveness and generalizability. + + + + Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge + http://openaccess.thecvf.com//content/ICCV2023/papers/Kim_Lip_Reading_for_Low-resource_Languages_by_Learning_and_Combining_General_ICCV_2023_paper.pdf + This paper proposes a novel lip reading framework, especially for low-resource languages, which has not been well addressed in the previous literature. Since low-resource languages do not have enough video-text paired data to train the model to have sufficient power to model lip movements and language, it is regarded as challenging to develop lip reading models for low-resource languages. In order to mitigate the challenge, we try to learn general speech knowledge, the ability to model lip movements, from a high-resource language through the prediction of speech units. It is known that different languages partially share common phonemes, thus general speech knowledge learned from one language can be extended to other languages. Then, we try to learn language-specific knowledge, the ability to model language, by proposing Language-specific Memory-augmented Decoder (LMDecoder). LMDecoder saves language-specific audio features into memory banks and can be trained on audio-text paired data which is more easily accessible than video-text paired data. Therefore, with LMDecoder, we can transform the input speech units into language-specific audio features and translate them into texts by utilizing the learned rich language knowledge. Finally, by combining general speech knowledge and language-specific knowledge, we can efficiently develop lip reading models even for low-resource languages. Through extensive experiments using five languages, English, Spanish, French, Italian, and Portuguese, the effectiveness of the proposed method is evaluated. + + + + HopFIR: Hop-wise GraphFormer with Intragroup Joint Refinement for 3D Human Pose Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhai_HopFIR_Hop-wise_GraphFormer_with_Intragroup_Joint_Refinement_for_3D_Human_ICCV_2023_paper.pdf + 2D-to-3D human pose lifting is fundamental for 3D human pose estimation (HPE), for which graph convolutional networks (GCNs) have proven inherently suitable for modeling the human skeletal topology. However, the current GCN-based 3D HPE methods update the node features by aggregating their neighbors' information without considering the interaction of joints in different joint synergies. Although some studies have proposed importing limb information to learn the movement patterns, the latent synergies among joints, such as maintaining balance are seldom investigated. We propose the Hop-wise GraphFormer with Intragroup Joint Refinement (HopFIR) architecture to tackle the 3D HPE problem. HopFIR mainly consists of a novel hop-wise GraphFormer (HGF) module and an intragroup joint refinement (IJR) module. The HGF module groups the joints by k-hop neighbors and applies a hop-wise transformer-like attention mechanism to these groups to discover latent joint synergies. The IJR module leverages the prior limb information for peripheral joint refinement. Extensive experimental results show that HopFIR outperforms the SOTA methods by a large margin, with a mean per-joint position error (MPJPE) on the Human3.6M dataset of 32.67 mm. We also demonstrate that the state-of-the-art GCN-based methods can benefit from the proposed hop-wise attention mechanism with a significant improvement in performance: SemGCN and MGCN are improved by 8.9% and 4.5%, respectively. + + + + Minimal Solutions to Generalized Three-View Relative Pose Problem + http://openaccess.thecvf.com//content/ICCV2023/papers/Ding_Minimal_Solutions_to_Generalized_Three-View_Relative_Pose_Problem_ICCV_2023_paper.pdf + For a generalized (or non-central) camera model, the minimal problem for two views of six points has efficient solvers. However, minimal problems of three views with four points and three views of six lines have not yet been explored and solved, despite the efforts from the computer vision community. This paper develops the formulations of these two minimal problems and shows how state-of-the-art GPU implementations of Homotopy Continuation solver can be used effectively. The proposed methods are evaluated on both synthetic and real datasets, demonstrating that they are fast, accurate and that they improve on structure from motion estimations, when employed in an hypothesis and test setting. + + + + Trajectory Unified Transformer for Pedestrian Trajectory Prediction + http://openaccess.thecvf.com//content/ICCV2023/papers/Shi_Trajectory_Unified_Transformer_for_Pedestrian_Trajectory_Prediction_ICCV_2023_paper.pdf + Pedestrian trajectory prediction is an essentially connecting link to understanding human behavior. Recent works achieve state-of-the-art performance gained from the hand-designed post-processing, e.g., clustering. However, this post-processing suffers from expensive inference time and neglects the probability of the predicted trajectory disturbing downstream safety decisions. In this paper, we present Trajectory Unified TRansformer, called TUTR, which unifies the trajectory prediction components, social interaction and multimodal trajectory prediction, into a transformer encoder-decoder architecture to effectively remove the need for post-processing. Specifically, TUTR parses the relationships across various motion modes by an explicit global prediction and an implicit mode-level transformer encoder. Then, TUTR attends to the social interactions with neighbors by a social-level transformer decoder. Finally, a dual prediction forecasts diverse trajectories and corresponding probabilities in parallel without post-processing. TUTR achieves state-of-the-art accuracy performance and about 10x - 40x inference speed improvements compared with previous well-tuning state-of-the-art methods using post-processing. + + + + MHEntropy: Entropy Meets Multiple Hypotheses for Pose and Shape Recovery + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_MHEntropy_Entropy_Meets_Multiple_Hypotheses_for_Pose_and_Shape_Recovery_ICCV_2023_paper.pdf + For monocular RGB-based 3D pose and shape estimation, multiple solutions are often feasible due to factors like occlusion and truncation. This work presents a multi-hypothesis probabilistic framework by optimizing the Kullback-Leibler divergence (KLD) between the data and model distribution. Our formulation reveals a connection between the pose entropy and diversity in the multiple hypotheses that has been neglected by previous works. For a comprehensive evaluation, besides the best hypothesis (BH) metric, we factor in visibility for evaluating diversity. Additionally, our framework is label-friendly, in that it can be learned from only partial 2D keypoints, e.g., those that are visible. Experiments on both ambiguous and real-world benchmarks demonstrate that our method outperforms other state-of-the-art multi-hypothesis methods in a comprehensive evaluation. The project page is at https://gloryyrolg.github.io/MHEntropy. + + + + Modeling the Relative Visual Tempo for Self-supervised Skeleton-based Action Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhu_Modeling_the_Relative_Visual_Tempo_for_Self-supervised_Skeleton-based_Action_Recognition_ICCV_2023_paper.pdf + Visual tempo characterizes the dynamics and the temporal evolution, which helps describe actions. Recent approaches directly perform visual tempo prediction on skeleton sequences, which may suffer from insufficient feature representation issue. In this paper, we observe that relative visual tempo is more in line with human intuition, and thus providing more effective supervision signals. Based on this, we propose a novel Relative Visual Tempo Contrastive Learning framework for skeleton action Representation (RVTCLR). Specifically, we design a Relative Visual Tempo Learning (RVTL) task to explore the motion information in intra-video clips, and an Appearance-Consistency (AC) task to learn appearance information simultaneously, resulting in more representative spatiotemporal features. Furthermore, skeleton sequence data is much sparser than RGB data, making the network learn shortcuts, and overfit to low-level information such as skeleton scales. To learn high-order semantics, we further design a new Distribution-Consistency (DC) branch, containing three components: Skeleton-specific Data Augmentation (SDA), Fine-grained Skeleton Encoding Module (FSEM), and Distribution-aware Diversity (DD) Loss. We term our entire method (RVTCLR with DC) as RVTCLR+. Extensive experiments on NTU RGB+D 60 and NTU RGB+D 120 datasets demonstrate that our RVTCLR+ can achieve competitive results over the state-of-the-art methods. Code is available at https://github.com/Zhuysheng/RVTCLR. + + + + ImbSAM: A Closer Look at Sharpness-Aware Minimization in Class-Imbalanced Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhou_ImbSAM_A_Closer_Look_at_Sharpness-Aware_Minimization_in_Class-Imbalanced_Recognition_ICCV_2023_paper.pdf + Class imbalance is a common challenge in real-world recognition tasks, where the majority of classes have few samples, also known as tail classes. We address this challenge with the perspective of generalization and empirically find that the promising Sharpness-Aware Minimization (SAM) fails to address generalization issues under the class-imbalanced setting. Through investigating this specific type of task, we identify that its generalization bottleneck primarily lies in the severe overfitting for tail classes with limited training data. To overcome this bottleneck, we leverage class priors to restrict the generalization scope of the class-agnostic SAM and propose a class-aware smoothness optimization algorithm named Imbalanced-SAM (ImbSAM). With the guidance of class priors, our ImbSAM specifically improves generalization targeting tail classes. We also verify the efficacy of ImbSAM on two prototypical applications of class-imbalanced recognition: long-tailed classification and semi-supervised anomaly detection, where our ImbSAM demonstrates remarkable performance improvements for tail classes and anomaly. Our code implementation is available at https://github.com/cool-xuan/Imbalanced_SAM. + + + + MonoDETR: Depth-guided Transformer for Monocular 3D Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_MonoDETR_Depth-guided_Transformer_for_Monocular_3D_Object_Detection_ICCV_2023_paper.pdf + Monocular 3D object detection has long been a challenging task in autonomous driving. Most existing methods follow conventional 2D detectors to first localize object centers, and then predict 3D attributes by neighboring features. However, only using local visual features is insufficient to understand the scene-level 3D spatial structures and ignores the long-range inter-object depth relations. In this paper, we introduce the first DETR framework for Monocular DEtection with a depth-guided TRansformer, named MonoDETR. We modify the vanilla transformer to be depth-aware and guide the whole detection process by contextual depth cues. Specifically, concurrent to the visual encoder that captures object appearances, we introduce to predict a foreground depth map, and specialize a depth encoder to extract non-local depth embeddings. Then, we formulate 3D object candidates as learnable queries and propose a depth-guided decoder to conduct object-scene depth interactions. In this way, each object query estimates its 3D attributes adaptively from the depth-guided regions on the image and is no longer constrained to local visual features. On KITTI benchmark with monocular images as input, MonoDETR achieves state-of-the-art performance and requires no extra dense depth annotations. Besides, our depth-guided modules can also be plug-and-play to enhance multi-view 3D object detectors on nuScenes dataset, demonstrating our superior generalization capacity. Code is available at https://github.com/ZrrSkywalker/MonoDETR. + + + + Contrastive Feature Masking Open-Vocabulary Vision Transformer + http://openaccess.thecvf.com//content/ICCV2023/papers/Kim_Contrastive_Feature_Masking_Open-Vocabulary_Vision_Transformer_ICCV_2023_paper.pdf + We present Contrastive Feature Masking Vision Transformer (CFM-ViT) - an image-text pretraining methodology that achieves simultaneous learning of image- and region level representation for open-vocabulary object detection (OVD). Our approach combines the masked autoencoder (MAE) objective into the contrastive learning objective to improve the representation for localization tasks. Unlike standard MAE, we perform reconstruction in the joint image-text embedding space, rather than the pixel space as is customary with the classical MAE method, which causes the model to better learn region-level semantics. Moreover, we introduce Positional Embedding Dropout (PED) to address scale variation between image-text pretraining and detection finetuning by randomly dropping out the positional embeddings during pretraining. PED improves detection performance and enables the use of a frozen ViT backbone as a region classifier, preventing the forgetting of open-vocabulary knowledge during detection finetuning. On LVIS open-vocabulary detection benchmark, CFM-ViT achieves a state-of-the-art 33.9 APr, surpassing the best approach by 7.6 points and achieves better zero-shot detection transfer. Finally, CFM-ViT acquires strong image-level representation, outperforming the state of the art on 8 out of 12 metrics on zero-shot image-text retrieval benchmarks. + + + + OccFormer: Dual-path Transformer for Vision-based 3D Semantic Occupancy Prediction + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_OccFormer_Dual-path_Transformer_for_Vision-based_3D_Semantic_Occupancy_Prediction_ICCV_2023_paper.pdf + The vision-based perception for autonomous driving has undergone a transformation from the bird-eye-view (BEV) representations to the 3D semantic occupancy. Compared with the BEV planes, the 3D semantic occupancy further provides structural information along the vertical direction. This paper presents OccFormer, a dual-path transformer network to effectively process the 3D volume for semantic occupancy prediction. OccFormer achieves a long-range, dynamic, and efficient encoding of the camera-generated 3D voxel features. It is obtained by decomposing the heavy 3D processing into the local and global transformer pathways along the horizontal plane. For the occupancy decoder, we adapt the vanilla Mask2Former for 3D semantic occupancy by proposing preserve-pooling and classguided sampling, which notably mitigate the sparsity and class imbalance. Experimental results demonstrate that OccFormer significantly outperforms existing methods for semantic scene completion on SemanticKITTI dataset and for LiDAR semantic segmentation on nuScenes dataset. Code is available at https://github.com/zhangyp15/OccFormer. + + + + Probabilistic Triangulation for Uncalibrated Multi-View 3D Human Pose Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Jiang_Probabilistic_Triangulation_for_Uncalibrated_Multi-View_3D_Human_Pose_Estimation_ICCV_2023_paper.pdf + 3D human pose estimation has been a long-standing challenge in computer vision and graphics, where multi-view methods have significantly progressed but are limited by the tedious calibration processes. Existing multi-view methods are restricted to fixed camera pose and therefore lack generalization ability. This paper presents a novel Probabilistic Triangulation module that can be embedded in a calibrated 3D human pose estimation method, generalizing it to uncalibration scenes. The key idea is to use a probability distribution to model the camera pose and iteratively update the distribution from 2D features instead of using camera pose. Specifically, We maintain a camera pose distribution and then iteratively update this distribution by computing the posterior probability of the camera pose through Monte Carlo sampling. This way, the gradients can be directly back-propagated from the 3D pose estimation to the 2D heatmap, enabling end-to-end training. Extensive experiments on Human3.6M and CMU Panoptic demonstrate that our method outperforms other uncalibration methods and achieves comparable results with state-of-the-art calibration methods. Thus, our method achieves a trade-off between estimation accuracy and generalizability. + + + + TORE: Token Reduction for Efficient Human Mesh Recovery with Transformer + http://openaccess.thecvf.com//content/ICCV2023/papers/Dou_TORE_Token_Reduction_for_Efficient_Human_Mesh_Recovery_with_Transformer_ICCV_2023_paper.pdf + In this paper, we introduce a set of simple yet effective TOken REduction (TORE) strategies for Transformer-based Human Mesh Recovery from monocular images. Current SOTA performance is achieved by Transformer-based structures. However, they suffer from high model complexity and computation cost caused by redundant tokens. We propose token reduction strategies based on two important aspects, i.e., the 3D geometry structure and 2D image feature, where we hierarchically recover the mesh geometry with priors from body structure and conduct token clustering to pass fewer but more discriminative image feature tokens to the Transformer. Our method massively reduces the number of tokens involved in high-complexity interactions in the Transformer. This leads to a significantly reduced computational cost while still achieving competitive or even higher accuracy in shape recovery. Extensive experiments across a wide range of benchmarks validate the superior effectiveness of the proposed method. We further demonstrate the generalizability of our method on hand mesh recovery. Visit our project page at https://frank-zy-dou.github.io/projects/Tore/index.html. + + + + D3G: Exploring Gaussian Prior for Temporal Sentence Grounding with Glance Annotation + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_D3G_Exploring_Gaussian_Prior_for_Temporal_Sentence_Grounding_with_Glance_ICCV_2023_paper.pdf + Temporal sentence grounding (TSG) aims to locate a specific moment from an untrimmed video with a given natural language query. Recently, weakly supervised methods still have a large performance gap compared to fully supervised ones, while the latter requires laborious timestamp annotations. In this study, we aim to reduce the annotation cost yet keep competitive performance for TSG task compared to fully supervised ones. To achieve this goal, we investigate a recently proposed glance-supervised temporal sentence grounding task, which requires only single frame annotation (referred to as glance annotation) for each query. Under this setup, we propose a Dynamic Gaussian prior based Grounding framework with Glance annotation (D3G), which consists of a Semantic Alignment Group Contrastive Learning module (SA-GCL) and a Dynamic Gaussian prior Adjustment module (DGA). Specifically, SA-GCL samples reliable positive moments from a 2D temporal map via jointly leveraging Gaussian prior and semantic consistency, which contributes to aligning the positive sentence-moment pairs in the joint embedding space. Moreover, to alleviate the annotation bias resulting from glance annotation and model complex queries consisting of multiple events, we propose the DGA module, which adjusts the distribution dynamically to approximate the ground truth of target moments. Extensive experiments on three challenging benchmarks verify the effectiveness of the proposed D3G. It outperforms the state-of-the-art weakly supervised methods by a large margin and narrows the performance gap compared to fully supervised methods. Code is available at https://github.com/solicucu/D3G. + + + + GEDepth: Ground Embedding for Monocular Depth Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_GEDepth_Ground_Embedding_for_Monocular_Depth_Estimation_ICCV_2023_paper.pdf + Monocular depth estimation is an ill-posed problem as the same 2D image can be projected from infinite 3D scenes. Although the leading algorithms in this field have reported significant improvement, they are essentially geared to the particular compound of pictorial observations and camera parameters (i.e., intrinsics and extrinsics), strongly limit- ing their generalizability in real-world scenarios. In or- der to cope with this difficulty, this paper proposes a novel ground embedding module to decouple camera parameters from pictorial cues, thus promoting the generalization ca- pability. Given camera parameters, our module generates the ground depth, which is stacked with the input image and referenced in the final depth prediction. A ground attention is designed in the module to optimally combine the ground depth with the residual depth. The proposed ground embed- ding is highly flexible and lightweight, leading to a plug-in module that is amenable to be integrated into various depth estimation networks. Experiments reveal that our approach achieves the state-of-the-art results on popular benchmarks, and more importantly, renders significant improvement on the cross-domain generalization. + + + + Animal3D: A Comprehensive Dataset of 3D Animal Pose and Shape + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_Animal3D_A_Comprehensive_Dataset_of_3D_Animal_Pose_and_Shape_ICCV_2023_paper.pdf + Accurately estimating the 3D pose and shape is an essential step towards understanding animal behavior, and can potentially benefit many downstream applications, such as wildlife conservation. However, research in this area is held back by the lack of a comprehensive and diverse dataset with high-quality 3D pose and shape annotations. In this paper, we propose Animal3D, the first comprehensive dataset for mammal animal 3D pose and shape estimation. Animal3D consists of 3379 images collected from 40 mammal species, high-quality annotations of 26 keypoints, and importantly the pose and shape parameters of the SMAL model. All annotations were labeled and checked manually in a multi-stage process to ensure highest quality results. Based on the Animal3D dataset, we benchmark representative shape and pose estimation models at: (1) supervised learning from only the Animal3D data, (2) synthetic to real transfer from synthetically generated images, and (3) fine-tuning human pose and shape estimation models. Our experimental results demonstrate that predicting the 3D shape and pose of animals across species remains a very challenging task, despite significant advances in human pose estimation and animal pose estimation for specific species. Our results further demonstrate that synthetic pre-training is a viable strategy to boost the model performance. Overall, Animal3D opens new directions for facilitating future research in animal 3D pose and shape estimation, and is publicly available. + + + + Rethinking Video Frame Interpolation from Shutter Mode Induced Degradation + http://openaccess.thecvf.com//content/ICCV2023/papers/Ji_Rethinking_Video_Frame_Interpolation_from_Shutter_Mode_Induced_Degradation_ICCV_2023_paper.pdf + Image restoration from various motion-related degradations, like blurry effects recorded by a global shutter (GS) and jello effects caused by a rolling shutter (RS), has been extensively studied. It has been recently recognized that such degradations encode temporal information, which can be exploited for video frame interpolation (VFI), a more challenging task than pure restoration. However, these VFI researches are mainly grounded on experiments with synthetic data, rather than real data. More fundamentally, under the same imaging condition, it remains unknown which degradation will be more effective toward VFI. In this paper, we present the first real-world dataset for learning and benchmark degraded video frame interpolation, named RD-VFI, and further explore the performance differences of three types of degradations, including GS blur, RS distortion, and an in-between effect caused by the rolling shutter with global reset (RSGR), thanks to our novel quad-axis imaging system. Moreover, we propose a unified Progressive Mutual Boosting Network (PMBNet) model to interpolate middle frames at arbitrary times for all shutter modes. Its disentanglement strategy and dual-stream correction enable us to adaptively deal with different degradations for VFI. Experimental results demonstrate that our PMBNet is superior to the respective state-of-the-art methods on all shutter modes. + + + + Semantic-Aware Dynamic Parameter for Video Inpainting Transformer + http://openaccess.thecvf.com//content/ICCV2023/papers/Lee_Semantic-Aware_Dynamic_Parameter_for_Video_Inpainting_Transformer_ICCV_2023_paper.pdf + Recent learning-based video inpainting approaches have achieved considerable progress. However, they still cannot fully utilize semantic information within the video frames and predict improper scene layout, failing to restore clear object boundaries for mixed scenes. To mitigate this problem, we introduce a new transformer-based video inpainting technique that can exploit semantic information within the input and considerably improve reconstruction quality. In this study, we use the mixture-of-experts scheme and train multiple experts to handle mixed scenes, including various semantics. We leverage these multiple experts and produce locally (token-wise) different network parameters to achieve semantic-aware inpainting results. Extensive experiments on YouTube-VOS and DAVIS benchmark datasets demonstrate that, compared with existing conventional video inpainting approaches, the proposed method has superior performance in synthesizing visually pleasing videos with much clearer semantic structures and textures. + + + + SKED: Sketch-guided Text-based 3D Editing + http://openaccess.thecvf.com//content/ICCV2023/papers/Mikaeili_SKED_Sketch-guided_Text-based_3D_Editing_ICCV_2023_paper.pdf + Text-to-image diffusion models are gradually introduced into computer graphics, recently enabling the development of Text-to-3D pipelines in an open domain. However, for interactive editing purposes, local manipulations of content through a simplistic textual interface can be arduous. Incorporating user guided sketches with Text-to-image pipelines offers users more intuitive control. Still, as state-of-the-art Text-to-3D pipelines rely on optimizing Neural Radiance Fields (NeRF) through gradients from arbitrary rendering views, conditioning on sketches is not straightforward. In this paper, we present SKED, a technique for editing 3D shapes represented by NeRFs. Our technique utilizes as few as two guiding sketches from different views to alter an existing neural field. The edited region respects the prompt semantics through a pre-trained diffusion model. To ensure the generated output adheres to the provided sketches, we propose novel loss functions to generate the desired edits while preserving the density and radiance of the base instance. We demonstrate the effectiveness of our proposed method through several qualitative and quantitative experiments. https://sked-paper.github.io/ + + + + MBPTrack: Improving 3D Point Cloud Tracking with Memory Networks and Box Priors + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_MBPTrack_Improving_3D_Point_Cloud_Tracking_with_Memory_Networks_and_ICCV_2023_paper.pdf + 3D single object tracking has been a crucial problem for decades with numerous applications such as autonomous driving. Despite its wide-ranging use, this task remains challenging due to the significant appearance variation caused by occlusion and size differences among tracked targets. To address these issues, we present MBPTrack, which adopts a Memory mechanism to utilize past information and formulates localization in a coarse-to-fine scheme using Box Priors given in the first frame. Specifically, past frames with targetness masks serve as an external memory, and a transformer-based module propagates tracked target cues from the memory to the current frame. To precisely localize objects of all sizes, MBPTrack first predicts the target center via Hough voting. By leveraging box priors given in the first frame, we adaptively sample reference points around the target center that roughly cover the target of different sizes. Then, we obtain dense feature maps by aggregating point features into the reference points, where localization can be performed more effectively. Extensive experiments demonstrate that MBPTrack achieves state-of-the-art performance on KITTI, nuScenes and Waymo Open Dataset, while running at 50 FPS on a single RTX3090 GPU. + + + + Novel-View Synthesis and Pose Estimation for Hand-Object Interaction from Sparse Views + http://openaccess.thecvf.com//content/ICCV2023/papers/Qu_Novel-View_Synthesis_and_Pose_Estimation_for_Hand-Object_Interaction_from_Sparse_ICCV_2023_paper.pdf + Hand-object interaction understanding and the barely addressed novel view synthesis are highly desired in the immersive communication, whereas it is challenging due to the high deformation of hand and heavy occlusions between hand and object. In this paper, we propose a neural rendering and pose estimation system for hand-object interaction from sparse views, which can also enable 3D hand-object interaction editing. We share the inspiration from recent scene understanding work that shows a scene specific model built beforehand can significantly improve and unblock vision tasks especially when inputs are sparse, and extend it to the dynamic hand-object interaction scenario and propose to solve the problem in two stages. We first learn the shape and appearance prior knowledge of hands and objects separately with the neural representation at the offline stage. During the online stage, we design a rendering-based joint model fitting framework to understand the dynamic hand-object interaction with the pre-built hand and object models as well as interaction priors, which thereby overcomes penetration and separation issues between hand and object and also enables novel view synthesis. In order to get stable contact during the hand-object interaction process in a sequence, we propose a stable contact loss to make the contact region to be consistent. Experiments demonstrate that our method outperforms the state-of-the-art methods. Code and dataset are available in project webpage https://iscas3dv.github.io/HO-NeRF. + + + + Distilling from Similar Tasks for Transfer Learning on a Budget + http://openaccess.thecvf.com//content/ICCV2023/papers/Borup_Distilling_from_Similar_Tasks_for_Transfer_Learning_on_a_Budget_ICCV_2023_paper.pdf + We address the challenge of getting efficient yet accurate recognition systems with limited labels. While recognition models improve with model size and amount of data, many specialized applications of computer vision have severe resource constraints both during training and inference. Transfer learning is an effective solution for training with few labels, however often at the expense of a computationally costly fine-tuning of large base models. We propose to mitigate this unpleasant trade-off between compute and accuracy via semi-supervised cross-domain distillation from a set of diverse source models. Initially, we show how to use task similarity metrics to select a single suitable source model to distill from, and that a good selection process is imperative for good downstream performance of a target model. We dub this approach DistillNearest. Though effective, DistillNearest assumes a single source model matches the target task, which is not always the case. To alleviate this, we propose a weighted multi-source distillation method to distill multiple source models trained on different domains weighted by their relevance for the target task into a single efficient model (named DistillWeighted). Our methods need no access to source data and merely need features and pseudo-labels of the source models. When the goal is accurate recognition under computational constraints, both DistillNearest and DistillWeighted approaches outperform both transfer learning from strong ImageNet initializations as well as state-of-the-art semi-supervised techniques such as FixMatch. Averaged over 8 diverse target tasks our multi-source method outperforms the baselines by 5.6%-points and 4.5%-points, respectively. + + + + Self-Supervised Burst Super-Resolution + http://openaccess.thecvf.com//content/ICCV2023/papers/Bhat_Self-Supervised_Burst_Super-Resolution_ICCV_2023_paper.pdf + We introduce a self-supervised training strategy for burst super-resolution that only uses noisy low-resolution bursts during training. Our approach eliminates the need to carefully tune synthetic data simulation pipelines, which often do not match real-world image statistics. Compared to weakly-paired training strategies, which require noisy smartphone burst photos of static scenes, paired with a clean reference obtained from a tripod-mounted DSLR camera, our approach is more scalable, and avoids the color mismatch between the smartphone and DSLR. To achieve this, we propose a new self-supervised objective that uses a forward imaging model to recover a high-resolution image from aliased high frequencies in the burst. Our approach does not require any manual tuning of the forward model's parameters; we learn them from data. Furthermore, we show our training strategy is robust to dynamic scene motion in the burst, which enables training burst super-resolution models using in-the-wild data. Extensive experiments on real and synthetic data show that, despite only using noisy bursts during training, models trained with our self-supervised strategy match, and sometimes surpass, the quality of fully-supervised baselines trained with synthetic data or weakly-paired ground-truth. Finally, we show our training strategy is general using four different burst super-resolution architectures. + + + + PC-Adapter: Topology-Aware Adapter for Efficient Domain Adaption on Point Clouds with Rectified Pseudo-label + http://openaccess.thecvf.com//content/ICCV2023/papers/Park_PC-Adapter_Topology-Aware_Adapter_for_Efficient_Domain_Adaption_on_Point_Clouds_ICCV_2023_paper.pdf + Understanding point clouds captured from the real-world is challenging due to shifts in data distribution caused by varying object scales, sensor angles, and self-occlusion. Prior works have addressed this issue by combining recent learning principles such as self-supervised learning, self-training and adversarial training, which leads to to significant computational overhead. Toward succinct yet powerful domain adaptation for point clouds, we revisit the unique challenges of point cloud data under domain shift scenarios and discover the importance of the global geometry of source data and trends of target pseudo-labels biased to the source label distribution. Motivated by our observations, we propose an adapter-guided domain adaptation method, PC-Adapter, that preserves the global shape information of the source domain using an attention-based adapter, while learning the local characteristics of the target domain via another adapter equipped with graph convolution. Additionally, we propose a novel pseudo-labeling strategy resilient to the classifier bias by adjusting confidence scores using their class-wise confidence distributions to consider relative confidences. Our method demonstrates superiority over baselines on various domain shift settings in benchmark datasets - PointDA, GraspNetPC, and PointSegDA. + + + + Cyclic Test-Time Adaptation on Monocular Video for 3D Human Mesh Reconstruction + http://openaccess.thecvf.com//content/ICCV2023/papers/Nam_Cyclic_Test-Time_Adaptation_on_Monocular_Video_for_3D_Human_Mesh_ICCV_2023_paper.pdf + Despite recent advances in 3D human mesh reconstruction, domain gap between training and test data is still a major challenge. Several prior works tackle the domain gap problem via test-time adaptation that fine-tunes a network relying on 2D evidence (e.g., 2D human keypoints) from test images. However, the high reliance on 2D evidence during adaptation causes two major issues. First, 2D evidence induces depth ambiguity, preventing the learning of accurate 3D human geometry. Second, 2D evidence is noisy or partially non-existent during test time, and such imperfect 2D evidence leads to erroneous adaptation. To overcome the above issues, we introduce CycleAdapt, which cyclically adapts two networks: a human mesh reconstruction network (HMRNet) and a human motion denoising network (MDNet), given a test video. In our framework, to alleviate high reliance on 2D evidence, we fully supervise HMRNet with generated 3D supervision targets by MDNet. Our cyclic adaptation scheme progressively elaborates the 3D supervision targets, which compensate for imperfect 2D evidence. As a result, our CycleAdapt achieves state-of-the-art performance compared to previous test-time adaptation methods. The codes are available in here: https://github.com/hygenie1228/CycleAdapt_RELEASE. + + + + 2D3D-MATR: 2D-3D Matching Transformer for Detection-Free Registration Between Images and Point Clouds + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_2D3D-MATR_2D-3D_Matching_Transformer_for_Detection-Free_Registration_Between_Images_and_ICCV_2023_paper.pdf + The commonly adopted detect-then-match approach to registration finds difficulties in the cross-modality cases due to the incompatible keypoint detection and inconsistent feature description. We propose, 2D3D-MATR, a detection-free method for accurate and robust registration between images and point clouds. Our method adopts a coarse-to-fine pipeline where it first computes coarse correspondences between downsampled patches of the input image and the point cloud and then extends them to form dense correspondences between pixels and points within the patch region. The coarse-level patch matching is based on transformer which jointly learns global contextual constraints with self-attention and cross-modality correlations with cross-attention. To resolve the scale ambiguity in patch matching, we construct a multi-scale pyramid for each image patch and learn to find for each point patch the best matching image patch at a proper resolution level. Extensive experiments on two public benchmarks demonstrate that 2D3D-MATR outperforms the previous state-of-the-art P2-Net by around 20 percentage points on inlier ratio and over 10 points on registration recall. Our code and models will be publicly released. + + + + Group Pose: A Simple Baseline for End-to-End Multi-Person Pose Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Group_Pose_A_Simple_Baseline_for_End-to-End_Multi-Person_Pose_Estimation_ICCV_2023_paper.pdf + In this paper, we study the problem of end-to-end multi-person pose estimation. State-of-the-art solutions adopt the DETR-like framework, and mainly develop the complex decoder, e.g., regarding pose estimation as keypoint box detection and combining with human detection in ED-Pose, hierarchically predicting with pose decoder and joint (keypoint) decoder in PETR. We present a simple yet effective transformer approach, named Group Pose. We simply regard K-keypoint pose estimation as predicting a set of NxK keypoint positions, each from a keypoint query, as well as representing each pose with an instance query for scoring N pose predictions. Motivated by the intuition that the interaction, among across-instance queries of different types, is not directly helpful, we make a simple modification to decoder self-attention. We replace single self-attention over all the Nx(K+1) queries with two subsequent group self-attentions: (i) N within-instance self-attention, with each over K keypoint queries and one instance query, and (ii) (K+1) same-type across-instance self-attention, each over N queries of the same type. The resulting decoder removes the interaction among across-instance type-different queries, easing the optimization and thus improving the performance. Experimental results on MS COCO and CrowdPose show that our approach without human box supervision is superior to previous methods with complex decoders, and even is slightly better than ED-Pose that uses human box supervision. Code is available. + + + + SkeleTR: Towards Skeleton-based Action Recognition in the Wild + http://openaccess.thecvf.com//content/ICCV2023/papers/Duan_SkeleTR_Towards_Skeleton-based_Action_Recognition_in_the_Wild_ICCV_2023_paper.pdf + We present SkeleTR, a new framework for skeleton-based action recognition. In contrast to prior work, which focuses mainly on controlled environments, we target in-the-wild scenarios that typically involve a variable number of people and various forms of interaction between people. SkeleTR works with a two-stage paradigm. It first models the intra-person skeleton dynamics for each skeleton sequence with graph convolutions, and then uses stacked Transformer encoders to capture person interactions that are important for action recognition in the wild. To mitigate the negative impact of inaccurate skeleton associations, SkeleTR takes relative short skeleton sequences as input and increases the number of sequences. As a unified solution, SkeleTR can be directly applied to multiple skeleton-based action tasks, including video-level action classification, instance-level action detection, and group-level activity recognition. It also enables transfer learning and joint training across different action tasks and datasets, which results in performance improvement. When evaluated on various skeleton-based action recognition benchmarks, SkeleTR achieves the state-of-the-art performance. + + + + Weakly-Supervised Action Localization by Hierarchically-Structured Latent Attention Modeling + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Weakly-Supervised_Action_Localization_by_Hierarchically-Structured_Latent_Attention_Modeling_ICCV_2023_paper.pdf + Weakly-supervised action localization aims to recognize and localize action instancese in untrimmed videos with only video-level labels. Most existing models rely on multiple instance learning(MIL), where the predictions of unlabeled instances are supervised by classifying labeled bags. The MIL-based methods are relatively well studied with cogent performance achieved on classification but not on localization. Generally, they locate temporal regions by the video-level classification but overlook the temporal variations of feature semantics. To address this problem, we propose a novel attention-based hierarchically-structured latent model to learn the temporal variations of feature semantics. Specifically, our model entails two components, the first is an unsupervised change-points detection module that detects change-points by learning the latent representations of video features in a temporal hierarchy based on their rates of change, and the second is an attention-based classification model that selects the change-points of the foreground as the boundaries. To evaluate the effectiveness of our model, we conduct extensive experiments on two benchmark datasets, THUMOS-14 and ActivityNet-v1.3. The experiments show that our method outperforms current state-of-the-art methods, and even achieves comparable performance with fully-supervised methods. + + + + Get3DHuman: Lifting StyleGAN-Human into a 3D Generative Model Using Pixel-Aligned Reconstruction Priors + http://openaccess.thecvf.com//content/ICCV2023/papers/Xiong_Get3DHuman_Lifting_StyleGAN-Human_into_a_3D_Generative_Model_Using_Pixel-Aligned_ICCV_2023_paper.pdf + Fast generation of high-quality 3D digital humans is important to a vast number of applications ranging from entertainment to professional concerns. Recent advances in differentiable rendering have enabled the training of 3D generative models without requiring 3D ground truths. However, the quality of the generated 3D humans still has much room to improve in terms of both fidelity and diversity. In this paper, we present Get3DHuman, a novel 3D human framework that can significantly boost the realism and diversity of the generated outcomes by only using a limited budget of 3D ground-truth data. Our key observation is that the 3D generator can profit from human-related priors learned through 2D human generators and 3D reconstructors. Specifically, we bridge the latent space of Get3DHuman with that of StyleGAN-Human via a specially-designed prior network, where the input latent code is mapped to the shape and texture feature volumes spanned by the pixel-aligned 3D reconstructor. The outcomes of the prior network are then leveraged as the supervisory signals for the main generator network. To ensure effective training, we further propose three tailored losses applied to the generated feature volumes and the intermediate feature maps. Extensive experiments demonstrate that Get3DHuman greatly outperforms the other state-of-the-art approaches and can support a wide range of applications including shape interpolation, shape re-texturing, and single-view reconstruction through latent inversion. + + + + Query6DoF: Learning Sparse Queries as Implicit Shape Prior for Category-Level 6DoF Pose Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Query6DoF_Learning_Sparse_Queries_as_Implicit_Shape_Prior_for_Category-Level_ICCV_2023_paper.pdf + Category-level 6DoF object pose estimation intends to estimate the rotation, translation, and size of unseen objects. Many previous works use point clouds as a pre-learned shape prior to overcome intra-category variability. The shape prior is deformed to reconstruct instances' point clouds in canonical space and to build dense 3D-3D correspondences between the observed and reconstructed point clouds. However, the pre-learned shape prior is not jointly optimized with estimation networks, and they are trained with a surrogate objective. We propose a novel 6D pose estimation network, named Query6DoF, based on a series of category-specific sparse queries that represent the prior shape. Each query represents a shape component, and these queries are learnable embeddings that can be optimized together with the estimation network according to the point cloud reconstruction loss, the normalized object coordinate loss, and the 6d pose estimation loss. Query6DoF adopts a deformation-and-matching paradigm with attention, where the queries dynamically extract features from regions of interest using the attention mechanism and then directly regress results. Furthermore, Query6DoF reduces computation overhead through the sparseness of the queries and the incorporation of a lightweight global information injection block. With the aforementioned design, Query6DoF achieves state-of-the-art (SOTA) pose estimation performance on the NOCS datasets. The source code and models are available at https://github.com/hustvl/Query6DoF. + + + + Towards High-Quality Specular Highlight Removal by Leveraging Large-Scale Synthetic Data + http://openaccess.thecvf.com//content/ICCV2023/papers/Fu_Towards_High-Quality_Specular_Highlight_Removal_by_Leveraging_Large-Scale_Synthetic_Data_ICCV_2023_paper.pdf + This paper aims to remove specular highlights from a single object-level image. Although previous methods have made some progresses, their performance remains somewhat limited, particularly for real images with complex specular highlights. To this end, we propose a three-stage network to address them. Specifically, given an input image, we first decompose it into the albedo, shading, and specular residue components to estimate a coarse specular-free image. Then, we further refine the coarse result to alleviate its visual artifacts such as color distortion. Finally, we adjust the tone of the refined result to match the tone of the input as closely as possible. In addition, to facilitate network training and quantitative evaluation, we present a large-scale synthetic dataset of object-level images, covering diverse objects and illumination conditions. Extensive experiments illustrate that our network is able to generalize well to unseen real object-level images, and even produce good results for scene-level images with multiple background objects and complex lighting. + + + + Unsupervised 3D Perception with 2D Vision-Language Distillation for Autonomous Driving + http://openaccess.thecvf.com//content/ICCV2023/papers/Najibi_Unsupervised_3D_Perception_with_2D_Vision-Language_Distillation_for_Autonomous_Driving_ICCV_2023_paper.pdf + Closed-set 3D perception models trained on only a pre-defined set of object categories can be inadequate for safety critical applications such as autonomous driving where new object types can be encountered after deployment. In this paper, we present a multi-modal auto labeling pipeline capable of generating amodal 3D bounding boxes and tracklets for training models on open-set categories without 3D human labels. Our pipeline exploits motion cues inherent in point cloud sequences in combination with the freely available 2D image-text pairs to identify and track all traffic participants. Compared to the recent studies in this domain, which can only provide class-agnostic auto labels limited to moving objects, our method can handle both static and moving objects in the unsupervised manner and is able to output open-vocabulary semantic labels thanks to the proposed vision-language knowledge distillation. Experiments on the Waymo Open Dataset show that our approach outperforms the prior work by significant margins on various unsupervised 3D perception tasks. + + + + Towards Grand Unified Representation Learning for Unsupervised Visible-Infrared Person Re-Identification + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Towards_Grand_Unified_Representation_Learning_for_Unsupervised_Visible-Infrared_Person_Re-Identification_ICCV_2023_paper.pdf + Unsupervised learning visible-infrared person re-identification (USL-VI-ReID) is an extremely important and challenging task, which can alleviate the issue of expensive cross-modality annotations. Existing works focus on handling the cross-modality discrepancy under unsupervised conditions. However, they ignore the fact that USL-VI-ReID is a cross-modality retrieval task with the hierarchical discrepancy, i.e., camera variation and modality discrepancy, resulting in clustering inconsistencies and ambiguous cross-modality label association. To address these issues, we propose a hierarchical framework to learn grand unified representation (GUR) for USL-VI-ReID. The grand unified representation lies in two aspects: 1) GUR adopts a bottom-up domain learning strategy with a cross-memory association embedding module to explore the information of hierarchical domains, i.e., intra-camera, inter-camera, and inter-modality domains, learning a unified and robust representation against hierarchical discrepancy. 2) To unify the identities of the two modalities, we develop a cross-modality label unification module that constructs a cross-modality affinity matrix as a bridge for propagating labels between two modalities. Then, we utilize the homogeneous structure matrix to smooth the propagated labels, ensuring that the label structure within one modality remains unchanged. Extensive experiments demonstrate that our GUR framework significantly outperforms existing USL-VI-ReID methods, and even surpasses some supervised counterparts. + + + + ReFit: Recurrent Fitting Network for 3D Human Recovery + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_ReFit_Recurrent_Fitting_Network_for_3D_Human_Recovery_ICCV_2023_paper.pdf + We present Recurrent Fitting (ReFit), a neural network architecture for single-image, parametric 3D human reconstruction. ReFit learns a feedback-update loop that mirrors the strategy of solving an inverse problem through optimization. At each iterative step, it reprojects keypoints from the human model to feature maps to query feedback, and uses a recurrent-based updater to adjust the model to fit the image better. Because ReFit encodes strong knowledge of the inverse problem, it is faster to train than previous regression models. At the same time, ReFit improves state-of-the-art performance on standard benchmarks. Moreover, ReFit applies to other optimization settings, such as multi-view fitting and single-view shape fitting. Project website: https://yufu-wang.github.io/refit_humans/ + + + + Verbs in Action: Improving Verb Understanding in Video-Language Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Momeni_Verbs_in_Action_Improving_Verb_Understanding_in_Video-Language_Models_ICCV_2023_paper.pdf + Understanding verbs is crucial to modelling how people and objects interact with each other and the environment through space and time. Recently, state-of-the-art video-language models based on CLIP have been shown to have limited verb understanding and to rely extensively on nouns, restricting their performance in real-world video applications that require action and temporal understanding. In this work, we improve verb understanding for CLIP-based video-language models by proposing a new Verb-Focused Contrastive (VFC) framework. This consists of two main components: (1) leveraging pretrained large language models (LLMs) to create hard negatives for cross-modal contrastive learning, together with a calibration strategy to balance the occurrence of concepts in positive and negative pairs; and (2) enforcing a fine-grained, verb phrase alignment loss. Our method achieves state-of-the-art results for zero-shot performance on three downstream tasks that focus on verb understanding, including video-text matching, video question-answering and video classification; while maintaining performance on noun-focused settings. To the best of our knowledge, this is the first work which proposes a method to alleviate the verb understanding problem, and does not simply highlight it. Our code is publicly available. + + + + Zero-Shot Point Cloud Segmentation by Semantic-Visual Aware Synthesis + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Zero-Shot_Point_Cloud_Segmentation_by_Semantic-Visual_Aware_Synthesis_ICCV_2023_paper.pdf + This paper proposes a feature synthesis approach for zero-shot semantic segmentation of 3D point clouds, enabling generalization to previously unseen categories. Given only the class-level semantic information for unseen objects, we strive to enhance the correspondence, alignment and consistency between the visual and semantic spaces, to synthesise diverse, generic and transferable visual features. We develop a masked learning strategy to promote diversity within the same class visual features and enhance the separation between different classes. We further cast the visual features into a prototypical space to model their distribution for alignment with the corresponding semantic space. Finally, we develop a consistency regularizer to preserve the semantic-visual relationships between the real-seen features and synthetic-unseen features. Our approach shows considerable semantic segmentation gains on ScanNet, S3DIS and SemanticKITTI benchmarks. Our code is available at: https://github.com/leolyj/3DPC-GZSL + + + + Exploring Predicate Visual Context in Detecting of Human-Object Interactions + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Exploring_Predicate_Visual_Context_in_Detecting_of_Human-Object_Interactions_ICCV_2023_paper.pdf + Recently, the DETR framework has emerged as the dominant approach for human--object interaction (HOI) research. In particular, two-stage transformer-based HOI detectors are amongst the most performant and training-efficient approaches. However, these often condition HOI classification on object features that lack fine-grained contextual information, eschewing pose and orientation information in favour of visual cues about object identity and box extremities. This naturally hinders the recognition of complex or ambiguous interactions. In this work, we study these issues through visualisations and carefully designed experiments. Accordingly, we investigate how best to re-introduce image features via cross-attention. With an improved query design, extensive exploration of keys and values, and box pair positional embeddings as spatial guidance, our model with enhanced predicate visual context (PViC) outperforms state-of-the-art methods on the HICO-DET and V-COCO benchmarks, while maintaining low training cost. + + + + Towards Saner Deep Image Registration + http://openaccess.thecvf.com//content/ICCV2023/papers/Duan_Towards_Saner_Deep_Image_Registration_ICCV_2023_paper.pdf + With recent advances in computing hardware and surges of deep-learning architectures, learning-based deep image registration methods have surpassed their traditional counterparts, in terms of metric performance and inference time. However, these methods focus on improving performance measurements such as Dice, resulting in less attention given to model behaviors that are equally desirable for registrations, especially for medical imaging. This paper investigates these behaviors for popular learning-based deep registrations under a sanity-checking microscope. We find that most existing registrations suffer from low inverse consistency and nondiscrimination of identical pairs due to overly optimized image similarities. To rectify these behaviors, we propose a novel regularization-based sanity-enforcer method that imposes two sanity checks on the deep model to reduce its inverse consistency errors and increase its discriminative power simultaneously. Moreover, we derive a set of theoretical guarantees for our sanity-checked image registration method, with experimental results supporting our theoretical findings and their effectiveness in increasing the sanity of models without sacrificing any performance. + + + + Interaction-aware Joint Attention Estimation Using People Attributes + http://openaccess.thecvf.com//content/ICCV2023/papers/Nakatani_Interaction-aware_Joint_Attention_Estimation_Using_People_Attributes_ICCV_2023_paper.pdf + This paper proposes joint attention estimation in a single image. Different from related work in which only the gaze-related attributes of people are independently employed, (I) their locations and actions are also employed as contextual cues for weighting their attributes, and (ii) interactions among all of these attributes are explicitly modeled in our method. For the interaction modeling, we propose a novel Transformer-based attention network to encode joint attention as low-dimensional features. We introduce a specialized MLP head with positional embedding to the Transformer so that it predicts pixelwise confidence of joint attention for generating the confidence heatmap. This pixelwise prediction improves the heatmap accuracy by avoiding the ill-posed problem in which the high-dimensional heatmap is predicted from the low-dimensional features. The estimated joint attention is further improved by being integrated with general image-based attention estimation. Our method outperforms SOTA methods quantitatively in comparative experiments. Code: https://anonymous.4open.science/r/anonymized_codes-ECA4. + + + + Non-Coaxial Event-Guided Motion Deblurring with Spatial Alignment + http://openaccess.thecvf.com//content/ICCV2023/papers/Cho_Non-Coaxial_Event-Guided_Motion_Deblurring_with_Spatial_Alignment_ICCV_2023_paper.pdf + Motion deblurring from a blurred image is a challenging computer vision problem because frame-based cameras lose information during the blurring process. Several attempts have compensated for the loss of motion information by using event cameras, which are bio-inspired sensors with a high temporal resolution. Even though most studies have assumed that image and event data are pixel-wise aligned, this is only possible with low-quality active-pixel sensor (APS) images and synthetic datasets. In real scenarios, obtaining per-pixel aligned event-RGB data is technically challenging since event and frame cameras have different optical axes. For the application of the event camera, we propose the first Non-coaxial Event-guided Image Deblurring (NEID) approach that utilizes the camera setup composed of a standard frame-based camera with a non-coaxial single event camera. To consider the per-pixel alignment between the image and event without additional devices, we propose the first NEID network that spatially aligns events to images while refining the image features from temporally dense event features. For training and evaluation of our network, we also present the first large-scale dataset, consisting of RGB frames with non-aligned events aimed at a breakthrough in motion deblurring with an event camera. Extensive experiments on various datasets demonstrate that the proposed method achieves significantly better results than the prior works in terms of performance and speed, and it can be applied for practical uses of event cameras. + + + + Fingerprinting Deep Image Restoration Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Quan_Fingerprinting_Deep_Image_Restoration_Models_ICCV_2023_paper.pdf + Fingerprinting is a promising non-invasive method for protecting the intellectual property rights (IPR) of deep neural network (DNN) models. It extracts a feature called a fingerprint from a DNN model to identify its ownership. Existing fingerprinting methods focus only on classification-related models that map images to labels, while inapplicable to models for image restoration that map images to images. This paper proposes a fingerprinting framework for DNN models of image restoration. The proposed framework defines the fingerprint using a critical image, which exhibits strongly discriminative patterns and is robust to modest model modifications. Model ownership is then verified by comparing the distance of color histograms and local gradient pattern histograms of critical images between the suspect and source models. We apply the proposed framework to two representative tasks, denoising and super-resolution. It outperforms the baselines of fingerprinting and competes against existing invasive model watermarking methods. + + + + SportsMOT: A Large Multi-Object Tracking Dataset in Multiple Sports Scenes + http://openaccess.thecvf.com//content/ICCV2023/papers/Cui_SportsMOT_A_Large_Multi-Object_Tracking_Dataset_in_Multiple_Sports_Scenes_ICCV_2023_paper.pdf + Multi-object tracking (MOT) in sports scenes plays a critical role in gathering players statistics, supporting further applications, such as automatic tactical analysis. Yet existing MOT benchmarks cast little attention on this domain. In this work, we present a new large-scale multi-object tracking dataset in multiple sports scenes, coined as SportsMOT, where all players on the court are supposed to be tracked. It consists of 240 video sequences, over 150K frames (almost 15x MOT17) and over 1.6M bounding boxes (3x MOT17) collected from 3 sports categories, including basketball, volleyball and football. Our dataset is characterized with two key properties: 1) fast and variable-speed motion and 2) similar yet distinguishable appearance. We expect SportsMOT to encourage the MOT trackers to promote in both motion-based association and appearance-based association. We benchmark several state-of-the-art trackers and reveal the key challenge of SportsMOT lies in object association. To alleviate the issue, we further propose a new multi-object tracking framework, termed as MixSort, introducing a MixFormer-like structure as an auxiliary association model to prevailing tracking-by-detection trackers. By integrating the customized appearance-based association with the original motion-based association, MixSort achieves state-of-the-art performance on SportsMOT and MOT17. Based on MixSort, we give an in-depth analysis and provide some profound insights into SportsMOT. + + + + Localizing Moments in Long Video Via Multimodal Guidance + http://openaccess.thecvf.com//content/ICCV2023/papers/Barrios_Localizing_Moments_in_Long_Video_Via_Multimodal_Guidance_ICCV_2023_paper.pdf + The recent introduction of the large-scale, long-form MAD and Ego4D datasets has enabled researchers to investigate the performance of current state-of-the-art methods for video grounding in the long-form setup, with interesting findings: current grounding methods alone fail at tackling this challenging task and setup due to their inability to process long video sequences. In this paper, we propose a method for improving the performance of natural language grounding in long videos by identifying and pruning out non-describable windows. We design a guided grounding framework consisting of a Guidance Model and a base grounding model. The Guidance Model emphasizes describable windows, while the base grounding model analyzes short temporal windows to determine which segments accurately match a given language query. We offer two designs for the Guidance Model: Query-Agnostic and Query-Dependent, which balance efficiency and accuracy. Experiments demonstrate that our proposed method outperforms state-of-the-art models by 4.1% in MAD and 4.52% in Ego4D (NLQ), respectively. Code, data and MAD's audio features necessary to reproduce our experiments are available at: https://github.com/waybarrios/guidance-based-video-grounding. + + + + Towards Universal Image Embeddings: A Large-Scale Dataset and Challenge for Generic Image Representations + http://openaccess.thecvf.com//content/ICCV2023/papers/Ypsilantis_Towards_Universal_Image_Embeddings_A_Large-Scale_Dataset_and_Challenge_for_ICCV_2023_paper.pdf + Fine-grained and instance-level recognition methods are commonly trained and evaluated on specific domains, in a model per domain scenario. Such an approach, however, is impractical in real large-scale applications. In this work, we address the problem of universal image embedding, where a single universal model is trained and used in multiple domains. First, we leverage existing domain-specific datasets to carefully construct a new large-scale public benchmark for the evaluation of universal image embeddings, with 241k query images, 1.4M index images and 2.8M training images across 8 different domains and 349k classes. We define suitable metrics, training and evaluation protocols to foster future research in this area. Second, we provide a comprehensive experimental evaluation on the new dataset, demonstrating that existing approaches and simplistic extensions lead to worse performance than an assembly of models trained for each domain separately. Finally, we conducted a public research competition on this topic, leveraging industrial datasets, which attracted the participation of more than 1k teams worldwide. This exercise generated many interesting research ideas and findings which we present in detail. Project webpage: https://cmp.felk.cvut.cz/univ_emb/ + + + + SemARFlow: Injecting Semantics into Unsupervised Optical Flow Estimation for Autonomous Driving + http://openaccess.thecvf.com//content/ICCV2023/papers/Yuan_SemARFlow_Injecting_Semantics_into_Unsupervised_Optical_Flow_Estimation_for_Autonomous_ICCV_2023_paper.pdf + Unsupervised optical flow estimation is especially hard near occlusions and motion boundaries and in low-texture regions. We show that additional information such as semantics and domain knowledge can help better constrain this problem. We introduce SemARFlow, an unsupervised optical flow network designed for autonomous driving data that takes estimated semantic segmentation masks as additional inputs. This additional information is injected into the encoder and into a learned upsampler that refines the flow output. In addition, a simple yet effective semantic augmentation module provides self-supervision when learning flow and its boundaries for vehicles, poles, and sky. Together, these injections of semantic information improve the KITTI-2015 optical flow test error rate from 11.80% to 8.38%. We also show visible improvements around object boundaries as well as a greater ability to generalize across datasets. Code is available at https://github.com/duke-vision/semantic-unsup-flow-release. + + + + Uncertainty-aware Unsupervised Multi-Object Tracking + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Uncertainty-aware_Unsupervised_Multi-Object_Tracking_ICCV_2023_paper.pdf + Without manually annotated identities, unsupervised multi-object trackers are inferior to learning reliable feature embeddings. It causes the similarity-based inter-frame association stage also be error-prone, where an uncertainty problem arises. The frame-by-frame accumulated uncertainty prevents trackers from learning the consistent feature embedding against time variation. To avoid this uncertainty problem, recent self-supervised techniques are adopted, whereas they failed to capture temporal relations. The inter-frame uncertainty still exists. In fact, this paper argues that though the uncertainty problem is inevitable, it is possible to leverage the uncertainty itself to improve the learned consistency in turn. Specifically, an uncertainty-based metric is developed to verify and rectify the risky associations. The resulting accurate pseudo-tracklets boost learning the feature consistency. And accurate tracklets can incorporate temporal information into spatial transformation. This paper proposes a tracklet-guided augmentation strategy to simulate the tracklet's motion, which adopts a hierarchical uncertainty-based sampling mechanism for hard sample mining. The ultimate unsupervised MOT framework, namely U2MOT, is proven effective on MOT-Challenges and VisDrone-MOT benchmark. U2MOT achieves a SOTA performance among the published supervised and unsupervised trackers. + + + + Designing Phase Masks for Under-Display Cameras + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Designing_Phase_Masks_for_Under-Display_Cameras_ICCV_2023_paper.pdf + Diffractive blur and low light levels are two fundamental challenges in producing high-quality photographs in under-display cameras (UDCs). In this paper, we incorporate phase masks on display panels to tackle both challenges. Our design inserts two phase masks, specifically two microlens arrays, in front of and behind a display panel. The first phase mask concentrates light on the locations where the display is transparent so that more light passes through the display, and the second phase mask reverts the effect of the first phase mask. We further optimize the folding height of each microlens to improve the quality of PSFs and suppress chromatic aberration. We evaluate our design using a physically-accurate simulator based on Fourier optics. The proposed design is able to double the light throughput while improving the invertibility of the PSFs. Lastly, we discuss the effect of our design on the display quality and show that implementation with polarization-dependent phase masks can leave the display quality uncompromised. + + + + Can Language Models Learn to Listen? + http://openaccess.thecvf.com//content/ICCV2023/papers/Ng_Can_Language_Models_Learn_to_Listen_ICCV_2023_paper.pdf + We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words. Given an input transcription of the speaker's words with their timestamps, our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE. Since gesture is a language component, we propose treating the quantized atomic motion elements as additional language token inputs to a transformer-based large language model. Initializing our transformer with the weights of a language model pre-trained only on text results in significantly higher quality listener responses than training a transformer from scratch. We show that our generated listener motion is fluent and reflective of language semantics through quantitative metrics and a qualitative user study. In our evaluation, we analyze the model's ability to utilize temporal and semantic aspects of spoken text. + + + + SurfsUP: Learning Fluid Simulation for Novel Surfaces + http://openaccess.thecvf.com//content/ICCV2023/papers/Mani_SurfsUP_Learning_Fluid_Simulation_for_Novel_Surfaces_ICCV_2023_paper.pdf + Modeling the mechanics of fluid in complex scenes is vital to applications in design, graphics, and robotics. Learning-based methods provide fast and differentiable fluid simulators, however most prior work is unable to accurately model how fluids interact with genuinely novel surfaces not seen during training. We introduce SurfsUP, a framework that represents objects implicitly using signed distance functions (SDFs), rather than an explicit representation of meshes or particles. This continuous representation of geometry enables more accurate simulation of fluid-object interactions over long time periods while simultaneously making computation more efficient. Moreover, SurfsUP trained on simple shape primitives generalizes considerably out-of-distribution, even to complex real-world scenes and objects. Finally, we show we can invert our model to design simple objects to manipulate fluid flow. + + + + Regularized Mask Tuning: Uncovering Hidden Knowledge in Pre-Trained Vision-Language Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Zheng_Regularized_Mask_Tuning_Uncovering_Hidden_Knowledge_in_Pre-Trained_Vision-Language_Models_ICCV_2023_paper.pdf + Prompt tuning and adapter tuning have shown great potential in transferring pre-trained vision-language models (VLMs) to various downstream tasks. In this work, we design a new type of tuning method, termed as regularized mask tuning, which masks the network parameters through a learnable selection. Inspired by neural pathways, we argue that the knowledge required by a downstream task already exists in the pre-trained weights but just gets concealed in the upstream pre-training stage. To bring the useful knowledge back into light, we first identify a set of parameters that are important to a given downstream task, then attach a binary mask to each parameter, and finally optimize these masks on the downstream data with the parameters frozen. When updating the mask, we introduce a novel gradient dropout strategy to regularize the parameter selection, in order to prevent the model from forgetting old knowledge and overfitting the downstream data. Experimental results on 11 datasets demonstrate the consistent superiority of our method over previous alternatives. It is noteworthy that we manage to deliver 18.73% performance improvement compared to the zero-shot CLIP via masking an average of only 2.56% parameters. Furthermore, our method is synergistic with most existing parameter-efficient tuning methods and can boost the performance on top of them. Code will be made publicly available. + + + + + + Adaptive and Background-Aware Vision Transformer for Real-Time UAV Tracking + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Adaptive_and_Background-Aware_Vision_Transformer_for_Real-Time_UAV_Tracking_ICCV_2023_paper.pdf + While discriminative correlation filters (DCF)-based trackers prevail in UAV tracking for their favorable efficiency, lightweight convolutional neural network (CNN)-based trackers using filter pruning have also demonstrated remarkable efficiency and precision. However, the use of pure vision transformer models (ViTs) for UAV tracking remains unexplored, which is a surprising finding given that ViTs have been shown to produce better performance and greater efficiency than CNNs in image classification. In this paper, we propose an efficient ViT-based tracking framework, Aba-ViTrack, for UAV tracking. In our framework, feature learning and template-search coupling are integrated into an efficient one-stream ViT to avoid an extra heavy relation modeling module. The proposed Aba-ViT exploits an adaptive and background-aware token computation method to reduce inference time. This approach adaptively discards tokens based on learned halting probabilities, which a priori are higher for background tokens than target ones. Extensive experiments on six UAV tracking benchmarks demonstrate that the proposed Aba-ViTrack achieves state-of-the-art performance in UAV tracking. Code is available at https://github.com/xyyang317/Aba-ViTrack. + + + + Persistent-Transient Duality: A Multi-Mechanism Approach for Modeling Human-Object Interaction + http://openaccess.thecvf.com//content/ICCV2023/papers/Tran_Persistent-Transient_Duality_A_Multi-Mechanism_Approach_for_Modeling_Human-Object_Interaction_ICCV_2023_paper.pdf + Humans are highly adaptable, swiftly switching between different modes to progressively handle different tasks, situations and contexts. In Human-object interaction (HOI) activities, these modes can be attributed to two mechanisms: (1) the large-scale consistent plan for the whole activity and (2) the small-scale children interactive actions that start and end along the timeline. While neuroscience and cognitive science have confirmed this multi-mechanism nature of human behavior, machine modeling approaches for human motion are trailing behind. While attempted to use gradually morphing structures (e.g., graph attention networks) to model the dynamic HOI patterns, they miss the expeditious and discrete mode-switching nature of the human motion. To bridge that gap, this work proposes to model two concurrent mechanisms that jointly control human motion: the Persistent process that runs continually on the global scale, and the Transient sub-processes that operate intermittently on the local context of the human while interacting with objects. These two mechanisms form an interactive Persistent-Transient Duality that synergistically governs the activity sequences. We model this conceptual duality by a parent-child neural network of Persistent and Transient channels with a dedicated neural module for dynamic mechanism switching. The framework is trialed on HOI motion forecasting. On two rich datasets and a wide variety of settings, the model consistently delivers superior performances, proving its suitability for the challenge. + + + + DDS2M: Self-Supervised Denoising Diffusion Spatio-Spectral Model for Hyperspectral Image Restoration + http://openaccess.thecvf.com//content/ICCV2023/papers/Miao_DDS2M_Self-Supervised_Denoising_Diffusion_Spatio-Spectral_Model_for_Hyperspectral_Image_Restoration_ICCV_2023_paper.pdf + Diffusion models have recently received a surge of interest due to their impressive performance for image restoration, especially in terms of noise robustness. However, existing diffusion-based methods are trained on a large amount of training data and perform very well in-distribution, but can be quite susceptible to distribution shift. This is especially inappropriate for data-starved hyperspectral image (HSI) restoration. To tackle this problem, this work puts forth a self-supervised diffusion model for HSI restoration, namely Denoising Diffusion Spatio-Spectral Model (DDS2M), which works by inferring the parameters of the proposed Variational Spatio-Spectral Module (VS2M) during the reverse diffusion process, solely using the degraded HSI without any extra training data. In VS2M, a variational inference-based loss function is customized to enable the untrained spatial and spectral networks to learn the posterior distribution, which serves as the transitions of the sampling chain to help reverse the diffusion process. Benefiting from its self-supervised nature and the diffusion process, DDS2M enjoys stronger generalization ability to various HSIs compared to existing diffusion-based methods and superior robustness to noise compared to existing HSI restoration methods. Extensive experiments on HSI denoising, noisy HSI completion and super-resolution on a variety of HSIs demonstrate DDS2M's superiority over the existing task-specific state-of-the-arts. Code is available at: https://github.com/miaoyuchun/DDS2M. + + + + MotionLM: Multi-Agent Motion Forecasting as Language Modeling + http://openaccess.thecvf.com//content/ICCV2023/papers/Seff_MotionLM_Multi-Agent_Motion_Forecasting_as_Language_Modeling_ICCV_2023_paper.pdf + Reliable forecasting of the future behavior of road agents is a critical component to safe planning in autonomous vehicles. Here, we represent continuous trajectories as sequences of discrete motion tokens and cast multi-agent motion prediction as a language modeling task over this domain. Our model, MotionLM, provides several advantages: First, it does not require anchors or explicit latent variable optimization to learn multimodal distributions. Instead, we leverage a single standard language modeling objective, maximizing the average log probability over sequence tokens. Second, our approach bypasses post-hoc interaction heuristics where individual agent trajectory generation is conducted prior to interactive scoring. Instead, MotionLM produces joint distributions over interactive agent futures in a single autoregressive decoding process. In addition, the model's sequential factorization enables temporally causal conditional rollouts. The proposed approach establishes new state-of-the-art performance for multi-agent motion prediction on the Waymo Open Motion Dataset, ranking 1st on the interactive challenge leaderboard. + + + + Black Box Few-Shot Adaptation for Vision-Language Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Ouali_Black_Box_Few-Shot_Adaptation_for_Vision-Language_Models_ICCV_2023_paper.pdf + Vision-Language (V-L) models trained with contrastive learning to align the visual and language modalities have been shown to be strong few-shot learners. Soft prompt learning is the method of choice for few-shot downstream adaption aiming to bridge the modality gap caused by the distribution shift induced by the new domain. While parameter-efficient, prompt learning still requires access to the model weights and can be computationally infeasible for large models with billions of parameters. To address these shortcomings, in this work, we describe a black-box method for V-L few-shot adaptation that (a) operates on pre-computed image and text features and hence works without access to the model's weights, (b) it is orders of magnitude faster at training time, (c) it is amenable to both supervised and unsupervised training, and (d) it can be even used to align image and text features computed from uni-modal models. To achieve this, we propose Linear Feature Alignment (LFA), a simple linear approach for V-L re-alignment in the target domain. LFA is initialized from a closed-form solution to a least-squares problem and then it is iteratively updated by minimizing a re-ranking loss. Despite its simplicity, our approach can even surpass soft-prompt learning methods as shown by extensive experiments on 11 image and 2 video datasets. + + + + Zero-1-to-3: Zero-shot One Image to 3D Object + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Zero-1-to-3_Zero-shot_One_Image_to_3D_Object_ICCV_2023_paper.pdf + We introduce Zero-1-to-3, a framework for changing the camera viewpoint of an object given just a single RGB image. To perform novel view synthesis in this underconstrained setting, we capitalize on the geometric priors that large-scale diffusion models learn about natural images. Our conditional diffusion model uses a synthetic dataset to learn controls of the relative camera viewpoint, which allow new images to be generated of the same object under a specified camera transformation. Even though it is trained on a synthetic dataset, our model retains a strong zero-shot generalization ability to out-of-distribution datasets as well as in-the-wild images, including impressionist paintings. Our viewpoint-conditioned diffusion approach can further be used for the task of 3D reconstruction from a single image. Qualitative and quantitative experiments show that our method significantly outperforms stateof- the-art single-view 3D reconstruction and novel view synthesis models by leveraging Internet-scale pre-training. + + + + 3D Distillation: Improving Self-Supervised Monocular Depth Estimation on Reflective Surfaces + http://openaccess.thecvf.com//content/ICCV2023/papers/Shi_3D_Distillation_Improving_Self-Supervised_Monocular_Depth_Estimation_on_Reflective_Surfaces_ICCV_2023_paper.pdf + Self-supervised monocular depth estimation (SSMDE) aims at predicting the dense depth maps of monocular images, by learning to minimize a photometric loss using spatially neighboring image pairs during training. While SSMDE offers a significant scalability advantage over supervised approaches, it performs poorly on reflective surfaces as the photometric constancy assumption of the photometric loss is violated. We note that the appearance of reflective surfaces is view-dependent and often there are views of such surfaces in the training data that are not contaminated by strong specular reflections. Thus, reflective surfaces can be accurately reconstructed by aggregating the predicted depth of these views. Motivated by this observation, we propose 3D distillation: a novel training framework that utilizes the projected depth of reconstructed reflective surfaces to generate reasonably accurate depth pseudo-labels. To identify those surfaces automatically, we employ an uncertainty-guided depth fusion method, combining the smoother and more accurate projected depth on reflective surfaces and the detailed predicted depth elsewhere. In our experiments using the ScanNet and 7-Scenes datasets, we show that 3D distillation not only significantly improves the prediction accuracy, especially on the problematic surfaces, but also that it generalizes well over various underlying network architectures and to new datasets. + + + + Low-Light Image Enhancement with Multi-Stage Residue Quantization and Brightness-Aware Attention + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Low-Light_Image_Enhancement_with_Multi-Stage_Residue_Quantization_and_Brightness-Aware_Attention_ICCV_2023_paper.pdf + Low-light image enhancement (LLIE) aims to recover illumination and improve the visibility of low-light images. Conventional LLIE methods often produce poor results because they neglect the effect of noise interference. Deep learning-based LLIE methods focus on learning a mapping function between low-light images and normal-light images that outperforms conventional LLIE methods. However, most deep learning-based LLIE methods cannot yet fully exploit the guidance of auxiliary priors provided by normal-light images in the training dataset. In this paper, we propose a brightness-aware network with normal-light priors based on brightness-aware attention and residual quantized codebook. To achieve a more natural and realistic enhancement, we design a query module to obtain more reliable normal-light features and fuse them with lowlight features by a fusion branch. In addition, we propose a brightness-aware attention module to further retain the color consistency between the enhanced results and the normal-light images. Extensive experimental results on both real-captured and synthetic data show that our method outperforms existing state-of-the-art methods. + + + + Hierarchically Decomposed Graph Convolutional Networks for Skeleton-Based Action Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Lee_Hierarchically_Decomposed_Graph_Convolutional_Networks_for_Skeleton-Based_Action_Recognition_ICCV_2023_paper.pdf + Graph convolutional networks (GCNs) are the most commonly used methods for skeleton-based action recognition and have achieved remarkable performance. Generating adjacency matrices with semantically meaningful edges is particularly important for this task, but extracting such edges is challenging problem. To solve this, we propose a hierarchically decomposed graph convolutional network (HD-GCN) architecture with a novel hierarchically decomposed graph (HD-Graph). The proposed HD-GCN effectively decomposes every joint node into several sets to extract major structurally adjacent and distant edges, and uses them to construct an HD-Graph containing those edges in the same semantic spaces of a human skeleton. In addition, we introduce an attention-guided hierarchy aggregation (A-HA) module to highlight the dominant hierarchical edge sets of the HD-Graph. Furthermore, we apply a new six-way ensemble method, which uses only joint and bone stream without any motion stream. The proposed model is evaluated and achieves state-of-the-art performance on four large, popular datasets. Finally, we demonstrate the effectiveness of our model with various comparative experiments. + + + + LIST: Learning Implicitly from Spatial Transformers for Single-View 3D Reconstruction + http://openaccess.thecvf.com//content/ICCV2023/papers/Arshad_LIST_Learning_Implicitly_from_Spatial_Transformers_for_Single-View_3D_Reconstruction_ICCV_2023_paper.pdf + Accurate reconstruction of both the geometric and topological details of a 3D object from a single 2D image embodies a fundamental challenge in computer vision. Existing explicit/implicit solutions to this problem struggle to recover self-occluded geometry and/or faithfully reconstruct topological shape structures. To resolve this dilemma, we introduce LIST, a novel neural architecture that leverages local and global image features to accurately reconstruct the geometric and topological structure of a 3D object from a single image. We utilize global 2D features to predict a coarse shape of the target object and then use it as a base for higher-resolution reconstruction. By leveraging both local 2D features from the image and 3D features from the coarse prediction, we can predict the signed distance between an arbitrary point and the target surface via an implicit predictor with great accuracy. Furthermore, our model does not require camera estimation or pixel alignment. It provides an uninfluenced reconstruction from the input-view direction. Through qualitative and quantitative analysis, we show the superiority of our model in reconstructing 3D objects from both synthetic and real-world images against the state of the art. + + + + LRRU: Long-short Range Recurrent Updating Networks for Depth Completion + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_LRRU_Long-short_Range_Recurrent_Updating_Networks_for_Depth_Completion_ICCV_2023_paper.pdf + Existing deep learning-based depth completion methods generally employ massive stacked layers to predict the dense depth map from sparse input data. Although such approaches greatly advance this task, their accompanied huge computational complexity hinders their practical applications. To accomplish depth completion more efficiently, we propose a novel lightweight deep network framework, the Long-short Range Recurrent Updating (LRRU) network. Without learning complex feature representations, LRRU first roughly fills the sparse input to obtain an initial dense depth map, and then iteratively updates it through learned spatially-variant kernels. Our iterative update process is content-adaptive and highly flexible, where the kernel weights are learned by jointly considering the guidance RGB images and the depth map to be updated, and large-to-small kernel scopes are dynamically adjusted to capture long-to-short range dependencies. Our initial depth map has coarse but complete scene depth information, which helps relieve the burden of directly regressing the dense depth from sparse ones, while our proposed method can effectively refine it to an accurate depth map with less learnable parameters and inference time. Experimental results demonstrate that our proposed LRRU variants achieve state-of-the-art performance across different parameter regimes. In particular, the LRRU-Base model outperforms competing approaches on the NYUv2 dataset, and ranks 1st on the KITTI depth completion benchmark at the time of submission. Project page: https://npucvr.github.io/LRRU/. + + + + MetaBEV: Solving Sensor Failures for 3D Detection and Map Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Ge_MetaBEV_Solving_Sensor_Failures_for_3D_Detection_and_Map_Segmentation_ICCV_2023_paper.pdf + Perception systems in modern autonomous driving vehicles typically take inputs from complementary multi-modal sensors, e.g., LiDAR and cameras. However, in real-world applications, sensor corruptions and failures lead to inferior performances, thus compromising autonomous safety. In this paper, we propose a robust framework, called MetaBEV, to address extreme real-world environments, involving overall six sensor corruptions and two extreme sensor-missing situations. In MetaBEV, signals from multiple sensors are first processed by modal-specific encoders. Subsequently, a set of dense BEV queries are initialized, termed meta-BEV. These queries are then processed iteratively by a BEV-Evolving decoder, which selectively aggregates deep features from either LiDAR, cameras, or both modalities. The updated BEV representations are further leveraged for multiple 3D prediction tasks. Additionally, we introduce a new \moe structure to alleviate the performance drop on distinct tasks in multi-task joint learning. Finally, MetaBEV is evaluated on the nuScenes dataset with 3D object detection and BEV map segmentation tasks. Experiments show MetaBEV outperforms prior arts by a large margin on both full and corrupted modalities. For instance, when the LiDAR signal is missing, MetaBEV improves 35.5% detection NDS and 17.7% segmentation mIoU upon the vanilla BEVFusion model; and when the camera signal is absent, MetaBEV still achieves 69.2% NDS and 53.7%mIoU, which is even higher than previous works that perform on full-modalities. Moreover, MetaBEV performs moderately against previous methods in both canonical perception and multi-task learning settings, refreshing state-of-the-art nuScenes BEV map segmentation with 70.4% mIoU. + + + + Exploring Temporal Concurrency for Video-Language Representation Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Exploring_Temporal_Concurrency_for_Video-Language_Representation_Learning_ICCV_2023_paper.pdf + Paired video and language data is naturally temporal concurrency, which requires the modeling of the temporal dynamics within each modality and the temporal alignment across modalities simultaneously. However, most existing video-language representation learning methods only focus on discrete semantic alignment that encourages aligned semantics to be close in the latent space, or temporal context dependency that captures short-range coherence, failing in building the temporal concurrency. In this paper, we propose to learn video-language representations by modeling video-language pairs as Temporal Concurrent Processes (TCP) via a process-wised distance metric learning framework. Specifically, we employ the soft Dynamic Time Warping (DTW) to measure the distance between two processes across modalities and then optimize the DTW costs. Meanwhile, we further introduce a regularization term that enforces the embeddings of each modality approximating a stochastic process to guarantee the inherent dynamics. Experimental results on three benchmarks demonstrate that TCP stands as a state-of-the-art method for various video-language understanding tasks, including paragraph-to-video retrieval, video moment retrieval, and video question-answering. Code is available at https://github.com/hengRUC/TCP. + + + + DynamicISP: Dynamically Controlled Image Signal Processor for Image Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Yoshimura_DynamicISP_Dynamically_Controlled_Image_Signal_Processor_for_Image_Recognition_ICCV_2023_paper.pdf + Image Signal Processors (ISPs) play important roles in image recognition tasks as well as in the perceptual quality of captured images. In most cases, experts make a lot of effort to manually tune many parameters of ISPs, but the parameters are sub-optimal. In the literature, two types of techniques have been actively studied: a machine learning-based parameter tuning technique and a DNN-based ISP technique. The former is lightweight but lacks expressive power. The latter has expressive power, but the computational cost is too heavy on edge devices. To solve these problems, we propose "DynamicISP," which consists of multiple classical ISP functions and dynamically controls the parameters of each frame according to the recognition result of the previous frame. We show our method successfully controls the parameters of multiple ISP functions and achieves state-of-the-art accuracy with low computational cost in single and multi-category object detection tasks. + + + + R-Pred: Two-Stage Motion Prediction Via Tube-Query Attention-Based Trajectory Refinement + http://openaccess.thecvf.com//content/ICCV2023/papers/Choi_R-Pred_Two-Stage_Motion_Prediction_Via_Tube-Query_Attention-Based_Trajectory_Refinement_ICCV_2023_paper.pdf + Predicting the future motion of dynamic agents is of paramount importance to ensuring safety and assessing risks in motion planning for autonomous robots. In this study, we propose a two-stage motion prediction method, called R-Pred, designed to effectively utilize both scene and interaction context using a cascade of the initial trajectory proposal and trajectory refinement networks. The initial trajectory proposal network produces M trajectory proposals corresponding to the M modes of the future trajectory distribution. The trajectory refinement network enhances each of the M proposals using 1) tube-query scene attention (TQSA) and 2) proposal-level interaction attention (PIA) mechanisms. TQSA uses tube-queries to aggregate local scene context features pooled from proximity around trajectory proposals of interest. PIA further enhances the trajectory proposals by modeling inter-agent interactions using a group of trajectory proposals selected by their distances from neighboring agents. Our experiments conducted on Argoverse and nuScenes datasets demonstrate that the proposed refinement network provides significant performance improvements compared to the single-stage baseline and that R-Pred achieves state-of-the-art performance in some categories of the benchmarks. + + + + Aggregating Feature Point Cloud for Depth Completion + http://openaccess.thecvf.com//content/ICCV2023/papers/Yu_Aggregating_Feature_Point_Cloud_for_Depth_Completion_ICCV_2023_paper.pdf + Guided depth completion aims to recover dense depth maps by propagating depth information from the given pixels to the remaining ones under the guidance of RGB images. However, most of the existing methods achieve this using a large number of iterative refinements or stacking repetitive blocks. Due to the limited receptive field of conventional convolution, the generalizability with respect to different sparsity levels of input depth maps is impeded. To tackle these problems, we propose a feature point cloud aggregation framework to directly propagate 3D depth information between the given points and the missing ones. We extract 2D feature map from images and transform the sparse depth map to point cloud to extract sparse 3D features. By regarding the extracted features as two sets of feature point clouds, the depth information for a target location can be reconstructed by aggregating adjacent sparse 3D features from the known points using cross attention. Based on this, we design a neural network, called as PointDC, to complete the entire depth information reconstruction process. Experimental results show that, our PointDC achieves superior or competitive results on the KITTI benchmark and NYUv2 dataset. In addition, the proposed PointDC demonstrates its higher generalizability to different sparsity levels of the input depth maps and cross-dataset evaluation. + + + + Reconstructed Convolution Module Based Look-Up Tables for Efficient Image Super-Resolution + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Reconstructed_Convolution_Module_Based_Look-Up_Tables_for_Efficient_Image_Super-Resolution_ICCV_2023_paper.pdf + Look-up table (LUT)-based methods have shown the great efficacy in single image super-resolution (SR) task. However, previous methods don't delve into the essential reason of restricted receptive field (RF) size in LUT, which is caused by the interaction of space and channel features in vanilla convolution. To enlarge RF with contained LUT sizes, we propose a novel Reconstructed Convolution(RC) module, which decouples channel-wise and spatial calculation. It can be formulated as n^2 1D LUTs to maintain nxn receptive field, which is obviously smaller than nxn D LUT formulated before. The LUT generated by our RC module reaches less than 1/10000 storage compared with SR-LUT baseline. The proposed Reconstructed Convolution module based LUT method, termed as RCLUT, can enlarge the RF size by 9 times than the state-of-the-art LUT-based SR method and achieve superior performance on five popular benchmark dataset. Moreover, the efficient and robust RC module can be used as a plugin to improve other LUT-based SR methods. The code is available at https://github.com/RC-LUT/RC-LUT.git. + + + + Action Sensitivity Learning for Temporal Action Localization + http://openaccess.thecvf.com//content/ICCV2023/papers/Shao_Action_Sensitivity_Learning_for_Temporal_Action_Localization_ICCV_2023_paper.pdf + Temporal action localization (TAL), which involves recognizing and locating action instances, is a challenging task in video understanding. Most existing approaches directly predict action classes and regress offsets to boundaries, while overlooking the discrepant importance of each frame. In this paper, we propose an Action Sensitivity Learning framework (ASL) to tackle this task, which aims to assess the value of each frame and then leverage the generated action sensitivity to recalibrate the training procedure. We first introduce a lightweight Action Sensitivity Evaluator to learn the action sensitivity at the class level and instance level, respectively. The outputs of the two branches are combined to reweight the gradient of the two sub-tasks. Moreover, based on the action sensitivity of each frame, we design an Action Sensitive Contrastive Loss to enhance features, where the action-aware frames are sampled as positive pairs to push away the action-irrelevant frames. The extensive studies on various action localization benchmarks (i.e., MultiThumos, Charades, Ego4D-Moment Queries v1.0, Epic-Kitchens 100, Thumos14 and ActivityNet1.3) show that ASL surpasses the state-of-the-art in terms of average-mAP under multiple types of scenarios, e.g., single-labeled, densely-labeled and egocentric. + + + + PEANUT: Predicting and Navigating to Unseen Targets + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhai_PEANUT_Predicting_and_Navigating_to_Unseen_Targets_ICCV_2023_paper.pdf + Efficient ObjectGoal navigation (ObjectNav) in novel environments requires an understanding of the spatial and semantic regularities in environment layouts. In this work, we present a straightforward method for learning these regularities by predicting the locations of unobserved objects from incomplete semantic maps. Our method differs from previous prediction-based navigation methods, such as frontier potential prediction or egocentric map completion, by directly predicting unseen targets while leveraging the global context from all previously explored areas. Our prediction model is lightweight and can be trained in a supervised manner using a relatively small amount of passively collected data. Once trained, the model can be incorporated into a modular pipeline for ObjectNav without the need for any reinforcement learning. We validate the effectiveness of our method on the HM3D and MP3D ObjectNav datasets. We find that it achieves the state-of-the-art on both datasets, despite not using any additional data for training. + + + + PoseDiffusion: Solving Pose Estimation via Diffusion-aided Bundle Adjustment + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_PoseDiffusion_Solving_Pose_Estimation_via_Diffusion-aided_Bundle_Adjustment_ICCV_2023_paper.pdf + Camera pose estimation is a long-standing computer vision problem that to date often relies on classical methods, such as handcrafted keypoint matching, RANSAC and bundle adjustment. In this paper, we propose to formulate the Structure from Motion (SfM) problem inside a probabilistic diffusion framework, modelling the conditional distribution of camera poses given input images. This novel view of an old problem has several advantages. (i) The nature of the diffusion framework mirrors the iterative procedure of bundle adjustment. (ii) The formulation allows a seamless integration of geometric constraints from epipolar geometry. (iii) It excels in typically difficult scenarios such as sparse views with wide baselines. (iv) The method can predict intrinsics and extrinsics for an arbitrary amount of images. We demonstrate that our method PoseDiffusion significantly improves over the classic SfM pipelines and the learned approaches on two real-world datasets. Finally, it is observed that our method can generalize across datasets without further training. Project page: https://posediffusion.github.io/ + + + + CORE: Cooperative Reconstruction for Multi-Agent Perception + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_CORE_Cooperative_Reconstruction_for_Multi-Agent_Perception_ICCV_2023_paper.pdf + This paper presents CORE, a conceptually simple, effective and communication-efficient model for multi-agent cooperative perception. It addresses the task from a novel perspective of cooperative reconstruction, based on two key insights: 1) cooperating agents together provide a more holistic observation of the environment, and 2) the holistic observation can serve as valuable supervision to explicitly guide the model learning how to reconstruct the ideal observation based on collaboration. CORE instantiates the idea with three major components: a compressor for each agent to create more compact feature representation for efficient broadcasting, a lightweight attentive collaboration component for cross-agent message aggregation, and a reconstruction module to reconstruct the observation based on aggregated feature representations. This learning-to-reconstruct idea is task-agnostic, and offers clear and reasonable supervision to inspire more effective collaboration, eventually promoting perception tasks. We validate CORE on two large-scale multi-agent percetion dataset, OPV2V and V2X-Sim, in two tasks, i.e., 3D object detection and semantic segmentation. Results demonstrate that CORE achieves state-of-the-art performance, and is more communication-efficient. + + + + SEFD: Learning to Distill Complex Pose and Occlusion + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_SEFD_Learning_to_Distill_Complex_Pose_and_Occlusion_ICCV_2023_paper.pdf + This paper addresses the problem of three-dimensional (3D) human mesh estimation in complex poses and occluded situations. Although many improvements have been made in 3D human mesh estimation using the two-dimensional (2D) pose with occlusion between humans, occlusion from complex poses and other objects remains a consistent problem. Therefore, we propose the novel Skinned Multi-Person Linear (SMPL) Edge Feature Distillation (SEFD) that demonstrates robustness to complex poses and occlusions, without increasing the number of parameters compared to the baseline model. The model generates an SMPL overlapping edge similar to the ground truth that contains target person boundary and occlusion information, performing subsequent feature distillation in a simple edge map. We also perform experiments on various benchmarks and exhibit fidelity both qualitatively and quantitatively. Extensive experiments prove that our method outperforms the state-of-the-art method by 2.8% in MPJPE and 1.9% in MPVPE on a benchmark 3DPW dataset in the presence of domain gap. Also, our method is superior in 3DPW-OCC, 3DPW-PC, RH-Dataset, OCHuman, CrowdPose, and LSP dataset in which occlusion, complex pose, and domain gap exist. The code and occlusion & complex pose annotation will be available at https: //anonymous.4open.science/r/SEFD-B7F8/ + + + + CiT: Curation in Training for Effective Vision-Language Data + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_CiT_Curation_in_Training_for_Effective_Vision-Language_Data_ICCV_2023_paper.pdf + Large vision-language models are generally applicable to many downstream tasks, but come at an exorbitant training cost that only large institutions can afford. This paper trades generality for efficiency and presents Curation in Training (CiT), a simple and efficient vision-text learning algorithm that couples a data objective into training. CiT automatically yields quality data to speed-up contrastive image-text training and alleviates the need for an offline data filtering pipeline, allowing broad data sources (including raw image-text pairs from the web). CiT contains two loops: an outer loop curating the training data and an inner loop consuming the curated training data. The text encoder connects the two loops. Given metadata for tasks of interest, e.g., class names, and a large pool of image-text pairs, CiT alternatively selects relevant training data from the pool by measuring the similarity of their text embeddings and embeddings of the metadata. In our experiments, we observe that CiT can speed up training by over an order of magnitude, especially if the raw data size is large. + + + + SparseNeRF: Distilling Depth Ranking for Few-shot Novel View Synthesis + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_SparseNeRF_Distilling_Depth_Ranking_for_Few-shot_Novel_View_Synthesis_ICCV_2023_paper.pdf + Neural Radiance Field (NeRF) significantly degrades when only a limited number of views are available. To complement the lack of 3D information, depth-based models, such as DSNeRF and MonoSDF, explicitly assume the availability of accurate depth maps of multiple views. They linearly scale the accurate depth maps as supervision to guide the predicted depth of few-shot NeRFs. However, accurate depth maps are difficult and expensive to capture due to wide-range depth distances in the wild. This work presents a new Sparse-view NeRF (SparseNeRF) framework that exploits depth priors from real-world inaccurate observations. The inaccurate depth observations are either from pre-trained depth models or coarse depth maps of consumer-level depth sensors. Since coarse depth maps are not strictly scaled to the ground-truth depth maps, we propose a simple yet effective constraint, a local depth ranking method, on NeRFs such that the expected depth ranking of the NeRF is consistent with that of the coarse depth maps in local patches. To preserve the spatial continuity of the estimated depth of NeRF, we further propose a spatial continuity constraint to encourage the consistency of the expected depth continuity of NeRF with coarse depth maps. Surprisingly, with simple depth ranking constraints, SparseNeRF outperforms all state-of-the-art few-shot NeRF methods (including depth-based models) on standard LLFF and DTU datasets. Moreover, we collect a new dataset NVS-RGBD that contains real-world depth maps from Azure Kinect, ZED 2, and iPhone 13 Pro. Extensive experiments on NVS-RGBD dataset also validate the superiority and generalizability of SparseNeRF. Code and dataset are available at https://sparsenerf.github.io/. + + + + ProPainter: Improving Propagation and Transformer for Video Inpainting + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhou_ProPainter_Improving_Propagation_and_Transformer_for_Video_Inpainting_ICCV_2023_paper.pdf + Flow-based propagation and spatiotemporal Transformer are two mainstream mechanisms in video inpainting (VI). Despite the effectiveness of these components, they still suffer from some limitations that affect their performance. Previous propagation-based approaches are performed separately either in the image or feature domain. Global image propagation isolated from learning may cause spatial misalignment due to inaccurate optical flow. Moreover, memory or computational constraints limit the temporal range of feature propagation and video Transformer, preventing exploration of correspondence information from distant frames. To address these issues, we propose an improved framework, called ProPainter, which involves enhanced ProPagation and an efficient Transformer. Specifically, we introduce dual-domain propagation that combines the advantages of image and feature warping, exploiting global correspondences reliably. We also propose a mask-guided sparse video Transformer, which achieves high efficiency by discarding unnecessary and redundant tokens. With these components, ProPainter outperforms prior arts by a large margin of 1.46 dB in PSNR while maintaining appealing efficiency. + + + + Root Pose Decomposition Towards Generic Non-rigid 3D Reconstruction with Monocular Videos + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Root_Pose_Decomposition_Towards_Generic_Non-rigid_3D_Reconstruction_with_Monocular_ICCV_2023_paper.pdf + This work focuses on the 3D reconstruction of non-rigid objects based on monocular RGB video sequences. Concretely, we aim at building high-fidelity models for generic object categories and casually captured scenes. To this end, we do not assume known root poses of objects, and do not utilize category-specific templates or dense pose priors. The key idea of our method, Root Pose Decomposition (RPD), is to maintain a per-frame root pose transformation, meanwhile building a dense field with local transformations to rectify the root pose. The optimization of local transformations is performed by point registration to the canonical space. We also adapt RPD to multi-object scenarios with object occlusions and individual differences. As a result, RPD allows non-rigid 3D reconstruction for complicated scenarios containing objects with large deformations, complex motion patterns, occlusions, and scale diversities of different individuals. Such a pipeline potentially scales to diverse sets of objects in the wild. We experimentally show that RPD surpasses state-of-the-art methods on the challenging DAVIS, OVIS, and AMA datasets. We provide video results in https://rpd-share.github.io. + + + + GLA-GCN: Global-local Adaptive Graph Convolutional Network for 3D Human Pose Estimation from Monocular Video + http://openaccess.thecvf.com//content/ICCV2023/papers/Yu_GLA-GCN_Global-local_Adaptive_Graph_Convolutional_Network_for_3D_Human_Pose_ICCV_2023_paper.pdf + 3D human pose estimation has been researched for decades with promising fruits. 3D human pose lifting is one of the promising research directions toward the task where both estimated pose and ground truth pose data are used for training. Existing pose lifting works mainly focus on improving the performance of estimated pose, but they usually underperform when testing on the ground truth pose data. We observe that the performance of the estimated pose can be easily improved by preparing good quality 2D pose, such as fine-tuning the 2D pose or using advanced 2D pose detectors. As such, we concentrate on improving the 3D human pose lifting via ground truth data for the future improvement of more quality estimated pose data. Towards this goal, a simple yet effective model called Global-local Adaptive Graph Convolutional Network (GLA-GCN) is proposed in this work. Our GLA-GCN globally models the spatiotemporal structure via a graph representation and backtraces local joint features for 3D human pose estimation via individually connected layers. To validate our model design, we conduct extensive experiments on three benchmark datasets: Human3.6M, HumanEva-I, and MPI-INF-3DHP. Experimental results show that our GLA-GCN implemented with ground truth 2D poses significantly outperforms state-of-the-art methods (e.g., up to 3%, 17%, and 14% error reductions on Human3.6M, HumanEva-I, and MPI-INF-3DHP, respectively). + + + + Snow Removal in Video: A New Dataset and A Novel Method + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Snow_Removal_in_Video_A_New_Dataset_and_A_Novel_ICCV_2023_paper.pdf + Snowfall is a common weather phenomenon that can severely affect computer vision tasks by obscuring objects and scenes. However, existing deep learning-based snow removal methods are designed for single images only. In this paper, we target a more complex task -- video snow removal, which aims to restore the clear video from the snowy video. To facilitate this task, we propose the first high-quality video dataset, which simulates realistic physical characteristics of snow and haze using a rendering engine and augmentation techniques. We also develop a deep learning framework for video snow removal. It involves Specifically, we propose a snow-query temporal aggregation module and a snow-aware contrastive learning loss function. The module aggregates features between video frames and removes snow effectively, while the loss function helps identify and eliminate snow features. We conduct extensive experiments and demonstrate that our proposed dataset is more realistic than previous datasets, and the models trained on it achieve better performance in real-world snowing images. Our proposed method outperforms state-of-the-art video and image-based methods on both synthetic and real snowy videos. + + + + Degradation-Resistant Unfolding Network for Heterogeneous Image Fusion + http://openaccess.thecvf.com//content/ICCV2023/papers/He_Degradation-Resistant_Unfolding_Network_for_Heterogeneous_Image_Fusion_ICCV_2023_paper.pdf + Heterogeneous image fusion (HIF) techniques aim to enhance image quality by merging complementary information from images captured by different sensors. Among these algorithms, deep unfolding network (DUN)-based methods achieve promising performance but still suffer from two issues: they lack a degradation-resistant-oriented fusion model and struggle to adequately consider the structural properties of DUNs, making them vulnerable to degradation scenarios. In this paper, we propose a Degradation-Resistant Unfolding Network (DeRUN) for the HIF task to generate high-quality fused images even in degradation scenarios. Specifically, we introduce a novel HIF model for degradation resistance and derive its optimization procedures. Then, we incorporate the optimization unfolding process into the proposed DeRUN for end-to-end training. To ensure the robustness and efficiency of DeRUN, we employ a joint constraint strategy and a lightweight partial weight sharing module. To train DeRUN, we further propose a gradient direction-based entropy loss with powerful texture representation capacity. Extensive experiments show that DeRUN significantly outperforms existing methods on four HIF tasks, as well as downstream applications, with cheaper computational and memory costs. + + + + Priority-Centric Human Motion Generation in Discrete Latent Space + http://openaccess.thecvf.com//content/ICCV2023/papers/Kong_Priority-Centric_Human_Motion_Generation_in_Discrete_Latent_Space_ICCV_2023_paper.pdf + Text-to-motion generation is a formidable task, aiming to produce human motions that align with the input text while also adhering to human capabilities and physical laws. While there have been advancements in diffusion models, their application in discrete spaces remains underexplored. Current methods often overlook the varying significance of different motions, treating them uniformly. It is essential to recognize that not all motions hold the same relevance to a particular textual description. Some motions, being more salient and informative, should be given precedence during generation. In response, we introduce a Priority-Centric Motion Discrete Diffusion Model (M2DM), which utilizes a Transformer-based VQ-VAE to derive a concise, discrete motion representation, incorporating a global self-attention mechanism and a regularization term to counteract code collapse. We also present a motion discrete diffusion model that employs an innovative noise schedule, determined by the significance of each motion token within the entire motion sequence. This approach retains the most salient motions during the reverse diffusion process, leading to more semantically rich and varied motions. Additionally, we formulate two strategies to gauge the importance of motion tokens, drawing from both textual and visual indicators. Comprehensive experiments on the HumanML3D and KIT-ML datasets confirm that our model surpasses existing techniques in fidelity and diversity, particularly for intricate textual descriptions. + + + + 3DHacker: Spectrum-based Decision Boundary Generation for Hard-label 3D Point Cloud Attack + http://openaccess.thecvf.com//content/ICCV2023/papers/Tao_3DHacker_Spectrum-based_Decision_Boundary_Generation_for_Hard-label_3D_Point_Cloud_ICCV_2023_paper.pdf + With the maturity of depth sensors, the vulnerability of 3D point cloud models has received increasing attention in various applications such as autonomous driving and robot navigation. Previous 3D adversarial attackers either follow the white-box setting to iteratively update the coordinate perturbations based on gradients, or utilize the output model logits to estimate noisy gradients in the black-box setting. However, these attack methods are hard to be deployed in real-world scenarios since realistic 3D applications will not share any model details to users. Therefore, we explore a more challenging yet practical 3D attack setting, i.e., attacking point clouds with black-box hard labels, in which the attacker can only have access to the prediction label of the input. To tackle this setting, we propose a novel 3D attack method, termed 3D Hard-label attacker (3DHacker), based on the developed decision boundary algorithm to generate adversarial samples solely with the knowledge of class labels. Specifically, to construct the class-aware model decision boundary, 3DHacker first randomly fuses two point clouds of different classes in the spectral domain to craft their intermediate sample with high imperceptibility, then projects it onto the decision boundary via binary search. To restrict the final perturbation size, 3DHacker further introduces an iterative optimization strategy to move the intermediate sample along the decision boundary for generating adversarial point clouds with smallest trivial perturbations. Extensive evaluations show that, even in the challenging hard-label setting, 3DHacker still competitively outperforms existing 3D attacks regarding the attack performance as well as adversary quality. + + + + Exploring Lightweight Hierarchical Vision Transformers for Efficient Visual Tracking + http://openaccess.thecvf.com//content/ICCV2023/papers/Kang_Exploring_Lightweight_Hierarchical_Vision_Transformers_for_Efficient_Visual_Tracking_ICCV_2023_paper.pdf + Transformer-based visual trackers have demonstrated significant progress owing to their superior modeling capabilities. However, existing trackers are hampered by low speed, limiting their applicability on devices with limited computational power. To alleviate this problem, we propose HiT, a new family of efficient tracking models that can run at high speed on different devices while retaining high performance. The central idea of HiT is the Bridge Module, which bridges the gap between modern lightweight transformers and the tracking framework. The Bridge Module incorporates the high-level information of deep features into the shallow large-resolution features. In this way, it produces better features for the tracking head. We also propose a novel dual-image position encoding technique that simultaneously encodes the position information of both the search region and template images. The HiT model achieves promising speed with competitive performance. For instance, it runs at 61 frames per second (fps) on the Nvidia Jetson AGX edge device. Furthermore, HiT attains 64.6% AUC on the LaSOT benchmark, surpassing all previous efficient trackers. + + + + MiniROAD: Minimal RNN Framework for Online Action Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/An_MiniROAD_Minimal_RNN_Framework_for_Online_Action_Detection_ICCV_2023_paper.pdf + Online Action Detection (OAD) is the task of identifying actions in streaming videos without access to future frames. Much effort has been devoted to effectively capturing long-range dependencies, with transformers receiving the spotlight for their ability to capture long-range temporal structures. In contrast, RNNs have received less attention lately, due to their lower performance compared to recent methods that utilize transformers. In this paper, we investigate the underlying reasons for the inferior performance of RNNs compared to transformer-based algorithms. Our findings indicate that the discrepancy between training and inference is the primary hindrance to the effective training of RNNs. To address this, we propose applying non-uniform weights to the loss computed at each time step, which allows the RNN model to learn from the predictions made in an environment that better resembles the inference stage. Extensive experiments on three benchmark datasets, THUMOS, TVSeries, and FineAction demonstrate that a minimal RNN-based model trained with the proposed methodology performs equally or better than the existing best methods with a significant increase in efficiency. The code is available at https://github.com/jbistanbul/MiniROAD. + + + + NDC-Scene: Boost Monocular 3D Semantic Scene Completion in Normalized Device Coordinates Space + http://openaccess.thecvf.com//content/ICCV2023/papers/Yao_NDC-Scene_Boost_Monocular_3D_Semantic_Scene_Completion_in_Normalized_Device_ICCV_2023_paper.pdf + Monocular 3D Semantic Scene Completion (SSC) has garnered significant attention in recent years due to its potential to predict complex semantics and geometry shapes from a single image, requiring no 3D inputs. In this paper, we identify several critical issues in current state-of-the-art methods, including the Feature Ambiguity of projected 2D features in the ray to the 3D space, the Pose Ambiguity of the 3D convolution, and the Computation Imbalance in the 3D convolution across different depth levels. To address these problems, we devise a novel Normalized Device Coordinates scene completion network (NDC-Scene) that directly extends the 2D feature map to a Normalized Device Coordinates (NDC) space, rather than to the world space directly, through progressive restoration of the dimension of depth with deconvolution operations. Experiment results demonstrate that transferring the majority of computation from the target 3D space to the proposed normalized device coordinates space benefits monocular SSC tasks. Additionally, we design a Depth-Adaptive Dual Decoder to simultaneously upsample and fuse the 2D and 3D feature maps, further improving overall performance. Our extensive experiments confirm that the proposed method consistently outperforms state-of-the-art methods on both outdoor SemanticKITTI and indoor NYUv2 datasets. Our code are available at https://github.com/Jiawei-Yao0812/NDCScene. + + + + SVDFormer: Complementing Point Cloud via Self-view Augmentation and Self-structure Dual-generator + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhu_SVDFormer_Complementing_Point_Cloud_via_Self-view_Augmentation_and_Self-structure_Dual-generator_ICCV_2023_paper.pdf + In this paper, we propose a novel network, SVDFormer, to tackle two specific challenges in point cloud completion: understanding faithful global shapes from incomplete point clouds and generating high-accuracy local structures. Current methods either perceive shape patterns using only 3D coordinates or import extra images with well-calibrated intrinsic parameters to guide the geometry estimation of the missing parts. However, these approaches do not always fully leverage the cross-modal self-structures available for accurate and high-quality point cloud completion. To this end, we first design a Self-view Fusion Network that leverages multiple-view depth image information to observe incomplete self-shape and generate a compact global shape. To reveal highly detailed structures, we then introduce a refinement module, called Self-structure Dual-generator, in which we incorporate learned shape priors and geometric self-similarities for producing new points. By perceiving the incompleteness of each point, the dual-path design disentangles refinement strategies conditioned on the structural type of each point. SVDFormer absorbs the wisdom of self-structures, avoiding any additional paired information such as color images with precisely calibrated camera intrinsic parameters. Comprehensive experiments indicate that our method achieves state-of-the-art performance on widely-used benchmarks. Code is available at https://github.com/czvvd/SVDFormer. + + + + E3Sym: Leveraging E(3) Invariance for Unsupervised 3D Planar Reflective Symmetry Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_E3Sym_Leveraging_E3_Invariance_for_Unsupervised_3D_Planar_Reflective_Symmetry_ICCV_2023_paper.pdf + Detecting symmetrical properties is a fundamental task in 3D shape analysis. In the case of a 3D model with planar symmetries, each point has a corresponding mirror point w.r.t. a symmetry plane, and the correspondences remain invariant under any arbitrary Euclidean transformation. Our proposed method, E3Sym, aims to detect planar reflective symmetry in an unsupervised and end-to-end manner by leveraging E(3) invariance. E3Sym establishes robust point correspondences through the use of E(3) invariant features extracted from a lightweight neural network, from which the dense symmetry prediction is produced. We also introduce a novel and efficient clustering algorithm to aggregate the dense prediction and produce a detected symmetry set, allowing for the detection of an arbitrary number of planar symmetries while ensuring the method remains differentiable for end-to-end training. Our method also possesses the ability to infer reasonable planar symmetries from incomplete shapes, which remains challenging for existing methods. Extensive experiments demonstrate that E3Sym is both effective and robust, outperforming state-of-the-art methods. + + + + Zero-Shot Composed Image Retrieval with Textual Inversion + http://openaccess.thecvf.com//content/ICCV2023/papers/Baldrati_Zero-Shot_Composed_Image_Retrieval_with_Textual_Inversion_ICCV_2023_paper.pdf + Composed Image Retrieval (CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption that describes the difference between the two images. The high effort and cost required for labeling datasets for CIR hamper the widespread usage of existing methods, as they rely on supervised learning. In this work, we propose a new task, Zero-Shot CIR (ZS-CIR), that aims to address CIR without requiring a labeled training dataset. Our approach, named zero-Shot composEd imAge Retrieval with textuaL invErsion SEARLE, maps the visual features of the reference image into a pseudo-word token in CLIP token embedding space and integrates it with the relative caption. To support research on ZS-CIR, we introduce an open-domain benchmarking dataset named Composed Image Retrieval on Common Objects in context (CIRCO), which is the first dataset for CIR containing multiple ground-truths for each query. The experiments show that SEARLE exhibits better performance than the baselines on the two main datasets for CIR tasks, FashionIQ and CIRR, and on the proposed CIRCO. The dataset, the code and the model are publicly available at https://github.com/miccunifi/SEARLE. + + + + BiFF: Bi-level Future Fusion with Polyline-based Coordinate for Interactive Trajectory Prediction + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhu_BiFF_Bi-level_Future_Fusion_with_Polyline-based_Coordinate_for_Interactive_Trajectory_ICCV_2023_paper.pdf + Predicting future trajectories of surrounding agents is essential for safety-critical autonomous driving. Most existing work focuses on predicting marginal trajectories for each agent independently. However, it has rarely been explored in predicting joint trajectories for interactive agents. In this work, we propose Bi-level Future Fusion (BiFF) to explicitly capture future interactions between interactive agents. Concretely, BiFF fuses the high-level future intentions followed by low-level future behaviors. Then the polyline-based coordinate is specifically designed for multi-agent prediction to ensure data efficiency, frame robustness, and prediction accuracy. Experiments show that BiFF achieves state-of-the-art performance on the interactive prediction benchmark of Waymo Open Motion Dataset. + + + + COOL-CHIC: Coordinate-based Low Complexity Hierarchical Image Codec + http://openaccess.thecvf.com//content/ICCV2023/papers/Ladune_COOL-CHIC_Coordinate-based_Low_Complexity_Hierarchical_Image_Codec_ICCV_2023_paper.pdf + We introduce COOL-CHIC, a Coordinate-based Low Complexity Hierarchical Image Codec. It is a learned alternative to autoencoders with 629 parameters and 680 multiplications per decoded pixel. COOL-CHIC offers compression performance close to modern conventional MPEG codecs such as HEVC and is competitive with popular autoencoder-based systems. This method is inspired by Coordinate-based Neural Representations, where an image is represented as a learned function which maps pixel coordinates to RGB values. The parameters of the mapping function are then sent using entropy coding. At the receiver side, the compressed image is obtained by evaluating the mapping function for all pixel coordinates. COOL-CHIC implementation is made open-source. + + + + Normalizing Flows for Human Pose Anomaly Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Hirschorn_Normalizing_Flows_for_Human_Pose_Anomaly_Detection_ICCV_2023_paper.pdf + Video anomaly detection is an ill-posed problem because it relies on many parameters such as appearance, pose, camera angle, background, and more. We distill the problem to anomaly detection of human pose, thus decreasing the risk of nuisance parameters such as appearance affecting the result. Focusing on pose alone also has the side benefit of reducing bias against distinct minority groups. Our model works directly on human pose graph sequences and is exceptionally lightweight ( 1K parameters), capable of running on any machine able to run the pose estimation with negligible additional resources. We leverage the highly compact pose representation in a normalizing flows framework, which we extend to tackle the unique characteristics of spatio-temporal pose data and show its advantages in this use case. The algorithm is quite general and can handle training data of only normal examples as well as a supervised setting that consists of labeled normal and abnormal examples. We report state-of-the-art results on two anomaly detection benchmarks - the unsupervised ShanghaiTech dataset and the recent supervised UBnormal dataset. Code available at https://github.com/orhir/STG-NF. + + + + Reconstructing Groups of People with Hypergraph Relational Reasoning + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_Reconstructing_Groups_of_People_with_Hypergraph_Relational_Reasoning_ICCV_2023_paper.pdf + Due to the mutual occlusion, severe scale variation, and complex spatial distribution, the current multi-person mesh recovery methods cannot produce accurate absolute body poses and shapes in large-scale crowded scenes. To address the obstacles, we fully exploit crowd features for reconstructing groups of people from a monocular image. A novel hypergraph relational reasoning network is proposed to formulate the complex and high-order relation correlations among individuals and groups in the crowd. We first extract compact human features and location information from the original high-resolution image. By conducting the relational reasoning on the extracted individual features, the underlying crowd collectiveness and interaction relationship can provide additional group information for the reconstruction. Finally, the updated individual features and the localization information are used to regress human meshes in camera coordinates. To facilitate the network training, we further build pseudo ground-truth on two crowd datasets, which may also promote future research on pose estimation and human behavior understanding in crowded scenes. The experimental results show that our approach outperforms other baseline methods both in crowded and common scenarios. The code and datasets are publicly available at https://github.com/boycehbz/GroupRec. + + + + What Does a Platypus Look Like? Generating Customized Prompts for Zero-Shot Image Classification + http://openaccess.thecvf.com//content/ICCV2023/papers/Pratt_What_Does_a_Platypus_Look_Like_Generating_Customized_Prompts_for_ICCV_2023_paper.pdf + Open-vocabulary models are a promising new paradigm for image classification. Unlike traditional classification models, open-vocabulary models classify among any arbitrary set of categories specified with natural language during inference. This natural language, called "prompts", typically consists of a set of hand-written templates (e.g., "a photo of a ") which are completed with each of the category names. This work introduces a simple method to generate higher accuracy prompts, without relying on any explicit knowledge of the task domain and with far fewer hand-constructed sentences. To achieve this, we combine open-vocabulary models with large language models (LLMs) to create Customized Prompts via Language models (CuPL, pronounced "couple"). In particular, we leverage the knowledge contained in LLMs in order to generate many descriptive sentences that contain important discriminating characteristics of the image categories. This allows the model to place a greater importance on these regions in the image when making predictions. We find that this straightforward and general approach improves accuracy on a range of zero-shot image classification benchmarks, including over one percentage point gain on ImageNet. Finally, this simple baseline requires no additional training and remains completely zero-shot. Code available at https://github.com/sarahpratt/CuPL. + + + + Scene as Occupancy + http://openaccess.thecvf.com//content/ICCV2023/papers/Tong_Scene_as_Occupancy_ICCV_2023_paper.pdf + Human driver can easily describe the complex traffic scene by visual system. Such an ability of precise perception is essential for driver's planning. To achieve this, a geometry-aware representation that quantizes the physical 3D scene into structured grid map with semantic labels per cell, termed as 3D Occupancy, would be desirable. Compared to the form of bounding box, a key insight behind occupancy is that it could capture the fine-grained details of critical obstacles in the scene, and thereby facilitate subsequent tasks. Prior or concurrent literature mainly concentrate on a single scene completion task, where we might argue that the potential of this occupancy representation might obsess broader impact. In this paper, we propose OccNet, a multi-view vision centric pipeline with a cascade and temporal voxel decoder to reconstruct 3D occupancy. At the core of OccNet is a general occupancy embedding to represent 3D physical world. Such a descriptor could be applied towards a wide span of driving tasks, including detection, segmentation and planning. To validate the effectiveness of this new representation and our proposed algorithm, we propose OpenOcc, the first dense high-quality 3D occupancy benchmark built on top of nuScenes. Empirical experiments show that there are evident performance gain across multiple tasks, e.g., motion planning could witness a collision rate reduction by 15%-58%, demonstrating the superiority of our method. + + + + U-RED: Unsupervised 3D Shape Retrieval and Deformation for Partial Point Clouds + http://openaccess.thecvf.com//content/ICCV2023/papers/Di_U-RED_Unsupervised_3D_Shape_Retrieval_and_Deformation_for_Partial_Point_ICCV_2023_paper.pdf + In this paper, we propose U-RED, an Unsupervised shape REtrieval and Deformation pipeline that takes an arbitrary object observation as input, typically captured by RGB images or scans, and jointly retrieves and deforms the geometrically similar CAD models from a pre-established database to tightly match the target. Considering existing methods typically fail to handle noisy partial observations, U-RED is designed to address this issue from two aspects. First, since one partial shape may correspond to multiple potential full shapes, the retrieval method must allow such an ambiguous one-to-many relationship. Thereby U-RED learns to project all possible full shapes of a partial target onto the surface of a unit sphere. Then during inference, each sampling on the sphere will yield a feasible retrieval. Second, since real-world partial observations usually contain noticeable noise, a reliable learned metric that measures the similarity between shapes is necessary for stable retrieval. In U-RED, we design a novel point-wise residual-guided metric that allows noise-robust comparison. Extensive experiments on the synthetic datasets PartNet, ComplementMe and the real-world dataset Scan2CAD demonstrate that U-RED surpasses existing state-of-the-art approaches by 47.3%, 16.7% and 31.6% respectively under Chamfer Distance. Codes and trained models will be released soon. + + + + PatchCT: Aligning Patch Set and Label Set with Conditional Transport for Multi-Label Image Classification + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_PatchCT_Aligning_Patch_Set_and_Label_Set_with_Conditional_Transport_ICCV_2023_paper.pdf + Multi-label image classification is a prediction task that aims to identify more than one label from a given image. This paper considers the semantic consistency of the latent space between the visual patch and linguistic label domains and introduces the conditional transport (CT) theory to bridge the acknowledged gap. While recent cross-modal attention-based studies have attempted to align such two representations and achieved impressive performance, they required carefully-designed alignment modules and extra complex operations in the attention computation. We find that by formulating the multi-label classification as a CT problem, we can exploit the interactions between the image and label efficiently by minimizing the bidirectional CT cost. Specifically, after feeding the images and textual labels into the modality-specific encoders, we view each image as a mixture of patch embeddings and a mixture of label embeddings, which capture the local region features and the class prototypes, respectively. CT is then employed to learn and align those two semantic sets by defining the forward and backward navigators. Importantly, the defined navigators in CT distance model the similarities between patches and labels, which provides an interpretable tool to visualize the learned prototypes. Extensive experiments on three public image benchmarks show that the proposed model consistently outperforms the previous methods. + + + + VI-Net: Boosting Category-level 6D Object Pose Estimation via Learning Decoupled Rotations on the Spherical Representations + http://openaccess.thecvf.com//content/ICCV2023/papers/Lin_VI-Net_Boosting_Category-level_6D_Object_Pose_Estimation_via_Learning_Decoupled_ICCV_2023_paper.pdf + Rotation estimation of high precision from an RGB-D object observation is a huge challenge in 6D object pose estimation, due to the difficulty of learning in the non-linear space of SO(3). In this paper, we propose a novel rotation estimation network, termed as VI-Net, to make the task easier by decoupling the rotation as the combination of a viewpoint rotation and an in-plane rotation. More specifically, VI-Net bases the feature learning on the sphere with two individual branches for the estimates of two factorized rotations, where a V-Branch is employed to learn the viewpoint rotation via binary classification on the spherical signals, while another I-Branch is used to estimate the in-plane rotation by transforming the signals to view from the zenith direction. To process the spherical signals, a Spherical Feature Pyramid Network is constructed based on a novel design of SPAtial Spherical Convolution (SPA-SConv), which settles the boundary problem of spherical signals via feature padding and realizesviewpoint-equivariant feature extraction by symmetric convolutional operations. We apply the proposed VI-Net to the challenging task of category-level 6D object pose estimation for predicting the poses of unknown objects without available CAD models; experiments on the benchmarking datasets confirm the efficacy of our method, which outperforms the existing ones with a large margin in the regime of high precision. + + + + Long-range Multimodal Pretraining for Movie Understanding + http://openaccess.thecvf.com//content/ICCV2023/papers/Argaw_Long-range_Multimodal_Pretraining_for_Movie_Understanding_ICCV_2023_paper.pdf + Learning computer vision models from (and for) movies has a long-standing history. While great progress has been attained, there is still a need for a pretrained multimodal model that can perform well in the ever-growing set of movie understanding tasks the community has been establishing. In this work, we introduce Long-range Multimodal Pretraining, a strategy, and a model that leverages movie data to train transferable multimodal and cross-modal encoders. Our key idea is to learn from all modalities in a movie by observing and extracting relationships over a long-range. After pretraining, we run ablation studies on the LVU benchmark and validate our modeling choices and the importance of learning from long-range time spans. Our model achieves state-of-the-art on several LVU tasks while being much more data efficient than previous works. Finally, we evaluate our model's transferability by setting a new state-of-the-art in five different benchmarks. + + + + Adverse Weather Removal with Codebook Priors + http://openaccess.thecvf.com//content/ICCV2023/papers/Ye_Adverse_Weather_Removal_with_Codebook_Priors_ICCV_2023_paper.pdf + Despite recent advancements in unified adverse weather removal methods, there remains a significant challenge of achieving realistic fine-grained texture and reliable background reconstruction to mitigate serious distortions. Inspired by recent advancements in codebook and vector quantization (VQ) techniques, we present a novel Adverse Weather Removal network with Codebook Priors (AWRCP) to address the problem of unified adverse weather removal. AWRCP leverages high-quality codebook priors derived from undistorted images to recover vivid texture details and faithful background structures. However, simply utilizing high-quality features from the codebook does not guarantee good results in terms of fine-grained details and structural fidelity. Therefore, we develop a deformable cross-attention with sparse sampling mechanism for flexible perform feature interaction between degraded features and high-quality features from codebook priors. In order to effectively incorporate high-quality texture features while maintaining the realism of the details generated by codebook priors, we propose a hierarchical texture warping head that gradually fuses hierarchical codebook prior features into high-resolution features at final restoring stage. With the utilization of the VQ codebook as a feature dictionary of high quality and the proposed designs, AWRCP can largely improve the restored quality of texture details, achieving the state-of-the-art performance across multiple adverse weather removal benchmark. + + + + MAP: Towards Balanced Generalization of IID and OOD through Model-Agnostic Adapters + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_MAP_Towards_Balanced_Generalization_of_IID_and_OOD_through_Model-Agnostic_ICCV_2023_paper.pdf + Deep learning has achieved tremendous success in recent years, but most of these successes are built on an independent and identically distributed (IID) assumption. This somewhat hinders the application of deep learning to the more challenging out-of-distribution (OOD) scenarios. Although many OOD methods have been proposed to address this problem and have obtained good performance on testing data that is of major shifts with training distributions, interestingly, we experimentally find that these methods achieve excellent OOD performance by making a great sacrifice of the IID performance. We call this finding the IID-OOD dilemma. Clearly, in real-world applications, distribution shifts between training and testing data are often uncertain, where shifts could be minor, and even close to the IID scenario, and thus it is truly important to design a deep model with the balanced generalization ability between IID and OOD. To this end, in this paper, we investigate an intriguing problem of balancing IID and OOD generalizations and propose a novel Model Agnostic adaPters (MAP) method, which is more reliable and effective for distribution-shift-agnostic real-world data. Our key technical contribution is to use auxiliary adapter layers to incorporate the inductive bias of IID into OOD methods. To achieve this goal, we apply a bilevel optimization to explicitly model and optimize the coupling relationship between the OOD model and auxiliary adapter layers. We also theoretically give a first-order approximation to save computational time. Experimental results on six datasets successfully demonstrate that MAP can greatly improve the performance of IID while achieving good OOD performance. + + + + Exploring Group Video Captioning with Efficient Relational Approximation + http://openaccess.thecvf.com//content/ICCV2023/papers/Lin_Exploring_Group_Video_Captioning_with_Efficient_Relational_Approximation_ICCV_2023_paper.pdf + Current video captioning efforts most focus on describing a single video while the need for captioning videos in groups has increased considerably. In this study, we propose a new task, group video captioning, which aims to infer the desired content among a group of target videos and describe it with another group of related reference videos. This task requires the model to effectively summarize the target videos and accurately describe the distinguishing content compared to the reference videos, and it becomes more difficult as the video length increases. To solve this problem, 1) First, we propose an efficient relational approximation (ERA) to identify the shared content among videos while the complexity is linearly related to the number of videos. 2) Then, we introduce a contextual feature refinery with intra-group self-supervision to capture the contextual information and further refine the common properties. 3) In addition, we construct two group video captioning datasets derived from the YouCook2 and the ActivityNet Captions. The experimental results demonstrate the effectiveness of our method on this new task. + + + + ADAPT: Efficient Multi-Agent Trajectory Prediction with Adaptation + http://openaccess.thecvf.com//content/ICCV2023/papers/Aydemir_ADAPT_Efficient_Multi-Agent_Trajectory_Prediction_with_Adaptation_ICCV_2023_paper.pdf + Forecasting future trajectories of agents in complex traffic scenes requires reliable and efficient predictions for all agents in the scene. However, existing methods for trajectory prediction are either inefficient or sacrifice accuracy. To address this challenge, we propose ADAPT, a novel approach for jointly predicting the trajectories of all agents in the scene with dynamic weight learning. Our approach outperforms state-of-the-art methods in both single-agent and multi-agent settings on the Argoverse and Interaction datasets, with a fraction of their computational overhead. We attribute the improvement in our performance: first, to the adaptive head augmenting the model capacity without increasing the model size; second, to our design choices in the endpoint-conditioned prediction, reinforced by gradient stopping. Our analyses show that ADAPT can focus on each agent with adaptive prediction, allowing for accurate predictions efficiently. https://KUIS-AI.github.io/adapt + + + + MAPConNet: Self-supervised 3D Pose Transfer with Mesh and Point Contrastive Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Sun_MAPConNet_Self-supervised_3D_Pose_Transfer_with_Mesh_and_Point_Contrastive_ICCV_2023_paper.pdf + 3D pose transfer is a challenging generation task that aims to transfer the pose of a source geometry onto a target geometry with the target identity preserved. Many prior methods require keypoint annotations to find correspondence between the source and target. Current pose transfer methods allow end-to-end correspondence learning but require the desired final output as ground truth for supervision. Unsupervised methods have been proposed for graph convolutional models but they require ground truth correspondence between the source and target inputs. We present a novel self-supervised framework for 3D pose transfer which can be trained in unsupervised, semi-supervised, or fully supervised settings without any correspondence labels. We introduce two contrastive learning constraints in the latent space: a mesh-level loss for disentangling global patterns including pose and identity, and a point-level loss for discriminating local semantics. We demonstrate quantitatively and qualitatively that our method achieves state-of-the-art results in supervised 3D pose transfer, with comparable results in unsupervised and semi-supervised settings. Our method is also generalisable to unseen human and animal data with complex topologies. + + + + DARTH: Holistic Test-time Adaptation for Multiple Object Tracking + http://openaccess.thecvf.com//content/ICCV2023/papers/Segu_DARTH_Holistic_Test-time_Adaptation_for_Multiple_Object_Tracking_ICCV_2023_paper.pdf + Multiple object tracking (MOT) is a fundamental component of perception systems for autonomous driving, and its robustness to unseen conditions is a requirement to avoid life-critical failures. Despite the urge of safety in driving systems, no solution to the MOT adaptation problem to domain shift in test-time conditions has ever been proposed. However, the nature of a MOT system is manifold - requiring object detection and instance association - and adapting all its components is non-trivial. In this paper, we analyze the effect of domain shift on appearance-based trackers, and introduce DARTH, a holistic test-time adaptation framework for MOT. We propose a detection consistency formulation to adapt object detection in a self-supervised fashion, while adapting the instance appearance representations via our novel patch contrastive loss. We evaluate our method on a variety of domain shifts - including sim-to-real, outdoor-to-indoor, indoor-to-outdoor - and substantially improve the source model performance on all metrics. Project page: https://www.vis.xyz/pub/darth. + + + + Multi-interactive Feature Learning and a Full-time Multi-modality Benchmark for Image Fusion and Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Multi-interactive_Feature_Learning_and_a_Full-time_Multi-modality_Benchmark_for_Image_ICCV_2023_paper.pdf + Multi-modality image fusion and segmentation play a vital role in autonomous driving and robotic operation. Early efforts focus on boosting the performance for only one task, e.g., fusion or segmentation, making it hard to reach `Best of Both Worlds'. To overcome this issue, in this paper, we propose a Multi-interactive Feature learning architecture for image fusion and segmentation, namely SegMiF, and exploit dual-task correlation to promote the performance of both tasks. The SegMiF is of a cascade structure, containing a fusion sub-network and a commonly used segmentation sub-network. By slickly bridging intermediate features between two components, the knowledge learned from the segmentation task can effectively assist the fusion task. Also, the benefited fusion network supports the segmentation one to perform more pretentiously. Besides, a hierarchical interactive attention block is established to ensure fine-grained mapping of all the vital information between two tasks, so that the modality/semantic features can be fully mutual-interactive. In addition, a dynamic weight factor is introduced to automatically adjust the corresponding weights of each task, which can balance the interactive feature correspondence and break through the limitation of laborious tuning. Furthermore, we construct a smart multi-wave binocular imaging system and collect a full-time multi-modality benchmark with 15 annotated pixel-level categories for image fusion and segmentation. Extensive experiments on several public datasets and our benchmark demonstrate that the proposed method outputs visually appealing fused images and perform averagely 7.66% higher segmentation mIoU in the real-world scene than the state-of-the-art approaches. The source code and benchmark are available at https://github.com/JinyuanLiu-CV/SegMiF. + + + + BaRe-ESA: A Riemannian Framework for Unregistered Human Body Shapes + http://openaccess.thecvf.com//content/ICCV2023/papers/Hartman_BaRe-ESA_A_Riemannian_Framework_for_Unregistered_Human_Body_Shapes_ICCV_2023_paper.pdf + We present Basis Restricted Elastic Shape Analysis (BaRe-ESA), a novel Riemannian framework for human body scan representation, interpolation and extrapolation. BaRe-ESA operates directly on unregistered meshes, i.e., without the need to establish prior point to point correspondences or to assume a consistent mesh structure. Our method relies on a latent space representation, which is equipped with a Riemannian (non-Euclidean) metric associated to an invariant higher-order metric on the space of surfaces. Experimental results on the FAUST and DFAUST datasets show that BaRe-ESA brings significant improvements with respect to previous solutions in terms of shape registration, interpolation and extrapolation. The efficiency and strength of our model is further demonstrated in applications such as motion transfer and random generation of body shape and pose. + + + + Skip-Plan: Procedure Planning in Instructional Videos via Condensed Action Space Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Skip-Plan_Procedure_Planning_in_Instructional_Videos_via_Condensed_Action_Space_ICCV_2023_paper.pdf + In this paper, we propose Skip-Plan, a condensed action space learning method for procedure planning in instructional videos. Current procedure planning methods all stick to the state-action pair prediction at every timestep and generate actions adjacently. Although it coincides with human intuition, such a methodology consistently struggles with high-dimensional state supervision and error accumulation on action sequences. In this work, we abstract the procedure planning problem as a mathematical chain model. By skipping uncertain nodes and edges in action chains, we transfer long and complex sequence functions into short but reliable ones in two ways. First, we skip all the intermediate state supervision and only focus on action predictions. Second, we decompose relatively long chains into multiple short sub-chains by skipping unreliable intermediate actions. By this means, our model explores all sorts of reliable sub-relations within an action sequence in the condensed action space. Extensive experiments show Skip-Plan achieves state-of-the-art performance on the CrossTask and COIN benchmarks for procedure planning. + + + + Sparse Instance Conditioned Multimodal Trajectory Prediction + http://openaccess.thecvf.com//content/ICCV2023/papers/Dong_Sparse_Instance_Conditioned_Multimodal_Trajectory_Prediction_ICCV_2023_paper.pdf + Pedestrian trajectory prediction is critical in many vision tasks but challenging due to the multimodality of the future trajectory. Most existing methods predict multimodal trajectories conditioned by goals (future endpoints) or instances (all future points). However, goal-conditioned methods ignore the intermediate process and instance-conditioned methods ignore the stochasticity of pedestrian motions. In this paper, we propose a simple yet effective Sparse Instance Conditioned Network (SICNet), which gives a balanced solution between goal-conditioned and instance-conditioned methods. Specifically, SICNet learns comprehensive sparse instances, i.e., representative points of the future trajectory, through a mask generated by a long short-term memory encoder and uses the memory mechanism to store and retrieve such sparse instances. Hence SICNet can decode the observed trajectory into the future prediction conditioned on the stored sparse instance. Moreover, we design a memory refinement module that refines the retrieved sparse instances from the memory to reduce memory recall errors. Extensive experiments on ETH-UCY and SDD datasets show that our method outperforms existing state-of-the-art methods. In addition, ablation studies demonstrate the superiority of our method compared with goal-conditioned and instance-conditioned approaches. + + + + NAPA-VQ: Neighborhood-Aware Prototype Augmentation with Vector Quantization for Continual Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Malepathirana_NAPA-VQ_Neighborhood-Aware_Prototype_Augmentation_with_Vector_Quantization_for_Continual_Learning_ICCV_2023_paper.pdf + Catastrophic forgetting; the loss of old knowledge upon acquiring new knowledge, is a pitfall faced by deep neural networks in real-world applications. Many prevailing solutions to this problem rely on storing exemplars (previously encountered data), which may not be feasible in applications with memory limitations or privacy constraints. Therefore, the recent focus has been on Non-Exemplar based Class Incremental Learning (NECIL) where a model incrementally learns about new classes without using any past exemplars. However, due to the lack of old data, NECIL methods struggle to discriminate between old and new classes causing their feature representations to overlap. We propose NAPA-VQ: Neighborhood Aware Prototype Augmentation with Vector Quantization, a framework that reduces this class overlap in NECIL. We draw inspiration from Neural Gas to learn the topological relationships in the feature space, identifying the neighboring classes that are most likely to get confused with each other. This neighborhood information is utilized to enforce strong separation between the neighboring classes as well as to generate old class representative prototypes that can better aid in obtaining a discriminative decision boundary between old and new classes. Our comprehensive experiments on CIFAR-100, TinyImageNet, and ImageNet-Subset demonstrate that NAPA-VQ outperforms the State-of-the-art NECIL methods by an average improvement of 5%, 2%, and 4% in accuracy and 10%, 3%, and 9% in forgetting respectively. Our code can be found in https://github.com/TamashaM/NAPA-VQ.git. + + + + Unsupervised Open-Vocabulary Object Localization in Videos + http://openaccess.thecvf.com//content/ICCV2023/papers/Fan_Unsupervised_Open-Vocabulary_Object_Localization_in_Videos_ICCV_2023_paper.pdf + In this paper, we show that recent advances in video representation learning and pre-trained vision-language models allow for substantial improvements in self-supervised video object localization. We propose a method that first localizes objects in videos via a slot attention approach and then assigns text to the obtained slots. The latter is achieved by an unsupervised way to read localized semantic information from the pre-trained CLIP model. The resulting video object localization is entirely unsupervised apart from the implicit annotation contained in CLIP, and it is effectively the first unsupervised approach that yields good results on regular video benchmarks. + + + + Unsupervised Video Deraining with An Event Camera + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Unsupervised_Video_Deraining_with_An_Event_Camera_ICCV_2023_paper.pdf + Current unsupervised video deraining methods are inefficient in modeling the intricate spatio-temporal properties of rain, which leads to unsatisfactory results. In this paper, we propose a novel approach by integrating a bio-inspired event camera into the unsupervised video deraining pipeline, which enables us to capture high temporal resolution information and model complex rain characteristics. Specifically, we first design an end-to-end learning-based network consisting of two modules, the asymmetric separation module and the cross-modal fusion module. The two modules are responsible for segregating the features of the rain-background layer, and for positive enhancement and negative suppression from a cross-modal perspective, respectively. Second, to regularize the network training, we elaborately design a cross-modal contrastive learning method that leverages the complementary information from event cameras, exploring the mutual exclusion and similarity of rain-background layers in different domains. This encourages the deraining network to focus on the distinctive characteristics of each layer and learn a more discriminative representation. Moreover, we construct the first real-world dataset comprising rainy videos and events using a hybrid imaging system. Extensive experiments demonstrate the superior performance of our method on both synthetic and real-world datasets. + + + + DIME-FM : DIstilling Multimodal and Efficient Foundation Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Sun_DIME-FM__DIstilling_Multimodal_and_Efficient_Foundation_Models_ICCV_2023_paper.pdf + Large Vision-Language Foundation Models (VLFM), such as CLIP, ALIGN and Florence, are trained on large private datasets of image-caption pairs and achieve superior transferability and robustness on downstream tasks, but they are difficult to use in many practical applications due to their large size, high latency and fixed architectures. Unfortunately, recent works show training a small custom VLFM for resource-limited applications is currently very difficult using public and smaller-scale data. In this paper, we introduce a new distillation mechanism (DIME-FM) that allows us to transfer the knowledge contained in large VLFMs to smaller, customized foundation models using a relatively small amount of inexpensive, unpaired images and sentences. We transfer the knowledge from the pre-trained CLIP-ViT-L/14 model to a ViT-B/32 model, with only 40M public images and 28.4M unpaired public sentences. The resulting model "Distill-ViT-B/32" rivals the CLIP-ViT-B/32 model pre-trained on its private WiT dataset (400M image-text pairs): Distill-ViT-B/32 achieves similar results in terms of zero-shot and linear-probing performance on both ImageNet and the ELEVATER (20 image classification tasks) benchmarks. It also displays comparable robustness when evaluated on five datasets with natural distribution shifts from ImageNet. + + + + Boosting Single Image Super-Resolution via Partial Channel Shifting + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Boosting_Single_Image_Super-Resolution_via_Partial_Channel_Shifting_ICCV_2023_paper.pdf + Although deep learning has significantly facilitated the progress of single image super-resolution (SISR) in recent years, it still hits bottlenecks to further improve SR performance with the continuous growth of model scale. Therefore, one of the hotspots in the field is to construct efficient SISR models by elevating the effectiveness of feature representation. In this work, we present a straightforward and generic approach for feature enhancement that can effectively promote the performance of SR models, dubbed partial channel shifting (PCS). Specifically, it is inspired by the temporal shifting in video understanding and displaces part of the channels along the spatial dimensions, thus allowing the effective receptive field to be amplified and the feature diversity to be augmented at almost zero cost. Also, it can be assembled into off-the-shelf models as a plug-and-play component for performance boosting without extra network parameters and computational overhead. However, regulating the features with PCS encounters some issues, like shifting directions and amplitudes, proportions, and patterns of shifted channels, etc. We impose some technical constraints on the issues to simplify the general channel shifting. Extensive and throughout experiments illustrate that the PCS indeed enlarges the effective receptive field, augments the feature diversity for efficiently enhancing SR recovery, and can endow obvious performance gains to existing models. + + + + Distracting Downpour: Adversarial Weather Attacks for Motion Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Schmalfuss_Distracting_Downpour_Adversarial_Weather_Attacks_for_Motion_Estimation_ICCV_2023_paper.pdf + Current adversarial attacks on motion estimation, or optical flow, optimize small per-pixel perturbations, which are unlikely to appear in the real world. In contrast, adverse weather conditions constitute a much more realistic threat scenario. Hence, in this work, we present a novel attack on motion estimation that exploits adversarially optimized particles to mimic weather effects like snowflakes, rain streaks or fog clouds. At the core of our attack framework is a differentiable particle rendering system that integrates particles (i) consistently over multiple time steps (ii) into the 3D space (iii) with a photo-realistic appearance. Through optimization, we obtain adversarial weather that significantly impacts the motion estimation. Surprisingly, methods that previously showed good robustness towards small per-pixel perturbations are particularly vulnerable to adversarial weather. At the same time, augmenting the training with non-optimized weather increases a method's robustness towards weather effects and improves generalizability at almost no additional cost. Our code is available at https://github.com/cv-stuttgart/DistractingDownpour. + + + + Unfolding Framework with Prior of Convolution-Transformer Mixture and Uncertainty Estimation for Video Snapshot Compressive Imaging + http://openaccess.thecvf.com//content/ICCV2023/papers/Zheng_Unfolding_Framework_with_Prior_of_Convolution-Transformer_Mixture_and_Uncertainty_Estimation_ICCV_2023_paper.pdf + We consider the problem of video snapshot compressive imaging (SCI), where sequential high-speed frames are modulated by different masks and captured by a single measurement. The underlying principle of reconstructing multi-frame images from only one single measurement is to solve an ill-posed problem. By combining optimization algorithms and neural networks, deep unfolding networks (DUNs) score tremendous achievements in solving inverse problems. In this paper, our proposed model is under the DUN framework and we propose a 3D Convolution-Transformer Mixture (CTM) module with a 3D efficient and scalable attention model plugged in, which helps fully learn the correlation between temporal and spatial dimensions by virtue of Transformer. To our best knowledge, this is the first time that Transformer is employed to video SCI reconstruction. Besides, to further investigate the high-frequency information during the reconstruction process which are neglected in previous studies, we introduce variance estimation characterizing the uncertainty on a pixel-by-pixel basis. Extensive experimental results demonstrate that our proposed method achieves state-of-the-art (SOTA) results. Code can be found on https://github.com/zsm1211/CTM-SCI. + + + + Non-Semantics Suppressed Mask Learning for Unsupervised Video Semantic Compression + http://openaccess.thecvf.com//content/ICCV2023/papers/Tian_Non-Semantics_Suppressed_Mask_Learning_for_Unsupervised_Video_Semantic_Compression_ICCV_2023_paper.pdf + Most video compression methods aim to improve the decoded video visual quality, instead of particularly guaranteeing the semantic-completeness, which deteriorates downstream video analysis tasks, e.g., action recognition. In this paper, we focus on a novel unsupervised video semantic compression problem, where video semantics is compressed in a downstream task-agnostic manner. To tackle this problem, we first propose a Semantic-Mining-then-Compensation (SMC) framework to enhance the plain video codec with powerful semantic coding capability. Then, we optimize the framework with only unlabeled video data, by masking out a proportion of the compressed video and reconstructing the masked regions of the original video, which is inspired by recent masked image modeling (MIM) methods. Although the MIM scheme learns generalizable semantic features, its inner generative learning paradigm may also facilitate the coding framework memorizing non-semantic information with extra bitcosts. To suppress this deficiency, we explicitly decrease the non-semantic information entropy of the decoded video features, by formulating it as a parametrized Gaussian Mixture Model conditioned on the mined video semantics. Comprehensive experimental results demonstrate the proposed approach shows remarkable superiority over previous traditional, learnable and perceptual-quality-oriented video codecs, on three video analysis tasks and seven datasets. + + + + Inverse Compositional Learning for Weakly-supervised Relation Grounding + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Inverse_Compositional_Learning_for_Weakly-supervised_Relation_Grounding_ICCV_2023_paper.pdf + Video relation grounding (VRG) is a significant and challenging problem in the domains of cross-modal learning and video understanding. In this study, we introduce a novel approach called inverse compositional learning (ICL) for weakly-supervised video relation grounding. Our approach represents relations at both the holistic and partial levels, formulating VRG as a joint optimization problem that encompasses reasoning at both levels. For holistic-level reasoning, we propose an inverse attention mechanism and a compositional encoder to generate compositional relevance features. Additionally, we introduce an inverse loss to evaluate and learn the relevance between visual features and relation features. At the partial-level reasoning, we introduce a grounding by classification scheme. By leveraging the learned holistic-level features and partial-level features, we train the entire model in an end-to-end manner. We conduct evaluations on two challenging datasets and demonstrate the substantial superiority of our proposed method over state-of-the-art methods. Extensive ablation studies confirm the effectiveness of our approach. + + + + Navigating to Objects Specified by Images + http://openaccess.thecvf.com//content/ICCV2023/papers/Krantz_Navigating_to_Objects_Specified_by_Images_ICCV_2023_paper.pdf + Images are a convenient way to specify which particular object instance an embodied agent should navigate to. Solving this task requires semantic visual reasoning and exploration of unknown environments. We present a system that can perform this task in both simulation and the real world. Our modular method solves sub-tasks of exploration, goal instance re-identification, goal localization, and local navigation. We re-identify the goal instance in egocentric vision using feature-matching and localize the goal instance by projecting matched features to a map. Each sub-task is solved using off-the-shelf components requiring zero fine-tuning. On the HM3D InstanceImageNav benchmark, this system outperforms a baseline end-to-end RL policy 7x and outperforms a state-of-the-art ImageNav model 2.3x (56% vs. 25% success). We deploy this system to a mobile robot platform and demonstrate effective performance in the real world, achieving an 88% success rate across a home and an office environment. + + + + LATR: 3D Lane Detection from Monocular Images with Transformer + http://openaccess.thecvf.com//content/ICCV2023/papers/Luo_LATR_3D_Lane_Detection_from_Monocular_Images_with_Transformer_ICCV_2023_paper.pdf + 3D lane detection from monocular images is a fundamental yet challenging task in autonomous driving. Recent advances primarily rely on structural 3D surrogates (e.g., bird's eye view) built from front-view image features and camera parameters. However, the depth ambiguity in monocular images inevitably causes misalignment between the constructed surrogate feature map and the original image, posing a great challenge for accurate lane detection. To address the above issue, we present a novel LATR model, an end-to-end 3D lane detector that uses 3D-aware front-view features without transformed view representation. Specifically, LATR detects 3D lanes via cross-attention based on query and key-value pairs, constructed using our lane-aware query generator and dynamic 3D ground positional embedding. On the one hand, each query is generated based on 2D lane-aware features and adopts a hybrid embedding to enhance the lane information. On the other hand, 3D space information is injected as positional embedding from an iteratively-updated 3D ground plane. LATR outperforms previous state-of-the-art methods on both synthetic Apollo and realistic OpenLane, ONCE-3DLanes datasets by large margins (e.g., 11.4 gain in terms of F1 score on OpenLane). Code will be released at https://github.com/JMoonr/LATR. + + + + Environment-Invariant Curriculum Relation Learning for Fine-Grained Scene Graph Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Min_Environment-Invariant_Curriculum_Relation_Learning_for_Fine-Grained_Scene_Graph_Generation_ICCV_2023_paper.pdf + The scene graph generation (SGG) task is designed to identify the predicates based on the subject-object pairs. However, existing datasets generally include two imbalance cases: one is the class imbalance from the predicted predicates and another is the context imbalance from the given subject-object pairs, which presents significant challenges for SGG. Most existing methods focus on the imbalance of the predicted predicate while ignoring the imbalance of the subject-object pairs, which could not achieve satisfactory results. To address the two imbalance cases, we propose a novel Environment Invariant Curriculum Relation learning (EICR) method, which can be applied in a plug-and-play fashion to existing SGG methods. Concretely, to remove the imbalance of the subject-object pairs, we first construct different distribution environments for the subject-object pairs and learn a model invariant to the environment changes. Then, we construct a class-balanced curriculum learning strategy to balance the different environments to remove the predicate imbalance. Comprehensive experiments conducted on VG and GQA datasets demonstrate that our EICR framework can be taken as a general strategy for various SGG models, and achieve significant improvements. + + + + Generalizable Decision Boundaries: Dualistic Meta-Learning for Open Set Domain Generalization + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Generalizable_Decision_Boundaries_Dualistic_Meta-Learning_for_Open_Set_Domain_Generalization_ICCV_2023_paper.pdf + Domain generalization (DG) is proposed to deal with the issue of domain shift, which occurs when statistical differences exist between source and target domains. However, most current methods do not account for a common realistic scenario where the source and target domains have different classes. To overcome this deficiency, open set domain generalization (OSDG) then emerges as a more practical setting to recognize unseen classes in unseen domains. An intuitive approach is to use multiple one-vs-all classifiers to define decision boundaries for each class and reject the outliers as unknown. However, the significant class imbalance between positive and negative samples often causes the boundaries biased towards positive ones, resulting in misclassification for known samples in the unseen target domain. In this paper, we propose a novel meta-learning-based framework called dualistic MEta-learning with joint DomaIn-Class matching (MEDIC), which considers gradient matching towards inter-domain and inter-class splits simultaneously to find a generalizable boundary balanced for all tasks. Experimental results demonstrate that MEDIC not only outperforms previous methods in open set scenarios, but also maintains competitive close set generalization ability at the same time. Our code is available at https://github.com/zzwdx/MEDIC. + + + + SimFIR: A Simple Framework for Fisheye Image Rectification with Self-supervised Representation Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Feng_SimFIR_A_Simple_Framework_for_Fisheye_Image_Rectification_with_Self-supervised_ICCV_2023_paper.pdf + In fisheye images, rich distinct distortion patterns are regularly distributed in the image plane. These distortion patterns are independent of the visual content and provide informative cues for rectification. To make the best of such rectification cues, we introduce SimFIR, a simple framework for fisheye image rectification based on self-supervised representation learning. Technically, we first split a fisheye image into multiple patches and extract their representations with a Vision Transformer (ViT). To learn fine-grained distortion representations, we then associate different image patches with their specific distortion patterns based on the fisheye model, and further subtly design an innovative unified distortion-aware pretext task for their learning. The transfer performance on the downstream rectification task is remarkably boosted, which verifies the effectiveness of the learned representations. Extensive experiments are conducted, and the quantitative and qualitative results demonstrate the superiority of our method over the state-of-the-art algorithms as well as its strong generalization ability on real-world fisheye images. + + + + Generalized Lightness Adaptation with Channel Selective Normalization + http://openaccess.thecvf.com//content/ICCV2023/papers/Yao_Generalized_Lightness_Adaptation_with_Channel_Selective_Normalization_ICCV_2023_paper.pdf + Lightness adaptation is vital to the success of image processing to avoid unexpected visual deterioration, which covers multiple aspects, e.g., low-light image enhancement, image retouching, and inverse tone mapping. Existing methods typically work well on their trained lightness conditions but perform poorly in unknown ones due to their limited generalization ability. To address this limitation, we propose a novel generalized lightness adaptation algorithm that extends conventional normalization techniques through a channel filtering design, dubbed Channel Selective Normalization (CSNorm). The proposed CSNorm purposely normalizes the statistics of lightness-relevant channels and keeps other channels unchanged, so as to improve feature generalization and discrimination. To optimize CSNorm, we propose an alternating training strategy that effectively identifies lightness-relevant channels. The model equipped with our CSNorm only needs to be trained on one lightness condition and can be well generalized to unknown lightness conditions. Experimental results on multiple benchmark datasets demonstrate the effectiveness of CSNorm in enhancing the generalization ability for the existing lightness adaptation methods. Code is available at https://github.com/mdyao/CSNorm. + + + + Omnidirectional Information Gathering for Knowledge Transfer-Based Audio-Visual Navigation + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Omnidirectional_Information_Gathering_for_Knowledge_Transfer-Based_Audio-Visual_Navigation_ICCV_2023_paper.pdf + Audio-visual navigation is an audio-targeted wayfinding task where a robot agent is entailed to travel a never-before-seen 3D environment towards the sounding source. In this article, we present ORAN, an omnidirectional audio-visual navigator based on cross-task navigation skill transfer. In particular, ORAN sharpens its two basic abilities for such challenging tasks, namely wayfinding and audio-visual information gathering. First, ORAN is trained with a confidence-aware cross-task policy distillation (CCPD) strategy. CCPD transfers the fundamental, point-to-point wayfinding skill that is well-trained on the large-scale PointGoal task to ORAN, to help ORAN better master audio-visual navigation with far fewer training samples. To improve the efficiency of knowledge transfer and address the domain gap, CCPD is made to be adaptive to the decision confidence of the teacher policy. Second, ORAN is equipped with an omnidirectional information gathering (OIG) mechanism, i.e., gleaning visual-acoustic observations from different directions before decision-making. As a result, ORAN yields more robust navigation behaviour. Taking CCPD and OIG together, ORAN significantly outperforms previous competitors. After the model ensemble, we got 1st in Soundspaces Challenge 2022, improving SPL and SR by 53% and 35% relatively. Our code will be released. + + + + Multi-Scale Bidirectional Recurrent Network with Hybrid Correlation for Point Cloud Based Scene Flow Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Cheng_Multi-Scale_Bidirectional_Recurrent_Network_with_Hybrid_Correlation_for_Point_Cloud_ICCV_2023_paper.pdf + Scene flow estimation provides the fundamental motion perception of a dynamic scene, which is of practical importance in many computer vision applications. In this paper, we propose a novel multi-scale bidirectional recurrent architecture that iteratively optimizes the coarse-to-fine scene flow estimation. In each resolution scale of estimation, a novel bidirectional gated recurrent unit is proposed to bidirectionally and iteratively augment point features and produce progressively optimized scene flow. The optimization of each iteration is integrated with the hybrid correlation that captures not only local correlation but also semantic correlation for more accurate estimation. Experimental results indicate that our proposed architecture significantly outperforms the existing state-of-the-art approaches on both FlyingThings3D and KITTI benchmarks while maintaining superior time efficiency. Codes and pre-trained models are publicly available at https://github.com/cwc1260/MSBRN. + + + + VLN-PETL: Parameter-Efficient Transfer Learning for Vision-and-Language Navigation + http://openaccess.thecvf.com//content/ICCV2023/papers/Qiao_VLN-PETL_Parameter-Efficient_Transfer_Learning_for_Vision-and-Language_Navigation_ICCV_2023_paper.pdf + The performance of the Vision-and-Language Navigation (VLN) tasks has witnessed rapid progress recently thanks to the use of large pre-trained vision-and-language models. However, full fine-tuning the pre-trained model for every downstream VLN task is becoming costly due to the considerable model size. Recent research hotspot of Parameter-Efficient Transfer Learning (PETL) shows great potential in efficiently tuning large pre-trained models for the common CV and NLP tasks, which exploits the most of the representation knowledge implied in the pre-trained model while only tunes a minimal set of parameters. However, simply utilizing existing PETL methods for the more challenging VLN tasks may bring non-trivial degeneration to the performance. Therefore, we present the first study to explore PETL methods for VLN tasks and propose a VLN-specific PETL method named VLN-PETL. Specifically, we design two PETL modules: Historical Interaction Booster (HIB) and Cross-modal Interaction Booster (CIB). Then we combine these two modules with several existing PETL methods as the integrated VLN-PETL. Extensive experimental results on four mainstream VLN tasks (R2R, REVERIE, NDH, RxR) demonstrate the effectiveness of our proposed VLN-PETL, where VLN-PETL achieves comparable or even better performance to full fine-tuning and outperforms other PETL methods with promising margins. The source code is available at https://github.com/YanyuanQiao/VLN-PETL + + + + Learning Continuous Exposure Value Representations for Single-Image HDR Reconstruction + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Learning_Continuous_Exposure_Value_Representations_for_Single-Image_HDR_Reconstruction_ICCV_2023_paper.pdf + Deep learning is commonly used to produce impressive results in reconstructing HDR images from LDR images. LDR stack-based methods are used for single-image HDR reconstruction, generating an HDR image from a deep learning generated LDR stack. However, current methods generate the LDR stack with predetermined exposure values (EVs), which may limit the quality of HDR reconstruction. To address this, we propose the continuous exposure value representation (CEVR) model, which uses an implicit function to generate LDR images with arbitrary EVs, including those unseen during training. Our flexible approach generates a continuous stack with more images containing diverse EVs, significantly improving HDR reconstruction. We use a cycle training strategy to supervise the model in generating continuous EV LDR images without corresponding ground truths. Our CEVR model outperforms existing methods, as demonstrated by experimental results. + + + + MixSynthFormer: A Transformer Encoder-like Structure with Mixed Synthetic Self-attention for Efficient Human Pose Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Sun_MixSynthFormer_A_Transformer_Encoder-like_Structure_with_Mixed_Synthetic_Self-attention_for_ICCV_2023_paper.pdf + Human pose estimation in videos has wide-ranging practical applications across various fields, many of which require fast inference on resource-scarce devices, necessitating the development of efficient and accurate algorithms. Previous works have demonstrated the feasibility of exploiting motion continuity to conduct pose estimation using sparsely sampled frames with transformer-based models. However, these methods only consider the temporal relation while neglecting spatial attention, and the complexity of dot product self-attention calculations in transformers are quadratically proportional to the embedding size. To address these limitations, we propose MixSynthFormer, a transformer encoder-like model with MLP-based mixed synthetic attention. By mixing synthesized spatial and temporal attentions, our model incorporates inter-joint and inter-frame importance and can accurately estimate human poses in an entire video sequence from sparsely sampled frames. Additionally, the flexible design of our model makes it versatile for other motion synthesis tasks. Our extensive experiments on 2D/3D pose estimation, body mesh recovery, and motion prediction validate the effectiveness and efficiency of MixSynthFormer. + + + + HumanMAC: Masked Motion Completion for Human Motion Prediction + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_HumanMAC_Masked_Motion_Completion_for_Human_Motion_Prediction_ICCV_2023_paper.pdf + Human motion prediction is a classical problem in computer vision and computer graphics, which has a wide range of practical applications. Previous effects achieve great empirical performance based on an encoding-decoding style. The methods of this style work by first encoding previous motions to latent representations and then decoding the latent representations into predicted motions. However, in practice, they are still unsatisfactory due to several issues, including complicated loss constraints, cumbersome training processes, and scarce switch of different categories of motions in prediction. In this paper, to address the above issues, we jump out of the foregoing style and propose a novel framework from a new perspective. Specifically, our framework works in a masked completion fashion. In the training stage, we learn a motion diffusion model that generates motions from random noise. In the inference stage, with a denoising procedure, we make motion prediction conditioning on observed motions to output more continuous and controllable predictions. The proposed framework enjoys promising algorithmic properties, which only needs one loss in optimization and is trained in an end-to-end manner. Additionally, it accomplishes the switch of different categories of motions effectively, which is significant in realistic tasks, e.g., the animation task. Comprehensive experiments on benchmarks confirm the superiority of the proposed framework. The project page is available at https://lhchen.top/Human-MAC. + + + + Prompt Switch: Efficient CLIP Adaptation for Text-Video Retrieval + http://openaccess.thecvf.com//content/ICCV2023/papers/Deng_Prompt_Switch_Efficient_CLIP_Adaptation_for_Text-Video_Retrieval_ICCV_2023_paper.pdf + In text-video retrieval, recent works have benefited from the powerful learning capabilities of pre-trained text-image foundation models (e.g., CLIP) by adapting them to the video domain. A critical problem for them is how to effectively capture the rich semantics inside the video using the image encoder of CLIP. To tackle this, state-of-the-art methods adopt complex cross-modal modeling techniques to fuse the text information into video frame representations, which, however, incurs severe efficiency issues in large-scale retrieval systems as the video representations must be recomputed online for every text query. In this paper, we discard this problematic cross-modal fusion process and aim to learn semantically-enhanced representations purely from the video, so that the video representations can be computed offline and reused for different texts. Concretely, we first introduce a spatial-temporal "Prompt Cube" into the CLIP image encoder and iteratively switch it within the encoder layers to efficiently incorporate the global video semantics into frame representations. We then propose to apply an auxiliary video captioning objective to train the frame representations, which facilitates the learning of detailed video semantics by providing fine-grained guidance in the semantic space. With a naive temporal fusion strategy (i.e., mean-pooling) on the enhanced frame representations, we obtain state-of-the-art performances on three benchmark datasets, i.e., MSR-VTT, MSVD, and LSMDC. + + + + Video Action Recognition with Attentive Semantic Units + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Video_Action_Recognition_with_Attentive_Semantic_Units_ICCV_2023_paper.pdf + Visual-Language Models (VLMs) have significantly advanced video action recognition. Supervised by the semantics of action labels, recent works adapt the visual branch of VLMs to learn video representations. Despite the effectiveness proved by these works, we believe that the potential of VLMs has yet to be fully harnessed. In light of this, we exploit the semantic units (SU) hiding behind the action labels and leverage their correlations with fine-grained items in frames for more accurate action recognition. SUs are entities extracted from the language descriptions of the entire action set, including body parts, objects, scenes, and motions. To further enhance the alignments between visual contents and the SUs, we introduce a multi-region module (MRA) to the visual branch of the VLM. The MRA allows the perception of region-aware visual features beyond the original global feature. Our method adaptively attends to and selects relevant SUs with visual features of frames. With a cross-modal decoder, the selected SUs serve to decode spatiotemporal video representations. In summary, the SUs as the medium can boost discriminative ability and transferability. Specifically, in fully-supervised learning, our method achieved 87.8% top-1 accuracy on Kinetics-400. In K=2 few-shot experiments, our method surpassed the previous state-of-the-art by +7.1% and +15.0% on HMDB-51 and UCF-101, respectively. + + + + Scanning Only Once: An End-to-end Framework for Fast Temporal Grounding in Long Videos + http://openaccess.thecvf.com//content/ICCV2023/papers/Pan_Scanning_Only_Once_An_End-to-end_Framework_for_Fast_Temporal_Grounding_ICCV_2023_paper.pdf + Video temporal grounding aims to pinpoint a video segment that matches the query description. Despite the recent advance in short-form videos (e.g., in minutes), temporal grounding in long videos (e.g., in hours) is still at its early stage. To address this challenge, a common practice is to employ a sliding window, yet can be inefficient and inflexible due to the limited number of frames within the window. In this work, we propose an end-to-end framework for fast temporal grounding, which is able to model an hours-long video with one-time network execution. Our pipeline is formulated in a coarse-to-fine manner, where we first extract context knowledge from non-overlapped video clips (i.e., anchors), and then supplement the anchors that highly response to the query with detailed content knowledge. Besides the remarkably high pipeline efficiency, another advantage of our approach is the capability of capturing long-range temporal correlation, thanks to modeling the entire video as a whole, and hence facilitates more accurate grounding. Experimental results suggest that, on the long-form video datasets MAD and Ego4d, our method significantly outperforms state-of-the-arts, and achieves 14.6x / 102.8x higher efficiency respectively. + + + + VoroMesh: Learning Watertight Surface Meshes with Voronoi Diagrams + http://openaccess.thecvf.com//content/ICCV2023/papers/Maruani_VoroMesh_Learning_Watertight_Surface_Meshes_with_Voronoi_Diagrams_ICCV_2023_paper.pdf + In stark contrast to the case of images, finding a concise, learnable discrete representation of 3D surfaces remains a challenge. In particular, while polygon meshes are arguably the most common surface representation used in geometry processing, their irregular and combinatorial structure often make them unsuitable for learning-based applications. In this work, we present VoroMesh, a novel and differentiable of watertight 3D shape surfaces. From a set of 3D points (called generators) and their associated occupancy, we define our boundary representation through the Voronoi diagram of the generators as the subset of Voronoi faces whose two associated (equidistant) generators are of opposite occupancy: the resulting polygon mesh forms a watertight approximation of the target shape's boundary. To learn the position of the generators, we propose a novel loss function, dubbed VoroLoss, that minimizes the distance from ground truth surface samples to the closest faces of the Voronoi diagram which does not require an explicit construction of the entire Voronoi diagram. A direct optimization of the Voroloss to obtain generators on the Thingi32 dataset demonstrates the geometric efficiency of our representation compared to axiomatic meshing algorithms and recent learning-based mesh representations. We further use VoroMesh in a learning-based mesh prediction task from input SDF grids on the ABC dataset, and show comparable performance to state-of-the-art methods while guaranteeing closed output surfaces free of self-intersections. + + + + What does CLIP know about a red circle? Visual prompt engineering for VLMs + http://openaccess.thecvf.com//content/ICCV2023/papers/Shtedritski_What_does_CLIP_know_about_a_red_circle_Visual_prompt_ICCV_2023_paper.pdf + Large-scale Vision-Language Models, such as CLIP, learn powerful image-text representations that have found numerous applications, from zero-shot classification to text-to-image generation. Despite that, their capabilities for solving novel discriminative tasks via prompting fall behind those of large language models, such as GPT-3. Here we explore the idea of visual prompt engineering for solving computer vision tasks beyond classification by editing in image space instead of text. In particular, we discover an emergent ability of CLIP, where, by simply drawing a red circle around an object, we can direct the model's attention to that region, while also maintaining global information. We show the power of this simple approach by achieving state-of-the-art in zero-shot referring expressions comprehension and strong performance in keypoint localization tasks. Finally, we draw attention to some potential ethical concerns of large language-vision models. + + + + LoLep: Single-View View Synthesis with Locally-Learned Planes and Self-Attention Occlusion Inference + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_LoLep_Single-View_View_Synthesis_with_Locally-Learned_Planes_and_Self-Attention_Occlusion_ICCV_2023_paper.pdf + We propose a novel method, LoLep, which regresses Locally-Learned planes from a single RGB image to represent scenes accurately, thus generating better novel views. Without the depth information, regressing appropriate plane locations is a challenging problem. To solve this issue, we pre-partition the disparity space into bins and design a disparity sampler to regress local offsets for multiple planes in each bin. However, only using such a sampler makes the network not convergent; we further propose two optimizing strategies that combine with different disparity distributions of datasets and propose an occlusion-aware reprojection loss as a simple yet effective geometric supervision technique. We also introduce a self-attention mechanism to improve occlusion inference and present a Block-Sampling Self-Attention (BS-SA) module to address the problem of applying self-attention to large feature maps. We demonstrate the effectiveness of our approach and generate state-of-the-art results on different datasets. Compared to MINE, our approach has an LPIPS reduction of 4.8% 9.0% and an RV reduction of 74.9% 83.5%. We also evaluate the performance on real-world images and demonstrate the benefits. We will release the source code at the time of publication. + + + + Exploring Positional Characteristics of Dual-Pixel Data for Camera Autofocus + http://openaccess.thecvf.com//content/ICCV2023/papers/Choi_Exploring_Positional_Characteristics_of_Dual-Pixel_Data_for_Camera_Autofocus_ICCV_2023_paper.pdf + In digital photography, autofocus is a key feature that aids high-quality image capture, and modern approaches use the phase patterns arising from dual-pixel sensors as important focus cues. However, dual-pixel data is prone to multiple error sources in its image capturing process, including lens shading or distortions due to the inherent optical characteristics of the lens. We observe that, while these degradations are hard to model using prior knowledge, they are correlated with the spatial position of the pixels within the image sensor area, and we propose a learning-based autofocus model with positional encodings (PE) to capture these patterns. Specifically, we introduce RoI-PE, which encodes the spatial position of our focusing region-of-interest (RoI) on the imaging plane. Learning with RoI-PE allows the model to be more robust to spatially-correlated degradations. In addition, we also propose to encode the current focal position of lens as lens-PE, which allows us to significantly reduce the computational complexity of the autofocus model. Experimental results clearly demonstrate the effectiveness of using the proposed position encodings for automatic focusing based on dual-pixel data. + + + + Heterogeneous Forgetting Compensation for Class-Incremental Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Dong_Heterogeneous_Forgetting_Compensation_for_Class-Incremental_Learning_ICCV_2023_paper.pdf + Class-incremental learning (CIL) has achieved remarkable successes in learning new classes consecutively while overcoming catastrophic forgetting on old categories. However, most existing CIL methods unreasonably assume that all old categories have the same forgetting pace, and neglect negative influence of forgetting heterogeneity among different old classes on forgetting compensation. To surmount the above challenges, we develop a novel Heterogeneous Forgetting Compensation (HFC) model, which can resolve heterogeneous forgetting of easy-to-forget and hard-to-forget old categories from both representation and gradient aspects. Specifically, we design a task-semantic aggregation block to alleviate heterogeneous forgetting from representation aspect. It aggregates local category information within each task to learn task-shared global representations. Moreover, we develop two novel plug-and-play losses: a gradient-balanced forgetting compensation loss and a gradient-balanced relation distillation loss to alleviate forgetting from gradient aspect. They consider gradient-balanced compensation to rectify forgetting heterogeneity of old categories and heterogeneous relation consistency. Experiments on several representative datasets illustrate effectiveness of our HFC model. The code is available at https://github.com/JiahuaDong/HFC. + + + + FemtoDet: An Object Detection Baseline for Energy Versus Performance Tradeoffs + http://openaccess.thecvf.com//content/ICCV2023/papers/Tu_FemtoDet_An_Object_Detection_Baseline_for_Energy_Versus_Performance_Tradeoffs_ICCV_2023_paper.pdf + Efficient detectors for edge devices are often optimized for parameters or speed count metrics, which remain in weak correlation with the energy of detectors. However, some vision applications of convolutional neural networks, such as always-on surveillance cameras, are critical for energy constraints. This paper aims to serve as a baseline by designing detectors to reach tradeoffs between energy and performance from two perspectives: 1) We extensively analyze various CNNs to identify low-energy architectures, including selecting activation functions, convolutions operators, and feature fusion structures on necks. These underappreciated details in past work seriously affect the energy consumption of detectors; 2) To break through the dilemmatic energy-performance problem, we propose a balanced detector driven by energy using discovered low-energy components named FemtoDet. In addition to the novel construction, we improve FemtoDet by considering convolutions and training strategy optimizations. Specifically, we develop a new instance boundary enhancement (IBE) module for convolution optimization to overcome the contradiction between the limited capacity of CNNs and detection tasks in diverse spatial representations, and propose a recursive warm-restart (RecWR) for optimizing training strategy to escape the sub-optimization of light-weight detectors by considering the data shift produced in popular augmentations. As a result, FemtoDet with only 68.77k parameters achieves a competitive score of 46.3 AP50 on PASCAL VOC and 1.11 W & 64.47 FPS on Qualcomm Snapdragon 865 CPU platforms. Extensive experiments on COCO and TJU-DHD datasets indicate that the proposed method achieves competitive results in diverse scenes. + + + + Iterative Prompt Learning for Unsupervised Backlit Image Enhancement + http://openaccess.thecvf.com//content/ICCV2023/papers/Liang_Iterative_Prompt_Learning_for_Unsupervised_Backlit_Image_Enhancement_ICCV_2023_paper.pdf + We propose a novel unsupervised backlit image enhancement method, abbreviated as CLIP-LIT, by exploring the potential of Contrastive Language-Image Pre-Training (CLIP) for pixel-level image enhancement. We show that the open-world CLIP prior not only aids in distinguishing between backlit and well-lit images, but also in perceiving heterogeneous regions with different luminance, facilitating the optimization of the enhancement network. Unlike high-level and image manipulation tasks, directly applying CLIP to enhancement tasks is non-trivial, owing to the difficulty in finding accurate prompts. To solve this issue, we devise a prompt learning framework that first learns an initial prompt pair by constraining the text-image similarity between the prompt (negative/positive sample) and the corresponding image (backlit image/well-lit image) in the CLIP latent space. Then, we train the enhancement network based on the text-image similarity between the enhanced result and the initial prompt pair. To further improve the accuracy of the initial prompt pair, we iteratively fine-tune the prompt learning framework to reduce the distribution gaps between the backlit images, enhanced results, and well-lit images via rank learning, boosting the enhancement performance. Our method alternates between updating the prompt learning framework and enhancement network until visually pleasing results are achieved. Extensive experiments demonstrate that our method outperforms state-of-the-art methods in terms of visual quality and generalization ability, without requiring any paired data. + + + + UATVR: Uncertainty-Adaptive Text-Video Retrieval + http://openaccess.thecvf.com//content/ICCV2023/papers/Fang_UATVR_Uncertainty-Adaptive_Text-Video_Retrieval_ICCV_2023_paper.pdf + With the explosive growth of web videos and emerging large-scale vision-language pre-training models, e.g., CLIP, retrieving videos of interest with text instructions has attracted increasing attention. A common practice is to transfer text-video pairs to the same embedding space and craft cross-modal interactions with certain entities in specific granularities for semantic correspondence. Unfortunately, the intrinsic uncertainties of optimal entity combinations in appropriate granularities for cross-modal queries are understudied, which is especially critical for modalities with hierarchical semantics, e.g., video, text, etc. In this paper, we propose an Uncertainty-Adaptive Text-Video Retrieval approach, termed UATVR, which models each look-up as a distribution matching procedure. Concretely, we add additional learnable tokens in the encoders to adaptively aggregate multi-grained semantics for flexible high-level reasoning. In the refined embedding space, we represent text-video pairs as probabilistic distributions where prototypes are sampled for matching evaluation. Comprehensive experiments on four benchmarks justify the superiority of our UATVR, which achieves new state-of-the-art results on MSR-VTT (50.8%), VATEX (64.5%), MSVD (49.7%), and DiDeMo (45.8%). The code is available at https://github.com/bofang98/UATVR. + + + + SALAD: Part-Level Latent Diffusion for 3D Shape Generation and Manipulation + http://openaccess.thecvf.com//content/ICCV2023/papers/Koo_SALAD_Part-Level_Latent_Diffusion_for_3D_Shape_Generation_and_Manipulation_ICCV_2023_paper.pdf + We present a cascaded diffusion model based on a part-level implicit 3D representation. Our model achieves state-of-the-art generation quality and also enables part-level shape editing and manipulation without any additional training in conditional setup. Diffusion models have demonstrated impressive capabilities in data generation as well as zero-shot completion and editing via a guided reverse process. Recent research on 3D diffusion models has focused on improving their generation capabilities with various data representations, while the absence of structural information has limited their capability in completion and editing tasks. We thus propose our novel diffusion model using a part-level implicit representation. To effectively learn diffusion with high-dimensional embedding vectors of parts, we propose a cascaded framework, learning diffusion first on a low-dimensional subspace encoding extrinsic parameters of parts and then on the other high-dimensional subspace encoding intrinsic attributes. In the experiments, we demonstrate the outperformance of our method compared with the previous ones both in generation and part-level completion and manipulation tasks. + + + + COMPASS: High-Efficiency Deep Image Compression with Arbitrary-scale Spatial Scalability + http://openaccess.thecvf.com//content/ICCV2023/papers/Park_COMPASS_High-Efficiency_Deep_Image_Compression_with_Arbitrary-scale_Spatial_Scalability_ICCV_2023_paper.pdf + Recently, neural network (NN)-based image compression studies have actively been made and has shown impressive performance in comparison to traditional methods. However, most of the works have focused on non-scalable image compression (single-layer coding) while spatially scalable image compression has drawn less attention although it has many applications. In this paper, we propose a novel NN-based spatially scalable image compression method, called COMPASS, which supports arbitrary-scale spatial scalability. Our proposed COMPASS has a very flexible structure where the number of layers and their respective scale factors can be arbitrarily determined during inference. To reduce the spatial redundancy between adjacent layers for arbitrary scale factors, our COMPASS adopts an inter-layer arbitrary scale prediction method, called LIFF, based on implicit neural representation. We propose a combined RD loss function to effectively train multiple layers. Experimental results show that our COMPASS achieves BD-rate gain of -58.33% and -47.17% at maximum compared to SHVC and the state-of-the-art NN-based spatially scalable image compression method, respectively, for various combinations of scale factors. Our COMPASS also shows comparable or even better coding efficiency than the single-layer coding for various scale factors. + + + + Score-Based Diffusion Models as Principled Priors for Inverse Imaging + http://openaccess.thecvf.com//content/ICCV2023/papers/Feng_Score-Based_Diffusion_Models_as_Principled_Priors_for_Inverse_Imaging_ICCV_2023_paper.pdf + Priors are essential for reconstructing images from noisy and/or incomplete measurements. The choice of the prior determines both the quality and uncertainty of recovered images. We propose turning score-based diffusion models into principled image priors ("score-based priors") for analyzing a posterior of images given measurements. Previously, probabilistic priors were limited to handcrafted regularizers and simple distributions. In this work, we empirically validate the theoretically-proven probability function of a score-based diffusion model. We show how to sample from resulting posteriors by using this probability function for variational inference. Our results, including experiments on denoising, deblurring, and interferometric imaging, suggest that score-based priors enable principled inference with a sophisticated, data-driven image prior. + + + + Multiscale Structure Guided Diffusion for Image Deblurring + http://openaccess.thecvf.com//content/ICCV2023/papers/Ren_Multiscale_Structure_Guided_Diffusion_for_Image_Deblurring_ICCV_2023_paper.pdf + Diffusion Probabilistic Models (DPMs) have recently been employed for image deblurring, formulated as an image-conditioned generation process that maps Gaussian noise to the high-quality image, conditioned on the blurry input. Image-conditioned DPMs (icDPMs) have shown more realistic results than regression-based methods when trained on pairwise in-domain data. However, their robustness in restoring images is unclear when presented with out-of-domain images as they do not impose specific degradation models or intermediate constraints. To this end, we introduce a simple yet effective multiscale structure guidance as an implicit bias that informs the icDPM about the coarse structure of the sharp image at the intermediate layers. This guided formulation leads to a significant improvement of the deblurring results, particularly on unseen domain. The guidance is extracted from the latent space of a regression network trained to predict the clean-sharp target at multiple lower resolutions, thus maintaining the most salient sharp structures. With both the blurry input and multiscale guidance, the icDPM model can better understand the blur and recover the clean image. We evaluate a single-dataset trained model on diverse datasets and demonstrate more robust deblurring results with fewer artifacts on unseen data. Our method outperforms existing baselines, achieving state-of-the-art perceptual quality while keeping competitive distortion metrics. + + + + CheckerPose: Progressive Dense Keypoint Localization for Object Pose Estimation with Graph Neural Network + http://openaccess.thecvf.com//content/ICCV2023/papers/Lian_CheckerPose_Progressive_Dense_Keypoint_Localization_for_Object_Pose_Estimation_with_ICCV_2023_paper.pdf + Estimating the 6-DoF pose of a rigid object from a single RGB image is a crucial yet challenging task. Recent studies have shown the great potential of dense correspondence-based solutions, yet improvements are still needed to reach practical deployment. In this paper, we propose a novel pose estimation algorithm named CheckerPose, which improves on three main aspects. Firstly, CheckerPose densely samples 3D keypoints from the surface of the 3D object and finds their 2D correspondences progressively in the 2D image. Compared to previous solutions that conduct dense sampling in the image space, our strategy enables the correspondence searching in a 2D grid (i.e., pixel coordinate). Secondly, for our 3D-to-2D correspondence, we design a compact binary code representation for 2D image locations. This representation not only allows for progressive correspondence refinement but also converts the correspondence regression to a more efficient classification problem. Thirdly, we adopt a graph neural network to explicitly model the interactions among the sampled 3D keypoints, further boosting the reliability and accuracy of the correspondences. Together, these novel components make CheckerPose a strong pose estimation algorithm. When evaluated on the popular Linemod, Linemod-O, and YCB-V object pose estimation benchmarks, CheckerPose clearly boosts the accuracy of correspondence-based methods and achieves state-of-the-art performances. Code is available at https://github.com/RuyiLian/CheckerPose. + + + + Event Camera Data Pre-training + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Event_Camera_Data_Pre-training_ICCV_2023_paper.pdf + This paper proposes a pre-trained neural network for handling event camera data. Our model is a self-supervised learning framework, and uses paired event camera data and natural RGB images for training. Our method contains three modules connected in a sequence: i) a family of event data augmentations, generating meaningful event images for self-supervised training; ii) a conditional masking strategy to sample informative event patches from event images, encouraging our model to capture the spatial layout of a scene and accelerating training; iii) a contrastive learning approach, enforcing the similarity of embeddings between matching event images, and between paired event and RGB images. An embedding projection loss is proposed to avoid the model collapse when enforcing the event image embedding similarities. A probability distribution alignment loss is proposed to encourage the event image to be consistent with its paired RGB image in the feature space. Transfer learning performance on downstream tasks shows the superiority of our method over state-of-the-art methods. For example, we achieve top-1 accuracy at 64.83% on the N-ImageNet dataset. Our code is available at https://github.com/Yan98/Event-Camera-Data-Pre-training. + + + + One-shot Implicit Animatable Avatars with Model-based Priors + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_One-shot_Implicit_Animatable_Avatars_with_Model-based_Priors_ICCV_2023_paper.pdf + Existing neural rendering methods for creating human avatars typically either require dense input signals such as video or multi-view images, or leverage a learned prior from large-scale specific 3D human datasets such that reconstruction can be performed with sparse-view inputs. Most of these methods fail to achieve realistic reconstruction when only a single image is available. To enable the data-efficient creation of realistic animatable 3D humans, we propose ELICIT, a novel method for learning human-specific neural radiance fields from a single image. Inspired by the fact that humans can effortlessly estimate the body geometry and imagine full-body clothing from a single image, we leverage two priors in ELICIT: 3D geometry prior and visual semantic prior. Specifically, ELICIT utilizes the 3D body shape geometry prior from a skinned vertex-based template model (i.e., SMPL) and implements the visual clothing semantic prior with the CLIP-based pretrained models. Both priors are used to jointly guide the optimization for creating plausible content in the invisible areas. Taking advantage of the CLIP models, ELICIT can use text descriptions to generate text-conditioned unseen regions. In order to further improve visual details, we propose a segmentation-based sampling strategy that locally refines different parts of the avatar. Comprehensive evaluations on multiple popular benchmarks, including ZJU-MoCAP, Human3.6M, and DeepFashion, show that ELICIT has outperformed strong baseline methods of avatar creation when only a single image is available. The code is public for research purposes at https://huangyangyi.github.io/ELICIT/. + + + + Unsupervised Feature Representation Learning for Domain-generalized Cross-domain Image Retrieval + http://openaccess.thecvf.com//content/ICCV2023/papers/Hu_Unsupervised_Feature_Representation_Learning_for_Domain-generalized_Cross-domain_Image_Retrieval_ICCV_2023_paper.pdf + Cross-domain image retrieval has been extensively studied due to its high practical value. In recently proposed unsupervised cross-domain image retrieval methods, efforts are taken to break the data annotation barrier. However, applicability of the model is still confined to domains seen during training. This limitation motivates us to present the first attempt at domain-generalized unsupervised cross-domain image retrieval (DG-UCDIR) aiming at facilitating image retrieval between any two unseen domains in an unsupervised way. To improve domain generalizability of the model, we thus propose a new two-stage domain augmentation technique for diversified training data generation. DG-UCDIR also shares all the challenges present in the unsupervised cross-domain image retrieval, where domain-agnostic and semantic-aware feature representations are supposed to be learned without external supervision. To accomplish this, we introduce a novel cross-domain contrastive learning strategy by utilizing phase image as a proxy to mitigate the domain gap. Extensive experiments are carried out using PACS and DomainNet dataset, and consistently illustrate the superior performance of our framework compared to existing state-of-the-art methods. Our source code is available at https: //github.com/conghui1002/DG-UCDIR. + + + + Dec-Adapter: Exploring Efficient Decoder-Side Adapter for Bridging Screen Content and Natural Image Compression + http://openaccess.thecvf.com//content/ICCV2023/papers/Shen_Dec-Adapter_Exploring_Efficient_Decoder-Side_Adapter_for_Bridging_Screen_Content_and_ICCV_2023_paper.pdf + Natural image compression has been greatly improved in the deep learning era. However, the compression performance will be heavily degraded if the pretrained encoder is directly applied on screen content image compression. Meanwhile, we observe that parameter-efficient trans-fer learning (PETL) methods have shown great adaptation ability in high-level vision tasks. Therefore, we propose a Dec-Adapter, a pioneering entropy-efficient transfer learning module for the decoder to bridge natural image and screen content compression. The adapter's parameters are learned during encoding and transmitted to the decoder for image-adaptive decoding. Our Dec-Adapter is lightweight, domain-transferable, and architecture-agnostic with generalized performance in bridging the two domains. Experiments demonstrate that our method outperforms all existing methods by a large margin in terms of BD-rate performance on screen content image compression. Specifically, our method achieves over 2 dB gain compared with the baseline when transferred to screen content image com-pression. + + + + Under-Display Camera Image Restoration with Scattering Effect + http://openaccess.thecvf.com//content/ICCV2023/papers/Song_Under-Display_Camera_Image_Restoration_with_Scattering_Effect_ICCV_2023_paper.pdf + The under-display camera (UDC) provides consumers with a full-screen visual experience without any obstruction due to notches or punched holes. However, the semi-transparent nature of the display inevitably introduces the severe degradation into UDC images. In this work, we address the UDC image restoration problem with the specific consideration of the scattering effect caused by the display. We explicitly model the scattering effect by treating the display as a piece of homogeneous scattering medium. With the physical model of the scattering effect, we improve the image formation pipeline for the image synthesis to construct a realistic UDC dataset with ground truths. To suppress the scattering effect for the eventual UDC image recovery, a two-branch restoration network is designed. More specifically, the scattering branch leverages global modeling capabilities of the channel-wise self-attention to estimate parameters of the scattering effect from degraded images. While the image branch exploits the local representation advantage of CNN to recover clear scenes, implicitly guided by the scattering branch. Extensive experiments are conducted on both real-world and synthesized data, demonstrating the superiority of the proposed method over the state-of-the-art UDC restoration techniques. The source code and dataset are available at https://github.com/NamecantbeNULL/SRUDC. + + + + VideoFlow: Exploiting Temporal Cues for Multi-frame Optical Flow Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Shi_VideoFlow_Exploiting_Temporal_Cues_for_Multi-frame_Optical_Flow_Estimation_ICCV_2023_paper.pdf + We introduce VideoFlow, a novel optical flow estimation framework for videos. In contrast to previous methods that learn to estimate optical flow from two frames, VideoFlow concurrently estimates bi-directional optical flows for multiple frames that are available in videos by sufficiently exploiting temporal cues. We first propose a TRi-frame Optical Flow (TROF) module that estimates bi-directional optical flows for the center frame in a three-frame manner. The information of the frame triplet is iteratively fused onto the center frame. To extend TROF for handling more frames, we further propose a MOtion Propagation (MOP) module that bridges multiple TROFs and propagates motion features between adjacent TROFs. With the iterative flow estimation refinement, the information fused in individual TROFs can be propagated into the whole sequence via MOP. By effectively exploiting video information, VideoFlow presents extraordinary performance, ranking 1st on all public benchmarks. On the Sintel benchmark, VideoFlow achieves 1.649 and 0.991 average end-point-error (AEPE) on the final and clean passes, a 15.1% and 7.6% error reduction from the best published results (1.943 and 1.073 from FlowFormer++). On the KITTI-2015 benchmark, VideoFlow achieves an F1-all error of 3.65%, a 19.2% error reduction from the best published result (4.52% from FlowFormer++). Code is released at https://github.com/XiaoyuShi97/VideoFlow. + + + + 3DMiner: Discovering Shapes from Large-Scale Unannotated Image Datasets + http://openaccess.thecvf.com//content/ICCV2023/papers/Cheng_3DMiner_Discovering_Shapes_from_Large-Scale_Unannotated_Image_Datasets_ICCV_2023_paper.pdf + We present 3DMiner -- a pipeline for mining 3D shapes from challenging large-scale unannotated image datasets. Unlike other unsupervised 3D reconstruction methods, we assume that, within a large-enough dataset, there must exist images of objects with similar shapes but varying backgrounds, textures, and viewpoints. Our approach leverages the recent advances in learning self-supervised image representations to cluster images with geometrically similar shapes and find common image correspondences between them. We then exploit these correspondences to obtain rough camera estimates as initialization for bundle-adjustment. Finally, for every image cluster, we apply a progressive bundle-adjusting reconstruction method to learn a neural occupancy field representing the underlying shape. We show that this procedure is robust to several types of errors introduced in previous steps (e.g., wrong camera poses, images containing dissimilar shapes, etc.), allowing us to obtain shape and pose annotations for images in-the-wild. When using images from Pix3D chairs, our method is capable of producing significantly better results than state-of-the-art unsupervised 3D reconstruction techniques, both quantitatively and qualitatively. Furthermore, we show how 3DMiner can be applied to in-the-wild data by reconstructing shapes present in images from the LAION-5B dataset. + + + + Order-Prompted Tag Sequence Generation for Video Tagging + http://openaccess.thecvf.com//content/ICCV2023/papers/Ma_Order-Prompted_Tag_Sequence_Generation_for_Video_Tagging_ICCV_2023_paper.pdf + Video Tagging intends to infer multiple tags spanning relevant content for a given video. Typically, video tags are freely defined and uploaded by a variety of users, so they have two characteristics: abundant in quantity and disordered intra-video. It is difficult for the existing multi-label classification and generation methods to adapt directly to this task. This paper proposes a novel generative model, Order-Prompted Tag Sequence Generation (OP-TSG), according to the above characteristics. It regards video tagging as a tag sequence generation problem guided by sample-dependent order prompts. These prompts are semantically aligned with tags and enable to decouple tag generation order, making the model focus on modeling the tag dependencies. Moreover, the word-based generation strategy enables the model to generate novel tags. To verify the effectiveness and generalization of the proposed method, a Chinese video tagging benchmark CREATE-tagging, and an English image tagging benchmark Pexel-tagging are established. Extensive results show that OP-TSG is significantly superior to other methods, especially the results on rare tags improve by 3.3% and 3% over SOTA methods on CREATE-tagging and Pexel-tagging, and novel tags generated on CREATE-tagging exhibit a tag gain of 7.04%. + + + + XVO: Generalized Visual Odometry via Cross-Modal Self-Training + http://openaccess.thecvf.com//content/ICCV2023/papers/Lai_XVO_Generalized_Visual_Odometry_via_Cross-Modal_Self-Training_ICCV_2023_paper.pdf + We propose XVO, a semi-supervised learning method for training generalized monocular Visual Odometry (VO) models with robust off-the-self operation across diverse datasets and settings. In contrast to standard monocular VO approaches which often study a known calibration within a single dataset, XVO efficiently learns to recover relative pose with real-world scale from visual scene semantics, i.e., without relying on any known camera parameters. We optimize the motion estimation model via self-training from large amounts of unconstrained and heterogeneous dash camera videos available on YouTube. Our key contribution is twofold. First, we empirically demonstrate the benefits of semi-supervised training for learning a general-purpose direct VO regression network. Second, we demonstrate multi-modal supervision, including segmentation, flow, depth, and audio auxiliary prediction tasks, to facilitate generalized representations for the VO task. Specifically, we find audio prediction task to significantly enhance the semi-supervised learning process while alleviating noisy pseudo-labels, particularly in highly dynamic and out-of-domain video data. Our proposed teacher network achieves state-of-the-art performance on the commonly used KITTI benchmark despite no multi-frame optimization or knowledge of camera parameters. Combined with the proposed semi-supervised step, XVO demonstrates off-the-shelf knowledge transfer across diverse conditions on KITTI, nuScenes, and Argoverse without fine-tuning. + + + + HMD-NeMo: Online 3D Avatar Motion Generation From Sparse Observations + http://openaccess.thecvf.com//content/ICCV2023/papers/Aliakbarian_HMD-NeMo_Online_3D_Avatar_Motion_Generation_From_Sparse_Observations_ICCV_2023_paper.pdf + Generating both plausible and accurate full body avatar motion is the key to the quality of immersive experiences in mixed reality scenarios. Head-Mounted Devices (HMDs) typically only provide a few input signals, such as head and hands 6-DoF. Recently, different approaches achieved impressive performance in generating full body motion given only head and hands signal. However, to the best of our knowledge, all existing approaches rely on full hand visibility. While this is the case when, e.g., using motion controllers, a considerable proportion of mixed reality experiences do not involve motion controllers and instead rely on egocentric hand tracking. This introduces the challenge of partial hand visibility owing to the restricted field of view of the HMD. In this paper, we propose the first unified approach, HMD-NeMo, that addresses plausible and accurate full body motion generation even when the hands may be only partially visible. HMD-NeMo is a lightweight neural network that predicts the full body motion in an online and real-time fashion. At the heart of HMD-NeMo is the spatio-temporal encoder with novel temporally adaptable mask tokens that encourage plausible motion in the absence of hand observations. We perform extensive analysis of the impact of different components in HMD-NeMo and introduce a new state-of-the-art on AMASS dataset through our evaluation. + + + + Adaptive Illumination Mapping for Shadow Detection in Raw Images + http://openaccess.thecvf.com//content/ICCV2023/papers/Sun_Adaptive_Illumination_Mapping_for_Shadow_Detection_in_Raw_Images_ICCV_2023_paper.pdf + Shadow detection methods rely on multi-scale contrast, especially global contrast, information to locate shadows correctly. However, we observe that the camera image signal processor (ISP) tends to preserve more local contrast information by sacrificing global contrast information during the raw-to-sRGB conversion process. This often causes existing methods to fail in scenes with high global contrast but low local contrast in shadow regions. In this paper, we propose a novel method to detect shadows from raw images. Our key idea is that instead of performing a many-to-one mapping like the ISP process, we can learn a many-to-many mapping from the high dynamic range raw images to the sRGB images of different illumination, which is able to preserve multi-scale contrast for accurate shadow detection. To this end, we first construct a new shadow dataset with 7000 raw images and shadow masks. We then propose a novel network, which includes a novel adaptive illumination mapping (AIM) module to project the input raw images into sRGB images of different intensity ranges and a shadow detection module to leverage the preserved multi-scale contrast information to detect shadows. To learn the shadow-aware adaptive illumination mapping process, we propose a novel feedback mechanism to guide the AIM during training. Experiments show that our method outperforms state-of-the-art shadow detectors. Code and dataset are available at https://github.com/jiayusun/SARA. + + + + Multi-Scale Residual Low-Pass Filter Network for Image Deblurring + http://openaccess.thecvf.com//content/ICCV2023/papers/Dong_Multi-Scale_Residual_Low-Pass_Filter_Network_for_Image_Deblurring_ICCV_2023_paper.pdf + We present a simple and effective Multi-scale Residual Low-Pass Filter Network (MRLPFNet) that jointly explores the image details and main structures for image deblurring. Our work is motivated by an observation that the difference between the blurry image and the clear one not only contains high-frequency contents(Note that the high-frequency contents in an image correspond to the image details, while the low-frequency ones denote the main structures of an image.) but also includes low-frequency information due to the influence of blur, while using the standard residual learning is less effective for modeling the main structure distorted by the blur. Considering that the low-frequency contents usually correspond to main global structures that are spatially variant, we first propose a learnable low-pass filter based on a self-attention mechanism to adaptively explore the global contexts for better modeling the low-frequency information. Then we embed it into a Residual Low-Pass Filter (RLPF) module, which involves an additional fully convolutional neural network with the standard residual learning to model the high-frequency information. We formulate the RLPF module into an end-to-end trainable network based on an encoder and decoder architecture and develop a wavelet-based feature fusion to fuse the multi-scale features. Experimental results show that our method performs favorably against state-of-the-art ones on commonly-used benchmarks. + + + + PhaseMP: Robust 3D Pose Estimation via Phase-conditioned Human Motion Prior + http://openaccess.thecvf.com//content/ICCV2023/papers/Shi_PhaseMP_Robust_3D_Pose_Estimation_via_Phase-conditioned_Human_Motion_Prior_ICCV_2023_paper.pdf + We present a novel motion prior, called PhaseMP, modeling a probability distribution on pose transitions conditioned by a frequency domain feature extracted from a periodic autoencoder. The phase feature further enforces the pose transitions to be unidirectional (i.e. no backward movement in time), from which more stable and natural motions can be generated. Specifically, our motion prior can be useful for accurately estimating 3D human motions in the presence of challenging input data, including long periods of spatial and temporal occlusion, as well as noisy sensor measurements. Through a comprehensive evaluation, we demonstrate the efficacy of our novel motion prior, showcasing its superiority over existing state-of-the-art methods by a significant margin across various applications, including video-to-motion and motion estimation from sparse sensor data, and etc. + + + + NLOS-NeuS: Non-line-of-sight Neural Implicit Surface + http://openaccess.thecvf.com//content/ICCV2023/papers/Fujimura_NLOS-NeuS_Non-line-of-sight_Neural_Implicit_Surface_ICCV_2023_paper.pdf + Non-line-of-sight (NLOS) imaging is conducted to infer invisible scenes from indirect light on visible objects. The neural transient field (NeTF) was proposed for representing scenes as neural radiance fields in NLOS scenes. We propose NLOS neural implicit surface (NLOS-NeuS), which extends the NeTF to neural implicit surfaces with a signed distance function (SDF) for reconstructing three-dimensional surfaces in NLOS scenes. We introduce two constraints as loss functions for correctly learning an SDF to avoid non-zero level-set surfaces. We also introduce a lower bound constraint of an SDF based on the geometry of the first-returning photons. The experimental results indicate that these constraints are essential for learning a correct SDF in NLOS scenes. Compared with previous methods with discretized representation, NLOS-NeuS with the neural continuous representation enables us to reconstruct smooth surfaces while preserving fine details in NLOS scenes. To the best of our knowledge, this is the first study on neural implicit surfaces with volume rendering in NLOS scenes. Project page: https://yfujimura. github.io/nlos-neus/ + + + + Augmenting and Aligning Snippets for Few-Shot Video Domain Adaptation + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_Augmenting_and_Aligning_Snippets_for_Few-Shot_Video_Domain_Adaptation_ICCV_2023_paper.pdf + For video models to be transferred and applied seamlessly across video tasks in varied environments, Video Unsupervised Domain Adaptation (VUDA) has been introduced to improve the robustness and transferability of video models. However, current VUDA methods rely on a vast amount of high-quality unlabeled target data, which may not be available in real-world cases. We thus consider a more realistic Few-Shot Video-based Domain Adaptation (FSVDA) scenario where we adapt video models with only a few target video samples. While a few methods have touched upon Few-Shot Domain Adaptation (FSDA) in images and in FSVDA, they rely primarily on spatial augmentation for target domain expansion with alignment performed statistically at the instance level. However, videos contain more knowledge in terms of rich temporal and semantic information, which should be fully considered while augmenting target domains and performing alignment in FSVDA. We propose a novel SSA2lign to address FSVDA at the snippet level, where the target domain is expanded through a simple snippet-level augmentation followed by the attentive alignment of snippets both semantically and statistically, where semantic alignment of snippets is conducted through multiple perspectives. Empirical results demonstrate state-of-the-art performance of SSA2lign across multiple cross-domain action recognition benchmarks. + + + + Towards Real-World Burst Image Super-Resolution: Benchmark and Method + http://openaccess.thecvf.com//content/ICCV2023/papers/Wei_Towards_Real-World_Burst_Image_Super-Resolution_Benchmark_and_Method_ICCV_2023_paper.pdf + Despite substantial advances, single-image super-resolution (SISR) is always in a dilemma to reconstruct high-quality images with limited information from one input image, especially in realistic scenarios. In this paper, we establish a large-scale real-world burst super-resolution dataset, i.e., RealBSR, to explore the faithful reconstruction of image details from multiple frames. Furthermore, we introduce a Federated Burst Affinity network (FBAnet) to investigate non-trivial pixel-wise displacements among images under real-world image degradation. Specifically, rather than using pixel-wise alignment, our FBAnet employs a simple homography alignment from a structural geometry aspect and a Federated Affinity Fusion (FAF) strategy to aggregate the complementary information among frames. Those fused informative representations are fed to a Transformer-based module of burst representation decoding. Besides, we have conducted extensive experiments on two versions of our datasets, i.e., RealBSR-RAW and RealBSR-RGB. Experimental results demonstrate that our FBAnet outperforms existing state-of-the-art burst SR methods and also achieves visually-pleasant SR image predictions with model details. Our dataset, codes, and models are publicly available at https://github.com/yjsunnn/FBANet. + + + + SYENet: A Simple Yet Effective Network for Multiple Low-Level Vision Tasks with Real-Time Performance on Mobile Device + http://openaccess.thecvf.com//content/ICCV2023/papers/Gou_SYENet_A_Simple_Yet_Effective_Network_for_Multiple_Low-Level_Vision_ICCV_2023_paper.pdf + With the rapid development of AI hardware accelerators, applying deep learning-based algorithms to solve various low-level vision tasks on mobile devices has gradually become possible. However, two main problems still need to be solved. Firstly, most low-level vision algorithms are task-specific and independent to each other, which makes them difficult to integrate into a single neural network architecture and accelerate simultaneously without task-level time-multiplexing. Secondly, most of these networks feature large amounts of parameters and huge computational costs in terms of multiplication-and-accumulation operations, and thus it is difficult to achieve real-time performance, especially on mobile devices with limited computing power. To tackle with these problems, we propose a novel network, SYENet, with only 6K parameters. The SYENet consists of two asymmetrical branches with simple building blocks and is able to handle multiple low-level vision tasks on mobile devices in a real-time manner. To effectively connect the results by asymmetrical branches, a Quadratic Connection Unit(QCU) is proposed. Furthermore, in order to improve visual quality, a new Regression Focal Loss is proposed to process the image. The proposed method proves its superior performance with the best PSNR and visual quality as compared with other networks in real-time applications such as Image Signal Processing(ISP), Low-Light Enhancement(LLE), and Super-Resolution(SR) with 2K60FPS throughput on Qualcomm 8 Gen 1 mobile SoC(System-on-Chip). Particularly, for ISP task, SYENet got the highest score in MAI 2022 Learned Smartphone ISP challenge. + + + + EdaDet: Open-Vocabulary Object Detection Using Early Dense Alignment + http://openaccess.thecvf.com//content/ICCV2023/papers/Shi_EdaDet_Open-Vocabulary_Object_Detection_Using_Early_Dense_Alignment_ICCV_2023_paper.pdf + Vision-language models such as CLIP have boosted the performance of open-vocabulary object detection, where the detector is trained on base categories but required to detect novel categories. Existing methods leverage CLIP's strong zero-shot recognition ability to align object-level embeddings with textual embeddings of categories. However, we observe that using CLIP for object-level alignment results in overfitting to base categories, i.e., novel categories most similar to base categories have particularly poor performance as they are recognized as similar base categories. In this paper, we first identify that the loss of critical fine-grained local image semantics hinders existing methods from attaining strong base-to-novel generalization. Then, we propose Early Dense Alignment (EDA) to bridge the gap between generalizable local semantics and object-level prediction. In EDA, we use object-level supervision to learn the dense-level rather than object-level alignment to maintain the local fine-grained semantics. Extensive experiments demonstrate our superior performance to competing approaches under the same strict setting and without using external training resources, i.e., improving the +8.4% novel box AP50 on COCO and +3.9% rare mask AP on LVIS. + + + + DOLCE: A Model-Based Probabilistic Diffusion Framework for Limited-Angle CT Reconstruction + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_DOLCE_A_Model-Based_Probabilistic_Diffusion_Framework_for_Limited-Angle_CT_Reconstruction_ICCV_2023_paper.pdf + Limited-Angle Computed Tomography (LACT) is a non-destructive 3D imaging technique used in a variety of applications ranging from security to medicine. The limited angle coverage in LACT is often a dominant source of severe artifacts in the reconstructed images, making it a challenging imaging inverse problem. Diffusion models are a recent class of deep generative models for synthesizing realistic images using image denoisers. In this work, we present DOLCE as the first framework for integrating conditionally-trained diffusion models and explicit physical measurement models for solving imaging inverse problems. DOLCE achieves the SOTA performance in highly ill-posed LACT by alternating between the data-fidelity and sampling updates of a diffusion model conditioned on the transformed sinogram. We show through extensive experimentation that unlike existing methods, DOLCE can synthesize high-quality and structurally coherent 3D volumes by using only 2D conditionally pre-trained diffusion models. We further show on several challenging real LACT datasets that the same pre-trained DOLCE model achieves the SOTA performance on drastically different types of images. + + + + Beyond Image Borders: Learning Feature Extrapolation for Unbounded Image Composition + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Beyond_Image_Borders_Learning_Feature_Extrapolation_for_Unbounded_Image_Composition_ICCV_2023_paper.pdf + For improving image composition and aesthetic quality, most existing methods modulate the captured images by striking out redundant content near the image borders. However, such image cropping methods are limited in the range of image views. Some methods have been suggested to extrapolate the images and predict cropping boxes from the extrapolated image. Nonetheless, the synthesized extrapolated regions may be included in the cropped image, making the image composition result not real and potentially with degraded image quality. In this paper, we circumvent this issue by presenting a joint framework for both unbounded recommendation of camera view and image composition (i.e., UNIC). In this way, the cropped image is a sub-image of the image acquired by the predicted camera view, and thus can be guaranteed to be real and consistent in image quality. Specifically, our framework takes the current camera preview frame as input and provides a recommendation for view adjustment, which contains operations unlimited by the image borders, such as zooming in or out and camera movement. To improve the prediction accuracy of view adjustment prediction, we further extend the field of view by feature extrapolation. After one or several times of view adjustments, our method converges and results in both a camera view and a bounding box showing the image composition recommendation. Extensive experiments are conducted on the datasets constructed upon existing image cropping datasets, showing the effectiveness of our UNIC in unbounded recommendation of camera view and image composition. The source code, dataset, and pretrained models is available at https://github.com/liuxiaoyu1104/UNIC. + + + + DeepChange: A Long-Term Person Re-Identification Benchmark with Clothes Change + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_DeepChange_A_Long-Term_Person_Re-Identification_Benchmark_with_Clothes_Change_ICCV_2023_paper.pdf + Long-term re-id with clothes change is a challenging problem in surveillance AI. Currently, its major bottleneck is that this field is still missing a large realistic benchmark. In this work, we contribute a large, realistic long-term person re-identification benchmark, termed DeepChange. Its unique characteristics are: (1) Realistic and rich personal appearance (e.g., clothes and hair style) and variations: Highly diverse clothes change and styles, with varying reappearing gaps in time from minutes to seasons, different weather conditions (e.g., sunny, cloudy, windy, rainy, snowy, extremely cold) and events (e.g., working, leisure, daily activities). (2) Rich camera setups: Raw videos were recorded by 17 outdoor varying-resolution cameras operating in a real-world surveillance system. (3) The currently largest number of (17) cameras, (1, 121) identities, and (178, 407) bounding boxes, over the longest time span (12 months). We benchmark the representative supervised and unsupervised re-id methods on our dataset. In addition, we investigate multimodal fusion strategies for tackling the clothes change challenge. Extensive experiments show that our fusion models outperform a wide variety of state-of-the-art models on DeepChange. Our dataset and documents are available at https://github.com/PengBoXiangShang/deepchange. + + + + Discrepant and Multi-Instance Proxies for Unsupervised Person Re-Identification + http://openaccess.thecvf.com//content/ICCV2023/papers/Zou_Discrepant_and_Multi-Instance_Proxies_for_Unsupervised_Person_Re-Identification_ICCV_2023_paper.pdf + Most recent unsupervised person re-identification methods maintain a cluster uni-proxy for contrastive learning. However, due to the intra-class variance and inter-class similarity, the cluster uni-proxy is prone to be biased and confused with similar classes, resulting in the learned features lacking intra-class compactness and inter-class separation in the embedding space. To completely and accurately represent the information contained in a cluster and learn discriminative features, we propose to maintain discrepant cluster proxies and multi-instance proxies for a cluster. Each cluster proxy focuses on representing a part of the information, and several discrepant proxies collaborate to represent the entire cluster completely. As a complement to the overall representation, multi-instance proxies are used to accurately represent the fine-grained information contained in the instances of the cluster. Based on the proposed discrepant cluster proxies, we construct cluster contrastive loss to use the proxies as hard positive samples to pull instances of a cluster closer and reduce intra-class variance. Meanwhile, instance contrastive loss is constructed by global hard negative sample mining in multi-instance proxies to push away the truly indistinguishable classes and decrease inter-class similarity. Extensive experiments on Market-1501 and MSMT17 demonstrate that the proposed method outperforms state-of-the-art approaches. + + + + Joint-Relation Transformer for Multi-Person Motion Prediction + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_Joint-Relation_Transformer_for_Multi-Person_Motion_Prediction_ICCV_2023_paper.pdf + Multi-person motion prediction is a challenging problem due to the dependency of motion on both individual past movements and interactions with other people. Transformer-based methods have shown promising resultson this task, but they miss the explicit relation representation between joints, such as skeleton structure and pairwise distance, which is crucial for accurate interaction modeling. In this paper, we propose the Joint-Relation Transformer, which utilizes relation information to enhance interaction modeling and improve future motion prediction. Our relation information contains the relative distance and the intra/inter-person physical constraints. To fuse relation and joint information, we design a novel joint-relation fusion layer with relation-aware attention to update both features. Additionally, we supervise the relation information by forecasting future distance. Experiments show that our method achieves a 13.4% improvement of 900ms VIM on 3DPW-SoMoF/RC and 17.8%/12.0% improvement of 3s MPJPE on CMU-Mpcap/MuPoTS-3D dataset. + + + + TMA: Temporal Motion Aggregation for Event-based Optical Flow + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_TMA_Temporal_Motion_Aggregation_for_Event-based_Optical_Flow_ICCV_2023_paper.pdf + Event cameras have the ability to record continuous and detailed trajectories of objects with high temporal resolution, thereby providing intuitive motion cues for optical flow estimation. Nevertheless, most existing learning-based approaches for event optical flow estimation directly remould the paradigm of conventional images by representing the consecutive event stream as static frames, ignoring the inherent temporal continuity of event data. In this paper, we argue that temporal continuity is a vital element of event-based optical flow and propose a novel Temporal Motion Aggregation (TMA) approach to unlock its potential. Technically, TMA comprises three components: an event splitting strategy to incorporate intermediate motion information underlying the temporal context, a linear lookup strategy to align temporally fine-grained motion features and a novel motion pattern aggregation module to emphasize consistent patterns for motion feature enhancement. By incorporating temporally fine-grained motion information, TMA can derive better flow estimates than existing methods at early stages, which not only enables TMA to obtain more accurate final predictions, but also greatly reduces the demand for a number of refinements. Extensive experiments on DSEC-Flow and MVSEC datasets verify the effectiveness and superiority of our TMA. Remarkably, compared to E-RAFT, TMA achieves a 6% improvement in accuracy and a 40% reduction in inference time on DSEC-Flow. Code will be available at https://github.com/ispc-lab/TMA. + + + + Building a Winning Team: Selecting Source Model Ensembles using a Submodular Transferability Estimation Approach + http://openaccess.thecvf.com//content/ICCV2023/papers/B_Building_a_Winning_Team_Selecting_Source_Model_Ensembles_using_a_ICCV_2023_paper.pdf + Estimating the transferability of publicly available pre-trained models to a target task has assumed an important place for transfer learning tasks in recent years. Existing efforts propose metrics that allow a user to choose one model from a pool of pre-trained models without having to fine-tune each model individually and identify one explicitly. With the growth in the number of available pre-trained models and the popularity of model ensembles, it also becomes essential to study the transferability of multiple-source models for a given target task. The few existing efforts study transferability in such multi-source ensemble settings using just the outputs of the classification layer and neglect possible domain or task mismatch. Moreover, they overlook the most important factor while selecting the source models, viz., the cohesiveness factor between them, which can impact the performance and confidence in the prediction of the ensemble. To address these gaps, we propose a novel Optimal tranSport-based suBmOdular tRaNsferability metric (OSBORN) to estimate the transferability of an ensemble of models to a downstream task. OSBORN collectively accounts for image domain difference, task difference, and cohesiveness of models in the ensemble to provide reliable estimates of transferability. We gauge the performance of OSBORN on both image classification and semantic segmentation tasks. Our setup includes 28 source datasets, 11 target datasets, 5 model architectures, and 2 pre-training methods. We benchmark our method against current state-of-the-art metrics MS-LEEP and E-LEEP, and outperform them consistently using the proposed approach. + + + + Plausible Uncertainties for Human Pose Regression + http://openaccess.thecvf.com//content/ICCV2023/papers/Bramlage_Plausible_Uncertainties_for_Human_Pose_Regression_ICCV_2023_paper.pdf + Human pose estimation (HPE) is integral to scene understanding in numerous safety-critical domains involving human-machine interaction, such as autonomous driving or semi-automated work environments. Avoiding costly mistakes is synonymous with anticipating failure in model predictions, which necessitates meta-judgments on the accuracy of the applied models. Here, we propose a straightforward human pose regression framework to examine the behavior of two established methods for simultaneous aleatoric and epistemic uncertainty estimation: maximum a-posteriori (MAP) estimation with Monte-Carlo variational inference and deep evidential regression (DER). First, we evaluate both approaches on the quality of their predicted variances and whether these truly capture the expected model error. The initial assessment indicates that both methods exhibit the overconfidence issue common in deep probabilistic models. This observation motivates our implementation of an additional recalibration step to extract reliable confidence intervals. We then take a closer look at deep evidential regression, which, to our knowledge, is applied comprehensively for the first time to the HPE problem. Experimental results indicate that DER behaves as expected in challenging and adverse conditions commonly occurring in HPE and that the predicted uncertainties match their purported aleatoric and epistemic sources. Notably, DER achieves smooth uncertainty estimates without the need for a costly sampling step, making it an attractive candidate for uncertainty estimation on resource-limited platforms. + + + + DiffIR: Efficient Diffusion Model for Image Restoration + http://openaccess.thecvf.com//content/ICCV2023/papers/Xia_DiffIR_Efficient_Diffusion_Model_for_Image_Restoration_ICCV_2023_paper.pdf + Diffusion model (DM) has achieved SOTA performance by modeling the image synthesis process into a sequential application of a denoising network. However, different from image synthesis generating each pixel from scratch, most pixels of image restoration (IR) are given. Thus, for IR, traditional DMs running massive iterations on a large model to estimate whole images or feature maps is inefficient. To address this issue, we propose an efficient DM for IR (DiffIR), which consists of a compact IR prior extraction network (CPEN), dynamic IR transformer (DIRformer), and denoising network. Specifically, DiffIR has two training stages: pretraining and training DM. In pretraining, we input ground-truth images into CPEN-S1 to capture a compact IR prior representation (IPR) to guide DIRformer. In the second stage, we train the DM to directly estimate the same IRP as pretrained CPEN-S1 only using LQ images. We observe that since the IPR is only a compact vector, DiffIR can use fewer iterations than traditional DM to obtain accurate estimations and generate more stable and realistic results. Since the iterations are few, our DiffIR can adopt a joint optimization of CPEN-S2, DIRformer, and denoising network, which can further reduce the estimation error influence. We conduct extensive experiments on several IR tasks and achieve SOTA performance while consuming less computational costs. Codes and models will be released. + + + + Simple Baselines for Interactive Video Retrieval with Questions and Answers + http://openaccess.thecvf.com//content/ICCV2023/papers/Liang_Simple_Baselines_for_Interactive_Video_Retrieval_with_Questions_and_Answers_ICCV_2023_paper.pdf + To date, the majority of video retrieval systems have been optimized for a "single-shot" scenario in which the user submits a query in isolation, ignoring previous interactions with the system. Recently, there has been renewed interest in interactive systems to enhance retrieval, but existing approaches are complex and deliver limited gains in performance. In this work, we revisit this topic and propose several simple yet effective baselines for interactive video retrieval via question-answering. We employ a VideoQA model to simulate user interactions and show that this enables the productive study of the interactive retrieval task without access to ground truth dialogue data. Experiments on MSR-VTT, MSVD, and AVSD show that our framework using question-based interaction significantly improves the performance of text-based video retrieval systems. Code is available at https://github.com/kevinliang888/IVR-QA-baselines. + + + + Going Denser with Open-Vocabulary Part Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Sun_Going_Denser_with_Open-Vocabulary_Part_Segmentation_ICCV_2023_paper.pdf + Object detection has been expanded from a limited number of categories to open vocabulary. Moving forward, a complete intelligent vision system requires understanding more fine-grained object descriptions, object parts. In this paper, we propose a detector with the ability to predict both open-vocabulary objects and their part segmentation. This ability comes from two designs. First, we train the detector on the joint of part-level, object-level and image-level data to build the multi-granularity alignment between language and image. Second, we parse the novel object into its parts by its dense semantic correspondence with the base object. These two designs enable the detector to largely benefit from various data sources and foundation models. In open-vocabulary part segmentation experiments, our method outperforms the baseline by 3.3 7.3 mAP in cross-dataset generalization on PartImageNet, and improves the baseline by 7.3 novel AP50 in cross-category generalization on Pascal Part. Finally, we train a detector that generalizes to a wide range of part segmentation datasets while achieving better performance than dataset-specific training. + + + + OCHID-Fi: Occlusion-Robust Hand Pose Estimation in 3D via RF-Vision + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_OCHID-Fi_Occlusion-Robust_Hand_Pose_Estimation_in_3D_via_RF-Vision_ICCV_2023_paper.pdf + Hand Pose Estimation (HPE) is crucial to many applications, but conventional cameras-based CM-HPE methods are completely subject to Line-of-Sight (LoS), as cameras cannot capture occluded objects. In this paper, we propose to exploit Radio-Frequency-Vision (RF-vision) capable of bypassing obstacles for achieving occluded HPE, and we introduce OCHID-Fi as the first RF-HPE method with 3D pose estimation capability. OCHID-Fi employs wideband RF sensors widely available on smart devices (e.g., iPhones) to probe 3D human hand pose and extract their skeletons behind obstacles. To overcome the challenge in labeling RF imaging given its human incomprehensible nature, OCHID-Fi employs a cross-modality and cross-domain training process. It uses a pre-trained CM-HPE network and a synchronized CM/RF dataset, to guide the training of its complex-valued RF-HPE network under LoS conditions. It further transfers knowledge learned from labeled LoS domain to unlabeled occluded domain via adversarial learning, enabling OCHID-Fi to generalize to unseen occluded scenarios. Experimental results demonstrate the superiority of OCHID-Fi: it achieves comparable accuracy to CM-HPE under normal conditions while maintaining such accuracy even in occluded scenarios, with empirical evidence for its generalizability to new domains. + + + + Reconstructing Interacting Hands with Interaction Prior from Monocular Images + http://openaccess.thecvf.com//content/ICCV2023/papers/Zuo_Reconstructing_Interacting_Hands_with_Interaction_Prior_from_Monocular_Images_ICCV_2023_paper.pdf + Reconstructing interacting hands from monocular images is indispensable in AR/VR applications. Most existing solutions rely on the accurate localization of each skeleton joint. However, these methods tend to be unreliable due to the severe occlusion and confusing similarity among adjacent hand parts. This also defies human perception because humans can quickly imitate an interaction pattern without localizing all joints. Our key idea is to first construct a two-hand interaction prior and recast the interaction reconstruction task as the conditional sampling from the prior. To expand more interaction states, a large-scale multimodal dataset with physical plausibility is proposed. Then a VAE is trained to further condense these interaction patterns as latent codes in a prior distribution. When looking for image cues that contribute to interaction prior sampling, we propose the interaction adjacency heatmap (IAH). Compared with a joint-wise heatmap for localization, IAH assigns denser visible features to those invisible joints. Compared with an all-in-one visible heatmap, it provides more fine-grained local interaction information in each interaction region. Finally, the correlations between the extracted features and corresponding interaction codes are linked by the ViT module. Comprehensive evaluations on benchmark datasets have verified the effectiveness of this framework. The code and dataset are publicly available at: https://github.com/binghui-z/InterPrior_pytorch. + + + + Towards Realistic Evaluation of Industrial Continual Learning Scenarios with an Emphasis on Energy Consumption and Computational Footprint + http://openaccess.thecvf.com//content/ICCV2023/papers/Chavan_Towards_Realistic_Evaluation_of_Industrial_Continual_Learning_Scenarios_with_an_ICCV_2023_paper.pdf + Incremental Learning (IL) aims to develop Machine Learning (ML) models that can learn from continuous streams of data and mitigate catastrophic forgetting. We analyse the current state-of-the-art Class-IL implementations and demonstrate why the current body of research tends to be one-dimensional, with an excessive focus on accuracy metrics. A realistic evaluation of Continual Learning methods should also emphasise energy consumption and overall computational load for a comprehensive understanding. This paper addresses research gaps between current IL research and industrial project environments, including varying incremental tasks and the introduction of Joint Training in tandem with IL. We introduce InVar-100 (Industrial Objects in Varied Contexts), a novel dataset meant to simulate the visual environments in industrial setups and perform various experiments for IL. Additionally, we incorporate explainability (using class activations) to interpret the model predictions. Our approach, RECIL (Real-World Scenarios and Energy Efficiency Considerations for Class Incremental Learning) provides meaningful insights about the applicability of IL approaches in practical use cases. The overarching aim is to bring the Incremental Learning and Green AI fields together and encourage the application of CIL methods in real-world scenarios. Code and dataset are available. + + + + How Much Temporal Long-Term Context is Needed for Action Segmentation? + http://openaccess.thecvf.com//content/ICCV2023/papers/Bahrami_How_Much_Temporal_Long-Term_Context_is_Needed_for_Action_Segmentation_ICCV_2023_paper.pdf + Modeling long-term context in videos is crucial for many fine-grained tasks including temporal action segmentation. An interesting question that is still open is how much long-term temporal context is needed for optimal performance. While transformers can model the long-term context of a video, this becomes computationally prohibitive for long videos. Recent works on temporal action segmentation thus combine temporal convolutional networks with self-attentions that are computed only for a local temporal window. While these approaches show good results, their performance is limited by their inability to capture the full context of a video. In this work, we try to answer how much long-term temporal context is required for temporal action segmentation by introducing a transformer-based model that leverages sparse attention to capture the full context of a video. We compare our model with the current state of the art on three datasets for temporal action segmentation, namely 50Salads, Breakfast, and Assembly101. Our experiments show that modeling the full context of a video is necessary to obtain the best performance for temporal action segmentation. + + + + 3D VR Sketch Guided 3D Shape Prototyping and Exploration + http://openaccess.thecvf.com//content/ICCV2023/papers/Luo_3D_VR_Sketch_Guided_3D_Shape_Prototyping_and_Exploration_ICCV_2023_paper.pdf + 3D shape modeling is labor-intensive, time-consuming, and requires years of expertise. To facilitate 3D shape modeling, we propose a 3D shape generation network that takes a 3D VR sketch as a condition. We assume that sketches are created by novices without art training and aim to reconstruct geometrically realistic 3D shapes of a given category. To handle potential sketch ambiguity, our method creates multiple 3D shapes that align with the original sketch's structure. We carefully design our method, training the model step-by-step and leveraging multi-modal 3D shape representation to support training with limited training data. To guarantee the realism of generated 3D shapes we leverage the normalizing flow that models the distribution of the latent space of 3D shapes. To encourage the fidelity of the generated 3D shapes to an input sketch, we propose a dedicated loss that we deploy at different stages of the training process. The code is available at https://github.com/Rowl1ng/3Dsketch2shape. + + + + MDCS: More Diverse Experts with Consistency Self-distillation for Long-tailed Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhao_MDCS_More_Diverse_Experts_with_Consistency_Self-distillation_for_Long-tailed_Recognition_ICCV_2023_paper.pdf + Recently, multi-expert methods have led to significant improvements in long-tail recognition (LTR). We summarize two aspects that need further enhancement to contribute to LTR boosting: (1) More diverse experts; (2) Lower model variance. However, the previous methods didn't handle them well. To this end, we propose More Diverse experts with Consistency Self-distillation (MDCS) to bridge the gap left by earlier methods. Our MDCS approach consists of two core components: Diversity Loss (DL) and Consistency Self-distillation (CS). In detail, DL promotes diversity among experts by controlling their focus on different categories. To reduce the model variance, we employ KL divergence to distill the richer knowledge of weakly augmented instances for the experts' self-distillation. In particular, we design Confident Instance Sampling (CIS) to select the correctly classified instances for CS to avoid biased/noisy knowledge. In the analysis and ablation study, we demonstrate that our method compared with previous work can effectively increase the diversity of experts, significantly reduce the variance of the model, and improve recognition accuracy. Moreover, the roles of our DL and CS are mutually reinforcing and coupled: the diversity of experts benefits from the CS, and the CS cannot achieve remarkable results without the DL. Experiments show our MDCS outperforms the state-of-the-art by 1% 2% on five popular long-tailed benchmarks, including CIFAR10-LT, CIFAR100-LT, ImageNet-LT, Places-LT, and iNaturalist 2018. The code is available at https://github.com/fistyee/MDCS. + + + + Similarity Min-Max: Zero-Shot Day-Night Domain Adaptation + http://openaccess.thecvf.com//content/ICCV2023/papers/Luo_Similarity_Min-Max_Zero-Shot_Day-Night_Domain_Adaptation_ICCV_2023_paper.pdf + Low-light conditions not only hamper human visual experience but also degrade the model's performance on downstream vision tasks. While existing works make remarkable progress on day-night domain adaptation, they rely heavily on domain knowledge derived from the task-specific nighttime dataset. This paper challenges a more complicated scenario with border applicability, i.e., zero-shot day-night domain adaptation, which eliminates reliance on any nighttime data. Unlike prior zero-shot adaptation approaches emphasizing either image-level translation or model-level adaptation, we propose a similarity min-max paradigm that considers them under a unified framework. On the image level, we darken images towards minimum feature similarity to enlarge the domain gap. Then on the model level, we maximize the feature similarity between the darkened images and their normal-light counterparts for better model adaptation. To the best of our knowledge, this work represents the pioneering effort in jointly optimizing both aspects, resulting in a significant improvement of model generalizability. Extensive experiments demonstrate our method's effectiveness and broad applicability on various nighttime vision tasks, including classification, semantic segmentation, visual place recognition, and video action recognition. Our project page is available at https://red-fairy.github.io/ZeroShotDayNightDA-Webpage/. + + + + Dark Side Augmentation: Generating Diverse Night Examples for Metric Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Mohwald_Dark_Side_Augmentation_Generating_Diverse_Night_Examples_for_Metric_Learning_ICCV_2023_paper.pdf + Image retrieval methods based on CNN descriptors rely on metric learning from a large number of diverse examples of positive and negative image pairs. Domains, such as night-time images, with limited availability and variability of training data suffer from poor retrieval performance even with methods performing well on standard benchmarks. We propose to train a GAN-based synthetic-image generator, translating available day-time image examples into night images. Such a generator is used in metric learning as a form of augmentation, supplying training data to the scarce domain. Various types of generators are evaluated and analyzed. We contribute with a novel light-weight GAN architecture that enforces the consistency between the original and translated image through edge consistency. The proposed architecture also allows a simultaneous training of an edge detector that operates on both night and day images. To further increase the variability in the training examples and to maximize the generalization of the trained model, we propose a novel method of diverse anchor mining. The proposed method improves over the state-of-the-art results on a standard Tokyo 24/7 day-night retrieval benchmark while preserving the performance on Oxford and Paris datasets. This is achieved without the need of training image pairs of matching day and night images. The source code is available at https://github.com/mohwald/gandtr . + + + + LVOS: A Benchmark for Long-term Video Object Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Hong_LVOS_A_Benchmark_for_Long-term_Video_Object_Segmentation_ICCV_2023_paper.pdf + Existing video object segmentation (VOS) benchmarks focus on short-term videos which just last about 3-5 seconds and where objects are visible most of the time. These videos are poorly representative of practical applications, and the absence of long-term datasets restricts further investigation of VOS on the application in realistic scenarios. So, in this paper, we present a new benchmark dataset and evaluation methodology named LVOS, which consists of 220 videos with a total duration of 421 minutes. To the best of our knowledge, LVOS is the first densely annotated long-term VOS dataset. The videos in our LVOS last 1.59 minutes on average, which is 20 times longer than videos in existing VOS datasets. Each video includes various attributes, especially challenges deriving from the wild, such as long-term reappearing and cross-temporal similar objects. Based on LVOS, we assess existing video object segmentation algorithms and propose a Diverse Dynamic Memory network (DDMemory) that consists of three complementary memory banks to exploit temporal information adequately. The experimental results demonstrate the strength and weaknesses of prior methods, pointing promising directions for further study. Our objective is to provide the community with a large and varied benchmark to boost the advancement of long-term VOS. Data and code are available at https://lingyihongfd.github.io/lvos.github.io/. + + + + CHAMPAGNE: Learning Real-world Conversation from Large-Scale Web Videos + http://openaccess.thecvf.com//content/ICCV2023/papers/Han_CHAMPAGNE_Learning_Real-world_Conversation_from_Large-Scale_Web_Videos_ICCV_2023_paper.pdf + Visual information is central to conversation: body gestures and physical behaviour, for example, contribute to meaning that transcends words alone. To date, however, most neural conversational models are limited to just text. We introduce CHAMPAGNE, a generative model of conversations that can account for visual contexts. To train CHAMPAGNE, we collect and release YTD-18M, a large-scale corpus of 18M video-based dialogues. YTD-18M is constructed from web videos: crucial to our data collection pipeline is a pretrained language model that converts error-prone automatic transcripts to a cleaner dialogue format while maintaining meaning. Human evaluation reveals that YTD-18M is more sensible and specific than prior resources (MMDialog, 1M dialogues), while maintaining visual-groundedness. Experiments demonstrate that 1) CHAMPAGNE learns to conduct conversation from YTD-18M; and 2) when fine-tuned, it achieves state-of-the-art results on four vision-language tasks focused on real-world conversations. We release data, models, and code. + + + + DeformToon3D: Deformable Neural Radiance Fields for 3D Toonification + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_DeformToon3D_Deformable_Neural_Radiance_Fields_for_3D_Toonification_ICCV_2023_paper.pdf + In this paper, we address the challenging problem of 3D toonification, which involves transferring the style of an artistic domain onto a target 3D face with stylized geometry and texture. Although fine-tuning a pre-trained 3D GAN on the artistic domain can produce reasonable performance, this strategy has limitations in the 3D domain. In particular, fine-tuning can deteriorate the original GAN latent space, which affects subsequent semantic editing, and requires independent optimization and storage for each new style, limiting flexibility and efficient deployment. To overcome these challenges, we propose DeformToon3D, an effective toonification framework tailored for hierarchical 3D GAN. Our approach decomposes 3D toonification into subproblems of geometry and texture stylization to better preserve the original latent space. Specifically, we devise a novel StyleField that predicts conditional 3D deformation to align a real-space NeRF to the style space for geometry stylization. Thanks to the StyleField formulation, which already handles geometry stylization well, texture stylization can be achieved conveniently via adaptive style mixing that injects information of the artistic domain into the decoder of the pre-trained 3D GAN. Due to the unique design, our method enables flexible style degree control and shape-texture-specific style swap. Furthermore, we achieve efficient training without any real-world 2D-3D training pairs but proxy samples synthesized from off-the-shelf 2D toonification models. + + + + Empowering Low-Light Image Enhancer through Customized Learnable Priors + http://openaccess.thecvf.com//content/ICCV2023/papers/Zheng_Empowering_Low-Light_Image_Enhancer_through_Customized_Learnable_Priors_ICCV_2023_paper.pdf + Deep neural networks have achieved remarkable progress in enhancing low-light images by improving their brightness and eliminating noise. However, most existing methods construct end-to-end mapping networks heuristically, neglecting the intrinsic prior of image enhancement task and lacking transparency and interpretability. Although some unfolding solutions have been proposed to relieve these issues, they rely on proximal operator networks that deliver ambiguous and implicit priors. In this work, we propose a paradigm for low-light image enhancement that explores the potential of customized learnable priors to improve the transparency of the deep unfolding paradigm.Motivated by the powerful feature representation capability of Masked Autoencoder (MAE), we customize MAE-based illumination and noise priors and redevelop them from two perspectives: 1) structure flow: we train the MAE from a normal-light image to its illumination properties and then embed it into the proximal operator design of the unfolding architecture; and 2) optimization flow: we train MAE from a normal-light image to its gradient representation and then employ it as a regularization term to constrain noise in the model output. These designs improve the interpretability and representation capability of the model. Extensive experiments on multiple low-light image enhancement datasets demonstrate the superiority of our proposed paradigm over state-of-the-art methods. Code is available at https://github.com/zheng980629/CUE. + + + + Guiding Image Captioning Models Toward More Specific Captions + http://openaccess.thecvf.com//content/ICCV2023/papers/Kornblith_Guiding_Image_Captioning_Models_Toward_More_Specific_Captions_ICCV_2023_paper.pdf + Image captioning is conventionally formulated as the task of generating captions that match the conditional distribution of reference image-caption pairs. However, reference captions in standard captioning datasets are short and may not uniquely identify the images they describe. These problems are further exacerbated when models are trained directly on image-alt text pairs collected from the internet. In this work, we show that it is possible to generate more specific captions with minimal changes to the training process. We implement classifier-free guidance (Ho & Salimans, 2021) for an autoregressive captioning model by fine-tuning it to estimate both conditional and unconditional distributions over captions. The guidance scale applied at decoding controls a trade-off between maximizing p(caption|image) and p(caption|image). Compared to standard greedy decoding, decoding with a guidance scale of 2 substantially improves reference-free metrics such as CLIPScore (0.808 vs. 0.775) and caption->image retrieval performance in the CLIP embedding space (recall@1 44.6% vs. 26.5%), but worsens standard reference-based captioning metrics (e.g., CIDEr 78.6 vs 126.1). We further explore the use of language models to guide the decoding process, obtaining small improvements over the Pareto frontier of reference-free vs. reference-based captioning metrics that arises from classifier-free guidance, and substantially improving the grammaticality of captions generated from a model trained only on minimally curated web data. + + + + Towards Effective Instance Discrimination Contrastive Loss for Unsupervised Domain Adaptation + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Towards_Effective_Instance_Discrimination_Contrastive_Loss_for_Unsupervised_Domain_Adaptation_ICCV_2023_paper.pdf + Domain adaptation (DA) aims to transfer knowledge from a label-rich source domain to a related but label-scarce target domain. Recently, increasing research has focused on exploring data structure of the target domain. In light of the recent success of Instance Discrimination Contrastive (IDCo) loss in self-supervised learning, we try directly applying it to domain adaptation tasks. However, the improvement is very limited, which motivates us to rethink its underlying limitations for domain adaptation tasks. An intuitive limitation is that a pair of samples belonging to the same class could be treated as negatives. Here we argue that using low-confidence samples to construct positive and negative pairs can alleviate this issue and is more suitable for IDCo loss. Another limitation is that IDCo loss cannot capture enough semantic information. We address this by introducing domain-invariant and accurate semantic information from classifier weights and input data. Specifically, we propose a class relationship enhanced features. It uses probability weighted class prototpyes as the input features of IDCo loss, which can implicitly transfer the domain-invariant class relationship. We further propose a target-dominated cross-domain mixup that can incorporate accurate semantic information from the source domain. We evaluate the proposed method in unsupervised DA and other DA settings, and extensive experimental results reveal that our method can make IDCo loss more effective and achieve state-of-the-art performance. + + + + FrozenRecon: Pose-free 3D Scene Reconstruction with Frozen Depth Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_FrozenRecon_Pose-free_3D_Scene_Reconstruction_with_Frozen_Depth_Models_ICCV_2023_paper.pdf + 3D scene reconstruction is a long-standing vision task. Existing approaches can be categorized into geometry-based and learning-based methods. The former leverages multi-view geometry but may face catastrophic failures due to the reliance on accurate pixel correspondence across views, while the latter mitigates these issues by learning 2D or 3D representation directly. However, without a large-scale video or 3D training data, it can hardly be generalized to diverse real-world scenarios due to the presence of tens of millions or even billions of optimization parameters in the deep network. Recently, robust monocular depth estimation models trained with large-scale datasets have been proven to possess weak 3D geometry prior, but they are insufficient for reconstruction due to the unknown camera parameters, the affine-invariant property, and inter-frame inconsistency. To address these issues, we propose a novel test-time optimization approach that can transfer the robustness of affine-invariant depth models such as LeReS to challenging diverse scenes while ensuring inter-frame consistency, with only dozens of parameters to optimize per video frame. Specifically, our approach involves freezing the pre-trained affine-invariant depth model's depth predictions, rectifying them by optimizing the unknown scale-shift values with a geometric consistency alignment module, and employing the resulting scale-consistent depth maps to robustly obtain camera poses and achieve dense scene reconstruction, even in low-texture regions. Experiments show that our method achieves state-of-the-art cross-dataset reconstruction on five zero-shot testing datasets. Code is available at: https://aim-uofa.github.io/FrozenRecon/ + + + + Affective Image Filter: Reflecting Emotions from Text to Images + http://openaccess.thecvf.com//content/ICCV2023/papers/Weng_Affective_Image_Filter_Reflecting_Emotions_from_Text_to_Images_ICCV_2023_paper.pdf + Understanding the emotions in text and presenting them visually is a very challenging problem that requires a deep understanding of natural language and high-quality image synthesis simultaneously. In this work, we propose Affective Image Filter (AIF), a novel model that is able to understand the visually-abstract emotions from the text and reflect them to visually-concrete images with appropriate colors and textures. We build our model based on the multi-modal transformer architecture, which unifies both images and texts into tokens and encodes the emotional prior knowledge. Various loss functions are proposed to understand complex emotions and produce appropriate visualization. In addition, we collect and contribute a new dataset with abundant aesthetic images and emotional texts for training and evaluating the AIF model. We carefully design four quantitative metrics and conduct a user study to comprehensively evaluate the performance, which demonstrates our AIF model outperforms state-of-the-art methods and could evoke specific emotional responses from human observers. + + + + Content-Aware Local GAN for Photo-Realistic Super-Resolution + http://openaccess.thecvf.com//content/ICCV2023/papers/Park_Content-Aware_Local_GAN_for_Photo-Realistic_Super-Resolution_ICCV_2023_paper.pdf + Recently, GAN has successfully contributed to making single-image super-resolution (SISR) methods produce more realistic images. However, natural images have complex distribution in the real world, and a single classifier in the discriminator may not have enough capacity to classify real and fake samples, making the preceding SR network generate unpleasing noise and artifacts. To solve the problem, we propose a novel content-aware local GAN framework, CAL-GAN, which processes a large and complicated distribution of real-world images by dividing them into smaller subsets based on similar contents. Our mixture of classifiers (MoC) design allocates different super-resolved patches to corresponding expert classifiers. Additionally, we introduce novel routing and orthogonality loss terms so that different classifiers can handle various contents and learn separable features. By feeding similar distributions into the corresponding specialized classifiers, CAL-GAN enhances the representation power of existing super-resolution models, achieving state-of-the-art perceptual performance on standard benchmarks and real-world images without modifying the generator-side architecture. + + + + Structure-Aware Surface Reconstruction via Primitive Assembly + http://openaccess.thecvf.com//content/ICCV2023/papers/Jiang_Structure-Aware_Surface_Reconstruction_via_Primitive_Assembly_ICCV_2023_paper.pdf + We propose a novel and efficient method for reconstructing manifold surfaces from point clouds. Unlike previous approaches that use dense implicit reconstructions or piecewise approximations and overlook inherent structures like quadrics in CAD models, our method faithfully preserves these quadric structures by assembling primitives. To achieve high-quality primitive extraction, we use a variational shape approximation, followed by a mesh arrangement for space partitioning and candidate primitive patches generation. We then introduce an effective pruning mechanism to classify candidate primitive patches as active or inactive, and further prune inactive patches to reduce the search space and speed up surface extraction significantly. Finally, the optimal active patches are computed by a binary linear programming and assembled as manifold and watertight surfaces. We perform extensive experiments on a wide range of CAD objects to validate its effectiveness. + + + + FineDance: A Fine-grained Choreography Dataset for 3D Full Body Dance Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_FineDance_A_Fine-grained_Choreography_Dataset_for_3D_Full_Body_Dance_ICCV_2023_paper.pdf + Generating full-body and multi-genre dance sequences from given music is a challenging task, due to the limitations of existing datasets and the inherent complexity of the fine-grained hand motion and dance genres. To address these problems, we propose FineDance, which contains 14.6 hours of music-dance paired data, with fine-grained hand motions, fine-grained genres (22 dance genres), and accurate posture. To the best of our knowledge, FineDance is the largest music-dance paired dataset with the most dance genres. Additionally, to address monotonous and unnatural hand movements existing in previous methods, we propose a full-body dance generation network, which utilizes the diverse generation capabilities of the diffusion model to solve monotonous problems, and use expert nets to solve unreal problems. To further enhance the genrematching and long-term stability of generated dances, we propose a Genre&Coherent aware Retrieval Module. Besides, we propose a new metric named Genre Matching Score to measure the genre matching between dance and music. Quantitative and qualitative experiments demonstrate the quality of FineDance, and the state-of-the-art performance of FineNet. + + + + Improving Online Lane Graph Extraction by Object-Lane Clustering + http://openaccess.thecvf.com//content/ICCV2023/papers/Can_Improving_Online_Lane_Graph_Extraction_by_Object-Lane_Clustering_ICCV_2023_paper.pdf + Autonomous driving requires accurate local scene understanding information. To this end, autonomous agents deploy object detection and online BEV lane graph extraction methods as a part of their perception stack. In this work, we propose an architecture and loss formulation to improve the accuracy of local lane graph estimates by using 3D object detection outputs. The proposed method learns to assign the objects to centerlines by considering the centerlines as cluster centers and the objects as data points to be assigned a probability distribution over the cluster centers. This training scheme ensures direct supervision on the relationship between lanes and objects, thus leading to better performance. The proposed method improves lane graph estimation substantially over state-of-the-art methods. The extensive ablations show that our method can achieve significant performance improvements by using the outputs of existing 3D object detection methods. Since our method uses the detection outputs rather than detection method intermediate representations, a single model of our method can use any detection method at test time. The code will be made publicly available. + + + + Video Background Music Generation: Dataset, Method and Evaluation + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhuo_Video_Background_Music_Generation_Dataset_Method_and_Evaluation_ICCV_2023_paper.pdf + Music is essential when editing videos, but selecting music manually is difficult and time-consuming. Thus, we seek to automatically generate background music tracks given video input. This is a challenging task since it requires music-video datasets, efficient architectures for video-to-music generation, and reasonable metrics, none of which currently exist. To close this gap, we introduce a complete recipe including dataset, benchmark model, and evaluation metric for video background music generation. We present SymMV, a video and symbolic music dataset with various musical annotations. To the best of our knowledge, it is the first video-music dataset with rich musical annotations. We also propose a benchmark video background music generation framework named V-MusProd, which utilizes music priors of chords, melody, and accompaniment along with video-music relations of semantic, color, and motion features. To address the lack of objective metrics for video-music correspondence, we design a retrieval-based metric VMCP built upon a powerful video-music representation learning model. Experiments show that with our dataset, V-MusProd outperforms the state-of-the-art method in both music quality and correspondence with videos. We believe our dataset, benchmark model, and evaluation metric will boost the development of video background music generation. Our dataset and code are available at https://github.com/zhuole1025/SymMV. + + + + Markov Game Video Augmentation for Action Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Aziere_Markov_Game_Video_Augmentation_for_Action_Segmentation_ICCV_2023_paper.pdf + This paper addresses data augmentation for action segmentation. Our key novelty is that we augment the original training videos in the deep feature space, not in the visual spatiotemporal domain as done by previous work. For augmentation, we modify original deep features of video frames such that the resulting embeddings fall closer to the class decision boundaries. Also, we edit action sequences of the original training videos (a.k.a. transcripts) by inserting, deleting, and replacing actions such that the resulting transcripts are close in edit distance to the ground truth ones. For our data augmentation we resort to reinforcement learning, instead of more common supervised learning, since we do not have access to reliable oracles which would provide supervision about the optimal data modifications in the deep feature space. For modifying frame embeddings, we use a meta-model formulated as a Markov Game with multiple self-interested agents. Also, new transcripts are generated using a fast, parameter-free Monte Carlo tree search. Our experiments show that the proposed data augmentation of the Breakfast, GTEA, and 50Salads datasets leads to significant performance gains of several state of the art action segmenters. + + + + RegFormer: An Efficient Projection-Aware Transformer Network for Large-Scale Point Cloud Registration + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_RegFormer_An_Efficient_Projection-Aware_Transformer_Network_for_Large-Scale_Point_Cloud_ICCV_2023_paper.pdf + Although point cloud registration has achieved remarkable advances in object-level and indoor scenes, large-scale registration methods are rarely explored. Challenges mainly arise from the huge point number, complex distribution, and outliers of outdoor LiDAR scans. In addition, most existing registration works generally adopt a two-stage paradigm: They first find correspondences by extracting discriminative local features and then leverage estimators (eg. RANSAC) to filter outliers, which are highly dependent on well-designed descriptors and post-processing choices. To address these problems, we propose an end-to-end transformer network (RegFormer) for large-scale point cloud alignment without any further post-processing. Specifically, a projection-aware hierarchical transformer is proposed to capture long-range dependencies and filter outliers by extracting point features globally. Our transformer has linear complexity, which guarantees high efficiency even for large-scale scenes. Furthermore, to effectively reduce mismatches, a bijective association transformer is designed for regressing the initial transformation. Extensive experiments on KITTI and NuScenes datasets demonstrate that our RegFormer achieves competitive performance in terms of both accuracy and efficiency. Codes are available at https://github.com/IRMVLab/RegFormer. + + + + Graphics2RAW: Mapping Computer Graphics Images to Sensor RAW Images + http://openaccess.thecvf.com//content/ICCV2023/papers/Seo_Graphics2RAW_Mapping_Computer_Graphics_Images_to_Sensor_RAW_Images_ICCV_2023_paper.pdf + Computer graphics (CG) rendering platforms produce imagery with ever-increasing photo realism. The narrowing domain gap between real and synthetic imagery makes it possible to use CG images as training data for deep learning models targeting high-level computer vision tasks, such as autonomous driving and semantic segmentation. CG images, however, are currently not suitable for low-level vision tasks targeting RAW sensor images. This is because RAW images are encoded in sensor-specific color spaces and incur pre-white-balance color casts caused by the sensor's response to scene illumination. CG images are rendered directly to a device-independent perceptual color space without needing white balancing. As a result, it is necessary to apply a mapping procedure to close the domain gap between graphics and RAW images. To this end, we introduce a framework to process graphics images to mimic RAW sensor images accurately. Our approach allows a one-to-many mapping, where a single graphics image can be transformed to match multiple sensors and multiple scene illuminations. In addition, our approach requires only a handful of example RAW-DNG files from the target sensor as parameters for the mapping process. We compare our method to alternative strategies and show that our approach produces more realistic RAW images and provides better results on three low-level vision tasks: RAW denoising, illumination estimation, and neural rendering for night photography. Finally, as part of this work, we provide a dataset of 292 realistic CG images for training low-light imaging models. + + + + VAD: Vectorized Scene Representation for Efficient Autonomous Driving + http://openaccess.thecvf.com//content/ICCV2023/papers/Jiang_VAD_Vectorized_Scene_Representation_for_Efficient_Autonomous_Driving_ICCV_2023_paper.pdf + Autonomous driving requires a comprehensive understanding of the surrounding environment for reliable trajectory planning. Previous works rely on dense rasterized scene representation (e.g., agent occupancy and semantic map) to perform planning, which is computationally intensive and misses the instance-level structure information. In this paper, we propose VAD, an end-to-end vectorized paradigm for autonomous driving, which models the driving scene as a fully vectorized representation. The proposed vectorized paradigm has two significant advantages. On one hand, VAD exploits the vectorized agent motion and map elements as explicit instance-level planning constraints which effectively improves planning safety. On the other hand, VAD runs much faster than previous end-to-end planning methods by getting rid of computation-intensive rasterized representation and hand-designed post-processing steps. VAD achieves state-of-the-art end-to-end planning performance on the nuScenes dataset, outperforming the previous best method by a large margin. Our base model, VAD-Base, greatly reduces the average collision rate by 29.0% and runs 2.5x faster. Besides, a lightweight variant, VAD-Tiny, greatly improves the inference speed (up to 9.3x) while achieving comparable planning performance. We believe the excellent performance and the high efficiency of VAD are critical for the real-world deployment of an autonomous driving system. Code and models are available at https://github.com/hustvl/VAD for facilitating future research. + + + + Batch-based Model Registration for Fast 3D Sherd Reconstruction + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Batch-based_Model_Registration_for_Fast_3D_Sherd_Reconstruction_ICCV_2023_paper.pdf + 3D reconstruction techniques have widely been used for digital documentation of archaeological fragments. However, efficient digital capture of fragments remains as a challenge. In this work, we aim to develop a portable, high-throughput, and accurate reconstruction system for efficient digitization of fragments excavated in archaeological sites. To realize high-throughput digitization of large numbers of objects, an effective strategy is to perform scanning and reconstruction in batches. However, effective batch-based scanning and reconstruction face two key challenges: 1) how to correlate partial scans of the same object from multiple batch scans, and 2) how to register and reconstruct complete models from partial scans that exhibit only small overlaps. To tackle these two challenges, we develop a new batch-based matching algorithm that pairs the front and back sides of the fragments, and a new Bilateral Boundary ICP algorithm that can register partial scans sharing very narrow overlapping regions. Extensive validation in labs and testing in excavation sites demonstrate that these designs enable efficient batch-based scanning for fragments. We show that such a batch-based scanning and reconstruction pipeline can have immediate applications on digitizing sherds in archaeological excavations. + + + + HiFace: High-Fidelity 3D Face Reconstruction by Learning Static and Dynamic Details + http://openaccess.thecvf.com//content/ICCV2023/papers/Chai_HiFace_High-Fidelity_3D_Face_Reconstruction_by_Learning_Static_and_Dynamic_ICCV_2023_paper.pdf + 3D Morphable Models (3DMMs) demonstrate great potential for reconstructing faithful and animatable 3D facial surfaces from a single image. The facial surface is influenced by the coarse shape, as well as the static detail (e,g., person-specific appearance) and dynamic detail (e.g., expression-driven wrinkles). Previous work struggles to decouple the static and dynamic details through image-level supervision, leading to reconstructions that are not realistic. In this paper, we aim at high-fidelity 3D face reconstruction and propose HiFace to explicitly model the static and dynamic details. Specifically, the static detail is modeled as the linear combination of a displacement basis, while the dynamic detail is modeled as the linear interpolation of two displacement maps with polarized expressions. We exploit several loss functions to jointly learn the coarse shape and fine details with both synthetic and real-world datasets, which enable HiFace to reconstruct high-fidelity 3D shapes with animatable details. Extensive quantitative and qualitative experiments demonstrate that HiFace presents state-of-the-art reconstruction quality and faithfully recovers both the static and dynamic details. + + + + Fast and Accurate Transferability Measurement by Evaluating Intra-class Feature Variance + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_Fast_and_Accurate_Transferability_Measurement_by_Evaluating_Intra-class_Feature_Variance_ICCV_2023_paper.pdf + Given a set of pre-trained models, how can we quickly and accurately find the most useful pre-trained model for a downstream task? Transferability measurement is to quantify how transferable is a pre-trained model learned on a source task to a target task. It is used for quickly ranking pre-trained models for a given task and thus becomes a crucial step for transfer learning. Existing methods measure transferability as the discrimination ability of a source model for a target data before transfer learning, which cannot accurately estimate the fine-tuning performance. Some of them restrict the application of transferability measurement in selecting the best supervised pre-trained models that have classifiers. It is important to have a general method for measuring transferability that can be applied in a variety of situations, such as selecting the best self-supervised pre-trained models that do not have classifiers, and selecting the best transferring layer for a target task. In this work, we propose TMI (TRANSFERABILITY MEASUREMENT WITH INTRA-CLASS FEATURE VARIANCE), a fast and accurate algorithm to measure transferability. We view transferability as the generalization of a pre-trained model on a target task by measuring intra-class feature variance. Intra-class variance evaluates the adaptability of the model to a new task, which measures how transferable the model is. Compared to previous studies that estimate how discriminative the models are, intra-class variance is more accurate than those as it does not require an optimal feature extractor and classifier. Extensive experiments on real-world datasets show that TMI outperforms competitors for selecting the top-5 best models, and exhibits consistently better correlation in 13 out of 17 cases. + + + + Algebraically Rigorous Quaternion Framework for the Neural Network Pose Estimation Problem + http://openaccess.thecvf.com//content/ICCV2023/papers/Lin_Algebraically_Rigorous_Quaternion_Framework_for_the_Neural_Network_Pose_Estimation_ICCV_2023_paper.pdf + The 3D pose estimation problem -- aligning pairs of noisy 3D point clouds -- is a problem with a wide variety of real-world applications. Here we focus on the use of quaternion-based neural network approaches to this problem and apparent anomalies that have arisen in previous efforts to resolve them. In addressing these anomalies, we draw heavily from the extensive literature on closed-form methods to solve this problem. We suggest that the major concerns that have been put forward could be resolved using a simple multi-valued training target derived from rigorous theoretical properties of the rotation-to-quaternion map of Bar-Itzhack. This multi-valued training target is then demonstrated to have good performance for both simulated and ModelNet targets. We provide a comprehensive theoretical context, using the quaternion adjugate, to confirm and establish the necessity of replacing single-valued quaternion functions by quaternions treated in the extended domain of multiple-charted manifolds. + + + + CVSformer: Cross-View Synthesis Transformer for Semantic Scene Completion + http://openaccess.thecvf.com//content/ICCV2023/papers/Dong_CVSformer_Cross-View_Synthesis_Transformer_for_Semantic_Scene_Completion_ICCV_2023_paper.pdf + Semantic scene completion (SSC) requires an accurate understanding of the geometric and semantic relationships between the objects in the 3D scene for reasoning the occluded objects. The popular SSC methods voxelize the 3D objects, allowing the deep 3D convolutional network (3D CNN) to learn the object relationships from the complex scenes. However, the current networks lack the controllable kernels to model the object relationship across multiple views, where appropriate views provide the relevant information for suggesting the existence of the occluded objects. In this paper, we propose Cross-View Synthesis Transformer (CVSformer), which consists of Multi-View Feature Synthesis and Cross-View Transformer for learning cross-view object relationships. In the multi-view feature synthesis, we use a set of 3D convolutional kernels rotated differently to compute the multi-view features for each voxel. In the cross-view transformer, we employ the cross-view fusion to comprehensively learn the cross-view relationships, which form useful information for enhancing the features of individual views. We use the enhanced features to predict the geometric occupancies and semantic labels of all voxels. We evaluate CVSformer on public datasets, where CVSformer yields state-of-the-art results. Our code is available at https://github.com/donghaotian123/CVSformer. + + + + UrbanGIRAFFE: Representing Urban Scenes as Compositional Generative Neural Feature Fields + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_UrbanGIRAFFE_Representing_Urban_Scenes_as_Compositional_Generative_Neural_Feature_Fields_ICCV_2023_paper.pdf + Generating photorealistic images with controllable camera pose and scene contents is essential for many applications including AR/VR and simulation. Despite the fact that rapid progress has been made in 3D-aware generative models, most existing methods focus on object-centric images and are not applicable to generating urban scenes for free camera viewpoint control and scene editing. To address this challenging task, we propose UrbanGIRAFFE, which uses a coarse 3D panoptic prior, including the layout distribution of uncountable stuff and countable objects, to provide semantic and geometric prior. Our model is compositional and controllable as it breaks down the scene into stuff, objects, and sky. Using stuff prior in the form of semantic voxel grids, we build a conditioned stuff generator that effectively incorporates the coarse semantic and geometry information. The object layout prior further allows us to learn an object generator from cluttered scenes. With proper loss functions, our approach facilitates photorealistic 3D-aware image synthesis with diverse controllability, including large camera movement, stuff editing, and object manipulation. We validate the effectiveness of our model on both synthetic and real-world datasets, including the challenging KITTI-360 dataset. + + + + Active Neural Mapping + http://openaccess.thecvf.com//content/ICCV2023/papers/Yan_Active_Neural_Mapping_ICCV_2023_paper.pdf + We address the problem of active mapping with a continually-learned neural scene representation, namely Active Neural Mapping. The key lies in actively finding the target space to be explored with efficient agent movement, thus minimizing the map uncertainty on-the-fly within a previously unseen environment. In this paper, we examine the weight space of the continually-learned neural field, and show empirically that the neural variability, the prediction robustness against random weight perturbation, can be directly utilized to measure the instant uncertainty of the neural map. Together with the continuous geometric information inherited in the neural map, the agent can be guided to find a traversable path to gradually gain knowledge of the environment. We present for the first time an online active mapping system with a coordinate-based implicit neural representation. Experiments in the visually-realistic Gibson and Matterport3D environment demonstrate the efficacy of the proposed method. + + + + RecRecNet: Rectangling Rectified Wide-Angle Images by Thin-Plate Spline Model and DoF-based Curriculum Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Liao_RecRecNet_Rectangling_Rectified_Wide-Angle_Images_by_Thin-Plate_Spline_Model_and_ICCV_2023_paper.pdf + The wide-angle lens shows appealing applications in VR technologies, but it introduces severe radial distortion into its captured image. To recover the realistic scene, previous works devote to rectifying the content of the wide-angle image. However, such a rectification solution inevitably distorts the image boundary, which changes related geometric distributions and misleads the current vision perception models. In this work, we explore constructing a win-win representation on both content and boundary by contributing a new learning model, i.e., Rectangling Rectification Network (RecRecNet). In particular, we propose a thin-plate spline (TPS) module to formulate the non-linear and non-rigid transformation for rectangling images. By learning the control points on the rectified image, our model can flexibly warp the source structure to the target domain and achieves an end-to-end unsupervised deformation. To relieve the complexity of structure approximation, we then inspire our RecRecNet to learn the gradual deformation rules with a DoF (Degree of Freedom)-based curriculum learning. By increasing the DoF in each curriculum stage, namely, from similarity transformation (4-DoF) to homography transformation (8-DoF), the network is capable of investigating more detailed deformations, offering fast convergence on the final rectangling task. Experiments show the superiority of our solution over the compared methods on both quantitative and qualitative evaluations. The code and dataset will be made available. + + + + Learning Versatile 3D Shape Generation with Improved Auto-regressive Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Luo_Learning_Versatile_3D_Shape_Generation_with_Improved_Auto-regressive_Models_ICCV_2023_paper.pdf + Auto-Regressive (AR) models have achieved impressive results in 2D image generation by modeling joint distributions in the grid space. While this approach has been extended to the 3D domain for powerful shape generation, it still has two limitations: expensive computations on volumetric grids and ambiguous auto-regressive order along grid dimensions. To overcome these limitations, we propose the Improved Auto-regressive Model (ImAM) for 3D shape generation, which applies discrete representation learning based on a latent vector instead of volumetric grids. Our approach not only reduces computational costs but also preserves essential geometric details by learning the joint distribution in a more tractable order. Moreover, thanks to the simplicity of our model architecture, we can naturally extend it from unconditional to conditional generation by concatenating various conditioning inputs, such as point clouds, categories, images, and texts. Extensive experiments demonstrate that ImAM can synthesize diverse and faithful shapes of multiple categories, achieving state-of-the-art performance. + + + + DETA: Denoised Task Adaptation for Few-Shot Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_DETA_Denoised_Task_Adaptation_for_Few-Shot_Learning_ICCV_2023_paper.pdf + Test-time task adaptation in few-shot learning aims to adapt a pre-trained task-agnostic model for capturing task-specific knowledge of the test task, rely only on few-labeled support samples. Previous approaches generally focus on developing advanced algorithms to achieve the goal, while neglecting the inherent problems of the given support samples. In fact, with only a handful of samples available, the adverse effect of either the image noise (a.k.a. X-noise) or the label noise (a.k.a. Y-noise) from support samples can be severely amplified. To address this challenge, in this work we propose DEnoised Task Adaptation (DETA), a first, unified image- and label-denoising framework orthogonal to existing task adaptation approaches. Without extra supervision, DETA filters out task-irrelevant, noisy representations by taking advantage of both global visual information and local region details of support samples. On the challenging Meta-Dataset, DETA consistently improves the performance of a broad spectrum of baseline methods applied on various pre-trained models. Notably, by tackling the overlooked image noise in Meta-Dataset, DETA establishes new state-of-the-art results. Code is released at https://github.com/JimZAI/DETA. + + + + Robust Frame-to-Frame Camera Rotation Estimation in Crowded Scenes + http://openaccess.thecvf.com//content/ICCV2023/papers/Delattre_Robust_Frame-to-Frame_Camera_Rotation_Estimation_in_Crowded_Scenes_ICCV_2023_paper.pdf + We present an approach to estimating camera rotation in crowded, real-world scenes from handheld monocular video. While camera rotation estimation is a well-studied problem, no previous methods exhibit both high accuracy and acceptable speed in this setting. Because the setting is not addressed well by other datasets, we provide a new dataset and benchmark, with high-accuracy, rigorously verified ground truth, on 17 video sequences. Methods developed for wide baseline stereo (e.g., 5-point methods) perform poorly on monocular video. On the other hand, methods used in autonomous driving (e.g., SLAM) leverage specific sensor setups, specific motion models, or local optimization strategies (lagging batch processing) and do not generalize well to handheld video. Finally, for dynamic scenes, commonly used robustification techniques like RANSAC require large numbers of iterations, and become prohibitively slow. We introduce a novel generalization of the Hough transform on SO(3) to efficiently and robustly find the camera rotation most compatible with optical flow. Among comparably fast methods, ours reduces error by almost 50% over the next best, and is more accurate than any method, irrespective of speed. This represents a strong new performance point for crowded scenes, an important setting for computer vision. The code and the dataset are available at https://fabiendelattre.com/robust-rotation-estimation. + + + + Bayesian Prompt Learning for Image-Language Model Generalization + http://openaccess.thecvf.com//content/ICCV2023/papers/Derakhshani_Bayesian_Prompt_Learning_for_Image-Language_Model_Generalization_ICCV_2023_paper.pdf + Foundational image-language models have generated considerable interest due to their efficient adaptation to downstream tasks by prompt learning. Prompt learning treats part of the language model input as trainable while freezing the rest, and optimizes an Empirical Risk Minimization objective. However, Empirical Risk Minimization is known to suffer from distributional shifts which hurt generalizability to prompts unseen during training. By leveraging the regularization ability of Bayesian methods, we frame prompt learning from the Bayesian perspective and formulate it as a variational inference problem. Our approach regularizes the prompt space, reduces overfitting to the seen prompts and improves the prompt generalization on unseen prompts. Our framework is implemented by modeling the input prompt space in a probabilistic manner, as an a priori distribution which makes our proposal compatible with prompt learning approaches that are unconditional or conditional on the image. We demonstrate empirically on 15 benchmarks that Bayesian prompt learning provides an appropriate coverage of the prompt space, prevents learning spurious features, and exploits transferable invariant features. This results in better generalization of unseen prompts, even across different datasets and domains. Code available at: https://github.com/saic-fi/Bayesian Prompt-Learning + + + + DiLiGenT-Pi: Photometric Stereo for Planar Surfaces with Rich Details - Benchmark Dataset and Beyond + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_DiLiGenT-Pi_Photometric_Stereo_for_Planar_Surfaces_with_Rich_Details_-_ICCV_2023_paper.pdf + Photometric stereo aims to recover detailed surface shapes from images captured under varying illuminations. However, existing real-world datasets primarily focus on evaluating photometric stereo for general non-Lambertian reflectances and feature bulgy shapes that have a certain height. As shape detail recovery is the key strength of photometric stereo over other 3D reconstruction techniques, and the near-planar surfaces widely exist in cultural relics and manufacturing workpieces, we present a new real-world dataset DiLiGenT-Pi containing 30 near-planar scenes with rich surface details. This dataset enables us to evaluate recent photometric stereo methods specifically for their ability to estimate shape details under diverse materials and to identify open problems such as near-planar surface normal estimation from uncalibrated photometric stereo and surface detail recovery for translucent materials. To inspire future research, this dataset will open soruced at https://photometricstereo.github.io/diligentpi.html. + + + + Accurate and Fast Compressed Video Captioning + http://openaccess.thecvf.com//content/ICCV2023/papers/Shen_Accurate_and_Fast_Compressed_Video_Captioning_ICCV_2023_paper.pdf + Existing video captioning approaches typically require to first sample video frames from a decoded video and then conduct a subsequent process (e.g., feature extraction and/or captioning model learning). In this pipeline, manual frame sampling may ignore key information in videos and thus degrade performance. Additionally, redundant information in the sampled frames may result in low efficiency in the inference of video captioning. Addressing this, we study video captioning from a different perspective in compressed domain, which brings multi-fold advantages over the existing pipeline: 1) Compared to raw images from the decoded video, the compressed video, consisting of I-frames, motion vectors and residuals, is highly distinguishable, which allows us to leverage the entire video for learning without manual sampling through a specialized model design; 2) The captioning model is more efficient in inference as smaller and less redundant information is processed. We propose a simple yet effective end-to-end transformer in the compressed domain for video captioning that enables learning from the compressed video for captioning. We show that even with a simple design, our method can achieve state-of-the-art performance on different benchmarks while running almost 2x faster than existing approaches. Code is available at https://github.com/acherstyx/CoCap. + + + + Visible-Infrared Person Re-Identification via Semantic Alignment and Affinity Inference + http://openaccess.thecvf.com//content/ICCV2023/papers/Fang_Visible-Infrared_Person_Re-Identification_via_Semantic_Alignment_and_Affinity_Inference_ICCV_2023_paper.pdf + Visible-infrared person re-identification (VI-ReID) focuses on matching the pedestrian images of the same identity captured by different modality cameras. The part-based methods achieve great success by extracting fine-grained features from feature maps. But most existing part-based methods employ horizontal division to obtain part features suffering from misalignment caused by irregular pedestrian movements. Moreover, most current methods use Euclidean or cosine distance of the output features to measure the similarity without considering the pedestrian relationships. Misaligned part features and naive inference methods both limit the performance of existing works. We propose a Semantic Alignment and Affinity Inference framework (SAAI), which aims to align latent semantic part features with the learnable prototypes and improve inference with affinity information. Specifically, we first propose semantic-aligned feature learning that employs the similarity between pixel-wise features and learnable prototypes to aggregate the latent semantic part features. Then, we devise an affinity inference module to optimize the inference with pedestrian relationships. Comprehensive experimental results conducted on the SYSU-MM01 and RegDB datasets demonstrate the favorable performance of our SAAI framework. Our code will be released at https://github.com/xiaoye-hhh/SAAI. + + + + DG3D: Generating High Quality 3D Textured Shapes by Learning to Discriminate Multi-Modal Diffusion-Renderings + http://openaccess.thecvf.com//content/ICCV2023/papers/Zuo_DG3D_Generating_High_Quality_3D_Textured_Shapes_by_Learning_to_ICCV_2023_paper.pdf + Many virtual reality applications require massive 3D content, which impels the need for low-cost and efficient modeling tools in terms of quality and quantity. In this paper, we present a Diffusion-augmented Generative model to generate high-fidelity 3D textured meshes that can be directly used in modern graphics engines. Challenges in directly generating textured mesh arise from the instability and texture incompleteness of a hybrid framework which contains conversion between 2D features and 3D space. To alleviate these difficulties, DG3D incorporates a diffusion-based augmentation module into the min-max game between the 3D tetrahedral mesh generator and 2D renderings discriminators, which stabilizes network optimization and prevents mode collapse in vanilla GANs. We also suggest using multi-modal renderings in discrimination to further increase the aesthetics and completeness of generated textures. Extensive experiments on the public benchmark and real scans show that our proposed DG3D outperforms existing state-of-the-art methods by a large margin, i.e., 5% 40% in FID-3D score and 5% 10% in geometry-related metrics. Code is available at https://github.com/seakforzq/DG3D. + + + + VLSlice: Interactive Vision-and-Language Slice Discovery + http://openaccess.thecvf.com//content/ICCV2023/papers/Slyman_VLSlice_Interactive_Vision-and-Language_Slice_Discovery_ICCV_2023_paper.pdf + Recent work in vision-and-language demonstrates that large-scale pretraining can learn generalizable models that are efficiently transferable to downstream tasks. While this may improve dataset-scale aggregate metrics, analyzing performance around hand-crafted subgroups targeting specific bias dimensions reveals systemic undesirable behaviors. However, this subgroup analysis is frequently stalled by annotation efforts, which require extensive time and resources to collect the necessary data. Prior art attempts to automatically discover subgroups to circumvent these constraints but typically leverages model behavior on existing task-specific annotations and rapidly degrades on more complex inputs beyond "tabular" data, none of which study vision-and-language models. This paper presents VLSlice, an interactive system enabling user-guided discovery of coherent representation-level subgroups with consistent visiolinguistic behavior, denoted as vision-and-language slices, from unlabeled image sets. We show that VLSlice enables users to quickly generate diverse high-coherency slices in a user study (n=22) and release the tool publicly. + + + + Learning to Ground Instructional Articles in Videos through Narrations + http://openaccess.thecvf.com//content/ICCV2023/papers/Mavroudi_Learning_to_Ground_Instructional_Articles_in_Videos_through_Narrations_ICCV_2023_paper.pdf + In this paper we present an approach for localizing steps of procedural activities in narrated how-to videos. To deal with the scarcity of labeled data at scale, we source the step descriptions from a language knowledge base (wikiHow) containing instructional articles for a large variety of procedural tasks. Without any form of manual supervision, our model learns to temporally ground the steps of procedural articles in how-to videos by matching three modalities: frames, narrations, and step descriptions. Specifically, our method aligns steps to video by fusing information from two distinct pathways: i) direct alignment of step descriptions to frames, ii) indirect alignment obtained by composing steps-to-narrations with narrations-to-video correspondences. Notably, our approach performs global temporal grounding of all steps in an article at once by exploiting order information, and is trained with step pseudo-labels which are iteratively refined and aggressively filtered. In order to validate our model we introduce a new benchmark -- HT-Step -- obtained by manually annotating a 124-hour subset of HowTo100M with steps sourced from wikiHow articles. Experiments on this benchmark as well as zero-shot evaluations on CrossTask demonstrate that our multi-modality alignment yields dramatic gains over several baselines and prior works. Finally, we show that our inner module for matching narration-to-video outperforms by a large margin the state of the art on the HTM-Align narration-video alignment benchmark. + + + + MAMo: Leveraging Memory and Attention for Monocular Video Depth Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Yasarla_MAMo_Leveraging_Memory_and_Attention_for_Monocular_Video_Depth_Estimation_ICCV_2023_paper.pdf + We propose MAMo, a novel memory and attention framework for monocular video depth estimation. MAMo can augment and improve any single-image depth estimation networks into video depth estimation models, enabling them to take advantage of the temporal information to predict more accurate depth. In MAMo, we augment model with memory which aids the depth prediction as the model streams through the video. Specifically, the memory stores learned visual and displacement tokens of the previous time instances. This allows the depth network to cross-reference relevant features from the past when predicting depth on the current frame. We introduce a novel scheme to continuously update the memory, optimizing it to keep tokens that correspond with both the past and the present visual information. We adopt attention-based approach to process memory features where we first learn the spatio-temporal relation among the resultant visual and displacement memory tokens using self-attention module. Further, the output features of self-attention are aggregated with the current visual features through cross-attention. The cross-attended features are finally given to a decoder to predict depth on the current frame. Through extensive experiments on several benchmarks, including KITTI, NYU-Depth V2, and DDAD, we show that MAMo consistently improves monocular depth estimation networks and sets new state-of-the-art (SOTA) accuracy. Notably, our MAMo video depth estimation provides higher accuracy with lower latency, when comparing to SOTA cost-volume-based video depth models. + + + + HDG-ODE: A Hierarchical Continuous-Time Model for Human Pose Forecasting + http://openaccess.thecvf.com//content/ICCV2023/papers/Xing_HDG-ODE_A_Hierarchical_Continuous-Time_Model_for_Human_Pose_Forecasting_ICCV_2023_paper.pdf + Recently, human pose estimation has attracted more and more attention due to its importance in many real applications. Although many efforts have been put on extracting 2D poses from static images, there are still some severe problems to be solved. A critical one is occlusion, which is more obvious in multi-person scenarios and makes it even more difficult to recover the corresponding 3D poses. When we consider a sequence of images, the temporal correlation among the contexts can be utilized to help us ease the problem, but most of the current works only rely on discrete-time models and estimate the joint locations of all people within a whole sparse graph. In this paper, we propose a new framework, Hierarchical Dynamic Graph Ordinary Differential Equation (HDG-ODE), to tackle the 3D pose forecasting task from 2D skeleton representations in videos. Our framework adopts ODE, a continuous-time model, as the base to predict the 3D joint positions at any time. Considering the structural-property of the skeleton data in representing human poses and the possible irregularity caused by occlusion, we propose the use of dynamic graph convolution as the basic operator. To reduce the computational complexity introduced by the sparsity of the pose graph, our model takes a hierarchical structure where the encoding process at the observation timestamp is done in a cascade manner while the propagation between observations is conducted in parallel. The performance studies on several datasets demonstrate that our model is effective and can out-perform other methods with fewer parameters. + + + + Self-supervised Monocular Underwater Depth Recovery, Image Restoration, and a Real-sea Video Dataset + http://openaccess.thecvf.com//content/ICCV2023/papers/Varghese_Self-supervised_Monocular_Underwater_Depth_Recovery_Image_Restoration_and_a_Real-sea_ICCV_2023_paper.pdf + Underwater (UW) depth estimation and image restoration is a challenging task due to its fundamental ill-posedness and the unavailability of real large-scale UW-paired datasets. UW depth estimation has been attempted before by utilizing either the haze information present or the geometry cue from stereo images or the adjacent frames in a video. To obtain improved estimates of depth from a single UW image, we propose a deep learning (DL) method that utilizes both haze and geometry during training. By harnessing the physical model for UW image formation in conjunction with the view-synthesis constraint on neighboring frames in monocular videos, we perform disentanglement of the input image to also get an estimate of the scene radiance. The proposed method is completely self-supervised and simultaneously outputs the depth map and the restored image in real-time (55 fps). We call this first-ever Underwater Self-supervised deep learning network for simultaneous Recovery of Depth and Image as USe-ReDI-Net. To facilitate monocular self-supervision, we collected a Dataset of Real-world Underwater Videos of Artifacts (DRUVA) in shallow sea waters. DRUVA is the first UW video dataset that contains video sequences of 20 different submerged artifacts with almost full azimuthal coverage of each artifact. Extensive experiments on our DRUVA dataset and other UW datasets establish the superiority of our proposed USe-ReDI-Net over prior art for both UW depth and image recovery. The dataset DRUVA is available at https://github.com/nishavarghese15/DRUVA + + + + Geometrized Transformer for Self-Supervised Homography Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Geometrized_Transformer_for_Self-Supervised_Homography_Estimation_ICCV_2023_paper.pdf + For homography estimation, we propose Geometrized Transformer (GeoFormer), a new detector-free feature matching method. Current detector-free methods, e.g. LoFTR, lack an effective mean to accurately localize small and thus computationally feasible regions for cross-attention diffusion. We resolve the challenge with an extremely simple idea: using the classical RANSAC geometry for attentive region search. Given coarse matches by LoFTR, a homography is obtained with ease. Such a homography allows us to compute cross-attention in a focused manner, where key/value sets required by Transformers can be reduced to small fix-sized regions rather than an entire image. Local features can thus be enhanced by standard Transformers. We integrate GeoFormer into the LoFTR framework. By minimizing a multi-scale cross-entropy based matching loss on auto-generated training data, the network is trained in a fully self-supervised manner. Extensive experiments are conducted on multiple real-world datasets covering natural images, heavily manipulated pictures and retinal images. The proposed method compares favorably against the state-of-the-art. + + + + TiDy-PSFs: Computational Imaging with Time-Averaged Dynamic Point-Spread-Functions + http://openaccess.thecvf.com//content/ICCV2023/papers/Shah_TiDy-PSFs_Computational_Imaging_with_Time-Averaged_Dynamic_Point-Spread-Functions_ICCV_2023_paper.pdf + Point-spread-function (PSF) engineering is a powerful computational imaging technique wherein a custom phase mask is integrated into an optical system to encode additional information into captured images. Used in combination with deep learning, such systems now offer state-of-the-art performance at monocular depth estimation, extended depth-of-field imaging, lensless imaging, and other tasks. Inspired by recent advances in spatial light modulator (SLM) technology, this paper answers a natural question: Can one encode additional information and achieve superior performance by changing a phase mask dynamically over time? We first prove that the set of PSFs described by static phase masks is non-convex and that, as a result, time-averaged PSFs generated by dynamic phase masks are fundamentally more expressive. We then demonstrate, in simulation, that time-averaged dynamic (TiDy) phase masks can leverage this increased expressiveness to offer substantially improved monocular depth estimation and extended depth-of-field imaging performance. + + + + Learning Fine-Grained Features for Pixel-Wise Video Correspondences + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Learning_Fine-Grained_Features_for_Pixel-Wise_Video_Correspondences_ICCV_2023_paper.pdf + Video analysis tasks rely heavily on identifying the pixels from different frames that correspond to the same visual target. To tackle this problem, recent studies have advocated feature learning methods that aim to learn distinctive representations to match the pixels, especially in a self-supervised fashion. Unfortunately, these methods have difficulties for tiny or even single-pixel visual targets. Pixel-wise video correspondences were traditionally related to optical flows, which however lead to deterministic correspondences and lack robustness on real-world videos. We address the problem of learning features for establishing pixel-wise correspondences. Motivated by optical flows as well as the self-supervised feature learning, we propose to use not only labeled synthetic videos but also unlabeled real-world videos for learning fine-grained representations in a holistic framework. We adopt an adversarial learning scheme to enhance the generalization ability of the learned features. Moreover, we design a coarse-to-fine framework to pursue high computational efficiency. Our experimental results on a series of correspondence-based tasks demonstrate that the proposed method outperforms state-of-the-art rivals in both accuracy and efficiency. + + + + FS-DETR: Few-Shot DEtection TRansformer with Prompting and without Re-Training + http://openaccess.thecvf.com//content/ICCV2023/papers/Bulat_FS-DETR_Few-Shot_DEtection_TRansformer_with_Prompting_and_without_Re-Training_ICCV_2023_paper.pdf + This paper is on Few-Shot Object Detection (FSOD), where given a few templates (examples) depicting a novel class (not seen during training), the goal is to detect all of its occurrences within a set of images. From a practical perspective, an FSOD system must fulfil the following desiderata: (a) it must be used as is, without requiring any fine-tuning at test time, (b) it must be able to process an arbitrary number of novel objects concurrently while supporting an arbitrary number of examples from each class and (c) it must achieve accuracy comparable to a closed system. Towards satisfying (a)-(c), in this work, we make the following contributions: We introduce, for the first time, a simple, yet powerful, few-shot detection transformer (FS-DETR) based on visual prompting that can address both desiderata (a) and (b). Our system builds upon the DETR framework, extending it based on two key ideas: (1) feed the provided visual templates of the novel classes as visual prompts during test time, and (2) "stamp" these prompts with pseudo-class embeddings (akin to soft prompting), which are then predicted at the output of the decoder. Importantly, we show that our system is not only more flexible than existing methods, but also, it makes a step towards satisfying desideratum (c). Specifically, it is significantly more accurate than all methods that do not require fine-tuning and even matches and outperforms the current state-of-the-art fine-tuning based methods on the most well-established benchmarks (PASCAL VOC & MSCOCO). + + + + Learning to Learn: How to Continuously Teach Humans and Machines + http://openaccess.thecvf.com//content/ICCV2023/papers/Singh_Learning_to_Learn_How_to_Continuously_Teach_Humans_and_Machines_ICCV_2023_paper.pdf + Curriculum design is a fundamental component of education. For example, when we learn mathematics at school, we build upon our knowledge of addition to learn multiplication. These and other concepts must be mastered before our first algebra lesson, which also reinforces our addition and multiplication skills. Designing a curriculum for teaching either a human or a machine shares the underlying goal of maximizing knowledge transfer from earlier to later tasks, while also minimizing forgetting of learned tasks. Prior research on curriculum design for image classification focuses on the ordering of training examples during a single offline task. Here, we investigate the effect of the order in which multiple distinct tasks are learned in a sequence. We focus on the online class-incremental continual learning setting, where algorithms or humans must learn image classes one at a time during a single pass through a dataset. We find that curriculum consistently influences learning outcomes for humans and for multiple continual machine learning algorithms across several benchmark datasets. We introduce a novel-object recognition dataset for human curriculum learning experiments and observe that curricula that are effective for humans are highly correlated with those that are effective for machines. As an initial step towards automated curriculum design for online class-incremental learning, we propose a novel algorithm, dubbed Curriculum Designer (CD), that designs and ranks curricula based on inter-class feature similarities. We find significant overlap between curricula that are empirically highly effective and those that are highly ranked by our CD. Our study establishes a framework for further research on teaching humans and machines to learn continuously using optimized curricula. + + + + A 5-Point Minimal Solver for Event Camera Relative Motion Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Gao_A_5-Point_Minimal_Solver_for_Event_Camera_Relative_Motion_Estimation_ICCV_2023_paper.pdf + Event-based cameras are ideal for line-based motion estimation, since they predominantly respond to edges in the scene. However, accurately determining the camera displacement based on events continues to be an open problem. This is because line feature extraction and dynamics estimation are tightly coupled when using event cameras, and no precise model is currently available for describing the complex structures generated by lines in the space-time volume of events. We solve this problem by deriving the correct non-linear parametrization of such manifolds, which we term eventails, and demonstrate its application to event-based linear motion estimation, with known rotation from an Inertial Measurement Unit. Using this parametrization, we introduce a novel minimal 5-point solver that jointly estimates line parameters and linear camera velocity projections, which can be fused into a single, averaged linear velocity when considering multiple lines. We demonstrate on both synthetic and real data that our solver generates more stable relative motion estimates than other methods while capturing more inliers than clustering based on spatio-temporal planes. In particular, our method consistently achieves a 100% success rate in estimating linear velocity where existing closed-form solvers only achieve between 23% and 70%. The proposed eventails contribute to a better understanding of spatio-temporal event-generated geometries and we thus believe it will become a core building block of future event-based motion estimation algorithms. + + + + TM2D: Bimodality Driven 3D Dance Generation via Music-Text Integration + http://openaccess.thecvf.com//content/ICCV2023/papers/Gong_TM2D_Bimodality_Driven_3D_Dance_Generation_via_Music-Text_Integration_ICCV_2023_paper.pdf + We propose a novel task for generating 3D dance movements that simultaneously incorporate both text and music modalities. Unlike existing works that generate dance movements using a single modality such as music, our goal is to produce richer dance movements guided by the instructive information provided by the text. However, the lack of paired motion data with both music and text modalities limits the ability to generate dance movements that integrate both. To alleviate this challenge, we propose to utilize a 3D human motion VQ-VAE to project the motions of the two datasets into a latent space consisting of quantized vectors, which effectively mix the motion tokens from the two datasets with different distributions for training. Additionally, we propose a cross-modal transformer to integrate text instructions into motion generation architecture for generating 3D dance movements without degrading the performance of music-conditioned dance generation. To better evaluate the quality of the generated motion, we introduce two novel metrics, namely Motion Prediction Distance (MPD) and Freezing Score (FS), to measure the coherence and freezing percentage of the generated motion. Extensive experiments show that our approach can generate realistic and coherent dance movements conditioned on both text and music while maintaining comparable performance with the two single modalities. Code is available at https://garfield-kh.github.io/TM2D/. + + + + Bootstrap Motion Forecasting With Self-Consistent Constraints + http://openaccess.thecvf.com//content/ICCV2023/papers/Ye_Bootstrap_Motion_Forecasting_With_Self-Consistent_Constraints_ICCV_2023_paper.pdf + We present a novel framework to bootstrap Motion forecasting with Self-consistent Constraints (MISC). The motion forecasting task aims at predicting future trajectories of vehicles by incorporating spatial and temporal information from the past. A key design of MISC is the proposed Dual Consistency Constraints that regularize the predicted trajectories under spatial and temporal perturbation during training. Also, to model the multi-modality in motion forecasting, we design a novel self-ensembling scheme to obtain accurate teacher targets to enforce the self-constraints with multi-modality supervision. With explicit constraints from multiple teacher targets, we observe a clear improvement in the prediction performance. Extensive experiments on the Argoverse motion forecasting benchmark and Waymo Open Motion dataset show that MISC significantly outperforms the state-of-the-art methods. As the proposed strategies are general and can be easily incorporated into other motion forecasting approaches, we also demonstrate that our proposed scheme consistently improves the prediction performance of several existing methods. + + + + CDAC: Cross-domain Attention Consistency in Transformer for Domain Adaptive Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_CDAC_Cross-domain_Attention_Consistency_in_Transformer_for_Domain_Adaptive_Semantic_ICCV_2023_paper.pdf + While transformers have greatly boosted performance in semantic segmentation, domain adaptive transformers are not yet well explored. We identify that the domain gap can cause discrepancies in self-attention. Due to this gap, the transformer attends to spurious regions or pixels, which deteriorates accuracy on the target domain. We propose Cross-Domain Attention Consistency (CDAC), to perform adaptation on attention maps using cross-domain attention layers that share features between source and target domains. Specifically, we impose consistency between predictions from cross-domain attention and self-attention modules to encourage similar distributions across domains in both the attention and output of the model, i.e., attention-level and output-level alignment. We also enforce consistency in attention maps between different augmented views to further strengthen the attention-based alignment. Combining these two components, CDAC mitigates the discrepancy in attention maps across domains and further boosts the performance of the transformer under unsupervised domain adaptation settings. Our method is evaluated on various widely used benchmarks and outperforms the state-of-the-art baselines, including GTAV-to-Cityscapes by 1.3 and 1.5 percent point (pp) and Synthia-to-Cityscapes by 0.6 pp and 2.9 pp when combining with two competitive Transformer-based backbones, respectively. Our code will be publicly available at https://github.com/wangkaihong/CDAC. + + + + Confidence-based Visual Dispersal for Few-shot Unsupervised Domain Adaptation + http://openaccess.thecvf.com//content/ICCV2023/papers/Xiong_Confidence-based_Visual_Dispersal_for_Few-shot_Unsupervised_Domain_Adaptation_ICCV_2023_paper.pdf + Unsupervised domain adaptation aims to transfer knowledge from a fully-labeled source domain to an unlabeled target domain. However, in real-world scenarios, providing abundant labeled data even in the source domain can be infeasible due to the difficulty and high expense of annotation. To address this issue, recent works consider the Few-shot Unsupervised Domain Adaptation (FUDA) where only a few source samples are labeled, and conduct knowledge transfer via self-supervised learning methods. Yet existing methods generally overlook that the sparse label setting hinders learning reliable source knowledge for transfer. Additionally, the learning difficulty difference in target samples is different but ignored, leaving hard target samples poorly classified. To tackle both deficiencies, in this paper, we propose a novel Confidence-based Visual Dispersal Transfer learning method (C-VisDiT) for FUDA. Specifically, C-VisDiT consists of a cross-domain visual dispersal strategy that transfers only high-confidence source knowledge for model adaptation and an intra-domain visual dispersal strategy that guides the learning of hard target samples with easy ones. We conduct extensive experiments on Office-31, Office-Home, VisDA-C, and DomainNet benchmark datasets and the results demonstrate that the proposed C-VisDiT significantly outperforms state-of-the-art FUDA methods. Our code is available at https://github.com/Bostoncake/C-VisDiT. + + + + Event-Guided Procedure Planning from Instructional Videos with Text Supervision + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Event-Guided_Procedure_Planning_from_Instructional_Videos_with_Text_Supervision_ICCV_2023_paper.pdf + In this work, we focus on the task of procedure planning from instructional videos with text supervision, where a model aims to predict an action sequence to transform the initial visual state into the goal visual state. A critical challenge of this task is the large semantic gap between observed visual states and unobserved intermediate actions, which is ignored by previous works. Specifically, this semantic gap refers to that the contents in the observed visual states are semantically different from the elements of some action text labels in a procedure. To bridge this semantic gap, we propose a novel event-guided paradigm, which first infers events from the observed states and then plans out actions based on both the states and predicted events. Our inspiration comes from that planning a procedure from an instructional video is to complete a specific event and a specific event usually involves specific actions. Based on the proposed paradigm, we contribute an Event-guided Prompting-based Procedure Planning (E3P) model, which encodes event information into the sequential modeling process to support procedure planning. To further consider the strong action associations within each event, our E3P adopts a mask-and-predict approach for relation mining, incorporating a probabilistic masking scheme for regularization. Extensive experiments on three datasets demonstrate the effectiveness of our proposed model. + + + + Multimodal Motion Conditioned Diffusion Model for Skeleton-based Video Anomaly Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Flaborea_Multimodal_Motion_Conditioned_Diffusion_Model_for_Skeleton-based_Video_Anomaly_Detection_ICCV_2023_paper.pdf + Anomalies are rare and anomaly detection is often therefore framed as One-Class Classification (OCC), i.e. trained solely on normalcy. Leading OCC techniques constrain the latent representations of normal motions to limited volumes and detect as abnormal anything outside, which accounts satisfactorily for the openset'ness of anomalies. But normalcy shares the same openset'ness property, since humans can perform the same action in several ways, which the leading techniques neglect. We propose a novel generative model for video anomaly detection (VAD), which assumes that both normality and abnormality are multimodal. We consider skeletal representations and leverage state-of-the-art diffusion probabilistic models to generate multimodal future human poses. We contribute a novel conditioning on the past motion of people and exploit the improved mode coverage capabilities of diffusion processes to generate different-but-plausible future motions. Upon the statistical aggregation of future modes, an anomaly is detected when the generated set of motions is not pertinent to the actual future. We validate our model on 4 established benchmarks: UBnormal, HR-UBnormal, HR-STC, and HR-Avenue, with extensive experiments surpassing state-of-the-art results. + + + + CDFSL-V: Cross-Domain Few-Shot Learning for Videos + http://openaccess.thecvf.com//content/ICCV2023/papers/Samarasinghe_CDFSL-V_Cross-Domain_Few-Shot_Learning_for_Videos_ICCV_2023_paper.pdf + Few-shot video action recognition is an effective approach to recognizing new categories with only a few labeled examples, thereby reducing the challenges associated with collecting and annotating large-scale video datasets. Existing methods in video action recognition rely on large labeled datasets from the same domain. However, this setup is not realistic as novel categories may come from different data domains that may have different spatial and temporal characteristics. This dissimilarity between the source and target domains can pose a significant challenge, rendering traditional few-shot action recognition techniques ineffective. To address this issue, in this work, we propose a novel cross-domain few-shot video action recognition method that leverages self-supervised learning and curriculum learning to balance the information from the source and target domains. To be particular, our method employs a masked autoencoder-based self-supervised training objective to learn from both source and target data in a self-supervised manner. Then a progressive curriculum balances learning the discriminative information from the source dataset with the generic information learned from the target domain. Initially, our curriculum utilizes supervised learning to learn class discriminative features from the source data. As the training progresses, we transition to learning target-domain-specific features. We propose a progressive curriculum to encourage the emergence of rich features in the target domain based on class discriminative supervised features in the source domain. We evaluate our method on several challenging benchmark datasets and demonstrate that our approach outperforms existing cross-domain few-shot learning techniques. Our code is available at https://github.com/Sarinda251/CDFSL-V + + + + Towards Viewpoint Robustness in Bird's Eye View Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Klinghoffer_Towards_Viewpoint_Robustness_in_Birds_Eye_View_Segmentation_ICCV_2023_paper.pdf + Autonomous vehicles (AV) require that neural networks used for perception be robust to different viewpoints if they are to be deployed across many types of vehicles without the repeated cost of data collection and labeling for each. AV companies typically focus on collecting data from diverse scenarios and locations, but not camera rig configurations, due to cost. As a result, only a small number of rig variations exist across most fleets. In this paper, we study how AV perception models are affected by changes in camera viewpoint and propose a way to scale them across vehicle types without repeated data collection and labeling. Using bird's eye view (BEV) segmentation as a motivating task, we find through extensive experiments that existing perception models are surprisingly sensitive to changes in camera viewpoint. When trained with data from one camera rig, small changes to pitch, yaw, depth, or height of the camera at inference time lead to large drops in performance. We introduce a technique for novel view synthesis and use it to transform collected data to the viewpoint of target rigs, allowing us to train BEV segmentation models for diverse target rigs without any additional data collection or labeling cost. To analyze the impact of viewpoint changes, we leverage synthetic data to mitigate other gaps (content, ISP, etc). Our approach is then trained on real data and evaluated on synthetic data, enabling evaluation on diverse target rigs. We release all data for use in future work. Our method is able to recover an average of 14.7% of the IoU that is otherwise lost when deploying to new rigs. + + + + What Can a Cook in Italy Teach a Mechanic in India? Action Recognition Generalisation Over Scenarios and Locations + http://openaccess.thecvf.com//content/ICCV2023/papers/Plizzari_What_Can_a_Cook_in_Italy_Teach_a_Mechanic_in_ICCV_2023_paper.pdf + We propose and address a new generalisation problem: can a model trained for action recognition successfully classify actions when they are performed within a previously unseen scenario and in a previously unseen location? To answer this question, we introduce the Action Recognition Generalisation Over scenarios and locations dataset ARGO1M, which contains 1.1M video clips from the large-scale Ego4D dataset, across 10 scenarios and 13 locations. We demonstrate recognition models struggle to generalise over 10 proposed test splits, each of an unseen scenario in an unseen location. We thus propose CIR, a method to represent each video as a Cross-Instance Reconstruction of videos from other domains. Reconstructions are paired with text narrations to guide the learning of a domain generalisable representation. We provide extensive analysis and ablations on ARGO1M that show CIR outperforms prior domain generalisation works on all test splits. + + + + EMDB: The Electromagnetic Database of Global 3D Human Pose and Shape in the Wild + http://openaccess.thecvf.com//content/ICCV2023/papers/Kaufmann_EMDB_The_Electromagnetic_Database_of_Global_3D_Human_Pose_and_ICCV_2023_paper.pdf + We present EMDB, the Electromagnetic Database of Global 3D Human Pose and Shape in the Wild. EMDB is a novel dataset that contains high-quality 3D SMPL pose and shape parameters with global body and camera trajectories for in-the-wild videos. We use body-worn, wireless electromagnetic (EM) sensors and a hand-held iPhone to record a total of 58 minutes of motion data, distributed over 81 indoor and outdoor sequences and 10 participants. Together with accurate body poses and shapes, we also provide global camera poses and body root trajectories. To construct EMDB, we propose a multi-stage optimization procedure, which first fits SMPL to the 6-DoF EM measurements and then refines the poses via image observations. To achieve high-quality results, we leverage a neural implicit avatar model to reconstruct detailed human surface geometry and appearance, which allows for improved alignment and smoothness via a dense pixel-level objective. Our evaluations, conducted with a multi-view volumetric capture system, indicate that EMDB has an expected accuracy of 2.3 cm positional and 10.6 degrees angular error, surpassing the accuracy of previous in-the-wild datasets. We evaluate existing state-of-the-art monocular RGB methods for camera-relative and global pose estimation on EMDB. EMDB is publicly available under https://ait.ethz.ch/emdb. + + + + Weakly-Supervised Action Segmentation and Unseen Error Detection in Anomalous Instructional Videos + http://openaccess.thecvf.com//content/ICCV2023/papers/Ghoddoosian_Weakly-Supervised_Action_Segmentation_and_Unseen_Error_Detection_in_Anomalous_Instructional_ICCV_2023_paper.pdf + We present a novel method for weakly-supervised action segmentation and unseen error detection in anomalous instructional videos. In the absence of an appropriate dataset for this task, we introduce the Anomalous Toy Assembly (ATA) dataset, which comprises 1152 untrimmed videos of 32 participants assembling three different toys, recorded from four different viewpoints. The training set comprises 27 participants who assemble toys in an expected and consistent manner, while the test and validation sets comprise 5 participants who display sequential anomalies in their task. We introduce a weakly labeled segmentation algorithm that is a generalization of the constrained Viterbi algorithm and identifies potential anomalous moments based on the difference between future anticipation and current recognition results. The proposed method is not restricted by the training transcripts during testing, allowing for the inference of anomalous action sequences while maintaining real-time performance. Based on these segmentation results, we also introduce a baseline to detect pre-defined human errors, and benchmark results on the ATA dataset. Experiments were conducted on the ATA and CSV datasets, demonstrating that the proposed method outperforms the state-of-the-art in segmenting anomalous videos under both online and offline conditions. + + + + Mesh2Tex: Generating Mesh Textures from Image Queries + http://openaccess.thecvf.com//content/ICCV2023/papers/Bokhovkin_Mesh2Tex_Generating_Mesh_Textures_from_Image_Queries_ICCV_2023_paper.pdf + Remarkable advances have been achieved recently in learning neural representations that characterize object geometry, while generating textured objects suitable for downstream applications and 3D rendering remains at an early stage. In particular, reconstructing textured geometry from images of real objects is a significant challenge - reconstructed geometry is often inexact, making realistic texturing a significant challenge. We present Mesh2Tex, which learns a realistic object texture manifold from uncorrelated collections of 3D object geometry and photorealistic RGB images, by leveraging a hybrid mesh-neural-field texture representation. Our texture representation enables compact encoding of high-resolution textures as a neural field in the barycentric coordinate system of the mesh faces. The learned texture manifold enables effective navigation to generate an object texture for a given 3D object geometry that matches to an input RGB image, which maintains robustness even under challenging real-world scenarios where the mesh geometry approximates an inexact match to the underlying geometry in the RGB image. Mesh2Tex can effectively generate realistic object textures for an object mesh to match real images observations towards digitization of real environments, significantly improving over previous state of the art. + + + + Deep Feature Deblurring Diffusion for Detecting Out-of-Distribution Objects + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_Deep_Feature_Deblurring_Diffusion_for_Detecting_Out-of-Distribution_Objects_ICCV_2023_paper.pdf + To promote the safe application of detectors, a task of unsupervised out-of-distribution object detection (OOD-OD) is recently proposed, whose goal is to detect unseen OOD objects without accessing any auxiliary OOD data. For this task, the challenge mainly lies in how to only leverage the known in-distribution (ID) data to detect OOD objects accurately without affecting the detection of ID objects, which can be framed as the diffusion problem for deep feature synthesis. Accordingly, such challenge could be addressed by the forward and reverse processes in the diffusion model. In this paper, we propose a new approach of Deep Feature Deblurring Diffusion (DFDD), consisting of forward blurring and reverse deblurring processes. Specifically, the forward process gradually performs Gaussian Blur on the extracted features, which is instrumental in retaining sufficient input-relevant information. By this way, the forward process could synthesize virtual OOD features that are close to the classification boundary between ID and OOD objects, which improves the performance of detecting OOD objects. During the reverse process, based on the blurred features, a dedicated deblurring model is designed to continually recover the lost details in the forward process. Both the deblurred features and original features are taken as the input for training, strengthening the discrimination ability. In the experiments, our method is evaluated on OOD-OD, open-set object detection, and incremental object detection. The significant performance gains over baselines demonstrate the superiorities of our method. The source code will be made available at: https://github.com/AmingWu/DFDD-OOD. + + + + Introducing Language Guidance in Prompt-based Continual Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Khan_Introducing_Language_Guidance_in_Prompt-based_Continual_Learning_ICCV_2023_paper.pdf + Continual Learning aims to learn a single model on a sequence of tasks without having access to data from previous tasks. The biggest challenge in the domain still remains catastrophic forgetting: a loss in performance on seen classes of earlier tasks. Some existing methods rely on an expensive replay buffer to store a chunk of data from previous tasks. This, while promising, becomes expensive when the number of tasks becomes large or data can not be stored for privacy reasons. As an alternative, prompt-based methods have been proposed that store the task information in a learnable prompt pool. This prompt pool instructs a frozen image encoder on how to solve each task. While the model faces a disjoint set of classes in each task in this setting, we argue that these classes can be encoded to the same embedding space of a pre-trained language encoder. In this work, we propose Language Guidance for Prompt-based Continual Learning (LGCL) as a plug-in for prompt-based methods. LGCL is model agnostic and introduces language guidance at the task level in the prompt pool and at the class level on the output feature of the vision encoder. We show with extensive experimentation that LGCL consistently improves the performance of prompt-based continual learning methods to set a new state-of-the-art. LGCL achieves these performance improvements without needing any additional learnable parameters. + + + + Invariant Training 2D-3D Joint Hard Samples for Few-Shot Point Cloud Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Yi_Invariant_Training_2D-3D_Joint_Hard_Samples_for_Few-Shot_Point_Cloud_ICCV_2023_paper.pdf + We tackle the data scarcity challenge in few-shot point cloud recognition of 3D objects by using a joint prediction from a conventional 3D model and a well-pretrained 2D model. Surprisingly, such an ensemble, though seems trivial, has hardly been shown effective in recent 2D-3D models. We find out the crux is the less effective training for the "joint hard samples", which have high confidence prediction on different wrong labels, implying that the 2D and 3D models do not collaborate well. To this end, our proposed invariant training strategy, called INVJOINT, does not only emphasize the training more on the hard samples, but also seeks the invariance between the conflicting 2D and 3D ambiguous predictions. INVJOINT can learn more collaborative 2D and 3D representations for better ensemble. Extensive experiments on 3D shape classification with widely-adopted ModelNet10/40, ScanObjectNN and Toys4K, and shape retrieval with ShapeNet-Core validate the superiority of our INVJOINT. + + + + EigenPlaces: Training Viewpoint Robust Models for Visual Place Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Berton_EigenPlaces_Training_Viewpoint_Robust_Models_for_Visual_Place_Recognition_ICCV_2023_paper.pdf + Visual Place Recognition is a task that aims to predict the place of an image (called query) based solely on its visual features. This is typically done through image retrieval, where the query is matched to the most similar images from a large database of geotagged photos, using learned global descriptors. A major challenge in this task is recognizing places seen from different viewpoints. To overcome this limitation, we propose a new method, called EigenPlaces, to train our neural network on images from different point of views, which embeds viewpoint robustness into the learned global descriptors. The underlying idea is to cluster the training data so as to explicitly present the model with different views of the same points of interest. The selection of this points of interest is done without the need for extra supervision. We then present experiments on the most comprehensive set of datasets in literature, finding that EigenPlaces is able to outperform previous state of the art on the majority of datasets, while requiring 60% less GPU memory for training and using 50% smaller descriptors. The code and trained models for EigenPlaces are available at https://github.com/gmberton/EigenPlaces, while results with any other baseline can be computed with the codebase at https://github.com/gmberton/auto_VPR. + + + + CIRI: Curricular Inactivation for Residue-aware One-shot Video Inpainting + http://openaccess.thecvf.com//content/ICCV2023/papers/Zheng_CIRI_Curricular_Inactivation_for_Residue-aware_One-shot_Video_Inpainting_ICCV_2023_paper.pdf + Video inpainting aims at filling in missing regions of a video. However, when dealing with dynamic scenes with camera or object movements, annotating the inpainting target becomes laborious and impractical. In this paper, we resolve the one-shot video inpainting problem in which only one annotated first frame is provided. A naive solution is to propagate the initial target to the other frames with techniques like object tracking. In this context, the main obstacles are the unreliable propagation and the partially inpainted artifacts due to the inaccurate mask. For the former problem, we propose curricular inactivation to replace the hard masking mechanism for indicating the inpainting target, which is robust to erroneous predictions in long-term video inpainting. For the latter, we explore the properties of inpainting residue and present an online residue removal method in an iterative detect-and-refine manner. Extensive experiments on several real-world datasets demonstrate the quantitative and qualitative superiorities of our proposed method in one-shot video inpainting. More importantly, our method is extremely flexible that can be integrated with arbitrary traditional inpainting models, activating them to perform the reliable one-shot video inpainting task. Video demonstrations can be found in our supplement, and our code can be found at https://github.com/Arise-zwy/CIRI. + + + + RSFNet: A White-Box Image Retouching Approach using Region-Specific Color Filters + http://openaccess.thecvf.com//content/ICCV2023/papers/Ouyang_RSFNet_A_White-Box_Image_Retouching_Approach_using_Region-Specific_Color_Filters_ICCV_2023_paper.pdf + Retouching images is an essential aspect of enhancing the visual appeal of photos. Although users often share common aesthetic preferences, their retouching methods may vary based on their individual preferences. Therefore, there is a need for white-box approaches that produce satisfying results and enable users to conveniently edit their images simultaneously. Recent white-box retouching methods rely on cascaded global filters that provide image-level filter arguments but cannot perform fine-grained retouching. In contrast, colorists typically employ a divide-and-conquer approach, performing a series of region-specific fine-grained enhancements when using traditional tools like Davinci Resolve. We draw on this insight to develop a white-box framework for photo retouching using parallel region-specific filters, called RSFNet. Our model generates filter arguments (e.g., saturation, contrast, hue) and attention maps of regions for each filter simultaneously. Instead of cascading filters, RSFNet employs linear summations of filters, allowing for a more diverse range of filter classes that can be trained more easily. Our experiments demonstrate that RSFNet achieves state-of-the-art results, offering satisfying aesthetic appeal and increased user convenience for editable white-box retouching. + + + + Tem-Adapter: Adapting Image-Text Pretraining for Video Question Answer + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Tem-Adapter_Adapting_Image-Text_Pretraining_for_Video_Question_Answer_ICCV_2023_paper.pdf + Video-language pre-trained models have shown remarkable success in guiding video question-answering (VideoQA) tasks. However, due to the length of video sequences, training large-scale video-based models incurs considerably higher costs than training image-based ones. This motivates us to leverage the knowledge from image-based pretraining, despite the obvious gaps between image and video domains. To bridge these gaps, in this paper, we propose Tem-Adapter, which enables the learning of temporal dynamics and complex semantics by a visual Temporal Aligner and a textual Semantic Aligner. Unlike conventional pretrained knowledge adaptation methods that only concentrate on the downstream task objective, the Temporal Aligner introduces an extra language-guided autoregressive task aimed at facilitating the learning of temporal dependencies, with the objective of predicting future states based on historical clues and language guidance that describes event progression. Besides, to reduce the semantic gap and adapt the textual representation for better event description, we introduce a Semantic Aligner that first designs a template to fuse question and answer pairs as event descriptions and then learns a Transformer decoder with the whole video sequence as guidance for refinement. We evaluate Tem-Adapter and different pre-train transferring methods on two VideoQA benchmarks, and the significant performance improvement demonstrates the effectiveness of our method. + + + + Unleashing the Potential of Spiking Neural Networks with Dynamic Confidence + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Unleashing_the_Potential_of_Spiking_Neural_Networks_with_Dynamic_Confidence_ICCV_2023_paper.pdf + This paper presents a new methodology to alleviate the fundamental trade-off between accuracy and latency in spiking neural networks (SNNs). The approach involves decoding confidence information over time from the SNN outputs and using it to develop a decision-making agent that can dynamically determine when to terminate each inference. The proposed method, Dynamic Confidence, provides several significant benefits to SNNs. 1. It can effectively optimize latency dynamically at runtime, setting it apart from many existing low-latency SNN algorithms. Our experiments on CIFAR-10 and ImageNet datasets have demonstrated an average 40% speedup across eight different settings after applying Dynamic Confidence. 2. The decision-making agent in Dynamic Confidence is straightforward to construct and highly robust in parameter space, making it extremely easy to implement. 3. The proposed method enables visualizing the potential of any given SNN, which sets a target for current SNNs to approach. For instance, if an SNN can terminate at the most appropriate time point for each input sample, a ResNet-50 SNN can achieve an accuracy as high as 82.47% on ImageNet within just 4.71 time steps on average. Unlocking the potential of SNNs needs a highly-reliable decision-making agent to be constructed and fed with a high-quality estimation of ground truth. In this regard, Dynamic Confidence represents a meaningful step toward realizing the potential of SNNs. + + + + TeD-SPAD: Temporal Distinctiveness for Self-Supervised Privacy-Preservation for Video Anomaly Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Fioresi_TeD-SPAD_Temporal_Distinctiveness_for_Self-Supervised_Privacy-Preservation_for_Video_Anomaly_Detection_ICCV_2023_paper.pdf + Video anomaly detection (VAD) without human monitoring is a complex computer vision task that can have a positive impact on society if implemented successfully. While recent advances have made significant progress in solving this task, most existing approaches overlook a critical real-world concern: privacy. With the increasing popularity of artificial intelligence technologies, it becomes crucial to implement proper AI ethics into their development. Privacy leakage in VAD allows models to pick up and amplify unnecessary biases related to people's personal information, which may lead to undesirable decision making. In this paper, we propose TeD-SPAD, a privacy-aware video anomaly detection framework that destroys visual private information in a self-supervised manner. In particular, we propose the use of a temporally-distinct triplet loss to promote temporally discriminative features, which complements current weakly-supervised VAD methods. Using TeD-SPAD, we achieve a positive trade-off between privacy protection and utility anomaly detection performance on three popular weakly supervised VAD datasets: UCF-Crime, XD-Violence, and ShanghaiTech. Our proposed anonymization model reduces private attribute prediction by 32.25% while only reducing frame-level ROC AUC on the UCF-Crime anomaly detection dataset by 3.69%. + + + + HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training + http://openaccess.thecvf.com//content/ICCV2023/papers/Ye_HiTeA_Hierarchical_Temporal-Aware_Video-Language_Pre-training_ICCV_2023_paper.pdf + Video-language pre-training has advanced the performance of various downstream video-language tasks. However, most previous methods directly inherit or adapt typical image-language pre-training paradigms to video-language pre-training, thus not fully exploiting the unique characteristic of video, i.e., temporal. In this paper, we propose a Hierarchical Temporal-Aware video-language pre-training framework, HiTeA, with two novel pre-training tasks for yielding temporal-aware multi-modal representation with cross-modal fine-grained temporal moment information and temporal contextual relations between video-text multi-modal pairs. First, we propose a cross-modal moment exploration task to explore moments in videos by mining the paired texts, which results in detailed video moment representation. Then, based on the learned detailed moment representations, the inherent temporal contextual relations are captured by aligning video-text pairs as a whole in different time resolutions with multi-modal temporal relation exploration task. Furthermore, we introduce the shuffling test to evaluate the temporal reliance of datasets and video-language pre-training models. We achieve state-of-the-art results on 15 well-established video-language understanding and generation tasks, especially on temporal-oriented datasets (e.g., SSv2-Template and SSv2-Label) with 8.6% and 11.1% improvement respectively. HiTeA also demonstrates strong generalization ability when directly transferred to downstream tasks in a zero-shot manner. + + + + VAPCNet: Viewpoint-Aware 3D Point Cloud Completion + http://openaccess.thecvf.com//content/ICCV2023/papers/Fu_VAPCNet_Viewpoint-Aware_3D_Point_Cloud_Completion_ICCV_2023_paper.pdf + Most existing learning-based 3D point cloud completion methods ignore the fact that the completion process is highly coupled with the viewpoint of a partial scan. However, the various viewpoints of incompletely scanned objects in real-world applications are normally unknown and directly estimating the viewpoint of each incomplete object is usually time-consuming and leads to huge annotation cost. In this paper, we thus propose an unsupervised viewpoint representation learning scheme for 3D point cloud completion without explicit viewpoint estimation. To be specific, we learn abstract representations of partial scans to distinguish various viewpoints in the representation space rather than the explicit estimation in the 3D space. We also introduce a Viewpoint-Aware Point cloud Completion Network (VAPCNet) with flexible adaption to various viewpoints based on the learned representations. The proposed viewpoint representation learning scheme can extract discriminative representations to obtain accurate viewpoint information. Reported experiments on two popular public datasets show that our VAPCNet achieves state-of-the-art performance for the point cloud completion task. Source code is available at https://github.com/FZH92128/VAPCNet. + + + + AutoSynth: Learning to Generate 3D Training Data for Object Point Cloud Registration + http://openaccess.thecvf.com//content/ICCV2023/papers/Dang_AutoSynth_Learning_to_Generate_3D_Training_Data_for_Object_Point_ICCV_2023_paper.pdf + In the current deep learning paradigm, the amount and quality of training data are as critical as the network architecture and its training details. However, collecting, processing, and annotating real data at scale is difficult, expensive, and time-consuming, particularly for tasks such as 3D object registration. While synthetic datasets can be created, they require expertise to design and include a limited number of categories. In this paper, we introduce a new approach called AutoSynth, which automatically generates 3D training data for point cloud registration. Specifically, AutoSynth automatically curates an optimal dataset by exploring a search space encompassing millions of potential datasets with diverse 3D shapes at a low cost. To achieve this, we generate synthetic 3D datasets by assembling shape primitives, and develop a meta-learning strategy to search for the best training data for 3D registration on real point clouds. For this search to remain tractable, we replace the point cloud registration network with a much smaller surrogate network, leading to a 4056.43 times speedup. We demonstrate the generality of our approach by implementing it with two different point cloud registration networks, BPNet and IDAM. Our results on TUD-L, LINEMOD, and Occluded-LINEMOD evidence that a neural network trained on our searched dataset yields consistently better performance than the same one trained on the widely used ModelNet40 dataset. + + + + Self-supervised Learning of Implicit Shape Representation with Dense Correspondence for Deformable Objects + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Self-supervised_Learning_of_Implicit_Shape_Representation_with_Dense_Correspondence_for_ICCV_2023_paper.pdf + Learning 3D shape representation with dense correspondence for deformable objects is a fundamental problem in computer vision. Existing approaches often need additional annotations of specific semantic domain, e.g., skeleton pose for human body or animals, which require extra annotation effort and suffer from error accumulation, and they are limited to specific domain. In this paper, we propose a novel self-supervised approach to learn neural implicit shape representation for deformable objects, which can represent shapes with a template shape and dense correspondence in 3D. Our method does not require the priors of skeleton and skinning weight, and only requires a collection of shapes represented in signed distance fields. To handle the large deformation, we constrain the learned template shape in the same latent space with the training shapes, design a new formulation of local rigid constraint that enforces rigid transformation in local region and addresses local reflection issue, and present a new hierarchical rigid constraint to reduce the ambiguity due to the joint learning of template shape and correspondence. Extensive experiments show that our model can represent shapes with large deformations. We also show that our shape representation can support two typical applications, such as texture transfer and shape editing, with competitive performance. The code and models will be publicly released. + + + + Scaling Data Generation in Vision-and-Language Navigation + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Scaling_Data_Generation_in_Vision-and-Language_Navigation_ICCV_2023_paper.pdf + Recent research in language-guided visual navigation has demonstrated a significant demand for the diversity of traversable environments and the quantity of supervision for training generalizable agents. To tackle the common data scarcity issue in existing vision-and-language navigation datasets, we propose an effective paradigm for generating large-scale data for learning, which applies 1200+ photo-realistic environments from HM3D and Gibson datasets and synthesizes 4.9 million instruction trajectory pairs using fully-accessible resources on the web. Importantly, we investigate the influence of each component in this paradigm on the agent's performance and study how to adequately apply the augmented data to pre-train and fine-tune an agent. Thanks to our large-scale dataset, the performance of an existing agent can be pushed up (+11% absolute with regard to previous SoTA) to a significantly new best of 80% single-run success rate on the R2R test split by simple imitation learning. The long-lasting generalization gap between navigating in seen and unseen environments is also reduced to less than 1% (versus 8% in the previous best method). Moreover, our paradigm also facilitates different models to achieve new state-of-the-art navigation results on CVDN, REVERIE, and R2R in continuous environments. + + + + Dual Learning with Dynamic Knowledge Distillation for Partially Relevant Video Retrieval + http://openaccess.thecvf.com//content/ICCV2023/papers/Dong_Dual_Learning_with_Dynamic_Knowledge_Distillation_for_Partially_Relevant_Video_ICCV_2023_paper.pdf + Almost all previous text-to-video retrieval works assume that videos are pre-trimmed with short durations. However, in practice, videos are generally untrimmed containing much background content. In this work, we investigate the more practical but challenging Partially Relevant Video Retrieval (PRVR) task, which aims to retrieve partially relevant untrimmed videos with the query input. Particularly, we propose to address PRVR from a new perspective, i.e., distilling the generalization knowledge from the large-scale vision-language pre-trained model and transferring it to a task-specific PRVR network. To be specific, we introduce a Dual Learning framework with Dynamic Knowledge Distillation (DL-DKD), which exploits the knowledge of a large vision-language model as the teacher to guide a student model. During the knowledge distillation, an inheritance student branch is devised to absorb the knowledge from the teacher model. Considering that the large model may be of mediocre performance due to the domain gaps, we further develop an exploration student branch to take the benefits of task-specific information. By jointly training the above two branches in a dual-learning way, our model is able to selectively acquire appropriate knowledge from the teacher model while capturing the task-specific property. In addition, a dynamical knowledge distillation strategy is further devised to adjust the effect of each student branch learning during the training. Experiment results demonstrate that our proposed model achieves state-of-the-art performance on ActivityNet and TVR datasets for PRVR. + + + + Disposable Transfer Learning for Selective Source Task Unlearning + http://openaccess.thecvf.com//content/ICCV2023/papers/Koh_Disposable_Transfer_Learning_for_Selective_Source_Task_Unlearning_ICCV_2023_paper.pdf + Transfer learning is widely used for training deep neural networks (DNN) for building a powerful representation. Even after the pre-trained model is adapted for the target task, the representation performance of the feature extractor is retained to some extent. As the performance of the pre-trained model can be considered the private property of the owner, it is natural to seek the exclusive right of the generalized performance of the pre-trained weight. To address this issue, we suggest a new paradigm of transfer learning called disposable transfer learning (DTL), which disposes of only the source task without degrading the performance of the target task. To achieve knowledge disposal, we propose a novel loss named Gradient Collision loss (GC loss). GC loss selectively unlearns the source knowledge by leading the gradient vectors of mini-batches in different directions. Whether the model successfully unlearns the source task is measured by piggyback learning accuracy (PL accuracy). PL accuracy estimates the vulnerability of knowledge leakage by retraining the scrubbed model on a subset of source data or new downstream data. We demonstrate that GC loss is an effective approach to the DTL problem by showing that the model trained with GC loss retains the performance on the target task with a significantly reduced PL accuracy. + + + + Grounding 3D Object Affordance from 2D Interactions in Images + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Grounding_3D_Object_Affordance_from_2D_Interactions_in_Images_ICCV_2023_paper.pdf + Grounding 3D object affordance seeks to locate objects' "action possibilities" regions in the 3D space, which serves as a link between perception and operation for embodied agents. Existing studies primarily focus on connecting visual affordances with geometry structures, e.g., relying on annotations to declare interactive regions of interest on the object and establishing a mapping between the regions and affordances. However, the essence of learning object affordance is to understand how to use it, and the manner that detaches interactions is limited in generalization. Normally, humans possess the ability to perceive object affordances in the physical world through demonstration images or videos. Motivated by this, we introduce a novel task setting: grounding 3D object affordance from 2D interactions in images, which faces the challenge of anticipating affordance through interactions of different sources. To address this problem, we devise a novel Interaction-driven 3D Affordance Grounding Network (IAG), which aligns the region feature of objects from different sources and models the interactive contexts for 3D object affordance grounding. Besides, we collect a Point-Image Affordance Dataset (PIAD) to support the proposed task. Comprehensive experiments on PIAD demonstrate the reliability of the proposed task and the superiority of our method. The project is available at https://github.com/yyvhang/IAGNet. + + + + Tube-Link: A Flexible Cross Tube Framework for Universal Video Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Tube-Link_A_Flexible_Cross_Tube_Framework_for_Universal_Video_Segmentation_ICCV_2023_paper.pdf + Video segmentation aims to segment and track every pixel in diverse scenarios accurately. In this paper, we present Tube-Link, a versatile framework that addresses multiple core tasks of video segmentation with a unified architecture. Our framework is a near-online approach that takes a short subclip as input and outputs the corresponding spatial-temporal tube masks. To enhance the modeling of cross-tube relationships, we propose an effective way to perform tube-level linking via attention along the queries. In addition, we introduce temporal contrastive learning to instance-wise discriminative features for tube-level association. Our approach offers flexibility and efficiency for both short and long video inputs, as the length of each subclip can be varied according to the needs of datasets or scenarios. Tube-Link outperforms existing specialized architectures by a significant margin on five video segmentation datasets. Specifically, it achieves almost 13% relative improvements on VIPSeg and 4% improvements on KITTI-STEP over the strong baseline Video K-Net. When using a ResNet50 backbone on Youtube-VIS-2019 and 2021, Tube-Link boosts IDOL by 3% and 4%, respectively. Code is available at https://github.com/lxtGH/Tube-Link. + + + + Hybrid Spectral Denoising Transformer with Guided Attention + http://openaccess.thecvf.com//content/ICCV2023/papers/Lai_Hybrid_Spectral_Denoising_Transformer_with_Guided_Attention_ICCV_2023_paper.pdf + In this paper, we present a Hybrid Spectral Denoising Transformer (HSDT) for hyperspectral image denoising. Challenges in adapting transformer for HSI arise from the capabilities to tackle existing limitations of CNN-based methods in capturing the global and local spatial-spectral correlations while maintaining efficiency and flexibility. To address these issues, we introduce a hybrid approach that combines the advantages of both models with a Spatial-Spectral Separable Convolution (S3Conv), Guided Spectral Self-Attention (GSSA), and Self-Modulated Feed-Forward Network (SM-FFN). Our S3Conv works as a lightweight alternative to 3D convolution, which extracts more spatial-spectral correlated features while keeping the flexibility to tackle HSIs with an arbitrary number of bands. These features are then adaptively processed by GSSA which performs 3D self-attention across the spectral bands, guided by a set of learnable queries that encode the spectral signatures. This not only enriches our model with powerful capabilities for identifying global spectral correlations but also maintains linear complexity. Moreover, our SM-FFN proposes the self-modulation that intensifies the activations of more informative regions, which further strengthens the aggregated features. Extensive experiments are conducted on various datasets under both simulated and real-world noise, and it shows that our HSDT significantly outperforms the existing state-of-the-art methods while maintaining low computational overhead. Code is at https://github.com/Zeqiang-Lai/HSDT. + + + + HiVLP: Hierarchical Interactive Video-Language Pre-Training + http://openaccess.thecvf.com//content/ICCV2023/papers/Shao_HiVLP_Hierarchical_Interactive_Video-Language_Pre-Training_ICCV_2023_paper.pdf + Video-Language Pre-training (VLP) has become one of the most popular research topics in deep learning. However, compared to image-language pre-training, VLP has lagged far behind due to the lack of large amounts of video-text pairs. In this work, we train a VLP model with a hybrid of image-text and video-text pairs, which significantly outperforms pre-training with only the video-text pairs. Besides, existing methods usually model the cross-modal interaction using cross-attention between single-scale visual tokens and textual tokens. These visual features are either of low resolutions lacking fine-grained information, or of high resolutions without high-level semantics. To address the issue, we propose Hierarchical interactive Video-Language Pre-training (HiVLP) that efficiently uses a hierarchical visual feature group for multi-modal cross-attention during pre-training. In the hierarchical framework, low-resolution features are learned with focus on more global high-level semantic information, while high-resolution features carry fine-grained details. As a result, HiVLP has the ability to effectively learn both the global and fine-grained representations to achieve better alignment between video and text inputs. Furthermore, we design a hierarchical multi-scale vision contrastive loss for self-supervised learning to boost the interaction between them. Experimental results show that HiVLP establishes new state-of-the-art results in three downstream tasks, text-video retrieval, video-text retrieval, and video captioning. + + + + Learning Concordant Attention via Target-aware Alignment for Visible-Infrared Person Re-identification + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_Learning_Concordant_Attention_via_Target-aware_Alignment_for_Visible-Infrared_Person_Re-identification_ICCV_2023_paper.pdf + Owing to the large distribution gap between the heterogeneous data in Visible-Infrared Person Re-identification (VI Re-ID), we point out that existing paradigms often suffer from the inter-modal semantic misalignment issue and thus fail to align and compare local details properly. In this paper, we present Concordant Attention Learning (CAL), a novel framework that learns semantic-aligned representations for VI Re-ID. Specifically, we design the Target-aware Concordant Alignment paradigm, which allows target-aware attention adaptation when aligning heterogeneous samples (i.e., adaptive attention adjustment according to the target image being aligned). This is achieved by exploiting the discriminative clues from the modality counterpart and designing effective modality-agnostic correspondence searching strategies. To ensure semantic concordance during the cross-modal retrieval stage, we further propose MatchDistill, which matches the attention patterns across modalities and learns their underlying semantic correlations by bipartite-graph-based similarity modeling and cross-modal knowledge exchange. Extensive experiments on VI Re-ID benchmark datasets demonstrate the effectiveness and superiority of the proposed CAL. + + + + Masked Motion Predictors are Strong 3D Action Representation Learners + http://openaccess.thecvf.com//content/ICCV2023/papers/Mao_Masked_Motion_Predictors_are_Strong_3D_Action_Representation_Learners_ICCV_2023_paper.pdf + In 3D human action recognition, limited supervised data makes it challenging to fully tap into the modeling potential of powerful networks such as transformers. As a result, researchers have been actively investigating effective self-supervised pre-training strategies. In this work, we show that instead of following the prevalent pretext task to perform masked self-component reconstruction in human joints, explicit contextual motion modeling is key to the success of learning effective feature representation for 3D action recognition. Formally, we propose the Masked Motion Prediction (MAMP) framework. To be specific, the proposed MAMP takes as input the masked spatio-temporal skeleton sequence and predicts the corresponding temporal motion of the masked human joints. Considering the high temporal redundancy of the skeleton sequence, in our MAMP, the motion information also acts as an empirical semantic richness prior that guide the masking process, promoting better attention to semantically rich temporal regions. Extensive experiments on NTU-60, NTU-120, and PKU-MMD datasets show that the proposed MAMP pre-training substantially improves the performance of the adopted vanilla transformer, achieving state-of-the-art results without bells and whistles. The source code of our MAMP is available at https://github.com/maoyunyao/MAMP. + + + + RIGID: Recurrent GAN Inversion and Editing of Real Face Videos + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_RIGID_Recurrent_GAN_Inversion_and_Editing_of_Real_Face_Videos_ICCV_2023_paper.pdf + GAN inversion is indispensable for applying the powerful editability of GAN to real images. However, existing methods invert video frames individually often leading to undesired inconsistent results over time. In this paper, we propose a unified recurrent framework, named Recurrent vIdeo GAN Inversion and eDiting (RIGID), to explicitly and simultaneously enforce temporally coherent GAN inversion and facial editing of real videos. Our approach models the temporal relations between current and previous frames from three aspects. To enable a faithful real video reconstruction, we first maximize the inversion fidelity and consistency by learning a temporal compensated latent code. Second, we observe incoherent noises lie in the high-frequency domain that can be disentangled from the latent space. Third, to remove the inconsistency after attribute manipulation, we propose an in-between frame composition constraint such that the arbitrary frame must be a direct composite of its neighboring frames. Our unified framework learns the inherent coherence between input frames in an end-to-end manner, and therefore it is agnostic to a specific attribute and can be applied to arbitrary editing of the same video without re-training. Extensive experiments demonstrate that RIGID outperforms state-of-the-art methods qualitatively and quantitatively in both inversion and editing tasks. The deliverables can be found in https://cnnlstm.github.io/RIGID. + + + + CSDA: Learning Category-Scale Joint Feature for Domain Adaptive Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Gao_CSDA_Learning_Category-Scale_Joint_Feature_for_Domain_Adaptive_Object_Detection_ICCV_2023_paper.pdf + Domain Adaptive Object Detection (DAOD) aims to improve the detection performance of target domains by minimizing the feature distribution between the source and target domain. Recent approaches usually align such distributions in terms of categories through adversarial learning and some progress has been made. However, when objects are non-uniformly distributed at different scales, such category-level alignment causes imbalanced object feature learning, refer as the inconsistency of category alignment at different scales. For better category-level feature alignment, we propose a novel DAOD framework of joint category and scale information, dubbed CSDA, such a design enables effective object learning for different scales. Specifically, our framework is implemented by two closely-related modules: 1) SGFF (Scale-Guided Feature Fusion) fuses the category representations of different domains to learn category-specific features, where the features are aligned by discriminators at three scales. 2) SAFE (Scale-Auxiliary Feature Enhancement) encodes scale coordinates into a group of tokens and enhances the representation of category-specific features at different scales by self-attention. Based on the anchor-based Faster-RCNN and anchor-free FCOS detectors, experiments show that our method achieves state-of-the-art results on three DAOD benchmarks. + + + + Single Image Defocus Deblurring via Implicit Neural Inverse Kernels + http://openaccess.thecvf.com//content/ICCV2023/papers/Quan_Single_Image_Defocus_Deblurring_via_Implicit_Neural_Inverse_Kernels_ICCV_2023_paper.pdf + Single image defocus deblurring (SIDD) is a challenging task due to the spatially-varying nature of defocus blur, characterized by per-pixel point spread functions (PSFs). Existing deep-learning-based methods for SIDD are limited by either over-fitting due to the lack of model constraints or under-parametrization that restricts their applicability to real-world images. To address the limitations, this paper proposes an interpretable approach that explicitly predicts inverse kernels with structural regularization. Motivated by the observation that defocus PSFs within an image often have similar shapes but different sizes, we represent the inverse kernels linearly over a multi-scale dictionary parameterized by implicit neural representations. We predict the corresponding representation coefficients via a duplex scale-recurrent neural network that jointly performs fine-to-coarse and coarse-to-fine estimations. Extensive experiments demonstrate that our approach achieves excellent performance using a lightweight model. + + + + AvatarCraft: Transforming Text into Neural Human Avatars with Parameterized Shape and Pose Control + http://openaccess.thecvf.com//content/ICCV2023/papers/Jiang_AvatarCraft_Transforming_Text_into_Neural_Human_Avatars_with_Parameterized_Shape_ICCV_2023_paper.pdf + Neural implicit fields are powerful for representing 3D scenes and generating high-quality novel views, but it remains challenging to use such implicit representations for creating a 3D human avatar with a specific identity and artistic style that can be easily animated. Our proposed method, AvatarCraft, addresses this challenge by using diffusion models to guide the learning of geometry and texture for a neural avatar based on a single text prompt. We carefully design the optimization framework of neural implicit fields, including a coarse-to-fine multi-bounding box training strategy, shape regularization, and diffusion-based constraints, to produce high-quality geometry and texture. Additionally, we make the human avatar animatable by deforming the neural implicit field with an explicit warping field that maps the target human mesh to a template human mesh, both represented using parametric human models. This simplifies animation and reshaping of the generated avatar by controlling pose and shape parameters. Extensive experiments on various text descriptions show that AvatarCraft is effective and robust in creating human avatars and rendering novel views, poses, and shapes. Our project page is: https://avatar-craft.github.io/. + + + + Why Is Prompt Tuning for Vision-Language Models Robust to Noisy Labels? + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_Why_Is_Prompt_Tuning_for_Vision-Language_Models_Robust_to_Noisy_ICCV_2023_paper.pdf + Vision-language models such as CLIP learn a generic text-image embedding from large-scale training data. A vision-language model can be adapted to a new classification task through few-shot prompt tuning. We find that such prompt tuning process is highly robust to label noises. This intrigues us to study the key reasons contributing to the robustness of the prompt tuning paradigm. We conducted extensive experiments to explore this property and find the key factors are: 1. the fixed classname tokens provide a strong regularization to the optimization of the model, reducing gradients induced by the noisy samples; 2. the powerful pre-trained image-text embedding that is learned from diverse and generic web data provides strong prior knowledge for image classification. Further, we demonstrate that noisy zero-shot predictions from CLIP can be used to tune its own prompt, significantly enhancing prediction accuracy in the unsupervised setting. + + + + Unified Pre-Training with Pseudo Texts for Text-To-Image Person Re-Identification + http://openaccess.thecvf.com//content/ICCV2023/papers/Shao_Unified_Pre-Training_with_Pseudo_Texts_for_Text-To-Image_Person_Re-Identification_ICCV_2023_paper.pdf + The pre-training task is indispensable for the text-to-image person re-identification (T2I-ReID) task. However, there are two underlying inconsistencies between these two tasks that may impact the performance: i) Data inconsistency. A large domain gap exists between the generic images/texts used in public pre-trained models and the specific person data in the T2I-ReID task. This gap is especially severe for texts, as general textual data are usually unable to describe specific people in fine-grained detail. ii) Training inconsistency. The processes of pre-training of images and texts are independent, despite cross-modality learning being critical to T2I-ReID. To address the above issues, we present a new unified pre-training pipeline (UniPT) designed specifically for the T2I-ReID task. We first build a large-scale text-labeled person dataset "LUPerson-T", in which pseudo-textual descriptions of images are automatically generated by the CLIP paradigm using a divide-conquer-combine strategy. Benefiting from this dataset, we then utilize a simple vision-and-language pre-training framework to explicitly align the feature space of the image and text modalities during pre-training. In this way, the pre-training task and the T2I-ReID task are made consistent with each other on both data and training levels. Without the need for any bells and whistles, our UniPT achieves competitive Rank-1 accuracy of, i.e., 68.50%, 60.09%, and 51.85% on CUHK-PEDES, ICFG-PEDES and RSTPReid, respectively. Both the LUPerson-T dataset and code are available at https://github.com/ZhiyinShao-H/UniPT. + + + + Traj-MAE: Masked Autoencoders for Trajectory Prediction + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Traj-MAE_Masked_Autoencoders_for_Trajectory_Prediction_ICCV_2023_paper.pdf + Trajectory prediction has been a crucial task in building a reliable autonomous driving system by anticipating possible dangers. One key issue is to generate consistent trajectory predictions without colliding. To overcome the challenge, we propose an efficient masked autoencoder for trajectory prediction (Traj-MAE) that better represents the complicated behaviors of agents in the driving environment. Specifically, our Traj-MAE employs diverse masking strategies to pre-train the trajectory encoder and map encoder, allowing for the capture of social and temporal information among agents while leveraging the effect of environment from multiple granularities. To address the catastrophic forgetting problem that arises when pre-training the network with multiple masking strategies, we introduce a continual pre-training framework, which can help Traj-MAE learn valuable and diverse information from various strategies efficiently. Our experimental results in both multi-agent and single-agent settings demonstrate that Traj-MAE achieves competitive results with state-of-the-art methods and significantly outperforms our baseline model. Project page: https://jiazewang.com/projects/trajmae.html. + + + + UniFusion: Unified Multi-View Fusion Transformer for Spatial-Temporal Representation in Bird's-Eye-View + http://openaccess.thecvf.com//content/ICCV2023/papers/Qin_UniFusion_Unified_Multi-View_Fusion_Transformer_for_Spatial-Temporal_Representation_in_Birds-Eye-View_ICCV_2023_paper.pdf + Bird's eye view (BEV) representation is a new perception formulation for autonomous driving, which is based on spatial fusion. Further, temporal fusion is also introduced in BEV representation and gains great success. In this work, we propose a new method that unifies both spatial and temporal fusion and merges them into a unified mathematical formulation. The unified fusion could not only provide a new perspective on BEV fusion but also brings new capabilities. With the proposed unified spatial-temporal fusion, our method could support long-range fusion, which is hard to achieve in conventional BEV methods. Moreover, the BEV fusion in our work is temporal-adaptive and the weights of temporal fusion are learnable. In contrast, conventional methods mainly use fixed and equal weights for temporal fusion. Besides, the proposed unified fusion could avoid information lost in conventional BEV fusion methods and make full use of features. Extensive experiments and ablation studies on the NuScenes dataset show the effectiveness of the proposed method and our method gains the state-of-the-art performance in the map and vehicle segmentation task. + + + + Sample-adaptive Augmentation for Point Cloud Recognition Against Real-world Corruptions + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Sample-adaptive_Augmentation_for_Point_Cloud_Recognition_Against_Real-world_Corruptions_ICCV_2023_paper.pdf + Robust 3D perception under corruption has become an essential task for the realm of 3D vision. While current data augmentation techniques usually perform random transformations on all point cloud objects in an offline way and ignore the structure of the samples, resulting in over-or-under enhancement. In this work, we propose an alternative to make sample-adaptive transformations based on the structure of the sample to cope with potential corruption via an auto-augmentation framework, named as AdaptPoint. Specially, we leverage a imitator, consisting of a Deformation Controller and a Mask Controller, respectively in charge of predicting deformation parameters and producing a per-point mask, based on the intrinsic structural information of the input point cloud, and then conduct corruption simulations on top. Then a discriminator is utilized to prevent the generation of excessive corruption that deviates from the original data distribution. In addition, a perception-guidance feedback mechanism is incorporated to guide the generation of samples with appropriate difficulty level. Furthermore, to address the paucity of real-world corrupted point cloud, we also introduce a new dataset ScanObjectNN-C, that exhibits greater similarity to actual data in real-world environments, especially when contrasted with preceding CAD datasets. Experiments show that our method achieves state-of-the-art results on multiple corruption benchmarks including ModelNet-C, our ScanObjectNN-C, and ShapeNet-C. The source code is released at: https://github.com/Roywangj/AdaptPoint. + + + + Modality Unifying Network for Visible-Infrared Person Re-Identification + http://openaccess.thecvf.com//content/ICCV2023/papers/Yu_Modality_Unifying_Network_for_Visible-Infrared_Person_Re-Identification_ICCV_2023_paper.pdf + Visible-infrared person re-identification (VI-ReID) is a challenging task due to large cross-modality discrepancies and intra-class variations. Existing methods mainly focus on learning modality-shared representations by embedding different modalities into the same feature space. As a result, the learned feature emphasizes the common patterns across modalities while suppressing modality-specific and identity-aware information that is valuable for Re-ID. To address these issues, we propose a novel Modality Unifying Network (MUN) to explore a robust auxiliary modality for VI-ReID. First, the auxiliary modality is generated by combining the proposed cross-modality learner and intra-modality learner, which can dynamically model the modality-specific and modality-shared representations to alleviate both cross-modality and intra-modality variations. Second, by aligning identity centres across the three modalities, an identity alignment loss function is proposed to discover the discriminative feature representations. Third, a modality alignment loss is introduced to consistently reduce the distribution distance of visible and infrared images by modality prototype modeling. Extensive experiments on multiple public datasets demonstrate that the proposed method surpasses the current state-of-the-art methods by a significant margin. + + + + Taming Contrast Maximization for Learning Sequential, Low-latency, Event-based Optical Flow + http://openaccess.thecvf.com//content/ICCV2023/papers/Paredes-Valles_Taming_Contrast_Maximization_for_Learning_Sequential_Low-latency_Event-based_Optical_Flow_ICCV_2023_paper.pdf + Event cameras have recently gained significant traction since they open up new avenues for low-latency and low-power solutions to complex computer vision problems. To unlock these solutions, it is necessary to develop algorithms that can leverage the unique nature of event data. However, the current state-of-the-art is still highly influenced by the frame-based literature, and usually fails to deliver on these promises. In this work, we take this into consideration and propose a novel self-supervised learning pipeline for the sequential estimation of event-based optical flow that allows for the scaling of the models to high inference frequencies. At its core, we have a continuously-running stateful neural model that is trained using a novel formulation of contrast maximization that makes it robust to nonlinearities and varying statistics in the input events. Results across multiple datasets confirm the effectiveness of our method, which establishes a new state of the art in terms of accuracy for approaches trained or optimized without ground truth. + + + + CASSPR: Cross Attention Single Scan Place Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Xia_CASSPR_Cross_Attention_Single_Scan_Place_Recognition_ICCV_2023_paper.pdf + Place recognition based on point clouds (LiDAR) is an important component for autonomous robots or self-driving vehicles. Current SOTA performance is achieved on accumulated LiDAR submaps using either point-based or voxel-based structures. While voxel-based approaches nicely integrate spatial context across multiple scales, they do not exhibit the local precision of point-based methods. As a result, existing methods struggle with fine-grained matching of subtle geometric features in sparse single-shot LiDAR scans. To overcome these limitations, we propose CASSPR as a method to fuse point-based and voxel-based approaches using cross attention transformers. CASSPR leverages a sparse voxel branch for extracting and aggregating information at lower resolution and a point-wise branch for obtaining fine-grained local information. CASSPR uses queries from one branch to try to match structures in the other branch, ensuring that both extract self-contained descriptors of the point cloud (rather than one branch dominating), but using both to inform the output global descriptor of the point cloud. Extensive experiments show that CASSPR surpasses the state-of-the-art by a large margin on several datasets (Oxford RobotCar, TUM, USyd). For instance, it achieves AR@1 of 85.6% on the TUM dataset, surpassing the strongest prior model by 15%. Our code is publicly available. + + + + DDFM: Denoising Diffusion Model for Multi-Modality Image Fusion + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhao_DDFM_Denoising_Diffusion_Model_for_Multi-Modality_Image_Fusion_ICCV_2023_paper.pdf + Multi-modality image fusion aims to combine different modalities to produce fused images that retain the complementary features of each modality, such as functional highlights and texture details. To leverage strong generative priors and address challenges such as unstable training and lack of interpretability for GAN-based generative methods, we propose a novel fusion algorithm based on the denoising diffusion probabilistic model (DDPM). The fusion task is formulated as a conditional generation problem under the DDPM sampling framework, which is further divided into an unconditional generation subproblem and a maximum likelihood subproblem. The latter is modeled in a hierarchical Bayesian manner with latent variables and inferred by the expectation-maximization (EM) algorithm. By integrating the inference solution into the diffusion sampling iteration, our method can generate high-quality fused images with natural image generative priors and cross-modality information from source images. Note that all we required is an unconditional pre-trained generative model, and no fine-tuning is needed. Our extensive experiments indicate that our approach yields promising fusion results in infrared-visible image fusion and medical image fusion. The code is available at https://github.com/Zhaozixiang1228/MMIF-DDFM. + + + + A Unified Continual Learning Framework with General Parameter-Efficient Tuning + http://openaccess.thecvf.com//content/ICCV2023/papers/Gao_A_Unified_Continual_Learning_Framework_with_General_Parameter-Efficient_Tuning_ICCV_2023_paper.pdf + The "pre-training - downstream adaptation" presents both new opportunities and challenges for Continual Learning (CL). Although the recent state-of-the-art in CL is achieved through Parameter-Efficient-Tuning (PET) adaptation paradigm, only prompt has been explored, limiting its application to Transformers only. In this paper, we position prompting as one instantiation of PET, and propose a unified CL framework with general PET, dubbed as Learning-Accumulation-Ensemble (LAE). PET, e.g., using Adapter, LoRA, or Prefix, can adapt a pre-trained model to downstream tasks with fewer parameters and resources. Given a PET method, our LAE framework incorporates it for CL with three novel designs. 1) Learning: the pre-trained model adapts to the new task by tuning an online PET module, along with our adaptation speed calibration to align different PET modules, 2) Accumulation: the task-specific knowledge learned by the online PET module is accumulated into an offline PET module through momentum update, 3) Ensemble: During inference, we respectively construct two experts with online/offline PET modules (which are favored by the novel/historical tasks) for prediction ensemble. We show that LAE is compatible with a battery of PET methods and gains strong CL capability. For example, LAE with Adaptor PET surpasses the prior state-of-the-art by 1.3% and 3.6% in last-incremental accuracy on CIFAR100 and ImageNet-R datasets, respectively. + + + + Hierarchical Generation of Human-Object Interactions with Diffusion Probabilistic Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Pi_Hierarchical_Generation_of_Human-Object_Interactions_with_Diffusion_Probabilistic_Models_ICCV_2023_paper.pdf + This paper presents a novel approach to generating the 3D motion of a human interacting with a target object, with a focus on solving the challenge of synthesizing long-range and diverse motions, which could not be fulfilled by existing auto-regressive models or path planning-based methods. We propose a hierarchical generation framework to solve this challenge. Specifically, our framework first generates a set of milestones and then synthesizes the motion along them. Therefore, the long-range motion generation could be reduced to synthesizing several short motion sequences guided by milestones. The experiments on the NSM, COUCH, and SAMP datasets show that our approach outperforms previous methods by a large margin in both quality and diversity. The source code is available on our project page https://zju3dv.github.io/hghoi. + + + + Learning Data-Driven Vector-Quantized Degradation Model for Animation Video Super-Resolution + http://openaccess.thecvf.com//content/ICCV2023/papers/Tuo_Learning_Data-Driven_Vector-Quantized_Degradation_Model_for_Animation_Video_Super-Resolution_ICCV_2023_paper.pdf + Existing real-world video super-resolution (VSR) methods focus on designing a general degradation pipeline for open-domain videos while ignoring data intrinsic characteristics which strongly limit their performance when applying to some specific domains (e.g., animation videos). In this paper, we thoroughly explore the characteristics of animation videos and leverage the rich priors in real-world animation data for a more practical animation VSR model. In particular, we propose a multi-scale Vector-Quantized Degradation model for animation video Super-Resolution (VQD-SR) to decompose the local details from global structures and transfer the degradation priors in real-world animation videos to a learned vector-quantized codebook for degradation modeling. A rich-content Real Animation Low-quality (RAL) video dataset is collected for extracting the priors. We further propose a data enhancement strategy for high-resolution (HR) training videos based on our observation that existing HR videos are mostly collected from the Web which contains conspicuous compression artifacts. The proposed strategy is valid to lift the upper bound of animation VSR performance, regardless of the specific VSR model. Experimental results demonstrate the superiority of the proposed VQD-SR over state-of-the-art methods, through extensive quantitative and qualitative evaluations of the latest animation video super-resolution benchmark. The code and pre-trained models can be downloaded at https://github.com/researchmm/VQD-SR. + + + + Calibrating Panoramic Depth Estimation for Practical Localization and Mapping + http://openaccess.thecvf.com//content/ICCV2023/papers/Kim_Calibrating_Panoramic_Depth_Estimation_for_Practical_Localization_and_Mapping_ICCV_2023_paper.pdf + The absolute depth values of surrounding environments provide crucial cues for various assistive technologies, such as localization, navigation, and 3D structure estimation. We propose that accurate depth estimated from panoramic images can serve as a powerful and light-weight input for a wide range of downstream tasks requiring 3D information. While panoramic images can easily capture the surrounding context from commodity devices, the estimated depth shares the limitations of conventional image-based depth estimation; the performance deteriorates under large domain shifts and the absolute values are still ambiguous to infer from 2D observations. By taking advantage of the holistic view, we mitigate such effects in a self-supervised way and fine-tune the network with geometric consistency during the test phase. Specifically, we construct a 3D point cloud from the current depth prediction and project the point cloud at various viewpoints or apply stretches on the current input image to generate synthetic panoramas. Then we minimize the discrepancy of the 3D structure estimated from synthetic images without collecting additional data. We empirically evaluate our method in robot navigation and map-free localization where our method shows large performance enhancements. Our calibration method can therefore widen the applicability under various external conditions, serving as a key component for practical panorama-based machine vision systems. + + + + DiffDis: Empowering Generative Diffusion Model with Cross-Modal Discrimination Capability + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_DiffDis_Empowering_Generative_Diffusion_Model_with_Cross-Modal_Discrimination_Capability_ICCV_2023_paper.pdf + Recently, large-scale diffusion models, e.g., Stable diffusion and DallE2, have shown remarkable results on image synthesis. On the other hand, large-scale cross-modal pre-trained models (e.g., CLIP, ALIGN, and FILIP) are competent for various downstream tasks by learning to align vision and language embeddings. In this paper, we explore the possibility of jointly modeling generation and discrimination. Specifically, we propose DiffDis to unify the cross-modal generative and discriminative pretraining into one single framework under the diffusion process. DiffDis first formulates the image-text discriminative problem as a generative diffusion process of the text embedding from the text encoder conditioned on the image. Then, we propose a novel dual-stream network architecture, which fuses the noisy text embedding with the knowledge of latent images from different scales for image-text discriminative learning. Moreover, the generative and discriminative tasks can efficiently share the image-branch network structure in the multi-modality model. Benefiting from diffusion-based unified training, DiffDis achieves both better generation ability and cross-modal semantic alignment in one architecture. Experimental results show that DiffDis outperforms single-task models on both the image generation and the image-text discriminative tasks, e.g., 1.65% improvement on average accuracy of zero-shot classification over 12 datasets and 2.42 improvement on FID of zero-shot image synthesis. + + + + View Consistent Purification for Accurate Cross-View Localization + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_View_Consistent_Purification_for_Accurate_Cross-View_Localization_ICCV_2023_paper.pdf + This paper proposes a fine-grained self-localization method for outdoor robotics that utilizes a flexible number of onboard cameras and readily accessible satellite images. The proposed method addresses limitations in existing cross-view localization methods that struggle to handle noise sources such as moving objects and seasonal variations. It is the first sparse visual-only method that enhances perception in dynamic environments by detecting view-consistent key points and their corresponding deep features from ground and satellite views, while removing off-the-ground objects and establishing homography transformation between the two views. Moreover, the proposed method incorporates a spatial embedding approach that leverages camera intrinsic and extrinsic information to reduce the ambiguity of purely visual matching, leading to improved feature matching and overall pose estimation accuracy. The method exhibits strong generalization and is robust to environmental changes, requiring only geo-poses as ground truth. Extensive experiments on the KITTI and Ford Multi-AV Seasonal datasets demonstrate that our proposed method outperforms existing state-of the-art methods, achieving median spatial accuracy errors below 0.5 meters along the lateral and longitudinal directions, and a median orientation accuracy error below 2 degrees. + + + + Efficient Video Action Detection with Token Dropout and Context Refinement + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Efficient_Video_Action_Detection_with_Token_Dropout_and_Context_Refinement_ICCV_2023_paper.pdf + Streaming video clips with large-scale video tokens impede vision transformers (ViTs) for efficient recognition, especially in video action detection where sufficient spatiotemporal representations are required for precise actor identification. In this work, we propose an end-to-end framework for efficient video action detection (EVAD) based on vanilla ViTs. Our EVAD consists of two specialized designs for video action detection. First, we propose a spatiotemporal token dropout from a keyframe-centric perspective. In a video clip, we maintain all tokens from its keyframe, preserve tokens relevant to actor motions from other frames, and drop out the remaining tokens in this clip. Second, we refine scene context by leveraging remaining tokens for better recognizing actor identities. The region of interest (RoI) in our action detector is expanded into temporal domain. The captured spatiotemporal actor identity representations are refined via scene context in a decoder with the attention mechanism. These two designs make our EVAD efficient while maintaining accuracy, which is validated on three benchmark datasets (i.e., AVA, UCF101-24, JHMDB). Compared to the vanilla ViT backbone, our EVAD reduces the overall GFLOPs by 43% and improves real-time inference speed by 40% with no performance degradation. Moreover, even at similar computational costs, our EVAD can improve the performance by 1.1 mAP with higher resolution inputs. Code is available at https://github.com/MCG-NJU/EVAD. + + + + Explicit Motion Disentangling for Efficient Optical Flow Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Deng_Explicit_Motion_Disentangling_for_Efficient_Optical_Flow_Estimation_ICCV_2023_paper.pdf + In this paper, we propose a novel framework for optical flow estimation that achieves a good balance between performance and efficiency. Our approach involves disentangling global motion learning from local flow estimation, treating global matching and local refinement as separate stages. We offer two key insights: First, the multi-scale 4D cost-volume based recurrent flow decoder is computationally expensive and unnecessary for handling small displacement. With the separation, we can utilize lightweight methods for both parts and maintain similar performance. Second, a dense and robust global matching is essential for both flow initialization as well as stable and fast convergence for the refinement stage. Towards this end, we introduce EMD-Flow, a framework that explicitly separates global motion estimation from the recurrent refinement stage. We propose two novel modules: Multi-scale Motion Aggregation (MMA) and Confidence-induced Flow Propagation (CFP). These modules leverage cross-scale matching prior and self-contained confidence maps to handle the ambiguities of dense matching in a global manner, generating a dense initial flow. Additionally, a lightweight decoding module is followed to handle small displacements, resulting in an efficient yet robust flow estimation framework. We further conduct comprehensive experiments on standard optical flow benchmarks with the proposed framework, and the experimental results demonstrate its superior balance between performance and runtime. Code is available at https://github.com/gddcx/EMD-Flow. + + + + From Chaos Comes Order: Ordering Event Representations for Object Recognition and Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Zubic_From_Chaos_Comes_Order_Ordering_Event_Representations_for_Object_Recognition_ICCV_2023_paper.pdf + Today, state-of-the-art deep neural networks that process events first convert them into dense, grid-like input representations before using an off-the-shelf network. However, selecting the appropriate representation for the task traditionally requires training a neural network for each representation and selecting the best one based on the validation score, which is very time-consuming. This work eliminates this bottleneck by selecting representations based on the Gromov-Wasserstein Discrepancy (GWD) between raw events and their representation. It is about 200 times faster to compute than training a neural network and preserves the task performance ranking of event representations across multiple representations, network backbones, datasets, and tasks. Thus finding representations with high task scores is equivalent to finding representations with a low GWD. We use this insight to, for the first time, perform a hyperparameter search on a large family of event representations, revealing new and powerful representations that exceed the state-of-the-art. Our optimized representations outperform existing representations by 1.7 mAP on the 1 Mpx dataset and 0.3 mAP on the Gen1 dataset, two established object detection benchmarks, and reach a 3.8% higher classification score on the mini N-ImageNet benchmark. Moreover, we outperform state-of-the-art by 2.1 mAP on Gen1 and state-of-the-art feed-forward methods by 6.0 mAP on the 1 Mpx datasets. This work opens a new unexplored field of explicit representation optimization for event-based learning. + + + + Identity-Consistent Aggregation for Video Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Deng_Identity-Consistent_Aggregation_for_Video_Object_Detection_ICCV_2023_paper.pdf + In Video Object Detection (VID), a common practice is to leverage the rich temporal contexts from the video to enhance the object representations in each frame. Existing methods treat the temporal contexts obtained from different objects indiscriminately and ignore their different identities. While intuitively, aggregating local views of the same object in different frames may facilitate a better understanding of the object. Thus, in this paper, we aim to enable the model to focus on the identity-consistent temporal contexts of each object to obtain more comprehensive object representations and handle the rapid object appearance variations such as occlusion, motion blur, etc. However, realizing this goal on top of existing VID models faces low-efficiency problems due to their redundant region proposals and nonparallel frame-wise prediction manner. To aid this, we propose ClipVID, a VID model equipped with Identity-Consistent Aggregation (ICA) layers specifically designed for mining fine-grained and identity-consistent temporal contexts. It effectively reduces the redundancies through the set prediction strategy, making the ICA layers very efficient and further allowing us to design an architecture that makes parallel clip-wise predictions for the whole video clip. Extensive experimental results demonstrate the superiority of our method: a state-of-the-art (SOTA) performance (84.7% mAP) on the ImageNet VID dataset while running at a speed about 7x faster (39.3 fps) than previous SOTAs. + + + + Relightify: Relightable 3D Faces from a Single Image via Diffusion Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Papantoniou_Relightify_Relightable_3D_Faces_from_a_Single_Image_via_Diffusion_ICCV_2023_paper.pdf + Following the remarkable success of diffusion models on image generation, recent works have also demonstrated their impressive ability to address a number of inverse problems in an unsupervised way, by properly constraining the sampling process based on a conditioning input. Motivated by this, in this paper, we present the first approach to use diffusion models as a prior for highly accurate 3D facial BRDF reconstruction from a single image. We start by leveraging a high-quality UV dataset of facial reflectance (diffuse and specular albedo and normals), which we render under varying illumination settings to simulate natural RGB textures and, then, train an unconditional diffusion model on concatenated pairs of rendered textures and reflectance components. At test time, we fit a 3D morphable model to the given image and unwrap the face in a partial UV texture. By sampling from the diffusion model, while retaining the observed texture part intact, the model inpaints not only the self-occluded areas but also the unknown reflectance components, in a single sequence of denoising steps. In contrast to existing methods, we directly acquire the observed texture from the input image, thus, resulting in more faithful and consistent reflectance estimation. Through a series of qualitative and quantitative comparisons, we demonstrate superior performance in both texture completion as well as reflectance reconstruction tasks. + + + + Leveraging Spatio-Temporal Dependency for Skeleton-Based Action Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Lee_Leveraging_Spatio-Temporal_Dependency_for_Skeleton-Based_Action_Recognition_ICCV_2023_paper.pdf + Skeleton-based action recognition has attracted considerable attention due to its compact representation of the human body's skeletal sructure. Many recent methods have achieved remarkable performance using graph convolutional networks (GCNs) and convolutional neural networks (CNNs), which extract spatial and temporal features, respectively. Although spatial and temporal dependencies in the human skeleton have been explored separately, spatio-temporal dependency is rarely considered. In this paper, we propose the Spatio-Temporal Curve Network (STC-Net) to effectively leverage the spatio-temporal dependency of the human skeleton. Our proposed network consists of two novel elements: 1) The Spatio-Temporal Curve (STC) module; and 2) Dilated Kernels for Graph Convolution (DK-GC). The STC module dynamically adjusts the receptive field by identifying meaningful node connections between every adjacent frame and generating spatio-temporal curves based on the identified node connections, providing an adaptive spatio-temporal coverage. In addition, we propose DK-GC to consider long-range dependencies, which results in a large receptive field without any additional parameters by applying an extended kernel to the given adjacency matrices of the graph. Our STC-Net combines these two modules and achieves state-of-the-art performance on four skeleton-based action recognition benchmarks. + + + + Camera-Driven Representation Learning for Unsupervised Domain Adaptive Person Re-identification + http://openaccess.thecvf.com//content/ICCV2023/papers/Lee_Camera-Driven_Representation_Learning_for_Unsupervised_Domain_Adaptive_Person_Re-identification_ICCV_2023_paper.pdf + We present a novel unsupervised domain adaption method for person re-identification (reID) that generalizes a model trained on a labeled source domain to an unlabeled target domain. We introduce a camera-driven curriculum learning (CaCL) framework that leverages camera labels of person images to transfer knowledge from source to target domains progressively. To this end, we divide target domain dataset into multiple subsets based on the camera labels, and initially train our model with a single subset (i.e., images captured by a single camera). We then gradually exploit more subsets for training, according to a curriculum sequence obtained with a camera-driven scheduling rule. The scheduler considers maximum mean discrepancies (MMD) between each subset and the source domain dataset, such that the subset closer to the source domain is exploited earlier within the curriculum. For each curriculum sequence, we generate pseudo labels of person images in a target domain to train a reID model in a supervised way. We have observed that the pseudo labels are highly biased toward cameras, suggesting that person images obtained from the same camera are likely to have the same pseudo labels, even for different IDs. To address the camera bias problem, we also introduce a camera-diversity (CD) loss encouraging person images of the same pseudo label, but captured across various cameras, to involve more for discriminative feature learning, providing person representations robust to inter-camera variations. Experimental results on standard benchmarks, including real-to-real and synthetic-to-real scenarios, demonstrate the effectiveness of our framework. + + + + Name Your Colour For the Task: Artificially Discover Colour Naming via Colour Quantisation Transformer + http://openaccess.thecvf.com//content/ICCV2023/papers/Su_Name_Your_Colour_For_the_Task_Artificially_Discover_Colour_Naming_ICCV_2023_paper.pdf + The long-standing theory that a colour-naming system evolves under dual pressure of efficient communication and perceptual mechanism is supported by more and more linguistic studies, including analysing four decades of diachronic data from the Nafaanra language. This inspires us to explore whether machine learning could evolve and discover a similar colour-naming system via optimising the communication efficiency represented by high-level recognition performance. Here, we propose a novel colour quantisation transformer, CQFormer, that quantises colour space while maintaining the accuracy of machine recognition on the quantised images. Given an RGB image, Annotation Branch maps it into an index map before generating the quantised image with a colour palette; meanwhile the Palette Branch utilises a key-point detection way to find proper colours in the palette among the whole colour space. By interacting with colour annotation, CQFormer is able to balance both the machine vision accuracy and colour perceptual structure such as distinct and stable colour distribution for discovered colour system. Very interestingly, we even observe the consistent evolution pattern between our artificial colour system and basic colour terms across human languages. Besides, our colour quantisation method also offers an efficient quantisation method that effectively compresses the image storage while maintaining high performance in high-level recognition tasks such as classification and detection. Extensive experiments demonstrate the superior performance of our method with extremely low bit-rate colours, showing potential to integrate into quantisation network to quantities from image to network activation. The source code is available at https://github.com/ryeocthiv/CQFormer. + + + + FSAR: Federated Skeleton-based Action Recognition with Adaptive Topology Structure and Knowledge Distillation + http://openaccess.thecvf.com//content/ICCV2023/papers/Guo_FSAR_Federated_Skeleton-based_Action_Recognition_with_Adaptive_Topology_Structure_and_ICCV_2023_paper.pdf + Existing skeleton-based action recognition methods typically follow a centralized learning paradigm, which can pose privacy concerns when exposing human-related videos. Federated Learning (FL) has attracted much attention due to its outstanding advantages in privacy-preserving. However, directly applying FL approaches to skeleton videos suffers from unstable training. In this paper, we investigate and discover that the heterogeneous human topology graph structure is the crucial factor hindering training stability. To address this issue, we pioneer a novel Federated Skeleton-based Action Recognition (FSAR) paradigm, which enables the construction of a globally generalized model without accessing local sensitive data. Specifically, we introduce an Adaptive Topology Structure (ATS), separating generalization and personalization by learning a domain-invariant topology shared across clients and a domain-specific topology decoupled from global model aggregation. Furthermore, we explore Multi-grain Knowledge Distillation (MKD) to mitigate the discrepancy between clients and the server caused by distinct updating patterns through aligning shallow block-wise motion features. Extensive experiments on multiple datasets demonstrate that FSAR outperforms state-of-the-art FL-based methods while inherently protecting privacy for skeleton-based action recognition. + + + + Video Adverse-Weather-Component Suppression Network via Weather Messenger and Adversarial Backpropagation + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Video_Adverse-Weather-Component_Suppression_Network_via_Weather_Messenger_and_Adversarial_Backpropagation_ICCV_2023_paper.pdf + Although convolutional neural networks (CNNs) have been proposed to remove adverse weather conditions in single images using a single set of pre-trained weights, they fail to restore weather videos due to the absence of temporal information. Furthermore, existing methods for removing adverse weather conditions (e.g., rain, fog, and snow) from videos can only handle one type of adverse weather. In this work, we propose the first framework for restoring videos from all adverse weather conditions by developing a video adverse-weather-component suppression network (ViWS-Net). To achieve this, we first devise a weather-agnostic video transformer encoder with multiple transformer stages. Moreover, we design a long short-term temporal modeling mechanism for weather messenger to early fuse input adjacent video frames and learn weather-specific information. We further introduce a weather discriminator with gradient reversion, to maintain the weather-invariant common information and suppress the weather-specific information in pixel features, by adversarially predicting weather types. Finally, we develop a messenger-driven video transformer decoder to retrieve the residual weather-specific feature, which is spatiotemporally aggregated with hierarchical pixel features and refined to predict the clean target frame of input videos. Experimental results, on benchmark datasets and real-world weather videos, demonstrate that our ViWS-Net outperforms current state-of-the-art methods in terms of restoring videos degraded by any weather condition. + + + + Part-Aware Transformer for Generalizable Person Re-identification + http://openaccess.thecvf.com//content/ICCV2023/papers/Ni_Part-Aware_Transformer_for_Generalizable_Person_Re-identification_ICCV_2023_paper.pdf + Domain generalization person re-identification (DG ReID) aims to train a model on source domains and generalize well on unseen domains. Vision Transformer usually yields better generalization ability than common CNN networks under distribution shifts. However, Transformer-based ReID models inevitably overfit to domain-specific biases due to the supervised learning strategy on the source domain. We observe that while the global images of different IDs should have different features, their similar local parts (e.g., black backpack) are not bounded by this constraint. Motivated by this, we propose a pure Transformer model (termed Part-aware Transformer) for DG-ReID by designing a proxy task, named Cross-ID Similarity Learning (CSL), to mine local visual information shared by different IDs. This proxy task allows the model to learn generic features because it only cares about the visual similarity of the parts regardless of the ID labels, thus alleviating the side effect of domain-specific biases. Based on the local similarity obtained in CSL, a Part-guided Self-Distillation (PSD) is proposed to further improve the generalization of global features. Our method achieves state-of-the-art performance under most DG ReID settings. The code is available at https://github.com/liyuke65535/Part-Aware-Transformer. + + + + Blending-NeRF: Text-Driven Localized Editing in Neural Radiance Fields + http://openaccess.thecvf.com//content/ICCV2023/papers/Song_Blending-NeRF_Text-Driven_Localized_Editing_in_Neural_Radiance_Fields_ICCV_2023_paper.pdf + Text-driven localized editing of 3D objects is particularly difficult as locally mixing the original 3D object with the intended new object and style effects without distorting the object's form is not a straightforward process. To address this issue, we propose a novel NeRF-based model, Blending-NeRF, which consists of two NeRF networks: pretrained NeRF and editable NeRF. Additionally, we introduce new blending operations that allow Blending-NeRF to properly edit target regions which are localized by text. By using a pretrained vision-language aligned model, CLIP, we guide Blending-NeRF to add new objects with varying colors and densities, modify textures, and remove parts of the original object. Our extensive experiments demonstrate that Blending-NeRF produces naturally and locally edited 3D objects from various text prompts. + + + + Panoramas from Photons + http://openaccess.thecvf.com//content/ICCV2023/papers/Jungerman_Panoramas_from_Photons_ICCV_2023_paper.pdf + Scene reconstruction in the presence of high-speed motion and low illumination is important in many applications such as augmented and virtual reality, drone navigation, and autonomous robotics. Traditional motion estimation techniques fail in such conditions, suffering from too much blur in the presence of high-speed motion and strong noise in low-light conditions. Single-photon cameras have recently emerged as a promising technology capable of capturing hundreds of thousands of photon frames per second thanks to their high speed and extreme sensitivity. Unfortunately, traditional computer vision techniques are not well suited for dealing with the binary-valued photon data captured by these cameras because these are corrupted by extreme Poisson noise. Here we present a method capable of estimating extreme scene motion under challenging conditions, such as low light or high dynamic range, from a sequence of high-speed image frames such as those captured by a single-photon camera. Our method relies on iteratively improving a motion estimate by grouping and aggregating frames after-the-fact, in a stratified manner. We demonstrate the creation of high-quality panoramas under fast motion and extremely low light, and super-resolution results using a custom single-photon camera prototype. + + + + Global Adaptation Meets Local Generalization: Unsupervised Domain Adaptation for 3D Human Pose Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Chai_Global_Adaptation_Meets_Local_Generalization_Unsupervised_Domain_Adaptation_for_3D_ICCV_2023_paper.pdf + When applying a pre-trained 2D-to-3D Human Pose lifting model to a target unseen dataset, a large performance degradation is commonly encountered due to domain shift issues. We observe that the degradation is caused by two factors: 1) the large distribution gap over global positions of poses between the source and target datasets due to variant camera parameters and settings, and 2) the deficient diversity of local structures of poses in the training. To this end, we combine global adaptation and local generalization in PoseDA, a simple yet effective framework of unsupervised domain adaptation for 3D human pose estimation. Specifically, global adaptation aims to align global positions of poses from source domain to target domain with a proposed global position alignment (GPA) module. This module brings significant performance improvement without introducing additional learnable parameters. In addition, we propose local pose augmentation (LPA) to enhance the diversity of 3D poses following an adversarial training scheme consisting of 1) a augmentation generator that generates the parameters of pre-defined pose transformations and 2) an anchor discriminator to ensure the reality and quality of the augmented data. Our approach can be applicable to almost all 2D-3D lifting models. PoseDA achieves 61.3 mm of MPJPE on MPI-INF-3DHP under a cross-dataset evaluation setup, improving upon the previous state-of-the-art method by 10.2%. + + + + DeFormer: Integrating Transformers with Deformable Models for 3D Shape Abstraction from a Single Image + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_DeFormer_Integrating_Transformers_with_Deformable_Models_for_3D_Shape_Abstraction_ICCV_2023_paper.pdf + Explicit 3D shape abstraction from a single 2D image is a long-standing problem in computer vision and graphics. By leveraging a set of primitives to represent the target shape, recent methods have achieved promising results. However, these methods either use a relatively larger number of primitives or lack geometric flexibility due to the low-dimensional expressibility of the primitives. In this paper, we propose a novel bi-channel Transformer architecture, integrated with parameterized deformable models, termed DeFormer, to simultaneously estimate global and local deformations. In this way, DeFormer can abstract complex object shapes while using a small number of primitives which offer a broader geometry coverage and finer details. Then, we introduce a force-driven dynamic fitting and a cycle-consistent re-projection loss to optimize the primitive parameters. Extensive experiments on ShapeNet across various settings show that DeFormer achieves better reconstruction accuracy over the state-of-the-art, and visualizes with consistent semantic correspondences for improved interpretability. + + + + Cross-view Semantic Alignment for Livestreaming Product Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Cross-view_Semantic_Alignment_for_Livestreaming_Product_Recognition_ICCV_2023_paper.pdf + Live commerce is the act of selling products online through livestreaming. The customer's diverse demands for online products introduces more challenges to Livestreaming Product Recognition. Previous works are either focus on fashion clothing data or subject to single-modal input, thus inconsistent with the real-world scenario where multimodal data from various categories are present. In this paper, we contribute LPR4M, a large-scale multimodal dataset that covers 34 categories, comprises 3 modalities (image, video, and text), and is 50 times larger than the largest publicly available dataset. In addition, LPR4M contains diverse videos and noise modality pair while also having a long-tailed distribution, resembling real-world problems. Moreover, a cRoss-vIew semantiC alignmEnt (RICE) model is proposed to learn discriminative instance features from the two views (image and video) of products via instance-level contrastive learning as well as cross-view patch-level feature propagation. A novel Patch Feature Reconstruction loss is proposed to penalize the semantic misalignment between the cross-view patches. Extensive ablation studies demonstrate the effectiveness of RICE and provide insights into the importance of dataset diversity and expressivity. + + + + Continuously Masked Transformer for Image Inpainting + http://openaccess.thecvf.com//content/ICCV2023/papers/Ko_Continuously_Masked_Transformer_for_Image_Inpainting_ICCV_2023_paper.pdf + A novel continuous-mask-aware transformer for image inpainting, called CMT, is proposed in this paper, which uses a continuous mask to represent the amounts of errors in tokens. First, we initialize a mask and use it during the self-attention. To facilitate the masked self-attention, we also introduce the notion of overlapping tokens. Second, we update the mask by modeling the error propagation during the masked self-attention. Through several masked self-attention and mask update (MSAU) layers, we predict initial inpainting results. Finally, we refine the initial results to reconstruct a more faithful image. Experimental results on multiple datasets show that the proposed CMT algorithm outperforms existing algorithms significantly. The source codes are available at https://github.com/keunsoo-ko/CMT. + + + + Vanishing Point Estimation in Uncalibrated Images with Prior Gravity Direction + http://openaccess.thecvf.com//content/ICCV2023/papers/Pautrat_Vanishing_Point_Estimation_in_Uncalibrated_Images_with_Prior_Gravity_Direction_ICCV_2023_paper.pdf + We tackle the problem of estimating a Manhattan frame, i.e. three orthogonal vanishing points, and the unknown focal length of the camera, leveraging a prior vertical direction. The direction can come from an Inertial Measurement Unit that is a standard component of recent consumer devices, e.g., smartphones. We provide an exhaustive analysis of minimal line configurations and derive two new 2-line solvers, one of which does not suffer from singularities affecting existing solvers. Additionally, we design a new non-minimal method, running on an arbitrary number of lines, to boost the performance in local optimization. Combining all solvers in a hybrid robust estimator, our method achieves increased accuracy even with a rough prior. Experiments on synthetic and real-world datasets demonstrate the superior accuracy of our method compared to the state of the art, while having comparable runtimes. We further demonstrate the applicability of our solvers for relative rotation estimation. The code is available at https://github.com/cvg/VP-Estimation-with-Prior-Gravity. + + + + Learn TAROT with MENTOR: A Meta-Learned Self-Supervised Approach for Trajectory Prediction + http://openaccess.thecvf.com//content/ICCV2023/papers/Pourkeshavarz_Learn_TAROT_with_MENTOR_A_Meta-Learned_Self-Supervised_Approach_for_Trajectory_ICCV_2023_paper.pdf + Predicting diverse yet admissible trajectories that adhere to the map constraints is challenging. Graph-based scene encoders have been proven effective for preserving local structures of maps by defining lane-level connections. However, such encoders do not capture more complex patterns emerging from long-range heterogeneous connections between nonadjacent interacting lanes. To this end, we shed new light on learning common driving patterns by introducing meTA ROad paTh (TAROT) to formulate combinations of various relations between lanes on the road topology. Intuitively, this can be viewed as finding feasible routes. Furthermore, we propose MEta-road NeTwORk (MENTOR) that helps trajectory prediction by providing it with TAROT as navigation tips. More specifically, 1) we define TAROT prediction as a novel self-supervised proxy task to identify the complex heterogeneous structure of the map. 2) For typical driving actions, we establish several TAROTs that result in multiple Heterogeneous Structure Learning (HSL) tasks. These tasks are used in MENTOR, which performs meta-learning by simultaneously predicting trajectories along with proxy tasks, identifying an optimal combination of them, and automatically balancing them to improve the primary task. We show that our model achieves state-of-the-art performance on the Argoverse dataset, especially on diversity and admissibility metrics, achieving up to 20% improvements in challenging scenarios. We further investigate the contribution of proposed modules in ablation studies. + + + + MatrixVT: Efficient Multi-Camera to BEV Transformation for 3D Perception + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhou_MatrixVT_Efficient_Multi-Camera_to_BEV_Transformation_for_3D_Perception_ICCV_2023_paper.pdf + This paper proposes an efficient multi-camera to Bird's-Eye-View (BEV) view transformation method for 3D perception, dubbed MatrixVT. Existing view transformers either suffer from poor transformation efficiency or rely on device-specific operators, hindering the broad application of BEV models. In contrast, our method generates BEV features efficiently with only convolutions and matrix multiplications (MatMul). Specifically, we propose describing the BEV feature as the MatMul of image feature and a sparse Feature Transporting Matrix (FTM). A Prime Extraction module is then introduced to compress the dimension of image features and reduce FTM's sparsity. Moreover, we propose the Ring & Ray Decomposition to replace the FTM with two matrices and reformulate our pipeline to reduce calculation further. Compared to existing methods, MatrixVT enjoys a faster speed and less memory footprint while remaining deploy-friendly. Extensive experiments on the nuScenes benchmark demonstrate that our method is highly efficient but obtains results on par with the SOTA method in object detection and map segmentation tasks. Code will be available. + + + + Local and Global Logit Adjustments for Long-Tailed Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Tao_Local_and_Global_Logit_Adjustments_for_Long-Tailed_Learning_ICCV_2023_paper.pdf + Multi-expert ensemble models for long-tailed learning typically either learn diverse generalists from the whole dataset or aggregate specialists on different subsets. However, the former is insufficient for tail classes due to the high imbalance factor of the entire dataset, while the latter may bring ambiguity in predicting unseen classes. To address these issues, we propose a novel Local and Global Logit Adjustments (LGLA) method that learns experts with full data covering all classes and enlarges the discrepancy among them by elaborated logit adjustments. LGLA consists of two core components: a Class-aware Logit Adjustment (CLA) strategy and an Adaptive Angular Weighted (AAW) loss. The CLA strategy trains multiple experts which excel at each subset using the Local Logit Adjustment (LLA). It also trains one expert specializing in an inversely long-tailed distribution through Global Logit Adjustment (GLA). Moreover, the AAW loss adopts adaptive hard sample mining with respect to different experts to further improve accuracy. Extensive experiments on popular long-tailed benchmarks manifest the superiority of LGLA over the SOTA methods. + + + + Sensitivity-Aware Visual Parameter-Efficient Fine-Tuning + http://openaccess.thecvf.com//content/ICCV2023/papers/He_Sensitivity-Aware_Visual_Parameter-Efficient_Fine-Tuning_ICCV_2023_paper.pdf + Visual Parameter-Efficient Fine-Tuning (PEFT) has become a powerful alternative for full fine-tuning so as to adapt pre-trained vision models to downstream tasks, which only tunes a small number of parameters while freezing the vast majority ones to ease storage burden and optimization difficulty. However, existing PEFT methods introduce trainable parameters to the same positions across different tasks depending solely on human heuristics and neglect the domain gaps. To this end, we study where to introduce and how to allocate trainable parameters by proposing a novel Sensitivity-aware visual Parameter-efficient fine-Tuning (SPT) scheme, which adaptively allocates trainable parameters to task-specific important positions given a desired tunable parameter budget. Specifically, our SPT first quickly identifies the sensitive parameters that require tuning for a given task in a data-dependent way. Next, our SPT further boosts the representational capability for the weight matrices whose number of sensitive parameters exceeds a pre-defined threshold by utilizing existing structured tuning methods, e.g., LoRA or Adapter, to replace directly tuning the selected sensitive parameters (unstructured tuning) under the budget. Extensive experiments on a wide range of downstream recognition tasks show that our SPT is complementary to the existing PEFT methods and largely boosts their performance, e.g., SPT improves Adapter with supervised pre-trained ViT-B/16 backbone by 4.2% and 1.4% mean Top-1 accuracy, reaching SOTA performance on FGVC and VTAB-1k benchmarks, respectively. Source code is at https://github.com/ziplab/SPT. + + + + Weakly-supervised 3D Pose Transfer with Keypoints + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Weakly-supervised_3D_Pose_Transfer_with_Keypoints_ICCV_2023_paper.pdf + The main challenges of 3D pose transfer are: 1) Lack of paired training data with different characters performing the same pose; 2) Disentangling pose and shape information from the target mesh; 3) Difficulty in applying to meshes with different topologies. We thus propose a novel weakly-supervised keypoint-based framework to overcome these difficulties. Specifically, we use a topology-agnostic keypoint detector with inverse kinematics to compute transformations between the source and target meshes. Our method only requires supervision on the keypoints, can be applied to meshes with different topologies and is shape-invariant for the target which allows extraction of pose-only information from the target meshes without transferring shape information. We further design a cycle reconstruction to perform self-supervised pose transfer without the need for ground truth deformed mesh with the same pose and shape as the target and source, respectively. We evaluate our approach on benchmark human and animal datasets, where we achieve superior performance compared to the state-of-the-art unsupervised approaches and even comparable performance with the fully supervised approaches. We test on the more challenging Mixamo dataset to verify our approach's ability in handling meshes with different topologies and complex clothes. Cross-dataset evaluation further shows the strong generalization ability of our approach. Our code will be open-sourced upon paper acceptance. + + + + On the Effectiveness of Spectral Discriminators for Perceptual Quality Improvement + http://openaccess.thecvf.com//content/ICCV2023/papers/Luo_On_the_Effectiveness_of_Spectral_Discriminators_for_Perceptual_Quality_Improvement_ICCV_2023_paper.pdf + Several recent studies advocate the use of spectral discriminators, which evaluate the Fourier spectra of images for generative modeling. However, the effectiveness of the spectral discriminators is not well interpreted yet. We tackle this issue by examining the spectral discriminators in the context of perceptual image super-resolution (i.e., GAN-based SR), as SR image quality is susceptible to spectral changes. Our analyses reveal that the spectral discriminator indeed performs better than the ordinary (a.k.a. spatial) discriminator in identifying the differences in the high-frequency range; however, the spatial discriminator holds an advantage in the low-frequency range. Thus, we suggest that the spectral and spatial discriminators shall be used simultaneously. Moreover, we improve the spectral discriminators by first calculating the patch-wise Fourier spectrum and then aggregating the spectra by Transformer. We verify the effectiveness of the proposed method twofold. On the one hand, thanks to the additional spectral discriminator, our obtained SR images have their spectra better aligned to those of the real images, which leads to a better PD tradeoff. On the other hand, our ensembled discriminator predicts the perceptual quality more accurately, as evidenced in the no-reference image quality assessment task. + + + + Diffusion-Based 3D Human Pose Estimation with Multi-Hypothesis Aggregation + http://openaccess.thecvf.com//content/ICCV2023/papers/Shan_Diffusion-Based_3D_Human_Pose_Estimation_with_Multi-Hypothesis_Aggregation_ICCV_2023_paper.pdf + In this paper, a novel Diffusion-based 3D Pose estimation (D3DP) method with Joint-wise reProjection-based Multi-hypothesis Aggregation (JPMA) is proposed for probabilistic 3D human pose estimation. On the one hand, D3DP generates multiple possible 3D pose hypotheses for a single 2D observation. It gradually diffuses the ground truth 3D poses to a random distribution, and learns a denoiser conditioned on 2D keypoints to recover the uncontaminated 3D poses. The proposed D3DP is compatible with existing 3D pose estimators and supports users to balance efficiency and accuracy during inference through two customizable parameters. On the other hand, JPMA is proposed to assemble multiple hypotheses generated by D3DP into a single 3D pose for practical use. It reprojects 3D pose hypotheses to the 2D camera plane, selects the best hypothesis joint-by-joint based on the reprojection errors, and combines the selected joints into the final pose. The proposed JPMA conducts aggregation at the joint level and makes use of the 2D prior information, both of which have been overlooked by previous approaches. Extensive experiments on Human3.6M and MPI-INF-3DHP datasets show that our method outperforms the state-of-the-art deterministic and probabilistic approaches by 1.5% and 8.9%, respectively. Code is available at https://github.com/paTRICK-swk/D3DP. + + + + RPEFlow: Multimodal Fusion of RGB-PointCloud-Event for Joint Optical Flow and Scene Flow Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Wan_RPEFlow_Multimodal_Fusion_of_RGB-PointCloud-Event_for_Joint_Optical_Flow_and_ICCV_2023_paper.pdf + Recently, the RGB images and point clouds fusion methods have been proposed to jointly estimate 2D optical flow and 3D scene flow. However, as both conventional RGB cameras and LiDAR sensors adopt a frame-based data acquisition mechanism, their performance is limited by the fixed low sampling rates, especially in highly-dynamic scenes. By contrast, the event camera can asynchronously capture the intensity changes with a very high temporal resolution, providing complementary dynamic information of the observed scenes. In this paper, we incorporate RGB images, Point clouds and Events for joint optical flow and scene flow estimation with our proposed multi-stage multimodal fusion model, RPEFlow. First, we present an attention fusion module with a cross-attention mechanism to implicitly explore the internal cross-modal correlation for 2D and 3D branches, respectively. Second, we introduce a mutual information regularization term to explicitly model the complementary information of three modalities for effective multimodal feature learning. We also contribute a new synthetic dataset to advocate further research. Experiments on both synthetic and real datasets show that our model outperforms the existing state-of-the-art by a wide margin. Code and dataset is available at https://npucvr.github.io/RPEFlow. + + + + DyGait: Exploiting Dynamic Representations for High-performance Gait Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_DyGait_Exploiting_Dynamic_Representations_for_High-performance_Gait_Recognition_ICCV_2023_paper.pdf + Gait recognition is a biometric technology that recognizes the identity of humans through their walking patterns. Compared with other biometric technologies, gait recognition is more difficult to disguise and can be applied to the condition of long-distance without the cooperation of subjects. Thus, it has unique potential and wide application for crime prevention and social security. At present, most gait recognition methods directly extract features from the video frames to establish representations. However, these architectures learn representations from different features equally but do not pay enough attention to dynamic features, which refers to a representation of dynamic parts of silhouettes over time (e.g. legs). Since dynamic parts of the human body are more informative than other parts (e.g. bags) during walking, in this paper, we propose a novel and high-performance framework named DyGait. This is the first framework on gait recognition that is designed to focus on the extraction of dynamic features. Specifically, to take full advantage of the dynamic information, we propose a Dynamic Augmentation Module (DAM), which can automatically establish spatial-temporal feature representations of the dynamic parts of the human body. The experimental results show that our DyGait network outperforms other state-of-the-art gait recognition methods. It achieves an average Rank-1 accuracy of 71.4% on the GREW dataset, 66.3% on the Gait3D dataset, 98.4% on the CASIA-B dataset and 98.3% on the OU-MVLP dataset. + + + + Helping Hands: An Object-Aware Ego-Centric Video Recognition Model + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Helping_Hands_An_Object-Aware_Ego-Centric_Video_Recognition_Model_ICCV_2023_paper.pdf + We introduce an object-aware decoder for improving the performance of spatio-temporal representations on ego-centric videos. The key idea is to enhance object-awareness during training by tasking the model to predict hand positions, object positions, and the semantic label of the objects using paired captions when available. At inference time the model only requires RGB frames as inputs, and is able to track and ground objects (although it has not been trained explicitly for this). We demonstrate the performance of the object-aware representations learnt by our model, by: (i) evaluating it for strong transfer, i.e, through zero-shot testing, on a number of downstream video-text retrieval and classification benchmarks; and (ii) by evaluating its temporal and spatial (grounding) performance by fine-tuning for this task. In all cases the performance improves over the state of the art -- even for networks trained with far larger batch sizes. Overall, we show that the model can act as a drop-in replacement for an ego-centric video model, and improve performance. + + + + SpinCam: High-Speed Imaging via a Rotating Point-Spread Function + http://openaccess.thecvf.com//content/ICCV2023/papers/Chan_SpinCam_High-Speed_Imaging_via_a_Rotating_Point-Spread_Function_ICCV_2023_paper.pdf + High-speed cameras are an indispensable tool used for the slow-motion analysis of scenes. The fixed bandwidth of any imaging system quickly becomes a bottleneck however, resulting in a fundamental trade-off between the camera's spatial and temporal resolutions. In recent years, compressive high-speed imaging systems have been proposed to circumvent these issues, by optically compressing the signal and using a reconstruction procedure to recover a video. In our work, we propose a novel approach for compressive high-speed imaging based on temporally coding the camera's point-spread function (PSF). By mechanically spinning a diffraction grating in front of a camera, the sensor integrates an image blurred by a PSF that continuously rotates over time. We also propose a deconvolution-based reconstruction algorithm to reconstruct videos from these measurements. Our method achieves superior light efficiency and handles a wider class of scenes compared to prior methods. Also, our mechanical design yields flexible temporal resolution that can be easily increased, potentially allowing capture at 192 kHz--far higher than prior works. We demonstrate a prototype on various applications including motion capture and particle image velocimetry (PIV). + + + + GlueStick: Robust Image Matching by Sticking Points and Lines Together + http://openaccess.thecvf.com//content/ICCV2023/papers/Pautrat_GlueStick_Robust_Image_Matching_by_Sticking_Points_and_Lines_Together_ICCV_2023_paper.pdf + Line segments are powerful features complementary to points. They offer structural cues, robust to drastic viewpoint and illumination changes, and can be present even in texture-less areas. However, describing and matching them is more challenging compared to points due to partial occlusions, lack of texture, or repetitiveness. This paper introduces a new matching paradigm, where points, lines, and their descriptors are unified into a single wireframe structure. We propose GlueStick, a deep matching Graph Neural Network (GNN) that takes two wireframes from different images and leverages the connectivity information between nodes to better glue them together. In addition to the increased efficiency brought by the joint matching, we also demonstrate a large boost of performance when leveraging the complementary nature of these two features in a single architecture. We show that our matching strategy outperforms the state-of-the-art approaches independently matching line segments and points for a wide variety of datasets and tasks. Code is available at https://github.com/cvg/GlueStick. + + + + Computational 3D Imaging with Position Sensors + http://openaccess.thecvf.com//content/ICCV2023/papers/Klotz_Computational_3D_Imaging_with_Position_Sensors_ICCV_2023_paper.pdf + Underlying many structured light systems, especially those based on laser scanning, is a simple vision task: tracking a light spot. To accomplish this, scanners use conventional CMOS sensors to capture, transmit, and process millions of pixel measurements. This approach, while capable of achieving high-fidelity 3D scans, is wasteful in terms of (often scarce) sensing and computational resources. We present a structured light system based on position sensing diodes (PSDs), an unconventional sensing modality that directly measures the centroid of the spatial distribution of incident light, thus enabling high-resolution 3D laser scanning with a minimal amount of sensor data. We develop theory and computational algorithms for PSD-based structured light under a variety of light transport effects. We demonstrate the benefits of the proposed techniques using a hardware prototype on several real-world scenes, including optically-challenging objects with long-range inter-reflections and scattering. + + + + Towards Multi-Layered 3D Garments Animation + http://openaccess.thecvf.com//content/ICCV2023/papers/Shao_Towards_Multi-Layered_3D_Garments_Animation_ICCV_2023_paper.pdf + Mimicking realistic dynamics in 3D garment animations is a challenging task due to the complex nature of multi-layered garments and the variety of outer forces involved. Existing approaches mostly focus on single-layered garments driven by only human bodies and struggle to handle general scenarios. In this paper, we propose a novel data-driven method, called LayersNet, to model garment-level animations as particle-wise interactions in a micro physics system. We improve simulation efficiency by representing garments as patch-level particles in a two-level structural hierarchy. Moreover, we introduce a novel Rotation Equivalent Transformation with Rotation Invariant Attention that leverage the rotation invariance and additivity of physics systems to better model outer forces. To verify the effectiveness of our approach and bridge the gap between experimental environments and real-world scenarios, we introduce a new challenging dataset, D-LAYERS, containing 700K frames of dynamics of 4,900 combinations of multi-layered garments driven by human bodies and randomly sampled wind. Our LayersNet achieves superior performance both quantitatively and qualitatively. Project page: www.mmlab-ntu.com/project/layersnet/index.html . + + + + Learning Image Harmonization in the Linear Color Space + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_Learning_Image_Harmonization_in_the_Linear_Color_Space_ICCV_2023_paper.pdf + Harmonizing cut-and-paste images into perceptually realistic ones is challenging, as it requires a full understanding of the discrepancies between the background of the target image and the inserted object. Existing methods mainly adjust the appearances of the inserted object via pixel-level manipulations. They are not effective in correcting color discrepancy caused by different scene illuminations and the image formation processes. We note that image colors are essentially camera ISP projection of the scene radiance. If we can trace the image colors back to the radiance field, we may be able to model the scene illumination and harmonize the discrepancy better. In this paper, we propose a novel neural approach to harmonize the image colors in a camera-independent color space, in which color values are proportional to the scene radiance. To this end, we propose a novel image unprocessing module to estimate an intermediate high dynamic range version of the object to be inserted. We then propose a novel color harmonization module that harmonizes the colors of the inserted object by querying the estimated scene radiance and re-rendering the harmonized object in the output color space. Extensive experiments demonstrate that our method outperforms the state-of-the-art approaches. + + + + Chasing Clouds: Differentiable Volumetric Rasterisation of Point Clouds as a Highly Efficient and Accurate Loss for Large-Scale Deformable 3D Registration + http://openaccess.thecvf.com//content/ICCV2023/papers/Heinrich_Chasing_Clouds_Differentiable_Volumetric_Rasterisation_of_Point_Clouds_as_a_ICCV_2023_paper.pdf + Learning-based registration for large-scale 3D point clouds has been shown to improve robustness and accuracy compared to classical methods and can be trained without supervision for locally rigid problems. However, for tasks with highly deformable structures, such as alignment of pulmonary vascular trees for medical diagnostics, previous approaches of self-supervision with regularisation and point distance losses have failed to succeed, leading to the need for complex synthetic augmentation strategies to obtain reliably strong supervision. In this work, we introduce a novel Differentiable Volumetric Rasterisation of point Clouds (DiVRoC) that overcomes those limitations and offers a highly efficient and accurate loss for large-scale deformable 3D registration. DiVRoC drastically reduces the computational complexity for measuring point cloud distances for high-resolution data with over 100k 3D points and can also be employed to extrapolate and regularise sparse motion fields, as loss in a self-training setting and as objective function in instance optimisation. DiVRoC can be successfully embedded into geometric registration networks, including PointPWC-Net and other graph CNNs. Our approach yields new state-of-the-art accuracy on the challenging PVT dataset in three different settings without training with manual ground truth: 1) unsupervised metric-based learning 2) self-supervised learning with pseudo labels generated by self-training and 3) optimisation based alignment without learning. https://github.com/mattiaspaul/ChasingClouds + + + + The Devil is in the Upsampling: Architectural Decisions Made Simpler for Denoising with Deep Image Prior + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_The_Devil_is_in_the_Upsampling_Architectural_Decisions_Made_Simpler_ICCV_2023_paper.pdf + Deep Image Prior (DIP) shows that some network architectures inherently tend towards generating smooth images while resisting noise, a phenomenon known as spectral bias. Image denoising is a natural application of this property. Although denoising with DIP mitigates the need for large training sets, two often intertwined practical challenges need to be overcome: architectural design and noise fitting. Existing methods either handcraft or search for suitable architectures from a vast design space, due to the limited understanding of how architectural choices affect the denoising outcome. In this study, we demonstrate from a frequency perspective that unlearnt upsampling is the main driving force behind the denoising phenomenon with DIP. This finding leads to straightforward strategies for identifying a suitable architecture for every image without laborious search. Extensive experiments show that the estimated architectures achieve superior denoising results than existing methods with up to 95% fewer parameters. Thanks to this under-parameterization, the resulting architectures are less prone to noise-fitting. + + + + Video Object Segmentation-aware Video Frame Interpolation + http://openaccess.thecvf.com//content/ICCV2023/papers/Yoo_Video_Object_Segmentation-aware_Video_Frame_Interpolation_ICCV_2023_paper.pdf + Video frame interpolation (VFI) is a very active research topic due to its broad applicability to many applications, including video enhancement, video encoding, and slow-motion effects. VFI methods have been advanced by improving the overall image quality for challenging sequences containing occlusions, large motion, and dynamic texture. This mainstream research direction neglects that foreground and background regions have different importance in perceptual image quality. Moreover, accurate synthesis of moving objects can be of utmost importance in computer vision applications. In this paper, we propose a video object segmentation (VOS)-aware training framework called VOS-VFI that allows VFI models to interpolate frames with more precise object boundaries. Specifically, we exploit VOS as an auxiliary task to help train VFI models by providing additional loss functions, including segmentation loss and bi-directional consistency loss. From extensive experiments, we demonstrate that VOS-VFI can boost the performance of existing VFI models by rendering clear object boundaries. Moreover, VOS-VFI displays its effectiveness on multiple benchmarks for different applications, including video object segmentation, object pose estimation, and visual tracking. + + + + Coherent Event Guided Low-Light Video Enhancement + http://openaccess.thecvf.com//content/ICCV2023/papers/Liang_Coherent_Event_Guided_Low-Light_Video_Enhancement_ICCV_2023_paper.pdf + With frame-based cameras, capturing fast-moving scenes without suffering from blur often comes at the cost of low SNR and low contrast. Worse still, the photometric constancy that enhancement techniques heavily relied on is fragile for frames with short exposure. Event cameras can record brightness changes at an extremely high temporal resolution. For low-light videos, event data are not only suitable to help capture temporal correspondences but also provide alternative observations in the form of intensity ratios between consecutive frames and exposure-invariant information. Motivated by this, we propose a low-light video enhancement method with hybrid inputs of events and frames. Specifically, a neural network is trained to establish spatiotemporal coherence between visual signals with different modalities and resolutions by constructing correlation volume across space and time. Experimental results on synthetic and real data demonstrate the superiority of the proposed method compared to the state-of-the-art methods. + + + + FCCNs: Fully Complex-valued Convolutional Networks using Complex-valued Color Model and Loss Function + http://openaccess.thecvf.com//content/ICCV2023/papers/Yadav_FCCNs_Fully_Complex-valued_Convolutional_Networks_using_Complex-valued_Color_Model_and_ICCV_2023_paper.pdf + Although complex-valued convolutional neural networks (iCNNs) have existed for a while, they lack proper complex-valued image inputs and loss functions. In addition, all their operations are not complex-valued as they have both complex-valued convolutional layers and real-valued fully-connected layers. As a result, they lack an end-to-end flow of complex-valued information, making them inconsistent w.r.t. the claimed operating domain, i.e., complex numbers. Considering these inconsistencies, we propose a complex-valued color model and loss function and turn fully-connected layers into convolutional layers. All these contributions culminate in what we call FCCNs (Fully Complex-valued Convolutional Networks), which take complex-valued images as inputs, perform only complex-valued operations, and have a complex-valued loss function. Thus, our proposed FCCNs have an end-to-end flow of complex-valued information, which lacks in existing iCNNs. Our extensive experiments on five image classification benchmark datasets show that FCCNs consistently perform better than existing iCNNs. Code is available at https://github.com/saurabhya/FCCNs . + + + + S-TREK: Sequential Translation and Rotation Equivariant Keypoints for Local Feature Extraction + http://openaccess.thecvf.com//content/ICCV2023/papers/Santellani_S-TREK_Sequential_Translation_and_Rotation_Equivariant_Keypoints_for_Local_Feature_ICCV_2023_paper.pdf + In this work we introduce S-TREK, a novel local feature extractor that combines a deep keypoint detector, which is both translation and rotation equivariant by design, with a lightweight deep descriptor extractor. We train the S-TREK keypoint detector within a framework inspired by reinforcement learning, where we leverage a sequential procedure to maximize a reward directly related to keypoint repeatability. Our descriptor network is trained following a "detect, then describe" approach, where the descriptor loss is evaluated only at those locations where keypoints have been selected by the already trained detector. Extensive experiments on multiple benchmarks confirm the effectiveness of our proposed method, with S-TREK often outperforming other state-of-the-art methods in terms of repeatability and quality of the recovered poses, especially when dealing with in-plane rotations. + + + + E2NeRF: Event Enhanced Neural Radiance Fields from Blurry Images + http://openaccess.thecvf.com//content/ICCV2023/papers/Qi_E2NeRF_Event_Enhanced_Neural_Radiance_Fields_from_Blurry_Images_ICCV_2023_paper.pdf + Neural Radiance Fields (NeRF) achieves impressive ren-dering performance by learning volumetric 3D representation from several images of different views. However, it is difficult to reconstruct a sharp NeRF from blurry input as often occurred in the wild. To solve this problem, we propose a novel Event-Enhanced NeRF (E2NeRF) by utilizing the combination data of a bio-inspired event camera and a standard RGB camera. To effectively introduce event stream into the learning process of neural volumetric representation, we propose a blur rendering loss and an event rendering loss, which guide the network via modelling real blur process and event generation process, respectively. Moreover, a camera pose estimation framework for real-world data is built with the guidance of event stream to generalize the method to practical applications. In contrast to previous image-based or event-based NeRF, our framework effectively utilizes the internal relationship between events and images. As a result, E2NeRF not only achieves image deblurring but also achieves high-quality novel view image generation. Extensive experiments on both synthetic data and real-world data demonstrate that E2NeRF can effectively learn a sharp NeRF from blurry images, especially in complex and low-light scenes. Our code and datasets are publicly available at https://github.com/iCVTEAM/E2NeRF. + + + + EgoTV: Egocentric Task Verification from Natural Language Task Descriptions + http://openaccess.thecvf.com//content/ICCV2023/papers/Hazra_EgoTV_Egocentric_Task_Verification_from_Natural_Language_Task_Descriptions_ICCV_2023_paper.pdf + To enable progress towards egocentric agents capable of understanding everyday tasks specified in natural language, we propose a benchmark and a synthetic dataset called Egocentric Task Verification (EgoTV). The goal in EgoTV is to verify the execution of tasks from egocentric videos based on the natural language description of these tasks. EgoTV contains pairs of videos and their task descriptions for multi-step tasks -- these tasks contain multiple sub-task decompositions, state changes, object interactions, and sub-task ordering constraints. In addition, EgoTV also provides abstracted task descriptions that contain only partial details about ways to accomplish a task. Consequently, EgoTV requires causal, temporal, and compositional reasoning of video and language modalities, which is missing in existing datasets. We also find that existing vision-language models struggle at such all round reasoning needed for task verification in EgoTV. Inspired by the needs of EgoTV, we propose a novel Neuro-Symbolic Grounding (NSG) approach that leverages symbolic representations to capture the compositional and temporal structure of tasks. We demonstrate NSG's capability towards task tracking and verification on our EgoTV dataset and a real-world dataset derived from CrossTask (CTV). We open-source the EgoTV and CTV datasets and the NSG model for future research on egocentric assistive agents. + + + + LMR: A Large-Scale Multi-Reference Dataset for Reference-Based Super-Resolution + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_LMR_A_Large-Scale_Multi-Reference_Dataset_for_Reference-Based_Super-Resolution_ICCV_2023_paper.pdf + It is widely agreed that reference-based super-resolution (RefSR) achieves superior results by referring to similar high quality images, compared to single image super-resolution (SISR). Intuitively, the more references, the better performance. However, previous RefSR methods have all focused on single-reference image training, while multiple reference images are often available in testing or practical applications. The root cause of such training-testing mismatch is the absence of publicly available multi-reference SR training datasets, which greatly hinders research efforts on multi-reference super-resolution. To this end, we construct a large-scale, multi-reference super-resolution dataset, named LMR. It contains 112,142 groups of 300x300 training images, which is 10x of the existing largest RefSR dataset. The image size is also some times larger. More importantly, each group is equipped with 5 reference images with different similarity levels. Furthermore, we propose a new baseline method for multi-reference super-resolution: MRefSR, including a Multi-Reference Attention Module (MAM) for feature fusion of an arbitrary number of reference images, and a Spatial Aware Filtering Module (SAFM) for the fused feature selection. The proposed MRefSR achieves significant improvements over state-of-the-art approaches on both quantitative and qualitative evaluations. Our code and data are available at: https://github.com/wdmwhh/MRefSR. + + + + Neural Implicit Surface Evolution + http://openaccess.thecvf.com//content/ICCV2023/papers/Novello_Neural_Implicit_Surface_Evolution_ICCV_2023_paper.pdf + This work investigates the use of smooth neural networks for modeling dynamic variations of implicit surfaces under the level set equation (LSE). For this, it extends the representation of neural implicit surfaces to the space-time, which opens up mechanisms for continuous geometric transformations. Examples include evolving an initial surface towards general vector fields, smoothing and sharpening using the mean curvature equation, and interpolations of initial conditions. The network training considers two constraints. A data term is responsible for fitting the initial condition to the corresponding time instant. Then, a LSE term forces the network to approximate the underlying geometric evolution given by the LSE, without any supervision. The network can also be initialized based on previously trained initial conditions, resulting in faster convergence compared to the standard approach. + + + + Distribution-Aligned Diffusion for Human Mesh Recovery + http://openaccess.thecvf.com//content/ICCV2023/papers/Foo_Distribution-Aligned_Diffusion_for_Human_Mesh_Recovery_ICCV_2023_paper.pdf + Recovering a 3D human mesh from a single RGB image is a challenging task due to depth ambiguity and self-occlusion, resulting in a high degree of uncertainty. Meanwhile, diffusion models have recently seen much success in generating high-quality outputs by progressively denoising noisy inputs. Inspired by their capability, we explore a diffusion-based approach for human mesh recovery, and propose a Human Mesh Diffusion (HMDiff) framework which frames mesh recovery as a reverse diffusion process. We also propose a Distribution Alignment Technique (DAT) that injects input-specific distribution information into the diffusion process, and provides useful prior knowledge to simplify the mesh recovery task. Our method achieves state-of-the-art performance on three widely used datasets. Project page: https://gongjia0208.github.io/HMDiff/. + + + + Diffuse3D: Wide-Angle 3D Photography via Bilateral Diffusion + http://openaccess.thecvf.com//content/ICCV2023/papers/Jiang_Diffuse3D_Wide-Angle_3D_Photography_via_Bilateral_Diffusion_ICCV_2023_paper.pdf + This paper aims to resolve the challenging problem of wide-angle novel view synthesis from a single image, a.k.a. wide-angle 3D photography. Existing approaches rely on local context and treat them equally to inpaint occluded RGB and depth regions, which fail to deal with large-region occlusion (i.e., observing from an extreme angle) and foreground layers might blend into background inpainting. To address the above issues, we propose Diffuse3D which employs a pre-trained diffusion model for global synthesis, while amending the model to activate depth-aware inference. Our key insight is to alter the convolution mechanism in the denoising process. We inject depth information into the denoising convolution operation with bilateral kernels, i.e., a depth kernel and a spatial kernel, to consider layered correlations among pixels. In this way, foreground regions are overlooked in background inpainting and only pixels close in depth are leveraged. On the other hand, we propose a global-local balancing approach to maximize both contextual understandings. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods in novel view synthesis, especially in wide-angle scenarios. More importantly, our method does not require any training and is a plug-and-play module that can be integrated with any diffusion model. Our code can be found at https://github.com/yutaojiang1/Diffuse3D. + + + + Tubelet-Contrastive Self-Supervision for Video-Efficient Generalization + http://openaccess.thecvf.com//content/ICCV2023/papers/Thoker_Tubelet-Contrastive_Self-Supervision_for_Video-Efficient_Generalization_ICCV_2023_paper.pdf + We propose a self-supervised method for learning motion-focused video representations. Existing approaches minimize distances between temporally augmented videos, which maintain high spatial similarity. We instead propose to learn similarities between videos with identical local motion dynamics but an otherwise different appearance. We do so by adding synthetic motion trajectories to videos which we refer to as tubelets. By simulating different tubelet motions and applying transformations, such as scaling and rotation, we introduce motion patterns beyond what is present in the pretraining data. This allows us to learn a video representation that is remarkably data efficient: our approach maintains performance when using only 25% of the pretraining videos. Experiments on 10 diverse downstream settings demonstrate our competitive performance and generalizability to new domains and fine-grained actions. + + + + Generalizing Event-Based Motion Deblurring in Real-World Scenarios + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Generalizing_Event-Based_Motion_Deblurring_in_Real-World_Scenarios_ICCV_2023_paper.pdf + Event-based motion deblurring has shown promising results by exploiting low-latency events. However, current approaches are limited in their practical usage, as they assume the same spatial resolution of inputs and specific blurriness distributions. This work addresses these limitations and aims to generalize the performance of event-based deblurring in real-world scenarios. We propose a scale-aware network that allows flexible input spatial scales and enables learning from different temporal scales of motion blur. A two-stage self-supervised learning scheme is then developed to fit real-world data distribution. By utilizing the relativity of blurriness, our approach efficiently ensures the restored brightness and structure of latent images and further generalizes deblurring performance to handle varying spatial and temporal scales of motion blur in a self-distillation manner. Our method is extensively evaluated, demonstrating remarkable performance, and we also introduce a real-world dataset consisting of multi-scale blurry frames and events to facilitate research in event-based deblurring. + + + + RCA-NOC: Relative Contrastive Alignment for Novel Object Captioning + http://openaccess.thecvf.com//content/ICCV2023/papers/Fan_RCA-NOC_Relative_Contrastive_Alignment_for_Novel_Object_Captioning_ICCV_2023_paper.pdf + In this paper, we introduce a novel approach to novel object captioning which employs relative contrastive learning to learn visual and semantic alignment. Our approach maximizes compatibility between regions and object tags in a contrastive manner. To set up a proper contrastive learning objective, for each image, we augment tags by leveraging the relative nature of positive and negative pairs obtained from foundation models such as CLIP. We then use the rank of each augmented tag in a list as a relative relevance label to contrast each top-ranked tag with a set of lower-ranked tags. This learning objective encourages the top-ranked tags to be more compatible with their image and text context than lower-ranked tags, thus improving the discriminative ability of the learned multi-modality representation. We evaluate our approach on two datasets and show that our proposed RCA-NOC approach outperforms state-of-the-art methods by a large margin, demonstrating its effectiveness in improving vision-language representation for novel object captioning. + + + + What Can Simple Arithmetic Operations Do for Temporal Modeling? + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_What_Can_Simple_Arithmetic_Operations_Do_for_Temporal_Modeling_ICCV_2023_paper.pdf + Temporal modeling plays a crucial role in understanding video content. To tackle this problem, previous studies built complicated temporal relations through time sequence thanks to the development of computationally powerful devices. In this work, we explore the potential of four simple arithmetic operations for temporal modeling. Specifically, we first capture auxiliary temporal cues by computing addition, subtraction, multiplication, and division between pairs of extracted frame features. Then, we extract corresponding features from these cues to benefit the original temporal-irrespective domain. We term such a simple pipeline as an Arithmetic Temporal Module (ATM), which operates on the stem of a visual backbone with a plug-and-play style. We conduct comprehensive ablation studies on the instantiation of ATMs and demonstrate that this module provides powerful temporal modeling capability at a low computational cost. Moreover, the ATM is compatible with both CNNs- and ViTs-based architectures. Our results show that ATM achieves superior performance over several popular video benchmarks. Specifically, on Something-Something V1, V2 and Kinetics-400, we reach top-1 accuracy of 65.6%, 74.6%, and 89.4% respectively. The code is available at https://github.com/whwu95/ATM. + + + + Pixel Adaptive Deep Unfolding Transformer for Hyperspectral Image Reconstruction + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Pixel_Adaptive_Deep_Unfolding_Transformer_for_Hyperspectral_Image_Reconstruction_ICCV_2023_paper.pdf + Hyperspectral Image (HSI) reconstruction has made gratifying progress with the deep unfolding framework by formulating the problem into a data module and a prior module. Nevertheless, existing methods still face the problem of insufficient matching with HSI data. The issues lie in three aspects: 1) fixed gradient descent step in the data module while the degradation of HSI is agnostic in the pixel-level. 2) inadequate prior module for 3D HSI cube. 3) stage interaction ignoring the differences in features at different stages. To address these issues, in this work, we propose a Pixel Adaptive Deep Unfolding Transformer (PADUT) for HSI reconstruction. In the data module, a pixel adaptive descent step is employed to focus on pixel-level agnostic degradation. In the prior module, we introduce the Non-local Spectral Transformer (NST) to emphasize the 3D characteristics of HSI for recovering. Moreover, inspired by the diverse expression of features in different stages and depths, the stage interaction is improved by the Fast Fourier Transform (FFT). Experimental results on both simulated and real scenes exhibit the superior performance of our method compared to state-of-the-art HSI reconstruction methods. The code is released at: https://github.com/MyuLi/PADUT + + + + Dynamic PlenOctree for Adaptive Sampling Refinement in Explicit NeRF + http://openaccess.thecvf.com//content/ICCV2023/papers/Bai_Dynamic_PlenOctree_for_Adaptive_Sampling_Refinement_in_Explicit_NeRF_ICCV_2023_paper.pdf + The explicit neural radiance field (NeRF) has gained considerable interest for its efficient training and fast inference capabilities, making it a promising direction such as virtual reality and gaming. In particular, PlenOctree (POT), an explicit hierarchical multi-scale octree representation, has emerged as a structural and influential framework. However, POT's fixed structure for direct optimization is sub-optimal as the scene complexity evolves continuously with updates to cached color and density, necessitating refining the sampling distribution to capture signal complexity accordingly. To address this issue, we propose the dynamic PlenOctree (DOT), which adaptively refines the sample distribution to adjust to changing scene complexity. Specifically, DOT proposes a concise yet novel hierarchical feature fusion strategy during the iterative rendering process. Firstly, it identifies the regions of interest through training signals to ensure adaptive and efficient refinement. Next, rather than directly filtering out valueless nodes, DOT introduces the sampling and pruning operations for octrees to aggregate features, enabling rapid parameter learning. Compared with POT, our DOT outperforms it by enhancing visual quality, reducing over 55.15/68.84% parameters, and providing 1.7/1.9 times FPS for NeRF-synthetic and Tanks & Temples, respectively. + + + + Scene Matters: Model-based Deep Video Compression + http://openaccess.thecvf.com//content/ICCV2023/papers/Tang_Scene_Matters_Model-based_Deep_Video_Compression_ICCV_2023_paper.pdf + Video compression has always been a popular research area, where many traditional and deep video compression methods have been proposed. These methods typically rely on signal prediction theory to enhance compression performance by designing high efficient intra and inter prediction strategies and compressing video frames one by one. In this paper, we propose a novel model-based video compression (MVC) framework that regards scenes as the fundamental units for video sequences. Our proposed MVC directly models the intensity variation of the entire video sequence in one scene, seeking non-redundant representations instead of reducing redundancy through spatio-temporal predictions. To achieve this, we employ implicit neural representation as our basic modeling architecture. To improve the efficiency of video modeling, we first propose context-related spatial positional embedding and frequency domain supervision in spatial context enhancement. For temporal correlation capturing, we design the scene flow constrain mechanism and temporal contrastive loss. Extensive experimental results demonstrate that our method achieves up to a 20% bitrate reduction compared to the latest video coding standard H.266 and is more efficient in decoding than existing video coding strategies. + + + + A Good Student is Cooperative and Reliable: CNN-Transformer Collaborative Learning for Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhu_A_Good_Student_is_Cooperative_and_Reliable_CNN-Transformer_Collaborative_Learning_ICCV_2023_paper.pdf + In this paper, we strive to answer the question 'how to collaboratively learn convolutional neural network (CNN)-based and vision transformer (ViT)-based models by selecting and exchanging the reliable knowledge between them for semantic segmentation?' Accordingly, we propose an online knowledge distillation (KD) framework that can simultaneously learn compact yet effective CNN-based and ViT-based models with two key technical breakthroughs to take full advantage of CNNs and ViT while compensating their limitations. Firstly, we propose heterogeneous feature distillation (HFD) to improve students' consistency in low-layer feature space by mimicking heterogeneous features between CNNs and ViT. Secondly, to facilitate the two students to learn reliable knowledge from each other, we propose bidirectional selective distillation (BSD) that can dynamically transfer selective knowledge. This is achieved by 1) region-wise BSD determining the directions of knowledge transferred between the corresponding regions in the feature space and 2) pixel-wise BSD discerning which of the prediction knowledge to be transferred in the logit space. Extensive experiments on three benchmark datasets demonstrate that our proposed framework outperforms the state-of-the-art online distillation methods by a large margin, and shows its efficacy in learning collaboratively between ViT-based and CNN-based models. + + + + Fan-Beam Binarization Difference Projection (FB-BDP): A Novel Local Object Descriptor for Fine-Grained Leaf Image Retrieval + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Fan-Beam_Binarization_Difference_Projection_FB-BDP_A_Novel_Local_Object_Descriptor_ICCV_2023_paper.pdf + Fine-grained leaf image retrieval (FGLIR) aims to search similar leaf images in subspecies level which involves very high interclass visual similarity and accordingly poses great challenges to leaf image description. In this study, we introduce a new concept, named fan-beam binarization difference projection (FB-BDP) to address this challenging issue. It is designed based on the theory of fan-beam projection (FBP) which is a mathematical tool originally used for computed tomographic reconstruction of objects and has the merits of capturing the inner structure information of objects in multiple directions and excellent ability to suppress image noise. However, few studies have been made to apply FBP to the description of texture patterns. Rather than calculating ray integrals over the whole object area, FB-BDP restricts its ray integrals calculated over local patches to guarantee the locality of the extracted features. By binarizing the intensity-differences between the off-center and center rays, FB-BDP enable its ray integrals insensitive to illumination change and more discriminative in the characterization of texture patterns. In additional, due to inheriting the merits of FBP, the proposed FB-BDP is superior over the existing local image descriptors by its invariance to scaling transformation, robustness to noise, and strong ability to capture direction and structure texture patterns. The results of extensive experiments on FGLIR show its higher retrieval accuracy over the benchmark methods, promising generalization power and strong complementarity to deep features. + + + + InterDiff: Generating 3D Human-Object Interactions with Physics-Informed Diffusion + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_InterDiff_Generating_3D_Human-Object_Interactions_with_Physics-Informed_Diffusion_ICCV_2023_paper.pdf + This paper addresses a novel task of anticipating 3D human-object interactions (HOIs). Most existing research on HOI synthesis lacks comprehensive whole-body interactions with dynamic objects, e.g., often limited to manipulating small or static objects. Our task is significantly more challenging, as it requires modeling dynamic objects with various shapes, capturing whole-body motion, and ensuring physically valid interactions. To this end, we propose InterDiff, a framework comprising two key steps: (i) interaction diffusion, where we leverage a diffusion model to encode the distribution of future human-object interactions; (ii) interaction correction, where we introduce a physics-informed predictor to correct denoised HOIs in a diffusion step. Our key insight is to inject prior knowledge that the interactions under reference with respect to contact points follow a simple pattern and are easily predictable. Experiments on multiple human-object interaction datasets demonstrate the effectiveness of our method for this task, capable of producing realistic, vivid, and remarkably long-term 3D HOI predictions. + + + + IST-Net: Prior-Free Category-Level Pose Estimation with Implicit Space Transformation + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_IST-Net_Prior-Free_Category-Level_Pose_Estimation_with_Implicit_Space_Transformation_ICCV_2023_paper.pdf + Category-level 6D pose estimation aims to predict the poses and sizes of unseen objects from a specific category. Thanks to prior deformation, which explicitly adapts a category-specific 3D prior (i.e., a 3D template) to a given object instance, prior-based methods attained great success and have become a major research stream. However, obtaining category-specific priors requires collecting a large amount of 3D models, which is labor-consuming and often not accessible in practice. This motivates us to investigate whether priors are necessary to make prior-based methods effective. Our empirical study shows that the 3D prior itself is not the credit to the high performance. The keypoint actually is the explicit deformation process, which aligns camera and world coordinates supervised by world space 3D models (also called canonical space). Inspired by these observations, we introduce a simple prior-free implicit space transformation network, namely IST-Net, to transform camera-space features to world-space counterparts and build correspondences between them in an implicit manner without relying on 3D priors. Besides, we design camera- and world-space enhancers to enrich the features with pose-sensitive information and geometrical constraints, respectively. Albeit simple, IST-Net achieves state-of-the-art performance based-on prior-free design, with top inference speed on the REAL275 benchmark. Our code and models are available at https://github.com/CVMI-Lab/IST-Net. + + + + Curvature-Aware Training for Coordinate Networks + http://openaccess.thecvf.com//content/ICCV2023/papers/Saratchandran_Curvature-Aware_Training_for_Coordinate_Networks_ICCV_2023_paper.pdf + Coordinate networks are widely used in computer vision due to their ability to represent signals as compressed, continuous entities. However, training these networks with first-order optimizers can be slow, hindering their use in real-time applications. Recent works have opted for shallow voxel-based representations to achieve faster training, but this sacrifices memory efficiency. This work proposes a solution that leverages second-order optimization methods to significantly reduce training times for coordinate networks while maintaining their compressibility. Experiments demonstrate the effectiveness of this approach on various signal modalities, such as audio, images, videos, shape and neural radiance fields (NeRF). + + + + Learning Rain Location Prior for Nighttime Deraining + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Learning_Rain_Location_Prior_for_Nighttime_Deraining_ICCV_2023_paper.pdf + Rain can significantly degrade image quality and visibility, making deraining a critical area of research in computer vision. Despite recent progress in learning-based deraining methods, there is a lack of focus on nighttime deraining due to the unique challenges posed by non-uniform local illuminations from artificial light sources. Rain streaks in these scenes have diverse appearances that are tightly related to their relative positions to light sources, making it difficult for existing deraining methods to effectively handle them. In this paper, we highlight the importance of rain streak location information in nighttime deraining. Specifically, we propose a Rain Location Prior (RLP) that is learned implicitly from rainy images using a recurrent residual model. This learned prior contains location information of rain streaks and, when injected into deraining models, can significantly improve their performance. To further improve the effectiveness of the learned prior, we also propose a Rain Prior Injection Module (RPIM) to modulate the prior before injection, increasing the importance of features within rain streak areas. Experimental results demonstrate that our approach outperforms existing state-of-the-art methods by about 1dB and effectively improves the performance of deraining models. We also evaluate our method on real night rainy images to show the capability to handle real scenes with fully synthetic data for training. Our method represents a significant step forward in the area of nighttime deraining and highlights the importance of location information in this challenging problem. The code is publicly available at https://github.com/zkawfanx/RLP. + + + + FBLNet: FeedBack Loop Network for Driver Attention Prediction + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_FBLNet_FeedBack_Loop_Network_for_Driver_Attention_Prediction_ICCV_2023_paper.pdf + The problem of predicting driver attention from the driving perspective is gaining increasing research focus due to its remarkable significance for autonomous driving and assisted driving systems. The driving experience is extremely important for safe driving, a skilled driver is able to effortlessly predict oncoming danger (before it becomes salient) based on the driving experience and quickly pay attention to the corresponding zones. However, the nonobjective driving experience is difficult to model, so a mechanism simulating the driver experience accumulation procedure is absent in existing methods, and the current methods usually follow the technique line of saliency prediction methods to predict driver attention. In this paper, we propose a FeedBack Loop Network (FBLNet), which attempts to model the driving experience accumulation procedure. By over-and-over iterations, FBLNet generates the incremental knowledge that carries rich historically-accumulative and long-term temporal information. The incremental knowledge in our model is like the driving experience of humans. Under the guidance of the incremental knowledge, our model fuses the CNN feature and Transformer feature that are extracted from the input image to predict driver attention. Our model exhibits a solid advantage over existing methods, achieving an outstanding performance improvement on two driver attention benchmark datasets. + + + + Video Anomaly Detection via Sequentially Learning Multiple Pretext Tasks + http://openaccess.thecvf.com//content/ICCV2023/papers/Shi_Video_Anomaly_Detection_via_Sequentially_Learning_Multiple_Pretext_Tasks_ICCV_2023_paper.pdf + Learning multiple pretext tasks is a popular approach to tackle the nonalignment problem in unsupervised video anomaly detection. However, the conventional learning method of simultaneously learning multiple pretext tasks, is prone to sub-optimal solutions, incurring sharp performance drops. In this paper, we propose to sequentially learn multiple pretext tasks according to their difficulties in an ascending manner to improve the performance of anomaly detection. The core idea is to relax the learning objective by starting with easy pretext tasks in the early stage and gradually refine it by involving more challenging pretext tasks later on. In this way, our method is able to reduce the difficulties of learning and avoid converging to sub-optimal solutions. Specifically, we design a tailored sequential learning order for three widely-used pretext tasks. It starts with frame prediction task, then moves on to frame reconstruction task and last ends with frame-order classification task. We further introduce a new contrastive loss which makes the learned representations of normality more discriminative by pushing normal and pseudo-abnormal samples apart. Extensive experiments on three datasets demonstrate the effectiveness of our method. + + + + SlaBins: Fisheye Depth Estimation using Slanted Bins on Road Environments + http://openaccess.thecvf.com//content/ICCV2023/papers/Lee_SlaBins_Fisheye_Depth_Estimation_using_Slanted_Bins_on_Road_Environments_ICCV_2023_paper.pdf + Although 3D perception for autonomous vehicles has focused on frontal-view information, more than half of fatal accidents occur due to side impacts in practice (e.g., T-bone crash). Motivated by this fact, we investigate the problem of side-view depth estimation, especially for monocular fisheye cameras, which provide wide FoV information. However, since fisheye cameras head road areas, it observes road areas mostly and results in severe distortion on object areas, such as vehicles or pedestrians. To alleviate these issues, we propose a new fisheye depth estimation network, SlaBins, that infers an accurate and dense depth map based on a geometric property of road environments; most objects are standing (i.e., orthogonal) on the road environments. Concretely, we introduce a slanted multi-cylindrical image (MCI) representation, which allows us to describe a distance as a radius to a cylindrical layer orthogonal to the ground regardless of the camera viewing direction. Based on the slanted MCI, we estimate a set of adaptive bins and a per-pixel probability map for depth estimation. Then by combining it with the estimated slanted angle of viewing direction, we directly infer a dense and accurate depth map for fisheye cameras. Experiments demonstrate that SlaBins outperforms the state-of-the-art methods in both qualitative and quantitative evaluation on the SynWoodScape and KITTI-360 depth datasets. + + + + March in Chat: Interactive Prompting for Remote Embodied Referring Expression + http://openaccess.thecvf.com//content/ICCV2023/papers/Qiao_March_in_Chat_Interactive_Prompting_for_Remote_Embodied_Referring_Expression_ICCV_2023_paper.pdf + Many Vision-and-Language Navigation (VLN) tasks have been proposed in recent years, from room-based to object-based and indoor to outdoor. The REVERIE (Remote Embodied Referring Expression) is interesting since it only provides high-level instructions to the agent, which are closer to human commands in practice. Nevertheless, this poses more challenges than other VLN tasks since it requires agents to infer a navigation plan only based on a short instruction. Large Language Models (LLMs) show great potential in robot action planning by providing proper prompts. Still, this strategy has not been explored under the REVERIE settings. There are several new challenges. For example, the LLM should be environment-aware so that the navigation plan can be adjusted based on the current visual observation. Moreover, the LLM planned actions should be adaptable to the much larger and more complex REVERIE environment. This paper proposes a March-in-Chat (MiC) model that can talk to the LLM on the fly and plan dynamically based on a newly proposed Room-and-Object Aware Scene Perceiver (ROASP). Our MiC model outperforms the previous state-of-the-art by large margins by SPL and RGSPL metrics on the REVERIE benchmark. The source code is available at https://github.com/YanyuanQiao/MiC + + + + Efficient Unified Demosaicing for Bayer and Non-Bayer Patterned Image Sensors + http://openaccess.thecvf.com//content/ICCV2023/papers/Lee_Efficient_Unified_Demosaicing_for_Bayer_and_Non-Bayer_Patterned_Image_Sensors_ICCV_2023_paper.pdf + As the physical size of recent CMOS image sensors (CIS) gets smaller, the latest mobile cameras are adopting unique non-Bayer color filter array (CFA) patterns (e.g., Quad, Nona, QxQ), which consist of homogeneous color units with adjacent pixels. These non-Bayer sensors are superior to conventional Bayer CFA thanks to their changeable pixel-bin sizes for different light conditions, but may introduce visual artifacts during demosaicing due to their inherent pixel pattern structures and sensor hardware characteristics. Previous demosaicing methods have primarily focused on fixed pixel-bin sizes of Bayer CFA, requiring specialized reconstruction methods for non-Bayer patterned CIS and executing multiple CFA modes depending on lighting conditions. In this work, we propose an efficient unified demosaicing method that can be applied to both conventional Bayer RAW and various non-Bayer CFAs' RAW data in different operation modes. Our Knowledge Learning-based demosaicing model for Adaptive Patterns, namely KLAP, utilizes CFA-adaptive filters for only 1% key filters in the network for each CFA, but still manages to effectively demosaic all the CFAs, yielding comparable performance to the large-scale models. Furthermore, by employing meta-learning during inference (KLAP-M), our model is able to eliminate unknown sensor-generic artifacts in real RAW data, effectively bridging the gap between synthetic images and real sensor RAW. Our KLAP and KLAP-M methods achieved state-of-the-art demosaicing performance in both synthetic and real RAW data of Bayer and non-Bayer CFAs. + + + + Spatially-Adaptive Feature Modulation for Efficient Image Super-Resolution + http://openaccess.thecvf.com//content/ICCV2023/papers/Sun_Spatially-Adaptive_Feature_Modulation_for_Efficient_Image_Super-Resolution_ICCV_2023_paper.pdf + Although deep learning-based solutions have achieved impressive reconstruction performance in image super-resolution (SR), these models are generally large, with complex architectures, making them incompatible with low-power devices with many computational and memory constraints. To overcome these challenges, we propose a spatially-adaptive feature modulation (SAFM) mechanism for efficient SR design. In detail, the SAFM layer uses independent computations to learn multi-scale feature representations and aggregates these features for dynamic spatial modulation. As the SAFM prioritizes exploiting non-local feature dependencies, we further introduce a convolutional channel mixer (CCM) to encode local contextual information and mix channels simultaneously. Extensive experimental results show that the proposed method is 3x smaller than state-of-the-art efficient SR methods, e.g., IMDN, and yields comparable performance with much less memory usage. Our source codes and pre-trained models are available at: https://github.com/sunny2109/SAFMN. + + + + Unsupervised Image Denoising in Real-World Scenarios via Self-Collaboration Parallel Generative Adversarial Branches + http://openaccess.thecvf.com//content/ICCV2023/papers/Lin_Unsupervised_Image_Denoising_in_Real-World_Scenarios_via_Self-Collaboration_Parallel_Generative_ICCV_2023_paper.pdf + Deep learning methods have shown remarkable performance in image denoising, particularly when trained on large-scale paired datasets. However, acquiring such paired datasets for real-world scenarios poses a significant challenge. Although unsupervised approaches based on generative adversarial networks (GANs) offer a promising solution for denoising without paired datasets, they are difficult in surpassing the performance limitations of conventional GAN-based unsupervised frameworks without significantly modifying existing structures or increasing the computational complexity of denoisers. To address this problem, we propose a self-collaboration (SC) strategy for multiple denoisers. This strategy can achieve significant performance improvement without increasing the inference complexity of the GAN-based denoising framework. Its basic idea is to iteratively replace the previous less powerful denoiser in the filter-guided noise extraction module with the current powerful denoiser. This process generates better synthetic clean-noisy image pairs, leading to a more powerful denoiser for the next iteration. In addition, we propose a baseline method that includes parallel generative adversarial branches with complementary "self-synthesis" and "unpaired-synthesis" constraints. This baseline ensures the stability and effectiveness of the training network. The experimental results demonstrate the superiority of our method over state-of-the-art unsupervised methods. + + + + Self-supervised Image Denoising with Downsampled Invariance Loss and Conditional Blind-Spot Network + http://openaccess.thecvf.com//content/ICCV2023/papers/Jang_Self-supervised_Image_Denoising_with_Downsampled_Invariance_Loss_and_Conditional_Blind-Spot_ICCV_2023_paper.pdf + There have been many image denoisers using deep neural networks, which outperform conventional model-based methods by large margins. Recently, self-supervised methods have attracted attention because constructing a large real noise dataset for supervised training is an enormous burden. The most representative self-supervised denoisers are based on blind-spot networks, which exclude the receptive field's center pixel. However, excluding any input pixel is abandoning some information, especially when the input pixel at the corresponding output position is excluded. In addition, a standard blind-spot network fails to reduce real camera noise due to the pixel-wise correlation of noise, though it successfully removes independently distributed synthetic noise. Hence, to realize a more practical denoiser, we propose a novel self-supervised training framework that can remove real noise. For this, we derive the theoretic upper bound of a supervised loss where the network is guided by the downsampled blinded output. Also, we design a conditional blind-spot network (C-BSN), which selectively controls the blindness of the network to use the center pixel information. Furthermore, we exploit a random subsampler to decorrelate noise spatially, making the C-BSN free of visual artifacts that were often seen in downsample-based methods. Extensive experiments show that the proposed C-BSN achieves state-of-the-art performance on real-world datasets as a self-supervised denoiser and shows qualitatively pleasing results without any post-processing or refinement. + + + + Generative Action Description Prompts for Skeleton-based Action Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Xiang_Generative_Action_Description_Prompts_for_Skeleton-based_Action_Recognition_ICCV_2023_paper.pdf + Skeleton-based action recognition has recently received considerable attention. Current approaches to skeleton-based action recognition are typically formulated as one-hot classification tasks and do not fully exploit the semantic relations between actions. For example, "make victory sign" and "thumb up" are two actions of hand gestures, whose major difference lies in the movement of hands. This information is agnostic from the categorical one-hot encoding of action classes but could be unveiled from the action description. Therefore, utilizing action description in training could potentially benefit representation learning. In this work, we propose a Generative Action-description Prompts (GAP) approach for skeleton-based action recognition. More specifically, we employ a pre-trained large-scale language model as the knowledge engine to automatically generate text descriptions for body parts movements of actions, and propose a multi-modal training scheme by utilizing the text encoder to generate feature vectors for different body parts and supervise the skeleton encoder for action representation learning. Experiments show that our proposed GAP method achieves noticeable improvements over various baseline models without extra computation cost at inference. GAP achieves new state-of-the-arts on popular skeleton-based action recognition benchmarks, including NTU RGB+D, NTU RGB+D 120 and NW-UCLA. The source code is available at https://github.com/MartinXM/GAP. + + + + Transparent Shape from a Single View Polarization Image + http://openaccess.thecvf.com//content/ICCV2023/papers/Shao_Transparent_Shape_from_a_Single_View_Polarization_Image_ICCV_2023_paper.pdf + This paper presents a learning-based method for transparent surface estimation from a single view polarization image. Existing shape from polarization(SfP) methods have the difficulty in estimating transparent shape since the inherent transmission interference heavily reduces the reliability of physics-based prior. To address this challenge, we propose the concept of physics-based prior confidence, which is inspired by the characteristic that the transmission component in the polarization image has more noise than reflection. The confidence is used to determine the contribution of the interfered physics-based prior. Then, we build a network(TransSfP) with multi-branch architecture to avoid the destruction of relationships between different hierarchical inputs. To train and test our method, we construct a dataset for transparent shape from polarization with paired polarization images and ground-truth normal maps. Extensive experiments and comparisons demonstrate the superior accuracy of our method. + + + + DriveAdapter: Breaking the Coupling Barrier of Perception and Planning in End-to-End Autonomous Driving + http://openaccess.thecvf.com//content/ICCV2023/papers/Jia_DriveAdapter_Breaking_the_Coupling_Barrier_of_Perception_and_Planning_in_ICCV_2023_paper.pdf + End-to-end autonomous driving aims to build a fully differentiable system that takes raw sensor data as inputs and directly outputs the planned trajectory or control signals of the ego vehicle. State-of-the-art methods usually follow the `Teacher-Student' paradigm. The Teacher model uses privileged information (ground-truth states of surrounding agents and map elements) to learn the driving strategy. The student model only has access to raw sensor data and conducts behavior cloning on the data collected by the teacher model. By eliminating the noise of the perception part during planning learning, state-of-the-art works could achieve better performance with significantly less data compared to those coupled ones. However, under the current Teacher-Student paradigm, the student model still needs to learn a planning head from scratch, which could be challenging due to the redundant and noisy nature of raw sensor inputs and the casual confusion issue of behavior cloning. In this work, we aim to explore the possibility of directly adopting the strong teacher model to conduct planning while letting the student model focus more on the perception part. We find that even equipped with a SOTA perception model, directly letting the student model learn the required inputs of the teacher model leads to poor driving performance, which comes from the large distribution gap between predicted privileged inputs and the ground-truth. To this end, we propose DriveAdapter, which employs adapters with the feature alignment objective function between the student (perception) and teacher (planning) modules. Additionally, since the pure learning-based teacher model itself is imperfect and occasionally breaks safety rules, we propose a method of action-guided feature learning with a mask for those imperfect teacher features to further inject the priors of hand-crafted rules into the learning process. DriveAdapter achieves SOTA performance on multiple closed-loop simulation-based benchmarks of CARLA. + + + + General Planar Motion from a Pair of 3D Correspondences + http://openaccess.thecvf.com//content/ICCV2023/papers/Dibene_General_Planar_Motion_from_a_Pair_of_3D_Correspondences_ICCV_2023_paper.pdf + We present a novel 2-point method for estimating the relative pose of a camera undergoing planar motion from 3D data (e.g. from a calibrated stereo setup or an RGB-D sensor). Unlike prior art, our formulation does not assume knowledge of the plane of motion, (e.g. parallelism between the optical axis and motion plane) to resolve the under-constrained nature of SE(3) motion estimation in this context. Instead, we enforce geometric constraints identifying, in closed-form, a unique planar motion solution from an orbital set of geometrically consistent SE(3) motion estimates. We explore the set of special and degenerate geometric cases arising from our formulation. Experiments on synthetic data characterize the sensitivity of our estimation framework to measurement noise and different types of observed motion. We integrate our solver within a RANSAC framework and demonstrate robust operation on standard benchmark sequences of real-world imagery. Code is available at: https://github.com/jdibenes/gpm. + + + + Single Depth-image 3D Reflection Symmetry and Shape Prediction + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Single_Depth-image_3D_Reflection_Symmetry_and_Shape_Prediction_ICCV_2023_paper.pdf + In this paper, we present Iterative Symmetry Completion Network (ISCNet), a single depth-image shape completion method that exploits reflective symmetry cues to obtain more detailed shapes. The efficacy of single depth-image shape completion methods is often sensitive to the accuracy of the symmetry plane. ISCNet therefore jointly estimates the symmetry plane and shape completion iteratively; more complete shapes contribute to more robust symmetry plane estimates and vice versa. Furthermore, our shape completion method operates in the image domain, enabling more efficient high-resolution, detailed geometry reconstruction. We perform the shape completion from pairs of viewpoints, reflected across the symmetry plane, predicted by a reinforcement learning agent to improve robustness and to simultaneously explicitly leverage symmetry. We demonstrate the effectiveness of ISCNet on a variety of object categories on both synthetic and real-scanned datasets. + + + + Downscaled Representation Matters: Improving Image Rescaling with Collaborative Downscaled Images + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_Downscaled_Representation_Matters_Improving_Image_Rescaling_with_Collaborative_Downscaled_Images_ICCV_2023_paper.pdf + Deep networks have achieved great success in image rescaling (IR) task that seeks to learn the optimal downscaled representations, i.e., low-resolution (LR) images, to reconstruct the original high-resolution (HR) images. Compared with super-resolution methods that consider a fixed downscaling scheme, e.g., bicubic, IR often achieves significantly better reconstruction performance thanks to the learned downscaled representations. This highlights the importance of a good downscaled representation. Existing IR methods mainly learn the downscaled representation by jointly optimizing the downscaling and upscaling models. Unlike them, we seek to improve the downscaled representation through a different and more direct way -- directly optimizing the downscaled image itself instead of the down-/upscaling models. Consequently, we propose a Hierarchical Collaborative Downscaling (HCD) method that performs gradient descent w.r.t. the reconstruction loss in both HR and LR domains to improve the downscaled representations, so as to boost IR performance. Extensive experiments show that our HCD significantly improves the reconstruction performance both quantitatively and qualitatively. Particularly, we improve over popular IR methods by >0.57db PSNR on Set5. Moreover, we also highlight the flexibility of our HCD since it can generalize well across diverse image rescaling models. The code is available at https://github.com/xubingna/HCD. + + + + Attention Discriminant Sampling for Point Clouds + http://openaccess.thecvf.com//content/ICCV2023/papers/Hong_Attention_Discriminant_Sampling_for_Point_Clouds_ICCV_2023_paper.pdf + This paper describes an attention-driven approach to 3-D point cloud sampling. We establish our method based on a structure-aware attention discriminant analysis that explores geometric and semantic relations embodied among points and their clusters. The proposed attention discriminant sampling (ADS) starts by efficiently decomposing a given point cloud into clusters to implicitly encode its structural and geometric relatedness among points. By treating each cluster as a structural component, ADS then draws on evaluating two levels of self-attention: within-cluster and between-cluster. The former reflects the semantic complexity entailed by the learned features of points within each cluster, while the latter reveals the semantic similarity between clusters. Driven by structurally preserving the point distribution, these two aspects of self-attention help avoid sampling redundancy and decide the number of sampled points in each cluster. Extensive experiments demonstrate that ADS significantly improves classification performance to 95.1% on ModelNet40 and 87.5% on ScanObjectNN and achieves 86.9% mIoU on ShapeNet Part Segmentation. For scene segmentation, ADS yields 91.1% accuracy on S3DIS with higher mIoU to the state-of-the-art and 75.6% mIoU on ScanNetV2. Furthermore, ADS surpasses the state-of-the-art with 55.0% mAP50 on ScanNetV2 object detection. + + + + IHNet: Iterative Hierarchical Network Guided by High-Resolution Estimated Information for Scene Flow Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_IHNet_Iterative_Hierarchical_Network_Guided_by_High-Resolution_Estimated_Information_for_ICCV_2023_paper.pdf + Scene flow estimation, which predicts the 3D displacements of point clouds, is a fundamental task in autonomous driving. Most methods have adopted a coarse-to-fine structure to balance computational efficiency with accuracy, particularly when handling large displacements. However, inaccuracies in the initial coarse layer's scene flow estimates may accumulate, leading to incorrect final estimates. To alleviate this, we introduce a novel Iterative Hierarchical Network----IHNet. This approach circulates high-resolution estimated information (scene flow and feature) from the preceding iteration back to the low-resolution layer of the current iteration. Serving as a guide, the high-resolution estimated scene flow, instead of initializing the scene flow from zero, provides a more precise center for low-resolution layer to identify matches. Meanwhile, the decoder's feature at the high-resolution layer can contribute essential movement information. Furthermore, based on the recurrent structure, we design a resampling scheme to enhance the correspondence between points across two consecutive frames. By employing the previous estimated scene flow to fine-tune the target frame's coordinates, we can significantly reduce the correspondence discrepancy between two frame points, a problem often caused by point sparsity. Following this adjustment, we continue to estimate the scene flow using the newly updated coordinates, along with the reencoded feature. Our approach outperforms the recent state-of-the-art method WSAFlowNet by 20.1% on FlyingThings3D and 56.0% on KITTI scene flow datasets according to EPE3D metric. The code is available at https://github.com/wangyunlhr/IHNet. + + + + SimNP: Learning Self-Similarity Priors Between Neural Points + http://openaccess.thecvf.com//content/ICCV2023/papers/Wewer_SimNP_Learning_Self-Similarity_Priors_Between_Neural_Points_ICCV_2023_paper.pdf + Existing neural field representations for 3D object reconstruction either (1) utilize object-level representations, but suffer from low-quality details due to conditioning on a global latent code, or (2) are able to perfectly reconstruct the observations, but fail to utilize object-level prior knowledge to infer unobserved regions. We present SimNP, a method to learn category-level self-similarities, which combines the advantages of both worlds by connecting neural point radiance fields with a category-level self-similarity representation. Our contribution is two-fold. (1) We design the first neural point representation on a category level by utilizing the concept of coherent point clouds. The resulting neural point radiance fields store a high level of detail for locally supported object regions. (2) We learn how information is shared between neural points in an unconstrained and unsupervised fashion, which allows to derive unobserved regions of an object during the reconstruction process from given observations. We show that SimNP is able to outperform previous methods in reconstructing symmetric unseen object regions, surpassing methods that build upon category-level or pixel-aligned radiance fields, while providing semantic correspondences between instances. + + + + Beyond the Limitation of Monocular 3D Detector via Knowledge Distillation + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Beyond_the_Limitation_of_Monocular_3D_Detector_via_Knowledge_Distillation_ICCV_2023_paper.pdf + Knowledge distillation (KD) is a promising approach that facilitates the compact student model to learn dark knowledge from the huge teacher model for better results. Although KD methods are well explored in the 2D detection task, existing approaches are not suitable for 3D monocular detection without considering spatial cues. Motivated by the potential of depth information, we propose a novel distillation framework that validly improves the performance of the student model without extra depth labels. Specifically, we first put forward a perspective-induced feature imitation, which utilizes the perspective principle (the farther the smaller) to facilitate the student to imitate more features of farther objects from the teacher model. Moreover, we construct a depth-guided matrix by the predicted depth gap of teacher and student to facilitate the model to learn more knowledge of farther objects in prediction level distillation. The proposed method is available for advanced monocular detectors with various backbones, which also brings no extra inference time. Extensive experiments on the KITTI and nuScenes benchmarks with diverse settings demonstrate that the proposed method outperforms the state-of-the-art KD methods. + + + + Temporal-Coded Spiking Neural Networks with Dynamic Firing Threshold: Learning with Event-Driven Backpropagation + http://openaccess.thecvf.com//content/ICCV2023/papers/Wei_Temporal-Coded_Spiking_Neural_Networks_with_Dynamic_Firing_Threshold_Learning_with_ICCV_2023_paper.pdf + Spiking Neural Networks (SNNs) offer a highly promising computing paradigm due to their biological plausibility, exceptional spatiotemporal information processing capability and low power consumption. As a temporal encoding scheme for SNNs, Time-To-First-Spike (TTFS) encodes information using the timing of a single spike, which allows spiking neurons to transmit information through sparse spike trains and results in lower power consumption and higher computational efficiency compared to traditional rate-based encoding counterparts. However, despite the advantages of the TTFS encoding scheme, the effective and efficient training of TTFS-based deep SNNs remains a significant and open research problem. In this work, we first examine the factors underlying the limitations of applying existing TTFS-based learning algorithms to deep SNNs. Specifically, we investigate issues related to over-sparsity of spikes and the complexity of finding the `causal set'. We then propose a simple yet efficient dynamic firing threshold (DFT) mechanism for spiking neurons to address these issues. Building upon the proposed DFT mechanism, we further introduce a novel direct training algorithm for TTFS-based deep SNNs, called DTA-TTFS. This method utilizes event-driven processing and spike timing to enable efficient learning of deep SNNs. The proposed training method was validated on the image classification task and experimental results clearly demonstrate that our proposed method achieves state-of-the-art accuracy in comparison to existing TTFS-based learning algorithms, while maintaining high levels of sparsity and energy efficiency on neuromorphic inference accelerator. + + + + NeO 360: Neural Fields for Sparse View Synthesis of Outdoor Scenes + http://openaccess.thecvf.com//content/ICCV2023/papers/Irshad_NeO_360_Neural_Fields_for_Sparse_View_Synthesis_of_Outdoor_ICCV_2023_paper.pdf + Recent implicit neural representations have shown great results for novel view synthesis. However, existing methods require expensive per-scene optimization from many views hence limiting their application to real-world unbounded urban settings where the objects of interest or backgrounds are observed from very few views. To mitigate this challenge, we introduce a new approach called NeO 360, Neural fields for sparse view synthesis of outdoor scenes. NeO 360 is a generalizable method that reconstructs 360deg scenes from a single or a few posed RGB images. The essence of our approach is in capturing the distribution of complex real-world outdoor 3D scenes and using a hybrid image-conditional triplanar representation that can be queried from any world point. Our representation combines the best of both voxel-based and bird's-eye-view (BEV) representations and is more effective and expressive than each. NeO 360's representation allows us to learn from a large collection of unbounded 3D scenes while offering generalizability to new views and novel scenes from as few as a single image during inference. We demonstrate our approach on the proposed challenging 360deg unbounded dataset, called NeRDS 360, and show that NeO 360 outperforms state-of-the-art generalizable methods for novel view synthesis while also offering editing and composition capabilities. Project page: zubair-irshad.github.io/projects/neo360.html + + + + UnLoc: A Unified Framework for Video Localization Tasks + http://openaccess.thecvf.com//content/ICCV2023/papers/Yan_UnLoc_A_Unified_Framework_for_Video_Localization_Tasks_ICCV_2023_paper.pdf + While large-scale image-text pretrained models such as CLIP have been used for multiple video-level tasks on trimmed videos, their use for temporal localization in untrimmed videos is still a relatively unexplored task. We design a new approach for this called UnLoc, which uses pretrained image and text towers, and feeds tokens to a video-text fusion model. The output of the fusion module are then used to construct a feature pyramid in which each level connects to a head to predict a per-frame relevancy score and start/end time displacements. Unlike previous works, our architecture enables Moment Retrieval, Temporal Localization, and Action Segmentation with a single stage model, without the need for action proposals, motion based pretrained features or representation masking. Unlike specialized models, we achieve state of the art results on all three different localization tasks with a unified approach. Code is available at: https://github.com/google-research/scenic. + + + + Fast Inference and Update of Probabilistic Density Estimation on Trajectory Prediction + http://openaccess.thecvf.com//content/ICCV2023/papers/Maeda_Fast_Inference_and_Update_of_Probabilistic_Density_Estimation_on_Trajectory_ICCV_2023_paper.pdf + Safety-critical applications such as autonomous vehicles and social robots require fast computation and accurate probability density estimation on trajectory prediction. To address both requirements, this paper presents a new normalizing flow-based trajectory prediction model named FlowChain. FlowChain is a stack of conditional continuously-indexed flows (CIFs) that are expressive and allow analytical probability density computation. This analytical computation is faster than the generative models that need additional approximations such as kernel density estimation. Moreover, FlowChain is more accurate than the Gaussian mixture-based models due to fewer assumptions on the estimated density. FlowChain also allows a rapid update of estimated probability densities. This update is achieved by adopting the newest observed position and reusing the flow transformations and its log-det-jacobians that represent the motion trend. This update is completed in less than one millisecond because this reuse greatly omits the computational cost. Experimental results showed our FlowChain achieved state-of-the-art trajectory prediction accuracy compared to previous methods. Furthermore, our FlowChain demonstrated superiority in the accuracy and speed of density estimation. Our code is available at https://github.com/meaten/FlowChain-ICCV2023 + + + + Adaptive Spiral Layers for Efficient 3D Representation Learning on Meshes + http://openaccess.thecvf.com//content/ICCV2023/papers/Babiloni_Adaptive_Spiral_Layers_for_Efficient_3D_Representation_Learning_on_Meshes_ICCV_2023_paper.pdf + The success of deep learning models on structured data has generated significant interest in extending their application to non-Euclidean domains. In this work, we introduce a novel intrinsic operator suitable for representation learning on 3D meshes. Our operator is specifically tailored to adapt its behavior to the irregular structure of the underlying graph and effectively utilize its long-range dependencies, while at the same time ensuring computational efficiency and ease of optimization. In particular, inspired by the framework of Spiral Convolution, which extracts and transforms the vertices in the 3D mesh following a local spiral ordering, we propose a general operator that dynamically adjusts the length of the spiral trajectory and the parameters of the transformation for each processed vertex and mesh. Then, we use polyadic decomposition to factorize its dense weight tensor into a sequence of lighter linear layers that separately process features and vertices information, hence significantly reducing the computational complexity without introducing any stringent inductive biases. Notably, we leverage dynamic gating to achieve spatial adaptivity and induce global reasoning with constant time complexity benefitting from an efficient dynamic pooling mechanism based on Summed-Area-tables. Used as a drop-in replacement on existing architectures for shape correspondence our operator significantly improves the performance-efficiency trade-off, and in 3D shape generation with morphable models achieves state-of-the-art performance with a three-fold reduction in the number of parameters required. Project page: https://github.com/Fb2221/DFC + + + + Convex Decomposition of Indoor Scenes + http://openaccess.thecvf.com//content/ICCV2023/papers/Vavilala_Convex_Decomposition_of_Indoor_Scenes_ICCV_2023_paper.pdf + We describe a method to parse a complex, cluttered indoor scene into primitives which offer a parsimonious abstraction of scene structure. Our primitives are simple convexes. Our method uses a learned regression procedure to parse a scene into a fixed number of convexes from RGBD input, and can optionally accept segmentations to improve the decomposition. The result is then polished with a descent method which adjusts the convexes to produce a very good fit, and greedily removes superfluous primitives. Because the entire scene is parsed, we can evaluate using traditional depth, normal, and segmentation error metrics. Our evaluation procedure demonstrates that the error from our primitive representation is comparable to that of predicting depth from a single image. + + + + Toward Unsupervised Realistic Visual Question Answering + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Toward_Unsupervised_Realistic_Visual_Question_Answering_ICCV_2023_paper.pdf + The problem of realistic VQA (RVQA), where a model has to reject unanswerable questions (UQs) and answer answerable ones (AQs), is studied. We first point out 2 drawbacks in current RVQA research, where (1) datasets contain too many unchallenging UQs and (2) a large number of annotated UQs are required for training. To resolve the first drawback, we propose a new testing dataset, RGQA, which combines AQs from an existing VQA dataset with around 29K human-annotated UQs. These UQs consist of both fine-grained and coarse-grained image-question pairs generated with 2 approaches: CLIP-based and Perturbation-based. To address the second drawback, we introduce an unsupervised training approach. This combines pseudo UQs obtained by randomly pairing images and questions, with an RoI Mixup procedure to generate more fine-grained pseudo UQs, and model ensembling to regularize model confidence. Experiments show that using pseudo UQs significantly outperforms RVQA baselines. RoI Mixup and model ensembling further increase the gain. Finally, human evaluation reveals a performance gap between humans and models, showing that more RVQA research is needed. + + + + Video OWL-ViT: Temporally-consistent Open-world Localization in Video + http://openaccess.thecvf.com//content/ICCV2023/papers/Heigold_Video_OWL-ViT_Temporally-consistent_Open-world_Localization_in_Video_ICCV_2023_paper.pdf + We present an architecture and a training recipe that adapts pretrained open-world image models to localization in videos. Understanding the open visual world (without being constrained by fixed label spaces) is crucial for many real-world vision tasks. Contrastive pre-training on large image-text datasets has recently led to significant improvements for image-level tasks. For more structured tasks involving object localization applying pre-trained models is more challenging. This is particularly true for video tasks, where task-specific data is limited. We show successful transfer of open-world models by building on the OWL-ViT open-vocabulary detection model and adapting it to video by adding a transformer decoder. The decoder propagates object representations recurrently through time by using the output tokens for one frame as the object queries for the next. Our model is end-to-end trainable on video data and enjoys improved temporal consistency compared to tracking-by-detection baselines, while retaining the open-world capabilities of the backbone detector. We evaluate our model on the challenging TAO-OW benchmark and demonstrate that open-world capabilities, learned from large-scale image-text pretraining, can be transferred successfully to open-world localization across diverse videos. + + + + Physics-Driven Turbulence Image Restoration with Stochastic Refinement + http://openaccess.thecvf.com//content/ICCV2023/papers/Jaiswal_Physics-Driven_Turbulence_Image_Restoration_with_Stochastic_Refinement_ICCV_2023_paper.pdf + Image distortion by atmospheric turbulence is a stochastic degradation, which is a critical problem in long-range optical imaging systems. A number of research has been conducted during the past decades, including model-based and emerging deep-learning solutions with the help of synthetic data. Although fast and physics-grounded simulation tools have been introduced to help the deep-learning models adapt to real-world turbulence conditions recently, the training of such models only relies on the synthetic data and ground truth pairs. This paper proposes the Physics-integrated Restoration Network (PiRN) to bring the physics-based simulator directly into the training process to help the network to disentangle the stochasticity from the degradation and the underlying image. Furthermore, to overcome the "average effect" introduced by deterministic models and the domain gap between the synthetic and real-world degradation, we further introduce PiRN with Stochastic Refinement (PiRN-SR) to boost its perceptual quality. Overall, our PiRN and PiRN-SR improve the generalization to real-world unknown turbulence conditions and provide a state-of-the-art restoration in both pixel-wise accuracy and perceptual quality. + + + + Enhancing Non-line-of-sight Imaging via Learnable Inverse Kernel and Attention Mechanisms + http://openaccess.thecvf.com//content/ICCV2023/papers/Yu_Enhancing_Non-line-of-sight_Imaging_via_Learnable_Inverse_Kernel_and_Attention_Mechanisms_ICCV_2023_paper.pdf + Recovering information from non-line-of-sight (NLOS) imaging is a computationally-intensive inverse problem. Most physics-based NLOS imaging methods address the complexity of this problem by assuming three-bounce reflections and no self-occlusion. However, these assumptions may break down for objects with large depth variations, preventing physics-based algorithms from accurately reconstructing the details and high-frequency information. On the other hand, while learning-based methods can avoid these assumptions, they may struggle to reconstruct details without specific designs due to the spectral bias of neural networks. To overcome these issues, we propose a novel approach that enhances physics-based NLOS imaging methods by introducing a learnable inverse kernel in the Fourier domain and using an attention mechanism to improve the neural network to learn high-frequency information. Our method is evaluated on publicly available and new synthetic datasets, demonstrating its commendable performance compared to prior physics-based and learning-based methods, especially for objects with large depth variations. Moreover, our approach generalizes well to real data and can be applied to tasks such as classification and depth reconstruction. We will make our code and dataset publicly available: https://sci2020.github.io. + + + + DECO: Dense Estimation of 3D Human-Scene Contact In The Wild + http://openaccess.thecvf.com//content/ICCV2023/papers/Tripathi_DECO_Dense_Estimation_of_3D_Human-Scene_Contact_In_The_Wild_ICCV_2023_paper.pdf + Understanding how humans use physical contact to interact with the world is key to enabling human-centric artificial intelligence. While inferring 3D contact is crucial for modeling realistic and physically-plausible human-object interactions, existing methods either focus on 2D, consider body joints rather than the surface, use coarse 3D body regions, or do not generalize to in-the-wild images. In contrast, we focus on inferring dense, 3D contact between the full body surface and objects in arbitrary images. To achieve this, we first collect DAMON, a new dataset containing dense vertex-level contact annotations paired with RGB images containing complex human-object and human-scene contact. Second, we train DECO, a novel 3D contact detector that uses both body-part-driven and scene-context-driven attention to estimate vertex-level contact on the SMPL body. DECO builds on the insight that human observers recognize contact by reasoning about the contacting body parts, their proximity to scene objects, and the surrounding scene context. We perform extensive evaluations of our detector on DAMON as well as on the RICH and BEHAVE datasets. We significantly outperform existing SOTA methods across all benchmarks. We also show qualitatively that DECO generalizes well to diverse and challenging real-world human interactions in natural images. The code, data, and models are available at https://deco.is.tue.mpg.de. + + + + PlaneRecTR: Unified Query Learning for 3D Plane Recovery from a Single View + http://openaccess.thecvf.com//content/ICCV2023/papers/Shi_PlaneRecTR_Unified_Query_Learning_for_3D_Plane_Recovery_from_a_ICCV_2023_paper.pdf + 3D plane recovery from a single image can usually be divided into several subtasks of plane detection, segmentation, parameter estimation and possibly depth estimation. Previous works tend to solve it by either extending the RCNN-based segmentation network or the dense pixel embedding-based clustering framework. However, none of them tried to integrate above related subtasks into a unified framework but treated them separately and sequentially, which we suspect is potentially a main source of performance limitation for existing approaches. Motivated by this finding and the success of query-based learning in enriching reasoning among semantic entities, in this paper, we propose PlaneRecTR, a Transformer-based architecture, which for the first time unifies all subtasks related to single-view plane recovery with a single compact model. Extensive quantitative and qualitative experiments demonstrate that our proposed unified learning achieves mutual benefits across subtasks, obtaining a new state-of-the-art performance on public ScanNet and NYUv2-Plane datasets. + + + + EigenTrajectory: Low-Rank Descriptors for Multi-Modal Trajectory Forecasting + http://openaccess.thecvf.com//content/ICCV2023/papers/Bae_EigenTrajectory_Low-Rank_Descriptors_for_Multi-Modal_Trajectory_Forecasting_ICCV_2023_paper.pdf + Capturing high-dimensional social interactions and feasible futures is essential for predicting trajectories. To address this complex nature, several attempts have been devoted to reducing the dimensionality of the output variables via parametric curve fitting such as the Bezier curve and B-spline function. However, these functions, which originate in computer graphics fields, are not suitable to account for socially acceptable human dynamics. In this paper, we present EigenTrajectory (ET), a trajectory prediction approach that uses a novel trajectory descriptor to form a compact space, known here as ET space, in place of Euclidean space, for representing pedestrian movements. We first reduce the complexity of the trajectory descriptor via a low-rank approximation. We transform the pedestrians' history paths into our ET space represented by spatio-temporal principle components, and feed them into off-the-shelf trajectory forecasting models. The inputs and outputs of the models as well as social interactions are all gathered and aggregated in the corresponding ET space. Lastly, we propose a trajectory anchor-based refinement method to cover all possible futures in the proposed ET. Extensive experiments demonstrate that our EigenTrajectory predictor can significantly improve both the prediction accuracy and reliability of existing trajectory forecasting models on public benchmarks, indicating that the proposed descriptor is suited to represent pedestrian behaviors. Code is publicly available at https://github.com/inhwanbae/EigenTrajectory. + + + + Video-FocalNets: Spatio-Temporal Focal Modulation for Video Action Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Wasim_Video-FocalNets_Spatio-Temporal_Focal_Modulation_for_Video_Action_Recognition_ICCV_2023_paper.pdf + Recent video recognition models utilize Transformer models for long-range spatio-temporal context modeling. Video transformer designs are based on self-attention that can model global context at a high computational cost. In comparison, convolutional designs for videos offer an efficient alternative but lack long-range dependency modeling. Towards achieving the best of both designs, this work proposes Video-FocalNet, an effective and efficient architecture for video recognition that models both local and global contexts. Video-FocalNet is based on a spatio-temporal focal modulation architecture that reverses the interaction and aggregation steps of self-attention for better efficiency. Further, the aggregation step and the interaction step are both implemented using efficient convolution and element-wise multiplication operations that are computationally less expensive than their self-attention counterparts on video representations. We extensively explore the design space of focal modulation-based spatio-temporal context modeling and demonstrate our parallel spatial and temporal encoding design to be the optimal choice. Video-FocalNets perform favorably well against the state-of-the-art transformer-based models for video recognition on five large-scale datasets (Kinetics-400, Kinetics-600, SS-v2, Diving-48, and ActivityNet-1.3) at a lower computational cost. Our code/models are released at https://github.com/TalalWasim/Video-FocalNets. + + + + Hidden Biases of End-to-End Driving Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Jaeger_Hidden_Biases_of_End-to-End_Driving_Models_ICCV_2023_paper.pdf + End-to-end driving systems have recently made rapid progress, in particular on CARLA. Independent of their major contribution, they introduce changes to minor system components. Consequently, the source of improvements is unclear. We identify two biases that recur in nearly all state-of-the-art methods and are critical for the observed progress on CARLA: (1) lateral recovery via a strong inductive bias towards target point following, and (2) longitudinal averaging of multimodal waypoint predictions for slowing down. We investigate the drawbacks of these biases and identify principled alternatives. By incorporating our insights, we develop TF++, a simple end-to-end method that ranks first on the Longest6 and LAV benchmarks, gaining 11 driving score over the best prior work on Longest6. + + + + PIDRo: Parallel Isomeric Attention with Dynamic Routing for Text-Video Retrieval + http://openaccess.thecvf.com//content/ICCV2023/papers/Guan_PIDRo_Parallel_Isomeric_Attention_with_Dynamic_Routing_for_Text-Video_Retrieval_ICCV_2023_paper.pdf + Text-video retrieval is a fundamental task with high practical value in multi-modal research. Inspired by the great success of pre-trained image-text models with large-scale data, such as CLIP, many methods are proposed to transfer the strong representation learning capability of CLIP to text-video retrieval. However, due to the modality difference between videos and images, how to effectively adapt CLIP to the video domain is still underexplored. In this paper, we investigate this problem from two aspects. First, we enhance the transferred image encoder of CLIP for fine-grained video understanding in a seamless fashion. Second, we conduct fine-grained contrast between videos and texts from both model improvement and loss design. Particularly, we propose a fine-grained contrastive model equipped with parallel isomeric attention and dynamic routing, namely PIDRo, for text-video retrieval. The parallel isomeric attention module is used as the video encoder, which consists of two parallel branches modeling the spatial-temporal information of videos from both patch and frame levels. The dynamic routing module is constructed to enhance the text encoder of CLIP, generating informative word representations by distributing the fine-grained information to the related word tokens within a sentence. Such model design provides us with informative patch, frame and word representations. We then conduct token-wise interaction upon them. With the enhanced encoders and the token-wise loss, we are able to achieve finer-grained text-video alignment and more accurate retrieval. PIDRo obtains state-of-the-art performance over various text-video retrieval benchmarks, including MSR-VTT, MSVD, LSMDC, DiDeMo and ActivityNet. + + + + RFD-ECNet: Extreme Underwater Image Compression with Reference to Feature Dictionary + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_RFD-ECNet_Extreme_Underwater_Image_Compression_with_Reference_to_Feature_Dictionary_ICCV_2023_paper.pdf + Thriving underwater applications demand efficient extreme compression technology to realize the transmission of underwater images (UWIs) in very narrow underwater bandwidth. However, existing image compression methods achieve inferior performance on UWIs because they do not consider the characteristics of UWIs: (1) Multifarious underwater styles of color shift and distance-dependent clarity, caused by the unique underwater physical imaging; (2) Massive redundancy between different UWIs, caused by the fact that different UWIs contain several common ocean objects, which have plenty of similarities in structures and semantics. To remove redundancy among UWIs, we first construct an exhaustive underwater multi-scale feature dictionary to provide coarse-to-fine reference features for UWI compression. Subsequently, an extreme UWI compression network with reference to the feature dictionary (RFD-ECNet) is creatively proposed, which utilizes feature match and reference feature variant to significantly remove redundancy among UWIs. To align the multifarious underwater styles and improve the accuracy of feature match, an underwater style normalized block (USNB) is proposed, which utilizes underwater physical priors extracted from the underwater physical imaging model to normalize the underwater styles of dictionary features toward the input. Moreover, a reference feature variant module (RFVM) is designed to adaptively morph the reference features, improving the similarity between the reference and input features. Experimental results on four UWI datasets show that our RFD-ECNet is the first work that achieves a significant BD-rate saving of 31% over the most advanced VVC. + + + + High-Resolution Document Shadow Removal via A Large-Scale Real-World Dataset and A Frequency-Aware Shadow Erasing Net + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_High-Resolution_Document_Shadow_Removal_via_A_Large-Scale_Real-World_Dataset_and_ICCV_2023_paper.pdf + Shadows often occur when we capture the document with casual equipment, which influences the visual quality and readability of the digital copies. Different from the algorithms for natural shadow removal, the algorithms in document shadow removal need to preserve the details of fonts and figures in high-resolution input. Previous works ignore this problem and remove the shadows via approximate attention and small datasets, which might not work in real-world situations. We handle high-resolution document shadow removal directly via a larger-scale real-world dataset and a carefully-designed frequency-aware network. As for the dataset, we acquire over 7k couples of high-resolution (2462 x 3699) images of real-world documents pairs with various samples under different lighting circumstances, which is 10 times larger than existing datasets. As for the design of the network, we decouple the high-resolution images in the frequency domain, where the low-frequency details and high-frequency boundaries can be effectively learned via the carefully designed network structure. Powered by our network and dataset, the proposed method shows a clearly better performance than previous methods in terms of visual quality and numerical results. The code, models, and dataset are available at: https://github.com/CXH-Research/DocShadow-SD7K. + + + + SILT: Shadow-Aware Iterative Label Tuning for Learning to Detect Shadows from Noisy Labels + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_SILT_Shadow-Aware_Iterative_Label_Tuning_for_Learning_to_Detect_Shadows_ICCV_2023_paper.pdf + Existing shadow detection datasets often contain missing or mislabeled shadows, which can hinder the performance of deep learning models trained directly on such data. To address this issue, we propose SILT, the Shadow-aware Iterative Label Tuning framework, which explicitly considers noise in shadow labels and trains the deep model in a self-training manner. Specifically, we incorporate strong data augmentations with shadow counterfeiting to help the network better recognize non-shadow regions and alleviate overfitting. We also devise a simple yet effective label tuning strategy with global-local fusion and shadow-aware filtering to encourage the network to make significant refinements on the noisy labels. We evaluate the performance of SILT by relabeling the test set of the SBU dataset and conducting various experiments. Our results show that even a simple U-Net trained with SILT can outperform all state-of-the-art methods by a large margin. When trained on SBU / UCF / ISTD, our network can successfully reduce the Balanced Error Rate by 25.2% / 36.9% / 21.3% over the best state-of-the-art method. + + + + Implicit Autoencoder for Point-Cloud Self-Supervised Representation Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Yan_Implicit_Autoencoder_for_Point-Cloud_Self-Supervised_Representation_Learning_ICCV_2023_paper.pdf + This paper advocates the use of implicit surface representation in autoencoder-based self-supervised 3D representation learning. The most popular and accessible 3D representation, i.e., point clouds, involves discrete samples of the underlying continuous 3D surface. This discretization process introduces sampling variations on the 3D shape, making it challenging to develop transferable knowledge of the true 3D geometry. In the standard autoencoding paradigm, the encoder is compelled to encode not only the 3D geometry but also information on the specific discrete sampling of the 3D shape into the latent code. This is because the point cloud reconstructed by the decoder is considered unacceptable unless there is a perfect mapping between the original and the reconstructed point clouds. This paper introduces the Implicit AutoEncoder (IAE), a simple yet effective method that addresses the sampling variation issue by replacing the commonly-used point-cloud decoder with an implicit decoder. The implicit decoder reconstructs a continuous representation of the 3D shape, independent of the imperfections in the discrete samples. Extensive experiments demonstrate that the proposed IAE achieves state-of-the-art performance across various self-supervised learning benchmarks. + + + + Speech4Mesh: Speech-Assisted Monocular 3D Facial Reconstruction for Speech-Driven 3D Facial Animation + http://openaccess.thecvf.com//content/ICCV2023/papers/He_Speech4Mesh_Speech-Assisted_Monocular_3D_Facial_Reconstruction_for_Speech-Driven_3D_Facial_ICCV_2023_paper.pdf + Recent audio2mesh-based methods have shown promising prospects for speech-driven 3D facial animation tasks. However, some intractable challenges are urgent to be settled. For example, the data-scarcity problem is intrinsically inevitable due to the difficulty of 4D data collection. Besides, current methods generally lack controllability on the animated face. To this end, we propose a novel framework named Speech4Mesh to consecutively generate 4D talking head data and train the audio2mesh network with the reconstructed meshes. In our framework, we first reconstruct the 4D talking head sequence based on the monocular videos. For precise capture of the talking-related variation on the face, we exploit the audio-visual alignment information from the video by employing a contrastive learning scheme. We next can train the audio2mesh network (e.g., FaceFormer) based on the generated 4D data. To get control of the animated talking face, we encode the speaking-unrelated factors (e.g., emotion, etc.) into an emotion embedding for manipulation. Finally, a differentiable renderer guarantees more accurate photometric details of the reconstruction and animation results. Empirical experiments demonstrate that the Speech4Mesh framework can not only outperform state-of-the-art reconstruction methods, especially on the lower-face part but also achieve better animation performance both perceptually and objectively after pre-trained on the synthesized data. Besides, we also verify that the proposed framework is able to explicitly control the emotion of the animated talking face. + + + + Generalizing Neural Human Fitting to Unseen Poses With Articulated SE(3) Equivariance + http://openaccess.thecvf.com//content/ICCV2023/papers/Feng_Generalizing_Neural_Human_Fitting_to_Unseen_Poses_With_Articulated_SE3_ICCV_2023_paper.pdf + We address the problem of fitting a parametric human body model (SMPL) to point cloud data. Optimization based methods require careful initialization and are prone to becoming trapped in local optima. Learning-based methods address this but do not generalize well when the input pose is far from those seen during training. For rigid point clouds, remarkable generalization has been achieved by leveraging SE(3)-equivariant networks, but these methods do not work on articulated objects. In this work we extend this idea to human bodies and propose ArtEq, a novel part-based SE(3)-equivariant neural architecture for SMPL model estimation from point clouds. Specifically, we learn a part detection network by leveraging local SO(3) invariance, and regress shape and pose using articulated SE(3) shape-invariant and pose-equivariant networks, all trained end-to-end. Our novel pose regression module leverages the permutation-equivariant property of self-attention layers to preserve rotational equivariance. Experimental results show that ArtEq generalizes to poses not seen during training, outperforming state-of-the-art methods by 44%in terms of body reconstruction accuracy, without requiring an optimization refinement step. Furthermore, ArtEq is three orders of magnitude faster during inference than prior work and has 97.3% fewer parameters. The code and model are available for research purposes at https://arteq.is.tue.mpg.de. + + + + Learning from Noisy Pseudo Labels for Semi-Supervised Temporal Action Localization + http://openaccess.thecvf.com//content/ICCV2023/papers/Xia_Learning_from_Noisy_Pseudo_Labels_for_Semi-Supervised_Temporal_Action_Localization_ICCV_2023_paper.pdf + Semi-Supervised Temporal Action Localization (SS-TAL) aims to improve the generalization ability of action detectors with large-scale unlabeled videos. Albeit the recent advancement, one of the major challenges still remains: noisy pseudo labels hinder efficient learning on abundant unlabeled videos, embodied as location biases and category errors. In this paper, we dive deep into such an important but understudied dilemma. To this end, we propose a unified framework, termed Noisy Pseudo-Label Learning, to handle both location biases and category errors. Specifically, our method is featured with (1) Noisy Label Ranking to rank pseudo labels based on the semantic confidence and boundary reliability, (2) Noisy Label Filtering to address the class-imbalance problem of pseudo labels caused by category errors, (3) Noisy Label Learning to penalize inconsistent boundary predictions to achieve noise-tolerant learning for heavy location biases. As a result, our method could effectively handle the label noise problem and improve the utilization of a large amount of unlabeled videos. Extensive experiments on THUMOS14 and ActivityNet v1.3 demonstrate the effectiveness of our method. The code is available at github.com/kunnxia/NPL. + + + + Activate and Reject: Towards Safe Domain Generalization under Category Shift + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Activate_and_Reject_Towards_Safe_Domain_Generalization_under_Category_Shift_ICCV_2023_paper.pdf + Albeit the notable performance on in-domain test points, it is non-trivial for deep neural networks to attain satisfactory accuracy when deploying in the open world, where novel domains and object classes often occur. In this paper, we study a practical problem of Domain Generalization under Category Shift (DGCS), which aims to simultaneously detect unknown-class samples and classify known-class samples in the target domains. Compared to prior DG works, we face two new challenges: 1) how to learn the concept of "unknown" during training with only source known-class samples, and 2) how to adapt the source-trained model to unseen environments for safe model deployment. To this end, we propose a novel Activate and Reject (ART) framework to reshape the model's decision boundary to accommodate unknown classes and conduct post hoc modification to further discriminate known and unknown classes using unlabeled test data. Specifically, during training, we promote the response to the unknown by optimizing the unknown probability and then smoothing the overall output to mitigate the overconfidence issue. At test time, we introduce a step-wise online adaptation method that predicts the label by virtue of the cross-domain nearest neighbor and class prototype information without updating the network's parameters or using threshold-based mechanisms. Experiments reveal that ART consistently improves the generalization capability of deep networks on different vision tasks. For image classification, ART improves the H-score by 6.1% on average compared to the previous best method. For object detection and semantic segmentation, we establish new benchmarks and achieve competitive performance. + + + + Dynamic Mesh Recovery from Partial Point Cloud Sequence + http://openaccess.thecvf.com//content/ICCV2023/papers/Jang_Dynamic_Mesh_Recovery_from_Partial_Point_Cloud_Sequence_ICCV_2023_paper.pdf + The exact 3D dynamics of the human body provides crucial evidence to analyze the consequences of the physical interaction between the body and the environment, which can eventually assist everyday activities in a wide range of applications. However, optimizing for 3D configurations from image observation requires a significant amount of computation, whereas real-world 3D measurements often suffer from noisy observation or complex occlusion. We resolve the challenge by learning a latent distribution representing strong temporal priors. We use a conditional variational autoencoder (CVAE) architecture with a transformer to train the motion priors with a large-scale motion dataset. Then our feature follower effectively aligns the feature spaces of noisy, partial observation with the necessary input for pre-trained motion priors, and quickly recovers a complete mesh sequence of motion. We demonstrate that the transformer-based autoencoder can collect necessary spatio-temporal correlations robust to various adversaries, such as missing temporal frames, or noisy observation under severe occlusion. Our framework is general and can be applied to recover the full 3D dynamics of other subjects with parametric representations. + + + + Neural Deformable Models for 3D Bi-Ventricular Heart Shape Reconstruction and Modeling from 2D Sparse Cardiac Magnetic Resonance Imaging + http://openaccess.thecvf.com//content/ICCV2023/papers/Ye_Neural_Deformable_Models_for_3D_Bi-Ventricular_Heart_Shape_Reconstruction_and_ICCV_2023_paper.pdf + We propose a novel neural deformable model (NDM) targeting at the reconstruction and modeling of 3D bi-ventricular shape of the heart from 2D sparse cardiac magnetic resonance (CMR) imaging data. We model the bi-ventricular shape using blended deformable superquadrics, which are parameterized by a set of geometric parameter functions and are capable of deforming globally and locally. While global geometric parameter functions and deformations capture gross shape features from visual data, local deformations, parameterized as neural diffeomorphic point flows, can be learned to recover the detailed heart shape. Different from iterative optimization methods used in conventional deformable model formulations, NDMs can be trained to learn such geometric parameter functions, global and local deformations from a shape distribution manifold. Our NDM can learn to densify a sparse cardiac point cloud with arbitrary scales and generate high-quality triangular meshes automatically. It also enables the implicit learning of dense correspondences among different heart shape instances for accurate cardiac shape registration. Furthermore, the parameters of NDM are intuitive, and can be used by a physician without sophisticated post-processing. Experimental results on a large CMR dataset demonstrate the improved performance of NDM over traditional methods. + + + + Nonrigid Object Contact Estimation With Regional Unwrapping Transformer + http://openaccess.thecvf.com//content/ICCV2023/papers/Xie_Nonrigid_Object_Contact_Estimation_With_Regional_Unwrapping_Transformer_ICCV_2023_paper.pdf + Acquiring contact patterns between hands and nonrigid objects is a common concern in the vision and robotics community. However, existing learning-based methods focus more on contact with rigid ones from monocular images. When adopting them for nonrigid contact, a major problem is that the existing contact representation is restricted by the geometry of the object. Consequently, contact neighborhoods are stored in an unordered manner and contact features are difficult to align with image cues. At the core of our approach lies a novel hand-object contact representation called RUPs (Region Unwrapping Profiles), which unwrap the roughly estimated hand-object surfaces as multiple high-resolution 2D regional profiles. The region grouping strategy is consistent with the hand kinematic bone division because they are the primitive initiators for a composite contact pattern. Based on this representation, our Regional Unwrapping Transformer (RUFormer) learns the correlation priors across regions from monocular inputs and predicts corresponding contact and deformed transformations. Our experiments demonstrate that the proposed framework can robustly estimate the deformed degrees and deformed transformations, which make it suitable for both nonrigid and rigid contact. + + + + Semi-supervised Semantics-guided Adversarial Training for Robust Trajectory Prediction + http://openaccess.thecvf.com//content/ICCV2023/papers/Jiao_Semi-supervised_Semantics-guided_Adversarial_Training_for_Robust_Trajectory_Prediction_ICCV_2023_paper.pdf + Predicting the trajectories of surrounding objects is a critical task for self-driving vehicles and many other autonomous systems. Recent works demonstrate that adversarial attacks on trajectory prediction, where small crafted perturbations are introduced to history trajectories, may significantly mislead the prediction of future trajectories and induce unsafe planning. However, few works have addressed enhancing the robustness of this important safety-critical task. In this paper, we present a novel adversarial training method for trajectory prediction. Compared with typical adversarial training on image tasks, our work is challenged by more random input with rich context and a lack of class labels. To address these challenges, we propose a method based on a semi-supervised adversarial autoencoder, which models disentangled semantic features with domain knowledge and provides additional latent labels for the adversarial training. Extensive experiments with different types of attacks demonstrate that our Semisupervised Semantics-guided Adversarial Training (SSAT) method can effectively mitigate the impact of adversarial attacks by up to 73% and outperform other popular defense methods. In addition, experiments show that our method can significantly improve the system's robust generalization to unseen patterns of attacks. We believe that such semantics-guided architecture and advancement on robust generalization is an important step for developing robust prediction models and enabling safe decision-making. + + + + Linear-Covariance Loss for End-to-End Learning of 6D Pose Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Linear-Covariance_Loss_for_End-to-End_Learning_of_6D_Pose_Estimation_ICCV_2023_paper.pdf + Most modern image-based 6D object pose estimation methods learn to predict 2D-3D correspondences, from which the pose can be obtained using a PnP solver. Because of the non-differentiable nature of common PnP solvers, these methods are supervised via the individual correspondences. To address this, several methods have designed differentiable PnP strategies, thus imposing supervision on the pose obtained after the PnP step. Here, we argue that this conflicts with the averaging nature of the PnP problem, leading to gradients that may encourage the network to degrade the accuracy of individual correspondences. To address this, we derive a loss function that exploits the ground truth pose before solving the PnP problem. Specifically, we linearize the PnP solver around the ground-truth pose and compute the covariance of the resulting pose distribution. We then define our loss based on the diagonal covariance elements, which entails considering the final pose estimate yet not suffering from the PnP averaging issue. Our experiments show that our loss consistently improves the pose estimation accuracy for both dense and sparse correspondence based methods, achieving state-of-the-art results on both Linemod-Occluded and YCB-Video. + + + + RLSAC: Reinforcement Learning Enhanced Sample Consensus for End-to-End Robust Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Nie_RLSAC_Reinforcement_Learning_Enhanced_Sample_Consensus_for_End-to-End_Robust_Estimation_ICCV_2023_paper.pdf + Robust estimation is a crucial and still challenging task, which involves estimating model parameters in noisy environments. Although conventional sampling consensus-based algorithms sample several times to achieve robustness, these algorithms cannot use data features and historical information effectively. In this paper, we propose RLSAC, a novel Reinforcement Learning enhanced SAmple Consensus framework for end-to-end robust estimation. RLSAC employs a graph neural network to utilize both data and memory features to guide exploring directions for sampling the next minimum set. The feedback of downstream tasks serves as the reward for unsupervised training. Therefore, RLSAC can avoid differentiating to learn the features and the feedback of downstream tasks for end-to-end robust estimation. In addition, RLSAC integrates a state transition module that encodes both data and memory features. Our experimental results demonstrate that RLSAC can learn from features to gradually explore a better hypothesis. Through analysis, it is apparent that RLSAC can be easily transferred to other sampling consensus-based robust estimation tasks. To the best of our knowledge, RLSAC is also the first method that uses reinforcement learning to sample consensus for end-to-end robust estimation. We release our codes at https://github.com/IRMVLab/RLSAC. + + + + Multi-Frequency Representation Enhancement with Privilege Information for Video Super-Resolution + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Multi-Frequency_Representation_Enhancement_with_Privilege_Information_for_Video_Super-Resolution_ICCV_2023_paper.pdf + CNN's limited receptive field restricts its ability to capture long-range spatial-temporal dependencies, leading to unsatisfactory performance in video super-resolution. To tackle this challenge, this paper presents a novel multi-frequency representation enhancement module (MFE) that performs spatial-temporal information aggregation in the frequency domain. Specifically, MFE mainly includes a spatial-frequency representation enhancement branch which captures the long-range dependency in the spatial dimension, and an energy frequency representation enhancement branch to obtain the inter-channel feature relationship. Moreover, a novel model training method named privilege training is proposed to encode the privilege information from high-resolution videos to facilitate model training. With these two methods, we introduce a new VSR model named MFPI, which outperforms state-of-the-art methods by a large margin while maintaining good efficiency on various datasets, including REDS4, Vimeo, Vid4, and UDM10. + + + + Self-supervised Pre-training for Mirror Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Lin_Self-supervised_Pre-training_for_Mirror_Detection_ICCV_2023_paper.pdf + Existing mirror detection methods require supervised ImageNet pre-training to obtain good general-purpose image features. However, supervised ImageNet pre-training focuses on category-level discrimination and may not be suitable for downstream tasks like mirror detection, due to the overfitting upstream tasks (e.g., supervised image classification). We observe that mirror reflection is crucial to how people perceive the presence of mirrors, and such mid-level features can be better transferred from self-supervised pre-trained models. Inspired by this observation, in this paper we aim to improve mirror detection methods by proposing a new self-supervised learning (SSL) pre-training framework for modeling the representation of mirror reflection progressively in the pre-training process. Our framework consists of three pre-training stages at different levels: 1) an image-level pre-training stage to globally incorporate mirror reflection features into the pre-trained model; 2) a patch-level pre-training stage to spatially simulate and learn local mirror reflection from image patches; and 3) a pixel-level pre-training stage to pixel-wisely capture mirror reflection via reconstructing corrupted mirror images based on the relationship between the inside and outside of mirrors. Extensive experiments show that our SSL pre-training framework significantly outperforms previous state-of-the-art CNN-based SSL pre-training frameworks and even outperforms supervised ImageNet pre-training when transferred to the mirror detection task. Code and models are available at https://jiaying.link/iccv2023-sslmirror/ + + + + GlowGAN: Unsupervised Learning of HDR Images from LDR Images in the Wild + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_GlowGAN_Unsupervised_Learning_of_HDR_Images_from_LDR_Images_in_ICCV_2023_paper.pdf + Most in-the-wild images are stored in Low Dynamic Range (LDR) form, serving as a partial observation of the High Dynamic Range (HDR) visual world. Despite limited dynamic range, these LDR images are often captured with different exposures, implicitly containing information about the underlying HDR image distribution. Inspired by this intuition, in this work we present, to the best of our knowledge, the first method for learning a generative model of HDR images from in-the-wild LDR image collections in a fully unsupervised manner. The key idea is to train a generative adversarial network (GAN) to generate HDR images which, when projected to LDR under various exposures, are indistinguishable from real LDR images. Experiments show that our method GlowGAN can synthesize photorealistic HDR images in many challenging cases such as landscapes, lightning, or windows, where previous supervised generative models produce overexposed images. With the assistance of GlowGAN, we showcase the innovative application of unsupervised inverse tone mapping (GlowGAN-ITM) that sets a new paradigm in this field. Unlike previous methods that gradually complete information from LDR input, GlowGAN-ITM searches the entire HDR image manifold modeled by GlowGAN for the HDR images which can be mapped back to the LDR input. GlowGAN-ITM method achieves more realistic reconstruction of overexposed regions compared to state-of-the-art supervised learning models, despite not requiring HDR images or paired multi-exposure images for training. + + + + Dual Pseudo-Labels Interactive Self-Training for Semi-Supervised Visible-Infrared Person Re-Identification + http://openaccess.thecvf.com//content/ICCV2023/papers/Shi_Dual_Pseudo-Labels_Interactive_Self-Training_for_Semi-Supervised_Visible-Infrared_Person_Re-Identification_ICCV_2023_paper.pdf + Visible-infrared person re-identification (VI-ReID) aims to match a specific person from a gallery of images captured from non-overlapping visible and infrared cameras. Most works focus on fully supervised VI-ReID, which requires substantial cross-modality annotation that is more expensive than the annotation in single-modality. To reduce the extensive cost of annotation, we explore two practical semi-supervised settings: uni-semi-supervised (annotating only visible images) and bi-semi-supervised (annotating partially in both modalities). These two semi-supervised settings face two challenges due to the large cross-modality discrepancies and the lack of correspondence supervision between visible and infrared images. Thus, it is difficult to generate reliable pseudo-labels and learn modality-invariant features from noise pseudo-labels. In this paper, we propose a dual pseudo-label interactive self-training (DPIS) for these two semi-supervised VI-ReID. Our DPIS integrates two pseudo-labels generated by distinct models into a hybrid pseudo-label for unlabeled data. However, the hybrid pseudo-label still inevitably contains noise. To eliminate the negative effect of noise pseudo-labels, we introduce three modules: noise label penalty (NLP), noise correspondence calibration (NCC), and unreliable anchor learning (UAL). Specifically, NLP penalizes noise labels, NCC calibrates noisy correspondences, and UAL mines the hard-to-discriminate features. Extensive experimental results on SYSU-MM01 and RegDB demonstrate that our DPIS achieves impressive performance under these two semi-supervised settings. + + + + Learned Compressive Representations for Single-Photon 3D Imaging + http://openaccess.thecvf.com//content/ICCV2023/papers/Gutierrez-Barragan_Learned_Compressive_Representations_for_Single-Photon_3D_Imaging_ICCV_2023_paper.pdf + Single-photon 3D cameras can record the time-of-arrival of billions of photons per second with picosecond accuracy. One common approach to summarize the photon data stream is to build a per-pixel timestamp histogram, resulting in a 3D histogram tensor that encodes distances along the time axis. As the spatio-temporal resolution of the histogram tensor increases, the in-pixel memory requirements and output data rates can quickly become impractical. To overcome this limitation, we propose a family of linear compressive representations of histogram tensors that can be computed efficiently, in an online fashion, as a matrix operation. We design practical lightweight compressive representations that are amenable to an in-pixel implementation and consider the spatio-temporal information of each timestamp. Furthermore, we implement our proposed framework as the first layer of a neural network, which enables the joint end-to-end optimization of the compressive representations and a downstream SPAD data processing model. We find that a well-designed compressive representation can reduce in-sensor memory and data rates up to 2 orders of magnitude without significantly reducing 3D imaging quality. Finally, we analyze the power consumption implications through an on-chip implementation. + + + + Alignment-free HDR Deghosting with Semantics Consistent Transformer + http://openaccess.thecvf.com//content/ICCV2023/papers/Tel_Alignment-free_HDR_Deghosting_with_Semantics_Consistent_Transformer_ICCV_2023_paper.pdf + High dynamic range (HDR) imaging aims to retrieve information from multiple low-dynamic range inputs to generate realistic output. The essence is to leverage the contextual information, including both dynamic and static semantics, for better image generation. Existing methods often focus on the spatial misalignment across input frames caused by the foreground and/or camera motion. However, there is no research on jointly leveraging the dynamic and static context in a simultaneous manner. To delve into this problem, we propose a novel alignment-free network with a Semantics Consistent Transformer (SCTNet) with both spatial and channel attention modules in the network. The spatial attention aims to deal with the intra-image correlation to model the dynamic motion, while the channel attention enables the inter-image intertwining to enhance the semantic consistency across frames. Aside from this, we introduce a novel realistic HDR dataset with more variations in foreground objects, environmental factors, and larger motions. Extensive comparisons on both conventional datasets and ours validate the effectiveness of our method, achieving the best trade-off on the performance and the computational cost. The source code and dataset are available at https://github.com/Zongwei97/SCTNet. + + + + Multi3DRefer: Grounding Text Description to Multiple 3D Objects + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Multi3DRefer_Grounding_Text_Description_to_Multiple_3D_Objects_ICCV_2023_paper.pdf + We introduce the task of localizing a flexible number of objects in real-world 3D scenes using natural language descriptions. Existing 3D visual grounding tasks focus on localizing a unique object given a text description. However, such a strict setting is unnatural as localizing potentially multiple objects is a common need in real-world scenarios and robotic tasks (e.g., visual navigation and object rearrangement). To address this setting we propose Multi3DRefer, generalizing the ScanRefer dataset and task. Our dataset contains 61926 descriptions of 11609 objects, where zero, single or multiple target objects are referenced by each description. We also introduce a new evaluation metric and benchmark methods from prior work to enable further investigation of multi-modal 3D scene understanding. Furthermore, we develop a better baseline leveraging 2D features from CLIP by rendering object proposals online with contrastive learning, which outperforms the state of the art on the ScanRefer benchmark. + + + + Examining Autoexposure for Challenging Scenes + http://openaccess.thecvf.com//content/ICCV2023/papers/Tedla_Examining_Autoexposure_for_Challenging_Scenes_ICCV_2023_paper.pdf + Autoexposure (AE) is a critical step applied by camera systems to ensure properly exposed images. While current AE algorithms are effective in well-lit environments with constant illumination, these algorithms still struggle in environments with bright light sources or scenes with abrupt changes in lighting. A significant hurdle in developing new AE algorithms for challenging environments, especially those with time-varying lighting, is the lack of suitable image datasets. To address this issue, we have captured a new 4D exposure dataset that provides a large solution space (i.e., shutter speed range from 1/500 to 15 seconds) over a temporal sequence with moving objects, bright lights, and varying lighting. In addition, we have designed a software platform to allow AE algorithms to be used in a plug-and-play manner with the dataset. Our dataset and associate platform enable repeatable evaluation of different AE algorithms and provide a much-needed starting point to develop better AE methods. We examine several existing AE strategies using our dataset and show that most users prefer a simple saliency method for challenging lighting conditions. + + + + Improved Visual Fine-tuning with Natural Language Supervision + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Improved_Visual_Fine-tuning_with_Natural_Language_Supervision_ICCV_2023_paper.pdf + Fine-tuning a visual pre-trained model can leverage the semantic information from large-scale pre-training data and mitigate the over-fitting problem on downstream vision tasks with limited training examples. While the problem of catastrophic forgetting in pre-trained backbone has been extensively studied for fine-tuning, its potential bias from the corresponding pre-training task and data, attracts less attention. In this work, we investigate this problem by demonstrating that the obtained classifier after fine-tuning will be close to that induced by the pre-trained model. To reduce the bias in the classifier effectively, we introduce a reference distribution obtained from a fixed text classifier, which can help regularize the learned vision classifier. The proposed method, Text Supervised fine-tuning (TeS), is evaluated with diverse pre-trained vision models including ResNet and ViT, and text encoders including BERT and CLIP, on 11 downstream tasks. The consistent improvement with a clear margin over distinct scenarios confirms the effectiveness of our proposal. Code is available at https://github.com/idstcv/TeS. + + + + Person Re-Identification without Identification via Event anonymization + http://openaccess.thecvf.com//content/ICCV2023/papers/Ahmad_Person_Re-Identification_without_Identification_via_Event_anonymization_ICCV_2023_paper.pdf + Wide-scale use of visual surveillance in public spaces puts individual privacy at stake while increasing resource consumption (energy, bandwidth, and computation). Neuromorphic vision sensors (event-cameras) have been recently considered a valid solution to the privacy issue because they do not capture detailed RGB visual information of the subjects in the scene. However, recent deep learning architectures have been able to reconstruct images from event cameras with high fidelity, reintroducing a potential threat to privacy for event-based vision applications. In this paper, we aim to anonymize event-streams to protect the identity of human subjects against such image reconstruction attacks. To achieve this, we propose an end-to-end network architecture jointly optimized for the twofold objective of preserving privacy and performing a downstream task such as person ReId. Our network learns to scramble events, enforcing the degradation of images recovered from the privacy attacker. In this work, we also bring to the community the first ever event-based person ReId dataset gathered to evaluate the performance of our approach. We validate our approach with extensive experiments and report results on the synthetic event data simulated from the publicly available SoftBio dataset and our proposed Event-ReId dataset. + + + + Self-Feedback DETR for Temporal Action Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Kim_Self-Feedback_DETR_for_Temporal_Action_Detection_ICCV_2023_paper.pdf + Temporal Action Detection (TAD) is challenging but fundamental for real-world video applications. Recently, DETR-based models have been devised for TAD but have not performed well yet. In this paper, we point out the problem in the self-attention of DETR for TAD; the attention modules focus on a few key elements, called temporal collapse problem. It degrades the capability of the encoder and decoder since their self-attention modules play no role. To solve the problem, we propose a novel framework, Self-DETR, which utilizes cross-attention maps of the decoder to reactivate self-attention modules. We recover the relationship between encoder features by simple matrix multiplication of the cross-attention map and its transpose. Likewise, we also get the information within decoder queries. By guiding collapsed self-attention maps with the guidance map calculated, we settle down the temporal collapse of self-attention modules in the encoder and decoder. Our extensive experiments demonstrate that Self-DETR resolves the temporal collapse problem by keeping high diversity of attention over all layers. + + + + UMC: A Unified Bandwidth-efficient and Multi-resolution based Collaborative Perception Framework + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_UMC_A_Unified_Bandwidth-efficient_and_Multi-resolution_based_Collaborative_Perception_Framework_ICCV_2023_paper.pdf + Multi-agent collaborative perception (MCP) has recently attracted much attention. It includes three key processes: communication for sharing, collaboration for integration, and reconstruction for different downstream tasks. Existing methods pursue designing the collaboration process alone, ignoring their intrinsic interactions and resulting in suboptimal performance. In contrast, we aim to propose a Unified Collaborative perception framework named UMC, optimizing the communication, collaboration, and reconstruction processes with the Multi-resolution technique. The communication introduces a novel trainable multi-resolution and selective-region (MRSR) mechanism, achieving higher quality and lower bandwidth. Then, a graph-based collaboration is proposed, conducting on each resolution to adapt the MRSR. Finally, the reconstruction integrates the multi-resolution collaborative features for downstream tasks. Since the general metric can not reflect the performance enhancement brought by MCP systematically, we introduce a brand-new evaluation metric that evaluates the MCP from different perspectives. To verify our algorithm, we conducted experiments on the V2X-Sim and OPV2V datasets. Our quantitative and qualitative experiments prove that the proposed UMC outperforms the state-of-the-art collaborative perception approaches. + + + + Viewing Graph Solvability in Practice + http://openaccess.thecvf.com//content/ICCV2023/papers/Arrigoni_Viewing_Graph_Solvability_in_Practice_ICCV_2023_paper.pdf + We present an advance in understanding the projective Structure-from-Motion, focusing in particular on the viewing graph: such a graph has cameras as nodes and fundamental matrices as edges. We propose a practical method for testing finite solvability, i.e., whether a viewing graph induces a finite number of camera configurations. Our formulation uses a significantly smaller number of equations (up to 400x) with respect to previous work. As a result, this is the only method in the literature that can be applied to large viewing graphs coming from real datasets, comprising up to 300K edges. In addition, we develop the first algorithm for identifying maximal finite-solvable components. + + + + SATR: Zero-Shot Semantic Segmentation of 3D Shapes + http://openaccess.thecvf.com//content/ICCV2023/papers/Abdelreheem_SATR_Zero-Shot_Semantic_Segmentation_of_3D_Shapes_ICCV_2023_paper.pdf + We explore the task of zero-shot semantic segmentation of 3D shapes by using large-scale off-the-shelf 2D im- age recognition models. Surprisingly, we find that modern zero-shot 2D object detectors are better suited for this task than contemporary text/image similarity predictors or even zero-shot 2D segmentation networks. Our key finding is that it is possible to extract accurate 3D segmentation maps from multi-view bounding box predictions by using the topological properties of the underlying surface. For this, we develop the Segmentation Assignment with Topological Reweighting (SATR) algorithm and evaluate it on ShapeNetPart and our proposed FAUST benchmarks. SATR achieves state-of-the-art performance and outperforms a baseline algorithm by 1.3% and 4% average mIoU on the FAUST coarse and fine-grained benchmarks, respectively, and by 5.2% average mIoU on the ShapeNetPart bench- mark. Our source code and data will be publicly released. Project webpage: https://samir55.github.io/SATR/. + + + + Pseudo Flow Consistency for Self-Supervised 6D Object Pose Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Hai_Pseudo_Flow_Consistency_for_Self-Supervised_6D_Object_Pose_Estimation_ICCV_2023_paper.pdf + Most self-supervised 6D object pose estimation methods can only work with additional depth information or rely on the accurate annotation of 2D segmentation masks, limiting their application range. In this paper, we propose a 6D object pose estimation method that can be trained with pure RGB images without any auxiliary information. We first obtain a rough pose initialization from networks trained on synthetic images rendered from the target's 3D mesh. Then, we introduce a refinement strategy leveraging the geometry constraint in synthetic-to-real image pairs from multiple different views. We formulate this geometry constraint as pixel-level flow consistency between the training images with dynamically generated pseudo labels. We evaluate our method on three challenging datasets and demonstrate that it outperforms state-of-the-art self-supervised methods significantly, with neither 2D annotations nor additional depth images. + + + + Probabilistic Human Mesh Recovery in 3D Scenes from Egocentric Views + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Probabilistic_Human_Mesh_Recovery_in_3D_Scenes_from_Egocentric_Views_ICCV_2023_paper.pdf + Automatic perception of human behaviors during social interactions is crucial for AR/VR applications, and an essential component is estimation of plausible 3D human pose and shape of our social partners from the egocentric view. One of the biggest challenges of this task is severe body truncation due to close social distances in egocentric scenarios, which brings large pose ambiguities for unseen body parts. To tackle this challenge, we propose a novel scene-conditioned diffusion method to model the body pose distribution. Conditioned on the 3D scene geometry, the diffusion model generates bodies in plausible human-scene interactions, with the sampling guided by a physics-based collision score to further resolve human-scene interpenetrations. The classifier-free training enables flexible sampling with different conditions and enhanced diversity. A visibility-aware graph convolution model guided by per-joint visibility serves as the diffusion denoiser to incorporate inter-joint dependencies and per-body-part control. Extensive evaluations show that our method generates bodies in plausible interactions with 3D scenes, achieving both superior accuracy for visible joints and diversity for invisible body parts. The code is available at https://sanweiliti.github.io/egohmr/egohmr.html. + + + + SceneRF: Self-Supervised Monocular 3D Scene Reconstruction with Radiance Fields + http://openaccess.thecvf.com//content/ICCV2023/papers/Cao_SceneRF_Self-Supervised_Monocular_3D_Scene_Reconstruction_with_Radiance_Fields_ICCV_2023_paper.pdf + 3D reconstruction from a single 2D image was extensively covered in the literature but relies on depth supervision at training time, which limits its applicability. To relax the dependence to depth we propose SceneRF, a self-supervised monocular scene reconstruction method using only posed image sequences for training. Fueled by the recent progress in neural radiance fields (NeRF) we optimize a radiance field though with explicit depth optimization and a novel probabilistic sampling strategy to efficiently handle large scenes. At inference, a single input image suffices to hallucinate novel depth views which are fused together to obtain 3D scene reconstruction. Thorough experiments demonstrate that we outperform all baselines for novel depth views synthesis and scene reconstruction, on indoor BundleFusion and outdoor SemanticKITTI. Code is available at https://astra-vision.github.io/SceneRF . + + + + INT2: Interactive Trajectory Prediction at Intersections + http://openaccess.thecvf.com//content/ICCV2023/papers/Yan_INT2_Interactive_Trajectory_Prediction_at_Intersections_ICCV_2023_paper.pdf + Motion forecasting is an important component in autonomous driving systems. One of the most challenging problems in motion forecasting is interactive trajectory prediction, whose goal is to jointly forecasts the future trajectories of interacting agents. To this end, we present a large-scale interactive trajectory prediction dataset named INT2 for INTeractive trajectory prediction at INTersections. INT2 includes 612,000 scenes, each lasting 1 minute, containing up to 10,200 hours of data. The agent trajectories are auto-labeled by a high-performance offline temporal detection and fusion algorithm, whose quality is further inspected by human judges. Vectorized semantic maps and traffic light information are also included in INT2. Additionally, the dataset poses an interesting domain mismatch challenge. For each intersection, we treat rush-hour and non-rush-hour segments as different domains. We benchmark the best open-sourced interactive trajectory prediction method on INT2 and Waymo Open Motion, under in-domain and cross-domain settings. The dataset, code and models are publicly available at https://github.com/AIR-DISCOVER/INT2. + + + + MapPrior: Bird's-Eye View Map Layout Estimation with Generative Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhu_MapPrior_Birds-Eye_View_Map_Layout_Estimation_with_Generative_Models_ICCV_2023_paper.pdf + Despite tremendous advancements in bird's-eye view (BEV) perception, existing models fall short in generating realistic and coherent semantic map layouts, and they fail to account for uncertainties arising from partial sensor information (such as occlusion or limited coverage). In this work, we introduce MapPrior, a novel BEV perception framework that combines a traditional discriminative BEV perception model with a learned generative model for semantic map layouts. Our MapPrior delivers predictions with better accuracy, realism, and uncertainty awareness. We evaluate our model on the large-scale nuScenes benchmark. At the time of submission, MapPrior outperforms the strongest competing method, with significantly improved MMD and ECE scores in camera- and LiDAR-based BEV perception. Furthermore, our method can be used to perpetually generate layouts with unconditional sampling. + + + + Conditional Cross Attention Network for Multi-Space Embedding without Entanglement in Only a SINGLE Network + http://openaccess.thecvf.com//content/ICCV2023/papers/Song_Conditional_Cross_Attention_Network_for_Multi-Space_Embedding_without_Entanglement_in_ICCV_2023_paper.pdf + Many studies in vision tasks have aimed to create effective embedding spaces for single-label object prediction within an image. However, in reality, most objects possess multiple specific attributes, such as shape, color, and length, with each attribute composed of various classes. To apply models in real-world scenarios, it is essential to be able to distinguish between the granular components of an object. Conventional approaches to embedding multiple specific attributes into a single network often result in entanglement, where fine-grained features of each attribute cannot be identified separately. To address this problem, we propose a Conditional Cross-Attention Network that induces disentangled multi-space embeddings for various specific attributes with only a single backbone. Firstly, we employ a cross-attention mechanism to fuse and switch the information of conditions (specific attributes), and we demonstrate its effectiveness through a diverse visualization example. Secondly, we leverage the vision transformer for the first time to a fine-grained image retrieval task and present a simple yet effective framework compared to existing methods. Unlike previous studies where performance varied depending on the benchmark dataset, our proposed method achieved consistent state-of-the-art performance on the FashionAI, DARN, DeepFashion, and Zappos50K benchmark datasets. + + + + MB-TaylorFormer: Multi-Branch Efficient Transformer Expanded by Taylor Formula for Image Dehazing + http://openaccess.thecvf.com//content/ICCV2023/papers/Qiu_MB-TaylorFormer_Multi-Branch_Efficient_Transformer_Expanded_by_Taylor_Formula_for_Image_ICCV_2023_paper.pdf + In recent years, Transformer networks are beginning to replace pure convolutional neural networks (CNNs) in the field of computer vision due to their global receptive field and adaptability to input. However, the quadratic computational complexity of softmax-attention limits the wide application in image dehazing task, especially for high-resolution images. To address this issue, we propose a new Transformer variant, which applies the Taylor expansion to approximate the softmax-attention and achieves linear computational complexity. A multi-scale attention refinement module is proposed as a complement to correct the error of the Taylor expansion. Furthermore, we introduce a multi-branch architecture with multi-scale patch embedding to the proposed Transformer, which embeds features by overlapping deformable convolution of different scales. The design of multi-scale patch embedding is based on three key ideas: 1) various sizes of the receptive field; 2) flexible shapes of the receptive field; 3) multi-level semantic information. Our model, named Multi-branch Transformer expanded by Taylor formula (MB-TaylorFormer), can embed coarse to fine features more flexibly at the patch embedding stage and capture long-distance pixel interactions with limited computational cost. Experimental results on several dehazing benchmarks show that MB-TaylorFormer achieves state-of-the-art performance with a light computational burden. + + + + FocalFormer3D: Focusing on Hard Instance for 3D Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_FocalFormer3D_Focusing_on_Hard_Instance_for_3D_Object_Detection_ICCV_2023_paper.pdf + False negatives (FN) in 3D object detection, e.g., missing predictions of pedestrians, vehicles, or other obstacles, can lead to potentially dangerous situations in autonomous driving. While being fatal, this issue is understudied in many current 3D detection methods. In this work, we propose Hard Instance Probing (HIP), a general pipeline that identifies FN in a multi-stage manner and guides the models to focus on excavating difficult instances. For 3D object detection, we instantiate this method as FocalFormer3D, a simple yet effective detector that excels at excavating difficult objects and improving prediction recall. FocalFormer3D features a multi-stage query generation to discover hard objects and a box-level transformer decoder to efficiently distinguish objects from massive object candidates. Experimental results on the nuScenes and Waymo datasets validate the superior performance of FocalFormer3D. The advantage leads to strong performance on both detection and tracking, in both LiDAR and multi-modal settings. Notably, FocalFormer3D achieves a 70.5 mAP and 73.9 NDS on nuScenes detection benchmark, while the nuScenes tracking benchmark shows 72.1 AMOTA, both ranking 1st place on the nuScenes LiDAR leaderboard. Our code is available at https://github.com/NVlabs/FocalFormer3D. + + + + TEMPO: Efficient Multi-View Pose Estimation, Tracking, and Forecasting + http://openaccess.thecvf.com//content/ICCV2023/papers/Choudhury_TEMPO_Efficient_Multi-View_Pose_Estimation_Tracking_and_Forecasting_ICCV_2023_paper.pdf + Existing volumetric methods for predicting 3D human pose estimation are accurate, but computationally expensive and optimized for single time-step prediction. We present TEMPO, an efficient multi-view pose estimation model that learns a robust spatiotemporal representation, improving pose accuracy while also tracking and forecasting human pose. We significantly reduce computation compared to the state-of-the-art by recurrently computing per-person 2D pose features, fusing both spatial and temporal information into a single representation. In doing so, our model is able to use spatiotemporal context to predict more accurate human poses without sacrificing efficiency. We further use this representation to track human poses over time as well as predict future poses. Finally, we demonstrate that our model is able to generalize across datasets without scene-specific fine-tuning. TEMPO achieves 10% better MPJPE with a 33x improvement in FPS compared to TesseTrack on the challenging CMU Panoptic Studio dataset. Our code and demos are available at https://rccchoudhury.github.io/tempo2023. + + + + DiffPose: SpatioTemporal Diffusion Model for Video-Based Human Pose Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Feng_DiffPose_SpatioTemporal_Diffusion_Model_for_Video-Based_Human_Pose_Estimation_ICCV_2023_paper.pdf + Denoising diffusion probabilistic models that were initially proposed for realistic image generation have recently shown success in various perception tasks (e.g., object detection and image segmentation) and are increasingly gaining attention in computer vision. However, extending such models to multi-frame human pose estimation is non-trivial due to the presence of the additional temporal dimension in videos. More importantly, learning representations that focus on keypoint regions is crucial for accurate localization of human joints. Nevertheless, the adaptation of the diffusion-based methods remains unclear on how to achieve such objective. In this paper, we present DiffPose, a novel diffusion architecture that formulates video-based human pose estimation as a conditional heatmap generation problem. First, to better leverage temporal information, we propose SpatioTemporal Representation Learner which aggregates visual evidences across frames and uses the resulting features in each denoising step as a condition. In addition, we present a mechanism called Lookup-based MultiScale Feature Interaction that determines the correlations between local joints and global contexts across multiple scales. This mechanism generates delicate representations that focus on keypoint regions. Altogether, by extending diffusion models, we show two unique characteristics from DiffPose on pose estimation task: (i) the ability to combine multiple sets of pose estimates to improve prediction accuracy, particularly for challenging joints, and (ii) the ability to adjust the number of iterative steps for feature refinement without retraining the model. DiffPose sets new state-of-the-art results on three benchmarks: PoseTrack2017, PoseTrack2018, and PoseTrack21. + + + + IntentQA: Context-aware Video Intent Reasoning + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_IntentQA_Context-aware_Video_Intent_Reasoning_ICCV_2023_paper.pdf + In this paper, we propose a novel task IntentQA, a special VideoQA task focusing on video intent reasoning, which has become increasingly important for AI with its advantages in equipping AI agents with the capability of reasoning beyond mere recognition in daily tasks. We also contribute a large-scale VideoQA dataset for this task. We propose a Context-aware Video Intent Reasoning model (CaVIR) consisting of i) Video Query Language (VQL) for better cross-modal representation of the situational context, ii) Contrastive Learning module for utilizing the contrastive context, and iii) Commonsense Reasoning module for incorporating the commonsense context. Comprehensive experiments on this challenging task demonstrate the effectiveness of each model component, the superiority of our full model over other baselines, and the generalizability of our model to a new VideoQA task. The dataset and codes are open-sourced at: https://github.com/JoseponLee/IntentQA.git + + + + Robust Monocular Depth Estimation under Challenging Conditions + http://openaccess.thecvf.com//content/ICCV2023/papers/Gasperini_Robust_Monocular_Depth_Estimation_under_Challenging_Conditions_ICCV_2023_paper.pdf + While state-of-the-art monocular depth estimation approaches achieve impressive results in ideal settings, they are highly unreliable under challenging illumination and weather conditions, such as at nighttime or in the presence of rain. In this paper, we uncover these safety-critical issues and tackle them with md4all: a simple and effective solution that works reliably under both adverse and ideal conditions, as well as for different types of learning supervision. We achieve this by exploiting the efficacy of existing methods under perfect settings. Therefore, we provide valid training signals independently of what is in the input. First, we generate a set of complex samples corresponding to the normal training ones. Then, we train the model by guiding its self- or full-supervision by feeding the generated samples and computing the standard losses on the corresponding original images. Doing so enables a single model to recover information across diverse conditions without modifications at inference time. Extensive experiments on two challenging public datasets, namely nuScenes and Oxford RobotCar, demonstrate the effectiveness of our techniques, outperforming prior works by a large margin in both standard and challenging conditions. Source code and data are available at: https://md4all.github.io. + + + + Parametric Depth Based Feature Representation Learning for Object Detection and Segmentation in Bird's-Eye View + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Parametric_Depth_Based_Feature_Representation_Learning_for_Object_Detection_and_ICCV_2023_paper.pdf + Recent vision-only perception models for autonomous driving achieved promising results by encoding multi-view image features into Bird's-Eye-View (BEV) space. A critical step and the main bottleneck of these methods is transforming image features into the BEV coordinate frame. This paper focuses on leveraging geometry information, such as depth, to model such feature transformation. Existing works rely on non-parametric depth distribution modeling leading to significant memory consumption, or ignore the geometry information to address this problem. In contrast, we propose to use parametric depth distribution modeling for feature transformation. We first lift the 2D image features to the 3D space defined for the ego vehicle via a predicted parametric depth distribution for each pixel in each view. Then, we aggregate the 3D feature volume based on the 3D space occupancy derived from depth to the BEV frame. Finally, we use the transformed features for downstream tasks such as object detection and semantic segmentation. Existing semantic segmentation methods do also suffer from an hallucination problem as they do not take visibility information into account. This hallucination can be particularly problematic for subsequent modules such as control and planning. To mitigate the issue, our method provides depth uncertainty and reliable visibility-aware estimations. We further leverage our parametric depth modeling to present a novel visibility-aware evaluation metric that, when taken into account, can mitigate the hallucination problem. Extensive experiments on object detection and semantic segmentation on the nuScenes datasets demonstrate that our method outperforms existing methods on both tasks. + + + + Global Features are All You Need for Image Retrieval and Reranking + http://openaccess.thecvf.com//content/ICCV2023/papers/Shao_Global_Features_are_All_You_Need_for_Image_Retrieval_and_ICCV_2023_paper.pdf + Image retrieval systems conventionally use a two-stage paradigm, leveraging global features for initial retrieval and local features for reranking. However, the scalability of this method is often limited due to the significant storage and computation cost incurred by local feature matching in the reranking stage. In this paper, we present SuperGlobal, a novel approach that exclusively employs global features for both stages, improving efficiency without sacrificing accuracy. SuperGlobal introduces key enhancements to the retrieval system, specifically focusing on the global feature extraction and reranking processes. For extraction, we identify sub-optimal performance when the widely-used ArcFace loss and Generalized Mean (GeM) pooling methods are combined and propose several new modules to improve GeM pooling. In the reranking stage, we introduce a novel method to update the global features of the query and top-ranked images by only considering feature refinement with a small set of images, thus being very compute and memory efficient. Our experiments demonstrate substantial improvements compared to the state of the art in standard benchmarks. Notably, on the Revisited Oxford+1M Hard dataset, our single-stage results improve by 7.1%, while our two-stage gain reaches 3.7% with a strong 64,865x speedup. Our two-stage system surpasses the current single-stage state-of-the-art by 16.3%, offering a scalable, accurate alternative for high-performing image retrieval systems with minimal time overhead. Code: https://github.com/ShihaoShao-GH/SuperGlobal. + + + + DPF-Net: Combining Explicit Shape Priors in Deformable Primitive Field for Unsupervised Structural Reconstruction of 3D Objects + http://openaccess.thecvf.com//content/ICCV2023/papers/Shuai_DPF-Net_Combining_Explicit_Shape_Priors_in_Deformable_Primitive_Field_for_ICCV_2023_paper.pdf + Unsupervised methods for reconstructing structures face significant challenges in capturing the geometric details with consistent structures among diverse shapes of the same category. To address this issue, we present a novel unsupervised structural reconstruction method, named DPF-Net, based on a new Deformable Primitive Field (DPF) representation, which allows for high-quality shape reconstruction using parameterized geometric primitives. We design a two-stage shape reconstruction pipeline which consists of a primitive generation module and a primitive deformation module to approximate the target shape of each part progressively. The primitive generation module estimates the explicit orientation, position, and size parameters of parameterized geometric primitives, while the primitive deformation module predicts a dense deformation field based on a parameterized primitive field to recover shape details. The strong shape prior encoded in parameterized geometric primitives enables our DPF-Net to extract high-level structures and recover fine-grained shape details consistently. The experimental results on three categories of objects in diverse shapes demonstrate the effectiveness and generalization ability of our DPF-Net on structural reconstruction and shape segmentation. + + + + CORE: Co-planarity Regularized Monocular Geometry Estimation with Weak Supervision + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_CORE_Co-planarity_Regularized_Monocular_Geometry_Estimation_with_Weak_Supervision_ICCV_2023_paper.pdf + The ill-posed nature of monocular 3D geometry (depth map and surface normals) estimation makes it rely mostly on data-driven approaches such as Deep Neural Networks (DNN). However, data acquisition of surface normals, especially the reliable normals, is acknowledged difficult. Commonly, reconstruction of surface normals with high quality is heuristic and time-consuming. Such fact urges methodologies to minimize dependency on ground-truth normals when predicting 3D geometry. In this work, we devise CO-planarity REgularized (CORE) loss functions and Structure-Aware Normal Estimator (SANE). Without involving any knowledge of ground-truth normals, these two designs enable pixel-wise 3D geometry estimation weakly supervised by only ground-truth depth map. For CORE loss functions, the key idea is to exploit locally linear depth-normal orthogonality under spherical coordinates as pixel-level constraints, and utilize our designed Adaptive Polar Regularization (APR) to resolve underlying numerical degeneracies. Meanwhile, SANE easily establishes multi-task learning with CORE loss functions on both depth and surface normal estimation, leading to the whole performance leap. Extensive experiments present the effectiveness of our method on various DNN architectures and data benchmarks. The experimental results demonstrate that our depth estimation achieves the state-of-the-art performance across all metrics on indoor scenes and comparable performance on outdoor scenes. In addition, our surface normal estimation is overall superior. + + + + A Sentence Speaks a Thousand Images: Domain Generalization through Distilling CLIP with Language Guidance + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_A_Sentence_Speaks_a_Thousand_Images_Domain_Generalization_through_Distilling_ICCV_2023_paper.pdf + Domain generalization studies the problem of training a model with samples from several domains (or distributions) and then testing the model with samples from a new, unseen domain. In this paper, we propose a novel approach for domain generalization that leverages recent advances in large vision-language models, specifically a CLIP teacher model, to train a smaller model that generalizes to unseen domains. The key technical contribution is a new type of regularization that requires the student's learned image representations to be close to the teacher's learned text representations obtained from encoding the corresponding text descriptions of images. We introduce two designs of the loss function, absolute and relative distance, which provide specific guidance on how the training process of the student model should be regularized. We evaluate our proposed method, dubbed RISE (Regularized Invariance with Semantic Embeddings), on various benchmark datasets, and show that it outperforms several state-of-the-art domain generalization methods. To our knowledge, our work is the first to leverage knowledge distillation using a large vision-language model for domain generalization. By incorporating text-based information, RISE improves the generalization capability of machine learning models. + + + + Yes, we CANN: Constrained Approximate Nearest Neighbors for Local Feature-Based Visual Localization + http://openaccess.thecvf.com//content/ICCV2023/papers/Aiger_Yes_we_CANN_Constrained_Approximate_Nearest_Neighbors_for_Local_Feature-Based_ICCV_2023_paper.pdf + Large-scale visual localization systems continue to relyon 3D point clouds built from image collections usingstructure-from-motion. While the 3D points in these modelsare represented using local image features, directly match-ing a query image's local features against the point cloud ischallenging due to the scale of the nearest-neighbor searchproblem. Many recent approaches to visual localization havethus proposed a hybrid method, where first a global (per im-age) embedding is used to retrieve a small subset of databaseimages, and local features of the query are matched onlyagainst those. It seems to have become common belief thatglobal embeddings are critical for said image-retrieval invisual localization, despite the significant downside of hav-ing to compute two feature types for each query image. Inthis paper, we take a step back from this assumption and pro-pose Constrained Approximate Nearest Neighbors (CANN),a joint solution of k-nearest-neighbors across both the ge-ometry and appearance space using only local features. Wefirst derive the theoretical foundation for k-nearest-neighborretrieval across multiple metrics and then showcase howCANN improves visual localization. Our experiments onpublic localization benchmarks demonstrate that our methodsignificantly outperforms both state-of-the-art global feature-based retrieval and approaches using local feature aggrega-tion schemes. Moreover, it is an order of magnitude faster inboth index and query time than feature aggregation schemesfor these datasets. Code will be released. + + + + Multi-Object Navigation with Dynamically Learned Neural Implicit Representations + http://openaccess.thecvf.com//content/ICCV2023/papers/Marza_Multi-Object_Navigation_with_Dynamically_Learned_Neural_Implicit_Representations_ICCV_2023_paper.pdf + Understanding and mapping a new environment are core abilities of any autonomously navigating agent. While classical robotics usually estimates maps in a stand-alone manner with SLAM variants, which maintain a topological or metric representation, end-to-end learning of navigation keeps some form of memory in a neural network. Networks are typically imbued with inductive biases, which can range from vectorial representations to birds-eye metric tensors or topological structures. In this work, we propose to structure neural networks with two neural implicit representations, which are learned dynamically during each episode and map the content of the scene: (i) the Semantic Finder predicts the position of a previously seen queried object; (ii) the Occupancy and Exploration Implicit Representation encapsulates information about explored area and obstacles, and is queried with a novel global read mechanism which directly maps from function space to a usable embedding space. Both representations are leveraged by an agent trained with Reinforcement Learning (RL) and learned online during each episode. We evaluate the agent on Multi-Object Navigation and show the high impact of using neural implicit representations as a memory source. + + + + NPC: Neural Point Characters from Video + http://openaccess.thecvf.com//content/ICCV2023/papers/Su_NPC_Neural_Point_Characters_from_Video_ICCV_2023_paper.pdf + High-fidelity human 3D models can now be learned directly from videos, typically by combining a template-based surface model with neural representations. However, obtaining a template surface requires expensive multi-view capture systems, laser scans, or strictly controlled conditions. Previous methods avoid using a template but rely on a costly or ill-posed mapping from observation to canonical space. We propose a hybrid point-based representation for animatable humans that does not require an explicit surface model, while being generalizable to novel poses. For a given video, our method automatically produces an explicit set of 3D points representing approximate canonical geometry, and learns an articulated deformation model that produces pose-dependent point transformations. The points serve both as a scaffold for high-frequency neural features and an anchor for efficiently mapping between observation and canonical space. We demonstrate on established benchmarks that our representation overcomes limitations of prior work operating in either canonical or in observation space. Moreover, our automatic point extraction approach enables learning models of human and animal characters alike, matching the performance of the methods using rigged surface templates despite being more general. Project website: https: //lemonatsu.github.io/npc/. + + + + CrossLoc3D: Aerial-Ground Cross-Source 3D Place Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Guan_CrossLoc3D_Aerial-Ground_Cross-Source_3D_Place_Recognition_ICCV_2023_paper.pdf + We present CrossLoc3D, a novel 3D place recognition method that solves a large-scale point matching problem in a cross-source setting. Cross-source point cloud data corresponds to point sets captured by depth sensors with different accuracies or from different distances and perspectives. We address the challenges in terms of developing 3D place recognition methods that account for the representation gap between points captured by different sources. Our method handles cross-source data by utilizing multi-grained features and selecting convolution kernel sizes that correspond to most prominent features. Inspired by the diffusion models, our method uses a novel iterative refinement process that gradually shifts the embedding spaces from different sources to a single canonical space for better metric learning. In addition, we present CS-Campus3D, the first 3D aerial-ground cross-source dataset consisting of point cloud data from both aerial and ground LiDAR scans. The point clouds in CS-Campus3D have representation gaps and other features like different views, point densities, and noise patterns. We show that our CrossLoc3D algorithm can achieve an improvement of 4.74% - 15.37% in terms of the top 1 average recall on our CS-Campus3D benchmark and achieves performance comparable to state-of-the-art 3D place recognition method on the Oxford RobotCar. We will release the code and CS-Campus3D benchmark. + + + + Recursive Video Lane Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Jin_Recursive_Video_Lane_Detection_ICCV_2023_paper.pdf + A novel algorithm to detect road lanes in videos, called recursive video lane detector (RVLD), is proposed in this paper, which propagates the state of a current frame recursively to the next frame. RVLD consists of an intra-frame lane detector (ILD) and a predictive lane detector (PLD). First, we design ILD to localize lanes in a still frame. Second, we develop PLD to exploit the information of the previous frame for lane detection in a current frame. To this end, we estimate a motion field and warp the previous output to the current frame. Using the warped information, we refine the feature map of the current frame to detect lanes more reliably. Experimental results show that RVLD outperforms existing detectors on video lane datasets. Our codes are available at https://github.com/dongkwonjin/RVLD. + + + + Unsupervised Self-Driving Attention Prediction via Uncertainty Mining and Knowledge Embedding + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhu_Unsupervised_Self-Driving_Attention_Prediction_via_Uncertainty_Mining_and_Knowledge_Embedding_ICCV_2023_paper.pdf + Predicting attention regions of interest is an important yet challenging task for self-driving systems. Existing methodologies rely on large-scale labeled traffic datasets that are labor-intensive to obtain. Besides, the huge domain gap between natural scenes and traffic scenes in current datasets also limits the potential for model training. To address these challenges, we are the first to introduce an unsupervised way to predict self-driving attention by uncertainty modeling and driving knowledge integration. Our approach's Uncertainty Mining Branch (UMB) discovers commonalities and differences from multiple generated pseudo-labels achieved from models pre-trained on natural scenes by actively measuring the uncertainty. Meanwhile, our Knowledge Embedding Block (KEB) bridges the domain gap by incorporating driving knowledge to adaptively refine the generated pseudo-labels. Quantitative and qualitative results with equivalent or even more impressive performance compared to fully-supervised state-of-the-art approaches across all three public datasets demonstrate the effectiveness of the proposed method and the potential of this direction. The code is available at https://github.com/zaplm/DriverAttention. + + + + DLGSANet: Lightweight Dynamic Local and Global Self-Attention Networks for Image Super-Resolution + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_DLGSANet_Lightweight_Dynamic_Local_and_Global_Self-Attention_Networks_for_Image_ICCV_2023_paper.pdf + We propose an effective lightweight dynamic local and global self-attention network (DLGSANet) to solve image super-resolution. Our method explores the properties of Transformers while having low computational costs. Motivated by the network designs of Transformers, we develop a simple yet effective multi-head dynamic local self-attention (MHDLSA) module to extract local features efficiently. In addition, we note that existing Transformers usually explore all similarities of the tokens between the queries and keys for the feature aggregation. However, not all the tokens from the queries are relevant to those in keys, using all the similarities does not effectively facilitate the high-resolution image reconstruction. To overcome this problem, we develop a sparse global self-attention (SparseGSA) module to select the most useful similarity values so that the most useful global features can be better utilized for the high-resolution image reconstruction. We develop a hybrid dynamic-Transformer block (HDTB) that integrates the MHDLSA and SparseGSA for both local and global feature exploration. To ease the network training, we formulate the HDTBs into a residual hybrid dynamic-Transformer group (RHDTG). By embedding the RHDTGs into an end-to-end trainable network, we show that our proposed method has fewer network parameters and lower computational costs while achieving competitive performance against state-of-the-art ones in terms of accuracy. More information is available at https://neonleexiang.github.io/DLGSANet/. + + + + Black-Box Unsupervised Domain Adaptation with Bi-Directional Atkinson-Shiffrin Memory + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Black-Box_Unsupervised_Domain_Adaptation_with_Bi-Directional_Atkinson-Shiffrin_Memory_ICCV_2023_paper.pdf + Black-box unsupervised domain adaptation (UDA) learns with source predictions of target data without accessing either source data or source models during training, and it has clear superiority in data privacy and flexibility in target network selection. However, the source predictions of target data are often noisy and training with them is prone to learning collapses. We propose BiMem, a bi-directional memorization mechanism that learns to remember useful and representative information to correct noisy pseudo labels on the fly, leading to robust black-box UDA that can generalize across different visual recognition tasks. BiMem constructs three types of memory, including sensory memory, short-term memory, and long-term memory, which interact in a bi-directional manner for comprehensive and robust memorization of learnt features. It includes a forward memorization flow that identifies and stores useful features and a backward calibration flow that rectifies features' pseudo labels progressively. Extensive experiments show that BiMem achieves superior domain adaptation performance consistently across various visual recognition tasks such as image classification, semantic segmentation and object detection. + + + + Disentangling Spatial and Temporal Learning for Efficient Image-to-Video Transfer Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Qing_Disentangling_Spatial_and_Temporal_Learning_for_Efficient_Image-to-Video_Transfer_Learning_ICCV_2023_paper.pdf + Recently, large-scale pre-trained language-image models like CLIP have shown extraordinary capabilities for understanding spatial contents, but naively transferring such models to video recognition still suffers from unsatisfactory temporal modelling capabilities. Existing methods insert tunable structures into or in parallel with the pre-trained model, which either requires back-propagation through the whole pre-trained model and is thus resource-demanding, or is limited by the temporal reasoning capability of the pre-trained structure. In this work, we present DiST, which disentangles the learning of spatial and temporal aspects of videos. Specifically, DiST uses a dual-encoder structure, where a pre-trained foundation model acts as the spatial encoder and a lightweight network is introduced as the temporal encoder. An integration branch is inserted between the encoders to fuse spatio-temporal information. The decoupled spatial and temporal learning in DiST is highly efficient because it avoids back-propagation of massive pre-trained parameters. Meanwhile, we empirically show that separated learning with an extra network for integration is beneficial to both spatial and temporal understanding. Extensive experiments on five benchmarks show that DiST delivers better performance than existing state-of-the-art methods by convincing gaps. When pre-training on the large-scale Kinetics-710, we achieve 89.7% on Kinetics-400 with a frozen ViT-L model, which verifies the scalability of DiST. Our code and models will be made available. + + + + Coarse-to-Fine: Learning Compact Discriminative Representation for Single-Stage Image Retrieval + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhu_Coarse-to-Fine_Learning_Compact_Discriminative_Representation_for_Single-Stage_Image_Retrieval_ICCV_2023_paper.pdf + Image retrieval targets to find images from a database that are visually similar to the query image. Two-stage methods following retrieve-and-rerank paradigm have achieved excellent performance, but their separate local and global modules are inefficient to real-world applications. To better trade-off retrieval efficiency and accuracy, some approaches fuse global and local feature into a joint representation to perform single-stage image retrieval. However, they are still challenging due to various situations to tackle, e.g., background, occlusion and viewpoint. In this work, we design a Coarse-to-Fine framework to learn Compact Discriminative representation (CFCD) for end-to-end single-stage image retrieval-requiring only image-level labels. Specifically, we first design a novel adaptive softmax-based loss which dynamically tunes its scale and margin within each mini-batch and increases them progressively to strengthen supervision during training and intra-class compactness. Furthermore, we propose a mechanism which attentively selects prominent local descriptors and infuse fine-grained semantic relations into the global representation by a hard negative sampling strategy to optimize inter-class distinctiveness at a global scale. Extensive experimental results have demonstrated the effectiveness of our method, which achieves state-of-the-art single-stage image retrieval performance on benchmarks such as Revisited Oxford and Revisited Paris. Code is available at https://github.com/bassyess/CFCD. + + + + Deep Fusion Transformer Network with Weighted Vector-Wise Keypoints Voting for Robust 6D Object Pose Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhou_Deep_Fusion_Transformer_Network_with_Weighted_Vector-Wise_Keypoints_Voting_for_ICCV_2023_paper.pdf + One critical challenge in 6D object pose estimation from a single RGBD image is efficient integration of two different modalities, i.e., color and depth. In this work, we tackle this problem by a novel Deep Fusion Transformer (DFTr) block that can aggregate cross-modality features for improving pose estimation. Unlike existing fusion methods, the proposed DFTr can better model cross-modality semantic correlation by leveraging their semantic similarity, such that globally enhanced features from different modalities can be better integrated for improved information extraction. Moreover, to further improve robustness and efficiency, we introduce a novel weighted vector-wise voting algorithm that employs a non-iterative global optimization strategy for precise 3D keypoint localization while achieving near real-time inference. Extensive experiments show the effectiveness and strong generalization capability of our proposed 3D keypoint voting algorithm. Results on four widely used benchmarks also demonstrate that our method outperforms the state-of-the-art methods by large margins. + + + + BT^2: Backward-compatible Training with Basis Transformation + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhou_BT2_Backward-compatible_Training_with_Basis_Transformation_ICCV_2023_paper.pdf + Modern retrieval system often requires recomputing the representation of every piece of data in the gallery when updating to a better representation model. This process is known as backfilling and can be especially costly in the real world where the gallery often contains billions of samples. Recently, researchers have proposed the idea of Backward Compatible Training (BCT) where the new representation model can be trained with an auxiliary loss to make it backward compatible with the old representation. In this way, the new representation can be directly compared with the old representation, in principle avoiding the need for any backfilling. However, follow-up work shows that there is an inherent trade-off where a backward-compatible representation model cannot simultaneously maintain the performance of the new model itself. This paper reports our "not-so-surprising" finding that adding extra dimensions to the representation can help here. However, we also found that naively increasing the dimension of the representation did not work. To deal with this, we propose Backward-compatible Training with a novel Basis Transformation (BT2). A basis transformation (BT) is basically a learnable set of parameters that applies an orthonormal transformation. Such a transformation possesses an important property whereby the original information contained in its input is retained in its output. We show in this paper how a BT can be utilized to add only the necessary amount of additional dimensions. We empirically verify the advantage of BT2 over other state-of-the-art methods in a wide range of settings. We then further extend BT2 to other challenging yet more practical settings, including significant changes in model architecture (CNN to Transformers), modality change, and even a series of updates in the model architecture mimicking the evolution of deep learning models in the past decade. + + + + ViperGPT: Visual Inference via Python Execution for Reasoning + http://openaccess.thecvf.com//content/ICCV2023/papers/Suris_ViperGPT_Visual_Inference_via_Python_Execution_for_Reasoning_ICCV_2023_paper.pdf + Answering visual queries is a complex task that requires both visual processing and reasoning. End-to-end models, the dominant approach for this task, do not explicitly differentiate between the two, limiting interpretability and generalization. Learning modular programs presents a promising alternative, but has proven challenging due to the difficulty of learning both the programs and modules simultaneously. We introduce ViperGPT, a framework that leverages code-generation models to compose vision-and-language models into subroutines to produce a result for any query. ViperGPT utilizes a provided API to access the available modules, and composes them by generating Python code that is later executed. This simple approach requires no further training, and achieves state-of-the-art results across various complex visual tasks. + + + + Fine-grained Visible Watermark Removal + http://openaccess.thecvf.com//content/ICCV2023/papers/Niu_Fine-grained_Visible_Watermark_Removal_ICCV_2023_paper.pdf + Visible watermark removal aims to erase the watermark from watermarked image and recover the background image, which is a challenging task due to the diverse watermarks. Previous works have designed dynamic network to handle various types of watermarks adaptively, but they ignore that even the watermarked region in a single image can be divided into multiple local parts with distinct visual appearances. In this work, we advance image-specific dynamic network towards part-specific dynamic network, which discovers multiple local parts within the watermarked region and handle them adaptively. Specifically, we propose a query-based multi-task framework, in which part query embeddings are jointly used in two branches to predict part masks and restore watermarked parts. Extensive experiments demonstrate the effectiveness of our fine-grained watermark removal network. + + + + GridMM: Grid Memory Map for Vision-and-Language Navigation + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_GridMM_Grid_Memory_Map_for_Vision-and-Language_Navigation_ICCV_2023_paper.pdf + Vision-and-language navigation (VLN) enables the agent to navigate to a remote location following the natural language instruction in 3D environments. To represent the previously visited environment, most approaches for VLN implement memory using recurrent states, topological maps, or top-down semantic maps. In contrast to these approaches, we build the top-down egocentric and dynamically growing Grid Memory Map (i.e., GridMM) to structure the visited environment. From a global perspective, historical observations are projected into a unified grid map in a top-down view, which can better represent the spatial relations of the environment. From a local perspective, we further propose an instruction relevance aggregation method to capture fine-grained visual clues in each grid region. Extensive experiments are conducted on both the REVERIE, R2R, SOON datasets in the discrete environments, and the R2R-CE dataset in the continuous environments, showing the superiority of our proposed method. + + + + LAC - Latent Action Composition for Skeleton-based Action Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_LAC_-_Latent_Action_Composition_for_Skeleton-based_Action_Segmentation_ICCV_2023_paper.pdf + Skeleton-based action segmentation requires recognizing composable actions in untrimmed videos. Current approaches decouple this problem by first extracting local visual features from skeleton sequences and then processing them by a temporal model to classify frame-wise actions. However, their performances remain limited as the visual features cannot sufficiently express composable actions. In this context, we propose Latent Action Composition (LAC), a novel self-supervised framework aiming at learning from synthesized composable motions for skeleton-based action segmentation. LAC is composed of a novel generation module towards synthesizing new sequences. Specifically, we design a linear latent space in the generator to represent primitive motion. New composed motions can be synthesized by simply performing arithmetic operations on latent representations of multiple input skeleton sequences. LAC leverages such synthesized sequences, which have large diversity and complexity, for learning visual representations of skeletons in both sequence and frame spaces via contrastive learning. The resulting visual encoder has a high expressive power and can be effectively transferred onto action segmentation tasks by end-to-end fine-tuning without the need for additional temporal models. We conduct a study focusing on transfer-learning and we show that representations learned from pre-trained LAC outperform the state-of-the-art by a large margin on TSU, Charades, PKU-MMD datasets. + + + + Learning Vision-and-Language Navigation from YouTube Videos + http://openaccess.thecvf.com//content/ICCV2023/papers/Lin_Learning_Vision-and-Language_Navigation_from_YouTube_Videos_ICCV_2023_paper.pdf + Vision-and-language navigation (VLN) requires an embodied agent to navigate in realistic 3D environments using natural language instructions. Existing VLN methods suffer from training on small-scale environments or unreasonable path-instruction datasets, limiting the generalization to unseen environments. There are massive house tour videos on YouTube, providing abundant real navigation experiences and layout information. However, these videos have not been explored for VLN before. In this paper, we propose to learn an agent from these videos by creating a large-scale dataset which comprises reasonable path-instruction pairs from house tour videos and pre-training the agent on it. To achieve this, we have to tackle the challenges of automatically constructing path-instruction pairs and exploiting real layout knowledge from raw and unlabeled videos. To address these, we first leverage an entropy-based method to construct the nodes of a path trajectory. Then, we propose an action-aware generator for generating instructions from unlabeled trajectories. Last, we devise a trajectory judgment pretext task to encourage the agent to mine the layout knowledge. Experimental results show that our method achieves state-of-the-art performance on two popular benchmarks (R2R and REVERIE). Code is available at https://github.com/JeremyLinky/YouTube-VLN. + + + + Uncertainty-aware State Space Transformer for Egocentric 3D Hand Trajectory Forecasting + http://openaccess.thecvf.com//content/ICCV2023/papers/Bao_Uncertainty-aware_State_Space_Transformer_for_Egocentric_3D_Hand_Trajectory_Forecasting_ICCV_2023_paper.pdf + Hand trajectory forecasting from egocentric views is crucial for enabling a prompt understanding of human intentions when interacting with AR/VR systems. However, existing methods handle this problem in a 2D image space which is inadequate for 3D real-world applications. In this paper, we set up an egocentric 3D hand trajectory forecasting task that aims to predict hand trajectories in a 3D space from early observed RGB videos in a first-person view. To fulfill this goal, we propose an uncertainty-aware state space Transformer (USST) that takes the merits of the attention mechanism and aleatoric uncertainty within the framework of the classical state-space model. The model can be further enhanced by the velocity constraint and visual prompt tuning (VPT) on large vision transformers. Moreover, we develop an annotation workflow to collect 3D hand trajectories with high quality. Experimental results on H2O and EgoPAT3D datasets demonstrate the superiority of USST for both 2D and 3D trajectory forecasting. The code and datasets are publicly released: https://actionlab-cv.github.io/EgoHandTrajPred. + + + + Pretrained Language Models as Visual Planners for Human Assistance + http://openaccess.thecvf.com//content/ICCV2023/papers/Patel_Pretrained_Language_Models_as_Visual_Planners_for_Human_Assistance_ICCV_2023_paper.pdf + In our pursuit of advancing multi-modal AI assistants capable of guiding users to achieve complex multi-step goals, we propose the task of 'Visual Planning for Assistance (VPA)'. Given a succinct natural language goal, e.g., "make a shelf", and a video of the user's progress so far, the aim of VPA is to devise a plan, i.e., a sequence of actions such as "sand shelf", "paint shelf", etc. to realize the specified goal. This requires assessing the user's progress from the (untrimmed) video, and relating it to the requirements of natural language goal, i.e., which actions to select and in what order? Consequently, this requires handling long video history and arbitrarily complex action dependencies. To address these challenges, we decompose VPA into video action segmentation and forecasting. Importantly, we experiment by formulating the forecasting step as a multi-modal sequence modeling problem, allowing us to leverage the strength of pre-trained LMs (as the sequence model). This novel approach, which we call Visual Language Model based Planner (VLaMP), outperforms baselines across a suite of metrics that gauge the quality of the generated plans. Furthermore, through comprehensive ablations, we also isolate the value of each component--language pre-training, visual observations, and goal information. We have open-sourced all the data, model checkpoints, and training code. + + + + Dynamic Point Fields + http://openaccess.thecvf.com//content/ICCV2023/papers/Prokudin_Dynamic_Point_Fields_ICCV_2023_paper.pdf + Recent years have witnessed significant progress in the field of neural surface reconstruction. While extensive focus was put on volumetric and implicit approaches, a number of works have shown that explicit graphics primitives, such as point clouds, can significantly reduce computational complexity without sacrificing the reconstructed surface quality. However, less emphasis has been put on modeling dynamic surfaces with point primitives. In this work, we present a dynamic point field model that combines the representational benefits of explicit point-based graphics with implicit deformation networks to allow efficient modeling of non-rigid 3D surfaces. Using explicit surface primitives also allows us to easily incorporate well-established constraints such as isometric-as-possible regularization. While learning this deformation model is prone to local optima when trained in a fully unsupervised manner, we propose to also leverage semantic information, such as keypoint correspondence, to guide the deformation learning. We demonstrate how this approach can be used for creating an expressive animatable human avatar from a collection of 3D scans. Here, previous methods mostly rely on variants of the linear blend skinning paradigm, which fundamentally limits the expressivity of such models when dealing with complex cloth appearances, such as long skirts. We show the advantages of our dynamic point field framework in terms of its representational power, learning efficiency, and robustness to out-of-distribution novel poses. The code for the project is publicly available. + + + + Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping + http://openaccess.thecvf.com//content/ICCV2023/papers/Djilali_Lip2Vec_Efficient_and_Robust_Visual_Speech_Recognition_via_Latent-to-Latent_Visual_ICCV_2023_paper.pdf + Visual Speech Recognition (VSR) differs from the common perception tasks as it requires deeper reasoning over the video sequence, even by human experts. Despite the recent advances in VSR, current approaches rely on labeled data to fully train or finetune their models predicting the target speech. This hinders their ability to generalize well beyond the training set and leads to performance degeneration under out-of-distribution challenging scenarios. Un-like previous works that involve auxiliary losses or com-plex training procedures and architectures, we propose a simple approach, named Lip2Vec that is based on learning a prior model. Given a robust visual speech encoder, this network maps the encoded latent representations of the lip sequence to their corresponding latents from the audio pair, which are sufficiently invariant for effective text decoding. The generated audio representation is then decoded to text using an off-the-shelf Audio Speech Recognition (ASR) model. The proposed model compares favorably with fully-supervised learning methods on the LRS3 dataset achieving 26 WER. Unlike SoTA approaches, our model keeps a rea-sonable performance on the VoxCeleb2-en test set. We believe that reprogramming the VSR as an ASR task narrows the performance gap between the two and paves the way for more flexible formulations of lip reading. + + + + Spectral Graphormer: Spectral Graph-Based Transformer for Egocentric Two-Hand Reconstruction using Multi-View Color Images + http://openaccess.thecvf.com//content/ICCV2023/papers/Tse_Spectral_Graphormer_Spectral_Graph-Based_Transformer_for_Egocentric_Two-Hand_Reconstruction_using_ICCV_2023_paper.pdf + We propose a novel transformer-based framework that reconstructs two high fidelity hands from multi-view RGB images. Unlike existing hand pose estimation methods, where one typically trains a deep network to regress hand model parameters from single RGB image, we consider a more challenging problem setting where we directly regress the absolute root poses of two-hands with extended forearm at high resolution from egocentric view. As existing datasets are either infeasible for egocentric viewpoints or lack background variations, we create a large-scale synthetic dataset with diverse scenarios and collect a real dataset from multi-calibrated camera setup to verify our proposed multi-view image feature fusion strategy. To make the reconstruction physically plausible, we propose two strategies: (i) a coarse-to-fine spectral graph convolution decoder to smoothen the meshes during upsampling and (ii) an optimisation-based refinement stage at inference to prevent self-penetrations. Through extensive quantitative and qualitative evaluations, we show that our framework is able to produce realistic two-hand reconstructions and demonstrate the generalisation of synthetic-trained models to real data, as well as real-time AR/VR applications. + + + + Recovering a Molecule's 3D Dynamics from Liquid-phase Electron Microscopy Movies + http://openaccess.thecvf.com//content/ICCV2023/papers/Ye_Recovering_a_Molecules_3D_Dynamics_from_Liquid-phase_Electron_Microscopy_Movies_ICCV_2023_paper.pdf + The dynamics of biomolecules are crucial for our understanding of their functioning in living systems. However, current 3D imaging techniques, such as cryogenic electron microscopy (cryo-EM), require freezing the sample, which limits the observation of their conformational changes in real time. The innovative liquid-phase electron microscopy (liquid-phase EM) technique allows molecules to be placed in the native liquid environment, providing a unique opportunity to observe their dynamics. In this paper, we propose TEMPOR, a Temporal Electron MicroscoPy Object Reconstruction algorithm for liquid-phase EM that leverages an implicit neural representation (INR) and a dynamical variational auto-encoder (DVAE) to recover time series of molecular structures. We demonstrate its advantages in recovering different motion dynamics from two simulated datasets, 7bcq and Cas9. To our knowledge, our work is the first attempt to directly recover 3D structures of a temporally-varying particle from liquid-phase EM movies. It provides a promising new approach for studying molecules' 3D dynamics in structural biology. + + + + SOCS: Semantically-Aware Object Coordinate Space for Category-Level 6D Object Pose Estimation under Large Shape Variations + http://openaccess.thecvf.com//content/ICCV2023/papers/Wan_SOCS_Semantically-Aware_Object_Coordinate_Space_for_Category-Level_6D_Object_Pose_ICCV_2023_paper.pdf + Most learning-based approaches to category-level 6D pose estimation are design around normalized object coordinate space (NOCS). While being successful, NOCS-based methods become inaccurate and less robust when handling objects of a category containing significant intra-category shape variations. This is because the object coordinates induced by global and rigid alignment of objects are semantically incoherent, making the coordinate regression hard to learn and generalize. We propose Semantically-aware Object Coordinate Space (SOCS) built by warping-and-aligning the objects guided by a sparse set of keypoints with semantically meaningful correspondence. SOCS is semantically coherent: Any point on the surface of a object can be mapped to a semantically meaningful location in SOCS, allowing for accurate pose and size estimation under large shape variations. To learn effective coordinate regression to SOCS, we propose a novel multi-scale coordinate-based attention network. Evaluations demonstrate that our method is easy to train, well-generalizing for large intra-category shape variations and robust to inter-object occlusions. + + + + NeRF-LOAM: Neural Implicit Representation for Large-Scale Incremental LiDAR Odometry and Mapping + http://openaccess.thecvf.com//content/ICCV2023/papers/Deng_NeRF-LOAM_Neural_Implicit_Representation_for_Large-Scale_Incremental_LiDAR_Odometry_and_ICCV_2023_paper.pdf + Simultaneously odometry and mapping using LiDAR data is an important task for mobile systems to achieve full autonomy in large-scale environments. However, most existing LiDAR-based methods prioritize tracking quality over reconstruction quality. Although the recently developed neural radiance fields (NeRF) have shown promising advances in implicit reconstruction for indoor environments, the problem of simultaneous odometry and mapping for large-scale scenarios using incremental LiDAR data remains unexplored. To bridge this gap, in this paper, we propose a novel NeRF-based LiDAR odometry and mapping approach, NeRF-LOAM, consisting of three modules neural odometry, neural mapping, and mesh reconstruction. All these modules utilize our proposed neural signed distance function, which separates LiDAR points into ground and non-ground points to reduce Z-axis drift, optimizes odometry and voxel embeddings concurrently, and in the end generates dense smooth mesh maps of the environment. Moreover, this joint optimization allows our NeRF-LOAM to be pre-trained free and exhibit strong generalization abilities when applied to different environments. Extensive evaluations on three publicly available datasets demonstrate that our approach achieves state-of-the-art odometry and mapping performance, as well as a strong generalization in large-scale environments utilizing LiDAR data. Furthermore, we perform multiple ablation studies to validate the effectiveness of our network design. The implementation of our approach will be made available at https://github.com/JunyuanDeng/NeRF-LOAM. + + + + OmniLabel: A Challenging Benchmark for Language-Based Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Schulter_OmniLabel_A_Challenging_Benchmark_for_Language-Based_Object_Detection_ICCV_2023_paper.pdf + Language-based object detection is a promising direction towards building a natural interface to describe objects in images that goes far beyond plain category names. While recent methods show great progress in that direction, proper evaluation is lacking. With OmniLabel, we propose a novel task definition, dataset, and evaluation metric. The task subsumes standard and open-vocabulary detection as well as referring expressions. With more than 30K unique object descriptions on over 25K images, OmniLabel provides a challenge benchmark with diverse and complex object descriptions in a naturally open-vocabulary setting. Moreover, a key differentiation to existing benchmarks is that our object descriptions can refer to one, multiple or even no object, hence, providing negative examples in free-form text. The proposed evaluation handles the large label space and judges performance via a modified average precision metric, which we validate by evaluating strong language-based baselines. OmniLabel indeed provides a challenging test bed for future research on language-based detection. + + + + Divide&Classify: Fine-Grained Classification for City-Wide Visual Geo-Localization + http://openaccess.thecvf.com//content/ICCV2023/papers/Trivigno_DivideClassify_Fine-Grained_Classification_for_City-Wide_Visual_Geo-Localization_ICCV_2023_paper.pdf + Visual Place recognition is commonly addressed as an image retrieval problem. However, retrieval methods are impractical to scale to large datasets, densely sampled from city-wide maps, since their dimension impact negatively on the inference time. Using approximate nearest neighbour search for retrieval helps to mitigate this issue, at the cost of a performance drop. In this paper we investigate whether we can effectively approach this task as a classification problem, thus bypassing the need for a similarity search. We find that existing classification methods for coarse, planet-wide localization are not suitable for the fine-grained and city-wide setting. This is largely due to how the dataset is split into classes, because these methods are designed to handle a sparse distribution of photos and as such do not consider the visual aliasing problem across neighbouring classes that naturally arises in dense scenarios. Thus, we propose a partitioning scheme that enables a fast and accurate inference, preserving a simple learning procedure, and a novel inference pipeline based on an ensemble of novel classifiers that uses the prototypes learned via an angular margin loss. Our method, Divide&Classify (D&C), enjoys the fast inference of classification solutions and an accuracy competitive with retrieval methods on the fine-grained, city-wide setting. Moreover, we show that D&C can be paired with existing retrieval pipelines to speed up computations by over 20 times while increasing their recall, leading to new state-of-the-art results. Code is available at https://github.com/ga1i13o/Divide-and-Classify + + + + 3D Semantic Subspace Traverser: Empowering 3D Generative Model with Shape Editing Capability + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_3D_Semantic_Subspace_Traverser_Empowering_3D_Generative_Model_with_Shape_ICCV_2023_paper.pdf + Shape generation is the practice of producing 3D shapes as various representations for 3D content creation. Previous studies on 3D shape generation have focused on shape quality and structure, without or less considering the importance of semantic information. Consequently, such generative models often fail to preserve the semantic consistency of shape structure or enable manipulation of the semantic attributes of shapes during generation. In this paper, we proposed a novel semantic generative model named 3D Semantic Subspace Traverser that utilizes semantic attributes for category-specific 3D shape generation and editing. Our method utilizes implicit functions as the 3D shape representation and combines a novel latent-space GAN with a linear subspace model to discover semantic dimensions in the local latent space of 3D shapes. Each dimension of the subspace corresponds to a particular semantic attribute, and we can edit the attributes of generated shapes by traversing the coefficients of those dimensions. Experimental results demonstrate that our method can produce plausible shapes with complex structures and enable the editing of semantic attributes. The code and trained models are available at https://github.com/TrepangCat/3D_Semantic_Subspace_Traverser. + + + + Text2Room: Extracting Textured 3D Meshes from 2D Text-to-Image Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Hollein_Text2Room_Extracting_Textured_3D_Meshes_from_2D_Text-to-Image_Models_ICCV_2023_paper.pdf + We present Text2Room, a method for generating room-scale textured 3D meshes from a given text prompt as input. To this end, we leverage pre-trained 2D text-to-image models to synthesize a sequence of images from different poses. In order to lift these outputs into a consistent 3D scene representation, we combine monocular depth estimation with a text-conditioned inpainting model. The core idea of our approach is a tailored viewpoint selection such that the content of each image can be fused into a seamless, textured 3D mesh. More specifically, we propose a continuous alignment strategy that iteratively fuses scene frames with the existing geometry to create a seamless mesh. Unlike existing works that focus on generating single objects [56, 41] or zoom-out trajectories [18] from text, our method generates complete 3D scenes with multiple objects and explicit 3D geometry. We evaluate our approach using qualitative and quantitative metrics, demonstrating it as the first method to generate room-scale 3D geometry with compelling textures from only text as input. + + + + On the Robustness of Normalizing Flows for Inverse Problems in Imaging + http://openaccess.thecvf.com//content/ICCV2023/papers/Hong_On_the_Robustness_of_Normalizing_Flows_for_Inverse_Problems_in_ICCV_2023_paper.pdf + Conditional normalizing flows can generate diverse image samples for solving inverse problems. Most normalizing flows for inverse problems in imaging employ the conditional affine coupling layer that can generate diverse images quickly. However, unintended severe artifacts are occasionally observed in the output of them. In this work, we address this critical issue by investigating the origins of these artifacts and proposing the conditions to avoid them. First of all, we empirically and theoretically reveal that these problems are caused by "exploding inverse" in the conditional affine coupling layer for certain out-of-distribution (OOD) conditional inputs. Then, we further validated that the probability of causing erroneous artifacts in pixels is highly correlated with a Mahalanobis distance-based OOD score for inverse problems in imaging. Lastly, based on our investigations, we propose a remark to avoid exploding inverse and then based on it, we suggest a simple remedy that substitutes the affine coupling layers with the modified rational quadratic spline coupling layers in normalizing flows, to encourage the robustness of generated image samples. Our experimental results demonstrated that our suggested methods effectively suppressed critical artifacts occurring in normalizing flows for super-resolution space generation and low-light image enhancement. + + + + DistillBEV: Boosting Multi-Camera 3D Object Detection with Cross-Modal Knowledge Distillation + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_DistillBEV_Boosting_Multi-Camera_3D_Object_Detection_with_Cross-Modal_Knowledge_Distillation_ICCV_2023_paper.pdf + 3D perception based on the representations learned from multi-camera bird's-eye-view (BEV) is trending as cameras are cost-effective for mass production in autonomous driving industry. However, there exists a distinct performance gap between multi-camera BEV and LiDAR based 3D object detection. One key reason is that LiDAR captures accurate depth and other geometry measurements, while it is notoriously challenging to infer such 3D information from merely image input. In this work, we propose to boost the representation learning of a multi-camera BEV based student detector by training it to imitate the features of a well-trained LiDAR based teacher detector. We propose effective balancing strategy to enforce the student to focus on learning the crucial features from the teacher, and generalize knowledge transfer to multi-scale layers with temporal fusion. We conduct extensive evaluations on multiple representative models of multi-camera BEV. Experiments reveal that our approach renders significant improvement over the student models, leading to the state-of-the-art performance on the popular benchmark nuScenes. + + + + PoseFix: Correcting 3D Human Poses with Natural Language + http://openaccess.thecvf.com//content/ICCV2023/papers/Delmas_PoseFix_Correcting_3D_Human_Poses_with_Natural_Language_ICCV_2023_paper.pdf + Automatically producing instructions to modify one's posture could open the door to endless applications, such as personalized coaching and in-home physical therapy. Tackling the reverse problem (i.e., refining a 3D pose based on some natural language feedback) could help for assisted 3D character animation or robot teaching, for instance. Although a few recent works explore the connections between natural language and 3D human pose, none focus on describing 3D body pose differences. In this paper, we tackle the problem of correcting 3D human poses with natural language. To this end, we introduce the PoseFix dataset, which consists of several thousand paired 3D poses and their corresponding text feedback, that describe how the source pose needs to be modified to obtain the target pose. We demonstrate the potential of this dataset on two tasks: (1) text-based pose editing, that aims at generating corrected 3D body poses given a query pose and a text modifier; and (2) correctional text generation, where instructions are generated based on the differences between two body poses. The dataset and the code are available at https://europe.naverlabs.com/research/computer-vision/posefix/. + + + + TAPIR: Tracking Any Point with Per-Frame Initialization and Temporal Refinement + http://openaccess.thecvf.com//content/ICCV2023/papers/Doersch_TAPIR_Tracking_Any_Point_with_Per-Frame_Initialization_and_Temporal_Refinement_ICCV_2023_paper.pdf + We present a novel model for Tracking Any Point (TAP) that effectively tracks any queried point on any physical surface throughout a video sequence. Our approach employs two stages: (1) a matching stage, which independently locates a suitable candidate point match for the query point on every other frame, and (2) a refinement stage, which updates both the trajectory and query features based on local correlations. The resulting model surpasses all baseline methods by a significant margin on the TAP-Vid benchmark, as demonstrated by an approximate 20% absolute average Jaccard (AJ) improvement on DAVIS. Our model facilitates fast inference on long and high-resolution video sequences. On a modern GPU, our implementation has the capacity to track points faster than real-time. Given the high-quality trajectories extracted from a large dataset, we demonstrate a proof-of-concept diffusion model which generates trajectories from static images, enabling plausible animations. Visualizations, source code, and pretrained models can be found at https://deepmind-tapir.github.io. + + + + SwinLSTM: Improving Spatiotemporal Prediction Accuracy using Swin Transformer and LSTM + http://openaccess.thecvf.com//content/ICCV2023/papers/Tang_SwinLSTM_Improving_Spatiotemporal_Prediction_Accuracy_using_Swin_Transformer_and_LSTM_ICCV_2023_paper.pdf + Integrating CNNs and RNNs to capture spatiotemporal dependencies is a prevalent strategy for spatiotemporal prediction tasks. However, the property of CNNs to learn local spatial information decreases their efficiency in capturing spatiotemporal dependencies, thereby limiting their prediction accuracy. In this paper, we propose a new recurrent cell, SwinLSTM, which integrates Swin Transformer blocks and the simplified LSTM, an extension that replaces the convolutional structure in ConvLSTM with the self-attention mechanism. Furthermore, we construct a network with SwinLSTM cell as the core for spatiotemporal prediction. Without using unique tricks, SwinLSTM outperforms state-of-the-art methods on Moving MNIST, Human3.6m, TaxiBJ, and KTH datasets. In particular, it exhibits a significant improvement in prediction accuracy compared to ConvLSTM. Our competitive experimental results demonstrate that learning global spatial dependencies is more advantageous for models to capture spatiotemporal dependencies. We hope that SwinLSTM can serve as a solid baseline to promote the advancement of spatiotemporal prediction accuracy. The codes are publicly available at https://github.com/SongTang-x/SwinLSTM. + + + + DEDRIFT: Robust Similarity Search under Content Drift + http://openaccess.thecvf.com//content/ICCV2023/papers/Baranchuk_DEDRIFT_Robust_Similarity_Search_under_Content_Drift_ICCV_2023_paper.pdf + The statistical distribution of content uploaded and searched on media sharing sites changes over time due to seasonal, sociological and technical factors. We investigate the impact of this "content drift" for large-scale similarity search tools, based on nearest neighbor search in embedding space. Unless a costly index reconstruction is performed frequently, content drift degrades the search accuracy and efficiency. The degradation is especially severe since, in general, both the query and database distributions change. We introduce and analyze real-world image and video datasets for which temporal information is available over a long time period. Based on the learnings, we devise DeDrift, a method that updates embedding quantizers to continuously adapt large-scale indexing structures on-the-fly. DeDrift almost eliminates the accuracy degradation due to the query and database content drift while being up to 100x faster than a full index reconstruction. + + + + Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment + http://openaccess.thecvf.com//content/ICCV2023/papers/Ibrahimi_Audio-Enhanced_Text-to-Video_Retrieval_using_Text-Conditioned_Feature_Alignment_ICCV_2023_paper.pdf + Text-to-video retrieval systems have recently made significant progress by utilizing pre-trained models trained on large-scale image-text pairs. However, most of the latest methods primarily focus on the video modality while disregarding the audio signal for this task. Nevertheless, a recent advancement by EclipSE has improved long-range text-to-video retrieval by developing an audiovisual video representation. Nonetheless, the objective of the text-to-video retrieval task is to capture the complementary audio and video information that is pertinent to the text query rather than simply achieving better audio and video alignment. To address this issue, we introduce TEFAL, a TExt-conditioned Feature ALignment method that produces both audio and video representations conditioned on the text query. Instead of using only an audiovisual attention block, which could suppress the audio information relevant to the text query, our approach employs two independent cross-modal attention blocks that enable the text to attend to the audio and video representations separately. Our proposed method's efficacy is demonstrated on four benchmark datasets that include audio: MSR-VTT, LSMDC, VATEX, and Charades, and achieves better than state-of-the-art performance consistently across the four datasets. This is attributed to the additional text-query-conditioned audio representation and the complementary information it adds to the text-query-conditioned video representation. + + + + Prior-guided Source-free Domain Adaptation for Human Pose Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Raychaudhuri_Prior-guided_Source-free_Domain_Adaptation_for_Human_Pose_Estimation_ICCV_2023_paper.pdf + Domain adaptation methods for 2D human pose estimation typically require continuous access to the source data during adaptation, which can be challenging due to privacy, memory, or computational constraints. To address this limitation, we focus on the task of source-free domain adaptation for pose estimation, where a source model must adapt to a new target domain using only unlabeled target data. Although recent advances have introduced source-free methods for classification tasks, extending them to the regression task of pose estimation is non-trivial. In this paper, we present Prior-guided Self-training (POST), a pseudo-labeling approach that builds on the popular Mean Teacher framework to compensate for the distribution shift. POST leverages prediction-level and feature-level consistency between a student and teacher model against certain image transformations. In the absence of source data, POST utilizes a human pose prior that regularizes the adaptation process by directing the model to generate more accurate and anatomically plausible pose pseudo-labels. Despite being simple and intuitive, our framework can deliver significant performance gains compared to applying the source model directly to the target data, as demonstrated in our extensive experiments and ablation studies. In fact, our approach achieves comparable performance to recent state-of-the-art methods that use source data for adaptation + + + + Auxiliary Tasks Benefit 3D Skeleton-based Human Motion Prediction + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_Auxiliary_Tasks_Benefit_3D_Skeleton-based_Human_Motion_Prediction_ICCV_2023_paper.pdf + Exploring spatial-temporal dependencies from observed motions is one of the core challenges of human motion prediction. Previous methods mainly focus on dedicated network structures to model the spatial and temporal dependencies. This paper considers a new direction by introducing a model learning framework with auxiliary tasks. In our auxiliary tasks, partial body joints' coordinates are corrupted by either masking or adding noise and the goal is to recover corrupted coordinates depending on the rest coordinates. To work with auxiliary tasks, we propose a novel auxiliary-adapted transformer, which can handle incomplete, corrupted motion data and achieve coordinate recovery via capturing spatial-temporal dependencies. Through auxiliary tasks, the auxiliary-adapted transformer is promoted to capture more comprehensive spatial-temporal dependencies among body joints' coordinates, leading to better feature learning. Extensive experimental results have shown that our method outperforms state-of-the-art methods by remarkable margins of 7.2%, 3.7%, and 9.4% in terms of 3D mean per joint position error (MPJPE) on the Human3.6M, CMU Mocap, and 3DPW datasets, respectively. We also demonstrate that our method is more robust under data missing cases and noisy data cases. Code is available at https://github.com/MediaBrain-SJTU/AuxFormer. + + + + Measuring Asymmetric Gradient Discrepancy in Parallel Continual Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Lyu_Measuring_Asymmetric_Gradient_Discrepancy_in_Parallel_Continual_Learning_ICCV_2023_paper.pdf + In Parallel Continual Learning (PCL), the parallel multiple tasks start and end training unpredictably, thus suffering from training conflict and catastrophic forgetting issues. The two issues are raised because the gradients from parallel tasks differ in directions and magnitudes. Thus, in this paper, we formulate the PCL into a minimum distance optimization problem among gradients and propose an explicit Asymmetric Gradient Distance (AGD) to evaluate the gradient discrepancy in PCL. AGD considers both gradient magnitude ratios and directions, and has a tolerance when updating with a small gradient of inverse direction, which reduces the imbalanced influence of gradients on parallel task training. Moreover, we propose a novel Maximum Discrepancy Optimization (MaxDO) strategy to minimize the maximum discrepancy among multiple gradients. Solving by MaxDO with AGD, parallel training reduces the influence of the training conflict and suppresses the catastrophic forgetting of finished tasks. Extensive experiments validate the effectiveness of our approach on three image recognition datasets. + + + + HyperDiffusion: Generating Implicit Neural Fields with Weight-Space Diffusion + http://openaccess.thecvf.com//content/ICCV2023/papers/Erkoc_HyperDiffusion_Generating_Implicit_Neural_Fields_with_Weight-Space_Diffusion_ICCV_2023_paper.pdf + Implicit neural fields, typically encoded by a multilayer perceptron (MLP) that maps from coordinates (e.g., xyz) to signals (e.g., signed distances), have shown remarkable promise as a high-fidelity and compact representation. However, the lack of a regular and explicit grid structure also makes it challenging to apply generative modeling directly on implicit neural fields in order to synthesize new data. To this end, we propose HyperDiffusion, a novel approach for unconditional generative modeling of implicit neural fields. HyperDiffusion operates directly on MLP weights and generates new neural implicit fields encoded by synthesized MLP parameters. Specifically, a collection of MLPs is first optimized to faithfully represent individual data samples. Subsequently, a diffusion process is trained in this MLP weight space to model the underlying distribution of neural implicit fields. HyperDiffusion enables diffusion modeling over a implicit, compact, and yet high-fidelity representation of complex signals across various dimensionalities within one single unified framework. Experiments on both 3D shapes and 4D mesh animations demonstrate the effectiveness of our approach with significant improvement over prior work in high-fidelity synthesis. + + + + Retinexformer: One-stage Retinex-based Transformer for Low-light Image Enhancement + http://openaccess.thecvf.com//content/ICCV2023/papers/Cai_Retinexformer_One-stage_Retinex-based_Transformer_for_Low-light_Image_Enhancement_ICCV_2023_paper.pdf + When enhancing low-light images, many deep learning algorithms are based on the Retinex theory. However, the Retinex model does not consider the corruptions hidden in the dark or introduced by the light-up process. Besides, these methods usually require a tedious multi-stage training pipeline and rely on convolutional neural networks, showing limitations in capturing long-range dependencies. In this paper, we formulate a simple yet principled One-stage Retinex-based Framework (ORF). ORF first estimates the illumination information to light up the low-light image and then restores the corruption to produce the enhanced image. We design an Illumination-Guided Transformer (IGT) that utilizes illumination representations to direct the modeling of non-local interactions of regions with different lighting conditions. By plugging IGT into ORF, we obtain our algorithm, Retinexformer. Comprehensive quantitative and qualitative experiments demonstrate that our Retinexformer significantly outperforms state-of-the-art methods on thirteen benchmarks. The user study and application on low-light object detection also reveal the latent practical values of our method. Code is available at https://github.com/caiyuanhao1998/Retinexformer + + + + Linear Spaces of Meanings: Compositional Structures in Vision-Language Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Trager_Linear_Spaces_of_Meanings_Compositional_Structures_in_Vision-Language_Models_ICCV_2023_paper.pdf + We investigate compositional structures in data embeddings from pre-trained vision-language models (VLMs). Traditionally, compositionality has been associated with algebraic operations on embeddings of words from a pre-existing vocabulary. In contrast, we seek to approximate representations from an encoder as combinations of a smaller set of vectors in the embedding space. These vectors can be seen as "ideal words" for generating concepts directly within embedding space of the model. We first present a framework for understanding compositional structures from a geometric perspective. We then explain what these compositional structures entail probabilistically in the case of VLM embeddings, providing intuitions for why they arise in practice. Finally, we empirically explore these structures in CLIP's embeddings and we evaluate their usefulness for solving different vision-language tasks such as classification, debiasing, and retrieval. Our results show that simple linear algebraic operations on embedding vectors can be used as compositional and interpretable methods for regulating the behavior of VLMs. + + + + Tracking by Natural Language Specification with Long Short-term Context Decoupling + http://openaccess.thecvf.com//content/ICCV2023/papers/Ma_Tracking_by_Natural_Language_Specification_with_Long_Short-term_Context_Decoupling_ICCV_2023_paper.pdf + The main challenge of Tracking by Natural Language Specification (TNL) is to predict the movement of the target object by giving two heterogeneous information, e.g., one is the static description of the main characteristics of a video contained in the textual query, i.e., long-term context; the other one is an image patch containing the object and its surroundings cropped from the current frame, i.e., the search area. Currently, most methods still struggle with the rationality of using those two information and simply fusing the two. However, the linguistic information contained in the textual query and the visual representation stored in the search area may sometimes be inconsistent, in which case the direct fusion of the two may lead to conflicts. To address this problem, we propose DecoupleTNL, introducing a video clip containing short-term context information into the framework of TNL and exploring a proper way to reduce the impact when visual representation is inconsistent with linguistic information. Concretely, we design two jointly optimized tasks, i.e., short-term context-matching and long-term context-perceiving. The context-matching task aims to gather the dynamic short-term context information in a period, while the context-perceiving task tends to extract the static long-term context information. After that, we design a long short-term modulation module to integrate both context information for accurate tracking. Extensive experiments have been conducted on three tracking benchmark datasets to demonstrate the superiority of DecoupleTNL + + + + Pyramid Dual Domain Injection Network for Pan-sharpening + http://openaccess.thecvf.com//content/ICCV2023/papers/He_Pyramid_Dual_Domain_Injection_Network_for_Pan-sharpening_ICCV_2023_paper.pdf + Pan-sharpening, a panchromatic image guided low-spatial-resolution multi-spectral super-resolution task, aims to reconstruct the missing high-frequency information of high-resolution multi-spectral counterpart. Although the inborn connection with frequency domain, existing pan-sharpening research has almost investigated the potential solution upon frequency domain, thus limiting the model performance improvement. To this end, we first revisit the degradation process of pan-sharpening in Fourier space, and then devise a Pyramid Dual Domain Injection Pan-sharpening Network upon the above observation by fully exploring and exploiting the distinguished information in both the spatial and frequency domains. Specifically, the proposed network is organized with multi-scale U-shape manner and composed by two core parts: a spatial guidance pyramid sub-network for fusing local spatial information and a frequency guidance pyramid sub-network for fusing global frequency domain information, thus encouraging dual-domain complementary learning. In this way, the model can capture multi-scale dual-domain information to enable generating high-quality pan-sharpening results. Quantitative and qualitative experiments over multiple datasets demonstrate that our method performs the best against other state-of-the-art ones and comprises a strong generalization ability for real-world scenes. + + + + NeSS-ST: Detecting Good and Stable Keypoints with a Neural Stability Score and the Shi-Tomasi detector + http://openaccess.thecvf.com//content/ICCV2023/papers/Pakulev_NeSS-ST_Detecting_Good_and_Stable_Keypoints_with_a_Neural_Stability_ICCV_2023_paper.pdf + Learning a feature point detector presents a challenge both due to the ambiguity of the definition of a keypoint and, correspondingly, the need for specially prepared ground truth labels for such points. In our work, we address both of these issues by utilizing a combination of a hand-crafted Shi-Tomasi detector, a specially designed metric that assesses the quality of keypoints, the stability score (SS), and a neural network. We build on the principled and localized keypoints provided by the Shi-Tomasi detector and learn the neural network to select good feature points via the stability score. The neural network incorporates the knowledge from the training targets in the form of the neural stability score (NeSS). Therefore, our method is named NeSS-ST since it combines the Shi-Tomasi detector and the properties of the neural stability score. It only requires sets of images for training without dataset pre-labeling or the need for reconstructed correspondence labels. We evaluate NeSS-ST on HPatches, ScanNet, MegaDepth and IMC-PT demonstrating state-of-the-art performance and good generalization on downstream tasks. The project repository is available at: https://github.com/KonstantinPakulev/NeSS-ST. + + + + Video Action Segmentation via Contextually Refined Temporal Keypoints + http://openaccess.thecvf.com//content/ICCV2023/papers/Jiang_Video_Action_Segmentation_via_Contextually_Refined_Temporal_Keypoints_ICCV_2023_paper.pdf + Video action segmentation refers to the task of densely casting each video frame or short segment in an untrimmed video into some pre-specified action categories. Although recent years have witnessed a great promise in the development of action segmentation techniques.A large body of existing methods still rely on frame-wise segmentation, which tends to render fragmentary results (i.e., over-segmentation).To effectively address above issues, we here propose a video action segmentation model that implements the novel idea of Refined Temporal Keypoints (RTK) for overcoming caveats of existing methods.To act effectively, the proposed model initially seeks for high-quality, sparse temporal keypoints by extracting non-local cues from the video, rather than conducting frame-wise classification as in many competing methods.Afterwards, large improvements over the inital temporal keypoints are pin-pointed as contributions by further refining and re-assembling operations. In specific, we develop a graph matching module that aggregates structural information between different temporal keypoints by learning the corresponding relationship of the temporal source graphs and the annotated target graphs. The initial temporal keypoints are refined by the encoded structural information reusing the graph matching module.A few set of prior rules are harnessed for post-processing and re-assembling all temporal keypoints.The remaining temporal keypoiting going through all refinement are used to generate the final action segmentation results.We perform experiments on three popular datasets: 50salads, GTEA and Breakfast, and our methods significantly outperforms the current methods, particularly achieves the state-of-the-art F1@50 scores of 83.4%, 79.5%, and 60.5% on three datasets, respectively. + + + + Shatter and Gather: Learning Referring Image Segmentation with Text Supervision + http://openaccess.thecvf.com//content/ICCV2023/papers/Kim_Shatter_and_Gather_Learning_Referring_Image_Segmentation_with_Text_Supervision_ICCV_2023_paper.pdf + Referring image segmentation, the task of segmenting any arbitrary entities described in free-form texts, opens up a variety of vision applications. However, manual labeling of training data for this task is prohibitively costly, leading to lack of labeled data for training. We address this issue by a weakly supervised learning approach using text descriptions of training images as the only source of supervision. To this end, we first present a new model that discovers semantic entities in input image and then combines such entities relevant to text query to predict the mask of the referent. We also present a new loss function that allows the model to be trained without any further supervision. Our method was evaluated on four public benchmarks for referring image segmentation, where it clearly outperformed the existing method for the same task and recent open-vocabulary segmentation models on all the benchmarks. + + + + Two-in-One Depth: Bridging the Gap Between Monocular and Binocular Self-Supervised Depth Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhou_Two-in-One_Depth_Bridging_the_Gap_Between_Monocular_and_Binocular_Self-Supervised_ICCV_2023_paper.pdf + Monocular and binocular self-supervised depth estimations are two important and related tasks in computer vision, which aim to predict scene depths from single images and stereo image pairs respectively. In literature, the two tasks are usually tackled separately by two different kinds of models, and binocular models generally fail to predict depth from single images, while the prediction accuracy of monocular models is generally inferior to binocular models. In this paper, we propose a Two-in-One self-supervised depth estimation network, called TiO-Depth, which could not only compatibly handle the two tasks, but also improve the prediction accuracy. TiO-Depth employs a Siamese architecture and each sub-network of it could be used as a monocular depth estimation model. For binocular depth estimation, a Monocular Feature Matching module is proposed for incorporating the stereo knowledge between the two images, and the full TiO-Depth is used to predict depths. We also design a multi-stage joint-training strategy for improving the performances of TiO-Depth in both two tasks by combining the relative advantages of them. Experimental results on the KITTI, Cityscapes, and DDAD datasets demonstrate that TiO-Depth outperforms both the monocular and binocular state-of-the-art methods in most cases, and further verify the feasibility of a two-in-one network for monocular and binocular depth estimation. The code is available at https://github.com/ZM-Zhou/TiO-Depth_pytorch. + + + + Rethinking Pose Estimation in Crowds: Overcoming the Detection Information Bottleneck and Ambiguity + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhou_Rethinking_Pose_Estimation_in_Crowds_Overcoming_the_Detection_Information_Bottleneck_ICCV_2023_paper.pdf + Frequent interactions between individuals are a fundamental challenge for pose estimation algorithms. Current pipelines either use an object detector together with a pose estimator (top-down approach), or localize all body parts first and then link them to predict the pose of individuals (bottom-up). Yet, when individuals closely interact, top-down methods are ill-defined due to overlapping individuals, and bottom-up methods often falsely infer connections to distant bodyparts. Thus, we propose a novel pipeline called bottom-up conditioned top-down pose estimation (BUCTD) that combines the strengths of bottom-up and top-down methods. Specifically, we propose to use a bottom-up model as the detector, which in addition to an estimated bounding box provides a pose proposal that is fed as condition to an attention-based top-down model. We demonstrate the performance and efficiency of our approach on animal and human pose estimation benchmarks. On CrowdPose and OCHuman, we outperform previous state-of-the-art models by a significant margin. We achieve 78.5 AP on CrowdPose and 48.5 AP on OCHuman, an improvement of 8.6% and 7.8% over the prior art, respectively. Furthermore, we show that our method strongly improves the performance on multi-animal benchmarks involving fish and monkeys. The code is available at https://github.com/amathislab/BUCTD. + + + + Social Diffusion: Long-term Multiple Human Motion Anticipation + http://openaccess.thecvf.com//content/ICCV2023/papers/Tanke_Social_Diffusion_Long-term_Multiple_Human_Motion_Anticipation_ICCV_2023_paper.pdf + We propose Social Diffusion, a novel method for short-term and long-term forecasting of the motion of multiple persons as well as their social interactions. Jointly forecasting motions for multiple persons involved in social activities is inherently a challenging problem due to the interdependencies between individuals. In this work, we leverage a diffusion model conditioned on motion histories and causal temporal convolutional networks to forecast individually and contextually plausible motions for all participants. The contextual plausibility is achieved via an order-invariant aggregation function. As a second contribution, we design a new evaluation protocol that measures the plausibility of social interactions which we evaluate on the Haggling dataset, which features a challenging social activity where people are actively taking turns to talk and switching their attention. We evaluate our approach on four datasets for multi-person forecasting where our approach outperforms the state-of-the-art in terms of motion realism and contextual plausibility. + + + + Synchronize Feature Extracting and Matching: A Single Branch Framework for 3D Object Tracking + http://openaccess.thecvf.com//content/ICCV2023/papers/Ma_Synchronize_Feature_Extracting_and_Matching_A_Single_Branch_Framework_for_ICCV_2023_paper.pdf + Siamese network has been a de facto benchmark framework for 3D LiDAR object tracking with a shared-parametric encoder extracting features from template and search region, respectively. This paradigm relies heavily on an additional matching network to model the cross-correlation/similarity of the template and search region. In this paper, we forsake the conventional Siamese paradigm and propose a novel single-branch framework, SyncTrack, synchronizing the feature extracting and matching to avoid forwarding encoder twice for template and search region as well as introducing extra parameters of matching network. The synchronization mechanism is based on the dynamic affinity of the Transformer, and an in-depth analysis of the relevance is provided theoretically. Moreover, based on the synchronization, we introduce a novel Attentive Points-Sampling strategy into the Transformer layers (APST), replacing the random/Farthest Points Sampling (FPS) method with sampling under the supervision of attentive relations between the template and search region. It implies connecting point-wise sampling with the feature learning, beneficial to aggregating more distinctive and geometric features for tracking with sparse points. Extensive experiments on two benchmark datasets (KITTI and NuScenes) show that SyncTrack achieves state-of-the-art performance in real-time tracking. + + + + Leveraging Intrinsic Properties for Non-Rigid Garment Alignment + http://openaccess.thecvf.com//content/ICCV2023/papers/Lin_Leveraging_Intrinsic_Properties_for_Non-Rigid_Garment_Alignment_ICCV_2023_paper.pdf + We address the problem of aligning real-world 3D data of garments, which benefits many applications such as texture learning, physical parameter estimation, generative modeling of garments, etc. Existing extrinsic methods typically perform non-rigid iterative closest point and struggle to align details due to incorrect closest matches and rigidity constraints. While intrinsic methods based on functional maps can produce high-quality correspondences, they work under isometric assumptions and become unreliable for garment deformations which are highly non-isometric. To achieve wrinkle-level as well as texture-level alignment, we present a novel coarse-to-fine two-stage method that leverages intrinsic manifold properties with two neural deformation fields, in the 3D space and the intrinsic space, respectively. The coarse stage performs a 3D fitting, where we leverage intrinsic manifold properties to define a manifold deformation field. The coarse fitting then induces a functional map that produces an alignment of intrinsic embeddings. We further refine the intrinsic alignment with a second neural deformation field for higher accuracy. We evaluate our method with our captured garment dataset, GarmCap. The method achieves accurate wrinkle-level and texture-level alignment and works for difficult garment types such as long coats. Our project page is https://jsnln.github.io/iccv2023_intrinsic/index.html. + + + + P2C: Self-Supervised Point Cloud Completion from Single Partial Clouds + http://openaccess.thecvf.com//content/ICCV2023/papers/Cui_P2C_Self-Supervised_Point_Cloud_Completion_from_Single_Partial_Clouds_ICCV_2023_paper.pdf + Point cloud completion aims to recover the complete shape based on a partial observation. Existing methods require either complete point clouds or multiple partial observations of the same object for learning. In contrast to previous approaches, we present Partial2Complete (P2C), the first self-supervised framework that completes point cloud objects using training samples consisting of only a single incomplete point cloud per object. Specifically, our framework groups incomplete point clouds into local patches as input and predicts masked patches by learning prior information from different partial objects. We also propose Region-Aware Chamfer Distance to regularize shape mismatch without limiting completion capability, and devise the Normal Consistency Constraint to incorporate a local planarity assumption, encouraging the recovered shape surface to be continuous and complete. In this way, P2C no longer needs multiple observations or complete point clouds as ground truth. Instead, structural cues are learned from a category-specific dataset to complete partial point clouds of objects. We demonstrate the effectiveness of our approach on both synthetic ShapeNet data and real-world ScanNet data, showing that P2C produces comparable results to methods trained with complete shapes, and outperforms methods learned with multiple partial observations. Code is available at https://github.com/CuiRuikai/Partial2Complete. + + + + A Game of Bundle Adjustment - Learning Efficient Convergence + http://openaccess.thecvf.com//content/ICCV2023/papers/Belder_A_Game_of_Bundle_Adjustment_-_Learning_Efficient_Convergence_ICCV_2023_paper.pdf + Bundle adjustment is the common way to solve localization and mapping. It is an iterative process in which a system of non-linear equations is solved using two optimization methods, weighted by a damping factor. In the classic approach, the latter is chosen heuristically by the Levenberg-Marquardt algorithm on each iteration. This might take many iterations, making the process computationally expensive, which might be harmful to real-time applications. We propose to replace this heuristic by viewing the problem in a holistic manner, as a game, and formulating it as a reinforcement-learning task. We set an environment which solves the non-linear equations and train an agent to choose the damping factor in a learned manner. We demonstrate that our approach considerably reduces the number of iterations required to reach the bundle adjustment's convergence, on both synthetic and real-life scenarios. We show that this reduction benefits the classic approach and can be integrated with other bundle adjustment acceleration methods. + + + + Learning Correction Filter via Degradation-Adaptive Regression for Blind Single Image Super-Resolution + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhou_Learning_Correction_Filter_via_Degradation-Adaptive_Regression_for_Blind_Single_Image_ICCV_2023_paper.pdf + Although existing image deep learning super-resolution (SR) methods achieve promising performance on benchmark datasets, they still suffer from severe performance drops when the degradation of the low-resolution (LR) input is not covered in training. To address the problem, we propose an innovative unsupervised method of Learning Correction Filter via Degradation-Adaptive Regression for Blind Single Image Super-Resolution. Highly inspired by the generalized sampling theory, our method aims to enhance the strength of off-the-shelf SR methods trained on known degradations and adapt to unknown complex degradations to generate improved results. Specifically, we first conduct degradation estimation for each local image region by learning the internal distribution in an unsupervised manner via GAN. Instead of assuming degradation are spatially invariant across the whole image, we learn correction filters to adjust degradations to known degradations in a spatially variant way by a novel linearly-assembled pixel degradation-adaptive regression module (DARM). DARM is lightweight and easy to optimize on a dictionary of multiple pre-defined filter bases. Extensive experiments on synthetic and real-world datasets verify the effectiveness of our method both qualitatively and quantitatively. Code can be available at: https://github.com/edbca/DARSR. + + + + SINC: Spatial Composition of 3D Human Motions for Simultaneous Action Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Athanasiou_SINC_Spatial_Composition_of_3D_Human_Motions_for_Simultaneous_Action_ICCV_2023_paper.pdf + Our goal is to synthesize 3D human motions given textual inputs describing simultaneous actions, for example `waving hand' while `walking' at the same time. We refer to generating such simultaneous movements as performing `spatial compositions'. In contrast to `temporal compositions' that seek to transition from one action to another, spatial compositing requires understanding which body parts are involved with which action, to be able to move them simultaneously. Motivated by the observation that the correspondence between actions and body parts is encoded in powerful language models, we extract this knowledge by prompting GPT-3 with text such as "what are the body parts involved in the action <action name>?", while also providing the parts list and a few examples. Given this action-part mapping, we combine body parts from two motions together and establish the first automated method to spatially compose two actions. However, training data with compositional actions is always limited by the combinatorics. Hence, we further create synthetic data with this approach, and use it to train a new state-of-the-art text-to-motion generation model, called SINC ("SImultaneous actioN Compositions for 3D human motions"). In our experiments, we find that training with such GPT-guided synthetic data improves spatial composition generation over baselines. Our code is publicly available at https://sinc.is.tue.mpg.de/. + + + + MV-DeepSDF: Implicit Modeling with Multi-Sweep Point Clouds for 3D Vehicle Reconstruction in Autonomous Driving + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_MV-DeepSDF_Implicit_Modeling_with_Multi-Sweep_Point_Clouds_for_3D_Vehicle_ICCV_2023_paper.pdf + Reconstructing 3D vehicles from noisy and sparse partial point clouds is of great significance to autonomous driving. Most existing 3D reconstruction methods cannot be directly applied to this problem because they are elaborately designed to deal with dense inputs with trivial noise. In this work, we propose a novel framework, dubbed MV-DeepSDF, which estimates the optimal Signed Distance Function (SDF) shape representation from multi-sweep point clouds to reconstruct vehicles in the wild. Although there have been some SDF-based implicit modeling methods, they only focus on single-view-based reconstruction, resulting in low fidelity. In contrast, we first analyze multi-sweep consistency and complementarity in the latent feature space and propose to transform the implicit space shape estimation problem into an element-to-set feature extraction problem. Then, we devise a new architecture to extract individual element-level representations and aggregate them to generate a set-level predicted latent code. This set-level latent code is an expression of the optimal 3D shape in the implicit space, and can be subsequently decoded to a continuous SDF of the vehicle. In this way, our approach learns consistent and complementary information among multi-sweeps for 3D vehicle reconstruction. We conduct thorough experiments on two real-world autonomous driving datasets (Waymo and KITTI) to demonstrate the superiority of our approach over state-of-the-art alternative methods both qualitatively and quantitatively. + + + + CHORD: Category-level Hand-held Object Reconstruction via Shape Deformation + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_CHORD_Category-level_Hand-held_Object_Reconstruction_via_Shape_Deformation_ICCV_2023_paper.pdf + In daily life, humans utilize hands to manipulate objects. Modeling the shape of objects that are manipulated by the hand is essential for AI to comprehend daily tasks and to learn manipulation skills. However, previous approaches have encountered difficulties in reconstructing the precise shapes of hand-held objects, primarily owing to a deficiency in prior shape knowledge and inadequate data for training. As illustrated, given a particular type of tool, such as a mug, despite its infinite variations in shape and appearance, humans have a limited number of 'effective' modes and poses for its manipulation. This can be attributed to the fact that humans have mastered the shape prior of the 'mug' category, and can quickly establish the corresponding relations between different mug instances and the prior, such as where the rim and handle are located. In light of this, we propose a new method, CHORD, for Category-level Hand-held Object Reconstruction via shape Deformation. CHORD deforms a categorical shape prior for reconstructing the intra-class objects. To ensure accurate reconstruction, we empower CHORD with three types of awareness: appearance, shape, and interacting pose. In addition, we have constructed a new dataset, COMIC, of category-level hand-object interaction. COMIC contains a rich array of object instances, materials, hand interactions, and viewing directions. Extensive evaluation shows that CHORD outperforms state-of-the-art approaches in both quantitative and qualitative measures. Code, model, and datasets are available at https://kailinli.github.io/CHORD. + + + + Towards Universal LiDAR-Based 3D Object Detection by Multi-Domain Knowledge Transfer + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_Towards_Universal_LiDAR-Based_3D_Object_Detection_by_Multi-Domain_Knowledge_Transfer_ICCV_2023_paper.pdf + Contemporary LiDAR-based 3D object detection methods mostly focus on single-domain learning or cross-domain adaptive learning. However, for autonomous driving systems, optimizing a specific LiDAR-based 3D object detector for each domain is costly and lacks of scalability in real-world deployment. It is desirable to train a universal LiDAR-based 3D object detector from multiple domains. In this work, we propose the first attempt to explore multi-domain learning and generalization for LiDAR-based 3D object detection. We show that jointly optimizing a 3D object detector from multiple domains achieves better generalization capability compared to the conventional single-domain learning model. To explore informative knowledge across domains towards a universal 3D object detector, we propose a multi-domain knowledge transfer framework with universal feature transformation. This approach leverages spatial-wise and channel-wise knowledge across domains to learn universal feature representations, so it facilitates to optimize a universal 3D object detector for deployment at different domains. Extensive experiments on four benchmark datasets (Waymo, KITTI, NuScenes and ONCE) show the superiority of our approach over the state-of-the-art approaches for multi-domain learning and generalization in LiDAR-based 3D object detection. + + + + Towards High-Fidelity Text-Guided 3D Face Generation and Manipulation Using only Images + http://openaccess.thecvf.com//content/ICCV2023/papers/Yu_Towards_High-Fidelity_Text-Guided_3D_Face_Generation_and_Manipulation_Using_only_ICCV_2023_paper.pdf + Generating 3D faces from textual descriptions has a multitude of applications, such as gaming, movie and robotics. Recent progresses have demonstrated the success of unconditional 3D face generation and text-to-3D shape generation. However, due to the limited text-3D face data pairs, text-driven 3D face generation remains an open problem. In this paper, we propose a text-guided 3D faces generation method, refer as TG-3DFace, for generating realistic 3D face using text guidance. Specifically, we adopt an unconditional 3D face generation framework and equip it with text conditions, which learns the text-guided 3D face generation with only text-2D face data. On top of that, we propose two text-to-face cross-modal alignment techniques, including the global contrastive learning and the fine-grained alignment module, to facilitate high semantic consistency between generated 3D faces and input texts. Besides, we present directional classifier guidance during the inference process, which encourages creativity for out-of-domain generations. Compared to the existing methods, TG-3DFace creates more realistic and aesthetically pleasing 3D faces, boosting 9% multi-view consistency (MVIC) over Latent3D. The rendered face images generated by TG-3DFace achieve higher FID and CLIP score than text-to-2D face/image generation models, demonstrating our superiority in generating realistic and semantic-consistent textures. + + + + ENTL: Embodied Navigation Trajectory Learner + http://openaccess.thecvf.com//content/ICCV2023/papers/Kotar_ENTL_Embodied_Navigation_Trajectory_Learner_ICCV_2023_paper.pdf + We propose Embodied Navigation Trajectory Learner (ENTL), a method for extracting long sequence representations for embodied navigation. Our approach unifies world modeling, localization and imitation learning into a single sequence prediction task. We train our model using vector-quantized predictions of future states conditioned on current states and actions. ENTL's generic architecture enables the sharing of the the spatio-temporal sequence encoder for multiple challenging embodied tasks. We achieve competitive performance on navigation tasks using significantly less data than strong baselines while performing auxiliary tasks such as localization and future frame prediction (a proxy for world modeling). A key property of our approach is that the model is pre-trained without any explicit reward signal, which makes the resulting model generalizable to multiple tasks and environments. + + + + AGG-Net: Attention Guided Gated-Convolutional Network for Depth Image Completion + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_AGG-Net_Attention_Guided_Gated-Convolutional_Network_for_Depth_Image_Completion_ICCV_2023_paper.pdf + Recently, stereo vision based on lightweight RGBD cameras has been widely used in various fields. However, limited by the imaging principles, the commonly used RGB-D cameras based on TOF, structured light, or binocular vision acquire some invalid data inevitably, such as weak reflection, boundary shadows, and artifacts, which may bring adverse impacts to the follow-up work. In this paper, we propose a new model for depth image completion based on the Attention Guided Gated-convolutional Network (AGG-Net), through which more accurate and reliable depth images can be obtained based on the raw depth maps and the corresponding RGB images. Our model employs a UNet-like architecture which consists of two parallel branches of depth and color features. In the encoding stage, an Attention Guided Gated Convolution (AG-GConv) module is proposed to realize the fusion of depth and color features at different scales, which can effectively reduce the negative impacts of invalid depth data on the reconstruction. In the decoding stage, an Attention Guided Skip Connection (AG-SC) module is presented to avoid introducing too many depth-irrelevant features to the reconstruction. The experimental results demonstrate that our method outperforms the state-of-the-art methods on the popular benchmarks NYU-Depth V2, DIML, and SUN RGB-D. + + + + Real-Time Neural Rasterization for Large Scenes + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Real-Time_Neural_Rasterization_for_Large_Scenes_ICCV_2023_paper.pdf + We propose a new method for realistic real-time novel-view synthesis (NVS) of large scenes. Existing fast neural rendering methods generate realistic results, but primarily work for small scale scenes (<50 square meter) and have difficulty at large scale (>10000 square meter). Traditional graphics-based rasterization rendering is fast for large scenes but lacks realism and requires expensive manually created assets. Our approach combines the best of both worlds by taking a moderate-quality scaffold mesh as input and learning a neural texture field and shader to model view-dependant effects to enhance realism, while still using the standard graphics pipeline for real-time rendering. Our method outperforms existing neural rendering methods, providing at least 30x faster rendering with comparable or better realism for large self-driving and drone scenes. Our work is the first to enable real-time visualization of large real-world scenes. + + + + MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Cheng_MixSpeech_Cross-Modality_Self-Learning_with_Audio-Visual_Stream_Mixup_for_Visual_Speech_ICCV_2023_paper.pdf + Multi-media communications facilitate global interaction among people. However, despite researchers exploring cross-lingual translation techniques such as machine translation and audio speech translation to overcome language barriers, there is still a shortage of cross-lingual studies on visual speech. This lack of research is mainly due to the absence of datasets containing visual speech and translated text pairs. In this paper, we present AVMuST-TED, the first dataset for Audio-Visual Multilingual Speech Translation, derived from TED talks. Nonetheless, visual speech is not as distinguishable as audio speech, making it difficult to develop a mapping from source speech phonemes to the target language text. To address this issue, we propose MixSpeech, a cross-modality self-learning framework that utilizes audio speech to regularize the training of visual speech tasks. To further minimize the cross-modality gap and its impact on knowledge transfer, we suggest adopting mixed speech, which is created by interpolating audio and visual streams, along with a curriculum learning strategy to adjust the mixing ratio as needed. MixSpeech enhances speech translation in noisy environments, improving BLEU scores for four languages on AVMuST-TED by +1.4 to +4.2. Moreover, it achieves state-of-the-art performance in lip reading on CMLR (11.1%), LRS2 (25.5%), and LRS3 (28.0%). + + + + Innovating Real Fisheye Image Correction with Dual Diffusion Architecture + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Innovating_Real_Fisheye_Image_Correction_with_Dual_Diffusion_Architecture_ICCV_2023_paper.pdf + Fisheye image rectification is hindered by synthetic models producing poor results for real-world correction. To address this, we propose a Dual Diffusion Architecture (DDA) for fisheye rectification that offers better practicality. The DDA leverages Denoising Diffusion Probabilistic Models (DDPMs) to gradually introduce bidirectional noise, allowing the synthesized and real images to develop into a consistent noise distribution. As a result, our network can perceive the distribution of unlabelled real fisheye images without relying on a transfer network, thus improving the performance of real fisheye correction. Additionally, we design an unsupervised one-pass network that generates a plausible new condition to strengthen guidance and address the non-negligible indeterminacy between the prior condition and the target. It can significantly affect the rectification task, especially in cases where radial distortion causes significant artifacts. This network can be regarded as an alternate scheme for fast producing reliable results without iterative inference. Compared to the state-of-the-art methods, our approach achieves superior performance in both synthetic and real fisheye image corrections. + + + + Global Perception Based Autoregressive Neural Processes + http://openaccess.thecvf.com//content/ICCV2023/papers/Tai_Global_Perception_Based_Autoregressive_Neural_Processes_ICCV_2023_paper.pdf + Increasingly, autoregressive approaches are being used to serialize observed variables based on specific criteria. The Neural Processes (NPs) model variable distribution as a continuous function and provide quick solutions for different tasks using a meta-learning framework. This paper proposes an autoregressive-based framework for NPs, based on their autoregressive properties. This framework leverages the autoregressive stacking effects of various variables to enhance the representation of the latent distribution, concurrently refining local and global relationships within the positional representation through the use of a sliding window mechanism. Autoregression improves function approximations in a stacked fashion, thereby raising the upper bound of the optimization. We have designated this framework as Autoregressive Neural Processes (AENPs) or Conditional Autoregressive Neural Processes (CAENPs). Traditional NP models and their variants aim to capture relationships between the context sample points, without addressing either local or global considerations. Specifically, we capture contextual relationships in the deterministic path and introduce sliding window attention and global attention to reconcile local and global relationships in the context sample points. Autoregressive constraints exist between multiple latent variables in the latent paths, thus building a complex global structure that allows our model to learn complex distributions. Finally, we demonstrate the effectiveness of the NPs or CFANPs models for 1D data, Bayesian optimization, and 2D data. + + + + VQA Therapy: Exploring Answer Differences by Visually Grounding Answers + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_VQA_Therapy_Exploring_Answer_Differences_by_Visually_Grounding_Answers_ICCV_2023_paper.pdf + Visual question answering is a task of predicting the answer to a question about an image. Given that different people can provide different answers to a visual question, we aim to better understand why with answer groundings. We introduce the first dataset that visually grounds each unique answer to each visual question, which we call VQAAnswerTherapy. We then propose two novel problems of predicting whether a visual question has a single answer grounding and localizing all answer groundings. We benchmark modern algorithms for these novel problems to show where they succeed and struggle. The dataset and evaluation server can be found publicly at https://vizwiz.org/tasks-and-datasets/vqa-answer-therapy/. + + + + Energy-based Self-Training and Normalization for Unsupervised Domain Adaptation + http://openaccess.thecvf.com//content/ICCV2023/papers/Herath_Energy-based_Self-Training_and_Normalization_for_Unsupervised_Domain_Adaptation_ICCV_2023_paper.pdf + We propose an Unsupervised Domain Adaptation (UDA) method by making use of Energy-Based Learning (EBL) and demonstrate 1. EBL can be used to improve the instance selection for a self-training task on the unlabelled target domain, and 2. alignment and normalizing energy scores can learn domain-invariant representations. For the former, we show that an energy-based selection criterion can be used to model instance selections by mimicking the joint distribution between data and predictions in the target domain. As per learning domain invariant representations, we show that stable domain alignment can be achieved by a combined energy alignment and an energy normalization process. We implement our method in consistent with the vision-transformer (ViT) backbone and empirically show that our proposed method can outperform state-of-the-art ViT based UDA methods on diverse benchmarks (DomainNet, OfficeHome, and VISDA2017). + + + + Collaborative Tracking Learning for Frame-Rate-Insensitive Multi-Object Tracking + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Collaborative_Tracking_Learning_for_Frame-Rate-Insensitive_Multi-Object_Tracking_ICCV_2023_paper.pdf + Multi-object tracking (MOT) at low frame rates can reduce computational, storage and power overhead to better meet the constraints of edge devices. Many existing MOT methods suffer from significant performance degradation in low-frame-rate videos due to significant location and appearance changes between adjacent frames. To this end, we propose to explore collaborative tracking learning (CoTracker) for frame-rate-insensitive MOT in a query-based end-to-end manner. Multiple historical queries of the same target jointly track it with richer temporal descriptions. Meanwhile, we insert an information refinement module between every two temporal blocking decoders to better fuse temporal clues and refine features. Moreover, a tracking object consistency loss is proposed to guide the interaction between historical queries. Extensive experimental results demonstrate that in high-frame-rate videos, CoTracker obtains higher performance than state-of-the-art methods on large-scale datasets Dancetrack and BDD100K, and outperforms the existing end-to-end methods on MOT17. More importantly, CoTracker has a significant advantage over state-of-the-art methods in low-frame-rate videos, which allows it to obtain faster processing speeds by reducing frame-rate requirements while maintaining higher performance. Code will be released at https://github.com/yolomax/ColTrack + + + + Prompt-aligned Gradient for Prompt Tuning + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhu_Prompt-aligned_Gradient_for_Prompt_Tuning_ICCV_2023_paper.pdf + Thanks to the large pre-trained vision-language models (VLMs) like CLIP, we can craft a zero-shot classifier by discrete prompt design, e.g., the confidence score of an image being "[CLASS]" can be obtained by using the VLM provided similarity between the image and the prompt sentence "a photo of a [CLASS]". Furthermore, prompting shows great potential for fast adaptation of VLMs to downstream tasks if we fine-tune the soft prompts with few samples. However, we find a common failure that improper fine-tuning or learning with extremely few-shot samples may even under-perform the zero-shot prediction. Existing methods still address this problem by using traditional anti-overfitting techniques such as early stopping and data augmentation, which lack a principled solution specific to prompting. In this paper, we present Prompt-aligned Gradient, dubbed ProGrad to prevent prompt tuning from forgetting the general knowledge learned from VLMs. In particular, ProGrad only updates the prompt whose gradient is aligned (or non-conflicting) to the general knowledge, which is represented as the optimization direction offered by the predefined prompt predictions. Extensive experiments under the few-shot learning, domain generalization, base-to-new generalization and cross-dataset transfer settings demonstrate the stronger few-shot generalization ability of ProGrad over state-of-the-art prompt tuning methods. Codes and theoretical proof are in Appendix. + + + + Aperture Diffraction for Compact Snapshot Spectral Imaging + http://openaccess.thecvf.com//content/ICCV2023/papers/Lv_Aperture_Diffraction_for_Compact_Snapshot_Spectral_Imaging_ICCV_2023_paper.pdf + We demonstrate a compact, cost-effective snapshot spectral imaging system named Aperture Diffraction Imaging Spectrometer (ADIS), which consists only of an imaging lens with an ultra-thin orthogonal aperture mask and a mosaic filter sensor, requiring no additional physical footprint compared to common RGB cameras. Then we introduce a new optical design that each point in the object space is multiplexed to discrete encoding locations on the mosaic filter sensor by diffraction-based spatial-spectral projection engineering generated from the orthogonal mask. The orthogonal projection is uniformly accepted to obtain a weakly calibration-dependent data form to enhance modulation robustness. Meanwhile, the Cascade Shift-Shuffle Spectral Transformer (CSST) with strong perception of the diffraction degeneration is designed to solve a sparsity-constrained inverse problem, realizing the volume reconstruction from 2D measurements with Large amount of aliasing. Our system is evaluated by elaborating the imaging optical theory and reconstruction algorithm with demonstrating the experimental imaging under a single exposure. Ultimately, we achieve the sub-super-pixel spatial resolution and high spectral resolution imaging. The code will be available at: https://github.com/Krito-ex/CSST. + + + + Diffusion Action Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Diffusion_Action_Segmentation_ICCV_2023_paper.pdf + Temporal action segmentation is crucial for understanding long-form videos. Previous works on this task commonly adopt an iterative refinement paradigm by using multi-stage models. We propose a novel framework via denoising diffusion models, which nonetheless shares the same inherent spirit of such iterative refinement. In this framework, action predictions are iteratively generated from random noise with input video features as conditions. To enhance the modeling of three striking characteristics of human actions, including the position prior, the boundary ambiguity, and the relational dependency, we devise a unified masking strategy for the conditioning inputs in our framework. Extensive experiments on three benchmark datasets, i.e., GTEA, 50Salads, and Breakfast, are performed and the proposed method achieves superior or comparable results to state-of-the-art methods, showing the effectiveness of a generative approach for action segmentation. + + + + Scalable Video Object Segmentation with Simplified Framework + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_Scalable_Video_Object_Segmentation_with_Simplified_Framework_ICCV_2023_paper.pdf + The current popular methods for video object segmentation (VOS) implement feature matching through several hand-crafted modules that separately perform feature extraction and matching. However, the above hand-crafted designs empirically cause insufficient target interaction, thus limiting the dynamic target-aware feature learning in VOS. To tackle these limitations, this paper presents a scalable Simplified VOS (SimVOS) framework to perform joint feature extraction and matching by leveraging a single transformer backbone. Specifically, SimVOS employs a scalable ViT backbone for simultaneous feature extraction and matching between query and reference features. This design enables SimVOS to learn better target-ware features for accurate mask prediction. More importantly, SimVOS could directly apply well-pretrained ViT backbones (e.g., MAE) for VOS, which bridges the gap between VOS and large-scale self-supervised pre-training. To achieve a better performance-speed trade-off, we further explore within-frame attention and propose a new token refinement module to improve the running speed and save computational cost. Experimentally, our SimVOS achieves state-of-the-art results on popular video object segmentation benchmarks, i.e., DAVIS-2017 (88.0% J&F), DAVIS-2016 (92.9% J&F) and YouTube-VOS 2019 (84.2% J&F), without applying any synthetic video or BL30K pre-training used in previous VOS approaches. Our code and models are available at https://github.com/jimmy-dq/SimVOS.git. + + + + Rehearsal-Free Domain Continual Face Anti-Spoofing: Generalize More and Forget Less + http://openaccess.thecvf.com//content/ICCV2023/papers/Cai_Rehearsal-Free_Domain_Continual_Face_Anti-Spoofing_Generalize_More_and_Forget_Less_ICCV_2023_paper.pdf + Face Anti-Spoofing (FAS) is recently studied under the continual learning setting, where the FAS models are expected to evolve after encountering data from new domains. However, existing methods need extra replay buffers to store previous data for rehearsal, which becomes infeasible when previous data is unavailable because of privacy issues. In this paper, we propose the first rehearsal-free method for Domain Continual Learning (DCL) of FAS, which deals with catastrophic forgetting and unseen domain generalization problems simultaneously. For better generalization to unseen domains, we design the Dynamic Central Difference Convolutional Adapter (DCDCA) to adapt Vision Transformer (ViT) models during the continual learning sessions. To alleviate the forgetting of previous domains without using previous data, we propose the Proxy Prototype Contrastive Regularization (PPCR) to constrain the continual learning with previous domain knowledge from the proxy prototypes. Simulating practical DCL scenarios, we devise two new protocols which evaluate both generalization and anti-forgetting performance. Extensive experimental results show that our proposed method can improve the generalization performance in unseen domains and alleviate the catastrophic forgetting of previous knowledge. The code and protocol files are released on https://github.com/RizhaoCai/DCL-FAS-ICCV2023. + + + + Towards General Low-Light Raw Noise Synthesis and Modeling + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Towards_General_Low-Light_Raw_Noise_Synthesis_and_Modeling_ICCV_2023_paper.pdf + Modeling and synthesizing low-light raw noise is a fundamental problem for computational photography and image processing applications. Although most recent works have adopted physics-based models to synthesize noise, the signal-independent noise in low-light conditions is far more complicated and varies dramatically across camera sensors, which is beyond the description of these models. To address this issue, we introduce a new perspective to synthesize the signal-independent noise by a generative model. Specifically, we synthesize the signal-dependent and signal-independent noise in a physics- and learning-based manner, respectively. In this way, our method can be considered as a general model, that is, it can simultaneously learn different noise characteristics for different ISO levels and generalize to various sensors. Subsequently, we present an effective multi-scale discriminator termed Fourier transformer discriminator (FTD) to distinguish the noise distribution accurately. Additionally, we collect a new low-light raw denoising (LRD) dataset for training and benchmarking. Qualitative validation shows that the noise generated by our proposed noise model can be highly similar to the real noise in terms of distribution. Furthermore, extensive denoising experiments demonstrate that our method performs favorably against state-of-the-art methods on different sensors. + + + + Beyond the Pixel: a Photometrically Calibrated HDR Dataset for Luminance and Color Prediction + http://openaccess.thecvf.com//content/ICCV2023/papers/Bolduc_Beyond_the_Pixel_a_Photometrically_Calibrated_HDR_Dataset_for_Luminance_ICCV_2023_paper.pdf + Light plays an important role in human well-being. However, most computer vision tasks treat pixels without considering their relationship to physical luminance. To address this shortcoming, we introduce the Laval Photometric Indoor HDR Dataset, the first large-scale photometrically calibrated dataset of high dynamic range 360deg panoramas. Our key contribution is the calibration of an existing, uncalibrated HDR Dataset. We do so by accurately capturing RAW bracketed exposures simultaneously with a professional photometric measurement device (chroma meter) for multiple scenes across a variety of lighting conditions. Using the resulting measurements, we establish the calibration coefficients to be applied to the HDR images. The resulting dataset is a rich representation of indoor scenes which displays a wide range of illuminance and color, and varied types of light sources. We exploit the dataset to introduce three novel tasks, where: per-pixel luminance, per-pixel color and planar illuminance can be predicted from a single input image. Finally, we also capture another smaller photometric dataset with a commercial 360deg camera, to experiment on generalization across cameras. We are optimistic that the release of our datasets and associated code will spark interest in physically accurate light estimation within the community. Dataset and code are available at https://lvsn.github.io/beyondthepixel/. + + + + Prototypical Mixing and Retrieval-Based Refinement for Label Noise-Resistant Image Retrieval + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Prototypical_Mixing_and_Retrieval-Based_Refinement_for_Label_Noise-Resistant_Image_Retrieval_ICCV_2023_paper.pdf + Label noise is pervasive in real-world applications, which influences the optimization of neural network models. This paper investigates a realistic but understudied problem of image retrieval under label noise, which could lead to severe overfitting or memorization of noisy samples during optimization. Moreover, identifying noisy samples correctly is still a challenging problem for retrieval models. In this paper, we propose a novel approach called Prototypical Mixing and Retrieval-based Refinement (TITAN) for label noise-resistant image retrieval, which corrects label noise and mitigates the effects of the memorization simultaneously. Specifically, we first characterize numerous prototypes with Gaussian distributions in the hidden space, which would direct the Mixing procedure in providing synthesized samples. These samples are fed into a similarity learning framework with varying emphasis based on the prototypical structure to learn semantics with reduced overfitting. In addition, we retrieve comparable samples for each prototype from simple to complex, which refine noisy samples in an accurate and class-balanced manner. Comprehensive experiments on five benchmark datasets demonstrate the superiority of our proposed TITAN compared with various competing baselines. + + + + AccFlow: Backward Accumulation for Long-Range Optical Flow + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_AccFlow_Backward_Accumulation_for_Long-Range_Optical_Flow_ICCV_2023_paper.pdf + Recent deep learning-based optical flow estimators have exhibited impressive performance in generating local flows between consecutive frames. However, the estimation of long-range flows between distant frames, particularly under complex object deformation and large motion occlusion, remains a challenging task. One promising solution is to accumulate local flows explicitly or implicitly to obtain the desired long-range flow. Nevertheless, the accumulation errors and flow misalignment can hinder the effectiveness of this approach. This paper proposes a novel recurrent framework called AccFlow, which recursively backward accumulates local flows using a deformable module called as AccPlus. In addition, an adaptive blending module is designed along with AccPlus to alleviate the occlusion effect by backward accumulation and rectify the accumulation error. Notably, we demonstrate the superiority of backward accumulation over conventional forward accumulation, which to the best of our knowledge has not been explicitly established before. To train and evaluate the proposed AccFlow, we have constructed a large-scale high-quality dataset named CVO, which provides ground-truth optical flow labels between adjacent and distant frames. Extensive experiments validate the effectiveness of AccFlow in handling long-range optical flow estimation. Codes are available at https://github.com/mulns/AccFlow. + + + + Contrastive Model Adaptation for Cross-Condition Robustness in Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Bruggemann_Contrastive_Model_Adaptation_for_Cross-Condition_Robustness_in_Semantic_Segmentation_ICCV_2023_paper.pdf + Standard unsupervised domain adaptation methods adapt models from a source to a target domain using labeled source data and unlabeled target data jointly. In model adaptation, on the other hand, access to the labeled source data is prohibited, i.e., only the source-trained model and unlabeled target data are available. We investigate normal-to-adverse condition model adaptation for semantic segmentation, whereby image-level correspondences are available in the target domain. The target set consists of unlabeled pairs of adverse- and normal-condition street images taken at GPS-matched locations. Our method--CMA--leverages such image pairs to learn condition-invariant features via contrastive learning. In particular, CMA encourages features in the embedding space to be grouped according to their condition-invariant semantic content and not according to the condition under which respective inputs are captured. To obtain accurate cross-domain semantic correspondences, we warp the normal image to the viewpoint of the adverse image and leverage warp-confidence scores to create robust, aggregated features. With this approach, we achieve state-of-the-art semantic segmentation performance for model adaptation on several normal-to-adverse adaptation benchmarks, such as ACDC and Dark Zurich. We also evaluate CMA on a newly procured adverse-condition generalization benchmark and report favorable results compared to standard unsupervised domain adaptation methods, despite the comparative handicap of CMA due to source data inaccessibility. Code is available at https://github.com/brdav/cma. + + + + Creative Birds: Self-Supervised Single-View 3D Style Transfer + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Creative_Birds_Self-Supervised_Single-View_3D_Style_Transfer_ICCV_2023_paper.pdf + In this paper, we propose a novel method for single-view 3D style transfer that generates a unique 3D object with both shape and texture transfer. Our focus lies primarily on birds, a popular subject in 3D reconstruction, for which no existing single-view 3D transfer methods have been developed. The method we propose seeks to generate a 3D mesh shape and texture of a bird from two single-view images. To achieve this, we introduce a novel shape transfer generator that comprises a dual residual gated network (DRGNet), and a multi-layer perceptron (MLP). DRGNet extracts the features of source and target images using a shared coordinate gate unit, while the MLP generates spatial coordinates for building a 3D mesh. We also introduce a semantic UV texture transfer module that implements textural style transfer using semantic UV segmentation, which ensures consistency in the semantic meaning of the transferred regions. This module can be widely adapted to many existing approaches. Finally, our method constructs a novel 3D bird using a differentiable renderer. Experimental results on the CUB dataset verify that our method achieves state-of-the-art performance on the single-view 3D style transfer task. Code is available at https://github.com/wrk226/creative_birds. + + + + Boosting Novel Category Discovery Over Domains with Soft Contrastive Learning and All in One Classifier + http://openaccess.thecvf.com//content/ICCV2023/papers/Zang_Boosting_Novel_Category_Discovery_Over_Domains_with_Soft_Contrastive_Learning_ICCV_2023_paper.pdf + Unsupervised domain adaptation (UDA) has proven to be highly effective in transferring knowledge from a label-rich source domain to a label-scarce target domain. However, the presence of additional novel categories in the target domain has led to the development of open-set domain adaptation (ODA) and universal domain adaptation (UNDA). Existing ODA and UNDA methods treat all novel categories as a single, unified unknown class and attempt to detect it during training. However, we found that domain variance can lead to more significant view-noise in unsupervised data augmentation, which affects the effectiveness of contrastive learning (CL) and causes the model to be overconfident in novel category discovery. To address these issues, a framework nameded Soft-contrastive All-in-one Network (SAN) is proposed for ODA and UNDA tasks. SAN includes a novel data-augmentation-based soft contrastive learning (SCL) loss to fine-tune the backbone for feature transfer and a more human-intuitive classifier to improve new class discovery capability. The SCL loss weakens the adverse effects of the data augmentation view-noise problem which is amplified in domain transfer tasks. The All-in-One (AIO) classifier overcomes the overconfidence problem of current mainstream closed-set and open-set classifiers. Visualization and ablation experiments demonstrate the effectiveness of the proposed innovations. Furthermore, extensive experiment results on ODA and UNDA show that SAN outperforms existing state-of-the-art methods. + + + + Search for or Navigate to? Dual Adaptive Thinking for Object Navigation + http://openaccess.thecvf.com//content/ICCV2023/papers/Dang_Search_for_or_Navigate_to_Dual_Adaptive_Thinking_for_Object_ICCV_2023_paper.pdf + "Search for" or "Navigate to"? When we find a specific object in an unknown environment, the two choices always arise in our subconscious mind. Before we see the target, we search for the target based on prior experience. Once we have seen the target, we can navigate to it by remembering the target location. However, recent object navigation methods consider using object association mostly to enhance the "search for" phase while neglecting the importance of the "navigate to" phase. Therefore, this paper proposes a dual adaptive thinking (DAT) method that flexibly adjusts thinking strategies in different navigation stages. Dual thinking includes both search thinking according to the object association ability and navigation thinking according to the target location ability. To make navigation thinking more effective, we design a target-oriented memory graph (TOMG) (which stores historical target information) and a target-aware multi-scale aggregator (TAMSA) (which encodes the relative position of the target). We assess our methods based on the AI2-Thor and RoboTHOR datasets. Compared with state-of-the-art (SOTA) methods, our approach significantly raises the overall success rate (SR) and success weighted by path length (SPL) while enhancing the agent's performance in the "navigate to" phase. + + + + OmniZoomer: Learning to Move and Zoom in on Sphere at High-Resolution + http://openaccess.thecvf.com//content/ICCV2023/papers/Cao_OmniZoomer_Learning_to_Move_and_Zoom_in_on_Sphere_at_ICCV_2023_paper.pdf + Omnidirectional images (ODIs) have become increasingly popular, as their large field-of-view (FoV) can offer viewers the chance to freely choose the view directions in immersive environments such as virtual reality. The Mobius transformation is typically employed to further provide the opportunity for movement and zoom on ODIs, but applying it to the image level often results in blurry effect and aliasing problem. In this paper, we propose a novel deep learning-based approach, called OmniZoomer, to incorporate the Mobius transformation into the network for movement and zoom on ODIs. By learning various transformed feature maps under different conditions, the network is enhanced to handle the increasing edge curvatures, which alleviates the blurry effect. Moreover, to address the aliasing problem, we propose two key components. Firstly, to compensate for the lack of pixels for describing curves, we enhance the feature maps in the high-resolution (HR) space and calculate the transformed index map with a spatial index generation module. Secondly, considering that ODIs are inherently represented in the spherical space, we propose a spherical resampling module that combines the index map and HR feature maps to transform the feature maps for better spherical correlation. The transformed feature maps are decoded to output a zoomed ODI. Experiments show that our method can produce HR and high-quality ODIs with the flexibility to move and zoom in to the object of interest. Project page is available at http://vlislab22.github.io/OmniZoomer/. + + + + Knowing Where to Focus: Event-aware Transformer for Video Grounding + http://openaccess.thecvf.com//content/ICCV2023/papers/Jang_Knowing_Where_to_Focus_Event-aware_Transformer_for_Video_Grounding_ICCV_2023_paper.pdf + Recent DETR-based video grounding models have made the model directly predict moment timestamps without any hand-crafted components, such as a pre-defined proposal or non-maximum suppression, by learning moment queries. However, their input-agnostic moment queries inevitably overlook an intrinsic temporal structure of a video, providing limited positional information. In this paper, we formulate an event-aware dynamic moment query to enable the model to take the input-specific content and positional information of the video into account. To this end, we present two levels of reasoning: 1) Event reasoning that captures distinctive event units constituting a given video using a slot attention mechanism; and 2) moment reasoning that fuses the moment queries with a given sentence through a gated fusion transformer layer and learns interactions between the moment queries and video-sentence representations to predict moment timestamps. Extensive experiments demonstrate the effectiveness and efficiency of the event-aware dynamic moment queries, outperforming state-of-the-art approaches on several video grounding benchmarks. The code is publicly available at https://github.com/jinhyunj/EaTR. + + + + Movement Enhancement toward Multi-Scale Video Feature Representation for Temporal Action Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhao_Movement_Enhancement_toward_Multi-Scale_Video_Feature_Representation_for_Temporal_Action_ICCV_2023_paper.pdf + Boundary localization is a challenging problem in Temporal Action Detection (TAD), in which there are two main issues. First, the submergence of movement feature, i.e. the movement information in a snippet is covered by the scene information. Second, the scale of action, that is, the proportion of action segments in the entire video, is considerably variable. In this work, we first design a Movement Enhance Module (MEM) to highlight movement feature for better action location, and then, we propose a Scale Feature Pyramid Network (SFPN) to detect multi-scale actions in videos. For Movement Enhance Module, firstly, Movement Feature Extractor (MFE) is designed to get the movement feature. Secondly, we propose a Multi-Relation Enhance Module (MREM) to grasp valuable information correlation both locally and temporally. For Scale Feature Pyramid Network, we design a U-Shape Module to model different scale actions, moreover, we design the training and inference strategy of different scales, ensuring that each pyramid layer is only responsible for actions at a specific scale. These two innovations are integrated as the Movement Enhance Network (MENet), and extensive experiments conducted on two challenging benchmarks demonstrate its effectiveness. MENet outperforms other representative TAD methods on ActivityNet-1.3 and THUMOS-14. + + + + Single Image Deblurring with Row-dependent Blur Magnitude + http://openaccess.thecvf.com//content/ICCV2023/papers/Ji_Single_Image_Deblurring_with_Row-dependent_Blur_Magnitude_ICCV_2023_paper.pdf + Image degradation often occurs during fast camera or object movements, regardless of the exposure modes: global shutter (GS) or rolling shutter (RS). Since these two exposure modes give rise to intrinsically different degradations, two restoration threads have been explored separately, i.e. motion deblurring of GS images and distortion correction of RS images, both of which are challenging restoration tasks, especially in the presence of a single input image. In this paper, we explore a novel in-between exposure mode, called global reset release (GRR) shutter, which produces GS-like blur but with row-dependent blur magnitude. We take advantage of this unique characteristic of GRR to explore the latent frames within a single image and restore a clear counterpart by only relying on these latent contexts. Specifically, we propose a residual spatially-compensated and spectrally-enhanced Transformer (RSS-T) block for row-dependent deblurring of a single GRR image. Its hierarchical positional encoding compensates global positional context of windows and enables order-awareness of the local pixel's position, along with a novel feed-forward network that simultaneously uses spatial and spectral information for gaining mixed global context. Extensive experimental results demonstrate that our method outperforms the state-of-the-art GS deblurring and RS correction methods on single GRR input. + + + + Deep Active Contours for Real-time 6-DoF Object Tracking + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Deep_Active_Contours_for_Real-time_6-DoF_Object_Tracking_ICCV_2023_paper.pdf + This paper solves the problem of real-time 6-DoF object tracking from an RGB video. Prior optimization-based methods optimize the object pose by aligning the projected model to the image based on handcrafted features, which are prone to suboptimal solutions. Recent learning-based methods use neural networks to predict the pose, which suffer from limited generalizability or computational efficiency. We propose a learning-based active contour model to make the best use of both worlds. Specifically, given an initial pose, we project the object model to the image plane to obtain the initial contour and use a lightweight network to predict how the contour should move to match the true object boundary, which provides the gradients to optimize the object pose. We also devise an efficient optimization algorithm to train our model end-to-end with pose supervision. Experimental results on semi-synthetic and real-world 6-DoF object tracking datasets demonstrate that our model outperforms state-of-the-art methods by a substantial margin in pose accuracy, while achieving real-time performance on mobile devices. Code is available on our project page: https://zju3dv.github.io/deep_ac/. + + + + + + Online Class Incremental Learning on Stochastic Blurry Task Boundary via Mask and Visual Prompt Tuning + http://openaccess.thecvf.com//content/ICCV2023/papers/Moon_Online_Class_Incremental_Learning_on_Stochastic_Blurry_Task_Boundary_via_ICCV_2023_paper.pdf + Continual learning aims to learn a model from a continuous stream of data, but it mainly assumes a fixed number of data and tasks with clear task boundaries. However, in real-world scenarios, the number of input data and tasks is constantly changing in a statistical way, not a static way. Although recently introduced incremental learning scenarios having blurry task boundaries somewhat address the above issues, they still do not fully reflect the statistical properties of real-world situations because of the fixed ratio of disjoint and blurry samples. In this paper, we propose a new Stochastic incremental Blurry task boundary scenario, called Si-Blurry, which reflects the stochastic properties of the real-world. We find that there are two major challenges in the Si-Blurry scenario: (1) inter- and intra-task forgettings and (2) class imbalance problem. To alleviate them, we introduce Mask and Visual Prompt tuning (MVP). In MVP, to address the inter- and intra-task forgetting issues, we propose a novel instance-wise logit masking and contrastive visual prompt tuning loss. Both of them help our model discern the classes to be learned in the current batch. It results in consolidating the previous knowledge. In addition, to alleviate the class imbalance problem, we introduce a new gradient similarity-based focal loss and adaptive feature scaling to ease overfitting to the major classes and underfitting to the minor classes. Extensive experiments show that our proposed MVP significantly outperforms the existing state-of-the-art methods in our challenging Si-Blurry scenario. + + + + SCANet: Scene Complexity Aware Network for Weakly-Supervised Video Moment Retrieval + http://openaccess.thecvf.com//content/ICCV2023/papers/Yoon_SCANet_Scene_Complexity_Aware_Network_for_Weakly-Supervised_Video_Moment_Retrieval_ICCV_2023_paper.pdf + Video moment retrieval aims to localize moments in video corresponding to a given language query. To avoid the expensive cost of annotating the temporal moments, weakly-supervised VMR (wsVMR) systems have been studied. For such systems, generating a number of proposals as moment candidates and then selecting the most appropriate proposal has been a popular approach. These proposals are assumed to contain many distinguishable scenes in a video as candidates. However, existing proposals of wsVMR systems do not respect the varying numbers of scenes in each video, where the proposals are heuristically determined irrespective of the video. We argue that the retrieval system should be able to counter the complexities caused by varying numbers of scenes in each video. To this end, we present a novel concept of a retrieval system referred to as Scene Complexity Aware Network (SCANet), which measures the `scene complexity' of multiple scenes in each video and generates adaptive proposals responding to variable complexities of scenes in each video. Experimental results on three retrieval benchmarks (i.e. Charades-STA, ActivityNet, TVR) achieve state-of-the-art performances and demonstrate the effectiveness of incorporating the scene complexity. + + + + Neural Interactive Keypoint Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Neural_Interactive_Keypoint_Detection_ICCV_2023_paper.pdf + This work proposes an end-to-end neural interactive keypoint detection framework named Click-Pose, which can significantly reduce more than 10 times labeling costs of 2D keypoint annotation compared with manual-only annotation. Click-Pose explores how user feedback can cooperate with a neural keypoint detector to correct the predicted keypoints in an interactive way for a faster and more effective annotation process. Specifically, we design the pose error modeling strategy that inputs the ground truth pose combined with four typical pose errors into the decoder and trains the model to reconstruct the correct poses, which enhances the self-correction ability of the model. Then, we attach an interactive human-feedback loop that allows receiving users' clicks to correct one or several predicted keypoints and iteratively utilizes the decoder to update all other keypoints with a minimum number of clicks (NoC) for efficient annotation. We validate Click-Pose in in-domain, out-of-domain scenes, and a new task of keypoint adaptation. For annotation, Click-Pose only needs 1.97 and 6.45 NoC@95 (at precision 95%) on COCO and Human-Art, reducing 31.4% and 36.3% efforts than the SOTA model (ViTPose) with manual correction, respectively. Besides, without user clicks, Click-Pose surpasses the previous end-to-end model by 1.4 AP on COCO and 3.0 AP on Human-Art. + + + + Knowledge-Aware Prompt Tuning for Generalizable Vision-Language Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Kan_Knowledge-Aware_Prompt_Tuning_for_Generalizable_Vision-Language_Models_ICCV_2023_paper.pdf + Pre-trained vision-language models, e.g., CLIP, working with manually designed prompts have demonstrated great effectiveness in transfer learning. Recently, learnable prompts achieve state-of-the-art performance, which however are prone to overfit to seen classes while failing to generalize to unseen classes. In this paper, we propose a Knowledge-Aware Prompt Tuning (KAPT) framework for vision-language models. Our approach takes the inspiration from human intelligence in which external knowledge is usually incorporated into recognizing novel categories of objects. Specifically, we design two complementary types of knowledge-aware prompts for the text encoder to leverage the distinctive characteristics of category-related external knowledge. The discrete prompt extracts the key information from descriptions of an object category, and the learned continuous prompt captures overall contexts. We further design an adaptation head for the visual encoder to aggregate salient attentive visual cues, which establishes discriminative and task-aware visual representations. We conduct extensive experiments on 11 widely-used benchmark datasets and the results verify the effectiveness in few-shot image classification, especially in generalizing to unseen categories. Compared with the state-of-the-art CoCoOp method, KAPT exhibits favorable performance and achieves an absolute gain of 3.22% on new classes and 2.57% in terms of harmonic mean. + + + + Leveraging Inpainting for Single-Image Shadow Removal + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Leveraging_Inpainting_for_Single-Image_Shadow_Removal_ICCV_2023_paper.pdf + Fully-supervised shadow removal methods achieve the best restoration qualities on public datasets but still generate some shadow remnants. One of the reasons is the lack of large-scale shadow & shadow-free image pairs. Unsupervised methods can alleviate the issue but their restoration qualities are much lower than those of fully-supervised methods. In this work, we find that pretraining shadow removal networks on the image inpainting dataset can reduce the shadow remnants significantly: a naive encoder-decoder network gets competitive restoration quality w.r.t. the state-of-the-art methods via only 10% shadow & shadow-free image pairs. After analyzing networks with/without inpainting pretraining via the information stored in the weight (IIW), we find that inpainting pretraining improves restoration quality in non-shadow regions and enhances the generalization ability of networks significantly. Additionally, shadow removal fine-tuning enables networks to fill in the details of shadow regions. Inspired by these observations we formulate shadow removal as an adaptive fusion task that takes advantage of both shadow removal and image inpainting. Specifically, we develop an adaptive fusion network consisting of two encoders, an adaptive fusion block, and a decoder. The two encoders are responsible for extracting the features from the shadow image and the shadow-masked image respectively. The adaptive fusion block is responsible for combining these features in an adaptive manner. Finally, the decoder converts the adaptive fused features to the desired shadow-free result. The extensive experiments show that our method empowered with inpainting outperforms all state-of-the-art methods. + + + + Accurate 3D Face Reconstruction with Facial Component Tokens + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Accurate_3D_Face_Reconstruction_with_Facial_Component_Tokens_ICCV_2023_paper.pdf + Accurately reconstructing 3D faces from monocular images and videos is crucial for various applications, such as digital avatar creation. However, the current deep learning-based methods face significant challenges in achieving accurate reconstruction with disentangled facial parameters and ensuring temporal stability in single-frame methods for 3D face tracking on video data. In this paper, we propose TokenFace, a transformer-based monocular 3D face reconstruction model. TokenFace uses separate tokens for different facial components to capture information about different facial parameters and employs temporal transformers to capture temporal information from video data. This design can naturally disentangle different facial components and is flexible to both 2D and 3D training data. Trained on hybrid 2D and 3D data, our model shows its power in accurately reconstructing faces from images and producing stable results for video data. Experimental results on popular benchmarks NoW and Stirling demonstrate that TokenFace achieves state-of-the-art performance, outperforming existing methods on all metrics by a large margin. + + + + Implicit Neural Representation for Cooperative Low-light Image Enhancement + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Implicit_Neural_Representation_for_Cooperative_Low-light_Image_Enhancement_ICCV_2023_paper.pdf + The following three factors restrict the application of existing low-light image enhancement methods: unpredictable brightness degradation and noise, inherent gap between metric-favorable and visual-friendly versions, and the limited paired training data. To address these limitations, we propose an implicit Neural Representation method for Cooperative low-light image enhancement, dubbed NeRCo. It robustly recovers perceptual-friendly results in an unsupervised manner. Concretely, NeRCo unifies the diverse degradation factors of real-world scenes with a controllable fitting function, leading to better robustness. In addition, for the output results, we introduce semantic-orientated supervision with priors from the pre-trained vision-language model. Instead of merely following reference images, it encourages results to meet subjective expectations, finding more visual-friendly solutions. Further, to ease the reliance on paired data and reduce solution space, we develop a dual-closed-loop constrained enhancement module. It is trained cooperatively with other affiliated modules in a self-supervised manner. Finally, extensive experiments demonstrate the robustness and superior effectiveness of our proposed NeRCo. Our code is available at https://github.com/Ysz2022/NeRCo. + + + + ReLeaPS : Reinforcement Learning-based Illumination Planning for Generalized Photometric Stereo + http://openaccess.thecvf.com//content/ICCV2023/papers/Chan_ReLeaPS__Reinforcement_Learning-based_Illumination_Planning_for_Generalized_Photometric_Stereo_ICCV_2023_paper.pdf + Illumination planning in photometric stereo aims to find a balance between tween surface normal estimation accuracy and image capturing efficiency by selecting optimal light configurations. It depends on factors such as the unknown shape and general reflectance of the target object, global illumination, and the choice of photometric stereo backbones, which are too complex to be handled by existing methods based on handcrafted illumination planning rules. This paper proposes a learning-based illumination planning method that jointly considers these factors via integrating a neural network and a generalized image formation model. As it is impractical to supervise illumination planning due to the enormous search space for ground truth light configurations, we formulate illumination planning using reinforcement learning, which explores the light space in a photometric stereo-aware and reward-driven manner. Experiments on synthetic and real-world datasets demonstrate that photometric stereo under the 20-light configurations from our method is comparable to, or even surpasses that of using lights from all available directions. + + + + Learning Foresightful Dense Visual Affordance for Deformable Object Manipulation + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_Learning_Foresightful_Dense_Visual_Affordance_for_Deformable_Object_Manipulation_ICCV_2023_paper.pdf + Understanding and manipulating deformable objects (e.g., ropes and fabrics) is an essential yet challenging task with broad applications. Difficulties come from complex states and dynamics, diverse configurations and high-dimensional action space of deformable objects. Besides, the manipulation tasks usually require multiple steps to accomplish, and greedy policies may easily lead to local optimal states. Existing studies usually tackle this problem using reinforcement learning or imitating expert demonstrations, with limitations in modeling complex states or requiring hand-crafted expert policies. In this paper, we study deformable object manipulation using dense visual affordance, with generalization towards diverse states, and propose a novel kind of foresightful dense affordance, which avoids local optima by estimating states' values for long-term manipulation. We propose a framework for learning this representation, with novel designs such as multi-stage stable learning and efficient self-supervised data collection without experts. Experiments demonstrate the superiority of our proposed foresightful dense affordance. + + + + CiteTracker: Correlating Image and Text for Visual Tracking + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_CiteTracker_Correlating_Image_and_Text_for_Visual_Tracking_ICCV_2023_paper.pdf + Existing visual tracking methods typically take an image patch as the reference of the target to perform tracking. However, a single image patch cannot provide a complete and precise concept of the target object as images are limited in their ability to abstract and can be ambiguous, which makes it difficult to track targets with drastic variations. In this paper, we propose the CiteTracking algorithm to enhance target modeling and inference in visual tracking by connecting images and text. Specifically, we develop a text generation module to convert the target image patch into a descriptive text containing its class and attribute information, providing a comprehensive reference point for the target. In addition, a dynamic description module is designed to adapt to target variations for more effective target representation. We then associate the target description and the search image using an attention-based correlation module to generate the correlated features for target state reference. Extensive experiments on five diverse datasets are conducted to evaluate the proposed algorithm and the favorable performance against the state-of-the-art methods demonstrates the effectiveness of the proposed tracking method. The source code and trained models will be made available to the public. + + + + PHRIT: Parametric Hand Representation with Implicit Template + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_PHRIT_Parametric_Hand_Representation_with_Implicit_Template_ICCV_2023_paper.pdf + We propose PHRIT, a novel approach for parametric hand mesh modeling with an implicit template that combines the advantages of both parametric meshes and implicit representations. Our method represents deformable hand shapes using signed distance fields (SDFs) with part-based shape priors, utilizing a deformation field to execute the deformation. The model offers efficient high-fidelity hand reconstruction by deforming the canonical template at infinite resolution. Additionally, it is fully differentiable and can be easily used in hand modeling since it can be driven by the skeleton and shape latent codes. We evaluate PHRIT on multiple downstream tasks, including skeleton-driven hand reconstruction, shapes from point clouds, and single-view 3D reconstruction, demonstrating that our approach achieves realistic and immersive hand modeling with state-of-the-art performance. + + + + BEVPlace: Learning LiDAR-based Place Recognition using Bird's Eye View Images + http://openaccess.thecvf.com//content/ICCV2023/papers/Luo_BEVPlace_Learning_LiDAR-based_Place_Recognition_using_Birds_Eye_View_Images_ICCV_2023_paper.pdf + Place recognition is a key module for long-term SLAM systems. Current LiDAR-based place recognition methods usually use representations of point clouds such as unordered points or range images. These methods achieve high recall rates of retrieval, but their performance may degrade in the case of view variation or scene changes. In this work, we explore the potential of a different representation in place recognition, i.e. bird's eye view (BEV) images. We validate that, without any delicate design, a simple ResNet trained on BEV images achieves comparable performance with the state-of-the-art place recognition methods in scenes of slight viewpoint changes. For more robust place recognition, we propose a rotation-invariant network called BEVPlace. We use group convolution to extract rotation-equivariant local features from the images and NetVLAD for global feature aggregation. In addition, we observe that the distance between BEV features is correlated with the geometry distance of point clouds. Based on the observation, we develop a method to estimate the position of the query cloud, extending the usage of place recognition. The experiments conducted on large-scale public datasets show that our method 1) achieves state-of-the-art performance in terms of recall rates, 2) is robust to view changes, 3) shows strong generalization ability, and 4) can estimate the positions of query point clouds. Source codes are publicly available at https://github.com/zjuluolun/BEVPlace. + + + + TrajPAC: Towards Robustness Verification of Pedestrian Trajectory Prediction Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_TrajPAC_Towards_Robustness_Verification_of_Pedestrian_Trajectory_Prediction_Models_ICCV_2023_paper.pdf + Robust pedestrian trajectory forecasting is crucial to developing safe autonomous vehicles. Although previous works have studied adversarial robustness in the context of trajectory forecasting, some significant issues remain unaddressed. In this work, we try to tackle these crucial problems. Firstly, the previous definitions of robustness in trajectory prediction are ambiguous. We thus provide formal definitions for two kinds of robustness, namely label robustness and pure robustness. Secondly, as previous works fail to consider robustness about all points in a disturbance interval, we utilise a probably approximately correct (PAC) framework for robustness verification. Additionally, this framework can not only identify potential counterexamples, but also provides interpretable analyses of the original methods. Our approach is applied using a prototype tool named TrajPAC. With TrajPAC, we evaluate the robustness of four state-of-the-art trajectory prediction models -- Trajectron++, MemoNet, AgentFormer, and MID -- on trajectories from five scenes of the ETH/UCY dataset and scenes of the Stanford Drone Dataset. Using our framework, we also experimentally study various factors that could influence robustness performance. + + + + Learning Point Cloud Completion without Complete Point Clouds: A Pose-Aware Approach + http://openaccess.thecvf.com//content/ICCV2023/papers/Kim_Learning_Point_Cloud_Completion_without_Complete_Point_Clouds_A_Pose-Aware_ICCV_2023_paper.pdf + Point cloud completion is to restore complete 3D scenes and objects from incomplete observations or limited sensor data. Existing fully-supervised methods rely on paired datasets of incomplete and complete point clouds, which are labor-intensive to obtain. Unpaired methods have been proposed, but still require a set of complete point clouds as a reference. As a remedy, in this paper, we propose a novel point cloud completion framework without using any complete point cloud at all. Our main idea is to generate multiple incomplete point clouds of various poses and integrate them into a complete point cloud. We train our framework based on cycle consistency, to generate an incomplete point cloud such that 1) shares the same object as the input incomplete point cloud and 2) corresponds to an arbitrarily given pose. In addition, we devise a novel projection method conditioned by pose to gather visible features, from a volumetric feature extracted by an encoder. Extensive experiments demonstrate that the proposed method achieves comparable or better results than existing unpaired methods. Further, we show that our method also can be applied to real incomplete point clouds. + + + + Frequency Guidance Matters in Few-Shot Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Cheng_Frequency_Guidance_Matters_in_Few-Shot_Learning_ICCV_2023_paper.pdf + Few-shot classification aims to learn a discriminative feature representation to recognize unseen classes with few labeled support samples. While most few-shot learning methods focus on exploiting the spatial information of image samples, frequency representation has also been proven essential in classification tasks. In this paper, we investigate the effect of different frequency components on the few-shot learning tasks. To enhance the performance and generalizability of few-shot methods, we propose a novel Frequency-Guided Few-shot Learning framework (dubbed FGFL), which leverages the task-specific frequency components to adaptively mask the corresponding image information, with a novel multi-level metric learning strategy including a triplet loss among original, masked and unmasked image as well as a contrastive loss between masked and original support and query sets to exploit more discriminative information. Extensive experiments on four benchmarks under several few-shot scenarios, i.e., standard, cross-dataset, cross-domain, and coarse-to-fine annotated classification, are conducted. Both qualitative and quantitative results show that our proposed FGFL scheme can attend to the class-discriminative frequency components, thus integrating those information towards more effective and generalizable few-shot learning. + + + + Spherical Space Feature Decomposition for Guided Depth Map Super-Resolution + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhao_Spherical_Space_Feature_Decomposition_for_Guided_Depth_Map_Super-Resolution_ICCV_2023_paper.pdf + Guided depth map super-resolution (GDSR), as a hot topic in multi-modal image processing, aims to upsample low-resolution (LR) depth maps with additional information involved in high-resolution (HR) RGB images from the same scene. The critical step of this task is to effectively extract domain-shared and domain-private RGB/depth features. In addition, three detailed issues, namely blurry edges, noisy surfaces, and over-transferred RGB texture, need to be addressed. In this paper, we propose the Spherical Space feature Decomposition Network (SSDNet) to solve the above issues. To better model cross-modality features, Restormer block-based RGB/depth encoders are employed for extracting local-global features. Then, the extracted features are mapped to the spherical space to complete the separation of private features and the alignment of shared features. Shared features of RGB are fused with the depth features to complete the GDSR task. Subsequently, a spherical contrast refinement (SCR) module is proposed to further address the detail issues. Patches that are classified according to imperfect categories are input into the SCR module, where the patch features are pulled closer to the ground truth and pushed away from the corresponding imperfect samples in the spherical feature space via contrastive learning. Extensive experiments demonstrate that our method can achieve state-of-the-art results on four test datasets, as well as successfully generalize to real-world scenes. The code is available at https://github.com/Zhaozixiang1228/GDSR-SSDNet. + + + + Tiled Multiplane Images for Practical 3D Photography + http://openaccess.thecvf.com//content/ICCV2023/papers/Khan_Tiled_Multiplane_Images_for_Practical_3D_Photography_ICCV_2023_paper.pdf + The task of synthesizing novel views from a single image has useful applications in virtual reality and mobile computing, and a number of approaches to the problem have been proposed in recent years. A Multiplane Image (MPI) estimates the scene as a stack of RGBA layers, and can model complex appearance effects, anti-alias depth errors and synthesize soft edges better than methods that use textured meshes or layered depth images. And unlike neural radiance fields, an MPI can be efficiently rendered on graphics hardware. However, MPIs are highly redundant and require a large number of depth layers to achieve plausible results. Based on the observation that the depth complexity in local image regions is lower than that over the entire image, we split an MPI into many small, tiled regions, each with only a few depth planes. We call this representation a Tiled Multiplane Image (TMPI). We propose a method for generating a TMPI with adaptive depth planes for single-view 3D photography in the wild. Our synthesized results are comparable to state-of-the-art single-view MPI methods while having lower computational overhead. each with only a few depth planes. We call this representation a Tiled Multiplane Image (TMPI). We propose a method for generating a TMPI with adaptive depth planes for single-view 3D photography in the wild. Our synthesized results are comparable to state-of-the-art single-view MPI methods while having lower computational overhead. + + + + HTML: Hybrid Temporal-scale Multimodal Learning Framework for Referring Video Object Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Han_HTML_Hybrid_Temporal-scale_Multimodal_Learning_Framework_for_Referring_Video_Object_ICCV_2023_paper.pdf + Referring Video Object Segmentation (RVOS) is to segment the object instance from a given video, according to the textual description of this object. However, in the open world, the object descriptions are often diversified in contents and flexible in lengths. This leads to the key difficulty in RVOS, i.e., various descriptions of different ob- jects are corresponding to different temporal scales in the video, which is ignored by most existing approaches with single stride of frame sampling. To tackle this problem, we propose a concise Hybrid Temporal-scale Multimodal Learning (HTML) framework, which can effectively align lingual and visual features to discover core object semantics in the video, by learning multimodal interaction hierarchically from different temporal scales. More specifically, we introduce a novel inter-scale multimodal perception module, where the language queries dynamically interact with visual features across temporal scales. It can effectively reduce complex object confusion by passing video context among different scales. Finally, we conduct extensive experiments on the widely used benchmarks, including Ref- Youtube-VOS, Ref-DAVIS17, A2D-Sentences and JHMDB- Sentences, where our HTML achieves state-of-the-art performance on all these datasets. + + + + PointDC: Unsupervised Semantic Segmentation of 3D Point Clouds via Cross-Modal Distillation and Super-Voxel Clustering + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_PointDC_Unsupervised_Semantic_Segmentation_of_3D_Point_Clouds_via_Cross-Modal_ICCV_2023_paper.pdf + Semantic segmentation of point clouds usually requires exhausting efforts of human annotations, hence it attracts wide attention to a challenging topic of learning from unlabeled or weaker form of annotations. In this paper, we take the first attempt for fully unsupervised semantic segmentation of point clouds, which aims to delineate semantically meaningful objects without any form of annotations. Previous works of unsupervised pipeline on 2D images fails in this task of point clouds, due to: 1) Clustering Ambiguity caused by limited magnitude of data and imbalanced class distribution; 2) Irregularity Ambiguity caused by the irregular sparsity of point cloud. Therefore, we propose a novel framework, PointDC, which is comprised of two steps that handles the aforementioned problems respectively: Cross-Modal Distillation (CVD) and Super-Voxel Clustering (SVC). In the first stage of CVD, multi-view visual features are back-projected to the 3D space and aggregated to a unified point feature to distill the training of the point representation. In the second stage of SVC, the point features are aggregated to super-voxels and then fed to the iterative clustering process for excavating semantic classes. PointDC yields a significant improvement over the prior state-of-the-art unsupervised methods, on both the ScanNet v2 (+18.4 mIOU) and S3DIS (+11.5 mIOU) semantic segmentation benchmarks. + + + + MV-Map: Offboard HD-Map Generation with Multi-view Consistency + http://openaccess.thecvf.com//content/ICCV2023/papers/Xie_MV-Map_Offboard_HD-Map_Generation_with_Multi-view_Consistency_ICCV_2023_paper.pdf + While bird's-eye-view (BEV) perception models can be useful for building high-definition maps (HD-Maps) with less human labor, their results are often unreliable and demonstrate noticeable inconsistencies in the predicted HD-Maps from different viewpoints. This is because BEV perception is typically set up in an "onboard" manner, which restricts the computation and consequently prevents algorithms from reasoning multiple views simultaneously. This paper overcomes these limitations and advocates a more practical "offboard" HD-Map generation setup that removes the computation constraints, based on the fact that HD-Maps are commonly reusable infrastructures built offline in data centers. To this end, we propose a novel offboard pipeline called MV-Map that capitalizes multi-view consistency and can handle an arbitrary number of frames with the key desgin of a "region-centric" framework. In MV-Map, the target HD-Maps are created by aggregating all the frames of onboard predictions, weighted by the confidence scores assigned by an "uncertainty network." To further enhance multi-view consistency, we augment the uncertainty network with the global 3D structure optimized by a voxelized neural radiance field (Voxel-NeRF). Extensive experiments on nuScenes show that our MV-Map significantly improves the quality of HD-Maps, further highlighting the importance of offboard methods for HD-Map generation. + + + + Multi-view Self-supervised Disentanglement for General Image Denoising + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Multi-view_Self-supervised_Disentanglement_for_General_Image_Denoising_ICCV_2023_paper.pdf + With its significant performance improvements, the deep learning paradigm has become a standard tool for modern image denoisers. While promising performance has been shown on seen noise distributions, existing approaches often suffer from generalisation to unseen noise types or general and real noise. It is understandable as the model is designed to learn paired mapping (e.g. from a noisy image to its clean version). In this paper, we instead propose to learn to disentangle the noisy image, under the intuitive assumption that different corrupted versions of the same clean image share a common latent space. A self-supervised learning framework is proposed to achieve the goal, without looking at the latent clean image. By taking two different corrupted versions of the same image as input, the proposed Multi-view Self-supervised Disentanglement (MeD) approach learns to disentangle the latent clean features from the corruptions and recover the clean image consequently. Extensive experimental analysis on both synthetic and real noise shows the superiority of the proposed method over prior self-supervised approaches, especially on unseen novel noise types. On real noise, the proposed method even outperforms its supervised counterparts by over 3dB. + + + + SHERF: Generalizable Human NeRF from a Single Image + http://openaccess.thecvf.com//content/ICCV2023/papers/Hu_SHERF_Generalizable_Human_NeRF_from_a_Single_Image_ICCV_2023_paper.pdf + Existing Human NeRF methods for reconstructing 3D humans typically rely on multiple 2D images from multi-view cameras or monocular videos captured from fixed camera views. However, in real-world scenarios, human images are often captured from random camera angles, presenting challenges for high-quality 3D human reconstruction. In this paper, we propose SHERF, the first generalizable Human NeRF model for recovering animatable 3D humans from a single input image. SHERF extracts and encodes 3D human representations in canonical space, enabling rendering and animation from free views and poses. To achieve high-fidelity novel view and pose synthesis, the encoded 3D human representations should capture both global appearance and local fine-grained textures. To this end, we propose a bank of 3D-aware hierarchical features, including global, point-level, and pixel-aligned features, to facilitate informative encoding. Global features enhance the information extracted from the single input image and complement the information missing from the partial 2D observation. Point-level features provide strong clues of 3D human structure, while pixel-aligned features preserve more fine-grained details. To effectively integrate the 3D-aware hierarchical feature bank, we design a feature fusion transformer. Extensive experiments on THuman, RenderPeople, ZJU_MoCap, and HuMMan datasets demonstrate that SHERF achieves state-of-the-art performance, with better generalizability for novel view and pose synthesis. + + + + MVPSNet: Fast Generalizable Multi-view Photometric Stereo + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhao_MVPSNet_Fast_Generalizable_Multi-view_Photometric_Stereo_ICCV_2023_paper.pdf + We propose a fast and generalizable solution to Multiview Photometric Stereo (MVPS), called MVPSNet. The key to our approach is a feature extraction network that effectively combines images from the same view captured under multiple lighting conditions to extract geometric features from shading cues for stereo matching. We demonstrate these features, termed 'Light Aggregated Feature Maps' (LAFM), are effective for feature matching even in textureless regions, where traditional multi-view stereo methods often fail. Our method produces similar reconstruction results to PS-NeRF, a state-of-the-art MVPS method that optimizes a neural network per-scene, while being 411x faster (105 seconds vs. 12 hours) in inference. Additionally, we introduce a new synthetic dataset for MVPS, sMVPS, which is shown to be effective for training a generalizable MVPS method. + + + + Human from Blur: Human Pose Tracking from Blurry Images + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhao_Human_from_Blur_Human_Pose_Tracking_from_Blurry_Images_ICCV_2023_paper.pdf + We propose a method to estimate 3D human poses from substantially blurred images. The key idea is to tackle the inverse problem of image deblurring by modeling the forward problem with a 3D human model, a texture map, and a sequence of poses to describe human motion. The blurring process is then modeled by a temporal image aggregation step. Using a differentiable renderer, we can solve the inverse problem by backpropagating the pixel-wise reprojection error to recover the best human motion representation that explains a single or multiple input images. Since the image reconstruction loss alone is insufficient, we present additional regularization terms. To the best of our knowledge, we present the first method to tackle this problem. Our method consistently outperforms other methods on significantly blurry inputs since they lack one or multiple key functionalities that our method unifies, i.e. image deblurring with sub-frame accuracy and explicit 3D modeling of non-rigid human motion. + + + + Uni-3D: A Universal Model for Panoptic 3D Scene Reconstruction + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Uni-3D_A_Universal_Model_for_Panoptic_3D_Scene_Reconstruction_ICCV_2023_paper.pdf + Performing holistic 3D scene understanding from a single-view observation, involving generating instance shapes and 3D scene segmentation, is a long-standing challenge. Prevailing works either focus only on geometry or segmentation, or model the task in two folds by separate modules, whose results are merged later to form the final prediction. Inspired by recent advances in 2D vision that unify image segmentation and detection by Transformer-based models, we present Uni-3D, a holistic 3D scene parsing/reconstruction system for a single RGB image. Uni-3D features a universal model with query-based representations for predicting segments of both object instances and scene layout. In Uni-3D, we also introduce a single Transformer for 2D depth-aware panoptic segmentation, which offers queries that serve as strong shape priors in 3D. Uni-3D seamlessly integrates 2D and 3D in its architecture and it outperforms previous methods significantly. + + + + Full-Body Articulated Human-Object Interaction + http://openaccess.thecvf.com//content/ICCV2023/papers/Jiang_Full-Body_Articulated_Human-Object_Interaction_ICCV_2023_paper.pdf + Fine-grained capture of 3D Human-Object Interactions (HOIs) boosts human activity understanding and facilitates various downstream visual tasks. Prior models mostly assume that humans interact with rigid objects using only a few body parts, limiting their scope. In this paper, we address the challenging problem of Full-Body Articulated Human-Object Interaction (f-AHOI), wherein the whole human bodies interact with articulated objects, whose parts are connected by movable joints. We present Capturing Human and Articulated-object InteRactionS (CHAIRS), a large-scale motion-captured f-AHOI dataset, consisting of 17.3 hours of versatile interactions between 46 participants and 81 articulated and rigid sittable objects. CHAIRS provides 3D meshes of both humans and articulated objects during the entire interactive process, as well as realistic and physically plausible full-body interactions. We show the value of CHAIRS with object pose estimation. By learning the geometrical relationships in HOI, we devise the first model that leverages human pose estimation to tackle the articulated object pose/shape estimation during whole-body interactions. Given an image and an estimated human pose, our model reconstructs the object pose/shape and optimizes the reconstruction according to a learned interaction prior. Under two evaluation settings, our model significantly outperforms baselines. We further demonstrate the value of CHAIRS with a downstream task of generating interacting human poses conditioned on articulated objects. We hope CHAIRS will promote the community towards finer-grained interaction understanding. Data/code will be made publicly available. + + + + FeatureNeRF: Learning Generalizable NeRFs by Distilling Foundation Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Ye_FeatureNeRF_Learning_Generalizable_NeRFs_by_Distilling_Foundation_Models_ICCV_2023_paper.pdf + Recent works on generalizable NeRFs have shown promising results on novel view synthesis from single or few images. However, such models have rarely been applied on other downstream tasks beyond synthesis such as semantic understanding and parsing. In this paper, we propose a novel framework named FeatureNeRF to learn generalizable NeRFs by distilling pre-trained vision foundation models (e.g., DINO, Latent Diffusion). FeatureNeRF leverages 2D pre-trained foundation models to 3D space via neural rendering, and then extract deep features for 3D query points from NeRF MLPs. Consequently, it allows to map 2D images to continuous 3D semantic feature volumes, which can be used for various downstream tasks. We evaluate FeatureNeRF on tasks of 2D/3D semantic keypoint transfer and 2D/3D object part segmentation. Our extensive experiments demonstrate the effectiveness of FeatureNeRF as a generalizable 3D semantic feature extractor. + + + + SRFormer: Permuted Self-Attention for Single Image Super-Resolution + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhou_SRFormer_Permuted_Self-Attention_for_Single_Image_Super-Resolution_ICCV_2023_paper.pdf + Previous works have shown that increasing the window size for Transformer-based image super-resolution models (e.g., SwinIR) can significantly improve the model performance but the computation overhead is also considerable. In this paper, we present SRFormer, a simple but novel method that can enjoy the benefit of large window self-attention but introduces even less computational burden. The core of our SRFormer is the permuted self-attention(PSA), which strikes an appropriate balance between the channel and spatial information for self-attention. Our PSA is simple and can be easily applied to existing super-resolution networks based on window self-attention. Without any bells and whistles, we show that our SRFormer achieves a 33.86dB PSNR score on the Urban100 dataset, which is 0.46dB higher than that of SwinIR but uses fewer parameters and computations. We hope our simple and effective approach can serve as a useful tool for future research in super-resolution model design. Our code is available at https://github.com/HVision-NKU/SRFormer. + + + + Deep Homography Mixture for Single Image Rolling Shutter Correction + http://openaccess.thecvf.com//content/ICCV2023/papers/Yan_Deep_Homography_Mixture_for_Single_Image_Rolling_Shutter_Correction_ICCV_2023_paper.pdf + We present a deep homography mixture motion model for single image rolling shutter correction. Rolling shutter (RS) effects are often caused by row-wise exposure delay in the widely adopted CMOS sensor. Previous methods often require more than one frame for the correction, leading to data quality requirements. Few approaches address the more challenging task of single image RS correction, which often adopt designs like trajectory estimation or long rectangular kernels, to learn the camera motion parameters of an RS image, to restore the global shutter (GS) image. In this work, we adopt a more straightforward method to learn deep homography mixture motion between an RS image and its corresponding GS image, without large solution space or strict restrictions on image features. We show that dividing an image into blocks with a Gaussian weight of block scanlines fits well for the RS setting. Moreover, instead of directly learning the motion mapping, we learn coefficients that assemble several motion bases to produce the correction motion, where these bases are learned from the consecutive frames of natural videos beforehand. Experiments show that our method outperforms existing single RS methods statistically and visually, in both synthesized and real RS images. + + + + Audio-Visual Glance Network for Efficient Video Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Nugroho_Audio-Visual_Glance_Network_for_Efficient_Video_Recognition_ICCV_2023_paper.pdf + Deep learning has made significant strides in video understanding tasks, but the computation required to classify lengthy and massive videos using clip-level video classifiers remains impractical and prohibitively expensive. To address this issue, we propose Audio-Visual Glance Network (AVGN), which leverages the commonly available audio and visual modalities to efficiently process the spatio-temporally important parts of a video. AVGN firstly divides the video into snippets of image-audio clip pair and employs lightweight unimodal encoders to extract global visual features and audio features. To identify the important temporal segments, we use an Audio-Visual Temporal Saliency Transformer (AV-TeST) that estimates the saliency scores of each frame. To further increase efficiency in the spatial dimension, AVGN processes only the important patches instead of the whole images. We use an Audio-Enhanced Spatial Patch Attention (AESPA) module to produce a set of enhanced coarse visual features, which are fed to a policy network that produces the coordinates of the important patches. This approach enables us to focus only on the most important spatio-temporally parts of the video, leading to more efficient video recognition. Moreover, we incorporate various training techniques and multi-modal feature fusion to enhance the robustness and effectiveness of our AVGN. By combining these strategies, our AVGN sets new state-of-the-art performance in multiple video recognition benchmarks while achieving faster processing speed. + + + + STEPs: Self-Supervised Key Step Extraction and Localization from Unlabeled Procedural Videos + http://openaccess.thecvf.com//content/ICCV2023/papers/Shah_STEPs_Self-Supervised_Key_Step_Extraction_and_Localization_from_Unlabeled_Procedural_ICCV_2023_paper.pdf + We address the problem of extracting key steps from unlabeled procedural videos, motivated by the potential of Augmented Reality (AR) headsets to revolutionize job training and performance. We decompose the problem into two steps: representation learning and key steps extraction. We propose a training objective, Bootstrapped Multi-Cue Contrastive (BMC2) loss to learn discriminative representations for various steps without any labels. Different from prior works, we develop techniques to train a light-weight temporal module which uses off-the-shelf features for self supervision. Our approach can seamlessly leverage information from multiple cues like optical flow, depth or gaze to learn discriminative features for key-steps, making it amenable for AR applications. We finally extract key steps via a tunable algorithm that clusters the representations and samples. We show significant improvements over prior works for the task of key step localization and phase classification. Qualitative results demonstrate that the extracted key steps are meaningful and succinctly represent various steps of the procedural tasks. + + + + Towards Robust and Smooth 3D Multi-Person Pose Estimation from Monocular Videos in the Wild + http://openaccess.thecvf.com//content/ICCV2023/papers/Park_Towards_Robust_and_Smooth_3D_Multi-Person_Pose_Estimation_from_Monocular_ICCV_2023_paper.pdf + 3D pose estimation is an invaluable task in computer vision with various practical applications. Especially, 3D pose estimation for multi-person from a monocular video (3DMPPE) is particularly challenging and is still largely uncharted, far from applying to in-the-wild scenarios yet. We pose three unresolved issues with the existing methods: lack of robustness on unseen views during training, vulnerability to occlusion, and severe jittering in the output. As a remedy, we propose POTR-3D, the first realization of a sequence-to-sequence 2D-to-3D lifting model for 3DMPPE, powered by a novel geometry-aware data augmentation strategy, capable of generating unbounded data with a variety of views while caring about the ground plane and occlusions. Through extensive experiments, we verify that the proposed model and data augmentation robustly generalizes to diverse unseen views, robustly recovers the poses against heavy occlusions, and reliably generates more natural and smoother outputs. The effectiveness of our approach is verified not only by achieving the state-of-the-art performance on public benchmarks, but also by qualitative results on more challenging in-the-wild videos. Demo videos are available at https://www.youtube.com/@potr3d. + + + + Clustering based Point Cloud Representation Learning for 3D Analysis + http://openaccess.thecvf.com//content/ICCV2023/papers/Feng_Clustering_based_Point_Cloud_Representation_Learning_for_3D_Analysis_ICCV_2023_paper.pdf + Point cloud analysis (such as 3D segmentation and detection) is a challenging task, because of not only the irregular geometries of many millions of unordered points, but also the great variations caused by depth, viewpoint, occlusion, etc. Current studies put much focus on the adaption of neural networks to the complex geometries of point clouds, but are blind to a fundamental question: how to learn an appropriate point embedding space that is aware of both discriminative semantics and challenging variations? As a response, we propose a clustering based supervised learning scheme for point cloud analysis. Unlike current de-facto, scene-wise training paradigm, our algorithm conducts within-class clustering on the point embedding space for automatically discovering subclass patterns which are latent yet representative across scenes. The mined patterns are, in turn, used to repaint the embedding space, so as to respect the underlying distribution of the entire training dataset and improve the robustness to the variations. Our algorithm is principled and readily pluggable to modern point cloud segmentation networks during training, without extra overhead during testing. With various 3D network architectures (i.e., voxel-based, point-based, Transformer-based, automatically searched), our algorithm shows notable improvements on famous point cloud segmentation datasets (i.e., 2.0-2.6% on single-scan and 2.0-2.2% multi-scan of SemanticKITTI, 1.8-1.9% on S3DIS, in terms of mIoU). Our algorithm also demonstrates utility in 3D detection, showing 2.0-3.4% mAP gains on KITTI. Our code is released at: https://github.com/FengZicai/Cluster3Dseg/. + + + + Forecast-MAE: Self-supervised Pre-training for Motion Forecasting with Masked Autoencoders + http://openaccess.thecvf.com//content/ICCV2023/papers/Cheng_Forecast-MAE_Self-supervised_Pre-training_for_Motion_Forecasting_with_Masked_Autoencoders_ICCV_2023_paper.pdf + This study explores the application of self-supervised learning (SSL) to the task of motion forecasting, an area that has not yet been extensively investigated despite the widespread success of SSL in computer vision and natural language processing. To address this gap, we introduce Forecast-MAE, an extension of the mask autoencoders framework that is specifically designed for self-supervised learning of the motion forecasting task. Our approach includes a novel masking strategy that leverages the strong interconnections between agents' trajectories and road networks, involving complementary masking of agents' future or history trajectories and random masking of lane segments. Our experiments on the challenging Argoverse 2 motion forecasting benchmark show that Forecast-MAE, which utilizes standard Transformer blocks with minimal inductive bias, achieves competitive performance compared to state-of-the-art methods that rely on supervised learning and sophisticated designs. Moreover, it outperforms the previous self-supervised learning method by a significant margin. Code is available at https://github.com/jchengai/forecast-mae. + + + + Efficient Transformer-based 3D Object Detection with Dynamic Token Halting + http://openaccess.thecvf.com//content/ICCV2023/papers/Ye_Efficient_Transformer-based_3D_Object_Detection_with_Dynamic_Token_Halting_ICCV_2023_paper.pdf + Balancing efficiency and accuracy is a long-standing problem for deploying deep learning models. The trade-off is even more important for real-time safety-critical systems like autonomous vehicles. In this paper, we propose an effective approach for accelerating transformer-based 3D object detectors by dynamically halting tokens at different layers depending on their contribution to the detection task. Although halting a token is a non-differentiable operation, our method allows for differentiable end-to-end learning by leveraging an equivalent differentiable forward-pass. Furthermore, our framework allows halted tokens to be reused to inform the model's predictions through a straightforward token recycling mechanism. Our method significantly improves the Pareto frontier of efficiency versus accuracy when compared with the existing approaches. By halting tokens and increasing model capacity, we are able to improve the baseline model's performance without increasing the model's latency on the Waymo Open Dataset. + + + + Video Task Decathlon: Unifying Image and Video Tasks in Autonomous Driving + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_Video_Task_Decathlon_Unifying_Image_and_Video_Tasks_in_Autonomous_ICCV_2023_paper.pdf + Performing multiple heterogeneous visual tasks in dynamic scenes is a hallmark of human perception capability. Despite remarkable progress in image and video recognition via representation learning, current research still focuses on designing specialized networks for singular, homogeneous, or simple combination of tasks. We instead explore the construction of a unified model for major image and video recognition tasks in autonomous driving with diverse input and output structures. To enable such an investigation, we design a new challenge, Video Task Decathlon (VTD), which includes ten representative image and video tasks spanning classification, segmentation, localization, and association of objects and pixels. On VTD, we develop our unified network, VTDNet, that uses a single structure and a single set of weights for all ten tasks. VTDNet groups similar tasks and employs task interaction stages to exchange information within and between task groups. Given the impracticality of labeling all tasks on all frames, and the performance degradation associated with joint training of many tasks, we design a Curriculum training, Pseudo-labeling, and Fine-tuning (CPF) scheme to successfully train VTDNet on all tasks and mitigate performance loss. Armed with CPF, VTDNet significantly outperforms its single-task counterparts on most tasks with only 20% overall computations. VTD is a promising new direction for exploring the unification of perception tasks in autonomous driving. + + + + PreSTU: Pre-Training for Scene-Text Understanding + http://openaccess.thecvf.com//content/ICCV2023/papers/Kil_PreSTU_Pre-Training_for_Scene-Text_Understanding_ICCV_2023_paper.pdf + The ability to recognize and reason about text embedded in visual inputs is often lacking in vision-and-language (V&L) models, perhaps because V&L pre-training methods have often failed to include such an ability in their training objective. In this paper, we propose PreSTU, a novel pre-training recipe dedicated to scene-text understanding (STU). PreSTU introduces OCR-aware pre-training objectives that encourage the model to recognize text from an image and connect it to the rest of the image content. We implement PreSTU using a simple transformer-based encoder-decoder architecture, combined with large-scale image-text datasets with scene text obtained from an off-the-shelf OCR system. We empirically demonstrate the effectiveness of this pre-training approach on eight visual question answering and four image captioning benchmarks. + + + + Towards Better Robustness against Common Corruptions for Unsupervised Domain Adaptation + http://openaccess.thecvf.com//content/ICCV2023/papers/Gao_Towards_Better_Robustness_against_Common_Corruptions_for_Unsupervised_Domain_Adaptation_ICCV_2023_paper.pdf + Recent studies have investigated how to achieve robustness for unsupervised domain adaptation (UDA). While most efforts focus on adversarial robustness, i.e. how the model performs against unseen malicious adversarial perturbations, robustness against benign common corruption (RaCC) surprisingly remains under-explored for UDA. Towards improving RaCC for UDA methods in an unsupervised manner, we propose a novel Distributionally and Discretely Adversarial Regularization (DDAR) framework in this paper. Formulated as a min-max optimization with a distribution distance, DDAR is theoretically well-founded to ensure generalization over unknown common corruptions. Meanwhile, we show that our regularization scheme effectively reduces a surrogate of RaCC, i.e., the perceptual distance between natural data and common corruption. To enable a better adversarial regularization, the design of the optimization pipeline relies on an image discretization scheme that can transform "out-of-distribution" adversarial data into "in-distribution" data augmentation. Through extensive experiments, in terms of RaCC, our method is superior to conventional unsupervised regularization mechanisms, widely improves the robustness of existing UDA methods, and achieves state-of-the-art performance. + + + + TALL: Thumbnail Layout for Deepfake Video Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_TALL_Thumbnail_Layout_for_Deepfake_Video_Detection_ICCV_2023_paper.pdf + The growing threats of deepfakes to society and cybersecurity have raised enormous public concerns, and increasing efforts have been devoted to this critical topic of deepfake video detection. Existing video methods achieve good performance but are computationally intensive. This paper introduces a simple yet effective strategy named Thumbnail Layout (TALL), which transforms a video clip into a pre-defined layout to realize the preservation of spatial and temporal dependencies. Specifically, consecutive frames are masked in a fixed position in each frame to improve generalization, then resized to sub-images and rearranged into a pre-defined layout as the thumbnail. TALL is model-agnostic and extremely simple by only modifying a few lines of code. Inspired by the success of vision transformers, we incorporate TALL into Swin Transformer, forming an efficient and effective method TALL-Swin. Extensive experiments on intra-dataset and cross-dataset validate the validity and superiority of TALL and SOTA TALL-Swin. TALL-Swin achieves 90.79% AUC on the challenging cross-dataset task, FaceForensics++ - Celeb-DF. The code is available at https://github.com/rainy-xu/TALL4Deepfake. + + + + Bidirectional Alignment for Domain Adaptive Detection with Transformers + http://openaccess.thecvf.com//content/ICCV2023/papers/He_Bidirectional_Alignment_for_Domain_Adaptive_Detection_with_Transformers_ICCV_2023_paper.pdf + We propose a Bidirectional Alignment for domain adaptive Detection with Transformers (BiADT) to improve cross domain object detection performance. Existing adversarial learning based methods use gradient reverse layer (GRL) to reduce the domain gap between the source and target domains in feature representations. Since different image parts and objects may exhibit various degrees of domain-specific characteristics, directly applying GRL on a global image or object representation may not be suitable. Our proposed BiADT explicitly estimates token-wise domain-invariant and domain-specific features in the image and object token sequences. BiADT has a novel deformable attention and self-attention, aimed at bi-directional domain alignment and mutual information minimization. These two objectives reduce the domain gap in domain-invariant representations, and simultaneously increase the distinctiveness of domain-specific features. Our experiments show that BiADT achieves very competitive performance to SOTA consistently on Cityscapes-to-FoggyCityscapes, Sim10K-to-Citiscapes and Cityscapes-to-BDD100K, outperforming the strong baseline, AQT, by 2.0, 2.1, and 2.4 in mAP50, respectively. + + + + CAME: Contrastive Automated Model Evaluation + http://openaccess.thecvf.com//content/ICCV2023/papers/Peng_CAME_Contrastive_Automated_Model_Evaluation_ICCV_2023_paper.pdf + The Automated Model Evaluation (AutoEval) framework entertains the possibility of evaluating a trained machine learning model without resorting to a labeled testing set. Despite the promise and some decent results, the existing AutoEval methods heavily rely on computing distribution shifts between the unlabelled testing set and the training set. We believe this reliance on the training set becomes another obstacle in shipping this technology to real-world ML development. In this work, we propose Contrastive Automatic Model Evaluation (CAME), a novel AutoEval framework that is rid of involving training set in the loop. The core idea of CAME bases on a theoretical analysis which bonds the model performance with a contrastive loss. Further, with extensive empirical validation, we manage to set up a predictable relationship between the two, simply by deducing on the unlabeled/unseen testing set. The resulting framework CAME establishes a new SOTA results for AutoEval by surpassing prior work significantly. + + + + Order-preserving Consistency Regularization for Domain Adaptation and Generalization + http://openaccess.thecvf.com//content/ICCV2023/papers/Jing_Order-preserving_Consistency_Regularization_for_Domain_Adaptation_and_Generalization_ICCV_2023_paper.pdf + Deep learning models fail on cross-domain challenges if the model is oversensitive to domain-specific attributes, e.g., lightning, background, camera angle, etc. To alleviate this problem, data augmentation coupled with consistency regularization are commonly adopted to make the model less sensitive to domain-specific attributes. Consistency regularization enforces the model to output the same representation or prediction for two views of one image. These constraints, however, are either too strict or not order-preserving for the classification probabilities. In this work, we propose the Order-preserving Consistency Regularization (OCR) for cross-domain tasks. The order-preserving property for the prediction makes the model robust to task-irrelevant transformations. As a result, the model becomes less sensitive to the domain-specific attributes. The comprehensive experiments show that our method achieves clear advantages on five different cross-domain tasks. + + + + Workie-Talkie: Accelerating Federated Learning by Overlapping Computing and Communications via Contrastive Regularization + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Workie-Talkie_Accelerating_Federated_Learning_by_Overlapping_Computing_and_Communications_via_ICCV_2023_paper.pdf + Federated learning (FL) over mobile devices is a promising distributed learning paradigm for various mobile applications. However, practical deployment of FL over mobile devices is very challenging because (i) conventional FL incurs huge training latency for mobile devices due to interleaved local computing and communications of model updates, (ii) there are heterogeneous training data across mobile devices, and (iii) mobile devices have hardware heterogeneity in terms of computing and communication capabilities. To address aforementioned challenges, in this paper, we propose a novel "workie-talkie" FL scheme, which can accelerate FL's training by overlapping local computing and wireless communications via contrastive regularization (FedCR). FedCR can reduce FL's training latency and almost eliminate straggler issues since it buries/embeds the time consumption of communications into that of local training. To resolve the issue of model staleness and data heterogeneity co-existing, we introduce class-wise contrastive regularization to correct the local training in FedCR. Besides, we jointly exploit contrastive regularization and subnetworks to further extend our FedCR approach to accommodate edge devices with hardware heterogeneity. We deploy FedCR in our FL testbed and conduct extensive experiments. The results show that FedCR outperforms its status quo FL approaches on various datasets and models. + + + + Late Stopping: Avoiding Confidently Learning from Mislabeled Examples + http://openaccess.thecvf.com//content/ICCV2023/papers/Yuan_Late_Stopping_Avoiding_Confidently_Learning_from_Mislabeled_Examples_ICCV_2023_paper.pdf + Sample selection is a prevalent method in learning with noisy labels, where small-loss data are typically considered as correctly labeled data. However, this method may not effectively identify clean hard examples with large losses, which are critical for achieving the model's closeto-optimal generalization performance. In this paper, we propose a new framework, Late Stopping, which leverages the intrinsic robust learning ability of DNNs through a prolonged training process. Specifically, Late Stopping gradually shrinks the noisy dataset by removing high-probability mislabeled examples while retaining the majority of clean hard examples in the training set throughout the learning process. We empirically observe that mislabeled and clean examples exhibit differences in the number of epochs required for them to be consistently and correctly classified, and thus high-probability mislabeled examples can be removed. Experimental results on benchmark-simulated and real-world noisy datasets demonstrate that the proposed method outperforms state-of-the-art counterparts. + + + + Most Important Person-Guided Dual-Branch Cross-Patch Attention for Group Affect Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Xie_Most_Important_Person-Guided_Dual-Branch_Cross-Patch_Attention_for_Group_Affect_Recognition_ICCV_2023_paper.pdf + Group affect refers to the subjective emotion that is evoked by an external stimulus in a group, which is an important factor that shapes group behavior and outcomes. Recognizing group affect involves identifying important individuals and salient objects among a crowd that can evoke emotions. However, most existing methods lack attention to affective meaning in group dynamics and fail to account for the contextual relevance of faces and objects in group-level images. In this work, we propose a solution by incorporating the psychological concept of the Most Important Person (MIP), which represents the most noteworthy face in a crowd and has affective semantic meaning. We present the Dual-branch Cross-Patch Attention Transformer (DCAT) which uses global image and MIP together as inputs. Specifically, we first learn the informative facial regions produced by the MIP and the global context separately. Then, the Cross-Patch Attention module is proposed to fuse the features of MIP and global context together to complement each other. Our proposed method outperforms state-of-the-art methods on GAF 3.0, GroupEmoW, and HECO datasets. Moreover, we demonstrate the potential for broader applications by showing that our proposed model can be transferred to another group affect task, group cohesion, and achieve comparable results. + + + + Achievement-Based Training Progress Balancing for Multi-Task Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Yun_Achievement-Based_Training_Progress_Balancing_for_Multi-Task_Learning_ICCV_2023_paper.pdf + Multi-task learning faces two challenging issues: (1) the high cost of annotating labels for all tasks and (2) balancing the training progress of various tasks with different natures. To resolve the label annotation issue, we construct a large-scale "partially annotated" multi-task dataset by combining task-specific datasets. However, the numbers of annotations for individual tasks are imbalanced, which may escalate an imbalance in training progress. To balance the training progress, we propose an achievement-based multi-task loss to modulate training speed based on the "achievement," defined as the ratio of current accuracy to single-task accuracy. Then, we formulate the multitask loss as a weighted geometric mean of individual task losses instead of a weighted sum to prevent any task from dominating the loss. In experiments, we evaluated the accuracy and training speed of the proposed multi-task loss on the large-scale multi-task dataset against recent multitask losses. The proposed loss achieved the best multi-task accuracy without incurring training time overhead. Compared to single-task models, the proposed one achieved 1.28%, 1.65%, and 1.18% accuracy improvement in object detection, semantic segmentation, and depth estimation, respectively, while reducing computations to 33.73%. Source code is available at https://github.com/ samsung/Achievement-based-MTL. + + + + Logic-induced Diagnostic Reasoning for Semi-supervised Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Liang_Logic-induced_Diagnostic_Reasoning_for_Semi-supervised_Semantic_Segmentation_ICCV_2023_paper.pdf + Recent advances in semi-supervised semantic segmentation have been heavily reliant on pseudo labeling to compensate for limited labeled data, disregarding the valuable relational knowledge among semantic concepts. To bridge this gap, we devise LogicDiag, a brand new neural-logic semi-supervised learning framework. Our key insight is that conflicts within pseudo labels, identified through symbolic knowledge, can serve as strong yet commonly ignored learning signals. LogicDiag resolves such conflicts via reasoning with logic-induced diagnoses, enabling the recovery of (potentially) erroneous pseudo labels, ultimately alleviating the notorious error accumulation problem. We showcase the practical application of LogicDiag in the data-hungry segmentation scenario, where we formalize the structured abstraction of semantic concepts as a set of logic rules. Extensive experiments on three standard semi-supervised semantic segmentation benchmarks demonstrate the effectiveness and generality of LogicDiag. Moreover, LogicDiag highlights the promising opportunities arising from the systematic integration of symbolic reasoning into the prevalent statistical, neural learning approaches. + + + + NeRF-Det: Learning Geometry-Aware Volumetric Representation for Multi-View 3D Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_NeRF-Det_Learning_Geometry-Aware_Volumetric_Representation_for_Multi-View_3D_Object_Detection_ICCV_2023_paper.pdf + We present NeRF-Det, a novel method for indoor 3D detection with posed RGB images as input. Unlike existing indoor 3D detection methods that struggle to model scene geometry, our method makes novel use of NeRF in an end-to-end manner to explicitly estimate 3D geometry, thereby improving 3D detection performance. Specifically, to avoid the significant extra latency associated with per-scene optimization of NeRF, we introduce sufficient geometry priors to enhance the generalizability of NeRF-MLP. Furthermore, we subtly connect the detection and NeRF branches through a shared MLP, enabling an efficient adaptation of NeRF to detection and yielding geometry-aware volumetric representations for 3D detection. Our method outperforms state-of-the-arts by 3.9 mAP and 3.1 mAP on the ScanNet and ARKITScenes benchmarks, respectively. We provide extensive analysis to shed light on how NeRF-Det works. As a result of our joint-training design, NeRF-Det is able to generalize well to unseen scenes for object detection, view synthesis, and depth estimation tasks without requiring per-scene optimization. Code is available at https://github.com/facebookresearch/NeRF-Det. + + + + Spatio-Temporal Domain Awareness for Multi-Agent Collaborative Perception + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Spatio-Temporal_Domain_Awareness_for_Multi-Agent_Collaborative_Perception_ICCV_2023_paper.pdf + Multi-agent collaborative perception as a potential application for vehicle-to-everything communication could significantly improve the perception performance of autonomous vehicles over single-agent perception. However, several challenges remain in achieving pragmatic information sharing in this emerging research. In this paper, we propose SCOPE, a novel collaborative perception framework that aggregates the spatio-temporal awareness characteristics across on-road agents in an end-to-end manner. Specifically, SCOPE has three distinct strengths: i) it considers effective semantic cues of the temporal context to enhance current representations of the target agent; ii) it aggregates perceptually critical spatial information from heterogeneous agents and overcomes localization errors via multi-scale feature interactions; iii) it integrates multi-source representations of the target agent based on their complementary contributions by an adaptive fusion paradigm. To thoroughly evaluate SCOPE, we consider both real-world and simulated scenarios of collaborative 3D object detection tasks on three datasets. Extensive experiments show the superiority of our approach and the necessity of the proposed components. The project link is https://ydk122024.github.io/SCOPE/. + + + + LPFF: A Portrait Dataset for Face Generators Across Large Poses + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_LPFF_A_Portrait_Dataset_for_Face_Generators_Across_Large_Poses_ICCV_2023_paper.pdf + Existing face generators exhibit exceptional performance on faces in small to medium poses (with respect to frontal faces) but struggle to produce realistic results for large poses. The distorted rendering results on large poses in 3D-aware generators further show that the generated 3D face shapes are far from the distribution of 3D faces in reality. We find that the above issues are caused by the training dataset's pose imbalance. To this end, we present LPFF, a large-pose Flickr face dataset comprised of 19,590 high-quality real large-pose portrait images. We utilize our dataset to train a 2D face generator that can process large-pose face images, as well as a 3D-aware generator that can generate realistic human face geometry. To better validate our pose-conditional 3D-aware generators, we develop a new FID measure to evaluate the 3D-level performance. Through this novel FID measure and other experiments, we show that LPFF can help 2D face generators extend their latent space and better manipulate the large-pose data, and help 3D-aware face generators achieve better view consistency and more realistic 3D reconstruction results. + + + + Pseudo-label Alignment for Semi-supervised Instance Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Hu_Pseudo-label_Alignment_for_Semi-supervised_Instance_Segmentation_ICCV_2023_paper.pdf + Pseudo-labeling is significant for semi-supervised instance segmentation, which generates instance masks and classes from unannotated images for subsequent training. However, in existing pipelines, pseudo-labels that contain valuable information may be directly filtered out due to mismatches in class and mask quality. To address this issue, we propose a novel framework, called pseudo-label aligning instance segmentation (PAIS), in this paper. In PAIS, we devise a dynamic aligning loss (DALoss) that adjusts the weights of semi-supervised loss terms with varying class and mask score pairs. Through extensive experiments conducted on the COCO and Cityscapes datasets, we demonstrate that PAIS is a promising framework for semi-supervised instance segmentation, particularly in cases where labeled data is severely limited. Notably, with just 1% labeled data, PAIS achieves 21.2 mAP (based on Mask-RCNN) and 19.9 mAP (based on K-Net) on the COCO dataset, outperforming the current state-of-the-art model, i.e., NoisyBoundary with 7.7 mAP, by a margin of over 12 points. Code is available at: https://github.com/hujiecpp/PAIS. + + + + MixBag: Bag-Level Data Augmentation for Learning from Label Proportions + http://openaccess.thecvf.com//content/ICCV2023/papers/Asanomi_MixBag_Bag-Level_Data_Augmentation_for_Learning_from_Label_Proportions_ICCV_2023_paper.pdf + Learning from label proportions (LLP) is a promising weakly supervised learning problem. In LLP, a set of instances (bag) has label proportions but no instance-level labels. LLP aims to train an instance-level classifier by using the label proportions of the bag. In this paper, we propose a bag-level data augmentation method for LLP called MixBag, which is based on the key observation from our preliminary experiments; that the instance-level classification accuracy improves as the number of labeled bags increases even though the total number of instances is fixed. We also propose a confidence interval loss designed based on statistical theory in order to use the augmented bags effectively. To the best of our knowledge, this is the first attempt to propose bag-level data augmentation for LLP. The advantage of MixBag is that it can be applied to instance-level data augmentation techniques and any LLP method that uses the proportion loss. Experimental results demonstrate this advantage and the effectiveness of our method. + + + + Effective Real Image Editing with Accelerated Iterative Diffusion Inversion + http://openaccess.thecvf.com//content/ICCV2023/papers/Pan_Effective_Real_Image_Editing_with_Accelerated_Iterative_Diffusion_Inversion_ICCV_2023_paper.pdf + Despite all recent progress, it is still challenging to edit and manipulate natural images with modern generative models. When using Generative Adversarial Network (GAN), one major hurdle is in the inversion process mapping a real image to its corresponding noise vector in the latent space, since its necessary to be able to reconstruct an image to edit its contents. Likewise for Denoising Diffusion Implicit Models (DDIM), the linearization assumption in each inversion step makes the whole deterministic inversion process unreliable. Existing approaches that have tackled the problem of inversion stability often incur in significant trade-offs in computational efficiency. In this work we propose an Accelerated Iterative Diffusion Inversion method, dubbed AIDI, that significantly improves reconstruction accuracy with minimal additional overhead in space and time complexity. By using a novel blended guidance technique, we show that effective results can be obtained on a large range of image editing tasks without large classifier-free guidance in inversion. Furthermore, when compared with other diffusion inversion based works, our proposed process is shown to be more robust for fast image editing in the 10 and 20 diffusion steps' regimes. + + + + UniFace: Unified Cross-Entropy Loss for Deep Face Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhou_UniFace_Unified_Cross-Entropy_Loss_for_Deep_Face_Recognition_ICCV_2023_paper.pdf + As a widely used loss function in deep face recognition, the softmax loss cannot guarantee that the minimum positive sample-to-class similarity is larger than the maximum negative sample-to-class similarity. As a result, no unified threshold is available to separate positive sample-to-class pairs from negative sample-to-class pairs. To bridge this gap, we design a UCE (Unified Cross-Entropy) loss for face recognition model training, which is built on the vital constraint that all the positive sample-to-class similarities shall be larger than the negative ones. Our UCE loss can be integrated with margins for a further performance boost. The face recognition model trained with the proposed UCE loss, UniFace, was intensively evaluated using a number of popular public datasets like MFR, IJB-C, LFW, CFP-FP, AgeDB, and MegaFace. Experimental results show that our approach outperforms SOTA methods like SphereFace, CosFace, ArcFace, Partial FC, etc. Especially, till the submission of this work (Mar. 8, 2023), the proposed UniFace achieves the highest TAR@MR-All on the academic track of the MFR-ongoing challenge. Code is publicly available. + + + + Jumping through Local Minima: Quantization in the Loss Landscape of Vision Transformers + http://openaccess.thecvf.com//content/ICCV2023/papers/Frumkin_Jumping_through_Local_Minima_Quantization_in_the_Loss_Landscape_of_ICCV_2023_paper.pdf + Quantization scale and bit-width are the most important parameters when considering how to quantize a neural network. Prior work focuses on optimizing quantization scales in a global manner through gradient methods (gradient descent & Hessian analysis). Yet, when applying perturbations to quantization scales, we observe a very jagged, highly non-smooth test loss landscape. In fact, small perturbations in quantization scale can greatly affect accuracy, yielding a 0.5-0.8% accuracy boost in 4-bit quantized vision transformers (ViTs). In this regime, gradient methods break down, since they cannot reliably reach local minima. In our work, dubbed Evol-Q, we use evolutionary search to effectively traverse the non-smooth landscape. Additionally, we propose using an infoNCE loss, which not only helps combat overfitting on the small (1,000 images) calibration dataset but also makes traversing such a highly non-smooth surface easier. Evol-Q improves the top-1 accuracy of a fully quantized ViT-Base by 10.30%, 0.78%, and 0.15% for 3-bit, 4-bit, and 8-bit weight quantization levels. Extensive experiments on a variety of CNN and ViT architectures further demonstrate its robustness in extreme quantization scenarios. + + + + ScatterNeRF: Seeing Through Fog with Physically-Based Inverse Neural Rendering + http://openaccess.thecvf.com//content/ICCV2023/papers/Ramazzina_ScatterNeRF_Seeing_Through_Fog_with_Physically-Based_Inverse_Neural_Rendering_ICCV_2023_paper.pdf + Vision in adverse weather conditions, whether it be snow, rain, or fog is challenging. In these scenarios, scattering and attenuation severly degrades image quality. Handling such inclement weather conditions, however, is essential to operate autonomous vehicles, drones and robotic applications where human performance is impeded the most. A large body of work explores removing weather-induced image degradations with dehazing methods. Most methods rely on single images as input and struggle to generalize from synthetic fully-supervised training approaches or to generate high fidelity results from unpaired real-world datasets. With data as bottleneck and most of today's training data relying on good weather conditions with inclement weather as outlier, we rely on an inverse rendering approach to reconstruct the scene content. We introduce ScatterNeRF, a neural rendering method which adequately renders foggy scenes and decomposes the fog-free background from the participating media -- exploiting the multiple views from a short automotive sequence without the need for a large training data corpus. Instead, the rendering approach is optimized on the multi-view scene itself, which can be typically captured by an autonomous vehicle, robot or drone during operation. Specifically, we propose a disentangled representation for the scattering volume and the scene objects, and learn the scene reconstruction with physics-inspired losses. We validate our method by capturing multi-view In-the-Wild data and controlled captures in a large-scale fog chamber. Our code and datasets are available at https://light.princeton.edu/scatternerf. + + + + Towards Generic Image Manipulation Detection with Weakly-Supervised Self-Consistency Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhai_Towards_Generic_Image_Manipulation_Detection_with_Weakly-Supervised_Self-Consistency_Learning_ICCV_2023_paper.pdf + As advanced image manipulation techniques emerge, detecting the manipulation becomes increasingly important. Despite the success of recent learning-based approaches for image manipulation detection, they typically require expensive pixel-level annotations to train, while exhibiting degraded performance when testing on images that are differently manipulated compared with training images. To address these limitations, we propose weakly-supervised image manipulation detection, such that only binary image-level labels (authentic or tampered with) are required for training purpose. Such weakly-supervised setting can leverage more training images and has the potential to adapt quickly to new manipulation techniques. To improve the generalization ability, we propose weakly-supervised self-consistency learning (WSCL) to leverage the weakly annotated images. For the second problem, we propose an end-to-end learnable method, which takes advantage of image self-consistency properties. Specifically, two consistency properties are learned: multi-source consistency (MSC) and inter-patch consistency (IPC). MSC exploits different content-agnostic information and enables cross-source learning via an online pseudo label generation and refinement process. IPC performs global pair-wise patch-patch relationship reasoning to discover a complete region of manipulation. Extensive experiments validate that our WSCL, even though is weakly supervised, exhibits competitive performance compared with fully-supervised counterpart under both in-distribution and out-of-distribution evaluations, as well as reasonable manipulation localization ability. + + + + PARF: Primitive-Aware Radiance Fusion for Indoor Scene Novel View Synthesis + http://openaccess.thecvf.com//content/ICCV2023/papers/Ying_PARF_Primitive-Aware_Radiance_Fusion_for_Indoor_Scene_Novel_View_Synthesis_ICCV_2023_paper.pdf + This paper proposes a method for fast scene radiance field reconstruction with strong novel view synthesis performance and convenient scene editing functionality. The key idea is to fully utilize semantic parsing and primitive extraction for constraining and accelerating the radiance field reconstruction process. To fulfill this goal, a primitive-aware hybrid rendering strategy was proposed to enjoy the best of both volumetric and primitive rendering. We further contribute a reconstruction pipeline conducts primitive parsing and radiance field learning iteratively for each input frame which successfully fuses semantic, primitive, and radiance information into a single framework. Extensive evaluations demonstrate the fast reconstruction ability, high rendering quality, and convenient editing functionality of our method. + + + + DeePoint: Visual Pointing Recognition and Direction Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Nakamura_DeePoint_Visual_Pointing_Recognition_and_Direction_Estimation_ICCV_2023_paper.pdf + In this paper, we realize automatic visual recognition and direction estimation of pointing. We introduce the first neural pointing understanding method based on two key contributions. The first is the introduction of a first-of-its-kind large-scale dataset for pointing recognition and direction estimation, which we refer to as the DP Dataset. DP Dataset consists of more than 2 million frames of 33 people pointing in various styles annotated for each frame with pointing timings and 3D directions. The second is DeePoint, a novel deep network model for joint recognition and 3D direction estimation of pointing. DeePoint is a Transformer-based network which fully leverages the spatio-temporal coordination of the body parts, not just the hands. Through extensive experiments, we demonstrate the accuracy and efficiency of DeePoint. We believe DP Dataset and DeePoint will serve as a sound foundation for visual human intention understanding. + + + + Deformer: Dynamic Fusion Transformer for Robust Hand Pose Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Fu_Deformer_Dynamic_Fusion_Transformer_for_Robust_Hand_Pose_Estimation_ICCV_2023_paper.pdf + Accurately estimating 3D hand pose is crucial for understanding how humans interact with the world. Despite remarkable progress, existing methods often struggle to generate plausible hand poses when the hand is heavily occluded or blurred. In videos, the movements of the hand allow us to observe various parts of the hand that may be occluded or blurred in a single frame. To adaptively leverage the visual clue before and after the occlusion or blurring for robust hand pose estimation, we propose the Deformer: a framework that implicitly reasons about the relationship between hand parts within the same image (spatial dimension) and different timesteps (temporal dimension). We show that a naive application of the transformer self-attention mechanism is not sufficient because motion blur or occlusions in certain frames can lead to heavily distorted hand features and generate imprecise keys and queries. To address this challenge, we incorporate a Dynamic Fusion Module into Deformer, which predicts the deformation of the hand and warps the hand mesh predictions from nearby frames to explicitly support the current frame estimation. Furthermore, we have observed that errors are unevenly distributed across different hand parts, with vertices around fingertips having disproportionately higher errors than those around the palm. We mitigate this issue by introducing a new loss function called maxMSE that automatically adjusts the weight of every vertex to focus the model on critical hand parts. Extensive experiments show that our method significantly outperforms state-of-the-art methods by 10%, and is more robust to occlusions (over 14%). + + + + iDAG: Invariant DAG Searching for Domain Generalization + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_iDAG_Invariant_DAG_Searching_for_Domain_Generalization_ICCV_2023_paper.pdf + Existing machine learning (ML) models are often fragile in open environments because the data distribution frequently shifts. To address this problem, domain generalization (DG) aims to explore underlying invariant patterns for stable prediction across domains. In this work, we first characterize that this failure of conventional ML models in DG is attributed to an inadequate identification of causal structures. We further propose a novel and theoretically grounded invariant Directed Acyclic Graph (dubbed iDAG) searching framework that attains an invariant graphical relation as the proxy to the causality structure from the intrinsic data-generating process. To enable tractable computation, iDAG solves a constrained optimization objective built on a set of representative class-conditional prototypes. Additionally, we integrate a hierarchical contrastive learning module, which poses a strong effect of clustering, for enhanced prototypes as well as stabler prediction. Extensive experiments on the synthetic and real-world benchmarks demonstrate that iDAG outperforms the state-of-the-art approaches, verifying the superiority of causal structure identification for DG. The code of iDAG is available at https://github.com/lccurious/iDAG. + + + + Spacetime Surface Regularization for Neural Dynamic Scene Reconstruction + http://openaccess.thecvf.com//content/ICCV2023/papers/Choe_Spacetime_Surface_Regularization_for_Neural_Dynamic_Scene_Reconstruction_ICCV_2023_paper.pdf + We propose an algorithm, 4DRegSDF, for the spacetime surface regularization to improve the fidelity of neural rendering and reconstruction in dynamic scenes. The key idea is to impose local rigidity on the deformable Signed Distance Function (SDF) for temporal coherency. Our approach works by (1) sampling points on the deformed surface by taking gradient steps toward the steepest direction along SDF, (2) extracting differential surface geometry, such as tangent plane or curvature, at each sample, and (3) adjusting the local rigidity at different timestamps. This enables our dynamic surface regularization to align 4D spacetime geometry via 3D canonical space more accurately. Experiments demonstrate that our 4DRegSDF achieves state-of-the-art performance in both reconstruction and rendering quality over synthetic and real-world datasets. + + + + GasMono: Geometry-Aided Self-Supervised Monocular Depth Estimation for Indoor Scenes + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhao_GasMono_Geometry-Aided_Self-Supervised_Monocular_Depth_Estimation_for_Indoor_Scenes_ICCV_2023_paper.pdf + This paper tackles the challenges of self-supervised monocular depth estimation in indoor scenes caused by large rotation between frames and low texture. We ease the learning process by obtaining coarse camera poses from monocular sequences through multi-view geometry to deal with the former. However, we found that limited by the scale ambiguity across different scenes in the training dataset, a naive introduction of geometric coarse poses cannot play a positive role in performance improvement, which is counter-intuitive. To address this problem, we propose to refine those poses during training through rotation and translation/scale optimization. To soften the effect of the low texture, we combine the global reasoning of vision transformers with an overfitting-aware, iterative self-distillation mechanism, providing more accurate depth guidance coming from the network itself. Experiments on NYUv2, ScanNet, 7scenes, and KITTI datasets support the effectiveness of each component in our framework, which sets a new state-of-the-art for indoor self-supervised monocular depth estimation, as well as outstanding generalization ability. Code and models are available at https://github.com/zxcqlf/GasMono + + + + Audio-Visual Deception Detection: DOLOS Dataset and Parameter-Efficient Crossmodal Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Guo_Audio-Visual_Deception_Detection_DOLOS_Dataset_and_Parameter-Efficient_Crossmodal_Learning_ICCV_2023_paper.pdf + Deception detection in conversations is a challenging yet important task, having pivotal applications in many fields such as credibility assessment in business, multimedia anti-frauds, and custom security. Despite this, deception detection research is hindered by the lack of high-quality deception datasets, as well as the difficulties of learning multimodal features effectively. To address this issue, we introduce DOLOS, the largest gameshow deception detection dataset with rich deceptive conversations. DOLOS includes 1,675 video clips featuring 213 subjects, and it has been labeled with audio-visual feature annotations. We provide train-test, duration, and gender protocols to investigate the impact of different factors. We benchmark our dataset on previously proposed deception detection approaches. To further improve the performance by fine-tuning fewer parameters, we propose Parameter-Efficient Crossmodal Learning (PECL), where a Uniform Temporal Adapter (UT-Adapter) explores temporal attention in transformer-based architectures, and a crossmodal fusion module, Plug-in Audio-Visual Fusion (PAVF), combines crossmodal information from audio-visual features. Based on the rich fine-grained audio-visual annotations on DOLOS, we also exploit multi-task learning to enhance performance by concurrently predicting deception and audio-visual features. Experimental results demonstrate the desired quality of the DOLOS dataset and the effectiveness of the PECL. The DOLOS dataset and the source codes are available. + + + + Alleviating Catastrophic Forgetting of Incremental Object Detection via Within-Class and Between-Class Knowledge Distillation + http://openaccess.thecvf.com//content/ICCV2023/papers/Kang_Alleviating_Catastrophic_Forgetting_of_Incremental_Object_Detection_via_Within-Class_and_ICCV_2023_paper.pdf + Incremental object detection (IOD) task requires a model to learn continually from newly added data. However, directly fine-tuning a well-trained detection model on a new task will sharply decrease the performance on old tasks, which is known as catastrophic forgetting. Knowledge distillation, including feature distillation and response distillation, has been proven to be an effective way to alleviate catastrophic forgetting. However, previous works on feature distillation heavily rely on low-level feature information, while under-exploring the importance of high-level semantic information. In this paper, we discuss the cause of catastrophic forgetting in IOD task as destruction of semantic feature space. We propose a method that dynamically distills both semantic and feature information with consideration of both between-class discriminativeness and within-class consistency on Transformer-based detector. Between-class discriminativeness is preserved by distilling class-level semantic distance and feature distance among various categories, while within-class consistency is preserved by distilling instance-level semantic information and feature information within each category. Extensive experiments are conducted on both Pascal VOC and MS COCO benchmarks. Our method outperforms all the previous CNN-based SOTA methods under various experimental scenarios, with a remarkable mAP improvement from 36.90% to 39.80% under one-step IOD task. + + + + Revisiting the Parameter Efficiency of Adapters from the Perspective of Precision Redundancy + http://openaccess.thecvf.com//content/ICCV2023/papers/Jie_Revisiting_the_Parameter_Efficiency_of_Adapters_from_the_Perspective_of_ICCV_2023_paper.pdf + Current state-of-the-art results in computer vision depend in part on fine-tuning large pre-trained vision models. However, with the exponential growth of model sizes, the conventional full fine-tuning, which needs to store a individual network copy for each tasks, leads to increasingly huge storage and transmission overhead. Adapter-based Parameter-Efficient Tuning (PET) methods address this challenge by tuning lightweight adapters inserted into the frozen pre-trained models. In this paper, we investigate how to make adapters even more efficient, reaching a new minimum size required to store a task-specific fine-tuned network. Inspired by the observation that the parameters of adapters converge at flat local minima, we find that adapters are resistant to noise in parameter space, which means they are also resistant to low numerical precision. To train low-precision adapters, we propose a computational-efficient quantization method which minimizes the quantization error. Through extensive experiments, we find that low-precision adapters exhibit minimal performance degradation, and even 1-bit precision is sufficient for adapters. The results of the experiments demonstrate that 1-bit adapters outperform all other PET methods on both the VTAB-1K benchmark and few-shot FGVC datasets, while requiring the smallest storage size. Our findings show, for the first time, the significant potential of quantization techniques in PET, providing a general solution to enhance the parameter efficiency of adapter-based PET methods. + + + + EMQ: Evolving Training-free Proxies for Automated Mixed Precision Quantization + http://openaccess.thecvf.com//content/ICCV2023/papers/Dong_EMQ_Evolving_Training-free_Proxies_for_Automated_Mixed_Precision_Quantization_ICCV_2023_paper.pdf + Mixed-Precision Quantization (MQ) can achieve a competitive accuracy-complexity trade-off for models. Conventional training-based search methods require time-consuming candidate training to search optimized per-layer bit-width configurations in MQ. Recently, some training-free approaches have presented various MQ proxies and significantly improve search efficiency. However, the correlation between these proxies and quantization accuracy is poorly understood. To address the gap, we first build the MQ-Bench-101, which involves different bit configurations and quantization results. Then, we observe that the existing training-free proxies perform weak correlations on the MQ-Bench-101. To efficiently seek superior proxies, we develop an automatic search of proxies framework for MQ via evolving algorithms. In particular, we devise an elaborate search space involving the existing proxies and perform an evolution search to discover the best correlated MQ proxy. We proposed a diversity-prompting selection strategy and compatibility screening protocol to avoid premature convergence and improve search efficiency. In this way, our Evolving proxies for Mixed-precision Quantization (EMQ) framework allows the auto-generation of proxies without heavy tuning and expert knowledge. Extensive experiments on ImageNet with various ResNet and MobileNet families demonstrate that our EMQ obtains superior performance than state-of-the-art mixed-precision methods at a significantly reduced cost. The code will be released. + + + + Face Clustering via Graph Convolutional Networks with Confidence Edges + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_Face_Clustering_via_Graph_Convolutional_Networks_with_Confidence_Edges_ICCV_2023_paper.pdf + Face clustering is a method for unlabeled image annotation and has attracted increasing attention. Existing methods have made significant breakthroughs by introducing Graph Convolutional Networks (GCNs) on the affinity graph. However, such graphs will contain many vertex pairs with inconsistent similarities and labels, thus degrading the model's performance. There are already relevant efforts for this problem, but the information about features needs to be mined further. In this paper, we define a new concept called confidence edge and guide the construction of graphs. Furthermore, a novel confidence-GCN is proposed to cluster face images by deriving more confidence edges. Firstly, Local Information Fusion is advanced to obtain a more accurate similarity metric by considering the neighbors of vertices. Then Unsupervised Neighbor Determination is used to discard low-quality edges based on similarity differences. Moreover, we elaborate that the remaining edges retain the most beneficial information to demonstrate the validity. At last, the confidence-GCN takes the graph as the input and fully uses the confidence edges to complete the clustering. Experiments show that our method outperforms existing methods on the face and person datasets to achieve state-of-the-art. At the same time, comparable results are obtained on the fashion dataset. + + + + Multimodal Garment Designer: Human-Centric Latent Diffusion Models for Fashion Image Editing + http://openaccess.thecvf.com//content/ICCV2023/papers/Baldrati_Multimodal_Garment_Designer_Human-Centric_Latent_Diffusion_Models_for_Fashion_Image_ICCV_2023_paper.pdf + Fashion illustration is used by designers to communicate their vision and to bring the design idea from conceptualization to realization, showing how clothes interact with the human body. In this context, computer vision can thus be used to improve the fashion design process. Differently from previous works that mainly focused on the virtual try-on of garments, we propose the task of multimodal-conditioned fashion image editing, guiding the generation of human-centric fashion images by following multimodal prompts, such as text, human body poses, and garment sketches. We tackle this problem by proposing a new architecture based on latent diffusion models, an approach that has not been used before in the fashion domain. Given the lack of existing datasets suitable for the task, we also extend two existing fashion datasets, namely Dress Code and VITON-HD, with multimodal annotations collected in a semi-automatic manner. Experimental results on these new datasets demonstrate the effectiveness of our proposal, both in terms of realism and coherence with the given multimodal inputs. Source code and collected multimodal annotations are publicly available at: https://github.com/aimagelab/multimodal-garment-designer. + + + + Time-to-Contact Map by Joint Estimation of Up-to-Scale Inverse Depth and Global Motion using a Single Event Camera + http://openaccess.thecvf.com//content/ICCV2023/papers/Nunes_Time-to-Contact_Map_by_Joint_Estimation_of_Up-to-Scale_Inverse_Depth_and_ICCV_2023_paper.pdf + Event cameras asynchronously report brightness changes with a temporal resolution in the order of microseconds, which makes them inherently suitable to address problems that involve rapid motion perception. In this paper, we address the problem of time-to-contact (TTC) estimation using a single event camera. This problem is typically addressed by estimating a single global TTC measure, which explicitly assumes that the surface/obstacle is planar and fronto-parallel. We relax this assumption by proposing an incremental event-based method to estimate the TTC that jointly estimates the (up-to scale) inverse depth and global motion using a single event camera. The proposed method is reliable and fast while asynchronously maintaining a TTC map (TTCM), which provides per-pixel TTC estimates. As a side product, the proposed method can also estimate per-event optical flow. We achieve state-of-the-art performances on TTC estimation in terms of accuracy and runtime per event while achieving competitive performance on optical flow estimation. + + + + A Benchmark for Chinese-English Scene Text Image Super-Resolution + http://openaccess.thecvf.com//content/ICCV2023/papers/Ma_A_Benchmark_for_Chinese-English_Scene_Text_Image_Super-Resolution_ICCV_2023_paper.pdf + Scene Text Image Super-resolution (STISR) aims to recover high-resolution (HR) scene text images with visually pleasant and readable text content from the given low-resolution (LR) input. Most existing works focus on recovering English texts, which have simple structures in the characters, while little work has been done on the more challenging Chinese texts with diverse and complex character structures. In this paper, we propose a real-world Chinese-English benchmark dataset, namely Real-CE, for the task of STISR with the emphasis on restoring structurally complex Chinese characters. The benchmark provides 1,935/783 real-world LR-HR text image pairs (contains 33,789 text lines in total) for training/testing in 2x and 4x zooming modes, complemented by detailed annotations, including detection boxes and text transcripts. Moreover, we design an edge-aware learning method, which provides structural supervision in image and feature domain, to effectively reconstruct the dense structures of Chinese characters. We conduct experiments on the proposed Real-CE benchmark and evaluate the existing STISR models with and without our edge-aware loss. The benchmark, including data and source code, will be made publicly available. + + + + Replay: Multi-modal Multi-view Acted Videos for Casual Holography + http://openaccess.thecvf.com//content/ICCV2023/papers/Shapovalov_Replay_Multi-modal_Multi-view_Acted_Videos_for_Casual_Holography_ICCV_2023_paper.pdf + We introduce Replay, a collection of multi-view, multi-modal videos of humans interacting socially. Each scene is filmed in high production quality, from different viewpoints with several static cameras, as well as wearable action cameras, and recorded with a large array of microphones at different positions in the room. Overall, the dataset contains over 3000 minutes of footage and over 5 million timestamped high-resolution frames annotated with camera poses and partially with foreground masks. The Replay dataset has many potential applications, such as novel-view synthesis, 3D reconstruction, novel-view acoustic synthesis, human body and face analysis, and training generative models. We provide a benchmark for training and evaluating novel-view synthesis, with two scenarios of different difficulty. Finally, we evaluate several baseline state-of-the-art methods on the new benchmark. + + + + Affine-Consistent Transformer for Multi-Class Cell Nuclei Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_Affine-Consistent_Transformer_for_Multi-Class_Cell_Nuclei_Detection_ICCV_2023_paper.pdf + Multi-class cell nuclei detection is a fundamental prerequisite in the diagnosis of histopathology. It is critical to efficiently locate and identify cells with diverse morphology and distributions in digital pathological images. Most existing methods take complex intermediate representations as learning targets and rely on inflexible post-refinements while paying less attention to various cell density and fields of view. In this paper, we propose a novel Affine-Consistent Transformer (AC-Former), which directly yields a sequence of nucleus positions and is trained collaboratively through two sub-networks, a global and a local network. The local branch learns to infer distorted input images of smaller scales while the global network outputs the large-scale predictions as extra supervision signals. We further introduce an Adaptive Affine Transformer (AAT) module, which can automatically learn the key spatial transformations to warp original images for local network training. The AAT module works by learning to capture the transformed image regions that are more valuable for training the model. Experimental results demonstrate that the proposed method significantly outperforms existing state-of-the-art algorithms on various benchmarks. + + + + Removing Anomalies as Noises for Industrial Defect Localization + http://openaccess.thecvf.com//content/ICCV2023/papers/Lu_Removing_Anomalies_as_Noises_for_Industrial_Defect_Localization_ICCV_2023_paper.pdf + Unsupervised anomaly detection aims to train models with only anomaly-free images to detect and localize unseen anomalies. Previous reconstruction-based methods have been limited by inaccurate reconstruction results. This work presents a denoising model to detect and localize the anomalies with a generative diffusion model. In particular, we introduce random noise to overwhelm the anomalous pixels and obtain pixel-wise precise anomaly scores from the intermediate denoising process. We find that the KL divergence of the diffusion model serves as a better anomaly score compared with the traditional RGB space score. Furthermore, we reconstruct the features from a pre-trained deep feature extractor as our feature level score to improve localization performance. Moreover, we propose a gradient denoising process to smoothly transform an anomalous image into a normal one. Our denoising model outperforms the state-of-the-art reconstruction-based anomaly detection methods for precise anomaly localization and high-quality normal image reconstruction on the MVTec-AD benchmark. + + + + GPGait: Generalized Pose-based Gait Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Fu_GPGait_Generalized_Pose-based_Gait_Recognition_ICCV_2023_paper.pdf + Recent works on pose-based gait recognition have demonstrated the potential of using such simple information to achieve results comparable to silhouette-based methods. However, the generalization ability of pose-based methods on different datasets is undesirably inferior to that of silhouette-based ones, which has received little attention but hinders the application of these methods in real-world scenarios. To improve the generalization ability of pose-based methods across datasets, we propose a Generalized Pose-based Gait recognition (GPGait) framework. First, a Human-Oriented Transformation (HOT) and a series of Human-Oriented Descriptors (HOD) are proposed to obtain a unified pose representation with discriminative multi-features. Then, given the slight variations in the unified representation after HOT and HOD, it becomes crucial for the network to extract local-global relationships between the keypoints. To this end, a Part-Aware Graph Convolutional Network (PAGCN) is proposed to enable efficient graph partition and local-global spatial feature extraction. Experiments on four public gait recognition datasets, CASIA-B, OUMVLP-Pose, Gait3D and GREW, show that our model demonstrates better and more stable cross-domain capabilities compared to existing skeleton-based methods, achieving comparable recognition results to silhouette-based ones. Code is available at https://github.com/BNU-IVC/FastPoseGait. + + + + Stable and Causal Inference for Discriminative Self-supervised Deep Visual Representations + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Stable_and_Causal_Inference_for_Discriminative_Self-supervised_Deep_Visual_Representations_ICCV_2023_paper.pdf + In recent years, discriminative self-supervised methods have made significant strides in advancing various visual tasks. The central idea of learning a data encoder that is robust to data distortions/augmentations is straightforward yet highly effective. Although many studies have demonstrated the empirical success of various learning methods, the resulting learned representations can exhibit instability and hinder downstream performance. In this study, we analyze discriminative self-supervised methods from a causal perspective to explain these unstable behaviors and propose solutions to overcome them. Our approach draws inspiration from prior works that empirically demonstrate the ability of discriminative self-supervised methods to demix ground truth causal sources to some extent. Unlike previous work on causality-empowered representation learning, we do not apply our solutions during the training process but rather during the inference process to improve time efficiency. Through experiments on both controlled image datasets and realistic image datasets, we show that our proposed solutions, which involve tempering a linear transformation with controlled synthetic data, are effective in addressing these issues. + + + + Semantic Attention Flow Fields for Monocular Dynamic Scene Decomposition + http://openaccess.thecvf.com//content/ICCV2023/papers/Liang_Semantic_Attention_Flow_Fields_for_Monocular_Dynamic_Scene_Decomposition_ICCV_2023_paper.pdf + From video, we reconstruct a neural volume that captures time-varying color, density, scene flow, semantics, and attention information. The semantics and attention let us identify salient foreground objects separately from the background across spacetime. To mitigate low resolution semantic and attention features, we compute pyramids that trade detail with whole-image context. After optimization, we perform a saliency-aware clustering to decompose the scene. To evaluate real-world scenes, we annotate object masks in the NVIDIA Dynamic Scene and DyCheck datasets. We demonstrate that this method can decompose dynamic scenes in an unsupervised way with competitive performance to a supervised method, and that it improves foreground/background segmentation over recent static/dynamic split methods. Project webpage: https://visual.cs.brown.edu/saff + + + + A Fast Unified System for 3D Object Detection and Tracking + http://openaccess.thecvf.com//content/ICCV2023/papers/Heitzinger_A_Fast_Unified_System_for_3D_Object_Detection_and_Tracking_ICCV_2023_paper.pdf + We present FUS3D, a fast and lightweight system for real-time 3D object detection and tracking on edge devices. Our approach seamlessly integrates stages for 3D object detection and multi-object-tracking into a single, end-to-end trainable model. FUS3D is specially tuned for indoor 3D human behavior analysis, with target applications in Ambient Assisted Living (AAL) or surveillance. The system is optimized for inference on the edge, thus enabling sensor-near processing of potentially sensitive data. In addition, our system relies exclusively on the less privacy-intrusive 3D depth imaging modality, thus further highlighting the potential of our method for application in sensitive areas. FUS3D achieves best results when utilized in a joint detection and tracking configuration. Nevertheless, the proposed detection stage can function as a fast standalone object detection model if required. We have evaluated FUS3D extensively on the MIPT dataset and demonstrated its superior performance over comparable existing state-of-the-art methods in terms of 3D object detection, multi-object tracking, and most importantly, runtime. Model code will be made publicly available. + + + + AIDE: A Vision-Driven Multi-View, Multi-Modal, Multi-Tasking Dataset for Assistive Driving Perception + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_AIDE_A_Vision-Driven_Multi-View_Multi-Modal_Multi-Tasking_Dataset_for_Assistive_Driving_ICCV_2023_paper.pdf + Driver distraction has become a significant cause of severe traffic accidents over the past decade. Despite the growing development of vision-driven driver monitoring systems, the lack of comprehensive perception datasets restricts road safety and traffic security. In this paper, we present an AssIstive Driving pErception dataset (AIDE) that considers context information both inside and outside the vehicle in naturalistic scenarios. AIDE facilitates holistic driver monitoring through three distinctive characteristics, including multi-view settings of driver and scene, multi-modal annotations of face, body, posture, and gesture, and four pragmatic task designs for driving understanding. To thoroughly explore AIDE, we provide experimental benchmarks on three kinds of baseline frameworks via extensive methods. Moreover, two fusion strategies are introduced to give new insights into learning effective multi-stream/modal representations. We also systematically investigate the importance and rationality of the key components in AIDE and benchmarks. The project link is https://github.com/ydk122024/AIDE. + + + + Self-Supervised Character-to-Character Distillation for Text Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Guan_Self-Supervised_Character-to-Character_Distillation_for_Text_Recognition_ICCV_2023_paper.pdf + When handling complicated text images (e.g., irregular structures, low resolution, heavy occlusion, and uneven illumination), existing supervised text recognition methods are data-hungry. Although these methods employ large-scale synthetic text images to reduce the dependence on annotated real images, the domain gap still limits the recognition performance. Therefore, exploring the robust text feature representations on unlabeled real images by self-supervised learning is a good solution. However, existing self-supervised text recognition methods conduct sequence-to-sequence representation learning by roughly splitting the visual features along the horizontal axis, which limits the flexibility of the augmentations, as large geometric-based augmentations may lead to sequence-to-sequence feature inconsistency. Motivated by this, we propose a novel self-supervised Character-to-Character Distillation method, CCD, which enables versatile augmentations to facilitate general text representation learning. Specifically, we delineate the character structures of unlabeled real images by designing a self-supervised character segmentation module. Following this, CCD easily enriches the diversity of local characters while keeping their pairwise alignment under flexible augmentations, using the transformation matrix between two augmented views from images. Experiments demonstrate that CCD achieves state-of-the-art results, with average performance gains of 1.38% in text recognition, 1.7% in text segmentation, 0.24 dB (PSNR) and 0.0321 (SSIM) in text super-resolution. Code will be released soon. + + + + Domain Adaptive Few-Shot Open-Set Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Pal_Domain_Adaptive_Few-Shot_Open-Set_Learning_ICCV_2023_paper.pdf + Few-shot learning has made impressive strides in addressing the crucial challenges of recognizing unknown samples from novel classes in target query sets and managing visual shifts between domains. However, existing techniques fall short when it comes to identifying target outliers under domain shifts by learning to reject pseudo-outliers from the source domain, resulting in an incomplete solution to both problems. To address these challenges comprehensively, we propose a novel approach called Domain Adaptive Few-Shot Open Set Recognition (DA-FSOS) and introduce a meta-learning-based architecture named DAFOS-Net. During training, our model learns a shared and discriminative embedding space while creating a pseudo-open-space decision boundary, given a fully-supervised source domain and a label-disjoint few-shot target domain. To enhance data density, we use a pair of conditional adversarial networks with tunable noise variances to augment both domains' closed and pseudo-open spaces. Furthermore, we propose a domain-specific batch-normalized class prototypes alignment strategy to align both domains globally while ensuring class-discriminativeness through novel metric objectives. Our training approach ensures that DAFOS-Net can generalize well to new scenarios in the target domain. We present three benchmarks for DA-FSOS based on the Office-Home, mini-ImageNet/CUB, and DomainNet datasets and demonstrate the efficacy of DAFOS-Net through extensive experimentation. + + + + Interactive Class-Agnostic Object Counting + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_Interactive_Class-Agnostic_Object_Counting_ICCV_2023_paper.pdf + We propose a novel framework for interactive class-agnostic object counting, where a human user can interactively provide feedback to improve the accuracy of a counter. Our framework consists of two main components: a user-friendly visualizer to gather feedback and an efficient mechanism to incorporate it. In each iteration, we produce a density map to show the current prediction result, and we segment it into non-overlapping regions with an easily verifiable number of objects. The user can provide feedback by selecting a region with obvious counting errors and specifying the range for the estimated number of objects within it. To improve the counting result, we develop a novel adaptation loss to force the visual counter to output the predicted count within the user-specified range. For effective and efficient adaptation, we propose a refinement module that can be used with any density-based visual counter, and only the parameters in the refinement module will be updated during adaptation. Our experiments on two challenging class-agnostic object counting benchmarks, FSCD-LVIS and FSC-147, show that our method can reduce the mean absolute error of multiple state-of-the-art visual counters by roughly 30% to 40% with minimal user input. Our project can be found at https://yifehuang97.github.io/ICACountProjectPage/. + + + + Estimator Meets Equilibrium Perspective: A Rectified Straight Through Estimator for Binary Neural Networks Training + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_Estimator_Meets_Equilibrium_Perspective_A_Rectified_Straight_Through_Estimator_for_ICCV_2023_paper.pdf + Binarization of neural networks is a dominant paradigm in neural networks compression. The pioneering work BinaryConnect uses Straight Through Estimator (STE) to mimic the gradients of the sign function, but it also causes the crucial inconsistency problem. Most of the previous methods design different estimators instead of STE to mitigate it. However, they ignore the fact that when reducing the estimating error, the gradient stability will decrease concomitantly. These highly divergent gradients will harm the model training and increase the risk of gradient vanishing and gradient exploding. To fully take the gradient stability into consideration, we present a new perspective to the BNNs training, regarding it as the equilibrium between the estimating error and the gradient stability. In this view, we firstly design two indicators to quantitatively demonstrate the equilibrium phenomenon. In addition, in order to balance the estimating error and the gradient stability well, we revise the original straight through estimator and propose a power function based estimator, Rectified Straight Through Estimator (ReSTE for short). Comparing to other estimators, ReSTE is rational and capable of flexibly balancing the estimating error with the gradient stability. Extensive experiments on CIFAR-10 and ImageNet datasets show that ReSTE has excellent performance and surpasses the state-of-the-art methods without any auxiliary modules or losses. + + + + MedKLIP: Medical Knowledge Enhanced Language-Image Pre-Training for X-ray Diagnosis + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_MedKLIP_Medical_Knowledge_Enhanced_Language-Image_Pre-Training_for_X-ray_Diagnosis_ICCV_2023_paper.pdf + In this paper, we consider enhancing medical visual-language pre-training (VLP) with domain-specific knowledge, by exploiting the paired image-text reports from the radiological daily practice. In particular, we make the following contributions: First, unlike existing works that directly process the raw reports, we adopt a novel triplet extraction module to extract the medical-related information, avoiding unnecessary complexity from language grammar and enhancing the supervision signals; Second, we propose a novel triplet encoding module with entity translation by querying a knowledge base, to exploit the rich domain knowledge in medical field, and implicitly build relationships between medical entities in the language embedding space; Third, we propose to use a Transformer-based fusion model for spatially aligning the entity description with visual signals at the image patch level, enabling the ability for medical diagnosis; Fourth, we conduct thorough experiments to validate the effectiveness of our architecture, and benchmark on numerous public benchmarks e.g., ChestX-ray14, RSNA Pneumonia, SIIM-ACR Pneumothorax, COVIDx CXR-2, COVID Rural, and EdemaSeverity. In both zero-shot and fine-tuning settings, our model has demonstrated strong performance compared with the former methods on disease classification and grounding. + + + + Automated Knowledge Distillation via Monte Carlo Tree Search + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Automated_Knowledge_Distillation_via_Monte_Carlo_Tree_Search_ICCV_2023_paper.pdf + In this paper, we present Auto-KD, the first automated search framework for optimal knowledge distillation design. Traditional distillation techniques typically require handcrafted designs by experts and extensive tuning costs for different teacher-student pairs. To address these issues, we empirically study different distillers, finding that they can be decomposed, combined, and simplified. Based on these observations, we build our uniform search space with advanced operations in transformations, distance functions, and hyperparameters components. For instance, the transformation parts are optional for global, intra-spatial, and inter-spatial operations, such as attention, mask, and multi-scale. Then, we introduce an effective search strategy based on the Monte Carlo tree search, modeling the search space as a Monte Carlo Tree (MCT) to capture the dependency among options. The MCT is updated using test loss and representation gap of student trained by candidate distillers as the reward for better exploration-exploitation balance. To accelerate the search process, we exploit offline processing without teacher inference, sparse training for student, and proxy settings based on distillation properties. In this way, our Auto-KD only needs small costs to search for optimal distillers before the distillation phase. Moreover, we expand Auto-KD for multi-layer and multi-teacher scenarios with training-free weighted factors. Our method is promising yet practical, and extensive experiments demonstrate that it generalizes well to different CNNs and Vision Transformer models and attains state-of-the-art performance across a range of vision tasks, including image classification, object detection, and semantic segmentation. Code is provided at https://github.com/lilujunai/Auto-KD. + + + + EmoTalk: Speech-Driven Emotional Disentanglement for 3D Face Animation + http://openaccess.thecvf.com//content/ICCV2023/papers/Peng_EmoTalk_Speech-Driven_Emotional_Disentanglement_for_3D_Face_Animation_ICCV_2023_paper.pdf + Speech-driven 3D face animation aims to generate realistic facial expressions that match the speech content and emotion. However, existing methods often neglect emotional facial expressions or fail to disentangle them from speech content. To address this issue, this paper proposes an end-to-end neural network to disentangle different emotions in speech so as to generate rich 3D facial expressions. Specifically, we introduce the emotion disentangling encoder (EDE) to disentangle the emotion and content in the speech by cross-reconstructed speech signals with different emotion labels. Then an emotion-guided feature fusion decoder is employed to generate a 3D talking face with enhanced emotion. The decoder is driven by the disentangled identity, emotional, and content embeddings so as to generate controllable personal and emotional styles. Finally, considering the scarcity of the 3D emotional talking face data, we resort to the supervision of facial blendshapes, which enables the reconstruction of plausible 3D faces from 2D emotional data, and contribute a large-scale 3D emotional talking face dataset (3D-ETF) to train the network. Our experiments and user studies demonstrate that our approach outperforms state-of-the-art methods and exhibits more diverse facial movements. We recommend watching the supplementary video: https://ziqiaopeng. github.io/emotalk + + + + Text-Conditioned Sampling Framework for Text-to-Image Generation with Masked Generative Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Lee_Text-Conditioned_Sampling_Framework_for_Text-to-Image_Generation_with_Masked_Generative_Models_ICCV_2023_paper.pdf + Token-based masked generative models are gaining popularity for their fast inference time with parallel decoding. While recent token-based approaches achieve competitive performance to diffusion-based models, their generation performance is still suboptimal as they sample multiple tokens simultaneously without considering the dependence among them. We empirically investigate this problem and propose a learnable sampling model, Text-Conditioned Token Selection (TCTS), to select optimal tokens via localized supervision with text information. TCTS improves not only the image quality but also the semantic alignment of the generated images with the given texts. To further improve the image quality, we introduce a cohesive sampling strategy, Frequency Adaptive Sampling (FAS), to each group of tokens divided according to the self-attention maps. We validate the efficacy of TCTS combined with FAS with various generative tasks, demonstrating that it significantly outperforms the baselines in image-text alignment and image quality. Our text-conditioned sampling framework further reduces the original inference time by more than 50% without modifying the original generative model. + + + + Inverse Problem Regularization with Hierarchical Variational Autoencoders + http://openaccess.thecvf.com//content/ICCV2023/papers/Prost_Inverse_Problem_Regularization_with_Hierarchical_Variational_Autoencoders_ICCV_2023_paper.pdf + In this paper, we propose to regularize ill-posed inverse problems using a deep hierarchical Variational AutoEncoder (HVAE) as an image prior. The proposed method synthesizes the advantages of i) denoiser-based Plug & Play approaches and ii) generative model based approaches to inverse problems. First, we exploit VAE properties to design an efficient algorithm that benefits from convergence guarantees of Plug-and-Play (PnP) methods. Second, our approach is not restricted to specialized datasets and the proposed PnP-HVAE model is able to solve image restoration problems on natural images of any size. Our experiments show that the proposed PnP-HVAE method is competitive with both SOTA denoiser-based PnP approaches, and other SOTA restoration methods based on generative models. The code for this project is available at https://github.com/jprost76/PnP-HVAE. + + + + Unpaired Multi-domain Attribute Translation of 3D Facial Shapes with a Square and Symmetric Geometric Map + http://openaccess.thecvf.com//content/ICCV2023/papers/Fan_Unpaired_Multi-domain_Attribute_Translation_of_3D_Facial_Shapes_with_a_ICCV_2023_paper.pdf + While impressive progress has recently been made in image-oriented facial attribute translation, shape-oriented 3D facial attribute translation remains an unsolved issue. This is primarily limited by the lack of 3D generative models and ineffective usage of 3D facial data. We propose a learning framework for 3D facial attribute translation to relieve these limitations. Firstly, we customize a novel geometric map for 3D shape representation and embed it in an end-to-end generative adversarial network. The geometric map represents 3D shapes symmetrically on a square image grid, while preserving the neighboring relationship of 3D vertices in a local least-square sense. This enables effective learning for the latent representation of data with different attributes. Secondly, we employ a unified and unpaired learning framework for multi-domain attribute translation. It not only makes effective usage of data correlation from multiple domains, but also mitigates the constraint for hardly accessible paired data. Finally, we propose a hierarchical architecture for the discriminator to guarantee robust results against both global and local artifacts. We conduct extensive experiments to demonstrate the advantage of the proposed framework over the state-of-the-art in generating high-fidelity facial shapes. Given an input 3D facial shape, the proposed framework is able to synthesize novel shapes of different attributes, which covers some downstream applications, such as expression transfer, gender translation, and aging. Code at https://github.com/NaughtyZZ/3D_facial_shape_attribute_translation_ssgmap. + + + + Template Inversion Attack against Face Recognition Systems using 3D Face Reconstruction + http://openaccess.thecvf.com//content/ICCV2023/papers/Shahreza_Template_Inversion_Attack_against_Face_Recognition_Systems_using_3D_Face_ICCV_2023_paper.pdf + Face recognition systems are increasingly being used in different applications. In such systems, some features (also known as embeddings or templates) are extracted from each face image. Then, the extracted templates are stored in the system's database during the enrollment stage and are later used for recognition. In this paper, we focus on template inversion attacks against face recognition systems and introduce a novel method (dubbed GaFaR) to reconstruct 3D face from facial templates. To this end, we use a geometry-aware generator network based on generative neural radiance fields (GNeRF), and learn a mapping from facial templates to the intermediate latent space of the generator network. We train our network with a semi-supervised learning approach using real and synthetic images simultaneously. For the real training data, we use a Generative Adversarial Network (GAN) based framework to learn the distribution of the latent space. For the synthetic training data, where we have the true latent code, we directly train in the latent space of the generator network. In addition, during the inference stage, we also propose optimization on the camera parameters to generate face images to improve the success attack rate (up to 17.14% in our experiments). We evaluate the performance of our method in the whitebox and blackbox attacks against state-of-the-art face recognition models on the LFW and MOBIO datasets. To our knowledge, this paper is the first work on 3D face reconstruction from facial templates. The project page is available at: https://www.idiap.ch/paper/gafar + + + + ETran: Energy-Based Transferability Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Gholami_ETran_Energy-Based_Transferability_Estimation_ICCV_2023_paper.pdf + This paper addresses the problem of ranking pre-trained models for object detection and image classification. Selecting the best pre-trained model by fine-tuning is an expensive and time-consuming task. Previous works have proposed transferability estimation based on features extracted by the pre-trained models. We argue that quantifying whether the target dataset is in-distribution (IND) or out-of-distribution (OOD) for the pre-trained model is an important factor in the transferability estimation. To this end, we propose ETran, an energy-based transferability assessment metric, which includes three scores: 1) energy score, 2) classification score, and 3) regression score. We use energy-based models to determine whether the target dataset is OOD or IND for the pre-trained model. In contrast to the prior works, ETran is applicable to a wide range of tasks including classification, regression, and object detection (classification+regression). This is the first work that proposes transferability estimation for object detection task. Our extensive experiments on four benchmarks and two tasks show that ETran outperforms previous works on object detection and classification benchmarks by an average of 21% and 12%, respectively, and achieves SOTA in transferability assessment. + + + + Predict to Detect: Prediction-guided 3D Object Detection using Sequential Images + http://openaccess.thecvf.com//content/ICCV2023/papers/Kim_Predict_to_Detect_Prediction-guided_3D_Object_Detection_using_Sequential_Images_ICCV_2023_paper.pdf + Recent camera-based 3D object detection methods have introduced sequential frames to improve the detection performance hoping that multiple frames would mitigate the large depth estimation error. Despite improved detection performance, prior works rely on naive fusion methods (e.g., concatenation) or are limited to static scenes (e.g., temporal stereo), neglecting the importance of the motion cue of objects. These approaches do not fully exploit the potential of sequential images and show limited performance improvements. To address this limitation, we propose a novel 3D object detection model, P2D (Predict to Detect), that integrates a prediction scheme into a detection framework to explicitly extract and leverage motion features. P2D predicts object information in the current frame using solely past frames to learn temporal motion features. We then introduce a novel temporal feature aggregation method that attentively exploits Bird's-Eye-View (BEV) features based on predicted object information, resulting in accurate 3D object detection. Experimental results demonstrate that P2D improves mAP and NDS by 3.0% and 3.7% compared to the sequential image-based baseline, proving that incorporating a prediction scheme can significantly improve detection accuracy. + + + + IDiff-Face: Synthetic-based Face Recognition through Fizzy Identity-Conditioned Diffusion Model + http://openaccess.thecvf.com//content/ICCV2023/papers/Boutros_IDiff-Face_Synthetic-based_Face_Recognition_through_Fizzy_Identity-Conditioned_Diffusion_Model_ICCV_2023_paper.pdf + The availability of large-scale authentic face databases has been crucial to the significant advances made in face recognition research over the past decade. However, legal and ethical concerns led to the recent retraction of many of these databases by their creators, raising questions about the continuity of future face recognition research without one of its key resources. Synthetic datasets have emerged as a promising alternative to privacy-sensitive authentic data for face recognition development. However, recent synthetic datasets that are used to train face recognition models suffer either from limitations in intra-class diversity or cross-class (identity) discrimination, leading to less optimal accuracies, far away from the accuracies achieved by models trained on authentic data. This paper targets this issue by proposing IDiff-Face, a novel approach based on conditional latent diffusion models for synthetic identity generation with realistic identity variations for face recognition training. Through extensive evaluations, our proposed synthetic-based face recognition approach pushed the limits of state-of-the-art performances, achieving, for example, 98.00% accuracy on the Labeled Faces in the Wild (LFW) benchmark, far ahead from the recent synthetic-based face recognition solutions with 95.40% and bridging the gap to authentic-based face recognition with 99.82% accuracy. + + + + Rethinking Multi-Contrast MRI Super-Resolution: Rectangle-Window Cross-Attention Transformer and Arbitrary-Scale Upsampling + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Rethinking_Multi-Contrast_MRI_Super-Resolution_Rectangle-Window_Cross-Attention_Transformer_and_Arbitrary-Scale_Upsampling_ICCV_2023_paper.pdf + Recently, several methods have explored the potential of multi-contrast magnetic resonance imaging (MRI) super-resolution (SR) and obtain results superior to single-contrast SR methods. However, existing approaches still have two shortcomings: (1) They can only address fixed integer upsampling scales, such as 2x, 3x, and 4x, which require training and storing the corresponding model separately for each upsampling scale in clinic. (2) They lack direct interaction among different windows as they adopt the square window (e.g., 8x8) transformer network architecture, which results in inadequate modelling of longer-range dependencies. Moreover, the relationship between reference images and target images is not fully mined. To address these issues, we develop a novel network for multi-contrast MRI arbitrary-scale SR, dubbed as McASSR. Specifically, we design a rectangle-window cross-attention transformer to establish longer-range dependencies in MR images without increasing computational complexity and fully use reference information. Besides, we propose the reference-aware implicit attention as an upsampling module, achieving arbitrary-scale super-resolution via implicit neural representation, further fusing supplementary information of the reference image. Extensive and comprehensive experiments on both public and clinical datasets show that our McASSR yields superior performance over SOTA methods, demonstrating its great potential to be applied in clinical practice. + + + + Towards Open-Set Test-Time Adaptation Utilizing the Wisdom of Crowds in Entropy Minimization + http://openaccess.thecvf.com//content/ICCV2023/papers/Lee_Towards_Open-Set_Test-Time_Adaptation_Utilizing_the_Wisdom_of_Crowds_in_ICCV_2023_paper.pdf + Test-time adaptation (TTA) methods, which generally rely on the model's predictions (e.g., entropy minimization) to adapt the source pretrained model to the unlabeled target domain, suffer from noisy signals originating from 1) incorrect or 2) open-set predictions. Long-term stable adaptation is hampered by such noisy signals, so training models without such error accumulation is crucial for practical TTA. To address these issues, including open-set TTA, we propose a simple yet effective sample selection method inspired by the following crucial empirical finding. While entropy minimization compels the model to increase the probability of its predicted label (i.e., confidence values), we found that noisy samples rather show decreased confidence values. To be more specific, entropy minimization attempts to raise the confidence values of an individual sample's prediction, but individual confidence values may rise or fall due to the influence of signals from numerous other predictions (i.e., wisdom of crowds). Due to this fact, noisy signals misaligned with such 'wisdom of crowds', generally found in the correct signals, fail to raise the individual confidence values of wrong samples, despite attempts to increase them. Based on such findings, we filter out the samples whose confidence values are lower in the adapted model than in the original model, as they are likely to be noisy. Our method is widely applicable to existing TTA methods and improves their long-term adaptation performance in both image classification (e.g., 49.4% reduced error rates with TENT) and semantic segmentation (e.g., 11.7% gain in mIoU with TENT). + + + + Long-Range Grouping Transformer for Multi-View 3D Reconstruction + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Long-Range_Grouping_Transformer_for_Multi-View_3D_Reconstruction_ICCV_2023_paper.pdf + Nowadays, transformer networks have demonstrated superior performance in many computer vision tasks. In a multi-view 3D reconstruction algorithm following this paradigm, self-attention processing has to deal with intricate image tokens including massive information when facing heavy amounts of view input. The curse of information content leads to the extreme difficulty of model learning. To alleviate this problem, recent methods compress the token number representing each view or discard the attention operations between the tokens from different views. Obviously, they give a negative impact on performance. Therefore, we propose long-range grouping attention (LGA) based on the divide-and-conquer principle. Tokens from all views are grouped for separate attention operations. The tokens in each group are sampled from all views and can provide macro representation for the resided view. The richness of feature learning is guaranteed by the diversity among different groups. An effective and efficient encoder can be established which connects inter-view features using LGA and extract intra-view features using the standard self-attention layer. Moreover, a novel progressive upsampling decoder is also designed for voxel generation with relatively high resolution. Hinging on the above, we construct a powerful transformer-based network, called LRGT. Experimental results on ShapeNet verify our method achieves SOTA accuracy in multi-view reconstruction. Code is available at https://github.com/LiyingCV/Long-Range-Grouping-Transformer. + + + + DenseShift: Towards Accurate and Efficient Low-Bit Power-of-Two Quantization + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_DenseShift_Towards_Accurate_and_Efficient_Low-Bit_Power-of-Two_Quantization_ICCV_2023_paper.pdf + Efficiently deploying deep neural networks on low-resource edge devices is challenging due to their ever-increasing resource requirements. To address this issue, researchers have proposed multiplication-free neural networks, such as Power-of-Two quantization, or also known as Shift networks, which aim to reduce memory usage and simplify computation. However, existing low-bit Shift networks are not as accurate as their full-precision counterparts, typically suffering from limited weight range encoding schemes and quantization loss. In this paper, we propose the DenseShift network, which significantly improves the accuracy of Shift networks, achieving competitive performance to full-precision networks for vision and speech applications. In addition, we introduce a method to deploy an efficient DenseShift network using non-quantized floating-point activations, while obtaining 1.6X speed-up over existing methods. To achieve this, we demonstrate that zero-weight values in low-bit Shift networks do not contribute to model capacity and negatively impact inference computation. To address this issue, we propose a zero-free shifting mechanism that simplifies inference and increases model capacity. We further propose a sign-scale decomposition design to enhance training efficiency and a low-variance random initialization strategy to improve the model's transfer learning performance. Our extensive experiments on various computer vision and speech tasks demonstrate that DenseShift outperforms existing low-bit multiplication-free networks and achieves competitive performance compared to full-precision networks. Furthermore, our proposed approach exhibits strong transfer learning performance without a drop in accuracy. Our code was released on GitHub. + + + + Efficient Computation Sharing for Multi-Task Visual Scene Understanding + http://openaccess.thecvf.com//content/ICCV2023/papers/Shoouri_Efficient_Computation_Sharing_for_Multi-Task_Visual_Scene_Understanding_ICCV_2023_paper.pdf + Solving multiple visual tasks using individual models can be resource-intensive, while multi-task learning can conserve resources by sharing knowledge across different tasks. Despite the benefits of multi-task learning, such techniques can struggle with balancing the loss for each task, leading to potential performance degradation. We present a novel computation- and parameter-sharing framework that balances efficiency and accuracy to perform multiple visual tasks utilizing individually-trained single-task transformers. Our method is motivated by transfer learning schemes to reduce computational and parameter storage costs while maintaining the desired performance. Our approach involves splitting the tasks into a base task and the other sub-tasks, and sharing a significant portion of activations and parameters/weights between the base and sub-tasks to decrease inter-task redundancies and enhance knowledge sharing. The evaluation conducted on NYUD-v2 and PASCAL-context datasets shows that our method is superior to the state-of-the-art transformer-based multi-task learning techniques with higher accuracy and reduced computational resources. Moreover, our method is extended to video stream inputs, further reducing computational costs by efficiently sharing information across the temporal domain as well as the task domain. Our codes are available at https://github.com/sarashoouri/EfficientMTL. + + + + DDIT: Semantic Scene Completion via Deformable Deep Implicit Templates + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_DDIT_Semantic_Scene_Completion_via_Deformable_Deep_Implicit_Templates_ICCV_2023_paper.pdf + Scene reconstructions are often incomplete due to occlusions and limited viewpoints. There have been efforts to use semantic information for scene completion. However, the completed shapes may be rough and imprecise since respective methods rely on 3D convolution and/or lack effective shape constraints. To overcome these limitations, we propose a semantic scene completion method based on deformable deep implicit templates (DDIT). Specifically, we complete each segmented instance in a scene by deforming a template with a latent code. Such a template is expressed by a deep implicit function in the canonical frame. It abstracts the shape prior of a category, and thus can provide constraints on the overall shape of an instance. Latent code controls the deformation of template to guarantee fine details of an instance. For code prediction, we design a neural network that leverages both intra- and inter-instance information. We also introduce an algorithm to transform instances between the world and canonical frames based on geometric constraints and a hierarchical tree. To further improve accuracy, we jointly optimize the latent code and transformation by enforcing the zero-valued isosurface constraint. In addition, we establish a new dataset to solve different problems of existing datasets. Experiments showed that our DDIT outperforms state-of-the-art approaches. + + + + Attention Where It Matters: Rethinking Visual Document Understanding with Selective Region Concentration + http://openaccess.thecvf.com//content/ICCV2023/papers/Cao_Attention_Where_It_Matters_Rethinking_Visual_Document_Understanding_with_Selective_ICCV_2023_paper.pdf + We propose a novel end-to-end document understanding model called SeRum (SElective Region Understanding Model) for extracting meaningful information from document images, including document analysis, retrieval, and office automation. Unlike state-of-the-art approaches that rely on multi-stage technical schemes and are computationally expensive, SeRum converts document image understanding and recognition tasks into a local decoding process of the vision tokens of interest, using a content-aware token merge module. This mechanism enables the model to pay more attention to regions of interest generated by the query decoder, improving the model's effectiveness and speeding up the decoding speed of the generative scheme. We also designed several pre-training tasks to enhance the understanding and local awareness of the model. Experimental results demonstrate that SeRum achieves state-of-the-art performance on document understanding tasks and competitive results on text spotting tasks. SeRum represents a substantial advancement towards enabling efficient and effective end-to-end document understanding. + + + + 3D Neural Embedding Likelihood: Probabilistic Inverse Graphics for Robust 6D Pose Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhou_3D_Neural_Embedding_Likelihood_Probabilistic_Inverse_Graphics_for_Robust_6D_ICCV_2023_paper.pdf + The ability to perceive and understand 3D scenes is crucial for many applications in computer vision and robotics. Inverse graphics is an appealing approach to 3D scene understanding that aims to infer the 3D scene structure from 2D images. In this paper, we introduce probabilistic modeling to the inverse graphics framework to quantify uncertainty and achieve robustness in 6D pose estimation tasks. Specifically, we propose 3D Neural Embedding Likelihood (3DNEL) as a unified probabilistic model over RGB-D images, and develop efficient inference procedures on 3D scene descriptions. 3DNEL effectively combines learned neural embeddings from RGB with depth information to improve robustness in sim-to-real 6D object pose estimation from RGB-D images. Performance on the YCB-Video dataset is on par with state-of-the-art yet is much more robust in challenging regimes. In contrast to discriminative approaches, 3DNEL's probabilistic generative formulation jointly models multiple objects in a scene, quantifies uncertainty in a principled way, and handles object pose tracking under heavy occlusion. Finally, 3DNEL provides a principled framework for incorporating prior knowledge about the scene and objects, which allows natural extension to additional tasks like camera pose tracking from video. + + + + SupFusion: Supervised LiDAR-Camera Fusion for 3D Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Qin_SupFusion_Supervised_LiDAR-Camera_Fusion_for_3D_Object_Detection_ICCV_2023_paper.pdf + LiDAR-Camera fusion-based 3D detection is a critical task for automatic driving. In recent years, many LiDAR-Camera fusion approaches sprung up and gained promising performances compared with single-modal detectors, but always lack carefully designed and effective supervision for the fusion process. In this paper, we propose a novel training strategy called SupFusion, which provides an auxiliary feature level supervision for effective LiDAR-Camera fusion and significantly boosts detection performance. Our strategy involves a data enhancement method named Polar Sampling, which densifies sparse objects and trains an assistant model to generate high-quality features as the supervision. These features are then used to train the LiDAR-Camera fusion model, where the fusion feature is optimized to simulate the generated high-quality features. Furthermore, we propose a simple yet effective deep fusion module, which contiguously gains superior performance compared with previous fusion methods with SupFusion strategy. In such a manner, our proposal shares the following advantages. Firstly, SupFusion introduces auxiliary feature-level supervision which could boost LiDAR-Camera detection performance without introducing extra inference costs. Secondly, the proposed deep fusion could continuously improve the detector's abilities. Our proposed SupFusion and deep fusion module is plug-and-play, we make extensive experiments to demonstrate its effectiveness. Specifically, we gain around 2% 3D mAP improvements on KITTI benchmark based on multiple LiDAR-Camera 3D detectors. Our code is available at https://github.com/IranQin/SupFusion. + + + + EMMN: Emotional Motion Memory Network for Audio-driven Emotional Talking Face Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Tan_EMMN_Emotional_Motion_Memory_Network_for_Audio-driven_Emotional_Talking_Face_ICCV_2023_paper.pdf + Synthesizing expression is essential to create realistic talking faces. Previous works consider expressions and mouth shapes as a whole and predict them solely from audio inputs. However, the limited information contained in audio, such as phonemes and coarse emotion embedding, may not be suitable as the source of elaborate expressions. Besides, since expressions are tightly coupled to lip motions, generating expression from other sources is tricky and always neglects expression performed on mouth region, leading to inconsistency between them. To tackle the issues, this paper proposes Emotional Motion Memory Net (EMMN) that synthesizes expression overall on the talking face via emotion embedding and lip motion instead of the sole audio. Specifically, we extract emotion embedding from audio and design Motion Reconstruction module to decompose ground truth videos into mouth features and expression features before training, where the latter encode all facial factors about expression. During training, the emotion embedding and mouth features are used as keys, and the corresponding expression features are used as values to create key-value pairs stored in the proposed Motion Memory Net. Hence, once the audio-relevant mouth features and emotion embedding are individually predicted from audio at inference time, we treat them as a query to retrieve the best-matching expression features, performing expression overall on the face and thus avoiding inconsistent results. Extensive experiments demonstrate that our method can generate high-quality talking face videos with accurate lip movements and vivid expressions on unseen subjects. + + + + Rethinking Vision Transformers for MobileNet Size and Speed + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Rethinking_Vision_Transformers_for_MobileNet_Size_and_Speed_ICCV_2023_paper.pdf + With the success of Vision Transformers (ViTs) in computer vision tasks, recent arts try to optimize the performance and complexity of ViTs to enable efficient deployment on mobile devices. Multiple approaches are proposed to accelerate attention mechanism, improve inefficient designs, or incorporate mobile-friendly lightweight convolutions to form hybrid architectures. However, ViT and its variants still have higher latency or considerably more parameters than lightweight CNNs, even true for the years-old MobileNet. In practice, latency and size are both crucial for efficient deployment on resource-constraint hardware. In this work, we investigate a central question, can transformer models run as fast as MobileNet and maintain a similar size? We revisit the design choices of ViTs and propose a novel supernet with low latency and high parameter efficiency. We further introduce a novel fine-grained joint search strategy for transformer models that can find efficient architectures by optimizing latency and number of parameters simultaneously. The proposed models, EfficientFormerV2, achieve 3.5% higher top-1 accuracy than MobileNetV2 on ImageNet-1K with similar latency and parameters. This work demonstrate that properly designed and optimized vision transformers can achieve high performance even with MobileNet-level size and speed. + + + + Implicit Identity Representation Conditioned Memory Compensation Network for Talking Head video Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Hong_Implicit_Identity_Representation_Conditioned_Memory_Compensation_Network_for_Talking_Head_ICCV_2023_paper.pdf + Talking head video generation aims to animate a human face in a still image with dynamic poses and expressions using motion information derived from a target-driving video, while maintaining the person's identity in the source image. However, dramatic and complex motions in the driving video cause ambiguous generation, because the still source image cannot provide sufficient appearance information for occluded regions or delicate expression variations, which produces severe artifacts and significantly degrades the generation quality. To tackle this problem, we propose to learn a global facial representation space, and design a novel implicit identity representation conditioned memory compensation network, coined as MCNet, for high-fidelity talking head generation. Specifically, we devise a network module to learn a unified spatial facial meta-memory bank from all training samples, which can provide rich facial structure and appearance priors to compensate warped source facial features for the generation. Furthermore, we propose an effective query mechanism based on implicit identity representations learned from the discrete keypoints of the source image. It can greatly facilitate the retrieval of more correlated information from the memory bank for the compensation. Extensive experiments demonstrate that MCNet can learn representative and complementary facial memory, and can clearly outperform previous state-of-the-art talking head generation methods on VoxCeleb1 and CelebV datasets. + + + + Chupa: Carving 3D Clothed Humans from Skinned Shape Priors using 2D Diffusion Probabilistic Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Kim_Chupa_Carving_3D_Clothed_Humans_from_Skinned_Shape_Priors_using_ICCV_2023_paper.pdf + We propose a 3D generation pipeline that uses diffusion models to generate realistic human digital avatars. Due to the wide variety of human identities, poses, and stochastic details, the generation of 3D human meshes has been a challenging problem. To address this, we decompose the problem into 2D normal map generation and normal map-based 3D reconstruction. Specifically, we first simultaneously generate realistic normal maps for the front and backside of a clothed human, dubbed dual normal maps, using a pose-conditional diffusion model. For 3D reconstruction, we "carve" the prior SMPL-X mesh to a detailed 3D mesh according to the normal maps through mesh optimization. To further enhance the high-frequency details, we present a diffusion resampling scheme on both body and facial regions, thus encouraging the generation of realistic digital avatars. We also seamlessly incorporate a recent text-to-image diffusion model to support text-based human identity control. Our method, namely, Chupa, is capable of generating realistic 3D clothed humans with better perceptual quality and identity variety. + + + + Going Beyond Nouns With Vision & Language Models Using Synthetic Data + http://openaccess.thecvf.com//content/ICCV2023/papers/Cascante-Bonilla_Going_Beyond_Nouns_With_Vision__Language_Models_Using_Synthetic_ICCV_2023_paper.pdf + Large-scale pre-trained Vision & Language (VL) models have shown remarkable performance in many applications, enabling replacing a fixed set of supported classes with zero-shot open vocabulary reasoning over (almost arbitrary) natural language prompts. However, recent works have uncovered a fundamental weakness of these models. For example, their difficulty to understand Visual Language Concepts (VLC) that go 'beyond nouns' such as the meaning of non-object words (e.g., attributes, actions, relations, states, etc.), or difficulty in performing compositional reasoning such as understanding the significance of the order of the words in a sentence. In this work, we investigate to which extent purely synthetic data could be leveraged to teach these models to overcome such shortcomings without compromising their zero-shot capabilities. We contribute Synthetic Visual Concepts (SyViC) - a million-scale synthetic dataset and data generation codebase allowing to generate additional suitable data to improve VLC understanding and compositional reasoning of VL models. Additionally, we propose a general VL finetuning strategy for effectively leveraging SyViC towards achieving these improvements. Our extensive experiments and ablations on VL-Checklist, Winoground, and ARO benchmarks demonstrate that it is possible to adapt strong pre-trained VL models with synthetic data significantly enhancing their VLC understanding (e.g. by 9.9% on ARO and 4.3% on VL-Checklist) with under 1% drop in their zero-shot accuracy. + + + + Zero-Shot Contrastive Loss for Text-Guided Diffusion Image Style Transfer + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Zero-Shot_Contrastive_Loss_for_Text-Guided_Diffusion_Image_Style_Transfer_ICCV_2023_paper.pdf + Diffusion models have shown great promise in text-guided image style transfer, but there is a trade-off between style transformation and content preservation due to their stochastic nature. Existing methods require computationally expensive fine-tuning of diffusion models or additional neural network. To address this, here we propose a zero-shot contrastive loss for diffusion models that doesn't require additional fine-tuning or auxiliary networks. By leveraging patch-wise contrastive loss between generated samples and original image embeddings in the pre-trained diffusion model, our method can generate images with the same semantic content as the source image in a zero-shot manner. Our approach outperforms existing methods while preserving content and requiring no additional training, not only for image style transfer but also for image-to-image translation and manipulation. Our experimental results validate the effectiveness of our proposed method. + + + + Identity-Seeking Self-Supervised Representation Learning for Generalizable Person Re-Identification + http://openaccess.thecvf.com//content/ICCV2023/papers/Dou_Identity-Seeking_Self-Supervised_Representation_Learning_for_Generalizable_Person_Re-Identification_ICCV_2023_paper.pdf + This paper aims to learn a domain-generalizable (DG) person re-identification (ReID) representation from large-scale videos without any annotation. Prior DG ReID methods employ limited labeled data for training due to the high cost of annotation, which restricts further advances. To overcome the barriers of data and annotation, we propose to utilize large-scale unsupervised data for training. The key issue lies in how to mine identity information. To this end, we propose an Identity-seeking Self-supervised Representation learning (ISR) method. ISR constructs positive pairs from inter-frame images by modeling the instance association as a maximum-weight bipartite matching problem. A reliability-guided contrastive loss is further presented to suppress the adverse impact of noisy positive pairs, ensuring that reliable positive pairs dominate the learning process. The training cost of ISR scales approximately linearly with the data size, making it feasible to utilize large-scale data for training. The learned representation exhibits superior generalization ability. Without human annotation and fine-tuning, ISR achieves 87.0% Rank-1 on Market-1501 and 56.4% Rank-1 on MSMT17, outperforming the best supervised domain-generalizable method by 5.0% and 19.5%, respectively. In the pre-training-to-fine-tuning scenario, ISR achieves state-of-the-art performance, with 88.4% Rank-1 on MSMT17. + + + + 3D-Aware Generative Model for Improved Side-View Image Synthesis + http://openaccess.thecvf.com//content/ICCV2023/papers/Jo_3D-Aware_Generative_Model_for_Improved_Side-View_Image_Synthesis_ICCV_2023_paper.pdf + While recent 3D-aware generative models have shown photo-realistic image synthesis with multi-view consistency, the synthesized image quality degrades depending on the camera pose (e.g., a face with a blurry and noisy boundary at a side viewpoint). Such degradation is mainly caused by the difficulty of learning both pose consistency and photo-realism simultaneously from a dataset with heavily imbalanced poses. In this paper, we propose SideGAN, a novel 3D GAN training method to generate photo-realistic images irrespective of the camera pose, especially for faces of side-view angles. To ease the challenging problem of learning photo-realistic and pose-consistent image synthesis, we split the problem into two subproblems, each of which can be solved more easily. Specifically, we formulate the problem as a combination of two simple discrimination problems, one of which learns to discriminate whether a synthesized image looks real or not, and the other learns to discriminate whether a synthesized image agrees with the camera pose. Based on this, we propose a dual-branched discriminator with two discrimination branches. We also propose a pose-matching loss to learn the pose consistency of 3D GANs. In addition, we present a pose sampling strategy to increase learning opportunities for steep angles in a pose-imbalanced dataset. With extensive validation, we demonstrate that our approach enables 3D GANs to generate high-quality geometries and photo-realistic images irrespective of the camera pose. + + + + OxfordTVG-HIC: Can Machine Make Humorous Captions from Images? + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_OxfordTVG-HIC_Can_Machine_Make_Humorous_Captions_from_Images_ICCV_2023_paper.pdf + This paper presents OxfordTVG-HIC (Humorous Image Captions), a large-scale dataset for humour generation and understanding. Humour is an abstract, subjective, and context-dependent cognitive construct involving several cognitive factors, making it a challenging task to generate and interpret. Hence, humour generation and understanding can serve as a new task for evaluating the ability of deep-learning methods to process abstract and subjective information. Due to the scarcity of data, humour-related generation tasks such as captioning remain underexplored. To address this gap, OxfordTVG-HIC offers approximately 2.9M image-text pairs with humour scores to train a generalizable humour captioning model. Contrary to existing captioning datasets, OxfordTVG-HIC features a wide range of emotional and semantic diversity resulting in out-of-context examples that are particularly conducive to generating humour. Moreover, OxfordTVG-HIC is curated devoid of offensive content. We also show how OxfordTVG HIC can be leveraged for evaluating the humour of a generated text. Through explainability analysis of the trained models, we identify the visual and linguistic cues influential for evoking humour prediction (and generation). We observe qualitatively that these cues are aligned with the benign violation theory of humour in cognitive psychology. + + + + EDAPS: Enhanced Domain-Adaptive Panoptic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Saha_EDAPS_Enhanced_Domain-Adaptive_Panoptic_Segmentation_ICCV_2023_paper.pdf + With autonomous industries on the rise, domain adaptation of the visual perception stack is an important research direction due to the cost savings promise. Much prior art was dedicated to domain-adaptive semantic segmentation in the synthetic-to-real context. Despite being a crucial output of the perception stack, panoptic segmentation has been largely overlooked by the domain adaptation community. Therefore, we revisit well-performing domain adaptation strategies from other fields, adapt them to panoptic segmentation, and show that they can effectively enhance panoptic domain adaptation. Further, we study the panoptic network design and propose a novel architecture (EDAPS) designed explicitly for domain-adaptive panoptic segmentation. It uses a shared, domain-robust transformer encoder to facilitate the joint adaptation of semantic and instance features, but task-specific decoders tailored for the specific requirements of both domain-adaptive semantic and instance segmentation. As a result, the performance gap seen in challenging panoptic benchmarks is substantially narrowed. EDAPS significantly improves the state-of-the-art performance for panoptic segmentation UDA by a large margin of 20% on SYNTHIA-to-Cityscapes and even 72% on the more challenging SYNTHIA-to-Mapillary Vistas. The implementation is available at https://github.com/susaha/edaps. + + + + Scratch Each Other's Back: Incomplete Multi-Modal Brain Tumor Segmentation via Category Aware Group Self-Support Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Qiu_Scratch_Each_Others_Back_Incomplete_Multi-Modal_Brain_Tumor_Segmentation_via_ICCV_2023_paper.pdf + Although Magnetic Resonance Imaging (MRI) is very helpful for brain tumor segmentation and discovery, it often lacks some modalities in clinical practice. As a result, degradation of prediction performance is inevitable. According to current implementations, different modalities are considered to be independent and non-interfering with each other during the training process of modal feature extraction, however they are complementary. In this paper, considering the sensitivity of different modalities to diverse tumor regions, we propose a Category Aware Group Self-Support Learning framework, called GSS, to make up for the information deficit among the modalities in the individual modal feature extraction phase. Precisely, within each prediction category, predictions of all modalities form a group, where the prediction with the most extraordinary sensitivity is selected as the group leader. Collaborative efforts between group leaders and members identify the communal learning target with high consistency and certainty. As our minor contribution, we introduce a random mask to reduce the possible biases. GSS adopts the standard training strategy without specific architectural choices and thus can be easily plugged into existing incomplete multi-modal brain tumor segmentation. Remarkably, extensive experiments on BraTS2020, BraTS2018, and BraTS2015 datasets demonstrate that GSS can improve the performance of existing SOTA algorithms by 1.27-3.20% in Dice on average. The code is released at https://github.com/qysgithubopen/GSS. + + + + Agglomerative Transformer for Human-Object Interaction Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Tu_Agglomerative_Transformer_for_Human-Object_Interaction_Detection_ICCV_2023_paper.pdf + We propose an agglomerative Transformer (AGER) that enables Transformer-based human-object interaction (HOI) detectors to flexibly exploit extra instance-level cues in a single-stage and end-to-end manner for the first time. AGER acquires instance tokens by dynamically clustering patch tokens and aligning cluster centres to instances with textual guidance, thus enjoying two benefits: 1) Intergrality: each instance token is encouraged to contain all discriminative feature regions of an instance, which demonstrates a significant improvement in the extraction of different instance-level cues, and subsequently leads to a new state-of-the-art performance of HOI detection with 36.75 mAP on HICO-Det. 2) Efficiency: the dynamical clustering mechanism allows AGER to generate instance tokens jointly with the feature learning of the Transformer encoder, eliminating the need of an additional object detector or instance decoder in prior methods, thus allowing the extraction of desirable extra cues for HOI detection in a single-stage and end-to-end pipeline. Concretely, AGER reduces GFLOPs by 8.5% and improves FPS by 36%, even compared to a vanilla DETR-like pipeline without extra cue extraction. + + + + Rethinking Fast Fourier Convolution in Image Inpainting + http://openaccess.thecvf.com//content/ICCV2023/papers/Chu_Rethinking_Fast_Fourier_Convolution_in_Image_Inpainting_ICCV_2023_paper.pdf + Recently proposed image inpainting method LaMa builds its network upon Fast Fourier Convolution (FFC), which was originally proposed for high-level vision tasks like image classification. FFC empowers the fully convolutional network to have a global receptive field in its early layers. Thanks to the unique character of the FFC module, LaMa has the ability to produce robust repeating texture, which can not be achieved by the previous inpainting methods. However, is the vanilla FFC module suitable for low-level vision tasks like image inpainting? In this paper, we analyze the fundamental flaws of using FFC in image inpainting, which are 1) spectrum shifting, 2) unexpected spatial activation, and 3) limited frequency receptive field. Such flaws make FFC-based inpainting framework difficult in generating complicated texture and performing faithful reconstruction. Based on the above analysis, we propose a novel Unbiased Fast Fourier Convolution (UFFC) module, which modifies the vanilla FFC module with 1) range transform and inverse transform, 2) absolute position embedding, 3) dynamic skip connection, and 4) adaptive clip, to overcome such flaws, achieving better inpainting results. Extensive experiments on several benchmark datasets demonstrate the effectiveness of our method, outperforming the state-of-the-art methods in both texture-capturing ability and expressiveness. + + + + Learning Robust Representations with Information Bottleneck and Memory Network for RGB-D-based Gesture Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Learning_Robust_Representations_with_Information_Bottleneck_and_Memory_Network_for_ICCV_2023_paper.pdf + Although previous RGB-D-based gesture recognition methods have shown promising performance, researchers often overlook the interference of task-irrelevant cues like illumination and background. These unnecessary factors are learned together with the predictive ones by the network and hinder accurate recognition. In this paper, we propose a convenient and analytical framework to learn a robust feature representation that is impervious to gesture-irrelevant factors. Based on the Information Bottleneck theory, two rules of Sufficiency and Compactness are derived to develop a new information-theoretic loss function, which cultivates a more sufficient and compact representation from the feature encoding and mitigates the impact of gesture-irrelevant information. To highlight the predictive information, we further integrate a memory network. Using our proposed content-based and contextual memory addressing scheme, we weaken the nuisances while preserving the task-relevant information, providing guidance for refining the feature representation. Experiments conducted on three public datasets demonstrate that our approach leads to a better feature representation and achieves better performance than state-of-the-art methods. + + + + P1AC: Revisiting Absolute Pose From a Single Affine Correspondence + http://openaccess.thecvf.com//content/ICCV2023/papers/Ventura_P1AC_Revisiting_Absolute_Pose_From_a_Single_Affine_Correspondence_ICCV_2023_paper.pdf + Affine correspondences have traditionally been used to improve feature matching over wide baselines. While recent work has successfully used affine correspondences to solve various relative camera pose estimation problems, less attention has been given to their use in absolute pose estimation. We introduce the first general solution to the problem of estimating the pose of a calibrated camera given a single observation of an oriented point and an affine correspondence. The advantage of our approach (P1AC) is that it requires only a single correspondence, in comparison to the traditional point-based approach (P3P), significantly reducing the combinatorics in robust estimation. P1AC provides a general solution that removes restrictive assumptions made in prior work and is applicable to large-scale image-based localization. We propose a minimal solution to the P1AC problem and evaluate our novel solver on synthetic data, showing its numerical stability and performance under various types of noise. On standard image-based localization benchmarks we show that P1AC achieves more accurate results than the widely used P3P algorithm. Code for our method is available at https://github.com/jonathanventura/P1AC/. + + + + Lossy and Lossless (L2) Post-training Model Size Compression + http://openaccess.thecvf.com//content/ICCV2023/papers/Shi_Lossy_and_Lossless_L2_Post-training_Model_Size_Compression_ICCV_2023_paper.pdf + Deep neural networks have delivered remarkable performance and have been widely used in various visual tasks. However, their huge sizes cause significant inconvenience for transmission and storage. Many previous studies have explored model size compression. However, these studies often approach various lossy and lossless compression methods in isolation, leading to challenges in achieving high compression ratios efficiently. This work proposes a post-training model size compression method that combines lossy and lossless compression in a unified way. We first propose a unified parametric weight transformation, which ensures different lossy compression methods can be performed jointly in a post-training manner. Then, a dedicated differentiable counter is introduced to guide the optimization of lossy compression to arrive at a more suitable point for later lossless compression. Additionally, our method can easily control a desired global compression ratio and allocate adaptive ratios for different layers. Finally, our method can achieve a stable 10 times compression ratio without sacrificing accuracy and a 20 times compression ratio with minor accuracy loss in a short time. Our code is available at https://github.com/ModelTC/L2_Compression. + + + + C2ST: Cross-Modal Contextualized Sequence Transduction for Continuous Sign Language Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_C2ST_Cross-Modal_Contextualized_Sequence_Transduction_for_Continuous_Sign_Language_Recognition_ICCV_2023_paper.pdf + Continuous Sign Language Recognition (CSLR) aims to transcribe the signs of an untrimmed video into written words or glosses. The mainstream framework for CSLR consists of a spatial module for visual representation learning, a temporal module aggregating the local and global temporal information of frame sequence, and the connectionist temporal classification (CTC) loss, which aligns video features with gloss sequence. Unfortunately, the language prior implicit in the gloss sequence is ignored throughout the modeling process. Furthermore, the contextualization of glosses is further ignored in alignment learning, as CTC makes an independence assumption between glosses. In this paper, we propose a Cross-modal Contextualized Sequence Transduction (C2ST) for CSLR, which effectively incorporates the knowledge of gloss sequence into the process of video representation learning and sequence transduction. Specifically, we introduce a cross-modal context learning framework for CSLR, in which the linguistic features of gloss sequences is extracted by a language model, and recurrently integrate with visual features for video modelling. Moreover, we introduce the contextualized sequence transduction loss that incorporates the contextual information of gloss sequences in label prediction, without making any independence assumptions between the glosses. Our method sets the new state of the art on three widely used large-scale sign language recognition datasets: Phoenix-2014, Phoenix-2014-T, and CSL-Daily. On CSL-Daily, our approach achieves an absolute gain of 4.9% WER compared to the best published results. + + + + ObjectFusion: Multi-modal 3D Object Detection with Object-Centric Fusion + http://openaccess.thecvf.com//content/ICCV2023/papers/Cai_ObjectFusion_Multi-modal_3D_Object_Detection_with_Object-Centric_Fusion_ICCV_2023_paper.pdf + Recent progress on multi-modal 3D object detection has featured BEV (Bird-Eye-View) based fusion, which effectively unifies both LiDAR point clouds and camera images in a shared BEV space. Nevertheless, it is not trivial to perform camera-to-BEV transformation due to the inherently ambiguous depth estimation of each pixel, resulting in spatial misalignment between these two multi-modal features. Moreover, such transformation also inevitably leads to projection distortion of camera image features in BEV space. In this paper, we propose a novel Object-centric Fusion (ObjectFusion) paradigm, which completely gets rid of camera-to-BEV transformation during fusion to align object-centric features across different modalities for 3D object detection. ObjectFusion first learns three kinds of modality-specific feature maps (i.e., voxel, BEV, and image features) from LiDAR point clouds and its BEV projections, camera images. Then a set of 3D object proposals are produced from the BEV features via a heatmap-based proposal generator. Next, the 3D object proposals are reprojected back to voxel, BEV, and image spaces. We leverage voxel and RoI pooling to generate spatially aligned object-centric features for each modality. All the object-centric features of three modalities are further fused at object level, which is finally fed into the detection heads. Extensive experiments on nuScenes dataset demonstrate the superiority of our ObjectFusion, by achieving 69.8% mAP on nuScenes validation set and improving BEVFusion by 1.3%. + + + + Adaptive Calibrator Ensemble: Navigating Test Set Difficulty in Out-of-Distribution Scenarios + http://openaccess.thecvf.com//content/ICCV2023/papers/Zou_Adaptive_Calibrator_Ensemble_Navigating_Test_Set_Difficulty_in_Out-of-Distribution_Scenarios_ICCV_2023_paper.pdf + Model calibration usually requires optimizing some parameters (e.g., temperature) w.r.t an objective function like negative log-likelihood. This work uncovers a significant aspect often overlooked that the objective function is influenced by calibration set difficulty: the ratio of misclassified to correctly classified samples. If a test set has a drastically different difficulty level from the calibration set, a phenomenon out-of-distribution (OOD) data often exhibit: the optimal calibration parameters of the two datasets would be different, rendering an optimal calibrator on the calibration set suboptimal on the OOD test set and thus degraded calibration performance. With this knowledge, we propose a simple and effective method named adaptive calibrator ensemble (ACE) to calibrate OOD datasets whose difficulty is usually higher than the calibration set. Specifically, two calibration functions are trained, one for in-distribution data (low difficulty), and the other for severely OOD data (high difficulty). To achieve desirable calibration on a new OOD dataset, ACE uses an adaptive weighting method that strikes a balance between the two extreme functions. When plugged in, ACE generally improves the performance of a few state-of-the-art calibration schemes on a series of OOD benchmarks. Importantly, such improvement does not come at the cost of the in-distribution calibration performance. Project Website: https://github.com/insysgroup/Adaptive-Calibrators-Ensemble.git. + + + + Contrastive Learning Relies More on Spatial Inductive Bias Than Supervised Learning: An Empirical Study + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhong_Contrastive_Learning_Relies_More_on_Spatial_Inductive_Bias_Than_Supervised_ICCV_2023_paper.pdf + Though self-supervised contrastive learning (CL) has shown its potential to achieve state-of-the-art accuracy without any supervision, its behavior still remains under investigated by academia. Different from most previous work that understands CL from learning objectives, we focus on an unexplored yet natural aspect: the spatial inductive bias which seems to be implicitly exploited via data augmentations in CL. We design an experiment to study the reliance of CL on such spatial inductive bias, by destroying the global or local spatial structures of image with global or local patch shuffling, and comparing the performance drop between experiments on original and corrupted dataset to quantify the reliance of certain inductive bias. We also use the uniformity of feature space to further research on how CL-pre-trained model behave with the corrupted dataset. Our results and analysis show that CL has a much higher reliance on spatial inductive bias than SL, regardless of specific CL algorithm or backbones, opening a new direction for studying the behavior of CL. + + + + Randomized Quantization: A Generic Augmentation for Data Agnostic Self-supervised Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_Randomized_Quantization_A_Generic_Augmentation_for_Data_Agnostic_Self-supervised_Learning_ICCV_2023_paper.pdf + Self-supervised representation learning follows a paradigm of withholding some part of the data and tasking the network to predict it from the remaining part. Among many techniques, data augmentation lies at the core for creating the information gap. Towards this end, masking has emerged as a generic and powerful tool where content is withheld along the sequential dimension, e.g., spatial in images, temporal in audio, and syntactic in language. In this paper, we explore the orthogonal channel dimension for generic data augmentation by exploiting precision redundancy. The data for each channel is quantized through a non-uniform quantizer, with the quantized value sampled randomly within randomly sampled quantization bins. From another perspective, quantization is analogous to channel-wise masking, as it removes the information within each bin, but preserves the information across bins. Our approach significantly surpasses existing generic data augmentation methods, while showing on par performance against modality- specific augmentations. We comprehensively evaluate our approach on vision, audio, 3D point clouds, as well as the DABS benchmark which is comprised of various data modalities. The code is available at https: //github.com/microsoft/random_quantize. + + + + Neural Radiance Field with LiDAR maps + http://openaccess.thecvf.com//content/ICCV2023/papers/Chang_Neural_Radiance_Field_with_LiDAR_maps_ICCV_2023_paper.pdf + We address outdoor Neural Radiance Fields (NeRF) with LiDAR maps. Existing NeRF methods usually require specially collected hypersampled source views and do not perform well with the open source camera-LiDAR datasets - significantly limiting the approach's practical utility. In this paper, we demonstrate an approach that allows for these datasets to be utilized for high quality neural renderings. Our design leverages 1) LiDAR sensors for strong 3D geometry priors that significantly improve the ray sampling locality, and 2) Conditional Adversarial Networks (cGANs) to recover image details since aggregating embeddings from imperfect LiDAR maps causes artifacts in the synthesized images. Our experiments show that while NeRF baselines produce either noisy or blurry results on Argoverse 2, the images synthesized by our system not only outperform baselines in image quality metrics under both clean and noisy conditions, but also obtain closer Detectron2 results to the ground truth images. Furthermore, to show the substantial applicability of our system, we demonstrate that our system can be used in data augmentation for training a pose regression network and multi-season view synthesis. Our dataset and code will be released + + + + AREA: Adaptive Reweighting via Effective Area for Long-Tailed Classification + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_AREA_Adaptive_Reweighting_via_Effective_Area_for_Long-Tailed_Classification_ICCV_2023_paper.pdf + Large-scale data from real-world usually follow a long-tailed distribution (i.e., a few majority classes occupy plentiful training data, while most minority classes have few samples), making the hyperplanes heavily skewed to the minority classes. Traditionally, reweighting is adopted to make the hyperplanes fairly split the feature space, where the weights are designed according to the number of samples. However, we find that the number of samples in a class can not accurately measure the size of its spanned space, especially for the majority class, where the size of its spanned space is usually larger than the samples' number because of the high diversity. Therefore, weights designed based on the samples' number will still compress the space of minority classes. In this paper, we reconsider reweighting from a totally new perspective of analyzing the spanned space of each class. We argue that, besides statistical numbers, relations between samples are also significant for sufficiently depicting the spanned space. Consequently, we estimate the size of the spanned space for each category, namely effective area, by detailedly analyzing its samples' distribution. By treating samples of a class as identically distributed random variables and analyzing their correlations, a simple and non-parametric formula is derived to estimate the effective area. Then, the weight simply calculated inversely proportional to the effective area of each class is adopted to achieve fairer training. Note that our weights are more flexible as they can be adaptively adjusted along with the optimizing features during training. Experiments on four long-tailed datasets show that the proposed weights outperform the state-of-the-art reweighting methods. Moreover, our method can also achieve better results on statistically balanced CIFAR-10/100. Code is available at https://github.com/xiaohua-chen/AREA. + + + + Learning Adaptive Neighborhoods for Graph Neural Networks + http://openaccess.thecvf.com//content/ICCV2023/papers/Saha_Learning_Adaptive_Neighborhoods_for_Graph_Neural_Networks_ICCV_2023_paper.pdf + Graph convolutional networks (GCNs) enable end-to-end learning on graph structured data. However, many works assume a given graph structure. When the input graph is noisy or unavailable, one approach is to construct or learn a latent graph structure. These methods typically fix the choice of node degree for the entire graph, which is suboptimal. Instead, we propose a novel end-to-end differentiable graph generator which builds graph topologies where each node selects both its neighborhood and its size. Our module can be readily integrated into existing pipelines involving graph convolution operations, replacing the predetermined or existing adjacency matrix with one that is learned, and optimized, as part of the general objective. As such it is applicable to any GCN. We integrate our module into trajectory prediction, point cloud classification and node classification pipelines resulting in improved accuracy over other structure-learning methods across a wide range of datasets and GCN backbones. + + + + Make-It-3D: High-fidelity 3D Creation from A Single Image with Diffusion Prior + http://openaccess.thecvf.com//content/ICCV2023/papers/Tang_Make-It-3D_High-fidelity_3D_Creation_from_A_Single_Image_with_Diffusion_ICCV_2023_paper.pdf + In this work, we investigate the problem of creating high-fidelity 3D content from only a single image. This is inherently challenging: it essentially involves estimating the underlying 3D geometry while hallucinating unseen textures. To address this challenge, we leverage prior knowledge in a well-trained 2D diffusion model to serve as a 3D-aware supervision for 3D creation. Our proposed method, Make-It-3D, employs a two-stage optimization pipeline: the first stage optimizes a neural radiance field with constraints from the reference image and diffusion prior; the second stage builds textured point clouds from the coarse model and further enhances the textures with diffusion prior leveraging the availability of high-quality textures from the reference image. Extensive experiments show that our method achieves a clear improvement over previous works, displaying faithful reconstruction and impressive visual quality. Our method presents the first attempt to achieve high-quality 3D creation from a single image for general objects, and enables various applications such as text-to-3D creation and texture editing. + + + + Taxonomy Adaptive Cross-Domain Adaptation in Medical Imaging via Optimization Trajectory Distillation + http://openaccess.thecvf.com//content/ICCV2023/papers/Fan_Taxonomy_Adaptive_Cross-Domain_Adaptation_in_Medical_Imaging_via_Optimization_Trajectory_ICCV_2023_paper.pdf + The success of automated medical image analysis depends on large-scale and expert-annotated training sets. Unsupervised domain adaptation (UDA) has been raised as a promising approach to alleviate the burden of labeled data collection. However, they generally operate under the closed-set adaptation setting assuming an identical label set between the source and target domains, which is over-restrictive in clinical practice where new classes commonly exist across datasets due to taxonomic inconsistency. While several methods have been presented to tackle both domain shifts and incoherent label sets, none of them take into account the common characteristics of the two issues and consider the learning dynamics along network training. In this work, we propose optimization trajectory distillation, a unified approach to address the two technical challenges from a new perspective. It exploits the low-rank nature of gradient space and devises a dual-stream distillation algorithm to regularize the learning dynamics of insufficiently annotated domain and classes with the external guidance obtained from reliable sources. Our approach resolves the issue of inadequate navigation along network optimization, which is the major obstacle in the taxonomy adaptive cross-domain adaptation scenario. We evaluate the proposed method extensively on several tasks towards various endpoints with clinical significance. The results demonstrate its effectiveness and improvements over previous methods. + + + + Ray Conditioning: Trading Photo-consistency for Photo-realism in Multi-view Image Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Ray_Conditioning_Trading_Photo-consistency_for_Photo-realism_in_Multi-view_Image_Generation_ICCV_2023_paper.pdf + Multi-view image generation attracts particular attention these days due to its promising 3D-related applications, e.g., image viewpoint editing. Most existing methods follow a paradigm where a 3D representation is first synthesized, and then rendered into 2D images to ensure photo-consistency across viewpoints. However, such explicit bias for photo-consistency sacrifices photo-realism, causing geometry artifacts and loss of fine-scale details when these methods are applied to edit real images. To address this issue, we propose ray conditioning, a geometry-free alternative that relaxes the photo-consistency constraint. Our method generates multi-view images by conditioning a 2D GAN on a light field prior. With explicit viewpoint control, state-of-the-art photo-realism and identity consistency, our method is particularly suited for the viewpoint editing task. + + + + SCOB: Universal Text Understanding via Character-wise Supervised Contrastive Learning with Online Text Rendering for Bridging Domain Gap + http://openaccess.thecvf.com//content/ICCV2023/papers/Kim_SCOB_Universal_Text_Understanding_via_Character-wise_Supervised_Contrastive_Learning_with_ICCV_2023_paper.pdf + Inspired by the great success of language model (LM)-based pre-training, recent studies in visual document understanding have explored LM-based pre-training methods for modeling text within document images. Among them, pre-training that reads all text from an image has shown promise, but often exhibits instability and even fails when applied to broader domains, such as those involving both visual documents and scene text images. This is a substantial limitation for real-world scenarios, where the processing of text image inputs in diverse domains is essential. In this paper, we investigate effective pre-training tasks in the broader domains and also propose a novel pre-training method called SCOB that leverages character-wise supervised contrastive learning with online text rendering to effectively pre-train document and scene text domains by bridging the domain gap. Moreover, SCOB enables weakly supervised learning, significantly reducing annotation costs. Extensive benchmarks demonstrate that SCOB generally improves vanilla pre-training methods and achieves comparable performance to state-of-the-art methods. Our findings suggest that SCOB can be served generally and effectively for read-type pre-training methods. The code will be available at https://github.com/naver-ai/scob. + + + + Domain Generalization of 3D Semantic Segmentation in Autonomous Driving + http://openaccess.thecvf.com//content/ICCV2023/papers/Sanchez_Domain_Generalization_of_3D_Semantic_Segmentation_in_Autonomous_Driving_ICCV_2023_paper.pdf + Using deep learning, 3D autonomous driving semantic segmentation has become a well-studied subject, with methods that can reach very high performance. Nonetheless, because of the limited size of the training datasets, these models cannot see every type of object and scene found in real-world applications. The ability to be reliable in these various unknown environments is called domain generalization. Despite its importance, domain generalization is relatively unexplored in the case of 3D autonomous driving semantic segmentation. To fill this gap, this paper presents the first benchmark for this application by testing state-of-the-art methods and discussing the difficulty of tackling Laser Imaging Detection and Ranging (LiDAR) domain shifts. We also propose the first method designed to address this domain generalization, which we call 3DLabelProp. This method relies on leveraging the geometry and sequentiality of the LiDAR data to enhance its generalization performances by working on partially accumulated point clouds. It reaches a mean Intersection over Union (mIoU) of 50.4% on SemanticPOSS and of 55.2% on PandaSet solid-state LiDAR while being trained only on SemanticKITTI, making it the state-of-the-art method for generalization (+5% and +33% better, respectively, than the second best method). + + + + HaMuCo: Hand Pose Estimation via Multiview Collaborative Self-Supervised Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Zheng_HaMuCo_Hand_Pose_Estimation_via_Multiview_Collaborative_Self-Supervised_Learning_ICCV_2023_paper.pdf + Recent advancements in 3D hand pose estimation have shown promising results, but its effectiveness has primarily relied on the availability of large-scale annotated datasets, the creation of which is a laborious and costly process. To alleviate the label-hungry limitation, we propose a self-supervised learning framework, HaMuCo, that learns a single view hand pose estimator from multi-view pseudo 2D labels. However, one of the main challenges of self-supervised learning is the presence of noisy labels and the "groupthink" effect from multiple views. To overcome these issues, we introduce a cross-view interaction network that distills the single view estimator by utilizing the cross-view correlated features and enforcing multi-view consistency to achieve collaborative learning. Both the single view estimator and the cross-view interaction network are trained jointly in an end-to-end manner. Extensive experiments show that our method can achieve state-of-the-art performance on multi-view self-supervised hand pose estimation. Furthermore, the proposed cross-view interaction network can also be applied to hand pose estimation from multi-view input and outperforms previous methods under same settings. + + + + Efficient Model Personalization in Federated Learning via Client-Specific Prompt Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Efficient_Model_Personalization_in_Federated_Learning_via_Client-Specific_Prompt_Generation_ICCV_2023_paper.pdf + Federated learning (FL) emerges as a decentralized learning framework which trains models from multiple distributed clients without sharing their data to preserve privacy. Recently, large-scale pre-trained models (e.g., Vision Transformer) have shown a strong capability of deriving robust representations. However, the data heterogeneity among clients, the limited computation resources, and the communication bandwidth restrict the deployment of large-scale models in FL frameworks. To leverage robust representations from large-scale models while enabling efficient model personalization for heterogeneous clients, we propose a novel personalized FL framework of client-specific Prompt Generation (pFedPG), which learns to deploy a personalized prompt generator at the server for producing client-specific visual prompts that efficiently adapts frozen backbones to local data distributions. Our proposed framework jointly optimizes the stages of personalized prompt adaptation locally and personalized prompt generation globally. The former aims to train visual prompts that adapt foundation models to each client, while the latter observes local optimization directions to generate personalized prompts for all clients. Through extensive experiments on benchmark datasets, we show that our pFedPG is favorable against state-of-the-art personalized FL methods under various types of data heterogeneity, allowing computation and communication efficient model personalization. + + + + Knowledge Restore and Transfer for Multi-Label Class-Incremental Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Dong_Knowledge_Restore_and_Transfer_for_Multi-Label_Class-Incremental_Learning_ICCV_2023_paper.pdf + Current class-incremental learning research mainly focuses on single-label classification tasks while multi-label class-incremental learning (MLCIL) with more practical application scenarios is rarely studied. Although there have been many anti-forgetting methods to solve the problem of catastrophic forgetting in single-label class-incremental learning, these methods have difficulty in solving the MLCIL problem due to label absence and information dilution problems. To solve these problems, we propose a Knowledge Restore and Transfer (KRT) framework including a dynamic pseudo-label (DPL) module to solve the label absence problem by restoring the knowledge of old classes to the new data and an incremental cross-attention (ICA) module with session-specific knowledge retention tokens storing knowledge and a unified knowledge transfer token transferring knowledge to solve the information dilution problem. Comprehensive experimental results on MS-COCO and PASCAL VOC datasets demonstrate the effectiveness of our method for improving recognition performance and mitigating forgetting on multi-label class-incremental learning tasks. + + + + PanFlowNet: A Flow-Based Deep Network for Pan-Sharpening + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_PanFlowNet_A_Flow-Based_Deep_Network_for_Pan-Sharpening_ICCV_2023_paper.pdf + Pan-sharpening aims to generate a high-resolution multispectral (HRMS) image by integrating the spectral information of a low-resolution multispectral (LRMS) image with the texture details of a high-resolution panchromatic (PAN) image. It essentially inherits the ill-posed nature of the super-resolution (SR) task that diverse HRMS images can degrade into an LRMS image. However, existing deep learning-based methods recover only one HRMS image from the LRMS image and PAN image using a deterministic mapping, thus ignoring the diversity of the HRMS image. In this paper, to alleviate this ill-posed issue, we propose a flow-based pan-sharpening network (PanFlowNet) to directly learn the conditional distribution of HRMS image given LRMS image and PAN image instead of learning a deterministic mapping. Specifically, we first transform this unknown conditional distribution into a given Gaussian distribution by an invertible network, and the conditional distribution can thus be explicitly defined. Then, we design an invertible Conditional Affine Coupling Block (CACB) and further build the architecture of PanFlowNet by stacking a series of CACBs. Finally, the PanFlowNet is trained by maximizing the log-likelihood of the conditional distribution given a training set and can then be used to predict diverse HRMS images. The experimental results verify that the proposed PanFlowNet can generate various HRMS images given an LRMS image and a PAN image. Additionally, the experimental results on different kinds of satellite datasets also demonstrate the superiority of our PanFlowNet compared with other state-of-the-art methods both visually and quantitatively. Code is available at Github. + + + + Domain Generalization via Balancing Training Difficulty and Model Capability + http://openaccess.thecvf.com//content/ICCV2023/papers/Jiang_Domain_Generalization_via_Balancing_Training_Difficulty_and_Model_Capability_ICCV_2023_paper.pdf + Domain generalization (DG) aims to learn domaingeneralizable models from one or multiple source domains that can perform well in unseen target domains. Despite its recent progress, most existing work suffers from the misalignment between the difficulty level of training samples and the capability of contemporarily trained models, leading to over-fitting or under-fitting in the trained generalization model. We design MoDify, a Momentum Difficulty framework that tackles the misalignment by balancing the seesaw between the model's capability and the samples' difficulties along the training process. MoDify consists of two novel designs that collaborate to fight against the misalignment while learning domain-generalizable models. The first is MoDify-based Data Augmentation which exploits an RGB Shuffle technique to generate difficulty-aware training samples on the fly. The second is MoDify-based Network Optimization which dynamically schedules the training samples for balanced and smooth learning with appropriate difficulty. Without bells and whistles, a simple implementation of MoDify achieves superior performance across multiple benchmarks. In addition, MoDify can complement existing methods as a plug-in, and it is generic and can work for different visual recognition tasks. + + + + CLIP-Driven Universal Model for Organ Segmentation and Tumor Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_CLIP-Driven_Universal_Model_for_Organ_Segmentation_and_Tumor_Detection_ICCV_2023_paper.pdf + An increasing number of public datasets have shown a marked impact on automated organ segmentation and tumor detection. However, due to the small size and partially labeled problem of each dataset, as well as a limited investigation of diverse types of tumors, the resulting models are often limited to segmenting specific organs/tumors and ignore the semantics of anatomical structures, nor can they be extended to novel domains. To address these issues, we propose the CLIP-Driven Universal Model, which incorporates text embedding learned from Contrastive Language-Image Pre-training (CLIP) to segmentation models. This CLIP-based label encoding captures anatomical relationships, enabling the model to learn a structured feature embedding and segment 25 organs and 6 types of tumors. The proposed model is developed from an assembly of 14 datasets, using a total of 3,410 CT scans for training and then evaluated on 6,162 external CT scans from 3 additional datasets. We rank first on the Medical Segmentation Decathlon (MSD) public leaderboard and achieve state-of-the-art results on Beyond The Cranial Vault (BTCV). Additionally, the Universal Model is computationally more efficient (6xfaster) compared with dataset-specific models, generalized better to CT scans from varying sites, and shows stronger transfer learning performance on novel tasks. + + + + Anchor Structure Regularization Induced Multi-view Subspace Clustering via Enhanced Tensor Rank Minimization + http://openaccess.thecvf.com//content/ICCV2023/papers/Ji_Anchor_Structure_Regularization_Induced_Multi-view_Subspace_Clustering_via_Enhanced_Tensor_ICCV_2023_paper.pdf + The tensor-based multi-view subspace clustering algorithms have received widespread attention due to the powerful ability to capture high-order correlation across views. Although such algorithms have achieved remarkable success, they still suffer from three main issues: 1) The extremely high computational complexity makes tensor-based methods difficult to handle large-scale data sets. 2) The commonly used Tensor Nuclear Norm (TNN) treats different singular values equally and under-penalizes the noise components, resulting in a sub-optimal representation tensor. 3) The subspace-based methods usually ignore the local geometric structure of the original data. Being aware of these, we propose Anchor Structure Regularitation Induced Multi-view Subspace Clustering via Enhanced Tensor Rank Minimization (ASR-ETR). Specifically, an anchor representation tensor is constructed by using the anchor representation strategy rather than the self-representation strategy to reduce the time complexity, and an Anchor Structure Regularization (ASR) is employed to enhance the local geometric structure in the learned anchor-representation tensor. We further define an Enhanced Tensor Rank (ETR), which is a tighter surrogate of the tensor rank and more effective to drive the noise out. Moreover, an efficient iterative optimization algorithm is designed to solve the ASR-ETR, which enjoys both linear complexity and favorable convergence. Extensive experimental results on various data sets demonstrate the superiority of the proposed algorithm as compared to state-of-the-art methods. + + + + MOSE: A New Dataset for Video Object Segmentation in Complex Scenes + http://openaccess.thecvf.com//content/ICCV2023/papers/Ding_MOSE_A_New_Dataset_for_Video_Object_Segmentation_in_Complex_ICCV_2023_paper.pdf + Video object segmentation (VOS) aims at segmenting a particular object throughout the entire video clip sequence. The state-of-the-art VOS methods have achieved excellent performance (e.g., 90+% J&F) on existing datasets. However, since the target objects in these existing datasets are usually relatively salient, dominant, and isolated, VOS under complex scenes has rarely been studied. To revisit VOS and make it more applicable in the real world, we collect a new VOS dataset called coMplex video Object SEgmentation (MOSE) to study the tracking and segmenting objects in complex environments. MOSE contains 2,149 video clips and 5,200 objects from 36 categories, with 431,725 high-quality object segmentation masks. The most notable feature of MOSE dataset is complex scenes with crowded and occluded objects. The target objects in the videos are commonly occluded by others and disappear in some frames. To analyze the proposed MOSE dataset, we benchmark 18 existing VOS methods under 4 different settings on the proposed MOSE dataset and conduct comprehensive comparisons. The experiments show that current VOS algorithms cannot well perceive objects in complex scenes. For example, under the semi-supervised VOS setting, the highest J&F by existing state-of-the-art VOS methods is only 59.4% on MOSE, much lower than their 90% J&F performance on DAVIS. The results reveal that although excellent performance has been achieved on existing benchmarks, there are unresolved challenges under complex scenes and more efforts are desired to explore these challenges in the future. + + + + BoMD: Bag of Multi-label Descriptors for Noisy Chest X-ray Classification + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_BoMD_Bag_of_Multi-label_Descriptors_for_Noisy_Chest_X-ray_Classification_ICCV_2023_paper.pdf + Deep learning methods have shown outstanding classification accuracy in medical imaging problems, which is largely attributed to the availability of large-scale datasets manually annotated with clean labels. However, given the high cost of such manual annotation, new medical imaging classification problems may need to rely on machine-generated noisy labels extracted from radiology reports. Indeed, many Chest X-Ray (CXR) classifiers have been modelled from datasets with noisy labels, but their training procedure is in general not robust to noisy-label samples, leading to sub-optimal models. Furthermore, CXR datasets are mostly multi-label, so current multi-class noisy-label learning methods cannot be easily adapted. In this paper, we propose a new method designed for noisy multi-label CXR learning, which detects and smoothly re-labels noisy samples from the dataset to be used in the training of common multi-label classifiers. The proposed method optimises a bag of multi-label descriptors (BoMD) to promote their similarity with the semantic descriptors produced by language models from multi-label image annotations. Our experiments on noisy multi-label training sets and clean testing sets show that our model has state-of-the-art accuracy and robustness in many CXR multi-label classification benchmarks, including a new benchmark that we propose to systematically assess noisy multi-label methods. + + + + Q-Diffusion: Quantizing Diffusion Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Q-Diffusion_Quantizing_Diffusion_Models_ICCV_2023_paper.pdf + Diffusion models have achieved great success in image synthesis through iterative noise estimation using deep neural networks. However, the slow inference, high memory consumption, and computation intensity of the noise estimation model hinder the efficient adoption of diffusion models. Although post-training quantization (PTQ) is considered a go-to compression method for other tasks, it does not work out-of-the-box on diffusion models. We propose a novel PTQ method specifically tailored towards the unique multi-timestep pipeline and model architecture of the diffusion models, which compresses the noise estimation network to accelerate the generation process. We identify the key difficulty of diffusion model quantization as the changing output distributions of noise estimation networks over multiple time steps and the bimodal activation distribution of the shortcut layers within the noise estimation network. We tackle these challenges with timestep-aware calibration and split shortcut quantization in this work. Experimental results show that our proposed method is able to quantize full-precision unconditional diffusion models into 4-bit while maintaining comparable performance (small FID change of at most 2.34 compared to >100 for traditional PTQ) in a training-free manner. Our approach can also be applied to text-guided image generation, where we can run stable diffusion in 4-bit weights with high generation quality for the first time. + + + + Robustifying Token Attention for Vision Transformers + http://openaccess.thecvf.com//content/ICCV2023/papers/Guo_Robustifying_Token_Attention_for_Vision_Transformers_ICCV_2023_paper.pdf + Despite the success of vision transformers (ViTs), they still suffer from significant drops in accuracy in the presence of common corruptions, such as noise or blur. Interestingly, we observe that the attention mechanism of ViTs tends to rely on few important tokens, a phenomenon we call token overfocusing. More critically, these tokens are not robust to corruptions, often leading to highly diverging attention patterns. In this paper, we intend to alleviate this overfocusing issue and make attention more stable through two general techniques: First, our Token-aware Average Pooling (TAP) module encourages the local neighborhood of each token to take part in the attention mechanism. Specifically, TAP learns average pooling schemes for each token such that the information of potentially important tokens in the neighborhood can adaptively be taken into account. Second, we force the output tokens to aggregate information from a diverse set of input tokens rather than focusing on just a few by using our Attention Diversification Loss (ADL). We achieve this by penalizing high cosine similarity between the attention vectors of different tokens. In experiments, we apply our methods to a wide range of transformer architectures and improve robustness significantly. For example, we improve corruption robustness on ImageNet-C by 2.4% while improving accuracy by 0.4% based on state-of-the-art robust architecture FAN. Also, when fine-tuning on semantic segmentation tasks, we improve robustness on CityScapes-C by 2.4% and ACDC by 3.0%. Our code is available at https://github.com/guoyongcs/TAPADL. + + + + UniSeg: A Unified Multi-Modal LiDAR Segmentation Network and the OpenPCSeg Codebase + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_UniSeg_A_Unified_Multi-Modal_LiDAR_Segmentation_Network_and_the_OpenPCSeg_ICCV_2023_paper.pdf + Point-, voxel-, and range-views are three representative forms of point clouds. All of them have accurate 3D measurements but lack color and texture information. RGB images are a natural complement to these point cloud views and fully utilizing the comprehensive information of them benefits more robust perceptions. In this paper, we present a unified multi-modal LiDAR segmentation network, termed UniSeg, which leverages the information of RGB images and three views of the point cloud, and accomplishes semantic segmentation and panoptic segmentation simultaneously. Specifically, we first design the Learnable cross-Modal Association (LMA) module to automatically fuse voxel-view and range-view features with image features, which fully utilize the rich semantic information of images and are robust to calibration errors. Then, the enhanced voxel-view and range-view features are transformed to the point space, where three views of point cloud features are further fused adaptively by the Learnable cross-View Association module (LVA). Notably, UniSeg achieves promising results in three public benchmarks, i.e., SemanticKITTI, nuScenes, and Waymo Open Dataset (WOD); it ranks 1st on two challenges of two benchmarks, including the LiDAR semantic segmentation challenge of nuScenes and panoptic segmentation challenges of SemanticKITTI. Besides, we construct the OpenPCSeg codebase, which is the largest and most comprehensive outdoor LiDAR segmentation codebase. It contains most of the popular outdoor LiDAR segmentation algorithms and provides reproducible implementations. The OpenPCSeg codebase will be made publicly available at https://github.com/PJLab-ADG/PCSeg. + + + + Pixel-Wise Contrastive Distillation + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_Pixel-Wise_Contrastive_Distillation_ICCV_2023_paper.pdf + We present a simple but effective pixel-level self-supervised distillation framework friendly to dense prediction tasks. Our method, called Pixel-Wise Contrastive Distillation (PCD), distills knowledge by attracting the corresponding pixels from student's and teacher's output feature maps. PCD includes a novel design called SpatialAdaptor which "reshapes" a part of the teacher network while preserving the distribution of its output features. Our ablation experiments suggest that this reshaping behavior enables more informative pixel-to-pixel distillation. Moreover, we utilize a plug-in multi-head self-attention module that explicitly relates the pixels of student's feature maps to enhance the effective receptive field, leading to a more competitive student. PCD outperforms previous self-supervised distillation methods on various dense prediction tasks. A backbone of ResNet-18-FPN distilled by PCD achieves 37.4 AP-bbox and 34.0 AP-mask on COCO dataset using the detector of Mask R-CNN. We hope our study will inspire future research on how to pre-train a small model friendly to dense prediction tasks in a self-supervised fashion. + + + + Efficient Deep Space Filling Curve + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Efficient_Deep_Space_Filling_Curve_ICCV_2023_paper.pdf + Space-filling curves (SFCs) act as a linearization approach to map data in higher dimensional space to lower dimensional space, which is used comprehensively in computer vision, such as image/point cloud compression, hashing and etc. Currently, researchers formulate the problem of searching for an optimal SFC to the problem of finding a single Hamiltonian circuit on the image grid graph. Existing methods adopt graph neural networks (GNN) for SFC search. By modeling the pixel grid as a graph, they first adopt GNN to predict the edge weights and then generate a minimum spanning tree (MST) based on the predictions, which is further used to construct the SFC. However, GNN-based methods suffer from high computational costs and memory footprint usage. Besides, MST generation is un-differentiable, which is infeasible to optimize via gradient descent. To remedy these issues, we propose a GNN-based SFC-search framework with a tailored algorithm that largely reduces computational cost of GNN. Additionally, we propose a siamese network learning scheme to optimize DNN-based models in an end-to-end fashion. Extensive experiments show that our proposed method outperforms both DNN-based methods and traditional SFCs, e.g. Hilbert curve, by a large margin on various benchmarks. + + + + GlueGen: Plug and Play Multi-modal Encoders for X-to-image Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Qin_GlueGen_Plug_and_Play_Multi-modal_Encoders_for_X-to-image_Generation_ICCV_2023_paper.pdf + Text-to-image (T2I) models based on diffusion processes have achieved remarkable success in controllable image generation using user-provided captions. However, the tight coupling between the current text encoder and image decoder in T2I models makes it challenging to replace or upgrade. Such changes often require massive fine-tuning or even training from scratch with the prohibitive expense. To address this problem, we propose GlueGen, which applies a newly proposed GlueNet model to align features from single-modal or multi-modal encoders with the latent space of an existing T2I model. The approach introduces a new training objective that leverages parallel corpora to align the representation spaces of different encoders. Empirical results show that GlueNet can be trained efficiently and enables various capabilities beyond previous state-of-the-art models: 1) multilingual language models such as XLM-Roberta can be aligned with existing T2I models, allowing for the generation of high-quality images from captions beyond English; 2) GlueNet can align multi-modal encoders such as AudioCLIP with the Stable Diffusion model, enabling sound-to-image generation; 3) it can also upgrade the current text encoder of the latent diffusion model for challenging case generation. By the alignment of various feature representations, the GlueNet allows for flexible and efficient integration of new functionality into existing T2I models and sheds light on X-to-image (X2I) generation. + + + + Ponder: Point Cloud Pre-training via Neural Rendering + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_Ponder_Point_Cloud_Pre-training_via_Neural_Rendering_ICCV_2023_paper.pdf + We propose a novel approach to self-supervised learning of point cloud representations by differentiable neural rendering. Motivated by the fact that informative point cloud features should be able to encode rich geometry and appearance cues and render realistic images, we train a point-cloud encoder within a devised point-based neural renderer by comparing the rendered images with real images on massive RGB-D data. The learned point-cloud encoder can be easily integrated into various downstream tasks, including not only high-level tasks like 3D detection and segmentation, but low-level tasks like 3D reconstruction and image synthesis. Extensive experiments on various tasks demonstrate the superiority of our approach compared to existing pre-training methods. + + + + L-DAWA: Layer-wise Divergence Aware Weight Aggregation in Federated Self-Supervised Visual Representation Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Rehman_L-DAWA_Layer-wise_Divergence_Aware_Weight_Aggregation_in_Federated_Self-Supervised_Visual_ICCV_2023_paper.pdf + The ubiquity of camera-enabled devices has led to large amounts of unlabeled image data being produced at the edge. The integration of self-supervised learning (SSL) and federated learning (FL) into one coherent system can potentially offer data privacy guarantees while also advancing the quality and robustness of the learned visual representations without needing to move data around. However, client bias and divergence during FL aggregation caused by data heterogeneity limits the performance of learned visual representations on downstream tasks. In this paper, we propose a new aggregation strategy termed Layer-wise Divergence Aware Weight Aggregation (L-DAWA) to mitigate the influence of client bias and divergence during FL aggregation. The proposed method aggregates weights at the layer-level according to the measure of angular divergence between the clients' model and the global model. Extensive experiments with cross-silo and cross-device settings on CIFAR-10/100 and Tiny ImageNet datasets demonstrate that our methods are effective and obtain new SOTA performance on both contrastive and non-contrastive SSL approaches. + + + + Controllable Guide-Space for Generalizable Face Forgery Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Guo_Controllable_Guide-Space_for_Generalizable_Face_Forgery_Detection_ICCV_2023_paper.pdf + Recent studies on face forgery detection have shown satisfactory performance for methods involved in training datasets, but are not ideal enough for unknown domains. This motivates many works to improve the generalization, but forgery-irrelevant information, such as image background and identity, still exists in different domain features and causes unexpected clustering, limiting the generalization. In this paper, we propose a controllable guide-space (GS) method to enhance the discrimination of different forgery domains, so as to increase the forgery relevance of features and thereby improve the generalization. The well-designed guide-space can simultaneously achieve both the proper separation of forgery domains and the large distance between real-forgery domains in an explicit and controllable manner. Moreover, for better discrimination, we use a decoupling module to weaken the interference of forgery-irrelevant correlations between domains. Furthermore, we make adjustments to the decision boundary manifold according to the clustering degree of the same domain features within the neighborhood. Extensive experiments in multiple in-domain and cross-domain settings confirm that our method can achieve state-of-the-art generalization. + + + + Calibrating Uncertainty for Semi-Supervised Crowd Counting + http://openaccess.thecvf.com//content/ICCV2023/papers/LI_Calibrating_Uncertainty_for_Semi-Supervised_Crowd_Counting_ICCV_2023_paper.pdf + Semi-supervised crowd counting is an important yet challenging task. A popular approach is to iteratively generate pseudo-labels for unlabeled data and add them to the training set. The key is to use uncertainty to select reliable pseudo-labels. In this paper, we propose a novel method to calibrate model uncertainty for crowd counting. Our method takes a supervised uncertainty estimation strategy to train the model through a surrogate function. This ensures the uncertainty is well controlled throughout the training. We propose a matching-based patch-wise surrogate function to better approximate uncertainty for crowd counting tasks. The proposed method pays a sufficient amount of attention to details, while maintaining a proper granularity. Altogether our method is able to generate reliable uncertainty estimation, high quality pseudolabels, and achieve state-of-the-art performance in semisupervised crowd counting. + + + + Segmentation of Tubular Structures Using Iterative Training with Tailored Samples + http://openaccess.thecvf.com//content/ICCV2023/papers/Liao_Segmentation_of_Tubular_Structures_Using_Iterative_Training_with_Tailored_Samples_ICCV_2023_paper.pdf + We propose a minimal path method to simultaneously compute segmentation masks and extract centerlines of tubular structures with line-topology. Minimal path methods are commonly used for the segmentation of tubular structures in a wide variety of applications. Recent methods use features extracted by CNNs, and often outperform methods using hand-tuned features. However, for CNN-based methods, the samples used for training may be generated inappropriately, so that they can be very different from samples encountered during inference. We approach this discrepancy by introducing a novel iterative training scheme, which enables generating better training samples specifically tailored for the minimal path methods without changing existing annotations. In our method, segmentation masks and centerlines are not determined after one another by post-processing, but obtained using the same steps. Our method requires only very few annotated training images. Comparison with seven previous approaches on three public datasets, including satellite images and medical images, shows that our method achieves state-of-the-art results both for segmentation masks and centerlines. + + + + Surface Extraction from Neural Unsigned Distance Fields + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Surface_Extraction_from_Neural_Unsigned_Distance_Fields_ICCV_2023_paper.pdf + We propose a method, named DualMesh-UDF, to extract a surface from unsigned distance functions (UDFs), encoded by neural networks, or neural UDFs. Neural UDFs are becoming increasingly popular for surface representation because of their versatility in presenting surfaces with arbitrary topologies, as opposed to the signed distance function that is limited to representing a closed surface. However, the applications of neural UDFs are hindered by the notorious difficulty in extracting the target surfaces they represent. Recent methods for surface extraction from a neural UDF suffer from significant geometric errors or topological artifacts due to two main difficulties: (1) A UDF does not exhibit sign changes; and (2) A neural UDF typically has substantial approximation errors. DualMesh-UDF addresses these two difficulties. Specifically, given a neural UDF encoding a target surface S to be recovered, we first estimate the tangent planes of S at a set of sample points close to S. Next, we organize these sample points into local clusters, and for each local cluster, solve a linear least squares problem to determine a final surface point. These surface points are then connected to create the output mesh surface, which approximates the target surface. The robust estimation of the tangent planes of the target surface and the subsequent minimization problem constitute our core strategy, which contributes to the favorable performance of DualMesh-UDF over other competing methods. To efficiently implement this strategy, we employ an adaptive Octree. Within this framework, we estimate the location of a surface point in each of the octree cells identified as containing part of the target surface. Extensive experiments show that our method outperforms existing methods in terms of surface reconstruction quality while maintaining comparable computational efficiency. + + + + CBA: Improving Online Continual Learning via Continual Bias Adaptor + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_CBA_Improving_Online_Continual_Learning_via_Continual_Bias_Adaptor_ICCV_2023_paper.pdf + Online continual learning (CL) aims to learn new knowledge and consolidate previously learned knowledge from non-stationary data streams. Due to the time-varying training setting, the model learned from a changing distribution easily forgets the previously learned knowledge and biases towards the newly received task. To address this problem, we propose a Continual Bias Adaptor (CBA) module to augment the classifier network to adapt to catastrophic distribution change during training, such that the classifier network is able to learn a stable consolidation of previously learned tasks. In the testing stage, CBA can be removed which introduces no additional computation cost and memory overhead. We theoretically reveal the reason why the proposed method can effectively alleviate catastrophic distribution shifts, and empirically demonstrate its effectiveness through extensive experiments based on four rehearsal-based baselines and three public continual learning benchmarks. + + + + Multi-view Spectral Polarization Propagation for Video Glass Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Qiao_Multi-view_Spectral_Polarization_Propagation_for_Video_Glass_Segmentation_ICCV_2023_paper.pdf + In this paper, we present the first polarization-guided video glass segmentation propagation solution (PGVS-Net) that can robustly and coherently propagate glass segmentation in RGB-P video sequences. By leveraging spatiotemporal polarization and color information, our method combines multi-view polarization cues and thus can alleviate the view dependence of single-input intensity variations on glass objects. We demonstrate that our model can outperform glass segmentation on RGB-only video sequences as well as produce more robust segmentation than per-frame RGB-P single-image segmentation methods. To train and validate PGVS-Net, we introduce a novel RGB-P Glass Video dataset (PGV-117) containing 117 video sequences of scenes captured with different types of camera paths, lighting conditions, dynamics, and glass types. + + + + DandelionNet: Domain Composition with Instance Adaptive Classification for Domain Generalization + http://openaccess.thecvf.com//content/ICCV2023/papers/Hu_DandelionNet_Domain_Composition_with_Instance_Adaptive_Classification_for_Domain_Generalization_ICCV_2023_paper.pdf + Domain generalization (DG) attempts to learn a model on source domains that can well generalize to unseen but different domains. The multiple source domains are innately different in distribution but intrinsically related to each other, e.g., from the same label space. To achieve a generalizable feature, most existing methods attempt to reduce the domain discrepancy by either learning domain-invariant feature, or additionally mining domain-specific feature. In the space of these features, the multiple source domains are either tightly aligned or not aligned at all, which both cannot fully take the advantage of complementary information from multiple domains. In order to preserve more complementary information from multiple domains at the meantime of reducing their domain gap, we propose that the multiple domains should not be tightly aligned but composite together, where all domains are pulled closer but still preserve their individuality respectively. This is achieved by using instance-adaptive classifier specified for each instance's classification, where the instance-adaptive classifier is slightly deviated from a universal classifier shared by samples from all domains. This adaptive classifier deviation allows all instances from the same category but different domains to be dispersed around the class center rather than squeezed tightly, leading to better generalization for unseen domain samples. In result, the multiple domains are harmoniously composite centered on a universal core, like a dandelion, so this work is referred to as DandelionNet. Experiments on multiple DG benchmarks demonstrate that the proposed method can learn a model with better generalization and experiments on source free domain adaption also indicate the versatility. + + + + PASTA: Proportional Amplitude Spectrum Training Augmentation for Syn-to-Real Domain Generalization + http://openaccess.thecvf.com//content/ICCV2023/papers/Chattopadhyay_PASTA_Proportional_Amplitude_Spectrum_Training_Augmentation_for_Syn-to-Real_Domain_Generalization_ICCV_2023_paper.pdf + Synthetic data offers the promise of cheap and bountiful training data for settings where labeled real-world data is scarce. However, models trained on synthetic data significantly underperform when evaluated on real-world data. In this paper, we propose Proportional Amplitude Spectrum Training Augmentation (PASTA), a simple and effective augmentation strategy to improve out-of-the-box synthetic-to-real (syn-to-real) generalization performance. PASTA perturbs the amplitude spectra of synthetic images in the Fourier domain to generate augmented views. Specifically, with PASTA we propose a structured perturbation strategy where high-frequency components are perturbed relatively more than the low-frequency ones. For the tasks of semantic segmentation (GTAV-Real), object detection (Sim10K-Real), and object recognition (VisDA-C Syn-Real), across a total of 5 syn-to-real shifts, we find that PASTA outperforms more complex state-of-the-art generalization methods while being complementary to the same. + + + + RANA: Relightable Articulated Neural Avatars + http://openaccess.thecvf.com//content/ICCV2023/papers/Iqbal_RANA_Relightable_Articulated_Neural_Avatars_ICCV_2023_paper.pdf + We propose RANA, a relightable and articulated neural avatar for the photorealistic synthesis of humans under arbitrary viewpoints, body poses, and lighting. We only require a short video clip of the person to create the avatar and assume no knowledge about the lighting environment. We present a novel framework to model humans while disentangling their geometry, texture, and also lighting environment from monocular RGB videos. To simplify this otherwise ill-posed task we first estimate the coarse geometry and texture of the person via SMPL+D model fitting and then learn an articulated neural representation for photorealistic image generation. RANA first generates the normal and albedo maps of the person in any given target body pose and then uses spherical harmonics lighting to generate the shaded image in the target lighting environment. We also propose to pretrain RANA using synthetic images and demonstrate that it leads to better disentanglement between geometry and texture while also improving robustness to novel body poses. Finally, we also present a new photorealistic synthetic dataset, Relighting Humans, to quantitatively evaluate the performance of the proposed approach. + + + + MODA: Mapping-Once Audio-driven Portrait Animation with Dual Attentions + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_MODA_Mapping-Once_Audio-driven_Portrait_Animation_with_Dual_Attentions_ICCV_2023_paper.pdf + Audio-driven portrait animation aims to synthesize portrait videos that are conditioned by given audio. Animating high-fidelity and multimodal video portraits has a variety of applications. Previous methods have attempted to capture different motion modes and generate high-fidelity portrait videos by training different models or sampling signals from given videos. However, lacking correlation learning between lip-sync and other movements (e.g., head pose/eye blinking) usually leads to unnatural results. In this paper, we propose a unified system for multi-person, diverse, and high-fidelity talking portrait generation. Our method contains three stages, i.e., 1) Mapping-Once network with Dual Attentions (MODA) generates talking representation from given audio. In MODA, we design a dual-attention module to encode accurate mouth movements and diverse modalities. 2) Facial composer network generates dense and detailed face landmarks, and 3) temporal-guided render syntheses stable videos. Extensive evaluations demonstrate that the proposed system produces more natural and realistic video portraits compared to previous methods. + + + + DIRE for Diffusion-Generated Image Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_DIRE_for_Diffusion-Generated_Image_Detection_ICCV_2023_paper.pdf + Diffusion models have shown remarkable success in visual synthesis, but have also raised concerns about potential abuse for malicious purposes. In this paper, we seek to build a detector for telling apart real images from diffusion-generated images. We find that existing detectors struggle to detect images generated by diffusion models, even if we include generated images from a specific diffusion model in their training data. To address this issue, we propose a novel image representation called DIffusion Reconstruction Error (DIRE), which measures the error between an input image and its reconstruction counterpart by a pre-trained diffusion model. We observe that diffusion-generated images can be approximately reconstructed by a diffusion model while real images cannot. It provides a hint that DIRE can serve as a bridge to distinguish generated and real images. DIRE provides an effective way to detect images generated by most diffusion models, and it is general for detecting generated images from unseen diffusion models and robust to various perturbations. Furthermore, we establish a comprehensive diffusion-generated benchmark including images generated by eight diffusion models to evaluate the performance of diffusion-generated image detectors. Extensive experiments on our collected benchmark demonstrate that DIRE exhibits superiority over previous generated-image detectors. The code, models, and dataset are available at https://github.com/ZhendongWang6/DIRE. + + + + Bring Clipart to Life + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhao_Bring_Clipart_to_Life_ICCV_2023_paper.pdf + The development of face editing has been boosted since the birth of StyleGAN. While previous works have explored different interactive methods, such as sketching and exemplar photos, they have been limited in terms of expressiveness and generality. In this paper, we propose a new interaction method by guiding the editing with abstract clipart, composed of a set of simple semantic parts, allowing users to control across face photos with simple clicks. However, this is a challenging task given the large domain gap between colorful face photos and abstract clipart with limited data. To solve this problem, we introduce a framework called ClipFaceShop built on top of StyleGAN. The key idea is to take advantage of W+ latent code encoded rich and disentangled visual features, and create a new lightweight selective feature adaptor to predict a modifiable path toward the target output photo. Since no pairwise labeled data exists for training, we design a set of losses to provide supervision signals for learning the modifiable path. Experimental results show that ClipFaceShop generates realistic and faithful face photos, sharing the same facial attributes as the reference clipart. We demonstrate that ClipFaceShop supports clipart in diverse styles, even in form of a free-hand sketch. + + + + Noise2Info: Noisy Image to Information of Noise for Self-Supervised Image Denoising + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Noise2Info_Noisy_Image_to_Information_of_Noise_for_Self-Supervised_Image_ICCV_2023_paper.pdf + Unsupervised image denoising has been proposed to alleviate the widespread noise problem without requiring clean images. Existing works mainly follow the self-supervised way, which tries to reconstruct each pixel x of noisy images without the knowledge of x. More recently, some pioneer works further emphasize the importance of x and propose to weigh the information extracted from x and other pixels when recovering x. However, such a method is highly sensitive to the standard deviation \sigma_n of noises injected to clean images, where \sigma_n is inaccessible without knowing clean images. Thus, it is unrealistic to assume that \sigma_n is known for pursuing high model performance. To alleviate this issue, we propose Noise2Info to extract the critical information, the standard deviation \sigma_n of injected noise, only based on the noisy images. Specifically, we first theoretically provide an upper bound on \sigma_n, while the bound requires clean images. Then, we propose a novel method to estimate the bound of \sigma_n by only using noisy images. Besides, we prove that the difference between our estimation with the true deviation goes smaller as the model training. Empirical studies show that Noise2Info is effective and robust on benchmark data sets and closely estimates the standard deviation of noises during model training. + + + + SynBody: Synthetic Dataset with Layered Human Models for 3D Human Perception and Modeling + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_SynBody_Synthetic_Dataset_with_Layered_Human_Models_for_3D_Human_ICCV_2023_paper.pdf + Synthetic data has emerged as a promising source for 3D human research as it offers low-cost access to large-scale human datasets. To advance the diversity and annotation quality of human models, we introduce a new synthetic dataset, SynBody, with three appealing features: 1) a clothed parametric human model that can generate a diverse range of subjects; 2) the layered human representation that naturally offers high-quality 3D annotations to support multiple tasks; 3) a scalable system for producing realistic data to facilitate real-world tasks. The dataset comprises 1.2M images with corresponding accurate 3D annotations, covering 10,000 human body models, 1,187 actions, and various viewpoints. The dataset includes two subsets for human pose and shape estimation as well as human neural rendering. Extensive experiments on SynBody indicate that it substantially enhances both SMPL and SMPL-X estimation. Furthermore, the incorporation of layered annotations offers a valuable training resource for investigating the Human Neural Radiance Fields(NeRF). + + + + EP2P-Loc: End-to-End 3D Point to 2D Pixel Localization for Large-Scale Visual Localization + http://openaccess.thecvf.com//content/ICCV2023/papers/Kim_EP2P-Loc_End-to-End_3D_Point_to_2D_Pixel_Localization_for_Large-Scale_ICCV_2023_paper.pdf + Visual localization is the task of estimating a 6-DoF camera pose of a query image within a provided 3D reference map. Thanks to recent advances in various 3D sensors, 3D point clouds are becoming a more accurate and affordable option for building the reference map, but research to match the points of 3D point clouds with pixels in 2D images for visual localization remains challenging. Existing approaches that jointly learn 2D-3D feature matching suffer from low inliers due to representational differences between the two modalities, and the methods that bypass this problem into classification have an issue of poor refinement. In this work, we propose EP2P-Loc, a novel large-scale visual localization method that mitigates such appearance discrepancy and enables end-to-end training for pose estimation. To increase the number of inliers, we propose a simple algorithm to remove invisible 3D points in the image, and find all 2D-3D correspondences without keypoint detection. To reduce memory usage and search complexity, we take a coarse-to-fine approach where we extract patch-level features from 2D images, then perform 2D patch classification on each 3D point, and obtain the exact corresponding 2D pixel coordinates through positional encoding. Finally, for the first time in this task, we employ a differentiable PnP for end-to-end training. In the experiments on newly curated large-scale indoor and outdoor benchmarks based on 2D-3D-S and KITTI, we show that our method achieves the state-of-the-art performance compared to existing visual localization and image-to-point cloud registration methods. + + + + Physics-Augmented Autoencoder for 3D Skeleton-Based Gait Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Guo_Physics-Augmented_Autoencoder_for_3D_Skeleton-Based_Gait_Recognition_ICCV_2023_paper.pdf + In this paper, we introduce physics-augmented autoencoder (PAA), a framework for 3D skeleton-based human gait recognition. Specifically, we construct the autoencoder with a graph-convolution-based encoder and a physics-based decoder. The encoder takes the skeleton sequence as input and generates the generalized positions and forces of each joint, which are taken by the decoder to reconstruct the input skeleton based on the Lagrangian dynamics. In this way, the intermediate representations are physically plausible and discriminative. During the inference, the decoder is discared and a RNN-based classifier takes the output of the encoder for gait recognition. We evaluated our proposed method on three benchmark datasets including Gait3D, GREW, and KinectGait. Our method achieves state-of-the-art performance for 3D skeleton-based gait recognition. Furthermore, extensive ablation studies show that our method generalizes better and is more robust with small-scale training data by incorporating the physics knowledge. We also validated the physical plausibility of the intermediate representations by making force predictions on real data with physical annotations. + + + + Regularized Primitive Graph Learning for Unified Vector Mapping + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Regularized_Primitive_Graph_Learning_for_Unified_Vector_Mapping_ICCV_2023_paper.pdf + Large-scale vector mapping is the foundation for transportation and urban planning. Most existing mapping methods are tailored to one specific mapping task, due to task-specific requirements on shape regularization and topology reconstruction. We propose GraphMapper, a unified framework for end-to-end vector map extraction from satellite images. Our key idea is using primitive graph as a unified representation of vector maps and formulating shape regularization and topology reconstruction as primitive graph reconstruction problems that can be solved in the same framework. Specifically, shape regularization is modeled as the consistency between primitive directions and their pairwise relationship. Based on the primitive graph, we design a learning approach to reconstruct primitive graphs in multiple stages. GraphMapper can fully explore primitive-wise and pairwise information for shape regularization and topology reconstruction, resulting improved primitive graph learning capabilities. We empirically demonstrate the effectiveness of GraphMapper on two challenging mapping tasks for building footprints and road networks. With the premise of sharing the majority design of the architecture and a few task-specific designs, our model outperforms state-of-the-art methods in both tasks on public benchmarks. Our code will be publicly available. + + + + FlipNeRF: Flipped Reflection Rays for Few-shot Novel View Synthesis + http://openaccess.thecvf.com//content/ICCV2023/papers/Seo_FlipNeRF_Flipped_Reflection_Rays_for_Few-shot_Novel_View_Synthesis_ICCV_2023_paper.pdf + Neural Radiance Field (NeRF) has been a mainstream in novel view synthesis with its remarkable quality of rendered images and simple architecture. Although NeRF has been developed in various directions improving continuously its performance, the necessity of a dense set of multi-view images still exists as a stumbling block to progress for practical application. In this work, we propose FlipNeRF, a novel regularization method for few-shot novel view synthesis by utilizing our proposed flipped reflection rays. The flipped reflection rays are explicitly derived from the input ray directions and estimated normal vectors, and play a role of effective additional training rays while enabling to estimate more accurate surface normals and learn the 3D geometry effectively. Since the surface normal and the scene depth are both derived from the estimated densities along a ray, the accurate surface normal leads to more exact depth estimation, which is a key factor for few-shot novel view synthesis. Furthermore, with our proposed Uncertainty-aware Emptiness Loss and Bottleneck Feature Consistency Loss, FlipNeRF is able to estimate more reliable outputs with reducing floating artifacts effectively across the different scene structures, and enhance the feature-level consistency between the pair of the rays cast toward the photo-consistent pixels without any additional feature extractor, respectively. Our FlipNeRF achieves the SOTA performance on the multiple benchmarks across all the scenarios. + + + + MolGrapher: Graph-based Visual Recognition of Chemical Structures + http://openaccess.thecvf.com//content/ICCV2023/papers/Morin_MolGrapher_Graph-based_Visual_Recognition_of_Chemical_Structures_ICCV_2023_paper.pdf + The automatic analysis of chemical literature has immense potential to accelerate the discovery of new materials and drugs. Much of the critical information in patent documents and scientific articles is contained in figures, depicting the molecule structures. However, automatically parsing the exact chemical structure is a formidable challenge, due to the amount of detailed information, the diversity of drawing styles, and the need for training data. In this work, we introduce MolGrapher to recognize chemical structures visually. First, a deep keypoint detector detects the atoms. Second, we treat all candidate atoms and bonds as nodes and put them in a graph. This construct allows a natural graph representation of the molecule. Last, we classify atom and bond nodes in the graph with a Graph Neural Network. To address the lack of real training data, we propose a synthetic data generation pipeline producing diverse and realistic results. In addition, we introduce a large-scale benchmark of annotated real molecule images, USPTO-30K, to spur research on this critical topic. Extensive experiments on five datasets show that our approach significantly outperforms classical and learning-based methods in most settings. Code, models, and datasets are available. + + + + SAMPLING: Scene-adaptive Hierarchical Multiplane Images Representation for Novel View Synthesis from a Single Image + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhou_SAMPLING_Scene-adaptive_Hierarchical_Multiplane_Images_Representation_for_Novel_View_Synthesis_ICCV_2023_paper.pdf + Recent novel view synthesis methods obtain promising results for relatively small scenes, e.g., indoor environments and scenes with a few objects, but tend to fail for unbounded outdoor scenes with a single image as input. In this paper, we introduce SAMPLING, a Scene-adaptive Hierarchical Multiplane Images Representation for Novel View Synthesis from a Single Image based on improved multiplane images (MPI). Observing that depth distribution varies significantly for unbounded outdoor scenes, we employ an adaptive-bins strategy for MPI to arrange planes in accordance with each scene image. To represent intricate geometry and multi-scale details, we further introduce a hierarchical refinement branch, which results in high-quality synthesized novel views. Our method demonstrates considerable performance gains in synthesizing large-scale unbounded outdoor scenes using a single image on the KITTI dataset and generalizes well to the unseen Tanks and Temples dataset. The code and models will be made available at https://pkuvdig.github.io/SAMPLING/. + + + + PointOdyssey: A Large-Scale Synthetic Dataset for Long-Term Point Tracking + http://openaccess.thecvf.com//content/ICCV2023/papers/Zheng_PointOdyssey_A_Large-Scale_Synthetic_Dataset_for_Long-Term_Point_Tracking_ICCV_2023_paper.pdf + We introduce PointOdyssey, a large-scale synthetic dataset, and data generation framework, for the training and evaluation of long-term fine-grained tracking algorithms. Our goal is to advance the state-of-the-art by placing emphasis on long videos with naturalistic motion. Toward the goal of naturalism, we animate deformable characters using real-world motion capture data, we build 3D scenes to match the motion capture environments, and we render camera viewpoints using trajectories mined via structure-from-motion on real videos. We create combinatorial diversity by randomizing character appearance, motion profiles, materials, lighting, 3D assets, and atmospheric effects. Our dataset currently includes 104 videos, averaging 2,000 frames long, with orders of magnitude more correspondence annotations than prior work. We show that existing methods can be trained from scratch in our dataset and outperform the published variants. Finally, we introduce modifications to the PIPs point tracking method, greatly widening its temporal receptive field, which improves its performance on PointOdyssey as well as on two real-world benchmarks. Our data and code are publicly available at: https://pointodyssey.com. + + + + LNPL-MIL: Learning from Noisy Pseudo Labels for Promoting Multiple Instance Learning in Whole Slide Image + http://openaccess.thecvf.com//content/ICCV2023/papers/Shao_LNPL-MIL_Learning_from_Noisy_Pseudo_Labels_for_Promoting_Multiple_Instance_ICCV_2023_paper.pdf + Gigapixel Whole Slide Images (WSIs) aided patient diagnosis and prognosis analysis are promising directions in computational pathology. However, limited by expensive and time-consuming annotation costs, WSIs usually only have weak annotations, including 1) WSI-level Annotations (WA) and 2) Limited Patch-level Annotations (LPA). Currently, Multiple Instance Learning (MIL) often exploits WA, while LPA usually assign pseudo-labels for unlabeled data. Intuitively, pseudo-labels can serve as a practical guide for MIL, but the unreliable prediction caused by LPA inevitably introduces noise. Furthermore, WA-supervised MIL training inevitably suffers from the semantical unalignment between instances and bag-level labels. To address these problems, we design a framework called Learning from Noisy Pseudo Labels for promoting Multiple Instance Learning (LNPL-MIL), which considers both types of weak annotation. In MIL, we propose a Transformer aware of instance Order and Distribution (TOD-MIL) that strengthens instances correlation and weakens semantical unalignment in the bag. We validate our LNPL-MIL on Tumor Diagnosis and Survival Prediction, achieving state-of-the-art performance with at least 2.7%/2.9% AUC and 2.6%/2.3% C-Index improvement with the patches labeled for two scales. Ablation study and visualization analysis further verify the effectiveness. + + + + Few-Shot Dataset Distillation via Translative Pre-Training + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Few-Shot_Dataset_Distillation_via_Translative_Pre-Training_ICCV_2023_paper.pdf + Dataset distillation aims at a small synthetic dataset to mimic the training performance on neural networks of a given large dataset. Existing approaches heavily rely on an iterative optimization to update synthetic data and multiple forward-backward passes over thousands of neural network spaces, which introduce significant overhead for computation and are inconvenient in scenarios requiring high efficiency. In this paper, we focus on few-shot dataset distillation, where a distilled dataset is synthesized with only a few or even a single network. To this end, we introduce the notion of distillation space, such that synthetic data optimized only in this specific space can achieve the effect of those optimized through numerous neural networks, with dramatically accelerated training and reduced computational cost. To learn such a distillation space, we first formulate the problem as a quad-level optimization framework and propose a bi-level algorithm. Nevertheless, the algorithm in its original form has a large memory footprint in practice due to the back-propagation through an unrolled computational graph. We then convert the problem of learning the distillation space to a first-order one based on image translation. Specifically, the synthetic images are optimized in an arbitrary but fixed neural space and then translated to those in the targeted distillation space. We pre-train the translator on some large datasets like ImageNet so that it requires only a limited number of adaptation steps on the target dataset. Extensive experiments demonstrate that the translator after pre-training and a limited number of adaptation steps achieves comparable distillation performance with state of the arts, with 15x acceleration. It also exerts satisfactory generalization performance across different datasets, storage budgets, and numbers of classes. + + + + TinyCLIP: CLIP Distillation via Affinity Mimicking and Weight Inheritance + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_TinyCLIP_CLIP_Distillation_via_Affinity_Mimicking_and_Weight_Inheritance_ICCV_2023_paper.pdf + In this paper, we propose a novel cross-modal distillation method, called TinyCLIP, for large-scale language-image pre-trained models. The method introduces two core techniques: affinity mimicking and weight inheritance. Affinity mimicking explores the interaction between modalities during distillation, enabling student models to mimic teachers' behavior of learning cross-modal feature alignment in a visual-linguistic affinity space. Weight inheritance transmits the pre-trained weights from the teacher models to their student counterparts to improve distillation efficiency. Moreover, we extend the method into a multi-stage progressive distillation to mitigate the loss of informative weights during extreme compression. Comprehensive experiments demonstrate the efficacy of TinyCLIP, showing that it can reduce the size of the pre-trained CLIP ViT-B/32 by 50%, while maintaining comparable zero-shot performance. While aiming for comparable performance, distillation with weight inheritance can speed up the training by 1.4 - 7.8x compared to training from scratch. Moreover, our TinyCLIP ViT-8M/16, trained on YFCC-15M, achieves an impressive zero-shot top-1 accuracy of 41.1% on ImageNet, surpassing the original CLIP ViT-B/16 by 3.5% while utilizing only 8.9% parameters. Finally, we demonstrate the good transferability of TinyCLIP in various downstream tasks. Code and models will be open-sourced at aka.ms/tinyclip. + + + + Democratising 2D Sketch to 3D Shape Retrieval Through Pivoting + http://openaccess.thecvf.com//content/ICCV2023/papers/Chowdhury_Democratising_2D_Sketch_to_3D_Shape_Retrieval_Through_Pivoting_ICCV_2023_paper.pdf + This paper studies the problem of 2D sketch to 3D shape retrieval, but with a focus on democratising the process. We would like this democratisation to happen on two fronts: (i) to remove the need for large-scale specifically sourced 2D sketch and 3D shape datasets, and (ii) to remove restrictions on how well the user needs to sketch and from what viewpoint. The end result is a system that is trainable using existing datasets, and once trained allows users to sketch regardless of drawing skills and without restriction on view angle. We achieve all this via a clever use of pivoting, along with novel designs that injects 3D understanding of 2D sketches into the system. We perform pivoting on two existing datasets, each from a distant research domain to the other: 2D sketch and photo pairs from the sketch-based image retrieval field (SBIR), and 3D shapes from ShapeNet. It follows that the actual feature pivoting happens on photos from the former and 2D projections from the latter. Doing this already achieves most of our democratisation challenge -- the level of 2D sketch abstraction embedded in SBIR dataset offers demoralization on drawing quality, and the whole thing works without a specifically sourced 2D sketch and 3D model pair. To further achieve democratisation on sketching viewpoint, we "lift" 2D sketches to 3D space using Blind Perspective-n-Points (BPnP) that injects 3D-aware information into the sketch encoder. Results show ours achieves competitive performance compared with fully-supervised baselines, while meeting all set democratisation goals. + + + + KECOR: Kernel Coding Rate Maximization for Active 3D Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Luo_KECOR_Kernel_Coding_Rate_Maximization_for_Active_3D_Object_Detection_ICCV_2023_paper.pdf + Achieving a reliable LiDAR-based object detector in autonomous driving is paramount, but its success hinges on obtaining large amounts of precise 3D annotations. Active learning (AL) seeks to mitigate the annotation burden through algorithms that use fewer labels and can attain performance comparable to fully supervised learning. Although AL has shown promise, current approaches prioritize the selection of unlabeled point clouds with high aleatoric and/or epistemic uncertainty, leading to the selection of more instances for labeling and reduced computational efficiency. In this paper, we resort to a novel kernel coding rate maximization (KECOR) strategy which aims to identify the most informative point clouds to acquire labels through the lens of information theory. Greedy search is applied to seek desired point clouds that can maximize the minimal number of bits required to encode the latent features. To determine the uniqueness and informativeness of the selected samples from the model perspective, we construct a proxy network of the 3D detector head and compute the outer product of Jacobians from all proxy layers to form the empirical neural tangent kernel (NTK) matrix. To accommodate both one-stage (i.e., SECOND) and two-stage detectors (i.e., PV-RCNN), we further incorporate the classification entropy maximization and well trade-off between detection performance and the total number of bounding boxes selected for annotation. Extensive experiments conducted on two 3D benchmarks and a 2D detection dataset evidence the superiority and versatility of the proposed approach. Our results show that approximately 44% box-level annotation costs and 26% computational time are reduced compared to the state-of-the-art AL method, without compromising detection performance. + + + + Locating Noise is Halfway Denoising for Semi-Supervised Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Fang_Locating_Noise_is_Halfway_Denoising_for_Semi-Supervised_Segmentation_ICCV_2023_paper.pdf + We investigate semi-supervised semantic segmentation with self-training, where a teacher model generates pseudo masks to exploit the benefits of a large amount of unlabeled images. We notice that the noisy label from the generated pseudo masks is the major obstacle to achieving good performance. Previous works all treat the noise in pixel level and ignore the contextual information of the noise. This work shows that locating the patch-wise noisy region is a better way to deal with noise. To be specific, our method, named Uncertainty-aware Patch CutMix (UPC), first estimates the uncertainty of per-pixel prediction for pseudo masks of unlabeled images. Then UPC splits the uncertainty map into patches and calculates patch-wise uncertainty. UPC selects top-k most uncertain patches to generate the uncertain regions. Finally, uncertain regions are replaced with reliable ones from labeled images. We conduct extensive experiments using standard semi-supervised settings on Pascal VOC and Cityscapes. Experiment results show that UPC can significantly boost the performance of the state-of-the-art methods. In addition, we further demonstrate that our UPC is robust to out-of-distribution unlabeled images, eg, MSCOCO. + + + + SSB: Simple but Strong Baseline for Boosting Performance of Open-Set Semi-Supervised Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Fan_SSB_Simple_but_Strong_Baseline_for_Boosting_Performance_of_Open-Set_ICCV_2023_paper.pdf + Semi-supervised learning (SSL) methods effectively leverage unlabeled data to improve model generalization. However, SSL models often underperform in open-set scenarios, where unlabeled data contain outliers from novel categories that do not appear in the labeled set. In this paper, we study the challenging and realistic open-set SSL setting, where the goal is to both correctly classify inliers and to detect outliers. Intuitively, the inlier classifier should be trained on inlier data only. However, we find that inlier classification performance can be largely improved by incorporating high-confidence pseudo-labeled data, regardless of whether they are inliers or outliers. Also, we propose to utilize non-linear transformations to separate the features used for inlier classification and outlier detection in the multi-task learning framework, preventing adverse effects between them. Additionally, we introduce pseudo-negative mining, which further boosts outlier detection performance. The three ingredients lead to what we call Simple but Strong Baseline (SSB) for open-set SSL. In experiments, SSB greatly improves both inlier classification and outlier detection performance, outperforming existing methods by a large margin. Our code will be released at https://github.com/YUE-FAN/SSB. + + + + Cross-Modal Orthogonal High-Rank Augmentation for RGB-Event Transformer-Trackers + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhu_Cross-Modal_Orthogonal_High-Rank_Augmentation_for_RGB-Event_Transformer-Trackers_ICCV_2023_paper.pdf + This paper addresses the problem of cross-modal object tracking from RGB videos and event data. Rather than constructing a complex cross-modal fusion network, we explore the great potential of a pre-trained vision Transformer (ViT). Particularly, we delicately investigate plug-and-play training augmentations that encourage the ViT to bridge the vast distribution gap between the two modalities, enabling comprehensive cross-modal information interaction and thus enhancing its ability. Specifically, we propose a mask modeling strategy that randomly masks a specific modality of some tokens to enforce the interaction between tokens from different modalities interacting proactively. To mitigate network oscillations resulting from the masking strategy and further amplify its positive effect, we then theoretically propose an orthogonal high-rank loss to regularize the attention matrix. Extensive experiments demonstrate that our plug-and-play training augmentation techniques can significantly boost state-of-the-art one-stream and two-stream trackers to a large extent in terms of both tracking precision and success rate. Our new perspective and findings will potentially bring insights to the field of leveraging powerful pre-trained ViTs to model cross-modal data. The code is publicly available at https://github.com/ZHU-Zhiyu/High-Rank_RGB-Event_Tracker. + + + + How to Boost Face Recognition with StyleGAN? + http://openaccess.thecvf.com//content/ICCV2023/papers/Sevastopolskiy_How_to_Boost_Face_Recognition_with_StyleGAN_ICCV_2023_paper.pdf + State-of-the-art face recognition systems require huge amounts of labeled training data. Given the priority of privacy in face recognition applications, the data is limited to celebrity web crawls, which have issues such as skewed distributions of ethnicities and limited numbers of identities. On the other hand, the self-supervised revolution in the industry motivates research on adaptation of the related techniques to facial recognition. One of the most popular practical tricks is to augment the dataset by the samples drawn from the high-resolution high-fidelity models (e.g. StyleGAN-like), while preserving the identity. We show that a simple approach based on fine-tuning an encoder for StyleGAN allows to improve upon the state-of-the-art facial recognition and performs better compared to training on synthetic face identities. We also collect large-scale unlabeled datasets with controllable ethnic constitution -- AfricanFaceSet-5M (5 million images of different people) and AsianFaceSet-3M (3 million images of different people) and we show that pretraining on each of them improves recognition of the respective ethnicities (as well as also others), while combining all unlabeled datasets results in the biggest performance increase. Our self-supervised strategy is the most useful with limited amounts of labeled training data, which can be beneficial for more tailored face recognition tasks and when facing privacy concerns. Evaluation is provided based on a standard RFW dataset and a new large-scale RB-WebFace benchmark. + + + + Text2Tex: Text-driven Texture Synthesis via Diffusion Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Text2Tex_Text-driven_Texture_Synthesis_via_Diffusion_Models_ICCV_2023_paper.pdf + We present Text2Tex, a novel method for generating high-quality textures for 3D meshes from the given text prompts. Our method incorporates inpainting into a pre-trained depth-aware image diffusion model to progressively synthesize high resolution partial textures from multiple viewpoints. To avoid accumulating inconsistent and stretched artifacts across views, we dynamically segment the rendered view into a generation mask, which represents the generation status of each visible texel. This partitioned view representation guides the depth-aware inpainting model to generate and update partial textures for the corresponding regions. Furthermore, we propose an automatic view sequence generation scheme to determine the next best view for updating the partial texture. Extensive experiments demonstrate that our method significantly outperforms the existing text-driven approaches and GAN-based methods. + + + + MUVA: A New Large-Scale Benchmark for Multi-View Amodal Instance Segmentation in the Shopping Scenario + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_MUVA_A_New_Large-Scale_Benchmark_for_Multi-View_Amodal_Instance_Segmentation_ICCV_2023_paper.pdf + Amodal Instance Segmentation (AIS) endeavors to accurately deduce complete object shapes that are partially or fully occluded. However, the inherent ill-posed nature of single-view datasets poses challenges in determining occluded shapes. A multi-view framework may help alleviate this problem, as humans often adjust their perspective when encountering occluded objects. At present, this approach has not yet been explored by existing methods and datasets. To bridge this gap, we propose a new task called Multi-view Amodal Instance Segmentation (MAIS) and introduce the MUVA dataset, the first MUlti-View AIS dataset that takes the shopping scenario as instantiation. MUVA provides comprehensive annotations, including multi-view amodal/visible segmentation masks, 3D models, and depth maps, making it the largest image-level AIS dataset in terms of both the number of images and instances. Additionally, we propose a new method for aggregating representative features across different instances and views, which demonstrates promising results in accurately predicting occluded objects from one viewpoint by leveraging information from other viewpoints. Besides, we also demonstrate that MUVA can benefit the AIS task in real-world scenarios. + + + + SeiT: Storage-Efficient Vision Training with Tokens Using 1% of Pixel Storage + http://openaccess.thecvf.com//content/ICCV2023/papers/Park_SeiT_Storage-Efficient_Vision_Training_with_Tokens_Using_1_of_Pixel_ICCV_2023_paper.pdf + We need billion-scale images to achieve more generalizable and ground-breaking vision models, as well as massive dataset storage to ship the images (e.g., the LAION-4B dataset needs 240TB storage space). However, it has become challenging to deal with unlimited dataset storage with limited storage infrastructure. A number of storage-efficient training methods have been proposed to tackle the problem, but they are rarely scalable or suffer from severe damage to performance. In this paper, we propose a storage-efficient training strategy for vision classifiers for large-scale datasets (e.g., ImageNet) that only uses 1024 tokens per instance without using the raw level pixels; our token storage only needs <1% of the original JPEG-compressed raw pixels. We also propose token augmentations and a Stem-adaptor module to make our approach able to use the same architecture as pixel-based approaches with only minimal modifications on the stem layer and the carefully tuned optimization settings. Our experimental results on ImageNet-1K show that our method significantly outperforms other storage-efficient training methods with a large gap. We further show the effectiveness of our method in other practical scenarios, storage-efficient pre-training, and continual learning. We will make our implementation and tokenized dataset publicly after the acceptance. + + + + CLIP2Point: Transfer CLIP to Point Cloud Classification with Image-Depth Pre-Training + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_CLIP2Point_Transfer_CLIP_to_Point_Cloud_Classification_with_Image-Depth_Pre-Training_ICCV_2023_paper.pdf + Pre-training across 3D vision and language remains under development because of limited training data. Recent works attempt to transfer vision-language (V-L) pre-training methods to 3D vision. However, the domain gap between 3D and images is unsolved, so that V-L pre-trained models are restricted in 3D downstream tasks. To address this issue, we propose CLIP2Point, an image-depth pre-training method by contrastive learning to transfer CLIP to the 3D domain, and adapt it to point cloud classification. We introduce a new depth rendering setting that forms a better visual effect, and then render 52,460 pairs of images and depth maps from ShapeNet for pre-training. The pre-training scheme of CLIP2Point combines cross-modality learning to enforce the depth features for capturing expressive visual and textual features and intra-modality learning to enhance the invariance of depth aggregation. Additionally, we propose a novel Gated Dual-Path Adapter (GDPA), i.e., a dual-path structure with global-view aggregators and gated fusion for downstream representative learning. It allows the ensemble of CLIP and CLIP2Point, tuning pre-training knowledge to downstream tasks in an efficient adaptation. Experimental results show that CLIP2Point is effective in transferring CLIP knowledge to 3D vision. Our CLIP2Point outperforms other 3D transfer learning and pre-training networks, achieving state-of-the-art results on zero-shot, few-shot, and fully-supervised classification. + + + + Parametric Classification for Generalized Category Discovery: A Baseline Study + http://openaccess.thecvf.com//content/ICCV2023/papers/Wen_Parametric_Classification_for_Generalized_Category_Discovery_A_Baseline_Study_ICCV_2023_paper.pdf + Generalized Category Discovery (GCD) aims to discover novel categories in unlabelled datasets using knowledge learned from labelled samples. Previous studies argued that parametric classifiers are prone to overfitting to seen categories, and endorsed using a non-parametric classifier formed with semi-supervised k-means. However, in this study, we investigate the failure of parametric classifiers, verify the effectiveness of previous design choices when high-quality supervision is available, and identify unreliable pseudo-labels as a key problem. We demonstrate that two prediction biases exist: the classifier tends to predict seen classes more often, and produces an imbalanced distribution across seen and novel categories. Based on these findings, we propose a simple yet effective parametric classification method that benefits from entropy regularisation, achieves state-of-the-art performance on multiple GCD benchmarks and shows strong robustness to unknown class numbers. We hope the investigation and proposed simple framework can serve as a strong baseline to facilitate future studies in this field. Our code is available at: https://github.com/CVMI-Lab/SimGCD. + + + + Denoising Diffusion Autoencoders are Unified Self-supervised Learners + http://openaccess.thecvf.com//content/ICCV2023/papers/Xiang_Denoising_Diffusion_Autoencoders_are_Unified_Self-supervised_Learners_ICCV_2023_paper.pdf + Inspired by recent advances in diffusion models, which are reminiscent of denoising autoencoders, we investigate whether they can acquire discriminative representations for classification via generative pre-training. This paper shows that the networks in diffusion models, namely denoising diffusion autoencoders (DDAE), are unified self-supervised learners: by pre-training on unconditional image generation, DDAE has already learned strongly linear-separable representations within its intermediate layers without auxiliary encoders, thus making diffusion pre-training emerge as a general approach for generative-and-discriminative dual learning. To validate this, we conduct linear probe and fine-tuning evaluations. Our diffusion-based approach achieves 95.9% and 50.0% linear evaluation accuracies on CIFAR-10 and Tiny-ImageNet, respectively, and is comparable to contrastive learning and masked autoencoders for the first time. Transfer learning from ImageNet also confirms the suitability of DDAE for Vision Transformers, suggesting the potential to scale DDAEs as unified foundation models. Code is available at github.com/FutureXiang/ddae. + + + + Cross-view Topology Based Consistent and Complementary Information for Deep Multi-view Clustering + http://openaccess.thecvf.com//content/ICCV2023/papers/Dong_Cross-view_Topology_Based_Consistent_and_Complementary_Information_for_Deep_Multi-view_ICCV_2023_paper.pdf + Multi-view clustering aims to extract valuable information from different sources or perspectives. Over the years, the deep neural network has demonstrated its superior representation learning capability in multi-view clustering and achieved impressive performance. However, most existing deep clustering approaches are dedicated to merging and exploring the consistent latent representation across multiple views while overlooking the abundant complementary information in each view. Furthermore, finding correlations between multiple views in an unsupervised setting is a significant challenge. To tackle these issues, we present a novel Cross-view Topology based Consistent and Complementary information extraction framework, termed CTCC. In detail, deep embedding can be obtained from the bipartite graph learning module for each view individually. CTCC then constructs the cross-view topological graph based on the OT distance between the bipartite graph of each view. Utilizing the above graph, we maximize the mutual information across views to learn consistent information and enhance the complementarity of each view by selectively isolating distributions from each other. Extensive experiments on five challenging datasets verify that CTCC outperforms existing methods significantly. + + + + Distribution-Consistent Modal Recovering for Incomplete Multimodal Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Distribution-Consistent_Modal_Recovering_for_Incomplete_Multimodal_Learning_ICCV_2023_paper.pdf + Recovering missed modality is popular in incomplete multimodal learning because it usually benefits downstream tasks. However, the existing methods often directly estimate missed modalities from the observed ones by deep neural networks, lacking consideration of the distribution gap between modalities, resulting in the inconsistency of distributions between the recovered data and true data. To mitigate this issue, in this work, we propose a novel recovery paradigm, Distribution-Consistent Modal Recovering (DiCMoR), to transfer the distributions from available modalities to missed modalities, which thus maintains the distribution consistency of recovered data. In particular, we design a class-specific flow based modality recovery method to transform cross-modal distributions on the condition of sample class, which could well predict a distribution-consistent space for missing modality by virtue of the invertibility and exact density estimation of normalizing flow. The generated data from the predicted distribution is jointly integrated with available modalities for the task of classification. Experiments demonstrate that DiCMoR gains superior performances and is more robust than existing state-of-the-art methods under various missing patterns. Visualization results show that the distribution gaps between recovered modalities and missing modalities are mitigated. + + + + ContactGen: Generative Contact Modeling for Grasp Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_ContactGen_Generative_Contact_Modeling_for_Grasp_Generation_ICCV_2023_paper.pdf + This paper presents a novel object-centric contact representation ContactGen for hand-object interaction. The ContactGen comprises 3 components: a contact map indicates the contact location, a part map represents the contact hand part, and a direction map tells the contact direction within each part. Given an input object, we propose a conditional generative model to predict ContactGen and adopt model-based optimization to predict diverse and geometrically feasible grasps. Experimental results demonstrate our method can generate high-fidelity and diverse human grasps for various objects. + + + + Ego-Humans: An Ego-Centric 3D Multi-Human Benchmark + http://openaccess.thecvf.com//content/ICCV2023/papers/Khirodkar_Ego-Humans_An_Ego-Centric_3D_Multi-Human_Benchmark_ICCV_2023_paper.pdf + We present EgoHumans, a new multi-view multi-human video benchmark to advance the state-of-the-art of egocentric human 3D pose estimation and tracking. Existing egocentric benchmarks either capture single subject or indoor-only scenarios, which limit the generalization of computer vision algorithms for real-world applications. We propose a novel 3D capture setup to construct a comprehensive egocentric multi-human benchmark in the wild with annotations to support diverse tasks such as human detection, tracking, 2D/3D pose estimation, and mesh recovery. We leverage consumer-grade wearable camera-equipped glasses for the egocentric view, which enables us to capture dynamic activities like playing tennis, fencing, volleyball, etc. Furthermore, our multi-view setup generates accurate 3D ground truth even under severe or complete occlusion. The dataset consists of more than 125k egocentric images, spanning diverse scenes with a particular focus on challenging and unchoreographed multi-human activities and fast-moving egocentric views. We rigorously evaluate existing state-of-the-art methods and highlight their limitations in the egocentric scenario, specifically on multi-human tracking. To address such limitations, we propose EgoFormer, a novel approach with a multi-stream transformer architecture and explicit 3D spatial reasoning to estimate and track the human pose. EgoFormer significantly outperforms prior art by 13.6% IDF1 on the EgoHumans dataset. + + + + Reference-guided Controllable Inpainting of Neural Radiance Fields + http://openaccess.thecvf.com//content/ICCV2023/papers/Mirzaei_Reference-guided_Controllable_Inpainting_of_Neural_Radiance_Fields_ICCV_2023_paper.pdf + The popularity of Neural Radiance Fields (NeRFs) for view synthesis has led to a desire for NeRF editing tools. Here, we focus on inpainting regions in a view-consistent and controllable manner. In addition to the typical NeRF inputs and masks delineating the unwanted region in each view, we require only a single inpainted view of the scene, i.e., a reference view. We use monocular depth estimators to back-project the inpainted view to the correct 3D positions. Then, via a novel rendering technique, a bilateral solver can construct view-dependent effects in non-reference views, making the inpainted region appear consistent from any view. For non-reference disoccluded regions, which cannot be supervised by the single reference view, we devise a method based on image inpainters to guide both the geometry and appearance. Our approach shows superior performance to NeRF inpainting baselines, with the additional advantage that a user can control the generated scene via a single inpainted image. + + + + Diffusion-Guided Reconstruction of Everyday Hand-Object Interaction Clips + http://openaccess.thecvf.com//content/ICCV2023/papers/Ye_Diffusion-Guided_Reconstruction_of_Everyday_Hand-Object_Interaction_Clips_ICCV_2023_paper.pdf + We tackle the task of reconstructing hand-object interactions from short video clips. Given an input video, our approach casts 3D inference as a per-video optimization and recovers a neural 3D representation of the object shape, as well as the time-varying motion and hand articulation. While the input video naturally provides some multi-view cues to guide 3D inference, these are insufficient on their own due to occlusions and limited viewpoint variations. To obtain accurate 3D, we augment the multi-view signals with generic data-driven priors to guide reconstruction. Specifically, we learn a diffusion network to model the conditional distribution of (geometric) renderings of objects conditioned on hand configuration and category label, and leverage it as a prior to guide the novel-view renderings of the reconstructed scene. We empirically evaluate our approach on egocentric videos across 6 object categories, and observe significant improvements over prior single-view and multi-view methods. Finally, we demonstrate our system's ability to reconstruct arbitrary clips from YouTube, showing both 1st and 3rd person interactions. + + + + DVGaze: Dual-View Gaze Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Cheng_DVGaze_Dual-View_Gaze_Estimation_ICCV_2023_paper.pdf + Gaze estimation methods estimate gaze from facial appearance with a single camera. However, due to the limited view of a single camera, the captured facial appearance cannot provide complete facial information and thus complicate the gaze estimation problem. Recently, camera devices are rapidly updated. Dual cameras are affordable for users and have been applied in many devices.This development suggests us to further improve gaze estimation performance with dual-view gaze estimation. In this paper, we propose a dual-view gaze estimation network (DV-Gaze). DV-Gaze estimates dual-view gaze directions from a pair of images. We first propose a dual-view interactive convolution (DIC) block in DV-Gaze. DIC blocks exchange dual-view information during convolution in multiple feature scales. It fuses dual-view features along epipolar lines and compensate original features with the fused feature. We further propose a dual-view transformer to estimate gaze from dual-view features. Camera poses are encoded to indicate the position information in the transformer. We also consider the geometric relation between dual-view gaze directions and propose a dual-view gaze consistency loss for DV-Gaze. DV-Gaze achieves state-of-the-art performance on ETH-XGaze and EVE datasets. Our experiments also prove the potential of dual-view gaze estimation. We release codes in https://github.com/yihuacheng/DVGaze. + + + + Efficient Joint Optimization of Layer-Adaptive Weight Pruning in Deep Neural Networks + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_Efficient_Joint_Optimization_of_Layer-Adaptive_Weight_Pruning_in_Deep_Neural_ICCV_2023_paper.pdf + In this paper, we propose a novel layer-adaptive weight-pruning approach for Deep Neural Networks (DNNs) that addresses the challenge of optimizing the output distortion minimization while adhering to a target pruning ratio constraint. Our approach takes into account the collective influence of all layers to design a layer-adaptive pruning scheme. We discover and utilize a very important additivity property of output distortion caused by pruning weights on multiple layers. This property enables us to formulate the pruning as a combinatorial optimization problem and efficiently solve it through dynamic programming. By decomposing the problem into sub-problems, we achieve linear time complexity, making our optimization algorithm fast and feasible to run on CPUs. Our extensive experiments demonstrate the superiority of our approach over existing methods on the ImageNet and CIFAR-10 datasets. On CIFAR-10, our method achieves remarkable improvements, outperforming others by up to 1.0% for ResNet-32, 0.5% for VGG-16, and 0.7% for DenseNet-121 in terms of top-1 accuracy. On ImageNet, we achieve up to 4.7% and 4.6% higher top-1 accuracy compared to other methods for VGG-16 and ResNet-50, respectively. These results highlight the effectiveness and practicality of our approach for enhancing DNN performance through layer-adaptive weight pruning. Code will be available on https://github.com/Akimoto-Cris/RD_VIT_PRUNE. + + + + Exploring the Sim2Real Gap Using Digital Twins + http://openaccess.thecvf.com//content/ICCV2023/papers/Sudhakar_Exploring_the_Sim2Real_Gap_Using_Digital_Twins_ICCV_2023_paper.pdf + It is very time consuming to create datasets for training computer vision models. An emerging alternative is to use synthetic data, but if the synthetic data is not similar enough to the real data, the performance is typically below that of training with real data. Thus using synthetic data still requires a large amount of time, money, and skill as one needs to author the data carefully. In this paper, we seek to understand which aspects of this authoring process are most critical. We present an analysis of which factors of variation between simulated and real data are most important. We capture images of YCB objects to create a novel YCB-Real dataset. We then create a novel synthetic "digital twin" dataset, YCB-Synthetic, which matches the YCB-Real dataset and includes variety of artifacts added to the synthetic data. We study the affects of these artifacts on our dataset and two existing published datasets on two different computer vision tasks: object detection and instance segmentation. We provide an analysis of the cost-benefit trade-offs between artist time for fixing artifacts and trained model accuracy. We plan to release this dataset (images and 3D assets) so they can be further used by the community. + + + + Re:PolyWorld - A Graph Neural Network for Polygonal Scene Parsing + http://openaccess.thecvf.com//content/ICCV2023/papers/Zorzi_RePolyWorld_-_A_Graph_Neural_Network_for_Polygonal_Scene_Parsing_ICCV_2023_paper.pdf + While most state-of-the-art instance segmentation methods produce pixel-wise segmentation masks, numerous applications demand precise vector polygons of detected objects instead of rasterized output. This paper proposes Re:PolyWorld as a remastered and improved version of PolyWorld, a neural network that extracts object vertices from an image and connects them optimally to generate precise polygons. The objective of this work was to overcome weaknesses and shortcomings of the original model, as well as introducing an improved polygonal representation to obtain a general-purpose method for polygon extraction in images. The architecture has been redesigned to not only exploit vertex features, but to also make use of the visual appearance of edges. To this end, an edge-aware Graph Neural Network predicts the connection strength between each pair of vertices, which is further used to compute the assignment by solving a differentiable optimal transport problem. The proposed redefinition of the polygonal scene turns the method into a powerful generalized approach that can be applied to a large variety of tasks and problem settings, such as building extraction, floorplan reconstruction and even wireframe parsing. Re:PolyWorld not only outperforms the original model on building extraction in aerial images, thanks to the proposed joint analysis of vertices and edges, but also beats the state-of-the-art in multiple other domains. + + + + Video State-Changing Object Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Yu_Video_State-Changing_Object_Segmentation_ICCV_2023_paper.pdf + Daily objects commonly experience state changes. For example, slicing a cucumber changes its state from whole to sliced. Learning about object state changes in Video Object Segmentation (VOS) is crucial for understanding and interacting with the visual world. Conventional VOS benchmarks do not consider this challenging yet crucial problem. This paper makes a pioneering effort to introduce a weakly-supervised benchmark on Video State-Changing Object Segmentation (VSCOS). We construct our VSCOS benchmark by selecting state-changing videos from existing datasets. In advocate of an annotation-efficient approach towards state-changing object segmentation, we only annotate the first and last frames of training videos, which is different from conventional VOS. Notably, an open-vocabulary setting is included to evaluate the generalization to novel types of objects or state changes. We empirically illustrate that state-of-the-art VOS models struggle with state-changing objects and lose track after the state changes. We analyze the main difficulties of our VSCOS task and identify three technical improvements, namely, fine-tuning strategies, representation learning, and integrating motion information. Applying these improvements results in a strong baseline for segmenting state-changing objects consistently. Our benchmark and baseline methods are publicly available at https://github.com/venom12138/VSCOS. + + + + MonoNeRF: Learning a Generalizable Dynamic Radiance Field from Monocular Videos + http://openaccess.thecvf.com//content/ICCV2023/papers/Tian_MonoNeRF_Learning_a_Generalizable_Dynamic_Radiance_Field_from_Monocular_Videos_ICCV_2023_paper.pdf + In this paper, we target at the problem of learning a generalizable dynamic radiance field from monocular videos. Different from most existing NeRF methods that are based on multiple views, monocular videos only contain one view at each timestamp, thereby suffering from ambiguity along the view direction in estimating point features and scene flows. Previous studies such as DynNeRF disambiguate point features by positional encoding, which is not transferable and severely limits the generalization ability. As a result, these methods have to train one independent model for each scene and suffer from heavy computational costs when applying to increasing monocular videos in real-world applications. To address this, We propose MonoNeRF to simultaneously learn point features and scene flows with point trajectory and feature correspondence constraints across frames. More specifically, we learn an implicit velocity field to estimate point trajectory from temporal features with Neural ODE, which is followed by a flow-based feature aggregation module to obtain spatial features along the point trajectory. We jointly optimize temporal and spatial features in an end-to-end manner. Experiments show that our MonoNeRF is able to learn from multiple scenes and support new applications such as scene editing, unseen frame synthesis, and fast novel scene adaptation. Codes are available at https://github.com/tianfr/MonoNeRF + + + + PG-RCNN: Semantic Surface Point Generation for 3D Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Koo_PG-RCNN_Semantic_Surface_Point_Generation_for_3D_Object_Detection_ICCV_2023_paper.pdf + One of the main challenges in LiDAR-based 3D object detection is that the sensors often fail to capture the complete spatial information about the objects due to long distance and occlusion. Two-stage detectors with point cloud completion approaches tackle this problem by adding more points to the regions of interest (RoIs) with a pre-trained network. However, these methods generate dense point clouds of objects for all region proposals, assuming that objects always exist in the RoIs. This leads to the indiscriminate point generation for incorrect proposals as well. Motivated by this, we propose Point Generation R-CNN (PG-RCNN), a novel end-to-end detector that generates semantic surface points of foreground objects for accurate detection. Our method uses a jointly trained RoI point generation module to process the contextual information of RoIs and estimate the complete shape and displacement of foreground objects. For every generated point, PG-RCNN assigns a semantic feature that indicates the estimated foreground probability. Extensive experiments show that the point clouds generated by our method provide geometrically and semantically rich information for refining false positive and misaligned proposals. PG-RCNN achieves competitive performance on the KITTI benchmark, with significantly fewer parameters than state-of-the-art models. The code is available at https://github.com/quotation2520/PG-RCNN. + + + + Representation Uncertainty in Self-Supervised Learning as Variational Inference + http://openaccess.thecvf.com//content/ICCV2023/papers/Nakamura_Representation_Uncertainty_in_Self-Supervised_Learning_as_Variational_Inference_ICCV_2023_paper.pdf + In this study, a novel self-supervised learning (SSL) method is proposed, which considers SSL in terms of variational inference to learn not only representation but also representation uncertainties. SSL is a method of learning representations without labels by maximizing the similarity between image representations of different augmented views of an image. Meanwhile, variational autoencoder (VAE) is an unsupervised representation learning method that trains a probabilistic generative model with variational inference. Both VAE and SSL can learn representations without labels, but their relationship has not been investigated in the past. Herein, the theoretical relationship between SSL and variational inference has been clarified. Furthermore, a novel method, namely variational inference SimSiam (VI-SimSiam), has been proposed. VI-SimSiam can predict the representation uncertainty by interpreting SimSiam with variational inference and defining the latent space distribution. The present experiments qualitatively show that VI-SimSiam could learn uncertainty by comparing input images and predicted uncertainties. Additionally, we described a relationship between estimated uncertainty and classification accuracy. + + + + Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_Bridging_Vision_and_Language_Encoders_Parameter-Efficient_Tuning_for_Referring_Image_ICCV_2023_paper.pdf + Parameter efficient tuning (PET) has received considerable attention owing to its applicability to reduce the number of parameters that need to be updated while maintaining competitive performance and providing better hardware resource savings. Although substantial progress has been made, most existing studies mainly focus on either single-modal tasks or simple classification tasks, with few works paying attention to the dense prediction tasks and the interaction between different modalities. Therefore, in this paper, we do an in-depth investigation of the efficient tuning problem on referring image segmentation. First, considering the absence of interaction between the dual encoder, we design a novel adapter named Bridger to facilitate the exchange of cross-modal information. This module also plays a role in injecting vision-specific inductive biases and task-specific information into the pre-trained model while keeping its original parameters fixed. Second, we design a lightweight decoder for referring image segmentation to make further alignment on visual and linguistic features. To perform a comprehensive assessment and promote further research, we evaluate the proposed framework on several challenging benchmarks. Experimental results illustrate the effectiveness of our approach. Updating only 1.61% to 3.38% parameters, the proposed framework gains comparable or even superior performance compared to existing full fine-tuning methods that utilize the same backbone. + + + + ATT3D: Amortized Text-to-3D Object Synthesis + http://openaccess.thecvf.com//content/ICCV2023/papers/Lorraine_ATT3D_Amortized_Text-to-3D_Object_Synthesis_ICCV_2023_paper.pdf + Text-to-3D modelling has seen exciting progress by combining generative text-to-image models with image-to-3D methods like Neural Radiance Fields. DreamFusion recently achieved high-quality results but requires a lengthy, per-prompt optimization to create 3D objects. To address this, we amortize optimization over text prompts by training on many prompts simultaneously with a unified model instead of separately. With this, we share computation across a prompt set, training in less time than per-prompt optimization. Our framework, Amortized Text-to-3D (ATT3D), enables knowledge sharing between prompts to generalize to unseen setups and smooth interpolations between text for novel assets and simple animations. + + + + Virtual Try-On with Pose-Garment Keypoints Guided Inpainting + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Virtual_Try-On_with_Pose-Garment_Keypoints_Guided_Inpainting_ICCV_2023_paper.pdf + Virtual try-on is an important technology supporting online apparel shopping, which provides consumers with a virtual experience to fit garments without physically wearing them. Recently, the image-based virtual try-on has received growing research attention. However, the synthetic results of existing virtual try-on methods usually present distortions in garment shape and lose pattern details. In this paper, we propose a pose-garment keypoints guided inpainting method for the image-based virtual try-on task, which produces high-fidelity try-on images and well preserves the shapes and patterns of the garments. In our method, human pose and garment keypoints are extracted from source images and constructed as graphs to predict the garment keypoints at the target pose. After which, the predicted keypoints are used as guide information to predict the target segmentation map and warp the garment image. The try-on image is finally generated with a semantic-conditioned inpainting scheme using the segmentation map and recomposed person image as conditions. To verify the effectiveness of our proposed method, we conduct extensive experiments on the VITON-HD dataset under both paired and unpaired experimental settings. The qualitative and quantitative results show that our method significantly outperforms prior methods at different image resolutions. The codes repository link is https://github.com/lizhi-ntu/KGI. + + + + Learning by Sorting: Self-supervised Learning with Group Ordering Constraints + http://openaccess.thecvf.com//content/ICCV2023/papers/Shvetsova_Learning_by_Sorting_Self-supervised_Learning_with_Group_Ordering_Constraints_ICCV_2023_paper.pdf + Contrastive learning has become an important tool in learning representations from unlabeled data mainly relying on the idea of minimizing distance between positive data pairs, e.g., views from the same images, and maximizing distance between negative data pairs, e.g., views from different images. This paper proposes a new variation of the contrastive learning objective, Group Ordering Constraints (GroCo), that leverages the idea of sorting the distances of positive and negative pairs and computing the respective loss based on how many positive pairs have a larger distance than the negative pairs, and thus are not ordered correctly. To this end, the GroCo loss is based on differentiable sorting networks, which enable training with sorting supervision by matching a differentiable permutation matrix, which is produced by sorting a given set of scores, to a respective ground truth permutation matrix. Applying this idea to groupwise pre-ordered inputs of multiple positive and negative pairs allows introducing the GroCo loss with implicit emphasis on strong positives and negatives, leading to better optimization of the local neighborhood. We evaluate the proposed formulation on various self-supervised learning benchmarks and show that it not only leads to improved results compared to vanilla contrastive learning but also shows competitive performance to comparable methods in linear probing and outperforms current methods in k-NN performance. + + + + Cross Modal Transformer: Towards Fast and Robust 3D Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Yan_Cross_Modal_Transformer_Towards_Fast_and_Robust_3D_Object_Detection_ICCV_2023_paper.pdf + In this paper, we propose a robust 3D detector, named Cross Modal Transformer (CMT), for end-to-end 3D multi-modal detection. Without explicit view transformation, CMT takes the image and point clouds tokens as inputs and directly outputs accurate 3D bounding boxes. The spatial alignment of multi-modal tokens is performed by encoding the 3D points into multi-modal features. The core design of CMT is quite simple while its performance is impressive. It achieves 74.1% NDS (state-of-the-art with single model) on nuScenes test set while maintaining faster inference speed. Moreover, CMT has a strong robustness even if the LiDAR is missing. Code is released at https: //github.com/junjie18/CMT . + + + + MoTIF: Learning Motion Trajectories with Local Implicit Neural Functions for Continuous Space-Time Video Super-Resolution + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_MoTIF_Learning_Motion_Trajectories_with_Local_Implicit_Neural_Functions_for_ICCV_2023_paper.pdf + This work addresses continuous space-time video super-resolution (C-STVSR) that aims to up-scale an input video both spatially and temporally by any scaling factors. One key challenge of C-STVSR is to propagate information temporally among the input video frames. To this end, we introduce a space-time local implicit neural function. It has the striking feature of learning forward motion for a continuum of pixels. We motivate the use of forward motion from the perspective of learning individual motion trajectories, as opposed to learning a mixture of motion trajectories with backward motion. To ease motion interpolation, we encode sparsely sampled forward motion extracted from the input video as the contextual input. Along with a reliability-aware splatting and decoding scheme, our framework, termed MoTIF, achieves the state-of-the-art performance on C-STVSR. The source code of MoTIF is available at https://github.com/sichun233746/MoTIF. + + + + CRN: Camera Radar Net for Accurate, Robust, Efficient 3D Perception + http://openaccess.thecvf.com//content/ICCV2023/papers/Kim_CRN_Camera_Radar_Net_for_Accurate_Robust_Efficient_3D_Perception_ICCV_2023_paper.pdf + Autonomous driving requires an accurate and fast 3D perception system that includes 3D object detection, tracking, and segmentation. Although recent low-cost camera-based approaches have shown promising results, they are susceptible to poor illumination or bad weather conditions and have a large localization error. Hence, fusing camera with low-cost radar, which provides precise long-range measurement and operates reliably in all environments, is promising but has not yet been thoroughly investigated. In this paper, we propose Camera Radar Net (CRN), a novel camera-radar fusion framework that generates a semantically rich and spatially accurate bird's-eye-view (BEV) feature map for various tasks. To overcome the lack of spatial information in an image, we transform perspective view image features to BEV with the help of sparse but accurate radar points. We further aggregate image and radar feature maps in BEV using multi-modal deformable attention designed to tackle the spatial misalignment between inputs. CRN with real-time setting operates at 20 FPS while achieving comparable performance to LiDAR detectors on nuScenes, and even outperforms at a far distance on 100m setting. Moreover, CRN with offline setting yields 62.4% NDS, 57.5% mAP on nuScenes test set and ranks first among all camera and camera-radar 3D object detectors. + + + + GaPro: Box-Supervised 3D Point Cloud Instance Segmentation Using Gaussian Processes as Pseudo Labelers + http://openaccess.thecvf.com//content/ICCV2023/papers/Ngo_GaPro_Box-Supervised_3D_Point_Cloud_Instance_Segmentation_Using_Gaussian_Processes_ICCV_2023_paper.pdf + Instance segmentation on 3D point clouds (3DIS) is a longstanding challenge in computer vision, where state-of-the-art methods are mainly based on full supervision. As annotating ground truth dense instance masks is tedious and expensive, solving 3DIS with weak supervision has become more practical. In this paper, we propose GaPro, a new instance segmentation for 3D point clouds using axis-aligned 3D bounding box supervision. Our two-step approach involves generating pseudo labels from box annotations and training a 3DIS network with the resulting labels. Additionally, we employ the self-training strategy to improve the performance of our method further. We devise an effective Gaussian Process to generate pseudo instance masks from the bounding boxes and resolve ambiguities when they overlap, resulting in pseudo instance masks with their uncertainty values. Our experiments show that GaPro outperforms previous weakly supervised 3D instance segmentation methods and has competitive performance compared to state-of-the-art fully supervised ones. Furthermore, we demonstrate the robustness of our approach, where we can adapt various state-of-the-art fully supervised methods to the weak supervision task by using our pseudo labels for training. We will release our implementation upon publication. + + + + Get the Best of Both Worlds: Improving Accuracy and Transferability by Grassmann Class Representation + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Get_the_Best_of_Both_Worlds_Improving_Accuracy_and_Transferability_ICCV_2023_paper.pdf + We generalize the class vectors found in neural networks to linear subspaces (i.e., points in the Grassmann manifold) and show that the Grassmann Class Representation (GCR) enables simultaneous improvement in accuracy and feature transferability. In GCR, each class is a subspace, and the logit is defined as the norm of the projection of a feature onto the class subspace. We integrate Riemannian SGD into deep learning frameworks such that class subspaces in a Grassmannian are jointly optimized with the rest model parameters. Compared to the vector form, the representative capability of subspaces is more powerful. We show that on ImageNet-1K, the top-1 errors of ResNet50-D, ResNeXt50, Swin-T, and Deit3-S are reduced by 5.6%, 4.5%, 3.0%, and 3.5%, respectively. Subspaces also provide freedom for features to vary, and we observed that the intra-class feature variability grows when the subspace dimension increases. Consequently, we found the quality of GCR features is better for downstream tasks. For ResNet50-D, the average linear transfer accuracy across 6 datasets improves from 77.98% to 79.70% compared to the strong baseline of vanilla softmax. For Swin-T, it improves from 81.5% to 83.4% and for Deit3, it improves from 73.8% to 81.4%. With these encouraging results, we believe that more applications could benefit from the Grassmann class representation. Code is released at https://github.com/innerlee/GCR. + + + + ObjectSDF++: Improved Object-Compositional Neural Implicit Surfaces + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_ObjectSDF_Improved_Object-Compositional_Neural_Implicit_Surfaces_ICCV_2023_paper.pdf + In recent years, neural implicit surface reconstruction has emerged as a popular paradigm for multi-view 3D reconstruction. Unlike traditional multi-view stereo approaches, the neural implicit surface-based methods leverage neural networks to represent 3D scenes as signed distance functions (SDFs). However, they tend to disregard the reconstruction of individual objects within the scene, which limits their performance and practical applications. To address this issue, previous work ObjectSDF introduced a nice framework of object-composition neural implicit surfaces, which utilizes 2D instance masks to supervise individual object SDFs. In this paper, we propose a new framework called ObjectSDF++ to overcome the limitations of ObjectSDF. First, in contrast to ObjectSDF whose performance is primarily restricted by its converted semantic field, the core component of our model is an occlusion-aware object opacity rendering formulation that directly volume-renders object opacity to be supervised with instance masks. Second, we design a novel regularization term for object distinction, which can effectively mitigate the issue that ObjectSDF may result in unexpected reconstruction in invisible regions due to the lack of constraint to prevent collisions. Our extensive experiments demonstrate that our novel framework not only produces superior object reconstruction results but also significantly improves the quality of scene reconstruction. Code and more resources can be found in https://qianyiwu.github.io/objectsdf++ + + + + Towards Unsupervised Domain Generalization for Face Anti-Spoofing + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Towards_Unsupervised_Domain_Generalization_for_Face_Anti-Spoofing_ICCV_2023_paper.pdf + Generalizable face anti-spoofing (FAS) based on domain generalization (DG) has gained growing attention due to its robustness in real-world applications. However, these DG methods rely heavily on labeled source data, which are usually costly and hard to access. Comparably, unlabeled face data are far more accessible in various scenarios. In this paper, we propose the first Unsupervised Domain Generalization framework for Face Anti-Spoofing, namely UDG-FAS, which could exploit large amounts of easily accessible unlabeled data to learn generalizable features for enhancing the low-data regime of FAS. Yet without supervision signals, learning intrinsic live/spoof features from complicated facial information is challenging, which is even tougher in cross-domain scenarios due to domain shift. Existing unsupervised learning methods tend to learn identity-biased and domain-biased features as shortcuts, and fail to specify spoof cues. To this end, we propose a novel Split-Rotation-Merge module to build identity-agnostic local representations for mining intrinsic spoof cues and search the nearest neighbors in the same domain as positives for mitigating the identity bias. Moreover, we propose to search cross-domain neighbors with domain-specific normalization and merged local features to learn a domain-invariant feature space. To our best knowledge, this is the first attempt to learn generalized FAS features in a fully unsupervised way. Extensive experiments show that UDG-FAS significantly outperforms state-of-the-art methods on six diverse practical protocols. + + + + MotionDeltaCNN: Sparse CNN Inference of Frame Differences in Moving Camera Videos with Spherical Buffers and Padded Convolutions + http://openaccess.thecvf.com//content/ICCV2023/papers/Parger_MotionDeltaCNN_Sparse_CNN_Inference_of_Frame_Differences_in_Moving_Camera_ICCV_2023_paper.pdf + Convolutional neural network inference on video input is computationally expensive and requires high memory bandwidth. Recently, DeltaCNN managed to reduce the cost by only processing pixels with significant updates over the previous frame. However, DeltaCNN relies on static camera input. Moving cameras add new challenges in how to fuse newly unveiled image regions with already processed regions efficiently to minimize the update rate - without increasing memory overhead and without knowing the camera extrinsics of future frames. In this work, we propose MotionDeltaCNN, a sparse CNN inference framework that supports moving cameras. We introduce spherical buffers and padded convolutions to enable seamless fusion of newly unveiled regions and previously processed regions - without increasing memory footprint. Our evaluation shows that we outperform DeltaCNN by up to 90% for moving camera videos. + + + + General Image-to-Image Translation with One-Shot Image Guidance + http://openaccess.thecvf.com//content/ICCV2023/papers/Cheng_General_Image-to-Image_Translation_with_One-Shot_Image_Guidance_ICCV_2023_paper.pdf + Large-scale text-to-image models pre-trained on massive text-image pairs show excellent performance in image synthesis recently. However, image can provide more intuitive visual concepts than plain text. People may ask: how can we integrate the desired visual concept into an existing image, such as our portrait? Current methods are inadequate in meeting this demand as they lack the ability to preserve content or translate visual concepts effectively. Inspired by this, we propose a novel framework named visual concept translator (VCT) with the ability to preserve content in the source image and translate the visual concepts guided by a single reference image. The proposed VCT contains a content-concept inversion (CCI) process to extract contents and concepts, and a content-concept fusion (CCF) process to gather the extracted information to obtain the target image. Given only one reference image, the proposed VCT can complete a wide range of general image-to-image translation tasks with excellent results. Extensive experiments are conducted to prove the superiority and effectiveness of the proposed methods. Codes are available at https://github.com/CrystalNeuro/visual-concept-translator. + + + + ASM: Adaptive Skinning Model for High-Quality 3D Face Modeling + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_ASM_Adaptive_Skinning_Model_for_High-Quality_3D_Face_Modeling_ICCV_2023_paper.pdf + The research fields of parametric face model and 3D face reconstruction have been extensively studied. However, a critical question remains unanswered: how to tailor the face model for specific reconstruction settings. We argue that reconstruction with multi-view uncalibrated images demands a new model with stronger capacity. Our study shifts attention from data-dependent 3D Morphable Models (3DMM) to an understudied human-designed skinning model. We propose Adaptive Skinning Model (ASM), which redefines the skinning model with more compact and fully tunable parameters. With extensive experiments, we demonstrate that ASM achieves significantly improved capacity than 3DMM, with the additional advantage of model size and easy implementation for new topology. We achieve state-of-the-art performance with ASM for multi-view reconstruction on the Florence MICC Coop benchmark. Our quantitative analysis demonstrates the importance of a high-capacity model for fully exploiting abundant information from multi-view input in reconstruction. Furthermore, our model with physical-semantic parameters can be directly utilized for real-world applications, such as in-game avatar creation. As a result, our work opens up new research direction for parametric face model and facilitates future research on multi-view reconstruction. + + + + CAFA: Class-Aware Feature Alignment for Test-Time Adaptation + http://openaccess.thecvf.com//content/ICCV2023/papers/Jung_CAFA_Class-Aware_Feature_Alignment_for_Test-Time_Adaptation_ICCV_2023_paper.pdf + Despite recent advancements in deep learning, deep neural networks continue to suffer from performance degradation when applied to new data that differs from training data. Test-time adaptation (TTA) aims to address this challenge by adapting a model to unlabeled data at test time. TTA can be applied to pretrained networks without modifying their training procedures, enabling them to utilize a well-formed source distribution for adaptation. One possible approach is to align the representation space of test samples to the source distribution (i.e., feature alignment). However, performing feature alignment in TTA is especially challenging in that access to labeled source data is restricted during adaptation. That is, a model does not have a chance to learn test data in a class-discriminative manner, which was feasible in other adaptation tasks (e.g., unsupervised domain adaptation) via supervised losses on the source data. Based on this observation, we propose a simple yet effective feature alignment loss, termed as Class-Aware Feature Alignment (CAFA), which simultaneously 1) encourages a model to learn target representations in a class-discriminative manner and 2) effectively mitigates the distribution shifts at test time. Our method does not require any hyper-parameters or additional losses, which are required in previous approaches. We conduct extensive experiments on 6 different datasets and show our proposed method consistently outperforms existing baselines. + + + + Learning Clothing and Pose Invariant 3D Shape Representation for Long-Term Person Re-Identification + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Learning_Clothing_and_Pose_Invariant_3D_Shape_Representation_for_Long-Term_ICCV_2023_paper.pdf + Long-Term Person Re-Identification (LT-ReID) has become increasingly crucial in computer vision and biometrics. In this work, we aim to extend LT-ReID beyond pedestrian recognition to include a wider range of real-world human activities while still accounting for cloth-changing scenarios over large time gaps. This setting poses additional challenges due to the geometric misalignment and appearance ambiguity caused by the diversity of human pose and clothing. To address these challenges, we propose a new approach 3DInvarReID for (i) disentangling identity from non-identity components (pose, clothing shape, and texture) of 3D clothed humans, and (ii) reconstructing accurate 3D clothed body shapes and learning discriminative features of naked body shapes for person ReID in a joint manner. To better evaluate our study of LT-ReID, we collect a real-world dataset called CCDA, which contains a wide variety of human activities and clothing changes. Experimentally, we show the superior performance of our approach for person ReID. + + + + Agile Modeling: From Concept to Classifier in Minutes + http://openaccess.thecvf.com//content/ICCV2023/papers/Stretcu_Agile_Modeling_From_Concept_to_Classifier_in_Minutes_ICCV_2023_paper.pdf + The application of computer vision methods to nuanced, subjective concepts is growing. While crowdsourcing has served the vision community well for most objective tasks (such as labeling a "zebra"), it now falters on tasks where there is substantial subjectivity in the concept (such as identifying "gourmet tuna"). However, empowering any user to develop a classifier for their concept is technically difficult: users are neither machine learning experts nor have the patience to label thousands of examples. In reaction, we introduce the problem of Agile Modeling: the process of turning any subjective visual concept into a computer vision model through a real-time user-in-the-loop interactions. We instantiate an Agile Modeling prototype for image classification and show through a user study (N=14) that users can create classifiers with minimal effort under 30 minutes. We compare this user driven process with the traditional crowdsourcing paradigm and find that the crowd's notion often differs from that of the user's, especially as the concepts become more subjective. Finally, we scale our experiments with simulations of users training classifiers for ImageNet21k categories to further demonstrate the efficacy. + + + + FACET: Fairness in Computer Vision Evaluation Benchmark + http://openaccess.thecvf.com//content/ICCV2023/papers/Gustafson_FACET_Fairness_in_Computer_Vision_Evaluation_Benchmark_ICCV_2023_paper.pdf + Computer vision models have known performance disparities across attributes such as gender and skin tone. This means during tasks such as classification and detection, model performance differs for certain classes based on the demographics of the people in the image. These disparities have been shown to exist, but until now there has not been a unified approach to measure these differences for common use-cases of computer vision models. We present a new benchmark named FACET (FAirness in Computer Vision EvaluaTion), a large, publicly available evaluation set of 32k images for some of the most common vision tasks - image classification, object detection and segmentation. For every image in FACET, we hired expert reviewers to manually annotate person-related attributes such as perceived skin tone and hair type, manually draw bounding boxes and label fine-grained person-related classes such as disk jockey or guitarist. In addition, we use FACET to benchmark state-of-the-art vision models and present a deeper understanding of potential performance disparities and challenges across sensitive demographic attributes. With the exhaustive annotations collected, we probe models using single demographics attributes as well as multiple attributes using an intersectional approach (e.g. hair color and perceived skin tone). Our results show that classification, detection, segmentation, and visual grounding models exhibit performance disparities across demographic attributes and intersections of attributes. These harms suggest that not all people represented in datasets receive fair and equitable treatment in these vision tasks. We hope current and future results using our benchmark will contribute to fairer, more robust vision models. FACET is available publicly at https://facet.metademolab.com + + + + Prototypes-oriented Transductive Few-shot Learning with Conditional Transport + http://openaccess.thecvf.com//content/ICCV2023/papers/Tian_Prototypes-oriented_Transductive_Few-shot_Learning_with_Conditional_Transport_ICCV_2023_paper.pdf + Transductive Few-Shot Learning (TFSL) has recently attracted increasing attention since it typically outperforms its inductive peer by leveraging statistics of query samples.However, previous TFSL methods usually encode uniform prior that all the classes within query samples are equally likely, which is biased in imbalanced TFSL and causes severe performance degradation.Given this pivotal issue, in this work, we propose a novel Conditional Transport (CT) based imbalanced TFSL model called Prototypes-oriented Unbiased Transfer Model (PUTM) to fully exploit unbiased statistics of imbalanced query samples, which employs forward and backward navigators as transport matrices to balance the prior of query samples per class between uniform and adaptive data-driven distributions. For efficiently transferring statistics learned by CT, we further derive a closed form solution to refine prototypes based on MAP given the learned navigators. The above two steps of discovering and transferring unbiased statistics follow an iterative manner, formulating our EM-based solver. Experimental results on four standard benchmarks including miniImageNet, tieredImageNet, CUB, and CIFAR-FS demonstrate superiority of our model in class-imbalanced generalization. + + + + SparseFusion: Fusing Multi-Modal Sparse Representations for Multi-Sensor 3D Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Xie_SparseFusion_Fusing_Multi-Modal_Sparse_Representations_for_Multi-Sensor_3D_Object_Detection_ICCV_2023_paper.pdf + By identifying four important components of existing LiDAR-camera 3D object detection methods (LiDAR and camera candidates, transformation, and fusion outputs), we observe that all existing methods either find dense candidates or yield dense representations of scenes. However, given that objects occupy only a small part of a scene, finding dense candidates and generating dense representations is noisy and inefficient. We propose SparseFusion, a novel multi-sensor 3D detection method that exclusively uses sparse candidates and sparse representations. Specifically, SparseFusion utilizes the outputs of parallel detectors in the LiDAR and camera modalities as sparse candidates for fusion. We transform the camera candidates into the LiDAR coordinate space by disentangling the object representations. Then, we can fuse the multi-modality candidates in a unified 3D space by a lightweight self-attention module. To mitigate negative transfer between modalities, we propose novel semantic and geometric cross-modality transfer modules that are applied prior to the modality-specific detectors. SparseFusion achieves state-of-the-art performance on the nuScenes benchmark while also running at the fastest speed, even outperforming methods with stronger backbones. We perform extensive experiments to demonstrate the effectiveness and efficiency of our modules and overall method pipeline. Our code will be made publicly available at https://github.com/yichen928/SparseFusion. + + + + DetermiNet: A Large-Scale Diagnostic Dataset for Complex Visually-Grounded Referencing using Determiners + http://openaccess.thecvf.com//content/ICCV2023/papers/Lee_DetermiNet_A_Large-Scale_Diagnostic_Dataset_for_Complex_Visually-Grounded_Referencing_using_ICCV_2023_paper.pdf + State-of-the-art visual grounding models can achieve high detection accuracy, but they are not designed to distinguish between all objects versus only certain objects of interest. In natural language, in order to specify a particular object or set of objects of interest, humans use determiners such as "my", "either" and "those". Determiners, as an important word class, are a type of schema in natural language about the reference or quantity of the noun. Existing grounded referencing datasets place much less emphasis on determiners, compared to other word classes such as nouns, verbs and adjectives. This makes it difficult to develop models that understand the full variety and complexity of object referencing. Thus, we have developed and released the DetermiNet dataset, which comprises 250,000 synthetically generated images and captions based on 25 determiners. The task is to predict bounding boxes to identify objects of interest, constrained by the semantics of the given determiner. We find that current state-of-the-art visual grounding models do not perform well on the dataset, highlighting the limitations of existing models on reference and quantification tasks. + + + + RICO: Regularizing the Unobservable for Indoor Compositional Reconstruction + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_RICO_Regularizing_the_Unobservable_for_Indoor_Compositional_Reconstruction_ICCV_2023_paper.pdf + Recently, neural implicit surfaces have become popular for multi-view reconstruction. To facilitate practical applications like scene editing and manipulation, some works extend the framework with semantic masks input for the object-compositional reconstruction rather than the holistic perspective. Though achieving plausible disentanglement, the performance drops significantly when processing the indoor scenes where objects are usually partially observed. We propose RICO to address this by regularizing the unobservable regions for indoor compositional reconstruction. Our key idea is to first regularize the smoothness of the occluded background, which then in turn guides the foreground object reconstruction in unobservable regions based on the object-background relationship. Particularly, we regularize the geometry smoothness of occluded background patches. With the improved background surface, the signed distance function and the reversedly rendered depth of objects can be optimized to bound them within the background range. Extensive experiments show our method outperforms other methods on synthetic and real-world indoor scenes and prove the effectiveness of proposed regularizations. The code is available at https://github.com/kyleleey/RICO. + + + + CO-PILOT: Dynamic Top-Down Point Cloud with Conditional Neighborhood Aggregation for Multi-Gigapixel Histopathology Image Representation + http://openaccess.thecvf.com//content/ICCV2023/papers/Nakhli_CO-PILOT_Dynamic_Top-Down_Point_Cloud_with_Conditional_Neighborhood_Aggregation_for_ICCV_2023_paper.pdf + Predicting survival rates based on multi-gigapixel histopathology images is one of the most challenging tasks in digital pathology. Due to the computational complexities, Multiple Instance Learning (MIL) has become the conventional approach for this process as it breaks the image into smaller patches. However, this technique fails to account for the individual cells present in each patch, while they are the fundamental part of the tissue. In this work, we developed a novel dynamic and hierarchical point-cloud-based method (CO-PILOT) for the processing of cellular graphs extracted from routine histopathology images. By using bottom-up information propagation and top-down conditional attention, our model gains access to an adaptive focus across different levels of tissue hierarchy. Through comprehensive experiments, we demonstrate that our model can outperform all the state-of-the-art methods in survival prediction, including the hierarchical Vision Transformer (ViT), across two datasets and four metrics with only half of the parameters of the closest baseline. Importantly, our model is able to stratify the patients into different risk cohorts with statistically different outcomes across two large datasets, a task that was previously achievable only using genomic information. Furthermore, we publish a large dataset containing 873 cellular graphs from 188 patients, along with their survival information, making it one of the largest publicly available datasets in this context. + + + + Troubleshooting Ethnic Quality Bias with Curriculum Domain Adaptation for Face Image Quality Assessment + http://openaccess.thecvf.com//content/ICCV2023/papers/Ou_Troubleshooting_Ethnic_Quality_Bias_with_Curriculum_Domain_Adaptation_for_Face_ICCV_2023_paper.pdf + Face Image Quality Assessment (FIQA) lays the foundation for ensuring the stability and accuracy of face recognition systems. However, existing FIQA methods mainly formulate quality relationships within the training set to yield quality scores, ignoring the generalization problem caused by ethnic quality bias between the training and test sets. Domain adaptation presents a potential solution to mitigate the bias, but if FIQA is treated essentially as a regression task, it will be limited by the challenge of feature scaling in transfer learning. Additionally, how to guarantee source risk is also an issue due to the lack of ground-truth labels of the source domain for FIQA. This paper presents the first attempt in the field of FIQA to address these challenges with a novel Ethnic-Quality-Bias Mitigating (EQBM) framework. Specifically, to eliminate the restriction of scalar regression, we first compute the Likert-scale quality probability distributions as source domain annotations. Furthermore, we design an easy-to-hard training scheduler based on the inter-domain uncertainty and intra-domain quality margin as well as the ranking-based domain adversarial network to enhance the effectiveness of transfer learning and further reduce the source risk in domain adaptation. Extensive experiments demonstrate that the EQBM significantly mitigates the quality bias and improves the generalization capability of FIQA across races on different datasets. + + + + CLR: Channel-wise Lightweight Reprogramming for Continual Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Ge_CLR_Channel-wise_Lightweight_Reprogramming_for_Continual_Learning_ICCV_2023_paper.pdf + Continual learning aims to emulate the human ability to continually accumulate knowledge over sequential tasks. The main challenge is to maintain performance on previously learned tasks after learning new tasks, i.e., to avoid catastrophic forgetting. We propose a Channel-wise Lightweight Reprogramming (CLR) approach that helps convolutional neural networks (CNNs) overcome catastrophic forgetting during continual learning. We show that a CNN model trained on an old task (or self-supervised proxy task) could be "reprogrammed" to solve a new task by using our proposed lightweight (very cheap) reprogramming parameter. With the help of CLR, we have a better stability-plasticity trade-off to solve continual learning problems: To maintain stability and retain previous task ability, we use a common task-agnostic immutable part as the shared "anchor" parameter set. We then add task-specific lightweight reprogramming parameters to reinterpret the outputs of the immutable parts, to enable plasticity and integrate new knowledge. To learn sequential tasks, we only train the lightweight reprogramming parameters to learn each new task. Reprogramming parameters are task-specific and exclusive to each task, which makes our method immune to catastrophic forgetting. To minimize the parameter requirement of reprogramming to learn new tasks, we make reprogramming lightweight by only adjusting essential kernels and learning channel-wise linear mappings from anchor parameters to task-specific domain knowledge. We show that, for general CNNs, the CLR parameter increase is less than 0.6% for any new task. Our method outperforms 13 state-of-the-art continual learning baselines on a new challenging sequence of 53 image classification datasets. Code and data are here: https://github.com/gyhandy/Channel-wise-Lightweight-Reprogramming + + + + IOMatch: Simplifying Open-Set Semi-Supervised Learning with Joint Inliers and Outliers Utilization + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_IOMatch_Simplifying_Open-Set_Semi-Supervised_Learning_with_Joint_Inliers_and_Outliers_ICCV_2023_paper.pdf + Semi-supervised learning (SSL) aims to leverage massive unlabeled data when labels are expensive to obtain. Unfortunately, in many real-world applications, the collected unlabeled data will inevitably contain unseen-class outliers not belonging to any of the labeled classes. To deal with the challenging open-set SSL task, the mainstream methods tend to first detect outliers and then filter them out. However, we observe a surprising fact that such approach could result in more severe performance degradation when labels are extremely scarce, as the unreliable outlier detector may wrongly exclude a considerable portion of valuable inliers. To tackle with this issue, we introduce a novel open-set SSL framework, IOMatch, which can jointly utilize inliers and outliers, even when it is difficult to distinguish exactly between them. Specifically, we propose to employ a multi-binary classifier in combination with the standard closed-set classifier for producing unified open-set classification targets, which regard all outliers as a single new class. By adopting these targets as open-set pseudo-labels, we optimize an open-set classifier with all unlabeled samples including both inliers and outliers. Extensive experiments have shown that IOMatch significantly outperforms the baseline methods across different benchmark datasets and different settings despite its remarkable simplicity. Our code and models are available at https://github.com/nukezil/IOMatch. + + + + Hierarchical Point-based Active Learning for Semi-supervised Point Cloud Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_Hierarchical_Point-based_Active_Learning_for_Semi-supervised_Point_Cloud_Semantic_Segmentation_ICCV_2023_paper.pdf + Impressive performance on point cloud semantic segmentation has been achieved by fully-supervised methods with large amounts of labelled data. As it is labour-intensive to acquire large-scale point cloud data with point-wise labels, many attempts have been made to explore learning 3D point cloud segmentation with limited annotations. Active learning is one of the effective strategies to achieve this purpose but is still under-explored. The most recent methods of this kind measure the uncertainty of each pre-divided region for manual labelling but they suffer from redundant information and require additional efforts for region division. This paper aims at addressing this issue by developing a hierarchical point-based active learning strategy. Specifically, we measure the uncertainty for each point by a hierarchical minimum margin uncertainty module which considers the contextual information at multiple levels. Then, a feature-distance suppression strategy is designed to select important and representative points for manual labelling. Besides, to better exploit the unlabelled data, we build a semi-supervised segmentation framework based on our active strategy. Extensive experiments on the S3DIS and ScanNetV2 datasets demonstrate that the proposed framework achieves 96.5% and 100% performance of fully-supervised baseline with only 0.07% and 0.1% training data, respectively, outperforming the state-of-the-art weakly-supervised and active learning methods. The code will be available at https://github.com/SmiletoE/HPAL. + + + + Quality-Agnostic Deepfake Detection with Intra-model Collaborative Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Le_Quality-Agnostic_Deepfake_Detection_with_Intra-model_Collaborative_Learning_ICCV_2023_paper.pdf + Deepfake has recently raised a plethora of societal concerns over its possible security threats and dissemination of fake information. Much research on deepfake detection has been undertaken. However, detecting low quality as well as simultaneously detecting different qualities of deepfakes still remains a grave challenge. Most SOTA approaches are limited by using a single specific model for detecting certain deepfake video quality type. When constructing multiple models with prior information about video quality, this kind of strategy incurs significant computational cost, as well as model and training data overhead. Further, it cannot be scalable and practical to deploy in real-world settings. In this work, we propose a universal intra-model collaborative learning framework to enable the effective and simultaneous detection of different quality of deepfakes. That is, our approach is the quality-agnostic deepfake detection method, dubbed QAD . In particular, by observing the upper bound of general error expectation, we maximize the dependency between intermediate representations of images from different quality levels via Hilbert-Schmidt Independence Criterion. In addition, an Adversarial Weight Perturbation module is carefully devised to enable the model to be more robust against image corruption while boosting the overall model's performance. Extensive experiments over seven popular deepfake datasets demonstrate the superiority of our QAD model over prior SOTA benchmarks. + + + + Object-Centric Multiple Object Tracking + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhao_Object-Centric_Multiple_Object_Tracking_ICCV_2023_paper.pdf + Unsupervised object-centric learning methods allow the partitioning of scenes into entities without additional localization information and are excellent candidates for reducing the annotation burden of multiple-object tracking (MOT) pipelines. Unfortunately, they lack two key properties: objects are often split into parts and are not consistently tracked over time. In fact, state-of-the-art models achieve pixel-level accuracy and temporal consistency by relying on supervised object detection with additional ID labels for the association through time. This paper proposes a video object-centric model for MOT. It consists of an index-merge module that adapts the object-centric slots into detection outputs and an object memory module that builds complete object prototypes to handle occlusions. Benefited from object-centric learning, we only require sparse detection labels (0%-6.25%) for object localization and feature binding. Relying on our self-supervised Expectation-Maximization-inspired loss for object association, our approach requires no ID labels. Our experiments significantly narrow the gap between the existing object-centric model and the fully supervised state-of-the-art and outperform several unsupervised trackers that also do not require ID labels. + + + + Point-TTA: Test-Time Adaptation for Point Cloud Registration Using Multitask Meta-Auxiliary Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Hatem_Point-TTA_Test-Time_Adaptation_for_Point_Cloud_Registration_Using_Multitask_Meta-Auxiliary_ICCV_2023_paper.pdf + We present Point-TTA, a novel test-time adaptation framework for point cloud registration (PCR) that improves the generalization and the performance of registration models. While learning-based approaches have achieved impressive progress, generalization to unknown testing environments remains a major challenge due to the variations in 3D scans. Existing methods typically train a generic model and the same trained model is applied on each instance during testing. This could be sub-optimal since it is difficult for the same model to handle all the variations during testing. In this paper, we propose a test-time adaptation approach for PCR. Our model can adapt to unseen distributions at test-time without requiring any prior knowledge of the test data. Concretely, we design three self-supervised auxiliary tasks that are optimized jointly with the primary PCR task. Given a test instance, we adapt our model using these auxiliary tasks and the updated model is used to perform the inference. During training, our model is trained using a meta-auxiliary learning approach, such that the adapted model via auxiliary tasks improves the accuracy of the primary task. Experimental results demonstrate the effectiveness of our approach in improving generalization of point cloud registration and outperforming other state-of-the-art approaches. + + + + uSplit: Image Decomposition for Fluorescence Microscopy + http://openaccess.thecvf.com//content/ICCV2023/papers/Ashesh_uSplit_Image_Decomposition_for_Fluorescence_Microscopy_ICCV_2023_paper.pdf + We present mSplit, a dedicated approach for trained image decomposition in the context of fluorescence microscopy images. We find that best results using regular deep architectures are achieved when large image patches are used during training, making memory consumption the limiting factor to further improving performance. We therefore introduce lateral contextualization (LC), a novel meta-architecture that enables the memory efficient incorporation of large image-context, which we observe is a key ingredient to solving the image decomposition task at hand. We integrate LC with U-Nets, Hierarchical AEs, and Hierarchical VAEs, for which we formulate a modified ELBO loss. Additionally, LC enables training deeper hierarchical models than otherwise possible and, interestingly, helps to reduce tiling artefacts that are inherently impossible to avoid when using tiled VAE predictions. We apply mSplit to five decomposition tasks, one on a synthetic dataset, four others derived from real microscopy data. Our method consistently achieves best results (average improvements to the best baseline of 2.25 dB PSNR), while simultaneously requiring considerably less GPU memory. Our code and datasets can be found at https://github.com/juglab/uSplit. + + + + LightGlue: Local Feature Matching at Light Speed + http://openaccess.thecvf.com//content/ICCV2023/papers/Lindenberger_LightGlue_Local_Feature_Matching_at_Light_Speed_ICCV_2023_paper.pdf + We introduce LightGlue, a deep neural network that learns to match local features across images. We revisit multiple design decisions of SuperGlue, the state of the art in sparse matching, and derive simple but effective improvements. Cumulatively, they make LightGlue more efficient -- in terms of both memory and computation, more accurate, and much easier to train. One key property is that LightGlue is adaptive to the difficulty of the problem: the inference is much faster on image pairs that are intuitively easy to match, for example because of a larger visual overlap or limited appearance change. This opens up exciting prospects for deploying deep matchers in latency-sensitive applications like 3D reconstruction. The code and trained models are publicly available at github.com/cvg/LightGlue. + + + + Masked Autoencoders are Efficient Class Incremental Learners + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhai_Masked_Autoencoders_are_Efficient_Class_Incremental_Learners_ICCV_2023_paper.pdf + Class Incremental Learning (CIL) aims to sequentially learn new classes while avoiding catastrophic forgetting of previous knowledge. We propose to use Masked Autoencoders (MAEs) as efficient learners for CIL. MAEs were originally designed to learn useful representations through reconstructive unsupervised learning, and they can be easily integrated with a supervised loss for classification. Moreover, MAEs can reliably reconstruct original input images from randomly selected patches, which we use to store exemplars from past tasks more efficiently for CIL. We also propose a bilateral MAE framework to learn from image-level and embedding-level fusion, which produces better-quality reconstructed images and more stable representations. Our experiments confirm that our approach performs better than the state-of-the-art on CIFAR-100, ImageNet-Subset, and ImageNet-Full. The code is available at https://github.com/scok30/MAE-CIL. + + + + Towards Semi-supervised Learning with Non-random Missing Labels + http://openaccess.thecvf.com//content/ICCV2023/papers/Duan_Towards_Semi-supervised_Learning_with_Non-random_Missing_Labels_ICCV_2023_paper.pdf + Semi-supervised learning (SSL) tackles the label missing problem by enabling the effective usage of unlabeled data. While existing SSL methods focus on the traditional setting, a practical and challenging scenario called label Missing Not At Random (MNAR) is usually ignored. In MNAR, the labeled and unlabeled data fall into different class distributions resulting in biased label imputation, which deteriorates the performance of SSL models. In this work, class transition tracking based Pseudo-Rectifying Guidance (PRG) is devised for MNAR. We explore the class-level guidance information obtained by the Markov random walk, which is modeled on a dynamically created graph built over the class tracking matrix. PRG unifies the history information of each class transition caused by the pseudo-rectifying procedure to activate the model's enthusiasm for neglected classes, so as the quality of pseudo-labels on both popular classes and rare classes in MNAR could be improved. We show the superior performance of PRG across a variety of MNAR scenarios, outperforming the latest SSL approaches combining bias removal solutions by a large margin. Code and model weights are available at https://github.com/NJUyued/PRG4SSL-MNAR. + + + + NeRFrac: Neural Radiance Fields through Refractive Surface + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhan_NeRFrac_Neural_Radiance_Fields_through_Refractive_Surface_ICCV_2023_paper.pdf + Neural Radiance Fields (NeRF) is a popular neural expression for novel view synthesis. By querying spatial points and view directions, a multilayer perceptron (MLP) can be trained to output the volume density and radiance at each point, which lets us render novel views of the scene. The original NeRF and its recent variants, however, target opaque scenes dominated by diffuse reflection surfaces and cannot handle complex refractive surfaces well. We introduce NeRFrac to realize neural novel view synthesis of scenes captured through refractive surfaces, typically water surfaces. For each queried ray, an MLP-based Refractive Field is trained to estimate the distance from the ray origin to the refractive surface. A refracted ray at each intersection point is then computed by Snell's Law, given the input ray and the approximated local normal. Points of the scene are sampled along the refracted ray and are sent to a Radiance Field for further radiance estimation. We show that from a sparse set of images, our model achieves accurate novel view synthesis of the scene underneath the refractive surface and simultaneously reconstructs the refractive surface. We evaluate the effectiveness of our method with synthetic and real scenes seen through water surfaces. Experimental results demonstrate the accuracy of NeRFrac for modeling scenes seen through wavy refractive surfaces. + + + + LivelySpeaker: Towards Semantic-Aware Co-Speech Gesture Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhi_LivelySpeaker_Towards_Semantic-Aware_Co-Speech_Gesture_Generation_ICCV_2023_paper.pdf + Gestures are non-verbal but important behaviors accompanying people's speech. While previous methods are able to generate speech rhythm-synchronized gestures, the semantic context of the speech is generally lacking in the gesticulations. Although semantic gestures do not occur very regularly in human speech, they are indeed the key for the audience to understand the speech context in a more immersive environment. Hence, we introduce LivelySpeaker, a framework that realizes semantics-aware co-speech gesture generation and offers several control handles. Specifically, the script-based gesture generation leverages the pre-trained CLIP text embeddings as the guidance for generating gestures that are highly semantically aligned with the script. Then, we devise a simple but effective diffusion-based gesture generation backbone simply using pure MLPs, that is conditioned on only audio signals and learns to gesticulate with realistic motions. We utilize such powerful prior to rhyme the script-guided gestures with the audio signals, notably in a zero-shot setting. Our novel two-stage generation framework also enables several applications, such as changing the gesticulation style, editing the co-speech gestures via textual prompting, and controlling the semantic awareness and rhythm alignment with guided diffusion. Extensive experiments demonstrate the advantages of the proposed framework over competing methods. In addition, our core diffusion-based generative model also achieves state-of-the-art performance on two benchmarks. The code and model will be released to facilitate future research. + + + + Preventing Zero-Shot Transfer Degradation in Continual Learning of Vision-Language Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Zheng_Preventing_Zero-Shot_Transfer_Degradation_in_Continual_Learning_of_Vision-Language_Models_ICCV_2023_paper.pdf + Continual learning (CL) can help pre-trained vision-language models efficiently adapt to new or under-trained data distributions without re-training. Nevertheless, during the continual training of the Contrastive Language-Image Pre-training (CLIP) model, we observe that the model's zero-shot transfer ability significantly degrades due to catastrophic forgetting. Existing CL methods can mitigate forgetting by replaying previous data. However, since the CLIP dataset is private, replay methods cannot access the pre-training dataset. In addition, replaying data of previously learned downstream tasks can enhance their performance but comes at the cost of sacrificing zero-shot performance. To address this challenge, we propose a novel method ZSCL to prevent zero-shot transfer degradation in the continual learning of vision-language models in both feature and parameter space. In the feature space, a reference dataset is introduced for distillation between the current and initial models. The reference dataset should have semantic diversity but no need to be labeled, seen in pre-training, or matched image-text pairs. In parameter space, we prevent a large parameter shift by averaging weights during the training. We propose a more challenging Multi-domain Task Incremental Learning (MTIL) benchmark to evaluate different methods, where tasks are from various domains instead of class-separated in a single dataset. Our method outperforms other methods in the traditional class-incremental learning setting and the MTIL by 9.7% average score. Our code locates at https://github.com/Thunderbeee/ZSCL. + + + + Personalized Image Generation for Color Vision Deficiency Population + http://openaccess.thecvf.com//content/ICCV2023/papers/Jiang_Personalized_Image_Generation_for_Color_Vision_Deficiency_Population_ICCV_2023_paper.pdf + Approximately, 350 million people, a proportion of 8%, suffer from color vision deficiency (CVD). While image generation algorithms have been highly successful in synthesizing high-quality images, CVD populations are unintentionally excluded from target users and have difficulties understanding the generated images as normal viewers do. Although a straightforward baseline can be formed by combining generation models and recolor compensation methods as the post-processing, the CVD friendliness of the result images is still limited since the input image content of recolor methods is not CVD-oriented and will keep fixed during the recolor compensation process. Besides, the CVD populations can't be fully served since the varying degrees of CVD are often neglected in recoloring methods. Instead, we propose a personalized CVD-friendly image generation algorithm with two key characteristics: (i) generating CVD-oriented images end-to-end; (ii) generating continuous personalized images for people with various CVD types and degrees through disentangling the color representation based on a triple-latent structure. Quantitative experiments and the user study indicate our proposed image generation model can generate practical and compelling results compared to the normal generation model and combination baselines on several datasets. + + + + EGC: Image Generation and Classification via a Diffusion Energy-Based Model + http://openaccess.thecvf.com//content/ICCV2023/papers/Guo_EGC_Image_Generation_and_Classification_via_a_Diffusion_Energy-Based_Model_ICCV_2023_paper.pdf + Learning image classification and image generation using the same set of network parameters presents a formidable challenge. Recent advanced approaches perform well in one task often exhibit poor performance in the other. This work introduces an energy-based classifier and generator, namely EGC, which can achieve superior performance in both tasks using a single neural network. Unlike conventional classifiers that produce a label given an image (i.e., a conditional distribution p(y|x)), the forward pass in EGC is a classification model that yields a joint distribution p(x,y), enabling a diffusion model in its backward pass by marginalizing out the label y to estimate the score function. Furthermore, EGC can be adapted for unsupervised learning by considering the label as latent variables. EGC achieves competitive generation results compared with state-of-the-art approaches on ImageNet-1k, CelebA-HQ and LSUN Church, while achieving superior classification accuracy and robustness against adversarial attacks on CIFAR-10. This work marks the inaugural success in mastering both domains using a unified network parameter set. We believe that EGC bridges the gap between discriminative and generative learning. + + + + Joint Metrics Matter: A Better Standard for Trajectory Forecasting + http://openaccess.thecvf.com//content/ICCV2023/papers/Weng_Joint_Metrics_Matter_A_Better_Standard_for_Trajectory_Forecasting_ICCV_2023_paper.pdf + Multi-modal trajectory forecasting methods commonly evaluate using single-agent metrics (marginal metrics), such as minimum Average Displacement Error (ADE) and Final Displacement Error (FDE), which fail to capture joint performance of multiple interacting agents. Only focusing on marginal metrics can lead to unnatural predictions, such as colliding trajectories or diverging trajectories for people who are clearly walking together as a group. Consequently, methods optimized for marginal metrics lead to overly-optimistic estimations of performance, which is detrimental to progress in trajectory forecasting research. In response to the limitations of marginal metrics, we present the first comprehensive evaluation of state-of-the-art (SOTA) trajectory forecasting methods with respect to multi-agent metrics (joint metrics): JADE, JFDE, and collision rate. We demonstrate the importance of joint metrics as opposed to marginal metrics with quantitative evidence and qualitative examples drawn from the ETH / UCY and Stanford Drone datasets. We introduce a new loss function incorporating joint metrics that, when applied to a SOTA trajectory forecasting method, achieves SOTA performance with respect to JADE and JFDE, achieving a 7% improvement over the previous SOTA on the ETH / UCY datasets. Our results also indicate that optimizing for joint metrics naturally leads to an improvement in interaction modeling, as evidenced by a 16% decrease in mean collision rate on the ETH / UCY datasets with respect to the previous SOTA. Code is available at https://github.com/ericaweng/joint-metrics-matter. + + + + Test Time Adaptation for Blind Image Quality Assessment + http://openaccess.thecvf.com//content/ICCV2023/papers/Roy_Test_Time_Adaptation_for_Blind_Image_Quality_Assessment_ICCV_2023_paper.pdf + While the design of blind image quality assessment (IQA) algorithms has improved significantly, the distribution shift between the training and testing scenarios often leads to a poor performance of these methods at inference time. This motivates the study of test time adaptation (TTA) techniques to improve their performance at inference time. Existing auxiliary tasks and loss functions used for TTA may not be relevant for quality-aware adaptation of the pre-trained model. In this work, we introduce two novel quality-relevant auxiliary tasks at the batch and sample levels to enable TTA for blind IQA. In particular, we introduce a group contrastive loss at the batch level and a relative rank loss at the sample level to make the model quality aware and adapt to the target data. Our experiments reveal that even using a small batch of images from the test distribution helps achieve significant improvement in performance by updating the batch normalization statistics of the source model. + + + + GeT: Generative Target Structure Debiasing for Domain Adaptation + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_GeT_Generative_Target_Structure_Debiasing_for_Domain_Adaptation_ICCV_2023_paper.pdf + Domain adaptation (DA) aims to transfer knowledge from a fully labeled source to a scarcely labeled or totally unlabeled target under domain shift. Recently, semi-supervised learning-based (SSL) techniques that leverage pseudo labeling have been increasingly used in DA. Despite the competitive performance, these pseudo labeling methods rely heavily on the source domain to generate pseudo labels for the target domain and therefore still suffer considerably from source data bias. Moreover, class distribution bias in the target domain is also often ignored in the pseudo label generation and thus leading to further deterioration of performance. In this paper, we propose GeT that learns a non-bias target embedding distribution with high quality pseudo labels. Specifically, we formulate an online target generative classifier to induce the target distribution into distinctive Gaussian components weighted by their class priors to mitigate source data bias and enhance target class discriminability. We further propose a structure similarity regularization framework to alleviate target class distribution bias and further improve target class discriminability. Experimental results show that our proposed GeT is effective and achieves consistent improvements under various DA settings with and without class distribution bias. Our code is available at: https://lulusindazc.github.io/getproject/. + + + + Point-SLAM: Dense Neural Point Cloud-based SLAM + http://openaccess.thecvf.com//content/ICCV2023/papers/Sandstrom_Point-SLAM_Dense_Neural_Point_Cloud-based_SLAM_ICCV_2023_paper.pdf + We propose a dense neural simultaneous localization and mapping (SLAM) approach for monocular RGBD input which anchors the features of a neural scene representation in a point cloud that is iteratively generated in an input-dependent data-driven manner. We demonstrate that both tracking and mapping can be performed with the same point-based neural scene representation by minimizing an RGBD-based re-rendering loss. In contrast to recent dense neural SLAM methods which anchor the scene features in a sparse grid, our point-based approach allows to dynamically adapt the anchor point density to the information density of the input. This strategy reduces runtime and memory usage in regions with fewer details and dedicates higher point density to resolve fine details. Our approach performs either better or competitive to existing dense neural RGBD SLAM methods in tracking, mapping and rendering accuracy on the Replica, TUM-RGBD and ScanNet datasets. + + + + TrajectoryFormer: 3D Object Tracking Transformer with Predictive Trajectory Hypotheses + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_TrajectoryFormer_3D_Object_Tracking_Transformer_with_Predictive_Trajectory_Hypotheses_ICCV_2023_paper.pdf + 3D multi-object tracking (MOT) is vital for many applications including autonomous driving vehicles and service robots. With the commonly used tracking-by-detection paradigm, 3D MOT has made important progress in recent years. However, these methods only use the detection boxes of the current frame to obtain trajectory-box association results, which makes it impossible for the tracker to recover objects missed by the detector. In this paper, we present TrajectoryFormer, a novel point-cloud-based 3D MOT framework. To recover the missed object by detector, we generates multiple trajectory hypotheses with hybrid candidate boxes, including temporally predicted boxes and currentframe detection boxes, for trajectory-box association. The predicted boxes can propagate object's history trajectory information to the current frame and thus the network can tolerate short-term miss detection of the tracked objects. We combine long-term object motion feature and short-term object appearance feature to create per-hypothesis feature embedding, which reduces the computational overhead for spatial-temporal encoding. Additionally, we introduce a Global-Local Interaction Module to conduct information interaction among all hypotheses and models their spatial relations, leading to accurate estimation of hypotheses. Our TrajectoryFormer achieves state-of-the-art performance on the Waymo 3D MOT benchmarks. + + + + See More and Know More: Zero-shot Point Cloud Segmentation via Multi-modal Visual Data + http://openaccess.thecvf.com//content/ICCV2023/papers/Lu_See_More_and_Know_More_Zero-shot_Point_Cloud_Segmentation_via_ICCV_2023_paper.pdf + Zero-shot point cloud segmentation aims to make deep models capable of recognizing novel objects in point cloud that are unseen in the training phase. Recent trends favor the pipeline which transfers knowledge from seen classes with labels to unseen classes without labels. They typically align visual features with semantic features obtained from word embedding by the supervision of seen classes' annotations. However, point cloud contains limited information to fully match with semantic features. In fact, the rich appearance information of images is a natural complement to the textureless point cloud, which is not well explored in previous literature. Motivated by this, we propose a novel multi-modal zero-shot learning method to better utilize the complementary information of point clouds and images for more accurate visual-semantic alignment. Extensive experiments are performed in two popular benchmarks, i.e, SemanticKITTI and nuScenes, and our method outperforms current SOTA methods with 52% and 49% improvement on average for unseen class mIoU, respectively. + + + + Editable Image Geometric Abstraction via Neural Primitive Assembly + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Editable_Image_Geometric_Abstraction_via_Neural_Primitive_Assembly_ICCV_2023_paper.pdf + This work explores a novel image geometric abstraction paradigm based on assembly out of a pool of pre-defined simple parametric primitives (i.e., triangle, rectangle, circle and semicircle), facilitating controllable shape editing in images. While cast as a mixed combinatorial and continuous optimization problem, the above task is approximately reformulated within a token translation neural framework that simultaneously outputs primitive assignments and corresponding transformation and color parameters in an image-to-set manner, thus bypassing complex/non-differentiable graph-matching iterations. To relax the searching space and address the gradient vanishing issue, a novel Neural Soft Assignment scheme that well explores the quasi-equivalence between the assignment in Bipartite b-Matching and opacity-aware weighted multiple rasterization combination is introduced, drastically reducing the optimization complexity. Without ground-truth image abstraction labeling (i.e., vectorized representation), the whole pipeline is end-to-end trainable in a self-supervised manner, based on the linkage of differentiable rasterization techniques. Extensive experiments on several datasets well demonstrate that our framework is able to predict highly compelling vectorized geometric abstraction results with a combination of ONLY four simple primitives, also with VERY straightforward shape editing capability by simple replacement of primitive type, compared to previous image abstraction and image vectorization methods. + + + + Homeomorphism Alignment for Unsupervised Domain Adaptation + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhou_Homeomorphism_Alignment_for_Unsupervised_Domain_Adaptation_ICCV_2023_paper.pdf + Existing unsupervised domain adaptation (UDA) methods rely on aligning the features from the source and target domains explicitly or implicitly in a common space (i.e., the domain invariant space). Explicit distribution matching ignores the discriminability of learned features, while the implicit counterpart such as self-supervised learning suffers from pseudo-label noises. With distribution alignment, it is challenging to acquire a common space which maintains fully the discriminative structure of both domains. In this work, we propose a novel HomeomorphisM Alignment (HMA) approach characterized by aligning the source and target data in two separate spaces. Specifically, an invertible neural network based homeomorphism is constructed. Distribution matching is then used as a sewing up tool for connecting this homeomorphism mapping between the source and target feature spaces. Theoretically, we show that this mapping can preserve the data topological structure (e.g., the cluster/group structure). This property allows for more discriminative model adaptation by leveraging both the original and transformed features of source data in a supervised manner, and those of target domain in an unsupervised manner (e.g., prediction consistency). Extensive experiments demonstrate that our method can achieve the state-of-the-art results. Code is released at https://github.com/buerzlh/HMA. + + + + EmoSet: A Large-scale Visual Emotion Dataset with Rich Attributes + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_EmoSet_A_Large-scale_Visual_Emotion_Dataset_with_Rich_Attributes_ICCV_2023_paper.pdf + Visual Emotion Analysis (VEA) aims at predicting people's emotional responses to visual stimuli. This is a promising, yet challenging, task in affective computing, which has drawn increasing attention in recent years. Most of the existing work in this area focuses on feature design, while little attention has been paid to dataset construction. In this work, we introduce EmoSet, the first large-scale visual emotion dataset annotated with rich attributes, which is superior to existing datasets in four aspects: scale, annotation richness, diversity, and data balance. EmoSet comprises 3.3 million images in total, with 118,102 of these images carefully labeled by human annotators, making it five times larger than the largest existing dataset. EmoSet includes images from social networks, as well as artistic images, and it is well balanced between different emotion categories. Motivated by psychological studies, in addition to emotion category, each image is also annotated with a set of describable emotion attributes: brightness, colorfulness, scene type, object class, facial expression, and human action, which can help understand visual emotions in a precise and interpretable way. The relevance of these emotion attributes is validated by analyzing the correlations between them and visual emotion, as well as by designing an attribute module to help visual emotion recognition. We believe EmoSet will bring some key insights and encourage further research in visual emotion analysis and understanding. Project page: https://vcc.tech/EmoSet. + + + + Class-relation Knowledge Distillation for Novel Class Discovery + http://openaccess.thecvf.com//content/ICCV2023/papers/Gu_Class-relation_Knowledge_Distillation_for_Novel_Class_Discovery_ICCV_2023_paper.pdf + We tackle the problem of novel class discovery, which aims to learn novel classes without supervision based on labeled data from known classes. A key challenge lies in transferring the knowledge in the known-class data to the learning of novel classes. Previous methods mainly focus on building a shared representation space for knowledge transfer and often ignore modeling class relations. To address this, we introduce a class relation representation for the novel classes based on the predicted class distribution of a model trained on known classes. Empirically, we find that such class relation becomes less informative during typical discovery training. To prevent such information loss, we propose a novel knowledge distillation framework, which utilizes our class-relation representation to regularize the learning of novel classes. In addition, to enable a flexible knowledge distillation scheme for each data point in novel classes, we develop a learnable weighting function for the regularization, which adaptively promotes knowledge transfer based on the semantic similarity between the novel and known classes. To validate the effectiveness and generalization of our method, we conduct extensive experiments on multiple benchmarks, including CIFAR100, Stanford Cars, CUB, and FGVC-Aircraft datasets. Our results demonstrate that the proposed method outperforms the previous state-of-the-art methods by a significant margin on almost all benchmarks. + + + + Data-Free Class-Incremental Hand Gesture Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Aich_Data-Free_Class-Incremental_Hand_Gesture_Recognition_ICCV_2023_paper.pdf + This paper investigates data-free class-incremental learning (DFCIL) for hand gesture recognition from 3D skeleton sequences. In this class-incremental learning (CIL) setting, while incrementally registering the new classes, we do not have access to the training samples (i.e. data-free) of the already known classes due to privacy. Existing DFCIL methods primarily focus on various forms of knowledge distillation for model inversion to mitigate catastrophic forgetting. Unlike SOTA methods, we delve deeper into the choice of the best samples for inversion. Inspired by the well-grounded theory of max-margin classification, we find that the best samples tend to lie close to the approximate decision boundary within a reasonable margin. To this end, we propose BOAT-MI -- a simple and effective boundary-aware prototypical sampling mechanism for model inversion for DFCIL. Our sampling scheme outperforms SOTA methods significantly on two 3D skeleton gesture datasets, the publicly available SHREC 2017, and EgoGesture3D -- which we extract from a publicly available RGBD dataset. Both our codebase and the EgoGesture3D skeleton dataset are publicly available: https://github.com/humansensinglab/dfcil-hgr + + + + Mixed Neural Voxels for Fast Multi-view Video Synthesis + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Mixed_Neural_Voxels_for_Fast_Multi-view_Video_Synthesis_ICCV_2023_paper.pdf + Synthesizing high-fidelity videos from real-world multiview input is challenging due to the complexities of real-world environments and high-dynamic movements. Previous works based on neural radiance fields have demonstrated high-quality reconstructions of dynamic scenes. However, training such models on real-world scenes is time-consuming, usually taking days or weeks. In this paper, we present a novel method named MixVoxels to efficiently represent dynamic scenes which leads to fast training and rendering speed. The proposed MixVoxels represents the 4D dynamic scenes as a mixture of static and dynamic voxels and processes them with different networks. In this way, the computation of the required modalities for static voxels can be processed by a lightweight model, which essentially reduces the amount of computation as many daily dynamic scenes are dominated by static backgrounds. To distinguish the two kinds of voxels, we propose a novel variation field to estimate the temporal variance of each voxel. For the dynamic representations, we design an inner-product time query method to efficiently query multiple time steps, which is essential to recover the high-dynamic movements. As a result, with 15 minutes of training for dynamic scenes with inputs of 300-frame videos, MixVoxels achieves better PSNR than previous methods. For rendering, MixVoxels can render a novel view video with 1K resolution at 37 fps. Codes and trained models are available at https://github.com/fengres/mixvoxels. + + + + Harvard Glaucoma Detection and Progression: A Multimodal Multitask Dataset and Generalization-Reinforced Semi-Supervised Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Luo_Harvard_Glaucoma_Detection_and_Progression_A_Multimodal_Multitask_Dataset_and_ICCV_2023_paper.pdf + Glaucoma is the number one cause of irreversible blindness globally. A major challenge for accurate glaucoma detection and progression forecasting is the bottleneck of limited labeled patients with the state-of-the-art (SOTA) 3D retinal imaging data of optical coherence tomography (OCT). To address the data scarcity issue, this paper proposes two solutions. First, we develop a novel generalization-reinforced semi-supervised learning (SSL) model called pseudo supervisor to optimally utilize unlabeled data. Compared with SOTA models, the proposed pseudo supervisor optimizes the policy of predicting pseudo labels with unlabeled samples to improve empirical generalization. Our pseudo supervisor model is evaluated with two clinical tasks consisting of glaucoma detection and progression forecasting. The progression forecasting task is evaluated both unimodally and multimodally. Our pseudo supervisor model demonstrates superior performance than SOTA SSL comparison models. Moreover, our model also achieves the best results on the publicly available LAG fun- dus dataset. Second, we introduce the Harvard Glaucoma Detection and Progression (Harvard-GDP) Dataset, a multimodal multitask dataset that includes data from 1,000 patients with OCT imaging data, as well as labels for glaucoma detection and progression. This is the largest glaucoma detection dataset with 3D OCT imaging data and the first glaucoma progression forecasting dataset that is publicly available. Detailed sex and racial analysis are pro- vided, which can be used by interested researchers for fairness learning studies. Our released dataset is benchmarked with several SOTA supervised CNN and transformer deep learning models. The dataset and code are made publicly available via https://ophai.hms.harvard.edu/ datasets/harvard-gdp1000. + + + + Tracking Everything Everywhere All at Once + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Tracking_Everything_Everywhere_All_at_Once_ICCV_2023_paper.pdf + We present a new test-time optimization method for estimating dense and long-range motion from a video sequence. Prior optical flow or particle video tracking algorithms typically operate within limited temporal windows, struggling to track through occlusions and maintain global consistency of estimated motion trajectories. We propose a complete and globally consistent motion representation, dubbed OmniMotion, that allows for accurate, full-length motion estimation of every pixel in a video. OmniMotion represents a video using a quasi-3D canonical volume and performs pixel-wise tracking via bijections between local and canonical space. This representation allows us to ensure global consistency, track through occlusions, and model any combination of camera and object motion. Extensive evaluations on the TAP-Vid benchmark and real-world footage show that our approach outperforms prior state-of-the-art methods by a large margin both quantitatively and qualitatively. + + + + CauSSL: Causality-inspired Semi-supervised Learning for Medical Image Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Miao_CauSSL_Causality-inspired_Semi-supervised_Learning_for_Medical_Image_Segmentation_ICCV_2023_paper.pdf + Semi-supervised learning (SSL) has recently demonstrated great success in medical image segmentation, significantly enhancing data efficiency with limited annotations. However, despite its empirical benefits, there are still concerns in the literature about the theoretical foundation and explanation of semi-supervised segmentation. To explore this problem, this study first proposes a novel causal diagram to provide a theoretical foundation for the mainstream semi-supervised segmentation methods. Our causal diagram takes two additional intermediate variables into account, which are neglected in previous work. Drawing from this proposed causal diagram, we then introduce a causality-inspired SSL approach on top of co-training frameworks called CauSSL, to improve SSL for medical image segmentation. Specifically, we first point out the importance of algorithmic independence between two networks or branches in SSL, which is often overlooked in the literature. We then propose a novel statistical quantification of the uncomputable algorithmic independence and further enhance the independence via a min-max optimization process. Our method can be flexibly incorporated into different existing SSL methods to improve their performance. Our method has been evaluated on three challenging medical image segmentation tasks using both 2D and 3D network architectures and has shown consistent improvements over state-of-the-art methods. Our code is publicly available at: https://github.com/JuzhengMiao/CauSSL. + + + + ChartReader: A Unified Framework for Chart Derendering and Comprehension without Heuristic Rules + http://openaccess.thecvf.com//content/ICCV2023/papers/Cheng_ChartReader_A_Unified_Framework_for_Chart_Derendering_and_Comprehension_without_ICCV_2023_paper.pdf + Charts are a powerful tool for visually conveying complex data, but their comprehension poses a challenge due to the diverse chart types and intricate components. Existing chart comprehension methods suffer from either heuristic rules or an over-reliance on OCR systems, resulting in suboptimal performance. To address these issues, we present ChartReader, a unified framework that seamlessly integrates chart derendering and comprehension tasks. Our approach includes a transformer-based chart component detection module and an extended pre-trained vision-language model for chart-to-X tasks. By learning the rules of charts automatically from annotated datasets, our approach eliminates the need for manual rule-making, reducing effort and enhancing accuracy. We also introduce a data variable replacement technique and extend the input and position embeddings of the pre-trained model for cross-task training. We evaluate ChartReader on Chart-to-Table, ChartQA, and Chart-to-Text tasks, demonstrating its superiority over existing methods. Our proposed framework can significantly reduce the manual effort involved in chart analysis, providing a step towards a universal chart understanding model. Moreover, our approach offers opportunities for plug-and-play integration with mainstream LLMs such as T5 and TaPas, extending their capability to chart comprehension tasks. + + + + Neural LiDAR Fields for Novel View Synthesis + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_Neural_LiDAR_Fields_for_Novel_View_Synthesis_ICCV_2023_paper.pdf + We present Neural Fields for LiDAR (NFL), a method to optimise a neural field scene representation from LiDAR measurements, with the goal of synthesizing realistic LiDAR scans from novel viewpoints. NFL combines the rendering power of neural fields with a detailed, physically motivated model of the LiDAR sensing process, thus enabling it to accurately reproduce key sensor behaviors like beam divergence, secondary returns, and ray dropping. We evaluate NFL on synthetic and real LiDAR scans and show that it outperforms explicit reconstruct-then-simulate methods as well as other NeRF-style methods on LiDAR novel view synthesis task. Moreover, we show that the improved realism of the synthesized views narrows the domain gap to real scans and translates to better registration and semantic segmentation performance. + + + + Understanding 3D Object Interaction from a Single Image + http://openaccess.thecvf.com//content/ICCV2023/papers/Qian_Understanding_3D_Object_Interaction_from_a_Single_Image_ICCV_2023_paper.pdf + Humans can easily understand a single image as depicting multiple potential objects permitting interaction. We use this skill to plan our interactions with the world and accelerate understanding new objects without engaging in interaction. In this paper, we would like to endow machines with the similar ability, so that intelligent agents can better explore the 3D scene or manipulate objects. Our approach is a transformer-based model that predicts the 3D location, physical properties and affordance of objects. To power this model, we collect a dataset with Internet videos, egocentric videos and indoor images to train and validate our approach. Our model yields strong performance on our data, and generalizes well to robotics data. + + + + Cross-Modal Translation and Alignment for Survival Analysis + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhou_Cross-Modal_Translation_and_Alignment_for_Survival_Analysis_ICCV_2023_paper.pdf + With the rapid advances in high-throughput sequencing technologies, the focus of survival analysis has shifted from examining clinical indicators to incorporating genomic profiles with pathological images. However, existing methods either directly adopt a straightforward fusion of pathological features and genomic profiles for survival prediction, or take genomic profiles as guidance to integrate the features of pathological images. The former would overlook intrinsic cross-modal correlations. The latter would discard pathological information irrelevant to gene expression. To address these issues, we present a Cross-Modal Translation and Alignment (CMTA) framework to explore the intrinsic cross-modal correlations and transfer potential complementary information. Specifically, we construct two parallel encoder-decoder structures for multi-modal data to integrate intra-modal information and generate cross-modal representation. Taking the generated cross-modal representation to enhance and recalibrate intra-modal representation can significantly improve its discrimination for comprehensive survival analysis. To explore the intrinsic cross-modal correlations, we further design a cross-modal attention module as the information bridge between different modalities to perform cross-modal interactions and transfer complementary information. Our extensive experiments on five public TCGA datasets demonstrate that our proposed framework outperforms the state-of-the-art methods. The source code has been released. + + + + Chaotic World: A Large and Challenging Benchmark for Human Behavior Understanding in Chaotic Events + http://openaccess.thecvf.com//content/ICCV2023/papers/Ong_Chaotic_World_A_Large_and_Challenging_Benchmark_for_Human_Behavior_ICCV_2023_paper.pdf + Understanding and analyzing human behaviors (actions and interactions of people), voices, and sounds in chaotic events is crucial in many applications, e.g., crowd management, emergency response services. Different from human behaviors in daily life, human behaviors in chaotic events are generally different in how they behave and influence others, and hence are often much more complex. However, currently there is lack of a large video dataset for analyzing human behaviors in chaotic situations. To this end, we create the first large and challenging multi-modal dataset, Chaotic World, that simultaneously provides different levels of fine-grained and dense spatio-temporal annotations of sounds, individual actions and group interaction graphs, and even text descriptions for each scene in each video, thereby enabling a thorough analysis of complicated behaviors in crowds and chaos. Our dataset consists of a total of 299,923 annotated instances for detecting human behaviors for Spatiotemporal Action Localization in chaotic events, 224,275 instances for identifying interactions between people for Behavior Graph Analysis in chaotic events, 336,390 instances for localizing relevant scenes of interest in long videos for Spatiotemporal Event Grounding, and 378,093 instances for triangulating the source of sound for Event Sound Source Localization. Given the practical complexity and challenges in chaotic events (e.g., large crowds, serious occlusions, complicated interaction patterns), our dataset shall be able to facilitate the community to develop, adapt, and evaluate various types of advanced models for analyzing human behaviors in chaotic events. We also design a simple yet effective IntelliCare model with a Dynamic Knowledge Pathfinder module that intelligently learns from multiple tasks and can analyze various aspects of a chaotic scene in a unified architecture. This method achieves promising results in experiments. Dataset and code can be found at https://github.com/sutdcv/Chaotic-World + + + + Active Stereo Without Pattern Projector + http://openaccess.thecvf.com//content/ICCV2023/papers/Bartolomei_Active_Stereo_Without_Pattern_Projector_ICCV_2023_paper.pdf + This paper proposes a novel framework integrating the principles of active stereo in standard passive camera systems without a physical pattern projector. We virtually project a pattern over the left and right images according to the sparse measurements obtained from a depth sensor. Any such devices can be seamlessly plugged into our framework, allowing for the deployment of a virtual active stereo setup in any possible environment, overcoming the limitation of pattern projectors, such as limited working range or environmental conditions. Experiments on indoor/outdoor datasets, featuring both long and close-range, support the seamless effectiveness of our approach, boosting the accuracy of both stereo algorithms and deep networks. + + + + Towards Instance-adaptive Inference for Federated Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Feng_Towards_Instance-adaptive_Inference_for_Federated_Learning_ICCV_2023_paper.pdf + Federated learning (FL) is a distributed learning paradigm that enables multiple clients to learn a powerful global model by aggregating local training. However, the performance of the global model is often hampered by non-i.i.d. distribution among the clients, requiring extensive efforts to mitigate inter-client data heterogeneity. Going beyond inter-client data heterogeneity, we note that intra-client heterogeneity can also be observed on complex real-world data and seriously deteriorate FL performance. In this paper, we present a novel FL algorithm, i.e., FedIns, to handle intra-client data heterogeneity by enabling instance-adaptive inference in the FL framework. Instead of huge instance-adaptive models, we resort to a parameter-efficient fine-tuning method, i.e., scale and shift deep features (SSF), upon a pre-trained model. Specifically, we first train an SSF pool for each client, and aggregate these SSF pools on the server side, thus still maintaining a low communication cost. To enable instance-adaptive inference, for a given instance, we dynamically find the best-matched SSF subsets from the pool and aggregate them to generate an adaptive SSF specified for the instance, thereby reducing the intra-client as well as the inter-client heterogeneity. Extensive experiments show that our FedIns outperforms state-of-the-art FL algorithms, e.g., a 6.64% improvement against the top-performing method with less than 15% communication cost on Tiny-ImageNet. + + + + Online Clustered Codebook + http://openaccess.thecvf.com//content/ICCV2023/papers/Zheng_Online_Clustered_Codebook_ICCV_2023_paper.pdf + Vector Quantisation (VQ) is experiencing a comeback in machine learning, where it is increasingly used in representation learning. However, optimizing the codevectors in existing VQ-VAE is not entirely trivial. A problem is codebook collapse, where only a small subset of codevectors receive gradients useful for their optimization, whereas a majority of them simply "dies off" and is never updated or used. This limits the effectiveness of VQ for learning larger codebooks in complex computer vision tasks that require high-capacity representations. In this paper, we present a simple alternative method for online codebook learning, Clustering VQ-VAE (CVQ-VAE). Our approach selects encoded features as anchors to update the "dead" codevectors, while optimizing the codebooks which are alive via the original loss. This strategy brings unused codevectors closer in distribution to the encoded features, increasing the likelihood of being chosen and optimized. We extensively validate the generalization capability of our quantizer on various datasets, tasks (e.g., reconstruction and generation), and architectures (e.g., VQ-VAE, VQGAN, LDM). CVQ-VAE can be easily integrated into the existing models with just a few lines of code. + + + + Robo3D: Towards Robust and Reliable 3D Perception against Corruptions + http://openaccess.thecvf.com//content/ICCV2023/papers/Kong_Robo3D_Towards_Robust_and_Reliable_3D_Perception_against_Corruptions_ICCV_2023_paper.pdf + The robustness of 3D perception systems under natural corruptions from environments and sensors is pivotal for safety-critical applications. Existing large-scale 3D perception datasets often contain data that are meticulously cleaned. Such configurations, however, cannot reflect the reliability of perception models during the deployment stage. In this work, we present Robo3D, the first comprehensive benchmark heading toward probing the robustness of 3D detectors and segmentors under out-of-distribution scenarios against natural corruptions that occur in real-world environments. Specifically, we consider eight corruption types stemming from severe weather conditions, external disturbances, and internal sensor failure. We uncover that, although promising results have been progressively achieved on standard benchmarks, state-of-the-art 3D perception models are at risk of being vulnerable to corruptions. We draw key observations on the use of data representations, augmentation schemes, and training strategies, that could severely affect the model's performance. To pursue better robustness, we propose a density-insensitive training framework along with a simple flexible voxelization strategy to enhance the model resiliency. We hope our benchmark and approach could inspire future research in designing more robust and reliable 3D perception models. Our robustness benchmark suite is publicly available. + + + + Gradient-based Sampling for Class Imbalanced Semi-supervised Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Gradient-based_Sampling_for_Class_Imbalanced_Semi-supervised_Object_Detection_ICCV_2023_paper.pdf + Current semi-supervised object detection (SSOD) algorithms typically assume class balanced datasets (PASCAL VOC etc.) or slightly class imbalanced datasets (MSCOCO, etc). This assumption can be easily violated since real world datasets can be extremely class imbalanced in nature, thus making the performance of semi-supervised object detectors far from satisfactory. Besides, the research for this problem in SSOD is severely under-explored. To bridge this research gap, we comprehensively study the class imbalance problem for SSOD under more challenging scenarios, thus forming the first experimental setting for class imbalanced SSOD (CI-SSOD). Moreover, we propose a simple yet effective gradient-based sampling framework that tackles the class imbalance problem from the perspective of two types of confirmation biases. To tackle confirmation bias towards majority classes, the gradient-based reweighting and gradient-based thresholding modules leverage the gradients from each class to fully balance the influence of the majority and minority classes. To tackle the confirmation bias from incorrect pseudo labels of minority classes, the class-rebalancing sampling module resamples unlabeled data following the guidance of the gradient-based reweighting module. Experiments on three proposed sub-tasks, namely MS-COCO, MS-COCO- Object365 and LVIS, suggest that our method outperforms current class imbalanced object detectors by clear margins, serving as a baseline for future research in CISSOD. Code will be available at https://github.com/nightkeepers/CI-SSOD. + + + + SLCA: Slow Learner with Classifier Alignment for Continual Learning on a Pre-trained Model + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_SLCA_Slow_Learner_with_Classifier_Alignment_for_Continual_Learning_on_ICCV_2023_paper.pdf + The goal of continual learning is to improve the performance of recognition models in learning sequentially arrived data. Although most existing works are established on the premise of learning from scratch, growing efforts have been devoted to incorporating the benefits of pre-training. However, how to adaptively exploit the pre-trained knowledge for each incremental task while maintaining its generalizability remains an open question. In this work, we present an extensive analysis for continual learning on a pre-trained model (CLPM), and attribute the key challenge to a progressive overfitting problem. Observing that selectively reducing the learning rate can almost resolve this issue in the representation layer, we propose a simple but extremely effective approach named Slow Learner with Classifier Alignment (SLCA), which further improves the classification layer by modeling the class-wise distributions and aligning the classification layers in a post-hoc fashion. Across a variety of scenarios, our proposal provides substantial improvements for CLPM (e.g., up to 49.76%, 50.05%, 44.69% and 40.16% on Split CIFAR-100, Split ImageNet-R, Split CUB-200 and Split Cars-196, respectively), and thus outperforms state-of-the-art approaches by a large margin. Based on such a strong baseline, critical factors and promising directions are analyzed in-depth to facilitate subsequent research. + + + + Implicit Temporal Modeling with Learnable Alignment for Video Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Tu_Implicit_Temporal_Modeling_with_Learnable_Alignment_for_Video_Recognition_ICCV_2023_paper.pdf + Contrastive language-image pretraining (CLIP) has demonstrated remarkable success in various image tasks. However, how to extend CLIP with effective temporal modeling is still an open and crucial problem. Existing factorized or joint spatial-temporal modeling trades off between the efficiency and performance. While modeling temporal information within straight through tube is widely adopted in literature, we find that simple frame alignment already provides enough essence without temporal attention. To this end, in this paper, we proposed a novel Implicit Learnable Alignment (ILA) method, which minimizes the temporal modeling effort while achieving incredibly high performance. Specifically, for a frame pair, an interactive point is predicted in each frame, serving as the mutual information rich region. By enhancing the features around the interactive point, two frames are implicitly aligned. The aligned features are then pooled into a single token, which is leveraged in the subsequent spatial self-attention. Our method allows eliminating the costly or insufficient temporal self-attention in video. Extensive experiments on benchmarks demonstrate the superiority and generality of our module. Particularly, the proposed ILA achieves a top-1 accuracy of 88.9% on Kinetics-400 with much fewer FLOPs compared with Swin-L and ViViT-H. Code is released at https://github.com/Francis-Rings/ILA. + + + + Pixel-Aligned Recurrent Queries for Multi-View 3D Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Xie_Pixel-Aligned_Recurrent_Queries_for_Multi-View_3D_Object_Detection_ICCV_2023_paper.pdf + We present PARQ - a multi-view 3D object detector with transformer and pixel-aligned recurrent queries. Unlike previous works that use learnable features or only encode 3D point positions as queries in the decoder, PARQ leverages appearance-enhanced queries initialized from reference points in 3D space and updates their 3D location with recurrent cross-attention operations. Incorporating pixel-aligned features and cross attention enables the model to encode the necessary 3D-to-2D correspondences and capture global contextual information of the input images. PARQ outperforms prior best methods on the ScanNet and ARKitScenes datasets, learns and detects faster, is more robust to distribution shifts in reference points, can leverage additional input views without retraining, and can adapt inference compute by changing the number of recurrent iterations. + + + + TiDAL: Learning Training Dynamics for Active Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Kye_TiDAL_Learning_Training_Dynamics_for_Active_Learning_ICCV_2023_paper.pdf + Active learning (AL) aims to select the most useful data samples from an unlabeled data pool and annotate them to expand the labeled dataset under a limited budget.Especially, uncertainty-based methods choose the most uncertain samples, which are known to be effective in improving model performance.However, AL literature often overlooks training dynamics (TD), defined as the ever-changing model behavior during optimization via stochastic gradient descent, even though other research areas have empirically shown that TD provides important clues for measuring the data uncertainty. In this paper, we first provide theoretical and empirical evidence to argue the usefulness of utilizing the ever-changing model behavior rather than the fully trained model snapshot. We then propose a novel AL method, Training Dynamics for Active Learning (TiDAL), which efficiently predicts the training dynamics of unlabeled data to estimate their uncertainty. Experimental results show that our TiDAL achieves better or comparable performance on both balanced and imbalanced benchmark datasets compared to state-of-the-art AL methods, which estimate data uncertainty using only static information after model training. + + + + DiffPose: Multi-hypothesis Human Pose Estimation using Diffusion Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Holmquist_DiffPose_Multi-hypothesis_Human_Pose_Estimation_using_Diffusion_Models_ICCV_2023_paper.pdf + Traditionally, monocular 3D human pose estimation employs a machine learning model to predict the most likely 3D pose for a given input image. However, a single image can be highly ambiguous and induces multiple plausible solutions for the 2D-3D lifting step, which results in overly confident 3D pose predictors. To this end, we propose DiffPose, a conditional diffusion model that predicts multiple hypotheses for a given input image. Compared to similar approaches, our diffusion model is straightforward and avoids intensive hyperparameter tuning, complex network structures, mode collapse, and unstable training. Moreover, we tackle the problem of over-simplification of the intermediate representation of the common two-step approaches which first estimate a distribution of 2D joint locations via joint-wise heatmaps and consecutively use their maximum argument for the 3D pose estimation step. Since such a simplification of the heatmaps removes valid information about possibly correct, though labeled unlikely, joint locations, we propose to represent the heatmaps as a set of 2D joint candidate samples. To extract information about the original distribution from these samples, we introduce our embedding transformer which conditions the diffusion model. Experimentally, we show that DiffPose improves upon the state of the art for multi-hypothesis pose estimation by 3-5% for simple poses and outperforms it by a large margin for highly ambiguous poses. + + + + AesPA-Net: Aesthetic Pattern-Aware Style Transfer Networks + http://openaccess.thecvf.com//content/ICCV2023/papers/Hong_AesPA-Net_Aesthetic_Pattern-Aware_Style_Transfer_Networks_ICCV_2023_paper.pdf + To deliver the artistic expression of the target style, recent studies exploit the attention mechanism owing to its ability to map the local patches of the style image to the corresponding patches of the content image. However, because of the low semantic correspondence between arbitrary content and artworks, the attention module repeatedly abuses specific local patches from the style image, resulting in disharmonious and evident repetitive artifacts. To overcome this limitation and accomplish impeccable artistic style transfer, we focus on enhancing the attention mechanism and capturing the rhythm of patterns that organize the style. In this paper, we introduce a novel metric, namely pattern repeatability, that quantifies the repetition of patterns in the style image. Based on the pattern repeatability, we propose Aesthetic Pattern-Aware style transfer Networks (AesPA-Net) that discover the sweet spot of local and global style expressions. In addition, we propose a novel self-supervisory task to encourage the attention mechanism to learn precise and meaningful semantic correspondence. Lastly, we introduce the patch-wise style loss to transfer the elaborate rhythm of local patterns. Through qualitative and quantitative evaluations, we verify the reliability of the proposed pattern repeatability that aligns with human perception, and demonstrate the superiority of the proposed framework. + + + + Self-Ordering Point Clouds + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Self-Ordering_Point_Clouds_ICCV_2023_paper.pdf + In this paper we address the task of finding representative subsets of points in a 3D point cloud by means of a point-wise ordering. Only a few works have tried to address this challenging vision problem, all with the help of hard to obtain point and cloud labels. Different from these works, we introduce the task of point-wise ordering in 3D point clouds through self-supervision, which we call self-ordering. We further contribute the first end-to-end trainable network that learns a point-wise ordering in a self-supervised fashion. It utilizes a novel differentiable point scoring-sorting strategy and it constructs an hierarchical contrastive scheme to obtain self-supervision signals. We extensively ablate the method and show its superior performance even compared to supervised ordering methods on multiple datasets and tasks including zero-shot ordering of point clouds from unseen categories. + + + + Continual Segment: Towards a Single, Unified and Non-forgetting Continual Segmentation Model of 143 Whole-body Organs in CT Scans + http://openaccess.thecvf.com//content/ICCV2023/papers/Ji_Continual_Segment_Towards_a_Single_Unified_and_Non-forgetting_Continual_Segmentation_ICCV_2023_paper.pdf + Deep learning empowers the mainstream medical image segmentation methods. Nevertheless, current deep segmentation approaches are not capable of efficiently and effectively adapting and updating the trained models when new segmentation classes are incrementally added. In the real clinical environment, it can be preferred that segmentation models could be dynamically extended to segment new organs/tumors without the (re-)access to previous training datasets due to obstacles of patient privacy and data storage. This process can be viewed as a continual semantic segmentation (CSS) problem, being understudied for multi-organ segmentation. In this work, we propose a new architectural CSS learning framework to learn a single deep segmentation model for segmenting a total of 143 whole-body organs. Using the encoder/decoder network structure, we demonstrate that a continually trained then frozen encoder coupled with incrementally-added decoders can extract sufficiently representative image features for new classes to be subsequently and validly segmented, while avoiding the catastrophic forgetting in CSS. To maintain a single network model complexity, each decoder is progressively pruned using neural architecture search and teacher-student based knowledge distillation. Finally, we propose a body-part and anomaly-aware output merging module to combine organ predictions originating from different decoders and incorporate both healthy and pathological organs appearing in different datasets. Trained and validated on 3D CT scans of 2500+ patients from four datasets, our single network can segment a total of 143 whole-body organs with very high accuracy, closely reaching the upper bound performance level by training four separate segmentation models (i.e., one model per dataset/task). + + + + Enhancing Modality-Agnostic Representations via Meta-Learning for Brain Tumor Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Konwer_Enhancing_Modality-Agnostic_Representations_via_Meta-Learning_for_Brain_Tumor_Segmentation_ICCV_2023_paper.pdf + In medical vision, different imaging modalities provide complementary information. However, in practice, not all modalities may be available during inference or even training. Previous approaches, e.g., knowledge distillation or image synthesis, often assume the availability of full modalities for all patients during training; this is unrealistic and impractical due to the variability in data collection across sites. We propose a novel approach to learn enhanced modality-agnostic representations by employing a meta-learning strategy in training, even when only limited full modality samples are available. Meta-learning enhances partial modality representations to full modality representations by meta-training on partial modality data and meta-testing on limited full modality samples. Additionally, we co-supervise this feature enrichment by introducing an auxiliary adversarial learning branch. More specifically, a missing modality detector is used as a discriminator to mimic the full modality setting. Our segmentation framework significantly outperforms state-of-the-art brain tumor segmentation techniques in missing modality scenarios. + + + + DNA-Rendering: A Diverse Neural Actor Repository for High-Fidelity Human-Centric Rendering + http://openaccess.thecvf.com//content/ICCV2023/papers/Cheng_DNA-Rendering_A_Diverse_Neural_Actor_Repository_for_High-Fidelity_Human-Centric_Rendering_ICCV_2023_paper.pdf + Realistic human-centric rendering plays a key role in both computer vision and computer graphics. Rapid progress has been made in the algorithm aspect over the years, yet existing human-centric rendering datasets and benchmarks are rather impoverished in terms of diversity (e.g., outfit's fabric/material, body's interaction with objects, and motion sequences), which are crucial for rendering effect. Researchers are usually constrained to explore and evaluate a small set of rendering problems on current datasets, while real-world applications require methods to be robust across different scenarios. In this work, we present DNA-Rendering, a large-scale, high-fidelity repository of human performance data for neural actor rendering. DNA-Rendering presents several appealing attributes. First, our dataset contains over 1500 human subjects, 5000 motion sequences, and 67.5M frames' data volume. Upon the massive collections, we provide human subjects with grand categories of pose actions, body shapes, clothing, accessories, hairdos, and object intersection, which ranges the geometry and appearance variances from everyday life to professional occasions. Second, we provide rich assets for each subject - 2D/3D human body keypoints, foreground masks, SMPLX models, cloth/accessory materials, multi-view images, and videos. These assets boost the current method's accuracy on downstream rendering tasks. Third, we construct a professional multi-view system to capture data, which contains 60 synchronous cameras with max 4096 x 3000 resolution, 15 fps speed, and stern camera calibration steps, ensuring high-quality resources for task training and evaluation. Along with the dataset, we provide a large-scale and quantitative benchmark in full-scale, with multiple tasks to evaluate the existing progress of novel view synthesis, novel pose animation synthesis, and novel identity rendering methods. In this manuscript, we describe our DNA-Rendering effort as a revealing of new observations, challenges, and future directions to human-centric rendering. The dataset, code, and benchmarks will be publicly available at https://dna-rendering.github.io/. + + + + A step towards understanding why classification helps regression + http://openaccess.thecvf.com//content/ICCV2023/papers/Pintea_A_step_towards_understanding_why_classification_helps_regression_ICCV_2023_paper.pdf + A number of computer vision deep regression approaches report improved results when adding a classification loss to the regression loss. Here, we explore why this is useful in practice and when it is beneficial. To do so, we start from precisely controlled dataset variations and data samplings and find that the effect of adding a classification loss is the most pronounced for regression with imbalanced data. We explain these empirical findings by formalizing the relation between the balanced and imbalanced regression losses. Finally, we show that our findings hold on two real imbalanced image datasets for depth estimation (NYUD2-DIR), and age estimation (IMDB-WIKI-DIR), and on the problem of imbalanced video progress prediction (Breakfast). Our main takeaway is: for a regression task, if the data sampling is imbalanced, then add a classification loss. + + + + CTP:Towards Vision-Language Continual Pretraining via Compatible Momentum Contrast and Topology Preservation + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhu_CTPTowards_Vision-Language_Continual_Pretraining_via_Compatible_Momentum_Contrast_and_Topology_ICCV_2023_paper.pdf + Vision-Language Pretraining (VLP) has shown impressive results on diverse downstream tasks by offline training on large-scale datasets. Regarding the growing nature of real-world data, such an offline training paradigm on ever-expanding data is unsustainable, because models lack the continual learning ability to accumulate knowledge constantly. However, most continual learning studies are limited to uni-modal classification and existing multi-modal datasets cannot simulate continual non-stationary data stream scenarios. To support the study of Vision-Language Continual Pretraining (VLCP), we first contribute a comprehensive and unified benchmark dataset P9D which contains over one million product image-text pairs from 9 industries. The data from each industry as an independent task supports continual learning and conforms to the real-world long-tail nature to simulate pretraining on web data. We comprehensively study the characteristics and challenges of VLCP, and propose a new algorithm: Compatible momentum contrast with Topology Preservation, dubbed CTP. The compatible momentum model absorbs the knowledge of the current and previous-task models to flexibly update the modal feature. Moreover, Topology Preservation transfers the knowledge of embedding across tasks while preserving the flexibility of feature adjustment. The experimental results demonstrate our method not only achieves superior performance compared with other baselines but also does not bring an expensive training burden. Dataset and codes are available at https://github.com/KevinLight831/CTP. + + + + FLIP: Cross-domain Face Anti-spoofing with Language Guidance + http://openaccess.thecvf.com//content/ICCV2023/papers/Srivatsan_FLIP_Cross-domain_Face_Anti-spoofing_with_Language_Guidance_ICCV_2023_paper.pdf + Face anti-spoofing (FAS) or presentation attack detection is an essential component of face recognition systems deployed in security-critical applications. Existing FAS methods have poor generalizability to unseen spoof types, camera sensors, and environmental conditions. Recently, vision transformer (ViT) models have been shown to be effective for the FAS task due to their ability to capture long-range dependencies among image patches. However, adaptive modules or auxiliary loss functions are often required to adapt pre-trained ViT weights learned on large-scale datasets such as ImageNet. In this work, we first show that initializing ViTs with multimodal (e.g., CLIP) pre-trained weights improves generalizability for the FAS task, which is in line with the zero-shot transfer capabilities of vision-language pre-trained (VLP) models. We then propose a novel approach for robust cross-domain FAS by grounding visual representations with the help of natural language. Specifically, we show that aligning the image representation with an ensemble of class descriptions (based on natural language semantics) improves FAS generalizability in low-data regimes. Finally, we propose a multimodal contrastive learning strategy to boost feature generalization further and bridge the gap between source and target domains. Extensive experiments on three standard protocols demonstrate that our method significantly outperforms the state-of-the-art methods, achieving better zero-shot transfer performance than five-shot transfer of "adaptive ViTs". Code: https://github.com/koushiksrivats/FLIP + + + + Distribution Shift Matters for Knowledge Distillation with Webly Collected Images + http://openaccess.thecvf.com//content/ICCV2023/papers/Tang_Distribution_Shift_Matters_for_Knowledge_Distillation_with_Webly_Collected_Images_ICCV_2023_paper.pdf + Knowledge distillation aims to learn a lightweight student network from a pre-trained teacher network. In practice, existing knowledge distillation methods are usually infeasible when the original training data is unavailable due to some privacy issues and data management considerations. Therefore, data-free knowledge distillation approaches proposed to collect training instances from the Internet. However, most of them have ignored the common distribution shift between the instances from original training data and webly collected data, affecting the reliability of the trained student network. To solve this problem, we propose a novel method dubbed "Knowledge Distillation between Different Distributions" (KD^ 3 ), which consists of three components. Specifically, we first dynamically select useful training instances from the webly collected data according to the combined predictions of teacher network and student network. Subsequently, we align both the weighted features and classifier parameters of the two networks for knowledge memorization. Meanwhile, we also build a new contrastive learning block called MixDistribution to generate perturbed data with a new distribution for instance alignment, so that the student network can further learn a distribution-invariant representation. Intensive experiments on various benchmark datasets demonstrate that our proposed KD^ 3 can outperform the state-of-the-art data-free knowledge distillation approaches. + + + + Gram-based Attentive Neural Ordinary Differential Equations Network for Video Nystagmography Classification + http://openaccess.thecvf.com//content/ICCV2023/papers/Qiu_Gram-based_Attentive_Neural_Ordinary_Differential_Equations_Network_for_Video_Nystagmography_ICCV_2023_paper.pdf + Video nystagmography (VNG) is the diagnostic gold standard of benign paroxysmal positional vertigo (BPPV), which requires medical professionals to examine the direction, frequency, intensity, duration, and variation in the strength of nystagmus on a VNG video. This is a tedious process heavily influenced by the doctor's experience, which is error-prone. Recent automatic VNG classification methods approach this problem from the perspective of video analysis without considering medical prior knowledge, resulting in unsatisfactory accuracy and limited diagnostic capability for nystagmographic types, thereby preventing their clinical application. In this paper, we propose an end-to-end data-driven novel BPPV diagnosis framework (TC-BPPV) by considering this problem as an eye trajectory classification problem due to the disease's symptoms and experts' prior knowledge. In this framework, we utilize an eye movement tracking system to capture the eye trajectory and propose the Gram-based attentive neural ordinary differential equations network (Gram-AODE) to perform classification. We validate our framework using the VNG dataset provided by the collaborative university hospital and achieve state-of-the-art performance. We also evaluate Gram-AODE on multiple open-source benchmarks to demonstrate its effectiveness in trajectory classification. Code is available at https://github.com/XiheQiu/Gram-AODE. + + + + Pluralistic Aging Diffusion Autoencoder + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Pluralistic_Aging_Diffusion_Autoencoder_ICCV_2023_paper.pdf + Face aging is an ill-posed problem because multiple plausible aging patterns may correspond to a given input. Most existing methods often produce one deterministic estimation. This paper proposes a novel CLIP-driven Pluralistic Aging Diffusion Autoencoder (PADA) to enhance the diversity of aging patterns. First, we employ diffusion models to generate diverse low-level aging details via a sequential denoising reverse process. Second, we present Probabilistic Aging Embedding (PAE) to capture diverse high-level aging patterns, which represents age information as probabilistic distributions in the common CLIP latent space. A text-guided KL-divergence loss is designed to guide this learning. Our method can achieve pluralistic face aging conditioned on open-world aging texts and arbitrary unseen face images. Qualitative and quantitative experiments demonstrate that our method can generate more diverse and high-quality plausible aging results. + + + + TIFA: Accurate and Interpretable Text-to-Image Faithfulness Evaluation with Question Answering + http://openaccess.thecvf.com//content/ICCV2023/papers/Hu_TIFA_Accurate_and_Interpretable_Text-to-Image_Faithfulness_Evaluation_with_Question_Answering_ICCV_2023_paper.pdf + Despite thousands of researchers, engineers, and artists actively working on improving text-to-image generation models, systems often fail to produce images that accurately align with the text inputs. We introduce TIFA (Text-to-image Faithfulness evaluation with question Answering), an automatic evaluation metric that measures the faithfulness of a generated image to its text input via visual question answering (VQA). Specifically, given a text input, we automatically generate several question-answer pairs using a language model. We calculate image faithfulness by checking whether existing VQA models can answer these questions using the generated image. TIFA is a reference-free metric that allows for fine-grained and interpretable evaluations of generated images.TIFA also has better correlations with human judgments than existing metrics. Based on this approach, we introduce TIFA v1.0, a benchmark consisting of 4K diverse text inputs and 25K questions across 12 categories (object, counting, etc.). We present a comprehensive evaluation of existing text-to-image models using TIFA v1.0 and highlight the limitations and challenges of current models. For instance, we find that current text-to-image models, despite doing well on color and material, still struggle in counting, spatial relations, and composing multiple objects. We hope our benchmark will help carefully measure the research progress in text-to-image synthesis and provide valuable insights for further research. + + + + Towards Models that Can See and Read + http://openaccess.thecvf.com//content/ICCV2023/papers/Ganz_Towards_Models_that_Can_See_and_Read_ICCV_2023_paper.pdf + Visual Question Answering (VQA) and Image Captioning (CAP), which are among the most popular vision-language tasks, have analogous scene-text versions that require reasoning from the text in the image. Despite their obvious resemblance, the two are treated independently and, as we show, yield task-specific methods that can either see or read, but not both. In this work, we conduct an in-depth analysis of this phenomenon and propose UniTNT, a Unified Text-Non-Text approach, which grants existing multimodal architectures scene-text understanding capabilities. Specifically, we treat scene-text information as an additional modality, fusing it with any pretrained encoder-decoder-based architecture via designated modules. Thorough experiments reveal that UniTNT leads to the first single model that successfully handles both task types. Moreover, we show that scene-text understanding capabilities can boost vision-language models' performance on general VQA and CAP by up to 2.69% and 0.6 CIDEr, respectively. + + + + Query Refinement Transformer for 3D Instance Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Lu_Query_Refinement_Transformer_for_3D_Instance_Segmentation_ICCV_2023_paper.pdf + 3D instance segmentation aims to predict a set of object instances in a scene and represent them as binary foreground masks with corresponding semantic labels. However, object instances are diverse in shape and category,and point clouds are usually sparse, unordered, and irregular, which leads to a query sampling dilemma. Besides,noise background queries interfere with proper scene perception and accurate instance segmentation. To address the above issues, we propose a Query Refinement Transformer termed QueryFormer. The key to our approach is to exploit a query initialization module to optimize the initialization process for the query distribution with a high coverage and low repetition rate. Additionally, we design an affiliated transformer decoder that suppresses the interference of noise background queries and helps the foreground queries focus on instance discriminative parts to predict final segmentation results. Extensive experiments on ScanNetV2 and S3DIS datasets show that our QueryFormer can surpass state-of-the-art 3D instance segmentation methods. + + + + 3DHumanGAN: 3D-Aware Human Image Generation with 3D Pose Mapping + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_3DHumanGAN_3D-Aware_Human_Image_Generation_with_3D_Pose_Mapping_ICCV_2023_paper.pdf + We present 3DHumanGAN, a 3D-aware generative adversarial network that synthesizes photorealistic images of full-body humans with consistent appearances under different view-angles and body-poses. To tackle the representational and computational challenges in synthesizing the articulated structure of human bodies, we propose a novel generator architecture in which a 2D convolutional backbone is modulated by a 3D pose mapping network. The 3D pose mapping network is formulated as a renderable implicit function conditioned on a posed 3D human mesh. This design has several merits: i) it leverages the strength of 2D GANs to produce high-quality images; ii) it generates consistent images under varying view-angles and poses; iii) the model can incorporate the 3D human prior and enable pose conditioning. Project page: https://3dhumangan.github.io/. + + + + Domain-Specificity Inducing Transformers for Source-Free Domain Adaptation + http://openaccess.thecvf.com//content/ICCV2023/papers/Sanyal_Domain-Specificity_Inducing_Transformers_for_Source-Free_Domain_Adaptation_ICCV_2023_paper.pdf + Conventional Domain Adaptation (DA) methods aim to learn domain-invariant feature representations to improve the target adaptation performance. However, we motivate that domain-specificity is equally important since in-domain trained models hold crucial domain-specific properties that are beneficial for adaptation. Hence, we propose to build a framework that supports disentanglement and learning of domain-specific factors and task-specific factors in a unified model. Motivated by the success of vision transformers in several multi-modal vision problems, we find that queries could be leveraged to extract the domain-specific factors. Hence, we propose a novel Domain-Specificity inducing Transformer (DSiT) framework for disentangling and learning both domain-specific and task-specific factors. To achieve disentanglement, we propose to construct novel Domain-Representative Inputs (DRI) with domain-specific information to train a domain classifier with a novel domain token. We are the first to utilize vision transformers for domain adaptation in a privacy-oriented source-free setting, and our approach achieves state-of-the-art performance on single-source, multi-source, and multi-target benchmarks. + + + + Efficient Emotional Adaptation for Audio-Driven Talking-Head Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Gan_Efficient_Emotional_Adaptation_for_Audio-Driven_Talking-Head_Generation_ICCV_2023_paper.pdf + Audio-driven talking-head synthesis is a popular research topic for virtual human-related applications. However, the inflexibility and inefficiency of existing methods, which necessitate expensive end-to-end training to transfer emotions from guidance videos to talking-head predictions, are significant limitations. In this work, we propose the Emotional Adaptation for Audio-driven Talking-head (EAT) method, which transforms emotion-agnostic talking-head models into emotion-controllable ones in a cost-effective and efficient manner through parameter-efficient adaptations. Our approach utilizes a pretrained emotion-agnostic talking-head transformer and introduces three lightweight adaptations (the Deep Emotional Prompts, Emotional Deformation Network, and Emotional Adaptation Module) from different perspectives to enable precise and realistic emotion controls. Our experiments demonstrate that our approach achieves state-of-the-art performance on widely-used benchmarks, including LRW and MEAD. Additionally, our parameter-efficient adaptations exhibit remarkable generalization ability, even in scenarios where emotional training videos are scarce or nonexistent. Project website: https://yuangan.github.io/eat/ + + + + Object-aware Gaze Target Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Tonini_Object-aware_Gaze_Target_Detection_ICCV_2023_paper.pdf + Gaze target detection aims to predict the image location where the person is looking and the probability that a gaze is out of the scene. Several works have tackled this task by regressing a gaze heatmap centered on the gaze location, however, they overlooked decoding the relationship between the people and the gazed objects. This paper proposes a Transformer-based architecture that automatically detects objects (including heads) in the scene to build associations between every head and the gazed-head/object, resulting in a comprehensive, explainable gaze analysis composed of: gaze target area, gaze pixel point, the class and the image location of the gazed-object. Upon evaluation of the in-the-wild benchmarks, our method achieves state-of-the-art results on all metrics (up to 2.91% gain in AUC, 50% reduction in gaze distance, and 9% gain in out-of-frame average precision) for gaze target detection and 11-13% improvement in average precision for the classification and the localization of the gazed-objects. The code of the proposed method is publicly available. + + + + VADER: Video Alignment Differencing and Retrieval + http://openaccess.thecvf.com//content/ICCV2023/papers/Black_VADER_Video_Alignment_Differencing_and_Retrieval_ICCV_2023_paper.pdf + We propose VADER, a spatio-temporal matching, alignment, and change summarization method to help fight misinformation spread via manipulated videos. VADER matches and coarsely aligns partial video fragments to candidate videos using a robust visual descriptor and scalable search over adaptively chunked video content. A transformer-based alignment module then refines the temporal localization of the query fragment within the matched video. A space-time comparator module identifies regions of manipulation between aligned content, invariant to any changes due to any residual temporal misalignments or artifacts arising from non-editorial changes of the content. Robustly matching video to a trusted source enables conclusions to be drawn on video provenance, enabling informed trust decisions on content encountered. Code and data are available at https://github.com/AlexBlck/vader + + + + HiLo: Exploiting High Low Frequency Relations for Unbiased Panoptic Scene Graph Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhou_HiLo_Exploiting_High_Low_Frequency_Relations_for_Unbiased_Panoptic_Scene_ICCV_2023_paper.pdf + Panoptic Scene Graph generation (PSG) is a recently proposed task in image scene understanding that aims to segment the image and extract triplets of subjects, objects and their relations to build a scene graph. This task is particularly challenging for two reasons. First, it suffers from a long-tail problem in its relation categories, making naive biased methods more inclined to high-frequency relations. Existing unbiased methods tackle the long-tail problem by data/loss rebalancing to favor low-frequency relations. Second, a subject-object pair can have two or more semantically overlapping relations. While existing methods favor one over the other, our proposed HiLo framework lets different network branches specialize on low and high frequency relations, enforce their consistency and fuse the results. To the best of our knowledge we are the first to propose an explicitly unbiased PSG method. In extensive experiments we show that our HiLo framework achieves state-of-the-art results on the PSG task. We also apply our method to the Scene Graph Generation task that predicts boxes instead of masks and see improvements over all baseline methods. Code is available at https://github.com/franciszzj/HiLo. + + + + Chop & Learn: Recognizing and Generating Object-State Compositions + http://openaccess.thecvf.com//content/ICCV2023/papers/Saini_Chop__Learn_Recognizing_and_Generating_Object-State_Compositions_ICCV_2023_paper.pdf + Recognizing and generating object-state compositions has been a challenging task, especially when generalizing to unseen compositions. In this paper, we study the task of cutting objects in different styles and the resulting object state changes. We propose a new benchmark suite Chop & Learn, to accommodate the needs of learning objects and different cut styles using multiple viewpoints. We also propose a new task of Compositional Image Generation, which can transfer learned cut styles to different objects, by generating novel object-state images. Moreover, we also use the videos for Compositional Action Recognition, and show valuable uses of this dataset for multiple video tasks. Project website: https://chopnlearn.github.io. + + + + Automatic Animation of Hair Blowing in Still Portrait Photos + http://openaccess.thecvf.com//content/ICCV2023/papers/Xiao_Automatic_Animation_of_Hair_Blowing_in_Still_Portrait_Photos_ICCV_2023_paper.pdf + We propose a novel approach to animate human hair in a still portrait photo. Existing work has largely studied the animation of fluid elements such as water and fire. However, hair animation for a real image remains underexplored, which is a challenging problem, due to the high complexity of hair structure and dynamics. Considering the complexity of hair structure, we innovatively treat hair wisp extraction as an instance segmentation problem, where a hair wisp is referred to as an instance. With advanced instance segmentation networks, our method extracts meaningful and natural hair wisps. Furthermore, we propose a wisp-aware animation module that animates hair wisps with pleasing motions without noticeable artifacts. The extensive experiments show the superiority of our method. Our method provides the most pleasing and compelling viewing experience in the qualitative experiments, and outperforms state-of-the-art still-image animation methods by a large margin in the quantitative evaluation. Project url: https://nevergiveu.github.io/AutomaticHairBlowing/ + + + + 4D Panoptic Segmentation as Invariant and Equivariant Field Prediction + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhu_4D_Panoptic_Segmentation_as_Invariant_and_Equivariant_Field_Prediction_ICCV_2023_paper.pdf + In this paper, we develop rotation-equivariant neural networks for 4D panoptic segmentation. 4D panoptic segmentation is a benchmark task for autonomous driving that requires recognizing semantic classes and object instances on the road based on LiDAR scans, as well as assigning temporally consistent IDs to instances across time. We observe that the driving scenario is symmetric to rotations on the ground plane. Therefore, rotation-equivariance could provide better generalization and more robust feature learning. Specifically, we review the object instance clustering strategies and restate the centerness-based approach and the offset-based approach as the prediction of invariant scalar fields and equivariant vector fields. Other sub-tasks are also unified from this perspective, and different invariant and equivariant layers are designed to facilitate their predictions. Through evaluation on the standard 4D panoptic segmentation benchmark of SemanticKITTI, we show that our equivariant models achieve higher accuracy with lower computational costs compared to their non-equivariant counterparts. Moreover, our method sets the new state-of-the-art performance and achieves 1st place on the SemanticKITTI 4D Panoptic Segmentation leaderboard. + + + + Zip-NeRF: Anti-Aliased Grid-Based Neural Radiance Fields + http://openaccess.thecvf.com//content/ICCV2023/papers/Barron_Zip-NeRF_Anti-Aliased_Grid-Based_Neural_Radiance_Fields_ICCV_2023_paper.pdf + Neural Radiance Field training can be accelerated through the use of grid-based representations in NeRF's learned mapping from spatial coordinates to colors and volumetric density. However, these grid-based approaches lack an explicit understanding of scale and therefore often introduce aliasing, usually in the form of jaggies or missing scene content. Anti-aliasing has previously been addressed by mip-NeRF 360, which reasons about sub-volumes along a cone rather than points along a ray, but this approach is not natively compatible with current grid-based techniques. We show how ideas from rendering and signal processing can be used to construct a technique that combines mip-NeRF 360 and grid-based models such as Instant NGP to yield error rates that are 8%-77% lower than either prior technique, and that trains 24x faster than mip-NeRF 360. + + + + Neural-PBIR Reconstruction of Shape, Material, and Illumination + http://openaccess.thecvf.com//content/ICCV2023/papers/Sun_Neural-PBIR_Reconstruction_of_Shape_Material_and_Illumination_ICCV_2023_paper.pdf + Reconstructing the shape and spatially varying surface appearances of a physical-world object as well as its surrounding illumination based on 2D images (e.g., photographs) of the object has been a long-standing problem in computer vision and graphics. In this paper, we introduce an accurate and highly efficient object reconstruction pipeline combining neural based object reconstruction and physics-based inverse rendering (PBIR). Our pipeline firstly leverages a neural SDF based shape reconstruction to produce high-quality but potentially imperfect object shape. Then, we introduce a neural material and lighting distillation stage to achieve high-quality predictions for material and illumination. In the last stage, initialized by the neural predictions, we perform PBIR to refine the initial results and obtain the final high-quality reconstruction of object shape, material, and illumination. Experimental results demonstrate our pipeline significantly outperforms existing methods quality-wise and performance-wise. Code: https://neural-pbir.github.io/ + + + + Fg-T2M: Fine-Grained Text-Driven Human Motion Generation via Diffusion Model + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Fg-T2M_Fine-Grained_Text-Driven_Human_Motion_Generation_via_Diffusion_Model_ICCV_2023_paper.pdf + Text-driven human motion generation in computer vision is both significant and challenging. However, current methods are limited to producing either deterministic or imprecise motion sequences, failing to effectively control the temporal and spatial relationships required to conform to a given text description. In this work, we propose a fine-grained method for generating high-quality, conditional human motion sequences supporting precise text description. Our approach consists of two key components: 1) a linguistics-structure assisted module that constructs accurate and complete language feature to fully utilize text information; and 2) a context-aware progressive reasoning module that learns neighborhood and overall semantic linguistics features from shallow and deep graph neural networks to achieve a multi-step inference. Experiments show that our approach outperforms text-driven motion generation methods on HumanML3D and KIT test sets and generates better visually confirmed motion to the text conditions. + + + + BlindHarmony: "Blind" Harmonization for MR Images via Flow Model + http://openaccess.thecvf.com//content/ICCV2023/papers/Jeong_BlindHarmony_Blind_Harmonization_for_MR_Images_via_Flow_Model_ICCV_2023_paper.pdf + In MRI, images of the same contrast (e.g., T1) from the same subject can exhibit noticeable differences when acquired using different hardware, sequences, or scan parameters. These differences in images create a domain gap that needs to be bridged by a step called image harmonization, to process the images successfully using conventional or deep learning-based image analysis (e.g., segmentation). Several methods, including deep learning-based approaches, have been proposed to achieve image harmonization. However, they often require datasets from multiple domains for deep learning training and may still be unsuccessful when applied to images from unseen domains. To address this limitation, we propose a novel concept called 'Blind Harmonization', which utilizes only target domain data for training but still has the capability to harmonize images from unseen domains. For the implementation of blind harmonization, we developed BlindHarmony using an unconditional flow model trained on target domain data. The harmonized image is optimized to have a correlation with the input source domain image while ensuring that the latent vector of the flow model is close to the center of the Gaussian distribution. BlindHarmony was evaluated on both simulated and real datasets and compared to conventional methods. BlindHarmony demonstrated noticeable performance on both datasets, highlighting its potential for future use in clinical settings. The source code is available at: https://github.com/SNU-LIST/BlindHarmony + + + + Efficient LiDAR Point Cloud Oversegmentation Network + http://openaccess.thecvf.com//content/ICCV2023/papers/Hui_Efficient_LiDAR_Point_Cloud_Oversegmentation_Network_ICCV_2023_paper.pdf + Point cloud oversegmentation is a challenging task since it needs to produce perceptually meaningful partitions (i.e., superpoints) of a point cloud. Most existing oversegmentation methods cannot efficiently generate superpoints from large-scale LiDAR point clouds due to complex and inefficient procedures. In this paper, we propose a simple yet efficient end-to-end LiDAR oversegmentation network, which segments superpoints from the LiDAR point cloud by grouping points based on low-level point embeddings. Specifically, we first learn the similarity of points from the constructed local neighborhoods to obtain low-level point embeddings through the local discriminative loss. Then, to generate homogeneous superpoints from the sparse LiDAR point cloud, we propose a LiDAR point grouping algorithm that simultaneously considers the similarity of point embeddings and the Euclidean distance of points in 3D space. Finally, we design a superpoint refinement module for accurately assigning the hard boundary points to the corresponding superpoints. Extensive results on two large-scale outdoor datasets, SemanticKITTI and nuScenes, show that our method achieves a new state-of-the-art in LiDAR oversegmentation. Notably, the inference time of our method is 100x faster than that of other methods. Furthermore, we apply the learned superpoints to the LiDAR semantic segmentation task and the results show that using superpoints can significantly improve the LiDAR semantic segmentation of the baseline network. Code is available at https://github.com/fpthink/SuperLiDAR. + + + + Few-Shot Video Classification via Representation Fusion and Promotion Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Xia_Few-Shot_Video_Classification_via_Representation_Fusion_and_Promotion_Learning_ICCV_2023_paper.pdf + Recent few-shot video classification (FSVC) works achieve promising performance by capturing similarity across support and query samples with different temporal alignment strategies or learning discriminative features via Transformer block within each episode. However, they ignore two important issues: a) It is difficult to capture rich intrinsic action semantics from a limited number of support instances within each task. b) Redundant or irrelevant frames in videos easily weaken the positive influence of discriminative frames. To address these two issues, this paper proposes a novel Representation Fusion and Promotion Learning (RFPL) mechanism with two sub-modules: meta-action learning (MAL) and reinforced image representation (RIR). Concretely, during training stage, we perform online learning for seeking a task-shared meta-action bank to enrich task-specific action representation by injecting global knowledge. Besides, we exploit reinforcement learning to obtain the importance of each frame and refine the representation. This operation maximizes the contribution of discriminative frames to further capture the similarity of support and query samples from the same category. Our RFPL framework is highly flexible that it can be integrated with many existing FSVC methods. Extensive experiments show that RFPL significantly enhances the performance of existing FSVC models when integrated with them. + + + + Hallucination Improves the Performance of Unsupervised Visual Representation Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_Hallucination_Improves_the_Performance_of_Unsupervised_Visual_Representation_Learning_ICCV_2023_paper.pdf + Contrastive learning models based on Siamese structure have demonstrated remarkable performance in self-supervised learning. Such a success of contrastive learning relies on two conditions, including a sufficient number of positive pairs and adequate variations between them. If the conditions are not met, these frameworks will lack semantic contrast and be fragile on overfitting. To address these two issues, we propose Hallucinator that could efficiently generate additional positive samples for further contrast. The Hallucinator creates new data in the feature space, thus introducing nearly negligible computation. Moreover, we reduce the mutual information of hallucinated pairs and smooth them through non-linear operations. This process helps avoid over-confident contrastive learning models during the training and achieves more robust transformation-invariant feature embeddings. Remarkably, we empirically prove that the proposed Hallucinator generalizes well to various contrastive learning models, including MoCoV1&V2, SimCLR and SimSiam. Under the linear classification protocol, a stable accuracy gain is achieved, ranging from 0.3% to 3.0% on CIFAR10&100, Tiny ImageNet, STL-10 and ImageNet. The improvement is also observed in transferring pre-train encoders to the downstream tasks, including object detection and segmentation. + + + + S3IM: Stochastic Structural SIMilarity and Its Unreasonable Effectiveness for Neural Fields + http://openaccess.thecvf.com//content/ICCV2023/papers/Xie_S3IM_Stochastic_Structural_SIMilarity_and_Its_Unreasonable_Effectiveness_for_Neural_ICCV_2023_paper.pdf + Recently, Neural Radiance Field (NeRF) has shown great success in rendering novel-view images of a given scene by learning an implicit representation with only posed RGB images. NeRF and relevant neural field methods (e.g., neural surface representation) typically optimize a point-wise loss and make point-wise predictions, where one data point corresponds to one pixel. Unfortunately, this line of research failed to use the collective supervision of distant pixels, although it is known that pixels in an image or scene can provide rich structural information. To the best of our knowledge, we are the first to design a nonlocal multiplex training paradigm for NeRF and relevant neural field methods via a novel Stochastic Structural SIMilarity (S3IM) loss that processes multiple data points as a whole set instead of process multiple inputs independently. Our extensive experiments demonstrate the unreasonable effectiveness of S3IM in improving NeRF and neural surface representation for nearly free. The improvements of quality metrics can be particularly significant for those relatively difficult tasks: e.g., the test MSE loss unexpectedly drops by more than 90% for TensoRF and DVGO over eight novel view synthesis tasks; a 198% F-score gain and a 64% Chamfer L1 distance reduction for NeuS over eight surface reconstruction tasks. Moreover, S3IM is consistently robust even with sparse inputs, corrupted images, and dynamic scenes. + + + + Membrane Potential Batch Normalization for Spiking Neural Networks + http://openaccess.thecvf.com//content/ICCV2023/papers/Guo_Membrane_Potential_Batch_Normalization_for_Spiking_Neural_Networks_ICCV_2023_paper.pdf + As one of the energy-efficient alternatives of conventional neural networks (CNNs), spiking neural networks (SNNs) have gained more and more interest recently. To train the deep models, some effective batch normalization (BN) techniques are proposed in SNNs. All these BNs are suggested to be used after the convolution layer as usually doing in CNNs. However, the spiking neuron is much more complex with spatiotemporal dynamics. The regulated data flow after the BN layer will be disturbed again by the membrane potential updating operation before the firing function, i.e., the nonlinear activation. Therefore, we advocate adding another BN layer before the firing function to normalize the membrane potential again, called MPBN. To eliminate the induced time cost of MPBN, we also propose a training-inference-decoupled re-parameterization technique to fold the trained MPBN into the firing threshold. With the re-parameterization technique, the MPBN will not induce any extra time burden in the inference. Furthermore, the MPBN can also adopt the element-wised form, while the BN after the convolution layer can only use the channel-wised form. Experimental results show that the proposed MPBN performs well on both popular non-spiking static and neuromorphic datasets. + + + + Enhancing Sample Utilization through Sample Adaptive Augmentation in Semi-Supervised Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Gui_Enhancing_Sample_Utilization_through_Sample_Adaptive_Augmentation_in_Semi-Supervised_Learning_ICCV_2023_paper.pdf + In semi-supervised learning, unlabeled samples can be utilized through augmentation and consistency regularization. However, we observed certain samples, even undergoing strong augmentation, are still correctly classified with high confidence, resulting in a loss close to zero. It indicates that these samples have been already learned well and do not provide any additional optimization benefits to the model. We refer to these samples as "naive samples". Unfortunately, existing SSL models overlook the characteristics of naive samples, and they just apply the same learning strategy to all samples. To further optimize the SSL model, we emphasize the importance of giving attention to naive samples and augmenting them in a more diverse manner. Sample adaptive augmentation (SAA) is proposed for this stated purpose and consists of two modules: 1) sample selection module; 2) sample augmentation module. Specifically, the sample selection module picks out naive samples based on historical training information at each epoch, then the naive samples will be augmented in a more diverse manner in the sample augmentation module. Thanks to the extreme ease of implementation of the above modules, SAA is advantageous for being simple and lightweight. We add SAA on top of FixMatch and FlexMatch respectively, and experiments demonstrate SAA can significantly improve the models. For example, SAA helped improve the accuracy of FixMatch from 92.50% to 94.76% and that of FlexMatch from 95.01% to 95.31% on CIFAR-10 with 40 labels. The code can be downloaded in supplementary materials. + + + + Imitator: Personalized Speech-driven 3D Facial Animation + http://openaccess.thecvf.com//content/ICCV2023/papers/Thambiraja_Imitator_Personalized_Speech-driven_3D_Facial_Animation_ICCV_2023_paper.pdf + Speech-driven 3D facial animation has been widely explored, with applications in gaming, character animation, virtual reality, and telepresence systems. State-of-the-art methods deform the face topology of the target actor to sync the input audio without considering the identity-specific speaking style and facial idiosyncrasies, thus, resulting in unrealistic and inaccurate lip movements. To address this, we present Imitator, a speech-driven facial expression synthesis method, which learns identity-specific details from a short input video and produces novel facial expressions matching the identity-specific speaking style and facial idiosyncrasies of the target actor. Specifically, we train a style-agnostic transformer on a large facial expression dataset which we use as a prior for audio-driven facial expressions. We utilize this prior to optimize for identity-specific speaking style based on a short reference video. To train the prior, we introduce a novel loss function based on detected bilabial consonants to ensure plausible lip closures and consequently improve the realism of the generated expressions. Through detailed experiments and user studies, we show that our approach improves Lip-Sync by 49% and produces expressive facial animations from input audio while preserving the actor's speaking style. Project page: https://balamuruganthambiraja.github.io/Imitator + + + + Seeing Beyond the Patch: Scale-Adaptive Semantic Segmentation of High-resolution Remote Sensing Imagery based on Reinforcement Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Seeing_Beyond_the_Patch_Scale-Adaptive_Semantic_Segmentation_of_High-resolution_Remote_ICCV_2023_paper.pdf + In remote sensing imagery analysis, patch-based methods have limitations in capturing information beyond the sliding window. This shortcoming poses a significant challenge in processing complex and variable geo-objects, which results in semantic inconsistency in segmentation results. To address this challenge, we propose a dynamic scale perception framework, named GeoAgent, which adaptively captures appropriate scale context information outside the image patch based on the different geo-objects. In GeoAgent, each image patch's states are represented by a global thumbnail and a location mask. The global thumbnail provides context beyond the patch, and the location mask guides the perceived spatial relationships. The scale-selection actions are performed through a Scale Control Agent (SCA). A feature indexing module is proposed to enhance the ability of the agent to distinguish the current image patch's location. The action switches the patch scale and context branch of a dual-branch segmentation network that extracts and fuses the features of multi-scale patches. The GeoAgent adjusts the network parameters to perform the appropriate scale-selection action based on the reward received for the selected scale. The experimental results, using two publicly available datasets and our newly constructed dataset WUSU, demonstrate that GeoAgent outperforms previous segmentation methods, particularly for large-scale mapping applications. + + + + WALDO: Future Video Synthesis Using Object Layer Decomposition and Parametric Flow Prediction + http://openaccess.thecvf.com//content/ICCV2023/papers/Le_Moing_WALDO_Future_Video_Synthesis_Using_Object_Layer_Decomposition_and_Parametric_ICCV_2023_paper.pdf + This paper presents WALDO (WArping Layer-Decomposed Objects), a novel approach to the prediction of future video frames from past ones. Individual images are decomposed into multiple layers combining object masks and a small set of control points. The layer structure is shared across all frames in each video to build dense inter-frame connections. Complex scene motions are modeled by combining parametric geometric transformations associated with individual layers, and video synthesis is broken down into discovering the layers associated with past frames, predicting the corresponding transformations for upcoming ones and warping the associated object regions accordingly, and filling in the remaining image parts. Extensive experiments on multiple benchmarks including urban videos (Cityscapes and KITTI) and videos featuring nonrigid motions (UCF-Sports and H3.6M), show that our method consistently outperforms the state of the art by a significant margin in every case. Code, pretrained models, and video samples synthesized by our approach can be found in the project webpage. + + + + Contactless Pulse Estimation Leveraging Pseudo Labels and Self-Supervision + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Contactless_Pulse_Estimation_Leveraging_Pseudo_Labels_and_Self-Supervision_ICCV_2023_paper.pdf + Remote photoplethysmography (rPPG) is a promising research area involving non-invasive monitoring of vital signs using cameras. While several supervised methods have been proposed, recent research has focused on contrastive-based self-supervised methods. However, these methods often collapse to learning irrelevant periodicities when dealing with interferences such as head motions, facial dynamics, and video compression. To address this limitation, firstly, we enhance the current self-supervised learning by introducing more reliable and explicit contrastive constraints. Secondly, we propose an innovative learning strategy that seamlessly integrates self-supervised constraints with pseudo-supervisory signals derived from traditional unsupervised methods. This is followed by a co-rectification technique designed to mitigate the adverse effects of noisy pseudo-labels. Experimental results demonstrate the superiority of our methodology over representative models when applied to small, high-quality datasets such as PURE and UBFC-rPPG. Importantly, on large-scale challenging datasets such as VIPL-HR and V4V, our method, with zero annotation cost, not only significantly surpasses prevailing self-supervised techniques but also showcases remarkable alignment with state-of-the-art supervised methods. + + + + Instruct-NeRF2NeRF: Editing 3D Scenes with Instructions + http://openaccess.thecvf.com//content/ICCV2023/papers/Haque_Instruct-NeRF2NeRF_Editing_3D_Scenes_with_Instructions_ICCV_2023_paper.pdf + We propose a method for editing NeRF scenes with text-instructions. Given a NeRF of a scene and the collection of images used to reconstruct it, our method uses an image-conditioned diffusion model (InstructPix2Pix) to iteratively edit the input images while optimizing the underlying scene, resulting in an optimized 3D scene that respects the edit instruction. We demonstrate that our proposed method is able to edit large-scale, real-world scenes, and is able to accomplish more realistic, targeted edits than prior work. + + + + Multi-Task Learning with Knowledge Distillation for Dense Prediction + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_Multi-Task_Learning_with_Knowledge_Distillation_for_Dense_Prediction_ICCV_2023_paper.pdf + While multi-task learning (MTL) has become an attractive topic, its training usually poses more difficulties than the single-task case. How to successfully apply knowledge distillation into MTL to improve training efficiency and model performance is still a challenging problem. In this paper, we introduce a new knowledge distillation procedure with an alternative match for MTL of dense prediction based on two simple design principles. First, for memory and training efficiency, we use a single strong multi-task model as a teacher during training instead of multiple teachers, as widely adopted in existing studies. Second, we employ a less sensitive Cauchy-Schwarz (CS) divergence instead of the Kullback-Leibler (KL) divergence and propose a CS distillation loss accordingly. With the less sensitive divergence, our knowledge distillation with an alternative match is applied for capturing inter-task and intra-task information between the teacher model and the student model of each task, thereby learning more "dark knowledge" for effective distillation. We conducted extensive experiments on dense prediction datasets, including NYUD-v2 and PASCAL-Context, for multiple vision tasks, such as semantic segmentation, human parts segmentation, depth estimation, surface normal estimation, and boundary detection. The results show that our proposed method decidedly improves model performance and the practical inference efficiency. + + + + ICD-Face: Intra-class Compactness Distillation for Face Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Yu_ICD-Face_Intra-class_Compactness_Distillation_for_Face_Recognition_ICCV_2023_paper.pdf + Knowledge distillation is an effective model compression method to improve the performance of a lightweight student model by transferring the knowledge of a well-performed teacher model, which has been widely adopted in many computer vision tasks, including face recognition (FR). The current FR distillation methods usually utilize the Feature Consistency Distillation (FCD) (e.g., L2 distance) on the learned embeddings extracted by the teacher and student models. However, after using FCD, we observe that the intra-class similarities of the student model are lower than the intra-class similarities of the teacher model a lot. Therefore, we propose an effective FR distillation method called ICD-Face by introducing intra-class compactness distillation into the existing distillation framework. Specifically, in ICD-Face, we first propose to calculate the similarity distributions of the teacher and student models, where the feature banks are introduced to construct sufficient and high-quality positive pairs. Then, we estimate the probability distributions of the teacher and student models and introduce the Similarity Distribution Consistency (SDC) loss to improve the intra-class compactness of the student model. Extensive experimental results on multiple benchmark datasets demonstrate the effectiveness of our proposed ICD-Face for face recognition. + + + + MRM: Masked Relation Modeling for Medical Image Pre-Training with Genetics + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_MRM_Masked_Relation_Modeling_for_Medical_Image_Pre-Training_with_Genetics_ICCV_2023_paper.pdf + Modern deep learning techniques on automatic multimodal medical diagnosis rely on massive expert annotations, which is time-consuming and prohibitive. Recent masked image modeling (MIM)-based pre-training methods have witnessed impressive advances for learning meaningful representations from unlabeled data and transferring to downstream tasks. However, these methods focus on natural images and ignore the specific properties of medical data, yielding unsatisfying generalization performance on downstream medical diagnosis. In this paper, we aim to leverage genetics to boost image pre-training and present a masked relation modeling (MRM) framework. Instead of explicitly masking input data in previous MIM methods leading to loss of disease-related semantics, we design relation masking to mask out token-wise feature relation in both self- and cross-modality levels, which preserves intact semantics within the input and allows the model to learn rich disease-related information. Moreover, to enhance semantic relation modeling, we propose relation matching to align the sample-wise relation between the intact and masked features. The relation matching exploits inter-sample relation by encouraging global constraints in the feature space to render sufficient semantic relation for feature representation. Extensive experiments demonstrate that the proposed framework is simple yet powerful, achieving state-of-the-art transfer performance on various downstream diagnosis tasks. + + + + TaskExpert: Dynamically Assembling Multi-Task Representations with Memorial Mixture-of-Experts + http://openaccess.thecvf.com//content/ICCV2023/papers/Ye_TaskExpert_Dynamically_Assembling_Multi-Task_Representations_with_Memorial_Mixture-of-Experts_ICCV_2023_paper.pdf + Learning discriminative task-specific features simultaneously for multiple distinct tasks is a fundamental problem in multi-task learning. Recent state-of-the-art models consider directly decoding task-specific features from one shared task-generic feature (e.g., feature from a backbone layer), and utilize carefully designed decoders to produce multi-task features. However, as the input feature is fully shared and each task decoder also shares decoding parameters for different input samples, it leads to a static feature decoding process, producing less discriminative task-specific representations. To tackle this limitation, we propose TaskExpert, a novel multi-task mixture-of-experts model that enables learning multiple representative task-generic feature spaces and decoding task-specific features in a dynamic manner. Specifically, TaskExpert introduces a set of expert networks to decompose the backbone feature into several representative task-generic features. Then, the task-specific features are decoded by using dynamic task-specific gating networks operating on the decomposed task-generic features. Furthermore, to establish long-range modeling of the task-specific representations from different layers of TaskExpert, we design a multi-task feature memory that updates at each layer and acts as an additional feature expert for dynamic task-specific feature decoding. Extensive experiments demonstrate that our TaskExpert clearly outperforms previous best-performing methods on all 9 metrics of two competitive multi-task learning benchmarks for visual scene understanding (i.e., PASCAL-Context and NYUD-v2). Code and models will be made publicly available. + + + + Meta OOD Learning For Continuously Adaptive OOD Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_Meta_OOD_Learning_For_Continuously_Adaptive_OOD_Detection_ICCV_2023_paper.pdf + Out-of-distribution (OOD) detection is crucial to modern deep learning applications by identifying and alerting about the OOD samples that should not be tested or used for making predictions. Current OOD detection methods have made significant progress when in-distribution (ID) and OOD samples are drawn from static distributions. However, this can be unrealistic when applied to real-world systems which often undergo continuous variations and shifts in ID and OOD distributions over time. Therefore, for an effective application in real-world systems, the development of OOD detection methods that can adapt to these dynamic and evolving distributions is essential. In this paper, we propose a novel and more realistic setting called continuously adaptive out-of-distribution (CAOOD) detection which targets on developing an OOD detection model that enables dynamic and quick adaptation to a new arriving distribution, with insufficient ID samples during deployment time. To address CAOOD, we develop meta OOD learning (MOL) by designing a learning-to-adapt diagram such that a good initialized OOD detection model is learned during the training process. In the testing process, MOL ensures OOD detection performance over shifting distributions by quickly adapting to new distributions with a few adaptations. Extensive experiments on several OOD benchmarks endorse the effectiveness of our method in preserving both ID classification accuracy and OOD detection performance on continuously shifting distributions. + + + + Few-shot Continual Infomax Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Gu_Few-shot_Continual_Infomax_Learning_ICCV_2023_paper.pdf + Few-shot continual learning is the ability to continually train a neural network from a sequential stream of few-shot data. In this paper, we propose a Few-shot Continual Infomax Learning (FCIL) framework that makes a deep model to continually/incrementally learn new concepts from few labeled samples, relieving the catastrophic forgetting of past knowledge. Specifically, inspired by the theoretical definition of transfer entropy, we introduce a feature embedding infomax to effectively perform the few-shot learning, which can transfer the strong encoding capability of the base network to learn the feature embedding of these novel classes by maximizing the mutual information of different-level feature distributions. Further, considering that the learned knowledge in the human brain is a generalization of actual information and exists in a certain relational structure, we perform continual structure infomax learning to relieve the catastrophic forgetting problem in the continual learning process. The information structure of this learned knowledge can be preserved through maximizing the mutual information across these continual-changing relations of inter-classes. Comprehensive evaluations on CIFAR100, miniImageNet, and CUB200 datasets demonstrate the superiority of our FCIL when compared against state-of-the-art methods on the few-shot continual learning task. + + + + A Parse-Then-Place Approach for Generating Graphic Layouts from Textual Descriptions + http://openaccess.thecvf.com//content/ICCV2023/papers/Lin_A_Parse-Then-Place_Approach_for_Generating_Graphic_Layouts_from_Textual_Descriptions_ICCV_2023_paper.pdf + Creating layouts is a fundamental step in graphic design. In this work, we propose to use text as the guidance to create graphic layouts, i.e., Text-to-Layout, aiming to lower the design barriers. Text-to-Layout is a challenging task, because it needs to consider the implicit, combined, and incomplete layout constraints from text, each of which has not been studied in previous work. To address this, we present a two-stage approach, named parse-then-place. The approach introduces an intermediate representation (IR) between text and layout to represent diverse layout constraints. With IR, Text-to-Layout is decomposed into a parse stage and a place stage. The parse stage takes a textual description as input and generates an IR, in which the implicit constraints from the text are transformed into explicit ones. The place stage generates layouts based on the IR. To model combined and incomplete constraints, we use a Transformer-based layout generation model and carefully design a way to represent constraints and layouts as sequences. Besides, we adopt the pretrain-then-finetune strategy to boost the performance of the layout generation model with large-scale unlabeled layouts. To evaluate our approach, we construct two Text-to-Layout datasets and conduct experiments on them. Quantitative results, qualitative analysis, and user studies demonstrate our approach's effectiveness. + + + + A Retrospect to Multi-prompt Learning across Vision and Language + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_A_Retrospect_to_Multi-prompt_Learning_across_Vision_and_Language_ICCV_2023_paper.pdf + The vision community is undergoing the unprecedented progress with the emergence of Vision-Language Pretraining Models (VLMs). Prompt learning plays as the holy grail of accessing VLMs since it enables their fast adaptation to downstream tasks with limited resources. Whereas existing research milling around single-prompt paradigms, rarely investigate the technical potential behind their multi-prompt learning counterparts. This paper aims to provide a principled retrospect for vision-language multi-prompt learning. We extend the recent constant modality gap phenomenon to learnable prompts and then, justify the superiority of vision-language transfer with multi-prompt augmentation, empirically and theoretically. In terms of this observation, we propose an Energy-based Multi-prompt Learning (EMPL) to generate multiple prompt embeddings by drawing instances from an energy-based distribution, which is implicitly defined by VLMs. So our EMPL is not only parameter-efficient but also rigorously lead to the balance between in-domain and out-of-domain open-vocabulary generalization. Comprehensive experiments have been conducted to justify our claims and the excellence of EMPL. + + + + Label Shift Adapter for Test-Time Adaptation under Covariate and Label Shifts + http://openaccess.thecvf.com//content/ICCV2023/papers/Park_Label_Shift_Adapter_for_Test-Time_Adaptation_under_Covariate_and_Label_ICCV_2023_paper.pdf + Test-time adaptation (TTA) aims to adapt a pre-trained model to the target domain in a batch-by-batch manner during inference. While label distributions often exhibit imbalances in real-world scenarios, most previous TTA approaches typically assume that both source and target domain datasets have balanced label distribution. Due to the fact that certain classes appear more frequently in certain domains (e.g., buildings in cities, trees in forests), it is natural that the label distribution shifts as the domain changes. However, we discover that the majority of existing TTA methods fail to address the coexistence of covariate and label shifts. To tackle this challenge, we propose a novel label shift adapter that can be incorporated into existing TTA approaches to deal with label shifts during the TTA process effectively. Specifically, we estimate the label distribution of the target domain to feed it into the label shift adapter. Subsequently, the label shift adapter produces optimal parameters for the target label distribution. By predicting only the parameters for a part of the pre-trained source model, our approach is computationally efficient and can be easily applied, regardless of the model architectures. Through extensive experiments, we demonstrate that integrating our strategy with TTA approaches leads to substantial performance improvements under the joint presence of label and covariate shifts. + + + + Dataset Quantization + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhou_Dataset_Quantization_ICCV_2023_paper.pdf + State-of-the-art deep neural networks are trained with large amounts (millions or even billions) of data. The expensive computation and memory costs make it difficult to train them on limited hardware resources, especially for recent popular large language models (LLM) and computer vision models (CV). Recent popular dataset distillation methods are thus developed, aiming to reduce the number of training samples via synthesizing small-scale datasets via gradient matching. However, as the gradient calculation is coupled with the specific network architecture, the synthesized dataset is biased and performs poorly when used for training unseen architectures. To address these limitations, we present dataset quantization (DQ), a new framework to compress large-scale datasets into small subsets which can be used for training any neural network architectures. Extensive experiments demonstrate that DQ is able to generate condensed small datasets for training unseen network architectures with state-of-the-art compression ratios for lossless model training. To the best of our knowledge, DQ is the first method that can successfully distill large-scale datasets such as ImageNet-1k with a state-of-the-art compression ratio. Notably, with 60% data from ImageNet and 20% data from Alpaca's instruction tuning data, the models can be trained with negligible or no performance drop for both vision tasks (including classification, semantic segmentation, and object detection) as well as language tasks (including instruction tuning tasks such as BBH and DROP). + + + + Overcoming Forgetting Catastrophe in Quantization-Aware Training + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Overcoming_Forgetting_Catastrophe_in_Quantization-Aware_Training_ICCV_2023_paper.pdf + Quantization is an effective approach for memory cost reduction by compressing networks to lower bits. However, existing quantization processes learned only from the current data tend to suffer from forgetting catastrophe on streaming data, i.e., significant performance decrement on old task data after being trained on new tasks. Therefore, we propose a lifelong quantization process, LifeQuant, to address the problem. We theoretically analyze the forgetting catastrophe from the shift of quantization search space with the change of data tasks. To overcome the forgetting catastrophe, we first minimize the space shift during quantization and propose Proximal Quantization Space Search (ProxQ), for regularizing the search space during quantization to be close to a pre-defined standard space. Afterward, we exploit replay data (a subset of old task data) for retraining in new tasks to alleviate the forgetting problem. However, the limited amount of replay data usually leads to biased quantization performance toward the new tasks. To address the imbalance issue, we design a Balanced Lifelong Learning (BaLL) Loss to reweight (to increase) the influence of replay data in new task learning, by leveraging the class distributions. Experimental results show that LifeQuant achieves outstanding accuracy performance with a low forgetting rate. + + + + Efficient Video Prediction via Sparsely Conditioned Flow Matching + http://openaccess.thecvf.com//content/ICCV2023/papers/Davtyan_Efficient_Video_Prediction_via_Sparsely_Conditioned_Flow_Matching_ICCV_2023_paper.pdf + We introduce a novel generative model for video prediction based on latent flow matching, an efficient alternative to diffusion-based models. In contrast to prior work, we keep the high costs of modeling the past during training and inference at bay by conditioning only on a small random set of past frames at each integration step of the image generation process. Moreover, to enable the generation of high-resolution videos and to speed up the training, we work in the latent space of a pretrained VQGAN. Finally, we propose to approximate the initial condition of the flow ODE with the previous noisy frame. This allows to reduce the number of integration steps and hence, speed up the sampling at inference time. We call our model Random frame conditioned flow Integration for VidEo pRediction, or, in short, RIVER. We show that RIVER achieves superior or on par performance compared to prior work on common video prediction benchmarks, while requiring an order of magnitude fewer computational resources. Project website: https://araachie.github.io/river. + + + + Surface Normal Clustering for Implicit Representation of Manhattan Scenes + http://openaccess.thecvf.com//content/ICCV2023/papers/Popovic_Surface_Normal_Clustering_for_Implicit_Representation_of_Manhattan_Scenes_ICCV_2023_paper.pdf + Novel view synthesis and 3D modeling using implicit neural field representation are shown to be very effective for calibrated multi-view cameras. Such representations are known to benefit from additional geometric and semantic supervision. Most existing methods that exploit additional supervision require dense pixel-wise labels or localized scene priors. These methods cannot benefit from high-level vague scene priors provided in terms of scenes' descriptions. In this work, we aim to leverage the geometric prior of Manhattan scenes to improve the implicit neural radiance field representations. More precisely, we assume that only the knowledge of the indoor scene (under investigation) being Manhattan is known -- with no additional information whatsoever -- with an unknown Manhattan coordinate frame. Such high-level prior is used to self-supervise the surface normals derived explicitly in the implicit neural fields. Our modeling allows us to cluster the derived normals and exploit their orthogonality constraints for self-supervision. Our exhaustive experiments on datasets of diverse indoor scenes demonstrate the significant benefit of the proposed method over the established baselines. The source code will be available at https://github.com/nikola3794/normal-clustering-nerf. + + + + Adaptive Similarity Bootstrapping for Self-Distillation Based Representation Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Lebailly_Adaptive_Similarity_Bootstrapping_for_Self-Distillation_Based_Representation_Learning_ICCV_2023_paper.pdf + Most self-supervised methods for representation learning leverage a cross-view consistency objective i.e., they maximize the representation similarity of a given image's augmented views. Recent work NNCLR goes beyond the cross-view paradigm and uses positive pairs from different images obtained via nearest neighbor bootstrapping in a contrastive setting. We empirically show that as opposed to the contrastive learning setting which relies on negative samples, incorporating nearest neighbor bootstrapping in a self-distillation scheme can lead to a performance drop or even collapse. We scrutinize the reason for this unexpected behavior and provide a solution. We propose to adaptively bootstrap neighbors based on the estimated quality of the latent space. We report consistent improvements compared to the naive bootstrapping approach and the original baselines. Our approach leads to performance improvements for various self-distillation method/backbone combinations and standard downstream tasks. Our code is publicly available at https://github.com/tileb1/AdaSim. + + + + Generalized Differentiable RANSAC + http://openaccess.thecvf.com//content/ICCV2023/papers/Wei_Generalized_Differentiable_RANSAC_ICCV_2023_paper.pdf + We propose -RANSAC, a generalized differentiable RANSAC that allows learning the entire randomized robust estimation pipeline. The proposed approach enables the use of relaxation techniques for estimating the gradients in the sampling distribution, which are then propagated through a differentiable solver. The trainable quality function marginalizes over the scores from all the models estimated within -RANSAC to guide the network learning accurate and useful inlier probabilities or to train feature detection and matching networks. Our method directly maximizes the probability of drawing a good hypothesis, allowing us to learn better sampling distributions. We test -RANSAC on various real-world scenarios on fundamental and essential matrix estimation, and 3D point cloud registration, outdoors and indoors, with handcrafted and learning-based features. It is superior to the state-of-the-art in terms of accuracy while running at a similar speed to its less accurate alternatives. The code and trained models are available at https://github.com/weitong8591/differentiable_ransac. + + + + ResQ: Residual Quantization for Video Perception + http://openaccess.thecvf.com//content/ICCV2023/papers/Abati_ResQ_Residual_Quantization_for_Video_Perception_ICCV_2023_paper.pdf + This paper accelerates video perception, such as semantic segmentation and human pose estimation, by levering cross-frame redundancies. Unlike the existing approaches, which avoid redundant computations by warping the past features using optical-flow or by performing sparse convolutions on frame differences, we approach the problem from a new perspective: low-bit quantization. We observe that residuals, as the difference in network activations between two neighboring frames, exhibit properties that make them highly quantizable. Based on this observation, we propose a novel quantization scheme for video networks coined as Residual Quantization. ResQ extends the standard, frame-by-frame, quantization scheme by incorporating temporal dependencies that lead to better performance in terms of accuracy vs. bit-width. Furthermore, we extend our model to dynamically adjust the bit-width proportional to the amount of changes in the video. We demonstrate the superiority of our model, against the standard quantization and existing efficient video perception models, using various architectures on semantic segmentation and human pose estimation benchmarks. + + + + MHCN: A Hyperbolic Neural Network Model for Multi-view Hierarchical Clustering + http://openaccess.thecvf.com//content/ICCV2023/papers/Lin_MHCN_A_Hyperbolic_Neural_Network_Model_for_Multi-view_Hierarchical_Clustering_ICCV_2023_paper.pdf + Multi-view hierarchical clustering (MCHC) plays a pivotal role in comprehending the structures within multi-view data, which hinges on the skillful interaction between hierarchical feature learning and comprehensive representation learning across multiple views. However, existing methods often overlook this interplay due to the simple heuristic agglomerative strategies or the decoupling of multi-view representation learning and hierarchical modeling, thus leading to insufficient representation learning. To address these issues, this paper proposes a novel Multi-view Hierarchical Clustering Network (MHCN) model by performing simultaneous multi-view learning and hierarchy modeling. Specifically, to uncover efficient tree-like structures among all views, we derive multiple hyperbolic autoencoders with latent space mapped onto the Poincare ball. Then, the corresponding hyperbolic embeddings are further regularized to achieve the multi-view representation learning principles for both view-common and view-private information, and to ensure hyperbolic uniformity with a well-balanced hierarchy for better interpretability. Extensive experiments on real-world and synthetic multi-view datasets have demonstrated that our method can achieve state-of-the-art hierarchical clustering performance, and empower the clustering results with good interpretability. + + + + FineRecon: Depth-aware Feed-forward Network for Detailed 3D Reconstruction + http://openaccess.thecvf.com//content/ICCV2023/papers/Stier_FineRecon_Depth-aware_Feed-forward_Network_for_Detailed_3D_Reconstruction_ICCV_2023_paper.pdf + Recent works on 3D reconstruction from posed images have demonstrated that direct inference of scene-level 3D geometry without test-time optimization is feasible using deep neural networks, showing remarkable promise and high efficiency. However, the reconstructed geometry, typically represented as a 3D truncated signed distance function (TSDF), is often coarse without fine geometric details. To address this problem, we propose three effective solutions for improving the fidelity of inference-based 3D reconstructions. We first present a resolution-agnostic TSDF supervision strategy to provide the network with a more accurate learning signal during training, avoiding the pitfalls of TSDF interpolation seen in previous work. We then introduce a depth guidance strategy using multi-view depth estimates to enhance the scene representation and recover more accurate surfaces. Finally, we develop a novel architecture for the final layers of the network, conditioning the output TSDF prediction on high-resolution image features in addition to coarse voxel features, enabling sharper reconstruction of fine details. Our method, FineRecon, produces smooth and highly accurate reconstructions, showing significant improvements across multiple depth and 3D reconstruction metrics. + + + + Zenseact Open Dataset: A Large-Scale and Diverse Multimodal Dataset for Autonomous Driving + http://openaccess.thecvf.com//content/ICCV2023/papers/Alibeigi_Zenseact_Open_Dataset_A_Large-Scale_and_Diverse_Multimodal_Dataset_for_ICCV_2023_paper.pdf + Existing datasets for autonomous driving (AD) often lack diversity and long-range capabilities, focusing instead on 360* perception and temporal reasoning. To address this gap, we introduce ZOD, a large-scale and diverse multimodal dataset collected over two years in various European countries, covering an area 9x that of existing datasets. ZOD boasts the highest range and resolution sensors among comparable datasets, coupled with detailed keyframe annotations for 2D and 3D objects (up to 245m), road instance/semantic segmentation, traffic sign recognition, and road classification. We believe that this unique combination will facilitate breakthroughs in long-range perception and multi-task learning. The dataset is composed of Frames, Sequences, and Drives, designed to encompass both data diversity and support for spatio-temporal learning, sensor fusion, localization, and mapping. Frames consist of 100k curated camera images with two seconds of other supporting sensor data, while the 1473 Sequences and 29 Drives include the entire sensor suite for 20 seconds and a few minutes, respectively. ZOD is the only AD dataset released under the permissive CC BY-SA 4.0 license, allowing for both research and commercial use. More information, and an extensive devkit, can be found at zod.zenseact.com. + + + + Weakly Supervised Referring Image Segmentation with Intra-Chunk and Inter-Chunk Consistency + http://openaccess.thecvf.com//content/ICCV2023/papers/Lee_Weakly_Supervised_Referring_Image_Segmentation_with_Intra-Chunk_and_Inter-Chunk_Consistency_ICCV_2023_paper.pdf + Referring image segmentation (RIS) aims to localize the object in an image referred by a natural language expression. Most previous studies learn RIS with a large-scale dataset containing segmentation labels, but they are costly. We present a weakly supervised learning method for RIS that only uses readily available image-text pairs. We first train a visual-linguistic model for image-text matching and extract a visual saliency map through Grad-CAM to identify the image regions corresponding to each word. However, we found two major problems with Grad-CAM. First, it lacks consideration of critical semantic relationships between words. We tackle this problem by modeling the relationship between words through intra-chunk and inter-chunk consistency. Second, Grad-CAM identifies only small regions of the referred object, leading to low recall. Therefore, we refine the localization maps with self-attention in Transformer and unsupervised object shape prior. On three popular benchmarks (RefCOCO, RefCOCO+, G-Ref), our method significantly outperforms recent comparable techniques. We also show that our method is applicable to various levels of supervision and obtains better performance than recent methods. + + + + Parameterized Cost Volume for Stereo Matching + http://openaccess.thecvf.com//content/ICCV2023/papers/Zeng_Parameterized_Cost_Volume_for_Stereo_Matching_ICCV_2023_paper.pdf + Stereo matching becomes computationally challenging when dealing with a large disparity range. Prior methods mainly alleviate the computation through dynamic cost volume by focusing on a local disparity space, but it requires many iterations to get close to the ground truth due to the lack of a global view. We find that the dynamic cost volume approximately encodes the disparity space as a single Gaussian distribution with a fixed and small variance at each iteration, which results in an inadequate global view over disparity space and a small update step at every iteration. In this paper, we propose a parameterized cost volume to encode the entire disparity space using multi-Gaussian distribution. The disparity distribution of each pixel is parameterized by weights, means, and variances. The means and variances are used to sample disparity candidates for cost computation, while the weights and means are used to calculate the disparity output. The above parameters are computed through a JS-divergence-based optimization, which is realized as a gradient descent update in a feed-forward differential module. Experiments show that our method speeds up the runtime of RAFT-Stereo by 4 15 times, achieving real-time performance and comparable accuracy. + + + + SAFE: Sensitivity-Aware Features for Out-of-Distribution Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Wilson_SAFE_Sensitivity-Aware_Features_for_Out-of-Distribution_Object_Detection_ICCV_2023_paper.pdf + We address the problem of out-of-distribution (OOD) detection for the task of object detection. We show that residual convolutional layers with batch normalisation produce Sensitivity-Aware FEatures (SAFE) that are consistently powerful for distinguishing in-distribution from out-of-distribution detections. We extract SAFE vectors for every detected object, and train a multilayer perceptron on the surrogate task of distinguishing adversarially perturbed from clean in-distribution examples. This circumvents the need for realistic OOD training data, computationally expensive generative models, or retraining of the base object detector. SAFE outperforms the state-of-the-art OOD object detectors on multiple benchmarks by large margins, e.g. reducing the FPR95 by an absolute 30.6% from 48.3% to 17.7% on the OpenImages dataset. + + + + DREAM: Efficient Dataset Distillation by Representative Matching + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_DREAM_Efficient_Dataset_Distillation_by_Representative_Matching_ICCV_2023_paper.pdf + Dataset distillation aims to synthesize small datasets with little information loss from original large-scale ones for reducing storage and training costs. Recent state-of-the-art methods mainly constrain the sample synthesis process by matching synthetic images and the original ones regarding gradients, embedding distributions, or training trajectories. Although there are various matching objectives, currently the strategy for selecting original images is limited to naive random sampling. We argue that random sampling overlooks the evenness of the selected sample distribution, which may result in noisy or biased matching targets. Besides, the sample diversity is also not constrained by random sampling. These factors together lead to optimization instability in the distilling process and degrade the training efficiency. Accordingly, we propose a novel matching strategy named as Dataset distillation by REpresentAive Matching (DREAM), where only representative original images are selected for matching. DREAM is able to be easily plugged into popular dataset distillation frameworks and reduce the distilling iterations by more than 8 times without performance drop. Given sufficient training time, DREAM further provides significant improvements and achieves state-of-the-art performances. + + + + Focus on Your Target: A Dual Teacher-Student Framework for Domain-Adaptive Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Huo_Focus_on_Your_Target_A_Dual_Teacher-Student_Framework_for_Domain-Adaptive_ICCV_2023_paper.pdf + We study unsupervised domain adaptation (UDA) for semantic segmentation. Currently, a popular UDA framework lies in self-training which endows the model with two-fold abilities: (i) learning reliable semantics from the labeled images in the source domain, and (ii) adapting to the target domain via generating pseudo labels on the unlabeled images. We find that, by decreasing/increasing the proportion of training samples from the target domain, the 'learning ability' is strengthened/weakened while the 'adapting ability' goes in the opposite direction, implying a conflict between these two abilities, especially for a single model. To alleviate the issue, we propose a novel dual teacher-student (DTS) framework and equip it with a bidirectional learning strategy. By increasing the proportion of target-domain data, the second teacher-student model learns to 'Focus on Your Target' while the first model is not affected. DTS is easily plugged into existing self-training approaches. In a standard UDA scenario (training on synthetic, labeled data and real, unlabeled data), DTS shows consistent gains over the baselines and sets new state-of-the-art results of 76.5% and 75.1% mIoUs on GTAv-Cityscapes and SYNTHIA-Cityscapes, respectively. The implementation is available at https://github.com/xinyuehuo/DTS. + + + + Enhanced Meta Label Correction for Coping with Label Corruption + http://openaccess.thecvf.com//content/ICCV2023/papers/Taraday_Enhanced_Meta_Label_Correction_for_Coping_with_Label_Corruption_ICCV_2023_paper.pdf + Deep Neural Networks (DNNs) have revolutionized visual classification tasks over the last decade. The training phase of deep-learning-based algorithms, however, often requires a vast amount of reliable annotated data. While reliability collecting such amount of labeled data usually yields to an exhaustive, expensive process, for many applications, acquiring massive datasets with imperfect annotations is straightforward. For instance, crawling search engines and online websites can generate a boatload amount of noisy labeled data. Hence, solving the problem of learning with noisy labels (LNL) is of paramount importance. Traditional LNL methods have successfully handled datasets with artificially injected noise, but they still fall short of adequately handling real-world noise. With the increasing use of meta-learning in the diverse fields of machine learning, researchers have tried to leverage auxiliary small clean datasets to meta-correct the training labels. Nonetheless, existing meta-label correction approaches are not fully exploiting their potential. In this study, we propose EMLC, an enhanced meta-label correction approach for the LNL problem. We re-examine the meta-learning process and introduce faster and more accurate meta-gradient derivations. We propose a novel teacher architecture tailored explicitly for the LNL problem, equipped with novel training objectives. EMLC outperforms prior approaches and achieves state-of-the-art results in all standard benchmarks. Notably, EMLC enhances the previous art on the noisy real-world dataset Clothing1M by 0.87%. Our publicly available code can be found at the following link: https://github.com/iccv23anonymous/Enhanced-Meta-Label-Correction + + + + Will Large-scale Generative Models Corrupt Future Datasets? + http://openaccess.thecvf.com//content/ICCV2023/papers/Hataya_Will_Large-scale_Generative_Models_Corrupt_Future_Datasets_ICCV_2023_paper.pdf + Recently proposed large-scale text-to-image generative models such as DALLE 2, Midjourney, and StableDiffusion can generate high-quality and realistic images from users' prompts. Not limited to the research community, ordinary Internet users enjoy these generative models, and consequently, a tremendous amount of generated images have been shared on the Internet. Meanwhile, today's success of deep learning in the computer vision field owes a lot to images collected from the Internet. These trends lead us to a research question: "will such generated images impact the quality of future datasets and the performance of computer vision models positively or negatively?" This paper empirically answers this question by simulating contamination. Namely, we generate ImageNet-scale and COCO-scale datasets using a state-of-the-art generative model and evaluate models trained with "contaminated" datasets on various tasks, including image classification and image generation. Throughout experiments, we conclude that generated images negatively affect downstream performance, while the significance depends on tasks and the amount of generated images. The generated datasets and the codes for experiments will be publicly released for future research. Generated datasets and source codes are available from https://github.com/moskomule/dataset-contamination. + + + + SHACIRA: Scalable HAsh-grid Compression for Implicit Neural Representations + http://openaccess.thecvf.com//content/ICCV2023/papers/Girish_SHACIRA_Scalable_HAsh-grid_Compression_for_Implicit_Neural_Representations_ICCV_2023_paper.pdf + Implicit Neural Representations (INR) or neural fields have emerged as a popular framework to encode multimedia signals such as images and radiance fields while retaining high-quality. Recently, learnable feature grids such as Instant-NGP have allowed significant speed-up in the training as well as the sampling of INRs by replacing a large neural network with a multi-resolution look-up table of feature vectors and a much smaller neural network. However, these feature grids come at the expense of large memory consumption which can be a bottleneck for storage and streaming applications. In this work, we propose SHACIRA, a simple yet effective task-agnostic framework for compressing such feature grids with no additional post-hoc pruning/quantization stages. We reparameterize feature grids with quantized latent weights and apply entropy regularization in the latent space to achieve high levels of compression across various domains. Quantitative and qualitative results on diverse datasets consisting of images, videos, and radiance fields, show that our approach outperforms existing INR approaches without the need for any large datasets or domain-specific heuristics. Our project page is available at https://shacira.github.io + + + + A Low-Shot Object Counting Network With Iterative Prototype Adaptation + http://openaccess.thecvf.com//content/ICCV2023/papers/Dukic_A_Low-Shot_Object_Counting_Network_With_Iterative_Prototype_Adaptation_ICCV_2023_paper.pdf + We consider low-shot counting of arbitrary semantic categories in the image using only few annotated exemplars (few-shot) or no exemplars (no-shot). The standard few-shot pipeline follows extraction of appearance queries from exemplars and matching them with image features to infer the object counts. Existing methods extract queries by feature pooling which neglects the shape information (e.g., size and aspect) and leads to a reduced object localization accuracy and count estimates. We propose a Low-shot Object Counting network with iterative prototype Adaptation (LOCA). Our main contribution is the new object prototype extraction module, which iteratively fuses the exemplar shape and appearance information with image features. The module is easily adapted to zero-shot scenarios, enabling LOCA to cover the entire spectrum of low-shot counting problems. LOCA outperforms all recent state-of-the-art methods on FSC147 benchmark by 20-30% in RMSE on one-shot and few-shot and achieves state-of-the-art on zero-shot scenarios, while demonstrating better generalization capabilities. The code and models are available here: https://github.com/djukicn/loca. + + + + MEGA: Multimodal Alignment Aggregation and Distillation For Cinematic Video Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Sadoughi_MEGA_Multimodal_Alignment_Aggregation_and_Distillation_For_Cinematic_Video_Segmentation_ICCV_2023_paper.pdf + Previous research has studied the task of segmenting cinematic videos into scenes and into narrative acts. However, these studies have overlooked the essential task of multimodal alignment and fusion for effectively and efficiently processing long-form videos (>60min). In this paper, we introduce Multimodal alignmEnt aGgregation and distillAtion (MEGA) for cinematic long-video segmentation. MEGA tackles the challenge by leveraging multiple media modalities. The method coarsely aligns inputs of variable lengths and different modalities with alignment positional encoding. To maintain temporal synchronization while reducing computation, we further introduce an enhanced bottleneck fusion layer which uses temporal alignment. Additionally, MEGA employs a novel contrastive loss to synchronize and transfer labels across modalities, enabling act segmentation from labeled synopsis sentences on video shots. Our experimental results show that MEGA outperforms state-of-the-art methods on MovieNet dataset for scene segmentation (with an Average Precision improvement of +1.19%) and on TRIPOD dataset for act segmentation (with a Total Agreement improvement of +5.51%). + + + + DiffRate : Differentiable Compression Rate for Efficient Vision Transformers + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_DiffRate__Differentiable_Compression_Rate_for_Efficient_Vision_Transformers_ICCV_2023_paper.pdf + Token compression aims to speed up large-scale vision transformers (e.g. ViTs) by pruning (dropping) or merging tokens. It is an important but challenging task. Although recent advanced approaches achieved great success, they need to carefully handcraft a compression rate (i.e. number of tokens to remove), which is tedious and leads to sub-optimal performance. To tackle this problem, we propose Differentiable Compression Rate (DiffRate), a novel token compression method that has several appealing properties prior arts do not have. First, DiffRate enables propagating the loss function's gradient onto the compression ratio, which is considered as a non-differentiable hyperparameter in previous work. In this case, different layers can automatically learn different compression rates layer-wisely without extra overhead. Second, token pruning and merging can be naturally performed simultaneously in DiffRate, while they were isolated in previous works. Third, extensive experiments demonstrate that DiffRate achieves state-of-the-art performance. For example, by applying the learned layer-wise compression rates to an off-the-shelf ViT-H (MAE) model, we achieve a 40% FLOPs reduction and a 1.5x throughput improvement, with a minor accuracy drop of 0.16% on ImageNet without fine-tuning, even outperforming previous methods with fine-tuning. Codes and models are available at https://github.com/OpenGVLab/DiffRate. + + + + Multi-Modal Continual Test-Time Adaptation for 3D Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Cao_Multi-Modal_Continual_Test-Time_Adaptation_for_3D_Semantic_Segmentation_ICCV_2023_paper.pdf + Continual Test-Time Adaptation (CTTA) generalizes conventional Test-Time Adaptation (TTA) by assuming that the target domain is dynamic over time rather than stationary. In this paper, we explore Multi-Modal Continual Test-Time Adaptation (MM-CTTA) as a new extension of CTTA for 3D semantic segmentation. The key to MM-CTTA is to adaptively attend to the reliable modality while avoiding catastrophic forgetting during continual domain shifts, which is out of the capability of previous TTA or CTTA methods. To fulfill this gap, we propose an MM-CTTA method called Continual Cross-Modal Adaptive Clustering (CoMAC) that addresses this task from two perspectives. On one hand, we propose an adaptive dual-stage mechanism to generate reliable cross-modal predictions by attending to the reliable modality based on the class-wise feature-centroid distance in the latent space. On the other hand, to perform test-time adaptation without catastrophic forgetting, we design class-wise momentum queues that capture confident target features for adaptation while stochastically restoring pseudo-source features to revisit source knowledge. We further introduce two new benchmarks to facilitate the exploration of MM-CTTA in the future. Our experimental results show that our method achieves state-of-the-art performance on both benchmarks. Visit our project website at https://sites.google.com/view/mmcotta. + + + + UMIFormer: Mining the Correlations between Similar Tokens for Multi-View 3D Reconstruction + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhu_UMIFormer_Mining_the_Correlations_between_Similar_Tokens_for_Multi-View_3D_ICCV_2023_paper.pdf + In recent years, many video tasks have achieved breakthroughs by utilizing the vision transformer and establishing spatial-temporal decoupling for feature extraction. Although multi-view 3D reconstruction also faces multiple images as input, it cannot immediately inherit their success due to completely ambiguous associations between unstructured views. There is not usable prior relationship, which is similar to the temporally-coherence property in a video. To solve this problem, we propose a novel transformer network for Unstructured Multiple Images (UMIFormer). It exploits transformer blocks for decoupled intra-view encoding and designed blocks for token rectification that mine the correlation between similar tokens from different views to achieve decoupled inter-view encoding. Afterward, all tokens acquired from various branches are compressed into a fixed-size compact representation while preserving rich information for reconstruction by leveraging the similarities between tokens. We empirically demonstrate on ShapeNet and confirm that our decoupled learning method is adaptable for unstructured multiple images. Meanwhile, the experiments also verify our model outperforms existing SOTA methods by a large margin. Code will be available at https://github.com/GaryZhu1996/UMIFormer. + + + + Improved Knowledge Transfer for Semi-Supervised Domain Adaptation via Trico Training Strategy + http://openaccess.thecvf.com//content/ICCV2023/papers/Ngo_Improved_Knowledge_Transfer_for_Semi-Supervised_Domain_Adaptation_via_Trico_Training_ICCV_2023_paper.pdf + The motivation of the semi-supervised domain adaptation (SSDA) is to train a model by leveraging knowledge acquired from the plentiful labeled source combined with extremely scarce labeled target data to achieve the lowest error on the unlabeled target data at the testing time. However, due to inter-domain and intra-domain discrepancies, the improvement of classification accuracy is limited. To solve these, we propose the Trico-training method that utilizes a multilayer perceptron (MLP) classifier and two graph convolutional network (GCN) classifiers called inter-view GCN and intra-view GCN classifiers. The first co-training strategy exploits a correlation between MLP and inter-view GCN classifiers to minimize the inter-domain discrepancy, in which the inter-view GCN classifier provides its pseudo labels to teach the MLP classifier, which encourages class representation alignment across domains. In contrast, the MLP classifier gives feedback to the inter-view GCN classifier by using a new concept, 'pseudo-edge', for neighbor's feature aggregation. Doing this increases the data structure mining ability of the inter-view GCN classifier; thus, the quality of generated pseudo labels is improved. The second co-training strategy between MLP and intra-view GCN is conducted in a similar way to reduce the intra-domain discrepancy by enhancing the correlation between labeled and unlabeled target data. Due to an imbalance in classification accuracy between inter-view and intra-view GCN classifiers, we propose the third co-training strategy that encourages them to cooperate to address this problem. We verify the effectiveness of the proposed method on three standard SSDA benchmark datasets: Office-31, Office-Home, and DomainNet. The extended experimental results show that our method surpasses the prior state-of-the-art approaches in SSDA. + + + + InterFormer: Real-time Interactive Image Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_InterFormer_Real-time_Interactive_Image_Segmentation_ICCV_2023_paper.pdf + Interactive image segmentation enables annotators to efficiently perform pixel-level annotation for segmentation tasks. However, the existing interactive segmentation pipeline suffers from inefficient computations of interactive models because of the following two issues. First, annotators' later click is based on models' feedback of annotators' former click. This serial interaction is unable to utilize model's parallelism capabilities. Second, in each interaction step, the model handles the invariant image along with the sparse variable clicks, resulting in a process that's highly repetitive and redundant. For efficient computations, we propose a method named InterFormer that follows a new pipeline to address these issues. InterFormer extracts and preprocesses the computationally time-consuming part i.e. image processing from the existing process. Specifically, InterFormer employs a large vision transformer (ViT) on high-performance devices to preprocess images in parallel, and then uses a lightweight module called interactive multi-head self attention (I-MSA) for interactive segmentation. Furthermore, the I-MSA module's deployment on low-power devices extends the practical application of interactive segmentation. The I-MSA module utilizes the preprocessed features to efficiently response to the annotator inputs in real-time. The experiments on several datasets demonstrate the effectiveness of InterFormer, which outperforms previous interactive segmentation models in terms of computational efficiency and segmentation quality, achieve real-time high-quality interactive segmentation on CPU-only devices. The code is available at https://github.com/YouHuang67/InterFormer. + + + + Online Prototype Learning for Online Continual Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Wei_Online_Prototype_Learning_for_Online_Continual_Learning_ICCV_2023_paper.pdf + Online continual learning (CL) studies the problem of learning continuously from a single-pass data stream while adapting to new data and mitigating catastrophic forgetting. Recently, by storing a small subset of old data, replay-based methods have shown promising performance. Unlike previous methods that focus on sample storage or knowledge distillation against catastrophic forgetting, this paper aims to understand why the online learning models fail to generalize well from a new perspective of shortcut learning. We identify shortcut learning as the key limiting factor for online CL, where the learned features may be biased, not generalizable to new tasks, and may have an adverse impact on knowledge distillation. To tackle this issue, we present the online prototype learning (OnPro) framework for online CL. First, we propose online prototype equilibrium to learn representative features against shortcut learning and discriminative features to avoid class confusion, ultimately achieving an equilibrium status that separates all seen classes well while learning new classes. Second, with the feedback of online prototypes, we devise a novel adaptive prototypical feedback mechanism to sense the classes that are easily misclassified and then enhance their boundaries. Extensive experimental results on widely-used benchmark datasets demonstrate the superior performance of OnPro over the state-of-the-art baseline methods. Source code is available at https://github.com/weilllllls/OnPro. + + + + Robust e-NeRF: NeRF from Sparse & Noisy Events under Non-Uniform Motion + http://openaccess.thecvf.com//content/ICCV2023/papers/Low_Robust_e-NeRF_NeRF_from_Sparse__Noisy_Events_under_Non-Uniform_ICCV_2023_paper.pdf + Event cameras offer many advantages over standard cameras due to their distinctive principle of operation: low power, low latency, high temporal resolution and high dynamic range. Nonetheless, the success of many downstream visual applications also hinges on an efficient and effective scene representation, where Neural Radiance Field (NeRF) is seen as the leading candidate. Such promise and potential of event cameras and NeRF inspired recent works to investigate on the reconstruction of NeRF from moving event cameras. However, these works are mainly limited in terms of the dependence on dense and low-noise event streams, as well as generalization to arbitrary contrast threshold values and camera speed profiles. In this work, we propose Robust e-NeRF, a novel method to directly and robustly reconstruct NeRFs from moving event cameras under various real-world conditions, especially from sparse and noisy events generated under non-uniform motion. It consists of two key components: a realistic event generation model that accounts for various intrinsic parameters (e.g. time-independent, asymmetric threshold and refractory period) and non-idealities (e.g. pixel-to-pixel threshold variation), as well as a complementary pair of normalized reconstruction losses that can effectively generalize to arbitrary speed profiles and intrinsic parameter values without such prior knowledge. Experiments on real and novel realistically simulated sequences verify our effectiveness. Our code, synthetic dataset and improved event simulator are public. + + + + ActorsNeRF: Animatable Few-shot Human Rendering with Generalizable NeRFs + http://openaccess.thecvf.com//content/ICCV2023/papers/Mu_ActorsNeRF_Animatable_Few-shot_Human_Rendering_with_Generalizable_NeRFs_ICCV_2023_paper.pdf + While NeRF-based human representations have shown impressive novel view synthesis results, most methods still rely on a large number of images / views for training. In this work, we propose a novel animatable NeRF called ActorsNeRF. It is first pre-trained on diverse human subjects, and then adapted with few-shot monocular video frames for a new actor with unseen poses. Building on previous generalizable NeRFs with parameter sharing using a ConvNet encoder, ActorsNeRF further adopts two human priors to capture the large human appearance, shape, and pose variations. Specifically, in the encoded feature space, we will first align different human subjects in a category-level canonical space, and then align the same human from different frames in an instance-level canonical space for rendering. We quantitatively and qualitatively demonstrate that ActorsNeRF significantly outperforms the existing state-of-the-art on few-shot generalization to new people and poses on multiple datasets. + + + + Multiple Planar Object Tracking + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Multiple_Planar_Object_Tracking_ICCV_2023_paper.pdf + Tracking both location and pose of multiple planar objects (MPOT) is of great significance to numerous real-world applications. The greater degree-of-freedom of planar objects compared with common objects makes MPOT far more challenging than well-studied object tracking, especially when occlusion occurs. To address this challenging task, we are inspired by amodal perception that humans jointly track visible and invisible parts of the target, and propose a tracking framework that unifies appearance perception and occlusion reasoning. Specifically, we present a dual-branch network to track the visible part of planar objects, including vertexes and mask. Then, we develop an occlusion area localization strategy to infer the invisible part, i.e., the occluded region, followed by a two-stream attention network finally refining the prediction. To alleviate the lack of data in this field, we build the first large-scale benchmark dataset, namely MPOT-3K. It consists of 3,717 planar objects from 356 videos and contains 148,896 frames together with 687,417 annotations. The collected planar objects have 9 motion patterns and the videos are shot in 6 types of indoor and outdoor scenes. Extensive experiments demonstrate the superiority of our proposed method on the newly developed MPOT-3K as well as other two popular single planar object tracking datasets. The code and MPOT-3K dataset are released on https://zzcheng.top/MPOT. + + + + Label-Guided Knowledge Distillation for Continual Semantic Segmentation on 2D Images and 3D Point Clouds + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Label-Guided_Knowledge_Distillation_for_Continual_Semantic_Segmentation_on_2D_Images_ICCV_2023_paper.pdf + Continual semantic segmentation (CSS) aims to extend an existing model to tackle unseen tasks while retaining its old knowledge. Naively fine-tuning the old model on new data leads to catastrophic forgetting. A common solution is knowledge distillation (KD), where the output distribution of the new model is regularized to be similar to that of the old model. However, in CSS, this is challenging because of the background shift issue. Existing KD-based CSS methods continue to suffer from confusion between the background and novel classes since they fail to establish a reliable class correspondence for distillation. To address this issue, we propose a new label-guided knowledge distillation (LGKD) loss, where the old model output is expanded and transplanted (with the guidance of the ground truth label) to form a semantically appropriate class correspondence with the new model output. Consequently, the useful knowledge from the old model can be effectively distilled into the new model without causing confusion. We conduct extensive experiments on two prevailing CSS benchmarks, Pascal-VOC and ADE20K, where our LGKD significantly boosts the performance of three competing methods, especially on novel mIoU by up to +76%, setting new state-of-the-art. Finally, to further demonstrate its generalization ability, we introduce the first CSS benchmark for 3D point cloud based on ScanNet, along with several re-implemented baselines for comparison. Experiments show that LGKD is versatile in both 2D and 3D modalities without requiring ad hoc design. Codes are available at https://github.com/Ze-Yang/LGKD. + + + + PRANC: Pseudo RAndom Networks for Compacting Deep Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Nooralinejad_PRANC_Pseudo_RAndom_Networks_for_Compacting_Deep_Models_ICCV_2023_paper.pdf + We demonstrate that a deep model can be reparametrized as a linear combination of several randomly initialized and frozen deep models in the weight space. During training, we seek local minima that reside within the subspace spanned by these random models (i.e., `basis' networks). Our framework, PRANC, enables significant compaction of a deep model. The model can be reconstructed using a single scalar `seed,' employed to generate the pseudo-random `basis' networks, together with the learned linear mixture coefficients. In practical applications, PRANC addresses the challenge of efficiently storing and communicating deep models, a common bottleneck in several scenarios, including multi-agent learning, continual learners, federated systems, and edge devices, among others. In this study, we employ PRANC to condense image classification models and compress images by compacting their associated implicit neural networks. PRANC outperforms baselines with a large margin on image classification when compressing a deep model almost 100 times. Moreover, we show that PRANC enables memory-efficient inference by generating layer-wise weights on the fly. The source code of PRANC is here: https://github.com/UCDvision/PRANC + + + + Clutter Detection and Removal in 3D Scenes with View-Consistent Inpainting + http://openaccess.thecvf.com//content/ICCV2023/papers/Wei_Clutter_Detection_and_Removal_in_3D_Scenes_with_View-Consistent_Inpainting_ICCV_2023_paper.pdf + Removing clutter from scenes is essential in many applications, ranging from privacy-concerned content filtering to data augmentation. In this work, we present an automatic system that removes clutter from 3D scenes and inpaints with coherent geometry and texture. We propose techniques for its two key components: 3D segmentation based on shared properties and 3D inpainting, both of which are important problems. We define 3D scene clutter as frequently-moving objects (e.g. clothes or chairs that are typically moved within a few days). The definition of 3D scene clutter (frequently-moving objects) is not well captured by commonly-studied object categories in computer vision. To tackle the lack of well-defined clutter annotations, we group noisy fine-grained labels, leverage virtual rendering, and impose an instance-level area-sensitive loss. Once clutter is removed, we inpaint geometry and texture in the resulting holes by merging inpainted RGB-D images. This requires novel voting and pruning strategies that guarantee multi-view consistency across individually inpainted images for mesh reconstruction. Experiments on ScanNet and Matterport3D dataset show that our method outperforms baselines for clutter segmentation and 3D inpainting, both visually and quantitatively. Project page: https://weify627.github.io/clutter/. + + + + Hierarchical Spatio-Temporal Representation Learning for Gait Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Hierarchical_Spatio-Temporal_Representation_Learning_for_Gait_Recognition_ICCV_2023_paper.pdf + Gait recognition is a biometric technique that identifies individuals by their unique walking styles, which is suitable for unconstrained environments and has a wide range of applications. While current methods focus on exploiting body part-based representations, they often neglect the hierarchical dependencies between local motion patterns. In this paper, we propose a hierarchical spatio-temporal representation learning (HSTL) framework for extracting gait features from coarse to fine. Our framework starts with a hierarchical clustering analysis to recover multi-level body structures from the whole body to local details. Next, an adaptive region-based motion extractor (ARME) is designed to learn region-independent motion features. The proposed HSTL then stacks multiple ARMEs in a top-down manner, with each ARME corresponding to a specific partition level of the hierarchy. An adaptive spatio-temporal pooling (ASTP) module is used to capture gait features at different levels of detail to perform hierarchical feature mapping. Finally, a frame-level temporal aggregation (FTA) module is employed to reduce redundant information in gait sequences through multi-scale temporal downsampling. Extensive experiments on CASIA-B, OUMVLP, GREW, and Gait3D datasets demonstrate that our method outperforms the state-of-the-art while maintaining a reasonable balance between model accuracy and complexity. Code is available at: https://github.com/gudaochangsheng/HSTL. + + + + Weakly Supervised Learning of Semantic Correspondence through Cascaded Online Correspondence Refinement + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_Weakly_Supervised_Learning_of_Semantic_Correspondence_through_Cascaded_Online_Correspondence_ICCV_2023_paper.pdf + In this paper, we develop a weakly supervised learning algorithm to learn robust semantic correspondences from large-scale datasets with only image-level labels. Following the spirit of multiple instance learning (MIL), we decompose the weakly supervised correspondence learning problem into three stages: image-level matching, region-level matching, and pixel-level matching. We propose a novel cascaded online correspondence refinement algorithm to integrate MIL and the correspondence filtering and refinement procedure into a single deep network and train this network end-to-end with only image-level supervision, i.e., without point-to-point matching information. During the correspondence learning process, pixel-to-pixel matching pairs inferred from weak supervision are propagated, filtered, and enhanced through masked correspondence voting and calibration. Besides, we design a correspondence consistency check algorithm to select images with discriminative key points to generate pseudo-labels for classical matching algorithms. Finally, we filter out about 110,000 images from the ImageNet ILSVRC training set to formulate a new dataset, called SC-ImageNet. Experiments on several popular benchmarks indicate that pre-training on SC-ImageNet can improve the performance of state-of-the-art algorithms efficiently. Our project is available on https://github.com/21210240056/SC-ImageNet. + + + + NaviNeRF: NeRF-based 3D Representation Disentanglement by Latent Semantic Navigation + http://openaccess.thecvf.com//content/ICCV2023/papers/Xie_NaviNeRF_NeRF-based_3D_Representation_Disentanglement_by_Latent_Semantic_Navigation_ICCV_2023_paper.pdf + 3D representation disentanglement aims to identify, decompose, and manipulate the underlying explanatory factors of 3D data, which helps AI fundamentally understand our 3D world. This task is currently under-explored and poses great challenges: (i) the 3D representations are complex and in general contains much more information than 2D image; (ii) many 3D representations are not well suited for gradient-based optimization, let alone disentanglement. To address these challenges, we use NeRF as a differentiable 3D representation, and introduce a self-supervised Navigation to identify interpretable semantic directions in the latent space. To our best knowledge, this novel method, dubbed NaviNeRF, is the first work to achieve fine-grained 3D disentanglement without any priors or supervision. Specifically, NaviNeRF is built upon the generative NeRF pipeline, and equipped with an Outer Navigation Branch and an Inner Refinement Branch. They are complementary ---- the outer navigation is to identify global-view semantic directions, and the inner refinement dedicates to fine-grained attributes. A synergistic loss is further devised to coordinate two branches. Extensive experiments demonstrate that NaviNeRF has a superior fine-grained 3D disentanglement ability than the previous 3D-aware models. Its performance is also comparable to editing-oriented models relying on semantic or geometry priors. + + + + Image-Free Classifier Injection for Zero-Shot Classification + http://openaccess.thecvf.com//content/ICCV2023/papers/Christensen_Image-Free_Classifier_Injection_for_Zero-Shot_Classification_ICCV_2023_paper.pdf + Zero-shot learning models achieve remarkable results on image classification for samples from classes that were not seen during training. However, such models must be trained from scratch with specialised methods: therefore, access to a training dataset is required when the need for zero-shot classification arises. In this paper, we aim to equip pre-trained models with zero-shot classification capabilities without the use of image data. We achieve this with our proposed Image-free Classifier Injection with Semantics (ICIS) that injects classifiers for new, unseen classes into pre-trained classification models in a post-hoc fashion without relying on image data. Instead, the existing classifier weights and simple class-wise descriptors, such as class names or attributes, are used. ICIS has two encoder-decoder networks that learn to reconstruct classifier weights from descriptors (and vice versa), exploiting (cross-)reconstruction and cosine losses to regularise the decoding process. Notably, ICIS can be cheaply trained and applied directly on top of pre-trained classification models. Experiments on benchmark ZSL datasets show that ICIS produces unseen classifier weights that achieve strong (generalised) zero-shot classification performance. Code is available at https://github.com/ExplainableML/ImageFreeZSL. + + + + Semantically Structured Image Compression via Irregular Group-Based Decoupling + http://openaccess.thecvf.com//content/ICCV2023/papers/Feng_Semantically_Structured_Image_Compression_via_Irregular_Group-Based_Decoupling_ICCV_2023_paper.pdf + Image compression techniques typically focus on compressing rectangular images for human consumption, however, resulting in transmitting redundant content for downstream applications. To overcome this limitation, some previous works propose to semantically structure the bitstream, which can meet specific application requirements by selective transmission and reconstruction. Nevertheless, they divide the input image into multiple rectangular regions according to semantics and ignore avoiding information interaction among them, causing waste of bitrate and distorted reconstruction of region boundaries. In this paper, we propose to decouple an image into multiple groups with irregular shapes based on a customized group mask and compress them independently. Our group mask describes the image at a finer granularity, enabling significant bitrate saving by reducing the transmission of redundant content. Moreover, to ensure the fidelity of selective reconstruction, this paper proposes the concept of group-independent transform that maintain the independence among distinct groups. And we instantiate it by the proposed Group-Independent Swin-Block (GI Swin-Block). Experimental results demonstrate that our framework structures the bitstream with negligible cost, and exhibits superior performance on both visual quality and intelligent task supporting. + + + + Self-Organizing Pathway Expansion for Non-Exemplar Class-Incremental Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhu_Self-Organizing_Pathway_Expansion_for_Non-Exemplar_Class-Incremental_Learning_ICCV_2023_paper.pdf + Non-exemplar class-incremental learning aims to recognize both the old and new classes without access to old class samples. The conflict between old and new class optimization is exacerbated since the shared neural pathways can only be differentiated by the incremental samples. To address this problem, we propose a novel self-organizing pathway expansion scheme. Our scheme consists of a class-specific pathway organization strategy that reduces the coupling of optimization pathway among different classes to enhance the independence of the feature representation, and a pathway-guided feature optimization mechanism to mitigate the update interference between the old and new classes. Extensive experiments on four datasets demonstrate significant performance gains, outperforming the state-of-the-art methods by a margin of 1%, 3%, 2% and 2%, respectively. + + + + Preserving Tumor Volumes for Unsupervised Medical Image Registration + http://openaccess.thecvf.com//content/ICCV2023/papers/Dong_Preserving_Tumor_Volumes_for_Unsupervised_Medical_Image_Registration_ICCV_2023_paper.pdf + Medical image registration is a critical task that estimates the spatial correspondence between pairs of images. However, current traditional and learning-based methods rely on similarity measures to generate a deforming field, which often results in disproportionate volume changes in dissimilar regions, especially in tumor regions. These changes can significantly alter the tumor size and underlying anatomy, which limits the practical use of image registration in clinical diagnosis. To address this issue, we have formulated image registration with tumors as a constraint problem that preserves tumor volumes while maximizing image similarity in other normal regions. Our proposed framework involves a two-stage process. In the first stage, we use similarity-based registration to identify potential tumor regions by their volume change, generating a soft tumor mask accordingly. In the second stage, we propose a volume-preserving registration with a novel adaptive volume-preserving loss that penalizes the change in size adaptively based on the masks calculated from the previous stage. Our approach balances image similarity and volume preservation in different regions, i.e., normal and tumor regions, by using soft tumor masks to adjust the imposition of volume-preserving loss on each one. This ensures that the tumor volume is preserved during the registration process. We have evaluated our framework on various datasets and network architectures, demonstrating that our method successfully preserves the tumor volume while achieving comparable registration results with state-of-the-art methods. Our code is at: https://dddraxxx.github.io/Volume-Preserving-Registration/. + + + + Unsupervised Accuracy Estimation of Deep Visual Models using Domain-Adaptive Adversarial Perturbation without Source Samples + http://openaccess.thecvf.com//content/ICCV2023/papers/Lee_Unsupervised_Accuracy_Estimation_of_Deep_Visual_Models_using_Domain-Adaptive_Adversarial_ICCV_2023_paper.pdf + Deploying deep visual models can lead to performance drops due to the discrepancies between source and target distributions. Several approaches leverage labeled source data to estimate target domain accuracy, but accessing labeled source data is often prohibitively difficult due to data confidentiality or resource limitations on serving devices. Our work proposes a new framework to estimate model accuracy on unlabeled target data without access to source data. We investigate the feasibility of using pseudo-labels for accuracy estimation and evolve this idea into adopting recent advances in source-free domain adaptation algorithms. Our approach measures the disagreement rate between the source hypothesis and the target pseudo-labeling function, adapted from the source hypothesis. We mitigate the impact of erroneous pseudo-labels that may arise due to a high ideal joint hypothesis risk by employing adaptive adversarial perturbation on the input of the target model. Our proposed source-free framework effectively addresses the challenging distribution shift scenarios and outperforms existing methods requiring source data and labels for training. + + + + MATE: Masked Autoencoders are Online 3D Test-Time Learners + http://openaccess.thecvf.com//content/ICCV2023/papers/Mirza_MATE_Masked_Autoencoders_are_Online_3D_Test-Time_Learners_ICCV_2023_paper.pdf + Our MATE is the first Test-Time-Training (TTT) method designed for 3D data, which makes deep networks trained for point cloud classification robust to distribution shifts occurring in test data. Like existing TTT methods from the 2D image domain, MATE also leverages test data for adaptation. Its test-time objective is that of a Masked Autoencoder: a large portion of each test point cloud is removed before it is fed to the network, tasked with reconstructing the full point cloud. Once the network is updated, it is used to classify the point cloud. We test MATE on several 3D object classification datasets and show that it significantly improves robustness of deep networks to several types of corruptions commonly occurring in 3D point clouds. We show that MATE is very efficient in terms of the fraction of points it needs for the adaptation. It can effectively adapt given as few as 5% of tokens of each test sample, making it extremely lightweight. Our experiments show that MATE also achieves competitive performance by adapting sparsely on the test data, which further reduces its computational overhead, making it ideal for real-time applications. + + + + Two Birds, One Stone: A Unified Framework for Joint Learning of Image and Video Style Transfers + http://openaccess.thecvf.com//content/ICCV2023/papers/Gu_Two_Birds_One_Stone_A_Unified_Framework_for_Joint_Learning_ICCV_2023_paper.pdf + Current arbitrary style transfer models are limited to either image or video domains. In order to achieve satisfying image and video style transfers, two different models are inevitably required with separate training processes on image and video domains, respectively. In this paper, we show that this can be precluded by introducing UniST, a Unified Style Transfer framework for both images and videos. At the core of UniST is a domain interaction transformer (DIT), which first explores context information within the specific domain and then interacts contextualized domain information for joint learning. In particular, DIT enables exploration of temporal information from videos for the image style transfer task and meanwhile allows rich appearance texture from images for video style transfer, thus leading to mutual benefits. Considering heavy computation of traditional multi-head self-attention, we present a simple yet effective axial multi-head self-attention (AMSA) for DIT, which improves computational efficiency while maintains style transfer performance. To verify the effectiveness of UniST, we conduct extensive experiments on both image and video style transfer tasks and show that UniST performs favorably against state-of-the-art approaches on both tasks. Code is available at https://github.com/NevSNev/UniST. + + + + Task-Oriented Multi-Modal Mutual Leaning for Vision-Language Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Long_Task-Oriented_Multi-Modal_Mutual_Leaning_for_Vision-Language_Models_ICCV_2023_paper.pdf + Prompt learning has become one of the most efficient paradigms for adapting large pre-trained vision-language models to downstream tasks. Current state-of-the-art methods, like CoOp and ProDA, tend to adopt soft prompts to learn an appropriate prompt for each specific task. Recent CoCoOp further boosts the base-to-new generalization performance via an image-conditional prompt. However, it directly fuses identical image semantics to prompts of different labels and significantly weakens the discrimination among different classes as shown in our experiments. Motivated by this observation, we first propose a class-aware text prompt (CTP) to enrich generated prompts with label-related image information. Unlike CoCoOp, CTP can effectively involve image semantics and avoid introducing extra ambiguities into different prompts. On the other hand, instead of reserving the complete image representations, we propose text-guided feature tuning (TFT) to make the image branch attend to class-related representation. A contrastive loss is employed to align such augmented text and image representations on downstream tasks. In this way, the image-to-text CTP and text-to-image TFT can be mutually promoted to enhance the adaptation of VLMs for downstream tasks. Extensive experiments demonstrate that our method outperforms the existing methods by a significant margin. Especially, compared to CoCoOp, we achieve an average improvement of 4.03% on new classes and 3.19% on harmonic-mean over eleven classification benchmarks. + + + + NeMF: Inverse Volume Rendering with Neural Microflake Field + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_NeMF_Inverse_Volume_Rendering_with_Neural_Microflake_Field_ICCV_2023_paper.pdf + Recovering the physical attributes of an object's appearance from its images captured under an unknown illumination is challenging yet essential for photo-realistic rendering.Recent approaches adopt the emerging implicit scene representations and have shown impressive results.However, they unanimously adopt a surface-based representation,and hence can not well handle scenes with very complex geometry, translucent object and etc.In this paper, we propose to conduct inverse volume rendering, in contrast to surface-based, by representing a scene using microflake volume, which assumes the space is filled with infinite small flakes and light reflects or scatters at each spatial location according to microflake distributions. We further adopt the coordinate networks to implicitly encode the microflake volume, and develop a differentiable microflake volume renderer to train the network in an end-to-end way in principle.Our NeMF enables effective recovery of appearance attributes for highly complex geometry and scattering object, enables high-quality relighting, material editing, and especially simulates volume rendering effects, such as scattering, which is infeasible for surface-based approaches. Our data and code are available at: https://github.com/YoujiaZhang/NeMF. + + + + MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing + http://openaccess.thecvf.com//content/ICCV2023/papers/Cao_MasaCtrl_Tuning-Free_Mutual_Self-Attention_Control_for_Consistent_Image_Synthesis_and_ICCV_2023_paper.pdf + Despite the success in large-scale text-to-image generation and text-conditioned image editing, existing methods still struggle to produce consistent generation and editing results. For example, generation approaches usually fail to synthesize multiple images of the same objects/characters but with different views or poses. Meanwhile, existing editing methods either fail to achieve effective complex non-rigid editing while maintaining the overall textures and identity, or require time-consuming fine-tuning to capture the image-specific appearance. In this paper, we develop MasaCtrl, a tuning-free method to achieve consistent image generation and complex non-rigid image editing simultaneously. Specifically, MasaCtrl converts existing self-attention in diffusion models into mutual self-attention, so that it can query correlated local contents and textures from source images for consistency. To further alleviate the query confusion between foreground and background, we propose a mask-guided mutual self-attention strategy, where the mask can be easily extracted from the cross-attention maps. Extensive experiments show that the proposed MasaCtrl can produce impressive results in both consistent image generation and complex non-rigid real image editing. + + + + Understanding Hessian Alignment for Domain Generalization + http://openaccess.thecvf.com//content/ICCV2023/papers/Hemati_Understanding_Hessian_Alignment_for_Domain_Generalization_ICCV_2023_paper.pdf + Out-of-distribution (OOD) generalization is a critical ability for deep learning models in many real-world scenarios including healthcare and autonomous vehicles. Recently, different techniques have been proposed to improve OOD generalization. Among these methods, gradient-based regularizers have shown promising performance compared with other competitors. Despite this success, our understanding of the role of Hessian and gradient alignment in domain generalization is still limited. To address this shortcoming, we analyze the role of the classifier's head Hessian matrix and gradient in domain generalization using recent OoD theory of transferability. Theoretically, we show that spectral norm between the classifier's head Hessian matrices across domains is an upper bound of the transfer measure, a notion of distance between target and source domains. Furthermore, we analyze all the attributes that get aligned when we encourage similarity between Hessians and gradients. Our analysis explains the success of many regularizers like CORAL, IRM, V-REx, Fish, IGA, and Fishr as they regularize part of the classifier's head Hessian and/or gradient. Finally, we propose two simple yet effective methods to match the classifier's head Hessians and gradients in an efficient way, based on the Hessian Gradient Product (HGP) and Hutchinson's method (Hutchinson), and without directly calculating Hessians. We validate the OOD generalization ability of proposed methods in different scenarios, including transferability, severe correlation shift, label shift and diversity shift. Our results show that Hessian alignment methods achieve promising performance on various OOD benchmarks. Our code is available here. + + + + Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Ge_Preserve_Your_Own_Correlation_A_Noise_Prior_for_Video_Diffusion_ICCV_2023_paper.pdf + Despite tremendous progress in generating high-quality images using diffusion models, synthesizing a sequence of animated frames that are both photorealistic and temporally coherent is still in its infancy. While off-the-shelf billion-scale datasets for image generation are available, collecting similar video data of the same scale is still challenging. Also, training a video diffusion model is computationally much more expensive than its image counterpart. In this work, we explore finetuning a pretrained image diffusion model with video data as a practical solution for the video synthesis task. We find that naively extending the image noise prior to video noise prior in video diffusion leads to sub-optimal performance. Our carefully designed video noise prior leads to substantially better performance. Extensive experimental validation shows that our model, Preserve Your Own COrrelation (PYoCo), attains SOTA zero-shot text-to-video results on the UCF-101 and MSR-VTT benchmarks. It also achieves SOTA video generation quality on the small-scale UCF-101 benchmark with a 10x smaller model using significantly less computation than the prior art. + + + + Revisiting Vision Transformer from the View of Path Ensemble + http://openaccess.thecvf.com//content/ICCV2023/papers/Chang_Revisiting_Vision_Transformer_from_the_View_of_Path_Ensemble_ICCV_2023_paper.pdf + Vision Transformers (ViTs) are normally regarded as a stack of transformer layers. In this work, we propose a novel view of ViTs showing that they can be seen as ensemble networks containing multiple parallel paths with different lengths. Specifically, we equivalently transform the traditional cascade of multi-head self-attention (MSA) and feed-forward network (FFN) into three parallel paths in each transformer layer. Then, we utilize the identity connection in our new transformer form and further transform the ViT into an explicit multi-path ensemble network. From the new perspective, these paths perform two functions: the first is to provide the feature for the classifier directly, and the second is to provide the lower-level feature representation for subsequent longer paths. We investigate the influence of each path for the final prediction and discover that some paths even pull down the performance. Therefore, we propose the path pruning and EnsembleScale skills for improvement, which cut out the underperforming paths and re-weight the ensemble components, respectively, to optimize the path combination and make the short paths focus on providing high-quality representation for subsequent paths. We also demonstrate that our path combination strategies can help ViTs go deeper and act as high-pass filters to filter out partial low-frequency signals. To further enhance the representation of paths served for subsequent paths, self-distillation is applied to transfer knowledge from the long paths to the short paths. This work calls for more future research to explain and design ViTs from new perspectives. + + + + Tetra-NeRF: Representing Neural Radiance Fields Using Tetrahedra + http://openaccess.thecvf.com//content/ICCV2023/papers/Kulhanek_Tetra-NeRF_Representing_Neural_Radiance_Fields_Using_Tetrahedra_ICCV_2023_paper.pdf + Neural Radiance Fields (NeRFs) are a very recent and very popular approach for the problems of novel view synthesis and 3D reconstruction. A popular scene representation used by NeRFs is to combine a uniform, voxel-based subdivision of the scene with an MLP. Based on the observation that a (sparse) point cloud of the scene is often available, this paper proposes to use an adaptive representation based on tetrahedra obtained by Delaunay triangulation instead of uniform subdivision or point-based representations. We show that such a representation enables efficient training and leads to state-of-the-art results. Our approach elegantly combines concepts from 3D geometry processing, triangle-based rendering, and modern neural radiance fields. Compared to voxel-based representations, ours provides more detail around parts of the scene likely to be close to the surface. Compared to point-based representations, our approach achieves better performance. The source code is publicly available at: https://jkulhanek.com/tetra-nerf. + + + + Ablating Concepts in Text-to-Image Diffusion Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Kumari_Ablating_Concepts_in_Text-to-Image_Diffusion_Models_ICCV_2023_paper.pdf + Large-scale text-to-image diffusion models can generate high-fidelity images with powerful compositional ability. However, these models are typically trained on an enormous amount of Internet data, often containing copyrighted material, licensed images, and personal photos. Furthermore, they have been found to replicate the style of various living artists or memorize exact training samples. How can we remove such copyrighted concepts or images without retraining the model from scratch? To achieve this goal, we propose an efficient method of ablating concepts in the pretrained model, i.e., preventing the generation of a target concept. Our algorithm learns to match the image distribution for a target style, instance, or text prompt we wish to ablate to the distribution corresponding to an anchor concept. This prevents the model from generating target concepts given its text condition. Extensive experiments show that our method can successfully prevent the generation of the ablated concept while preserving closely related concepts in the model. + + + + MapFormer: Boosting Change Detection by Using Pre-change Information + http://openaccess.thecvf.com//content/ICCV2023/papers/Bernhard_MapFormer_Boosting_Change_Detection_by_Using_Pre-change_Information_ICCV_2023_paper.pdf + Change detection in remote sensing imagery is essential for a variety of applications such as urban planning, disaster management, and climate research. However, existing methods for identifying semantically changed areas overlook the availability of semantic information in the form of existing maps describing features of the earth's surface. In this paper, we leverage this information for change detection in bi-temporal images. We show that the simple integration of the additional information via concatenation of latent representations suffices to significantly outperform state-of-the-art change detection methods. Motivated by this observation, we propose the new task of Conditional Change Detection, where pre-change semantic information is used as input next to bi-temporal images. To fully exploit the extra information, we propose MapFormer, a novel architecture based on a multi-modal feature fusion module that allows for feature processing conditioned on the available semantic information. We further employ a supervised, cross-modal contrastive loss to guide the learning of visual representations. Our approach outperforms existing change detection methods by an absolute 11.7% and 18.4% in terms of binary change IoU on DynamicEarthNet and HRSCD, respectively. Furthermore, we demonstrate the robustness of our approach to the quality of the pre-change semantic information and the absence pre-change imagery. The code is available at https://github.com/mxbh/mapformer. + + + + Masked Diffusion Transformer is a Strong Image Synthesizer + http://openaccess.thecvf.com//content/ICCV2023/papers/Gao_Masked_Diffusion_Transformer_is_a_Strong_Image_Synthesizer_ICCV_2023_paper.pdf + Despite its success in image synthesis, we observe that diffusion probabilistic models (DPMs) often lack contextual reasoning ability to learn the relations among object parts in an image, leading to a slow learning process. To solve this issue, we propose a Masked Diffusion Transformer (MDT) that introduces a mask latent modeling scheme to explicitly enhance the DPMs' ability to contextual relation learning among object semantic parts in an image. During training, MDT operates in the latent space to mask certain tokens. Then, an asymmetric masking diffusion transformer is designed to predict masked tokens from unmasked ones while maintaining the diffusion generation process. Our MDT can reconstruct the full information of an image from its incomplete contextual input, thus enabling it to learn the associated relations among image tokens. Experimental results show that MDT achieves superior image synthesis performance, e.g., a new SOTA FID score in the ImageNet data set, and has about 3x faster learning speed than the previous SOTA DiT. The source code is released at https://github.com/sail-sg/MDT. + + + + LightDepth: Single-View Depth Self-Supervision from Illumination Decline + http://openaccess.thecvf.com//content/ICCV2023/papers/Rodriguez-Puigvert_LightDepth_Single-View_Depth_Self-Supervision_from_Illumination_Decline_ICCV_2023_paper.pdf + Single-view depth estimation can be remarkably effective if there is enough ground-truth depth data for supervised training. However, there are scenarios, especially in medicine in the case of endoscopies, where such data cannot be obtained. In such cases, multi-view self-supervision and synthetic-to-real transfer serve as alternative approaches, however, with a considerable performance reduction in comparison to supervised case. Instead, we propose a single-view self-supervised method that achieves a performance similar to the supervised case. In some medical devices, such as endoscopes, the camera and light sources are co-located at a small distance from the target surfaces. Thus, we can exploit that, for any given albedo and surface orientation, pixel brightness is inversely proportional to the square of the distance to the surface, providing a strong single-view self-supervisory signal. In our experiments, our self-supervised models deliver accuracies comparable to those of fully supervised ones, while being applicable without depth ground-truth data. + + + + Referring Image Segmentation Using Text Supervision + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Referring_Image_Segmentation_Using_Text_Supervision_ICCV_2023_paper.pdf + Existing Referring Image Segmentation (RIS) methods typically require expensive pixel-level or box-level annotations for supervision. In this paper, we observe that the referring texts used in RIS already provide sufficient information to localize the target object. Hence, we propose a novel weakly-supervised RIS framework to formulate the target localization problem as a classification process to differentiate between positive and negative text expressions. While the referring text expressions for an image are used as positive expressions, the referring text expressions from other images can be used as negative expressions for this image. Our framework has three main novelties. First, we propose a bilateral prompt method to facilitate the classification process, by harmonizing the domain discrepancy between visual and linguistic features. Second, we propose a calibration method to reduce noisy background information and improve the correctness of the response maps for target object localization. Third, we propose a positive response map selection strategy to generate high-quality pseudo-labels from the enhanced response maps, for training a segmentation network for RIS inference. For evaluation, we propose a new metric to measure localization accuracy. Experiments on four benchmarks show that our framework achieves promising performances to existing fully-supervised RIS methods while outperforming state-of-the-art weakly-supervised methods adapted from related areas. Code is available at https://github.com/fawnliu/TRIS. + + + + Once Detected, Never Lost: Surpassing Human Performance in Offline LiDAR based 3D Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Fan_Once_Detected_Never_Lost_Surpassing_Human_Performance_in_Offline_LiDAR_ICCV_2023_paper.pdf + This paper aims for high-performance offline LiDAR-based 3D object detection. We first observe that experienced human annotators annotate objects from a track-centric perspective. They first label objects in a track with clear shapes, and then leverage the temporal coherence to infer the annotations of obscure objects. Drawing inspiration from this, we propose a high-performance offline detector in a track-centric perspective instead of the conventional object-centric perspective. Our method features a bidirectional tracking module and a track-centric learning module. Such a design allows our detector to infer and refine a complete track once the object is detected at a certain moment. We refer to this characteristic as "onCe detecTed, neveR Lost" and name the proposed system CTRL. Extensive experiments demonstrate the remarkable performance of our method, surpassing the human-level annotating accuracy and outperforming the previous state-of-the-art methods in the highly competitive Waymo Open Dataset leaderboard without model ensemble. The code is available at https://github.com/tusen-ai/SST. + + + + Eventful Transformers: Leveraging Temporal Redundancy in Vision Transformers + http://openaccess.thecvf.com//content/ICCV2023/papers/Dutson_Eventful_Transformers_Leveraging_Temporal_Redundancy_in_Vision_Transformers_ICCV_2023_paper.pdf + Vision Transformers achieve impressive accuracy across a range of visual recognition tasks. Unfortunately, their accuracy frequently comes with high computational costs. This is a particular issue in video recognition, where models are often applied repeatedly across frames or temporal chunks. In this work, we exploit temporal redundancy between subsequent inputs to reduce the cost of Transformers for video processing. We describe a method for identifying and re-processing only those tokens that have changed significantly over time. Our proposed family of models, Eventful Transformers, can be converted from existing Transformers (often without any re-training) and give adaptive control over the compute cost at runtime. We evaluate our method on large-scale datasets for video object detection (ImageNet VID) and action recognition (EPIC-Kitchens 100). Our approach leads to significant computational savings (on the order of 2-4x) with only minor reductions in accuracy. + + + + Robust Referring Video Object Segmentation with Cyclic Structural Consensus + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Robust_Referring_Video_Object_Segmentation_with_Cyclic_Structural_Consensus_ICCV_2023_paper.pdf + Referring Video Object Segmentation (R-VOS) is a challenging task that aims to segment an object in a video based on a linguistic expression. Most existing R-VOS methods have a critical assumption: the object referred to must appear in the video. This assumption, which we refer to as "semantic consensus", is often violated in real-world scenarios, where the expression may be queried against false videos. In this work, we highlight the need for a robust R-VOS model that can handle semantic mismatches. Accordingly, we propose an extended task called Robust R-VOS (RRVOS), which accepts unpaired video-text inputs. We tackle this problem by jointly modeling the primary R-VOS problem and its dual (text reconstruction). A structural text-to-text cycle constraint is introduced to discriminate semantic consensus between video-text pairs and impose it in positive pairs, thereby achieving multi-modal alignment from both positive and negative pairs. Our structural constraint effectively addresses the challenge posed by linguistic diversity, overcoming the limitations of previous methods that relied on the point-wise constraint. A new evaluation dataset, RRYTVOS is constructed to measure the model robustness. Our model achieves state-of-the-art performance on R-VOS benchmarks, Ref-DAVIS17 and Ref-Youtube-VOS, and also our RRYTVOS dataset. + + + + Building Bridge Across the Time: Disruption and Restoration of Murals In the Wild + http://openaccess.thecvf.com//content/ICCV2023/papers/Shao_Building_Bridge_Across_the_Time_Disruption_and_Restoration_of_Murals_ICCV_2023_paper.pdf + In this paper, we focus on the mural-restoration task, which aims to detect damaged regions in the mural and repaint them automatically. Different from traditional image restoration tasks like in/out/blind-painting and image renovation, the corrupted mural suffers from more complicated degradation. However, existing mural-restoration methods and datasets still focus on simple degradation like masking. Such a significant gap prevents mural-restoration from being applied to real scenarios. To fill this gap, in this work, we propose a systematic framework to simulate the physical process for damaged murals and provide a new benchmark dataset for mural-restoration. Limited by the simplification of the data synthesis process, the previous mural-restoration methods suffer from poor performance in our proposed dataset. To handle this problem, we propose the Attention Diffusion Framework (ADF) for this challenging task. Within the framework, a damage attention map module is proposed to estimate the damage extent. Facing the diversity of defects, we propose a series of loss functions to choose repair strategies adaptively. Finally, experimental results support the effectiveness of the proposed framework in terms of both mural synthesis and restoration. + + + + Neural Haircut: Prior-Guided Strand-Based Hair Reconstruction + http://openaccess.thecvf.com//content/ICCV2023/papers/Sklyarova_Neural_Haircut_Prior-Guided_Strand-Based_Hair_Reconstruction_ICCV_2023_paper.pdf + Generating realistic human 3D reconstructions using image or video data is essential for various communication and entertainment applications. While existing methods achieved impressive results for body and facial regions, realistic hair modeling still remains challenging due to its high mechanical complexity. This work proposes an approach capable of accurate hair geometry reconstruction at a strand level from a monocular video or multi-view images captured in uncontrolled lighting conditions. Our method has two stages, with the first stage performing joint reconstruction of coarse hair and bust shapes and hair orientation using implicit volumetric representations. The second stage then estimates a strand-level hair reconstruction by reconciling in a single optimization process the coarse volumetric constraints with hair strand and hairstyle priors learned from the synthetic data. To further increase the reconstruction fidelity, we incorporate image-based losses into the fitting process using a new differentiable renderer. The combined system, named Neural Haircut, achieves high realism and personalization of the reconstructed hairstyles. + + + + DG-Recon: Depth-Guided Neural 3D Scene Reconstruction + http://openaccess.thecvf.com//content/ICCV2023/papers/Ju_DG-Recon_Depth-Guided_Neural_3D_Scene_Reconstruction_ICCV_2023_paper.pdf + A key challenge in neural 3D scene reconstruction from monocular images is to fuse features back projected from various views without any depth or occlusion information. We address this by leveraging monocular depth priors, which effectively guide the fusion to improve surface prediction and skip over irrelevant, ambiguous, or occluded features. Furthermore, we revisit the average-based fusion used by most neural 3D reconstruction methods and propose two alternatives, a variance-based and a cross-attention-based fusion module, that are more efficient and effective than the average-based and self-attention-based counterparts. Compared to the NeuralRecon baseline, the proposed DG-Recon models significantly improve the reconstruction quality and completeness while remaining in real-time. Our method achieves state-of-the-art online reconstruction results on the ScanNet dataset and is on par with the current best offline method, which repeatedly accesses keyframes from the entire video sequence. Our ScanNet-trained model also generalizes robustly to the challenging 7-Scenes dataset and a subset of SUN3D containing scenes as big as an entire floor. + + + + The Stable Signature: Rooting Watermarks in Latent Diffusion Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Fernandez_The_Stable_Signature_Rooting_Watermarks_in_Latent_Diffusion_Models_ICCV_2023_paper.pdf + Generative image modeling enables a wide range of applications but raises ethical concerns about responsible deployment. This paper introduces an active strategy combining image watermarking and Latent Diffusion Models. The goal is for all generated images to conceal a watermark allowing for future detection and/or identification. The method quickly fine-tunes the image generator, conditioned on a binary signature. A pre-trained watermark extractor recovers the hidden signature from any generated image and a statistical test then determines whether it comes from the generative model. We evaluate the invisibility and robustness of our watermark on a variety of generation tasks, showing that Stable Signature works even after the images are modified. For instance, it detects the origin of an image generated from a text prompt, then cropped to keep 10% of the content, with 90+% accuracy at a false positive rate below 1e-6. + + + + Seal-3D: Interactive Pixel-Level Editing for Neural Radiance Fields + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Seal-3D_Interactive_Pixel-Level_Editing_for_Neural_Radiance_Fields_ICCV_2023_paper.pdf + With the popularity of implicit neural representations, or neural radiance fields (NeRF), there is a pressing need for editing methods to interact with the implicit 3D models for tasks like post-processing reconstructed scenes and 3D content creation. While previous works have explored NeRF editing from various perspectives, they are restricted in editing flexibility, quality, and speed, failing to offer direct editing response and instant preview. The key challenge is to conceive a locally editable neural representation that can directly reflect the editing instructions and update instantly. To bridge the gap, we propose a new interactive editing method and system for implicit representations, called Seal-3D, which allows users to edit NeRF models in a pixel-level and free manner with a wide range of NeRF-like backbone and preview the editing effects instantly. To achieve the effects, the challenges are addressed by our proposed proxy function mapping the editing instructions to the original space of NeRF models in the teacher model and a two-stage training strategy for the student model with local pretraining and global finetuning. A NeRF editing system is built to showcase various editing types. Our system can achieve compelling editing effects with an interactive speed of about 1 second. + + + + NeRF-MS: Neural Radiance Fields with Multi-Sequence + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_NeRF-MS_Neural_Radiance_Fields_with_Multi-Sequence_ICCV_2023_paper.pdf + Neural radiance fields (NeRF) achieve impressive performance in novel view synthesis when trained on only single sequence data. However, leveraging multiple sequences captured by different cameras at different times is essential for better reconstruction performance. Multi-sequence data takes two main challenges: appearance variation due to different lighting conditions and non-static objects like pedestrians. To address these issues, we propose NeRF-MS, a novel approach to training NeRF with multi-sequence data. Specifically, we utilize a triplet loss to regularize the distribution of per-image appearance code, which leads to better high-frequency texture and consistent appearance, such as specular reflections. Then, we explicitly model non-static objects to reduce floaters. Extensive results demonstrate that NeRF-MS not only outperforms state-of-the-art view synthesis methods on outdoor and synthetic scenes, but also achieves 3D consistent rendering and robust appearance controlling. Project page: https://nerf-ms.github.io/. + + + + Diffusion Model as Representation Learner + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Diffusion_Model_as_Representation_Learner_ICCV_2023_paper.pdf + Diffusion Probabilistic Models (DPMs) have recently demonstrated impressive results on various generative tasks. Despite its promises, the learned representations of pre-trained DPMs, however, have not been fully understood. In this paper, we conduct an in-depth investigation of the representation power of DPMs, and propose a novel knowledge transfer method that leverages the knowledge acquired by generative DPMs for analytical tasks. Our study begins by examining the feature space of DPMs, revealing that DPMs are inherently denoising autoencoders that balance the representation learning with regularizing model capacity. To this end, we introduce a novel knowledge transfer paradigm named RepFusion. Our paradigm extracts representations at different time steps from off-the-shelf DPMs and dynamically employs them as supervision for student networks, in which the optimal time is determined through reinforcement learning. We evaluate our approach on several image classification, semantic segmentation, and landmark detection benchmarks, and demonstrate that it outperforms state-of-the-art methods. Our results uncover the potential of DPMs as a powerful tool for representation learning and provide insights into the usefulness of generative models beyond sample generation. + + + + Nerfbusters: Removing Ghostly Artifacts from Casually Captured NeRFs + http://openaccess.thecvf.com//content/ICCV2023/papers/Warburg_Nerfbusters_Removing_Ghostly_Artifacts_from_Casually_Captured_NeRFs_ICCV_2023_paper.pdf + Casually captured Neural Radiance Fields (NeRFs) suffer from artifacts such as floaters or flawed geometry when rendered outside the input camera trajectory. Existing evaluation protocols often do not capture these effects, since they usually only assess image quality at every 8th frame of the training capture. To aid in the development and evaluation of new methods in novel-view synthesis, we propose a new dataset and evaluation procedure, where two camera trajectories are recorded of the scene: one used for training, and the other for evaluation. In this more challenging in-the-wild setting, we find that existing hand-crafted regularizers do not remove floaters nor improve scene geometry. Thus, we propose a 3D diffusion-based method that leverages local 3D priors and a novel density-based score distillation sampling loss to discourage artifacts during NeRF optimization. We show that this data-driven prior removes floaters and improves scene geometry for casual captures. + + + + Document Understanding Dataset and Evaluation (DUDE) + http://openaccess.thecvf.com//content/ICCV2023/papers/Van_Landeghem_Document_Understanding_Dataset_and_Evaluation_DUDE_ICCV_2023_paper.pdf + We call on the Document AI (DocAI) community to re-evaluate current methodologies and embrace the challenge of creating more practically-oriented benchmarks. Document Understanding Dataset and Evaluation (DUDE) seeks to remediate the halted research progress in understanding visually-rich documents (VRDs). We present a new dataset with novelties related to types of questions, answers, and document layouts based on multi-industry, multi-domain, and multi-page VRDs of various origins and dates. Moreover, we are pushing the boundaries of current methods by creating multi-task and multi-domain evaluation setups that more accurately simulate real-world situations where powerful generalization and adaptation under low-resource settings are desired. DUDE aims to set a new standard as a more practical, long-standing benchmark for the community, and we hope that it will lead to future extensions and contributions that address real-world challenges. Finally, our work illustrates the importance of finding more efficient ways to model language, images, and layout in DocAI. + + + + Prototypical Kernel Learning and Open-set Foreground Perception for Generalized Few-shot Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_Prototypical_Kernel_Learning_and_Open-set_Foreground_Perception_for_Generalized_Few-shot_ICCV_2023_paper.pdf + Generalized Few-shot Semantic Segmentation (GFSS) extends Few-shot Semantic Segmentation (FSS) to simultaneously segment unseen classes and seen classes during evaluation. Previous works leverage additional branch or prototypical aggregation to eliminate the constrained setting of FSS. However, representation division and embedding prejudice, which heavily results in poor performance of GFSS, have not been synthetical considered. We address the aforementioned problems by jointing the prototypical kernel learning and open-set foreground perception. Specifically, a group of learnable kernels is proposed to perform segmentation with each kernel in charge of a stuff class. Then, we explore to merge the prototypical learning to the update of base-class kernels, which is consistent with the prototype knowledge aggregation of few-shot novel classes. In addition, a foreground contextual perception module cooperating with conditional bias based inference is adopted to perform class-agnostic as well as open-set foreground detection, thus to mitigate the embedding prejudice and prevent novel targets from being misclassified as background. Moreover, we also adjust our method to the Class Incremental Few-shot Semantic Segmentation (CIFSS) which takes the knowledge of novel classes in a incremental stream. Extensive experiments on PASCAL-5i and COCO-20i datasets demonstrate that our method performs better than previous state-of-the-art. + + + + Simple and Effective Out-of-Distribution Detection via Cosine-based Softmax Loss + http://openaccess.thecvf.com//content/ICCV2023/papers/Noh_Simple_and_Effective_Out-of-Distribution_Detection_via_Cosine-based_Softmax_Loss_ICCV_2023_paper.pdf + Deep learning models need to detect out-of-distribution (OOD) data in the inference stage because they are trained to estimate the train distribution and infer the data sampled from the distribution. Many methods have been proposed, but they have some limitations, such as requiring additional data, input processing, or high computational cost. Moreover, most methods have hyperparameters to be set by users, which have a significant impact on the detection rate. We propose a simple and effective OOD detection method by combining the feature norm and the Mahalanobis distance obtained from classification models trained with the cosine-based softmax loss. Our method is practical because it does not use additional data for training, is about three times faster when inferencing than the methods using the input processing, and is easy to apply because it does not have any hyperparameters for OOD detection. We confirm that our method is superior to or at least comparable to state-of-the-art OOD detection methods through the experiments. + + + + CFCG: Semi-Supervised Semantic Segmentation via Cross-Fusion and Contour Guidance Supervision + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_CFCG_Semi-Supervised_Semantic_Segmentation_via_Cross-Fusion_and_Contour_Guidance_Supervision_ICCV_2023_paper.pdf + Current state-of-the-art semi-supervised semantic segmentation (SSSS) methods typically adopt pseudo labeling and consistency regularization between multiple learners with different perturbations. Although the performance is desirable, many issues remain: (1) supervisions from a single learner tend to be noisy which causes unreliable consistency regularization (2) existing pixel-wise confidence-score-based reliability measurement causes potential error accumulation as the training proceeds. In this paper, we propose a novel SSSS framework, called CFCG, which combines cross-fusion and contour guidance supervision to tackle these issues. Concretely, we adopt both image-level and feature-level perturbations to expand feature distribution thus pushing the potential limits of consistency regularization. Then, two particular modules are proposed to enable effective semi-supervised learning under heavy coherent perturbations. Firstly, Cross-Fusion Supervision (CFS) mechanism leverages multiple learners to enhance the quality of pseudo labels. Secondly, we introduce an adaptive contour guidance module (ACGM) to effectively identify unreliable spatial regions in pseudo labels. Finally, our proposed CFCG achieves gains of mIoU +1.40%, +0.89% with a single learner and +1.85%, +1.33% by fusion inference on PASCAL VOC 2012 and on Cityscapes respectively under 1/8 protocols, clearly surpassing previous methods and reaching the state-of-the-art. + + + + SLAN: Self-Locator Aided Network for Vision-Language Understanding + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhai_SLAN_Self-Locator_Aided_Network_for_Vision-Language_Understanding_ICCV_2023_paper.pdf + Learning fine-grained interplay between vision and language contributes to a more accurate understanding for Vision-Language tasks. However, it remains challenging to extract key image regions according to the texts for semantic alignments. Most existing works are either limited by text-agnostic and redundant regions obtained with the frozen detectors, or failing to scale further due to their heavy reliance on scarce grounding (gold) data to pre-train detectors. To solve these problems, we propose Self-Locator Aided Network (SLAN) for vision-language understanding tasks without any extra gold data. SLAN consists of a region filter and a region adaptor to localize regions of interest conditioned on different texts. By aggregating vision-language information, the region filter selects key regions and the region adaptor updates their coordinates with text guidance. With detailed region-word alignments, SLAN can be easily generalized to many downstream tasks. It achieves fairly competitive results on five vision-language understanding tasks (e.g., 85.7% and 69.2% on COCO image-to-text and text-to-image retrieval, surpassing previous SOTA methods). SLAN also demonstrates strong zero-shot and fine-tuned transferability to two localization tasks. + + + + Anomaly Detection using Score-based Perturbation Resilience + http://openaccess.thecvf.com//content/ICCV2023/papers/Shin_Anomaly_Detection_using_Score-based_Perturbation_Resilience_ICCV_2023_paper.pdf + Unsupervised anomaly detection is widely studied for industrial applications since it is difficult to obtain anomalous data. In particular, reconstruction-based anomaly detection can be a feasible solution if there is no option to use external knowledge, such as extra datasets or pre-trained models. However, reconstruction-based methods have limited utility due to poor detection performance. A score-based model, also known as a denoising diffusion model, recently has shown a high sample quality in the generation task. In this paper, we propose a novel unsupervised anomaly detection method leveraging the score-based model. This method promises good performance without external knowledge. The score, a gradient of the log-likelihood, has a property that is available for anomaly detection. The samples on the data manifold can be restored instantly by the score, even if they are randomly perturbed. We call this a score-based perturbation resilience. On the other hand, the samples that deviate from the manifold cannot be restored in the same way. The variation of resilience depending on the sample position can be an indicator to discriminate anomalies. We derive this statement from a geometric perspective. Our method shows superior performance on three benchmark datasets for industrial anomaly detection. Specifically, on MVTec AD, we achieve image-level AUROC of 97.7% and pixel-level AUROC of 97.4% outperforming previous works that do not use external knowledge. + + + + Generating Visual Scenes from Touch + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Generating_Visual_Scenes_from_Touch_ICCV_2023_paper.pdf + An emerging line of work has sought to generate plausible imagery from touch. Existing approaches, however, tackle only narrow aspects of the visuo-tactile synthesis problem, and lag significantly behind the quality of cross-modal synthesis methods in other domains. We draw on recent advances in latent diffusion to create a model for synthesizing images from tactile signals (and vice versa) and apply it to a number of visuo-tactile synthesis tasks. Using this model, we significantly outperform prior work on the tactile-driven stylization problem, i.e., manipulating an image to match a touch signal, and we are the first to successfully generate images from touch without additional sources of information about the scene. We also successfully use our model to address two novel synthesis problems: generating images that do not contain the touch sensor or the hand holding it, and estimating an image's shading from its reflectance and touch. + + + + SatlasPretrain: A Large-Scale Dataset for Remote Sensing Image Understanding + http://openaccess.thecvf.com//content/ICCV2023/papers/Bastani_SatlasPretrain_A_Large-Scale_Dataset_for_Remote_Sensing_Image_Understanding_ICCV_2023_paper.pdf + Remote sensing images are useful for a wide variety of planet monitoring applications, from tracking deforestation to tackling illegal fishing. The Earth is extremely diverse---the amount of potential tasks in remote sensing images is massive, and the sizes of features range from several kilometers to just tens of centimeters. However, creating generalizable computer vision methods is a challenge in part due to the lack of a large-scale dataset that captures these diverse features for many tasks. In this paper, we present SatlasPretrain, a remote sensing dataset that is large in both breadth and scale, combining Sentinel-2 and NAIP images with 302M labels under 137 categories and seven label types. We evaluate eight baselines and a proposed method on SatlasPretrain, and find that there is substantial room for improvement in addressing research challenges specific to remote sensing, including processing image time series that consist of images from very different types of sensors, and taking advantage of long-range spatial context. Moreover, we find that pre-training on SatlasPretrain substantially improves performance on downstream tasks, increasing average accuracy by 18% over ImageNet and 6% over the next best baseline. The dataset, pre-trained model weights, and code are available at https://satlas-pretrain.allen.ai/. + + + + DReg-NeRF: Deep Registration for Neural Radiance Fields + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_DReg-NeRF_Deep_Registration_for_Neural_Radiance_Fields_ICCV_2023_paper.pdf + Although Neural Radiance Fields (NeRF) is popular in the computer vision community recently, registering multiple NeRFs has yet to gain much attention. Unlike the existing work, NeRF2NeRF, which is based on traditional optimization methods and needs human annotated keypoints, we propose DReg-NeRF to solve the NeRF registration problem on object-centric scenes without human intervention. After training NeRF models, our DReg-NeRF first extracts features from the occupancy grid in NeRF. Subsequently, our DReg-NeRF utilizes a transformer architecture with self-attention and cross-attention layers to learn the relations between pairwise NeRF blocks. In contrast to state-of-the-art (SOTA) point cloud registration methods, the decoupled correspondences are supervised by surface fields without any ground truth overlapping labels. We construct a novel view synthesis dataset with 1,700+ 3D objects obtained from Objaverse to train our network. When evaluated on the test set, our proposed method beats the SOTA point cloud registration methods by a large margin with a mean RPE = 9.67* and a mean RTE = 0.038. Our code is available at https://github.com/AIBluefisher/DReg-NeRF. + + + + All in Tokens: Unifying Output Space of Visual Tasks via Soft Token + http://openaccess.thecvf.com//content/ICCV2023/papers/Ning_All_in_Tokens_Unifying_Output_Space_of_Visual_Tasks_via_ICCV_2023_paper.pdf + We introduce AiT, a unified output representation for various vision tasks, which is a crucial step towards general-purpose vision task solvers. Despite the challenges posed by the high-dimensional and task-specific outputs, we showcase the potential of using discrete representation (VQ-VAE) to model the dense outputs of many computer vision tasks as a sequence of discrete tokens. This is inspired by the established ability of VQ-VAE to conserve the structures spanning multiple pixels using few discrete codes. To that end, we present a modified shallower architecture for VQ-VAE that improves efficiency while keeping prediction accuracy. Our approach also incorporates uncertainty into the decoding process by using a soft fusion of the codebook entries, providing a more stable training process, which notably improved prediction accuracy. Our evaluation of AiT on depth estimation and instance segmentation tasks, with both continuous and discrete labels, demonstrates its superiority compared to other unified models. The code and models are available at https://github.com/SwinTransformer/AiT. + + + + LDL: Line Distance Functions for Panoramic Localization + http://openaccess.thecvf.com//content/ICCV2023/papers/Kim_LDL_Line_Distance_Functions_for_Panoramic_Localization_ICCV_2023_paper.pdf + We introduce LDL, a fast and robust algorithm that localizes a panorama to a 3D map using line segments. LDL focuses on the sparse structural information of lines in the scene, which is robust to illumination changes and can potentially enable efficient computation. While previous line-based localization approaches tend to sacrifice accuracy or computation time, our method effectively observes the holistic distribution of lines within panoramic images and 3D maps. Specifically, LDL matches the distribution of lines with 2D and 3D line distance functions, which are further decomposed along principal directions of lines to increase the expressiveness. The distance functions provide coarse pose estimates by comparing the distributional information, where the poses are further optimized using conventional local feature matching. As our pipeline solely leverages line geometry and local features, it does not require costly additional training of line-specific features or correspondence matching. Nevertheless, our method demonstrates robust performance on challenging scenarios including object layout changes, illumination shifts, and large-scale scenes, while exhibiting fast pose search terminating within a matter of milliseconds. We thus expect our method to serve as a practical solution for line-based localization, and complement the well-established point-based paradigm. + + + + TransTIC: Transferring Transformer-based Image Compression from Human Perception to Machine Perception + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_TransTIC_Transferring_Transformer-based_Image_Compression_from_Human_Perception_to_Machine_ICCV_2023_paper.pdf + This work aims for transferring a Transformer-based image compression codec from human perception to machine perception without fine-tuning the codec. We propose a transferable Transformer-based image compression framework, termed TransTIC. Inspired by visual prompt tuning, TransTIC adopts an instance-specific prompt generator to inject instance-specific prompts to the encoder and task-specific prompts to the decoder. Extensive experiments show that our proposed method is capable of transferring the base codec to various machine tasks and outperforms the competing methods significantly. To our best knowledge, this work is the first attempt to utilize prompting on the low-level image compression task. + + + + CHORUS : Learning Canonicalized 3D Human-Object Spatial Relations from Unbounded Synthesized Images + http://openaccess.thecvf.com//content/ICCV2023/papers/Han_CHORUS__Learning_Canonicalized_3D_Human-Object_Spatial_Relations_from_Unbounded_ICCV_2023_paper.pdf + We present a method for teaching machines to understand and model the underlying spatial common sense of diverse human-object interactions in 3D in a self-supervised way. This is a challenging task, as there exist specific manifolds of the interactions that can be considered human-like and natural, but the human pose and the geometry of objects can vary even for similar interactions. Such diversity makes the annotating task of 3D interactions difficult and hard to scale, which limits the potential to reason about that in a supervised way. One way of learning the 3D spatial relationship between humans and objects during interaction is by showing multiple 2D images captured from different viewpoints when humans interact with the same type of objects. The core idea of our method is to leverage a generative model that produces high-quality 2D images from an arbitrary text prompt input as an "unbounded" data generator with effective controllability and view diversity. Despite its imperfection of the image quality over real images, we demonstrate that the synthesized images are sufficient to learn the 3D human-object spatial relations. We present multiple strategies to leverage the synthesized images, including (1) the first method to leverage a generative image model for 3D human-object spatial relation learning; (2) a framework to reason about the 3D spatial relations from inconsistent 2D cues in a self-supervised manner via 3D occupancy reasoning with pose canonicalization; (3) semantic clustering to disambiguate different types of interactions with the same object types; and (4) a novel metric to assess the quality of 3D spatial learning of interaction. Project Page: https://jellyheadandrew.github.io/projects/chorus + + + + ViLLA: Fine-Grained Vision-Language Representation Learning from Real-World Data + http://openaccess.thecvf.com//content/ICCV2023/papers/Varma_ViLLA_Fine-Grained_Vision-Language_Representation_Learning_from_Real-World_Data_ICCV_2023_paper.pdf + Vision-language models (VLMs), such as CLIP and ALIGN, are generally trained on datasets consisting of image-caption pairs obtained from the web. However, real-world multimodal datasets, such as healthcare data, are significantly more complex: each image (e.g. X-ray) is often paired with text (e.g. physician report) that describes many distinct attributes occurring in fine-grained regions of the image. We refer to these samples as exhibiting high pairwise complexity, since each image-text pair can be decomposed into a large number of region-attribute pairings. The extent to which VLMs can capture fine-grained relationships between image regions and textual attributes when trained on such data has not been previously evaluated. The first key contribution of this work is to demonstrate through systematic evaluations that as the pairwise complexity of the training dataset increases, standard VLMs struggle to learn region-attribute relationships, exhibiting performance degradations of up to 37% on retrieval tasks. In order to address this issue, we introduce ViLLA as our second key contribution. ViLLA, which is trained to capture fine-grained region-attribute relationships from complex datasets, involves two components: (a) a lightweight, self-supervised mapping model to decompose image-text samples into region-attribute pairs, and (b) a contrastive VLM to learn representations from generated region-attribute pairs. We demonstrate with experiments across four domains (synthetic, product, medical, and natural images) that ViLLA outperforms comparable VLMs on fine-grained reasoning tasks, such as zero-shot object detection (up to 3.6 AP50 points on COCO and 0.6 mAP points on LVIS) and retrieval (up to 14.2 R-Precision points). + + + + Towards Unifying Medical Vision-and-Language Pre-Training via Soft Prompts + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Towards_Unifying_Medical_Vision-and-Language_Pre-Training_via_Soft_Prompts_ICCV_2023_paper.pdf + Medical vision-and-language pre-training (Med-VLP) has shown promising improvements on many downstream medical tasks owing to its applicability to extracting generic representations from medical images and texts. Practically, there exist two typical types, i.e., the fusion-encoder type and the dual-encoder type, depending on whether a heavy fusion module is used. The former is superior at multi-modal tasks owing to the sufficient interaction between modalities; the latter is good at uni-modal and cross-modal tasks due to the single-modality encoding ability. To take advantage of these two types, we propose an effective yet straightforward scheme named PTUnifier to unify the two types. We first unify the input format by introducing visual and textual prompts, which serve as DETR-like queries that assist in extracting features when one of the modalities is missing. By doing so, a single model could serve as a foundation model that processes various tasks adopting different input formats (i.e., image-only, text-only, and image-text-pair). Furthermore, we construct a prompt pool (instead of static ones) to improve diversity and scalability, enabling queries conditioned on different input instances. Experimental results show that our approach achieves state-of-the-art results on a broad range of tasks, spanning uni-modal tasks (i.e., image/text classification and text summarization), cross-modal tasks (i.e., image-to-text generation and image-text/text-image retrieval), and multi-modal tasks (i.e., visual question answering), demonstrating the effectiveness of our approach. Note that the adoption of prompts is orthogonal to most existing Med-VLP approaches and could be a beneficial and complementary extension to these approaches. The source code is available at https://anonymous.4open.science/r/ICCV-2023-Submission-PTUnifier/ and will be released in the final version of this paper. + + + + A Large-scale Study of Spatiotemporal Representation Learning with a New Benchmark on Action Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Deng_A_Large-scale_Study_of_Spatiotemporal_Representation_Learning_with_a_New_ICCV_2023_paper.pdf + The goal of building a benchmark (suite of datasets) is to provide a unified protocol for fair evaluation and thus facilitate the evolution of a specific area. Nonetheless, we point out that existing protocols of action recognition could yield partial evaluations due to several limitations. To comprehensively probe the effectiveness of spatiotemporal representation learning, we introduce BEAR, a new BEnchmark on video Action Recognition. BEAR is a collection of 18 video datasets grouped into 5 categories (anomaly, gesture, daily, sports, and instructional), which covers a diverse set of real-world applications. With BEAR, we thoroughly evaluate 6 common spatiotemporal models pre-trained by both supervised and self-supervised learning. We also report transfer performance via standard finetuning, few-shot fine-tuning, and unsupervised domain adaptation. Our observation suggests that the current state-of-the-art cannot solidly guarantee high performance on datasets close to real-world applications, and we hope BEAR can serve as a fair and challenging evaluation benchmark to gain insights on building next-generation spatiotemporal learners. Our dataset, code, and models are released at: https://github.com/AndongDeng/BEAR + + + + HoloFusion: Towards Photo-realistic 3D Generative Modeling + http://openaccess.thecvf.com//content/ICCV2023/papers/Karnewar_HoloFusion_Towards_Photo-realistic_3D_Generative_Modeling_ICCV_2023_paper.pdf + Diffusion-based image generators can now produce high-quality and diverse samples, but their success has yet to fully translate to 3D generation: existing diffusion methods can either generate low-resolution but 3D consistent outputs, or detailed 2D views of 3D objects with potential structural defects and lacking either view consistency or realism. We present HoloFusion, a method that combines the best of these approaches to produce high-fidelity, plausible, and diverse 3D samples while learning from a collection of multi-view 2D images only. The method first generates coarse 3D samples using a variant of the recently proposed HoloDiffusion generator. Then, it independently renders and upsamples a large number of views of the coarse 3D model, super-resolves them to add detail, and distills those into a single, high-fidelity implicit 3D representation, which also ensures view-consistency of the final renders. The super-resolution network is trained as an integral part of HoloFusion, and the final distillation uses a new sampling scheme to capture the space of super-resolved signals. We compare our method against existing baselines, including DreamFusion, Get3D, EG3D, and HoloDiffusion, and achieve, to the best of our knowledge, the most realistic results on the challenging CO3Dv2 dataset. + + + + Improving Continuous Sign Language Recognition with Cross-Lingual Signs + http://openaccess.thecvf.com//content/ICCV2023/papers/Wei_Improving_Continuous_Sign_Language_Recognition_with_Cross-Lingual_Signs_ICCV_2023_paper.pdf + This work dedicates to continuous sign language recognition (CSLR), which is a weakly supervised task dealing with the recognition of continuous signs from videos, without any prior knowledge about the temporal boundaries between consecutive signs. Data scarcity heavily impedes the progress of CSLR. Existing approaches typically train CSLR models on a monolingual corpus, which is orders of magnitude smaller than that of speech recognition. In this work, we explore the feasibility of utilizing multilingual sign language corpora to facilitate monolingual CSLR. Our work is built upon the observation of cross-lingual signs, which originate from different sign languages but have similar visual signals (e.g., hand shape and motion). The underlying idea of our approach is to identify the cross-lingual signs in one sign language and properly leverage them as auxiliary training data to improve the recognition capability of another. To achieve the goal, we first build two sign language dictionaries containing isolated signs that appear in two datasets. Then we identify the sign-to-sign mappings between two sign languages via a well-optimized isolated sign language recognition model. At last, we train a CSLR model on the combination of the target data with original labels and the auxiliary data with mapped labels. Experimentally, our approach achieves state-of-the-art performance on two widely-used CSLR datasets: Phoenix-2014 and Phoenix-2014T. + + + + TransIFF: An Instance-Level Feature Fusion Framework for Vehicle-Infrastructure Cooperative 3D Detection with Transformers + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_TransIFF_An_Instance-Level_Feature_Fusion_Framework_for_Vehicle-Infrastructure_Cooperative_3D_ICCV_2023_paper.pdf + Cooperation between vehicles and infrastructure is vital to enhancing the safety of autonomous driving. Two significant and contradictory challenges now stand in the collaborative perception: fusion accuracy and communication bandwidth. Previous intermediate fusion methods that transmit features balance the accuracy and bandwidth compared with early fusion and late fusion, but usually have problems with feature alignment and domain gaps, and the bandwidth usage still falls short of the industrial application standard to our best knowledge. In this paper, we propose TransIFF, an instance-level feature fusion framework with transformers that can effectively reduce bandwidth usage. Furthermore, it can align the domain gaps between vehicle and infrastructure features, and improve the robustness of feature fusion, leading to a high cooperative perception accuracy. TransIFF is composed of three components: a vehicle-side network, an infrastructure-side network, and a vehicle-infrastructure fusion network. Initially, the vehicle-side and infrastructure-side networks independently generate instance-level features. Subsequently, the infrastructure-side instance-level features are transmitted to the vehicles, significantly reducing the communication bandwidth usage. Finally, in the vehicle-infrastructure fusion network, Cross-Domain Adaptation (CDA) module is designed to align the feature domains, followed by Feature Magnet (FM) module which can adaptively fuse the instance features and achieve a robust feature fusion. TransIFF yields state-of-the-art performance on the widely used real-world vehicle-infrastructure cooperative benchmark DAIR-V2X, achieving 59.62% AP with only 2^12 bytes bandwidth consumption. + + + + Masked Retraining Teacher-Student Framework for Domain Adaptive Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhao_Masked_Retraining_Teacher-Student_Framework_for_Domain_Adaptive_Object_Detection_ICCV_2023_paper.pdf + Domain adaptive Object Detection (DAOD) leverages a labeled domain (source) to learn an object detector generalizing to a novel domain without annotation (target). Recent advances use a teacher-student framework, i.e., a student model is supervised by the pseudo labels from a teacher model. Though great success, they suffer from the limited number of pseudo boxes with incorrect predictions caused by the domain shift, misleading the student model to get sub-optimal results. To mitigate this problem, we propose Masked Retraining Teacher-student framework (MRT) which leverages masked autoencoder and selective retraining mechanism on detection transformer. Specifically, we present a customized design of masked autoencoder branch, masking the multi-scale feature maps of target images and reconstructing features by the encoder of the student model and an auxiliary decoder. This helps the student model capture target domain characteristics and become a more data-efficient learner to gain knowledge from the limited number of pseudo boxes. Furthermore, we adopt selective retraining mechanism, periodically re-initializing certain parts of the student parameters with masked autoencoder refined weights to allow the model to jump out of the local optimum biased to the incorrect pseudo labels. Experimental results on three DAOD benchmarks demonstrate the effectiveness of our method. Code can be found at https://github.com/JeremyZhao1998/MRT-release. + + + + Prune Spatio-temporal Tokens by Semantic-aware Temporal Accumulation + http://openaccess.thecvf.com//content/ICCV2023/papers/Ding_Prune_Spatio-temporal_Tokens_by_Semantic-aware_Temporal_Accumulation_ICCV_2023_paper.pdf + Transformers have become the primary backbone of the computer vision community due to their impressive performance. However, the unfriendly computation cost impedes their potential in the video recognition domain. To optimize the speed-accuracy trade-off, we propose Semantic-aware Temporal Accumulation score (STA) to prune spatio-temporal tokens integrally. STA score considers two critical factors: temporal redundancy and semantic importance. The former depicts a specific region based on whether it is a new occurrence or a seen entity by aggregating token-to-token similarity in consecutive frames while the latter evaluates each token based on its contribution to the overall prediction. As a result, tokens with higher scores of STA carry more temporal redundancy as well as lower semantics thus being pruned. Based on the STA score, we are able to progressively prune the tokens without introducing any additional parameters or requiring further re-training. We directly apply the STA module to off-the-shelf ViT and VideoSwin backbones, and the empirical results on Kinetics-400 and Something-Something V2 achieve over 30% computation reduction with a negligible 0.2% accuracy drop. The code is released at https://github.com/Mark12Ding/STA. + + + + Growing a Brain with Sparsity-Inducing Generation for Continual Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Jin_Growing_a_Brain_with_Sparsity-Inducing_Generation_for_Continual_Learning_ICCV_2023_paper.pdf + Deep neural networks suffer from catastrophic forgetting in continual learning, where they tend to lose information about previously learned tasks when optimizing a new incoming task. Recent strategies isolate the important parameters for previous tasks to retain old knowledge while learning the new task. However, using the fixed old knowledge might act as an obstacle to capturing novel representations. To overcome this limitation, we propose a framework that evolves the previously allocated parameters by absorbing the knowledge of the new task. The approach performs under two different networks. The base network learns knowledge of sequential tasks, and the sparsity-inducing hypernetwork generates parameters for each time step for evolving old knowledge. The generated parameters transform old parameters of the base network to reflect the new knowledge. We design the hypernetwork to generate sparse parameters conditional to the task-specific information and the structural information of the base network. We evaluate the proposed approach on class-incremental and task-incremental learning scenarios for image classification and video action recognition tasks. Experimental results show that the proposed method consistently outperforms a large variety of continual learning approaches for those scenarios by evolving old knowledge. + + + + Cross-Ray Neural Radiance Fields for Novel-View Synthesis from Unconstrained Image Collections + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Cross-Ray_Neural_Radiance_Fields_for_Novel-View_Synthesis_from_Unconstrained_Image_ICCV_2023_paper.pdf + Neural Radiance Fields (NeRF) is a revolutionary approach for rendering scenes by sampling a single ray per pixel and it has demonstrated impressive capabilities in novel-view synthesis from static scene images. However, in practice, we usually need to recover NeRF from unconstrained image collections, which poses two challenges: 1) the images often have dynamic changes in appearance because of different capturing time and camera settings; 2) the images may contain transient objects such as humans and cars, leading to occlusion and ghosting artifacts. Conventional approaches seek to address these challenges by locally utilizing a single ray to synthesize a color of a pixel. In contrast, humans typically perceive appearance and objects by globally utilizing information across multiple pixels. To mimic the perception process of humans, in this paper, we propose Cross-Ray NeRF (CR-NeRF) that leverages interactive information across multiple rays to synthesize occlusion-free novel views with the same appearances as the images. Specifically, to model varying appearances, we first propose to represent multiple rays with a novel cross-ray feature and then recover the appearance by fusing global statistics, i.e., feature covariance of the rays and the image appearance. Moreover, to avoid occlusion introduced by transient objects, we propose a transient objects handler and introduce a grid sampling strategy for masking out the transient objects. We theoretically find that leveraging correlation across multiple rays promotes capturing more global information. Moreover, extensive experimental results on large real-world datasets verify the effectiveness of CR-NeRF. + + + + SPACE: Speech-driven Portrait Animation with Controllable Expression + http://openaccess.thecvf.com//content/ICCV2023/papers/Gururani_SPACE_Speech-driven_Portrait_Animation_with_Controllable_Expression_ICCV_2023_paper.pdf + Animating portraits using speech has received growing attention in recent years, with various creative and practical use cases. An ideal generated video should have good lip sync with the audio, natural facial expressions and head motions, and high frame quality. In this work, we present SPACE, which uses speech and a single image to generate high-resolution, and expressive videos with realistic head pose, without requiring a driving video. It uses a multi-stage approach, combining the controllability of facial landmarks with the high-quality synthesis power of a pretrained face generator. SPACE also allows for the control of emotions and their intensities. Our method outperforms prior methods in objective metrics for image quality and facial motions and is strongly preferred by users in pair-wise comparisons. Please visit the project page to view the videos and to see more results: https://research.nvidia.com/labs/dir/space. + + + + End-to-end 3D Tracking with Decoupled Queries + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_End-to-end_3D_Tracking_with_Decoupled_Queries_ICCV_2023_paper.pdf + In this work, we present an end-to-end framework for camera-based 3D multi-object tracking, called DQTrack. To avoid heuristic design in detection-based trackers, recent query-based approaches deal with identity-agnostic detection and identity-aware tracking in a single embedding. However, it brings inferior performance because of the inherent representation conflict. To address this issue, we decouple the single embedding into separated queries, i.e., object query and track query. Unlike previous detection-based and query-based methods, the decoupled-query paradigm utilizes task-specific queries and still maintains the compact pipeline without complex post-processing. Moreover, the learnable association and temporal update are designed to provide differentiable trajectory association and frame-by-frame query update, respectively. The proposed DQTrack is demonstrated to achieve consistent gains in various benchmarks, outperforming all previous tracking-by-detection and learning-based methods on the nuScenes dataset. + + + + Deformable Model-Driven Neural Rendering for High-Fidelity 3D Reconstruction of Human Heads Under Low-View Settings + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_Deformable_Model-Driven_Neural_Rendering_for_High-Fidelity_3D_Reconstruction_of_Human_ICCV_2023_paper.pdf + Reconstructing 3D human heads in low-view settings presents technical challenges, mainly due to the pronounced risk of overfitting with limited views and high-frequency signals. To address this, we propose geometry decomposition and adopt a two-stage, coarse-to-fine training strategy, allowing for progressively capturing high-frequency geometric details. We represent 3D human heads using the zero level-set of a combined signed distance field, comprising a smooth template, a non-rigid deformation, and a high-frequency displacement field. The template captures features that are independent of both identity and expression and is co-trained with the deformation network across multiple individuals with sparse and randomly selected views. The displacement field, capturing individual-specific details, undergoes separate training for each person. Our network training does not require 3D supervision or object masks. Experimental results demonstrate the effectiveness and robustness of our geometry decomposition and two-stage training strategy. Our method outperforms existing neural rendering approaches in terms of reconstruction accuracy and novel view synthesis under low-view settings. Moreover, the pre-trained template serves a good initialization for our model when encountering unseen individuals. + + + + Density-invariant Features for Distant Point Cloud Registration + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Density-invariant_Features_for_Distant_Point_Cloud_Registration_ICCV_2023_paper.pdf + Registration of distant outdoor LiDAR point clouds is crucial to extending the 3D vision of collaborative autonomous vehicles, and yet is challenging due to small overlapping area and a huge disparity between observed point densities. In this paper, we propose Group-wise Contrastive Learning (GCL) scheme to extract density-invariant geometric features to register distant outdoor LiDAR point clouds. We mark through theoretical analysis and experiments that, contrastive positives should be independent and identically distributed (i.i.d.), in order to train density-invariant feature extractors. We propose upon the conclusion a simple yet effective training scheme to force the feature of multiple point clouds in the same spatial location (referred to as positive groups) to be similar, which naturally avoids the sampling bias introduced by a pair of point clouds to conform with the i.i.d. principle. The resulting fully-convolutional feature extractor is more powerful and density-invariant than state-of-the-art methods, improving the registration recall of distant scenarios on KITTI and nuScenes benchmarks by 40.9% and 26.9%, respectively. Code is available at https://github.com/liuQuan98/GCL. + + + + + + Diffusion Models as Masked Autoencoders + http://openaccess.thecvf.com//content/ICCV2023/papers/Wei_Diffusion_Models_as_Masked_Autoencoders_ICCV_2023_paper.pdf + There has been a longstanding belief that generation can facilitate a true understanding of visual data. In line with this, we revisit generatively pre-training visual representations in light of recent interest in denoising diffusion models. While directly pre-training with diffusion models does not produce strong representations, we condition diffusion models on masked input and formulate diffusion models as masked autoencoders (DiffMAE). Our approach is capable of (i) serving as a strong initialization for downstream recognition tasks, (ii) conducting high-quality image inpainting, and (iii) being effortlessly extended to video where it produces state-of-the-art classification accuracy. We further perform a comprehensive study on the pros and cons of design choices and build connections between diffusion models and masked autoencoders. + + + + One-Shot Recognition of Any Material Anywhere Using Contrastive Learning with Physics-Based Rendering + http://openaccess.thecvf.com//content/ICCV2023/papers/Drehwald_One-Shot_Recognition_of_Any_Material_Anywhere_Using_Contrastive_Learning_with_ICCV_2023_paper.pdf + Visual recognition of materials and their states is essential for understanding the world, from determining whether food is cooked, metal is rusted, or a chemical reaction has occurred. However, current image recognition methods are limited to specific classes and properties and can't handle the vast number of material states in the world. To address this, we present MatSim: the first dataset and benchmark for computer vision-based recognition of similarities and transitions between materials and textures, focusing on identifying any material under any conditions using one or a few examples. The dataset contains synthetic and natural images. Synthetic images were rendered using giant collections of textures, objects, and environments generated by computer graphics artists. We use mixtures and gradual transitions between materials to allow the system to learn cases with smooth transitions between states (like gradually cooked food). We also render images with materials inside transparent containers to support beverage and chemistry lab use cases. We use this dataset to train a Siamese net that identifies the same material in different objects, mixtures, and environments. The descriptor generated by this net can be used to identify the states of materials and their subclasses using a single image. We also present the first few-shot material recognition benchmark with natural images from a wide range of fields, including the state of foods and beverages, types of grounds, and many other use cases. We show that a net trained on the MatSim synthetic dataset outperforms state-of-the-art models like Clip on the benchmark and also achieves good results on other unsupervised material classification tasks. Dataset, generation code and trained models have been made available at: https://github.com/ZuseZ4/MatSim-Dataset-Generator-Scripts-And-Neural-net + + + + HairCLIPv2: Unifying Hair Editing via Proxy Feature Blending + http://openaccess.thecvf.com//content/ICCV2023/papers/Wei_HairCLIPv2_Unifying_Hair_Editing_via_Proxy_Feature_Blending_ICCV_2023_paper.pdf + Hair editing has made tremendous progress in recent years. Early hair editing methods use well-drawn sketches or masks to specify the editing conditions. Even though they can enable very fine-grained local control, such interaction modes are inefficient for the editing conditions that can be easily specified by language descriptions or reference images. Thanks to the recent breakthrough of cross-modal models (e.g., CLIP), HairCLIP is the first work that enables hair editing based on text descriptions or reference images. However, such text-driven and reference-driven interaction modes make HairCLIP unable to support fine-grained controls specified by sketch or mask. In this paper, we propose HairCLIPv2, aiming to support all the aforementioned interactions with one unified framework. Simultaneously, it improves upon HairCLIP with better irrelevant attributes (e.g., identity, background) preservation and unseen text descriptions support. The key idea is to convert all the hair editing tasks into hair transfer tasks, with editing conditions converted into different proxies accordingly. The editing effects are added upon the input image by blending the corresponding proxy features within the hairstyle or hair color feature spaces. Besides the unprecedented user interaction mode support, quantitative and qualitative experiments demonstrate the superiority of HairCLIPv2 in terms of editing effects, irrelevant attribute preservation and visual naturalness. Our code is available at https://github.com/wty-ustc/HairCLIPv2. + + + + DocTr: Document Transformer for Structured Information Extraction in Documents + http://openaccess.thecvf.com//content/ICCV2023/papers/Liao_DocTr_Document_Transformer_for_Structured_Information_Extraction_in_Documents_ICCV_2023_paper.pdf + We present a new formulation for structured information extraction (SIE) from visually rich documents. We address the limitations of existing IOB tagging and graph-based formulations, which are either overly reliant on the correct ordering of input text or struggle with decoding a complex graph. Instead, motivated by anchor-based object detectors in computer vision, we represent an entity as an anchor word and a bounding box, and represent entity linking as the association between anchor words. This is more robust to text ordering, and maintains a compact graph for entity linking. The formulation motivates us to introduce 1) a Document Transformer (DocTr) that aims at detecting and associating entity bounding boxes in visually rich documents, and 2) a simple pre-training strategy that helps learn entity detection in the context of language. Evaluations on three SIE benchmarks show the effectiveness of the proposed formulation, and the overall approach outperforms existing solutions. + + + + Role-Aware Interaction Generation from Textual Description + http://openaccess.thecvf.com//content/ICCV2023/papers/Tanaka_Role-Aware_Interaction_Generation_from_Textual_Description_ICCV_2023_paper.pdf + This research tackles the problem of generating interaction between two human actors corresponding to textual description. We claim that certain interactions, which we call asymmetric interactions, involve a relationship between an actor and a receiver, whose motions significantly differ depending on the assigned role. However, existing studies of interaction generation attempt to learn the correspondence between a single label and the motions of both actors combined, overlooking differences in individual roles. We consider a novel problem of role-aware interaction generation, where roles can be designated before generation. We translate the text of the asymmetric interactions into active and passive voice to ensure the textual context is consistent with each role. We propose a model that learns to generate motions of the designated role, which together form a mutually consistent interaction. As the model treats individual motions separately, it can be pretrained to derive knowledge from single-person motion data for more accurate interactions. Moreover, we introduce a method inspired by Permutation Invariant Training (PIT) that can automatically learn which of the two actions corresponds to an actor or a receiver without additional annotation. We further present cases where existing evaluation metrics fail to accurately assess the quality of generated interactions, and propose a novel metric, Mutual Consistency, to address such shortcomings. Experimental results demonstrate the efficacy of our method, as well as the necessity of the proposed metric. Our code is available at https://github.com/line/Human-Interaction-Generation. + + + + Continual Learning for Personalized Co-speech Gesture Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Ahuja_Continual_Learning_for_Personalized_Co-speech_Gesture_Generation_ICCV_2023_paper.pdf + Co-speech gestures are a key channel of human communication, making them important for personalized chat agents to generate. In the past, gesture generation models assumed that data for each speaker is available all at once, and in large amounts. However in practical scenarios, speaker data comes sequentially and in small amounts as the agent personalizes with more speakers, akin to a continual learning paradigm. While more recent works have shown progress in adapting to low-resource data, they catastrophically forget the gesture styles of initial speakers they were trained on. Also, prior generative continual learning works are not multimodal, making this space less studied. In this paper, we explore this new paradigm and propose C-DiffGAN: an approach that continually learns new speaker gesture styles with only a few minutes of per-speaker data, while retaining previously learnt styles. Inspired by prior continual learning works, C-DiffGAN encourages knowledge retention by 1) generating reminiscences of previous low-resource speaker data, then 2) crossmodally aligning to them to mitigate catastrophic forgetting. We quantitatively demonstrate improved performance and reduced forgetting over strong baselines through standard continual learning measures, reinforced by a qualitative user study that shows that our method produces more natural, style-preserving gestures. Code and videos can be found at https://chahuja.com/cdiffgan + + + + DreamTeacher: Pretraining Image Backbones with Deep Generative Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_DreamTeacher_Pretraining_Image_Backbones_with_Deep_Generative_Models_ICCV_2023_paper.pdf + In this work, we introduce a self-supervised feature representation learning framework DreamTeacher that utilizes generative networks for pre-training downstream image backbones. We propose to distill knowledge from a trained generative model into standard image backbones that have been well engineered for specific perception tasks. We investigate two types of knowledge distillation: 1) distilling learned generative features onto target image backbones as an alternative to pretraining these backbones on large labeled datasets such as ImageNet, and 2) distilling labels obtained from generative networks with task heads onto logits of target backbones. We perform extensive analysis on several generative models, dense prediction benchmarks, and several pre-training regimes. We empirically find that our DreamTeacher significantly outperforms existing self-supervised representation learning approaches across the board. Unsupervised ImageNet pre-training with DreamTeacher leads to significant improvements over ImageNet classification pre-training on downstream datasets, showcasing generative models, and diffusion generative models specifically, as a promising approach to representation learning on large, diverse datasets without requiring manual annotation. + + + + Decomposition-Based Variational Network for Multi-Contrast MRI Super-Resolution and Reconstruction + http://openaccess.thecvf.com//content/ICCV2023/papers/Lei_Decomposition-Based_Variational_Network_for_Multi-Contrast_MRI_Super-Resolution_and_Reconstruction_ICCV_2023_paper.pdf + Multi-contrast MRI super-resolution (SR) and reconstruction methods aim to explore complementary information from the reference image to help the reconstruction of the target image. Existing deep learning-based methods usually manually design fusion rules to aggregate the multi-contrast images, fail to model their correlations accurately and lack certain interpretations. Against these issues, we propose a multi-contrast variational network (MC-VarNet) to explicitly model the relationship of multi-contrast images. Our model is constructed based on an intuitive motivation that multi-contrast images have consistent (edges and structures) and inconsistent (contrast) information. We thus build a model to reconstruct the target image and decompose the reference image as a common component and a unique component. In the feature interaction phase, only the common component is transferred to the target image. We solve the variational model and unfold the iterative solutions into a deep network. Hence, the proposed method combines the good interpretability of model-based methods with the powerful representation ability of deep learning-based methods. Experimental results on the multi-contrast MRI reconstruction and SR demonstrate the effectiveness of the proposed model. Especially, since we explicitly model the multi-contrast images, our model is more robust to the reference images with noises and large inconsistent structures. The code is available at https://github.com/lpcccc-cv/MC-VarNet. + + + + Semi-supervised Speech-driven 3D Facial Animation via Cross-modal Encoding + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Semi-supervised_Speech-driven_3D_Facial_Animation_via_Cross-modal_Encoding_ICCV_2023_paper.pdf + Existing Speech-driven 3D facial animation methods typically follow the supervised paradigm, involving regression from speech to 3D facial animation. This paradigm faces two major challenges: the high cost of supervision acquisition, and the ambiguity in mapping between speech and lip movements. To address these challenges, this study proposes a novel cross-modal semi-supervised framework, comprising a Speech-to-Image Transcoder and a Face-to-Geometry Regressor. The former jointly learns a common representation space from speech and image domains, enabling the transformation of speech into semantically-consistent facial images. The latter is responsible for reconstructing 3D facial meshes from the transformed images. Both modules require minimal effort to acquire the necessary training data, thereby obviating the dependence on costly supervised data. Furthermore, the joint learning scheme enables the fusion of intricate visual features into speech encoding, thereby facilitating the transformation of subtle speech variations into nuanced lip movements, ultimately enhancing the fidelity of 3D face reconstructions. Consequently, the ambiguity of the direct mapping of speech-to-animation is significantly reduced, leading to coherent and high-fidelity generation of lip motion. Extensive experiments demonstrate that our approach produces competitive results compared to supervised methods. + + + + WaveNeRF: Wavelet-based Generalizable Neural Radiance Fields + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_WaveNeRF_Wavelet-based_Generalizable_Neural_Radiance_Fields_ICCV_2023_paper.pdf + Neural Radiance Field (NeRF) has shown impressive performance in novel view synthesis via implicit scene representation. However, it usually suffers from poor scalability as requiring densely sampled images for each new scene. Several studies have attempted to mitigate this problem by integrating Multi-View Stereo (MVS) technique into NeRF while they still entail a cumbersome fine-tuning process for new scenes. Notably, the rendering quality will drop severely without this fine-tuning process and the errors mainly appear around the high-frequency features. In the light of this observation, we design WaveNeRF, which integrates wavelet frequency decomposition into MVS and NeRF to achieve generalizable yet high-quality synthesis without any per-scene optimization. To preserve high-frequency information when generating 3D feature volumes, WaveNeRF builds Multi-View Stereo in the Wavelet domain by integrating the discrete wavelet transform into the classical cascade MVS, which disentangles high-frequency information explicitly. With that, disentangled frequency features can be injected into classic NeRF via a novel hybrid neural renderer to yield faithful high-frequency details, and an intuitive frequency-guided sampling strategy can be designed to suppress artifacts around high-frequency regions. Extensive experiments over three widely studied benchmarks show that WaveNeRF achieves superior generalizable radiance field modeling when only given three images as input. + + + + LoCUS: Learning Multiscale 3D-consistent Features from Posed Images + http://openaccess.thecvf.com//content/ICCV2023/papers/Kloepfer_LoCUS_Learning_Multiscale_3D-consistent_Features_from_Posed_Images_ICCV_2023_paper.pdf + An important challenge for autonomous agents such as robots is to maintain a spatially and temporally consistent model of the world. It must be maintained through occlusions, previously-unseen views, and long time horizons (e.g., loop closure and re-identification). It is still an open question how to train such a versatile neural representation without supervision. We start from the idea that the training objective can be framed as a patch retrieval problem: given an image patch in one view of a scene, we would like to retrieve (with high precision and recall) all patches in other views that map to the same real-world location. One drawback is that this objective does not promote reusability of features: by being unique to a scene (achieving perfect precision/recall), a representation will not be useful in the context of other scenes. We find that it is possible to balance retrieval and reusability by constructing the retrieval set carefully, leaving out patches that map to far-away locations. Similarly, we can easily regulate the scale of the learned features (e.g., points, objects, or rooms) by adjusting the spatial tolerance for considering a retrieval to be positive. We optimize for (smooth) Average Precision (AP), in a single unified ranking-based objective. This objective also doubles as a criterion for choosing landmarks or keypoints, as patches with high AP. We show results creating sparse, multi-scale, semantic spatial maps composed of highly identifiable landmarks, with applications in landmark retrieval, localization, semantic segmentation and instance segmentation. + + + + Foreground Object Search by Distilling Composite Image Feature + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Foreground_Object_Search_by_Distilling_Composite_Image_Feature_ICCV_2023_paper.pdf + Foreground object search (FOS) aims to find compatible foreground objects for a given background image, producing realistic composite image. We observe that competitive retrieval performance could be achieved by using a discriminator to predict the compatibility of composite image, but this approach has unaffordable time cost. To this end, we propose a novel FOS method via distilling composite feature (DiscoFOS). Specifically, the abovementioned discriminator serves as teacher network. The student network employs two encoders to extract foreground feature and background feature. Their interaction output is enforced to match the composite image feature from the teacher network. Additionally, previous works did not release their datasets, so we contribute two datasets for FOS task: S-FOSD dataset with synthetic composite images and R-FOSD dataset with real composite images. Extensive experiments on our two datasets demonstrate the superiority of the proposed method over previous approaches. The dataset and code are available at https://github.com/bcmi/Foreground-Object-Search-Dataset-FOSD. + + + + Generalized Few-Shot Point Cloud Segmentation via Geometric Words + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_Generalized_Few-Shot_Point_Cloud_Segmentation_via_Geometric_Words_ICCV_2023_paper.pdf + Existing fully-supervised point cloud segmentation methods suffer in the dynamic testing environment with emerging new classes. Few-shot point cloud segmentation algorithms address this problem by learning to adapt to new classes at the sacrifice of segmentation accuracy for the base classes, which severely impedes its practicality. This largely motivates us to present the first attempt at a more practical paradigm of generalized few-shot point cloud segmentation, which requires the model to generalize to new categories with only a few support point clouds and simultaneously retain the capability to segment base classes. We propose the geometric words to represent geometric components shared between the base and novel classes, and incorporate them into a novel geometric-aware semantic representation to facilitate better generalization to the new classes without forgetting the old ones. Moreover, we introduce geometric prototypes to guide the segmentation with geometric prior knowledge. Extensive experiments on S3DIS and ScanNet consistently illustrate the superior performance of our method over baseline methods. Our code is available at: https://github.com/Pixie8888/GFS-3DSeg_GWs. + + + + STEERER: Resolving Scale Variations for Counting and Localization via Selective Inheritance Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Han_STEERER_Resolving_Scale_Variations_for_Counting_and_Localization_via_Selective_ICCV_2023_paper.pdf + Scale variation is a deep-rooted problem in object counting, which has not been effectively addressed by existing scale-aware algorithms. An important factor is that they typically involve cooperative learning across multi-resolutions, which could be suboptimal for learning the most discriminative features from each scale. In this paper, we propose a novel method termed STEERER (SelecTivE inhERitance lEaRning) that addresses the issue of scale variations in object counting. STEERER selects the most suitable scale for patch objects to boost feature extraction and only inherits discriminative features from lower to higher resolution progressively. The main insights of STEERER are a dedicated Feature Selection and Inheritance Adaptor (FSIA), which selectively forwards scale-customized features at each scale, and a Masked Selection and Inheritance Loss (MSIL) that helps to achieve high-quality density maps across all scales. Our experimental results on nine datasets with counting and localization tasks demonstrate the unprecedented scale generalization ability of STEERER. Code is available at https://github.com/taohan10200/STEERER. + + + + Geometric Viewpoint Learning with Hyper-Rays and Harmonics Encoding + http://openaccess.thecvf.com//content/ICCV2023/papers/Min_Geometric_Viewpoint_Learning_with_Hyper-Rays_and_Harmonics_Encoding_ICCV_2023_paper.pdf + Viewpoint is a fundamental modality that carries the interaction between observers and their environment. This paper proposes the first deep-learning framework for the viewpoint modality. The challenge in formulating learning frameworks for viewpoints resides in a suitable multimodal representation that links across the camera viewing space and 3D environment. Traditional approaches reduce the problem to image analysis instances, making them computationally expensive and not adequately modelling the intrinsic geometry and environmental context of 6DoF viewpoints. We improve these issues in two ways. 1) We propose a generalized viewpoint representation forgoing the analysis of photometric pixels in favor of encoded viewing ray embeddings attained from point cloud learning frameworks. 2) We propose a novel SE(3)-bijective 6D viewing ray, hyper-ray, that addresses the DoF deficiency problem of using 5DoF viewing rays representing 6DoF viewpoints. We demonstrate our approach has both efficiency and accuracy superiority over existing methods in novel real-world environments. + + + + C2F2NeUS: Cascade Cost Frustum Fusion for High Fidelity and Generalizable Neural Surface Reconstruction + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_C2F2NeUS_Cascade_Cost_Frustum_Fusion_for_High_Fidelity_and_Generalizable_ICCV_2023_paper.pdf + There is an emerging effort to combine the two popular 3D frameworks using Multi-View Stereo (MVS) and Neural Implicit Surfaces (NIS) with a specific focus on the few-shot / sparse view setting. In this paper, we introduce a novel integration scheme that combines the multi-view stereo with neural signed distance function representations, which potentially overcomes the limitations of both methods. MVS uses per-view depth estimation and cross-view fusion to generate accurate surfaces, while NIS relies on a common coordinate volume. Based on this strategy, we propose to construct per-view cost frustum for finer geometry estimation, and then fuse cross-view frustums and estimate the implicit signed distance functions to tackle artifacts that are due to noise and holes in the produced surface reconstruction. We further apply a cascade frustum fusion strategy to effectively captures global-local information and structural consistency. Finally, we apply cascade sampling and a pseudo-geometric loss to foster stronger integration between the two architectures. Extensive experiments demonstrate that our method reconstructs robust surfaces and outperforms existing state-of-the-art methods. + + + + Fast Full-frame Video Stabilization with Iterative Optimization + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhao_Fast_Full-frame_Video_Stabilization_with_Iterative_Optimization_ICCV_2023_paper.pdf + Video stabilization refers to the problem of transforming a shaky video into a visually pleasing one. The question of how to strike a good trade-off between visual quality and computational speed has remained one of the open challenges in video stabilization. Inspired by the analogy between wobbly frames and jigsaw puzzles, we propose an iterative optimization-based learning approach using synthetic datasets for video stabilization, which consists of two interacting submodules: motion trajectory smoothing and full-frame outpainting. First, we develop a two-level (coarse-to-fine) stabilizing algorithm based on the probabilistic flow field. The confidence map associated with the estimated optical flow is exploited to guide the search for shared regions through backpropagation. Second, we take a divide-and-conquer approach and propose a novel multiframe fusion strategy to render full-frame stabilized views. An important new insight brought about by our iterative optimization approach is that the target video can be interpreted as the fixed point of nonlinear mapping for video stabilization. We formulate video stabilization as a problem of minimizing the amount of jerkiness in motion trajectories, which guarantees convergence with the help of fixed-point theory. Extensive experimental results are reported to demonstrate the superiority of the proposed approach in terms of computational speed and visual quality. The code will be available on GitHub. + + + + Learning Semi-supervised Gaussian Mixture Models for Generalized Category Discovery + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhao_Learning_Semi-supervised_Gaussian_Mixture_Models_for_Generalized_Category_Discovery_ICCV_2023_paper.pdf + In this paper, we address the problem of generalized category discovery (GCD), i.e., given a set of images where part of them are labelled and the rest are not, the task is to automatically cluster the images in the unlabelled data, leveraging the information from the labelled data, while the unlabelled data contain images from the labelled classes and also new ones. GCD is similar to semi-supervised learning (SSL) but is more realistic and challenging, as SSL assumes all the unlabelled images are from the same classes as the labelled ones. We also do not assume the class number in the unlabelled data is known a-priori, making the GCD problem even harder. To tackle the problem of GCD without knowing the class number, we propose an EM-like framework that alternates between representation learning and class number estimation. We propose a semi-supervised variant of the Gaussian Mixture Model (GMM) with a stochastic splitting and merging mechanism to dynamically determine the prototypes by examining the cluster compactness and separability. With these prototypes, we leverage prototypical contrastive learning for representation learning on the partially labelled data subject to the constraints imposed by the labelled data. Our framework alternates between these two steps until convergence. The cluster assignment for an unlabelled instance can then be retrieved by identifying its nearest prototype. We comprehensively evaluate our framework on both generic image classification datasets and challenging fine-grained object recognition datasets, achieving state-of-the-art performance. + + + + Rethinking Point Cloud Registration as Masking and Reconstruction + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Rethinking_Point_Cloud_Registration_as_Masking_and_Reconstruction_ICCV_2023_paper.pdf + Point cloud registration is essential in computer vision and robotics. In this paper, a critical observation is made that the invisible parts of each point cloud can be directly utilized as inherent masks, and the aligned point cloud pair can be regarded as the reconstruction target. Motivated by this observation, we rethink the point cloud registration problem as a masking and reconstruction task. To this end, a generic and concise auxiliary training network, the Masked Reconstruction Auxiliary Network (MRA), is proposed. The MRA reconstructs the complete point cloud by separately using the encoded features of each point cloud obtained from the backbone, guiding the contextual features in the backbone to capture fine-grained geometric details and the overall structures of point cloud pairs. Unlike recently developed high-performing methods that incorporate specific encoding methods into transformer models, which sacrifice versatility and introduce significant computational complexity during the inference process, our MRA can be easily inserted into other methods to further improve registration accuracy. Additionally, the MRA is detached after training, thereby avoiding extra computational complexity during the inference process. Building upon the MRA, we present a novel transformer-based method, the Masked Reconstruction Transformer (MRT), which achieves both precise and efficient alignment using standard transformers. Extensive experiments conducted on the 3DMatch, ModelNet40, and KITTI datasets demonstrate the superior performance of our MRT over state-of-the-art methods, and the efficiency of the MRA in improving registration accuracy. + + + + Human Part-wise 3D Motion Context Learning for Sign Language Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Lee_Human_Part-wise_3D_Motion_Context_Learning_for_Sign_Language_Recognition_ICCV_2023_paper.pdf + In this paper, we propose P3D, the human part-wise motion context learning framework for sign language recognition. Our main contributions lie in two dimensions: learning the part-wise motion context and employing the pose ensemble to utilize 2D and 3D pose jointly. First, our empirical observation implies that part-wise context encoding benefits the performance of sign language recognition. While previous methods of sign language recognition learned motion context from the sequence of the entire pose, we argue that such methods cannot exploit part-specific motion context. In order to utilize part-wise motion context, we propose the alternating combination of a part-wise encoding Transformer (PET) and a whole-body encoding Transformer (WET). PET encodes the motion contexts from a part sequence, while WET merges them into a unified context. By learning part-wise motion context, our P3D achieves superior performance on WLASL compared to previous state-of-the-art methods. Second, our framework is the first to ensemble 2D and 3D poses for sign language recognition. Since the 3D pose holds rich motion context and depth information to distinguish the words, our P3D outperformed the previous state-of-the-art methods employing a pose ensemble. + + + + Remembering Normality: Memory-guided Knowledge Distillation for Unsupervised Anomaly Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Gu_Remembering_Normality_Memory-guided_Knowledge_Distillation_for_Unsupervised_Anomaly_Detection_ICCV_2023_paper.pdf + Knowledge distillation (KD) has been widely explored in unsupervised anomaly detection (AD). The student is assumed to constantly produce representations of typical patterns within trained data, named "normality", and the representation discrepancy between the teacher and student model is identified as anomalies. However, it suffers from the "normality forgetting" issue. Trained on anomaly-free data, the student still well reconstructs anomalous representations for anomalies and is sensitive to fine patterns in normal data, which also appear in training. To mitigate this issue, we introduce a novel Memory-guided Knowledge-Distillation (MemKD) framework that adaptively modulates the normality of student features in detecting anomalies. Specifically, we first propose a normality recall memory (NR Memory) to strengthen the normality of student-generated features by recalling the stored normal information. In this sense, representations will not present anomalies and fine patterns will be well described. Subsequently, we employ a normality embedding learning strategy to promote information learning for the NR Memory. It constructs a normal exemplar set so that the NR Memory can memorize prior knowledge in anomaly-free data and later recall them from the query feature. Consequently, comprehensive experiments demonstrate that the proposed MemKD achieves promising results on five benchmarks, i.e., MVTec AD, VisA, MPDD, MVTec 3D-AD, and Eyecandies. + + + + Coordinate Quantized Neural Implicit Representations for Multi-view Reconstruction + http://openaccess.thecvf.com//content/ICCV2023/papers/Jiang_Coordinate_Quantized_Neural_Implicit_Representations_for_Multi-view_Reconstruction_ICCV_2023_paper.pdf + In recent years, huge progress has been made on learn- ing neural implicit representations from multi-view images for 3D reconstruction. As an additional input complement- ing coordinates, using sinusoidal functions as positional encodings plays a key role in revealing high frequency de- tails with coordinate-based neural networks. However, high frequency positional encodings make the optimization un- stable, which results in noisy reconstructions and artifacts in empty space. To resolve this issue in a general sense, we introduce to learn neural implicit representations with quantized coordinates, which reduces the uncertainty and ambiguity in the field during optimization. Instead of con- tinuous coordinates, we discretize continuous coordinates into discrete coordinates using nearest interpolation among quantized coordinates which are obtained by discretizing the field in an extremely high resolution. We use discrete coordinates and their positional encodings to learn implicit functions through volume rendering. This significantly re- duces the variations in the sample space, and triggers more multi-view consistency constraints on intersections of rays from different views, which enables to infer implicit function in a more effective way. Our quantized coordinates do not bring any computational burden, and can seamlessly work upon the latest methods. Our evaluations under the widely used benchmarks show our superiority over the state-of-the- art. Our code is available at https://github.com/ MachinePerceptionLab/CQ-NIR. + + + + MAS: Towards Resource-Efficient Federated Multiple-Task Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhuang_MAS_Towards_Resource-Efficient_Federated_Multiple-Task_Learning_ICCV_2023_paper.pdf + Federated learning (FL) is an emerging distributed machine learning method that empowers in-situ model training on decentralized edge devices. However, multiple simultaneous FL tasks could overload resource-constrained devices. In this work, we propose the first FL system to effectively coordinate and train multiple simultaneous FL tasks. We first formalize the problem of training simultaneous FL tasks. Then, we present our new approach, MAS (Merge and Split), to optimize the performance of training multiple simultaneous FL tasks. MAS starts by merging FL tasks into an all-in-one FL task with a multi-task architecture. After training for a few rounds, MAS splits the all-in-one FL task into two or more FL tasks by using the affinities among tasks measured during the all-in-one training. It then continues training each split of FL tasks based on model parameters from the all-in-one training. Extensive experiments demonstrate that MAS outperforms other methods while reducing training time by 2x and reducing energy consumption by 40%. We hope this work will inspire the community to further study and optimize training simultaneous FL tasks. + + + + Bridging Cross-task Protocol Inconsistency for Distillation in Dense Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Bridging_Cross-task_Protocol_Inconsistency_for_Distillation_in_Dense_Object_Detection_ICCV_2023_paper.pdf + Knowledge distillation (KD) has shown potential for learning compact models in dense object detection. However, the commonly used softmax-based distillation ignores the absolute classification scores for individual categories. Thus, the optimum of the distillation loss does not necessarily lead to the optimal student classification scores for dense object detectors. This cross-task protocol inconsistency is critical, especially for dense object detectors, since the foreground categories are extremely imbalanced. To address the issue of protocol differences between distillation and classification, we propose a novel distillation method with cross-task consistent protocols, tailored for the dense object detection. For classification distillation, we address the cross-task protocol inconsistency problem by formulating the classification logit maps in both teacher and student models as multiple binary-classification maps and applying a binary-classification distillation loss to each map. For localization distillation, we design an IoU-based Localization Distillation Loss that is free from specific network structures and can be compared with existing localization distillation losses. Our proposed method is simple but effective, and experimental results demonstrate its superiority over existing methods. + + + + Narrator: Towards Natural Control of Human-Scene Interaction Generation via Relationship Reasoning + http://openaccess.thecvf.com//content/ICCV2023/papers/Xuan_Narrator_Towards_Natural_Control_of_Human-Scene_Interaction_Generation_via_Relationship_ICCV_2023_paper.pdf + Naturally controllable human-scene interaction (HSI) generation has an important role in various fields, such as VR/AR content creation and human-centered AI. However, existing methods are unnatural and unintuitive in their controllability, which heavily limits their application in practice. Therefore, we focus on a challenging task of naturally and controllably generating realistic and diverse HSIs from textual descriptions. From human cognition, the ideal generative model should correctly reason about spatial relationships and interactive actions. To that end, we propose Narrator, a novel relationship reasoning-based generative approach using a conditional variation autoencoder for naturally controllable generation given a 3D scene and a textual description. Also, we model global and local spatial relationships in a 3D scene and a textual description respectively based on the scene graph, and introduce a part-level action mechanism to represent interactions as atomic body part states. In particular, benefiting from our relationship reasoning, we further propose a simple yet effective multi-human generation strategy, which is the first exploration for controllable multi-human scene interaction generation. Our extensive experiments and perceptual studies show that Narrator can controllably generate diverse interactions and significantly outperform existing works. + + + + Vision Relation Transformer for Unbiased Scene Graph Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Sudhakaran_Vision_Relation_Transformer_for_Unbiased_Scene_Graph_Generation_ICCV_2023_paper.pdf + Recent years have seen a growing interest in Scene Graph Generation (SGG), a comprehensive visual scene understanding task that aims to predict entity relationships using a relation encoder-decoder pipeline stacked on top of an object encoder-decoder backbone. Unfortunately, current SGG methods suffer from an information loss regarding the entities' local-level cues during the relation encoding process. To mitigate this, we introduce the Vision rElation TransfOrmer (VETO), consisting of a novel local-level entity relation encoder. We further observe that many existing SGG methods claim to be unbiased, but are still biased towards either head or tail classes. To overcome this bias, we introduce a Mutually Exclusive ExperT (MEET) learning strategy that captures important relation features without bias towards head or tail classes. Experimental results on the VG and GQA datasets demonstrate that VETO + MEET boosts the predictive performance by up to 47% over the state of the art while being 10x smaller. + + + + Revisit PCA-based Technique for Out-of-Distribution Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Guan_Revisit_PCA-based_Technique_for_Out-of-Distribution_Detection_ICCV_2023_paper.pdf + Out-of-distribution (OOD) detection is a desired ability to ensure the reliability and safety of intelligent systems. A scoring function is often designed to measure the degree of any new data being an OOD sample. While most designed scoring functions are based on a single source of information (e.g., the classifier's output, logits, or feature vector), recent studies demonstrate that fusion of multiple sources may help better detect OOD data. In this study, after detailed analysis of the issue in OOD detection by the conventional principal component analysis (PCA), we propose fusing a simple regularized PCA-based reconstruction error with other source of scoring function to further improve OOD detection performance. In particular, when combined with a strong energy score-based OOD method, the regularized reconstruction error helps achieve new state-of-the-art OOD detection results on multiple standard benchmarks. The code is available at https://github.com/SYSU-MIA-GROUP/pca-based-out-of-distribution-detection. + + + + Visually-Prompted Language Model for Fine-Grained Scene Graph Generation in an Open World + http://openaccess.thecvf.com//content/ICCV2023/papers/Yu_Visually-Prompted_Language_Model_for_Fine-Grained_Scene_Graph_Generation_in_an_ICCV_2023_paper.pdf + Scene Graph Generation (SGG) aims to extract <subject, predicate, object> relationships in images for vision understanding. Although recent works have made steady progress on SGG, they still suffer long-tail distribution that tail-predicates are more costly to train and hard to distinguish due to a small amount of annotated data compared to frequent predicates. Existing re-balancing strategies try to handle it via prior rules but still are confined to pre-defined conditions, which are not scalable for various models and datasets. In this paper, we propose a Cross-modal prediCate boosting (CaCao) framework, where a visually-prompted language model is learned to generate diverse fine-grained predicates in a low-resource way. The proposed CaCao can be applied in a plug-and-play fashion and automatically strengthen existing SGG to tackle the long-tailed problem. Based on that, we further introduce a novel Entangled cross-modal prompt approach for open-world predicate scene graph generation (Epic), where models can generalize to unseen predicates in a zero-shot manner. Comprehensive experiments on three benchmark datasets show that CaCao consistently boosts the performance of multiple scene graph generation models in a model-agnostic way. Moreover, our Epic achieves competitive performance on open-world predicate prediction. The data and code for this paper are publicly available. + + + + FishNet: A Large-scale Dataset and Benchmark for Fish Recognition, Detection, and Functional Trait Prediction + http://openaccess.thecvf.com//content/ICCV2023/papers/Khan_FishNet_A_Large-scale_Dataset_and_Benchmark_for_Fish_Recognition_Detection_ICCV_2023_paper.pdf + Aquatic species are essential components of the world's ecosystem, and the preservation of aquatic biodiversity is crucial for maintaining proper ecosystem functioning. Unfortunately, increasing anthropogenic pressures such as overfishing, climate change, and coastal development pose significant threats to aquatic biodiversity. To address this challenge, it is necessary to design an automatic aquatic species monitoring systems that can help researchers and policymakers better understand changes in aquatic ecosystems and take appropriate actions to preserve biodiversity. However, the development of such systems is impeded by a lack of large-scale diverse aquatic species datasets. Existing aquatic species recognition datasets generally have a limited number of species, nor do they provide functional trait data, and so have only narrow potential for application. To address the need for generalized systems that can recognize, locate, and predict a wide array of species and their functional traits, we present FishNet, a large-scale diverse dataset containing 94,532 meticulously organized images from 17,357 aquatic species, organized according to aquatic biological taxonomy (order, family, genus, and species). We further build three benchmarks, i.e., fish classification, fish detection, and functional trait prediction, inspired by ecological research needs, to facilitate the development of aquatic species recognition systems, and promote further research in the field of aquatic ecology. Our FishNet dataset has the potential to encourage the development of more accurate and effective tools for the monitoring and protection of aquatic ecosystems, and hence take effective action toward the conservation of our planet's aquatic biodiversity. Our dataset and code will be released at https://fishnet-2023.github.io/. + + + + Masked Spatio-Temporal Structure Prediction for Self-supervised Learning on Point Cloud Videos + http://openaccess.thecvf.com//content/ICCV2023/papers/Shen_Masked_Spatio-Temporal_Structure_Prediction_for_Self-supervised_Learning_on_Point_Cloud_ICCV_2023_paper.pdf + Recently, the community has made tremendous progress in developing effective methods for point cloud video understanding that learn from massive amounts of labeled data. However, annotating point cloud videos is usually notoriously expensive. Moreover, training via one or only a few traditional tasks (e.g., classification) may be insufficient to learn subtle details of the spatio-temporal structure existing in point cloud videos. In this paper, we propose a Masked Spatio-Temporal Structure Prediction (MaST-Pre) method to capture the structure of point cloud videos without human annotations. MaST-Pre is based on spatio-temporal point-tube masking and consists of two self-supervised learning tasks. First, by reconstructing masked point tubes, our method is able to capture the appearance information of point cloud videos. Second, to learn motion, we propose a temporal cardinality difference prediction task that estimates the change in the number of points within a point tube. In this way, MaST-Pre is forced to model the spatial and temporal structure in point cloud videos. Extensive experiments on MSRAction-3D, NTU-RGBD, NvGesture, and SHREC'17 demonstrate the effectiveness of the proposed method. + + + + DreamPose: Fashion Video Synthesis with Stable Diffusion + http://openaccess.thecvf.com//content/ICCV2023/papers/Karras_DreamPose_Fashion_Video_Synthesis_with_Stable_Diffusion_ICCV_2023_paper.pdf + We present DreamPose, a diffusion-based method for generating animated fashion videos from still images. Given an image and a sequence of human body poses, our method synthesizes a video containing both human and fabric motion. To achieve this, we transform a pretrained text-to-image model (Stable Diffusion) into a pose-and-image guided video synthesis model, using a novel finetuning strategy, a set of architectural changes to support the added conditioning signals, and techniques to encourage temporal consistency. We fine-tune on a collection of fashion videos from the UBC Fashion dataset. We evaluate our method on a variety of clothing styles and poses, and demonstrate that our method produces state-of-the-art results on fashion video animation. Video results are available at www.grail.cs.washington.edu/projects/dreampose. + + + + PhysDiff: Physics-Guided Human Motion Diffusion Model + http://openaccess.thecvf.com//content/ICCV2023/papers/Yuan_PhysDiff_Physics-Guided_Human_Motion_Diffusion_Model_ICCV_2023_paper.pdf + Denoising diffusion models hold great promise for generating diverse and realistic human motions. However, existing motion diffusion models largely disregard the laws of physics in the diffusion process and often generate physically-implausible motions with pronounced artifacts such as floating, foot sliding, and ground penetration. This seriously impacts the quality of generated motions and limits their real-world application. To address this issue, we present a novel physics-guided motion diffusion model (PhysDiff), which incorporates physical constraints into the diffusion process. Specifically, we propose a physics-based motion projection module that uses motion imitation in a physics simulator to project the denoised motion of a diffusion step to a physically-plausible motion. The projected motion is further used in the next diffusion step to guide the denoising diffusion process. Intuitively, the use of physics in our model iteratively pulls the motion toward a physically-plausible space, which cannot be achieved by simple post-processing. Experiments on large-scale human motion datasets show that our approach achieves state-of-the-art motion quality and improves physical plausibility drastically (>78% for all datasets). + + + + SwiftFormer: Efficient Additive Attention for Transformer-based Real-time Mobile Vision Applications + http://openaccess.thecvf.com//content/ICCV2023/papers/Shaker_SwiftFormer_Efficient_Additive_Attention_for_Transformer-based_Real-time_Mobile_Vision_Applications_ICCV_2023_paper.pdf + Self-attention has become a defacto choice for capturing global context in various vision applications. However, its quadratic computational complexity with respect to image resolution limits its use in real-time applications, especially for deployment on resource-constrained mobile devices. Although hybrid approaches have been proposed to combine the advantages of convolutions and self-attention for a better speed-accuracy trade-off, the expensive matrix multiplication operations in self-attention remain a bottleneck. In this work, we introduce a novel efficient additive attention mechanism that effectively replaces the quadratic matrix multiplication operations with linear element-wise multiplications. Our design shows that the key-value interaction can be replaced with a linear layer without sacrificing any accuracy. Unlike previous state-of-the-art methods, our efficient formulation of self-attention enables its usage at all stages of the network. Using our proposed efficient additive attention, we build a series of models called "SwiftFormer" which achieves state-of-the-art performance in terms of both accuracy and mobile inference speed. Our small variant achieves 78.5% top-1 ImageNet-1K accuracy with only 0.8 ms latency on iPhone 14, which is more accurate and 2x faster compared to MobileViT-v2. Our code and models: https://tinyurl.com/5ft8v46w + + + + UpCycling: Semi-supervised 3D Object Detection without Sharing Raw-level Unlabeled Scenes + http://openaccess.thecvf.com//content/ICCV2023/papers/Hwang_UpCycling_Semi-supervised_3D_Object_Detection_without_Sharing_Raw-level_Unlabeled_Scenes_ICCV_2023_paper.pdf + Semi-supervised Learning (SSL) has received increasing attention in autonomous driving to reduce the enormous burden of 3D annotation. In this paper, we propose UpCycling, a novel SSL framework for 3D object detection with zero additional raw-level point cloud: learning from unlabeled de-identified intermediate features (i.e., "smashed" data) to preserve privacy. Since these intermediate features are naturally produced by the inference pipeline, no additional computation is required on autonomous vehicles. However, generating effective consistency loss for unlabeled feature-level scene turns out to be a critical challenge. The latest SSL frameworks for 3D object detection that enforce consistency regularization between different augmentations of an unlabeled raw-point scene become detrimental when applied to intermediate features. To solve the problem, we introduce a novel combination of hybrid pseudo labels and feature-level Ground Truth sampling (F-GT), which safely augments unlabeled multi-type 3D scene features and provides high-quality supervision. We implement UpCycling on two representative 3D object detection models: SECOND-IoU and PV-RCNN. Experiments on widely-used datasets (Waymo, KITTI, and Lyft) verify that UpCycling outperforms other augmentation methods applied at the feature level. In addition, while preserving privacy, UpCycling performs better or comparably to the state-of-the-art methods that utilize raw-level unlabeled data in both domain adaptation and partial-label scenarios. + + + + s-Adaptive Decoupled Prototype for Few-Shot Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Du_s-Adaptive_Decoupled_Prototype_for_Few-Shot_Object_Detection_ICCV_2023_paper.pdf + Meta-learning-based few-shot detectors use one K-average-pooled prototype (averaging along K-shot dimension) in both Region Proposal Network (RPN) and Detection head (DH) for query detection. Such plain operation would harm the FSOD performance in two aspects: 1) the poor quality of the prototype, and 2) the equivocal guidance due to the contradictions between RPN and DH. In this paper, we look closely into those critical issues and propose the s-Adaptive Decoupled Prototype (s-ADP) as a solution. To generate the high-quality prototype, we prioritize salient representations and deemphasize trivial variations by accessing both angle distance and magnitude dispersion (s) across K-support samples. To provide precise information for the query image, the prototype is decoupled into task-specific ones, which provide tailored guidance for 'where to look' and 'what to look for', respectively. Beyond that, we find our s-ADP can gradually strengthen the generalization power of encoding network during meta-training. So it can robustly deal with intra-class variations and a simple K- average pooling is enough to generate a high-quality prototype at meta-testing. We provide theoretical analysis to support its rationality. Extensive experiments on Pascal VOC, MS-COCO and FSOD datasets demonstrate that the proposed method achieves new state-of-the-art performance. Notably, our method surpasses the baseline model by a large margin - up to around 5.0% AP50 and 8.0% AP75 on novel classes. + + + + Semantics Meets Temporal Correspondence: Self-supervised Object-centric Learning in Videos + http://openaccess.thecvf.com//content/ICCV2023/papers/Qian_Semantics_Meets_Temporal_Correspondence_Self-supervised_Object-centric_Learning_in_Videos_ICCV_2023_paper.pdf + Self-supervised methods have shown remarkable progress in learning high-level semantics and low-level temporal correspondence. Building on these results, we take one step further and explore the possibility of integrating these two features to enhance object-centric representations. Our preliminary experiments indicate that query slot attention can extract different semantic components from the RGB feature map, while random sampling based slot attention can exploit temporal correspondence cues between frames to assist instance identification. Motivated by this, we propose a novel semantic-aware masked slot attention on top of the fused semantic features and correspondence maps. It comprises two slot attention stages with a set of shared learnable Gaussian distributions. In the first stage, we use the mean vectors as slot initialization to decompose potential semantics and generate semantic segmentation masks through iterative attention. In the second stage, for each semantics, we randomly sample slots from the corresponding Gaussian distribution and perform masked feature aggregation within the semantic area to exploit temporal correspondence patterns for instance identification. We adopt semantic- and instance-level temporal consistency as self-supervision to encourage temporally coherent object-centric representations. Our model effectively identifies multiple object instances with semantic structure, reaching promising results on unsupervised video object discovery. Furthermore, we achieve state-of-the-art performance on dense label propagation tasks, demonstrating the potential for object-centric analysis. + + + + First Session Adaptation: A Strong Replay-Free Baseline for Class-Incremental Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Panos_First_Session_Adaptation_A_Strong_Replay-Free_Baseline_for_Class-Incremental_Learning_ICCV_2023_paper.pdf + In Class-Incremental Learning (CIL) an image classification system is exposed to new classes in each learning session and must be updated incrementally. Methods approaching this problem have updated both the classification head and the feature extractor body at each session of CIL. In this work, we develop a baseline method, First Session Adaptation (FSA), that sheds light on the efficacy of existing CIL approaches, and allows us to assess the relative performance contributions from head and body adaption. FSA adapts a pre-trained neural network body only on the first learning session and fixes it thereafter; a head based on linear discriminant analysis (LDA), is then placed on top of the adapted body, allowing exact updates through CIL. FSA is replay-free i.e. it does not memorize examples from previous sessions of continual learning. To empirically motivate FSA, we first consider a diverse selection of 22 image-classification datasets, evaluating different heads and body adaptation techniques in high/low-shot offline settings. We find that the LDA head performs well and supports CIL out-of-the-box. We also find that Featurewise Layer Modulation (FiLM) adapters are highly effective in the few-shot setting, and full-body adaption in the high-shot setting. Second, we empirically investigate various CIL settings including high-shot CIL and few-shot CIL, including settings that have previously been used in the literature. We show that FSA significantly improves over the state-of-the-art in 15 of the 16 settings considered. FSA with FiLM adapters is especially performant in the few-shot setting. These results indicate that current approaches to continuous body adaptation are not working as expected. Finally, we propose a measure that can be applied to a set of unlabelled inputs which is predictive of the benefits of body adaptation. + + + + Ada3D : Exploiting the Spatial Redundancy with Adaptive Inference for Efficient 3D Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhao_Ada3D__Exploiting_the_Spatial_Redundancy_with_Adaptive_Inference_for_ICCV_2023_paper.pdf + Voxel-based methods have achieved state-of-the-art performance for 3D object detection in autonomous driving. However, their significant computational and memory costs pose a challenge for their application to resource-constrained vehicles. One reason for this high resource consumption is the presence of a large number of redundant background points in Lidar point clouds, resulting in spatial redundancy in both 3D voxel and dense BEV map representations. To address this issue, we propose an adaptive inference framework called Ada3D, which focuses on exploiting the input-level spatial redundancy. Ada3D adaptively filters the redundant input, guided by a lightweight importance predictor and the unique properties of the Lidar point cloud. Additionally, we utilize the BEV features' intrinsic sparsity by introducing the Sparsity Preserving Batch Normalization. With Ada3D, we achieve 40% reduction for 3D voxels and decrease the density of 2D BEV feature maps from 100% to 20% without sacrificing accuracy. Ada3D reduces the model computational and memory cost by 5x, and achieves 1.52x/1.45x end-to-end GPU latency and 1.5x/4.5x GPU peak memory optimization for the 3D and 2D backbone respectively. + + + + Point Contrastive Prediction with Semantic Clustering for Self-Supervised Learning on Point Cloud Videos + http://openaccess.thecvf.com//content/ICCV2023/papers/Sheng_Point_Contrastive_Prediction_with_Semantic_Clustering_for_Self-Supervised_Learning_on_ICCV_2023_paper.pdf + We propose a unified point cloud video self-supervised learning framework for object-centric and scene-centric data. Previous methods commonly conduct representation learning at the clip or frame level and cannot well capture fine-grained semantics. Instead of contrasting the representations of clips or frames, in this paper, we propose a unified self-supervised framework by conducting contrastive learning at the point level. Moreover, we introduce a new pretext task by achieving semantic alignment of superpoints, which further facilitates the representations to capture semantic cues at multiple scales. In addition, due to the high redundancy in the temporal dimension of dynamic point clouds, directly conducting contrastive learning at the point level usually leads to massive undesired negatives and insufficient modeling of positive representations. To remedy this, we propose a selection strategy to retain proper negatives and make use of high-similarity samples from other instances as positive supplements. Extensive experiments show that our method outperforms supervised counterparts on a wide range of downstream tasks and demonstrates the superior transferability of the learned representations. + + + + Preserving Modality Structure Improves Multi-Modal Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Swetha_Preserving_Modality_Structure_Improves_Multi-Modal_Learning_ICCV_2023_paper.pdf + Self-supervised learning on large-scale multi-modal datasets allows learning semantically meaningful embeddings in a joint multi-modal representation space without relying on human annotations. These joint embeddings enable zero-shot cross-modal tasks like retrieval and classification. However, these methods often struggle to generalize well on out-of-domain data as they ignore the semantic structure present in modality-specific embeddings. In this context, we propose a novel Semantic-Structure-Preserving Consistency approach to improve generalizability by preserving the modality-specific relationships in the joint embedding space. To capture modality-specific semantic relationships between samples, we propose to learn multiple anchors and represent the multifaceted relationship between samples with respect to their relationship with these anchors. To assign multiple anchors to each sample, we propose a novel Multi-Assignment Sinkhorn-Knopp algorithm. Our experimentation demonstrates that our proposed approach learns semantically meaningful anchors in a self-supervised manner. Furthermore, our evaluation on MSR-VTT and YouCook2 datasets demonstrates that our proposed multi-anchor assignment based solution achieves state-of-the-art performance and generalizes to both inand out-of-domain datasets. Code: https://github.com/Swetha5/Multi_Sinkhorn_Knopp + + + + Pre-training Vision Transformers with Very Limited Synthesized Images + http://openaccess.thecvf.com//content/ICCV2023/papers/Nakamura_Pre-training_Vision_Transformers_with_Very_Limited_Synthesized_Images_ICCV_2023_paper.pdf + Formula-driven supervised learning (FDSL) is a pre-training method that relies on synthetic images generated from mathematical formulae such as fractals. Prior work on FDSL has shown that pre-training vision transformers on such synthetic datasets can yield competitive accuracy on a wide range of downstream tasks. These synthetic images are categorized according to the parameters in the mathematical formula that generate them. In the present work, we hypothesize that the process for generating different instances for the same category in FDSL, can be viewed as a form of data augmentation. We validate this hypothesis by replacing the instances with data augmentation, which means we only need a single image per category. Our experiments show that this one-instance fractal database (OFDB) performs better than the original dataset where instances were explicitly generated. We further scale up OFDB to 21,000 categories and show that it matches, or even surpasses, the model pre-trained on ImageNet-21k in ImageNet-1k fine-tuning. The number of images in OFDB is 21k, whereas ImageNet-21k has 14M. This opens new possibilities for pre-training vision transformers with much smaller datasets. + + + + PADDLES: Phase-Amplitude Spectrum Disentangled Early Stopping for Learning with Noisy Labels + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_PADDLES_Phase-Amplitude_Spectrum_Disentangled_Early_Stopping_for_Learning_with_Noisy_ICCV_2023_paper.pdf + Convolutional Neural Networks (CNNs) are powerful in learning patterns of different vision tasks, but they are sensitive to label noise and may overfit to noisy labels during training. The early stopping strategy averts updating CNNs during the early training phase and is widely employed in the presence of noisy labels. Motivated by biological findings that the amplitude spectrum (AS) and phase spectrum (PS) in the frequency domain play different roles in the animal's vision system, we observe that PS, which captures more semantic information, can increase the robustness of CNNs to label noise, more so than AS can. We thus propose early stops at different times for AS and PS by disentangling the features of some layer(s) into AS and PS using Discrete Fourier Transform (DFT) during training. Our proposed Phase-AmplituDe DisentangLed Early Stopping (PADDLES) method is shown to be effective on both synthetic and real-world label-noise datasets. PADDLES outperforms other early stopping methods and obtains state-of-the-art performance. + + + + CLIP-Cluster: CLIP-Guided Attribute Hallucination for Face Clustering + http://openaccess.thecvf.com//content/ICCV2023/papers/Shen_CLIP-Cluster_CLIP-Guided_Attribute_Hallucination_for_Face_Clustering_ICCV_2023_paper.pdf + One of the most important yet rarely studied challenges for supervised face clustering is the large intra-class variance caused by different face attributes such as age, pose, and expression. Images of the same identity but with different face attributes usually tend to be clustered into different sub-clusters. For the first time, we proposed an attribute hallucination framework named CLIP-Cluster to address this issue, which first hallucinates multiple representations for different attributes with the powerful CLIP model and then pools them by learning neighbor-adaptive attention. Specifically, CLIP-Cluster first introduces a text-driven attribute hallucination module, which allows one to use natural language as the interface to hallucinate novel attributes for a given face image based on the well-aligned image-language CLIP space. Furthermore, we develop a neighbor-aware proxy generator that fuses the features describing various attributes into a proxy feature to build a bridge among different sub-clusters and reduce the intra-class variance. The proxy feature is generated by adaptively attending to the hallucinated visual features and the source one based on the local neighbor information. On this basis, a graph built with the proxy representations is used for subsequent clustering operations. Extensive experiments show our proposed approach outperforms state-of-the-art face clustering methods with high inference efficiency. + + + + Compositional Feature Augmentation for Unbiased Scene Graph Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Compositional_Feature_Augmentation_for_Unbiased_Scene_Graph_Generation_ICCV_2023_paper.pdf + Scene Graph Generation (SGG) aims to detect all the visual relation triplets <sub, pred, obj> in a given image. With the emergence of various advanced techniques for better utilizing both the intrinsic and extrinsic information in each relation triplet, SGG has achieved great progress over the recent years. However, due to the ubiquitous long-tailed predicate distributions, today's SGG models are still easily biased to the head predicates. Currently, the most prevalent debiasing solutions for SGG are re-balancing methods, e.g., changing the distributions of original training samples. In this paper, we argue that all existing re-balancing strategies fail to increase the diversity of the relation triplet features of each predicate, which is critical for robust SGG. To this end, we propose a novel Compositional Feature Augmentation (CFA) strategy, which is the first unbiased SGG work to mitigate the bias issue from the perspective of increasing the diversity of triplet features. Specifically, we first decompose each relation triplet feature into two components: intrinsic feature and extrinsic feature, which correspond to the intrinsic characteristics and extrinsic contexts of a relation triplet, respectively. Then, we design two different feature augmentation modules to enrich the feature diversity of original relation triplets by replacing or mixing up either their intrinsic or extrinsic features from other samples. Due to its model-agnostic nature, CFA can be seamlessly incorporated into any SGG model. Extensive ablations have shown that CFA can achieve a new state-of-the-art performance on the trade-off between different metrics. + + + + Foreground and Text-lines Aware Document Image Rectification + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Foreground_and_Text-lines_Aware_Document_Image_Rectification_ICCV_2023_paper.pdf + This paper aims at the distorted document image rectification problem, the objective to eliminate the geometric distortion in the document images and realize document intelligence. Improving the readability of distorted documents is crucial to effectively extract information from deformed images. According to our observations, the foreground and text-line of the original warped image can represent the deformation tendency. However, previous distorted image rectification methods pay little attention to the readability of the warped paper. In this paper, we focus on the foreground and text-line regions of distorted paper and proposes a global and local fusion method to improve the rectification effect of distorted images and enhance the readability of document images. We introduce cross attention to capture the features of the foreground and text-lines in the warped document and effectively fuse them. The proposed method is evaluated quantitatively and qualitatively on the public DocUNet benchmark and DIR300 Dataset, which achieve state-of-the-art performances. Experimental analysis shows the proposed method can well perform overall geometric rectification of distorted images and effectively improve document readability (using the metrics of Character Error Rate and Edit Distance). The code is available at https://github.com/xiaomore/Document-Image-Dewarping. + + + + INSTA-BNN: Binary Neural Network with INSTAnce-aware Threshold + http://openaccess.thecvf.com//content/ICCV2023/papers/Lee_INSTA-BNN_Binary_Neural_Network_with_INSTAnce-aware_Threshold_ICCV_2023_paper.pdf + Binary Neural Networks (BNNs) have emerged as a promising solution for reducing the memory footprint and compute costs of deep neural networks, but they suffer from quality degradation due to the lack of freedom as activations and weights are constrained to the binary values. To compensate for the accuracy drop, we propose a novel BNN design called Binary Neural Network with INSTAnce-aware threshold (INSTA-BNN), which controls the quantization threshold dynamically in an input-dependent or instance-aware manner. According to our observation, higher-order statistics can be a representative metric to estimate the characteristics of the input distribution. INSTA-BNN is designed to adjust the threshold dynamically considering various information, including higher-order statistics, but it is also optimized judiciously to realize minimal overhead on a real device. Our extensive study shows that INSTA-BNN outperforms the baseline by 3.0% and 2.8% on the ImageNet classification task with comparable computing cost, achieving 68.5% and 72.2% top-1 accuracy on ResNet-18 and MobileNetV1 based models, respectively. + + + + When Epipolar Constraint Meets Non-Local Operators in Multi-View Stereo + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_When_Epipolar_Constraint_Meets_Non-Local_Operators_in_Multi-View_Stereo_ICCV_2023_paper.pdf + Learning-based multi-view stereo (MVS) method heavily relies on feature matching, which requires distinctive and descriptive representations. An effective solution is to apply non-local feature aggregation, e.g., Transformer. Albeit useful, these techniques introduce heavy computation overheads for MVS. Each pixel densely attends to the whole image. In contrast, we propose to constrain non-local feature augmentation within a pair of lines: each point only attends the corresponding pair of epipolar lines. Our idea takes inspiration from the classic epipolar geometry, which shows that one point with different depth hypotheses will be projected to the epipolar line on the other view. This constraint reduces the 2D search space into the epipolar line in stereo matching. Similarly, this suggests that the matching of MVS is to distinguish a series of points lying on the same line. Inspired by this point-to-line search, we devise a line-to-point non-local augmentation strategy. We first devise an optimized searching algorithm to split the 2D feature maps into epipolar line pairs. Then, an Epipolar Transformer (ET) performs non-local feature augmentation among epipolar line pairs. We incorporate the ET into a learning-based MVS baseline, named ET-MVSNet. ET-MVSNet achieves state-of-the-art reconstruction performance on both the DTU and Tanks-and-Temples benchmark with high efficiency. Code is available at https://github.com/TQTQliu/ET-MVSNet. + + + + LU-NeRF: Scene and Pose Estimation by Synchronizing Local Unposed NeRFs + http://openaccess.thecvf.com//content/ICCV2023/papers/Cheng_LU-NeRF_Scene_and_Pose_Estimation_by_Synchronizing_Local_Unposed_NeRFs_ICCV_2023_paper.pdf + A critical obstacle preventing NeRF models from being deployed broadly in the wild is their reliance on accurate camera poses. Consequently, there is growing interest in extending NeRF models to jointly optimize camera poses and scene representation, which offers an alternative to off-the-shelf SfM pipelines which have well-understood failure modes. Existing approaches for unposed NeRF operate under limited assumptions, such as a prior pose distribution or coarse pose initialization, making them less effective in a general setting. In this work, we propose a novel approach, LU-NeRF, that jointly estimates camera poses and neural radiance fields with relaxed assumptions on pose configuration. Our approach operates in a local-to-global manner, where we first optimize over local subsets of the data, dubbed mini-scenes. LU-NeRF estimates local pose and geometry for this challenging few-shot task. The mini-scene poses are brought into a global reference frame through a robust pose synchronization step, where a final global optimization of pose and scene can be performed. We show our LU-NeRF pipeline outperforms prior attempts at unposed NeRF without making restrictive assumptions on the pose prior. This allows us to operate in the general SE(3) pose setting, unlike the baselines. Our results also indicate our model can be complementary to feature-based SfM pipelines as it compares favorably to COLMAP on low-texture and low-resolution images. + + + + GrowCLIP: Data-Aware Automatic Model Growing for Large-scale Contrastive Language-Image Pre-Training + http://openaccess.thecvf.com//content/ICCV2023/papers/Deng_GrowCLIP_Data-Aware_Automatic_Model_Growing_for_Large-scale_Contrastive_Language-Image_Pre-Training_ICCV_2023_paper.pdf + Cross-modal pre-training has shown impressive performance on a wide range of downstream tasks, benefiting from massive image-text pairs collected from the Internet. In practice, online data are growing constantly, highlighting the importance of the ability of pre-trained model to learn from data that is continuously growing. Existing works on cross-modal pre-training mainly focus on training a network with fixed architecture. However, it is impractical to limit the model capacity when considering the continuously growing nature of pre-training data in real-world applications. On the other hand, it is important to utilize the knowledge in current model to obtain efficient training and better performance. To address the above issues, in this paper, we propose GrowCLIP, a data-driven automatic model growing algorithm for contrastive language-image pre-training with continuous image-text pairs as input. Specially, we adopt a dynamic growth space and seek out the optimal architecture at each growth step to adapt to online learning scenarios. And the shared encoder is proposed in our growth space to enhance the degree of cross-modal fusion. Besides, we explore the effect of growth in different dimensions, which could provide future references for the design of cross-modal model architecture. Finally, we employ parameter inheriting with momentum (PIM) to maintain the previous knowledge and address the issue of local minimum dilemma. Compared with the existing methods, GrowCLIP improve 2.3% average top-1 accuracy on zero-shot image classification of 9 downstream tasks. As for zero-shot image retrieval, GrowCLIP can improve 1.2% for top-1 image-to-text recall on Flickr30K dataset. + + + + LA-Net: Landmark-Aware Learning for Reliable Facial Expression Recognition under Label Noise + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_LA-Net_Landmark-Aware_Learning_for_Reliable_Facial_Expression_Recognition_under_Label_ICCV_2023_paper.pdf + Facial expression recognition (FER) remains a challenging task due to the ambiguity of expressions. The derived noisy labels significantly harm the performance in real-world scenarios. To address this issue, we present a new FER model named Landmark-Aware Net (LA-Net), which leverages facial landmarks to mitigate the impact of label noise from two perspectives. Firstly, LA-Net uses landmark information to suppress the uncertainty in expression space and constructs the label distribution of each sample by neighborhood aggregation, which in turn improves the quality of training supervision. Secondly, the model incorporates landmark information into expression representations using the devised expression-landmark contrastive loss. The enhanced expression feature extractor can be less susceptible to label noise. Our method can be integrated with any deep neural network for better training supervision without introducing extra inference costs. We conduct extensive experiments on both in-the-wild datasets and synthetic noisy datasets and demonstrate that LA-Net achieves state-of-the-art performance. + + + + SGAligner: 3D Scene Alignment with Scene Graphs + http://openaccess.thecvf.com//content/ICCV2023/papers/Sarkar_SGAligner_3D_Scene_Alignment_with_Scene_Graphs_ICCV_2023_paper.pdf + Building 3D scene graphs has recently emerged as a topic in scene representation for several embodied AI applications to represent the world in a structured and rich manner. With their increased use in solving downstream tasks (e.g., navigation and room rearrangement), can we leverage and recycle them for creating 3D maps of environments, a pivotal step in agent operation? We focus on the fundamental problem of aligning pairs of 3D scene graphs whose overlap can range from zero to partial and can contain arbitrary changes. We propose SGAligner, the first method for aligning pairs of 3D scene graphs that is robust to in-the-wild scenarios (i.e., unknown overlap - if any - and changes in the environment). We get inspired by multi-modality knowledge graphs and use contrastive learning to learn a joint, multi-modal embedding space. We evaluate on the 3RScan dataset and further showcase that our method can be used for estimating the transformation between pairs of 3D scenes. Since benchmarks for these tasks are missing, we create them on this dataset. The code, benchmark, and trained models are available on the project website. + + + + Efficient Discovery and Effective Evaluation of Visual Perceptual Similarity: A Benchmark and Beyond + http://openaccess.thecvf.com//content/ICCV2023/papers/Barkan_Efficient_Discovery_and_Effective_Evaluation_of_Visual_Perceptual_Similarity_A_ICCV_2023_paper.pdf + Visual similarities discovery (VSD) is an important task with broad e-commerce applications. Given an image of a certain object, the goal of VSD is to retrieve images of different objects with high perceptual visual similarity. Although being a highly addressed problem, the evaluation of proposed methods for VSD is often based on a proxy of an identification-retrieval task, evaluating the ability of a model to retrieve different images of the same object. We posit that evaluating VSD methods based on identification tasks is limited, and faithful evaluation must rely on expert annotations. In this paper, we introduce the first large-scale fashion visual similarity benchmark dataset, consisting of more than 110K expert-annotated image pairs. Besides this major contribution, we share insight from the challenges we faced while curating this dataset. Based on these insights, we propose a novel and efficient labeling procedure that can be applied to any dataset. Our analysis examines its limitations and inductive biases, and based on these findings, we propose metrics to mitigate those limitations. Though our primary focus lies on visual similarity, the methodologies we present have broader applications for discovering and evaluating perceptual similarity across various domains. + + + + Improving Representation Learning for Histopathologic Images with Cluster Constraints + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_Improving_Representation_Learning_for_Histopathologic_Images_with_Cluster_Constraints_ICCV_2023_paper.pdf + Recent advances in whole-slide image (WSI) scanners and computational capabilities have significantly propelled the application of artificial intelligence in histopathology slide analysis. While these strides are promising, current supervised learning approaches for WSI analysis come with the challenge of exhaustively labeling high-resolution slides--a process that is both labor-intensive and time-consuming. In contrast, self-supervised learning (SSL) pretraining strategies are emerging as a viable alternative, given that they don't rely on explicit data annotations. These SSL strategies are quickly bridging the performance disparity with their supervised counterparts. In this context, we introduce an SSL framework. This framework aims for transferable representation learning and semantically meaningful clustering by synergizing invariance loss and clustering loss in WSI analysis. Notably, our approach outperforms common SSL methods in downstream classification and clustering tasks, as evidenced by tests on the Camelyon16 and a pancreatic cancer dataset. + + + + Learning Neural Implicit Surfaces with Object-Aware Radiance Fields + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Learning_Neural_Implicit_Surfaces_with_Object-Aware_Radiance_Fields_ICCV_2023_paper.pdf + Recent progress on multi-view 3D object reconstruction has featured neural implicit surfaces via learning high-fidelity radiance fields. However, most approaches hinge on the visual hull derived from cost-expensive silhouette masks to obtain object surfaces. In this paper, we propose a novel Object-aware Radiance Fields (ORF) to automatically learn an object-aware geometry reconstruction. The geometric correspondences between multi-view 2D object regions and 3D implicit/explicit object surfaces are additionally exploited to boost the learning of object surfaces. Technically, a critical transparency discriminator is designed to distinguish the object-intersected and object-bypassed rays based on the estimated 2D object regions, leading to 3D implicit object surfaces. Such implicit surfaces can be directly converted into explicit object surfaces (e.g., meshes) via marching cubes. Then, we build the geometric correspondence between 2D planes and 3D meshes by rasterization, and project the estimated object regions into 3D explicit object surfaces by aggregating the object information across multiple views. The aggregated object information in 3D explicit object surfaces is further reprojected back to 2D planes, aiming to update 2D object regions and enforce them to be multi-view consistent. Extensive experiments on DTU and BlendedMVS verify the capability of ORF to produce comparable surfaces against the state-of-the-art models that demand silhouette masks. + + + + PADCLIP: Pseudo-labeling with Adaptive Debiasing in CLIP for Unsupervised Domain Adaptation + http://openaccess.thecvf.com//content/ICCV2023/papers/Lai_PADCLIP_Pseudo-labeling_with_Adaptive_Debiasing_in_CLIP_for_Unsupervised_Domain_ICCV_2023_paper.pdf + Traditional Unsupervised Domain Adaptation (UDA) leverages the labeled source domain to tackle the learning tasks on the unlabeled target domain. It can be more challenging when a large domain gap exists between the source and the target domain. A more practical setting is to utilize a large-scale pre-trained model to fill the domain gap. For example, CLIP shows promising zero-shot generalizability to bridge the gap. However, after applying traditional fine-tuning to specifically adjust CLIP on a target domain, CLIP suffers from catastrophic forgetting issues where the new domain knowledge can quickly override CLIP's pre-trained knowledge and decreases the accuracy by half. We propose Catastrophic Forgetting Measurement (CFM) to adjust the learning rate to avoid excessive training (thus mitigating the catastrophic forgetting issue). We then utilize CLIP's zero-shot prediction to formulate a Pseudo-labeling setting with Adaptive Debiasing in CLIP (PADCLIP) by adjusting causal inference with our momentum and CFM. Our PADCLIP allows end-to-end training on source and target domains without extra overhead, and we achieved the best results on four public datasets, with a significant improvement (+18.5% accuracy) on DomainNet. + + + + Causal-DFQ: Causality Guided Data-Free Network Quantization + http://openaccess.thecvf.com//content/ICCV2023/papers/Shang_Causal-DFQ_Causality_Guided_Data-Free_Network_Quantization_ICCV_2023_paper.pdf + Model quantization, which aims to compress deep neural networks and accelerate inference speed, has greatly facilitated the development of cumbersome models on mobile and edge devices. There is a common assumption in quantization methods from prior works that training data is available. In practice, however, this assumption cannot always be fulfilled due to reasons of privacy and security, rendering these methods inapplicable in real-life situations. Thus, data-free network quantization has recently received significant attention in neural network compression. Causal reasoning provides an intuitive way to model casual relationships to eliminate data-driven correlations, making causality an essential component of analyzing data-free problems. However, causal formulations of data-free quantization are inadequate in the literature. To bridge this gap, we construct a causal graph to model the data generation and discrepancy reduction between the pre-trained and quantized models. Inspired by the causal understanding, we propose the Causality-guided Data-free Network Quantization method, Causal-DFQ, to eliminate the reliance on data via approaching an equilibrium of causality-driven intervened distributions. Specifically, we design a content-style-decoupled generator, synthesizing images conditioned on the relevant and irrelevant factors; then we propose a discrepancy reduction loss to align the intervened distributions of the pre-trained and quantized models. It is worth noting that our work is the first attempt towards introducing causality to data-free quantization problem. Extensive experiments demonstrate the efficacy of Causal-DFQ. + + + + CancerUniT: Towards a Single Unified Model for Effective Detection, Segmentation, and Diagnosis of Eight Major Cancers Using a Large Collection of CT Scans + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_CancerUniT_Towards_a_Single_Unified_Model_for_Effective_Detection_Segmentation_ICCV_2023_paper.pdf + Human readers or radiologists routinely perform full-body multi-organ multi-disease detection and diagnosis in clinical practice, while most medical AI systems are built to focus on single organs with a narrow list of a few diseases. This might severely limit AI's clinical adoption. A certain number of AI models need to be assembled non-trivially to match the diagnostic process of a human reading a CT scan. In this paper, we construct a Unified Tumor Transformer (CancerUniT) model to jointly detect tumor existence & location and diagnose tumor characteristics for eight major cancers in CT scans. CancerUniT is a query-based Mask Transformer model with the output of multi-tumor prediction. We decouple the object queries into organ queries, tumor detection queries and tumor diagnosis queries, and further establish hierarchical relationships among the three groups. This clinically-inspired architecture effectively assists inter- and intra-organ representation learning of tumors and facilitates the resolution of these complex, anatomically related multi-organ cancer image reading tasks. CancerUniT is trained end-to-end using curated large-scale CT images of 10,042 patients including eight major types of cancers and occurring non-cancer tumors (all are pathology-confirmed with 3D tumor masks annotated by radiologists). On the test set of 631 patients, CancerUniT has demonstrated strong performance under a set of clinically relevant evaluation metrics, substantially outperforming both multi-disease methods and an assembly of eight single-organ expert models in tumor detection, segmentation, and diagnosis. This moves one step closer towards a universal high performance cancer screening tool. + + + + Dual Meta-Learning with Longitudinally Consistent Regularization for One-Shot Brain Tissue Segmentation Across the Human Lifespan + http://openaccess.thecvf.com//content/ICCV2023/papers/Sun_Dual_Meta-Learning_with_Longitudinally_Consistent_Regularization_for_One-Shot_Brain_Tissue_ICCV_2023_paper.pdf + Brain tissue segmentation is essential for neuroscience and clinical studies. However, segmentation on longitudinal data is challenging due to dynamic brain changes across the lifespan. Previous researches mainly focus on self-supervision with regularizations and will lose longitudinal generalization when fine-tuning on a specific age group. In this paper, we propose a dual meta-learning paradigm to learn longitudinally consistent representations and persist when fine-tuning. Specifically, we learn a plug-and-play feature extractor to extract longitudinal-consistent anatomical representations by meta-feature learning and a well-initialized task head for fine-tuning by meta-initialization learning. Besides, two class-aware regularizations are proposed to encourage longitudinal consistency. Experimental results on the iSeg2019 and ADNI datasets demonstrate the effectiveness of our method. + + + + Active Self-Supervised Learning: A Few Low-Cost Relationships Are All You Need + http://openaccess.thecvf.com//content/ICCV2023/papers/Cabannes_Active_Self-Supervised_Learning_A_Few_Low-Cost_Relationships_Are_All_You_ICCV_2023_paper.pdf + Self-Supervised Learning (SSL) has emerged as the solution of choice to learn transferable representations from unlabeled data. However, SSL requires to build samples that are known to be semantically akin, i.e. positive views. Requiring such knowledge is the main limitation of SSL and is often tackled by ad-hoc strategies e.g. applying known data-augmentations to the same input. In this work, we generalize and formalize this principle through Positive Active Learning (PAL) where an oracle queries semantic relationships between samples. PAL achieves three main objectives. First, it is a theoretically grounded learning framework that encapsulates standard SSL but also supervised and semi-supervised learning depending on the employed oracle. Second, it provides a consistent algorithm to embed a priori knowledge, e.g. some observed labels, into any SSL losses without any change in the training pipeline. Third, it provides a proper active learning framework yielding low-cost solutions to annotate datasets, arguably bringing the gap between theory and practice of active learning that is based on simple-to-answer-by-non-experts queries of semantic relationships between inputs. + + + + Wasserstein Expansible Variational Autoencoder for Discriminative and Generative Continual Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Ye_Wasserstein_Expansible_Variational_Autoencoder_for_Discriminative_and_Generative_Continual_Learning_ICCV_2023_paper.pdf + Task-Free Continual Learning (TFCL) represents a challenging learning paradigm where a model is trained on the non-stationary data distributions without any knowledge of the task information, thus representing a more practical approach. Despite promising achievements by the Variational Autoencoder (VAE) mixtures in continual learning, such methods ignore the redundancy among the probabilistic representations of their components when performing model expansion, leading to mixture components learning similar tasks. This paper proposes the Wasserstein Expansible Variational Autoencoder (WEVAE), which evaluates the statistical similarity between the probabilistic representation of new data and that represented by each mixture component and then uses it for deciding when to expand the model. Such a mechanism can avoid unnecessary model expansion while ensuring the knowledge diversity among the trained components. In addition, we propose an energy-based sample selection approach that assigns high energies to novel samples and low energies to the samples which are similar to the model's knowledge. Extensive empirical studies on both supervised and unsupervised benchmark tasks demonstrate that our model outperforms all competing methods. The code is available at https://github.com/dtuzi123/WEVAE/. + + + + Label-Free Event-based Object Recognition via Joint Learning with Image Reconstruction from Events + http://openaccess.thecvf.com//content/ICCV2023/papers/Cho_Label-Free_Event-based_Object_Recognition_via_Joint_Learning_with_Image_Reconstruction_ICCV_2023_paper.pdf + Recognizing objects from sparse and noisy events becomes extremely difficult when paired images and category labels do not exist. In this paper, we study label-free event-based object recognition where category labels and paired images are not available. To this end, we propose a joint formulation of object recognition and image reconstruction in a complementary manner. Our method first reconstructs images from events and performs object recognition through Contrastive Language-Image Pre-training (CLIP), enabling better recognition through a rich context of images. Since the category information is essential in reconstructing images, we propose category-guided attraction loss and category-agnostic repulsion loss to bridge the textual features of predicted categories and the visual features of reconstructed images using CLIP. Moreover, we introduce a reliable data sampling strategy and local-global reconstruction consistency to boost joint learning of two tasks. To enhance the accuracy of prediction and quality of reconstruction, we also propose a prototype-based approach using unpaired images. Extensive experiments demonstrate the superiority of our method and its extensibility for zero-shot object recognition. Our project code is available at https://github.com/Chohoonhee/Ev-LaFOR. + + + + Gloss-Free Sign Language Translation: Improving from Visual-Language Pretraining + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhou_Gloss-Free_Sign_Language_Translation_Improving_from_Visual-Language_Pretraining_ICCV_2023_paper.pdf + Sign Language Translation (SLT) is a challenging task due to its cross-domain nature, involving the translation of visual-gestural language to text. Many previous methods employ an intermediate representation,i.e., gloss sequences, to facilitate SLT, thus transforming it into a two-stage task of sign language recognition (SLR) followed by sign language translation (SLT). However, the scarcity of gloss-annotated sign language data, combined with the information bottleneck in the mid-level gloss representation, has hindered the further development of the SLT task. To address this challenge, we propose a novel Gloss-Free SLT base on Visual-Language Pretraining (GFSLT-VLP), which improves SLT by inheriting language-oriented prior knowledge from pre-trained models, without any gloss annotation assistance. Our approach involves two stages: (i) integrating Contrastive Language-Image Pre-training (CLIP) with masked self-supervised learning to create pre-tasks that bridge the semantic gap between visual and textual representations and restore masked sentences, and (ii) constructing an end-to-end architecture with an encoder-decoder-like structure that inherits the parameters of the pre-trained Visual Encoder and Text Decoder from the first stage. The seamless combination of these novel designs forms a robust sign language representation and significantly improves gloss-free sign language translation. In particular, we have achieved unprecedented improvements in terms of BLEU-4 score on the PHOENIX14T dataset (>=+5) and the CSL-Daily dataset (>=+3) compared to state-of-the-art gloss-free SLT methods. Furthermore, our approach also achieves competitive results on the PHOENIX14T dataset when compared with most of the gloss-based methods. + + + + Shrinking Class Space for Enhanced Certainty in Semi-Supervised Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Shrinking_Class_Space_for_Enhanced_Certainty_in_Semi-Supervised_Learning_ICCV_2023_paper.pdf + Semi-supervised learning is attracting blooming attention, due to its success in combining unlabeled data. To mitigate potentially incorrect pseudo labels, recent frameworks mostly set a fixed confidence threshold to discard uncertain samples. This practice ensures high-quality pseudo labels, but incurs a relatively low utilization of the whole unlabeled set. In this work, our key insight is that these uncertain samples can be turned into certain ones, as long as the confusion classes for the top-1 class are detected and removed. Invoked by this, we propose a novel method dubbed ShrinkMatch to learn uncertain samples. For each uncertain sample, it adaptively seeks a shrunk class space, which merely contains the original top-1 class, as well as remaining less likely classes. Since the confusion ones are removed in this space, the re-calculated top-1 confidence can satisfy the pre-defined threshold. We then impose a consistency regularization between a pair of strongly and weakly augmented samples in the shrunk space to strive for discriminative representations. Furthermore, considering the varied reliability among uncertain samples and the gradually improved model during training, we correspondingly design two reweighting principles for our uncertain loss. Our method exhibits impressive performance on widely adopted benchmarks. Code is available at https://github.com/LiheYoung/ShrinkMatch. + + + + eP-ALM: Efficient Perceptual Augmentation of Language Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Shukor_eP-ALM_Efficient_Perceptual_Augmentation_of_Language_Models_ICCV_2023_paper.pdf + Large Language Models (LLMs) have so far impressed the world, with unprecedented capabilities that emerge in models at large scales. On the vision side, transformer models (i.e., ViT) are following the same trend, achieving the best performance on challenging benchmarks. With the abundance of such unimodal models, a natural question arises; do we need also to follow this trend to tackle multimodal tasks? In this work, we propose to rather direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception. Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency. In particular, they still train a large number of parameters, rely on large multimodal pretraining, use encoders (e.g., CLIP) trained on huge image-text datasets, and add significant inference overhead. In addition, most of these approaches have focused on Zero-Shot and In Context Learning, with little to no effort on direct finetuning. We investigate the minimal computational effort needed to adapt unimodal models for multimodal tasks and propose a new challenging setup, alongside different approaches, that efficiently adapts unimodal pretrained models. We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning across Image, Video, and Audio modalities, following the proposed setup. The code is available here: https://github.com/mshukor/eP-ALM. + + + + Multimodal Optimal Transport-based Co-Attention Transformer with Global Structure Consistency for Survival Prediction + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_Multimodal_Optimal_Transport-based_Co-Attention_Transformer_with_Global_Structure_Consistency_for_ICCV_2023_paper.pdf + Survival prediction is a complicated ordinal regression task that aims to predict the ranking risk of death, which generally benefits from the integration of histology and genomic data. Despite the progress in joint learning from pathology and genomics, existing methods still suffer from challenging issues: 1) Due to the large size of pathological images, it is difficult to effectively represent the gigapixel whole slide images (WSIs). 2) Interactions within tumor microenvironment (TME) in histology are essential for survival analysis. Although current approaches attempt to model these interactions via co-attention between histology and genomic data, they focus on only dense local similarity across modalities, which fails to capture global consistency between potential structures, i.e. TME-related interactions of histology and co-expression of genomic data. To address these challenges, we propose a Multimodal Optimal Transport-based Co-Attention Transformer framework with global structure consistency, in which optimal transport (OT) is applied to match patches of a WSI and genes embeddings for selecting informative patches to represent the gigapixel WSI. More importantly, OT-based co-attention provides a global awareness to effectively capture structural interactions within TME for survival prediction. To overcome high computational complexity of OT, we propose a robust and efficient implementation over micro-batch of WSI patches by approximating the original OT with unbalanced mini-batch OT. Extensive experiments show the superiority of our method on five benchmark datasets compared to the state-of-the-art methods. The code will be released. + + + + Robust One-Shot Face Video Re-enactment using Hybrid Latent Spaces of StyleGAN2 + http://openaccess.thecvf.com//content/ICCV2023/papers/Oorloff_Robust_One-Shot_Face_Video_Re-enactment_using_Hybrid_Latent_Spaces_of_ICCV_2023_paper.pdf + Recent research on one-shot face re-enactment has progressively overcome the low-resolution constraint with the help of StyleGAN's high-fidelity portrait generation. However, such approaches rely on explicit 2D/3D structural priors for guidance and/or use flow-based warping which constrain their performance. Moreover, existing methods are sensitive (not robust) to the source frame's facial expressions and head pose, even though ideally only the identity of the source frame should have an effect. Addressing these limitations, we propose a novel framework exploiting the implicit 3D prior and inherent latent properties of StyleGAN2 to facilitate one-shot face re-enactment at 1024x1024 (1) with zero dependencies on explicit structural priors, (2) accommodating attribute edits, and (3) robust to diverse facial expressions and head poses of the source frame. We train an encoder using a self-supervised approach to decompose the identity and facial deformation of a portrait image within the pre-trained StyleGAN2's predefined latent spaces itself (automatically facilitating (1) and (2)). The decomposed identity latent of the source and the facial deformation latents of the driving sequence are used to generate re-enacted frames using the StyleGAN2 generator. Additionally, to improve the identity reconstruction and to enable seamless transfer of driving motion, we propose a novel approach, Cyclic Manifold Adjustment. We perform extensive qualitative and quantitative analyses which demonstrate the superiority of the proposed approach against state-of-the-art methods. Project page: https://trevineoorloff.github.io/FaceVideoReenactment_HybridLatents.io/. + + + + RPG-Palm: Realistic Pseudo-data Generation for Palmprint Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Shen_RPG-Palm_Realistic_Pseudo-data_Generation_for_Palmprint_Recognition_ICCV_2023_paper.pdf + Palmprint recently shows great potential in recognition applications as it is a privacy-friendly and stable biometric. However, the lack of large-scale public palmprint datasets limits further research and development of palmprint recognition. In this paper, we propose a novel realistic pseudo-palmprint generation (RPG) model to synthesize palmprints with massive identities. We first introduce a conditional modulation generator to improve intra-class diversity. Then an identity-aware loss is proposed to ensure identity consistency against unpaired training. We further improve the Bezier palm creases generation strategy to guarantee identity independence. Extensive experimental results demonstrate that synthetic pretraining significantly boosts the recognition model performance. For example, our model improves the state-of-the-art BezierPalm by more than 5% and 14% in terms of TAR@FAR=1e-6 under the 1:1 and 1:3 Open-set protocol. When accessing only 10% of the real training data, our method still outperforms ArcFace with 100% real training data, indicating that we are closer to real-data-free palmprint recognition. The code will be made open upon acceptance. + + + + Lecture Presentations Multimodal Dataset: Towards Understanding Multimodality in Educational Videos + http://openaccess.thecvf.com//content/ICCV2023/papers/Lee_Lecture_Presentations_Multimodal_Dataset_Towards_Understanding_Multimodality_in_Educational_Videos_ICCV_2023_paper.pdf + Many educational videos use slide presentations, a sequence of visual pages that contain text and figures accompanied by spoken language, which are constructed and presented carefully in order to optimally transfer knowledge to students. Previous studies in multimedia and psychology attribute the effectiveness of lecture presentations to their multimodal nature. As a step toward developing vision-language models to aid in student learning as intelligent teacher assistants, we introduce the Lecture Presentations Multimodal (LPM) Dataset as a large-scale benchmark testing the capabilities of vision-and-language models in multimodal understanding of educational videos. Our dataset contains aligned slides and spoken language, for 180+ hours of video and 9000+ slides, with 10 lecturers from various subjects (e.g., computer science, dentistry, biology). We introduce three research tasks, (1) figure-to-text retrieval, (2) text-to-figure retrieval, and (3) generation of slide explanations, which are grounded in multimedia learning and psychology principles to test a vision-language model's understanding of multimodal content. We provide manual annotations to help implement these tasks and establish baselines on them. Comparing baselines and human student performances, we find that state-of-the-art vision-language models (zero-shot and fine-tuned) struggle in (1) weak crossmodal alignment between slides and spoken text, (2) learning novel visual mediums, (3) technical language, and (4) long-range sequences. We introduce PolyViLT, a novel multimodal transformer trained with a multi-instance learning loss that is more effective than current approaches for retrieval. We conclude by shedding light on the challenges and opportunities in multimodal understanding of educational presentation videos. + + + + Window-Based Early-Exit Cascades for Uncertainty Estimation: When Deep Ensembles are More Efficient than Single Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Xia_Window-Based_Early-Exit_Cascades_for_Uncertainty_Estimation_When_Deep_Ensembles_are_ICCV_2023_paper.pdf + Deep Ensembles are a simple, reliable, and effective method of improving both the predictive performance and uncertainty estimates of deep learning approaches. However, they are widely criticised as being computationally expensive, due to the need to deploy multiple independent models. Recent work has challenged this view, showing that for predictive accuracy, ensembles can be more computationally efficient (at inference) than scaling single models within an architecture family. This is achieved by cascading ensemble members via an early-exit approach. In this work, we investigate extending these efficiency gains to tasks related to uncertainty estimation. As many such tasks, e.g. selective classification, are binary classification, our key novel insight is to only pass samples within a window close to the binary decision boundary to later cascade stages. Experiments on ImageNet-scale data across a number of network architectures and uncertainty tasks show that the proposed window-based early-exit approach is able to achieve a superior uncertainty-computation trade-off compared to scaling single models. For example, a cascaded EfficientNet-B2 ensemble is able to achieve similar coverage at 5% risk as a single EfficientNet-B4 with <30% the number of MACs. We also find that cascades/ensembles give more reliable improvements on OOD data vs scaling models up. + + + + XNet: Wavelet-Based Low and High Frequency Fusion Networks for Fully- and Semi-Supervised Semantic Segmentation of Biomedical Images + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhou_XNet_Wavelet-Based_Low_and_High_Frequency_Fusion_Networks_for_Fully-_ICCV_2023_paper.pdf + Fully- and semi-supervised semantic segmentation of biomedical images have been advanced with the development of deep neural networks (DNNs). So far, however, DNN models are usually designed to support one of these two learning schemes, unified models that support both fully- and semi-supervised segmentation remain limited. Furthermore, few fully-supervised models focus on the intrinsic low frequency (LF) and high frequency (HF) information of images to improve performance. Perturbations in consistency-based semi-supervised models are often artificially designed. They may introduce negative learning bias that are not beneficial for training. In this study, we propose a wavelet-based LF and HF fusion model XNet, which supports both fully- and semi-supervised semantic segmentation and outperforms state-of-the-art models in both fields. It emphasizes extracting LF and HF information for consistency training to alleviate the learning bias caused by artificial perturbations. Extensive experiments on two 2D and two 3D datasets demonstrate the effectiveness of our model. Code is available at https://github.com/Yanfeng-Zhou/XNet. + + + + Betrayed by Captions: Joint Caption Grounding and Generation for Open Vocabulary Instance Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_Betrayed_by_Captions_Joint_Caption_Grounding_and_Generation_for_Open_ICCV_2023_paper.pdf + In this work, we focus on open vocabulary instance segmentation to expand a segmentation model to classify and segment instance-level novel categories. Previous approaches have relied on massive caption datasets and complex pipelines to establish one-to-one mappings between image regions and words in captions. However, such methods build noisy supervision by matching non-visible words to image regions, such as adjectives and verbs. Meanwhile, context words are also important for inferring the existence of novel objects as they show high inter-correlations with novel categories. To overcome these limitations, we devise a joint Caption Grounding and Generation (CGG) framework, which incorporates a novel grounding loss that only focuses on matching object nouns to improve learning efficiency. We also introduce a caption generation head that enables additional supervision and contextual modeling as a complementation to the grounding loss. Our analysis and results demonstrate that grounding and generation components complement each other, significantly enhancing the segmentation performance for novel classes. Experiments on the COCO dataset with two settings: Open Vocabulary Instance Segmentation (OVIS) and Open Set Panoptic Segmentation (OSPS) demonstrate the superiority of the CGG. Specifically, CGG achieves a substantial improvement of 6.8% mAP for novel classes without extra data on the OVIS task and 15% PQ improvements for novel classes on the OSPS benchmark. + + + + StyleGANEX: StyleGAN-Based Manipulation Beyond Cropped Aligned Faces + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_StyleGANEX_StyleGAN-Based_Manipulation_Beyond_Cropped_Aligned_Faces_ICCV_2023_paper.pdf + Recent advances in face manipulation using StyleGAN have produced impressive results. However, StyleGAN is inherently limited to cropped aligned faces at a fixed image resolution it is pre-trained on. In this paper, we propose a simple and effective solution to this limitation by using dilated convolutions to rescale the receptive fields of shallow layers in StyleGAN, without altering any model parameters. This allows fixed-size small features at shallow layers to be extended into larger ones that can accommodate variable resolutions, making them more robust in characterizing unaligned faces. To enable real face inversion and manipulation, we introduce a corresponding encoder that provides the first-layer feature of the extended StyleGAN in addition to the latent style code. We validate the effectiveness of our method using unaligned face inputs of various resolutions in a diverse set of face manipulation tasks, including facial attribute editing, super-resolution, sketch/mask-to-face translation, and face toonification. + + + + HandR2N2: Iterative 3D Hand Pose Estimation Using a Residual Recurrent Neural Network + http://openaccess.thecvf.com//content/ICCV2023/papers/Cheng_HandR2N2_Iterative_3D_Hand_Pose_Estimation_Using_a_Residual_Recurrent_ICCV_2023_paper.pdf + 3D hand pose estimation is a critical task in various human-computer interaction applications. Numerous deep learning based estimation models in this domain have been actively explored. However, the existing models follows a non-recurrent scheme and thus require complex architectures or redundant parameters in order to achieve acceptable model capacity. To tackle this limitation, this paper proposes HandR2N2, a compact neural network that iteratively regresses the hand pose using a novel residual recurrent unit. The recurrent design allows recursive exploitation of partial layers to gradually optimize previously estimated joint locations. In addition, we exploit graph reasoning to capture kinematic dependencies between joints for better performance. Experimental results show that the proposed model significantly outperforms the existing methods on three hand pose benchmark datasets in terms of both accuracy and efficiency. Codes and pre-trained models are publicly available at https://github.com/cwc1260/HandR2N2. + + + + Unsupervised Learning of Object-Centric Embeddings for Cell Instance Segmentation in Microscopy Images + http://openaccess.thecvf.com//content/ICCV2023/papers/Wolf_Unsupervised_Learning_of_Object-Centric_Embeddings_for_Cell_Instance_Segmentation_in_ICCV_2023_paper.pdf + Segmentation of objects in microscopy images is required for many biomedical applications. We introduce object-centric embeddings (OCEs), which embed image patches such that the spatial offsets between patches cropped from the same object are preserved. Those learnt embeddings can be used to delineate individual objects and thus obtain instance segmentations. Here, we show theoretically that, under assumptions commonly found in microscopy images, OCEs can be learnt through a self-supervised task that predicts the spatial offset between image patches. Together, this forms an unsupervised cell instance segmentation method which we evaluate on nine diverse large-scale microscopy datasets. Segmentations obtained with our method lead to substantially improved results, compared to a state-of-the-art baseline on six out of nine datasets, and perform on par on the remaining three datasets. If ground-truth annotations are available, our method serves as an excellent starting point for supervised training, reducing the required amount of ground-truth needed by one order of magnitude, thus substantially increasing the practical applicability of our method. Source code is available at github.com/funkelab/cellulus. + + + + XiNet: Efficient Neural Networks for tinyML + http://openaccess.thecvf.com//content/ICCV2023/papers/Ancilotto_XiNet_Efficient_Neural_Networks_for_tinyML_ICCV_2023_paper.pdf + The recent interest in the edge-to-cloud continuum paradigm has emphasized the need for simple and scalable architectures to deliver optimal performance on computationally constrained devices. However, resource-efficient neural networks usually optimize for parameter count and thus use operators such as depthwise convolutions, which do not maximally exploit the efficiency of resource-constrained devices. In this article, we propose XiNet, a novel convolutional neural architecture that targets edge devices. We derived the XiNet architecture from an extensive real-world efficiency analysis of various neural network operators (e.g., standard, depthwise, and pointwise convolutions). Compared to other mobile architectures, our approach substantially improves the performance-complexity trade-off by optimizing the number of operations, parameters, and working memory (RAM). Moreover, we show how XiNet can be easily adapted to different devices thanks to Hardware Aware Scaling (HAS), which enables disjoint optimization of RAM, FLASH, and operations count. We analyze the scaling properties of our architecture under different hardware constraints and validate it on the image classification task. Finally, we evaluate the performance of XiNet for object detection on the MS-COCO and VOC-2012 benchmarks and compare it with state-of-the-art mobile neural networks, achieving a 70% reduction in energy requirements with similar performance. + + + + GridPull: Towards Scalability in Learning Implicit Representations from 3D Point Clouds + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_GridPull_Towards_Scalability_in_Learning_Implicit_Representations_from_3D_Point_ICCV_2023_paper.pdf + Learning implicit representations has been a widely used solution for surface reconstruction from 3D point clouds. The latest methods infer a distance or occupancy field by overfitting a neural network on a single point cloud. However, these methods suffer from a slow inference due to the slow convergence of neural networks and the extensive calculation of distances to surface points, which limits them to small scale points. To resolve the scalability issue in surface reconstruction, we propose GridPull to improve the efficiency of learning implicit representations from large scale point clouds. Our novelty lies in the fast inference of a discrete distance field defined on grids without using any neural components. To remedy the lack of continuousness brought by neural networks, we introduce a loss function to encourage continuous distances and consistent gradients in the field during pulling queries onto the surface in grids near to the surface. We use uniform grids for a fast grid search to localize sampled queries, and organize surface points in a tree structure to speed up the calculation of distances to the surface. We do not rely on learning priors or normal supervision during optimization, and achieve superiority over the latest methods in terms of complexity and accuracy. We evaluate our method on shape and scene benchmarks, and report numerical and visual comparisons with the latest methods to justify our effectiveness and superiority. The code is available at https://github.com/chenchao15/GridPull. + + + + GeoMIM: Towards Better 3D Knowledge Transfer via Masked Image Modeling for Multi-view 3D Understanding + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_GeoMIM_Towards_Better_3D_Knowledge_Transfer_via_Masked_Image_Modeling_ICCV_2023_paper.pdf + Multi-view camera-based 3D detection is a challenging problem in computer vision. Recent works leverage a pretrained LiDAR detection model to transfer knowledge to a camera-based student network. However, we argue that there is a major domain gap between the LiDAR BEV features and the camera-based BEV features, as they have different characteristics and are derived from different sources. In this paper, we propose Geometry Enhanced Masked Image Modeling (GeoMIM) to transfer the knowledge of the LiDAR model in a pretrain-finetune paradigm for improving the multi-view camera-based 3D detection. GeoMIM is a multi-camera vision transformer with Cross-View Attention (CVA) blocks that uses LiDAR BEV features encoded by the pretrained BEV model as learning targets. During pretraining, GeoMIM's decoder has a semantic branch completing dense perspective-view features and the other geometry branch reconstructing dense perspective-view depth maps. The depth branch is designed to be camera-aware by inputting the camera's parameters for better transfer capability. Extensive results demonstrate that GeoMIM outperforms existing methods on nuScenes benchmark, achieving state-of-the-art performance for camera-based 3D object detection and 3D segmentation. + + + + RenderIH: A Large-Scale Synthetic Dataset for 3D Interacting Hand Pose Estimation + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_RenderIH_A_Large-Scale_Synthetic_Dataset_for_3D_Interacting_Hand_Pose_ICCV_2023_paper.pdf + The current interacting hand (IH) datasets are relatively simplistic in terms of background and texture, with hand joints being annotated by a machine annotator, which may result in inaccuracies, and the diversity of pose distribution is limited. However, the variability of background, pose distribution, and texture can greatly influence the generalization ability. Therefore, we present a large-scale synthetic dataset --RenderIH-- for interacting hands with accurate and diverse pose annotations. The dataset contains 1M photo-realistic images with varied backgrounds, perspectives, and hand textures. To generate natural and diverse interacting poses, we propose a new pose optimization algorithm. Additionally, for better pose estimation accuracy, we introduce a transformer-based pose estimation network, TransHand, to leverage the correlation between interacting hands and verify the effectiveness of RenderIH in improving results. Our dataset is model-agnostic and can improve more accuracy of any hand pose estimation method in comparison to other real or synthetic datasets. Experiments have shown that pretraining on our synthetic data can significantly decrease the error from 6.76mm to 5.79mm, and our Transhand surpasses contemporary methods. Our dataset and code are available at https://github.com/adwardlee/RenderIH. + + + + Cross-modal Scalable Hyperbolic Hierarchical Clustering + http://openaccess.thecvf.com//content/ICCV2023/papers/Long_Cross-modal_Scalable_Hierarchical_Clustering_in_Hyperbolic_space_ICCV_2023_paper.pdf + Hierarchical clustering is a natural approach to discover ontologies from data. Yet, existing approaches are hampered by their inability to scale to large datasets and the discrete encoding of the hierarchy. We introduce scalable Hyperbolic Hierarchical Clustering (sHHC) which overcomes these limitations by learning continuous hierarchies in hyperbolic space. Our hierarchical clustering is of high quality and can be obtained in a fraction of the runtime. Additionally, we demonstrate the strength of sHHC on a downstream cross-modal self-supervision task. By using the discovered hierarchies from sound and vision to construct continuous hierarchical pseudo-labels we can efficiently optimize a network for activity recognition and obtain competitive performance compared to recent self-supervised learning models. Our findings demonstrate the strength of Hyperbolic Hierarchical Clustering and its potential for Self-Supervised Learning. + + + + PointMBF: A Multi-scale Bidirectional Fusion Network for Unsupervised RGB-D Point Cloud Registration + http://openaccess.thecvf.com//content/ICCV2023/papers/Yuan_PointMBF_A_Multi-scale_Bidirectional_Fusion_Network_for_Unsupervised_RGB-D_Point_ICCV_2023_paper.pdf + Point cloud registration is a task to estimate the rigid transformation between two unaligned scans, which plays an important role in many computer vision applications. Previous learning-based works commonly focus on supervised registration, which have limitations in practice. Recently, with the advance of inexpensive RGB-D sensors, several learning-based works utilize RGB-D data to achieve unsupervised registration. However, most of existing unsupervised methods follow a cascaded design or fuse RGB-D data in a unidirectional manner, which do not fully exploit the complementary information in the RGB-D data. To leverage the complementary information more effectively, we propose a network implementing multi-scale bidirectional fusion between RGB images and point clouds generated from depth images. By bidirectionally fusing visual and geometric features in multi-scales, more distinctive deep features for correspondence estimation can be obtained, making our registration more accurate. Extensive experiments on ScanNet and 3DMatch demonstrate that our method achieves new state-of-the-art performance. Code will be released at https://github.com/phdymz/PointMBF. + + + + LiveHand: Real-time and Photorealistic Neural Hand Rendering + http://openaccess.thecvf.com//content/ICCV2023/papers/Mundra_LiveHand_Real-time_and_Photorealistic_Neural_Hand_Rendering_ICCV_2023_paper.pdf + The human hand is the main medium through which we interact with our surroundings, making its digitization an important problem. While there are several works modeling the geometry of hands, little attention has been paid to capturing photo-realistic appearance. Moreover, for applications in extended reality and gaming, real-time rendering is critical. We present the first neural-implicit approach to photo-realistically render hands in real-time. This is a challenging problem as hands are textured and undergo strong articulations with pose-dependent effects. However, we show that this aim is achievable through our carefully designed method. This includes training on a low-resolution rendering of a neural radiance field, together with a 3D-consistent super-resolution module and mesh-guided sampling and space canonicalization. We demonstrate a novel application of perceptual loss on the image space, which is critical for learning details accurately. We also show a live demo where we photo-realistically render the human hand in real-time for the first time, while also modeling pose- and view-dependent appearance effects. We ablate all our design choices and show that they optimize for rendering speed and quality. + + + + TripLe: Revisiting Pretrained Model Reuse and Progressive Learning for Efficient Vision Transformer Scaling and Searching + http://openaccess.thecvf.com//content/ICCV2023/papers/Fu_TripLe_Revisiting_Pretrained_Model_Reuse_and_Progressive_Learning_for_Efficient_ICCV_2023_paper.pdf + One promising way to accelerate transformer training is to reuse small pretrained models to initialize the transformer, as their existing representation power facilitates faster model convergence. Previous works designed expansion operators to scale up pretrained models to the target model before training. Yet, model functionality is difficult to preserve when scaling a transformer in all dimensions at once. Moreover, maintaining the pretrained optimizer states for weights is critical for model scaling, whereas the new weights added during expansion lack these states in pretrained models. To address these issues, we propose TripLe, which partially scales a model before training, while growing the rest of the new parameters during training by copying both the warmed-up weights with the optimizer states from existing weights. As such, the new parameters introduced during training will obtain their training states. Furthermore, through serializing the scaling of model width and depth, the functionality of each expansion can be preserved. We evaluate TripLe in both single-trial model scaling and multi-trial neural architecture search (NAS). Due to the fast training convergence of TripLe, the proxy accuracy from TripLe better reveals the model quality compared to from-scratch training in multi-trial NAS. Experiments show that TripLe outperforms both from-scratch training and knowledge distillation (KD) in both training time and task performance. TripLe can also be combined with KD to achieve an even higher task accuracy. For NAS, the model obtained from TripLe outperforms DeiT-B in task accuracy with 69% reduction in parameter size and FLOPs. + + + + Learning Unified Decompositional and Compositional NeRF for Editable Novel View Synthesis + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Learning_Unified_Decompositional_and_Compositional_NeRF_for_Editable_Novel_View_ICCV_2023_paper.pdf + Implicit neural representations have shown powerful capacity in modeling real-world 3D scenes, offering superior performance in novel view synthesis. In this paper, we target a more challenging scenario, i.e., joint scene novel view synthesis and editing based on implicit neural scene representations. State-of-the-art methods in this direction typically consider building separate networks for these two tasks (i.e., view synthesis and editing). Thus, the modeling of interactions and correlations between these two tasks is very limited, which, however, is critical for learning high-quality scene representations.To tackle this problem, in this paper, we propose a unified Neural Radiance Field (NeRF) framework to effectively perform joint scene decomposition and composition for modeling real-world scenes. The decomposition aims at learning disentangled 3D representations of different objects and the background, allowing for scene editing, while scene composition models an entire scene representation for novel view synthesis. Specifically, with a two-stage NeRF framework, we learn a coarse stage for predicting a global radiance field as guidance for point sampling, and in the second fine-grained stage, we perform scene decomposition by a novel one-hot object radiance field regularization module and a pseudo supervision via inpainting to handle ambiguous background regions occluded by objects. The decomposed object-level radiance fields are further composed by using activations from the decomposition module. Extensive quantitative and qualitative results show the effectiveness of our method for scene decomposition and composition, outperforming state-of-the-art methods for both novel-view synthesis and editing tasks. + + + + SeeABLE: Soft Discrepancies and Bounded Contrastive Learning for Exposing Deepfakes + http://openaccess.thecvf.com//content/ICCV2023/papers/Larue_SeeABLE_Soft_Discrepancies_and_Bounded_Contrastive_Learning_for_Exposing_Deepfakes_ICCV_2023_paper.pdf + Modern deepfake detectors have achieved encouraging results, when training and test images are drawn from the same data collection. However, when these detectors are applied to images produced with unknown deepfake-generation techniques, considerable performance degradations are commonly observed. In this paper, we propose a novel deepfake detector, called SeeABLE, that formalizes the detection problem as a (one-class) out-of-distribution detection task and generalizes better to unseen deepfakes. Specifically, SeeABLE first generates local image perturbations (referred to as soft-discrepancies) and then pushes the perturbed faces towards predefined prototypes using a novel regression-based bounded contrastive loss. To strengthen the generalization performance of SeeABLE to unknown deepfake types, we generate a rich set of soft discrepancies and train the detector: (i) to localize, which part of the face was modified, and (ii) to identify the alteration type. To demonstrate the capabilities of SeeABLE, we perform rigorous experiments on several widely-used deepfake datasets and show that our model convincingly outperforms competing state-of-the-art detectors, while exhibiting highly encouraging generalization capabilities. The source code for SeeABLE is available from: https://github.com/anonymous-author-sub/seeable. + + + + Semi-Supervised Learning via Weight-Aware Distillation under Class Distribution Mismatch + http://openaccess.thecvf.com//content/ICCV2023/papers/Du_Semi-Supervised_Learning_via_Weight-Aware_Distillation_under_Class_Distribution_Mismatch_ICCV_2023_paper.pdf + Semi-Supervised Learning (SSL) under class distribution mismatch aims to tackle a challenging problem wherein unlabeled data contain lots of unknown categories unseen in the labeled ones. In such mismatch scenarios, traditional SSL suffers severe performance damage due to the harmful invasion of the instances with unknown categories into the target classifier. In this study, by strict mathematical reasoning, we reveal that the SSL error under class distribution mismatch is composed of pseudo-labeling error and invasion error, both of which jointly bound the SSL population risk. To alleviate the SSL error, we propose a robust SSL framework called Weight-Aware Distillation (WAD) that, by weights, selectively transfers knowledge beneficial to the target task from unsupervised contrastive representation to the target classifier. Specifically, WAD captures adaptive weights and high-quality pseudo-labels to target instances by exploring point mutual information (PMI) in representation space to maximize the role of unlabeled data and filter unknown categories. Theoretically, we prove that WAD has a tight upper bound of population risk under class distribution mismatch. Experimentally, extensive results demonstrate that WAD outperforms five state-of-the-art SSL approaches and one standard baseline on two benchmark datasets, CIFAR10 and CIFAR100, and an artificial cross-dataset. The code is available at https://github.com/RUC-DWBI-ML/research/tree/main/WAD-master. + + + + ELFNet: Evidential Local-global Fusion for Stereo Matching + http://openaccess.thecvf.com//content/ICCV2023/papers/Lou_ELFNet_Evidential_Local-global_Fusion_for_Stereo_Matching_ICCV_2023_paper.pdf + Although existing stereo matching models have achieved continuous improvement, they often face issues related to trustworthiness due to the absence of uncertainty estimation. Additionally, effectively leveraging multi-scale and multi-view knowledge of stereo pairs remains unexplored. In this paper, we introduce the Evidential Local-global Fusion (ELF) framework for stereo matching, which endows both uncertainty estimation and confidence-aware fusion with trustworthy heads. Instead of predicting the disparity map alone, our model estimates an evidential-based disparity considering both aleatoric and epistemic uncertainties. With the normal inverse-Gamma distribution as a bridge, the proposed framework realizes intra evidential fusion of multi-level predictions and inter evidential fusion between cost-volume-based and transformer-based stereo matching. Extensive experimental results show that the proposed framework exploits multi-view information effectively and achieves state-of-the-art overall performance both on accuracy and cross-domain generalization. The codes are available at https://github.com/jimmy19991222/ELFNet. + + + + SimpleClick: Interactive Image Segmentation with Simple Vision Transformers + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_SimpleClick_Interactive_Image_Segmentation_with_Simple_Vision_Transformers_ICCV_2023_paper.pdf + Click-based interactive image segmentation aims at extracting objects with a limited user clicking. A hierarchical backbone is the de-facto architecture for current methods. Recently, the plain, non-hierarchical Vision Transformer (ViT) has emerged as a competitive backbone for dense prediction tasks. This design allows the original ViT to be a foundation model that can be finetuned for downstream tasks without redesigning a hierarchical backbone for pretraining. Although this design is simple and has been proven effective, it has not yet been explored for interactive segmentation. To fill this gap, we propose SimpleClick, the first plain-backbone method for interactive segmentation. Other than the plain backbone, we also explore several variants of simple feature pyramid networks that only take as input the last feature representation of the backbone. With the plain backbone pretrained as a masked autoencoder (MAE), SimpleClick achieves state-of-the-art performance. Remarkably, our method achieves 4.15 NoC@90 on SBD, improving 21.8% over the previous best result. Extensive evaluation on medical images demonstrates the generalizability of our method. We further develop an extremely tiny ViT backbone for SimpleClick and provide a detailed computational analysis, highlighting its suitability as a practical annotation tool. + + + + Towards Content-based Pixel Retrieval in Revisited Oxford and Paris + http://openaccess.thecvf.com//content/ICCV2023/papers/An_Towards_Content-based_Pixel_Retrieval_in_Revisited_Oxford_and_Paris_ICCV_2023_paper.pdf + This paper introduces the first two landmark pixel retrieval benchmarks. Like semantic segmentation extends classification to the pixel level, pixel retrieval is an extension of image retrieval and offers information about which pixels are related to the query object. In addition to retrieving images for the given query, it helps users quickly identify the query object in true positive images and exclude false positive images by denoting the correlated pixels. Our user study results show pixel-level annotation can significantly improve the user experience. Compared with semantic and instance segmentation, pixel retrieval requires a fine-grained recognition capability for variable-granularity targets. To this end, we propose pixel retrieval benchmarks named PROxford and PRParis, which are based on the widely used image retrieval datasets, ROxford and RParis. Three professional annotators label 5,942 images with two rounds of double-checking and refinement. Furthermore, we conduct extensive experiments and analysis on the SOTA methods in image search, image matching, detection, segmentation, and dense matching using our pixel retrieval benchmarks. Results show that the pixel retrieval task is challenging to these approaches and distinctive from existing problems, suggesting that further research can advance the content-based pixel-retrieval and, thus, user search experience. + + + + Retro-FPN: Retrospective Feature Pyramid Network for Point Cloud Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Xiang_Retro-FPN_Retrospective_Feature_Pyramid_Network_for_Point_Cloud_Semantic_Segmentation_ICCV_2023_paper.pdf + Learning per-point semantic features from the hierarchical feature pyramid is essential for point cloud semantic segmentation. However, most previous methods suffered from ambiguous region features or failed to refine per-point features effectively, which leads to information loss and ambiguous semantic identification. To resolve this, we propose Retro-FPN to model the per-point feature prediction as an explicit and retrospective refining process, which goes through all the pyramid layers to extract semantic features explicitly for each point. Its key novelty is a retro-transformer for summarizing semantic contexts from the previous layer and accordingly refining the features in the current stage. In this way, the categorization of each point is conditioned on its local semantic pattern. Specifically, the retro-transformer consists of a local cross-attention block and a semantic gate unit. The cross-attention serves to summarize the semantic pattern retrospectively from the previous layer. And the gate unit carefully incorporates the summarized contexts and refines the current semantic features. Retro-FPN is a pluggable neural network that applies to hierarchical decoders. By integrating Retro-FPN with three representative backbones, including both point-based and voxel-based methods, we show that Retro-FPN can significantly improve performance over state-of-the-art backbones. Comprehensive experiments on widely used benchmarks can justify the effectiveness of our design. The source is available at https://github.com/AllenXiangX/Retro-FPN. + + + + Benchmarking Low-Shot Robustness to Natural Distribution Shifts + http://openaccess.thecvf.com//content/ICCV2023/papers/Singh_Benchmarking_Low-Shot_Robustness_to_Natural_Distribution_Shifts_ICCV_2023_paper.pdf + Robustness to natural distribution shifts has seen remarkable progress thanks to recent pre-training strategies combined with better fine-tuning methods. However, such fine-tuning assumes access to large amounts of labelled data, and the extent to which the observations hold when the amount of training data is not as high remains unknown. We address this gap by performing the first in-depth study of robustness to various natural distribution shifts in different low-shot regimes: spanning datasets, architectures, pre-trained initializations, and state-of-the-art robustness interventions. Most importantly, we find that there is no single model of choice that is often more robust than others, and existing interventions can fail to improve robustness on some datasets even if they do so in the full-shot regime. We hope that our work will motivate the community to focus on this problem of practical importance. + + + + DeLiRa: Self-Supervised Depth, Light, and Radiance Fields + http://openaccess.thecvf.com//content/ICCV2023/papers/Guizilini_DeLiRa_Self-Supervised_Depth_Light_and_Radiance_Fields_ICCV_2023_paper.pdf + Differentiable volumetric rendering is a powerful paradigm for 3D reconstruction and novel view synthesis. However, standard volume rendering approaches struggle with degenerate geometries in the case of limited viewpoint diversity, a common scenario in robotics applications. In this work, we propose to use the multi-view photometric objective from the self-supervised depth estimation literature as a geometric regularizer for volumetric rendering, significantly improving novel view synthesis without requiring additional information. Building upon this insight, we explore the explicit modeling of scene geometry using a generalist Transformer, jointly learning a radiance field as well as depth and light fields with a set of shared latent codes. We demonstrate that sharing geometric information across tasks is mutually beneficial, leading to improvements over single-task learning without an increase in network complexity. Our DeLiRa architecture achieves state-of-the-art results on the ScanNet benchmark, enabling high quality volumetric rendering as well as real-time novel view and depth synthesis in the limited viewpoint diversity setting. + + + + Stable Cluster Discrimination for Deep Clustering + http://openaccess.thecvf.com//content/ICCV2023/papers/Qian_Stable_Cluster_Discrimination_for_Deep_Clustering_ICCV_2023_paper.pdf + Deep clustering can optimize representations of instances (i.e., representation learning) and explore the inherent data distribution (i.e., clustering) simultaneously, which demonstrates a superior performance over conventional clustering methods with given features. However, the coupled objective implies a trivial solution that all instances collapse to the uniform features. To tackle the challenge, a two-stage training strategy is developed for decoupling, where it introduces an additional pre-training stage for representation learning and then fine-tunes the obtained model for clustering. Meanwhile, one-stage methods are developed mainly for representation learning rather than clustering, where various constraints for cluster assignments are designed to avoid collapsing explicitly. Despite the success of these methods, an appropriate learning objective tailored for deep clustering has not been investigated sufficiently. In this work, we first show that the prevalent discrimination task in supervised learning is unstable for one-stage clustering due to the lack of ground-truth labels and positive instances for certain clusters in each mini-batch. To mitigate the issue, a novel stable cluster discrimination (SeCu) task is proposed and a new hardness-aware clustering criterion can be obtained accordingly. Moreover, a global entropy constraint for cluster assignments is studied with efficient optimization. Extensive experiments are conducted on benchmark data sets and ImageNet. SeCu achieves state-of-the-art performance on all of them, which demonstrates the effectiveness of one-stage deep clustering. + + + + Pix2Video: Video Editing using Image Diffusion + http://openaccess.thecvf.com//content/ICCV2023/papers/Ceylan_Pix2Video_Video_Editing_using_Image_Diffusion_ICCV_2023_paper.pdf + Image diffusion models, trained on massive image collections, have emerged as the most versatile image generator model in terms of quality and diversity. They support inverting real images and conditional (e.g., text) generation, making them attractive for high-quality image editing applications. We investigate how to use such pre-trained image models for text-guided video editing. The critical challenge is to achieve the target edits while still preserving the content of the source video. Our method works in two simple steps: first, we use a pre-trained structure-guided (e.g., depth) image diffusion model to perform text-guided edits on an anchor frame; then, in the key step, we progressively propagate the changes to the future frames via self-attention feature injection to adapt the core denoising step of the diffusion model. We then consolidate the changes by adjusting the latent code for the frame before continuing the process. Our approach is training-free and generalizes to a wide range of edits. We demonstrate the effectiveness of the approach by extensive experimentation and compare it against four different prior and parallel efforts (on ArXiv). We demonstrate that realistic text-guided video edits are possible, without any compute-intensive preprocessing or video-specific finetuning. + + + + Holistic Geometric Feature Learning for Structured Reconstruction + http://openaccess.thecvf.com//content/ICCV2023/papers/Lu_Holistic_Geometric_Feature_Learning_for_Structured_Reconstruction_ICCV_2023_paper.pdf + The inference of topological principles is a key problem in structured reconstruction. We observe that wrongly predicted topological relationships are often incurred by the lack of holistic geometry clues in low-level features. Inspired by the fact that massive signals can be compactly described with frequency analysis, we experimentally explore the efficiency and tendency of learning structure geometry in the frequency domain. Accordingly, we propose a frequency-domain feature learning strategy (F-Learn) to fuse scattered geometric fragments holistically for topology-intact structure reasoning. Benefiting from the parsimonious design, the F-Learn strategy can be easily deployed into a deep reconstructor with a lightweight model modification. Experiments demonstrate that the F-Learn strategy can effectively introduce structure awareness into geometric primitive detection and topology inference, bringing significant performance improvement to final structured reconstruction. Code and pre-trained models are available at https://github.com/Geo-Tell/F-Learn. + + + + FateZero: Fusing Attentions for Zero-shot Text-based Video Editing + http://openaccess.thecvf.com//content/ICCV2023/papers/QI_FateZero_Fusing_Attentions_for_Zero-shot_Text-based_Video_Editing_ICCV_2023_paper.pdf + The diffusion-based generative models have achieved remarkable success in text-based image generation. However, since it contains enormous randomness in generation progress, it is still challenging to apply such models for real-world visual content editing, especially in videos. In this paper, we propose FateZero, a zero-shot text-based editing method on real-world videos without per-prompt training or use-specific mask. To edit videos consistently, we propose several techniques based on the pre-trained models. Firstly, in contrast to the straightforward DDIM inversion technique, our approach captures intermediate attention maps during inversion, which effectively retain both structural and motion information. These maps are directly fused in the editing process rather than generated during denoising. To further minimize semantic leakage of the source video, we then fuse self-attentions with a blending mask obtained by cross-attention features from the source prompt. Furthermore, we have implemented a reform of the self-attention mechanism in denoising UNet by introducing spatial-temporal attention to ensure frame consistency. Yet succinct, our method is the first one to show the ability of zero-shot text-driven video style and local attribute editing from the trained text-to-image model. We also have a better zero-shot shape-aware editing ability in the text-to-video model. + + + + Uncertainty-guided Learning for Improving Image Manipulation Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Ji_Uncertainty-guided_Learning_for_Improving_Image_Manipulation_Detection_ICCV_2023_paper.pdf + Image manipulation detection (IMD) is of vital importance as faking images and spreading misinformation can be malicious and harm our daily life. IMD is the core technique to solve these issues and poses challenges in two main aspects: (1) Data Uncertainty, i.e., the manipulated artifacts are often hard for humans to discern and lead to noisy labels, which may disturb model training; (2) Model Uncertainty, i.e., the same object may hold different categories (tampered or not) due to manipulation operations, which could potentially confuse the model training and result in unreliable outcomes. Previous works mainly focus on solving the model uncertainty issue by designing meticulous features and networks, however, the data uncertainty problem is rarely considered. In this paper, we address both problems by introducing an uncertainty-guided learning framework, which measures data and model uncertainty by a novel Uncertainty Estimation Network (UEN). UEN is trained under dynamic supervision, and outputs estimated uncertainty maps to refine manipulation detection results, which significantly alleviates the learning difficulties. To our knowledge, this is the first work to embed uncertainty modeling into IMD. Extensive experiments on various datasets demonstrate state-of-the-art performance, validating the effectiveness and generalizability of our method. + + + + AdaMV-MoE: Adaptive Multi-Task Vision Mixture-of-Experts + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_AdaMV-MoE_Adaptive_Multi-Task_Vision_Mixture-of-Experts_ICCV_2023_paper.pdf + Sparsely activated Mixture-of-Experts (MoE) is becoming a promising paradigm for multi-task learning (MTL). Instead of compressing multiple tasks' knowledge into a single model, MoE separates the parameter space and only utilizes the relevant model pieces given task type and its input, which provides stabilized MTL training and ultra-efficient inference. However, current MoE approaches adopt a fixed network capacity (e.g., two experts in usual) for all tasks. It potentially results in the over-fitting of simple tasks or the under-fitting of challenging scenarios, especially when tasks are significantly distinctive in their complexity. In this paper, we propose an adaptive MoE framework for multi-task vision recognition, dubbed AdaMV-MoE. Based on the training dynamics, it automatically determines the number of activated experts for each task, avoiding the laborious manual tuning of optimal model size. To validate our proposal, we benchmark it on ImageNet classification and COCO object detection & instance segmentation which are notoriously difficult to learn in concert, due to their discrepancy. Extensive experiments across a variety of vision transformers demonstrate a superior performance of AdaMV-MoE, compared to MTL with a shared backbone and the recent state-of-the-art (SoTA) MTL MoE approach. Codes are available online: https://github.com/google-research/google-research/tree/master/moe_mtl. + + + + Hierarchical Visual Categories Modeling: A Joint Representation Learning and Density Estimation Framework for Out-of-Distribution Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Hierarchical_Visual_Categories_Modeling_A_Joint_Representation_Learning_and_Density_ICCV_2023_paper.pdf + Detecting out-of-distribution inputs for visual recognition models has become critical in safe deep learning. This paper proposes a novel hierarchical visual category modeling scheme to separate out-of-distribution data from in-distribution data through joint representation learning and statistical modeling. We learn a mixture of Gaussian models for each in-distribution category. There are many Gaussian mixture models to model different visual categories. With these Gaussian models, we design an in-distribution score function by aggregating multiple Mahalanobis-based metrics. We don't use any auxiliary outlier data as training samples, which may hurt the generalization ability of out-of-distribution detection algorithms. We split the ImageNet1k dataset into ten folds randomly. We use one fold as the in-distribution dataset and the others as out-of-distribution datasets to evaluate the proposed method. We also conduct experiments on seven popular benchmarks, including CIFAR, iNaturalist, SUN, Places, Textures, ImageNet-O, and OpenImage-O. Extensive experiments indicate that the proposed method outperforms state-of-the-art algorithms clearly. Meanwhile, we find that our visual representation has a competitive performance when compared with features learned by classical methods. These results demonstrate that the proposed method hasn't weakened the discriminative ability of visual recognition models and keeps high efficiency in detecting out-of-distribution samples. + + + + ReNeRF: Relightable Neural Radiance Fields with Nearfield Lighting + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_ReNeRF_Relightable_Neural_Radiance_Fields_with_Nearfield_Lighting_ICCV_2023_paper.pdf + Recent work on radiance fields and volumetric inverse rendering (e.g., NeRFs) has provided excellent results in building data-driven models of real scenes for novel view synthesis with high photorealism. While full control over viewpoint is achieved, scene lighting is typically "baked" into the model and cannot be changed; other methods only capture limited variation in lighting or make restrictive assumptions about the captured scene. These limitations prevent the application on arbitrary materials and novel 3D environments with complex, distinct lighting. In this paper, we target the application scenario of capturing high-fidelity assets for neural relighting in controlled studio conditions, but without requiring a dense light stage. Instead, we leverage a small number of area lights commonly used in photogrammetry. We propose ReNeRF, a relightable radiance field model based on the intuitive and powerful approach of image-based relighting, which implicitly captures global light transport (for arbitrary objects) without complex, error-prone simulations. Thus, our new method is simple and provides full control over viewpoint and lighting, without simplistic assumptions about how light interacts with the scene. In addition, ReNeRF does not rely on the usual assumption of distant lighting - during training, we explicitly account for the distance between 3D points in the volume and point samples on the light sources. Thus, at test time, we achieve better generalization to novel, continuous lighting directions, including nearfield lighting effects. + + + + 360VOT: A New Benchmark Dataset for Omnidirectional Visual Object Tracking + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_360VOT_A_New_Benchmark_Dataset_for_Omnidirectional_Visual_Object_Tracking_ICCV_2023_paper.pdf + 360deg images can provide an omnidirectional field of view which is important for stable and long-term scene perception. In this paper, we explore 360deg images for visual object tracking and perceive new challenges caused by large distortion, stitching artifacts, and other unique attributes of 360deg images. To alleviate these problems, we take advantage of novel representations of target localization, i.e., bounding field-of-view, and then introduce a general 360 tracking framework that can adopt typical trackers for omnidirectional tracking. More importantly, we propose a new large-scale omnidirectional tracking benchmark dataset, 360VOT, in order to facilitate future research. 360VOT contains 120 sequences with up to 113K high-resolution frames in equirectangular projection. And the tracking targets cover 32 categories in diverse scenarios. Moreover, we provide 4 types of unbiased ground truth, including (rotated) bounding boxes and (rotated) bounding field-of-views, as well as new metrics tailored for 360deg images which allow accurate evaluation of omnidirectional tracking performance. Finally, we extensively evaluated 20 state-of-the-art visual trackers and provided a new baseline for future comparisons. Homepage: https://360vot.hkustvgd.com + + + + Is Imitation All You Need? Generalized Decision-Making with Dual-Phase Training + http://openaccess.thecvf.com//content/ICCV2023/papers/Wei_Is_Imitation_All_You_Need_Generalized_Decision-Making_with_Dual-Phase_Training_ICCV_2023_paper.pdf + We introduce DualMind, a generalist agent designed to tackle various decision-making tasks that addresses challenges posed by current methods, such as overfitting behaviors and dependence on task-specific fine-tuning. DualMind uses a novel "Dual-phase" training strategy that emulates how humans learn to act in the world. The model first learns fundamental common knowledge through a self-supervised objective tailored for control tasks and then learns how to make decisions based on different contexts through imitating behaviors conditioned on given prompts. DualMind can handle tasks across domains, scenes, and embodiments using just a single set of model weights and can execute zero-shot prompting without requiring task-specific fine-tuning. We evaluate DualMind on MetaWorld and Habitat through extensive experiments and demonstrate its superior generalizability compared to previous techniques, outperforming other generalist agents by over 50% and 70% on Habitat and MetaWorld, respectively. On the 45 tasks in MetaWorld, DualMind achieves over 30 tasks at a 90% success rate. Our source code is available at https://github.com/yunyikristy/DualMind. + + + + LERF: Language Embedded Radiance Fields + http://openaccess.thecvf.com//content/ICCV2023/papers/Kerr_LERF_Language_Embedded_Radiance_Fields_ICCV_2023_paper.pdf + Humans describe the physical world using natural language to refer to specific 3D locations based on a vast range of properties: visual appearance, semantics, abstract associations, or actionable affordances. In this work we propose Language Embedded Radiance Fields (LERFs), a method for grounding language embeddings from off-the-shelf models like CLIP into NeRF, which enable these types of open-ended language queries in 3D. LERF learns a dense, multi-scale language field inside NeRF by volume rendering CLIP embeddings along training rays, supervising these embeddings across training views to provide multi-view consistency and smooth the underlying language field. After optimization, LERF can extract 3D relevancy maps for a broad range of language prompts interactively in real-time, which has potential use cases in robotics, understanding vision-language models, and interacting with 3D scenes. LERF enables pixel-aligned, zero-shot queries on the distilled 3D CLIP embeddings without relying on region proposals or masks, supporting long-tail open-vocabulary queries hierarchically across the volume. See the project website at: https://lerf.io + + + + DomainAdaptor: A Novel Approach to Test-time Adaptation + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_DomainAdaptor_A_Novel_Approach_to_Test-time_Adaptation_ICCV_2023_paper.pdf + To deal with the domain shift between training and test samples, current methods have primarily focused on learning generalizable features during training and ignore the specificity of unseen samples that are also critical during the test. In this paper, we investigate a more challenging task that aims to adapt a trained CNN model to unseen domains during the test. To maximumly mine the information in the test data, we propose a unified method called DomainAdaptor for the test-time adaptation, which consists of an AdaMixBN module and a Generalized Entropy Minimization (GEM) loss. Specifically, AdaMixBN addresses the domain shift by adaptively fusing training and test statistics in the normalization layer via a dynamic mixture coefficient and a statistic transformation operation. To further enhance the adaptation ability of AdaMixBN, we design a GEM loss that extends the Entropy Minimization loss to better exploit the information in the test data. Extensive experiments show DomainAdaptor consistently outperforms the state-of-the-art methods on four benchmarks. Furthermore, our method brings more remarkable improvement against existing methods on the few-data unseen domain. The code is available at https://github.com/koncle/DomainAdaptor. + + + + Mitigating and Evaluating Static Bias of Action Representations in the Background and the Foreground + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Mitigating_and_Evaluating_Static_Bias_of_Action_Representations_in_the_ICCV_2023_paper.pdf + In video action recognition, shortcut static features can interfere with the learning of motion features, resulting in poor out-of-distribution (OOD) generalization. The video background is clearly a source of static bias, but the video foreground, such as the clothing of the actor, can also provide static bias. In this paper, we empirically verify the existence of foreground static bias by creating test videos with conflicting signals from the static and moving portions of the video. To tackle this issue, we propose a simple yet effective technique, StillMix, to learn robust action representations. Specifically, StillMix identifies bias-inducing video frames using a 2D reference network and mixes them with videos for training, serving as effective bias suppression even when we cannot explicitly extract the source of bias within each video frame or enumerate types of bias. Finally, to precisely evaluate static bias, we synthesize two new benchmarks, SCUBA for static cues in the background, and SCUFO for static cues in the foreground. With extensive experiments, we demonstrate that StillMix mitigates both types of static bias and improves video representations for downstream applications. + + + + CuNeRF: Cube-Based Neural Radiance Field for Zero-Shot Medical Image Arbitrary-Scale Super Resolution + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_CuNeRF_Cube-Based_Neural_Radiance_Field_for_Zero-Shot_Medical_Image_Arbitrary-Scale_ICCV_2023_paper.pdf + Medical image arbitrary-scale super-resolution (MIASSR) has recently gained widespread attention, aiming to supersample medical volumes at arbitrary scales via a single model. However, existing MIASSR methods face two major limitations: (i) reliance on high-resolution (HR) volumes and (ii) limited generalization ability, which restricts their applications in various scenarios. To overcome these limitations, we propose Cube-based Neural Radiance Field (CuNeRF), a zero-shot MIASSR framework that is able to yield medical images at arbitrary scales and free viewpoints in a continuous domain. Unlike existing MISR methods that only fit the mapping between low-resolution (LR) and HR volumes, CuNeRF focuses on building a continuous volumetric representation from each LR volume without the knowledge from the corresponding HR one. This is achieved by the proposed differentiable modules: cube-based sampling, isotropic volume rendering, and cube-based hierarchical rendering. Through extensive experiments on magnetic resource imaging (MRI) and computed tomography (CT) modalities, we demonstrate that CuNeRF can synthesize high-quality SR medical images, which outperforms state-of-the-art MISR methods, achieving better visual verisimilitude and fewer objectionable artifacts. Compared to existing MISR methods, our CuNeRF is more applicable in practice. + + + + Beyond Object Recognition: A New Benchmark towards Object Concept Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Beyond_Object_Recognition_A_New_Benchmark_towards_Object_Concept_Learning_ICCV_2023_paper.pdf + Understanding objects is a central building block of AI, especially for embodied AI. Even though object recognition excels with deep learning, current machines struggle to learn higher-level knowledge, e.g., what attributes an object has, and what we can do with it. Here, we propose a challenging Object Concept Learning (OCL) task to push the envelope of object understanding. It requires machines to reason out affordances and simultaneously give the reason: what attributes make an object possess these affordances. To support OCL, we build a densely annotated knowledge base including extensive annotations for three levels of object concept (category, attribute, affordance), and the clear causal relations of three levels. By analyzing the causal structure of OCL, we present a baseline, Object Concept Reasoning Network (OCRN). It leverages concept instantiation and causal intervention to infer the three levels. In experiments, OCRN effectively infers the object knowledge while following the causalities well. Our data and code are available at https://mvig-rhos.com/ocl. + + + + EgoObjects: A Large-Scale Egocentric Dataset for Fine-Grained Object Understanding + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhu_EgoObjects_A_Large-Scale_Egocentric_Dataset_for_Fine-Grained_Object_Understanding_ICCV_2023_paper.pdf + Object understanding in egocentric visual data is arguably a fundamental research topic in egocentric vision. However, existing object datasets are either non-egocentric or have limitations in object categories, visual content, and annotation granularities. In this work, we introduce EgoObjects, a large-scale egocentric dataset for fine-grained object understanding. Its Pilot version contains over 9K videos collected by 250 participants from 50+ countries using 4 wearable devices, and over 650K object annotations from 368 object categories. Unlike prior datasets containing only object category labels, EgoObjects also annotates each object with an instance-level identifier, and includes over 14K unique object instances. EgoObjects was designed to capture the same object under diverse background complexities, surrounding objects, distance, lighting and camera motion. In parallel to the data collection, we conducted data annotation by developing a multi-stage federated annotation process to accommodate the growing nature of the dataset. To bootstrap the research on EgoObjects, we present a suite of 4 benchmark tasks around the egocentric object understanding, including a novel instance level- and the classical category level object detection. Moreover, we also introduce 2 novel continual learning object detection tasks. The dataset and API are available at https://github.com/facebookresearch/EgoObjects. + + + + Simulating Fluids in Real-World Still Images + http://openaccess.thecvf.com//content/ICCV2023/papers/Fan_Simulating_Fluids_in_Real-World_Still_Images_ICCV_2023_paper.pdf + In this work, we tackle the problem of real-world fluid animation from a still image. The key of our system is a surface-based layered representation, where the scene is decoupled into a surface fluid layer and an impervious background layer with corresponding transparencies to characterize the composition of the two layers. The animated video can be produced by warping only the surface fluid layer according to the estimation of fluid motions and recombining it with the background. In addition, we introduce surface-only fluid simulation, a 2.5D fluid calculation, as a replacement for motion estimation. Specifically, we leverage triangular mesh based on a monocular depth estimator to represent fluid surface layer and simulate the motion with the inspiration of classic physics theory of hybrid Lagrangian-Eulerian method, along with a learnable network so as to adapt to complex real-world image textures.Extensive experiments not only indicate our method's competitive performance for common fluid scenes but also better robustness and reasonability under complex transparent fluid scenarios. Moreover, as proposed surface-based layer representation and surface-only fluid simulation naturally disentangle the scene, interactive editing such as adding objects and texture replacing could be easily achieved with realistic results. + + + + SC3K: Self-supervised and Coherent 3D Keypoints Estimation from Rotated, Noisy, and Decimated Point Cloud Data + http://openaccess.thecvf.com//content/ICCV2023/papers/Zohaib_SC3K_Self-supervised_and_Coherent_3D_Keypoints_Estimation_from_Rotated_Noisy_ICCV_2023_paper.pdf + This paper proposes a new method to infer keypoints from arbitrary object categories in practical scenarios where point cloud data (PCD) are noisy, down-sampled and arbitrarily rotated. Our proposed model adheres to the following principles: i) keypoints inference is fully unsupervised (no annotation given), ii) keypoints position error should be low and resilient to PCD perturbations (robustness), iii) keypoints should not change their indexes for the intra-class objects (semantic coherence), iv) keypoints should be close to or proximal to PCD surface (compactness). We achieve these desiderata by proposing a new self-supervised training strategy for keypoints estimation that does not assume any a priori knowledge of the object class, and a model architecture with coupled auxiliary losses that promotes the desired keypoints properties. We compare the keypoints estimated by the proposed approach with those of the state-of-the-art unsupervised approaches. The experiments show that our approach outperforms by estimating keypoints with improved coverage (+9.41%) while being semantically consistent (+4.66%) that best characterizes the object's 3D shape for downstream tasks. + + + + Segmenting Known Objects and Unseen Unknowns without Prior Knowledge + http://openaccess.thecvf.com//content/ICCV2023/papers/Gasperini_Segmenting_Known_Objects_and_Unseen_Unknowns_without_Prior_Knowledge_ICCV_2023_paper.pdf + Panoptic segmentation methods assign a known class to each pixel given in input. Even for state-of-the-art approaches, this inevitably enforces decisions that systematically lead to wrong predictions for objects outside the training categories. However, robustness against out-of-distribution samples and corner cases is crucial in safety-critical settings to avoid dangerous consequences. Since real-world datasets cannot contain enough data points to adequately sample the long tail of the underlying distribution, models must be able to deal with unseen and unknown scenarios as well. Previous methods targeted this by re-identifying already-seen unlabeled objects. In this work, we propose the necessary step to extend segmentation with a new setting which we term holistic segmentation. Holistic segmentation aims to identify and separate objects of unseen, unknown categories into instances without any prior knowledge about them while performing panoptic segmentation of known classes. We tackle this new problem with U3HS, which finds unknowns as highly uncertain regions and clusters their corresponding instance-aware embeddings into individual objects. By doing so, for the first time in panoptic segmentation with unknown objects, our U3HS is trained without unknown categories, reducing assumptions and leaving the settings as unconstrained as in real-life scenarios. Extensive experiments on public data from MS COCO, Cityscapes, and Lost&Found demonstrate the effectiveness of U3HS for this new, challenging, and assumptions-free setting called holistic segmentation. Project page: https://holisticseg.github.io. + + + + CMDA: Cross-Modality Domain Adaptation for Nighttime Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Xia_CMDA_Cross-Modality_Domain_Adaptation_for_Nighttime_Semantic_Segmentation_ICCV_2023_paper.pdf + Most nighttime semantic segmentation studies are based on domain adaptation approaches and image input. However, limited by the low dynamic range of conventional cameras, images fail to capture structural details and boundary information in low-light conditions. Event cameras, as a new form of vision sensors, are complementary to conventional cameras with their high dynamic range. To this end, we propose a novel unsupervised Cross-Modality Domain Adaptation (CMDA) framework to leverage multi-modality (Images and Events) information for nighttime semantic segmentation, with only labels on daytime images. In CMDA, we design the Image Motion-Extractor to extract motion information and the Image Content-Extractor to extract content information from images, in order to bridge the gap between different modalities (Images to Events) and domains (Day to Night). Besides, we introduce the first image-event nighttime semantic segmentation dataset. Extensive experiments on both the public image dataset and the proposed image-event dataset demonstrate the effectiveness of our proposed approach. We open-source our code, models, and dataset at https://github.com/XiaRho/CMDA. + + + + Learning with Diversity: Self-Expanded Equalization for Better Generalized Deep Metric Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Yan_Learning_with_Diversity_Self-Expanded_Equalization_for_Better_Generalized_Deep_Metric_ICCV_2023_paper.pdf + Exploring good generalization ability is essential in deep metric learning (DML). Most existing DML methods focus on improving the model robustness against category shift to keep the performance on unseen categories. However, in addition to category shift, domain shift also widely exists in real-world scenarios. Therefore, learning better generalization ability for the DML model is still a challenging yet realistic problem. In this paper, we propose a new self-expanded equalization (SEE) method to effectively generalize the DML model to both unseen categories and domains. Specifically, we take a `min-max' strategy combined with a proxy-based loss to adaptively augment diverse out-of-distribution samples that vastly expand the span of original training data. To take full advantage of the implicit cross-domain relations between source and augmented samples, we introduce a domain-aware equalization module to induce the domain-invariant distance metric by regularizing the feature distribution in the metric space. Extensive experiments on two benchmarks and a large-scale multi-domain dataset demonstrate the superiority of our SEE over the existing DML methods. + + + + Dynamic Residual Classifier for Class Incremental Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Dynamic_Residual_Classifier_for_Class_Incremental_Learning_ICCV_2023_paper.pdf + The rehearsal strategy is widely used to alleviate the catastrophic forgetting problem in class incremental learning (CIL) by preserving limited exemplars from previous tasks. With imbalanced sample numbers between old and new classes, the classifier learning can be biased. Existing CIL methods exploit the long-tailed (LT) recognition techniques, e.g., the adjusted losses and the data re-sampling methods, to handle the data imbalance issue within each increment task. In this work, the dynamic nature of data imbalance in CIL is shown and a novel Dynamic Residual Classifier (DRC) is proposed to handle this challenging scenario. Specifically, DRC is built upon a recent advance residual classifier with the branch layer merging to handle the model-growing problem. Moreover, DRC is compatible with different CIL pipelines and substantially improves them. Combining DRC with the model adaptation and fusion (MAF) pipeline, this method achieves state-of-the-art results on both the conventional CIL and the LT-CIL benchmarks. Extensive experiments are also conducted for a detailed analysis. The code is publicly available. + + + + Optimizing the Placement of Roadside LiDARs for Autonomous Driving + http://openaccess.thecvf.com//content/ICCV2023/papers/Jiang_Optimizing_the_Placement_of_Roadside_LiDARs_for_Autonomous_Driving_ICCV_2023_paper.pdf + Multi-agent cooperative perception is an increasingly popular topic in the field of autonomous driving, where roadside LiDARs play an essential role. However, how to optimize the placement of roadside LiDARs is a crucial but often overlooked problem. This paper proposes an approach to optimize the placement of roadside LiDARs by selecting optimized positions within the scene for better perception performance. To efficiently obtain the best combination of locations, a greedy algorithm based on the perceptual gain is proposed, which selects the location that can maximize the perceptual gain sequentially. We define perceptual gain as the increased perceptual capability when a new LiDAR is placed. To obtain the perception capability, we propose a perception predictor that learns to evaluate LiDAR placement using only a single point cloud frame. A dataset named Roadside-Opt is created using the CARLA simulator to facilitate research on the roadside LiDAR placement problem. Extensive experiments are conducted to demonstrate the effectiveness of our proposed method. + + + + Diverse Inpainting and Editing with GAN Inversion + http://openaccess.thecvf.com//content/ICCV2023/papers/Yildirim_Diverse_Inpainting_and_Editing_with_GAN_Inversion_ICCV_2023_paper.pdf + Recent inversion methods have shown that real images can be inverted into StyleGAN's latent space and numerous edits can be achieved on those images thanks to the semantically rich feature representations of well-trained GAN models. However, extensive research has also shown that image inversion is challenging due to the trade-off between high-fidelity reconstruction and editability. In this paper, we tackle an even more difficult task, inverting erased images into GAN's latent space for realistic inpaintings and editings. Furthermore, by augmenting inverted latent codes with different latent samples, we achieve diverse inpaintings. Specifically, we propose to learn an encoder and mixing network to combine encoded features from erased images with StyleGAN's mapped features from random samples. To encourage the mixing network to utilize both inputs, we train the networks with generated data via a novel set-up. We also utilize higher-rate features to prevent color inconsistencies between the inpainted and unerased parts. We run extensive experiments and compare our method with state-of-the-art inversion and inpainting methods. Qualitative metrics and visual comparisons show significant improvements. + + + + DiFaReli: Diffusion Face Relighting + http://openaccess.thecvf.com//content/ICCV2023/papers/Ponglertnapakorn_DiFaReli_Diffusion_Face_Relighting_ICCV_2023_paper.pdf + We present a novel approach to single-view face relighting in the wild. Handling non-diffuse effects, such as global illumination or cast shadows, has long been a challenge in face relighting. Prior work often assumes Lambertian surfaces, simplified lighting models or involves estimating 3D shape, albedo, or a shadow map. This estimation, however, is error-prone and requires many training examples with lighting ground truth to generalize well. Our work bypasses the need for accurate estimation of intrinsic components and can be trained solely on 2D images without any light stage data, multi-view images, or lighting ground truth. Our key idea is to leverage a conditional diffusion implicit model (DDIM) for decoding a disentangled light encoding along with other encodings related to 3D shape and facial identity inferred from off-the-shelf estimators. We also propose a novel conditioning technique that eases the modeling of the complex interaction between light and geometry by using a rendered shading reference to spatially modulate the DDIM. We achieve state-of-the-art performance on standard benchmark Multi-PIE and can photorealistically relight in-the-wild images. Please visit our page: https://diffusion-face-relighting.github.io. + + + + Building3D: A Urban-Scale Dataset and Benchmarks for Learning Roof Structures from Point Clouds + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Building3D_A_Urban-Scale_Dataset_and_Benchmarks_for_Learning_Roof_Structures_ICCV_2023_paper.pdf + Urban modeling from LiDAR point clouds is an important topic in computer vision, computer graphics, photogrammetry and remote sensing. 3D city models have found a wide range of applications in smart cities, autonomous navigation, urban planning and mapping etc. However, existing datasets for 3D modeling mainly focus on common objects such as furniture or cars. Lack of building datasets has become a major obstacle for applying deep learning technology to specific domains such as urban modeling. In this paper, we present a urban-scale dataset consisting of more than 160 thousands buildings along with corresponding point clouds, mesh and wire-frame models, covering 16 cities in Estonia about 998 Km2. We extensively evaluate performance of state-of-the-art algorithms including handcrafted and deep feature based methods. Experimental results indicate that Building3D has challenges of high intra-class variance, data imbalance and large-scale noises. The Building3D is the first and largest urban-scale building modeling benchmark, allowing a comparison of supervised and self-supervised learning methods. We believe that our Building3D will facilitate future research on urban modeling, aerial path planning, mesh simplification, and semantic/part segmentation etc. + + + + Localizing Object-Level Shape Variations with Text-to-Image Diffusion Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Patashnik_Localizing_Object-Level_Shape_Variations_with_Text-to-Image_Diffusion_Models_ICCV_2023_paper.pdf + Text-to-image models give rise to workflows which often begin with an exploration step, where users sift through a large collection of generated images. The global nature of the text-to-image generation process prevents users from narrowing their exploration to a particular object in the image. In this paper, we present a technique to generate a collection of images that depicts variations in the shape of a specific object, enabling an object-level shape exploration process. Creating plausible variations is challenging as it requires control over the shape of the generated object while respecting its semantics. A particular challenge when generating object variations is accurately localizing the manipulation applied over the object's shape. We introduce a prompt-mixing technique that switches between prompts along the denoising process to attain a variety of shape choices. To localize the image-space operation, we present two techniques that use the self-attention layers in conjunction with the cross-attention layers. Moreover, we show that these localization techniques are general and effective beyond the scope of generating object variations. Extensive results and comparisons demonstrate the effectiveness of our method in generating object variations, and the competence of our localization techniques. + + + + CoSign: Exploring Co-occurrence Signals in Skeleton-based Continuous Sign Language Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Jiao_CoSign_Exploring_Co-occurrence_Signals_in_Skeleton-based_Continuous_Sign_Language_Recognition_ICCV_2023_paper.pdf + The co-occurrence signals (e.g., hand shape, facial expression, and lip pattern) play a critical role in Continuous Sign Language Recognition (CSLR). Compared to RGB data, skeleton data provide a more efficient and concise option, and lay a good foundation for the co-occurrence exploration in CSLR. However, skeleton data are often used as a tool to assist visual grounding and have not attracted sufficient attention. In this paper, we propose a simple yet effective GCN-based approach, named CoSign, to incorporate Co-occurrence Signals and explore the potential of skeleton data in CSLR. Specifically, we propose a group-specific GCN to better exploit the knowledge of each signal and a complementary regularization to prevent complex co-adaptation across signals. Furthermore, we propose a two-stream framework that gradually fuses both static and dynamic information in skeleton data. Experimental results on three public CSLR datasets (PHOENIX14, PHOENIX14-T and CSL-Daily) show that the proposed CoSign achieves competitive performance with recent video-based approaches while reducing the computation cost during training. + + + + Disentangle then Parse: Night-time Semantic Segmentation with Illumination Disentanglement + http://openaccess.thecvf.com//content/ICCV2023/papers/Wei_Disentangle_then_Parse_Night-time_Semantic_Segmentation_with_Illumination_Disentanglement_ICCV_2023_paper.pdf + Most prior semantic segmentation methods have been developed for day-time scenes, while typically underperforming in night-time scenes due to insufficient and complicated lighting conditions. In this work, we tackle this challenge by proposing a novel night-time semantic segmentation paradigm, i.e., disentangle then parse (DTP). DTP explicitly disentangles night-time images into light-invariant reflectance and light-specific illumination components and then recognizes semantics based on their adaptive fusion. Concretely, the proposed DTP comprises two key components: 1) Instead of processing lighting-entangled features as in prior works, our Semantic-Oriented Disentanglement (SOD) framework enables the extraction of reflectance component without being impeded by lighting, allowing the network to consistently recognize the semantics under cover of varying and complicated lighting conditions. 2) Based on the observation that the illumination component can serve as a cue for some semantically confused regions, we further introduce an Illumination-Aware Parser (IAParser) to explicitly learn the correlation between semantics and lighting, and aggregate the illumination features to yield more precise predictions. Extensive experiments on the night-time segmentation task with various settings demonstrate that DTP significantly outperforms state-of-the-art methods. Furthermore, with negligible additional parameters, DTP can be directly used to benefit existing day-time methods for night-time segmentation. Code and dataset are available at https://github.com/w1oves/DTP.git. + + + + Large-Scale Land Cover Mapping with Fine-Grained Classes via Class-Aware Semi-Supervised Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Dong_Large-Scale_Land_Cover_Mapping_with_Fine-Grained_Classes_via_Class-Aware_Semi-Supervised_ICCV_2023_paper.pdf + Semi-supervised learning has attracted increasing attention in the large-scale land cover mapping task. However, existing methods overlook the potential to alleviate the class imbalance problem by selecting a suitable set of unlabeled data. Besides, in class-imbalanced scenarios, existing pseudo-labeling methods mostly only pick confident samples, failing to exploit the hard samples during training. To tackle these issues, we propose a unified Class-Aware Semi-Supervised Semantic Segmentation framework. The proposed framework consists of three key components. To construct a better semi-supervised learning dataset, we propose a class-aware unlabeled data selection method that is more balanced towards the minority classes. Based on the built dataset with improved class balance, we propose a Class-Balanced Cross Entropy loss, jointly considering the annotation bias and the class bias to re-weight the loss in both sample and class levels to alleviate the class imbalance problem. Moreover, we propose the Class Center Contrast method to jointly utilize the labeled and unlabeled data. Specifically, we decompose the feature embedding space using the ground truth and pseudo-labels, and employ the embedding centers for hard and easy samples of each class per image in the contrast loss to exploit the hard samples during training. Compared with state-of-the-art class-balanced pseudo-labeling methods, the proposed method improves the mean accuracy and mIoU by 4.28% and 1.70%, respectively, on the large-scale Sentinel-2 dataset with 24 land cover classes. + + + + LISTER: Neighbor Decoding for Length-Insensitive Scene Text Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Cheng_LISTER_Neighbor_Decoding_for_Length-Insensitive_Scene_Text_Recognition_ICCV_2023_paper.pdf + The diversity in length constitutes a significant characteristic of text. Due to the long-tail distribution of text lengths, most existing methods for scene text recognition (STR) only work well on short or seen-length text, lacking the capability of recognizing longer text or performing length extrapolation. This is a crucial issue, since the lengths of the text to be recognized are usually not given in advance in real-world applications, but it has not been adequately investigated in previous works. Therefore, we propose in this paper a method called Length-Insensitive Scene TExt Recognizer (LISTER), which remedies the limitation regarding the robustness to various text lengths. Specifically, a Neighbor Decoder is proposed to obtain accurate character attention maps with the assistance of a novel neighbor matrix regardless of the text lengths. Besides, a Feature Enhancement Module is devised to model the long-range dependency with low computation cost, which is able to perform iterations with the neighbor decoder to enhance the feature map progressively. To the best of our knowledge, we are the first to achieve effective length-insensitive scene text recognition. Extensive experiments demonstrate that the proposed LISTER algorithm exhibits obvious superiority on long text recognition and the ability for length extrapolation, while comparing favourably with the previous state-of-the-art methods on standard benchmarks for STR (mainly short text). + + + + Proxy Anchor-based Unsupervised Learning for Continuous Generalized Category Discovery + http://openaccess.thecvf.com//content/ICCV2023/papers/Kim_Proxy_Anchor-based_Unsupervised_Learning_for_Continuous_Generalized_Category_Discovery_ICCV_2023_paper.pdf + Recent advances in deep learning have significantly improved the performance of various computer vision applications. However, discovering novel categories in an incremental learning scenario remains a challenging problem due to the lack of prior knowledge about the number and nature of new categories. Existing methods for novel category discovery are limited by their reliance on labeled datasets and prior knowledge about the number of novel categories and the proportion of novel samples in the batch. To address the limitations and more accurately reflect real-world scenarios, in this paper, we propose a novel unsupervised class incremental learning approach for discovering novel categories on unlabeled sets without prior knowledge. The proposed method fine-tunes the feature extractor and proxy anchors on labeled sets, then splits samples into old and novel categories and clusters on the unlabeled dataset. Furthermore, the proxy anchors-based exemplar generates representative category vectors to mitigate catastrophic forgetting. Experimental results demonstrate that our proposed approach outperforms the state-of-the-art methods on fine-grained datasets under real-world scenarios. + + + + Distribution-Aware Prompt Tuning for Vision-Language Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Cho_Distribution-Aware_Prompt_Tuning_for_Vision-Language_Models_ICCV_2023_paper.pdf + Pre-trained vision-language models (VLMs) have shown impressive performance on various downstream tasks by utilizing knowledge learned from large data. In general, the performance of VLMs on target tasks can be further improved by prompt tuning, which adds context to the input image or text. By leveraging data from target tasks, various prompt-tuning methods have been studied in the literature. A key to prompt tuning is the feature space alignment between two modalities via learnable vectors with model parameters fixed. We observed that the alignment becomes more effective when embeddings of each modality are 'well-arranged' in the latent space. Inspired by this observation, we proposed distribution-aware prompt tuning (DAPT) for vision-language models, which is simple yet effective. Specifically, the prompts are learned by maximizing inter-dispersion, the distance between classes, as well as minimizing the intra-dispersion measured by the distance between embeddings from the same class. Our extensive experiments on 11 benchmark datasets demonstrate that our method significantly improves generalizability. The code is available at https://github.com/mlvlab/DAPT. + + + + Fantasia3D: Disentangling Geometry and Appearance for High-quality Text-to-3D Content Creation + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Fantasia3D_Disentangling_Geometry_and_Appearance_for_High-quality_Text-to-3D_Content_Creation_ICCV_2023_paper.pdf + Automatic 3D content creation has achieved rapid progress recently due to the availability of pre-trained, large language models and image diffusion models, forming the emerging topic of text-to-3D content creation. Existing text-to-3D methods commonly use implicit scene representations, which couple the geometry and appearance via volume rendering and are suboptimal in terms of recovering finer geometries and achieving photorealistic rendering; consequently, they are less effective for generating high-quality 3D assets. In this work, we propose a new method of Fantasia3D for high-quality text-to-3D content creation. Key to Fantasia3D is the disentangled modeling and learning of geometry and appearance. For geometry learning, we rely on a hybrid scene representation, and propose to encode surface normal extracted from the representation as the input of the image diffusion model. For appearance modeling, we introduce the spatially varying bidirectional reflectance distribution function (BRDF) into the text-to-3D task, and learn the surface material for photorealistic rendering of the generated surface. Our disentangled framework is more compatible with popular graphics engines, supporting relighting, editing, and physical simulation of the generated 3D assets. We conduct thorough experiments that show the advantages of our method over existing ones under different text-to-3D task settings. Project page and source codes: https://fantasia3d.github.io/. + + + + MagicFusion: Boosting Text-to-Image Generation Performance by Fusing Diffusion Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhao_MagicFusion_Boosting_Text-to-Image_Generation_Performance_by_Fusing_Diffusion_Models_ICCV_2023_paper.pdf + The advent of open-source AI communities has produced a cornucopia of powerful text-guided diffusion models that are trained on various datasets. While few explorations have been conducted on ensembling such models to combine their strengths. In this work, we propose a simple yet effective method called Saliency-aware Noise Blending (SNB) that can empower the fused text-guided diffusion models to achieve more controllable generation. Specifically, we experimentally find that the responses of classifier-free guidance are highly related to the saliency of generated images. Thus we propose to trust different models in their areas of expertise by blending the predicted noises of two diffusion models in a saliency-aware manner. SNB is training-free and can be completed within a DDIM sampling process. Additionally, it can automatically align the semantics of two noise spaces without requiring additional annotations such as masks. Extensive experiments show the impressive effectiveness of SNB in various applications. The project page is available at https://magicfusion.github.io. + + + + UCF: Uncovering Common Features for Generalizable Deepfake Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Yan_UCF_Uncovering_Common_Features_for_Generalizable_Deepfake_Detection_ICCV_2023_paper.pdf + Deepfake detection remains a challenging task due to the difficulty of generalizing to new types of forgeries. This problem primarily stems from the overfitting of existing detection methods to forgery-irrelevant features and method-specific patterns. The latter has been rarely studied and not well addressed by previous works. This paper presents a novel approach to address the two types of overfitting issues by uncovering common forgery features. Specifically, we first propose a disentanglement framework that decomposes image information into three distinct components: forgery-irrelevant, method-specific forgery, and common forgery features. To ensure the decoupling of method-specific and common forgery features, a multi-task learning strategy is employed, including a multi-class classification that predicts the category of the forgery method and a binary classification that distinguishes the real from the fake. Additionally, a conditional decoder is designed to utilize forgery features as a condition along with forgery-irrelevant features to generate reconstructed images. Furthermore, a contrastive regularization technique is proposed to encourage the disentanglement of the common and specific forgery features. Ultimately, we only utilize the common forgery features for the purpose of generalizable deepfake detection. Extensive evaluations demonstrate that our framework can perform superior generalization than current state-of-the-art methods. + + + + Sample4Geo: Hard Negative Sampling For Cross-View Geo-Localisation + http://openaccess.thecvf.com//content/ICCV2023/papers/Deuser_Sample4Geo_Hard_Negative_Sampling_For_Cross-View_Geo-Localisation_ICCV_2023_paper.pdf + Cross-View Geo-Localisation is still a challenging task where additional modules, specific pre-processing or zooming strategies are necessary to determine accurate positions of images. Since different views have different geometries, pre-processing like polar transformation helps to merge them. However, this results in distorted images which then have to be rectified. Adding hard negatives to the training batch could improve the overall performance but with the default loss functions in geo-localisation it is difficult to include them. In this article, we present a simplified but effective architecture based on contrastive learning with symmetric InfoNCE loss that outperforms current state-of-the-art results. Our framework consists of a narrow training pipeline that eliminates the need of using aggregation modules, avoids further pre-processing steps and even increases the generalisation capability of the model to unknown regions. We introduce two types of sampling strategies for hard negatives. The first explicitly exploits geographically neighboring locations to provide a good starting point. The second leverages the visual similarity between the image embeddings in order to mine hard negative samples. Our work shows excellent performance on common cross-view datasets like CVUSA, CVACT, University-1652 and VIGOR. A comparison between cross-area and same-area settings demonstrate the good generalisation capability of our model. + + + + Novel Scenes & Classes: Towards Adaptive Open-set Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Novel_Scenes__Classes_Towards_Adaptive_Open-set_Object_Detection_ICCV_2023_paper.pdf + Domain Adaptive Object Detection (DAOD) transfers an object detector to a novel domain free of labels. However, in the real world, besides encountering novel scenes, novel domains always contain novel-class objects de facto, which are ignored in existing research. Thus, we formulate and study a more practical setting, Adaptive Open-set Object Detection (AOOD), considering both novel scenes and classes. Directly combing off-the-shelled cross-domain and open-set approaches is sub-optimal since their low-order dependence, e.g., the confidence score, is insufficient for the AOOD with two dimensions of novel information. To address this, we propose a novel Structured mOtif MAtching (SOMA) framework for AOOD, which models the high-order relation with motifs, i.e., a statistically significant subgraph, and formulates AOOD solution as motif matching to learn with high-order patterns. In a nutshell, SOMA consists of Structure-aware Novel-class Learning (SNL) and Structure-aware Transfer Learning (STL). As for SNL, we establish an instance-oriented graph to capture the class-independent object feature hidden in different base classes. Then, a high-order metric is proposed to match the most significant motif as high-order patterns, serving for motif-guided novel-class learning. In STL, we set up a semantic-oriented graph to model the class-dependent relation across domains, and match unlabelled objects with high-order motifs to align the cross-domain distribution with structural awareness. Extensive experiments demonstrate that the proposed SOMA achieves state-of-the-art performance. Codes will be released publicly for further study. + + + + LIMITR: Leveraging Local Information for Medical Image-Text Representation + http://openaccess.thecvf.com//content/ICCV2023/papers/Dawidowicz_LIMITR_Leveraging_Local_Information_for_Medical_Image-Text_Representation_ICCV_2023_paper.pdf + Medical imaging analysis plays a critical role in the diagnosis and treatment of various medical conditions. This paper focuses on chest X-ray images and their corresponding radiological reports. It presents a new model that learns a joint X-ray image & report representation. The model is based on a novel alignment scheme between the visual data and the text, which takes into account both local and global information. Furthermore, the model integrates domain-specific information of two types -- lateral images and the consistent visual structure of chest images. Our representation is shown to benefit three types of retrieval tasks: text-image retrieval, class-based retrieval, and phrase-grounding. + + + + Multi-task View Synthesis with Neural Radiance Fields + http://openaccess.thecvf.com//content/ICCV2023/papers/Zheng_Multi-task_View_Synthesis_with_Neural_Radiance_Fields_ICCV_2023_paper.pdf + Multi-task visual learning is a critical aspect of computer vision. Current research, however, predominantly concentrates on the multi-task dense prediction setting, which overlooks the intrinsic 3D world and its multi-view consistent structures, and lacks the capacity for versatile imagination. In response to these limitations, we present a novel problem setting -- multi-task view synthesis (MTVS), which reinterprets multi-task prediction as a set of novel-view synthesis tasks for multiple scene properties, including RGB. To tackle the MTVS problem, we propose MuvieNeRF, a framework that incorporates both multi-task and cross-view knowledge to simultaneously synthesize multiple scene properties. MuvieNeRF integrates two key modules, the Cross-Task Attention (CTA) and Cross-View Attention (CVA) modules, enabling the efficient use of information across multiple views and tasks. Extensive evaluations on both synthetic and realistic benchmarks demonstrate that MuvieNeRF is capable of simultaneously synthesizing different scene properties with promising visual quality, even outperforming conventional discriminative models in various settings. Notably, we show that MuvieNeRF exhibits universal applicability across a range of NeRF backbones. Our code is available at https://github.com/zsh2000/MuvieNeRF. + + + + Visual Traffic Knowledge Graph Generation from Scene Images + http://openaccess.thecvf.com//content/ICCV2023/papers/Guo_Visual_Traffic_Knowledge_Graph_Generation_from_Scene_Images_ICCV_2023_paper.pdf + Although previous works on traffic scene understanding have achieved great success, most of them stop at a lowlevel perception stage, such as road segmentation and lane detection, and few concern high-level understanding. In this paper, we present Visual Traffic Knowledge Graph Generation (VTKGG), a new task for in-depth traffic scene understanding that tries to extract multiple kinds of information and integrate them into a knowledge graph. To achieve this goal, we first introduce a large dataset named CASIATencent Road Scene dataset (RS10K) with comprehensive annotations to support related research. Secondly, we propose a novel traffic scene parsing architecture containing a Hierarchical Graph ATtention network (HGAT) to analyze the heterogeneous elements and their complicated relations in traffic scene images. By hierarchizing the heterogeneous graph and equipping it with cross-level links, our approach exploits the correlation among various elements completely and acquires accurate relations. The experimental results show that our method can effectively generate visual traffic knowledge graphs and achieve state-of-the-art performance. + + + + OmnimatteRF: Robust Omnimatte with 3D Background Modeling + http://openaccess.thecvf.com//content/ICCV2023/papers/Lin_OmnimatteRF_Robust_Omnimatte_with_3D_Background_Modeling_ICCV_2023_paper.pdf + Video matting has broad applications, from adding interesting effects to casually captured movies to assisting video production professionals. Matting with associated effects such as shadows and reflections has also attracted increasing research activity, and methods like Omnimatte have been proposed to separate dynamic foreground objects of interest into their own layers. However, prior works represent video backgrounds as 2D image layers, limiting their capacity to express more complicated scenes, thus hindering application to real-world videos. In this paper, we propose a novel video matting method, OmnimatteRF, that combines dynamic 2D foreground layers and a 3D background model. The 2D layers preserve the details of the subjects, while the 3D background robustly reconstructs scenes in real-world videos. Extensive experiments demonstrate that our method reconstructs scenes with better quality on various videos. + + + + Bold but Cautious: Unlocking the Potential of Personalized Federated Learning through Cautiously Aggressive Collaboration + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_Bold_but_Cautious_Unlocking_the_Potential_of_Personalized_Federated_Learning_ICCV_2023_paper.pdf + Personalized federated learning (PFL) reduces the impact of non-independent and identically distributed (non-IID) data among clients by allowing each client to train a personalized model when collaborating with others. A key question in PFL is to decide which parameters of a client should be localized or shared with others. In current mainstream approaches, all layers that are sensitive to non-IID data (such as classifier layers) are generally personalized. The reasoning behind this approach is understandable, as localizing parameters that are easily influenced by non-IID data can prevent potential negative effects of collaboration. However, we believe that this approach is too conservative for collaboration. For example, for a certain client, even if its parameters are easily influenced by non-IID data, it can still benefit by sharing these parameters with clients having similar data distribution. This observation emphasizes the importance of considering not only the sensitivity to non-IID data but also the similarity of data distribution when determining which parameters should be localized in PFL. This paper introduces a novel guideline for client collaboration in PFL. Unlike existing approaches that prohibit all collaboration of sensitive parameters, our guideline allows clients to share more parameters with others, leading to improved model performance. Additionally, we propose a new PFL method named FedCAC, which employs a quantitative metric to evaluate each parameter's sensitivity to non-IID data and carefully selects collaborators based on this evaluation. Experimental results demonstrate that FedCAC enables clients to share more parameters with others, resulting in superior performance compared to state-of-the-art methods, particularly in scenarios where clients have diverse distributions. + + + + ESSAformer: Efficient Transformer for Hyperspectral Image Super-resolution + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_ESSAformer_Efficient_Transformer_for_Hyperspectral_Image_Super-resolution_ICCV_2023_paper.pdf + Single hyperspectral image super-resolution (single-HSI-SR) aims to restore a high-resolution hyperspectral image from a low-resolution observation. However, the prevailing CNN-based approaches have shown limitations in building long-range dependencies and capturing interaction information between spectral features. This results in inadequate utilization of spectral information and artifacts after upsampling. To address this issue, we propose ESSAformer, an ESSA attention-embedded Transformer network for single-HSI-SR with an iterative refining structure. Specifically, we first introduce a robust and spectral-friendly similarity metric, i.e., the spectral correlation coefficient of the spectrum (SCC), to replace the original attention matrix and incorporates inductive biases into the model to facilitate training. Built upon it, we further utilize the kernelizable attention technique with theoretical support to form a novel efficient SCC-kernel-based self-attention (ESSA) and reduce attention computation to linear complexity. ESSA enlarges the receptive field for features after upsampling without bringing much computation and allows the model to effectively utilize spatial-spectral information from different scales, resulting in the generation of more natural high-resolution images. Without the need for pretraining on large-scale datasets, our experiments demonstrate ESSA's effectiveness in both visual quality and quantitative results. The code will be released. + + + + Thinking Image Color Aesthetics Assessment: Models, Datasets and Benchmarks + http://openaccess.thecvf.com//content/ICCV2023/papers/He_Thinking_Image_Color_Aesthetics_Assessment_Models_Datasets_and_Benchmarks_ICCV_2023_paper.pdf + We present a comprehensive study on a new task named image color aesthetics assessment (ICAA), which aims to assess color aesthetics based on human perception. ICAA is important for various applications such as imaging measurement and image analysis. However, due to the highly diverse aesthetic preferences and numerous color combinations, ICAA presents more challenges than conventional image quality assessment tasks. To advance ICAA research, 1) we propose a baseline model called the Delegate Transformer, which not only deploys deformable transformers to adaptively allocate interest points, but also learns human color space segmentation behavior by the dedicated module. 2) We elaborately build a color-oriented dataset, ICAA17K, containing 17K images, covering 30 popular color combinations, 80 devices and 50 scenes, with each image densely annotated by more than 1,500 people. Moreover, we develop a large-scale benchmark of 15 methods, the most comprehensive one thus far based on two datasets, SPAQ and ICAA17K. Our work, not only achieves state-of-the-art performance, but more importantly offers the community a roadmap to explore solutions for ICAA. Code and dataset are available in https://github.com/woshidandan/Image-Color-Aesthetics-Assessment. + + + + Multi-body Depth and Camera Pose Estimation from Multiple Views + http://openaccess.thecvf.com//content/ICCV2023/papers/Dal_Cin_Multi-body_Depth_and_Camera_Pose_Estimation_from_Multiple_Views_ICCV_2023_paper.pdf + Traditional and deep Structure-from-Motion (SfM) methods typically operate under the assumption that the scene is rigid, i.e., the environment is static or consists of a single moving object. Few multi-body SfM approaches address the reconstruction of multiple rigid bodies in a scene but suffer from the inherent scale ambiguity of SfM, such that objects are reconstructed at inconsistent scales. We propose a depth and camera pose estimation framework to resolve the scale ambiguity in multi-body scenes. Specifically, starting from disorganized images, we present a novel multi-view scale estimator that resolves the camera pose ambiguity and a multi-body plane sweep network that generalizes depth estimation to dynamic scenes. Experiments demonstrate the advantages of our method over state-of-the-art SfM frameworks in multi-body scenes and show that it achieves comparable results in static scenes. The code and dataset are available at https://github.com/andreadalcin/MultiBodySfM. + + + + DISeR: Designing Imaging Systems with Reinforcement Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Klinghoffer_DISeR_Designing_Imaging_Systems_with_Reinforcement_Learning_ICCV_2023_paper.pdf + Imaging systems consist of cameras to encode visual information about the world and perception models to interpret this encoding. Cameras contain (1) illumination sources, (2) optical elements, and (3) sensors, while perception models use (4) algorithms. Directly searching over all combinations of these four building blocks to design an imaging system is challenging due to the size of the search space. Moreover, cameras and perception models are often designed independently, leading to sub-optimal task performance. In this paper, we formulate these four building blocks of imaging systems as a context-free grammar (CFG), which can be automatically searched over with a learned camera designer to jointly optimize the imaging system with task-specific perception models. By transforming the CFG to a state-action space, we then show how the camera designer can be implemented with reinforcement learning to intelligently search over the combinatorial space of possible imaging system configurations. We demonstrate our approach on two tasks, depth estimation and camera rig design for autonomous vehicles, showing that our method yields rigs that outperform industry-wide standards. We believe that our proposed approach is an important step towards automating imaging system design. + + + + The Euclidean Space is Evil: Hyperbolic Attribute Editing for Few-shot Image Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_The_Euclidean_Space_is_Evil_Hyperbolic_Attribute_Editing_for_Few-shot_ICCV_2023_paper.pdf + Few-shot image generation is a challenging task since it aims to generate diverse new images for an unseen category with only a few images. Existing methods suffer from the trade-off between the quality and diversity of generated images. To tackle this problem, we propose Hyperbolic Attribute Editing (HAE), a simple yet effective method. Unlike prior arts that work in Euclidean space, HAE captures the hierarchy among images using data from seen categories in hyperbolic space. Given a well-trained HAE, images of unseen categories can be generated by moving the latent code of a given image toward any meaningful directions in the Poincare disk with a fixing radius. Most importantly, the hyperbolic space allows us to control the semantic diversity of the generated images by setting different radii in the disk. Extensive experiments and visualizations demonstrate that HAE is capable of not only generating images with promising quality and diversity using limited data but achieving a highly controllable and interpretable editing process. + + + + Invariant Feature Regularization for Fair Face Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Ma_Invariant_Feature_Regularization_for_Fair_Face_Recognition_ICCV_2023_paper.pdf + Fair face recognition is all about learning invariant feature that generalizes to unseen faces in any demographic group. Unfortunately, face datasets inevitably capture the imbalanced demographic attributes that are ubiquitous in real-world observations, and the model learns biased feature that generalizes poorly in the minority group. We point out that the bias arises due to the confounding demographic attributes, which mislead the model to capture the spurious demographic-specific feature. The confounding effect can only be removed by causal intervention, which requires the confounder annotations. However, such annotations can be prohibitively expensive due to the diversity of the demographic attributes. To tackle this, we propose to generate diverse data partitions iteratively in an unsupervised fashion. Each data partition acts as a self-annotated confounder, enabling our Invariant Feature Regularization (INV-REG) to deconfound. INV-REG is orthogonal to existing methods, and combining INV-REG with two strong baselines (Arcface and CIFP) leads to new state-of-the-art that improves face recognition on a variety of demographic groups. Code is available at https://github.com/milliema/InvReg. + + + + Local Context-Aware Active Domain Adaptation + http://openaccess.thecvf.com//content/ICCV2023/papers/Sun_Local_Context-Aware_Active_Domain_Adaptation_ICCV_2023_paper.pdf + Active Domain Adaptation (ADA) queries the labels of a small number of selected target samples to help adapting a model from a source domain to a target domain. The local context of queried data is important, especially when the domain gap is large. However, this has not been fully explored by existing ADA works. In this paper, we propose a Local context-aware ADA framework, named LADA, to address this issue. To select informative target samples, we devise a novel criterion based on the local inconsistency of model predictions. Since the labeling budget is usually small, fine-tuning model on only queried data can be inefficient. We progressively augment labeled target data with the confident neighbors in a class-balanced manner. Experiments validate that the proposed criterion chooses more informative target samples than existing active selection strategies. Furthermore, our full method clearly surpasses recent ADA arts on various benchmarks. Code is available at https://github.com/tsun/LADA. + + + + Deep Incubation: Training Large Models by Divide-and-Conquering + http://openaccess.thecvf.com//content/ICCV2023/papers/Ni_Deep_Incubation_Training_Large_Models_by_Divide-and-Conquering_ICCV_2023_paper.pdf + Recent years have witnessed a remarkable success of large deep learning models. However, training these models is challenging due to high computational costs, painfully slow convergence, and overfitting issues. In this paper, we present Deep Incubation, a novel approach that enables the efficient and effective training of large models by dividing them into smaller sub-modules which can be trained separately and assembled seamlessly. A key challenge for implementing this idea is to ensure the compatibility of the independently trained sub-modules. To address this issue, we first introduce a global, shared meta model, which is leveraged to implicitly link all the modules together, and can be designed as an extremely small network with negligible computational overhead. Then we propose a module incubation algorithm, which trains each sub-module to replace the corresponding component of the meta model and accomplish a given learning task. Despite the simplicity, our approach effectively encourages each sub-module to be aware of its role in the target large model, such that the finally-learned sub-modules can collaborate with each other smoothly after being assembled. Empirically, our method can outperform end-to-end (E2E) training in well-established training setting and shows transferable performance gain for downstream tasks (e.g., object detection and image segmentation on COCO and ADE20K). Our code is available at https://github.com/LeapLabTHU/Deep-Incubation. + + + + iVS-Net: Learning Human View Synthesis from Internet Videos + http://openaccess.thecvf.com//content/ICCV2023/papers/Dong_iVS-Net_Learning_Human_View_Synthesis_from_Internet_Videos_ICCV_2023_paper.pdf + Recent advances in implicit neural representations make it possible to generate free-viewpoint videos of the human from sparse view images. To avoid the expensive training for each person, previous methods adopt the generalizable human model and demonstrate impressive results. However, these methods usually rely on limited multi-view images typically collected in the studio or commercial high-quality 3D scans for training, which heavily prohibits their generalization capability for in-the-wild images. To solve this problem, we propose a new approach to learn a generalizable human model from a new source of data, i.e., Internet videos. These videos capture various human appearances and poses and record the performers from abundant viewpoints. To exploit these videos, we present a temporal self-supervised pipeline to enforce the local appearance consistency of each body part over different frames of the same video. Once learned, the human model enables creating photorealistic free-viewpoint videos from a single input image. Experiments show that our method can generate high-quality view synthesis on in-the-wild images while only training on monocular videos. + + + + All4One: Symbiotic Neighbour Contrastive Learning via Self-Attention and Redundancy Reduction + http://openaccess.thecvf.com//content/ICCV2023/papers/Estepa_All4One_Symbiotic_Neighbour_Contrastive_Learning_via_Self-Attention_and_Redundancy_Reduction_ICCV_2023_paper.pdf + Nearest neighbour based methods have proved to be one of the most successful self-supervised learning (SSL) approaches due to their high generalization capabilities. However, their computational efficiency decreases when more than one neighbour is used. In this paper, we propose a novel contrastive SSL approach, which we call All4One, that reduces the distance between neighbour representations using "centroids" created through a self-attention mechanism. We use a Centroid Contrasting objective along with single Neighbour Contrasting and Feature Contrasting objectives. Centroids help in learning contextual information from multiple neighbours whereas the neighbour contrast enables learning representations directly from the neighbours and the feature contrast allows learning representations unique to the features.This combination enables All4One to outperform popular instance discrimination approaches by more than 1% on linear classification evaluation for popular benchmark datasets and obtains state-of-the-art (SoTA) results. Finally, we show that All4One is robust towards embedding dimensionalities and augmentations, surpassing NNCLR and Barlow Twins by more than 5% on low dimensionality and weak augmentation settings. + + + + Contrastive Pseudo Learning for Open-World DeepFake Attribution + http://openaccess.thecvf.com//content/ICCV2023/papers/Sun_Contrastive_Pseudo_Learning_for_Open-World_DeepFake_Attribution_ICCV_2023_paper.pdf + The challenge in sourcing attribution for forgery faces has gained widespread attention due to the rapid development of generative techniques. While many recent works have taken essential steps on GAN-generated faces, more threatening attacks related to identity swapping or expression transferring are still overlooked. And the forgery traces hidden in unknown attacks from the open-world unlabeled faces still remain under-explored. To push the related frontier research, we introduce a new benchmark called Open-World DeepFake Attribution (OW-DFA), which aims to evaluate attribution performance against various types of fake faces under open-world scenarios. Meanwhile, we propose a novel framework named Contrastive Pseudo Learning (CPL) for the OW-DFA task through 1) introducing a Global-Local Voting module to guide the feature alignment of forged faces with different manipulated regions, 2) designing a Confidence-based Soft Pseudo-label strategy to mitigate the pseudo-noise caused by similar methods in unlabeled set. In addition, we extend the CPL framework with a multi-stage paradigm that leverages pre-train technique and iterative learning to further enhance traceability performance. Extensive experiments verify the superiority of our proposed method on the OW-DFA and also demonstrate the interpretability of deepfake attribution task and its impact on improving the security of deepfake detection area. + + + + ICL-D3IE: In-Context Learning with Diverse Demonstrations Updating for Document Information Extraction + http://openaccess.thecvf.com//content/ICCV2023/papers/He_ICL-D3IE_In-Context_Learning_with_Diverse_Demonstrations_Updating_for_Document_Information_ICCV_2023_paper.pdf + Large language models (LLMs), such as GPT-3 and ChatGPT, have demonstrated remarkable results in various natural language processing (NLP) tasks with in-context learning, which involves inference based on a few demonstration examples. Despite their successes in NLP tasks, no investigation has been conducted to assess the ability of LLMs to perform document information extraction (DIE) using in-context learning. Applying LLMs to DIE poses two challenges: the modality and task gap. To this end, we propose a simple but effective in-context learning framework called ICL-D3IE, which enables LLMs to perform DIE with different types of demonstration examples. Specifically, we extract the most difficult and distinct segments from hard training documents as hard demonstrations for benefiting all test instances. We design demonstrations describing relationships that enable LLMs to understand positional relationships. We introduce formatting demonstrations for easy answer extraction. Additionally, the framework improves diverse demonstrations by updating them iteratively. Our experiments on three widely used benchmark datasets demonstrate that the ICL-D3IE framework enables GPT-3/ChatGPT to achieve superior performance when compared to previous pre-trained methods fine-tuned with full training in both the in-distribution (ID) setting and in the out-of-distribution (OOD) setting. Code is available at https://anonymous.4open.science/r/ICL-D3IE-B1EE. + + + + Shape Anchor Guided Holistic Indoor Scene Understanding + http://openaccess.thecvf.com//content/ICCV2023/papers/Dong_Shape_Anchor_Guided_Holistic_Indoor_Scene_Understanding_ICCV_2023_paper.pdf + This paper proposes a shape anchor guided learning strategy (AncLearn) for robust holistic indoor scene understanding. We observe that the search space constructed by current methods for proposal feature grouping and instance point sampling often introduces massive noise to instance detection and mesh reconstruction. Accordingly, we develop AncLearn to generate anchors that dynamically fit instance surfaces to (i) unmix noise and target-related features for offering reliable proposals at the detection stage, and (ii) reduce outliers in object point sampling for directly providing well-structured geometry priors without segmentation during reconstruction. We embed AncLearn into a reconstruction-from-detection learning system (AncRec) to generate high-quality semantic scene models in a purely instance-oriented manner. Experiments conducted on the ScanNetv2 dataset (with ground truths from Scan2CAD and SceneCAD) demonstrate that our shape anchor-based method consistently achieves state-of-the-art performance in terms of 3D object detection, layout estimation, and shape reconstruction. + + + + Knowledge-Aware Federated Active Learning with Non-IID Data + http://openaccess.thecvf.com//content/ICCV2023/papers/Cao_Knowledge-Aware_Federated_Active_Learning_with_Non-IID_Data_ICCV_2023_paper.pdf + Federated learning enables multiple decentralized clients to learn collaboratively without sharing local data. However, the expensive annotation cost on local clients remains an obstacle in utilizing local data. In this paper, we propose a federated active learning paradigm to efficiently learn a global model with a limited annotation budget while protecting data privacy in a decentralized learning manner. The main challenge faced by federated active learning is the mismatch between the active sampling goal of the global model on the server and that of the asynchronous local clients. This becomes even more significant when data is distributed non-IID across local clients. To address the aforementioned challenge, we propose Knowledge-Aware Federated Active Learning (KAFAL), which consists of Knowledge-Specialized Active Sampling (KSAS) and Knowledge-Compensatory Federated Update (KCFU). Specifically, KSAS is a novel active sampling method tailored for the federated active learning problem, aiming to deal with the mismatch challenge by sampling actively based on the discrepancies between local and global models. KSAS intensifies specialized knowledge in local clients, ensuring the sampled data is informative for both the local clients and the global model. Meanwhile, KCFU deals with the client heterogeneity caused by limited data and non-IID data distributions by compensating for each client's ability in weak classes with the assistance of the global model. Extensive experiments and analyses are conducted to show the superiority of KAFAL over recent state-of-the-art active learning methods. Code is available at https://github.com/ycao5602/KAFAL. + + + + PlankAssembly: Robust 3D Reconstruction from Three Orthographic Views with Learnt Shape Programs + http://openaccess.thecvf.com//content/ICCV2023/papers/Hu_PlankAssembly_Robust_3D_Reconstruction_from_Three_Orthographic_Views_with_Learnt_ICCV_2023_paper.pdf + In this paper, we develop a new method to automatically convert 2D line drawings from three orthographic views into 3D CAD models. Existing methods for this problem reconstruct 3D models by back-projecting the 2D observations into 3D space while maintaining explicit correspondence between the input and output. Such methods are sensitive to errors and noises in the input, thus often fail in practice where the input drawings created by human designers are imperfect. To overcome this difficulty, we leverage the attention mechanism in a Transformer-based sequence generation model to learn flexible mappings between the input and output. Further, we design shape programs which are suitable for generating the objects of interest to boost the reconstruction accuracy and facilitate CAD modeling applications. Experiments on a new benchmark dataset show that our method significantly outperforms existing ones when the inputs are noisy or incomplete. + + + + PODIA-3D: Domain Adaptation of 3D Generative Model Across Large Domain Gap Using Pose-Preserved Text-to-Image Diffusion + http://openaccess.thecvf.com//content/ICCV2023/papers/Kim_PODIA-3D_Domain_Adaptation_of_3D_Generative_Model_Across_Large_Domain_ICCV_2023_paper.pdf + Recently, significant advancements have been made in 3D generative models, however training these models across diverse domains is challenging and requires an huge amount of training data and knowledge of pose distribution. Text-guided domain adaptation methods have allowed the generator to be adapted to the target domains using text prompts, thereby obviating the need for assembling numerous data. Recently, DATID-3D presents impressive quality of samples in text-guided domain, preserving diversity in text by leveraging text-to-image diffusion. However, adapting 3D generators to domains with significant domain gaps from the source domain still remains challenging due to issues in current text-to-image diffusion models as following: 1) shape-pose trade-off in diffusion-based translation, 2) pose bias, and 3) instance bias in the target domain, resulting in inferior 3D shapes, low text-image correspondence, and low intra-domain diversity in the generated samples. To address these issues, we propose a novel pipeline called PODIA-3D, which uses pose-preserved text-to-image diffusion-based domain adaptation for 3D generative models. We construct a pose-preserved text-to-image diffusion model that allows the use of extremely high-level noise for significant domain changes. We also propose specialized-to-general sampling strategies to improve the details of the generated samples. Moreover, to overcome the instance bias, we introduce a text-guided debiasing method that improves intra-domain diversity. Consequently, our method successfully adapts 3D generators across significant domain gaps. Our qualitative results and user study demonstrate that our approach outperforms existing 3D text-guided domain adaptation methods in terms of text-image correspondence, realism, diversity of rendered images, and sense of depth of 3D shapes in the generated samples. + + + + Steered Diffusion: A Generalized Framework for Plug-and-Play Conditional Image Synthesis + http://openaccess.thecvf.com//content/ICCV2023/papers/Nair_Steered_Diffusion_A_Generalized_Framework_for_Plug-and-Play_Conditional_Image_Synthesis_ICCV_2023_paper.pdf + Conditional generative models typically demand large annotated training sets to achieve high-quality synthesis. As a result, there has been significant interest in designing models that perform plug-and-play generation, i.e., to use a predefined or pretrained model, which is not explicitly trained on the generative task, to guide the generative process (e.g., using language). However, such guidance is typically useful only towards synthesizing high-level semantics rather than editing fine-grained details as in image-to-image translation tasks. To this end, and capitalizing on the powerful fine-grained generative control offered by the recent diffusion-based generative models, we introduce Steered Diffusion, a generalized framework for photorealistic zero-shot conditional image generation using a diffusion model trained for unconditional generation. The key idea is to steer the image generation of the diffusion model at inference time via designing a loss using a pre-trained inverse model that characterizes the conditional task. This loss modulates the sampling trajectory of the diffusion process. Our framework allows for easy incorporation of multiple conditions during inference. We present experiments using steered diffusion on several tasks including inpainting, colorization, text-guided semantic editing, and image super-resolution. Our results demonstrate clear qualitative and quantitative improvements over state-of-the-art diffusion-based plug-and-play models while adding negligible additional computational cost. + + + + Vision Grid Transformer for Document Layout Analysis + http://openaccess.thecvf.com//content/ICCV2023/papers/Da_Vision_Grid_Transformer_for_Document_Layout_Analysis_ICCV_2023_paper.pdf + Document pre-trained models and grid-based models have proven to be very effective on various tasks in Document AI. However, for the document layout analysis (DLA) task, existing document pre-trained models, even those pre-trained in a multi-modal fashion, usually rely on either textual features or visual features. Grid-based models for DLA are multi-modality but largely neglect the effect of pre-training. To fully leverage multi-modal information and exploit pre-training techniques to learn better representation for DLA, in this paper, we present VGT, a two-stream Vision Grid Transformer, in which Grid Transformer (GiT) is proposed and pre-trained for 2D token-level and segment-level semantic understanding. Furthermore, a new dataset named D^4LA, which is so far the most diverse and detailed manually-annotated benchmark for document layout analysis, is curated and released. Experiment results have illustrated that the proposed VGT model achieves new state-of-the-art results on DLA tasks, e.g. PubLayNet (95.7% to 96.2%), DocBank (79.6% to 84.1%), and D^4LA (67.7% to 68.8%). The code and models as well as the D4LA dataset will be made publicly available. + + + + Few Shot Font Generation Via Transferring Similarity Guided Global Style and Quantization Local Style + http://openaccess.thecvf.com//content/ICCV2023/papers/Pan_Few_Shot_Font_Generation_Via_Transferring_Similarity_Guided_Global_Style_ICCV_2023_paper.pdf + Automatic few-shot font generation (AFFG), aiming at generating new fonts with only a few glyph references, reduces the labor cost of manually designing fonts. However, the traditional AFFG paradigm of style-content disentanglement cannot capture the diverse local details of different fonts. So, many component-based approaches are proposed to tackle this problem. The issue with component-based approaches is that they usually require special pre-defined glyph components, e.g., strokes and radicals, which is infeasible for AFFG of different languages. In this paper, we present a novel font generation approach by aggregating styles from character similarity-guided global features and stylized component-level representations. We calculate the similarity scores of the target character and the referenced samples by measuring the distance along the corresponding channels from the content features, and assigning them as the weights for aggregating the global style features. To better capture the local styles, a cross-attention-based style transfer module is adopted to transfer the styles of reference glyphs to the components, where the components are self-learned discrete latent codes through vector quantization without manual definition. With these designs, our AFFG method could obtain a complete set of component-level style representations, and also control the global glyph characteristics. The experimental results reflect the effectiveness and generalization of the proposed method on different linguistic scripts, and also show its superiority when compared with other state-of-the-art methods. The source code can be found at https://github.com/awei669/VQ-Font. + + + + Differentiable Transportation Pruning + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Differentiable_Transportation_Pruning_ICCV_2023_paper.pdf + Deep learning algorithms are increasingly employed at the edge. However, edge devices are resource constrained and thus require efficient deployment of deep neural networks. Pruning methods are a key tool for edge deployment as they can improve storage, compute, memory bandwidth, and energy usage. In this paper we propose a novel accurate pruning technique that allows precise control over the output network size. Our method uses an efficient optimal transportation scheme which we make end-to-end differentiable and which automatically tunes the exploration-exploitation behavior of the algorithm to find accurate sparse sub-networks. We show that our method achieves state-of-the-art performance compared to previous pruning methods on 3 different datasets, using 5 different models, across a wide range of pruning ratios, and with two types of sparsity budgets and pruning granularities. + + + + Large Selective Kernel Network for Remote Sensing Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Large_Selective_Kernel_Network_for_Remote_Sensing_Object_Detection_ICCV_2023_paper.pdf + Recent research on remote sensing object detection has largely focused on improving the representation of oriented bounding boxes but has overlooked the unique prior knowledge presented in remote sensing scenarios. Such prior knowledge can be useful because tiny remote sensing objects may be mistakenly detected without referencing a sufficiently long-range context, which can vary for different objects. This paper considers these priors and proposes the lightweight Large Selective Kernel Network (LSKNet). LSKNet can dynamically adjust its large spatial receptive field to better model the ranging context of various objects in remote sensing scenarios. To our knowledge, large and selective kernel mechanisms have not been previously explored in remote sensing object detection. Without bells and whistles, our lightweight LSKNet sets new state-of-the-art scores on standard benchmarks, i.e., HRSC2016 (98.46% mAP), DOTA-v1.0 (81.85% mAP), and FAIR1M-v1.0 (47.87% mAP). + + + + I-ViT: Integer-only Quantization for Efficient Vision Transformer Inference + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_I-ViT_Integer-only_Quantization_for_Efficient_Vision_Transformer_Inference_ICCV_2023_paper.pdf + Vision Transformers (ViTs) have achieved state-of-the-art performance on various computer vision applications. However, these models have considerable storage and computational overheads, making their deployment and efficient inference on edge devices challenging. Quantization is a promising approach to reducing model complexity, and the dyadic arithmetic pipeline can allow the quantized models to perform efficient integer-only inference. Unfortunately, dyadic arithmetic is based on the homogeneity condition in convolutional neural networks, which is not applicable to the non-linear components in ViTs, making integer-only inference of ViTs an open issue. In this paper, we propose I-ViT, an integer-only quantization scheme for ViTs, to enable ViTs to perform the entire computational graph of inference with integer arithmetic and bit-shifting, and without any floating-point arithmetic. In I-ViT, linear operations (e.g., MatMul and Dense) follow the integer-only pipeline with dyadic arithmetic, and non-linear operations (e.g., Softmax, GELU, and LayerNorm) are approximated by the proposed light-weight integer-only arithmetic methods. More specifically, I-ViT applies the proposed Shiftmax and ShiftGELU, which are designed to use integer bit-shifting to approximate the corresponding floating-point operations. We evaluate I-ViT on various benchmark models and the results show that integer-only INT8 quantization achieves comparable (or even slightly higher) accuracy to the full-precision (FP) baseline. Furthermore, we utilize TVM for practical hardware deployment on the GPU's integer arithmetic units, achieving 3.72 4.11x inference speedup compared to the FP model. Code of both Pytorch and TVM is released at https://github.com/zkkli/I-ViT. + + + + To Adapt or Not to Adapt? Real-Time Adaptation for Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Colomer_To_Adapt_or_Not_to_Adapt_Real-Time_Adaptation_for_Semantic_ICCV_2023_paper.pdf + The goal of Online Domain Adaptation for semantic segmentation is to handle unforeseeable domain changes that occur during deployment, like sudden weather events. However, the high computational costs associated with brute-force adaptation make this paradigm unfeasible for real-world applications. In this paper we propose HAMLET, a Hardware-Aware Modular Least Expensive Training framework for real-time domain adaptation. Our approach includes a hardware-aware back-propagation orchestration agent (HAMT) and a dedicated domain-shift detector that enables active control over when and how the model is adapted (LT). Thanks to these advancements, our approach is capable of performing semantic segmentation while simultaneously adapting at more than 29FPS on a single consumer-grade GPU. Our framework's encouraging accuracy and speed trade-off is demonstrated on OnDA and SHIFT benchmarks through experimental results. + + + + Strivec: Sparse Tri-Vector Radiance Fields + http://openaccess.thecvf.com//content/ICCV2023/papers/Gao_Strivec_Sparse_Tri-Vector_Radiance_Fields_ICCV_2023_paper.pdf + We propose Strivec, a novel neural representation that models a 3D scene as a radiance field with sparsely distributed and compactly factorized local tensor feature grids. Our approach leverages tensor decomposition, following the recent work TensoRF, to model the tensor grids. In contrast to TensoRF which uses a global tensor and focuses on their vector-matrix decomposition, we propose to utilize a cloud of local tensors and apply the classic CANDECOMP/PARAFAC (CP) decomposition to factorize each tensor into triple vectors that express local feature distributions along spatial axes and compactly encode a local neural field. We also apply multi-scale tensor grids to discover the geometry and appearance commonalities and exploit spatial coherence with the tri-vector factorization at multiple local scales. The final radiance field properties are regressed by aggregating neural features from multiple local tensors across all scales. Our tri-vector tensors are sparsely distributed around the actual scene surface, discovered by a fast coarse reconstruction, leveraging the sparsity of a 3D scene. We demonstrate that our model can achieve better rendering quality while using significantly fewer parameters than previous methods, including TensoRF and Instant-NGP. + + + + Multiscale Representation for Real-Time Anti-Aliasing Neural Rendering + http://openaccess.thecvf.com//content/ICCV2023/papers/Hu_Multiscale_Representation_for_Real-Time_Anti-Aliasing_Neural_Rendering_ICCV_2023_paper.pdf + The rendering scheme in neural radiance field (NeRF) is effective in rendering a pixel by casting a ray into the scene. However, NeRF yields blurred rendering results when the training images are captured at non-uniform scales, and produces aliasing artifacts if the test images are taken in distant views. To address this issue, Mip-NeRF proposes a multiscale representation as a conical frustum to encode scale information. Nevertheless, this approach is only suitable for offline rendering since it relies on integrated positional encoding (IPE) to query a multilayer perceptron (MLP). To overcome this limitation, we propose mip voxel grids (Mip-VoG), an explicit multiscale representation with a deferred architecture for real-time anti-aliasing rendering. Our approach includes a density Mip-VoG for scene geometry and a feature Mip-VoG with a small MLP for view-dependent color. Mip-VoG represents scene scale using the level of detail (LOD) derived from ray differentials and uses quadrilinear interpolation to map a queried 3D location to its features and density from two neighboring down-sampled voxel grids. To our knowledge, our approach is the first to offer multiscale training and real-time anti-aliasing rendering simultaneously. We conducted experiments on multiscale dataset, results show that our approach outperforms state-of-the-art real-time rendering baselines. + + + + Borrowing Knowledge From Pre-trained Language Model: A New Data-efficient Visual Learning Paradigm + http://openaccess.thecvf.com//content/ICCV2023/papers/Ma_Borrowing_Knowledge_From_Pre-trained_Language_Model_A_New_Data-efficient_Visual_ICCV_2023_paper.pdf + The development of vision models for real-world applications is hindered by the challenge of annotated data scarcity, which has necessitated the adoption of data-efficient visual learning techniques such as semi-supervised learning. Unfortunately, the prevalent cross-entropy supervision is limited by its focus on category discrimination while disregarding the semantic connection between concepts, which ultimately results in the suboptimal exploitation of scarce labeled data. To address this issue, this paper presents a novel approach that seeks to leverage linguistic knowledge for data-efficient visual learning. The proposed approac, BorLan, Borrows knowledge from off-the-shelf pretrained Language models that are already endowed with rich semantics extracted from large corpora, to compensate the semantic deficiency due to limited annotation in visual training. Specifically, we design a distribution alignment objective, which guides the vision model to learn both semantic-aware and domain-agnostic representations for the task through linguistic knowledge. One significant advantage of this paradigm is its flexibility in combining various visual and linguistic models. Extensive experiments on semi-supervised learning, single domain generalization and few-shot learning validate its effectiveness. + + + + Tracking without Label: Unsupervised Multiple Object Tracking via Contrastive Similarity Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Meng_Tracking_without_Label_Unsupervised_Multiple_Object_Tracking_via_Contrastive_Similarity_ICCV_2023_paper.pdf + Unsupervised learning is a challenging task due to the lack of labels. Multiple Object Tracking (MOT), which inevitably suffers from mutual object interference, occlusion, etc., is even more difficult without label supervision. In this paper, we explore the latent consistency of sample features across video frames and propose an Unsupervised Contrastive Similarity Learning method, named UCSL, including three contrast modules: self-contrast, cross-contrast, and ambiguity contrast. Specifically, i) self-contrast uses intra-frame direct and inter-frame indirect contrast to obtain discriminative representations by maximizing self-similarity. ii) Cross-contrast aligns cross- and continuous-frame matching results, mitigating the persistent negative effect caused by object occlusion. And iii) ambiguity contrast matches ambiguous objects with each other to further increase the certainty of subsequent object association through an implicit manner. On existing benchmarks, our method outperforms the existing unsupervised methods using only limited help from ReID head, and even provides higher accuracy than lots of fully supervised methods. + + + + Re-mine, Learn and Reason: Exploring the Cross-modal Semantic Correlations for Language-guided HOI detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Cao_Re-mine_Learn_and_Reason_Exploring_the_Cross-modal_Semantic_Correlations_for_ICCV_2023_paper.pdf + Human-Object Interaction (HOI) detection is a challenging computer vision task that requires visual models to address the complex interactive relationship between humans and objects and predict <human, action, object> triplets. Despite the challenges posed by the numerous interaction combinations, they also offer opportunities for multi-modal learning of visual texts. In this paper, we present a systematic and unified framework (RmLR) that enhances HOI detection by incorporating structured text knowledge. Firstly, we qualitatively and quantitatively analyze the loss of interaction information in the two-stage HOI detector and propose a re-mining strategy to generate more comprehensive visual representation. Secondly, we design more fine-grained sentence- and word-level alignment and knowledge transfer strategies to effectively address the many-to-many matching problem between multiple interactions and multiple texts. These strategies alleviate the matching confusion problem that arises when multiple interactions occur simultaneously, thereby improving the effectiveness of the alignment process. Finally, HOI reasoning by visual features augmented with textual knowledge substantially improves the understanding of interactions. Experimental results illustrate the effectiveness of our approach, where state-of-the-art performance is achieved on public benchmarks. + + + + Strata-NeRF : Neural Radiance Fields for Stratified Scenes + http://openaccess.thecvf.com//content/ICCV2023/papers/Dhiman_Strata-NeRF__Neural_Radiance_Fields_for_Stratified_Scenes_ICCV_2023_paper.pdf + Neural Radiance Fields (NeRF) approaches learn the underlying 3D representation of a scene and generate photo-realistic novel views with high fidelity. However, most proposed settings concentrate on 3D modelling a single object or a single level of a scene. However, in the real world, a person captures a structure at multiple levels, resulting in layered capture. For example, tourists usually capture a monument's exterior structure before capturing the inner structure. Modelling such scenes in 3D with seamless switching between levels can drastically improve the Virtual Reality (VR) experience. However, most of the existing techniques struggle in modelling such scenes. Hence, we propose Strata-NeRF, a single radiance field that can implicitly learn the 3D representation of outer, inner, and subsequent levels. Strata-NeRF achieves this by conditioning the NeRFs on Vector Quantized (VQ) latents which allows sudden changes in scene structure with changes in levels due to their discrete nature. We first investigate the proposed approach's effectiveness by modelling a novel multilayered synthetic dataset comprising diverse scenes and then further validate its generalization on the real-world RealEstate dataset. We find that Strata-NeRF effectively models the scene structure, minimizes artefacts and synthesizes high-fidelity views compared to existing state-of-the-art approaches in the literature. + + + + 3D-aware Blending with Generative NeRFs + http://openaccess.thecvf.com//content/ICCV2023/papers/Kim_3D-aware_Blending_with_Generative_NeRFs_ICCV_2023_paper.pdf + Image blending aims to combine multiple images seamlessly. It remains challenging for existing 2D-based methods, especially when input images are misaligned due to differences in 3D camera poses and object shapes. To tackle these issues, we propose a 3D-aware blending method using generative Neural Radiance Fields (NeRF), including two key components: 3D-aware alignment and 3D-aware blending. For 3D-aware alignment, we first estimate the camera pose of the reference image with respect to generative NeRFs and then perform pose alignment for objects. To further leverage 3D information of the generative NeRF, we propose 3D-aware blending that utilizes volume density and blends on the NeRF's latent space, rather than raw pixel space. Collectively, our method outperforms existing 2D baselines, as validated by extensive quantitative and qualitative evaluations with FFHQ and AFHQ-Cat. + + + + Multi-Modal Gated Mixture of Local-to-Global Experts for Dynamic Image Fusion + http://openaccess.thecvf.com//content/ICCV2023/papers/Cao_Multi-Modal_Gated_Mixture_of_Local-to-Global_Experts_for_Dynamic_Image_Fusion_ICCV_2023_paper.pdf + Infrared and visible image fusion aims to integrate comprehensive information from multiple sources to achieve superior performances on various practical tasks, such as detection, over that of a single modality. However, most existing methods directly combined the texture details and object contrast of different modalities, ignoring the dynamic changes in reality, which diminishes the visible texture in good lighting conditions and the infrared contrast in low lighting conditions. To fill this gap, we propose a dynamic image fusion framework with a multi-modal gated mixture of local-to-global experts, termed MoE-Fusion, to dynamically extract effective and comprehensive information from the respective modalities. Our model consists of a Mixture of Local Experts (MoLE) and a Mixture of Global Experts (MoGE) guided by a multi-modal gate. The MoLE performs specialized learning of multi-modal local features, prompting the fused images to retain the local information in a sample-adaptive manner, while the MoGE focuses on the global information that complements the fused image with overall texture detail and contrast. Extensive experiments show that our MoE-Fusion outperforms state-of-the-art methods in preserving multi-modal image texture and contrast through the local-to-global dynamic learning paradigm, and also achieves superior performance on detection tasks. Our code is available: https://github.com/SunYM2020/MoE-Fusion. + + + + DELFlow: Dense Efficient Learning of Scene Flow for Large-Scale Point Clouds + http://openaccess.thecvf.com//content/ICCV2023/papers/Peng_DELFlow_Dense_Efficient_Learning_of_Scene_Flow_for_Large-Scale_Point_ICCV_2023_paper.pdf + Point clouds are naturally sparse, while image pixels are dense. The inconsistency limits feature fusion from both modalities for point-wise scene flow estimation. Previous methods rarely predict scene flow from the entire point clouds of the scene with one-time inference due to the memory inefficiency and heavy overhead from distance calculation and sorting involved in commonly used farthest point sampling, KNN, and ball query algorithms for local feature aggregation. To mitigate these issues in scene flow learning, we regularize raw points to a dense format by storing 3D coordinates in 2D grids. Unlike the sampling operation commonly used in existing works, the dense 2D representation 1) preserves most points in the given scene, 2) brings in a significant boost of efficiency, and 3) eliminates the density gap between points and pixels, allowing us to perform effective feature fusion. We also present a novel warping projection technique to alleviate the information loss problem resulting from the fact that multiple points could be mapped into one grid during projection when computing cost volume. Sufficient experiments demonstrate the efficiency and effectiveness of our method, outperforming the prior-arts on the FlyingThings3D and KITTI dataset. + + + + E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning + http://openaccess.thecvf.com//content/ICCV2023/papers/Han_E2VPT_An_Effective_and_Efficient_Approach_for_Visual_Prompt_Tuning_ICCV_2023_paper.pdf + As the size of transformer-based models continues to grow, fine-tuning these large-scale pre-trained vision models for new tasks has become increasingly parameter-intensive. Parameter-efficient learning has been developed to reduce the number of tunable parameters during fine-tuning. Although these methods show promising results, there is still a significant performance gap compared to full fine-tuning. To address this challenge, we propose an Effective and Efficient Visual Prompt Tuning (E^2VPT) approach for large-scale transformer-based model adaptation. Specifically, we introduce a set of learnable key-value prompts and visual prompts into self-attention and input layers, respectively, to improve the effectiveness of model fine-tuning. Moreover, we design a prompt pruning procedure to systematically prune low importance prompts while preserving model performance, which largely enhances the model's efficiency. Empirical results demonstrate that our approach outperforms several state-of-the-art baselines on two benchmarks, with considerably low parameter usage (e.g., 0.32% of model parameters on VTAB-1k). We anticipate that this work will inspire further exploration within the pretrain-then-finetune paradigm for large-scale models. + + + + From Knowledge Distillation to Self-Knowledge Distillation: A Unified Approach with Normalized Loss and Customized Soft Labels + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_From_Knowledge_Distillation_to_Self-Knowledge_Distillation_A_Unified_Approach_with_ICCV_2023_paper.pdf + Knowledge Distillation (KD) uses the teacher's prediction logits as soft labels to guide the student, while self-KD does not need a real teacher to require the soft labels. This work unifies the formulations of the two tasks by decomposing and reorganizing the generic KD loss into a Normalized KD (NKD) loss and customized soft labels for both target class (image's category) and non-target classes named Universal Self-Knowledge Distillation (USKD). We decompose the KD loss and find the non-target loss from it forces the student's non-target logits to match the teacher's, but the sum of the two non-target logits is different, preventing them from being identical. NKD normalizes the non-target logits to equalize their sum. It can be generally used for KD and self-KD to better use the soft labels for distillation loss. USKD generates customized soft labels for both target and non-target classes without a teacher. It smooths the target logit of the student as the soft target label and uses the rank of the intermediate feature to generate the soft non-target labels with Zipf's law. For KD with teachers, our NKD achieves state-of-the-art performance on CIFAR-100 and ImageNet datasets, boosting the ImageNet Top-1 accuracy of ResNet18 from 69.90% to 71.96% with a ResNet-34 teacher. For self-KD without teachers, USKD is the first self-KD method that can be effectively applied to both CNN and ViT models with negligible additional time and memory cost, resulting in new state-of-the-art results, such as 1.17% and 0.55% accuracy gains on ImageNet for MobileNet and DeiT-Tiny, respectively. Code is available at https://github.com/yzd-v/cls_KD. + + + + Improving Generalization in Visual Reinforcement Learning via Conflict-aware Gradient Agreement Augmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Improving_Generalization_in_Visual_Reinforcement_Learning_via_Conflict-aware_Gradient_Agreement_ICCV_2023_paper.pdf + Learning a policy with great generalization to unseen environments remains challenging but critical in visual reinforcement learning. Despite the success of augmentation combination in the supervised learning generalization, naively applying it to visual RL algorithms may damage the training efficiency, suffering from serve performance degradation. In this paper, we first conduct qualitative analysis and illuminate the main causes: (i) high-variance gradient magnitudes and (ii) gradient conflicts existed in various augmentation methods. To alleviate these issues, we propose a general policy gradient optimization framework, named Conflict-aware Gradient Agreement Augmentation (CG2A), and better integrate augmentation combination into visual RL algorithms to address the generalization bias. In particular, CG2A develops a Gradient Agreement Solver to adaptively balance the varying gradient magnitudes, and introduces a Soft Gradient Surgery strategy to alleviate the gradient conflicts. Extensive experiments demonstrate that CG2A significantly improves the generalization performance and sample efficiency of visual RL algorithms. + + + + Graph Matching with Bi-level Noisy Correspondence + http://openaccess.thecvf.com//content/ICCV2023/papers/Lin_Graph_Matching_with_Bi-level_Noisy_Correspondence_ICCV_2023_paper.pdf + In this paper, we study a novel and widely existing problem in graph matching (GM), namely, Bi-level Noisy Correspondence (BNC), which refers to node-level noisy correspondence (NNC) and edge-level noisy correspondence (ENC). In brief, on the one hand, due to the poor recognizability and viewpoint differences between images, it is inevitable to inaccurately annotate some keypoints with offset and confusion, leading to the mismatch between two associated nodes, i.e., NNC. On the other hand, the noisy node-to-node correspondence will further contaminate the edge-to-edge correspondence, thus leading to ENC. For the BNC challenge, we propose a novel method termed Contrastive Matching with Momentum Distillation. Specifically, the proposed method is with a robust quadratic contrastive loss which enjoys the following merits: i) better exploring the node-to-node and edge-to-edge correlations through a GM customized quadratic contrastive learning paradigm; ii) adaptively penalizing the noisy assignments based on the confidence estimated by the momentum teacher. Extensive experiments on three real-world datasets show the robustness of our model compared with 12 competitive baselines. The code is available at https://github.com/XLearning-SCU/2023-ICCV-COMMON. + + + + InfiniCity: Infinite-Scale City Synthesis + http://openaccess.thecvf.com//content/ICCV2023/papers/Lin_InfiniCity_Infinite-Scale_City_Synthesis_ICCV_2023_paper.pdf + Toward infinite-scale 3D city synthesis, we propose a novel framework, InfiniCity, which constructs and renders an unconstrainedly large and 3D-grounded environment from random noises. InfiniCity decomposes the seemingly impractical task into three feasible modules, taking advantage of both 2D and 3D data. First, an infinite-pixel image synthesis module generates arbitrary-scale 2D maps from the bird's-eye view. Next, an octree-based voxel completion module lifts the generated 2D map to 3D octrees. Finally, a voxel-based neural rendering module texturizes the voxels and renders 2D images. InfiniCity can thus synthesize arbitrary-scale and traversable 3D city environments. We quantitatively and qualitatively demonstrate the efficacy of the proposed framework. + + + + OpenOccupancy: A Large Scale Benchmark for Surrounding Semantic Occupancy Perception + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_OpenOccupancy_A_Large_Scale_Benchmark_for_Surrounding_Semantic_Occupancy_Perception_ICCV_2023_paper.pdf + Semantic occupancy perception is essential for autonomous driving, as automated vehicles require a fine-grained perception of the 3D urban structures. However, existing relevant benchmarks lack diversity in urban scenes, and they only evaluate front-view predictions. Towards a comprehensive benchmarking of surrounding perception algorithms, we propose OpenOccupancy, which is the first surrounding semantic occupancy perception benchmark. In the OpenOccupancy benchmark, we extend the large-scale nuScenes dataset with dense semantic occupancy annotations. Previous annotations rely on LiDAR points superimposition, where some occupancy labels are missed due to sparse LiDAR channels. To mitigate the problem, we introduce the Augmenting And Purifying (AAP) pipeline to 2x densify the annotations, where 4000 human hours are involved in the labeling process. Besides, camera-based, LiDAR-based and multi-modal baselines are established for the OpenOccupancy benchmark. Furthermore, considering the complexity of surrounding occupancy perception lies in the computational burden of high-resolution 3D predictions, we propose the Cascade Occupancy Network (CONet) to refine the coarse prediction, which relatively enhances the performance by 30% than the baseline. We hope the OpenOccupancy benchmark will boost the development of surrounding occupancy perception algorithms. + + + + Weakly-Supervised Text-Driven Contrastive Learning for Facial Behavior Understanding + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Weakly-Supervised_Text-Driven_Contrastive_Learning_for_Facial_Behavior_Understanding_ICCV_2023_paper.pdf + Contrastive learning has shown promising potential for learning robust representations by utilizing unlabeled data. However, constructing effective positive-negative pairs for contrastive learning on facial behavior datasets remains challenging. This is because such pairs inevitably encode the subject-ID information, and the randomly constructed pairs may push similar facial images away due to the limited number of subjects in facial behavior datasets. To address this issue, we propose to utilize activity descriptions, coarse-grained information provided in some datasets, which can provide high-level semantic information about the image sequences but is often neglected in previous studies. More specifically, we introduce a two-stage Contrastive Learning with Text-Embeded framework for Facial behavior understanding (CLEF). The first stage is a weakly-supervised contrastive learning method that learns representations from positive-negative pairs constructed using coarse-grained activity information. The second stage aims to train the recognition of facial expressions or facial action units by maximizing the similarity between image and the corresponding text label names. The proposed CLEF achieves state-of-the-art performance on three in-the-lab datasets for AU recognition and three in-the-wild datasets for facial expression recognition. + + + + Box-based Refinement for Weakly Supervised and Unsupervised Localization Tasks + http://openaccess.thecvf.com//content/ICCV2023/papers/Gomel_Box-based_Refinement_for_Weakly_Supervised_and_Unsupervised_Localization_Tasks_ICCV_2023_paper.pdf + It has been established that training a box-based detector network can enhance the localization performance of weakly supervised and unsupervised methods. Moreover, we extend this understanding by demonstrating that these detectors can be utilized to improve the original network, paving the way for further advancements. To accomplish this, we train the detectors on top of the network output instead of the image data and apply suitable loss backpropagation. Our findings reveal a significant improvement in phrase grounding for the "what is where by looking" task, as well as various methods of unsupervised object discovery. + + + + PRIOR: Prototype Representation Joint Learning from Medical Images and Reports + http://openaccess.thecvf.com//content/ICCV2023/papers/Cheng_PRIOR_Prototype_Representation_Joint_Learning_from_Medical_Images_and_Reports_ICCV_2023_paper.pdf + Contrastive learning based vision-language joint pre-training has emerged as a successful representation learning strategy. In this paper, we present a prototype representation learning framework incorporating both global and local alignment between medical images and reports. In contrast to standard global multi-modality alignment methods, we employ a local alignment module for fine-grained representation. Furthermore, a cross-modality conditional reconstruction module is designed to interchange information across modalities in the training phase by reconstructing masked images and reports. For reconstructing long reports, a sentence-wise prototype memory bank is constructed, enabling the network to focus on low-level localized visual and high-level clinical linguistic features. Additionally, a non-auto-regressive generation paradigm is proposed for reconstructing non-sequential reports. Experimental results on five downstream tasks, including supervised classification, zero-shot classification, image-to-text retrieval, semantic segmentation, and object detection, show the proposed method outperforms other state-of-the-art methods across multiple datasets and under different dataset size settings. The code is available at https://github.com/QtacierP/PRIOR. + + + + Vision HGNN: An Image is More than a Graph of Nodes + http://openaccess.thecvf.com//content/ICCV2023/papers/Han_Vision_HGNN_An_Image_is_More_than_a_Graph_of_ICCV_2023_paper.pdf + The realm of graph-based modeling has proven its adaptability across diverse real-world data types. However, its applicability to general computer vision tasks had been limited until the introduction of the Vision Graph Neural Network (ViG). ViG divides input images into patches, conceptualized as nodes, constructing a graph through connections to nearest neighbors. Nonetheless, this method of graph construction confines itself to simple pairwise relationships, leading to surplus edges and unwarranted memory and computation expenses. In this paper, we enhance ViG by transcending conventional "pairwise" linkages and harnessing the power of the hypergraph to encapsulate image information. Our objective is to encompass more intricate inter-patch associations. In both training and inference phases, we adeptly establish and update the hypergraph structure using the Fuzzy C-Means method, ensuring minimal computational burden. This augmentation yields the Vision HyperGraph Neural Network (ViHGNN). The model's efficacy is empirically substantiated through its state-of-the-art performance on both image classification and object detection tasks, courtesy of the hypergraph structure learning module that uncovers higher-order relationships. Our code is available at: https://github.com/VITA-Group/ViHGNN. + + + + HumanSD: A Native Skeleton-Guided Diffusion Model for Human Image Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Ju_HumanSD_A_Native_Skeleton-Guided_Diffusion_Model_for_Human_Image_Generation_ICCV_2023_paper.pdf + Controllable human image generation (HIG) has attracted significant attention from academia and industry for its numerous real-life applications. State-of-the-art solutions, such as ControlNet and T2I-Adapter, introduce an additional learnable branch on top of the frozen pre-trained stable diffusion (SD) model, which can enforce various kinds of conditions, including skeleton guidance of HIG. While such a plug-and-play approach is appealing, the inevitable and uncertain conflicts between the original images produced from the frozen SD branch and the given condition incur significant challenges for the learnable branch, which conduct the condition learning via image feature editing. In this work, we propose a native skeleton-guided diffusion model for controllable HIG called HumanSD. Instead of performing image editing with dual-branch diffusion, we fine-tune the original SD model using a novel heatmap-guided denoising loss. This strategy effectively and efficiently strengthens the given skeleton condition during model training while mitigating the catastrophic forgetting effects. HumanSD is fine-tuned on the assembly of three large-scale human-centric datasets with text-image-pose information, two of which are established in this work. Experimental results show that HumanSD outperforms ControlNet in terms of pose control and image quality, particularly when the given skeleton guidance is sophisticated. Code and data are available at: https://idearesearch.github.io/HumanSD/. + + + + HRS-Bench: Holistic, Reliable and Scalable Benchmark for Text-to-Image Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Bakr_HRS-Bench_Holistic_Reliable_and_Scalable_Benchmark_for_Text-to-Image_Models_ICCV_2023_paper.pdf + Designing robust text-to-image (T2I) models have been extensively explored in recent years, especially with the emergence of diffusion models, which achieves state-of-the-art results on T2I synthesis tasks. Despite the significant effort and success in this direction, we observed that the existing metrics need to be more robust to measure real progress. Therefore, comparing the existing models are more complex and heavily subjective for human evaluations. In addition, we observe that the efforts in developing new architectures do not coincide with efforts in the evaluation direction. Driven by this observation, the importance of designing a concrete evaluation emerges to fill the gap between designing and evaluation efforts. Accordingly, we introduce our holistic, reliable, and scalable benchmark, termed \papernameAbbrev , for T2I models. Unlike the existing benchmarks, which focus on limited aspects, we measure 13 skills, which could be categorized into five critical skills; accuracy, robustness, generalization, fairness, and bias. In addition, \papernameAbbrev covers 50 applications, e.g., fashion, animals, transportation, food, and clothes. We evaluate nine recent large-scale T2I models using metrics that cover a wide range of skills. We study 13 skills, e.g., robustness, fairness, and bias. To probe the effectiveness of our \papernameAbbrev , a human evaluation is conducted, which is aligned with 95% with our evaluations on average across the 13 skills. We hope our findings, e.g., all the existing models can not generate visual text nor emotionally grounded images, help accelerate and direct future research. To this end, the code and data are available at https://eslambakr.github.io/hrsbench.github.io/. + + + + DiffCloth: Diffusion Based Garment Synthesis and Manipulation via Structural Cross-modal Semantic Alignment + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_DiffCloth_Diffusion_Based_Garment_Synthesis_and_Manipulation_via_Structural_Cross-modal_ICCV_2023_paper.pdf + Cross-modal garment synthesis and manipulation will significantly benefit the way fashion designers generate garments and modify their designs via flexible linguistic interfaces. However, despite the significant progress that has been made in generic image synthesis using diffusion models, producing garment images with garment part level semantics that are well aligned with input text prompts and then flexibly manipulating the generated results still remains a problem. Current approaches follow the general text-to-image paradigm and mine cross-modal relations via simple cross-attention modules, neglecting the structural correspondence between visual and textual representations in the fashion design domain. In this work, we instead introduce DiffCloth, a diffusion-based pipeline for cross-modal garment synthesis and manipulation, which empowers diffusion models with flexible compositionality in the fashion domain by structurally aligning the cross-modal semantics. Specifically, we formulate the part-level cross-modal alignment as a bipartite matching problem between the linguistic Attribute-Phrases (AP) and the visual garment parts which are obtained via constituency parsing and semantic segmentation, respectively. To mitigate the issue of attribute confusion, we further propose a semantic-bundled cross-attention to preserve the spatial structure similarities between the attention maps of attribute adjectives and part nouns in each AP. Moreover, DiffCloth allows for manipulation of the generated results by simply replacing APs in the text prompts. The manipulation-irrelevant regions are recognized by blended masks obtained from the bundled attention maps of the APs and kept unchanged. Extensive experiments on the CM-Fashion benchmark demonstrate that DiffCloth both yields state-of-the-art garment synthesis results by leveraging the inherent structural information and supports flexible manipulation with region consistency. + + + + Class Prior-Free Positive-Unlabeled Learning with Taylor Variational Loss for Hyperspectral Remote Sensing Imagery + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhao_Class_Prior-Free_Positive-Unlabeled_Learning_with_Taylor_Variational_Loss_for_Hyperspectral_ICCV_2023_paper.pdf + Positive-unlabeled learning (PU learning) in hyperspectral remote sensing imagery (HSI) is aimed at learning a binary classifier from positive and unlabeled data, which has broad prospects in various earth vision applications. However, when PU learning meets limited labeled HSI, the unlabeled data may dominate the optimization process, which makes the neural networks overfit the unlabeled data. In this paper, a Taylor variational loss is proposed for HSI PU learning, which reduces the weight of the gradient of the unlabeled data by Taylor series expansion to enable the network to find a balance between overfitting and underfitting. In addition, the self-calibrated optimization strategy is designed to stabilize the training process. Experiments on 7 benchmark datasets (21 tasks in total) validate the effectiveness of the proposed method. Code is at: https://github.com/Hengwei-Zhao96/T-HOneCls. + + + + HoloAssist: an Egocentric Human Interaction Dataset for Interactive AI Assistants in the Real World + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_HoloAssist_an_Egocentric_Human_Interaction_Dataset_for_Interactive_AI_Assistants_ICCV_2023_paper.pdf + Building an interactive AI assistant that can perceive, reason, and collaborate with humans in the real world has been a long-standing pursuit in the AI community. This work is part of a broader research effort to develop intelligent agents that can interactively guide humans through performing tasks in the physical world. As a first step in this direction, we introduce HoloAssist, a large-scale egocentric human interaction dataset, where two people collaboratively complete physical manipulation tasks. The task performer executes the task while wearing a mixed-reality headset that captures seven synchronized data streams. The task instructor watches the performer's egocentric video in real time and guides them verbally. By augmenting the data with action and conversational annotations and observing the rich behaviors of various participants, we present key insights into how human assistants correct mistakes, intervene in the task completion procedure, and ground their instructions to the environment. HoloAssist spans 166 hours of data captured by 350 unique instructor-performer pairs. Furthermore, we construct and present benchmarks on mistake detection, intervention type prediction, and hand forecasting, along with detailed analysis. We expect HoloAssist will provide an important resource for building AI assistants that can fluidly collaborate with humans in the real world. Data can be downloaded at https://holoassist.github.io/. + + + + StableVideo: Text-driven Consistency-aware Diffusion Video Editing + http://openaccess.thecvf.com//content/ICCV2023/papers/Chai_StableVideo_Text-driven_Consistency-aware_Diffusion_Video_Editing_ICCV_2023_paper.pdf + Diffusion-based methods can generate realistic images and videos, but they struggle to edit existing objects in a video while preserving their geometry over time. This prevents diffusion models from being applied to natural video editing. In this paper, we tackle this problem by introducing temporal dependency to existing text-driven diffusion models, which allows them to generate consistent appearance for the new objects. Specifically, we develop a novel inter-frame propagation mechanism for diffusion video editing, which leverages the concept of layered representations to propagate the geometry and appearance information from one frame to the next. We then build up a text-driven video editing framework based on this mechanism, namely StableVideo, which can achieve consistency-aware video editing. Extensive qualitative experiments demonstrate the strong editing capability of our approach. Compared with state-of-the-art video editing methods, our approach shows superior qualitative and quantitative results. + + + + PIRNet: Privacy-Preserving Image Restoration Network via Wavelet Lifting + http://openaccess.thecvf.com//content/ICCV2023/papers/Deng_PIRNet_Privacy-Preserving_Image_Restoration_Network_via_Wavelet_Lifting_ICCV_2023_paper.pdf + The cloud-based multimedia service becomes increasingly popular in the last decade, however, it poses a serious threat to the client's privacy. To address this issue, many methods utilized image encryption as a defense mechanism. However, the encrypted images look quite different from the natural images, making them vulnerable to attackers. In this paper, we propose a novel method namely PIRNet, which operates privacy-preserving image restoration in the steganographic domain. Compared to existing methods, our method offers significant advantages in terms of invisibility and security. Specifically, we first propose a wavelet Lifting-based Invertible Hiding (LIH) network to conceal the secret image into the stego image. Then, a Lifting-based Secure Restoration (LSR) network is utilized to perform image restoration in the steganographic domain. Since the secret image remains hidden throughout the whole image restoration process, the privacy of clients can be largely ensured. In addition, since the stego image looks visually the same as the cover image, the attackers can hardly discover it, which significantly improves the security. The experimental results on different datasets show the superiority of our PIRNet over the existing methods on various privacy-preserving image restoration tasks, including image denoising, deblurring and super-resolution. + + + + LAW-Diffusion: Complex Scene Generation by Diffusion with Layouts + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_LAW-Diffusion_Complex_Scene_Generation_by_Diffusion_with_Layouts_ICCV_2023_paper.pdf + Thanks to the rapid development of diffusion models, unprecedented progress has been witnessed in image synthesis. Prior works mostly rely on pre-trained linguistic models, but a text is often too abstract to properly specify all the spatial properties of an image, e.g., the layout configuration of a scene, leading to the sub-optimal results of complex scene generation. In this paper, we achieve accurate complex scene generation by proposing a semantically controllable Layout-AWare diffusion model, termed LAW-Diffusion. Distinct from the previous Layout-to-Image generation (L2I) methods that primarily explore category-aware relationships, LAW-Diffusion introduces a spatial dependency parser to encode the location-aware semantic coherence across objects as a layout embedding and produces a scene with perceptually harmonious object styles and contextual relations. To be specific, we delicately instantiate each object's regional semantics as an object region map and leverage a location-aware cross-object attention module to capture the spatial dependencies among those disentangled representations. We further propose an adaptive guidance schedule for our layout guidance to mitigate the trade-off between the regional semantic alignment and the texture fidelity of generated objects. Moreover, LAW-Diffusion allows for instance reconfiguration while maintaining the other regions in a synthesized image by introducing a layout-aware latent grafting mechanism to recompose its local regional semantics. To better verify the plausibility of generated scenes, we propose a new evaluation metric for the L2I task, dubbed Scene Relation Score (SRS) to measure how the images preserve the rational and harmonious relations among contextual objects. Comprehensive experiments on COCO-Stuff and Visual-Genome demonstrate that our LAW-Diffusion yields the state-of-the-art generative performance, especially with coherent object relations. + + + + Multi-Label Knowledge Distillation + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Multi-Label_Knowledge_Distillation_ICCV_2023_paper.pdf + Existing knowledge distillation methods typically work by imparting the knowledge of output logits or intermediate feature maps from the teacher network to the student network, which is very successful in multi-class single-label learning. However, these methods can hardly be extended to the multi-label learning scenario, where each instance is associated with multiple semantic labels, because the prediction probabilities do not sum to one and feature maps of the whole example may ignore minor classes in such a scenario. In this paper, we propose a novel multi-label knowledge distillation method. On one hand, it exploits the informative semantic knowledge from the logits by dividing the multi-label learning problem into a set of binary classification problems; on the other hand, it enhances the distinctiveness of the learned feature representations by leveraging the structural information of label-wise embeddings. Experimental results on multiple benchmark datasets validate that the proposed method can avoid knowledge counteraction among labels, thus achieving superior performance against diverse comparing methods. + + + + Towards Geospatial Foundation Models via Continual Pretraining + http://openaccess.thecvf.com//content/ICCV2023/papers/Mendieta_Towards_Geospatial_Foundation_Models_via_Continual_Pretraining_ICCV_2023_paper.pdf + Geospatial technologies are becoming increasingly essential in our world for a wide range of applications, including agriculture, urban planning, and disaster response. To help improve the applicability and performance of deep learning models on these geospatial tasks, various works have begun investigating foundation models for this domain. Researchers have explored two prominent approaches for introducing such models in geospatial applications, but both have drawbacks in terms of limited performance benefit or prohibitive training cost. Therefore, in this work, we propose a novel paradigm for building highly effective geospatial foundation models with minimal resource cost and carbon impact. We first construct a compact yet diverse dataset from multiple sources to promote feature diversity, which we term GeoPile. Then, we investigate the potential of continual pretraining from large-scale ImageNet-22k models and propose a multi-objective continual pretraining paradigm, which leverages the strong representations of ImageNet while simultaneously providing the freedom to learn valuable in-domain features. Our approach outperforms previous state-of-the-art geospatial pretraining methods in an extensive evaluation on seven downstream datasets covering various tasks such as change detection, classification, multi-label classification, semantic segmentation, and super-resolution. Code is available at https://github.com/mmendiet/GFM. + + + + ConSlide: Asynchronous Hierarchical Interaction Transformer with Breakup-Reorganize Rehearsal for Continual Whole Slide Image Analysis + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_ConSlide_Asynchronous_Hierarchical_Interaction_Transformer_with_Breakup-Reorganize_Rehearsal_for_Continual_ICCV_2023_paper.pdf + Whole slide image (WSI) analysis has become increasingly important in the medical imaging community, enabling automated and objective diagnosis, prognosis, and therapeutic-response prediction. However, in clinical practice, the continuous progress of evolving WSI acquisition technology, the diversity of scanners, and different imaging protocols hamper the utility of WSI analysis models. In this paper, we propose the FIRST continual learning framework for WSI analysis, named ConSlide, to tackle the challenges of enormous image size, utilization of hierarchical structure, and catastrophic forgetting by progressive model updating on multiple sequential datasets. Our framework contains three key components. The Hierarchical Interaction Transformer (HIT) is proposed to model and utilize the hierarchical structural knowledge of WSI. The BreakupReorganize (BuRo) rehearsal method is developed for WSI data replay with efficient region storing buffer and WSI reorganizing operation. The asynchronous updating mechanism is devised to encourage the network to learn generic and specific knowledge respectively during the replay stage, based on a nested cross-scale similarity learning (CSSL) module. We evaluated the proposed ConSlide on four public WSI datasets from TCGA projects. It performs best over other state-of-the-art methods with a fair WSI-based continual learning setting and achieves a better trade-off of the overall performance and forgetting on previous tasks. + + + + RepQ-ViT: Scale Reparameterization for Post-Training Quantization of Vision Transformers + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_RepQ-ViT_Scale_Reparameterization_for_Post-Training_Quantization_of_Vision_Transformers_ICCV_2023_paper.pdf + Post-training quantization (PTQ), which only requires a tiny dataset for calibration without end-to-end retraining, is a light and practical model compression technique. Recently, several PTQ schemes for vision transformers (ViTs) have been presented; unfortunately, they typically suffer from non-trivial accuracy degradation, especially in low-bit cases. In this paper, we propose RepQ-ViT, a novel PTQ framework for ViTs based on quantization scale reparameterization, to address the above issues. RepQ-ViT decouples the quantization and inference processes, where the former employs complex quantizers and the latter employs scale-reparameterized simplified quantizers. This ensures both accurate quantization and efficient inference, which distinguishes it from existing approaches that sacrifice quantization performance to meet the target hardware. More specifically, we focus on two components with extreme distributions: post-LayerNorm activations with severe inter-channel variation and post-Softmax activations with power-law features, and initially apply channel-wise quantization and log(sqrt(2)) quantization, respectively. Then, we reparameterize the scales to hardware-friendly layer-wise quantization and log2 quantization for inference, with only slight accuracy or computational costs. Extensive experiments are conducted on multiple vision tasks with different model variants, proving that RepQ-ViT, without hyperparameters and expensive reconstruction procedures, can outperform existing strong baselines and encouragingly improve the accuracy of 4-bit PTQ of ViTs to a usable level. Code is available at https://github.com/zkkli/RepQ-ViT. + + + + ReactioNet: Learning High-Order Facial Behavior from Universal Stimulus-Reaction by Dyadic Relation Reasoning + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_ReactioNet_Learning_High-Order_Facial_Behavior_from_Universal_Stimulus-Reaction_by_Dyadic_ICCV_2023_paper.pdf + Diverse visual stimuli can evoke various human affective states, which are usually manifested in an individual's muscular actions and facial expressions. In lab-controlled emotion datasets, such a critical component (i.e., stimulus) was commonly designed in a limited way, making researchers incapable of generalizing the universal correlation and causation of stimulus-reaction as well as predicting possible emotions from context, timing, and relation. In this paper, we collected a large-scale spontaneous facial behavior database ReactioNet, which contains 1.1 million coupled stimulus-reaction tuples (visual/audio/caption from both stimuli and subjects). We introduce a new facial behavior detection scenario, Dyadic Relation Reasoning (DRR), which aims to detect facial actions by reasoning their relations with stimuli. By aggregating the dyadic information, our method essentially forms a relation prototype Universal Stimulus Reaction (U-SR), which encodes the low-order and high-order relationships between stimulus agents and facial reactions. A framework with both non-graph and graph modules is further developed to evaluate DRR-based facial action unit detection, facial expression recognition, and scene classification. Specifically, to learn "what" arouses a facial reaction, the non-graph module associates and projects the fine-grained stimulus-reaction features into common subspaces using cross-domain contrastive learning. To learn "how" stimulus-reaction are mutually related, the graph module adopts Graph Convolution Network to represent, converge, and infer the dyadic U-SR relation under two relation assumptions (i.e., homophily and heterophily). Extensive experiments demonstrate the effectiveness of the proposed work. The new dataset will be available for the research community. + + + + Emotional Listener Portrait: Neural Listener Head Generation with Emotion + http://openaccess.thecvf.com//content/ICCV2023/papers/Song_Emotional_Listener_Portrait_Neural_Listener_Head_Generation_with_Emotion_ICCV_2023_paper.pdf + Listener head generation centers on generating non-verbal behaviors (e.g., smile) of a listener in reference to the information delivered by a speaker. A significant challenge when generating such responses is the non-deterministic nature of fine-grained facial expressions during a conversation, which varies depending on the emotions and attitudes of both the speaker and the listener. To tackle this problem, we propose the Emotional Listener Portrait (ELP), which treats each fine-grained facial motion as a composition of several discrete motion-codewords and explicitly models the probability distribution of the motions under different emotional contexts in conversation. Benefiting from the "explicit" and "discrete" design, our ELP model can not only automatically generate natural and diverse responses toward a given speaker via sampling from the learned distribution but also generate controllable responses with a predetermined attitude. Under several quantitative metrics, our ELP exhibits significant improvements compared to previous methods. + + + + Unsupervised Domain Adaptation for Training Event-Based Networks Using Contrastive Learning and Uncorrelated Conditioning + http://openaccess.thecvf.com//content/ICCV2023/papers/Jian_Unsupervised_Domain_Adaptation_for_Training_Event-Based_Networks_Using_Contrastive_Learning_ICCV_2023_paper.pdf + Event-based cameras offer reliable measurements for preforming computer vision tasks in high-dynamic range environments and during fast motion maneuvers. However, adopting deep learning in event-based vision faces the challenge of annotated data scarcity due to recency of event cameras. Transferring the knowledge that can be obtained from conventional camera annotated data offers a practical solution to this challenge. We develop an unsupervised domain adaptation algorithm for training a deep network for event-based data image classification using contrastive learning and uncorrelated conditioning of data. Our solution outperforms the existing algorithms for this purpose. + + + + DRAW: Defending Camera-shooted RAW Against Image Manipulation + http://openaccess.thecvf.com//content/ICCV2023/papers/Hu_DRAW_Defending_Camera-shooted_RAW_Against_Image_Manipulation_ICCV_2023_paper.pdf + RAW files are the initial measurement of scene radiance widely used in most cameras, and the ubiquitously-used RGB images are converted from RAW data through Image Signal Processing (ISP) pipelines. Nowadays, digital images are risky of being nefariously manipulated. Inspired by the fact that innate immunity is the first line of body defense, we propose DRAW, a novel scheme of defending images against manipulation by protecting their sources, i.e., camera-shooted RAWs. Specifically, we design a lightweight Multi-frequency Partial Fusion Network (MPF-Net) friendly to devices with limited computing resources by frequency learning and partial feature fusion. It introduces invisible watermarks as protective signal into the RAW data. The protection capability can not only be transferred into the rendered RGB images regardless of the applied ISP pipeline, but also is resilient to post-processing operations such as blurring or compression. Once the image is manipulated, we can accurately identify the forged areas with a localization network. Extensive experiments on several famous RAW datasets, e.g., RAISE, FiveK and SIDD, indicate the effectiveness of our method. We hope that this technique can be used in future cameras as an option for image protection, which could effectively restrict image manipulation at the source. + + + + Controllable Person Image Synthesis with Pose-Constrained Latent Diffusion + http://openaccess.thecvf.com//content/ICCV2023/papers/Han_Controllable_Person_Image_Synthesis_with_Pose-Constrained_Latent_Diffusion_ICCV_2023_paper.pdf + Controllable person image synthesis aims at rendering a source image based on user-specified changes in body pose or appearance. Prior art approaches leverage pixel-level denoising diffusion models conditioned on the coarse skeleton via cross-attention. This leads to two limitations: low efficiency and inaccurate condition information. To address both issues, a novel Pose-Constrained Latent Diffusion model (PoCoLD) is introduced. Rather than using the skeleton as a sparse pose representation, we exploit DensePose which offers much richer body structure information. To effectively capitalize DensePose at a low cost, we propose an efficient pose-constrained attention module that is capable of modeling the complex interplay between appearance and pose. Extensive experiments show that our PoCoLD outperforms the state-of-the-art competitors in image synthesis fidelity. Critically, it runs 2x faster and consumes 3.6x smaller memory than the latest diffusion-model-based alternative during inference. + + + + TopoSeg: Topology-Aware Nuclear Instance Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/He_TopoSeg_Topology-Aware_Nuclear_Instance_Segmentation_ICCV_2023_paper.pdf + Nuclear instance segmentation has been critical for pathology image analysis in medical science, e.g., cancer diagnosis. Current methods typically adopt pixel-wise optimization for nuclei boundary exploration, where rich structural information could be lost for subsequent quantitative morphology assessment. To address this issue, we develop a topology-aware segmentation approach, termed TopoSeg, which exploits topological structure information to keep the predictions rational, especially in common situations with densely touching and overlapping nucleus instances. Concretely, TopoSeg builds on a topology-aware module (TAM), which encodes dynamic changes of different topology structures within the three-class probability maps (inside, boundary, and background) of the nuclei to persistence barcodes and makes the topology-aware loss function. To efficiently focus on regions with high topological errors, we propose an adaptive topology-aware selection (ATS) strategy to enhance the topology-aware optimization procedure further. Experiments on three nuclear instance segmentation datasets justify the superiority of TopoSeg, which achieves state-of-the-art performance. The code is available at https://github.com/hhlisme/toposeg. + + + + CPCM: Contextual Point Cloud Modeling for Weakly-supervised Point Cloud Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_CPCM_Contextual_Point_Cloud_Modeling_for_Weakly-supervised_Point_Cloud_Semantic_ICCV_2023_paper.pdf + We study the task of weakly-supervised point cloud semantic segmentation with sparse annotations (e.g., less than 0.1% points are labeled), aiming to reduce the expensive cost of dense annotations. Unfortunately, with extremely sparse annotated points, it is very difficult to extract both contextual and object information for scene understanding such as semantic segmentation. Motivated by masked modeling (e.g., MAE) in image and video representation learning, we seek to endow the power of masked modeling to learn contextual information from sparsely-annotated points. However, directly applying MAE to 3D point clouds with sparse annotations may fail to work. First, it is nontrivial to effectively mask out the informative visual context from 3D point clouds. Second, how to fully exploit the sparse annotations for context modeling remains an open question. In this paper, we propose a simple yet effective Contextual Point Cloud Modeling (CPCM) method that consists of two parts: a region-wise masking (RegionMask) strategy and a contextual masked training (CMT) method. Specifically, RegionMask masks the point cloud continuously in geometric space to construct a meaningful masked prediction task for subsequent context learning. CMT disentangles the learning of supervised segmentation and unsupervised masked context prediction for effectively learning the very limited labeled points and mass unlabeled points, respectively. Extensive experiments on the widely-tested ScanNet V2 and S3DIS benchmarks demonstrate the superiority of CPCM over the state-of-the-art. + + + + PATMAT: Person Aware Tuning of Mask-Aware Transformer for Face Inpainting + http://openaccess.thecvf.com//content/ICCV2023/papers/Motamed_PATMAT_Person_Aware_Tuning_of_Mask-Aware_Transformer_for_Face_Inpainting_ICCV_2023_paper.pdf + Generative models such as StyleGAN2 and Stable Diffusion have achieved state-of-the-art performance in computer vision tasks such as image synthesis, inpainting, and de-noising. However, current generative models for face inpainting often fail to preserve fine facial details and the identity of the person, despite creating aesthetically convincing image structures and textures. In this work, we propose Person Aware Tuning (PAT) of Mask-Aware Transformer (MAT) for face inpainting, which addresses this issue. Our proposed method, PATMAT, effectively preserves identity by incorporating reference images of a subject and fine-tuning a MAT architecture trained on faces. By using 40 reference images, PATMAT creates anchor points in MAT's style module, and tunes the model using the fixed anchors to adapt the model to a new face identity. Moreover, PATMAT's use of multiple images per anchor during training allows the model to use fewer reference images than competing methods. We demonstrate that PATMAT outperforms state-of-the-art models in terms of image quality, the preservation of person-specific details, and the identity of the subject. Our results suggest that PATMAT can be a promising approach for improving the quality of personalized face inpainting. + + + + Adaptive Nonlinear Latent Transformation for Conditional Face Editing + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_Adaptive_Nonlinear_Latent_Transformation_for_Conditional_Face_Editing_ICCV_2023_paper.pdf + Recent works for face editing usually manipulate the latent space of StyleGAN via the linear semantic directions. However, they usually suffer from the entanglement of facial attributes, need to tune the optimal editing strength, and are limited to binary attributes with strong supervision signals. This paper proposes a novel adaptive nonlinear latent transformation for disentangled and conditional face editing, termed AdaTrans. Specifically, our AdaTrans divides the manipulation process into several finer steps; i.e., the direction and size at each step are conditioned on both the facial attributes and the latent codes. In this way, AdaTrans describes an adaptive nonlinear transformation trajectory to manipulate the faces into target attributes while keeping other attributes unchanged. Then, AdaTrans leverages a predefined density model to constrain the learned trajectory in the distribution of latent codes by maximizing the likelihood of transformed latent code. Moreover, we also propose a disentangled learning strategy under a mutual information framework to eliminate the entanglement among attributes, which can further relax the need for labeled data. Consequently, AdaTrans enables a controllable face editing with the advantages of disentanglement, flexibility with non-binary attributes, and high fidelity. Extensive experimental results on various facial attributes demonstrate the qualitative and quantitative effectiveness of the proposed AdaTrans over existing state-of-the-art methods, especially in the most challenging scenarios with a large age gap and few labeled examples. The source code is available at https://github.com/Hzzone/AdaTrans. + + + + Tiny Updater: Towards Efficient Neural Network-Driven Software Updating + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Tiny_Updater_Towards_Efficient_Neural_Network-Driven_Software_Updating_ICCV_2023_paper.pdf + Significant advancements have been accomplished with deep neural networks in diverse visual tasks, which have substantially elevated their deployment in edge device software. However, during the update of neural network-based software, users are required to download all the parameters of the neural network anew, which harms the user experience. Motivated by previous progress in model compression, we propose a novel training methodology named Tiny Updater to address this issue. Specifically, by adopting the variant of pruning and knowledge distillation methods, Tiny Updater can update the neural network-based software by only downloading a few parameters (10% 20%) instead of all the parameters in the neural network. Experiments on eleven datasets of three tasks, including image classification, image-to-image translation, and video recognition have demonstrated its effectiveness. Codes have been released in https://github.com/ArchipLab-LinfengZhang/TinyUpdater. + + + + CAD-Estate: Large-scale CAD Model Annotation in RGB Videos + http://openaccess.thecvf.com//content/ICCV2023/papers/Maninis_CAD-Estate_Large-scale_CAD_Model_Annotation_in_RGB_Videos_ICCV_2023_paper.pdf + We propose a method for annotating videos of complex multi-object scenes with a globally-consistent 3D representation of the objects. We annotate each object with a CAD model from a database, and place it in the 3D coordinate frame of the scene with a 9-DoF pose transformation. Our method is semi-automatic and works on commonly-available RGB videos, without requiring a depth sensor. Many steps are performed automatically, and the tasks performed by humans are simple, well-specified, and require only limited reasoning in 3D. This makes them feasible for crowd-sourcing and has allowed us to construct a large-scale dataset by annotating real-estate videos from YouTube. Our dataset CAD-Estate offers 101k instances of 12k unique CAD models placed in the 3D representations of 20k videos. In comparison to Scan2CAD, the largest existing dataset with CAD model annotations on real scenes, CAD-Estate has 7x more instances and 4x more unique CAD models. We showcase the benefits of pre-training a Mask2CAD model on CAD-Estate for the task of automatic 3D object reconstruction and pose estimation, demonstrating that it leads to performance improvements on the popular Scan2CAD benchmark. The dataset is available at https://github.com/google-research/cad-estate. + + + + Muscles in Action + http://openaccess.thecvf.com//content/ICCV2023/papers/Chiquier_Muscles_in_Action_ICCV_2023_paper.pdf + Human motion is created by, and constrained by, our muscles. We take a first step at building computer vision methods that represent the internal muscle activity that causes motion. We present a new dataset, Muscles in Action (MIA), to to learn to incorporate muscle activity into human motion representations. The dataset consists of 12.5 hours of synchronized video and surface electromyography (sEMG) data of 10 subjects performing various exercises. Using this dataset, we learn a bidirectional representation that predicts muscle activation from video, and conversely, reconstructs motion from muscle activation. We evaluate our model on in-distribution subjects and exercises, as well as on out-of-distribution subjects and exercises. We demonstrate how advances in modeling both modalities jointly can serve as conditioning for muscularly consistent motion generation. Putting muscles into computer vision systems will enable richer models of virtual humans, with applications in sports, fitness, and AR/VR. + + + + Large-Scale Person Detection and Localization Using Overhead Fisheye Cameras + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Large-Scale_Person_Detection_and_Localization_Using_Overhead_Fisheye_Cameras_ICCV_2023_paper.pdf + Location determination finds wide applications in daily life. Instead of existing efforts devoted to localizing tourist photos captured by perspective cameras, in this article, we focus on developing person positioning solutions using overhead fisheye cameras. Such solutions are advantageous in large field of view (FOV), low cost, anti-occlusion, and unaggressive work mode (without the necessity of cameras carried by persons). However, related studies are quite scarce, due to the paucity of data. To stimulate research in this exciting area, we present LOAF, the first large-scale overhead fisheye dataset for person detection and localization. LOAF is built with many essential features, e.g., i) the data cover abundant diversities in scenes, human pose, density, and location; ii) it contains currently the largest number of annotated pedestrian, i.e., 600K bounding boxes with ground-truth location information; iii) the body-boxes are labeled as radius-aligned so as to fully address the positioning challenge. To approach localization, we build a fisheye person detection network, which exploits the fisheye distortions by a clever position embedding strategy and is trained to predict radius-aligned human boxes end-to-end. Then, the actual locations of the detected persons are calculated by a numerical solution on the fisheye model and camera altitude data. Extensive experiments on LOAF validate the superiority of our fisheye detector w.r.t. previous methods, and show that our whole fisheye positioning solution is able to locate all persons in FOV with an accuracy of 0.5m, within 0.1s. Our dataset and code shall be released. + + + + All-to-Key Attention for Arbitrary Style Transfer + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhu_All-to-Key_Attention_for_Arbitrary_Style_Transfer_ICCV_2023_paper.pdf + Attention-based arbitrary style transfer studies have shown promising performance in synthesizing vivid local style details. They typically use the all-to-all attention mechanism---each position of content features is fully matched to all positions of style features. However, all-to-all attention tends to generate distorted style patterns and has quadratic complexity, limiting the effectiveness and efficiency of arbitrary style transfer. In this paper, we propose a novel all-to-key attention mechanism---each position of content features is matched to stable key positions of style features---that is more in line with the characteristics of style transfer. Specifically, it integrates two newly proposed attention forms: distributed and progressive attention. Distributed attention assigns attention to key style representations that depict the style distribution of local regions; Progressive attention pays attention from coarse-grained regions to fine-grained key positions. The resultant module, dubbed StyA2K, shows extraordinary performance in preserving the semantic structure and rendering consistent style patterns. Qualitative and quantitative comparisons with state-of-the-art methods demonstrate the superior performance of our approach. Codes and models are available on https://github.com/LearningHx/StyA2K. + + + + Learning to Distill Global Representation for Sparse-View CT + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Learning_to_Distill_Global_Representation_for_Sparse-View_CT_ICCV_2023_paper.pdf + Sparse-view computed tomography (CT)---using a small number of projections for tomographic reconstruction---enables much lower radiation dose to patients and accelerated data acquisition. The reconstructed images, however, suffer from strong artifacts, greatly limiting their diagnostic value. Current trends for sparse-view CT turn to the raw data for better information recovery. The resultant dual-domain methods, nonetheless, suffer from secondary artifacts, especially in ultra-sparse view scenarios, and their generalization to other scanners/protocols is greatly limited. A crucial question arises: have the image post-processing methods reached the limit? Our answer is not yet. In this paper, we stick to image post-processing methods due to great flexibility and propose global representation (GloRe) distillation framework for sparse-view CT, termed GloReDi. First, we propose to learn GloRe with Fourier convolution, so each element in GloRe has an image-wide receptive field. Second, unlike methods that only use the full-view images for supervision, we propose to distill GloRe from intermediate-view reconstructed images that are readily available but not explored in previous literature. The success of GloRe distillation is attributed to two key components: representation directional distillation to align the GloRe directions, and band-pass-specific contrastive distillation to gain clinically important details. Extensive experiments demonstrate the superiority of the proposed GloReDi over the state-of-the-art methods, including dual-domain ones. The source code is available at https://github.com/longzilicart/GloReDi. + + + + SparseMAE: Sparse Training Meets Masked Autoencoders + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhou_SparseMAE_Sparse_Training_Meets_Masked_Autoencoders_ICCV_2023_paper.pdf + Masked Autoencoders (MAE) and its variants have proven to be effective for pretraining large-scale Vision Transformers (ViTs). However, small-scale models do not benefit from the pretraining mechanisms due to limited capacity. Sparse training is a method of transferring representations from large models to small ones by pruning unimportant parameters. However, naively combining MAE finetuning with sparse training make the network task-specific, resulting in the loss of task-agnostic knowledge, which is crucial for model generalization. In this paper, we aim to reduce model complexity from large vision transformers pretrained by MAE with assistant of sparse training. We summarize various sparse training methods to prune large vision transformers during MAE pretraining and finetuning stages, and discuss their shortcomings. To improve learning both task-agnostic and task-specific knowledge, we propose SparseMAE, a novel two-stage sparse training method that includes sparse pretraining and sparse finetuning. In sparse pretraining, we dynamically prune a small-scale sub-network from a ViT-Base. During finetuning, the sparse sub-network adaptively changes its topology connections under the task-agnostic knowledge of the full model. Extensive experimental results demonstrate the effectiveness of our method and its superiority on small-scale vision transformers. Code will be available at https://github.com/aojunzz/SparseMAE. + + + + ELITE: Encoding Visual Concepts into Textual Embeddings for Customized Text-to-Image Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Wei_ELITE_Encoding_Visual_Concepts_into_Textual_Embeddings_for_Customized_Text-to-Image_ICCV_2023_paper.pdf + In addition to the unprecedented ability in imaginary creation, large text-to-image models are expected to take customized concepts in image generation. Existing works generally learn such concepts in an optimization-based manner, yet bringing excessive computation or memory burden. In this paper, we instead propose a learning-based encoder, which consists of a global and a local mapping networks for fast and accurate customized text-to-image generation. In specific, the global mapping network projects the hierarchical features of a given image into multiple "new" words in the textual word embedding space, i.e., one primary word for well-editable concept and other auxiliary words to exclude irrelevant disturbances (e.g., background). In the meantime, a local mapping network injects the encoded patch features into cross attention layers to provide omitted details, without sacrificing the editability of primary concepts. We compare our method with existing optimization-based approaches on a variety of user-defined concepts, and demonstrate that our method enables highfidelity inversion and more robust editability with a significantly faster encoding process. Our code is publicly available at https://github.com/csyxwei/ELITE. + + + + Text2Performer: Text-Driven Human Video Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Jiang_Text2Performer_Text-Driven_Human_Video_Generation_ICCV_2023_paper.pdf + Text-driven content creation has evolved to be a transformative technique that revolutionizes creativity. Here we study the task of text-driven human video generation, where a video sequence is synthesized from texts describing the appearance and motions of a target performer. Compared to general text-driven video generation, human-centric video generation requires maintaining the appearance of synthesized human while performing complex motions. In this work, we present Text2Performer to generate vivid human videos with articulated motions from texts. Text2Performer has two novel designs: 1) decomposed human representation and 2) diffusion-based motion sampler. First, we decompose the VQVAE latent space into human appearance and pose representation in an unsupervised manner by utilizing the nature of human videos. In this way, the appearance is well maintained along the generated frames. Then, we propose continuous VQ-diffuser to sample a sequence of pose embeddings. Unlike existing VQ-based methods that operate in the discrete space, continuous VQ-diffuser directly outputs the continuous pose embeddings for better motion modeling. Finally, motion-aware masking strategy is designed to mask the pose embeddings spatial-temporally to enhance the temporal coherence. Moreover, to facilitate the task of text-driven human video generation, we contribute a Fashion-Text2Video dataset with manually annotated action labels and text descriptions. Extensive experiments demonstrate that Text2Performer generates high-quality human videos (up to 512x256 resolution) with diverse appearances and flexible motions. Our project page is https://yumingj.github.io/projects/Text2Performer.html + + + + A Simple Recipe to Meta-Learn Forward and Backward Transfer + http://openaccess.thecvf.com//content/ICCV2023/papers/Cetin_A_Simple_Recipe_to_Meta-Learn_Forward_and_Backward_Transfer_ICCV_2023_paper.pdf + Meta-learning holds the potential to provide a general and explicit solution to tackle interference and forgetting in continual learning. However, many popular algorithms introduce expensive and unstable optimization processes with new key hyper-parameters and requirements, hindering their applicability. We propose a new, general, and simple meta-learning algorithm for continual learning (SiM4C) that explicitly optimizes to minimize forgetting and facilitate forward transfer. We show our method is stable, introduces only minimal computational overhead, and can be integrated with any memory-based continual learning algorithm in only a few lines of code. SiM4C meta-learns how to effectively continually learn even on very long task sequences, largely outperforming prior meta-approaches. Naively integrating with existing memory-based algorithms, we also record universal performance benefits and state-of-the-art results across different visual classification benchmarks without introducing new hyper-parameters. + + + + 4D Myocardium Reconstruction with Decoupled Motion and Shape Model + http://openaccess.thecvf.com//content/ICCV2023/papers/Yuan_4D_Myocardium_Reconstruction_with_Decoupled_Motion_and_Shape_Model_ICCV_2023_paper.pdf + Estimating the shape and motion state of the myocardium is essential in diagnosing cardiovascular diseases. However, cine magnetic resonance (CMR) imaging is dominated by 2D slices, whose large slice spacing challenges inter-slice shape reconstruction and motion acquisition. To address this problem, we propose a 4D reconstruction method that decouples motion and shape, which can predict the inter-/intra- shape and motion estimation from a given sparse point cloud sequence obtained from limited slices. Our framework comprises a neural motion model and an end-diastolic (ED) shape model. The implicit ED shape model can learn a continuous boundary and encourage the motion model to predict without the supervision of ground truth deformation, and the motion model enables canonical input of the shape model by deforming any point from any phase to the ED phase. Additionally, the constructed ED-space enables pre-training of the shape model, thereby guiding the motion model and addressing the issue of data scarcity. We propose the first 4D myocardial dataset as we know and verify our method on the proposed, public, and cross-modal datasets, showing superior reconstruction performance and enabling various clinical applications. + + + + LiDAR-UDA: Self-ensembling Through Time for Unsupervised LiDAR Domain Adaptation + http://openaccess.thecvf.com//content/ICCV2023/papers/Shaban_LiDAR-UDA_Self-ensembling_Through_Time_for_Unsupervised_LiDAR_Domain_Adaptation_ICCV_2023_paper.pdf + We introduce LiDAR-UDA, a novel two-stage self-training-based Unsupervised Domain Adaptation (UDA) method for LiDAR segmentation. Existing self-training methods use a model trained on labeled source data to generate pseudo labels for target data and refine the predictions via fine-tuning the network on the pseudo labels. These methods suffer from domain shifts caused by different LiDAR sensor configurations in the source and target domains. We propose two techniques to reduce sensor discrepancy and improve pseudo label quality: 1) LiDAR beam subsampling, which simulates different LiDAR scanning patterns by randomly dropping beams; 2) cross-frame ensembling, which exploits temporal consistency of consecutive frames to generate more reliable pseudo labels. Our method is simple, generalizable, and does not incur any extra inference cost. We evaluate our method on several public LiDAR datasets and show that it outperforms the state-of-the-art methods by more than 3.9% mIoU on average for all scenarios. Code will be available at https://github.com/JHLee0513/lidar_uda. + + + + MSI: Maximize Support-Set Information for Few-Shot Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Moon_MSI_Maximize_Support-Set_Information_for_Few-Shot_Segmentation_ICCV_2023_paper.pdf + FSS (Few-shot segmentation) aims to segment a target class using a small number of labeled images (support set). To extract the information relevant to target class, a dominant approach in best performing FSS methods removes background features using a support mask. We observe that this feature excision through a limiting support mask introduces an information bottleneck in several challenging FSS cases, e.g., for small targets and/or inaccurate target boundaries. To this end, we present a novel method (MSI), which maximizes the support-set information by exploiting two complementary sources of features to generate super correlation maps. We validate the effectiveness of our approach by instantiating it into three recent and strong FSS methods. Experimental results on several publicly available FSS benchmarks show that our proposed method consistently improves performance by visible margins and leads to faster convergence. Our code and trained models are available at: https://github.com/moonsh/MSI-Maximize-Support-Set-Information + + + + H3WB: Human3.6M 3D WholeBody Dataset and Benchmark + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhu_H3WB_Human3.6M_3D_WholeBody_Dataset_and_Benchmark_ICCV_2023_paper.pdf + We present a benchmark for 3D human whole-body pose estimation, which involves identifying accurate 3D keypoints on the entire human body, including face, hands, body, and feet. Currently, the lack of a fully annotated and accurate 3D whole-body dataset results in deep networks being trained separately on specific body parts, which are combined during inference. Or they rely on pseudo-groundtruth provided by parametric body models which are not as accurate as detection based methods. To overcome these issues, we introduce the Human3.6M 3D WholeBody (H3WB) dataset, which provides whole-body annotations for the Human3.6M dataset using the COCO Wholebody layout. H3WB comprises 133 whole-body keypoint annotations on 100K images, made possible by our new multi-view pipeline. We also propose three tasks: i) 3D whole-body pose lifting from 2D complete whole-body pose, ii) 3D whole-body pose lifting from 2D incomplete whole-body pose, and iii) 3D whole-body pose estimation from a single RGB image. Additionally, we report several baselines from popular methods for these tasks. Furthermore, we also provide automated 3D whole-body annotations of TotalCapture and experimentally show that when used with H3WB it helps to improve the performance. + + + + LDP-Feat: Image Features with Local Differential Privacy + http://openaccess.thecvf.com//content/ICCV2023/papers/Pittaluga_LDP-Feat_Image_Features_with_Local_Differential_Privacy_ICCV_2023_paper.pdf + Modern computer vision services often require users to share raw feature descriptors with an untrusted server. This presents an inherent privacy risk, as raw descriptors may be used to recover the source images from which they were extracted. To address this issue, researchers recently proposed privatizing image features by embedding them within an affine subspace containing the original feature as well as adversarial feature samples. In this paper, we propose two novel inversion attacks to show that it is possible to (approximately) recover the original image features from these embeddings, allowing us to recover privacy-critical image content. In light of such successes and the lack of theoretical privacy guarantees afforded by existing visual privacy methods, we further propose the first method to privatize image features via local differential privacy, which, unlike prior approaches, provides a guaranteed bound for privacy leakage regardless of the strength of the attacks. In addition, our method yields strong performance in visual localization as a downstream task while enjoying the privacy guarantee. + + + + Pre-Training-Free Image Manipulation Localization through Non-Mutually Exclusive Contrastive Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhou_Pre-Training-Free_Image_Manipulation_Localization_through_Non-Mutually_Exclusive_Contrastive_Learning_ICCV_2023_paper.pdf + Deep Image Manipulation Localization (IML) models suffer from training data insufficiency and thus heavily rely on pre-training. We argue that contrastive learning is more suitable to tackle the data insufficiency problem for IML. Crafting mutually exclusive positives and negatives is the prerequisite for contrastive learning. However, when adopting contrastive learning in IML, we encounter three categories of image patches: tampered, authentic, and contour patches. Tampered and authentic patches are naturally mutually exclusive, but contour patches containing both tampered and authentic pixels are non-mutually exclusive to them. Simply abnegating these contour patches results in a drastic performance loss since contour patches are decisive to the learning outcomes. Hence, we propose the Non-mutually exclusive Contrastive Learning (NCL) framework to rescue conventional contrastive learning from the above dilemma. In NCL, to cope with the non-mutually exclusivity, we first establish a pivot structure with dual branches to constantly switch the role of contour patches between positives and negatives while training. Then, we devise a pivot-consistent loss to avoid spatial corruption caused by the role-switching process. In this manner, NCL both inherits the self-supervised merits to address the data insufficiency and retains a high manipulation localization accuracy. Extensive experiments verify that our NCL achieves state-of-the-art performance on all five benchmarks without any pre-training and is more robust on unseen real-life samples. https://github.com/Knightzjz/NCL-IML + + + + MRN: Multiplexed Routing Network for Incremental Multilingual Text Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Zheng_MRN_Multiplexed_Routing_Network_for_Incremental_Multilingual_Text_Recognition_ICCV_2023_paper.pdf + Multilingual text recognition (MLTR) systems typically focus on a fixed set of languages, which makes it difficult to handle newly added languages or adapt to ever-changing data distribution. In this paper, we propose the Incremental MLTR (IMLTR) task in the context of incremental learning (IL), where different languages are introduced in batches. IMLTR is particularly challenging due to rehearsal-imbalance, which refers to the uneven distribution of sample characters in the rehearsal set, used to retain a small amount of old data as past memories. To address this issue, we propose a Multiplexed Routing Network (MRN). MRN trains a recognizer for each language that is currently seen. Subsequently, a language domain predictor is learned based on the rehearsal set to weigh the recognizers. Since the recognizers are derived from the original data, MRN effectively reduces the reliance on older data and better fights against catastrophic forgetting, the core issue in IL. We extensively evaluate MRN on MLT17 and MLT19 datasets. It outperforms existing general-purpose IL methods by large margins, with average accuracy improvements ranging from 10.3% to 35.8% under different settings. Code is available at https://github.com/simplify23/MRN. + + + + MOST: Multiple Object Localization with Self-Supervised Transformers for Object Discovery + http://openaccess.thecvf.com//content/ICCV2023/papers/Rambhatla_MOST_Multiple_Object_Localization_with_Self-Supervised_Transformers_for_Object_Discovery_ICCV_2023_paper.pdf + We tackle the challenging task of unsupervised object localization in this work. Recently, transformers trained with self-supervised learning have been shown to exhibit object localization properties without being trained for this task. In this work, we present Multiple Object localization with Self-supervised Transformers (MOST) that uses features of transformers trained using self-supervised learning to localize multiple objects in real world images. MOST analyzes the similarity maps of the features using box counting; a fractal analysis tool to identify tokens lying on foreground patches. The identified tokens are then clustered together, and tokens of each cluster are used to generate bounding boxes on foreground regions. Unlike recent state-of-the-art object localization methods, MOST can localize multiple objects per image and outperforms SOTA algorithms on several object localization and discovery benchmarks on PASCAL-VOC 07, 12 and COCO20k datasets. Additionally, we show that MOST can be used for self-supervised pretraining of object detectors, and yields consistent improvements on fully, semi-supervised object detection and unsupervised region proposal generation.Our project is publicly available at rssaketh.github.io/most. + + + + SAFE: Machine Unlearning With Shard Graphs + http://openaccess.thecvf.com//content/ICCV2023/papers/Dukler_SAFE_Machine_Unlearning_With_Shard_Graphs_ICCV_2023_paper.pdf + We present Synergy Aware Forgetting Ensemble (SAFE), a method to adapt large models on a diverse collection of data while minimizing the expected cost to remove the influence of training samples from the trained model. This process, also known as selective forgetting or unlearning, is often conducted by partitioning a dataset into shards, training fully independent models on each, then ensembling the resulting models. Increasing the number of shards reduces the expected cost to forget but at the same time it increases inference cost and reduces the final accuracy of the model since synergistic information between samples is lost during the independent model training. Rather than treating each shard as independent, SAFE introduces the notion of a shard graph, which allows incorporating limited information from other shards during training, trading off a modest increase in expected forgetting cost with a significant increase in accuracy, all while still attaining complete removal of residual influence after forgetting. SAFE uses a lightweight system of adapters which can be trained while reusing most of the computations. This allows SAFE to be trained on shards an order-of-magnitude smaller than current state-of-the-art methods (thus reducing the forgetting costs) while also maintaining high accuracy, as we demonstrate empirically on fine-grained computer vision datasets. + + + + OrthoPlanes: A Novel Representation for Better 3D-Awareness of GANs + http://openaccess.thecvf.com//content/ICCV2023/papers/He_OrthoPlanes_A_Novel_Representation_for_Better_3D-Awareness_of_GANs_ICCV_2023_paper.pdf + We present a new method for generating realistic and view-consistent images with fine geometry from 2D image collections. Our method proposes a hybrid explicit-implicit representation called OrthoPlanes, which encodes fine-grained 3D information in feature maps that can be efficiently generated by modifying 2D StyleGANs. Compared to previous representations, our method has better scalability and expressiveness with clear and explicit information. As a result, our method can handle more challenging view-angles and synthesize articulated objects with high spatial degree of freedom. Experiments demonstrate that our method achieves state-of-the-art results on FFHQ and SHHQ datasets, both quantitatively and qualitatively. + + + + NeTO:Neural Reconstruction of Transparent Objects with Self-Occlusion Aware Refraction-Tracing + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_NeTONeural_Reconstruction_of_Transparent_Objects_with_Self-Occlusion_Aware_Refraction-Tracing_ICCV_2023_paper.pdf + We present a novel method called NeTO, for capturing the 3D geometry of solid transparent objects from 2D images via volume rendering. Reconstructing transparent objects is a very challenging task, which is ill-suited for general-purpose reconstruction techniques due to the specular light transport phenomena. Although existing refraction-tracing-based methods, designed especially for this task, achieve impressive results, they still suffer from unstable optimization and loss of fine details since the explicit surface representation they adopted is difficult to be optimized, and the self-occlusion problem is ignored for refraction-tracing. In this paper, we propose to leverage implicit Signed Distance Function (SDF) as surface representation and optimize the SDF field via volume rendering with a self-occlusion aware refractive ray tracing. The implicit representation enables our method to be capable of reconstructing high-quality reconstruction even with a limited set of views, and the self-occlusion aware strategy makes it possible for our method to accurately reconstruct the self-occluded regions. Experiments show that our method achieves faithful reconstruction results and outperforms prior works by a large margin. Visit our project page at https://www.xxlong.site/NeTO/. + + + + Boosting 3-DoF Ground-to-Satellite Camera Localization Accuracy via Geometry-Guided Cross-View Transformer + http://openaccess.thecvf.com//content/ICCV2023/papers/Shi_Boosting_3-DoF_Ground-to-Satellite_Camera_Localization_Accuracy_via_Geometry-Guided_Cross-View_Transformer_ICCV_2023_paper.pdf + Image retrieval-based cross-view localization methods often lead to very coarse camera pose estimation, due to the limited sampling density of the database satellite images. In this paper, we propose a method to increase the accuracy of a ground camera's location and orientation by estimating the relative rotation and translation between the ground-level image and its matched/retrieved satellite image. Our approach designs a geometry-guided cross-view transformer that combines the benefits of conventional geometry and learnable cross-view transformers to map the ground-view observations to an overhead view. Given the synthesized overhead view and observed satellite feature maps, we construct a neural pose optimizer with strong global information embedding ability to estimate the relative rotation between them. After aligning their rotations, we develop an uncertainty-guided spatial correlation to generate a probability map of the vehicle locations, from which the relative translation can be determined. Experimental results demonstrate that our method significantly outperforms the state-of-the-art. Notably, the likelihood of restricting the vehicle lateral pose to be within 1m of its Ground Truth (GT) value on the cross-view KITTI dataset has been improved from 35.54% to 76.44%, and the likelihood of restricting the vehicle orientation to be within 1 degree of its GT value has been improved from 19.64% to 99.10%. + + + + Adaptive Reordering Sampler with Neurally Guided MAGSAC + http://openaccess.thecvf.com//content/ICCV2023/papers/Wei_Adaptive_Reordering_Sampler_with_Neurally_Guided_MAGSAC_ICCV_2023_paper.pdf + We propose a new sampler for robust estimators that always selects the sample with the highest probability of consisting only of inliers. After every unsuccessful iteration, the inlier probabilities are updated in a principled way via a Bayesian approach. The probabilities obtained by the deep network are used as prior (so-called neural guidance) inside the sampler. Moreover, we introduce a new loss that exploits, in a geometrically justifiable manner, the orientation and scale that can be estimated for any type of feature, e.g., SIFT or SuperPoint, to estimate two-view geometry. The new loss helps to learn higher-order information about the underlying scene geometry. Benefiting from the new sampler and the proposed loss, we combine the neural guidance with the state-of-the-art MAGSAC++. Adaptive Reordering Sampler with Neurally Guided MAGSAC (ARS-MAGSAC) is superior to the state-of-the-art in terms of accuracy and run-time on the PhotoTourism and KITTI datasets for essential and fundamental matrix estimation. The code and trained models are available at https://github.com/weitong8591/ars_magsac. + + + + Learning Cross-Representation Affinity Consistency for Sparsely Supervised Biomedical Instance Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Learning_Cross-Representation_Affinity_Consistency_for_Sparsely_Supervised_Biomedical_Instance_Segmentation_ICCV_2023_paper.pdf + Sparse instance-level supervision has recently been explored to address insufficient annotation in biomedical instance segmentation, which is easier to annotate crowded instances and better preserves instance completeness for 3D volumetric datasets compared to common semi-supervision.In this paper, we propose a sparsely supervised biomedical instance segmentation framework via cross-representation affinity consistency regularization. Specifically, we adopt two individual networks to enforce the perturbation consistency between an explicit affinity map and an implicit affinity map to capture both feature-level instance discrimination and pixel-level instance boundary structure. We then select the highly confident region of each affinity map as the pseudo label to supervise the other one for affinity consistency learning. To obtain the highly confident region, we propose a pseudo-label noise filtering scheme by integrating two entropy-based decision strategies. Extensive experiments on four biomedical datasets with sparse instance annotations show the state-of-the-art performance of our proposed framework. For the first time, we demonstrate the superiority of sparse instance-level supervision on 3D volumetric datasets, compared to common semi-supervision under the same annotation cost. + + + + A Skeletonization Algorithm for Gradient-Based Optimization + http://openaccess.thecvf.com//content/ICCV2023/papers/Menten_A_Skeletonization_Algorithm_for_Gradient-Based_Optimization_ICCV_2023_paper.pdf + The skeleton of a digital image is a compact representation of its topology, geometry, and scale. It has utility in many computer vision applications, such as image description, segmentation, and registration. However, skeletonization has only seen limited use in contemporary deep learning solutions. Most existing skeletonization algorithms are not differentiable, making it impossible to integrate them with gradient-based optimization. Compatible algorithms based on morphological operations and neural networks have been proposed, but their results often deviate from the geometry and topology of the true medial axis. This work introduces the first three-dimensional skeletonization algorithm that is both compatible with gradient-based optimization and preserves an object's topology. Our method is exclusively based on matrix additions and multiplications, convolutional operations, basic non-linear functions, and sampling from a uniform probability distribution, allowing it to be easily implemented in any major deep learning library. In benchmarking experiments, we prove the advantages of our skeletonization algorithm compared to non-differentiable, morphological, and neural-network-based baselines. Finally, we demonstrate the utility of our algorithm by integrating it with two medical image processing applications that use gradient-based optimization: deep-learning-based blood vessel segmentation, and multimodal registration of the mandible in computed tomography and magnetic resonance images. + + + + V3Det: Vast Vocabulary Visual Detection Dataset + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_V3Det_Vast_Vocabulary_Visual_Detection_Dataset_ICCV_2023_paper.pdf + Recent advances in detecting arbitrary objects in the real world are trained and evaluated on object detection datasets with a relatively restricted vocabulary. To facilitate the development of more general visual object detection, we propose V3Det, a vast vocabulary visual detection dataset with precisely annotated bounding boxes on massive images. V3Det has several appealing properties: 1) Vast Vocabulary: It contains bounding boxes of objects from 13,204 categories on real-world images, which is 10 times larger than the existing large vocabulary object detection dataset, e.g., LVIS. 2) Hierarchical Category Organization: The vast vocabulary of V3Det is organized by a hierarchical category tree which annotates the inclusion relationship among categories, encouraging the exploration of category relationships in vast and open vocabulary object detection. 3) Rich Annotations: V3Det comprises precisely annotated objects in 243k images and professional descriptions of each category written by human experts and a powerful chatbot. By offering a vast exploration space, V3Det enables extensive benchmarks on both vast and open vocabulary object detection, leading to new observations, practices, and insights for future research. It has the potential to serve as a cornerstone dataset for developing more general visual perception systems. V3Det is available at https://v3det.openxlab.org.cn/. + + + + Multi-weather Image Restoration via Domain Translation + http://openaccess.thecvf.com//content/ICCV2023/papers/Patil_Multi-weather_Image_Restoration_via_Domain_Translation_ICCV_2023_paper.pdf + Weather degraded conditions such as rain, haze, snow, etc. may degrade the performance of most computer vision systems. Therefore, effective restoration of multi-weather degraded images is an essential prerequisite for successful functioning of such systems. The current multi-weather image restoration approaches utilize a model that is trained on a combined dataset consisting of individual images for rainy, snowy, and hazy weather degradations. These methods may face challenges when dealing with real-world situations where the images may have multiple, more intricate weather conditions. To address this issue, we propose a domain translation-based unified method for multi-weather image restoration. In this approach, the proposed network learns multiple weather degradations simultaneously, making it immune for real-world conditions. Specifically, we first propose an instance-level domain (weather) translation with multi-attentive feature learning approach to get different weather-degraded variants of the same scenario. Next, the original and translated images are used as input to the proposed novel multi-weather restoration network which utilizes a progressive multi-domain deformable alignment (PMDA) with cascaded multi-head attention (CMA). The proposed PMDA facilitates the restoration network to learn weather-invariant clues effectively. Further, PMDA and respective decoder features are merged via proposed CMA module for restoration. Extensive experimental results on synthetic and real-world hazy, rainy, and snowy image databases clearly demonstrate that our model outperforms the state-of-the-art multi-weather image restoration methods. The URL for our code is provided in the supplementary material and will be made public upon acceptance. + + + + Improving Unsupervised Visual Program Inference with Code Rewriting Families + http://openaccess.thecvf.com//content/ICCV2023/papers/Ganeshan_Improving_Unsupervised_Visual_Program_Inference_with_Code_Rewriting_Families_ICCV_2023_paper.pdf + Programs offer compactness and structure that makes them an attractive representation for visual data. We explore how code rewriting can be used to improve systems for inferring programs from visual data. We first propose Sparse Intermittent Rewrite Injection (SIRI), a framework for unsupervised bootstrapped learning. SIRI sparsely applies code rewrite operations over a dataset of training programs, injecting the improved programs back into the training set. We design a family of rewriters for visual programming domains: parameter optimization, code pruning, and code grafting. For three shape programming languages in 2D and 3D, we experimentally validate that using SIRI with our family of rewriters improves performance: better reconstructions and faster convergence rates, compared with bootstrapped learning methods that do not use rewriters or use them naively. Finally, we demonstrate that our family of rewriters can be effectively employed at test time to improve the output of SIRI predictions. For 2D and 3D CSG, we outperform or match the reconstruction performance of recent domain-specific neural architectures, while producing more parsimonious programs, that use significantly fewer primitives. + + + + Essential Matrix Estimation using Convex Relaxations in Orthogonal Space + http://openaccess.thecvf.com//content/ICCV2023/papers/Karimian_Essential_Matrix_Estimation_using_Convex_Relaxations_in_Orthogonal_Space_ICCV_2023_paper.pdf + We introduce a novel method to estimate the essential matrix for two-view Structure from Motion (SfM). We show that every 3 by 3 essential matrix can be embedded in a 4 by 4 rotation, having its bottom right entry fixed to zero; we call the latter the quintessential matrix. This embedding leads to rich relations with the space of 4-D rotations, quaternions, and the classical twisted-pair ambiguity in two-view SfM. We use this structure to derive a succession of semidefinite relaxations that require fewer parameters than the existing non-minimal solvers and yield faster convergence with certifiable optimality. We then exploit the low-rank geometry of these relaxations to reduce them to an equivalent optimization on a Riemannian manifold and solve them via the Riemannian Staircase method. The experimental evaluation confirms that our algorithm always finds the globally optimal solution and outperforms the existing non-minimal methods. We make our implementations open source. + + + + Concept-wise Fine-tuning Matters in Preventing Negative Transfer + http://openaccess.thecvf.com//content/ICCV2023/papers/Yang_Concept-wise_Fine-tuning_Matters_in_Preventing_Negative_Transfer_ICCV_2023_paper.pdf + A multitude of prevalent pre-trained models mark a major milestone in the development of artificial intelligence, while fine-tuning has been a common practice that enables pre-trained models to figure prominently in a wide array of target datasets. Our empirical results reveal that off-the-shelf fine-tuning techniques are far from adequate to mitigate negative transfer caused by two types of underperforming features in a pre-trained model, including rare features and spuriously correlated features. Rooted in structural causal models of predictions after fine-tuning, we propose a Concept-wise fine-tuning (Concept-Tuning) approach which refines feature representations in the level of patches with each patch encoding a concept. Concept-Tuning minimizes the negative impacts of rare features and spuriously correlated features by (1) maximizing the mutual information between examples in the same category with regard to a slice of rare features (a patch) and (2) applying front-door adjustment via attention neural networks in channels and feature slices (patches). The proposed Concept-Tuning consistently and significantly (by up to 4.76%) improves prior state-of-the-art fine-tuning methods on eleven datasets, diverse pre-training strategies (supervised and self-supervised ones), various network architectures, and sample sizes in a target dataset. + + + + Learning Human Dynamics in Autonomous Driving Scenarios + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Learning_Human_Dynamics_in_Autonomous_Driving_Scenarios_ICCV_2023_paper.pdf + Simulation has emerged as an indispensable tool for scaling and accelerating the development of self-driving systems. A critical aspect of this is simulating realistic and diverse human behavior and intent. In this work, we propose a holistic framework for learning physically plausible human dynamics from real driving scenarios, narrowing the gap between real and simulated human behavior in safety-critical applications. We show that state-of-the-art methods underperform in driving scenarios where video data is recorded from moving vehicles, and humans are frequently partially or fully occluded. Furthermore, existing methods often disregard the global scene where humans are situated, resulting in various motion artifacts like foot sliding, floating, or ground penetration. Therefore, the primary technical challenge of this work is to infer physically plausible human dynamics for the occluded body parts on uneven terrain, based on visible motions. To address this challenge, we propose an approach that incorporates physics with a reinforcement learning-based motion controller to learn human dynamics for driving scenarios. Our framework can simulate physically plausible human dynamics that accurately match observed human motions and infill motions for occluded body parts, while improving the physical plausibility of the entire motion sequence. We evaluate our method on the challenging driving scenarios in the Waymo Open Dataset. Experiments on the challenging Waymo Open Dataset show that our method outperforms state-of-the-art motion capture approaches significantly in recovering high-quality, physically plausible, and scene-aware human dynamics. + + + + DDP: Diffusion Model for Dense Visual Prediction + http://openaccess.thecvf.com//content/ICCV2023/papers/Ji_DDP_Diffusion_Model_for_Dense_Visual_Prediction_ICCV_2023_paper.pdf + We propose a simple, efficient, yet powerful framework for dense visual predictions based on the conditional diffusion pipeline. Our approach follows a "noise-to-map" generative paradigm for prediction by progressively removing noise from a random Gaussian distribution, guided by the image. The method, called DDP, efficiently extends the denoising diffusion process into the modern perception pipeline. Without task-specific design and architecture customization, DDP is easy to generalize to most dense prediction tasks, e.g., semantic segmentation and depth estimation. In addition, DDP shows attractive properties such as dynamic inference and uncertainty awareness, in contrast to previous single-step discriminative methods. We show top results on three representative tasks with six diverse benchmarks, without tricks, DDP achieves state-of-the-art or competitive performance on each task compared to the specialist counterparts. For example, semantic segmentation (83.9 mIoU on Cityscapes), BEV map segmentation (70.6 mIoU on nuScenes), and depth estimation (0.05 REL on KITTI). We hope that our approach will serve as a solid baseline and facilitate future research. + + + + Semantics-Consistent Feature Search for Self-Supervised Visual Representation Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Song_Semantics-Consistent_Feature_Search_for_Self-Supervised_Visual_Representation_Learning_ICCV_2023_paper.pdf + In contrastive self-supervised learning, the common way to learn discriminative representation is to pull different augmented "views" of the same image closer while pushing all other images further apart, which has been proven to be effective. However, it is unavoidable to construct undesirable views containing different semantic concepts during the augmentation procedure. It would damage the semantic consistency of representation to pull these augmentations closer in the feature space indiscriminately. In this study, we introduce feature-level augmentation and propose a novel semantics-consistent feature search (SCFS) method to mitigate this negative effect. The main idea of SCFS is to adaptively search semantics-consistent features to enhance the contrast between semantics-consistent regions in different augmentations. Thus, the trained model can learn to focus on meaningful object regions, improving the semantic representation ability. Extensive experiments conducted on different datasets and tasks demonstrate that SCFS effectively improves the performance of self-supervised learning and achieves state-of-the-art performance on different downstream tasks. + + + + Probabilistic Modeling of Inter- and Intra-observer Variability in Medical Image Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Schmidt_Probabilistic_Modeling_of_Inter-_and_Intra-observer_Variability_in_Medical_Image_ICCV_2023_paper.pdf + Medical image segmentation is a challenging task, particularly due to inter- and intra-observer variability, even between medical experts. In this paper, we propose a novel model, called Probabilistic Inter-Observer and iNtra-Observer variation NetwOrk (Pionono). It captures the labeling behavior of each rater with a multidimensional probability distribution and integrates this information with the feature maps of the image to produce probabilistic segmentation predictions. The model is optimized by variational inference and can be trained end-to-end. It outperforms state-of-the-art models such as STAPLE, Probabilistic U-Net, and models based on confusion matrices. Additionally, Pionono predicts multiple coherent segmentation maps that mimic the rater's expert opinion, which provides additional valuable information for the diagnostic process. Experiments on real-world cancer segmentation datasets demonstrate the high accuracy and efficiency of Pionono, making it a powerful tool for medical image analysis. + + + + Total-Recon: Deformable Scene Reconstruction for Embodied View Synthesis + http://openaccess.thecvf.com//content/ICCV2023/papers/Song_Total-Recon_Deformable_Scene_Reconstruction_for_Embodied_View_Synthesis_ICCV_2023_paper.pdf + We explore the task of embodied view synthesis from monocular videos of deformable scenes. Given a minute-long RGBD video of people interacting with their pets, we render the scene from novel camera trajectories derived from the in-scene motion of actors: (1) egocentric cameras that simulate the point of view of a target actor and (2) 3rd-person cameras that follow the actor. Building such a system requires reconstructing the root-body and articulated motion of every actor, as well as a scene representation that supports free-viewpoint synthesis. Longer videos are more likely to capture the scene from diverse viewpoints (which helps reconstruction) but are also more likely to contain larger motions (which complicates reconstruction). To address these challenges, we present Total-Recon, the first method to photorealistically reconstruct deformable scenes from long monocular RGBD videos. Crucially, to scale to long videos, our method hierarchically decomposes the scene into the background and objects, whose motion is decomposed into carefully initialized root-body motion and local articulations. To quantify such "in-the-wild" reconstruction and view synthesis, we collect ground-truth data from a specialized stereo RGBD capture rig for 11 challenging videos, significantly outperforming prior methods. + + + + AdaNIC: Towards Practical Neural Image Compression via Dynamic Transform Routing + http://openaccess.thecvf.com//content/ICCV2023/papers/Tao_AdaNIC_Towards_Practical_Neural_Image_Compression_via_Dynamic_Transform_Routing_ICCV_2023_paper.pdf + Compressive autoencoders (CAEs) play an important role in deep learning-based image compression, but large-scale CAEs are computationally expensive. We propose a framework with three techniques to enable efficient CAE-based image coding: 1) Spatially-adaptive convolution and normalization operators enable block-wise nonlinear transform to spend FLOPs unevenly across the image to be compressed, according to a transform capacity map. 2) Just-unpenalized model capacity (JUMC) optimizes the transform capacity of each CAE block via rate-distortion-complexity optimization, finding the optimal capacity for the source image content. 3) A lightweight routing agent model predicts the transform capacity map for the CAEs by approximating JUMC targets. By activating the best-sized sub-CAE inside the slimmable supernet, our approach achieves up to 40% computational speed-up with minimal BD-Rate increase, validating its ability to save computational resources in a content-aware manner. + + + + Privacy Preserving Localization via Coordinate Permutations + http://openaccess.thecvf.com//content/ICCV2023/papers/Pan_Privacy_Preserving_Localization_via_Coordinate_Permutations_ICCV_2023_paper.pdf + Recent methods on privacy-preserving image-based localization use a random line parameterization to protect the privacy of query images and database maps. The lifting of points to lines effectively drops one of the two geometric constraints traditionally used with point-to-point correspondences in structure-based localization. This leads to a significant loss of accuracy for the privacy-preserving methods. In this paper, we overcome this limitation by devising a coordinate permutation scheme that allows for recovering the original point positions during pose estimation. The recovered points provide the full 2D geometric constraints and enable us to close the gap between privacy-preserving and traditional methods in terms of accuracy. Another limitation of random line methods is their vulnerability to density based 3D line cloud inversion attacks. Our method not only provides better accuracy than the original random line based approach but also provides stronger privacy guarantees against these recently proposed attacks. Extensive experiments on standard benchmark datasets demonstrate these improvements consistently across both scenarios of protecting the privacy of query images as well as the database map. + + + + SMMix: Self-Motivated Image Mixing for Vision Transformers + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_SMMix_Self-Motivated_Image_Mixing_for_Vision_Transformers_ICCV_2023_paper.pdf + CutMix is a vital augmentation strategy that determines the performance and generalization ability of vision transformers (ViTs). However, the inconsistency between the mixed images and the corresponding labels harms its efficacy. Existing CutMix variants tackle this problem by generating more consistent mixed images or more precise mixed labels, but inevitably introduce heavy training overhead or require extra information, undermining ease of use. To this end, we propose an efficient and effective Self-Motivated image Mixing method (SMMix), which motivates both image and label enhancement by the model under training itself. Specifically, we propose a max-min attention region mixing approach that enriches the attention-focused objects in the mixed images. Then, we introduce a fine-grained label assignment technique that co-trains the output tokens of mixed images with fine-grained supervision. Moreover, we devise a novel feature consistency constraint to align features from mixed and unmixed images. Due to the subtle designs of the self-motivated paradigm, our SMMix is significant in its smaller training overhead and better performance than other CutMix variants. In particular, SMMix improves the accuracy of DeiT-T/S/B, CaiT-XXS-24/36, and PVT-T/S/M/L by more than +1% on ImageNet-1k. The generalization capability of our method is also demonstrated on downstream tasks and out-of-distribution datasets. Our project is available at https://github.com/ChenMnZ/SMMix. + + + + Reconciling Object-Level and Global-Level Objectives for Long-Tail Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Reconciling_Object-Level_and_Global-Level_Objectives_for_Long-Tail_Detection_ICCV_2023_paper.pdf + Large vocabulary object detectors are often faced with the long-tailed label distributions, seriously degrading their ability to detect rarely seen categories. On one hand, the rare objects are prone to be misclassified as frequent categories. On the other hand, due to the limitation on the total number of detections per image, detectors usually rank all the confidence scores globally and filter out the lower-ranking ones. This may result in missed detection during inference, especially for the rare categories that naturally come with lower scores. Existing methods mainly focus on the former problem and design various classification loss to enhance the object-level classification accuracy, but largely overlook the global-level ranking task. In this paper, we propose a novel framework that Reconciles Object-level and Global-level (ROG) objectives to address both problems. As a multi-task learning framework, ROG simultaneously trains the model with two tasks: classifying each object proposal individually and ranking all the confidence scores globally. Specifically, complementary to the object-level classification loss for model discrimination, we design a generalized average precision (GAP) loss to explicitly optimize the global-level score ranking across different objects. For each category, GAP loss generates balanced gradients to rectify the ranking errors. In experiments, we show that GAP loss is highly versatile to be plugged into various advanced methods and brings considerable benefits. + + + + In-Style: Bridging Text and Uncurated Videos with Style Transfer for Text-Video Retrieval + http://openaccess.thecvf.com//content/ICCV2023/papers/Shvetsova_In-Style_Bridging_Text_and_Uncurated_Videos_with_Style_Transfer_for_ICCV_2023_paper.pdf + Large-scale noisy web image-text datasets have been proven to be efficient for learning robust vision-language models. However, to transfer them to the task of video retrieval, models still need to be fine-tuned on hand-curated paired text-video data to adapt to the diverse styles of video descriptions. To address this problem without the need for hand-annotated pairs, we propose a new setting, text-video retrieval with uncurated & unpaired data, that uses only text queries together with uncurated web videos during training without any paired text-video data. To this end, we propose an approach, In-Style, that learns the style of the text queries and transfers it to uncurated web videos. Moreover, to improve generalization, we show that one model can be trained with multiple text styles. To this end, we introduce a multi-style contrastive training procedure, that improves the generalizability over several datasets simultaneously. We evaluate our model on retrieval performance over multiple datasets to demonstrate the advantages of our style transfer framework on the new task of uncurated & unpaired text-video retrieval and improve state-of-the-art performance on zero-shot text-video retrieval. + + + + CLIPTER: Looking at the Bigger Picture in Scene Text Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Aberdam_CLIPTER_Looking_at_the_Bigger_Picture_in_Scene_Text_Recognition_ICCV_2023_paper.pdf + Reading text in real-world scenarios often requires understanding the context surrounding it, especially when dealing with poor-quality text. However, current scene text recognizers are unaware of the bigger picture as they operate on cropped text images. In this study, we harness the representative capabilities of modern vision-language models, such as CLIP, to provide scene-level information to the crop-based recognizer. We achieve this by fusing a rich representation of the entire image, obtained from the vision-language model, with the recognizer word-level features via a gated cross-attention mechanism. This component gradually shifts to the context-enhanced representation, allowing for stable fine-tuning of a pretrained recognizer. We demonstrate the effectiveness of our model-agnostic framework, CLIPTER (CLIP TExt Recognition), on leading text recognition architectures and achieve state-of-the-art results across multiple benchmarks. Furthermore, our analysis highlights improved robustness to out-of-vocabulary words and enhanced generalization in low-data regimes. + + + + Revisiting Scene Text Recognition: A Data Perspective + http://openaccess.thecvf.com//content/ICCV2023/papers/Jiang_Revisiting_Scene_Text_Recognition_A_Data_Perspective_ICCV_2023_paper.pdf + This paper aims to re-assess scene text recognition (STR) from a data-oriented perspective. We begin by revisiting the six commonly used benchmarks in STR and observe a trend of performance saturation, whereby only 2.91% of the benchmark images cannot be accurately recognized by an ensemble of 13 representative models. While these results are impressive and suggest that STR could be considered solved, however, we argue that this is primarily due to the less challenging nature of the common benchmarks, thus concealing the underlying issues that STR faces. To this end, we consolidate a large-scale real STR dataset, namely Union14M, which comprises 4 million labeled images and 10 million unlabeled images, to assess the performance of STR models in more complex real-world scenarios. Our experiments demonstrate that the 13 models can only achieve an average accuracy of 66.53% on the 4 million labeled images, indicating that STR still faces numerous challenges in the real world. By analyzing the error patterns of the 13 models, we identify seven open challenges in STR and develop a challenge-driven benchmark consisting of eight distinct subsets to facilitate further progress in the field. Our exploration demonstrates that STR is far from being solved and leveraging data may be a promising solution. In this regard, we find that utilizing the 10 million unlabeled images through self-supervised pre-training can significantly improve the robustness of STR model in real-world scenarios and leads to state-of-the-art performance. Code and dataset is available at https: //github.com/Mountchicken/Union14M . + + + + DPM-OT: A New Diffusion Probabilistic Model Based on Optimal Transport + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_DPM-OT_A_New_Diffusion_Probabilistic_Model_Based_on_Optimal_Transport_ICCV_2023_paper.pdf + Sampling from diffusion probabilistic models (DPMs) can be viewed as a piecewise distribution transformation, which generally requires hundreds or thousands of steps of the inverse diffusion trajectory to get a high-quality image. Recent progress in designing fast samplers for DPMs achieves a trade-off between sampling speed and sample quality by knowledge distillation or adjusting the variance schedule or the denoising equation. However, it can't be optimal in both aspects and often suffer from mode mixture in short steps. To tackle this problem, we innovatively regard inverse diffusion as an optimal transport (OT) problem between latents at different stages and propose DPM-OT, a unified learning framework for fast DPMs with the direct expressway represented by OT map, which can generate high-quality samples within around 10 function evaluations. By calculating the semi-discrete optimal transport between the data latents and the white noise, we obtain the expressway from the prior distribution to the data distribution, while significantly alleviating the problem of mode mixture. In addition, we give the error bound of the proposed method, which theoretically guarantees the stability of the algorithm. Extensive experiments validate the effectiveness and advantages of DPM-OT in terms of speed and quality (FID and mode mixture), thus representing an efficient solution for generative modeling. Source codes are available at https://github.com/cognaclee/DPM-OT + + + + Inherent Redundancy in Spiking Neural Networks + http://openaccess.thecvf.com//content/ICCV2023/papers/Yao_Inherent_Redundancy_in_Spiking_Neural_Networks_ICCV_2023_paper.pdf + Spiking Neural Networks (SNNs) are well known as a promising energy-efficient alternative to conventional artificial neural networks. Subject to the preconceived impression that SNNs are sparse firing, the analysis and optimization of inherent redundancy in SNNs have been largely overlooked, thus the potential advantages of spike-based neuromorphic computing in accuracy and energy efficiency are interfered. In this work, we pose and focus on three key questions regarding the inherent redundancy in SNNs. We argue that the redundancy is induced by the spatio-temporal invariance of SNNs, which enhances the efficiency of parameter utilization but also invites lots of noise spikes. Further, we analyze the effect of spatio-temporal invariance on the spatio-temporal dynamics and spike firing of SNNs. Then, motivated by these analyses, we propose an Advance Spatial Attention (ASA) module to harness SNNs' redundancy, which can adaptively optimize their membrane potential distribution by a pair of individual spatial attention sub-modules. In this way, noise spike features are accurately regulated. Experimental results demonstrate that the proposed method can significantly drop the spike firing with better performance than state-of-the-art baselines. Our code is available in https://github.com/BICLab/ASA-SNN. + + + + FastRecon: Few-shot Industrial Anomaly Detection via Fast Feature Reconstruction + http://openaccess.thecvf.com//content/ICCV2023/papers/Fang_FastRecon_Few-shot_Industrial_Anomaly_Detection_via_Fast_Feature_Reconstruction_ICCV_2023_paper.pdf + In industrial anomaly detection, data efficiency and the ability for fast migration across products become the main concerns when developing detection algorithms. Existing methods tend to be data-hungry and work in the one-model-one-category way, which hinders their effectiveness in real-world industrial scenarios. In this paper, we propose a few-shot anomaly detection strategy that works in a low-data regime and can generalize across products at no cost. Given a defective query sample, we propose to utilize a few normal samples as a reference to reconstruct its normal version, where the final anomaly detection can be achieved by sample alignment. Specifically, we introduce a novel regression with distribution regularization to obtain the optimal transformation from support to query features, which guarantees the reconstruction result shares visual similarity with the query sample and meanwhile maintains the property of normal samples. Experimental results reflect that our method significantly outperforms previous state-of-the-art at both image and pixel-level AUROC performances from 2 to 8-shot scenarios. Besides, with only a limited number of training samples (less than 8 samples), our method reaches competitive performance with vanilla AD methods which are trained with extensive normal samples. + + + + Local or Global: Selective Knowledge Assimilation for Federated Learning with Limited Labels + http://openaccess.thecvf.com//content/ICCV2023/papers/Cho_Local_or_Global_Selective_Knowledge_Assimilation_for_Federated_Learning_with_ICCV_2023_paper.pdf + Many existing FL methods assume clients with fully-labeled data, while in realistic settings, clients have limited labels due to the expensive and laborious process of labeling. Limited labeled local data of the clients often leads to their local model having poor generalization abilities to their larger unlabeled local data, such as having class-distribution mismatch with the unlabeled data. As a result, clients may instead look to benefit from the global model trained across clients to leverage their unlabeled data, but this also becomes difficult due to data heterogeneity across clients. In our work, we propose FedLabel where clients selectively choose the local or global model to pseudo-label their unlabeled data depending on which is more of an expert of the data. We further utilize both the local and global models' knowledge via global-local consistency regularization which minimizes the divergence between the two models' outputs when they have identical pseudo-labels for the unlabeled data. Unlike other semi-supervised FL baselines, our method does not require additional experts other than the local or global model, nor require additional parameters to be communicated. We also do not assume any server-labeled data or fully labeled clients. For both cross-device and cross-silo settings, we show that FedLabel outperforms other semi-supervised FL baselines by 8-24%, and even outperforms standard fully supervised FL baselines (100% labeled data) with only 5-20% of labeled data. + + + + Learning Pseudo-Relations for Cross-domain Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhao_Learning_Pseudo-Relations_for_Cross-domain_Semantic_Segmentation_ICCV_2023_paper.pdf + Domain adaptive semantic segmentation aims to adapt a model trained on labeled source domain to the unlabeled target domain. Self-training shows competitive potential in this field. Existing methods along this stream mainly focus on selecting reliable predictions on target data as pseudo-labels for category learning, while ignoring the useful relations between pixels for relation learning. In this paper, we propose a pseudo-relation learning framework, Relation Teacher (RTea), which can exploitable pixel relations to efficiently use unreliable pixels and learn generalized representations. In this framework, we build reasonable pseudo-relations on local grids and fuse them with low-level relations in the image space, which are motivated by the reliable local relations prior and available low-level relations prior. Then, we design a pseudo-relation learning strategy and optimize the class probability to meet the relation consistency by finding the optimal sub-graph division. In this way, the model's certainty and consistency of prediction are enhanced on the target domain, and the cross-domain inadaptation is further eliminated. Extensive experiments on three datasets demonstrate the effectiveness of the proposed method. + + + + Human-centric Scene Understanding for 3D Large-scale Scenarios + http://openaccess.thecvf.com//content/ICCV2023/papers/Xu_Human-centric_Scene_Understanding_for_3D_Large-scale_Scenarios_ICCV_2023_paper.pdf + Human-centric scene understanding is significant for real-world applications, but it is extremely challenging due to the existence of diverse human poses and actions, complex human-environment interactions, severe occlusions in crowds, etc. In this paper, we present a large-scale multi-modal dataset for human-centric scene understanding, dubbed HuCenLife, which is collected in diverse daily-life scenarios with rich and fine-grained annotations. Our HuCenLife can benefit many 3D perception tasks, such as segmentation, detection, action recognition, etc., and we also provide benchmarks for these tasks to facilitate related research. In addition, we design novel modules for LiDAR-based segmentation and action recognition, which are more applicable for large-scale human-centric scenarios and achieve state-of-the-art performance. The dataset and code can be found at https://github.com/4DVLab/HuCenLife.git. + + + + SimMatchV2: Semi-Supervised Learning with Graph Consistency + http://openaccess.thecvf.com//content/ICCV2023/papers/Zheng_SimMatchV2_Semi-Supervised_Learning_with_Graph_Consistency_ICCV_2023_paper.pdf + Semi-Supervised image classification is one of the most fundamental problem in computer vision, which significantly reduces the need for human labor. In this paper, we introduce a new semi-supervised learning algorithm - SimMatchV2, which formulates various consistency regularizations between labeled and unlabeled data from the graph perspective. In SimMatchV2, we regard the augmented view of a sample as a node, which consists of a label and its corresponding representation. Different nodes are connected with the edges, which are measured by the similarity of the node representations. Inspired by the message passing and node classification in graph theory, we propose four types of consistencies, namely 1) node-node consistency, 2) node-edge consistency, 3) edge-edge consistency, and 4) edge-node consistency. We also uncover that a simple feature normalization can reduce the gaps of the feature norm between different augmented views, significantly improving the performance of SimMatchV2. Our SimMatchV2 has been validated on multiple semi-supervised learning benchmarks. Notably, with ResNet-50 as our backbone and 300 epochs of training, SimMatchV2 achieves 71.9% and 76.2% Top-1 Accuracy with 1% and 10% labeled examples on ImageNet, which significantly outperforms the previous methods and achieves state-of-the-art performance. + + + + Reinforced Disentanglement for Face Swapping without Skip Connection + http://openaccess.thecvf.com//content/ICCV2023/papers/Ren_Reinforced_Disentanglement_for_Face_Swapping_without_Skip_Connection_ICCV_2023_paper.pdf + The SOTA face swap models still suffer the problem of either target identity (i.e., shape) being leaked or the target non-identity attributes (i.e., background, hair) failing to be fully preserved in the final results. We show that this insufficient disentanglement is caused by two flawed designs that were commonly adopted in prior models: (1) counting on only one compressed encoder to represent both the semantic-level non-identity facial attributes(i.e., pose) and the pixel-level non-facial region details, which is contradictory to satisfy at the same time; (2) highly relying on long skip-connections between the encoder and the final generator, leaking a certain amount of target face identity into the result. To fix them, we introduce a new face swap framework called "WSC-swap" that gets rid of skip connections and uses two target encoders to respectively capture the pixel-level non-facial region attributes and the semantic non-identity attributes in the face region. To further reinforce the disentanglement learning for the target encoder, we employ both identity removal loss via adversarial training (i.e., GAN) and the non-identity preservation loss via prior 3DMM models like. Extensive experiments on both FaceForensics++ and CelebA-HQ show that our results significantly outperform previous works on a rich set of metrics, including one novel metric for measuring identity consistency that was completely neglected before. + + + + Privacy-Preserving Face Recognition Using Random Frequency Components + http://openaccess.thecvf.com//content/ICCV2023/papers/Mi_Privacy-Preserving_Face_Recognition_Using_Random_Frequency_Components_ICCV_2023_paper.pdf + The ubiquitous use of face recognition has sparked increasing privacy concerns, as unauthorized access to sensitive face images could compromise the information of individuals. This paper presents an in-depth study of the privacy protection of face images' visual information and against recovery. Drawing on the perceptual disparity between humans and models, we propose to conceal visual information by pruning human-perceivable low-frequency components. For impeding recovery, we first elucidate the seeming paradox between reducing model-exploitable information and retaining high recognition accuracy. Based on recent theoretical insights and our observation on model attention, we propose a solution to the dilemma, by advocating for the training and inference of recognition models on randomly selected frequency components. We distill our findings into a novel privacy-preserving face recognition method, PartialFace. Extensive experiments demonstrate that PartialFace effectively balances privacy protection goals and recognition accuracy. Code is available at: https://github.com/Tencent/TFace. + + + + Vision Transformer Adapters for Generalizable Multitask Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Bhattacharjee_Vision_Transformer_Adapters_for_Generalizable_Multitask_Learning_ICCV_2023_paper.pdf + We introduce the first multitasking vision transformer adapters that learn generalizable task affinities which can be applied to novel tasks and domains. Integrated into an off-the-shelf vision transformer backbone, our adapters can simultaneously solve multiple dense vision tasks in a parameter-efficient manner, unlike existing multitasking transformers that are parametrically expensive. In contrast to concurrent methods, we do not require retraining or fine-tuning whenever a new task or domain is added. We introduce a task-adapted attention mechanism within our adapter framework that combines gradient-based task similarities with attention-based ones. The learned task affinities generalize to the following settings: zero-shot task transfer, unsupervised domain adaptation, and generalization without fine-tuning to novel domains. We demonstrate that our approach outperforms not only the existing convolutional neural network-based multitasking methods but also the vision transformer-based ones. Our project page is at https://ivrl.github.io/VTAGML. + + + + CVRecon: Rethinking 3D Geometric Feature Learning For Neural Reconstruction + http://openaccess.thecvf.com//content/ICCV2023/papers/Feng_CVRecon_Rethinking_3D_Geometric_Feature_Learning_For_Neural_Reconstruction_ICCV_2023_paper.pdf + Recent advances in neural reconstruction using posed image sequences have made remarkable progress. However, due to the lack of depth information, existing volumetric-based techniques simply duplicate 2D image features of the object surface along the entire camera ray. We contend this duplication introduces noise in empty and occluded spaces, posing challenges for producing high-quality 3D geometry. Drawing inspiration from traditional multi-view stereo methods, we propose an end-to-end 3D neural reconstruction framework CVRecon, designed to exploit the rich geometric embedding in the cost volumes to facilitate 3D geometric feature learning. Furthermore, we present Ray-contextual Compensated Cost Volume (RCCV), a novel 3D geometric feature representation that encodes view-dependent information with improved integrity and robustness. Through comprehensive experiments, we demonstrate that our approach significantly improves the reconstruction quality in various metrics and recovers clear fine details of the 3D geometries. Our extensive ablation studies provide insights into the development of effective 3D geometric feature learning schemes. Project page: https://cvrecon.ziyue.cool + + + + ClothesNet: An Information-Rich 3D Garment Model Repository with Simulated Clothes Environment + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhou_ClothesNet_An_Information-Rich_3D_Garment_Model_Repository_with_Simulated_Clothes_ICCV_2023_paper.pdf + We present ClothesNet: a large-scale dataset of 3D clothes objects with information-rich annotations. Our dataset consists of around 4000 models covering 11 categories annotated with clothes features, boundary lines, and keypoints. ClothesNet can be used to facilitate a variety of computer vision and robot interaction tasks. Using our dataset, we establish benchmark tasks for clothes perception, including classification, boundary line segmentation, and keypoint detection, and develop simulated clothes environments for robotic interaction tasks, including rearranging, folding, hanging, and dressing. We also demonstrate the efficacy of our ClothesNet in real-world experiments. + + + + StyleLipSync: Style-based Personalized Lip-sync Video Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Ki_StyleLipSync_Style-based_Personalized_Lip-sync_Video_Generation_ICCV_2023_paper.pdf + In this paper, we present StyleLipSync, a style-based personalized lip-sync video generative model that can generate identity-agnostic lip-synchronizing video from arbitrary audio. To generate a video of arbitrary identities, we leverage expressive lip prior from the semantically rich latent space of a pre-trained StyleGAN, where we can also design a video consistency with a linear transformation. In contrast to the previous lip-sync methods, we introduce pose-aware masking that dynamically locates the mask to improve the naturalness over frames by utilizing a 3D parametric mesh predictor frame by frame. Moreover, we propose a few-shot lip-sync adaptation method for an arbitrary person by introducing a sync regularizer that preserves lip-sync generalization while enhancing the person-specific visual information. Extensive experiments demonstrate that our model can generate accurate lip-sync videos even with the zero-shot setting and enhance characteristics of an unseen face using a few seconds of target video through the proposed adaptation method. + + + + Efficient 3D Semantic Segmentation with Superpoint Transformer + http://openaccess.thecvf.com//content/ICCV2023/papers/Robert_Efficient_3D_Semantic_Segmentation_with_Superpoint_Transformer_ICCV_2023_paper.pdf + We introduce a novel superpoint-based transformer architecture for efficient semantic segmentation of large-scale 3D scenes. Our method incorporates a fast algorithm to partition point clouds into a hierarchical superpoint structure, which makes our preprocessing 7 times faster than existing superpoint-based approaches. Additionally, we leverage a self-attention mechanism to capture the relationships between superpoints at multiple scales, leading to state-of-the-art performance on three challenging benchmark datasets: S3DIS (76.0% mIoU 6-fold validation), KITTI-360 (63.5% on Val), and DALES (79.6%). With only 212k parameters, our approach is up to 200 times more compact than other state-of-the-art models while maintaining similar performance. Furthermore, our model can be trained on a single GPU in 3 hours for a fold of the S3DIS dataset, which is 7x to 70x fewer GPU-hours than the best-performing methods. Our code and models are accessible at github.com/drprojects/superpoint_transformer. + + + + Minimum Latency Deep Online Video Stabilization + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Minimum_Latency_Deep_Online_Video_Stabilization_ICCV_2023_paper.pdf + We present a novel camera path optimization framework for the task of online video stabilization. Typically, a stabilization pipeline consists of three steps: motion estimating, path smoothing, and novel view rendering. Most previous methods concentrate on motion estimation, proposing various global or local motion models. In contrast, path optimization receives relatively less attention, especially in the important online setting, where no future frames are available. In this work, we adopt recent off-the-shelf high-quality deep motion models for motion estimation to recover the camera trajectory and focus on the latter two steps. Our network takes a short 2D camera path in a sliding window as input and outputs the stabilizing warp field of the last frame in the window, which warps the coming frame to its stabilized position. A hybrid loss is well-defined to constrain the spatial and temporal consistency. In addition, we build a motion dataset that contains stable and unstable motion pairs for the training. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art online methods both qualitatively and quantitatively and achieves comparable performance to offline methods. + + + + Speech2Lip: High-fidelity Speech to Lip Generation by Learning from a Short Video + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_Speech2Lip_High-fidelity_Speech_to_Lip_Generation_by_Learning_from_a_ICCV_2023_paper.pdf + Synthesizing realistic videos according to a given speech is still an open challenge. Previous works have been plagued by issues such as inaccurate lip shape generation and poor image quality. The key reason is that only motions and appearances on limited facial areas (e.g., lip area) are mainly driven by the input speech. Therefore, directly learning a mapping function from speech to the entire head image is prone to ambiguity, particularly when using a short video for training. We thus propose a decomposition-synthesis-composition framework named Speech to Lip (Speech2Lip) that disentangles speech-sensitive and speech-insensitive motion/appearance to facilitate effective learning from limited training data, resulting in the generation of natural-looking videos. First, given a fixed head pose (i.e., canonical space), we present a speech-driven implicit model for lip image generation which concentrates on learning speech-sensitive motion and appearance. Next, to model the major speech-insensitive motion (i.e., head movement), we introduce a geometry-aware mutual explicit mapping (GAMEM) module that establishes geometric mappings between different head poses. This allows us to paste generated lip images at the canonical space onto head images with arbitrary poses and synthesize talking videos with natural head movements. In addition, a Blend-Net and a contrastive sync loss are introduced to enhance the overall synthesis performance. Quantitative and qualitative results on three benchmarks demonstrate that our model can be trained by a video of just a few minutes in length and achieve state-of-the-art performance in both visual quality and speech-visual synchronization. Code: https://github.com/CVMI-Lab/Speech2Lip. + + + + UHDNeRF: Ultra-High-Definition Neural Radiance Fields + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_UHDNeRF_Ultra-High-Definition_Neural_Radiance_Fields_ICCV_2023_paper.pdf + We propose UHDNeRF, a new framework for novel view synthesis on the challenging ultra-high-resolution (e.g., 4K) real-world scenes. Previous NeRF methods are not specifically designed for rendering on extremely high resolutions, leading to burry results with notable detail-losing problems even though trained on 4K images. This is mainly due to the mismatch between the high-resolution inputs and the low-dimensional volumetric representation. To address this issue, we introduce an adaptive implicit-explicit scene representation with which an explicit sparse point cloud is used to boost the performance of an implicit volume on modeling subtle details. Specifically, we reconstruct the complex real-world scene with a frequency separation strategy that the implicit volume learns to represent the low-frequency properties of the whole scene, and the sparse point cloud is used for reproducing high-frequency details. To better explore the information embedded in the point cloud, we extract a global structure feature and a local point-wise feature from the point cloud for each sample located in the high-frequency regions. Furthermore, a patch-based sampling strategy is introduced to reduce the computational cost. The high-fidelity rendering results demonstrate the superiority of our method for retaining high-frequency details at 4K ultra-high-resolution scenarios against state-of-the-art NeRF-based solutions. + + + + Why do networks have inhibitory/negative connections? + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Why_do_networks_have_inhibitorynegative_connections_ICCV_2023_paper.pdf + Why do brains have inhibitory connections? Why do deep networks have negative weights? We propose an answer from the perspective of representation capacity. We believe representing functions is the primary role of both (i) the brain in natural intelligence, and (ii) deep networks in artificial intelligence. Our answer to why there are inhibitory/negative weights is: to learn more functions. We prove that, in the absence of negative weights, neural networks with non-decreasing activation functions are not universal approximators. While this may be an intuitive result to some, to the best of our knowledge, there is no formal theory, in either machine learning or neuroscience, that demonstrates why negative weights are crucial in the context of representation capacity. Further, we provide insights on the geometric properties of the representation space that non-negative deep networks cannot represent. We expect these insights will yield a deeper understanding of more sophisticated inductive priors imposed on the distribution of weights that lead to more efficient biological and machine learning. + + + + Ordinal Label Distribution Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Wen_Ordinal_Label_Distribution_Learning_ICCV_2023_paper.pdf + Label distribution learning (LDL) is a recent hot topic, in which ambiguity is modeled via description degrees of the labels. However, in common LDL tasks, e.g., age estimation, labels are in an intrinsic order. The conventional LDL paradigm adopts a per-label manner for optimization, neglecting the internal sequential patterns of labels. Therefore, we propose a new paradigm, termed ordinal label distribution learning (OLDL). We model the sequential patterns of labels from aspects of spatial, semantic, and temporal order relationships. The spatial order depicts the relative position between arbitrary labels. We build cross-label transformation between distributions, which is determined by the spatial margin in labels. Labels naturally yield different semantics, so the semantic order is represented by constructing semantic correlations between arbitrary labels. The temporal order describes that the presence of labels is determined by their order, i.e. five after four. The value of a particular label contains information about previous labels, and we adopt cumulative distribution to construct this relationship. Based on these characteristics of ordinal labels, we propose the learning objectives and evaluation metrics for OLDL, namely CAD, QFD, and CJS. Comprehensive experiments conducted on four tasks demonstrate the superiority of OLDL against other existing LDL methods in both traditional and newly proposed metrics. Our project page can be found at https://downdric23.github.io/. + + + + Boosting Multi-modal Model Performance with Adaptive Gradient Modulation + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Boosting_Multi-modal_Model_Performance_with_Adaptive_Gradient_Modulation_ICCV_2023_paper.pdf + While the field of multi-modal learning keeps growing fast, the deficiency of the standard joint training paradigm has become clear through recent studies. They attribute the sub-optimal performance of the jointly trained model to the modality competition phenomenon. Existing works attempt to improve the jointly trained model by modulating the training process. Despite their effectiveness, those methods can only apply to late fusion models. More importantly, the mechanism of the modality competition remains unexplored. In this paper, we first propose an adaptive gradient modulation method that can boost the performance of multi-modal models with various fusion strategies. Extensive experiments show that our method surpasses all existing modulation methods. Furthermore, to have a quantitative understanding of the modality competition and the mechanism behind the effectiveness of our modulation method, we introduce a novel metric to measure the competition strength. This metric is built on the mono-modal concept, a function that is designed to represent the competition-less state of a modality. Through systematic investigation, our results confirm the intuition that the modulation encourages the model to rely on the more informative modality. In addition, we find that the jointly trained model typically has a preferred modality on which the competition is weaker than other modalities. However, this preferred modality need not dominate others. Our code will be available at https://github.com/lihong2303/AGM_ICCV2023. + + + + PODA: Prompt-driven Zero-shot Domain Adaptation + http://openaccess.thecvf.com//content/ICCV2023/papers/Fahes_PODA_Prompt-driven_Zero-shot_Domain_Adaptation_ICCV_2023_paper.pdf + Domain adaptation has been vastly investigated in computer vision but still requires access to target images at train time, which might be intractable in some uncommon conditions. In this paper, we propose the task of 'Prompt-driven Zero-shot Domain Adaptation', where we adapt a model trained on a source domain using only a general description in natural language of the target domain, i.e., a prompt. First, we leverage a pretrained contrastive vision-language model (CLIP) to optimize affine transformations of source features, steering them towards the target text embedding while preserving their content and semantics. To achieve this, we propose Prompt-driven Instance Normalization (PIN). Second, we show that these prompt-driven augmentations can be used to perform zero-shot domain adaptation for semantic segmentation. Experiments demonstrate that our method significantly outperforms CLIP-based style transfer baselines on several datasets for the downstream task at hand, even surpassing one-shot unsupervised domain adaptation. A similar boost is observed on object detection and image classification. The code is available at https://github.com/astra-vision/PODA . + + + + SAFL-Net: Semantic-Agnostic Feature Learning Network with Auxiliary Plugins for Image Manipulation Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Sun_SAFL-Net_Semantic-Agnostic_Feature_Learning_Network_with_Auxiliary_Plugins_for_Image_ICCV_2023_paper.pdf + Since image editing methods in real world scenarios cannot be exhausted, generalization is a core challenge for image manipulation detection, which could be severely weakened by semantically related features. In this paper we propose SAFL-Net, which constrains a feature extractor to learn semantic-agnostic features by designing specific modules with corresponding auxiliary tasks. Applying constraints directly to the features extracted by the encoder helps it learn semantic-agnostic manipulation trace features, which prevents the biases related to semantic information within the limited training data and improves generalization capabilities. The consistency of auxiliary boundary prediction task and original region prediction task is guaranteed by a feature transformation structure. Experiments on various public datasets and comparisons in multiple dimensions demonstrate that SAFL-Net is effective for image manipulation detection. + + + + DataDAM: Efficient Dataset Distillation with Attention Matching + http://openaccess.thecvf.com//content/ICCV2023/papers/Sajedi_DataDAM_Efficient_Dataset_Distillation_with_Attention_Matching_ICCV_2023_paper.pdf + Researchers have long tried to minimize training costs in deep learning while maintaining strong generalization across diverse datasets. Emerging research on dataset distillation aims to reduce training costs by creating a small synthetic set that contains the information of a larger real dataset and ultimately achieves test accuracy equivalent to a model trained on the whole dataset. Unfortunately, the synthetic data generated by previous methods are not guaranteed to distribute and discriminate as well as the original training data, and they incur significant computational costs. Despite promising results, there still exists a significant performance gap between models trained on condensed synthetic sets and those trained on the whole dataset. In this paper, we address these challenges using efficient Dataset Distillation with Attention Matching (DataDAM), achieving state-of-the-art performance while reducing training costs. Specifically, we learn synthetic images by matching the spatial attention maps of real and synthetic data generated by different layers within a family of randomly initialized neural networks. Our method outperforms the prior methods on several datasets, including CIFAR10/100, TinyImageNet, ImageNet-1K, and subsets of ImageNet-1K across most of the settings, and achieves improvements of up to 6.5% and 4.1% on CIFAR100 and ImageNet-1K, respectively. We also show that our high-quality distilled images have practical benefits for downstream applications, such as continual learning and neural architecture search. + + + + PlanarTrack: A Large-scale Challenging Benchmark for Planar Object Tracking + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_PlanarTrack_A_Large-scale_Challenging_Benchmark_for_Planar_Object_Tracking_ICCV_2023_paper.pdf + Planar object tracking is a critical computer vision problem and has drawn increasing interest owing to its key roles in robotics, augmented reality, etc. Despite rapid progress, its further development, especially in the deep learning era, is largely hindered due to the lack of large-scale challenging benchmarks. Addressing this, we introduce PlanarTrack, a large-scale challenging planar tracking benchmark. Specifically, PlanarTrack consists of 1,000 videos with more than 490K images. All these videos are collected in complex unconstrained scenarios from the wild, which makes PlanarTrack, compared with existing benchmarks, more challenging but realistic for real-world applications. To ensure the high-quality annotation, each frame in PlanarTrack is manually labeled using four corners with multiple-round careful inspection and refinement. To our best knowledge, PlanarTrack, to date, is the largest and most challenging dataset dedicated to planar object tracking. In order to analyze the proposed PlanarTrack, we evaluate 10 planar trackers and conduct comprehensive comparisons and in-depth analysis. Our results, not surprisingly, demonstrate that current top-performing planar trackers degenerate significantly on the challenging PlanarTrack and more efforts are needed to improve planar tracking in the future. In addition, we further derive a variant named PlanarTrack_BB for generic object tracking from PlanarTrack. Our evaluation of 10 excellent generic trackers on PlanarTrack_BB manifests that, surprisingly, PlanarTrack_BB is even more challenging than several popular generic tracking benchmarks and more attention should be paid to handle such planar objects, though they are rigid. All benchmarks and evaluations will be released. + + + + Structural Alignment for Network Pruning through Partial Regularization + http://openaccess.thecvf.com//content/ICCV2023/papers/Gao_Structural_Alignment_for_Network_Pruning_through_Partial_Regularization_ICCV_2023_paper.pdf + In this paper, we propose a novel channel pruning method to reduce the computational and storage costs of Convolutional Neural Networks (CNNs). Many existing one-shot pruning methods directly remove redundant structures, which brings a huge gap between the model before and after network pruning. This gap will no doubt result in performance loss for network pruning. To mitigate this gap, we first learn a target sub-network during the model training process, and then we use this sub-network to guide the learning of model weights through partial regularization. The target sub-network is learned and produced by using an architecture generator, and it can be optimized efficiently. In addition, we also derive the proximal gradient for our proposed partial regularization to facilitate the structural alignment process. With these designs, the gap between the pruned model and the sub-network is reduced, thus improving the pruning performance. Empirical results also suggest that the sub-network found by our method has a much higher performance than the one-shot pruning setting. Extensive experiments show that our method can achieve state-of-the-art performances on CIFAR-10 and ImageNet with ResNets and MobileNet-V2. + + + + Learning Long-Range Information with Dual-Scale Transformers for Indoor Scene Completion + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Learning_Long-Range_Information_with_Dual-Scale_Transformers_for_Indoor_Scene_Completion_ICCV_2023_paper.pdf + Due to the limited resolution of 3D sensors and the inevitable mutual occlusion between objects, 3D scans of real scenes are commonly incomplete. Previous scene completion methods struggle to capture long-range spatial feature, resulting in unsatisfactory completion results. To alleviate the problem, we propose a novel Dual-Scale Transformer Network (DST-Net) that efficiently utilizes both long-range and short-range spatial context information to improve the quality of 3D scene completion. To reduce the heavy computation cost of extracting long-range features via transformers, DST-Net adopts a self-supervised two-stage completion strategy. In the first stage, we split the input scene into blocks, and perform completion on individual blocks. In the second stage, the blocks are merged together as a whole and then further refined to improve completeness. More importantly, we propose a contrastive attention training strategy to encourage the transformers to learn distinguishable features for better scene completion. Experiments on datasets of Matterport3D, ScanNet, and ICL-NUIM demonstrate that our method can generate better completion results, and our method outperforms the state-of-the-art methods quantitatively and qualitatively. + + + + Discriminative Class Tokens for Text-to-Image Diffusion Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Schwartz_Discriminative_Class_Tokens_for_Text-to-Image_Diffusion_Models_ICCV_2023_paper.pdf + Recent advances in text-to-image diffusion models have enabled the generation of diverse and high-quality images. While impressive, the images often fall short of depicting subtle details and are susceptible to errors due to ambiguity in the input text. One way of alleviating these issues is to train diffusion models on class-labeled datasets. This approach has two disadvantages: (i) supervised datasets are generally small compared to large-scale scraped text-image datasets on which text-to-image models are trained, affecting the quality and diversity of the generated images, or (ii) the input is a hard-coded label, as opposed to free-form text, limiting the control over the generated images. In this work, we propose a non-invasive fine-tuning technique that capitalizes on the expressive potential of free-form text while achieving high accuracy through discriminative signals from a pretrained classifier. This is done by iteratively modifying the embedding of an added input token of a text-to-image diffusion model, by steering generated images toward a given target class according to a classifier. Our method is fast compared to prior fine-tuning methods and does not require a collection of in-class images or retraining of a noise-tolerant classifier. We evaluate our method extensively, showing that the generated images are: (i) more accurate and of higher quality than standard diffusion models, (ii) can be used to augment training data in a low-resource setting, and (iii) reveal information about the data used to train the guiding classifier. The code is available at https://github.com/idansc/discriminative_class_tokens. + + + + ORC: Network Group-based Knowledge Distillation using Online Role Change + http://openaccess.thecvf.com//content/ICCV2023/papers/Choi_ORC_Network_Group-based_Knowledge_Distillation_using_Online_Role_Change_ICCV_2023_paper.pdf + In knowledge distillation, since a single, omnipotent teacher network cannot solve all problems, multiple teacher-based knowledge distillations have been studied recently. However, sometimes their improvements are not as good as expected because some immature teachers may transfer the false knowledge to the student. In this paper, to overcome this limitation and take the efficacy of the multiple networks, we divide the multiple networks into teacher and student groups, respectively. That is, the student group is a set of immature networks that require learning the teacher's knowledge, while the teacher group consists of the selected networks that are capable of teaching successfully. We propose our online role change strategy where the top-ranked networks in the student group are able to promote to the teacher group at every iteration. After training the teacher group using the error samples of the student group to refine the teacher group's knowledge, we transfer the collaborative knowledge from the teacher group to the student group successfully. We verify the superiority of the proposed method on CIFAR-10, CIFAR-100, and ImageNet which achieves high performance. We further show the generality of our method with various backbone architectures such as ResNet, WRN, VGG, Mobilenet, and Shufflenet. + + + + Audiovisual Masked Autoencoders + http://openaccess.thecvf.com//content/ICCV2023/papers/Georgescu_Audiovisual_Masked_Autoencoders_ICCV_2023_paper.pdf + Can we leverage the audiovisual information already present in video to improve self-supervised representation learning? To answer this question, we study various pretraining architectures and objectives within the masked autoencoding framework, motivated by the success of similar methods in natural language and image understanding. We show that we can achieve significant improvements on audiovisual downstream classification tasks, surpassing the state-of-the-art on VGGSound and AudioSet. Furthermore, we can leverage our audiovisual pretraining scheme for multiple unimodal downstream tasks using a single audiovisual pretrained model. We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens without pretraining specifically for this dataset. + + + + DomainDrop: Suppressing Domain-Sensitive Channels for Domain Generalization + http://openaccess.thecvf.com//content/ICCV2023/papers/Guo_DomainDrop_Suppressing_Domain-Sensitive_Channels_for_Domain_Generalization_ICCV_2023_paper.pdf + Deep Neural Networks have exhibited considerable success in various visual tasks. However, when applied to unseen test datasets, state-of-the-art models often suffer performance degradation due to domain shifts. In this paper, we introduce a novel approach for domain generalization from a novel perspective of enhancing the robustness of channels in feature maps to domain shifts. We observe that models trained on source domains contain a substantial number of channels that exhibit unstable activations across different domains, which are inclined to capture domain-specific features and behave abnormally when exposed to unseen target domains. To address the issue, we propose a DomainDrop framework to continuously enhance the channel robustness to domain shifts, where a domain discriminator is used to identify and drop unstable channels in feature maps of each network layer during forward propagation. We theoretically prove that our framework could effectively lower the generalization bound. Extensive experiments on several benchmarks indicate that our framework achieves state-of-the-art performance compared to other competing methods. Our code is available at https://github.com/lingeringlight/DomainDrop. + + + + StyleInV: A Temporal Style Modulated Inversion Network for Unconditional Video Generation + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_StyleInV_A_Temporal_Style_Modulated_Inversion_Network_for_Unconditional_Video_ICCV_2023_paper.pdf + Unconditional video generation is a challenging task that involves synthesizing high-quality videos that are both coherent and of extended duration. To address this challenge, researchers have used pretrained StyleGAN image generators for high-quality frame synthesis and focused on motion generator design. The motion generator is trained in an autoregressive manner using heavy 3D convolutional discriminators to ensure motion coherence during video generation. In this paper, we introduce a novel motion generator design that uses a learning-based inversion network for GAN. The encoder in our method captures rich and smooth priors from encoding images to latents, and given the latent of an initially generated frame as guidance, our method can generate smooth future latent by modulating the inversion encoder temporally. Our method enjoys the advantage of sparse training and naturally constrains the generation space of our motion generator with the inversion network guided by the initial frame, eliminating the need for heavy discriminators. Moreover, our method supports style transfer with simple fine-tuning when the encoder is paired with a pretrained StyleGAN generator. Extensive experiments conducted on various benchmarks demonstrate the superiority of our method in generating long and high-resolution videos with decent single-frame quality and temporal consistency. Code is available at https://github.com/johannwyh/StyleInV. + + + + Anatomical Invariance Modeling and Semantic Alignment for Self-supervised Learning in 3D Medical Image Analysis + http://openaccess.thecvf.com//content/ICCV2023/papers/Jiang_Anatomical_Invariance_Modeling_and_Semantic_Alignment_for_Self-supervised_Learning_in_ICCV_2023_paper.pdf + Self-supervised learning (SSL) has recently achieved promising performance for 3D medical image analysis tasks. Most current methods follow existing SSL paradigm originally designed for photographic or natural images, which cannot explicitly and thoroughly exploit the intrinsic similar anatomical structures across varying medical images. This may in fact degrade the quality of learned deep representations by maximizing the similarity among features containing spatial misalignment information and different anatomical semantics. In this work, we propose a new self-supervised learning framework, namely Alice, that explicitly fulfills Anatomical invariance modeling and semantic alignment via elaborately combining discriminative and generative objectives. Alice introduces a new contrastive learning strategy which encourages the similarity between views that are diversely mined but with consistent high-level semantics, in order to learn invariant anatomical features. Moreover, we design a conditional anatomical feature alignment module to complement corrupted embeddings with globally matched semantics and inter-patch topology information, conditioned by the distribution of local image content, which permits to create better contrastive pairs. Our extensive quantitative experiments on three 3D medical image analysis tasks demonstrate and validate the performance superiority of Alice, surpassing the previous best SSL counterpart methods and showing promising ability for united representation learning. Codes are available at https://github.com/alibaba-damo-academy/alice. + + + + SSDA: Secure Source-Free Domain Adaptation + http://openaccess.thecvf.com//content/ICCV2023/papers/Ahmed_SSDA_Secure_Source-Free_Domain_Adaptation_ICCV_2023_paper.pdf + Source-free domain adaptation (SFDA) is a popular unsupervised domain adaptation method where a pre-trained model from a source domain is adapted to a target domain without accessing any source data. Despite rich results in this area, existing literature overlooks the security challenges of the unsupervised SFDA setting in presence of a malicious source domain owner. This work investigates the effect of a source adversary which may inject a hidden malicious behavior (Backdoor/Trojan) during source training and potentially transfer it to the target domain even after benign training by the victim (target domain owner). Our investigation of the current SFDA setting reveals that because of the unique challenges present in SFDA (e.g., no source data, target label), defending against backdoor attack using existing defenses become practically ineffective in protecting the target model. To address this, we propose a novel target domain protection scheme called secure source-free domain adaptation (SSDA). SSDA adopts a single-shot model compression of a pre-trained source model and a novel knowledge transfer scheme with a spectral-norm-based loss penalty for target training. The proposed static compression and the dynamic training loss penalty are designed to suppress the malicious channels responsive to the backdoor during the adaptation stage. At the same time, the knowledge transfer from an uncompressed auxiliary model helps to recover the benign test accuracy. Our extensive evaluation on multiple dataset and domain tasks against recent backdoor attacks reveal that the proposed SSDA can successfully defend against strong backdoor attacks with little to no degradation in test accuracy compared to the vulnerable baseline SFDA methods. Our code is available at https://github.com/ML-Security-Research-LAB/SSDA. + + + + ESTextSpotter: Towards Better Scene Text Spotting with Explicit Synergy in Transformer + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_ESTextSpotter_Towards_Better_Scene_Text_Spotting_with_Explicit_Synergy_in_ICCV_2023_paper.pdf + In recent years, end-to-end scene text spotting approaches are evolving to the Transformer-based framework. While previous studies have shown the crucial importance of the intrinsic synergy between text detection and recognition, recent advances in Transformer-based methods usually adopt an implicit synergy strategy with shared query, which can not fully realize the potential of these two interactive tasks. In this paper, we argue that the explicit synergy considering distinct characteristics of text detection and recognition can significantly improve the performance text spotting. To this end, we introduce a new model named Explicit Synergy-based Text Spotting Transformer framework (ESTextSpotter), which achieves explicit synergy by modeling discriminative and interactive features for text detection and recognition within a single decoder. Specifically, we decompose the conventional shared query into task-aware queries for text polygon and content, respectively. Through the decoder with the proposed vision-language communication module, the queries interact with each other in an explicit manner while preserving discriminative patterns of text detection and recognition, thus improving performance significantly. Additionally, we propose a task-aware query initialization scheme to ensure stable training. Experimental results demonstrate that our model significantly outperforms previous state-of-the-art methods. Code is available at https://github.com/mxin262/ESTextSpotter. + + + + UGC: Unified GAN Compression for Efficient Image-to-Image Translation + http://openaccess.thecvf.com//content/ICCV2023/papers/Ren_UGC_Unified_GAN_Compression_for_Efficient_Image-to-Image_Translation_ICCV_2023_paper.pdf + Recent years have witnessed the prevailing progress of Generative Adversarial Networks (GANs) in image-to-image translation. However, the success of these GAN models hinges on ponderous computational costs and labor-expensive training data. Current efficient GAN learning techniques often fall into two orthogonal aspects: i) model slimming via reduced calculation costs; ii)data/label-efficient learning with fewer training data/labels. To combine the best of both worlds, we propose a new learning paradigm, Unified GAN Compression (UGC), with a unified optimization objective to seamlessly prompt the synergy of model-efficient and label-efficient learning. UGC sets up semi-supervised-driven network architecture search and adaptive online semi-supervised distillation stages sequentially, which formulates a heterogeneous mutual learning scheme to obtain an architecture-flexible, label-efficient, and performance-excellent model. Extensive experiments demonstrate that UGC obtains state-of-the-art lightweight models even with less than 50% labels. UGC that compresses 40X MACs can achieve 21.43 FID on edges-shoes with 25% labels, which even outperforms the original model with 100% labels by 2.75 FID. + + + + Efficient View Synthesis with Neural Radiance Distribution Field + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_Efficient_View_Synthesis_with_Neural_Radiance_Distribution_Field_ICCV_2023_paper.pdf + Recent work on Neural Radiance Fields (NeRF) has demonstrated significant advances in high-quality view synthesis. A major limitation of NeRF is its low rendering efficiency due to the need for multiple network forwardings to render a single pixel. Existing methods to improve NeRF either reduce the number of required samples or optimize the implementation to accelerate the network forwarding. Despite these efforts, the problem of multiple sampling persists due to the intrinsic representation of radiance fields. In contrast, Neural Light Fields (NeLF) reduce the computation cost of NeRF by querying only one single network forwarding per pixel. To achieve a close visual quality to NeRF, existing NeLF methods require significantly larger network capacities which limits their rendering efficiency in practice. In this work, we propose a new representation called Neural Radiance Distribution Field (NeRDF) that targets efficient view synthesis in real-time. Specifically, we use a small network similar to NeRF while preserving the rendering speed with a single network forwarding per pixel as in NeLF. The key is to model the radiance distribution along each ray with frequency basis and predict frequency weights using the network. Pixel values are then computed via volume rendering on radiance distributions. Experiments show that our proposed method offers a better trade-off among speed, quality, and network size than existing methods: we achieve a 254x speed-up over NeRF with similar network size, with only a marginal performance decline. Our project page is at yushuang-wu.github.io/NeRDF. + + + + SparseBEV: High-Performance Sparse 3D Object Detection from Multi-Camera Videos + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_SparseBEV_High-Performance_Sparse_3D_Object_Detection_from_Multi-Camera_Videos_ICCV_2023_paper.pdf + Camera-based 3D object detection in BEV (Bird's Eye View) space has drawn great attention over the past few years. Dense detectors typically follow a two-stage pipeline by first constructing a dense BEV feature and then performing object detection in BEV space, which suffers from complex view transformations and high computation cost. On the other side, sparse detectors follow a query-based paradigm without explicit dense BEV feature construction, but achieve worse performance than the dense counterparts. In this paper, we find that the key to mitigate this performance gap is the adaptability of the detector in both BEV and image space. To achieve this goal, we propose SparseBEV, a fully sparse 3D object detector that outperforms the dense counterparts. SparseBEV contains three key designs, which are (1) scale-adaptive self attention to aggregate features with adaptive receptive field in BEV space, (2) adaptive spatio-temporal sampling to generate sampling locations under the guidance of queries, and (3) adaptive mixing to decode the sampled features with dynamic weights from the queries. On the test split of nuScenes, SparseBEV achieves the state-of-the-art performance of 67.5 NDS. On the val split, SparseBEV achieves 55.8 NDS while maintaining a real-time inference speed of 23.5 FPS. Code is available at https://github.com/MCG-NJU/SparseBEV. + + + + Boosting Whole Slide Image Classification from the Perspectives of Distribution, Correlation and Magnification + http://openaccess.thecvf.com//content/ICCV2023/papers/Qu_Boosting_Whole_Slide_Image_Classification_from_the_Perspectives_of_Distribution_ICCV_2023_paper.pdf + Bag-based multiple instance learning (MIL) methods have become the mainstream for Whole Slide Image (WSI) classification. However, there are still three important issues that have not been fully addressed: (1) positive bags with a low positive instance ratio are prone to the influence of a large number of negative instances; (2) the correlation between local and global features of pathology images has not been fully modeled; and (3) there is a lack of effective information interaction between different magnifications. In this paper, we propose MILBooster, a powerful dual-scale multi-stage MIL framework to address these issues from the perspectives of distribution, correlation, and magnification. Specifically, to address issue (1), we propose a plug-and-play bag filter that effectively increases the positive instance ratio of positive bags. For issue (2), we propose a novel window-based Transformer architecture called PiceBlock to model the correlation between local and global features of pathology images. For issue (3), we propose a dual-branch architecture to process different magnifications and design an information interaction module called Scale Mixer for efficient information interaction between them. We conducted extensive experiments on four clinical WSI classification tasks using three datasets. MILBooster achieved new state-of-the-art performance on all these tasks. Codes will be available. + + + + Multimodal High-order Relation Transformer for Scene Boundary Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Wei_Multimodal_High-order_Relation_Transformer_for_Scene_Boundary_Detection_ICCV_2023_paper.pdf + Scene boundary detection breaks down long videos into meaningful story-telling units and plays a crucial role in high-level video understanding. Despite significant advancements in this area, this task remains a challenging problem as it requires a comprehensive understanding of multimodal cues and high-level semantics. To tackle this issue, we propose a multimodal high-order relation transformer, which integrates a high-order encoder and an adaptive decoder in a unified framework. By modeling the multimodal cues and exploring similarities between the shots, the encoder is capable of capturing high-order relations between shots and extracting shot features with context semantics. By clustering the shots adaptively, the decoder can discover more universal switch pattern between successive scenes, thus helping scene boundary detection. Extensive experimental results on three standard benchmarks demonstrate that the proposed model performs favorably against state-of-the-art video scene detection methods. + + + + Tri-MipRF: Tri-Mip Representation for Efficient Anti-Aliasing Neural Radiance Fields + http://openaccess.thecvf.com//content/ICCV2023/papers/Hu_Tri-MipRF_Tri-Mip_Representation_for_Efficient_Anti-Aliasing_Neural_Radiance_Fields_ICCV_2023_paper.pdf + Despite the tremendous progress in neural radiance fields (NeRF), we still face a dilemma of the trade-off between quality and efficiency, e.g., MipNeRF presents fine-detailed and anti-aliased renderings but takes days for training, while Instant-ngp can accomplish the reconstruction in a few minutes but suffers from blurring or aliasing when rendering at various distances or resolutions due to ignoring the sampling area. To this end, we propose a novel Tri-Mip encoding (a la "mipmap") that enables both instant reconstruction and anti-aliased high-fidelity rendering for neural radiance fields. The key is to factorize the pre-filtered 3D feature spaces in three orthogonal mipmaps. In this way, we can efficiently perform 3D area sampling by taking advantage of 2D pre-filtered feature maps, which significantly elevates the rendering quality without sacrificing efficiency. To cope with the novel Tri-Mip representation, we propose a cone-casting rendering technique to efficiently sample anti-aliased 3D features with the Tri-Mip encoding considering both pixel imaging and observing distance. Extensive experiments on both synthetic and real-world datasets demonstrate our method achieves state-of-the-art rendering quality and reconstruction speed while maintaining a compact representation that reduces 25% model size compared against Instant-ngp. Code is available at the project webpage: https: //wbhu.github.io/projects/Tri-MipRF + + + + LaRS: A Diverse Panoptic Maritime Obstacle Detection Dataset and Benchmark + http://openaccess.thecvf.com//content/ICCV2023/papers/Zust_LaRS_A_Diverse_Panoptic_Maritime_Obstacle_Detection_Dataset_and_Benchmark_ICCV_2023_paper.pdf + The progress in maritime obstacle detection is hindered by the lack of a diverse dataset that adequately captures the complexity of general maritime environments. We present the first maritime panoptic obstacle detection benchmark LaRS, featuring scenes from Lakes, Rivers and Seas. Our major contribution is the new dataset, which boasts the largest diversity in recording locations, scene types, obstacle classes, and acquisition conditions among the related datasets. LaRS is composed of over 4000 per-pixel labeled key frames with nine preceding frames to allow utilization of the temporal texture, amounting to over 40k frames. Each key frame is annotated with 8 thing, 3 stuff classes and 19 global scene attributes. We report the results of 27 semantic and panoptic segmentation methods, along with several performance insights and future research directions. To enable objective evaluation, we have implemented an online evaluation server. The LaRS dataset, evaluation toolkit and benchmark are publicly available at: https://lojzezust.github.io/lars-dataset + + + + Self-Evolved Dynamic Expansion Model for Task-Free Continual Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Ye_Self-Evolved_Dynamic_Expansion_Model_for_Task-Free_Continual_Learning_ICCV_2023_paper.pdf + Task-Free Continual Learning (TFCL) aims to learn new concepts from a stream of data without any task information. The Dynamic Expansion Model (DEM) has shown promising results in TFCL by dynamically expanding the model's capacity to deal with shifts in the data distribution. However, existing approaches only consider the recognition of the input shift as the expansion signal and ignore the correlation between the newly incoming data and previously learned knowledge, resulting in adding and training unnecessary parameters. In this paper, we propose a novel and effective framework for TFCL, which dynamically expands the architecture of a DEM model through a self-assessment mechanism evaluating the diversity of knowledge among existing experts as expansion signals. This mechanism ensures learning additional underlying data distributions with a compact model structure. A novelty-aware sample selection approach is proposed to manage the memory buffer that forces the newly added expert to learn novel information from a data stream, which further promotes the diversity among experts. Moreover, we also propose to reuse previously learned representation information for learning new incoming data by using knowledge transfer in TFCL, which has not been explored before. The DEM expansion and training are regularized through a gradient updating mechanism to gradually explore the positive forward transfer, further improving the performance. Empirical results on TFCL benchmarks show that the proposed framework outperforms the state-of-the-art while using a reasonable number of parameters. The code is available at https://github.com/dtuzi123/SEDEM/. + + + + Adaptive Template Transformer for Mitochondria Segmentation in Electron Microscopy Images + http://openaccess.thecvf.com//content/ICCV2023/papers/Pan_Adaptive_Template_Transformer_for_Mitochondria_Segmentation_in_Electron_Microscopy_Images_ICCV_2023_paper.pdf + Mitochondria, as tiny structures within the cell, are of significant importance to study cell functions for biological and clinical analysis. And exploring how to automatically segment mitochondria in electron microscopy (EM) images has attracted increasing attention. However, most of existing methods struggle to adapt to different scales and appearances of the input due to the inherent limitations of the traditional CNN architecture. To mitigate these limitations, we propose a novel adaptive template transformer (ATFormer) for mitochondria segmentation. The proposed ATFormer model enjoys several merits. First, the designed structural template learning module can acquire appearance-adaptive templates of background, foreground and contour to sense the characteristics of different shapes of mitochondria. And we further adopt an optimal transport algorithm to enlarge the discrepancy among diverse templates to fully activate corresponding regions. Second, we introduce a hierarchical attention learning mechanism to absorb multi-level information for templates to be adaptive scale-aware classifiers for dense prediction. Extensive experimental results on three challenging benchmarks including MitoEM, Lucchi and NucMM-Z datasets demonstrate that our ATFormer performs favorably against state-of-the-art mitochondria segmentation methods. + + + + Tangent Model Composition for Ensembling and Continual Fine-tuning + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Tangent_Model_Composition_for_Ensembling_and_Continual_Fine-tuning_ICCV_2023_paper.pdf + Tangent Model Composition (TMC) is a method to combine component models independently fine-tuned around a pre-trained point. Component models are tangent vectors to the pre-trained model that can be added, scaled, or subtracted to support incremental learning, ensembling, or unlearning. Component models are composed at inference time via scalar combination, reducing the cost of ensembling to that of a single model. TMC improves accuracy by 4.2% compared to ensembling non-linearly fine-tuned models at a 2.5x to 10x reduction of inference cost, growing linearly with the number of component models. Each component model can be forgotten at zero cost, with no residual effect on the resulting inference. When used for continual fine-tuning, TMC is not constrained by sequential bias and can be executed in parallel on federated data. TMC outperforms recently published continual fine-tuning methods almost uniformly on each setting -- task-incremental, class-incremental, and data-incremental -- on a total of 13 experiments across 3 benchmark datasets, despite not using any replay buffer. TMC is designed for composing models that are local to a pre-trained embedding, but could be extended to more general settings. + + + + Knowledge-Spreader: Learning Semi-Supervised Facial Action Dynamics by Consistifying Knowledge Granularity + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Knowledge-Spreader_Learning_Semi-Supervised_Facial_Action_Dynamics_by_Consistifying_Knowledge_Granularity_ICCV_2023_paper.pdf + Recent studies on dynamic facial action unit (AU) detection have extensively relied on dense annotations. However, manual annotations are difficult, time-consuming, and costly. The canonical semi-supervised learning (SSL) methods ignore the consistency, extensibility, and adaptability of structural knowledge across spatial-temporal domains. Furthermore, the reliance on offline design and excessive parameters hinder the efficiency of the learning process. To remedy these issues, we propose a lightweight and on-line semi-supervised framework, a so-called Knowledge-Spreader (KS), to learn AU dynamics with sparse annotations. By formulating SSL as a Progressive Knowledge Distillation (PKD) problem, we aim to infer cross-domain information, specifically from spatial to temporal domains, by consistifying knowledge granularity within Teacher-Students Network. Specifically, KS employs sparsely annotated key-frames to learn AU dependencies as the privileged knowledge. Then, the model spreads the learned knowledge to their unlabeled neighbours by jointly applying knowledge distillation and pseudo-labeling, and completes the temporal information as the expanded knowledge. We term the progressive knowledge distillation as "Knowledge Spreading", which allows our model to learn spatial-temporal knowledge from video clips with only one label allocated. Extensive experiments demonstrate that KS achieves competitive performance as compared to the state of the arts under the circumstances of using only 2% labels on BP4D and 5% labels on DISFA. In addition, we have tested it on our newly developed large-scale comprehensive emotion database BP4D++, which contains considerable samples across well-synchronized and aligned sensor modalities for alleviating the scarcity issue of annotations and identities. + + + + Kick Back & Relax: Learning to Reconstruct the World by Watching SlowTV + http://openaccess.thecvf.com//content/ICCV2023/papers/Spencer_Kick_Back__Relax_Learning_to_Reconstruct_the_World_by_ICCV_2023_paper.pdf + Self-supervised monocular depth estimation (SS-MDE) has the potential to scale to vast quantities of data. Unfortunately, existing approaches limit themselves to the automotive domain, resulting in models incapable of generalizing to complex environments such as natural or indoor settings. To address this, we propose a large-scale SlowTV dataset curated from YouTube, containing an order of magnitude more data than existing automotive datasets. SlowTV contains 1.7M images from a rich diversity of environments, such as worldwide seasonal hiking, scenic driving and scuba diving. Using this dataset, we train an SS-MDE model that provides zero-shot generalization to a large collection of indoor/outdoor datasets. The resulting model outperforms all existing SSL approaches and closes the gap on supervised SoTA, despite using a more efficient architecture. We additionally introduce a collection of best-practices to further maximize performance and zero-shot generalization. This includes 1) aspect ratio augmentation, 2) camera intrinsic estimation, 3) support frame randomization and 4) flexible motion estimation. + + + + ChildPlay: A New Benchmark for Understanding Children's Gaze Behaviour + http://openaccess.thecvf.com//content/ICCV2023/papers/Tafasca_ChildPlay_A_New_Benchmark_for_Understanding_Childrens_Gaze_Behaviour_ICCV_2023_paper.pdf + Gaze behaviors such as eye-contact or shared attention are important markers for diagnosing developmental disorders in children. While previous studies have looked at some of these elements, the analysis is usually performed on private datasets and is restricted to lab settings. Furthermore, all publicly available gaze target prediction benchmarks mostly contain instances of adults, which makes models trained on them less applicable to scenarios with young children. In this paper, we propose the first study for predicting the gaze target of children and interacting adults. To this end, we introduce the ChildPlay dataset: a curated collection of short video clips featuring children playing and interacting with adults in uncontrolled environments (e.g. kindergarten, therapy centers, preschools etc.), which we annotate with rich gaze information. We further propose a new model for gaze target prediction that is geometrically grounded by explicitly identifying the scene parts in the 3D field of view (3DFoV) of the person, leveraging recent geometry preserving depth inference methods. Our model achieves state of the art results on benchmark datasets and ChildPlay. Furthermore, results show that looking at faces prediction performance on children is much worse than on adults, and can be significantly improved by fine-tuning models using child gaze annotations. Our dataset is available at https://www.idiap.ch/en/dataset/childplay-gaze. + + + + When Noisy Labels Meet Long Tail Dilemmas: A Representation Calibration Method + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_When_Noisy_Labels_Meet_Long_Tail_Dilemmas_A_Representation_Calibration_ICCV_2023_paper.pdf + Real-world large-scale datasets are both noisily labeled and class-imbalanced. The issues seriously hurt the generalization of trained models. It is hence significant to address the simultaneous incorrect labeling and class-imbalance, i.e., the problem of learning with noisy labels on long-tailed data. Previous works develop several methods for the problem. However, they always rely on strong assumptions that are invalid or hard to be checked in practice. In this paper, to handle the problem and address the limitations of prior works, we propose a representation calibration method RCAL. Specifically, RCAL works with the representations extracted by unsupervised contrastive learning. We assume that without incorrect labeling and class imbalance, the representations of instances in each class conform to a multivariate Gaussian distribution, which is much milder and easier to be checked. Based on the assumption, we recover underlying representation distributions from polluted ones resulting from mislabeled and class-imbalanced data. Additional data points are then sampled from the recovered distributions to help generalization. Moreover, during classifier training, representation learning takes advantage of representation robustness brought by contrastive learning, which further improves the classifier performance. We derive theoretical results to discuss the effectiveness of our representation calibration. Experiments on multiple benchmarks justify our claims and confirm the superiority of the proposed method. + + + + Reinforce Data, Multiply Impact: Improved Model Accuracy and Robustness with Dataset Reinforcement + http://openaccess.thecvf.com//content/ICCV2023/papers/Faghri_Reinforce_Data_Multiply_Impact_Improved_Model_Accuracy_and_Robustness_with_ICCV_2023_paper.pdf + We propose Dataset Reinforcement, a strategy to improve a dataset once such that the accuracy of any model architecture trained on the reinforced dataset is improved at no additional training cost for users. We propose a Dataset Reinforcement strategy based on data augmentation and knowledge distillation. Our generic strategy is designed based on extensive analysis across CNN- and transformer-based models and performing large-scale study of distillation with state-of-the-art models with various data augmentations. We create a reinforced version of the ImageNet training dataset, called ImageNet+, as well as reinforced datasets CIFAR-100+, Flowers-102+, and Food-101+. Models trained with ImageNet+ are more accurate, robust, and calibrated, and transfer well to downstream tasks (e.g., segmentation and detection). As an example, the accuracy of ResNet-50 improves by 1.7% on the ImageNet validation set, 3.5% on ImageNetV2, and 10.0% on ImageNet-R. Expected Calibration Error (ECE) on the ImageNet validation set is also reduced by 9.9%. Using this backbone with Mask-RCNN for object detection on MS-COCO, the mean average precision improves by 0.8%. We reach similar gains for MobileNets, ViTs, and Swin-Transformers. For MobileNetV3 and Swin-Tiny, we observe significant improvements on ImageNet-R/A/C of up to 20% improved robustness. Models pretrained on ImageNet+ and fine-tuned on CIFAR-100+, Flowers-102+, and Food-101+, reach up to 3.4% improved accuracy. The code, datasets, and pretrained models are available at https://github.com/apple/ml-dr. + + + + Incremental Generalized Category Discovery + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhao_Incremental_Generalized_Category_Discovery_ICCV_2023_paper.pdf + We explore the problem of Incremental Generalized Category Discovery (IGCD). This is a challenging category-incremental learning setting where the goal is to develop models that can correctly categorize images from previously seen categories, in addition to discovering novel ones. Learning is performed over a series of time steps where the model obtains new labeled and unlabeled data, and discards old data, at each iteration. The difficulty of the problem is compounded in our generalized setting as the unlabeled data can contain images from categories that may or may not have been observed before. We present a new model for IGCD which combines non-parametric categorization with efficient image sampling to mitigate catastrophic forgetting. To quantify performance, we propose a new benchmark dataset named iNatIGCD that is motivated by a real-world fine-grained visual categorization task. In our experiments we outperform existing related methods. + + + + Guiding Local Feature Matching with Surface Curvature + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_Guiding_Local_Feature_Matching_with_Surface_Curvature_ICCV_2023_paper.pdf + We propose a new method, named curvature similarity extractor (CSE), for improving local feature matching across images. CSE calculates the curvature of the local 3D surface patch for each detected feature point in a viewpoint-invariant manner via fitting quadrics to predicted monocular depth maps. This curvature is then leveraged as an additional signal in feature matching with off-the-shelf matchers like SuperGlue and LoFTR. Additionally, CSE enables end-to-end joint training by connecting the matcher and depth predictor networks. Our experiments demonstrate on large-scale real-world datasets that CSE continuously improves the accuracy of state-of-the-art methods. Fine-tuning the depth prediction network further enhances the accuracy. The proposed approach achieves state-of-the-art results on the ScanNet dataset, showcasing the effectiveness of incorporating 3D geometric information into feature matching. + + + + Constraining Depth Map Geometry for Multi-View Stereo: A Dual-Depth Approach with Saddle-shaped Depth Cells + http://openaccess.thecvf.com//content/ICCV2023/papers/Ye_Constraining_Depth_Map_Geometry_for_Multi-View_Stereo_A_Dual-Depth_Approach_ICCV_2023_paper.pdf + Learning-based multi-view stereo (MVS) methods deal with predicting accurate depth maps to achieve an accurate and complete 3D representation. Despite the excellent performance, existing methods ignore the fact that a suitable depth geometry is also critical in MVS. In this paper, we demonstrate that different depth geometries have significant performance gaps, even using the same depth prediction error. Therefore, we introduce an ideal depth geometry composed of Saddle-Shaped Cells, whose predicted depth map oscillates upward and downward around the ground-truth surface, rather than maintaining a continuous and smooth depth plane. To achieve it, we develop a coarse-to-fine framework called Dual-MVSNet (DMVSNet), which can produce an oscillating depth plane. Technically, we predict two depth values for each pixel (Dual-Depth) and propose a novel loss function and a checkerboard-shaped selecting strategy to constrain the predicted depth geometry. Compared to existing methods, DMVSNet achieves a high rank on the DTU benchmark and obtains the top performance on challenging scenes of Tanks and Temples, demonstrating its strong performance and generalization ability. Our method also points to a new research direction for considering depth geometry in MVS. + + + + DiffusionDet: Diffusion Model for Object Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_DiffusionDet_Diffusion_Model_for_Object_Detection_ICCV_2023_paper.pdf + We propose DiffusionDet, a new framework that formulates object detection as a denoising diffusion process from noisy boxes to object boxes. During the training stage, object boxes diffuse from ground-truth boxes to random distribution, and the model learns to reverse this noising process. In inference, the model refines a set of randomly generated boxes to the output results in a progressive way. Our work possesses an appealing property of flexibility, which enables the dynamic number of boxes and iterative evaluation. The extensive experiments on the standard benchmarks show that DiffusionDet achieves favorable performance compared to previous well-established detectors. For example, DiffusionDet achieves 5.3 AP and 4.8 AP gains when evaluated with more boxes and iteration steps, under a zero-shot transfer setting from COCO to CrowdHuman. Our code is available at https://github.com/ShoufaChen/DiffusionDet. + + + + Forward Flow for Novel View Synthesis of Dynamic Scenes + http://openaccess.thecvf.com//content/ICCV2023/papers/Guo_Forward_Flow_for_Novel_View_Synthesis_of_Dynamic_Scenes_ICCV_2023_paper.pdf + This paper proposes a neural radiance field (NeRF) approach for novel view synthesis of dynamic scenes using forward warping. Existing methods often adopt a static NeRF to represent the canonical space, and render dynamic images at other time steps by mapping the sampled 3D points back to the canonical space with the learned backward flow field. However, this backward flow field is non-smooth and discontinuous, which is difficult to be fitted by commonly used smooth motion models. To address this problem, we propose to estimate the forward flow field and directly warp the canonical radiance field to other time steps. Such forward flow field is smooth and continuous within the object region, which benefits the motion model learning. To achieve this goal, we represent the canonical radiance field with voxel grids to enable efficient forward warping, and propose a differentiable warping process, including an average splatting operation and an inpaint network, to resolve the many-to-one and one-to-many mapping issues. Thorough experiments show that our method outperforms existing methods in both novel view rendering and motion modeling, demonstrating the effectiveness of our forward flow motion modeling. Project page: https://npucvr.github.io/ForwardFlowDNeRF. + + + + CopyRNeRF: Protecting the CopyRight of Neural Radiance Fields + http://openaccess.thecvf.com//content/ICCV2023/papers/Luo_CopyRNeRF_Protecting_the_CopyRight_of_Neural_Radiance_Fields_ICCV_2023_paper.pdf + Neural Radiance Fields (NeRF) have the potential to be a major representation of media. Since training a NeRF has never been an easy task, the protection of its model copyright should be a priority. In this paper, by analyzing the pros and cons of possible copyright protection solutions, we propose to protect the copyright of NeRF models by replacing the original color representation in NeRF with a watermarked color representation. Then, a distortion-resistant rendering scheme is designed to guarantee robust message extraction in 2D renderings of NeRF. Our proposed method can directly protect the copyright of NeRF models while maintaining high rendering quality and bit accuracy when compared among optional solutions. + + + + SegRCDB: Semantic Segmentation via Formula-Driven Supervised Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Shinoda_SegRCDB_Semantic_Segmentation_via_Formula-Driven_Supervised_Learning_ICCV_2023_paper.pdf + Pre-training is a strong strategy for enhancing visual models to efficiently train them with a limited number of labeled images. In semantic segmentation, creating annotation masks requires an intensive amount of labor and time, and therefore, a large-scale pre-training dataset with semantic labels is quite difficult to construct. Moreover, what matters in semantic segmentation pre-training has not been fully investigated. In this paper, we propose the Segmentation Radial Contour DataBase (SegRCDB), which for the first time applies formula-driven supervised learning for semantic segmentation. SegRCDB enables pre-training for semantic segmentation without real images or any manual semantic labels. SegRCDB is based on insights about what is important in pre-training for semantic segmentation and allows efficient pre-training. Pre-training with SegRCDB achieved higher mIoU than the pre-training with COCO-Stuff for fine-tuning on ADE-20k and Cityscapes with the same number of training images. SegRCDB has a high potential to contribute to semantic segmentation pre-training and investigation by enabling the creation of large datasets without manual annotation. The SegRCDB dataset will be released under a license that allows research and commercial use. + + + + LoTE-Animal: A Long Time-span Dataset for Endangered Animal Behavior Understanding + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_LoTE-Animal_A_Long_Time-span_Dataset_for_Endangered_Animal_Behavior_Understanding_ICCV_2023_paper.pdf + Understanding and analyzing animal behavior is increasingly essential to protect endangered animal species. However, the application of advanced computer vision techniques in this regard is minimal, which boils down to lacking large and diverse datasets for training deep models. To break the deadlock, we present LoTE-Animal, a large-scale endangered animal dataset collected over 12 years, to foster the application of deep learning in rare species conservation. The collected data contains vast variations such as ecological seasons, weather conditions, periods, viewpoints, and habitat scenes. So far, we retrieved at least 500K videos and 1.2 million images. Specifically, we selected and annotated 11 endangered animals for behavior understanding, including 10K video sequences for the action recognition task, 28K images for object detection, instance segmentation, and pose estimation tasks. In addition, we gathered 7K web images of the same species as source domain data for the domain adaptation task. We provide evaluation results of representative vision understanding approaches and cross-domain experiments. LoTE-Animal dataset would facilitate the community to research more advanced machine learning models and explore new tasks to aid endangered animal conservation. Our dataset will be released with the paper. + + + + DQS3D: Densely-matched Quantization-aware Semi-supervised 3D Detection + http://openaccess.thecvf.com//content/ICCV2023/papers/Gao_DQS3D_Densely-matched_Quantization-aware_Semi-supervised_3D_Detection_ICCV_2023_paper.pdf + In this paper, we study the problem of semi-supervised 3D object detection, which is of great importance considering the high annotation cost for cluttered 3D indoor scenes. We resort to the robust and principled framework of self-teaching, which has triggered notable progress for semi-supervised learning recently. While this paradigm is natural for image-level or pixel-level prediction, adapting it to the detection problem is challenged by the issue of proposal matching. Prior methods are based upon two-stage pipelines, matching heuristically selected proposals generated in the first stage and resulting in spatially sparse training signals. In contrast, we propose the first semi-supervised 3D detection algorithm that works in the single-stage manner and allows spatially dense training signals. A fundamental issue of this new design is the quantization error caused by point-to-voxel discretization, which inevitably leads to misalignment between two transformed views in the voxel domain. To this end, we derive and implement closed-form rules that compensate this misalignment on-the-fly. Our results are significant, e.g., promoting ScanNet mAP@0.5 from 35.2% to 48.5% using 20% annotation. Codes and data are publicly available. + + + + Towards Inadequately Pre-trained Models in Transfer Learning + http://openaccess.thecvf.com//content/ICCV2023/papers/Deng_Towards_Inadequately_Pre-trained_Models_in_Transfer_Learning_ICCV_2023_paper.pdf + Transfer learning has been a popular learning paradigm in the deep learning era, especially in annotation-insufficient scenarios. Better ImageNet pre-trained models have been demonstrated, from the perspective of architecture, by previous research to have better transferability to downstream tasks. However, in this paper, we find that during the same pre-training process, models at middle epochs, which are inadequately pre-trained, can outperform fully trained models when used as feature extractors (FE), while the fine-tuning (FT) performance still grows with the source performance. This reveals that there is not a solid positive correlation between top-1 accuracy on ImageNet and the transferring result on target data. Based on the contradictory phenomenon between FE and FT that a better feature extractor fails to be fine-tuned better accordingly, we conduct comprehensive analyses on features before the softmax layer to provide insightful explanations. Our discoveries suggest that, during pre-training, models tend to first learn spectral components corresponding to large singular values and the residual components contribute more when fine-tuning. + + + + Class-Aware Patch Embedding Adaptation for Few-Shot Image Classification + http://openaccess.thecvf.com//content/ICCV2023/papers/Hao_Class-Aware_Patch_Embedding_Adaptation_for_Few-Shot_Image_Classification_ICCV_2023_paper.pdf + "A picture is worth a thousand words", significantly beyond mere a categorization. Accompanied by that, many patches of the image could have completely irrelevant meanings with the categorization if they were independently observed. This could significantly reduce the efficiency of a large family of few-shot learning algorithms, which have limited data and highly rely on the comparison of image patches. To address this issue, we propose a Class-aware Patch Embedding Adaptation (CPEA) method to learn "class-aware embeddings" of the image patches. The key idea of CPEA is to integrate patch embeddings with class-aware embeddings to make them class-relevant. Furthermore, we define a dense score matrix between class-relevant patch embeddings across images, based on which the degree of similarity between paired images is quantified. Visualization results show that CPEA concentrates patch embeddings by class, thus making them class-relevant. Extensive experiments on four benchmark datasets, miniImageNet, tieredImageNet, CIFAR-FS, and FC-100, indicate that our CPEA significantly outperforms the existing state-of-the-art methods. The source code is available at https://github.com/FushengHao/CPEA. + + + + Federated Learning Over Images: Vertical Decompositions and Pre-Trained Backbones Are Difficult to Beat + http://openaccess.thecvf.com//content/ICCV2023/papers/Hu_Federated_Learning_Over_Images_Vertical_Decompositions_and_Pre-Trained_Backbones_Are_ICCV_2023_paper.pdf + We carefully evaluate a number of algorithms for learning in a federated environment, and test their utility for a variety of image classification tasks. We consider many issues that have not been adequately considered before: whether learning over data sets that do not have diverse sets of images affects the results; whether to use a pre-trained feature extraction "backbone"; how to evaluate learner performance (we argue that classification accuracy is not enough), among others. Overall, across a wide variety of settings, we find that vertically decomposing a neural network seems to give the best results, and outperforms more standard reconciliation-used methods. + + + + HOSNeRF: Dynamic Human-Object-Scene Neural Radiance Fields from a Single Video + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_HOSNeRF_Dynamic_Human-Object-Scene_Neural_Radiance_Fields_from_a_Single_Video_ICCV_2023_paper.pdf + We introduce HOSNeRF, a novel 360deg free-viewpoint rendering method that reconstructs neural radiance fields for dynamic human-object-scene from a single monocular in-the-wild video. Our method enables pausing the video at any frame and rendering all scene details (dynamic humans, objects, and backgrounds) from arbitrary viewpoints. The first challenge in this task is the complex object motions in human-object interactions, which we tackle by introducing the new object bones into the conventional human skeleton hierarchy to effectively estimate large object deformations in our dynamic human-object model. The second challenge is that humans interact with different objects at different times, for which we introduce two new learnable object state embeddings that can be used as conditions for learning our human-object representation and scene representation, respectively. Extensive experiments show that HOSNeRF significantly outperforms SOTA approaches on two challenging datasets by a large margin of 40% 50% in terms of LPIPS. The code, data, and compelling examples of 360deg free-viewpoint renderings from single videos: https://showlab.github.io/HOSNeRF. + + + + Collaborative Propagation on Multiple Instance Graphs for 3D Instance Segmentation with Single-point Supervision + http://openaccess.thecvf.com//content/ICCV2023/papers/Dong_Collaborative_Propagation_on_Multiple_Instance_Graphs_for_3D_Instance_Segmentation_ICCV_2023_paper.pdf + Instance segmentation on 3D point clouds has been attracting increasing attention due to its wide applications, especially in scene understanding areas. However, most existing methods operate on fully annotated data while manually preparing ground-truth labels at point-level is very cumbersome and labor-intensive. To address this issue, we propose a novel weakly supervised method RWSeg that only requires labeling one object with one point. With these sparse weak labels, we introduce a unified framework with two branches to propagate semantic and instance information respectively to unknown regions using self-attention and a cross-graph random walk method. Specifically, we propose a Cross-graph Competing Random Walks (CRW) algorithm that encourages competition among different instance graphs to resolve ambiguities in closely placed objects, improving instance assignment accuracy. RWSeg generates high-quality instance-level pseudo labels. Experimental results on ScanNet-v2 and S3DIS datasets show that our approach achieves comparable performance with fully-supervised methods and outperforms previous weakly-supervised methods by a substantial margin. + + + + RMP-Loss: Regularizing Membrane Potential Distribution for Spiking Neural Networks + http://openaccess.thecvf.com//content/ICCV2023/papers/Guo_RMP-Loss_Regularizing_Membrane_Potential_Distribution_for_Spiking_Neural_Networks_ICCV_2023_paper.pdf + Spiking Neural Networks (SNNs) as one of the biology-inspired models have received much attention recently. It can significantly reduce energy consumption since they quantize the real-valued membrane potentials to 0/1 spikes to transmit information thus the multiplications of activations and weights can be replaced by additions when implemented on hardware. However, this quantization mechanism will inevitably introduce quantization error, thus causing catastrophic information loss. To address the quantization error problem, we propose a regularizing membrane potential loss (RMP-Loss) to adjust the distribution which is directly related to quantization error to a range close to the spikes. Our method is extremely simple to implement and straightforward to train an SNN. Furthermore, it is shown to consistently outperform previous state-of-the-art methods over different network architectures and datasets. + + + + Multi-grained Temporal Prototype Learning for Few-shot Video Object Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_Multi-grained_Temporal_Prototype_Learning_for_Few-shot_Video_Object_Segmentation_ICCV_2023_paper.pdf + Few-Shot Video Object Segmentation (FSVOS) aims to segment objects in a query video with the same category defined by a few annotated support images. However, this task was seldom explored. In this work, based on IPMT, a state-of-the-art few-shot image segmentation method that combines external support guidance information with adaptive query guidance cues, we propose to leverage multi-grained temporal guidance information for handling the temporal correlation nature of video data. We decompose the query video information into a clip prototype and a memory prototype for capturing local and long-term internal temporal guidance, respectively. Frame prototypes are further used for each frame independently to handle fine-grained adaptive guidance and enable bidirectional clip-frame prototype communication. To reduce the influence of noisy memory, we propose to leverage the structural similarity relation among different predicted regions and the support for selecting reliable memory frames. Furthermore, a new segmentation loss is also proposed to enhance the category discriminability of the learned prototypes. Experimental results demonstrate that our proposed video IPMT model significantly outperforms previous models on two benchmark datasets. Code is available at https://github.com/nankepan/VIPMT. + + + + Time Does Tell: Self-Supervised Time-Tuning of Dense Image Representations + http://openaccess.thecvf.com//content/ICCV2023/papers/Salehi_Time_Does_Tell_Self-Supervised_Time-Tuning_of_Dense_Image_Representations_ICCV_2023_paper.pdf + Spatially dense self-supervised learning is a rapidly growing problem domain with promising applications for unsupervised segmentation and pretraining for dense downstream tasks. Despite the abundance of temporal data in the form of videos, this information-rich source has been largely overlooked. Our paper aims to address this gap by proposing a novel approach that incorporates temporal consistency in dense self-supervised learning. While methods designed solely for images face difficulties in achieving even the same performance on videos, our method improves not only the representation quality for videos - but also images. Our approach, which we call time-tuning, starts from image-pretrained models and fine-tunes them with a novel self-supervised temporal-alignment clustering loss on unlabeled videos. This effectively facilitates the transfer of high-level information from videos to image representations. Time-tuning improves the state-of-the-art by 8-10% for unsupervised semantic segmentation on videos and matches it for images. We believe this method paves the way for further self-supervised scaling by leveraging the abundant availability of videos. The implementation can be found here : https://github.com/SMSD75/Timetuning + + + + CroCo v2: Improved Cross-view Completion Pre-training for Stereo Matching and Optical Flow + http://openaccess.thecvf.com//content/ICCV2023/papers/Weinzaepfel_CroCo_v2_Improved_Cross-view_Completion_Pre-training_for_Stereo_Matching_and_ICCV_2023_paper.pdf + Despite impressive performance for high-level downstream tasks, self-supervised pre-training methods have not yet fully delivered on dense geometric vision tasks such as stereo matching or optical flow. The application of self-supervised concepts, such as instance discrimination or masked image modeling, to geometric tasks is an active area of research. In this work, we build on the recent cross-view completion framework, a variation of masked image modeling that leverages a second view from the same scene which makes it well suited for binocular downstream tasks. The applicability of this concept has so far been limited in at least two ways: (a) by the difficulty of collecting real-world image pairs -- in practice only synthetic data have been used -- and (b) by the lack of generalization of vanilla transformers to dense downstream tasks for which relative position is more meaningful than absolute position. We explore three avenues of improvement. First, we introduce a method to collect suitable real-world image pairs at large scale. Second, we experiment with relative positional embeddings and show that they enable vision transformers to perform substantially better. Third, we scale up vision transformer based cross-completion architectures, which is made possible by the use of large amounts of data. With these improvements, we show for the first time that state-of-the-art results on stereo matching and optical flow can be reached without using any classical task-specific techniques like correlation volume, iterative estimation, image warping or multi-scale reasoning, thus paving the way towards universal vision models. + + + + ExBluRF: Efficient Radiance Fields for Extreme Motion Blurred Images + http://openaccess.thecvf.com//content/ICCV2023/papers/Lee_ExBluRF_Efficient_Radiance_Fields_for_Extreme_Motion_Blurred_Images_ICCV_2023_paper.pdf + We present ExBluRF, a novel view synthesis method for extreme motion blurred images based on efficient radiance fields optimization. Our approach consists of two main components: 6-DOF camera trajectory-based motion blur formulation and voxel-based radiance fields. From extremely blurred images, we optimize the sharp radiance fields by jointly estimating the camera trajectories that generate the blurry images. In training, multiple rays along the camera trajectory are accumulated to reconstruct single blurry color, which is equivalent to the physical motion blur operation. We minimize the photo-consistency loss on blurred image space and obtain the sharp radiance fields with camera trajectories that explain the blur of all images. The joint optimization on the blurred image space demands painfully increasing computation and resources proportional to the blur size. Our method solves this problem by replacing the MLP-based framework to low-dimensional 6-DOF camera poses and voxel-based radiance fields. Compared with the existing works, our approach restores much sharper 3D scenes from challenging motion blurred views with the order of 10x less training time and GPU memory consumption. + + + + Text2Video-Zero: Text-to-Image Diffusion Models are Zero-Shot Video Generators + http://openaccess.thecvf.com//content/ICCV2023/papers/Khachatryan_Text2Video-Zero_Text-to-Image_Diffusion_Models_are_Zero-Shot_Video_Generators_ICCV_2023_paper.pdf + Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets. In this paper, we introduce a new task, zero-shot text-to-video generation, and propose a low-cost approach (without any training or optimization) by leveraging the power of existing text-to-image synthesis methods (e.g., Stable Diffusion), making them suitable for the video domain. Our key modifications include (i) enriching the latent codes of the generated frames with motion dynamics to keep the global scene and the background time consistent; and (ii) reprogramming frame-level self-attention using a new cross-frame attention of each frame on the first frame, to preserve the context, appearance, and identity of the foreground object. Experiments show that this leads to low overhead, yet high-quality and remarkably consistent video generation. Moreover, our approach is not limited to text-to-video synthesis but is also applicable to other tasks such as conditional and content-specialized video generation, and Video Instruct-Pix2Pix, i.e., instruction-guided video editing. As experiments show, our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data. Our code is publicly available at: https://github.com/Picsart-AI-Research/Text2Video-Zero . + + + + Exploring Video Quality Assessment on User Generated Contents from Aesthetic and Technical Perspectives + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_Exploring_Video_Quality_Assessment_on_User_Generated_Contents_from_Aesthetic_ICCV_2023_paper.pdf + The rapid increase in user-generated-content (UGC) videos calls for the development of effective video quality assessment (VQA) algorithms. However, the objective of the UGC-VQA problem is still ambiguous and can be viewed from two perspectives: the technical perspective, measuring the perception of distortions; and the aesthetic perspective, which relates to preference and recommendation on contents. To understand how these two perspectives affect overall subjective opinions in UGC-VQA, we conduct a large-scale subjective study to collect human quality opinions on overall quality of videos as well as perceptions from aesthetic and technical perspectives. The collected Disentangled Video Quality Database (DIVIDE-3k) confirms that human quality opinions on UGC videos are universally and inevitably affected by both aesthetic and technical perspectives. In light of this, we propose the Disentangled Objective Video Quality Evaluator (DOVER) to learn the quality of UGC videos based on the two perspectives. The DOVER proves state-of-the-art performance in UGC-VQA under very high efficiency. With perspective opinions in DIVIDE-3k, we further propose DOVER++, the first approach to provide reliable clear-cut quality evaluations from a single aesthetic or technical perspective. + + + + Distributed Bundle Adjustment with Block-Based Sparse Matrix Compression for Super Large Scale Datasets + http://openaccess.thecvf.com//content/ICCV2023/papers/Zheng_Distributed_Bundle_Adjustment_with_Block-Based_Sparse_Matrix_Compression_for_Super_ICCV_2023_paper.pdf + We propose a distributed bundle adjustment (DBA) method using the exact Levenberg-Marquardt (LM) algorithm for super large-scale datasets. Most of the existing methods partition the global map to small ones and conduct bundle adjustment in the submaps. In order to fit the parallel framework, they use approximate solutions instead of the LM algorithm. However, those methods often give sub-optimal results. Different from them, we utilize the exact LM algorithm to conduct global bundle adjustment where the formation of the reduced camera system (RCS) is actually parallelized and executed in a distributed way. To store the large RCS, we compress it with a block-based sparse matrix compression format (BSMC), which fully exploits its block feature. The BSMC format also enables the distributed storage and updating of the global RCS. The proposed method is extensively evaluated and compared with the state-of-the-art pipelines using both synthetic and real datasets. Preliminary results demonstrate the efficient memory usage and vast scalability of the proposed method compared with the baselines. For the first time, we conducted parallel bundle adjustment using LM algorithm on a real datasets with 1.18 million images and a synthetic dataset with 10 million images (about 500 times that of the state-of-the-art LM-based BA) on a distributed computing system. + + + + Spurious Features Everywhere - Large-Scale Detection of Harmful Spurious Features in ImageNet + http://openaccess.thecvf.com//content/ICCV2023/papers/Neuhaus_Spurious_Features_Everywhere_-_Large-Scale_Detection_of_Harmful_Spurious_Features_ICCV_2023_paper.pdf + Benchmark performance of deep learning classifiers alone is not a reliable predictor for the performance of a deployed model. In particular, if the image classifier has picked up spurious features in the training data, its predictions can fail in unexpected ways. In this paper, we develop a framework that allows us to systematically identify spurious features in large datasets like ImageNet. It is based on our neural PCA components and their visualization. Previous work on spurious features often operates in toy settings or requires costly pixel-wise annotations. In contrast, we work with ImageNet and validate our results by showing that presence of the harmful spurious feature of a class alone is sufficient to trigger the prediction of that class. We introduce the novel dataset "Spurious ImageNet" which allows to measure the reliance of any ImageNet classifier on harmful spurious features. Moreover, we introduce SpuFix as a simple mitigation method to reduce the dependence of any ImageNet classifier on previously identified harmful spurious features without requiring additional labels or retraining of the model. We provide code and data at https://github.com/YanNeu/spurious_imagenet. + + + + Delicate Textured Mesh Recovery from NeRF via Adaptive Surface Refinement + http://openaccess.thecvf.com//content/ICCV2023/papers/Tang_Delicate_Textured_Mesh_Recovery_from_NeRF_via_Adaptive_Surface_Refinement_ICCV_2023_paper.pdf + Neural Radiance Fields (NeRF) have constituted a remarkable breakthrough in image-based 3D reconstruction. However, their implicit volumetric representations differ significantly from the widely-adopted polygonal meshes and lack support from common 3D software and hardware, making their rendering and manipulation inefficient. To overcome this limitation, we present a novel framework that generates textured surface meshes from images. Our approach begins by efficiently initializing the geometry and view-dependency decomposed appearance with a NeRF. Subsequently, a coarse mesh is extracted, and an iterative surface refinement algorithm is developed to adaptively adjust both vertex positions and face density based on re-projected rendering errors. We jointly refine the appearance with geometry and bake it into texture images for real-time rendering. Extensive experiments demonstrate that our method achieves superior mesh quality and competitive rendering quality. + + + + Probabilistic Precision and Recall Towards Reliable Evaluation of Generative Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Park_Probabilistic_Precision_and_Recall_Towards_Reliable_Evaluation_of_Generative_Models_ICCV_2023_paper.pdf + Assessing the fidelity and diversity of the generative model is a difficult but important issue for technological advancement. So, recent papers have introduced k-Nearest Neighbor (kNN) based precision-recall metrics to break down the statistical distance into fidelity and diversity. While they provide an intuitive method, we thoroughly analyze these metrics and identify oversimplified assumptions and undesirable properties of kNN that result in unreliable evaluation, such as susceptibility to outliers and insensitivity to distributional changes. Thus, we propose novel metrics, P-precision and P-recall (PP&PR), based on a probabilistic approach that address the problems. Through extensive investigations on toy experiments and state-of-the-art generative models, we show that our PP&PR provide more reliable estimates for comparing fidelity and diversity than the existing metrics. The codes are available at https://github.com/kdst-team/Probablistic_precision_recall. + + + + Deep Multitask Learning with Progressive Parameter Sharing + http://openaccess.thecvf.com//content/ICCV2023/papers/Shi_Deep_Multitask_Learning_with_Progressive_Parameter_Sharing_ICCV_2023_paper.pdf + We propose a novel progressive parameter-sharing strategy (MPPS) in this paper for effectively training multitask learning models on diverse computer vision tasks simultaneously. Specifically, we propose to parameterize distributions for different tasks to control the sharings, based on the concept of Exclusive Capacity that we introduce. A scheduling mechanism following the concept of curriculum learning is also designed to progressively change the sharing strategy to increase the level of sharing during the learning process. We further propose a novel loss function to regularize the optimization of network parameters as well as the sharing probabilities of each neuron for each task. Our approach can be combined with many state-of-the-art multitask learning solutions to achieve better joint task performance. Comprehensive experiments show that it has competitive performance on three challenging datasets (Multi-CIFAR100, NYUv2, and Cityscapes) using various convolution neural network architectures. + + + + Personalized Semantics Excitation for Federated Image Classification + http://openaccess.thecvf.com//content/ICCV2023/papers/Xia_Personalized_Semantics_Excitation_for_Federated_Image_Classification_ICCV_2023_paper.pdf + Federated learning casts a light on the collaboration of distributed local clients with privacy protected to attain a more generic global model. However, significant distribution shift in input/label space across different clients makes it challenging to well generalize to all clients, which motivates personalized federated learning (PFL). Existing PFL methods typically customize the local model by fine-tuning with limited local supervision and the global model regularizer, which secures local specificity but risks ruining the global discriminative knowledge. In this paper, we propose a novel Personalized Semantics Excitation (PSE) mechanism to breakthrough this limitation by exciting and fusing personalized semantics from the global model during local model customization. Specifically, PSE explores channel-wise gradient differentiation across global and local models to identify important low-level semantics mostly from convolutional layers which are embedded into the client-specific training. In addition, PSE deploys the collaboration of global and local models to enrich high-level feature representations and facilitate the robustness of client classifier through a cross-model attention module. Extensive experiments and analysis on various image classification benchmarks demonstrate the effectiveness and advantage of our method over the state-of-the-art PFL methods. + + + + SurroundOcc: Multi-camera 3D Occupancy Prediction for Autonomous Driving + http://openaccess.thecvf.com//content/ICCV2023/papers/Wei_SurroundOcc_Multi-camera_3D_Occupancy_Prediction_for_Autonomous_Driving_ICCV_2023_paper.pdf + 3D scene understanding plays a vital role in vision-based autonomous driving. While most existing methods focus on 3D object detection, they have difficulty describing real-world objects of arbitrary shapes and infinite classes. Towards a more comprehensive perception of a 3D scene, in this paper, we propose a SurroundOcc method to predict the 3D occupancy with multi-camera images. We first extract multi-scale features for each image and adopt spatial 2D-3D attention to lift them to the 3D volume space. Then we apply 3D convolutions to progressively upsample the volume features and impose supervision on multiple levels. To obtain dense occupancy prediction, we design a pipeline to generate dense occupancy ground truth without expansive occupancy annotations. Specifically, we fuse multi-frame LiDAR scans of dynamic objects and static scenes separately. Then we adopt Poisson Reconstruction to fill the holes and voxelize the mesh to get dense occupancy labels. Extensive experiments on nuScenes and SemanticKITTI datasets demonstrate the superiority of our method. Code and dataset are available at https://github.com/weiyithu/SurroundOcc. + + + + Deep Multiview Clustering by Contrasting Cluster Assignments + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_Deep_Multiview_Clustering_by_Contrasting_Cluster_Assignments_ICCV_2023_paper.pdf + Multiview clustering (MVC) aims to reveal the underlying structure of multiview data by categorizing data samples into clusters. Deep learning-based methods exhibit strong feature learning capabilities on large-scale datasets. For most existing deep MVC methods, exploring the invariant representations of multiple views is still an intractable problem. In this paper, we propose a cross-view contrastive learning (CVCL) method that learns view-invariant representations and produces clustering results by contrasting the cluster assignments among multiple views. Specifically, we first employ deep autoencoders to extract view-dependent features in the pretraining stage. Then, a cluster-level CVCL strategy is presented to explore consistent semantic label information among the multiple views in the fine-tuning stage. Thus, the proposed CVCL method is able to produce more discriminative cluster assignments by virtue of this learning strategy. Moreover, we provide a theoretical analysis of soft cluster assignment alignment. Extensive experimental results obtained on several datasets demonstrate that the proposed CVCL method outperforms several state-of-the-art approaches. + + + + Look at the Neighbor: Distortion-aware Unsupervised Domain Adaptation for Panoramic Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Zheng_Look_at_the_Neighbor_Distortion-aware_Unsupervised_Domain_Adaptation_for_Panoramic_ICCV_2023_paper.pdf + Endeavors have been recently made to transfer knowledge from the labeled pinhole image domain to the unlabeled panoramic image domain via Unsupervised Domain Adaptation (UDA). The aim is to tackle the domain gaps caused by the style disparities and distortion problem of the non-uniformly distributed pixels of equirectangular projection (ERP). Previous works typically focus on transferring knowledge based on geometric priors with specially designed multi-branch network architectures. As a result, considerable computational costs are induced, and meanwhile, their generalization abilities are profoundly hindered by the variation of distortion among pixels. In this paper, we find that the pixels' neighborhood regions of the ERP indeed introduce less distortion. Intuitively, we propose a novel UDA framework that can effectively address the distortion problems for panoramic semantic segmentation. In comparison, our method is simpler, easier to implement, and more computationally efficient. Specifically, we propose distortion-aware attention (DA) capturing the neighboring pixel distribution without using any geometric constraints. Moreover, we propose a class-wise feature aggregation (CFA) module to iteratively update the feature representations with a memory bank. As such, the feature similarity between two domains can be consistently optimized. Extensive experiments show that our method achieves new state-of-the-art performance while remarkably reducing 80% parameters. + + + + Rethinking Safe Semi-supervised Learning: Transferring the Open-set Problem to A Close-set One + http://openaccess.thecvf.com//content/ICCV2023/papers/Ma_Rethinking_Safe_Semi-supervised_Learning_Transferring_the_Open-set_Problem_to_A_ICCV_2023_paper.pdf + Conventional semi-supervised learning (SSL) lies in the close-set assumption that the labeled and unlabeled sets contain data with the same seen classes, called in-distribution (ID) data. In contrast, safe SSL investigates a more challenging open-set problem where unlabeled set may involve some out-of-distribution (OOD) data with unseen classes, which could harm the performance of SSL. When we are experimenting with the mainstream safe SSL methods, we have a surprising finding that all OOD data show a clear tendency to gather in the feature space. This inspires us to solve the safe SSL problem from a fresh perspective. Specifically, for a classification task with K seen classes, we utilize a prototype network not only to generate K prototypes of all seen classes, but also explicitly model an additional prototype for the OOD data, transferring the K-way classification on the open-set to the (K+1)-way on the close-set. In this way, the typical SSL techniques (e.g., consistency regularization and pseudo labeling) can be applied to tackle the safe SSL problem without additional consideration of OOD data processing like other safe SSL methods do. Particularly, considering the possible low-confidence pseudo labels, we further propose an iterative negative learning (INL) paradigm to enforce the network learning knowledge from complementary labels on wider classes, improving the network's classification performance. Extensive experiments on four benchmark datasets show that our approach remarkably lifts the performance on safe SSL and outperforms the state-of-the-art methods. + + + + Iterative Superquadric Recomposition of 3D Objects from Multiple Views + http://openaccess.thecvf.com//content/ICCV2023/papers/Alaniz_Iterative_Superquadric_Recomposition_of_3D_Objects_from_Multiple_Views_ICCV_2023_paper.pdf + Humans are good at recomposing novel objects, i.e they can identify commonalities between unknown objects from general structure to finer detail, an ability difficult to replicate by machines. We propose a framework, ISCO, to recompose an object using 3D superquadrics as semantic parts directly from 2D views without training a model that uses 3D supervision. To achieve this, we optimize the superquadric parameters that compose a specific instance of the object, comparing its rendered 3D view and 2D image silhouette. Our ISCO framework iteratively adds new superquadrics wherever the reconstruction error is high, abstracting first coarse regions and then finer details of the target object. With this simple coarse-to-fine inductive bias, ISCO provides consistent superquadrics for related object parts, despite not having any semantic supervision. Since ISCO does not train any neural network, it is also inherently robust to out of distribution objects. Experiments show that, compared to recent single instance superquadrics reconstruction approaches, ISCO provides consistently more accurate 3D reconstructions, even from images in the wild. Code available at https://github.com/ExplainableML/ISCO. + + + + SiLK: Simple Learned Keypoints + http://openaccess.thecvf.com//content/ICCV2023/papers/Gleize_SiLK_Simple_Learned_Keypoints_ICCV_2023_paper.pdf + Keypoint detection & descriptors are foundational technologies for computer vision tasks like image matching, 3D reconstruction and visual odometry. Hand-engineered methods like Harris corners, SIFT, and HOG descriptors have been used for decades; more recently, there has been a trend to introduce learning in an attempt to improve keypoint detectors. On inspection however, the results are difficult to interpret; recent learning-based methods employ a vast diversity of experimental setups and design choices: empirical results are often reported using different backbones, protocols, datasets, types of supervisions or tasks. Since these differences are often coupled together, it raises a natural question on what makes a good learned keypoint detector. In this work, we revisit the design of existing keypoint detectors by deconstructing their methodologies and identifying the key components. We re-design each component from first-principle and propose Simple Learned Keypoints (SiLK) that is fully-differentiable, lightweight, and flexible. Despite its simplicity, SiLK advances new state-of-the-art on Detection Repeatability and Homography Estimation tasks on HPatches and 3D Point-Cloud Registration task on ScanNet, and achieves competitive performance to state-of-the-art on camera pose estimation in 2022 Image Matching Challenge and ScanNet. + + + + EfficientViT: Lightweight Multi-Scale Attention for High-Resolution Dense Prediction + http://openaccess.thecvf.com//content/ICCV2023/papers/Cai_EfficientViT_Lightweight_Multi-Scale_Attention_for_High-Resolution_Dense_Prediction_ICCV_2023_paper.pdf + High-resolution dense prediction enables many appealing real-world applications, such as computational photography, autonomous driving, etc. However, the vast computational cost makes deploying state-of-the-art high-resolution dense prediction models on hardware devices difficult. This work presents EfficientViT, a new family of high-resolution vision models with novel lightweight multi-scale attention. Unlike prior high-resolution dense prediction models that rely on heavy self-attention, hardware-inefficient large-kernel convolution, or complicated topology structure to obtain good performances, our lightweight multi-scale attention achieves a global receptive field and multi-scale learning (two critical features for high-resolution dense prediction) with only lightweight and hardware-efficient operations. As such, EfficientViT delivers remarkable performance gains over previous state-of-the-art high-resolution dense prediction models with significant speedup on diverse hardware platforms, including mobile CPU, edge GPU, and cloud GPU. Without performance loss on Cityscapes, our EfficientViT provides up to 8.8x and 3.8x GPU latency reduction over SegFormer and SegNeXt, respectively. For super-resolution, EfficientViT provides up to 6.4x speedup over Restormer while providing 0.11dB gain in PSNR. + + + + Rapid Adaptation in Online Continual Learning: Are We Evaluating It Right? + http://openaccess.thecvf.com//content/ICCV2023/papers/Al_Kader_Hammoud_Rapid_Adaptation_in_Online_Continual_Learning_Are_We_Evaluating_It_ICCV_2023_paper.pdf + We revisit the common practice of evaluating adaptation of Online Continual Learning (OCL) algorithms through the metric of online accuracy, which measures the accuracy of the model on the immediate next few samples. However, we show that this metric is unreliable, as even vacuous blind classifiers, which do not use input images for prediction, can achieve unrealistically high online accuracy by exploiting spurious label correlations in the data stream. Our study reveals that existing OCL algorithms can also achieve high online accuracy, but perform poorly in retaining useful information, suggesting that they unintentionally learn spurious label correlations. To address this issue, we propose a novel metric for measuring adaptation based on the accuracy on the near-future samples, where spurious correlations are removed. We benchmark existing OCL approaches using our proposed metric on large-scale datasets under various computational budgets and find that better generalization can be achieved by retaining and reusing past seen information. We believe that our proposed metric can aid in the development of truly adaptive OCL methods. We provide code to reproduce our results at https://github.com/drimpossible/EvalOCL. + + + + Label-Efficient Online Continual Object Detection in Streaming Video + http://openaccess.thecvf.com//content/ICCV2023/papers/Wu_Label-Efficient_Online_Continual_Object_Detection_in_Streaming_Video_ICCV_2023_paper.pdf + Humans can watch a continuous video stream and effortlessly perform continual acquisition and transfer of new knowledge with minimal supervision yet retaining previously learnt experiences. In contrast, existing continual learning (CL) methods require fully annotated labels to effectively learn from individual frames in a video stream. Here, we examine a more realistic and challenging problem--Label-Efficient Online Continual Object Detection (LEOCOD) in streaming video. We propose a plug-and-play module, Efficient-CLS, that can be easily inserted into and improve existing continual learners for object detection in video streams with reduced data annotation costs and model retraining time. We show that our method has achieved significant improvement with minimal forgetting across all supervision levels on two challenging CL benchmarks for streaming real-world videos. Remarkably, with only 25% annotated video frames, our method still outperforms the base CL learners, which are trained with 100% annotations on all video frames. The data and source code will be publicly available at https://github.com/showlab/Efficient-CLS. + + + + Diverse Cotraining Makes Strong Semi-Supervised Segmentor + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Diverse_Cotraining_Makes_Strong_Semi-Supervised_Segmentor_ICCV_2023_paper.pdf + Deep co-training has been introduced to semi-supervised segmentation and achieves impressive results, yet few studies have explored the working mechanism behind it. In this work, we revisit the core assumption that supports co-training: multiple compatible and conditionally independent views. By theoretically deriving the generalization upper bound, we prove the prediction similarity between two models negatively impacts the model's generalization ability. However, most current co-training models are tightly coupled together and violate this assumption. Such coupling leads to the homogenization of networks and confirmation bias which consequently limits the performance. To this end, we explore different dimensions of co-training and systematically increase the diversity from the aspects of input domains, different augmentations and model architectures to counteract homogenization. Our Diverse Co-training outperforms the state-of-the-art (SOTA) methods by a large margin across different evaluation protocols on the Pascal and Cityscapes. For example. we achieve the best mIoU of 76.2%, 77.7% and 80.2% on Pascal with only 92, 183 and 366 labeled images, surpassing the previous best results by more than 5%. + + + + VQA-GNN: Reasoning with Multimodal Knowledge via Graph Neural Networks for Visual Question Answering + http://openaccess.thecvf.com//content/ICCV2023/papers/Wang_VQA-GNN_Reasoning_with_Multimodal_Knowledge_via_Graph_Neural_Networks_for_ICCV_2023_paper.pdf + Visual question answering (VQA) requires systems to perform concept-level reasoning by unifying unstructured (e.g., the context in question and answer; "QA context") and structured (e.g., knowledge graph for the QA context and scene; "concept graph") multimodal knowledge. Existing works typically combine a scene graph and a concept graph of the scene by connecting corresponding visual nodes and concept nodes, then incorporate the QA context representation to perform question answering. However, these methods only perform a unidirectional fusion from unstructured knowledge to structured knowledge, limiting their potential to capture joint reasoning over the heterogeneous modalities of knowledge. To perform more expressive reasoning, we propose VQA-GNN, a new VQA model that performs bidirectional fusion between unstructured and structured multimodal knowledge to obtain unified knowledge representations. Specifically, we inter-connect the scene graph and the concept graph through a super node that represents the QA context, and introduce a new multimodal GNN technique to perform inter-modal message passing for reasoning that mitigates representational gaps between modalities. On two challenging VQA tasks (VCR and GQA), our method outperforms strong baseline VQA methods by 3.2% on VCR (Q-AR) and 4.6% on GQA, suggesting its strength in performing concept-level reasoning. Ablation studies further demonstrate the efficacy of the bidirectional fusion and multimodal GNN method in unifying unstructured and structured multimodal knowledge. + + + + Unmasked Teacher: Towards Training-Efficient Video Foundation Models + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_Unmasked_Teacher_Towards_Training-Efficient_Video_Foundation_Models_ICCV_2023_paper.pdf + Video Foundation Models (VFMs) have received limited exploration due to high computational costs and data scarcity. Previous VFMs rely on Image Foundation Models (IFMs), which face challenges in transferring to the video domain. Although VideoMAE has trained a robust ViT from limited data, its low-level reconstruction poses convergence difficulties and conflicts with high-level cross-modal alignment. This paper proposes a training-efficient method for temporal-sensitive VFMs that integrates the benefits of existing methods. To increase data efficiency, we mask out most of the low-semantics video tokens, but selectively align the unmasked tokens with IFM, which serves as the UnMasked Teacher (UMT). By providing semantic guidance, our method enables faster convergence and multimodal friendliness. With a progressive pre-training framework, our model can handle various tasks including scene-related, temporal-related, and complex video-language understanding. Using only public sources for pre-training in 6 days on 32 A100 GPUs, our scratch-built ViT-L/16 achieves state-of-the-art performances on various video tasks. + + + + SQAD: Automatic Smartphone Camera Quality Assessment and Benchmarking + http://openaccess.thecvf.com//content/ICCV2023/papers/Fang_SQAD_Automatic_Smartphone_Camera_Quality_Assessment_and_Benchmarking_ICCV_2023_paper.pdf + Smartphone photography is becoming increasingly popular, but fitting high-performing camera systems within the given space limitations remains a challenge for manufacturers. As a result, powerful mobile camera systems are in high demand. Despite recent progress in computer vision, camera system quality assessment remains a tedious and manual process. In this paper, we present the Smartphone Camera Quality Assessment Dataset (SQAD), which includes natural images captured by 29 devices. SQAD defines camera system quality based on six widely accepted criteria: resolution, color accuracy, noise level, dynamic range, Point Spread Function, and aliasing. Built on thorough examinations in a controlled laboratory environment, SQAD provides objective metrics for quality assessment, overcoming previous subjective opinion scores. Moreover, we introduce the task of automatic camera quality assessment and train deep learning-based models on the collected data to perform a precise quality prediction for arbitrary photos. The dataset, codes and pre-trained models are released at https://github.com/aiff22/SQAD. + + + + Multi-Event Video-Text Retrieval + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Multi-Event_Video-Text_Retrieval_ICCV_2023_paper.pdf + Video-Text Retrieval (VTR) is a crucial multi-modal task in an era of massive video-text data on the Internet. A plethora of work characterized by using a two-stream Vision-Language model architecture that learns a joint representation of video-text pairs has become a prominent approach for the VTR task. However, these models operate under the assumption of bijective video-text correspondences and neglect a more practical scenario where video content usually encompasses multiple events, while texts like user queries or webpage metadata tend to be specific and correspond to single events. This establishes a gap between the previous training objective and real-world applications, leading to the potential performance degradation of earlier models during inference. In this study, we introduce the Multi-event Video-Text Retrieval (MeVTR) task, addressing scenarios in which each video contains multiple different events, as a niche scenario of the conventional Video-Text Retrieval Task. We present a simple model, Me-Retriever, which incorporates key event video representation and a new MeVTR loss for the MeVTR task. Comprehensive experiments show that this straightforward framework outperforms other models in the Video-to-Text and Text-to-Video tasks, effectively establishing a robust baseline for the MeVTR task. We believe this work serves as a strong foundation for future studies. Code is available at https://github.com/gengyuanmax/MeVTR. + + + + You Never Get a Second Chance To Make a Good First Impression: Seeding Active Learning for 3D Semantic Segmentation + http://openaccess.thecvf.com//content/ICCV2023/papers/Samet_You_Never_Get_a_Second_Chance_To_Make_a_Good_ICCV_2023_paper.pdf + We propose SeedAL, a method to seed active learning for efficient annotation of 3D point clouds for semantic segmentation. Active Learning (AL) iteratively selects relevant data fractions to annotate within a given budget, but requires a first fraction of the dataset (a 'seed') to be already annotated to estimate the benefit of annotating other data fractions. We first show that the choice of the seed can significantly affect the performance of many AL methods. We then propose a method for automatically constructing a seed that will ensure good performance for AL. Assuming that images of the point clouds are available, which is common, our method relies on powerful unsupervised image features to measure the diversity of the point clouds. It selects the point clouds for the seed by optimizing the diversity under an annotation budget, which can be done by solving a linear optimization problem. Our experiments demonstrate the effectiveness of our approach compared to random seeding and existing methods on both the S3DIS and SemanticKitti datasets. Code is available at https://github.com/nerminsamet/seedal. + + + + Scalable Multi-Temporal Remote Sensing Change Data Generation via Simulating Stochastic Change Process + http://openaccess.thecvf.com//content/ICCV2023/papers/Zheng_Scalable_Multi-Temporal_Remote_Sensing_Change_Data_Generation_via_Simulating_Stochastic_ICCV_2023_paper.pdf + Understanding the temporal dynamics of Earth's surface is a mission of multi-temporal remote sensing image analysis, significantly promoted by deep vision models with its fuel---labeled multi-temporal images. However, collecting, preprocessing, and annotating multi-temporal remote sensing images at scale is non-trivial since it is expensive and knowledge-intensive. In this paper, we present a scalable multi-temporal remote sensing change data generator via generative modeling, which is cheap and automatic, alleviating these problems. Our main idea is to simulate a stochastic change process over time. We consider the stochastic change process as a probabilistic semantic state transition, namely generative probabilistic change model (GPCM), which decouples the complex simulation problem into two more trackable sub-problems, i.e., change event simulation and semantic change synthesis. To solve these two problems, we present the change generator (Changen), a GAN-based GPCM, enabling controllable object change data generation, including customizable object property, and change event. The extensive experiments suggest that our Changen has superior generation capability, and the change detectors with Changen pre-training exhibit excellent transferability to real-world change datasets. + + + + NerfAcc: Efficient Sampling Accelerates NeRFs + http://openaccess.thecvf.com//content/ICCV2023/papers/Li_NerfAcc_Efficient_Sampling_Accelerates_NeRFs_ICCV_2023_paper.pdf + Optimizing and rendering Neural Radiance Fields is computationally expensive due to the vast number of samples required by volume rendering. Recent works have included alternative sampling approaches to help accelerate their methods, however, they are often not the focus of the work. In this paper, we investigate and compare multiple sampling approaches and demonstrate that improved sampling is generally applicable across NeRF variants under an unified concept of transmittance estimator. To facilitate future experiments, we develop NerfAcc, a Python toolbox that provides flexible APIs for incorporating advanced sampling methods into NeRF related methods. We demonstrate its flexibility by showing that it can reduce the training time of several recent NeRF methods by 1.5x to 20x with minimal modifications to the existing codebase. Additionally, highly customized NeRFs, such as Instant-NGP, can be implemented in native PyTorch using NerfAcc. Our code are open-sourced at https://www.nerfacc.com. + + + + A2Q: Accumulator-Aware Quantization with Guaranteed Overflow Avoidance + http://openaccess.thecvf.com//content/ICCV2023/papers/Colbert_A2Q_Accumulator-Aware_Quantization_with_Guaranteed_Overflow_Avoidance_ICCV_2023_paper.pdf + We present accumulator-aware quantization (A2Q), a novel weight quantization method designed to train quantized neural networks (QNNs) to avoid overflow when using low-precision accumulators during inference. A2Q introduces a unique formulation inspired by weight normalization that constrains the L1-norm of model weights according to accumulator bit width bounds that we derive. Thus, in training QNNs for low-precision accumulation, A2Q also inherently promotes unstructured weight sparsity to guarantee overflow avoidance. We apply our method to deep learning-based computer vision tasks to show that A2Q can train QNNs for low-precision accumulators while maintaining model accuracy competitive with a floating-point baseline. In our evaluations, we consider the impact of A2Q on both general-purpose platforms and programmable hardware. However, we primarily target model deployment on FPGAs because they can be programmed to fully exploit custom accumulator bit widths. Our experimentation shows accumulator bit width significantly impacts the resource efficiency of FPGA-based accelerators. On average across our benchmarks, A2Q offers up to a 2.3x reduction in resource utilization over 32-bit accumulator counterparts with 99.2% of the floating-point model accuracy. + + + + ARNOLD: A Benchmark for Language-Grounded Task Learning with Continuous States in Realistic 3D Scenes + http://openaccess.thecvf.com//content/ICCV2023/papers/Gong_ARNOLD_A_Benchmark_for_Language-Grounded_Task_Learning_with_Continuous_States_ICCV_2023_paper.pdf + Understanding the continuous states of objects is essential for task learning and planning in the real world. However, most existing task learning benchmarks assume discrete (e.g., binary) object states, which poses challenges for learning complex tasks and transferring learned policy from the simulated environment to the real world. Furthermore, the robot's ability to follow human instructions based on grounding the actions and states is limited. To tackle these challenges, we present ARNOLD, a benchmark that evaluates language-grounded task learning with continuous states in realistic 3D scenes. ARNOLD consists of 8 language-conditioned tasks that involve understanding object states and learning policies for continuous goals. To promote language-instructed learning, we provide expert demonstrations with template-generated language descriptions. We assess task performance by utilizing the latest language-conditioned policy learning models. Our results indicate that current models for language-conditioned manipulations continue to experience significant challenges when it comes to novel goal-state generalizations, scene generalizations, and object generalizations. These findings highlight the need to develop new algorithms to address this gap and underscore the potential for further research in this area. + + + + CLNeRF: Continual Learning Meets NeRF + http://openaccess.thecvf.com//content/ICCV2023/papers/Cai_CLNeRF_Continual_Learning_Meets_NeRF_ICCV_2023_paper.pdf + Novel view synthesis aims to render unseen views given a set of calibrated images. In practical applications, the coverage, appearance or geometry of the scene may change over time, with new images continuously being captured. Efficiently incorporating such continuous change is an open challenge. Standard NeRF benchmarks only involve scene coverage expansion. To study other practical scene changes, we propose a new dataset, World Across Time (WAT), consisting of scenes that change in appearance and geometry over time. We also propose a simple yet effective method, CLNeRF, which introduces continual learning (CL) to Neural Radiance Fields (NeRFs). CLNeRF combines generative replay and the Instant Neural Graphics Primitives (NGP) architecture to effectively prevent catastrophic forgetting and efficiently update the model when new data arrives. We also add trainable appearance and geometry embeddings to NGP, allowing a single compact model to handle complex scene changes. Without the need to store historical images, CLNeRF trained sequentially over multiple scans of a changing scene performs on-par with the upper bound model trained on all scans at once. Compared to other CL baselines CLNeRF performs much better across standard benchmarks and WAT. The source code, a demo, and the WAT dataset are available at https://github.com/IntelLabs/CLNeRF. + + + + CrossMatch: Source-Free Domain Adaptive Semantic Segmentation via Cross-Modal Consistency Training + http://openaccess.thecvf.com//content/ICCV2023/papers/Yin_CrossMatch_Source-Free_Domain_Adaptive_Semantic_Segmentation_via_Cross-Modal_Consistency_Training_ICCV_2023_paper.pdf + Source-free domain adaptive semantic segmentation has gained increasing attention recently. It eases the requirement of full data access to the source domain by transferring knowledge only from a well-trained source model. However, reducing the uncertainty of the target pseudo labels becomes inevitably more challenging without the supervision of the labeled source data. In this work, we propose a novel asymmetric two-stream architecture that learns more robustly from noisy pseudo labels. Our approach simultaneously conducts dual-head pseudo label denoising and cross-modal consistency regularization. Towards the former, we introduce a multimodal auxiliary network during training (and discard it during inference), which effectively enhances the pseudo labels' correctness by leveraging the guidance from the depth information. Towards the latter, we enforce a new cross-modal pixel-wise consistency between the predictions of the two streams, encouraging our model to behave smoothly for both modality variance and image perturbations. It serves as an effective regularization to further reduce the impact of the inaccurate pseudo labels in source-free unsupervised domain adaptation. Experiments on GTA5 to Cityscapes and SYNTHIA to Cityscapes benchmarks demonstrate the superiority of our proposed method, obtaining the new state-of-the-art mIoU of 57.7% and 57.5%, respectively. + + + + Improving Equivariance in State-of-the-Art Supervised Depth and Normal Predictors + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhong_Improving_Equivariance_in_State-of-the-Art_Supervised_Depth_and_Normal_Predictors_ICCV_2023_paper.pdf + Dense depth and surface normal predictors should possess the equivariant property to cropping-and-resizing -- cropping the input image should result in cropping the same output image. However, we find that state-of-the-art depth and normal predictors, despite having strong performances, surprisingly do not respect equivariance. The problem exists even when crop-and-resize data augmentation is employed during training. To remedy this, we propose an equivariant regularization technique, consisting of an averaging procedure and a self-consistency loss, to explicitly promote cropping-and-resizing equivariance in depth and normal networks. Our approach can be applied to both CNN and Transformer architectures, does not incur extra cost during testing, and notably improves the supervised and semi-supervised learning performance of dense predictors on Taskonomy tasks. Finally, finetuning with our loss on unlabeled images improves not only equivariance but also accuracy of state-of-the-art depth and normal predictors when evaluated on NYU-v2. + + + + Reducing Training Time in Cross-Silo Federated Learning Using Multigraph Topology + http://openaccess.thecvf.com//content/ICCV2023/papers/Do_Reducing_Training_Time_in_Cross-Silo_Federated_Learning_Using_Multigraph_Topology_ICCV_2023_paper.pdf + Federated learning is an active research topic since it enables several participants to jointly train a model without sharing local data. Currently, cross-silo federated learning is a popular training setting that utilizes a few hundred reliable data silos with high-speed access links to training a model. While this approach has been widely applied in real-world scenarios, designing a robust topology to reduce the training time remains an open problem. In this paper, we present a new multigraph topology for cross-silo federated learning. We first construct the multigraph using the overlay graph. We then parse this multigraph into different simple graphs with isolated nodes. The existence of isolated nodes allows us to perform model aggregation without waiting for other nodes, hence effectively reducing the training time. Intensive experiments on three public datasets show that our proposed method significantly reduces the training time compared with recent state-of-the-art topologies while maintaining the accuracy of the learned model. + + + + Counting Crowds in Bad Weather + http://openaccess.thecvf.com//content/ICCV2023/papers/Huang_Counting_Crowds_in_Bad_Weather_ICCV_2023_paper.pdf + Crowd counting has recently attracted significant attention in the field of computer vision due to its wide applications to image understanding. Numerous methods have been proposed and achieved state-of-the-art performance for real-world tasks. However, existing approaches do not perform well under adverse weather such as haze, rain, and snow since the visual appearances of crowds in such scenes are drastically different from those images in clear weather of typical datasets. In this paper, we propose a method for robust crowd counting in adverse weather scenarios. Instead of using a two-stage approach that involves image restoration and crowd counting modules, our model learns effective features and adaptive queries to account for large appearance variations. With these weather queries, the proposed model can learn the weather information according to the degradation of the input image and optimize with the crowd counting module simultaneously. Experimental results show that the proposed algorithm is effective in counting crowds under different weather types on benchmark datasets. The source code and trained models will be made available to the public. + + + + FreeDoM: Training-Free Energy-Guided Conditional Diffusion Model + http://openaccess.thecvf.com//content/ICCV2023/papers/Yu_FreeDoM_Training-Free_Energy-Guided_Conditional_Diffusion_Model_ICCV_2023_paper.pdf + Recently, conditional diffusion models have gained popularity in numerous applications due to their exceptional generation ability. However, many existing methods are training-required. They need to train a time-dependent classifier or a condition-dependent score estimator, which increases the cost of constructing conditional diffusion models and is inconvenient to transfer across different conditions. Some current works aim to overcome this limitation by proposing training-free solutions, but most can only be applied to a specific category of tasks and not to more general conditions. In this work, we propose a training-Free conditional Diffusion Model (FreeDoM) used for various conditions. Specifically, we leverage off-the-shelf pre-trained networks, such as a face detection model, to construct time-independent energy functions, which guide the generation process without requiring training. Furthermore, because the construction of the energy function is very flexible and adaptable to various conditions, our proposed FreeDoM has a broader range of applications than existing training-free methods. FreeDoM is advantageous in its simplicity, effectiveness, and low cost. Experiments demonstrate that FreeDoM is effective for various conditions and suitable for diffusion models of diverse data domains, including image and latent code domains. + + + + UniT3D: A Unified Transformer for 3D Dense Captioning and Visual Grounding + http://openaccess.thecvf.com//content/ICCV2023/papers/Chen_UniT3D_A_Unified_Transformer_for_3D_Dense_Captioning_and_Visual_ICCV_2023_paper.pdf + Performing 3D dense captioning and visual grounding requires a common and shared understanding of the underlying multimodal relationships. However, despite some previous attempts on connecting these two related tasks with highly task-specific neural modules, it remains understudied how to explicitly depict their shared nature to learn them simultaneously. In this work, we propose UniT3D, a simple yet effective fully unified transformer-based architecture for jointly solving 3D visual grounding and dense captioning. UniT3D enables learning a strong multimodal representation across the two tasks through a supervised joint pre-training scheme with bidirectional and seq-to-seq objectives. With a generic architecture design, UniT3D allows expanding the pre-training scope to more various training sources such as the synthesized data from 2D prior knowledge to benefit 3D vision-language tasks. Extensive experiments and analysis demonstrate that UniT3D obtains significant gains for 3D dense captioning and visual grounding. + + + + SKiT: a Fast Key Information Video Transformer for Online Surgical Phase Recognition + http://openaccess.thecvf.com//content/ICCV2023/papers/Liu_SKiT_a_Fast_Key_Information_Video_Transformer_for_Online_Surgical_ICCV_2023_paper.pdf + This paper introduces SKiT, a fast Key information Transformer for phase recognition of videos. Unlike previous methods that rely on complex models to capture long-term temporal information, SKiT accurately recognizes high-level stages of videos using an efficient key pooling operation. This operation records important key information by retaining the maximum value recorded from the beginning up to the current video frame, with a time complexity of O(1). Experimental results on Cholec80 and AutoLaparo surgical datasets demonstrate the ability of our model to recognize phases in an online manner. SKiT achieves higher performance than state-of-the-art methods with an accuracy of 92.5% and 82.9% on Cholec80 and AutoLaparo, respectively, while running the temporal model eight times faster ( 7ms v.s. 55ms) than LoViT, which uses ProbSparse to capture global information. We highlight that the inference time of SKiT is constant, and independent from the input length, making it a stable choice for keeping a record of important global information, that appears on long surgical videos, essential for phase recognition. To sum up, we propose an effective and efficient model for surgical phase recognition that leverages key global information. This has an intrinsic value when performing this task in an online manner on long surgical videos for stable real-time surgical recognition systems. + + + + Automatic Network Pruning via Hilbert-Schmidt Independence Criterion Lasso under Information Bottleneck Principle + http://openaccess.thecvf.com//content/ICCV2023/papers/Guo_Automatic_Network_Pruning_via_Hilbert-Schmidt_Independence_Criterion_Lasso_under_Information_ICCV_2023_paper.pdf + Most existing neural network pruning methods hand-crafted their importance criteria and structures to prune. This constructs heavy and unintended dependencies on heuristics and expert experience for both the objective and the parameters of the pruning approach. In this paper, we try to solve this problem by introducing a principled and unified framework based on Information Bottleneck (IB) theory, which further guides us to an automatic pruning approach. Specifically, we first formulate the channel pruning problem from an IB perspective, and then implement the IB principle by solving a Hilbert-Schmidt Independence Criterion (HSIC) Lasso problem under certain conditions. Based on the theoretical guidance, we then provide an automatic pruning scheme by searching for global penalty coefficients. Verified by extensive experiments, our method yields state-of-the-art performance on various benchmark networks and datasets. For example, with VGG-16, we achieve a 60%-FLOPs reduction by removing 76% of the parameters, with an improvement of 0.40% in top-1 accuracy on CIFAR-10. With ResNet-50, we achieve a 56%-FLOPs reduction by removing 50% of the parameters, with a small loss of 0.08% in the top-1 accuracy on ImageNet. The code is available at https://github.com/sunggo/APIB. + + + + Neglected Free Lunch - Learning Image Classifiers Using Annotation Byproducts + http://openaccess.thecvf.com//content/ICCV2023/papers/Han_Neglected_Free_Lunch_-_Learning_Image_Classifiers_Using_Annotation_Byproducts_ICCV_2023_paper.pdf + Supervised learning of image classifiers distills human knowledge into a parametric model through pairs of images and corresponding labels (X,Y). We argue that this simple and widely used representation of human knowledge neglects rich auxiliary information from the annotation procedure, such as the time-series of mouse traces and clicks left after image selection. Our insight is that such annotation byproducts Z provide approximate human attention that weakly guides the model to focus on the foreground cues, reducing spurious correlations and discouraging shortcut learning. To verify this, we create ImageNet-AB and COCO-AB. They are ImageNet and COCO training sets enriched with sample-wise annotation byproducts, collected by replicating the respective original annotation tasks. We refer to the new paradigm of training models with annotation byproducts as learning using annotation byproducts (LUAB). We show that a simple multitask loss for regressing Z together with Y already improves the generalisability and robustness of the learned models. Compared to the original supervised learning, LUAB does not require extra annotation costs. ImageNet-AB and COCO-AB are at https://github.com/naver-ai/NeglectedFreeLunch. + + + + Rethinking the Role of Pre-Trained Networks in Source-Free Domain Adaptation + http://openaccess.thecvf.com//content/ICCV2023/papers/Zhang_Rethinking_the_Role_of_Pre-Trained_Networks_in_Source-Free_Domain_Adaptation_ICCV_2023_paper.pdf + Source-free domain adaptation (SFDA) aims to adapt a source model trained on a fully-labeled source domain to an unlabeled target domain. Large-data pre-trained networks are used to initialize source models during source training, and subsequently discarded. However, source training can cause the model to overfit to source data distribution and lose applicable target domain knowledge. We propose to integrate the pre-trained network into the target adaptation process as it has diversified features important for generalization and provides an alternate view of features and classification decisions different from the source model. We propose to distil useful target domain information through a co-learning strategy to improve target pseudolabel quality for finetuning the source model. Evaluation on 4 benchmark datasets show that our proposed strategy improves adaptation performance and can be successfully integrated with existing SFDA methods. Leveraging modern pre-trained networks that have stronger representation learning ability in the co-learning strategy further boosts performance. + + + + RLIPv2: Fast Scaling of Relational Language-Image Pre-Training + http://openaccess.thecvf.com//content/ICCV2023/papers/Yuan_RLIPv2_Fast_Scaling_of_Relational_Language-Image_Pre-Training_ICCV_2023_paper.pdf + Relational Language-Image Pre-training (RLIP) aims to align vision representations with relational texts, thereby advancing the capability of relational reasoning in computer vision tasks. However, hindered by the slow convergence of RLIPv1 architecture and the limited availability of existing scene graph data, scaling RLIPv1 is challenging. In this paper, we propose RLIPv2, a fast converging model that enables the scaling of relational pre-training to large-scale pseudo-labelled scene graph data. To enable fast scaling, RLIPv2 introduces Asymmetric Language-Image Fusion (ALIF), a mechanism that facilitates earlier and deeper gated cross-modal fusion with sparsified language encoding layers. ALIF leads to comparable or better performance than RLIPv1 in a fraction of the time for pre-training and fine-tuning. To obtain scene graph data at scale, we extend object detection datasets with free-form relation labels by introducing a captioner (e.g., BLIP) and a designed Relation Tagger. The Relation Tagger assigns BLIP-generated relation texts to region pairs, thus enabling larger-scale relational pre-training. Through extensive experiments conducted on Human-Object Interaction Detection and Scene Graph Generation, RLIPv2 shows state-of-the-art performance on three benchmarks under fully-finetuning, few-shot and zero-shot settings. Notably, the largest RLIPv2 achieves 23.29mAP on HICO-DET without any fine-tuning, yields 32.22mAP with just 1% data and yields 45.09mAP with 100% data. Code and models are publicly available at https://github.com/JacobYuan7/RLIPv2. + + + + TransFace: Calibrating Transformer Training for Face Recognition from a Data-Centric Perspective + http://openaccess.thecvf.com//content/ICCV2023/papers/Dan_TransFace_Calibrating_Transformer_Training_for_Face_Recognition_from_a_Data-Centric_ICCV_2023_paper.pdf + Vision Transformers (ViTs) have demonstrated powerful representation ability in various visual tasks thanks to their intrinsic data-hungry nature. However, we unexpectedly find that ViTs perform vulnerably when applied to face recognition (FR) scenarios with extremely large datasets. We investigate the reasons for this phenomenon and discover that the existing data augmentation approach and hard sample mining strategy are incompatible with ViTs-based FR backbone due to the lack of tailored consideration on preserving face structural information and leveraging each local token information. To remedy these problems, this paper proposes a superior FR model called TransFace, which employs a patch-level data augmentation strategy named DPAP and a hard sample mining strategy named EHSM. Specially, DPAP randomly perturbs the amplitude information of dominant patches to expand sample diversity, which effectively alleviates the overfitting problem in ViTs. EHSM utilizes the information entropy in the local tokens to dynamically adjust the importance weight of easy and hard samples during training, leading to a more stable prediction. Experiments on several benchmarks demonstrate the superiority of our TransFace. Code and models are available at https://github.com/DanJun6737/TransFace. + + + + Aria Digital Twin: A New Benchmark Dataset for Egocentric 3D Machine Perception + http://openaccess.thecvf.com//content/ICCV2023/papers/Pan_Aria_Digital_Twin_A_New_Benchmark_Dataset_for_Egocentric_3D_ICCV_2023_paper.pdf + We introduce the Aria Digital Twin (ADT) - an egocentric dataset captured using Aria glasses with extensive object, environment, and human level ground truth. This ADT release contains 200 sequences of real-world activities conducted by Aria wearers in two real indoor scenes with 398 object instances (344 stationary and 74 dynamic). Each sequence consists of: a) raw data of two monochrome camera streams, one RGB camera stream, two IMU streams; b) complete sensor calibration; c) ground truth data including continuous 6-degree-of-freedom (6DoF) poses of the Aria devices, object 6DoF poses, 3D eye gaze vectors, 3D human poses, 2D image segmentations, image depth maps; and d) photo-realistic synthetic renderings. To the best of our knowledge, there is no existing egocentric dataset with a level of accuracy, photo-realism and comprehensiveness comparable to ADT. By contributing ADT to the research community, our mission is to set a new standard for evaluation in the egocentric machine perception domain, which includes very challenging research problems such as 3D object detection and tracking, scene reconstruction and understanding, sim-to-real learning, human pose prediction - while also inspiring new machine perception tasks for augmented reality (AR) applications. To kick start exploration of the ADT research use cases, we evaluated several existing state-of-the-art methods for object detection, segmentation and image translation tasks that demonstrate the usefulness of ADT as a benchmarking dataset. + + + + \ No newline at end of file